pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-24 15:44:58 +08:00

Author	SHA1	Message	Date
Michael Lazos	71abc0a02d	[Hierarchical compile] Ensure output nodes are sorted last ghstack-source-id: a401d13cb6936181c0265b1e61dac6d2dceaa02d Pull Request resolved: https://github.com/pytorch/pytorch/pull/151295	2025-04-21 14:22:10 -07:00
Michael Lazos	386d8af2da	[Hierarchical Compile] Handle autocast ctx manager ghstack-source-id: 1ea87ad879f7c0ae075be1534d99affe3293d89f Pull Request resolved: https://github.com/pytorch/pytorch/pull/151294 [Dynamo] Add test for autocast	2025-04-21 14:22:10 -07:00
Michael Lazos	7463b297ab	[Hierarchical Compile] Fix small bug ghstack-source-id: 7cf3d2c2de5ec52752db6a450af4e26470a8cd6e Pull Request resolved: https://github.com/pytorch/pytorch/pull/151293	2025-04-21 14:22:09 -07:00
fduwjj	25a11850e9	[symmem] Add some code comments to rendezvous code (#151716 ) While reading and learning the rendezvous code, I just want to add some comments to explain the code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151716 Approved by: https://github.com/kwen2501	2025-04-21 20:45:39 +00:00
Aaron Gokaslan	352019bf9e	[BE]: Better cleanup optimized code from #151474 (#151794 ) This change addresses the first/second time/mem "spike" observed Improves on #151474 by removing unnecessary stride calculations and unused arguments to the helper function https://github.com/pytorch/pytorch/issues/151351 Fixes https://github.com/pytorch/pytorch/issues/151351 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151794 Approved by: https://github.com/albanD, https://github.com/eqy	2025-04-21 20:32:11 +00:00
henrylhtsang	1f0d764b65	stage 2 of depreate silent fallback of tuning gemm (#148622 ) context: https://github.com/pytorch/pytorch/issues/147479 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148622 Approved by: https://github.com/eellison ghstack dependencies: #151506	2025-04-21 20:14:34 +00:00
henrylhtsang	02cecd1018	[inductor][test] Skip triton tests for MPS as well, also change reason for skipping SM89 to not IS_BIG_GPU (#151506 ) Differential Revision: [D73162091](https://our.internmc.facebook.com/intern/diff/D73162091/) Combining / improving https://github.com/pytorch/pytorch/pull/150485 and https://github.com/pytorch/pytorch/pull/150343 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151506 Approved by: https://github.com/ColinPeppler	2025-04-21 20:14:34 +00:00
PaulZhang12	191b0237a6	Added to docs for out_dtype arg in torch gemms (#151704 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151704 Approved by: https://github.com/bdhirsh	2025-04-21 20:09:17 +00:00
Evgeny Fiksman	1a6effc5d8	[torch] Expose PCI info from CUDA device (#151672 ) Summary: PR#125083 add cuda device UUID info, but due to meta internal [version of ROCM the code was excluded](https://github.com/pytorch/pytorch/pull/125083?fbclid=IwY2xjawJvLnNleHRuA2FlbQIxMQABHlY55crrkTqWBWTsr2HVfuqnZ3R1GHR3o9Kf1o3h3uvyawEmCEdhdT48iY1P_aem_8tfrGrWE9SxFYasGfH8kCQ#issuecomment-2103315320). This change will ensure meta internal code is built and PCI info is available Test Plan: pass CI Differential Revision: D73253426 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151672 Approved by: https://github.com/Skylion007	2025-04-21 19:55:19 +00:00
Svetlana Karslioglu	2fb1326483	Add dates to pages (#151602 ) re: #150873 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151602 Approved by: https://github.com/albanD	2025-04-21 19:53:55 +00:00
Zain Rizvi	b7c7000728	Ensure runners have the required prefix (#151815 ) Clone changes from https://github.com/pytorch/pytorch/pull/151696/ since that PR wouldn't merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/151815 Approved by: https://github.com/seemethere	2025-04-21 19:09:17 +00:00
Nikita Shulga	9680016bcf	[MergeBot] Update PullRequestResolved Regex (#151814 ) By copying an updated one from `cff091f3f3` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151814 Approved by: https://github.com/izaitsevfb, https://github.com/albanD	2025-04-21 19:02:05 +00:00
Nikita Shulga	d79144da52	[BE] Move aarch64 docker build to larger node (#151808 ) They happen once a week or so, not sure why it needs to be on the slowest machine possible Pull Request resolved: https://github.com/pytorch/pytorch/pull/151808 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2025-04-21 18:54:31 +00:00
PyTorch MergeBot	fd04c79878	Revert "[aot autograd][logging] Profile large missing gaps in compile time tracing (#151256 )" This reverts commit 8e373592c8be3e28a5f5a774fc1d517aa3dbe8b4. Reverted https://github.com/pytorch/pytorch/pull/151256 on behalf of https://github.com/Camyll due to breaking internal tests, cannot import ([comment](https://github.com/pytorch/pytorch/pull/151256#issuecomment-2819244186))	2025-04-21 18:49:23 +00:00
Nikita Shulga	f37e138bc4	[MPS] Enable log1p and sigmoid for int64 (#151791 ) It works on MacOS-15, but likely will need a skip for MacOS-13 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151791 Approved by: https://github.com/Skylion007 ghstack dependencies: #151790	2025-04-21 18:30:04 +00:00
Henry Tsang	e2b1c06319	[cutlass] Define GELU_taylor<float> only if CUTLASS version is <= 380 (#151702 ) Summary: #buildmore `df8a550d39/include/cutlass/epilogue/thread/activation.h (L610)` was added in v3.9 (not tagged yet) Test Plan: mostly ci. Logic seems same. Reviewed By: drisspg Differential Revision: D72615240 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151702 Approved by: https://github.com/Skylion007, https://github.com/eqy	2025-04-21 18:23:46 +00:00
Oguz Ulgen	0f8613bf5c	Introduce unsafe way to mark functions as cacheable (#151603 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151603 Approved by: https://github.com/jamesjwu ghstack dependencies: #151768, #151609	2025-04-21 17:37:38 +00:00
Oguz Ulgen	67c2869a38	Unpack the output code in the standalone_compile (#151609 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151609 Approved by: https://github.com/zou3519 ghstack dependencies: #151768	2025-04-21 17:37:38 +00:00
Oguz Ulgen	287998b87f	Run standalone compile tests on cpu/gpu (#151768 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151768 Approved by: https://github.com/zou3519	2025-04-21 17:37:29 +00:00
Nikita Shulga	cea43f721a	[Testing] Unskip expm1 log1p for MPS (#151790 ) But don't test them for unsupported dtypes (which is float64 for MPS) - Skip int64 for log1p for now (next PR will fix that) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151790 Approved by: https://github.com/Skylion007	2025-04-21 17:18:47 +00:00
PyTorch MergeBot	9374064483	Revert "[Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404 )" This reverts commit 783be8f93248ca3af24b968bdf84188f5a3257d1. Reverted https://github.com/pytorch/pytorch/pull/151404 on behalf of https://github.com/malfet due to suspected of breaking linux builds and breaks internal tests as well ([comment](https://github.com/pytorch/pytorch/pull/151404#issuecomment-2819041756))	2025-04-21 17:11:53 +00:00
PyTorch MergeBot	33808f0ebd	Revert "[Easy] The event_id of torch.cuda.Event and torch.xpu.Event always is 0 (#151226 )" This reverts commit 8e5fefedf4af3f31ccd05290c1b21eedf6a4ad1b. Reverted https://github.com/pytorch/pytorch/pull/151226 on behalf of https://github.com/malfet due to Reverting to unblock revert of https://github.com/pytorch/pytorch/pull/151404 ([comment](https://github.com/pytorch/pytorch/pull/151226#issuecomment-2819030735))	2025-04-21 17:07:49 +00:00
bobrenjc93	515a0f606b	[ez] fix typo in comment (#151755 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151755 Approved by: https://github.com/Skylion007	2025-04-21 14:52:39 +00:00
Thanh Ha	2eacdb91c3	Add OIDC permissions to xpu workflow (#151455 ) The reusable workflow requires OIDC authentication to work and is configured via it's only caller xpu.yml however setting it here too to clarify that it is required. This setting also flags jobs that call this workflow without the required permissions set to remind them it need to be set. JWT ID token requires `id-token: write` permissions as documented here https://docs.github.com/en/actions/security-for-github-actions/security-hardening-your-deployments/configuring-openid-connect-in-cloud-providers#adding-permissions-settings Ref: pytorch-fdn/multicloud-ci-infra#3 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151455 Approved by: https://github.com/chuanqi129, https://github.com/atalman	2025-04-21 14:39:40 +00:00
inventshah	bf28d1cafc	Expose bicubic mode for torch::nn::functional::grid_sample in LibTorch (#150817 ) When bicubic interpolation was added to grid_sampler in #44780, `GridSampleFuncOptions` was not updated to allow a user to use bicubic mode in LibTorch, even though the function could handle it. This PR fixes the parity such that LibTorch's `torch::nn::functional::grid_sample` behaves the same as PyTorch's `torch.nn.functional.grid_sample`. Existing users can directly use `torch::grid_sampler` but must know what int to pass for the interpolation (2 for bicubic) and padding mode parameters, which is not ideal. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150817 Approved by: https://github.com/Skylion007	2025-04-21 08:55:27 +00:00
Nikita Shulga	2a9afdae81	[Benchmarking] Add sam and stable_diffusion to MPS benchmarked models (#151748 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151748 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #151747	2025-04-21 05:51:46 +00:00
FFFrog	f7ddc5125e	[Easy] Fix the compilation warning of BlasKernel. (#151736 ) As the title stated. Change Before: ```C++ [2/21] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/BlasKernel.cpp.o /root/Git.d/pytorch/pytorch/aten/src/ATen/native/BlasKernel.cpp:346:6: warning: ‘void at::native::blas_impl::gemv_fast_path(const char, const int, const int, const scalar_t, const scalar_t, const int, const scalar_t, const int, const scalar_t, scalar_t, const int) [with scalar_t = c10::Half]’ defined but not used [-Wunused-function] 346 \| void gemv_fast_path<at::Half>( \| ^~~~~~~~~~~~~~~~~~~~~~~~ /root/Git.d/pytorch/pytorch/aten/src/ATen/native/BlasKernel.cpp:329:6: warning: ‘bool at::native::blas_impl::gemv_use_fast_path(char, int64_t, int64_t, scalar_t, int64_t, int64_t, scalar_t, int64_t) [with scalar_t = c10::Half]’ defined but not used [-Wunused-function] 329 \| bool gemv_use_fast_path<at::Half>( \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ /root/Git.d/pytorch/pytorch/aten/src/ATen/native/BlasKernel.cpp:301:6: warning: ‘void at::native::blas_impl::gemv_fast_path(const char, const int, const int, const scalar_t, const scalar_t, const int, const scalar_t, const int, const scalar_t, scalar_t, const int) [with scalar_t = c10::BFloat16]’ defined but not used [-Wunused-function] 301 \| void gemv_fast_path<at::BFloat16>( \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ /root/Git.d/pytorch/pytorch/aten/src/ATen/native/BlasKernel.cpp:273:6: warning: ‘bool at::native::blas_impl::gemv_use_fast_path(char, int64_t, int64_t, scalar_t, int64_t, int64_t, scalar_t, int64_t) [with scalar_t = c10::BFloat16]’ defined but not used [-Wunused-function] 273 \| bool gemv_use_fast_path<at::BFloat16>( \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151736 Approved by: https://github.com/shink, https://github.com/Skylion007	2025-04-21 03:31:46 +00:00
Scott Wolchok	8eb21dffa9	consolidate ATen/test/dispatch_key_set_test.cpp with rest of DispatchKeySet tests (#151697 ) Doesn't seem to be a reason to have two test files for this. Differential Revision: [D73274020](https://our.internmc.facebook.com/intern/diff/D73274020/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151697 Approved by: https://github.com/Skylion007 ghstack dependencies: #151626, #151627, #151628, #151629, #151630	2025-04-21 02:58:12 +00:00
Mandar Deshpande	9c2ac2b876	[pytorch][triton] Enable warp spec for FlexAttention kernel (#150470 ) Summary: Given inductor support for warp-specialization for `TritonTemplateKernel`, this change adds: - num_consumer_groups - num_buffers_warp_spec to the flexattention template generated by inductor in `torch.compile`. NOTE: Currently default config doesn't enable warp-spec and needs explicit args for num_consumer_groups, num_buffers_warp_spec in the kernel options to enable. Test Plan: ### Functional Testing ```Py import torch from torch.nn.attention.flex_attention import flex_attention from triton.testing import do_bench make_tensor = lambda: torch.rand(8, 16, 8192, 128, device="cuda", dtype=torch.bfloat16) q, k, v = make_tensor(), make_tensor(), make_tensor() flex_compiled = torch.compile(flex_attention, fullgraph=True) print(do_bench(lambda: flex_compiled(q, k, v, kernel_options={"num_warps": 4, "num_consumer_groups": 2, "num_buffers_warp_spec": 3,}))) ``` - (best config) without WS: 11.06 - with WS: 9.35 Differential Revision: D70501880 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150470 Approved by: https://github.com/drisspg	2025-04-21 02:00:55 +00:00
Huamin Li	fc2dd6d408	[Inductor] Update should_decompose_mm condition for CPU (#151730 ) Summary: Similar to what we did previously in D70033166 Previously, for cpu we decompose addmm if ``` check_device(mat1, mat2, device="cpu") and statically_known_true(mat1.shape[0] == 1) and statically_known_true(mat2.shape[0] <= 64) and statically_known_true(mat2.shape[1] <= 512) ``` We have a new case where `mat1.shape[0] = 80`, and benchmark shows that it will beneficial if we decompose, so update the condition to ``` check_device(mat1, mat2, device="cpu") and statically_known_true(mat1.shape[0] == 1) and statically_known_true(mat2.shape[0] <= 128) and statically_known_true(mat2.shape[1] <= 512) ``` Differential Revision: D73292985 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151730 Approved by: https://github.com/kflu, https://github.com/houseroad	2025-04-21 01:56:47 +00:00
Davide Italiano	470132c6a1	[MPS] Add support for hermite_polynomial_he (inductor/eager). (#151754 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151754 Approved by: https://github.com/malfet, https://github.com/jansel	2025-04-20 17:44:40 +00:00
Aart J.C. Bik	c3a7278278	Use more efficient row/col computation (#151474 ) This change addresses the first/second time/mem "spike" observed in https://github.com/pytorch/pytorch/issues/151351 Fixes #151351 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151474 Approved by: https://github.com/eqy, https://github.com/amjames, https://github.com/Skylion007	2025-04-20 16:02:19 +00:00
Camyll Harajli	6b45b6e6c9	run lintrunner for Export d68846308 (#151725 ) fixes broken lint tests in https://github.com/pytorch/pytorch/pull/151481 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151725 Approved by: https://github.com/exclamaforte, https://github.com/Skylion007 Co-authored-by: Gabriel Ferns <gabeferns@meta.com> Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-04-20 14:58:17 +00:00
Gabriel Ferns	a40e876b08	Support fp8 dtypes in assert_close (#150002 ) Fixes #135998 Adds support for fp8. These are compared bitwise, without atol and rtol. The implementation uses the same comparison functions, just with atol and rtol forced to zero. The error message is different from the default case; it only tells the user the first mismatch. This is to avoid triggering the error from #135998. Test Plan: New unit test covers new code paths. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150002 Approved by: https://github.com/cyyever, https://github.com/zou3519	2025-04-20 01:24:21 +00:00
PyTorch MergeBot	48761e9737	Revert "[Easy] Fix the function signature of torch.Event (#151221 )" This reverts commit 92baeecbdd3fb717880485e529df4efb02627c9d. Reverted https://github.com/pytorch/pytorch/pull/151221 on behalf of https://github.com/malfet due to This broke rocm tests, see `92baeecbdd (40818271233-box)` ([comment](https://github.com/pytorch/pytorch/pull/151221#issuecomment-2816883409))	2025-04-19 22:06:24 +00:00
PyTorch MergeBot	c4482565cc	Revert "[Easy][torch.Event] Fix and improve the docs of torch.Event (#151411 )" This reverts commit 1e1d0a4be63b354f762ee21bdccec03c1e5b371c. Reverted https://github.com/pytorch/pytorch/pull/151411 on behalf of https://github.com/malfet due to This broke rocm tests, see `92baeecbdd (40818271233-box)` ([comment](https://github.com/pytorch/pytorch/pull/151221#issuecomment-2816883409))	2025-04-19 22:06:24 +00:00
Nikita Shulga	9b74ea2490	[Benchmarking] Run MPS benchmarks for [b]float16 (#151747 ) And implicitly pass `--float32` when collecting results for "notset" option. Speedups for some models are much higher for float16 dtype, but it's important to track accuracy Pull Request resolved: https://github.com/pytorch/pytorch/pull/151747 Approved by: https://github.com/Skylion007	2025-04-19 16:40:08 +00:00
Nikita Shulga	ed511cd537	[Testing] Make test_add_complex3 run on different devices (#151732 ) By constructing tensor on that device, because it does not call `self.common` but rather executes test directly. Otherwise `test_add_complex3_mps` will test CPU inductor, rather than MPS one Pull Request resolved: https://github.com/pytorch/pytorch/pull/151732 Approved by: https://github.com/dcci	2025-04-19 14:29:13 +00:00
Aaron Gokaslan	483e61bfec	[BE][Easy]: Simplify reversed call in graph matcher (#151674 ) Another list call on reversed that is no longer necessary since ItemViews reversed Pull Request resolved: https://github.com/pytorch/pytorch/pull/151674 Approved by: https://github.com/albanD	2025-04-19 14:14:31 +00:00
PyTorch MergeBot	68f748a992	Revert "[Testing] Make test_add_complex3 run on different devices (#151732 )" This reverts commit 414ce713fb329b20f93002fa4ffd6bb23bc3b93b. Reverted https://github.com/pytorch/pytorch/pull/151732 on behalf of https://github.com/malfet due to It breaks MacOS-13 ([comment](https://github.com/pytorch/pytorch/pull/151732#issuecomment-2816690571))	2025-04-19 12:35:41 +00:00
FFFrog	1e1d0a4be6	[Easy][torch.Event] Fix and improve the docs of torch.Event (#151411 ) Changes: - add detailed function or class signature - fix the wrong display of torch.Event.wait and torch.Event.record Pull Request resolved: https://github.com/pytorch/pytorch/pull/151411 Approved by: https://github.com/albanD ghstack dependencies: #151226, #151221	2025-04-19 12:21:02 +00:00
FFFrog	92baeecbdd	[Easy] Fix the function signature of torch.Event (#151221 ) As the title stated. The difference between declaration and implemention. declaration: `d5a19e4525/torch/_C/__init__.pyi.in (L157-L162)` Implementation: `d5a19e4525/torch/csrc/Event.cpp (L30-L32)` Question: Which one should we choose? - Change enable_timing to False to be consistent with torch.cuda.Event - Change enable_timing to True to avoid BC-break Pull Request resolved: https://github.com/pytorch/pytorch/pull/151221 Approved by: https://github.com/albanD ghstack dependencies: #151226	2025-04-19 11:56:37 +00:00
FFFrog	8e5fefedf4	[Easy] The event_id of torch.cuda.Event and torch.xpu.Event always is 0 (#151226 ) Although torch.cuda.Event and torch.xpu.Event have cuda_event and sycl_event fields respectively, the event_id exposed from the base class torch.Event is always 0, which can confuse users. The memory of torch.Event is not useful to torch.cuda.Event and torch.xpu.Event, but we still need to inherit from torch.Event because CPython will check it. Repro with cuda: ``` >>> import torch >>> event = torch.cuda.Event() >>> event.cuda_event 0 >>> event.event_id 0 >>> event.record() >>> event.cuda_event 127982096 >>> event.event_id 0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151226 Approved by: https://github.com/albanD	2025-04-19 10:42:00 +00:00
PyTorch MergeBot	92d0c40c49	Revert "Cache the value of torch_key in subproc (#151057 )" This reverts commit 5f5805a6ac44179520291b2aa6e18d286dc93669. Reverted https://github.com/pytorch/pytorch/pull/151057 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/151057#issuecomment-2816614510))	2025-04-19 08:48:12 +00:00
Nichols A. Romero	f6c1cf04b5	[ROCm][TunableOp] Support submatrices in offline tuning (#151138 ) This PR adds support for submatrices in offline tuning for: - GEMM - GEMM and bias - ScaledGEMM - Batch Strided GEMM New UTs to cover submatrices. Submatrices for strided batch API is not part of this PR and will be done seperately. There is also a bug fix for offline tuning for full matrix for GEMM and bias in the `NT` case. Offline and online UTs were updated to cover this corner case. To improve code readability, swapped definition of transA and transB. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151138 Approved by: https://github.com/jeffdaily	2025-04-19 04:14:27 +00:00
Will Constable	2673ea4131	Add api to enable/disable NaN detector per-PG (#151723 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151723 Approved by: https://github.com/kwen2501, https://github.com/fduwjj	2025-04-19 03:55:25 +00:00
Nikita Shulga	414ce713fb	[Testing] Make test_add_complex3 run on different devices (#151732 ) By constructing tensor on that device, because it does not call `self.common` but rather executes test directly. Otherwise `test_add_complex3_mps` will test CPU inductor, rather than MPS one Pull Request resolved: https://github.com/pytorch/pytorch/pull/151732 Approved by: https://github.com/dcci	2025-04-19 03:14:46 +00:00
PyTorch MergeBot	6261db7719	Revert "inductor.config.descriptive_names = False is not actually supported (#145523 ) (#146051 ) (#151481 )" This reverts commit cfc4d74b0c9a0d21debbebb41e1dfa4dd2acf2a0. Reverted https://github.com/pytorch/pytorch/pull/151481 on behalf of https://github.com/malfet due to It indeed breaks lint, it followup PR contains it's own issues ([comment](https://github.com/pytorch/pytorch/pull/151481#issuecomment-2816490764))	2025-04-19 03:12:56 +00:00
Nikita Shulga	843e4d11ba	[Benchmarking] Enable HF_GPT2 benchmarking on Metal (#151721 ) By building wheel with USE_DISTRIBUTED=1 Otherwise attempt to run ``` python3 benchmarks/dynamo/torchbench.py --performance --only hf_T5 --backend inductor --inference --devices mps ``` wil fail with ``` File "/Users/nshulga/Library/Python/3.10/lib/python/site-packages/transformers/modeling_utils.py", line 40, in <module> import torch.distributed.tensor File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/tensor/__init__.py", line 4, in <module> import torch.distributed.tensor._ops # force import all built-in dtensor ops File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/tensor/_ops/__init__.py", line 2, in <module> from ._conv_ops import * # noqa: F403 File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/tensor/_ops/_conv_ops.py", line 5, in <module> from torch.distributed.tensor._dtensor_spec import DTensorSpec, TensorMeta File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/tensor/_dtensor_spec.py", line 6, in <module> from torch.distributed.tensor.placement_types import ( File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/tensor/placement_types.py", line 8, in <module> import torch.distributed._functional_collectives as funcol File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/_functional_collectives.py", line 9, in <module> import torch.distributed.distributed_c10d as c10d File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/distributed_c10d.py", line 23, in <module> from torch._C._distributed_c10d import ( ModuleNotFoundError: No module named 'torch._C._distributed_c10d'; 'torch._C' is not a package ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151721 Approved by: https://github.com/wdvr, https://github.com/dcci, https://github.com/huydhn	2025-04-19 02:57:03 +00:00
Gabriel Ferns	cfc4d74b0c	inductor.config.descriptive_names = False is not actually supported (#145523 ) (#146051 ) (#151481 ) Summary: This config is not supported (it throws an error when set), and doesn't really make sense imo. Approved by: https://github.com/eellison Test Plan: contbuild & OSS CI, see `edf266e9bb` Reviewed By: masnesral Differential Revision: D68846308 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151481 Approved by: https://github.com/masnesral	2025-04-19 01:13:35 +00:00
Tugsbayasgalan Manlaibaatar	adf5f38eae	Don't specialize min/max (#151347 ) address https://github.com/pytorch/pytorch/issues/149635 Differential Revision: [D73041489](https://our.internmc.facebook.com/intern/diff/D73041489/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151347 Approved by: https://github.com/bobrenjc93	2025-04-19 00:11:15 +00:00
Shivam Raikundalia	359e1d517c	[Profiler] Remove Decref From Python Context (#151625 ) Summary: When doing on-demand profiler with stack, the decref causes a segfault. I tried checking the refcount and the object itself and they both look fine but still segfaults every time. Lets remove it for now and revisit. This will induce a small memory leak but it should be small enough that it does not create any significant impact on jobs ran. Test Plan: Removed decref and got clean traces https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/0/1744933624/localhost/libkineto_activities_2936811.json.gz&bucket=gpu_traces Differential Revision: D73225468 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151625 Approved by: https://github.com/davidberard98	2025-04-18 23:55:19 +00:00
Scott Wolchok	e48189cf03	Don't eagerly create AliasInfo in parseAliasDeclaration (#151630 ) No need to create an AliasInfo...unless we need it. Differential Revision: [D73129452](https://our.internmc.facebook.com/intern/diff/D73129452/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151630 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #151626, #151627, #151628, #151629	2025-04-18 22:51:37 +00:00
Scott Wolchok	cac8d35503	Use fmt::format for debug strings in Library init (#151629 ) Observed several ms taken during `import torch` by c10::str here. Differential Revision: [D73129453](https://our.internmc.facebook.com/intern/diff/D73129453/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151629 Approved by: https://github.com/cyyever, https://github.com/Skylion007, https://github.com/albanD, https://github.com/malfet ghstack dependencies: #151626, #151627, #151628	2025-04-18 22:51:37 +00:00
Scott Wolchok	313ceb4da3	Reserve vector in StringCordView ctor (#151628 ) Clear missing reserve (we should expect that pieces are not empty). Differential Revision: [D73129445](https://our.internmc.facebook.com/intern/diff/D73129445/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151628 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #151626, #151627	2025-04-18 22:51:29 +00:00
Scott Wolchok	704a504e8a	Reserve vectors in FunctionSchema::cloneWithRealTypes (#151627 ) 1) reserving is much better than not reserving 2) std::transform for a 1-line-body loop is generally not considered to be an improvement (and doesn't get seem to get boiled away by clang under -Oz) Differential Revision: [D73013363](https://our.internmc.facebook.com/intern/diff/D73013363/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151627 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #151626	2025-04-18 22:51:23 +00:00
Scott Wolchok	fc7d493908	Overload Library::def rather than templating it (#151626 ) It ends up being templated over a bunch of reference-to-array-of-characters types with different lengths, such as `char const (&) [88]`, which is an annoyance when profiling and possibly a source of code bloat. Differential Revision: [D73129450](https://our.internmc.facebook.com/intern/diff/D73129450/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151626 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-04-18 22:51:16 +00:00
PyTorch MergeBot	97d97aef24	Revert "[dynamic shapes] guard_or_false for _reshape_view_helper, utils._infer_size for wildcard dims (#150127 )" This reverts commit 1dd2033c0a1de460ee2bad8d64c36a0344886071. Reverted https://github.com/pytorch/pytorch/pull/150127 on behalf of https://github.com/clee2000 due to maybe caused export test to fail? export/test_draft_export.py::TestDraftExport::test_masked_linear [GH job link](https://github.com/pytorch/pytorch/actions/runs/14538768138/job/40794985504) [HUD commit link](`1dd2033c0a`), bad TD ([comment](https://github.com/pytorch/pytorch/pull/150127#issuecomment-2816232086))	2025-04-18 21:38:47 +00:00
Sam Larsen	bd77c3e054	[easy] Update test/dynamo/test_structured_trace.py (#151606 ) Summary: test/dynamo/test_structured_trace.py is out of date because of some new fields. (I guess the test is disabled?). Bring it up to date. Test Plan: `python test/dynamo/test_structured_trace.py` Fixes #149671 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151606 Approved by: https://github.com/Skylion007 ghstack dependencies: #151599	2025-04-18 21:33:13 +00:00
Justin Chu	56d318bfac	[ONNX][Eazy] Update onnx program doc formatting and improve robustness (#151623 ) - Update docstring list formatting - Use a try finally block to keep the model unmodified if save() fails. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151623 Approved by: https://github.com/titaiwangms	2025-04-18 21:31:31 +00:00
Animesh Jain	02dd096e51	[invoke_subgraph][fake tensor] Add finalizer on subgraph instead of the functionalize ctx wrapper (#151633 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151633 Approved by: https://github.com/zou3519 ghstack dependencies: #151330, #151256, #151357, #151477	2025-04-18 21:23:21 +00:00
Wei Wang	b74be52454	[CUDA][NVTX] Move nvtx3 code from cmake/public/cuda.cmake to cmake/Dependencies.cmake (#151583 ) Fixes [#147220] Context: In the CUDA NVTX world, there are NVTX v2 and NVTX v3. As announced in CUDA release notes, e.g. [CUDA 12.8 Update 1]( https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#deprecated-or-dropped-operating-systems) "`NVTX v2 is deprecated. To migrate to NVTX v3. Change your code from: #include <nvtoolsext.h> to #include "nvtx3/nvtoolsext.h`". This header is included in the toolkit." On the PyTorch side, TORCH_CUDA_USE_NVTX3 compile time macro is used and it is set to true when (most of the time) nvtx3 is found. nvtx3 is found in two cases: 1) USE_SYSTEM_NVTX=0 (default), torch build process would automatically look for the nvtx3 in pytorch/third_party/nvtx. This is the most common and default case. 2) when USE_SYSTEM_NVTX=1 is used, nvtx3 is found from the installed CUDA toolkit (e.g. CUDA 12.8 and even some earlier cuda versions). As described in #147220, the reason it can find pytorch/third_party/nvtx is because it used `6f035d8462/cmake/public/cuda.cmake (L176)` note the "PROJECT_SOURCE_DIR" usage in [pytorch/cmake/public/cuda.cmake](`6f035d8462/cmake/public/cuda.cmake (L176)`) Before this PR: PyTorch build would succeed in finding nvtx3 due to the above described process, everything is good. But downstream projects like torchvision can fail, and would by default fail because the following are happening: 1) USE_SYSTEM_NVTX=0 is used (and most likely it is this case because it is the default) 2) NVTX v2 can no longer be found (e.g. future CUDA versions because deprecation would eventually become removal) 3) TorchVision cannot find NVTX3 either because torchvision was invoking [pytorch/cmake/public/cuda.cmake] but the PROJECT_SOURCE_DIR is no longer the pytorch source but torchvision source! 4) One workaround is to "USE_SYSTEM_NVTX=1" but users have to explicitly set this and do the plumbing work After this PR: PyTorch can still find nvtx3 because the part of the code that finds nvtx3 is just moved to a new place. The CI logs are showing it being able to find nvtx3. e.g. [this job](https://productionresultssa14.blob.core.windows.net/actions-results/47f8efaa-0afe-4e1f-bc94-0a82629941cb/workflow-job-run-dc8201b1-845b-5da1-a6ea-d3360ce1b508/logs/job/job-logs.txt?rsct=text%2Fplain&se=2025-04-18T20%3A38%3A05Z&sig=yMd6egC%2Banl3lR%2BudXFX18bfUH189z0DTGLtscHQJwY%3D&ske=2025-04-19T06%3A21%3A45Z&skoid=ca7593d4-ee42-46cd-af88-8b886a2f84eb&sks=b&skt=2025-04-18T18%3A21%3A45Z&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skv=2025-01-05&sp=r&spr=https&sr=b&st=2025-04-18T20%3A28%3A00Z&sv=2025-01-05), which reads "`Found nvtx3: C:/actions-runner/_work/pytorch/pytorch/pytorch/third_party/NVTX/c/include`" For torchvision, it still invoke [pytorch/cmake/public/cuda.cmake] but it no longer tries to find nvtx3 as torchvision is not using nvtx3 (if in future it uses, it can set USE_SYSTEM_NVTX=1 by default). So it would avoid the error reported in [#147220] Pull Request resolved: https://github.com/pytorch/pytorch/pull/151583 Approved by: https://github.com/eqy, https://github.com/atalman, https://github.com/malfet	2025-04-18 21:18:09 +00:00
Junjie Wang (PyTorch)	6e7b6e8d57	[c10d][fr] Fix a bug when first rank is not zero in the script (#151683 ) Summary: Further testing the script, we found that we shouldn't always assume rank 0 is the first rank, so we need to check all entries and see if it P2P op for this coalesced group. Test Plan: Directly test with corner case. Differential Revision: D73266257 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151683 Approved by: https://github.com/fegin	2025-04-18 20:55:06 +00:00
Catherine Lee	a6e46faff4	Use reusable binary docker build action for manywheel (#151489 ) This is part of splitting up https://github.com/pytorch/pytorch/pull/150558 into smaller chunks, please see that for more context Similar to https://github.com/pytorch/pytorch/pull/151483 but for manywheel Changed the job name s390x doesn't have access to aws ecr so it doesn't use the action. manylinuxs390x-builder ecr repo doesn't exist in docker hub so idk why the image name is that Testing: Can't really test since PRs don't have the credentials to push to docker io, which is the image used for everything, including PRs right now Pull Request resolved: https://github.com/pytorch/pytorch/pull/151489 Approved by: https://github.com/seemethere	2025-04-18 20:38:33 +00:00
Catherine Lee	b0f26e81a5	Use reusable binary docker build action for libtorch (#151488 ) This is part of splitting up https://github.com/pytorch/pytorch/pull/150558 into smaller chunks, please see that for more context Similar to https://github.com/pytorch/pytorch/pull/151483 but for libtorch Changed the job name Testing: Can't really test since PRs don't have the credentials to push to docker io, which is the image used for everything, including PRs right now Pull Request resolved: https://github.com/pytorch/pytorch/pull/151488 Approved by: https://github.com/atalman	2025-04-18 20:37:38 +00:00
Xiaodong Wang	88b0553c58	[AMD] Remove fbcode limit for uuid (#151652 ) Summary: We're now w/ later rocm version so ok to add uuid back. Test Plan: sandcastle Differential Revision: D73240086 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151652 Approved by: https://github.com/Skylion007, https://github.com/ngimel, https://github.com/houseroad	2025-04-18 20:37:09 +00:00
Huy Do	7ffa9000ed	Replace perf-nightly-macos with inductor-perf-nightly-macos (#151698 ) The name was updated by https://github.com/pytorch/pytorch/pull/151155. The benchmark results weren't updated on the dashboard otherwise. For PT2 compiler perf benchmark, we are still relying on this old workflow. To get rid of this, we need to update PT2 benchmark dashboard to use the new benchmark database (cc @yangw-dev) The results are there on the new database: ``` SELECT * FROM oss_ci_benchmark_v3 WHERE workflow_id = 14510035576 ``` but not on the old database: ``` SELECT * FROM inductor_torch_dynamo_perf_stats WHERE workflow_id = 14510035576 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151698 Approved by: https://github.com/seemethere, https://github.com/atalman	2025-04-18 20:31:36 +00:00
PyTorch MergeBot	1b267a58a1	Revert "[export] allow partially specifying keys for dynamic shapes dict spec (#151597 )" This reverts commit c8240e3492e4813e822d7265eb3afb7f1168db39. Reverted https://github.com/pytorch/pytorch/pull/151597 on behalf of https://github.com/clee2000 due to broke some export test export/test_converter.py::TestConverter::test_aten_len [GH job link](https://github.com/pytorch/pytorch/actions/runs/14538615968/job/40792673415) [HUD commit link](`c8240e3492`), bad TD ([comment](https://github.com/pytorch/pytorch/pull/151597#issuecomment-2816127271))	2025-04-18 20:17:44 +00:00
Sam Larsen	f20a266512	[easy] Update test/dynamo/test_utils.py (#151599 ) Summary: test/dynamo/test_utils.py is out of date because of some new dynamo_timed fields. (I guess the test is disabled?). Bring it up to date Test Plan: `python test/dynamo/test_utils.py` Fixes #148093 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151599 Approved by: https://github.com/Skylion007	2025-04-18 18:49:24 +00:00
PyTorch MergeBot	e434a9152e	Revert "[inductor][test] Skip triton tests for MPS as well, also change reason for skipping SM89 to not IS_BIG_GPU (#151506 )" This reverts commit 6246c7d62ca2f091838d5c707e3d932994c5e35a. Reverted https://github.com/pytorch/pytorch/pull/151506 on behalf of https://github.com/henrylhtsang due to seems to be breaking some rocm mi300 run ([comment](https://github.com/pytorch/pytorch/pull/151506#issuecomment-2815999009))	2025-04-18 18:40:17 +00:00
Aaron Gokaslan	cccfc146fe	[BE][Easy]: Simplify ModuleList reversed method (#151673 ) Removes unnecessary list calls now that we are in Python 3.9 and KeyViews implement reversed directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151673 Approved by: https://github.com/albanD	2025-04-18 18:39:32 +00:00
PyTorch MergeBot	b7807759de	Revert "stage 2 of depreate silent fallback of tuning gemm (#148622 )" This reverts commit 181b3883e71b9771e8a3cdaf43d627f68e9f0fa6. Reverted https://github.com/pytorch/pytorch/pull/148622 on behalf of https://github.com/henrylhtsang due to seems to be breaking some rocm mi300 run ([comment](https://github.com/pytorch/pytorch/pull/148622#issuecomment-2815995105))	2025-04-18 18:37:09 +00:00
Oguz Ulgen	b73606dcc5	Add jk for force_disable_caches (#151621 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151621 Approved by: https://github.com/jamesjwu	2025-04-18 18:19:40 +00:00
eellison	9ccdeae7db	Fix uint view copy (#151598 ) Fix for https://github.com/pytorch/pytorch/issues/151156. We have some logic to undo our upcast prior to dtype bitcast. This pr cleans up that logic using dtypes in codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151598 Approved by: https://github.com/zou3519 ghstack dependencies: #151562	2025-04-18 18:13:39 +00:00
PyTorch MergeBot	28974a1ec3	Revert "[Easy] Fix the compilation warning of BlasKernel. (#151302 )" This reverts commit 32c79da789af84312a0db2de19211a7c57196ba7. Reverted https://github.com/pytorch/pytorch/pull/151302 on behalf of https://github.com/malfet due to Breaks builds without OpenMP, see https://github.com/pytorch/pytorch/issues/151680 ([comment](https://github.com/pytorch/pytorch/pull/151302#issuecomment-2815954855))	2025-04-18 18:10:45 +00:00
garfield1997	115a0c6413	add privateuse1 device type to pre forward hook of fsdp (#149487 ) add privateuse1 device type to pre forward hook of fsdp Pull Request resolved: https://github.com/pytorch/pytorch/pull/149487 Approved by: https://github.com/FFFrog, https://github.com/cyyever, https://github.com/shink, https://github.com/albanD	2025-04-18 17:50:23 +00:00
zeshengzong	1a48382a4c	[Easy] Optimize container.py typing (#151653 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/151653 Approved by: https://github.com/albanD	2025-04-18 17:33:43 +00:00
Shangdi Yu	931bd05560	Do not propagate real tensor in extern kernel (#151377 ) Summary: See internal Diff for more details. In ExternKernel, the FakeTensors do not have associated real tensors, because they are just created from ir.Node's shape and stride. Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r aoti_data_dependent_ex buck2 run mode/dev-nosan fbcode//caffe2/test/inductor:aot_inductor_arrayref_cpu -- -r data_dependent_extern_kernel_op ``` Differential Revision: D73002775 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151377 Approved by: https://github.com/angelayi	2025-04-18 17:28:13 +00:00
henrylhtsang	181b3883e7	stage 2 of depreate silent fallback of tuning gemm (#148622 ) context: https://github.com/pytorch/pytorch/issues/147479 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148622 Approved by: https://github.com/eellison ghstack dependencies: #151506	2025-04-18 17:26:16 +00:00
henrylhtsang	6246c7d62c	[inductor][test] Skip triton tests for MPS as well, also change reason for skipping SM89 to not IS_BIG_GPU (#151506 ) Differential Revision: [D73162091](https://our.internmc.facebook.com/intern/diff/D73162091/) Combining / improving https://github.com/pytorch/pytorch/pull/150485 and https://github.com/pytorch/pytorch/pull/150343 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151506 Approved by: https://github.com/ColinPeppler	2025-04-18 17:26:16 +00:00
Pian Pawakapan	1dd2033c0a	[dynamic shapes] guard_or_false for _reshape_view_helper, utils._infer_size for wildcard dims (#150127 ) For reshape/view: removes fast paths for 0 elements, checking dimensions to skip. Modifies the loop accumulating input elements, to raise a UserError if we run out of dimensions, graph breaking for compile and erroring out for export. For infer_size: assumes if user passes us an unbacked, it's probably not -1 Will think about changes in https://docs.google.com/document/d/1WYx6EZwVDXtBnWyrzoecgGWdiK0V3XZKftfpWwQ5i3E/edit?tab=t.0#heading=h.22k54zym11qp in a later PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/150127 Approved by: https://github.com/laithsakka	2025-04-18 17:05:11 +00:00
Pian Pawakapan	c8240e3492	[export] allow partially specifying keys for dynamic shapes dict spec (#151597 ) Fixes #148564 Should help with exporting HF-style models, so users don't have to specify 100 Nones Pull Request resolved: https://github.com/pytorch/pytorch/pull/151597 Approved by: https://github.com/angelayi	2025-04-18 16:53:01 +00:00
Xiaodong Wang	9eaaca2ece	Turn off symm_mem when cuda version is <12.3 (#151203 ) Summary: It looks symmetric memory only supports cuda12.3+. We do have the definition w/ 12.3- but we don't have implementation. So maybe a good idea to even disable the definition. Test Plan: CI Reviewed By: jianyuh, houseroad, ngimel, jiawenliu64 Differential Revision: D72936993 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151203 Approved by: https://github.com/ngimel, https://github.com/houseroad	2025-04-18 16:37:12 +00:00
FFFrog	783be8f932	[Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404 ) As the title stated Changes: - Add record, query and enable_timing check - Add related tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/151404 Approved by: https://github.com/albanD	2025-04-18 15:26:13 +00:00
rzou	29317f8585	[standalone_compile] Some misc fixes (#151502 ) This PR fixes two things. The first problem is that in the vLLM style standalone_compile is called from within a custom torch.compile backend. If there already is a FakeTensorMode (which there is), we shouldn't create a new FakeTensorMode with the same shape_env, instead we should just reuse the same FakeTensorMode. The second thing is that compile_fx can mutate the passed in gm, so we deepcopy (since standalone_compile should be standalone) Test Plan: - new test - updated old tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/151502 Approved by: https://github.com/oulgen ghstack dependencies: #151501, #151551	2025-04-18 12:34:13 +00:00
rzou	58310a0043	[standalone_compile] support multiple returns (#151551 ) We were only returning the first one. There's an edge case on what to do if the original function returns a single Tensor. capture(f) returns a function that returns a tuple of one Tensor in this case and we were originally converting this back to one single Tensor. I think it's fine to return a tuple of one Tensor (that is what the graph passed to standalone_compile asked for!) but we can revisit. fine Test Plan: - modified one test to used multiple outputs Pull Request resolved: https://github.com/pytorch/pytorch/pull/151551 Approved by: https://github.com/Skylion007, https://github.com/oulgen ghstack dependencies: #151501	2025-04-18 12:34:13 +00:00
rzou	ac715e96b4	[standalone_compile] Don't check if path is directory if it doesn't exist (#151501 ) os.path.isdir(path) will return False if the path doesn't exist. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/151501 Approved by: https://github.com/Skylion007, https://github.com/oulgen	2025-04-18 12:34:13 +00:00
Nikita Shulga	14293c2377	[MPS] Allow isin for mixed types (#151600 ) To follow pattern set by CPU and CUDA impls: define common_dtype and optionally casts `elements` and `test_elements` to common dtype if needed - Add regression test, though skip it on MacOS-13, as `isin` seems to produce garbage there even for same dtypes ``` >>> import torch >>> x=torch.arange(4.0, device='mps') >>> y=torch.arange(1.0, 3.0, device='mps') >>> x, y, torch.isin(x, y), torch.isin(y, x) (tensor([0., 1., 2., 3.], device='mps:0'), tensor([1., 2.], device='mps:0'), tensor([False, True, False, False], device='mps:0'), tensor([False, False], device='mps:0')) >>> torch.__version__ '2.6.0' ``` - Cleanup code a bit Fixes https://github.com/pytorch/pytorch/issues/151443 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151600 Approved by: https://github.com/Skylion007, https://github.com/dcci, https://github.com/kulinseth	2025-04-18 12:30:32 +00:00
Adam J. Stewart	675f69f40f	collect_env: gracefully handle no pip (#151607 ) If pip is not installed: ### Before ```console > python3 torch/utils/collect_env.py Collecting environment information... Traceback (most recent call last): File "/Users/Adam/pytorch/torch/utils/collect_env.py", line 694, in <module> main() ~~~~^^ File "/Users/Adam/pytorch/torch/utils/collect_env.py", line 677, in main output = get_pretty_env_info() File "/Users/Adam/pytorch/torch/utils/collect_env.py", line 672, in get_pretty_env_info return pretty_str(get_env_info()) ~~~~~~~~~~~~^^ File "/Users/Adam/pytorch/torch/utils/collect_env.py", line 497, in get_env_info pip_version, pip_list_output = get_pip_packages(run_lambda) ~~~~~~~~~~~~~~~~^^^^^^^^^^^^ File "/Users/Adam/pytorch/torch/utils/collect_env.py", line 450, in get_pip_packages for line in out.splitlines() ^^^^^^^^^^^^^^ AttributeError: 'NoneType' object has no attribute 'splitlines' ``` ### After ```console > python3 torch/utils/collect_env.py Collecting environment information... PyTorch version: N/A Is debug build: N/A CUDA used to build PyTorch: N/A ROCM used to build PyTorch: N/A OS: macOS 15.4 (arm64) GCC version: Could not collect Clang version: 20.1.0 CMake version: version 3.31.6 Libc version: N/A Python version: 3.13.2 (main, Apr 8 2025, 15:27:33) [Clang 17.0.0 (clang-1700.0.13.3)] (64-bit runtime) Python platform: macOS-15.4-arm64-arm-64bit-Mach-O Is CUDA available: N/A CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: N/A GPU models and configuration: Could not collect Nvidia driver version: Could not collect cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: N/A CPU: Apple M2 Pro Versions of relevant libraries: [pip3] Could not collect [conda] Could not collect ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151607 Approved by: https://github.com/malfet	2025-04-18 12:28:58 +00:00
Yutao Xu	776aa68221	Update torch-xpu-ops commit pin (#150827 ) Update the torch-xpu-ops commit to [b51dd3ef4f4d0f6b44c59e61431c5d29354dcaf6](`b51dd3ef4f`), including: - Update commit pin to xpu-ops main branch - Fixes batch_norm numeric error by adding additional boundary check - Enable two operators: fft & jagged_to_padded_dense - XCCL relevant changes: 1. Cache `cclStream` to improve performance. 2. Add support for complex datatypes in `allgather` and `broadcast`. 3. Support `coalescing` operations and `batch_isend_irecv`. 4. Introduce additional logging; use `export TORCH_CPP_LOG_LEVEL=INFO`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150827 Approved by: https://github.com/EikanWang Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-04-18 10:12:59 +00:00
LuFengqing	0376bbf5b3	[XPU] skip a subprocess UT for Windows (#150999 ) This case creates subprocess in a subprocess. In Windows it can't load function at this scenario hence I have to skip it ``` File "C:\ProgramData\miniforge3\envs\lfq\lib\multiprocessing\spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "C:\ProgramData\miniforge3\envs\lfq\lib\multiprocessing\spawn.py", line 126, in _main self = reduction.pickle.load(from_parent) AttributeError: Can't get attribute 'run_model' on <module '__main__' (built-in)> Traceback (most recent call last): File "<string>", line 25, in <module> File "<string>", line 16, in test_multi_process AssertionError ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150999 Approved by: https://github.com/guangyey, https://github.com/EikanWang Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-04-18 08:55:47 +00:00
Natalia Gimelshein	541f8cd34c	faster gather implementation (#151490 ) So far it's only for `gather`, but we'll move index_select and index to this implementation too. Torchtitan and fbgemm have noticed that gather/index_select perf is bad, this PR brings core implementation to be on par with those customized implementations. Added benefits: all dtypes are supported, a bit less strict on the tensor dimensions/contiguity because we pick the fast path after TensorIterator collapsed the dimensions. Biggest part of this PR is not even the kernel (it's dumb, just vectorized loads are enough), but moving utilities for vectorized loads and stores from SymmetricMemory to be generally accessible in MemoryAccess.cuh. Additional tests are coming to make sure this implementation doesn't break anything `gather` is equivalent to x[indices] for 1d indices via ``` def fn_gather(x, indices): return torch.gather(x, dim=0, index=indices.unsqueeze(1).expand(-1, x.shape[1])) def fn_index(x, indices): return x[indices] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151490 Approved by: https://github.com/Skylion007, https://github.com/eqy	2025-04-18 07:48:31 +00:00
Tugsbayasgalan Manlaibaatar	eb1f85a2a0	Support C++ statically_known_true (#151346 ) Differential Revision: [D73040543](https://our.internmc.facebook.com/intern/diff/D73040543/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151346 Approved by: https://github.com/laithsakka	2025-04-18 06:42:12 +00:00
FFFrog	8895c290f4	[Easy] enable PYFMT for torch/quantization/eager (#150761 ) All modifications are done through tools, the detailed commands are as follows: ```bash lintrunner -a --take "PYFMT" --all-files ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150761 Approved by: https://github.com/jerryzh168	2025-04-18 05:53:33 +00:00
PyTorch UpdateBot	91b090c912	[executorch hash update] update the pinned executorch hash (#151632 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151632 Approved by: https://github.com/pytorchbot	2025-04-18 05:07:28 +00:00
bobrenjc93	6649ed9deb	[ez] fix code owners typo (#151499 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151499 Approved by: https://github.com/laithsakka	2025-04-18 04:24:16 +00:00
Will Constable	bedefa46a9	Document non-pytorch CUDA memory allocation and how to query it (#150880 ) This PR documents the fact that PyTorch does not have visibility into how every CUDA memory allocation happend - it only knows about allocations that went through the pytorch CUDA allocator. It also adds a code snippet showing how to use pynvml to query current GPU memory usage. ## Preview Added a note at the top of "Understanding CUDA Memory Usage" doc: <img width="732" alt="image" src="https://github.com/user-attachments/assets/69e28d2a-841a-4b1b-b886-e96fb5d76582" /> which links to a section below: <img width="733" alt="image" src="https://github.com/user-attachments/assets/cab4f252-9ac2-4fc6-a45d-fdb958fc7dbc" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/150880 Approved by: https://github.com/kwen2501, https://github.com/ngimel	2025-04-18 03:48:54 +00:00
Jane Xu	7d282da449	Add automatic categorization for release notes: inductor (aoti) (#151569 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151569 Approved by: https://github.com/desertfire ghstack dependencies: #151453	2025-04-18 03:39:06 +00:00
Chen Zhu	2426258789	[doc fix] fix torch export docs for preserve_module_call_signature (#151140 ) The preserve_module_call_signature explanation is missing in the __init__.py. Copying that from _trace.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/151140 Approved by: https://github.com/angelayi	2025-04-18 02:55:35 +00:00
Yu, Guangye	33cfe30ee1	Add HostAllocator as the unified parent class (#151431 ) # Motivation This PR introduces a unified parent class `HostAllocator` with the following goals: 1. Enable backend-specific host allocator registration, including support for out-of-tree backends. 2. Provide a unified and extensible API surface for host memory management across all backends, especially accelerators. The new interface includes: - `at::getHostAllocator()->allocate` - `at::getHostAllocator()->empty_cache` - `at::getHostAllocator()->record_event` - `at::getHostAllocator()->get_stats` - `at::getHostAllocator()->reset_accumulated_stats` - `at::getHostAllocator()->reset_peak_stats` # Additional Context We plan to deprecate legacy APIs such as `at::cuda::CachingHostAllocator_emptyCache` and recommend users migrate to the new backend-specific API, for example: ```cpp at::getHostAllocator(at::kCUDA)->empty_cache(); ``` This refactor will help standardize host memory management across devices and simplify backend integration in the future. Another key improvement I am going to do is move the `is_pinned` functionality into the `HostAllocator` class, which enables centralized pinned memory verification through calls like `at::getHostAllocator(at::kCUDA)->is_pinned(ptr)`. Benefits include: - Consistent host memory handling across all device backends - Decouple pinned memory functionality with `AcceleratorHooksInterface` in a more modular way - Clearer separation between device memory allocation and pinned host memory management This architecture makes the system more maintainable and extensible for future device support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151431 Approved by: https://github.com/albanD ghstack dependencies: #151403	2025-04-18 02:44:17 +00:00
FFFrog	1cc5a8452b	[Openreg][PrivateUse1] Fix releasing tensor issue when using pin_memory (#151091 ) As the title stated. Related PR: https://github.com/pytorch/pytorch/pull/147066 Co-authored-by: Zhenbin Lin <lin-zhenbin@qq.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/151091 Approved by: https://github.com/albanD ghstack dependencies: #151007	2025-04-18 02:40:07 +00:00
FFFrog	3528488061	[Openreg][PrivateUse1] Enable CI for openreg (#151007 ) Changes: - move test_openreg.py from test/cpp_extensions/open_registration_extension/ to test/ - update README.md for openreg - enable CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/151007 Approved by: https://github.com/albanD	2025-04-18 02:40:07 +00:00
Laith Sakka	09e8ff92cc	refresh benchmark results (#151622 ) updating due to <1.5% increases in https://github.com/pytorch/pytorch/pull/151469 not all benchmarks were updated Pull Request resolved: https://github.com/pytorch/pytorch/pull/151622 Approved by: https://github.com/oulgen	2025-04-18 02:39:13 +00:00
Tristan Rice	98c892749b	c10d/Store: add nonblocking mode to queue_pop (#151485 ) This adds a non-blocking mode to queue_pop. This allows for workers to poll if work is ready without blocking the main loop. This is useful for the case where you want to have a GPU have maximum utilization when something only periodically is sent on the queue. We also expose a `torch.distributed.QueueEmptyError` so users can catch the error and handle it accordingly. Test plan: ``` pytest test/distributed/test_store.py -k queue -v -s -x ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151485 Approved by: https://github.com/fduwjj, https://github.com/tianfengfrank	2025-04-18 02:14:50 +00:00
PaulZhang12	3ed5f1fb77	[CUDA][cuBLAS] Aten GEMM overload for FP32 output from FP16/BF16 inputs (#150812 ) Enable FP32 output from FP16/BF16 GEMMs in aten with cuBLAS. Accumulation for these GEMMs are generally already done in FP32. Adds the functionality to the following aten operators: * mm * bmm * addmm * baddmm Follow up of customer issue: https://github.com/pytorch/pytorch/issues/146241#issuecomment-2781889390 Differential Revision: [D73126191](https://our.internmc.facebook.com/intern/diff/D73126191) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150812 Approved by: https://github.com/ngimel, https://github.com/eqy	2025-04-18 01:53:26 +00:00
Xintong Hu	a6182903cd	Update PyTorchStreamReader API to take cpu allocator override (#150439 ) Summary: Add allocator param in getRecord Test Plan: newly added UT ``` buck test caffe2/caffe2/serialize:inline_container_test ``` Differential Revision: D72252585 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150439 Approved by: https://github.com/albanD	2025-04-18 01:53:14 +00:00
Laith Sakka	b434322075	Fix has_free_symbols (#151492 ) used to fail for self.assertFalse(has_free_symbols(sympy.S.true)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151492 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #151170, #151171	2025-04-18 01:19:01 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	c2a202169d	Fix implicit state dict modification (#151436 ) Summary: Previously we were modyfing ep.state_dict while runnning decomp which it shouldn't Test Plan: CI Fixes: https://github.com/pytorch/pytorch/issues/151366 Differential Revision: D73102315 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151436 Approved by: https://github.com/angelayi	2025-04-18 00:58:55 +00:00
leslie-fang-intel	34266836d5	[Inductor] Suppress cuda init error for CPU only Inductor (#151528 ) Summary After https://github.com/pytorch/pytorch/pull/151255, invoking `torch.compile` on a non-CUDA device prints the following error: `E0416 23:39:55.953000 418833 torch/_inductor/codegen/cuda/cuda_env.py:22] Error getting cuda arch: Torch not compiled with CUDA enabled.` This PR updates the code to initialize `PRESETS` only when CUDA is available, preventing this error message from being printed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151528 Approved by: https://github.com/jansel, https://github.com/henrylhtsang	2025-04-18 00:55:01 +00:00
Will Constable	9e235c549c	[C10D] avoid computing global_rank when group_rank is used (#151373 ) collective APIs accept either group or global rank for src/dst rank. We provide a helper `_canonicalize_group_rank` which converts from maybe group or maybe global to one particular format (defined by the kwarg return_global: bool=False). In this PR we stop performing the mapping lookup that converts group to global or global to group in the case that the caller wants us to return the same value that was passed in. The PR should be functionally equivalent, except in cases where the mapping itself would raise an exception but the mapping was not necessary in the first place. This has come up in cases where people create new process groups outside of 'init_process_group' APIs and group-specific ranks may not have a valid mapping to the 'global' rank. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151373 Approved by: https://github.com/xunnanxu, https://github.com/d4l3k	2025-04-17 23:53:50 +00:00
Yanli Zhao	d8bafd23ab	[DDP] add one option to allow skipping all reduce unused parameters (#151503 ) Summary: add one option to allow skipping all reduce unused parameters, this could help improve training throughput significantly when the number of unused parameters is large in the model. Test Plan: unit tests, CI Differential Revision: D72282069 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151503 Approved by: https://github.com/mrshenli	2025-04-17 23:30:19 +00:00
eellison	6d46b530fc	Remove libdevice ops in inductor (#151562 ) Now that we track dtypes during codegen, we can delete all these extra ops that worked around the problem by doing dispatch at lowering time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151562 Approved by: https://github.com/isuruf, https://github.com/jansel	2025-04-17 22:18:00 +00:00
Animesh Jain	bdb34f55a0	[fake tensor cache] Support index with non bool/int8 indices (#151477 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151477 Approved by: https://github.com/zou3519, https://github.com/bdhirsh ghstack dependencies: #151330, #151256, #151357	2025-04-17 21:51:08 +00:00
Catherine Lee	0129c3a4e1	Use reusable binary docker build action for almalinux, clean up script (#151483 ) This is part of splitting up https://github.com/pytorch/pytorch/pull/150558 into smaller chunks, please see that for more context Use the binary docker build action from https://github.com/pytorch/pytorch/pull/151471 Change the workflow trigger to be all of .ci/docker so it will make a new image + tag whenever it changes. build script: * change to be independent of the CUDA_VERSION env var, since all the info should be in the imagename:tag * remove docker push parts since that will happen during the workflow * clean up a bit * make the build script more like the CI build script (use a temp image name) I don't think this image is actually used anywhere Also push docker image to imagename:tag, I got rid of it in the PR making the reusable workflow since I thought it was not in the original scripts but it actually is there Pull Request resolved: https://github.com/pytorch/pytorch/pull/151483 Approved by: https://github.com/ZainRizvi	2025-04-17 21:32:56 +00:00
William Wen	652fa451a4	[dynamo] support fb internal bytecode EAGER_IMPORT_NAME (#151362 ) Differential Revision: [D73127097](https://our.internmc.facebook.com/intern/diff/D73127097) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151362 Approved by: https://github.com/oulgen	2025-04-17 21:19:45 +00:00
angelayi	d5dda82586	[export] Integrate meta kernel generation with draft-export (#150809 ) If a custom operator does not contain a fake impl, currently draft-export will use the real-tensor propagation to get an output for the operator and continue tracing. However if we retrace the exported model using `ep.run_decompositions`, or `export`, or run the exported program with fake tensors, we'll still fail because there's no fake impl. With this PR, after draft-export we will generate an operator profile for each operator call that we encounter, and store this on the report attached to the exported program `ep._report.op_profiles`. Users can then use `torch._library.fake_profile.register_fake_profile` to temporarily generate and register a fake impl based on these operator profiles. This way future fake tensor retracing will work. The workflow would look something like: ```python class M(torch.nn.Module): def forward(self, a, b): res = torch.ops.mylib.foo8(a, b) # no fake impl return res ep = export(M(), (torch.ones(3, 4), torch.ones(3, 4)) # this fails bc no fake impl ep = draft_export(M(), (torch.ones(3, 4), torch.ones(3, 4)) ep.run_decompositions() # this fails bc no fake impl # this registers fake impls based on the profiles with torch._library.fake_profile.register_fake_profile(ep._report.op_profiles): decomp = ep.run_decompositions() # this works new_inp = ( torch.ones(2, 3, 4), torch.ones(2, 3, 4), ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150809 Approved by: https://github.com/zou3519	2025-04-17 20:52:31 +00:00
Michael Lazos	4f62dccbda	[Cutlass] Implement Epilogue Argument emitter (#150903 ) This implements epilogue visitor tree argument generation (example type [here](`3fe62887d8/include/cutlass/epilogue/fusion/sm90_callbacks_tma_warpspecialized.hpp (L332)`)). Details: The codegen task here is to implement a function which can generate a tree of C++ structs and properly extract the correct properties from Inductor buffers and write them to the correct locations in the generated struct. To implement this with the minimum amount of code, I generate the cutlass DAGIR (the EVT internal represenation) which specifically has a pass, [pass_argument_type.py ](`5e497243f7/python/cutlass/backend/evt/passes/pass_argument_type.py (L4)`) which generates a nested tree of custom argument types for each node in the DAGIR. This nested tree of constructors is then passed kwargs to fill in the proper values, where the node's name is used to differentiate between different values in the kwarg dictionary. This however is non-customizable; the nested tree of EVT args is a nested tree of ctypes which looks for actual values so that this object can be passed directly to the cutlass-python C++ runner. Inductor on the other hand needs to fill this struct with string C++ expressions representing the values (or extracting the values from kernel launcher args). So `_render_argument_type` implements this: it iterates over the tree of types created by pass_argument_type.py and generates a string representing the nested structs, filling in C++ expressions representing the different fields. Long term plan: Long term, I will ask the nvidia to provide an overridable [visitor_factory](`5e497243f7/python/cutlass/backend/evt/passes/pass_argument_type.py (L82)`) which could allow us to override the behavior of pass_argument_type.py to generate the string we would like during DAGIR generation. Previously merged: * #150346 * #150345 * #150344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150903 Approved by: https://github.com/henrylhtsang, https://github.com/eellison	2025-04-17 20:30:21 +00:00
Dylan Maloy	8e0f9fbccf	[c10] helpers for runtime c10::alias re-use (#151361 ) Summary: we need these to check whether the input tensor was re-sized/strided between executions when choosing to alias Test Plan: CI Reviewed By: henryoier Differential Revision: D73061676 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151361 Approved by: https://github.com/SherlockNoMad	2025-04-17 20:27:17 +00:00
Aaron Gokaslan	da580123a0	[BE][Easy]: Dedupe a TypeAlias in PrimsCommon (#151565 ) Replaces a duplicate TypeAlias with a reference to the global constant for them Pull Request resolved: https://github.com/pytorch/pytorch/pull/151565 Approved by: https://github.com/albanD	2025-04-17 19:59:41 +00:00
Nikita Shulga	c4688af254	Fix lint Introduced by fb6ac2f16132f7953711ce6924bc2ee4a033228c	2025-04-17 12:48:52 -07:00
Meet Vadakkanchery	473a38b562	[DCP] Add logging for _stateful_to_state_dict(), stage_state_dict(), and synchronize_staging() (#151320 ) Summary: As titled. Test Plan: CI Differential Revision: D73040700 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151320 Approved by: https://github.com/saumishr	2025-04-17 12:48:39 -07:00
Aaron Gokaslan	c5b10ff119	[BE][Easy]: Normalize Dim typing in torch distributed (#151566 ) Improve typing using prims_common dtypes Pull Request resolved: https://github.com/pytorch/pytorch/pull/151566 Approved by: https://github.com/albanD	2025-04-17 19:30:09 +00:00
Kashif Rasul	2ed2cb5805	add generalized pareto distribution (GPD) (#135968 ) Add the GPD as a distribution class Pull Request resolved: https://github.com/pytorch/pytorch/pull/135968 Approved by: https://github.com/albanD Co-authored-by: Alexander März <statmixedmlgit@gmail.com>	2025-04-17 18:51:02 +00:00
zeshengzong	7e2081fa93	Optimize `interpolate` saturate description (#151304 ) Fixes #108225 ## Test Result ### Before ![image](https://github.com/user-attachments/assets/bdbf8a5c-d5a4-44a5-b81e-2cbb5b8bfd02) ### After ![image](https://github.com/user-attachments/assets/1c21a27d-1700-4661-9988-dbb1cdc81fa2) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151304 Approved by: https://github.com/albanD Co-authored-by: albanD <desmaison.alban@gmail.com>	2025-04-17 18:34:29 +00:00
Jared Hance	055e59e709	[bazel] Build flatbuffers within bazel (#151364 ) This is similar to how we handle protobufs and it makes it more convenient for bazel users to handle their version of flatbuffers. (Flatbuffers is very picky about the generated code matching the runtime). Instead of using the checked in generated code, we generate it on the fly. This is relevant to https://github.com/pytorch/pytorch/issues/112903, because having the version of flatbuffers tied to pytorch will make pytorch difficult to use as an external workspace. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151364 Approved by: https://github.com/malfet	2025-04-17 18:33:51 +00:00
iremyux	3a6b3c8e0e	Combine windows x64 and arm64 yaml template files (#149850 ) While introducing Windows-Arm64 nightly workflows, we created a separate template file for win-arm64. This PR combines x64&arm64 and deletes the win-arm64 one. Fixes #148776 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149850 Approved by: https://github.com/ozanMSFT, https://github.com/malfet	2025-04-17 17:58:55 +00:00
PyTorch MergeBot	1ce7969e81	Revert "[Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404 )" This reverts commit 90c5b86cd8fcbbe6ee7c46ad17a05767f884794b. Reverted https://github.com/pytorch/pytorch/pull/151404 on behalf of https://github.com/clee2000 due to broke a cpp extension test? test_cpp_extensions_stream_and_event.py::TestCppExtensionStreamAndEvent::test_stream_event [GH job link](https://github.com/pytorch/pytorch/actions/runs/14519277500/job/40736981315) [HUD commit link](`90c5b86cd8`), bad TD ([comment](https://github.com/pytorch/pytorch/pull/151404#issuecomment-2813649667))	2025-04-17 17:45:41 +00:00
Blaine Burton Rister	ae6f6b8efb	[Inductor] Remove singleton tiling splits when prefer_nd_tiling=True (#151508 ) # Issue Users who want block pointers are like to use the config settings `{"trition.use_block_ptr": True, "triton.prefer_nd_tiling": True, "triton.max_tiles": 3}` . Among other things, these settings allow us to generate 3D block pointers for broadcasts. However, broadcasts which don't truly require 3D often end up introducing a superfluous tiling dimension of size 1. For example, given this function with elementwise multiplication: ``` def foo(x, y, z): a = x * y b = 128.0 c = a * b d = a * z e = x * z return a, c, d, e inps = [ torch.randn((8, 11, 128), device=self.device), torch.randn((128,), device=self.device), torch.randn((8, 11, 128), device=self.device), ] torch.compile(foo)(inps) ``` We get the following Triton kernels: ``` @triton.jit def triton_poi_fused_mul_0(in_ptr0, in_ptr1, out_ptr0, znumel, ynumel, xnumel, ZBLOCK : tl.constexpr, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr): znumel = 88 ynumel = 1 xnumel = 128 zoffset = tl.program_id(2) ZBLOCK zindex = zoffset + tl.arange(0, ZBLOCK)[:, None, None] zmask = zindex < znumel yoffset = tl.program_id(1) * YBLOCK yindex = yoffset + tl.arange(0, YBLOCK)[None, :, None] ymask = tl.full([ZBLOCK, YBLOCK, XBLOCK], True, tl.int1) xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[None, None, :] xmask = xindex < xnumel x1 = xindex z0 = zindex tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[88, 128], strides=[128, 1], block_shape=[ZBLOCK, XBLOCK], order=[1, 0], offsets=[zoffset, xoffset]), boundary_check=[0, 1], eviction_policy='evict_last')[:, None, :] tmp1 = tl.load(tl.make_block_ptr(in_ptr1, shape=[128], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0], eviction_policy='evict_last')[None, None, :] tmp2 = tmp0 * tmp1 tl.store(tl.make_block_ptr(out_ptr0, shape=[88, 128], strides=[128, 1], block_shape=[ZBLOCK, XBLOCK], order=[1, 0], offsets=[zoffset, xoffset]), tl.reshape(tl.broadcast_to(tmp2, [ZBLOCK, YBLOCK, XBLOCK]), [ZBLOCK, XBLOCK]).to(tl.float32), boundary_check=[0, 1]) ''', device_str='cuda') @triton.jit def triton_poi_fused_mul_1(in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, out_ptr2, xnumel, XBLOCK : tl.constexpr): xnumel = 11264 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[11264], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0]) tmp3 = tl.load(tl.make_block_ptr(in_ptr1, shape=[11264], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0]) tmp5 = tl.load(tl.make_block_ptr(in_ptr2, shape=[11264], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0]) tmp1 = 128.0 tmp2 = tmp0 * tmp1 tmp4 = tmp0 * tmp3 tmp6 = tmp5 * tmp3 tl.store(tl.make_block_ptr(out_ptr0, shape=[11264], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp2, [XBLOCK]).to(tl.float32), boundary_check=[0]) tl.store(tl.make_block_ptr(out_ptr1, shape=[11264], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp4, [XBLOCK]).to(tl.float32), boundary_check=[0]) tl.store(tl.make_block_ptr(out_ptr2, shape=[11264], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp6, [XBLOCK]).to(tl.float32), boundary_check=[0]) ''', device_str='cuda') ``` Note that one kernel has `ynumel=1`. The extra dimension results in more expensive address calculations, and also seems to prevent fusion. # Fix To fix this, this PR filters out any splits of size 1 from the `prefer_nd_tiling` algorithm. This results in the following fused kernel, with 2D tiling: ``` @triton.jit def triton_poi_fused_mul_0(in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, out_ptr2, out_ptr3, ynumel, xnumel, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr): ynumel = 88 xnumel = 128 yoffset = tl.program_id(1) * YBLOCK yindex = yoffset + tl.arange(0, YBLOCK)[:, None] ymask = yindex < ynumel xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[None, :] xmask = xindex < xnumel x1 = xindex y0 = yindex tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[88, 128], strides=[128, 1], block_shape=[YBLOCK, XBLOCK], order=[1, 0], offsets=[yoffset, xoffset]), boundary_check=[0, 1], eviction_policy='evict_last') tmp1 = tl.load(tl.make_block_ptr(in_ptr1, shape=[128], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0], eviction_policy='evict_last')[None, :] tmp5 = tl.load(tl.make_block_ptr(in_ptr2, shape=[88, 128], strides=[128, 1], block_shape=[YBLOCK, XBLOCK], order=[1, 0], offsets=[yoffset, xoffset]), boundary_check=[0, 1], eviction_policy='evict_last') tmp2 = tmp0 * tmp1 tmp3 = 128.0 tmp4 = tmp2 * tmp3 tmp6 = tmp2 * tmp5 tmp7 = tmp0 * tmp5 tl.store(tl.make_block_ptr(out_ptr0, shape=[88, 128], strides=[128, 1], block_shape=[YBLOCK, XBLOCK], order=[1, 0], offsets=[yoffset, xoffset]), tl.broadcast_to(tmp2, [YBLOCK, XBLOCK]).to(tl.float32), boundary_check=[0, 1]) tl.store(tl.make_block_ptr(out_ptr1, shape=[88, 128], strides=[128, 1], block_shape=[YBLOCK, XBLOCK], order=[1, 0], offsets=[yoffset, xoffset]), tl.broadcast_to(tmp4, [YBLOCK, XBLOCK]).to(tl.float32), boundary_check=[0, 1]) tl.store(tl.make_block_ptr(out_ptr2, shape=[88, 128], strides=[128, 1], block_shape=[YBLOCK, XBLOCK], order=[1, 0], offsets=[yoffset, xoffset]), tl.broadcast_to(tmp6, [YBLOCK, XBLOCK]).to(tl.float32), boundary_check=[0, 1]) tl.store(tl.make_block_ptr(out_ptr3, shape=[88, 128], strides=[128, 1], block_shape=[YBLOCK, XBLOCK], order=[1, 0], offsets=[yoffset, xoffset]), tl.broadcast_to(tmp7, [YBLOCK, XBLOCK]).to(tl.float32), boundary_check=[0, 1]) ''', device_str='cuda') ``` # Test plan Added the test case above to CI. Checked that a single kernel is generated with 2D tiling. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151508 Approved by: https://github.com/jansel	2025-04-17 17:37:45 +00:00
Jithun Nair	b4550541ea	[ROCm] upgrade nightly wheels to rocm6.4 (#151355 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151355 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-04-17 17:29:07 +00:00
Oguz Ulgen	ef64beb232	Include post grad gm and fx runnable in cache artifacts for tlparse (#151469 ) Fixed #151462 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151469 Approved by: https://github.com/bdhirsh	2025-04-17 17:14:13 +00:00
Oguz Ulgen	ee3366dbb2	[MegaCache] Encode key in base64 (#151472 ) I have noticed that there are some errors like ``` UnicodeDecodeError: 'utf-8' codec can't decode byte 0x95 in position 169302: invalid start byte ``` I havent been able to repro this locally yet, this change should fix the encoding issues Pull Request resolved: https://github.com/pytorch/pytorch/pull/151472 Approved by: https://github.com/masnesral	2025-04-17 17:12:22 +00:00
Oguz Ulgen	8404c09b15	[MegaCache] Rename the PGO artifact when used between different jobs (#151482 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151482 Approved by: https://github.com/bobrenjc93, https://github.com/jamesjwu	2025-04-17 17:09:29 +00:00
zeshengzong	fe90a5c140	[Easy] Optimize `clip_grad` param description (#151532 ) Fix missing optional description in `clip_grad_norm_` and `clip_grad_value_` ## Test Result ### Before ![image](https://github.com/user-attachments/assets/3393dd4b-a730-4dd4-8304-9b895ac669d4) ![image](https://github.com/user-attachments/assets/220c4738-a728-474b-b06d-b5be7660d150) ### After ![image](https://github.com/user-attachments/assets/5637bb68-3b6d-49a3-8ee1-3af636950aa0) ![image](https://github.com/user-attachments/assets/c0f1d966-a9ba-4fac-a874-9d4955f6e0d6) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151532 Approved by: https://github.com/Skylion007, https://github.com/albanD	2025-04-17 16:47:38 +00:00
Mu-Chu Lee	c3a18f6126	[AOTInductor] Add states for constant folding process (#151273 ) Summary: We add states in the constant folding process for AOTInductor. Basically, there's 3 states, which is (1) None: The state when no constants are loaded and uninitialized. (2) Initialized: The state when constants are loaded, but not yet folded. (3) Folded: The state where the model is fully ready with folded constants. Note that even if constant folding is not enabled, we still only run when state is FOLDED, this is okay because without constant folding, the transition from INITIALIZED to FOLDED is just a pass-throught. Test Plan: python test/inductor/test_aot_inductor.py -k test_constant_folding_with_update Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D73002538](https://our.internmc.facebook.com/intern/diff/D73002538) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151273 Approved by: https://github.com/jingsh, https://github.com/desertfire	2025-04-17 16:41:38 +00:00
Jane Xu	4843ce7611	[BE] Remove outdated script to check namespace BC (#151453 ) Now that we have bc_lint in CI, this script is no longer needed (nor has it ever been conclusive). I've already updated the Runbook to not need this script. Suppressing bc_lint as this script is not shipped as a part of torch--it is not user facing! For context, this script is (rarely) used by the release notes manager to ensure BC across releases. It had been broken for at least since 2.6. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151453 Approved by: https://github.com/albanD, https://github.com/jbschlosser	2025-04-17 15:43:53 +00:00
FFFrog	90c5b86cd8	[Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404 ) As the title stated Changes: - Add record, query and enable_timing check - Add related tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/151404 Approved by: https://github.com/albanD	2025-04-17 15:30:12 +00:00
Zhang, Jianyi	7f528751cc	[Inductor] fix torch._inductor.exc.InductorError: KeyError (#151424 ) Fixes #151423, which is a regression after #150845 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151424 Approved by: https://github.com/eellison	2025-04-17 15:07:43 +00:00
Aleksei Nikiforov	bb11122e12	Update docker image names for s390x (#151426 ) Disable switching tag for s390x docker images Keep it that way unless they are published. There's no way to determine in advance which docker image names are needed for building s390x binaries otherwise. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151426 Approved by: https://github.com/malfet, https://github.com/seemethere	2025-04-17 12:47:23 +00:00
Nikita Shulga	fa6e842527	[MPS] Make fused rms_norm traceable (#150661 ) Which is a regression, introduced by https://github.com/pytorch/pytorch/issues/150629#issue-2970312779 which I should have reviewed more thoroughly. - Defined `_fused_rms_norm`, added MPS-only implementation for it and dispatch from `rms_norm_symint`, which is registered as `CompositeImplicitAutograd`, i.e. it is not supposed to do any computations over Tensor, only dispatch to other ops - - Register `_fused_rms_norm` as a fallback in `torch/_inductor/lowering.py` - Added unit test to avoid those regressions in the future TODO: - Get rid of this op, change `rms_norm_symint` definition to `CompositeExplicitAutograd` and implement backward function in `tools/autograd/derivatives.yaml` - Benchmark compiler and re-enable decomp as follows when compiled code is faster ```python @register_decomposition(aten._rms_norm_fused) def rms_norm_fused( self: torch.Tensor, ndim: int, weight: torch.Tensor, eps: float ) -> torch.Tensor: dtr = [self.dim() - i - 1 for i in range(ndim)] return self * weight * (self.pow(2).mean(dtr, keepdim=True).add(eps).rsqrt()) ``` Fixes https://github.com/pytorch/pytorch/issues/150629 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150661 Approved by: https://github.com/manuelcandales, https://github.com/jansel	2025-04-17 11:32:00 +00:00
PyTorch MergeBot	41b82611ee	Revert "[Reopen] [Intel GPU] Set higher tolerance for some models only on XPU Device (#144756 )" This reverts commit 300e0ee13c08ef77e88f32204a2e0925c17ce216. Reverted https://github.com/pytorch/pytorch/pull/144756 on behalf of https://github.com/malfet due to Broke rocm torch bench runs with TypeError: unsupported operand type(s) for \|: 'set' and 'list' ([comment](https://github.com/pytorch/pytorch/pull/144756#issuecomment-2812525970))	2025-04-17 11:09:01 +00:00
PyTorch MergeBot	e4fe67f623	Revert "[MPS] Make fused rms_norm traceable (#150661 )" This reverts commit 682f09ec51526aefe6b504ac8081944baa866556. Reverted https://github.com/pytorch/pytorch/pull/150661 on behalf of https://github.com/malfet due to Has decomp started to fail again ([comment](https://github.com/pytorch/pytorch/pull/150661#issuecomment-2812520408))	2025-04-17 11:06:05 +00:00
FFFrog	32c79da789	[Easy] Fix the compilation warning of BlasKernel. (#151302 ) As the title stated. Change Before: ```C++ [2/21] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/BlasKernel.cpp.o /root/Git.d/pytorch/pytorch/aten/src/ATen/native/BlasKernel.cpp:346:6: warning: ‘void at::native::blas_impl::gemv_fast_path(const char, const int, const int, const scalar_t, const scalar_t, const int, const scalar_t, const int, const scalar_t, scalar_t, const int) [with scalar_t = c10::Half]’ defined but not used [-Wunused-function] 346 \| void gemv_fast_path<at::Half>( \| ^~~~~~~~~~~~~~~~~~~~~~~~ /root/Git.d/pytorch/pytorch/aten/src/ATen/native/BlasKernel.cpp:329:6: warning: ‘bool at::native::blas_impl::gemv_use_fast_path(char, int64_t, int64_t, scalar_t, int64_t, int64_t, scalar_t, int64_t) [with scalar_t = c10::Half]’ defined but not used [-Wunused-function] 329 \| bool gemv_use_fast_path<at::Half>( \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ /root/Git.d/pytorch/pytorch/aten/src/ATen/native/BlasKernel.cpp:301:6: warning: ‘void at::native::blas_impl::gemv_fast_path(const char, const int, const int, const scalar_t, const scalar_t, const int, const scalar_t, const int, const scalar_t, scalar_t, const int) [with scalar_t = c10::BFloat16]’ defined but not used [-Wunused-function] 301 \| void gemv_fast_path<at::BFloat16>( \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ /root/Git.d/pytorch/pytorch/aten/src/ATen/native/BlasKernel.cpp:273:6: warning: ‘bool at::native::blas_impl::gemv_use_fast_path(char, int64_t, int64_t, scalar_t, int64_t, int64_t, scalar_t, int64_t) [with scalar_t = c10::BFloat16]’ defined but not used [-Wunused-function] 273 \| bool gemv_use_fast_path<at::BFloat16>( \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151302 Approved by: https://github.com/malfet, https://github.com/aditew01 ghstack dependencies: #151427	2025-04-17 10:50:22 +00:00
Michael Lazos	f29fe78cf2	[Dynamo] Implement sourceless named tuple support (#151266 ) Fixes https://github.com/pytorch/pytorch/issues/140903 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151266 Approved by: https://github.com/williamwen42, https://github.com/StrongerXi, https://github.com/anijain2305	2025-04-17 08:43:03 +00:00
FFFrog	49c91b4be9	[Easy][Building] Fix the warning of int4mm.cu when building (#151427 ) As the title stated. Changes Before: ```C++ [999/1526] Building CUDA object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/int4mm.cu.o /root/Git.d/pytorch/pytorch/aten/src/ATen/native/cuda/int4mm.cu(142): warning #177-D: variable "at::native::kWarpSize" was declared but never referenced constexpr int32_t kWarpSize = 32; ^ Remark: The warnings can be suppressed with "-diag-suppress <warning-number>" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151427 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-04-17 08:21:32 +00:00
Chong Gu	a05cc9f494	Remove Clear Cache Time from do_bench_using_profiling (#150696 ) Summary: In most instances, this action would take ~33% of the total run time, which means that our benchmark would previously differ from the end results by a lot. Test Plan: We can compare the benchmark results for ``` CUDA_VISIBLE_DEVICES=4,5 buck run mode/opt -c python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100a //caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-snapshot-id=672308665_0 --lower-backend=AOT_INDUCTOR --node-replacement-dict="{'torch.nn.Linear':{'(autotune)': 'fp8_float_model_dynamic_quantization_rowwise'}}" --trace-aot-inductor-module=True --disable-acc-tracer=False --batch-size=1024 ``` before and after the diff, and notice that on average, the benchmark results decrease by ~0.1ms per iteration, which is more closely aligned with the lowered modules. Differential Revision: D72469845 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150696 Approved by: https://github.com/frank-wei	2025-04-17 07:25:41 +00:00
bobrenjc93	e0f05229e9	[ez] Make relaxed constraint error message more user friendly (#151407 ) Fixes #151356 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151407 Approved by: https://github.com/Skylion007	2025-04-17 06:43:10 +00:00
Benjamin Glass	10a54ffe5a	[inductor] Reduce runtime of CPU OpInfo tests (#151435 ) `has_triton()` returns True if Triton is present on the system and supports _any_ backend we care about. In this case, that means we _always_ check gradients, even though the intended behavior is to skip gradients when testing on CPU. Fixes a bug from #146911. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151435 Approved by: https://github.com/masnesral	2025-04-17 05:25:14 +00:00
PyTorch UpdateBot	b7d9f44602	[executorch hash update] update the pinned executorch hash (#151493 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151493 Approved by: https://github.com/pytorchbot	2025-04-17 05:14:12 +00:00
Nikita Shulga	682f09ec51	[MPS] Make fused rms_norm traceable (#150661 ) Which is a regression, introduced by https://github.com/pytorch/pytorch/issues/150629#issue-2970312779 which I should have reviewed more thoroughly. - Defined `_fused_rms_norm`, added MPS-only implementation for it and dispatch from `rms_norm_symint`, which is registered as `CompositeImplicitAutograd`, i.e. it is not supposed to do any computations over Tensor, only dispatch to other ops - - Register `_fused_rms_norm` as a fallback in `torch/_inductor/lowering.py` - Added unit test to avoid those regressions in the future TODO: - Get rid of this op, change `rms_norm_symint` definition to `CompositeExplicitAutograd` and implement backward function in `tools/autograd/derivatives.yaml` - Benchmark compiler and re-enable decomp as follows when compiled code is faster ```python @register_decomposition(aten._rms_norm_fused) def rms_norm_fused( self: torch.Tensor, ndim: int, weight: torch.Tensor, eps: float ) -> torch.Tensor: dtr = [self.dim() - i - 1 for i in range(ndim)] return self * weight * (self.pow(2).mean(dtr, keepdim=True).add(eps).rsqrt()) ``` Fixes https://github.com/pytorch/pytorch/issues/150629 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150661 Approved by: https://github.com/manuelcandales, https://github.com/jansel	2025-04-17 04:15:24 +00:00
PyTorch MergeBot	17ea9d1478	Revert "[DCP] Add logging for _stateful_to_state_dict(), stage_state_dict(), and synchronize_staging() (#151320 )" This reverts commit fb6ac2f16132f7953711ce6924bc2ee4a033228c. Reverted https://github.com/pytorch/pytorch/pull/151320 on behalf of https://github.com/malfet due to Broke lint ([comment](https://github.com/pytorch/pytorch/pull/151320#issuecomment-2811669325))	2025-04-17 03:57:03 +00:00
Nikita Shulga	a94483329c	[MPS] Start benchmarking compile results (#151155 ) To know passrate and speedup Pull Request resolved: https://github.com/pytorch/pytorch/pull/151155 Approved by: https://github.com/dcci	2025-04-17 02:45:39 +00:00
Valérian Rey	f5851efed9	Fix `torch.autograd.backward` `inputs` validation (#150975 ) - Fixes #150883 - Fixes #70504 This is my first PR to pytorch, so please tell me if I'm forgetting anything. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150975 Approved by: https://github.com/soulitzer	2025-04-17 02:11:13 +00:00
fduwjj	6f9ffaa991	[c10d][fr] Fix script for uneven reduce scatter and update test cases (#151475 ) Somehow the type string for reduce scatter is "REDUCE_SCATTER" not "REDUCESCATTER". This PR fixed it and added more test cases. Differential Revision: [D73141245](https://our.internmc.facebook.com/intern/diff/D73141245) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151475 Approved by: https://github.com/fegin	2025-04-17 02:11:08 +00:00
Shangdi Yu	cd1db55817	Fix tensor_constant name collision in aot_export_module (#151123 ) Summary: When we have an exported program that looks like this: ``` ExportedProgram: class GraphModule(torch.nn.Module): def forward(self, b__tensor_constant0: "f32[1]", ... c_lifted_tensor_0: "i64[925]", …. , tupleized_input_0_0: "f32[10, 2139]", clone: "i64[925]" = torch.ops.aten.clone.default(c_lifted_tensor_0); c_lifted_tensor_0 = None index_select: "f32[10, 925]" = torch.ops.aten.index_select.default(tupleized_input_0_0, 1, clone); clone = None ``` The graph after `aot_export_module` could have a name collision, notice that `_tensor_constant0` arg of `clone` is different from the `_tensor_constant0` in the input module . ``` def forward(self): arg9_1: "f32[10, 2139]" _tensor_constant0: "f32[1]" = self._tensor_constant0 # this should be int64, conflicted with the original _tensor_constant0, had a clone on this constant before lifting index: "f32[10, 925]" = torch.ops.aten.index.Tensor(arg9_1, [None, _tensor_constant0]); _tensor_constant0 = None ``` This caused the `tensors used as indices must binary, int...` aoti error on PT2I dashboard because later we used `clone` as index. We had this error because we created a new `_tensor_constant0` at [here](https://github.com/pytorch/pytorch/blob/main/torch/fx/_symbolic_trace.py#L403-L412), and the new `_tensor_constant0` overrides the original `_tensor_constant0` on the input Module in `_unlift_graph`. The `arg` for `clone` is created at `create_proxy` in `proxy.py`. To fix this, we do a graph pass before we unlift the graph inputs to avoid name collision Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r aot_compile_constant_folding buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor -- -r aoti_constant_tensor_name_collision ``` Differential Revision: D72761937 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151123 Approved by: https://github.com/tugsbayasgalan, https://github.com/jingsh	2025-04-17 01:52:21 +00:00
Yu, Guangye	bf92c9883b	Refine host caching allocator (#151403 ) # Motivation This stack of PRs aims to generalize and improve PyTorch host allocator code. This PR introduces a `DeleterFnPtr` template parameter to `CachingHostAllocatorInterface` to resolve circular dependency issues. This change allows for better code reuse and simplifies the implementation of host allocators. # Additional Context TODO: - [ ] Unify host allocator related API - [ ] Deprecate those device-specific legacy API - [ ] Move `is_pinned` to host allocator Pull Request resolved: https://github.com/pytorch/pytorch/pull/151403 Approved by: https://github.com/gujinghui, https://github.com/albanD	2025-04-17 01:50:47 +00:00
Meet Vadakkanchery	fb6ac2f161	[DCP] Add logging for _stateful_to_state_dict(), stage_state_dict(), and synchronize_staging() (#151320 ) Summary: As titled. Test Plan: CI Differential Revision: D73040700 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151320 Approved by: https://github.com/saumishr	2025-04-17 01:08:32 +00:00
Mao Yunfei	300e0ee13c	[Reopen] [Intel GPU] Set higher tolerance for some models only on XPU Device (#144756 ) Reopen the previous stale closed PR https://github.com/pytorch/pytorch/pull/134192 We need to increase the tolerance slightly to ensure that certain models pass accuracy check on the XPU device. This pull request preserves the original tolerance threshold for the CUDA device and introduces a new key higher_fp16_bf16_xpu, which only impacts the XPU device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144756 Approved by: https://github.com/chuanqi129, https://github.com/EikanWang, https://github.com/desertfire	2025-04-17 00:26:55 +00:00
Boyuan Feng	2fd26925c4	improve noop elimination for view (#151095 ) This PR improves noop elimination. ### View Noop ```python >>> torch.Size([1,2,3]) == [1,2,3] False >>> torch.Size([1,2,3]) == (1,2,3) True ``` So we add `tuple(size)` in `view_noop`. Example: ```python import torch @torch.compile() def f(x): batch_size = x.shape[0] x = x.transpose(1, 2) # (batch_size, 2, 3) x = x.reshape(batch_size, 2, 3) # noop return x x = torch.randn((2,3,2)) f(x) x = torch.randn((4,3,2)) f(x) ``` Before: ![image](https://github.com/user-attachments/assets/be488881-6c99-43a9-b088-fa481f675775) After: ![image](https://github.com/user-attachments/assets/6d93be3d-128b-44d4-ad6a-d3d18e272329) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151095 Approved by: https://github.com/eellison	2025-04-16 23:55:32 +00:00
zeshengzong	9a2624c712	Fix `keepdim` param optional description (#151197 ) Fixes #151104 Fix optional description of `dim` and `keepdim`, except `torch.quantile` which already fixed in #146485 ## Test Result ### Before ![image](https://github.com/user-attachments/assets/69f1824d-3d15-407e-8c92-f25a22e16914) ### After ![image](https://github.com/user-attachments/assets/e5aac674-ab8f-4988-a5f1-7400c36bdc99) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151197 Approved by: https://github.com/mikaylagawarecki	2025-04-16 23:15:30 +00:00
Catherine Lee	9e6ad274dc	Action for building docker binary builds (#151471 ) This is part of splitting up https://github.com/pytorch/pytorch/pull/150558 into smaller chunks, please see that for more context Uses calculate docker image with the new custom tag prefix, so the naming convention of the docker images is slightly different for images built on PR based off of `a582f04608/.github/workflows/build-manywheel-images.yml (L101)` Also moves the push of the docker images from inside the build scripts to inside the workflow Currently not used anywhere, but the binary docker builds are very similar so I'm going to change them to use this instead Pull Request resolved: https://github.com/pytorch/pytorch/pull/151471 Approved by: https://github.com/malfet, https://github.com/seemethere, https://github.com/ZainRizvi	2025-04-16 23:01:35 +00:00
Svetlana Karslioglu	cd7bc60e11	Migrate to new theme (#149331 ) - Migrate pytorch docs, cpp docs and functorch docs to the pytorch_sphinx_theme2 - Migrate index.rst to markdown and restructure to use high-level horizontal bar sections Python API, Developer Notes - Added python-api.md which becomes the main container for the API docs. This file will be used to add all api references in the toctree. It would be great to have lint for this file: https://github.com/pytorch/pytorch/issues/150718 - Enabled mermaid sphinx extension and opengraph sphinx extension Pull Request resolved: https://github.com/pytorch/pytorch/pull/149331 Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/albanD	2025-04-16 21:35:19 +00:00
Nikita Shulga	1ffaa00ad7	[MPS] Migrate `bitwise_not` to unary operator (#151460 ) That kills to birds with one stone: - Makes implementations more standartized (and faster for strided inputs/outputs) - Fixes bug strided inplace bitwise_not I.e. before this change ```python import torch x=torch.arange(32, device="mps") x[::2].bitwise_not_() print(x) ``` produced ``` tensor([ -1, -2, -3, -4, -5, -6, -7, -8, -9, -10, -11, -12, -13, -14, -15, -16, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31], device='mps:0') ``` after, it generates reasonable output ``` tensor([ -1, 1, -3, 3, -5, 5, -7, 7, -9, 9, -11, 11, -13, 13, -15, 15, -17, 17, -19, 19, -21, 21, -23, 23, -25, 25, -27, 27, -29, 29, -31, 31], device='mps:0') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151460 Approved by: https://github.com/dcci, https://github.com/qqaatw, https://github.com/Skylion007	2025-04-16 21:34:45 +00:00
PyTorch MergeBot	f252f9df5e	Revert "[Openreg][PrivateUse1] Enable CI for openreg (#151007 )" This reverts commit abbca37fe882541e0259b43dd314a324180550ed. Reverted https://github.com/pytorch/pytorch/pull/151007 on behalf of https://github.com/clee2000 due to At least test_record_event needs to also be skipped on dynamo too, its failing and then somehow causing a hang? https://github.com/pytorch/pytorch/actions/runs/14487625709/job/40637535027#step:25:73 ([comment](https://github.com/pytorch/pytorch/pull/151007#issuecomment-2810789483))	2025-04-16 21:05:17 +00:00
PyTorch MergeBot	e0535e823f	Revert "[Openreg][PrivateUse1] Fix releasing tensor issue when using pin_memory (#151091 )" This reverts commit e229ce34c4ab8cd4e2800227615be32fb362b1e6. Reverted https://github.com/pytorch/pytorch/pull/151091 on behalf of https://github.com/clee2000 due to At least test_record_event needs to also be skipped on dynamo too, its failing and then somehow causing a hang? https://github.com/pytorch/pytorch/actions/runs/14487625709/job/40637535027#step:25:73 ([comment](https://github.com/pytorch/pytorch/pull/151007#issuecomment-2810789483))	2025-04-16 21:05:17 +00:00
Boyuan Feng	5b5399bfcd	[graph partition] reorder to reduce #partitions for simple dependencies (#150814 ) This PR reduces #graph partitions by reordering nodes when the `should_partition` nodes have simple dependencies. Specifically, for `should_partition` nodes: a. If a node has no dependency or only depends on graph inputs: move to the front. Use case is when we move symints to cuda tensor for PaddedTensorSubclass b. If the only user of a node is OutputNode: move it to the end. #### Example The following example shows a padded tensor subclass use case where we copy symint to a cuda tensor (aka mask) in the middle of function. Reordering still generates 1 cudagraph by moving the mask to the front. ```python import torch torch._inductor.config.graph_partition = True # Two reasons for this: # 1. We want to reuse the same mask for many masked_fill calls # 2. Prevent inductor from fusing this op into other ops (e.g. masked_fill) # so we can still reorder in scheduler @torch.library.custom_op("mylib::create_mask", mutates_args=(), tags=(torch._C.Tag.cudagraph_unsafe,)) def create_mask(padded_size: int, original_size: int, device: torch.device) -> torch.Tensor: mask = torch.zeros((padded_size,), dtype=torch.bool, device=device) mask[original_size:] = True return mask @create_mask.register_fake def _(padded_size, original_size, device): return torch.empty((padded_size,), dtype=torch.bool, device=device) def f(padded_tensor, original_tensor, weight): original_size = original_tensor.size()[0] padded_size = padded_tensor.size()[0] # element wise op so we don't care padding value padded_tensor = padded_tensor + 1 padded_tensor = torch.nn.functional.relu(padded_tensor) # dot product requires padding with 0 dot_res = padded_tensor.dot(weight) padded_tensor += dot_res # min requires padding with inf, so we create mask now mask = create_mask(padded_size, original_size, padded_tensor.device) min_res = torch.min( torch.ops.aten.masked_fill(padded_tensor, mask, float("inf")) ) # max requires padding with inf. we can reuse previous mask max_res = torch.max( torch.ops.aten.masked_fill(padded_tensor, mask, -float("inf")) ) return min_res+max_res+padded_tensor compiled_f = torch.compile(f, mode="reduce-overhead") def run(padded_size, original_size): padded_tensor = torch.randn(padded_size, device="cuda") padded_tensor[original_size:] = 0 original_tensor = torch.randn(original_size, device="meta") weight = torch.randn(padded_size, device="cuda") eager_out = f(padded_tensor, original_tensor, weight) compiled_out = compiled_f(padded_tensor, original_tensor, weight) assert torch.allclose(eager_out[0], compiled_out[0]) assert torch.allclose(eager_out[1], compiled_out[1]) # new cudagraph run(8, 4) # new cudagraph due to recompile run(8, 6) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150814 Approved by: https://github.com/eellison	2025-04-16 20:49:20 +00:00
PyTorch MergeBot	a582f04608	Revert "[ez] Make relaxed constraint error message more user friendly (#151407 )" This reverts commit bc934f57d7c14b07e7497eb72a90d893270bc662. Reverted https://github.com/pytorch/pytorch/pull/151407 on behalf of https://github.com/izaitsevfb due to breaks export tests ([comment](https://github.com/pytorch/pytorch/pull/151407#issuecomment-2810716135))	2025-04-16 20:40:22 +00:00
Animesh Jain	607443b16b	[compile][compile time traces] Add more dynamo traces (#151357 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151357 Approved by: https://github.com/williamwen42 ghstack dependencies: #151330, #151256	2025-04-16 20:37:08 +00:00
Animesh Jain	8e373592c8	[aot autograd][logging] Profile large missing gaps in compile time tracing (#151256 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151256 Approved by: https://github.com/bdhirsh, https://github.com/masnesral ghstack dependencies: #151330	2025-04-16 20:37:08 +00:00
Animesh Jain	c58b3f6be3	[invoke_subgraph][inductor] Run pre and post grad passes on invoke_subgraph (#151330 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151330 Approved by: https://github.com/eellison, https://github.com/zou3519	2025-04-16 20:37:01 +00:00
Mateusz Nowak	4c4a5df73b	Allow to run flex_attention on HPU (#148656 ) HPU specific implementation details are to be located in out-of-tree HPU library. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148656 Approved by: https://github.com/drisspg	2025-04-16 19:49:15 +00:00
Blaine Burton Rister	9400f53903	[Inductor] Broadcast to range tree shape before block pointer store (#151399 ) # Feature This fixes a bug related to block pointer stores. Since Triton's block pointer stores don't support implicit broadcasting, in certain cases we need to generate a `reshape->broadcast->reshape` pattern to ensure that the tensor being stored has the same shape as the block pointer. This happens when the block indexing expression involves strides of 0 or dimensions of 1, both of which we eliminate from the block pointer. The existing logic missed an important edge case. We may need a broadcast prior to the first `reshape` of this pattern, in case the tensor comes from a load with implicit broadcasting. For example, if the range trees have shape `[YBLOCK, XBLOCK]`, but the load has a shape `[1, XBLOCK]`, we need to broadcast this to `[YBLOCK, XBLOCK]` prior to storing. See the example kernel below, which comes from `expand` -> `clone` with 3D tiling. The load has an implicit broadcast, and the store has a reshape. Thus, we need to insert an explicit broadcast between them. ``` @triton.jit def triton_poi_fused_clone_0(in_ptr0, out_ptr0, znumel, ynumel, xnumel, ZBLOCK : tl.constexpr, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr): znumel = 32 ynumel = 1 xnumel = 32 zoffset = tl.program_id(2) * ZBLOCK zindex = zoffset + tl.arange(0, ZBLOCK)[:, None, None] zmask = zindex < znumel yoffset = tl.program_id(1) * YBLOCK yindex = yoffset + tl.arange(0, YBLOCK)[None, :, None] ymask = tl.full([ZBLOCK, YBLOCK, XBLOCK], True, tl.int1) xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[None, None, :] xmask = xindex < xnumel x1 = xindex z0 = zindex tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[32], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0], eviction_policy='evict_last')[None, None, :] tl.store(tl.make_block_ptr(out_ptr0, shape=[32, 32], strides=[32, 1], block_shape=[ZBLOCK, XBLOCK], order=[1, 0], offsets=[zoffset, xoffset]), tl.reshape(tl.broadcast_to(tmp0, [ZBLOCK, YBLOCK, XBLOCK]), [ZBLOCK, XBLOCK]).to(tl.float32), boundary_check=[0, 1]) ''', device_str='cuda') ``` The tricky part is that we don't want to emit redundant broadcasts in the store. This PR reworks the logic a bit to make sure we don't emit a second broadcast unless it actually changes the shape. # Test plan Added a CI test for this case, which would fail on trunk. Checked that only one broadcast was emitted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151399 Approved by: https://github.com/jansel, https://github.com/eellison	2025-04-16 19:03:40 +00:00
eqy	17bf59340c	[cuSPARSE][B200] Bump tolerances for test_sparse_csr matvec (#148721 ) Small tolerance bump for blackwell (appears to use same kernel as prev. arches) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148721 Approved by: https://github.com/nWEIdia, https://github.com/ngimel	2025-04-16 18:44:18 +00:00
William Wen	1f29190b59	[dynamo] unimplemented -> unimplemented_v2 in variables/builtin.py (#151145 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151145 Approved by: https://github.com/Skylion007, https://github.com/StrongerXi, https://github.com/jansel, https://github.com/zou3519	2025-04-16 17:16:05 +00:00
bobrenjc93	bc934f57d7	[ez] Make relaxed constraint error message more user friendly (#151407 ) Fixes #151356 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151407 Approved by: https://github.com/Skylion007	2025-04-16 17:00:06 +00:00
Sidney Tsang	cedcdda0ed	Add ccode for CeilToInt and IntTrueDiv (#151375 ) Summary: As titled Test Plan: Test in D73052653 -- shape calculator generates successfully Differential Revision: D73073845 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151375 Approved by: https://github.com/kalpit-meta-1, https://github.com/Skylion007	2025-04-16 16:47:55 +00:00
PyTorch MergeBot	6a3a6d22dc	Revert "[dynamo] context manager/decorator for dynamo config patching during tracing (#150586 )" This reverts commit 40ce4fb24a536d175348df876f61956d4945778e. Reverted https://github.com/pytorch/pytorch/pull/150586 on behalf of https://github.com/clee2000 due to broke some inductor tests? inductor/test_fuzzer.py::TestConfigFuzzer::test_config_fuzzer_dynamo_bisect [GH job link](https://github.com/pytorch/pytorch/actions/runs/14486513628/job/40635178179) [HUD commit link](`40ce4fb24a`), bad TD ([comment](https://github.com/pytorch/pytorch/pull/150586#issuecomment-2810064322))	2025-04-16 16:13:47 +00:00
Nikita Shulga	0c77af3576	[MPSInductor] Add pow, log2 and FloorToInt ops (#151449 ) That enables `test_pow_by_natural_log2_dynamic_shapes_mps` Not sure why log2 printer function suffix is `OpaqueUnaryFn_log2`, rather than just `log2` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151449 Approved by: https://github.com/jansel	2025-04-16 15:56:21 +00:00
FFFrog	e229ce34c4	[Openreg][PrivateUse1] Fix releasing tensor issue when using pin_memory (#151091 ) As the title stated. Related PR: https://github.com/pytorch/pytorch/pull/147066 Co-authored-by: Zhenbin Lin <lin-zhenbin@qq.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/151091 Approved by: https://github.com/albanD ghstack dependencies: #151005, #151007	2025-04-16 13:12:17 +00:00
Simon Fan	c7400d0026	[inductor][comms] skip reorder_for_locality for wait nodes (#150074 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150074 Approved by: https://github.com/eellison, https://github.com/bdhirsh ghstack dependencies: #150258	2025-04-16 10:18:33 +00:00
Simon Fan	159d8a14a6	[inductor][comms] fix node_summary for composite scheduler nodes (#150258 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150258 Approved by: https://github.com/yf225	2025-04-16 10:18:33 +00:00
angelayi	41c97a72a1	[export] Add draft-export to error msg (#151065 ) Given an exception in torch.export, I want to try/catch it to add the message "hey try out draft-export!". Currently I only add this message for errors that draft-export is known to fix, like DataDependentErrors, ConstraintViolationErrors, and no fake impl. Originally the error message looks like: ``` File "/data/users/angelayi/pytorch/torch/_library/custom_ops.py", line 626, in fake_impl raise RuntimeError( RuntimeError: There was no fake impl registered for <CustomOpDef(mylib::foo2)>. This is necessary for torch.compile/export/fx tracing to work. Please use `foo2_impl.register_fake` to add an fake impl. ``` Now, the error msg now looks something like: ``` File "/data/users/angelayi/pytorch/torch/_library/custom_ops.py", line 626, in fake_impl raise RuntimeError( RuntimeError: There was no fake impl registered for <CustomOpDef(mylib::foo2)>. This is necessary for torch.compile/export/fx tracing to work. Please use `foo2_impl.register_fake` to add an fake impl. The error above occurred when calling torch.export.export. If you would like to view some more information about this error, and get a list of all other errors that may occur in your export call, you can rerun your program with the `DRAFT_EXPORT=1` envvar, or replace your `export()` call with `draft_export()`. ``` In python versions >= 3.11, we can use `exception.add_note` to add to the error message. However with previous versions I did a hack to modify `e.args`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151065 Approved by: https://github.com/pianpwk ghstack dependencies: #151051	2025-04-16 08:56:02 +00:00
angelayi	84e633e09d	[export] Make draft-export predispatch=True by default (#151051 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151051 Approved by: https://github.com/pianpwk	2025-04-16 08:56:02 +00:00
Cookiee235	a5c61668d7	fix ambiguous error message (#150086 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/150086 Approved by: https://github.com/anijain2305	2025-04-16 08:48:05 +00:00
Laith Sakka	0a489f924d	Fix: missing () in generated runtime assert c++ code (#151171 ) Address one of the issues in https://github.com/pytorch/pytorch/issues/151127 generated code used to be not a==5 or b==5 should be not (a==5 or b==5) address one of the issues in the comments of Address one of the issues in https://github.com/pytorch/pytorch/issues/151127 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151171 Approved by: https://github.com/aorenste, https://github.com/eellison ghstack dependencies: #151170	2025-04-16 08:10:17 +00:00
Laith Sakka	55595e0c85	Fix Issues in deferring runtime assertions. (#151170 ) This PR fix two bugs: 1) Update self.bound_unbacked_symbols before emitting runtime asserts : set self.bound_unbacked_symbols before emitting runtime asserts to include runtime asserts depending on the current node 2) In the pass that remove unused graph inputs, we should not remove symbols that are used by runtime assertions. Address some of the issues in https://github.com/pytorch/pytorch/issues/151127 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151170 Approved by: https://github.com/bobrenjc93, https://github.com/eellison	2025-04-16 08:10:17 +00:00
FFFrog	abbca37fe8	[Openreg][PrivateUse1] Enable CI for openreg (#151007 ) Changes: - move test_openreg.py from test/cpp_extensions/open_registration_extension/ to test/ - update README.md for openreg - enable CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/151007 Approved by: https://github.com/albanD ghstack dependencies: #151005	2025-04-16 07:55:51 +00:00
FFFrog	a9dbbe1aee	[OpenReg][PrivateUse1] Refactoring the csrc files of pytorch_openreg (#151005 ) As the title stated. Changes: - Remove unnecessary header file - Remove unnecessary registry logic about PrivateUse1HooksRegistry，such as TORCH_DECLARE_REGISTRY, C10_DEFINE_REGISTRY, etc,. - using static + global variable to do initialization instead of call_one Next Step: Enable test_openreg.py in CI/CD to guard the quality of PrivateUse1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151005 Approved by: https://github.com/albanD	2025-04-16 07:55:50 +00:00
William Wen	40ce4fb24a	[dynamo] context manager/decorator for dynamo config patching during tracing (#150586 ) Implement traceable config patching for Dynamo: enables restricted patching of Dynamo config where user can use a context manager/decorator to change tracing behavior for parts of the code. The new `dont_skip_tracing` decorator/context manager for ignoring most trace rules is easily implemented with this more generic traceable config patching feature. Implementation: - Create a new specialized context manager class representing a wrapper around torch._dynamo.config.patch - Dynamo doesn't trace into the context manager but updates config at compile time - Correctness is based on our correctness for handling supported context managers - Implementation is inspired by how `GradModeVariable` is implemented. Previous attempts: https://github.com/pytorch/pytorch/pull/148736 (decorator-only global approach) and https://github.com/pytorch/pytorch/pull/149439 (decorator-only traceback approach) See https://docs.google.com/document/d/1vWNwKL_jpg-PLopifcaSa338wks3GqSVF4GHRguybGg/edit?tab=t.0 for more details on implementation - including previous approaches. NOTE: this PR fixes a bug where skipped code objects were not tracked by convert_frame.py, leading to cases where code objects would be automatically skipped even after `torch._dynamo.reset()`. This exposed some latent dynamo-wrapped test failures in CI that previously passed in CI but not locally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150586 Approved by: https://github.com/jansel, https://github.com/zou3519, https://github.com/anijain2305	2025-04-16 06:49:58 +00:00
Angela Yi	daf2ccf023	[custom ops] Fix destroy function (#151299 ) Summary: D72906445 seemed to cause a SIGABRT when running the test in the test plan. The change I narrowed it down to was where in fake_impls the [`deregister_fake_kernel` no longer calls `lib.destroy`](https://github.com/pytorch/pytorch/pull/150806/files#diff-7fd3f4222276c63b91f3a895530bb5efe137fd23165b48f25afcf3c06a5d2a8fL65-L69). Calling `lib.destroy` in that handle results in a maximum recursion error where someone calls library.destroy which calls the handle which calls back to library.destroy. So I compared the implementation of this _del_library and lib.destroy and it seemed like the main thing different was deleting `self.m`. So adding that fixed my issue! Side note, I feel like we can combine `_del_library` and `library._destroy`? But I won't do it in this diff to make sure we don't break too many things 😅 Test Plan: `buck test 'fbcode//mode/opt' fbcode//aiplatform/gmpp/bulk_eval/reader/service/tests:reader_service_handler_tests -- --exact 'aiplatform/gmpp/bulk_eval/reader/service/tests:reader_service_handler_tests - aiplatform.gmpp.bulk_eval.reader.service.tests.reader_service_handler_tests.ReaderServiceHandlerTests: test_add_preproc_output_into_queue'` https://www.internalfb.com/intern/testinfra/testrun/10977524170296078 Differential Revision: D73017613 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151299 Approved by: https://github.com/zou3519	2025-04-16 06:18:09 +00:00
Sam Larsen	585d03fa39	Record how many parameters we're parsing within dynamo (#148508 ) This allows us to track how many paramaters we have in compilations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148508 Approved by: https://github.com/jansel, https://github.com/anijain2305 Co-authored-by: Sam Larsen <slarsen@meta.com>	2025-04-16 06:15:11 +00:00
PyTorch UpdateBot	b4cee2bf57	[executorch hash update] update the pinned executorch hash (#151280 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151280 Approved by: https://github.com/pytorchbot	2025-04-16 05:39:06 +00:00
Mu-Chu Lee	107121dfad	[AOTInductor] Add interface for user managed buffer in package api. (#151325 ) Summary: https://github.com/pytorch/pytorch/pull/151141 We add interface for user managed buffer in the package api. Test Plan: Included in commit.] Reviewed By: henrylhtsang Differential Revision: D72985440 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151325 Approved by: https://github.com/angelayi	2025-04-16 04:25:40 +00:00
Will Feng	82200e33b5	Make torch._chunk_cat support non-contiguous inputs (#151263 ) Currently, `torch._chunk_cat` only supports contiguous inputs (due to `.view()` usage in `_pad_chunk()` supporting only contiguous tensor). This doesn't work for internal models where there can be non-contiguous input tensors: - size=[8192, 16416], stride=[16448, 1] # stride[0] is larger than size[1] - size=[1152, 384], stride=[1, 1152] # column-major tensor In this PR, we relax the assumption on contiguous input tensor, by switching from `.view()` to `.reshape()`. Note that since `.reshape()` will try to use `.view()` under the hood whenever possible, this should not cause regression to existing use cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151263 Approved by: https://github.com/BoyuanFeng	2025-04-16 04:18:46 +00:00
fduwjj	30101aa450	[c10d][fr] Add counters for FR dump and reduce its timeout to finish dump before watchdog timeout (#151329 ) After https://github.com/pytorch/pytorch/pull/150652, we still see some ranks missing dumps. Upon looking further, the case is that FR dump timed out for its first attempt: watchdog thread: notify FR dump -> wait for 1 mins -> throw watchdog timeout -> notify elastic to kill process FR dump thread: received FR dump signal -> timeout after 1 mins with first attempt -> started 2nd attempt -> got killed. So we want to make the FR dump timeout shorter, in reality, the log shows that the dump finished within one sec. Even if we consider a very slow speed like 200K/s the usual size FR (1MB at most) takes around 5 secs, so 15 secs is like 3 times buffer. Also we still let watchdog sleep for 1 min so that we can wait enough time for two dump to timeout and the following check like GIL checker to execute. Also, if we get stuck in getting GIL or cuda hang, 15 seconds should be enough to detect the hang. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151329 Approved by: https://github.com/fegin	2025-04-16 03:48:03 +00:00
Camyll Harajli	3a90fd481e	fix test_einsum: use initialized values (#151363 ) Summary: `empty` uses uninitialized values so that could be NaNs, thus, the assert_close kept failing in FBCode. Test Plan: ``` buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:unbacked_symints_cpu -- --exact 'caffe2/test/inductor:unbacked_symints_cpu - test_einsum_cpu (caffe2.test.inductor.test_unbacked_symints.TestUnbackedSymintsCPU)' --env TORCH_LOGS="+output_code" --print-passing-details --env TORCH_LOGS_FORMAT="%(filename)s:%(lineno)s: %(message)s" ``` Differential Revision: D73067722 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151363 Approved by: https://github.com/Camyll Co-authored-by: Camyll Harajli <camyllh@meta.com>	2025-04-16 03:10:29 +00:00
Nikita Shulga	6124dabd30	[CI][NoOp] Update skip reason for argmin_with_nan (#151374 ) Which is https://github.com/pytorch/pytorch/issues/130295 (i.e. torch.compile produces correct results, but eager is not) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151374 Approved by: https://github.com/dcci	2025-04-16 02:33:20 +00:00
Joel Schlosser	ae53510b9e	Fix setUpClass() / tearDownClass() for device-specific tests (#151129 ) Finishes up the work started in #121686 + adds test Update: this was not as straightforward as I originally imagined. Context below. TL;DR: `TestFoo{CPU, CUDA}` now actually derive from `TestFoo`! Also, `{CPU, CUDA}TestBase` setup / teardown logic is now always called (it is required to set the primary device), regardless of whether `super().setUpClass()` / `super().tearDownClass()` are called or not. Background: The typical way to get device-specific tests is to write a generic `TestFoo` and call `instantiate_device_type_tests(TestFoo, locals())` to get `TestFooCPU`, `TestFooCUDA`, etc. After this, generic tests (e.g. `TestFoo.test_bar()`) become `TestFooCPU.test_bar_cpu()` / `TestFooCUDA.test_bar_cuda()`. Behind the scenes, this was historically accomplished by creating a `TestFooCUDA` that derives from both a `CUDATestBase` and an empty class called `TestFoo_base`. This `TestFoo_base` has the same bases as `TestFoo`, but none of the test functions (e.g. `test_bar()`). The documented reason for this is to avoid things like a derived `TestFooCUDA.test_bar()` being discovered in addition to the real device-specific test `TestFooCUDA.test_bar_cuda()`. (1) A reason this matters is because it should be possible to call e.g. `super().setUpClass()` from a custom setup / teardown classmethod. If the generated TestFooCUDA does not derive from TestFoo, but instead derives from the empty class described above, this syntax does not work; in fact there is no way to form a proper `super()` call that works across the device-specific test variants. Here's an example that breaks in the OpInfo tests: `070f389745/test/test_ops.py (L218-L221)` (2) Further, there is some precedent within a custom `setUpClass()` impl for storing things on the `cls` object to be accessed at test time. This must be the device-specific test class (`TestFooCUDA`) and not `TestFoo` for this to work. As an example, the open device registration tests load a module during setup and use it in the test logic: `070f389745/test/test_cpp_extensions_open_device_registration.py (L63-L77)` `070f389745/test/test_cpp_extensions_open_device_registration.py (L79-L80)` To accomplish both (1) and (2) at the same time, I decided to revisit the idea of utilizing a proper inheritance hierarchy for `TestFoo` -> `{TestFooCPU, TestFooCUDA}`. That is: have TestFooCPU / TestFooCUDA actually derive from `TestFoo`. This achieves both (1) and (2). The only thing left is to make sure the generic tests (e.g. `TestFoo.test_bar()`) are not discoverable, as was the stated reason for diverging from this in the first place. It turns out we can simply `delattr()` these generic tests from `TestFoo` once `TestFooCPU` / `TestFooCUDA` have been setup with the device-specific variants, and all works well. The `instantiate_device_type_tests(...)` logic already deletes `TestFoo` from scope, so I don't see a problem with deleting generic tests from this base class as well (CI will prove me right or wrong ofc). Side note: I was encountering a weird race condition where sometimes the custom `setUpClass()` / `tearDownClass()` defined & swapped in [here](`4a47dd9b3f/torch/testing/_internal/common_device_type.py (L940-L955)`) would be used, and sometimes it wouldn't. This non-deterministic behavior was called out previously by @ngimel here: `4a47dd9b3f/test/inductor/test_torchinductor_dynamic_shapes.py (L128-L130)` To address this, I moved this block of logic to before the first call to `instantiate_test()`, as that method queries for the primary device, and the primary device identification logic may manually invoke `setUpClass()` (see [here](`4a47dd9b3f/torch/testing/_internal/common_device_type.py (L381-L384)`)). Goal: define the `setUpClass()` / `tearDownClass()` we want for correctness before they're ever called. This seems to work and the behavior is deterministic now AFAICT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151129 Approved by: https://github.com/janeyx99, https://github.com/masnesral, https://github.com/malfet	2025-04-16 02:18:42 +00:00
Aleksei Nikiforov	067a7b1d4a	Disable -Werror for s390x test module compilation (#150413 ) This change should make nightly testsuite green again for s390x. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150413 Approved by: https://github.com/seemethere	2025-04-16 02:15:17 +00:00
Carlo Bertolli	aacac88bee	[ROCM] Fix in-place aten sum with specialized templated kernels. (#151230 ) We noticed a regression when doing aten.sum in-place (a+=b) and the type of the output is not the same as the functor. Co-authored by: Jerry Mannil <jerry.mannil@amd.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/151230 Approved by: https://github.com/jeffdaily	2025-04-16 02:07:46 +00:00
cyy	cadd832c19	[1/N] Use std::string_view in torchgen (#146403 ) Moves remaining c10::sv to std::sv Pull Request resolved: https://github.com/pytorch/pytorch/pull/146403 Approved by: https://github.com/albanD	2025-04-16 01:50:22 +00:00
henrylhtsang	dd11613f94	[cutlass backend][experimental] Try out presets for cutlass instead of searching all configs (#151255 ) Differential Revision: [D72668861](https://our.internmc.facebook.com/intern/diff/D72668861/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151255 Approved by: https://github.com/mlazos	2025-04-16 01:48:06 +00:00
henrylhtsang	532025fbd0	[cutlass backend][ez] Ban FP32 output dtype from using CUTLASS GEMM backend (#151279 ) FP32 not supported: https://github.com/pytorch/pytorch/issues/145952 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151279 Approved by: https://github.com/ColinPeppler	2025-04-16 01:12:18 +00:00
Justin Chu	8780d18f64	[ONNX] Add a comment for handling bf16/fp8 tensor to numpy conversion (#151371 ) Follow up of https://github.com/pytorch/pytorch/pull/151259 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151371 Approved by: https://github.com/titaiwangms	2025-04-16 00:49:38 +00:00
Benji Beck	4bbb61812c	[BE][1/2] Move original_weights_lookup attribute to constant (#151241 ) Summary: As title. Cleaning usages by using global constant. Test Plan: `buck test 'fbcode//mode/opt' fbcode//caffe2/test:quantization_fx -- --exact 'caffe2/test:quantization_fx - test_keep_original_weights (quantization.fx.test_quantize_fx.TestQuantizeFx)'` Differential Revision: D72892815 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151241 Approved by: https://github.com/Skylion007, https://github.com/hl475	2025-04-16 00:41:25 +00:00
Nikita Shulga	44a522dd78	[BE] Fix extra-semi warning in attention.cpp (#151367 ) Introduced by https://github.com/pytorch/pytorch/pull/149512 Before this change, following warning was generated ``` /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/transformers/attention.cpp:452:71: warning: extra ';' outside of a function is incompatible with C++98 [-Wc++98-compat-extra-semi] 452 \| REGISTER_HPU_DISPATCH(_fused_sdp_choice_stub, &_fused_sdp_choice_meta); \| ^ 1 warning generated. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151367 Approved by: https://github.com/drisspg	2025-04-16 00:31:45 +00:00
henrylhtsang	8e6415fd32	[cutlass backend] "Fix" FlexibleLayout (#151284 ) So Horace was right, Triton does fix the layout when rendering the template (i.e. roughly at the same time). You can double check that running the unit test with gemm backend as "TRITON,CUTLASS". You will notice that the layout is fixed if we have triton in gemm backend, but flexible if triton is not there. code pointer: https://github.com/pytorch/pytorch/blob/main/torch/_inductor/select_algorithm.py#L927 In the future, we should remove `fix_op_layout` from class CUTLASSGemmTemplate. But maybe we can monitor it for a bit first. Differential Revision: [D72996143](https://our.internmc.facebook.com/intern/diff/D72996143/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151284 Approved by: https://github.com/ColinPeppler	2025-04-16 00:10:52 +00:00
Michael Lazos	e55eb5c870	[Cutlass] Integrate EVT codegen into 3x gemm template (#150346 ) Previously merged: * #150345 * #150344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150346 Approved by: https://github.com/henrylhtsang ghstack dependencies: #150344, #150345	2025-04-16 00:08:22 +00:00
Oguz Ulgen	3cf0e2d8ec	Add inductor standalone_compile API (#150670 ) This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution. ``` standalone_compile(gm, example_inputs, options) -> CompiledArtifact CompiledArtifact.save(path, format: binary\|unpacked = binary) CompiledArtifact.load(path, format: binary\|unpacked = binary) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670 Approved by: https://github.com/jamesjwu, https://github.com/zou3519	2025-04-15 23:38:15 +00:00
Justin Chu	9917feff50	[ONNX] Produce correct dtypes for bf16/f8 in IR TorchTensor (#151259 ) Split the changes from https://github.com/pytorch/pytorch/pull/151069 to address https://github.com/microsoft/onnxscript/issues/2187, where the output np arrays do not have the correct ml_dtypes types as expected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151259 Approved by: https://github.com/titaiwangms	2025-04-15 23:21:04 +00:00
Nikita Shulga	331423e5c2	Fix tensorpipe compilation with clang-17 (#151344 ) By suppressing `missing-template-arg-list-after-template-kw` warning, which seems to be required to compile Google's libnop, which is in a semi-abandoned state now ``` In file included from /Users/malfet/git/pytorch/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/base/variant.h:21: /Users/malfet/git/pytorch/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/types/variant.h:241:30: error: a template argument list is expected after a name prefixed by the template keyword [-Wmissing-template-arg-list-after-template-kw] 241 \| index_ = value_.template Construct(std::forward<Args>(args)...); \| ^ /Users/malfet/git/pytorch/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/types/variant.h:258:26: error: a template argument list is expected after a name prefixed by the template keyword [-Wmissing-template-arg-list-after-template-kw] 258 \| if (!value_.template Assign(TypeTag<T>{}, index_, std::forward<U>(value))) { \| ^ /Users/malfet/git/pytorch/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/types/variant.h:265:26: error: a template argument list is expected after a name prefixed by the template keyword [-Wmissing-template-arg-list-after-template-kw] 265 \| if (!value_.template Assign(index_, std::forward<T>(value))) { \| ^ 3 errors generated. ``` Fixes https://github.com/pytorch/pytorch/issues/151316 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151344 Approved by: https://github.com/ZainRizvi, https://github.com/seemethere	2025-04-15 22:18:06 +00:00
PyTorch MergeBot	98b1e82ba8	Revert "Fix setUpClass() / tearDownClass() for device-specific tests (#151129 )" This reverts commit bd4cf30e31a2a0b0a57f54c7eedd3a39d5778cbe. Reverted https://github.com/pytorch/pytorch/pull/151129 on behalf of https://github.com/jbschlosser due to flex attention tests failing ([comment](https://github.com/pytorch/pytorch/pull/151129#issuecomment-2807632119))	2025-04-15 22:07:25 +00:00
Angela Yi	e1d8b3f838	[inductor] Check NoneLayout in update_zero_dim_cpu_tensor (#151321 ) Summary: This fixes the error in https://fb.workplace.com/groups/1075192433118967/permalink/1640802133224658/ I tried really hard but I couldn't come up with a test case to repro the issue, but I confirmed with the OP that this issue has been fixed. ``` Traceback (most recent call last): File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/compile_fx.py", line 746, in _compile_fx_inner mb_compiled_graph = fx_codegen_and_compile( File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/compile_fx.py", line 1343, in fx_codegen_and_compile return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs) File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/compile_fx.py", line 1232, in codegen_and_compile compiled_module = graph.compile_to_module() File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/graph.py", line 2087, in compile_to_module return self._compile_to_module() File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/graph.py", line 2095, in _compile_to_module self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen() File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/graph.py", line 2002, in codegen self._update_scheduler() File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/graph.py", line 1996, in _update_scheduler self.scheduler = Scheduler(self.operations) File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/scheduler.py", line 1954, in __init__ self._init(nodes) File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/scheduler.py", line 1974, in _init self.update_zero_dim_cpu_tensor() File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/scheduler.py", line 4433, in update_zero_dim_cpu_tensor and buffer.get_size() == [] File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/ir.py", line 3903, in get_size return [*self.get_layout().size] File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/ir.py", line 3914, in get_layout raise NotImplementedError(type(self.layout).__name__) torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: NotImplementedError: NoneLayout ``` Test Plan: OP said the issue is fixed Differential Revision: D72575808 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151321 Approved by: https://github.com/BoyuanFeng	2025-04-15 21:58:09 +00:00
aishwaryar12309	4518b30680	Clarify that x and dx are mutually exclusive in torch.trapezoid doc (#151190 ) This PR addresses [#151105](https://github.com/pytorch/pytorch/issues/151105) by stating that x and dx are mutually exclusive parameters in torch.trapezoid() Pull Request resolved: https://github.com/pytorch/pytorch/pull/151190 Approved by: https://github.com/soulitzer	2025-04-15 21:42:05 +00:00
Michael Lazos	630cf46039	[Cutlass] Codegen for EVT Epilogue (#150345 ) Previously merged: * #150344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150345 Approved by: https://github.com/henrylhtsang, https://github.com/eellison ghstack dependencies: #150344	2025-04-15 21:31:21 +00:00
Jeff Daily	27ef3f6cdc	[ROCm][CI/CD] Create ROCm6.4 magma tarball (#151345 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151345 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-04-15 21:12:48 +00:00
fduwjj	71e7dcda87	[c10d][fr] Record each individual collective being coalesced (#151238 ) During the record of FR for coalesced collectives we are not consistent. For P2P ops, we log individual collectives into FR but for non-p2p ops, we don't do that. This PR is trying to make non-P2P also log individual collective into FR so that we can use script to check correctness of ops for each one of collectives coalesced. Also the added unit test also address the unit test ask in the comment in https://github.com/pytorch/pytorch/pull/150863?fbclid=IwZXh0bgNhZW0CMTEAAR4a5Rd_JyJlrbKZcacbIv5WX5b4MqBRNn0hpgl-VTSD0eeXRlPZ9Ty_CPOYhQ_aem_ALEG1ibRajwie-rn1B4n5w#pullrequestreview-2751254224. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151238 Approved by: https://github.com/d4l3k, https://github.com/wconstab ghstack dependencies: #151247	2025-04-15 20:56:37 +00:00
fduwjj	ae648f047c	[c10d][fr] Enable FR analysis script for rest of all coalesce op (#151247 ) We revisited how coalesced collective is working in https://github.com/pytorch/pytorch/pull/151243 and we now want to enable the script to work for slow path. The change is indeed bc-breaking but this is needed to make it work and the API is an internal use API. It is not user facing. For slow path the individual has input-sizes and output sizes recorded but no state. The final one has the state ready. We check the correctness of each individual collective one by one but we don't check the state match for these collectives, we can only check the state match for the last one which is the work item with coalesced label. Added more unit test for slow path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151247 Approved by: https://github.com/d4l3k, https://github.com/XilunWu	2025-04-15 20:53:03 +00:00
chaihahaha	f98150fc8e	Warn user of existing lock file to avoid infinite waiting (#149382 ) Sometimes the python script didn't exit normally and the lock file remains in the path. In this case, the `file_baton.py` may sleep forever waiting for the lock file to release. This PR will add a warning to show the existing lock file path, let the user better understand which file to delete when the waiting time is too long. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149382 Approved by: https://github.com/soulitzer	2025-04-15 20:25:29 +00:00
Joel Schlosser	bd4cf30e31	Fix setUpClass() / tearDownClass() for device-specific tests (#151129 ) Finishes up the work started in #121686 + adds test Update: this was not as straightforward as I originally imagined. Context below. TL;DR: `TestFoo{CPU, CUDA}` now actually derive from `TestFoo`! Also, `{CPU, CUDA}TestBase` setup / teardown logic is now always called (it is required to set the primary device), regardless of whether `super().setUpClass()` / `super().tearDownClass()` are called or not. Background: The typical way to get device-specific tests is to write a generic `TestFoo` and call `instantiate_device_type_tests(TestFoo, locals())` to get `TestFooCPU`, `TestFooCUDA`, etc. After this, generic tests (e.g. `TestFoo.test_bar()`) become `TestFooCPU.test_bar_cpu()` / `TestFooCUDA.test_bar_cuda()`. Behind the scenes, this was historically accomplished by creating a `TestFooCUDA` that derives from both a `CUDATestBase` and an empty class called `TestFoo_base`. This `TestFoo_base` has the same bases as `TestFoo`, but none of the test functions (e.g. `test_bar()`). The documented reason for this is to avoid things like a derived `TestFooCUDA.test_bar()` being discovered in addition to the real device-specific test `TestFooCUDA.test_bar_cuda()`. (1) A reason this matters is because it should be possible to call e.g. `super().setUpClass()` from a custom setup / teardown classmethod. If the generated TestFooCUDA does not derive from TestFoo, but instead derives from the empty class described above, this syntax does not work; in fact there is no way to form a proper `super()` call that works across the device-specific test variants. Here's an example that breaks in the OpInfo tests: `070f389745/test/test_ops.py (L218-L221)` (2) Further, there is some precedent within a custom `setUpClass()` impl for storing things on the `cls` object to be accessed at test time. This must be the device-specific test class (`TestFooCUDA`) and not `TestFoo` for this to work. As an example, the open device registration tests load a module during setup and use it in the test logic: `070f389745/test/test_cpp_extensions_open_device_registration.py (L63-L77)` `070f389745/test/test_cpp_extensions_open_device_registration.py (L79-L80)` To accomplish both (1) and (2) at the same time, I decided to revisit the idea of utilizing a proper inheritance hierarchy for `TestFoo` -> `{TestFooCPU, TestFooCUDA}`. That is: have TestFooCPU / TestFooCUDA actually derive from `TestFoo`. This achieves both (1) and (2). The only thing left is to make sure the generic tests (e.g. `TestFoo.test_bar()`) are not discoverable, as was the stated reason for diverging from this in the first place. It turns out we can simply `delattr()` these generic tests from `TestFoo` once `TestFooCPU` / `TestFooCUDA` have been setup with the device-specific variants, and all works well. The `instantiate_device_type_tests(...)` logic already deletes `TestFoo` from scope, so I don't see a problem with deleting generic tests from this base class as well (CI will prove me right or wrong ofc). Side note: I was encountering a weird race condition where sometimes the custom `setUpClass()` / `tearDownClass()` defined & swapped in [here](`4a47dd9b3f/torch/testing/_internal/common_device_type.py (L940-L955)`) would be used, and sometimes it wouldn't. This non-deterministic behavior was called out previously by @ngimel here: `4a47dd9b3f/test/inductor/test_torchinductor_dynamic_shapes.py (L128-L130)` To address this, I moved this block of logic to before the first call to `instantiate_test()`, as that method queries for the primary device, and the primary device identification logic may manually invoke `setUpClass()` (see [here](`4a47dd9b3f/torch/testing/_internal/common_device_type.py (L381-L384)`)). Goal: define the `setUpClass()` / `tearDownClass()` we want for correctness before they're ever called. This seems to work and the behavior is deterministic now AFAICT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151129 Approved by: https://github.com/janeyx99, https://github.com/masnesral, https://github.com/malfet	2025-04-15 20:13:26 +00:00
Michael Lazos	d77e0cddfe	[Cutlass] Import cutlass python API for EVT (#150344 ) This imports the pieces of the cutlass python API that are needed for python EVT tracing. It builds on existing importing for cutlass_library. Once EVT tracing has been added to cutlass_library (should be later this year) this can be removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150344 Approved by: https://github.com/henrylhtsang, https://github.com/eellison	2025-04-15 20:11:40 +00:00
Shunting Zhang	91923f0ee1	[inductor] disable alignment asserts in fbcode (#151274 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151274 Approved by: https://github.com/Mingming-Ding, https://github.com/Microve, https://github.com/eellison	2025-04-15 19:59:54 +00:00
Thomas Bohnstingl	a2632d5241	[HOP] Reworked DispatchKey.Autograd (#151107 ) This PR intends to rework the dispatching of the autograd key. I.e., currently the DispatchKey.Autograd of the HOPs was triggered, even if non of the operands of the HOP have `requires_grad=True`. With this rework, the autograd is bypassed if non of the operands require gradients and only invoked if any of the operands require gradients. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151107 Approved by: https://github.com/ydwu4	2025-04-15 19:55:46 +00:00
Jeff Daily	19a33b20c2	[ROCm][CI/CD] create ROCm 6.4 images, part 1, skip magma tarball (#151236 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151236 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-04-15 19:45:15 +00:00
Oguz Ulgen	8d5f7ab06c	Replace all random is_fbcode imports to environment (#151283 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151283 Approved by: https://github.com/masnesral, https://github.com/Skylion007	2025-04-15 19:42:58 +00:00
Brian Hirsh	eea4a7b424	update expected results for comptime benchmark (#151319 ) This PR https://github.com/pytorch/pytorch/pull/150594 bumped the benchmark up by ~1%, a bit under our 1.5% "regression" mark. Modeled this PR after https://github.com/pytorch/pytorch/pull/144274 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151319 Approved by: https://github.com/jamesjwu, https://github.com/laithsakka	2025-04-15 19:40:13 +00:00
henrylhtsang	e45a6a9300	[inductor][test] Disable Triton GEMM backend tests for SM89 (#150485 ) Motivation: To deprecate a silent fallback behavior https://github.com/pytorch/pytorch/issues/150390 Problem: On SM89, Trition GEMM backend isn't working. This seems to be a pre-existing issue. I don't have access to SM89 to debug further. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150485 Approved by: https://github.com/xmfan, https://github.com/eellison	2025-04-15 19:03:52 +00:00
Boyuan Feng	f1adf22b5f	improve noop elimination for slice and slice_scatter (#151175 ) Improves noop elimination for `slice` and `slice_scatter`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151175 Approved by: https://github.com/zou3519	2025-04-15 18:56:50 +00:00
Nikita Shulga	d7050ef48b	[CI] Run test_torchinductor for MPS device (#150821 ) There are only 118 failures atm, mark them all with xfail to avoid new regressions Add `xfail_if_mps_unimplemented` decorator to distinguish between tests that call unimplemented eager op vs ones that fail for some other reason. Added `aten._scaled_dot_product_attention_math_for_mps` fallback to make test behavior consistent between MacOS-15 (where falback is in place) and MacOS-14 Weird MacOS-14 specific skips: - test_torchinductor.py::GPUTests::test_cat_extern_kernel_mps - test_torchinductor.py::GPUTests::test_sort_transpose_mps (likely an eager bug) - test_torchinductor.py::GPUTests::test_unaligned_input_mps Numerous MacOS-13 skips, including few eager hard crashes, for example running `test_torchinductor.py::GPUTests::test_scatter5_mps` causes ``` /AppleInternal/Library/BuildRoots/c651a45f-806e-11ed-a221-7ef33c48bc85/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSNDArray/Kernels/MPSNDArrayScatter.mm:309: failed assertion `Rank of destination array (1) must be greater than or equal to inner-most dimension of indices array (3)' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150821 Approved by: https://github.com/ZainRizvi, https://github.com/dcci ghstack dependencies: #151224, #151246, #151272, #151282, #151288	2025-04-15 18:42:39 +00:00
Prachi Gupta	7e5f6dcf7f	Add @requires_multicast_support to test_multimem_all_gather (#151227 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/151227 Approved by: https://github.com/jeffdaily	2025-04-15 18:41:12 +00:00
Shangdi Yu	83d88d128d	[reland] Make export._trace._WrapperModule work in strict mode (#146919 ) (#151264 ) Summary: as title `export._trace._WrapperModule` is used to wrap functions into a Module so we can export the function. We add `export._wrapper_utils` to `dynamo`'s `MOD_INLINELIST` so dynamo traces into `_WrapperModule` Fixes https://github.com/pytorch/pytorch/issues/146867 Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test:test_export -- -r wrapper_module ``` Differential Revision: D72986826 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151264 Approved by: https://github.com/angelayi	2025-04-15 18:35:34 +00:00
sandishkumarhn	61f127aac5	[Export] fix automatically convert instances of _check(u>=0) to check_is_size() (#148844 ) Fixes #148826 Understanding: 1. PyTorch should automatically convert instances of _check(u>=0) to check_is_size() 2. The export mechanism should suggest using check_is_size() instead of _check(u>=0) when applicable Changes made: 1. Added a helper function to detect non-negative checks: is_non_negative_check 2. Modified the suggestion logic in _suggest_torch_checks to detect and handle non-negative checks 3. unit tests test_is_non_negative_check_function, test_suggest_torch_checks_with_non_negative_check, and test_suggest_torch_checks_with_regular_check unit tests: base) sany@sandishs-Laptop pytorch % pytest test/export/test_export.py::TestExport::test_suggest_torch_checks_with_non_negative_check =================================== test session starts ================== platform darwin -- Python 3.9.19, pytest-7.3.2, pluggy-1.5.0 rootdir: /Users/sany/git/pytorch configfile: pytest.ini plugins: xdoctest-1.1.0, cpp-2.3.0, flakefinder-1.1.0, anyio-4.6.0, rerunfailures-14.0, hypothesis-5.35.1, xdist-3.3.1, subtests-0.13.1, typeguard-4.3.0 collected 1 item Running 1 items in this shard test/export/test_export.py . [100%] ======================== 1 passed in 1.67s ======================= (base) sany@sandishs-Laptop pytorch % pytest test/export/test_export.py::TestExport::test_suggest_torch_checks_with_regular_check ======================= test session starts ================= platform darwin -- Python 3.9.19, pytest-7.3.2, pluggy-1.5.0 rootdir: /Users/sany/git/pytorch configfile: pytest.ini plugins: xdoctest-1.1.0, cpp-2.3.0, flakefinder-1.1.0, anyio-4.6.0, rerunfailures-14.0, hypothesis-5.35.1, xdist-3.3.1, subtests-0.13.1, typeguard-4.3.0 collected 1 item Running 1 items in this shard test/export/test_export.py . [100%] ================================= 1 passed in 1.61s ================ (base) sany@sandishs-Laptop pytorch % pytest test/export/test_export.py::TestExport::test_is_non_negative_check_function ================================ test session starts ============= platform darwin -- Python 3.9.19, pytest-7.3.2, pluggy-1.5.0 rootdir: /Users/sany/git/pytorch configfile: pytest.ini plugins: xdoctest-1.1.0, cpp-2.3.0, flakefinder-1.1.0, anyio-4.6.0, rerunfailures-14.0, hypothesis-5.35.1, xdist-3.3.1, subtests-0.13.1, typeguard-4.3.0 collected 1 item Running 1 items in this shard test/export/test_export.py . [100%] ======================= 1 passed in 1.62s ========================= (base) sany@sandishs-Laptop pytorch % Pull Request resolved: https://github.com/pytorch/pytorch/pull/148844 Approved by: https://github.com/laithsakka	2025-04-15 17:41:11 +00:00
PyTorch MergeBot	74f6bc28a7	Revert "Add inductor standalone_compile API (#150670 )" This reverts commit c9aef508984a31f03821eaad381468673ef29c0a. Reverted https://github.com/pytorch/pytorch/pull/150670 on behalf of https://github.com/Camyll due to breaking internal builds with torch module not found error ([comment](https://github.com/pytorch/pytorch/pull/150670#issuecomment-2806975267))	2025-04-15 17:35:59 +00:00
Blaine Burton Rister	c0a0761871	[Inductor] Refactor wrapper codegen to use Wrapper IR. (#150458 ) Preparatory refactor for https://github.com/pytorch/pytorch/pull/146942. # Feature This PR refactors the existing wrapper codegen into `WrapperLine` subclasses, extending the existing Memory Planning IR into a fully-fledged Wrapper IR. See the diagram below. ![wrapper_ir](https://github.com/user-attachments/assets/a61db21b-caf3-45d2-bfdb-91066ae4ba6b) The IR currently supports the following ops: - All existing memory planning IR ops (`AllocateLine`, `FreeIfNotReusedLine`, etc.) - Reinterpret views (`ReinterpretLine`) - Kernel definitions (`KernelDefinitionLine`) - Calls to defined kernels (`KernelCallLine`) - Calls to extern kernels (`ExternKernelLine`, `ExternKernelAllocLine`) - Ops with multiple outputs (`MultiOutputLine`) - Tensor cleanup at the end of a graph (`FreeLine`) - Leaving comments in code (`CommentLine`) There are two main motivations for this refactor: 1. Unlike free-form C++ and and Python code, Wrapper IR lines provide structured information about what the wrapper code does. This serves as a natural extension point for other types of wrapper codegen. For example, the parent PR generates FX IR from Wrapper IR. Wrapper IR aims to give new backends enough information to generate wrapper code without needing to modify core Inductor files such as `ir.py`. 2. This design will hopefully promote stronger modularity and encapsulation. a. Inductor's core compilation passes don't need to worry about whether they're targeting Python, C++, FX or anything else. They can simply focus on generating Wrapper IR, and target-specific code can be refactored into the various backends. b. Backends do not need to know about all the details and internal state of `V.graph` IR. For example, they don't need to consider whether a buffer has been removed from the graph when generating code. Wrapper IR will hopefully provide a simpler interface for generating wrapper code, which abstracts away the details of device code. # Implementation details The implementation mainly consists of separating direct C++/Python codegen into two phases: 1. Emit Wrapper IR lines describing what the wrapper code is supposed to do. 2. Inside the `codegen()` method of each `WrapperLine`, call backend methods which generate pure Python/C++ code using the information stored in the Wrapper IR line. For example, `KernelCallLine` calls `wrapper._generate_kernel_call_helper`, which is overriden by the various Python and C++ backends to generate the final wrapper code. The main difficulty in implementing this is that we need to be careful that code is generated in the correct order. Wrapper codegen happens in two passes: first we write code into `self.lines` which mainly contains wrapper IR, but can also contain raw Python or C++ lines in some situations. Then, we convert the wrapper IR into the final Python/C++ code in `self.wrapper_call`. Since the same macros may be used in both passes, it's difficult to ensure that code is written to the correct buffer. The easiest solution for this was to implement a context manager overriding the `writeline` method to write to `self.wrapper_call` after memory planning is finished. This way, `writeline` writes to `self.lines` in the first pass, and `self.wrapper_call` in the second. This obviated the need to pass `code` or `writeline` variables all the way through the call stack, which would have touched most of the existing macros. # Test plan Since this refactor touches all the existing wrapper codegen classes, the existing CI provides good coverage. The parent PR introduces new tests for the FX IR backend. Among other things, these tests assert that `self.lines` only contains Wrapper IR lines, and no free-form code. While this would not be true of all programs today, the tests suggests that the IR implemented in this PR is sufficient to cover basic PyTorch usage. # Future directions These two goals are only partially realized by this PR. These are several important steps which still undergo direct Python/C++ codegen in core files: - User-defined Triton kernels. - Reinterpret views on outputs, from `gen_output_refs()`. (In the parent PR, the FX converter has a custom way of handling this. This can eventually be ported into Wrapper IR.) - Fallback ops with custom `codegen()` methods, e.g. `ScatterFallback`. - Misc. C++ lines emitted by the various cpp backends, e.g. declaring constants. These cases will gradually be handled in subsequent PRs, as the Inductor->FX converter expands its coverage. Given that these refactors are pretty tricky to do, it seems wiser to execute them in stages, as opposed to porting everything to Wrapper IR at once.Some Python and codegen still lives in core files such as `ir.py`, as described in previous sections. Hopefully, this PR will serve as a starting point which moves the codebase towards a more modular design. Over time, we can gradually refactor the remaining codegen (mainly in `ir.py`) into backend classes. One limitation of this PR is that codegen still happens in two phases during `PythonWrapperCodegen`. First, we generate Wrapper IR into `self.lines`, and from there we generate Python or C++ code into `self.wrapper_call`, `self.header`, etc. In the long term, it would be cleaner to split wrapper IR into its own class which doesn't deal with Python/C++ codegen at all. (See the diagram at the top.) That would strictly enforce the boundary between Wrapper IR and Python/C++ wrapper code. However, this would probably be a much larger refactor. Another limitation of the current code is that the helper functions have a lot of call args. It's also possible to clean this up by passing Wrapper IR ops e.g. `KernelCallLine` into helper functions like `_generate_kernel_call_helper`, since they store all the arguments. However, that change would likely be prone to merge conflicts, so I would like to save it for follow-up PRs if possible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150458 Approved by: https://github.com/eellison	2025-04-15 17:28:36 +00:00
Shunting Zhang	8f440a8e70	don't return logits for benchmark script (#151075 ) PT2 benchmark scripts has a pattern like: ``` def forward_and_backward_pass(self, mod, inputs, collect_outputs=True): cloned_inputs = clone_inputs(inputs) self.optimizer_zero_grad(mod) with self.autocast(self.autocast_arg): pred = mod(cloned_inputs) loss = self.compute_loss(pred) self.grad_scaler.scale(loss).backward() self.optimizer_step() if collect_outputs: return collect_results(mod, pred, loss, cloned_inputs) return None ``` for training. The collect_outputs argument is True only for accuracy testing and it's false for performance testing. For HF benchmark suite, a model usually returns tuple (loss, logits). For performance testing, even though the logits is never used anywhere, dynamo has to keep it due to the control flow. A few bad things if we keep logits here 1. the peak memory will be higher since the logits is large and we can not release its memory earlier. 2. we can not do optimization like chunking for the logits because the tensor needs to be returned from the pre-grad graph Actually I think it's fine to not return logits at all. - For training cases, checking loss and gradients for accuracy is good enough. It's hard to see two runs have mismatch logits but matching loss/gradients. - Also, discarding logits as soon as possible for perf benchmarking makes it more fair for us. On the other hand, it may be interesting to let dynamo support something like dynamo.constexpr (similar to tl.constexpr). A variable annotated as dynamo.constexpr will be specialized at compile time and we can do more optimization (DCE e.g.) at compile time. (A small [repro](https://gist.github.com/shunting314/0912a8947028a904c34f361021b8024d)) Benchmark results here [link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Fri%2C%2004%20Apr%202025%2018%3A03%3A26%20GMT&stopTime=Fri%2C%2011%20Apr%202025%2018%3A03%3A26%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/shunting314/204/head&lCommit=fe25dab3f65e1b0e9db0af03f7664af70fcc9c66&rBranch=main&rCommit=55e62ff74ad5614faf80b060c7bfc551e3b7af5a) - HF 15% (1.51 -> 1.66 compression ratio) peak memory improvement - I also see 5% (2.74 -> 2.79x) perf win for HF. It could be true. We may generate more efficient kernels since we don't need keep logits and return it from the pre-grad graph. But I'll double check Pull Request resolved: https://github.com/pytorch/pytorch/pull/151075 Approved by: https://github.com/eellison, https://github.com/jansel	2025-04-15 17:13:00 +00:00
David Berard	7d205b22b5	[profiler][retry] don't disable CUPTI_LAZY_REINIT for cuda >= 12.6 (#151124 ) Retry of https://github.com/pytorch/pytorch/pull/150957, which was reverted due to internal meta failures Credit to @mgmtea who wrote the initial version of this PR: https://github.com/pytorch/pytorch/pull/146604 Context: CUPTI is the NVIDIA library that Kineto uses for collecting GPU-side info during profiling. The intended usage is to register a callback while you want profiling to occur, and then unregister the callback when you want profiling to stop. But a bug would cause crashes if CUPTI callbacks were de-registered when used with cudagraphs. The workaround was to disable "CUPTI_LAZY_REINIT" and "CUPTI_TEARDOWN" in Kineto - which prevents crashes, but can result in slower execution after profiling has occurred and completed. This bug is believed to be fixed in CUDA >= 12.6, so this PR qualifies that DISABLE_CUPTI_LAZY_REINIT=1 and CUPTI_TEARDOWN=0 should only be applied if CUDA >= 12.6. Additionally, `profiler_allow_cudagraph_cupti_lazy_reinit_cuda12()` is added as an escape hatch so that we can add a killswitch in case we see more crashes related to this. Differential Revision: [D72842114](https://our.internmc.facebook.com/intern/diff/D72842114/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D72842114/)! Differential Revision: [D72842114](https://our.internmc.facebook.com/intern/diff/D72842114) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151124 Approved by: https://github.com/sraikund16	2025-04-15 16:11:49 +00:00
Ankita George	c5de6ff079	Remove ls from filesystem base (#151117 ) Summary: User reported issue where they are inheriting from filesystembase but don't have the ls method which was added in the PR https://github.com/pytorch/pytorch/pull/150701#discussion_r2039840129. Removing the method from the base class but keeping it in derived class Test Plan: buck test 'fbcode//mode/opt' fbcode//caffe2/test/distributed/checkpoint:test_hf_storage Differential Revision: D72867722 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151117 Approved by: https://github.com/Skylion007, https://github.com/lw	2025-04-15 14:45:20 +00:00
rzou	f1f18c75c9	Gracefully handle optree less than minimum version, part 2 (#151257 ) If optree is less than the minimum version, we should pretend it doesn't exist. The problem right now is: - Install optree==0.12.1 - `import torch._dynamo` - This raise an error "min optree version is 0.13.0" The fix is to pretend optree doesn't exist if it is less than the min version. There are ways to clean up this PR more (e.g. have a single source of truth for the version, some of the variables are redundant), but I am trying to reduce the risk as much as possible for this to go into 2.7. Test Plan: I verified the above problem was fixed. Also tried some other things, like the following, which now gives the expected behavior. ```py >>> import torch >>> import optree >>> optree.__version__ '0.12.1' >>> import torch._dynamo >>> import torch._dynamo.polyfills.pytree >>> import torch.utils._pytree >>> import torch.utils._cxx_pytree ImportError: torch.utils._cxx_pytree depends on optree, which is an optional dependency of PyTorch. To u se it, please upgrade your optree package to >= 0.13.0 ``` I also audited all non-test callsites of optree and torch.utils._cxx_pytree. Follow along with me: optree imports - torch.utils._cxx_pytree. This is fine. - [guarded by check] `f76b7ef33c/torch/_dynamo/polyfills/pytree.py (L29-L31)` _cxx_pytree imports - [guarded by check] torch.utils._pytree (changed in this PR) - [guarded by check] torch/_dynamo/polyfills/pytree.py (changed in this PR) - [guarded by try-catch] `f76b7ef33c/torch/distributed/_functional_collectives.py (L17)` - [guarded by try-catch] `f76b7ef33c/torch/distributed/tensor/_op_schema.py (L15)` - [guarded by try-catch] `f76b7ef33c/torch/distributed/tensor/_dispatch.py (L35)` - [guarded by try-catch] `f76b7ef33c/torch/_dynamo/variables/user_defined.py (L94)` - [guarded by try-catch] `f76b7ef33c/torch/distributed/tensor/experimental/_func_map.py (L14)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151257 Approved by: https://github.com/malfet, https://github.com/XuehaiPan	2025-04-15 13:08:26 +00:00
jianan-gu	12cb11a268	[Inductor UT] Refactor FlexAttention UT and add CPU tests (#144953 ) This PR extends and refines all rest UTs for CPU and more devices in `test/inductor/test_flex_attention.py` and `test/inductor/test_flex_decoding.py`, as a follow-up to https://github.com/pytorch/pytorch/pull/141453 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144953 Approved by: https://github.com/drisspg	2025-04-15 12:44:49 +00:00
Benson Ma	2180e87d7c	[fbgemm_gpu] Incorporate Torch DSA (#151148 ) Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/1035 X-link: https://github.com/pytorch/FBGEMM/pull/3950 - Incorporte the PyTorch DSA infrastructure into the FBGEMM kernel launcher utility Test Plan: ``` # Nvidia buck2 test 'fbcode//mode/opt' fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:tensor_accessor_builder buck2 test 'fbcode//mode/opt' fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:tensor_accessor_builder_with_memcheck buck2 run 'fbcode//mode/opt' -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=a100 -c fbcode.platform=platform010 fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:kernel_launcher # AMD buck2 run mode/opt-amd-gpu -c fbcode.platform=platform010 fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:tensor_accessor_builder_with_memcheck buck2 run mode/opt-amd-gpu -c fbcode.platform=platform010 fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:kernel_launcher buck2 run mode/opt-amd-gpu -c fbcode.platform=platform010 fbcode//deeplearning/fbgemm/fbgemm_gpu/test/tbe:split_embeddings_utils ``` Differential Revision: D72759030 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151148 Approved by: https://github.com/huydhn	2025-04-15 11:34:04 +00:00
Mu-Chu Lee	70e7b76707	[AOTInductor] Add Python interface for user managed buffer. (#151141 ) Summary: Add pybind for user managed buffer in update_constants_buffer. Test Plan: Included in commit. ``` python test/inductor/test_aot_inductor.py -k user_managed ``` Differential Revision: D72892310 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151141 Approved by: https://github.com/henrylhtsang, https://github.com/desertfire	2025-04-15 09:36:30 +00:00
ZhiweiYan-96	bd9c436c99	[Intel GPU][PT2E] Register qconv impls to general qconv_pointwise schema (#151092 ) # Motivation Refer to https://github.com/pytorch/pytorch/pull/150751, general scheme for `qconv_pointwise` is added and `qconv2d_pointwise` is removed in callers. This PR registers the XPU backend implementations to this operator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151092 Approved by: https://github.com/EikanWang, https://github.com/guangyey	2025-04-15 08:42:14 +00:00
Zhang, Jianyi	a756c50315	[Intel GPU] Avoid using fp32 in sdp math path when benchmark performance. (#150996 ) sdp on xpu will fallback to math path in some cases (i.e. training). In dynamo benchmark, we prefer to use fp16 for better performance. Although `allow_fp16_bf16_reduction_math_sdp` is under backends.cuda, its implementation is for all device. I didn't add if device == xpu here, I suppose cuda devices will not run into math path anyway Pull Request resolved: https://github.com/pytorch/pytorch/pull/150996 Approved by: https://github.com/drisspg, https://github.com/EikanWang	2025-04-15 08:08:01 +00:00
Chien-Chin Huang	ccfce9ae86	Fix score_mod.py dynamic max autotune for backward (#151270 ) Same as https://github.com/pytorch/pytorch/pull/148991 but this PR fixes the backward path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151270 Approved by: https://github.com/drisspg, https://github.com/bobrenjc93	2025-04-15 06:33:37 +00:00
Nikita Shulga	afaadce083	[MPSInductor] Adjust memory format detection (#151288 ) MPS conv implementation will only yield channels last if input is in channels_last format Fixes `TestGPUTests.test_conv2d_backward_channels_last` on MacOS-15 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151288 Approved by: https://github.com/jansel, https://github.com/dcci ghstack dependencies: #151224, #151246, #151272, #151282	2025-04-15 06:25:00 +00:00
Nikita Shulga	b8a2824755	[MPS] Fix logit output for half/bfloat (#151282 ) Which also fixes MPSInductor pointwise test TODO: (as followup PRs): get rid of special native_function.yaml dispatches and use stub Pull Request resolved: https://github.com/pytorch/pytorch/pull/151282 Approved by: https://github.com/dcci ghstack dependencies: #151224, #151246, #151272	2025-04-15 06:25:00 +00:00
FFFrog	a2f7764507	[Dynamo] Fix the unimplemented_v2 of EventVariable.call_method in ctx_manager.py (#151208 ) Changes: - Field of `explanations` shoule be `str` instead of `tuple` - Not only `torch.cuda.Event`, but alse `torch.xpu.Event` can trigger this message. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151208 Approved by: https://github.com/Skylion007	2025-04-15 05:26:39 +00:00
Colin Peppler	9e20a8411b	make einsum unbacked friendly (#151032 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151032 Approved by: https://github.com/pianpwk	2025-04-15 04:35:17 +00:00
henrylhtsang	5a51de5ab1	[cutlass backend] Add more logs for cutlass backend benchmark (#150639 ) Goal is to have a way to compare if a change make it better or worse. ``` Average edge over aten (max(-edge, 0), higher is better): triton: 8.596507086950552 (from 6 valid values) triton_persistent_tma: 9.517193693923307 (from 6 valid values) cutlass_lvl_default: 3.3234737908691785 (from 6 valid values) cutlass_lvl_1111: 7.088173348313991 (from 6 valid values) cutlass_lvl_2222: 7.291869722320318 (from 6 valid values) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150639 Approved by: https://github.com/ColinPeppler	2025-04-15 04:19:51 +00:00
fduwjj	48b4bc1640	[c10d][fr] Enable FR analysis script for all fast-path coalesce op (#151243 ) This PR is to enable FR for all coalesce ops for fast path. (batch p2p is enabled in the current script, so we will mainly focus on non-P2P ops). To explain what is fast path, let's revisit how coalesced collective is working today: For non-P2P coalesced ops, there are are several ways to call it (due to legendary reasons): - Way one: Directly call python api like all_reduce_coalesced in python, this will be deprecated soon. - Way two: Directly call api inside PGNCCL like allreduce_coalesced. The way case 1 will eventually call into this. This is not deprecated and will not be deprecated, IIUC. - Way three: Using _coalescing_manager in python, like: ``` with _coalescing_manager(): for i in range(num_colls): dist.all_reduce(tensors[i]) ``` This way has two path: - Fast path: when users call all-reduce, all-gather-into-tensor or reduce-scatter, we will only launch one big collective by calling the api from case 1. - Slow path: we call startCoalescing() in the beginning and then a bunch of collectives (each one will generate a FR entry) and then endCoalescing(). Inside startCoalescing(), groupStart() is called and inside endCoalescing(), groupEnd() is then called. So although this is going to be one collective, we call into PGNCCL for each collective coalesced in the slow path case. - For uneven all-gather (allgather_v) and reduce-scatter, it follows the pattern mention in slow path. It directly call cpp api inside PGNCCL. This PR addressed the fast path because this is just an easy case, we store the collectives info on the python side, and we will only call into PGNCCL once so there will only be one work and one FR entry. We can just treat them as regular coalesced collective. We add some e2e unit test for build_db function so that the change to FR is more thoroughly tested. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151243 Approved by: https://github.com/d4l3k, https://github.com/wz337	2025-04-15 04:08:28 +00:00
Ryan Guo	f66229de2b	[dynamo] Remove `traceable_tensor_subclasses`-related code (#151062 ) Since #149792 deprecates `traceable_tensor_subclasses` and it's been landed for over a week, we can safely remove all the old code that uses `traceable_tensor_subclasses` (they were primarily for testing purposes and are equivalent to no-ops now). Pull Request resolved: https://github.com/pytorch/pytorch/pull/151062 Approved by: https://github.com/mlazos, https://github.com/anijain2305 ghstack dependencies: #151060, #151061	2025-04-15 03:55:35 +00:00
Ryan Guo	6a1499d209	[dynamo] handle tensor subclass with non-classmethod `__torch_function__` (#151061 ) As title, this patch fixes bugs in 1. emulating `has_torch_function` 2. emulating calling `__torch_function__` 3. building a callable VT for non-classmethod `__torch_function__` Fixes #120799, #150265, #150848. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151061 Approved by: https://github.com/anijain2305, https://github.com/mlazos ghstack dependencies: #151060	2025-04-15 03:55:34 +00:00
Ryan Guo	73129b8974	[dynamo] Properly handle `super().some_classmethod(...)` (#151060 ) Previously we were passing in the instance as first argument to a `super().some_classmethod(...)` call, but we should've passed in the type object instead, per semantics of `@classmethod`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151060 Approved by: https://github.com/Skylion007, https://github.com/mlazos, https://github.com/anijain2305	2025-04-15 03:55:34 +00:00
Yifu Wang	e178a3aa94	clang-format CUDASymmetricMemory.cu (#151260 ) Ported from #146592 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151260 Approved by: https://github.com/Skylion007	2025-04-15 02:00:34 +00:00
zeshengzong	25803d3a22	Optimize typing in `lr_scheduler.py` (#151219 ) ## Changes - Add typing annotation in `lr_scheduler.py` ## Test Result ```bash pytest test/optim/test_lrscheduler.py -vv ``` ![image](https://github.com/user-attachments/assets/34a91965-ff3a-462a-9ab0-b46ad4b290e9) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151219 Approved by: https://github.com/janeyx99	2025-04-15 01:00:13 +00:00
Tristan Rice	4ede6705b5	test_store: fix timeout for test_queues (#151252 ) Fixes #151216, #151215 Previously I forgot to revert the timeout after setting it for the timeout test. To prevent this in the future I split the test into 3 different tests so timeout testing is isolated. Test plan: Stress tested ``` pytest test/distributed/test_store.py -k queue -v -s --minutes 10 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151252 Approved by: https://github.com/XilunWu	2025-04-15 00:44:19 +00:00
Howard Huang	263f08e119	[PP] Add schedule visualizer (#150347 ) Added a new private file (`_schedule_visualizer.py`) with some helper methods that can be used to visualize the operations of a schedule and plot with matplotlib. InterleavedZeroBubble(pp_group=4, microbatches=8): ![image](https://github.com/user-attachments/assets/610ba9a8-7d18-4a99-bcad-6f43e5b23c8c) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150347 Approved by: https://github.com/kwen2501	2025-04-15 00:38:18 +00:00
Nikita Shulga	070357b61a	[MPSInductor] Fix silent correctness in bitcast (#151272 ) By using Metal `as_type` which according to documentation does exactly that: > Metal adds an as_type<type-id> operator to allow any scalar or vector data type (that is not a pointer) to be reinterpreted as another scalar or vector data type of the same size. The bits in the operand are returned directly without modification as the new type. The usual type promotion for function arguments is not performed. Using `reinterpret_cast` created a potential silent correctness error when dtypes of different sizes were bitcast to each other Add expicit cast to src_type to avoid errors due to type promotion (i.e. soemthing like (x+1).view(dtype=torch.float16) would work correctly in eager mode for int16 dtype, but would fail in compile, as arithmetic operations will promote int16 to int32 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151272 Approved by: https://github.com/dcci ghstack dependencies: #151224, #151246	2025-04-14 23:39:42 +00:00
Animesh Jain	508b882513	[dynamo][invoke_subgraph] Use FxGraphModule comparison instead of hashing (#150911 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150911 Approved by: https://github.com/zou3519	2025-04-14 23:34:26 +00:00
Nichols A. Romero	a24a9c42fb	[ROCm] Improve behavior of get_torch_rocm_version helper function on non-ROCm systems. (#151040 ) Fixes #150041 Return a zero tuple when ROCm is _not_ supported, similar to what is done for the CUDA version of this function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151040 Approved by: https://github.com/jeffdaily	2025-04-14 22:50:07 +00:00
Oguz Ulgen	c9aef50898	Add inductor standalone_compile API (#150670 ) This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution. ``` standalone_compile(gm, example_inputs, options) -> CompiledArtifact CompiledArtifact.save(path, format: binary\|unpacked = binary) CompiledArtifact.load(path, format: binary\|unpacked = binary) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670 Approved by: https://github.com/jamesjwu, https://github.com/zou3519	2025-04-14 22:00:09 +00:00
PyTorch MergeBot	4a47dd9b3f	Revert "[map] always turn on dynamo for map (#150962 )" This reverts commit a72d56cb6be8c6ded5678b0b98003c90fd1b5a71. Reverted https://github.com/pytorch/pytorch/pull/150962 on behalf of https://github.com/Camyll due to breaking internal builds {SHORT_REASON} ([comment](https://github.com/pytorch/pytorch/pull/150962#issuecomment-2803006282))	2025-04-14 21:09:22 +00:00
PyTorch MergeBot	6a77a0a50c	Revert "[map] make proxy mode re-dispatch to fake key (#151034 )" This reverts commit ca2e8cd3528635526a3fe09444139ffa748e97be. Reverted https://github.com/pytorch/pytorch/pull/151034 on behalf of https://github.com/Camyll due to breaking internal builds {SHORT_REASON} ([comment](https://github.com/pytorch/pytorch/pull/150962#issuecomment-2803006282))	2025-04-14 21:09:21 +00:00
rzou	070f389745	Mark auto_functionalized HOPs as cacheable (#151194 ) Fixes #151188 Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/151194 Approved by: https://github.com/oulgen, https://github.com/anijain2305 ghstack dependencies: #151193	2025-04-14 20:05:32 +00:00
rzou	dea50b0778	Improve sort with non-constant keys error message (#151193 ) Fixes https://github.com/pytorch/pytorch/issues/143505 Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/151193 Approved by: https://github.com/jansel, https://github.com/anijain2305, https://github.com/williamwen42	2025-04-14 20:05:32 +00:00
Nikita Shulga	46ce8f7df6	[MPSInductor] Cast halfs to floats (#151246 ) To avoid accuracy issues when small reductions are unrolled, cast half to float during the `load` op As `op_math_t<half>` is indeed float This fixes `test_unroll_small_reduction` for reduced precision types Pull Request resolved: https://github.com/pytorch/pytorch/pull/151246 Approved by: https://github.com/dcci ghstack dependencies: #151224	2025-04-14 19:47:04 +00:00
Olaf Lipinski	0a6e1d6b9b	Expand docs for `nn.functional`, and make the wording consistent (#148436 ) Expands the docs for the loss functions, and makes the wording consistent. Fixes #148353 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148436 Approved by: https://github.com/albanD	2025-04-14 19:37:12 +00:00
Nariaki Tateiwa	23a3cef5d9	[c10d] Add `_allgather_base` , `reduce_scatter` , and `_reduce_scatter_base` into ProcessGroupMPI to enable FSDP with MPI backend (#150162 ) This PR implements _allgather_base, reduce_scatter, and _reduce_scatter_base in the MPI backend (ProcessGroupMPI), enabling support for Fully Sharded Data Parallel (FSDP) in environments that use MPI for distributed communication. ### Context As noted in https://github.com/pytorch/pytorch/issues/85628, FSDP currently supports only the NCCL backend. Due to this limitation, FSDP cannot run on legacy HPC environments or clusters that rely on MPI. By implementing just these three collective operations, we can enable FSDP to work with the MPI backend. These collectives are implemented in a similar manner to existing operations such as allgather. ### Testing We validated this PR using pytorch/build/bin/ProcessGroupMPITest with OpenMPI, and all tests passed successfully. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150162 Approved by: https://github.com/H-Huang	2025-04-14 19:31:38 +00:00
angelayi	7deed1946f	Fix assert_tensor_meta (#150808 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150808 Approved by: https://github.com/pianpwk ghstack dependencies: #150806, #150807	2025-04-14 19:28:54 +00:00
angelayi	53528440e1	Generate meta kernel with operator profiles (#150807 ) Added a context manager, `torch._library.fake_profile.register_fake_profile(op_profiles)`, where given an operator profile, it will generate and register a fake impl for the operator based on the operator profile. The input to `register_fake_profile` is a dictionary mapping operator name to a set of profiles which describe the input and outputs of the operator. Here's an example of a profile for `mylib.foo.default`: ``` "mylib.foo.default": { OpProfile( args_profile=( TensorMetadata(rank=2, dtype=torch.float32, device=torch.device("cpu"), layout=torch.strided,), TensorMetadata(rank=2, dtype=torch.float32, device=torch.device("cpu"), layout=torch.strided,), ), out_profile=TensorMetadata(rank=2, dtype=torch.float32, device=torch.device("cpu"), layout=torch.strided,), ) } ``` `foo`'s profile contains only one profile, which says that for 2 input tensors of rank 2, dtype float32, device cpu, we will return one tensor of rank 2, dtype float32, and device cpu. This will then generate a fake kernel where given 2 input tensors of rank 2 (and the other tensor metadata), we will output one tensor of rank 2 (and the other tensor metadata). If the operator also supports other input ranks, then we can add to the profile for the fake impl to support more input types. This profile can either be manually written or created by draft-export, and then checked into the codebase. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150807 Approved by: https://github.com/zou3519 ghstack dependencies: #150806	2025-04-14 19:28:54 +00:00
Justin Chu	901e37515f	[ONNX] Fix bfloat16 support in onnx_program callable (#151121 ) - Added a test to guard bfloat16. The optimizer incorrectly turns bfloat16 initializers into uint16, but this is not relevant to export logic. - Fix bfloat16 support in onnx_program callable Tested with the following with cuda ```py import torch class BfloatModel(torch.nn.Module): def __init__(self): super().__init__() self.param = torch.nn.Parameter(torch.tensor(2.0, dtype=torch.bfloat16)) def forward(self, x): return x * torch.tensor(1.0, dtype=torch.bfloat16) * self.param input = torch.randn(1, 10, dtype=torch.bfloat16) model = BfloatModel() onnx_program = torch.onnx.export(model, (input,), dynamo=True, optimize=False, verify=True) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151121 Approved by: https://github.com/titaiwangms Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-04-14 19:27:29 +00:00
cz2h	f76b7ef33c	Add error check for out variant of tensordot function with requries_grad tensor (#150270 ) Fixes #147846. Previously there is no error out under out variant of`tensordot` while `requires_grad=True`. This can cause potential issue when out tensor is part of a computation graph. Enforces the out variant of tensordot to run without setting `requries_grad=True`. Change same to #117067 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150270 Approved by: https://github.com/soulitzer	2025-04-14 18:43:14 +00:00
Aaron Orenstein	1f5af12cd9	Using hasattr for `_boxed_call` is asking for trouble (#151130 ) Summary: There are a number of places in the code checking for the existence of `_boxed_call` instead of checking for a `True` value. This is somewhat dangerous because one would assume that setting it to `None` or `False` would be the same as not setting it (output_code.py does this, for example). Change `hasattr()` to `getattr(..., False)` for these cases. Test Plan: unit tests pass Differential Revision: D72806693 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151130 Approved by: https://github.com/Skylion007	2025-04-14 18:36:30 +00:00
Pian Pawakapan	6dddd6520d	[dynamic shapes] add sym_and, sym_or (#150456 ) This has been pretty helpful for the size-oblivious rewrite. Wanted the variadic args version to avoid `sym_or(a, sym_or(b, sym_or(c, d)))` in favor of `sym_or(a, b, c, d)`. Happy to change this to ban the 1-arg version. This is better than plain and/or because the whole symbolic expression gets preserved, and if we guard on it or defer as a runtime assert, we preserve all branches. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150456 Approved by: https://github.com/laithsakka	2025-04-14 18:18:06 +00:00
Animesh Jain	785495ee29	[dynamo][error message] Hint for dict_items as inputs to the compiled region (#151169 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151169 Approved by: https://github.com/zou3519 ghstack dependencies: #151164, #151168	2025-04-14 17:38:20 +00:00
Animesh Jain	3c46808a14	[dynamo] Graph break fixes while tracing inspect module (#151168 ) Fixes https://github.com/pytorch/pytorch/issues/139374 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151168 Approved by: https://github.com/jansel ghstack dependencies: #151164	2025-04-14 17:38:20 +00:00
Thomas Bohnstingl	b0bdd76f2e	[scan] Autograd with partial gradient support (#146285 ) This PR introduces the Autograd feature for scan with partial gradient support. It is a combination of the already opened PRs: https://github.com/pytorch/pytorch/pull/135631 and https://github.com/bohnstingl/pytorch/pull/4 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146285 Approved by: https://github.com/ydwu4 Co-authored-by: Yidi Wu <yidi@meta.com>	2025-04-14 17:01:31 +00:00
fzyzcjy	50abc1ecc4	Super tiny fix typo (#151212 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/151212 Approved by: https://github.com/Skylion007	2025-04-14 16:47:40 +00:00
Nikita Shulga	184ac8c7f7	[MPSInductor] Fix noop codegen (#151224 ) By adding `pass` in front of the comment for fake set_device call Which fixes `TestGPU.test_zero_element_mutation_mps`, which previously failed with ``` torch._inductor.exc.InductorError: RuntimeError: Failed to import /var/folders/sc/2thx6_x95h7_h9qs8s48yh140000gn/T/tmp2emka_sx/7k/c7kmnwhb363ysalhewglr3cwtej6tiz3t4ppqa4bvhubaokmlprw.py IndentationError: expected an indented block after 'with' statement on line 38 (c7kmnwhb363ysalhewglr3cwtej6tiz3t4ppqa4bvhubaokmlprw.py, line 40) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151224 Approved by: https://github.com/Skylion007, https://github.com/jansel, https://github.com/dcci	2025-04-14 16:38:47 +00:00
Jithun Nair	001695c397	[ROCm][CI] Enable distributed CI on MI300 (#150667 ) * Enable distributed CI on MI300 runners, same schedule-based and release-branch triggers as `periodic.yml`; also uses label `ciflow/periodic-rocm-mi300` for triggering on PRs. * Disabled failing distributed tests on MI300 via Github issues: [151077](https://github.com/pytorch/pytorch/issues/151077), [151078](https://github.com/pytorch/pytorch/issues/151078), [151081](https://github.com/pytorch/pytorch/issues/151081), [151082](https://github.com/pytorch/pytorch/issues/151082), [151083](https://github.com/pytorch/pytorch/issues/151083), [151084](https://github.com/pytorch/pytorch/issues/151084), [151085](https://github.com/pytorch/pytorch/issues/151085), [151086](https://github.com/pytorch/pytorch/issues/151086), [151087](https://github.com/pytorch/pytorch/issues/151087), [151088](https://github.com/pytorch/pytorch/issues/151088), [151089](https://github.com/pytorch/pytorch/issues/151089), [151090](https://github.com/pytorch/pytorch/issues/151090), [151153](https://github.com/pytorch/pytorch/issues/151153) * Disable failing distributed tests via `skipIfRocm`: `ea9315ff95` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150667 Approved by: https://github.com/jeffdaily	2025-04-14 16:19:04 +00:00
cyy	eb19f5abab	[2/N] Use internal linkage in aten C++ files (#151070 ) Turn functions and variables into static if they are not used outside the ten cpp files. In some cases, missing header inclusion is added. In other cases, unused functions are removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151070 Approved by: https://github.com/Skylion007	2025-04-14 16:07:17 +00:00
PyTorch MergeBot	24b3ab9255	Revert "Add inductor standalone_compile API (#150670 )" This reverts commit bbc5fe850454df6860814ab77a1f3a4ca3698157. Reverted https://github.com/pytorch/pytorch/pull/150670 on behalf of https://github.com/albanD due to Broke profiler test ([comment](https://github.com/pytorch/pytorch/pull/150670#issuecomment-2802067144))	2025-04-14 15:22:33 +00:00
hippocookie	d99236b68c	Optimize `cdist` param description (#151178 ) Fixes #151101 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151178 Approved by: https://github.com/soulitzer	2025-04-14 13:53:10 +00:00
bobrenjc93	8497491f38	[ez] remove unused arg in _create_wrapped_callback (#151179 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151179 Approved by: https://github.com/anijain2305, https://github.com/Skylion007 ghstack dependencies: #150753, #150754, #150755, #150828	2025-04-14 12:54:23 +00:00
bobrenjc93	d5a19e4525	[ez] dynamo fix typo in comment (#150828 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150828 Approved by: https://github.com/anijain2305, https://github.com/Skylion007 ghstack dependencies: #150753, #150754, #150755	2025-04-14 10:09:28 +00:00
zeshengzong	5eebcb991a	Add scripts to generate plots of LRSchedulers (#149189 ) Fixes #92007 ## Changes - Add script to generate plots for `lr_scheduler` - Add plots to `lr_scheduler` docs - Add example section if it missing in `lr_scheduler` docs ## Test Result ### LambdaLR ![image](https://github.com/user-attachments/assets/37fc0894-e2ec-48f2-a2d6-3514e51e1ea2) ### MultiplicativeLR ![image](https://github.com/user-attachments/assets/2122b3a0-a4ce-42c7-bb45-559c1fc73e0f) ### StepLR ![image](https://github.com/user-attachments/assets/47bc9d96-4b60-4586-a000-f213583bbe8f) ### MultiStepLR ![image](https://github.com/user-attachments/assets/c822b849-d5be-4b94-aa7a-0017a2c9ff15) ### ConstantLR ![image](https://github.com/user-attachments/assets/83107cdd-7b00-44a6-b09d-e8ee849b4a12) ### LinearLR ![image](https://github.com/user-attachments/assets/60190105-691a-4101-8966-5b0c396093a4) ### ExponentialLR ![image](https://github.com/user-attachments/assets/dfcbcbca-89e5-4a2f-b1bd-33e25d2405ec) ### PolynomialLR ![image](https://github.com/user-attachments/assets/7c3d4fce-c846-40a0-b62e-f3e81c7e08bd) ### CosineAnnealingLR ![image](https://github.com/user-attachments/assets/26712769-dde9-4faa-b61b-e23c51daef50) ### ChainedScheduler ![image](https://github.com/user-attachments/assets/20734a8b-e939-424f-b45a-773f86f020b1) ### SequentialLR ![image](https://github.com/user-attachments/assets/2cd3ed67-2a0a-4c42-9ad2-e0be090d3751) ### ReduceLROnPlateau ![image](https://github.com/user-attachments/assets/b77f641e-4810-450d-b2cd-8b3f134ea188) ### CyclicLR ![image](https://github.com/user-attachments/assets/29b8666f-41b3-45e4-9159-6929074e6108) ### OneCycleLR ![image](https://github.com/user-attachments/assets/d5b683ef-41e8-4ca8-9fe8-0f1e6b433866) ### CosineAnnealingWarmRestarts ![image](https://github.com/user-attachments/assets/1d45ea80-dea8-494d-a8ab-e9cfc94c55d6) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149189 Approved by: https://github.com/janeyx99	2025-04-14 09:53:38 +00:00
Zesheng Zong	5a64476ed6	[Easy] Add `output_size` in forward method of ConvTranspose2d (#150609 ) Fixes #74593 Add description for `forward` in [ConvTranspose2d](https://pytorch.org/docs/stable/generated/torch.nn.ConvTranspose2d.html) doc ## Test Result ![image](https://github.com/user-attachments/assets/eebad7a2-f782-4219-9756-344e0f34fada) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150609 Approved by: https://github.com/mikaylagawarecki Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>	2025-04-14 09:53:22 +00:00
zeshengzong	01f226bfb8	Add check for ctc_loss targets param (#150981 ) Fixes #150835 ## Test Result ```python # cuda >>> import torch >>> import torch.nn.functional as F >>> device = "cuda" # "cpu" is fine >>> num_classes = 4 >>> log_probs = torch.rand(0, 0, num_classes, device=device) >>> targets = torch.tensor([], device=device, dtype=torch.long) >>> input_lengths = torch.tensor([], device=device, dtype=torch.long) >>> target_lengths = torch.tensor([], device=device, dtype=torch.long) >>> result = F.ctc_loss(log_probs, targets, input_lengths, target_lengths, reduction='none') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/zong/code/pytorch/torch/nn/functional.py", line 3079, in ctc_loss return torch.ctc_loss( ^^^^^^^^^^^^^^^ RuntimeError: log_probs tensor must not be empty # cpu >>> device = "cpu" >>> num_classes = 4 >>> log_probs = torch.rand(0, 0, num_classes, device=device) >>> targets = torch.tensor([], device=device, dtype=torch.long) >>> input_lengths = torch.tensor([], device=device, dtype=torch.long) >>> target_lengths = torch.tensor([], device=device, dtype=torch.long) >>> result = F.ctc_loss(log_probs, targets, input_lengths, target_lengths, reduction='none') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/zong/code/pytorch/torch/nn/functional.py", line 3079, in ctc_loss return torch.ctc_loss( ^^^^^^^^^^^^^^^ RuntimeError: log_probs tensor must not be empty ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150981 Approved by: https://github.com/eqy	2025-04-14 07:24:30 +00:00
Oguz Ulgen	bbc5fe8504	Add inductor standalone_compile API (#150670 ) This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution. ``` standalone_compile(gm, example_inputs, options) -> CompiledArtifact CompiledArtifact.save(path, format: binary\|unpacked = binary) CompiledArtifact.load(path, format: binary\|unpacked = binary) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670 Approved by: https://github.com/jamesjwu, https://github.com/zou3519	2025-04-14 07:07:10 +00:00
bobrenjc93	189bc9283e	[ez] move GuardsContext code comment to the right place (#150755 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150755 Approved by: https://github.com/anijain2305, https://github.com/Skylion007 ghstack dependencies: #150753, #150754	2025-04-14 07:03:23 +00:00
PyTorch UpdateBot	9757092aed	[executorch hash update] update the pinned executorch hash (#151195 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151195 Approved by: https://github.com/pytorchbot	2025-04-14 05:46:54 +00:00
Chuanqi Xu	0d09a33819	[Attention] Always pad in preprocess_mask to avoid recompilations (#150403 ) Motivation: for the following script: ``` // demo.py import torch import json from transformers import BertModel, BertConfig CONFIG = """ { "architectures": [ "BertForMaskedLM" ], "attention_probs_dropout_prob": 0.1, "gradient_checkpointing": false, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "layer_norm_eps": 1e-12, "max_position_embeddings": 512, "model_type": "bert", "num_attention_heads": 12, "num_hidden_layers": 12, "pad_token_id": 0, "position_embedding_type": "absolute", "transformers_version": "4.6.0.dev0", "type_vocab_size": 2, "use_cache": true, "vocab_size": 30522 } """ config = json.loads(CONFIG) bloom_config = BertConfig(**config) model = BertModel(bloom_config).half().cuda() torch.compiler.reset() torch.cuda.empty_cache() compiled_fn = torch.compile(model) vocab_size = 30522 for b in range(1, 3): for s in range(1, 10): print(f"🚀 {b} {s}") input_ids = torch.randint(0, vocab_size, (b, s)).cuda() attention_mask = torch.ones(b, s).cuda() with torch.no_grad(): out = compiled_fn(input_ids, attention_mask).last_hidden_state ``` when we run it with: ``` time TORCH_LOGS=recompiles python demo.py ``` We can see there are 7 recompilations and it takes 2 mins (fresh build) or 1 min (cached build) in my machine. One root cause of the recompilations is, there are guards to check the alignments of the inputs (see the patch). So there are unexpected recompilations for `(1, 4)`, `(1, 8)`, `(2, 4)` and `(2, 8)` inputs. In this patch, we always try to always pad the inputs if we don't know its shape at compilation to avoid the guards on alignment. It is fine to always pad the tensor. It won't change the semantics. Now there are only 3 recompilations and it takes 1 min (fresh build) and 17s (cached build) in my machine. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150403 Approved by: https://github.com/drisspg	2025-04-14 04:18:22 +00:00
Nitin Singh	9458b83729	[HPU] Add HPU as a supported device for NestedTensor (#148659 ) This change enables basic NestedTensor operations on HPU, fixing the runtime error when creating a NestedTensor on HPU. - Extended `NestedTensorImpl` to recognize `hpu` as a valid storage device. - Added `NestedTensorHPU` to `DispatchKey` parsing in `DispatchKey.cpp`. - Updated `torchgen/model.py` to include `NestedTensorHPU` in `dispatch_keys`. - Modified `native_functions.yaml` to enable `NestedTensorHPU` support for various ops. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148659 Approved by: https://github.com/jeromean, https://github.com/albanD, https://github.com/sujoysaraswati	2025-04-14 03:42:34 +00:00
bobrenjc93	9aca00102f	[ez]][dynamo] remove useless super().__init__() (#150754 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150754 Approved by: https://github.com/anijain2305, https://github.com/jansel, https://github.com/Skylion007 ghstack dependencies: #150753	2025-04-14 03:37:42 +00:00
Yuki Kobayashi	101c4f482a	Docs: Fix typos in the Symbolic Numbers docstrings (#151181 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151181 Approved by: https://github.com/soulitzer	2025-04-14 01:46:02 +00:00
Li-Huai (Allan) Lin	ddfc14b3ae	[MPS] Fix where (#151176 ) Fixes #150967 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151176 Approved by: https://github.com/kulinseth, https://github.com/malfet	2025-04-13 20:44:50 +00:00
Thomas Adams	8494d5582a	Propagate callable parameter types using ParamSpec (#142306 ) (#151014 ) Partially addresses #142306 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151014 Approved by: https://github.com/Skylion007	2025-04-13 20:38:11 +00:00
bobrenjc93	3f0931b1de	[ez][dynamo] some code movement (#150753 ) `optimize_assert` already does the lookup for `backend` and `backend_ctx_ctor`. This simply moves the lookups within `optimize` lower so we don't end up calling these functions twice unnecessarily in the `optimize_assert` path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150753 Approved by: https://github.com/anijain2305, https://github.com/jansel	2025-04-13 15:44:42 +00:00
Yu, Guangye	b0810168a3	Generalize poison fork logic for each device backend (#144664 ) # Motivation Generalize the posion_fork code to make it reusable across different devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144664 Approved by: https://github.com/EikanWang, https://github.com/albanD	2025-04-13 09:54:30 +00:00
zeshengzong	304633152c	Clean up duplicated code in lr_scheduler (#150984 ) ## Changes - Remove duplicated code in `ReduceLROnPlateau` - Remove redundant `noqa` comment ## Test Result ```bash pytest test/optim/test_lrscheduler.py ``` ![image](https://github.com/user-attachments/assets/37f91f31-0e77-4abf-9dd1-75538c0f0792) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150984 Approved by: https://github.com/janeyx99	2025-04-13 09:18:50 +00:00
Zhang, Jianyi	b59f3d3ae0	[Intel GPU] skip a cuda api call in amp to save some host overhead on xpu (#151111 ) This can save ~0.2ms on non cuda devices by skip calling `amp_definitely_not_available()`. It can improve small models in torchbench like lennard_jones on xpu 10% on both eager and inductor in dynamo benchmarks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151111 Approved by: https://github.com/soulitzer	2025-04-13 06:37:07 +00:00
Ruisi Zhang	1c5619ef9c	[DTensor] Add DTensor redistribute fwd/bwd datatype conversion to enable SimpleFSDP mixed precision training (#150740 ) As titled, this pr adds additional `forward_dtype` and `backward_dtype` conversion in DTensor `redistribute` API to enable SimpleFSDP's mixed precision training. In this forward pass, the DTensor can be configured to be cast to `forward_dtype`; in the backward pass, the DTensor can be configured to be cast to `backward_dtype`. 1. Correctness: The end-to-end SimpleFSDP mixed precision training integration has been proved to work properly in the PR from this fork: https://github.com/tianyu-l/pytorch_intern24/pull/20. We are now migrating the code to official PyTorch DTensor. 2. Example Usage: There is an example in TorchTian's SimpleFSDP implementation: https://github.com/pytorch/torchtitan/pull/1060. In the example below, a DTensor `x` is all-gather'ed along the `self.compute_placements`, with datatype cast to `self.param_dtype`. In the backward pass, additionally, the computed gradients are reduce-scatter'ed along the `self.grad_placements`, with datatype cast to `self.reduce_dtype`. ```python output = x.redistribute( placements=self.compute_placements, forward_dtype=self.param_dtype, backward_dtype=self.reduce_dtype, ).to_local(grad_placements=self.grad_placements) ``` Under the hood, in `class Redistribute(torch.autograd.Function):`, the `forward` function first takes `x`'s local tensor, convert it to `forward_dtype`, before all-gather `x`. The `backward` function take `grad_output` and convert it to `backward_dtype`, before reduce-scatter `grad_output`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150740 Approved by: https://github.com/tianyu-l	2025-04-13 05:49:03 +00:00
PyTorch UpdateBot	00c6caaf3d	[executorch hash update] update the pinned executorch hash (#150722 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150722 Approved by: https://github.com/pytorchbot	2025-04-13 05:37:33 +00:00
Animesh Jain	587aec2b4f	[dynamo][nn_module] Use method.__self__ to find source for patched methods (#151164 ) Fixes https://github.com/pytorch/pytorch/issues/137476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151164 Approved by: https://github.com/jansel	2025-04-13 04:50:19 +00:00
Animesh Jain	7b1a2373e8	[dynamo][super variable] Fix bug to use correct source (#151154 ) Fixes https://github.com/pytorch/pytorch/issues/150994 We should cherry-pick to 2.7 branch if possible, because this breaks torch.compile on some HF models. Look at the issue referenced here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151154 Approved by: https://github.com/jansel	2025-04-13 04:48:52 +00:00
PyTorch MergeBot	8157e76b79	Revert "[Inductor] Refactor wrapper codegen to use Wrapper IR. (#150458 )" This reverts commit fe7f425de7b76ef33d308d0a03779b97a914d186. Reverted https://github.com/pytorch/pytorch/pull/150458 on behalf of https://github.com/clee2000 due to broke a lot of tests internally? D72906459 ([comment](https://github.com/pytorch/pytorch/pull/150458#issuecomment-2799578597))	2025-04-13 03:52:42 +00:00
Nikita Shulga	67188cd38d	[Testing] Skip `test_unspec_inputs_float64_mps` (#151167 ) As backend does nto support float64 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151167 Approved by: https://github.com/dcci ghstack dependencies: #151166	2025-04-13 00:41:51 +00:00
Nikita Shulga	d289d1177c	[CI] Fix `GPUTests.test_scheduler_vertical_fusion1` (#151166 ) By enabling the test_operators on MPS device Pull Request resolved: https://github.com/pytorch/pytorch/pull/151166 Approved by: https://github.com/dcci	2025-04-13 00:41:51 +00:00
Nikita Shulga	9699cc3eb9	[MPSInductor] Fix larger-than-threadgroup Welford reductions (#151152 ) By using `welford_combine` primitive in the loop This fixes `GPUTests.test_multilayer_var_lowp_mps` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151152 Approved by: https://github.com/jansel ghstack dependencies: #151042, #150824, #151151	2025-04-12 21:44:51 +00:00
PyTorch MergeBot	7762bddd87	Revert "[MPSInductor] Fix larger-than-threadgroup Welford reductions (#151152 )" This reverts commit 71073caa00836c23e3fc7fcfe1d69b77ffb9d9c9. Reverted https://github.com/pytorch/pytorch/pull/151152 on behalf of https://github.com/malfet due to Another lint failure ([comment](https://github.com/pytorch/pytorch/pull/151152#issuecomment-2799027274))	2025-04-12 20:27:48 +00:00
James Wu	3dcb46c30e	[easy] Add cache bypass traceback information to cache_info on autograd_cache_bypass (#151025 ) This will help us better debug pickling errors, etc, in internal models Pull Request resolved: https://github.com/pytorch/pytorch/pull/151025 Approved by: https://github.com/masnesral	2025-04-12 19:56:32 +00:00
Xiaodong Wang	9d4de265db	[AMD] Block mem efficient attention for FP32 in CK backend (#151132 ) Summary: CK doesn't support FP32 attention, but aotriton does. If we prefer CK, and the input dtype is FP32, we'll select mem efficient attention but CK doesn't support it. So we'll exclude mem eff attention and pick math. Differential Revision: D72880985 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151132 Approved by: https://github.com/yoyoyocmu	2025-04-12 19:36:20 +00:00
Nikita Shulga	71073caa00	[MPSInductor] Fix larger-than-threadgroup Welford reductions (#151152 ) By using `welford_combine` primitive in the loop This fixes `GPUTests.test_multilayer_var_lowp_mps` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151152 Approved by: https://github.com/jansel ghstack dependencies: #151042, #150824, #151151	2025-04-12 19:16:33 +00:00
Nikita Shulga	3b86cb8dff	[MPSInductor][BE] Implement reduction caching (#151151 ) That avoids double/triple invocation of welford reductions when both mean and deviation must be returned Code has been copy-n-pasted for Halide implementation `575f348965/torch/_inductor/codegen/halide.py (L1189-L1191)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151151 Approved by: https://github.com/jansel ghstack dependencies: #151042, #150824	2025-04-12 19:16:33 +00:00
FFFrog	2653498ff3	[Openreg][PrivateUse1] Refactor csrc files of Pytorch_openreg (#151004 ) I want to format and refactor the csrc file of pytorch_openreg. To make the code review clearer and easier to understand, I divide the code refactoring into two parts: - Part 1: Code formatting - Part 2: Code refactoring and optimization (Next PR) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151004 Approved by: https://github.com/albanD ghstack dependencies: #151000	2025-04-12 17:22:28 +00:00
FFFrog	c181403063	[Openreg][PrivateUse1] Improve openreg module capabilities (#151000 ) ---- - Add more functionalities for openreg in openreg module - Remove related functionalities from test_cpp_extensions_open_device_registration.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/151000 Approved by: https://github.com/albanD	2025-04-12 17:21:35 +00:00
Zhengxu Chen	be24e7b4b4	[dynamo] Use sentinel value for guard filter. (#151131 ) Summary: `None` can collide with the real values in the scope, so we should use a separate value. Also added "has_value" to the struct so that it's more clear whether the value is absent or not. Test Plan: CI Differential Revision: D72881300 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151131 Approved by: https://github.com/jansel, https://github.com/anijain2305	2025-04-12 15:29:57 +00:00
Yichen Yan	5b16a0704e	Fix license check for setuptools>=77 (#151158 ) Fixes #151157 See issue for more information Pull Request resolved: https://github.com/pytorch/pytorch/pull/151158 Approved by: https://github.com/malfet	2025-04-12 13:41:12 +00:00
Tianyu Liu	7dd2ed1197	[dtensor] add op support for torch._grouped_mm (#151072 ) This PR would make TP work with Grouped MM in MoE implementations like https://github.com/pytorch/torchtitan/pull/1084 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151072 Approved by: https://github.com/wanchaol, https://github.com/wwwjn	2025-04-12 07:07:44 +00:00
FFFrog	0c59a031c8	[OpenReg][PrivateUse1] add device context for OpenReg Module (#150997 ) Add device context support for OpenReg Module, which is depended by some tests such as ``torch.serialization.default_restore_location`` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150997 Approved by: https://github.com/albanD	2025-04-12 06:32:30 +00:00
jPorterDosch	3e9f4f3f78	docs: allow empty targets tensor in ctc_loss (#151080 ) docs: allow empty targets tensor in ctc_losswhen target_lengths are zero, as described in issue Fixes #150995 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151080 Approved by: https://github.com/albanD	2025-04-12 05:26:54 +00:00
PyTorch MergeBot	2f899f07aa	Revert "Make export._trace._WrapperModule work in strict mode (#146919 )" This reverts commit dad5e5e2622c82ca272290225abe16ee461d9ac9. Reverted https://github.com/pytorch/pytorch/pull/146919 on behalf of https://github.com/malfet due to Broke lint, see https://github.com/pytorch/pytorch/actions/runs/14415686353/job/40431799827 ([comment](https://github.com/pytorch/pytorch/pull/146919#issuecomment-2798446930))	2025-04-12 04:12:36 +00:00
Shangdi Yu	dad5e5e262	Make export._trace._WrapperModule work in strict mode (#146919 ) Summary: as title `export._trace._WrapperModule` is used to wrap functions into a Module so we can export the function. We add `export._wrapper_utils` to `dynamo`'s `MOD_INLINELIST` so dynamo traces into `_WrapperModule` Fixes https://github.com/pytorch/pytorch/issues/146867 Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test:test_export -- -r wrapper_module ``` Differential Revision: D69434316 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146919 Approved by: https://github.com/angelayi	2025-04-12 03:22:08 +00:00
Bert Maher	19b76bd873	hack to try to fix not empty triton dir (#151119 ) Differential Revision: D72741938 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151119 Approved by: https://github.com/hl475, https://github.com/muchulee8, https://github.com/Skylion007	2025-04-12 03:21:41 +00:00
Boyuan Feng	c1470d4dc4	[graph partition] support graphsafe_run_with_rng_state (#150958 ) Prior to this PR, `rng_state` is in `V.graph.graph_inputs` but not in read_writes of any IRNode. As a result, it is not identified as a partition inputs: ```python def partition_0(args): primals_2, primals_1 = args ... buf0 = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype=torch.float32, device=device(type='cuda', index=1), pin_memory=False, rng_state=fwd_rng_state_0) # <----- access fwd_rng_state_0 but it's not an input ... def call(self, args): primals_1, primals_2, fwd_rng_state_0 = args ... partition0_args = [primals_2, primals_1] (buf2, primals_2, primals_1) = self.partitions[0](partition0_args) # <---- fwd_rng_state_0 is graph_inputs but is not passed to partitions[0] ... ``` This PR fixes this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150958 Approved by: https://github.com/eellison	2025-04-12 03:17:08 +00:00
Nikita Shulga	397d37acc5	[MPSInductor] Naive welford_reduce implementation (#150824 ) Literal Python-to-Metal translation of `85549fe6de/torch/_inductor/runtime/triton_helpers.py (L217-L225)` Fixed missing barrier in `welford_combine` And this is sufficient to make `GPUTests.test_batch_norm_2d_2_mps` to pass Pull Request resolved: https://github.com/pytorch/pytorch/pull/150824 Approved by: https://github.com/dcci, https://github.com/jansel ghstack dependencies: #151042	2025-04-12 03:11:38 +00:00
soulitzer	32f0f414ab	Add some autograd producer consumer stream sync tests (#150952 ) Thanks @ngimel and @albanD for some ideas on test cases Pull Request resolved: https://github.com/pytorch/pytorch/pull/150952 Approved by: https://github.com/albanD	2025-04-12 02:44:09 +00:00
angelayi	397b7f9b82	[custom ops] Override fake registration (#150806 ) Added a flag, `allow_override`, to allow overriding existing kernel implementations in `torch.library.register_fake` `library.impl`. The default is false, where if a user tries to register a kernel to a dispatch key that already contains a kernel, it will error. This flag doesn't apply to CustomOpDefs, where overriding a fake kernel is already allowed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150806 Approved by: https://github.com/zou3519	2025-04-12 02:43:47 +00:00
PyTorch MergeBot	77407b38a9	Revert "[MPSInductor] Naive welford_reduce implementation (#150824 )" This reverts commit 575f348965abe8ea428eba7098f67ec9764a7f9a. Reverted https://github.com/pytorch/pytorch/pull/150824 on behalf of https://github.com/malfet due to Linter fails again, landrace this time? ([comment](https://github.com/pytorch/pytorch/pull/150824#issuecomment-2798392241))	2025-04-12 02:22:22 +00:00
Wei Wang	f6e9e064a7	[CI][CUDA] xfail grouped gemm unit tests on blackwell (#150982 ) On SM100OrLater, Expect failures like: RuntimeError: torch._grouped_mm is only supported on CUDA devices with compute capability = 9.0 To execute this test, run the following from the base repo dir: python test/test_matmul_cuda.py TestMatmulCudaCUDA.test_grouped_gemm_3d_2d_strided_False_a_row_major_True_b_row_major_False_cuda This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ` test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_2d_strided_False_a_row_major_False_b_row_major_False_cuda SKIPPED [0.0005s] (Issue with numpy versi...) [ 2%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_2d_strided_False_a_row_major_False_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 4%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_2d_strided_False_a_row_major_True_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 6%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_2d_strided_False_a_row_major_True_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 8%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_2d_strided_True_a_row_major_False_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 10%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_2d_strided_True_a_row_major_False_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 12%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_2d_strided_True_a_row_major_True_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 14%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_2d_strided_True_a_row_major_True_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version ...) [ 16%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_3d_strided_False_a_row_major_False_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versi...) [ 18%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_3d_strided_False_a_row_major_False_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 20%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_3d_strided_False_a_row_major_True_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 22%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_3d_strided_False_a_row_major_True_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 25%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_3d_strided_True_a_row_major_False_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 27%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_3d_strided_True_a_row_major_False_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 29%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_3d_strided_True_a_row_major_True_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 31%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_3d_strided_True_a_row_major_True_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version ...) [ 33%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_2d_strided_False_a_row_major_False_b_row_major_False_cuda SKIPPED [0.0002s] (Issue with numpy versi...) [ 35%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_2d_strided_False_a_row_major_False_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 37%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_2d_strided_False_a_row_major_True_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 39%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_2d_strided_False_a_row_major_True_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 41%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_2d_strided_True_a_row_major_False_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 43%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_2d_strided_True_a_row_major_False_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 45%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_2d_strided_True_a_row_major_True_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 47%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_2d_strided_True_a_row_major_True_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version ...) [ 50%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_3d_strided_False_a_row_major_False_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versi...) [ 52%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_3d_strided_False_a_row_major_False_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 54%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_3d_strided_False_a_row_major_True_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 56%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_3d_strided_False_a_row_major_True_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 58%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_3d_strided_True_a_row_major_False_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 60%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_3d_strided_True_a_row_major_False_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 62%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_3d_strided_True_a_row_major_True_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 64%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_3d_strided_True_a_row_major_True_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version ...) [ 66%] test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_2d_2d_fast_accum_False_strided_False_cuda XFAIL [0.8166s] [ 68%] test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_2d_2d_fast_accum_False_strided_True_cuda XFAIL [0.0017s] [ 70%] test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_2d_2d_fast_accum_True_strided_False_cuda XFAIL [0.0012s] [ 72%] test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_2d_2d_fast_accum_True_strided_True_cuda XFAIL [0.0012s] [ 75%] test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_2d_3d_fast_accum_False_strided_False_cuda XFAIL [0.0033s] [ 77%] test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_2d_3d_fast_accum_False_strided_True_cuda XFAIL [0.0012s] [ 79%] test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_2d_3d_fast_accum_True_strided_False_cuda XFAIL [0.0015s] [ 81%] test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_2d_3d_fast_accum_True_strided_True_cuda XFAIL [0.0012s] [ 83%] test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_3d_2d_fast_accum_False_strided_False_cuda XFAIL [0.0012s] [ 85%] test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_3d_2d_fast_accum_False_strided_True_cuda XFAIL [0.0012s] [ 87%] test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_3d_2d_fast_accum_True_strided_False_cuda XFAIL [0.0011s] [ 89%] test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_3d_2d_fast_accum_True_strided_True_cuda XFAIL [0.0012s] [ 91%] test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_3d_3d_fast_accum_False_strided_False_cuda XFAIL [0.0014s] [ 93%] test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_3d_3d_fast_accum_False_strided_True_cuda XFAIL [0.0012s] [ 95%] test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_3d_3d_fast_accum_True_strided_False_cuda XFAIL [0.0011s] [ 97%] test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_3d_3d_fast_accum_True_strided_True_cuda XFAIL [0.0011s] [100%] ` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150982 Approved by: https://github.com/ngimel, https://github.com/eqy	2025-04-12 01:53:12 +00:00
Blaine Burton Rister	fe7f425de7	[Inductor] Refactor wrapper codegen to use Wrapper IR. (#150458 ) Preparatory refactor for https://github.com/pytorch/pytorch/pull/146942. # Feature This PR refactors the existing wrapper codegen into `WrapperLine` subclasses, extending the existing Memory Planning IR into a fully-fledged Wrapper IR. See the diagram below. ![wrapper_ir](https://github.com/user-attachments/assets/a61db21b-caf3-45d2-bfdb-91066ae4ba6b) The IR currently supports the following ops: - All existing memory planning IR ops (`AllocateLine`, `FreeIfNotReusedLine`, etc.) - Reinterpret views (`ReinterpretLine`) - Kernel definitions (`KernelDefinitionLine`) - Calls to defined kernels (`KernelCallLine`) - Calls to extern kernels (`ExternKernelLine`, `ExternKernelAllocLine`) - Ops with multiple outputs (`MultiOutputLine`) - Tensor cleanup at the end of a graph (`FreeLine`) - Leaving comments in code (`CommentLine`) There are two main motivations for this refactor: 1. Unlike free-form C++ and and Python code, Wrapper IR lines provide structured information about what the wrapper code does. This serves as a natural extension point for other types of wrapper codegen. For example, the parent PR generates FX IR from Wrapper IR. Wrapper IR aims to give new backends enough information to generate wrapper code without needing to modify core Inductor files such as `ir.py`. 2. This design will hopefully promote stronger modularity and encapsulation. a. Inductor's core compilation passes don't need to worry about whether they're targeting Python, C++, FX or anything else. They can simply focus on generating Wrapper IR, and target-specific code can be refactored into the various backends. b. Backends do not need to know about all the details and internal state of `V.graph` IR. For example, they don't need to consider whether a buffer has been removed from the graph when generating code. Wrapper IR will hopefully provide a simpler interface for generating wrapper code, which abstracts away the details of device code. # Implementation details The implementation mainly consists of separating direct C++/Python codegen into two phases: 1. Emit Wrapper IR lines describing what the wrapper code is supposed to do. 2. Inside the `codegen()` method of each `WrapperLine`, call backend methods which generate pure Python/C++ code using the information stored in the Wrapper IR line. For example, `KernelCallLine` calls `wrapper._generate_kernel_call_helper`, which is overriden by the various Python and C++ backends to generate the final wrapper code. The main difficulty in implementing this is that we need to be careful that code is generated in the correct order. Wrapper codegen happens in two passes: first we write code into `self.lines` which mainly contains wrapper IR, but can also contain raw Python or C++ lines in some situations. Then, we convert the wrapper IR into the final Python/C++ code in `self.wrapper_call`. Since the same macros may be used in both passes, it's difficult to ensure that code is written to the correct buffer. The easiest solution for this was to implement a context manager overriding the `writeline` method to write to `self.wrapper_call` after memory planning is finished. This way, `writeline` writes to `self.lines` in the first pass, and `self.wrapper_call` in the second. This obviated the need to pass `code` or `writeline` variables all the way through the call stack, which would have touched most of the existing macros. # Test plan Since this refactor touches all the existing wrapper codegen classes, the existing CI provides good coverage. The parent PR introduces new tests for the FX IR backend. Among other things, these tests assert that `self.lines` only contains Wrapper IR lines, and no free-form code. While this would not be true of all programs today, the tests suggests that the IR implemented in this PR is sufficient to cover basic PyTorch usage. # Future directions These two goals are only partially realized by this PR. These are several important steps which still undergo direct Python/C++ codegen in core files: - User-defined Triton kernels. - Reinterpret views on outputs, from `gen_output_refs()`. (In the parent PR, the FX converter has a custom way of handling this. This can eventually be ported into Wrapper IR.) - Fallback ops with custom `codegen()` methods, e.g. `ScatterFallback`. - Misc. C++ lines emitted by the various cpp backends, e.g. declaring constants. These cases will gradually be handled in subsequent PRs, as the Inductor->FX converter expands its coverage. Given that these refactors are pretty tricky to do, it seems wiser to execute them in stages, as opposed to porting everything to Wrapper IR at once.Some Python and codegen still lives in core files such as `ir.py`, as described in previous sections. Hopefully, this PR will serve as a starting point which moves the codebase towards a more modular design. Over time, we can gradually refactor the remaining codegen (mainly in `ir.py`) into backend classes. One limitation of this PR is that codegen still happens in two phases during `PythonWrapperCodegen`. First, we generate Wrapper IR into `self.lines`, and from there we generate Python or C++ code into `self.wrapper_call`, `self.header`, etc. In the long term, it would be cleaner to split wrapper IR into its own class which doesn't deal with Python/C++ codegen at all. (See the diagram at the top.) That would strictly enforce the boundary between Wrapper IR and Python/C++ wrapper code. However, this would probably be a much larger refactor. Another limitation of the current code is that the helper functions have a lot of call args. It's also possible to clean this up by passing Wrapper IR ops e.g. `KernelCallLine` into helper functions like `_generate_kernel_call_helper`, since they store all the arguments. However, that change would likely be prone to merge conflicts, so I would like to save it for follow-up PRs if possible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150458 Approved by: https://github.com/eellison	2025-04-12 01:15:19 +00:00
Nikita Shulga	575f348965	[MPSInductor] Naive welford_reduce implementation (#150824 ) Literal Python-to-Metal translation of `85549fe6de/torch/_inductor/runtime/triton_helpers.py (L217-L225)` Fixed missing barrier in `welford_combine` And this is sufficient to make `GPUTests.test_batch_norm_2d_2_mps` to pass Pull Request resolved: https://github.com/pytorch/pytorch/pull/150824 Approved by: https://github.com/dcci, https://github.com/jansel ghstack dependencies: #151042	2025-04-12 00:46:01 +00:00
PyTorch MergeBot	83f14c0b06	Revert "[MPSInductor] Naive welford_reduce implementation (#150824 )" This reverts commit 5edfb4c4fad1bb9504482d930a2540d22427d383. Reverted https://github.com/pytorch/pytorch/pull/150824 on behalf of https://github.com/malfet due to I should have waited for lint ([comment](https://github.com/pytorch/pytorch/pull/150824#issuecomment-2798249264))	2025-04-12 00:21:14 +00:00
Yidi Wu	ca2e8cd352	[map] make proxy mode re-dispatch to fake key (#151034 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151034 Approved by: https://github.com/zou3519 ghstack dependencies: #150962	2025-04-11 23:28:06 +00:00
Yidi Wu	a72d56cb6b	[map] always turn on dynamo for map (#150962 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150962 Approved by: https://github.com/zou3519	2025-04-11 23:28:06 +00:00
Nikita Shulga	5edfb4c4fa	[MPSInductor] Naive welford_reduce implementation (#150824 ) Literal Python-to-Metal translation of `85549fe6de/torch/_inductor/runtime/triton_helpers.py (L217-L225)` Fixed missing barrier in `welford_combine` And this is sufficient to make `GPUTests.test_batch_norm_2d_2_mps` to pass Pull Request resolved: https://github.com/pytorch/pytorch/pull/150824 Approved by: https://github.com/dcci, https://github.com/jansel ghstack dependencies: #151042	2025-04-11 23:21:35 +00:00
eqy	c4f826d5e8	[CUDA][TF32] Account for TF32 in `test_alexnet_prefix` (#150970 ) Mainly seems to be an issue on Blackwell with e.g., ``` Mismatched elements: 1 / 746496 (0.0%) Greatest absolute difference: 0.005461275577545166 at index (2, 32, 11, 9) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150970 Approved by: https://github.com/soulitzer	2025-04-11 23:13:54 +00:00
Bert Maher	2d187bf7e6	Support tuning of _scaled_grouped_mm (#150421 ) This includes the default aten implementation, as well as a Triton implementation imported from FBGEMM (https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/experimental/gemm/triton_gemm/grouped_gemm.py) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150421 Approved by: https://github.com/ngimel	2025-04-11 23:03:49 +00:00
Will Constable	c3bc6b3542	[DTensor] Fix empty shard global-offset calculation (#150862 ) `compute_local_shape_and_global_offset` util computes the local shape of a particular shard of a DTensor, and the global offset (which describes how the shard fits into the global tensor). When the tensor dim does not evenly divide into the mesh dim, uneven sharding occurs. In some cases, uneven sharding results in an empty shard. e.g. tensor dim size: 4096 mesh dim size: 30 ranks 0..27 have local size 18 rank 28 has local size 8 rank 29 has local size 0 <--- empty shard The global offset for an empty shard was previously undefined and returned values that were computed based on logic that assumes no empty shards. This caused DCP to fail to save a checkpoint, becuase deduplication logic could 'throw away' real (non-empty) shards thinking they were duplicates of zero-sized shards with the same offset. Now, we define the global offset of an empty shard to be the dim-size, which is out of bounds of the tensor and can't overlap with any non-empty shards. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150862 Approved by: https://github.com/teja-rao, https://github.com/XilunWu	2025-04-11 22:25:57 +00:00
Matthew Hoffman	85549fe6de	Add `__all__` for `torch.utils.dlpack` (#149026 ) Fixes the issue: ```python torch.utils.dlpack.to_dlpack(tensor) # "to_dlpack" is not exported from module "torch.utils.dlpack" Pylance[reportPrivateImportUsage](https://github.com/microsoft/pyright/blob/main/docs/configuration.md#reportPrivateImportUsage) ``` the docs for `torch.utils.dlpack`: https://pytorch.org/docs/stable/dlpack.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/149026 Approved by: https://github.com/mikaylagawarecki	2025-04-11 22:03:24 +00:00
albanD	2a909cab16	Update ninja missing error message (#147698 ) In cpp_extensions Pull Request resolved: https://github.com/pytorch/pytorch/pull/147698 Approved by: https://github.com/Skylion007	2025-04-11 21:56:53 +00:00
Bin Bao	a78ac409b5	[AOTI] Add _weight_int4pack_mm to the C shim fallback list (#151059 ) Summary: As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/151059 Approved by: https://github.com/yushangdi	2025-04-11 21:22:35 +00:00
Bartlomiej Stemborowski	12281f9c18	[dynamo] Deprecate enable_cpp_framelocals_guard_eval config variable - default: True (#151008 ) [dynamo] Deprecate enable_cpp_framelocals_guard_eval config variable - default: True Reading the feature enabling param `enable_cpp_framelocals_guard_eval `at the CPP level is time consuming and slows down the operation of the dynamo as it is done every time the function using this param is called. Reading the value only once at init isn’t an option as it would disable the modification of this param at the runtime. Since this feature is enabled by default for some time and it doesn’t cause known issues, the `enable_cpp_framelocals_guard_eval `configuration param will be deprecated by this commit and its value is hardcoded to true. Local microbenchmark dynamo_guard_eval.py: - 931.9 us -> 538.9 us (3.10) @williamwen42 @jansel @anijain2305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151008 Approved by: https://github.com/williamwen42	2025-04-11 21:07:59 +00:00
Nikita Shulga	8910e4f2bb	Fix 32-bit indexing overflows in ReducedPrecisionGemV (#150949 ) By chaining `lda` type from `int` to ~~`long`~~ `int64_t` Add regression test (but probably restrict it to CPUs (or may be skip float32 testing on GPUs) Fixes https://github.com/pytorch/pytorch/issues/150637 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150949 Approved by: https://github.com/Skylion007	2025-04-11 20:55:20 +00:00
Pat Vignola	05236b5045	Allow OpaqueTensorImpl to be used for views (#151028 ) Summary: When creating an `OpaqueTensorImpl`, currently there's only an option to create it for a non-view tensor, but it can be useful to create one for view tensors as well. View tensors should contain the same autograd parameters as the original tensor, whereas non-view tensors get created with whatever `inference_mode` option is currently enabled. For this reason, `TensorImpl` has a special view constructor that takes `TensorImpl::ImplType` as its first parameter, so adding a new constructor to `OpaqueTensorImpl` that does the same thing allows us to create views with it. Test Plan: CI Reviewed By: scottxu0730 Differential Revision: D71748460 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151028 Approved by: https://github.com/scottxu0730, https://github.com/chaos5958	2025-04-11 20:07:47 +00:00
Tristan Rice	bb60e82672	c10d/Store: add queues (#150969 ) This adds queue operations as described in https://github.com/pytorch/pytorch/issues/150943. This works by adding two new operations `queue_push` and `queue_pop`. The semantics are designed to be blocking with a timeout. Pushing will always succeed as the queue is infinite size. Popping will first call `wait` until the key is ready and then pop the value from the queue. This implements queues for only: HashStore, TCPStore w/ libuv. FileStore and the legacy backends are not supported. `wait` and `check` work for queue operations though queue_push will only wake up the first waiter rather than all of them. This also has a few cleanups to error types/documentation in related code. Example trace: ``` [I409 16:51:43.963833529 TCPStoreLibUvBackend.cpp:829] [c10d - trace] validate magic:1015412686 address:[localhost]:55816 [I409 16:51:43.963845838 TCPStoreLibUvBackend.cpp:842] [c10d - trace] ping nonce:2840795 address:[localhost]:55816 [I409 16:51:43.963902914 TCPStoreLibUvBackend.cpp:911] [c10d - trace] add key:init/ val:1 address:[localhost]:55816 [I409 16:51:43.963939389 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:init/ address:[localhost]:55816 [I409 16:51:43.963974842 TCPStoreLibUvBackend.cpp:893] [c10d - trace] get key:init/ address:[localhost]:55816 [I409 16:51:43.964071909 TCPStoreLibUvBackend.cpp:1121] [c10d - trace] queue_push key:/test_prefix/test_queue_support address:[localhost]:55816 [I409 16:51:43.964080221 TCPStoreLibUvBackend.cpp:940] [c10d - trace] check key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964108584 TCPStoreLibUvBackend.cpp:1121] [c10d - trace] queue_push key:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964123207 TCPStoreLibUvBackend.cpp:1121] [c10d - trace] queue_push key:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964128194 TCPStoreLibUvBackend.cpp:940] [c10d - trace] check key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964156347 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964187493 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964217709 TCPStoreLibUvBackend.cpp:1133] [c10d - trace] queue_pop key:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964324300 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964354495 TCPStoreLibUvBackend.cpp:1133] [c10d - trace] queue_pop key:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964416299 TCPStoreLibUvBackend.cpp:940] [c10d - trace] check key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964458733 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:/test_prefix/non_existant address:[localhost]:55816 [W409 16:51:43.974516585 socket.cpp:460] [c10d] waitForInput: poll for socket SocketImpl(fd=75, addr=[localhost]:55816, remote=[localhost]:46641) returned 0, likely a timeout [W409 16:51:43.974559169 socket.cpp:485] [c10d] waitForInput: socket SocketImpl(fd=75, addr=[localhost]:55816, remote=[localhost]:46641) timed out after 10ms [I409 16:51:43.974600451 TCPStoreLibUvBackend.cpp:1101] [c10d - trace] cancel_wait address:[localhost]:55816 ``` Test plan: ``` $ pytest test/distributed/test_store.py -k queue -v -s test/distributed/test_store.py::FileStoreTest::test_queues SKIPPED [0.4351s] (Store does not support queues) test/distributed/test_store.py::HashStoreTest::test_queues PASSED [0.0009s] test/distributed/test_store.py::PrefixFileStoreTest::test_queues SKIPPED [0.0006s] (Store does not support queues) test/distributed/test_store.py::TCPStoreTest::test_queues SKIPPED [0.0012s] (Store does not support queues) test/distributed/test_store.py::LibUvTCPStoreTest::test_queues PASSED [0.0014s] test/distributed/test_store.py::PrefixTCPStoreTest::test_queues PASSED [0.0014s] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150969 Approved by: https://github.com/XilunWu, https://github.com/fduwjj	2025-04-11 19:24:17 +00:00
PaulZhang12	83ae61fd8e	[Inductor] Add Subgraph as a Autotuning Choice (#150653 ) Add the option for providing a Subgraph as an autotuning choice in Inductor. This is crucial for implementing the split-k optimization for GEMMs by decomposing a mm -> bmm. https://github.com/pytorch/pytorch/pull/150654 uses these changes to add decomposeK as a default autotuning choice for aten.mm in Inductor. Using https://github.com/pytorch/pytorch/pull/150654 and a simple script: ``` import torch def f(a, b): return torch.matmul(a, b) def decompose_func(a_in, b_in): M, K = a_in.shape K, N = b_in.shape # TODO: Ideally we want to autotune over this parameter kPartitions = 256 assert K % kPartitions == 0, "K must be divisible by Kmini" B = K // kPartitions a_reshaped = a_in.reshape(M, B, kPartitions).transpose( 0, 1 ) # Shape: (B, M, kPartitions) b_reshaped = b_in.reshape(B, kPartitions, N) # Shape: (B, kPartitions, N) result = torch.bmm(a_reshaped, b_reshaped) # Shape: (B, M, N) return result.sum(dim=0).to(torch.float16) # Sum over B dimension, Shape: (M, N) for k in [4096, 8192, 12288, 16384, 20480, 24576, 28672, 32768]: a = torch.randn(32, k, dtype=torch.float16, device="cuda", requires_grad=True) b = torch.randn(k, 32, dtype=torch.float16, device="cuda", requires_grad=True) compiled_res = torch.compile(f, dynamic=False)(a, b) decompose_res = decompose_func(a, b) print(f"Compiled mm result close to aten: {torch.allclose(f(a, b), compiled_res, atol=1e-5, rtol=0.5)}") print(f"Compiled mm result close to decompose: {torch.allclose(decompose_res, compiled_res, atol=1e-5, rtol=0.5)}") ``` we are able to autotune the decomposeK optimization to aten and the traditional Triton templates in Inductor. DecomposeK is faster than aten by about ~10% on average and > 4x speedup over the best Triton templates on an H100 machine, e.g.: ``` AUTOTUNE mm(32x28672, 28672x32) decompose_k_mm 0.0126 ms 100.0% mm 0.0144 ms 87.5% triton_mm_69 0.0579 ms 21.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4 triton_mm_75 0.0677 ms 18.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=4, num_warps=4 triton_mm_76 0.0850 ms 14.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4 triton_mm_68 0.1444 ms 8.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4 triton_mm_72 0.1546 ms 8.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4 triton_mm_74 0.1819 ms 6.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=4, num_warps=4 triton_mm_67 0.1917 ms 6.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=2, num_warps=4 triton_mm_73 0.2766 ms 4.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4 ``` https://pastebin.com/g3FMaauT is the generated code from Inductor containing the subgraph decomposition for aten.mm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150653 Approved by: https://github.com/eellison	2025-04-11 19:08:43 +00:00
Shivam Raikundalia	ad5e9065ac	[Profiler/Easy] Remove temp flag for on-demand Memory Snapshot (#151068 ) Summary: Now that we have profiler impl in we don't need the temporary flag. submodule update too. Test Plan: CI Reviewed By: sanrise Differential Revision: D72672186 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151068 Approved by: https://github.com/davidberard98	2025-04-11 18:50:25 +00:00
Michael Lazos	fe961679d5	[Inductor] add support for disabling atomic adds (#151033 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/151033 Approved by: https://github.com/eellison, https://github.com/shunting314	2025-04-11 18:41:56 +00:00
PyTorch MergeBot	67d3053d4b	Revert "update benchamark result due to <1% regression (#150937 )" This reverts commit 860765d621e14730f8b6e7344da0053c4f00d540. Reverted https://github.com/pytorch/pytorch/pull/150937 on behalf of https://github.com/laithsakka due to regression diff reverted ([comment](https://github.com/pytorch/pytorch/pull/150937#issuecomment-2797611127))	2025-04-11 17:36:47 +00:00
fduwjj	6b32255e37	[c10d][fr] Add logging of nccl_version into fr and its dump (#151048 ) Users also want to see the nccl version in the FR dump so let's add it to FR. We only add it per rank per PG nccl comm, so this is really add a couple bytes to FR memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151048 Approved by: https://github.com/kwen2501	2025-04-11 17:36:09 +00:00
Oguz Ulgen	5f5805a6ac	Cache the value of torch_key in subproc (#151057 ) No need to recalculate torch_key in subprocs, lets pass it from main process. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151057 Approved by: https://github.com/jamesjwu, https://github.com/masnesral	2025-04-11 17:30:23 +00:00
Luca Wehrstedt	fc1cccd012	Register also future allocations in mempool with NCCL (#150684 ) This is the final PR, where everything comes together. The problem I'm trying to solve is the following: when we register a MemPool with the NCCL ProcessGroup, it calls `ncclCommRegister` on all the allocations that are _currently_ in the pool. However, any later allocation will _not_ be registered with the NCCL communicator! This is terribly inconvenient, because it means that every piece of code that allocates a tensor must be changed to become aware of whether it's doing so within a private pool, and it must become aware of NCCL and of all the PGs in existence, in order to re-register that pool with them. Moreover, I believe there can be performance implications because allocating tensors is usually done in the critical path (i.e., during the forward and backward of every step of a training), whereas registering memory is a slow operation that should be done once at init time. With this PR, once the user registers a Mempool with the NCCL PG, we install some hooks into the CachingAllocator in order to listen for all future memory allocations and, if they belong to the pool, we automatically call `ncclCommRegister` on them! (In fact, we reuse the hooks that already exist for `TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/150684 Approved by: https://github.com/kwen2501 ghstack dependencies: #150683	2025-04-11 17:26:37 +00:00
Luca Wehrstedt	99642182f2	Add mempool to allocator's trace events (#150683 ) In the NCCL ProcessGroup we want to support being able to "register" with NCCL all the allocations that belong to a certain private MemPool. In order to do so on-the-fly for every new allocation, we register a hook for the CachingAllocator's TraceEvents. However, we were lacking a way to know whether a given TraceEvent belonged to the MemPool that we cared about or not. With this PR, we add a MempoolId_t field to the TraceEvents. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150683 Approved by: https://github.com/syed-ahmed, https://github.com/kwen2501	2025-04-11 17:26:37 +00:00
Tianyu Liu	d385179886	[dtensor] add op support for torch.cumsum (#151071 ) For `torch.cumsum`, any sharding placement shoud propogate through if the cumsum `dim` is not sharded; otherwise it needs to be replicated first. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151071 Approved by: https://github.com/wanchaol	2025-04-11 16:42:19 +00:00
henrylhtsang	1fe260f7c4	[cutlass backend] Add and fix logs, fix types, and make cutlass generator only generate GEMM (#150973 ) Differential Revision: [D72760205](https://our.internmc.facebook.com/intern/diff/D72760205/) We hardcoded to only use GEMM anyway. This also raises the problem with high instantiation level. As the instantiation level goes higher (here it is 3333), the time it takes to list the configs might be long already (here it is >3 minutes). If we know exactly what configs we care, we should have a way to generate them without calling generators. But let's see if we need that. using this script ``` import os os.environ["TORCH_LOGS"] = "inductor" import torch import torch._inductor.config torch._inductor.config.max_autotune = True torch._inductor.config.force_disable_caches = True torch._inductor.config.max_autotune_gemm_backends = "Aten,CUTLASS" # intentionally use no cutlass ops torch._inductor.config.cuda.cutlass_max_profiling_configs = 0 torch._inductor.config.cuda.cutlass_instantiation_level = "3333" def main(): M = 128 dtype = torch.float16 A = torch.randn(M, M, device="cuda", dtype=dtype) B = torch.randn(M, M, device="cuda", dtype=dtype) compiled_model = torch.compile(torch.mm) _ = compiled_model(A, B) print("done") if __name__ == "__main__": main() ``` before, with logs: ``` CUTLASS library generated 7 operations in 235.03 seconds Got cutlass configs: total number of ops: 4753. Filtering took 10.51 seconds ``` after: ``` CUTLASS library generated 1 operations in 207.39 seconds Got cutlass configs: total number of ops: 4753. Filtering took 9.53 seconds ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150973 Approved by: https://github.com/ColinPeppler	2025-04-11 16:24:26 +00:00
James Wu	f1364431f0	Add debug_lines of FXGraphCacheKey to AOTAutogradCacheEntry (#150594 ) Previously we didn't save debug_lines because it's pretty large, but compared to the size of FXGraphCache entries it's still pretty small. So let's add it to AOTAutogradCache for easier debugability. Differential Revision: [D72361611](https://our.internmc.facebook.com/intern/diff/D72361611/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150594 Approved by: https://github.com/oulgen	2025-04-11 15:24:13 +00:00
Burak Turk	38bec787fa	cleanup JK for duplicate pt2 compile callbacks prevention (#148704 ) Summary: This diff cleans up the JK we used for enabling `add pt2 callbacks for backward pass and prevent duplicate callbacks` feature. Differential Revision: D70643543 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148704 Approved by: https://github.com/mlazos	2025-04-11 15:17:06 +00:00
James Wu	91920661b4	Don't log benchmarking event to Scuba (#151053 ) These two events are really common, and also make up a huge portion of logs (~70%) we get internally in PT2 Compile Events. I don't think it's actually that useful to aggregate them, so instead of logging them to PT2 Compile Events, lets just only log them to chromium. These two events will still be visible from tlparse: they just won't be in our internal tables. Please let me know if folks disagree. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151053 Approved by: https://github.com/oulgen, https://github.com/masnesral	2025-04-11 14:56:36 +00:00
zeshengzong	d94cc0e994	Optimize `ConvTranspose2d` stride description (#150819 ) Fixes #150775 ## Test Result ### Before ![image](https://github.com/user-attachments/assets/81cd932f-9447-4924-9553-a5cb88fc5d0e) ### After ![image](https://github.com/user-attachments/assets/6365c71c-7268-4226-b722-ee7446cb2467) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150819 Approved by: https://github.com/jbschlosser	2025-04-11 09:37:56 +00:00
William Wen	183bca41de	[dynamo] unimplemented -> unimplemented_v2 in variables/builder.py (#151044 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151044 Approved by: https://github.com/anijain2305, https://github.com/zou3519	2025-04-11 09:07:01 +00:00
Yuanhao Ji	d6f1c72354	[PrivateUse1] Allow out-of-tree devices to pass check when validating csr tensor args (#149374 ) Fixes #149303 Fllow-up: #147306 Because we have a dispatch key named `DispatchKey::SparseCsrPrivateUse1` for this case, we allow users to create a csr tensor on out-of-tree devices, so we should also let that pass the check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149374 Approved by: https://github.com/FFFrog, https://github.com/albanD	2025-04-11 09:05:20 +00:00
Colin Peppler	5590a0692c	[aotinductor] fix std::{min.max} compilation error for sympy expr with multiple args (#150894 ) ### Compilation error The issue is that u0 (an unbacked symint) can come from a smaller int dtype e.g. int16, int32. ``` error: no matching function for call to ‘min(int64_t&, short int&)’ 759 \| call_add_kernel_with_scaling_0(... std::min(100L, s97, u0) ...); ``` ### Diff The fix is to explicitly specify `int64_t` in the std::min template. ``` int64_t s97 = arg0_1_size[0]; int16_t u0_raw; # not a long auto u0 = u0_raw; # Before std::min({100L, s97, u0}) # After std::min<int64_t>({100L, s97, u0}) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150894 Approved by: https://github.com/desertfire	2025-04-11 07:32:47 +00:00
PyTorch MergeBot	44ed0c9fbb	Revert "[profiler] don't disable CUPTI_LAZY_REINIT for cuda >= 12.6 (#150957 )" This reverts commit 37812009fd123d5c4a038ce798eedd4a89eeffad. Reverted https://github.com/pytorch/pytorch/pull/150957 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/150957#issuecomment-2795878848))	2025-04-11 05:38:58 +00:00
wdziurdz	6c7336cb31	[Profiler][HPU] Enable profiler.key_averages().table() for HPU devices (#150770 ) Fixes #150769 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150770 Approved by: https://github.com/sraikund16, https://github.com/jeromean	2025-04-11 05:17:12 +00:00
Yuanhao Ji	85ada5d6dd	[Dynamo] Allow dynamo to handle 'or' operator between two dicts (#147305 ) Fixes #146538 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147305 Approved by: https://github.com/anijain2305	2025-04-11 04:47:31 +00:00
xinan.lin	6f6ff8837a	[Inductor UT][Break XPU] Fix UTs for XPU broken by community. (#150830 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150830 Approved by: https://github.com/anmyachev, https://github.com/desertfire, https://github.com/jansel ghstack dependencies: #149862	2025-04-11 04:30:46 +00:00
xinan.lin	d186c933f8	[Inductor UT][Break XPU] Apply CUDA tolerances changes on XPU that introduced by #144579 . (#149862 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149862 Approved by: https://github.com/desertfire, https://github.com/jansel	2025-04-11 04:30:46 +00:00
Isuru Fernando	a22d3e778e	[dynamo][guards] Print relational guards only once (#150810 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150810 Approved by: https://github.com/anijain2305	2025-04-11 04:10:37 +00:00
Tristan Rice	8b5e717601	c10d/Store: add clone feature (#150966 ) (#150966 ) (#151045 ) Summary: This adds a new `clone()` method to Store which will return a new Store instance that can be used from a different thread. This is intended to better support multiple threads with stores such as when ProcessGroupNCCL needs a store to do error propagation. Related issue: https://github.com/pytorch/pytorch/issues/150943 Approved by: https://github.com/fduwjj Test Plan: contbuild & OSS CI, see `205881ea4a` Test plan from GitHub: ``` pytest test/distributed/test_store.py -k PythonStore pytest test/distributed/test_store.py -k clone ``` Differential Revision: D72789690 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151045 Approved by: https://github.com/XilunWu, https://github.com/fduwjj	2025-04-11 04:00:23 +00:00
Justin Chu	75162aa7de	[ONNX] Support running bfloat16 models with ONNX Runtime (#149646 ) Use ORTValue objects to support bfloat16 and other dtypes as inputs. This only supports cuda as ort only implements bfloat16 on cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149646 Approved by: https://github.com/titaiwangms	2025-04-11 03:38:26 +00:00
Zhengxu Chen	86370fd658	[dynamo] Allow guards to be dropped with custom filter functions. (#150936 ) Summary: A follow up of https://github.com/pytorch/pytorch/pull/150689. Test Plan: test_dynamo -k test_guard_filter_fn Differential Revision: D72722322 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150936 Approved by: https://github.com/jansel	2025-04-11 03:06:34 +00:00
zeshengzong	4b0cf9fc00	Optimize transformer encoder/decoder init suggestion (#146882 ) Fixes #72253 Add hint message for users to manually initialize after created. ## Test Result Before ![image](https://github.com/user-attachments/assets/1914223f-008e-4ff7-aea1-c54c55679f65) ![image](https://github.com/user-attachments/assets/fd4110c1-26f7-48fe-9582-80581ab72328) After ![image](https://github.com/user-attachments/assets/12270ba2-b384-4fe6-b351-4287b272d102) ![image](https://github.com/user-attachments/assets/0194e3a0-700a-40da-a9de-e9854c2d5d2e) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146882 Approved by: https://github.com/jbschlosser	2025-04-11 02:31:56 +00:00
Jiang, Yanbing	1e92579126	Add torch._scaled_mm for CPU (#150410 ) This PR is the duplicated one for https://github.com/pytorch/pytorch/pull/139975. This PR is to add torch._scaled_mm for CPU backend. _scaled_mm_out_cpu and _scaled_mm_cpu are new added and included in torch._scaled_mm CPU dispatch. We also add _scaled_mm_out_cpu_emulated as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150410 Approved by: https://github.com/atalman	2025-04-11 02:23:03 +00:00
cyyever	24ca7e91e6	[1/N] Use internal linkage in torch/csrc C++ files. (#150930 ) Turn more functions and variables into static if they are not used outside the cpp files. Unused functions are removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150930 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-04-11 02:19:31 +00:00
fduwjj	48132de4af	[c10d][fr] Fix the false positive in the dtype check in fr analysis script (#151063 ) When checking dtype in fr analysis script, we should only check it when the input of output numbel is larger than zero. For the case when it is gather or scatter, the output/input size will be an empty list for non-src or non-dst ranks which we should just skip the check. Differential Revision: [D72826823](https://our.internmc.facebook.com/intern/diff/D72826823) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151063 Approved by: https://github.com/d4l3k, https://github.com/kwen2501	2025-04-11 02:11:58 +00:00
Tristan Rice	df4e5294a6	Reapply "ProcessGroupGloo: support lazy_init (#150801 )" (#151031 ) This reverts commit 73f3d6d9aaa128d9917e8b3790933ba2855066cc. Reapplies #150801 Test plan: See #150801 submodule Pull Request resolved: https://github.com/pytorch/pytorch/pull/151031 Approved by: https://github.com/fduwjj	2025-04-11 01:58:35 +00:00
Nikita Shulga	b7c0fda163	[MPS] Fix `determine_backend_memory_format` logic (#151042 ) If input is channels last than MPS will return a channels last output This fixed `GPUTests.test_convolution_4_mps` from test_torchinductor.py That previous failed with ``` AssertionError: expected size 3==3, stride 1==192 at dim=1; expected size 12==12, stride 48==16 at dim=2; expected size 16==16, stride 3==1 at dim=3 ``` As FakeTensor implementation of conv returned `Contiguous`, rather than `ChannelLast` layout on MacOS-15 or later. This doesn't seem to be very well documented, so will try to document the call path for `ExternKernel` invocation for `aten::convolution`: - First inductor decomp defined here is called `c93e4b8290/torch/_inductor/kernel/conv.py (L424-L425)` - Then it goes thru FakeTensor decomposition implemented here `320914f1b6/torch/_subclasses/fake_impls.py (L739-L740)` - Finally it goes down to convolution meta registrations implemented here `320914f1b6/torch/_meta_registrations.py (L2416-L2417)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151042 Approved by: https://github.com/dcci	2025-04-11 01:51:34 +00:00
fduwjj	320914f1b6	[c10d][libuv] Add back correct EOF case check (#151052 ) We removed the wrong EOF case in https://github.com/pytorch/pytorch/pull/150987, and we added the correct one back in this PR. Since https://github.com/pytorch/pytorch/pull/150987 is a fix, so we merge that PR first and use this PR as a follow-up to further makes the logic more complete. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151052 Approved by: https://github.com/XilunWu	2025-04-11 01:37:30 +00:00
Yanan Cao (PyTorch)	c93e4b8290	[BC-breaking] Set NonStrict as default for export_for_training (#150941 ) Summary: - Flip default value of `strict` argument from True to False on torch.export.export_for_training API - All callsites have been updated to provide this argument explicitly to avoid behavior change. - If you see any breakages, that means you may have a new callsite that is missed, please set `strict=True` explicitly to the callsite to mitigage. Test Plan: CI Differential Revision: D72724975 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150941 Approved by: https://github.com/ydwu4	2025-04-11 00:50:05 +00:00
eellison	e945247f05	Revert two recent prologue prs (#151013 ) These were landed in a bit of a rush to try to make the release.. Reverting, then will re-land with https://github.com/pytorch/pytorch/pull/151009 applied, and do full benchmark run with max-autotune. Differential Revision: [D72791103](https://our.internmc.facebook.com/intern/diff/D72791103) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151013 Approved by: https://github.com/zou3519	2025-04-10 23:48:41 +00:00
Will Constable	c9a35c2a6e	[C10D] Document object collectives limitations (#150815 ) Adds louder warning labels in the doc page and docstring for object collectives in hopes of raising awareness of several footgun issues including accidental creation of cuda contexts by serializing and sending 'device-local' gpu tensors over the object-* apis. Preview: <img width="902" alt="image" src="https://github.com/user-attachments/assets/e0c08c70-d8e5-4e15-b3e2-5cd563714f71" /> addresses #150798 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150815 Approved by: https://github.com/kwen2501	2025-04-10 22:48:39 +00:00
Yiming Zhou	dbcd0b571d	Back out "[AOTI] Always use oss schema for ExternKernelNodes serialization" (#151026 ) Summary: Revert for FC breaking Test Plan: CI Differential Revision: D72802075 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151026 Approved by: https://github.com/hl475	2025-04-10 22:36:35 +00:00
Justin Chu	f304483e95	[ONNX] Add asdict method to VerificationInfo class (#151024 ) This pull request introduces a new method to convert `VerificationInfo` objects to dictionaries and includes a corresponding test to ensure the method works correctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151024 Approved by: https://github.com/titaiwangms	2025-04-10 22:23:33 +00:00
henrylhtsang	8d81806211	[inductor] Change minimum number of SMs to 60 to let Ada use Triton GEMM backend (#150888 ) context: https://github.com/pytorch/pytorch/issues/150390#issuecomment-2790272814 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150888 Approved by: https://github.com/jansel	2025-04-10 22:10:55 +00:00
PyTorch MergeBot	e786b3bf54	Revert "[inductor] Change minimum number of SMs to 60 to let Ada use Triton GEMM backend (#150888 )" This reverts commit 115a165f9b24e3aaaeb2d0994678116758bd636f. Reverted https://github.com/pytorch/pytorch/pull/150888 on behalf of https://github.com/malfet due to This indeed broke all those inductor tests ([comment](https://github.com/pytorch/pytorch/pull/150888#issuecomment-2795231901))	2025-04-10 21:46:23 +00:00
PyTorch MergeBot	6a65f2c4fe	Revert "Support tuning of _scaled_grouped_mm (#150421 )" This reverts commit 8efcf21fff327d155350bf26ccba769bab58c077. Reverted https://github.com/pytorch/pytorch/pull/150421 on behalf of https://github.com/malfet due to Looks like it broke lint, see `a0ab243c3a/1` ([comment](https://github.com/pytorch/pytorch/pull/150421#issuecomment-2795218547))	2025-04-10 21:36:41 +00:00
PyTorch MergeBot	a0ab243c3a	Revert "Generalize poison fork logic for each device backend (#144664 )" This reverts commit 83bd0b63b55f224fada6d5f6dd7eb5b4cb3072fb. Reverted https://github.com/pytorch/pytorch/pull/144664 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/144664#issuecomment-2795157082))	2025-04-10 21:02:14 +00:00
Bert Maher	8efcf21fff	Support tuning of _scaled_grouped_mm (#150421 ) This includes the default aten implementation, as well as a Triton implementation imported from FBGEMM (https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/experimental/gemm/triton_gemm/grouped_gemm.py) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150421 Approved by: https://github.com/ngimel	2025-04-10 20:34:16 +00:00
PyTorch MergeBot	abe41c5c9c	Revert "c10d/Store: add clone feature (#150966 )" This reverts commit 205881ea4a451574c3a3de87c42484043a955d6e. Reverted https://github.com/pytorch/pytorch/pull/150966 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/150966#issuecomment-2795063574))	2025-04-10 20:17:53 +00:00
Ivan Dimitrov	8fdd61bc45	Fix torchscript issues with reference quantized modules (#150870 ) Summary: The reference quantized modules for linear / conv / etc fail to torchscript due to two issues (1) The type of torch.qscheme doesn't script (2) The "_DTYPE_TO_QVALUE_BOUNDS" values were resolving to union[float, int] instead of just int. We fix that with a hard cast. See: <internal post> + comments for more context Test Plan: unit tests + fixing this NB N6923590 Differential Revision: D72652616 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150870 Approved by: https://github.com/jerryzh168	2025-04-10 20:14:45 +00:00
PyTorch MergeBot	31162214d8	Revert "[AOTI] Remove typedef for half and bfloat16 (#150657 )" This reverts commit 357814c85c00a2b5b3fb9add97735e4789caa7e0. Reverted https://github.com/pytorch/pytorch/pull/150657 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/150657#issuecomment-2795042772))	2025-04-10 20:08:03 +00:00
Shunting Zhang	252029b294	[Inductor] assert fallback output alignment (#150804 ) Previous PR (https://github.com/pytorch/pytorch/pull/150777) fixes the alignment problem for fallback kernel assuming meta kernel is correct. This PR handles the case that meta kernel is incorrect. Assertion is added if the compiler assumes a fallback kernel output is aligned. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150804 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: #150777	2025-04-10 20:01:06 +00:00
henrylhtsang	115a165f9b	[inductor] Change minimum number of SMs to 60 to let Ada use Triton GEMM backend (#150888 ) context: https://github.com/pytorch/pytorch/issues/150390#issuecomment-2790272814 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150888 Approved by: https://github.com/jansel	2025-04-10 19:46:35 +00:00
William Wen	4161c752bb	[dynamo] unpack sequence lazily for list extend/deque extendleft (#150965 ) Fixes https://github.com/pytorch/pytorch/issues/133063. We were unpacking generators/iterators eagerly when we should be unpacking them one-by-one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150965 Approved by: https://github.com/jansel	2025-04-10 19:31:31 +00:00
Pian Pawakapan	389cd15265	[export] check tuple length mismatch for dynamic_shapes spec (#150976 ) Summary: weren't checking this Test Plan: test_export Differential Revision: D72761995 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150976 Approved by: https://github.com/angelayi	2025-04-10 19:08:43 +00:00
fduwjj	f663aa4e81	[c10d][tcp_store] Fix connection reset caused by wrong socket close (#150987 ) While fixing the memory leak in https://github.com/pytorch/pytorch/pull/145757, we accidentally close the socket for the case when nread == 0 and thought it is the case when connection is closed. This is not true. According to libuv doc: https://docs.libuv.org/en/v1.x/stream.html#c.uv_read_cb. > nread might be 0, which does not indicate an error or EOF. This is equivalent to EAGAIN or EWOULDBLOCK under read(2). We found this bug when debugging a broken pipe issue when users first call a set and then wait for all keys right afterwards on 128 ranks. This might also cause other broken pipe issues we have seen in the prod jobs recently. Added a unit test to test this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150987 Approved by: https://github.com/d4l3k, https://github.com/XilunWu	2025-04-10 18:48:57 +00:00
Daniel Vega-Myhre	e7ed50f27b	[async TP] Fix handling of case where scatter dim = 0 for 2D output tensor (#150935 ) ## Summary of changes 1. Change assertion to a warning, when no all gather or reduce scatter patterns are found, and remove the corresponding unit test. It seems some valid TP graphs may not have any pattern matches, from what I can see. 2. Fix wrong variable name being referenced (`A_with_scatter_dim_0` instead of just `A`) 3. Simplify reshaping to target output shape (don't need to recalculate output shape) 4. When "A" tensor is 2D, so we are doing doing a 2D x 2D scaled mm, we need to fix our handling of the case where the scatter dim is 0. When scatter dim is 0 for the 2D scaled mm output shape, this is actually dim 1 in the unreduced stacked partial scaled mm outputs, which has a (logical) shape of `(group_size, M//group_size, N)`. To summarize: - Unreduced stacked partials are of shape `(M, N)` - We view as `(group size, M//group_size, N)` and reduce along the scatter dim (`group_size` / dim 0). - Reduced output (`reduced_out`) has shape (M//group_size, N) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150935 Approved by: https://github.com/lw	2025-04-10 18:25:48 +00:00
yucai-intel	08831f30bb	[Intel GPU] Allow XPU backend in Depthwise_conv2d&3d operators (#149114 ) This modification is to support XPU kernels for depthwise_conv2d and depthwise_conv3d. Currently, when running depthwise_conv on XPU devices, it is calculated with Mkldnn via the ConvBackend::Overrideable path. After this modification, depthwise_conv will be calculated directly using XpuDepthwise3d when the Mkldnn backend is disabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149114 Approved by: https://github.com/guangyey, https://github.com/albanD	2025-04-10 17:49:27 +00:00
David Berard	37812009fd	[profiler] don't disable CUPTI_LAZY_REINIT for cuda >= 12.6 (#150957 ) Credit to @mgmtea who wrote the initial version of this PR: https://github.com/pytorch/pytorch/pull/146604 Context: CUPTI is the NVIDIA library that Kineto uses for collecting GPU-side info during profiling. The intended usage is to register a callback while you want profiling to occur, and then unregister the callback when you want profiling to stop. But a bug would cause crashes if CUPTI callbacks were de-registered when used with cudagraphs. The workaround was to disable "CUPTI_LAZY_REINIT" and "CUPTI_TEARDOWN" in Kineto - which prevents crashes, but can result in slower execution after profiling has occurred and completed. This bug is believed to be fixed in CUDA >= 12.6, so this PR qualifies that DISABLE_CUPTI_LAZY_REINIT=1 and CUPTI_TEARDOWN=0 should only be applied if CUDA >= 12.6. Additionally, `profiler_allow_cudagraph_cupti_lazy_reinit_cuda12()` is added as an escape hatch so that we can add a killswitch in case we see more crashes related to this. Differential Revision: [D72745929](https://our.internmc.facebook.com/intern/diff/D72745929) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150957 Approved by: https://github.com/aaronenyeshi, https://github.com/Skylion007	2025-04-10 17:45:01 +00:00
Hexin Wang	6720d23969	Fixing NCCL abort hang issue when a ProcessGroupNCCL manages multiple ncclComms (#150690 ) Detail of the issue: If PyTorch issues send/recv to each 2 rank comm, and these comms are managed by a single ProcessGroupNCCL instance, then comms need to abort either in sequence or in group. I.e. the following sequential abort will cause hang in NCCL. recv(..., comm0, stream); send(..., comm1, stream); abort(comm1); abort(comm0); Fixes #119797 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150690 Approved by: https://github.com/kwen2501, https://github.com/eqy, https://github.com/atalman	2025-04-10 17:33:26 +00:00
Basil Wong	1250106630	[pytorch] Remove numpy dependency from Knapsack Evaluator (#150825 ) Summary: The two implementations are functionally equivalent. They both calculate the memory budget at the knee point in the Pareto frontier using the same algorithm. 1. np.linspace -> basic list comprehension 2. runtime and memory values -> lists instead of numpy arrays 3. np.ptp -> max - min 4. np.norm -> diff with min value / range 5. np.sqrt -> *0.5 5. np.argmin -> .index(min(_)) Test Plan: # Unit Testing ``` buck test mode/opt //caffe2/test/functorch:test_ac_knapsack; pingme "tests done" Buck UI: https://www.internalfb.com/buck2/f4e41eb8-e775-4f04-b4e7-8e567599deb8 Test UI: https://www.internalfb.com/intern/testinfra/testrun/10133099236155875 Network: Up: 24KiB Down: 1.9GiB (reSessionID-7cd11487-f3e7-43ab-982a-805510771c8d) Executing actions. Remaining 0/259826 98:15:40.5s exec time total Command: test. Finished 3 local, 5 remote, 103467 cache (99% hit) 98:15:14.8s exec time cached (99%) Time elapsed: 1:09.9s Tests finished: Pass 15. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` # End to End Testing ### Baseline Run with DP Let's confirm everything we are running on works. - Optimization Algo: DP - Memory Budget: 0.05 - AIX Link: apf_local-basilwong-2025-03-22_20:39:10 - TLParse rank 0: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpDJaWp5/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 - TLParse rank 1: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpDJaWp5/rank_1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 ### Dynamic Memory Budget (Before Change) - Revision: 2c95489b7f79 - Optimization Algo: Dynamic Memory Budget - Memory Budget: 0.05 - AIX Link: https://www.internalfb.com/mlhub/pipeline/4088035428184866 - TLParse: - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpykEy8U/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpykEy8U/rank_1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 ### Dynamic Memory Budget (After Change) - Revision: 14353eef3c9e - Optimization Algo: Dynamic Memory Budget - Memory Budget: 0.05 - AIX Link: https://www.internalfb.com/mlhub/pipeline/1613558749306737 - TLParse Links: - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpZKNWFw/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpZKNWFw/rank_1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 As a sanity check lets take the AC information for the following compile id: 7_0_0 from the rank 0 of each TLParse. {F1976883124} Baseline: P1779400819 * Saved node values show we are storing much more compared to dynamic memory: ``` "Knapsack Saved Nodes": [ 16, 17, 19, 20, 21, 22, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60 ] ``` * Before Change: P1779401775 * Saved nodes are similar to after change but not exactly. ``` "Knapsack Saved Nodes": [ 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 49, 50 ] ``` * After Change: P1779402106 * Here we se the largest nodes that are saved are around the same, but there is a small discrepancy for the smallest nodes. ``` "Knapsack Saved Nodes": [ 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 50, 51, 57, 58, 59, 60, 61, 62 ], ``` The discrepancy can be explained by looking at the estimated memory values. This is the non-deterministic part(below are the top 5 memory values for considered candidates): ``` 0.05774741703905514, 0.007333005338292718, 0.007333005338292718, 0.007333005338292718, 0.007333005338292718, ``` vs ``` 0.049254204820440746, 0.006254502199421049, 0.006254502199421049, 0.006254502199421049, 0.006254502199421049, ``` Based on that the dynamic memory implementations performed similarly in an E2E test and that memory is non-deterministic we should be good to go to land. Differential Revision: D71692245 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150825 Approved by: https://github.com/seemethere, https://github.com/jansel	2025-04-10 17:07:03 +00:00
Laith Sakka	5471e80fb4	Remove guard_size_oblivious from vector_norm decomposition. (#148809 ) This PR remove the usage of guard_size_oblivious in vector_norm by inlining it in the runtime check, this prevent any data dependent error from ever appearing here at the locations where guard_size_oblivious used to exist. Before this PR it used to break potentially. This is NOT BC breaking or changing of semantics from eager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148809 Approved by: https://github.com/bobrenjc93	2025-04-10 16:19:00 +00:00
angelayi	e6969c1bd8	[export] Symint support (nonstrict, Dim.DYNAMIC) (#150198 ) Fixes https://github.com/pytorch/pytorch/issues/113682 only in the non-strict export case. Also we only support Dim.DYNAMIC/AUTO, not named-Dims Pull Request resolved: https://github.com/pytorch/pytorch/pull/150198 Approved by: https://github.com/pianpwk	2025-04-10 15:06:23 +00:00
Tom Ritchford	596e44d26a	[inductor] Enable docstring_linter on _inductor (#144622 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144622 Approved by: https://github.com/eellison ghstack dependencies: #144621	2025-04-10 14:32:26 +00:00
Tom Ritchford	ba35793226	[inductor] Add tests for new docstring_linter features (fix #142496 ) (#144621 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144621 Approved by: https://github.com/eellison	2025-04-10 14:32:26 +00:00
PyTorch MergeBot	73f3d6d9aa	Revert "ProcessGroupGloo: support lazy_init (#150801 )" This reverts commit f237ee54bfb35d16cd10e358d4b78578c88a5781. Reverted https://github.com/pytorch/pytorch/pull/150801 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/150801#issuecomment-2793161239))	2025-04-10 13:44:31 +00:00
Wang, Chuanqi	7b7b9d707e	[CI] Add XPU compiled check in CICD (#150771 ) Address the suggestion from https://github.com/pytorch/pytorch/issues/150001#issuecomment-2753407421 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150771 Approved by: https://github.com/malfet, https://github.com/atalman	2025-04-10 13:33:27 +00:00
FFFrog	4273e5d15c	Expose is_available API for torch.backends.mkldnn (#147432 ) As the title stated. Like torch.backends.mkl, torch.backends.openmp and so on, they all expose is_available API for users. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147432 Approved by: https://github.com/albanD	2025-04-10 05:05:37 +00:00
cdzhan	1a1a32ce5a	[elastic][test] fix race condition in test_barrier_timeout_rank_tracing (#150768 ) # Root cause The barrier timeout set to 0.1 is too short, some threads may not have enough time to reach the barrier. # How to reproduce Adding some sleep will be easy to reproduce. ```python def test_barrier_timeout_rank_tracing(self): N = 3 store = dist.HashStore() def run_barrier_for_rank(i: int): if i != 0: import time;time.sleep(1) # Let some thread sleep for a while try: store_util.barrier( store, N, key_prefix="test/store", barrier_timeout=0.1, rank=i, rank_tracing_decoder=lambda x: f"Rank {x} host", trace_timeout=0.01, ) except Exception as e: return str(e) return "" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150768 Approved by: https://github.com/d4l3k	2025-04-10 04:40:16 +00:00
Yuanhao Ji	a6933a1c42	[Inductor] Remove triton dtype patch which has landed (#149611 ) As this [pr][0] has already landed, we should remove its patch. Having [mentioned][1] this before, I am making this change now to avoid omissions. [0]: https://github.com/triton-lang/triton/pull/3342 [1]: https://github.com/pytorch/pytorch/pull/147583/files#r1970440062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149611 Approved by: https://github.com/eellison	2025-04-10 03:42:55 +00:00
Benjamin Glass	b80bb87689	cpp_wrapper: Miscellaneous fixups (#150143 ) 1. Revisit preprocessing code in cpp_bulider.py, removing a hack that channels it through stdout. 2. Fix ops that return None. Differential Revision: [D72053414](https://our.internmc.facebook.com/intern/diff/D72053414) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150143 Approved by: https://github.com/desertfire	2025-04-10 03:31:12 +00:00
Laith Sakka	cd80778ac8	Fix issue in optimized_add issue: make_optimized should be called on non args only (#150955 ) PR https://github.com/pytorch/pytorch/pull/149665 did a change to the optimized_add that is causing an issue internally. In general make_optimized should be only be called with valid new_args, new_args can become None when elements already exists also, we should break out of the loop in that case. Note that I also only maintained the optimized summation when both lhs and rhs lengths are <=2. This is ok because the optimization is based on the inductive property of adding one symbol at a time. the [2]+[2] here is serving as base case ( i feel we can also remove it ) . Note that keeping it for all sizes while correct, I am not sure if tis as efficient (we will do N log(n) insertions). there is no current justification for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150955 Approved by: https://github.com/Mingming-Ding, https://github.com/atalman, https://github.com/bobrenjc93	2025-04-10 03:00:21 +00:00
Yuanhao Ji	bf7d8ef10d	[Docs] Clarify behavior when integer dtype is used with requires_grad=True in `tensor.to()` (#150913 ) Fixes #150618 Related comment: https://github.com/pytorch/pytorch/issues/3226#issuecomment-489362234 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150913 Approved by: https://github.com/janeyx99, https://github.com/soulitzer, https://github.com/cyyever Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-04-10 02:52:58 +00:00
Yuki Kobayashi	78b3d71ece	Docs: Add missing whitespace in the cmake warning message (#150929 ) A trailing whitespace is needed to be concatenated to the following string correctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150929 Approved by: https://github.com/Skylion007	2025-04-10 02:50:56 +00:00
Yu, Guangye	3d3fcaaf7b	Delegate torch.accelerator.device_count to torch.xxx.device_count for multi-process usage (#149924 ) # Motivation Adapt `torch.accelerator.device_count` for multi-process usage. For example, `torch.cuda.device_count` avoids poisoning fork, then `torch.accelerator.device_count` should meet the same requirement. Now that `torch.get_device_module(device).device_count` supports this, `torch.accelerator.device_count` should align with this behavior as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149924 Approved by: https://github.com/albanD ghstack dependencies: #147507	2025-04-10 02:37:37 +00:00
Yu, Guangye	6972255dad	Document poison fork note for accelerator APIs (#147507 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147507 Approved by: https://github.com/sraikund16, https://github.com/kwen2501, https://github.com/albanD	2025-04-10 02:37:37 +00:00
Yu, Guangye	83bd0b63b5	Generalize poison fork logic for each device backend (#144664 ) # Motivation Generalize the posion_fork code to make it reusable across different devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144664 Approved by: https://github.com/EikanWang, https://github.com/albanD	2025-04-10 02:34:53 +00:00
cyy	322f883c0c	Remove unneeded CUDA logic from _create_build_env (#145822 ) Because FindCUDAToolkit.cmake has that logic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145822 Approved by: https://github.com/albanD	2025-04-10 02:17:28 +00:00
cyy	54827752a4	[5/N] Remove unnecessary once flag usage (#147445 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/147445 Approved by: https://github.com/albanD	2025-04-10 01:48:10 +00:00
Tristan Rice	205881ea4a	c10d/Store: add clone feature (#150966 ) This adds a new `clone()` method to Store which will return a new Store instance that can be used from a different thread. This is intended to better support multiple threads with stores such as when ProcessGroupNCCL needs a store to do error propagation. Related issue: https://github.com/pytorch/pytorch/issues/150943 Test plan: ``` pytest test/distributed/test_store.py -k PythonStore pytest test/distributed/test_store.py -k clone ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150966 Approved by: https://github.com/fduwjj	2025-04-10 01:41:50 +00:00
rzou	061832bc7a	Gracefully handle optree less than minimum version (#150956 ) Summary: - We are saying the minimum version of pytree that PyTorch can use is 0.13.0 - If a user imports torch.utils._cxx_pytree, it will raise an ImportError if optree doesn't exist or exists and is less than the minimum version. Fixes https://github.com/pytorch/pytorch/issues/150889. There are actually two parts to that issue: 1. dtensor imports torch.utils._cxx_pytree, but the optree installed in the environment might be too old. Instead, raising ImportError in torch.utils._cxx_pytree solves the issue. 2. We emit an "optree too low version" warning. I've deleted the warning in favor of the more explicit ImportError. Test Plan: - code reading Pull Request resolved: https://github.com/pytorch/pytorch/pull/150956 Approved by: https://github.com/albanD, https://github.com/atalman, https://github.com/XuehaiPan	2025-04-10 01:22:50 +00:00
mantaionut	9d1528186f	Fix static functions when using module in MSVC (#148675 ) If you try to use torch in c++ using modules then it will not compile due to static function not being supported in MSVC when using modules https://developercommunity.visualstudio.com/t/10323558. It's also aligned with [C++20 standard](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/n4849.pdf) (ISO/IEC 14882:2020) 10.2.7 Export declaration [module.interface]: "Exported names have either external linkage or no linkage". Fixes https://github.com/pytorch/pytorch/issues/71309 Tested using the following code. ```c++ export module testModule; import <torch/torch.h>; import <memory>; import <string>; import <tuple>; import <iostream>; export namespace testModule { export void test() { torch::Tensor tensor1 = torch::rand({ 2, 3 }); torch::Tensor tensor2 = torch::rand({ 3, 2 }); // Perform tensor multiplication torch::Tensor result = torch::matmul(tensor1, tensor2); // Print the tensors std::cout << "Tensor 1: " << tensor1 << std::endl; std::cout << "Tensor 2: " << tensor2 << std::endl; std::cout << "Result of multiplication: " << result << std::endl; } } ``` ```c++ import testModule; int main() { testModule::test(); return 0; } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148675 Approved by: https://github.com/albanD, https://github.com/malfet Co-authored-by: mantaionut <ionut@janeasystems.com>	2025-04-10 01:19:54 +00:00
FFFrog	69cee91a55	Code Clean: Using the new builtin function provides by python 3.8 later (#150839 ) Changes: - reversed - math.perm - inspect.getfile Pull Request resolved: https://github.com/pytorch/pytorch/pull/150839 Approved by: https://github.com/Skylion007	2025-04-10 01:17:39 +00:00
Mu-Chu Lee	f3cf3ec591	[AOTInductor] Add User Managed buffer for AOTI constant buffer. (#150276 ) Summary: We add the functionality to allow users to directly pass in a at::Tensor into AOTInductor, that would be used as the constant. This user managed buffer skips the copying step in AOTInductor, and let users to directly manage the memory usage themselve. Test Plan: LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib /data/users/$USER/pytorch/build/bin/test_aoti_inference Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D72589514](https://our.internmc.facebook.com/intern/diff/D72589514) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150276 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2025-04-10 00:15:44 +00:00
Shangdi Yu	92e81cf41a	Add real_tensor to the FakeTensor in node.meta["val"] (#150948 ) Summary: We need real_tensor on the FakeTensor in node.meta["val"] in order to aot_compile the draft exported programs. Otherwise, we cannot propagate real tensors even when fake_mode.propagate_real_tensors = True. This also fixes real tensor propagation in `run_decomposition()`. Test Plan: ``` buck2 run @mode/dev-nosan caffe2/test:test_export -- -r test_dedup_data_dependent_failure ``` Differential Revision: D72732714 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150948 Approved by: https://github.com/angelayi	2025-04-10 00:11:46 +00:00
Laith Sakka	91d1826539	Add dynamic version for mm_loop benchmark (#150865 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150865 Approved by: https://github.com/eellison	2025-04-09 23:37:43 +00:00
Will Constable	a8b48ff14c	[DTensor] clean up _local_shard_size_and_offset (#150650 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150650 Approved by: https://github.com/wanchaol, https://github.com/XilunWu ghstack dependencies: #150490	2025-04-09 22:07:48 +00:00
Will Constable	3532dd4f1e	[DTensor] StridedShard support uneven sharding (#150490 ) This enables using FSDP+TP on parameters with dimensions that aren't evenly divisible by the DP/TP mesh sizes. - this may not support all possible combinations of strided shardings and shardings, but the support before this PR is not complete anyway This contains several fixes for different aspects of DTensor behavior relating to uneven strided sharding: - original creation of the strided tensor requires fixes in StridedShard._split_tensor - full_tensor() reconstruction requries fixes in StridedShard._to_replicate_tensor to correctly reshuffle the data into the original pre-sharded order - Distributed Checkpointing support requires correct computation of the compute_local_shape_and_global_offset util so it knows how a local shard maps to the global tensor, for reconstruction during load/reshard. This PR also adds a util `_explicit_order_placements` which converts a list of placements with StridedSharding into a list of placements with only regular sharding, with the order shuffled such that it is equivalent. Builds on and completes the work started in https://github.com/pytorch/pytorch/pull/148894 Uneven Sharding Example ------- (copied from _StridedShard._to_replicate_tensor docstring) mesh = (DP=2, TP=2) original = torch.arange(5) Applying Sharding Step 1 - Apply TP sharding `tp = distribute_tensor(x, world_mesh['tp'], [Shard(0)])` local_tensors: rank0: [0,1,2] rank1: [3,4] rank1: [0,1,2] rank3: [3,4] Step 2 - Apply FSDP sharding `dp_tp = ...` (the process of creating a strided-shard tensor is skipped over as it is hacky and complicated) dp_tp has placement (_StridedShard(0, split_factor=2), Shard(0)) local_tensors: rank0: [0,1] rank1: [3] rank1: [2] rank3: [4] Reconstructing the Full Tensor Now, say someone wants to reconstruct dp_tp's full tensor. This will invoke 'redistribute' to replicate. redistribute will first replicate the "Shard(0)" placement on the rightmost mesh dim, then replicate the StridedShard placement second, which is implemented by this function. So our starting point (`local_tensor` arg) is the result of replicating the Shard(0) placement across the TP dim, which looks like this. Note the discrepancy with the 'tp sharded tensor' line above! We'll fix it by locally shuffling data. local_tensors: rank0: [0,1,3] rank1: [0,1,3] rank1: [2,4] rank3: [2,4] Step 1: replicate over the DP dimension. Afterwards, each rank can locally sort the values. note: we need padding to do this allgather, and we'll need to keep track of the padding amount for later local_tensors: rank0: [0,1,3,2,4] rank1: [0,1,3,2,4] rank1: [0,1,3,2,4] rank3: [0,1,3,2,4] Step 2: chunk and shuffle values around to account for the wrong order of operations above and get the original tensor content back 01324# <- our allgather includes padding, if padding was applied in step 1 01324 <- Remove the padding 013, 24 <- chunk once, 'undoing' the DP allgather 01, 3, 2, 4 <- chunk each chunk, 'undoing' the initial (wrong) TP allgather performed by Shard(0)->Replicate() 012, 34 <- interleave with stride=TP mesh dim size 01234 <- concatenate Co-authored-by: Luca Wehrstedt <lw@meta.com> Co-authored-by: Will Constable <whc@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/150490 Approved by: https://github.com/wanchaol, https://github.com/XilunWu	2025-04-09 22:07:48 +00:00
Wei Wang	cc2decdb25	[CI][CUDA][Distributed]Update test_composability.py (#148578 ) world_size = int(os.getenv("WORLD_SIZE", 4)) in subsequent lines indicate the tests in this file do not only require > 1 GPU, but at least 4 GPUs. skip_if_lt_x_gpu(4) does not properly skip this on a platform with 2 GPUs. skip_if_lt_x_gpu being broken, potentially related to a similar issue: https://github.com/pytorch/pytorch/issues/146094 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148578 Approved by: https://github.com/atalman	2025-04-09 21:57:05 +00:00
Yifei Teng	786422a4d7	Remove a workaround added in #149381 (#150693 ) Remove a workaround added in https://github.com/pytorch/pytorch/pull/149381. Fixes https://github.com/pytorch/xla/issues/8934 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150693 Approved by: https://github.com/albanD	2025-04-09 21:48:03 +00:00
Laith Sakka	087e8587cd	support backed_size_oblivious in guard_or_false/guard_or_true (#150231 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150231 Approved by: https://github.com/pianpwk	2025-04-09 21:47:20 +00:00
Tom Ritchford	31fe258efc	[inductor] Add features to docstring_linter (see #142496 ) (#145834 ) ## Improvements to `docstring_linter` * Add a "grandfather list" of existing undocumented classes and functions (`--grandfather`, `--grandfather-tolerance`, `--no-grandfather`, `--write-grandfather`) * In classes, now just one of the class itself or its `__init__()` method needs to be documented (`--lint-init` turns the old behavior back on) * Now classes and functions defined local to other functions do not need to be documented (`--lint-local` turns the old behavior back on) * New `--report` flag produces a compact report of long, undocumented classes or function definitions: see attached example run over all pytorch: [pytorch-docs.json](https://github.com/user-attachments/files/18455981/pytorch-docs.json) ## Help text ``` $ python tools/linter/adapters/docstring_linter.py --help usage: docstring_linter.py [-h] [-l] [-v] [--grandfather GRANDFATHER] [--grandfather-tolerance GRANDFATHER_TOLERANCE] [--lint-init] [--lint-local] [--lint-protected] [--max-class MAX_CLASS] [--max-def MAX_DEF] [--min-docstring MIN_DOCSTRING] [--no-grandfather] [--report] [--write-grandfather] [files ...] `docstring_linter` reports on long functions, methods or classes without docstrings positional arguments: files A list of files or directories to lint optional arguments: -h, --help show this help message and exit -l, --lintrunner Run for lintrunner and print LintMessages which aren't edits -v, --verbose Print more debug info --grandfather GRANDFATHER, -g GRANDFATHER Set the grandfather list --grandfather-tolerance GRANDFATHER_TOLERANCE, -t GRANDFATHER_TOLERANCE Tolerance for grandfather sizes, in percent --lint-init, -i Lint __init__ and class separately --lint-local, -o Lint definitions inside other functions --lint-protected, -p Lint functions, methods and classes that start with _ --max-class MAX_CLASS, -c MAX_CLASS Maximum number of lines for an undocumented class --max-def MAX_DEF, -d MAX_DEF Maximum number of lines for an undocumented function --min-docstring MIN_DOCSTRING, -s MIN_DOCSTRING Minimum number of characters for a docstring --no-grandfather, -n Disable the grandfather list --report, -r Print a report on all classes and defs --write-grandfather, -w Rewrite the grandfather list ``` --- Pull Request resolved: https://github.com/pytorch/pytorch/pull/145834 Approved by: https://github.com/amjames, https://github.com/eellison	2025-04-09 21:38:36 +00:00
Bin Bao	357814c85c	[AOTI] Remove typedef for half and bfloat16 (#150657 ) Summary: typedef is prone to name collision. Explicitly spell out the actual aten types, needed for the libtorch-free codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150657 Approved by: https://github.com/malfet	2025-04-09 21:21:17 +00:00
Isuru Fernando	d751698a36	Support negative values for fill with uint tensors (#144458 ) Fixes https://github.com/pytorch/pytorch/issues/144188 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144458 Approved by: https://github.com/amjames, https://github.com/eellison	2025-04-09 21:08:06 +00:00
Laith Sakka	860765d621	update benchamark result due to <1% regression (#150937 ) <img width="1503" alt="Screenshot 2025-04-09 at 9 07 13 AM" src="https://github.com/user-attachments/assets/e16f31b0-c5dc-4dd6-8adb-aac11ed988db" /> PR https://hud.pytorch.org/pr/148104 which is acceptable but we have to update this to avoid flakiness in the future . Pull Request resolved: https://github.com/pytorch/pytorch/pull/150937 Approved by: https://github.com/zou3519	2025-04-09 20:25:48 +00:00
Richard Barnes	2b9d8a5633	Fix `-Wmissing-braces` in a few files (#150802 ) Test Plan: Sandcastle Reviewed By: wenxin0319 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150802 Approved by: https://github.com/Skylion007	2025-04-09 20:15:34 +00:00
angelayi	ea0cbba1fc	[export] Refine draft-export CVE with Dim.AUTO (#150876 ) Instead of using refine_dynamic_shapes_from_suggested_fixes to fix ConstraintViolationErrors in draft-export, we can just convert the dims to Dim.AUTO, which is less error prone Pull Request resolved: https://github.com/pytorch/pytorch/pull/150876 Approved by: https://github.com/pianpwk	2025-04-09 19:44:30 +00:00
Tristan Rice	f237ee54bf	ProcessGroupGloo: support lazy_init (#150801 ) This adds lazy initialization support to ProcessGroupGloo via `TORCH_GLOO_LAZY_INIT` or via `create_device(..., lazy_init=True)` This is still a draft PR as there's one race condition when doing coalesced operations that needs to be fixed upstream in Gloo first. Depends on https://github.com/facebookincubator/gloo/pull/427 landing first This also updates the gloo submodule to include the required changes. Test plan: added lazy init test variants ``` pytest -v test/distributed/test_c10d_gloo.py -k Lazy ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150801 Approved by: https://github.com/fduwjj	2025-04-09 19:29:50 +00:00
Yanan Cao (PyTorch)	a4545f09da	[Codemod][AddExplicitStrictExportForTrainingInferenceArg] caffe2/test/export (#150884 ) Differential Revision: D72667175 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150884 Approved by: https://github.com/ydwu4	2025-04-09 19:18:33 +00:00
Shangdi Yu	cfab04d01b	Fix aten.div type promotion for FakeTensor (#150874 ) Summary: When we divide a FakeTensor by an integer using the fast op implementation, the type promotion should be `ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT` so we get a float when dividing an int FakeTensor by an integer. ``` FAST = get_fast_op_impls() fast_div = FAST[torch.ops.aten.div.Tensor] fast_div(fake_tensor, some_int) ``` Test Plan: ``` python test/test_fake_tensor.py -k test_fast_div ``` Differential Revision: D72667430 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150874 Approved by: https://github.com/angelayi	2025-04-09 18:52:01 +00:00
Zhuoran Zhao	d3a2872c67	Hipify global scrach defintion in AOTI codegen (#150893 ) Summary: as title, a refactor is very needed I think .... or at least unify internal/external AOTI wrapper hipification method Test Plan: P1780296121 Differential Revision: D72683568 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150893 Approved by: https://github.com/davidberard98	2025-04-09 18:35:36 +00:00
Natalia Gimelshein	d04a6ec021	add reduce_scatter to symm mem ops (#150813 ) + a few small fixes (don't error out on 0-element tensors, a few more checks for contiguous outputs, more threads for better perf). Pull Request resolved: https://github.com/pytorch/pytorch/pull/150813 Approved by: https://github.com/xw285cornell	2025-04-09 17:59:17 +00:00
Shangdi Yu	cc185c32e0	[aoti] Use generate_fake_kernels_from_real_mismatches config for draft exported programs (#150651 ) Summary: Sometimes we get `MetadataMismatchError` in aoti compilation because draft export uses the flag below to infer the fake kernel when there’s a mismatch, but aoti doesn’t have this flag turned on. https://fburl.com/code/9qzytl6q torch._functorch.config.generate_fake_kernels_from_real_mismatches If we set this flag to True, then aoti compilation would work. Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r aoti_runtime_asserts ``` Differential Revision: D72345085 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150651 Approved by: https://github.com/angelayi	2025-04-09 17:28:29 +00:00
Max Ren	6fb089f2a2	[AO] fix per token block size calculation (#150890 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150890 Approved by: https://github.com/jerryzh168	2025-04-09 17:07:31 +00:00
Will Constable	c59aaa03ff	[DTensor] add _explicit_order_placements util (#150493 ) The util converts a list of placements in the traditional DTensor format (e.g. [_StridedShard(0), Shard(0)], where list position is mesh_dim and sharding is always applied left-to-right (from dim 0 to higher dims)) to a more explicitly ordered format, also replacing '_StridedShard' with simple 'Shard' placements in the process. (e.g. the above becomes [(1, Shard(0)), (0, Shard(0)] where the first item in the tuple is the mesh_dim and the ordering of the tuples is the sharding order. This is useful so far as a helper for fixing local shape computation for strided sharding in the uneven shape case, in the following PR- but may also be useful more broadly if we can use explicit orderings to simplify other parts of DTensor logic. This skips implementing some combinations of _StridedSharding that are not currently used in the wild today, but could be supported easily. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150493 Approved by: https://github.com/wanchaol, https://github.com/XilunWu	2025-04-09 16:55:24 +00:00
PyTorch MergeBot	01568cb17a	Revert "Refactor layout constraint selection logic (#148104 )" This reverts commit 2e7c9d33e7f933ac3b723cb3bb05b9c88432c25c. Reverted https://github.com/pytorch/pytorch/pull/148104 on behalf of https://github.com/atalman due to [GH job link](https://github.com/pytorch/pytorch/actions/runs/14357056427/job/40251630946) [HUD commit link](`2e7c9d33e7`) ([comment](https://github.com/pytorch/pytorch/pull/148104#issuecomment-2790369493))	2025-04-09 16:49:48 +00:00
PyTorch MergeBot	a0e796df03	Revert "Inductor respects exact strides on custom ops by default (#150511 )" This reverts commit a4bb2f106f8cc642539d4698b6d869a87adca92f. Reverted https://github.com/pytorch/pytorch/pull/150511 on behalf of https://github.com/atalman due to [GH job link](https://github.com/pytorch/pytorch/actions/runs/14357056427/job/40251630946) [HUD commit link](`2e7c9d33e7`) ([comment](https://github.com/pytorch/pytorch/pull/148104#issuecomment-2790369493))	2025-04-09 16:49:48 +00:00
rzou	a4bb2f106f	Inductor respects exact strides on custom ops by default (#150511 ) If a tag is not specified on a custom operator, then inductor will assume that it needs exact strides. Test Plan: - tests + CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/150511 Approved by: https://github.com/eellison, https://github.com/shunting314 ghstack dependencies: #150495, #148104	2025-04-09 16:46:48 +00:00
Yidi Wu	c714d2fc0e	[hop] support base_hop._gen_schema (#149688 ) This PR creates two utils for generating a schema for hops from example inputs and use base hop as an exmaple. 1. HopArgumentInfoGen creates an argument or an output schema with mutation information. 2. CFuncitonSchemaGen piece together the argument info of inputs and outputs and produces torch._C.FunctionSchema. is_write attribute of argument info can be computed. Note that the is_write annotation only works when the inputs are flattened (e.g. cannot support mutation inside tuple). We need special handling the case where we have tuple inputs like cond. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149688 Approved by: https://github.com/zou3519	2025-04-09 16:42:55 +00:00
Justin Chu	72755a4b7a	Avoid circular imports in tracing_state_functions (#150325 ) tracing_state_functions references some torch functions from submodules like `torch.onnx.is_in_onnx_export` that could trigger module initialization & circular imports. I turned the mapping into a function so that the dictionary is not initialized at torch import. (discovered in https://github.com/pytorch/pytorch/pull/149646) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150325 Approved by: https://github.com/zou3519	2025-04-09 16:32:11 +00:00
fduwjj	8aaf296efc	[c10d][fr] Refactor analysis script for modularization and reusing for coalesce collectives (#150881 ) Trying to make the code of FR analysis more reusable and modularized. So we split core error analysis logic into separate functions. This PR mostly is shuffle around the code a bit. Differential Revision: [D72690120](https://our.internmc.facebook.com/intern/diff/D72690120) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150881 Approved by: https://github.com/wz337	2025-04-09 16:10:19 +00:00
fduwjj	c8d37b9c85	[ez][c10d] Disable start event recording for coalesced col and improve profile title (#150863 ) While looking at enabling FR analysis for coalesced collectives, I found that for the slow-path coalescing (cols which are not all-gather, all-reduce or reduce-scatter), we still record start event for them. This is wrong and we should do the same thing as endEvent recodring. And I made the profiler title more visible when we pass in the opType for coalesced all-gather and reduce-scatter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150863 Approved by: https://github.com/eqy, https://github.com/d4l3k, https://github.com/kwen2501	2025-04-09 16:09:56 +00:00
shubhambhokare1	1a56609e75	[ONNX] Supporting different opset versions for torchlib registry (#149901 ) - Allows opset_version to determine which onnx decomposition to choose - Adds a cleanup function to modify the registry after it is built Pull Request resolved: https://github.com/pytorch/pytorch/pull/149901 Approved by: https://github.com/justinchuby, https://github.com/titaiwangms	2025-04-09 16:03:46 +00:00
pralay	97a5e5c6b3	Added _fused_sdp_choice_stub dispatcher support for HPU device (#149512 ) Currently for HPU device we don't have any support for _fused_sdp_choice_stub dispatcher function, so for `scaled_dot_product_attention` function by default selecting the `MATH Backend` using `_fused_sdp_choice_stub` for HPU device. With this PR we have enabled support for `_fused_sdp_choice_stub` dispatcher function, so that we can invoke any backend (for example math, flash_attention, efficient_attention, cudnn_attention, overrideable) according to user choice for HPU device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149512 Approved by: https://github.com/drisspg	2025-04-09 15:48:09 +00:00
atalman	d0e3482266	Update triton wheel build, setuptools pin (#150931 ) Observing failure in release workflow: https://github.com/pytorch/pytorch/actions/runs/14346340202/job/40216804374 ``` Traceback (most recent call last): File "/opt/python/cp311-cp311/lib/python3.11/site-packages/wheel/bdist_wheel.py", line 11, in <module> from setuptools.command.bdist_wheel import bdist_wheel as bdist_wheel ModuleNotFoundError: No module named 'setuptools.command.bdist_wheel' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/tmp/tmppwpqef_x/triton/python/setup.py", line 27, in <module> from wheel.bdist_wheel import bdist_wheel File "/opt/python/cp311-cp311/lib/python3.11/site-packages/wheel/bdist_wheel.py", line 13, in <module> raise ImportError(ERROR) from exc ImportError: The 'wheel.bdist_wheel' module has been removed. Please update your setuptools to v70.1 or later. If you're explicitly importing 'wheel.bdist_wheel', please update your import to point to 'setuptools.command.bdist_wheel' instead. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150931 Approved by: https://github.com/Skylion007	2025-04-09 15:26:07 +00:00
zeshengzong	5a422150c3	Add `torch.triu_indices`, `torch.tril_indices` dtype description (#150749 ) Fixes #150675 ## Test Result ![image](https://github.com/user-attachments/assets/f30a0de0-6475-4d07-b441-15fffd453ba1) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150749 Approved by: https://github.com/bdhirsh	2025-04-09 15:03:24 +00:00
Xia, Weiwen	246f3b6530	[Quant][PT2E][X86] enable qconv1d-relu fusion (#150751 ) Summary As the title. - The `conv1d - relu` pattern will be annotated by the `X86InductorQuantizer`. - The pattern will be fused as `qconv_pointwise` during lowering. Test plan ``` python test/inductor/test_mkldnn_pattern_matcher.py -k test_qconv1d_relu_cpu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150751 Approved by: https://github.com/jerryzh168, https://github.com/leslie-fang-intel	2025-04-09 14:42:02 +00:00
Jack Taylor	2299087220	[ROCm] Introduce AMD specific inductor gemm tuning (#147315 ) Replaces https://github.com/pytorch/pytorch/pull/143286 Adds ROCm specific MM configs for max-autotune incorporating ROCm specific triton tuning kernargs such as waves_per_eu, kpack, matrix_instr_nonkdim. This PR also introduces behavior to allow tuning for GROUP_M in triton gemm case. Dynamo huggingface inference benchmarks: `TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS="TRITON" python huggingface.py --performance --inference --bfloat16 --backend=inductor` GEOMEAN speedup (before): \| 1.35x GEOMEAN speedup (after): \| 1.42x name \| Eager - abs latency \| old - abs_latency \| old - speedup \| new - abs_latency \| new - speedup -- \| -- \| -- \| -- \| -- \| -- AlbertForMaskedLM \| 26.22 \| 26.52 \| 98.86% \| 24.58 \| 106.67% AlbertForQuestionAnswering \| 25.96 \| 26.40 \| 98.33% \| 24.10 \| 107.73% AllenaiLongformerBase \| 21.03 \| 10.65 \| 197.50% \| 10.49 \| 200.58% BartForCausalLM \| 7.77 \| 9.76 \| 79.63% \| 8.79 \| 88.46% BartForConditionalGeneration \| 14.44 \| 12.86 \| 112.26% \| 11.96 \| 120.70% BertForMaskedLM \| 8.10 \| 8.82 \| 91.89% \| 8.57 \| 94.53% BertForQuestionAnswering \| 6.82 \| 7.32 \| 93.20% \| 7.10 \| 96.18% BlenderbotForCausalLM \| 10.97 \| 11.39 \| 96.34% \| 10.10 \| 108.65% BlenderbotSmallForCausalLM \| 5.91 \| 5.44 \| 108.72% \| 4.82 \| 122.67% BlenderbotSmallForConditionalGeneration \| 12.64 \| 9.65 \| 130.94% \| 9.11 \| 138.83% CamemBert \| 8.35 \| 9.15 \| 91.24% \| 8.86 \| 94.27% DebertaForMaskedLM \| 10.92 \| 6.09 \| 179.44% \| 5.90 \| 185.05% DebertaForQuestionAnswering \| 14.29 \| 7.70 \| 185.59% \| 7.26 \| 196.75% DebertaV2ForMaskedLM \| 15.47 \| 10.22 \| 151.32% \| 9.34 \| 165.55% DebertaV2ForQuestionAnswering \| 14.98 \| 6.11 \| 245.28% \| 6.28 \| 238.40% DistilBertForMaskedLM \| 8.37 \| 8.70 \| 96.30% \| 8.22 \| 101.92% DistilBertForQuestionAnswering \| 10.21 \| 10.54 \| 96.88% \| 10.39 \| 98.36% DistillGPT2 \| 8.77 \| 6.78 \| 129.40% \| 6.31 \| 138.88% ElectraForCausalLM \| 10.32 \| 4.70 \| 219.45% \| 4.60 \| 224.29% ElectraForQuestionAnswering \| 11.48 \| 5.62 \| 204.20% \| 5.44 \| 210.95% GPT2ForSequenceClassification \| 6.21 \| 5.72 \| 108.50% \| 5.58 \| 111.26% GoogleFnet \| 26.51 \| 20.81 \| 127.37% \| 19.91 \| 133.11% LayoutLMForMaskedLM \| 12.09 \| 7.99 \| 151.28% \| 7.66 \| 157.80% LayoutLMForSequenceClassification \| 10.62 \| 6.49 \| 163.67% \| 6.25 \| 169.95% M2M100ForConditionalGeneration \| 14.98 \| 10.20 \| 146.79% \| 9.89 \| 151.42% MBartForCausalLM \| 7.67 \| 9.78 \| 78.44% \| 8.87 \| 86.55% MBartForConditionalGeneration \| 13.45 \| 12.69 \| 105.99% \| 12.03 \| 111.82% MT5ForConditionalGeneration \| 19.96 \| 5.32 \| 375.37% \| 5.08 \| 393.01% MegatronBertForCausalLM \| 13.22 \| 7.86 \| 168.07% \| 7.18 \| 184.01% MegatronBertForQuestionAnswering \| 15.62 \| 11.81 \| 132.21% \| 11.02 \| 141.68% MobileBertForMaskedLM \| 26.63 \| 10.82 \| 245.99% \| 11.95 \| 222.73% MobileBertForQuestionAnswering \| 23.53 \| 7.55 \| 311.51% \| 9.53 \| 247.03% OPTForCausalLM \| 7.33 \| 7.64 \| 95.93% \| 7.56 \| 96.90% PLBartForCausalLM \| 8.73 \| 7.63 \| 114.40% \| 7.37 \| 118.58% PLBartForConditionalGeneration \| 10.46 \| 8.50 \| 122.98% \| 8.16 \| 128.13% PegasusForCausalLM \| 7.18 \| 7.37 \| 97.42% \| 6.64 \| 108.22% PegasusForConditionalGeneration \| 16.47 \| 16.66 \| 98.87% \| 14.18 \| 116.13% RobertaForCausalLM \| 10.30 \| 9.95 \| 103.52% \| 9.52 \| 108.25% RobertaForQuestionAnswering \| 6.37 \| 7.13 \| 89.28% \| 6.79 \| 93.87% T5ForConditionalGeneration \| 12.40 \| 6.72 \| 184.51% \| 6.48 \| 191.16% T5Small \| 12.02 \| 6.66 \| 180.55% \| 6.32 \| 190.33% TrOCRForCausalLM \| 14.12 \| 13.31 \| 106.11% \| 12.45 \| 113.41% XGLMForCausalLM \| 16.48 \| 6.23 \| 264.52% \| 6.35 \| 259.51% XLNetLMHeadModel \| 74.87 \| 62.23 \| 120.32% \| 57.95 \| 129.19% YituTechConvBert \| 20.21 \| 10.50 \| 192.48% \| 9.97 \| 202.72% We are also seeing improvement ~9% on internal addmm benchmark This PR will also slightly reduce the compilation time on AMD max-autotune as before this change we assess every config with matrix_instr_nonkdim [0, 16] but we remove this and use 16 for all configs with this update. No CI to test the max-autotune perf currently but this will be enabled via https://github.com/pytorch/pytorch/pull/148672 after which we can investigate more tuning updates and config pruning Pull Request resolved: https://github.com/pytorch/pytorch/pull/147315 Approved by: https://github.com/jansel, https://github.com/eellison	2025-04-09 14:34:30 +00:00
Antoine Broyelle	886d9acb0d	[docs] Add 32-bit complex to the list of dtypes (#144590 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144590 Approved by: https://github.com/janeyx99	2025-04-09 13:10:21 +00:00
Richard Howell	64ac41f68d	[pytorch] add header docs for TORCH_LIBRARY_THREAD_UNSAFE_LAZY_INIT (#150854 ) Summary: Add header docs for the experimental TORCH_LIBRARY_THREAD_UNSAFE_LAZY_INIT feature, and guard behind C10_MOBILE. Reviewed By: albanD Differential Revision: D72572345 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150854 Approved by: https://github.com/larryliu0820, https://github.com/zou3519	2025-04-09 12:59:24 +00:00
cyy	142f0f86ce	Enable modernize-use-default-member-init (#149046 ) ``modernize-use-default-member-init`` prefers initialisation in class members, that make more ``= default`` constructors possible. Some violations or modernize rules have been fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149046 Approved by: https://github.com/zou3519	2025-04-09 11:57:24 +00:00
Sherlock Huang	81f60f3880	Expand allowed_getattr_types_for_subgm to torch.Tensor (#150867 ) Summary: att regular weight has the type of torch.nn.parameter.Parameter buffer and tensor constant has the type of torch.Tensor both types are valid. Test Plan: CI Differential Revision: D72657275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150867 Approved by: https://github.com/zhxchen17	2025-04-09 11:01:45 +00:00
FFFrog	604467de20	Code Clean: Remove specific bytecode support in dynamo for python3.8 (#150838 ) Related Bytecode: - CALL_FINALLy - END_FINALLy - POP_FINALLy The bytecodes above were removed before python3.9, refer to [this](`53908bd790/Misc/NEWS.d/3.9.0a2.rst`) for more infos. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150838 Approved by: https://github.com/Skylion007, https://github.com/jansel ghstack dependencies: #150834	2025-04-09 07:16:52 +00:00
FFFrog	b01877aa13	Fix addbmm & addmv & baddbmm out dtype check (#148176 ) ---- - torch.addbmm - torch.addmv - torch.baddbmm ISSUE related: https://github.com/pytorch/pytorch/issues/138399 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148176 Approved by: https://github.com/jansel ghstack dependencies: #148174	2025-04-09 07:02:56 +00:00
James Wu	4d6ff6ca5c	Fill config2launcher with correct launchers during cache hit coordinate descent (#150860 ) This bug was crazy hard to reproduce, so I can't seem to get a unit test written that isn't the internal one I used for debugging. Here's a short TLDR of the bug: - Due to D71983456(OSS: https://github.com/pytorch/pytorch/pull/149910), we cache CachingAutotuners in memory. - Importantly: Saving stuff in PyCodeCache in memory is not semantically equivalent to writing to disk. By saving it in memory, CachingAutotuners do not reset global state. - It's possible through recompiles for different dynamo frames to compile down to exactly the same inductor output code. This involves models that run multiple times, but differ very subtley, or in ways that cause a dynamo guard failure but not a different inductor output code. - Because of this, we reuse CachingAutotuners for a second compile (with different example inputs, just the same triton kernel code) - CachingAutotuners have a Coordinate Descent class on them, which has a cache: https://fburl.com/code/4igrsams (OSS: `aafc4b6188/torch/_inductor/runtime/coordinate_descent_tuner.py (L69)`) - Because we are caching these in memory and not on disk, this cache is not cleared between runs. - However, this variable is not saved on the class, and is reinitialized every time we do autotuning: https://fburl.com/code/n2o8tmje (OSS: `aafc4b6188/torch/_inductor/runtime/triton_heuristics.py (L933)`) - `config2launcher` is added when we call `benchmark_one_config`, but on a CoorDesc cache hit, we never call `benchmark_one_config`! So we end up returning None, and erroring with: ``` AttributeError: 'NoneType' object has no attribute 'store_cubin' ``` This fixes the problem for now by just recompiling the launcher. Technically, we might be able to save config2launcher on the class to avoid this, but I don't want to risk another weird cache safety bug here, so taking the simpler approach for now. Note that this error only reproduces if: - None of AOTAutogradCache, FXgraphCache hit on the second entry: otherwise, the CachingAutotuner will go through a pickling and then not be saved in memory - We haven't spawned parallel compile workers. If there are parallel compile workers, we pickle the autotuner on the way from the worker to the parent process, once again resetting the Autotuner. - The autotune cache doesn't already have the best config stored in it So it was extraordinarily hard to debug/reproduce. Because of this, I have a complicated internal unit test but no OSS test that can trigger the exact problem. I'll work on a separate test later, but this needs to go in to fix a sev, so we're landing it based on an internal test only. Differential Revision: [D72655382](https://our.internmc.facebook.com/intern/diff/D72655382/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D72655382/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/150860 Approved by: https://github.com/oulgen	2025-04-09 04:39:37 +00:00
Nikita Shulga	bc47d539fc	[MPS] Support ArgumentBuffer bindings from C++/Python (#150780 ) To workaround limitation of 32-arguments per kernel and being able to eventually compile something like ```python import torch def foo(args): rc = torch.empty_like(args[0]) for arg in args: rc += arg return rc tensors = torch.rand(100, 32, device='mps').unbind(0) print(torch.compile(foo)(tensors)) ``` For now, introduce `at::native:🤘:get_tensor_gpu_address` and use it from both C++ test and compile_shader to convert list of tensors to list of pointers valid on GPU. Initially this binding were done via `id< MTLArgumentEncoder>`, but according to [Improving CPU Performance by Using Argument Buffers](https://developer.apple.com/documentation/metal/improving-cpu-performance-by-using-argument-buffers?language=objc#Encode-Resources-into-Argument-Buffers) article, this is not necessary when targeting Tier2-only devices (which is true of all devices on MacOS-13 or newer): > To directly encode the argument buffer resources on these Tier 2 devices, write the [MTLBuffer](https://developer.apple.com/documentation/metal/mtlbuffer?language=objc).[gpuAddress](https://developer.apple.com/documentation/metal/mtlbuffer/gpuaddress?language=objc) property — and for other resource types (samplers, textures, and acceleration structures), the [gpuResourceID](https://developer.apple.com/documentation/metal/mtlcomputepipelinestate/gpuresourceid?language=objc) property — into the corresponding structure member. To encode offsets, treat these property values as uint64 types and add the offset to them. Add both C++ and PyThon unittests that validate that this works. Please note, that using either ArgumentEncoder or directly encoding the data does not guarantee buffer will not be freed until shader execution is complete. On the other hand, this should already be guaranteed by MPSCachingAllocator that would only free the memory after all streams completed its execution. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150780 Approved by: https://github.com/dcci	2025-04-09 04:24:37 +00:00
rzou	2e7c9d33e7	Refactor layout constraint selection logic (#148104 ) This PR: - cleans up some existing comments that don't make sense anymore - hooks up the "custom_op_default_layout_constraint" back (that seems to have broken) - cleans up the "lazy registration path" which seems to never get hit anymore - adds dislike_padding to nodes that require exact strides Test Plan: - tests + CI disable padding Pull Request resolved: https://github.com/pytorch/pytorch/pull/148104 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #150495	2025-04-09 02:09:18 +00:00
rzou	44deb67830	Fix _del_library (#150495 ) On library deletion, we need to clear fx's schema cache. Test Plan: - top PR in the stack, I don't have a good test case for this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150495 Approved by: https://github.com/eellison	2025-04-09 02:09:18 +00:00
Daniel Vega-Myhre	5f18b7d877	[docs] remove --recursive flag from readme (#150785 ) Fixes #150745 See https://github.com/pytorch/pytorch/issues/150745#issuecomment-2784216663 Cloning with `--recursive` as shown in the docs prevents users from checking out commits from before NCCL was removed as a submodule. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150785 Approved by: https://github.com/atalman	2025-04-09 02:07:48 +00:00
PyTorch MergeBot	d9f47c75de	Revert "Fixing NCCL abort hang issue when a ProcessGroupNCCL manages multiple ncclComms (#150690 )" This reverts commit 91173ff89aab5f632d483c736d11d5dcf60decac. Reverted https://github.com/pytorch/pytorch/pull/150690 on behalf of https://github.com/atalman due to failing internal test ([comment](https://github.com/pytorch/pytorch/pull/150690#issuecomment-2787905966))	2025-04-09 00:06:32 +00:00
eellison	27ded359a5	Fix inplacing with multiple, fused uses (#150845 ) We had `can_inplace` defined on a single use. When that buffer has multiple uses inside a fused node, we need to check if the other accesses have the same index. Otherwise we may read memory that has already been written to from inplacing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150845 Approved by: https://github.com/zou3519, https://github.com/exclamaforte, https://github.com/atalman, https://github.com/jansel	2025-04-09 00:05:07 +00:00
Yiming Zhou	89505f4498	[AOTI] Always use oss schema for ExternKernelNodes serialization (#150197 ) Summary: Added a field `protocol` to `ExternKernelNodes` and all the lowering pass will always use the oss schema to serialize external kernel nodes from now on. Test Plan: CI Differential Revision: D72020444 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150197 Approved by: https://github.com/zhxchen17	2025-04-08 22:35:28 +00:00
FFFrog	17f9276e29	Code Clean: Remove python3.8 specific code because PyTorch now need Python3.9 and later (#150834 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150834 Approved by: https://github.com/Skylion007, https://github.com/albanD	2025-04-08 18:53:55 +00:00
Shunting Zhang	901b02cf16	[Inductor] fix alignement assumption for fallback (#150777 ) Inductor right now only works properly for fallback kernels producing aligned output. When Inductor create layout for fallback kernel output, Inductor does not add the tensor offset to the layout [link](`2a1e2b88ed/torch/_inductor/ir.py (L6935-L6941)`). Thus unaligned output will be treated as aligned. Adding the offset to the layout directly does not work since that change the index expression in the generated kernel and we may 'double' applying the offset. Triton already considers the offset when passing in the data_ptr. To solve this issue, we track the unaligned buffer names instead. This potentially can fix the internal issues we are debugging here: https://fb.workplace.com/groups/1075192433118967/permalink/1618308128807392/ Differential Revision: [D72600784](https://our.internmc.facebook.com/intern/diff/D72600784) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150777 Approved by: https://github.com/eellison, https://github.com/jansel	2025-04-08 18:49:44 +00:00
Yanan Cao (PyTorch)	c36d9b0d8d	[Codemod][AddExplicitStrictExportForTrainingInferenceArg] caffe2/torch/ao (#150826 ) Differential Revision: D72615631 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150826 Approved by: https://github.com/ydwu4	2025-04-08 18:49:22 +00:00
Basil Wong	aafc4b6188	Do not depend on numpy during the import (#150816 ) Summary: Related issue: https://github.com/pytorch/pytorch/issues/149681 We can follow up with a different implementation that does not use numpy(potentially with Torch primitives). Test Plan: pending: contbuild & OSS CI Differential Revision: D72609835 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150816 Approved by: https://github.com/jerryzh168, https://github.com/cyyever, https://github.com/albanD	2025-04-08 18:12:53 +00:00
Guilherme Leobas	e6bd133866	add batching rule for `torch.Tensor.scatter_add_` (#150543 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150543 Approved by: https://github.com/zou3519	2025-04-08 18:00:10 +00:00
William Wen	97759614c2	[dynamo] reconstruct functions decorated in the compiled region properly (#150645 ) We were previously unable to reconstruct functions that were decorated in the compiled region. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150645 Approved by: https://github.com/jansel	2025-04-08 17:32:46 +00:00
PyTorch MergeBot	4926bd6004	Revert "Fix the Problems About Defining Static Variable in Inline Function (#147095 )" This reverts commit 3da14d38bd396f5bbe8494872d1509efa1a6f048. Reverted https://github.com/pytorch/pytorch/pull/147095 on behalf of https://github.com/atalman due to breaks internally ([comment](https://github.com/pytorch/pytorch/pull/147095#issuecomment-2787129770))	2025-04-08 17:10:36 +00:00
FFFrog	3e0038ae85	Fix torch.matmul related out dtype check (#148174 ) ---- - torch.matmul -> CompositeImplicitAutograd -> dot_out (when left_dim == 1 & right_dim == 1) -> mv_out (when left_dim == 2 & right_dim == 1) -> mm_out (when left_dim == 1 & right_dim == 2) -> ... - torch.dot - torch.vdot - torch.mm - torch.mv ISSUE related: https://github.com/pytorch/pytorch/issues/138399 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148174 Approved by: https://github.com/jansel	2025-04-08 17:00:28 +00:00
Animesh Jain	173f126068	[invoke_subgraph] Preserve node meta (#150782 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150782 Approved by: https://github.com/bdhirsh ghstack dependencies: #150666	2025-04-08 16:57:39 +00:00
PyTorch MergeBot	4447352e64	Revert "[CUDA] Only use vec128 if CUDA version is newer than 12.8 (#150705 )" This reverts commit 5228986c395dc79f90d2a2b991deea1eef188260. Reverted https://github.com/pytorch/pytorch/pull/150705 on behalf of https://github.com/atalman due to break periodic tests ([comment](https://github.com/pytorch/pytorch/pull/150705#issuecomment-2787017751))	2025-04-08 16:29:05 +00:00
ikalinic	97f34f0125	[ROCm][Windows] Include AOTriton dependent sources in Windows build (#150521 ) Includes ATen native transformers hipified sources in ROCm+Windows build. This was removed due to Trinton not being available on Windows, but this causes further linker errors. Setting `USE_FLASH_ATTENTION=0` and `USE_MEM_EFF_ATTENTION=0` during the build will mitigate the missing headers, but also not cause any linker errors, so we will use this approach for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150521 Approved by: https://github.com/jeffdaily	2025-04-08 16:18:15 +00:00
Yuanhao Ji	1239260a0e	[Accelerator][Chore] Use existing `acc` when raising an error (#150829 ) As the title said, `acc` already exists so we just use it instead of calling `current_accelerator()` again. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150829 Approved by: https://github.com/guangyey, https://github.com/Skylion007	2025-04-08 16:05:06 +00:00
Nikita Shulga	ec5f2e3028	[Build] Fix fbgemm build with gcc-12+ (#150847 ) By suppressing more warnings TODO: fbgemm pin really needs to get updated Pull Request resolved: https://github.com/pytorch/pytorch/pull/150847 Approved by: https://github.com/atalman, https://github.com/Skylion007	2025-04-08 16:03:40 +00:00
ZhiweiYan-96	52d172eafd	Facilitate at::_weight_int4pack_mm_with_scale_and_zeros related registration (#147962 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147962 Approved by: https://github.com/jerryzh168, https://github.com/guangyey, https://github.com/EikanWang ghstack dependencies: #137566 Co-authored-by: xiaolil1 <xiaoli.liu@intel.com>	2025-04-08 15:36:07 +00:00
Yan Zhiwei	da7322548b	[Intel GPU] int4 WOQ gemm XPU Support (#137566 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137566 Approved by: https://github.com/liangan1, https://github.com/guangyey, https://github.com/EikanWang Co-authored-by: xiaolil1 <xiaoli.liu@intel.com>	2025-04-08 15:36:06 +00:00
FFFrog	05365e380d	Remove torch functions that do not support device arguments from _device_constructor (#150290 ) As the title stated In Addition: - I have checked all the functions in _device_constructor and found ``torch.vander`` also don`t support device arguments - Remove the duplicated function such as torch.ones and torch.asarray Related issue:https://github.com/pytorch/pytorch/issues/150284 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150290 Approved by: https://github.com/albanD	2025-04-08 15:13:55 +00:00
FFFrog	a402c2f203	Remove redundant code in cuda/__init__.py (#150529 ) As the title stated. Follow: https://github.com/pytorch/pytorch/pull/147078 Fix issue: https://github.com/pytorch/pytorch/issues/150519 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150529 Approved by: https://github.com/eqy	2025-04-08 15:03:21 +00:00
Guilherme Leobas	ad516180e0	Update CPython tests for ctx manager to use unittest (#146501 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146501 Approved by: https://github.com/zou3519 ghstack dependencies: #146500	2025-04-08 14:55:17 +00:00
Guilherme Leobas	f3b2fb6c66	Allow trace through unittest (#146500 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146500 Approved by: https://github.com/anijain2305	2025-04-08 14:55:17 +00:00
Luca Wehrstedt	1791b4150b	Clarify behavior of TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK (#150682 ) I still don't really understand the original purpose of that env var, but it appears that its usage is completely disconnected from MemPools and from `ncclMemAlloc`/`Free`. In fact, when that env var is set, we invoke `ncclCommRegister` for _all_ NCCL communicators for _all_ the memory segments managed by the allocator (both the global ones, allocated with `cudaMalloc`, and the ones in private MemPools), and we do that both for the segments that already exist when the PG is initialized and for all segments that will be allocated later. I'm reworking the code a bit, by using a few helper functions, whose name should make this behavior clearer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150682 Approved by: https://github.com/kwen2501 ghstack dependencies: #150681	2025-04-08 13:00:59 +00:00
Luca Wehrstedt	3649e2e7bd	Safer bookkeeping of NCCL communicators (#150681 ) This consists mainly in two changes: - ensure we can reliably obtain the device from a `NCCLComm` object (there was one constructor which didn't set the device) - use a RAII pattern for acquiring the lock to the global dictionary of `NCCLComms` (which ensures the lock is released in case of exceptions) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150681 Approved by: https://github.com/kwen2501	2025-04-08 11:12:37 +00:00
FFFrog	3da14d38bd	Fix the Problems About Defining Static Variable in Inline Function (#147095 ) Refer to https://github.com/pytorch/pytorch/issues/125465 for more informations - Remove unused header files - Move the inline function that defines the static variable to .cc Pull Request resolved: https://github.com/pytorch/pytorch/pull/147095 Approved by: https://github.com/cyyever, https://github.com/albanD	2025-04-08 10:23:02 +00:00
FFFrog	881d99495d	Add more check for torch.ormqr (#150759 ) As the title statd. Please refer to https://github.com/pytorch/pytorch/issues/150674 for more info. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150759 Approved by: https://github.com/lezcano	2025-04-08 08:26:05 +00:00
fengqing.lu	a106842ea8	[XPU] Fix XPU unit test on Windows (#150520 ) This PR is to resolve issue reported in https://github.com/intel/torch-xpu-ops/issues/1478 There are two cases failing in our Windows CI enabling. - test_xpu.py::TestXpuXPU::test_lazy_init_xpu Needs to add `if __name__ == '__main__':` for Windows when using multiprocess. Refer to https://stackoverflow.com/a/18205006 ``` RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase. This probably means that you are not using fork to start your child processes and you have forgotten to use the proper idiom in the main module: if __name__ == '__main__': freeze_support() ... The "freeze_support()" line can be omitted if the program is not going to be frozen to produce an executable. Traceback (most recent call last): File "C:\Users\sdp\lufengqing\torch-xpu-ops\test\xpu\xpu_test_utils.py", line 24, in <module> test_multi_process(model, input) File "C:\Users\sdp\lufengqing\torch-xpu-ops\test\xpu\xpu_test_utils.py", line 16, in test_multi_process assert p.exitcode == 0 AssertionError ``` - test_xpu.py::TestXpuXPU::test_wrong_xpu_fork_xpu is a linux only test case, we should skip it on Windows. Refer to `248487f455/test/test_multiprocessing.py (L609)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150520 Approved by: https://github.com/guangyey, https://github.com/EikanWang	2025-04-08 07:02:40 +00:00
xinan.lin	58ede0cca3	[Inductor XPU] Refine `test_mkldnn_pattern_matcher.py` to be reusable for XPU. (#150286 ) This PR extracts some test cases from TestPatternMatcher into a newly created TestPatternMatcherGeneric, and uses instantiate_device_type_tests to make them reusable across multiple devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150286 Approved by: https://github.com/jansel	2025-04-08 05:42:44 +00:00
FFFrog	f8aa6404ac	Refactor: add initialization of math.lcm into torch_c_binding_in_graph_functions (#150766 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150766 Approved by: https://github.com/aorenste, https://github.com/jansel	2025-04-08 04:12:26 +00:00
zeshengzong	c9c0f8eae3	Add plot for `torch.nn.Threshold` and `torch.nn.GLU` (#150171 ) Fixes #150170 ## Changes - Add plot for `torch.nn.Threshold` and `torch.nn.GLU` - Add example output make them easier get result by users ## Test Result ![image](https://github.com/user-attachments/assets/f6c5bc46-f9b7-4db7-9797-e08d8423d1b3) ![image](https://github.com/user-attachments/assets/ad4e6c84-7b29-44f1-b7bd-9c81e4a92ef8) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150171 Approved by: https://github.com/albanD	2025-04-08 03:55:37 +00:00
zeshengzong	7e11089fe5	Optimize dataloader Self typing (#146816 ) Optimize `dataloader.py` method return type with Self typing Pull Request resolved: https://github.com/pytorch/pytorch/pull/146816 Approved by: https://github.com/albanD	2025-04-08 03:52:23 +00:00
atalman	836955bdbd	[Manylinux 2.28] Correct Linux aarch64 cuda binaries wheel name (#150786 ) Related to: https://github.com/pytorch/pytorch/issues/149044#issuecomment-2784044555 For CPU binaries we run auditwheel however for cuda binaries auditwheel produces invalid results . Hence we need to rename the file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150786 Approved by: https://github.com/malfet	2025-04-08 02:58:28 +00:00
Ahmad Sharif	73b4938f7c	[cuda] Add new faster gammabeta backward kernel (#148605 ) (Reapply with launch bounds) (#150625 ) # Changes over the previous PR This reverts commit 61a1f09 and adds `__launch_bounds__` to the kernel. Previously I merged 114d404 that did not work on Blackwell because it consumed too many registers. It got reverted in 61a1f09. For more context see: https://github.com/pytorch/pytorch/issues/150266. This PR reverts the revert (i.e. reapplies the original diff), with one additional line with `__launch_bounds__` added: ``` git diff HEAD^ diff --git a/aten/src/ATen/native/cuda/layer_norm_kernel.cu b/aten/src/ATen/native/cuda/layer_norm_kernel.cu index 0d63a2f979c..3ce2c24c18e 100644 --- a/aten/src/ATen/native/cuda/layer_norm_kernel.cu +++ b/aten/src/ATen/native/cuda/layer_norm_kernel.cu @@ -657,6 +657,7 @@ bool aligned_grid > __global__ void +__launch_bounds__(block_dim_x * block_dim_y) GammaBetaBackwardCUDAKernelTemplate( int64_t M, int64_t N, ``` I managed to get a Blackwell machine and verified that the fix works. The fix was verified using this repro that I got from @drisspg <details> <summary> Repro script that fails on Blackwell </summary> ``` import torch from torch.nn import init # from transformer_nuggets import init_logging # from transformer_nuggets.utils.benchmark import profiler # from pathlib import Path # init_logging() class PermuteModule(torch.nn.Module): def __init__(self, permutation): super(PermuteModule, self).__init__() self.permutation = permutation def forward(self, x:torch.Tensor) -> torch.Tensor: assert len(x.shape) == len(self.permutation), f"Dimension mismatch! Unable to permute {len(x.shape)} dim input with a {len(self.permutation)} dim permutation!" return x.permute(self.permutation) def test(n_layers:int, conv_stride:int): _sequence = [] for _ in range(n_layers): # Conv1d inputs are (N x C x L), LayerNorm expects ( x C). Dims must be permuted between modules. _sequence += [ PermuteModule((0,2,1)), torch.nn.Conv1d(in_channels=512, out_channels=512, groups=1, kernel_size=9, dilation=1, stride=conv_stride, padding=0, bias=False), PermuteModule((0,2,1)), torch.nn.LayerNorm(512), torch.nn.ReLU() ] model = torch.nn.Sequential(_sequence).to(device="cuda") data = torch.randn((100,2048,512), device="cuda") out = model(data) loss = torch.nn.functional.mse_loss(out, torch.rand_like(out)) loss.backward() torch.autograd.set_detect_anomaly(True) print(f"Torch version: {torch.__version__}") # with profiler(Path("conv")): # # print(f"layers=1, stride=1") # # test(n_layers=1, conv_stride=1) # # print(f"layers=2, stride=1") # # test(n_layers=2, conv_stride=1) # # print(f"layers=1, stride=2") # # test(n_layers=1, conv_stride=2) # print(f"layers=2, stride=2") # test(n_layers=2, conv_stride=2) print(f"layers=2, stride=2") test(n_layers=2, conv_stride=2) # we will not reach this print statement. print("DONE.") ``` </details> I also re-ran my performance benchmark and found no regressions over the previous PR. # Full description of the old PR Original PR: https://github.com/pytorch/pytorch/pull/148605 This PR adds a new kernel for producing gamma and beta values for the backward pass in a performant way. To test the performance against the baseline, I measured the backward pass of layernorm while sweeping over the following variables: 1. dtype in {half, float} 2. M in `2k, 2k - 1, 2k + 1 for k in range(...)` 3. N in `2k, 2k - 1, 2k + 1 for k in range(...)` 4. Whether we flush the L2 cache before running the backward pass Summary: The new code performs better than the old code, especially for powers of 2. For M >> N case, it performs very well (kernel itself can be 30x faster and the overall backward pass can be 5-10x faster). In order to visualize results of the kernel when choosing different values of M, N and dtype, I wrote some code to generate a heatmap. The heatmap has N on the x-axis, M on the y-axis and color-coded points where green shows performance improvement and red shows regressions. For example, `m=32 n=2048 1.42x` in the heatmap would indicate the normalized shape had 32 elements. The leading dimensions' product was 2048 elements and the new kernel resulted in the backward pass* being 1.42x faster than the old backward pass. Important note: This heatmap shows the total backward pass time as seen by the user. The kernel time difference can be sometimes very large while the total backward pass time is not that high. For example, for dtype=torch.half, M=32 N=2048, flush_l2_cache=True case, the heatmap shows a speedup of 1.42x, while ncu tells me the new kernel is 2.5x faster than the old: M=32 N=2048 dtype=half flush_l2=True Old Kernel NCU summary: ``` ----------------------- ----------- ------------ Metric Name Metric Unit Metric Value ----------------------- ----------- ------------ DRAM Frequency Ghz 1.59 SM Frequency Ghz 1.35 Elapsed Cycles cycle 27,526 Memory Throughput % 2.21 DRAM Throughput % 0.54 Duration us 20.42 L1/TEX Cache Throughput % 4.31 L2 Cache Throughput % 2.62 SM Active Cycles cycle 1,475.02 Compute (SM) Throughput % 0.29 ----------------------- ----------- ------------ ``` M=32 N=2048 dtype=half flush_l2=True New Kernel NCU summary: ``` ----------------------- ----------- ------------ Metric Name Metric Unit Metric Value ----------------------- ----------- ------------ DRAM Frequency Ghz 1.59 SM Frequency Ghz 1.34 Elapsed Cycles cycle 10,920 Memory Throughput % 5.64 DRAM Throughput % 1.35 Duration us 8.13 L1/TEX Cache Throughput % 1.92 L2 Cache Throughput % 6.89 SM Active Cycles cycle 3,554.41 Compute (SM) Throughput % 0.67 ----------------------- ----------- ------------ ``` Let's look at some rows from the heatmap. For dtype=float16 flush_l2_cache=True and when input shapes are powers of 2, we get the following: <img width="1508" alt="image" src="https://github.com/user-attachments/assets/06179599-b2f0-4a45-8664-247a1067950b" /> There are 3 columns -- the first shows all data points, the second shows speedups only and the 3rd column shows regressions only. We can see that there are dramatic speedups for M >> N cases and the regressions are not that high (less than 1%, which could just be measurement noise). Here is a small guide I made: ![image](https://github.com/user-attachments/assets/90c26f7c-e3ad-46d2-a6ce-fe4b5fb3d738) For dtype=float32, we get a similar chart: <img width="1499" alt="image" src="https://github.com/user-attachments/assets/c4d31a76-03b0-426c-9114-e1bfad29b530" /> The new code performs especially well for m >> n cases, and also where m and n are small. The m >> n case is special because we run 2 reduction kernels back to back and parallelize in the "M" dimension (the older kernel only parallelized in the "N" dimension). The new code can sometimes have regressions for non-powers of 2. That is because the old code was using block sizes of {16, 32} while we have `threads.x = 32`. For example when N=33, the old code would have 3 blocks and we will have 2 blocks. I wrote some code to specialize for this case, but I think it will add complexity and @ngimel mentioned that non-powers of 2 are rare enough. I am including the regressions here for completeness' sake: <img width="1500" alt="image" src="https://github.com/user-attachments/assets/31c17cfb-ed9b-4106-b9c8-5c359751f530" /> To see this better: 1. Click the image 2. Right click the expanded image and open in a new tab 3. Go to that tab and left click once to zoom in If you want to see the full data, here it is: ![image](https://github.com/user-attachments/assets/54fb60c9-8c0c-4530-a1dd-79ecda1a69a1) I also measured binary size and compile time since those are important for developers: Binary size comparison ![image](https://github.com/user-attachments/assets/ceef5073-1036-47f6-b9dc-cea088beda51) ``` # Original -rwxr-xr-x 1 ahmads users 307193112 Mar 6 08:46 ./torch/lib/libtorch_cuda.so # This PR -rwxr-xr-x 1 ahmads users 307193112 Mar 6 08:46 ./torch/lib/libtorch_cuda.so ``` The diff in bytes is 302kB which is about a 0.1% increase. Compile time difference: ``` # Original real 0m10.931s user 0m9.676s sys 0m1.004s # this PR real 0m16.720s user 0m15.514s sys 0m1.066s # Command I ran time /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DFLASHATTENTION_DISABLE_SOFTCAP -DFLASH_NAMESPACE=pytorch_flash -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DTORCH_CUDA_USE_NVTX3 -DUNFUSE_FMA -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_CUFILE -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -I/home/ahmads/personal/pytorch/build/aten/src -I/home/ahmads/personal/pytorch/aten/src -I/home/ahmads/personal/pytorch/build -I/home/ahmads/personal/pytorch -I/home/ahmads/personal/pytorch/cmake/../third_party/benchmark/include -I/home/ahmads/personal/pytorch/third_party/onnx -I/home/ahmads/personal/pytorch/build/third_party/onnx -I/home/ahmads/personal/pytorch/nlohmann -I/home/ahmads/personal/pytorch/third_party/flash-attention/csrc/flash_attn/src -I/home/ahmads/personal/pytorch/aten/src/THC -I/home/ahmads/personal/pytorch/aten/src/ATen/cuda -I/home/ahmads/personal/pytorch/third_party/fmt/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/ahmads/personal/pytorch/build/caffe2/aten/src -I/home/ahmads/personal/pytorch/aten/src/ATen/.. -I/home/ahmads/personal/pytorch/build/nccl/include -I/home/ahmads/personal/pytorch/c10/cuda/../.. -I/home/ahmads/personal/pytorch/c10/.. -I/home/ahmads/personal/pytorch/third_party/tensorpipe -I/home/ahmads/personal/pytorch/build/third_party/tensorpipe -I/home/ahmads/personal/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/ahmads/personal/pytorch/torch/csrc/api -I/home/ahmads/personal/pytorch/torch/csrc/api/include -isystem /home/ahmads/personal/pytorch/build/third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/ahmads/personal/pytorch/third_party/protobuf/src -isystem /home/ahmads/personal/pytorch/third_party/XNNPACK/include -isystem /home/ahmads/personal/pytorch/third_party/ittapi/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda/include -isystem /home/ahmads/personal/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /home/ahmads/personal/pytorch/third_party/ideep/include -isystem /home/ahmads/personal/pytorch/INTERFACE -isystem /home/ahmads/personal/pytorch/third_party/nlohmann/include -isystem /home/ahmads/personal/pytorch/third_party/NVTX/c/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o.d -x cu -c /home/ahmads/personal/pytorch/aten/src/ATen/native/cuda/layer_norm_kernel.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o ``` So the new PR is 6 seconds longer compile time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150625 Approved by: https://github.com/ngimel, https://github.com/atalman	2025-04-08 02:39:41 +00:00
morotti	c0991b0316	README: anaconda license violation / no longer recommend anaconda since it's no longer free to use (#150619 ) hello, I was going over the documentation to build pytorch from source. Unfortunately, the first thing that come up is that you strongly recommend to use anaconda, which shouldn't be used because it's no longer free to use. Could you please remove that from the doc? I don't know if you are aware but anaconda is no longer free. They changed their terms of service in 2020 to restrict commercial usage. They changed their terms of service in 2024 to forbid downloading anaconda and forbid education and non-profit usage too. The download is open and doesn't require any registration, but if you download anaconda they will sue you ^^ They started raining lawsuits against users since last year. You may have heard about anaconda vs intel in the news. They started another 5 or so in the last few months. https://www.reuters.com/legal/litigation/intel-sued-copyright-infringement-over-ai-software-2024-08-09/ You may need to adjust more doc and adjust your build system. The free to use alternatives are miniforge with the conda-forge channel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150619 Approved by: https://github.com/seemethere	2025-04-08 02:10:31 +00:00
CaoE	d7f3cd0ac3	Add Half support for weight_norm on CPU (#148878 ) Fixes #148867. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148878 Approved by: https://github.com/leslie-fang-intel, https://github.com/cyyever, https://github.com/albanD	2025-04-08 01:12:29 +00:00
Nikita Shulga	5228986c39	[CUDA] Only use vec128 if CUDA version is newer than 12.8 (#150705 ) By addressing a feedback requested at https://github.com/pytorch/pytorch/pull/145746 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150705 Approved by: https://github.com/atalman	2025-04-08 00:46:13 +00:00
Akash Verma	e9e5682a4a	[ROCm] Build Pytorch extensions with amdclang++ (#150451 ) Here are the following modifications made to cpp_extension.py- 1) Changed compiler flag to use --version. 2) Added a feature to convert alpha-numeric string to numeric string for the version string returned by compiler. This was the source of error as the parser was failing on parsing alpha-numeric version string. Build with following pytorch extensions- Apex, TorchVision, TorchAudio & DeepSpeed. Unit tested with following pytorch extensions- Apex, TorchVision. (cherry picked from commit c873aeac35851a7d5000eb7f24561d3f56c2ffbd) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/150451 Approved by: https://github.com/jeffdaily	2025-04-07 23:31:29 +00:00
Hexin Wang	91173ff89a	Fixing NCCL abort hang issue when a ProcessGroupNCCL manages multiple ncclComms (#150690 ) Detail of the issue: If PyTorch issues send/recv to each 2 rank comm, and these comms are managed by a single ProcessGroupNCCL instance, then comms need to abort either in sequence or in group. I.e. the following sequential abort will cause hang in NCCL. recv(..., comm0, stream); send(..., comm1, stream); abort(comm1); abort(comm0); Fixes #119797 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150690 Approved by: https://github.com/kwen2501	2025-04-07 23:20:49 +00:00
Animesh Jain	6ea5514e04	[invoke_subgraph] Lazy backward (#150666 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150666 Approved by: https://github.com/zou3519, https://github.com/bdhirsh	2025-04-07 22:44:43 +00:00
Ankita George	78fe079c97	Support having no metadata file for HuggingFaceStorageReader (#150701 ) Summary: If there is only one safetensors file, we don't need users to have a metadata file and we can just construct it from the keys of that file. This is a use-case for some HuggingFace models, so adding support for it Test Plan: ensure existing tests pass tested e2e in a notebook Differential Revision: D72472490 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150701 Approved by: https://github.com/joecummings	2025-04-07 22:10:39 +00:00
Nikita Shulga	fbccbfedaf	[BE] Fix Amp.metal compilation warning (#150783 ) Deleting unused `uint tid` fixes ``` [114/1416] Compiling /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Amp.metal to Amp_30.air /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Amp.metal:70:10: warning: unused parameter 'tid' [-Wunused-parameter] uint tid [[thread_position_in_grid]]) { ^ 1 warning generated. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150783 Approved by: https://github.com/wdvr, https://github.com/atalman	2025-04-07 22:05:00 +00:00
Max Ren	eba05e2d3e	[AO] Refactor convert and add QuantAffinePlaceholderObserver (#150644 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150644 Approved by: https://github.com/jerryzh168 ghstack dependencies: #150642, #150643	2025-04-07 20:52:45 +00:00
Max Ren	5653fb3525	[AO] Add Moving Average Affine Observer (#150643 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150643 Approved by: https://github.com/jerryzh168 ghstack dependencies: #150642	2025-04-07 20:52:45 +00:00
Max Ren	ed0dea3e24	[AO] update port_metadata_pass to support quant_affine ops (#150642 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150642 Approved by: https://github.com/jerryzh168	2025-04-07 20:52:44 +00:00
PyTorch MergeBot	bf1132c196	Revert "Generalize poison fork logic for each device backend (#144664 )" This reverts commit d86c14156d875b782b82dda96842a1f77910f010. Reverted https://github.com/pytorch/pytorch/pull/144664 on behalf of https://github.com/atalman due to failing periodic test: python test/test_cpp_extensions_mtia_backend.py TestCppExtensionMTIABackend.test_device_context ([comment](https://github.com/pytorch/pytorch/pull/144664#issuecomment-2784506104))	2025-04-07 20:09:53 +00:00
Pian Pawakapan	f8b53f4a75	[export] raise when Dim.DYNAMIC 0/1 specializes (#150716 ) Previously we didn't catch this, mark_dynamic() just doesn't allocate a symbol for it Differential Revision: D72486930 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150716 Approved by: https://github.com/angelayi	2025-04-07 18:58:42 +00:00
Sam Larsen	2a1e2b88ed	[logging] Add pgo remote get/put timings to dynamo_compile (#150322 ) Test Plan: https://fburl.com/scuba/dynamo_compile/sandbox/xf950tw8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150322 Approved by: https://github.com/ppanchalia	2025-04-07 18:08:26 +00:00
Annop Wongwathanarat	6fcffd8cd1	Optimize SVE embedding performance (#150176 ) Change loop unrolling strategy. Previously, the script only unrolls the inner loop over block_size when block size is multiple of vector length. This version instead unrolls the outer loop which reduces the number of load/store for accumulation into the output array and improves performance for cases when block size is not multiple of vector length. Benchmarking script: ```python # SPDX-FileCopyrightText: Copyright 2025 Arm Limited and/or its affiliate <open-source-office@arm.com> # SPDX-License-Identifier: BSD-3-Clause import torch import torch.nn as nn import numpy as np import time import sys np.random.seed(0) torch.manual_seed(0) num_embeddings = 400000 embedding_dim = int(sys.argv[1]) multi_hot = 100 batch_size = 400 nrun = 1000 class SimpleEmbeddingBagModel(nn.Module): def __init__(self, num_embeddings, embedding_dim): super(SimpleEmbeddingBagModel, self).__init__() weights = torch.from_numpy((np.random.random_sample((num_embeddings, embedding_dim)) + 1).astype(np.float32)).to(torch.float16) # Defining the EmbeddingBag layer self.embedding_bag = torch.nn.EmbeddingBag(num_embeddings, embedding_dim, _weight=weights, mode='sum', include_last_offset=True, dtype=torch.float32) def forward(self, input, offsets): # Forward pass through the EmbeddingBag layer result32 = self.embedding_bag(input, offsets, per_sample_weights=None) return result32 # Instantiate the model model = SimpleEmbeddingBagModel(num_embeddings=num_embeddings, embedding_dim=embedding_dim) model.eval() # Example input input_tensor = torch.randint(0, num_embeddings, (batch_size * multi_hot,), dtype=torch.long) offsets = torch.tensor(range(0, batch_size * multi_hot + 1, multi_hot)) with torch.no_grad(): # warm up output32 = model(input_tensor, offsets) ti = time.time_ns() for i in range(nrun): _ = model(input_tensor, offsets) tf = time.time_ns() print("{:3d} {:.3E}".format(embedding_dim, (tf-ti)/nrun/1.e6)) ``` Speedup on NEOVERSEV1 with 1 thread ![embedding](https://github.com/user-attachments/assets/16e567ed-b9a5-4db3-90b8-dec66d5414a7) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150176 Approved by: https://github.com/digantdesai, https://github.com/malfet	2025-04-07 18:01:54 +00:00
Saurabh Mishra	7d2411d30e	[DCP][OSS] Introduce barrier util in the DistWrapper for rank local checkpointing (#150748 ) Summary: Introduce barrier util in the DistWrapper for rank local checkpointing. This barrier will be used at the end of the rank local checkpointing to ensure all ranks synchronize. Test Plan: UTs Differential Revision: D72541431 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150748 Approved by: https://github.com/MeetVadakkanchery	2025-04-07 17:33:07 +00:00
Isuru Fernando	957faaadca	Avoid overflow in vector_norm for scalar input (#144073 ) Fixes https://github.com/pytorch/pytorch/issues/143960 where torch.dist gave different results from eager due to vector_norm overflowing and eager mode avoids the overflow for single element reductions by not computing the power and then the root. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144073 Approved by: https://github.com/eellison, https://github.com/laithsakka	2025-04-07 17:10:10 +00:00
fduwjj	06e9deabb6	[c10d][fr] Improve FR dump robustness with all watchdog broadcast wait and more frequent store check (#150652 ) When debugging FR missing dump and missing dump logs, I have couple initial findings: 1. On the same rank, if a second watchdog timeout triggers on a different PG(or subPG), that watchdog thread will immediately throw exception instead of sleeping. We want to fix that by still making the watchdog thread to wait for 1 min. 2. The FR dump takes about 900ms to 1200ms so, we are not checking the store frequently enough. But instead of changing the frequency from 1sec to 300ms, we finally decided to just let all ranks just sleep for 1 min universally rather than using a promise. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150652 Approved by: https://github.com/kwen2501	2025-04-07 16:33:27 +00:00
jpvillam	56ab71de98	[ROCm] Expand workspace size for gfx95 (#150632 ) Use same workspace size for gfx95* as gfx94* Pull Request resolved: https://github.com/pytorch/pytorch/pull/150632 Approved by: https://github.com/jeffdaily Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>	2025-04-07 16:05:56 +00:00
shiyang-weng	0ad2c5d7e2	Add RECORD_FUNCTION for AOTI (#150150 ) Only add RECORD_FUNCTION for shim_fn now. Next step need to add RECORD_FUNCTION for all the aoti_torch_* functions. Fixes https://github.com/pytorch/pytorch/issues/148650 Some code gen by aoti ```c++ AtenTensorHandle buf1_handle; AtenTensorHandle buf2_handle; AtenTensorHandle buf3_handle; AtenTensorHandle buf4_handle; {RECORD_FUNCTION("aoti_torch_cpu__embedding_bag", c10::ArrayRef<c10::IValue>());AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_cpu__embedding_bag(L__self___sparse_arch_embedding_bag_collection_embedding_bags_t_cat_0_weight, arg80_1, arg81_1, 0, 0L, 0, nullptr, 1, -1L, &buf1_handle, &buf2_handle, &buf3_handle, &buf4_handle));} RAIIAtenTensorHandle buf1(buf1_handle); RAIIAtenTensorHandle buf2(buf2_handle); RAIIAtenTensorHandle buf3(buf3_handle); RAIIAtenTensorHandle buf4(buf4_handle); arg80_1.reset(); arg81_1.reset(); ``` On trace ``` { "name": "aoti_torch_cpu__embedding_bag", "ph": "X", "ts": 68874.450000, "dur": 361.291000, "tid": 2, "pid": "CPU Functions", "args": {} }, ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150150 Approved by: https://github.com/desertfire, https://github.com/EikanWang	2025-04-07 15:12:29 +00:00
Benjamin Glass	f813d64f54	cpp_wrapper: Fix even more tests (#147225 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147225 Approved by: https://github.com/desertfire ghstack dependencies: #150671, #150672	2025-04-07 14:20:06 +00:00
Benjamin Glass	f0abbabac1	AOTI fallback ops: sort alphabetically (#150672 ) This is just a housekeeping task that makes the listed fallback op order match what's in the generated C shim files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150672 Approved by: https://github.com/desertfire ghstack dependencies: #150671	2025-04-07 14:20:06 +00:00
Benjamin Glass	5e3c8214b5	cpp_wrapper: Re-enable code disabled for forward compatibility (#150671 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150671 Approved by: https://github.com/desertfire	2025-04-07 14:20:06 +00:00
Shivam Raikundalia	99c9a31386	[submodule] [Snapshot/Profiler] Memory Snapshot On Demand (#150559 ) Summary: Profiler side of memory snapshot. 1. Add API to actually do snapshot when client interface is called 2. Add ifdefs to builds so that kineto hooks snapshot correctly. Design Philosophy: There is one interesting part of this implementation and it is during export. For export we are callign the python impl of the export rather than CPP even though we are already in CPP. This is because it is better to simply have one path of export rather than 2. Personally, I want there to be parity between auto-trace and on-demand so it if we can limit the side paths then we will have an easier time maintaining this relationship Test Plan: {F1976563426} Reviewed By: sanrise Differential Revision: D70733247 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150559 Approved by: https://github.com/sanrise	2025-04-07 13:04:38 +00:00
Zain Huda	e209625334	[torchrec] update local_shards_wrapper to latest version (#150469 ) Summary: Adding new ops, support for empty shards, and fixed initializations for downstream checkpointing. Test Plan: buck2 run 'fbcode//mode/dev-nosan' fbcode//torchrec/distributed/tests:test_shards_wrapper Differential Revision: D72271275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150469 Approved by: https://github.com/XilunWu	2025-04-07 13:00:52 +00:00
PyTorch UpdateBot	cdf3b63e32	Update slow tests (#150283 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150283 Approved by: https://github.com/pytorchbot	2025-04-07 11:49:59 +00:00
PyTorch UpdateBot	25662d38d5	[xla hash update] update the pinned xla hash (#132021 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132021 Approved by: https://github.com/pytorchbot	2025-04-07 11:35:56 +00:00
Kurt Mohler	164d2c887b	Add check in `test_cow_input` to ensure COW data is never changed (#150723 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150723 Approved by: https://github.com/Skylion007	2025-04-07 04:35:00 +00:00
Zhengxu Chen	24aadb40fb	[precompile] Serialization for GlobalStateGuard (#150636 ) Summary: To preserve global state guards we need to make the C++ type serialzable. Using json because it's easier to do and we don't have a lot of data in global state. Test Plan: test_dynamo -k test_global_state_guard_serialization Differential Revision: D72410611 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150636 Approved by: https://github.com/williamwen42	2025-04-07 03:10:03 +00:00
eellison	b6929aef08	Fix conv2d strided prologue (#150697 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150697 Approved by: https://github.com/drisspg	2025-04-07 02:26:58 +00:00
Yu, Guangye	d86c14156d	Generalize poison fork logic for each device backend (#144664 ) # Motivation Generalize the posion_fork code to make it reusable across different devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144664 Approved by: https://github.com/EikanWang, https://github.com/albanD	2025-04-07 02:06:21 +00:00
Han, Chao1	d98575806b	Generalize compile collective to avoid cuda-bias (#150405 ) Fixes https://github.com/intel/torch-xpu-ops/issues/1527 Let the combination of `compile` and `collective` to support more devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150405 Approved by: https://github.com/guangyey, https://github.com/jansel Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-04-07 01:54:20 +00:00
Richard Barnes	d8d306cbc6	Suppress `-Wunused-function` for DSA (#150735 ) Test Plan: Sandcastle Reviewed By: dtolnay Differential Revision: D72458590 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150735 Approved by: https://github.com/eqy, https://github.com/cyyever	2025-04-07 01:47:35 +00:00
Richard Barnes	370ba6b96f	[codemod] Fix `-Wambiguous-reversed-operator` in aten/src/ATen/cuda/tunable/Tunable.h (#150744 ) Summary: `-Wambiguous-reversed-operator` warns about ambiguous reversed operators, e.g. `a < b` and `b > a` are both valid. Such operators are disallowed in C++20. This codemod fixes the warnings. #buildsonlynotests - If this diff compiles, it works. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Differential Revision: D72535527 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150744 Approved by: https://github.com/drisspg	2025-04-07 01:45:03 +00:00
Paul Ganssle	47b494ef69	Add type hints to `_tensor_docs.add_docstr_all` (#150715 ) There is some sort of bug in `pytype` where if this function doesn't have type hints, `pytype` will spend 10 minutes inferring the types. Not that this matters much for a project not using `pytype`, but it led me to realize that this function could easily be type hinted and is not, so here is a PR adding some type hints. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150715 Approved by: https://github.com/Skylion007	2025-04-06 22:25:34 +00:00
Scott Wolchok	0aaf35310a	Overload unary - operator on at::vec::Vectorized to call neg() (#150568 ) Makes Vectorized look even more like a scalar type, getting me closer to being able to use the same generic code with scalars and Vectorized (e.g., for sigmoid, which needs `exp(-x)`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/150568 Approved by: https://github.com/Skylion007 ghstack dependencies: #150380	2025-04-06 21:12:27 +00:00
Scott Wolchok	912102b4ec	Make at::vec::Vectorized ops work with scalars (#150380 ) I noticed that I couldn't use `vec::Vectorized` operations with scalars, even though there is an implicit conversion from `T` to `vec::Vectorized<T>`, so I made it work. Test Plan: Added tests. Reverted vec_base.h, left the new tests in place, and confirmed that new tests don't compile in that state. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150380 Approved by: https://github.com/Skylion007	2025-04-06 21:12:27 +00:00
Eddie Yan	8adfcd35c3	[cuDNN][SDPA] Loosen constraints for GQA for cuDNN Attention (#150337 ) cuDNN attention doesn't require key and value tensors to have the same number of heads Pull Request resolved: https://github.com/pytorch/pytorch/pull/150337 Approved by: https://github.com/drisspg	2025-04-06 20:31:11 +00:00
Bin Bao	6a8ab902a2	[AOTI][dashboard] Fix mis-calculated memory compression ratio (#150695 ) Summary: https://github.com/pytorch/pytorch/pull/149817 introduced an extra warmup run to compute AOTI memory compression ratio, but since weights are only loaded once in the AOTI run, the peak memory seen in the extra warmup won't include the weight, which causes an aritifically high memory compression ratio. This PR removes that extra warmup run, and calls reset_peak_memory_stats in the proper place instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150695 Approved by: https://github.com/yushangdi	2025-04-06 19:51:22 +00:00
Randolf Scholz	6c38b9be73	[typing] Add type hints to `__init__` methods in `torch.distributions`. (#144197 ) Fixes #144196 Extends #144106 and #144110 ## Open Problems: - [ ] Annotating with `numbers.Number` is a bad idea, should consider using `float`, `SupportsFloat` or some `Procotol`. https://github.com/pytorch/pytorch/pull/144197#discussion_r1903324769 # Notes - `beta.py`: needed to add `type: ignore` since `broadcast_all` is untyped. - `categorical.py`: converted `else` branches of mutually exclusive arguments to `if` branch[^2]. - ~~`dirichlet.py`: replaced `axis` with `dim` arguments.~~ #144402 - `gemoetric.py`: converted `else` branches of mutually exclusive arguments to `if` branch[^2]. - ~~`independent.py`: fixed bug in `Independent.__init__` where `tuple[int, ...]` could be passed to `Distribution.__init__` instead of `torch.Size`.~~ EDIT: turns out the bug is related to typing of `torch.Size`. #144218 - `independent.py`: made `Independent` a generic class of its base distribution. - `multivariate_normal.py`: converted `else` branches of mutually exclusive arguments to `if` branch[^2]. - `relaxed_bernoulli.py`: added class-level type hint for `base_dist`. - `relaxed_categorical.py`: added class-level type hint for `base_dist`. - ~~`transforms.py`: Added missing argument to docstring of `ReshapeTransform`~~ #144401 - ~~`transforms.py`: Fixed bug in `AffineTransform.sign` (could return `Tensor` instead of `int`).~~ #144400 - `transforms.py`: Added `type: ignore` comments to `AffineTransform.log_abs_det_jacobian`[^1]; replaced `torch.abs(scale)` with `scale.abs()`. - `transforms.py`: Added `type: ignore` comments to `AffineTransform.__eq__`[^1]. - `transforms.py`: Fixed type hint on `CumulativeDistributionTransform.domain`. Note that this is still an LSP violation, because `Transform.domain` is defined as `Constraint`, but `Distribution.domain` is defined as `Optional[Constraint]`. - skipped: `constraints.py`, `constraints_registry.py`, `kl.py`, `utils.py`, `exp_family.py`, `__init__.py`. ## Remark `TransformedDistribution`: `__init__` uses the check `if reinterpreted_batch_ndims > 0:`, which can lead to the creation of `Independent` distributions with only 1 component. This results in awkward code like `base_dist.base_dist` in `LogisticNormal`. ```python import torch from torch.distributions import * b1 = Normal(torch.tensor([0.0]), torch.tensor([1.0])) b2 = MultivariateNormal(torch.tensor([0.0]), torch.eye(1)) t = StickBreakingTransform() d1 = TransformedDistribution(b1, t) d2 = TransformedDistribution(b2, t) print(d1.base_dist) # Independent with 1 dimension print(d2.base_dist) # MultivariateNormal ``` One could consider changing this to `if reinterpreted_batch_ndims > 1:`. [^1]: Usage of `isinstance(value, numbers.Real)` leads to problems with static typing, as the `numbers` module is not supported by `mypy` (see <https://github.com/python/mypy/issues/3186>). This results in us having to add type-ignore comments in several places [^2]: Otherwise, we would have to add a bunch of `type: ignore` comments to make `mypy` happy, as it isn't able to perform the type narrowing. Ideally, such code should be replaced with structural pattern matching once support for Python 3.9 is dropped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144197 Approved by: https://github.com/malfet Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-04-06 17:50:35 +00:00
Isalia20	49f6cce736	[MPS] grad scaler (#150255 ) Fixes #142397 Basic implementation is done. What's left: - [x] Different dtype/device tensors in the TensorList - [x] fast path for grouping the foreach kernel - [x] Tests Regarding tests, I found some tests in `test/test_torch.py` for GradScaler but I couldn't figure out what is the best way to enable the test for MPS device. By removing `@onlyNativeDeviceTypes`, one enables the tests for MPS but also enables tests for all other devices which are not included in the native device types. If I put: `instantiate_device_type_tests(TestTorchDeviceType, globals(), allow_mps=True)` This enables lots of tests in that class for MPS which were not(?) being tested before? This part needs some clarification Pull Request resolved: https://github.com/pytorch/pytorch/pull/150255 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-04-06 17:06:55 +00:00
Natalia Gimelshein	55e62ff74a	bf16 grouped gemm (#150374 ) Enabled bf16 grouped gemm with an API similar to _scaled_group_gemm, except without scale and fast accum arguments. All transpose variants are enabled, unlike scaled gemm. Ideally we'd factor out a lot more code from scaled gemm, currently there's a lot of repetition between scaled and non-scaled versions. I factored out only a helper kernel that prepares arguments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150374 Approved by: https://github.com/drisspg	2025-04-06 04:53:24 +00:00
PyTorch MergeBot	caf8d9bc17	Revert "Fix conv2d strided prologue (#150697 )" This reverts commit 2e4ae2ab41dbe1939bd1ffb427af8e5ea8eaff41. Reverted https://github.com/pytorch/pytorch/pull/150697 on behalf of https://github.com/ngimel due to breaks rocm build ([comment](https://github.com/pytorch/pytorch/pull/150697#issuecomment-2781218658))	2025-04-06 04:50:15 +00:00
Klint Qinami	2d98a1caf5	[MTIA] Map names to operand indices when folding submodules (#150692 ) When replacing placeholders with getattrs during constant folding, we can have an argument and parameter name mismatch. In fact, there is no guarantee that the parameter name is equivalent to the argument name used in the module call. Differential Revision: D72415970 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150692 Approved by: https://github.com/jfix71	2025-04-06 03:11:14 +00:00
Jeff Daily	15768cc34b	add unit test for preferred_blas_library settings (#150581 ) Follow up to #150212 that was committed without a unit test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150581 Approved by: https://github.com/atalman, https://github.com/malfet Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-04-06 01:44:07 +00:00
Richard Barnes	83b870a28a	Fix missing braces for clang CUDA (#150736 ) Test Plan: Sandcastle Differential Revision: D72469764 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150736 Approved by: https://github.com/Skylion007	2025-04-06 01:29:59 +00:00
Nikita Shulga	c830c12a87	[MPSInductor] Fix tiled reduction logic (#150737 ) In case of tiles, index must include both reduction dimentions Pull Request resolved: https://github.com/pytorch/pytorch/pull/150737 Approved by: https://github.com/dcci	2025-04-06 00:20:41 +00:00
Isalia20	cfea55dbec	[MPS] fix inverse bug for N>1024 (#146754 ) Fixes #138200 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146754 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-04-05 21:49:21 +00:00
Mu-Chu Lee	60a45eb862	[AOTInductor] Introduce MaybeOwningAtenTensorHandle for ConstantMap (#150275 ) Summary: We used RAIIAtenTensorHandle for ConstantMap, where RAIIAtenTensorHandle is a unique_ptr, indicating that all memory handling is by the AOTInductor internally. In this PR, we introduce ConstantAtenTensorHandle which replaces RAIIATenTensorHandle. This class holds a raw AtenTensorHandle, and also owns a RAIIAtenTensorHandle if user decides to delegate memory management to AOTInductor. This is a prerequisite for user managed buffer, this PR, however only introduces this class and make sure it works with existing AOTInductor and has the default behavior identical as using RAIIAtenTensorHandle. Test Plan: Existing tests. No change should be introduced within this PR. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/150275 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2025-04-05 06:00:35 +00:00
Nikita Shulga	7ac8186851	[MPSInductor] Speedup `sum`/`prod` reductions (#150566 ) By using cooperative `simd_sum`/`simd_product` instead of a C-style for loop for threadgroup reductions. This also allows significantly reduce amount of shared memory needed to perform those reductions Using such reduction increases the `torch.compile` performance for gpt-fast using `stories110M` from 29 tokens/sec to 630 tokens/sec on M4 and changes perf of torch.rand as follows: \|size\| before \| after \| \|------------------------\|------------\|-------------\| \| 512x512 \| 202.1 \| 131.8 \| \| 1024x1024 \| 780.6 \| 176.9 \| \| 2048x2048 \| 1423.4 \| 339.9 \| \| 4096x4097 \| 2982.2 \| 1047.2 \| Unfortunately, none of the SIMDgroup operations are available for 64-bit integers, but one can simulate the behavior using using `simd_shuffle_down` of 64-bit values represented as `int2` types, that yields reduction in $log_2(threadgroup\\_size)$ steps. [`mlx/kernels/reduction/ops.h](`86389bf970/mlx/backend/metal/kernels/reduction/ops.h (L15-L18)`) contains an implementation of such algorithm, but alas it yields wrong results on M1/M2(and may be M3 machines) if not all threads in the simdgroup are active which could be observed by running ```python import torch lib=torch.mps.compile_shader(""" kernel void do_sum(device int* out, constant int* in, uint idx [[thread_position_in_grid]]) { out[idx] = metal::simd_shuffle_down(in[idx], 8); } """) x=torch.arange(22, device='mps', dtype=torch.int32) y=torch.empty_like(x) lib.do_sum(y, x) print(y) ``` that returns following on M4 ``` tensor([ 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 0, 0, 0, 0, 0, 0, 0, 0], device='mps:0', dtype=torch.int32) ``` but same kernel running on M1 returns ``` tensor([ 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 14, 15, 16, 17, 18, 19, 20, 21], device='mps:0', dtype=torch.int32) ``` This discrepancy in behavior can be addressed by using `simd_shuffle_and_fill_down`, but any kernels using simd_shuffle_and_fill_down cause an internal compiler error on MacOS-13.2. Considering that OS is to be EOL soon, skip the offending tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150566 Approved by: https://github.com/manuelcandales ghstack dependencies: #150452, #150457	2025-04-05 02:47:27 +00:00
Jithun Nair	c14977e91c	Use 'rocm' naming for rocm-related workflows/jobs (#150555 ) Reduces number of places in the workflow files needing update for ROCm version update Pull Request resolved: https://github.com/pytorch/pytorch/pull/150555 Approved by: https://github.com/jeffdaily	2025-04-05 02:09:11 +00:00
Laith Sakka	3320efef6b	Refresh expected results. (#150264 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150264 Approved by: https://github.com/bobrenjc93	2025-04-05 01:11:19 +00:00
eellison	2e4ae2ab41	Fix conv2d strided prologue (#150697 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150697 Approved by: https://github.com/drisspg	2025-04-05 00:28:56 +00:00
leslie-fang-intel	d6887f444f	[Inductor] Fallback embedding when sparse is True (#150659 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/150656, fallback `embedding` when sparse is True. Test Plan ``` python -u -m pytest -s -v test/inductor/test_torchinductor.py -k test_embedding_sparse ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150659 Approved by: https://github.com/jansel	2025-04-04 23:59:38 +00:00
Stepan Hruda	2e23768d25	Expose symbols on macos in the xplat pytorch stack (#150487 ) Summary: X-link: https://github.com/pytorch/executorch/pull/9819 Had to revert D71321310 because it affected way too many targets and build sizes. These changes should expose just enough symbols to be buildable in arvr mode on macOS. Could potentially make narrow it down even more by avoiding eg `get_pt_compiler_flags` Differential Revision: D72255474 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150487 Approved by: https://github.com/drisspg	2025-04-04 23:03:16 +00:00
Paul Zhang	2a2ddff214	[Inductor] Fix consolidating _scaled_mm into mm template TMA error (#150686 ) Summary: The previous diff broke a few tests that didn't run on internal or GH CI: T220169086, this fixes that issue. The {% if } block is only supposed to support autotuned parameters (constexpr), and should not be used for locals based on other examples. Test Plan: buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:fp8 -- --exact 'caffe2/test/inductor:fp8 - test_tensorwise_scaling_bfloat16_shape_16,32,32_has_bias_False_use_fast_accum_True_persistent_matmul_True (caffe2.test.inductor.test_fp8.TestFP8Lowering)' Reviewed By: NikhilAPatel Differential Revision: D72460516 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150686 Approved by: https://github.com/eellison, https://github.com/NikhilAPatel	2025-04-04 22:49:22 +00:00
Ankita George	861d2cc02c	Add a param for save format in Storage Writer (#150025 ) Summary: add a param to specify to the storage writer how to save tensors. Write now the only options are safetensors and torch.save. Test Plan: (lintrunner) [ankitageorge@devgpu003.cco3 /data/users/ankitageorge/fbsource/fbcode/caffe2 (1d57cb27b)]$ buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/distributed/checkpoint:test_hf_storage File changed: fbcode//caffe2/torch/distributed/checkpoint/filesystem.py Buck UI: https://www.internalfb.com/buck2/e80cc963-e34a-4876-b6f4-7ce2794e48dd Test UI: https://www.internalfb.com/intern/testinfra/testrun/3659174965882569 Network: Up: 32KiB Down: 1.9KiB (reSessionID-ef9fa764-a40a-451b-ab58-08eabe7a9422) Executing actions. Remaining 0/4 3.4s exec time total Command: test. Finished 2 local Time elapsed: 19.6s Tests finished: Pass 4. Fail 0. Fatal 0. Skip 0. Build failure 0 Reviewed By: saumishr Differential Revision: D70271943 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150025 Approved by: https://github.com/saumishr	2025-04-04 17:52:53 +00:00
Eric Griffith	c53bc616d5	caffe2: Fix lint errors in native/xnnpack/Linear.cpp (#150508 ) Summary: See title Test Plan: Sandcastle Differential Revision: D72275403 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150508 Approved by: https://github.com/malfet, https://github.com/Skylion007, https://github.com/cyyever	2025-04-04 17:14:43 +00:00
PyTorch MergeBot	c93e34d7b5	Revert "bound sympy accuracy (#150383 )" This reverts commit 1bc2b2b12ae1ddd27b0401a1baac3b8099b6fc50. Reverted https://github.com/pytorch/pytorch/pull/150383 on behalf of https://github.com/laithsakka due to big regression ([comment](https://github.com/pytorch/pytorch/pull/150383#issuecomment-2779227548))	2025-04-04 16:26:00 +00:00
PyTorch MergeBot	f443035f10	Revert "[cuda] Add new faster gammabeta backward kernel (#148605 ) (Reapply with launch bounds) (#150625 )" This reverts commit c6defa9443d241dd7a0baac4e708b6e906bd012c. Reverted https://github.com/pytorch/pytorch/pull/150625 on behalf of https://github.com/atalman due to failing internal build ([comment](https://github.com/pytorch/pytorch/pull/150625#issuecomment-2779183414))	2025-04-04 16:05:18 +00:00
Zhengxu Chen	07d439e782	[aoti] Split ConstantType definition out of model.h (#150545 ) Summary: Splitting the type definition of ConstantType into a separate header because it's needed by Sigmoid OSS but the entire model.h header include cause the following compilation error: ``` 2025-04-01T18:12:42.0391272Z FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/nativert/kernels/AOTICallDelegateKernel.cpp.o 2025-04-01T18:12:42.0417705Z /opt/cache/bin/sccache /opt/cache/bin/clang++ -DAT_PER_OPERATOR_HEADERS -DBUILD_ONEDNN_GRAPH -DCAFFE2_BUILD_MAIN_LIB -DCPUINFO_SUPPORTED_PLATFORM=1 -DFMT_HEADER_ONLY=1 -DFXDIV_USE_INLINE_ASSEMBLY=0 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DIDEEP_USE_MKL -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNNP_CONVOLUTION_ONLY=0 -DNNP_INFERENCE_ONLY=0 -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_ENABLE_LLVM -DUSE_C10D_GLOO -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_RPC -DUSE_TENSORPIPE -DXNN_LOG_LEVEL=0 -D_FILE_OFFSET_BITS=64 -Dtorch_cpu_EXPORTS -I/var/lib/jenkins/workspace/build/aten/src -I/var/lib/jenkins/workspace/aten/src -I/var/lib/jenkins/workspace/build -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/cmake/../third_party/benchmark/include -I/opt/llvm/include -I/var/lib/jenkins/workspace/third_party/onnx -I/var/lib/jenkins/workspace/build/third_party/onnx -I/var/lib/jenkins/workspace/nlohmann -I/var/lib/jenkins/workspace/torch/csrc/api -I/var/lib/jenkins/workspace/torch/csrc/api/include -I/var/lib/jenkins/workspace/caffe2/aten/src/TH -I/var/lib/jenkins/workspace/build/caffe2/aten/src/TH -I/var/lib/jenkins/workspace/build/caffe2/aten/src -I/var/lib/jenkins/workspace/build/caffe2/../aten/src -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/third_party/miniz-3.0.2 -I/var/lib/jenkins/workspace/third_party/kineto/libkineto/include -I/var/lib/jenkins/workspace/third_party/kineto/libkineto/src -I/var/lib/jenkins/workspace/third_party/cpp-httplib -I/var/lib/jenkins/workspace/aten/src/ATen/.. -I/var/lib/jenkins/workspace/third_party/FXdiv/include -I/var/lib/jenkins/workspace/c10/.. -I/var/lib/jenkins/workspace/third_party/pthreadpool/include -I/var/lib/jenkins/workspace/third_party/cpuinfo/include -I/var/lib/jenkins/workspace/aten/src/ATen/native/quantized/cpu/qnnpack/include -I/var/lib/jenkins/workspace/aten/src/ATen/native/quantized/cpu/qnnpack/src -I/var/lib/jenkins/workspace/aten/src/ATen/native/quantized/cpu/qnnpack/deps/clog/include -I/var/lib/jenkins/workspace/third_party/NNPACK/include -I/var/lib/jenkins/workspace/third_party/fbgemm/include -I/ 2025-04-01T18:12:42.0444143Z In file included from /var/lib/jenkins/workspace/torch/csrc/nativert/kernels/AOTICallDelegateKernel.cpp:5: 2025-04-01T18:12:42.0445081Z In file included from /var/lib/jenkins/workspace/torch/csrc/nativert/executor/AOTIDelegateExecutor.h:6: 2025-04-01T18:12:42.0446002Z In file included from /var/lib/jenkins/workspace/torch/csrc/nativert/executor/AOTInductorModelImpl.h:5: 2025-04-01T18:12:42.0447549Z /var/lib/jenkins/workspace/torch/csrc/inductor/aoti_runtime/model.h:78:13: error: function 'RAII_cpuMalloc' is not needed and will not be emitted [-Werror,-Wunneeded-internal-declaration] 2025-04-01T18:12:42.0448656Z RAIIDataPtr RAII_cpuMalloc(size_t num_bytes) { ``` model.h defines RAII_malloc functions directly into anonymous namespace which seems pretty sad. we should do something about it but may not in the current diff. Test Plan: CI Differential Revision: D72320413 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150545 Approved by: https://github.com/desertfire	2025-04-04 15:48:45 +00:00
Yuanhao Ji	1b0a023dde	[Dynamo][Misc] Apply typing hints for `codegen` (#150289 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/150289 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2025-04-04 14:26:22 +00:00
Davide Italiano	295b7e21eb	[MPS/inductor] Add support for hermite_polynomial_h. (#150664 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150664 Approved by: https://github.com/malfet	2025-04-04 13:14:52 +00:00
Eddie Yan	09c4da9325	[CUDA][avgpool2d] Fix backward launch bounds again for `sm100`, `sm120` (#150640 ) `__CUDA_ARCH__` is not visible in host code, which causes incorrect launch bounds and `too many resources requested for launch` on blackwell CC @atalman @malfet as we would want this in 2.7 @nWEIdia Pull Request resolved: https://github.com/pytorch/pytorch/pull/150640 Approved by: https://github.com/malfet, https://github.com/drisspg, https://github.com/atalman	2025-04-04 13:05:40 +00:00
Jakub Grzybek	73358d37da	Fix codegen, change str comparison opeator to == for proper equality … (#150611 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150611 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2025-04-04 09:59:59 +00:00
PyTorch MergeBot	4854926aeb	Revert "Add torch._scaled_mm for CPU (#150410 )" This reverts commit 3b02f795c5ad2339794b15b370c0e4a235d36adf. Reverted https://github.com/pytorch/pytorch/pull/150410 on behalf of https://github.com/malfet due to It breaks ROCM tests ([comment](https://github.com/pytorch/pytorch/pull/150410#issuecomment-2777704212))	2025-04-04 06:52:54 +00:00
PyTorch UpdateBot	f3cb3557d6	[executorch hash update] update the pinned executorch hash (#149817 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149817 Approved by: https://github.com/pytorchbot	2025-04-04 05:21:44 +00:00
Yuanhao Ji	98d06b401b	[Dynamo] Fix `dict.items()` return type (#150112 ) Fixes #150110 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150112 Approved by: https://github.com/jansel, https://github.com/zou3519	2025-04-04 04:32:13 +00:00
PyTorch UpdateBot	e6e1f8c272	[audio hash update] update the pinned audio hash (#150589 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150589 Approved by: https://github.com/pytorchbot	2025-04-04 04:29:45 +00:00
Pian Pawakapan	c6d79c163c	[dynamic shapes] allow duck typing for 0/1 (#150222 ) Fixes #150184 e.g. for config.backed_size_oblivious=True and compile Pull Request resolved: https://github.com/pytorch/pytorch/pull/150222 Approved by: https://github.com/laithsakka	2025-04-04 03:24:46 +00:00
Aby Mathew C	7df6f930e8	Adapt test_misc.py for HPUs (#149499 ) This PR is related to https://github.com/pytorch/pytorch/pull/145476 . That PR had two files (test_functions.py and test_misc.py) . test_functions was causing CI/rebase/merge issues and hence removed for now. This PR contains only test_misc.py. This is a continuation of https://github.com/pytorch/pytorch/pull/144387 . ## MOTIVATION We recently integrated support for Intel Gaudi devices (identified as 'hpu') into the common_device_type framework via the pull request at https://github.com/pytorch/pytorch/pull/126970. This integration allows tests to be automatically instantiated for Gaudi devices upon loading the relevant library. Building on this development, the current pull request extends the utility of these hooks by adapting selected CUDA tests to operate on Gaudi devices. Additionally, we have confirmed that these modifications do not interfere with the existing tests on CUDA devices. Other accelerators can also extend the functionality by adding the device in the devices list. ( For eg: xpu ) ## CHANGES Create a separate class for test functions running on CUDA devices Extend the functionality of these tests to include HPUs Use instantiate_device_type_tests with targeted attributes to generate device-specific test instances within the new classes Apply skipIfHPU decorator to bypass tests that are not yet compatible with HPU devices PS: Most of these changes were initially part of https://github.com/pytorch/pytorch/pull/147609 , but closed that PR due to merge conflicts. The review comments were handled in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149499 Approved by: https://github.com/EikanWang, https://github.com/desertfire, https://github.com/cyyever	2025-04-04 02:47:43 +00:00
Scott Wolchok	ed0fd2fa7a	clang-format aten/src/ATen/cpu/vec/*.h (#150426 ) I got a complaint about indentation on #150380. Make the machines fix it for us. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150426 Approved by: https://github.com/aditew01, https://github.com/cyyever, https://github.com/frost-intel, https://github.com/Skylion007	2025-04-04 02:41:11 +00:00
fduwjj	bd9c42ebfb	[c10d] Surface error type when we unlink and create named pipe for DumpPipe (#150648 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150648 Approved by: https://github.com/fegin, https://github.com/kwen2501	2025-04-04 02:12:32 +00:00
Lucas Kabela	a9e2f22405	[Bugfix] Fix compile error with `torch.Tensor.unsqueeze_` and inplace views called from Tensor Class (#150573 ) Fixes #129673 ### Summary: Modifying a tensor by reshaping in place (such as `unsqueeze_`) should cause a graph break; however, when accessed through `torch.Tensor` api as opposed to as self attribute caused the code to crash with an error (see attached issue) Paths differed when traced due to the stack variable popped, as: * `self.unsqueeze_` pops a `LazyVariableTracker` which gets resolved to `TensorVariable`, so when looking for the method, triggers the fn call `var_getattr` in `_dynamo/variables/tensor.py`; since this is an inplace view (metadata mutation) on graph input, it is not well supported so should fall back (see [L446](`1017927c83/torch/_dynamo/variables/tensor.py (L446)`) in that file) * `torch.Tensor.unsqueeze` pops a `UserDefinedClassVariable` so when looking for the method, triggers the fn call `var_getattr` in `_dynamo/variables/user_defined.py` on [L273](`a8f6b40e36/torch/_dynamo/variables/user_defined.py (L273)`). This path tries to build a variable tracker from the obj popped, which resolves to a trace_rule , and as a Tensor method, is resolved to `TorchInGraphFunctionVariable` on [L3767](`a8f6b40e36/torch/_dynamo/trace_rules.py (L3767)`) So, one straightforward option is to check if the fn is an inplace_view on a input tensor in `torch.py` when we resolve the `__call__function` for the `TorchInGraphFunctionVariable` instead, which resolves the bug by providing a graph break ### Test ``` pytest test/dynamo/test_functions.py::FunctionTests::test_unsqueeze_inplace ``` Results in ``` Running 1 items in this shard test/dynamo/test_functions.py . [100%] =========================================================================================== 1 passed in 9.16s ========================================================================================== ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150573 Approved by: https://github.com/anijain2305	2025-04-04 01:58:34 +00:00
James Wu	1979a409e9	Make CompileEventLogger more defensive w.r.t to AOTAutogradCache and FXGraphCache (#150423 ) This PR makes it so that we don't crash due to logging if we invoke AOTAutogradCache/FXGraphCache without using dynamo. This is preparation for supporting certain VLLM use cases where they store graph modules and have special handling in conjunection with the caches. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150423 Approved by: https://github.com/oulgen	2025-04-04 01:55:13 +00:00
Laith Sakka	f9f6c080d8	support guard or false/true in user code and add tests (#150178 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150178 Approved by: https://github.com/pianpwk	2025-04-04 01:19:14 +00:00
Nichols A. Romero	d0026fa138	[ROCm][TunableOp] Fix UT race condition and reduce UT duration. (#150463 ) This PR fixes two race conditions that occur when UT tests are run: - In a particular order within a single shard. - Concurrently in multiple shards. Each test now gets a unique filename that depends on the test name. There were two other minor improvements to the UTs: - matmul_offline_mgpu could occasionally fail if run on 8 GPUs. Criteria was relaxed. - bmm_tunableop_rocm checks that the rotating buffer is not zero. Otherwise, the test is not useful. Additionally, several UTs took over 1 minute to run. Their duration was reduced by a combination of setting max tuning iterations to one, setting the rotating buffer size to zero, and/or reducing the matrix dimensions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150463 Approved by: https://github.com/jeffdaily	2025-04-04 01:12:03 +00:00
Avik Chaudhuri	1bc2b2b12a	bound sympy accuracy (#150383 ) Differential Revision: D72215735 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150383 Approved by: https://github.com/pianpwk	2025-04-04 00:15:32 +00:00
PyTorch MergeBot	b0e28f60df	Revert "add unit test for preferred_blas_library settings (#150581 )" This reverts commit 781d28e2655f88ae2fef827ed110f22ed553a0ab. Reverted https://github.com/pytorch/pytorch/pull/150581 on behalf of https://github.com/clee2000 due to new test broken internally D72395624 ([comment](https://github.com/pytorch/pytorch/pull/150581#issuecomment-2777228731))	2025-04-03 23:51:49 +00:00
Yanan Cao (PyTorch)	1ab6c4ff04	[Codemod][AddExplicitStrictExportForTrainingInferenceArg] caffe2/ (#149595 ) internal diff: D71497480 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149595 Approved by: https://github.com/Skylion007	2025-04-03 23:50:13 +00:00
Zhao Zhu	8878289f89	[aten] 8 bytes aligned vector loads for bf16 and fp16 dtypes in torch.cat (#150233 ) Enable aligned vector loading for 2 bytes datatypes in torch.cat. Specifically: 1. reduce the vector length to 8 bytes for 2-byte types (fp16, bf16 etc) 2. enable through a conditional template The reason why 8-byte vector loading was chosen for fp16 and bf16: 16-byte load results in heavier register overheads (i.e. 4 register per load for fp32 -> 8 register per load for fp16). Therefore, to employ the benefits of vectorized loading, we reduced ALIGNED_VEC_LOAD_BYTES to 8 for fp16 and bf16 ### perf testing: before: ``` torch-cat-D1-30108-D2-624-D3-772-dtype-torch.float32: B pt_eager copy 0 100.0 0.022621 0.036162 1 1000.0 0.133616 0.207051 2 10000.0 1.326848 1.848768 3 20000.0 2.744544 3.692128 torch-cat-D1-30108-D2-624-D3-772-dtype-torch.bfloat16: B pt_eager copy 0 100.0 0.022434 0.035477 1 1000.0 0.140608 0.144518 2 10000.0 1.303792 1.229584 3 20000.0 2.668288 2.436160 ``` after: ``` torch-cat-D1-30108-D2-624-D3-772-dtype-torch.float32: B pt_eager copy 0 100.0 0.022608 0.036328 1 1000.0 0.133861 0.207399 2 10000.0 1.325120 1.847136 3 20000.0 2.726528 3.693184 torch-cat-D1-30108-D2-624-D3-772-dtype-torch.bfloat16: B pt_eager copy 0 100.0 0.019942 0.035482 1 1000.0 0.084858 0.144544 2 10000.0 0.924384 1.230672 3 20000.0 1.944448 2.436480 ``` ### bw analysis: bw on fp16/bf16 got increased by 40%-50% for large tensors before: ``` Bandwidth (GB/s) for ((16384, 16384), 1) int8;fp16;fp32;int32;fp64;long\|869.87\|1382.74\|1956.46\|1952.73\|1969.03\|1963.66 Bandwidth (GB/s) for ((4194304,), 0) int8;fp16;fp32;int32;fp64;long\|568.43\|926.53\|1589.20\|1567.52\|1771.54\|1783.68 Bandwidth (GB/s) for ((16777216,), 0) int8;fp16;fp32;int32;fp64;long\|752.07\|1269.50\|1894.86\|1900.85\|1954.10\|1955.08 Bandwidth (GB/s) for ((33554432,), 0) int8;fp16;fp32;int32;fp64;long\|807.08\|1354.69\|1960.48\|1962.45\|1972.73\|1973.85 Bandwidth (GB/s) for ((134217728,), 0) int8;fp16;fp32;int32;fp64;long\|864.02\|1398.02\|1963.43\|1955.32\|1963.37\|1969.96 ``` after: ``` Bandwidth (GB/s) for ((16384, 16384), 1) int8;fp16;fp32;int32;fp64;long\|873.08\|1892.16\|1954.35\|1962.51\|1962.03\|1965.98 Bandwidth (GB/s) for ((4194304,), 0) int8;fp16;fp32;int32;fp64;long\|575.13\|1242.45\|1576.37\|1571.30\|1769.94\|1790.22 Bandwidth (GB/s) for ((16777216,), 0) int8;fp16;fp32;int32;fp64;long\|742.92\|1734.57\|1887.99\|1897.62\|1940.99\|1959.25 Bandwidth (GB/s) for ((33554432,), 0) int8;fp16;fp32;int32;fp64;long\|802.60\|1865.45\|1952.64\|1947.53\|1974.47\|1973.48 Bandwidth (GB/s) for ((134217728,), 0) int8;fp16;fp32;int32;fp64;long\|865.32\|1939.07\|1965.72\|1963.25\|1969.06\|1968.72 ``` ### Perf testing code: ``` # pyre-strict from typing import List, Optional, Tuple import click import pandas as pd import torch # @manual=//triton:triton import triton # CUDA_VISIBLE_DEVICEs=7 buck2 run @mode/opt //scripts/zhaozhu:cat_bench @click.command() @click.option("--data-type", type=str, default="bf16") @click.option("--return-result", type=bool, default=False) def main( data_type: str, return_result: bool, ) -> Optional[Tuple[List[triton.testing.Benchmark], List[pd.DataFrame]]]: torch.backends.cudnn.allow_tf32 = True torch.backends.cuda.matmul.allow_tf32 = True if data_type == "fp32": dtype = torch.float32 elif data_type == "fp16": dtype = torch.float16 elif data_type == "bf16": dtype = torch.bfloat16 else: raise ValueError(f"Unsupported data type: {data_type}.") D1 = int(torch.randint(low=10000, high=50000, size=(1,)).item()) D2 = int(torch.randint(low=100, high=1000, size=(1,)).item()) D3 = int(torch.randint(low=500, high=1000, size=(1,)).item()) configs: List[triton.testing.Benchmark] = [ triton.testing.Benchmark( x_names=["B"], x_vals=[100, 1000, 10000, 20000], line_arg="provider", line_vals=["pt_eager", "copy"], line_names=["pt_eager", "copy"], styles=[("blue", "-"), ("green", "-"), ("red", "-")], ylabel="ms", plot_name=f"torch-cat-D1-{D1}-D2-{D2}-D3-{D3}-dtype-{dtype}", args={ "D1": D1, "D2": D2, "D3": D3, "dtype": dtype, }, ) ] @triton.testing.perf_report(configs) def bench_cat( B: int, D1: int, D2: int, D3: int, dtype: torch.dtype, provider: str, ) -> float: warmup = 10 rep = 3 tensors = [] a = torch.empty( # (B, 30108), (B, D1), dtype=dtype, device=torch.device("cuda"), ).uniform_(-1.0, 1.0) b = torch.empty( # (B, 624), (B, D2), dtype=dtype, device=torch.device("cuda"), ).uniform_(-1.0, 1.0) c = torch.empty( # (B, 772), (B, D3), dtype=dtype, device=torch.device("cuda"), ).uniform_(-1.0, 1.0) tensors = [a, b, c] total_cols: int = int(a.shape[1] + b.shape[1] + c.shape[1]) def torch_copy( tensors: List[torch.Tensor], is_inplace: bool = True ) -> torch.Tensor: f = torch.zeros([B, total_cols], dtype=dtype, device=torch.device("cuda")) col_idx = 0 for t in tensors: temp = f[:, col_idx : col_idx + t.shape[1]] if is_inplace: temp.copy_(t) else: f[:, col_idx : col_idx + t.shape[1]] = t col_idx += t.shape[1] return f def torch_cat(tensors: List[torch.Tensor]) -> torch.Tensor: return torch.cat(tensors, dim=1) ref = torch_cat(tensors) real = torch_copy(tensors, is_inplace=False) torch.testing.assert_allclose(ref, real) if provider == "pt_eager": fn = lambda: torch_cat(tensors) # noqa E731 ms = triton.testing.do_bench(fn, warmup=warmup, rep=rep) return ms elif provider == "stack": def torch_stack(tensors: List[torch.Tensor]) -> torch.Tensor: return torch.stack(tensors, dim=1).view(-1, total_cols) fn = lambda: torch_stack(tensors) ms = triton.testing.do_bench(fn, warmup=warmup, rep=rep) return ms elif provider == "copy": fn = lambda: torch_copy(tensors) ms = triton.testing.do_bench(fn, warmup=warmup, rep=rep) return ms else: raise ValueError(f"unsupported provider: {provider}") df = bench_cat.run(print_data=True, return_df=return_result) if return_result: return configs, df if __name__ == "__main__": main() ``` and bw analysis code is from: https://github.com/pytorch/pytorch/pull/102815?fbclid=IwZXh0bgNhZW0CMTEAAR1Rwclp_O1fknl1Litpm9GeY0ZZZovdCv8_kQfGf6Zy8LaoP9JhO0ZsutM_aem_BPCZEZda5OOMnzI9Mrlapg#issue-1737409146 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150233 Approved by: https://github.com/ngimel	2025-04-03 23:40:18 +00:00
Henry Hu	5cf3029503	Remove unused rand call if not fallback to eager for rand (#147790 ) Fixes #147171 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147790 Approved by: https://github.com/eellison	2025-04-03 23:27:03 +00:00
William Wen	118e3862bc	[dynamo] disable new test_assert_failure_in_generic_ctx_mgr internally (#150631 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150631 Approved by: https://github.com/clee2000 ghstack dependencies: #150471	2025-04-03 23:08:25 +00:00
Tovly Deutsch	a2dce42654	Split up cub-RadixSortPairs.cu to parallelize compilation (#148936 ) Summary: `cub-RadixSortPairs.cu` has slow compilation times, especially on Windows. These changes split up the file into smaller components to allow each component to compile in parallel. On Windows, I observed a compile time drop from about 20 minutes to 6 minutes. Differential Revision: D70539649 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148936 Approved by: https://github.com/suo, https://github.com/eqy, https://github.com/malfet	2025-04-03 23:04:21 +00:00
Jane Xu	c0618a3957	Update commitlist.py instructions for the GitHub repo regime (#149535 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149535 Approved by: https://github.com/albanD	2025-04-03 22:43:00 +00:00
Richard Howell	76994d48f4	[pytorch] add experimental TORCH_LIBRARY_THREAD_UNSAFE_LAZY_INIT (#150537 ) Summary: Add an experimental feature to defer pytorch library initialization cost to post startup. As noted this feature is not thread safe, it requires the client to maintain thread safety at library load time. Reviewed By: zou3519 Differential Revision: D71917841 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150537 Approved by: https://github.com/zou3519	2025-04-03 22:36:17 +00:00
Jeff Daily	9e55dae2a6	CUDA CachingHostAllocator tracks registrations to call correct free (#146520 ) Allocations using cudaHostRegister should use corresponding cudaHostUnregister and similarly for cudaHostAlloc / cudaFreeHost. In test_cuda.py, the allocator config will change from test to test but the cache is not emptied prior to changing the config. This results in the wrong free being called later. Unit test sharding is avoiding this issue, but running the test_cuda.py with a single shard will fail. The following reproducer demonstrates the problem. ```C++ int main(int argc, char *argv) { void ptr; assert(cudaSuccess == cudaHostAlloc(&ptr, 1024, cudaHostAllocDefault)); assert(cudaSuccess == cudaHostUnregister(ptr)); std::free(ptr); return 0; } ``` The above code results in the following failure because the ptr is an invalid argument to cudaHostUnregister. ``` a.out: test.cpp:53: int main(int, char**): Assertion `cudaSuccess == cudaHostUnregister(ptr)' failed. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146520 Approved by: https://github.com/ngimel	2025-04-03 22:33:48 +00:00
Ahmad Sharif	c6defa9443	[cuda] Add new faster gammabeta backward kernel (#148605 ) (Reapply with launch bounds) (#150625 ) # Changes over the previous PR This reverts commit 61a1f09 and adds `__launch_bounds__` to the kernel. Previously I merged 114d404 that did not work on Blackwell because it consumed too many registers. It got reverted in 61a1f09. For more context see: https://github.com/pytorch/pytorch/issues/150266. This PR reverts the revert (i.e. reapplies the original diff), with one additional line with `__launch_bounds__` added: ``` git diff HEAD^ diff --git a/aten/src/ATen/native/cuda/layer_norm_kernel.cu b/aten/src/ATen/native/cuda/layer_norm_kernel.cu index 0d63a2f979c..3ce2c24c18e 100644 --- a/aten/src/ATen/native/cuda/layer_norm_kernel.cu +++ b/aten/src/ATen/native/cuda/layer_norm_kernel.cu @@ -657,6 +657,7 @@ bool aligned_grid > __global__ void +__launch_bounds__(block_dim_x * block_dim_y) GammaBetaBackwardCUDAKernelTemplate( int64_t M, int64_t N, ``` I managed to get a Blackwell machine and verified that the fix works. The fix was verified using this repro that I got from @drisspg <details> <summary> Repro script that fails on Blackwell </summary> ``` import torch from torch.nn import init # from transformer_nuggets import init_logging # from transformer_nuggets.utils.benchmark import profiler # from pathlib import Path # init_logging() class PermuteModule(torch.nn.Module): def __init__(self, permutation): super(PermuteModule, self).__init__() self.permutation = permutation def forward(self, x:torch.Tensor) -> torch.Tensor: assert len(x.shape) == len(self.permutation), f"Dimension mismatch! Unable to permute {len(x.shape)} dim input with a {len(self.permutation)} dim permutation!" return x.permute(self.permutation) def test(n_layers:int, conv_stride:int): _sequence = [] for _ in range(n_layers): # Conv1d inputs are (N x C x L), LayerNorm expects ( x C). Dims must be permuted between modules. _sequence += [ PermuteModule((0,2,1)), torch.nn.Conv1d(in_channels=512, out_channels=512, groups=1, kernel_size=9, dilation=1, stride=conv_stride, padding=0, bias=False), PermuteModule((0,2,1)), torch.nn.LayerNorm(512), torch.nn.ReLU() ] model = torch.nn.Sequential(_sequence).to(device="cuda") data = torch.randn((100,2048,512), device="cuda") out = model(data) loss = torch.nn.functional.mse_loss(out, torch.rand_like(out)) loss.backward() torch.autograd.set_detect_anomaly(True) print(f"Torch version: {torch.__version__}") # with profiler(Path("conv")): # # print(f"layers=1, stride=1") # # test(n_layers=1, conv_stride=1) # # print(f"layers=2, stride=1") # # test(n_layers=2, conv_stride=1) # # print(f"layers=1, stride=2") # # test(n_layers=1, conv_stride=2) # print(f"layers=2, stride=2") # test(n_layers=2, conv_stride=2) print(f"layers=2, stride=2") test(n_layers=2, conv_stride=2) # we will not reach this print statement. print("DONE.") ``` </details> I also re-ran my performance benchmark and found no regressions over the previous PR. # Full description of the old PR Original PR: https://github.com/pytorch/pytorch/pull/148605 This PR adds a new kernel for producing gamma and beta values for the backward pass in a performant way. To test the performance against the baseline, I measured the backward pass of layernorm while sweeping over the following variables: 1. dtype in {half, float} 2. M in `2k, 2k - 1, 2k + 1 for k in range(...)` 3. N in `2k, 2k - 1, 2k + 1 for k in range(...)` 4. Whether we flush the L2 cache before running the backward pass Summary: The new code performs better than the old code, especially for powers of 2. For M >> N case, it performs very well (kernel itself can be 30x faster and the overall backward pass can be 5-10x faster). In order to visualize results of the kernel when choosing different values of M, N and dtype, I wrote some code to generate a heatmap. The heatmap has N on the x-axis, M on the y-axis and color-coded points where green shows performance improvement and red shows regressions. For example, `m=32 n=2048 1.42x` in the heatmap would indicate the normalized shape had 32 elements. The leading dimensions' product was 2048 elements and the new kernel resulted in the backward pass* being 1.42x faster than the old backward pass. Important note: This heatmap shows the total backward pass time as seen by the user. The kernel time difference can be sometimes very large while the total backward pass time is not that high. For example, for dtype=torch.half, M=32 N=2048, flush_l2_cache=True case, the heatmap shows a speedup of 1.42x, while ncu tells me the new kernel is 2.5x faster than the old: M=32 N=2048 dtype=half flush_l2=True Old Kernel NCU summary: ``` ----------------------- ----------- ------------ Metric Name Metric Unit Metric Value ----------------------- ----------- ------------ DRAM Frequency Ghz 1.59 SM Frequency Ghz 1.35 Elapsed Cycles cycle 27,526 Memory Throughput % 2.21 DRAM Throughput % 0.54 Duration us 20.42 L1/TEX Cache Throughput % 4.31 L2 Cache Throughput % 2.62 SM Active Cycles cycle 1,475.02 Compute (SM) Throughput % 0.29 ----------------------- ----------- ------------ ``` M=32 N=2048 dtype=half flush_l2=True New Kernel NCU summary: ``` ----------------------- ----------- ------------ Metric Name Metric Unit Metric Value ----------------------- ----------- ------------ DRAM Frequency Ghz 1.59 SM Frequency Ghz 1.34 Elapsed Cycles cycle 10,920 Memory Throughput % 5.64 DRAM Throughput % 1.35 Duration us 8.13 L1/TEX Cache Throughput % 1.92 L2 Cache Throughput % 6.89 SM Active Cycles cycle 3,554.41 Compute (SM) Throughput % 0.67 ----------------------- ----------- ------------ ``` Let's look at some rows from the heatmap. For dtype=float16 flush_l2_cache=True and when input shapes are powers of 2, we get the following: <img width="1508" alt="image" src="https://github.com/user-attachments/assets/06179599-b2f0-4a45-8664-247a1067950b" /> There are 3 columns -- the first shows all data points, the second shows speedups only and the 3rd column shows regressions only. We can see that there are dramatic speedups for M >> N cases and the regressions are not that high (less than 1%, which could just be measurement noise). Here is a small guide I made: ![image](https://github.com/user-attachments/assets/90c26f7c-e3ad-46d2-a6ce-fe4b5fb3d738) For dtype=float32, we get a similar chart: <img width="1499" alt="image" src="https://github.com/user-attachments/assets/c4d31a76-03b0-426c-9114-e1bfad29b530" /> The new code performs especially well for m >> n cases, and also where m and n are small. The m >> n case is special because we run 2 reduction kernels back to back and parallelize in the "M" dimension (the older kernel only parallelized in the "N" dimension). The new code can sometimes have regressions for non-powers of 2. That is because the old code was using block sizes of {16, 32} while we have `threads.x = 32`. For example when N=33, the old code would have 3 blocks and we will have 2 blocks. I wrote some code to specialize for this case, but I think it will add complexity and @ngimel mentioned that non-powers of 2 are rare enough. I am including the regressions here for completeness' sake: <img width="1500" alt="image" src="https://github.com/user-attachments/assets/31c17cfb-ed9b-4106-b9c8-5c359751f530" /> To see this better: 1. Click the image 2. Right click the expanded image and open in a new tab 3. Go to that tab and left click once to zoom in If you want to see the full data, here it is: ![image](https://github.com/user-attachments/assets/54fb60c9-8c0c-4530-a1dd-79ecda1a69a1) I also measured binary size and compile time since those are important for developers: Binary size comparison ![image](https://github.com/user-attachments/assets/ceef5073-1036-47f6-b9dc-cea088beda51) ``` # Original -rwxr-xr-x 1 ahmads users 307193112 Mar 6 08:46 ./torch/lib/libtorch_cuda.so # This PR -rwxr-xr-x 1 ahmads users 307193112 Mar 6 08:46 ./torch/lib/libtorch_cuda.so ``` The diff in bytes is 302kB which is about a 0.1% increase. Compile time difference: ``` # Original real 0m10.931s user 0m9.676s sys 0m1.004s # this PR real 0m16.720s user 0m15.514s sys 0m1.066s # Command I ran time /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DFLASHATTENTION_DISABLE_SOFTCAP -DFLASH_NAMESPACE=pytorch_flash -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DTORCH_CUDA_USE_NVTX3 -DUNFUSE_FMA -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_CUFILE -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -I/home/ahmads/personal/pytorch/build/aten/src -I/home/ahmads/personal/pytorch/aten/src -I/home/ahmads/personal/pytorch/build -I/home/ahmads/personal/pytorch -I/home/ahmads/personal/pytorch/cmake/../third_party/benchmark/include -I/home/ahmads/personal/pytorch/third_party/onnx -I/home/ahmads/personal/pytorch/build/third_party/onnx -I/home/ahmads/personal/pytorch/nlohmann -I/home/ahmads/personal/pytorch/third_party/flash-attention/csrc/flash_attn/src -I/home/ahmads/personal/pytorch/aten/src/THC -I/home/ahmads/personal/pytorch/aten/src/ATen/cuda -I/home/ahmads/personal/pytorch/third_party/fmt/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/ahmads/personal/pytorch/build/caffe2/aten/src -I/home/ahmads/personal/pytorch/aten/src/ATen/.. -I/home/ahmads/personal/pytorch/build/nccl/include -I/home/ahmads/personal/pytorch/c10/cuda/../.. -I/home/ahmads/personal/pytorch/c10/.. -I/home/ahmads/personal/pytorch/third_party/tensorpipe -I/home/ahmads/personal/pytorch/build/third_party/tensorpipe -I/home/ahmads/personal/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/ahmads/personal/pytorch/torch/csrc/api -I/home/ahmads/personal/pytorch/torch/csrc/api/include -isystem /home/ahmads/personal/pytorch/build/third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/ahmads/personal/pytorch/third_party/protobuf/src -isystem /home/ahmads/personal/pytorch/third_party/XNNPACK/include -isystem /home/ahmads/personal/pytorch/third_party/ittapi/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda/include -isystem /home/ahmads/personal/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /home/ahmads/personal/pytorch/third_party/ideep/include -isystem /home/ahmads/personal/pytorch/INTERFACE -isystem /home/ahmads/personal/pytorch/third_party/nlohmann/include -isystem /home/ahmads/personal/pytorch/third_party/NVTX/c/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o.d -x cu -c /home/ahmads/personal/pytorch/aten/src/ATen/native/cuda/layer_norm_kernel.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o ``` So the new PR is 6 seconds longer compile time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150625 Approved by: https://github.com/ngimel	2025-04-03 22:07:43 +00:00
Andrey Talman	2abd81402f	[validations] Run nccl version check on Linux only (#150635 ) Followup https://github.com/pytorch/pytorch/pull/150194 to disable nccl version print on OS's other then Linux Pull Request resolved: https://github.com/pytorch/pytorch/pull/150635 Approved by: https://github.com/clee2000	2025-04-03 22:06:58 +00:00
Shangdi Yu	941090a791	Make sure torch.compiler._is_compiling_flag=True in aoti (#150588 ) Summary: See internal Diff summary Differential Revision: D72355449 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150588 Approved by: https://github.com/angelayi	2025-04-03 22:02:29 +00:00
PyTorch MergeBot	5a654deb40	Revert "Enable C++ dynamic shape guards by default (#140756 )" This reverts commit c1d503529d23f33bc0819286df8d0ecbe31b559f. Reverted https://github.com/pytorch/pytorch/pull/140756 on behalf of https://github.com/isuruf due to new test test_runtime_checks_large hangs on CI ([comment](https://github.com/pytorch/pytorch/pull/140756#issuecomment-2776979814))	2025-04-03 21:44:41 +00:00
Jason Ansel	d41c22b578	Revert "[fx] Move Node._prepend/Node._remove_from_list to C++ (#148261 )" (#150542 ) Reverts #148261 due to possible memory leak This reverts commit 5d4e7d58b42623a9024a84f0050967ff0318dcdb. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150542 Approved by: https://github.com/clee2000	2025-04-03 21:15:38 +00:00
Svetlana Karslioglu	277369ac16	Move formulas on separate line in loss.py (#150565 ) Move formulas on separate line in loss.py for better readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150565 Approved by: https://github.com/mikaylagawarecki	2025-04-03 20:47:35 +00:00
Yiming Zhou	a3f9e04656	[export] Make aoti_call_delegate hop traceable (#148804 ) Summary: The `aoti_call_delegate` hop now uses a stateless `original_gm` for tracing with fake tensors and the OSS AOTI Runner for running with real tensors Differential Revision: D70738393 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148804 Approved by: https://github.com/SherlockNoMad	2025-04-03 20:44:31 +00:00
Shangdi Yu	51da241c0a	[aoti] Fix cannot determine truth value of Relation error when propagating unbacked symint in lowering (#150570 ) Summary: Fix cannot determine truth value of Relation error when propagating unbacked symint in lowering Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r aoti_runtime_asserts ``` Differential Revision: D72331070 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150570 Approved by: https://github.com/angelayi, https://github.com/henryoier	2025-04-03 20:06:15 +00:00
Isuru Fernando	c1d503529d	Enable C++ dynamic shape guards by default (#140756 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140756 Approved by: https://github.com/anijain2305 ghstack dependencies: #149149, #149197, #149211	2025-04-03 20:03:52 +00:00
Kai Londenberg	1843ad458d	[Inductor] Cache CUDA compilation errors (#149716 ) Summary: Add support for caching of CUDA (nvcc) compilation errors to codecache.py Test Plan: CI ( for example Cutlass backend unit tests ) Reviewed By: ColinPeppler Differential Revision: D71562040 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149716 Approved by: https://github.com/ColinPeppler	2025-04-03 19:47:27 +00:00
Jiang, Yanbing	3b02f795c5	Add torch._scaled_mm for CPU (#150410 ) This PR is the duplicated one for https://github.com/pytorch/pytorch/pull/139975. This PR is to add torch._scaled_mm for CPU backend. _scaled_mm_out_cpu and _scaled_mm_cpu are new added and included in torch._scaled_mm CPU dispatch. We also add _scaled_mm_out_cpu_emulated as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150410 Approved by: https://github.com/atalman	2025-04-03 19:43:45 +00:00
ZhaoqiongZ	96f35f55e2	update get start xpu document for v2.7 (#150397 ) update get start xpu document for v2.7 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150397 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/atalman Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-04-03 18:17:08 +00:00
Xilun Wu	78d1165d76	[DTensor][tp] fix errors in FSDP+TP checkpointing test (#150354 ) ## Summary remove the `tp_parallelize_plan` assignment that accidentally rewrites the previous assignments in `test_fsdp_dsd.py`. ## Test `pytest test/distributed/checkpoint/fsdp/test_fsdp_dsd.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150354 Approved by: https://github.com/wconstab	2025-04-03 17:41:46 +00:00
FFFrog	5d36253a7d	Refactoring: fix the python constant check (#150608 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150608 Approved by: https://github.com/Skylion007	2025-04-03 17:33:45 +00:00
Jeff Daily	fa0fdc0cca	if blaslt fails, fall back to blas (#150147 ) Fixes #150016. This is implemented for both cublaslt and hipblaslt. gemm_and_bias on failure will fall back to unfused path. lt gemm on failure falls back to gemm even if gemm preference is set to lt. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150147 Approved by: https://github.com/malfet	2025-04-03 16:18:59 +00:00
David Berard	5be5cfe4cb	[inductor][autotune cache] add torch_key() to configs hash (#150494 ) Summary: Context: https://github.com/pytorch/pytorch/pull/150122 (D71982587 - let's call this "the WS diff") introduces "bc/fc-breaking" cache changes. In particular, it introduces `num_consumer_groups` and adds it to the cached config. In versions of torch that include the WS diff, `num_consumer_groups` is treated as a class variable on a triton.Config object (i.e. `triton.Config({..kwargs..}, num_consumer_groups=num_consumer_groups, ...`). And in versions of torch that don't include the WS diff, you generally don't expect to see this kwarg. But if a program is run WS-torch (i.e. torch w/ the WS diff), and then later you run the same program with non-WS-torch, then non-WS-torch is going to find this autotune cache entry, and interpret `num_consumer_groups` as a kwarg, because there's no special handling for for num_consumer_groups in this version of torch. Then the program crashes with a triton failure message. The fix: add the torch version / torch key into the hash, so that any changes to inductor will invalidate the cache (ensuring that other changes to triton_heuristics won't cause these bc/fc issues). Test Plan: D72285868 (or https://gist.github.com/davidberard98/2ea697eb550c94d0d1948fedb5c5c7d8, but this doesn't repro in OSS because this version of warp specialization is not available in oss triton) can repro the failure, and the failure is fixed after this PR is patched. Also, added a test in test/inductor/test_codecache.py which verifies that there's no cache hit if the torch_key changes (and verified that without the functional changes in this PR, the test fails). Differential Revision: D72285303 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150494 Approved by: https://github.com/oulgen	2025-04-03 16:01:57 +00:00
Luca Wehrstedt	440c07e56a	Fix detection of GPU multicast (#150563 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150563 Approved by: https://github.com/kwen2501	2025-04-03 15:31:15 +00:00
angelayi	5314a6fe82	[export] Fix deserialization issue (#150515 ) An internal model was serialized in 2023, and is now breaking while loading with the following error: ``` File "<eval_with_key>.1675", line 4 def forward(self, arg1163_1, arg1164_1, , arg1166_1, , arg1168_1, arg1169_1, arg1170_1, , arg1172_1, arg1173_1, arg1174_1, arg1175_1, arg1176_1, arg1177_1, arg1178_1, arg1179_1, arg1180_1, arg1181_1, arg1182_1, arg1183_1, arg1184_1, arg1185_1, arg1186_1, arg1187_1, arg1188_1, arg1189_1, arg1190_1, arg1191_1, arg1192_1, arg1193_1, arg1194_1, arg1195_1, arg1196_1, arg1197_1, arg1198_1, arg1199_1, arg1200_1, arg1201_1, arg1202_1, arg1203_1, arg1204_1, arg1205_1, arg1206_1, arg1207_1, arg1208_1, arg1209_1, arg1210_1, arg1211_1, arg1212_1, arg1213_1, arg1214_1, arg1215_1, arg1216_1, , arg1218_1, arg1219_1, arg1220_1, arg1221_1, arg1222_1, arg1223_1, arg1224_1, , arg1226_1, arg1227_1, arg1228_1, , arg1230_1, , , , , , , , , , , , , , , ): ^ SyntaxError: invalid syntax ``` The syntax errors are due to inputs that are `None` when exporting. Prior to changes in https://github.com/pytorch/pytorch/pull/123590 (landed 4/2024), input specs for none inputs look like `InputSpec(userInput=UserInputSpec(arg=Argument(asNone=True)))`, and during deserialization when creating a node, we would just use a dummy name `arg`. After to those changes, the input specs for none inputs look like `InputSpec(constantInput=InputToConstantInputSpec(name='y', value=ConstantValue(asNone=True)))`, and when creating a node we would use the name `y` as the name. However the PR didn't handle the case if it's loading an old package which doesn't have this name, so ended up putting empty names in the placeholder nodes. This error was uncovered after https://github.com/pytorch/pytorch/pull/149717, where we now use the GraphModule's python codegen to run the UnflattenedModule instead of going through the interpreter path. The placeholder nodes having empty names caused the python codegen to fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150515 Approved by: https://github.com/yushangdi	2025-04-03 15:27:45 +00:00
Isuru Fernando	a72b4eb806	Support windows in C++ shape guards (#149211 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149211 Approved by: https://github.com/anijain2305 ghstack dependencies: #149149, #149197	2025-04-03 14:42:08 +00:00
Isuru Fernando	f9a7eac718	use python fallback if there are overflows (#149197 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149197 Approved by: https://github.com/anijain2305 ghstack dependencies: #149149	2025-04-03 14:39:03 +00:00
Isuru Fernando	ff783f062a	Fix shape guard failure to be valid python (#149149 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149149 Approved by: https://github.com/anijain2305	2025-04-03 14:36:17 +00:00
FFFrog	70b34a42c1	Add new dependences for gen_pyi.py (#150391 ) As the title stated. When we update some functions in _torch_docs.py or _tensor_docs.py, and execute some commands (like ``python setup.py evolve``) to install the latest version, the description about the function we just changed is not updated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150391 Approved by: https://github.com/Skylion007, https://github.com/peterbell10	2025-04-03 14:18:18 +00:00
Jeff Daily	781d28e265	add unit test for preferred_blas_library settings (#150581 ) Follow up to #150212 that was committed without a unit test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150581 Approved by: https://github.com/atalman	2025-04-03 13:27:50 +00:00
Guilherme Leobas	cbc901fac3	Implement `raise ... from ...` (#148766 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148766 Approved by: https://github.com/zou3519	2025-04-03 13:15:31 +00:00
LifengWang	e0d19cf6cc	Enable weekly test for operator benchmark (#150502 ) To regularly track the performance of the operator benchmark, enable the weekly test. Hi, @huydhn, as you mentioned in https://github.com/pytorch/pytorch/pull/143733#issuecomment-2578317520, we could integrate the performance data from the weekly test into the OSS benchmark database for the dashboard. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150502 Approved by: https://github.com/huydhn	2025-04-03 12:17:19 +00:00
Danfeng Wang	5d9c7f78e7	[fbcode]Removing `@NoIntBaseDeprecated` annotation in `evaluation.thrift` file (#150271 ) Summary: #buildall Test Plan: ``` buck test 'fbcode//mode/opt' fbcode//caffe2/torch/fb/training_toolkit/applications/bulk_eval/tests:evaluator_test -- --exact 'caffe2/torch/fb/training_toolkit/applications/bulk_eval/tests:evaluator_test - test_setup_evaluation_utils (caffe2.torch.fb.training_toolkit.applications.bulk_eval.tests.evaluator_test.EvaluatorTest)' ``` Differential Revision: D72028940 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150271 Approved by: https://github.com/huydhn	2025-04-03 12:01:59 +00:00
Bin Bao	d4c30b4599	[AOTI][dashboard] Update how peak memory is measured (#150534 ) Summary: In the dashboard measurement script, AOTI needs to run Eager first to register the output pytree, so the peak memory compression ratio on the dashboard is always close to 1. Update AOTI run to use an extra warmup run, so the peak memory compression ratio measures the result at the run time instead of the compile time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150534 Approved by: https://github.com/yushangdi	2025-04-03 12:01:43 +00:00
Jagadish Krishnamoorthy	6fa1b17195	ROCm: Add trailing comma for consistency in gfx architecture list (#150250 ) Adding trailing comma for consistency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150250 Approved by: https://github.com/petrex, https://github.com/jeffdaily, https://github.com/cyyever	2025-04-03 10:58:48 +00:00
Arash Pakbin	e6e07ec1cf	[ROCm] code cleanup of architecture checks (#150473 ) This PR replaces several calls to `at::cuda::getCurrentDeviceProperties()->gcnArchName` and `at::cuda::getDeviceProperties(device_index)->gcnArchName` when checking to see if the GPU architecture is in a certain list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150473 Approved by: https://github.com/jeffdaily, https://github.com/cyyever	2025-04-03 09:51:06 +00:00
Jiang, Zhiwei	9e106019f6	[XPU] Add an implict conversion from XPUStream to sycl::queue* (#148646 ) # Motivation Currently, in Pytorch XPU, `cudaStream_t` is mapped to `sycl::queue&`, so an implicit cast from `XPUStream` to `sycl::queue&` is provided just like `CUDAStream` has an implicit cast to `cudaStream_t`. But on the SYCLomatic side, we migrate `cudaStream_t` to `sycl::queue*` but not `sycl::queue&` (One reason is that `cudaStream_t` is actually a pointer so users can do anything with that integer. Another reason is that the early `sycl::queue` was not impl-ed by a pointer, so copy by value is not desirable.) Without this PR: ``` cudaStream_t a = getCurrentCUDAStream(); cudaStream_t b = getCurrentCUDAStream().stream(); ``` need be migrated to: ``` queue_ptr a = &(sycl::queue&)getCurrentXPUStream(); queue_ptr b = &(getCurrentXPUStream().queue()); ``` With this PR: ``` queue_ptr a = getCurrentXPUStream(); queue_ptr b = &(getCurrentXPUStream().queue()); ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148646 Approved by: https://github.com/guangyey, https://github.com/EikanWang	2025-04-03 08:12:38 +00:00
Saagar Jha	c067127d47	Ensure cuda_dlink_post_cflags are quoted as well (#150151 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150151 Approved by: https://github.com/janeyx99	2025-04-03 06:50:22 +00:00
Junjie Wang (PyTorch)	fc674b45d4	[c10d] Add logging for desync debug report (#150513 ) Summary: We want to add a logging to first understand what is the distribution of desync debug report. Test Plan: Test with logger staging Differential Revision: D72249281 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150513 Approved by: https://github.com/kwen2501	2025-04-03 06:42:06 +00:00
Pian Pawakapan	90ddb33141	[export] specialize for aten.to (#149235 ) Changes decomposition behavior of `aten.to` to respect the aliasing/non-aliasing behavior in eager, and to specialize to the input/conversion dtype & device. Before change: we always decompose `aten.to` into `_to_copy`, regardless of aliasing behavior. This leads us to ban mutations on the result of `_to_copy` when aliased, since we can't guarantee correct program semantics. This meant users had to explicitly call `.clone()` before mutating. In the special cases where we don’t ban mutations (e.g. dtype conversion), we add runtime assertions on the input & conversion dtype/devices in the decomposed program (see https://github.com/pytorch/pytorch/pull/142420). After change: we decompose to the aliasing/non-aliasing behavior that matches eager, allowing mutations in all cases. We also add dtype/device assertions for all `aten.to` ops, starting in the pre-dispatch graph, basically specializing the program to the dtype/devices. Differential Revision: D71229547 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149235 Approved by: https://github.com/tugsbayasgalan	2025-04-03 05:20:10 +00:00
drisspg	2e5d95a082	[FlexAttention] Remove dead code (#150575 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150575 Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng	2025-04-03 01:46:19 +00:00
Shangdi Yu	77dca3947e	[aoti] make a check function for each input (#150553 ) Summary: make a check function for each input to avoid too large to optimize error on `__check_inputs_outputs` Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r runtime_checks ``` Differential Revision: D72286280 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150553 Approved by: https://github.com/desertfire	2025-04-03 00:55:35 +00:00
rzou	13f48197d2	Add Chillee as core reviewer (#150579 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150579 Approved by: https://github.com/albanD, https://github.com/drisspg, https://github.com/malfet	2025-04-03 00:40:06 +00:00
Mu-Chu Lee	f363fe616d	[AOTInductor] Fix autotuning code's codegen (#150522 ) Summary: Codegen used to generate tmp_arg_{index} as temporary args, and index is the position of the caller. We changed the logic of codegen such that we can reuse previous generated samples, and only delete after arg is no longer used. In this case, we need to make {index} unique, since different functions could reuse the same "tmp_arg_{index}" name string, but corresponds to different args. Test Plan: `python test/inductor/test_aot_inductor.py -k test_autotuning_args_reuse` Differential Revision: D72297084 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150522 Approved by: https://github.com/desertfire, https://github.com/22quinn	2025-04-03 00:08:19 +00:00
Gabriel Ferns	24f50653c8	fix bug in logging code (#150518 ) Fixes https://github.com/pytorch/pytorch/issues/150379 ```python >>> key = "aten._int_mm_1_2_3" >>> m, n, k = key.split("_")[-3:] >>> m, n, k ('1', '2', '3') >>> name = "_".join(key.split("_")[:-3]) >>> name 'aten._int_mm' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150518 Approved by: https://github.com/xmfan	2025-04-02 23:39:06 +00:00
PyTorch MergeBot	61a1f09b5b	Revert "[cuda] Add new faster gammabeta backward kernel (#148605 )" This reverts commit 114d404b0720e8073748690faeb96449e5c0b229. Reverted https://github.com/pytorch/pytorch/pull/148605 on behalf of https://github.com/drisspg due to See https://github.com/pytorch/pytorch/issues/150266#issuecomment-2773907902 for more details ([comment](https://github.com/pytorch/pytorch/pull/148605#issuecomment-2773928838))	2025-04-02 23:14:11 +00:00
Animesh Jain	de15ef0ee8	[invoke_subgraph] Force grad_outs to be contiguous at tracing time (#150561 ) I am unable to come up with a testcase. It passes many end-to-end tests that fail with ReshapeError at https://ossci-raw-job-status.s3.amazonaws.com/log/39717218372 ![image](https://github.com/user-attachments/assets/8509b485-3897-4538-968b-bbe05af63a59) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150561 Approved by: https://github.com/zou3519, https://github.com/bdhirsh ghstack dependencies: #150082, #150450, #150486, #150556	2025-04-02 22:59:08 +00:00
Wang, Chuanqi	0198e44f37	Update torch-xpu-ops commit pin to 98c808d (#150554 ) Update the torch-xpu-ops commit to [98c808dea6de7330c415aa777d6921944cf79887](`98c808dea6`), include - Fixes #150001 by removing pre-CXX11 ABI logic from build script for XPU - Fixes #150430 - Fixes XCCL build issue caused by PR #150398 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150554 Approved by: https://github.com/EikanWang, https://github.com/malfet	2025-04-02 22:42:18 +00:00
PaulZhang12	8667a00979	Add stride + dtype to autotune results (#150419 ) Add stride/dtype info to autotune gemm results. New output header: `AUTOTUNE mm(1024x1024, 1024x7680)` `strides: [1, 1024], [7680, 1]` `dtypes: torch.bfloat16, torch.bfloat16` Differential Revision: [D72253313](https://our.internmc.facebook.com/intern/diff/D72253313) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150419 Approved by: https://github.com/eellison	2025-04-02 22:36:38 +00:00
Animesh Jain	0bacb90a9c	[invoke_subgraph][min-cut partitioner] Fix bug to use the correct root module (#150556 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150556 Approved by: https://github.com/bdhirsh, https://github.com/zou3519 ghstack dependencies: #150082, #150450, #150486	2025-04-02 22:35:00 +00:00
Shivam Raikundalia	a677b491c9	[Profiler] Fix Empty C Call Queue (#150370 ) Summary: My commandeer of https://github.com/pytorch/pytorch/pull/150102 Based on description of PR it seems that we need to add C calls for each starting python event with a callable such that when the tracing exits we will have a matching enter for any given exit. It adds some unnecessary events at worst but prevents segfaults/failures. My PR just cleans up some refcount impl and logging. Contributors: @arjun-choudhry Test Plan: Ran resnet test internally. Will check CI and ask reviewers to make sure it resolves their issues. Differential Revision: D72207570 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150370 Approved by: https://github.com/aaronenyeshi	2025-04-02 22:25:46 +00:00
Eli Uriegas	74aa9f571c	ci: Use cache / progress when local docker build (#150551 ) It's a bit annoying to try and work on these locally when the cache / progress isn't being used so let's just set it so that those flags are only valid when in CI directly. `${CI}` is a default environment variable that's defined by actions itself. See https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/store-information-in-variables#default-environment-variables Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/150551 Approved by: https://github.com/clee2000, https://github.com/ZainRizvi, https://github.com/atalman	2025-04-02 22:08:57 +00:00
Avik Chaudhuri	1017927c83	multidimensional slicing (#150104 ) Differential Revision: D71962884 Fixes #150057 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150104 Approved by: https://github.com/angelayi	2025-04-02 20:57:16 +00:00
Ryan Guo	bb98749230	[dynamo] Always trace into tensor subclass `__torch_function__` (#149792 ) This patch effectively ignores traceable_tensor_subclasses, allowing Dynamo to always try tracing into the `__torch_function__` of tensor subclass. This helps us with 2 things: 1. allowing users to directly benefit from better compilation of tensor subclass, by just upgrading pytorch, without having to change legacy library code (see earlier patches in the stack for examples). 2. potentially exposing more issues in compiling tensor subclass, so we can get signals and improve them. As a consequence, it exposed and fixes 2 subtle bugs: 1. In `build_torch_function_fn`, we could get `torch._C._disabled_torch_function_impl` because we have a `Parameter` subclass without `__torch_function__` override or if we have a tensor subclass with `__torch_dispatch__` override. We graph break on this for now, and plan to add support -- the logic for simulating `torch._C._disabled_torch_function_impl` is already in `SuperVariable`, we just need to reuse it. 2. Sometimes we create `SyntheticLocalSource` and need to remove all the guards installed on it, but we only removed the ones whose source _is_ the created synthetic source `s`, but forgot about chained source like `s.foo`, this showed up as `SYNTHETIC_LOCAL['tmp_0'].__torch_function__.__func__`. Differential Revision: [D71906141](https://our.internmc.facebook.com/intern/diff/D71906141) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149792 Approved by: https://github.com/jansel, https://github.com/mlazos ghstack dependencies: #149482, #149483, #149484	2025-04-02 20:57:00 +00:00
Ryan Guo	3463ea1059	[dynamo] Support tensor subclass with overriden tensor methods and properties (#149484 ) This fixes most of the "torch.compile X tensor-subclass" issues encountered in https://github.com/city96/ComfyUI-GGUF/issues/118. The relevant tensor subclass definition is here: `298192ed60/ops.py (L18-L65)`. A few things to note about the tensor subclass: 1. it overrides a lot of the `torch.Tensor` methods (e.g., `to`, `clone`), so this patch updates `TensorWithTFOverrideVariable.var_getattr` to support that. 2. it overrides the `shape` property, so this patch updates `TensorWithTFOverrideVariable.var_getattr` to support property as well. 3. it has calls to `torch.Tensor.size`, which returns `torch.Size`, which gets reconstructed in `torch.Tensor.__torch_function__`, so this patch adds support for calling `torch.Size(...)` on non-constant inputs. Differential Revision: [D71906137](https://our.internmc.facebook.com/intern/diff/D71906137) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149484 Approved by: https://github.com/jansel, https://github.com/mlazos ghstack dependencies: #149482, #149483	2025-04-02 20:57:00 +00:00
Ryan Guo	0d4dbfd9ed	[dynamo] Support `torch.Tensor._make_subclass` and tracing through tensor subclass `__new__` (#149483 ) This builds off the previous patch in the stack, and fully fixes https://github.com/huggingface/diffusers/issues/10795. Essentially, tensor subclass in the issue uses `torch.Tensor._make_subclass`, which has a pretty simple shallow-copy plus type change semantics, as far as Dynamo is concerned. So this patch adds a polyfill for it. As a result, this allows us to trace through many user-defined `__new__` in tensor subclass (it's similar to how we trace through user-defined `__new__` for `UserDefinedClassVariable`), so this patch also faithfully trace through these `__new__` methods. Differential Revision: [D71906139](https://our.internmc.facebook.com/intern/diff/D71906139) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149483 Approved by: https://github.com/zou3519, https://github.com/mlazos ghstack dependencies: #149482	2025-04-02 20:56:52 +00:00
Ryan Guo	33535b3eee	[dynamo] Support Tensor subclass that has dynamic attributes or calls `Parameter.__torch_function__` (#149482 ) This fixes most of https://github.com/huggingface/diffusers/issues/10795, except for `torch.Tensor._make_subclass`, which will be fixed in a subsequent patch. The relevant tensor subclass from the aforementioned issue is defined here: `fbf6b856cc/src/diffusers/quantizers/gguf/utils.py (L398-L435)`. There are two things to note about the tensor subclass: 1. it calls `super().__torch_function__`, which is `torch._C._disabled_torch_function_impl`, so this patch updates `SuperVariable.call_method` to handle it (we can't do a simpler polyfill due to some bug with `var_getattr` raising `NotImplementedError`, which forgot to restore symbolic context). 2. it sets and reads attributes (`quant_type`), and defines new methods (`as_data`), so this patch adds support for those. 3. it has a `__init__`, which Dynamo needs to trace through in `TensorSubclassVariable.call_function`. Differential Revision: [D71906140](https://our.internmc.facebook.com/intern/diff/D71906140) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149482 Approved by: https://github.com/jansel, https://github.com/mlazos	2025-04-02 20:56:43 +00:00
William Wen	85df0dc246	[dynamo] emit only 1 graph break message on unrecoverable data-dependent assert fail (#150471 ) Addresses https://fb.workplace.com/groups/1075192433118967/permalink/1625299684774903/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/150471 Approved by: https://github.com/jansel	2025-04-02 20:42:43 +00:00
Colin Peppler	a8f6b40e36	[inductor] skip non-trivial tiling if unbacked symints are present (#150225 ) Take two of https://github.com/pytorch/pytorch/pull/149994. This time we just skip `convert_tiling_to_3d` and `candidate_tilings` if there exists unbacked symints. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150225 Approved by: https://github.com/eellison	2025-04-02 20:36:02 +00:00
PyTorch MergeBot	03c879d59b	Revert "[dynamo] Support Tensor subclass that has dynamic attributes or calls `Parameter.__torch_function__` (#149482 )" This reverts commit 98453c135a7778d12ff881d8b0a717257be9fc38. Reverted https://github.com/pytorch/pytorch/pull/149482 on behalf of https://github.com/malfet due to Broke trunk, see `b03c42109c/1` ([comment](https://github.com/pytorch/pytorch/pull/149482#issuecomment-2773650522))	2025-04-02 20:30:33 +00:00
PyTorch MergeBot	18908c8ced	Revert "[dynamo] Support `torch.Tensor._make_subclass` and tracing through tensor subclass `__new__` (#149483 )" This reverts commit 203e1d681d1a4eb7794dfaeaebfa497242dde17d. Reverted https://github.com/pytorch/pytorch/pull/149483 on behalf of https://github.com/malfet due to Broke trunk, see `b03c42109c/1` ([comment](https://github.com/pytorch/pytorch/pull/149482#issuecomment-2773650522))	2025-04-02 20:30:33 +00:00
PyTorch MergeBot	01411c739f	Revert "[dynamo] Support tensor subclass with overriden tensor methods and properties (#149484 )" This reverts commit 7e53c58687482d58461e1dd8e09f59a9daf8f7b3. Reverted https://github.com/pytorch/pytorch/pull/149484 on behalf of https://github.com/malfet due to Broke trunk, see `b03c42109c/1` ([comment](https://github.com/pytorch/pytorch/pull/149482#issuecomment-2773650522))	2025-04-02 20:30:33 +00:00
PyTorch MergeBot	e545567340	Revert "[dynamo] Always trace into tensor subclass `__torch_function__` (#149792 )" This reverts commit 238109ad3245c5485f9e83b4b02d258b09329042. Reverted https://github.com/pytorch/pytorch/pull/149792 on behalf of https://github.com/malfet due to Broke trunk, see `b03c42109c/1` ([comment](https://github.com/pytorch/pytorch/pull/149482#issuecomment-2773650522))	2025-04-02 20:30:32 +00:00
Eli Uriegas	af5c1b96e2	ci: Set minimum cmake version for halide build (#150560 ) This was failing due to pybind being strict about their cmake version requirements. This resolves errors like: ``` 652.1 Compatibility with CMake < 3.5 has been removed from CMake. 652.1 652.1 Update the VERSION argument <min> value. Or, use the <min>...<max> syntax 652.1 to tell CMake that the project requires at least <min> but has been updated 652.1 to work with policies introduced by <max> or earlier. 652.1 652.1 Or, add -DCMAKE_POLICY_VERSION_MINIMUM=3.5 to try configuring anyway. 652.1 652.1 652.1 -- Configuring incomplete, errors occurred! ``` Tested this locally with the following command: ``` ./build.sh pytorch-linux-jammy-py3.12-halide -t 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-jammy-py3.12-halide:8a8989876ff1aa1d5b0e465177afebbc7a9da921 ``` Closes https://github.com/pytorch/pytorch/issues/150420 Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/150560 Approved by: https://github.com/clee2000, https://github.com/ZainRizvi, https://github.com/atalman, https://github.com/malfet	2025-04-02 20:27:24 +00:00
James Wu	b03c42109c	Proactively remove CompiledTritonKernels before loading from cache/starting inductor compile (#150453 ) We'll still running into this issue intermittently and it's hard to debug; so I thought a more aggressive cache clear strategy may fix it as a stopgap until we can Statically launch cuda kernels and avoid some of this stuff Differential Revision: [D72257973](https://our.internmc.facebook.com/intern/diff/D72257973/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150453 Approved by: https://github.com/oulgen	2025-04-02 20:08:32 +00:00
Yidi Wu	22030efb64	expect fail scan test in sigmoid (#150475 ) Summary: as titled. Test Plan: see modified test. Differential Revision: D72271976 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150475 Approved by: https://github.com/zhxchen17	2025-04-02 19:56:50 +00:00
Catherine Lee	d4298f2136	[CI] Use system nccl in build (#150226 ) Install nccl in the docker image (which is already being done in some docker images), and use USE_SYSTEM_NCCL=1 in CI builds It takes some time to build nccl and doesn't happen in parallel, so theres less benefit in switching to a bigger runner and using more processes The other changes in this PR are because there is an install_cuda script and an install_cuda_aarch64 script and they both build nccl from source and define their own pins for the nccl version. There is also a .ci/docker/nccl-cu11.txt and cu12.txt that define the pins, and this is an attempt to unify them. Unfortunately this leads to a lot of files needing to be copied to the docker build Generally seems to increase docker pull times by <1 min, P1768456379 but its hard to tell what the real increase is 15761 mib -> 16221 [linux-focal-cuda11.8-py3.10-gcc9 / test (distributed](https://github.com/pytorch/pytorch/actions/runs/14114171729/job/39545500161#logs) `jq '[.layers[].size, .config.size] \| add / 1024 / 1024'` Example `6eb3c2e282 (39520169577-box)` ![image](https://github.com/user-attachments/assets/d44ef415-6e48-41ef-ac83-f19bab47560c) TODO: * Figure out a way to verify that nccl was built + works properly when it is expected (this time i just checked torch.distributed.is_nccl_available) * Merge the cusparse installation scripts * Merge the cuda installation scripts * Either split the nccl, cuda, and cusparse installations always, or make the always together in one bash script distributed/test_distributed_spawn Pull Request resolved: https://github.com/pytorch/pytorch/pull/150226 Approved by: https://github.com/seemethere, https://github.com/atalman	2025-04-02 19:42:43 +00:00
atalman	cb4cd6166e	Address Cmake update issue in windows magma builds (#150549 ) 1. Fixes Cmake update error: https://github.com/pytorch/pytorch/actions/runs/14223930697/job/39858632864 ``` CMake Error at CMakeLists.txt:1 (cmake_minimum_required): Compatibility with CMake < 3.5 has been removed from CMake. Update the VERSION argument <min> value. Or, use the <min>...<max> syntax to tell CMake that the project requires at least <min> but has been updated to work with policies introduced by <max> or earlier. Or, add -DCMAKE_POLICY_VERSION_MINIMUM=3.5 to try configuring anyway. ``` 2. Removes deprecated CUDA 12.4 build Pull Request resolved: https://github.com/pytorch/pytorch/pull/150549 Approved by: https://github.com/clee2000	2025-04-02 19:13:44 +00:00
PaulZhang12	e62d958f02	[Inductor] Reland Merge Triton ScaledMM as epilogue to MM template #150045 (#150441 ) Merges https://github.com/pytorch/pytorch/pull/150438 and https://github.com/pytorch/pytorch/pull/150045. https://github.com/pytorch/pytorch/pull/150045 was already landed, but did not include a change that makes it unable to land internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150441 Approved by: https://github.com/clee2000	2025-04-02 17:49:32 +00:00
Ryan Guo	238109ad32	[dynamo] Always trace into tensor subclass `__torch_function__` (#149792 ) This patch effectively ignores traceable_tensor_subclasses, allowing Dynamo to always try tracing into the `__torch_function__` of tensor subclass. This helps us with 2 things: 1. allowing users to directly benefit from better compilation of tensor subclass, by just upgrading pytorch, without having to change legacy library code (see earlier patches in the stack for examples). 2. potentially exposing more issues in compiling tensor subclass, so we can get signals and improve them. As a consequence, it exposed and fixes 2 subtle bugs: 1. In `build_torch_function_fn`, we could get `torch._C._disabled_torch_function_impl` because we have a `Parameter` subclass without `__torch_function__` override or if we have a tensor subclass with `__torch_dispatch__` override. We graph break on this for now, and plan to add support -- the logic for simulating `torch._C._disabled_torch_function_impl` is already in `SuperVariable`, we just need to reuse it. 2. Sometimes we create `SyntheticLocalSource` and need to remove all the guards installed on it, but we only removed the ones whose source _is_ the created synthetic source `s`, but forgot about chained source like `s.foo`, this showed up as `SYNTHETIC_LOCAL['tmp_0'].__torch_function__.__func__`. Differential Revision: [D71906141](https://our.internmc.facebook.com/intern/diff/D71906141) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149792 Approved by: https://github.com/jansel, https://github.com/mlazos ghstack dependencies: #149482, #149483, #149484	2025-04-02 17:05:25 +00:00
Ryan Guo	7e53c58687	[dynamo] Support tensor subclass with overriden tensor methods and properties (#149484 ) This fixes most of the "torch.compile X tensor-subclass" issues encountered in https://github.com/city96/ComfyUI-GGUF/issues/118. The relevant tensor subclass definition is here: `298192ed60/ops.py (L18-L65)`. A few things to note about the tensor subclass: 1. it overrides a lot of the `torch.Tensor` methods (e.g., `to`, `clone`), so this patch updates `TensorWithTFOverrideVariable.var_getattr` to support that. 2. it overrides the `shape` property, so this patch updates `TensorWithTFOverrideVariable.var_getattr` to support property as well. 3. it has calls to `torch.Tensor.size`, which returns `torch.Size`, which gets reconstructed in `torch.Tensor.__torch_function__`, so this patch adds support for calling `torch.Size(...)` on non-constant inputs. Differential Revision: [D71906137](https://our.internmc.facebook.com/intern/diff/D71906137) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149484 Approved by: https://github.com/jansel, https://github.com/mlazos ghstack dependencies: #149482, #149483	2025-04-02 17:05:25 +00:00
Ryan Guo	203e1d681d	[dynamo] Support `torch.Tensor._make_subclass` and tracing through tensor subclass `__new__` (#149483 ) This builds off the previous patch in the stack, and fully fixes https://github.com/huggingface/diffusers/issues/10795. Essentially, tensor subclass in the issue uses `torch.Tensor._make_subclass`, which has a pretty simple shallow-copy plus type change semantics, as far as Dynamo is concerned. So this patch adds a polyfill for it. As a result, this allows us to trace through many user-defined `__new__` in tensor subclass (it's similar to how we trace through user-defined `__new__` for `UserDefinedClassVariable`), so this patch also faithfully trace through these `__new__` methods. Differential Revision: [D71906139](https://our.internmc.facebook.com/intern/diff/D71906139) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149483 Approved by: https://github.com/zou3519, https://github.com/mlazos ghstack dependencies: #149482	2025-04-02 17:05:19 +00:00
Ryan Guo	98453c135a	[dynamo] Support Tensor subclass that has dynamic attributes or calls `Parameter.__torch_function__` (#149482 ) This fixes most of https://github.com/huggingface/diffusers/issues/10795, except for `torch.Tensor._make_subclass`, which will be fixed in a subsequent patch. The relevant tensor subclass from the aforementioned issue is defined here: `fbf6b856cc/src/diffusers/quantizers/gguf/utils.py (L398-L435)`. There are two things to note about the tensor subclass: 1. it calls `super().__torch_function__`, which is `torch._C._disabled_torch_function_impl`, so this patch updates `SuperVariable.call_method` to handle it (we can't do a simpler polyfill due to some bug with `var_getattr` raising `NotImplementedError`, which forgot to restore symbolic context). 2. it sets and reads attributes (`quant_type`), and defines new methods (`as_data`), so this patch adds support for those. 3. it has a `__init__`, which Dynamo needs to trace through in `TensorSubclassVariable.call_function`. Differential Revision: [D71906140](https://our.internmc.facebook.com/intern/diff/D71906140) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149482 Approved by: https://github.com/jansel, https://github.com/mlazos	2025-04-02 17:05:12 +00:00
PyTorch MergeBot	532530be34	Revert "[Profiler] Fix Empty C Call Queue (#150370 )" This reverts commit 5734909f343ab1de44ed5ab23311d43a9c6afaed. Reverted https://github.com/pytorch/pytorch/pull/150370 on behalf of https://github.com/clee2000 due to broke some profiler tests when building with debug asserts profiler/test_memory_profiler.py::TestMemoryProfiler::test_config_check [GH job link](https://github.com/pytorch/pytorch/actions/runs/14211763078/job/39822158330) [HUD commit link](`3ac5a499dd`) ([comment](https://github.com/pytorch/pytorch/pull/150370#issuecomment-2773146070))	2025-04-02 16:40:54 +00:00
Manuel Candales	f38566dfe4	[MPSInductor] Disable mm/bmm decompositions (#150541 ) Disables mm/bmm decompositions. torch.compile on MPS was speeding up stories15M (~4x) but it was making stories110M much slower. Self-contained reproducer to demonstrate the difference (before the change, after it should be identical) ```python import torch import timeit def bench_mm(f, x, y): from torch.utils.benchmark import Timer return Timer(stmt="f(x, y); torch.mps.synchronize()", globals={"x": x, "y": y, "f": f}, language="python", timer=timeit.default_timer).blocked_autorange() x = torch.rand(1024, 512, device='mps') y = torch.rand(512, 1, device='mps') mm_c = torch.compile(torch.mm, options={"coordinate_descent_tuning": False}) mm_c_cdt = torch.compile(torch.mm, options={"coordinate_descent_tuning": True}) print(f"Compiled torch.mm perf (with cdt disabled) for 1024x512 and 512x1 matrices are {bench_mm(mm_c, x, y).median}") print(f"Compiled torch.mm perf (with cdt enabled) for 1024x512 and 512x1 matrices are {bench_mm(mm_c_cdt, x, y).median}") ``` Disabling the inductor mm decomposition, speeds up stories15M further (~6x) and speeds up stories110M (~7x) The table below show average tokens/sec across 5 runs on M1 Pro for stories15M and stories110M: \| \| stories15M \| stories110M \| \|------------------------\|------------\|-------------\| \| without compile \| 99.40 \| 53.11 \| \| compile before change \| 367.68 \| 19.43 \| \| compile after change \| 582.96 \| 355.07 \| stories110M (without compile) ``` (gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps [...] Average tokens/sec: 53.11 ``` stories110M (compile before change) ``` (gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps --compile [...] Average tokens/sec: 19.43 ``` stories110M (compile after change) ``` (gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps --compile [...] Average tokens/sec: 355.07 ``` stories15M (without compile) ``` (gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps [...] Average tokens/sec: 99.40 ``` stories15M (compile before change) ``` (gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps --compile [...] Average tokens/sec: 367.68 ``` stories15M (compile after change) ``` (gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps --compile [...] Average tokens/sec: 582.96 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150541 Approved by: https://github.com/malfet	2025-04-02 16:07:18 +00:00
Wang, Chuanqi	8102272d8c	[BE] Fix triton windows build (#150512 ) Fixes #150480 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150512 Approved by: https://github.com/atalman Co-authored-by: Andrey Talman <atalman@fb.com>	2025-04-02 15:48:11 +00:00
Animesh Jain	42c7c7f15f	[invoke_subgraph] Filter out grad_out where fw_out requires_grad is False (#150486 ) I am not sure if this is the right way. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150486 Approved by: https://github.com/zou3519 ghstack dependencies: #150082, #150450	2025-04-02 14:40:08 +00:00
Isuru Fernando	82ceebce58	[inductor] Lowerings for max_pool3d (#148210 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148210 Approved by: https://github.com/eellison	2025-04-02 14:13:01 +00:00
Isuru Fernando	5f62d07ec6	Fix log2, PowByNatural printing (#147592 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147592 Approved by: https://github.com/eellison	2025-04-02 14:12:15 +00:00
rzou	aae36929ed	Rename node.meta["arg_kwarg_vals"] to node.meta["eager_input_vals"] (#148092 ) And added a comment about it. Otherwise it might be confusing Test Plan: - wait for CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/148092 Approved by: https://github.com/eellison ghstack dependencies: #148046, #148063, #148091	2025-04-02 13:18:04 +00:00
rzou	4d121d2b02	Implement needs_exact_strides for mutable custom operators (#148091 ) Mutable custom operators get wrapped into an auto_functionalized HOP, so we need to store the arg_kwarg_vals on the auto_functionalized HOP itself. When Inductor does the re-inplacing, it'll use the pattern matcher to decompose the auto_functionalized HOP back into the original op (and 0+ other view or clone operations). The pattern matcher uses the arg_kwarg_vals to trace the subgraph to do the decomposition, so it ultimately sets arg_kwarg_vals on the original op's node correctly. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/148091 Approved by: https://github.com/eellison ghstack dependencies: #148046, #148063	2025-04-02 13:18:04 +00:00
rzou	c69c3c885e	Add needs_exact_strides operator tag for Inductor to force exact strides (#148063 ) Inductor will force exact strides on a custom operator tagged with needs_exact_strides. I'll make this the default in a follow-up PR. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/148063 Approved by: https://github.com/eellison ghstack dependencies: #148046	2025-04-02 13:17:58 +00:00
rzou	c41fbb4f78	Change arg_kwarg_vals propagation strategy (#148046 ) Instead of always propagating arg_kwarg_vals in _COPY_META_FIELDS, we special-case the pattern matcher to propagate arg_kwarg_vals when it sees triton_kernel_wrapper_functional. The strategy is: 1) trace out the replacement graph with arg_kwarg_vals (which have accurate eager-mode metadata) 2) trace out the replacement graph with vals (which have the accurate Inductor metadata) 3) Propagate the arg_kwarg_vals from the first graph to the second. 4) Use the second graph as the replacement graph. The strategy is this because we want to extend this to handle auto_functionalized later up in the stack. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/148046 Approved by: https://github.com/eellison	2025-04-02 13:17:52 +00:00
Bin Bao	03138733ba	[AOTI] Emit Triton kernels as comment (#150188 ) Summary: Emit the corresponding Triton kernel code as comment in each call_triton_ wrapper function, for easier debugging. Differential Revision: [D72178907](https://our.internmc.facebook.com/intern/diff/D72178907) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150188 Approved by: https://github.com/yushangdi	2025-04-02 12:41:54 +00:00
Benjamin Glass	75f38dfd4e	cpp_wrapper: precompile a few more commonly used headers, and improve RAIIPyObject interface (#149350 ) Add includes for torch.device, torch.dtype, torch.layout, and torch.memory_format to the cpp_wrapper common header, so that they get precompiled. Additionally, add move constructors and operator bool to RAIIPyObject. Closes #142005. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149350 Approved by: https://github.com/desertfire	2025-04-02 09:54:27 +00:00
Boyuan Feng	3f54b14c75	[CUDAGraph] support meta tensor (#150478 ) Previously, cudagraph is skipped if the graph contains any meta tensor. However, we should not skip since meta tensor does not have actual computation. This PR fixes the issue. ### Example ```python import torch def foobar(x, y): return x * 2, y * 3 foo_c = torch.compile(mode="reduce-overhead")(foobar) t = torch.empty((1, 16, 128, 128), device="meta") y = torch.rand([64], device="cuda") eager_out = foobar(t, y) for _ in range(3): compiled_out = foo_c(t, y) ``` Prior to this PR, above code leads to ``` skipping cudagraphs due to multiple devices: device(type='cuda', index=0), device(type='meta') ``` With this PR, we don't skip. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150478 Approved by: https://github.com/eellison	2025-04-02 07:21:50 +00:00
Sukchul Cho	0da8127f77	Compare device name of profiler dynamically (#150396 ) Compare self.use_device of torch.autograd.profiler.profiler with _get_privateuse1_backend_name(), since privateuse1 backend can be renamed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150396 Approved by: https://github.com/sraikund16	2025-04-02 06:06:06 +00:00
Rebecca Chen	c65de03196	Add `Any` return annotation to `__getattr__` methods that return a union of types. (#150204 ) Adds an `Any` return type annotation to `__getattr__` methods in `torch/_ops.py` that return a union of types. Attribute access returning a union of types can cause issues downstream because consumers would need to handle all of the possible types to make the type checker happy. This doesn't seem to matter today for mypy, presumably because `Any` is always inferred when a return type annotation is missing, but it still makes explicit what mypy is already doing implicitly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150204 Approved by: https://github.com/malfet	2025-04-02 05:25:07 +00:00
Nikita Shulga	dee016ceb7	[MPSInductor] Add `store_reduce` method (#150457 ) That restrict the store operation to 0th thread, which should be much better, shouldn't it (Though I don't observe it in the benchmark) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150457 Approved by: https://github.com/jansel, https://github.com/dcci ghstack dependencies: #150452	2025-04-02 05:12:49 +00:00
William Wen	3ac5a499dd	[dynamo] add dynamo disable reasons to codebase (#150440 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150440 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #150341	2025-04-02 04:26:48 +00:00
William Wen	25eff6e991	[dynamo] add reason field to torch.compiler.disable (#150341 ) Implements https://github.com/pytorch/pytorch/issues/146445 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150341 Approved by: https://github.com/zou3519, https://github.com/jansel	2025-04-02 04:26:48 +00:00
Mu-Chu Lee	063ea5d669	[AOTInductor] Modify test for Memory tracking for memory-related (#150269 ) operations Summary: Fix the test for memory tracking. This PR does: (1) Add tracking before and after for all memory-related operations. Make sure the operation do indeed captures memory both in CUDA and torch's CUDACachAllocator Make sure the operation do indeed captures consumed memory both in CUDA and torch's CUDACachAllocator. (2) Keep track of memory being reserved by CUDACacheAllocator in torch and it's relationship with global CUDA memory consumption. Test Plan: This PR is adding tests. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/150269 Approved by: https://github.com/jingsh, https://github.com/chenyang78, https://github.com/desertfire	2025-04-02 04:18:18 +00:00
Shivam Raikundalia	5734909f34	[Profiler] Fix Empty C Call Queue (#150370 ) Summary: My commandeer of https://github.com/pytorch/pytorch/pull/150102 Based on description of PR it seems that we need to add C calls for each starting python event with a callable such that when the tracing exits we will have a matching enter for any given exit. It adds some unnecessary events at worst but prevents segfaults/failures. My PR just cleans up some refcount impl and logging. Test Plan: Ran resnet test internally. Will check CI and ask reviewers to make sure it resolves their issues. Differential Revision: D72207570 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150370 Approved by: https://github.com/aaronenyeshi	2025-04-02 02:44:50 +00:00
eqy	f09513e515	[CUDA]][SymmetricMemory] Interpret empty string as `std::nullopt` in `rendezvous` (#149793 ) this is a "temporary" fix as current internal API requires strings at some interfaces instead of `std::optional` and empty strings are presumably used in-lieu of `nullopt`. e.g., `9d02b3993f/torch/csrc/distributed/c10d/intra_node_comm.cu (L49)` this currently breaks `test_intra_node_comm_all_reduce` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149793 Approved by: https://github.com/kwen2501, https://github.com/cyyever	2025-04-02 02:41:07 +00:00
Animesh Jain	61ebe999cc	[invoke_subgraph] Do not cache fake tensors for AOTDispatcher first pass (#150450 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150450 Approved by: https://github.com/zou3519 ghstack dependencies: #150082	2025-04-02 02:31:54 +00:00
Animesh Jain	b060fedfa8	[invoke_subgraph] Support None in the fwd output (#150082 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150082 Approved by: https://github.com/zou3519	2025-04-02 02:31:54 +00:00
Rithesh Baradi	0ae75ca2de	assert on all_reduce_event only if it's not CPU device. (#150316 ) Summary: For CPU based runs, `all_reduce_event` would be None since this is the result of the `all_reduce_stream.record_event()`, which does not do much other than returning None when device type is CPU. Test Plan: CI Differential Revision: D72176406 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150316 Approved by: https://github.com/kwen2501, https://github.com/weifengpy, https://github.com/mori360	2025-04-02 01:54:35 +00:00
cyy	e872c38eb3	Remove cppcoreguidelines-pro-type-member-init_fix suppression (#148638 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148638 Approved by: https://github.com/zou3519	2025-04-02 01:33:20 +00:00
vasiliy	c974b5322a	enable torch.compile for torch._scaled_mm nvfp4 recipe (#150462 ) Summary: Updates the meta registration for `torch._scaled_mm` to work for the nvfp4 recipe. Test Plan: ```bash pytest test/test_matmul_cuda.py -s -k test_blockwise_nvfp4 ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/150462 Approved by: https://github.com/eellison	2025-04-02 01:08:40 +00:00
Nikita Shulga	ee97299961	[MPS][Testing] Benchmark reduction ops (#150452 ) That compares eager vs compile On my M4Pro mini I'm getting the following now ``` [--------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------] \| eager-512x512 \| compile-512x512 \| eager-1024x1024 \| compile-1024x1024 \| eager-2048x2048 \| compile-2048x2048 \| eager-4096x4096 \| compile-4096x4096 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- sum (torch.float32) \| 121.0 \| 201.5 \| 130.3 \| 772.3 \| 179.4 \| 1470.5 \| 476.1 \| 2980.0 max (torch.float32) \| 154.1 \| 165.9 \| 198.7 \| 211.6 \| 344.2 \| 386.9 \| 1326.6 \| 1345.6 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150452 Approved by: https://github.com/dcci, https://github.com/manuelcandales	2025-04-02 01:06:27 +00:00
tvukovic-amd	db32093192	[ROCm][Windows] Fix torchvision build with ROCm 6.4 on windows (#150180 ) Since with HIP SDK 6.4 hipcc files and calls and restructured, the case for calling hipcc.exe is added in case of building torchvision with HIP SDK 6.4 on Windows Pull Request resolved: https://github.com/pytorch/pytorch/pull/150180 Approved by: https://github.com/malfet, https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-04-02 00:35:47 +00:00
Junjie Wang (PyTorch)	d22e3d5efe	[fr] Add logger config for flight record in PGNCCL (#150356 ) Summary: We want to move from a scuba based direct logging to a logger config based logging. Mostly changes are internal but we need to change the exception to exception_msg. Test Plan: Following https://www.internalfb.com/wiki/Server_Logging/Getting_Started_with_Logging/Onboarding_Existing_Scribe-Based_Logging_(Alpha)/ to test it. Differential Revision: D72198171 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150356 Approved by: https://github.com/fegin	2025-04-01 23:54:07 +00:00
Tristan Rice	6aea4d90fb	gloo: use shared Stores (#150230 ) Summary: X-link: https://github.com/facebookincubator/gloo/pull/423 This modifies `connectFullMesh` to take in a shared_ptr<IStore> instead of a reference. This is an API breaking change but fairly easy to work around. To have backwards compatibility in PyTorch during the commit phase we add a new ifdef `GLOO_SHARED_STORE` which can provide backwards compatibility until we update the pinned Gloo version in pytorch OSS repo. This also adds a new `wait_get` method to `IStore` which will allow us to do a more efficient operation in PyTorch TCPStore. PyTorch's `Store::get` automatically waits so we want to make sure we can avoid waiting twice to reduce network traffic. This change will land simultaneously in PyTorch and Gloo repos. Test Plan: ``` buck2 test //gloo/... //caffe2/caffe2/contrib/gloo: ``` Differential Revision: D72084111 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150230 Approved by: https://github.com/fduwjj	2025-04-01 23:37:25 +00:00
Nick Riasanovsky	4934a83347	[AMD] [TRITON] [INDUCTOR] Add tl.assume to enable bufferops on AMD (#150373 ) Summary: Update the GEMM template to include the necessary `tl.assume` annotations to enable bufferops with AMD. Test Plan: Tested manually with a simple matmul run with torch.complie(f, mode="max-autotune") the environment variables TRITON_ALWAYS_COMPILE=1 AMDGCN_ENABLE_DUMP=1 AMDGCN_USE_BUFFER_OPS=1. Inspecting the generated AMDGCN all loads/stores use bufferops. Note: Since inductor is loading constants for many of the shape values assumes are generally not needed for the stride/shape information, but pid calculations are generally a gap in Triton's inference capability. Differential Revision: D71922698 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150373 Approved by: https://github.com/eellison	2025-04-01 23:29:39 +00:00
angelayi	60fe0922f6	[pytree] Register normal class to register_dataclass (#147752 ) Fixes https://github.com/pytorch/pytorch/pull/147532#discussion_r1964365330 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147752 Approved by: https://github.com/zou3519	2025-04-01 23:28:20 +00:00
PyTorch MergeBot	203a27e0ce	Revert "[cuBLAS][cuBLASLt] Unify `cuBLASLt` workspaces with `cuBLAS` workspaces (#145130 )" This reverts commit 8f7fbe3d7d2cd301df48fcbe8a14f8aa1a9c1e48. Reverted https://github.com/pytorch/pytorch/pull/145130 on behalf of https://github.com/clee2000 due to reverted internally by D72140190 ([comment](https://github.com/pytorch/pytorch/pull/145130#issuecomment-2770874244))	2025-04-01 23:07:28 +00:00
Will Feng	80ab233786	[Inductor] Hide reinplace_fsdp_all_gather pass behind skip_fsdp_hooks config (#150436 ) The `reinplace_fsdp_all_gather` pass is currently only for Traceable FSDP2 and doesn't work together with SimpleFSDP. We should hide the pass behind `skip_fsdp_hooks` config which makes it only apply to Traceable FSDP2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150436 Approved by: https://github.com/BoyuanFeng	2025-04-01 22:56:06 +00:00
PyTorch MergeBot	9458460211	Revert "if blaslt fails, fall back to blas (#150147 )" This reverts commit 65139eb050817329ac8e541c377b2be3bb5ffe14. Reverted https://github.com/pytorch/pytorch/pull/150147 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/150147#issuecomment-2770847320))	2025-04-01 22:52:22 +00:00
PyTorch MergeBot	76e1b3ba4c	Revert "[ROCm] use correct workspace for hipblaslt, silence warning (#150227 )" This reverts commit c158eac0de2afe38d68952ca401888ed5777f6b0. Reverted https://github.com/pytorch/pytorch/pull/150227 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/150227#issuecomment-2770827563))	2025-04-01 22:31:13 +00:00
henrylhtsang	629c1bd2dd	[ez][inductor][tests] Skip triton backend only for CPU tests (#150343 ) Motivation: to unblock https://github.com/pytorch/pytorch/pull/148622 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150343 Approved by: https://github.com/chenyang78	2025-04-01 22:03:48 +00:00
Avik Chaudhuri	b70d105c77	infer dynamic shapes through additional inputs (#150144 ) Summary: Instead of explicitly specifying dynamic shapes, it is possible to infer them from additional example inputs. Together with the example inputs provided to export, we can basically make any varying dim dynamic and keep any fixed dim static. This should be useful for prod scenarios that have access to tests and/or profiling data, yet are somewhat removed from the model authoring process. However this alone is not satisfactory: the exported program by design has only one graph, representing one path through the model, and we cannot necessarily guarantee that this graph works for the additional example inputs because different guards might have been created if we had exported with them instead (corresponding to different traced paths). However, checking that the additional example inputs satisfy the guards created by the original export should be sufficient for generalization. Now, while we don't preserve all guards in the exported program, we do check a subset of them as part of input matching. So we add a verification step at the end of export when such additional example inputs are provided. This should be enough for now. Test Plan: added test (positive and negative cases) Differential Revision: D72001771 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150144 Approved by: https://github.com/bobrenjc93	2025-04-01 21:13:39 +00:00
Michael Lazos	0d44a8aea1	[Hierarchical Compile] Apply deduplication after output node creation (#150306 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150306 Approved by: https://github.com/anijain2305 ghstack dependencies: #150303, #150304, #150305	2025-04-01 20:54:18 +00:00
Michael Lazos	8740ffa760	[Hierarchical Compile] Add cycle detection to graph region expansion (#150305 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150305 Approved by: https://github.com/anijain2305 ghstack dependencies: #150303, #150304	2025-04-01 20:54:18 +00:00
Michael Lazos	a2300aff94	[Hierarchical Compile] Add cycle detection function for debug (#150304 ) Remove print Pull Request resolved: https://github.com/pytorch/pytorch/pull/150304 Approved by: https://github.com/anijain2305 ghstack dependencies: #150303	2025-04-01 20:54:10 +00:00
Michael Lazos	99fd96c10b	[Hierarchical Compile] Remove spammy debug log (#150303 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150303 Approved by: https://github.com/williamwen42	2025-04-01 20:54:03 +00:00
atalman	295162ec3a	Smoke Test - disable pypi package validation for binaries that package cuda libs (#150194 ) Smoke Test - disable pypi package validation for binaries that package cuda libs. These binaries do not install packages via pypi. Should Resolve this from `linux-binary-manywheel / manywheel-py3_11-cuda12_6-full-test / test`: ``` Traceback (most recent call last): File "/pytorch/.ci/pytorch/smoke_test/smoke_test.py", line 468, in <module> main() File "/pytorch/.ci/pytorch/smoke_test/smoke_test.py", line 462, in main smoke_test_cuda( File "/pytorch/.ci/pytorch/smoke_test/smoke_test.py", line 274, in smoke_test_cuda compare_pypi_to_torch_versions( File "/pytorch/.ci/pytorch/smoke_test/smoke_test.py", line 220, in compare_pypi_to_torch_versions raise RuntimeError(f"Can't find {package} in PyPI for Torch: {torch_version}") RuntimeError: Can't find cudnn in PyPI for Torch: 9.5.1 ``` Link: https://github.com/pytorch/pytorch/actions/runs/14101221665/job/39505479587#step:15:982 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150194 Approved by: https://github.com/ZainRizvi	2025-04-01 19:18:44 +00:00
Tianyu Liu	d2ad9aa2f2	[dtensor][tp] add a ParallelStyle PrepareModuleInputOutput (#150372 ) Needed this class for because `parallelize_module` takes a dict, which doesn't allow `PrepareModuleInput` and `PrepareModuleOutput` to be applied at the same time. The `PrepareModuleInputOutput` in this PR initializes two variables `prepare_module_input` and `prepare_module_output` and uses them to process module / inputs / outputs. I had another implementation which put all code in `PrepareModuleInputOutput` and let `PrepareModuleInput` and `PrepareModuleOutput` inherit the monolithic `PrepareModuleInputOutput`. But it is 1. less cleaner 2. conceptually abusing inheritance because `PrepareModuleInput` shouldn't be able to access class methods of `PrepareModuleOutput` and vice versa Pull Request resolved: https://github.com/pytorch/pytorch/pull/150372 Approved by: https://github.com/wanchaol	2025-04-01 19:15:43 +00:00
Tianyu Liu	5d6ac2dced	[dtensor] add op support for select_backward and slice_backward (#150357 ) Inheriting and rebasing @awgu 's PR https://github.com/pytorch/pytorch/pull/149071 - fixed an issue for `select_backward` and an issue for `slice_backward` - removed `_experimental_ops.py` as it becomes empty Pull Request resolved: https://github.com/pytorch/pytorch/pull/150357 Approved by: https://github.com/awgu, https://github.com/XilunWu	2025-04-01 19:15:25 +00:00
IvanKobzarev	a37afd23fa	[custom_ops][perf] Move expensive pytree traversals of tensors to C++ (#148555 ) (benchmark for 1 call) Before: ``` └─ $ python ~/task_custom_ops_perf/test_custom_ops_perf_repro.py DO_BENCH mutate: 77.72445678710938 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/mutate.json DO_BENCH no_mutate: 64.61143493652344 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/no_mutate.json DO_BENCH direct_mutate: 11.682510375976562 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_mutate.json DO_BENCH direct_no_mutate: 18.596649169921875 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_no_mutate.json ``` After: ``` └─ $ python ~/task_custom_ops_perf/test_custom_ops_perf_repro.py DO_BENCH mutate: 47.6837158203125 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/mutate.json DO_BENCH no_mutate: 31.709671020507812 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/no_mutate.json DO_BENCH direct_mutate: 10.967254638671875 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_mutate.json DO_BENCH direct_no_mutate: 10.728836059570312 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_no_mutate.json ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148555 Approved by: https://github.com/zou3519	2025-04-01 18:45:48 +00:00
Ethan Wee	78300c8205	[ROCm] update test buffer fudge factor for hipblaslt (#150348 ) The default workspace for hipblaslt is larger than for cublas/cublaslt which requires a slight increase to the buffer needed. Forward-fix for #150227 that broke ROCm distributed tests but wasn't part of initial CI signal. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150348 Approved by: https://github.com/jeffdaily	2025-04-01 18:31:25 +00:00
Jason Ansel	37ebb0b56a	[inductor] Fix inductor windows linker error (#150256 ) Fixes #149889 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150256 Approved by: https://github.com/anijain2305, https://github.com/eellison	2025-04-01 18:30:55 +00:00
eellison	15dbad2115	Update torch.compile issue template (#150192 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150192 Approved by: https://github.com/malfet ghstack dependencies: #149947	2025-04-01 18:16:16 +00:00
PyTorch MergeBot	f04cf13bdd	Revert "Merge Triton ScaledMM as epilogue to MM template (#150045 )" This reverts commit 981048854da154eae8ff0bd439e72e1256ae00da. Reverted https://github.com/pytorch/pytorch/pull/150045 on behalf of https://github.com/PaulZhang12 due to Need to add PR 150415 fixes for internal merge ([comment](https://github.com/pytorch/pytorch/pull/150045#issuecomment-2770252452))	2025-04-01 17:54:28 +00:00
Will Feng	b0c560ef2a	[dynamo][hooks] use wrap_top_frame config for functions (#150209 ) When torch.compile is applied to a module via `mod.compile(...)`, it's equivalent to `torch.compile(mod._call_impl)` which takes a different path than `OptimizedModule`. This PR ensures that the `wrap_top_frame` config can also take effect for the `torch.compile(mod._call_impl)` use case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150209 Approved by: https://github.com/anijain2305	2025-04-01 17:41:23 +00:00
Nikita Shulga	48af2cdd27	[BE] Move all lint runner to 24.04 (#150427 ) As Ubuntu-20 reached EOL on Apr 1st, see https://github.com/actions/runner-images/issues/11101 This forces older python version to be 3.8 Delete all linux-20.04 runners from the lintrunner.yml Pull Request resolved: https://github.com/pytorch/pytorch/pull/150427 Approved by: https://github.com/seemethere	2025-04-01 17:33:15 +00:00
Xia, Weiwen	3b0cd9b542	[Quant][PT2E] add a lowering pass for x86 backend (#149708 ) Summary This PR adds a lowering pass for x86 backend - Patterns of `dequantize -> conv/linear (-> quantize)` are fused to corresponding quantized onednn ops. - Weights are prepacked ahead of time. - Post ops of conv/linear are fused if supported. - The pass returns a `GraphModule` with the modifications mentioned above. Test plan ``` pytest test/quantization/pt2e/test_x86inductor_quantizer.py -k test_lowering_to_x86 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149708 Approved by: https://github.com/jerryzh168, https://github.com/leslie-fang-intel	2025-04-01 17:32:41 +00:00
Catherine Lee	783f045c4f	[ez] Remove dead lite interpreter CI code (#150424 ) There are no lite-interpreter build environments in CI I assume every mac build is arm64 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150424 Approved by: https://github.com/seemethere, https://github.com/malfet	2025-04-01 17:14:32 +00:00
Catherine Lee	a17ee8181a	[CI] Fix log artifact not containing test logs attempt 2 (#150234 ) Fixes #ISSUE_NUMBER Take two of https://github.com/pytorch/pytorch/pull/149577 since it didn't work Pull Request resolved: https://github.com/pytorch/pytorch/pull/150234 Approved by: https://github.com/malfet, https://github.com/seemethere	2025-04-01 17:13:58 +00:00
Nikita Shulga	f94ac263af	[MPSInductor] Fix neg for unsigned types (#150412 ) By more-or-less copy-n-pasting the fix from https://github.com/pytorch/pytorch/pull/94035 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150412 Approved by: https://github.com/jansel, https://github.com/dcci ghstack dependencies: #150382, #150386	2025-04-01 16:52:41 +00:00
Xuehai Pan	ae74ef9d53	Set proper `LD_LIBRARY_PATH` on Linux in nightly venv in nightly pull tool (#143262 ) Before this change: ```console $ make setup-env-cuda PYTHON="${HOMEBREW_PREFIX}/bin/python3.12" $ source venv/bin/activate $ python3 -c 'import torch' Traceback (most recent call last): File "<string>", line 1, in <module> File "/home/PanXuehai/Projects/pytorch/torch/__init__.py", line 379, in <module> from torch._C import * # noqa: F403 ^^^^^^^^^^^^^^^^^^^^^^ ImportError: libcudnn.so.9: cannot open shared object file: No such file or directory ``` This PR adds `site-packages/nvidia/**/lib` to `LD_LIBRARY_PATH` in `venv/bin/activate` script to let NVIDIA PyPI packages can be loaded correctly. See also: - #141837 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143262 Approved by: https://github.com/malfet	2025-04-01 16:51:02 +00:00
Sriram Kumar	a19b667bca	[ROCm] Update CUDAPluggableAllocator.h (#1984 ) (#150010 ) Altering the flag to use the correct streamType in CUDAPluggableAllocator class for ROCm gpu. The flag TORCH_HIP_VERSION does not work for ROCm as intended. This flag is replaced with USE_ROCM. This is impacting Distributed Fused Adam in Rocm/APEX when using nccl_ub feature. This has been tested with rocm/apex. See PR https://github.com/ROCm/apex/pull/184 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150010 Approved by: https://github.com/jeffdaily	2025-04-01 16:49:03 +00:00
Ke Wen	35c45a4a31	[Reland] Launch kernel on current stream & remove `record_stream` entirely (#150398 ) Relanding #148590 due to merge conflict. This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Squashed contents: * [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820) PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead (70us -> 35us), which reduce total CPU/GPU from 230us to 195us by 15% * [PGNCCL] Make avoid-record-stream default * [c10d] Add asyncOp argument to Ops * Change python side wait * Pass asyncOp at ProcessGroup level * Watchdog unstashing tensors as a safety net * Stash tensors for reduce_scatter_v and all_gather_v Pull Request approved: https://github.com/pytorch/pytorch/pull/149753 * [c10d] Move unstashing from watchdog to main thread Pull Request approved: https://github.com/pytorch/pytorch/pull/150079 * [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation Pull Request approved: https://github.com/pytorch/pytorch/pull/150130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150398 Approved by: https://github.com/atalman	2025-04-01 16:46:07 +00:00
Mergen Nachin	7382654ebc	Update ExecuTorch pin to latest viable/strict 3/28/2025 (#150308 ) From latest viable/strict: https://hud.pytorch.org/hud/pytorch/executorch/viable%2Fstrict/1?per_page=50 Fixes https://github.com/pytorch/pytorch/issues/144480 This commit has important CI stability fixes, such as https://github.com/pytorch/executorch/pull/9561 and https://github.com/pytorch/executorch/pull/9634 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150308 Approved by: https://github.com/jathu, https://github.com/malfet	2025-04-01 16:30:09 +00:00
Nikita Shulga	428234bc28	[MPSInductor] torch.complex128 is unsupported on MPS (#150386 ) Same as torch.float64 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150386 Approved by: https://github.com/dcci ghstack dependencies: #150382	2025-04-01 15:19:10 +00:00
Nikita Shulga	1c6e88eb03	[MPS] Test bf16 perf of few unary and binary ops (#150382 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150382 Approved by: https://github.com/Skylion007	2025-04-01 13:58:20 +00:00
Bin Bao	0d96c38b76	[AOTI] Skip test_buffer_mutation_and_force_mmap_weights for fbcode (#150340 ) Summary: Skip due to an older ideep version Differential Revision: D72190746 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150340 Approved by: https://github.com/yushangdi	2025-04-01 13:24:21 +00:00
maajidkhann	84c21d2147	Enable SVE ACLE implementation for tanH Aten op for FP32 dType. (#143741 ) In deep learning models, the tanh (hyperbolic tangent) function is a widely used activation function, primarily in feedforward networks, recurrent neural networks (RNNs), and various other architectures. Also, the tanh (hyperbolic tangent) function is commonly used in Physics-Informed Neural Networks (PINNs). PINNs are a class of machine learning models designed to solve partial differential equations (PDEs) by incorporating the governing physics directly into the loss function, along with data-driven terms. In PINNs, activation functions like tanh are used in the neural network architecture to enable the model to learn complex mappings between inputs (such as spatial and temporal coordinates) and outputs (such as field variables). Operator: tanh() Current Implementation in OSS in ATen Backend: SVE Flow: Uses SVE sleef when available else std implementation. With this PR : SVE Flow: Uses SVE ACLE implementation. (Faster Implementation) Here are the performance improvements. Single core perf numbers: ![image](https://github.com/user-attachments/assets/c2f4bcb6-11bc-4af1-b5eb-278a4cc4a69d) Metric: CPU time avg time per iteration (In ms) As you can see with both gcc and clang compilers, we see a significant performance gain with SVE ACLE implementation over current OSS Implementation (Sleef) and also Neon. Hardware: m7g.8xlarge (Graviton 3 Instance) Script used in benchmarking: ```python import os #os.environ["ATEN_CPU_CAPABILITY"] = "default" os.environ["ATEN_CPU_CAPABILITY"] = "sve256" import torch import torch.nn as nn #Set the random seed for reproducibility torch.manual_seed(1) #Create a tensor of shape (8521, 50) x = torch.randn(8521, 50) for i in range(10): output = x.tanh() #Perform the tanh operation 1000 times and profile the performance print("### CPU tanh") with torch.autograd.profiler.profile(record_shapes=True) as prof: for i in range(1000): output = x.tanh() #Print the profiling results sorted by self CPU time print(prof.key_averages().table(sort_by="self_cpu_time_total")) #Optionally print the final output (if needed, uncomment the following line) print(output) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143741 Approved by: https://github.com/malfet	2025-04-01 11:54:58 +00:00
yucai-intel	bf4814eb6a	[Intel GPU] Allow XPU backend in Quantize operators (#150288 ) This modification is to support torch.quantize_per_channel() on XPU, otherwise it will cause a segmentation fault. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150288 Approved by: https://github.com/jerryzh168, https://github.com/guangyey	2025-04-01 11:27:26 +00:00
Xuehai Pan	a10b765bf1	[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257 ) Changes in this PR: 1. Add `is_structseq` and `is_structseq_class` functions to determine a object or a class is PyStructSequence. 2. Add a generic class `structseq` which can be used as the registration key for PyStructSequence types like `namedtuple` for Named Tuple types. 3. Change `is_namedtuple` to accept subclasses of namedtuple to be namedtuple. Before this PR, only namedtuple class directly created by `collections.namedtuple` or `typing.NamedTuple` were namedtuple classes while their subclasses were not. This PR makes `is_namedtuple` return true for subclasses of namedtuple class. Resolves #75982. New tests are included in this PR. - #75982 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113257 Approved by: https://github.com/zou3519	2025-04-01 10:40:43 +00:00
Prajesh Praveen Anchalia	48e9ffc873	Unify on dynamo_compile as the overall wait counter (#150293 ) Summary: dynamo_compile for the most part has been accounting for compile time except autotuning. all_compilation_types had earlier been injected on fx_codegen_and_compile, which was incorrect. Add autotuining to dynamo and deprcate all_compilation_types counter. Differential Revision: D72145447 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150293 Approved by: https://github.com/masnesral, https://github.com/jamesjwu	2025-04-01 08:55:51 +00:00
FFFrog	36f2d0aaba	Add "xpu" to __all__ for torch/version.py (#149695 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149695 Approved by: https://github.com/desertfire, https://github.com/guangyey	2025-04-01 08:44:51 +00:00
Natalia Gimelshein	1700599266	Add one_shot_all_reduce_copy to allow non-symm-mem allocated tensors to be reduced (#150129 ) Per title, we want to be able to use it even if inputs are not registered. Separate copy would add latency, and one-shot is all about the lowest possible latency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150129 Approved by: https://github.com/xw285cornell	2025-04-01 05:36:43 +00:00
Natalia Gimelshein	414b9ae016	enable out variant of 2-shot reduction (#150153 ) Per title, this version uses symm mem input both as input source and as a work buffer, so input is modified after the end (similar to what fbgemm car reduction does). It is intended to be wrapped in an op that would first copy the real inputs to symm mem buffers that wouldn't be exposed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150153 Approved by: https://github.com/xw285cornell	2025-04-01 05:36:04 +00:00
Tugsbayasgalan Manlaibaatar	7e7e5698cc	Suppress more warnings (#149833 ) Differential Revision: [D71702307](https://our.internmc.facebook.com/intern/diff/D71702307) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149833 Approved by: https://github.com/malfet, https://github.com/Skylion007	2025-04-01 05:33:04 +00:00
William Wen	790d459f85	[dynamo] add error message for unsupported LOAD_BUILD_CLASS (#150323 ) Improved error message for https://github.com/pytorch/pytorch/issues/128942 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150323 Approved by: https://github.com/jansel, https://github.com/zou3519	2025-04-01 05:03:50 +00:00
Stonepia	ce52674b76	[Doc] Update CMAKE_PREFIX_PATH for XPU windows README (#148863 ) We found that the `pip install cmake` and `conda install cmake` has different behavior. The reason is that the pip installed one doesn't find the corresponding libs under conda env. So we need to set the `CMAKE_PREFIX_PATH` for alignment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148863 Approved by: https://github.com/CuiYifeng, https://github.com/malfet Co-authored-by: Cui, Yifeng <yifeng.cui@intel.com>	2025-04-01 04:43:11 +00:00
Phillip Liu	31634b8c6a	[fr] Added protection against missing stack frames in fr cont. (#150133 ) Summary: Previously we had D70358287, which didn't fully resolved the issue. Test Plan: # FR `buck2 run @//mode/opt //caffe2/fb/flight_recorder:fr_trace -- --mast_job_id f710320638-TrainingApplication --mast_job_version 0 --mast_job_attempt 0 --bucket tlcm_log_blob --world_size 128 --dump_file_name_offset 0 --allow-incomplete-ranks` Confirm no error # FR analyzer `buck2 run @//mode/opt //investigations/dr_patternson/analyzers/ai_observability:ai_observability-all-analyzers-cli -- flight_recorder_analyzer --mast_job_name f710320638-TrainingApplication --mast_job_version 0 --mast_job_attempt 0` Confirm no error Differential Revision: D71998980 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150133 Approved by: https://github.com/fduwjj	2025-04-01 03:07:59 +00:00
Nikita Shulga	827b730f4e	[CI] Skip test_copy_large_tensor on M2-15 runners (#150377 ) They have more than 12Gb memory, but may be running this test causes OOM in CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/150377 Approved by: https://github.com/atalman	2025-04-01 02:33:43 +00:00
Nikita Shulga	6470b373c1	`torch.backends.mkldnn.flags()` CM should not warn (#150358 ) By returning `None` rather than `False` from `THPModule_allowTF32OneDNN` when USE_XPU is not defined Added regression test Fixes https://github.com/pytorch/pytorch/issues/149829 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/150358 Approved by: https://github.com/atalman	2025-04-01 01:33:40 +00:00
Sun, Jiayi	5cb5675f13	[Inductor] optimize the heuristics of parallel reduction (#149614 ) Fix https://github.com/pytorch/pytorch/issues/148639. Summary: Optimize the heuristics of parallel reduction: When the number of steps of the first inner loop beyond the maximum parallel depth is much larger than the number of steps of all outer loops within the maximum parallel depth, change the starting depth of parallelism to the first inner loop and recalculate the maximum parallel depth. I ran the Inductor benchmark with this PR on CPU. A timm model poolformer_m36 BF16 has about 25% performance improvement, and no performance regression is seen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149614 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2025-04-01 01:31:00 +00:00
Zhang, Jianyi	0f12951fc2	[Intel gpu] always set deterministic for xpu accuracy test (#149028 ) On Intel Max 1550, models like Super_SloMo can actually pass accuracy test after set deterministic, because we do not use atomic in upsampling bilinear backward in some cases when running on XPU. Furthermore, I guess the only reason not to set deterministic on these models is just avoiding errors. We should use warn_only = True. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149028 Approved by: https://github.com/guangyey, https://github.com/desertfire Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-04-01 01:00:11 +00:00
Nikita Shulga	7ab8532cf1	[BE] Get rid of cross-compile and x86 build options for Mac (#150362 ) As both cross-compilation and x86 builds has been removed a while back Remove stale TODO about building with OpenMP support Pull Request resolved: https://github.com/pytorch/pytorch/pull/150362 Approved by: https://github.com/atalman, https://github.com/clee2000	2025-04-01 00:45:24 +00:00
Joshua Hamilton	4ce0b959ff	Add a warning when a tensor with requires_grad=True is converted to a scalar (#143261 ) Fixes #143071 Operations performed on tensors with `requires_grad=True` such as ```python import torch x = torch.tensor(2.0, requires_grad=True) y = x ** 3 ``` and ```python x = torch.tensor(2.0, requires_grad=True) y = torch.pow(x,3) ``` are valid operations. While an operation using `numpy` like ```python import numpy as np x = torch.tensor(2.0, requires_grad=True) y = np.pow(x,3) # > RuntimeError: Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead. ``` leads to an error. However, an operation that uses `math` like ```python import math x = torch.tensor(2.0, requires_grad=True) y = math.pow(x,3) ``` does not cause an error, and `y` is no longer a tensor with a gradient! This represents a [footgun](https://en.wiktionary.org/wiki/footgun#Noun) for some users, like myself when training small, custom, non-neural network models. To prevent future undesired behavior, I added a warning when converting tensors with `requires_grad=True` to scalars. Now, when using `math.pow` on a `tensor`, we get a single warning with: ```python x = torch.tensor(2.0, requires_grad=True) y = math.pow(x,3) # > UserWarning: Converting a tensor with requires_grad=True to a scalar may lead to unexpected behavior. # Consider using tensor.detach() first. ``` Please let me know if you have any questions 👍 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143261 Approved by: https://github.com/malfet Co-authored-by: albanD <desmaison.alban@gmail.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-04-01 00:42:46 +00:00
Jack Taylor	49b7d0d84d	[ROCm] Enable more inductor UTs (#149513 ) Primarily enable inductor fp8 tests, also enable other inductor tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/149513 Approved by: https://github.com/jeffdaily	2025-04-01 00:30:36 +00:00
Nikita Shulga	c75dac5f5c	Fix typo (#150363 ) Fixes https://github.com/pytorch/pytorch/issues/150339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150363 Approved by: https://github.com/atalman, https://github.com/kwen2501	2025-03-31 23:58:37 +00:00
Davide Italiano	b48505a8a1	[MPS] Add support for hermite_polynomial_h. (#150279 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150279 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-03-31 23:30:19 +00:00
Mu-Chu Lee	a2070e2fd5	[AOTInductor] Free tensors in test (#150274 ) Summary: This PR frees tensor that were new-ed within the test itself to prevent memory leak. Test Plan: Fixing tests itself. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/150274 Approved by: https://github.com/chenyang78	2025-03-31 23:28:13 +00:00
Shiyan Deng	982a7f7db0	[cachinghostallocator] remove the check on cudaHostRegister path (#150070 ) Summary: In the cudaHostAlloc path, the flag we used is `cudaHostAllocDefault` [0] which don't really have this strict enforcement (devicePtr retrieved from ` cudaHostGetDevicePointer(()` point to the same addr as the hostPtr) according to the guide [1]. This diff removes the check so that the host register path works for ROCm. [0]`6aca002d82/aten/src/ATen/cuda/CachingHostAllocator.cpp (L97)` [1] https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gb65da58f444e7230d3322b6126bb4902 Test Plan: test_pinned_memory_with_cudaregister tests Differential Revision: D71932562 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150070 Approved by: https://github.com/jeffdaily	2025-03-31 23:23:05 +00:00
PaulZhang12	981048854d	Merge Triton ScaledMM as epilogue to MM template (#150045 ) Previously, scaled_mm's (FP8 matmul) Triton lowering for inductor was in a separate template. This PR consolidates that lowering into the mm template, with an added epilogue to deal with multiplying the scales. This paves the way for future scaled variants of BMM, Grouped GEMM in inductor. Currently, there is still a separate template for TMA+persistent version of scaled_mm. The current mm lowering has a separate template for TMA + Persistent version. Will hopefully consolidate the extra scaled_mm TMA+persistent template when the consolidation for the mm template is done. TODO: Consolidate TMA+Persistent logic into 1 template and remove separate scaled_mm TMA template Pull Request resolved: https://github.com/pytorch/pytorch/pull/150045 Approved by: https://github.com/drisspg	2025-03-31 23:20:14 +00:00
Nikita Shulga	91666eef60	Update gloo submodule (#150320 ) That updates its CMake minimum version(via https://github.com/facebookincubator/gloo/pull/424 ) and removes cmake-4.0.0 workarounds for gloo Pull Request resolved: https://github.com/pytorch/pytorch/pull/150320 Approved by: https://github.com/atalman	2025-03-31 22:40:27 +00:00
PyTorch MergeBot	1526ff955e	Revert "Add a warning when a tensor with requires_grad=True is converted to a scalar (#143261 )" This reverts commit 515b45e5693dbf9dd58d8472806cbe5f49e43074. Reverted https://github.com/pytorch/pytorch/pull/143261 on behalf of https://github.com/clee2000 due to failing internal tests D72135661 ([comment](https://github.com/pytorch/pytorch/pull/143261#issuecomment-2767531682))	2025-03-31 22:19:08 +00:00
Faa Diallo	423e4a4568	[ROCm] cmake 4 workaround for hiprtc (#150324 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150324 Approved by: https://github.com/jeffdaily, https://github.com/atalman, https://github.com/malfet	2025-03-31 21:55:53 +00:00
Ethan Wee	4e2997db73	[ROCm][CI] Increase wheel build timeout from 210 to 240 (#150221 ) Fixes #150046. Increasing the timeout from 210 to 240. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150221 Approved by: https://github.com/jeffdaily	2025-03-31 21:46:09 +00:00
Pian Pawakapan	925fd4aa2e	[export] min/max ranges for dim hints (#149590 ) Differential Revision: D71522032 Adds min/max ranges to Dim.AUTO/DYNAMIC/STATIC, so users can do `Dim.AUTO(min=2, max=2048)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149590 Approved by: https://github.com/tugsbayasgalan	2025-03-31 21:32:20 +00:00
Eli Uriegas	dfcd98e684	cd: Fix naming for windows arm64 libtorch builds (#150310 ) Apparently the magical incantation to name these correctly lies in the build_variant variable otherwise it silently does nothing. Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/150310 Approved by: https://github.com/atalman	2025-03-31 20:12:03 +00:00
Matthew Haddock	80b7f6b704	Adjust TestInductorOpInfo to depend on backend, not device (#146911 ) As is the case with many inductor tests, this test adapts test criteria based on device type, where it should be adjusting for the backend registered for that device. In this particular case, using the upstream triton CPU backend would lead to failures, as reference_in_float would be true as this is required for the C++/OpenMP backend which does not have float16 support. However most triton backends do, and as such should be tested in float16. Similarly a triton backend with a device not described as a GPU would get skipped from testing entirely. A more generic solution would be ideal, but this would require a lot of work across many tests. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146911 Approved by: https://github.com/masnesral	2025-03-31 18:24:16 +00:00
Aleksei Nikiforov	ab342d3793	Make PyTorch buildable by CMake-4.x on s390x (#150294 ) This is a continuation of https://github.com/pytorch/pytorch/pull/150203 that fixes nightly build on s390x. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150294 Approved by: https://github.com/malfet	2025-03-31 18:10:02 +00:00
angelayi	5e34758cef	[invoke_subgraph] Support unbacked (#149298 ) Differential Revision: [D71420641](https://our.internmc.facebook.com/intern/diff/D71420641) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149298 Approved by: https://github.com/zou3519	2025-03-31 17:25:09 +00:00
Pian Pawakapan	284b766898	[dynamic shapes] C++ bindings for guard_or_false/true (#150148 ) C++ version. Would like to add it in one place to prove it works, but couldn't find one that doesn't expose a chain of data-dependent changes... so just gonna put up the base implementation Pull Request resolved: https://github.com/pytorch/pytorch/pull/150148 Approved by: https://github.com/laithsakka, https://github.com/jingsh	2025-03-31 17:04:25 +00:00
Prachi Gupta	47cdad2995	[ROCm] Enable several fsdp related UTs (#149369 ) Enabling 26 UTs for ROCm in the following files: - distributed._shard.sharded_optim.test_sharded_optim - 2 UTs - distributed._shard.sharded_tensor.ops.test_binary_cmp - 4 UTs - distributed._shard.sharded_tensor.ops.test_init - 3 UTs - distributed._shard.sharded_tensor.ops.test_embedding - 2 UTs - distributed._shard.sharded_tensor.ops.test_embedding_bag - 2 UTs - distributed._composable.test_replicate_with_compiler - 4 UTs - distributed._composable.fsdp.test_fully_shard_grad_scaler - 1 UTs - distributed.tensor.test_attention - 4 UTs - distributed.tensor.test_matrix_ops - 1 UTs - distributed.tensor.test_tensor_ops - 1 UTs - distributed.fsdp.test_fsdp_grad_acc - 2 UTs Pull Request resolved: https://github.com/pytorch/pytorch/pull/149369 Approved by: https://github.com/jeffdaily	2025-03-31 16:15:57 +00:00
PyTorch MergeBot	7c858066ae	Revert "Enable TMA persistent GEMM Template by default (#149427 )" This reverts commit b8ef642f04874e13a9f2771902ddb7514f294015. Reverted https://github.com/pytorch/pytorch/pull/149427 on behalf of https://github.com/clee2000 due to failing tests internally D72116141 ([comment](https://github.com/pytorch/pytorch/pull/149427#issuecomment-2766672200))	2025-03-31 15:58:34 +00:00
PyTorch MergeBot	57fa99c5c3	Revert "enable out variant of 2-shot reduction (#150153 )" This reverts commit cdeb32d2d1c31b60c65133e83510977c5c180005. Reverted https://github.com/pytorch/pytorch/pull/150153 on behalf of https://github.com/clee2000 due to failing internal builds D72083877 ([comment](https://github.com/pytorch/pytorch/pull/150153#issuecomment-2766633712))	2025-03-31 15:43:24 +00:00
PyTorch MergeBot	e57fa18b40	Revert "Add one_shot_all_reduce_copy to allow non-symm-mem allocated tensors to be reduced (#150129 )" This reverts commit 8a872261dcb3797557d1965af6832677a77efec1. Reverted https://github.com/pytorch/pytorch/pull/150129 on behalf of https://github.com/clee2000 due to breaking internal builds D72080428 ([comment](https://github.com/pytorch/pytorch/pull/150129#issuecomment-2766619006))	2025-03-31 15:37:54 +00:00
Wang, Chuanqi	f74d5d576a	Update torch-xpu-ops commit pin to 3ee2bd2 (#150300 ) Update the torch-xpu-ops commit to [3ee2bd2f13e1ed17a685986ff667a58bed5f2aa5](`3ee2bd2f13`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150300 Approved by: https://github.com/EikanWang	2025-03-31 13:36:11 +00:00
Yichen Yan	bbb9b2476b	Unify use of `enableCollectiveHashDebug_` and trivial updates (#142865 ) Use `enableCollectiveHashDebug_` instead of checking env ad-hoc when `TORCH_DISTRIBUTED_DEBUG = DETAIL` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/142865 Approved by: https://github.com/fegin, https://github.com/kwen2501	2025-03-31 12:23:30 +00:00
Ethan Wee	c158eac0de	[ROCm] use correct workspace for hipblaslt, silence warning (#150227 ) Follow up to #145130. That PR caused a warning on ROCm the first time hipblaslt was called for any workload, always. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/150227 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-03-31 09:49:43 +00:00
LifengWang	51f0403f46	Update the baseline for max_autotune ci workflow (#149107 ) Since the issue https://github.com/pytorch/pytorch/issues/148535 is fixed in PR https://github.com/pytorch/pytorch/pull/148923, update the baseline for max_autotune ci workflow. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149107 Approved by: https://github.com/chuanqi129, https://github.com/leslie-fang-intel, https://github.com/desertfire	2025-03-31 09:45:44 +00:00
Kavya Govindarajan	4aded85e79	Fix space typo in warning message (#143473 ) Warning shows up like this (no space between willbe): ``` /home/xxx/.local/lib/python3.11/site-packages/torch/distributed/fsdp/_state_dict_utils.py:827: UserWarning: When using ``NO_SHARD`` for ``ShardingStrategy``, full_state_dict willbe returned. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143473 Approved by: https://github.com/mikaylagawarecki, https://github.com/kwen2501	2025-03-31 07:38:02 +00:00
Matthew Hoffman	c976321541	Use variadic length tuple for `torch.masked.DimOrDims` (#149870 ) `tuple[int]` means only a tuple of length 1, which is not what was intended. ```python loss = torch.masked.mean(loss, mask=mask, dim=(-1, -2)) # Argument of type "tuple[Literal[-1], Literal[-2]]" cannot be assigned to parameter "dim" of type "DimOrDims" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149870 Approved by: https://github.com/Skylion007	2025-03-31 07:06:58 +00:00
Vlad K	f1b74037b1	Fix bug when Inductor include path contains spaces (#148271 ) This PR fixes a bug with how include directories with spaces are handled on Windows. I ran into an edge case with torch.compile() - it will error out with an exception on Windows. In particular, it will try to execute the following: `cl /I C:/Program Files/Python311/Include ...`, where `C:/Program` will be treated as separate from `Files/Python311/Include`. I looked into using something like `shlex.quote` or `pathlib.Path`, but I didn't find those options to be suitable (shlex is POSIX shell only, pathlib.Path does not escape spaces). There is another place in the function that also deals with escaping spaces. My fix follows the same style. `0ff2e6a85a/torch/_inductor/cpp_builder.py (L1464)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148271 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-03-31 06:46:05 +00:00
Youseok Yang	b99e0c5412	Fix mtia_extension.cpp setDevice() to correctly set current_device (#149398 ) We referred to this code and found that there was a minor bug. Fix for future reference for others. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149398 Approved by: https://github.com/janeyx99	2025-03-31 06:07:22 +00:00
Yuanhao Ji	4f14224dc8	[Inductor] Fix `torch.polygamma()` when n == 1 (#147453 ) Fixes #147450 Be consistent with cpu kernel: `77dbd28535/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp (L433-L444)` Got this in the case: ``` Eager: tensor([1.2914e+15]), dtype: torch.float32 Compile: tensor([1.2914e+15]), dtype: torch.float32 Expected: tensor([6.5808e+32], dtype=torch.float64), dtype: torch.float64 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147453 Approved by: https://github.com/eellison	2025-03-31 05:27:46 +00:00
fduwjj	9456738edf	[c10d][fr] Allow multiple writer registration with warnings (#150232 ) The life span of writer is actually the whole program which is sub-optimal but it is a practical compromise so that the registration of writer can happen outside PG creation. So we decide to allow multiple writer registrations with warnings. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150232 Approved by: https://github.com/d4l3k, https://github.com/kwen2501	2025-03-31 04:43:43 +00:00
redwrasse	ad54b3aae2	test 0-dim squeeze in basic.TestSqueeze (#147928 ) Replace TODO with 0-dim squeeze, checks scalar is unchanged in `basic.TestSqueeze` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147928 Approved by: https://github.com/janeyx99	2025-03-31 04:35:16 +00:00
Luca Arnaboldi	c3bb174bb2	SubsetRandomSampler - changed iteration over tensor to iteration over list (#149126 ) Digging further the problem at https://github.com/UKPLab/sentence-transformers/pull/3261, it boils down to this expensive loop over a torch tensor. Looping over a list, like in RandomSampler, solves the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149126 Approved by: https://github.com/divyanshk, https://github.com/cyyever	2025-03-31 04:33:35 +00:00
dscamiss	59abb8c7a2	Fix documentation build errors caused by unsupported section titles (#150205 ) Fixes #150134 Build with `make html` looks OK now: ```shell reading sources... [100%] torch.compiler_get_started .. xpu looking for now-outdated files... none found pickling environment... done checking consistency... done preparing documents... done writing output... [ 80%] generated/torch.nn.Softsign .. generated/torch.nn.modules.module.register_module_full_backward_writing output... [ 86%] generated/torch.nn.modules.module.register_module_module_registration_hook .. generated/torch.rwriting output... [100%] generated/torch.xpu.get_rng_state .. xpu generating indices... genindex done highlighting module code... [100%] typing writing additional pages... search done copying images... [100%] _static/img/torch_cuda_memory/allocator_state_history.png copying static files... done copying extra files... done dumping search index in English (code: en)... done dumping object inventory... done build succeeded. The HTML pages are in build/html. ``` New rendering looks like this: ![image](https://github.com/user-attachments/assets/af7e23a5-9dfd-4cb6-9333-a9e8cfe47ea0) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150205 Approved by: https://github.com/albanD	2025-03-31 04:27:44 +00:00
Yuanhao Ji	32afecff8b	[PrivateUse1] Impl `isBuilt()` and `isAvailable()` (#149594 ) Follow-up: #146098 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149594 Approved by: https://github.com/albanD	2025-03-31 04:18:38 +00:00
jj hunt	46c8f2e965	Update docstring to match code. (#148455 ) Very tiny fix to doc string. Pass grid_size=None results in an Exception. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148455 Approved by: https://github.com/mikaylagawarecki	2025-03-31 04:16:11 +00:00
Nichols A. Romero	ca2ffc23ab	[ROCm][TunableOp] Stricter unit tests for online and offline tuning (#150142 ) Improvements to unit tests and warnings for unsupported cases in offline tuning. Here are more details: - Previously we only compared the OpSig for the untuned vs. tuned entries. This was not strict enough so we now compare OpSig+ParamSig. - The main offline and online UTs are now stricter to make sure we exercise the code paths for the four combinations of transA and transB. - Offline tuning does not support some tensor shapes. Emit warning and skip tuning. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150142 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-03-31 04:12:08 +00:00
Daniel Vega-Myhre	157bff22f7	[Async TP] Fuse matmul-reduce-scatters when reduce scatters have multiple users, and save fused node for backward instead of reduce_scatter node (#149946 ) Fixes #149876 ## Stack - [previous PR in stack] https://github.com/pytorch/pytorch/pull/149247 ## TL;DR This PR implements support in async TP for saving the reduce-scatter result for backward, which previously would break the torchtitan AC policies: no AC, per op SAC, and per layer SAC. ## Context In torchtitan's LLama3 per op SAC policy, we want to save the output of `reduce_scatter` ops for backward, which is useful for TP. The reduce_scatter op is also saved for No AC (since all activations are saved) and per layer SAC (since we save the activations for N full layers, which do contain reduce-scatters for TP. However, doing this causes incompatibility with Async TP for the AC policies above, for 2 reasons: 1) The graph pattern matching specifically only matches on reduce scatter nodes with 1 user, but reduce_scatter nodes saved for backwards will have 2 users (the 2nd one being the return/output node, which saves it for backward). 2) The subgraph replacement logic which replaces the users of the `wait_tensor` after the reduce-scatter with the new fused node has no mechanism to save the fused_node for backward instead of the reduce-scatter node. This means we cannot directly replace the subgraph, since we can't delete nodes which still have users (in this case, the output node is still using the reduce-scatter node). To fix this, we do 2 things: 1) Add additional pattern matching logic to also match reduce-scatter nodes with 2 users, so we also perform fusion when reduce-scatter is saved for backward. 2) When replacing the subgraph with the fused node, detect if the reduce-scatter was saved for backward, and if so, save the result of the fused node for backward instead. This enables us to properly erase the subgraph and prevent the memory leak which occurred in #149876 ## Other changes - Continue to throw an error if we don't find any candidate all-gathers or reduce-scatters for fusion (since TP should have both) but DON'T throw an error if we don't fuse any matmul-reduce-scatters. This is because I've found there are actually valid graphs where we do fuse reduce scatters in the forward graph but not the backward graph (in the backward pass there are reduce-scatters but the producer op is an "add" not a mm/scaled_mm). ## Test plan 1. All unit tests are passing 2. Visualized the graphs and verified the fusion is occurring properly. 3. Verified via manual torchtitan runs there is no memory leak / OOM occurring anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149946 Approved by: https://github.com/fegin	2025-03-30 19:05:47 +00:00
James Wu	cbc0964636	Store statically launchable CachingAutotuners inside CompiledFXGraph.triton_bundle (#149054 ) This PR adds CachingAutotuners that are statically launchable to FXGraphCache's cache entry. Regular CachingAutotuners, with triton kernels attached to them, are not very good to cache: they are very large, and take huge amounts of space since they track all of the various binary files, along with various metadata. We could probably figure out what information we could delete from the kernel and have it still work, but with StaticCudaLauncher, we no longer have to. Instead, we can cache every compiled triton kernel that is statically launchable. Because StaticTritonCompileResult is serializable, and designed to have a very small memory footprint, we can save it into FXGraphCache without increasing the cache size significantly. We store it as a part of CompiledFxGraph.triton_bundle. Then, on load, we repopulate the CachingAutotuner into our CompiledTritonKernel cache. The upsides of this are many: - We no longer need to call into a separate process on cache hit - We can guarantee that the triton kernel we got from our cache entry is the one we use to launch again, so no worries about triton's own caching logic - Once we achieve feature parity and all torch.compiled triton kernels are statically launchable, we can clean up a bunch of TritonBundler code and simplify the cache hit logic. Fixes #149449 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149054 Approved by: https://github.com/oulgen	2025-03-30 17:51:11 +00:00
Aaron Gokaslan	e91f84c87d	[BE]: Update cudnn frontend submodule to 1.11.0 (#149759 ) Update CUDNN frontend submodule to 11.1.0. Adds some new features like score_mod from flex_attention and adds a lot of bugfixes and new feature knobs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149759 Approved by: https://github.com/jansel	2025-03-30 17:14:26 +00:00
Joshua Hamilton	515b45e569	Add a warning when a tensor with requires_grad=True is converted to a scalar (#143261 ) Fixes #143071 Operations performed on tensors with `requires_grad=True` such as ```python import torch x = torch.tensor(2.0, requires_grad=True) y = x ** 3 ``` and ```python x = torch.tensor(2.0, requires_grad=True) y = torch.pow(x,3) ``` are valid operations. While an operation using `numpy` like ```python import numpy as np x = torch.tensor(2.0, requires_grad=True) y = np.pow(x,3) # > RuntimeError: Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead. ``` leads to an error. However, an operation that uses `math` like ```python import math x = torch.tensor(2.0, requires_grad=True) y = math.pow(x,3) ``` does not cause an error, and `y` is no longer a tensor with a gradient! This represents a [footgun](https://en.wiktionary.org/wiki/footgun#Noun) for some users, like myself when training small, custom, non-neural network models. To prevent future undesired behavior, I added a warning when converting tensors with `requires_grad=True` to scalars. Now, when using `math.pow` on a `tensor`, we get a single warning with: ```python x = torch.tensor(2.0, requires_grad=True) y = math.pow(x,3) # > UserWarning: Converting a tensor with requires_grad=True to a scalar may lead to unexpected behavior. # Consider using tensor.detach() first. ``` Please let me know if you have any questions 👍 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143261 Approved by: https://github.com/albanD Co-authored-by: albanD <desmaison.alban@gmail.com>	2025-03-30 11:19:07 +00:00
Nikita Shulga	e8a11f175e	[BE] Use `auto` in MPS codebase more (#150000 ) Non-trivial (but still a no-op changes): - Replace `[mpsGraph broadcastTensor:[mpsGraph constantWithScalar:1 dataType:MPSDataTypeInt32] toShape:inputTensor.shape name:nil]` with `[mpsGraph constantWithScalar:1 dataType:MPSDataTypeInt32 shape:inputTensor.shape]` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150000 Approved by: https://github.com/dcci, https://github.com/cyyever	2025-03-30 05:35:58 +00:00
Prajesh Praveen Anchalia	005c9b2f4f	Fix _Waitcounter decorator and dd backward pass wait counter (#150235 ) Summary: This will log a wait counter with for backward compile and fixes weirdness with nested context managers. Since the old wait counters added through dynamo_timed were never created with the nesting issue. I am also changing the key nomenclature from `pytorch.dynamo_timed` to `pytorch.wait_counter`. We want to use the same nomenclature, to make it easy to find keys. Reviewed By: jamesjwu Differential Revision: D72032055 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150235 Approved by: https://github.com/jamesjwu, https://github.com/masnesral	2025-03-30 05:20:12 +00:00
Shangdi Yu	cc58ecceea	Move dump location to avoid dumping twice (#150219 ) Summary: If we put the dumping code in codegen, we might get a separate node_mapping dump for the constant folded graph (https://github.com/pytorch/pytorch/blob/main/torch/_inductor/compile_fx.py#L1119). We move it into compile_fx.py so there's only one node_mapping dump. Test Plan: CI Reviewed By: YUNQIUGUO Differential Revision: D72068715 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150219 Approved by: https://github.com/YUNQIUGUO	2025-03-30 03:35:38 +00:00
Horace He	3140565db6	Update type of `create_block_mask` to more accurately reflect things (#150244 ) Fixes some mypy issues Pull Request resolved: https://github.com/pytorch/pytorch/pull/150244 Approved by: https://github.com/drisspg	2025-03-29 21:55:57 +00:00
sanshang	879a293db8	fix et trace collection of all_to_all (#149485 ) ![image](https://github.com/user-attachments/assets/1e602dec-24a4-4f47-88c0-9311737e217b) ![image](https://github.com/user-attachments/assets/c48a3273-43fb-4a7f-9341-b90cb6b10785) fix ET trace collection to all_to_all. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149485 Approved by: https://github.com/shengfukevin, https://github.com/kwen2501	2025-03-29 20:17:24 +00:00
Nikita Shulga	965784eb9b	[MPSInductor] Specify `max_total_threads_per_threadgroup` (#150247 ) When generating reduction kernel, otherwise compiler can unroll loops too much that kernel could not be launched for the intended threadgroup size Extend `c10:🤘:max` to accept different dtypes Together this fixes `test_large_broadcast_reduction` TODO: - Explore different threadgroup_sizes for best perf Pull Request resolved: https://github.com/pytorch/pytorch/pull/150247 Approved by: https://github.com/jansel, https://github.com/dcci ghstack dependencies: #150246	2025-03-29 19:37:15 +00:00
Nikita Shulga	52135db69a	[BE] Fix signed/unsigned comparison warning (#150246 ) One will see them only if compilation fails, but still Pull Request resolved: https://github.com/pytorch/pytorch/pull/150246 Approved by: https://github.com/cyyever, https://github.com/jansel	2025-03-29 15:12:42 +00:00
PyTorch MergeBot	3b00ff8850	Revert "[Profiler] Give non-zero default values to start events (#149757 )" This reverts commit bc72420bcb37390af3fced885e019903e6e425bd. Reverted https://github.com/pytorch/pytorch/pull/149757 on behalf of https://github.com/malfet due to Broke windows builds, which were also the signal on the HUD ([comment](https://github.com/pytorch/pytorch/pull/149757#issuecomment-2763461365))	2025-03-29 15:08:55 +00:00
Irshad CC	f3c77b2458	Set requires grad in TensorMaker::make_tensor() (#148255 ) Fixes #146419 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148255 Approved by: https://github.com/soulitzer	2025-03-29 08:06:42 +00:00
PaulZhang12	b8ef642f04	Enable TMA persistent GEMM Template by default (#149427 ) Previously, this was unable to be landed given there was limited H100 for CI testing. Benchmarking on H100 CI looks good now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149427 Approved by: https://github.com/drisspg	2025-03-29 07:32:42 +00:00
Max Calman	bc72420bcb	[Profiler] Give non-zero default values to start events (#149757 ) The intent of the existing code is to > // Assign system TIDs to start events based on the system TID of the next // observed event with the same Python TID. However, if there are start events that don't share the same Python TID as later observed events, then they are left with the default initialization of DeviceAndResource and assigned values of `0`. This is problematic because Kineto uses `device=0, resource=0` for the first GPU (or other backend) device. This PR maintains the previous logic of using TIDs from later events if any are present, but defaults to the current process and system thread IDs if there aren't later events to reference. This issue was discovered while working to implement a custom backend and some CPU start events were appearing on the same process and thread as the device in the trace. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149757 Approved by: https://github.com/sraikund16	2025-03-29 06:29:25 +00:00
Michał Górny	ec6fa547a1	Remove unnecessary "special linking" for `BLAS_LIBRARIES` (#145487 ) Remove the "special linking" that involves listing `BLAS_LIBRARIES` thrice if `TH_BINARY_BUILD` is set, as it should not be any different from listing it just once. The code seems to date back to commit cfcf2af95f91a88ec61cbcac8b30a718e7332aa5. The original code already listed `BLAS_LIBRARIES` thrice, but it provided no explanation for doing that — and without `TH_BINARY_BUILD`, BLAS was not linked at all. The current version seems to originate in d6a8d28d6529a4f0b80a8c046ca9c36ca6c8b347 — and it already provided an `ELSE` clause listing `BLAS_LIBRARIES` only once. From this, I suspect that it is probably an unnecessary leftover. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145487 Approved by: https://github.com/malfet	2025-03-29 05:13:22 +00:00
Jane Xu	2c9e07ecd2	[BE] Remove outdated RPC benchmark (#146716 ) We have lots of outdated unused + uncalled code in our codebase, namely in our benchmarks and examples folders among others. The last change to this directory was 4 years ago and this code looks dead. cc @albanD @H-Huang for feedback Pull Request resolved: https://github.com/pytorch/pytorch/pull/146716 Approved by: https://github.com/Skylion007, https://github.com/H-Huang	2025-03-29 04:44:36 +00:00
Bryce Ferenczi	beea76020b	Removed ROCM ifdef that governs thread count + smem parallel reduction. (#149779 ) #149548 Fixed the arbitrarily missing parallelism for NLL, but they also added an arbritrary #ifdef ROCM guard around this fix to prevent its use on CUDA gpus. There is also a problem with the way the kernel does the reduction from the intermediate shared memory, using only thread 0 walking linearly. This has been changed to a simple parallel reduction algorithm. Tested changes with `python3 test/test_nn.py` ``` Ran 3551 tests in 200.554s OK (skipped=998, expected failures=4) ``` Performance before and after with the script below with an RTX 3090, batch size x axis, time (sec) y axis. This GPU is also used for display graphics and such, so the measurements are pretty noisy, even with 100 samples. ## Before ![before_nll](https://github.com/user-attachments/assets/c19044aa-7bc2-4223-b560-9be7acedef35) ## After ifdef removal ![after_nll](https://github.com/user-attachments/assets/4672f5ca-93b0-4c34-a257-81b2ab364995) ## After Parallel SMEM reduction ![after_reduction](https://github.com/user-attachments/assets/9607b68c-7d9d-4ee0-9f99-8989d134e4fd) ```python import torch from matplotlib import pyplot as plt from torch.nn import functional as F timing = [] batches= list(range(32, 4096, 32)) for batch in [32] + batches: samples = [] for _ in range(100): probs = torch.rand(batch, 10).cuda() labels = torch.randint(0, 10, (batch,)).cuda() start = torch.cuda.Event(enable_timing=True) end = torch.cuda.Event(enable_timing=True) start.record() F.nll_loss(probs, labels) end.record() torch.cuda.synchronize() elapsed = start.elapsed_time(end) samples.append(elapsed) timing.append(sum(samples) / len(samples)) timing = timing[1:] plt.plot(batches, timing) plt.show() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149779 Approved by: https://github.com/jeffdaily	2025-03-29 04:27:54 +00:00
Eddie Yan	a8dd9b6c27	[cuDNN][SDPA] abide by `enable_gqa` convention in cuDNN (#149976 ) long overdue Pull Request resolved: https://github.com/pytorch/pytorch/pull/149976 Approved by: https://github.com/drisspg, https://github.com/Skylion007	2025-03-29 04:24:51 +00:00
Thanh Ha	340beb7f7c	Add .editorconfig (#149193 ) This adds an .editorconfig file to automatically configure devs local Editors / IDEs with the basic formatting rules of the project. List of supported editors: https://editorconfig.org/#pre-installed Pull Request resolved: https://github.com/pytorch/pytorch/pull/149193 Approved by: https://github.com/malfet	2025-03-29 04:07:21 +00:00
fzyzcjy	66a7a49d64	Super tiny fix typo (#149190 ) ... when checking the doc to build from source Pull Request resolved: https://github.com/pytorch/pytorch/pull/149190 Approved by: https://github.com/jingsh	2025-03-29 04:06:05 +00:00
Shangdi Yu	5e787bf3e5	[reland] Support torchbind in OSS proxy executor (#150196 ) Summary: The original Diff D69500038 is reverted due to a false alarm on trunk health. Implement torchbind support in OSSProxyExecutor. Exactly the same as the implementation in FbProxyExecutor. D69693697 - fbProxyExecutor D69887230 - fbProxyExecutor but for torchbind method D70746626 - Support None output type Other changes: - When generating the schema of the CallTrochBind HOP, the arg name of the torchbind object arg should be the same as the torchbind method's torchbind object arg (instead of `obj`). - In `AOTIModelPackageLoader`, we extract everything in `data/constants` to `tmp_dir/data/aot_inductor/<model>/` folder, so the torchbind objs exist in the same folder as the rest of the files (e.g. cpp, so). This is to be consistent of how files are packaged internally (more details in internal Diff summary). Note on using `filesystem`: Seems like there'll be [issues](https://github.com/pytorch/pytorch/pull/137209) with using`filesystem` header in linux, so here I use string manipulation instead of `filesystem::path`. Test Plan: ``` test/inductor:torchbind -- -r torchbind_aoti test/inductor:torchbind -- -r aot_compile ``` Differential Revision: D72063691 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150196 Approved by: https://github.com/hl475, https://github.com/desertfire	2025-03-29 03:36:55 +00:00
Mandar Deshpande	0861af2596	[pytorch][triton] Warp specialization support in TritonTemplate for torchinductor (#148503 ) (#150122 ) Summary: Currently only `num_warps` and `num_stages` are supported as one of the kernel options for inductor auto-tuning using `TritonTemplate`. In order to allow warp-specialization kernel options should allow specifying `num_consumer_groups` and `num_buffers_warp_spec` as well. NOTE: Currently gating changes to FBCODE using HAS_WARP_SPEC which is only available on triton/release-3.3.x Test Plan: ## Unit test Added tests for `test_triton_template_warp_specialization` to verify generated kenrnel contains configs for `num_consumer_groups` and `num_buffers_warp_spec`. ## Functional Testing Specific to flexattention. ``` import torch from torch.nn.attention.flex_attention import flex_attention from triton.testing import do_bench make_tensor = lambda: torch.rand(8, 16, 8192, 128, device="cuda", dtype=torch.bfloat16) q, k, v = make_tensor(), make_tensor(), make_tensor() flex_compiled = torch.compile(flex_attention, fullgraph=True) print(do_bench(lambda: flex_compiled(q, k, v, kernel_options={"num_warps": 4}))) ``` triton do_bench results: - default compile: 15.176783561706543 - with warp-spec: 9.452800750732422 ## Extra notes - generated triton kernel using `TORCH_LOGS=output_code`: P1740612877 - TTGIR for fused kernel: P1740614685 Differential Revision: D71982587 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150122 Approved by: https://github.com/eellison, https://github.com/zou3519, https://github.com/jansel	2025-03-29 03:36:50 +00:00
Mu-Chu Lee	03313c6619	[AOTInductor] Add function for users to extract constants in container (#150163 ) Summary: Add extract_constant_map that allows users to inspect the constants being used by AOTInductor Test Plan: `python test/inductor/test_aot_inductor.py -k extract_constants_map` `LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib /data/users/$USER/pytorch/build/bin/test_aoti_inference` Differential Revision: D72020400 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150163 Approved by: https://github.com/chenyang78	2025-03-29 03:36:12 +00:00
Nichols A. Romero	7a470c9320	[ROCm] change preferred blas lib defaults (#150212 ) Fixes #148883 Fixes #150155 Also adds at::BlasBackend:Default. Instinct cards prefer hipBLASLt, everything else prefers rocBLAS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150212 Approved by: https://github.com/jeffdaily	2025-03-29 03:33:07 +00:00
Tristan Rice	29b3fdab01	TCPStoreLibUvBackend: support masterListenFd (#150215 ) This supports `masterListenFd` which is required for full compatibility with the non-libuv TCPStore. The code was just missing a `uv_listen` call and now it works just fine. This is required to migrate the last remaining uses of TCPStore off of the non-libuv backend. Test plan: ``` pytest -v test/distributed/test_store.py -k test_take_over_listen_socket ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150215 Approved by: https://github.com/fduwjj	2025-03-29 01:58:07 +00:00
Nikita Shulga	493c7fa66f	[Cmake] Make PyTorch buildable by CMake-4.x (#150203 ) By turning on compatibility mode for protobuf, nnpack, PSimd and FP16, ittapi, TensorPipe and Gloo Update CMake requirements Revert 0ece461ccafe5649d2d0f058ff5477765fd56499 and b0901d62ae2c2e909f91401eacebf3731df20cbe to test that it actually works TODO: - Update/get rid of those libraries Fixes https://github.com/pytorch/pytorch/issues/150149 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150203 Approved by: https://github.com/clee2000	2025-03-29 01:39:13 +00:00
Nikita Shulga	edb6f1b7a8	Move MacOS inductor tests to M2-15 runner (#150228 ) To get more representative results (and be able to run more tests eventually) Also get pull_request for workflow dispatch if yml file is modified Pull Request resolved: https://github.com/pytorch/pytorch/pull/150228 Approved by: https://github.com/clee2000	2025-03-29 01:36:07 +00:00
Jeff Daily	65139eb050	if blaslt fails, fall back to blas (#150147 ) Fixes #150016. This is implemented for both cublaslt and hipblaslt. gemm_and_bias on failure will fall back to unfused path. lt gemm on failure falls back to gemm even if gemm preference is set to lt. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150147 Approved by: https://github.com/malfet	2025-03-28 23:39:53 +00:00
PyTorch MergeBot	ccfde4dadf	Revert "Move MacOS inductor tests to M2-15 runner (#150228 )" This reverts commit b1b58708b26a840f6bf0ccdd14a9916ff7291fb4. Reverted https://github.com/pytorch/pytorch/pull/150228 on behalf of https://github.com/malfet due to Should not have ignored lint signal ([comment](https://github.com/pytorch/pytorch/pull/150228#issuecomment-2762794366))	2025-03-28 23:05:27 +00:00
Nikita Shulga	b1b58708b2	Move MacOS inductor tests to M2-15 runner (#150228 ) To get more representative results (and be able to run more tests eventually) Also get pull_request for workflow dispatch if yml file is modified Pull Request resolved: https://github.com/pytorch/pytorch/pull/150228 Approved by: https://github.com/clee2000	2025-03-28 22:15:40 +00:00
PyTorch MergeBot	7ac0658757	Revert "[CI] Fix docker builds failing due to cmake update by setting CMAKE_POLICY_VERSION_MINIMUM (#150220 )" This reverts commit 87549a65c96cd7e48f024c02e7daa3f227b2bf18. Reverted https://github.com/pytorch/pytorch/pull/150220 on behalf of https://github.com/clee2000 due to doesn't solve the problem since the installed cmake 4 stays on the system, resulting in failed pytorch builds later ([comment](https://github.com/pytorch/pytorch/pull/150220#issuecomment-2762623078))	2025-03-28 21:44:03 +00:00
Zain Rizvi	4271ebdbdc	Explicitly state that a test-infra branch cut is required (#150214 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150214 Approved by: https://github.com/atalman ghstack dependencies: #150210, #150211, #150213	2025-03-28 21:13:29 +00:00
Zain Rizvi	2b2286c4ec	Update reference for binary_build workflows (#150213 ) There hasn't been a circleci for a looooong time Pull Request resolved: https://github.com/pytorch/pytorch/pull/150213 Approved by: https://github.com/atalman ghstack dependencies: #150210, #150211	2025-03-28 21:13:29 +00:00
Zain Rizvi	4118d7307f	Update referenced PRs for ecosystem library branch cut (#150211 ) The old PRs had a lot of extra changes in them which are no longer needed Pull Request resolved: https://github.com/pytorch/pytorch/pull/150211 Approved by: https://github.com/atalman ghstack dependencies: #150210	2025-03-28 21:13:22 +00:00
Zain Rizvi	f231500c50	Mention the cherry-picker bot in the release docs (#150210 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150210 Approved by: https://github.com/atalman	2025-03-28 21:13:15 +00:00
Catherine Lee	87549a65c9	[CI] Fix docker builds failing due to cmake update by setting CMAKE_POLICY_VERSION_MINIMUM (#150220 ) Set the CMAKE_POLICY_VERSION_MINIMUM env var to make executorch and halide docker builds pass (they install from those repos which don't have cmake pinned) This can be removed if executorch and halide update their builds and we update the hash? Pull Request resolved: https://github.com/pytorch/pytorch/pull/150220 Approved by: https://github.com/atalman, https://github.com/malfet	2025-03-28 20:55:04 +00:00
zeshengzong	cb83850a24	Fix docs format error in `torch.nn` (#150156 ) Fixes #150152 Fix format error in [torch.nn.CosineSimilarity](https://pytorch.org/docs/stable/generated/torch.nn.CosineSimilarity.html#torch.nn.CosineSimilarity), [torch.nn.KLDivLoss](https://pytorch.org/docs/stable/generated/torch.nn.KLDivLoss.html#torch.nn.KLDivLoss) and other pages. ## Test Result ### Before #### torch.nn.CosineSimilarity ![Image](https://github.com/user-attachments/assets/1ad633d9-dfaf-43f0-a536-9035a24bf858) #### torch.nn.KLDivLoss ![Image](https://github.com/user-attachments/assets/20a001b0-1f66-414e-b554-11934d65a4bf) ### After #### torch.nn.CosineSimilarity ![image](https://github.com/user-attachments/assets/a2d9ea8d-5637-4604-a0e4-9231a4deee44) #### torch.nn.KLDivLoss ![image](https://github.com/user-attachments/assets/d0e319f9-a3b3-47a7-b2f8-060d46d53bc7) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150156 Approved by: https://github.com/cyyever, https://github.com/malfet	2025-03-28 20:54:09 +00:00
Nikita Shulga	7c65911b11	[MPS] Fix dot/mm for conj_tensors (#150157 ) - Distinguish between conjugated/non_conjugated inputs by appending conjugation to the operator key - For matmul or dot, add `conjugateWithTensor:name:` calls before running the op - Enable testing for conjugated ops by passing `include_conjugated_inputs` to opinfo - Filter `include_conjugated_inputs` argument from `sample_inputs_window` (probably should have landed as separate PR) - Preserve conj property when gathering the views, that fixes `cov` operator Fixes https://github.com/pytorch/pytorch/issues/148156 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150157 Approved by: https://github.com/dcci	2025-03-28 20:36:44 +00:00
Catherine Lee	9092dd2e82	[CI] Disable some tests that are failing in periodic (#150059 ) Disabling some tests to restore periodic nogpu avx512 timeout: `59f14d19ae (38492953496-box)` profiler failure: `7ae0ce6360 (38461255009-box)` test_accelerator failure: `87bfd66c3c (39476723746-box)` origin: 146098 test_overrides failure: `bf752c36da (39484562957-box)` origin: 146098 inductor cpu repro: `bb9c426024 (38447525659-box)` functorch eager transforms: `8f858e226b (39488068620-box)` `f2cea01f71 (39555064878)` `b5281a4a18 (39599355600)` either 148288 or 148261? `2ec9aceaeb/1` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150059 Approved by: https://github.com/ZainRizvi, https://github.com/atalman, https://github.com/malfet	2025-03-28 20:31:32 +00:00
Jeff Daily	2bd5bfa3ce	[ROCm] use magma-rocm tarball for CI/CD (#149986 ) Follow-up to #149902. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149986 Approved by: https://github.com/malfet	2025-03-28 19:28:50 +00:00
Natalia Gimelshein	cdeb32d2d1	enable out variant of 2-shot reduction (#150153 ) Per title, this version uses symm mem input both as input source and as a work buffer, so input is modified after the end (similar to what fbgemm car reduction does). It is intended to be wrapped in an op that would first copy the real inputs to symm mem buffers that wouldn't be exposed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150153 Approved by: https://github.com/xw285cornell	2025-03-28 19:06:03 +00:00
Wang, Chuanqi	35ff5084e6	[CI] Remove the xpu env source for linux binary validate (#150138 ) Due to we have enabled the xpu runtime pypi packages as dependencies directly Pull Request resolved: https://github.com/pytorch/pytorch/pull/150138 Approved by: https://github.com/atalman	2025-03-28 17:25:37 +00:00
Catherine Lee	85079e4380	[TD] Enable TD on distributed cpu (#150028 ) Enable TD on distributed cpu, I think the only reason it's not is because I forgot to enable it Get rid of some of the statements that are no ops: * asan uses default shard * nogpu got moved to periodic * no windows cuda testing anymore Only thing on pull and trunk that doesn't use TD is dynamo_wrapped but I think it's fast enough to be ok for now, we can take another look after this Pull Request resolved: https://github.com/pytorch/pytorch/pull/150028 Approved by: https://github.com/ZainRizvi	2025-03-28 17:19:11 +00:00
PyTorch MergeBot	cf7447ae99	Revert "cpp_wrapper: Fix even more tests (#147225 )" This reverts commit d25acac357ff8663a7787e57e6bc5e69987a8f9a. Reverted https://github.com/pytorch/pytorch/pull/147225 on behalf of https://github.com/yangw-dev due to broke test internally test/inductor/test_benchmark_fusion ([comment](https://github.com/pytorch/pytorch/pull/147225#issuecomment-2761944564))	2025-03-28 17:07:52 +00:00
PyTorch MergeBot	e691fcae0e	Revert "cpp_wrapper: precompile a few more commonly used headers, and improve RAIIPyObject interface (#149350 )" This reverts commit 2b20d1433f4e5c7556fe4679d89b8f795990d494. Reverted https://github.com/pytorch/pytorch/pull/149350 on behalf of https://github.com/yangw-dev due to broke test internally test/inductor/test_benchmark_fusion ([comment](https://github.com/pytorch/pytorch/pull/147225#issuecomment-2761944564))	2025-03-28 17:07:52 +00:00
Andrey Talman	b0901d62ae	Pin cmake to 3.31.2 for windows conda install (#150185 ) Trying to fix nightly failures Cmake 4.0 update https://pypi.org/project/cmake/4.0.0/ broke nightly builds You can see it here: https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=cuda11_8-build and here: https://hud.pytorch.org/hud/pytorch/pytorch/nightly/1?per_page=50&name_filter= This fix for Windows Builds. Linux and MacOS where already fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150185 Approved by: https://github.com/jeanschmidt, https://github.com/ZainRizvi	2025-03-28 17:03:02 +00:00
Animesh Jain	a469ddc663	[inductor] No type promotion for slice_scatter (#150090 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150090 Approved by: https://github.com/eellison, https://github.com/zou3519 ghstack dependencies: #149087, #149667, #150036, #148953	2025-03-28 17:02:01 +00:00
Catherine Lee	1bdf996e7a	[CI] Fix log artifact not containing test logs? (#149577 ) Sometimes I would find a log artifact that only has usage_logs.txt in it, even though there are other logs created by tests. I think this is somehow caused by output buffering with find. I don't understand how, but at the very least, I can see that all the jobs on this PR have the logs from the test runs Pull Request resolved: https://github.com/pytorch/pytorch/pull/149577 Approved by: https://github.com/ZainRizvi	2025-03-28 17:00:00 +00:00
Catherine Lee	d5a8bd0688	[CI][docker] Use multistage build for triton (#149413 ) Sees to reduce docker pull times by ~3 min if triton is requested, some compressed docker sizes seems to have decreased by 1/3 ish Also add check that triton is installed/not installed Pull Request resolved: https://github.com/pytorch/pytorch/pull/149413 Approved by: https://github.com/malfet	2025-03-28 16:07:19 +00:00
Catherine Lee	0ece461cca	Pin cmake==3.31.6 (#150158 ) I'm not sure if this is the right think to do, but cmake 4.0.0 got released on pypi and our builds are failing with it Example: `aa70d62041 (39555975425-box)` I guess we have to go change all the cmake_minimum_required to >=3.5? backwards compat still failing because its building with the base commit which this pr can't really change until it gets merged, but at least manywheel binary builds got past where they were originally failing Also pin the conda installation, but the most recent version on conda is 3.31.2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150158 Approved by: https://github.com/cyyever, https://github.com/malfet	2025-03-28 15:49:17 +00:00
Alexander Grund	350a479146	Fix test failures on non-x86 Linux (#148445 ) The cpp contexts are only supported on x86 Linux. The tests requiring them are skipped on non-Linux but not if the architecture is not x86. In most places it is checked for ARM64 which is not enough as a check for x86 is required instead. Fix the test decorators and factor out a common one in test_cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148445 Approved by: https://github.com/eellison	2025-03-28 15:27:44 +00:00
Michael Lazos	d2c0c65ea1	[Dynamo] Add debug linting option for graph dedupe (#150053 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/150053 Approved by: https://github.com/StrongerXi, https://github.com/anijain2305	2025-03-28 14:27:09 +00:00
IvanKobzarev	25309a17f0	[aotd] Config to guess_tangents_stride (#150035 ) Differential Revision: [D71907684](https://our.internmc.facebook.com/intern/diff/D71907684) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150035 Approved by: https://github.com/ilyas409, https://github.com/seemethere	2025-03-28 13:54:19 +00:00
PyTorch MergeBot	7c4e49750e	Revert "Store statically launchable CachingAutotuners inside CompiledFXGraph.triton_bundle (#149054 )" This reverts commit c16af5d7984872b6ae81476d6cae64bddb7ce664. Reverted https://github.com/pytorch/pytorch/pull/149054 on behalf of https://github.com/jamesjwu due to Sorry I forgot to fix one last test ([comment](https://github.com/pytorch/pytorch/pull/149054#issuecomment-2761381443))	2025-03-28 13:35:07 +00:00
James Wu	c16af5d798	Store statically launchable CachingAutotuners inside CompiledFXGraph.triton_bundle (#149054 ) This PR adds CachingAutotuners that are statically launchable to FXGraphCache's cache entry. Regular CachingAutotuners, with triton kernels attached to them, are not very good to cache: they are very large, and take huge amounts of space since they track all of the various binary files, along with various metadata. We could probably figure out what information we could delete from the kernel and have it still work, but with StaticCudaLauncher, we no longer have to. Instead, we can cache every compiled triton kernel that is statically launchable. Because StaticTritonCompileResult is serializable, and designed to have a very small memory footprint, we can save it into FXGraphCache without increasing the cache size significantly. We store it as a part of CompiledFxGraph.triton_bundle. Then, on load, we repopulate the CachingAutotuner into our CompiledTritonKernel cache. The upsides of this are many: - We no longer need to call into a separate process on cache hit - We can guarantee that the triton kernel we got from our cache entry is the one we use to launch again, so no worries about triton's own caching logic - Once we achieve feature parity and all torch.compiled triton kernels are statically launchable, we can clean up a bunch of TritonBundler code and simplify the cache hit logic. Fixes #149449 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149054 Approved by: https://github.com/oulgen	2025-03-28 13:28:05 +00:00
Yuanhao Ji	d4da0e955e	[Dynamo] Fix `is_compile_supported()` when `device_type` contains device index (#147837 ) Fixes #147826 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147837 Approved by: https://github.com/anijain2305	2025-03-28 07:16:29 +00:00
Pian Pawakapan	103bf64a3c	[export] refactor _Dim into Dim (#149891 ) Summary: forward fix T218515233 Test Plan: test_export Differential Revision: D71769231 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149891 Approved by: https://github.com/jingsh, https://github.com/angelayi	2025-03-28 06:19:03 +00:00
bobrenjc93	f649ee73ce	Use source hashing to generate consistent symbolic ids (#149665 ) This PR was inspired by internal models that were cache missing due to PGO. At a high level the problem looks as follows Run 1, Invocation 1: We do static compile, save some example values in PGO/automatic dynamic Run 1, Invocation 2: We detect varying inputs, do dynamic compile, get a dynamic graph and save to PGO. Crucially what we save to PGO is actually a superset of what is actually dynamic. If we notice an input was varying, we mark it as dynamic in PGO even if later on that value gets specialized. When a value gets specialized, we actually remove the symbol from the graph. This results in an interesting conundrum where although we are producing the same isomorphic graph, PGO makes the second run cache miss. Let's see how.... Run 2, Invocation 1: We fetch the PGO, over-mark things as dynamic, get a fx graph, look it up in the cache and... whoops! cache miss! This is because of the aforementioned behavior where the PGO profile will cause us to over-allocate symbols. In practice this means we end up saving a graph in cache with symbols x:s1, y:s3 and on second attempt we cache miss with x:s1, y:s6 where symbols s3,s4,s5 were all optimistically marked dynamic by PGO and subsequently specialized. We solve this problem by hashing the source names. This ensures somewhat stable assignment. To prevent catastrophic symbol collisions, we use linear probing to ensure no collisions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149665 Approved by: https://github.com/Mingming-Ding, https://github.com/laithsakka	2025-03-28 05:36:32 +00:00
Tugsbayasgalan Manlaibaatar	c49315e645	Improve attr mismatch msg (#149576 ) Differential Revision: [D71513041](https://our.internmc.facebook.com/intern/diff/D71513041) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149576 Approved by: https://github.com/avikchaudhuri	2025-03-28 05:10:56 +00:00
Daniël de Kok	fdc4394b16	Do not fetch NCCL when system NCCL is used (#149607 ) We are compiling PyTorch in a sandbox without networking. Unconditionally fetching breaks the build and is not needed when a system NCCL is used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149607 Approved by: https://github.com/malfet	2025-03-28 05:06:49 +00:00
Animesh Jain	c9ebf517c2	[dynamo][invoke_subgraph] Input aliasing and mutation check in Dynamo (#148953 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148953 Approved by: https://github.com/zou3519 ghstack dependencies: #149087, #149667, #150036	2025-03-28 03:50:07 +00:00
eellison	c18e2ce53b	Ignore meta ops in inductor (#150137 ) Fix for https://github.com/pytorch/pytorch/issues/144607 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150137 Approved by: https://github.com/BoyuanFeng	2025-03-28 03:01:57 +00:00
PyTorch MergeBot	ddb1e97839	Revert "Support torchbind in OSS proxy executor (#149747 )" This reverts commit aa70d62041c28fe35c416aa932b32ef0e4d5bc33. Reverted https://github.com/pytorch/pytorch/pull/149747 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/149747#issuecomment-2760040741))	2025-03-28 02:48:02 +00:00
Colin L. Rice	2f785ab208	dynamo_compile: Log all compilation time under all_compilation_types (#149664 ) This counter is designed to include all compilation pytorch does (triton + dynamo_compile). However this wasn't including all of dynamo compilation, since it was put in at the fx_codegen_and_compile spot. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149664 Approved by: https://github.com/masnesral	2025-03-28 02:27:48 +00:00
Natalia Gimelshein	8a872261dc	Add one_shot_all_reduce_copy to allow non-symm-mem allocated tensors to be reduced (#150129 ) Per title, we want to be able to use it even if inputs are not registered. Separate copy would add latency, and one-shot is all about the lowest possible latency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150129 Approved by: https://github.com/xw285cornell	2025-03-28 02:14:27 +00:00
Sam Larsen	1e55b9c0b5	Fix autotune pool shutdown (#149890 ) Summary: A couple follow-ups noted in review from https://github.com/pytorch/pytorch/pull/149700: 1. Make sure we correctly signal _all_ subproces to shutdown, even in the case where some processes are currently benchmarking. 2. Change how the pool singleton is created. That also allows us to fully initialize the object in the ctor and remove a bunch of asserts. Test Plan: existing unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/149890 Approved by: https://github.com/aorenste ghstack dependencies: #149700	2025-03-28 02:09:51 +00:00
Sam Larsen	266bd22b44	Improve subproc autotuning implementation (#149700 ) Summary: The primary change is to update the autotune-in-a-subproc implementation to avoid using multiprocessing spawn. Spawn (re)executes the toplevel script in the subproc, which can be problematic. The approach here is similar to Triton parallel compile: we Popen a subproc on a controlled entry point and communicate over pipes. That change drove a lot of refactoring in the TuningProcess class, so I took the opportunity to simplify some things, rename some methods, etc. One other notable change is around the timeout / kill approach. After a timeout, we were previously attempting to stop the subproc in three steps (graceful shutdown, sigkill if graceful fails, sigterm if sigkill fails). I'm gonna argue think that's not useful: 1) The graceful shutdown is never going to work unless the subproc happens to have just completed its task and is ready to receive the next command. 2) If we're going to kill the subproc, let's just take the most aggressive approach and move on as quickly as possible to restarting it rather than waiting to see if previous shutdown attempts succeeded. The only downside that I can find find is maybe a little log spew?, e.g., ` ResourceWarning: subprocess 2987680 is still running` List of changes: * Use Popen instead of spawn for the autotuning subprocess. * Introduced a new entry point `__autotune_main__.py` * Renamed some TuningProcess methods. For example `shutdown` makes more sense than `terminate` because the latter implies a forced kill. * Simplified the implementation around benchmarking timeout and how we kill the subproc after a timeout. * Deprecated the unused timeout configs in `_inductor/config.py` * Moved `get_ld_library_path` helper to a common utils file. * Added more unit tests for subproc crashes / timeouts / exceptions, etc. Test plan: * New unit tests * Also ran internally with all combinations of: build mode `opt` and `dev-nosan`, and `buck run` vs. executing the `.par` file directly. * Made sure the functionality to parallelize autotuning across different GPUs is working (it wasn't clear to me this was behaving the way we wanted it to). Differential Revision: [D71976971](https://our.internmc.facebook.com/intern/diff/D71976971) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149700 Approved by: https://github.com/aorenste, https://github.com/jansel, https://github.com/eellison	2025-03-28 01:06:39 +00:00
Shivam Raikundalia	8b04364914	[Easy/Profiler] Set Duration to -1 for unfinished CPU events (#150131 ) Summary: Some OSS Kineto users were requesting that we allow for 0 duration events in Kineto even though they won't be seen on the trace. To allow this we changed the handling of said events in D71510383. However this causes unfinished events in collection to never be post processed; this diff fixes said issue. Test Plan: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/0/1743102222/localhost/libkineto_activities_631490.json.gz&bucket=gpu_traces Differential Revision: D71993609 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150131 Approved by: https://github.com/chuanhaozhuge, https://github.com/xw285cornell	2025-03-28 00:29:22 +00:00
Shangdi Yu	aa70d62041	Support torchbind in OSS proxy executor (#149747 ) Summary: Implement torchbind support in OSSProxyExecutor. Exactly the same as the implementation in FbProxyExecutor. D69693697 - fbProxyExecutor D69887230 - fbProxyExecutor but for torchbind method Other changes: - When generating the schema of the CallTrochBind HOP, the arg name of the torchbind object arg should be the same as the torchbind method's torchbind object arg (instead of `obj`). - In `AOTIModelPackageLoader`, we extract everything in `data/constants` to `tmp_dir/data/aot_inductor/<model>/` folder, so the torchbind objs exist in the same folder as the rest of the files (e.g. cpp, so). This is to be consistent of how files are packaged internally Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r torchbind_aoti buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r aot_compile ``` Differential Revision: D69500038 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149747 Approved by: https://github.com/desertfire	2025-03-28 00:04:19 +00:00
Taras	d670df356c	Improve error handling when checking CUDA version in case nvcc is not found (#148671 ) Fixes: - https://github.com/pytorch/pytorch/issues/101138 Description The PR enhances error handling in `_check_cuda_version` by verifying the existence of the `nvcc` executable before invoking `subprocess.check_output`. If `nvcc` is missing, a `FileNotFoundError` is raised with a clear message, guiding users to check their CUDA installation and path configuration. Testing Manually tested with and without `nvcc` present in the expected path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148671 Approved by: https://github.com/malfet	2025-03-27 23:04:59 +00:00
Benjamin Glass	2b20d1433f	cpp_wrapper: precompile a few more commonly used headers, and improve RAIIPyObject interface (#149350 ) Add includes for torch.device, torch.dtype, torch.layout, and torch.memory_format to the cpp_wrapper common header, so that they get precompiled. Additionally, add move constructors and operator bool to RAIIPyObject. Closes #142005. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149350 Approved by: https://github.com/desertfire ghstack dependencies: #147225	2025-03-27 23:00:01 +00:00
Nikita Shulga	ef1cb6b646	[BE] Suppress user_warnings while running opinfo tests (#150115 ) Some of the samples are constructed in a way that are expected to trigger those, but what's the point displaying them Pull Request resolved: https://github.com/pytorch/pytorch/pull/150115 Approved by: https://github.com/dcci ghstack dependencies: #150060	2025-03-27 22:36:27 +00:00
PyTorch MergeBot	1a3bd894ff	Revert "[fbcode]Removing `@NoIntBaseDeprecated` annotation in `caffe2.thrift` file (#149742 ) (#149744 )" This reverts commit 6eac3a0068f028d03897ce38e0cfec11812591fe. Reverted https://github.com/pytorch/pytorch/pull/149744 on behalf of https://github.com/malfet due to Broke tests, see `80aa88f907/1` ([comment](https://github.com/pytorch/pytorch/pull/149744#issuecomment-2759676260))	2025-03-27 22:31:54 +00:00
eellison	4c57aec5b9	Dont exclude constant_pad_nd in prologue fusion (#149947 ) Originally, I excluded constant_pad_nd from fusing to be conservative on compilation time. But, on benchmarking, you do occasionally get speedups by fusing it. Also includes a fix for making single, contiguous dep for prologues. For instance, the following benchmark gets a 7% speedup by fusing in the constant_pad_nd. ``` import torch import torch.nn.functional as F torch._inductor.config.force_disable_caches = True padded_N = 2048 n_pad_rows = 100 K, N = 2048, 4096 tensor1 = torch.randn(padded_N - n_pad_rows, 4096, device="cuda").to(torch.bfloat16) tensor2 = torch.randn(4096, 4096, device="cuda").to(torch.bfloat16) @torch.compile(mode='max-autotune-no-cudagraphs') def masked_linear(input, weight, n_pad_input_rows): """ Linear layer with input padded by `n_pad_input_rows` rows """ # Use constant_pad_nd to pad with zeros for the invalid rows padded_input = F.pad(tensor1, (0, 0, 0, n_pad_input_rows), "constant", 0) return F.linear(padded_input, weight) # Invoke the function masked_linear(tensor1, tensor2, n_pad_rows) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149947 Approved by: https://github.com/drisspg	2025-03-27 22:26:30 +00:00
PyTorch MergeBot	80aa88f907	Revert "Store statically launchable CachingAutotuners inside CompiledFXGraph.triton_bundle (#149054 )" This reverts commit ac91f8765ba7817a0853f0520e7f9c94768babc2. Reverted https://github.com/pytorch/pytorch/pull/149054 on behalf of https://github.com/yangw-dev due to This is breaking ROCM tests on trunk. hud.pytorch.org/ ([comment](https://github.com/pytorch/pytorch/pull/149054#issuecomment-2759604301))	2025-03-27 22:15:40 +00:00
Avik Chaudhuri	21bcbbfb5e	fix range constraints for expr (#150103 ) During tracing it is possible for a `s1: VR[2, inf]` to be replaced by a `s0: VR[3, inf]` (note smaller range) by the shape env. But after export, unfortunately we'd previously record `range_constraints[s0] = VR[2, inf]` (note larger range), which is incorrect. This is because we'd map `s1.node.expr` (`s0`) to the `var_to_range` of `s1.node._expr` (`s1`) when creating `range_constraints`. The comment surrounding this code suggests this predated `bound_sympy`, but now we can do better. For users, this means that when using `Dim.DYNAMIC` previously they wouldn't get input constraints checked sufficiently, now they do (shifting errors early). Differential Revision: D71962694 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150103 Approved by: https://github.com/zhxchen17	2025-03-27 22:11:39 +00:00
Keke Zhai	68414512e6	Implement aten.select.int sharding strategy (#149842 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149842 Approved by: https://github.com/XilunWu	2025-03-27 20:49:00 +00:00
Benjamin Glass	d25acac357	cpp_wrapper: Fix even more tests (#147225 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147225 Approved by: https://github.com/desertfire	2025-03-27 19:21:03 +00:00
Shangdi Yu	0ed0b7fa96	[aoti] Better error message when torchbind object is used as a graph input in AOTI (#149965 ) Summary: Given an explicit error when torchbind object is used as input to AoTI Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r test_torchbind_input ``` Differential Revision: D69490915 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149965 Approved by: https://github.com/desertfire	2025-03-27 18:48:55 +00:00
PyTorch MergeBot	a9d08ed0ce	Revert "Parallelize sort (#149505 )" This reverts commit 842d51500be144d53f4d046d31169e8f46c063f6. Reverted https://github.com/pytorch/pytorch/pull/149505 on behalf of https://github.com/ZainRizvi due to Reverting since this is breaking inductor builds on trunk. More details [GH job link](https://github.com/pytorch/pytorch/actions/runs/14000726218/job/39207447863) [HUD commit link](`842d51500b`) ([comment](https://github.com/pytorch/pytorch/pull/149505#issuecomment-2759082390))	2025-03-27 18:43:11 +00:00
vasiliy	01cb3519b3	wire torch._scaled_mm with fp4 operands to the cublas nvfp4 kernel (#148792 ) Summary: When `a` and `b` have dtype `torch.float4_e2m1fn_x2` and `a_scale` and `b_scale` have dtype `torch.float8_e4m3fn`, makes ```python c = torch._scaled_mm(a, b, a_scale, b_scale, out_dtype=torch.bfloat16) ``` call the cuBLAS fp4 gemm kernel, as specified in https://docs.nvidia.com/cuda/cublas/index.html?highlight=fp4#d-block-scaling-for-fp8-and-fp4-data-types note: output scale (`scale_in_D` from the cuBLAS docs) is not tested in this PR - we can enable in a follow-up. Test Plan: ```bash pytest test/test_matmul_cuda.py -s -k mxfp8_nvfp4 ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/148792 Approved by: https://github.com/eqy ghstack dependencies: #148791	2025-03-27 17:32:20 +00:00
vasiliy	e33bc41958	add `torch.float4_e2m1fn_x2` to PyTorch (#148791 ) Summary: Redo of https://github.com/pytorch/pytorch/pull/146578 to get around rebase conflicts. Test Plan: ``` pytest test/quantization/core/experimental/test_floatx.py -s ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/148791 Approved by: https://github.com/drisspg, https://github.com/eqy, https://github.com/jeffdaily	2025-03-27 17:32:20 +00:00
James Wu	ac91f8765b	Store statically launchable CachingAutotuners inside CompiledFXGraph.triton_bundle (#149054 ) This PR adds CachingAutotuners that are statically launchable to FXGraphCache's cache entry. Regular CachingAutotuners, with triton kernels attached to them, are not very good to cache: they are very large, and take huge amounts of space since they track all of the various binary files, along with various metadata. We could probably figure out what information we could delete from the kernel and have it still work, but with StaticCudaLauncher, we no longer have to. Instead, we can cache every compiled triton kernel that is statically launchable. Because StaticTritonCompileResult is serializable, and designed to have a very small memory footprint, we can save it into FXGraphCache without increasing the cache size significantly. We store it as a part of CompiledFxGraph.triton_bundle. Then, on load, we repopulate the CachingAutotuner into our CompiledTritonKernel cache. The upsides of this are many: - We no longer need to call into a separate process on cache hit - We can guarantee that the triton kernel we got from our cache entry is the one we use to launch again, so no worries about triton's own caching logic - Once we achieve feature parity and all torch.compiled triton kernels are statically launchable, we can clean up a bunch of TritonBundler code and simplify the cache hit logic. Fixes #149449 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149054 Approved by: https://github.com/oulgen ghstack dependencies: #149657	2025-03-27 17:14:44 +00:00
Danfeng Wang	6eac3a0068	[fbcode]Removing `@NoIntBaseDeprecated` annotation in `caffe2.thrift` file (#149742 ) (#149744 ) Summary: To align with thrift-python, we are adding the int base class for `non-Flag` enums. In order to not break production code, the annotation `python.NoIntBaseClassDeprecated` is added to opt-out some enums After the related customer code logic changes, we can now safely remove the annotations that were added earlier. Our ultimate goal is to unconditionally add the `int` base to `thrift-py3` enums. Test Plan: ``` buck test 'fbcode//mode/opt' fbcode//caffe2/torch/fb/training_toolkit/applications/bulk_eval/tests:evaluator_test -- --exact 'caffe2/torch/fb/training_toolkit/applications/bulk_eval/tests:evaluator_test - test_setup_evaluation_utils (caffe2.torch.fb.training_toolkit.applications.bulk_eval.tests.evaluator_test.EvaluatorTest)' ``` Reviewed By: ahilger Differential Revision: D71446522 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149744 Approved by: https://github.com/izaitsevfb, https://github.com/huydhn	2025-03-27 17:11:26 +00:00
James Wu	14f0cd7630	[StaticCudaLauncher] Support sharedMemBytes > 48KB (#149657 ) Triton does some special handling when requesting more than 48 KB of shared memory: specifically it queries the device for maximum device memory, then sets the maximum amount of dynamic memory to be the difference between static and dynamic memory. See corresponding implementation in triton land here: https://github.com/triton-lang/triton/blob/main/third_party/nvidia/backend/driver.c#L128-L143 Test plan: - New unit test requesting more than 48 KB of memory Pull Request resolved: https://github.com/pytorch/pytorch/pull/149657 Approved by: https://github.com/jansel	2025-03-27 17:00:18 +00:00
Ankita George	85e4e51a7d	Fix bug in _load_state_dict_from_keys method (#150058 ) Summary: The _load_state_dict_from_keys method specifies that `Loads any key specified in this set. If no keys are specified, the entire checkpoint is loaded.` But this isn't happening right now, because an empty keys arg is passed in as a set() to `_load_state_dict` and keys is expected to be None for it to actually be included in the state_dict https://fburl.com/code/l8yzojyx. So with the set() argument, the state_dict is always going to be empty Test Plan: ensure existing tests pass Differential Revision: D71930712 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150058 Approved by: https://github.com/saumishr	2025-03-27 16:36:00 +00:00
Aleksandar Samardžić	d75921d3a6	Fix sparse CUTLASS-based kernels (#150023 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150023 Approved by: https://github.com/jcaip ghstack dependencies: #149978	2025-03-27 16:23:55 +00:00
Boyuan Feng	c830d750e6	[graph partition] support splitting on custom ops (#149782 ) This PR adds support for graph partition on custom ops. Land after #149458. ### API This PR provides a new API to register/unregister custom ops for graph partition. ```python def register_custom_op_support_cudagraph( operator: torch._library.custom_ops.CustomOpDef, is_cudagraphable: bool, ) -> None ``` Example usage: ```python from torch._inductor.utils import register_custom_op_partition @torch.library.custom_op("mylib::movement", mutates_args=()) def movement(pic: torch.Tensor) -> torch.Tensor: img = pic.cpu() cropped_img = (img + 1) * 2 return cropped_img.cuda() / 255.0 @movement.register_fake def _(pic): return torch.empty_like(pic) register_custom_op_support_cudagraph(movement, is_cudagraphable=False) ``` ### Example In this example, 1 torch-compiled region has 3 cudagraphs after splitting on 2 custom ops. ![image](https://github.com/user-attachments/assets/6d07355b-6690-4cde-89ef-e4aff6b0079c) Code to repro: ```python import torch from torch._inductor.utils import register_custom_op_support_cudagraph torch._inductor.config.graph_partition = True @torch.library.custom_op("mylib::movement", mutates_args=()) def movement(pic: torch.Tensor) -> torch.Tensor: img = pic.cpu() cropped_img = (img + 1)2 return cropped_img.cuda() / 255. @movement.register_fake def _(pic): return torch.empty_like(pic) @torch.library.custom_op("mylib::modify", mutates_args=()) def modify(pic: torch.Tensor) -> torch.Tensor: pic1 = pic + 1 pic1_cpu = (pic1.cpu() + 1) 2 return pic1_cpu.cuda() + pic @modify.register_fake def _(pic): return torch.empty_like(pic) @torch.library.custom_op("mylib::transform", mutates_args=()) def transform(pic: torch.Tensor) -> torch.Tensor: return (pic + 1) * 2 @transform.register_fake def _(pic): return torch.empty_like(pic) register_custom_op_support_cudagraph(movement, is_cudagraphable=False) register_custom_op_support_cudagraph(modify, is_cudagraphable=False) img = torch.randn(3, 64, 64, device="cuda") def f(img): x = (img + 10) * 2 y = movement(x) z = y + 1 u = transform(z) v = 2*u + 1 out = modify(v) return out + 1 compiled_f = torch.compile(f, mode="reduce-overhead", fullgraph=True) eager_out = f(img) for _ in range(3): compiled_out = compiled_f(img) assert torch.allclose(eager_out, compiled_out) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149782 Approved by: https://github.com/zou3519	2025-03-27 16:23:07 +00:00
PyTorch MergeBot	efc975feb2	Revert "[triton] Warp specialization support in torchinductor (#148503 )" This reverts commit 36183215e8845b54cdb69097e2b688fa9e4d3daf. Reverted https://github.com/pytorch/pytorch/pull/148503 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/148503#issuecomment-2758590645))	2025-03-27 16:06:42 +00:00
PyTorch MergeBot	af7719a2fa	Revert "Use source hashing to generate consistent symbolic ids (#149665 )" This reverts commit 1f92348dc6c60e3020a723b37ecb8226cf2480c0. Reverted https://github.com/pytorch/pytorch/pull/149665 on behalf of https://github.com/malfet due to Broke trunk, see `6eb3c2e282/1` ([comment](https://github.com/pytorch/pytorch/pull/149665#issuecomment-2758578187))	2025-03-27 16:02:27 +00:00
zpcore	6eb3c2e282	Update xla pin (#149381 ) Update xla pin to fix the github test failure issue. [failure link](https://hud.pytorch.org/failure?name=pull+%2F+linux-focal-py3_9-clang9-xla+%2F+test+%28xla%2C+1%2C+1%2C+lf.linux.12xlarge%29&jobName=linux-focal-py3_9-clang9-xla+%2F+test+%28xla%2C+1%2C+1%2C+lf.linux.12xlarge%29&failureCaptures=%5B%22test_call_jax_pytree%22%2C%22TestJaxInterop%22%5D). The test is run the torch_xla jax test but install the jax/jaxlib dependencies as we did in https://github.com/pytorch/xla/pull/8781/files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149381 Approved by: https://github.com/atalman	2025-03-27 13:53:25 +00:00
Mandar Deshpande	36183215e8	[triton] Warp specialization support in torchinductor (#148503 ) Summary: Currently only `num_warps` and `num_stages` are supported as one of the kernel options for inductor auto-tuning using `TritonTemplate`. In order to allow warp-specialization kernel options should allow specifying `num_consumer_groups` and `num_buffers_warp_spec` as well. Test Plan: ## Unit test Added tests for `test_triton_template_warp_specialization` to verify generated kenrnel contains configs for `num_consumer_groups` and `num_buffers_warp_spec`. ## Functional Testing Specific to flexattention. ``` import torch from torch.nn.attention.flex_attention import flex_attention from triton.testing import do_bench make_tensor = lambda: torch.rand(8, 16, 8192, 128, device="cuda", dtype=torch.bfloat16) q, k, v = make_tensor(), make_tensor(), make_tensor() flex_compiled = torch.compile(flex_attention, fullgraph=True) print(do_bench(lambda: flex_compiled(q, k, v, kernel_options={"num_warps": 4}))) ``` triton do_bench results: - default compile: 15.176783561706543 - with warp-spec: 9.452800750732422 ## Extra notes - generated triton kernel using `TORCH_LOGS=output_code`: P1740612877 - TTGIR for fused kernel: P1740614685 Differential Revision: D70212243 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148503 Approved by: https://github.com/eellison	2025-03-27 13:07:50 +00:00
_githubsgi	f0e1a0838c	Enabling xpu in OffsetBasedRNGTracker . (#148360 ) Else torch.distributed breaks on xpu devices. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148360 Approved by: https://github.com/zhangxiaoli73, https://github.com/guangyey, https://github.com/gujinghui, https://github.com/XilunWu, https://github.com/kwen2501 Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-03-27 10:55:05 +00:00
Matthew Haddock	e175929b8c	Make codegen dynamic shapes more device agnostic (#146830 ) Currently, as is the case with many inductor devices are assumed to be one of: - CPU with Cpp coden, or - GPU with triton codegen This is not always the case, a CPU backend may be using the triton CPU backend, or some other codegen entirely. This goes some way to fixing it in the case where a CPU backend can use triton scheduling. A more general solution could be implemented, but this would need to be quite robust, and is probably best done more centrally and by someone who can do more testing with CUDA devices. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146830 Approved by: https://github.com/eellison, https://github.com/albanD, https://github.com/guangyey Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>	2025-03-27 10:40:49 +00:00
Laith Sakka	6cbcdee944	Introduce guard_or_true, guard_or_false (#148430 ) some context in this document: https://docs.google.com/document/d/18nJsj-F2C_QXO7ClwzPcAUENQ-B440B43W7DdDnlDt4/edit?tab=t.0#heading=h.pgebnyi7pocj But TLDR; `guard_or_true`, `guard_or_false` are better than `guard_size_oblivious` due to : - Easier to reason about what assumptions we are making while reading the code. - Avoid size_oblivious complexity that is not needed. - Avoid unsoundness that could make `guard_size_oblivious(a==1)` be true when its not true for some vaue `a` during runtime. - Less data dependent errors for some cases: ex, when doing `guard_size_oblivious(a==1)` and we know `a` is a tensor size, if it's traced with `a=u1-u2` `guard_size_oblivious(a==1)` will throw a data dependent error but `guard_else_false` will just return `False`. ### How is it different from statically_known_true?? `if(cond)`: (normal guarding) will try to evaluate statically and guard on the condition, willing to restrict input space to evaluate cond. if it fails to evaluate due to data dependent error will throw an exception (that could be converted to graph break in some situations). `statically_known_true(cond)`: would be used when you never want to add a guard (restrict your input space), but just want to do a best effort check to see if you can infer that something is true/false ONLY based on existing constraints. `guard_or_true(cond)`/`guard_or_false(cond)`: Those would be used in situations you prefer to guard and know the result of the expression over not guarding, but in case you hit a data dependent error you are ok with just returning true or false. Some reasons you might be ok with returning true/false instead could be: 1. It's an optimization I do not want to fail for not performing optimization. 2. I am willing to deviate from the normal semantics when I have unbacked for the benefit of not failing (See the doc above for more details). `definitely_true(cond)`: same as `guard_or_false(cond)` except does not try to do static eval for unbacked (planning to deprecate it and replace uses with `guard_or_false` or make it alias to `guard_or_false`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148430 Approved by: https://github.com/bobrenjc93	2025-03-27 09:34:05 +00:00
pralay	a9ee797e41	added fake tensor support for foreach_copy (#149127 ) Fixes #149111 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149127 Approved by: https://github.com/jansel, https://github.com/jeromean	2025-03-27 09:26:23 +00:00
Louie Tsai	7aacbab0b3	Update Doc for Intel XPU Profiling (#134515 ) Updated below two pages for Intel XPU https://pytorch.org/docs/stable/torch.compiler_profiling_torch_compile.html https://pytorch.org/docs/stable/profiler.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/134515 Approved by: https://github.com/dvrogozh, https://github.com/malfet	2025-03-27 09:15:35 +00:00
Mu-Chu Lee	e6afb51805	[AOTInductor] Free folded constants that's managed by AOTInductor (#149825 ) internally. Summary: This diff allows freeing the usage of folded constants that's created by AOTInductor through CUDACachingAllocator instead of the constant blob from cudaMalloc directly. Test Plan: LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib /home/$USER/local/pytorch/build/bin/test_aoti_inference Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/149825 Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/jingsh	2025-03-27 06:05:50 +00:00
PyTorch MergeBot	e080bac533	Revert "Introduce guard_or_true, guard_or_false (#148430 )" This reverts commit d5593ea31ceb2590336cc9815ee2c13a18db6cd7. Reverted https://github.com/pytorch/pytorch/pull/148430 on behalf of https://github.com/laithsakka due to need to fix stuff ([comment](https://github.com/pytorch/pytorch/pull/148430#issuecomment-2756701436))	2025-03-27 05:10:20 +00:00
Simon Fan	748252378d	[ca] introduce RuntimeState to support c++ hooks via graph breaks (#149987 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149987 Approved by: https://github.com/jansel ghstack dependencies: #149647, #149709, #149651, #149897	2025-03-27 05:05:34 +00:00
Simon Fan	dcb378cff2	[ca] support anomly mode nan checks with different semantics than eager (#149897 ) see note in code Pull Request resolved: https://github.com/pytorch/pytorch/pull/149897 Approved by: https://github.com/jansel ghstack dependencies: #149647, #149709, #149651	2025-03-27 05:05:34 +00:00
Nikita Shulga	488b87cb68	[BE] do not retain/release tensor (#150075 ) `Tensor::as_strided__symint` is inplace op that returns self, no need to retain it Pull Request resolved: https://github.com/pytorch/pytorch/pull/150075 Approved by: https://github.com/angelayi, https://github.com/atalman, https://github.com/cyyever	2025-03-27 03:43:14 +00:00
bobrenjc93	1f92348dc6	Use source hashing to generate consistent symbolic ids (#149665 ) This PR was inspired by internal models that were cache missing due to PGO. At a high level the problem looks as follows Run 1, Invocation 1: We do static compile, save some example values in PGO/automatic dynamic Run 1, Invocation 2: We detect varying inputs, do dynamic compile, get a dynamic graph and save to PGO. Crucially what we save to PGO is actually a superset of what is actually dynamic. If we notice an input was varying, we mark it as dynamic in PGO even if later on that value gets specialized. When a value gets specialized, we actually remove the symbol from the graph. This results in an interesting conundrum where although we are producing the same isomorphic graph, PGO makes the second run cache miss. Let's see how.... Run 2, Invocation 1: We fetch the PGO, over-mark things as dynamic, get a fx graph, look it up in the cache and... whoops! cache miss! This is because of the aforementioned behavior where the PGO profile will cause us to over-allocate symbols. In practice this means we end up saving a graph in cache with symbols x:s1, y:s3 and on second attempt we cache miss with x:s1, y:s6 where symbols s3,s4,s5 were all optimistically marked dynamic by PGO and subsequently specialized. We solve this problem by hashing the source names. This ensures somewhat stable assignment. To prevent catastrophic symbol collisions, we use linear probing to ensure no collisions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149665 Approved by: https://github.com/Mingming-Ding, https://github.com/laithsakka	2025-03-27 03:39:27 +00:00
Daniel Vega-Myhre	ae29f054f5	[Async TP] More robust support for rowwise scales when fusing matmul reduce-scatter (#149247 ) Part of https://github.com/pytorch/torchtitan/issues/866 ## Context - Async TP needs to support the "reshape -> scaled_mm -> reshape" pattern because scaled mm only supports 2D input tensors and 2D scales. - (a,b,c) => (ab,c) - (a\b,c) @ (c,d) = (a\b,d) - (a\b,d) => (a,b,d) - Currently the implementation does not support scaled mm with rowwise scales for all cases of the reshape -> scaled_mm -> reshape pattern. The minimal example of this pattern is confirmed to work via this [unit test](`00a2c68f67/test/distributed/tensor/parallel/test_micro_pipeline_tp.py (L406)`), but more involved e2e examples in torchtitan fail silently (more context in final bullet point). - Previously, the "A tensor" node referenced in the async TP graph manipulation code is the 3D+ node before the reshape, but the "A_scale" node is the 2d node from after the reshape, so they are incompatible. - I previously implemented a simpler solution to this problem in https://github.com/pytorch/pytorch/pull/148001, with a [unit test](https://github.com/pytorch/pytorch/pull/148001/files#diff-115f1d0852382c9b58f22640d80999d879b33618e5f6c633fc9e4d0ca9781cecR406) confirming the fused node is indeed in the graph for the minimal example of the reshape->mm->reshape pattern. I also confirmed via manual e2e testing w/ torchtitan that the crash I was fixing no longer occurred. However, it turns out due to this [bug in torchtitan](https://github.com/pytorch/torchtitan/issues/866) it was causing async TP to fail silently and fall back to vanilla TP, hiding the fact that this original solution fixed the crash but the fusion would not occur for rowwise scales. Thus, more robust solution is needed to support all cases. ## Solution TL;DR - Use the 2D 'A' tensor and corresponding 2D scales as input to the fused_matmul_reduce_scatter implementation, instead of the 3D+ tensor/scales. - Track the "pre mm reshape" and "post mm reshape" separately, to be referenced in the `fused_scaled_matmul_reduce_scatter` implementation, to update the scatter dim through the pre-mm reshape, and apply the post-mm reshape before applying the reduce scatter and returning the output tensor. - Separate the `fused_matmul_reduce_scatter` and the `fused_scaled_matmul_reduce_scatter` code paths, to simplify them both. - By fixing the bug in torchtitan (PR https://github.com/pytorch/torchtitan/pull/965) and implementing support for rowwise scales in pytorch in this PR, together these changes will solve the problem of how to support rowwise scales with all types of AC. ## Additional details for reviewers To use the 2D A tensor while also supporting the "reshape -> mm -> reshape" pattern, the following other changes were needed: - Track the pre-mm reshape, as it will affect the scatter dim used in the fused_matmul_reduce_scatter impementation. - Track the post-mm reshape, as it will affect the output shape used in the fused_matmul_reduce_scatter impementation - Based on the pre-mm reshape and the original scatter dim, calculate the new scatter dim for the 2D tensor. This is needed because during the pipelined producer mm implementation, the scatter dim is moved to dim 0 (so it can be sharded along the first dim and then get chunks to do mm ops on by indexing into the first dim), then moved back to it's original place before the reduce-scatter. - Use the tracked post-mm reshape to reshape the stacked partial 2D outputs of the mm ops into 3D outputs needed for 1) the reduce-scatter w/ the original scatter dim, and 2) the expected output shape to prevent shape errors with subsequent ops. ## Test plan - All existing unit tests passing. - Expand unit tests for rowwise scales to test more scatter dims - Added unit tests enforcing that async TP fails fast / throws an error if it fails to perform any fusions. Previously it just "failed silently" (fell back to vanilla TP without the user knowing) which has led to confusion, so this will improve the UX. - Compared loss curves of bf16 vs float8 w/ rowwise scales to confirm integrity of numerics - Confirmed via manual testing with torchtitan and inspecting the compile graph that the fusion is working as intended for: - bfloat16 - float8 with tensorwise scales - float8 with rowwise scales ## Loss curves Loss curves are virtually identical for bf16 + vanilla TP versus float8 with rowwise scales + async TP: <img width="1017" alt="loss_async_tp" src="https://github.com/user-attachments/assets/4995db78-7012-490f-a370-f4fecc289a22" /> ## Performance #### Per op SAC Performance benchmarks for torchtitan Llama3 8b training runs on 4 H100s with per op SAC, using FSDP degree=2, TP degree=2: - bf16 (vanilla TP): TPS 5161.5, peak memory 50.53 GB - bf16 (async TP): TPS 5229.5, peak memory 50.68 GB - float8 tensorwise (vanilla TP): TPS: 5959.5, peak memory: 50.47 GB - float8 tensorwise (async TP): TPS 5964.5, peak memory 50.47 GB - float8 rowwise (vanilla TP): TPS: 4962.0, peak memory: 50.55 GB - float8 rowwise (async TP): TPS 4966.5, peak memory 50.65 GB #### Full AC Llama3 70b training runs on 128 H100s with full AC, using FSDP=16, TP=8 - bf16 (vanilla TP): 598 TPS, peak memory 71.51 GB - bf16 (async TP): TPS 673, peak memory 71.08 (+12.54% TPS vs vanilla TP) - float8 tensorwise (vanilla TP): 820 TPS, peak memory 55.26 GB - float8 tensorwise (async TP): 950 TPS, peak memory 55.91 GB (+15.85% TPS vs vanilla TP) - float8 rowwise (vanilla TP): TPS: 540 TPS, peak memory 71.46 GB - float8 rowwise (async TP): 560 TPS, peak memory 70.65 GB (+3.7% TPS vs vanilla TP but still unexpectedly lower than bf16) As you can see, float8 rowwise is working but performance needs to be improved further. ## Other changes - Added logging so the user will know why fusion failed if it does. - Remove logic which inserted a reshape node targeting "A scale" to get it to be in 3D like the "A tensor" since it's no longer needed. ## Long term plan - Add a `scaled_matmul` op in pytorch, which will natively support a 3D+ "A tensor" and allow us to simplify the async TP implementation by avoiding the reshape -> scaled_mm -> reshape pattern and the special handling for it. ## Visualizing fused nodes in graphs for torchtitan training runs Below are examples of the visualized graph generated by torch compile for torchtitan llama3 8b training runs with per op SAC. These graphs provide additional evidence (beyond the new unit tests added) that the implementation is working correctly. ### bf16 <img width="900" alt="bf16-fusion" src="https://github.com/user-attachments/assets/a3bed917-28eb-4a56-8d6e-2d2bf498385c" /> ### float8 with tensorwise scales <img width="900" alt="tensorwise-node" src="https://github.com/user-attachments/assets/b212ec4a-1899-44de-a4de-18c74e1de68a" /> ### float8 with rowwise scales <img width="900" alt="rowwise" src="https://github.com/user-attachments/assets/ed3354a3-894b-4ec9-86d0-f80364bf3d83" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/149247 Approved by: https://github.com/kwen2501	2025-03-27 03:15:30 +00:00
Ahmad Sharif	114d404b07	[cuda] Add new faster gammabeta backward kernel (#148605 ) This PR adds a new kernel for producing gamma and beta values for the backward pass in a performant way. To test the performance against the baseline, I measured the backward pass of layernorm while sweeping over the following variables: 1. dtype in {half, float} 2. M in `2k, 2k - 1, 2k + 1 for k in range(...)` 3. N in `2k, 2k - 1, 2k + 1 for k in range(...)` 4. Whether we flush the L2 cache before running the backward pass Summary: The new code performs better than the old code, especially for powers of 2. For M >> N case, it performs very well (kernel itself can be 30x faster and the overall backward pass can be 5-10x faster). In order to visualize results of the kernel when choosing different values of M, N and dtype, I wrote some code to generate a heatmap. The heatmap has N on the x-axis, M on the y-axis and color-coded points where green shows performance improvement and red shows regressions. For example, `m=32 n=2048 1.42x` in the heatmap would indicate the normalized shape had 32 elements. The leading dimensions' product was 2048 elements and the new kernel resulted in the backward pass being 1.42x faster than the old backward pass. Important note: This heatmap shows the total backward pass time as seen by the user. The kernel time difference can be sometimes very large while the total backward pass time is not that high. For example, for dtype=torch.half, M=32 N=2048, flush_l2_cache=True case, the heatmap shows a speedup of 1.42x, while ncu tells me the new kernel is 2.5x faster than the old: M=32 N=2048 dtype=half flush_l2=True Old Kernel NCU summary: ``` ----------------------- ----------- ------------ Metric Name Metric Unit Metric Value ----------------------- ----------- ------------ DRAM Frequency Ghz 1.59 SM Frequency Ghz 1.35 Elapsed Cycles cycle 27,526 Memory Throughput % 2.21 DRAM Throughput % 0.54 Duration us 20.42 L1/TEX Cache Throughput % 4.31 L2 Cache Throughput % 2.62 SM Active Cycles cycle 1,475.02 Compute (SM) Throughput % 0.29 ----------------------- ----------- ------------ ``` M=32 N=2048 dtype=half flush_l2=True New Kernel NCU summary: ``` ----------------------- ----------- ------------ Metric Name Metric Unit Metric Value ----------------------- ----------- ------------ DRAM Frequency Ghz 1.59 SM Frequency Ghz 1.34 Elapsed Cycles cycle 10,920 Memory Throughput % 5.64 DRAM Throughput % 1.35 Duration us 8.13 L1/TEX Cache Throughput % 1.92 L2 Cache Throughput % 6.89 SM Active Cycles cycle 3,554.41 Compute (SM) Throughput % 0.67 ----------------------- ----------- ------------ ``` Let's look at some rows from the heatmap. For dtype=float16 flush_l2_cache=True and when input shapes are powers of 2, we get the following: <img width="1508" alt="image" src="https://github.com/user-attachments/assets/06179599-b2f0-4a45-8664-247a1067950b" /> There are 3 columns -- the first shows all data points, the second shows speedups only and the 3rd column shows regressions only. We can see that there are dramatic speedups for M >> N cases and the regressions are not that high (less than 1%, which could just be measurement noise). Here is a small guide I made: ![image](https://github.com/user-attachments/assets/90c26f7c-e3ad-46d2-a6ce-fe4b5fb3d738) For dtype=float32, we get a similar chart: <img width="1499" alt="image" src="https://github.com/user-attachments/assets/c4d31a76-03b0-426c-9114-e1bfad29b530" /> The new code performs especially well for m >> n cases, and also where m and n are small. The m >> n case is special because we run 2 reduction kernels back to back and parallelize in the "M" dimension (the older kernel only parallelized in the "N" dimension). The new code can sometimes have regressions for non-powers of 2. That is because the old code was using block sizes of {16, 32} while we have `threads.x = 32`. For example when N=33, the old code would have 3 blocks and we will have 2 blocks. I wrote some code to specialize for this case, but I think it will add complexity and @ngimel mentioned that non-powers of 2 are rare enough. I am including the regressions here for completeness' sake: <img width="1500" alt="image" src="https://github.com/user-attachments/assets/31c17cfb-ed9b-4106-b9c8-5c359751f530" /> To see this better: 1. Click the image 2. Right click the expanded image and open in a new tab 3. Go to that tab and left click once to zoom in If you want to see the full data, here it is: ![image](https://github.com/user-attachments/assets/54fb60c9-8c0c-4530-a1dd-79ecda1a69a1) I also measured binary size and compile time since those are important for developers: Binary size comparison ![image](https://github.com/user-attachments/assets/ceef5073-1036-47f6-b9dc-cea088beda51) ``` # Original -rwxr-xr-x 1 ahmads users 307193112 Mar 6 08:46 ./torch/lib/libtorch_cuda.so # This PR -rwxr-xr-x 1 ahmads users 307193112 Mar 6 08:46 ./torch/lib/libtorch_cuda.so ``` The diff in bytes is 302kB which is about a 0.1% increase. Compile time difference: ``` # Original real 0m10.931s user 0m9.676s sys 0m1.004s # this PR real 0m16.720s user 0m15.514s sys 0m1.066s # Command I ran time /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DFLASHATTENTION_DISABLE_SOFTCAP -DFLASH_NAMESPACE=pytorch_flash -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DTORCH_CUDA_USE_NVTX3 -DUNFUSE_FMA -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_CUFILE -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -I/home/ahmads/personal/pytorch/build/aten/src -I/home/ahmads/personal/pytorch/aten/src -I/home/ahmads/personal/pytorch/build -I/home/ahmads/personal/pytorch -I/home/ahmads/personal/pytorch/cmake/../third_party/benchmark/include -I/home/ahmads/personal/pytorch/third_party/onnx -I/home/ahmads/personal/pytorch/build/third_party/onnx -I/home/ahmads/personal/pytorch/nlohmann -I/home/ahmads/personal/pytorch/third_party/flash-attention/csrc/flash_attn/src -I/home/ahmads/personal/pytorch/aten/src/THC -I/home/ahmads/personal/pytorch/aten/src/ATen/cuda -I/home/ahmads/personal/pytorch/third_party/fmt/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/ahmads/personal/pytorch/build/caffe2/aten/src -I/home/ahmads/personal/pytorch/aten/src/ATen/.. -I/home/ahmads/personal/pytorch/build/nccl/include -I/home/ahmads/personal/pytorch/c10/cuda/../.. -I/home/ahmads/personal/pytorch/c10/.. -I/home/ahmads/personal/pytorch/third_party/tensorpipe -I/home/ahmads/personal/pytorch/build/third_party/tensorpipe -I/home/ahmads/personal/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/ahmads/personal/pytorch/torch/csrc/api -I/home/ahmads/personal/pytorch/torch/csrc/api/include -isystem /home/ahmads/personal/pytorch/build/third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/ahmads/personal/pytorch/third_party/protobuf/src -isystem /home/ahmads/personal/pytorch/third_party/XNNPACK/include -isystem /home/ahmads/personal/pytorch/third_party/ittapi/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda/include -isystem /home/ahmads/personal/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /home/ahmads/personal/pytorch/third_party/ideep/include -isystem /home/ahmads/personal/pytorch/INTERFACE -isystem /home/ahmads/personal/pytorch/third_party/nlohmann/include -isystem /home/ahmads/personal/pytorch/third_party/NVTX/c/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o.d -x cu -c /home/ahmads/personal/pytorch/aten/src/ATen/native/cuda/layer_norm_kernel.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o ``` So the new PR is 6 seconds longer compile time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148605 Approved by: https://github.com/ngimel	2025-03-27 03:01:53 +00:00
Yidi Wu	b2b9aaf0ad	Fix non-strict export doesn't turn on dynamo for hop (#149903 ) Somehow the torch._dynamo.is_compiling is changed to torch.compiler.is_compiling(), which also checks whether we're exporting. This is not caught by cI because we don't have an export test for scan. Changing to torch.compiler.is_dynamo_compiling and added a test. edit: piggyback the re-tracing support in this PR. Related code in combine_fn_is_normalized. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149903 Approved by: https://github.com/zou3519	2025-03-27 02:38:05 +00:00
vasiliy	dad0854d48	meta registration for torch._scaled_mm with mxfp8 (#148461 ) Summary: Adds the meta registration logic for torch.compile to work with `torch._scaled_mm` with mxfp8. Thanks to @eellison for the pointer to make inductor work with this. Test Plan: ``` pytest test/test_matmul_cuda.py -k test_blockwise_mxfp8_compile -s ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/148461 Approved by: https://github.com/drisspg, https://github.com/eellison	2025-03-27 02:32:40 +00:00
Laith Sakka	d5593ea31c	Introduce guard_or_true, guard_or_false (#148430 ) some context in this document: https://docs.google.com/document/d/18nJsj-F2C_QXO7ClwzPcAUENQ-B440B43W7DdDnlDt4/edit?tab=t.0#heading=h.pgebnyi7pocj But TLDR; `guard_or_true`, `guard_or_false` are better than `guard_size_oblivious` due to : - Easier to reason about what assumptions we are making while reading the code. - Avoid size_oblivious complexity that is not needed. - Avoid unsoundness that could make `guard_size_oblivious(a==1)` be true when its not true for some vaue `a` during runtime. - Less data dependent errors for some cases: ex, when doing `guard_size_oblivious(a==1)` and we know `a` is a tensor size, if it's traced with `a=u1-u2` `guard_size_oblivious(a==1)` will throw a data dependent error but `guard_else_false` will just return `False`. ### How is it different from statically_known_true?? `if(cond)`: (normal guarding) will try to evaluate statically and guard on the condition, willing to restrict input space to evaluate cond. if it fails to evaluate due to data dependent error will throw an exception (that could be converted to graph break in some situations). `statically_known_true(cond)`: would be used when you never want to add a guard (restrict your input space), but just want to do a best effort check to see if you can infer that something is true/false ONLY based on existing constraints. `guard_or_true(cond)`/`guard_or_false(cond)`: Those would be used in situations you prefer to guard and know the result of the expression over not guarding, but in case you hit a data dependent error you are ok with just returning true or false. Some reasons you might be ok with returning true/false instead could be: 1. It's an optimization I do not want to fail for not performing optimization. 2. I am willing to deviate from the normal semantics when I have unbacked for the benefit of not failing (See the doc above for more details). `definitely_true(cond)`: same as `guard_or_false(cond)` except does not try to do static eval for unbacked (planning to deprecate it and replace uses with `guard_or_false` or make it alias to `guard_or_false`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148430 Approved by: https://github.com/bobrenjc93	2025-03-27 02:22:20 +00:00
Ahmad Sarvmeily	c2b8fead43	Allow TritonTemplate subclasses to override kernel type (#150018 ) Allows subclasses of `TritonTemplate` to override the kernel type, e.g. ``` class MyTritonTemplate(TritonTemplate): kernel_type = MyTritonTemplateKernel ``` This means that all of the logic in `TritonTemplate` class doesn't need to be duplicated in subclasses if the only required change is the kernel type. Note that there is precedent for doing this - see `SIMDScheduling` in `torch/_inductor/codegen/simd.py`: ``` class SIMDScheduling(BaseScheduling): kernel_type: type[Any] = SIMDKernel # override in subclass ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150018 Approved by: https://github.com/jansel	2025-03-27 02:16:40 +00:00
Angela Yi	8d1cfb63b5	[export] Save unflattened gm (#150030 ) Summary: Reland of D71082652 Test Plan: https://www.internalfb.com/intern/testinfra/testrun/8444249558423545 https://www.internalfb.com/intern/testinfra/testrun/7318349652864293 https://www.internalfb.com/intern/testinfra/testrun/13229323980143778 https://www.internalfb.com/intern/testinfra/testrun/11540474119884081 Differential Revision: D71902033 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150030 Approved by: https://github.com/pianpwk	2025-03-27 02:01:51 +00:00
Laith Sakka	128b32f363	cache loaded python modules (#149910 ) I am splitting caching the loading of modules from the caching the codegen since its trivial and much easier. Module loading is 50% of the cost, and codegen is 50% of maybe_append choice on full graph model. which is 40% of total compile time. <img width="434" alt="Screenshot 2025-03-24 at 4 35 12 PM" src="https://github.com/user-attachments/assets/aa851c6a-bde9-43f8-b12d-e439504ef62c" /> running mm_loop benchmark, before this change: 67947323682 after this change: 25845073249 2.6X faster. it seems that the cache was there then got dropped. I added benchmark so it wont be dropped again by mistake. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149910 Approved by: https://github.com/eellison, https://github.com/aorenste ghstack dependencies: #149932	2025-03-27 00:45:09 +00:00
Rachel Guo	48cff64a54	[pt2_provenance_tracing] add combo kernel nodes post_grad nodes origin info (#149598 ) Summary: found it helpful when running prod model with combo_kernel feature enabled Test Plan: CI Differential Revision: D71513304 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149598 Approved by: https://github.com/yushangdi	2025-03-27 00:26:24 +00:00
Animesh Jain	731b559f54	[easy] Use config patch to toggle capture_scalar_output (#150036 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150036 Approved by: https://github.com/angelayi ghstack dependencies: #149087, #149667	2025-03-27 00:01:39 +00:00
Animesh Jain	999fa15ba8	[invoke_subgraph][fake tensor cache] Add a finalizer for id hashed objects (#149667 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149667 Approved by: https://github.com/zou3519 ghstack dependencies: #149087	2025-03-27 00:01:39 +00:00
Animesh Jain	a7596b4b34	[invoke_subgraph] Fake tensor prop caching (#149087 ) Redoing https://github.com/pytorch/pytorch/pull/137808 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149087 Approved by: https://github.com/zou3519	2025-03-27 00:01:39 +00:00
Justin Chu	3efa211e48	[ONNX] Annotate None inputs in symbolic ops (#150038 ) Add `None` to type annotations of `torch.onnx.ops.symbolic*` ops and improve tests to test support for optional inputs. Previously it was omitted mistakenly even though the implementation supports it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150038 Approved by: https://github.com/titaiwangms	2025-03-27 00:01:09 +00:00
Nikita Shulga	6db95ccf4c	Delete linux-focal-cuda12_6-py3_10-gcc11-bazel-test (#150066 ) It's been broken for a while even when this jobs were still called ` linux-focal-cuda12.4-py3.10-gcc9-bazel-test` Last time it run successfully on Feb 21st Pull Request resolved: https://github.com/pytorch/pytorch/pull/150066 Approved by: https://github.com/yangw-dev, https://github.com/seemethere, https://github.com/atalman	2025-03-26 23:55:58 +00:00
Aleksandar Samardžić	43cc954f88	Refactor row-wise scaled MM (#149978 ) 1. Add config selection for SM89. 2. Only build kernels if compiling for given arch. 3. Factor out CMake code to enforce compiling for needed archs for individual files into a function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149978 Approved by: https://github.com/drisspg	2025-03-26 23:49:41 +00:00
Nikita Shulga	6aca002d82	[MPS] Add `chebyshev_polynomial_[uvw]` (#150060 ) For both eager and inductor Pull Request resolved: https://github.com/pytorch/pytorch/pull/150060 Approved by: https://github.com/dcci, https://github.com/jansel	2025-03-26 23:35:05 +00:00
PyTorch MergeBot	185aaaaf8e	Revert "Improve subproc autotuning implementation (#149700 )" This reverts commit 8cd6a133f21821f0713116f0f9a55e5368de8c1c. Reverted https://github.com/pytorch/pytorch/pull/149700 on behalf of https://github.com/yangw-dev due to This is breaking servicelab_benchmark_pyper_local_runner internally ([comment](https://github.com/pytorch/pytorch/pull/149700#issuecomment-2755975959))	2025-03-26 23:17:01 +00:00
Nikita Shulga	db8f4c1b1b	[MPSInductor] Run chebyshev_polynomial_t tests (#150042 ) Test name should start with `test_` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150042 Approved by: https://github.com/dcci	2025-03-26 22:50:08 +00:00
Jon Janzen	9aa0612dd3	[targets2buck] Remove tombstone messages proactively (#147897 ) Summary: X-link: https://github.com/pytorch/executorch/pull/8703 Originally we created a bunch of empty `TARGETS` files to allow us to enable `BUCK` files in fbcode by hiding the existing BUCK file. These files were subsequently merged together using `non_fbcode_target` so these tombstones are no longer necessary. This diff fixes all files that WOULD have had the useless tombstone merged into them. To create this diff, I just ran the merger script that Codemod Service is using and then deleted the "merged from" and tombstone lines with `sed`, `arc f` and reverted any lines that didn't make sense Test Plan: CI Differential Revision: D69994481 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147897 Approved by: https://github.com/izaitsevfb	2025-03-26 22:15:17 +00:00
Nichols A. Romero	c0af782f30	[ROCm] Change LoadHIP to use find_file for rocm_version.h (#149983 ) Fixes #149805 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149983 Approved by: https://github.com/jeffdaily	2025-03-26 21:26:41 +00:00
Pat Vignola	625913eefc	[MTIA] [Triton] Set codename of MTIA device in triton heuristics (#149860 ) Summary: Triton-MTIA expects the codename of the device as the arch when querying the module map, not the compute capability. This diff gets rid of the following error: `No libdevice is provided for arch (0, 0)` Test Plan: CI Reviewed By: Myrthan Differential Revision: D70072095 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149860 Approved by: https://github.com/jansel	2025-03-26 20:58:12 +00:00
Tristan Rice	87bfd66c3c	gloo: update to latest version (#149985 ) This updates submodule Gloo to the latest version and brings a number of benefits: * connection retries `d2609ab5e8` * better error messages `5ca057d6cc` * multi_get support for larger scale jobs `4ff6edf45f` * metadata exchange optimizations `20dc202dd8` * miscellaneous other fixes Old commit: `5354032ea0` Test plan: This is already being used in production environments at scale. PyTorch CI ``` pytest -v test/distributed/test_c10d_gloo.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149985 Approved by: https://github.com/fduwjj, https://github.com/malfet	2025-03-26 19:19:31 +00:00
Boyuan Feng	039ebdc192	[Graph Partition] Support symbol inputs (#149458 ) This PR supports symbol inputs to graph partition functions. Before this PR, we rely on `node.read_writes` to get partition inputs. However, this does not cover symbol inputs. In this PR, for each graph partition, we collect all symbol inputs which are required to be in scope to successfully perform codegen, including: - free symbols used in partition nodes. - free symbols in partition input/node shapes, strides, and offsets. This is needed for recording cudagraphs for tensors with dynamic shapes. ### Note1: MutationLayout In this example, node.layout is MutationLayoutSHOULDREMOVE. The symint from index `n` does not appear in the size, offset, stridese of node.layout. This symint appear in node.layout.target. So we need extra handle for it. ```python x = torch.zeros(7, device="cuda") def fn(n, a): a[n] = -1 return a opt_fn = torch.compile(fn, fullgraph=True) for n in range(2, x.shape[0]): opt_fn(n, x) ``` ### Note2: Composability with Padded Tensor Subclass W/o graph partition, Padded Tensor subclass lifts outer shapes to input arguments (i.e., arg0_1 for s0, arg1_1 for s1) but does not lift inner shapes (i.e., s2 and s3). Since cudagraph cache relies on integer inputs, it will cache on outer shapes and ignore inner shapes, which is bad. ``` def call(args): arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1 = args args.clear() s0 = arg0_1 s1 = arg1_1 arg2_1_size = arg2_1.size() s2 = arg2_1_size[0] s3 = arg2_1_size[1] assert_size_stride(arg2_1, (s2, s3), (s3, 1)) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) buf0 = empty_strided_cuda((s2, s3), (s3, 1), torch.float32) # Topologically Sorted Source Nodes: [x1, mul], Original ATen: [aten.add, aten.mul] triton_poi_fused_add_mul_0_xnumel = s2s3 stream0 = get_raw_stream(0) triton_poi_fused_add_mul_0.run(arg2_1, buf0, triton_poi_fused_add_mul_0_xnumel, stream=stream0) del arg2_1 return (buf0, s0, s1, s1, ) ``` w/ graph partition, the partition function only includes tensor and inner shapes as inputs, to make sure the cudagraph caching is correct. Full Comparison: [code](https://www.internalfb.com/intern/diffing/?paste_number=1761674743) ```python def call(self, args): arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1 = args args.clear() s0 = arg0_1 s1 = arg1_1 arg2_1_size = arg2_1.size() s2 = arg2_1_size[0] s3 = arg2_1_size[1] assert_size_stride(arg2_1, (s2, s3), (s3, 1)) partition0_args = [arg2_1, s2, s3] del arg2_1 (buf0,) = self.partitions[0](partition0_args) del partition0_args return (buf0, s0, s1, s1, ) ``` The number of cudagraphs is validated below: (also added to test) ```python import torch from padded_tensor import PaddedTensor # Turning off graph_partition leads to # torch._inductor.cudagraph_trees.get_container(0).tree_manager.new_graph_id().id=6 # at the end, which is wrong. # torch._inductor.config.graph_partition = False # Turning on graph_partition leads to # torch._inductor.cudagraph_trees.get_container(0).tree_manager.new_graph_id().id=4 # at the end, which is correct. torch._inductor.config.graph_partition = True def f(x): x1 = x + 1 return x1 2 compiled_f = torch.compile(f, mode="reduce-overhead") def run(shape): x = torch.randn(*shape, device="cuda") pad_x = PaddedTensor.from_tensor(x, multipliers={0:4, 1:4}) assert hasattr(pad_x, "multipliers"), breakpoint() eager_out = f(pad_x) for _ in range(3): compiled_out = compiled_f(pad_x) compiled_out = compiled_f(pad_x) assert eager_out.shape == compiled_out.shape assert eager_out.tensor.shape == compiled_out.tensor.shape assert torch.allclose(eager_out.tensor, compiled_out.tensor) # static shape. record a NEW cudagraph. 1 cudagraph in total now. run((2,3)) # outer shape is dynamic, leading to a new dynamo graph # this new dynamo graph forces a NEW cudagraph. 2 cudagraphs in total now run((3,4)) # outer shape changed but inner shape does not change # so NO new cudagraph is recorded run((2,2)) # inner shape is dynamic now, leading to a new dynamo graph # this new dynamo graph forces a NEW cudagraph. 3 cudagraphs in total now run((5,6)) # does NOT record a new cudagraph run((7,8)) # record a NEW cudagraph. 4 cudagraphs in total now run((10,11)) assert torch._inductor.cudagraph_trees.get_container(0).tree_manager.new_graph_id().id == 4 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149458 Approved by: https://github.com/eellison	2025-03-26 17:21:30 +00:00
Jithun Nair	4a9466c96a	Newer conda versions require --update-deps to update dependencies such as libgcc-ng (#149599 ) * When we try to install [libstdcxx-ng 12.3.0 from conda-forge](`595293316d/.ci/docker/common/install_conda.sh (L65)`), conda 24.7.1 updates the dependencies of that package, including libgcc-ng package to the following: `libgcc-ng-14.2.0 \| h69a702a_2 52 KB conda-forge` * However, conda updated their installer script on Feb 6 2025 to version 25.1.1, which behaves differently from previous versions when installing conda packages. * conda 25.1.1 does not update any dependencies in the above step, and hence the same installation of libgcc-ng from "defaults" channel is present: `libgcc-ng pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1` * Adding the "--update-deps" flags to the conda install command installs a newer libgcc-ng package from the "conda-forge" conda channel: `libgcc-ng-12.3.0 \| h77fa898_13 762 KB conda-forge`, which is compatible with the libstdcxx-ng 12.3.0 package * Compare this [Feb 4 docker build](https://github.com/pytorch/pytorch/actions/runs/13148456164/job/36691412387#step:6:5179) to this [Feb 10 docker build](https://github.com/pytorch/pytorch/actions/runs/13247023578/job/36975931849#step:6:5451), which shows that the latter does not update libgcc-ng. * This creates linking issues when trying to use a library, that was built with a newer libgcc_s.so.1 (from libcc-ng package), in the PyTorch conda environment. Eg. ONNX-RT: ``` [0;93m2025-02-13 10:18:38.492434704 [W:onnxruntime:Default, migraphx_execution_provider.cc:167 get_flags_from_env] [MIGraphX EP] MIGraphX ENV Override Variables Set:[m [1;31m2025-02-13 10:18:38.628064251 [E:onnxruntime:Default, provider_bridge_ort.cc:2028 TryGetProviderInfo_ROCM] /onnxruntime/onnxruntime/core/session/provider_bridge_ort.cc:1636 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_rocm.so with error: /opt/conda/envs/py_3.10/bin/../lib/libgcc_s.so.1: version `GCC_12.0.0' not found (required by /opt/conda/envs/py_3.10/lib/python3.10/site-packages/onnxruntime/capi/libonnxruntime_providers_rocm.so) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149599 Approved by: https://github.com/malfet	2025-03-26 17:04:21 +00:00
Shangdi Yu	b2088f1afe	Add inductor test for torchbind symint (#149980 ) Summary: add test Test Plan: ``` buck run //caffe2/test:test_export -- -r test_compile_custom_obj_unbacked_symint ``` Differential Revision: D71843179 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149980 Approved by: https://github.com/BoyuanFeng	2025-03-26 17:02:55 +00:00
Mu-Chu Lee	a0253d2840	[Inductor] Use real input to autotune user defined triton kernels (#149553 ) Summary: User defined Triton kernel sometimes rely on real inputs to determine the path of execution. We need real inputs to invoke the correct behavior of the user defined triton kernels (see example in test case, where we have an early return for random inputs) Test Plan: Included in the commit. python test/inductor/test_aot_inductor.py -k triton_autotuning python test/inductor/test_aot_inductor.py -k triton_mutated_autotuning Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/149553 Approved by: https://github.com/davidberard98, https://github.com/eellison	2025-03-26 16:42:48 +00:00
Nikita Shulga	3a8171efad	[MPS] Preserve in/out dtypes in binary_op name (#150024 ) To be consistient with unary op and avoid silent correctness problems if someone will try to invoke the op with unexpected out dtype Pull Request resolved: https://github.com/pytorch/pytorch/pull/150024 Approved by: https://github.com/dcci	2025-03-26 16:00:43 +00:00
Jack Taylor	32299e5f9a	Reland "Introduce new template heuristic for triton autotune configs" (#147452 ) This change was reverted in https://github.com/pytorch/pytorch/pull/147388 for regressing an internal workload. I have removed the additional ir.device_type calls in mm_scaled and unpack_mixed_mm.py which could be contributing to the additional compile time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147452 Approved by: https://github.com/jansel	2025-03-26 15:47:06 +00:00
atalman	7336b76bcc	Refactor cudnn version check in smoke test for Windows (#150015 ) After https://github.com/pytorch/pytorch/pull/149885 I see failures on Window smoke test: https://github.com/pytorch/test-infra/actions/runs/14069923716/job/39401550854 Due to fact that pypi packages such as cudnn and nccl are installed only on Linux. Hence this should resolve issue on Windows platform. On windows cudnn is shipped with PyTorch as opposed to installed dynamically. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150015 Approved by: https://github.com/ZainRizvi	2025-03-26 15:15:46 +00:00
Ankita George	8a40fca9a1	Support huggingface reading and writing for multi rank case (#148189 ) Summary: This diff adds the ability for HF reader/writer to read/write in a distributed way. We do this by sending all the tensors meant for the same file to the same rank. Test Plan: ensure existing tests pass I also ran a full end to end test on my devserver to read/write from my HF repo Differential Revision: D70096439 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148189 Approved by: https://github.com/joecummings, https://github.com/saumishr	2025-03-26 14:47:31 +00:00
Aleksei Nikiforov	0c139fa58e	Switch s390x tests to blocklist (#149507 ) Switch s390x tests to blocklist Pull Request resolved: https://github.com/pytorch/pytorch/pull/149507 Approved by: https://github.com/seemethere	2025-03-26 12:11:41 +00:00
Laith Sakka	7379c66344	add loop mm benchmark (#149932 ) results: compile time instruction count for iteration 4 is 67947323682 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149932 Approved by: https://github.com/bobrenjc93, https://github.com/eellison	2025-03-26 11:21:30 +00:00
cyy	79e8a69257	Enable move warnings for torch targets (#149923 ) This PR enables more move warnings for torch targets and fixes some code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149923 Approved by: https://github.com/malfet	2025-03-26 08:38:13 +00:00
Nikita Shulga	de68ddc68e	[MPS] Fix metal ops with different dtypes (#149974 ) By implementing `_cast_` flavors of both dense and strided ops. Add regression tests that tests `fmax`/`fmin` for mixed dtypes. Been dreaded to write this PR for a while, as it end up to be pretty bulky: - Adds 1C10_METAL_ALL_TYPES_FUNCTOR` and `c10:🤘:ScalarType` to `c10/metal/common.h` and test that its values always match `c10::ScalarType` - Add `c10:🤘:cast_to` to `c10/metal/utils.h` which could be used to cast any scalar metal dtype to any other one, including complex values - Implement `val_at_offs<T>(constant void *, long offs, ScalarType dtype)` that is used to dynamically cast types - Add `binary_strided_cast` and `binary_dense_cast` that are invoked for output dtype and cast both inputs to that output before performing the op Benchmark collected on M2Pro that runs fmax for 1 mln element tensors (Times are in microseconds.) \| \| dense-dense \| transp-transp \| dense-transp \| transp-dense \| dense-scalar \| dense-bcast \| \|-------------------------\|---------------\|----------------\|----------------\|----------------\|---------------\|--------------- \| \| fmax (torch.float16, torch.float16) \| 160.9 \| 159.9 \| 270.5 \| 270.9 \| 236.6 \| 293.0 \| fmax (torch.float32, torch.float32) \| 176.9 \| 171.0 \| 273.7 \| 293.5 \| 242.6 \| 294.2 \| fmax (torch.float32, torch.float16) \| 171.4 \| 170.9 \| 283.6 \| 303.0 \| 253.7 \| 302.3 \| add (torch.float16, torch.float16) \| 218.0 \| 223.6 \| 221.0 \| 222.0 \| 214.9 \| 218.3 \| add (torch.float32, torch.float32) \| 227.4 \| 233.9 \| 228.8 \| 231.9 \| 218.9 \| 221.4 \| add (torch.float32, torch.float16) \| 226.1 \| 227.5 \| 227.5 \| 226.9 \| 177.0 \| 190.8 TODOS: - Include input and output dtype in non-cast kernel name - Make TensorFactory.h use `C10_METAL_ALL_TYPES_FUNCTOR` - Extend mixed_dytpes testing via OpInfo Fixes https://github.com/pytorch/pytorch/issues/149951 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149974 Approved by: https://github.com/manuelcandales	2025-03-26 07:03:21 +00:00
Aleksei Nikiforov	aa575cab71	Skip cxxabi check for s390x (#149954 ) On s390x gcc 14 is used because it contains fix for interaction between precompiled headers and vectorization builtins. This fix is not available in earlier gcc versions. gcc-14 uses ABI19, but check still fails, so skip it for now.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149954 Approved by: https://github.com/cyyever, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-03-26 06:50:27 +00:00
Justin Chu	6ae8eb881c	[ONNX] Clean up the diagnostics module (#149864 ) Remove the diagnostics/SARIF module from ONNX exporter because it is obsolete unused. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149864 Approved by: https://github.com/titaiwangms	2025-03-26 05:58:32 +00:00
PyTorch MergeBot	d256b2dcb2	Revert "[custom_ops][perf] Move expensive pytree traversals of tensors to C++ (#148555 )" This reverts commit d686d04c2f3bac110044ebad5cc46e3035d7b425. Reverted https://github.com/pytorch/pytorch/pull/148555 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/148555#issuecomment-2753283221))	2025-03-26 05:27:52 +00:00
Shangdi Yu	819b23e0b4	Support None return type in torchbind and Add more AOTI torchbind e2e tests (#149749 ) Summary: - Add more tests for torchbind in aoti FallBackKernel - In FallbackKernel.find_device, do not check the device of torchbind obj because they don't have a fixed "device" - If no device found for CallTorchBindObject, use cpu - handle None output in `export_extern_kernel_node` Test Plan: ``` buck run //sigmoid/inference/test:e2e_test_cpu -- -r CustomClassHolderConstantDynamic ``` Differential Revision: D70746626 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149749 Approved by: https://github.com/desertfire	2025-03-26 04:20:14 +00:00
Isuru Fernando	71acb1bb42	[inductor] Fix division by zero error in fractional max (#148729 ) Fixes https://github.com/pytorch/pytorch/issues/148152 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148729 Approved by: https://github.com/eellison	2025-03-26 04:18:50 +00:00
eqy	9108d153ce	[CUDA]][SymmetricMemory] Interpret empty string as `std::nullopt` in `rendezvous` (#149793 ) this is a "temporary" fix as current internal API requires strings at some interfaces instead of `std::optional` and empty strings are presumably used in-lieu of `nullopt`. e.g., `9d02b3993f/torch/csrc/distributed/c10d/intra_node_comm.cu (L49)` this currently breaks `test_intra_node_comm_all_reduce` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149793 Approved by: https://github.com/kwen2501, https://github.com/cyyever	2025-03-26 03:59:43 +00:00
PyTorch MergeBot	ab9ca6b31f	Revert "[inductor] Fix mm logging for `torch._scaled_.mm` (#149967 )" This reverts commit 661d74bf4483e19e158c41b55d47f02eb9fdcc21. Reverted https://github.com/pytorch/pytorch/pull/149967 on behalf of https://github.com/malfet due to This broke ROCM testing, see `45b11730f1/1` ([comment](https://github.com/pytorch/pytorch/pull/149967#issuecomment-2753149024))	2025-03-26 03:29:59 +00:00
Nichols A. Romero	45b11730f1	[ROCm][TunableOp] TunableOp Context Manager for unit tests (#149930 ) This PR is cleanup only. There are no feature changes or bug fixes. We create a TunableOp context manager for setting up and cleanup. We re-write TunableOp unit tests in terms of this context manager. Ultimately reduces the amount of copy-paste code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149930 Approved by: https://github.com/jeffdaily	2025-03-26 02:59:58 +00:00
David Berard	a8d0c5c928	[inductor][triton 3.3] Fix cpp_wrapper w/ TMA in triton 3.3 (#149973 ) Fixes #148938 Context: In triton 3.3, triton kernels expect a global scratch space arg to be passed in. This is fixed in #148051, which fixed most of the AOTI/cpp_wrapper failures; the fix is to inject a (null) global scratch space arg passed as an argument to all kernels. But in the case of TMA, we need to call a non-triton-generated function - init1DTMADescriptor. The same `generate_args_decl` function used for calling triton kernels (and modified in #148051 to insert a global scratch space) is used to prepare the arguments to init1DTMADescriptor, and so it had an extra global scratch space arg. Then we'd get a null pointer passed into init1DTMADescriptor, resulting in an IMA later on when the TMA use kernel This PR: adds an option to `generate_args_decl` to specify whether this is a triton kernel (in which case we should add the global scratch space arg) or not (when we shouldn't add the extra arg). Note: this doesn't appear in CI because we don't run these tests with Hopper machines in CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149973 Approved by: https://github.com/drisspg	2025-03-26 00:12:02 +00:00
PyTorch MergeBot	1b373f6cd4	Revert "cpp_wrapper: Fix even more tests (#147225 )" This reverts commit 62d351a35b1bd961afbd09057beec14ff201c41d. Reverted https://github.com/pytorch/pytorch/pull/147225 on behalf of https://github.com/yangw-dev due to broke [ROCM mi300 test](https://github.com/pytorch/pytorch/actions/runs/14066803692/job/39393110086) in [HUD](https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=rocm-mi300%20%2F%20linux-focal-rocm6.3-py3.10%20%2F%20test%20(default%2C%201%2C%206%2C%20linux.rocm.gpu.mi300.2)&mergeLF=true) ([comment](https://github.com/pytorch/pytorch/pull/147225#issuecomment-2752799778))	2025-03-26 00:03:13 +00:00
PyTorch MergeBot	91bf92597c	Revert "cpp_wrapper: precompile a few more commonly used headers, and improve RAIIPyObject interface (#149350 )" This reverts commit 0de70fbbe73d2109497cd57ed5402e0cf9450f18. Reverted https://github.com/pytorch/pytorch/pull/149350 on behalf of https://github.com/yangw-dev due to broke [ROCM mi300 test](https://github.com/pytorch/pytorch/actions/runs/14066803692/job/39393110086) in [HUD](https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=rocm-mi300%20%2F%20linux-focal-rocm6.3-py3.10%20%2F%20test%20(default%2C%201%2C%206%2C%20linux.rocm.gpu.mi300.2)&mergeLF=true) ([comment](https://github.com/pytorch/pytorch/pull/147225#issuecomment-2752799778))	2025-03-26 00:03:13 +00:00
Vincent Moens	3c85784980	Fix broken LazyLinear init (#149693 ) Fixes #149691 I beleive it does not impact negatively the fix in https://github.com/pytorch/pytorch/pull/147599 as the tests stilll pass but @FFFrog should confirm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149693 Approved by: https://github.com/mikaylagawarecki, https://github.com/FFFrog, https://github.com/malfet	2025-03-25 23:49:49 +00:00
Rachel Guo	661d74bf44	[inductor] Fix mm logging for `torch._scaled_.mm` (#149967 ) Summary: This pr is just for recreation of the original pr: https://github.com/pytorch/pytorch/pull/149769 Fix for `torch._scaled_mm` op mm logging, which breaks the original brittle underscore parsing assumptions. Test Plan: CI Differential Revision: D71828732 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149967 Approved by: https://github.com/vkuzo	2025-03-25 23:38:35 +00:00
Ethan Wee	c05328e01a	[ROCm] fix uninitialized warning in BFloat16.h (#149868 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149868 Approved by: https://github.com/jeffdaily, https://github.com/cyyever	2025-03-25 23:36:10 +00:00
Ethan Wee	36eb64d60e	[ROCm] missing AT_CUDA_CHECK for cub and SoftMax (#149883 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149883 Approved by: https://github.com/jeffdaily, https://github.com/Skylion007	2025-03-25 23:22:32 +00:00
eqy	de73790fe6	[cuDNN][SDPA] cuDNN SDPA supports `head_dim <= 256` on `sm90` and `sm100` as of `9.5.1+` (#149904 ) gqa check PR will go next... Pull Request resolved: https://github.com/pytorch/pytorch/pull/149904 Approved by: https://github.com/drisspg	2025-03-25 23:10:16 +00:00
Divain	68b327341c	Fix #149806 : Fix path lookup in _preload_cuda_deps (#149808 ) @pytorchbot label "bug" Fixes #149806 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149808 Approved by: https://github.com/jansel	2025-03-25 23:03:47 +00:00
Ozan Aydin	ce54c430c0	[Submodule] [cpuinfo] cpuinfo update (#149305 ) Updating `cpuinfo` module. Relevant: https://github.com/pytorch/cpuinfo/issues/270 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149305 Approved by: https://github.com/malfet	2025-03-25 22:44:50 +00:00
Mu-Chu Lee	feb503c1df	[AOTInductor] Refine error message for dlopen in AOTInductor (#149812 ) Summary: Refine the error message if dlopen failed in AOTInductor. The original error message was ominous, modified to recommend user to rebuild AOTInductor if needed, otherwise it's fine. Test Plan: None. Error message change. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/149812 Approved by: https://github.com/chenyang78, https://github.com/jingsh	2025-03-25 21:45:10 +00:00
Jeff Daily	0159f8ed54	[ROCm] build magma rocm and upload tarball (#149902 ) This will improve docker image build times by not having to rebuild magma rocm for unrelated changes. This PR is step 1 of 2. The next step is a second PR to modify the docker image builds to use the magma tarball that this PR will produce. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149902 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-03-25 21:37:13 +00:00
PyTorch MergeBot	d3b7cf7b7d	Revert "[ROCm] build magma rocm and upload tarball (#149902 )" This reverts commit bf8f4efd3158204592643e6cf26889fff5afcee2. Reverted https://github.com/pytorch/pytorch/pull/149902 on behalf of https://github.com/seemethere due to This is currently breaking lint see [GH job link](https://github.com/pytorch/pytorch/actions/runs/14069330750/job/39399569526) [HUD commit link](`bf8f4efd31`) ([comment](https://github.com/pytorch/pytorch/pull/149902#issuecomment-2752594578))	2025-03-25 21:33:00 +00:00
Davide Italiano	e85ce64bde	[MPS/Inductor] Add support for chebyshev_polynomial_t. (#149928 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149928 Approved by: https://github.com/malfet	2025-03-25 21:02:13 +00:00
Laith Sakka	6c9d48b32b	refresh results of benchmarks (#149936 ) while the test was disabled, I put a fix but another win change landed before the test was restored to it stayed disabled. <img width="698" alt="Screenshot 2025-03-24 at 6 26 36 PM" src="https://github.com/user-attachments/assets/2713c685-aee2-4dea-9a6c-cad01ef575cd" /> caused by https://github.com/pytorch/pytorch/pull/149295 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149936 Approved by: https://github.com/bobrenjc93	2025-03-25 21:01:08 +00:00
bobrenjc93	90110b069f	Use statically known true in should_decompose_mm (#149950 ) This meta function is causing recompiles for large ads runs due to overguarding: https://www.internalfb.com/ai_infra/job_inspector/guided/pt2_compile?jobName=aps-ig_fm_v4_pt2_on-6e0a734dcc&jobVersion=0&jobAttempt=0 If we look at the reasons, it's because of this function adding guards: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-ig_fm_v4_pt2_on-6e0a734dcc/attempt_0/version_0/rank_0/-_18_8_0/recompile_reasons_1971.json?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 This PR moves to statically_known_true so we don't overly guard for dynamic shapes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149950 Approved by: https://github.com/mengluy0125	2025-03-25 20:40:00 +00:00
Fuzzkatt	ce3dc9e346	add some extra test oom skips for jetson due to lacking nvml support (#149587 ) Add a couple of Jetson skips for oom tests in test/test_cuda.py due to failures in nvidia CI. Jetson not having full nvml support is a known issue so this is mostly a test side fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149587 Approved by: https://github.com/eqy	2025-03-25 20:39:10 +00:00
Fuzzkatt	b562d22772	test/test_cuda.py: rework TEST_PYNVML logic to make more sense, add not IS_JETSON condition (#149578 ) PYNVML related tests in test/test_cuda.py are failing in nvidia internal CI for Jetson devices because Jetson devices don't fully support nvml (it exists as a stub library). In addition to skipping PYNVML tests for Jetson, this PR also reworks the TEST_PYNVML logic a bit to be more consistent with the rest of TEST_{something} conditions in test/test_cuda.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/149578 Approved by: https://github.com/janeyx99, https://github.com/eqy	2025-03-25 20:38:15 +00:00
Mu-Chu Lee	12628ba24d	[AOTInductor] Bug fix for freeing buffers when freeing multiple times (#149810 ) Summary: We might free the active buffer if we free the buffer twice. Test Plan: ``` LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib /home/$USER/local/pytorch/build/bin/test_aoti_inference ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/149810 Approved by: https://github.com/chenyang78	2025-03-25 20:26:36 +00:00
Jeff Daily	bf8f4efd31	[ROCm] build magma rocm and upload tarball (#149902 ) This will improve docker image build times by not having to rebuild magma rocm for unrelated changes. This PR is step 1 of 2. The next step is a second PR to modify the docker image builds to use the magma tarball that this PR will produce. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149902 Approved by: https://github.com/malfet	2025-03-25 20:20:36 +00:00
Lucas Kabela	d1ff3ff675	[Bugfix] Add handling for buffer overrides (#149882 ) Fixes #139167 This PR: * uses `named_buffers` to mark static * Checks that `named_buffers` is of expected type (callable, iterator) before trying to iterate over; if not, we skip this pass These changes fix the previous errors in dynamo causing to crash (as shown in issue above) ### Unit Test ``` python test/dynamo/test_buffers_override.py ``` Results in: ``` . ---------------------------------------------------------------------- Ran 2 tests in 5.344s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149882 Approved by: https://github.com/anijain2305	2025-03-25 20:12:43 +00:00
Sam Larsen	8cd6a133f2	Improve subproc autotuning implementation (#149700 ) Summary: The primary change is to update the autotune-in-a-subproc implementation to avoid using multiprocessing spawn. Spawn (re)executes the toplevel script in the subproc, which can be problematic. The approach here is similar to Triton parallel compile: we Popen a subproc on a controlled entry point and communicate over pipes. That change drove a lot of refactoring in the TuningProcess class, so I took the opportunity to simplify some things, rename some methods, etc. One other notable change is around the timeout / kill approach. After a timeout, we were previously attempting to stop the subproc in three steps (graceful shutdown, sigkill if graceful fails, sigterm if sigkill fails). I'm gonna argue think that's not useful: 1) The graceful shutdown is never going to work unless the subproc happens to have just completed its task and is ready to receive the next command. 2) If we're going to kill the subproc, let's just take the most aggressive approach and move on as quickly as possible to restarting it rather than waiting to see if previous shutdown attempts succeeded. The only downside that I can find find is maybe a little log spew?, e.g., ` ResourceWarning: subprocess 2987680 is still running` List of changes: * Use Popen instead of spawn for the autotuning subprocess. * Introduced a new entry point `__autotune_main__.py` * Renamed some TuningProcess methods. For example `shutdown` makes more sense than `terminate` because the latter implies a forced kill. * Simplified the implementation around benchmarking timeout and how we kill the subproc after a timeout. * Deprecated the unused timeout configs in `_inductor/config.py` * Moved `get_ld_library_path` helper to a common utils file. * Added more unit tests for subproc crashes / timeouts / exceptions, etc. Test plan: * New unit tests * Also ran internally with all combinations of: build mode `opt` and `dev-nosan`, and `buck run` vs. executing the `.par` file directly. * Made sure the functionality to parallelize autotuning across different GPUs is working (it wasn't clear to me this was behaving the way we wanted it to). Pull Request resolved: https://github.com/pytorch/pytorch/pull/149700 Approved by: https://github.com/aorenste, https://github.com/jansel, https://github.com/eellison	2025-03-25 20:07:28 +00:00
PyTorch MergeBot	30e8be599f	Revert "[ONNX] Clean up the diagnostics module (#149864 )" This reverts commit cc6e300fe225ac7f34f37494639b061ef45ceeec. Reverted https://github.com/pytorch/pytorch/pull/149864 on behalf of https://github.com/malfet due to This indeed broke Mac testing see `1c98dc3664/1` ([comment](https://github.com/pytorch/pytorch/pull/149864#issuecomment-2752317873))	2025-03-25 19:31:50 +00:00
Ryan Guo	1c98dc3664	[dynamo] Fix handling of setattr with some tensor attributes (#149791 ) We weren't handling `setattr(tensor_obj, "real", 42)` correctly, because the attribute is a `GetSetDescriptorType` that has special setter logic. See added test and comments for more explanations. This patch makes it so that we graph break in those cases, rather than resulting in silent incorrectness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149791 Approved by: https://github.com/mlazos ghstack dependencies: #149481	2025-03-25 18:57:56 +00:00
Benjamin Glass	0de70fbbe7	cpp_wrapper: precompile a few more commonly used headers, and improve RAIIPyObject interface (#149350 ) Add includes for torch.device, torch.dtype, torch.layout, and torch.memory_format to the cpp_wrapper common header, so that they get precompiled. Additionally, add move constructors and operator bool to RAIIPyObject. Closes #142005. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149350 Approved by: https://github.com/desertfire ghstack dependencies: #146706, #147225	2025-03-25 17:58:40 +00:00
Benjamin Glass	62d351a35b	cpp_wrapper: Fix even more tests (#147225 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147225 Approved by: https://github.com/desertfire ghstack dependencies: #146706	2025-03-25 17:58:40 +00:00
Benjamin Glass	0f1aaeb62e	cpp_wrapper: persist autotune example tensors until last use (#146706 ) Patches over an issue where randomly generated example tensors can cause kernel autotuning to fail, when those tensors would not be possible outputs from previous kernels in the sequence. This fixes a failure in `test_torchinductor_opinfo.py` when run with compile-time autotuning, `test_comprehensive_nanquantile_cuda_float64`. For clarity, the situation triggering this PR looks like kernels `A -> BCDE -> F` (`BCDE` is fused), where one of the outputs from `A` is a boolean tensor describing some of the input data. Previously, we randomly regenerated that boolean tensor and the input data before passing them to `BCDE`, so that they no longer matched. This caused a `tl.device_assert` call in `BCDE` to fail. With this PR, we reuse the random data input to `A` and the output Boolean tensor, such that they match and pass the device assertion in `BCDE`. Fixes #147799. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146706 Approved by: https://github.com/desertfire	2025-03-25 17:58:40 +00:00
Nikita Shulga	8d1db7f39d	[MPS][BE] Add `c10/metal/common.h` (#149955 ) That could be shared between host and metal code So far put only one constant, which is a maximum number of tensor dimentions Pull Request resolved: https://github.com/pytorch/pytorch/pull/149955 Approved by: https://github.com/Skylion007, https://github.com/manuelcandales	2025-03-25 17:37:24 +00:00
Justin Chu	cc6e300fe2	[ONNX] Clean up the diagnostics module (#149864 ) Remove the diagnostics/SARIF module from ONNX exporter because it is obsolete unused. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149864 Approved by: https://github.com/titaiwangms	2025-03-25 16:58:46 +00:00
angelayi	84ae056d82	[invoke_subgraph] Support pending unbacked symint (#149297 ) The "PendingUnbackedSymbolNotFound" error is when an unbacked symbol is created within a piece of code, but this symbol never appears in any of the outputs. I believe the original intention is to help catch incorrectly written meta kernels, where users might've unintentionally created an unbacked symbol but never used it anywhere, but in our case this is intentional. An example is the following test case: ```python def test_pending_unbacked(self): class M(torch.nn.Module): @mark_compile_region def gn(self, x): u = x[0].item() return x * u def forward(self, x): for _ in range(4): x = self.gn(x) return x torch._dynamo.config.capture_scalar_outputs = True torch.compile(M())(torch.randn(8)) ``` This fails with the error: ``` torch._dynamo.exc.InternalTorchDynamoError: PendingUnbackedSymbolNotFound: Pending unbacked symbols {zuf1} not in returned outputs (FakeTensor(..., size=(8,)),) . ``` In this case, creating the unbacked symbol is intentional, so we can bypass this using `fake_mode.shape_env.ignore_fresh_unbakced_symbols()`. Differential Revision: [D71298926](https://our.internmc.facebook.com/intern/diff/D71298926) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149297 Approved by: https://github.com/zou3519 ghstack dependencies: #149296	2025-03-25 16:42:58 +00:00
angelayi	8be1bf1dbb	[export] Add mark_compiled_region support (#149296 ) Differential Revision: [D71298930](https://our.internmc.facebook.com/intern/diff/D71298930) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149296 Approved by: https://github.com/zou3519	2025-03-25 16:42:58 +00:00
Eli Uriegas	5c19952c83	cd: Restore windows release builds for libtorch (#149863 ) These were accidentally deleted in the refactor of DEVTOOLSET + cxx11abi. This happened because the `build_environment` variable wasn't aware of the `build_variant` for libtorch and subsequently overwrote the original file twice, leaving the last written as the actual workflow (which in this case was the debug builds). One thing this has made me curious on is if we actually need `debug` builds for window at all? We don't release them for linux and I'd probably bet that they have low download numbers anyways so maybe it makes sense to cut them. Adds a build_variant parameter to the dataclass so that we can extend these easily in the future if we want. Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/149863 Approved by: https://github.com/malfet, https://github.com/atalman	2025-03-25 16:23:59 +00:00
Nikita Shulga	f0ca0d45a6	[CI] Add MacOS-M2-15 as MPS test target on trunk (#149900 ) Now that we have runners allocated by AWS Pull Request resolved: https://github.com/pytorch/pytorch/pull/149900 Approved by: https://github.com/ZainRizvi, https://github.com/seemethere	2025-03-25 16:19:35 +00:00
Wang, Eikan	2cc3f5030a	Add XPU and SYCL Merge Patterns (#149933 ) As the title Pull Request resolved: https://github.com/pytorch/pytorch/pull/149933 Approved by: https://github.com/atalman	2025-03-25 16:03:29 +00:00
Alanna Burke	43ee67e8dc	Removing doc references to PRE_CXX11_ABI. (#149756 ) Fixes #149550 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149756 Approved by: https://github.com/svekars, https://github.com/atalman	2025-03-25 16:01:59 +00:00
atalman	5dca832257	Add smoke test to validate pypi env version vs torch complied and installed versions of nccl and cudnn (#149885 ) Followup after nccl update to validate both cudnn and nccl versions in nightly and release pipelines. Tested on local dev machine, output. Success: ``` Found matching cudnn. Torch: 9.5.1 PyPI 9.5.1.17 Found matching nccl. Torch: 2.25.1 PyPI 2.25.1 ``` Failure: ``` Traceback (most recent call last): File "test1.py", line 29, in <module> compare_pypi_to_torch_versions("nccl", find_pypi_package_version("nvidia-nccl"), torch_nccl_version) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ec2-user/test1.py", line 24, in compare_pypi_to_torch_versions raise RuntimeError( f"Wrong {package} version. Torch: {torch_version} PyPI: {pypi_version}" ) RuntimeError: Wrong nccl version. Torch: 2.25.1 PyPI: 2.26.2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149885 Approved by: https://github.com/malfet, https://github.com/ZainRizvi, https://github.com/d4l3k	2025-03-25 15:57:53 +00:00
Ivan Grigorev	d90d83c484	[torch] Fix unsafe concurrent access to autocast_enabled (#148281 ) Summary: Making autocast_enabled atomic, as it can be accessed from multiple threads Differential Revision: D70456813 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148281 Approved by: https://github.com/davidberard98	2025-03-25 14:46:12 +00:00
soulitzer	a2bba53f87	Improve error message when view of intermediate is returned from autograd.Function and marked dirty (#149543 ) Fixes https://github.com/pytorch/pytorch/issues/149252 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149543 Approved by: https://github.com/zou3519 ghstack dependencies: #149220	2025-03-25 14:44:11 +00:00
PyTorch MergeBot	7b218ca874	Revert "[BE] Replace XPU support packages installation to offline mode in Linux CI/CD (#149843 )" This reverts commit 86dcdf9c8bb8f69c5d28184b31ee6d7f19127d67. Reverted https://github.com/pytorch/pytorch/pull/149843 on behalf of https://github.com/malfet due to This breaks XPU builds, see `23183fef7e/1` ([comment](https://github.com/pytorch/pytorch/pull/149843#issuecomment-2751482412))	2025-03-25 14:39:10 +00:00
Nikita Shulga	29b3f409c2	[BE][CI] Update actionlint to 1.7.7 (#149919 ) - fix anti-pattern started by https://github.com/pytorch/pytorch/pull/81922 when x86 actionlint binaries were placed in Linux-arm64 folder - Fix renaming lint violations, namely ``` >>> Lint for .github/workflows/_linux-test.yml: Error (ACTIONLINT) [expression] property "workspace" is not defined in object type {arch: string; debug: string; environment: string; name: string; os: string; temp: string; tool_cache: string} 446 \| if: failure() && steps.install-nvidia-driver.outcome && steps.install-nvidia-driver.outcome != 'skipped' 447 \| shell: bash 448 \| env: >>> 449 \| RUNNER_WORKSPACE: ${{ runner.workspace }} 450 \| run: \| 451 \| set +e 452 \| set -x >>> Lint for .github/workflows/create_release.yml: Error (ACTIONLINT) [deprecated-commands] workflow command "set-output" was deprecated. use `echo "{name}={value}" >> $GITHUB_OUTPUT` instead: https://docs.github.com/en/actions/using- workflows/workflow-commands-for-github-actions 80 \| path: ${{ env.PT_RELEASE_FILE }} 81 \| - name: Set output 82 \| id: release_name >>> 83 \| run: echo "::set-output name=pt_release_name::${{ env.PT_RELEASE_NAME }}.tar.gz" 84 \| 85 \| upload_source_code_to_s3: 86 \| if: ${{ github.repository == 'pytorch/pytorch' && github.event_name == 'push' && startsWith(github.ref, 'refs/tags/v') && contains(github.ref, 'rc') }} >>> Lint for .github/workflows/target-determination-indexer.yml: Error (ACTIONLINT) [shellcheck] shellcheck reported issue in this script: SC2086:info:3:3: Double quote to prevent globbing and word splitting 98 \| DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }} 99 \| GITHUB_RUN_ID: ${{ github.run_id }} 100 \| AWS_DEFAULT_REGION: us-east-1 >>> 101 \| run: \| 102 \| # detached container should get cleaned up by teardown_ec2_linux 103 \| container_name=$(docker run \ 104 \| ${GPU_FLAG:-} \ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149919 Approved by: https://github.com/jeanschmidt, https://github.com/atalman, https://github.com/Skylion007 ghstack dependencies: #149917, #149918, #149922	2025-03-25 14:37:10 +00:00
Nikita Shulga	6c7f9f7e7d	[CI][BE] Update other actions (#149922 ) Discovered by actionlint-1.7.7: - `actions/checkout@v3`->`actions/checkout@v4` - `actions/setup-python@v4` -> `actions/setup-python@v5` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149922 Approved by: https://github.com/Skylion007 ghstack dependencies: #149917, #149918	2025-03-25 14:37:10 +00:00
Nikita Shulga	535885dc8d	[BE][CI] Update configure-aws-credential to v4 (#149918 ) Prerequisite for update to actionlint-1.7.7 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149918 Approved by: https://github.com/Skylion007 ghstack dependencies: #149917	2025-03-25 14:37:02 +00:00
Nikita Shulga	f63b03e9fc	[BE] Add Mac ARM64 actionlint binary (#149917 ) Downloaded from https://github.com/rhysd/actionlint/releases/tag/v1.6.21 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149917 Approved by: https://github.com/Skylion007	2025-03-25 14:36:54 +00:00
Nikita Shulga	23183fef7e	[Test] Add simple MPS op benchmarks (#149914 ) Lots of benchmark tests has been posted in PRs, but they might get lost over time So let's create a benchmark and populate it with results (preferably from the run on CI machine) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149914 Approved by: https://github.com/dcci, https://github.com/cyyever	2025-03-25 11:31:27 +00:00
Wang, Chuanqi	86dcdf9c8b	[BE] Replace XPU support packages installation to offline mode in Linux CI/CD (#149843 ) To ensure the build environment is stable Pull Request resolved: https://github.com/pytorch/pytorch/pull/149843 Approved by: https://github.com/EikanWang	2025-03-25 09:11:35 +00:00
Yuanhao Ji	86fbbe44cc	Improve error message for CUDAGuardImpl, MPSGuardImpl, XPUGuardImpl (#149838 ) Fixes #149822 Will get: ``` RuntimeError: t == DeviceType::CUDA INTERNAL ASSERT FAILED at "/home/jyh/workspace/pytorch/c10/cuda/impl/CUDAGuardImpl.h":28, please report a bug to PyTorch. CUDAGuardImpl initialized with non-CUDA DeviceType: cpu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149838 Approved by: https://github.com/Skylion007, https://github.com/guangyey	2025-03-25 07:29:53 +00:00
Michael Lazos	a89bdc0565	[Hierarchical Compilation] Handle origin nodes without children (#149685 ) Bug discovered running Hierarchical Compilation on HF. I don't have a smaller repro for this unfortunately. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149685 Approved by: https://github.com/williamwen42, https://github.com/anijain2305	2025-03-25 07:27:11 +00:00
Nikita Shulga	5a7588f183	[Build] Remove pre-CXX11 ABI logic from build script (#149888 ) Only keep one in check_binary_symbols to make sure there are no pre-CXX11 ABI symbols in the library Pull Request resolved: https://github.com/pytorch/pytorch/pull/149888 Approved by: https://github.com/atalman, https://github.com/seemethere ghstack dependencies: #149887	2025-03-25 03:17:16 +00:00
titaiwangms	280e48739a	[ONNX] Set is_in_onnx_export for dynamo=True (#149678 ) Fixes #149141 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149678 Approved by: https://github.com/justinchuby	2025-03-25 03:16:23 +00:00
Tugsbayasgalan Manlaibaatar	27657a00d9	Demote logger of runtime_asserts_frozen to be fired only on debug mode (#149832 ) Differential Revision: [D71702305](https://our.internmc.facebook.com/intern/diff/D71702305) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149832 Approved by: https://github.com/malfet	2025-03-25 02:29:13 +00:00
FEI	59d5cf083b	update torch.nn.RelicationPad{1,2,3}d deternimistic documentation (#148633 ) https://github.com/pytorch/pytorch/issues/115395 This issue mentioned that when deterministic mode is turned on, added a decomp for replication_pad_{1,2,3}d to make the backward function deterministic. @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/148633 Approved by: https://github.com/isuruf	2025-03-25 02:01:31 +00:00
Saurabh Mishra	d4c578082a	[DCP] Cache save plan metadata to reduce the collective overhead (#149785 ) Summary: Cache save plan metadata to reduce the collective overhead. Global plan dedupe and metadata creation are the main overheads on Rank 0. This change saves all this cost for the subsequent saves if the plans do not change. A quick experiment with the 256 rank job, Global step overhead drops by ~99%, from 90s+ to mere 1.5s. 1.5s was mostly spent on creating the checkpoint module directories and near empty collective. Differential Revision: D71631441 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149785 Approved by: https://github.com/MeetVadakkanchery	2025-03-25 02:00:15 +00:00
Scott Wolchok	dc39e673e2	Remove aten.elu core ATen decomp because it is now core ATen (#149780 ) Per @larryliu0820. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149780 Approved by: https://github.com/larryliu0820	2025-03-25 01:59:57 +00:00
Zhengxu Chen	84684e9397	[sigmoid] Fix scalar resolution for Scalar_mode aten ops. (#149755 ) Summary: For Scalar variant resolution, we didn't handle a corner case of "Tensor_mode" variant (from aten::div). Adding the missing case to the graph pass. Test Plan: buck test mode/opt caffe2/test:test_export -- -r test_operator_aten_tensor_mode_variant_cpp_runtime Differential Revision: D71638433 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149755 Approved by: https://github.com/yushangdi	2025-03-25 01:17:36 +00:00
Tristan Rice	159e97cbcf	ProcessGroupGloo: support reduce_scatter + update support chart (#149869 ) This adds a `reduce_scatter` implementation for ProcessGroupGloo. This is a pretty naive implementation as it does 1 allreduce per rank but may be useful for testing in FSDP etc. There was an existing implementation of reduce_scatter_tensor/reduce_scatter_tensor_coalesed that has a very similar implementation but requires a fixed tensor size per rank. If users find these functions to be too slow we can address them as issues arise. Gloo now supports all major distributed operations. Quite a few of these were added by @rohan-varma and @yifuwang but they didn't update the support chart. We also have `CUDAWork` variants of most operations so those were also added to the chart. Test plan: ``` pytest -v test/distributed/test_c10d_gloo.py -k reduce_scatter ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149869 Approved by: https://github.com/fduwjj	2025-03-25 01:16:12 +00:00
Carlo Bertolli	5af9cb12b7	[ROCm] Extend vectorized elementwise kernel to more heterogenous tensor types. (#149738 ) This patch extends the initial support for "vectorized templated" kernels to the following input tensor types: (BFloat16, float) (float, float16) (float16, float) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149738 Approved by: https://github.com/jeffdaily	2025-03-25 01:10:01 +00:00
Stepan Hruda	2a9e737839	[caffe2] Do not use --no-as-needed on macOS (#149421 ) Summary: `--no-as-needed` is not available in ld64.lld Applying this on all macos is potentially too broad? I am not sure if `fbcode//mode/mac` uses a different linker, but arvr mode for sure uses ld64.lld. Test Plan: CI / used for a macOS build on top of the stack. Differential Revision: D71315125 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149421 Approved by: https://github.com/colesbury	2025-03-25 00:41:09 +00:00
bobrenjc93	1cee6c37cc	add bobren and laithsakka as ds owners (#149873 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149873 Approved by: https://github.com/laithsakka	2025-03-25 00:14:04 +00:00
Benjamin Glass	23855391f1	Add regression tests for 3 missing PR-time benchmarks (#149423 ) Uses values from the latest PR-time benchmark run on viable/strict. See https://github.com/pytorch/pytorch/actions/runs/13898520615/job/38900894469 for a job showing why this is needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149423 Approved by: https://github.com/laithsakka	2025-03-24 23:39:36 +00:00
Isalia20	ba46643df1	[MPS] tril op not handling infs correctly (#149866 ) Fixes #149813 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149866 Approved by: https://github.com/malfet	2025-03-24 23:38:41 +00:00
Nikita Shulga	51f91e3428	[CD] Check that nightly x86 binaries are build with gcc-11 (#149887 ) Though they should have been with gcc-14, per https://github.com/pypa/manylinux?tab=readme-ov-file#manylinux_2_28-almalinux-8-based Pull Request resolved: https://github.com/pytorch/pytorch/pull/149887 Approved by: https://github.com/atalman, https://github.com/seemethere	2025-03-24 23:22:19 +00:00
Jzhyang1	f320c7b766	Rename README.txt to README.md (#149811 ) I am 99% sure this is meant to be a .md file rather than a .txt file Fixes an issue with viewing the README on github, idk what else this accomplishes but it's been bothering me Pull Request resolved: https://github.com/pytorch/pytorch/pull/149811 Approved by: https://github.com/colesbury	2025-03-24 22:33:33 +00:00
Zhengxu Chen	490ce7e67c	[sigmoid] Support _operator.neg/truediv (#149754 ) Summary: adding operator.truediv and operator.neg support to the runtime Test Plan: buck run mode/opt caffe2/test:test_export -- -r test_sym_float_operators_cpp_runtime_nonstrict Differential Revision: D71637267 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149754 Approved by: https://github.com/pianpwk	2025-03-24 22:15:25 +00:00
sanchitintel	e77ca19999	[Inductor-CPU] Fix int8 WoQ AMX micro-kernel when `block_n` is 16 or 48 (#149359 ) ### Summary When the block-size for `N` dimension is `48` for the AMX GEMM micro-kernel for int8 WoQ (BF16 activation, int8 statically quantized weights), the logic for handling the tail is incorrect - we can't always dequantize 32 elements of weights at a time because we may need to dequantize `32` followed by `16` when `block_n` is `48` (for each `K`). This PR fixes that logic, which was initially exposed with `M=17, N=1024, K=1024`. This PR also fixes the case of `block_n` being 16. I had introduced [this bug ](`ca9813ea14`) after misreading GEMM blockings as `["block_m", "block_k", "block_n"]` instead of `["block_m", "block_n", "block_k"]` (so I had wrongly assumed that `block_n` was always 32). ### Future work While this PR simply fixes a bug, it's possible to optimize the code pertaining to dequantizing & caching the B buffer - for `block_n` being `16` or `48`, `K` would always be a multiple of 2, so `K * block_n` will always be a multiple of 32. Since `dequantized_B_buf` stores rows contiguously, when `block_n` would be `16` or `48`, we could store 32 BF16 elements at a time instead of storing `16` at a time (when `block_n` is 16), or `32` followed by `16` at a time (when `block_n` is 48). Such an optimization would lower `register -> memory` data movements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149359 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5	2025-03-24 21:27:46 +00:00
James Wu	49f86a939c	[AOTAutogradCache] Allow Custom Autograd functions behind a flag (#149751 ) This adds a new env var and flag, autograd_cache_allow_custom_autograd_functions, (env var: `TORCHINDUCTOR_AUTOGRAD_CACHE_ALLOW_CUSTOM_AUTOGRAD`) which allows custom autograd functions into AOTAutogradCache. @hirsheybar and I worked together to verify that the higher order op AutogradFunctionApply is pure with respect to the dynamo input being passed in, so this should be safe. I'm still putting it behind a flag and turning it on slowly, first on an internal model, though. Once we verify that it is correct on the internal model we can work to enable the flag by default. Differential Revision: [D71633184](https://our.internmc.facebook.com/intern/diff/D71633184/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149751 Approved by: https://github.com/bdhirsh, https://github.com/zou3519	2025-03-24 21:12:11 +00:00
Ryan Guo	ae6158500a	[dynamo] fix calling torch function on newly constructed tensor subclass (#149481 ) This patch updates existing `test_return_..._subclass` tests in `test/dynamo/test_subclasses.py`, so that they end up invoking the `__torch_function__` method of the newly constructed tensor subclass instnaces. This exposes a bug in `TensorVariable.method_as_subclass`, where it forgot to grab the `__func__` out of `__torch_function__`, which led to the an error down the line. This patch fixes `TensorVariable.method_as_subclass` by centralizing how we extract and wrap torch function, in `build_torch_function_fn`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149481 Approved by: https://github.com/jansel	2025-03-24 21:07:41 +00:00
Kirill Goltsman	f12969421e	[DYNAMO] [BUG FIX] correct casting to boolean for TORCH_COMPILE_DISABLE (#149852 ) Fixes #149840 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149852 Approved by: https://github.com/jingsh	2025-03-24 20:50:44 +00:00
Tristan Rice	b248edd7cc	ProcessGroupGloo: support ReduceOp::AVG (#149781 ) This adds AVG support to ProcessGroupGloo to better support FSDP on CPU. I expect there will be more issues but this is easy enough to support in a naive fashion. This applies to both reduce and allreduce. This is a simple SUM + division and may not be the most numerically stable but that's expected. FSDP for low precision data types implements pre/post divide and uses SUM instead. Test plan: ``` pytest -v test/distributed/test_c10d_gloo.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149781 Approved by: https://github.com/fduwjj	2025-03-24 20:29:30 +00:00
Yuxin Wu	40ec9d2bfa	avoid allocation when tensor_new from storage (#149797 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149797 Approved by: https://github.com/Skylion007	2025-03-24 20:02:45 +00:00
Nikita Shulga	112f983056	[MPS] Replace indexed with strided flavor (#149730 ) Which renders non-contiguous operations much faster for larger tensors, for example `fmax` of 1000x1000 strides tensors takes 270ms with new algorithm and 430ms with an old one, that needed additional tensor of 3e6 elements to function. TODO: Add 64-bit indexing logic, as current implementation has the same limitation as `generateKernelDataOffsets` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149730 Approved by: https://github.com/dcci, https://github.com/manuelcandales	2025-03-24 19:37:51 +00:00
Davide Italiano	9179178728	[MPS] Add support for `chebyshev_polynomial_t` in eager. (#149816 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149816 Approved by: https://github.com/malfet	2025-03-24 19:19:55 +00:00
Simon Fan	1e5a561c13	[ca] fix accumulate grad polyfill when different strides between param and grad (#149651 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149651 Approved by: https://github.com/jansel ghstack dependencies: #149647, #149709	2025-03-24 19:06:45 +00:00
Simon Fan	754875e237	[ca] API comments and support dynamic shapes via configs (#149709 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149709 Approved by: https://github.com/jansel ghstack dependencies: #149647	2025-03-24 19:06:45 +00:00
Simon Fan	86ee3bf3d5	[ca] use torch.compile ca API for benchmarks (#149647 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149647 Approved by: https://github.com/jansel	2025-03-24 19:06:45 +00:00
atalman	71145059c8	Allow rebuild of triton on workflow_dispatch (#149865 ) Allows to rebuild triton from main. latest triton build failed : https://github.com/pytorch/pytorch/actions/runs/13984299781/job/39298288914 The cause PR was reverted: https://github.com/pytorch/pytorch/pull/148419 We need to rebuild the triton now Pull Request resolved: https://github.com/pytorch/pytorch/pull/149865 Approved by: https://github.com/seemethere, https://github.com/malfet	2025-03-24 18:17:47 +00:00
PyTorch MergeBot	bada898f5e	Revert "Extend vec backend with BF16 SVE intrinsics (#143666 )" This reverts commit d072254eaea325a507c1498431e4c8294205fe2d. Reverted https://github.com/pytorch/pytorch/pull/143666 on behalf of https://github.com/malfet due to I'm unsure why this PR got merged, as it doesn't have a valid review ([comment](https://github.com/pytorch/pytorch/pull/143666#issuecomment-2749013169))	2025-03-24 18:13:50 +00:00
Jingyi Yang	5beb5b7e47	[torch/c10d] change class variable from private to protected (#149579 ) (#149645 ) Summary: Change class variable from private to protected in ProcessGroupNCCL Test Plan: Existing UT Pass. Reviewed By: kingchc, kwen2501 Differential Revision: D71373067 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149645 Approved by: https://github.com/kwen2501	2025-03-24 17:58:54 +00:00
Ethan Wee	d0c06c4533	[ROCm] Update libamd_comgr.so file in triton wheel build (#149855 ) In ROCm 6.4 and newer, when building Triton in the Triton-ROCm wheel build flow, newer releases of ROCm no longer have libamd_comgr.so.2 as the .so file has been updated to libamd_comgr.so.3 in ROCm 6.4 and newer. We conditionalize on which ROCm the wheel build is for, and choose the .so accordingly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149855 Approved by: https://github.com/Skylion007, https://github.com/jeffdaily	2025-03-24 17:51:14 +00:00
bobrenjc93	60f31f551e	Only print dde partial fx graph for export (#149831 ) Lazos correctly pointed out this doesn't make sense for compile since we graph break in compile. This results in tons of unwanted user log spew. We do want this in export though since it's drastiaclly reduced the support load for DDEs. This PR does the refactor to keep it in export but remove it from compile Pull Request resolved: https://github.com/pytorch/pytorch/pull/149831 Approved by: https://github.com/mlazos	2025-03-24 17:46:18 +00:00
PyTorch MergeBot	42e7bda53e	Revert "[export] Save unflattened gm (#149717 )" This reverts commit 1e159db57c611b98a531341927b2d01f39383f7a. Reverted https://github.com/pytorch/pytorch/pull/149717 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/149717#issuecomment-2748924563))	2025-03-24 17:41:01 +00:00
William Wen	6608d4e3e9	[dynamo] keep chained exceptions in user-facing tracebacks (#149676 ) This preserves graph breaks in the case that one graph break directly causes another, e.g. graph breaks in generic context managers. ```python import torch class CtxMgr: def __enter__(self): return self def __exit__(self, exc_type, exc_value, traceback): pass @torch.compile(backend="eager", fullgraph=True) def fn(): with CtxMgr(): with CtxMgr(): pass with CtxMgr(): with CtxMgr(): pass torch._dynamo.graph_break() fn() ``` Output: ``` torch._dynamo.exc.Unsupported: Call to `torch._dynamo.graph_break()` Explanation: User-inserted graph break. Message: None Hint: Remove the `torch._dynamo.graph_break()` call. Developer debug context: Called `torch._dynamo.graph_break()` with args `[]`, kwargs `{}` The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/data/users/williamwen/pytorch/playground.py", line 23, in <module> fn() File "/data/users/williamwen/pytorch/torch/_dynamo/eval_frame.py", line 664, in _fn raise e.with_traceback(None) from e.__cause__ torch._dynamo.exc.Unsupported: Graph break under GenericContextWrappingVariable Explanation: Attempted to graph break in an active context manager(s) that doesn't support graph breaking. Hint: Move the offending context manager(s) to outside the compiled region. Hint: This graph break may have been caused by an earlier graph break. Resolving the earlier graph break may resolve this one. Developer debug context: Active generic context managers: [GenericContextWrappingVariable(CtxMgr), GenericContextWrappingVariable(CtxMgr)] from user code: File "/data/users/williamwen/pytorch/playground.py", line 20, in fn torch._dynamo.graph_break() Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo" ``` Note in particular that both graph breaks (torch._dynamo.graph_break and graph break in context manager) are present in the logs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149676 Approved by: https://github.com/jansel, https://github.com/zou3519, https://github.com/anijain2305	2025-03-24 17:36:13 +00:00
Angela Yi	1e159db57c	[export] Save unflattened gm (#149717 ) Test Plan: CI Differential Revision: D71082652 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149717 Approved by: https://github.com/pianpwk	2025-03-24 17:25:25 +00:00
Yidi Wu	0a0a73a9a9	[cond] don't trace fw and bw graph in autograd key (#148930 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148930 Approved by: https://github.com/zou3519	2025-03-24 17:07:29 +00:00
Rachel Guo	9bae904cb4	[inductor] fix combo_kernel logging #2 (#149772 ) Summary: fix another combo kernel logging error: File "/home/guorachel/local/fbsource/buck-out/v2/gen/fbcode/4bcbfa3ef39dbd6f/caffe2/test/inductor/__combo_kernels__/combo_kernels#link-tree/torch/_inductor/scheduler.py", line 2036, in _init self.create_combo_kernel_nodes(num_ck_nodes=None) File "/home/guorachel/local/fbsource/buck-out/v2/gen/fbcode/4bcbfa3ef39dbd6f/caffe2/test/inductor/__combo_kernels__/combo_kernels#link-tree/torch/_inductor/scheduler.py", line 3068, in create_combo_kernel_nodes log.debug("ComboKernels: Generating with num_ck_nodes = %d...", num_ck_nodes) Message: 'ComboKernels: Generating with num_ck_nodes = %d...' Arguments: (None,) Test Plan: Verified in test_combo_kernel.py the logging error went away. Differential Revision: D71655949 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149772 Approved by: https://github.com/ColinPeppler, https://github.com/Skylion007	2025-03-24 16:57:45 +00:00
PyTorch MergeBot	453da423d4	Revert "ci: Add sccache to manylinux images (#148419 )" This reverts commit 1099c371505a6a3e3cab69e5afca1e747f2215a4. Reverted https://github.com/pytorch/pytorch/pull/148419 on behalf of https://github.com/atalman due to Breaks triton build ([comment](https://github.com/pytorch/pytorch/pull/148419#issuecomment-2748759515))	2025-03-24 16:43:26 +00:00
Bert Maher	a439524be6	[inductor] Add the largest matmul tile size to default tuning set (#149790 ) While we probably don't want to expand the set of default matmul tunings too much, this is the largest tile size usable by H100 and A100, and is usually the top performing tile size for large matmuls. E.g. on H100 adding this tile size improves perf of multiplying 8192-square matrices from 600->700 tflops. (cuBLAS 12.6 gets 780, so Triton still isn't SOTA, but closer) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149790 Approved by: https://github.com/jansel	2025-03-24 16:32:53 +00:00
Dmitry Nikolayev	db92d0f388	A bunch of typos (#149404 ) Improves readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149404 Approved by: https://github.com/soulitzer	2025-03-24 16:16:04 +00:00
Tristan Rice	ddc0fe903f	ci/docker: use NCCL 2.26.2-1 (#149778 ) Related to #149153 This updates some build scripts to hopefully fix the nightly builds which are somehow building against nccl 2.25.1 and using 2.26.2 from pip. Test plan: After merging rerun nightly linux jobs and validate that nccl version matches Pull Request resolved: https://github.com/pytorch/pytorch/pull/149778 Approved by: https://github.com/Skylion007, https://github.com/atalman Co-authored-by: Andrey Talman <atalman@fb.com>	2025-03-24 16:14:54 +00:00
Francisco Massa	0a60a0cad4	Let pointwise sharding take arg with largest number of dims in case of ties (#149721 ) Before, we would take the first argument with the largest number of shards, regardless if it had fewer dims than another arg with the same number of shards but more dimensions. This would lead to potentially fewer sharding options Pull Request resolved: https://github.com/pytorch/pytorch/pull/149721 Approved by: https://github.com/tianyu-l	2025-03-24 15:39:39 +00:00
Wang, Chuanqi	2c13a07002	[CI] Fix xpu linux test permission issue and add ci docker image pull (#149053 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/149053 Approved by: https://github.com/atalman	2025-03-24 15:19:24 +00:00
Yu, Guangye	db9b031b00	Add default XPU toolkit path to CMake (#149270 ) # Motivation Add default XPU runtime path to CMake to mitigate https://github.com/pytorch/pytorch/issues/149075 This ensures proper linking with `libtorch` when a user does not source the Torch XPU toolkit while working on a C++ library or executable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149270 Approved by: https://github.com/dvrogozh, https://github.com/EikanWang, https://github.com/atalman	2025-03-24 14:41:24 +00:00
Isuru Fernando	66b0a0b61a	[inductor] support dilation in max_pool2d lowering (#148209 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148209 Approved by: https://github.com/eellison	2025-03-24 13:00:12 +00:00
PyTorch UpdateBot	dfdc28ea67	Update slow tests (#149844 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149844 Approved by: https://github.com/pytorchbot	2025-03-24 12:12:56 +00:00
Isalia20	248487f455	[MPS] nanmedian with dims (#149680 ) Third most voted op from #77764 Tests were deleted because they are covered by the regular test_output_match tests so those were redundant and were added in the last PR before the nanmedian dim version would be implemented Pull Request resolved: https://github.com/pytorch/pytorch/pull/149680 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-03-24 03:49:16 +00:00
Yu, Guangye	d5ce5c9509	Reuse format_size utils (#149383 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149383 Approved by: https://github.com/malfet	2025-03-24 03:06:27 +00:00
James Wu	de3aca3311	[StaticCudaLauncher] Support any number of kernel arguments (#149442 ) Fixes #149450 This PR adds fallback support on StaticCudaLauncher for any number of kernel arguments. Above MAX_ARGS, we can do a heap allocation/malloc instead. For 0 arguments, triton technically does some undefined behavior by allocating a 0 byte array and passing it to cuLaunchKernel. In reality, cuLaunchKernel never accesses the pointer if the singature of the cubin has no parameters, so we can just pass nullptr directly. We could technically use `alloca` to stack allocate instead of heap allocate, though in my tests it didn't seem to affect runtime performance on benchmarks particularly impressively, and alloca has portability issues, so I'd rather just stick with something simpler for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149442 Approved by: https://github.com/jansel	2025-03-23 22:43:47 +00:00
Justin Chu	2dccd70ef0	[ONNX] Clean up legacy dynamo export code (#149745 ) Clean up code that is unused and obsolete. The public `torch.onnx.dynamo_export` is kept for now but the legacy implementation is removed. Remove public option classes and OnnxRegistry that have been deprecated. Users: use torch.onnx.export(…, dynamo=True). Pull Request resolved: https://github.com/pytorch/pytorch/pull/149745 Approved by: https://github.com/titaiwangms, https://github.com/cyyever	2025-03-23 19:35:16 +00:00
Nikita Shulga	8bece88655	[BE] Eliminate TODO for 2022 (#149557 ) Need to think a bit more about what types.h includes Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/149557 Approved by: https://github.com/albanD	2025-03-23 05:35:54 +00:00
Alfredo Tupone	c201d4dbea	elif is not a cmake keyword (#149655 ) Test for pocketfft_header not in its place is wrong Pull Request resolved: https://github.com/pytorch/pytorch/pull/149655 Approved by: https://github.com/Skylion007	2025-03-23 03:28:53 +00:00
fzyzcjy	85027ef74a	Super tiny fix typo (#149109 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/149109 Approved by: https://github.com/malfet	2025-03-23 03:02:53 +00:00
James Wu	fe954cdcbf	Use correct boxed_forward_device_index when running `CompiledFxGraph.post_compile` (#148130 ) This PR threads through the correct boxed_forward_device_index from graph_kwargs to CompiledFXGraph.post_compile. This allows us to correctly update BoxedDeviceIndex from cache hits. We don't actually need to save `boxed_forward_device_index` in CompiledFXGraph because its value is in the cache key, so it always matches to the ambient one anyway. On forward with cudagraphs enabled, derive `boxed_forward_device_index`'s value from `device_idxs`. Testing: ``` python benchmarks/dynamo/cachebench.py --mode training --benchmark torchbench --model BERT_pytorch --device cuda --repeat 1 --dynamic --output="dynamic.json" ``` Now cache hits properly on FXGraphCache. AOTAutogradCache has a guard failure. Will look into that as a followup. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148130 Approved by: https://github.com/eellison	2025-03-23 02:57:58 +00:00
Mark Saroufim	539db4af4b	load_inline no_implicit_headers mode (#149480 ) In the kernelBot leaderboard we support people competing with custom cuda extensions via `load_inline()`, however even on toy kernels this can result in cold starts of up to 90s - this feature is primarily responsible for us having to double our timeout values I performed an investigation here https://github.com/msaroufim/load_inline_slow and the primary cause was that torch/extension.h and torch/types.h add in about 5,000 header files https://github.com/msaroufim/load_inline_slow/blob/main/header-analysis So we introduce a mode `no_implicit_headers` which forces users to be explicit about exactly what they want to add. There's a proper test meant to be used in a CLI and a pytest test that's not terribly helpful Then there's still an open question around what's the most minimal example implementation we can provide. For the baseline kernel we're showing here, it takes about 1 min to compile 1. There's using TensorBase.h (finicky to get right but can get compilation times down to 7s) 2. Just using Tensor.h (down to 15s) 3. Using Shim.h (did not try yet since the syntax is verbose relative to cuda) This is my take so far https://gist.github.com/msaroufim/079a8d08ffebd0f91a1c2247eb0ce9e0 for a minimal implementation at 15s but @malfet has a simpler one at only 5s There's more things I'd like to try moving forward like nvrtc and fancier compilation flags. Typical advice around using precompiled headers does not apply to us because we are mostly interested in cold starts where we tear down the machine after running a kernel Also in a future PR I'd like to fix issue I've noticed with load_inline 1. It needs a force recompilation mode, I was using this quite a bit myself 2. The cache does not take into account changes in environment so the best way to force a recompilation is to change some string in the file 3. Instead of relying on pybind, can we use TORCH_LIBRARY instead Pull Request resolved: https://github.com/pytorch/pytorch/pull/149480 Approved by: https://github.com/malfet	2025-03-22 19:21:29 +00:00
cyy	9367f8f6f1	Remove outdated instructions from CI scripts (#149795 ) Some instructions about Python 3.8 and CUDA 11.3 are removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149795 Approved by: https://github.com/malfet	2025-03-22 18:37:07 +00:00
Davide Italiano	2b848ab192	[MPS/inductor] Add support for modified_scaled_bessel_k{0,1} (#149794 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149794 Approved by: https://github.com/malfet	2025-03-22 15:41:40 +00:00
Animesh Jain	6bbe8dbd63	[dynamo][hooks] config to wrap the top frame in a wrapper (#149758 ) This should be done by default but there are too many issues. This PR is a workaround. https://github.com/pytorch/pytorch/issues/117584 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149758 Approved by: https://github.com/yf225 ghstack dependencies: #149712	2025-03-22 07:17:01 +00:00
bobrenjc93	621c801f78	fix dynamic float when dynamic=True (#149564 ) Fixes https://github.com/pytorch/pytorch/issues/149406#issuecomment-2738111733. Basically previously we would only make floats dynamic via automatic dynamic, now if you set dynamic=True, we will make the floats dynamic on the first compile. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149564 Approved by: https://github.com/laithsakka	2025-03-22 05:58:59 +00:00
eqy	8f7fbe3d7d	[cuBLAS][cuBLASLt] Unify `cuBLASLt` workspaces with `cuBLAS` workspaces (#145130 ) As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels. This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits: + caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`) + "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925 + fixes behavior broken behavior with the memtracker; https://github.com/pytorch/pytorch/pull/139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it + one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider Pull Request resolved: https://github.com/pytorch/pytorch/pull/145130 Approved by: https://github.com/ngimel	2025-03-22 05:50:11 +00:00
PyTorch UpdateBot	51fa8fb0ff	[executorch hash update] update the pinned executorch hash (#149585 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149585 Approved by: https://github.com/pytorchbot	2025-03-22 05:14:19 +00:00
Nichols A. Romero	01b1d1f91b	[ROCm][TunableOp] Fix offline tuning for ScaledGEMM. (#149677 ) The main purpose of this PR is to fix offline tuning for ScaledGEMM. The previous UT passed because it was not strict enough. Additionally: - All the offline tuning tests now do a comparison with the online results to ensure that ParamSignature match. - We raise an error if submatrices are encountered as this is only supported in online tuning mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149677 Approved by: https://github.com/jeffdaily	2025-03-22 02:22:13 +00:00
Davide Italiano	b9a5e1d038	[MPS] Add support for scaled_modified_bessel_k1 to eager. (#149783 ) Another day another op Pull Request resolved: https://github.com/pytorch/pytorch/pull/149783 Approved by: https://github.com/malfet	2025-03-22 02:13:41 +00:00
Tugsbayasgalan Manlaibaatar	021b3e23ec	Fix is_nonzero for more than one elem tensors (#149637 ) Differential Revision: [D71560442](https://our.internmc.facebook.com/intern/diff/D71560442) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149637 Approved by: https://github.com/pianpwk	2025-03-22 02:08:28 +00:00
Xintong Hu	9d02b3993f	[PT2] Port use_triton_lce to PT2 pre_grad passes (#149702 ) Summary: `use_triton_lce_replace_simple_LCE` and `use_triton_lce_replace_normal_LCE` code is mostly the same, some minor changes to support aten IR Test Plan: ``` scripts/aetk/aetk -L %run ~/fbsource/fbcode/caffe2/test/inductor/fb/test_customized_triton_kernel_passes.py ``` will verify the qps after everything done in the stack Reviewed By: frank-wei Differential Revision: D68909857 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149702 Approved by: https://github.com/frank-wei	2025-03-22 00:36:58 +00:00
Scott Wolchok	c73a526599	Extract reusable portions of elu_kernel into header (#149673 ) Similar to #140425, we are making the implementation usable via header-only code sharing. Review note: #62546 by @yanbing-j removed expm1 usage from this path. I don't know why and expm1 should be more efficient, so I've put it back. Please let me know if there is a good reason I shouldn't. Testing: existing correctness tests should cover. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149673 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-03-21 23:54:26 +00:00
PyTorch MergeBot	b238e36fd9	Revert "[BE][Ez]: Update CU126 to CUDNN 12.8 too (#149254 )" This reverts commit b0a5d55c584792a504ec18600180e3d1200dfea6. Reverted https://github.com/pytorch/pytorch/pull/149254 on behalf of https://github.com/izaitsevfb due to seems to be causing multiple test failures ([comment](https://github.com/pytorch/pytorch/pull/149254#issuecomment-2744686862))	2025-03-21 23:44:09 +00:00
Nikita Shulga	27370998b2	[MPS][BE] Move `polar`/`complex` to stubs (#149752 ) No need to have in-place MPS kernel, as it just copy-n-paste of code from TensorFactories.cpp into Binarykernel.mm Pull Request resolved: https://github.com/pytorch/pytorch/pull/149752 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #149727, #149728, #149729	2025-03-21 22:36:05 +00:00
Animesh Jain	d320af0663	[dynamo] Ensure placeholder name is not an intermediate node name (#149712 ) Fixes https://fb.workplace.com/groups/1075192433118967/permalink/1615671879071017/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/149712 Approved by: https://github.com/zou3519	2025-03-21 22:24:45 +00:00
Brian Hirsh	7f836b747f	partitioner: ensure collectives saved by SAC that are actually unused in the bw are properly not saved (#149652 ) This PR fixes one of the issues described here: https://github.com/pytorch/torchtitan/issues/866#issuecomment-2726015248 I spent some time trying to write a unit test and ultimately failed. If folks are interested I can spend more time trying to, but otherwise I have an E2E test with torchtitan. command: ``` CUDA_VISIBLE_DEVICES=1,2,3,4 NGPU=4 CONFIG_FILE="./torchtitan/models/llama/train_configs/llama3_8b.toml" tlp ./run_train.sh --training.steps=30 --training.tensor_parallel_degree=2 --training.compile --experimental.enable_async_tensor_parallel ``` here's the backward graph generated prior to the PR: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/hirsheybar/f7d17388-42c2-4d7e-8a55-a00387341ecb/custom/rank_0/-_0_0_0/aot_backward_graph_9.txt?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 and new backward graph with the PR: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/hirsheybar/ab8576fc-98c1-4915-af47-699aa8e2557e/custom/rank_0/-_0_0_0/aot_backward_graph_9.txt?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 The main difference is that the input arg `reduce_scatter_tensor_1` is dead code in the bw graph, causing us to unnecessarily save a giant `reduce_scatter` for bw. With the PR, we properly ensure that it is not saved for backward. More comments in the PR, but the main thing going on is that: (1) We have some existing logic that checks for activations that are actually dead code in the backward, and removes them (2) collectives are not properly handled by this code. Why? collective are always followed by `wait_tensor()` call. So we need to go one node further and check if the "dead" code has a wait_tensor user that is also dead Pull Request resolved: https://github.com/pytorch/pytorch/pull/149652 Approved by: https://github.com/zou3519 ghstack dependencies: #149514	2025-03-21 22:09:19 +00:00
Brian Hirsh	1c6b517e19	DTensor: more generically support CompositeImplicitAutograd ops under inference mode (#149514 ) Today, if you run DTensor (or any tensor subclass) under __torch_dispatch__, you will start seeing `CompositeImplicitAutograd` ops show up in the torch_dispatch. "handling" these ops is trivial: you can just tell them to decompose into their constituent ops. Normally this decomposing happens in autograd, above DTensor, but inference_mode turns autograd off, forcing the subclass to handle the op directly. It looks like previously we manually added a few CompositeImplicitAutograd entries to DTensor (e.g. linear), but this PR tries to support these ops a bit more generically. The main difference is that DTensor now needs to check if a given op is `CompositeImplicitAutograd` before attempting to run sharding prop. I ran a quick microbenchmark for the below code with `timeit`, which gave me overhead on the order of ~1us, which is hopefully not too bad for eager mode: ``` def fast_function(): return torch._C._dispatch_has_kernel_for_dispatch_key(op_call.name(), torch._C.DispatchKey.CompositeImplicitAutograd) import timeit time_taken = timeit.timeit(fast_function, number=1000) # printed 0.12..., aka 1.2us print(f'func={str(op_call)}, time={str(time_taken)}') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149514 Approved by: https://github.com/kwen2501, https://github.com/albanD, https://github.com/wanchaol	2025-03-21 22:09:19 +00:00
Wei Feng	d46c16fca6	[FSDP2] warning that reshard_after_forward=1 and True are different (#149750 ) people complains about spending time to debug reshard_after_forward=1. What they actually want is reshard_after_forward=True. 1 and True can be used interchangeably in programming generally, add one-time warning to remind they are different * reshard_after_forward=1 means resharding parameters to world size 1, by keeping unsharded parameters from forward to backward * reshard_after_forward=True means reshard parameters to FSDP mesh from FSDP2 perspective, our docstring is clear about int vs bool https://pytorch.org/docs/main/distributed.fsdp.fully_shard.html <img width="764" alt="Screenshot 2025-03-21 at 11 02 55 AM" src="https://github.com/user-attachments/assets/6675f7a4-95a0-4421-8dbf-f47e9fdeca26" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/149750 Approved by: https://github.com/mori360, https://github.com/msaroufim, https://github.com/wconstab	2025-03-21 22:05:20 +00:00
angelayi	ff020d32b6	[export] Patch dynamo configs when nonstrict tracing (#149295 ) Differential Revision: [D71298929](https://our.internmc.facebook.com/intern/diff/D71298929) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149295 Approved by: https://github.com/ydwu4, https://github.com/zou3519	2025-03-21 21:44:54 +00:00
Avik Chaudhuri	fb07fe6f36	pretty print graph signature (#149710 ) Fixes #141243 Differential Revision: [D71604218](https://our.internmc.facebook.com/intern/diff/D71604218/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149710 Approved by: https://github.com/angelayi	2025-03-21 21:31:58 +00:00
eellison	5757aa8773	Cudagraph fix + comment cleanup (#149741 ) Cudagraphs is careful to not allow any memory recorded to escape globally without having a reference to the tensor. This is because we may later reclaim that memory for a cudagraph recording and we need to mark the tensor as erroring on access. Very occasionally, a stray tensor will have been allocated locally but not yet cleaned up. In this case, we enter the slow path and try to gc.collect() to deallocate it. From a hard to repro internal use case, this was fixed by an additional `cuda.synchronize()`. i also snuck in an outdated comment and a duplicate line removal. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149741 Approved by: https://github.com/BoyuanFeng, https://github.com/Skylion007	2025-03-21 21:12:36 +00:00
Annop Wongwathanarat	842d51500b	Parallelize sort (#149505 ) PR #142391 erroneously used `USE_OMP` instead of `USE_OPENMP`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149505 Approved by: https://github.com/fadara01, https://github.com/Skylion007	2025-03-21 20:54:40 +00:00
Xuehai Pan	85f6d61421	[BE] format `test/inductor/s429861_repro.py` (#148554 ) Split from #148186 The diff can be re-generated with the following code in the repo root directory on main branch: ```python import re from pathlib import Path def replace(m: re.Match) -> str: s = m.group() if '\n' not in s: return s indent = m.group("indent") varnames = s.removesuffix("None").replace("=", "").replace("(", "").replace(")", "").split() return "\n".join( [ f"{indent}(", (f"{indent} {varname}," for varname in varnames), f"{indent}) = (None,) {len(varnames)}", ] ) file = Path('test/inductor/s429861_repro.py') content = file.read_text(encoding='utf-8') new_content = re.sub( r"^(?P<indent> )\w+ =(\s($\s\w+\s$\|\w+)\s=\s*)+None$", replace, content, flags=re.MULTILINE, ) file.write_text(new_content, encoding='utf-8') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148554 Approved by: https://github.com/jansel	2025-03-21 20:39:28 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	c5deacc27a	Fix subclass access custom op bug (#149698 ) Summary: When we call torch.inference_mode, we seem to skip Autograd key causing the custom op export uses to be not decomposed properly before subclass dispatching starts. We fix this by force desugaring this op at Python key Test Plan: test Differential Revision: D71599541 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149698 Approved by: https://github.com/bdhirsh	2025-03-21 19:42:56 +00:00
Avik Chaudhuri	09aa63ea2c	preserve custom meta in placeholders (#149661 ) Fixes #147338 Differential Revision: [D71573533](https://our.internmc.facebook.com/intern/diff/D71573533/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149661 Approved by: https://github.com/junpeiz, https://github.com/angelayi	2025-03-21 19:09:38 +00:00
Aaron Orenstein	0eb3ac9349	Make sure to write to caches atomically (#149654 ) This is an attempt to fix #119698 I was unable to reproduce the original described problem on the latest trunk but the proposed fix makes sense. Instead of adding locks like the original (unlanded) fix I changed a few of the cache writes to be atomic file swaps (write to temp file, rename file) which should have the same effect without blocking reads. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149654 Approved by: https://github.com/eellison	2025-03-21 18:59:41 +00:00
Shangdi Yu	46dd226702	Fakify torchbind objects in compile_fx and add tests for SigridTransformsInstanceTorchBind (#149529 ) Summary: We need to properly fakify torchbind objects, including the ones in graph module attributes, so the resgitered fake implementation works properly. - _fakify_script_objects in `compile_fx` - Allow fake torchbind objects in `torchbind_constants` Remove `node.meta["unbacked_bindings"]` for `aot_compile` in `compile_fx`. Otherwise `ShapeProp` will fail when trying to resolve the `unbacked_bindings` of `with_effect` tokens. Update `sigrid_transforms_test` to use the latest `torch._inductor.aot_compile` API. Add a test for `Fakify torchbind objects in compile_fx and add tests for SigridTransformsInstanceTorchBind` in `e2e_test`. Test Plan: ``` buck run //caffe2/torch/fb/sparsenn:sigrid_test -- -r test_transform_torch_bind buck run //sigmoid/inference/test:e2e_test_cpu -- -r SigridTransforms buck2 run mode/dev-nosan sigmoid/inference/ts_migration:pt2i_readiness_main -- --model_id 545017754 --test_suite ads_all --mode test_preproc ``` Differential Revision: D70013257 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149529 Approved by: https://github.com/angelayi	2025-03-21 18:58:28 +00:00
Alexander Grund	19b763def1	Skip test if torchvision is not available (#149494 ) The test unconditionally imports torchvision and fails if the isn't installed. Skip it in this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149494 Approved by: https://github.com/janeyx99	2025-03-21 18:57:13 +00:00
Aaron Gokaslan	b0a5d55c58	[BE][Ez]: Update CU126 to CUDNN 12.8 too (#149254 ) Have CUDNN have the same version for 12.6 and 12.8 for better performance and consistency. We can't do CU12.1 because it's not supported and CU12.4 isn't updated due to manywheel Linux compatibility reasons and dropping support for it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149254 Approved by: https://github.com/jansel, https://github.com/atalman, https://github.com/tinglvv	2025-03-21 18:20:44 +00:00
Pradeep Fernando	1b08aaeafe	Supporting non-tensor-data write_size in planner write items. (#149699 ) Summary: 1\ The current write item structure does not contain the amount of data that needs to be written. 2\ the planner.item already has a size primitive 'tensor_storage_size'. https://fburl.com/code/7a0gsmw7 But only for tensors. 3\ Right now, the only way the writer layer get hold of this property (fro non tensor data) first do a lookup in to the actual tensor/bytes then calculate the nbytes. This change introduce a way to capture non-tensor data size within a write-plan item. Test Plan: Existing UT. Differential Revision: D71599725 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149699 Approved by: https://github.com/MeetVadakkanchery	2025-03-21 18:09:14 +00:00
Ding, Yi1	f7d1b966c2	[Inductor] Unify the data type propagation between Triton and CPP Backend (#146970 ) Fixes #144246 Use `DtypePropagationOpsHandler` for CSE variables of CPP backend. In addition, add static type checking for the generated CPP code similar to the `config.test_configs.runtime_triton_dtype_assert`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146970 Approved by: https://github.com/jgong5, https://github.com/eellison, https://github.com/leslie-fang-intel	2025-03-21 17:52:51 +00:00
Scott Wolchok	99a4fc5a2f	Add elu as core ATen (#149684 ) Differential Revision: [D71590420](https://our.internmc.facebook.com/intern/diff/D71590420/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149684 Approved by: https://github.com/larryliu0820	2025-03-21 16:56:10 +00:00
LifengWang	fa5f556f88	[CI] enable operator benchmark on CPU (#143733 ) This is to enable operator benchmark for CPU to track op level performance. This PR is motivated by PR: https://github.com/pytorch/pytorch/issues/120982 and investigate feasibility in https://github.com/pytorch/pytorch/pull/127216 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143733 Approved by: https://github.com/leslie-fang-intel, https://github.com/atalman, https://github.com/huydhn, https://github.com/malfet Co-authored-by: diwei sun <diwei.sun@intel.com> Co-authored-by: chuanqiw <chuanqi.wang@intel.com>	2025-03-21 16:46:03 +00:00
Nikita Shulga	700260f166	[MPS][BE] Get rid of `supports_dense` flag (#149729 ) As now all binary ops supports dense Pull Request resolved: https://github.com/pytorch/pytorch/pull/149729 Approved by: https://github.com/dcci ghstack dependencies: #149727, #149728	2025-03-21 16:37:03 +00:00
Nikita Shulga	64d22b9fad	[MPS][BE] Migrate complex_mul to tensor iterator (#149728 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149728 Approved by: https://github.com/dcci ghstack dependencies: #149727	2025-03-21 16:37:03 +00:00
Nikita Shulga	e35ef61066	[MPS][BE] Migrate `torch.complex` to binary_functor (#149727 ) As it's very similar in nature to `torch.polar` Though rename kernel from `complex_kernel` to `make_complex` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149727 Approved by: https://github.com/dcci	2025-03-21 16:36:56 +00:00
Davide Italiano	bdc132d0e1	[MPS] Add support for scaled_modified_bessel_k0 for eager. (#149705 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149705 Approved by: https://github.com/malfet	2025-03-21 16:14:29 +00:00
Jithun Nair	1eab841185	Add release branch push triggers to inductor-rocm-mi300.yml (#149672 ) In similar vein as https://github.com/pytorch/pytorch/pull/149517 When we added the rocm-mi300.yml earlier this year, we had lower capacity and we were just pipecleaning the workflow, so we set the trigger to only respond to pushes to main branch. But now we have more stability as well as capacity, and we would really like to ensure that the release branch is being tested on MI300s as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149672 Approved by: https://github.com/jeffdaily	2025-03-21 16:02:03 +00:00
Davide Italiano	5d4b5ee315	[MPS] Add inline to function definition. (#149704 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149704 Approved by: https://github.com/malfet	2025-03-21 14:53:09 +00:00
Ryo Suzuki	d072254eae	Extend vec backend with BF16 SVE intrinsics (#143666 ) - Following the work in https://github.com/pytorch/pytorch/pull/119571, BF16 SVE intrinsics are added to the Vectorized class, providing ~1.7x speedup on `silu` and `softmax`. - Added bf16 detection in CMake - Added a guard for native NEON code to prevent compilation errors @aditew01 @maajidkhann please have a look Pull Request resolved: https://github.com/pytorch/pytorch/pull/143666 Approved by: https://github.com/swolchok, https://github.com/aditew01 Co-authored-by: Aditya Tewari <aditya.tewari@arm.com>	2025-03-21 10:55:11 +00:00
Nikita Shulga	68dfd44e50	Do not depend on numpy during the import (#149683 ) But a good followup would be to use torch primitives instead of numpy here Fixes https://github.com/pytorch/pytorch/issues/149681 Test plan: Monkey-patch 2.7.0-rc and run `python -c "import torch;print(torch.compile(lambda x:x.sin() + x.cos())(torch.rand(32)))"` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149683 Approved by: https://github.com/seemethere	2025-03-21 08:14:57 +00:00
Michael Lazos	34743678b9	[Dynamo] Cleanup state management for ctx managers (#149689 ) Removes state indirection for ctx managers. This isn't needed anymore since VTs are mutable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149689 Approved by: https://github.com/StrongerXi	2025-03-21 07:18:33 +00:00
Arash Pakbin	cfc08caea9	[ROCm] NLLLoss (torch.nll_loss) Performance Tuning by Dynamically Selecting # of GPU threads (#149548 ) Instead of fixing the number of GPU threads to 32 regardless of input size, this PR dynamically selects the number of threads based on the formula: clamp(2^round(log2(dim0/16)), min = 32, max = 1024). The experiments below were done on an MI300 machine for data type float32: ![nll_loss_threads_bests](https://github.com/user-attachments/assets/3be3d465-e3db-44ed-991a-fdfcab03baae) ![nll_loss_heauristic](https://github.com/user-attachments/assets/e82b9788-9b4d-4862-a180-8df7ad298182) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149548 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony	2025-03-21 07:16:37 +00:00
Davide Italiano	0ed34210b2	[MPS] Add support for `modified_bessel_k1` to eager and inductor. (#149687 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149687 Approved by: https://github.com/malfet	2025-03-21 04:59:06 +00:00
Yuanhao Ji	0a396a8160	[Docs] Make `torch.Library`'s `kind` have no default value to be consistent with the code (#149390 ) Fixes #149389 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149390 Approved by: https://github.com/janeyx99	2025-03-21 04:42:10 +00:00
Jing Xu	4ea580568a	update aotinductor doc for XPU support (#149299 ) as title. Since the AOTInductor feature starting from 2.7 works on Intel GPU, add the related contents into its doc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149299 Approved by: https://github.com/guangyey, https://github.com/desertfire	2025-03-21 04:40:31 +00:00
Rachel Guo	ccd5d811e8	[aoti] follow up to use new api in `test_provenance_tracing.py` (#149387 ) Summary: As title. Follow up of D71181284. and some minor refactoring Context : D69609685 (update test runner to use new api) / https://github.com/pytorch/pytorch/pull/147105 Test Plan: ``` buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:provenance_tracing -- -r test_triton_kernel_to_post_grad_tracing_cpu ``` Differential Revision: D71375725 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149387 Approved by: https://github.com/yushangdi	2025-03-21 04:37:50 +00:00
Nikita Shulga	5327894812	[BE] Introduce `lapack_work_to_int` function (#149682 ) That could be used to safely cast floating values to int by adding an ULP, which is a followup after https://github.com/pytorch/pytorch/pull/146456 Fixes https://github.com/pytorch/pytorch/issues/149591 (Not adding unittest as it's just going to be too slow) Test plan: ``` % python3 -c "import torch; torch.pinverse(torch.rand(50000, 8193))" ``` Before the change errored out with ``` RuntimeError: false INTERNAL ASSERT FAILED at "pytorch/pytorch/aten/src/ATen/native/BatchLinearAlgebra.cpp":1605, please report a bug to PyTorch. linalg.svd: Argument 12 has illegal value. Most certainly there is a bug in the implementation calling the backend library. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149682 Approved by: https://github.com/wdvr	2025-03-21 04:08:07 +00:00
Yuanhao Ji	bf6621d08f	[Distributed] Add `repr` methods for `ParallelStyle`s (#149478 ) Fixes #149470 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149478 Approved by: https://github.com/wanchaol	2025-03-21 03:59:25 +00:00
xinan.lin	ee6a029165	[XPU] Update triton commit to fix to fix level_zero not found by env var LEVEL_ZERO_V1_SDK_PATH. (#149511 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149511 Approved by: https://github.com/EikanWang	2025-03-21 03:56:00 +00:00
zeshengzong	732f9d7435	Optimize `torch.equal` description (#149618 ) Fixes #149222 ## Test Result ![image](https://github.com/user-attachments/assets/559a376f-2dd0-4474-bbd5-9299d9df51e3) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149618 Approved by: https://github.com/zou3519	2025-03-21 03:44:49 +00:00
Xia, Weiwen	64bd889660	[Inductor][CPP] rename shim_mkldnn.h/.cpp to shim_cpu.h/.cpp (#149372 ) Summary Previous discussion is here: https://github.com/pytorch/pytorch/pull/148907#issuecomment-2712795600 Rename these files because - they may hold mkldnn-unrelated code for CPU - filenames are aligned with files for CUDA and XPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/149372 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire	2025-03-21 03:42:12 +00:00
Justin Chu	a39bf846f5	[ONNX] Add draft_export as a strategy (#147529 ) Create draft_export strategy. The strategy is added before jit and after strict=True, as the third fallback. Since it is specializing tensors it should not be less robust than the jit trace strategy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147529 Approved by: https://github.com/titaiwangms	2025-03-21 03:05:17 +00:00
Hollow Man	0692301e25	Catch OSError in general when writing files (#149464 ) Redundant exception types in `except (PermissionError, OSError):`. Write `except OSError:`, which catches exactly the same exceptions. https://github.com/pytorch/pytorch/actions/runs/13935844871/job/39141062991 When hipify files, or writing cprofile files, PermissionError is not enough when the file is located in a place that is not writable at all, or other OS errors happened when writing files. This fix makes the code more robust. Example error log: ```log File "deepspeed/ops/adam/fused_adam.py", line 94, in __init__ fused_adam_cuda = FusedAdamBuilder().load() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "deepspeed/ops/op_builder/builder.py", line 540, in load return self.jit_load(verbose) ^^^^^^^^^^^^^^^^^^^^^^ File "deepspeed/ops/op_builder/builder.py", line 587, in jit_load op_module = load(name=self.name, ^^^^^^^^^^^^^^^^^^^^ File "torch/utils/cpp_extension.py", line 1597, in load return _jit_compile( ^^^^^^^^^^^^^ File "torch/utils/cpp_extension.py", line 2031, in _jit_compile hipify_result = hipify_python.hipify( ^^^^^^^^^^^^^^^^^^^^^ File "torch/utils/hipify/hipify_python.py", line 1167, in hipify preprocess_file_and_save_result(output_directory, filepath, all_files, header_include_dirs, File "torch/utils/hipify/hipify_python.py", line 213, in preprocess_file_and_save_result result = preprocessor(output_directory, filepath, all_files, header_include_dirs, stats, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "torch/utils/hipify/hipify_python.py", line 940, in preprocessor output_source = RE_QUOTE_HEADER.sub(mk_repl('#include "{0}"', True), output_source) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "torch/utils/hipify/hipify_python.py", line 919, in repl preprocess_file_and_save_result(output_directory, File "torch/utils/hipify/hipify_python.py", line 213, in preprocess_file_and_save_result result = preprocessor(output_directory, filepath, all_files, header_include_dirs, stats, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "torch/utils/hipify/hipify_python.py", line 986, in preprocessor with clean_ctx.open(fout_path, 'w', encoding='utf-8') as fout: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "torch/utils/hipify/hipify_python.py", line 123, in open return open(fn, args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ OSError: [Errno 30] Read-only file system: 'deepspeed/ops/csrc/adam/multi_tensor_apply_hip.cuh' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149464 Approved by: https://github.com/janeyx99	2025-03-21 02:42:50 +00:00
Justin Chu	362b40939d	[ONNX] Improve docstring of onnx symbolic ops (#149668 ) Better examples Pull Request resolved: https://github.com/pytorch/pytorch/pull/149668 Approved by: https://github.com/titaiwangms	2025-03-21 01:57:39 +00:00
Matthias Braun	66dd00fca0	Fix clang-tidy errors (#149581 ) Summary: Cleanup clang-tidy complaints in `EmbeddingBag.cpp`: Avoid shadowed variables and unused parameters. Test Plan: sandcastle Differential Revision: D71512594 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149581 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-03-21 01:53:57 +00:00
Simon Fan	e481615bc7	[aot] always lower the backward with a deepcopy (#149229 ) FIXES https://github.com/pytorch/pytorch/issues/149105 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149229 Approved by: https://github.com/bdhirsh	2025-03-21 01:47:13 +00:00
Xintong Hu	5ebc283f2c	[PT2] Port use_triton_dot_compress to PT2 pre_grad passes (#148517 ) Summary: add use_triton_dot_compress in pre_grad Test Plan: ``` scripts/aetk/aetk -L %run ~/fbsource/fbcode/caffe2/test/inductor/fb/test_customized_triton_kernel_passes.py ``` Reviewed By: frank-wei Differential Revision: D68909838 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148517 Approved by: https://github.com/frank-wei	2025-03-21 01:42:32 +00:00
James Wu	c2ada9d77b	[easy] Do not logspam if static cuda launcher is disabled (#149669 ) No need to log.info every time someone runs with StaticCudaLauncher disabled. Test plan: Run any benchmark and see that we don't spam the bypass message in logs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149669 Approved by: https://github.com/oulgen, https://github.com/jansel ghstack dependencies: #148890	2025-03-21 01:22:26 +00:00
Eli Uriegas	1099c37150	ci: Add sccache to manylinux images (#148419 ) Adds sccache to our manylinux images, these are purposefully built without the scccache-dist binary since we're not expecting to use that. Another caveat of these builds is that they are built with the vendored version of openssl. This is to set the stage for us to be able to build binaries sequentially. Signed-off-by: Eli Uriegas <github@terriblecode.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/148419 Approved by: https://github.com/atalman	2025-03-21 01:15:34 +00:00
Han, Xu	2975664fb0	add python root bin to windows load path. (#146573 ) This PR is extend python root bin path to dll load list. It makes PyTorch robust and compatible to more dependency libraries, such as `intel-pti`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146573 Approved by: https://github.com/EikanWang, https://github.com/albanD	2025-03-21 00:48:43 +00:00
Sam Larsen	90543e90a0	Fix broken dynamo_timed test due to python_version field (#149659 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149659 Approved by: https://github.com/ppanchalia	2025-03-21 00:27:28 +00:00
Zhengxu Chen	f47aa08130	[export] Support python assertion with symints. (#149444 ) Summary: This diff ports some technique from torch.fx symbolic trace to trace through Python asserts when we run into data dependent symbolic shape assertions, so that we can achieve the same effect as torch dynamo to automatically turn assert into torch.check()s. Test Plan: buck test mode/opt caffe2/test:test_export -- -r test_python_asserts_with_sym_int Differential Revision: D71425360 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149444 Approved by: https://github.com/tugsbayasgalan	2025-03-20 23:07:45 +00:00
angelayi	bf34e228c5	[export] Beef up guard_added logs (#149465 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149465 Approved by: https://github.com/pianpwk	2025-03-20 23:02:07 +00:00
Michael Lazos	1d3c50fcc5	[Dynamo] Support the torch._C.DisableTorchFunction ctx manager (#149491 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149491 Approved by: https://github.com/StrongerXi ghstack dependencies: #149489, #149490	2025-03-20 22:19:55 +00:00
Michael Lazos	ce5adc5c05	[Dynamo] add support for torch._C._is_torch_function_all_disabled (#149490 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149490 Approved by: https://github.com/StrongerXi ghstack dependencies: #149489	2025-03-20 22:19:55 +00:00
Michael Lazos	f64c361860	[Dynamo] Refactor DisableTorchFunction ctx manager (#149489 ) Refactors the DisableTorchFunction ctx manager to properly model the eager code (no args to the context manager). Pull Request resolved: https://github.com/pytorch/pytorch/pull/149489 Approved by: https://github.com/StrongerXi	2025-03-20 22:19:55 +00:00
zhc7	a268c29b9f	[distributed] fix: use group rank instead of global rank when possible (#149488 ) Fixes #149200 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149488 Approved by: https://github.com/wconstab	2025-03-20 21:47:03 +00:00
Isuru Fernando	b07b819912	[inductor] Add a helper for convert index_dtype to torch dtype (#149531 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149531 Approved by: https://github.com/eellison	2025-03-20 21:33:29 +00:00
Zhuoran Zhao	a703107f7b	[AOTInductor] Fix skip cpp wrapper unit test (#149606 ) Summary: as title Test Plan: ``` buck2 test 'fbcode//mode/opt' fbcode//deeplearning/aot_inductor/cpu/test:cpu_lowering_utils_test -- --exact 'deeplearning/aot_inductor/cpu/test:cpu_lowering_utils_test - test_cpu_lower_aoti_ep_called (deeplearning.aot_inductor.cpu.test.test_lowering_utils.CPULoweringTest)' ``` ``` buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cudagraph_trees_expandable_segments -- --exact 'caffe2/test/inductor:cudagraph_trees_expandable_segments - test_skip_cpp_wrapper (caffe2.test.inductor.test_cudagraph_trees.CudaGraphTreeTests)' ``` https://www.internalfb.com/phabricator/paste/view/P1758059197 Reviewed By: henryoier Differential Revision: D71528281 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149606 Approved by: https://github.com/desertfire	2025-03-20 20:55:33 +00:00
Guilherme Leobas	406d464d97	Add `is_batchedtensor` to dynamo builder (#149541 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149541 Approved by: https://github.com/zou3519	2025-03-20 20:46:15 +00:00
Kai Londenberg	f17ae3f7b7	[Inductor Cutlass backend] Fix imports and compilation of Cutlass SM100 Kernels (#149515 ) Summary: Fixes the import and compilation of Cutlass SM100 Kernels. Test Plan: Cutlass backend unit tests, running benchmarks/inductor_backends/cutlass.py Differential Revision: D71196747 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149515 Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78	2025-03-20 20:35:18 +00:00
PyTorch MergeBot	24176f6e32	Revert "[cond] don't trace fw and bw graph in autograd key (#148930 )" This reverts commit 6e843a51dd5743b864fc28601ef06cdc18488b3e. Reverted https://github.com/pytorch/pytorch/pull/148930 on behalf of https://github.com/ydwu4 due to Test failure is legit ([comment](https://github.com/pytorch/pytorch/pull/148930#issuecomment-2741585315))	2025-03-20 20:28:29 +00:00
Yidi Wu	4a4a71a73c	[inductor]lowering scan to while_loop (#148580 ) This PR add a pass in post_grad that lowers scan to while_loop. See the comment before the pass for how this is implemented. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148580 Approved by: https://github.com/jansel, https://github.com/eellison	2025-03-20 20:21:02 +00:00
Yidi Wu	6e843a51dd	[cond] don't trace fw and bw graph in autograd key (#148930 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148930 Approved by: https://github.com/zou3519	2025-03-20 20:18:29 +00:00
Guilherme Leobas	18435945af	Set __context__/__cause__ when generator raise `StopIteration` (#148765 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148765 Approved by: https://github.com/zou3519 ghstack dependencies: #146505	2025-03-20 19:59:30 +00:00
Guilherme Leobas	44e6464914	Allow setting attribute to NestedUserFunctionVariable (#146505 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146505 Approved by: https://github.com/zou3519	2025-03-20 19:59:30 +00:00
Dominic Binks	aae4c0729e	Fix broken build within xplat/caffe2 (#149403 ) Summary: Following a pull from open source, the build within xplat is broken due to not finding <autograd/function.h>. Within the python_function.cpp there seems to be a convention of using the torch/csrc prefix. This change includes that prefix to enable the build to proceed. Test Plan: Build a binary using torch. https://www.internalfb.com/buck2/83122485-d3c3-43f4-97b4-81bb90450b3b Unit tests run too https://www.internalfb.com/intern/testinfra/testrun/13229323975828416 Further testing in CI and elsewise expected. Reviewed By: malfet Differential Revision: D70331539 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149403 Approved by: https://github.com/izaitsevfb Co-authored-by: Dominic Binks <dbinks@meta.com>	2025-03-20 19:27:55 +00:00
Yi Wang	ffa085334c	Specify the default PyTorch Distributed backend for MPS (#149538 ) Fixes #149537 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149538 Approved by: https://github.com/d4l3k, https://github.com/malfet	2025-03-20 18:54:03 +00:00
Natalia Gimelshein	1d221724fc	fix missing field initializer warning (#149597 ) Per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/149597 Approved by: https://github.com/drisspg, https://github.com/Skylion007	2025-03-20 18:48:05 +00:00
William Wen	6285a71aba	[dynamo] fix bug where non-recursive disable modifies the original function (#148896 ) Fixes https://github.com/pytorch/pytorch/issues/148787. We fix this by: - Wrapping the original function instead of directly modifying it - When we detect that the previous frame is the non-recursive disable wrapper, then skip tracing this frame (non-recursive disable wrapper will always be skipped, so that frame will be present in the traceback)l Pull Request resolved: https://github.com/pytorch/pytorch/pull/148896 Approved by: https://github.com/jansel	2025-03-20 18:33:54 +00:00
Jane Xu	88a26dbb9d	[BE] simplify test_cpp_extensions_aot and .gitignore (#149231 ) It is shady to clean up an install mid-test. So don't do that anymore and use .gitignore instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149231 Approved by: https://github.com/albanD, https://github.com/msaroufim	2025-03-20 18:17:19 +00:00
Sergey Zimin	b99fc9d29f	[MTIA] Support loading Tensors on mtia:0 for pytorch code (#149327 ) Summary: The diff includes updates to the PyTorch code to enable loading tensors to MTIA. Reviewed By: PatriceVignola Differential Revision: D71176848 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149327 Approved by: https://github.com/ezyang	2025-03-20 18:05:15 +00:00
James Wu	7bb9c36784	Hook StaticCudaLauncher up to torch.compile (cold start) (#148890 ) This hooks up the previous PR to torch.compile. Will add a config flag to hide this behind in a bit, but for now it's useful for testing purposes to have it on by default. Inductor will automatically choose to use StaticCudaLauncher to launch triton kernels if: - The kernel is a cuda kernel and inductor can find a cubin file associated with it - The kernel takes less than 50 arguments - The kernel doesn't use any special features (launch hooks, large amounts of shared memory) - The kernel is not user defined (to be supported in a later PR) We split CompileResult into TritonCompileResult and StaticTritonCompileResult, but have them share implementations of how they exec a python launcher. StaticTritonCompileResult's python launcher has the benefit of a simpler def_args/call_args setup, since it always filters out all constexprs before running, no matter the triton version. Some key features of StaticTritonCompileResult: - It is fully serializable - It stores the minimum amount of stuff, so that later it can be cached easily - It does not depend on any triton specific types (though it does have various triton metadata). For now, both TritonCompileResult and StaticTritonCompileResult still `exec` custom python launchers, and use GridExpr. We can change that in the future to simplify if we'd like. For now though, this custom python codegen is good for flexibility when it comes to supporting removal of constexprs, so using it for static launching is nice to not have to pay the cost of removing constexprs at kernel runtime. Hooking everything up to torch.compile lets me run every unit test with StaticCudaLauncher to make sure that we still pass (even if we bypass StaticCudaLauncher itself). It also lets me check for compilation/runtime performance with these changes. Fixes #149448 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148890 Approved by: https://github.com/jansel	2025-03-20 17:32:20 +00:00
Dmitry Nikolaev	c99efc08fb	[ROCm] skip test_RNN_dropout_state (#149446 ) PR to skip test_nn.py::TestNN::test_RNN_dropout_state Currently ROCm doesn't support dropout value for RNN PR to enable RNN dropout on ROCm still in review and blocked pytorch/pytorch#144572 Fixes: https://github.com/pytorch/pytorch/issues/68849 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149446 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily	2025-03-20 17:22:39 +00:00
Eli Uriegas	1d9401befc	ci: Remove mentions and usages of DESIRED_DEVTOOLSET and cxx11 (#149443 ) This is a remnant of our migration to manylinux2_28 we should remove these since all of our binary builds are now built with cxx11_abi Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/149443 Approved by: https://github.com/izaitsevfb, https://github.com/atalman	2025-03-20 16:49:46 +00:00
Avik Chaudhuri	6237495fcf	torch.Size input (#149414 ) Summary: Support for `torch.Size` inputs was patchy before because `unflatten_fn` for this type returned a tuple. This PR cleans this up. Fixes #149158 Test Plan: added test Differential Revision: D71403635 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149414 Approved by: https://github.com/yushangdi	2025-03-20 16:23:13 +00:00
IvanKobzarev	2c4bc65366	[aotd] Guess tangents stride as output strides (#144579 ) AOTDispatch doing AOT backward graph preparation does not know real tangents that user will specify when runs backward. AOTD guesses the tangents. Before - we guessed that memory format of tangents will be as memory format of corresponding outputs. And if specified tangents at runtime are not the same memory format as we guessed during compilation, AOTD does coercion (copy) to guessed memory_format But as Horace found, there are popular use cases, where the outputs of compiled region will be in specific memory_format. E.g. in 4D tensor transposing dims 1 and 2. https://github.com/karpathy/nanoGPT/blob/master/model.py#L57 This PR changes the logic, that AOTD expects the same "strideness" of tangents as outputs. As a result it will avoid coercion for the case of transposed dims. Limitations: We keep guessing memory_format for: 1/ Dynamic shapes (needs more changes) 2/ Tensor subclasses (needs more changes) Other changes: test_torchinductor was always creating contiguous tangents via `torch.randn()`, changing them to be `torch.randn_like()` to compare computation with the same strideness. (E.g. for cuda float16 strideness affects numerics for fft ops). Pull Request resolved: https://github.com/pytorch/pytorch/pull/144579 Approved by: https://github.com/bdhirsh	2025-03-20 15:41:36 +00:00
Andrey Talman	9b1127437e	Add triton as dependency to CUDA aarch64 build (#149584 ) Aarch64 Triton build was added by: https://github.com/pytorch/pytorch/pull/148705 Hence add proper contrain to CUDA 12.8 Aarch64 build Please note we want to still use: ```platform_system == 'Linux' and platform_machine == 'x86_64'``` For all other builds. Since these are prototype binaries only used by cuda 12.8 linux aarch64 build. Which we would like to serve from download.pytorch.org Pull Request resolved: https://github.com/pytorch/pytorch/pull/149584 Approved by: https://github.com/nWEIdia, https://github.com/tinglvv, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-03-20 15:39:45 +00:00
Zhengxu Chen	80dfce2cc3	[export] Handle non OpNamespace type during decomposition. (#149431 ) Summary: Turns out we can have non OpNamespace object in torch.ops._dir. We should just throw away those during iteration. Test Plan: eyes Differential Revision: D71417992 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149431 Approved by: https://github.com/tugsbayasgalan	2025-03-20 15:36:15 +00:00
ZhiweiYan-96	d67c1a027e	[Intel GPU][PT2E] bugfix: use zero-point to decide conv src zp mask (#149473 ) # Motivation The PR fix a bug that wrongly decides the zero-point mask setting. Specifically, it deems zero-point is always not zeros due to scale is used for judgement. Fortunately, the bug only affects the performance. The accuracy is not affected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149473 Approved by: https://github.com/EikanWang, https://github.com/guangyey	2025-03-20 14:46:07 +00:00
Sun, Jiayi	496bbf38be	add grad_output shape check for adaptive_avg_pool2d_backward (#145241 ) Fix https://github.com/pytorch/pytorch/issues/145070. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145241 Approved by: https://github.com/malfet, https://github.com/eqy	2025-03-20 14:10:31 +00:00
Shuai Yang	00a2c68f67	Fix a typo "trochrec" to "torchrec" (#149542 ) Summary: As titled, the path is incorrect due to the typo Test Plan: CI Differential Revision: D71490709 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149542 Approved by: https://github.com/williamwen42	2025-03-20 10:14:23 +00:00
William Wen	a66a9581da	[dynamo] support Python 3.13t (#149549 ) A few bug fixes to get Dynamo mostly working with 3.13 nogil. Dynamo encounters internal CPython assert errors in older versions of 3.13. The fix has been landed on [CPython's 3.13 branch](https://github.com/python/cpython/tree/3.13) and will be included in 3.13.3 (https://peps.python.org/pep-0719/ - april 8). If you wish to try `torch.compile` on the latest 3.13 branch, you can comment out the error checking (i.e. `70b6cd4e11/torch/__init__.py (L2535)` and `70b6cd4e11/torch/_dynamo/eval_frame.py (L899)`). We will work on getting PyTorch CI up for Dynamo/dynamo-wrapped/inductor once 3.13.3 is available. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149549 Approved by: https://github.com/jansel	2025-03-20 09:49:27 +00:00
Blaine Burton Rister	970ac2d907	[Inductor] Improve memory locality by iterating over y dimension before x (#149339 ) # Feature Fixes https://github.com/pytorch/pytorch/issues/148718 by reordering the tensor dims to `(z, y, x)`. As a bonus refactor, block pointers no longer needed the `reorder=True` argument to `self.active_range_trees()`. Since this argument is no longer used anywhere, this PR simply deletes it as opposed to updating the logic for the new iteration order. # Perf impact It looks like there's a decent perf bump on A100, with cudagraphs enabled. Granted, perf runs seem to have some noise between commits. ([Workflow run](https://github.com/pytorch/pytorch/actions/runs/13914815576).) Training (all neutral or positive): ![image](https://github.com/user-attachments/assets/57f1ef1d-60b4-446f-baf3-aca87a26b81b) Inference (one positive, one very small negative): ![image](https://github.com/user-attachments/assets/679aa057-af23-47f1-8d8e-8520daf1bd92) As reported in https://github.com/pytorch/pytorch/issues/148718, this PR makes consecutive threads access consecutive memory addresses. This should theoretically give the GPU more opportunities to coalesce loads and stores. From Nvidia's [kernel profiling guide](https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html): > Local memory is private storage for an executing thread and is not visible outside of that thread. It is intended for thread-local data like thread stacks and register spills. Local memory addresses are translated to global virtual addresses by the AGU unit. Local memory has the same latency as global memory. One difference between global and local memory is that local memory is arranged such that consecutive 32-bit words are accessed by consecutive thread IDs. Accesses are therefore fully coalesced as long as all threads in a warp access the same relative address (e.g., same index in an array variable, same member in a structure variable, etc.). I couldn't find any information on how coalescing works for other kinds of memory, but the guide mentions it is also supported for accesses to the L2 cache. > The L2 Request Coalescer (LRC) processes incoming requests for L2 and tries to coalesce read requests before forwarding them to the L2 cache. It also serves programmatic multicast requests from the SM and supports compression for writes. The [answer to this Stack Overflow post](https://stackoverflow.com/a/5044424) also explains coalescing in a straightforward way. Inductor's current iteration order corresponds to the first (uncoalesced) example in that answer, while the order after this PR corresponds to the second (coalesced) example. Besides GPUs, this order of accessing data is highly advantageous for systems relying on DMAs, as those are designed to access contiguous spans of memory. This change improves the performance of an elementwise add kernel on an internal model, using internal hardware, by 1.76x. I will share the details with reviewers who are Meta employees via a private channel. # Test plan - Updated expected code on CI tests. - Added a new test checking the {x,y,z}indices and block pointers on a 3D pointwise kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149339 Approved by: https://github.com/jansel	2025-03-20 08:12:00 +00:00
Bin Bao	3647711a89	[AOTI][refactor] Remove dead code (#149287 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149287 Approved by: https://github.com/cyyever, https://github.com/yushangdi	2025-03-20 07:29:27 +00:00
PyTorch MergeBot	90ef7a9561	Revert "Supporting non-tensor-data write_size in planner write items. (#149434 )" This reverts commit 1442230a267f0ce4f0bb540fca775faa71e7cfd5. Reverted https://github.com/pytorch/pytorch/pull/149434 on behalf of https://github.com/izaitsevfb due to breaking docs build ([comment](https://github.com/pytorch/pytorch/pull/149434#issuecomment-2739378287))	2025-03-20 06:52:02 +00:00
Sun, Jiayi	00333c4548	[Inductor] Set prop_kind to forward_inference when grad is not needed for mkldnn_linear_pointwise and mkldnn_convolution_pointwise (#147072 ) Summary: The `prop_kind` of `mkldnn._linear_pointwise`, `mkldnn._linear_pointwise.binary`, `mkldnn._convolution_pointwise.binary` and `mkldnn._convolution_pointwise_.binary` are always `dnnl_forward`, i.e., `dnnl_forward_training` , regardless of whether `grad` is needed. Setting `prop_kind` to `dnnl_forward_inference` for these ops when `grad` is not needed could have better performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147072 Approved by: https://github.com/leslie-fang-intel, https://github.com/CaoE, https://github.com/jansel	2025-03-20 06:21:31 +00:00
Rachel Guo	c4d59e6279	[Inductor] Fix combo_kernel logging error (#149575 ) Summary: Fix logging error like: ``` in combinable_nodes log.debug( Message: 'ComboKernels: %d template nodes are filtered' Arguments: (OrderedSet([8]),) --- Logging error --- Traceback (most recent call last): File "/usr/local/fbcode/platform010/lib/python3.10/logging/__init__.py", line 1100, in emit msg = self.format(record) File "/usr/local/fbcode/platform010/lib/python3.10/logging/__init__.py", line 943, in format return fmt.format(record) File "/data/users/guorachel/fbsource/buck-out/v2/gen/fbcode/854b9ed00d28c5c5/caffe2/torch/fb/model_transform/experimental/benchmark/__mts_gpu_benchmark__/mts_gpu_benchmark#link-tree/torch/_logging/_internal.py", line 818, in format record.message = record.getMessage() File "/usr/local/fbcode/platform010/lib/python3.10/logging/__init__.py", line 368, in getMessage msg = msg % self.args TypeError: %d format: a real number is required, not OrderedSet ``` encountered in running a prod model + enable combo kernel feature Test Plan: CI Differential Revision: D71512220 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149575 Approved by: https://github.com/ColinPeppler	2025-03-20 06:09:44 +00:00
Davide Italiano	595293316d	[MPS/Inductor] Add support for modified_bessel_k0. (#149593 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149593 Approved by: https://github.com/jansel	2025-03-20 04:51:44 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	9a184b1074	Monkeypatch fake mode so it errors on invalid custom ops (#149410 ) Internal version: [D71294776](https://www.internalfb.com/diff/D71294776) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149410 Approved by: https://github.com/gmagogsfm	2025-03-20 04:50:57 +00:00
Menglu Yu	fe94d7da1a	[Inductor][Optimus] Add move view after cat aten pattern (#149178 ) Summary: Add aten pattern to move the view/reshape out of split cat, further reduce the number of kernels. context: https://docs.google.com/document/d/1G2qFcQu1K7VXbz2uPe0CS2aBirnwtwI_B8lxmlBlAPQ/edit?tab=t.0 Test Plan: ### how to enable Add the following patterns to the post grad ``` post_grad_fusion_options={ "normalization_aten_pass": {}, "move_view_after_cat_aten_pass": {}, }, ``` ### unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_move_view_after_cat_aten ``` Buck UI: https://www.internalfb.com/buck2/3c5451be-c63a-4794-8d6b-103ecac78905 Test UI: https://www.internalfb.com/intern/testinfra/testrun/6192449704507267 ### local reproduce ``` buck2 run mode/opt scripts/shuaiyang:test -- --flow_id 691990503 --use_synthetic_data --optimus ``` https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/mengluy/2025-03-13-20-59-34/trace.json.gz&bucket=gpu_traces ### E2E baseline f691990503 proposal Differential Revision: D71177004 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149178 Approved by: https://github.com/Yuzhen11	2025-03-20 04:07:25 +00:00
Isalia20	95e71765f2	[MPS] nanmedian implementation (#149407 ) Implements nanmedian on MPS. This implementation only implements `torch.nanmedian(tensor)` without `keepdim` and `dim` Will implement nanmedian with dim and keepdim in a followup Pull Request resolved: https://github.com/pytorch/pytorch/pull/149407 Approved by: https://github.com/malfet	2025-03-20 03:50:26 +00:00
bobrenjc93	cca46a0b6f	Fix score_mod.py dynamic max autotune (#148991 ) python benchmarks/transformer/score_mod.py --dynamic --max-autotune previously would crash with ``` "/home/bobren/local/a/pytorch/torch/_inductor/select_algorithm.py", line 2306, in key_of node.get_device().type, ``` but with this change no longer does Pull Request resolved: https://github.com/pytorch/pytorch/pull/148991 Approved by: https://github.com/drisspg	2025-03-20 03:28:51 +00:00
Xu Han	bc1b8730a4	[Windows][inductor] fix blank space break windows file path (#149388 ) Fixes #149310 From origin error message: ```cmd Command: cl /I C:/Program Files/Python310/Include /I c:/code/.env/lib/site-packages/torch/include /I c:/code/.env/lib/site-packages/torch/include/torch/csrc/api/include /I c:/code/.env/lib/site-packages/torch/include/TH /I c:/code/.env/lib/site-packages/torch/include/THC /D TORCH_INDUCTOR_CPP_WRAPPER /D STANDALONE_TORCH_HEADER /D C10_USING_CUSTOM_GENERATED_MACROS /DLL /MD /O2 /std:c++20 /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /wd4624 /wd4067 /wd4068 /EHsc /openmp /openmp:experimental C:/Users/user/AppData/Local/Temp/torchinductor_user/ou/coubnfnqsm2gbdzdytufv46jotd6sxsnnhgldiw45pl5yjq5nbvz.cpp /LD /FeC:/Users/user/AppData/Local/Temp/torchinductor_user/ou/coubnfnqsm2gbdzdytufv46jotd6sxsnnhgldiw45pl5yjq5nbvz.pyd /link /LIBPATH:c:/code/.env/Scripts/libs /LIBPATH:c:/code/.env/lib/site-packages/torch/lib torch.lib torch_cpu.lib torch_python.lib sleef.lib Output: Microsoft (R) C/C++ Optimizing Compiler Version 19.43.34809 for x86 Copyright (C) Microsoft Corporation. All rights reserved. cl : Command line warning D9025 : overriding '/openmp' with '/openmp:experimental' cl : Command line warning D9024 : unrecognized source file type 'Files/Python310/Include', object file assumed coubnfnqsm2gbdzdytufv46jotd6sxsnnhgldiw45pl5yjq5nbvz.cpp C:/Users/user/AppData/Local/Temp/torchinductor_user/ou/coubnfnqsm2gbdzdytufv46jotd6sxsnnhgldiw45pl5yjq5nbvz.cpp(21): fatal error C1083: Cannot open include file: 'Python.h': No such file or directory ``` Python installed in `C:/Program Files/Python310` path, and the blank space break the file path. Solution: Add quotes to declare Windows file paths, after that: ```cmd cl /I "C:/Users/Xuhan/.conda/envs/new_build/Include" /I "C:/Users/Xuhan/.conda/envs/new_build/lib/site-packages/torch/include" /I "C:/Users/Xuhan/.conda/envs/new_build/lib/site-packages/torch/include/torch/csrc/api/include" /D TORCH_INDUCTOR_CPP_WRAPPER /D STANDALONE_TORCH_HEADER /D C10_USING_CUSTOM_GENERATED_MACROS /D CPU_CAPABILITY_AVX512 /DLL /MD /O2 /std:c++20 /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /wd4624 /wd4067 /wd4068 /EHsc /openmp /openmp:experimental C:/Users/Xuhan/AppData/Local/Temp/tmp1wsj0m8r/za/czarp3ly5c22ge3hydvnzvad4cjimyr3hkwvofodxqffgil7frfd.cpp /arch:AVX512 /FeC:/Users/Xuhan/AppData/Local/Temp/tmp1wsj0m8r/za/czarp3ly5c22ge3hydvnzvad4cjimyr3hkwvofodxqffgil7frfd.pyd /LD /link /LIBPATH:"C:/Users/Xuhan/.conda/envs/new_build/libs" /LIBPATH:"C:/Users/Xuhan/.conda/envs/new_build/lib/site-packages/torch/lib" "torch.lib" "torch_cpu.lib" "torch_python.lib" "sleef.lib" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149388 Approved by: https://github.com/jansel	2025-03-20 03:10:30 +00:00
Dmitry Rogozhkin	45a879e55b	xpu: improve error handling and reporting in XPU cmake files (#149353 ) For #149075 * Add a graceful cmake error instead of cryptic one if SYCL runtime is not found: ``` The link interface of target "c10_xpu" contains: torch::xpurt but the target was not found. ``` * Suppress unclear cmake error if SYCL compiler is not available and further version query fails: ``` CMake Error at /home/dvrogozh/pytorch/torch/share/cmake/Caffe2/FindSYCLToolkit.cmake:37 (string): string sub-command REGEX, mode REPLACE needs at least 6 arguments total to command. ``` CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149353 Approved by: https://github.com/guangyey, https://github.com/malfet	2025-03-20 02:00:39 +00:00
Tugsbayasgalan Manlaibaatar	3b7bd6c63d	Fix dynamic shapes repordering bug (#149528 ) WHen we create constraints, we look at the ordering of kwargs according to model signature. But when we trace, we use the ordering that is created based on how user passes in their kwargs. As a result, constraints and dynamic shapes end up having a different order causing issues when they have different dynamic tensor specs. Differential Revision: [D71478578](https://our.internmc.facebook.com/intern/diff/D71478578) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149528 Approved by: https://github.com/ydwu4	2025-03-20 01:57:44 +00:00
Sam Larsen	1e30192b19	[logging] Add python version to dynamo_compile table (#149419 ) Summary: This adds a version field like the following: `3.10.9+fb (3.10:1dd9be6, May 4 2022, 01:23:45) [Clang 15.0.7 (mononoke://mononoke.internal.tfbnw.net/fbsource 5d1601b0eed7426ac` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149419 Approved by: https://github.com/c00w	2025-03-20 01:48:34 +00:00
Pradeep Fernando	1442230a26	Supporting non-tensor-data write_size in planner write items. (#149434 ) Summary: 1\ The current write item structure does not contain the amount of data that needs to be written. 2\ the planner.item already has a size primitive 'tensor_storage_size'. https://fburl.com/code/7a0gsmw7 But only for tensors. 3\ Right now, the only way the writer layer get hold of this property (fro non tensor data) - first do a lookup in to the actual tensor/bytes - then calculate the nbytes. This change introduce a way to capture non-tensor data size within a write-plan item. Reviewed By: daulet-askarov Differential Revision: D70497442 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149434 Approved by: https://github.com/MeetVadakkanchery	2025-03-20 01:22:05 +00:00
Theodore Ehrenborg	02e21c7854	Fix spelling (#149277 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149277 Approved by: https://github.com/zou3519	2025-03-20 01:02:32 +00:00
PyTorch MergeBot	826e790696	Revert "ci: Remove mentions and usages of DESIRED_DEVTOOLSET (#149443 )" This reverts commit 95a633c45304755ebdbc08396d9948d34243ddb3. Reverted https://github.com/pytorch/pytorch/pull/149443 on behalf of https://github.com/izaitsevfb due to fails lint ([comment](https://github.com/pytorch/pytorch/pull/149443#issuecomment-2738709561))	2025-03-20 00:59:41 +00:00
Eli Uriegas	95a633c453	ci: Remove mentions and usages of DESIRED_DEVTOOLSET (#149443 ) This is a remnant of our migration to manylinux2_28 we should remove these since all of our binary builds are now built with cxx11_abi Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/149443 Approved by: https://github.com/izaitsevfb, https://github.com/atalman	2025-03-20 00:39:02 +00:00
cyy	29c4f2c07a	Remove Ubuntu 18.04 scripts (#149479 ) Ubuntu 18.04 end of life reached on May 31, 2023. These code isn't used now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149479 Approved by: https://github.com/malfet	2025-03-20 00:13:40 +00:00
Ethan Wee	6cbf97ede8	[ROCm] enable HIPMallocAsyncAllocator (#149145 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149145 Approved by: https://github.com/izaitsevfb Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-03-19 23:42:35 +00:00
Aleksei Nikiforov	2be97c7257	Update nightly s390x builds (#149337 ) This change should fix new nightly build failures for s390x. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149337 Approved by: https://github.com/malfet	2025-03-19 23:27:14 +00:00
Andrey Talman	c9de76a1e4	Modify cuda aarch64 install for cudnn and nccl. Cleanup aarch64 cuda 12.6 docker (#149540 ) 1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: https://github.com/pytorch/pytorch/pull/149351 TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds 3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: https://github.com/pytorch/pytorch/pull/148895 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149540 Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia	2025-03-19 23:20:05 +00:00
Avik Chaudhuri	5005e1bc47	support multinomial for dynamic num_samples (#149463 ) Test Plan: added test Fixes #149048 Differential Revision: D71434914 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149463 Approved by: https://github.com/pianpwk	2025-03-19 23:15:29 +00:00
Catherine Lee	cc469aaf3b	[CI][docker] Remove vulkan and swiftshader from docker builds (#149530 ) Probably should have been removed with https://github.com/pytorch/pytorch/pull/139354/files? Should I also remove mentions of them from build.sh and test.sh? Pull Request resolved: https://github.com/pytorch/pytorch/pull/149530 Approved by: https://github.com/malfet	2025-03-19 23:13:27 +00:00
Davide Italiano	88c2fe533f	[MPS] Add `modified_bessel_k0` support to eager. (#149563 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149563 Approved by: https://github.com/malfet	2025-03-19 23:10:55 +00:00
Mergen Nachin	bc86b6c55a	Update ExecuTorch pin update (#149539 ) Latest commit in https://hud.pytorch.org/hud/pytorch/executorch/viable%2Fstrict/1?per_page=50 Follow-up to https://github.com/pytorch/pytorch/issues/144480#issuecomment-2731150636 Also, need to incorporate change from https://github.com/pytorch/executorch/pull/8817 Test Plan: Monitor linux-jammy-py3-clang12-executorch test Pull Request resolved: https://github.com/pytorch/pytorch/pull/149539 Approved by: https://github.com/larryliu0820	2025-03-19 22:29:59 +00:00
Catherine Lee	6974ba84f6	[ci][anaconda] Remove conda from linter docker images (#147789 ) Remove conda usage from the linter docker images Handles part of https://github.com/pytorch/pytorch/issues/148110 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147789 Approved by: https://github.com/atalman	2025-03-19 21:56:44 +00:00
Shivam Raikundalia	a11538aa46	[GPU Snapshot] Add Clear History Flag (#149352 ) Summary: Oftentimes, users complain that a bunch of extra events are prepended to their desired GPU snapshot. This is because they usually attach an OOM logger without knowing and when they go to collect the actual snapshot, it adds all the OOM logger contents. Since OOM and regular snapshot use the same backend, we currently don't have the infra in place to split these snapshots. As a solution we add a flag to the snapshot frontend to clear out the history when starting the auto-trace record memory history. A more thorough solution would be to have a user pass in a handle and to have snapshots per handle to seperate the events. However, this would likely be complicated and more work than it is worth as we would have to change the callbacks in the caching allocator and pass these objects between python and cpp. Test Plan: See diff below Differential Revision: D71159720 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149352 Approved by: https://github.com/eqy, https://github.com/aaronenyeshi	2025-03-19 21:44:20 +00:00
PyTorch MergeBot	e1d143cb7b	Revert "[ROCm] enable HIPMallocAsyncAllocator (#149145 )" This reverts commit ee1a2b7810126258ce64d1e22b59fae81a3f7bcb. Reverted https://github.com/pytorch/pytorch/pull/149145 on behalf of https://github.com/izaitsevfb due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/149145#issuecomment-2738115728))	2025-03-19 21:12:13 +00:00
Nichols A. Romero	37bb7f79c6	[ROCm][TunableOp] Unit test for TunableOp BLAS logging. (#148982 ) Add unit test for new TunableOp BLAS logging feature. Requires this PR to be merged in first: https://github.com/pytorch/pytorch/pull/148979 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148982 Approved by: https://github.com/jeffdaily	2025-03-19 20:57:19 +00:00
Jessica Vandebon	71daeddde2	[MTIA] Ensure correct stream behavior for input_buffer add autograd on MTIA (#149433 ) Test Plan: CI Differential Revision: D71414498 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149433 Approved by: https://github.com/albanD	2025-03-19 20:19:18 +00:00
Yanan Cao (PyTorch)	fae79e91a0	Remove torch.export.export_for_inference (#149078 ) Summary: Remove torch.export.export_for_inference, it is redundant and can always be replaced with torch.export.export_for_training() + run_decompositions() Test Plan: unit tests Differential Revision: D71069057 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149078 Approved by: https://github.com/tugsbayasgalan	2025-03-19 19:57:18 +00:00
Shangdi Yu	05fee772e5	Fix with effect lowering for list return type (#149510 ) Summary: - For `torch.ops.higher_order.with_effects`'s lowering, we should not extract the items out of an list (i.e. `*result` vs `result`). The `get_attr` nodes consider the result to be in the list format. Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r test_torchbind_aot_compile buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r list_return buck run //caffe2/torch/fb/sparsenn:sigrid_test -- -r test_transform_torch_bind # tested together with D70013257 buck run fbcode//mode/dev-nosan //caffe2/test:test_export -- -r test_custom_obj ``` Reviewed By: angelayi Differential Revision: D71346024 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149510 Approved by: https://github.com/zou3519	2025-03-19 19:35:08 +00:00
Scott Ramsby	842a072fd3	[codemod] Fix clang-tidy command line doc comments (#149524 ) Summary: Fixes the comments to match the latest updates to the checked-in tools. Search/replace applied in this order: * `# /fbsource/tools/lint/clangtidy/clang-tidy-platform010 -list-checks` -> `# ~/fbsource/tools/lint/clangtidy/clang-tidy-platform010-clang-17 -list-checks` * `# ~/fbsource/tools/lint/clangtidy/clang-tidy-platform010 -list-checks` -> `# ~/fbsource/tools/lint/clangtidy/clang-tidy-platform010-clang-17 -list-checks` * `fbsource/tools/lint/clangtidy/clang-tidy-platform010 -list-checks` -> `fbsource/tools/lint/clangtidy/clang-tidy-platform010-clang-17 -list-checks` Test Plan: CI Reviewed By: johnkearney Differential Revision: D71431516 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149524 Approved by: https://github.com/janeyx99	2025-03-19 19:22:11 +00:00
Pian Pawakapan	96828a2155	[export] refactor DimHints for type errors (#149424 ) Differential Revision: D71414367 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149424 Approved by: https://github.com/justinchuby, https://github.com/avikchaudhuri	2025-03-19 18:51:07 +00:00
Yidi Wu	9ec9f4740c	[export] fix stft decomp and making it consistent with cpp impl. (#149232 ) Summary: We change the fake impl of stft to follow more closely with its cpp implementation [here](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/SpectralOps.cpp#L951-L963) where " n_frames = 1 + (len - n_fft) / hop_length;" is also an integer division. Test Plan: Existing tests and buck2 build --flagfile fbcode//mode/dev fbcode//executorch/examples/models/fb/llama4:speech_transform.pte Differential Revision: D71209142 edit: we kept the original path un-changed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149232 Approved by: https://github.com/jackzhxng	2025-03-19 18:40:35 +00:00
Bin Bao	94d761fbf0	[AOTI][reland] Update test runner to use the new APIs (#149412 ) Summary: Reland https://github.com/pytorch/pytorch/pull/147105. Switch to the newer aoti_compile_and_package APIs. Some tests still kept using legacy APIs, and will follow up with internal test refactoring. Differential Revision: [D71470265](https://our.internmc.facebook.com/intern/diff/D71470265) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149412 Approved by: https://github.com/yushangdi	2025-03-19 17:56:44 +00:00
IvanKobzarev	d686d04c2f	[custom_ops][perf] Move expensive pytree traversals of tensors to C++ (#148555 ) (benchmark for 1 call) Before: ``` └─ $ python ~/task_custom_ops_perf/test_custom_ops_perf_repro.py DO_BENCH mutate: 77.72445678710938 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/mutate.json DO_BENCH no_mutate: 64.61143493652344 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/no_mutate.json DO_BENCH direct_mutate: 11.682510375976562 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_mutate.json DO_BENCH direct_no_mutate: 18.596649169921875 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_no_mutate.json ``` After: ``` └─ $ python ~/task_custom_ops_perf/test_custom_ops_perf_repro.py DO_BENCH mutate: 47.6837158203125 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/mutate.json DO_BENCH no_mutate: 31.709671020507812 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/no_mutate.json DO_BENCH direct_mutate: 10.967254638671875 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_mutate.json DO_BENCH direct_no_mutate: 10.728836059570312 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_no_mutate.json ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148555 Approved by: https://github.com/zou3519	2025-03-19 17:16:57 +00:00
Jithun Nair	518563d6ef	Add release branch push triggers to rocm-mi300.yml (#149517 ) When we added the rocm-mi300.yml earlier this year, we had lower capacity and we were just pipecleaning the workflow, so we set the trigger to only respond to pushes to main branch. But now we have more stability as well as capacity, and we would really like to ensure that the release branch is being tested on MI300s as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149517 Approved by: https://github.com/atalman	2025-03-19 16:14:09 +00:00
Ze Sheng	e98afa0f89	[Sigmoid] Remove magic method in CapabilityBasedPartitioner (#149400 ) Summary: As title. Test Plan: CI Differential Revision: D70575197 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149400 Approved by: https://github.com/jfix71	2025-03-19 16:02:43 +00:00
Andrey Talman	4df66e0b7f	Pin auditwheel to 6.2.0 (#149471 ) Observing aarch64 failure in nightly: https://github.com/pytorch/pytorch/actions/runs/13917778961/job/38943911228 Similar to: https://github.com/pytorch/vision/pull/8982 ``` 2025-03-18T08:44:58.4128744Z Repairing Wheel with AuditWheel 2025-03-18T08:44:58.5440988Z INFO:auditwheel.main_repair:Repairing torch-2.8.0.dev20250318+cpu-cp39-cp39-linux_aarch64.whl 2025-03-18T08:45:20.3393288Z Traceback (most recent call last): 2025-03-18T08:45:20.3393732Z File "/opt/python/cp39-cp39/bin/auditwheel", line 8, in <module> 2025-03-18T08:45:20.3394115Z sys.exit(main()) 2025-03-18T08:45:20.3394559Z File "/opt/_internal/cpython-3.9.21/lib/python3.9/site-packages/auditwheel/main.py", line 53, in main 2025-03-18T08:45:20.3395064Z result: int \| None = args.func(args, p) 2025-03-18T08:45:20.3395626Z File "/opt/_internal/cpython-3.9.21/lib/python3.9/site-packages/auditwheel/main_repair.py", line 203, in execute 2025-03-18T08:45:20.3396163Z out_wheel = repair_wheel( 2025-03-18T08:45:20.3396657Z File "/opt/_internal/cpython-3.9.21/lib/python3.9/site-packages/auditwheel/repair.py", line 84, in repair_wheel 2025-03-18T08:45:20.3397184Z raise ValueError(msg) 2025-03-18T08:45:20.3397620Z ValueError: Cannot repair wheel, because required library "libarm_compute.so" could not be located 2025-03-18T08:45:20.3678843Z Traceback (most recent call last): 2025-03-18T08:45:20.3679267Z File "/pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py", line 236, in <module> 2025-03-18T08:45:20.3680988Z pytorch_wheel_name = complete_wheel("/pytorch/") 2025-03-18T08:45:20.3681449Z File "/pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py", line 141, in complete_wheel 2025-03-18T08:45:20.3681976Z check_call(["auditwheel", "repair", f"dist/{wheel_name}"], cwd=folder) 2025-03-18T08:45:20.3682860Z File "/opt/python/cp39-cp39/lib/python3.9/subprocess.py", line 373, in check_call 2025-03-18T08:45:20.3683308Z raise CalledProcessError(retcode, cmd) 2025-03-18T08:45:20.3684034Z subprocess.CalledProcessError: Command '['auditwheel', 'repair', 'dist/torch-2.8.0.dev20250318+cpu-cp39-cp39-linux_aarch64.whl']' returned non-zero exit status 1. 2025-03-18T08:45:20.3790063Z ##[error]Process completed with exit code 1. 2025-03-18T08:45:20.3862012Z ##[group]Run pytorch/test-infra/.github/actions/teardown-linux@main 2025-03-18T08:45:20.3862448Z with: ``` Please note aarch64 CUDA failures are related to: https://github.com/pytorch/pytorch/pull/149351 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149471 Approved by: https://github.com/malfet	2025-03-19 15:55:05 +00:00
Shangdi Yu	1bf443e2f2	[aoti x with_effect token] Unbacked symint and register lowering (#147656 ) Differential Revision: D70022208 - When resolving unbacked symints in ExternKernel for with_effect, we need to ignore the first item in the binding path, because the `example_output` doesn't contain the effect token, but the binding paths do. - Similarly, `node.meta["val"]` contains the effect token, so when we compute_unbacked_bindings, we need to remove that effect token - For `torch.ops.higher_order.with_effects`'s lowering, we should not extract the items out of an list (i.e. `*result` vs `result`). The `get_attr` nodes consider the result to be in the list format. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147656 Approved by: https://github.com/angelayi, https://github.com/zou3519	2025-03-19 14:38:30 +00:00
Aaron Orenstein	2fcfae72b4	async fx compile (#146135 ) Adds the ability to run the selected out-of-process fx compile scheme in async mode - where we kick off the compile and then run eagerly until the compile is finished. Added a test which runs a tiny model in a loop making sure that we execute it both eagerly and then compiled. Differential Revision: [D71135546](https://our.internmc.facebook.com/intern/diff/D71135546) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146135 Approved by: https://github.com/jamesjwu, https://github.com/jansel	2025-03-19 14:07:51 +00:00
FFFrog	1dce65a82c	Fix the invalid link for FX (#149289 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149289 Approved by: https://github.com/zou3519	2025-03-19 14:03:18 +00:00
Aleksei Nikiforov	97910b6c00	Update s390x docker image (#148444 ) New releases of ml_dtypes successfully build on s390x, skip building patched old release. Unpin grpcio version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148444 Approved by: https://github.com/seemethere	2025-03-19 12:25:10 +00:00
Aleksei Nikiforov	7ca296f564	Document patched podman build for s390x runners (#147618 ) Podman patches from upstream are needed to resolve a couple of issues hit when using it. Document automated build of podman with applied patches fixing those issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147618 Approved by: https://github.com/seemethere	2025-03-19 12:25:05 +00:00
Aleksei Nikiforov	cfbeaf7b7e	Improve docker build cleanup on s390x runners (#149316 ) Currently it sometimes still leaves a couple of processess running. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149316 Approved by: https://github.com/seemethere	2025-03-19 10:10:44 +00:00
FFFrog	466d5295c1	Fixed abnormal behavior of LazyLinear when using LayzLinear and load_state together (#147599 ) Update Points: - Update the logic of ``initialize_parameters`` - Add new testcases The ISSUE Related: https://github.com/pytorch/pytorch/issues/147389 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147599 Approved by: https://github.com/mikaylagawarecki	2025-03-19 10:01:12 +00:00
fduwjj	8bf3f3fc43	[c10d] Add a collective time estimator for NCCL comms (#149343 ) We want to upstream the feature from new nccl for users to estimate comm time. Resolves #147753 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149343 Approved by: https://github.com/kwen2501	2025-03-19 07:54:02 +00:00
Riham Selim	b963d96bad	[Torchscript] Add a flag to use mangled names instead of demangled (#148906 ) Summary: Optionally keep mangled names when expanding torchscript stacks Test Plan: ``` buck2 build mode/opt //scripts/rihams/LearnPyTorch:torch_script_generate --show-full-output /data/users/rihams/fbsource/buck-out/v2/gen/fbcode/0bd9d136228ad8a7/scripts/rihams/LearnPyTorch/__torch_script_generate__/torch_script_generate.par buck2 build mode/opt //scripts/rihams/LearnPyTorch:torch_script_execute --show-full-output ``` - With `--torch_jit_expanded_stacks_mangled` Flag: /data/users/rihams/fbsource/buck-out/v2/gen/fbcode/ef35e45045e8164c/scripts/rihams/LearnPyTorch/__torch_script_execute__/torch_script_execute fbcode/model.pt --torch_jit_expanded_stacks_mangled --torch_jit_enable_expanded_stacks https://fburl.com/scuba/strobelight_function_tracer/8die4rvm {F1975933247} Without Flag: /data/users/rihams/fbsource/buck-out/v2/gen/fbcode/ef35e45045e8164c/scripts/rihams/LearnPyTorch/__torch_script_execute__/torch_script_execute ./model.pt --torch_jit_enable_expanded_stacks https://fburl.com/scuba/strobelight_function_tracer/x3nladpf {F1975933268} Reviewed By: bbus Differential Revision: D70905872 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148906 Approved by: https://github.com/zdevito	2025-03-19 07:53:02 +00:00
ikalinic	3e78c9e967	[ROCm][Windows] Disable hipSPARSE and CK declarations and remove references for Windows (#149195 ) This PR removes references to `hipSPARSE` and `ck` functions and disables declarations which are not supported on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149195 Approved by: https://github.com/jeffdaily Co-authored-by: Michal Gallus <Michal.Gallus@amd.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-03-19 07:30:53 +00:00
yifanmao	2cb42f26c1	Remove test_get_model_state_dict_del_memory (#149460 ) test_get_model_state_dict_del_memory get unexpected memory, leading to the test failures. Remove tests right now to avoid blocking the others. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149460 Approved by: https://github.com/fegin	2025-03-19 07:06:46 +00:00
FFFrog	e8a35eb7da	Add Missing Communication collectives (#147379 ) ---- - reduce_add_coalesced Pull Request resolved: https://github.com/pytorch/pytorch/pull/147379 Approved by: https://github.com/mikaylagawarecki	2025-03-19 06:59:04 +00:00
Menglu Yu	981807cfcb	[Inductor][Optimus] split cat aten pass (#149027 ) Summary: We add the aten pattern to optimize big cat node with arbitrary order of inputs to support APS jobs context: https://docs.google.com/document/d/1G2qFcQu1K7VXbz2uPe0CS2aBirnwtwI_B8lxmlBlAPQ/edit?tab=t.0 Test Plan: ### how to enable Add the following patterns to the post grad ``` post_grad_fusion_options={ "normalization_aten_pass": {}, "split_cat_aten_pass": {"threshold_to_cat": 10}, }, ``` You can tune threshold_to_cat to achieve best performance. If nothing gives, the default value 10 will be used ### unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_split_cat_post_grad ``` Buck UI: https://www.internalfb.com/buck2/9e52168d-c107-4be8-a46b-b9d239f5c50d Test UI: https://www.internalfb.com/intern/testinfra/testrun/17732923605061752 Network: Up: 112KiB Down: 132KiB (reSessionID-915796e0-4a8f-486a-9f63-afb1e191d24a) Executing actions. Remaining 0/3 1.0s exec time total Command: test. Finished 2 local Time elapsed: 4:57.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 ### E2E baseline f691990503 proposal Differential Revision: D71017436 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149027 Approved by: https://github.com/Yuzhen11	2025-03-19 06:01:05 +00:00
Simon Fan	f123f2c077	[ca] fix dce for side-effects (#149336 ) The AOT backward could have contained side effectful ops, so we can't DCE them. Have CA also call the default fx.Node.is_impure which will cover some of the existing cases Pull Request resolved: https://github.com/pytorch/pytorch/pull/149336 Approved by: https://github.com/jansel	2025-03-19 05:56:47 +00:00
PyTorch UpdateBot	ddb076591d	[executorch hash update] update the pinned executorch hash (#147422 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147422 Approved by: https://github.com/pytorchbot	2025-03-19 05:22:35 +00:00
Pat Vignola	42bd4a09a3	[MTIA] Add _mtia_getCurrentRawStream to MTIA module (#149436 ) Summary: The FlexAttention path generates code that uses this function. Although streams are not used yet in Triton-MTIA, adding this now allows us to not branch out just for MTIA and generate different code. Test Plan: CI Reviewed By: chaos5958 Differential Revision: D70072057 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149436 Approved by: https://github.com/chaos5958	2025-03-19 05:17:51 +00:00
PyTorch UpdateBot	ef93cdfb8a	[audio hash update] update the pinned audio hash (#149467 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149467 Approved by: https://github.com/pytorchbot	2025-03-19 04:28:57 +00:00
Ethan Wee	ee1a2b7810	[ROCm] enable HIPMallocAsyncAllocator (#149145 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149145 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-03-19 03:59:55 +00:00
Avik Chaudhuri	20874a1f46	debug ival swap (#149206 ) Summary: Recall that we use "ivals" to track intermediate values of mutations during unflattening. Previously, for each such intermediate value, we would create a hidden shared attribute that would be updated / read by respective submodules. Unfortunately this scheme doesn't work when some but not all of those submodules are swapped out. This is because the swapped in submodules have no knowledge of these hidden attributes. Thus the submodules that are not swapped out end up reading / updating dangling state. This PR does away with these hidden attributes. Instead, we directly read the underlying buffer or placeholder that was updated, and update those underlying buffers and placeholders in place. This makes the graphs look much closer to their eager origins. Test Plan: added some tests, ensured existing tests pass Differential Revision: D71203469 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149206 Approved by: https://github.com/tugsbayasgalan	2025-03-19 03:43:30 +00:00
Jun Luo	14dc6e732d	Cache the get_device_module result (#149207 ) Summary: As title. Test Plan: OSS CIs. Reviewed By: chaos5958 Differential Revision: D71084180 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149207 Approved by: https://github.com/jansel	2025-03-19 03:20:38 +00:00
angelayi	01a57981aa	[export] Add TracingContext (#149294 ) TracingContext is added to all tracing locations -- in torch.export this is where we call make_fx (for training IR) and aot_export_module (for inference IR), and in run_decompositions where we call aot_export_module Differential Revision: [D71298927](https://our.internmc.facebook.com/intern/diff/D71298927) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149294 Approved by: https://github.com/ydwu4	2025-03-19 03:11:08 +00:00
Animesh Jain	a3c286677b	[compile] Switch off inference mode during compilation (#149321 ) PR does following * Turns `inference_mode` to False and `no_grad` for `convert_frame`, if the inference_mode is on globally. * Turns off inference_mode for fake tensor prop. This ensures that converting from real inference tensor to a fake tensor removes the inference-ness. * Graph breaks on is_inference and is_inference_mode_enabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149321 Approved by: https://github.com/jansel, https://github.com/zou3519	2025-03-19 02:45:27 +00:00
Bin Bao	04e251a7dd	[AOTI] Add num_runners to AOTIModelPackageLoader (#149364 ) Summary: AOTIModelContainerRunner takes a num_runners argument for multi-threaded inference, but AOTIModelPackageLoader forgot to take the same parameter, although its run() API already expects to take an optional cudaStream_t parameter for multi-threaded inference. Differential Revision: [D71357418](https://our.internmc.facebook.com/intern/diff/D71357418) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149364 Approved by: https://github.com/angelayi	2025-03-19 02:28:06 +00:00
Richard Barnes	536c0c7a47	[codemod][lowrisk] Remove unused exception parameter from caffe2/aten/src/ATen/cuda/CUDABlas.cpp (#149328 ) Summary: `-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it. This: ``` try { ... } catch (exception& e) { // no use of e } ``` should instead be written as ``` } catch (exception&) { ``` If the code compiles, this is safe to land. Test Plan: Sandcastle Reviewed By: dtolnay Pull Request resolved: https://github.com/pytorch/pytorch/pull/149328 Approved by: https://github.com/Skylion007, https://github.com/eqy	2025-03-19 02:05:33 +00:00
Ivan Zaitsev	919d54b7b1	Fix format string in ck_gemm_template.h for int64_t variables (#149438 ) Summary: Change %d to %ld in printf format specifier to correctly handle int64_t variables n, m, k. This fixes compilation errors in HIP builds where the format string didn't match the argument type. forward fix for D71412006 ``` In file included from fbcode/caffe2/aten/src/ATen/native/hip/ck_gemm_bfloat16.hip:4: fbcode/caffe2/aten/src/ATen/native/hip/ck_gemm_template.h:386:28: error: format specifies type 'int' but the argument has type 'int64_t' (aka 'long') [-Werror,-Wformat] 385 \| printf("error shape = %d %d %d TRANSA=%d TRANSB=%d \n", \| ~~ \| %ld 386 \| n, m, k,TRANSA, TRANSB); \| ^ fbcode/caffe2/aten/src/ATen/native/hip/ck_gemm_template.h:386:31: error: format specifies type 'int' but the argument has type 'int64_t' (aka 'long') [-Werror,-Wformat] 385 \| printf("error shape = %d %d %d TRANSA=%d TRANSB=%d \n", \| ~~ \| %ld 386 \| n, m, k,TRANSA, TRANSB); \| ^ fbcode/caffe2/aten/src/ATen/native/hip/ck_gemm_template.h:386:25: error: format specifies type 'int' but the argument has type 'int64_t' (aka 'long') [-Werror,-Wformat] 385 \| printf("error shape = %d %d %d TRANSA=%d TRANSB=%d \n", \| ~~ \| %ld 386 \| n, m, k,TRANSA, TRANSB); \| ^ ``` Test Plan: ``` buck2 build --flagfile fbcode//mode/opt-amd-gpu fbcode//torchrec/sparse/tests:test_jagged_tensor_gpu ``` Differential Revision: D71418611 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149438 Approved by: https://github.com/ZainRizvi	2025-03-19 01:46:34 +00:00
Stepan Hruda	6bcf9c6ce3	[xnnpack] Expose subgraph symbols (#149397 ) Summary: Main XNNPack target code uses symbols from subgraph so they need to be exported - this gets uncovered on macos where symbols were not visible after linking Test Plan: CI / used for a macOS build on top of the stack. Differential Revision: D71315023 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149397 Approved by: https://github.com/digantdesai	2025-03-19 01:14:46 +00:00
Nichols A. Romero	11d4438a5f	[ROCm][TunableOp] More TF32 support. (#149088 ) This PR includes additional enhancements to TF32 support in TunableOp. - OpSignature now differentiates between float32 and tf32 data types. - Offline tuning now supports TF32. - Unit tests for online and offline tuning of TF32. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149088 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-03-19 00:26:20 +00:00
tvukovic-amd	268de64005	[ROCm][Windows] Enable torchvision build with ROCm on Windows (#147382 ) - Updated HIP flags for Windows (removed non Windows flags on Windows case, added runtime library) - Set hipcc call for Windows case - Removed CUDA flags (not used in ROCm) on Windows - Updated Windows compiler (added case when using ROCm on Windows) - Fixed path issue in hipify_python Pull Request resolved: https://github.com/pytorch/pytorch/pull/147382 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-03-18 23:37:05 +00:00
Nikita Shulga	61a64c20c4	[MPSInductor] Move threadfence at the right location (#149437 ) Not sure how it worked in the past, but fence should be before first read from the shared memory, not after it. This bug was exposed by https://github.com/pytorch/pytorch/pull/148969 which removed unnecessary barrier before calling `threadgroup_reduce` functions Test plan: ``` % python3 generate.py --checkpoint_path checkpoints/stories15M/model.pth --prompt "Once upon a time" --device mps --compile ``` Before that it produced gibberish, now it works fine Pull Request resolved: https://github.com/pytorch/pytorch/pull/149437 Approved by: https://github.com/manuelcandales, https://github.com/dcci	2025-03-18 23:27:19 +00:00
Angela Yi	ea02aac2ca	[export] Update remove runtime asserts pass (#149198 ) Test Plan: CI -- Removing asserts should be a noop Differential Revision: D69566851 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149198 Approved by: https://github.com/pianpwk	2025-03-18 23:07:25 +00:00
Nikita Shulga	5db3a4ac88	[Build] Guard per-op headers in ACLUtils.cpp (#149417 ) To fix internal build failures, where per-op headers are not generated. We really should have lint for something like that. Test Plan: CI Reviewed By: izaitsevfb Differential Revision: D71406882 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149417 Approved by: https://github.com/Skylion007, https://github.com/izaitsevfb	2025-03-18 22:56:29 +00:00
Zhuoran Zhao	45fec7843d	Fix local compilication and hipification (#149384 ) Summary: As title, we need to fix the issue introduced from https://github.com/pytorch/pytorch/pull/148305 Test Plan: CI and e2e https://docs.google.com/document/d/1Bu-MxJCkN7WaRkKJLVBQvnSp8yV0v3Aeb3Y9R5sjeHw/edit?tab=t.0 Differential Revision: D71373001 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149384 Approved by: https://github.com/desertfire, https://github.com/jansel, https://github.com/chenyang78	2025-03-18 22:56:02 +00:00
Shivam Raikundalia	0d804dec0f	[Profiler/Easy] Pass Overload Names To Kineto (#149333 ) Summary: Right now we get Overload names and forward them to the Event List frontend for profiler but we do not forward anything to kineto. This diff checks if there is an overload name for each cpu op and appends it to the name if necessary Test Plan: Added test in CI Differential Revision: D71326670 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149333 Approved by: https://github.com/aaronenyeshi	2025-03-18 22:15:51 +00:00
angelayi	3b48c72141	[export] Minor refactor to trace.py (#149240 ) Minor refactor to trace.py * Removed `_strict_export_lower_to_aten_ir` in favor of just `_strict_export` and `_non_strict_export` * Matched the APIs of `_strict_export` and `_non_strict_export` * Instead of a `lower_to_aten_callback` which is a callable, or `dispatch_tracing_mode`, both functions take in a `_to_aten_func` which can be either `_export_to_aten_ir_make_fx` or `_export_to_aten_ir`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149240 Approved by: https://github.com/pianpwk	2025-03-18 21:40:30 +00:00
Justin Chu	010963032c	[ONNX] Create onnx_symbolic (#148905 ) In the old exporter we allow users to define a symbolic() method to bypass JIT tracing for a block of logic. We can allow users to do similar things by creating symbolic ops at export. This PR implements `torch.onnx.ops.symbolic` and `torch.onnx.ops.symbolic_multi_out` to allow users to create onnx nodes symbolically with pt2 & fx. The custom pytorch ops were designed such that the attributes are encoded to be part of a valid fx op. Users provide shape and dtype for the meta function to produce the currect fake tensor during export. An example is ![image](https://github.com/user-attachments/assets/c62f5f21-e038-456e-a71d-b9a5d0a7cd9d) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148905 Approved by: https://github.com/titaiwangms	2025-03-18 21:32:06 +00:00
Yuxin Wu	d80a70b58a	Avoid unnecessary clone in torch.cuda.set_rng_state (#149283 ) Clone has performance issue according to `f49c3eb6e6/megatron/core/tensor_parallel/random.py (L77-L80)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149283 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-03-18 20:47:57 +00:00
Thomas Bohnstingl	cd5c13d8f0	[hop] Rework the check of Metadata in the functionalization key (#148789 ) This PR is a more cosmetic rework of the metadata check performed by some HOPs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148789 Approved by: https://github.com/ydwu4	2025-03-18 20:30:59 +00:00
Brian Hirsh	f06e366532	partitioner: treat inputs with static indices as free to save (#148922 ) Fixes https://github.com/pytorch/pytorch/issues/141881 internal xref: https://fb.workplace.com/groups/1075192433118967/posts/1538435030128036/?comment_id=1556782068293332 I tried to make a test case out of the code linked in that github issue. The setup + bad outcome today was as follows: (1) you have a graph where one of its inputs is a model weight (2) in the backward, you do some downstream compute on `weight`, `tmp = f(weight)`, where (a) `tmp` is of a smaller size than `weight`, and (b) the compute is trivially fusible into other kernels (so the partitioner thinks it is "free" to recompute (3) since `sizeof(tmp) < sizeof(weight)` and the recompute is free, the partitioner decides that it would be strictly better to save `tmp` for backward instead of weight (4) this is bad: `weight` is a static tensor that sits in GPU memory for the duration of your entire training loop, so saving it for backward has no negative impact on peak memory. Since we're saving `tmp` instead, we end up unnecessarily increasing peak memory. In particular - the repro involves an autograd.Function in eager that saves the weight for bw, so we end up hitting higher peak memory in compile The fix I'm trying out in this PR is to tell the partitioner that graph inputs that we know have static addresses (aka parameters) are "free" to save. Below is the fw/bw graph before my change, where you can see that instead of `primals_2` being saved for backward, we save `t_8` (which involves some low precision downstream compute on `primals_2`, that is only needed in the backward. ``` ===== Forward graph 0 ===== /data/users/hirsheybar/checkout2/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module): def forward(self, primals_1: "bf16[64, 64][64, 1]cuda:0", primals_2: "bf16[64, 64][64, 1]cuda:0", primals_3: "bf16[64][1]cuda:0"): # File: /data/users/hirsheybar/checkout2/pytorch/test/dynamo/test_repros.py:6943 in forward, code: out = Fp8LinearFn.apply( abs_1: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.abs.default(primals_1) view: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(abs_1, [64, 1, 64]); abs_1 = None amax: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view, [-1]); view = None abs_2: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.abs.default(primals_2) view_1: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(abs_2, [64, 1, 64]); abs_2 = None amax_1: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view_1, [-1]); view_1 = None _to_copy: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(amax, dtype = torch.float32); amax = None clamp: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy, 1e-12); _to_copy = None div: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp, 448.0); clamp = None reciprocal: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div) view_2: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(primals_1, [64, 1, 64]) view_3: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_2, [64, 1, 1, 64]); view_2 = None slice_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal, 0, 0, 9223372036854775807); reciprocal = None unsqueeze: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_1, 1); slice_1 = None slice_2: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze, 2, 0, 9223372036854775807); unsqueeze = None unsqueeze_1: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_2, 3); slice_2 = None mul: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_3, unsqueeze_1); view_3 = unsqueeze_1 = None view_4: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul, [64, 1, 64]); mul = None view_5: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_4, [64, 64]); view_4 = None _to_copy_1: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_5, dtype = torch.float8_e4m3fn); view_5 = None _to_copy_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(amax_1, dtype = torch.float32) clamp_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_2, 1e-12); _to_copy_2 = None div_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_1, 448.0); clamp_1 = None reciprocal_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div_1) view_6: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(primals_2, [64, 1, 64]) view_7: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_6, [64, 1, 1, 64]); view_6 = None slice_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal_1, 0, 0, 9223372036854775807); reciprocal_1 = None unsqueeze_2: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_3, 1); slice_3 = None slice_4: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_2, 2, 0, 9223372036854775807); unsqueeze_2 = None unsqueeze_3: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_4, 3); slice_4 = None mul_1: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_7, unsqueeze_3); view_7 = unsqueeze_3 = None view_8: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_1, [64, 1, 64]); mul_1 = None view_9: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_8, [64, 64]); view_8 = None _to_copy_3: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_9, dtype = torch.float8_e4m3fn); view_9 = None t: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.t.default(div_1); div_1 = None new_ones: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(div, [1, 1], pin_memory = False) new_ones_1: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(t, [1, 1], pin_memory = False) t_2: "f8e4m3fn[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(_to_copy_3); _to_copy_3 = None t_3: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.t.default(new_ones_1); new_ones_1 = None _scaled_mm: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._scaled_mm.default(_to_copy_1, t_2, new_ones, t_3, None, None, torch.bfloat16); _to_copy_1 = t_2 = new_ones = t_3 = None view_10: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(_scaled_mm, [64, 1, 64]); _scaled_mm = None view_11: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_10, [64, 1, 1, 64]); view_10 = None slice_5: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(div, 0, 0, 9223372036854775807); div = None unsqueeze_4: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_5, 1); slice_5 = None slice_6: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_4, 2, 0, 9223372036854775807); unsqueeze_4 = None unsqueeze_5: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_6, 3); slice_6 = None mul_2: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_11, unsqueeze_5); view_11 = unsqueeze_5 = None view_12: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_2, [64, 1, 64]); mul_2 = None view_13: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_12, [64, 64]); view_12 = None view_14: "f32[1, 64, 64][4096, 64, 1]cuda:0" = torch.ops.aten.view.default(view_13, [1, 64, 64]); view_13 = None view_15: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.view.default(view_14, [1, 64, 64, 1]); view_14 = None slice_7: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(t, 0, 0, 9223372036854775807); t = None unsqueeze_6: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_7, 1); slice_7 = None slice_8: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_6, 2, 0, 9223372036854775807); unsqueeze_6 = None unsqueeze_7: "f32[1, 1, 64, 1][1, 64, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_8, 3); slice_8 = None mul_3: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_15, unsqueeze_7); view_15 = unsqueeze_7 = None view_16: "f32[64, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.view.default(mul_3, [64, 64, 1]); mul_3 = None view_17: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_16, [64, 64]); view_16 = None _to_copy_4: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_17, dtype = torch.bfloat16); view_17 = None add: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.add.Tensor(_to_copy_4, primals_3); _to_copy_4 = primals_3 = None t_4: "bf16[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(primals_2); primals_2 = None clone: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.clone.default(t_4, memory_format = torch.contiguous_format); t_4 = None t_5: "bf16[1, 64][1, 1]cuda:0" = torch.ops.aten.t.default(amax_1); amax_1 = None view_21: "bf16[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.view.default(t_5, [1, 1, 64]); t_5 = None amax_3: "bf16[1, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view_21, [-1]); view_21 = None unsqueeze_8: "bf16[1, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(amax_3, 1); amax_3 = None expand: "bf16[1, 64, 1][1, 0, 1]cuda:0" = torch.ops.aten.expand.default(unsqueeze_8, [1, 64, 1]) clone_1: "bf16[1, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format); expand = None view_22: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.view.default(clone_1, [64, 1]); clone_1 = None _to_copy_7: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(view_22, dtype = torch.float32); view_22 = None clamp_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_7, 1e-12); _to_copy_7 = None div_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_3, 448.0); clamp_3 = None reciprocal_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div_3); div_3 = None view_27: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(clone, [64, 1, 64]); clone = None view_28: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_27, [64, 1, 1, 64]); view_27 = None slice_11: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal_3, 0, 0, 9223372036854775807); reciprocal_3 = None unsqueeze_11: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_11, 1); slice_11 = None slice_12: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_11, 2, 0, 9223372036854775807); unsqueeze_11 = None unsqueeze_12: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_12, 3); slice_12 = None mul_5: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_28, unsqueeze_12); view_28 = unsqueeze_12 = None view_29: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_5, [64, 1, 64]); mul_5 = None view_30: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_29, [64, 64]); view_29 = None _to_copy_8: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_30, dtype = torch.float8_e4m3fn); view_30 = None t_8: "f8e4m3fn[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(_to_copy_8); _to_copy_8 = None # No stacktrace found for following nodes view_39: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(add, [64, 64]); add = None return (view_39, primals_1, unsqueeze_8, t_8) INFO: TRACED GRAPH ===== Backward graph 0 ===== <eval_with_key>.1 class GraphModule(torch.nn.Module): def forward(self, primals_1: "bf16[64, 64][64, 1]cuda:0", unsqueeze_8: "bf16[1, 1, 1][1, 1, 1]cuda:0", t_8: "f8e4m3fn[64, 64][1, 64]cuda:0", tangents_1: "bf16[64, 64][64, 1]cuda:0"): # File: /data/users/hirsheybar/checkout2/pytorch/test/dynamo/test_repros.py:6946 in forward, code: out = out.unflatten(0, input.shape[:-1]) view_19: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(tangents_1, [64, 64]); tangents_1 = None # File: /data/users/hirsheybar/checkout2/pytorch/test/dynamo/test_repros.py:6943 in forward, code: out = Fp8LinearFn.apply( abs_3: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.abs.default(view_19) view_20: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(abs_3, [64, 1, 64]); abs_3 = None amax_2: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view_20, [-1]); view_20 = None expand: "bf16[1, 64, 1][1, 0, 1]cuda:0" = torch.ops.aten.expand.default(unsqueeze_8, [1, 64, 1]); unsqueeze_8 = None clone_1: "bf16[1, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format); expand = None view_22: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.view.default(clone_1, [64, 1]); clone_1 = None _to_copy_5: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(amax_2, dtype = torch.float32); amax_2 = None clamp_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_5, 1e-12); _to_copy_5 = None div_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_2, 448.0); clamp_2 = None reciprocal_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div_2) view_23: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_19, [64, 1, 64]) view_24: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_23, [64, 1, 1, 64]); view_23 = None slice_9: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal_2, 0, 0, 9223372036854775807); reciprocal_2 = None unsqueeze_9: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_9, 1); slice_9 = None slice_10: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_9, 2, 0, 9223372036854775807); unsqueeze_9 = None unsqueeze_10: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_10, 3); slice_10 = None mul_4: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_24, unsqueeze_10); view_24 = unsqueeze_10 = None view_25: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_4, [64, 1, 64]); mul_4 = None view_26: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_25, [64, 64]); view_25 = None _to_copy_6: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_26, dtype = torch.float8_e4m3fn); view_26 = None _to_copy_7: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(view_22, dtype = torch.float32); view_22 = None clamp_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_7, 1e-12); _to_copy_7 = None div_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_3, 448.0); clamp_3 = None t_6: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.t.default(div_3); div_3 = None new_ones_2: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(div_2, [1, 1], pin_memory = False) new_ones_3: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(t_6, [1, 1], pin_memory = False) t_9: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.t.default(new_ones_3); new_ones_3 = None _scaled_mm_1: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._scaled_mm.default(_to_copy_6, t_8, new_ones_2, t_9, None, None, torch.bfloat16); _to_copy_6 = t_8 = new_ones_2 = t_9 = None view_31: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(_scaled_mm_1, [64, 1, 64]); _scaled_mm_1 = None view_32: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_31, [64, 1, 1, 64]); view_31 = None slice_13: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(div_2, 0, 0, 9223372036854775807); div_2 = None unsqueeze_13: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_13, 1); slice_13 = None slice_14: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_13, 2, 0, 9223372036854775807); unsqueeze_13 = None unsqueeze_14: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_14, 3); slice_14 = None mul_6: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_32, unsqueeze_14); view_32 = unsqueeze_14 = None view_33: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_6, [64, 1, 64]); mul_6 = None view_34: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_33, [64, 64]); view_33 = None view_35: "f32[1, 64, 64][4096, 64, 1]cuda:0" = torch.ops.aten.view.default(view_34, [1, 64, 64]); view_34 = None view_36: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.view.default(view_35, [1, 64, 64, 1]); view_35 = None slice_15: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(t_6, 0, 0, 9223372036854775807); t_6 = None unsqueeze_15: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_15, 1); slice_15 = None slice_16: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_15, 2, 0, 9223372036854775807); unsqueeze_15 = None unsqueeze_16: "f32[1, 1, 64, 1][1, 64, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_16, 3); slice_16 = None mul_7: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_36, unsqueeze_16); view_36 = unsqueeze_16 = None view_37: "f32[64, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.view.default(mul_7, [64, 64, 1]); mul_7 = None view_38: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_37, [64, 64]); view_37 = None _to_copy_9: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_38, dtype = torch.bfloat16); view_38 = None t_10: "bf16[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(view_19) mm: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.mm.default(t_10, primals_1); t_10 = primals_1 = None sum_1: "bf16[64][1]cuda:0" = torch.ops.aten.sum.dim_IntList(view_19, [0]); view_19 = None return (_to_copy_9, mm, sum_1) ``` With the change, we save primals_2 for backward instead ``` ===== Forward graph 0 ===== /data/users/hirsheybar/checkout2/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module): def forward(self, primals_1: "bf16[64, 64][64, 1]cuda:0", primals_2: "bf16[64, 64][64, 1]cuda:0", primals_3: "bf16[64][1]cuda:0"): # File: /data/users/hirsheybar/checkout2/pytorch/test/dynamo/test_repros.py:6943 in forward, code: out = Fp8LinearFn.apply( abs_1: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.abs.default(primals_1) view: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(abs_1, [64, 1, 64]); abs_1 = None amax: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view, [-1]); view = None abs_2: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.abs.default(primals_2) view_1: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(abs_2, [64, 1, 64]); abs_2 = None amax_1: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view_1, [-1]); view_1 = None _to_copy: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(amax, dtype = torch.float32); amax = None clamp: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy, 1e-12); _to_copy = None div: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp, 448.0); clamp = None reciprocal: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div) view_2: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(primals_1, [64, 1, 64]) view_3: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_2, [64, 1, 1, 64]); view_2 = None slice_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal, 0, 0, 9223372036854775807); reciprocal = None unsqueeze: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_1, 1); slice_1 = None slice_2: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze, 2, 0, 9223372036854775807); unsqueeze = None unsqueeze_1: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_2, 3); slice_2 = None mul: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_3, unsqueeze_1); view_3 = unsqueeze_1 = None view_4: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul, [64, 1, 64]); mul = None view_5: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_4, [64, 64]); view_4 = None _to_copy_1: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_5, dtype = torch.float8_e4m3fn); view_5 = None _to_copy_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(amax_1, dtype = torch.float32) clamp_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_2, 1e-12); _to_copy_2 = None div_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_1, 448.0); clamp_1 = None reciprocal_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div_1) view_6: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(primals_2, [64, 1, 64]) view_7: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_6, [64, 1, 1, 64]); view_6 = None slice_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal_1, 0, 0, 9223372036854775807); reciprocal_1 = None unsqueeze_2: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_3, 1); slice_3 = None slice_4: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_2, 2, 0, 9223372036854775807); unsqueeze_2 = None unsqueeze_3: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_4, 3); slice_4 = None mul_1: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_7, unsqueeze_3); view_7 = unsqueeze_3 = None view_8: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_1, [64, 1, 64]); mul_1 = None view_9: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_8, [64, 64]); view_8 = None _to_copy_3: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_9, dtype = torch.float8_e4m3fn); view_9 = None t: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.t.default(div_1); div_1 = None new_ones: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(div, [1, 1], pin_memory = False) new_ones_1: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(t, [1, 1], pin_memory = False) t_2: "f8e4m3fn[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(_to_copy_3); _to_copy_3 = None t_3: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.t.default(new_ones_1); new_ones_1 = None _scaled_mm: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._scaled_mm.default(_to_copy_1, t_2, new_ones, t_3, None, None, torch.bfloat16); _to_copy_1 = t_2 = new_ones = t_3 = None view_10: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(_scaled_mm, [64, 1, 64]); _scaled_mm = None view_11: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_10, [64, 1, 1, 64]); view_10 = None slice_5: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(div, 0, 0, 9223372036854775807); div = None unsqueeze_4: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_5, 1); slice_5 = None slice_6: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_4, 2, 0, 9223372036854775807); unsqueeze_4 = None unsqueeze_5: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_6, 3); slice_6 = None mul_2: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_11, unsqueeze_5); view_11 = unsqueeze_5 = None view_12: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_2, [64, 1, 64]); mul_2 = None view_13: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_12, [64, 64]); view_12 = None view_14: "f32[1, 64, 64][4096, 64, 1]cuda:0" = torch.ops.aten.view.default(view_13, [1, 64, 64]); view_13 = None view_15: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.view.default(view_14, [1, 64, 64, 1]); view_14 = None slice_7: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(t, 0, 0, 9223372036854775807); t = None unsqueeze_6: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_7, 1); slice_7 = None slice_8: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_6, 2, 0, 9223372036854775807); unsqueeze_6 = None unsqueeze_7: "f32[1, 1, 64, 1][1, 64, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_8, 3); slice_8 = None mul_3: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_15, unsqueeze_7); view_15 = unsqueeze_7 = None view_16: "f32[64, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.view.default(mul_3, [64, 64, 1]); mul_3 = None view_17: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_16, [64, 64]); view_16 = None _to_copy_4: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_17, dtype = torch.bfloat16); view_17 = None add: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.add.Tensor(_to_copy_4, primals_3); _to_copy_4 = primals_3 = None t_5: "bf16[1, 64][1, 1]cuda:0" = torch.ops.aten.t.default(amax_1); amax_1 = None view_21: "bf16[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.view.default(t_5, [1, 1, 64]); t_5 = None amax_3: "bf16[1, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view_21, [-1]); view_21 = None unsqueeze_8: "bf16[1, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(amax_3, 1); amax_3 = None # No stacktrace found for following nodes view_39: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(add, [64, 64]); add = None return (view_39, primals_1, primals_2, unsqueeze_8) INFO: TRACED GRAPH ===== Backward graph 0 ===== <eval_with_key>.1 class GraphModule(torch.nn.Module): def forward(self, primals_1: "bf16[64, 64][64, 1]cuda:0", primals_2: "bf16[64, 64][64, 1]cuda:0", unsqueeze_8: "bf16[1, 1, 1][1, 1, 1]cuda:0", tangents_1: "bf16[64, 64][64, 1]cuda:0"): # File: /data/users/hirsheybar/checkout2/pytorch/test/dynamo/test_repros.py:6946 in forward, code: out = out.unflatten(0, input.shape[:-1]) view_19: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(tangents_1, [64, 64]); tangents_1 = None # File: /data/users/hirsheybar/checkout2/pytorch/test/dynamo/test_repros.py:6943 in forward, code: out = Fp8LinearFn.apply( t_4: "bf16[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(primals_2); primals_2 = None clone: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.clone.default(t_4, memory_format = torch.contiguous_format); t_4 = None abs_3: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.abs.default(view_19) view_20: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(abs_3, [64, 1, 64]); abs_3 = None amax_2: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view_20, [-1]); view_20 = None expand: "bf16[1, 64, 1][1, 0, 1]cuda:0" = torch.ops.aten.expand.default(unsqueeze_8, [1, 64, 1]); unsqueeze_8 = None clone_1: "bf16[1, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format); expand = None view_22: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.view.default(clone_1, [64, 1]); clone_1 = None _to_copy_5: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(amax_2, dtype = torch.float32); amax_2 = None clamp_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_5, 1e-12); _to_copy_5 = None div_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_2, 448.0); clamp_2 = None reciprocal_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div_2) view_23: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_19, [64, 1, 64]) view_24: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_23, [64, 1, 1, 64]); view_23 = None slice_9: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal_2, 0, 0, 9223372036854775807); reciprocal_2 = None unsqueeze_9: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_9, 1); slice_9 = None slice_10: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_9, 2, 0, 9223372036854775807); unsqueeze_9 = None unsqueeze_10: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_10, 3); slice_10 = None mul_4: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_24, unsqueeze_10); view_24 = unsqueeze_10 = None view_25: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_4, [64, 1, 64]); mul_4 = None view_26: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_25, [64, 64]); view_25 = None _to_copy_6: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_26, dtype = torch.float8_e4m3fn); view_26 = None _to_copy_7: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(view_22, dtype = torch.float32); view_22 = None clamp_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_7, 1e-12); _to_copy_7 = None div_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_3, 448.0); clamp_3 = None reciprocal_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div_3) view_27: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(clone, [64, 1, 64]); clone = None view_28: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_27, [64, 1, 1, 64]); view_27 = None slice_11: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal_3, 0, 0, 9223372036854775807); reciprocal_3 = None unsqueeze_11: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_11, 1); slice_11 = None slice_12: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_11, 2, 0, 9223372036854775807); unsqueeze_11 = None unsqueeze_12: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_12, 3); slice_12 = None mul_5: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_28, unsqueeze_12); view_28 = unsqueeze_12 = None view_29: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_5, [64, 1, 64]); mul_5 = None view_30: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_29, [64, 64]); view_29 = None _to_copy_8: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_30, dtype = torch.float8_e4m3fn); view_30 = None t_6: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.t.default(div_3); div_3 = None new_ones_2: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(div_2, [1, 1], pin_memory = False) new_ones_3: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(t_6, [1, 1], pin_memory = False) t_8: "f8e4m3fn[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(_to_copy_8); _to_copy_8 = None t_9: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.t.default(new_ones_3); new_ones_3 = None _scaled_mm_1: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._scaled_mm.default(_to_copy_6, t_8, new_ones_2, t_9, None, None, torch.bfloat16); _to_copy_6 = t_8 = new_ones_2 = t_9 = None view_31: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(_scaled_mm_1, [64, 1, 64]); _scaled_mm_1 = None view_32: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_31, [64, 1, 1, 64]); view_31 = None slice_13: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(div_2, 0, 0, 9223372036854775807); div_2 = None unsqueeze_13: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_13, 1); slice_13 = None slice_14: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_13, 2, 0, 9223372036854775807); unsqueeze_13 = None unsqueeze_14: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_14, 3); slice_14 = None mul_6: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_32, unsqueeze_14); view_32 = unsqueeze_14 = None view_33: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_6, [64, 1, 64]); mul_6 = None view_34: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_33, [64, 64]); view_33 = None view_35: "f32[1, 64, 64][4096, 64, 1]cuda:0" = torch.ops.aten.view.default(view_34, [1, 64, 64]); view_34 = None view_36: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.view.default(view_35, [1, 64, 64, 1]); view_35 = None slice_15: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(t_6, 0, 0, 9223372036854775807); t_6 = None unsqueeze_15: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_15, 1); slice_15 = None slice_16: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_15, 2, 0, 9223372036854775807); unsqueeze_15 = None unsqueeze_16: "f32[1, 1, 64, 1][1, 64, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_16, 3); slice_16 = None mul_7: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_36, unsqueeze_16); view_36 = unsqueeze_16 = None view_37: "f32[64, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.view.default(mul_7, [64, 64, 1]); mul_7 = None view_38: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_37, [64, 64]); view_37 = None _to_copy_9: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_38, dtype = torch.bfloat16); view_38 = None t_10: "bf16[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(view_19) mm: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.mm.default(t_10, primals_1); t_10 = primals_1 = None sum_1: "bf16[64][1]cuda:0" = torch.ops.aten.sum.dim_IntList(view_19, [0]); view_19 = None return (_to_copy_9, mm, sum_1) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148922 Approved by: https://github.com/zou3519	2025-03-18 20:08:11 +00:00
Zain Rizvi	b8c0c50bbe	Release.md readability improvements (#149402 ) Improves a bunch of readability/grammatical issues with release.md. Note: This was a claude code experiment, with all changes automatically generated. But turns out minor edits like this is _not_ a good use of claude code since it asked for approval on every single changed line. Prob way more efficient to toss this entire thing into a simple LLM. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149402 Approved by: https://github.com/atalman	2025-03-18 20:04:56 +00:00
jzhou	dfdf58f8cb	[ROCm] enable CK backend for bf16/fp16 on gfx11 (#143971 ) this change enables enable CK backend for fp16 on Gfx11 @jeffdaily Pull Request resolved: https://github.com/pytorch/pytorch/pull/143971 Approved by: https://github.com/jeffdaily	2025-03-18 18:18:22 +00:00
Pian Pawakapan	e0e8639a10	[torchbench] fix dynamic_shapes spec for moco (#148772 ) Fixes https://github.com/pytorch/pytorch/issues/148333 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148772 Approved by: https://github.com/yushangdi, https://github.com/desertfire	2025-03-18 18:16:54 +00:00
Nichols A. Romero	dbea13ed45	[ROCm][TunableOp] Minor fix to BLAS logging for ScaledGEMM with no bias vector. (#149357 ) Omit the bias type argument for BLAS logging when there is a ScaledGEMM with no bias vector. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149357 Approved by: https://github.com/jeffdaily	2025-03-18 18:14:52 +00:00
Nichols A. Romero	c0566e0dbf	[ROCm] Fixes and improvements to CUDA->HIP flag conversion for CPP extensions (#149245 ) Fixes https://github.com/ROCm/hip/issues/3764. Fixes and improvements to CUDA->HIP flag conversion for CPP extensions - Log flag conversion for debugging purposes. - Fix cases where it should not touch the -I flags or cases where CUDA appears more than once by replacing only the first instance. - Fix case where nvcc key may not exist - Fix case where hipify should ignore flag values and only touch the flag itself Pull Request resolved: https://github.com/pytorch/pytorch/pull/149245 Approved by: https://github.com/jeffdaily Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>	2025-03-18 18:01:07 +00:00
eellison	585fd972b8	Iterate over dense dim first in split reduction reindexing (#147229 ) Fix for https://github.com/pytorch/pytorch/issues/144431. Improves perf from 0.29963893827160504 -> 0.0396331632970453. In split reductions, we view an input tensor as a single dimension, then reduce over it. When we are reducing over a tensor which has a dimension other than the last dimension as the dense dimension, we should iterate over the dense dimension first in our re-indexing. This pr also gives evidence for general need of reduction tiling, e.g. for cooperative reduction handling of this.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147229 Approved by: https://github.com/jansel	2025-03-18 17:35:21 +00:00
mori360	ee3a2c6ee2	[State_dict] Remove functools.cache and add unit test (#149354 ) Fixes https://github.com/pytorch/pytorch/issues/149100 @functools.cache would keep 'self' alive, leading to unexpected memory performance. (e.g. in the issue linked, if the model is deleted, the model's memory is still occupied.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149354 Approved by: https://github.com/fegin	2025-03-18 17:30:41 +00:00
mori360	5b8cc4709a	[FSDP2] Add set_reshard_after_forward (#149103 ) Fixes https://github.com/pytorch/pytorch/issues/149029 Add `set_reshard_after_forward` to set `post_forward_mesh_info` so as to decide `_reshard_after_forward` Add unit test similar to `test_fully_shard_communication_count`, the FSDPModule would perform as `._reshard_after_forward=True` after `.set_reshard_after_forward=True`, as well as setting to False Pull Request resolved: https://github.com/pytorch/pytorch/pull/149103 Approved by: https://github.com/awgu	2025-03-18 17:21:54 +00:00
Animesh Jain	a8df5e5af9	[dynamo] Add mem leak test (#149358 ) Test for https://github.com/pytorch/pytorch/pull/148480 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149358 Approved by: https://github.com/malfet	2025-03-18 16:38:28 +00:00
Aleksei Nikiforov	d5b1d99f78	Enable more nightly tests on s390x (#148452 ) Also enable some tests which probably were accidentally disabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148452 Approved by: https://github.com/seemethere, https://github.com/malfet	2025-03-18 16:09:39 +00:00
Saurabh Mishra	381d0cb239	[DCP] Avoid in-place update and deepcopy during dudpe (#149320 ) Summary: Avoid in-place update and deepcopy during dudpe. Deepcopy becomes prohibitively expensive with models having a huge number of FQNs. This was manifestd in the Ads 2K experiment as well. Here are the results from the TextRay model in Mitra: #### Control job with deepcopy regression: First save ~24.8s Global step latency is ~7-8s Test job with the new fix to avoid deepcopy: First save is ~21s global step latency ~2s Test Plan: ``` buck test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:test_planner ``` https://www.internalfb.com/intern/testinfra/testrun/3940649945104822 Differential Revision: D71245218 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149320 Approved by: https://github.com/MeetVadakkanchery	2025-03-18 16:08:40 +00:00
Nikita Shulga	c41196a4d0	[EZ][Docker] Remove `install_db.sh` (#149360 ) Which is a vestige of caffe2 days and was no-op since https://github.com/pytorch/pytorch/pull/125092 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149360 Approved by: https://github.com/atalman, https://github.com/cyyever, https://github.com/seemethere, https://github.com/Skylion007	2025-03-18 16:07:47 +00:00
Justin Chu	fdacf3c920	[ONNX] Update types in VerificationInfo (#149377 ) torch.types.Number was rendered as is in the documentation and can be confusing. We write the original types instead to reduce confusion for users. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149377 Approved by: https://github.com/titaiwangms	2025-03-18 15:37:39 +00:00
PyTorch MergeBot	405025778d	Revert "[AOTI] Update test runner to use the new APIs (#147105 )" This reverts commit 9a78513c3cb21a5f506135e2a56f967cf1fddc60. Reverted https://github.com/pytorch/pytorch/pull/147105 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147105#issuecomment-2733656413))	2025-03-18 15:25:40 +00:00
PyTorch MergeBot	5ba437fb45	Revert "[AOTI] Forward fix unit test failures (#149401 )" This reverts commit ec9e11145e1a86300aae0fe09a1d8917d21deba1. Reverted https://github.com/pytorch/pytorch/pull/149401 on behalf of https://github.com/desertfire due to reverting the original PR instead ([comment](https://github.com/pytorch/pytorch/pull/149401#issuecomment-2733633516))	2025-03-18 15:18:48 +00:00
Pat Vignola	213eea216a	[MTIA] Add _mtia_maybeExchangeDevice to MTIA module (#149340 ) Summary: The FlexAttention path uses `_maybe_exchange_device`, so it will be needed eventually for MTIA as well. Test Plan: `buck2 test fbcode//mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- test_maybe_exchange_device` Reviewed By: chaos5958 Differential Revision: D70072063 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149340 Approved by: https://github.com/chaos5958	2025-03-18 15:15:12 +00:00
Bin Bao	ec9e11145e	[AOTI] Forward fix unit test failures (#149401 ) Summary: There is a land conflict between https://github.com/pytorch/pytorch/pull/149161 and https://github.com/pytorch/pytorch/pull/147105. We just need to update the APIs used in two new unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149401 Approved by: https://github.com/ZainRizvi	2025-03-18 15:02:01 +00:00
atalman	6e2b2660b9	Make numpy check optional (#149356 ) We may want to skip numpy smoke tests. Hence making it optional Pull Request resolved: https://github.com/pytorch/pytorch/pull/149356 Approved by: https://github.com/ZainRizvi	2025-03-18 15:00:01 +00:00
Andrey Talman	bc88f6faa1	Use TorchVersion for triton version check (#149136 ) Followup after https://github.com/pytorch/pytorch/pull/149092#issuecomment-2721990321 To use TorchVersion for triton version parsing Pull Request resolved: https://github.com/pytorch/pytorch/pull/149136 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-03-18 13:48:46 +00:00
Jithun Nair	b06b5c3e27	[ROCm] Use alternate mirror for drm repo (#149380 ) Fixes issue with building ROCm manywheel and libtorch images eg. https://github.com/pytorch/pytorch/actions/runs/13887711267/job/38854659005#step:4:8328 ``` #53 2.832 Cloning into 'drm'... #53 2.849 fatal: unable to access 'https://gitlab.freedesktop.org/mesa/drm.git/': The requested URL returned error: 503 #53 2.851 ./install_rocm_drm.sh: line 29: pushd: drm: No such file or directory ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149380 Approved by: https://github.com/jeffdaily	2025-03-18 13:33:25 +00:00
Laith Sakka	6055a4f612	refresh benchmarks results. (#149347 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149347 Approved by: https://github.com/jamesjwu	2025-03-18 08:53:49 +00:00
Francisco Massa	9b92828d4b	Add batch dim sharding rule to sdpa (#149253 ) This is a trivial rule that for most cases isn't needed, but if we want to consider that the input data is actually `Shard(0)` (instead of `Replicated()` as it is currently assumed), then we need this rule. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149253 Approved by: https://github.com/XilunWu	2025-03-18 07:54:02 +00:00
Davide Italiano	9cd52da45c	[MPS/inductor] Add support for `modified_bessel_i1`. (#149379 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149379 Approved by: https://github.com/malfet	2025-03-18 06:02:33 +00:00
Fadi Arafeh	6c2db8fab0	Enable qint8 and quint8 add for AArch64 using ACL directly (#148653 ) This enables qint8 and quint8 add for AArch64 through Arm Compute Library (ACL) directly. Relative performance improvement using OMP_NUM_THREADS=1 is ~15x, using OMP_NUM_THREADS=32 it’s ~5.4x. Co-authored-by: David Svantesson <david.svantesson-yeung@arm.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/148653 Approved by: https://github.com/malfet ghstack dependencies: #148585	2025-03-18 05:38:39 +00:00
Nikita Shulga	2e0c98ff05	[MPS] Add `bicubic2d_aa` (#149378 ) Which is currently the most frequently requested op in https://github.com/pytorch/pytorch/issues/141287 Mostly done by refactoring `upsample_bilinear2d_aa` to accept Functor as one of the template arguments, which closely ideas from `eec43cfbc0/src/libImaging/Resample.c` as well as `bb42e4d137/aten/src/ATen/native/cuda/UpSampleBilinear2d.cu (L472-L478)` Populate unit tests by copying upsample_bilinear_2d_aa and reusing it as upsample_bicubic2d_aa At that point, only difference between upsample_bilinear2d_aa and upsample_bicubic2d_aa are convolution kernel function and size: for bilinear it's 3x3, for bicubic it's 5x5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149378 Approved by: https://github.com/dcci	2025-03-18 05:35:41 +00:00
Tristan Rice	dea7157160	nccl: upgrade to 2.26.2 to avoid hang on ncclCommAbort (#149351 ) Fixes #149153 Yaml generated from: ``` python .github/scripts/generate_ci_workflows.py ``` Test plan: Repro in https://gist.github.com/d4l3k/16a19b475952bc40ddd7f2febcc297b7 ``` rm -rf third_party/nccl python setup.py develop ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149351 Approved by: https://github.com/kwen2501, https://github.com/atalman, https://github.com/malfet	2025-03-18 05:23:18 +00:00
Rachel Guo	b8f91bcb14	[pt2_provenance_tracking] add support for cpp kernel (#149185 ) Summary: As title. Add inductor cpp kernel to post grad graph node mapping & UT. Context: Raised as a feature request for AOTI CPU case. https://fb.workplace.com/groups/1028545332188949/permalink/1169020841474730/ Differential Revision: D71181284 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149185 Approved by: https://github.com/jingsh	2025-03-18 04:43:07 +00:00
Shangdi Yu	7869196482	Fix torchbind schema str generation (#149239 ) Summary: Fix Torchbind HOP schema generation when there's no input Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r schema ``` Differential Revision: D71231164 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149239 Approved by: https://github.com/zou3519	2025-03-18 04:29:56 +00:00
Wei-Sheng Chin	bca75fe97a	[MAIA] [Autocast] Enable autocast on MAIA device (#148511 ) Fixes #148510. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148511 Approved by: https://github.com/albanD	2025-03-18 03:46:22 +00:00
Davide Italiano	c43e35d6f7	[MPS] Implement support for `modified_bessel_i1` in eager. (#149368 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149368 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-03-18 03:29:10 +00:00
Mu-Chu Lee	bb42e4d137	[AOTInductor] Add function to free buffer (#149161 ) Summary: We add a function that allows users to free the unused buffer. Test Plan: Testing correctness: python test/inductor/test_aot_inductor.py -k free_inactive Testing memory consumption: LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib /home/$USER/local/pytorch/build/bin/test_aoti_inference Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/149161 Approved by: https://github.com/chenyang78, https://github.com/desertfire ghstack dependencies: #149249	2025-03-18 02:43:14 +00:00
Jane Xu	cccdf860e2	[BE] Add STABLE_LIBRARY test for multiple returns (#149230 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149230 Approved by: https://github.com/albanD, https://github.com/zou3519 ghstack dependencies: #149052	2025-03-18 02:40:54 +00:00
Jane Xu	988827cdfb	Use schema as source of truth + support ones_like/empty_like (#149052 ) This change does 2 important things: (a) Instead of relying on IValue type as source of truth, we use the schema as the source of truth, which is important as IValue types are overloaded and can ambiguously convert incorrectly. For example, a MemoryFormat will look like an int + get converted to an int64_t vs a MemoryFormat! (b) This PR expands support for many more types to encompass way more schemas, e.g., Optional, Device, dtype, etc. The main win from this PR is the ability for aoti_torch_call_dispatcher to call TensorFactory ops like ones_like/empty_like! Pull Request resolved: https://github.com/pytorch/pytorch/pull/149052 Approved by: https://github.com/albanD	2025-03-18 02:40:54 +00:00
Justin Chu	ebabd0efdd	[ONNX] Expose verification utilities (#148603 ) Expose verification utilities to public documentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148603 Approved by: https://github.com/titaiwangms	2025-03-18 02:10:34 +00:00
Sun, Jiayi	c36ac16da1	[Inductor] optimize welford reduction (#145061 ) Fix https://github.com/pytorch/pytorch/issues/141541. Fix https://github.com/pytorch/pytorch/issues/142839. Fix https://github.com/pytorch/pytorch/issues/143182. Summary: In order to fix the issue that the accuracy of welford reduction is not good enough, we refer to the eager implementation, combine Welford algorithm with cascade sum to improve numerical stability. Specifically: 1. Use Welford algorithm to compute mean and variance. 2. Use cascade summation when computing sum over input for both mean and variance. I tested Inductor benchmark with this PR on CPU, no performance gains or regressions were seen. Example: Take https://github.com/pytorch/pytorch/issues/141541 as an example: ``` import torch import torch.nn as nn torch.manual_seed(0) class Model(nn.Module): def __init__(self): super().__init__() self.gn = nn.GroupNorm(num_groups=32, num_channels=32) def forward(self, x): return self.gn(x) model = Model().eval() c_model = torch.compile(model) x = torch.randn(1, 32, 128, 128, 128) with torch.no_grad(): output = model(x) c_output = c_model(x) print(torch.max(torch.abs(output - c_output))) print(torch.allclose(output, c_output, 1.3e-6, 1e-5)) ``` logs - before ``` tensor(7.0095e-05) False ``` - After ``` tensor(9.5367e-07) True ``` - on CUDA ``` tensor(1.4305e-06, device='cuda:0', grad_fn=<MaxBackward1>) True ``` Generated code: - before ``` cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float'], ''' #include "/tmp/torchinductor_jiayisun/pi/cpicxudqmdsjh5cm4klbtbrvy2cxwr7whxl3md2zzdjdf3orvfdf.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2) { { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<int64_t>(131072L)); for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(2097152L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + 2097152Lx0), static_cast<int64_t>(16)); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0); } } } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<int64_t>(x0)] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<int64_t>(x0)] = static_cast<float>(tmp_acc0.m2); } } } { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(2097152L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + 2097152Lx0), static_cast<int64_t>(16)); auto tmp1 = out_ptr0[static_cast<int64_t>(x0)]; auto tmp4 = out_ptr1[static_cast<int64_t>(x0)]; auto tmp12 = in_ptr1[static_cast<int64_t>(x0)]; auto tmp15 = in_ptr2[static_cast<int64_t>(x0)]; auto tmp2 = at::vec::Vectorized<float>(tmp1); auto tmp3 = tmp0 - tmp2; auto tmp5 = static_cast<float>(2097152.0); auto tmp6 = tmp4 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = decltype(tmp6)(tmp6 + tmp7); auto tmp9 = 1 / std::sqrt(tmp8); auto tmp10 = at::vec::Vectorized<float>(tmp9); auto tmp11 = tmp3 * tmp10; auto tmp13 = at::vec::Vectorized<float>(tmp12); auto tmp14 = tmp11 * tmp13; auto tmp16 = at::vec::Vectorized<float>(tmp15); auto tmp17 = tmp14 + tmp16; tmp17.store(out_ptr2 + static_cast<int64_t>(x1 + 2097152Lx0)); } } } } } } ''') ``` - After ``` cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float'], ''' #include "/tmp/torchinductor_jiayisun/ln/clnlak27xpvmq3klpqyj6xzyq2thf4ecrezve5ddy4f4xaz4sb7w.h" extern "C" void kernel(const float in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2) { { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); WelfordHelper<at::vec::Vectorized<float>> welford_helper0(static_cast<int64_t>(131072L)); static WelfordHelper<at::vec::Vectorized<float>> masked_welford_helper0(static_cast<int64_t>(0L)); for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(2097152L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + 2097152Lx0), static_cast<int64_t>(16)); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &welford_helper0); } } } tmp_acc0_vec = welford_combine(tmp_acc0_vec, &welford_helper0); masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, &masked_welford_helper0); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<int64_t>(x0)] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<int64_t>(x0)] = static_cast<float>(tmp_acc0.m2); } } } { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(2097152L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + 2097152Lx0), static_cast<int64_t>(16)); auto tmp1 = out_ptr0[static_cast<int64_t>(x0)]; auto tmp4 = out_ptr1[static_cast<int64_t>(x0)]; auto tmp12 = in_ptr1[static_cast<int64_t>(x0)]; auto tmp15 = in_ptr2[static_cast<int64_t>(x0)]; auto tmp2 = at::vec::Vectorized<float>(tmp1); auto tmp3 = tmp0 - tmp2; auto tmp5 = static_cast<float>(2097152.0); auto tmp6 = tmp4 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = decltype(tmp6)(tmp6 + tmp7); auto tmp9 = 1 / std::sqrt(tmp8); auto tmp10 = at::vec::Vectorized<float>(tmp9); auto tmp11 = tmp3 * tmp10; auto tmp13 = at::vec::Vectorized<float>(tmp12); auto tmp14 = tmp11 * tmp13; auto tmp16 = at::vec::Vectorized<float>(tmp15); auto tmp17 = tmp14 + tmp16; tmp17.store(out_ptr2 + static_cast<int64_t>(x1 + 2097152L*x0)); } } } } } } ''') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145061 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel	2025-03-18 02:05:35 +00:00
cyy	1096443467	Use torch_compile_options for c10 libraries (#147821 ) c10, c10_cuda, c10_hip and c10_xpu are given additional compile options by torch_compile_options, which are more restrictive and can help reveal potential bugs inside the code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147821 Approved by: https://github.com/guangyey, https://github.com/malfet	2025-03-18 01:54:23 +00:00
Su, Tong	60523540f1	Force build to conform C++ standard on windows by adding /permissive- flag (#149035 ) Fixes #147366 1. Add `/permissive-` to the `torch_compile_options` for the build to conform to the C++ standard. 2. Fix the error when trying to assign a string literal to a non-const ptr. The `/permissive-` flag can be found at https://learn.microsoft.com/en-us/cpp/build/reference/permissive-standards-conformance?view=msvc-170 From the above [doc](https://learn.microsoft.com/en-us/cpp/build/reference/permissive-standards-conformance?view=msvc-170#remarks), > By default, the /permissive- option is set in new projects created by Visual Studio 2017 version 15.5 and later versions. > The /permissive- option is implicitly set by the /std:c++latest option starting in Visual Studio 2019 version 16.8, and in version 16.11 by the /std:c++20 option. Thus, it is reasonable to add this flag to the existing project. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149035 Approved by: https://github.com/guangyey, https://github.com/malfet	2025-03-18 01:51:46 +00:00
Xia, Weiwen	c1dd75e4dc	Add AOTI shim for _weight_int4pack_mm_cpu_tensor (#149031 ) Summary Previous implementation of shim did not align with the design and it was removed by https://github.com/pytorch/pytorch/pull/148907 This PR adds it back in the files of MKLDNN backend and re-enable the CPP wrapper UT. Test plan ``` pytest -s test/inductor/test_cpu_cpp_wrapper.py -k test_woq_int4 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149031 Approved by: https://github.com/leslie-fang-intel, https://github.com/EikanWang, https://github.com/desertfire	2025-03-18 01:33:13 +00:00
cyy	425c6d8eba	Replace c10::is_pod with std::is_trivial (#149286 ) These remaining c10::is_pod calls can be replaced without compromising the semantics. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149286 Approved by: https://github.com/zou3519	2025-03-18 01:33:01 +00:00
Animesh Jain	f9a787224c	[dynamo][guards][serialization] Dont use ID_MATCH guard for bool and None (#149228 ) Doing this removes the need of collecting `id` and therefore facilitates serialization. It also improves readability with recompilations. Earlier, recompile message will just show the `id`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149228 Approved by: https://github.com/jansel	2025-03-18 01:25:37 +00:00
Davide Italiano	186cc7327c	[MPS/BE] Remove decorator that skipped test on macOS 12. (#149365 ) macOS 12 is not really supported anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149365 Approved by: https://github.com/malfet	2025-03-18 00:58:08 +00:00
Aaron Gokaslan	a0ac63cbd9	[BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/149257 Approved by: https://github.com/jansel	2025-03-18 00:46:07 +00:00
Davide Italiano	811f587d86	[MPS/BE] @parametrize generation of pointwise_ops. (#149363 ) Make this less error prone/reduces duplication. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149363 Approved by: https://github.com/malfet	2025-03-18 00:37:43 +00:00
Bin Bao	9a78513c3c	[AOTI] Update test runner to use the new APIs (#147105 ) Summary: Switch to the newer aoti_compile_and_package APIs. Some tests still kept using legacy APIs, and will follow up with internal test refactoring. Differential Revision: [D69609685](https://our.internmc.facebook.com/intern/diff/D69609685) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147105 Approved by: https://github.com/jingsh	2025-03-18 00:27:09 +00:00
PyTorch MergeBot	b52a8bef01	Revert "[dynamo][guards][serialization] Dont use ID_MATCH guard for bool and None (#149228 )" This reverts commit 5905bbe745b0acb4909243c93014c0e6f3512c2d. Reverted https://github.com/pytorch/pytorch/pull/149228 on behalf of https://github.com/malfet due to I wonder if this will fix the pr-time-benchmark regressions ([comment](https://github.com/pytorch/pytorch/pull/149228#issuecomment-2731237949))	2025-03-18 00:10:50 +00:00
Nikita Shulga	46226a90c8	[EZ][BE] Remove cross-compilation options from mac-build.yml (#149237 ) It has long been gone Pull Request resolved: https://github.com/pytorch/pytorch/pull/149237 Approved by: https://github.com/seemethere, https://github.com/atalman	2025-03-17 23:50:31 +00:00
Eli Uriegas	523bffd388	cd: Add no-cache for test binaries (#149218 ) This is to make it so that we don't experience issues like https://github.com/pytorch/vision/actions/runs/13861462856/job/38795684317#step:13:212 ``` ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone may have tampered with them. unknown package: Expected sha256 8e34a6f02ac5a63763251953063a19ba9df855ac2c8a13ef409dfef708e2ba26 Got 341156cc5067488565c1e103be6e95105b0fc0d87d8ac24ff8891f63fd33216f ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149218 Approved by: https://github.com/ZainRizvi, https://github.com/atalman, https://github.com/malfet	2025-03-17 23:26:20 +00:00
Mayank Mishra	37c914ca0c	fix simple-spec crash (#147723 ) found an issue while running `python torchgen/fuse/gen_patterns.py` exact error: ```shell Traceback (most recent call last): File "/Users/mayankmishra/Desktop/non-IBM/pytorch/torchgen/fuse/gen_patterns.py", line 19, in <module> joint_graph.lazy_init() File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/pattern_matcher.py", line 2096, in lazy_init result = fn() File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/fx_passes/joint_graph.py", line 53, in lazy_init _pad_mm_init() File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/fx_passes/pad_mm.py", line 905, in _pad_mm_init gen_register_replacement( File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/pattern_matcher.py", line 1584, in gen_register_replacement pat = _serialize_pattern( File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/pattern_matcher.py", line 1539, in _serialize_pattern file_template = get_file_template() File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/pattern_matcher.py", line 1513, in get_file_template if isinstance(attr, type) and issubclass(attr, (PatternExpr, _TargetExpr)): File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/abc.py", line 123, in __subclasscheck__ return _abc_subclasscheck(cls, subclass) TypeError: issubclass() arg 1 must be a class ``` This PR fixes this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147723 Approved by: https://github.com/aorenste Co-authored-by: Aaron Orenstein <aorenste@meta.com>	2025-03-17 23:25:48 +00:00
Tony-Y	78715a181f	Convert Tensor lr to 0-dim as needed for the optimizer to normally work (#145674 ) Fixes #145461 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145674 Approved by: https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-03-17 23:07:05 +00:00
Mu-Chu Lee	1157367c78	[AOTInductor] [BE] Add macro for loading symbols in aoti runner (#149249 ) Summary: Add macro for loading symbols in aoti runner Test Plan: Existing tests Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/149249 Approved by: https://github.com/chenyang78	2025-03-17 23:02:01 +00:00
PyTorch MergeBot	24cfeec2c7	Revert "[BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257 )" This reverts commit bfee141666319c80b6c5284394905beef8682515. Reverted https://github.com/pytorch/pytorch/pull/149257 on behalf of https://github.com/malfet due to Let's see if it helps restore compiler benchmark sanity, see `8bc7bd94a5/1` ([comment](https://github.com/pytorch/pytorch/pull/149257#issuecomment-2731133812))	2025-03-17 22:57:00 +00:00
PyTorch MergeBot	afa1eda901	Revert "[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 )" This reverts commit ef6296e7f20d744a0cfed81cab573d60204e7626. Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/izaitsevfb due to reverted internally, see D71292427 ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2731114626))	2025-03-17 22:43:15 +00:00
Yanan Cao (PyTorch)	a16ada41b9	Fix outdated docstring of torch.export.export regarding strict flag (#149077 ) Summary: Fix outdated docstring of torch.export.export regarding strict flag Test Plan: None, doc only change Differential Revision: D71068215 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149077 Approved by: https://github.com/zhxchen17	2025-03-17 22:29:20 +00:00
Sheng Qin	d25617255c	Fix AOTI update_constant_buffer issue. (#149243 ) Summary: In D69553929 we changed the logic of constant & buffer update in AOTI. However this is incompatible with current Sigmoid runtime since we have different logics to pass in buffers, resulted in errors like ``` I0310 17:29:24.456960 3679102 AOTIDelegateExecutor.cpp:89] AOTIDelegateExecutor processing weights * Aborted at 1741652964 (Unix time, try 'date -d 1741652964') * * Signal 11 (SIGSEGV) (0x30) received by PID 3679102 (pthread TID 0x7f9933e49000) (linux TID 3679102) (code: address not mapped to object), stack trace: * @ 00000000000040b9 folly::symbolizer::(anonymous namespace)::signalHandler(int, siginfo_t, void) ./fbcode/folly/debugging/symbolizer/SignalHandler.cpp:453 @ 0000000000006c45 folly::fibers::(anonymous namespace)::sigsegvSignalHandler(int, siginfo_t, void) ./fbcode/folly/fibers/GuardPageAllocator.cpp:237 @ 000000000004455f (unknown) /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/signal/../sysdeps/unix/sysv/linux/libc_sigaction.c:8 -> /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c @ 00000000001e8164 torch::aot_inductor::AOTInductorModelContainer::update_constant_buffer(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, AtenTensorOpaque, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, AtenTensorOpaque> > > const&, bool, bool) ``` Test Plan: 1) Generate lowered merge net ``` CUDA_VISIBLE_DEVICES=0 ../buck-out/v2/gen/fbcode/b5b13003c82cbdec/caffe2/torch/fb/model_transform/fx2trt/packaging/__generate_merge_net_file__/generate_merge_net_file.par --action=generate --input-file=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_input --output-file=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_output.aoti_sigmoid --lower-backend=aot_inductor --use_sigmoid=true --aot_inductor_config="{'max_autotune': True, 'comprehensive_padding': False}" --add_passes=use_matmul_lce_replace_normal_LCE,use_triton_dot_compress,use_matmul_fuse_lce_replace_first_LCE,use_contiguous_linear_reduction_replace_linear_reduction --disable_acc_tracer=false ``` 2) Load net predictor ``` CUDA_VISIBLE_DEVICES=1 ../buck-out/v2/gen/fbcode/103717df3cc2b97a/caffe2/torch/fb/model_transform/fx2trt/packaging/__load_net_predictor__/load_net_predictor --loadMode=AccuracyAB --inputNetFile=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_output.aoti_ts --otherNetFile=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_output.aoti_sigmoid --moduleName=merge --benchmarkEnableProfiling=false —-predictor_hardware_type=1 --disableStaticRuntime=true ``` Reviewed By: hl475 Differential Revision: D71236710 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149243 Approved by: https://github.com/hl475, https://github.com/jingsh	2025-03-17 22:10:57 +00:00
Isuru Fernando	a3c6e3139a	allow extra args for parameterization of tests in inductor (#149154 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149154 Approved by: https://github.com/amjames, https://github.com/eellison	2025-03-17 22:05:06 +00:00
Davide Italiano	e4f6e4ac84	[MPS] Add inductor support for `modified_bessel_i0`. (#149342 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149342 Approved by: https://github.com/malfet	2025-03-17 21:45:51 +00:00
Carlo Bertolli	8bc7bd94a5	[ROCm] Input vectorization in elementwise kernels for tensors with heterogeneous types (#147527 ) This patch exemplifies its use for input tensors with types (float,bfloat16) when functor type is float(float,float). Pull Request resolved: https://github.com/pytorch/pytorch/pull/147527 Approved by: https://github.com/jeffdaily Co-authored-by: Hashem Hashemi <hashem.hashemi@amd.com>	2025-03-17 20:51:36 +00:00
Benjamin Glass	e8dd58b8cf	cpp_wrapper: Precompile device-specific header files (#146928 ) This saves us about a second per compilation, which is _massive_ for the OpInfo tests. Total OpInfo test runtime is down about 2x from this change alone. Relands #144002, with changes needed by fbcode internals. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146928 Approved by: https://github.com/desertfire	2025-03-17 20:40:15 +00:00
Sampsa	5e9f792479	[ROCm] Unskip flex attention UTs after triton 3.3 bump (#148327 ) Enable `test_flex_attention.py::TestLearnableBiases` unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148327 Approved by: https://github.com/jeffdaily	2025-03-17 20:15:14 +00:00
Shunting Zhang	6c7d8419e3	fix two accuracy regression (#149172 ) There are 2 accuracy regression in 3/12 nightly perf run. I can not repro them locally thus there is no effective way to bisect. Raise the tolerance to make them pass the accuracy check. - error log for HF MegatronBertForQuestionAnswering https://gist.github.com/shunting314/25322b66e15e98feed32e0d9a1e43316 - error log for TIMM gluon_inception_v3 https://gist.github.com/shunting314/df64ce22327df27a7057bbbd19ef5164 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149172 Approved by: https://github.com/jansel, https://github.com/eellison	2025-03-17 19:34:00 +00:00
Pat Vignola	769f19bf95	[MTIA] Add _mtia_exchangeDevice to MTIA module (#149322 ) Summary: The FlexAttention path uses `_exchange_device`, so it will be needed eventually for MTIA as well. Test Plan: `buck2 test fbcode//mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- test_exchange_device` Reviewed By: chaos5958 Differential Revision: D70072059 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149322 Approved by: https://github.com/chaos5958	2025-03-17 19:31:10 +00:00
angelayi	8d7c430e84	Symintify transpose_ (#149057 ) Fixes https://github.com/pytorch/pytorch/issues/148702 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149057 Approved by: https://github.com/yushangdi	2025-03-17 19:11:54 +00:00
Fadi Arafeh	08a644a4c4	Enable fast qlinear static/dynamic path for AArch64 through ACL directly (#148585 ) This enables a fast path for eager mode static/dynamic quantization for AArch64 through Arm Compute Library (ACL) directly. Context: PRs #126687, #139887 enabled an optimized implementation for `qlinear` and `qlinear_dynamic` for aarch64 through `ideep → oneDNN → ACL` which improved performance by ~10x compared to the previous implementation. However, the current `qlinear` and `qlinear_dynamic` path (`ideep → oneDNN → ACL`) suffers from high overhead due to the API friction between the stateless oneDNN API and the stateful ACL low-precision GEMM (`lowp_gemm`) API - for example, ACL's `lowp_gemm` objects cache information like weights reduction or weights in optimized memory format which oneDNN does not allow due to its stateless nature. Hence, ACL currently runs a (redundant) sum of columns and pre-transposition (to the gemm kerne's optimal format) for each GEMM operation. This PR addresses the sub-optimalities above by integrating ACL directly with `qlinear` and `qlinear_dynamic`. - For `qlinear_dynamic` (dynamically quantized matmuls): This PR yields an **average speedup (averaged over context_lengths of 2^3 up to 2^9) of ~ 50% for `bert-base-uncased`, `bert-large-uncased`, `roberta-base`, `distilbert-base-uncased` with 16 threads on a Neoverse-V1 (with transformers==4.48) for the benchmarking script below: ``` # SPDX-FileCopyrightText: Copyright 2025 Arm Limited and/or its affiliate <open-source-office@arm.com> # SPDX-License-Identifier: BSD-3-Clause import torch from transformers import AutoModel, AutoConfig import time import numpy as np from argparse import ArgumentParser class ModelArgumentParser(ArgumentParser): def __init__(self) -> None: super().__init__(description="huggingface model") self.add_argument("--context_length", help="context length - number of input tokens", type=int, default=64 ) self.add_argument("--model", help="model checkpoint - i.e. 'bert-base-uncased'", type=str, default=None) self.add_argument("--iters", help="benchmark iterations", default=500) if __name__ == "__main__": parser = ModelArgumentParser() args = parser.parse_args() model_name = args.model config = AutoConfig.from_pretrained(model_name) batch_size = 1 model = AutoModel.from_pretrained(model_name) model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8) model.eval() inputs = torch.randint(config.vocab_size, (batch_size, args.context_length), dtype=torch.long, device="cpu") times = [] with torch.no_grad(): # warmup for _ in range(10): model(inputs) # benchmark for _ in range(args.iters): s = time.time_ns() model(inputs) times.append((time.time_ns() - s) / 1e6) print("Model = ", model_name) print("Context Length = ", args.context_length) print("Min (ms) = ", min(times)) print("Mean (ms) = ", np.mean(times)) ``` - For `qlinear` (statically quantized matmuls): This PR yields an average speedup of 2x for signed activations (`s8s8s8`) and 95x for unsigned activations (u8s8u8)** on a Neoverse-V1 with 16 threads for the benchmarking script below. The averages are over for all combinations of `M = [8, 16, ..., 512]`, `K = [768, 1024, 2048, 4096]`, `N = [768, 1024, 2048, 4096]`. The astronomical speedup for unsigned activation is because oneDNN v3.7 does not have an optimized implementation for `u8s8u8` on AArch64. ``` # SPDX-FileCopyrightText: Copyright 2025 Arm Limited and/or its affiliate <open-source-office@arm.com> # SPDX-License-Identifier: BSD-3-Clause import torch import torch.nn as nn from torch.quantization import QConfig from torch.ao.quantization.observer import HistogramObserver, default_weight_observer import torch import torch.nn as nn import numpy as np import random from argparse import ArgumentParser import time class ModelArgumentParser(ArgumentParser): def __init__(self) -> None: super().__init__() self.add_argument("--M", help="M dimension", type=int, default=64 ) self.add_argument("--K", help="K dimension", type=int, default=64 ) self.add_argument("--N", help="N dimension", type=int, default=64 ) self.add_argument("--signed_input", help="Use (signed) torch.qint8 for inputs instead of (unsigned) torch.quint8", action="store_true" ) self.add_argument("--seed", help="Random seed", type=int, default=42 ) self.add_argument("--iters", help="benchmark iterations", default=500) def set_seed(seed): random.seed(seed) np.random.seed(seed) torch.manual_seed(seed) class LinearModel(nn.Module): def __init__(self, K, N): super(LinearModel, self).__init__() self.quant = torch.quantization.QuantStub() self.fc = nn.Linear(K, N) self.dequant = torch.quantization.DeQuantStub() def forward(self, x): x = self.quant(x) x = self.fc(x) x = self.dequant(x) return x def quantize_model(model, args): qconfig = QConfig( activation=HistogramObserver.with_args(reduce_range=False, dtype=torch.qint8 if args.signed_input else torch.quint8), weight=default_weight_observer, ) # Prepare the model for static quantization # Specify quantization configurations model.qconfig = qconfig model_prepared = torch.quantization.prepare(model_fp32) # Calibrate the model with sample inputs # Example input data for calibration with torch.no_grad(): sample_data = torch.randn(args.M, args.K) model_prepared(sample_data) # Convert the prepared model to a quantized model model_quantized = torch.quantization.convert(model_prepared) return model_quantized if __name__ == "__main__": parser = ModelArgumentParser() args = parser.parse_args() set_seed(args.seed) model_fp32 = LinearModel(args.K, args.N) model_quantized = quantize_model(model_fp32, args) inputs = torch.randn(args.M, args.K) times = [] with torch.no_grad(): # warmup for _ in range(10): model_quantized(inputs) # benchmark for _ in range(args.iters): s = time.time_ns() model_quantized(inputs) times.append((time.time_ns() - s) / 1e6) print("M,K,N,signed = ", args.M, args.K, args.N, args.signed_input) print("Min Times (ms) = ", min(times)) print("Mean Times (ms) = ", np.mean(times)) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148585 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-03-17 18:21:10 +00:00
Isuru Fernando	c41c2130be	Fix printing INT64_MIN (#149148 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149148 Approved by: https://github.com/anijain2305	2025-03-17 17:57:18 +00:00
Yichen Yan	8cdb9adc05	do not run `test_ck_blas_library` on cpu (#148316 ) Fix on non-rocm: ``` root@e01-tw-ue5g2g3sap6:~/pytorch/test# python test_linalg.py TestLinalgCPU.test_ck_blas_library_cpu E ====================================================================== ERROR: test_ck_blas_library_cpu (__main__.TestLinalgCPU) ---------------------------------------------------------------------- Traceback (most recent call last): File "/root/pytorch/torch/testing/_internal/common_utils.py", line 3108, in wrapper method(args, kwargs) File "/root/pytorch/torch/testing/_internal/common_device_type.py", line 480, in instantiated_test raise rte File "/root/pytorch/torch/testing/_internal/common_device_type.py", line 460, in instantiated_test result = test(self, param_kwargs) File "/root/pytorch/torch/testing/_internal/common_device_type.py", line 1242, in dep_fn return fn(slf, args, *kwargs) File "/root/pytorch/torch/testing/_internal/common_utils.py", line 1981, in _fn fn(args, **kwargs) File "/root/pytorch/test/test_linalg.py", line 8621, in test_ck_blas_library torch.backends.cuda.preferred_blas_library('ck') File "/root/pytorch/torch/backends/cuda/__init__.py", line 258, in preferred_blas_library torch._C._set_blas_preferred_backend(_BlasBackends[backend]) RuntimeError: Cannot set preferred backend to Ck if PyTorch has not been compiled for ROCm. To execute this test, run the following from the base repo dir: python test/test_linalg.py TestLinalgCPU.test_ck_blas_library_cpu This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ---------------------------------------------------------------------- Ran 1 test in 0.346s FAILED (errors=1) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148316 Approved by: https://github.com/jeffdaily	2025-03-17 17:45:45 +00:00
Catherine Lee	224cd9f055	[ez] Flush trymerge print statements (#149012 ) Logs of trymerge don't match up with timestamps, ex https://github.com/pytorch/pytorch/actions/runs/13766246347/job/38493307591 Ex: ``` 2025-03-10T14:20:41.4899509Z Attempting merge of https://github.com/pytorch/pytorch/pull/148648 (0.003460856278737386 minutes elapsed) ... 2025-03-10T14:20:41.4907867Z Merge of https://github.com/pytorch/pytorch/pull/148648 failed due to: Still waiting for 16 jobs to finish, first few of them are: Check Labels / Check labels, trunk / macos-py3-arm64 / build, trunk / win-vs2022-cpu-py3 / build, trunk / cuda12.4-py3.10-gcc9-sm80 / build, trunk / win-vs2022-cuda12.6-py3 / build. Retrying in 5 min 2025-03-10T14:20:41.4909772Z Attempting merge of https://github.com/pytorch/pytorch/pull/148648 (5.280085611343384 minutes elapsed) ... 2025-03-10T14:20:41.4916812Z Merge of https://github.com/pytorch/pytorch/pull/148648 failed due to: Still waiting for 15 jobs to finish, first few of them are: trunk / macos-py3-arm64 / build, trunk / win-vs2022-cpu-py3 / build, trunk / cuda12.4-py3.10-gcc9-sm80 / build, trunk / win-vs2022-cuda12.6-py3 / build, trunk / linux-focal-cuda12.6-py3.10-gcc11-no-ops / build. Retrying in 5 min 2025-03-10T14:20:41.4918183Z Attempting merge of https://github.com/pytorch/pytorch/pull/148648 (10.590279157956441 minutes elapsed) ``` Either buffering prints or github actions logs are being weird? Print with flush to see if it helps Pull Request resolved: https://github.com/pytorch/pytorch/pull/149012 Approved by: https://github.com/malfet	2025-03-17 17:04:48 +00:00
Rachel Guo	aaa4c3d60b	[mm_logs] make aten mm info readable (#148800 ) Summary: as title. make it into a table like e.g. also see pic in test plan \| Name \| M \| N \| K \| Count \| \| aten.mm \| 16 \| 6 \| 16 \| 1 \| ... Test Plan: {F1975907876} <img width="1090" alt="Screenshot 2025-03-11 at 3 13 00 PM" src="https://github.com/user-attachments/assets/ffae8c56-e32c-49cc-bbfb-5b8d216b8657" /> Differential Revision: D70825664 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148800 Approved by: https://github.com/henrylhtsang	2025-03-17 17:00:58 +00:00
Xinya Zhang	2a011ca904	[ROCm] testing: enable MEFF/FA unittests for gfx1100 (#148911 ) Include gfx1100, and optionally enable gfx1201/gfx950 according to env var TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL Pull Request resolved: https://github.com/pytorch/pytorch/pull/148911 Approved by: https://github.com/jeffdaily	2025-03-17 16:41:15 +00:00
PyTorch MergeBot	9d37b501db	Revert "[ROCm] enable HIPMallocAsyncAllocator (#149145 )" This reverts commit 2e02c07a5d1c432547542f90de2885be9ffd13cf. Reverted https://github.com/pytorch/pytorch/pull/149145 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @albanD, might you be able to help get this PR landed? See D71214814 for more details on the failure. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/149145#issuecomment-2730104736))	2025-03-17 16:17:02 +00:00
Yu, Guangye	c7c3e77324	Refine XPU oneDNN context manager API (#147349 ) # Motivation This PR introduces improvements to the XPU oneDNN context manager API: - `GpuEngineManager::get_engine`: Added a new API that accepts a `DeviceIndex` to simplify code and improve usability - by default, using the current device index. - `GpuStreamManager::get_stream`: Now explicitly requires a `DeviceIndex` as input to ensure correctness and consistency - by default, using the current device index. Additionally, it enhances integration with `c10::DeviceGuard`, ensuring correct device management. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147349 Approved by: https://github.com/EikanWang	2025-03-17 14:45:56 +00:00
PyTorch UpdateBot	790f93db3a	Update slow tests (#149300 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149300 Approved by: https://github.com/pytorchbot	2025-03-17 11:39:29 +00:00
Sun, Jiayi	b2862f1435	optimize the decomposition of aten.native_group_norm (#144733 ) Summary: Optimize the decomposition of aten.native_group_norm. Reduce unnecessary repeated operations by changing the order of operations for `mean`, `rstd`, `weight`, `bias `and `input`, which can improve performance when `flattened_inner_size `is large. The original decomposition: 1. compute `mean `and `rstd`, 2. out = (x - mean) * rstd, compute in the range [N, C, ], 3. out = out weight + bias, compute in the range [N, C, ], The new decomposition: 1. compute `mean `and `rstd`, 2. new_weight = rstd weight, new_bias = - mean * rstd * weight + bias, compute in the range [N, C], 3. out = out * new_weight + new_bias, compute in the range [N, C, *], I tested the Inductor performance benchmark with this PR on both CPU and A100. On CPU, two torchbench models(functorch_dp_cifar10 and opacus_cifar10) have about 25% performance improvement, and two diffusion models(Stable Diffusion and Latent Consistency Model(LCM)) have about 2% performance improvement. On A100, no performance gains or regressions were seen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144733 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2025-03-17 09:27:01 +00:00
zeshengzong	1cc5f6b623	Optimize `MaxPool1d` param `ceil_mode` description (#148869 ) Fixes #148123 Add output shape formula based on `ceil_mode` value, according to `00199acdb8/aten/src/ATen/native/Pool.h (L61-L75)` ## Test Result ### Before ![image](https://github.com/user-attachments/assets/0a175178-a104-4348-a14b-516e866d533a) ### After ![image](https://github.com/user-attachments/assets/ce621d4b-1986-41fb-bd71-2b03c0aa996e) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148869 Approved by: https://github.com/mikaylagawarecki	2025-03-17 08:50:40 +00:00
soulitzer	916e8979d3	Skip some tests not using gradcheck on slowgradcheck (#149220 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149220 Approved by: https://github.com/seemethere	2025-03-17 00:34:52 +00:00
eqy	6048d88afe	[ARM64][CUDA] skip string pattern matching in `test_workspace_allocation_error` (#149236 ) `unwind()` on ARM64 seems to elide the strings of interest Pull Request resolved: https://github.com/pytorch/pytorch/pull/149236 Approved by: https://github.com/malfet, https://github.com/eellison, https://github.com/BoyuanFeng	2025-03-17 00:30:43 +00:00
Aaron Gokaslan	bfee141666	[BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/149257 Approved by: https://github.com/jansel	2025-03-16 23:52:58 +00:00
Tugsbayasgalan Manlaibaatar	6b1b95ad2a	Support subclass constructor capturing in export (#147014 ) Notable TODOs: 1. Need to implement AutogradHOP to get rid of subclasses before serializing 2. Need to implement mechanism to figure out what subclasses will be used in export when they are not expressed in the inputs Differential Revision: [D69640673](https://our.internmc.facebook.com/intern/diff/D69640673) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147014 Approved by: https://github.com/bdhirsh	2025-03-16 18:19:19 +00:00
Animesh Jain	5905bbe745	[dynamo][guards][serialization] Dont use ID_MATCH guard for bool and None (#149228 ) Doing this removes the need of collecting `id` and therefore facilitates serialization. It also improves readability with recompilations. Earlier, recompile message will just show the `id`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149228 Approved by: https://github.com/jansel	2025-03-16 15:56:17 +00:00
Davide Italiano	9f33c6f0a0	[MPS] Add support for modified_bessel_i0 in eager. (#149264 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149264 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-03-16 04:45:49 +00:00
Nikita Shulga	f80bee4934	[MPS][BE] Move common binary ops macros to indexing.h (#149263 ) And binary op invocation logic to OperationUtils.mm This is a no-op change, additional sanity checks/logic improvements will be added as followups Pull Request resolved: https://github.com/pytorch/pytorch/pull/149263 Approved by: https://github.com/dcci ghstack dependencies: #149262	2025-03-16 02:06:40 +00:00
Davide Italiano	21c2edfec8	[MPS/metal] Add missing `inline` to function definitions. (#149265 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149265 Approved by: https://github.com/malfet	2025-03-16 00:33:27 +00:00
Nikita Shulga	3e2c4086ad	[EZ][BE] Reuse `result_of` from `c10/metal/utils.h` (#149262 ) No need for one more implementation Pull Request resolved: https://github.com/pytorch/pytorch/pull/149262 Approved by: https://github.com/dcci	2025-03-16 00:21:28 +00:00
Sam Larsen	acf42b0048	Fix memory leak in subproc_pool future (#149259 ) Summary: The future holds a reference to the callback, and the callback captures the outer future. Seems to create a cycle that the garbage collector doesn't clean up. Verified by compiling 15k synthetic Triton kernels and observing that subprocess memory overhead improves. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149259 Approved by: https://github.com/Skylion007	2025-03-15 20:26:30 +00:00
James Wu	a9c55277d7	[Reland] First version of statically compiled launcher for triton compiled CUDA kernels (#149238 ) This is a new version of https://github.com/pytorch/pytorch/pull/148561 fixing the ROCM test failure Putting this up for a first pass review, though I will likely make a bunch of changes before landing to add more features, etc. This diff implements a first version of a static CUDA kernel launcher in `torch._C`. The goal here is to take a cubin file and some metadata from a CompiledKernel from `triton`, and launch the cubin file directly. Background doc: https://docs.google.com/document/d/1rjRcHl6MfauHG30nCoQX-9UKvKyIs4WWMy_GsGyqb9g/edit?tab=t.0#heading=h.ut5lf39lzq66 Normally, using triton's CompiledKernel.make_launcher(), we would pay the cost of codegenning C++ and running it at compile time. With this new approach, we can use one statically compiled library to launch the kernel. The tradeoff here is that this new kernel launcher will not be able to use codegen to deal with different lengths/types of arguments. So we use templating to handle up to 10 arguments for now. We also allocate 8 bytes on the stack per argument no matter the argument type, which can take more memory than codegenning. On the other hand, we improve compile time on cold and warm start by not having to call the C++ compiler at all. This diff does not add the launcher to torch, but introduces a basic test suite. A list of TODOs that are not yet complete: - Handle `nvTmaDesc` and `cuTensorMap`, which triton handles - Embed the grid logic instead of passing in gridX,Y,Z - Handle launch_enter and exit hooks? (Not sure if inductor has these) - Benchmarking to see if there's runtime performance loss - Probably lots of features of the triton C++ generated code that I haven't handled yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149238 Approved by: https://github.com/oulgen	2025-03-15 15:06:46 +00:00
Sam Larsen	c83c711da8	Remove some memory overhead in parallel compile workers (#149168 ) Summary: The parallel compile workers are holding on to more memory than they need to because they're loading the compiled modules into memory. Update the post-fork initializer to record when in a subprocess and skip some of the unnecessary overhead. Test Plan: Ran a test script to compile 15k Triton kernels and used tracemalloc in the subprocs to investigate the overhead. On my devgpu: * After importing torch in a subproc: 371M * Without this PR, after compiling 15k kernels: 825M * With this PR, after compiling 15k kernels: 531M Pull Request resolved: https://github.com/pytorch/pytorch/pull/149168 Approved by: https://github.com/jansel	2025-03-15 14:20:40 +00:00
Huamin Li	e7e477c1f9	Not generate custom obj json when it's empty (#149246 ) Summary: as title. See internal Diff summary for more context. Test Plan: buck run @fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r config_not_generated Differential Revision: D71241676 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149246 Approved by: https://github.com/houseroad Co-authored-by: Huamin Li <huaminli@meta.com>	2025-03-15 13:00:48 +00:00
Lirong	4482a65fef	Add side_effect to avoid dce custom op in CA graph (#149181 ) We found that in compiled_autograd, when defining custom op, the custom op will be dce in the backward graph. We added a side effect condition in the dce function to prevent eliminating custom op with side effect in CA graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149181 Approved by: https://github.com/xmfan	2025-03-15 04:15:49 +00:00
Wenjie Yang	115fc98cc0	Migrate aten.split.Tensor from using Sharding Rule to Sharding Strategy (#149106 ) Summary: Use Sharding Strategy for aten.split.Tensor instead of sharding rule Test Plan: pytest test/distributed/tensor/test_dtensor_ops.py -s -k split Reviewers: xilunwu Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/149106 Approved by: https://github.com/XilunWu, https://github.com/tianyu-l	2025-03-15 04:03:40 +00:00
Jane Xu	740ce0fa5f	op should NOT be static in aoti_torch_call_dispatcher (#149208 ) aoti_torch_call_dispatcher is meant to call different ops, so the op must not be static. Otherwise, every call to this API will call the first op that was ever called, which is not the intended behavior of any human being. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149208 Approved by: https://github.com/albanD, https://github.com/zou3519, https://github.com/malfet	2025-03-15 01:47:11 +00:00
Simon Fan	578160c875	[ca] don't inline accumulate grad op (#149014 ) we use dummy tensors in our initial trace, so we should never inline. the subclass dispatch might not support the dummy tensor, e.g. DTensor accumulate grad will check that both param and grad are DTensors Pull Request resolved: https://github.com/pytorch/pytorch/pull/149014 Approved by: https://github.com/jansel ghstack dependencies: #149064	2025-03-15 01:10:54 +00:00
Simon Fan	f4368d8872	[ca] clean up aot node deduping (#149064 ) rename the AOT nodes as we copy paste them into the CA graph Pull Request resolved: https://github.com/pytorch/pytorch/pull/149064 Approved by: https://github.com/jansel	2025-03-15 01:10:54 +00:00
Nikita Shulga	96795e9533	[BE] Parametrize `TestMPS.test_binops_dtype_precedence` (#149234 ) No op change, just splits a longer tests into a series of a smaller ones Pull Request resolved: https://github.com/pytorch/pytorch/pull/149234 Approved by: https://github.com/atalman, https://github.com/dcci ghstack dependencies: #149216, #149233	2025-03-15 00:37:11 +00:00
Jithun Nair	1c7196f04b	Add new GHA workflow to cache ROCm CI docker images on MI300 CI runners periodically (#148394 ) Refiling https://github.com/pytorch/pytorch/pull/148387 from pytorch repo branch to get AWS login via OIDC working Successful docker caching run: https://github.com/pytorch/pytorch/actions/runs/13843689908/job/38737095535 Run without cached docker image: https://github.com/pytorch/pytorch/actions/runs/13843692637/job/38746033460 ![image](https://github.com/user-attachments/assets/c410ff35-a150-4885-b904-3a5e1888c032) Run with cached docker image: ![image](https://github.com/user-attachments/assets/41e417b5-a795-4ed2-a9cd-00151db8f813) ~6 min vs 3 s :) Thanks @saienduri for the help on the MI300 infra side Pull Request resolved: https://github.com/pytorch/pytorch/pull/148394 Approved by: https://github.com/jeffdaily	2025-03-15 00:34:04 +00:00
xinan.lin	9ad6265d04	[AOTI][XPU] Fix: model_container_runner_xpu.cpp is not built into libtorch_xpu.so (#149175 ) The missing of model_container_runner_xpu.cpp will cause compilation failure when user build CPP inference application on XPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149175 Approved by: https://github.com/jansel	2025-03-15 00:30:04 +00:00
yifanmao	7537b19c73	[FSDP2] Update ignored_params docstring and add unit test (#149074 ) Fixes https://github.com/pytorch/pytorch/issues/148242 ignored_params won't be moved to devices in full_shard(), update docstring. Add unit test `test_move_states_to_device_ignored_param_device` to show that ignored_params won't be moved during full_shard(), but would be after `model.cuda()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149074 Approved by: https://github.com/awgu	2025-03-15 00:23:09 +00:00
maajidkhann	09f7f62cfe	Fix atomic operation compatibility for ARMv8-A (Raspberry Pi 4) by adjusting compilation flags (#148070 ) Issue: * The ldaddal instruction is an AArch64 atomic operation available from ARMv8.1-A onwards. * Raspberry Pi 4 (Cortex-A72) is ARMv8-A, which does not support ldaddal, leading to failures when running PyTorch built with march=armv8.2-a+sve * This led to an issue when running PyTorch on ARMv8-A (Raspberry Pi 4), as unsupported atomic operations were generated. Fix: * Updated the build flags to explicitly use -march=armv8-a+sve, ensuring GCC and clang promotes it correctly and resolves compatibility issues with armv8 and still work correctly for SVE like before. * This ensures that PyTorch builds correctly for ARMv8-A platforms (e.g., Raspberry Pi 4) while still enabling SVE for supported hardware. Test plan: - Allocate `a1.4xlarge` on AWS - Run following script using wheel produced by this PR ```python import torch def f(x): return x.sin() + x.cos() print(torch.__version__) f_c = torch.jit.script(f) ``` - Observe no crash ``` $ python3 foo.py 2.7.0.dev20250313+cpu ``` - Observe crash with 2.6.0 ``` $ python3 foo.py 2.6.0+cpu Illegal instruction (core dumped) ``` Fixes #146792 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148070 Approved by: https://github.com/malfet	2025-03-15 00:02:38 +00:00
Nikita Shulga	08af311fc2	[MPS] Fix type promotion for `torch.floor_divide` (#149233 ) And delete some duplicating glue code by relying on the stub After this change `torch.arange(10, device = 'mps') // torch.arange(10., device='mps')` will return tensor of floats, which is a common dtype for float + integral operation, rather than tensor of ints Checked by `test_div2` inductor testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/149233 Approved by: https://github.com/atalman ghstack dependencies: #149216	2025-03-15 00:00:42 +00:00
bobrenjc93	eb7bf4202d	Make dynamism code robust to NotImplementedException (#148823 ) In prod many models have `@property` methods that raise NotImplementedError. This PR updates our dynamism code to be more robust to these types of models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148823 Approved by: https://github.com/laithsakka	2025-03-14 23:38:19 +00:00
Stephen Jia	ff58ccec6c	[ATen-CPU] Add `math.h` for Gelu (#149164 ) Summary: ## Context This PR is mostly to enable ExecuTorch build for Windows: https://github.com/pytorch/executorch/pull/9198 In ExecuTorch, the optimized GeLU kernel calls the ATen implementation. However, on Windows `math.h` needs to be included with `#define _USE_MATH_DEFINES` in order for math constants to be defined. Test Plan: Rely on CI to make sure existing tests do not break. Tested separately with ExecuTorch to make sure Windows build is successful. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149164 Approved by: https://github.com/swolchok	2025-03-14 23:37:25 +00:00
PyTorch MergeBot	f9b4856989	Revert "[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257 )" This reverts commit c95a6b416b4d1b830535f82e2719c055d077cbad. Reverted https://github.com/pytorch/pytorch/pull/113257 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @zou3519 can you please help land this internally? See the sigmoid tests in D71198793 for details. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/113257#issuecomment-2725982539))	2025-03-14 23:13:34 +00:00
PyTorch MergeBot	643aaea133	Revert "[RFC] First version of statically compiled launcher for triton compiled CUDA kernels (#148561 )" This reverts commit 5a843f8973d7fc6a601f089fc969d2a5ac7e5338. Reverted https://github.com/pytorch/pytorch/pull/148561 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/148561#issuecomment-2725969268))	2025-03-14 23:01:26 +00:00
cz2h	05f2cbfe19	Add meta function for out variants of ones,zeros,empty (#149098 ) Open another PR to fix merge conflicts. Fixes https://github.com/pytorch/pytorch/issues/135832 For aten.ones, aten.zeros, followed this [link](https://docs.google.com/document/d/1GgvOe7C8_NVOMLOCwDaYV1mXXyHMXY7ExoewHqooxrs/edit?tab=t.0#heading=h.64r4npvq0w0) to register meta functions. For aten.empty.out, followed this [part](https://docs.google.com/document/d/1GgvOe7C8_NVOMLOCwDaYV1mXXyHMXY7ExoewHqooxrs/edit?tab=t.0#heading=h.iy9lxhxhtl5v) to register a decomp for empty that handles the FakeTensor input. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149098 Approved by: https://github.com/williamwen42	2025-03-14 22:17:30 +00:00
Nikita Shulga	d7d9a71e19	[MPSInductor] Add support for atan2 (#149216 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149216 Approved by: https://github.com/dcci	2025-03-14 21:53:03 +00:00
Isalia20	dd6e9df3d0	[MPS] fix attention enable_gqa crash on mps (#149147 ) Fixes #149132 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149147 Approved by: https://github.com/malfet	2025-03-14 21:25:54 +00:00
Davide Italiano	0bd863a62f	[MPS] Add inductor support for `i1e`. (#149221 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149221 Approved by: https://github.com/malfet	2025-03-14 21:18:38 +00:00
Aditya Tewari	a0893475ba	Enable oneDNN dispatch for gemm bf16bf16->bf16 (#148197 ) Currently, `linear` layers using BF16 are dispatched to OpenBLAS, provided that sbgemm_ is available. However, profiling on AArch64 shows that dispatching to oneDNN results in a significant speedup. This PR updates the dispatch logic to leverage oneDNN for improved performance. Attaching some benchmark results. Instance: NeoverseV1., on 16 threads. <img width="482" alt="Screenshot 2025-02-28 at 17 18 38" src="https://github.com/user-attachments/assets/b84e7455-af6e-417f-920d-bdd2bec2e8f9" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/148197 Approved by: https://github.com/malfet	2025-03-14 20:58:24 +00:00
albanD	1bdbf12672	Update as strided doc (#149146 ) Make it clearer why it is not recommended to use it and when the resulting Tensor will have undefined behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149146 Approved by: https://github.com/gchanan, https://github.com/jbschlosser	2025-03-14 19:49:57 +00:00
Um Changyong	69aeb87eca	update error message in get_backend() more detail_ (#141796 ) Fixes #ISSUE_NUMBER When attempting to reconfigure the environment without properly handling the PyTorch-related settings, you may encounter the following message. ``` │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/site-packages/torch/distributed/distribut │ │ ed_c10d.py:1215 in get_backend │ │ │ │ 1212 │ if _rank_not_in_group(pg): │ │ 1213 │ │ raise ValueError("Invalid process group specified") │ │ 1214 │ pg_store = _world.pg_map[pg] if pg in _world.pg_map else None │ │ ❱ 1215 │ return Backend(not_none(pg_store)[0]) │ │ 1216 │ │ 1217 │ │ 1218 def _get_process_group_uid(pg: ProcessGroup) -> int: │ │ │ │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/site-packages/torch/utils/_typing_utils.p │ │ y:13 in not_none │ │ │ │ 10 │ │ 11 def not_none(obj: Optional[T]) -> T: │ │ 12 │ if obj is None: │ │ ❱ 13 │ │ raise TypeError("Invariant encountered: value was None when it should not be") │ │ 14 │ return obj │ │ 15 │ ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ TypeError: Invariant encountered: value was None when it should not be Exception ignored in: <function Vllm.__del__ at 0x7f35f96b6dd0> ``` Since this message can cause confusion for multiple developers, the purpose of this PR is to suggest additional details to help clarify the situation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141796 Approved by: https://github.com/kwen2501	2025-03-14 19:42:42 +00:00
Qiongwen Zhang	5e79b61e8a	add PrivateUse1 backend in fsdp collecitves (#147260 ) add PrivateUse1 backend in fsdp collecitves Pull Request resolved: https://github.com/pytorch/pytorch/pull/147260 Approved by: https://github.com/weifengpy	2025-03-14 19:41:41 +00:00
henrylhtsang	fe01af2242	[AOTI][debug logger] small fix for intermediate value debugger for jit when arg is not tensor (#149007 ) repro: ``` import torch import torch._inductor.config as config config.aot_inductor.debug_intermediate_value_printer = "2" config.aot_inductor.filtered_kernel_names = "triton_poi_fused__to_copy_add_0" class Model(torch.nn.Module): def forward(self, x): x = x.to(torch.float) return x + 1 model = Model().cuda() x = torch.randn(10).cuda().to(torch.float8_e4m3fn) _ = torch.compile(model, fullgraph=True)(x) print("done") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149007 Approved by: https://github.com/jingsh	2025-03-14 19:40:41 +00:00
Aaron Gokaslan	c96ed7e6f5	[BE]: No include left behind - recursive glob setuptools support (#148258 ) Fixes #148256 TestPlan check the printout from the setup.py build and verify the files are still included. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148258 Approved by: https://github.com/malfet, https://github.com/benjaminglass1	2025-03-14 19:39:21 +00:00
Nikita Shulga	9d7945e382	[EZ] Fix typo in UnaryOps.mm (#149217 ) s/imput/input/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/149217 Approved by: https://github.com/ZainRizvi, https://github.com/dcci	2025-03-14 19:31:20 +00:00
zeshengzong	a7f8de2198	Add `nn.Bilinear` param validation (#149018 ) Fixes #103425 ## Changes - Add doc description size value `must be > 0` - Add validation for `in1_features` param Currently, only `in1_features` will cause runtime error, if add checks for `in2_features` and `out_features` as well, might be kind of BC breaking. ```python import torch from torch import nn class lenet(nn.Module): def __init__(self): super(lenet, self).__init__() self.conv = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=5, stride=1) # Error, `in1_features=1, in2_features=0, out_features=0` no error self.linear = nn.Bilinear(in1_features=0, in2_features=0, out_features=0) def forward(self, x): # 1st block x = self.conv(x) x = self.linear(x) return x if __name__ == '__main__': net = lenet() ``` ## Test Result ```bash pytest test/test_nn.py -k test_bilinear -vv ``` ![image](https://github.com/user-attachments/assets/20617ba9-bac5-4db2-aecc-1831dbc8eb43) ![image](https://github.com/user-attachments/assets/401e4e1f-051a-4e1c-952b-48e85de64b0b) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149018 Approved by: https://github.com/mikaylagawarecki	2025-03-14 19:26:12 +00:00
James Wu	5a843f8973	[RFC] First version of statically compiled launcher for triton compiled CUDA kernels (#148561 ) Putting this up for a first pass review, though I will likely make a bunch of changes before landing to add more features, etc. This diff implements a first version of a static CUDA kernel launcher in `torch._C`. The goal here is to take a cubin file and some metadata from a CompiledKernel from `triton`, and launch the cubin file directly. Background doc: https://docs.google.com/document/d/1rjRcHl6MfauHG30nCoQX-9UKvKyIs4WWMy_GsGyqb9g/edit?tab=t.0#heading=h.ut5lf39lzq66 Normally, using triton's CompiledKernel.make_launcher(), we would pay the cost of codegenning C++ and running it at compile time. With this new approach, we can use one statically compiled library to launch the kernel. The tradeoff here is that this new kernel launcher will not be able to use codegen to deal with different lengths/types of arguments. So we use templating to handle up to 10 arguments for now. We also allocate 8 bytes on the stack per argument no matter the argument type, which can take more memory than codegenning. On the other hand, we improve compile time on cold and warm start by not having to call the C++ compiler at all. This diff does not add the launcher to torch, but introduces a basic test suite. A list of TODOs that are not yet complete, will do in separate diff: - Handle `nvTmaDesc` and `cuTensorMap`, which triton handles - Embed the grid logic instead of passing in gridX,Y,Z. With https://github.com/pytorch/pytorch/pull/147583, we should be able to handle all of the grid logic directly in _StaticCudaLauncher.launch_kernel, and get rid of the python evaluation. - Handle launch_enter and exit hooks? (Not sure if inductor has these) - Benchmarking to see if there's runtime performance loss - Hooking it up with a config to inductor - Testing harness to test against torch generated triton kernels Differential Revision: [D69926783](https://our.internmc.facebook.com/intern/diff/D69926783/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148561 Approved by: https://github.com/aorenste, https://github.com/syed-ahmed	2025-03-14 19:12:13 +00:00
zeshengzong	97272e4b49	Fix `torch.nn.functional.hardswish` gradients corner case (#148049 ) Fixes #147801 ## Changes - Change hardswish gradient compute condition as [torch.nn.functional.hardswish](https://pytorch.org/docs/stable/generated/torch.nn.functional.hardswish.html) - Enable cuda for test `test_hardswish_grad_corner` - Add test case for value=-3 ## Test Result ```bash pytest test/test_nn.py -k test_hardswish pytest test/test_unary_ufuncs.py -k test_hardswish pytest test/inductor/test_torchinductor.py -k test_hardswish ``` ![image](https://github.com/user-attachments/assets/000cb5c4-15f5-4bfd-ab45-f52bf810ff3d) ![image](https://github.com/user-attachments/assets/38b08cf8-ea84-47a2-8e37-0a213da3e0c8) ![image](https://github.com/user-attachments/assets/54bc57be-2c57-46cc-ab90-94ea6cbe1c34) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148049 Approved by: https://github.com/soulitzer	2025-03-14 18:53:10 +00:00
Ethan Wee	2e02c07a5d	[ROCm] enable HIPMallocAsyncAllocator (#149145 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149145 Approved by: https://github.com/jeffdaily	2025-03-14 18:21:27 +00:00
Nikita Shulga	f2221b2fce	[MPS] Add support for `i1e` (#149203 ) Followup after https://github.com/pytorch/pytorch/pull/149174 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149203 Approved by: https://github.com/dcci	2025-03-14 17:33:52 +00:00
Davide Italiano	f067eafabb	[MPS] Modify a test to test the correct function. (#149204 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149204 Approved by: https://github.com/malfet	2025-03-14 17:27:47 +00:00
Nikita Shulga	42e468d9b0	[MPSInductor] Adjust check_bounds (#147205 ) To make upper bound inclusive, which fixes `test_vectorized_ops_masked` and results in the following code ```python mps_lib_0 = compile_mps_shader(""" #include <c10/metal/random.h> #include <c10/metal/special_math.h> #include <c10/metal/utils.h> kernel void generated_kernel( device float* out_ptr0, constant float* in_ptr0, uint xindex [[thread_position_in_grid]] ) { int x0 = (xindex) % (64); int x1 = (xindex) / (64); auto tmp5 = in_ptr0[x0 + 63*x1]; int x2 = xindex; auto tmp0 = x0; auto tmp1 = static_cast<long>(tmp0); auto tmp2 = 63; auto tmp3 = tmp1 < tmp2; if (x0 > 63) return; auto tmp6 = tmp3 ? tmp5 : 7; out_ptr0[x2] = static_cast<float>(tmp6); } """) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147205 Approved by: https://github.com/jansel, https://github.com/dcci ghstack dependencies: #147211	2025-03-14 17:26:00 +00:00
cyy	a9aae05a6b	Remove test decorations on MacOS 12 (#148942 ) MacOS 12 may reach EOL, as from https://endoflife.date/macos Pull Request resolved: https://github.com/pytorch/pytorch/pull/148942 Approved by: https://github.com/malfet	2025-03-14 17:22:37 +00:00
Davide Italiano	f2ea77c099	[MPS] Add inductor support for i0e. (#149180 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149180 Approved by: https://github.com/malfet	2025-03-14 16:15:52 +00:00
PyTorch MergeBot	71795f159e	Revert "[AOTInductor] [BE] Add swap_constant_buffer into pybind for tests. (#149167 )" This reverts commit bea181ff7eeead9fcdd806e286846296c4ab2d67. Reverted https://github.com/pytorch/pytorch/pull/149167 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See D71177501 for the failure. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/149167#issuecomment-2725001232))	2025-03-14 15:16:21 +00:00
Davide Italiano	706c22549c	[MPS] Add support for `i0e` in eager. (#149174 ) Add `special.i0e` to XFAIL_GRADLIST for now, as its backward op is not yet implemented Pull Request resolved: https://github.com/pytorch/pytorch/pull/149174 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-03-14 14:43:46 +00:00
Huamin Li	68bbe20db7	Add test coverage (#149182 ) Summary: Follow up from D71160718 Differential Revision: D71177037 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149182 Approved by: https://github.com/houseroad	2025-03-14 09:38:29 +00:00
Xuehai Pan	c95a6b416b	[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257 ) Changes in this PR: 1. Add `is_structseq` and `is_structseq_class` functions to determine a object or a class is PyStructSequence. 2. Add a generic class `structseq` which can be used as the registration key for PyStructSequence types like `namedtuple` for Named Tuple types. 3. Change `is_namedtuple` to accept subclasses of namedtuple to be namedtuple. Before this PR, only namedtuple class directly created by `collections.namedtuple` or `typing.NamedTuple` were namedtuple classes while their subclasses were not. This PR makes `is_namedtuple` return true for subclasses of namedtuple class. Resolves #75982. New tests are included in this PR. - #75982 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113257 Approved by: https://github.com/zou3519	2025-03-14 08:50:30 +00:00
Sheng Fu	05ac99042f	Clean up grid in execution trace (#149159 ) Summary: This DIFF https://www.internalfb.com/diff/D70471332 removed input "grid" when calling triton kernel. PyTorch execution trace need to make the appropriate change. It includes capturing ET and replay ET. Test Plan: buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA.test_execution_trace_with_pt2_cuda buck2 run mode/opt param_bench/fb/integration_tests:test_et_replay Differential Revision: D71152464 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149159 Approved by: https://github.com/sraikund16, https://github.com/jansel	2025-03-14 07:12:16 +00:00
PyTorch MergeBot	be4e6c1c8e	Revert "[MPS] Add support for `i0e` in eager. (#149174 )" This reverts commit b4745db90482ff139ea62d06ec0a18468e1131b7. Reverted https://github.com/pytorch/pytorch/pull/149174 on behalf of https://github.com/malfet due to MPS are red on trunk ([comment](https://github.com/pytorch/pytorch/pull/149174#issuecomment-2723774600))	2025-03-14 06:35:01 +00:00
Nikita Shulga	e162758051	[MPSInductor] Add `bessel_[jy][01]` ops (#149179 ) By simply calling corresponding special functions Followup TODO: tweak bessel_y0 to match CPU implementation for `torch.half` dtype Pull Request resolved: https://github.com/pytorch/pytorch/pull/149179 Approved by: https://github.com/dcci ghstack dependencies: #149123	2025-03-14 06:33:30 +00:00
Huamin Li	d4496346b9	Update logic when producing key name for keep_original_weights (#149171 ) Differential Revision: D71160718 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149171 Approved by: https://github.com/houseroad	2025-03-14 05:29:54 +00:00
Nikita Shulga	db6d72213b	[MPS] Add `torch.special.bessel_[jy][01]` implementations (#149123 ) By copy-n-pasting functions from `f59064f2b7/aten/src/ATen/native/cuda/Math.cuh (L1463)` With an ugly workaround for `bessel_y[01]` to avoid internal compiler exception on M1/M2 machines (see FB16863363 / https://gist.github.com/malfet/e7785e4b572e7740887a83a2386ef769 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149123 Approved by: https://github.com/Skylion007, https://github.com/dcci	2025-03-14 05:13:55 +00:00
PyTorch MergeBot	e6839819c8	Revert "[ROCm] Input vectorization in elementwise kernels for tensors with heterogeneous types (#147527 )" This reverts commit 4f8391db55c8c3a574d61d99d6d6a4a0b6723acb. Reverted https://github.com/pytorch/pytorch/pull/147527 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @albanD, would you be able to help them land the fixes internally? The error looks really simple. See D71152448 for details. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/147527#issuecomment-2723531085))	2025-03-14 05:11:01 +00:00
Isuru Fernando	9e6b2ca58d	Fix sympy float priting (#147552 ) Fixes https://github.com/pytorch/pytorch/pull/147261 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147552 Approved by: https://github.com/bobrenjc93, https://github.com/cyyever	2025-03-14 05:07:06 +00:00
Mu-Chu Lee	bea181ff7e	[AOTInductor] [BE] Add swap_constant_buffer into pybind for tests. (#149167 ) Summary: We add swap_constant_buffer in pybind to add tests. Test Plan: python test/inductor/test_aot_inductor.py -k test_update_inactive_constant_buffer Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/149167 Approved by: https://github.com/chenyang78, https://github.com/jingsh	2025-03-14 04:12:48 +00:00
Mu-Chu Lee	e567900998	[AOTInductor] Activate CPU test for update_constant_buffer (#149162 ) Summary: Fixed by #145459 Test Plan: Re-activating tests. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/149162 Approved by: https://github.com/chenyang78, https://github.com/jingsh	2025-03-14 04:09:57 +00:00
fduwjj	aed0b7a742	[c10d] Add param recording for uniqueID broadcasting and allgather (#149166 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149166 Approved by: https://github.com/kwen2501	2025-03-14 03:51:30 +00:00
Davide Italiano	b4745db904	[MPS] Add support for `i0e` in eager. (#149174 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149174 Approved by: https://github.com/malfet	2025-03-14 02:51:28 +00:00
Dmitry Rogozhkin	c179971bfc	xpu: update filter out of dg2 AOT target (#148677 ) torch-xpu-ops has updated list of AOT targets to use and used `dg2` instead of `dg2-g10`. This requires an update in cpp_extension.py which currently filters out `dg2-` prefixed AOT targets. CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148677 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/albanD	2025-03-14 02:24:06 +00:00
Eli Uriegas	56b2e4b8f0	ci: Update linux.20_04 --> linux.24_04 (#149142 ) Ubuntu 20.04 is getting deprecated soon so we might as well proactively move to the latest LTS which is 24.04 > [!NOTE] > The oldest supported version of python on 24.04 is Python 3.8. Since we test for Python 3.6 compat in our collect_env test we need to have this particular job stick with 20.04 for now until we decide to upgrade it to a newer python version. Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/149142 Approved by: https://github.com/atalman, https://github.com/wdvr	2025-03-14 02:20:10 +00:00
cyy	e66ad221e9	Use std::string_view in get_fully_qualified_type_name (#145197 ) The same as #139164 but open a new PR due to messy history there. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145197 Approved by: https://github.com/r-barnes	2025-03-14 01:58:35 +00:00
Pat Vignola	e8d36019d4	[c10d] Make getDefaultBackend more fault tolerant without relying on exceptions (#149152 ) Summary: no-except builds are terminating when this exception is thrown. We should proactively check if a backend is available before calling has_hooks, instead of trying and failing. Test Plan: CI Differential Revision: D71144456 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149152 Approved by: https://github.com/kwen2501	2025-03-14 01:27:52 +00:00
Yiming Zhou	15cd6921a5	[export] Fix tensor_constant and buffer naming conflicts in TS converter (#148803 ) Summary: In TS converter, tensor constants are traced as BUFFER and later we will convert them back to CONSTANT_TENSOR. So we need to prevent naming conflicts during lift constant pass. Test Plan: CI Differential Revision: D70826426 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148803 Approved by: https://github.com/angelayi	2025-03-14 00:38:12 +00:00
PyTorch MergeBot	49570cb402	Revert "Split up cub-RadixSortPairs.cu to parallelize compilation (#148936 )" This reverts commit 9a3d26cfcdb1c1be84a04baa3ee554dbe67cb049. Reverted https://github.com/pytorch/pytorch/pull/148936 on behalf of https://github.com/ZainRizvi due to Breaks lint in trunk [GH job link](https://github.com/pytorch/pytorch/actions/runs/13845459825/job/38742803351) [HUD commit link](`9a3d26cfcd`) ([comment](https://github.com/pytorch/pytorch/pull/148936#issuecomment-2722853628))	2025-03-13 22:54:33 +00:00
Gheorghe-Teodor Bercea	4cae8f48cc	[ROCm] Improve softmax performance (#149076 ) This patch improves the performance of softmax for 2D tensors by: using a softmax calculation which eliminates the increase of shared memory usage with the size of the tensor and relies on global memory accesses for the tensor data accesses while still using shared memory for the actual reduction step (the shared memory used for the reduction is constant and does not increase with tensor size). for the final computation replacing the division by the sum with the multiplication of 1/sum. The 1/sum is computed as the last step of the warp reduction. replace the use of the exp function with the __expf function. The impact on numerical accuracy is within a 1e-5 for half precision and 1e-7 for full precision. The impact on performance for MI300X is between 22% and 50% percentage improvement over current runtimes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149076 Approved by: https://github.com/jeffdaily	2025-03-13 22:07:28 +00:00
Tovly Deutsch	9a3d26cfcd	Split up cub-RadixSortPairs.cu to parallelize compilation (#148936 ) Summary: `cub-RadixSortPairs.cu` has slow compilation times, especially on Windows. These changes split up the file into smaller components to allow each component to compile in parallel. On Windows, I observed a compile time drop from about 20 minutes to 6 minutes. Differential Revision: D70539649 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148936 Approved by: https://github.com/suo, https://github.com/eqy	2025-03-13 22:02:05 +00:00
Shangdi Yu	4098a229a0	Add back fake class registration to test_torchbind (#149137 ) Fixes #149121 Summary: as title, to fix https://github.com/pytorch/pytorch/issues/149121 Test Plan: ``` python test/export/test_torchbind.py ``` Differential Revision: D71129321 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149137 Approved by: https://github.com/yiming0416	2025-03-13 21:26:37 +00:00
Zhenghao Hu	e5fccb2bab	[pytorch] Fix duplicated Malloc/Free insertation when using IRBuilderBase::CreateMalloc/CreateFree in LLVM 18+ (#149058 ) Summary: Pytorch unitest hangs when jitting the Tensor kernel. The problem exists for LLVM version >= 18 due to this upstream change: `45bb45f2ae` `IRBuilderBase::CreateCall` will insert the instruction into the BasicBlock by default. And we don't need to explicitly insert the instruction when compiling the tensor kernel. Test Plan: ## Test with the release toolchain ``` buck test 'mode/dev' //caffe2/test:jit -- --exact 'caffe2/test:jit - test_concat_invariant (test_jit_fuser_te.TestTEFuserDynamic)' ``` ## Test with the Buckified toolchain Apply this D71046097 to select the LLVM libraries. ``` # Build tests buck build 'mode/dev-asan' //caffe2/test:jit --show-output ``` ``` # Run test (Change HASH and paths accordingly) HASH="b755f1c435832a1e" ENABLE_FLATBUFFER=0 FB_OVERRIDE_PYBIND11_GIL_INCREF_DECREF_CHECK=1 MKL_NUM_THREADS=1 NO_MULTIPROCESSING_SPAWN=0 OMP_NUM_THREADS=1 PYTORCH_TEST=1 PYTORCH_TEST_FBCODE=1 PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_DEV_DBG_ASAN=1 PYTORCH_TEST_WITH_TSAN=0 PYTORCH_TEST_WITH_UBSAN=1 SKIP_TEST_BOTTLENECK=1 TENSORPIPE_TLS_DATACENTER=test_dc TEST_PILOT=True TPX_IS_TEST_EXECUTION=true TPX_TIMEOUT_SEC=6000 \ buck-out/v2/gen/$HASH/caffe2/test/__jit__/jit.par --test-filter test_jit_fuser_te.TestTEFuserDynamic.test_concat_invariant ``` Differential Revision: D71046799 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149058 Approved by: https://github.com/dcci, https://github.com/Skylion007	2025-03-13 20:37:47 +00:00
Andy Lugo	38e81a5332	[ROCm] Use generated CK config.h rather than system (#147993 ) prevents pytorch from potentially using system version of config.h and instead prioritize the CK submodule's version Pull Request resolved: https://github.com/pytorch/pytorch/pull/147993 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-03-13 20:04:12 +00:00
Carlo Bertolli	4f8391db55	[ROCm] Input vectorization in elementwise kernels for tensors with heterogeneous types (#147527 ) This patch exemplifies its use for input tensors with types (float,bfloat16) when functor type is float(float,float). Pull Request resolved: https://github.com/pytorch/pytorch/pull/147527 Approved by: https://github.com/jeffdaily Co-authored-by: Hashem Hashemi <hashem.hashemi@amd.com>	2025-03-13 19:56:26 +00:00
Eddie Yan	0dcd482e54	[SDPA] Respect `sdpa_kernel`'s `priority_order` setting in `torch.compile` (#147768 ) [https://github.com/pytorch/pytorch/pull/140467](https://github.com/pytorch/pytorch/pull/140467) added the option to specify a priority order for SDPA but the `torch.compile` path silently ignored this setting as I wasn't aware of the separate context manager handling on `torch.compile` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147768 Approved by: https://github.com/drisspg	2025-03-13 18:52:34 +00:00
Joel Schlosser	5e1b715dda	BC fix for AOTIModelPackageLoader() constructor defaults (#149082 ) The default value for `run_single_threaded` was wrongly specified in the .cpp file instead of the header, breaking C++-side instantiation of `AOTIModelPackageLoader` with no arguments. This PR fixes this and adds a test for the use case of running with `AOTIModelPackageLoader` instead of `AOTIModelContainerRunner` on the C++ side. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149082 Approved by: https://github.com/desertfire	2025-03-13 18:40:53 +00:00
cyy	970fefcc53	Remove outdated skipCUDAIfCudnnVersionLessThan decoration (#148940 ) Test conditions for CUDNN 7 and 8 were removed because we have moved to CUDNN 9. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148940 Approved by: https://github.com/mikaylagawarecki	2025-03-13 18:02:50 +00:00
Eli Uriegas	c73c72b1e1	ci: Update linux_job references to v2 (#149102 ) This is probably a bit overdue but trying to update these so we can finally get rid of all the remnants that rely on non-manylinux2_28 stuff and conda stuff Signed-off-by: Eli Uriegas <github@terriblecode.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/149102 Approved by: https://github.com/Skylion007, https://github.com/atalman, https://github.com/malfet ghstack dependencies: #149104	2025-03-13 17:31:55 +00:00
Eli Uriegas	77ea66695a	ci: Fix check_binary gcc abi check (#149104 ) All of our binaries should be built with the cxx11-abi now so lets fix this check to reflect reality. I also noticed that this particular script is not used widely since this issue should've been caught in nightlies a long time ago. Maybe worth an investigation to just remove this script if it's not actually being used. Signed-off-by: Eli Uriegas <github@terriblecode.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/149104 Approved by: https://github.com/Skylion007, https://github.com/atalman, https://github.com/malfet	2025-03-13 17:31:55 +00:00
Simon Fan	7c87ec1b50	[ca] always do initial trace with dynamic shapes (#148801 ) HUD: https://fburl.com/wzvx6tax no regressions (ignore the pass rate improvements, those come from #149030) <img width="864" alt="image" src="https://github.com/user-attachments/assets/d7598f98-b378-4abb-a0c7-e4311162f681" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/148801 Approved by: https://github.com/jansel ghstack dependencies: #148799, #149030	2025-03-13 17:30:29 +00:00
Simon Fan	b263b272fa	[ca] fix lazily compiled aot bwd (#149030 ) FIXES https://github.com/pytorch/pytorch/issues/137372 sometimes, the aot bwd is lowered lazily. so the bw_module we saved in CompiledFunction._lazy_backward_info hasn't gone through post grad passes, specifically the view_to_reshape pass. Running that directly will then sometimes error, because the AOT forward has already changed its views to reshapes, and it is reflected in the gradients we see in CA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149030 Approved by: https://github.com/bdhirsh ghstack dependencies: #148799	2025-03-13 17:30:29 +00:00
Simon Fan	e6f560a262	[ca] support for dynamic shapes CopySlices (#148799 ) i'm changing CA initial trace to always trace as dynamic, fixes these errors: ```python This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 FAILED [0.2139s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_autograd_python_custom_function_inplace - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch. To execute this test, run the following from the base repo dir: python test/test_autograd.py TestAutogradWithCompiledAutograd.test_autograd_python_custom_function_inplace This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 FAILED [0.0057s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_copy_slices_graph_task_updates - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch. To execute this test, run the following from the base repo dir: python test/test_autograd.py TestAutogradWithCompiledAutograd.test_copy_slices_graph_task_updates This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 FAILED [0.9662s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_inplace_on_view_weak_grad_fn - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch. To execute this test, run the following from the base repo dir: python test/test_autograd.py TestAutogradWithCompiledAutograd.test_inplace_on_view_weak_grad_fn This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 FAILED [0.0077s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_leaf_assignment - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch. To execute this test, run the following from the base repo dir: python test/test_autograd.py TestAutogradWithCompiledAutograd.test_leaf_assignment This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 FAILED [5.0485s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_setitem_mask - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch. To execute this test, run the following from the base repo dir: python test/test_autograd.py TestAutogradWithCompiledAutograd.test_setitem_mask This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 FAILED [0.0102s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_tensor_hooks_inplace_over_view - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch. To execute this test, run the following from the base repo dir: python test/test_autograd.py TestAutogradWithCompiledAutograd.test_tensor_hooks_inplace_over_view ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148799 Approved by: https://github.com/jansel, https://github.com/zou3519	2025-03-13 17:30:20 +00:00
Shivam Raikundalia	e84cc4c052	Update Kineto Submodule (#149089 ) Summary: We have made a lot of changes in Kineto this month. It is a good idea to update the submodule in now especially since the roctracer-sdk change will be very large Test Plan: CI Differential Revision: D71082829 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149089 Approved by: https://github.com/Skylion007	2025-03-13 17:18:16 +00:00
Aaron Gokaslan	6856d81c60	[BE]: Update CU128 cudnn to 9.8.0.87 (#148963 ) Also cu12.6 is an on old CUDNN version, we may want to upgrade it for all the performance reasons as I don't see a manywheel linux reason to stay back on the old 9.5 release. I might split that into it's own PR. This one just updates CU126 to the latest and greatest. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148963 Approved by: https://github.com/jansel, https://github.com/eqy, https://github.com/nWEIdia, https://github.com/tinglvv, https://github.com/atalman	2025-03-13 16:59:12 +00:00
Bin Bao	b9803a5c81	[AOTI] Re-enable AOTI cpp unit test (#149085 ) Summary: test_inductor_aoti was removed by accident previously. Add it back. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149085 Approved by: https://github.com/jbschlosser	2025-03-13 16:00:38 +00:00
Boyuan Feng	3e605fe46d	[CUDAGraph] Graph Partition (#147648 ) This PR implements cudagraph partition, following previous PR on inductor graph partition (#147038). Since there are many ops that cudagraph cannot support, this PR focuses on `cpu ops` and will add more partition rules in the next PR. ## Example ```python import torch torch._inductor.config.graph_partition = True def f(x, y): x1 = x + 1 y1 = y + 1 y_cpu = y1.cpu() + 1 z = x @ y return x1 + y1 + z + y_cpu.cuda() x, y = [torch.ones(2, 2, device="cuda") for _ in range(2)] x_cloned, y_cloned = [tmp.clone() for tmp in [x,y]] eager_out = f(x, y) f_compiled = torch.compile(f, mode="reduce-overhead") for _ in range(5): compiled_out = f_compiled(x_cloned, y_cloned) assert torch.allclose(eager_out, compiled_out) ``` w/o graph partition, we will skip cudagraph: ``` skipping cudagraphs due to skipping cudagraphs due to cpu device (device_put). Found from : File "/home/boyuan/playground/cudagraph/graph_partition/graph_partition.py", line 9, in f y_cpu = y1.cpu() + 1 # 3 ``` w/ graph partition, we can see two cudagraphify under the same torch-compiled region: ![image](https://github.com/user-attachments/assets/4e22d428-2687-433d-b92a-0814a2201b25) ## Design PR #147038 splits `def call(args)` function into multiple `def partition_id(args)`. In this PR, we use `recursively_apply_fns()` to wrap each `partition_id()` function with `cudagraphify`. One major design point is, `cudagraphify` takes metadata such as static_input_idxs and we need to provide such metadata for each graph partition. However, we previously only have such metadata for the original graph instead of graph partitions. The [idea](https://github.com/pytorch/pytorch/pull/147038#discussion_r1964124800) is: - compute a mapping from the partition metadata (e.g., input/output idx) to the graph metadata, stored in `GraphPartitionMap`. - during post_compile, get the `CudagraphMetadata` for each partition based on the graph-level metadata and `GraphPartitionMap`, via `get_partition_cudagraph_metadata()`. - finally, in `cudagraph_partition_pos_compile`, we compute the `CudagraphMetadata` and apply cudagraphify for each graph via `recursively_apply_fns`. #### Q: How does it work with codecache? While we have multiple graph partitions, we still have 1 file and 1 `call` function for 1 dynamo graph. The major difference is we need to additionally load a `recursively_apply_fns()` for graph partition. We also add `partition_maps: Optional[list[GraphPartitionMap]]` to `CompiledFxGraph` so it will be serialized and could be deserialized later. ## Edge Case 1 PyTorch has an assumption on input/output orders. For example, backward inputs take saved tensors first and then tangents. In graph partition, we respect such orders via `graph_partition_signature_reorder`. ## Edge Case 2 Cudagraphifying `call` function gives 2 cudagraph managed tensors `buf0` and `primals_1`. However, cudagraphifying `partition_0` gives only 1 cudagraph managed tensor `buf0`. This leads to a semantic difference between cudagraph w/ and w/o graph partition. [full code comparison](https://www.internalfb.com/intern/diffing/?paste_number=1747654420) ![image](https://github.com/user-attachments/assets/03d08ce0-f1d1-4d1d-8432-805a07e1dd40) To achieve the same semantic, we returns an input tensor as output if it is not freed in a graph partition. This allows more cudagraph managed tensors and is important for handling saved tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147648 Approved by: https://github.com/eellison	2025-03-13 16:00:21 +00:00
atalman	65d19a5699	Remove runtime dependency on packaging (#149092 ) Looks like after https://github.com/pytorch/pytorch/pull/148924 We are seeing this error in nightly test: https://github.com/pytorch/pytorch/actions/runs/13806023728/job/38616861623 ``` File "/Users/runner/work/_temp/anaconda/envs/test_conda_env/lib/python3.13/site-packages/torch/_inductor/pattern_matcher.py", line 79, in <module> from .lowering import fallback_node_due_to_unsupported_type File "/Users/runner/work/_temp/anaconda/envs/test_conda_env/lib/python3.13/site-packages/torch/_inductor/lowering.py", line 7024, in <module> from . import kernel File "/Users/runner/work/_temp/anaconda/envs/test_conda_env/lib/python3.13/site-packages/torch/_inductor/kernel/__init__.py", line 1, in <module> from . import mm, mm_common, mm_plus_mm File "/Users/runner/work/_temp/anaconda/envs/test_conda_env/lib/python3.13/site-packages/torch/_inductor/kernel/mm.py", line 6, in <module> from packaging.version import Version ModuleNotFoundError: No module named 'packaging' ``` Hence removing runtime dependency on packaging since it may not be installed by default Pull Request resolved: https://github.com/pytorch/pytorch/pull/149092 Approved by: https://github.com/drisspg, https://github.com/davidberard98	2025-03-13 14:53:13 +00:00
taoyang	f59064f2b7	[FIX] remove the duplicate key in DEFAULT_STATIC_QUANT_MODULE_MAPPINGS (#149043 ) nn.Dropout appeared at line 81 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149043 Approved by: https://github.com/jingsh	2025-03-13 12:42:33 +00:00
Bin Bao	bdf57fb8f7	[AOTI][refactor] Split MiniArrayRef into a separate header (#149073 ) Summary: MiniArrayRef is a common utility and will be used by the libtorch-free AOTI. Differential Revision: [D71064657](https://our.internmc.facebook.com/intern/diff/D71064657) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149073 Approved by: https://github.com/yushangdi	2025-03-13 11:57:32 +00:00
Andrew Gu	a8b1767ae5	[DTensor] Fix `local_map` with multi-threading (#149070 ) Using `nonlocal device_mesh` is not safe with multi-threading Pull Request resolved: https://github.com/pytorch/pytorch/pull/149070 Approved by: https://github.com/wanchaol	2025-03-13 10:58:59 +00:00
Shangdi Yu	df60500ab8	Fix too big to optimize in test, actually use O0 when aot_inductor.compile_wrapper_with_O0 is set (#148714 ) Summary: 1. Check against the "0" char instead 2. We got the following error when using anything other than O0 flag: `error: Function ZN5torch12aot_inductorL22__check_inputs_outputsEPP16AtenTensorOpaqueS3 is too big to optimize [-Werror,-Wignored-optimization-argument]` So we use O0 flag in wrapper code when `aot_inductor.compile_wrapper_opt_level` is set to `O0`. Test Plan: ``` buck run 'fbcode//mode/opt' fbcode//deeplearning/aot_inductor/cpu/test:ads_second_stage_dsnn_models_aoti_lowering_test -- -r AdsSecondStageDSNNModelsAOTILoweringTest ``` Differential Revision: D70670957 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148714 Approved by: https://github.com/desertfire	2025-03-13 10:22:06 +00:00
George Wigley	96a6a71ac7	skip test_torch_dynamo_codegen_pow if CPU backend is not cpp (#146595 ) The test asserts that `aten.pow` is not present in the generated kernel code. When using a CPU backend other than cpp, the kernel contains comments referencing the aten ops that produced the kernel in this case `aten.pow`. This PR skips that test case if the CPU backend is not cpp. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146595 Approved by: https://github.com/williamwen42	2025-03-13 10:03:29 +00:00
Tom Ritchford	d90f9e9a34	[inductor] Fix issue with set_linter, improve linter framework (#144620 ) ### `set_linter` only * Fix gnarly [bug](`dbed747aae/tools/test/set_linter_testdata/python_code.py.txt.python (L42)`) which would have garbled Python files involving sets contained in sets. * Better handling of new Python3.12 token types ### Both linters. * Recover from and report on unparseable Python files * Remove `ParseError.check()` (it made it harder to read the code) * FileLinter is now generic on `PythonFile` ### Notes As I started working on new docstring features, I found a nasty bug and an edge case bug in set linter, and realized both the linters crash when there is a badly-formed Python file in the repo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144620 Approved by: https://github.com/amjames, https://github.com/jansel	2025-03-13 09:49:40 +00:00
Leo Wang	f4bffb7461	[docs] fix autograd description on convex function case (#148658 ) The sub-gradient of minimum norm is the least steep descent direction. ```python import torch x = torch.tensor([-2, -1, 0, 1, 2.], requires_grad=True) torch.relu(x).sum().backward() print(x.grad) # tensor([0., 0., 0., 1., 1.]) y = torch.tensor([-2, -1, 0, 1, 2.], requires_grad=True) torch.abs(y).sum().backward() print(y.grad) # tensor([-1., -1., 0., 1., 1.]) ``` (How can I request a reviewer? I don't have the button on the right) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148658 Approved by: https://github.com/lezcano	2025-03-13 09:06:15 +00:00
wdziurdz	75c8b7d972	[Profiler][HPU] Fix incorrect availabilities for HPU (#148663 ) Fixes #148661 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148663 Approved by: https://github.com/jeromean, https://github.com/albanD	2025-03-13 08:03:52 +00:00
eqy	ec93aa7f84	fix cuDNN SDPA meta registration (#148921 ) Update `cuDNN SDPA` meta registration to matching memory layout behavior in: https://github.com/pytorch/pytorch/pull/138354 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148921 Approved by: https://github.com/drisspg, https://github.com/jbschlosser	2025-03-13 07:33:16 +00:00
Shangdi Yu	2a7d583452	Consolidate torchbind fake class registration (#149063 ) Summary: Remove duplicated fake class registration Test Plan: CI Differential Revision: D71052419 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149063 Approved by: https://github.com/angelayi	2025-03-13 06:57:13 +00:00
Yuanhao Ji	c208f21791	[Dynamo] Replace `unimplemented` with`unimplemented_v2` in `torch/_dynamo/variables/base.py` (#148177 ) Part of #147913 Replace `unimplemented` with`unimplemented_v2` in `torch/_dynamo/variables/base.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148177 Approved by: https://github.com/williamwen42	2025-03-13 06:35:51 +00:00
xinan.lin	037d7af778	[Inductor UT] Enable PYTORCH_TESTING_DEVICE_ONLY_FOR test case filter for test_torchinductor.py (#149023 ) The environ var PYTORCH_TESTING_DEVICE_ONLY_FOR controls the devices in get_desired_device_type_test_bases, so we add RUN_CPU and RUN_GPU to make sure cases are only enabled for devices specified for PYTORCH_TESTING_DEVICE_ONLY_FOR. eg. Only enable GPU cases, not CPU cases even HAS_CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149023 Approved by: https://github.com/jansel, https://github.com/cyyever	2025-03-13 05:15:28 +00:00
Sam Larsen	7cdbb913e7	[logging] Set compile_id in the CachingAutotuner during compilation so we have it for dynamo_timed logging (#148693 ) Summary: This is a simpler alternative to https://github.com/pytorch/pytorch/pull/146455, where we can stick the compileId (and forward/backward bool) in the CachingAutotuner so that we have it for logging `benchmark_all_configs`. Recall that the first attempt put the compileId in the inductor_meta and that interfered with caching. Test Plan: `python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only nanogpt` * tlparse: https://fburl.com/e71yn6uc * dynamo_compile: https://fburl.com/scuba/dynamo_compile/sandbox/4ageghhv * pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/4fgv1itq Pull Request resolved: https://github.com/pytorch/pytorch/pull/148693 Approved by: https://github.com/eellison	2025-03-13 03:50:58 +00:00
Brian Hirsh	3646d4dbc8	[partitioner] always ban compiler-driven recompute of collectives by default (#147561 ) This should fix the hang in https://fb.workplace.com/groups/1075192433118967/permalink/1603268720311333/ The argument here is that: (1) in general, it is not safe for the partitioner to sometimes choose to recompute collectives in the backward. Why? If we are running a distributed job, where many ranks are compiling at the same time, we need every rank to make a consistent decision about which collectives are recomputed for backward. If we let each compiler instance make its own choice without any cross-rank communication, they can make different choices and cause NCCL hangs (see the link above) (2) later on, we'll want an `spmd_mode` flag that causes the compiler to issue collectives and communicate info across ranks. Once we have such a config, then turning it on should make it safe for the partitioner to potentially choose to recompute collectives (and agree on the binary "recompute-or-save" choice across all ranks) (3) even without an `spmd_mode`, users can override this choice by using `torch.utils.checkpoint()` in their user code. User checkpointing generally always overrides the partitioner, and this should be safe because we expect the user to apply checkpointing consistently across ranks Pull Request resolved: https://github.com/pytorch/pytorch/pull/147561 Approved by: https://github.com/zou3519	2025-03-13 03:36:13 +00:00
Bartlomiej Stemborowski	420a9be743	[regression] Fix pin_memory() when it is called before device lazy initialization. (#149033 ) PR #145752 has added a check in the isPinnedPtr to check if a device is initialized before checking if the tensor is pinned. Also that PR has added a lazy initialization trigger when an at::empty is called with a pinned param set to true. However, when the tensor is firstly created and it is pinned in a separate call by calling pin_memory() function, lazy device init is not called so is_pinned returns always false. With this PR, the lazy initialization is moved to getPinnedMemoryAllocator function, thus it is assured that device is initialized before we pin a tensor. Fixes #149032 @ngimel @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/149033 Approved by: https://github.com/ngimel, https://github.com/albanD	2025-03-13 02:56:24 +00:00
henrylhtsang	f2d43d866c	[cutlass backend] switch layout for cutlass backend benchmark (#149009 ) ``` python benchmarks/inductor_backends/cutlass.py ``` logs: ``` Experiment group: mm (1024x1024, 1024x1024) torch.float16 +-----------------------+--------------------+----------------------+---------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+---------------------+ \| aten \| 13.059554621577263 \| 1.580178506206721 \| NA \| \| triton \| 10.245470330119133 \| 0.04118620231747627 \| -21.54808776410064 \| \| triton_persistent_tma \| 10.388538241386414 \| 0.04225084185600281 \| -20.45258400908819 \| \| cutlass_lvl_default \| 12.882896699011326 \| 231.14990583620965 \| -1.3527101626732294 \| \| cutlass_lvl_1111 \| 11.362981051206589 \| 126.41650272067636 \| -12.99105229490415 \| \| cutlass_lvl_2222 \| 11.107578873634338 \| 555.8380545829423 \| -14.946725248331441 \| +-----------------------+--------------------+----------------------+---------------------+ Experiment group: mm (1024x1024, 1024x1024) torch.bfloat16 +-----------------------+--------------------+----------------------+---------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+---------------------+ \| aten \| 14.037585817277431 \| 0.21587548777461052 \| NA \| \| triton \| 10.571777820587158 \| 78.15654796129093 \| -24.68948750735019 \| \| triton_persistent_tma \| 10.761583223938942 \| 1.3195342738181353 \| -23.337364672110443 \| \| cutlass_lvl_default \| 12.872588820755482 \| 237.0100042372942 \| -8.299126443010406 \| \| cutlass_lvl_1111 \| 11.08622644096613 \| 137.55013868492097 \| -21.02469338195443 \| \| cutlass_lvl_2222 \| 11.044904589653015 \| 551.265836935956 \| -21.319059178545007 \| +-----------------------+--------------------+----------------------+---------------------+ Experiment group: mm (2048x2048, 2048x2048) torch.float16 +-----------------------+--------------------+----------------------+---------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+---------------------+ \| aten \| 30.483894050121307 \| 0.27990864124149084 \| NA \| \| triton \| 29.567627236247063 \| 99.87172158574685 \| -3.005740711366232 \| \| triton_persistent_tma \| 29.66325916349888 \| 1.3695051120594144 \| -2.692027748401006 \| \| cutlass_lvl_default \| 29.82821688055992 \| 72.61214569816366 \| -2.150897022812533 \| \| cutlass_lvl_1111 \| 29.476772993803024 \| 67.7428645719774 \| -3.303780857728953 \| \| cutlass_lvl_2222 \| 30.113255605101585 \| 233.84051702311262 \| -1.2158500630212203 \| +-----------------------+--------------------+----------------------+---------------------+ Experiment group: mm (2048x2048, 2048x2048) torch.bfloat16 +-----------------------+--------------------+----------------------+---------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+---------------------+ \| aten \| 30.58255836367607 \| 0.058386584743857384 \| NA \| \| triton \| 29.799651354551315 \| 100.18178300186992 \| -2.559978795150901 \| \| triton_persistent_tma \| 29.362043365836143 \| 1.534341821912676 \| -3.990885861562106 \| \| cutlass_lvl_default \| 29.4346883893013 \| 73.68858492700383 \| -3.7533484305817093 \| \| cutlass_lvl_1111 \| 29.164200648665428 \| 75.44329373072833 \| -4.637799421958348 \| \| cutlass_lvl_2222 \| 29.13798950612545 \| 227.33327346481383 \| -4.7235056020244 \| +-----------------------+--------------------+----------------------+---------------------+ Experiment group: mm (8192x8192, 8192x8192) torch.float16 +-----------------------+--------------------+----------------------+--------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+--------------------+ \| aten \| 1656.6237211227417 \| 0.0549461180344224 \| NA \| \| triton \| 1892.8285837173462 \| 2.3174119112081826 \| 14.258208401997386 \| \| triton_persistent_tma \| 1665.332317352295 \| 2.7922237082384527 \| 0.525683419747917 \| \| cutlass_lvl_default \| 1705.5492401123047 \| 108.31571159465238 \| 2.9533272019312116 \| \| cutlass_lvl_1111 \| 1714.9059772491455 \| 17.64627545280382 \| 3.518134829489478 \| \| cutlass_lvl_2222 \| 1680.4152727127075 \| 306.9972395859659 \| 1.4361469829637354 \| +-----------------------+--------------------+----------------------+--------------------+ Experiment group: mm (8192x8192, 8192x8192) torch.bfloat16 +-----------------------+--------------------+----------------------+--------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+--------------------+ \| aten \| 1621.416687965393 \| 0.06300561130046844 \| NA \| \| triton \| 1782.3902368545532 \| 2.318530729971826 \| 9.927956834535548 \| \| triton_persistent_tma \| 1586.0934257507324 \| 2.7931175641715527 \| -2.178543151605614 \| \| cutlass_lvl_default \| 1657.4617624282837 \| 43.31810224894434 \| 2.2230605328307784 \| \| cutlass_lvl_1111 \| 1641.5367126464844 \| 17.648567833006382 \| 1.2408916739557292 \| \| cutlass_lvl_2222 \| 1645.8417177200317 \| 249.33647010894492 \| 1.5064005407078918 \| +-----------------------+--------------------+----------------------+--------------------+ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149009 Approved by: https://github.com/chenyang78, https://github.com/jingsh	2025-03-13 01:57:47 +00:00
Zhou, Lingzhi	4a12777ffe	[Partitioner] Remove unnecessary upstream nodes in dependency viewer (#146580 ) We iterate upstream nodes to update partition map. But actually did nothing due to we iterate nodes with reversed topological order https://github.com/pytorch/pytorch/pull/136608/files#diff-f2f9dd3903fd99955732eb694941fea0cb7301a58d59554787f3311d417e5615L193 so that there exists no upstream nodes in assignment. Remove it to reduce for-loop overhead which up to O(N * N) complexity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146580 Approved by: https://github.com/Skylion007, https://github.com/jerome-habana	2025-03-13 01:42:10 +00:00
Andrey Talman	1e37e5b836	Update nightly PyTorch version to 2.8.0 (#149038 ) Branch for 2.7: https://github.com/pytorch/pytorch/tree/release/2.7 Same as https://github.com/pytorch/pytorch/pull/135916 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149038 Approved by: https://github.com/ZainRizvi	2025-03-12 23:51:04 +00:00
PyTorch MergeBot	e51615cb73	Revert "[Profiler][HPU] Fix incorrect availabilities for HPU (#148663 )" This reverts commit 28b78800b92a4d847a2360ab0e0b87d3e00a6138. Reverted https://github.com/pytorch/pytorch/pull/148663 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @albanD, could you please help get this relanded? See D71052806 for more details. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/148663#issuecomment-2719297055))	2025-03-12 22:52:11 +00:00
PyTorch MergeBot	b1980b2405	Revert "Make dynamism code robust to NotImplementedException (#148823 )" This reverts commit 60576419a2a5cc09e4a92be870fda8f3fc305ddc. Reverted https://github.com/pytorch/pytorch/pull/148823 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, see D71042206 for details. To validate your fixes internally before relanding, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/148823#issuecomment-2719287467))	2025-03-12 22:45:39 +00:00
Catherine Lee	38c5cf99b3	[CI] Don't clean workspace when fetching repo (#147994 ) Tested on https://github.com/pytorch/pytorch/pull/148995 Do two checkouts: first one attempts to use an existing checkout if possible. The second one removes the workspace and re pulls everything if the first one fails This is probably not going to be useful if we switch entirely to ephemeral runners but w/e Pull Request resolved: https://github.com/pytorch/pytorch/pull/147994 Approved by: https://github.com/malfet, https://github.com/atalman	2025-03-12 22:29:52 +00:00
Catherine Lee	3f1769f785	Add ninja to requirements-ci for all arch (#148778 ) So I can get ninja_logs for the builds No negative consequences afaik Pull Request resolved: https://github.com/pytorch/pytorch/pull/148778 Approved by: https://github.com/malfet, https://github.com/atalman	2025-03-12 22:07:46 +00:00
Jeff Daily	0c8ec26d3b	[ROCm][TunableOp] hipblaslt tf32 support (#145946 ) TF32 is supported by hipblaslt. Support added by #143549. This PR expands integration to the TunableOp feature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145946 Approved by: https://github.com/pruthvistony, https://github.com/echen4096, https://github.com/yoyoyocmu Co-authored-by: Nichols A. Romero <nick.romero@amd.com>	2025-03-12 21:17:11 +00:00
Yanan Cao (PyTorch)	ab45aaca97	Set non-strict export as default mode (#148790 ) Summary: - Flip the default value of strict argument in torch.export.export from True to False - Update test infra to cope with the change, some of them made the assumption of strict mode as default - Disabled some tests that fail in non-strict mode Test Plan: Sandcastle Differential Revision: D70228628 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148790 Approved by: https://github.com/angelayi	2025-03-12 21:10:58 +00:00
Matthew Hoffman	e3ebf61589	Create and send `full_tensor` on `ProcessGroup`-supported device in `_broadcast_tensors` (#148865 ) Fixes #138842 `device` is always the device of the `local_state_dict`, which may or may not be CPU, which is not supported by NCCL backend. Instead, create broadcasted tensors on one of `pg._device_types` and then move the tensors back if `local_state_dict`'s `device` was not supported by the `ProcessGroup`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148865 Approved by: https://github.com/mori360	2025-03-12 20:56:31 +00:00
Richard Barnes	b5191b9312	[codemod][lowrisk] Fix deprecated use of 0/NULL in caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/src/fc-unpack.cc + 1 (#148996 ) Summary: `nullptr` is typesafe. `0` and `NULL` are not. In the future, only `nullptr` will be allowed. This diff helps us embrace the future _now_ in service of enabling `-Wzero-as-null-pointer-constant`. Test Plan: Sandcastle Reviewed By: dtolnay Differential Revision: D70939306 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148996 Approved by: https://github.com/Skylion007	2025-03-12 20:06:19 +00:00
eqy	b90698f5ba	[CUDA] try to abate some flakiness in `test_stream_event_nogil` (#148796 ) threshold twiddling as one in a few dozen runs tend to fail the current threshold Pull Request resolved: https://github.com/pytorch/pytorch/pull/148796 Approved by: https://github.com/Skylion007	2025-03-12 19:12:50 +00:00
min-jean-cho	215f856142	Add XPU device to nested_layer_norm (#148593 ) Work with https://github.com/intel/torch-xpu-ops/pull/1416 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/148593 Approved by: https://github.com/guangyey, https://github.com/jbschlosser	2025-03-12 19:07:08 +00:00
henrylhtsang	66300d3d55	[cutlass backend] try make cutlass backend benchmark more robust (#149015 ) Differential Revision: [D71006269](https://our.internmc.facebook.com/intern/diff/D71006269/) I want to make sure the benchmark even if failed on some experiment can still print most of the results. ``` Experiment group: mm (3x3, 3x3) torch.bfloat16 +-----------------------+-------------------+----------------------+---------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+-------------------+----------------------+---------------------+ \| aten \| 6.175220478326082 \| 0.5982149520423263 \| NA \| \| triton \| 5.326753947883844 \| 3.2067150759976357 \| -13.739858089605114 \| \| triton_persistent_tma \| 5.340870004147291 \| 3.279932268196717 \| -13.51126615004617 \| \| cutlass_lvl_default \| inf \| inf \| inf \| \| cutlass_lvl_1111 \| inf \| inf \| inf \| \| cutlass_lvl_2222 \| inf \| inf \| inf \| \| cutlass_lvl_3333 \| inf \| inf \| inf \| +-----------------------+-------------------+----------------------+---------------------+ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149015 Approved by: https://github.com/chenyang78, https://github.com/jingsh	2025-03-12 18:59:49 +00:00
Thomas Bohnstingl	86bc154d61	[scan] Flattened output of HOP scan (#148955 ) This is required because downstream operations expect HOPs to return a flattened list of output elements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148955 Approved by: https://github.com/ydwu4	2025-03-12 18:27:27 +00:00
Tugsbayasgalan Manlaibaatar	fb0e9cb0a0	Remove warnings on non-buffer tensor constants (#148483 ) Export already registers tensor constants directly in the graph and this is also true for Torchbind objects. This removes warning that pollutes the output. Differential Revision: [D70577856](https://our.internmc.facebook.com/intern/diff/D70577856) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148483 Approved by: https://github.com/zhxchen17, https://github.com/zou3519 ghstack dependencies: #148364	2025-03-12 18:20:04 +00:00
atalman	29fd875bc1	Automate stable CUDA update and linter using min Python verison (#148912 ) 1. Fixes: https://github.com/pytorch/pytorch/issues/145571 . Cuda Stable is the same cuda version that is published to pypi, also used to set Metadata section in the rest of whl scripts and tag the docker releases with latest tag. 2. Updates min python version used in linter Pull Request resolved: https://github.com/pytorch/pytorch/pull/148912 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-03-12 18:12:34 +00:00
Shangdi Yu	01e9036bd2	skip torchbind in cosntant folding (#148993 ) Summary: Do not fold torchbind objects in constant folding Any operation on these torchbind objects can have arbitrary side effects, so we can't effectively constant fold anything torchbind-obj-related anyway. Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r aot_compile_constant_folding ``` Reviewed By: angelayi Differential Revision: D69946541 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148993 Approved by: https://github.com/angelayi	2025-03-12 18:08:08 +00:00
Yidi Wu	923ce10f6c	[while_loop] require stride to be the same as input for body_fn (#148002 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148002 Approved by: https://github.com/zou3519	2025-03-12 17:15:10 +00:00
wdziurdz	28b78800b9	[Profiler][HPU] Fix incorrect availabilities for HPU (#148663 ) Fixes #148661 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148663 Approved by: https://github.com/jeromean, https://github.com/Skylion007, https://github.com/EikanWang, https://github.com/albanD	2025-03-12 17:06:57 +00:00
Jason Ansel	b040dc3a53	Reland: [inductor] Simplify grid handling (#148305 ) Summary: Relands D69965761 / https://github.com/pytorch/pytorch/pull/147583 Before this PR, calling a triton kernel would look like: ```py kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0) ``` where the `grid=` was passed as a callable (function closure) arg. This PR removes the grid arg: ```py kernel.run(a, b, xnumel, stream=stream0) ``` instead now the grid computation is included in the kernel launcher, with something like: ```py def launcher(in_ptr0, out_ptr0, xnumel, stream): grid_0 = ((xnumel + 1023) >> 10) grid_1 = 1 grid_2 = 1 runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel) ``` This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`. It also allows us to unify the handling of grids between the Python and C++ wrapper code. Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid. This unification allows this PR to be a net deletion of code. Differential [disconnected] Revision: D70471332 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148305 Approved by: https://github.com/shunting314, https://github.com/eellison	2025-03-12 15:52:16 +00:00
PyTorch MergeBot	626a5e22eb	Revert "[CI] Don't clean workspace when fetching repo (#147994 )" This reverts commit e5fef8a08ebb8548e8413ae54ef0ad9a11f1f4c0. Reverted https://github.com/pytorch/pytorch/pull/147994 on behalf of https://github.com/clee2000 due to broke checkout on xpu, probably lack of sudo? ([comment](https://github.com/pytorch/pytorch/pull/147994#issuecomment-2718335186))	2025-03-12 15:50:38 +00:00
Catherine Lee	9a0f65d3d3	[TD] test_cpp_extensions_aot_ninja corresponds to things in test/cpp_extensions (#148992 ) Manually map test_cpp_extensions_aot_ninja to files in test/cpp_extensions since test_cpp_extensions_aot_ninja isn't an actual file you can edit, but a wrapper for files in test/cpp_extensions. Idk if this is a good idea, feels very manual. Maybe it would be better to classify this the same as any other TD failure where TD simply can't figure out the tests it needs to run Pull Request resolved: https://github.com/pytorch/pytorch/pull/148992 Approved by: https://github.com/malfet, https://github.com/seemethere, https://github.com/janeyx99	2025-03-12 15:40:06 +00:00
Jason Ansel	488c4480f9	[inductor] Fix profiler tests with latest Triton (#149025 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149025 Approved by: https://github.com/yanboliang	2025-03-12 15:34:26 +00:00
PyTorch MergeBot	5ada4e6a53	Revert "Reland: [inductor] Simplify grid handling (#148305 )" This reverts commit 8d08b4901586f230353a558ee00c16ad57f95178. Reverted https://github.com/pytorch/pytorch/pull/148305 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI ([comment](https://github.com/pytorch/pytorch/pull/148305#issuecomment-2718177044))	2025-03-12 14:58:43 +00:00
cyy	8fa81a6066	Enable misc-use-internal-linkage check and apply fixes (#148948 ) Enables clang-tidy rule [`misc-use-internal-linkage`](https://clang.llvm.org/extra/clang-tidy/checks/misc/use-internal-linkage.html). This new check was introduced in Clang-Tidy 18 and is available due to recent update of Clang-Tidy 19. The check marks functions and variables used only in the translation unit as static. Therefore undesired symbols are not leaked into other units, more link time optimisations are possible and the resulting binaries may be smaller. The detected violations were mostly fixed by using static. In other cases, the symbols were indeed consumed by others files, then their declaring headers were included. Still some declarations were wrong and have been fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148948 Approved by: https://github.com/Skylion007	2025-03-12 14:22:56 +00:00
leslie-fang-intel	f349304c08	[Inductor][CPP] Fix expr issue in loop split (#148882 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/148058. In this case, there is an `indexing_expr` as an integer which doesn't have the method of `find`. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_issue_148058 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148882 Approved by: https://github.com/jgong5	2025-03-12 11:08:07 +00:00
lingzhi98	81aee3c9c4	[Partitioner] Reduce time consuming of partitions merger (#146582 ) This patch optimize maybe_merge_partition func through 3-ways: Remove unnecessary copy https://github.com/pytorch/pytorch/blob/main/torch/fx/passes/infra/partitioner.py#L99. The number of copied nodes is large if we can merge all of the nodes of graph into one partition. Record users of each partition to avoid duplicate iteration over nodes https://github.com/pytorch/pytorch/blob/main/torch/fx/passes/infra/partitioner.py#L133. The trip count of this loop maybe very large. The nodes number of each partitions maybe not balance https://github.com/pytorch/pytorch/blob/main/torch/fx/passes/infra/partitioner.py#L145. We always encounter one issue: one partition has n nodes, but the other has one node. Merge the smaller partition into the larger can help to reduce time consuming. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146582 Approved by: https://github.com/jerome-habana, https://github.com/Skylion007	2025-03-12 09:24:38 +00:00
Xiaodong Wang	d547a56668	[AMD] Various fixes for mem efficient attention on CK backend (#148986 ) Summary: Decouple aotriton vs. ck for mem efficient attention. Also fixed HW check. Reviewed By: henryhu6 Differential Revision: D70872677 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148986 Approved by: https://github.com/jianyuh, https://github.com/houseroad	2025-03-12 07:36:46 +00:00
Nikita Shulga	924a247fbb	[MPS] Enable angle and atan2 for `torch.long` (#149017 ) This check was added by https://github.com/pytorch/pytorch/pull/85817, that introduced no unit-tests and its content seems to be totally unrelated to title/subject of that PR. Anyway, right now it seems to be working fine on MacOS-13+ Pull Request resolved: https://github.com/pytorch/pytorch/pull/149017 Approved by: https://github.com/dcci	2025-03-12 04:48:52 +00:00
Nikita Shulga	7b78a2c415	[MPSInductor] Fix `argmin`/`argmax` long reductions (#149021 ) By adding an additional indexes array for aggregates and populating it when performing partial reductions. And with that I can finally `torch.compile` TinyStories and get 600+ tokens/sec vs <200 on eager Pull Request resolved: https://github.com/pytorch/pytorch/pull/149021 Approved by: https://github.com/jansel ghstack dependencies: #148969, #148975, #149004, #149020	2025-03-12 04:39:29 +00:00
Nikita Shulga	758522d56a	[MPSInductor][EZ] Fix argmin/max signatures (#149020 ) threadgroup_argmin used to return input type, which is wrong, it should have returned `int` or `long` Change signatures of both thredgroup_argmin and threadgroup_argmax to return int, as group size is small, no need to carry over large integeres Pull Request resolved: https://github.com/pytorch/pytorch/pull/149020 Approved by: https://github.com/jansel ghstack dependencies: #148969, #148975, #149004	2025-03-12 04:39:29 +00:00
Nikita Shulga	fe22db9cc3	[MPSInductor] Fix `min`/`max` reductions over large dims (#149004 ) Simple followup after sum/prod Pull Request resolved: https://github.com/pytorch/pytorch/pull/149004 Approved by: https://github.com/jansel ghstack dependencies: #148969, #148975	2025-03-12 04:39:19 +00:00
clr	2a7e997b3f	test/dynamo/test_utils: Fix one broken test on different python versions (#148987 ) We correctly handed different python version in the explicit ir_nodes test, but didn't handle it in the dynamo_timed test. Just explicitly deleting the fields there so the dynamo_timed test passes on all python versions. (I noticed it breaking on 3.13). Pull Request resolved: https://github.com/pytorch/pytorch/pull/148987 Approved by: https://github.com/jansel	2025-03-12 02:11:08 +00:00
LifengWang	e40a9e602b	Add the max_autotune tests in the periodic jobs. (#143560 ) To promptly detect issues with max_autotune, such as [#143102](https://github.com/pytorch/pytorch/issues/143102), add the max_autotune tests to the periodic CI to track the accuracy regularly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143560 Approved by: https://github.com/leslie-fang-intel, https://github.com/desertfire	2025-03-12 01:47:46 +00:00
bobrenjc93	60576419a2	Make dynamism code robust to NotImplementedException (#148823 ) In prod many models have `@property` methods that raise NotImplementedError. This PR updates our dynamism code to be more robust to these types of models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148823 Approved by: https://github.com/laithsakka	2025-03-12 01:01:57 +00:00
atalman	46f096bba6	Explicitly set use-ephemeral runners for windows nightly cpu test jobs (#149001 ) This PR migrated windows builds to use ephemeral runners: https://github.com/pytorch/pytorch/pull/134463 however missed test jobs. Explicitly set use-ephemeral runners for windows nightly cpu tests. Please note we should be using already ephemeral runners for these after: https://github.com/pytorch/test-infra/pull/6377 (recently migrated) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149001 Approved by: https://github.com/malfet	2025-03-11 23:51:39 +00:00
Boyuan Feng	5b60749e9e	[cudagraph] add log for skip reasons (#148797 ) Summary: Add skip reasons to dynamo_compile so we can know popular skip reasons for cudagraph Test Plan: {F1975906635} Differential Revision: D70820791 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148797 Approved by: https://github.com/masnesral	2025-03-11 23:31:48 +00:00
Nikita Shulga	98a2d905bf	[MPSInductor] Fix large prod and sum reductions (#148975 ) After this change, if reduction dimension is larger than `max_threadgroup_size`, emit a `for` loop from `codegen_iteration_ranges_entry` and wrap it up in `codegen_body()` I.e. after this changes following command ``` % TORCH_LOGS=output_code python -c "import torch;print(torch.compile(lambda x:(x[0::2].sin()+(x[1::2] + .4).cos()).sum(dim=0) - 3.14)(torch.rand(4096, device='mps')))" 2>&1\|cut -c 86- ``` will emit following shader ```metal #include <c10/metal/random.h> #include <c10/metal/special_math.h> #include <c10/metal/utils.h> #include <c10/metal/reduction_utils.h> kernel void generated_kernel( device float* out_ptr1, constant float* in_ptr0, uint2 thread_pos [[thread_position_in_grid]], uint2 group_pos [[thread_position_in_threadgroup]] ) { auto xindex = thread_pos.x; auto r0_index = thread_pos.y; threadgroup float tmp_acc_0[1024]; tmp_acc_0[r0_index] = 0; for(auto r0_0_cnt = 0; r0_0_cnt < 2; ++r0_0_cnt) { int r0_0 = 2 * r0_index + r0_0_cnt; if (r0_0 >= 2047) break; auto tmp0 = in_ptr0[2r0_0]; auto tmp2 = in_ptr0[1 + 2r0_0]; auto tmp1 = metal::precise::sin(tmp0); auto tmp3 = 0.4; auto tmp4 = tmp2 + tmp3; auto tmp5 = metal::precise::cos(tmp4); auto tmp6 = tmp1 + tmp5; tmp_acc_0[r0_index] += tmp6; } auto tmp7 = c10:🤘:threadgroup_sum(tmp_acc_0, 1024); auto tmp8 = 3.14; auto tmp9 = tmp7 - tmp8; out_ptr1[0] = static_cast<float>(tmp9); } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148975 Approved by: https://github.com/dcci, https://github.com/jansel ghstack dependencies: #148969	2025-03-11 22:46:41 +00:00
bobrenjc93	2dcdb4ba78	[ez] include config as part of __all__ in torch.compiler (#148978 ) Right now we are susceptive to a race condition where if the torch.compiler.config is not implicitly import via dynamo/builder.py, we will throw an error when trying to set compiler configs. This fixes it by including config in `__all__`. Previous ``` >>> import torch >>> torch.compiler.config.dynamic_sources = "L['kwargs']['float_features']" Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: module 'torch.compiler' has no attribute 'config' >>> torch.compiler.config.dynamic_sources = "L['kwargs']['float_features']" Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: module 'torch.compiler' has no attribute 'config' ``` Now ``` >>> import torch >>> torch.compiler.config.dynamic_sources = "L['kwargs']['float_features']" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148978 Approved by: https://github.com/bdhirsh, https://github.com/laithsakka	2025-03-11 21:58:38 +00:00
Pian Pawakapan	a6459afb0e	[dynamic shapes] add backed_size_oblivious option (#148696 ) Adds option `torch.fx.experimental._config.backed_size_oblivious = True` to allocate `[0, inf]` instead of `[2, inf]` ranges for size backed symbols, and opting into size-oblivious semantics for them. Helps in a number of cases like - Keeps `[0, inf]` bounds for unbacked symbols, when we make a unbacked -> backed replacement - More sound handling for 0/1 inputs at runtime when we lower from export - Avoids ends-of-bounds, sys.maxsize constraint violations for exporting with named Dims (https://github.com/pytorch/pytorch/issues/146315, https://github.com/pytorch/pytorch/issues/146046) May look towards turning this on globally for export. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148696 Approved by: https://github.com/bobrenjc93	2025-03-11 21:52:34 +00:00
Natalia Gimelshein	53a1a022a9	[WIP] Initial implementation of Grouped Gemm API (#148531 ) This PR provides initial cutlass implementation of grouped gemm api as described in this [document](https://docs.google.com/document/d/1985La6wUUVH1AGBkNhaGKUXzx-9ybtbUp567-vYVOM4/edit?tab=t.0#heading=h.g8lzbjnyzzx9). Any combination of 2d and 3d inputs is supported, with 2d input being jagged, and the offsets of the jagged input being given by device tensor `offs`. Only H100 is supported, and only fp8_e4m3 with bf16 output and rowwise scaling. All the dimensions of each individual gemm have to be multiple of 16, that's cutlass limitation. I'll need to add those checks, for dynamic dimensions unfortunately the checks will have to be a device assert. I had to copy-paste cutlass's `Sm90RowBroadcast` and `Sm90ColBroadcast` structs with minor changes to enable scales given as pointer arrays, ideally those should be part of cutlass itself. I copied the schedules from the similar grouped gemm in FBGEMM, but there's a lot of room to improve perf, especially for `fast_accum=False`. Next steps would be perf tuning and increasing coverage to B100, I don't know how cutlass grouped gemm example handles blockwise scaling on B100. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148531 Approved by: https://github.com/drisspg	2025-03-11 21:49:46 +00:00
Howard Huang	b98af95401	Fix DCP link (#148974 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148974 Approved by: https://github.com/svekars	2025-03-11 21:26:37 +00:00
Nichols A. Romero	6119ffc711	[ROCm][TunableOp] Fix TunableOp BLAS logging for online tuning case. (#148979 ) In a previous PR https://github.com/pytorch/pytorch/pull/147034, there was a bad merge at the last minute. BLAS logging works for offline tuning, but does not currently work for online tuning. This PR fixes BLAS logging for online tuning. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148979 Approved by: https://github.com/jeffdaily	2025-03-11 21:20:04 +00:00
Catherine Lee	e5fef8a08e	[CI] Don't clean workspace when fetching repo (#147994 ) Tested on 874c5dc4c98cc63a06bfc900d03683b02f110d7c' Also tested on https://github.com/pytorch/pytorch/actions/runs/13798178199/job/38594767529?pr=148995#step:4:12 Don't remove the workspace when fetching. The checkout action performs git clean -ffdx to remove untracked files and files in gitignore This is probably not going to be useful if we switch entirely to ephemeral runners but w/e Pull Request resolved: https://github.com/pytorch/pytorch/pull/147994 Approved by: https://github.com/malfet, https://github.com/atalman	2025-03-11 21:10:56 +00:00
atalman	72d9f88ef2	[release] Move triton pin to latest triton release/3.3.x (#148971 ) This branch contains latest AMD cherry-picks: https://github.com/triton-lang/triton/pull/6171 https://github.com/triton-lang/triton/pull/6165 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148971 Approved by: https://github.com/danzimm	2025-03-11 21:10:42 +00:00
Jane Xu	e6ef0620cc	Add shim.h C API to call dispatcher on our own aten ops (#148832 ) This PR still needs testing through some cpp extension Pull Request resolved: https://github.com/pytorch/pytorch/pull/148832 Approved by: https://github.com/albanD, https://github.com/atalman ghstack dependencies: #148124	2025-03-11 21:02:04 +00:00
Shangdi Yu	cf19efd3d9	Support basic TorchBind in aot_compile and aoti_compile_and_package (#148506 ) Summary: Codegen - Skip some codegen parts for torchbind (such as arg decleration) because they are loaded in proxy executor, so we do not need to declare torchbind args in cpp code - Added a helper method to get the schema of CallTorchBind HOP. The returned schema is only the schema of `obj.method()`. Serialization Add support for torchbind object in serialization - For CallTorchBind HOP, we need to handle it specially because of it's schema. The output serialized args is in the format of `(obj, method, args, kwargs)`. - it.TorchBindObject inputs are serialized to `as_custom_obj` Argument. Packaging* Add torchbind objects file and `custom_objs_config.json` file to generated files output of `aot_compile`. The json file is stored in the `data/aotinductor/<model_name>` folder in pt2 archive. The torchbind objects are stored in data/constants/ folder in pt2 archive. The format of torchbind objects are `f"{CUSTOM_OBJ_FILENAME_PREFIX}{custom_obj_idx}"`. e.g. `custom_obj_0`. CustomClassHolder objects implement their own pickle methods. Note that this `custom_objs_config.json` file is different from the `model_constants_config.json` file produced in package_sigmoid(). The keys in `custom_objs_config` directly correspond to the arg name in extern nodes json. The key in `model_constants_config.json` produced by `package_sigmoid` is the attribute name in the user mode code. This is required for both internal and OSS torchbind support. For OSS torchbind support, we also need to package torchbind_constants into the .pt2 output. Work Left We still need to add torchbind support in ProxyExecutor for inductor.aoti_load_package to work. See other diffs in the stack. Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r schema buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r aot_compile ``` Differential Revision: D69490718 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148506 Approved by: https://github.com/angelayi	2025-03-11 20:55:18 +00:00
Bin Bao	f69e58e8e8	[CI] Update crossvit_9_240 as pass (#148989 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148989 Approved by: https://github.com/ZainRizvi	2025-03-11 20:54:39 +00:00
PyTorch MergeBot	b54cf1a281	Revert "[logging] Set compile_id in the CachingAutotuner during compilation so we have it for dynamo_timed logging (#148693 )" This reverts commit 73c8068cf889829fb811fc75baac03163c9a42ee. Reverted https://github.com/pytorch/pytorch/pull/148693 on behalf of https://github.com/ZainRizvi due to This is breaking lint on trunk. Please rebase these changes before merging them back in. [GH job link](https://github.com/pytorch/pytorch/actions/runs/13796723235/job/38590020554) [HUD commit link](`73c8068cf8`) ([comment](https://github.com/pytorch/pytorch/pull/148693#issuecomment-2715671875))	2025-03-11 20:50:23 +00:00
Nikita Shulga	c18858d633	[MPS] Make `torch.mps.compile_shader` public (#148972 ) It was a private method in 2.6, but nothin changes in its API for 2.7 and it will likely remain the same in 2.8, so time to remove underscore from its name Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148972 Approved by: https://github.com/Skylion007, https://github.com/atalman, https://github.com/seemethere, https://github.com/albanD, https://github.com/dcci	2025-03-11 20:20:58 +00:00
cat-state	abcec55532	gracefully handle `tokenize.TokenError` in funcname parser. Adds support for non-Python source (#148737 ) This change allows defining python functions in non-python source and having them be able to compiled by torch.compile. The existing implementation already returns None for the case where the file couldn't be read, so returning None (by making an empty funcname cache) makes sense for the case of non-python source code too. Example [basilisp](https://github.com/basilisp-lang/basilisp): ```clojure (import torch) (import [torch.nn.functional :as F]) (torch/rand 10) (defn f {:decorators [torch/compile]} [x] (* (F/relu x) x)) (f (-> (torch/randn 100) (.cuda))) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148737 Approved by: https://github.com/williamwen42	2025-03-11 19:49:28 +00:00
Sam Larsen	73c8068cf8	[logging] Set compile_id in the CachingAutotuner during compilation so we have it for dynamo_timed logging (#148693 ) Summary: This is a simpler alternative to https://github.com/pytorch/pytorch/pull/146455, where we can stick the compileId (and forward/backward bool) in the CachingAutotuner so that we have it for logging `benchmark_all_configs`. Recall that the first attempt put the compileId in the inductor_meta and that interfered with caching. Test Plan: `python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only nanogpt` * tlparse: https://fburl.com/e71yn6uc * dynamo_compile: https://fburl.com/scuba/dynamo_compile/sandbox/4ageghhv * pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/4fgv1itq Pull Request resolved: https://github.com/pytorch/pytorch/pull/148693 Approved by: https://github.com/eellison	2025-03-11 19:38:40 +00:00
henrylhtsang	5b8da17681	[cutlass backend] Add addmm and bmm tests for AOTI (#148929 ) Needs to do: 1. Expand addmm tests to cover all 4 shapes 2. Add dynamic shape support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148929 Approved by: https://github.com/jingsh, https://github.com/ColinPeppler	2025-03-11 19:38:24 +00:00
Yanan Cao (PyTorch)	7b2ecb80eb	[Codemod][AddExplicitStrictExportArg] caffe2/test/inductor (#148928 ) Differential Revision: D70908557 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148928 Approved by: https://github.com/angelayi	2025-03-11 19:36:30 +00:00
Xinya Zhang	61f9b50e09	[ROCm] Fix TORCH_CHECK for hdim 512 support added in AOTriton 0.9b (#148967 ) Fixes #148850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148967 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily	2025-03-11 19:21:10 +00:00
Jane Xu	971606befa	Add a stable TORCH_LIBRARY to C shim (#148124 ) This PR adds two main parts: - shim.h stable C APIs into torch::Library APIs - a higher level API in torch/csrc/stable/library.h that calls into this shim.h + otherwise is self contained Goal: custom kernel writers should be able to call the apis in the directories above in order to register their library in a way that allows their custom extension to run with a different libtorch version than it was built with. Subplots resolved: - Do we want a whole separate StableLibrary or do we want to freeze torch::Library and add `m.stable_impl(cstring, void (fn)(void , int64_t, int64_t)` into it - Yes, we want a separate StableLibrary. We cannot freeze Library and it is NOT header only. - Should I use unint64_t as the common denominator instead of void to support 32bit architectures better? - Yes, and done - Should I add a stable `def` and `fragment` when those can be done in python? - I think we do want these --- and now they're done - Where should library_stable_impl.cpp live? -- no longer relevant - I need some solid test cases to make sure everything's going ok. I've intentionally thrown in a bunch of random dtypes into the signature, but I still haven't tested returning multiple things, returning nothing, complex dtypes, etc. - Have since tested all the torch library endpoints. the others can be tested in a followup to separate components that need to be in shim.h vs can be added later Pull Request resolved: https://github.com/pytorch/pytorch/pull/148124 Approved by: https://github.com/albanD, https://github.com/zou3519, https://github.com/atalman	2025-03-11 19:12:46 +00:00
Andy Lugo	4d10da731b	[ROCm] CK Memory-Efficient Attention (attention bias support) (#147778 ) Implements CK as the backend for memory efficient attention with a couple caveats: - Still enabled via `torch.backends.cuda.preferred_rocm_fa_library("ck") - Does NOT support Nested Tensors Using the mem_eff path allows us to use attention bias with a CK sdpa backend Pull Request resolved: https://github.com/pytorch/pytorch/pull/147778 Approved by: https://github.com/houseroad	2025-03-11 19:02:59 +00:00
Doru Bercea	a1cb67b69e	[ROCm] Improve backwards indexing when stride is not one (#147630 ) Improve backwards indexing when stride is not one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147630 Approved by: https://github.com/jeffdaily	2025-03-11 19:02:48 +00:00
Guilherme Leobas	daff65d671	Correctly propagate exception to parent tx (#146502 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146502 Approved by: https://github.com/anijain2305, https://github.com/williamwen42, https://github.com/zou3519 ghstack dependencies: #146504, #146499	2025-03-11 18:55:45 +00:00
Guilherme Leobas	fb53e9e514	Add `__context/cause/suppress_context/traceback__` to Exception (#146499 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146499 Approved by: https://github.com/zou3519, https://github.com/anijain2305 ghstack dependencies: #146504	2025-03-11 18:55:45 +00:00
Guilherme Leobas	4e7d264cf8	Introduce `UserDefinedExceptionClassVariable` (#146504 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146504 Approved by: https://github.com/anijain2305	2025-03-11 18:55:45 +00:00
Jason Ansel	8d08b49015	Reland: [inductor] Simplify grid handling (#148305 ) Summary: Relands D69965761 / https://github.com/pytorch/pytorch/pull/147583 Before this PR, calling a triton kernel would look like: ```py kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0) ``` where the `grid=` was passed as a callable (function closure) arg. This PR removes the grid arg: ```py kernel.run(a, b, xnumel, stream=stream0) ``` instead now the grid computation is included in the kernel launcher, with something like: ```py def launcher(in_ptr0, out_ptr0, xnumel, stream): grid_0 = ((xnumel + 1023) >> 10) grid_1 = 1 grid_2 = 1 runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel) ``` This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`. It also allows us to unify the handling of grids between the Python and C++ wrapper code. Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid. This unification allows this PR to be a net deletion of code. Differential Revision: D70471332 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148305 Approved by: https://github.com/shunting314, https://github.com/eellison	2025-03-11 18:51:06 +00:00
PyTorch MergeBot	c916a8efc5	Revert "Use the device interface for detecting Triton availability (#139171 )" This reverts commit 940b60db974f08a31c746eec2f9c399fc8a861ee. Reverted https://github.com/pytorch/pytorch/pull/139171 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @jansel can you please help get these changes working? See D70946254 for more details. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/139171#issuecomment-2715392451))	2025-03-11 18:49:21 +00:00
drisspg	57ee821a41	fix dynamo ide (#148849 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148849 Approved by: https://github.com/bobrenjc93	2025-03-11 18:43:30 +00:00
Nikita Shulga	883fb78c7e	Update jinja2 version in requirements-gha-cache.txt As previous version is vulnerable to CVE-2025-27516 This closes Dependabot report	2025-03-11 11:42:38 -07:00
dependabot[bot]	5ee9dbc0a1	Bump jinja2 from 3.1.5 to 3.1.6 in /.ci/docker (#148812 ) Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.5 to 3.1.6. - [Release notes](https://github.com/pallets/jinja/releases) - [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst) - [Commits](https://github.com/pallets/jinja/compare/3.1.5...3.1.6) --- updated-dependencies: - dependency-name: jinja2 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-03-11 11:39:55 -07:00
cyy	a5f6b24d87	Remove outdated skipIfRocmVersionLessThan decorations (#148941 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148941 Approved by: https://github.com/jeffdaily	2025-03-11 18:37:40 +00:00
Ke Wen	ef6296e7f2	[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 ) This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj Differential Revision: [D70937982](https://our.internmc.facebook.com/intern/diff/D70937982) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590 Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj	2025-03-11 18:36:12 +00:00
Nikita Shulga	b366f33606	[MPSInductor] Prep for mutlistage reductions (#148969 ) ---- - Move reduction variable initialization from `loads` to `indexing_code` - Move barriers from `codegen_kernel` to `reduction` and only use them for `any` reductions (as other reduction ops do barriers explicitly inside the respective reduction functions) - Use `self.compute` instead of `self.body` for all compute operations Checked that number of before/after failures stays at `164 failed, 616 passed, 53 skipped` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148969 Approved by: https://github.com/dcci	2025-03-11 18:35:23 +00:00
Nichols A. Romero	dcc502f376	[ROCm][TunableOp] Add bias data type to params signature. (#146227 ) Add bias vector data type in TunableOp params signature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146227 Approved by: https://github.com/jeffdaily	2025-03-11 18:31:22 +00:00
Chien-Chin Huang	52acc1f955	[DSD] Update the document to mention the limitation of set_optimizer_state_dict (#148918 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/140898 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148918 Approved by: https://github.com/fduwjj, https://github.com/mori360 ghstack dependencies: #148825	2025-03-11 18:24:12 +00:00
Yukio Siraichi	e0d4c43ad1	Add env for disabling meta reference on functionalization. (#148822 ) Fix: https://github.com/pytorch/xla/issues/8755 This PR introduces `TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE` environment variable. Setting this variable makes it so the functionalization kernels won't run the meta reference, which is used to propagate expected sizes and strides. Currently, PyTorch/XLA doesn't actually propagates the correct strides to its tensors. It was also shown that calling these meta functions may incur in significant overhead. Running the provided minimal reproducer (see issue), we see a speedup close to 4.3x: - Baseline: 0.0747s - `XLA_DISABLE_FUNCTIONALIZATION=1`: 0.0159s - `TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE=1`: 0.0175s In summary, this PR: - Creates the `disable_meta_reference()` function, which checks whether the environment variable is set - Modifies codegen for functionalization kernels, adding the call to `disable_meta_reference()` function to the appropriate conditions - Creates a new bash function for running `lazy/test_ts_opinfo.py` with the environment variable set Pull Request resolved: https://github.com/pytorch/pytorch/pull/148822 Approved by: https://github.com/bdhirsh	2025-03-11 16:13:35 +00:00
Jason Ansel	09029010e5	[inductor] Fix create_specialize_impl error in latest Triton (#148933 ) ```py $ python test/inductor/test_triton_kernels.py KernelTests.test_triton_kernel_2d_autotune_grad_False_dynamic_True_backend_inductor_grid_type_1 WARNING:torch._dynamo:Encountered an exception in identify_mutated_tensors, assuming every input is mutated Traceback (most recent call last): File "/home/jansel/pytorch/torch/_higher_order_ops/triton_kernel_wrap.py", line 715, in identify_mutated_tensors ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_higher_order_ops/triton_kernel_wrap.py", line 289, in generate_ttir specialization = _get_specialization(ordered_args.values()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_higher_order_ops/triton_kernel_wrap.py", line 262, in _get_specialization specialize_impl = triton.runtime.jit.create_specialize_impl() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: create_specialize_impl() missing 1 required positional argument: 'specialize_extra' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148933 Approved by: https://github.com/yanboliang, https://github.com/davidberard98	2025-03-11 15:54:47 +00:00
PyTorch MergeBot	16560d4e8f	Revert "Refactor `test/test_torch.py` by moving testcase to `test_indexing.py` (#148875 )" This reverts commit 0fa0a740958ffc474843ceb1d19ee43c4bff4c09. Reverted https://github.com/pytorch/pytorch/pull/148875 on behalf of https://github.com/ZainRizvi due to That torch.version failure you got in CI was a legitimate failure and is now breaking trunk. [GH job link](https://github.com/pytorch/pytorch/actions/runs/13778023702/job/38534207536) [HUD commit link](`0fa0a74095`) ([comment](https://github.com/pytorch/pytorch/pull/148875#issuecomment-2714757288))	2025-03-11 15:27:25 +00:00
atalman	3945954741	Bump triton pin. Add aarch64 triton build (#148705 ) 1. Bumps pin for triton to release/3.3.x branch 2. Bump pin for triton-xpu 3. Remove ROCm xfail tests 4. Add aarch64 triton build: * Depends on: https://github.com/pytorch/pytorch/pull/148768 * Fixes: https://github.com/pytorch/pytorch/issues/130558 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148705 Approved by: https://github.com/drisspg, https://github.com/Skylion007, https://github.com/EikanWang	2025-03-11 15:12:21 +00:00
PyTorch MergeBot	c983e1124c	Revert "[WIP] Initial implementation of Grouped Gemm API (#148531 )" This reverts commit ff29791ed8f815bdbca1a5606de046380baca69d. Reverted https://github.com/pytorch/pytorch/pull/148531 on behalf of https://github.com/janeyx99 due to Sorry but this broke ROCm jobs on trunk ([comment](https://github.com/pytorch/pytorch/pull/148531#issuecomment-2714577498))	2025-03-11 14:40:58 +00:00
Animesh Jain	f1787ee0f7	[dynamo] Remove L scoping for recompilation messages (#148917 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148917 Approved by: https://github.com/williamwen42	2025-03-11 14:26:26 +00:00
Animesh Jain	992838e702	[dynamo][guards] Do not ID_MATCH on numpy tensors (#148923 ) Might help with https://github.com/pytorch/pytorch/issues/148535 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148923 Approved by: https://github.com/jansel	2025-03-11 14:20:26 +00:00
Alexander Grund	ee21ccc816	Skip ao_sparsity TestComposability for missing FBGEMM (#144146 ) Those tests (from test_ao_sparsity) require FBGEMM which may not be available. So add the skip decorator. Fixes #87364 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144146 Approved by: https://github.com/jerryzh168, https://github.com/jcaip	2025-03-11 13:02:18 +00:00
Rengan Xu	da4bb72a71	Backout D70075331 (#148824 ) Summary: The AOTI lowering for model 699109736 and other new models worked before D70075331, but failed after with error "RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 4096 n 10 k 7936 mat1_ld 7936 mat2_ld 7936 result_ld 4096 abcType 2 computeType 68 scaleType 0" So we revert D70075331 as a workaround now. Test Plan: The model could be lowered and published successfully. e.g. 702869739_16 Differential Revision: D70823254 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148824 Approved by: https://github.com/eqy	2025-03-11 12:51:17 +00:00
David Berard	9ad64ce795	[triton 3.3] Forward-fix mm template selection logic (#148924 ) Follow-up from https://github.com/pytorch/pytorch/pull/148662. The logic from https://github.com/pytorch/pytorch/pull/148662 is incorrect; what we want is "choose the second template 'AMD-specific template' only if we're on hip AND triton version < 3.3" - negating it, the code should be "choose the cirst template if we're NOT on hip OR triton version >= 3.3". Tested locally to verify that it fixes the test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148924 Approved by: https://github.com/drisspg, https://github.com/atalman, https://github.com/eellison	2025-03-11 09:05:44 +00:00
eellison	2bcc3acb90	Update low prec codegen for div/mod (#142350 ) Div/mod in fp16/bf16 requires a downcast to preserve its inputs' dtypes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142350 Approved by: https://github.com/blaine-rister	2025-03-11 08:02:30 +00:00
Gabriel Ferns	41e4728f74	update types on dynamo configs (#146873 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146873 Approved by: https://github.com/williamwen42	2025-03-11 05:33:48 +00:00
Gabriel Ferns	1fcc4bc109	Don't look at TESTING_ONLY in fuzzer (#146870 ) Lots of configs aren't meant to be set because they're testing only Pull Request resolved: https://github.com/pytorch/pytorch/pull/146870 Approved by: https://github.com/masnesral	2025-03-11 05:32:25 +00:00
xinan.lin	bed92a8523	[Window][Inductor UT] Fix for tempfile.NamedTemporaryFile(delete=True) not work on Windows. (#148632 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148632 Approved by: https://github.com/jansel	2025-03-11 05:05:15 +00:00
Bin Bao	ecfbfe1603	[AOTI] Remove aoti_torch_cpu__weight_int4pack_mm_cpu_tensor (#148907 ) Summary: shim.h is only meant for generic tensor util shim functions. We should switch to use the auto fallback generation, but it will need some extra care on the op schema. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148907 Approved by: https://github.com/janeyx99	2025-03-11 04:41:05 +00:00
George White	940b60db97	Use the device interface for detecting Triton availability (#139171 ) This allows for each device type to check current devices for Triton compatibility and ensure their Triton backend is present. This PR replaces the `has_triton()` global method which was previously used for this task, and moves the initial check for each Inductor backend on to their associated `BaseScheduler` subclass. This means that other backends, such as Halide, can also implement their own availability checks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139171 Approved by: https://github.com/jansel	2025-03-11 03:56:11 +00:00
Natalia Gimelshein	ff29791ed8	[WIP] Initial implementation of Grouped Gemm API (#148531 ) This PR provides initial cutlass implementation of grouped gemm api as described in this [document](https://docs.google.com/document/d/1985La6wUUVH1AGBkNhaGKUXzx-9ybtbUp567-vYVOM4/edit?tab=t.0#heading=h.g8lzbjnyzzx9). Any combination of 2d and 3d inputs is supported, with 2d input being jagged, and the offsets of the jagged input being given by device tensor `offs`. Only H100 is supported, and only fp8_e4m3 with bf16 output and rowwise scaling. All the dimensions of each individual gemm have to be multiple of 16, that's cutlass limitation. I'll need to add those checks, for dynamic dimensions unfortunately the checks will have to be a device assert. I had to copy-paste cutlass's `Sm90RowBroadcast` and `Sm90ColBroadcast` structs with minor changes to enable scales given as pointer arrays, ideally those should be part of cutlass itself. I copied the schedules from the similar grouped gemm in FBGEMM, but there's a lot of room to improve perf, especially for `fast_accum=False`. Next steps would be perf tuning and increasing coverage to B100, I don't know how cutlass grouped gemm example handles blockwise scaling on B100. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148531 Approved by: https://github.com/drisspg	2025-03-11 02:41:09 +00:00
Brian Hirsh	621dadd4ca	partitioner: when materializing unbacked tensor intermediates, apply hint to symbol, not expr (#144097 ) Fixes https://github.com/pytorch/pytorch/issues/144095 open to suggestions: the `hint_int(..., fallback=...)` API feels like a bit of a footgun, because: (1) we use the same guess for every unbacked symint (both symbols, and compound expressions) (2) the user may have established some relationship between some unbacked symints that we are not taking into account. I'm not sure how real of an issue (2) is - is it common to e.g. generate two unbacked symints, and then add a runtime assert that they are unequal? Instead I did something simpler that's just enough to fix the linked issue: if we have a sympy expression containing an unbacked symbol (e.g. `u0 + 1`), then the partitioner will now fill in the symbol with our guess instead of the expression (plugging in `u0=4096` gets us 4097). This was important for an internal custom op, that had some logic like this: ``` def custom_op(x: [u0], y: [u0 + 1]): assert x.shape[0] = y.shape[0] - 1 ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144097 Approved by: https://github.com/laithsakka	2025-03-11 02:11:57 +00:00
albanD	8c45d44abb	Skip distributed subprocess test internally as they don't work (#148909 ) Follow up from https://github.com/pytorch/pytorch/pull/146098 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148909 Approved by: https://github.com/janeyx99	2025-03-11 02:07:45 +00:00
Simon Fan	457ff9b7ae	[reland][ca] side-effect free inital trace: compiled_args (#148376 ) This reverts commit ea12fc8a9ff7da808e0b661ca07e9d4ce75d04bc. Reland https://github.com/pytorch/pytorch/pull/147804, there was a bad import inserted by my linter. Differential Revision: [D70582747](https://our.internmc.facebook.com/intern/diff/D70582747) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148376 Approved by: https://github.com/jansel	2025-03-11 01:57:36 +00:00
Shuai Yang	9fddbf3417	Update the comment (#148726 ) Differential Revision: D70747931 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148726 Approved by: https://github.com/yf225	2025-03-11 01:19:14 +00:00
zeshengzong	0fa0a74095	Refactor `test/test_torch.py` by moving testcase to `test_indexing.py` (#148875 ) Fix `FIXME` in `test_torch.py` by moving test-cases to `test_indexing.py` ```python # FIXME: move to test indexing # FIXME: move to indexing test suite ``` - Move tests in `test/test_torch.py` to `test_indexing.py` - Remove `FIXME` comments ## TestResult ```bash pytest test/test_torch.py -k TestTorchDeviceType -vv pytest test/test_indexing.py -k TestIndexing -vv ``` ![image](https://github.com/user-attachments/assets/49a80985-e74a-4da6-a063-476e87e6aa8a) ![image](https://github.com/user-attachments/assets/77afa936-5dba-480c-b293-eb1f7bc74420) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148875 Approved by: https://github.com/soulitzer	2025-03-11 01:01:59 +00:00
bobrenjc93	c297c09a37	Fix invalid nested int guarding in broadcast_shapes() (#145957 ) Fixes #145874 This PR takes the approach of updating the logic determining whether multiple shapes broadcast together to handle nested ints specially. Possible alternative approach: don't update `broadcast_shapes()` + indicate that e.g. `Ne(j0, 1)` should statically evaluate to False. I briefly tried this but it wasn't straightforward. Is it better? Pull Request resolved: https://github.com/pytorch/pytorch/pull/145957 Approved by: https://github.com/bobrenjc93 Co-authored-by: bobrenjc93 <bobren@meta.com>	2025-03-11 00:53:13 +00:00
cyy	295f2ed4d1	Fix "invalid application of 'sizeof' to an incomplete type" (#148854 ) Fixes with C++23 and constexpr std::unique_ptr Pull Request resolved: https://github.com/pytorch/pytorch/pull/148854 Approved by: https://github.com/Skylion007	2025-03-11 00:40:00 +00:00
cyy	a6e71dbc88	Enable ASAN on inductor CUDA tests (#148749 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148749 Approved by: https://github.com/jansel	2025-03-10 23:53:40 +00:00
drisspg	b215841ebb	[MM] Add sm carevout to lowerings (#148793 ) # Summary See https://github.com/pytorch/pytorch/issues/145115 for more details. I have been using the following to verify, need to figure out how to do proper guarding This does do the correct thing if we compile w/ sm carvout already set but since we dont guard on it just yet we dont recompile Pull Request resolved: https://github.com/pytorch/pytorch/pull/148793 Approved by: https://github.com/lw, https://github.com/eellison	2025-03-10 23:49:26 +00:00
Brian Hirsh	492f3fd5cf	replace usages of upload_graph in inductor with tlparse (v2) (#148720 ) Reland of https://github.com/pytorch/pytorch/pull/148703 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148720 Approved by: https://github.com/mengluy0125	2025-03-10 22:47:58 +00:00
Michal Gallus	5bbca7d328	[ROCm][Windows] Fix OpenMP Flags for clang-cl (#148097 ) When clang-cl parses its command line arguments, it expects MSVC-style arguments (beggining with `/` such as `/WX`, `/MD`, etc.) to be provided, and clang-style arguments to be preceded by `-Xclang`, otherwise, the clang-style parameters are ignored as they are interpreted unrecognized compiler options. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148097 Approved by: https://github.com/jeffdaily	2025-03-10 22:47:15 +00:00
PyTorch MergeBot	a95eb0c0a7	Revert "[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 )" This reverts commit 2149f6c6845d00711ffab648132b7377e8cd3edb. Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/ZainRizvi due to Breaking internally, see D70873275. Discussed reverting this with Ke. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2712001270))	2025-03-10 22:38:40 +00:00
Qiaochu Yuan	12a95390ae	[Minimizer] allow overriding of ShapeProp logic by subclasses of _MinimizerBase (#148784 ) Summary: The changes contained in this diff - allow subclass Minimizer implementations to override the default shape propagation logic with custom logic - copies over the meta attribute on get_attr graph nodes during the graph splitting step - for both changes, behavior for existing classes do not change Test Plan: CI Differential Revision: D70799942 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148784 Approved by: https://github.com/blaine-rister	2025-03-10 22:22:16 +00:00
Jane Xu	fcb633fafa	Introduce TORCH_ABI_VERSION and a runtime aoti_torch_abi_version C shim ABI (#148892 ) Importable https://github.com/pytorch/pytorch/pull/148836 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148892 Approved by: https://github.com/albanD	2025-03-10 22:22:10 +00:00
Boyuan Feng	98b3f1db9f	[Flex Attention] support num_heads > 1 in block_mask (#148857 ) Previously flex decoding errors when block mask has num_heads > 1. So users have to use num_heads=1, or explicitly mark `kernel_options={"FORCE_USE_FLEX_ATTENTION": True}`. This PR fixes this issue. When not using grouped query attention (GQA, i.e., Hq == Hkv), we support block mask with num_heads = 1 and num_heads = num_query_heads (i.e., Hq). This is the same setting as flex attention kernel. When using GQA (i.e., Hq != Hkv), we support block mask with num_heads = 1. When num_heads = Hq, we fall back to flex attention kernel so user don't need to explicitly mark `kernel_options={"FORCE_USE_FLEX_ATTENTION": True}` anymore. Why fallback? In the current flex decoding triton kernel, grouped query heads for the same kv head are handled by the same thread block. Supporting num_heads = Hq with GQA requires support different kv num blocks for different query heads in the same thread block, leading to lots of redundant workload. So we should better use the main flex_attention kernel where each query head is handled by a separate block. Fixes #148527 Fixes #147267 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148857 Approved by: https://github.com/drisspg	2025-03-10 22:02:50 +00:00
Mandar Deshpande	6ef15c7f46	[pytorch] Update flexattention bwd config generation (#148600 ) Summary: Currently `flex_attention` template's backward config generation returns values for every case. This change instead stores intermediate values in `'bwd_config` returned at the end. Test Plan: CI. Existing tests. Differential Revision: D70649316 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148600 Approved by: https://github.com/Skylion007	2025-03-10 22:00:56 +00:00
Ozan Aydin	8701b302cc	setuptools pinning (#148879 ) Fixes #148877 --- On 9 March 2025, [setuptools](https://pypi.org/project/setuptools/#history) published a new version and it is causing an issue on `pytorch` with the following error: ``` AttributeError: module 'distutils' has no attribute '_msvccompiler'. Did you mean: 'ccompiler'? ``` Last known working version is [75.8.2](https://pypi.org/project/setuptools/75.8.2/) Currently it is affecting Windows ARM64 nightly build, however soon it might affect also Windows x64 builds. (conda version is not updated yet [setuptools conda](https://anaconda.org/anaconda/setuptools) Locally both `Windows ARM64` and `Windows x64` are having same problem with the latest `setuptools` (>75.8.2) --- This PR is pinning `setuptools` version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148879 Approved by: https://github.com/seemethere	2025-03-10 21:29:32 +00:00
Ting Lu	c652772af7	[aarch64] install ninja for docker to build triton on arm (#148768 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148768 Approved by: https://github.com/atalman, https://github.com/Skylion007 Co-authored-by: Andrey Talman <atalman@fb.com>	2025-03-10 21:28:53 +00:00
Michal Gallus	b706044cca	[ROCm][Windows] Enable hipblaslt for Windows (#148563 ) This PR adds hipblaslt library as one of the Windows' dependencies. `rocBLAS` is added too, since certain symbols aren't detected with `hipblas` alone on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148563 Approved by: https://github.com/jeffdaily	2025-03-10 21:07:16 +00:00
Ting Lu	2a1eeaeed8	Remove 12.4 x86 builds and 12.6 sbsa builds from nightly (#148895 ) https://github.com/pytorch/pytorch/issues/145570 redo https://github.com/pytorch/pytorch/pull/148625 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148895 Approved by: https://github.com/atalman	2025-03-10 20:55:09 +00:00
henrylhtsang	4a2173d9a0	[cutlass backend][ez] Incorporate AOTI dynamic shape test into main test of MM (#148786 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148786 Approved by: https://github.com/jingsh	2025-03-10 20:35:10 +00:00
chunhuanMeng	e9c12e819d	Update torch-xpu-ops commit pin (#148881 ) Update the torch-xpu-ops commit to [026b2c8c7c92a7b2cec5d26334006e3423251cc6](`026b2c8c7c`), includes: - Enable AOT for LNL Pull Request resolved: https://github.com/pytorch/pytorch/pull/148881 Approved by: https://github.com/EikanWang	2025-03-10 20:31:51 +00:00
Chien-Chin Huang	ed969d1236	[DSD] Fix the shared parameter mismatch for optimizer state_dict when flattening FQNs are used (#148825 ) Summary: As title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148825 Approved by: https://github.com/fduwjj, https://github.com/mori360	2025-03-10 20:04:36 +00:00
Tristan Rice	494abeff8a	CUDACachingAllocator,c10d: fixes for IPC release performance (#148805 ) This has two fixes to improve IPC tensor release performance when using torchft's BabyProcessGroupNCCL. 1. release the IpcMutex when deleting the `ExpandableSegements` object to avoid synchronizing under the lock 2. release the GIL in WorkNCCL destructor since the shared tensor will be destructed there Test plan: Run with torchft + torchtitan ``` REPLICA_GROUP_ID=0 NGPU=2 CUDA_VISIBLE_DEVICES=0,1 CONFIG_FILE=./torchtitan/models/llama/train_configs/llama3_8b.toml ./run_train.sh --training.data_par allel_shard_degree=2 --fault_tolerance.enable --fault_tolerance.group_size=2 --fault_tolerance.replica_id=0 --metrics.log_freq=1 --training.seq_len 4096 ... [rank0]:[titan] 2025-03-07 17:51:31,387 - root - INFO - step: 61 loss: 7.4825 memory: 79.73GiB(83.89%) tps: 317 tflops: 16.34 mfu: 1.65% ``` Check py-spy to verify no bottleneck on IPC lock when creating new shared tensors ![20250307_17h50m10s_grim](https://github.com/user-attachments/assets/fa8b359f-e337-4ed5-be22-a42ab2bee03d) ![20250307_17h50m00s_grim](https://github.com/user-attachments/assets/206f869a-f07e-4fbd-9e28-89b3da95ef6e) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148805 Approved by: https://github.com/Skylion007, https://github.com/fegin, https://github.com/zdevito	2025-03-10 19:47:04 +00:00
atalman	2e4874e48d	Update RELEASE.md with latest changes to release process and release 2.7 information (#148888 ) 1. Update for Release 2.7 compatibility matrix 2. Remove mention of builder project, the scripts for release management were migrated to test-infra Pull Request resolved: https://github.com/pytorch/pytorch/pull/148888 Approved by: https://github.com/albanD, https://github.com/ZainRizvi	2025-03-10 19:20:27 +00:00
clr	6b0fd741d1	dynamo: Count number of opcodes processes (#147149 ) This gives us a decent proxy for how big of a graph we functionally had to parse. Note that this is a cummulative counter. If people feel strongly, I can either write into the dynamo_timed datasets with metrics contexts, or clear the counters / write a counter per frame id as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147149 Approved by: https://github.com/jansel	2025-03-10 19:20:09 +00:00
Wanchao Liang	3129faf8be	Optimize shard_dim_alltoall to use alltoall_single (#148868 ) as titled, previously the shard_dim_alltoall uses `all_to_all`, which essentially could incur lots of copies if the tensor become non-contiguous during splits, and alltoall itself also incur copies This PR uses alltoall_single instead, so that we could minimize tensor copies. tested on all the shard dim change tests and it works properly: ``` pytest test/distributed/tensor/test_redistribute.py -s -k shard_dim_alltoall ``` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148868 Approved by: https://github.com/tianyu-l	2025-03-10 18:38:12 +00:00
Benjamin Glass	ed7e964f2b	codecache.py: use str.format rather than % formatting (#148691 ) Additionally, swaps over a fixed length `std::vector` used by `cpp_wrapper` for a `std::array`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148691 Approved by: https://github.com/desertfire	2025-03-10 18:33:58 +00:00
Fadi Arafeh	d1f21d8ec3	Enable Direct Use of Arm Compute Library (ACL) in ATen (#148584 ) ACL is already built with PyTorch as a shared library when USE_MKLDNN_ACL is set. Currently, it is only used indirectly in ATen via oneDNN for AArch64 targets. However there are cases where it makes sense to utilize ACL directly without oneDNN as an intermediary - e.g. quantization. See #145942, #147337, #146620. This patch enables such use cases by exposing ACL to ATen Pull Request resolved: https://github.com/pytorch/pytorch/pull/148584 Approved by: https://github.com/malfet	2025-03-10 18:29:51 +00:00
Han, Xu	00cabd4235	[Inductor][Windows] add env_var switch to turn all Windows inductor UTs. (#148733 ) For timeout reason, we can't turn on all Windows Inductor UTs in CI: https://github.com/pytorch/pytorch/issues/135927 And without the UTs, we can't ensure Windows inductor quality. Intel team will do some local test for Windows inductor, but we still need to add a switch to turn on the full Windows inductor UTs. The switch is an environment variable: ```cmd set TORCHINDUCTOR_WINDOWS_TESTS=1 ``` After setup this environment variable, we can turn on all Windows inductor UTs, It will not affect to PyTorch CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148733 Approved by: https://github.com/jansel Co-authored-by: Jason Ansel <jansel@jansel.net>	2025-03-10 18:25:29 +00:00
eellison	4c13a859e5	Workaround no triton float8_e8m0fnu support in inductor (#148722 ) Triton doesn't support actual float8_e8m0fnu yet, so we can't currently codegen any arithmetic on them. But we can support bitcasting, and view/memory operators and treat them as uint8 for now. Fix for https://github.com/pytorch/pytorch/issues/147873. The one question i'm not sure of is whether or not we need to explicitly disable triton template fusion since it would fuse in these dtypes as uint8.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148722 Approved by: https://github.com/vkuzo ghstack dependencies: #148450	2025-03-10 17:37:39 +00:00
cyy	203dd18c5c	Bump Clang-tidy to 19.1.4 (#148648 ) Because Clang-tidy 19 has more powerful clang-analyzer checks to detect subtle bugs. New checks such as misc-use-internal-linkage can help identify potential static variables or functions, thus reducing binary sizes. Some new checks are disabled temporarily for later enabling. Additional warnings have been fixed or suppressed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148648 Approved by: https://github.com/Skylion007	2025-03-10 17:32:30 +00:00
PyTorch MergeBot	ebd087e4b5	Revert "[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257 )" This reverts commit f08146b67bab331f7bdc9fa247f526f6e60a7190. Reverted https://github.com/pytorch/pytorch/pull/113257 on behalf of https://github.com/jovianjaison due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/113257#issuecomment-2711299830))	2025-03-10 17:19:21 +00:00
PyTorch MergeBot	2ec9aceaeb	Revert "Move aoti_torch_cpu__weight_int4pack_mm_cpu_tensor to not be mangled (#148834 )" This reverts commit 3680e666d8ceaa43069555f821d1e8a5de01d5ab. Reverted https://github.com/pytorch/pytorch/pull/148834 on behalf of https://github.com/janeyx99 due to sorry I don't think I want this PR in before the branch cut, as it'd freeze the API in the file when it should really be in a different header ([comment](https://github.com/pytorch/pytorch/pull/148834#issuecomment-2711162193))	2025-03-10 16:29:40 +00:00
Nicolas De Carli	9dbc2527dc	Disable some SVE autovec (#148489 ) Summary: autovec miscompiles on patterns of the type: ```cpp for (const auto i : c10::irange()) ``` Same issue as described in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117001 and addressed by https://github.com/pytorch/pytorch/pull/137795 for gcc, but not clang Test Plan: buck2 build //caffe2/caffe2/fb/transforms:sigrid_interface Differential Revision: D70422723 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148489 Approved by: https://github.com/malfet	2025-03-10 16:25:00 +00:00
Jason Ansel	a60b4ed623	[fx] Optimize TracerBase.create_arg and Graph._gen_python_code (#148292 ) Before: 19502951 function calls (18702776 primitive calls) in 8.533 seconds After: 16402551 function calls (15602452 primitive calls) in 7.701 seconds Pull Request resolved: https://github.com/pytorch/pytorch/pull/148292 Approved by: https://github.com/oulgen ghstack dependencies: #148243, #148260, #148261, #148288	2025-03-10 16:06:19 +00:00
Jason Ansel	8f858e226b	[fx] Optimizations for node name generation (#148288 ) Before: ![image](https://github.com/user-attachments/assets/3a9ed22b-ae33-41ec-a0db-01f4f3ca2ffe) After: ![image](https://github.com/user-attachments/assets/44c6e578-c63e-4a43-b3e0-d11d4bdbb6db) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148288 Approved by: https://github.com/oulgen ghstack dependencies: #148243, #148260, #148261	2025-03-10 16:06:19 +00:00
Jason Ansel	5d4e7d58b4	[fx] Move Node._prepend/Node._remove_from_list to C++ (#148261 ) Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before: ``` 24303536 function calls (23503339 primitive calls) in 10.726 seconds ``` after: ``` 20003454 function calls (19203257 primitive calls) in 8.936 seconds ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148261 Approved by: https://github.com/oulgen ghstack dependencies: #148243, #148260	2025-03-10 16:06:11 +00:00
Jason Ansel	bf752c36da	[fx] Move Node._update_args_kwargs to C++ (#148260 ) Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before: ``` 25203549 function calls (24403352 primitive calls) in 12.090 seconds ``` after: ``` 24303536 function calls (23503339 primitive calls) in 10.726 seconds ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148260 Approved by: https://github.com/oulgen ghstack dependencies: #148243	2025-03-10 16:06:02 +00:00
Jason Ansel	bec7bdad47	[fx] Move map_aggregate to C++ (#148243 ) Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before: ``` 30603618 function calls (29403419 primitive calls) in 13.744 seconds ``` after: ``` 25203549 function calls (24403352 primitive calls) in 12.090 seconds ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148243 Approved by: https://github.com/oulgen	2025-03-10 16:05:53 +00:00
cyy	b8b1b364c9	Fix invalid format string in libfmt calls (#148855 ) Wrap shaderSource inside fmt::runtime because the format string is not a string literal and can't pass libfmt's compile time check in C++23 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148855 Approved by: https://github.com/Skylion007	2025-03-10 14:47:52 +00:00
Aman Karmani	a81751d8b7	[CD] Annotate linux/arm64 cuda wheels with consistent nvidia dependencies (#145021 ) This resolves issues installing torch nightly wheels into a `uv sync`-generated `.venv` The root cause is that the x64 and arm64 cuda nightly wheels have inconsistent metadata. This can be seen comparing `generated-linux-aarch64-binary-manywheel-nightly.yml` and `generated-linux-binary-manywheel-nightly.yml` `uv` expects consistency: https://github.com/astral-sh/uv/issues/10693 >Frankly, it's really not ideal that they change their dependencies from wheel to wheel. >They could still put the dependencies there with the same platform markers they're using in the other wheel though... 🤷‍♀ https://github.com/astral-sh/uv/issues/10119#issuecomment-2559898792 >I think this is something that basically has to be solved by PyTorch. The issue is that the wheels for `2.6.0.dev20241222+cu126` don't have consistent metadata, and it's a fundamental assumption of uv that the metadata for a given version _is_ consistent. To resolve this, I modified the arm64 nightly build workflow to add two new `PYTORCH_EXTRA_INSTALL_REQUIREMENTS` entries, under `manywheel-py3_11-cuda-aarch64-build` and `manywheel-py3_12-cuda-aarch64-build`. These are based on their equivalents in the x64 workflow for the corresponding python versions. I used the cuda 12.6 dependencies versions for the nvidia packages, to match the `DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-main` being used by these jobs. (The arm64 workflow file already had several `PYTORCH_EXTRA_INSTALL_REQUIREMENTS` entries, under various cpu wheels. I'm not sure why these are there, but I left them as-is.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145021 Approved by: https://github.com/seemethere, https://github.com/atalman Co-authored-by: Eli Uriegas <eliuriegas@meta.com> Co-authored-by: Andrey Talman <atalman@fb.com>	2025-03-10 14:39:39 +00:00
Wang, Chuanqi	4fdd076907	[CD] Add triton xpu as dependency of torch xpu windows whl (#148755 ) Depends on PR #147637 land Pull Request resolved: https://github.com/pytorch/pytorch/pull/148755 Approved by: https://github.com/atalman	2025-03-10 14:04:30 +00:00
Kalpit Munot	31625b08b8	Add ccode for FloorDiv (#148727 ) Summary: Add ccode for FloorDiv Test Plan: CIs Differential Revision: D70749021 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148727 Approved by: https://github.com/bobrenjc93	2025-03-10 14:00:18 +00:00
atalman	2068235c0a	Add timm_efficientnet to flaky models after cuda 12.6 update in CI/CD (#148788 ) After https://github.com/pytorch/pytorch/pull/148612 This model have become flaky Tracking this regression in an issue : https://github.com/pytorch/pytorch/issues/148699 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148788 Approved by: https://github.com/izaitsevfb, https://github.com/malfet	2025-03-10 13:40:41 +00:00
albanD	68c12ecfe2	Move get accelerator to use build time flags when possible (#146098 ) This PR does two main things (they are in a single PR to show how the newly added APIs are used). - Add isBuilt and isAvailable APIs to the AcceleratorHook interface. See inline doc for their exact semantic - Use the newly added isBuilt for accelerator check to ensure it does not poison fork Pull Request resolved: https://github.com/pytorch/pytorch/pull/146098 Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/EikanWang, https://github.com/jeromean Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-03-10 13:17:58 +00:00
Xuehai Pan	098494e9cb	[dynamo] allow global import `from collections import deque` in user code (#148676 ) See https://github.com/pytorch/pytorch/pull/148669#discussion_r1983462218 for more details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148676 Approved by: https://github.com/jansel	2025-03-10 13:14:05 +00:00
Xinyuan Zhao	59f14d19ae	Implement gradient for the `residuals` of `torch.linalg.lstsq` (#148526 ) Fixes #147543. I have written some tests in python using `gradcheck`. Please advise where I should put these tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148526 Approved by: https://github.com/lezcano	2025-03-10 12:35:09 +00:00
Francisco Massa	ea86b8d315	Fix redistribution cost for all-reduce (#148761 ) This issue seems to have been introduced in https://github.com/pytorch/pytorch/pull/119897. With the current implementation, it might be more favorable to perform a reduce_scatter followed by an all-gather than simply an all-reduce. Thanks @lw for the helpful discussions on getting this PR out! Pull Request resolved: https://github.com/pytorch/pytorch/pull/148761 Approved by: https://github.com/Skylion007, https://github.com/lw, https://github.com/tianyu-l, https://github.com/fegin	2025-03-10 12:13:11 +00:00
PyTorch UpdateBot	526524b489	Update slow tests (#148873 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148873 Approved by: https://github.com/pytorchbot	2025-03-10 11:46:30 +00:00
Michal Gallus	74da76f67c	[ROCm][Windows] Fix ROCm/HIP version header (#148560 ) On Windows, ROCm libraries do not have a `<rocm-core/rocm_version.h>` header, which causes the compilation to fail. This PR resolves this problem by utilising `<hip/hip_version.h>` from HIP SDK. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148560 Approved by: https://github.com/jeffdaily	2025-03-10 11:28:13 +00:00
Mwiza Kunda	00199acdb8	[inductor][triton] Block ptr analysis fix assert on matched index expression (#148446 ) If dynamic shapes are enabled, then block analysis may create new precomputed size replacements from the index which can lead to an assertion failure when the matched index is compared with the original index. For example the below assertion fails, despite the expressions being equivalent (ps2 = 3 * ps0). This can be resolved by updating the original index with the replacements, or simply removing the replacements when the expressions are tested to be equal - the latter option is implemented in this PR. ``` torch._inductor.exc.InductorError: AssertionError: E Invalid match! E Index: 3ps0((yindex//3)) + (ModularIndexing(yindex, 1, 3)) E Matched expression: ps2*((yindex//3)) + (ModularIndexing(yindex, 1, 3)) E ``` This PR fixes the test below when `config.triton.use_block_ptr=True`: ``` python test/inductor/test_torchinductor_dynamic_shapes.py DynamicShapesCpuTests.test_conv3d_channels_last_dynamic_shapes_cpu ``` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148446 Approved by: https://github.com/jansel	2025-03-10 05:26:55 +00:00
Jane Xu	3680e666d8	Move aoti_torch_cpu__weight_int4pack_mm_cpu_tensor to not be mangled (#148834 ) I noticed that this op was likely intended to be in the `extern "C"` portion of the file, but it was not added as such in https://github.com/pytorch/pytorch/pull/145250 which means this function is actually not stable/would get mangled by C++. Following the thread there I am thinking there are two possible solutions: (1) Since this op was never stable to begin with, and @Xia-Weiwen already landed the fallback, maybe this op is deletable + should get deleted before the 2.7 branch cut (2) Or we could just move the op to the right portion of the code. While I like just deleting the op, I am hesitant to do in case there's something I haven't considered, so this PR does option 2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148834 Approved by: https://github.com/desertfire	2025-03-10 03:23:48 +00:00
henrylhtsang	7ae0ce6360	[cutlass backend] fix assertion that prevent self multiplication (#148233 ) # Problem: In a matmul, sometimes some of the nodes are the same. Say `A @ A`. In that case, when writing the stride of node B, we have to figure out if we want lda or ldb, which points to the same node, and we have no way to differentiate which one. # Solution Just use whichever. Since they are the same. # Question What if we compile with `A @ A`, and then pass in `A @ B`? Well inductor guards will raise an error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148233 Approved by: https://github.com/ColinPeppler	2025-03-10 00:21:36 +00:00
henrylhtsang	b47d81682d	[cutlass backend] Forward fix for less aligned gemm shapes (#148521 ) Differential Revision: [D70600093](https://our.internmc.facebook.com/intern/diff/D70600093/) 1. Check if config name filtering still works. Tested, it works 2. do we get C++ compile error Yes, potentially we need to filter them out manually. Here we get this. ``` static_assert(threads_minor == 0 \|\| (TileSizeK % threads_minor == 0)); ``` We need to move some assertions to gemm_template.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/148521 Approved by: https://github.com/ColinPeppler	2025-03-10 00:21:24 +00:00
cyy	aac230a511	[MPS] Fix Wreorder-init-list (#148839 ) Fixes the following warning: ``` warning: ISO C++ requires field designators to be specified in declaration order; field 'value' will be initialized after field 'size' [-Wreorder-init-list] 662 \| return {.value.cf = scalar.to<c10::complex<float>>(), .size = sizeof(int64_t), .type = type}; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148839 Approved by: https://github.com/Skylion007	2025-03-09 23:45:46 +00:00
Nikita Shulga	b95889042c	[MPS] Introduce strides unary op (#148468 ) By adding following template ```metal template <typename T, typename F> kernel void unary_strided( device result_of<F, T>* output [[buffer(0)]], constant T* input [[buffer(1)]], constant long* sizes [[buffer(2)]], constant long* input_strides [[buffer(3)]], constant long* output_strides [[buffer(4)]], constant uint& ndim, uint index [[thread_position_in_grid]]) { F f; int pos[max_ndim]; pos_from_thread_index(int(index), pos, sizes, ndim); const auto input_offs = offset_from_coord(pos, input_strides, ndim); const auto output_offs = offset_from_coord(pos, output_strides, ndim); output[output_offs] = f(input[input_offs]); } ``` and instantiating it for all existing unary shaders, which eliminates the need to any intermediate copies. No extra testing are needed as those cases are already covered by `test_output_grad_match_corrcoef_cpu_float32` as well as `test_unary_ops_storage_offset_strided` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148468 Approved by: https://github.com/dcci	2025-03-09 22:30:51 +00:00
PyTorch MergeBot	275a7c5dbb	Revert "Add a stable TORCH_LIBRARY to C shim (#148124 )" This reverts commit 327e07ac1dc3351bb5f0ad436760b83590c400aa. Reverted https://github.com/pytorch/pytorch/pull/148124 on behalf of https://github.com/malfet due to Sorry for reverting your PR, but somehow it caused test failures in newly introduced tests, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=pull%20%2F%20linux-focal-cuda12.6-py3.10-gcc11-sm89%20%2F%20test%20(default%2C%201&mergeLF=true ([comment](https://github.com/pytorch/pytorch/pull/148124#issuecomment-2709057833))	2025-03-09 20:44:56 +00:00
PyTorch MergeBot	19a39a7a06	Revert "[dynamo] allow global import `from collections import deque` in user code (#148676 )" This reverts commit 685fb377131cc684633dc5471e77038988db53f6. Reverted https://github.com/pytorch/pytorch/pull/148676 on behalf of https://github.com/malfet due to Looks like it broke ROCM, see `f1444f006c/1`(default%2C%201&mergeLF=true ([comment](https://github.com/pytorch/pytorch/pull/148676#issuecomment-2709057326))	2025-03-09 20:42:03 +00:00
Zhenghao Hu	f1444f006c	[caffe2/torch] Fixup upstream LLVM (major version 21) API changes (#148833 ) Latest LLVM introduced two changes related to the `Triple` usage that causes build failures when building pytorch. ## Failure in llvm_codegen.cpp: Triple is stored in Modules instead of the string: `979c275097` ## Failure in llvm_jit.cpp: Triple argument is removed from LLJITBuilder::... : `b18e5b6a36` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148833 Approved by: https://github.com/Skylion007	2025-03-09 18:58:36 +00:00
Jason Ansel	9a1a2e1516	Better log message to update pr_time_benchmarks/expected_results.csv (#148303 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148303 Approved by: https://github.com/Skylion007	2025-03-09 17:12:47 +00:00
xinan.lin	a8e3d1984a	[Inductor UT][XPU] Skip test case test_cat_max_autotune_triton for known issue. (#148734 ) The mm triton template/configs have not been tuned for XPU, we observer that the epilogue fusion can not speed up on XPU because of registers spill. So XPU failed on the case `test_cat_max_autotune_triton` which checks the fusion. We'll remove the skip after #146568 being resolved. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148734 Approved by: https://github.com/jansel	2025-03-09 15:09:43 +00:00
Aditya Tiwari	bb9c426024	Typo Errors fixed in multiple files (#148262 ) # Fix typo errors across PyTorch codebase This PR fixes various spelling errors throughout the PyTorch codebase to improve documentation quality and code readability. ## Changes Made ### Documentation Fixes - Changed "seperate" to "separate" in multiple files: - `setup.py`: Build system documentation - `torch/_library/triton.py`: AOT compilation comments - `torch/csrc/dynamo/compiled_autograd.h`: Node compilation documentation - `torch/export/_unlift.py`: Pass population comments - `torch/export/exported_program.py`: Decomposition table notes ### Code Comments and Error Messages - Changed "occured" to "occurred" in: - `test/mobile/test_lite_script_module.py`: Exception handling comments - `torch/export/_draft_export.py`: Error message text - `aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp`: MAGMA bug comment - `torch/csrc/utils/python_numbers.h`: Overflow handling comment - `torch/csrc/jit/OVERVIEW.md`: Graph compilation documentation - `torch/_dynamo/symbolic_convert.py`: Error explanation ### API Documentation - Changed "fullfill" to "fulfill" in `torch/distributed/checkpoint/state_dict_loader.py` - Changed "accross" to "across" in: - `torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp` - `torch/distributed/distributed_c10d.py` ## Motivation These changes improve code readability and maintain consistent spelling throughout the codebase. No functional changes were made; this is purely a documentation and comment improvement PR. ## Test Plan No testing required as these changes only affect comments and documentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148262 Approved by: https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-03-09 12:21:40 +00:00
Jane Xu	327e07ac1d	Add a stable TORCH_LIBRARY to C shim (#148124 ) This PR adds two main parts: - shim.h stable C APIs into torch::Library APIs - a higher level API in torch/csrc/stable/library.h that calls into this shim.h + otherwise is self contained Goal: custom kernel writers should be able to call the apis in the directories above in order to register their library in a way that allows their custom extension to run with a different libtorch version than it was built with. Subplots resolved: - Do we want a whole separate StableLibrary or do we want to freeze torch::Library and add `m.stable_impl(cstring, void (fn)(void , int64_t, int64_t)` into it - Yes, we want a separate StableLibrary. We cannot freeze Library and it is NOT header only. - Should I use unint64_t as the common denominator instead of void to support 32bit architectures better? - Yes, and done - Should I add a stable `def` and `fragment` when those can be done in python? - I think we do want these --- and now they're done - Where should library_stable_impl.cpp live? -- no longer relevant - I need some solid test cases to make sure everything's going ok. I've intentionally thrown in a bunch of random dtypes into the signature, but I still haven't tested returning multiple things, returning nothing, complex dtypes, etc. - Have since tested all the torch library endpoints. the others can be tested in a followup to separate components that need to be in shim.h vs can be added later Pull Request resolved: https://github.com/pytorch/pytorch/pull/148124 Approved by: https://github.com/albanD, https://github.com/zou3519	2025-03-09 10:07:25 +00:00
Xuehai Pan	685fb37713	[dynamo] allow global import `from collections import deque` in user code (#148676 ) See https://github.com/pytorch/pytorch/pull/148669#discussion_r1983462218 for more details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148676 Approved by: https://github.com/jansel	2025-03-09 09:35:29 +00:00
William Wen	6566d67bd3	[dynamo] show stack above dynamo in graph break user tracebacks (#148401 ) Also show the line of code relevant to a dynamo-compiled frame, instead of just the first line (this was broken for data-dependent jump graph breaks and for 3.11+). Also collapses resume frames together (use config.verbose to see full stack trace - for developers). Pull Request resolved: https://github.com/pytorch/pytorch/pull/148401 Approved by: https://github.com/zou3519, https://github.com/jansel	2025-03-09 07:37:38 +00:00
Ke Wen	2149f6c684	[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 ) This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590 Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj	2025-03-09 07:32:23 +00:00
Jason Ansel	85fe576ee3	[set_linter] allow x in {...} (#148422 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148422 Approved by: https://github.com/Skylion007	2025-03-09 06:43:11 +00:00
PyTorch MergeBot	9cb25f0ea2	Revert "[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 )" This reverts commit 17dbeb11db7afbab792ad76c24840c1552a0e76d. Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/janeyx99 due to PR break backward compat test ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2708641172))	2025-03-09 03:01:55 +00:00
Ke Wen	17dbeb11db	[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 ) This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590 Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj	2025-03-08 20:00:12 +00:00
Nino Risteski	5245304f1e	Update decompositions_for_jvp.py (#148821 ) small typo thing that got my eye Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148821 Approved by: https://github.com/Skylion007	2025-03-08 19:08:42 +00:00
Daniel Vega-Myhre	148eb735ee	Change nvcc arch flags for sm100 (#148774 ) ### Summary - Addressing this comment https://github.com/pytorch/pytorch/pull/148274#discussion_r1984944012 ### Test plan - Verified building from source w/ B200s is successful - Verified B200 tensorcores are still being utilized properly via benchmarking script Pull Request resolved: https://github.com/pytorch/pytorch/pull/148774 Approved by: https://github.com/Skylion007	2025-03-08 19:05:53 +00:00
Tristan Rice	7ffadff286	c10d/ProcessGroup: cleanup abort and shutdown (#148798 ) This adds `abort` and `shutdown` to `Backend` and `ProcessGroup` objects. This simplifies the logic in `distributed_c10d.py` by having a default noop implementation for all PGs. This will be useful for torchft and upcoming versions of NCCL which will handle abort correctly. Currently `torchft` would have to call internal methods `_abort` on the PGNCCL object directly but with this change we can now just call `.abort()` and have it work for any PG implementation. Test plan: ``` pytest distributed/test_backends.py distributed/test_c10d_common.py distributed/test_c10d_pypg.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148798 Approved by: https://github.com/kwen2501	2025-03-08 18:33:18 +00:00
Sanket Purandare	9841f0ddcf	Add support for non functional collectives under FakeTensorMode and fake_pg for memory tracking (#147566 ) This PR adds support for non-functional collectives under `FakeTensorMode` and `fake_pg`. It helps eliminate the patching of collectives for memory and runtime estimation. It also modifies the `ModTracker` to enable the post-backward hook call for modules whose inputs don't require gradients but parameters do. For the memory tracking, we now enable tracking DTensor dispatcher for custom dispatch functions like `entropy_loss`. Dispatcher is only enabled for the memory tracking part and disabled as soon as it is done. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147566 Approved by: https://github.com/weifengpy	2025-03-08 18:00:49 +00:00
Fangjun Kuang	439782960c	Fix typos in SpectralOps.cpp (#148818 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148818 Approved by: https://github.com/Skylion007	2025-03-08 17:34:59 +00:00
eqy	849cc058ee	[CUDA][TF32] Account for tf32 in `test_efficient_conv_bn_eval` (#148802 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148802 Approved by: https://github.com/Skylion007	2025-03-08 16:17:04 +00:00
David Berard	c3b05c4a27	[triton 3.3] support both specialize_impl and create_specialize_impl (#148806 ) After https://github.com/triton-lang/triton/pull/6099, we sometimes need to do `from triton.runtime.jit import specialize impl` and sometimes do `triton.runtime.jit.create_specialize_impl()`. This should fix a bunch of the new errors that appeared with the triton 3.3 / pytorch 2.7 integration (e.g. `python test/inductor/test_aot_inductor.py -k test_triton_kernel_equal_to_1_float_arg_dynamic_False_cuda`, failing at https://hud.pytorch.org/pr/pytorch/pytorch/148684#38392501220) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148806 Approved by: https://github.com/drisspg	2025-03-08 09:31:52 +00:00
Justin Chu	118c9e501a	[ONNX] Remove inaccurate test comment (#148813 ) Remove the comment that says jit trace strategy doesn't support dynamic shapes as dict because it does support it (which is what the test is testing) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148813 Approved by: https://github.com/cyyever, https://github.com/titaiwangms	2025-03-08 08:55:56 +00:00
Zhuoran Zhao	3745da18f4	[AOTI] Swith to local cpp compile for fbcode (#148592 ) Summary: as title, otherwise we can not find lamdhip64 Test Plan: https://www.internalfb.com/phabricator/paste/view/P1747104431 Differential Revision: D70637798 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148592 Approved by: https://github.com/hl475	2025-03-08 08:38:26 +00:00
Simon Fan	666508eb17	[aot cache][ca] remove restriction on caching ca's aot inference graph (#148491 ) but still can't cache CA's aot inference graph yet: the CA functional ops aren't serializable Pull Request resolved: https://github.com/pytorch/pytorch/pull/148491 Approved by: https://github.com/jamesjwu ghstack dependencies: #148381	2025-03-08 06:08:26 +00:00
Simon Fan	c16cd25cf5	[ca] remove compiled_autograd_tracing (#148381 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148381 Approved by: https://github.com/jansel	2025-03-08 06:08:26 +00:00
Wang, Chuanqi	5f1c79ba2b	[CD] Enable triton xpu windows build (#147637 ) Depends on #147727, which introduce triton xpu windows support Pull Request resolved: https://github.com/pytorch/pytorch/pull/147637 Approved by: https://github.com/atalman	2025-03-08 05:28:46 +00:00
cyy	f7c0c230b0	Fix compile errors (#148758 ) Fix ``` /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/14.2.1/../../../../include/c++/14.2.1/bits/unique_ptr.h:91:16: error: invalid application of 'sizeof' to an incomplete type 'torch::jit::AliasDb::WriteRegistry' 91 \| static_assert(sizeof(_Tp)>0, \| ^~~~~~~~~~~ /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/14.2.1/../../../../include/c++/14.2.1/bits/unique_ptr.h:399:4: note: in instantiation of member function 'std::default_delete<torch::jit::AliasDb::WriteRegistry>::operator()' requested here 399 \| get_deleter()(std::move(__ptr)); \| ^ ../torch/csrc/jit/ir/alias_analysis.cpp:200:10: note: in instantiation of member function 'std::unique_ptr<torch::jit::AliasDb::WriteRegistry>::~unique_ptr' requested here 200 \| AliasDb::~AliasDb() = default; \| ^ ../torch/csrc/jit/ir/alias_analysis.cpp:200:23: note: in defaulted destructor for 'torch::jit::AliasDb' first required here 200 \| AliasDb::~AliasDb() = default; \| ^ ../torch/csrc/jit/ir/alias_analysis.h:298:10: note: forward declaration of 'torch::jit::AliasDb::WriteRegistry' 298 \| struct WriteRegistry; \| ^ 1 error generated. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148758 Approved by: https://github.com/Skylion007	2025-03-08 04:56:42 +00:00
Yanan Cao (PyTorch)	75179fd6e6	[Codemod][AddExplicitStrictExportArg] caffe2/test/inductor (#148781 ) Differential Revision: D70575053 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148781 Approved by: https://github.com/SherlockNoMad	2025-03-08 04:43:32 +00:00
riccardofelluga	8f71d4563e	Fix rms_norm in fp16/bf16 (#147203 ) Fixes #134106. This PR moves the `upcasted_result` down-casting after all computation is done. Since the multiplication with the weight_opt input is not done in half precision, the current code path is doing the following: fp16 -> fp32 -> fp16 -> fp32 -> fp16. What we want tho is to avoid down-casting and this PR proposes: fp16 -> fp32 -> fp16. This results in better accuracy as it avoids truncating. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147203 Approved by: https://github.com/eqy	2025-03-08 04:43:18 +00:00
Joel Schlosser	85467ed063	Fix for AOTI + CUDAGraphs when calling from Python (#148601 ) Background: I've been comparing performance of torch.compile vs. torch.export + AOTI (specifically, loaded from Python) on the Flux model and found a ~1.4% performance decrease with the latter. The trace shows that CUDAGraphs are not utilized for torch.export + AOTI, leading to higher overhead. When trying to manually CUDAGraph the loaded, previously exported + AOTIed model (thanks to @eellison for the logic here), I get: ``` Error: operation not permitted when stream is capturing ``` @desertfire confirms that this is due to multi-threading logic on the AOTI runtime side (in `AOTIModelContainer` / `AOTIModel`) conflicting with the use of CUDAGraphs. Fix: This PR takes the approach of providing an alternate, single-threaded method for running loaded models with the AOTI runtime. Details: * Python side introduces a new flag to enable this behavior (needs a better name): `torch._inductor.package.load_package(..., run_single_threaded=False)` * This flag is passed down to the C++ side's `AOTIModelPackageLoader`, which passes it to the `CreateAOTIModelRunnerFunc` during `AOTIModelContainerRunner` construction. * C++ side introduces single-threaded alternatives to model running and model container running: * `AOTIModelContainer.run_single_threaded()` / `AOTIModel.run_single_threaded()`. The interfaces match those of `run()`, but the synchronization logic has been removed. * Introduces `AOTInductorModelContainerRunSingleThreaded` to AOTI's `interface.h`; this is invoked by the `AOTIModelContainerRunner` utility class when `run_single_threaded=true`. I've verified on both a small repro and my real-world use case that I can manually CUDAGraph a loaded model that was previously exported + AOTIed. Future work: * Flip default value to `run_single_threaded=True` as Python-side inference doesn't take advantage of the AOTI runtime thread pool * There are some BC concerns here - models need to be re-serialized so the .so contains the new `AOTInductorModelContainerRunSingleThreaded` interface func. We can flip the default value and warn (instead of crashing) if the `AOTInductorModelContainerRunSingleThreaded` symbol does not exist. * Compose with cudagraph trees as opposed to manual cuda graph wrapping Pull Request resolved: https://github.com/pytorch/pytorch/pull/148601 Approved by: https://github.com/desertfire	2025-03-08 02:44:14 +00:00
Sampsa	9f170d9d13	[Triton 3.3] Remove ROCm specific mm gemm template (#148662 ) Fixes: https://github.com/pytorch/pytorch/issues/147121 Since triton 3.3.x fixes the problem Needs to be handled in none BC breaking way, so we will conditionalise this change on triton version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148662 Approved by: https://github.com/davidberard98 Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>	2025-03-08 01:24:40 +00:00
drisspg	a89e7c2da9	[Upstream] Wrap log_2_e in tl.constexpr for new 3.3 bump (#148785 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148785 Approved by: https://github.com/davidberard98	2025-03-08 01:09:28 +00:00
Lukas Pfahler	179b7a0abc	Do not crash when compiling quantized LORA models (#148435 ) Fixes #148072 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148435 Approved by: https://github.com/Valentine233, https://github.com/leslie-fang-intel	2025-03-08 00:02:08 +00:00
Gabriel Ferns	24085db082	Don't clear feedback_saver_fns after cache clear (#148723 ) Summary: Since feedback_saver_fns are used for logging, I don't think it makes sense to clear them, and this resulted in weird behavior in user code where disabling caches caused logging code to break. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148723 Approved by: https://github.com/henrylhtsang, https://github.com/eellison	2025-03-07 23:43:59 +00:00
Justin Chu	d96c85558a	[ONNX] Use torch export to get dynamic shapes for JIT convert strategy (#148627 ) Use torch export to get dynamic shapes for JIT converted graph. I just realized we can retrace a converted jit graph with `torch.export` and produce dynamic shapes using `torch.export`. - Prior: The exporter will produce a static graph silently even when dynamic_shapes are provided. - Proposed: When `dynamic_shapes` is provided and when the strategy is able to handle it, it will succeed ## Why are we still keeping the JIT strategy? It is useful when users want to convert JIT modules or `.pt` files into ONNX via the new path. Sometimes also useful when there are JIT scripted modules in the nn module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148627 Approved by: https://github.com/titaiwangms	2025-03-07 23:41:50 +00:00
Avanish.Tiwari	26f8d81037	Enable onednn in pytorch for ppc64le architecture (#143743 ) This PR will enable onednn for powerpc Architecture which will help to do quantization of the model via onednn for powerpc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143743 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-03-07 23:35:47 +00:00
Sam Larsen	187d5c0eb1	[logging] Log cudagraphify timings to dynamo_timed (#143220 ) Summary: this adds some new dynamo_timed calls in cudagraph_trees, primarily with the aim to add cudagraph-related timing to scuba. Things to note: * Uses the changes in https://github.com/pytorch/pytorch/pull/141919 to log "runtime" entries * The logging for chromium/tlparse/scuba relies on us providing a compile_id since it's not available in the environment. A lot of the changes here are just passing around the compile_id * I believe the spirit of the scuba logging is to capture the overheads of `torch.compile`. Therefore, I'm not adding _every_ dynamo_timed to scuba. For example, "run_eager" is the first real execution of the inductor graph -- it's not cudagraph overhead, per se. Watch out for the two instances of `dynamo_compile_runtime_column_us="runtime_cudagraphify_time_us"`. Those are the spots I believe are _extra_ overhead we'd contribute to torch.compile. Test Plan: `python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only dcgan`: * tlparse: https://fburl.com/21yrdn8h * scuba: https://fburl.com/scuba/dynamo_compile/sandbox/wt90wnjz `python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only nanogpt` * tlparse: https://fburl.com/r9mp7uiv * scuba: https://fburl.com/scuba/dynamo_compile/sandbox/1nvx94re Pull Request resolved: https://github.com/pytorch/pytorch/pull/143220 Approved by: https://github.com/eellison	2025-03-07 23:07:13 +00:00
iupaikov-amd	f2dfe2d99c	[Triton 3.3] [ROCm] Enabled split_scan support for ROCm builds (#147619 ) Fixes issue https://github.com/pytorch/pytorch/issues/133228 Enabled split_scan support for ROCm builds. Must be handled in a non BC breaking way so this functionality is enabled conditionalised on triton version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147619 Approved by: https://github.com/davidberard98 Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> Co-authored-by: David Berard <davidberard98@gmail.com>	2025-03-07 23:06:21 +00:00
PyTorch MergeBot	0f852641c2	Revert "[cutlass backend] Forward fix for less aligned gemm shapes (#148521 )" This reverts commit d35a4ddae2345e639001bfee58a0932e96597f2d. Reverted https://github.com/pytorch/pytorch/pull/148521 on behalf of https://github.com/henrylhtsang due to mistakes when writing the tests ([comment](https://github.com/pytorch/pytorch/pull/148521#issuecomment-2707637965))	2025-03-07 22:42:13 +00:00
David Berard	755965d2e4	[inductor] fix matmul w/ torch.bucketize epilogue (#148769 ) See https://github.com/pytorch/pytorch/issues/148764. Inductor was codegen-ing wrong shapes for bucketize when it was fused as an epilogue: the binary search helper function requested the shape of the input tensor, and Inductor was generating `[XBLOCK]`, when `XBLOCK` doesn't exist. As a workaround, this PR removes the `BLOCK_SHAPE` parameter from the helper function (and just uses `values.shape`) so that we don't even have to generate the shape. This PR also introduces `torch._inductor.config.triton.disallow_failing_autotune_kernels_TESTING_ONLY` to test this behavior. This config is needed to enforce that _all_ autotune kernel candidates pass - otherwise, the fused-bucketize exception just gets caught and an `inf` latency is assigned to it. Differential Revision: [D70794563](https://our.internmc.facebook.com/intern/diff/D70794563) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148769 Approved by: https://github.com/benjaminglass1, https://github.com/aaronenyeshi	2025-03-07 22:34:13 +00:00
Xinya Zhang	67742128b7	[ROCm] Bump AOTriton to 0.9.2b (#148433 ) Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.9b: * Optimize these Non-power-of-two head dimensions: 48, 80, 96, 160, 192, 224. Inputs with these head dimensions do not need padding to power-of-two anymore. * `is_causal=True` cases are now supported with persistent dynamic algorithm, which requires an atomic tensor but does load balance between different CTAs * `dropout_p > 0.0` cases now support full 64-bit offsets and use all i64x4 PRNG outputs * The precise AOTriton shared library version can now be identified with `readelf -p .comment libaotriton_v2.so` + However, this does not guarantee the GPU images stored under `aotriton.images` have the same version, since they can be overwritten. * The newly added fused backward kernel will be used for smaller workloads, due to less kernel invocation overhead. * Support gfx1201 (RX 9070XT). Need to be enabled at runtime with `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148433 Approved by: https://github.com/jeffdaily	2025-03-07 22:10:07 +00:00
Nikita Shulga	7b79e17275	[BE] Move cuda12.6 builds to gcc11 (#148740 ) I.e. `s/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc11/` Which accidentally fixes undefined symbol references errors namely ``` /usr/bin/ld: /var/lib/jenkins/cpp-build/caffe2/build/lib/libtorch_cuda.so: undefined reference to `std::__throw_bad_array_new_length()' ``` Which happens because `libmagma.a` that were build with gcc-11 (after https://github.com/pytorch/pytorch/pull/148135 ) contains symbols which are defined in `/opt/rh/gcc-toolset-11/root/usr/lib/gcc/x86_64-redhat-linux/11/libstdc++_nonshared.a` but missing from the corresponding library bundled with `g++-9`) Though I could not figure out what flags one must use to trigger generation of those symbols, see https://godbolt.org/z/E9KfdhzzY or ``` $ echo "int* foo(int x) { return new int[x];}"\|g++ -std=c++17 -S -O3 -x c++ -o - - .file "" .text .section .text.unlikely,"ax",@progbits .LCOLDB0: .text .LHOTB0: .p2align 4 .globl _Z3fooi .type _Z3fooi, @function _Z3fooi: .LFB0: .cfi_startproc endbr64 movslq %edi, %rdi subq $8, %rsp .cfi_def_cfa_offset 16 movabsq $2305843009213693950, %rax cmpq %rax, %rdi ja .L2 salq $2, %rdi addq $8, %rsp .cfi_def_cfa_offset 8 jmp _Znam@PLT .cfi_endproc .section .text.unlikely .cfi_startproc .type _Z3fooi.cold, @function _Z3fooi.cold: .LFSB0: .L2: .cfi_def_cfa_offset 16 call __cxa_throw_bad_array_new_length@PLT .cfi_endproc ``` Fixes https://github.com/pytorch/pytorch/issues/148728 and https://github.com/pytorch/pytorch/issues/148495 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148740 Approved by: https://github.com/wdvr, https://github.com/atalman, https://github.com/Skylion007, https://github.com/ZainRizvi	2025-03-07 21:21:12 +00:00
Nichols A. Romero	08baaa7d63	[Docs][TunableOp] TunableOp documentation update (#148384 ) This PR aligns documentation to what is in the README file: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cuda/tunable/README.md and removes the prototype NOTE. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148384 Approved by: https://github.com/jeffdaily, https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-03-07 21:02:49 +00:00
PyTorch MergeBot	bb94b65da7	Revert "[cutlass backend] fix assertion that prevent self multiplication (#148233 )" This reverts commit 2fb654676f6291f6e27c6bab2761f170516598dd. Reverted https://github.com/pytorch/pytorch/pull/148233 on behalf of https://github.com/henrylhtsang due to mistake in PR ([comment](https://github.com/pytorch/pytorch/pull/148233#issuecomment-2707440106))	2025-03-07 20:58:28 +00:00
Nikita Shulga	d8dc700e25	Delete duplicate entry from `docker-builds.yml` (#148782 ) Regression introduced by merge conflict of https://github.com/pytorch/pytorch/pull/148612 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148782 Approved by: https://github.com/atalman	2025-03-07 20:55:46 +00:00
PyTorch MergeBot	99da439d10	Revert "Remove Cuda 12.4 from nightly Binaries (#148625 )" This reverts commit 1239176fe717839ca5612ac03a4806051225f381. Reverted https://github.com/pytorch/pytorch/pull/148625 on behalf of https://github.com/malfet due to Broke lint ([comment](https://github.com/pytorch/pytorch/pull/148625#issuecomment-2707415005))	2025-03-07 20:47:45 +00:00
Nikita Shulga	6602e632cd	Suppress build warnings when gcc-11 is used (#148763 ) By decorating the header with `C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wmismatched-new-delete")` that will suppress following (when building against ancient llvm-9) ``` In file included from /var/lib/jenkins/workspace/torch/csrc/jit/tensorexpr/llvm_codegen.cpp:24: /opt/llvm/include/llvm/IR/IRBuilder.h: In member function 'llvm::LoadInst* llvm::IRBuilder<T, Inserter>::CreateLoad(llvm::Type, llvm::Value, const llvm::Twine&) [with T = llvm::ConstantFolder; Inserter = llvm::IRBuilderDefaultInserter]': /opt/llvm/include/llvm/IR/IRBuilder.h:1581:19: error: 'static void llvm::User::operator delete(void)' called on pointer returned from a mismatched allocation function [-Werror=mismatched-new-delete] 1581 \| return Insert(new LoadInst(Ty, Ptr), Name); \| ^~~~~~~~~~~~~~~~~~~~~ /opt/llvm/include/llvm/IR/IRBuilder.h:1581:19: note: returned from 'static void llvm::UnaryInstruction::operator new(size_t)' ``` Probably a reasonable followup will be to disable NNC testing all-together, as project has been in a maintenance mode for a while now Pull Request resolved: https://github.com/pytorch/pytorch/pull/148763 Approved by: https://github.com/Skylion007, https://github.com/ZainRizvi, https://github.com/atalman ghstack dependencies: #148739	2025-03-07 20:43:35 +00:00
Justin Chu	d36391307f	[ONNX] Handle error in verification interpreter (#148730 ) Use a simple try catch to handle onnx runtime errors in the verification interpreter when that happens. One example is ort will sometimes produce a list of None for some nodes. I am not sure how that happens yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148730 Approved by: https://github.com/titaiwangms ghstack dependencies: #148706	2025-03-07 20:24:49 +00:00
Xuehai Pan	aebd2e411f	[pytree][easy] lock global registry containers properly for thread-safety (#148750 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148750 Approved by: https://github.com/StrongerXi	2025-03-07 20:04:52 +00:00
bobrenjc93	6b44a91a62	use statically_known_true instead of guard_size_oblivious in pattern matcher (#147557 ) We shouldn't add guards here. Use statically_known_true instead. Internal xref: https://fb.workplace.com/groups/1075192433118967/?multi_permalinks=1609560723015466&comment_id=1610040026300869&notif_id=1740082892544333&notif_t=work_feedback_reaction_generic&ref=notif Differential Revision: [D69950122](https://our.internmc.facebook.com/intern/diff/D69950122/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147557 Approved by: https://github.com/eellison	2025-03-07 19:17:25 +00:00
PyTorch MergeBot	b246cd7b82	Revert "Move get accelerator to use build time flags when possible (#146098 )" This reverts commit 17302b4bc837af079d2f6480f07ea2c99b93fb4b. Reverted https://github.com/pytorch/pytorch/pull/146098 on behalf of https://github.com/albanD due to Still fails with cuda build on a non-gpu machine ([comment](https://github.com/pytorch/pytorch/pull/146098#issuecomment-2707191770))	2025-03-07 18:59:58 +00:00
Ting Lu	1239176fe7	Remove Cuda 12.4 from nightly Binaries (#148625 ) https://github.com/pytorch/pytorch/issues/145570 removes cuda 12.4 nightly builds Pull Request resolved: https://github.com/pytorch/pytorch/pull/148625 Approved by: https://github.com/atalman	2025-03-07 18:56:04 +00:00
Irem Yuksel	61c4074df7	Add Windows Arm64 Nightly Builds (#139760 ) This PR creates 3 new worklflows for Windows Arm64 target. The workflows and outputs can be reviewed at the following links: https://github.com/pytorch/pytorch/actions/workflows/generated-windows-arm64-binary-libtorch-release-nightly.yml https://github.com/pytorch/pytorch/actions/workflows/generated-windows-arm64-binary-libtorch-debug-nightly.yml https://github.com/pytorch/pytorch/actions/workflows/generated-windows-arm64-binary-wheel-nightly.yml Pull Request resolved: https://github.com/pytorch/pytorch/pull/139760 Approved by: https://github.com/malfet Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com> Co-authored-by: Huy Do <huydhn@gmail.com>	2025-03-07 18:53:56 +00:00
cyy	e839e4f5bd	Fix Wc++98-compat-extra-semi (#148757 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148757 Approved by: https://github.com/Skylion007	2025-03-07 18:49:12 +00:00
Michal Gallus	0a7ccee1e0	[ROCm][Windows] Disable Composable Kernels and Triton for Windows builds (#147334 ) Currently, Composible Kernels and Triton aren't available on Windows. This PR ensures that the files relating to this dependency are not included during the build. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147334 Approved by: https://github.com/jeffdaily	2025-03-07 18:40:49 +00:00
eqy	18c6e00c7b	[CUDA Graphs][NCCL] Set event queries to happen under thread-local mode in `ProcessGroupNCCL.cpp` (#148594 ) Should mean we don't need to coordinate the watchdog with CUDAGraph captures anymore Pull Request resolved: https://github.com/pytorch/pytorch/pull/148594 Approved by: https://github.com/kwen2501	2025-03-07 18:39:02 +00:00
Ting Lu	9769618d35	[CI] [inductor] Add cu126 inductor jobs and move away cu124 (#148612 ) https://github.com/pytorch/pytorch/issues/145570 breaking https://github.com/pytorch/pytorch/pull/140793 into eager and inductor benchmarks to unblock Seems many inductor yml are added after initial change was prepared. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148612 Approved by: https://github.com/nWEIdia, https://github.com/atalman Co-authored-by: atalman <atalman@fb.com>	2025-03-07 18:30:14 +00:00
Nikita Shulga	da923afdc7	[MPS][BE] Align bitshift behavior with CPU (#148719 ) By casting the argument to output type Pull Request resolved: https://github.com/pytorch/pytorch/pull/148719 Approved by: https://github.com/Skylion007 ghstack dependencies: #148685, #148686	2025-03-07 18:28:14 +00:00
Nikita Shulga	f84710aef4	[MPS] Fix scalar to tensors bitshifts (#148686 ) By introducing a concept of non-commutative binary op and renaming all op templates from `bitwise_foo_tensor` and `bitwise_foo_scalar` to `bitwise_foo_tensor_tensor` and `bitwise_foo_tensor_scalar` Add regression tests Please note, that for some undefined values MPS and CPU behaviors are different, for example ``` >>> import torch >>> 4095 >> torch.arange(12, device="mps", dtype=torch.uint8) tensor([255, 255, 255, 255, 255, 127, 63, 31, 15, 7, 3, 1], device='mps:0', dtype=torch.uint8) >>> 4095 >> torch.arange(12, device="cpu", dtype=torch.uint8) tensor([255, 127, 63, 31, 15, 7, 3, 1, 0, 0, 0, 0], dtype=torch.uint8) ``` Because on CPU scalar is cast to output dtype before operation is performed, but on MPS this happens after the op is done Fixes https://github.com/pytorch/pytorch/issues/147889 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148686 Approved by: https://github.com/albanD ghstack dependencies: #148685	2025-03-07 18:28:14 +00:00
cyy	116c1e42c5	Re-enable tests (#148732 ) No UBSAN failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148732 Approved by: https://github.com/Skylion007	2025-03-07 18:11:57 +00:00
Jack Taylor	8059ead823	[ROCm] Incorporate ROCm triton specific tuning parameters (#148437 ) Splitting https://github.com/pytorch/pytorch/pull/147315 into two PRs. This PR adds general support for kpack and waves_per_eu triton kernel args for AMD backend. More detail in the PR above. A follow up PR will update the configs used by ROCm but this requires https://github.com/pytorch/pytorch/pull/147452 to land first Pull Request resolved: https://github.com/pytorch/pytorch/pull/148437 Approved by: https://github.com/eellison, https://github.com/jansel	2025-03-07 18:09:47 +00:00
Aaron Orenstein	a3b77d434a	Subprocess compile (attempt 2) (#148635 ) Add a mode to fx_codegen_and_compile() to compile in a separate process. This is to prepare for async compile where we'll compile and run eager in parallel (and also be able to move the compile phase to a remote computer). Added a test based which runs the test_torchinductor tests with subprocess compiling turned on. Fixed the test which caused the previous version (#146134) to be reverted: ``` $ PYTORCH_TEST_WITH_ROCM=1 PYTORCH_TEST_WITH_SLOW=1 PYTORCH_TEST_SKIP_FAST=1 python test/inductor/test_compile_subprocess.py CpuTests.test_conv_bn_fuse_cpu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148635 Approved by: https://github.com/jamesjwu	2025-03-07 17:50:14 +00:00
xinan.lin	50c9f6d83b	[Windows][Inductor][XPU] Unload triton pyd files to be able to remove them on Windows. (#148323 ) In `fresh_inductor_cache` remove pyd files will raise permission error on Windows because they are still used by the process. So we clear the references to the loaded pyd libray obj and unload them from the process. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148323 Approved by: https://github.com/jansel ghstack dependencies: #148534, #148538, #147727	2025-03-07 17:19:59 +00:00
xinan.lin	d05694807d	[XPU][Inductor] Update Intel triton for release 2.7. (#147727 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147727 Approved by: https://github.com/EikanWang, https://github.com/Skylion007 ghstack dependencies: #148534, #148538	2025-03-07 17:19:59 +00:00
Saurabh Mishra	136b8165d1	[DCP] Save Plan Caching: Fix the missing all_plans update in the cache. (#148577 ) Summary: Save Plan Caching: Fix the missing all_plans update in the cache. Test Plan: ``` buck2 test //aiplatform/modelstore/experimental/integration_tests/tests/nosan:checkpoint_dist_save_load_test ``` https://www.internalfb.com/intern/testinfra/testrun/17451448626323264 Reviewed By: MeetVadakkanchery Differential Revision: D70229019 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148577 Approved by: https://github.com/MeetVadakkanchery	2025-03-07 17:00:59 +00:00
PyTorch MergeBot	abcca2fcbb	Revert "Fix `torch.nn.functional.hardswish` gradients corner case (#148049 )" This reverts commit 29b28e9d9f93d78092099a44a7bcc28cfbae06e3. Reverted https://github.com/pytorch/pytorch/pull/148049 on behalf of https://github.com/soulitzer due to This may be causing an accuracy failure on inductor ([comment](https://github.com/pytorch/pytorch/pull/148049#issuecomment-2706839169))	2025-03-07 16:05:56 +00:00
albanD	17302b4bc8	Move get accelerator to use build time flags when possible (#146098 ) This PR does two main things (they are in a single PR to show how the newly added APIs are used). - Add isBuilt and isAvailable APIs to the AcceleratorHook interface. See inline doc for their exact semantic - Use the newly added isBuilt for accelerator check to ensure it does not poison fork Pull Request resolved: https://github.com/pytorch/pytorch/pull/146098 Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/EikanWang, https://github.com/jeromean Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-03-07 15:19:34 +00:00
Nikita Shulga	d54b2b7fa7	[BE] Delete split builds (#148739 ) They has been disabled since Oct 2024, perhaps time to remove them from the workflows See https://github.com/pytorch/pytorch/issues/138750 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148739 Approved by: https://github.com/atalman	2025-03-07 15:10:50 +00:00
Anant Gulati	372ad7b181	Enable FSDP2 on HPU device (#148667 ) The motivation of this PR is to enable FSDP2 collectives for HPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/148667 Approved by: https://github.com/wconstab	2025-03-07 14:33:43 +00:00
ZhiweiYan-96	81847d08cf	[Intel GPU][quant] Refine zero-point memory creation (#148640 ) # Motivation This PR skips zero-point GPU memory creation when zero-point=0, as it would not be used by oneDNN library. This could help save the 1~3 H2D copy overhead per QLinear/QConv kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148640 Approved by: https://github.com/liangan1, https://github.com/EikanWang	2025-03-07 13:49:19 +00:00
Luca Wehrstedt	f80aad62fa	Improve Pareto frontier plot for AutoAC (#148678 ) This was added in https://github.com/pytorch/pytorch/pull/126320. It's a very nice feature, which can be used to predict memory usage for different budget values. However, it had some limitations, notably in terms of resolution (it only sampled 21 points across the whole range thus missed many threshold values) and in distributed settings. Here I fix those by using recursive binary searches to identify all thresholds (up to a resolution of 1e-3, which can be made configurable) and output them in SVG (to be able to discern different points), plus I add the rank to the filename and store it in a user-define directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148678 Approved by: https://github.com/Chillee, https://github.com/fmassa	2025-03-07 13:22:29 +00:00
Aleksei Nikiforov	d4d7d813fa	Update CURL url for manywheel images (#148343 ) It looks like it was moved on the site it was downloaded from. Switch to official site while updating URL. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148343 Approved by: https://github.com/dr4gon01, https://github.com/janeyx99, https://github.com/atalman, https://github.com/seemethere	2025-03-07 11:41:12 +00:00
Avik Chaudhuri	6cf360be04	fix lost input mutations with export_tracepoint (#148709 ) Preserving module call signatures in the presence of input mutation cause incorrect results. The root cause turned out to be that export tracepoints would unwrap / wrap functional args that would lose mutation info on those args. Differential Revision: [D70734821](https://our.internmc.facebook.com/intern/diff/D70734821/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148709 Approved by: https://github.com/angelayi	2025-03-07 09:36:18 +00:00
Nichols A. Romero	bb84a23c22	[ROCm] [TunableOp] Enable logging of BLAS parameters (#147034 ) This PR supports a logging feature that is being requested. ``` PYTORCH_TUNABLEOP_BLAS_LOG=1 ``` Enables the logging of BLAS parameters with either offline of online (in-situ) tuning. The BLAS parameters are written to the CSV file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147034 Approved by: https://github.com/jeffdaily	2025-03-07 09:32:59 +00:00
Ding, Yi1	243b47e2ec	[Intel GPU] Fix SDPA dummy LSE output to match meta function (#148652 ) To fix XPU patched UTs including ```bash pytest -vs third_party/torch-xpu-ops/test/xpu/test_meta_xpu.py::TestMetaXPU::test_dispatch_symbolic_meta_outplace_nn_functional_scaled_dot_product_attention_xpu_bfloat16 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148652 Approved by: https://github.com/EikanWang	2025-03-07 08:36:18 +00:00
FFFrog	416ea1c71c	Code Clean: Remove unnecessary code (#148735 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148735 Approved by: https://github.com/jingsh, https://github.com/cyyever	2025-03-07 08:15:37 +00:00
ZhiweiYan-96	4075646bd8	Use oneDNN v3.7.1 for Intel GPU (#148403 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148403 Approved by: https://github.com/EikanWang Co-authored-by: majing <jing1.ma@intel.com> Co-authored-by: xiaolil1 <xiaoli.liu@intel.com>	2025-03-07 08:03:49 +00:00
cyy	3d854ea9bd	Remove deprecated std::aligned_storage_t (#148660 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148660 Approved by: https://github.com/swolchok	2025-03-07 07:29:42 +00:00
Rachel Guo	3f069e7679	[mm_logs] enhance the printing for overview info (#148716 ) Summary: previously the dynamo counters does not print the counts information automatically. explicitly added a log msg to print after lowering for overview info for inductor aten mms it will look like: the name is in `{aten_op_name}_{m}_{n}_{k}` ``` torch/_inductor/compile_fx.py:832] [0/0] Overview info of inductor aten mms: (aten.addmm_16_6_16: 1), (name: count), xxx ``` {F1975874802} Test Plan: ``` TORCH_LOGS="+inductor" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm_cuda ``` Differential Revision: D70739912 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148716 Approved by: https://github.com/henrylhtsang	2025-03-07 05:23:49 +00:00
Syed Tousif Ahmed	5f392ae560	Throws error when using torch.cuda.MemPool with expandable segments (#148378 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148378 Approved by: https://github.com/ngimel, https://github.com/eqy ghstack dependencies: #148374	2025-03-07 05:22:03 +00:00
Wei Feng	c0f1557285	[FSDP2][doc] highlight equivalence of set_requires_gradient_sync and no_sync (#148715 ) we got asked a few times about FSDP2's equivalence of no_sync. highlight set_requires_gradient_sync as the equivalence in docstring Pull Request resolved: https://github.com/pytorch/pytorch/pull/148715 Approved by: https://github.com/mori360	2025-03-07 04:34:46 +00:00
Nitin Singh	fe4b88f6aa	[HPU] Add hpu to fused kernels supported devices (#148666 ) This change adds "hpu" to the list of device types that support fused kernels in the optimizer, ensuring compatibility with HPU backend. Without this change, when `test_all_gather_extension_outer_size_stride` of `pytorch/test/distributed/_composable/fsdp/test_fully_shard_extensions.py` is run on 'hpu' backend, it fails with: RuntimeError: fused=True requires all the params to be floating point Tensors of supported devices: ['mps', 'cuda', 'xpu', 'cpu', 'privateuseone'] but torch.float32 and hpu Pull Request resolved: https://github.com/pytorch/pytorch/pull/148666 Approved by: https://github.com/albanD	2025-03-07 04:28:33 +00:00
Nichols A. Romero	33f8ab2f58	[ROCm][TunableOp] Add support for rowwise scaling on scaled GEMM. (#148238 ) This PR adds support for rowwise scaling versus tensorwise scaling on scaled GEMM. There are few other items included in this PR as well: - Fixes for offline tuning of scaled GEMM - Simplification of existing offline UT - Update existing online UT to also test rowwise versus tensorwise scaled GEMM - New UT for offline scaled GEMM Pull Request resolved: https://github.com/pytorch/pytorch/pull/148238 Approved by: https://github.com/jeffdaily	2025-03-07 04:12:48 +00:00
Andrey Talman	cdb4fd0d29	Update win-vs2022-cuda12.1-py3 -> win-vs2022-cuda12.6-py3 (#148717 ) Should have been migrated long ago Pull Request resolved: https://github.com/pytorch/pytorch/pull/148717 Approved by: https://github.com/ZainRizvi, https://github.com/malfet	2025-03-07 03:21:29 +00:00
xinan.lin	389b496062	[XPU] Add test/kernel.errors.txt to .gitignore. (#148538 ) Intel GPU user mode driver may generate kernel.errors.txt files in current working directory in certain scenarios. It includes diagnostic information but does necessarily indicates the issue with an application. This is a known issue and will be fixed in newer version of driver. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148538 Approved by: https://github.com/desertfire, https://github.com/jansel ghstack dependencies: #148534	2025-03-07 03:12:50 +00:00
Wei-Sheng Chin	9c9b05bc4f	Expose functions used in custom backend in torch_python dll (#148213 ) Fixes #148208. There are solutions for exposing symbols implicitly from inline functions (i.e., inline function A calls non-inline function B in foo.h. Code includes foo.h has to see the symbol B in DLL). Solution 1: tag the entire struct where the inline functions are defined as member functions with TORCH_PYTHON_API --- this PR does this for python_arg_parser.h. An alternative solution exists but will slow down dispatching a lot --- drop inline keyword and move implementation to .cc file. Solution 2: tag individual functions with TORCH_PYTHON_API. This PR does this for python_tensor.h. Related discussion about hiding torch_python symbols: https://github.com/pytorch/pytorch/pull/142214 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148213 Approved by: https://github.com/malfet	2025-03-07 02:34:37 +00:00
Zhuoran Zhao	dfb4094b9c	Skip buffer in dense update (#148533 ) Summary: as title. PyTorch Module buffer will not be published in delta publishing. In Quinn's previous diff, constant type annotations have been introduced. In addition to skip constant, we also need to skip buffer if it is not found in the user-provided delta weights list Test Plan: https://docs.google.com/document/d/1wiqUo0PyZ4g6YJIJlL_LE084ZEuE74iu74gZjqGGjWY/edit?tab=t.0#heading=h.dby6cwiw1xrn Differential Revision: D69553929 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148533 Approved by: https://github.com/22quinn, https://github.com/jingsh	2025-03-07 01:59:58 +00:00
ZhiweiYan-96	00cd6c07b9	[Intel GPU][pt2e] Enable quantized grouped convolution at XPU (#148522 ) # Motivation&Details This PR fix a bug that blocked quantized group convolution before. The bug is caused by that, grouped convolution requires setting weight scale mask on both group dimension and output channel dimension. This PR fixs the wrong mask in integration and add grouped conv in UT. # UT ` python test/inductor/test_mkldnn_pattern_matcher.py -k test_qconv2d_xpu` # Runtime exemplification ```onednn_verbose,v1,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src:s8::blocked:acdb::f0 wei:s8::blocked:abcde::f0 bia:f32::blocked:a::f0 dst:f32::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:3:f32 attr-zero-points:src0:0:s32,alg:convolution_direct,g4mb1_ic128oc128_ih4oh2kh3sh1dh0ph0_iw4ow2kw3sw1dw0pw0,0.0529785`` The verbose shows that we successfully run into quantized convolution, where weight is `abcde` format(group conv). Pull Request resolved: https://github.com/pytorch/pytorch/pull/148522 Approved by: https://github.com/EikanWang, https://github.com/liangan1, https://github.com/jansel ghstack dependencies: #148423	2025-03-07 01:57:45 +00:00
drisspg	127bd5a02d	Add sparsity (#148513 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148513 Approved by: https://github.com/danielvegamyhre	2025-03-07 01:47:52 +00:00
ZhiweiYan-96	b4430c3a6d	[Intel GPU][pt2e]: Collapse 3D input to 2D for matmul in qlinear_pointwise_binary fusion (#148423 ) # Motivation During the `qlinear_pointwise_binary` lowering pass, dim collapsing only occurs when post-ops is `add`. It is the responsibility of C++ kernels to handle dimension for post-ops `sum` # Details This PR explicitly reshape input from 3D to 2D in op `qlinear_pointwise_binary`. Besides, we refractor implementation `qlinear_pointwise_binary.tensor` to call `qlinear_pointwise_binary` for removing duplicated codes. # UT testing `python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlienar_add_xpu` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148423 Approved by: https://github.com/EikanWang, https://github.com/jansel	2025-03-07 01:47:33 +00:00
Ryan Guo	c8cd8f68bd	[dynamo] Properly account for non-list instances in list comparison (#148470 ) As title; this patch also removes an unused `list_compare` method. Fixes #148179. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148470 Approved by: https://github.com/anijain2305	2025-03-07 01:29:30 +00:00
eellison	a7fe685be8	Add cpp wrapper skip to cudagraph logs (#148700 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148700 Approved by: https://github.com/jbschlosser	2025-03-07 01:02:40 +00:00
Justin Chu	e3087f6d76	[ONNX] Improve verify_onnx_program to use VerificationInterpreter (#148706 ) I realized we can just extend `verify_onnx_program` to return intermediate values. There is no need for us to expose the VerificationInterpreter to users. I added a `compare_intermediates` option to `verify_onnx_program`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148706 Approved by: https://github.com/titaiwangms	2025-03-07 00:40:54 +00:00
cyy	50eb4f3990	Enable UBSAN test (#147511 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/147511 Approved by: https://github.com/colesbury	2025-03-07 00:35:32 +00:00
Richard Barnes	33a285379a	[codemod] Remove unused-variable in caffe2/torch/csrc/distributed/c10d/cuda/AsyncMM.cu (#148501 ) Summary: LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Reviewed By: dtolnay Pull Request resolved: https://github.com/pytorch/pytorch/pull/148501 Approved by: https://github.com/Skylion007	2025-03-07 00:33:39 +00:00
Ting Lu	a0bc6d81bb	[CI][CUDA] Move away from cuda12.4, Add cuda12.6 eager CI tests (#148602 ) https://github.com/pytorch/pytorch/issues/145570 breaking https://github.com/pytorch/pytorch/pull/140793/ into eager and inductor benchmarks to unblock Pull Request resolved: https://github.com/pytorch/pytorch/pull/148602 Approved by: https://github.com/atalman, https://github.com/malfet Co-authored-by: atalman <atalman@fb.com>	2025-03-07 00:15:04 +00:00
Xilun Wu	e2a0296e80	[dtensor] add CuDNN SDPA op support to DTensor (#148537 ) ### Summary This PR adds `_scaled_dot_product_cudnn_attention` and `_scaled_dot_product_cudnn_attention_backward` to DTensor ops ### Test `pytest test/distributed/tensor/test_attention.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148537 Approved by: https://github.com/drisspg, https://github.com/fegin	2025-03-06 23:44:40 +00:00
Syed Tousif Ahmed	3960f97832	Documents torch.cuda.MemPool API (#148374 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148374 Approved by: https://github.com/eqy, https://github.com/ngimel	2025-03-06 23:18:43 +00:00
Jagadish Krishnamoorthy	ed9c8a5d13	ROCm: Disable torch check for Multiplication of two Float8_e5m2 matrices (#148228 ) ROCm supports Multiplication of two Float8_e5m2 matrices. Hence disabling the torch check for ROCm. Test command (on ROCm h/w supporting fp8) python test/test_matmul_cuda.py TestFP8MatmulCudaCUDA.test_float8_basics_cuda -v Pull Request resolved: https://github.com/pytorch/pytorch/pull/148228 Approved by: https://github.com/jeffdaily, https://github.com/petrex	2025-03-06 22:12:45 +00:00
Aidyn-A	e6800bda7f	[Test][Linalg][CUDA] Increase niter in test_svd_lowrank_cuda_float64 (#145930 ) A recent PR #143049 attempted to increase tolerances to make test passable. However, we are still seeing errors like: ``` Traceback (most recent call last): File "~git/pytorch/test/test_linalg.py", line 2540, in test_svd_lowrank run_subtest(None, size, (), device, torch.svd_lowrank, density=density) File "~git/pytorch/test/test_linalg.py", line 2505, in run_subtest self.assertEqual(A, a, rtol=1e-7, atol=2e-7) File "~git/pytorch/torch/testing/_internal/common_utils.py", line 4044, in assertEqual raise error_metas.pop()[0].to_error( # type: ignore[index] AssertionError: Tensor-likes are not close! Mismatched elements: 90 / 1000000 (0.0%) Greatest absolute difference: 7.795904016052784e-07 at index (176, 930) (up to 2e-07 allowed) Greatest relative difference: inf at index (6, 179) (up to 1e-07 allowed) ``` Increasing `niter` parameter actually decreases numerical differences. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145930 Approved by: https://github.com/ngimel	2025-03-06 22:10:53 +00:00
Blaine Burton Rister	75d29443e7	[Docs] update bucketize documentaion (#148400 ) Fixes #144504 Clarify the documentation for `torch.bucketize` by referencing the existing table. The current version includes a somewhat confusing explanation for the `right` kwarg, whereas the existing table is much clearer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148400 Approved by: https://github.com/benjaminglass1, https://github.com/eellison, https://github.com/albanD	2025-03-06 22:07:52 +00:00
henrylhtsang	2fb654676f	[cutlass backend] fix assertion that prevent self multiplication (#148233 ) # Problem: In a matmul, sometimes some of the nodes are the same. Say `A @ A`. In that case, when writing the stride of node B, we have to figure out if we want lda or ldb, which points to the same node, and we have no way to differentiate which one. # Solution Just use whichever. Since they are the same. # Question What if we compile with `A @ A`, and then pass in `A @ B`? Well inductor guards will raise an error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148233 Approved by: https://github.com/ColinPeppler	2025-03-06 22:02:26 +00:00
henrylhtsang	d35a4ddae2	[cutlass backend] Forward fix for less aligned gemm shapes (#148521 ) Differential Revision: [D70600093](https://our.internmc.facebook.com/intern/diff/D70600093/) 1. Check if config name filtering still works. Tested, it works 2. do we get C++ compile error Yes, potentially we need to filter them out manually. Here we get this. ``` static_assert(threads_minor == 0 \|\| (TileSizeK % threads_minor == 0)); ``` We need to move some assertions to gemm_template.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/148521 Approved by: https://github.com/ColinPeppler	2025-03-06 22:02:19 +00:00
Ting Lu	5a5ac98918	[aarch64] add libcufile for cu126 and cu128 (#148465 ) seeing ` File "/usr/local/lib/python3.12/site-packages/torch/__init__.py", line 411, in <module> from torch._C import * # noqa: F403 ^^^^^^^^^^^^^^^^^^^^^^ ImportError: libcufile.so.0: cannot open shared object file: No such file or directory` with arm cu128 nightly. related to https://github.com/pytorch/pytorch/pull/148137 need to copy the dependency for arm build as well Pull Request resolved: https://github.com/pytorch/pytorch/pull/148465 Approved by: https://github.com/atalman, https://github.com/abhilash1910	2025-03-06 21:39:43 +00:00
lanzongwei.lan	3d62e81a1e	[DCP] fix dcp gather_object/scatter_object_list (#147675 ) gather_object/scatter_object_list's dst is `Destination rank on global process group (regardless of group argument)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147675 Approved by: https://github.com/MeetVadakkanchery	2025-03-06 21:20:38 +00:00
Ryan Guo	1d7fc0c681	[dynamo] Remove dead code path around `functools.partial` objects (#148683 ) This removes the code paths added in #98120, which has then been superceded by #108846. More importantly, it makes `EQUALS_MATCH`'s `ok_mutable_types` (added in #134016) easier to reason about, i.e., no need to worry about `dict` types, which was only needed for #98120. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148683 Approved by: https://github.com/yanboliang	2025-03-06 21:20:04 +00:00
Shunting Zhang	262411e48b	[inductor] online softmax (#127011 ) Softmax need do some preparation work that access the input tensor in two passes - compute amax of each row - compute (x - amax).exp.sum for each row When the row size is large, cache can not hold all the active data and accessing the input multiple passes increases execution time since the kernel is membw bounded. Online softmax uses a customized reduction to compute max and sum at the same time by accessing the data in one pass. Check this paper for more details ( https://arxiv.org/abs/1805.02867 ). Also here is an online softmax kernel generated by inductor as a reference: https://gist.github.com/shunting314/67ae4fffd45d4f2753c781780332fa54 ## Microbenchmark - `TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_ONLINE_SOFTMAX=0 DO_PERF_TEST=1 python test/inductor/test_online_softmax.py -k test_softmax` : without online softmax - eager_ms=6.671296119689941 - opt_ms=8.06931209564209 - `TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_ONLINE_SOFTMAX=1 DO_PERF_TEST=1 python test/inductor/test_online_softmax.py -k test_softmax`: with online softmax - eager_ms=6.634047985076904 - opt_ms=6.230591773986816 Ideally, online softmax should save about 2ms here. We saves about 1.84ms in practice. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127011 Approved by: https://github.com/jansel	2025-03-06 21:07:18 +00:00
PyTorch MergeBot	cf9efbdf16	Revert "Enable onednn in pytorch for ppc64le architecture (#143743 )" This reverts commit d4cf0e5af406239881acfeb4f9e4f62373faca8b. Reverted https://github.com/pytorch/pytorch/pull/143743 on behalf of https://github.com/davidberard98 due to windows build failures look related [GH job link](https://github.com/pytorch/pytorch/actions/runs/13705127978/job/38329845095) [HUD commit link](`d4cf0e5af4`) ([comment](https://github.com/pytorch/pytorch/pull/143743#issuecomment-2704903253))	2025-03-06 20:47:57 +00:00
zeshengzong	1add61c242	Replace `unimplemented` with `unimplemented_v2' in` codegen.py` (#148069 ) Fixes #147913 - replace `unimplemented` in `codegen.py` - remove unused import `unimplemented` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148069 Approved by: https://github.com/Skylion007, https://github.com/williamwen42	2025-03-06 20:42:37 +00:00
Aaron Gokaslan	edd640a95a	[BE][Ez]: Use itertools.chain.from_iterable when possible (#148190 ) Often makes the code more readable, more efficient, and adds support for infinite iterables. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148190 Approved by: https://github.com/jansel, https://github.com/malfet	2025-03-06 20:37:06 +00:00
Nikita Shulga	65dbc3b454	[BE][MPS] Remove redundant `handle_tensor_scalar_binary_op` (#148685 ) After https://github.com/pytorch/pytorch/pull/143934 `mtl_setBuffer` can handle scalar tensors correctly, so no need to have a specialized function here Pull Request resolved: https://github.com/pytorch/pytorch/pull/148685 Approved by: https://github.com/dcci	2025-03-06 19:24:46 +00:00
zeshengzong	29b28e9d9f	Fix `torch.nn.functional.hardswish` gradients corner case (#148049 ) Fixes #147801 ## Changes - Change hardswish gradient compute condition as [torch.nn.functional.hardswish](https://pytorch.org/docs/stable/generated/torch.nn.functional.hardswish.html) - Enable cuda for test `test_hardswish_grad_corner` - Add test case for value=-3 ## Test Result ```bash pytest test/test_nn.py -k test_hardswish pytest test/test_unary_ufuncs.py -k test_hardswish pytest test/inductor/test_torchinductor.py -k test_hardswish ``` ![image](https://github.com/user-attachments/assets/000cb5c4-15f5-4bfd-ab45-f52bf810ff3d) ![image](https://github.com/user-attachments/assets/38b08cf8-ea84-47a2-8e37-0a213da3e0c8) ![image](https://github.com/user-attachments/assets/54bc57be-2c57-46cc-ab90-94ea6cbe1c34) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148049 Approved by: https://github.com/soulitzer	2025-03-06 19:04:52 +00:00
Xuehai Pan	f08146b67b	[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257 ) Changes in this PR: 1. Add `is_structseq` and `is_structseq_class` functions to determine a object or a class is PyStructSequence. 2. Add a generic class `structseq` which can be used as the registration key for PyStructSequence types like `namedtuple` for Named Tuple types. 3. Change `is_namedtuple` to accept subclasses of namedtuple to be namedtuple. Before this PR, only namedtuple class directly created by `collections.namedtuple` or `typing.NamedTuple` were namedtuple classes while their subclasses were not. This PR makes `is_namedtuple` return true for subclasses of namedtuple class. Resolves #75982. New tests are included in this PR. - #75982 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113257 Approved by: https://github.com/zou3519	2025-03-06 18:59:02 +00:00
PyTorch MergeBot	96176e32a9	Revert "[ROCm] Bump AOTriton to 0.9.1b (#148433 )" This reverts commit 8af79b7ec816f5c73536a806aa4c7ea1f7bd3867. Reverted https://github.com/pytorch/pytorch/pull/148433 on behalf of https://github.com/jovianjaison due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/148433#issuecomment-2704638858))	2025-03-06 18:32:48 +00:00
George Wigley	b85ae06bed	Update CPU tolerance for f16 triplet margin loss (#147742 ) Currently, the `test_torchinductor_opinfo` test for `nn.functional.triplet_margin_loss` fails on AArch64, this PR increases the acceptable ATOL and RTOL for this test when using F16. There is precedent for this as XPU and CUDA already increase the tolerance. Additionally, the CPU backend increases the tolerance for the `with_distance_loss` variant of `triplet_margin_loss`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147742 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2025-03-06 18:09:43 +00:00
Bin Bao	d10bacd4ce	[AOTI][dashboard] Skip torchbench models not supported by export (#148359 ) Summary: Certain models fail in export because of data-dependent ops. Skip them so that oncall can better track the AOTInductor dashboard. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148359 Approved by: https://github.com/angelayi, https://github.com/ysiraichi	2025-03-06 18:08:17 +00:00
Ke Wen	d91a634edf	[c10d] Make getDefaultBackend more fault tolerant (#148596 ) This is a forward fix for #135338. It hits error like this: ``` "distributed_c10d.py", line 2156, in destroy_process_group if type(pg) == ProcessGroup and pg._has_hooks(): RuntimeError: Could not find the default backend type 0 for Process Group with name undefined. ``` When users call `init_process_group(nothing)`, default backend is not set, or set to `undefined`. Thus the above signature. Triggered by the `_has_hooks()` call. The fix wraps `getDefaultBackend` with a try-catch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148596 Approved by: https://github.com/LucasLLC, https://github.com/fduwjj	2025-03-06 18:07:43 +00:00
Tiwari-Avanish	d4cf0e5af4	Enable onednn in pytorch for ppc64le architecture (#143743 ) This PR will enable onednn for powerpc Architecture which will help to do quantization of the model via onednn for powerpc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143743 Approved by: https://github.com/malfet, https://github.com/albanD	2025-03-06 18:00:55 +00:00
Xuehai Pan	097b0d372a	[pytree] fix previously failed dynamo tests (#148669 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148669 Approved by: https://github.com/zou3519	2025-03-06 17:59:29 +00:00
PyTorch MergeBot	28b68b46bc	Revert "[cutlass backend] fix assertion that prevent self multiplication (#148233 )" This reverts commit 4aeca28137dcee74b5fcd0c0636d0ee1f113d5fb. Reverted https://github.com/pytorch/pytorch/pull/148233 on behalf of https://github.com/henrylhtsang due to mistake in PR ([comment](https://github.com/pytorch/pytorch/pull/148233#issuecomment-2704534995))	2025-03-06 17:45:49 +00:00
Nikita Shulga	3cde4c3069	[BE] Remove `onlyCPU` decorator from test_local_scalar_dense (#148559 ) Followup from https://github.com/pytorch/pytorch/pull/145717, not sure why author thinks those tests should be limited to one architecture. And fixed similar crashes for CUDA and MPS Pull Request resolved: https://github.com/pytorch/pytorch/pull/148559 Approved by: https://github.com/ZainRizvi, https://github.com/atalman, https://github.com/seemethere	2025-03-06 17:43:02 +00:00
PyTorch MergeBot	841451af9f	Revert "[Inductor] Avoid tensor slice overflow for large step (#147433 )" This reverts commit 1d7397a2d04a4d636559f41511a20f7dadbe5777. Reverted https://github.com/pytorch/pytorch/pull/147433 on behalf of https://github.com/jovianjaison due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/147433#issuecomment-2704506627))	2025-03-06 17:33:08 +00:00
Rachel Guo	679e7d257e	[mm_logs] follow up to add count info based on shape for inductor `aten.mm`s (#148623 ) Summary: as title. when enable `TORCH_LOGS="+inductor"`, you can get logs at the end such as stats [('calls_captured', 1), ('unique_graphs', 1)] inductor [('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('benchmarking.TritonBenchmarker.benchmark_gpu', 2), (('aten_addmm', (16, 6, 16)), 1), ('extern_calls', 1), ('async_compile_cache_miss', 1)] graph_break [] Test Plan: follow up to add proper logging test. Differential Revision: D70665104 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148623 Approved by: https://github.com/henrylhtsang	2025-03-06 16:20:04 +00:00
Benjamin Glass	b160dda743	cpp_wrapper: reduce memory usage by removing unneeded temporaries (#147403 ) This PR contains a set of interrelated changes, listed below, with the upshot that compiled model memory usage in `cpp_wrapper` mode is now roughly equivalent to the default inductor mode. Changes: 1. Refactor `reinterpret_view` calls in `cpp_wrapper` to always return a temporary RAII tensor object, rather than saving off a "temporary" tensor handle that persisted through the end of the function. This matches the behavior of the base Python wrapper class, and is responsible for majority of the memory usage reductions. 2. Eliminate nearly all other cases where a "temporary" tensor handle was saved off (with the exception of one or two places where the tensor would immediately be destroyed by going out-of-scope). This necessitated some ugly-looking code to handle `Optional[Tensor]` and `Optional[Sequence[Any]]`, since `Optional` is passed by pointer into the C-shim functions (making passing temporary objects difficult). This code is justified by the fact that it only appears in controlled circumstances that we auto-generate, so there are minimal user-facing footguns. 3. Delete the list containing the input tensors to the `cpp_wrapper` main function after casting them to `AtenTensorHandle` objects, which have an internal reference count keeping them alive. The [TorchInductor benchmark](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Sat%2C%2015%20Feb%202025%2018%3A38%3A08%20GMT&stopTime=Sat%2C%2022%20Feb%202025%2018%3A38%3A08%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/benjaminglass1/73/head&lCommit=4d5edaf67e80ca9ca36d301af1ded13967a04790&rBranch=main&rCommit=e1bf892d9004a4dba0748d0eda5c3b4eced0ea70) I ran shows the increased memory compression. Differential Revision: [D70648897](https://our.internmc.facebook.com/intern/diff/D70648897) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147403 Approved by: https://github.com/desertfire	2025-03-06 16:08:16 +00:00
David Berard	5fb0f45d3b	[triton 3.3] test_triton_kernel_constants fix (#148626 ) Thanks @FindHao who did the initial version of this PR: https://github.com/pytorch/pytorch/pull/148505 TL;DR is that https://github.com/triton-lang/triton/pull/5961 deprecates `tl.constexpr` annotations - you're supposed to wrap the constexpr value in `tl.constexpr()` instead. This just updates the tests to wrap with `tl.constexpr()` (and leaves the annotations - that way the old triton versions will still pass). Pull Request resolved: https://github.com/pytorch/pytorch/pull/148626 Approved by: https://github.com/FindHao	2025-03-06 14:18:21 +00:00
Mikayla Gawarecki	d5184901c4	Make torch.serialization.skip_data work with torch.load (#148018 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148018 Approved by: https://github.com/albanD ghstack dependencies: #147786, #147787, #147788	2025-03-06 12:04:46 +00:00
Mikayla Gawarecki	be0ceee1c3	Make record/storage alignment in torch.save configurable (#147788 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147788 Approved by: https://github.com/albanD ghstack dependencies: #147786, #147787	2025-03-06 12:04:46 +00:00
Mikayla Gawarecki	209977e6e5	Add information about checkpoint offset to untyped storages when torch.load under FakeTensorMode (#147787 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147787 Approved by: https://github.com/albanD ghstack dependencies: #147786	2025-03-06 12:04:39 +00:00
Mikayla Gawarecki	bdcc1b579b	Allow torch.load under FakeTensorMode to load FakeTensors with correct devices (for plain Tensors) (#147786 ) This only fixes _rebuild_tensor_v2 and _rebuild_tensor_v3 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147786 Approved by: https://github.com/albanD	2025-03-06 12:04:32 +00:00
rzou	79aa17489c	[dynamo] ctx_manager.py: replace unimplemented with unimplemented_v2 (#148570 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148570 Approved by: https://github.com/williamwen42 ghstack dependencies: #148454	2025-03-06 07:46:31 +00:00
titaiwangms	e7bc1d1791	[ONNX] Update saved exported program in debugging report if the exporting passes run_decomposition() (#148617 ) Previous to this PR, if the exporting passes run_decomposition(), the report still shows the exported_program before decomposition, which adds the difficulties to our users when they want to check the exported program that are used to translate to ONNX graph. The following example is what we see before this PR: ``` # PyTorch ONNX Conversion Report ``` ✅ Obtain model graph with `torch.export.export(..., strict=False)` ⚪ Obtain model graph with `torch.export.export(..., strict=True)` ⚪ Obtain model graph with `torch.jit.trace` ✅ Decompose operators for ONNX compatibility ❌ Translate the graph into ONNX ⚪ Run `onnx.checker` on the ONNX model ⚪ Execute the model with ONNX Runtime ⚪ Validate model output accuracy ``` ## Error messages ```pytb Traceback (most recent call last): File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 707, in _translate_fx_graph _handle_call_function_node_with_lowering( File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 486, in _handle_call_function_node_with_lowering raise _errors.DispatchError( torch.onnx._internal.exporter._errors.DispatchError: No ONNX function found for <OpOverload(op='aten.slice', overload='Tensor')>. Failure message: No decompositions registered for the complex-valued input The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 1371, in export onnx_program = _exported_program_to_onnx_program( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 1007, in _exported_program_to_onnx_program values = _translate_fx_graph( ^^^^^^^^^^^^^^^^^^^^ File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 733, in _translate_fx_graph raise _errors.ConversionError( torch.onnx._internal.exporter._errors.ConversionError: Error when translating node %slice_1 : [num_users=1] = call_function[target=torch.ops.aten.slice.Tensor](args = (%_to_copy, 0, 0, 9223372036854775807), kwargs = {}). See the stack trace for more information. ``` ## Exported program ```python ExportedProgram: class GraphModule(torch.nn.Module): def forward(self, x: "f32[3, 4]"): # File: /home/titaiwang/pytorch/test_slice_complex.py:6 in forward, code: x_complex = x.to(torch.complex64) to: "c64[3, 4]" = torch.ops.aten.to.dtype(x, torch.complex64); x = None # File: /home/titaiwang/pytorch/test_slice_complex.py:8 in forward, code: return x_complex[:, :2] slice_1: "c64[3, 4]" = torch.ops.aten.slice.Tensor(to, 0, 0, 9223372036854775807); to = None slice_2: "c64[3, 2]" = torch.ops.aten.slice.Tensor(slice_1, 1, 0, 2); slice_1 = None return (slice_2,) Graph signature: ExportGraphSignature(input_specs=[InputSpec(kind=<InputKind.USER_INPUT: 1>, arg=TensorArgument(name='x'), target=None, persistent=None)], output_specs=[OutputSpec(kind=<OutputKind.USER_OUTPUT: 1>, arg=TensorArgument(name='slice_2'), target=None)]) Range constraints: {} ``` ## Analysis PyTorch ONNX Conversion Analysis ## Model Information The model has 0 parameters and 0 buffers (non-trainable parameters). Number of parameters per dtype: ```python defaultdict(<class 'int'>, {}) ``` Number of buffers per dtype: ```python defaultdict(<class 'int'>, {}) ``` Inputs: - `x`: `TensorMetadata(shape=torch.Size([3, 4]), dtype=torch.float32, requires_grad=False, stride=(4, 1), memory_format=torch.contiguous_format, is_quantized=False, qparams={})` Outputs: - `slice_2`: `TensorMetadata(shape=torch.Size([3, 2]), dtype=torch.complex64, requires_grad=False, stride=(4, 1), memory_format=None, is_quantized=False, qparams={})` The FX graph has 5 nodes in total. Number of FX nodes per op: - `placeholder`: 1 - `call_function`: 3 - `output`: 1 Of the call_function nodes, the counts of operators used are: - `aten.slice.Tensor`: 2 - `aten.to.dtype`: 1 ## ONNX Conversion Information The model contains operators the dispatcher could not find registered ONNX decompositions for. This may be due to missing implementations, decompositions not registered correctly, or a bug in the dispatcher. Errors grouped by operator: - `aten.to.dtype`: No decompositions registered for the real-valued input. Example node: `%to : [num_users=1] = call_function[target=torch.ops.aten.to.dtype](args = (%x, torch.complex64), kwargs = {})`. All nodes: `[to]` - `aten.slice.Tensor`: No decompositions registered for the complex-valued input. Example node: `%slice_1 : [num_users=1] = call_function[target=torch.ops.aten.slice.Tensor](args = (%to, 0, 0, 9223372036854775807), kwargs = {})`. All nodes: `[slice_1, slice_2]` ## Decomposition comparison Ops exist only in the ExportedProgram before decomposition: `['aten.to.dtype']` Ops exist only in the ExportedProgram after decomposition: `['aten._to_copy.default']` ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148617 Approved by: https://github.com/justinchuby	2025-03-06 07:03:45 +00:00
PyTorch MergeBot	ae6bb58483	Revert "[cutlass backend] Forward fix for less aligned gemm shapes (#148521 )" This reverts commit ad49cfc9f0a8a4d8881b3734edd8c33a087c8b97. Reverted https://github.com/pytorch/pytorch/pull/148521 on behalf of https://github.com/davidberard98 due to broke lint: [GH job link](https://github.com/pytorch/pytorch/actions/runs/13690720601/job/38283359447) [HUD commit link](`ad49cfc9f0`) ([comment](https://github.com/pytorch/pytorch/pull/148521#issuecomment-2702980028))	2025-03-06 06:59:39 +00:00
PaulZhang12	4dc956a1d8	[Inductor][Triton] Fix test_autotune_inplace_kernel to work with newer Triton version (#148595 ) For new Triton version 3.3, constexpr are included as part of the signature. Update failing test to reflect this change, additional context in https://github.com/pytorch/pytorch/pull/145051. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148595 Approved by: https://github.com/davidberard98	2025-03-06 05:37:08 +00:00
xinan.lin	1fac47702e	[Break XPU][Inductor UT] Generalize device-bias code introduced by #146866 . (#148534 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148534 Approved by: https://github.com/nandesuka	2025-03-06 04:39:50 +00:00
titaiwangms	f057206fca	[ONNX] Support complex comparison when verify=True (#148619 ) Previously, the comparison of complex numbers was not supported when `verify=True`. NOTE: This PR can be extended to support more complex comparison cases if there are other places in onnx codebase needed to be changed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148619 Approved by: https://github.com/justinchuby	2025-03-06 04:38:43 +00:00
bobrenjc93	8b65d522e1	refactor delayed compile to use code context (#148530 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148530 Approved by: https://github.com/williamwen42 ghstack dependencies: #148509	2025-03-06 04:02:30 +00:00
henrylhtsang	ad49cfc9f0	[cutlass backend] Forward fix for less aligned gemm shapes (#148521 ) Differential Revision: [D70600093](https://our.internmc.facebook.com/intern/diff/D70600093/) 1. Check if config name filtering still works. Tested, it works 2. do we get C++ compile error Yes, potentially we need to filter them out manually. Here we get this. ``` static_assert(threads_minor == 0 \|\| (TileSizeK % threads_minor == 0)); ``` We need to move some assertions to gemm_template.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/148521 Approved by: https://github.com/ColinPeppler	2025-03-06 03:42:55 +00:00
Isalia20	02e1580e39	[MPS] fix crash for mse loss with 0 numel inputs (#148608 ) Fixes #148589 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148608 Approved by: https://github.com/malfet	2025-03-06 03:32:34 +00:00
James Wu	8728d4b815	Clear triton kernels after parent make_launcher (#148604 ) Before, we were clearing the cache only after inductor compile. But inductor may not always compile, i.e. on AOTAutogradCache hit. So instead, we should clear it when the future is consumed. This is a more robust fix for the issue in D69476856 Differential Revision: [D70646281](https://our.internmc.facebook.com/intern/diff/D70646281/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148604 Approved by: https://github.com/masnesral	2025-03-06 03:28:38 +00:00
cyy	1433bc1455	Remove CAFFE2_USE_EXCEPTION_PTR (#147247 ) The check is for older compilers and is now aways true. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147247 Approved by: https://github.com/janeyx99	2025-03-06 02:56:23 +00:00
maybeLee	43e1284c96	Fix empty matrix handling of addmv in inductor (#143792 ) This is a resubmission of my previous PR that I accidentally deleted, apology in advance if any inconvenience caused. Below are details of this PR. Fix an issue when torch.addmv behaves inconsistent between torch.compile mode and eager mode. Here is the code to reproduce: ``` import torch import numpy as np @torch.compile def test_optimized(input, mat, vec): return torch.addmv(input, mat, vec) def test(input, mat, vec): return torch.addmv(input, mat, vec) input = torch.tensor([2], dtype=torch.int32) mat = torch.tensor(np.random.randn(0, 0), dtype=torch.int32) vec = torch.tensor([]) origin_out = test(input, mat, vec) optimized_out = test_optimized(input, mat, vec) print(origin_out) # tensor([2.]) print(optimized_out) # tensor([]) ``` According to the equation (https://pytorch.org/docs/stable/generated/torch.addmv.html), when matrix and vector is empty, returning `[2.]` seems more reasonable to me. Following the cpu implementation of this API:`e97b97af56/aten/src/ATen/native/Blas.cpp (L62)` I add an additional branch to handle empty matrix Pull Request resolved: https://github.com/pytorch/pytorch/pull/143792 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2025-03-06 02:09:27 +00:00
Pat Vignola	38b3375a81	[MTIA] Use "ieee" instead of "tf32" for MTIA's default precision in FlexAttention (#148565 ) Summary: MTIA supports ieee but not tf32, so we set the default precision of MTIA to ieee similar to how it's done for AMD. Test Plan: CI Reviewed By: mortzur Differential Revision: D70072064 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148565 Approved by: https://github.com/mortzur	2025-03-06 02:07:18 +00:00
Ruben Rodriguez Buchillon	32715a2311	[inductor][ck] add kBatch_sweep to config.rocm (#148223 ) Summary: # Why enable testing and users to specify a set of kBatches to try rather than relying on our hand written heuristic # What add rocm.kBatch_sweep as a list of kBatches to try out. These will generate a product of CK instances, one per kBatch for each existing op, though they are often filtered out if they are likely to fail at runtime Test Plan: n/a Reviewed By: chenyang78 Differential Revision: D70226055 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148223 Approved by: https://github.com/ColinPeppler	2025-03-06 01:14:33 +00:00
Shivam Raikundalia	63fbc738dc	[Easy/Profiler] Add last entry to truncated values (#148576 ) Summary: Since the ranks of a PG are usually in a consecutive range it is useful to print the last values when truncating metadata Test Plan: Manually changed truncate length to 2 and ran 4 gpu graph to get the following trace: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu003.rva5.facebook.com/rank-1.Mar_05_09_48_21.1280355.pt.trace.json.gz&bucket=gpu_traces Differential Revision: D70637461 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148576 Approved by: https://github.com/davidberard98	2025-03-06 01:14:15 +00:00
Thomas Bohnstingl	23441492f6	[scan] Refactoring of input checking and dynamo invocation (#142125 ) This PR does a refactoring of the way dynamo is invoked and how the input shapes are checked for scan and for associative_scan Pull Request resolved: https://github.com/pytorch/pytorch/pull/142125 Approved by: https://github.com/ydwu4	2025-03-06 01:06:54 +00:00
Shunting Zhang	6cc3e69103	[inductor] use eager stride for custom op if no tags (#148367 ) Fix https://github.com/pytorch/pytorch/issues/148356 This is some sort of short term fix to recover the default behavior to apply layout constraint for custom ops when there are no tags. A longer term attempt to make sure Inductor always gets correct eager strides is here: https://github.com/pytorch/pytorch/pull/148104 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148367 Approved by: https://github.com/eellison, https://github.com/zou3519	2025-03-06 00:58:00 +00:00
Prachi Gupta	703176e538	[ROCm] Fix sort for non-standard bool (#147459 ) When converting from uint8 to bool using `view` op, we get a bool that has 0 for false and a non-zero value for true. However, these kinds of bool have undefined behavior. We only read the last bit as 0 or 1 to convert to false or true. In this fix, we convert bools to uint8, which will convert false to 0 and non-zero value to 1. Essentially, converting non-standard bool to a standard bool and fixing the sort op for non-standard bool. Fixes #139972 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147459 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony	2025-03-06 00:23:02 +00:00
bobrenjc93	690fc2c876	Add aot_eager_then_compile stance (#148509 ) Sometimes `eager_then_compile` stance isn't enough since some models are so close to the memory limit that going to eager will OOM since we don't get the memory reductions from activation checkpointing. This PR introduces `aot_eager_then_compile` which avoids the expensive inductor compile, but still does aot_eager to get the benefits of memory reduction in the first invocation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148509 Approved by: https://github.com/williamwen42	2025-03-05 23:23:45 +00:00
Benjamin Glass	d6d670ab4d	[AOTI] build CPU CPP kernels at O3, and all other code at O1 (#148587 ) In the future, we may also want to add LTO linking to further optimize the results (while still hopefully netting compile time benefits). Differential Revision: [D70641543](https://our.internmc.facebook.com/intern/diff/D70641543) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148587 Approved by: https://github.com/desertfire	2025-03-05 22:47:46 +00:00
PyTorch MergeBot	897fd9b514	Revert "Subprocess compile (#146134 )" This reverts commit 07f876e9602ec6881df2360ab4817e129b563b7c. Reverted https://github.com/pytorch/pytorch/pull/146134 on behalf of https://github.com/malfet due to looks like it broke slow jobs, see `e1dee4ccb3/3` ([comment](https://github.com/pytorch/pytorch/pull/146134#issuecomment-2702239123))	2025-03-05 22:41:19 +00:00
Justin Chu	e1dee4ccb3	[ONNX] Assert capture strategy in tests (#148348 ) Previously the strategy used for obtaining the exported program is not asserted. This leads to silent errors if torch.export breaks something and a fallback strategy is used. This change adds a _capture_strategy field to ONNXProgram and enables unit tests to assert the strategy used to prevent fallbacks from happening. Fixes #147674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148348 Approved by: https://github.com/titaiwangms, https://github.com/shubhambhokare1	2025-03-05 22:31:54 +00:00
Tugsbayasgalan Manlaibaatar	5ccd659c0e	Fix decomp for linspace (#147997 ) In python decompositions, we shouldn't do any non-functional operations for functional operators. This should go away once we start decomposing before functionalization. Differential Revision: [D70265200](https://our.internmc.facebook.com/intern/diff/D70265200) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147997 Approved by: https://github.com/zou3519	2025-03-05 22:10:08 +00:00
Andy Lugo	9e755a1c03	[ROCm] add gfx12 to nightly wheels (#148562 ) Adds gfx1200 and gfx1201 to PYTORCH_ROCM_ARCH for wheels and libtorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148562 Approved by: https://github.com/jeffdaily	2025-03-05 21:56:22 +00:00
Ankita George	2a639ce1d7	Add new hf storage class to torch.distributed package (#148361 ) Summary: title - Add new hf storage class to torch.distributed package so that it can be imported by customers. The HF storage reader/writer was added as DCP storage components so that DCP load and save can directly interact with hugging face format and storage. Test Plan: ensure signals pass Differential Revision: D70495399 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148361 Approved by: https://github.com/MeetVadakkanchery	2025-03-05 21:52:06 +00:00
Sam Larsen	10354e146f	Re-enable test_torchinductor:test_buffer_batch_norm (#148573 ) Summary: Per https://github.com/pytorch/pytorch/issues/128198 seems like this is working now Fixes #128198 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148573 Approved by: https://github.com/StrongerXi	2025-03-05 21:51:24 +00:00
fduwjj	87bd3471ff	[c10d] Move record param for init to the right place (#148571 ) The place we do the log of init does not look correct. We move it to the beginning of comm init. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148571 Approved by: https://github.com/kwen2501	2025-03-05 21:43:30 +00:00
Ryan Guo	ad9a10aff0	[dynamo] Make `nonstrict_trace` work with some `pytree.register_constant`-ed instances (#148007 ) As title, this enables `nonstrict_trace`-ed function to take in object whose type has been `pytree.register_constant`-ed, as long as the object existed outside the `torch.compile` region. This also forces Dynamo to emit a `EQUALS_MATCH` guard on the object. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148007 Approved by: https://github.com/zou3519 ghstack dependencies: #148385	2025-03-05 21:28:26 +00:00
Ryan Guo	a10f577ee0	[dynamo] Account for function id reuse in relevant Dynamo decorators (#148385 ) This fixes a recent series of flaky failure from `nonstrict_trace` unit tests: #148166, #148056, #148055, #148054, #148034, #148033, #148032, #148031. For now we don't need to worry about the other decorators because they are either meant for builtin/numpy functions (which should never deallocate in practice), or used for polyfills which keeps the function object in `get_torch_obj_rule_map()`. Fixes #147777. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148385 Approved by: https://github.com/zou3519	2025-03-05 21:28:26 +00:00
henrylhtsang	4aeca28137	[cutlass backend] fix assertion that prevent self multiplication (#148233 ) # Problem: In a matmul, sometimes some of the nodes are the same. Say `A @ A`. In that case, when writing the stride of node B, we have to figure out if we want lda or ldb, which points to the same node, and we have no way to differentiate which one. # Solution Just use whichever. Since they are the same. # Question What if we compile with `A @ A`, and then pass in `A @ B`? Well inductor guards will raise an error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148233 Approved by: https://github.com/ColinPeppler	2025-03-05 21:26:22 +00:00
angelayi	ed9624ee60	[export] Fix AttrProxy slicing (#148507 ) Fixes https://fb.workplace.com/groups/1028545332188949/permalink/1159599265750221/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/148507 Approved by: https://github.com/zhxchen17	2025-03-05 21:03:15 +00:00
Nikita Shulga	dd6ec8706e	[BE] Relax sympy dependency to 1.13.3 or newer (#148575 ) Fixes https://github.com/pytorch/pytorch/issues/145225 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148575 Approved by: https://github.com/ZainRizvi, https://github.com/atalman	2025-03-05 20:51:16 +00:00
Yanbo Liang	9efa9c73f6	[Dyamo] Replace unimplemented with unimplemented_v2 for variables/distributed (#148500 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148500 Approved by: https://github.com/williamwen42	2025-03-05 20:41:43 +00:00
Thanh Ha	98458e5c81	Add a docstring to build.sh (#144566 ) Add a little blurb to explain what build.sh is doing. Signed-off-by: Thanh Ha <thanh.ha@linuxfoundation.org>	2025-03-05 15:26:37 -05:00
Justin Chu	c6a05df174	[ONNX] Use onnxscript apis for 2.7 (#148453 ) Use onnxscript apis for 2.7. Remove reference to `torchlib_opset()` and `torchlib_opset_version()` which were removed in the onnxscript 2.7 apis. These apis were removed because torchlib in onnxscript will always stay on opset 18. Future opset version bumps will happen in pytorch core after the migration of torchlib. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148453 Approved by: https://github.com/titaiwangms, https://github.com/shubhambhokare1	2025-03-05 20:10:00 +00:00
PyTorch MergeBot	c9edd37ffb	Revert "[dtensor] add aten._scaled_dot_product_cudnn_attention.default op support (#148377 )" This reverts commit 9eef457c0241f87097a2ca7625f9961e31f3adcd. Reverted https://github.com/pytorch/pytorch/pull/148377 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/13683650448/job/38261818684) [HUD commit link](`9eef457c02`) probably landrace ([comment](https://github.com/pytorch/pytorch/pull/148377#issuecomment-2701903810))	2025-03-05 19:45:16 +00:00
IvanKobzarev	c5d92edd5a	[dynamo] WeakRefVar reconstruct (#148083 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148083 Approved by: https://github.com/anijain2305	2025-03-05 19:34:17 +00:00
Justin Chu	50e827b3df	[ONNX] Create VerificationInterpreter (#148396 ) An fx interpreter for comparing ONNX values with pytorch ones. ```py import torch from torch.onnx._internal.exporter._verification import VerificationInterpreter class Model(torch.nn.Module): def forward(self, query, key, value): res = torch.nn.functional.scaled_dot_product_attention( query, key, value ) rest = res.transpose(0, 1) return rest.view(8, 32, 128 * 64) model = Model() query = torch.rand(32, 8, 128, 64, dtype=torch.float16) key = torch.rand(32, 8, 128, 64, dtype=torch.float16) value = torch.rand(32, 8, 128, 64, dtype=torch.float16) onnx_program = torch.onnx.export(model, (query, key, value), dynamo=True) interpreter = VerificationInterpreter(onnx_program) interpreter.run(query, key, value) for info in interpreter.verification_infos: print(info) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148396 Approved by: https://github.com/titaiwangms	2025-03-05 19:18:52 +00:00
Xinya Zhang	8af79b7ec8	[ROCm] Bump AOTriton to 0.9.1b (#148433 ) Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.9b: * Optimize these Non-power-of-two head dimensions: 48, 80, 96, 160, 192, 224. Inputs with these head dimensions do not need padding to power-of-two anymore. * `is_causal=True` cases are now supported with persistent dynamic algorithm, which requires an atomic tensor but does load balance between different CTAs * `dropout_p > 0.0` cases now support full 64-bit offsets and use all i64x4 PRNG outputs * The precise AOTriton shared library version can now be identified with `readelf -p .comment libaotriton_v2.so` + However, this does not guarantee the GPU images stored under `aotriton.images` have the same version, since they can be overwritten. * The newly added fused backward kernel will be used for smaller workloads, due to less kernel invocation overhead. * Support gfx1201 (RX 9070XT). Need to be enabled at runtime with `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148433 Approved by: https://github.com/jeffdaily	2025-03-05 19:11:57 +00:00
Xilun Wu	9eef457c02	[dtensor] add aten._scaled_dot_product_cudnn_attention.default op support (#148377 ) ### Summary This PR adds `_scaled_dot_product_cudnn_attention` to DTensor ops and tests it with unit test. This should allow Context Parallel and Tensor Parallel to use cudnn SDPA. ### Test `pytest test/distributed/tensor/test_attention.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148377 Approved by: https://github.com/drisspg	2025-03-05 19:09:52 +00:00
Ting Lu	9dd46a9233	Deprecate sm70 for cuda 12.8 binary (#147607 ) follow up for https://github.com/pytorch/pytorch/pull/146265/files, dropping sm_70 as well, since "Architecture support for Maxwell, Pascal, and Volta is considered feature-complete and will be frozen in an upcoming release." https://github.com/pytorch/pytorch/issues/145570 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147607 Approved by: https://github.com/atalman	2025-03-05 18:54:17 +00:00
Wang, Chuanqi	3f4311d589	[CD] Upgrade xpu runtime pypi packages version and enable windows kineto again (#148319 ) Fixes https://github.com/pytorch/pytorch/issues/145155 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148319 Approved by: https://github.com/xuhancn, https://github.com/atalman	2025-03-05 18:39:55 +00:00
angelayi	9db9593bba	Add some more meta kernels (#147862 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147862 Approved by: https://github.com/zou3519	2025-03-05 18:33:00 +00:00
Tugsbayasgalan Manlaibaatar	e555c4d8ae	Fix bug in AOTI lowering (#148364 ) Fixes: https://github.com/pytorch/pytorch/issues/148370 Differential Revision: [D70514480](https://our.internmc.facebook.com/intern/diff/D70514480) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148364 Approved by: https://github.com/desertfire	2025-03-05 18:27:15 +00:00
ZhaoqiongZ	38479e495e	Add note to get start xpu (#148168 ) Installing PyTorch from binaries will automatically install the runtime packages of Intel® Deep Learning Essentials. In this case, if we activate oneAPI in a standalone installation of Intel® Deep Learning Essentials, there will be an environment issue. Therefore, add a note to remind users to avoid this situation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148168 Approved by: https://github.com/janeyx99 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-03-05 18:11:14 +00:00
Marko Radmilac	c65ee728f0	Initial implementation of host memory stats (#147660 ) This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics. This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache. As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later. Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660 Approved by: https://github.com/ngimel	2025-03-05 16:13:19 +00:00
Andy Lugo	70c5edb697	[ROCm] fix CK compile for gfx1200 (#148496 ) gfx1200 causes the CK-based GEMM to fail to compile because CK is choosing an incorrect FP8 interpretation. CK assumes FP8 interpretation is static and chosen prior to compilation. This PR is a work-around that makes the selection dynamic during hipclang compilation passes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148496 Approved by: https://github.com/jeffdaily	2025-03-05 16:11:03 +00:00
Nikita Shulga	864b75dd50	[MPS] Fix unary_kernel_strided logic (#148512 ) Fixes bug introduced by https://github.com/pytorch/pytorch/pull/148350 Before this change ``` % python3 -c "import torch; x, y = torch.arange(128.0, device='mps').reshape(2, 8, 8).unbind(0); print(torch.sqrt(x[::2, ::2], out=y[::2, ::2]))" tensor([[ 0.0000, 1.4142, 2.0000, 2.4495], [ 80.0000, 82.0000, 84.0000, 86.0000], [ 96.0000, 98.0000, 100.0000, 102.0000], [112.0000, 114.0000, 116.0000, 118.0000]], device='mps:0') ``` After this change ``` % python3 -c "import torch; x, y = torch.arange(128.0, device='mps').reshape(2, 8, 8).unbind(0); print(torch.sqrt(x[::2, ::2], out=y[::2, ::2]))" tensor([[0.0000, 1.4142, 2.0000, 2.4495], [4.0000, 4.2426, 4.4721, 4.6904], [5.6569, 5.8310, 6.0000, 6.1644], [6.9282, 7.0711, 7.2111, 7.3485]], device='mps:0') ``` One can not avoid copies if both input and output tensors have the same strides, one needs to make sure that they are dense-in-storage (transposed tensor would be dense, but say selecting every odd and even column wouldn't) Add regression test to prevent those from happening again Also, no need to check that sizes match, luckily it is checked by the structured op (and `out` for unary ops does not support broadcasting, I just checked) Revived needs_copy_logic, though it will become irrelevant after https://github.com/pytorch/pytorch/pull/148468 is landed Pull Request resolved: https://github.com/pytorch/pytorch/pull/148512 Approved by: https://github.com/janeyx99	2025-03-05 15:57:54 +00:00
Aidyn-A	8274da9312	[c10d][PGNCCL] Fix capturability of isend and irecv (#148462 ) This PR fixes an issue of inability to capture `isend`/`irecv` ops in `async` mode. <details> <summary>The repro code</summary> ```Python import os import torch import torch.distributed as dist USE_ASYNC = True def test_func(x, rank): if rank == 0: x += 1 # Send the tensor to process 1 if USE_ASYNC: a = dist.isend(tensor=x, dst=1) else: dist.send(tensor=x, dst=1) else: # Receive tensor from process 0 if USE_ASYNC: a = dist.irecv(tensor=x, src=0) else: dist.recv(tensor=x, src=0) if USE_ASYNC: a.wait() return x + 2 def run(rank): torch.cuda.set_device(rank) x = torch.ones(1, device='cuda') with torch.cuda.stream(torch.cuda.Stream()): for i in range(11): x.copy_(torch.ones(1, device='cuda')) y = test_func(x, rank) print(f"Rank{rank} has data {y} in warmup") torch.cuda.synchronize() graph = torch.cuda.CUDAGraph() x.copy_(torch.ones(1, device='cuda')) with torch.cuda.graph(graph): y = test_func(x, rank) for i in range(1): x.copy_(torch.ones(1, device='cuda')) graph.replay() print(f"Rank{rank} has data {y} after graph replay") def main(): rank = int(os.environ['RANK']) local_rank = int(os.environ['LOCAL_RANK']) world_size = int(os.environ['WORLD_SIZE']) dist.init_process_group('nccl', rank=rank, world_size=world_size) run(local_rank) if __name__ == "__main__": main() ``` </details> Fails with an error stating that work handle is of a NoneType: ``` [rank1]: Traceback (most recent call last): [rank1]: File "/workspace/repro.py", line 54, in <module> [rank1]: main() [rank1]: File "/workspace/repro.py", line 51, in main [rank1]: run(local_rank) [rank1]: File "/workspace/repro.py", line 38, in run [rank1]: y = test_func(x, rank) [rank1]: ^^^^^^^^^^^^^^^^^^ [rank1]: File "/workspace/repro.py", line 22, in test_func [rank1]: a.wait() [rank1]: ^^^^^^ [rank1]: AttributeError: 'NoneType' object has no attribute 'wait' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148462 Approved by: https://github.com/kwen2501	2025-03-05 15:49:53 +00:00
Sun, Jiayi	19a6cf35f6	add input shape check for _local_scalar_dense (#145717 ) Fix https://github.com/pytorch/pytorch/issues/145066. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145717 Approved by: https://github.com/malfet	2025-03-05 15:24:08 +00:00
Aidyn-A	96afa8a2bb	[TEST][SPARSE] Simplify branching in test_cusparselt_backend (#148318 ) Due to introduction of CUDA versions, the branching becomes more complicated. This PR is proposed to simplify branching in `test_cusparselt_backend` in order to avoid checking each and every CUDA version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148318 Approved by: https://github.com/jcaip	2025-03-05 10:17:00 +00:00
Nichols A. Romero	0ef2e938d0	[ROCm] [TunableOp] Track top solutions during tuning process (#147243 ) For each set of GEMM parameters that are evaluated by Tunableop, keep track of the top 5 solutions. Print the top 5 solutions when `PYTORCH_TUNABLEOP_VERBOSE=2`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147243 Approved by: https://github.com/jeffdaily	2025-03-05 09:35:02 +00:00
Jithun Nair	6c3492b491	[ROCm] Enable mi300-specific workflows to be triggered on PRs (#147904 ) This change will be needed to be able to trigger the MI300-specific CI workflows on PRs by using a PR label. * inductor-rocm-mi300.yml uses the existing `ciflow/inductor-rocm` label so that any PR manually labeled as such will trigger `inductor` config runs on both MI200 and MI300. * rocm-mi300.yml uses a separate `ciflow/rocm-mi300` label, since we don't want to over-trigger `default` config runs on MI300 runners due to limited capacity, and [`ciflow/rocm` label is automatically applied](`79438512a0/torchci/lib/bot/autoLabelBot.ts (L24)`) on many PRs. * inductor-perf-test-nightly-rocm.yml uses a separate `ciflow/inductor-perf-test-nightly-rocm` label, so that we can manually trigger a round of perf testing on MI300 runners to test the perf impact of a major inductor-related change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147904 Approved by: https://github.com/huydhn	2025-03-05 06:00:37 +00:00
eellison	2295efa1b3	Fix only logging ir_post_fusion with torch_compile_debug enabled (#148499 ) Because we were invoking the logs through `V.debug`, it was not running if TORCH_COMPILE_DEBUG was not set. this is because there is some magic the in debug [getattr](`d789c22712/torch/_inductor/debug.py (L468-L480)`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/148499 Approved by: https://github.com/shunting314	2025-03-05 05:35:09 +00:00
zeshengzong	fb1b7ec173	Remove deprecate method and attirbute in `LRScheduler` (#147301 ) Following [#99270 suggestion](https://github.com/pytorch/pytorch/issues/99270#issuecomment-1511656408), remove deprecate method `LRScheduler.print_lr` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147301 Approved by: https://github.com/janeyx99	2025-03-05 05:30:19 +00:00
Bin Bao	df7e43e5d4	[AOTI] Fix aot_inductor_package test errors (#148279 ) Summary: Fix fbcode test failures introduced by https://github.com/pytorch/pytorch/pull/147975. Make sure script.ld is copied to the build-time directory. Differential Revision: D70454149 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148279 Approved by: https://github.com/zoranzhao	2025-03-05 05:22:48 +00:00
henrylhtsang	b020d166f2	stage 1 of depreate silent fallback of tuning gemm (#147798 ) Differential Revision: [D70045778](https://our.internmc.facebook.com/intern/diff/D70045778/) context: https://github.com/pytorch/pytorch/issues/147479 For the most part, this should not change the behavior. For int_mm, I also removed ``` # TODO: Re-enable eager mode implementation once cuBLAS is fixed if use_cutlass or use_triton_template(layout, enable_int32=True): choices = [] ``` because I think it is unwanted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147798 Approved by: https://github.com/eellison	2025-03-05 05:15:59 +00:00
Laith Sakka	913356fb41	Fix recent regression in evaluate_expr that effect cache lookups (#147836 ) PR https://github.com/pytorch/pytorch/pull/146939/ added an argument for evaluate_expr for the purpose of logging. This caused a regression that we thought is due to calling id on symnode. I digged deeper and found that adding that argument although does not effect results of evaluate_expr it mess the cache lookups. I refactored the code to avoid using expr_sym_node_id in the cache lookup, I also introduced evaluate_sym_node to and simplified the calls to evaluate_expr #suppress-bc-linter Pull Request resolved: https://github.com/pytorch/pytorch/pull/147836 Approved by: https://github.com/oulgen	2025-03-05 04:11:41 +00:00
henrylhtsang	ed8ec0cb98	[cutlass backend][BE] Fix two small things in cutlass backend standalone debugger (#148493 ) Differential Revision: [D70583777](https://our.internmc.facebook.com/intern/diff/D70583777/) Two really small things: * The bits in BlockFillRandomUniform would round float to ints * when bias exists, the order of args are C, A, B, D Pull Request resolved: https://github.com/pytorch/pytorch/pull/148493 Approved by: https://github.com/chenyang78	2025-03-05 04:01:36 +00:00
Wang, Chuanqi	e0ea593974	[CD] Upgrade Windows xpu support package to 2025.0.1 for binary compression (#148313 ) The binary compression feature can reduce the size of the Torch XPU Windows wheel packages Pull Request resolved: https://github.com/pytorch/pytorch/pull/148313 Approved by: https://github.com/atalman	2025-03-05 03:00:27 +00:00
Rachel Guo	1673bc7610	[mm_logs][ez] dump tuned mm info at lowering stage (#148363 ) Summary: As title. it would be beneficial for judging e2e perf improvement Easy first step to dump mm info at lowering stage. e.g. ``` fbsource/fbcode/caffe2/torch/_inductor/kernel/mm.py:525] [0/0] Tuned aten.addmm: m=16, n=6, k=16, layout=FixedLayout('cuda:0', torch.float32, size=[16, 6], stride=[6, 1]) ``` Next step: Dump overview info at `post_grad_graph` stage such as overall count of `aten.mm` in the graph & visualize to a table structure. Test Plan: by looking very hard in aot inductor bmm and mm UTs. Differential Revision: D70507880 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148363 Approved by: https://github.com/henrylhtsang	2025-03-05 02:21:27 +00:00
wdziurdz	edc3ca577e	[Profiler] Add profiler activity for HPU devices (#148182 ) Fixes #148181 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148182 Approved by: https://github.com/sraikund16	2025-03-05 01:37:48 +00:00
William Wen	3985ce0b88	[dynamo] rename test_graph_break_messages -> test_error_messages (#148220 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148220 Approved by: https://github.com/zou3519, https://github.com/jansel ghstack dependencies: #148205	2025-03-05 01:16:53 +00:00
William Wen	b28cbe5db3	[dynamo] remove internal stack trace for fullgraph=True graph breaks (#148205 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148205 Approved by: https://github.com/zou3519	2025-03-05 01:16:53 +00:00
Mitchell, Frost	2927a64357	[inductor][cpu] Fix error with FlexibleLayout weights in BMM (#148188 ) Fixes #148074 When node A is reshaped (is a `ReinterpretView`) and node B has a `FlexibleLayout`, then the layout of node B may be changed during the `kernel.select(options["W"], 0, self.b_index)` call, which could cause the assertion in `kernel.select` to fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148188 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel	2025-03-05 01:05:05 +00:00
Animesh Jain	713a504a82	[dynamo][guards] Fix mem leak caused be refcount increment (#148480 ) Should help [internalfb.com/sevmanager/view/491701](https://www.internalfb.com/sevmanager/view/491701) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148480 Approved by: https://github.com/xmfan, https://github.com/StrongerXi, https://github.com/williamwen42, https://github.com/zou3519	2025-03-05 01:04:08 +00:00
Mwiza Kunda	b5873292c6	Add overload names to profiler trace (#143114 ) Currently, recorded profiler events for aten ops do not store overload names. It would be useful to know which overloads are actually called to analyse performance. For example, consider the following dispatch trace which occurs if there is a fallthrough kernel registered for aten::add: ``` [call] op=[aten::add.Tensor], key=[AutogradCPU] [redispatch] op=[aten::add.Tensor], key=[Undefined] [call] op=[aten::empty.memory_format], key=[BackendSelect] [redispatch] op=[aten::empty.memory_format], key=[CPU] [call] op=[aten::add.out], key=[CPU] ``` In this case, aten::add.out is a child of aten::add.Tensor, however the current profiler trace provides no way to differentiate aten op calls. See the added unit test for a more detailed example. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143114 Approved by: https://github.com/sraikund16	2025-03-05 01:00:29 +00:00
Daniel Vega-Myhre	cf5e3f3cea	Add cutlass kernel for rowwise scaled mm on sm100 (#148421 ) ### Important - Previous PR in stack https://github.com/pytorch/pytorch/pull/148274 - Despite the changes between sm90 vs sm100 being fairly minimal, I created a separate kernel since we'll be making various arch specific perf optimizations to the sm100 kernel next. - This kernel has not been optimized yet. However, initial perf testing shows numbers which indicates the tensorcores are being utilized as expected (not just CUDA cores). ### Summary of changes - This PR adds a new cutlass kernel for rowwise GEMM on sm100. - sm100 kernel is based on sm90 kernel, with the following changes: - Use new arch tag `cutlass::arch::Sm100` - Do not use [large tile](`4eb0c45297/aten/src/ATen/native/cuda/RowwiseScaledMM.cu (L203)`) schedule in CollectiveMainLoop or CollectiveEpilogue (causes build errors) - SM90 vs SM100 kernel diff: https://www.diffchecker.com/ZCAPaFAg/ ### Next steps - Arch specific performance optimization Pull Request resolved: https://github.com/pytorch/pytorch/pull/148421 Approved by: https://github.com/drisspg	2025-03-05 00:46:01 +00:00
rzou	a907b6abae	[compiled_autograd] workaround windows compilation issue (#148454 ) torch.compile doesn't work on windows so we can ifdef-away the problem. I do not know what the root cause actually is. Most notably, the pytorch windows build is fine, but some third-party projects that use pytorch headers on windows (e.g. torchaudio) have issues. Test Plan: - wait for CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/148454 Approved by: https://github.com/atalman, https://github.com/xmfan	2025-03-05 00:18:20 +00:00
Howard Huang	e02a2ca07a	Fix dist.init_process_group on windows (#148266 ) Fix https://github.com/pytorch/pytorch/issues/139990 We don't build libuv on windows so anything that creates `TCPStore` which includes `init_process_group()` will fail, which is a bad experience. We should just default to `USE_LIBUV=0` for windows. There were a decent amount of hits for this [error on google ](https://www.google.com/search?q=use_libuv+was+requested+but+PyTorch+was+build+without+libuv+support&sca_esv=921f59ac5f8bd98a&sxsrf=AHTn8zpG3PxdKoomFHkclOc451rBhoc3jw%3A1740854890873&source=hp&ei=albDZ5GHM-uIptQP4NTikQw&iflsig=ACkRmUkAAAAAZ8Nkei9H-aB2IBCk3pUOK3yFl5xBLZUt&ved=0ahUKEwiR5P7qxemLAxVrhIkEHWCqOMIQ4dUDCBg&uact=5&oq=use_libuv+was+requested+but+PyTorch+was+build+without+libuv+support&gs_lp=Egdnd3Mtd2l6IkN1c2VfbGlidXYgd2FzIHJlcXVlc3RlZCBidXQgUHlUb3JjaCB3YXMgYnVpbGQgd2l0aG91dCBsaWJ1diBzdXBwb3J0SABQAFgAcAB4AJABAJgBAKABAKoBALgBA8gBAPgBAvgBAZgCAKACAJgDAJIHAKAHAA&sclient=gws-wiz) and https://github.com/pytorch/pytorch/issues/139579, so I figured we should add a more helpful message as well. We don't have CI for windows and our support is just best effort, so I just tested these changes on my windows machine. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148266 Approved by: https://github.com/d4l3k	2025-03-05 00:07:56 +00:00
lzhang2	84b58bd63e	Enable FSDP tests on XPU device (#147518 ) Motivation: Enable FSDP tests on XPU device Pull Request resolved: https://github.com/pytorch/pytorch/pull/147518 Approved by: https://github.com/weifengpy	2025-03-04 23:49:37 +00:00
eellison	c98c3af421	Add a couple config options to compiler bisector (#148450 ) These are commonly source of bugs/divergence (through bad interactions etc) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148450 Approved by: https://github.com/shunting314	2025-03-04 23:23:21 +00:00
Isalia20	0c0a4baddd	[MPS] unary kernels - avoid copying tensors if they have same stride (#148350 ) I was a bit concerned when I saw in #148272 that metal unary kernel was 0.02x of the performance of what we had with MPS Graphs for sqrt(for non contiguous) tensors. This change makes it so that copying is only done if we don't have same strided tensors(for input/output). So if out tensor is not provided then we don't do copy(don't call contiguous) at all and dispatch the kernel as is. After making this change the script that I listed at the end of the above PR has the same execution time as the non-transposed one. Times for reference(on transposed tensor where matrix is NxN matrix): \| N \| time_old \| time_new \| \|-------\|--------------------\|--------------------\| \| 100 \| 0.0002241021 \| 0.0001548659 \| \| 1000 \| 0.0005934822 \| 0.0002150342 \| \| 10000 \| 0.3242016407 \| 0.0045755033 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/148350 Approved by: https://github.com/janeyx99	2025-03-04 23:20:26 +00:00
Nikita Shulga	ade4af8c95	[MPS][BE] Fix `c10:🤘:sinc` implementation (#148471 ) Restrict scalar implementation to `is_scalar_floating_point_v` types, but perform all internal computations in full 32-bit floats. Make complex implementation a template for `is_complex_v` types This makes its eager kernel implementation for both real and complex type a trivial call to the template Pull Request resolved: https://github.com/pytorch/pytorch/pull/148471 Approved by: https://github.com/dcci ghstack dependencies: #148398, #148399, #148448, #148449	2025-03-04 23:14:03 +00:00
Eddie Yan	93e9daed54	[cuDNN][SDPA][Nested Tensor] Experimental cuDNN Nested Tensor SDPA Support (forward only) (#141178 ) Disabled by default for now behind `TORCH_CUDNN_SDPA_NESTED_TENSOR_ENABLED=1` Just wanted to get this out before starting a series of SDPA cleanup PRs---the biggest thing is we don't need the boilerplate around all of the `build_graph_and_tensors*` functions anymore as we can now use the `UID`-style referencing of tensor nodes as was done for the Conv-V8 API backend. CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/141178 Approved by: https://github.com/jbschlosser	2025-03-04 23:09:09 +00:00
Catherine Lee	d789c22712	Upgrade github ubuntu-20.04 runners to ubuntu-24.04 (#148469 ) The github provided ubuntu-20.04 gha runners are being deprecated (https://togithub.com/actions/runner-images/issues/11101) so upgrade workflows using them to the latest runner 24.04 They are currently doing a brownout, resulting in failures like: https://github.com/pytorch/pytorch/actions/runs/13660782115 ``` [do_update_viablestrict](https://github.com/pytorch/pytorch/actions/runs/13660782115/job/38192777885) This is a scheduled Ubuntu 20.04 brownout. Ubuntu 20.04 LTS runner will be removed on 2025-04-01. For more details, see https://github.com/actions/runner-images/issues/11101 ``` Should we be using ubuntu-latest instead? I attempted to upgrade actionlint to 1.7.7 but on my local in test-infra it seems to add a lot of new checks, and on test-infra's CI, I seem to have uploaded the wrong executable or something so it failed. I'll try again later Pull Request resolved: https://github.com/pytorch/pytorch/pull/148469 Approved by: https://github.com/seemethere, https://github.com/malfet	2025-03-04 22:29:04 +00:00
Nichols A. Romero	5f47b7e268	[ROCm][TunableOp] Unit test for offline tuning of GEMM with bias (#148371 ) One more unit test for the offline version of TunableOp. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148371 Approved by: https://github.com/jeffdaily	2025-03-04 22:24:27 +00:00
Nikita Shulga	842ffea445	[MPS][BE] Towards strided unary ops support (#148449 ) Add generic functors kernels and rewrite all existing implementations into functors Pull Request resolved: https://github.com/pytorch/pytorch/pull/148449 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #148398, #148399, #148448	2025-03-04 22:22:39 +00:00
Justin Chu	70d0e1b96a	Bump onnxscript to 0.2.2 in CI (#148388 ) Unblock https://github.com/pytorch/pytorch/pull/148140 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148388 Approved by: https://github.com/malfet	2025-03-04 22:09:50 +00:00
Pian Pawakapan	c677f3251f	[export] don't use unbacked_renamings in export (#147574 ) Plan: avoid the use of unbacked renamings, and introduce a pass run in `_produce_aten_artifact` that recomputes unbacked bindings. Decided to do this because in we don't serialize unbacked renamings (or any ShapeEnv state), so this used to compose poorly with de/serialization. This hopefully establishes the invariant that the unbacked binding keys are always in sync with the example values (i.e. same indices, and removed if the symbol is replaced / specialized). For de/serialization, we don't stored unbacked bindings, and just rerun the pass. Involved a refactor of compute_unbacked_bindings. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147574 Approved by: https://github.com/avikchaudhuri	2025-03-04 21:43:49 +00:00
Eli Uriegas	84961a0c17	ci: Add workflow dispatch for commit hash update (#148486 ) Maybe this should also be split into its own workflow instead of piggy backing off of nightly? Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/148486 Approved by: https://github.com/clee2000 ghstack dependencies: #148466, #148472	2025-03-04 21:26:23 +00:00
Eli Uriegas	d290186ed3	ci: Add triton to update hash workflow (#148472 ) Adds triton to our auto-update workflows so that PRs can be automatically made and the triton team can follow up to fix any issues that may arise. Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/148472 Approved by: https://github.com/Camyll, https://github.com/atalman ghstack dependencies: #148466	2025-03-04 21:26:23 +00:00
Eli Uriegas	9be8f74156	ci: Consolidate commit hash updates into a matrix (#148466 ) Consolidates all of our commit hash update jobs into a single matrix to make it easier to add more jobs later on. Side note: How do I even test if this works? Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/148466 Approved by: https://github.com/Camyll, https://github.com/clee2000, https://github.com/atalman	2025-03-04 21:26:13 +00:00
dan_the_3rd	d1abde11ec	[dynamo] Support passing arguments to `DeviceMesh.get_group` (#147741 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147741 Approved by: https://github.com/StrongerXi	2025-03-04 21:19:47 +00:00
Zain Rizvi	f30776c37a	[BE] Upgrade to mypy 1.14 (#145966 ) Upgrade mypy version Pull Request resolved: https://github.com/pytorch/pytorch/pull/145966 Approved by: https://github.com/Skylion007	2025-03-04 20:58:26 +00:00
Angela Yi	60205b0eb2	[export] Fix logging so that it doesn't result in max recursion error (#148231 ) Test Plan: buck2 run mode/dev-nosan sigmoid/inference/ts_migration:pt2i_readiness_main -- --model_id=487493491 --test_suite ads_all --mode test_full_model Produces https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmp2wsjQH/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100 Differential Revision: D70416613 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148231 Approved by: https://github.com/yiming0416	2025-03-04 20:47:25 +00:00
Thomas Bohnstingl	e4c558be1d	[scan] Corrections for scan (#146110 ) This PR resolves some minor issues with the scan HOP and unifies the handling of the additional_inputs in the same way as for associative_scan. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146110 Approved by: https://github.com/ydwu4	2025-03-04 20:29:08 +00:00
Isalia20	439395c0ae	[MPS] add slogdet and logdet implementations to mps (#148287 ) Low hanging fruits, all ops for these are implemented so just adding them to native functions adds the functionality on mps. Probably next op I should add should be lu solve seeing as how many ops need it for the grad calculation Pull Request resolved: https://github.com/pytorch/pytorch/pull/148287 Approved by: https://github.com/malfet	2025-03-04 19:49:23 +00:00
PyTorch MergeBot	92beda54c8	Revert "[fx] Move map_aggregate to C++ (#148243 )" This reverts commit edaff88f69f069d517b72ea23fd5eb04702eb0b5. Reverted https://github.com/pytorch/pytorch/pull/148243 on behalf of https://github.com/jovianjaison due to breaking internal builds [T216910920] ([comment](https://github.com/pytorch/pytorch/pull/148243#issuecomment-2698724058))	2025-03-04 19:40:21 +00:00
PyTorch MergeBot	17d003fe75	Revert "[fx] Move Node._update_args_kwargs to C++ (#148260 )" This reverts commit 0135f57f4aaeaba8d720f551eab6dca6fcede8cd. Reverted https://github.com/pytorch/pytorch/pull/148260 on behalf of https://github.com/jovianjaison due to breaking internal builds [T216910920] ([comment](https://github.com/pytorch/pytorch/pull/148243#issuecomment-2698724058))	2025-03-04 19:40:21 +00:00
PyTorch MergeBot	97b9e68bc6	Revert "[fx] Move Node._prepend/Node._remove_from_list to C++ (#148261 )" This reverts commit 29c2de9ae16f1673f3f44363243294d403e53d37. Reverted https://github.com/pytorch/pytorch/pull/148261 on behalf of https://github.com/jovianjaison due to breaking internal builds [T216910920] ([comment](https://github.com/pytorch/pytorch/pull/148243#issuecomment-2698724058))	2025-03-04 19:40:21 +00:00
PyTorch MergeBot	6fb18ff685	Revert "Better log message to update pr_time_benchmarks/expected_results.csv (#148303 )" This reverts commit a3d69e6e1a530ae2b91cd549ea26aac51ffc7566. Reverted https://github.com/pytorch/pytorch/pull/148303 on behalf of https://github.com/jovianjaison due to breaking internal builds [T216910920] ([comment](https://github.com/pytorch/pytorch/pull/148243#issuecomment-2698724058))	2025-03-04 19:40:21 +00:00
PyTorch MergeBot	63778cb8a0	Revert "[Inductor] Record Triton’s Base32 Cache Key in `.best_config` for Debugging (#147019 )" This reverts commit e3e45d90d8578083da8b51a3b1d911e9a4523e5b. Reverted https://github.com/pytorch/pytorch/pull/147019 on behalf of https://github.com/clee2000 due to broke inductor test inductor/test_max_autotune.py::TestMaxAutotune::test_cat_max_autotune_extern [GH job link](https://github.com/pytorch/pytorch/actions/runs/13653495421/job/38171259603) [HUD commit link](`e3e45d90d8`) on inductor workflow and rocm workflow ([comment](https://github.com/pytorch/pytorch/pull/147019#issuecomment-2698677222))	2025-03-04 19:20:15 +00:00
PyTorch MergeBot	9d196edb7d	Revert "Bump onnxscript to 0.2.2 in CI (#148388 )" This reverts commit 7ab6749ec7db32e0b3cdfd19db087f15dd0bebe2. Reverted https://github.com/pytorch/pytorch/pull/148388 on behalf of https://github.com/clee2000 due to broke libtorch debug build? [GH job link](https://github.com/pytorch/pytorch/actions/runs/13646179239/job/38152039312) [HUD commit link](`7ab6749ec7`) ([comment](https://github.com/pytorch/pytorch/pull/148388#issuecomment-2698665495))	2025-03-04 19:16:34 +00:00
Afanti	c219c5ca38	Fix code descriptions in the test package. (#148145 ) The parameter and function description have something wrong and make them correct. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148145 Approved by: https://github.com/janeyx99	2025-03-04 19:14:41 +00:00
Nikita Shulga	e8900fbe4f	[MPS] Add some useful utils (#148448 ) Like `is_compex_v`, `is_scalar_intergral_v`, `result_of` etc Pull Request resolved: https://github.com/pytorch/pytorch/pull/148448 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #148398, #148399	2025-03-04 19:09:17 +00:00
Wanchao Liang	f859722f70	[dtensor] refactor sharding prop to handle cross mesh computation (#147869 ) as titled, this PR moves the same mesh check from the sharding propagation level to each individual operator level. This is to allow more flexibility for each individual operator to check the operator can be run on the same mesh or not. For example, before this PR if user have two DTensor params that lives on different DeviceMesh, and want to run `for_each` operator on them individually, it would error out with cross mesh error. But for foreach computation there could be DTensors that live on different meshes, as long as the the mesh are the same in a "zipped way". This should also fix https://github.com/pytorch/pytorch/issues/134212 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/147869 Approved by: https://github.com/tianyu-l	2025-03-04 18:30:44 +00:00
Eli Uriegas	eea54a55f6	ci: Switch manywheel build.sh to just use dev (#148310 ) To avoid annoying error message like: > fatal: no tag exactly matches 'a6520c85bd85875b09f2c68e51622699d7d07595' These were popping up when GITHUB_REF is not set so let's just assume that if someone is building without directly setting GITHUB_REF then they're probably doing a dev build. Signed-off-by: Eli Uriegas <github@terriblecode.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/148310 Approved by: https://github.com/Camyll, https://github.com/atalman	2025-03-04 18:27:44 +00:00
PyTorch MergeBot	611b0e9bc4	Revert "[fx] Optimizations for node name generation (#148288 )" This reverts commit 5eb0337cfd5e7c2cdf4a2d4829609e391467270f. Reverted https://github.com/pytorch/pytorch/pull/148288 on behalf of https://github.com/clee2000 due to something in this stack broke some dynamo and higher order ops tests like higher_order_ops/test_invoke_subgraph.py::TestInvokeSubgraphCompile::test_dedupe [GH job link](https://github.com/pytorch/pytorch/actions/runs/13645082540/job/38149882002) [HUD commit link](`8531d247ba`). dynamo/test_graph_deduplication did run on the PR but the higher_order_ops one didn't, probably combo of landrace and bad TD ([comment](https://github.com/pytorch/pytorch/pull/148288#issuecomment-2698365172))	2025-03-04 17:10:12 +00:00
PyTorch MergeBot	ed9055c303	Revert "[fx] Optimize TracerBase.create_arg and Graph._gen_python_code (#148292 )" This reverts commit 8531d247ba411993f9a10686d70514f6945f9960. Reverted https://github.com/pytorch/pytorch/pull/148292 on behalf of https://github.com/clee2000 due to something in this stack broke some dynamo and higher order ops tests like higher_order_ops/test_invoke_subgraph.py::TestInvokeSubgraphCompile::test_dedupe [GH job link](https://github.com/pytorch/pytorch/actions/runs/13645082540/job/38149882002) [HUD commit link](`8531d247ba`). dynamo/test_graph_deduplication did run on the PR but the higher_order_ops one didn't, probably combo of landrace and bad TD ([comment](https://github.com/pytorch/pytorch/pull/148288#issuecomment-2698365172))	2025-03-04 17:10:12 +00:00
Nikita Shulga	67937be673	[BE] Move `sinc` kernels to the same OP family (#148399 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148399 Approved by: https://github.com/dcci ghstack dependencies: #148398	2025-03-04 15:49:20 +00:00
Nikita Shulga	7fcbaff206	[BE] Remove stale arg for complex ops (#148398 ) Not need to pass DTYPE0 and DTYPE1 if only one DTYPE is used Pull Request resolved: https://github.com/pytorch/pytorch/pull/148398 Approved by: https://github.com/dcci	2025-03-04 14:35:43 +00:00
Jiang, Yanbing	f2f25a5444	Upgrade submodule oneDNN to v3.7.1 (#148293 ) This PR is to upgrade submodule oneDNN to v3.7.1. ## Improvements - Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids). - Improved performance of int8 and fp32 forward convolution primitive on processors with Intel AVX2 instruction set support. - Improved performance of fp8 matmul primitives with bf16 and fp16 bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids). - Introduced initial optimizations for Intel GPUs based on Xe3 architecture. - Added bfloat16 support for SDPA, implemented fp16 and bf16 gemm kernel in SDPA. - Fixed f16 matmul accuracy, the issue of SDPA cannot dispatched to ukernel, bf16/fp16/fp32 conv performance, INT8 Kernel trigger page fault, deconvolution precision issue on complex128 and fp64 and gemm correctness issue in float16 issues. - Improved bf16 matmul performance with fp32 destination with Arm Compute Library (ACL). - Improved bf16 to fp32 reorder performance. - Improved bf16 reorder performance. - Improved bf16 convolution with ACL. Fixes https://github.com/pytorch/pytorch/issues/136348. ## Validation results on CPU 1. NLP models accuracy/inference/training ![image](https://github.com/user-attachments/assets/859279b8-1631-4268-b226-7de9ac5870d8) ![image](https://github.com/user-attachments/assets/30ec7151-41ca-482a-9d2d-0c4850e75bab) 2. Torchbench cpu userbenchmark inference & training ![image](https://github.com/user-attachments/assets/71c9807c-caf9-4385-9990-d2ab637031cd) 3. Inductor quantization ![image](https://github.com/user-attachments/assets/3d2a3bd3-82fa-4566-8050-7ea5d6b61675) 4. Dynamo benchmarks ![image](https://github.com/user-attachments/assets/554ecce3-c85c-4a0e-88f1-2e73983c5dcd) ![image](https://github.com/user-attachments/assets/148c88f8-4367-4428-bb54-ce8a4deefd1b) ![image](https://github.com/user-attachments/assets/f2e744f4-d710-4699-acf4-1f130ecfadf1) ![image](https://github.com/user-attachments/assets/97128b80-4d0e-495a-aeda-dde3e70c96fd) ![image](https://github.com/user-attachments/assets/a9afce37-684c-45c0-b938-6dd7e0383805) ![image](https://github.com/user-attachments/assets/b8714236-9681-4fbe-8d98-be93deedab88) ![image](https://github.com/user-attachments/assets/4423061f-d133-45ba-98bd-d2f739e50431) ![image](https://github.com/user-attachments/assets/7955da10-3d23-493e-99fa-658f7f40035b) ## Validation results on XPU Accuracy is same as baseline. Performance is shown below. ![image](https://github.com/user-attachments/assets/7645304d-5b1d-43f9-b840-9f846ed380a0) ## Validation results on ARM ![image](https://github.com/user-attachments/assets/080f7c02-0238-436f-ad20-5a9e3f6aafbb) ![image](https://github.com/user-attachments/assets/443742aa-ca61-41de-ae80-5d4c65cd0c87) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148293 Approved by: https://github.com/mingfeima, https://github.com/atalman	2025-03-04 13:56:45 +00:00
Mwiza Kunda	f339e41a38	[inductor][triton] Fix average pool nd for int64 dtype (#146061 ) The eager mode implementation of average pool nd returns an integer tensor if the input is also an integer tensor. This should also be preserved in inductor. Fixes pytest -k test_comprehensive_nn_functional_avg_pool2d_cpu_int64 error: Triton compilation failed: triton_poi_fused_avg_pool2d_0 See WIP https://github.com/pytorch/pytorch/pull/145865#issuecomment-26200289890 to potentially enable such tests as they aren't enabled yet. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146061 Approved by: https://github.com/eellison	2025-03-04 13:53:50 +00:00
Meet Vadakkanchery	fdee60769a	[DCP] Introduce process based async checkpointing (#147039 ) Summary: ### Context Background checkpoint upload thread interfering with trainer thread: In [async save API](https://github.com/pytorch/pytorch/blob/main/torch/distributed/checkpoint/state_dict_saver.py#L239-L248), the background thread spends a considerable amount of time on CPU-bound tasks (pickling/unpickling several metada objects a.k.a SavePlans) on rank0 during the collective operation; this kind of asymmetric computation heavily contends for GIL with the trainer thread causing GPU util to suffer significantly for the E2E checkpoint duration. ### Solution: Introduce async save via a checkpoint daemon process. This daemon process will be created once (during the first save attempt) and can serve async checkpoint requests for the remainder of training lifetime. Test Plan: Added E2E UTs for process based async save. Differential Revision: D69272583 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147039 Approved by: https://github.com/saumishr	2025-03-04 13:33:28 +00:00
taozhiwei	16d07988fc	add supports_coalescing property in c10d::Backend to determine whether backend supports coalescing (#135338 ) 1. My company is using privateuseone to connect new hardware device and requires the use of `batch_isend_irecv` function. However, `batch_isend_irecv` is currently only open to CUDA, so I add `supports_coalescing` property in `c10d::Backend` to determine whether backend supports coalescing. 2. If `pg._has_hooks` return True, We don't need to determine if the current device is CUDA. So privateuseone can also support `pg._wait_for_pending_works` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135338 Approved by: https://github.com/kwen2501, https://github.com/albanD	2025-03-04 12:37:06 +00:00
fulvius31	e3e45d90d8	[Inductor] Record Triton’s Base32 Cache Key in `.best_config` for Debugging (#147019 ) Modified TorchInductor’s autotuning flow so that each `best_config` JSON file also includes the Triton “base32” (or base64) cache key. Motivation Debugging & Analysis: With this change, we can quickly identify which compiled binary and IRs belongs to a given best config. The impact is minimal since it is only an extra field in .best_config. It can help advanced performance tuning or kernel-level debugging. Also, since Triton already stores cubin/hsaco in its cache, developers/researchers can avoid to set `store_cubin = True` since they can get the cubin/hsaco in the Triton cache and with the code provided in this PR, they can easily match the best_config with the right Triton cache directory for the "best" kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147019 Approved by: https://github.com/davidberard98	2025-03-04 12:16:38 +00:00
Alexander Grund	f1cce0951b	Create unique test report files for distributed tests (#148325 ) The distributed tests are executed once for each backend and for each init method. `$TEST_REPORT_SOURCE_OVERRIDE` is used such that test results from different backends are stored in different files. The same needs to be done for the init method. Move the setting of the variable into `test_distributed` and incorporate the init method into the name. Useful for e.g. https://github.com/pytorch/pytorch/issues/126523 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148325 Approved by: https://github.com/clee2000	2025-03-04 10:45:33 +00:00
zeshengzong	0b0d28accd	Optimize param `prepend` class reference `torch.nn.Module` (#148304 ) Fixes #147696 ## Changes Change `prepend` description `torch.nn.modules.Module` to `torch.nn.Module` ## Test Result ### Before ![image](https://github.com/user-attachments/assets/054f54b7-9487-4505-a926-3e17a84bd2f9) ### After ![image](https://github.com/user-attachments/assets/1d2a5708-62d1-428e-b136-bcaa35e5e6da) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148304 Approved by: https://github.com/Skylion007	2025-03-04 08:46:14 +00:00
bobrenjc93	da2688f624	Introduce delayed compile via `eager_then_compile` stance (#147983 ) Recently I've been experimenting with introducing new APIs to delay compile as a way to reduce compile times while improving the ergonomics of using dynamic shapes. The high level idea is to run the first invocation of compile in eager, save the example inputs, and on the second invocation we can derive the dynamism in the inputs so that we don't need to waste our time doing a compile with static shapes (which is the status quo today with automatic dynamic). Another benefit of this is most users no longer need to annotate their inputs with mark_dynamic and mark_unbaked calls since we can derive the dynamism on the very first call. Additionally we get dynamic ints out of the box in this new regime. This PR implements this idea through the set_stance APIs. In particular it introduces a new `eager_then_compile` stance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147983 Approved by: https://github.com/williamwen42	2025-03-04 07:46:31 +00:00
drisspg	e0f0db0105	updates to benchmarks (#144831 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144831 Approved by: https://github.com/danielvegamyhre	2025-03-04 06:21:12 +00:00
Daniel Vega-Myhre	ac99fc7e57	Updates to build rowwise scaled mm kernel on SM10.0a (#148274 ) ## Summary Update cmake files and RowwiseScaledMM.cu to build on SM10.0a arch. NOTE: performance optimization will be done in separate follow up PRs ## Steps to verify build 1. Access devgpu/machine with B200 GPUs, verify B200s are visible w/ `nvidia-smi` 2. Install CUDA tookit 12.8 - e.g. see [Nvidia docs](https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Rocky&target_version=9&target_type=rpm_local) 3. Verify CUDA toolkit installation - e.g. `nvcc --version` should have `... Cuda compilation tools, release 12.8 ... ` in output 4. Set env var `TORCH_CUDA_ARCH_LIST=10.0a` 4. Build pytorch from source with this PR ([steps](https://github.com/pytorch/pytorch#from-source)) 5. Uninstall `pytorch-triton` with `pip uninstall pytorch-triton` 6. Build and install triton from source: https://github.com/triton-lang/triton?tab=readme-ov-file#install-from-source 7. Run tests shown in test plan below NOTE: performance optimization will be done in a separate PR. The goal of this PR is just to ensure it builds correctly. ## Test plan - `python test/distributed/tensor/test_matrix_ops.py -k scaled_mm`: OK - `python test/test_matmul_cuda.py -k rowwise`: OK - `python test/test_flop_counter.py -k scaled_mm`: OK - `python test/inductor/test_aot_inductor.py -k fp8`: OK - `python test/inductor/test_fp8.py`: OK Pull Request resolved: https://github.com/pytorch/pytorch/pull/148274 Approved by: https://github.com/drisspg	2025-03-04 05:23:41 +00:00
Justin Chu	7ab6749ec7	Bump onnxscript to 0.2.2 in CI (#148388 ) Unblock https://github.com/pytorch/pytorch/pull/148140 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148388 Approved by: https://github.com/malfet	2025-03-04 04:21:58 +00:00
Richard Barnes	d54cab78e1	[codemod] Fix missing field initializer in caffe2/torch/lib/libshm/manager.cpp +1 (#148393 ) Summary: The LLVM warning `-Wmissing-field-initializers` has found one or more structs in this diff's files which were missing field initializers. This can be unintended such as: ``` my_struct s1 = {0}; // Initializes only the first field to zero; others to default values my_struct s2 = {}; // Initializes all fields to default values (often zero) ``` or it may be because only some of the members of a struct are initialized, perhaps because the items were added to the struct but not every instance of it was updated. To fix the problem, I've either used `{}` to initialize all fields to default or added appropriate default initializations to the missing fields. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Reviewed By: dtolnay Differential Revision: D70472663 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148393 Approved by: https://github.com/Skylion007	2025-03-04 04:20:04 +00:00
Dmitry Rogozhkin	70410f93f2	doc/xpu: align description of SyclExtension with CPP/CUDA (#147988 ) This commit just aligns description of `py_limited_api` feature in SyclExtension with CPP/CUDA. We've missed this change on doing SyclExtension due to parallel work on the changes. For CPP/CUDA change was done in 515e55e6927ad5f57ec222d7779712630341acf3. CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147988 Approved by: https://github.com/janeyx99, https://github.com/guangyey	2025-03-04 04:17:36 +00:00
cyy	ec2805ada8	Remove outdated CUDA version check (#148142 ) Since Torch requires CUDA>=11, some checks can be removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148142 Approved by: https://github.com/janeyx99, https://github.com/eqy	2025-03-04 03:33:44 +00:00
cyy	98bf2f1170	Use Python 3.9 typing (#148157 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148157 Approved by: https://github.com/janeyx99	2025-03-04 03:09:55 +00:00
cyy	b7832f0339	Enable ASAN in CUDA tests (#147812 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/147812 Approved by: https://github.com/janeyx99	2025-03-04 02:50:39 +00:00
Jason Ansel	8531d247ba	[fx] Optimize TracerBase.create_arg and Graph._gen_python_code (#148292 ) Before: 19502951 function calls (18702776 primitive calls) in 8.533 seconds After: 16402551 function calls (15602452 primitive calls) in 7.701 seconds Pull Request resolved: https://github.com/pytorch/pytorch/pull/148292 Approved by: https://github.com/oulgen ghstack dependencies: #148243, #148260, #148261, #148303, #148288	2025-03-04 02:42:23 +00:00
Jason Ansel	5eb0337cfd	[fx] Optimizations for node name generation (#148288 ) Before: ![image](https://github.com/user-attachments/assets/3a9ed22b-ae33-41ec-a0db-01f4f3ca2ffe) After: ![image](https://github.com/user-attachments/assets/44c6e578-c63e-4a43-b3e0-d11d4bdbb6db) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148288 Approved by: https://github.com/oulgen ghstack dependencies: #148243, #148260, #148261, #148303	2025-03-04 02:42:23 +00:00
Jason Ansel	a3d69e6e1a	Better log message to update pr_time_benchmarks/expected_results.csv (#148303 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148303 Approved by: https://github.com/Skylion007 ghstack dependencies: #148243, #148260, #148261	2025-03-04 02:42:23 +00:00
Henry Tsang	17518007b2	[cutlass backend] Benchmark compared to aten and triton (#148347 ) Benchmark for cutlass backend. ``` python benchmarks/inductor_backends/cutlass.py ``` Test Plan: ``` Experiment group: mm (1024x1024, 1024x1024) torch.float16 +-----------------------+--------------------+----------------------+---------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+---------------------+ \| aten \| 12.759539298713207 \| 2.7271360370796174 \| NA \| \| triton \| 10.573655366897583 \| 1.8661278090439737 \| -17.131370346859384 \| \| triton_persistent_tma \| 10.884030722081661 \| 0.5315794269554317 \| -14.698873781600327 \| \| cutlass_lvl_default \| 13.09632882475853 \| 0.5520401500398293 \| 2.6395116481931873 \| \| cutlass_lvl_1111 \| 11.05172373354435 \| 0.569593315012753 \| -13.384617776451302 \| \| cutlass_lvl_2222 \| 11.371277272701263 \| 133.58984916994814 \| -10.880189272601317 \| +-----------------------+--------------------+----------------------+---------------------+ Experiment group: mm (1024x1024, 1024x1024) torch.bfloat16 +-----------------------+--------------------+----------------------+---------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+---------------------+ \| aten \| 14.472318813204765 \| 1.5445372510002926 \| NA \| \| triton \| 10.568295605480671 \| 16.583424195996486 \| -26.975796056689987 \| \| triton_persistent_tma \| 10.45411266386509 \| 5.830657540936954 \| -27.764770809729562 \| \| cutlass_lvl_default \| 12.742593884468079 \| 28.994930602959357 \| -11.951954286402668 \| \| cutlass_lvl_1111 \| 11.522261425852776 \| 79.85037935699802 \| -20.38413764531163 \| \| cutlass_lvl_2222 \| 10.993581265211105 \| 132.86601971101481 \| -24.037181552548486 \| +-----------------------+--------------------+----------------------+---------------------+ Experiment group: mm (2048x2048, 2048x2048) torch.float16 +-----------------------+--------------------+----------------------+---------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+---------------------+ \| aten \| 30.700622126460075 \| 2.225986961973831 \| NA \| \| triton \| 29.17378954589367 \| 38.571991189033724 \| -4.97329524553989 \| \| triton_persistent_tma \| 29.642896726727486 \| 7.2848734309664 \| -3.4452897904663744 \| \| cutlass_lvl_default \| 29.514770954847336 \| 29.819900761009194 \| -3.8626291243482167 \| \| cutlass_lvl_1111 \| 29.411429539322853 \| 23.82907024596352 \| -4.19923929172139 \| \| cutlass_lvl_2222 \| 29.57325428724289 \| 134.31008586101234 \| -3.672133530628152 \| +-----------------------+--------------------+----------------------+---------------------+ Experiment group: mm (2048x2048, 2048x2048) torch.bfloat16 +-----------------------+--------------------+----------------------+--------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+--------------------+ \| aten \| 30.858177691698074 \| 1.181898436974734 \| NA \| \| triton \| 28.630023822188377 \| 39.24473957403097 \| -7.220626868414034 \| \| triton_persistent_tma \| 28.641965240240097 \| 5.275042273919098 \| -7.181929126210897 \| \| cutlass_lvl_default \| 29.16003204882145 \| 29.934022572939284 \| -5.503065216107967 \| \| cutlass_lvl_1111 \| 28.79570797085762 \| 23.948012012057006 \| -6.683705504085324 \| \| cutlass_lvl_2222 \| 29.02756631374359 \| 136.25560767308343 \| -5.932337924306467 \| +-----------------------+--------------------+----------------------+--------------------+ Experiment group: mm (8192x8192, 8192x8192) torch.float16 +-----------------------+--------------------+----------------------+--------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+--------------------+ \| aten \| 1456.143856048584 \| 1.020197194069624 \| NA \| \| triton \| 1708.2737684249878 \| 5.766509635956027 \| 17.31490410985819 \| \| triton_persistent_tma \| 1476.485013961792 \| 7.455113030038774 \| 1.3969195302177155 \| \| cutlass_lvl_default \| 1583.3594799041748 \| 50.408804678940214 \| 8.736473620182366 \| \| cutlass_lvl_1111 \| 1636.4418268203735 \| 82.82403108896688 \| 12.381879030898025 \| \| cutlass_lvl_2222 \| 1507.5665712356567 \| 260.03901409788523 \| 3.531430975962381 \| +-----------------------+--------------------+----------------------+--------------------+ Experiment group: mm (8192x8192, 8192x8192) torch.bfloat16 +-----------------------+--------------------+----------------------+--------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+--------------------+ \| aten \| 1382.230520248413 \| 1.2586536260787398 \| NA \| \| triton \| 1646.9683647155762 \| 5.442052865982987 \| 19.15294450447995 \| \| triton_persistent_tma \| 1423.9195585250854 \| 6.515797697938979 \| 3.016069871556595 \| \| cutlass_lvl_default \| 1500.9030103683472 \| 51.36402789200656 \| 8.58557877152115 \| \| cutlass_lvl_1111 \| 1446.9740390777588 \| 30.65435610699933 \| 4.683988515729638 \| \| cutlass_lvl_2222 \| 1419.661521911621 \| 205.1948991640238 \| 2.7080144096717635 \| +-----------------------+--------------------+----------------------+--------------------+ ``` Differential Revision: D70147589 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148347 Approved by: https://github.com/drisspg, https://github.com/chenyang78	2025-03-04 01:45:36 +00:00
Ding, Yi1	c21dc11a17	[Intel GPU] Enable SDPA on XPU (#147614 ) Motivation === This PR is part of the plan of OneDNN Upstreaming, as #114848 [(comment)](https://github.com/pytorch/pytorch/issues/114848#issuecomment-2451553203) stated. The support of SDPA is via the overridable variance on XPU backend. Beside the added `Attention.cpp` file, `Graph.h` is added to hold utils for OneDNN graph including those for kernel/compile graph caching. In addition, a selection of testcases in `test/test_transformers.py` are copied into the new `test/xpu/test_transformers.py` and modified accordingly to provide additional tests beyond `./third_party/torch-xpu-ops/test/xpu/test_ops_xpu.py`. Depends on OneDNN version v3.7 upgrade in #147498 Depends on BUILD_GRAPH switch in #147608 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147614 Approved by: https://github.com/jansel, https://github.com/EikanWang	2025-03-04 01:40:45 +00:00
Shangdi Yu	b17f5223a4	Generate AOTI input check by default (#148005 ) Summary: Generate AOTI size and stride input check by default. But the checks are only run if `AOT_INDUCTOR_DEBUG_COMPILE` env variable is set (to avoid slowing down the performance). Example output: ```cpp bool _check_aoti_runtime_check_inputs_env() { const static char* env_var_value = getenv("AOTI_RUNTIME_CHECK_INPUTS"); const static bool result = env_var_value != nullptr && env_var_value[0] != '\0'; return result; } AOTI_NOINLINE static void __check_inputs_outputs( AtenTensorHandle* input_handles, AtenTensorHandle* output_handles) { if (!_check_aoti_runtime_check_inputs_env()){ return; } //rest of the check } ``` Test Plan: CI Differential Revision: D70260490 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148005 Approved by: https://github.com/hl475, https://github.com/desertfire, https://github.com/jingsh	2025-03-04 00:55:14 +00:00
atalman	0bd2caac55	Docker release - pin buildkit to v0.19.0 (#148372 ) Fix nightly build failure during arm64 docker build (since 02.21.2025): https://github.com/pytorch/pytorch/actions/runs/13452177170/job/37588508155#step:12:851 Error: ``` #10 73.62 Segmentation fault (core dumped) #10 73.67 qemu: uncaught target signal 11 (Segmentation fault) - core dumped #10 73.85 Segmentation fault (core dumped) #10 73.85 dpkg: error processing package libc-bin (--configure): #10 73.85 installed libc-bin package post-installation script subprocess returned error exit status 139 ``` Looks like we are hitting: https://github.com/moby/buildkit/issues/5783 Update setup-qemu and buildkit actions to v3 and buildkit to v0.19.0 Please note: CUDA 12.8 error is not related to this failure in nightly cpu arm64. Looks like we are trying to install release torch when running on PR. Cuda 12.8 build is not released yet, hence a failure. Will send followup to make sure we are using nightly torch when running on PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148372 Approved by: https://github.com/seemethere	2025-03-03 23:55:30 +00:00
Animesh Jain	d43c6f0033	[invoke_subgraph] Run joint passes on the hop graphs (#139325 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139325 Approved by: https://github.com/bdhirsh, https://github.com/zou3519 ghstack dependencies: #147559	2025-03-03 23:38:14 +00:00
Ethan Wee	216a108aaf	[ROCm] Add rocm-mi300 and inductor-rocm-mi300 to upload-test-stats.yml (#148365 ) We currently run MI300X machines on rocm-mi300 and inductor-rocm-mi300 but we don't have artifacts for the results: e.g. `6e10471966 (rocm-mi300)` ![image](https://github.com/user-attachments/assets/f5588072-b818-4f54-a348-0e6ac7e96829) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148365 Approved by: https://github.com/jeffdaily	2025-03-03 23:22:56 +00:00
Nikita Shulga	586d8df651	Fix condition for `CONVERT_NON_VECTORIZED_INIT` invocation (#148362 ) Yet another regression caused by https://github.com/pytorch/pytorch/pull/146596 that breaks builds if PyTorch is compiled for Android or using NVIDIA GraceHopper systems Not sure why author was trying to change the conditon to begin with Pull Request resolved: https://github.com/pytorch/pytorch/pull/148362 Approved by: https://github.com/izaitsevfb ghstack dependencies: #148354	2025-03-03 23:13:37 +00:00
Nikita Shulga	5887a2d8de	[BE] Use `C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED` (#148354 ) Instead of `#pragma GCC diagnostic ignored "-Wignored-qualifiers"` Also limit the scope to just `Vectorized::map` that has to be declared that way due to sleef function signature definitions that return `const __m256` for AVX2 methods Also delete `#pragma GCC diagnostic pop` from vec256_half and vec256_bfloat16 as it results in an unbalanced pop warning, for push that is defined in vec256_16bit_float, which will be included only once ``` In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec.h:7: In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec256/vec256.h:15: /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec256/vec256_half.h:232:27: warning: pragma diagnostic pop could not pop, no matching push [-Wunknown-pragmas] 232 \| #pragma GCC diagnostic pop \| ^ 1 warning generated. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148354 Approved by: https://github.com/izaitsevfb	2025-03-03 23:00:47 +00:00
henrylhtsang	d0b23e661d	[cutlass backend] Add main tests for mm, addmm and bmm - step 1 (#148229 ) This adds very good coverage for normal mm tests {aoti x torch.compile} x {default, dynamic}. There are some parts that are less tested. For example: * different layout combo * shapes that are less aligned Pull Request resolved: https://github.com/pytorch/pytorch/pull/148229 Approved by: https://github.com/chenyang78	2025-03-03 22:31:46 +00:00
Jane (Yuan) Xu	a41413829c	Use release notes label for module: distributed_checkpoint (#148352 ) module: distributed_checkpoint is redundant with oncall: distributed checkpointing. @fduwjj let us know that module: distributed_checkpoint is just used for release notes, so let's use the release notes label for the release notes, which the bot will pick up better. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148352 Approved by: https://github.com/fegin	2025-03-03 21:33:28 +00:00
ankurneog	e45040b1d3	[c10d] Add hccl distributed backend to c10d data structures (#146478 ) # MOTIVATION Intel Gaudi is an out-of-tree PyTorch accelerator having its own device /dispatch key ```hpu``` . With this change we add entries for Gaudi's distributed backend ```hccl``` to the c10d Backend data structures. This is to ensure that there is no naming conflict in case a new in-tree accelerator is introduced with the same backend name. The Out-of-tree backends are registered calling `fd0cd6a08f/torch/distributed/distributed_c10d.py (L302)` Successful registration adds the backend name to the list : `fd0cd6a08f/torch/distributed/distributed_c10d.py (L265)` We are binding the process group creator constructs at run-time so if there are other distributed backend with the same device name they can safely add the device type to the dictionary `fd0cd6a08f/torch/distributed/distributed_c10d.py (L274)` And add another entry to the dictionary with the same backend name ( but different device name ) `fd0cd6a08f/torch/distributed/distributed_c10d.py (L268)` In addition the out-of-tree devices can utilize the ```backend_list``` to check for successful backend registration eg: APIs like ```is_hccl_available``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146478 Approved by: https://github.com/H-Huang	2025-03-03 21:32:21 +00:00
nandesuka	52078154f2	Add support for no-op concat with padded output (#146866 ) Add support for no-op concat with padded output Pull Request resolved: https://github.com/pytorch/pytorch/pull/146866 Approved by: https://github.com/shunting314	2025-03-03 21:10:46 +00:00
Aaron Orenstein	07f876e960	Subprocess compile (#146134 ) Add a mode to `fx_codegen_and_compile()` to compile in a separate process. This is to prepare for async compile where we'll compile and run eager in parallel (and also be able to move the compile phase to a remote computer). Added a test based which runs the test_torchinductor tests with subprocess compiling turned on. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146134 Approved by: https://github.com/jamesjwu	2025-03-03 21:10:12 +00:00
William Wen	8f361c808b	[dynamo] run-only recursively on recompile limit exceeded (#148021 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148021 Approved by: https://github.com/anijain2305	2025-03-03 21:01:08 +00:00
FFFrog	1bbe57336b	Replace unimplemented with unimplemented_v2 for dynamo (#148158 ) torch/_dynamo/variables/constant.py https://github.com/pytorch/pytorch/issues/147913 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148158 Approved by: https://github.com/williamwen42, https://github.com/Skylion007	2025-03-03 21:00:17 +00:00
Anatoly Myachev	b162b1600b	[Inductor] Hot fix after #148011 (#148270 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148270 Approved by: https://github.com/davidberard98	2025-03-03 20:18:21 +00:00
Prachi Gupta	d260d4fc55	HSDP custom hook UTs are multi-threaded - can't set device rank (#148099 ) HSDP custom hook UTs are multi-threaded and using single physical GPU. If we set rank in each thread, then we are referencing the same GPU with multiple ranks, which isn't right. Therefore, removing the rank setting from these UTs. Now, they are passing with 1, 2, 4 GPUs. Fixes #147767 and #147769 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148099 Approved by: https://github.com/jeffdaily	2025-03-03 19:48:49 +00:00
Alexander Grund	302c660298	Consistently use load_torchbind_test_lib in tests (#148082 ) The same code is repeated multiple times with slightly different implementations. Use the existing function for brevity and consistency. In the function the code from `test_export` is used which does a single `load_library` with cleaner conditions Pull Request resolved: https://github.com/pytorch/pytorch/pull/148082 Approved by: https://github.com/angelayi	2025-03-03 19:37:28 +00:00
Sam Larsen	40c2505f16	[logging] Log individual Triton kernel compilation times to dynamo_compile (#147022 ) Summary: Gather the compilation time of individual triton kernels and log them to dynamo_compile: * Time compilation in `_worker_compile_triton` and pass back to the main process and logged from `get_result()`. * Added a way to track the "top N" (or N most-expensive compiles) in the metrics_context. I did this because I doubt we really care to capture potentially thousands of kernel compile times. That would be problematic for scuba logging anyway, so let's limit the number we track from the beginning. Arbitrarily chose 25 for now. * Format the list of compile times as a json string before logging. Test Plan: `python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only nanogpt` Scuba: https://fburl.com/scuba/dynamo_compile/sandbox/nc4dzm3r Pull Request resolved: https://github.com/pytorch/pytorch/pull/147022 Approved by: https://github.com/jamesjwu	2025-03-03 19:32:17 +00:00
Carlos Mocholi	aade4fbd55	Expose the rendezvous keepalive arguments (#145228 ) Enables support for this: ```python from torch.distributed.launcher.api import LaunchConfig config = LaunchConfig( ..., rdzv_configs={"keep_alive_interval": 1122, "heartbeat_timeout": 321, "keep_alive_max_attempt" 5}, ) ``` These arguments are currently hard-coded inside torchrun. The default values are not suitable for jobs with thousands of ranks. Today, `rdzv_configs` only allows the keys `join_timeout`, `last_call_timeout`, `close_timeout` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145228 Approved by: https://github.com/wconstab	2025-03-03 19:11:56 +00:00
Pian Pawakapan	a929e11e4f	[dynamic shapes][export] ignore when real-tensor fallback fails (#147779 ) Summary: uninspired solution to https://github.com/pytorch/pytorch/issues/147402 Test Plan: test_draft_export Differential Revision: D70132269 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147779 Approved by: https://github.com/bobrenjc93	2025-03-03 19:09:56 +00:00
cyy	09291817b2	Fix extra semicolon warning (#148291 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148291 Approved by: https://github.com/Skylion007	2025-03-03 18:51:44 +00:00
sanchitintel	1c544a9ddd	[Inductor-CPP] If all of the activation scale dims are 1, make it a 0D tensor (#147033 ) For int8 dynamically quantized activation & int8 quantized weights, add a workaround for some indexing issue that expected an empty index ( so, was expecting a 0D tensor) in epilogue creator when the activation scale was sized [1, 1] by converting it into a 0D tensor. The issue was discovered while running LLaMA2 quantized with torchao's `int8_dynamic_activation_int8_weight` quantization on CPU with max-autotune enabled (although this error would've occurred regardless). The final hidden states tensor that's activation to LM head is of shape `[batch_size, sequence_length, hidden_dim]` during decoding. For decoding one token at a time with batch size 1, sequence length is 1. The activation scale is shaped `[1, 1]` (reshaped from `[1, 1, 1]`). However, Inductor epilogue creator expects a 0D tensor in this case (my guess is that the corresponding logic in Inductor expects a 0D tensor if a tensor has only one element, even if it's 1D?). Pull Request resolved: https://github.com/pytorch/pytorch/pull/147033 Approved by: https://github.com/jansel, https://github.com/leslie-fang-intel	2025-03-03 18:32:27 +00:00
Oguz Ulgen	57addfcd58	Significantly speed up save_cache_artifacts (#148227 ) While using save_cache_artifacts on internal workloads, we have noticed that repeatedly calling this function after every batch is incredibly expensive. This PR significantly speeds up this function call by opting out of pickle and redesigning serialization algorithm. Essentially what we want is to be able to call serialize many times without incurring costs from scratch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148227 Approved by: https://github.com/jamesjwu ghstack dependencies: #148226	2025-03-03 17:28:41 +00:00
Nikita Shulga	3ca1a2564d	[BE][MPS] Use `copysign` for imaginary part of sqrt (#148286 ) Also it's tempting trying to replace `aa + bb` with `dot(input[index])` but for some reason it results in a slightly different output Pull Request resolved: https://github.com/pytorch/pytorch/pull/148286 Approved by: https://github.com/dcci ghstack dependencies: #148285	2025-03-03 16:03:54 +00:00
Nikita Shulga	84502baaff	[MPS] Fix sqrt and other for `torch.chalf` (#148285 ) Those kernels, instead of being instantiated for half2 (which corresponds to ComplexHalf) were instnatiated for short2, which resuled in the following test ``` % python3 -c "import torch; print(torch.rand(6, device='mps', dtype=torch.chalf).sqrt())" ``` Fail with ``` RuntimeError: Failed to create function state object for: sqrt_complex_half_half ``` As sqrt is not implemented for CPU, add explicit test to `test_sqrt` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148285 Approved by: https://github.com/dcci	2025-03-03 16:03:54 +00:00
jianan-gu	d57f617844	[Inductor][CPP] Avoid transpose with cpp micro-gemm for FlexAttention (#147069 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147069 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/drisspg ghstack dependencies: #147068	2025-03-03 15:22:11 +00:00
Wang, Chuanqi	6c089f5da3	ci: move xpu triton build to manylinux 2.28 (#148195 ) Follow PR #148129 to remove manylinux builds for triton xpu Pull Request resolved: https://github.com/pytorch/pytorch/pull/148195 Approved by: https://github.com/seemethere	2025-03-03 12:31:08 +00:00
leslie-fang-intel	165e33531c	[Inductor][CPP] Fix the vec codegen for tanh (#148254 ) Summary Fix https://github.com/pytorch/pytorch/issues/148241, The previous vectorized code generation for `tanh` used a decomposed implementation, leading to numerical differences that were further amplified by `atan2`. For example, in the given test case after `tanh`, the eager output at `[0,0,11,47]` was `-5.820766091346741e-10`, while the compiled output was `1.4319084584712982e-08`, resulting in different `atan2` outputs of `-2.3561` and `0.7853`. This issue is fixed by switching to the Sleef implementation. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_tanh_atan2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148254 Approved by: https://github.com/malfet, https://github.com/jgong5	2025-03-03 11:46:57 +00:00
CaoE	118a165ac5	[Inductor][CPP] Add transposed B matrix support for CppMicroGemmFP32Vec (#147068 ) * Add transposed B support for CppMicroGemmFP32Vec. * Add support for cases where N is not divisible by `block_n`. Expand CppMicroGemmFP32Vec to generate gemm kernel that supports transposed B and N of arbitrary size. This is the basis for https://github.com/pytorch/pytorch/pull/147069 to get better performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147068 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel, https://github.com/jgong5	2025-03-03 11:08:23 +00:00
Wang, Eikan	6a3a1f96ce	Enable XPU for Inductor MM Triton Kernel Benchmark (#148237 ) #147620 enabled `force_shape_pad` for triton kernel benchmark. Intel GPU supports this scenario. Hence, we need to enable the case in this PR. Otherwise, there would be a test case regression for Intel GPU as #147620 has been landed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148237 Approved by: https://github.com/jansel	2025-03-03 10:09:06 +00:00
CaoE	b3bb73e11c	Separate transpose from memory load/store and add load size support for convert_to_int32 (#147067 ) Separate transpose from memory load/store and add load size support for convert_to_int32 to facilitate the expansion for CppMicroGemmFP32Vec. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147067 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2025-03-03 02:56:16 +00:00
Xia, Weiwen	ab81ca5053	[Inductor][CPU] Add GEMM templates for _weight_int4pack_mm_for_cpu with AVX512 (#146756 ) Summary It's part of the task to enable max-autotune with GEMM template for WoQ INT4 GEMM on CPU. This PR adds GEMM templates for `torch.ops.aten_weight_int4pack_mm_for_cpu`. The micro kernel used for the templates is based on AVX512 and it's a copy of the ATen implementation of `torch.ops.aten_weight_int4pack_mm_for_cpu` with minor changes. Due to better blocking and loop schedule, the GEMM template based implementation outperforms the ATen implementation in all cases we tested. Test plan ``` python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_avx512 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146756 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2025-03-03 00:56:29 +00:00
PyTorch MergeBot	608377d341	Revert "[import][inductor] Simplify grid handling (#147583 )" This reverts commit b59776d8572a56e2d2366174eac11015b1776f1e. Reverted https://github.com/pytorch/pytorch/pull/147583 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147583#issuecomment-2693016036))	2025-03-03 00:49:32 +00:00
Jason Ansel	29c2de9ae1	[fx] Move Node._prepend/Node._remove_from_list to C++ (#148261 ) Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before: ``` 24303536 function calls (23503339 primitive calls) in 10.726 seconds ``` after: ``` 20003454 function calls (19203257 primitive calls) in 8.936 seconds ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148261 Approved by: https://github.com/oulgen ghstack dependencies: #148243, #148260	2025-03-02 22:42:31 +00:00
Jason Ansel	0135f57f4a	[fx] Move Node._update_args_kwargs to C++ (#148260 ) Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before: ``` 25203549 function calls (24403352 primitive calls) in 12.090 seconds ``` after: ``` 24303536 function calls (23503339 primitive calls) in 10.726 seconds ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148260 Approved by: https://github.com/oulgen ghstack dependencies: #148243	2025-03-02 22:42:31 +00:00
Jason Ansel	edaff88f69	[fx] Move map_aggregate to C++ (#148243 ) Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before: ``` 30603618 function calls (29403419 primitive calls) in 13.744 seconds ``` after: ``` 25203549 function calls (24403352 primitive calls) in 12.090 seconds ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148243 Approved by: https://github.com/oulgen	2025-03-02 22:42:31 +00:00
PyTorch MergeBot	94afb165d9	Revert "[c10d] Add hccl distributed backend to c10d data structures (#146478 )" This reverts commit dae3fbfe9720e83e7e81d41430fb5067221bbed7. Reverted https://github.com/pytorch/pytorch/pull/146478 on behalf of https://github.com/malfet due to This seems to break ROCM tests, see `dae3fbfe97` ([comment](https://github.com/pytorch/pytorch/pull/146478#issuecomment-2692913573))	2025-03-02 21:22:04 +00:00
Nikita Shulga	1106eb0212	[BE] Fix extra semicolon warning (#148284 ) Introduced by https://github.com/pytorch/pytorch/pull/146596 I.e. while building locally my log was littered with ``` In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/LossNLL2d.cpp:5: In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/cpu/utils.h:5: In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec.h:7: In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec256/vec256.h:15: /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec256/vec256_half.h:228:42: warning: extra ';' outside of a function is incompatible with C++98 [-Wc++98-compat-extra-semi] 228 \| LOAD_FP32_NON_VECTORIZED_INIT(Half, fp16); \| ^ 2 warnings generated. [230/1017] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/LossNLL.cpp.o In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/LossNLL.cpp:9: In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/cpu/utils.h:5: In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec.h:7: In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec256/vec256.h:14: /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec256/vec256_bfloat16.h:228:46: warning: extra ';' outside of a function is incompatible with C++98 [-Wc++98-compat-extra-semi] 228 \| LOAD_FP32_NON_VECTORIZED_INIT(BFloat16, bf16); ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148284 Approved by: https://github.com/Skylion007	2025-03-02 19:06:46 +00:00
Aaron Gokaslan	6d70b42810	[BE][Ez]: Update fmt submodule to 11.1.4 (#148264 ) This minor release is mostly bugfixes, ABI fixes, and compiler support fixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148264 Approved by: https://github.com/jansel, https://github.com/cyyever	2025-03-02 19:00:00 +00:00
Nikita Shulga	95d81d21a6	[MPS] Speedup interpolation (#148277 ) First of all, perf claims made in https://github.com/pytorch/pytorch/pull/145581 and https://github.com/pytorch/pytorch/pull/148154 are too good to be true (due to the bug in the script that did not call `torch.mps.synchronize` at the end of the benchmark script, but still slightly better than MPS, probably due to the launch overhead. And while measure performance correctly, I've noticed that a lot of time is spent on 64-bit integral division of thread_index to get spatial coordinates. Simply downcasting divisior to 32-bit integer (which is also the thread index) speeds it up almost 2x for bilinear and bicubic as could be demonstrated by running following script ```python import torch import time import subprocess import itertools def benchmark(device, dtype, mode="bilinear", antialias=False, sf=.5): # Create example inputs x = torch.testing.make_tensor(1, 1, 2048, 2048, device=device, dtype=dtype) # define kwargs kwargs = {"antialias": antialias, "mode": mode, "scale_factor": sf} # Skip for unimplemented flavors if antialias and mode == "bicubic" and device == "mps": return None, "Skip" elif antialias and dtype != torch.float32: if device == "cpu": return None, "Skip" outputs_match = None else: # Check output y = torch.nn.functional.interpolate(x, kwargs) z = torch.nn.functional.interpolate(x.cpu(), kwargs) outputs_match = torch.allclose(y.cpu(), z) if not outputs_match: atol = (y.cpu() - z).abs().max() rtol = ((y.cpu() - z)[z!=0]/z[z!=0]).abs().max() print(f"atol={atol} rtol={rtol}") # Measure time manually start_time = time.time() * 1000 for _ in range(1000): y = torch.nn.functional.interpolate(x, *kwargs) torch.mps.synchronize() end_time = time.time() 1000 manual_delta = (end_time - start_time) average_time = f"{manual_delta:6.1f}" return "True " if outputs_match else "False", average_time brand_string = subprocess.check_output(['sysctl', '-n', 'machdep.cpu.brand_string']).decode("utf-8").strip() for mode,antialias in itertools.product(["bilinear", "bicubic"], [False, True]): outputs_match_list = [] average_time_list = [] for device in ["mps", "cpu"]: for dtype in [torch.float32, torch.float16, torch.bfloat16]: outputs_match, average_time = benchmark(device, dtype, mode=mode, antialias=antialias) outputs_match_list.append(str(outputs_match)) average_time_list.append(average_time) print(f"\nBenchmarking Results (collected on {brand_string}) for {mode} interpolation {'with antialias' if antialias else ''}:") print("-"*40) print("Device : MPS \| CPU") print("Dtype : FP32 \| FP16 \| BF16 \| FP32 \| FP16 \| BF16") print(f"Outputs Match : ", " \| ".join(outputs_match_list)) print(f"Average Time (us) :", " \|".join(average_time_list)) ``` Before ``` Benchmarking Results (collected on Apple M4 Pro) for bilinear interpolation : ---------------------------------------- Device : MPS \| CPU Dtype : FP32 \| FP16 \| BF16 \| FP32 \| FP16 \| BF16 Outputs Match : True \| True \| True \| True \| True \| True Average Time (us) : 292.0 \| 264.7 \| 267.9 \| 289.1 \| 230.9 \| 309.1 atol=1.430511474609375e-06 rtol=0.11363636702299118 Benchmarking Results (collected on Apple M4 Pro) for bilinear interpolation with antialias: ---------------------------------------- Device : MPS \| CPU Dtype : FP32 \| FP16 \| BF16 \| FP32 \| FP16 \| BF16 Outputs Match : False \| False \| False \| True \| None \| None Average Time (us) : 698.3 \| 684.2 \| 683.8 \| 851.0 \|Skip \|Skip atol=2.086162567138672e-06 rtol=0.019750799983739853 Benchmarking Results (collected on Apple M4 Pro) for bicubic interpolation : ---------------------------------------- Device : MPS \| CPU Dtype : FP32 \| FP16 \| BF16 \| FP32 \| FP16 \| BF16 Outputs Match : False \| True \| True \| True \| True \| True Average Time (us) : 314.3 \| 301.0 \| 298.8 \| 681.5 \| 616.7 \| 833.7 ``` After ``` Benchmarking Results (collected on Apple M4 Pro) for bilinear interpolation : ---------------------------------------- Device : MPS \| CPU Dtype : FP32 \| FP16 \| BF16 \| FP32 \| FP16 \| BF16 Outputs Match : True \| True \| True \| True \| True \| True Average Time (us) : 119.9 \| 98.9 \| 98.6 \| 289.8 \| 231.9 \| 308.5 atol=1.430511474609375e-06 rtol=0.05681818351149559 Benchmarking Results (collected on Apple M4 Pro) for bilinear interpolation with antialias: ---------------------------------------- Device : MPS \| CPU Dtype : FP32 \| FP16 \| BF16 \| FP32 \| FP16 \| BF16 Outputs Match : False \| False \| False \| True \| None \| None Average Time (us) : 541.9 \| 531.1 \| 531.0 \| 846.8 \|Skip \|Skip atol=2.0265579223632812e-06 rtol=0.008604463189840317 Benchmarking Results (collected on Apple M4 Pro) for bicubic interpolation : ---------------------------------------- Device : MPS \| CPU Dtype : FP32 \| FP16 \| BF16 \| FP32 \| FP16 \| BF16 Outputs Match : False \| True \| True \| True \| True \| True Average Time (us) : 314.3 \| 301.0 \| 298.8 \| 681.5 \| 616.7 \| 833.7 ``` TODO: - Figure out if this ops make more sense as 3D jobs with n and c channels dispatch as one more dimension Pull Request resolved: https://github.com/pytorch/pytorch/pull/148277 Approved by: https://github.com/Skylion007	2025-03-02 17:13:52 +00:00
cyy	9aa897b992	Remove unnecessary tensor clone (#148159 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148159 Approved by: https://github.com/Skylion007	2025-03-02 16:21:39 +00:00
Ding, Yi1	1d7397a2d0	[Inductor] Avoid tensor slice overflow for large step (#147433 ) Fixes #147071 Currently, if step is a value very close to INT64_MAX, the calculation of slice output length will overflow. This PR tries to fix this problem and thus fix #147071. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147433 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2025-03-02 16:07:15 +00:00
Colin Peppler	9c506aa8a6	[aotinductor] add option to disable runtime assertions (#146462 ) A recent user experience is like this: * User runs AOTI lowering, it's successful. * They take AOTI model and run it with some sample inputs. Everything runs well * Then they boot up a serving test that loads the AOTI model and runs it with a set of sample requests. * They see that some of the requests fail. The logs show them this: * AOTInductorModel run failed with input spec: [1, 32]:c10::BFloat16, [2]:long ... * Error: u45 >= 2 * To the untrained eye, "AOTInductorModel run failed" is all they see. But, the true reason is Error: u45 >= 2 However, the assertion isn't always correct. * In fact, u45 can actually be 0. * So, why did AOTI say u45 ≥ 2? It's a two-piece combo: * With 0/1 Specialization, the ShapeEnv creates symbolic shapes (e.g. s0) with a default value-range of [2, inf] * In the graph, Dynamo traces torch.mul(A, B) where A is [s0, ...]and B is [u45, ...]. So, Dynamo learns Eq(s0, u45). * Therefore, u45 also has a range of [2, inf]. Hence, the incorrect runtime assertion. So, the motivation for this PR is to add an option to disable the logging. If you run into a situation like this. However, another way to avoid this is to call `mark_unbacked()` on all the dynamic dims. @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/146462 Approved by: https://github.com/desertfire, https://github.com/22quinn	2025-03-02 09:14:58 +00:00
Oguz Ulgen	26358fa2d8	Add AppendingByteSerializer class (#148226 ) This PR adds a new util class that enables efficient appending of sequential byte data with custom serialization and deserialization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148226 Approved by: https://github.com/aorenste	2025-03-02 08:20:58 +00:00
Jason Ansel	b59776d857	[import][inductor] Simplify grid handling (#147583 ) Before this PR, calling a triton kernel would look like: ```py kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0) ``` where the `grid=` was passed as a callable (function closure) arg. This PR removes the grid arg: ```py kernel.run(a, b, xnumel, stream=stream0) ``` instead now the grid computation is included in the kernel launcher, with something like: ```py def launcher(in_ptr0, out_ptr0, xnumel, stream): grid_0 = ((xnumel + 1023) >> 10) grid_1 = 1 grid_2 = 1 runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel) ``` This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`. It also allows us to unify the handling of grids between the Python and C++ wrapper code. Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid. This unification allows this PR to be a net deletion of code. Note the attached diff contains some minor fbcode-only changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147583 Approved by: https://github.com/eellison, https://github.com/shunting314	2025-03-02 07:31:07 +00:00
ankurneog	dae3fbfe97	[c10d] Add hccl distributed backend to c10d data structures (#146478 ) # MOTIVATION Intel Gaudi is an out-of-tree PyTorch accelerator having its own device /dispatch key ```hpu``` . With this change we add entries for Gaudi's distributed backend ```hccl``` to the c10d Backend data structures. This is to ensure that there is no naming conflict in case a new in-tree accelerator is introduced with the same backend name. The Out-of-tree backends are registered calling `fd0cd6a08f/torch/distributed/distributed_c10d.py (L302)` Successful registration adds the backend name to the list : `fd0cd6a08f/torch/distributed/distributed_c10d.py (L265)` We are binding the process group creator constructs at run-time so if there are other distributed backend with the same device name they can safely add the device type to the dictionary `fd0cd6a08f/torch/distributed/distributed_c10d.py (L274)` And add another entry to the dictionary with the same backend name ( but different device name ) `fd0cd6a08f/torch/distributed/distributed_c10d.py (L268)` In addition the out-of-tree devices can utilize the ```backend_list``` to check for successful backend registration eg: APIs like ```is_hccl_available``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146478 Approved by: https://github.com/H-Huang, https://github.com/guangyey	2025-03-02 05:13:48 +00:00
Boyuan Feng	6e10471966	[ci] disable cudagraph for tts_angular on dashboard (#148221 ) tts_angular with cudagraph is flaky. Its speedup varies from .05 to 1.01. This PR disables cudagraph for tts_angular to avoid the noise. Since tts_angular shows ~1x speedup while other torchbench models show ~2x speedup, skipping tts_angular would wrongly bump the cudagraph speedup. So this PR only disables cudagraph for tts_angular instead of skipping tts_angular. [Dashboard ](https://github.com/pytorch/pytorch/actions/runs/13597394087) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148221 Approved by: https://github.com/eellison	2025-03-02 03:31:19 +00:00
Daniel Vega-Myhre	de7af81f18	[async TP] insert reshape node to handle "reshape -> scaled mm -> reshape pattern" in async TP with rowwise scales (#148001 ) Fixes https://github.com/pytorch/torchtitan/issues/864 ## Summary While testing torchtitan with float8 training with rowwise scaling + async TP, a [bug](https://github.com/pytorch/torchtitan/issues/864) was discovered. The symptom was the scaling factor dims did not match the dims of the tensor the scales were to be applied to. My [root cause analysis](https://github.com/pytorch/torchtitan/issues/864#issuecomment-2672465060) determined the reason is that when async TP graph manipulation constructs the `fused_scaled_matmul_reduce_scatter` op, it does not yet handle the "reshape -> scaled mm -> reshape" pattern used in torchao [here](`ed361ff5c7/torchao/float8/float8_linear.py (L122-L124)`) - specifically when row-wise scales are being used. ## TL;DR of root cause - When a Float8Tensor is reshaped, the scale is reshaped along with it so the dimensions are aligned. - In the graph manipulation logic of the micropipeline TP post grad pass, the scaled_mm `A tensor` node is referencing the tensor _before_ to the reshape op, but referencing the `A_scale` node _after_ the reshape op. ## Example - Concrete example: - `A tensor` is a Float8Tensor with shape (1,8192,2048) and scale of shape (1,8192,1) when a matmul op is called in torchao [here](`8706d3f3b0/torchao/float8/float8_linear.py (L70)`). Torchao does a reshape -> scaled mm -> reshape [here](`ed361ff5c7/torchao/float8/float8_linear.py (L122)`). When a Float8Tensor is reshaped, its scale is reshaped along with it [here](`8706d3f3b0/torchao/float8/float8_ops.py (L152)`). So the first reshape makes the "A tensor" (1,8192,2048) => (8192,2048) and the scale (1,8192,1) => (8192,1). - During post grad pass in async TP: - `A_node` has shape (1,8192,2048) (tensor from before this [reshape](`ed361ff5c7/torchao/float8/float8_linear.py (L122)`)) - `A_scale` has shape (8192,1) (due to reshape op above, which caused the scale to be reshaped from (1,8192,1) => (8192,1)). ## Solution Note: the compiler inserts a `reciprocal` op after the reshape, so we can't simply use the node before the reshape as the `A_scale_node`, otherwise it will affect the numerics. - Short-term solution: if the specific pattern showne below is detected, insert a reshape node after the reciprocal, to reshape the reciprocal output back to the originals shape before the reshape. - reshape is just a view, so there should be no impact on performance ``` Before: reshape (a,bc,) to (ab,c) -> reciprocal After: reshape (a,bc,) to (ab,c) -> reciprocal -> reshape (a*b,c) to (a,b,c) ``` - Long-term solution: implement a `torch._scaled_matmul` which can support 3D+ `A tensor` ## Test plan - Added unit test which exercises this new path - Manually tested with torchtitan with float8 rowwise + async TP Pull Request resolved: https://github.com/pytorch/pytorch/pull/148001 Approved by: https://github.com/yifuwang	2025-03-02 03:25:28 +00:00
Phillip Liu	ce2f680e00	[fr] Added protection against missing stack frames in fr (#148203 ) Summary: We have quite a while failures due to this unprotected access. https://fburl.com/scuba/ai_rca_debug_tracing/qtnb63qf Test Plan: Reviewed By: fduwjj Differential Revision: D70358287 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148203 Approved by: https://github.com/fduwjj	2025-03-02 01:03:49 +00:00
Isalia20	19de523de6	[MPS] metal unary kernel for sqrt (#148272 ) Issue #148219 highlighted the high dispatch times of ops which ran with MPS Graph on smaller tensors. This PR rewrites the sqrt with metal kernel to mitigate that issue ## Speedups: Matrix size means NxN matrix here. ![speedup_sqrt](https://github.com/user-attachments/assets/db0a705b-1a0e-42b4-bd42-4e7960415c81) Code to generate the times(needs building the torch with old time and new time): ```python import torch import numpy as np import time import csv matrix_sizes = [1, 100, 1000, 10_000] num_runs = 1000 warmup_runs = 3 def run_sqrt(A): torch.mps.synchronize() start = time.perf_counter() c = torch.sqrt(A) torch.mps.synchronize() end = time.perf_counter() return c, end - start results = { 'N': [], 'mean_time': [], 'std_time': [] } for n in matrix_sizes: print(f"\nBenchmarking N={n}") try: A_mps = torch.rand((n, n), dtype=torch.float32, device="mps") for _ in range(warmup_runs): _, _ = run_sqrt(A_mps) times = [] for _ in range(num_runs): _, t = run_sqrt(A_mps) times.append(t) mean_time = np.mean(times) std_time = np.std(times) results['N'].append(n) results['mean_time'].append(mean_time) results['std_time'].append(std_time) print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s") except RuntimeError as e: print(f"Error for N={n}: {e}") continue with open('sqrt_benchmark_times_new.csv', 'w', newline='') as f: writer = csv.writer(f) writer.writerow(['N', 'mean_time', 'std_time']) for i in range(len(results['N'])): writer.writerow([ results['N'][i], results['mean_time'][i], results['std_time'][i] ]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148272 Approved by: https://github.com/malfet	2025-03-02 00:45:45 +00:00
Wei-Sheng Chin	1a6883759d	Fix macro for bit_cast in c10/util/bit_cast.h - one line change (#148265 ) Fixes #148263. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148265 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-03-01 20:55:31 +00:00
PyTorch MergeBot	1919e0de9a	Revert "stage 1 of depreate silent fallback of tuning gemm (#147798 )" This reverts commit 297c00264e54cfb192f289e23a41775b81cb9cb8. Reverted https://github.com/pytorch/pytorch/pull/147798 on behalf of https://github.com/wdvr due to failing internal builds, discussed with author ([comment](https://github.com/pytorch/pytorch/pull/147798#issuecomment-2692390551))	2025-03-01 20:04:23 +00:00
bobrenjc93	82603fd7d2	introduce dynamism library (#147981 ) This is the first step in supporting delayed compile. This library takes in example inputs and outputs a dict of dynamism across the inputs. We will use this to detect dynamism across multiple inputs in delayed compile. We will also use this to make shape collections more ergonomic by providing an affordance to generate a shape collection using example inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147981 Approved by: https://github.com/pianpwk, https://github.com/wdvr	2025-03-01 19:57:54 +00:00
Richard Barnes	5301710b15	[codemod] Fix unused-value issue in caffe2/aten/src/ATen/cuda/detail/CUDAHooks.cpp +4 (#147555 ) Summary: LLVM has a warning `-Wunused-value` which we treat as an error because it's so often diagnostic of a code issue. Unused values often indicate a programming mistake, but can also just be unnecessary cruft that harms readability and performance. For questions/comments, contact r-barnes. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Differential Revision: D69945678 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147555 Approved by: https://github.com/Skylion007, https://github.com/eqy	2025-03-01 19:46:13 +00:00
Rengan Xu	0ff2e6a85a	Fix None and equal_to_1 arguments issue in Triton kernel generated by AOTI (#148102 ) Summary: When a Triton kernel has arguments with None values followed by arguments with value 1, AOTI attempts to remove the None arguments and update the indices of the equal_to_1 arguments in triton_meta["configs"]. However, if the same kernel is called multiple times, this optimization process is repeated. Prior to this diff, the indices of equal_to_1 arguments from subsequent calls (second and later) were based on the updated indices from the previous call, resulting in incorrect behavior. This diff aims to localize the updated indices for equal_to_1 arguments within the optimization process of the current call, ensuring accurate and consistent results. Test Plan: Unit Test: ``` buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor -- -r test_triton_kernel_with_none_inputs_and_equal_to_1_arg ``` Differential Revision: D69998314 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148102 Approved by: https://github.com/davidberard98, https://github.com/chenyang78	2025-03-01 18:38:33 +00:00
Ryo Suzuki	2b86309da3	separate f16 vectorized class from bf16 (#146596 ) Separating the f16 vectorized class into a different file from the bf16 vectorized class in order to be able to add a new bf16 SVE vectorized class in https://github.com/pytorch/pytorch/pull/143666. This is required as we would need to exclude the current bf16 class in order to use the sve bf16 class but still include the current f16 vectorized class. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146596 Approved by: https://github.com/malfet	2025-03-01 18:22:32 +00:00
PyTorch MergeBot	8e004865dd	Revert "introduce dynamism library (#147981 )" This reverts commit 1c1bf410ecdeac8d240e15bf8c33c0f00fab0673. Reverted https://github.com/pytorch/pytorch/pull/147981 on behalf of https://github.com/malfet due to lint failure ([comment](https://github.com/pytorch/pytorch/pull/147981#issuecomment-2692351017))	2025-03-01 18:16:52 +00:00
PyTorch MergeBot	a983b2b11a	Revert "Initial implementation of host memory stats (#147660 )" This reverts commit 945e359fc1afe6c0bb6129ed9607b237fa19cd98. Reverted https://github.com/pytorch/pytorch/pull/147660 on behalf of https://github.com/mradmila due to There is an issue with ambiguous definition of Stat structure when different C++ tools are used. Backing out for now. ([comment](https://github.com/pytorch/pytorch/pull/147660#issuecomment-2692346379))	2025-03-01 18:05:45 +00:00
Sun, Jiayi	d23051f29b	[Inductor] Support parallel reduction for GroupNorm (#144020 ) Summary: Support parallel reduction for GroupNorm by optimizing the parallelization heuristics: When the range of the first inner loop is much larger than the range of all outer loops, change the starting depth of parallelization to the first inner loop. I tested the Inductor benchmark with this PR on CPU. One torchbench model(pytorch_CycleGAN_and_pix2pix) achieved ~45% performance improvement, and two diffusion models(Stable Diffusion and Latent Consistency Model(LCM)) achieved ~2% performance improvement. Example: ``` import torch import torch.nn as nn class GN(nn.Module): def __init__(self, num_groups, num_channels): super(GN, self).__init__() self.gn = nn.GroupNorm(num_groups, num_channels) def forward(self, x): return self.gn(x) x = torch.randn(2, 64, 168, 168).to(memory_format=torch.channels_last) m = GN(2, 64).eval() compiled_m = torch.compile(m) with torch.no_grad(): out = compiled_m(x) ``` Generated code: - Before: ``` cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float', 'float', 'float'], ''' #include "/tmp/torchinductor_jiayisun/pi/cpicxudqmdsjh5cm4klbtbrvy2cxwr7whxl3md2zzdjdf3orvfdf.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2, float* out_ptr3, float* out_ptr4) { #pragma omp parallel num_threads(56) { int tid = omp_get_thread_num(); { #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2L); x1+=static_cast<int64_t>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<int64_t>(56448L)); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(28224L); x2+=static_cast<int64_t>(1L)) { for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(32L); x3+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x3 >= static_cast<int64_t>(0) && x3 < static_cast<int64_t>(32L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + 32Lx1 + 64Lx2 + 1806336Lx0), static_cast<int64_t>(16)); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0); } } } } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<int64_t>(x1 + 2Lx0)] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<int64_t>(x1 + 2Lx0)] = static_cast<float>(tmp_acc0.m2); } } } } #pragma omp single { { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L)) { #pragma GCC ivdep for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2L); x1+=static_cast<int64_t>(1L)) { for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(32L); x2+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x2 >= static_cast<int64_t>(0) && x2 < static_cast<int64_t>(32L))) { auto tmp0 = out_ptr1[static_cast<int64_t>(x1 + 2Lx0)]; auto tmp6 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x2 + 32Lx1), static_cast<int64_t>(16)); auto tmp9 = out_ptr0[static_cast<int64_t>(x1 + 2Lx0)]; auto tmp13 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<int64_t>(x2 + 32Lx1), static_cast<int64_t>(16)); auto tmp1 = static_cast<float>(903168.0); auto tmp2 = tmp0 / tmp1; auto tmp3 = static_cast<float>(1e-05); auto tmp4 = decltype(tmp2)(tmp2 + tmp3); auto tmp5 = 1 / std::sqrt(tmp4); auto tmp7 = at::vec::Vectorized<float>(tmp5); auto tmp8 = tmp7 tmp6; auto tmp10 = decltype(tmp9)(-tmp9); auto tmp11 = at::vec::Vectorized<float>(tmp10); auto tmp12 = tmp11 * tmp8; auto tmp14 = tmp12 + tmp13; tmp8.store(out_ptr2 + static_cast<int64_t>(x2 + 32Lx1 + 64Lx0)); tmp14.store(out_ptr3 + static_cast<int64_t>(x2 + 32Lx1 + 64Lx0)); } } } } } } } { #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(28224L); x1+=static_cast<int64_t>(1L)) { for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(64L); x2+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x2 >= static_cast<int64_t>(0) && x2 < static_cast<int64_t>(64L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x2 + 64Lx1 + 1806336Lx0), static_cast<int64_t>(16)); auto tmp1 = at::vec::Vectorized<float>::loadu(out_ptr2 + static_cast<int64_t>(x2 + 64Lx0), static_cast<int64_t>(16)); auto tmp3 = at::vec::Vectorized<float>::loadu(out_ptr3 + static_cast<int64_t>(x2 + 64Lx0), static_cast<int64_t>(16)); auto tmp2 = tmp0 * tmp1; auto tmp4 = tmp2 + tmp3; tmp4.store(out_ptr4 + static_cast<int64_t>(x2 + 64Lx1 + 1806336Lx0)); } } } } } } } } ''') ``` - After: ``` cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float', 'float', 'float'], ''' #include "/tmp/torchinductor_jiayisun/pi/cpicxudqmdsjh5cm4klbtbrvy2cxwr7whxl3md2zzdjdf3orvfdf.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2, float* out_ptr3, float* out_ptr4) { { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L)) { #pragma GCC ivdep for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2L); x1+=static_cast<int64_t>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec_arr[56]; for (int i = 0; i < 56; i++) { tmp_acc0_vec_arr[i] = Welford<at::vec::Vectorized<float>>(); } Welford<float> tmp_acc0_arr[56]; for (int i = 0; i < 56; i++) { tmp_acc0_arr[i] = Welford<float>(); } Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec_arr[56]; for (int i = 0; i < 56; i++) { masked_tmp_acc0_vec_arr[i] = Welford<at::vec::Vectorized<float>>(); } #pragma omp parallel num_threads(56) { int tid = omp_get_thread_num(); static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<int64_t>(1008L)); Welford<at::vec::Vectorized<float>> tmp_acc0_vec_local = Welford<at::vec::Vectorized<float>>(); Welford<float> tmp_acc0_local = Welford<float>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec_local = Welford<at::vec::Vectorized<float>>(); #pragma omp for for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(28224L); x2+=static_cast<int64_t>(1L)) { for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(32L); x3+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x3 >= static_cast<int64_t>(0) && x3 < static_cast<int64_t>(32L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + 32Lx1 + 64Lx2 + 1806336Lx0), static_cast<int64_t>(16)); tmp_acc0_vec_local = welford_combine(tmp_acc0_vec_local, tmp0, &wrecps0); } } } } tmp_acc0_vec_arr[tid] = tmp_acc0_vec_local; tmp_acc0_arr[tid] = tmp_acc0_local; masked_tmp_acc0_vec_arr[tid] = masked_tmp_acc0_vec_local; } for (int tid = 0; tid < 56; tid++) { tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp_acc0_vec_arr[tid]); } for (int tid = 0; tid < 56; tid++) { tmp_acc0 = welford_combine(tmp_acc0, tmp_acc0_arr[tid]); } for (int tid = 0; tid < 56; tid++) { masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, masked_tmp_acc0_vec_arr[tid]); } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<int64_t>(x1 + 2Lx0)] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<int64_t>(x1 + 2Lx0)] = static_cast<float>(tmp_acc0.m2); } } } } { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L)) { #pragma GCC ivdep for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2L); x1+=static_cast<int64_t>(1L)) { for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(32L); x2+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x2 >= static_cast<int64_t>(0) && x2 < static_cast<int64_t>(32L))) { auto tmp0 = out_ptr1[static_cast<int64_t>(x1 + 2Lx0)]; auto tmp6 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x2 + 32Lx1), static_cast<int64_t>(16)); auto tmp9 = out_ptr0[static_cast<int64_t>(x1 + 2Lx0)]; auto tmp13 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<int64_t>(x2 + 32Lx1), static_cast<int64_t>(16)); auto tmp1 = static_cast<float>(903168.0); auto tmp2 = tmp0 / tmp1; auto tmp3 = static_cast<float>(1e-05); auto tmp4 = decltype(tmp2)(tmp2 + tmp3); auto tmp5 = 1 / std::sqrt(tmp4); auto tmp7 = at::vec::Vectorized<float>(tmp5); auto tmp8 = tmp7 tmp6; auto tmp10 = decltype(tmp9)(-tmp9); auto tmp11 = at::vec::Vectorized<float>(tmp10); auto tmp12 = tmp11 * tmp8; auto tmp14 = tmp12 + tmp13; tmp8.store(out_ptr2 + static_cast<int64_t>(x2 + 32Lx1 + 64Lx0)); tmp14.store(out_ptr3 + static_cast<int64_t>(x2 + 32Lx1 + 64Lx0)); } } } } } } #pragma omp parallel num_threads(56) { int tid = omp_get_thread_num(); { #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(28224L); x1+=static_cast<int64_t>(1L)) { for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(64L); x2+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x2 >= static_cast<int64_t>(0) && x2 < static_cast<int64_t>(64L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x2 + 64Lx1 + 1806336Lx0), static_cast<int64_t>(16)); auto tmp1 = at::vec::Vectorized<float>::loadu(out_ptr2 + static_cast<int64_t>(x2 + 64Lx0), static_cast<int64_t>(16)); auto tmp3 = at::vec::Vectorized<float>::loadu(out_ptr3 + static_cast<int64_t>(x2 + 64Lx0), static_cast<int64_t>(16)); auto tmp2 = tmp0 * tmp1; auto tmp4 = tmp2 + tmp3; tmp4.store(out_ptr4 + static_cast<int64_t>(x2 + 64Lx1 + 1806336Lx0)); } } } } } } } } ''') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144020 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel, https://github.com/jgong5	2025-03-01 17:11:50 +00:00
cyy	8bf3920279	Remove unneeded Clang-tidy suppression (#148246 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148246 Approved by: https://github.com/Skylion007	2025-03-01 16:51:54 +00:00
bobrenjc93	1c1bf410ec	introduce dynamism library (#147981 ) This is the first step in supporting delayed compile. This library takes in example inputs and outputs a dict of dynamism across the inputs. We will use this to detect dynamism across multiple inputs in delayed compile. We will also use this to make shape collections more ergonomic by providing an affordance to generate a shape collection using example inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147981 Approved by: https://github.com/pianpwk	2025-03-01 14:57:06 +00:00
Nikita Shulga	3a0c9f7f9d	[MPS] Fix SDPA crash (#148239 ) If operation is invoked with mask twice it will crash, as mask expansion logic was implemented inside cache creation block, which is executed only once for all shapes Fixes https://github.com/pytorch/pytorch/issues/148194 which is a regression introduced by https://github.com/pytorch/pytorch/pull/147545 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148239 Approved by: https://github.com/dcci	2025-03-01 13:06:51 +00:00
Nikita Shulga	735d7b1af6	[EZ][BE] Increase tolerances for interpolate op (#148224 ) Not sure why tolerances were set like that, this logic was added in https://github.com/pytorch/pytorch/pull/104181 without much explanation But if I'm to make a guess, it's likely due to the inaccuracy of bilinear op, that has since been replaced by shader Pull Request resolved: https://github.com/pytorch/pytorch/pull/148224 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #148154, #148187, #148211	2025-03-01 13:03:59 +00:00
xinan.lin	762724f3d0	[Break XPU][Inductor] Generalize device-bias code and fix test_graph_partition for XPU (#148178 ) This PR generalized the device-bias code introduced by #147038 . And align the behavior between XPU and CUDA on add + mm + pointwise pattern (for XPU, from addmm + pointwise to mm + fused_add_pointwise) , which fix the failed test case `test_graph_partiton` on XPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148178 Approved by: https://github.com/benjaminglass1, https://github.com/jansel, https://github.com/EikanWang ghstack dependencies: #148155	2025-03-01 10:59:55 +00:00
xinan.lin	ab78bf5c66	[Break XPU][Inductor UT] Avoid custom op registration conflicts in test_auto_functionalize.py. (#148155 ) Fix #148148 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148155 Approved by: https://github.com/jansel, https://github.com/EikanWang	2025-03-01 10:59:55 +00:00
wz337	2f1b8e0fe2	[DTensor][Test] Add a test to demonstrate current dtensor view behavior if redistribution happens (#148015 ) This does not fix the view op issue when redistribution happens. We want to add a test to demonstrate/record the issue, in which the distributed behavior does not match up with single device behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148015 Approved by: https://github.com/XilunWu	2025-03-01 10:24:40 +00:00
PyTorch MergeBot	191c9bd013	Revert "[async TP] insert reshape node to handle "reshape -> scaled mm -> reshape pattern" in async TP with rowwise scales (#148001 )" This reverts commit b8efebe57d05a87be5b0f304218d2af7bb2bf6c6. Reverted https://github.com/pytorch/pytorch/pull/148001 on behalf of https://github.com/davidberard98 due to looks like another lint error ([comment](https://github.com/pytorch/pytorch/pull/148001#issuecomment-2692042859))	2025-03-01 07:43:58 +00:00
Sun, Jiayi	fe3b9e3764	[Inductor] optimize the heuristics of outer loop fusion (#147523 ) Summary: Optimize the heuristics of outer loop fusion: When the range of the first inner loop is much larger than the range of all outer loops, do not fuse the outer loops and fallback to standard codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147523 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel, https://github.com/jgong5	2025-03-01 06:50:04 +00:00
Daniel Vega-Myhre	b8efebe57d	[async TP] insert reshape node to handle "reshape -> scaled mm -> reshape pattern" in async TP with rowwise scales (#148001 ) Fixes https://github.com/pytorch/torchtitan/issues/864 ## Summary While testing torchtitan with float8 training with rowwise scaling + async TP, a [bug](https://github.com/pytorch/torchtitan/issues/864) was discovered. The symptom was the scaling factor dims did not match the dims of the tensor the scales were to be applied to. My [root cause analysis](https://github.com/pytorch/torchtitan/issues/864#issuecomment-2672465060) determined the reason is that when async TP graph manipulation constructs the `fused_scaled_matmul_reduce_scatter` op, it does not yet handle the "reshape -> scaled mm -> reshape" pattern used in torchao [here](`ed361ff5c7/torchao/float8/float8_linear.py (L122-L124)`) - specifically when row-wise scales are being used. ## TL;DR of root cause - When a Float8Tensor is reshaped, the scale is reshaped along with it so the dimensions are aligned. - In the graph manipulation logic of the micropipeline TP post grad pass, the scaled_mm `A tensor` node is referencing the tensor _before_ to the reshape op, but referencing the `A_scale` node _after_ the reshape op. ## Example - Concrete example: - `A tensor` is a Float8Tensor with shape (1,8192,2048) and scale of shape (1,8192,1) when a matmul op is called in torchao [here](`8706d3f3b0/torchao/float8/float8_linear.py (L70)`). Torchao does a reshape -> scaled mm -> reshape [here](`ed361ff5c7/torchao/float8/float8_linear.py (L122)`). When a Float8Tensor is reshaped, its scale is reshaped along with it [here](`8706d3f3b0/torchao/float8/float8_ops.py (L152)`). So the first reshape makes the "A tensor" (1,8192,2048) => (8192,2048) and the scale (1,8192,1) => (8192,1). - During post grad pass in async TP: - `A_node` has shape (1,8192,2048) (tensor from before this [reshape](`ed361ff5c7/torchao/float8/float8_linear.py (L122)`)) - `A_scale` has shape (8192,1) (due to reshape op above, which caused the scale to be reshaped from (1,8192,1) => (8192,1)). ## Solution Note: the compiler inserts a `reciprocal` op after the reshape, so we can't simply use the node before the reshape as the `A_scale_node`, otherwise it will affect the numerics. - Short-term solution: if the specific pattern showne below is detected, insert a reshape node after the reciprocal, to reshape the reciprocal output back to the originals shape before the reshape. - reshape is just a view, so there should be no impact on performance ``` Before: reshape (a,bc,) to (ab,c) -> reciprocal After: reshape (a,bc,) to (ab,c) -> reciprocal -> reshape (a*b,c) to (a,b,c) ``` - Long-term solution: implement a `torch._scaled_matmul` which can support 3D+ `A tensor` ## Test plan - Added unit test which exercises this new path - Manually tested with torchtitan with float8 rowwise + async TP Pull Request resolved: https://github.com/pytorch/pytorch/pull/148001 Approved by: https://github.com/yifuwang	2025-03-01 06:38:39 +00:00
Animesh Jain	fd16311e7f	[inductor][subgraph] Plumbing to get ShapeAsConstantBuffer from subgraph to main graph output (#147559 ) I am unable to create a test case that fails without the next PR. The idea is to have a symint which is returned by the inner subgraph and then returned by the forward graph after partitioning. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147559 Approved by: https://github.com/eellison	2025-03-01 06:17:11 +00:00
David Berard	c87097e74a	[triton 3.3] Fix inductor/test_profiler.py test (#148230 ) test_inductor_profiling_kernel_names_pointwise is checking that the profiler correctly records the input shapes to the kernel. After triton 3.3, we get a different number of args (because the constexpr args are passed in, from the python perspective). This just patches the test to pass in either case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148230 Approved by: https://github.com/drisspg, https://github.com/YUNQIUGUO	2025-03-01 04:27:49 +00:00
Anatoly Myachev	9377a32cd1	[Inductor][NFC] Remove unused functions from `compile_tasks.py` (#147564 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147564 Approved by: https://github.com/Skylion007, https://github.com/davidberard98	2025-03-01 03:44:43 +00:00
PyTorch MergeBot	baf1c8fcdc	Revert "introduce dynamism library (#147981 )" This reverts commit 6eff6b28e4d09cbf632f79502a8e317bf5b53c34. Reverted https://github.com/pytorch/pytorch/pull/147981 on behalf of https://github.com/wdvr due to lint failure ([comment](https://github.com/pytorch/pytorch/pull/147981#issuecomment-2691906065))	2025-03-01 03:43:01 +00:00
Fuzzkatt	493cd97af5	add skips to test_notifies_oom and test_set_per_process_memory_fraction (#148134 ) Tests fail in NVIDIA internal CI since we do not support nvml on Jetson, but nvml is required for OOM reporting to work properly, so we are skipping the failing tests for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148134 Approved by: https://github.com/eqy	2025-03-01 02:59:48 +00:00
bobrenjc93	6eff6b28e4	introduce dynamism library (#147981 ) This is the first step in supporting delayed compile. This library takes in example inputs and outputs a dict of dynamism across the inputs. We will use this to detect dynamism across multiple inputs in delayed compile. We will also use this to make shape collections more ergonomic by providing an affordance to generate a shape collection using example inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147981 Approved by: https://github.com/pianpwk	2025-03-01 02:49:16 +00:00
Isalia20	08434df1f2	[MPS] fix empty place holder error for smooth l1 loss (#148133 ) Fixes #123171 And parametrizes the tests for it Pull Request resolved: https://github.com/pytorch/pytorch/pull/148133 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-03-01 02:32:45 +00:00
Anatoly Myachev	02c5f21541	[Inductor] fix `AOTInductorTestABICompatibleGpu.test_triton_kernel_weird_param_order` with new Triton (#148011 ) In this case, the parameters have already been filtered [here](`201666d77d/torch/_inductor/codegen/cpp_wrapper_gpu.py (L335)`) and subsequent filtering is not only unnecessary, it breaks the code, since the positions of the parameters change after filtering. For this test, for example, the second filtering discarded `buf0`. For example: ```python (Pdb) triton_meta["signature"] {'in_ptr0': 'fp32', 'in_ptr1': 'fp32', 'n_elements': 'i32', 'BLOCK_SIZE': 'constexpr', 'out_ptr': '*fp32'} (Pdb) call_args ['arg0_1', 'arg0_1', '256L', 'buf0'] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148011 Approved by: https://github.com/davidberard98	2025-03-01 01:21:20 +00:00
Isuru Fernando	338ed67a1e	[inductor] Implement max_pool2d_with_indices as a reduction for large window sizes (#147876 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147876 Approved by: https://github.com/eellison	2025-03-01 01:07:01 +00:00
atalman	230a3b0f83	Add cuda 11.8 guard for cufile preload (#148184 ) Follow up after https://github.com/pytorch/pytorch/pull/148137 Make sure we don't try to load cufile on CUDA 11.8 Test: ``` >>> import torch /usr/local/lib64/python3.9/site-packages/torch/_subclasses/functional_tensor.py:276: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.) cpu = _conversion_method_template(device=torch.device("cpu")) >>> torch.__version__ '2.7.0.dev20250227+cu118' >>> ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148184 Approved by: https://github.com/mikaylagawarecki	2025-03-01 01:01:04 +00:00
Iris Z	2544afaa1a	[DeviceMesh] Add some documentation for `from_group` API and add a 2D test (#146364 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146364 Approved by: https://github.com/fduwjj	2025-03-01 00:57:37 +00:00
Nikita Shulga	5d297f7a34	[MPS][BE] Combine two `upsample_kernel_out_template` into one (#148211 ) - First, by stopp inverting sizes and strides, i.e. passing them as is, but reading them in inverse order in the shader as 1st stride of 4D tensor is one used for batches, 2nd for channels and 3rd and 4th for spatial coordinates - Pass `scales` as float2 even in linear tensor Above allows one to collide two flavors `upsample_kernel_out_template` into one Pull Request resolved: https://github.com/pytorch/pytorch/pull/148211 Approved by: https://github.com/dcci ghstack dependencies: #148154, #148187	2025-03-01 00:39:26 +00:00
clr	83fb974b5d	scriptfunction: Make sure we have valid __name__ and __qualname__ (#147906 ) It's not fully clear why these are not being created, but you can definitely reproduce this in code. `__name__` is fun, since there appears to be no way to explicitly set it on the pybind11 layer or c++ layer. I've set this in the python wrapper code (which works correctly). But let me know if people feel strongly and want us to go explicitly cast to python within the cpp functions and set it there. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147906 Approved by: https://github.com/jansel ghstack dependencies: #147894	2025-02-28 23:25:47 +00:00
Matthew Hoffman	1ae7cc41ca	Define `__all__` for `torch.utils.tensorboard` (#147550 ) Fixes the issue: ```python import torch.utils.tensorboard torch.utils.tensorboard.FileWriter # pyright: "FileWriter" is not exported from module "torch.utils.tensorboard" torch.utils.tensorboard.RecordWriter # pyright: "RecordWriter" is not exported from module "torch.utils.tensorboard" torch.utils.tensorboard.SummaryWriter # pyright: "SummaryWriter" is not exported from module "torch.utils.tensorboard" ``` The [docs page for `torch.utils.tensorboard`](https://pytorch.org/docs/stable/tensorboard.html) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147550 Approved by: https://github.com/albanD	2025-02-28 23:06:11 +00:00
drisspg	3a69dee955	[Submodule][FlashAttention] Bump to 2.7.4 (#148147 ) # Summary This makes me happy Pull Request resolved: https://github.com/pytorch/pytorch/pull/148147 Approved by: https://github.com/Skylion007	2025-02-28 22:40:02 +00:00
bobrenjc93	83ec7cdcd4	Fix recompile reason logging (#148200 ) for the following test case ``` @torch.compile(dynamic=False, backend=cnts) def fn(x, y, z): return x * y * z[0] fn(1, torch.randn(1), {0: torch.randn(1)}) fn(2, torch.randn(2), {0: torch.randn(2)}) fn(3, torch.randn(3), {0: torch.randn(3)}) fn(4, torch.randn(4), {0: torch.randn(4)}) fn(5, torch.randn(5), {0: torch.randn(5)}) ``` previously we would log ``` 0/0: L['x'] == 1 0/0: L['x'] == 1 0/0: L['x'] == 1 0/0: L['x'] == 1 ``` but after this change we now log ``` 0/0: L['x'] == 1 0/1: L['x'] == 2 0/2: L['x'] == 3 0/3: L['x'] == 4 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148200 Approved by: https://github.com/xmfan	2025-02-28 22:33:37 +00:00
William Wen	40b3e4a358	[dynamo] expose code execution strategy to python (#148020 ) @anijain2305 this can be used to mark a code object to be skipped/run-only (recursively) while tracing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148020 Approved by: https://github.com/jansel	2025-02-28 21:59:12 +00:00
Mwiza Kunda	e74fdbe6d0	[inductor] ignore block ptr advancements for removed buffers (#148087 ) Follow up to https://github.com/pytorch/pytorch/pull/147193. Some buffers are removed only when the kernel context is exited so defer the lines instead. Added `use_block_ptr` as a parameter to test case that fails if run with block ptrs enabled. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148087 Approved by: https://github.com/jansel, https://github.com/eellison	2025-02-28 21:31:15 +00:00
Nikita Shulga	d174562487	[MPS][BE][EZ] Aggregate macros (#148187 ) Refactor `INSTANTIATE_UPSAMPLE_BILINEAR2D(DTYPE)`, `INSTANTIATE_UPSAMPLE_BICUBIC2D(DTYPE)` and `INSTANTIATE_UPSAMPLE_BILINEAR2DAA(DTYPE)` use common `INSTANTIATE_UPSAMPLE2D` Then combine multiple invocations into `INSTANTIATE_UPSAMPLE_ALL` I.e. functionally it's a no-op, but achieves the same with fewer lines of code Pull Request resolved: https://github.com/pytorch/pytorch/pull/148187 Approved by: https://github.com/Skylion007 ghstack dependencies: #148154	2025-02-28 21:30:00 +00:00
Sijia Chen	4995e058bf	[user-triton] handle inline_asm_case (#148043 ) Summary: We currently failed the mutation analysis for all inline_asm ops. In this diff, we handle the case when "is_pure" is set to True since it indicates the operation doesn't mutate the input value Test Plan: ../buck-out/v2/gen/fbcode/854b9ed00d28c5c5/caffe2/test/inductor/__triton_kernels__/triton_kernels.par --r test_mutations_inline_asm_kernel ``` test_mutations_inline_asm_kernel_is_pure_true (caffe2.test.inductor.test_triton_kernels.MutationTests) ... W0226 18:10:34.261000 1906801 /data/users/sijiac/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:656] TTIR mutation analysis: Skipping pure tt.elementwise_inline_asm op (is_pure=True) ok ---------------------------------------------------------------------- Ran 2 tests in 0.706s OK ``` Differential Revision: D69878591 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148043 Approved by: https://github.com/zou3519	2025-02-28 20:52:51 +00:00
Ruben Rodriguez Buchillon	6f91720e1c	[inductor][ck] manual kBatch heuristic (#148118 ) Summary: # Why Leverage kBatch parameter for large splitK examples for CK for better than ATEN performance # What replace default kBatch = 1 with a manual heuristic - if K > 16 * max (M,N) - leverage k_per_block, and K and number of SMs on the chip - upper bound to 128, lower bound to 1 This is better than defaulting to 1, cheap to calculate, and shows performance beyond ATEN This is of course subject to change and improvement Test Plan: with minor modifications to to run torch.mm on the shape `M, N, K = 2048, 2048, 524288` ``` buck2 run -c fbcode.re_gpu_tests=False mode/opt-amd-gpu fbcode//deeplearning/aot_inductor/benchmark/sampling:test_gemm_autotune_benchmark_AMD_block_0 ``` ``` AUTOTUNE mm(2048x524288, 524288x2048) rocm_ck_gemm_template_49 10.4972 ms 100.0% rocm_ck_gemm_template_8 10.6132 ms 98.9% rocm_ck_gemm_template_9 10.6907 ms 98.2% [...] mm 18.9880 ms 55.3% ``` Reviewed By: ColinPeppler Differential Revision: D70224591 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148118 Approved by: https://github.com/ColinPeppler	2025-02-28 20:36:16 +00:00
amdfaa	48c55a66ec	[ROCm] Move ROCm unstable MI300 jobs back to stable (#146675 ) Fixes #145790 Needs #145504 to be merged first to resolve an artifact uploading issue with MI300 runners. This PR moves rocm unstable MI300 back to stable. The change to unstable was introduced through this [PR](https://github.com/pytorch/pytorch/pull/145790). This was because the MI300s were failing with a [docker daemon](https://github.com/pytorch/pytorch/actions/runs/13015957622/job/36306779536) issue which has been resolved. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146675 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily	2025-02-28 20:34:27 +00:00
Bert Maher	6778084531	[inductor][cutlass] Environment variables for allow/denylist (#148161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148161 Approved by: https://github.com/henrylhtsang, https://github.com/eellison	2025-02-28 20:33:10 +00:00
sanchitintel	5a1954eb93	[Inductor-CPU] Fix broken int8 WoQ GEMM AMX implementation in main (#147895 ) #146843 broke int8 WoQ GEMM's (for BF16 activation) AMX ISA implementation in the main branch. UT: `python test/inductor/test_cpu_select_algorithm.py -v -k woq` The issue remained undetected because in case of templated kernel compilation failure, the auto-tuning infra marks its runtime as `inf`, and the op against which it was being benchmarked is used, so UTs didn't fail even on machines that support AMX ISA. `test/inductor/test_cpu_select_algorithm.py` UTs checked the value of the `select_algorithm_autotune` counter, which only counts how many ops were selected for autotuning against their templated codegened counterparts. @leslie-fang-intel advised using a new counter. I added `counters["inductor"]["cpp_templated_kernel_counter"]`, which is incremented after a codegened kernel's compilation, so it'd help catch breakage scenarios in which a templated kernel could not be codegened due to a compilation failure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147895 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel	2025-02-28 20:20:45 +00:00
clr	e0e516c554	Don't crash when we call __qualname__ on torch._C.ScriptFunction (#147894 ) We've root caused this to correctly throwing attribute error on ScriptFunction when missing attributes are caused. This PR will fix crashes that are showing up. I'm going to stack a second PR to fix torch._c.ScriptFunction just being a very badly behaving python object (which should also fix this Pull Request resolved: https://github.com/pytorch/pytorch/pull/147894 Approved by: https://github.com/jansel	2025-02-28 20:15:38 +00:00
henrylhtsang	297c00264e	stage 1 of depreate silent fallback of tuning gemm (#147798 ) Differential Revision: [D70045778](https://our.internmc.facebook.com/intern/diff/D70045778/) context: https://github.com/pytorch/pytorch/issues/147479 For the most part, this should not change the behavior. For int_mm, I also removed ``` # TODO: Re-enable eager mode implementation once cuBLAS is fixed if use_cutlass or use_triton_template(layout, enable_int32=True): choices = [] ``` because I think it is unwanted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147798 Approved by: https://github.com/eellison	2025-02-28 19:51:55 +00:00
PyTorch MergeBot	ebc3f27bf4	Revert "[async TP] insert reshape node to handle "reshape -> scaled mm -> reshape pattern" in async TP with rowwise scales (#148001 )" This reverts commit 6e037ac41c095dfdb37fdd4b36bf8ec2ebf84bf1. Reverted https://github.com/pytorch/pytorch/pull/148001 on behalf of https://github.com/wdvr due to lint error ([comment](https://github.com/pytorch/pytorch/pull/148001#issuecomment-2691421540))	2025-02-28 19:44:54 +00:00
amdfaa	42aeb5d259	Resolve zip file permission issue when uploading artifacts on ROCm MI300 CI runners (#145504 ) E.g.: https://github.com/pytorch/pytorch/actions/runs/13500418791/job/37719437613#step:19:120 ``` Beginning upload of artifact content to blob storage Error: An error has occurred while creating the zip file for upload Error: EACCES: permission denied, open '/home/runner/_work/pytorch/pytorch/test/test-reports/backends.xeon.test_launch_1.1_22ba1133f3fcd140_.log' /home/runner/_work/_actions/actions/upload-artifact/v4/dist/upload/index.js:3459 throw new Error('An error has occurred during zip creation for the artifact'); ^ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145504 Approved by: https://github.com/jeffdaily Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>	2025-02-28 19:16:28 +00:00
Daniel Vega-Myhre	6e037ac41c	[async TP] insert reshape node to handle "reshape -> scaled mm -> reshape pattern" in async TP with rowwise scales (#148001 ) Fixes https://github.com/pytorch/torchtitan/issues/864 ## Summary While testing torchtitan with float8 training with rowwise scaling + async TP, a [bug](https://github.com/pytorch/torchtitan/issues/864) was discovered. The symptom was the scaling factor dims did not match the dims of the tensor the scales were to be applied to. My [root cause analysis](https://github.com/pytorch/torchtitan/issues/864#issuecomment-2672465060) determined the reason is that when async TP graph manipulation constructs the `fused_scaled_matmul_reduce_scatter` op, it does not yet handle the "reshape -> scaled mm -> reshape" pattern used in torchao [here](`ed361ff5c7/torchao/float8/float8_linear.py (L122-L124)`) - specifically when row-wise scales are being used. ## TL;DR of root cause - When a Float8Tensor is reshaped, the scale is reshaped along with it so the dimensions are aligned. - In the graph manipulation logic of the micropipeline TP post grad pass, the scaled_mm `A tensor` node is referencing the tensor _before_ to the reshape op, but referencing the `A_scale` node _after_ the reshape op. ## Example - Concrete example: - `A tensor` is a Float8Tensor with shape (1,8192,2048) and scale of shape (1,8192,1) when a matmul op is called in torchao [here](`8706d3f3b0/torchao/float8/float8_linear.py (L70)`). Torchao does a reshape -> scaled mm -> reshape [here](`ed361ff5c7/torchao/float8/float8_linear.py (L122)`). When a Float8Tensor is reshaped, its scale is reshaped along with it [here](`8706d3f3b0/torchao/float8/float8_ops.py (L152)`). So the first reshape makes the "A tensor" (1,8192,2048) => (8192,2048) and the scale (1,8192,1) => (8192,1). - During post grad pass in async TP: - `A_node` has shape (1,8192,2048) (tensor from before this [reshape](`ed361ff5c7/torchao/float8/float8_linear.py (L122)`)) - `A_scale` has shape (8192,1) (due to reshape op above, which caused the scale to be reshaped from (1,8192,1) => (8192,1)). ## Solution Note: the compiler inserts a `reciprocal` op after the reshape, so we can't simply use the node before the reshape as the `A_scale_node`, otherwise it will affect the numerics. - Short-term solution: if the specific pattern showne below is detected, insert a reshape node after the reciprocal, to reshape the reciprocal output back to the originals shape before the reshape. - reshape is just a view, so there should be no impact on performance ``` Before: reshape (a,bc,) to (ab,c) -> reciprocal After: reshape (a,bc,) to (ab,c) -> reciprocal -> reshape (a*b,c) to (a,b,c) ``` - Long-term solution: implement a `torch._scaled_matmul` which can support 3D+ `A tensor` ## Test plan - Added unit test which exercises this new path - Manually tested with torchtitan with float8 rowwise + async TP Pull Request resolved: https://github.com/pytorch/pytorch/pull/148001 Approved by: https://github.com/yifuwang	2025-02-28 18:51:42 +00:00
Marko Radmilac	945e359fc1	Initial implementation of host memory stats (#147660 ) This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics. This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache. As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later. Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660 Approved by: https://github.com/ngimel	2025-02-28 18:36:44 +00:00
Yidi Wu	982d7ba3ef	[while_loop][inductor] relax the constraint that all inputs must be on the same device (#148019 ) Previously, we require all inputs of while_loop to be on the same device. However, there're use cases where we want to keep some of the inputs on cpu while others on gpu e.g. an loop_idx on cpu will save the gpu to device copies. This PR relaxes the constraint and only check if carry and input at the same position have the same device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148019 Approved by: https://github.com/eellison, https://github.com/jansel	2025-02-28 18:27:03 +00:00
Yidi Wu	2d2f60bdda	[cond] support mismatched output in inductor (#147567 ) In this PR, we extract `codegen_unbacked_symbol_defs` of FallbackKernel out as a `codegen_unbacked_symbol_defs_for_outputs` method in wrapper. With it, HOPs can support the case where the subgraph returns a tensor with unbacked symints. This PR only do it for cond, we'll have follow up PRs for others (e.g. while_loop) as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147567 Approved by: https://github.com/jansel	2025-02-28 18:26:48 +00:00
henrylhtsang	d765077004	[cutlass backend] Sort the list of ops for better repro (#148047 ) Differential Revision: [D70298051](https://our.internmc.facebook.com/intern/diff/D70298051/) This only affects anything if `cutlass_max_profiling_configs` is used. I believe cutlass_max_profiling_configs is more of a testing config. Problem is when we get the configs from cutlass_library, the ops can come in different orders. Motivation is to make repro small issues easier. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148047 Approved by: https://github.com/chenyang78, https://github.com/coconutruben	2025-02-28 18:04:10 +00:00
henrylhtsang	790ec756ee	[cutlass backend] Check if len(timings) == len(choices) before skipping precompile (#148050 ) Differential Revision: [D70298908](https://our.internmc.facebook.com/intern/diff/D70298908/) Mostly from @coconutruben observation. Right now, we skip precompilation if we find some timings. That sounds like a bug. Most of the time it is fine, since we don't change the number of configs and triton compilation doesn't take too long. But it is devastating for cutlass backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148050 Approved by: https://github.com/coconutruben	2025-02-28 17:58:58 +00:00
Nikita Shulga	e5e31050d3	[MPS] Implement linear1d as shader (#148154 ) And get rid of MPS call, as for some reason implementation via MPSGraph API call is 100x+ times slower that Metal shader, at least according to the following benchmark ```python import torch import time import subprocess def benchmark(device, dtype): # Create example inputs x = torch.testing.make_tensor(3, 5, 65536, device=device, dtype=dtype) sf = .5 # Check output y = torch.nn.functional.interpolate(x, scale_factor=sf, mode="linear") z = torch.nn.functional.interpolate(x.cpu(), scale_factor=sf, mode="linear") outputs_match = torch.allclose(y.cpu(), z) if not outputs_match: atol = (y.cpu() - z).abs().max() rtol = ((y.cpu() - z)[z!=0]/z[z!=0]).abs().max() print(f"atol={atol} rtol={rtol}") # Measure time manually start_time = time.time() * 1000 for _ in range(1000): y = torch.nn.functional.interpolate(x, scale_factor=sf, mode="linear") torch.mps.synchronize end_time = time.time() * 1000 manual_delta = (end_time - start_time) average_time = f"{manual_delta:6.1f}" return "True " if outputs_match else "False", average_time outputs_match_list = [] average_time_list = [] for device in ["mps", "cpu"]: for dtype in [torch.float32, torch.float16, torch.bfloat16]: outputs_match, average_time = benchmark(device, dtype) outputs_match_list.append(str(outputs_match)) average_time_list.append(average_time) brand_string = subprocess.check_output(['sysctl', '-n', 'machdep.cpu.brand_string']).decode("utf-8").strip() print(f"\nBenchmarking Results (collected on {brand_string}):") print("-"*40) print("Device : MPS \| CPU") print("Dtype : FP32 \| FP16 \| BF16 \| FP32 \| FP16 \| BF16 ") print(f"Outputs Match : ", " \| ".join(outputs_match_list)) print(f"Average Time (us) :", " \|".join(average_time_list)) ``` Benchmark results after the change ``` Benchmarking Results (collected on Apple M2 Pro): ---------------------------------------- Device : MPS \| CPU Dtype : FP32 \| FP16 \| BF16 \| FP32 \| FP16 \| BF16 Outputs Match : True \| True \| True \| True \| True \| True Average Time (us) : 2.5 \| 2.1 \| 2.2 \| 161.4 \| 115.0 \| 161.1 ``` And before the change ``` Benchmarking Results (collected on Apple M2 Pro): ---------------------------------------- Device : MPS \| CPU Dtype : FP32 \| FP16 \| BF16 \| FP32 \| FP16 \| BF16 Outputs Match : True \| True \| True \| True \| True \| True Average Time (us) : 354.0 \| 336.0 \| 332.4 \| 145.5 \| 114.7 \| 148.3 ``` Fixes https://github.com/pytorch/pytorch/issues/144245 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148154 Approved by: https://github.com/dcci	2025-02-28 16:47:42 +00:00
Mengwei Liu	b5cd4ac950	[torchgen] Add support for schema with namespace (#148038 ) Fixes https://github.com/pytorch/executorch/issues/8711 In ExecuTorch when we try to parse the following schema: ``` aten::__lshift__.Scalar(Tensor self, Scalar other) -> Tensor ``` Repro: ```python from torchgen.model import FunctionSchema native_schema = FunctionSchema.parse("aten::__lshift__.Scalar(Tensor self, Scalar other) -> Tensor") ``` It's failing because `BaseOperatorName` categorizes it to be a inplace operator. I understand we are not supposed to pass in namespace "aten::" into `FunctionSchema.parse()` but unfortunately ExecuTorch requires this feature to work. This PR adds a new `namespace` attribute to `BaseOperatorName` and makes sure the rest of the stack works as before, if a schema without namespace is passed in Pull Request resolved: https://github.com/pytorch/pytorch/pull/148038 Approved by: https://github.com/bdhirsh	2025-02-28 16:41:50 +00:00
Eli Uriegas	e593288859	ci: Remove manylinux builds for triton, except for XPU (#148129 ) We're dropping regular old manylinux so let's drop it here too Relates to #123649 Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/148129 Approved by: https://github.com/Camyll, https://github.com/huydhn, https://github.com/malfet, https://github.com/atalman ghstack dependencies: #148126	2025-02-28 16:23:18 +00:00
bobrenjc93	4708cfdbd9	Support whitelist of dynamic sources (#147979 ) This PR introduces the ability to whitelist sources as dynamic. This is particularly useful for large models with graph breaks, as you can keep the dynamism across graph breaks since source names stay consistent. Additionally you can use this to mark ints as dynamic. NB: I intentionally didn't complicate the interface by supporting specification of per dimension dynamism. There is virtue in keeping true to the standard way of representing sources (eg. L['x']). If we find in practice that we need more more fine grained control, we can explore further affordances at that time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147979 Approved by: https://github.com/Mingming-Ding	2025-02-28 15:43:14 +00:00
Yuanhao Ji	0a948f705b	[Dynamo] Fix `AssertionError` when dynamo traces `torch.functional.xxx()` functions (#148075 ) Fixes #147840 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148075 Approved by: https://github.com/yanboliang	2025-02-28 15:09:11 +00:00
atalman	1db3c58fab	Remove manylinux 2014 artifacts (#148135 ) 1. Switch Magma build to Manylinux 2.28 base 2. Use manylinux 2.28 as default in populate_binary_env.sh 3. Remove manylinux 2014 docker builds Pull Request resolved: https://github.com/pytorch/pytorch/pull/148135 Approved by: https://github.com/malfet	2025-02-28 13:43:14 +00:00
Xuehai Pan	1cb4e2df65	[BE][PYFMT] migrate PYFMT for `torch._inductor` to `ruff format` (#144550 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144550 Approved by: https://github.com/jansel	2025-02-28 13:33:19 +00:00
William Wen	34d726011f	[dynamo] update data-dependent branching graph break messages (#147912 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147912 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #147494, #147872	2025-02-28 12:30:06 +00:00
Xilun Wu	4106aa33eb	[dtensor][fix] fix _scaled_dot_product_flash_attention sharding (#148125 ) ### Summary https://github.com/pytorch/pytorch/pull/146372/ changed the op signature of `_scaled_dot_product_flash_attention` and as a consequence DTensor needs to change its sharding defined at `40ad5e01df/torch/distributed/tensor/_ops/_matrix_ops.py (L232)` ### Test `pytest test/distributed/tensor/test_attention.py` ### Follow-up It's still unclear why the CP unit tests were not run over the original PR which is BC-breaking. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148125 Approved by: https://github.com/tianyu-l, https://github.com/fegin	2025-02-28 09:26:43 +00:00
ZhiweiYan-96	af720cd5a7	[Intel GPU] Decompule Intel GPU oneDNN from other backends (#147926 ) # Motivation Currently, Intel GPU is moving forward rapidly with the development of feature. We(Intel GPU) want an independent version control over oneDNN component so as to quickly adopt the optimization or bug fixing provided by oneDNN team. This PR does not change the behaviors of other backends like Intel CPU, ARM. They can keep using the stable version contained in `third_party/ideep`. # Detail At compilation time, we will `git clone` oneDNN via URL `https://github.com/oneapi-src/oneDNN` and checkout to the tag/commit that Intel GPU backend prefers. This feature is supported by CMake `Externalproject_add` command. Following is a build log example: ```bash [11/60] Performing download step (git clone) for 'xpu_mkldnn_proj' Cloning into 'xpu_mkldnn_proj'... HEAD is now at 5e92240360 meta: updated citation file [12/60] Performing update step for 'xpu_mkldnn_proj' -- Already at requested tag: v3.7 [13/60] No patch step for 'xpu_mkldnn_proj' ``` The log demonstates that, we explicitly download the source files and checkout to a specific tag. The source file of oneDNN is located at `build/xpu_mkldnn_proj-prefix/src/xpu_mkldnn_proj` # Runtime verification Running UT for CPU ```bash onednn_verbose,v1,info,oneDNN v3.7.0 (commit fc3f17ad469b8a6da7192ae12d32625faa509f1e) onednn_verbose,v1,info,cpu,runtime:OpenMP,nthr:24 onednn_verbose,v1,info,cpu,isa:Intel AVX-512 with Intel DL Boost onednn_verbose,v1,info,gpu,runtime:none onednn_verbose,v1,info,graph,backend,0:dnnl_backend onednn_verbose,v1,primitive,info,template:operation,engine ``` Runnint UT for Intel GPU ```bash onednn_verbose,v1,info,oneDNN v3.7.0 (commit 5e9224036021433d2577548ed0539fe9a53256bc) onednn_verbose,v1,info,cpu,runtime:threadpool,nthr:24 onednn_verbose,v1,info,cpu,isa:Intel AVX-512 with Intel DL Boost onednn_verbose,v1,info,gpu,runtime:DPC++ onednn_verbose,v1,info,gpu,engine,sycl gpu device count:2 ``` We can see that, Intel GPU would uses commit `5e922` (tag v3.7), while CPU uses `fc3f17` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147926 Approved by: https://github.com/EikanWang Co-authored-by: leizhenyuan <zhenyuan.lei@intel.com>	2025-02-28 07:42:06 +00:00
Ankita George	3a58a04898	Build a storage reader/writer to write checkpoints in HF format (#148089 ) Summary: D69984656 caused issues by adding the fsspec dependency to torch distributed when many packages internally didn't have it. In this diff I'm not adding HFStorageReader/Writer to __init__.py so that HFStorage components don't get imported internally and in turn there is no fsspec import that happens. I did the removal from __init__.py in D70286926 to fix the failing tests but the revert was done concurrently. I'll add the classes to __init__.py when I figure out a better way to get fsspec added as a dependency everywhere Test Plan: signals pass buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/distributed/checkpoint:test_hf_storage Differential Revision: D70324090 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148089 Approved by: https://github.com/saumishr	2025-02-28 07:38:10 +00:00
Xuehai Pan	995df34b19	[BE][PYFMT] migrate PYFMT for `torch.{distributed,distributions}` to `ruff format` (#144547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144547 Approved by: https://github.com/kwen2501	2025-02-28 07:35:56 +00:00
Rachel Guo	4e160d5fd9	[triton 3.3] Fix aoti cpp wrapper remaining 5 issue. (following #148051 ) (#148117 ) Summary: Fix the following 5 on a100: - test_foreach_cpp_wrapper_cuda_gpu_wrapper - test_enable_dynamic_shapes_cpp_wrapper_cuda_gpu_wrapper - test_dynamic_shapes_persistent_reduction_mixed_x_dim_cuda_gpu_wrapper - test_enable_dynamic_shapes_cpp_wrapper_cuda_dynamic_shapes_gpu_wrapper - test_dynamic_shapes_persistent_reduction_mixed_x_dim_cuda_dynamic_shapes_gpu_wrapper Test Plan: oss : ``` TORCHINDUCTOR_COMPILE_THREADS=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=TRITON TORCH_LOGS="+inductor, output_code" TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 CPLUS_INCLUDE_PATH=/usr/local/cuda-12.6/include:$CPLUS_INCLUDE_PATH python test/inductor/test_gpu_cpp_wrapper.py -k test_foreach_cpp_wrapper_cuda_gpu_wrapper ``` @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/148117 Approved by: https://github.com/davidberard98, https://github.com/chenyang78	2025-02-28 06:56:30 +00:00
Wouter Devriendt	ea12fc8a9f	Revert D70262395 (#148164 ) Summary: This reverts #147804 due to internal revert. --- This diff reverts D70262395 Reviewed By: RossMcKenzie Differential Revision: D70318024 @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/148164 Approved by: https://github.com/xmfan	2025-02-28 06:39:48 +00:00
William Wen	baba7beed2	[dynamo] add context manager debug information to graph breaks (#147872 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147872 Approved by: https://github.com/zou3519 ghstack dependencies: #147494	2025-02-28 06:23:28 +00:00
William Wen	4caeede799	[dynamo] more better error messages [3/N] (#147494 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147494 Approved by: https://github.com/jansel, https://github.com/zou3519	2025-02-28 06:23:28 +00:00
eellison	bc362cc15a	Move expanded dim require_exact_stride handling to api from sdpa lowering (#148101 ) See issue: https://github.com/pytorch/pytorch/issues/147156#issue-2852362217. Original tests from https://github.com/pytorch/pytorch/pull/146054 should cover these changes, and I tested that the perf on https://github.com/pytorch/pytorch/issues/145760 remains fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148101 Approved by: https://github.com/zou3519	2025-02-28 06:02:18 +00:00
cyy	b0dfd242fa	Remove NO_MULTIPROCESSING_SPAWN checks (#146705 ) py 3.9 has spawn. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146705 Approved by: https://github.com/colesbury	2025-02-28 05:53:19 +00:00
Aaron Gokaslan	3b4b23ab0b	[BE][Ez]: Remove extra copy in dtensor parallel loss (#148096 ) Remove an extra copy of the input to `_log_softmax` when there is a dtype and memory format change. Fuse the copies instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148096 Approved by: https://github.com/jansel, https://github.com/wconstab	2025-02-28 05:42:32 +00:00
Arthur Laureus Wigo	9b7130b8db	Clean temporary directory at exit (#147813 ) Issue: A temporary directory is created in [pytorch/torch/distributed/nn/jit/instantiator.py](https://github.com/arthurlw/pytorch/blob/clean-temp-directory-at-exit/torch/distributed/nn/jit/instantiator.py) but is never cleaned up, leading to a ResourceWarning on program exit. Solution: Registered an `atexit` handler to properly clean up the temporary directory when the program exits. Fixes #147744 Line 23 in [0a49f8f](`0a49f8fd3d`) ```python 23 atexit.register(_TEMP_DIR.cleanup) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147813 Approved by: https://github.com/H-Huang	2025-02-28 04:12:23 +00:00
Davide Italiano	760921a7d8	[MPS] Add inductor support for the `entr()` operator. (#148128 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148128 Approved by: https://github.com/jansel, https://github.com/malfet	2025-02-28 03:33:22 +00:00
Animesh Jain	eb9c127341	[dynamo][optimizers] Install ID_GUARDED tensors into the Fx graph (#147824 ) Earlier, with inline flag we were lifting id-guarded tensors to the inputs to the Fx graph. But this offers no benefit. Main idea behind lifting parameters as inputs was to reuse the compilation units across many instances of the nn-module. However, if we are guarding on the `id`, we are explicitly specializing the compiled artifact to the parameter. This PR installs the parameters back into the graph. The benefit is removal of all pre-graph bytecode to extract the id-guarded tensors from locals/globals. This increases speedup from 1.67x to 1.75x for an internal model that has large number of optimizer parameters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147824 Approved by: https://github.com/jansel Co-authored-by: Jason Ansel <jansel@meta.com>	2025-02-28 03:22:11 +00:00
PyTorch MergeBot	926b7b5027	Revert "Remove NO_MULTIPROCESSING_SPAWN checks (#146705 )" This reverts commit 40ad5e01dff05c7d64e070fb01683820e678f788. Reverted https://github.com/pytorch/pytorch/pull/146705 on behalf of https://github.com/cyyever due to Broke lint?, I guess land race with rufff update ([comment](https://github.com/pytorch/pytorch/pull/146705#issuecomment-2689603077))	2025-02-28 03:04:38 +00:00
Xuehai Pan	3ce352e389	[BE][PYFMT] migrate PYFMT for `torch._dynamo` to `ruff format` (#144549 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144549 Approved by: https://github.com/jansel	2025-02-28 03:03:53 +00:00
Ding, Yi1	edc5bf91d2	[Intel GPU] Add synchronize() in torch.utils.benchmark (#147835 ) When following https://pytorch.org/tutorials/recipes/recipes/benchmark.html on XPU, I notice that the device it is not synchronized in the benchmark. This PR tries to fix this and align the behavior with CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147835 Approved by: https://github.com/EikanWang, https://github.com/desertfire	2025-02-28 02:58:17 +00:00
Xuehai Pan	0edb2da4a4	[dynamo] add sourceless builder for `types.MethodType` (#147880 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147880 Approved by: https://github.com/jansel	2025-02-28 02:30:04 +00:00
x41lakazam	30375cb326	Fix minor typo in python_nccl (#148088 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148088 Approved by: https://github.com/Skylion007	2025-02-28 00:47:09 +00:00
eellison	481a57bc37	Support torch.compile rng selective activation checkpointing with cudagraph (#146878 ) TODO: - [x] Add handling for when forward is invoked multiple times without invoking backward, so that the fwd/backward states are out of sync - [x] Update rng state initialization to take from correct device - [x] Tests - [x] handling of retain_graph - [x] respect fallback random Fix for https://github.com/pytorch/pytorch/issues/130123. Updates the aot_eager and cudagraph compilation of `run_and_save_rng_state` to use the new mechanism added by https://github.com/pytorch/pytorch/pull/114068 for CUDAGraph safe rng states. We have a pair of rng states for the fwd and backward respectively. In both forward and backward the rng op will get run with `graphsafe_run_with_rng_state` which takes in RNG state and it hooks onto the current RNG generator before running the operator. The rng states for fwd/backward are initialized with the same value. We ensure that for any given run of the forward, the corresponding backward run will have the same rng states for the op as was observed in the forward. ``` ===== Forward graph 1 ===== /data/users/eellison/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module): def forward(self, primals_1: "f32[4, 4][4, 1]cuda:0", primals_2: "f32[4, 4][4, 1]cuda:0", fwd_rng_state_0): sin: "f32[4, 4][4, 1]cuda:0" = torch.ops.aten.sin.default(primals_1) # No stacktrace found for following nodes graphsafe_run_with_rng_state = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype = torch.float32, device = device(type='cuda', index=0), pin_memory = False, rng_state = fwd_rng_state_0); fwd_rng_state_0 = None ... ===== Backward graph 1 ===== def forward(self, primals_1: "f32[4, 4][4, 1]cuda:0", primals_2: "f32[4, 4][4, 1]cuda:0", tangents_1: "f32[4, 4][4, 1]cuda:0", bwd_rng_state_0): sin: "f32[4, 4][4, 1]cuda:0" = torch.ops.aten.sin.default(primals_1) # No stacktrace found for following nodes graphsafe_run_with_rng_state = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype = torch.float32, device = device(type='cuda', index=0), pin_memory = False, rng_state = bwd_rng_state_0); bwd_rng_state_0 = None ``` There is some extra complication when a user either calls backward with retain_graph, or calls the backward in a different order as they called the forward. If a user has state fwd_rng_state0, bwd_rng_state0 and calls: - fwd0: fwd_rng_state0 -> fwd_rng_state1 - fwd1: fwd_rng_state1 -> fwd_rng_state2 - bwd1 - bwd0 Then naively, when bwd1 is invoked the bwd rng states would not be equal to the same states that were observed in fwd1. I added handling of this in the aot runtime wrappers to detect pending backward invocations, and the current position of the bwd rng states, and to update when necesssary. Other notes: Because nodes which appear later in the forward appear earlier in the backward, we need a separate rng state for each operator. If we reused the rng across ops, the forward and backward would be run with different rng states. I.e., not applied in the same order. Questions for reviewers: This does change numerics, bc the rng of the op is now taken from the input rng state instead of whatever the rng would be midway through running the graph. Technically, we only need this for cuda graph. But, I'd prefer to not have a rng divergence just for cudagraph. I am making it respect `fallback_random`. Edit: decided to apply to non cudagraphs as well, so long as fallback_random is not set I'm initializing the rng states by cloning the current state. If you had something like 5 different rands in the model with the same shape, theyd all get the same value. This doesn't seem great. I could use some other initialization scheme like taking seed from graph position, or etc etc. Not sure. Let me know thoughts. Edit: updated to be taken from randint() Update: initializing rng states from torch.randint.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146878 Approved by: https://github.com/anijain2305, https://github.com/bdhirsh	2025-02-28 00:47:03 +00:00
Brian Hirsh	c6d1038aaa	only print GraphModule during fx.Interpreter errors if valid (#148090 ) Came up in https://www.internalfb.com/diff/D69057074?dst_version_fbid=970771615000938&transaction_fbid=1723357345264461 - we need to make sure the GraphModule is valid before calling `print_readable` on it Pull Request resolved: https://github.com/pytorch/pytorch/pull/148090 Approved by: https://github.com/jamesjwu, https://github.com/zou3519 ghstack dependencies: #147749	2025-02-28 00:44:27 +00:00
Andrey Talman	5a14ff8ace	Add cufile to list of libraries to preload (#148137 ) Fixes: https://github.com/pytorch/pytorch/issues/148120 Test with almalinux/9-base:latest : ``` >>> import torch Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib64/python3.9/site-packages/torch/__init__.py", line 401, in <module> from torch._C import * # noqa: F403 ImportError: libcufile.so.0: cannot open shared object file: No such file or directory >>> exit() [root@18b37257e416 /]# vi /usr/local/lib64/python3.9/site-packages/torch/__init__.py [root@18b37257e416 /]# python3 Python 3.9.19 (main, Sep 11 2024, 00:00:00) [GCC 11.5.0 20240719 (Red Hat 11.5.0-2)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch /usr/local/lib64/python3.9/site-packages/torch/_subclasses/functional_tensor.py:276: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.) cpu = _conversion_method_template(device=torch.device("cpu")) >>> torch.__version__ '2.7.0.dev20250227+cu126' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148137 Approved by: https://github.com/malfet	2025-02-28 00:35:47 +00:00
cyyever	40ad5e01df	Remove NO_MULTIPROCESSING_SPAWN checks (#146705 ) py 3.9 has spawn. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146705 Approved by: https://github.com/colesbury	2025-02-28 00:15:32 +00:00
Catherine Lee	2978771c9d	[CI] test upload: better check for if job is rerun disabled tests (#148027 ) Some disabled test runs weren't being uploaded as disabled tests because some dynamo tests are set to mark themselves as skipped if they are failing. This makes the script think that there are fewer retries than there are actually are and that the job is not a rerun disabled tests job. Instead, query for the job name to see if it contains rerun disabled tests and fall back to counting the number of retries if querying fails Alternate options: relax the check for the number of tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/148027 Approved by: https://github.com/huydhn	2025-02-28 00:04:33 +00:00
Eli Uriegas	fc78192b1d	ci: Only run CI specific things when in CI (#148126 ) This was blocking me from running this locally so don't run it like this Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/148126 Approved by: https://github.com/Camyll, https://github.com/malfet, https://github.com/atalman	2025-02-27 23:27:57 +00:00
Aaron Gokaslan	f4235310e8	[BE][Ez]: Remove redundant empty tensor copies in meta-reg (#147978 ) Empty_likes includes a memory_format arg. Let's use it to avoid unnecessary copy operations. Noticed while reviewing: https://github.com/pytorch/pytorch/pull/147862 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147978 Approved by: https://github.com/jansel	2025-02-27 23:16:44 +00:00
Zhengxu Chen	915b9c80ab	[export] Sync aoti schema to schema.py (#148017 ) Summary: Synchronizing internal AOTI schema to OSS schema.py Test Plan: CI Differential Revision: D70271151 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148017 Approved by: https://github.com/yiming0416	2025-02-27 21:46:11 +00:00
Eli Uriegas	871b3909fc	ci: Remove manylinux 2014 remnants (#148028 ) These are the only remaining references I could find to manylinux2014, we should probably look to remove these a bit quicker since it made it difficult to know which Dockerfiles were important in .ci/docker/manywheel/ > [!TIP] > I checked if we were using these by running > `rg 2014 .github/` Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/148028 Approved by: https://github.com/wdvr, https://github.com/malfet, https://github.com/atalman	2025-02-27 21:37:00 +00:00
Zain Rizvi	10ffd94216	Reference the commit explicitly (#148026 ) Reference the commit tested by CI explicitly, and fail the merge if the PR was updated. Tested locally Pull Request resolved: https://github.com/pytorch/pytorch/pull/148026 Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/atalman	2025-02-27 21:06:34 +00:00
Xintong Hu	783d83c5d8	[PT2] Port fuse_split_getitem_squeeze to PT2 pre_grad passes (#148059 ) Summary: put it as an add_pass option Reviewed By: frank-wei Differential Revision: D68909559 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148059 Approved by: https://github.com/frank-wei	2025-02-27 21:03:51 +00:00
Xuehai Pan	d48eb58d1d	[BE][CI] bump ruff to 0.9.8 (#145606 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145606 Approved by: https://github.com/malfet ghstack dependencies: #144546	2025-02-27 21:01:10 +00:00
PyTorch MergeBot	644d84d594	Revert "optimize the decomposition of aten.native_group_norm (#144733 )" This reverts commit b533bb4b133c36767270bd8a24f11d5c37f8dd5c. Reverted https://github.com/pytorch/pytorch/pull/144733 on behalf of https://github.com/desertfire due to Cause TIMM pass rate regression on H100, see https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2020%20Feb%202025%2020%3A53%3A55%20GMT&stopTime=Thu%2C%2027%20Feb%202025%2020%3A53%3A55%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=main&lCommit=4216478250e08e950fdd090fc23a1b270c520cc4&rBranch=main&rCommit=4986f0f52eb871cdb91b8124ee162cfe622b8688 ([comment](https://github.com/pytorch/pytorch/pull/144733#issuecomment-2689092714))	2025-02-27 20:57:25 +00:00
atalman	1845e7d1f5	Use nightly-wheel-upload env for triton wheel publishing (#148108 ) Required for publishing triton builds Pull Request resolved: https://github.com/pytorch/pytorch/pull/148108 Approved by: https://github.com/malfet	2025-02-27 20:47:40 +00:00
Xuehai Pan	c73a92fbf5	[BE][CI] bump `ruff` to 0.9.2: multiline `assert` statements (#144546 ) Reference: https://docs.astral.sh/ruff/formatter/black/#assert-statements > Unlike Black, Ruff prefers breaking the message over breaking the assertion, similar to how both Ruff and Black prefer breaking the assignment value over breaking the assignment target: > > ```python > # Input > assert ( > len(policy_types) >= priority + num_duplicates > ), f"This tests needs at least {priority+num_duplicates} many types." > > > # Black > assert ( > len(policy_types) >= priority + num_duplicates > ), f"This tests needs at least {priority+num_duplicates} many types." > > # Ruff > assert len(policy_types) >= priority + num_duplicates, ( > f"This tests needs at least {priority + num_duplicates} many types." > ) > ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144546 Approved by: https://github.com/malfet	2025-02-27 20:46:16 +00:00
Ruben Rodriguez Buchillon	f0d00421cf	[inductor][ck] kBatch filtering with gen_ops (#148004 ) Summary: # Why not all choices of kBatch are valid and will lead to a runtime error (when CK checks the validity of the args) `c9bcfd755e/include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3_multi_d.hpp (L1020)` # What - move kBatch inside the gen_ops to have more control over it, and be able to filter it - expand filtering based on the cpp logic - refactor the padding checks to be more readable Test Plan: ``` buck2 run -c fbcode.re_gpu_tests=False mode/opt-amd-gpu fbcode//deeplearning/aot_inductor/benchmark/sampling:test_gemm_autotune_benchmark_AMD_block_0 ``` with kBatch = 128: some filering kBatch = 1: no filering kBatch = 1738: all options filtered out Reviewed By: henrylhtsang Differential Revision: D70211442 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148004 Approved by: https://github.com/ColinPeppler, https://github.com/tenpercent	2025-02-27 20:13:58 +00:00
Davide Italiano	ce805a5ba5	[BE/metal] Rename REGISTER_I0_I1 to REGISTER_SPECIAL. (#148036 ) Now that it's used for other ops as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148036 Approved by: https://github.com/malfet, https://github.com/jansel	2025-02-27 17:56:26 +00:00
Mikayla Gawarecki	9a1f720a72	Validate inputs to _nested_view_from_buffer to prevent overflows (#147356 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147356 Approved by: https://github.com/albanD, https://github.com/jbschlosser ghstack dependencies: #147352, #147354	2025-02-27 15:48:58 +00:00
Mikayla Gawarecki	536bce5a04	Make Tensor.set_ validate storage_offset when sizes/strides are unchanged (#147354 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147354 Approved by: https://github.com/albanD ghstack dependencies: #147352	2025-02-27 15:48:58 +00:00
Mikayla Gawarecki	e64441915f	Fix overflow in checkInBoundsForStorage (#147352 ) Use `computeStorageNbytes` (which checks for overflows) to include the computation re the storage_offset Pull Request resolved: https://github.com/pytorch/pytorch/pull/147352 Approved by: https://github.com/albanD	2025-02-27 15:48:50 +00:00
Anatoly Myachev	6ccbff1450	[Inductor] Fix `inductor/test_kernel_benchmark.py` for new Triton; do not duplicate parameters in `_dump_launch_params` (#147746 ) The problem is that the new Triton uses the following code branch, which does not filter the call parameters, which may already be in the launcher's cfg.kwargs. This is generally expected behavior, so I just stopped adding arguments from `launcher.config.kwargs`: `cde12207a0/torch/_inductor/runtime/triton_heuristics.py (L1099)` Issue example (from https://github.com/intel/intel-xpu-backend-for-triton/issues/3499): ```bash Failed when when running cleaned triton Command '['/home/xinanlin/xinanlin/miniforge3/bin/python', '/tmp/torchinductor_xinanlin/4g/c4gp5j3t44nmaxvl7ndgcptyur6sij4k3b dmtky5n4j4jrd5k5pu.py.cleaned']' returned non-zero exit status 1. Traceback (most recent call last): File "/tmp/torchinductor_xinanlin/4g/c4gp5j3t44nmaxvl7ndgcptyur6sij4k3bdmtky5n4j4jrd5k5pu.py.cleaned", line 103, in <module> compiled_module_main('None', benchmark_compiled_module) File "/home/xinanlin/xinanlin/pytorch/torch/_inductor/wrapper_benchmark.py", line 435, in compiled_module_main wall_time_ms = benchmark_compiled_module_fn(times=times, repeat=repeat) * 1000 File "/tmp/torchinductor_xinanlin/4g/c4gp5j3t44nmaxvl7ndgcptyur6sij4k3bdmtky5n4j4jrd5k5pu.py.cleaned", line 98, in benchmark_compiled_module return print_performance(fn, times=times, repeat=repeat) File "/home/xinanlin/xinanlin/pytorch/torch/_inductor/utils.py", line 451, in print_performance [timed(model, example_inputs, times, device) for _ in range(repeat)] File "/home/xinanlin/xinanlin/pytorch/torch/_inductor/utils.py", line 451, in <listcomp> [timed(model, example_inputs, times, device) for _ in range(repeat)] File "/home/xinanlin/xinanlin/pytorch/torch/_inductor/utils.py", line 434, in timed result = model(example_inputs) File "/tmp/torchinductor_xinanlin/4g/c4gp5j3t44nmaxvl7ndgcptyur6sij4k3bdmtky5n4j4jrd5k5pu.py.cleaned", line 97, in <lambda> fn = lambda: call([arg0_1, arg1_1]) File "/tmp/torchinductor_xinanlin/4g/c4gp5j3t44nmaxvl7ndgcptyur6sij4k3bdmtky5n4j4jrd5k5pu.py.cleaned", line 86, in call triton_poi_fused_add_0[grid(1)](arg0_1, arg1_1, buf0, 1, 1, XBLOCK=1, num_warps=1, num_stages=1) File "/home/xinanlin/xinanlin/miniforge3/lib/python3.10/site-packages/triton/runtime/jit.py", line 336, in <lambda> return lambda args, *kwargs: self.run(grid=grid, warmup=False, args, *kwargs) File "/home/xinanlin/xinanlin/miniforge3/lib/python3.10/site-packages/triton/runtime/jit.py", line 531, in run bound_args, specialization, options = binder(args, **kwargs) TypeError: dynamic_func() got multiple values for argument 'XBLOCK' ``` Reroduce: `python test/inductor/test_kernel_benchmark.py -k test_remove_inductor_deps` Triton: `c4a79a1960` Pytorch: bea72180ed75f522ce4fe5e723bc2112e0874732 @davidberard98 @etaf please take a look Pull Request resolved: https://github.com/pytorch/pytorch/pull/147746 Approved by: https://github.com/jansel	2025-02-27 14:40:22 +00:00
Wang, Eikan	2c35af4def	[Intel GPU] Avoid including CPU oneDNN header files for Intel GPU (#147969 ) XPU builds oneDNN in another folder. The XPU oneDNN head files are in the XPU-specific folder - `${__XPU_MKLDNN_BUILD_DIR}`. `f522d899fb/cmake/Modules/FindMKLDNN.cmake (L73)` So, `${PROJECT_SOURCE_DIR}/third_party/ideep/mkl-dnn/include` is useless for XPU. `XPU_MKLDNN_INCLUDE` is good enough. Meanwhile, it may mess up the included files if the version of XPU oneDNN differs from other backends. * __->__ #147969 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147969 Approved by: https://github.com/ZhiweiYan-96, https://github.com/liangan1, https://github.com/atalman	2025-02-27 14:22:17 +00:00
atalman	71ee17baa1	Smoke Test skip cuda.gds on windows (#148060 ) Follow up after : https://github.com/pytorch/pytorch/pull/147120 Cufile was enabled only on Linux: https://pypi.org/project/nvidia-cufile-cu12/#files Fixes validation workflow failues: https://github.com/pytorch/test-infra/actions/runs/13558218752/job/37896578837 ``` File "C:\Jenkins\Miniconda3\envs\conda-env-13558218752\lib\site-packages\torch\cuda\gds.py", line 105, in __init__ raise RuntimeError("GdsFile is not supported on this platform.") RuntimeError: GdsFile is not supported on this platform. Exception ignored in: <function GdsFile.__del__ at 0x000001772B5003A0> Traceback (most recent call last): File "C:\Jenkins\Miniconda3\envs\conda-env-13558218752\lib\site-packages\torch\cuda\gds.py", line 113, in __del__ if self.handle is not None: AttributeError: 'GdsFile' object has no attribute 'handle' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148060 Approved by: https://github.com/mikaylagawarecki	2025-02-27 14:00:49 +00:00
IvanKobzarev	7ae0e0b2ea	[aotd] Log torch._functorch.config in tlparse (#147883 ) Adding torch._functorch.config to tlparse for better debugability. E.g. https://github.com/pytorch/pytorch/pull/147638 happened only with `torch._functorch.config.view_replay_for_aliased_outputs=False` which is True by defautl Pull Request resolved: https://github.com/pytorch/pytorch/pull/147883 Approved by: https://github.com/bdhirsh, https://github.com/jamesjwu	2025-02-27 11:22:45 +00:00
Raymond Li	c5bf9aaf1c	Log graph breaks (#146537 ) Graph breaks currently aren't logged to dynamo_compile and pt2_compile_events. We want to log them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146537 Approved by: https://github.com/c00w	2025-02-27 11:06:33 +00:00
Lu Fang	0489a349e7	Skip the logging if the pass cannot be pickled (#148053 ) Summary: Skip the logging for vllm at this moment, we can add some pickle logic later. The log is only for debugging purpose. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148053 Approved by: https://github.com/chenyang78	2025-02-27 10:54:34 +00:00
David Berard	26f19539ad	[triton 3.3] cpp_wrapper: add a global_scratch arg (#148051 ) Following triton # 4916, the generated cubin expects a global_scratch argument to support on-device TMA. We believe this is the source of many of the "invalid argument" failures on AOTI/cpp_wrapper tests. AFAIK, we don't use on-device TMA in Inductor as of now, so it should be safe to use a nullptr for the scratch space. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148051 Approved by: https://github.com/YUNQIUGUO	2025-02-27 10:13:57 +00:00
Zhang, Jianyi	91e7c7945c	[Intel GPU] Avoid unnecessary copy when the dst of Matmul is non-contiguous (#144759 ) We should not always call contiguous on the dst of matmul. We have already removed copy of matmul input in https://github.com/pytorch/pytorch/pull/143784 I also fixed an accuracy issue by using onednn sum post op instead of binary add in the case of inplace to avoid UT failure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144759 Approved by: https://github.com/EikanWang	2025-02-27 08:04:34 +00:00
Ti-Tai Wang	8ee84aa703	[ONNX] Fix missed None type support in dyamic shapes string cases (#148025 ) In `_any_str_or_dim_in_dynamic_shapes`, we strictly guard the `dynamic_shapes` to make sure the flattened shapes are valid. But the code missed to consider None could be in the shapes. NOTE: Found in benchmarking with Olive. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148025 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2025-02-27 07:57:47 +00:00
Simon Fan	fd43c36aa9	[ca] side-effect free initial trace: RAII PyCompilerInterface (#147891 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147891 Approved by: https://github.com/jansel ghstack dependencies: #147242, #147796, #147804	2025-02-27 07:17:30 +00:00
Mu-Chu Lee	9017becf1d	Add unique kernel name support for user defined triton kernel (#147587 ) Summary: Add unique_user_kernel_names which mimics what unique_kernel_names do, but for user defined Triton kernels. This does rewrite the copied kernel src, and modifies non-Inductor generated code, so we split it out from unique_kernel_names, where we have more control over all namings and generations. Test Plan: Only used for debug purpose Differential Revision: D69966608 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147587 Approved by: https://github.com/desertfire	2025-02-27 06:00:50 +00:00
PyTorch MergeBot	c622796cde	Revert "Build a storage reader/writer to write checkpoints in HF format (#147622 )" This reverts commit 6a658d983e84f7bcb8e67328b00661ec49db78c5. Reverted https://github.com/pytorch/pytorch/pull/147622 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147622#issuecomment-2686932514))	2025-02-27 05:14:28 +00:00
Yutao Xu	21bd5fe203	Update torch-xpu-ops commit pin (#147968 ) Update the torch-xpu-ops commit to [86aaaf8a9dd6932c088b7afcac0c0856b23d341a](`86aaaf8a9d`), includes: - Bugfix (PT2E/BatchNorm) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147968 Approved by: https://github.com/Skylion007	2025-02-27 05:01:12 +00:00
Boyuan Feng	b6fe28ff02	[Inductor] Graph Partition (#147038 ) This PR implements inductor graph partition. Previously, 1 dynamo graph is mapped to 1 inductor graph, and further mapped to 1 call function. In this PR, we allow 1 dynamo graph mapped to multiple inductor graphs and multiple `graph_partition` functions in the generated code. This allows applying different further optimizations to different `graph_partition`. Design Doc: [link](https://docs.google.com/document/d/1qPgOfy25l7SIYnrQrvU-TO1mdHMslCwv_SLmeXID6tM/edit?usp=sharing) Example: [Generated code before and after this diff](https://www.internalfb.com/intern/diffing/?paste_number=1737334601) In the follow-up PR, we will extend the work to cudagraph, which allows applying cudagraph to parts of the generated code (#125864). Pull Request resolved: https://github.com/pytorch/pytorch/pull/147038 Approved by: https://github.com/eellison	2025-02-27 04:50:43 +00:00
Ankita George	e0b93082f1	Remove HuggingFace reader and writer from __init__.py (#148030 ) Summary: This is causing a HFStorageReader/Writer to be imported which imports fsspec but dependencies don't have fsspec, which is causing failing builds Differential Revision: D70286926 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148030 Approved by: https://github.com/hl475	2025-02-27 04:50:14 +00:00
Mwiza Kunda	8cb8722979	[inductor][triton] Ignore block ptr advances for removed buffers (#147193 ) block ptr advancements should also be deferrered conditional on the associated buffer not being removed. For example, if `FusedSchedulerNode(op0-op1)` has a store in `SchedulerNode` `op0` that is read in `op1`, the store and associated block ptr that would be created for `op0` in isolation is no longer needed. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/147193 Approved by: https://github.com/jansel Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-02-27 03:37:33 +00:00
PyTorch MergeBot	17358ce778	Revert "Support torch.compile rng selective activation checkpointing with cudagraph (#146878 )" This reverts commit ad0c879e2203145f6d56df0b95af36822220ab8f. Reverted https://github.com/pytorch/pytorch/pull/146878 on behalf of https://github.com/wdvr due to lint failure ([comment](https://github.com/pytorch/pytorch/pull/146878#issuecomment-2686767956))	2025-02-27 03:36:16 +00:00
Alex Baden	9d3636283b	[Inductor] Use generic GPU device in test_preserves_strides (#148006 ) #147861 added a new test tagged for the generic GPU but uses the cuda GPU type for creating the tensors. Update the GPU type to also be generic. This passes with my local Intel Triton install, presumably it will work for the current pin. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148006 Approved by: https://github.com/eellison, https://github.com/etaf	2025-02-27 02:52:51 +00:00
drisspg	07b7b3ed4e	torch._scaled_mm with MXFP8 (#147548 ) # summary Add blockwise MXFP8 support to `torch._scaled_mm` on CUDA capability 10.0 and higher devices. If the scales for A and B are of dtype `torch.float8_e8m0fnu`, we dispatch to the blockwise kernel from cuBLAS. This is a skeleton PR where we test basic functionality (numerics of various simple matrices, as well as one end to end quantization + gemm). - Scales are flipped based on transpose_result - Handles boundary conditions Note that MXFP4 is not added in this PR - we can tackle that in a future PR. This PR was created by taking https://github.com/pytorch/pytorch/pull/145562, switching e8m0 to in-core dtype, removing fp4 for now, and adding test cases. # test plan ``` pytest test/test_matmul_cuda.py -k blockwise_mxfp8 -s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147548 Approved by: https://github.com/drisspg Co-authored-by: drisspg <drisspguessous@gmail.com>	2025-02-27 02:44:39 +00:00
henrylhtsang	84c89a4527	[cutlass backend] cache_clear algorithm select cache on fresh inductor cache (#147590 ) Differential Revision: [D69959917](https://our.internmc.facebook.com/intern/diff/D69959917/) AlgorithmSelectorCache is a cache. The expectation is that when we force disable cache + clear inductor caches, it would be clear. However that is not the case. The reason why this is a problem can be seen by following this repro: What we will see is ``` SingleProcess AUTOTUNE benchmarking takes 6.2202 seconds and 46.0568 seconds precompiling for 36 choices SingleProcess AUTOTUNE benchmarking takes 492.3141 seconds and 0.0010 seconds precompiling for 36 choices ``` The root cause is, while precompiling is skipped, due to it being cache, autotuning isn't skipped since we force disable it. repro: ``` import logging import os os.environ["TORCH_LOGS"] = "+output_code,+benchmarking,+inductor" import torch import torch._inductor.config from torch._inductor.utils import clear_inductor_caches torch._inductor.config.max_autotune = True torch._inductor.config.force_disable_caches = True torch._inductor.config.autotune_num_choices_displayed = None torch._inductor.config.max_autotune_gemm_backends = "CUTLASS" torch._inductor.config.autotune_fallback_to_aten = False torch._inductor.config.cuda.cutlass_instantiation_level = "0001" def main(): M, N, K = 2048, 2048, 2048 dtype = torch.bfloat16 A = torch.randn(M, K, device="cuda", dtype=dtype) B = torch.randn(K, N, device="cuda", dtype=dtype) for _ in range(2): torch._dynamo.reset() clear_inductor_caches() compiled_model = torch.compile(torch.mm, fullgraph=True) _ = compiled_model(A, B) print("done") if __name__ == "__main__": main() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147590 Approved by: https://github.com/eellison, https://github.com/chenyang78	2025-02-27 02:30:49 +00:00
Mengwei Liu	97ebccaa91	Add _fft_r2c as core ATen (#147998 ) As titled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147998 Approved by: https://github.com/tugsbayasgalan	2025-02-27 02:29:59 +00:00
eellison	ad0c879e22	Support torch.compile rng selective activation checkpointing with cudagraph (#146878 ) TODO: - [x] Add handling for when forward is invoked multiple times without invoking backward, so that the fwd/backward states are out of sync - [x] Update rng state initialization to take from correct device - [x] Tests - [x] handling of retain_graph - [x] respect fallback random Fix for https://github.com/pytorch/pytorch/issues/130123. Updates the aot_eager and cudagraph compilation of `run_and_save_rng_state` to use the new mechanism added by https://github.com/pytorch/pytorch/pull/114068 for CUDAGraph safe rng states. We have a pair of rng states for the fwd and backward respectively. In both forward and backward the rng op will get run with `graphsafe_run_with_rng_state` which takes in RNG state and it hooks onto the current RNG generator before running the operator. The rng states for fwd/backward are initialized with the same value. We ensure that for any given run of the forward, the corresponding backward run will have the same rng states for the op as was observed in the forward. ``` ===== Forward graph 1 ===== /data/users/eellison/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module): def forward(self, primals_1: "f32[4, 4][4, 1]cuda:0", primals_2: "f32[4, 4][4, 1]cuda:0", fwd_rng_state_0): sin: "f32[4, 4][4, 1]cuda:0" = torch.ops.aten.sin.default(primals_1) # No stacktrace found for following nodes graphsafe_run_with_rng_state = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype = torch.float32, device = device(type='cuda', index=0), pin_memory = False, rng_state = fwd_rng_state_0); fwd_rng_state_0 = None ... ===== Backward graph 1 ===== def forward(self, primals_1: "f32[4, 4][4, 1]cuda:0", primals_2: "f32[4, 4][4, 1]cuda:0", tangents_1: "f32[4, 4][4, 1]cuda:0", bwd_rng_state_0): sin: "f32[4, 4][4, 1]cuda:0" = torch.ops.aten.sin.default(primals_1) # No stacktrace found for following nodes graphsafe_run_with_rng_state = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype = torch.float32, device = device(type='cuda', index=0), pin_memory = False, rng_state = bwd_rng_state_0); bwd_rng_state_0 = None ``` There is some extra complication when a user either calls backward with retain_graph, or calls the backward in a different order as they called the forward. If a user has state fwd_rng_state0, bwd_rng_state0 and calls: - fwd0: fwd_rng_state0 -> fwd_rng_state1 - fwd1: fwd_rng_state1 -> fwd_rng_state2 - bwd1 - bwd0 Then naively, when bwd1 is invoked the bwd rng states would not be equal to the same states that were observed in fwd1. I added handling of this in the aot runtime wrappers to detect pending backward invocations, and the current position of the bwd rng states, and to update when necesssary. Other notes: Because nodes which appear later in the forward appear earlier in the backward, we need a separate rng state for each operator. If we reused the rng across ops, the forward and backward would be run with different rng states. I.e., not applied in the same order. Questions for reviewers: This does change numerics, bc the rng of the op is now taken from the input rng state instead of whatever the rng would be midway through running the graph. Technically, we only need this for cuda graph. But, I'd prefer to not have a rng divergence just for cudagraph. I am making it respect `fallback_random`. Edit: decided to apply to non cudagraphs as well, so long as fallback_random is not set I'm initializing the rng states by cloning the current state. If you had something like 5 different rands in the model with the same shape, theyd all get the same value. This doesn't seem great. I could use some other initialization scheme like taking seed from graph position, or etc etc. Not sure. Let me know thoughts. Edit: updated to be taken from randint() Update: initializing rng states from torch.randint.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146878 Approved by: https://github.com/anijain2305, https://github.com/bdhirsh	2025-02-27 02:08:29 +00:00
Andrey Talman	784902983e	Remove +PTX from cuda 12.6 builds (#148000 ) Similar to: https://github.com/pytorch/pytorch/pull/141142 Ahead of the release 2.7 I see following validation failure: https://github.com/pytorch/test-infra/actions/runs/13552433445/job/37879041739?pr=6339 ``` RuntimeError: Binary size of torch-2.7.0.dev20250226+cu126-cp310-cp310-manylinux_2_28_x86_64.whl 1076.45 MB exceeds the threshold 750 MB ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148000 Approved by: https://github.com/clee2000, https://github.com/ngimel, https://github.com/tinglvv	2025-02-27 02:02:11 +00:00
ZhaoqiongZ	20ce67cd06	Udpate hw requirement for FP64 on "Getting Started on Intel GPU" (#147802 ) Fixes #147731 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147802 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-02-27 01:54:19 +00:00
cyy	9ca871f32b	Remove binaries/benchmark_args.h (#147920 ) It's not used in OSS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147920 Approved by: https://github.com/Skylion007	2025-02-27 01:16:28 +00:00
Zaili Wang	ea5d40db73	Address source code building command for Intel GPU support (#143476 ) As the title Pull Request resolved: https://github.com/pytorch/pytorch/pull/143476 Approved by: https://github.com/EikanWang, https://github.com/malfet Co-authored-by: Xu Han <xu.han@outlook.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-02-27 01:07:40 +00:00
Bin Bao	f104ef1248	[AOTI][refactor] Consolidate CppBuilder.build and CppBuilder.build_fbcode (#147975 ) Summary: Let CppBuilder handle all the cpp build logic Differential Revision: D70141808 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147975 Approved by: https://github.com/angelayi, https://github.com/yushangdi	2025-02-27 00:35:12 +00:00
Benjamin Glass	f98cd84b04	cpp_wrapper: use largeTensorTest for test memory checks (#146991 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146991 Approved by: https://github.com/desertfire	2025-02-27 00:30:21 +00:00
Benjamin Glass	723f3a9eab	torch.utils._content_store: fix error in hash_storage on XPU (#147785 ) See https://github.com/pytorch/pytorch/actions/runs/13508573465/job/37745227468 for an example error. This is triggering after the merge of #147541, which enabled Dynamo compilation on XPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147785 Approved by: https://github.com/jansel	2025-02-26 23:57:59 +00:00
PyTorch MergeBot	915eb012e1	Revert "[dynamo] add sourceless builder for `types.MethodType` (#147880 )" This reverts commit 08f4c1a2332921e57c782c80a66b2adc9cdc0575. Reverted https://github.com/pytorch/pytorch/pull/147880 on behalf of https://github.com/wdvr due to failing trunk tests ([comment](https://github.com/pytorch/pytorch/pull/147880#issuecomment-2686436432))	2025-02-26 23:29:58 +00:00
Nichols A. Romero	84e60eece8	[ROCm] [TunableOp] Unit tests for scaled GEMM and GEMM with bias (#147890 ) Two more unit tests for TunableOp: - Scaled GEMM - GEMM with bias Pull Request resolved: https://github.com/pytorch/pytorch/pull/147890 Approved by: https://github.com/jeffdaily	2025-02-26 22:41:24 +00:00
Nichols A. Romero	b13ad1a193	[ROCm][TunableOp] Remove extra transpose characters in hipBLASLt signature. (#147900 ) Cleanup the TunableOp hipBLASLt signature of extra transpose characters. Test manually and no new regressions found. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147900 Approved by: https://github.com/jeffdaily	2025-02-26 22:28:00 +00:00
PyTorch MergeBot	7e7d05bf85	Revert "[do not merge yet] update grammar (#147996 )" This reverts commit 6e129a697f86425d0682ed30ffc9b3f8abe00e9e. Reverted https://github.com/pytorch/pytorch/pull/147996 on behalf of https://github.com/seemethere due to Need to revert ([comment](https://github.com/pytorch/pytorch/pull/147996#issuecomment-2686291282))	2025-02-26 22:01:12 +00:00
sokkaofthewatertribe	6e129a697f	[do not merge yet] update grammar (#147996 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147996 Approved by: https://github.com/seemethere	2025-02-26 21:52:58 +00:00
PyTorch MergeBot	dc7556f1bd	Revert "[do not merge yet] update grammar (#147996 )" This reverts commit a1ee2c3a08c3bf3d83c4e9f352ea179c107edb13. Reverted https://github.com/pytorch/pytorch/pull/147996 on behalf of https://github.com/seemethere due to Need to revert ([comment](https://github.com/pytorch/pytorch/pull/147996#issuecomment-2686266052))	2025-02-26 21:43:06 +00:00
sokkaofthewatertribe	a1ee2c3a08	[do not merge yet] update grammar (#147996 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147996 Approved by: https://github.com/seemethere	2025-02-26 21:39:08 +00:00
henrylhtsang	201666d77d	[cutlass backend] turn autotuning logs off by default + rename log to autotuning log (#147922 ) things we did: * turn off autotuning logs by default * rename autotuning logs from log to autotuning_log, so people are aware that it is a special artifact log. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147922 Approved by: https://github.com/eellison	2025-02-26 21:02:04 +00:00
Xiao Wang	976ff5cf01	Add cmake hints to USE_SYSTEM_NVTX for nvtx3 include dir (#147418 ) per title sometimes, it's hard for cmake to find NVTX3 without the cuda include path hint Pull Request resolved: https://github.com/pytorch/pytorch/pull/147418 Approved by: https://github.com/nWEIdia, https://github.com/malfet	2025-02-26 20:52:28 +00:00
Ankita George	6a658d983e	Build a storage reader/writer to write checkpoints in HF format (#147622 ) Title - we want to write checkpoints in HF format with DCP, this diff allows this for the non-distributed use case. Copy of [D68444967](https://www.internalfb.com/diff/D68444967) (https://github.com/pytorch/pytorch/pull/146352). That diff got reverted because of lint errors. The lint error was due to having imports of uninstalled libraries. This was on purpose because we don't want to install safetensors and huggingface, this new diff explicitly ignores this lint so that we don't have the error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147622 Approved by: https://github.com/saumishr	2025-02-26 20:47:54 +00:00
Thomas Bohnstingl	7c71ab1d40	[scan] User-facing reverse flag handling (#147886 ) This PR removes the reverse flag from the backend implementation and resolves it via `torch.flip` in the frontend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147886 Approved by: https://github.com/ydwu4	2025-02-26 20:04:57 +00:00
Davide Italiano	683e083e8d	[MPS] Add support for `entr()` in eager. (#147948 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147948 Approved by: https://github.com/malfet	2025-02-26 19:55:02 +00:00
Ryan Guo	eb08ada5d3	[dynamo] Support reads to global/captured tensors in `nonstrict_trace`-ed function (#147572 ) As title. Without this patch we get the following error: Tweaking the `allow_non_fake_inputs` flag on tensor mode doesn't quite work for AOTAutograd, which also needs to fake-tensor-propagate the `nonstrict_trace`-ed function, but that's _after_ Dynamo has handled the `nonstrict_trace` processing and put the `flat_apply(...)` node into the graph. So we can't easily to temporarily enable the `allow_non_fake_inputs` flag on current fake mode, when AOTAutograd processes a `flat_apply` node from Dynamo's `nonstrict_trace` handling. And after discussing with zou3519, I decided to add a global `FakeTensorTLS` that contains a `allow_non_fake_inputs_override` flag, and patch the `nonstrict_trace`-ed function to temporarily tweak this flag during its execution. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147572 Approved by: https://github.com/zou3519 ghstack dependencies: #146714, #146367, #146950, #147571	2025-02-26 19:47:39 +00:00
Ryan Guo	73e963459e	[dynamo] Support `nonstrict_trace` on class method (#147571 ) As title, also see 1. new test `test_nonstrict_trace_on_method` for example. 2. newly added comments for why we need special treatment on methods. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147571 Approved by: https://github.com/zou3519 ghstack dependencies: #146714, #146367, #146950	2025-02-26 19:47:39 +00:00
Ryan Guo	7e0ef2c844	[dynamo] Use the new `get_unique_name_wrt` helper when applicable (#146950 ) This patch removes some duplicated name generation logic in Dynamo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146950 Approved by: https://github.com/zou3519 ghstack dependencies: #146714, #146367	2025-02-26 19:47:39 +00:00
Ryan Guo	f46f0e465c	[dynamo] Initial support for `nonstrict_trace` (#146367 ) ## Context > Note: `mark_traceable` got renamed to `nonstrict_trace` after > offline discussion. The reasons are (1) it aligns with `torch.export`'s > `nonstrict` notion, and (2) it's more definitive in behavior suggestion. 1. [Overall Design](https://docs.google.com/document/d/1O-dR2ZQaJQVt_v67AVcDCw2yJLtqgkZFwoXK0buEWRg/edit?tab=t.0) 2. [Dynamo graph representation with `torch._higher_order_ops.flat_apply`](https://docs.google.com/document/d/1YHl5nPTJvYeCPE5TO9uA18DPWNgUYGE4gCn6bFvXcBM/edit?tab=t.0#heading=h.xtw3hhbro4gn) ## Summary This patch adds a `torch._dynamo.nonstrict_trace` decorator, which currently is an enhanced version of `torch._dynamo.allow_in_graph` (see docstring for their differences). Specifically, this patch focuses on the UI and functionality prototyping/plumbing. The main enhancement is supporting more input types, and the implementation challenge lies in reconstructing the input objects from Dynamo `VariableTracker` (while accounting for buffered side-effects and guards). This patch takes a middle-ground (simple implementation with a bit of user labor), by 1. asking the user to provide pytree registration for non-proxy-able input types, 2. letting Dynamo trace through `pytree_flatten` (which accounts for buffered side-effects and guards automatically), 3. and passing in the TreeSpec as a graph attribute constant into `torch._higher_order_ops.flat_apply` (which unflattens the inputs and invokes the underlying function). ## Next Steps In subsequent patches, we will try to support the following: - annotating on class method - reads to global tensors - inputs that contains `pytree.register_constant`-ed instances. - function as input - more output types (e.g., any pytree-registered type) - `torch.nn.Module` as inputs Pull Request resolved: https://github.com/pytorch/pytorch/pull/146367 Approved by: https://github.com/zou3519 ghstack dependencies: #146714	2025-02-26 19:47:39 +00:00
Ryan Guo	bab84f0bd9	[hop] Support more output types for `flat_apply` (#146714 ) This patch enables `flat_apply` to support certain non-Tensor output types like containers and graphable types. This will in turn enable the upcoming `mark_traceable` to support more output types. The patch also exposes a `func_to_graphable` rather than having the users calling the lower level `pytree.flatten(ConstantFunction(...))`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146714 Approved by: https://github.com/zou3519	2025-02-26 19:47:39 +00:00
IvanKobzarev	8594856651	[aotd] Alias of intermediate unwrap TensorAlias (#147638 ) Bug was reported by internal user. AOTD classified outputs that are aliases of intermediates of the graph in different categories. ... - output is alias of intermediate which base is already output - output is alias of intermediate which base is not in output If we look at the fn: ``` def fn(x): ix = x + 1 a = ix.transpose(0, 1) return a.detach(), a ``` output 0: detach view of alias a, where a is already output output 1: alias of intermediate ix, then additional output ix will be added internally output 0 base is TensorAlias(a) in this case, but could be Tensor. Adding runtime unwrapping solves this problem. Alternatively we should track base of a.detach() all the way to ix, in that case the base will be always a Tensor, not TensorAlias. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147638 Approved by: https://github.com/bdhirsh	2025-02-26 19:42:21 +00:00
Xintong Hu	30db64bf51	[PT2] Support add/remove passes in pre_grad (#146064 ) Summary: support the same functionality with acc_tracer disabled, add a new config for pre_grad add/remove_passes, at the front end it still uses the same interface some minor updates in pre_grad passes to make sure the passes are run in desired order, after added passes, still run pass like remove_noops at the end Test Plan: add new UT, please see stacked diff for add pass tests (TODO: update diff link) Differential Revision: D68909278 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146064 Approved by: https://github.com/frank-wei	2025-02-26 18:46:43 +00:00
Nikita Shulga	00732c3f7e	[MPS] Implemented `masked_fill_scalar` as shader (#147369 ) - Move `pos_from_thread_index and `offset_from_pos` from `UnfoldBackward.metal` into `c10/metal/indexing.h` header - Initial idea were to implement `StridedTensor` and `ConstStridedTensor` and use them to have masked_fill kernel a something simple as the following loop ```metal ConstStridedTensor<bool> mask(mask_data, sizes, mask_strides, ndim); if (mask[thread_index]) { StridedTensor<T> input(input_data, sizes, input_strides, ndim); input[thread_index] = val; } ``` But though it looks elegant and works correctly, performance wise it's much slower that the existing MPS shader (see table below), as int64 divisions on M2 GPU are really slow - Solved performance issue by implementing 3 flavors of the same shader: `dense`, that is used when both input and mask are dense tensors of the same size, `broadcast`, which is used when `mask` is leading dimensions expandable into input tensor and `strided` which is a general purpose fallback, but still computes position in the tensors only ones. As result, perf is even better than existing MPS shader for dense and broadcast able tensors. Performance measured on M2Pro thru different iterations of the same shader \| dtype \| MPS \| int64-idx \| int64-inlined \| 32-bit strided \| 32-bit broadcasted \| \| ------\|------\| -----\| ---- \| --- \| ---- \| \| float32 \| 2.8 msec \| 41.6 msec \| 26.9 msec \| 5 msec \| 2.4 msec \| \| float16 \| 1.86 msec \| 38.2 msec\| 26.6 msec \| 4.6 msec \| 1.9 msec \| \|bfloat16\|1.86 msec \|38.3 msec \| 26.6 msec \| 4.6 msec \| 1.9 msec \| And benchmark script ```python import torch from timeit import default_timer from itertools import product from torch.utils.benchmark import Measurement, Timer def bench_mask_fill( n, binary_func, dtype=torch.float32, ) -> Measurement: t = Timer( stmt=f"x.masked_fill(y, -17.0); torch.mps.synchronize()", setup=f"x,y = torch.rand(1, 20, {n}, {n}, dtype={dtype}, device='mps'), torch.ones({n}, {n}, device='mps').triu().bool()", globals = {'f': binary_func}, language="python", timer=default_timer ) return t.blocked_autorange() if __name__ == "__main__": n = 1024 for dtype in [torch.float32, torch.float16, torch.bfloat16]: eager_t = bench_mask_fill(n, torch.fmax, dtype) use_msec = eager_t.mean > 1e-4 multiplier = 1e3 if use_msec else 1e6 uname = "msec" if use_msec else "usec" print(f"torch.masked_fill_() {str(dtype):>14} {eager_t.mean*multiplier:>7.2f} {uname}") ``` Fixes https://github.com/pytorch/pytorch/issues/143477 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147369 Approved by: https://github.com/dcci ghstack dependencies: #147977	2025-02-26 18:39:15 +00:00
Isalia20	ebf6b9839c	[MPS] faster integer batched matmul (#147877 ) Followup to #147526 Tiled matmul for bmm as well. ## Speed ups: ![speedups_bmm](https://github.com/user-attachments/assets/02501145-7d64-4bbe-9dcc-994f004b4829) Script to record times: ```python import torch import numpy as np import time import csv batch_sizes = [1, 2, 4, 8] matrix_sizes = [256, 512, 1024, 2048] num_runs = 10 warmup_runs = 3 def run_int_mm(A, B): torch.mps.synchronize() start = time.perf_counter() c = A @ B torch.mps.synchronize() end = time.perf_counter() return c, end - start results = { 'N': [], 'B': [], 'mean_time': [], 'std_time': [] } for b in batch_sizes: for n in matrix_sizes: print(f"\nBenchmarking N={n} and B={b}") try: A_mps = torch.randint(low=-100, high=100, size=(b, n, n), dtype=torch.int8, device="mps") B_mps = torch.randint(low=-100, high=100, size=(b, n, n), dtype=torch.int8, device="mps") for _ in range(warmup_runs): _, _ = run_int_mm(A_mps, B_mps) times = [] for _ in range(num_runs): _, t = run_int_mm(A_mps, B_mps) times.append(t) mean_time = np.mean(times) std_time = np.std(times) results['N'].append(n) results['B'].append(b) results['mean_time'].append(mean_time) results['std_time'].append(std_time) print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s") except RuntimeError as e: print(f"Error for N={n}: {e}") continue with open('int_bmm_benchmark_times_new.csv', 'w', newline='') as f: writer = csv.writer(f) writer.writerow(['N', 'batch', 'mean_time', 'std_time']) for i in range(len(results['N'])): writer.writerow([ results['N'][i], results['B'][i], results['mean_time'][i], results['std_time'][i] ]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147877 Approved by: https://github.com/Skylion007	2025-02-26 18:37:13 +00:00
Henry Tsang	cfb293ee02	[inductor] Add logs for precompile and autotuning (#147923 ) Differential Revision: D70222645 I want to add more logs around precompile, especially around the reason why sometimes it gets fast returned. See https://github.com/pytorch/pytorch/pull/147590 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147923 Approved by: https://github.com/Skylion007	2025-02-26 18:26:07 +00:00
Jagadish Krishnamoorthy	0ea5d1067b	ROCm: Remove static specifier for allow_tf32 variable. (#147186 ) Since the env variable HIPBLASLT_ALLOW_TF32 can change, remove static type for allow_tf32 variable so that it captures the current value of env variable HIPBLASLT_ALLOW_TF32. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147186 Approved by: https://github.com/jeffdaily, https://github.com/naromero77amd	2025-02-26 18:24:02 +00:00
Animesh Jain	4e4191854b	[logs][qol] Print log options alphabetically (#147888 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147888 Approved by: https://github.com/jansel	2025-02-26 18:15:39 +00:00
rzou	fb566c5aea	Fix auto_functionalize x inference_mode (#147925 ) Fixes #147924 We were using the wrong FunctionalTensorMode to construct FunctionalTensors. FunctionalTensors modify the FunctionalTensorMode on construction, so that led to the wrong FunctionalTensorMode being modified. This PR threads the FunctionalTensorMode through correctly. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/147925 Approved by: https://github.com/bdhirsh	2025-02-26 18:05:30 +00:00
drisspg	678435c443	[FlexAttention] Fix IMA bug (#147918 ) # Summary Fixes: https://github.com/pytorch/pytorch/issues/147268 I got this right for the backwards and somehow forgot to do the flip in the forward, not sure how this wasnt found earlier.. Testing IMAs is tuff in pytest so didnt add but verified on reproducer ```py ❯ sanitize python flex/maurice_ima.py --setting 0 ========= COMPUTE-SANITIZER pool: torch.Size([64, 8, 784, 64]) tensor(1.0078, device='cuda:0') Feat shape torch.Size([64, 8, 784, 64]) Feat strides (401408, 50176, 64, 1) Feat is contig: True attn: torch.Size([64, 8, 784, 64]) tensor(1.7994, device='cuda:0') ========= ERROR SUMMARY: 0 errors ❯ sanitize python flex/maurice_ima.py --setting 1 ========= COMPUTE-SANITIZER pool: torch.Size([64, 8, 784, 64]) tensor(2.8297, device='cuda:0') Feat shape torch.Size([64, 8, 784, 64]) Feat strides (401408, 50176, 64, 1) Feat is contig: True attn: torch.Size([64, 8, 784, 64]) tensor(1.9714, device='cuda:0') ========= ERROR SUMMARY: 0 errors ❯ sanitize python flex/maurice_ima.py --setting 2 ========= COMPUTE-SANITIZER pool: torch.Size([64, 8, 784, 64]) tensor(3.2232, device='cuda:0') Feat shape torch.Size([64, 8, 784, 64]) Feat strides (401408, 50176, 64, 1) Feat is contig: True attn: torch.Size([64, 8, 784, 64]) tensor(2.2095, device='cuda:0') ========= ERROR SUMMARY: 0 errors ```` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147918 Approved by: https://github.com/BoyuanFeng, https://github.com/Skylion007	2025-02-26 17:59:05 +00:00
Catherine Lee	3f7e242c86	[CI] Checkout with more processes (#147652 ) The default action doesn't use more processes, possibly because most github provided runners only have 2 cpus, but we have more than that, so we might as well use them Generally cuts maybe 1 min off of checkout time? Changed checkout from pytorch/pytorch@main to pytorch/pytorch@my branch to test on 249a936998e66cc0d6ad8664e0e93ec1b9432a8b Pull Request resolved: https://github.com/pytorch/pytorch/pull/147652 Approved by: https://github.com/ZainRizvi	2025-02-26 17:51:28 +00:00
Xilun Wu	ef61c290e1	[DTensor][random] defer DTensor RNG state sync until first random op call or manual_seed call; support more flexible OffsetBasedRNGTracker init (#147025 ) Resolves https://github.com/pytorch/pytorch/issues/146767. May also resolve https://github.com/pytorch/pytorch/issues/147584. ### Summary This PR removes the RNG tracker init from the `distribute_tensor` call for the following reasons: 1. if the user does not use random ops on DTensor, there's no need to init DTensor RNG which currently requires CUDA device to be present. 2. this complies with the 0-communication semantic of `src_data_rank=None` shard distribution. Besides, `OffsetBasedRNGTracker` only accepts `DeviceMesh` argument to its constructor method. ### Consequence DTensor RNG initialization is delayed till the first DTensor random ops call or `torch.distributed.tensor.random.manual_seed`. ### Test `pytest test/distributed/tensor/test_random_ops.py` `pytest test/distributed/tensor/parallel/test_tp_random_state.py` `pytest test/distributed/tensor/parallel/test_tp_style.py` Differential Revision: [D70201856](https://our.internmc.facebook.com/intern/diff/D70201856) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147025 Approved by: https://github.com/kwen2501	2025-02-26 17:33:22 +00:00
Nikita Shulga	5ef94ca816	[BE] Do not copy arguments in variadic template (#147977 ) By adding missing `std::forward<Args>(args)...` and declaring template as passing args by reference Noticed while working on creating `mtl_setBytes` specification that takes `MPSScalar` as argument Pull Request resolved: https://github.com/pytorch/pytorch/pull/147977 Approved by: https://github.com/Skylion007, https://github.com/dcci	2025-02-26 17:20:16 +00:00
Boyuan Feng	ba9ed856e0	[FlexAttention] Improve error msg for embedding < 16 (#147765 ) flex_attention uses tl.dot, which [does not support embedding < 16](https://github.com/triton-lang/triton/issues/2266) on input shapes. This PR adds explicit error message for users who are prototyping with small tensors. Fixes #147701 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147765 Approved by: https://github.com/drisspg	2025-02-26 17:06:35 +00:00
Alex Baden	ac926f81cc	[Inductor][Triton] Rework casting logic to avoid illegal bitcast (#147395 ) Triton introduced checks for bitcasts where the casted value does not fit into the casted type (e.g. https://github.com/triton-lang/triton/pull/5926, though in this instance I think the issue is related to the type for the broadcast). Some routines in Inductor now perform illegal bitcasts. I reworked the compare and swap w/ index routine used in sort to remove the illegal bitcast (~~I left the bitcast for now, but I think it could probably be removed assuming the reshape does not change the type~~). The explicit cast is correct, and I don't think there are performance issues, but because the cast on the sum is not a bitcast I suppose there could be. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147395 Approved by: https://github.com/eellison	2025-02-26 16:56:17 +00:00
Simon Fan	fd1220e386	[ca] side-effect free inital trace: compiled_args (#147804 ) const methods to prevent accidental mutation. changes mainly in Error nodes and PyNode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147804 Approved by: https://github.com/jansel ghstack dependencies: #147242, #147796	2025-02-26 16:37:27 +00:00
Simon Fan	5e3069dde8	[ca] side-effect free initial trace: GraphTask (#147796 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147796 Approved by: https://github.com/jansel ghstack dependencies: #147242	2025-02-26 16:37:27 +00:00
Simon Fan	0a2da008f8	[ca] trace saved variable unpacking (#147242 ) ## Before Previously, CA will always unpack all saved variables stored in the autograd graph before executing it. This meant that we can't capture unpack hooks as part of the CA graph, and they would fire out of order wrt to other backward hooks. For memory saving APIs built on top of saved tensor hooks like non-reentrant checkpointing and offloading, we couldn't achieve any savings because all activations would be recomputed/loaded and active at the same time, resulting in no-op. ## After We add unpack hooks into the CA graph so that they can be executed progressively. The python hook and hook input themselves are wrapped by non-traceable code, so CA polyfills the wrapping as: ```python # pseudocode class SavedVariable: def unpack(self): if self.hook: return self.hook(self.packed_data) else: return self.packed_data # This approach won't directly work when we add support for Forward AD or double-backward. ``` Directly executing the CA graph (without torch.compiling it) under checkpointing/offloading, memory profile is expected to stay the same as when using the eager autograd engine. If AOT backward is in the autograd graph, memory profile is expected to be better than the eager autograd engine, since we can now delay saved activations unpacking into the AOT backward's execution. All tests pass when running the CA graph directly, the remaining issues are in Dynamo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147242 Approved by: https://github.com/jansel	2025-02-26 16:37:17 +00:00
Xuehai Pan	08f4c1a233	[dynamo] add sourceless builder for `types.MethodType` (#147880 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147880 Approved by: https://github.com/jansel	2025-02-26 15:43:47 +00:00
Katarzyna Fojcik	edaf9ddeb5	Add basic Gaudi support to benchmarks/dynamo (#145920 ) This PR adds basic Gaudi support to benchmarks/dynamo Pull Request resolved: https://github.com/pytorch/pytorch/pull/145920 Approved by: https://github.com/eellison	2025-02-26 14:50:22 +00:00
leslie-fang-intel	be830c8b1c	[Inductor][CPP] fix store mode atomic add (#147961 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/147848 and https://github.com/pytorch/pytorch/issues/146390. While addressing these issues, 2 problems were encountered: - In `CppVecKernel`, when the number of threads is 1 and the mode is `atomic_add`, `store` did not `load/add` before storing. This has been fixed in this PR. - In `CppTile2DKernel`, `store` did not support `atomic_add` mode. Support for this has been added in this PR. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_nn_fold ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147961 Approved by: https://github.com/malfet	2025-02-26 14:04:34 +00:00
Irem Yuksel	f522d899fb	Add MSVC version condition to "Fix for MSVC problem on Windows Arm64 (#136765 )" (#145076 ) This PR adds MSVC version guards around the if block presented on f7e36d8d6f9706ee9b9653538c4c8d2ba375a181. This commit was to provide a workaround for the problem reported here: https://developercommunity.visualstudio.com/t/MSVC-loop-unrolling-problem-194033813-/10720692 . The issue is fixed now and only appears between versions 19.36 and 19.42. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145076 Approved by: https://github.com/malfet, https://github.com/alinpahontu2912 Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>	2025-02-26 12:08:24 +00:00
Luca Wehrstedt	60d94ea22b	Add option to limit number of SMs used by matmul kernels (#147966 ) Resubmission of #144974 which was reverted for unrelated reasons. Newer matmul kernels, e.g. those targeting Hopper GPUs, sometime use a "persistent" schedule which consists in launching as many CUDA blocks as there are SMs on the GPU, with each such block then working on multiple output tiles in a row. This allows to eliminate the overhead of starting and finishing each tile, effectively doing cross-tile pipelining. In previous generations these latencies could be hidden by having multiple CUDA blocks per SM but, with blocks becoming larger, only one can run at a time per SM and thus this needs to be taken care of in software. Persistent kernels become an issue when other kernels are running concurrently. The classical example is a NCCL communication kernel running in the background. In such cases the matmul expects to be able to use all the SMs but is prevented from doing so because some of the are busy. This can lead to its blocks being scheduled as two separate waves on the available SMs. This "wave quantization" can double the latency of the matmul kernels. While we wait for smarter solutions, such as automatic load balancing among the blocks, an easy way to unblock ourselves is to tell the matmuls to only use a subset of the GPU's SMs. For this, I am introducing a global `sm_carveout` flag which can be used to specify how many SMs should be left available for other kernels. For now I only change the cuBLAS kernels and the scaled-mm CUTLASS kernel. More kernels can be opted-in later. I tested this change manually, by using the Kineto profiler to look up the grid size of a scaled-mm kernel with different values of `sm_carveout`, and making sure it changed. Suggestions are welcome for a more automated test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147966 Approved by: https://github.com/danthe3rd	2025-02-26 12:01:12 +00:00
Zhenbin Lin	7ffae2c028	Split test_transformers.py (#147441 ) Split test_transformers.py into test_transformers.py and test_transformers_privateuser1.py. Currently the privateuse1 test cases in test_transformers.py are skipped since they conflict with cuda test cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147441 Approved by: https://github.com/drisspg	2025-02-26 11:54:24 +00:00
William Wen	cf6d1e6824	[dynamo] add generic graph break hints (#147429 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147429 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #147385	2025-02-26 09:20:28 +00:00
William Wen	3fd68e4e2f	[dynamo] make some more graph break messages readable in English [2/N] (#147385 ) This is for "for some large number Z, make sure the error messages are readable English." - beginning to audit all `unimplemented` sites and making sure that all messages are at least English-readable. Hints may not necessarily be provided. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147385 Approved by: https://github.com/jansel	2025-02-26 09:20:28 +00:00
Ruben Rodriguez Buchillon	7a06bfdd1c	[inductor][ck] kBatch parametrized (#147885 ) Summary: # Why Enable us to set the kBatch parameter, rather than bake it in Especially for larger splitK scenarios, this can yield very good performance (up to 1.5x vs hipblaslt from initial tests) ## Why like this The obvious question should be: why not add this to the op itself, and maybe even into the template/kernel. That would simplify the code. The choice to have it as a "runtime" param that we fix is be able to reuse the compiled CK `.so` libraries, as now multiple choices of kBatch can be used with the exact same `.so` (as the shared library does not depend on kBatch, but takes it as a parameter) # What - copy cutlass approach for swizzle to have a "runtime" arg that we pass in but is really choice dependent - pipe through everything from template and kernel - hard-code it to be kBatch=1 for now (same as before, just now settable) This is part of a series of Diffs, where next we need to figure out 1. how to filter out ops + kBatch that don't work 2. set this better for splitK scenarios (hand written heuristic) Test Plan: (with minor modifications) ``` # show it working with AOTI buck2 run mode/opt-amd-gpu //scripts/henrylhtsang/repros:aot ``` ``` # show it working with inductor only buck2 run -c fbcode.re_gpu_tests=False mode/opt-amd-gpu fbcode//deeplearning/aot_inductor/benchmark/sampling:test_gemm_autotune_benchmark_AMD_block_0 ``` Differential Revision: D70200008 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147885 Approved by: https://github.com/ColinPeppler	2025-02-26 07:28:19 +00:00
PyTorch MergeBot	a84db75e1b	Revert "torch._scaled_mm with MXFP8 (#147548 )" This reverts commit 12b9674cb603438639298d6c9757ea93e18a7289. Reverted https://github.com/pytorch/pytorch/pull/147548 on behalf of https://github.com/wdvr due to failing internal build - similar to previous, see below ([comment](https://github.com/pytorch/pytorch/pull/147548#issuecomment-2684134336))	2025-02-26 07:17:24 +00:00
Huy Do	4216478250	Fix the benchmark config name from H100 benchmark (#147947 ) When using the wrong benchmark configs, the benchmark jobs will be skipped. The name should have the `_cuda_h100` suffix as used in the test matrix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147947 Approved by: https://github.com/wdvr	2025-02-26 06:40:07 +00:00
Isuru Fernando	4ec6c1d1ec	Fix test_halide.py report invocation to re-run failed tests (#147640 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147640 Approved by: https://github.com/jansel	2025-02-26 06:32:22 +00:00
PyTorch MergeBot	acca9b9cb0	Revert "[AOTI][refactor] Consolidate CppBuilder.build and CppBuilder.build_fbcode_cpu_re (#147803 )" This reverts commit 0b9da1ae0ad30ef228f132354b875bcaec214ace. Reverted https://github.com/pytorch/pytorch/pull/147803 on behalf of https://github.com/wdvr due to breaking internal tests, discussed with author ([comment](https://github.com/pytorch/pytorch/pull/147803#issuecomment-2683938121))	2025-02-26 05:32:17 +00:00
vasiliy	12b9674cb6	torch._scaled_mm with MXFP8 (#147548 ) # summary Add blockwise MXFP8 support to `torch._scaled_mm` on CUDA capability 10.0 and higher devices. If the scales for A and B are of dtype `torch.float8_e8m0fnu`, we dispatch to the blockwise kernel from cuBLAS. This is a skeleton PR where we test basic functionality (numerics of various simple matrices, as well as one end to end quantization + gemm). - Scales are flipped based on transpose_result - Handles boundary conditions Note that MXFP4 is not added in this PR - we can tackle that in a future PR. This PR was created by taking https://github.com/pytorch/pytorch/pull/145562, switching e8m0 to in-core dtype, removing fp4 for now, and adding test cases. # test plan ``` pytest test/test_matmul_cuda.py -k blockwise_mxfp8 -s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147548 Approved by: https://github.com/drisspg Co-authored-by: drisspg <drisspguessous@gmail.com>	2025-02-26 05:21:26 +00:00
Nikita Shulga	9ed40af917	[BE][EZ] Delete MacOS-12.3 xfail list (#147905 ) As PyTorch requires at least MacOS-13 (and Metal-3) to work, delete any pre-MacoS13 checks from test script Pull Request resolved: https://github.com/pytorch/pytorch/pull/147905 Approved by: https://github.com/dcci ghstack dependencies: #147892	2025-02-26 05:08:09 +00:00
Nikita Shulga	a2399c9b44	[BE] Switch `index_variable` to `torch.testing.make_tensor` (#147892 ) As it was a long-time todo and actually ublocks using this function for MPS devices (that do not support double) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147892 Approved by: https://github.com/dcci	2025-02-26 05:08:09 +00:00
eellison	c839fa4dd2	[Resubmit] Record input strides at time of tracing, constrain to them for triton fn (#147861 ) Resubmit of https://github.com/pytorch/pytorch/pull/145448. it lost its changes on rebase. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147861 Approved by: https://github.com/zou3519	2025-02-26 05:05:06 +00:00
Jerry Mannil	ba25e26baa	[ROCm] Use IPT=8 for block radix sort (#147657 ) Improve performance for shapes that use block radix sort by decreasing the item_per_thread to 8. This will increase the thread block size leading to higher occupancy. Co-author: @amd-sushetty Pull Request resolved: https://github.com/pytorch/pytorch/pull/147657 Approved by: https://github.com/jeffdaily	2025-02-26 04:22:16 +00:00
Ke Wen	f211818bc0	[c10d] Restrict use condition of NCCL mem pool (#147764 ) Add check to see if CUDA driver support multicast, as does in Symmetric Memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147764 Approved by: https://github.com/syed-ahmed, https://github.com/yifuwang	2025-02-26 03:40:00 +00:00
Henry Tsang	d3fc583ff0	[cutlass backend] force_disable_caches for test_number_mm_precompiles (#147901 ) Summary: Test is flaky right now. Differential Revision: D70209511 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147901 Approved by: https://github.com/ColinPeppler	2025-02-26 03:22:49 +00:00
Davide Italiano	9ad0ad6497	[MPS] Introduce a shader for `entr()`. (#147914 ) To be used in eager/inductor in order to implement the missing operation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147914 Approved by: https://github.com/malfet	2025-02-26 02:54:44 +00:00
Menglu Yu	805f7d97f7	[Inductor][Optimus] Fix a corner case in split cat aten pass (#147784 ) Summary: We need to further check the input of the cat to make sure all of them are from the same split node. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_split_cat_post_grad ``` Buck UI: https://www.internalfb.com/buck2/c875cbdd-5374-46cf-811c-45f91cf6ba3e Test UI: https://www.internalfb.com/intern/testinfra/testrun/10977524161964655 Network: Up: 64KiB Down: 27KiB (reSessionID-2e5915cb-4894-48f6-ab1c-3981adb42dab) Executing actions. Remaining 0/3 1.5s exec time total Command: test. Finished 2 local Time elapsed: 2:52.1s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E before aps-recgpt_ig_emb_pt2_comment_out-30c4d5127e tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-recgpt_ig_emb_pt2_comment_out-30c4d5127e/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100 after aps-recgpt_ig_emb_pt2_comment_out-c03f74e353 Differential Revision: D70132209 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147784 Approved by: https://github.com/Microve	2025-02-26 02:19:48 +00:00
Sun, Jiayi	b533bb4b13	optimize the decomposition of aten.native_group_norm (#144733 ) Summary: Optimize the decomposition of aten.native_group_norm. Reduce unnecessary repeated operations by changing the order of operations for `mean`, `rstd`, `weight`, `bias `and `input`, which can improve performance when `flattened_inner_size `is large. The original decomposition: 1. compute `mean `and `rstd`, 2. out = (x - mean) * rstd, compute in the range [N, C, ], 3. out = out weight + bias, compute in the range [N, C, ], The new decomposition: 1. compute `mean `and `rstd`, 2. new_weight = rstd weight, new_bias = - mean * rstd * weight + bias, compute in the range [N, C], 3. out = out * new_weight + new_bias, compute in the range [N, C, *], I tested the Inductor performance benchmark with this PR on both CPU and A100. On CPU, two torchbench models(functorch_dp_cifar10 and opacus_cifar10) have about 25% performance improvement, and two diffusion models(Stable Diffusion and Latent Consistency Model(LCM)) have about 2% performance improvement. On A100, no performance gains or regressions were seen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144733 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2025-02-26 01:42:46 +00:00
mori360	12112fd198	Fix bug in FSDP wrapped module with zero argument (#147771 ) Fixes https://github.com/pytorch/pytorch/issues/147531 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147771 Approved by: https://github.com/awgu	2025-02-26 01:40:53 +00:00
martin-kokos	8de6fe8c0b	[docs] fix numpy docs reference (#147697 ) Fix a link to numpy documentation that has moved and now 404's I"ve checked other numpy doc links that point to docs.scipy.org (which then redirects to numpy.org) and they do work, so I am fixing just this 404. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147697 Approved by: https://github.com/soulitzer	2025-02-26 01:30:03 +00:00
PyTorch MergeBot	90e3a3d86d	Revert "[ca] trace saved variable unpacking (#147242 )" This reverts commit 68ddca94498fd7961cc5ebcb0dffafb8c2f4baca. Reverted https://github.com/pytorch/pytorch/pull/147242 on behalf of https://github.com/wdvr due to failing tests in the slow workflow, see below ([comment](https://github.com/pytorch/pytorch/pull/147242#issuecomment-2683604547))	2025-02-26 00:40:16 +00:00
PyTorch MergeBot	4d614baa30	Revert "[ca] side-effect free initial trace: GraphTask (#147796 )" This reverts commit 5758743f3c92f9cd9b61bc435602f13dd19c13d7. Reverted https://github.com/pytorch/pytorch/pull/147796 on behalf of https://github.com/wdvr due to failing tests in the slow workflow, see below ([comment](https://github.com/pytorch/pytorch/pull/147796#issuecomment-2683599896))	2025-02-26 00:36:08 +00:00
PyTorch MergeBot	143f0f0006	Revert "[ca] side-effect free inital trace: compiled_args (#147804 )" This reverts commit ec768d8dc04b334e01db1a90e4e6646e4e867e67. Reverted https://github.com/pytorch/pytorch/pull/147804 on behalf of https://github.com/wdvr due to failing tests in the slow workflow, see below ([comment](https://github.com/pytorch/pytorch/pull/147804#issuecomment-2683594740))	2025-02-26 00:31:40 +00:00
drisspg	3ecfe6be25	[Submodule] Turning flash-attention integration into 3rd party submod (#144120 ) (#146372 ) Summary: # Summary ### Sticky points Cuda-graph rng handling has changed / deviated from original implementation. We will be left with a dangling 'offset' val and confusing naming due to BC ## Dependencies - Flash PR: https://github.com/Dao-AILab/flash-attention/pull/1419 ### Other Points - The BC linter is complaining about losing generate.py and its functions which is not real BC surface cc albanD imported-using-ghimport Test Plan: Imported from OSS Building in dev `buck build @//mode/dev-nosan -c fbcode.nvcc_arch=h100a //caffe2:ATen-cu --show-full-output ` I and Nming the .so I do see that the flash symbols are correctly named: ``` 0000000001c3dfb0 t pytorch_flash::run_mha_bwd(pytorch_flash::Flash_bwd_params&, CUstream_st)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const 0000000001c36080 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()#6}::operator()() const 0000000001c360e0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const 0000000001c35fc0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#6}::operator()() const 0000000001c36020 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const ``` Reviewed By: vkuzo Differential Revision: D68502879 Pulled By: drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/146372 Approved by: https://github.com/jbschlosser	2025-02-26 00:10:59 +00:00
Animesh Jain	276dfe8150	[dynamo][cpp-guards] Disable dict-tag optim if the guard_manager has child accessors (#147694 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147694 Approved by: https://github.com/isuruf	2025-02-26 00:02:08 +00:00
Mikayla Gawarecki	8e7e5ba182	Add sparse tensors constructed via legacy constructor to _sparse_tensors_to_validate (#147759 ) This is a redo of https://github.com/pytorch/pytorch/pull/147408 which added validation at the end of the legacy constructor calls. The reason why I didn't land that was because in `legacy_load`, constructor would be called before storages of indices/values are set. So the tensor would not actually be validated. Technically, torch.sparse.{Foo}Tensor should not even be called by our rebuild process since afaict this was the first PR that added support for sparse tensor serialization https://github.com/pytorch/pytorch/pull/27062 and it already uses `_rebuild_sparse_tensor` (which would add the rebuilt tensor to the list to validate), but torch.sparse.FooTensor is allowlisted This PR adds tensors constructed as such to the list to validate at the end of torch.load. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147759 Approved by: https://github.com/albanD	2025-02-25 23:51:12 +00:00
PyTorch MergeBot	c82c1411c6	Revert "torch._scaled_mm with MXFP8 (#147548 )" This reverts commit e34c15a05b027b9da0962c971d448138fcf94926. Reverted https://github.com/pytorch/pytorch/pull/147548 on behalf of https://github.com/wdvr due to failing internal build - discussed with author ([comment](https://github.com/pytorch/pytorch/pull/147548#issuecomment-2683517851))	2025-02-25 23:28:15 +00:00
henrylhtsang	0633f63f0d	[cutlass backend] try fix standlone runner test (#147811 ) Differential Revision: [D70147859](https://our.internmc.facebook.com/intern/diff/D70147859/) Trying to fix this test one last time, especially when mixed mm is getting removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147811 Approved by: https://github.com/chenyang78	2025-02-25 23:27:02 +00:00
PyTorch MergeBot	05bc8fe62e	Revert "follow up to #147548 , fix regression on MI300 (#147878 )" This reverts commit cc444e75d540daff127f0210b7f8965a5c2b8d2a. Reverted https://github.com/pytorch/pytorch/pull/147878 on behalf of https://github.com/wdvr due to temporary reverting to revert an older one in the stack ([comment](https://github.com/pytorch/pytorch/pull/147878#issuecomment-2683515567))	2025-02-25 23:25:59 +00:00
Anatoly Myachev	2df9a8d72d	[Inductor][Tests] Update `get_divisible_by_16` function in `test_torchinductor.py` to work correctly with new Triton (#147865 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147865 Approved by: https://github.com/davidberard98	2025-02-25 23:14:13 +00:00
PyTorch MergeBot	1e894d2635	Revert "Add option to limit number of SMs used by matmul kernels (#144974 )" This reverts commit af2d63637ed025789679a17c241e6bb466508a1d. Reverted https://github.com/pytorch/pytorch/pull/144974 on behalf of https://github.com/wdvr due to reverting in order to revert #147548 that causes a merge conflict ([comment](https://github.com/pytorch/pytorch/pull/144974#issuecomment-2683461733))	2025-02-25 22:46:38 +00:00
Jeff Daily	cc444e75d5	follow up to #147548 , fix regression on MI300 (#147878 ) Removing curly braces seemed superficial but broke MI300 rowwise matmul. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147878 Approved by: https://github.com/drisspg	2025-02-25 22:16:28 +00:00
Tugsbayasgalan Manlaibaatar	a821d69d92	Fix register constant to be usable in exportz (#147533 ) Differential Revision: [D69939737](https://our.internmc.facebook.com/intern/diff/D69939737) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147533 Approved by: https://github.com/zou3519	2025-02-25 21:10:47 +00:00
PyTorch MergeBot	0d31c621a3	Revert "[inductor][triton] Ignore block ptr advances for removed buffers (#147193 )" This reverts commit 17766b7aad0d9931bb6b3485fcf3d4c7532c3557. Reverted https://github.com/pytorch/pytorch/pull/147193 on behalf of https://github.com/wdvr due to failing tests on trunk - see below ([comment](https://github.com/pytorch/pytorch/pull/147193#issuecomment-2683286358))	2025-02-25 21:04:04 +00:00
Saurabh Mishra	6eb3d1e762	[DCP] Cache save plans in default planner (#147343 ) Summary: This PR caches the save plans to significantly reduce the collective cost for successive checkpoint save attempts. Here is the high level approach: - Create the local plan and cache the same. - In next iteration, compare the local plan with the cached plan metadata. If no change, do not send that local plan in the collective. - Global plan step, will only create the global plan with the new delta plans and empty plans for the cached ones. - Finish plan step will check for the empty plans. If its empty, it will grab the cached plan. If not, it will use the new plan provided. Test Plan: UTs Differential Revision: D69224491 ## How to enable the caching: DefaultSavePlanner introduces the enable_plan_caching which is set to False by default for now. https://github.com/pytorch/pytorch/pull/147343/files#diff-579bbb7b82572753afa91085fbf954f7c7613ff8376da9b26153d5cc3a3c4ee8R77 Set this to True to enable the caching and we should see significant speed up in the subsequent checkpoint save attempts, specially for larger scale jobs. Reference issue: https://github.com/pytorch/pytorch/issues/123695 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147343 Approved by: https://github.com/MeetVadakkanchery	2025-02-25 20:59:25 +00:00
Avik Chaudhuri	8d921eb97f	export method (#147573 ) The `export` API takes a `nn.Module` and traces its `forward` method. However sometimes it is useful to export different methods of a `nn.Module`, either as a one-off for debugging or as a set of methods that are called in some sequence outside `export` (e.g., `encode` / `decode`). When multiple methods of the same module instance are exported, they should share the same of the common module instance. This PR adds a couple of utils in `torch._export.utils` for this workflow. The `wrap_method` util wraps a method as a `nn.Module` that can then be exported. See included test. We recommend using the same module instance to export multiple methods on that instance, in which case they are guaranteed to share state. On serde, this state sharing is lost, so we provide another util, `sync_state`, to re-sync the state. These utils are meant to be eventually replaced by API-level changes, but for now this can unblock users who need this workflow. In particular, in the future we can accept one or multiple method entrypoints, with their own args / kwargs / dynamic shape specifications, which can create a variant of `ExportedProgram` with multiple graphs that share state; then we can automatically ensure that the state sharing is preserved through serde. Differential Revision: [D69960801](https://our.internmc.facebook.com/intern/diff/D69960801/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147573 Approved by: https://github.com/tugsbayasgalan	2025-02-25 20:58:54 +00:00
Hoa Dinh	687fe64667	Fix crash in -[PTMCoreMLCompiler _compileModel:atPath:] (#147809 ) Summary: We could hit one of those exceptions: https://github.com/apple/coremltools/blob/main/modelpackage/src/ModelPackage.cpp#L205-L225 And it would make this code path crash. Test Plan: build. Differential Revision: D70122378 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147809 Approved by: https://github.com/mcr229	2025-02-25 20:56:16 +00:00
Simon Fan	ec768d8dc0	[ca] side-effect free inital trace: compiled_args (#147804 ) const methods to prevent accidental mutation. changes mainly in Error nodes and PyNode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147804 Approved by: https://github.com/jansel ghstack dependencies: #147242, #147796	2025-02-25 20:38:51 +00:00
Simon Fan	5758743f3c	[ca] side-effect free initial trace: GraphTask (#147796 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147796 Approved by: https://github.com/jansel ghstack dependencies: #147242	2025-02-25 20:38:51 +00:00
Simon Fan	68ddca9449	[ca] trace saved variable unpacking (#147242 ) ## Before Previously, CA will always unpack all saved variables stored in the autograd graph before executing it. This meant that we can't capture unpack hooks as part of the CA graph, and they would fire out of order wrt to other backward hooks. For memory saving APIs built on top of saved tensor hooks like non-reentrant checkpointing and offloading, we couldn't achieve any savings because all activations would be recomputed/loaded and active at the same time, resulting in no-op. ## After We add unpack hooks into the CA graph so that they can be executed progressively. The python hook and hook input themselves are wrapped by non-traceable code, so CA polyfills the wrapping as: ```python # pseudocode class SavedVariable: def unpack(self): if self.hook: return self.hook(self.packed_data) else: return self.packed_data # This approach won't directly work when we add support for Forward AD or double-backward. ``` Directly executing the CA graph (without torch.compiling it) under checkpointing/offloading, memory profile is expected to stay the same as when using the eager autograd engine. If AOT backward is in the autograd graph, memory profile is expected to be better than the eager autograd engine, since we can now delay saved activations unpacking into the AOT backward's execution. All tests pass when running the CA graph directly, the remaining issues are in Dynamo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147242 Approved by: https://github.com/jansel	2025-02-25 20:38:51 +00:00
Yidi Wu	adf0f4ffd2	[custom op] fix inductor cpp codegen when returning a list of single tensor (#147649 ) For a custom op that returns a list of a single tensor with unbacked symint shape: ```python @torch.library.custom_op( "aoti_custom_ops::fn_ret_list_of_single_tensor", mutates_args={} ) def fn_ret_list_of_single_tensor(x: torch.Tensor) -> list[torch.Tensor]: s = x.sum().to(torch.int64) return [torch.randn(s.item())] @fn_ret_list_of_single_tensor.register_fake def _(x): ctx = torch._custom_op.impl.get_ctx() i0 = ctx.new_dynamic_size() return [torch.randn(i0)] ``` Before the fix, we have the following error: ``` /tmp/tmp5iikarn2/cci3ruqb7zdwtl457zo4itspq3sjnqiayhcshp5uaak7ktksckix/cggzqlwf4bmu6tjqodhoto3hhkhgharhwtvw2uxsasqrdipnazrv.cpp:456:26: error: type/value mismatch at argument 1 in template parameter list for ‘template<class _Tp, class ... _Types> constexpr const _Tp& std::get(const std::variant<_Types ...>&)’ 456 \| auto u0 = std::get<0>(buf1).size(0); \| ~~~~~~~~~~~^~~~~~ /tmp/tmp5iikarn2/cci3ruqb7zdwtl457zo4itspq3sjnqiayhcshp5uaak7ktksckix/cggzqlwf4bmu6tjqodhoto3hhkhgharhwtvw2uxsasqrdipnazrv.cpp:456:26: note: expected a type, got ‘0’ In file included from /data/users/yidi/pytorch/torch/include/c10/util/Exception.h:14, from /data/users/yidi/pytorch/torch/include/c10/core/ScalarType.h:5, from /data/users/yidi/pytorch/torch/include/ATen/AccumulateType.h:4, from /data/users/yidi/pytorch/torch/include/ATen/native/Math.h:3, from /data/users/yidi/pytorch/torch/include/ATen/cpu/vec/vec_base.h:31, from /data/users/yidi/pytorch/torch/include/ATen/cpu/vec/vec512/vec512.h:8, from /data/users/yidi/pytorch/torch/include/ATen/cpu/vec/vec.h:4, from /data/users/yidi/pytorch/torch/include/ATen/cpu/vec/functional_base.h:6, from /data/users/yidi/pytorch/torch/include/ATen/cpu/vec/functional.h:3, from /tmp/tmp5iikarn2/3b/c3bi5gk6mslf6u4iaqafhxm64z6u65e3eain4xlary5blqnvv6xx.h:39, from /tmp/tmp5iikarn2/cci3ruqb7zdwtl457zo4itspq3sjnqiayhcshp5uaak7ktksckix/cggzqlwf4bmu6tjqodhoto3hhkhgharhwtvw2uxsasqrdipnazrv.cpp:366: /usr/include/c++/11/variant:1145:27: note: candidate: ‘template<class _Tp, class ... _Types> constexpr const _Tp&& std::get(const std::variant<_Types ...>&&)’ 1145 \| constexpr const _Tp&& get(const variant<_Types...>&& __v) \| ^~~ /usr/include/c++/11/variant:1145:27: note: template argument deduction/substitution failed: /tmp/tmp5iikarn2/cci3ruqb7zdwtl457zo4itspq3sjnqiayhcshp5uaak7ktksckix/cggzqlwf4bmu6tjqodhoto3hhkhgharhwtvw2uxsasqrdipnazrv.cpp:456:26: error: type/value mismatch at argument 1 in template parameter list for ‘template<class _Tp, class ... _Types> constexpr const _Tp&& std::get(const std::variant<_Types ...>&&)’ 456 \| auto u0 = std::get<0>(buf1).size(0); \| ~~~~~~~~~~~^~~~~~ /tmp/tmp5iikarn2/cci3ruqb7zdwtl457zo4itspq3sjnqiayhcshp5uaak7ktksckix/cggzqlwf4bmu6tjqodhoto3hhkhgharhwtvw2uxsasqrdipnazrv.cpp:456:26: note: expected a type, got ‘0’ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147649 Approved by: https://github.com/angelayi ghstack dependencies: #147130	2025-02-25 20:28:41 +00:00
Yidi Wu	824474cb35	[cond] support output sizes mismatch in front end (#147130 ) This PR finishes https://github.com/pytorch/pytorch/pull/137615 by addressing the TODOs and comments left there. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147130 Approved by: https://github.com/zou3519	2025-02-25 20:28:41 +00:00
Bo Li	de80b6f0d3	Updated test_cuda.py to rerun tests (#147040 ) Initially test_cuda::TestCudaMallocAsync::test_clock_speed and test_cuda::TestCudaMallocAsync::test_power_draw are skipped in this [commit](`d4871750d9`). Pulled ROCm nightly image and verified these two tests run fine locally. Filed this PR to enable them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147040 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily	2025-02-25 19:58:42 +00:00
Benjamin Glass	361b6c97cd	cpp_wrapper: Fixup output code indentation (#147215 ) Closes #142165. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147215 Approved by: https://github.com/desertfire ghstack dependencies: #146109, #146424	2025-02-25 19:50:37 +00:00
Benjamin Glass	7c515b2da4	cpp_wrapper: fix test_torchinductor* tests (#146424 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146424 Approved by: https://github.com/desertfire ghstack dependencies: #146109	2025-02-25 19:50:37 +00:00
Benjamin Glass	46d1422afd	cpp_wrapper: fix inductor triton tests (#146109 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146109 Approved by: https://github.com/desertfire	2025-02-25 19:50:37 +00:00
Sam Larsen	9740d69e78	[logging] Add toplevel dynamo_compile / tlparse logging for AOTI (#147760 ) Summary: This adds the proper context managers in `compile_fx_aot` such that we get: 1) A toplevel chromium event (i.e., tlparse) 2) A single `dynamo_compile` log entry Test Plan: Before: * Scuba (we only log the dynamo event): https://fburl.com/scuba/dynamo_compile/sandbox/gaqowzrd * Perfetto trace: https://fburl.com/vol7r6w1 After: * Scuba (we log the dynamo _and_ compile_fx_aot event): https://fburl.com/scuba/dynamo_compile/sandbox/cx2we8w8 * Perfetto trace (click on the toplevel event to see the additional metadata): https://fburl.com/sziy40r9 Differential Revision: D70113859 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147760 Approved by: https://github.com/desertfire	2025-02-25 19:41:39 +00:00
Svetlana Karslioglu	14b9f7f7bc	Remove link to search survey (#147751 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147751 Approved by: https://github.com/malfet	2025-02-25 19:26:59 +00:00
Mwiza	17766b7aad	[inductor][triton] Ignore block ptr advances for removed buffers (#147193 ) block ptr advancements should also be deferrered conditional on the associated buffer not being removed. For example, if `FusedSchedulerNode(op0-op1)` has a store in `SchedulerNode` `op0` that is read in `op1`, the store and associated block ptr that would be created for `op0` in isolation is no longer needed. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/147193 Approved by: https://github.com/jansel Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-02-25 19:14:55 +00:00
Xuehai Pan	ea6938a1f7	Add XuehaiPan to CODEOWNERS for C++ PyTree utilities (#137408 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137408 Approved by: https://github.com/zou3519	2025-02-25 18:48:32 +00:00
Zesheng Zong	580f1183b4	Enable ruff rule S324 (#147665 ) Fixes #147627 - Add `S324` in `pyproject.toml ` - Running check and clean warnings ```bash lintrunner --take RUFF --all-files ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147665 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-02-25 18:27:34 +00:00
iupaikov-amd	6061664266	Enabled force_shape_pad for triton tests in test_kernel_benchmark (#147620 ) During ROCm runs we naturally have those tests show that padding path will be slower for our archs and the pad_mm chooses to opt out of padding thus failing those tests. Reasoning for this is per my understanding those tests don't check IF the operation should be padded in the first place, but HOW is it padded and if it's done in a correct way. More than that the tests shouldn't really be hardware dependent or have some condition for them. Similar PR for reference: https://github.com/pytorch/pytorch/pull/141768 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147620 Approved by: https://github.com/jeffdaily, https://github.com/chenyang78, https://github.com/shunting314	2025-02-25 18:06:48 +00:00
Ethan Wee	651e6aacf9	[ROCm] Remove benign warning about missing amdgpu.ids (#147791 ) Fixes #144203. We build a custom libdrm when preparing our docker image. We attempt to locate the amdgpu.ids file relative to the python binary, but this is not possible for venv installs of pytorch when the python binary is a symlink. Not finding amdgpu.ids causes `torch.cuda.get_device_name()` to return "AMD Radeon Graphics" as a generic name instead of something specific such as "AMD Instinct MI250X / MI250". The libdrm warning is noisy, so we are removing it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147791 Approved by: https://github.com/jeffdaily	2025-02-25 17:17:25 +00:00
FFFrog	e5a13410cd	Fix the tiny doc descriptions (#147319 ) As the title stated Pull Request resolved: https://github.com/pytorch/pytorch/pull/147319 Approved by: https://github.com/zou3519	2025-02-25 17:10:16 +00:00
Nikita Shulga	346bbefa63	[BE] Parameterize TestSDPA in test_mps.py (#147856 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147856 Approved by: https://github.com/Skylion007	2025-02-25 16:07:24 +00:00
Robert Hardwick	810d2a3dbd	[ARM] Fix bug in _ref_test_helper in test_ops and fix failing test on Aarch64 (#146597 ) We have a failing unit test on Aarch64 ``` Exception: Caused by reference input at index 34: SampleInput(input=Tensor[size=(5, 5, 4), device="cpu", dtype=torch.complex64, contiguous=False], args=(), kwargs={}, broadcasts_input=False, name='') To execute this test, run the following from the base repo dir: PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=34 python test/test_ops.py TestCommonCPU.test_python_ref__refs_square_cpu_complex64 This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ``` After debugging it I found that `ex` variable is not being reset to None on each loop inside _ref_test_helper. Which after fixing, highlighted another expectedFailure to reenable - `nn.functional.hinge_embedding_loss` which was incorrectly being skipped due to the same problem. `4a545eb85d/test/test_ops.py (L546)` ex variable is not reset after this for next loop iteration Pull Request resolved: https://github.com/pytorch/pytorch/pull/146597 Approved by: https://github.com/digantdesai	2025-02-25 14:15:10 +00:00
Isalia20	a695aae89b	[MPS] fix attention for >4d tensors (#147545 ) Fixes #147443 and adds tests for >4d tensors Pull Request resolved: https://github.com/pytorch/pytorch/pull/147545 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-02-25 13:55:28 +00:00
Bin Bao	0b9da1ae0a	[AOTI][refactor] Consolidate CppBuilder.build and CppBuilder.build_fbcode_cpu_re (#147803 ) Summary: Let CppBuilder handle all the cpp build logic Differential Revision: [D70146185](https://our.internmc.facebook.com/intern/diff/D70146185) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147803 Approved by: https://github.com/malfet ghstack dependencies: #147805, #147806, #147807	2025-02-25 13:33:12 +00:00
Bin Bao	cc1c9826d4	[AOTI][refactor] Fix a typo (#147807 ) Summary: defination -> definition Differential Revision: [D70146182](https://our.internmc.facebook.com/intern/diff/D70146182) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147807 Approved by: https://github.com/malfet ghstack dependencies: #147805, #147806	2025-02-25 13:33:12 +00:00
Bin Bao	7ed0670e21	[AOTI][refactor] Replace run_command_and_check with CppBuilder.build (#147806 ) Summary: Consolidate cpp compilation action to CppBuilder. Reland https://github.com/pytorch/pytorch/pull/147680 Differential Revision: [D70146183](https://our.internmc.facebook.com/intern/diff/D70146183) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147806 Approved by: https://github.com/malfet ghstack dependencies: #147805	2025-02-25 13:33:03 +00:00
Bin Bao	2680e835c8	[AOTI][refactor] Rename use_absolute_path to use_relative_path (#147805 ) Summary: The option really means to compile a cpp file using its basename instead of the its full path. Reland https://github.com/pytorch/pytorch/pull/147679. Differential Revision: [D70146184](https://our.internmc.facebook.com/intern/diff/D70146184) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147805 Approved by: https://github.com/malfet	2025-02-25 13:32:54 +00:00
Isalia20	7e37fb0a4c	[MPS] faster integer matmul for mps (#147526 ) There is a naive matmul kernel written for MPS matmul which is used when input types are integer(and some other cases for older MacOSes). The old version of matmul is naive with global memory accesses which really tanks the performance especially when matrix is sufficiently large. This PR optimizes it (even though there might be more optimizations with using simdgroup matrices which I'll cover in followup since writing that kernel will take more time) ## Performance comparison on M1 Pro: ![performance_comparison](https://github.com/user-attachments/assets/6ea8de5a-8231-4c5b-8dc9-caa79ea6879a) You can get these numbers by running this script with old kernel compiled and then new kernel compiled(Make sure to change the csv where each output is written): ```python import torch import numpy as np import time import csv matrix_sizes = [32, 128, 512, 1024, 2048, 4096] num_runs = 10 warmup_runs = 3 def run_int_mm(A, B): torch.mps.synchronize() start = time.perf_counter() c = A @ B torch.mps.synchronize() end = time.perf_counter() return c, end - start results = { 'N': [], 'mean_time': [], 'std_time': [] } for n in matrix_sizes: print(f"\nBenchmarking N={n}") try: A_mps = torch.randint(low=-100, high=100, size=(n, n), dtype=torch.int8, device="mps") B_mps = torch.randint(low=-100, high=100, size=(n, n), dtype=torch.int8, device="mps") for _ in range(warmup_runs): _, _ = run_int_mm(A_mps, B_mps) times = [] for _ in range(num_runs): _, t = run_int_mm(A_mps, B_mps) times.append(t) mean_time = np.mean(times) std_time = np.std(times) results['N'].append(n) results['mean_time'].append(mean_time) results['std_time'].append(std_time) print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s") except RuntimeError as e: print(f"Error for N={n}: {e}") continue with open('int_mm_benchmark_times_old.csv', 'w', newline='') as f: writer = csv.writer(f) writer.writerow(['N', 'mean_time', 'std_time']) for i in range(len(results['N'])): writer.writerow([ results['N'][i], results['mean_time'][i], results['std_time'][i] ]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147526 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-02-25 13:15:18 +00:00
Wang, Eikan	b63c601614	Update merge rules for oneDNN part (#147615 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147615 Approved by: https://github.com/atalman	2025-02-25 11:26:59 +00:00
Luca Wehrstedt	af2d63637e	Add option to limit number of SMs used by matmul kernels (#144974 ) Newer matmul kernels, e.g. those targeting Hopper GPUs, sometime use a "persistent" schedule which consists in launching as many CUDA blocks as there are SMs on the GPU, with each such block then working on multiple output tiles in a row. This allows to eliminate the overhead of starting and finishing each tile, effectively doing cross-tile pipelining. In previous generations these latencies could be hidden by having multiple CUDA blocks per SM but, with blocks becoming larger, only one can run at a time per SM and thus this needs to be taken care of in software. Persistent kernels become an issue when other kernels are running concurrently. The classical example is a NCCL communication kernel running in the background. In such cases the matmul expects to be able to use all the SMs but is prevented from doing so because some of the are busy. This can lead to its blocks being scheduled as two separate waves on the available SMs. This "wave quantization" can double the latency of the matmul kernels. While we wait for smarter solutions, such as automatic load balancing among the blocks, an easy way to unblock ourselves is to tell the matmuls to only use a subset of the GPU's SMs. For this, I am introducing a global `sm_carveout` flag which can be used to specify how many SMs should be left available for other kernels. For now I only change the cuBLAS kernels and the scaled-mm CUTLASS kernel. More kernels can be opted-in later. I tested this change manually, by using the Kineto profiler to look up the grid size of a scaled-mm kernel with different values of `sm_carveout`, and making sure it changed. Suggestions are welcome for a more automated test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144974 Approved by: https://github.com/eqy, https://github.com/albanD	2025-02-25 10:19:19 +00:00
David Berard	94969d0a40	[inductor][user triton] Handle scf.yield more accurately (#147762 ) TL;DR: Previously, the mutation analysis for scf.if/scf.for would bundle all the scf.yield arguments into a single op (the scf.yield), such that a mutation on any returned value from the scf.if/scf.for would register as a mutation to _all_ of the scf.yield args. To fix this, this PR artificially introduces a new scf.yield op for each of the scf.yield args. Context: The relevant kernel is something like this one (added as a test in test_triton_kernels.py) ```python @triton.jit def branch_with_multiple_yield_args( in_ptr0, in_ptr1, out_ptr, conditional_ptr, n_elements, BLOCK_SIZE: "tl.constexpr", ): pid = tl.program_id(axis=0) block_start = pid * BLOCK_SIZE offsets = block_start + tl.arange(0, BLOCK_SIZE) mask = offsets < n_elements conditional = tl.load(conditional_ptr) if conditional: in0 = in_ptr0 + 1 in1 = in_ptr1 + 1 out = out_ptr + 1 else: in0 = in_ptr0 in1 = in_ptr1 out = out_ptr x = tl.load(in0 + offsets, mask=mask) y = tl.load(in1 + offsets, mask=mask) tl.store(out + offsets, x + y, mask=mask) ``` The mutation analysis starts with the `tl.store` - and then does a DFS backwards towards the parameters. When a new op is encountered in the DFS, the analysis pass recurses on the op's arguments. The if branch gets converted to TTIR like this: ```mlir %21:3 = scf.if %20 -> (!tt.ptr<f32>, !tt.ptr<f32>, !tt.ptr<f32>) { ... scf.yield %31, %32, %33 : !tt.ptr<f32>, !tt.ptr<f32>, !tt.ptr<f32> loc(#loc10) } else { scf.yield %arg0, %arg1, %arg2 : !tt.ptr<f32>, !tt.ptr<f32>, !tt.ptr<f32> loc(#loc11) } loc(#loc7) ``` and so the "source" op of the `out` variable is marked as the `scf.yield` op - and then all of the arguments to `scf.yield` are marked as mutable (including arg0, arg1, and arg2 - only one of which is actually mutated). This PR we duplicate the `scf.yield` to add one `scf.yield` per return value. That way we avoid marking all the returns from the scf.if/scf.for as mutated when only some are. Differential Revision: [D70118202](https://our.internmc.facebook.com/intern/diff/D70118202) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147762 Approved by: https://github.com/oulgen, https://github.com/zou3519	2025-02-25 08:41:00 +00:00
Yutao Xu	7bd2e3bca1	Update torch-xpu-ops commit pin (#147743 ) Update the torch-xpu-ops commit to [306a0ffb6e0cae27c5bd9a3b9cd378048c8e00e7](`306a0ffb6e`), includes: - Bugfix (LayerNorm/Nonzeros) - Update AOT target Pull Request resolved: https://github.com/pytorch/pytorch/pull/147743 Approved by: https://github.com/EikanWang	2025-02-25 08:06:35 +00:00
Aviral Goel	866dc45d3c	[Inductor][ROCm][CK] Unhardedcoded kernel shapes for ck_conv_template codegen (#147504 ) ## [Inductor][ROCm][CK] Parameterize `ck_conv_template` Codegen ### Description Previously, ROCm CK kernel codegen templates were hardcoded with fixed values for convolution parameters: - `index_t GroupCount` - `index_t NBatch` - `index_t NOutChannels` - `index_t NInChannels` - `vector<index_t> FilterSize` - `vector<index_t> InputSize` - `vector<index_t> ConvolutionStrides` - `vector<index_t> Dilations` - `vector<index_t> LeftPads` - `vector<index_t> RightPads` This PR updates `ck_conv_template` to accept these parameters dynamically from Inductor. By doing so, we reduce the number of generated templates, improving flexibility and maintainability. ### Testing - Verified correctness by running relevant test cases, i.e `test/inductor/test_ck_backend.py` - Ensured generated kernels reflect the updated parameterization, i.e generated templates in `/tmp/torchinductor_root/` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147504 Approved by: https://github.com/jansel, https://github.com/eellison, https://github.com/tenpercent Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>	2025-02-25 07:48:07 +00:00
Chien-Chin Huang	d73b927662	[DSD] Fixes issue when there is a PG without parameters (#147730 ) Fixes https://github.com/pytorch/pytorch/issues/143828 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147730 Approved by: https://github.com/mori360	2025-02-25 07:25:38 +00:00
PyTorch MergeBot	fb73b0c7c5	Revert "use copy2d in h2d/d2h copy when possible (#146256 )" This reverts commit 0bc036a9e98d2cc92ff9dd367342b1f2efcc15f0. Reverted https://github.com/pytorch/pytorch/pull/146256 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/146256#issuecomment-2680868627))	2025-02-25 07:06:38 +00:00
Oguz Ulgen	bb7e8fbd66	[CacheBench] Add hf_T5 llama moco to cachebench (#147783 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147783 Approved by: https://github.com/huydhn ghstack dependencies: #147688, #147780, #147781, #147782	2025-02-25 04:34:45 +00:00
Oguz Ulgen	895564d6b6	[CacheBench] Add huggingface (#147782 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147782 Approved by: https://github.com/huydhn ghstack dependencies: #147688, #147780, #147781	2025-02-25 04:34:45 +00:00
Oguz Ulgen	c4fb6ae55d	[CacheBench] Separate dynamic into its own option (#147781 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147781 Approved by: https://github.com/huydhn ghstack dependencies: #147688, #147780	2025-02-25 04:34:34 +00:00
Oguz Ulgen	60d4cbfc06	[CacheBench] Add repeat option so that we can have more accurate cache results (#147780 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147780 Approved by: https://github.com/huydhn ghstack dependencies: #147688	2025-02-25 04:34:25 +00:00
Oguz Ulgen	ab3b814af3	[CacheBench] Add ciflow/trunk test (#147688 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147688 Approved by: https://github.com/huydhn	2025-02-25 04:34:16 +00:00
eellison	4b7604ec10	Delete Mixed MM Special Casing (#147151 ) Now that torchinductor supports prologue fusion we can delete all the mixed mm code. When I benchmarked int8 weight only mm in the new path compared to int8mm in the old path in the [following benchmark](https://gist.github.com/eellison/46e321709572c11c077d0612cb3492b7) I got a 1.244x geomean speedup comparing Huggingface linear shapes with bias. There's a couple reasons for the speedup: - prologue fusion is often unprofitable, even for int8 mm. because the current mixed mm benchmarking only compares triton_int8_mm vs (dtype_conversion + cublas), we miss out on scenarios where the triton template is profitable but the prologue fusion is not. - similarly, we miss out on potential epilogue fusions like bias if we dispatch to the [fallback mixed mm](`5006932cbc/torch/_inductor/kernel/mm.py (L750-L751)`) that mixed_mm will dispatch to instead of the deferred epilogue tuning in current path. It's possible some of the speedups would be smaller on larger models where the epilogue might get fused into a following kernel. Nonetheless, even if this is perf neutral it is worth landing for code deduplication. The one kernel that is a little special and would not fall out of the prologue fusion is the uint4x2_mixed_mm kernel. it's still possible to generate with prologue fusion but not currently exactly as the current [impl](`bd370c138a/torch/_inductor/kernel/unpack_mixed_mm.py (L43-L49)`). But the current impl does not compare to a cublas baseline so I found that it is making things slower (35% slower on a not particularly big 1024, 1024, 1024 mm shape on h100). this should be fine to delete. Future optimizations could include: - cutlass prologue path - making prologue fusion support the persistent tma based mm template. from @drisspg's experience this led to nice wins with fp8 but not as nice wins with bf16 mm. I think similarly, lower memory bandwidth int8 mm would benefit. Differential Revision: [D70114858](https://our.internmc.facebook.com/intern/diff/D70114858) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147151 Approved by: https://github.com/drisspg, https://github.com/cpuhrsch	2025-02-25 04:29:54 +00:00
PyTorch MergeBot	890213f65f	Revert "[AOTI][refactor] Rename use_absolute_path to use_relative_path (#147679 )" This reverts commit 0b52d801d2297ad6c38e631eedfd4dead9360e1b. Reverted https://github.com/pytorch/pytorch/pull/147679 on behalf of https://github.com/desertfire due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147679#issuecomment-2680389225))	2025-02-25 04:11:13 +00:00
PyTorch MergeBot	9b06b30468	Revert "[AOTI][refactor] Replace run_command_and_check with CppBuilder.build (#147680 )" This reverts commit 22fae0d948ac14c72b510fafc2283072d744dff9. Reverted https://github.com/pytorch/pytorch/pull/147680 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147680#issuecomment-2680383986))	2025-02-25 04:06:40 +00:00
Xia, Weiwen	9478c90e2b	[Quant] flip: throw runtime error for QUInt4x2 and QUInt2x4 input (#147430 ) Fixes #147208 Summary The `flip` op causes memory corruption for `torch.quint4x2` and `torch.quint2x4` inputs. It is because the TensorIterator-based implementation does not support multiple elements per byte. And `torch.quint4x2` and `torch.quint2x4` are deprecated in PyTorch. So, we add a check here to throw a runtime error if input dtyps is `torch.quint4x2` or `torch.quint2x4`. Test plan ``` pytest -s test/test_shape_ops.py -k test_flip_unsupported_dtype ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147430 Approved by: https://github.com/mingfeima, https://github.com/ngimel	2025-02-25 03:47:40 +00:00
Riley Dulin	20295c017e	Fix import of getArtifactLogger for ir_pre_fusion and ir_post_fusion (#147560 ) Fixes #147002 There was an issue with the previous PR https://github.com/pytorch/pytorch/pull/147248 that didn't show up in CI, where a logging import was not complete in torch/_inductor/debug.py before importing it. This only happened if someone directly imported the file without doing any other imports before. Also set to off_by_default by request to reduce log spew. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147560 Approved by: https://github.com/Skylion007	2025-02-25 03:36:08 +00:00
vasiliy	e34c15a05b	torch._scaled_mm with MXFP8 (#147548 ) # summary Add blockwise MXFP8 support to `torch._scaled_mm` on CUDA capability 10.0 and higher devices. If the scales for A and B are of dtype `torch.float8_e8m0fnu`, we dispatch to the blockwise kernel from cuBLAS. This is a skeleton PR where we test basic functionality (numerics of various simple matrices, as well as one end to end quantization + gemm). - Scales are flipped based on transpose_result - Handles boundary conditions Note that MXFP4 is not added in this PR - we can tackle that in a future PR. This PR was created by taking https://github.com/pytorch/pytorch/pull/145562, switching e8m0 to in-core dtype, removing fp4 for now, and adding test cases. # test plan ``` pytest test/test_matmul_cuda.py -k blockwise_mxfp8 -s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147548 Approved by: https://github.com/drisspg Co-authored-by: drisspg <drisspguessous@gmail.com>	2025-02-25 03:32:22 +00:00
cyy	8f728e28dd	Enable ASAN in CUDA tests (#147512 ) It should work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147512 Approved by: https://github.com/soulitzer	2025-02-25 02:58:39 +00:00
FFFrog	b0fa92042b	Fix torch.mean out dtype check (#147188 ) For CPU: Type promotion is supported for torch.mean For Meta: Not supported for torch.mean ISSUE related: https://github.com/pytorch/pytorch/issues/138399 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147188 Approved by: https://github.com/albanD	2025-02-25 02:50:03 +00:00
Benjamin Glass	33ff96b3f9	cpp_builder: unbreak clang++ detection (#147775 ) Fixes an issue where `_is_gcc` would match on `clang++` due to the string ending with `g++`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147775 Approved by: https://github.com/desertfire	2025-02-25 02:33:01 +00:00
Ding, Yi1	dacdc9782b	[Inductor] Add input value checking to randint meta function (#147191 ) Fixes #147070 Adding value checking for the range to the meta function, similar to which in the CUDA/CPU aten op. Test with ``` PYTORCH_TEST_WITH_DYNAMO=1 pytest test/test_tensor_creation_ops.py -k test_randint_inference ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147191 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2025-02-25 02:18:16 +00:00
leslie-fang-intel	c644f4c5fe	[Inductor] Fix the decompositions of torch isin (#147519 ) Summary Fixed two decomposition issues in `torch.isin`: - Issue 1: As reported in [#147329](https://github.com/pytorch/pytorch/issues/147329), the current decomposition does not support cases where test_element is a scalar. This is now implemented by referring to the `ead970c8d0/aten/src/ATen/native/TensorCompare.cpp (L1004-L1008)` - Issue 2: Found while enabling a unit test with `elements = 1` and `test_elements = torch.tensor([1, 2, 3, 4])`, where Inductor produced different results compared to eager mode. This issue is fixed by referring to `ead970c8d0/aten/src/ATen/native/cpu/TensorCompareKernel.cpp (L329-L338)` Test Plan ``` python test/inductor/test_torchinductor.py -k test_isin_tensor_scalar ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147519 Approved by: https://github.com/jgong5, https://github.com/FFFrog, https://github.com/peterbell10	2025-02-25 01:49:44 +00:00
Nikita Shulga	2c8cd41c1f	Delete unused conda-aws-upload environment (#147792 ) As this environment only contains keys for Anaconda uploads Pull Request resolved: https://github.com/pytorch/pytorch/pull/147792 Approved by: https://github.com/atalman	2025-02-25 01:42:52 +00:00
Nichols A. Romero	43074680b5	[ROCm] Add support for gfx1102 arch to wheel builds. (#147761 ) [gfx1102 is not officially supported](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html) but most ROCm libs have gfx1102 code objects available since ROCm 5.5. Now that we're using `--offload-compress` we can fit another gfx target. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147761 Approved by: https://github.com/jeffdaily	2025-02-25 01:35:52 +00:00
Anatoly Myachev	97557b9833	[Inductor] Update `set_driver_to_gpu` code to avoid backend re-initialization with new Triton (#147621 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147621 Approved by: https://github.com/jansel	2025-02-25 00:04:54 +00:00
Yuanhao Ji	55bf3ff3a5	[Docs] Add `OpDTypes.any_common_cpu_cuda_one` (#147605 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/147605 Approved by: https://github.com/soulitzer	2025-02-24 23:23:43 +00:00
PyTorch MergeBot	e72b4c61bf	Revert "Upgrade submodule oneDNN to v3.7 (#147498 )" This reverts commit 576ed1e400d069ec2fff6162f82a71ff0bd81f7c. Reverted https://github.com/pytorch/pytorch/pull/147498 on behalf of https://github.com/wdvr due to failing some tests on trunk - see below ([comment](https://github.com/pytorch/pytorch/pull/147498#issuecomment-2679867286))	2025-02-24 22:57:39 +00:00
Peter Yeh	81dccd706b	[ROCm] OCP FP8 Support for new GPUs (#146632 ) TLDR: Follow up/ Build on top of https://github.com/pytorch/pytorch/pull/144476. add OCP FP8 support for gfx950 refer to https://github.com/pytorch/ao/pull/1677 This pull request includes several changes to improve compatibility and support for new GPU architectures and data types, particularly for ROCm. The key updates involve adding support for new ROCm versions and GPU architectures, updating data type handling, and removing outdated checks. ### Improvements to GPU Architecture and ROCm Version Support: * [`aten/src/ATen/Context.cpp`](diffhunk://#diff-33de472d304acbe57d693c8567370c638068bedc1aa0ce8e9dc115dad05a7810L323-R326): Added support for new GPU architectures `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks. * [`aten/src/ATen/native/cuda/Blas.cpp`](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199): Updated architecture support in multiple functions to include `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks. [[1]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199) [[2]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL865-R876) ### Updates to Data Type Handling: * [`aten/src/ATen/cuda/CUDADataType.h`](diffhunk://#diff-9188bb13b1a49f459141f5f9b875593d1c5ce2beb5ad711fdbaf5bc7089ec015L81-L98): Enhanced data type conversion to include new float8 types for both CUDA and ROCm environments. * [`aten/src/ATen/cuda/tunable/GemmHipblaslt.h`](diffhunk://#diff-bfa1a3b5d4bef1892bf50338775f3b0fd8cd31fc1868148f3968b98aefb68e3fL29-R80): Updated `HipDataTypeFor` template to handle new float8 types and added hard-coded enum values for ROCm versions prior to 6.3. ### Removal of Outdated Checks: * [`cmake/public/LoadHIP.cmake`](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197): Removed the check for `HIP_NEW_TYPE_ENUMS` as it is no longer necessary with the updated ROCm versions. [[1]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197) [[2]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L211-R182) These changes ensure better compatibility and performance on newer hardware and software environments, particularly for users leveraging ROCm and CUDA for deep learning and scientific computing tasks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146632 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-02-24 22:47:52 +00:00
Zachary DeVito	a71d8b7246	Fix `ReferenceError: weakly-referenced object no longer exists` in cycle detector (#146922 ) Summary: weakref.proxy objects will throw errors when they re dead. We just do not bother visulaizing them. They are weak, so they aren't relevant to cycles anyway. Differential Revision: D69270429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146922 Approved by: https://github.com/tianfengfrank, https://github.com/Chillee	2025-02-24 22:27:39 +00:00
Bin Bao	22fae0d948	[AOTI][refactor] Replace run_command_and_check with CppBuilder.build (#147680 ) Consolidate cpp compilation action to CppBuilder Differential Revision: [D69723632](https://our.internmc.facebook.com/intern/diff/D69723632/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147680 Approved by: https://github.com/yushangdi, https://github.com/angelayi ghstack dependencies: #147679	2025-02-24 21:45:15 +00:00
Bin Bao	0b52d801d2	[AOTI][refactor] Rename use_absolute_path to use_relative_path (#147679 ) The option really means to compile a cpp file using its basename instead of the its full path. Differential Revision: [D69722709](https://our.internmc.facebook.com/intern/diff/D69722709/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147679 Approved by: https://github.com/angelayi	2025-02-24 21:44:33 +00:00
Doru Bercea	96acb56626	[ROCm] Optimize the stride one indexing backwards kernel (#146420 ) This patch makes several changes to the stride 1 backwards indexing kernel as follows: - enables the computation across the `sorted_indices` array to happen in parallel by all the lanes in the warp, this means that the accesses to `sorted_indices` are now fully coalesced. - the duplicate counting now happens in parallel: each lane in the warp counts the duplicates of a different `idx`. - enable skipping during duplicate count: this optimization ensures that for large number of duplicates we can skip 32 values at time to speed up the count. - for low number of duplicates i.e. we have less than `warp-size` duplicates then just perform the tail reduction which avoid the wasteful parallel reduction across the warp for this case (it would only add zero values). - for high number of duplicates i.e. when we have more than `warp-size` duplicates then we still use the full warp of lanes to compute the reduced value with as much parallelism as possible. This is done by making sure that all lanes stick around and cooperatively execute the reduction in case there is a single `idx` which has a large number of duplicates (i.e. a duplicate spike). For this to happen we use shared memory to pass the duplicate count computed in parallel in the first part of the kernel to the cooperative reduction part of the kernel. Benefits on examples extracted from workloads show a 3.6x to 10x speed-up. co-author: Hashem Hashemi <Hashem.Hashemi@amd.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/146420 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily	2025-02-24 21:19:06 +00:00
Brian Hirsh	89b9c12de8	remove prints from partitioner (#147749 ) See `c57894cd74..22d8f9a657 (r1968015955)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147749 Approved by: https://github.com/Skylion007, https://github.com/laithsakka	2025-02-24 21:03:45 +00:00
Tristan Rice	8eb400ef66	[BE] TCPStore: use typed errors for assertions (#147647 ) This is a follow up to #147465 that changes most TORCH_CHECK calls in TCPStore and TCPStoreLibUvBackend to use typed exceptions instead of generic `TORCH_CHECK` calls which end up as RuntimeErrors in Python. Test plan: ``` pytest test/distributed/test_store.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147647 Approved by: https://github.com/fduwjj	2025-02-24 20:58:10 +00:00
Anatoly Myachev	19fd21fe7e	[Inductor] Hot fix after #146917 (#147639 ) This pull request reverts the changes to `torch/_inductor/ir.py` file that were added in #146917. Where I tested, there were changes only from `torch/_inductor/codegen/cpp_wrapper_gpu.py`, it turns out that changes in `torch/_inductor/ir.py` file are not really needed. So it's my fault, I didn't sync the environments (between several machines) correctly. @davidberard98 @YUNQIUGUO maybe that's why the tests on CUDA didn't pass? Pull Request resolved: https://github.com/pytorch/pytorch/pull/147639 Approved by: https://github.com/etaf, https://github.com/davidberard98	2025-02-24 20:34:48 +00:00
Xuehai Pan	754fb834db	[BE][CI] bump `ruff` to 0.9.0: string quote styles (#144569 ) Reference: https://docs.astral.sh/ruff/formatter/#f-string-formatting - Change the outer quotes to double quotes for nested f-strings ```diff - f'{", ".join(args)}' + f"{', '.join(args)}" ``` - Change the inner quotes to double quotes for triple f-strings ```diff string = """ - {', '.join(args)} + {", ".join(args)} """ ``` - Join implicitly concatenated strings ```diff - string = "short string " "short string " f"{var}" + string = f"short string short string {var}" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144569 Approved by: https://github.com/Skylion007 ghstack dependencies: #146509	2025-02-24 19:56:09 +00:00
Xuehai Pan	52f6d4aa30	[BE][CI][Easy] bump `ruff` to 0.9.0: long statements in docstrings (#146509 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146509 Approved by: https://github.com/justinchuby, https://github.com/Skylion007	2025-02-24 19:56:08 +00:00
Nichols A. Romero	9605c5063b	[ROCm][TunableOp] Speed-up matmul_small_brute_force_tunableop unit test (#147659 ) This PR has a UT speed-up and some refactoring of tests. A previous PR https://github.com/pytorch/pytorch/pull/142422 fixed this matmul_small_brute_force_tunableop for the FP16 data type by adding TunableOp numerical checks. It had the unfortunate side effect that it increased the execution time for the FP32 and FP64 data types by a significant margin. This PR reduces the execution time by 20+ minutes. We also move a hipBLASLt version check to a different tunableop UT for simplicity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147659 Approved by: https://github.com/jeffdaily	2025-02-24 19:44:38 +00:00
Xiaochang Wu	69c4f6ff13	[Minor] Fix minor mistake in docstring of replace_pattern (#147611 ) Fixes #147610 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147611 Approved by: https://github.com/soulitzer	2025-02-24 19:33:44 +00:00
Yan Zhiwei	b9b1fd9b93	[Intel GPU] qlinear.pointwise with mixed dtype support (#136753 ) # Motivation This PR is aimed to add mixed data type(AMP) support for `qlinear_pointwise` op. With current PR, we allow `qlinear` kernels output Tensor that is BF16, rather than FP32/INT8. # UT verification ```bash DNNL_VERBOSE=1 python test/inductor/test_mkldnn_pattern_matcher.py -v \ -k test_qlinear_int8_mixed_bf16_xpu \ -k test_qlinear_relu_int8_mixed_bf16_xpu \ -k test_qlinear_add_int8_mixed_bf16_xpu ``` # Runtime exemplification ```bash #qlinear+bf16 output onednn_verbose,primitive,exec,gpu:0,matmul,ocl:gemm_with_po:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_bf16::blocked:ab::f0_mask2 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32,,4x4:4x4,0.0698242 # qlinear_add + bf16 output onednn_verbose,primitive,exec,gpu:0,matmul,ocl:gemm_with_po:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_bf16::blocked:ab::f0_mask2 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_linear:1:-0.677141+sum:0.0132773,,4x4:4x4,0.0419922 # qlinear_add_relu + bf16 output onednn_verbose,primitive,exec,gpu:0,matmul,ocl:gemm_with_po:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_bf16::blocked:ab::f0_mask2 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_linear:1:0.533096+sum:0.00416481+eltwise_relu,,4x4:4x4,0.0759277 ``` As shown in the oneDNN verbose, the attribute `dst_bf16::blocked:ab::f0` demonstrate that we could successfully output a bf16 tensor in int8 gemm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136753 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire, https://github.com/jerryzh168 ghstack dependencies: #133307, #135189, #135337, #135465 Co-authored-by: guangyey <guangye.yu@intel.com>	2025-02-24 19:27:50 +00:00
Yan Zhiwei	075b91bef1	[Intel GPU] qconv.pointwise with mixed dtype XPU support (#135465 ) # Motivation This PR is aimed to add mixed data type(AMP) support for `qconv_pointwise` op. With current PR, we allow `qconv` kernels output Tensor that is BF16, rather than FP32/INT8. # UT verification ```bash DNNL_VERBOSE=1 python test/inductor/test_mkldnn_pattern_matcher.py -v \ -k test_qconv2d_int8_mixed_bf16_xpu \ -k test_qconv2d_relu_int8_mixed_bf16_xpu \ -k test_qconv2d_hardtanh_int8_mixed_bf16_xpu \ -k test_qconv2d_hardswish_int8_mixed_bf16_xpu \ -k test_qconv2d_silu_int8_mixed_bf16_xpu \ -k test_qconv2d_add_int8_mixed_bf16_xpu \ -k test_qconv2d_add_relu_int8_mixed_bf16_xpu ``` # Runtime verification ```bash #qconv + bf16 onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_s8::blocked:acdb::f0 wei_s8::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_bf16::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:1:f32 attr-zero-points:src0:0:s32,alg:convolution_direct,mb1_ic128oc128_ih6oh4kh3sh1dh0ph0_iw6ow4kw3sw1dw0pw0,0.0539551 # qconv_silu + bf16 onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_s8::blocked:acdb::f0 wei_s8::blocked:abcd::f0 bia_undef::undef::: dst_bf16::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_swish:1,alg:convolution_direct,mb1_ic128oc128_ih6oh4kh3sh1dh0ph0_iw6ow4kw3sw1dw0pw0,0.0588379 # qconv_hardswish + bf16 onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_s8::blocked:acdb::f0 wei_s8::blocked:abcd::f0 bia_undef::undef::: dst_bf16::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_hardswish:0.166667:0.5,alg:convolution_direct,mb1_ic128oc128_ih6oh4kh3sh1dh0ph0_iw6ow4kw3sw1dw0pw0,0.0568848 ``` The `dst_bf16::blocked:acdb::f0` attribute in oneDNN verbose demonstrate the output tensor is computed as bf16 successfully. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135465 Approved by: https://github.com/liangan1, https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire, https://github.com/jerryzh168 ghstack dependencies: #133307, #135189, #135337 Co-authored-by: guangyey <guangye.yu@intel.com>	2025-02-24 19:27:50 +00:00
Michal Gallus	ffa19b9024	[ROCm][Windows] Fix unrecognized constexpr std::memcpy for HIP-clang (#147316 ) Since in MSVC's 2019/2022 implementation of STL memcpy is not defined as a constexpr function, HIP clang compiler on Windows cannot evaluate the following memcopy as one that could be resolved during the compile time. To resolve this, a `__builtin_memcpy` is used instead which doesn't have this limitation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147316 Approved by: https://github.com/jeffdaily	2025-02-24 18:28:59 +00:00
PyTorch MergeBot	900a774781	Revert "[ROCm] Update periodic.yml to use 2GPU runners (#146839 )" This reverts commit b6273d7f4ba4fbb126eb96816287641ca1e4efc6. Reverted https://github.com/pytorch/pytorch/pull/146839 on behalf of https://github.com/jithunnair-amd due to This change is not needed anymore since our 4-GPU runners are back online and stable so far ([comment](https://github.com/pytorch/pytorch/pull/146839#issuecomment-2679145448))	2025-02-24 17:17:58 +00:00
Ding, Yi1	cde12207a0	[Intel GPU] Add SDPA implementation on XPU with OneDNN (#147612 ) Add XPU implementation of OneDNN based SDPA operator. Will be integrated and enabled later. Depends on BUILD_GRAPH switch in #147608 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147612 Approved by: https://github.com/EikanWang	2025-02-24 16:12:04 +00:00
Jiang, Yanbing	576ed1e400	Upgrade submodule oneDNN to v3.7 (#147498 ) This PR is to upgrade submodule oneDNN to v3.7. ## Improvements - Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids). - Improved performance of int8 and fp32 forward convolution primitive on processors with Intel AVX2 instruction set support. - Improved performance of fp8 matmul primitives with bf16 and fp16 bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids). - Introduced initial optimizations for Intel GPUs based on Xe3 architecture. - Added bfloat16 support for SDPA, implemented fp16 and bf16 gemm kernel in SDPA. - Fixed f16 matmul accuracy, the issue of SDPA cannot dispatched to ukernel, bf16/fp16/fp32 conv performance, INT8 Kernel trigger page fault, deconvolution precision issue on complex128 and fp64 and gemm correctness issue in float16 issues. - Improved bf16 matmul performance with fp32 destination with Arm Compute Library (ACL). - Improved bf16 to fp32 reorder performance. - Improved bf16 reorder performance. - Improved bf16 convolution with ACL. Fixes https://github.com/pytorch/pytorch/issues/136348. ## Validation results on CPU 1. NLP models accuracy/inference/training ![image](https://github.com/user-attachments/assets/859279b8-1631-4268-b226-7de9ac5870d8) ![image](https://github.com/user-attachments/assets/30ec7151-41ca-482a-9d2d-0c4850e75bab) 2. Torchbench cpu userbenchmark inference & training ![image](https://github.com/user-attachments/assets/71c9807c-caf9-4385-9990-d2ab637031cd) 3. Inductor quantization ![image](https://github.com/user-attachments/assets/3d2a3bd3-82fa-4566-8050-7ea5d6b61675) 4. Dynamo benchmarks ![image](https://github.com/user-attachments/assets/554ecce3-c85c-4a0e-88f1-2e73983c5dcd) ![image](https://github.com/user-attachments/assets/148c88f8-4367-4428-bb54-ce8a4deefd1b) ![image](https://github.com/user-attachments/assets/f2e744f4-d710-4699-acf4-1f130ecfadf1) ![image](https://github.com/user-attachments/assets/97128b80-4d0e-495a-aeda-dde3e70c96fd) ![image](https://github.com/user-attachments/assets/a9afce37-684c-45c0-b938-6dd7e0383805) ![image](https://github.com/user-attachments/assets/b8714236-9681-4fbe-8d98-be93deedab88) ![image](https://github.com/user-attachments/assets/4423061f-d133-45ba-98bd-d2f739e50431) ![image](https://github.com/user-attachments/assets/7955da10-3d23-493e-99fa-658f7f40035b) ## Validation results on XPU Accuracy is same as baseline. Performance is shown below. ![image](https://github.com/user-attachments/assets/7645304d-5b1d-43f9-b840-9f846ed380a0) ## Validation results on ARM ![image](https://github.com/user-attachments/assets/080f7c02-0238-436f-ad20-5a9e3f6aafbb) ![image](https://github.com/user-attachments/assets/443742aa-ca61-41de-ae80-5d4c65cd0c87) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147498 Approved by: https://github.com/fadara01, https://github.com/mingfeima, https://github.com/atalman	2025-02-24 14:32:51 +00:00
Tom Ritchford	80d3afc698	[inductor] Improve type annotations in _inductor/pattern_matcher.py (#146626 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146626 Approved by: https://github.com/Skylion007	2025-02-24 14:30:35 +00:00
PyTorch UpdateBot	d0f08dc3eb	Update slow tests (#147728 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147728 Approved by: https://github.com/pytorchbot	2025-02-24 11:48:19 +00:00
Xuehai Pan	cba14212e6	[FX] micro-optimization `map_aggregate(immutable_dict)` (#147691 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147691 Approved by: https://github.com/Skylion007, https://github.com/jansel ghstack dependencies: #147699, #144640	2025-02-24 09:14:08 +00:00
Xuehai Pan	a50af71fb6	[FX] Refactor immutable collections implementation (#144640 ) Get rid of dynamic class creation via `type(name, bases, ...)`. Convert it to classic static class definition for better readability and static analysis support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144640 Approved by: https://github.com/jansel ghstack dependencies: #147699	2025-02-24 09:14:08 +00:00
xinan.lin	dc9a03d30c	[Window] Fix invalid file path on windows. (#147708 ) This PR aims to fix the invalid path for windows: `C:\\Users\\sdp\\AppData\\Local\\Temp\\tmp0wugz2qm\\dynamo\\code_state___main__.TestFxGraphCache.test_cache_hot_load_pgo:None:.pkl.lock` Windows does not allow chars `\ / : * ? " < > \|` in a path. And this PR also replace `os.rename` to `os.replace` in torch/_dynamo/pgo.py because `os.replace` allows target file exists on Windows, but not `os.rename` . \| Function \| `os.rename()` \| `os.replace()` \| \|--------------------------------\|----------------------------\|----------------------------\| \| Rename a file \| ✅ \| ✅ \| \| Move a file \| ✅ \| ✅ \| \| Overwrite an existing file \| ❌ (Error on Windows) \| ✅ (Will overwrite) \| \| Overwrite an existing directory \| ❌ (Error on Windows) \| ❌ (Error on Windows) \| \| Move across disks \| ❌ \| ❌ \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/147708 Approved by: https://github.com/jansel	2025-02-24 08:31:11 +00:00
PyTorch MergeBot	5b6ad682bc	Revert "[TorchRec][PT2] disable contextlib in PT2 train pipeline (#147254 )" This reverts commit 85ea67983421acc30ccc76f7a159042e75c6ea08. Reverted https://github.com/pytorch/pytorch/pull/147254 on behalf of https://github.com/jeanschmidt due to introduced reds on main ([comment](https://github.com/pytorch/pytorch/pull/147254#issuecomment-2677700862))	2025-02-24 08:20:16 +00:00
xinan.lin	8d618f3da7	[AOTI][XPU] Suppress multi-line comment warning for XPU. (#147710 ) This PR aim to suppress multi-line comment waring in sycl header when building Inductor cpp_wrapper . ``` /intel/oneapi/compiler/2025.0/include/sycl/detail/builtins/builtins.hpp:235:1: warning: multi-line comment [-Wcomment] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147710 Approved by: https://github.com/EikanWang, https://github.com/jansel	2025-02-24 07:28:59 +00:00
Huamin Li	cee03b7746	[Inductor] Update should_decompose_mm condition for CPU (#147673 ) Summary: Previously, for cpu we decompose addmm if ``` check_device(mat1, mat2, device="cpu") and mat1.shape[0] == 1 and mat2.shape[0] <= 64 and mat2.shape[1] <= 16 ``` We have a new case where `mat2.shape[2] = 304`, and benchmark shows that it will beneficial if we decompose, so update the condition to ``` check_device(mat1, mat2, device="cpu") and mat1.shape[0] == 1 and mat2.shape[0] <= 64 and mat2.shape[1] <= 512 ``` Differential Revision: D70033166 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147673 Approved by: https://github.com/houseroad	2025-02-24 05:51:50 +00:00
Davide Italiano	8b65dbad13	[MPS/Inductor] Add support for xlog1py. (#147709 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147709 Approved by: https://github.com/jansel	2025-02-24 05:28:52 +00:00
Dmitry Rogozhkin	baccadb2f1	xpu: torch.xpu.get_arch_list() to return [] if xpu not compiled (#147431 ) Initially discussed here: https://github.com/pytorch/pytorch/pull/132945#discussion_r1957366131 Previously `torch.xpu.get_arch_list()` got relaxed to work even if XPU device is not available. However, we overlooked the case when pytorch is not compiled with XPU support. In such a case function throws an exception. This commit adjusts this behavior and makes function return `[]` even if pytorch is not compiled with XPU support. CC: @EikanWang @fengyuan14 @guangyey @malfet @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/147431 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/albanD	2025-02-24 01:35:54 +00:00
lei,zhenyuan	7c52ef2424	Add XPU to is_compile_supported to support roi_align op in torchvision (#147541 ) Part of the required fix for https://github.com/intel/torch-xpu-ops/issues/1264. To support `roi_align`, torchvision uses `is_compile_supported` in `torch/_dynamo/utils.py` to compile a non-deterministic version of the op for backwards passes. This PR adds XPU device to the supported compile devices. The `is_compile_supported()` util function has extremely limited usage, only being used in `torchvision.ops.roi_align` and `torch.utils._content_store.has_storage()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147541 Approved by: https://github.com/guangyey, https://github.com/jansel Co-authored-by: lei,zhenyuan <zhenyuan.lei@intel.com>	2025-02-24 01:32:36 +00:00
Davide Italiano	4e934ee5a7	[MPS] Add eager support for xlog1py. (#147687 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147687 Approved by: https://github.com/malfet	2025-02-24 01:23:59 +00:00
eqy	718cf68aee	[cuBLAS][cuBLASLt] Unify `cuBLASLt` workspaces with `cuBLAS` workspaces (#145130 ) As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels. This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits: + caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`) + "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925 + fixes behavior broken behavior with the memtracker; https://github.com/pytorch/pytorch/pull/139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it + one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider Pull Request resolved: https://github.com/pytorch/pytorch/pull/145130 Approved by: https://github.com/ngimel	2025-02-23 22:01:39 +00:00
Xuehai Pan	b5d7aefa57	[BE] add missing overload annotations for `tree_map_only` (#147699 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147699 Approved by: https://github.com/Skylion007	2025-02-23 20:21:07 +00:00
Catherine Lee	f47573f70d	Add super().setUp() to some test cases (#147651 ) I saw that their disabled issues were getting spammed with comments, meaning that they were still running in CI despite having a disable issue, so I added the super().setUp() call to check if there's a disable issue for them since they were missing it Pull Request resolved: https://github.com/pytorch/pytorch/pull/147651 Approved by: https://github.com/huydhn	2025-02-23 18:21:17 +00:00
Nikita Shulga	f03e7f3801	[MPS] Workaround rng bug for 5D tensors (#147667 ) For some reason MPSGraph returns repeated values is tensor dimention is larger than 4, which can be clearly seen by running following ```swift import Metal import MetalPerformanceShadersGraph func randMPS(device: MTLDevice, obuf: MTLBuffer, nelem: Int, ndim: Int = 5) { let graph = MPSGraph() var dims = Array(repeating: 1, count: ndim) dims[0] = nelem let shape = dims.map { NSNumber(value: $0) } let randNode = graph.randomUniformTensor(withShape: shape, seed: 42, name: nil) let mpsOutputBuffer = MPSGraphTensorData(obuf, shape: shape, dataType: .float32) guard let queue = device.makeCommandQueue() else { fatalError("Can't make queue") } graph.run(with: queue, feeds: [:], targetOperations: nil, resultsDictionary: [randNode: mpsOutputBuffer]) } func printBuf(_ prefix: String, buf: MTLBuffer, nelem: Int) { let buf_data = buf.contents().assumingMemoryBound(to: Float.self) print(prefix) for i in 0..<nelem { print(buf_data[i], terminator: i != nelem - 1 ? " " : "\n") } } guard let device = MTLCopyAllDevices().first else { fatalError("Not Metal device found") } print("Using device \(device.name)") let nelem = 2 guard let buf = device.makeBuffer(length:nelem * MemoryLayout<Float>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") } randMPS(device: device, obuf: buf, nelem: nelem, ndim: 4) printBuf("4D uniform", buf: buf, nelem: nelem) randMPS(device: device, obuf: buf, nelem: nelem, ndim: 5) printBuf("5D uniform", buf: buf, nelem: nelem) ``` Workaround by flatting the tensor if it's contiguous Fixes https://github.com/pytorch/pytorch/issues/147624 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147667 Approved by: https://github.com/dcci	2025-02-23 16:52:01 +00:00
PyTorch MergeBot	3e2d9d079e	Revert "[ROCm] OCP FP8 Support for new GPUs (#146632 )" This reverts commit f95ab46797e1f3e8cc48ce2f45e4f6985132fb19. Reverted https://github.com/pytorch/pytorch/pull/146632 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, I'll find someone to help merge this PR back to main ([comment](https://github.com/pytorch/pytorch/pull/146632#issuecomment-2676823614))	2025-02-23 12:04:50 +00:00
Guilherme Leobas	d0adff761e	Propagate `AttributeError` to user code in user_defined.py (#146497 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146497 Approved by: https://github.com/anijain2305, https://github.com/zou3519 ghstack dependencies: #146496	2025-02-23 01:18:28 +00:00
Guilherme Leobas	8c761ac7e3	Handle `is`/`is not` (#146496 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146496 Approved by: https://github.com/anijain2305, https://github.com/zou3519	2025-02-23 01:18:28 +00:00
Davide Italiano	b084635735	[MPS/inductor] Adjust more tests that depends on non-divisible input sizes (#147681 ) Also adjust a comment while I'm at it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147681 Approved by: https://github.com/jansel	2025-02-23 00:33:26 +00:00
Davide Italiano	6a5e3917a7	[MPS] Add inductor support for spherical_bessel_j0. (#147650 ) Counterpart to my previous patch that added support for the op in eager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147650 Approved by: https://github.com/jansel	2025-02-23 00:32:36 +00:00
Davide Italiano	f9c117f859	[mps/inductor] XFAIL adaptive_avg_pool_with_output_size_0. (#147676 ) Non-divisible input sizes are not implemented on MPS device yet Pull Request resolved: https://github.com/pytorch/pytorch/pull/147676 Approved by: https://github.com/malfet	2025-02-22 20:17:33 +00:00
drisspg	db15cb0988	[Submodule] [Cutlass] Update to 3.8.0 tag (#147655 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147655 Approved by: https://github.com/henrylhtsang, https://github.com/eqy	2025-02-22 20:05:31 +00:00
Huanyu He	85ea679834	[TorchRec][PT2] disable contextlib in PT2 train pipeline (#147254 ) [TorchRec][PT2] disable contextlib in PT2 train pipeline (#147254) Summary: # context * more details in the [post](https://fb.workplace.com/groups/1075192433118967/permalink/1587079018596970/) * disable contextlib with PT2 Test Plan: * run command ``` TORCH_SHOW_CPP_STACKTRACES=1 TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 TORCH_LOGS="+dynamo,+graph_code,output_code,dynamic,aot,guards,verbose_guards,recompiles,graph_breaks" TORCH_TRACE=/var/tmp/tt buck2 run fbcode//mode/opt fbcode//aps_models/ads/icvr:icvr_launcher_live -- mode=fmc/local_ig_fm_ultra_mini training.pipeline_type=pt2 data_loader.dataset.table_ds=[2024-12-02] 2>&1 \| tee -a output.log ``` * old tlparse https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpYYAS3o/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100 * new tlparse https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpUJhCGZ/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100 Reviewed By: Microve Differential Revision: D68480678	2025-02-22 18:57:55 +01:00
PyTorch MergeBot	fa8e3a28a7	Revert "[cuDNN][SDPA][Nested Tensor] Experimental cuDNN Nested Tensor SDPA Support (forward only) (#141178 )" This reverts commit 533b884870acd951e684e0bf551eb76904dec047. Reverted https://github.com/pytorch/pytorch/pull/141178 on behalf of https://github.com/jeanschmidt due to Broke internal arvr signals, see D69971019. @jbschlosser please help the author get this PR merged ([comment](https://github.com/pytorch/pytorch/pull/141178#issuecomment-2676317470))	2025-02-22 17:28:12 +00:00
PyTorch MergeBot	bea72180ed	Revert "[ROCm] Implemented dropout usage for RNN with MIOpen backend (#144572 )" This reverts commit e7bf490c430ac5a70ccb7ab8e954d3386fd29413. Reverted https://github.com/pytorch/pytorch/pull/144572 on behalf of https://github.com/jeanschmidt due to Broke internal signals, D69994027, I'll find someone to help get this change merged ([comment](https://github.com/pytorch/pytorch/pull/144572#issuecomment-2676314308))	2025-02-22 17:19:38 +00:00
PyTorch MergeBot	3409cbd177	Revert "Delete Mixed MM Special Casing (#147151 )" This reverts commit d6bb1d7f0a9dc3d11d2864da9ab46872377a6e52. Reverted https://github.com/pytorch/pytorch/pull/147151 on behalf of https://github.com/jeanschmidt due to Broke a few internal signals, see comments on D69994157 ([comment](https://github.com/pytorch/pytorch/pull/147151#issuecomment-2676312215))	2025-02-22 17:14:32 +00:00
Wang, Chuanqi	72b4f35cb5	[CI] Reduce the AOT target list to reduce build time (#147601 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/147601 Approved by: https://github.com/atalman	2025-02-22 14:43:26 +00:00
sanchitintel	3cc3d7e08f	Also support non-contiguous activation for torch._weight_int8pack_mm on CPU (#147588 ) ### Problem Non-contiguous activation for `torch._weight_int8pack_mm` is unsupported on CPU. So, with int8 WoQ with B16 activation with torchao, for batch-size 2 & above, an assertion is hit regarding non-contiguous A being unsupported. Such an issue was encountered with LLaMA models. ### Solution Also support non-contiguous activation for `torch._weight_int8pack_mm`, so long as it's contiguous on the last dimension & remove the assertion that requires contiguous activation. ### Alternative solutions considered Could modify LLaMA model in transformers library to call `contiguous` after obtaining the final hidden state, just before computing logits with the LM head. However, [it](https://github.com/huggingface/transformers/pull/36078) might cause some regression for other users of that code. Another aspect to this issue is - is latency always lower if we make an activation tensor contiguous before linear or `torch._weight_int8pack_mm` is called on CPU? I guess we need some data-points to analyze this part, although I think the performance should be good enough with this patch, since the first cache lines of rows of A are being explicitly prefetched in the existing code (and it also avoids copy, which a `contiguous` call would do). Pull Request resolved: https://github.com/pytorch/pytorch/pull/147588 Approved by: https://github.com/mingfeima, https://github.com/leslie-fang-intel, https://github.com/malfet	2025-02-22 08:29:07 +00:00
Ke Wen	e1bf892d90	[DDP] Temporarily disable comm mem (#147663 ) For fear that it incur slightly more memory usage and cause some applications at tight memory margin to OOM. (bc the comm mem pool is a separate pool than the regular pool ?) Differential Revision: [D70026681](https://our.internmc.facebook.com/intern/diff/D70026681) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147663 Approved by: https://github.com/d4l3k	2025-02-22 05:55:43 +00:00
Aaron Orenstein	086d146f6f	Update ruff linter for PEP585 (#147540 ) This turns on PEP585 enforcement in RUFF. - Updates the target python version - Stops ignoring UP006 warnings (PEP585) - Fixes a few issues which crept into the tree in the last day Pull Request resolved: https://github.com/pytorch/pytorch/pull/147540 Approved by: https://github.com/justinchuby, https://github.com/Skylion007	2025-02-22 04:45:17 +00:00
Laith Sakka	77d2780657	Enable strobelight profiling specific compile frame ids using COMPILE_STROBELIGHT_FRAME_FILTER (#147549 ) running python test/strobelight/examples/compile_time_profile_example.py ``` strobelight_compile_time_profiler, line 123, 2025-02-20 14:08:08,409, INFO: compile time strobelight profiling enabled strobelight_compile_time_profiler, line 159, 2025-02-20 14:08:08,409, INFO: Unique sample tag for this run is: 2025-02-20-14:08:081656673devgpu005.nha1.facebook.com strobelight_compile_time_profiler, line 160, 2025-02-20 14:08:09,124, INFO: URL to access the strobelight profile at the end of the run: https://fburl.com/scuba/pyperf_experimental/on_demand/9felqj0i strobelight_compile_time_profiler, line 205, 2025-02-20 14:08:12,436, INFO: profiling frame 0/0 is skipped due to frame_id_filter 1/.* strobelight_compile_time_profiler, line 205, 2025-02-20 14:08:15,553, INFO: profiling frame 0/0 is skipped due to frame_id_filter 1/.* strobelight_compile_time_profiler, line 205, 2025-02-20 14:08:16,170, INFO: profiling frame 0/0 is skipped due to frame_id_filter 1/.* strobelight_compile_time_profiler, line 214, 2025-02-20 14:08:16,877, INFO: profiling frame 1/0 strobelight_function_profiler, line 247, 2025-02-20 14:08:19,416, INFO: strobelight run id is: 4015948658689996 strobelight_function_profiler, line 249, 2025-02-20 14:08:21,546, INFO: strobelight profiling running strobelight_function_profiler, line 289, 2025-02-20 14:08:25,964, INFO: work function took 4.417063233006047 seconds strobelight_function_profiler, line 230, 2025-02-20 14:08:28,310, INFO: strobelight profiling stopped strobelight_function_profiler, line 221, 2025-02-20 14:08:44,308, INFO: Total samples: 119 strobelight_function_profiler, line 221, 2025-02-20 14:08:44,308, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/73h2f7ur strobelight_function_profiler, line 221, 2025-02-20 14:08:44,308, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/zs06fi9e strobelight_compile_time_profiler, line 167, 2025-02-20 14:08:44,308, INFO: 1 strobelight success runs out of 1 non-recursive compilation events. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147549 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #147547	2025-02-22 03:44:53 +00:00
Laith Sakka	fc095a885c	move _strobelight/example to avoid graph breaks (#147547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147547 Approved by: https://github.com/bobrenjc93	2025-02-22 03:44:53 +00:00
Ethan Wee	fecd3f7ecb	[ROCm] change is_hip_clang() to always return True (#147646 ) hipify is replacing kernel launchs <<< >>> with hipLaunchKernelGGL() macro and this is a regression caused by /opt/rocm/hip/.hipinfo no longer existing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147646 Approved by: https://github.com/jeffdaily, https://github.com/petrex	2025-02-22 03:26:55 +00:00
xinan.lin	b11d5cd584	[Inductor UT][Windows][XPU] Fix Inductor UT on XPU Windows. (#146481 ) This PR fixed all the inductor UT failures for XPU backend on Windows we found in local machine(Due to resource constraints, we have not yet set up a Windows CI pipeline online.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146481 Approved by: https://github.com/jansel, https://github.com/EikanWang ghstack dependencies: #147347	2025-02-22 02:53:16 +00:00
xinan.lin	2d433cf1ad	[Inductor UT][Windows][XPU] Enable Inductor UT on XPU Windows. (#147347 ) This PR removes the restrictions on general cases for XPU on Windows, allowing us to run Inductor UT on Windows. Additionally, this series of PRs has also fixed all XPU Inductor UT issues on Windows. However, due to resource constraints, we have not yet set up a Windows CI pipeline online. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147347 Approved by: https://github.com/jansel, https://github.com/EikanWang	2025-02-22 02:53:16 +00:00
Scott Wolchok	84fcf1bb11	constexpr all the things in irange.h (#147633 ) I got complaints while irangeifying some files in ExecuTorch that irange could not be used in a constexpr function. This made the complaints go away. I added a constexpr function in irange_test that used to fail to build with `error: variable of non-literal type 'iterator' (aka 'integer_iterator<int, true>') cannot be defined in a constexpr function before C++23` and now builds fine. Differential Revision: [D69959614](https://our.internmc.facebook.com/intern/diff/D69959614/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147633 Approved by: https://github.com/albanD	2025-02-22 01:51:51 +00:00
Angela Yi	6e0b09728a	[export] Remove report from draft-export output (#147558 ) Summary: This matches the export API. To print the report, people can just do `print(ep._report)`. This information is also displayed in the terminal after the draft_export call. Test Plan: CI Reviewed By: SherlockNoMad Differential Revision: D69689154 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147558 Approved by: https://github.com/pianpwk	2025-02-22 00:54:29 +00:00
Oguz Ulgen	1c334893dc	[CacheBench] Refactor code to prepare for mode benchmarks (#147641 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147641 Approved by: https://github.com/huydhn	2025-02-22 00:20:54 +00:00
Howard Huang	5d26b7108f	[PP] Remove extra code and docs BE (#147636 ) current docs: <img width="746" alt="image" src="https://github.com/user-attachments/assets/4c4088fc-ee97-4a82-be28-e33eb35e76f5" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/147636 Approved by: https://github.com/awgu	2025-02-22 00:10:31 +00:00
Peter Yeh	f95ab46797	[ROCm] OCP FP8 Support for new GPUs (#146632 ) TLDR: Follow up/ Build on top of https://github.com/pytorch/pytorch/pull/144476. add OCP FP8 support for gfx950 refer to https://github.com/pytorch/ao/pull/1677 This pull request includes several changes to improve compatibility and support for new GPU architectures and data types, particularly for ROCm. The key updates involve adding support for new ROCm versions and GPU architectures, updating data type handling, and removing outdated checks. ### Improvements to GPU Architecture and ROCm Version Support: * [`aten/src/ATen/Context.cpp`](diffhunk://#diff-33de472d304acbe57d693c8567370c638068bedc1aa0ce8e9dc115dad05a7810L323-R326): Added support for new GPU architectures `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks. * [`aten/src/ATen/native/cuda/Blas.cpp`](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199): Updated architecture support in multiple functions to include `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks. [[1]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199) [[2]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL865-R876) ### Updates to Data Type Handling: * [`aten/src/ATen/cuda/CUDADataType.h`](diffhunk://#diff-9188bb13b1a49f459141f5f9b875593d1c5ce2beb5ad711fdbaf5bc7089ec015L81-L98): Enhanced data type conversion to include new float8 types for both CUDA and ROCm environments. * [`aten/src/ATen/cuda/tunable/GemmHipblaslt.h`](diffhunk://#diff-bfa1a3b5d4bef1892bf50338775f3b0fd8cd31fc1868148f3968b98aefb68e3fL29-R80): Updated `HipDataTypeFor` template to handle new float8 types and added hard-coded enum values for ROCm versions prior to 6.3. ### Removal of Outdated Checks: * [`cmake/public/LoadHIP.cmake`](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197): Removed the check for `HIP_NEW_TYPE_ENUMS` as it is no longer necessary with the updated ROCm versions. [[1]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197) [[2]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L211-R182) These changes ensure better compatibility and performance on newer hardware and software environments, particularly for users leveraging ROCm and CUDA for deep learning and scientific computing tasks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146632 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-02-21 23:44:08 +00:00
Jason Furmanek	b1a81a4a65	Don't use '-e' when installing Triton (#147228 ) Currently the install_triton.sh script uses "pip install -e ." to install Triton. Using the -e is sometimes appropriate for develop work but is less appropriate for delivery. To make matters worse it seems the behavior of the -e various depending on the version of pip invovled. This PR removes the -e and installs Triton normally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147228 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily	2025-02-21 23:00:12 +00:00
Catherine Lee	995b125cdd	[CI] Build sm89 with more procs experiment (#147487 ) Add a build that uses 4 out of the 8 processes available on a linux.2xlarge/c5.2xlarge. Currently it's set to 2 because it would oom, but I'm curious as to how often people's builds oom. I can't test this on my own because of caching, so it has to run on pull request This might result in a failing job on may people's PRs and I'm not sure how to get around it. I named it stable to make it automatically get sorted into the stable group for Dr. CI but it'll still show up Pull Request resolved: https://github.com/pytorch/pytorch/pull/147487 Approved by: https://github.com/huydhn	2025-02-21 22:07:00 +00:00
Catherine Lee	7c8c82cd64	[trymerge] Post initial starting merge comment on stacked PRs (#147028 ) Post a small comment stating if a PR is being merged as part of a stack Pull Request resolved: https://github.com/pytorch/pytorch/pull/147028 Approved by: https://github.com/ZainRizvi	2025-02-21 22:05:00 +00:00
Avik Chaudhuri	698f6f9fae	specify only some dimensions in shapes collection (#147534 ) Differential Revision: [D69936316](https://our.internmc.facebook.com/intern/diff/D69936316/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147534 Approved by: https://github.com/bobrenjc93	2025-02-21 22:02:42 +00:00
Mitchell, Frost	2fb9416e6f	[inductor][cpu] Move VNNI weight packing into AMX GEMM kernel for contiguous BMM weights (#146843 ) Currently, the bfloat16 microkernel that uses AMX vectorization requires that the weights are in an interleaved VNNI format. For GEMM code, this hasn't been an issue because GEMM currently only supports constant weights, so the VNNI weight packing is done during compile-time and saved as a constant tensor to the graph. But for BMM ops where weights are not required to be constant, current code does an expensive reshape/VNNI packing for all BMM weights. This PR removes the need for the reshape/packing for non-constant inputs by moving VNNI packing inside the AMX microkernel. A new `K * block_n` buffer is used to store the temporary packed weights. Weight packing involves interleaving 2 rows of weights. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146843 Approved by: https://github.com/jgong5, https://github.com/sanchitintel, https://github.com/leslie-fang-intel, https://github.com/jansel	2025-02-21 21:46:00 +00:00
henrylhtsang	d91be786cb	[cutlass backend] clear_on_fresh_inductor_cache when generatings cutlass ops (#147586 ) Differential Revision: [D69966732](https://our.internmc.facebook.com/intern/diff/D69966732/) This is needed if we want to generate cutlass ops with different instantiation level in one session. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147586 Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78	2025-02-21 21:28:41 +00:00
PyTorch MergeBot	ef6b16ea9d	Revert "[trymerge] Post initial starting merge comment on stacked PRs (#147028 )" This reverts commit 0295aabf6071c7da62325e6a29e04ed09a3e34ef. Reverted https://github.com/pytorch/pytorch/pull/147028 on behalf of https://github.com/clee2000 due to I think this broke merge for non ghstack prs ([comment](https://github.com/pytorch/pytorch/pull/147028#issuecomment-2675532017))	2025-02-21 21:02:19 +00:00
PyTorch MergeBot	05e6f15966	Revert "[Inductor][Triton] Rework casting logic to avoid illegal bitcast (#147395 )" This reverts commit e758d8b4d1632ea765bf8bc8e87b6039ae708b9f. Reverted https://github.com/pytorch/pytorch/pull/147395 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, see D69890757 - servicelab_benchmark_pyper_local_runner, @eellison please help the author get this change landed ([comment](https://github.com/pytorch/pytorch/pull/147395#issuecomment-2675521966))	2025-02-21 20:56:40 +00:00
Thomas Bohnstingl	6eb795c9e8	[associative_scan] compile backend change to "eager" (#146973 ) This PR fixes some issues with torch export discussed here: https://github.com/pytorch/pytorch/pull/140043#discussion_r1941932960 However, this backend change does still not resolve the failure for specific shapes mentioned here: https://github.com/pytorch/pytorch/issues/137943#issuecomment-2649564994 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146973 Approved by: https://github.com/ydwu4	2025-02-21 20:21:41 +00:00
Luca Wehrstedt	5ed1e23e3a	Fix type stubs for SymmetricMemory (#146310 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146310 Approved by: https://github.com/yifuwang	2025-02-21 19:59:43 +00:00
Nichols A. Romero	fd8ae1aa04	[ROCm] gfx940 and gfx941 cleanup (#147394 ) Removing gfx architectures not supported by ROCm. NOTE: For users wanting to build PyTorch for gfx archs that are not supported by the official wheels on download.pytorch.org, you can build PyTorch from source for your desired gfx arch [using the PYTORCH_ROCM_ARCH env var](https://github.com/pytorch/pytorch/blob/main/README.md#amd-rocm-support). Pull Request resolved: https://github.com/pytorch/pytorch/pull/147394 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily	2025-02-21 19:42:12 +00:00
zeshengzong	c0ee62573a	[Easy][optim] Add LBFGS params optional desc (#147579 ) [LBFGS docs](https://pytorch.org/docs/stable/generated/torch.optim.LBFGS.html#torch.optim.LBFGS) missing `optional` description for params in compare with other optimizer docs, like [Adam](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html) ## Test Result ### Before ![image](https://github.com/user-attachments/assets/34877490-16b4-4c68-bf6c-405bae563352) ### After ![image](https://github.com/user-attachments/assets/7fba94c8-7091-47b8-bdf1-ca7d779a027f) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147579 Approved by: https://github.com/janeyx99	2025-02-21 19:38:10 +00:00
Oguz Ulgen	b5c3bb6185	Add continuous run for cachebench (#147546 ) This PR adds a continuous run for cache bench. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147546 Approved by: https://github.com/huydhn ghstack dependencies: #147537	2025-02-21 19:02:17 +00:00
henrylhtsang	76ce194b8e	For addmm and bmm, check if config.autotune_fallback_to_aten before using aten as a fallback. Also fix bmm cutlass backend (#147148 ) This PR also fixes BMM, which was silently failing for a while. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147148 Approved by: https://github.com/eellison	2025-02-21 18:41:52 +00:00
Catherine Lee	0295aabf60	[trymerge] Post initial starting merge comment on stacked PRs (#147028 ) Post a small comment stating if a PR is being merged as part of a stack Pull Request resolved: https://github.com/pytorch/pytorch/pull/147028 Approved by: https://github.com/ZainRizvi	2025-02-21 18:05:05 +00:00
Hanson-HSChang	2190ca7f47	Use __qualname__ in add_safe_globals and update Unpickling error raised for Unsupported GLOBAL (#146815 ) - Fixes #146814 Change ```python for f in _marked_safe_globals_set: module, name = f.__module__, f.__name__ ``` to ```python for f in _marked_safe_globals_set: module, name = f.__module__, f.__qualname__ ``` for avoiding same key string overwrite. A test is also added. ``` python test/test_serialization.py TestSerialization.test_serialization_nested_class ``` - Fixes #146886 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146815 Approved by: https://github.com/mikaylagawarecki	2025-02-21 18:04:59 +00:00
Nicolas De Carli	f4e4cfcb91	[caffe2] Ignore compiler option when building using clang (#147556 ) Summary: Skip adding unrecognized option optimize("-fno-tree-loop-vectorize") when building using clang This piece of code began to be compiled after armv9a has been set as default compilation profile Test Plan: buck2 run mode/opt -c python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12 lego/scripts:lego_cli -- run-locally --model_entity_id ${MODEL} --config_version ${CONFIG_VERSION} --disable_generate_new_checkpoint --checkpoint_version 0 --publish_context OFFLINE_PUBLISH --lego_pipeline aiplatform.modelstore.model_generation.lego.lego_pipeline_builder.gmpp_lego_pipeline --gmpp_config '{"gmpp_pipeline_descriptor": "aiplatform.modelstore.model_generation.v1.ads_pipelines.aimp_pyper_pipeline.model_generation_pipeline", "worker_process_number":12, "worker_thread_per_process_number": 6, "use_work_assignment": true}' 2>&1 \| tee aimp_697790515.log Reviewed By: andrewjcg Differential Revision: D69947027 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147556 Approved by: https://github.com/janeyx99	2025-02-21 17:46:04 +00:00
Shivam Raikundalia	a0c7d96028	[Easy] Add Delimeter To Show Where Allocation Addr Begins (#147461 ) Summary: When we print the addr we append an "s" or a "b" to the beginning of an addr. Since the addr is in hex, a user might be confused and think the "b" is part of the address. Added an approstrophe to clear this up Test Plan: CI Differential Revision: D69828538 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147461 Approved by: https://github.com/zdevito	2025-02-21 17:19:53 +00:00
Anatoly Myachev	784f64bb05	[inductor] triton support port-#5512, update cpp wrapper for gpu (#146917 ) In short, this pull request enhances `constexprs` expression filtering. Note: I tested the changes on xpu backend. Part of https://github.com/pytorch/pytorch/issues/144103 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146917 Approved by: https://github.com/EikanWang, https://github.com/etaf, https://github.com/davidberard98, https://github.com/YUNQIUGUO	2025-02-21 17:10:53 +00:00
Tugsbayasgalan Manlaibaatar	6a6de0e09d	better error message (#147532 ) Differential Revision: [D69939736](https://our.internmc.facebook.com/intern/diff/D69939736) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147532 Approved by: https://github.com/avikchaudhuri, https://github.com/zou3519	2025-02-21 17:08:47 +00:00
Oguz Ulgen	a8ce4d1846	Add cachebench (#147537 ) This PR adds a new benchmark called cachebench in order to measure/demonstrate the prowess of PT2 caching. ``` python benchmarks/dynamo/cachebench.py --output="result.json" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147537 Approved by: https://github.com/jamesjwu	2025-02-21 17:06:45 +00:00
Ding, Yi1	af1072ffb6	[Intel GPU] Enable BUILD_GRAPH for xpu_mkldnn (#147608 ) For preparation of OneDNN based XPU SDPA enabling. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147608 Approved by: https://github.com/EikanWang, https://github.com/atalman	2025-02-21 16:12:30 +00:00
eellison	d6bb1d7f0a	Delete Mixed MM Special Casing (#147151 ) Now that torchinductor supports prologue fusion we can delete all the mixed mm code. When I benchmarked int8 weight only mm in the new path compared to int8mm in the old path in the [following benchmark](https://gist.github.com/eellison/46e321709572c11c077d0612cb3492b7) I got a 1.244x geomean speedup comparing Huggingface linear shapes with bias. There's a couple reasons for the speedup: - prologue fusion is often unprofitable, even for int8 mm. because the current mixed mm benchmarking only compares triton_int8_mm vs (dtype_conversion + cublas), we miss out on scenarios where the triton template is profitable but the prologue fusion is not. - similarly, we miss out on potential epilogue fusions like bias if we dispatch to the [fallback mixed mm](`5006932cbc/torch/_inductor/kernel/mm.py (L750-L751)`) that mixed_mm will dispatch to instead of the deferred epilogue tuning in current path. It's possible some of the speedups would be smaller on larger models where the epilogue might get fused into a following kernel. Nonetheless, even if this is perf neutral it is worth landing for code deduplication. The one kernel that is a little special and would not fall out of the prologue fusion is the uint4x2_mixed_mm kernel. it's still possible to generate with prologue fusion but not currently exactly as the current [impl](`bd370c138a/torch/_inductor/kernel/unpack_mixed_mm.py (L43-L49)`). But the current impl does not compare to a cublas baseline so I found that it is making things slower (35% slower on a not particularly big 1024, 1024, 1024 mm shape on h100). this should be fine to delete. Future optimizations could include: - cutlass prologue path - making prologue fusion support the persistent tma based mm template. from @drisspg's experience this led to nice wins with fp8 but not as nice wins with bf16 mm. I think similarly, lower memory bandwidth int8 mm would benefit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147151 Approved by: https://github.com/drisspg, https://github.com/cpuhrsch	2025-02-21 16:02:40 +00:00
Luca Wehrstedt	36c461af95	Support SymmetricMemory's signaling kernels on sm60 and sm70 (#146308 ) By leveraging libcudacxx's utilities: https://nvidia.github.io/cccl/libcudacxx/extended_api/synchronization_primitives/atomic_ref.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/146308 Approved by: https://github.com/yifuwang	2025-02-21 15:29:02 +00:00
Aaron Orenstein	7ce4974e50	Fix PEP585 update (#147536 ) Summary: D69920347 causes a pyre failure due to changing a base object from typing.Iterable to abc.Iterable. For now revert that change until it can be dealt with on its own. Test Plan: failures from D69920347 pass locally unit tests pass Reviewed By: oulgen Differential Revision: D69936518 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147536 Approved by: https://github.com/jeanschmidt	2025-02-21 14:37:03 +00:00
Jean Schmidt	654f2666d9	Increase memory for linux binary builds (#147542 ) Recently I detected that some linux manywheels builds are flaky ([ex](https://github.com/pytorch/pytorch/actions/runs/13438309056/job/37555475510)). After investigating, could not detect issues when investigating the runner logs, its disk space available, network usage or CPU load. Unfortunately, memory information is not available. But given the symptoms, the likehood of this being a OOM problem is high. So, moving those build jobs from a `linux.12xlarge.ephemeral` to `linux.12xlarge.memory.ephemeral`. This change depends on https://github.com/pytorch/test-infra/pull/6316 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147542 Approved by: https://github.com/ZainRizvi, https://github.com/atalman	2025-02-21 14:15:40 +00:00
Marek Michalowski	51748a5d1a	Update OpenBLAS to 0.3.29 (#144857 ) * Improvements for GEMM to GEMV kernels * Improvements for SVE kernels for SGEMV and DGEMV Pull Request resolved: https://github.com/pytorch/pytorch/pull/144857 Approved by: https://github.com/malfet	2025-02-21 10:07:06 +00:00
Sheng Fu	71d2827eeb	Code Refactoring for getting start and stride from global ranks (#147230 ) Summary: Code Refactoring for getting start and stride from global ranks, this function can be used in different collective backend. Differential Revision: D69555405 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147230 Approved by: https://github.com/kwen2501	2025-02-21 10:02:50 +00:00
Paikov Iurii	e7bf490c43	[ROCm] Implemented dropout usage for RNN with MIOpen backend (#144572 ) This PR fixes https://github.com/pytorch/pytorch/issues/107183 for ROCm. Implemented the usage of new RNN descriptor for MIOpen backend that takes into account dropout rate value using dropout descriptor. This fixes associated test_RNN_dropout_state test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144572 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-02-21 10:01:27 +00:00
henrylhtsang	cffe7183f1	[cutlass backend] Fix standalone runner test after swizzle became a runtime parameter (#147554 ) Differential Revision: [D69945114](https://our.internmc.facebook.com/intern/diff/D69945114/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147554 Approved by: https://github.com/mlazos	2025-02-21 09:27:44 +00:00
cyy	b61a556427	Turn onnx functions into static (#147598 ) To avoid exposing ONNX symbols. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147598 Approved by: https://github.com/justinchuby	2025-02-21 07:40:28 +00:00
PyTorch MergeBot	3395da7f7c	Revert "Build a storage reader/writer to write checkpoints in HF format (#146352 )" This reverts commit c615b8c174c80936d365d19d8b8f4d9ad9a195f3. Reverted https://github.com/pytorch/pytorch/pull/146352 on behalf of https://github.com/jeanschmidt due to Author ignored linting errors ([comment](https://github.com/pytorch/pytorch/pull/146352#issuecomment-2673789271))	2025-02-21 07:30:52 +00:00
PyTorch MergeBot	e5da9df421	Revert "Increase memory for linux binary builds (#147542 )" This reverts commit 87e6e2924eb706b928cdfc4a11623b39259fa830. Reverted https://github.com/pytorch/pytorch/pull/147542 on behalf of https://github.com/jeanschmidt due to seems that it is best to use another machine type ([comment](https://github.com/pytorch/pytorch/pull/147542#issuecomment-2673765724))	2025-02-21 07:14:57 +00:00
Kevin Fu	4986f0f52e	[PT2]: allow empty dict to pass type check (#147167 ) (#147480 ) Summary: Seeing errors like when testing sigmoid for inline_cvr and perevent_cvr models. ``` terminate called after throwing an instance of 'c10::Error' what(): forward() Expected a value of type 'Dict[int, Tuple[Tensor, Tensor, Tensor]]' for argument 'event_based_features' but instead found type 'Dict[Any, Any]'. ``` Let empty dict pass type check. please, do NOT use any of the following flags, those are result of manual interventions in other parts of the system, misuse of them can be very painful for both detect and recover: Test Plan: ``` MODEL_ENTITY_ID=691508446 SNAPSHOT_ID=0 OTHER_MODEL_ENTITY_ID=649645886 OTHER_SNAPSHOT_ID=0 MODULE=local buck2 run mode/opt caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- \ --loadMode=BenchmarkAB \ --inputNetFile=/data/users/${USER}/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}${suffix} \ --otherNetFile=/data/users/${USER}/models/${OTHER_MODEL_ENTITY_ID}/${OTHER_SNAPSHOT_ID}/${OTHER_MODEL_ENTITY_ID}_${OTHER_SNAPSHOT_ID}${suffix} \ --moduleName=${module} \ --submodToDevice "" \ --benchmarkDontRebatchSamples=true \ --sampleInputFilePath=/data/users/${USER}/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/archive_.predictor.disagg.gpu.local/data/sample_inputs/local.pt ``` Reviewed By: yjhao Differential Revision: D69871393 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147480 Approved by: https://github.com/henryoier, https://github.com/jeanschmidt	2025-02-21 07:00:46 +00:00
Arash Pakbin	c74b59fc1f	[ROCm][TunableOp] resolve the rocBLAS version dynamically (#147363 ) Dynamically gets rocBLAS version instead of relying on some preprocessing-time definitions which may be stale. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147363 Approved by: https://github.com/pruthvistony, https://github.com/naromero77amd, https://github.com/jeffdaily	2025-02-21 06:50:21 +00:00
dilililiwhy	86ae672b6a	Use has_triton_package in _inductor.runtime.hints (#147442 ) Fixes #ISSUE_NUMBER Use existing method for triton check Pull Request resolved: https://github.com/pytorch/pytorch/pull/147442 Approved by: https://github.com/Skylion007	2025-02-21 05:52:00 +00:00
Eddie Yan	533b884870	[cuDNN][SDPA][Nested Tensor] Experimental cuDNN Nested Tensor SDPA Support (forward only) (#141178 ) Disabled by default for now behind `TORCH_CUDNN_SDPA_NESTED_TENSOR_ENABLED=1` Just wanted to get this out before starting a series of SDPA cleanup PRs---the biggest thing is we don't need the boilerplate around all of the `build_graph_and_tensors*` functions anymore as we can now use the `UID`-style referencing of tensor nodes as was done for the Conv-V8 API backend. CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/141178 Approved by: https://github.com/jbschlosser	2025-02-21 05:22:19 +00:00
Jerry Zhang	a2c3a2c5c4	Support serialization for uintx/intx in weights_only (#147500 ) Summary: Fixing the issue reported by huggingface Test Plan: python test/test_serialization.py -k test_serialization_uintx_intx Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/147500 Approved by: https://github.com/mikaylagawarecki	2025-02-21 04:38:44 +00:00
Ankita George	c615b8c174	Build a storage reader/writer to write checkpoints in HF format (#146352 ) Summary: Title - we want to write checkpoints in HF format with DCP, this diff allows this for the non-distributed use case. Test Plan: buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:test_hf_torchtune_storage N6476188 --> able to save and load tensor in hf format Differential Revision: D68444967 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146352 Approved by: https://github.com/saumishr	2025-02-21 03:31:21 +00:00
Ting Lu	fe100c3c5b	Add libtorch nightly build for CUDA 12.8 (#146265 ) Try removing sm50 and sm60 to shrink binary size, and resolve the ld --relink error "Architecture support for Maxwell, Pascal, and Volta is considered feature-complete and will be frozen in an upcoming release." from 12.8 release note. Also updating the runner for cuda 12.8 test to g4dn (T4, sm75) due to the drop in sm50/60 support. https://github.com/pytorch/pytorch/issues/145570 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146265 Approved by: https://github.com/atalman	2025-02-21 03:04:06 +00:00
Tristan Rice	ba214ab56c	TCPStore: soft fail bind when agent store active (#147465 ) This makes it easier to roll out `TORCHELASTIC_USE_AGENT_STORE` by opportunistically swallowing bind errors when the agent store is enabled and the port matches `MASTER_PORT`. This should be very safe as if the store is somehow not up and the envs are set, the TCPStore client connections will fail to connect so we end up with a slightly different error message but success/failure behavior is identical. This also pybinds `c10d::SocketError` into Python so we can assert on the error type in tests. https://docs.google.com/document/d/1CzOn_N53AiFxWGgbyMWSnd2elCJd4lZ-ajPg2lzcxoM/edit?tab=t.0#heading=h.2j2f5dimrdau Test plan: ``` pytest test/distributed/test_store.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147465 Approved by: https://github.com/fduwjj	2025-02-21 03:02:26 +00:00
Yan Zhiwei	8a5265cb37	[Intel GPU] qlinear_pointwise.binary[_tensor] XPU support (#135337 ) # Motivation This PR intends to enable quantized fusion `qlinear+add` at Intel GPU backend. At backend level, we register the op via schema `TORCH_SELECTIVE_NAME("onednn::qlinear_pointwise.binary")` and `TORCH_SELECTIVE_NAME("onednn::qlinear_pointwise.binary_tensor")` which is the one already defined in `x86InductorQuantzer` At Inductor level, we have small modification at `torch/_inductor/fx_passes/quantization.py` to allow signed int8 data type(s8) during op lowering. As for the pattern matching, we greatly reuse the code existing at x86InductorQuantizer. # UT verification ```bash python test/inductor/test_mkldnn_pattern_matcher.py -v \ -k test_qlinear_add_xpu ``` # Runtime Verification ```bash onednn_verbose,primitive,exec,gpu:0,matmul,jit:gemm:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_f32::blocked:ab::f0_mask2 dst_f32::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_linear:1:0.654408+sum:0.00511256+eltwise_relu,,4x4:4x4,0.0319824 ``` The verbose is collected from UT. We can see the attribute ` attr-post-ops:eltwise_linear:1:0.654408+sum:0.00511256+eltwise_relu`, the post add and ReLU is successfully fused on GEMM computation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135337 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/liangan1, https://github.com/jerryzh168 ghstack dependencies: #133307, #135189 Co-authored-by: guangyey <guangye.yu@intel.com>	2025-02-21 02:09:28 +00:00
CaoE	8b818ab58f	Use float data type for Half sum in fallback implementation of batchnorm backward on CPU (#147353 ) Fixes #147303. Use float data type for Half sum in fallback implementation of batchnorm backward on CPU as the representation range of Half is small. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147353 Approved by: https://github.com/leslie-fang-intel, https://github.com/cpuhrsch	2025-02-21 01:33:33 +00:00
Simon Fan	ac88a6c00d	[fx] demote node prepend to self log from warning to debug (#147538 ) FIXES https://github.com/pytorch/pytorch/issues/147175 This is harmless, not sure why this is a user warning. Writing reordering graph passes is more concise when we ignore this warning. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147538 Approved by: https://github.com/yanboliang	2025-02-21 01:32:34 +00:00
Nichols A. Romero	4b35139a46	[ROCm][TunableOp] Fix TunableOp warmup environment variable. (#147412 ) This PR corrects the behavior of the TunableOp warmup variables: ``` PYTORCH_TUNABLEOP_MAX_WARMUP_DURATION_MS PYTORCH_TUNABLEOP_MAX_WARMUP_ITERATIONS ``` See the updated comments which describe how the environment variables are intended to work. Previously, if you only set one of the two environment variables the warmup iters would always be zero. Manually tested the four possible combinations to make sure things still behavior as intended. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147412 Approved by: https://github.com/jeffdaily	2025-02-21 00:29:58 +00:00
Zhengxu Chen	fdb1305ace	reland "[sigmoid] Test OSS model runner with test_export.py" (#147535 ) Summary: There are ~260 tests for all the corner cases of export from test_export.py. utitlizing to test sigmoid in the OSS setting. Test Plan: buck test mode/opt caffe2/test:test_export -- -r _sigmoid Differential Revision: D69937387 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147535 Approved by: https://github.com/yiming0416	2025-02-20 23:45:13 +00:00
Jean Schmidt	87e6e2924e	Increase memory for linux binary builds (#147542 ) Recently I detected that some linux manywheels builds are flaky ([ex](https://github.com/pytorch/pytorch/actions/runs/13438309056/job/37555475510)). After investigating, could not detect issues when investigating the runner logs, its disk space available, network usage or CPU load. Unfortunately, memory information is not available. But given the symptoms, the likehood of this being a OOM problem is high. So, moving those build jobs from a `linux.12xlarge.ephemeral` to `linux.24xlarge.ephemeral`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147542 Approved by: https://github.com/ZainRizvi, https://github.com/atalman	2025-02-20 23:02:45 +00:00
Aaron Orenstein	be0df96b50	Fix c++ implementation of strip_function_call (#147436 ) #143063 was missing handling a couple UCS cases as well as had some bugs in the way it dealt with errors. - Fix all the UCS handling (and make some of the common code more common) - Make sure all the error paths return `nullptr` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147436 Approved by: https://github.com/jansel	2025-02-20 20:41:21 +00:00
henrylhtsang	af31640391	[cutlass backend] enable mixed mm test (cutlass2x) for H100 (#147474 ) I am okay with not landing this as well. The motivation is to make developing on H100 smoother. The reason the current test works on A100 but not H100 is because of alignment issue. Which was caused by arch specific filtering logic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147474 Approved by: https://github.com/alexsamardzic, https://github.com/ColinPeppler	2025-02-20 20:28:44 +00:00
henrylhtsang	d068141c3b	[cutlass backend] add subproc tests (#147173 ) I want to separate subproc autotuning from the main tests. And I observed that for addmm, it can work without subproc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147173 Approved by: https://github.com/ColinPeppler ghstack dependencies: #147169	2025-02-20 20:07:42 +00:00
henrylhtsang	2565951f8a	[cutlass backend] remove triton from most tests and add an integration test (#147169 ) Removing aten and triton from the list of backends for the tests that have it. Instead, add a small integration test to make sure autotuning works fine. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147169 Approved by: https://github.com/ColinPeppler	2025-02-20 20:07:42 +00:00
Richard Barnes	fb1f7f6a09	[codemod] Fix unused-value issue in caffe2/aten/src/ATen/native/miopen/Conv_miopen.cpp +1 (#147496 ) Summary: LLVM has a warning `-Wunused-value` which we treat as an error because it's so often diagnostic of a code issue. Unused values often indicate a programming mistake, but can also just be unnecessary cruft that harms readability and performance. For questions/comments, contact r-barnes. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Differential Revision: D69755123 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147496 Approved by: https://github.com/Skylion007	2025-02-20 19:00:38 +00:00
Jessica Vandebon	6971b77510	[CPU Stream] Add noop for CPU stream record_event() and wait_event() (#145935 ) Summary: Adds wait_event and record_event endpoints to CPU stream in order to facilitate device-agnostic code. Both methods are noops. Test Plan: CI Differential Revision: D68833927 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145935 Approved by: https://github.com/Skylion007	2025-02-20 18:50:55 +00:00
Catherine Lee	863ac20659	[CI] Do not overwrite return code of test file when fails for rerun disabled tests (#147484 ) Do not overwrite the return code of a single file when it fails. This will allow the log to be printed to stdout and the gha logs Pull Request resolved: https://github.com/pytorch/pytorch/pull/147484 Approved by: https://github.com/ZainRizvi	2025-02-20 17:51:58 +00:00
Sampsa	83bb921a5a	[ROCm] Update meta_registration for efficient attention (#146979 ) Fixes a series of failing and skipped unit tests. For nvidia hw, the longsumexp last dimension is required to be a multiple of 32. This is not the case for rocm. A related issue: https://github.com/pytorch/pytorch/issues/146848 The unit tests in question: ```bash inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_prev_13_cuda inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_prev_14_cuda inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_prev_15_cuda inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_rewriter_11_cuda inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_rewriter_14_cuda inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_rewriter_15_cuda inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_rewriter_17_cuda inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_rewriter_1_cuda inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_rewriter_1_freezing inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_rewriter_2_cuda inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_rewriter_3_cuda inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_rewriter_4_cuda inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_rewriter_6_cuda inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_prev_13_cuda inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_prev_14_cuda inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_prev_15_cuda inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_rewriter_11_cuda inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_rewriter_14_cuda inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_rewriter_15_cuda inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_rewriter_17_cuda inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_rewriter_1_cuda inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_rewriter_1_freezing inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_rewriter_2_cuda inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_rewriter_3_cuda inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_rewriter_4_cuda inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_rewriter_6_cuda ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146979 Approved by: https://github.com/shunting314	2025-02-20 15:05:13 +00:00
vasiliy	382fbcc1e4	add the `torch.float8_e8m0fnu` dtype to PyTorch (#147466 ) Summary: Continuing the work from https://github.com/pytorch/pytorch/pull/146427 Adds the `torch.float8_e8m0fnu` dtype to PyTorch, as detailed in https://github.com/pytorch/pytorch/issues/146414 . Please see the issue for a detailed definition of the format. Example of basic functionality: ```python import torch # round trip x0 = torch.randn(4, 4, dtype=torch.float32) x1 = x0.to(torch.float8_e8m0fnu) # RNE rounding x2 = x1.to(torch.float32) # 2 ** exponent # creation with empty x0 = torch.empty(4, 4, dtype=torch.float8_e8m0fnu) # printing print(x0) ``` Done in this PR: * numerical correctness * op coverage (except for `torch._scaled_mm`): create tensor, cast to/from float32 * printing a tensor works For future PRs: * performance optimizations for casting * torch._scaled_mm * PT2 * various cleanups (detailed in comments with issue numbers) Test Plan: ``` pytest test/quantization/core/experimental/test_float8.py -s ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/147466 Approved by: https://github.com/drisspg	2025-02-20 13:55:42 +00:00
James Wu	574371d828	Add current cuda device index to FXGraphCache key (#147464 ) This PR intends to fix the cache related issues from https://github.com/pytorch/pytorch/issues/147405. It does not handle the dynamo recompile case in process, because it does not introduce any extra guards. For FXGraphCache and AOTAutogradCache, we simply have to have the device context in the cache key. Note that for any function that accepts tensor inputs, the device context is naturally already included in the cache key by the metadata of example inputs. However, for functions that return constants or have no arguments, the device context still needs to be in the cache key. A more robust fix for this would be to have inductor generate device guards that are dynamic, instead of specialized. This would also help us share more cache artifacts. I've added unit tests for FXGraphCache and AOTAutogradCache, both of which would fail without this change. Differential Revision: [D69875939](https://our.internmc.facebook.com/intern/diff/D69875939) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147464 Approved by: https://github.com/bdhirsh, https://github.com/anijain2305	2025-02-20 12:38:21 +00:00
Nikita Shulga	ead970c8d0	Revert "Add cifllow/riscv64 label" This reverts commit 5116b27792d37c38039459c922a466581e219fc2. (I've pushed to the wrong branch by accident)	2025-02-20 11:55:52 +01:00
Nikita Shulga	5116b27792	Add cifllow/riscv64 label	2025-02-20 11:09:06 +01:00
zeshengzong	6beba8dcce	Optimize `graph.py` typing (#147099 ) Optimize `graph.py` methods type annotation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147099 Approved by: https://github.com/cyyever, https://github.com/aorenste	2025-02-20 09:32:30 +00:00
Luca Wehrstedt	f9b8121350	Make Inductor scheduler aware of _scaled_mm (#146992 ) This is used for example to estimate runtime when doing comms overlap Pull Request resolved: https://github.com/pytorch/pytorch/pull/146992 Approved by: https://github.com/drisspg, https://github.com/eellison, https://github.com/shunting314	2025-02-20 09:02:31 +00:00
Shawn Xu	9da250aada	type `fully_shard` so that the return value can be chained with typing enabled (#147489 ) This allows for ``` fsdped = fully_shard(model) fsdped.set_xyz() ``` same applies if `model` is actually a list of modules Differential Revision: [D69888119](https://our.internmc.facebook.com/intern/diff/D69888119) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147489 Approved by: https://github.com/Skylion007 ghstack dependencies: #147488	2025-02-20 08:43:16 +00:00
zeshengzong	6a72aaadae	Fix `torch.max` optional args `dim`, `keepdim` description (#147177 ) [`torch.max`](https://pytorch.org/docs/stable/generated/torch.max.html#torch.max) optional args `dim`, `keepdim` not described in document, but users can ignore them. ```python >>> import torch >>> a = torch.randn(3,1,3) >>> a.max() tensor(1.9145) >>> a.max(dim=1) torch.return_types.max( values=tensor([[ 1.1436, -0.0728, 1.3312], [-0.4049, 0.1792, -1.2247], [ 0.8767, -0.7888, 1.9145]]), indices=tensor([[0, 0, 0], [0, 0, 0], [0, 0, 0]])) ``` ## Changes - Add `optional` description for `dim`, `keepdim` - Add example of using `dim`, `keepdim` ## Test Result ### Before ![image](https://github.com/user-attachments/assets/3391bc45-b636-4e64-9406-04d80af0c087) ### After ![image](https://github.com/user-attachments/assets/1d70e282-409c-4573-b276-b8219fd6ef0a) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147177 Approved by: https://github.com/colesbury	2025-02-20 08:18:09 +00:00
drisspg	452315c84f	Fix RuntimeError: value cannot be converted to type int64_t without overflow (#147492 ) The exact call is coming from here: `78a94c9114/torch/_inductor/memory.py (L161)` I have no idea why this error is being thrown and what mode/modes might be failing for this Pull Request resolved: https://github.com/pytorch/pytorch/pull/147492 Approved by: https://github.com/eellison	2025-02-20 08:00:26 +00:00
zeshengzong	a000c7e6d2	Add hint message for `pack_padded_sequence` (#146747 ) Fixes #144207 Add truncate hint message in docs [torch.nn.utils.rnn.pack_padded_sequence](https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pack_padded_sequence.html) ## Test Result ![image](https://github.com/user-attachments/assets/46258f36-f6c7-4f11-9213-8513e52a9001) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146747 Approved by: https://github.com/mikaylagawarecki	2025-02-20 06:27:07 +00:00
Aaron Orenstein	db4ce78d46	PEP585: More UP006 fixes (#146392 ) This should be the final PR before we can enable RUFF UP006. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146392 Approved by: https://github.com/justinchuby, https://github.com/albanD, https://github.com/Skylion007	2025-02-20 06:18:13 +00:00
Animesh Jain	76ad19a549	[dynamo][codegen] Implement CSE for pre-graph graph-arg bytecode reconstruction (#147425 ) This reduces fixed overhead seen in a few internal models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147425 Approved by: https://github.com/jansel, https://github.com/StrongerXi	2025-02-20 05:42:52 +00:00
PyTorch UpdateBot	8f6b9403c1	[audio hash update] update the pinned audio hash (#147423 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147423 Approved by: https://github.com/pytorchbot	2025-02-20 05:39:46 +00:00
Yidi Wu	77aa602871	[torchbind] Differentiate ScriptModule and ScriptObject with qualified name (#147399 ) Summary: This pr add a _is_script_object method to differentiate scriptModule and scriptObject, where the formal inherits from ScriptObject in C++ so they both passes the isinstance(obj, torch.ScriptObject) check. The qualified name of ScriptObject (i.e. custom class) would starts with "__torch__.torch.classes", this has been a widely used assumption for dealing with custom class across our code base. Test Plan: Add new test. Differential Revision: D69685316 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147399 Approved by: https://github.com/yushangdi	2025-02-20 04:57:57 +00:00
Michael Lazos	7185ca8348	[Cutlass] Add test verifying number of precompiles (#147477 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/147477 Approved by: https://github.com/henrylhtsang	2025-02-20 04:47:57 +00:00
amdfaa	5f5b44f6bf	[ROCm] Update inductor-periodic.yml to use the correct label (#147473 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147473 Approved by: https://github.com/jeffdaily	2025-02-20 04:44:18 +00:00
bobrenjc93	0d56b7e665	Support size oblivious max equation (#147344 ) Addresses https://github.com/pytorch/pytorch/issues/125914 by detecting when we have a sym_max between {0, 1} and a summation of size-like unbacked symints. The basic idea is max(1, u0 + u1) can be simplified to u0 + u1 if both u0 and u1 are size-like since their value ranges are [2, inf]. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147344 Approved by: https://github.com/angelayi	2025-02-20 04:33:19 +00:00
Shangdi Yu	0b0da81021	Support static method of torchbind attributes in torch.compile with inductor backend (#146927 ) As title. Many changes adapted from https://github.com/pytorch/pytorch/pull/129537. Also this diff is only for static method of torchbind attributes. Some case that's not supported/tested: - dynamic torchbind objects - torchbind objects as an input to the module. Note that in JIT Inductor, the attributes are lifted as inputs. So even if we just have torchbind objects as attributes, they will show up as inputs in the graph. Example generated python code in torch.compile with inductor backend for the test case in `inductor/test_torchbind.py` (P1730554370): ```python async_compile.wait(globals()) del async_compile def call(args): arg1_1, arg2_1, arg3_1 = args args.clear() assert_size_stride(arg1_1, (2, 3), (3, 1)) assert_size_stride(arg2_1, (2, 3), (3, 1)) buf2 = empty_strided_cpu((2, 3), (3, 1), torch.float32) cpp_fused_add_0(arg1_1, arg2_1, buf2) del arg1_1 del arg2_1 # Topologically Sorted Source Nodes: [x, takes_foo_tuple_return], Original ATen: [aten.add] buf3 = torch.ops._TorchScriptTesting.takes_foo_tuple_return.default(arg3_1, buf2) buf4 = buf3[0] assert_size_stride(buf4, (2, 3), (3, 1)) buf5 = buf3[1] assert_size_stride(buf5, (2, 3), (3, 1)) buf6 = buf4; del buf4 # reuse cpp_fused_add_1(buf6, buf5) del buf5 # Topologically Sorted Source Nodes: [y, b], Original ATen: [aten.add] buf7 = torch.ops._TorchScriptTesting.takes_foo.default(arg3_1, buf6) del buf3 del buf6 buf8 = buf7 assert_size_stride(buf8, (2, 3), (3, 1)) # Topologically Sorted Source Nodes: [c], Original ATen: [] buf9 = torch.ops.higher_order.call_torchbind(arg3_1, 'add_tensor', buf2) del arg3_1 del buf7 buf10 = buf9 assert_size_stride(buf10, (2, 3), (3, 1)) del buf9 buf11 = buf2; del buf2 # reuse cpp_fused_add_2(buf11, buf8, buf10) return (buf11, ) def benchmark_compiled_module(times=10, repeat=10): from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance arg1_1 = rand_strided((2, 3), (3, 1), device='cpu', dtype=torch.float32) arg2_1 = rand_strided((2, 3), (3, 1), device='cpu', dtype=torch.float32) import pickle global arg3_1 arg3_1 = pickle.loads(b'\x80\x04\x95[\x00\x00\x00\x00\x00\x00\x00\x8c\x05torch\x94\x8c\x0cScriptObject\x94\x93\x94)\x81\x94]\x94(K\nK\x14e\x8c0__torch__.torch.classes._TorchScriptTesting._Foo\x94\x86\x94b.') fn = lambda: call([arg1_1, arg2_1, arg3_1]) return print_performance(fn, times=times, repeat=repeat) if __name__ == "__main__": from torch._inductor.wrapper_benchmark import compiled_module_main compiled_module_main('None', benchmark_compiled_module) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146927 Approved by: https://github.com/angelayi	2025-02-20 03:33:19 +00:00
Shawn Xu	de1cb0f351	capture the return value in the contract typing (#147488 ) ---- * the existing typing makes the return type `Optional[nn.Module]` * this doesn't seem to be what the decorator actually does as it does not alter the original return type * This PR aims to fix the typing Differential Revision: [D69888120](https://our.internmc.facebook.com/intern/diff/D69888120) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147488 Approved by: https://github.com/Skylion007	2025-02-20 03:32:34 +00:00
rzou	fea718f062	[BaseHOP] change hop(subgraph, operands) to hop(subgraph, *operands) (#146730 ) Our three main users are OK with this, with two of them (foreach_map, invoke_quant) prefering it like this. I was originally worried about BC issues (this now means you cannot add any positional args) but I think that's not a concern -- one can always add kwonly args. Test Plan - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/146730 Approved by: https://github.com/ydwu4, https://github.com/mlazos	2025-02-20 02:30:36 +00:00
Yan Zhiwei	f79b352f5a	[Intel GPU] qconv_pointwise.binary XPU support (#135189 ) # Motivation This PR intends to enable quantized fusion `qconv+add` and `qconv+add+relu` at Intel GPU backend. At backend level, we register the op via schema `TORCH_SELECTIVE_NAME("onednn::qconv2d_pointwise.binary")` which is the one already defined in `x86InductorQuantzer` At Inductor level, we have small modification at `torch/_inductor/fx_passes/quantization.py` to allow signed int8 data type(s8) during op lowering. As for the pattern matching, we greatly reuse the code existing at x86InductorQuantizer. # UT verification ```bash python test/inductor/test_mkldnn_pattern_matcher.py -v \ -k test_qconv2d_add_xpu \ -k test_qconv2d_add_relu_xpu 2>&1 ``` # Runtime exemplification Following is the oneDNN verbose collected from UT ```bash onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_s8::blocked:acdb::f0 wei_s8::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_s8::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:1:f32 attr-zero-points:src0:0:s32+dst:0:s32 attr-post-ops:eltwise_linear:1:0.337704+sum:0.0241217+eltwise_relu,alg:convolution_direct,mb1_ic3oc6_ih8oh6kh3sh1dh0ph0_iw8ow6kw3sw1dw0pw0,0.151123 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135189 Approved by: https://github.com/liangan1, https://github.com/EikanWang, https://github.com/guangyey, https://github.com/jerryzh168 ghstack dependencies: #133307 Co-authored-by: guangyey <guangye.yu@intel.com>	2025-02-20 02:02:54 +00:00
Riley Dulin	93316cfe94	Move ir_pre_fusion.txt and ir_post_fusion.txt to TORCH_LOGS (#147248 ) Fixes #147002 Moves ir_{pre, post}_fusion.txt to be controlled by TORCH_LOGS instead of TORCH_COMPILE_DEBUG. Updated tests of these logs as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147248 Approved by: https://github.com/eellison	2025-02-20 00:26:17 +00:00
William Wen	16e202a38e	[dynamo] improved graph break messages for some common graph break sites [1/N] (#146525 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146525 Approved by: https://github.com/jansel	2025-02-20 00:08:13 +00:00
Pian Pawakapan	1e94c7aaa4	[draft_export] only clear pending unbacked symbols for overwritten kernels (#147427 ) This was wrong, we were doing this in all cases Pull Request resolved: https://github.com/pytorch/pytorch/pull/147427 Approved by: https://github.com/angelayi	2025-02-20 00:07:54 +00:00
Henry Tsang	3986c3e4a6	[reland][cutlass backend] Do not change dtype of GEMM template for cutlass 3x (#147434 ) Reland of https://github.com/pytorch/pytorch/pull/146877 incorporate forward fix (didn't land): https://github.com/pytorch/pytorch/pull/147185 Summary: I think this is a change in the right direction. Right now, when we try to find a cutlass gemm, we generate bunch of gemm templates, and filter out those that don't fix. For example, if we are doing bf16 x bf16 matmul, the gemm template for fp32 x fp32 is generated and filtered out. However, for the dtype of bias, we would attempt to modify the dtype of the gemm template. I think this is a bad idea, since (1) the usable template is also being generated, and (2) this messes with the configuration name of the template. I tested this offline. There isn't much difference in performance. However, with instantiation level 2222, I noticed way less "C++ compile error". This is probably due to using the right template? Follow-ups are needed: 1. benchmark and dashboard 2. check our logic for setting alignment with my change https://www.internalfb.com/intern/paste/P1729604119/ without my change https://www.internalfb.com/intern/paste/P1729624806/ Differential Revision: D69825865 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147434 Approved by: https://github.com/ColinPeppler	2025-02-20 00:07:07 +00:00
Yang Wang	a88d7d4268	[util] fetch logical count cpu (#147413 ) To match with Vcpu count with aws: after (96), before (48) Instance Ref: https://instances.vantage.sh/aws/ec2/g4dn.metal before: https://hud.pytorch.org/utilization/13377376406/37360984234/1 after: https://hud.pytorch.org/utilization/13401543806/37435031356/1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147413 Approved by: https://github.com/clee2000	2025-02-19 23:44:54 +00:00
Michael Lazos	004d65aeb0	Add type hints to cuda kernel (#147471 ) Missed this in a previous PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/147471 Approved by: https://github.com/eellison	2025-02-19 23:35:10 +00:00
Henry Tsang	48203bec63	[BE] remove sysconfig.get_config_var("LIBDIR") from cuda lib paths (#147409 ) Summary: I think the path is not needed anymore. It was added in https://github.com/pytorch/pytorch/pull/126408, but it has been a while since then. See if CI complains. Differential Revision: D69573185 See also https://github.com/pytorch/pytorch/pull/147158 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147409 Approved by: https://github.com/chenyang78	2025-02-19 23:04:22 +00:00
Gregory Comer	f63db6255f	Re-land exclude upsample_bilinear2d.vec and nearest2d.vec from default export decomposition table (#147153 ) Note: This is a re-land of https://github.com/pytorch/pytorch/pull/141791, which I reverted due to breaking some Meta-internal tests - an internal ET delegate did not handle the non-decomposed upsample_nearest2d, and it was not caught in CI. I've resolved that issue and should be ready to safely re-land. Summary: As upsample_bilinear2d.vec and upsample_nearest2d.vec are core ATen ops, they should not be decomposed by default in the export path. Because the operators have CompositeImplicitAutograd dispatch, their decomposition is registered by default. This change adds an override list for CIA decompositions being registered in the default decomp table. In the long-term, we likely will want to exclude decompositions for all core-tagged CIA ops, but this will require all consumers to be ready to handle the remaining two ops, avg_pool1d, and adaptive_avg_pool1d. Until they are ready, I believe an explicit override list is the safest option. Additionally, I've also removed the ExecuTorch XNNPACK delegate ConvertToUpsampleBilinear2d pass, as the pass breaks (and is not needed), given that the op is not decomposed. The purpose of this pass was originally to pattern match the decomposition and recompose it, but this is no longer necessary. Test Plan: Added a new test (`test_default_decomposition_core_cia_ops`) in test_export.py to verify that upsample_bilinear2d.vec (and in the future, other core-tagged CIA ops) are not decomposed by default. Also, I manually validated end to end with ExecuTorch that the op is not decomposed in to_edge (see N6238522). ``` buck test //caffe2/test:test_export -- test_default_decomposition_core_cia_ops ``` Differential Revision: D69625112 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147153 Approved by: https://github.com/manuelcandales	2025-02-19 23:03:29 +00:00
fduwjj	fb55bac3de	[fr][fix] Split MatchState and dynamic info for fr analysis downstream (#147439 ) The original MatchState type was declared as a python Enum. Although we did make it callable but we consume it right away. There are downstream cases when we need it to be a python class which is not supported in Python enum. So we did a small refactoring so that we keep both the enum state and dynamic info (culprit) for the fr analysis script. Differential Revision: [D69830994](https://our.internmc.facebook.com/intern/diff/D69830994) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147439 Approved by: https://github.com/fegin	2025-02-19 22:09:16 +00:00
Justin Chu	41ae15faa3	[ONNX] Add scaffolding for onnx decomp and logic for op tests (#147392 ) Create scaffold for onnx op test data and common logic. This PR creates the scaffolding for new onnx decomp functions described in https://github.com/pytorch/pytorch/issues/139301. It adds two ops: abs and add, and enables the related tests. https://github.com/pytorch/pytorch/issues/139301 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147392 Approved by: https://github.com/titaiwangms ghstack dependencies: #147396	2025-02-19 21:55:12 +00:00
Avik Chaudhuri	24738768a8	more dist ops in non strict (#147417 ) Summary: Previously we added support for `all_reduce` to non strict. This PR extends this support to other non-functional collectives that are remapped in Dynamo: `all_gather`, `all_gather_into_tensor`, `all_to_all_single`, `reduce_scatter_tensor`. Test Plan: added unit tests Differential Revision: D69813991 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147417 Approved by: https://github.com/angelayi	2025-02-19 21:29:16 +00:00
Eli Uriegas	394676759d	ci: Add h100 nightly perf testing (#146868 ) This infrastructure has been up for a while so add a workflow to actually run things on it. > [!IMPORTANT] > We only have 14 linux.aws.h100 runners so it might be beneficial for us to actually pair this list down. > Will leave it up to the compiler team to comment on this PR on which tests are actually important vs. what is not. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146868 Approved by: https://github.com/eellison, https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>	2025-02-19 21:13:17 +00:00
Vincent Moens	8bea08e5bc	[BE] Fix tensor stub (#147384 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147384 Approved by: https://github.com/albanD, https://github.com/janeyx99, https://github.com/atalman	2025-02-19 19:47:03 +00:00
Alex Baden	e758d8b4d1	[Inductor][Triton] Rework casting logic to avoid illegal bitcast (#147395 ) Triton introduced checks for bitcasts where the casted value does not fit into the casted type (e.g. https://github.com/triton-lang/triton/pull/5926, though in this instance I think the issue is related to the type for the broadcast). Some routines in Inductor now perform illegal bitcasts. I reworked the compare and swap w/ index routine used in sort to remove the illegal bitcast (~~I left the bitcast for now, but I think it could probably be removed assuming the reshape does not change the type~~). The explicit cast is correct, and I don't think there are performance issues, but because the cast on the sum is not a bitcast I suppose there could be. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147395 Approved by: https://github.com/eellison	2025-02-19 19:45:01 +00:00
Justin Chu	279c7f262e	[ONNX] Refactor dispatcher and registry (#147396 ) This PR sets up the registry to accept onnx decomp functions to be moved into PyTorch (https://github.com/pytorch/pytorch/issues/139301). The ops from onnx script are currently appended to the registry. When the ops are moved into PyTorch, the moved ops takes precedence because they appear first in the registry list. After the migration hooks for loading ops from onnx script will be removed. 1. Use a private field `_pt_onnx_signature` to store function signatures to avoid conflicts 2. Update the registry to record the signature in OnnxDecompMeta and update the dispatcher to leverage the data structure 3. Update registry to prepare for onnx op registration, and update the the onnx_impl decorator to support a no_compile option Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/147396 Approved by: https://github.com/titaiwangms	2025-02-19 19:38:28 +00:00
bobrenjc93	4f3c070b25	[inductor] GraphLowering code movement (#147335 ) moved these methods under __init__ to be more idiomatic Pull Request resolved: https://github.com/pytorch/pytorch/pull/147335 Approved by: https://github.com/eellison ghstack dependencies: #147331	2025-02-19 19:32:30 +00:00
Fadi Arafeh	5a3a50c791	Update Arm Compute Library (ACL) to v25.02 (#147454 ) Among many things, this version of ACL fixes the redundant declaration warning that we're blocked on in (#145942, #146620, #147337) and introduces better scheduling heuristics for GEMMs Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/147454 Approved by: https://github.com/malfet	2025-02-19 18:51:08 +00:00
Richard Howell	9fee408daa	[caffe2] disable warning for unused arguments (#147411 ) Summary: Disable warnings on unused command line arguments for ukernels_asm. Test Plan: On top of D69602077: ``` $ buck2 build --flagfile fbsource//xplat/mode/arstudio/auto.py fbsource//xplat/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack:ukernels_asmAppleMac ``` Differential Revision: D69807977 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147411 Approved by: https://github.com/kimishpatel	2025-02-19 17:54:31 +00:00
Arash Pakbin	5220d402b5	[ROCm] TopK optimizations for AMD GPUs (#146387 ) TopK performance on ROCm performs better on the test suite with the default config. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146387 Approved by: https://github.com/malfet, https://github.com/ngimel	2025-02-19 17:10:59 +00:00
Ting Lu	e6c86952c6	Add CUDA 12.8 windows nightly build (#147037 ) https://github.com/pytorch/pytorch/issues/145570 windows AMI is deployed to prod today, prepping the windows cuda 12.8 build Pull Request resolved: https://github.com/pytorch/pytorch/pull/147037 Approved by: https://github.com/atalman	2025-02-19 16:59:32 +00:00
xinan.lin	8cbf7d0d6e	[Inductor UT][XPU] Skip fft_c2c case since it's not implemented on XPU. (#147351 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147351 Approved by: https://github.com/jansel	2025-02-19 16:03:03 +00:00
Simon Fan	ed83b0b70b	[ddp] decouple python reducer from compilation mode (#147123 ) Current implementation reads as: we will only actually use the "python_reducer" config if the DDP forward is compiled. Otherwise, we will silently fallback to C++ reducer + no DDPOptimizer. I'm changing this behavior to always use the python reducer if the config is specified. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147123 Approved by: https://github.com/fegin	2025-02-19 15:51:40 +00:00
drisspg	303ad1916f	[FlexAttention] Fix weird generate stride call in flex decode (#147435 ) # Summary Seems like we had a redundant tuple unpack and that doesn't appear to be supported in new triton Fixes https://github.com/pytorch/pytorch/issues/147373 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147435 Approved by: https://github.com/BoyuanFeng	2025-02-19 12:12:27 +00:00
Michael Lazos	77dbd28535	[Cutlass] Restore search space for swizzle (#147224 ) This restores the previous search space, since swizzle is now a runtime parameter, there shouldn't be extra compile-time overhead from searching this now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147224 Approved by: https://github.com/eellison ghstack dependencies: #147222, #147223	2025-02-19 09:22:51 +00:00
Michael Lazos	e9b3ff0570	[Cutlass] Add support for runtime param choices, starting with swizzle (#147223 ) This PR adds support for swizzle as a runtime parameter choice. Future runtime parameter choices can be added to the [get_runtime_arg_info](`2d40f9fb52/torch/_inductor/codegen/cuda/cuda_template.py (L282)`) list method and then possible choices can be [looped over similarly to swizzle](`933f921b36/torch/_inductor/codegen/cuda/gemm_template.py (L532)`). For precompile, we now filter choices by hash to only compile each distinct kernel source once. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147223 Approved by: https://github.com/Chillee, https://github.com/eellison ghstack dependencies: #147222	2025-02-19 09:22:51 +00:00
Michael Lazos	81eb2a78ad	[Inductor] Add autotuning artifact logging (#147222 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147222 Approved by: https://github.com/henrylhtsang, https://github.com/eellison	2025-02-19 09:22:42 +00:00
bobrenjc93	655b061ef0	[inductor] Freeze runtime asserts after shape prop but before codegen (#147331 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147331 Approved by: https://github.com/eellison	2025-02-19 06:29:13 +00:00
Laith Sakka	454fbd5bbe	realize stride symbols in estimate_runtime (#146752 ) Unfortuanlty could not create a local repo, or unit test. fix https://github.com/pytorch/pytorch/issues/146686 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146752 Approved by: https://github.com/bobrenjc93, https://github.com/bdhirsh	2025-02-19 06:02:49 +00:00
Angela Yi	2c3680ce38	[apf] Fix input adapter (#147238 ) Summary: Add support for inputs that no longer exist in `input_fields`, but is not actually used by the original program. In this case, we just give it a dummy input based on the node's metadata. Test Plan: Verified for S488841 Differential Revision: D69328093 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147238 Approved by: https://github.com/pianpwk	2025-02-19 04:49:58 +00:00
Jason Ansel	465930ee81	Revert "[ROCm] ROCm-specific gemm tuning parameters" (#147388 ) Summary: This diff reverts D69573225 / https://github.com/pytorch/pytorch/pull/143286 15% cold compile time regression, see https://fb.workplace.com/groups/1075192433118967/permalink/1608559059782299/ Test Plan: NA Differential Revision: D69790102 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147388 Approved by: https://github.com/yanboliang	2025-02-19 04:47:35 +00:00
atalman	4ece056791	Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073 ) Should resolve: https://github.com/pytorch/pytorch/issues/144768 We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1`` For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1`` We use pinned version of NCCL rather then submodule. Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl`` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146073 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj	2025-02-19 03:52:26 +00:00
Chen Lai	bd370c138a	fix pt2e block wise quantization unit test (#147406 ) Differential Revision: D69806596 https://github.com/pytorch/pytorch/pull/146946 breaks the unit test, because the quant nodes are folded by default now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147406 Approved by: https://github.com/andrewor14, https://github.com/jerryzh168	2025-02-19 02:40:27 +00:00
henrylhtsang	5006932cbc	[cutlass backend] forward fix of standalone runner for fbcode (#147158 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147158 Approved by: https://github.com/chenyang78	2025-02-19 02:02:10 +00:00
Ankita George	f16d30137c	[OSS] Update FileSystem methods to properly handle a string argument (#145751 ) Summary: When testing, I tried to pass in a string argument to the FileSystem class' methods, which is a valid input, but the cast() that casted the string to a path wasn't working as was likely expected and was leading all the methods to fail with a string arg. Instead of a cast, a proper constructor should be used. Test Plan: N6475361 methods don't throw an error with a string arg like they were previously Differential Revision: D68713937 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145751 Approved by: https://github.com/pradeepfn	2025-02-19 01:50:24 +00:00
titaiwangms	953f7834cc	[ONNX] Pick up missing types in dynamic shapes renaming (#147407 ) Found in `_check_dynamic_shapes` that int and None type are valid inputs of dynamic_shapes. This PR adds the support on these two types and add the tests to guard the sync of ONNX flatten logic and the one in expor.t Pull Request resolved: https://github.com/pytorch/pytorch/pull/147407 Approved by: https://github.com/justinchuby	2025-02-19 01:49:53 +00:00
atalman	757d7f28d1	[CD] Increase timeout for windows binary builds (#147390 ) Mitigates https://github.com/pytorch/pytorch/issues/147376 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147390 Approved by: https://github.com/huydhn, https://github.com/jeanschmidt, https://github.com/malfet	2025-02-19 01:15:04 +00:00
Justin Chu	959d79f85f	[ONNX] Move and improve error reproduction logic in test (#147391 ) https://github.com/pytorch/pytorch/issues/139301 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147391 Approved by: https://github.com/titaiwangms	2025-02-19 00:00:11 +00:00
PyTorch MergeBot	babb2dc2af	Revert "Add torch._scaled_mm for CPU (#139975 )" This reverts commit 6f7e67c43c13b5675b4ff60cbaa71e5083a22481. Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/wdvr due to failing inductor mkldnn_pattern_matcher_cpu tests ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2667186865))	2025-02-18 23:58:31 +00:00
bobrenjc93	525ca80f53	add unbacked strict mode (#147333 ) fixes #145775 This is the first step in introducing a "strict" mode where we don't silent specialize and don't silent graph break. At a high level when we do mark_unbacked(... strict=True), anytime we specialize an unbacked symint we will explicitly error and tell the user their unbacked dimension was specialized to a single value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147333 Approved by: https://github.com/laithsakka	2025-02-18 23:33:55 +00:00
bobrenjc93	5d547d82e6	Add no_data_dependent_graph_break mode (#147342 ) This adds a strict mode `TORCHDYNAMO_UNBACKED_STRICT` to prevent graph breaking when we guard on data dependent. This is a better UX for those who are actively trying to make their model more dynamic, but aren't close enough to full graph to use that flag directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147342 Approved by: https://github.com/laithsakka	2025-02-18 23:33:47 +00:00
albanD	bae049b439	Update addr doc (#146482 ) Fixes https://github.com/pytorch/pytorch/issues/146399 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146482 Approved by: https://github.com/janeyx99	2025-02-18 23:25:38 +00:00
Tuan Trieu	ca397d82a6	[Sigmoid] Fix issues with constant folding and fba_ops (#146948 ) Summary: There are 2 issues: - `skip_folding_node_fn` isn't considered when propagating constant values. So given a skipped node with constant inputs, it outputs a constant and its users can output constant values and then be included in the constant graph. However, the skipped node is not included in the constant graph when extracting the constant graph. This issue is fixed by checking for skipped node when propagating the constant values and making the skipped node to output unknown value (not constant) so that its users cannot output constant. - `fba_linear` op can be included in the constant graph but it is not implemented for CPU so constant graph cannot be executed. This issue is fixed by converting `fba_linear` to `aten.addmm`. - A refactor to allow more fba_ops to be included in the constant graph (via mapping fba_ops to aten ops). Reviewed By: StellarrZ Differential Revision: D68716393 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146948 Approved by: https://github.com/zhxchen17	2025-02-18 23:17:47 +00:00
Tsung-Hsien Lee	c9a15d980f	[FSDP2] Simplify shard_placement_fn in test (#146847 ) Summary: Found this while checking `shard_placement_fn` for Shampoo shard independent implementation. Test Plan: OSS CI & tests Differential Revision: D69412878 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146847 Approved by: https://github.com/awgu	2025-02-18 23:01:26 +00:00
Jane Xu	c8433c2c6c	[BE] correct docs for clock_rate to MHz, fixes #147098 (#147393 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147393 Approved by: https://github.com/andrewor14	2025-02-18 22:59:58 +00:00
mori360	a21a123fd5	Add fqn_modifier at loading_state_dict and unit test (#146557 ) In Fusion model, users might change the state_dict keys by state_dict_hook The load_state_dict APIs here won't call model.state_dict() so that the hooks won't be called to change the keys, causing the mismatch between fqn and state_dict keys. The PR here suggests users to add how they would change the state_dict key prefix (they can name it, here we call "fqn_modifiers") by default During loading state_dict, we have the prefix change during getting fqn so that they can be processed same as through state_dict hook. For example: There's a state_dict_hook: ``` def _state_dict_hook(self, destination, prefix, keep_vars): """Remove "embedding" from the original embedding in the state_dict name. This keeps the orginal state dict name for the embedding from before fusing with the FusionEmbedding. [!Note] This update changes the order of the OrderedDict """ key = prefix + "embedding.weight" new_key = prefix + "weight" destination[new_key] = destination[key] del destination[key] ``` In the dsd after this PR, we would skip "embedding." before "weight" if find the "fqn_modifiers" attribute at that module ``` def fqn_modifiers(self) -> Dict[str, str]: return { "weight": "embedding", } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146557 Approved by: https://github.com/fegin	2025-02-18 22:54:41 +00:00
PyTorch MergeBot	7622e29a37	Revert "Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073 )" This reverts commit eecee5863e698d19458b33df7bfecbda0a04557a. Reverted https://github.com/pytorch/pytorch/pull/146073 on behalf of https://github.com/atalman due to breaks Locally building benchmarks ([comment](https://github.com/pytorch/pytorch/pull/146073#issuecomment-2667054179))	2025-02-18 22:23:35 +00:00
Laith Sakka	3f35664ee8	More precise check for shared storage check in inductor/reinplace pass (#147050 ) Currently if two tensor share storage we have some logic to avoid re-inplacing. Before this PR two tensors share storage if use same underlying storage even if they do not overlap. This diff enhance the checks to avoid cases when we know tensors do not overlap easily. mitigate https://github.com/pytorch/pytorch/issues/139628 but does not fix the inductor issue in it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147050 Approved by: https://github.com/zou3519	2025-02-18 21:55:34 +00:00
William Wen	63e8ad49b8	[dynamo] replace hardcoded eval frame control flags skip_code_recursive_flag/cache_limit_hit_flag (#146355 ) This PR and the previous: - Moves parts of `eval_frame.c` to C++. - Reduces code duplication in `dynamo__custom_eval_frame` and makes the control flow more clear. - Enables `convert_frame` to signal to `eval_frame.cpp` in a general manner how to evaluate this frame, recursive frames, and future frames with the same code object (default/compile, skip, run-only). e.g. this will allow us to change skipping/cache limit hit eval_frame behavior directly from convert_frame without requiring changes to C/C++. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146355 Approved by: https://github.com/jansel ghstack dependencies: #145603	2025-02-18 21:37:12 +00:00
William Wen	75db0fd8a0	[dynamo] refactor dynamo__custom_eval_frame to C++, refactor SKIP_CODE[_RECURSIVE] (#145603 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145603 Approved by: https://github.com/jansel, https://github.com/anijain2305	2025-02-18 21:37:12 +00:00
Yang Chen	eb892cd768	[codegen] enable SORT and TUPLE_REDUCTION for AMD Triton (#147340 ) Looks like Triton's AMD backend supports multiple inputs already. Let's enable SORT and TUPLE_REDUCTION for it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147340 Approved by: https://github.com/Skylion007, https://github.com/jansel, https://github.com/eellison	2025-02-18 21:15:23 +00:00
Vincent Moens	1b047d5d7a	Add link to non_blocking/pinmem tutorial in `Tensor.to` docstrings (#145651 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145651 Approved by: https://github.com/svekars	2025-02-18 20:38:01 +00:00
clr	166419b9c1	dynamo: Don't crash when encountering a object with no __name__ (#147246 ) This was triggering on ScriptFunctions. Note that other than badly implemented c functiosn, this seems to be almost impossible to trigger, so I wrote a smaller unit test, rather than a full repro. Let me know if people feel strongly and want a full reproduction. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147246 Approved by: https://github.com/anijain2305, https://github.com/jansel, https://github.com/Skylion007	2025-02-18 20:35:49 +00:00
12v	74682e8595	Fix typo (#147330 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147330 Approved by: https://github.com/srinivasreddy, https://github.com/Skylion007	2025-02-18 20:20:34 +00:00
Ahmad Sharif	d9b3d76b85	Fix linter warnings (#147386 ) https://github.com/pytorch/pytorch/pull/145866 accidentally introduced a warning about const casts and also comparison of unsigned long int with signed long int. This PR fixes both of those warnings. Tested by running: ``` /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DTORCH_CUDA_USE_NVTX3 -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_CUFILE -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -I/home/ahmads/personal/pytorch/build/aten/src -I/home/ahmads/personal/pytorch/aten/src -I/home/ahmads/personal/pytorch/build -I/home/ahmads/personal/pytorch -I/home/ahmads/personal/pytorch/cmake/../third_party/benchmark/include -I/home/ahmads/personal/pytorch/third_party/onnx -I/home/ahmads/personal/pytorch/build/third_party/onnx -I/home/ahmads/personal/pytorch/nlohmann -I/home/ahmads/personal/pytorch/aten/src/THC -I/home/ahmads/personal/pytorch/aten/src/ATen/cuda -I/home/ahmads/personal/pytorch/third_party/fmt/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/ahmads/personal/pytorch/build/caffe2/aten/src -I/home/ahmads/personal/pytorch/aten/src/ATen/.. -I/home/ahmads/personal/pytorch/build/nccl/include -I/home/ahmads/personal/pytorch/c10/cuda/../.. -I/home/ahmads/personal/pytorch/c10/.. -I/home/ahmads/personal/pytorch/third_party/tensorpipe -I/home/ahmads/personal/pytorch/build/third_party/tensorpipe -I/home/ahmads/personal/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/ahmads/personal/pytorch/torch/csrc/api -I/home/ahmads/personal/pytorch/torch/csrc/api/include -isystem /home/ahmads/personal/pytorch/build/third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/ahmads/personal/pytorch/third_party/protobuf/src -isystem /home/ahmads/personal/pytorch/third_party/XNNPACK/include -isystem /home/ahmads/personal/pytorch/third_party/ittapi/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda/include -isystem /home/ahmads/personal/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /home/ahmads/personal/pytorch/third_party/ideep/include -isystem /home/ahmads/personal/pytorch/INTERFACE -isystem /home/ahmads/personal/pytorch/third_party/nlohmann/include -isystem /home/ahmads/personal/pytorch/third_party/NVTX/c/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o.d -x cu -c /home/ahmads/personal/pytorch/aten/src/ATen/native/cuda/SoftMax.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o ``` And I got no warnings or errors. Same with `python setup.py develop` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147386 Approved by: https://github.com/Skylion007, https://github.com/ngimel	2025-02-18 20:03:16 +00:00
PyTorch MergeBot	302f56a1f2	Revert "Fix non-bitwise type annotations for Tensor operators (see #145838 ) (#146845 )" This reverts commit 59b7e52ad8f6146b4364515a7f3e54d6f3edd6da. Reverted https://github.com/pytorch/pytorch/pull/146845 on behalf of https://github.com/jeanschmidt due to Seems to break a few code dependencies in multiple places ([comment](https://github.com/pytorch/pytorch/pull/146845#issuecomment-2666656834))	2025-02-18 19:01:27 +00:00

2510 changed files with 136460 additions and 73339 deletions

									
										2

.ci/aarch64_linux/aarch64_ci_build.sh
									
												View File
												
				@ -20,7 +20,7 @@ cd /

				# on the mounted pytorch repo

				git config --global --add safe.directory /pytorch

				pip install -r /pytorch/requirements.txt

				pip install auditwheel

				pip install auditwheel==6.2.0

				if [ "$DESIRED_CUDA" = "cpu" ]; then

				    echo "BASE_CUDA_VERSION is not set. Building cpu wheel."

				    #USE_PRIORITIZED_TEXT_FOR_LD for enable linker script optimization https://github.com/pytorch/pytorch/pull/121975/files

									
										20

.ci/aarch64_linux/aarch64_wheel_ci_build.py
									
												View File
												
				@ -39,7 +39,7 @@ def build_ArmComputeLibrary() -> None:

				            "clone",

				            "https://github.com/ARM-software/ComputeLibrary.git",

				            "-b",

				            "v24.09",

				            "v25.02",

				            "--depth",

				            "1",

				            "--shallow-submodules",

				@ -99,10 +99,14 @@ def update_wheel(wheel_path, desired_cuda) -> None:

				        if "126" in desired_cuda:

				            libs_to_copy += [

				                "/usr/local/cuda/lib64/libnvrtc-builtins.so.12.6",

				                "/usr/local/cuda/lib64/libcufile.so.0",

				                "/usr/local/cuda/lib64/libcufile_rdma.so.1",

				            ]

				        elif "128" in desired_cuda:

				            libs_to_copy += [

				                "/usr/local/cuda/lib64/libnvrtc-builtins.so.12.8",

				                "/usr/local/cuda/lib64/libcufile.so.0",

				                "/usr/local/cuda/lib64/libcufile_rdma.so.1",

				            ]

				    else:

				        libs_to_copy += [

				@ -132,6 +136,9 @@ def complete_wheel(folder: str) -> str:

				    """

				    wheel_name = list_dir(f"/{folder}/dist")[0]

				    # Please note for cuda we don't run auditwheel since we use custom script to package

				    # the cuda dependencies to the wheel file using update_wheel() method.

				    # However we need to make sure filename reflects the correct Manylinux platform.

				    if "pytorch" in folder and not enable_cuda:

				        print("Repairing Wheel with AuditWheel")

				        check_call(["auditwheel", "repair", f"dist/{wheel_name}"], cwd=folder)

				@ -143,7 +150,14 @@ def complete_wheel(folder: str) -> str:

				            f"/{folder}/dist/{repaired_wheel_name}",

				        )

				    else:

				        repaired_wheel_name = wheel_name

				        repaired_wheel_name = wheel_name.replace(

				            "linux_aarch64", "manylinux_2_28_aarch64"

				        )

				        print(f"Renaming {wheel_name} wheel to {repaired_wheel_name}")

				        os.rename(

				            f"/{folder}/dist/{wheel_name}",

				            f"/{folder}/dist/{repaired_wheel_name}",

				        )

				    print(f"Copying {repaired_wheel_name} to artifacts")

				    shutil.copy2(

				@ -204,7 +218,7 @@ if __name__ == "__main__":

				        else:

				            build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={version}.dev{build_date} PYTORCH_BUILD_NUMBER=1 "

				    elif branch.startswith(("v1.", "v2.")):

				        build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={branch[1:branch.find('-')]} PYTORCH_BUILD_NUMBER=1 "

				        build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={branch[1 : branch.find('-')]} PYTORCH_BUILD_NUMBER=1 "

				    if enable_mkldnn:

				        build_ArmComputeLibrary()

									
										20

.ci/aarch64_linux/build_aarch64_wheel.py
									
												View File
												
				@ -19,13 +19,11 @@ import boto3

				# AMI images for us-east-1, change the following based on your ~/.aws/config

				os_amis = {

				    "ubuntu18_04": "ami-078eece1d8119409f",  # login_name: ubuntu

				    "ubuntu20_04": "ami-052eac90edaa9d08f",  # login_name: ubuntu

				    "ubuntu22_04": "ami-0c6c29c5125214c77",  # login_name: ubuntu

				    "redhat8": "ami-0698b90665a2ddcf1",  # login_name: ec2-user

				}

				ubuntu18_04_ami = os_amis["ubuntu18_04"]

				ubuntu20_04_ami = os_amis["ubuntu20_04"]

				@ -329,7 +327,7 @@ def build_ArmComputeLibrary(host: RemoteHost, git_clone_flags: str = "") -> None

				        ]

				    )

				    host.run_cmd(

				        f"git clone https://github.com/ARM-software/ComputeLibrary.git -b v24.09 {git_clone_flags}"

				        f"git clone https://github.com/ARM-software/ComputeLibrary.git -b v25.02 {git_clone_flags}"

				    )

				    host.run_cmd(f"cd ComputeLibrary && scons Werror=1 -j8 {acl_build_flags}")

				@ -659,18 +657,6 @@ def configure_system(

				            "sudo apt-get install -y python3-dev python3-yaml python3-setuptools python3-wheel python3-pip"

				        )

				    host.run_cmd("pip3 install dataclasses typing-extensions")

				    # Install and switch to gcc-8 on Ubuntu-18.04

				    if not host.using_docker() and host.ami == ubuntu18_04_ami and compiler == "gcc-8":

				        host.run_cmd("sudo apt-get install -y g++-8 gfortran-8")

				        host.run_cmd(

				            "sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-8 100"

				        )

				        host.run_cmd(

				            "sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-8 100"

				        )

				        host.run_cmd(

				            "sudo update-alternatives --install /usr/bin/gfortran gfortran /usr/bin/gfortran-8 100"

				        )

				    if not use_conda:

				        print("Installing Cython + numpy from PyPy")

				        host.run_cmd("sudo pip3 install Cython")

				@ -761,7 +747,7 @@ def start_build(

				        version = host.check_output("cat pytorch/version.txt").strip()[:-2]

				        build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={version}.dev{build_date} PYTORCH_BUILD_NUMBER=1"

				    if branch.startswith(("v1.", "v2.")):

				        build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={branch[1:branch.find('-')]} PYTORCH_BUILD_NUMBER=1"

				        build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={branch[1 : branch.find('-')]} PYTORCH_BUILD_NUMBER=1"

				    if host.using_docker():

				        build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"

				    if enable_mkldnn:

				@ -1026,7 +1012,7 @@ if __name__ == "__main__":

				        install_condaforge_python(host, args.python_version)

				        sys.exit(0)

				    python_version = args.python_version if args.python_version is not None else "3.8"

				    python_version = args.python_version if args.python_version is not None else "3.9"

				    if args.use_torch_from_pypi:

				        configure_system(host, compiler=args.compiler, python_version=python_version)

									
										2

.ci/docker/almalinux/Dockerfile
									
												View File
												
				@ -44,6 +44,8 @@ FROM base as cuda

				ARG CUDA_VERSION=12.4

				RUN rm -rf /usr/local/cuda-*

				ADD ./common/install_cuda.sh install_cuda.sh

				COPY ./common/install_nccl.sh install_nccl.sh

				COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/

				ENV CUDA_HOME=/usr/local/cuda-${CUDA_VERSION}

				# Preserve CUDA_VERSION for the builds

				ENV CUDA_VERSION=${CUDA_VERSION}

									
										90

.ci/docker/almalinux/build.sh
									
												View File
												
				@ -1,82 +1,60 @@

				#!/usr/bin/env bash

				# Script used only in CD pipeline

				set -eou pipefail

				set -exou pipefail

				image="$1"

				shift

				if [ -z "${image}" ]; then

				  echo "Usage: $0 IMAGE"

				  echo "Usage: $0 IMAGENAME:ARCHTAG"

				  exit 1

				fi

				DOCKER_IMAGE_NAME="pytorch/${image}"

				# Go from imagename:tag to tag

				DOCKER_TAG_PREFIX=$(echo "${image}" | awk -F':' '{print $2}')

				CUDA_VERSION=""

				if [[ "${DOCKER_TAG_PREFIX}" == cuda* ]]; then

				    # extract cuda version from image name and tag.  e.g. manylinux2_28-builder:cuda12.8 returns 12.8

				    CUDA_VERSION=$(echo "${DOCKER_TAG_PREFIX}" | awk -F'cuda' '{print $2}')

				fi

				export DOCKER_BUILDKIT=1

				TOPDIR=$(git rev-parse --show-toplevel)

				CUDA_VERSION=${CUDA_VERSION:-12.1}

				case ${CUDA_VERSION} in

				case ${DOCKER_TAG_PREFIX} in

				  cpu)

				    BASE_TARGET=base

				    DOCKER_TAG=cpu

				    ;;

				  all)

				    BASE_TARGET=all_cuda

				    DOCKER_TAG=latest

				  cuda*)

				    BASE_TARGET=cuda${CUDA_VERSION}

				    ;;

				  *)

				    BASE_TARGET=cuda${CUDA_VERSION}

				    DOCKER_TAG=cuda${CUDA_VERSION}

				    echo "ERROR: Unknown docker tag ${DOCKER_TAG_PREFIX}"

				    exit 1

				    ;;

				esac

				# TODO: Remove LimitNOFILE=1048576 patch once https://github.com/pytorch/test-infra/issues/5712

				# is resolved. This patch is required in order to fix timing out of Docker build on Amazon Linux 2023.

				sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service

				sudo systemctl daemon-reload

				sudo systemctl restart docker

				(

				  set -x

				  # TODO: Remove LimitNOFILE=1048576 patch once https://github.com/pytorch/test-infra/issues/5712

				  # is resolved. This patch is required in order to fix timing out of Docker build on Amazon Linux 2023.

				  sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service

				  sudo systemctl daemon-reload

				  sudo systemctl restart docker

				export DOCKER_BUILDKIT=1

				TOPDIR=$(git rev-parse --show-toplevel)

				tmp_tag=$(basename "$(mktemp -u)" | tr '[:upper:]' '[:lower:]')

				  docker build \

				    --target final \

				    --progress plain \

				    --build-arg "BASE_TARGET=${BASE_TARGET}" \

				    --build-arg "CUDA_VERSION=${CUDA_VERSION}" \

				    --build-arg "DEVTOOLSET_VERSION=11" \

				    -t ${DOCKER_IMAGE_NAME} \

				    $@ \

				    -f "${TOPDIR}/.ci/docker/almalinux/Dockerfile" \

				    ${TOPDIR}/.ci/docker/

				)

				docker build \

				  --target final \

				  --progress plain \

				  --build-arg "BASE_TARGET=${BASE_TARGET}" \

				  --build-arg "CUDA_VERSION=${CUDA_VERSION}" \

				  --build-arg "DEVTOOLSET_VERSION=11" \

				  -t ${tmp_tag} \

				  $@ \

				  -f "${TOPDIR}/.ci/docker/almalinux/Dockerfile" \

				  ${TOPDIR}/.ci/docker/

				if [[ "${DOCKER_TAG}" =~ ^cuda* ]]; then

				if [ -n "${CUDA_VERSION}" ]; then

				  # Test that we're using the right CUDA compiler

				  (

				    set -x

				    docker run --rm "${DOCKER_IMAGE_NAME}" nvcc --version | grep "cuda_${CUDA_VERSION}"

				  )

				fi

				GITHUB_REF=${GITHUB_REF:-$(git symbolic-ref -q HEAD || git describe --tags --exact-match)}

				GIT_BRANCH_NAME=${GITHUB_REF##*/}

				GIT_COMMIT_SHA=${GITHUB_SHA:-$(git rev-parse HEAD)}

				DOCKER_IMAGE_BRANCH_TAG=${DOCKER_IMAGE_NAME}-${GIT_BRANCH_NAME}

				DOCKER_IMAGE_SHA_TAG=${DOCKER_IMAGE_NAME}-${GIT_COMMIT_SHA}

				if [[ "${WITH_PUSH:-}" == true ]]; then

				  (

				    set -x

				    docker push "${DOCKER_IMAGE_NAME}"

				    if [[ -n ${GITHUB_REF} ]]; then

				        docker tag ${DOCKER_IMAGE_NAME} ${DOCKER_IMAGE_BRANCH_TAG}

				        docker tag ${DOCKER_IMAGE_NAME} ${DOCKER_IMAGE_SHA_TAG}

				        docker push "${DOCKER_IMAGE_BRANCH_TAG}"

				        docker push "${DOCKER_IMAGE_SHA_TAG}"

				    fi

				  )

				  docker run --rm "${tmp_tag}" nvcc --version | grep "cuda_${CUDA_VERSION}"

				fi

									
										131

.ci/docker/build.sh
									
												View File
												
				@ -1,4 +1,8 @@

				#!/bin/bash

				# The purpose of this script is to:

				# 1. Extract the set of parameters to be used for a docker build based on the provided image name.

				# 2. Run docker build with the parameters found in step 1.

				# 3. Run the built image and print out the expected and actual versions of packages installed.

				set -ex

				@ -95,13 +99,12 @@ fi

				# configuration, so we hardcode everything here rather than do it

				# from scratch

				case "$image" in

				  pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.4.1

				  pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc11)

				    CUDA_VERSION=12.6.3

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    GCC_VERSION=11

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				@ -115,7 +118,6 @@ case "$image" in

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				@ -130,7 +132,6 @@ case "$image" in

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				@ -145,7 +146,61 @@ case "$image" in

				    ANACONDA_PYTHON_VERSION=3.13

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.6.3

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.6.3

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-focal-cuda12.6-cudnn9-py3.12-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.6.3

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=9

				    PROTOBUF=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-focal-cuda12.6-cudnn9-py3.13-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.6.3

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.13

				    GCC_VERSION=9

				    PROTOBUF=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				@ -160,7 +215,6 @@ case "$image" in

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				@ -172,7 +226,6 @@ case "$image" in

				    ANACONDA_PYTHON_VERSION=3.9

				    CLANG_VERSION=10

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    CONDA_CMAKE=yes

				    ONNX=yes

				@ -181,10 +234,7 @@ case "$image" in

				    ANACONDA_PYTHON_VERSION=3.9

				    CLANG_VERSION=10

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    VULKAN_SDK_VERSION=1.2.162.1

				    SWIFTSHADER=yes

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				@ -192,10 +242,7 @@ case "$image" in

				    ANACONDA_PYTHON_VERSION=3.11

				    CLANG_VERSION=10

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    VULKAN_SDK_VERSION=1.2.162.1

				    SWIFTSHADER=yes

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				@ -203,7 +250,6 @@ case "$image" in

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    CONDA_CMAKE=yes

				    TRITON=yes

				@ -212,7 +258,6 @@ case "$image" in

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=11

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    ROCM_VERSION=6.2.4

				    NINJA_VERSION=1.9.0

				@ -227,7 +272,6 @@ case "$image" in

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=11

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    ROCM_VERSION=6.3

				    NINJA_VERSION=1.9.0

				@ -242,7 +286,6 @@ case "$image" in

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=11

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    XPU_VERSION=0.5

				    NINJA_VERSION=1.9.0

				@ -253,7 +296,6 @@ case "$image" in

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=11

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    XPU_VERSION=2025.0

				    NINJA_VERSION=1.9.0

				@ -264,7 +306,6 @@ case "$image" in

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=11

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    CONDA_CMAKE=yes

				@ -278,7 +319,6 @@ case "$image" in

				    CUDNN_VERSION=9

				    CLANG_VERSION=12

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    TRITON=yes

				    ;;

				@ -286,7 +326,6 @@ case "$image" in

				    ANACONDA_PYTHON_VERSION=3.9

				    CLANG_VERSION=12

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    CONDA_CMAKE=yes

				    TRITON=yes

				@ -307,7 +346,6 @@ case "$image" in

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=11

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    CONDA_CMAKE=yes

				@ -322,7 +360,7 @@ case "$image" in

				    EXECUTORCH=yes

				    ;;

				  pytorch-linux-jammy-py3.12-halide)

				    CUDA_VERSION=12.4

				    CUDA_VERSION=12.6

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=11

				    CONDA_CMAKE=yes

				@ -330,7 +368,7 @@ case "$image" in

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-py3.12-triton-cpu)

				    CUDA_VERSION=12.4

				    CUDA_VERSION=12.6

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=11

				    CONDA_CMAKE=yes

				@ -340,20 +378,19 @@ case "$image" in

				    # TODO: Use 3.9 here because of this issue https://github.com/python/mypy/issues/13627.

				    # We will need to update mypy version eventually, but that's for another day. The task

				    # would be to upgrade mypy to 1.0.0 with Python 3.11

				    ANACONDA_PYTHON_VERSION=3.9

				    CONDA_CMAKE=yes

				    PYTHON_VERSION=3.9

				    PIP_CMAKE=yes

				    ;;

				  pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-linter)

				    ANACONDA_PYTHON_VERSION=3.9

				    PYTHON_VERSION=3.9

				    CUDA_VERSION=11.8

				    CONDA_CMAKE=yes

				    PIP_CMAKE=yes

				    ;;

				  pytorch-linux-jammy-aarch64-py3.10-gcc11)

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=11

				    ACL=yes

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    CONDA_CMAKE=yes

				    # snadampal: skipping llvm src build install because the current version

				@ -365,7 +402,6 @@ case "$image" in

				    GCC_VERSION=11

				    ACL=yes

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    CONDA_CMAKE=yes

				    # snadampal: skipping llvm src build install because the current version

				@ -376,7 +412,6 @@ case "$image" in

				  *)

				    # Catch-all for builds that are not hardcoded.

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    echo "image '$image' did not match an existing build configuration"

				    if [[ "$image" == *py* ]]; then

				@ -425,14 +460,21 @@ if [[ "$image" == *cuda*  && ${OS} == "ubuntu" ]]; then

				  fi

				fi

				no_cache_flag=""

				progress_flag=""

				# Do not use cache and progress=plain when in CI

				if [[ -n "${CI:-}" ]]; then

				  no_cache_flag="--no-cache"

				  progress_flag="--progress=plain"

				fi

				# Build image

				docker build \

				       --no-cache \

				       --progress=plain \

				       ${no_cache_flag} \

				       ${progress_flag} \

				       --build-arg "BUILD_ENVIRONMENT=${image}" \

				       --build-arg "PROTOBUF=${PROTOBUF:-}" \

				       --build-arg "LLVMDEV=${LLVMDEV:-}" \

				       --build-arg "DB=${DB:-}" \

				       --build-arg "VISION=${VISION:-}" \

				       --build-arg "UBUNTU_VERSION=${UBUNTU_VERSION}" \

				       --build-arg "CENTOS_VERSION=${CENTOS_VERSION}" \

				@ -440,13 +482,12 @@ docker build \

				       --build-arg "GLIBC_VERSION=${GLIBC_VERSION}" \

				       --build-arg "CLANG_VERSION=${CLANG_VERSION}" \

				       --build-arg "ANACONDA_PYTHON_VERSION=${ANACONDA_PYTHON_VERSION}" \

				       --build-arg "PYTHON_VERSION=${PYTHON_VERSION}" \

				       --build-arg "GCC_VERSION=${GCC_VERSION}" \

				       --build-arg "CUDA_VERSION=${CUDA_VERSION}" \

				       --build-arg "CUDNN_VERSION=${CUDNN_VERSION}" \

				       --build-arg "TENSORRT_VERSION=${TENSORRT_VERSION}" \

				       --build-arg "GRADLE_VERSION=${GRADLE_VERSION}" \

				       --build-arg "VULKAN_SDK_VERSION=${VULKAN_SDK_VERSION}" \

				       --build-arg "SWIFTSHADER=${SWIFTSHADER}" \

				       --build-arg "CMAKE_VERSION=${CMAKE_VERSION:-}" \

				       --build-arg "NINJA_VERSION=${NINJA_VERSION:-}" \

				       --build-arg "KATEX=${KATEX:-}" \

				@ -456,6 +497,7 @@ docker build \

				       --build-arg "UCX_COMMIT=${UCX_COMMIT}" \

				       --build-arg "UCC_COMMIT=${UCC_COMMIT}" \

				       --build-arg "CONDA_CMAKE=${CONDA_CMAKE}" \

				       --build-arg "PIP_CMAKE=${PIP_CMAKE}" \

				       --build-arg "TRITON=${TRITON}" \

				       --build-arg "TRITON_CPU=${TRITON_CPU}" \

				       --build-arg "ONNX=${ONNX}" \

				@ -481,7 +523,7 @@ docker build \

				UBUNTU_VERSION=$(echo ${UBUNTU_VERSION} | sed 's/-rc$//')

				function drun() {

				  docker run --rm "$tmp_tag" $*

				  docker run --rm "$tmp_tag" "$@"

				}

				if [[ "$OS" == "ubuntu" ]]; then

				@ -529,3 +571,14 @@ if [ -n "$KATEX" ]; then

				    exit 1

				  fi

				fi

				HAS_TRITON=$(drun python -c "import triton" > /dev/null 2>&1 && echo "yes" || echo "no")

				if [[ -n "$TRITON" || -n "$TRITON_CPU" ]]; then

				  if [ "$HAS_TRITON" = "no" ]; then

				    echo "expecting triton to be installed, but it is not"

				    exit 1

				  fi

				elif [ "$HAS_TRITON" = "yes" ]; then

				  echo "expecting triton to not be installed, but it is"

				  exit 1

				fi

									
										9

.ci/docker/centos-rocm/Dockerfile
									
												View File
												
				@ -55,13 +55,6 @@ RUN if [ -n "${PROTOBUF}" ]; then bash ./install_protobuf.sh; fi

				RUN rm install_protobuf.sh

				ENV INSTALLED_PROTOBUF ${PROTOBUF}

				# (optional) Install database packages like LMDB and LevelDB

				ARG DB

				COPY ./common/install_db.sh install_db.sh

				RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi

				RUN rm install_db.sh

				ENV INSTALLED_DB ${DB}

				# (optional) Install vision packages like OpenCV

				ARG VISION

				COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./

				@ -75,7 +68,7 @@ COPY ./common/install_rocm.sh install_rocm.sh

				RUN bash ./install_rocm.sh

				RUN rm install_rocm.sh

				COPY ./common/install_rocm_magma.sh install_rocm_magma.sh

				RUN bash ./install_rocm_magma.sh

				RUN bash ./install_rocm_magma.sh ${ROCM_VERSION}

				RUN rm install_rocm_magma.sh

				COPY ./common/install_amdsmi.sh install_amdsmi.sh

				RUN bash ./install_amdsmi.sh

2

.ci/docker/ci_commit_pins/executorch.txt

View File

 @ -1 +1 @@
 e4d6b6380d575e48e37e9d987fded4ec588e7bc
 ae5d57d35c165d98df728380b20fbde350392

2

.ci/docker/ci_commit_pins/nccl-cu12.txt

View File

 @ -1 +1 @@
 v2.25.1-1
 v2.26.2-1

2

.ci/docker/ci_commit_pins/triton-xpu.txt

View File

 @ -1 +1 @@
 e98b6fcb8df5b44eb0d0addb6767c573d37ba024
 bcc8265e677e5321606a3311bf71470f14456a8

2

.ci/docker/ci_commit_pins/triton.txt

View File

 @ -1 +1 @@
 b3bb1f8da0ded6ccd572dd1358ef45af5a1befe
 ce50fade7e209553aba4898cd9b82aab83b

									
										2

.ci/docker/common/install_acl.sh
									
												View File
												
				@ -1,6 +1,6 @@

				set -euo pipefail

				readonly version=v24.04

				readonly version=v25.02

				readonly src_host=https://github.com/ARM-software

				readonly src_repo=ComputeLibrary

									
										2

.ci/docker/common/install_base.sh
									
												View File
												
				@ -37,7 +37,7 @@ install_ubuntu() {

				  if [[ "$UBUNTU_VERSION" == "20.04"* && "$CUDA_VERSION" == "11.8"* ]]; then

				    maybe_libnccl_dev="libnccl2=2.15.5-1+cuda11.8 libnccl-dev=2.15.5-1+cuda11.8 --allow-downgrades --allow-change-held-packages"

				  elif [[ "$UBUNTU_VERSION" == "20.04"* && "$CUDA_VERSION" == "12.4"* ]]; then

				    maybe_libnccl_dev="libnccl2=2.25.1-1+cuda12.4 libnccl-dev=2.25.1-1+cuda12.4 --allow-downgrades --allow-change-held-packages"

				    maybe_libnccl_dev="libnccl2=2.26.2-1+cuda12.4 libnccl-dev=2.26.2-1+cuda12.4 --allow-downgrades --allow-change-held-packages"

				  else

				    maybe_libnccl_dev=""

				  fi

									
										12

.ci/docker/common/install_clang.sh
									
												View File
												
				@ -4,16 +4,10 @@ set -ex

				if [ -n "$CLANG_VERSION" ]; then

				  if [[ $CLANG_VERSION == 9 && $UBUNTU_VERSION == 18.04 ]]; then

				    sudo apt-get update

				    # gpg-agent is not available by default on 18.04

				    sudo apt-get install  -y --no-install-recommends gpg-agent

				    wget --no-check-certificate -O - https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add  -

				    apt-add-repository "deb http://apt.llvm.org/bionic/ llvm-toolchain-bionic-${CLANG_VERSION} main"

				  elif [[ $UBUNTU_VERSION == 22.04 ]]; then

				  if [[ $UBUNTU_VERSION == 22.04 ]]; then

				    # work around ubuntu apt-get conflicts

				    sudo apt-get -y -f install

				    wget --no-check-certificate -O - https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add  -

				    wget --no-check-certificate -O - https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add -

				    if [[ $CLANG_VERSION == 18 ]]; then

				      apt-add-repository "deb http://apt.llvm.org/jammy/ llvm-toolchain-jammy-18 main"

				    fi

				@ -41,7 +35,7 @@ if [ -n "$CLANG_VERSION" ]; then

				  # clang's packaging is a little messed up (the runtime libs aren't

				  # added into the linker path), so give it a little help

				  clang_lib=("/usr/lib/llvm-$CLANG_VERSION/lib/clang/"*"/lib/linux")

				  echo "$clang_lib" > /etc/ld.so.conf.d/clang.conf

				  echo "$clang_lib" >/etc/ld.so.conf.d/clang.conf

				  ldconfig

				  # Cleanup package manager

									
										4

.ci/docker/common/install_conda.sh
									
												View File
												
				@ -62,11 +62,11 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

				  # libstdcxx from conda default channels are too old, we need GLIBCXX_3.4.30

				  # which is provided in libstdcxx 12 and up.

				  conda_install libstdcxx-ng=12.3.0 -c conda-forge

				  conda_install libstdcxx-ng=12.3.0 --update-deps -c conda-forge

				  # Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README

				  if [[ $(uname -m) == "aarch64" ]]; then

				    conda_install "openblas==0.3.28=*openmp*"

				    conda_install "openblas==0.3.29=*openmp*"

				  else

				    conda_install "mkl=2021.4.0 mkl-include=2021.4.0"

				  fi

									
										2

.ci/docker/common/install_cpython.sh
									
												View File
												
				@ -7,7 +7,7 @@ PYTHON_DOWNLOAD_GITHUB_BRANCH=https://github.com/python/cpython/archive/refs/hea

				GET_PIP_URL=https://bootstrap.pypa.io/get-pip.py

				# Python versions to be installed in /opt/$VERSION_NO

				CPYTHON_VERSIONS=${CPYTHON_VERSIONS:-"3.8.1 3.9.0 3.10.1 3.11.0 3.12.0 3.13.0 3.13.0t"}

				CPYTHON_VERSIONS=${CPYTHON_VERSIONS:-"3.9.0 3.10.1 3.11.0 3.12.0 3.13.0 3.13.0t"}

				function check_var {

				    if [ -z "$1" ]; then

									
										48

.ci/docker/common/install_cuda.sh
									
												View File
												
				@ -2,7 +2,6 @@

				set -ex

				NCCL_VERSION=v2.25.1-1

				CUDNN_VERSION=9.5.1.17

				function install_cusparselt_040 {

				@ -40,8 +39,7 @@ function install_cusparselt_063 {

				function install_118 {

				    CUDNN_VERSION=9.1.0.70

				    NCCL_VERSION=v2.21.5-1

				    echo "Installing CUDA 11.8 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.4.0"

				    echo "Installing CUDA 11.8 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.4.0"

				    rm -rf /usr/local/cuda-11.8 /usr/local/cuda

				    # install CUDA 11.8.0 in the same container

				    wget -q https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run

				@ -59,14 +57,7 @@ function install_118 {

				    cd ..

				    rm -rf tmp_cudnn

				    # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				    # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				    git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git

				    cd nccl && make -j src.build

				    cp -a build/include/* /usr/local/cuda/include/

				    cp -a build/lib/* /usr/local/cuda/lib64/

				    cd ..

				    rm -rf nccl

				    CUDA_VERSION=11.8 bash install_nccl.sh

				    install_cusparselt_040

				@ -75,7 +66,7 @@ function install_118 {

				function install_124 {

				  CUDNN_VERSION=9.1.0.70

				  echo "Installing CUDA 12.4.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.2"

				  echo "Installing CUDA 12.4.1 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.6.2"

				  rm -rf /usr/local/cuda-12.4 /usr/local/cuda

				  # install CUDA 12.4.1 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux.run

				@ -93,14 +84,7 @@ function install_124 {

				  cd ..

				  rm -rf tmp_cudnn

				  # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				  # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				  git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git

				  cd nccl && make -j src.build

				  cp -a build/include/* /usr/local/cuda/include/

				  cp -a build/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf nccl

				  CUDA_VERSION=12.4 bash install_nccl.sh

				  install_cusparselt_062

				@ -108,7 +92,7 @@ function install_124 {

				}

				function install_126 {

				  echo "Installing CUDA 12.6.3 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.3"

				  echo "Installing CUDA 12.6.3 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.6.3"

				  rm -rf /usr/local/cuda-12.6 /usr/local/cuda

				  # install CUDA 12.6.3 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.6.3/local_installers/cuda_12.6.3_560.35.05_linux.run

				@ -126,14 +110,7 @@ function install_126 {

				  cd ..

				  rm -rf tmp_cudnn

				  # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				  # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				  git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git

				  cd nccl && make -j src.build

				  cp -a build/include/* /usr/local/cuda/include/

				  cp -a build/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf nccl

				  CUDA_VERSION=12.6 bash install_nccl.sh

				  install_cusparselt_063

				@ -240,8 +217,8 @@ function prune_126 {

				}

				function install_128 {

				  CUDNN_VERSION=9.7.1.26

				  echo "Installing CUDA 12.8.0 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.3"

				  CUDNN_VERSION=9.8.0.87

				  echo "Installing CUDA 12.8.0 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.6.3"

				  rm -rf /usr/local/cuda-12.8 /usr/local/cuda

				  # install CUDA 12.8.0 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda_12.8.0_570.86.10_linux.run

				@ -259,14 +236,7 @@ function install_128 {

				  cd ..

				  rm -rf tmp_cudnn

				  # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				  # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				  git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git

				  cd nccl && make -j src.build

				  cp -a build/include/* /usr/local/cuda/include/

				  cp -a build/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf nccl

				  CUDA_VERSION=12.8 bash install_nccl.sh

				  install_cusparselt_063

									
										162

.ci/docker/common/install_cuda_aarch64.sh
									
												View File
												
				@ -3,19 +3,7 @@

				set -ex

				NCCL_VERSION=v2.21.5-1

				CUDNN_VERSION=9.5.1.17

				function install_cusparselt_062 {

				    # cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				    mkdir tmp_cusparselt && pushd tmp_cusparselt

				    wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-sbsa/libcusparse_lt-linux-sbsa-0.6.2.3-archive.tar.xz

				    tar xf libcusparse_lt-linux-sbsa-0.6.2.3-archive.tar.xz

				    cp -a libcusparse_lt-linux-sbsa-0.6.2.3-archive/include/* /usr/local/cuda/include/

				    cp -a libcusparse_lt-linux-sbsa-0.6.2.3-archive/lib/* /usr/local/cuda/lib64/

				    popd

				    rm -rf tmp_cusparselt

				}

				CUDNN_VERSION=9.8.0.87

				function install_cusparselt_063 {

				    # cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				@ -28,141 +16,8 @@ function install_cusparselt_063 {

				    rm -rf tmp_cusparselt

				}

				function install_124 {

				  CUDNN_VERSION=9.1.0.70

				  echo "Installing CUDA 12.4.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.2"

				  rm -rf /usr/local/cuda-12.4 /usr/local/cuda

				  # install CUDA 12.4.1 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux_sbsa.run

				  chmod +x cuda_12.4.1_550.54.15_linux_sbsa.run

				  ./cuda_12.4.1_550.54.15_linux_sbsa.run --toolkit --silent

				  rm -f cuda_12.4.1_550.54.15_linux_sbsa.run

				  rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.4 /usr/local/cuda

				  # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				  mkdir tmp_cudnn && cd tmp_cudnn

				  wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-sbsa/cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  tar xf cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  cp -a cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/

				  cp -a cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf tmp_cudnn

				  # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				  # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				  git clone -b ${NCCL_VERSION} --depth 1 https://github.com/NVIDIA/nccl.git

				  cd nccl && make -j src.build

				  cp -a build/include/* /usr/local/cuda/include/

				  cp -a build/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf nccl

				  install_cusparselt_063

				  ldconfig

				}

				function prune_124 {

				  echo "Pruning CUDA 12.4"

				  #####################################################################################

				  # CUDA 12.4 prune static libs

				  #####################################################################################

				  export NVPRUNE="/usr/local/cuda-12.4/bin/nvprune"

				  export CUDA_LIB_DIR="/usr/local/cuda-12.4/lib64"

				  export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  if [[ -n "$OVERRIDE_GENCODE" ]]; then

				      export GENCODE=$OVERRIDE_GENCODE

				  fi

				  # all CUDA libs except CuDNN and CuBLAS

				  ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis"  \

				      | xargs -I {} bash -c \

				                "echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"

				  # prune CuDNN and CuBLAS

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a

				  #####################################################################################

				  # CUDA 12.4 prune visual tools

				  #####################################################################################

				  export CUDA_BASE="/usr/local/cuda-12.4/"

				  rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.1.0 $CUDA_BASE/nsight-systems-2023.4.4/

				}

				function install_126 {

				  echo "Installing CUDA 12.6.3 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.3"

				  rm -rf /usr/local/cuda-12.6 /usr/local/cuda

				  # install CUDA 12.6.3 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.6.3/local_installers/cuda_12.6.3_560.35.05_linux_sbsa.run

				  chmod +x cuda_12.6.3_560.35.05_linux_sbsa.run

				  ./cuda_12.6.3_560.35.05_linux_sbsa.run --toolkit --silent

				  rm -f cuda_12.6.3_560.35.05_linux_sbsa.run

				  rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.6 /usr/local/cuda

				  # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				  mkdir tmp_cudnn && cd tmp_cudnn

				  wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-sbsa/cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  tar xf cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  cp -a cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/

				  cp -a cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf tmp_cudnn

				  # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				  # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				  git clone -b ${NCCL_VERSION} --depth 1 https://github.com/NVIDIA/nccl.git

				  cd nccl && make -j src.build

				  cp -a build/include/* /usr/local/cuda/include/

				  cp -a build/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf nccl

				  install_cusparselt_063

				  ldconfig

				}

				function prune_126 {

				  echo "Pruning CUDA 12.6"

				  #####################################################################################

				  # CUDA 12.6 prune static libs

				  #####################################################################################

				  export NVPRUNE="/usr/local/cuda-12.6/bin/nvprune"

				  export CUDA_LIB_DIR="/usr/local/cuda-12.6/lib64"

				  export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  if [[ -n "$OVERRIDE_GENCODE" ]]; then

				      export GENCODE=$OVERRIDE_GENCODE

				  fi

				  if [[ -n "$OVERRIDE_GENCODE_CUDNN" ]]; then

				      export GENCODE_CUDNN=$OVERRIDE_GENCODE_CUDNN

				  fi

				  # all CUDA libs except CuDNN and CuBLAS

				  ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis"  \

				      | xargs -I {} bash -c \

				                "echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"

				  # prune CuDNN and CuBLAS

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a

				  #####################################################################################

				  # CUDA 12.6 prune visual tools

				  #####################################################################################

				  export CUDA_BASE="/usr/local/cuda-12.6/"

				  rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.3.2 $CUDA_BASE/nsight-systems-2024.5.1/

				}

				function install_128 {

				  CUDNN_VERSION=9.7.1.26

				  echo "Installing CUDA 12.8.0 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.3"

				  echo "Installing CUDA 12.8.0 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.6.3"

				  rm -rf /usr/local/cuda-12.8 /usr/local/cuda

				  # install CUDA 12.8.0 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda_12.8.0_570.86.10_linux_sbsa.run

				@ -180,14 +35,7 @@ function install_128 {

				  cd ..

				  rm -rf tmp_cudnn

				  # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				  # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				  git clone -b ${NCCL_VERSION} --depth 1 https://github.com/NVIDIA/nccl.git

				  cd nccl && make -j src.build

				  cp -a build/include/* /usr/local/cuda/include/

				  cp -a build/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf nccl

				  CUDA_VERSION=12.8 bash install_nccl.sh

				  install_cusparselt_063

				@ -198,10 +46,6 @@ function install_128 {

				while test $# -gt 0

				do

				    case "$1" in

				    12.4) install_124; prune_124

				        ;;

				    12.6) install_126; prune_126

				        ;;

				    12.8) install_128;

				        ;;

				    *) echo "bad argument $1"; exit 1

									
										2

.ci/docker/common/install_cudnn.sh
									
												View File
												
				@ -5,7 +5,7 @@ if [[ -n "${CUDNN_VERSION}" ]]; then

				    mkdir tmp_cudnn

				    pushd tmp_cudnn

				    if [[ ${CUDA_VERSION:0:4} == "12.8" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-9.7.1.26_cuda12-archive"

				        CUDNN_NAME="cudnn-linux-x86_64-9.8.0.87_cuda12-archive"

				    elif [[ ${CUDA_VERSION:0:4} == "12.6" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-9.5.1.17_cuda12-archive"

				    elif [[ ${CUDA_VERSION:0:2} == "12" ]]; then

									
										38

.ci/docker/common/install_db.sh
									
												View File
											
				@ -1,38 +0,0 @@

				#!/bin/bash

				set -ex

				install_ubuntu() {

				  apt-get update

				  # Cleanup

				  apt-get autoclean && apt-get clean

				  rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

				}

				install_centos() {

				  # Need EPEL for many packages we depend on.

				  # See http://fedoraproject.org/wiki/EPEL

				  yum --enablerepo=extras install -y epel-release

				  # Cleanup

				  yum clean all

				  rm -rf /var/cache/yum

				  rm -rf /var/lib/yum/yumdb

				  rm -rf /var/lib/yum/history

				}

				# Install base packages depending on the base OS

				ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')

				case "$ID" in

				  ubuntu)

				    install_ubuntu

				    ;;

				  centos)

				    install_centos

				    ;;

				  *)

				    echo "Unable to determine OS..."

				    exit 1

				    ;;

				esac

									
										5

.ci/docker/common/install_executorch.sh
									
												View File
												
				@ -50,10 +50,9 @@ setup_executorch() {

				  pushd executorch

				  export PYTHON_EXECUTABLE=python

				  export EXECUTORCH_BUILD_PYBIND=ON

				  export CMAKE_ARGS="-DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON"

				  export CMAKE_ARGS="-DEXECUTORCH_BUILD_PYBIND=ON -DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON"

				  as_jenkins .ci/scripts/setup-linux.sh cmake || true

				  as_jenkins .ci/scripts/setup-linux.sh --build-tool cmake || true

				  popd

				}

									
										4

.ci/docker/common/install_halide.sh
									
												View File
												
				@ -35,7 +35,9 @@ git clone https://github.com/halide/Halide.git

				pushd Halide

				git checkout ${COMMIT} && git submodule update --init --recursive

				pip_install -r requirements.txt

				cmake -G Ninja -DCMAKE_BUILD_TYPE=Release -S . -B build

				# NOTE: pybind has a requirement for cmake > 3.5 so set the minimum cmake version here with a flag

				#       Context: https://github.com/pytorch/pytorch/issues/150420

				cmake -G Ninja -DCMAKE_POLICY_VERSION_MINIMUM=3.5 -DCMAKE_BUILD_TYPE=Release -S . -B build

				cmake --build build

				test -e ${CONDA_PREFIX}/lib/python3 || ln -s python${ANACONDA_PYTHON_VERSION} ${CONDA_PREFIX}/lib/python3

				cmake --install build --prefix ${CONDA_PREFIX}

									
										6

.ci/docker/common/install_linter.sh
									
												View File
												
				@ -2,8 +2,6 @@

				set -ex

				source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"

				if [ -n "${UBUNTU_VERSION}" ]; then

				  apt update

				  apt-get install -y clang doxygen git graphviz nodejs npm libtinfo5

				@ -15,8 +13,8 @@ chown -R jenkins pytorch

				pushd pytorch

				# Install all linter dependencies

				pip_install -r requirements.txt

				conda_run lintrunner init

				pip install -r requirements.txt

				lintrunner init

				# Cache .lintbin directory as part of the Docker image

				cp -r .lintbin /tmp

									
										26

.ci/docker/common/install_nccl.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,26 @@

				#!/bin/bash

				set -ex

				NCCL_VERSION=""

				if [[ ${CUDA_VERSION:0:2} == "11" ]]; then

				  NCCL_VERSION=$(cat ci_commit_pins/nccl-cu11.txt)

				elif [[ ${CUDA_VERSION:0:2} == "12" ]]; then

				  NCCL_VERSION=$(cat ci_commit_pins/nccl-cu12.txt)

				else

				  echo "Unexpected CUDA_VERSION ${CUDA_VERSION}"

				  exit 1

				fi

				if [[ -n "${NCCL_VERSION}" ]]; then

				  # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				  # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				  git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git

				  pushd nccl

				  make -j src.build

				  cp -a build/include/* /usr/local/cuda/include/

				  cp -a build/lib/* /usr/local/cuda/lib64/

				  popd

				  rm -rf nccl

				  ldconfig

				fi

									
										9

.ci/docker/common/install_ninja.sh
									
												View File
												
				@ -4,10 +4,15 @@ set -ex

				[ -n "$NINJA_VERSION" ]

				url="https://github.com/ninja-build/ninja/releases/download/v${NINJA_VERSION}/ninja-linux.zip"

				arch=$(uname -m)

				if [ "$arch" == "aarch64" ]; then

				    url="https://github.com/ninja-build/ninja/releases/download/v${NINJA_VERSION}/ninja-linux-aarch64.zip"

				else

				    url="https://github.com/ninja-build/ninja/releases/download/v${NINJA_VERSION}/ninja-linux.zip"

				fi

				pushd /tmp

				wget --no-verbose --output-document=ninja-linux.zip "$url"

				unzip ninja-linux.zip -d /usr/local/bin

				rm -f ninja-linux.zip

				popd

				popd

									
										2

.ci/docker/common/install_onnx.sh
									
												View File
												
				@ -32,7 +32,7 @@ pip_install coloredlogs packaging

				pip_install onnxruntime==1.18.1

				pip_install onnx==1.17.0

				pip_install onnxscript==0.1.0 --no-deps

				pip_install onnxscript==0.2.2 --no-deps

				# required by onnxscript

				pip_install ml_dtypes

									
										2

.ci/docker/common/install_openblas.sh
									
												View File
												
				@ -4,7 +4,7 @@

				set -ex

				cd /

				git clone https://github.com/OpenMathLib/OpenBLAS.git -b v0.3.28 --depth 1 --shallow-submodules

				git clone https://github.com/OpenMathLib/OpenBLAS.git -b v0.3.29 --depth 1 --shallow-submodules

				OPENBLAS_BUILD_FLAGS="

									
										18

.ci/docker/common/install_python.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,18 @@

				#!/bin/bash

				set -ex

				apt-get update

				# Use deadsnakes in case we need an older python version

				sudo add-apt-repository ppa:deadsnakes/ppa

				apt-get install -y python${PYTHON_VERSION} python${PYTHON_VERSION}-dev python3-pip python${PYTHON_VERSION}-venv

				# Use a venv because uv and some other package managers don't support --user install

				ln -s /usr/bin/python${PYTHON_VERSION} /usr/bin/python

				python -m venv /var/lib/jenkins/ci_env

				source /var/lib/jenkins/ci_env/bin/activate

				python -mpip install --upgrade pip

				python -mpip install -r /opt/requirements-ci.txt

				if [ -n "${PIP_CMAKE}" ]; then

				  python -mpip install cmake==3.31.6

				fi

									
										11

.ci/docker/common/install_rocm.sh
									
												View File
												
				@ -8,10 +8,6 @@ ver() {

				install_ubuntu() {

				    apt-get update

				    if [[ $UBUNTU_VERSION == 18.04 ]]; then

				      # gpg-agent is not available by default on 18.04

				      apt-get install -y --no-install-recommends gpg-agent

				    fi

				    if [[ $UBUNTU_VERSION == 20.04 ]]; then

				      # gpg-agent is not available by default on 20.04

				      apt-get install -y --no-install-recommends gpg-agent

				@ -23,6 +19,13 @@ install_ubuntu() {

				    apt-get install -y libc++1

				    apt-get install -y libc++abi1

				    # Make sure rocm packages from repo.radeon.com have highest priority

				    cat << EOF > /etc/apt/preferences.d/rocm-pin-600

				Package: *

				Pin: release o=repo.radeon.com

				Pin-Priority: 600

				EOF

				    # Add amdgpu repository

				    UBUNTU_VERSION_NAME=`cat /etc/os-release | grep UBUNTU_CODENAME | awk -F= '{print $2}'`

				    echo "deb [arch=amd64] https://repo.radeon.com/amdgpu/${ROCM_VERSION}/ubuntu ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/amdgpu.list

									
										2

.ci/docker/common/install_rocm_drm.sh
									
												View File
												
				@ -115,7 +115,7 @@ index a5007ffc..13fa07fc 100644

				 	if (!fp) {

				-		fprintf(stderr, "%s: %s\n", AMDGPU_ASIC_ID_TABLE,

				-			strerror(errno));

				+		fprintf(stderr, "amdgpu.ids: No such file or directory\n");

				+		//fprintf(stderr, "amdgpu.ids: No such file or directory\n");

				 		return;

				 	}

									
										72

.ci/docker/common/install_rocm_magma.sh
									
												View File
												
				@ -1,50 +1,32 @@

				#!/bin/bash

				# Script used in CI and CD pipeline

				#!/usr/bin/env bash

				# Script used only in CD pipeline

				set -ex

				set -eou pipefail

				# Magma build scripts need `python`

				ln -sf /usr/bin/python3 /usr/bin/python

				function do_install() {

				    rocm_version=$1

				    rocm_version_nodot=${1//./}

				ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')

				case "$ID" in

				  almalinux)

				    yum install -y gcc-gfortran

				    ;;

				  *)

				    echo "No preinstalls to build magma..."

				    ;;

				esac

				    # Version 2.7.2 + ROCm related updates

				    MAGMA_VERSION=a1625ff4d9bc362906bd01f805dbbe12612953f6

				    magma_archive="magma-rocm${rocm_version_nodot}-${MAGMA_VERSION}-1.tar.bz2"

				MKLROOT=${MKLROOT:-/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION}

				    rocm_dir="/opt/rocm"

				    (

				        set -x

				        tmp_dir=$(mktemp -d)

				        pushd ${tmp_dir}

				        curl -OLs https://ossci-linux.s3.us-east-1.amazonaws.com/${magma_archive}

				        if tar -xvf "${magma_archive}"

				        then

				            mkdir -p "${rocm_dir}/magma"

				            mv include "${rocm_dir}/magma/include"

				            mv lib "${rocm_dir}/magma/lib"

				        else

				            echo "${magma_archive} not found, skipping magma install"

				        fi

				        popd

				    )

				}

				# "install" hipMAGMA into /opt/rocm/magma by copying after build

				git clone https://bitbucket.org/icl/magma.git

				pushd magma

				# Version 2.7.2 + ROCm related updates

				git checkout a1625ff4d9bc362906bd01f805dbbe12612953f6

				cp make.inc-examples/make.inc.hip-gcc-mkl make.inc

				echo 'LIBDIR += -L$(MKLROOT)/lib' >> make.inc

				if [[ -f "${MKLROOT}/lib/libmkl_core.a" ]]; then

				    echo 'LIB = -Wl,--start-group -lmkl_gf_lp64 -lmkl_gnu_thread -lmkl_core -Wl,--end-group -lpthread -lstdc++ -lm -lgomp -lhipblas -lhipsparse' >> make.inc

				fi

				echo 'LIB += -Wl,--enable-new-dtags -Wl,--rpath,/opt/rocm/lib -Wl,--rpath,$(MKLROOT)/lib -Wl,--rpath,/opt/rocm/magma/lib -ldl' >> make.inc

				echo 'DEVCCFLAGS += --gpu-max-threads-per-block=256' >> make.inc

				export PATH="${PATH}:/opt/rocm/bin"

				if [[ -n "$PYTORCH_ROCM_ARCH" ]]; then

				  amdgpu_targets=`echo $PYTORCH_ROCM_ARCH | sed 's/;/ /g'`

				else

				  amdgpu_targets=`rocm_agent_enumerator | grep -v gfx000 | sort -u | xargs`

				fi

				for arch in $amdgpu_targets; do

				  echo "DEVCCFLAGS += --offload-arch=$arch" >> make.inc

				done

				# hipcc with openmp flag may cause isnan() on __device__ not to be found; depending on context, compiler may attempt to match with host definition

				sed -i 's/^FOPENMP/#FOPENMP/g' make.inc

				make -f make.gen.hipMAGMA -j $(nproc)

				LANG=C.UTF-8 make lib/libmagma.so -j $(nproc) MKLROOT="${MKLROOT}"

				make testing/testing_dgemm -j $(nproc) MKLROOT="${MKLROOT}"

				popd

				mv magma /opt/rocm

				do_install $1

									
										24

.ci/docker/common/install_swiftshader.sh
									
												View File
											
				@ -1,24 +0,0 @@

				#!/bin/bash

				set -ex

				[ -n "${SWIFTSHADER}" ]

				retry () {

				    $*  || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)

				}

				_https_amazon_aws=https://ossci-android.s3.amazonaws.com

				# SwiftShader

				_swiftshader_dir=/var/lib/jenkins/swiftshader

				_swiftshader_file_targz=swiftshader-abe07b943-prebuilt.tar.gz

				mkdir -p $_swiftshader_dir

				_tmp_swiftshader_targz="/tmp/${_swiftshader_file_targz}"

				curl --silent --show-error --location --fail --retry 3 \

				  --output "${_tmp_swiftshader_targz}" "$_https_amazon_aws/${_swiftshader_file_targz}"

				tar -C "${_swiftshader_dir}" -xzf "${_tmp_swiftshader_targz}"

				export VK_ICD_FILENAMES="${_swiftshader_dir}/build/Linux/vk_swiftshader_icd.json"

									
										18

.ci/docker/common/install_triton.sh
									
												View File
												
				@ -2,6 +2,12 @@

				set -ex

				mkdir -p /opt/triton

				if [ -z "${TRITON}" ] && [ -z "${TRITON_CPU}" ]; then

				  echo "TRITON and TRITON_CPU are not set. Exiting..."

				  exit 0

				fi

				source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"

				get_conda_version() {

				@ -52,6 +58,7 @@ cd triton

				as_jenkins git checkout ${TRITON_PINNED_COMMIT}

				as_jenkins git submodule update --init --recursive

				cd python

				pip_install pybind11==2.13.6

				# TODO: remove patch setup.py once we have a proper fix for https://github.com/triton-lang/triton/issues/4527

				as_jenkins sed -i -e 's/https:\/\/tritonlang.blob.core.windows.net\/llvm-builds/https:\/\/oaitriton.blob.core.windows.net\/public\/llvm-builds/g' setup.py

				@ -60,17 +67,22 @@ if [ -n "${UBUNTU_VERSION}" ] && [ -n "${GCC_VERSION}" ] && [[ "${GCC_VERSION}"

				  # Triton needs at least gcc-9 to build

				  apt-get install -y g++-9

				  CXX=g++-9 pip_install -e .

				  CXX=g++-9 conda_run python setup.py bdist_wheel

				elif [ -n "${UBUNTU_VERSION}" ] && [ -n "${CLANG_VERSION}" ]; then

				  # Triton needs <filesystem> which surprisingly is not available with clang-9 toolchain

				  add-apt-repository -y ppa:ubuntu-toolchain-r/test

				  apt-get install -y g++-9

				  CXX=g++-9 pip_install -e .

				  CXX=g++-9 conda_run python setup.py bdist_wheel

				else

				  pip_install -e .

				  conda_run python setup.py bdist_wheel

				fi

				# Copy the wheel to /opt for multi stage docker builds

				cp dist/*.whl /opt/triton

				# Install the wheel for docker builds that don't use multi stage

				pip_install dist/*.whl

				if [ -n "${CONDA_CMAKE}" ]; then

				  # TODO: This is to make sure that the same cmake and numpy version from install conda

				  # script is used. Without this step, the newer cmake version (3.25.2) downloaded by

									
										24

.ci/docker/common/install_vulkan_sdk.sh
									
												View File
											
				@ -1,24 +0,0 @@

				#!/bin/bash

				set -ex

				[ -n "${VULKAN_SDK_VERSION}" ]

				retry () {

				    $*  || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)

				}

				_vulkansdk_dir=/var/lib/jenkins/vulkansdk

				_tmp_vulkansdk_targz=/tmp/vulkansdk.tar.gz

				curl \

				  --silent \

				  --show-error \

				  --location \

				  --fail \

				  --retry 3 \

				  --output "${_tmp_vulkansdk_targz}" "https://ossci-android.s3.amazonaws.com/vulkansdk-linux-x86_64-${VULKAN_SDK_VERSION}.tar.gz"

				mkdir -p "${_vulkansdk_dir}"

				tar -C "${_vulkansdk_dir}" -xzf "${_tmp_vulkansdk_targz}" --strip-components 1

				rm -rf "${_tmp_vulkansdk_targz}"

									
										7

.ci/docker/libtorch/Dockerfile
									
												View File
												
				@ -49,6 +49,8 @@ RUN bash ./install_mkl.sh && rm install_mkl.sh

				FROM cpu as cuda

				ADD ./common/install_cuda.sh install_cuda.sh

				ADD ./common/install_magma.sh install_magma.sh

				COPY ./common/install_nccl.sh install_nccl.sh

				COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/

				ENV CUDA_HOME /usr/local/cuda

				FROM cuda as cuda11.8

				@ -72,6 +74,7 @@ RUN bash ./install_magma.sh 12.8

				RUN ln -sf /usr/local/cuda-12.8 /usr/local/cuda

				FROM cpu as rocm

				ARG ROCM_VERSION

				ARG PYTORCH_ROCM_ARCH

				ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}

				ENV MKLROOT /opt/intel

				@ -86,11 +89,11 @@ ADD ./common/install_rocm_magma.sh install_rocm_magma.sh

				# gfortran and python needed for building magma from source for ROCm

				RUN apt-get update -y && \

				    apt-get install gfortran -y && \

				    apt-get install python -y && \

				    apt-get install python3 python-is-python3 -y && \

				    apt-get clean

				RUN bash ./install_rocm_drm.sh && rm install_rocm_drm.sh

				RUN bash ./install_rocm_magma.sh && rm install_rocm_magma.sh

				RUN bash ./install_rocm_magma.sh ${ROCM_VERSION} && rm install_rocm_magma.sh

				FROM ${BASE_TARGET} as final

				COPY --from=openssl            /opt/openssl           /opt/openssl

									
										82

.ci/docker/libtorch/build.sh
									
												View File
												
				@ -1,83 +1,63 @@

				#!/usr/bin/env bash

				# Script used only in CD pipeline

				set -eou pipefail

				set -eoux pipefail

				image="$1"

				shift

				if [ -z "${image}" ]; then

				  echo "Usage: $0 IMAGE"

				  echo "Usage: $0 IMAGENAME:ARCHTAG"

				  exit 1

				fi

				DOCKER_IMAGE="pytorch/${image}"

				TOPDIR=$(git rev-parse --show-toplevel)

				GPU_ARCH_TYPE=${GPU_ARCH_TYPE:-cpu}

				GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}

				WITH_PUSH=${WITH_PUSH:-}

				DOCKER=${DOCKER:-docker}

				case ${GPU_ARCH_TYPE} in

				# Go from imagename:tag to tag

				DOCKER_TAG_PREFIX=$(echo "${image}" | awk -F':' '{print $2}')

				GPU_ARCH_VERSION=""

				if [[ "${DOCKER_TAG_PREFIX}" == cuda* ]]; then

				    # extract cuda version from image name.  e.g. manylinux2_28-builder:cuda12.8 returns 12.8

				    GPU_ARCH_VERSION=$(echo "${DOCKER_TAG_PREFIX}" | awk -F'cuda' '{print $2}')

				elif [[ "${DOCKER_TAG_PREFIX}" == rocm* ]]; then

				    # extract rocm version from image name.  e.g. manylinux2_28-builder:rocm6.2.4 returns 6.2.4

				    GPU_ARCH_VERSION=$(echo "${DOCKER_TAG_PREFIX}" | awk -F'rocm' '{print $2}')

				fi

				case ${DOCKER_TAG_PREFIX} in

				    cpu)

				        BASE_TARGET=cpu

				        DOCKER_TAG=cpu

				        GPU_IMAGE=ubuntu:20.04

				        DOCKER_GPU_BUILD_ARG=""

				        ;;

				    cuda)

				    cuda*)

				        BASE_TARGET=cuda${GPU_ARCH_VERSION}

				        DOCKER_TAG=cuda${GPU_ARCH_VERSION}

				        GPU_IMAGE=ubuntu:20.04

				        DOCKER_GPU_BUILD_ARG=""

				        ;;

				    rocm)

				    rocm*)

				        BASE_TARGET=rocm

				        DOCKER_TAG=rocm${GPU_ARCH_VERSION}

				        GPU_IMAGE=rocm/dev-ubuntu-20.04:${GPU_ARCH_VERSION}-complete

				        PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx1030;gfx1100;gfx1101;gfx942"

				        DOCKER_GPU_BUILD_ARG="--build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH}"

				        GPU_IMAGE=rocm/dev-ubuntu-22.04:${GPU_ARCH_VERSION}-complete

				        PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"

				        DOCKER_GPU_BUILD_ARG="--build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} --build-arg ROCM_VERSION=${GPU_ARCH_VERSION}"

				        ;;

				    *)

				        echo "ERROR: Unrecognized GPU_ARCH_TYPE: ${GPU_ARCH_TYPE}"

				        echo "ERROR: Unrecognized DOCKER_TAG_PREFIX: ${DOCKER_TAG_PREFIX}"

				        exit 1

				        ;;

				esac

				tmp_tag=$(basename "$(mktemp -u)" | tr '[:upper:]' '[:lower:]')

				(

				    set -x

				    DOCKER_BUILDKIT=1 ${DOCKER} build \

				         --target final \

				        ${DOCKER_GPU_BUILD_ARG} \

				        --build-arg "GPU_IMAGE=${GPU_IMAGE}" \

				        --build-arg "BASE_TARGET=${BASE_TARGET}" \

				        -t "${DOCKER_IMAGE}" \

				        $@ \

				        -f "${TOPDIR}/.ci/docker/libtorch/Dockerfile" \

				        "${TOPDIR}/.ci/docker/"

				)

				GITHUB_REF=${GITHUB_REF:-$(git symbolic-ref -q HEAD || git describe --tags --exact-match)}

				GIT_BRANCH_NAME=${GITHUB_REF##*/}

				GIT_COMMIT_SHA=${GITHUB_SHA:-$(git rev-parse HEAD)}

				DOCKER_IMAGE_BRANCH_TAG=${DOCKER_IMAGE}-${GIT_BRANCH_NAME}

				DOCKER_IMAGE_SHA_TAG=${DOCKER_IMAGE}-${GIT_COMMIT_SHA}

				if [[ "${WITH_PUSH}" == true ]]; then

				  (

				    set -x

				    ${DOCKER} push "${DOCKER_IMAGE}"

				    if [[ -n ${GITHUB_REF} ]]; then

				        ${DOCKER} tag ${DOCKER_IMAGE} ${DOCKER_IMAGE_BRANCH_TAG}

				        ${DOCKER} tag ${DOCKER_IMAGE} ${DOCKER_IMAGE_SHA_TAG}

				        ${DOCKER} push "${DOCKER_IMAGE_BRANCH_TAG}"

				        ${DOCKER} push "${DOCKER_IMAGE_SHA_TAG}"

				    fi

				  )

				fi

				DOCKER_BUILDKIT=1 ${DOCKER} build \

				    --target final \

				    ${DOCKER_GPU_BUILD_ARG} \

				    --build-arg "GPU_IMAGE=${GPU_IMAGE}" \

				    --build-arg "BASE_TARGET=${BASE_TARGET}" \

				    -t "${tmp_tag}" \

				    $@ \

				    -f "${TOPDIR}/.ci/docker/libtorch/Dockerfile" \

				    "${TOPDIR}/.ci/docker/"

									
										26

.ci/docker/linter-cuda/Dockerfile
									
												View File
												
				@ -18,28 +18,30 @@ COPY ./common/install_user.sh install_user.sh

				RUN bash ./install_user.sh && rm install_user.sh

				# Install conda and other packages (e.g., numpy, pytest)

				ARG ANACONDA_PYTHON_VERSION

				ARG CONDA_CMAKE

				ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION

				ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH

				COPY requirements-ci.txt /opt/conda/requirements-ci.txt

				COPY ./common/install_conda.sh install_conda.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ./common/install_magma_conda.sh install_magma_conda.sh

				RUN bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt

				ARG PYTHON_VERSION

				ARG PIP_CMAKE

				# Put venv into the env vars so users don't need to activate it

				ENV PATH /var/lib/jenkins/ci_env/bin:$PATH

				ENV VIRTUAL_ENV /var/lib/jenkins/ci_env

				COPY requirements-ci.txt /opt/requirements-ci.txt

				COPY ./common/install_python.sh install_python.sh

				RUN bash ./install_python.sh && rm install_python.sh /opt/requirements-ci.txt

				# Install cuda and cudnn

				ARG CUDA_VERSION

				COPY ./common/install_cuda.sh install_cuda.sh

				RUN bash ./install_cuda.sh ${CUDA_VERSION} && rm install_cuda.sh

				COPY ./common/install_nccl.sh install_nccl.sh

				COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/

				RUN bash ./install_cuda.sh ${CUDA_VERSION} && rm install_cuda.sh install_nccl.sh /ci_commit_pins/nccl-cu*

				ENV DESIRED_CUDA ${CUDA_VERSION}

				ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:$PATH

				# Note that Docker build forbids copying file outside the build context

				COPY ./common/install_linter.sh install_linter.sh

				COPY ./common/common_utils.sh common_utils.sh

				RUN bash ./install_linter.sh

				RUN rm install_linter.sh common_utils.sh

				RUN rm install_linter.sh

				RUN chown -R jenkins:jenkins /var/lib/jenkins/ci_env

				USER jenkins

				CMD ["bash"]

									
										18

.ci/docker/linter/Dockerfile
									
												View File
												
				@ -15,20 +15,18 @@ COPY ./common/install_user.sh install_user.sh

				RUN bash ./install_user.sh && rm install_user.sh

				# Install conda and other packages (e.g., numpy, pytest)

				ARG ANACONDA_PYTHON_VERSION

				ARG CONDA_CMAKE

				ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION

				ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH

				COPY requirements-ci.txt /opt/conda/requirements-ci.txt

				COPY ./common/install_conda.sh install_conda.sh

				COPY ./common/common_utils.sh common_utils.sh

				RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt

				ARG PYTHON_VERSION

				ARG PIP_CMAKE

				ENV PATH /var/lib/jenkins/ci_env/bin:$PATH

				ENV VIRTUAL_ENV /var/lib/jenkins/ci_env

				COPY requirements-ci.txt /opt/requirements-ci.txt

				COPY ./common/install_python.sh install_python.sh

				RUN bash ./install_python.sh && rm install_python.sh /opt/requirements-ci.txt

				# Note that Docker build forbids copying file outside the build context

				COPY ./common/install_linter.sh install_linter.sh

				COPY ./common/common_utils.sh common_utils.sh

				RUN bash ./install_linter.sh

				RUN rm install_linter.sh common_utils.sh

				RUN rm install_linter.sh

				USER jenkins

				CMD ["bash"]

									
										6

.ci/docker/manywheel/Dockerfile
									
												View File
												
				@ -64,7 +64,9 @@ FROM base as cuda

				ARG BASE_CUDA_VERSION=10.2

				# Install CUDA

				ADD ./common/install_cuda.sh install_cuda.sh

				RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh

				COPY ./common/install_nccl.sh install_nccl.sh

				COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/

				RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh install_nccl.sh /ci_commit_pins/nccl-cu*

				FROM base as intel

				# MKL

				@ -195,6 +197,6 @@ RUN bash ./install_rocm_drm.sh && rm install_rocm_drm.sh

				# cmake3 is needed for the MIOpen build

				RUN ln -sf /usr/local/bin/cmake /usr/bin/cmake3

				ADD ./common/install_rocm_magma.sh install_rocm_magma.sh

				RUN bash ./install_rocm_magma.sh && rm install_rocm_magma.sh

				RUN bash ./install_rocm_magma.sh ${ROCM_VERSION} && rm install_rocm_magma.sh

				ADD ./common/install_miopen.sh install_miopen.sh

				RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh

153

.ci/docker/manywheel/Dockerfile_2014

View File

 @ -1,153 +0,0 @@
 # syntax = docker/dockerfile:experimental
 ARG ROCM_VERSION=3.7
 ARG BASE_CUDA_VERSION=10.2
 ARG GPU_IMAGE=nvidia/cuda:${BASE_CUDA_VERSION}-devel-centos7
 FROM quay.io/pypa/manylinux2014_x86_64 as base
 ENV LC_ALL en_US.UTF-8
 ENV LANG en_US.UTF-8
 ENV LANGUAGE en_US.UTF-8
 RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
 RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
 RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
 RUN yum install -y wget curl perl util-linux xz bzip2 git patch which perl zlib-devel
 RUN yum install -y yum-utils centos-release-scl sudo
 RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
 RUN yum install -y devtoolset-7-gcc devtoolset-7-gcc-c++ devtoolset-7-gcc-gfortran devtoolset-7-binutils
 ENV PATH=/opt/rh/devtoolset-7/root/usr/bin:$PATH
 ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-7/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib:$LD_LIBRARY_PATH
 # cmake
 RUN yum install -y cmake3 && \
     ln -s /usr/bin/cmake3 /usr/bin/cmake
 FROM base as openssl
 # Install openssl (this must precede `build python` step)
 # (In order to have a proper SSL module, Python is compiled
 # against a recent openssl [see env vars above], which is linked
 # statically. We delete openssl afterwards.)
 ADD ./common/install_openssl.sh install_openssl.sh
 RUN bash ./install_openssl.sh && rm install_openssl.sh
 # remove unncessary python versions
 RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
 RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
 RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
 RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
 FROM base as cuda
 ARG BASE_CUDA_VERSION=10.2
 # Install CUDA
 ADD ./common/install_cuda.sh install_cuda.sh
 RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh
 FROM base as intel
 # MKL
 ADD ./common/install_mkl.sh install_mkl.sh
 RUN bash ./install_mkl.sh && rm install_mkl.sh
 FROM base as magma
 ARG BASE_CUDA_VERSION=10.2
 # Install magma
 ADD ./common/install_magma.sh install_magma.sh
 RUN bash ./install_magma.sh ${BASE_CUDA_VERSION} && rm install_magma.sh
 FROM base as jni
 # Install java jni header
 ADD ./common/install_jni.sh install_jni.sh
 ADD ./java/jni.h jni.h
 RUN bash ./install_jni.sh && rm install_jni.sh
 FROM base as libpng
 # Install libpng
 ADD ./common/install_libpng.sh install_libpng.sh
 RUN bash ./install_libpng.sh && rm install_libpng.sh
 FROM ${GPU_IMAGE} as common
 RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
 RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
 RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
 ENV LC_ALL en_US.UTF-8
 ENV LANG en_US.UTF-8
 ENV LANGUAGE en_US.UTF-8
 RUN yum install -y \
         aclocal \
         autoconf \
         automake \
         bison \
         bzip2 \
         curl \
         diffutils \
         file \
         git \
         make \
         patch \
         perl \
         unzip \
         util-linux \
         wget \
         which \
         xz \
         yasm
 RUN yum install -y \
     https://repo.ius.io/ius-release-el7.rpm \
     https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm
 RUN yum swap -y git git236-core
 # git236+ would refuse to run git commands in repos owned by other users
 # Which causes version check to fail, as pytorch repo is bind-mounted into the image
 # Override this behaviour by treating every folder as safe
 # For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
 RUN git config --global --add safe.directory "*"
 ENV SSL_CERT_FILE=/opt/_internal/certs.pem
 # Install LLVM version
 COPY --from=openssl            /opt/openssl                          /opt/openssl
 COPY --from=base               /opt/python                           /opt/python
 COPY --from=base               /opt/_internal                        /opt/_internal
 COPY --from=base               /usr/local/bin/auditwheel             /usr/local/bin/auditwheel
 COPY --from=intel              /opt/intel                            /opt/intel
 COPY --from=base               /usr/local/bin/patchelf               /usr/local/bin/patchelf
 COPY --from=libpng             /usr/local/bin/png*                   /usr/local/bin/
 COPY --from=libpng             /usr/local/bin/libpng*                /usr/local/bin/
 COPY --from=libpng             /usr/local/include/png*               /usr/local/include/
 COPY --from=libpng             /usr/local/include/libpng*            /usr/local/include/
 COPY --from=libpng             /usr/local/lib/libpng*                /usr/local/lib/
 COPY --from=libpng             /usr/local/lib/pkgconfig              /usr/local/lib/pkgconfig
 COPY --from=jni                /usr/local/include/jni.h              /usr/local/include/jni.h
 FROM common as cpu_final
 ARG BASE_CUDA_VERSION=10.2
 RUN yum install -y yum-utils centos-release-scl
 RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
 RUN yum install -y devtoolset-7-gcc devtoolset-7-gcc-c++ devtoolset-7-gcc-gfortran devtoolset-7-binutils
 ENV PATH=/opt/rh/devtoolset-7/root/usr/bin:$PATH
 ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-7/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib:$LD_LIBRARY_PATH
 # cmake
 RUN yum install -y cmake3 && \
     ln -s /usr/bin/cmake3 /usr/bin/cmake
 # ninja
 RUN yum install -y http://repo.okay.com.mx/centos/7/x86_64/release/okay-release-1-1.noarch.rpm
 RUN yum install -y ninja-build
 FROM cpu_final as cuda_final
 RUN rm -rf /usr/local/cuda-${BASE_CUDA_VERSION}
 COPY --from=cuda     /usr/local/cuda-${BASE_CUDA_VERSION}  /usr/local/cuda-${BASE_CUDA_VERSION}
 COPY --from=magma    /usr/local/cuda-${BASE_CUDA_VERSION}  /usr/local/cuda-${BASE_CUDA_VERSION}
 FROM common as rocm_final
 ARG ROCM_VERSION=3.7
 # Install ROCm
 ADD ./common/install_rocm.sh install_rocm.sh
 RUN bash ./install_rocm.sh ${ROCM_VERSION} && rm install_rocm.sh
 # cmake is already installed inside the rocm base image, but both 2 and 3 exist
 # cmake3 is needed for the later MIOpen custom build, so that step is last.
 RUN yum install -y cmake3 && \
     rm -f /usr/bin/cmake && \
     ln -s /usr/bin/cmake3 /usr/bin/cmake
 ADD ./common/install_miopen.sh install_miopen.sh
 RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh

6

.ci/docker/manywheel/Dockerfile_2_28

View File

 @ -36,7 +36,9 @@ FROM base as cuda
 ARG BASE_CUDA_VERSION=11.8
 # Install CUDA
 ADD ./common/install_cuda.sh install_cuda.sh
 RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh
 COPY ./common/install_nccl.sh install_nccl.sh
 COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/
 RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh install_nccl.sh ci_commit_pins/nccl-cu*
 FROM base as intel
 # MKL
 @ -158,7 +160,7 @@ ADD ./common/install_rocm_drm.sh install_rocm_drm.sh
 RUN bash ./install_rocm_drm.sh && rm install_rocm_drm.sh
 ENV MKLROOT /opt/intel
 ADD ./common/install_rocm_magma.sh install_rocm_magma.sh
 RUN bash ./install_rocm_magma.sh && rm install_rocm_magma.sh
 RUN bash ./install_rocm_magma.sh ${ROCM_VERSION} && rm install_rocm_magma.sh
 ADD ./common/install_miopen.sh install_miopen.sh
 RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh

6

.ci/docker/manywheel/Dockerfile_2_28_aarch64

View File

 @ -38,6 +38,12 @@ RUN yum install -y \
   sudo \
   gcc-toolset-${GCCTOOLSET_VERSION}-toolchain
 # (optional) Install non-default Ninja version
 ARG NINJA_VERSION
 COPY ./common/install_ninja.sh install_ninja.sh
 RUN if [ -n "${NINJA_VERSION}" ]; then bash ./install_ninja.sh; fi
 RUN rm install_ninja.sh
 # Ensure the expected devtoolset is used
 ENV PATH=/opt/rh/gcc-toolset-${GCCTOOLSET_VERSION}/root/usr/bin:$PATH
 ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${GCCTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${GCCTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH

4

.ci/docker/manywheel/Dockerfile_cuda_aarch64

View File

 @ -67,7 +67,9 @@ FROM base as cuda
 ARG BASE_CUDA_VERSION
 # Install CUDA
 ADD ./common/install_cuda_aarch64.sh install_cuda_aarch64.sh
 RUN bash ./install_cuda_aarch64.sh ${BASE_CUDA_VERSION} && rm install_cuda_aarch64.sh
 COPY ./common/install_nccl.sh install_nccl.sh
 COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/
 RUN bash ./install_cuda_aarch64.sh ${BASE_CUDA_VERSION} && rm install_cuda_aarch64.sh install_nccl.sh ci_commit_pins/nccl-cu*
 FROM base as magma
 ARG BASE_CUDA_VERSION

40

.ci/docker/manywheel/Dockerfile_s390x

View File

 @ -42,6 +42,7 @@ RUN yum install -y \
   llvm-devel \
   libzstd-devel \
   python3.12-devel \
   python3.12-test \
   python3.12-setuptools \
   python3.12-pip \
   python3-virtualenv \
 @ -101,24 +102,33 @@ CMD ["/bin/bash"]
 # install test dependencies:
 # - grpcio requires system openssl, bundled crypto fails to build
 # - ml_dtypes 0.4.0 requires some fixes provided in later commits to build
 RUN dnf install -y \
   protobuf-devel \
   protobuf-c-devel \
   protobuf-lite-devel \
   wget \
   patch
   hdf5-devel \
   python3-h5py \
   git
 RUN env GRPC_PYTHON_BUILD_SYSTEM_OPENSSL=True pip3 install grpcio==1.65.4
 RUN cd ~ && \
   git clone https://github.com/jax-ml/ml_dtypes && \
   cd ml_dtypes && \
   git checkout v0.4.0 && \
 RUN env GRPC_PYTHON_BUILD_SYSTEM_OPENSSL=True pip3 install grpcio
 # cmake-3.28.0 from pip for onnxruntime
 RUN python3 -mpip install cmake==3.28.0
 # build onnxruntime 1.21.0 from sources.
 # it is not possible to build it from sources using pip,
 # so just build it from upstream repository.
 # h5py is dependency of onnxruntime_training.
 # h5py==3.11.0 builds with hdf5-devel 1.10.5 from repository.
 # install newest flatbuffers version first:
 # for some reason old version is getting pulled in otherwise.
 # packaging package is required for onnxruntime wheel build.
 RUN pip3 install flatbuffers && \
   pip3 install h5py==3.11.0 && \
   pip3 install packaging && \
   git clone https://github.com/microsoft/onnxruntime && \
   cd onnxruntime && git checkout v1.21.0 && \
   git submodule update --init --recursive && \
   wget https://github.com/jax-ml/ml_dtypes/commit/b969f76914d6b30676721bc92bf0f6021a0d1321.patch && \
   wget https://github.com/jax-ml/ml_dtypes/commit/d4e6d035ecda073eab8bcf60f4eef572ee7087e6.patch && \
   patch -p1 < b969f76914d6b30676721bc92bf0f6021a0d1321.patch && \
   patch -p1 < d4e6d035ecda073eab8bcf60f4eef572ee7087e6.patch && \
   python3 setup.py bdist_wheel && \
   pip3 install dist/*.whl && \
   rm -rf ml_dtypes
   ./build.sh --config Release --parallel 0 --enable_pybind --build_wheel --enable_training --enable_training_apis --enable_training_ops --skip_tests --allow_running_as_root && \
   pip3 install ./build/Linux/Release/dist/onnxruntime_training-*.whl && \
   cd .. && /bin/rm -rf ./onnxruntime

									
										133

.ci/docker/manywheel/build.sh
									
												View File
												
				@ -1,7 +1,7 @@

				#!/usr/bin/env bash

				# Script used only in CD pipeline

				set -eou pipefail

				set -exou pipefail

				TOPDIR=$(git rev-parse --show-toplevel)

				@ -9,151 +9,110 @@ image="$1"

				shift

				if [ -z "${image}" ]; then

				  echo "Usage: $0 IMAGE"

				  echo "Usage: $0 IMAGE:ARCHTAG"

				  exit 1

				fi

				DOCKER_IMAGE="pytorch/${image}"

				# Go from imagename:tag to tag

				DOCKER_TAG_PREFIX=$(echo "${image}" | awk -F':' '{print $2}')

				DOCKER_REGISTRY="${DOCKER_REGISTRY:-docker.io}"

				GPU_ARCH_VERSION=""

				if [[ "${DOCKER_TAG_PREFIX}" == cuda* ]]; then

				    # extract cuda version from image name.  e.g. manylinux2_28-builder:cuda12.8 returns 12.8

				    GPU_ARCH_VERSION=$(echo "${DOCKER_TAG_PREFIX}" | awk -F'cuda' '{print $2}')

				elif [[ "${DOCKER_TAG_PREFIX}" == rocm* ]]; then

				    # extract rocm version from image name.  e.g. manylinux2_28-builder:rocm6.2.4 returns 6.2.4

				    GPU_ARCH_VERSION=$(echo "${DOCKER_TAG_PREFIX}" | awk -F'rocm' '{print $2}')

				fi

				GPU_ARCH_TYPE=${GPU_ARCH_TYPE:-cpu}

				GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}

				MANY_LINUX_VERSION=${MANY_LINUX_VERSION:-}

				DOCKERFILE_SUFFIX=${DOCKERFILE_SUFFIX:-}

				WITH_PUSH=${WITH_PUSH:-}

				case ${GPU_ARCH_TYPE} in

				    cpu)

				case ${image} in

				    manylinux2_28-builder:cpu)

				        TARGET=cpu_final

				        DOCKER_TAG=cpu

				        GPU_IMAGE=centos:7

				        DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=9"

				        ;;

				    cpu-manylinux_2_28)

				        TARGET=cpu_final

				        DOCKER_TAG=cpu

				        GPU_IMAGE=amd64/almalinux:8

				        DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11"

				        MANY_LINUX_VERSION="2_28"

				        ;;

				    cpu-aarch64)

				    manylinuxaarch64-builder:cpu-aarch64)

				        TARGET=final

				        DOCKER_TAG=cpu-aarch64

				        GPU_IMAGE=arm64v8/centos:7

				        DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=10"

				        MANY_LINUX_VERSION="aarch64"

				        ;;

				    cpu-aarch64-2_28)

				    manylinux2_28_aarch64-builder:cpu-aarch64)

				        TARGET=final

				        DOCKER_TAG=cpu-aarch64

				        GPU_IMAGE=arm64v8/almalinux:8

				        DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11"

				        DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11 --build-arg NINJA_VERSION=1.12.1"

				        MANY_LINUX_VERSION="2_28_aarch64"

				        ;;

				    cpu-cxx11-abi)

				    manylinuxcxx11-abi-builder:cpu-cxx11-abi)

				        TARGET=final

				        DOCKER_TAG=cpu-cxx11-abi

				        GPU_IMAGE=""

				        DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=9"

				        MANY_LINUX_VERSION="cxx11-abi"

				        ;;

				    cpu-s390x)

				    manylinuxs390x-builder:cpu-s390x)

				        TARGET=final

				        DOCKER_TAG=cpu-s390x

				        GPU_IMAGE=s390x/almalinux:8

				        DOCKER_GPU_BUILD_ARG=""

				        MANY_LINUX_VERSION="s390x"

				        ;;

				    cuda)

				    manylinux2_28-builder:cuda*)

				        TARGET=cuda_final

				        DOCKER_TAG=cuda${GPU_ARCH_VERSION}

				        # Keep this up to date with the minimum version of CUDA we currently support

				        GPU_IMAGE=centos:7

				        DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=9"

				        ;;

				    cuda-manylinux_2_28)

				        TARGET=cuda_final

				        DOCKER_TAG=cuda${GPU_ARCH_VERSION}

				        GPU_IMAGE=amd64/almalinux:8

				        DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=11"

				        MANY_LINUX_VERSION="2_28"

				        ;;

				    cuda-aarch64)

				    manylinuxaarch64-builder:cuda*)

				        TARGET=cuda_final

				        DOCKER_TAG=cuda${GPU_ARCH_VERSION}

				        GPU_IMAGE=arm64v8/centos:7

				        DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=11"

				        MANY_LINUX_VERSION="aarch64"

				        DOCKERFILE_SUFFIX="_cuda_aarch64"

				        ;;

				    rocm|rocm-manylinux_2_28)

				    manylinux2_28-builder:rocm*)

				        TARGET=rocm_final

				        DOCKER_TAG=rocm${GPU_ARCH_VERSION}

				        GPU_IMAGE=rocm/dev-centos-7:${GPU_ARCH_VERSION}-complete

				        DEVTOOLSET_VERSION="9"

				        if [ ${GPU_ARCH_TYPE} == "rocm-manylinux_2_28" ]; then

				            MANY_LINUX_VERSION="2_28"

				            DEVTOOLSET_VERSION="11"

				            GPU_IMAGE=rocm/dev-almalinux-8:${GPU_ARCH_VERSION}-complete

				        fi

				        PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101"

				        MANY_LINUX_VERSION="2_28"

				        DEVTOOLSET_VERSION="11"

				        GPU_IMAGE=rocm/dev-almalinux-8:${GPU_ARCH_VERSION}-complete

				        PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"

				        DOCKER_GPU_BUILD_ARG="--build-arg ROCM_VERSION=${GPU_ARCH_VERSION} --build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} --build-arg DEVTOOLSET_VERSION=${DEVTOOLSET_VERSION}"

				        ;;

				    xpu)

				    manylinux2_28-builder:xpu)

				        TARGET=xpu_final

				        DOCKER_TAG=xpu

				        GPU_IMAGE=amd64/almalinux:8

				        DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11"

				        MANY_LINUX_VERSION="2_28"

				        ;;

				    *)

				        echo "ERROR: Unrecognized GPU_ARCH_TYPE: ${GPU_ARCH_TYPE}"

				        echo "ERROR: Unrecognized image name: ${image}"

				        exit 1

				        ;;

				esac

				IMAGES=''

				if [[ -n ${MANY_LINUX_VERSION} && -z ${DOCKERFILE_SUFFIX} ]]; then

				    DOCKERFILE_SUFFIX=_${MANY_LINUX_VERSION}

				fi

				(

				    set -x

				    if [ "$(uname -m)" != "s390x" ]; then

				        # TODO: Remove LimitNOFILE=1048576 patch once https://github.com/pytorch/test-infra/issues/5712

				        # is resolved. This patch is required in order to fix timing out of Docker build on Amazon Linux 2023.

				        sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service

				        sudo systemctl daemon-reload

				        sudo systemctl restart docker

				    fi

				    DOCKER_BUILDKIT=1 docker build  \

				        ${DOCKER_GPU_BUILD_ARG} \

				        --build-arg "GPU_IMAGE=${GPU_IMAGE}" \

				        --target "${TARGET}" \

				        -t "${DOCKER_IMAGE}" \

				        $@ \

				        -f "${TOPDIR}/.ci/docker/manywheel/Dockerfile${DOCKERFILE_SUFFIX}" \

				        "${TOPDIR}/.ci/docker/"

				)

				GITHUB_REF=${GITHUB_REF:-$(git symbolic-ref -q HEAD || git describe --tags --exact-match)}

				GIT_BRANCH_NAME=${GITHUB_REF##*/}

				GIT_COMMIT_SHA=${GITHUB_SHA:-$(git rev-parse HEAD)}

				DOCKER_IMAGE_BRANCH_TAG=${DOCKER_IMAGE}-${GIT_BRANCH_NAME}

				DOCKER_IMAGE_SHA_TAG=${DOCKER_IMAGE}-${GIT_COMMIT_SHA}

				if [[ "${WITH_PUSH}" == true ]]; then

				    (

				        set -x

				        docker push "${DOCKER_IMAGE}"

				        if [[ -n ${GITHUB_REF} ]]; then

				            docker tag ${DOCKER_IMAGE} ${DOCKER_IMAGE_BRANCH_TAG}

				            docker tag ${DOCKER_IMAGE} ${DOCKER_IMAGE_SHA_TAG}

				            docker push "${DOCKER_IMAGE_BRANCH_TAG}"

				            docker push "${DOCKER_IMAGE_SHA_TAG}"

				        fi

				    )

				# Only activate this if in CI

				if [ "$(uname -m)" != "s390x" ] && [ -v CI ]; then

				    # TODO: Remove LimitNOFILE=1048576 patch once https://github.com/pytorch/test-infra/issues/5712

				    # is resolved. This patch is required in order to fix timing out of Docker build on Amazon Linux 2023.

				    sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service

				    sudo systemctl daemon-reload

				    sudo systemctl restart docker

				fi

				tmp_tag=$(basename "$(mktemp -u)" | tr '[:upper:]' '[:lower:]')

				DOCKER_BUILDKIT=1 docker build  \

				    ${DOCKER_GPU_BUILD_ARG} \

				    --build-arg "GPU_IMAGE=${GPU_IMAGE}" \

				    --target "${TARGET}" \

				    -t "${tmp_tag}" \

				    $@ \

				    -f "${TOPDIR}/.ci/docker/manywheel/Dockerfile${DOCKERFILE_SUFFIX}" \

				    "${TOPDIR}/.ci/docker/"

									
										2

.ci/docker/manywheel/build_scripts/build_utils.sh
									
												View File
												
				@ -3,7 +3,7 @@

				# Script used only in CD pipeline

				OPENSSL_DOWNLOAD_URL=https://www.openssl.org/source/old/1.1.1/

				CURL_DOWNLOAD_URL=https://curl.askapache.com/download

				CURL_DOWNLOAD_URL=https://curl.se/download

				AUTOCONF_DOWNLOAD_URL=https://ftp.gnu.org/gnu/autoconf

26

.ci/docker/requirements-ci.txt

View File

 @ -41,11 +41,14 @@ fbscribelogger==0.1.7
 #Pinned versions: 0.1.6
 #test that import:
 flatbuffers==2.0
 flatbuffers==2.0 ; platform_machine != "s390x"
 #Description: cross platform serialization library
 #Pinned versions: 2.0
 #test that import:
 flatbuffers ; platform_machine == "s390x"
 #Description: cross platform serialization library; Newer version is required on s390x for new python version
 hypothesis==5.35.1
 # Pin hypothesis to avoid flakiness: https://github.com/pytorch/pytorch/issues/31136
 #Description: advanced library for generating parametrized tests
 @ -90,10 +93,10 @@ librosa>=0.6.2 ; python_version < "3.11"
 #Pinned versions:
 #test that import:
 mypy==1.13.0
 mypy==1.14.0
 # Pin MyPy version because new errors are likely to appear with each release
 #Description: linter
 #Pinned versions: 1.10.0
 #Pinned versions: 1.14.0
 #test that import: test_typing.py, test_type_hints.py
 networkx==2.8.8
 @ -102,10 +105,10 @@ networkx==2.8.8
 #Pinned versions: 2.8.8
 #test that import: functorch
 #ninja
 #Description: build system.  Note that it install from
 #here breaks things so it is commented out
 #Pinned versions: 1.10.0.post1
 ninja==1.11.1.3
 #Description: build system. Used in some tests. Used in build to generate build
 #time tracing information
 #Pinned versions: 1.11.1.3
 #test that import: run_test.py, test_cpp_extensions_aot.py,test_determination.py
 numba==0.49.0 ; python_version < "3.9"
 @ -294,7 +297,7 @@ ghstack==0.8.0
 #Pinned versions: 0.8.0
 #test that import:
 jinja2==3.1.5
 jinja2==3.1.6
 #Description: jinja2 template engine
 #Pinned versions: 3.1.4
 #test that import:
 @ -339,7 +342,7 @@ onnx==1.17.0
 #Pinned versions:
 #test that import:
 onnxscript==0.1.0
 onnxscript==0.2.2
 #Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
 #Pinned versions:
 #test that import:
 @ -353,7 +356,7 @@ parameterized==0.8.1
 #Pinned versions: 1.24.0
 #test that import: test_sac_estimator.py
 pwlf==2.2.1 ; python_version >= "3.8"
 pwlf==2.2.1
 #Description: required for testing torch/distributed/_tools/sac_estimator.py
 #Pinned versions: 2.2.1
 #test that import: test_sac_estimator.py
 @ -365,10 +368,9 @@ PyYAML
 pyzstd
 setuptools
 ninja==1.11.1 ; platform_machine == "aarch64"
 scons==4.5.2 ; platform_machine == "aarch64"
 pulp==2.9.0 ; python_version >= "3.8"
 pulp==2.9.0
 #Description: required for testing ilp formulaiton under torch/distributed/_tools
 #Pinned versions: 2.9.0
 #test that import: test_sac_ilp.py

10

.ci/docker/requirements-docs.txt

View File

 @ -1,15 +1,20 @@
 sphinx==5.3.0
 #Description: This is used to generate PyTorch docs
 #Pinned versions: 5.3.0
 -e git+https://github.com/pytorch/pytorch_sphinx_theme.git#egg=pytorch_sphinx_theme
 -e git+https://github.com/pytorch/pytorch_sphinx_theme.git@a98ffecb792d50df495be401becbf5c414421423#egg=pytorch_sphinx_theme2
 # TODO: sphinxcontrib.katex 0.9.0 adds a local KaTeX server to speed up pre-rendering
 # but it doesn't seem to work and hangs around idly. The initial thought is probably
 # something related to Docker setup. We can investigate this later
 sphinxcontrib.katex==0.8.6
 #Description: This is used to generate PyTorch docs
 #Pinned versions: 0.8.6
 sphinxext-opengraph==0.9.1
 #Description: This is used to generate PyTorch docs
 #Pinned versions: 0.9.1
 matplotlib==3.5.3
 #Description: This is used to generate PyTorch docs
 #Pinned versions: 3.5.3
 @ -46,5 +51,6 @@ myst-nb==0.17.2
 # The following are required to build torch.distributed.elastic.rendezvous.etcd* docs
 python-etcd==0.4.5
 sphinx-copybutton==0.5.0
 sphinx-panels==0.4.1
 sphinx-design==0.4.0
 sphinxcontrib-mermaid==1.0.0
 myst-parser==0.18.1

2

.ci/docker/triton_version.txt

View File

 @ -1 +1 @@
 .2.0
 .3.0

									
										29

.ci/docker/ubuntu-cuda/Dockerfile
									
												View File
												
				@ -2,7 +2,7 @@ ARG UBUNTU_VERSION

				ARG CUDA_VERSION

				ARG IMAGE_NAME

				FROM ${IMAGE_NAME}

				FROM ${IMAGE_NAME} as base

				ARG UBUNTU_VERSION

				ARG CUDA_VERSION

				@ -50,13 +50,6 @@ RUN if [ -n "${PROTOBUF}" ]; then bash ./install_protobuf.sh; fi

				RUN rm install_protobuf.sh

				ENV INSTALLED_PROTOBUF ${PROTOBUF}

				# (optional) Install database packages like LMDB and LevelDB

				ARG DB

				COPY ./common/install_db.sh install_db.sh

				RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi

				RUN rm install_db.sh

				ENV INSTALLED_DB ${DB}

				# (optional) Install vision packages like OpenCV

				ARG VISION

				COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./

				@ -97,14 +90,20 @@ RUN if [ -n "${CMAKE_VERSION}" ]; then bash ./install_cmake.sh; fi

				RUN rm install_cmake.sh

				ARG TRITON

				FROM base as triton-builder

				# Install triton, this needs to be done before sccache because the latter will

				# try to reach out to S3, which docker build runners don't have access

				COPY ./common/install_triton.sh install_triton.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/triton.txt triton.txt

				COPY triton_version.txt triton_version.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton.txt triton_version.txt

				RUN bash ./install_triton.sh

				FROM base as final

				COPY --from=triton-builder /opt/triton /opt/triton

				RUN if [ -n "${TRITON}" ]; then pip install /opt/triton/*.whl; chown -R jenkins:jenkins /opt/conda; fi

				RUN rm -rf /opt/triton

				ARG HALIDE

				# Build and install halide

				@ -159,6 +158,16 @@ COPY ./common/install_cusparselt.sh install_cusparselt.sh

				RUN bash install_cusparselt.sh

				RUN rm install_cusparselt.sh

				# Install NCCL

				ARG CUDA_VERSION

				COPY ./common/install_nccl.sh install_nccl.sh

				COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/

				RUN bash install_nccl.sh

				RUN rm install_nccl.sh /ci_commit_pins/nccl-cu*

				ENV USE_SYSTEM_NCCL=1

				ENV NCCL_INCLUDE_DIR="/usr/local/cuda/include/"

				ENV NCCL_LIB_DIR="/usr/local/cuda/lib64/"

				# Install CUDSS

				ARG CUDA_VERSION

				COPY ./common/install_cudss.sh install_cudss.sh

									
										9

.ci/docker/ubuntu-rocm/Dockerfile
									
												View File
												
				@ -50,13 +50,6 @@ RUN if [ -n "${PROTOBUF}" ]; then bash ./install_protobuf.sh; fi

				RUN rm install_protobuf.sh

				ENV INSTALLED_PROTOBUF ${PROTOBUF}

				# (optional) Install database packages like LMDB and LevelDB

				ARG DB

				COPY ./common/install_db.sh install_db.sh

				RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi

				RUN rm install_db.sh

				ENV INSTALLED_DB ${DB}

				# (optional) Install vision packages like OpenCV

				ARG VISION

				COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./

				@ -70,7 +63,7 @@ COPY ./common/install_rocm.sh install_rocm.sh

				RUN bash ./install_rocm.sh

				RUN rm install_rocm.sh

				COPY ./common/install_rocm_magma.sh install_rocm_magma.sh

				RUN bash ./install_rocm_magma.sh

				RUN bash ./install_rocm_magma.sh ${ROCM_VERSION}

				RUN rm install_rocm_magma.sh

				ADD ./common/install_miopen.sh install_miopen.sh

				RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh

									
										7

.ci/docker/ubuntu-xpu/Dockerfile
									
												View File
												
				@ -77,13 +77,6 @@ COPY triton_version.txt triton_version.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton-xpu.txt triton_version.txt

				# (optional) Install database packages like LMDB and LevelDB

				ARG DB

				COPY ./common/install_db.sh install_db.sh

				RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi

				RUN rm install_db.sh

				ENV INSTALLED_DB ${DB}

				# (optional) Install vision packages like OpenCV

				ARG VISION

				COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./

									
										51

.ci/docker/ubuntu/Dockerfile
									
												View File
												
				@ -1,6 +1,6 @@

				ARG UBUNTU_VERSION

				FROM ubuntu:${UBUNTU_VERSION}

				FROM ubuntu:${UBUNTU_VERSION} as base

				ARG UBUNTU_VERSION

				@ -52,9 +52,16 @@ RUN  bash ./install_lcov.sh && rm install_lcov.sh

				# Install cuda and cudnn

				ARG CUDA_VERSION

				COPY ./common/install_cuda.sh install_cuda.sh

				RUN bash ./install_cuda.sh ${CUDA_VERSION} && rm install_cuda.sh

				COPY ./common/install_nccl.sh install_nccl.sh

				COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/

				RUN bash ./install_cuda.sh ${CUDA_VERSION} && rm install_cuda.sh install_nccl.sh /ci_commit_pins/nccl-cu*

				ENV DESIRED_CUDA ${CUDA_VERSION}

				ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:$PATH

				# No effect if cuda not installed

				ENV USE_SYSTEM_NCCL=1

				ENV NCCL_INCLUDE_DIR="/usr/local/cuda/include/"

				ENV NCCL_LIB_DIR="/usr/local/cuda/lib64/"

				# (optional) Install UCC

				ARG UCX_COMMIT

				@ -74,13 +81,6 @@ RUN if [ -n "${PROTOBUF}" ]; then bash ./install_protobuf.sh; fi

				RUN rm install_protobuf.sh

				ENV INSTALLED_PROTOBUF ${PROTOBUF}

				# (optional) Install database packages like LMDB and LevelDB

				ARG DB

				COPY ./common/install_db.sh install_db.sh

				RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi

				RUN rm install_db.sh

				ENV INSTALLED_DB ${DB}

				# (optional) Install vision packages like OpenCV

				ARG VISION

				COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./

				@ -88,18 +88,6 @@ RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi

				RUN rm install_vision.sh cache_vision_models.sh common_utils.sh

				ENV INSTALLED_VISION ${VISION}

				# (optional) Install Vulkan SDK

				ARG VULKAN_SDK_VERSION

				COPY ./common/install_vulkan_sdk.sh install_vulkan_sdk.sh

				RUN if [ -n "${VULKAN_SDK_VERSION}" ]; then bash ./install_vulkan_sdk.sh; fi

				RUN rm install_vulkan_sdk.sh

				# (optional) Install swiftshader

				ARG SWIFTSHADER

				COPY ./common/install_swiftshader.sh install_swiftshader.sh

				RUN if [ -n "${SWIFTSHADER}" ]; then bash ./install_swiftshader.sh; fi

				RUN rm install_swiftshader.sh

				# (optional) Install non-default CMake version

				ARG CMAKE_VERSION

				COPY ./common/install_cmake.sh install_cmake.sh

				@ -127,20 +115,21 @@ RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_d

				RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt

				ARG TRITON

				# Install triton, this needs to be done before sccache because the latter will

				# try to reach out to S3, which docker build runners don't have access

				ARG TRITON_CPU

				# Create a separate stage for building Triton and Triton-CPU.  install_triton

				# will check for the presence of env vars

				FROM base as triton-builder

				COPY ./common/install_triton.sh install_triton.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/triton.txt triton.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton.txt

				ARG TRITON_CPU

				COPY ./common/install_triton.sh install_triton.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/triton-cpu.txt triton-cpu.txt

				RUN if [ -n "${TRITON_CPU}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton-cpu.txt

				RUN bash ./install_triton.sh

				FROM base as final

				COPY --from=triton-builder /opt/triton /opt/triton

				RUN if [ -n "${TRITON}" ] || [ -n "${TRITON_CPU}" ]; then pip install /opt/triton/*.whl; chown -R jenkins:jenkins /opt/conda; fi

				RUN rm -rf /opt/triton

				ARG EXECUTORCH

				# Build and install executorch

2

.ci/magma-rocm/.gitignore vendored Normal file

View File

 @ -0,0 +1,2 @@
 output/
 magma-rocm*/

									
										41

.ci/magma-rocm/Makefile
									
										Normal file
									
												View File
												
				@ -0,0 +1,41 @@

				SHELL=/usr/bin/env bash

				DOCKER_CMD ?= docker

				DESIRED_ROCM ?= 6.4

				DESIRED_ROCM_SHORT = $(subst .,,$(DESIRED_ROCM))

				PACKAGE_NAME = magma-rocm

				# inherit this from underlying docker image, do not pass this env var to docker

				#PYTORCH_ROCM_ARCH ?= gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201

				DOCKER_RUN = set -eou pipefail; ${DOCKER_CMD} run --rm -i \

					-v $(shell git rev-parse --show-toplevel)/.ci:/builder \

					-w /builder \

					-e PACKAGE_NAME=${PACKAGE_NAME}${DESIRED_ROCM_SHORT} \

					-e DESIRED_ROCM=${DESIRED_ROCM} \

					"pytorch/manylinux2_28-builder:rocm${DESIRED_ROCM}-main" \

					magma-rocm/build_magma.sh

				.PHONY: all

				all: magma-rocm64

				all: magma-rocm63

				all: magma-rocm624

				.PHONY:

				clean:

					$(RM) -r magma-*

					$(RM) -r output

				.PHONY: magma-rocm64

				magma-rocm64: DESIRED_ROCM := 6.4

				magma-rocm64:

					$(DOCKER_RUN)

				.PHONY: magma-rocm63

				magma-rocm63: DESIRED_ROCM := 6.3

				magma-rocm63:

					$(DOCKER_RUN)

				.PHONY: magma-rocm624

				magma-rocm624: DESIRED_ROCM := 6.2.4

				magma-rocm624:

					$(DOCKER_RUN)

									
										48

.ci/magma-rocm/README.md
									
										Normal file
									
												View File
												
				@ -0,0 +1,48 @@

				# Magma ROCm

				This folder contains the scripts and configurations to build libmagma.so, linked for various versions of ROCm.

				## Building

				Look in the `Makefile` for available targets to build. To build any target, for example `magma-rocm63`, run

				```

				# Using `docker`

				make magma-rocm63

				# Using `podman`

				DOCKER_CMD=podman make magma-rocm63

				```

				This spawns a `pytorch/manylinux-rocm<version>` docker image, which has the required `devtoolset` and ROCm versions installed.

				Within the docker image, it runs `build_magma.sh` with the correct environment variables set, which package the necessary files

				into a tarball, with the following structure:

				```

				.

				├── include       # header files

				├── lib           # libmagma.so

				├── info

				│   ├── licenses  # license file

				│   └── recipe    # build script

				```

				More specifically, `build_magma.sh` copies over the relevant files from the `package_files` directory depending on the ROCm version.

				Outputted binaries should be in the `output` folder.

				## Pushing

				Packages can be uploaded to an S3 bucket using:

				```

				aws s3 cp output/*/magma-cuda*.bz2 <bucket-with-path>

				```

				If you do not have upload permissions, please ping @seemethere or @soumith to gain access

				## New versions

				New ROCm versions can be added by creating a new make target with the next desired version. For ROCm version N.n, the target should be named `magma-rocmNn`.

				Make sure to edit the appropriate environment variables (e.g., DESIRED_ROCM) in the `Makefile` accordingly. Remember also to check `build_magma.sh` to ensure the logic for copying over the files remains correct.

									
										42

.ci/magma-rocm/build_magma.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,42 @@

				#!/usr/bin/env bash

				set -eou pipefail

				# Environment variables

				# The script expects DESIRED_CUDA and PACKAGE_NAME to be set

				ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"

				# Version 2.7.2 + ROCm related updates

				MAGMA_VERSION=a1625ff4d9bc362906bd01f805dbbe12612953f6

				# Folders for the build

				PACKAGE_FILES=${ROOT_DIR}/magma-rocm/package_files # metadata

				PACKAGE_DIR=${ROOT_DIR}/magma-rocm/${PACKAGE_NAME} # build workspace

				PACKAGE_OUTPUT=${ROOT_DIR}/magma-rocm/output # where tarballs are stored

				PACKAGE_BUILD=${PACKAGE_DIR} # where the content of the tarball is prepared

				PACKAGE_RECIPE=${PACKAGE_BUILD}/info/recipe

				PACKAGE_LICENSE=${PACKAGE_BUILD}/info/licenses

				mkdir -p ${PACKAGE_DIR} ${PACKAGE_OUTPUT}/linux-64 ${PACKAGE_BUILD} ${PACKAGE_RECIPE} ${PACKAGE_LICENSE}

				# Fetch magma sources and verify checksum

				pushd ${PACKAGE_DIR}

				git clone https://bitbucket.org/icl/magma.git

				pushd magma

				git checkout ${MAGMA_VERSION}

				popd

				popd

				# build

				pushd ${PACKAGE_DIR}/magma

				# The build.sh script expects to be executed from the sources root folder

				INSTALL_DIR=${PACKAGE_BUILD} ${PACKAGE_FILES}/build.sh

				popd

				# Package recipe, license and tarball

				# Folder and package name are backward compatible for the build workflow

				cp ${PACKAGE_FILES}/build.sh ${PACKAGE_RECIPE}/build.sh

				cp ${PACKAGE_DIR}/magma/COPYRIGHT ${PACKAGE_LICENSE}/COPYRIGHT

				pushd ${PACKAGE_BUILD}

				tar cjf ${PACKAGE_OUTPUT}/linux-64/${PACKAGE_NAME}-${MAGMA_VERSION}-1.tar.bz2 include lib info

				echo Built in ${PACKAGE_OUTPUT}/linux-64/${PACKAGE_NAME}-${MAGMA_VERSION}-1.tar.bz2

				popd

									
										38

.ci/magma-rocm/package_files/build.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,38 @@

				# Magma build scripts need `python`

				ln -sf /usr/bin/python3 /usr/bin/python

				ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')

				case "$ID" in

				  almalinux)

				    yum install -y gcc-gfortran

				    ;;

				  *)

				    echo "No preinstalls to build magma..."

				    ;;

				esac

				MKLROOT=${MKLROOT:-/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION}

				cp make.inc-examples/make.inc.hip-gcc-mkl make.inc

				echo 'LIBDIR += -L$(MKLROOT)/lib' >> make.inc

				if [[ -f "${MKLROOT}/lib/libmkl_core.a" ]]; then

				    echo 'LIB = -Wl,--start-group -lmkl_gf_lp64 -lmkl_gnu_thread -lmkl_core -Wl,--end-group -lpthread -lstdc++ -lm -lgomp -lhipblas -lhipsparse' >> make.inc

				fi

				echo 'LIB += -Wl,--enable-new-dtags -Wl,--rpath,/opt/rocm/lib -Wl,--rpath,$(MKLROOT)/lib -Wl,--rpath,/opt/rocm/magma/lib -ldl' >> make.inc

				echo 'DEVCCFLAGS += --gpu-max-threads-per-block=256' >> make.inc

				export PATH="${PATH}:/opt/rocm/bin"

				if [[ -n "$PYTORCH_ROCM_ARCH" ]]; then

				  amdgpu_targets=`echo $PYTORCH_ROCM_ARCH | sed 's/;/ /g'`

				else

				  amdgpu_targets=`rocm_agent_enumerator | grep -v gfx000 | sort -u | xargs`

				fi

				for arch in $amdgpu_targets; do

				  echo "DEVCCFLAGS += --offload-arch=$arch" >> make.inc

				done

				# hipcc with openmp flag may cause isnan() on __device__ not to be found; depending on context, compiler may attempt to match with host definition

				sed -i 's/^FOPENMP/#FOPENMP/g' make.inc

				make -f make.gen.hipMAGMA -j $(nproc)

				LANG=C.UTF-8 make lib/libmagma.so -j $(nproc) MKLROOT="${MKLROOT}"

				make testing/testing_dgemm -j $(nproc) MKLROOT="${MKLROOT}"

				cp -R lib ${INSTALL_DIR}

				cp -R include ${INSTALL_DIR}

									
										2

.ci/magma/Makefile
									
												View File
												
				@ -12,7 +12,7 @@ DOCKER_RUN = set -eou pipefail; ${DOCKER_CMD} run --rm -i \

					-e PACKAGE_NAME=${PACKAGE_NAME}${DESIRED_CUDA_SHORT} \

					-e DESIRED_CUDA=${DESIRED_CUDA} \

					-e CUDA_ARCH_LIST="${CUDA_ARCH_LIST}" \

					"pytorch/manylinux-builder:cuda${DESIRED_CUDA}-main" \

					"pytorch/manylinux2_28-builder:cuda${DESIRED_CUDA}-main" \

					magma/build_magma.sh

				.PHONY: all

									
										12

.ci/manywheel/build_common.sh
									
												View File
												
				@ -111,12 +111,6 @@ case ${DESIRED_PYTHON} in

				    ;;

				esac

				if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then

				    export _GLIBCXX_USE_CXX11_ABI=1

				else

				    export _GLIBCXX_USE_CXX11_ABI=0

				fi

				if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then

				    echo "Calling build_amd.py at $(date)"

				    python tools/amd_build/build_amd.py

				@ -209,12 +203,6 @@ if [[ -n "$BUILD_PYTHONLESS" ]]; then

				    mkdir -p /tmp/$LIBTORCH_HOUSE_DIR

				    if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then

				        LIBTORCH_ABI="cxx11-abi-"

				    else

				        LIBTORCH_ABI=

				    fi

				    zip -rq /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-$PYTORCH_BUILD_VERSION.zip libtorch

				    cp /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-$PYTORCH_BUILD_VERSION.zip \

				       /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-latest.zip

									
										4

.ci/manywheel/build_cuda.sh
									
												View File
												
				@ -54,11 +54,11 @@ cuda_version_nodot=$(echo $CUDA_VERSION | tr -d '.')

				TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6"

				case ${CUDA_VERSION} in

				    12.8)

				        TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0;10.0;12.0+PTX" #Ripping out 5.0 and 6.0 due to ld error

				        TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;9.0;10.0;12.0+PTX" #removing sm_50-sm_70 as these architectures are deprecated in CUDA 12.8 and will be removed in future releases

				        EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")

				        ;;

				    12.6)

				        TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0+PTX"

				        TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0"

				        EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")

				        ;;

				    12.4)

									
										12

.ci/manywheel/build_libtorch.sh
									
												View File
												
				@ -95,12 +95,6 @@ python setup.py clean

				retry pip install -qr requirements.txt

				retry pip install -q numpy==2.0.1

				if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then

				    export _GLIBCXX_USE_CXX11_ABI=1

				else

				    export _GLIBCXX_USE_CXX11_ABI=0

				fi

				if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then

				    echo "Calling build_amd.py at $(date)"

				    python tools/amd_build/build_amd.py

				@ -169,12 +163,6 @@ fi

				)

				if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then

				    LIBTORCH_ABI="cxx11-abi-"

				else

				    LIBTORCH_ABI=

				fi

				(

				    set -x

									
										29

.ci/pytorch/build.sh
									
												View File
												
				@ -35,7 +35,7 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then

				fi

				if [[ "$BUILD_ENVIRONMENT" == *cuda11* ]]; then

				  if [[ "$BUILD_ENVIRONMENT" != *cuda11.3* && "$BUILD_ENVIRONMENT" != *clang* ]]; then

				  if [[ "$BUILD_ENVIRONMENT" != *clang* ]]; then

				    # TODO: there is a linking issue when building with UCC using clang,

				    # disable it for now and to be fix later.

				    # TODO: disable UCC temporarily to enable CUDA 12.1 in CI

				@ -173,6 +173,7 @@ if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				  source /opt/intel/oneapi/compiler/latest/env/vars.sh

				  # XPU kineto feature dependencies are not fully ready, disable kineto build as temp WA

				  export USE_KINETO=0

				  export TORCH_XPU_ARCH_LIST=pvc

				fi

				# sccache will fail for CUDA builds if all cores are used for compiling

				@ -191,7 +192,7 @@ fi

				# We only build FlashAttention files for CUDA 8.0+, and they require large amounts of

				# memory to build and will OOM

				if [[ "$BUILD_ENVIRONMENT" == *cuda* ]] && [[ 1 -eq $(echo "${TORCH_CUDA_ARCH_LIST} >= 8.0" | bc) ]]; then

				if [[ "$BUILD_ENVIRONMENT" == *cuda* ]] && [[ 1 -eq $(echo "${TORCH_CUDA_ARCH_LIST} >= 8.0" | bc) ]] && [ -z "$MAX_JOBS_OVERRIDE" ]; then

				  echo "WARNING: FlashAttention files require large amounts of memory to build and will OOM"

				  echo "Setting MAX_JOBS=(nproc-2)/3 to reduce memory usage"

				  export MAX_JOBS="$(( $(nproc --ignore=2) / 3 ))"

				@ -276,10 +277,8 @@ else

				    # or building non-XLA tests.

				    if [[ "$BUILD_ENVIRONMENT" != *rocm*  &&

				          "$BUILD_ENVIRONMENT" != *xla* ]]; then

				      if [[ "$BUILD_ENVIRONMENT" != *py3.8* ]]; then

				        # Install numpy-2.0.2 for builds which are backward compatible with 1.X

				        python -mpip install numpy==2.0.2

				      fi

				      # Install numpy-2.0.2 for builds which are backward compatible with 1.X

				      python -mpip install numpy==2.0.2

				      WERROR=1 python setup.py clean

				@ -302,6 +301,18 @@ else

				    fi

				    pip_install_whl "$(echo dist/*.whl)"

				    if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				      echo "Checking that xpu is compiled"

				      pushd dist/

				      if python -c 'import torch; exit(0 if torch.xpu._is_compiled() else 1)'; then

				        echo "XPU support is compiled in."

				      else

				        echo "XPU support is NOT compiled in."

				        exit 1

				      fi

				      popd

				    fi

				    # TODO: I'm not sure why, but somehow we lose verbose commands

				    set -x

				@ -377,8 +388,10 @@ else

				    # This is an attempt to mitigate flaky libtorch build OOM error. By default, the build parallelization

				    # is set to be the number of CPU minus 2. So, let's try a more conservative value here. A 4xlarge has

				    # 16 CPUs

				    MAX_JOBS=$(nproc --ignore=4)

				    export MAX_JOBS

				    if [ -z "$MAX_JOBS_OVERRIDE" ]; then

				      MAX_JOBS=$(nproc --ignore=4)

				      export MAX_JOBS

				    fi

				    # NB: Install outside of source directory (at the same level as the root

				    # pytorch folder) so that it doesn't get cleaned away prior to docker push.

									
										120

.ci/pytorch/check_binary.sh
									
												View File
												
				@ -59,78 +59,16 @@ else

				  export install_root="$(dirname $(which python))/../lib/python${py_dot}/site-packages/torch/"

				fi

				###############################################################################

				# Setup XPU ENV

				###############################################################################

				if [[ "$DESIRED_CUDA" == 'xpu' ]]; then

				  set +u

				  # Refer https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html

				  source /opt/intel/oneapi/compiler/latest/env/vars.sh

				  source /opt/intel/oneapi/pti/latest/env/vars.sh

				fi

				###############################################################################

				# Check GCC ABI

				###############################################################################

				# NOTE [ Building libtorch with old vs. new gcc ABI ]

				#

				# Packages built with one version of ABI could not be linked against by client

				# C++ libraries that were compiled using the other version of ABI. Since both

				# gcc ABIs are still common in the wild, we need to support both ABIs. Currently:

				#

				# - All the nightlies built on CentOS 7 + devtoolset7 use the old gcc ABI.

				# - All the nightlies built on Ubuntu 16.04 + gcc 5.4 use the new gcc ABI.

				# NOTE: As of https://github.com/pytorch/pytorch/issues/126551 we only produce

				#       wheels with cxx11-abi

				echo "Checking that the gcc ABI is what we expect"

				if [[ "$(uname)" != 'Darwin' ]]; then

				  function is_expected() {

				    if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* || "$DESIRED_CUDA" == *"rocm"* ]]; then

				      if [[ "$1" -gt 0 || "$1" == "ON " ]]; then

				        echo 1

				      fi

				    else

				      if [[ -z "$1" || "$1" == 0 || "$1" == "OFF" ]]; then

				        echo 1

				      fi

				    fi

				  }

				  # First we check that the env var in TorchConfig.cmake is correct

				  # We search for D_GLIBCXX_USE_CXX11_ABI=1 in torch/TorchConfig.cmake

				  torch_config="${install_root}/share/cmake/Torch/TorchConfig.cmake"

				  if [[ ! -f "$torch_config" ]]; then

				    echo "No TorchConfig.cmake found!"

				    ls -lah "$install_root/share/cmake/Torch"

				    exit 1

				  fi

				  echo "Checking the TorchConfig.cmake"

				  cat "$torch_config"

				  # The sed call below is

				  #   don't print lines by default (only print the line we want)

				  # -n

				  #   execute the following expression

				  # e

				  #   replace lines that match with the first capture group and print

				  # s/.*D_GLIBCXX_USE_CXX11_ABI=\(.\)".*/\1/p

				  #   any characters, D_GLIBCXX_USE_CXX11_ABI=, exactly one any character, a

				  #   quote, any characters

				  #   Note the exactly one single character after the '='. In the case that the

				  #     variable is not set the '=' will be followed by a '"' immediately and the

				  #     line will fail the match and nothing will be printed; this is what we

				  #     want.  Otherwise it will capture the 0 or 1 after the '='.

				  # /.*D_GLIBCXX_USE_CXX11_ABI=\(.\)".*/

				  #   replace the matched line with the capture group and print

				  # /\1/p

				  actual_gcc_abi="$(sed -ne 's/.*D_GLIBCXX_USE_CXX11_ABI=\(.\)".*/\1/p' < "$torch_config")"

				  if [[ "$(is_expected "$actual_gcc_abi")" != 1 ]]; then

				    echo "gcc ABI $actual_gcc_abi not as expected."

				    exit 1

				  fi

				  # We also check that there are [not] cxx11 symbols in libtorch

				  # We also check that there are cxx11 symbols in libtorch

				  #

				  echo "Checking that symbols in libtorch.so have the right gcc abi"

				  python3 "$(dirname ${BASH_SOURCE[0]})/smoke_test/check_binary_symbols.py"

				@ -208,35 +146,11 @@ setup_link_flags () {

				TEST_CODE_DIR="$(dirname $(realpath ${BASH_SOURCE[0]}))/test_example_code"

				build_and_run_example_cpp () {

				  if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then

				    GLIBCXX_USE_CXX11_ABI=1

				  else

				    GLIBCXX_USE_CXX11_ABI=0

				  fi

				  setup_link_flags

				  g++ ${TEST_CODE_DIR}/$1.cpp -I${install_root}/include -I${install_root}/include/torch/csrc/api/include -D_GLIBCXX_USE_CXX11_ABI=$GLIBCXX_USE_CXX11_ABI -std=gnu++17 -L${install_root}/lib ${REF_LIB} ${ADDITIONAL_LINKER_FLAGS} -ltorch $TORCH_CPU_LINK_FLAGS $TORCH_CUDA_LINK_FLAGS $C10_LINK_FLAGS -o $1

				  g++ ${TEST_CODE_DIR}/$1.cpp -I${install_root}/include -I${install_root}/include/torch/csrc/api/include -std=gnu++17 -L${install_root}/lib ${REF_LIB} ${ADDITIONAL_LINKER_FLAGS} -ltorch $TORCH_CPU_LINK_FLAGS $TORCH_CUDA_LINK_FLAGS $C10_LINK_FLAGS -o $1

				  ./$1

				}

				build_example_cpp_with_incorrect_abi () {

				  if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then

				    GLIBCXX_USE_CXX11_ABI=0

				  else

				    GLIBCXX_USE_CXX11_ABI=1

				  fi

				  set +e

				  setup_link_flags

				  g++ ${TEST_CODE_DIR}/$1.cpp -I${install_root}/include -I${install_root}/include/torch/csrc/api/include -D_GLIBCXX_USE_CXX11_ABI=$GLIBCXX_USE_CXX11_ABI -std=gnu++17 -L${install_root}/lib ${REF_LIB} ${ADDITIONAL_LINKER_FLAGS} -ltorch $TORCH_CPU_LINK_FLAGS $TORCH_CUDA_LINK_FLAGS $C10_LINK_FLAGS -o $1

				  ERRCODE=$?

				  set -e

				  if [ "$ERRCODE" -eq "0" ]; then

				    echo "Building example with incorrect ABI didn't throw error. Aborting."

				    exit 1

				  else

				    echo "Building example with incorrect ABI throws expected error. Proceeding."

				  fi

				}

				###############################################################################

				# Check simple Python/C++ calls

				###############################################################################

				@ -246,11 +160,6 @@ if [[ "$PACKAGE_TYPE" == 'libtorch' ]]; then

				    export LD_LIBRARY_PATH=/usr/local/cuda/lib64

				  fi

				  build_and_run_example_cpp simple-torch-test

				  # `_GLIBCXX_USE_CXX11_ABI` is always ignored by gcc in devtoolset7, so we test

				  # the expected failure case for Ubuntu 16.04 + gcc 5.4 only.

				  if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then

				    build_example_cpp_with_incorrect_abi simple-torch-test

				  fi

				else

				  pushd /tmp

				  python -c 'import torch'

				@ -307,6 +216,14 @@ else

				  fi

				fi

				###############################################################################

				# Check XPU configured correctly

				###############################################################################

				if [[ "$DESIRED_CUDA" == 'xpu' && "$PACKAGE_TYPE" != 'libtorch' ]]; then

				  echo "Checking that xpu is compiled"

				  python -c 'import torch; exit(0 if torch.xpu._is_compiled() else 1)'

				fi

				###############################################################################

				# Check CUDA configured correctly

				###############################################################################

				@ -385,10 +302,19 @@ except RuntimeError as e:

				fi

				###############################################################################

				# Check for C++ ABI compatibility between gcc7 and gcc9 compiled binaries

				# Check for C++ ABI compatibility to GCC-11

				###############################################################################

				if [[ "$(uname)" == 'Linux' &&  "$PACKAGE_TYPE" == 'manywheel' ]]; then

				  pushd /tmp

				  python -c "import torch; exit(0 if torch.compiled_with_cxx11_abi() else (0 if torch._C._PYBIND11_BUILD_ABI == '_cxxabi1011' else 1))"

				  # Per https://gcc.gnu.org/onlinedocs/gcc/C_002b_002b-Dialect-Options.html gcc-11 is ABI16

				  # Though manylinux_2.28 should have been build with gcc-14, per

				  # https://github.com/pypa/manylinux?tab=readme-ov-file#manylinux_2_28-almalinux-8-based

				  # On s390x gcc 14 is used because it contains fix for interaction

				  # between precompiled headers and vectorization builtins.

				  # This fix is not available in earlier gcc versions.

				  # gcc-14 uses ABI19.

				  if [[ "$(uname -m)" != "s390x" ]]; then

				    python -c "import torch; exit(0 if torch._C._PYBIND11_BUILD_ABI == '_cxxabi1016' else 1)"

				  fi

				  popd

				fi

									
										51

.ci/pytorch/macos-build.sh
									
												View File
												
				@ -33,56 +33,15 @@ if which sccache > /dev/null; then

				  export PATH="${tmp_dir}:$PATH"

				fi

				cross_compile_arm64() {

				  # Cross compilation for arm64

				print_cmake_info

				if [[ ${BUILD_ENVIRONMENT} == *"distributed"* ]]; then

				  # Needed for inductor benchmarks, as lots of HF networks make `torch.distribtued` calls

				  USE_DISTRIBUTED=1 USE_OPENMP=1 WERROR=1 python setup.py bdist_wheel

				else

				  # Explicitly set USE_DISTRIBUTED=0 to align with the default build config on mac. This also serves as the sole CI config that tests

				  # that building with USE_DISTRIBUTED=0 works at all. See https://github.com/pytorch/pytorch/issues/86448

				  USE_DISTRIBUTED=0 CMAKE_OSX_ARCHITECTURES=arm64 MACOSX_DEPLOYMENT_TARGET=11.0 USE_MKLDNN=OFF USE_QNNPACK=OFF WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python setup.py bdist_wheel

				}

				compile_arm64() {

				  # Compilation for arm64

				  # TODO: Compile with OpenMP support (but this causes CI regressions as cross-compilation were done with OpenMP disabled)

				  USE_DISTRIBUTED=0 USE_OPENMP=1 MACOSX_DEPLOYMENT_TARGET=11.0 WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python setup.py bdist_wheel

				}

				compile_x86_64() {

				  USE_DISTRIBUTED=0 WERROR=1 python setup.py bdist_wheel --plat-name=macosx_10_9_x86_64

				}

				build_lite_interpreter() {

				    echo "Testing libtorch (lite interpreter)."

				    CPP_BUILD="$(pwd)/../cpp_build"

				    # Ensure the removal of the tmp directory

				    trap 'rm -rfv ${CPP_BUILD}' EXIT

				    rm -rf "${CPP_BUILD}"

				    mkdir -p "${CPP_BUILD}/caffe2"

				    # It looks libtorch need to be built in "${CPP_BUILD}/caffe2 folder.

				    BUILD_LIBTORCH_PY=$PWD/tools/build_libtorch.py

				    pushd "${CPP_BUILD}/caffe2" || exit

				    VERBOSE=1 DEBUG=1 python "${BUILD_LIBTORCH_PY}"

				    popd || exit

				    "${CPP_BUILD}/caffe2/build/bin/test_lite_interpreter_runtime"

				}

				print_cmake_info

				if [[ ${BUILD_ENVIRONMENT} = *arm64* ]]; then

				  if [[ $(uname -m) == "arm64" ]]; then

				    compile_arm64

				  else

				    cross_compile_arm64

				  fi

				elif [[ ${BUILD_ENVIRONMENT} = *lite-interpreter* ]]; then

				  export BUILD_LITE_INTERPRETER=1

				  build_lite_interpreter

				else

				  compile_x86_64

				fi

				if which sccache > /dev/null; then

				  print_sccache_stats

				fi

									
										44

.ci/pytorch/macos-test.sh
									
												View File
												
				@ -221,25 +221,39 @@ test_torchbench_smoketest() {

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  local backend=eager

				  local dtype=notset

				  local device=mps

				  local models=(hf_T5 llama BERT_pytorch dcgan hf_GPT2 yolov3 resnet152 sam pytorch_unet stable_diffusion_text_encoder moco speech_transformer)

				  touch "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_training_${device}_performance.csv"

				  touch "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv"

				  for backend in eager inductor; do

				  echo "Setup complete, launching torchbench training performance run"

				  for model in hf_T5 llama BERT_pytorch dcgan hf_GPT2 yolov3 resnet152; do

				    PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \

				      --performance --only "$model" --backend "$backend" --training --devices "$device" \

				      --output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_training_${device}_performance.csv"

				  done

				    for dtype in notset float16 bfloat16; do

				      echo "Launching torchbench inference performance run for backend ${backend} and dtype ${dtype}"

				      local dtype_arg="--${dtype}"

				      if [ "$dtype" == notset ]; then

				          dtype_arg="--float32"

				      fi

				      touch "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv"

				      for model in "${models[@]}"; do

				        PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \

				          --performance --only "$model" --backend "$backend" --inference --devices "$device" "$dtype_arg" \

				          --output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv" || true

				      done

				    done

				    for dtype in notset amp; do

				      echo "Launching torchbench training performance run for backend ${backend} and dtype ${dtype}"

				      touch "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_training_${device}_performance.csv"

				      local dtype_arg="--${dtype}"

				      if [ "$dtype" == notset ]; then

				          dtype_arg="--float32"

				      fi

				      for model in "${models[@]}"; do

				        PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \

				          --performance --only "$model" --backend "$backend" --training --devices "$device" "$dtype_arg" \

				          --output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_training_${device}_performance.csv" || true

				      done

				    done

				  echo "Launching torchbench inference performance run"

				  for model in hf_T5 llama BERT_pytorch dcgan hf_GPT2 yolov3 resnet152; do

				    PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \

				      --performance --only "$model" --backend "$backend" --inference --devices "$device" \

				      --output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv"

				  done

				  echo "Pytorch benchmark on mps device completed"

									
										6

.ci/pytorch/python_doc_push_script.sh
									
												View File
												
				@ -119,12 +119,6 @@ popd

				git rm -rf "$install_path" || true

				mv "$pt_checkout/docs/build/html" "$install_path"

				# Prevent Google from indexing $install_path/_modules. This folder contains

				# generated source files.

				# NB: the following only works on gnu sed. The sed shipped with mac os is different.

				# One can `brew install gnu-sed` on a mac and then use "gsed" instead of "sed".

				find "$install_path/_modules" -name "*.html" -print0 | xargs -0 sed -i '/<head>/a \ \ <meta name="robots" content="noindex">'

				git add "$install_path" || true

				git status

				git config user.email "soumith+bot@pytorch.org"

									
										33

.ci/pytorch/smoke_test/check_binary_symbols.py
									
												View File
												
				@ -80,7 +80,7 @@ def grep_symbols(lib: str, patterns: list[Any]) -> list[str]:

				        return functools.reduce(list.__add__, (x.result() for x in tasks), [])

				def check_lib_symbols_for_abi_correctness(lib: str, pre_cxx11_abi: bool = True) -> None:

				def check_lib_symbols_for_abi_correctness(lib: str) -> None:

				    print(f"lib: {lib}")

				    cxx11_symbols = grep_symbols(lib, LIBTORCH_CXX11_PATTERNS)

				    pre_cxx11_symbols = grep_symbols(lib, LIBTORCH_PRE_CXX11_PATTERNS)

				@ -88,28 +88,12 @@ def check_lib_symbols_for_abi_correctness(lib: str, pre_cxx11_abi: bool = True)

				    num_pre_cxx11_symbols = len(pre_cxx11_symbols)

				    print(f"num_cxx11_symbols: {num_cxx11_symbols}")

				    print(f"num_pre_cxx11_symbols: {num_pre_cxx11_symbols}")

				    if pre_cxx11_abi:

				        if num_cxx11_symbols > 0:

				            raise RuntimeError(

				                f"Found cxx11 symbols, but there shouldn't be any, see: {cxx11_symbols[:100]}"

				            )

				        if num_pre_cxx11_symbols < 1000:

				            raise RuntimeError("Didn't find enough pre-cxx11 symbols.")

				        # Check for no recursive iterators, regression test for https://github.com/pytorch/pytorch/issues/133437

				        rec_iter_symbols = grep_symbols(

				            lib, [re.compile("std::filesystem::recursive_directory_iterator.*")]

				    if num_pre_cxx11_symbols > 0:

				        raise RuntimeError(

				            f"Found pre-cxx11 symbols, but there shouldn't be any, see: {pre_cxx11_symbols[:100]}"

				        )

				        if len(rec_iter_symbols) > 0:

				            raise RuntimeError(

				                f"recursive_directory_iterator in used pre-CXX11 binaries, see; {rec_iter_symbols}"

				            )

				    else:

				        if num_pre_cxx11_symbols > 0:

				            raise RuntimeError(

				                f"Found pre-cxx11 symbols, but there shouldn't be any, see: {pre_cxx11_symbols[:100]}"

				            )

				        if num_cxx11_symbols < 100:

				            raise RuntimeError("Didn't find enought cxx11 symbols")

				    if num_cxx11_symbols < 100:

				        raise RuntimeError("Didn't find enought cxx11 symbols")

				def main() -> None:

				@ -121,9 +105,8 @@ def main() -> None:

				        else:

				            install_root = Path(distutils.sysconfig.get_python_lib()) / "torch"

				    libtorch_cpu_path = install_root / "lib" / "libtorch_cpu.so"

				    pre_cxx11_abi = "cxx11-abi" not in os.getenv("DESIRED_DEVTOOLSET", "")

				    check_lib_symbols_for_abi_correctness(libtorch_cpu_path, pre_cxx11_abi)

				    libtorch_cpu_path = str(install_root / "lib" / "libtorch_cpu.so")

				    check_lib_symbols_for_abi_correctness(libtorch_cpu_path)

				if __name__ == "__main__":

									
										8

.ci/pytorch/smoke_test/max_autotune.py
									
												View File
												
				@ -46,7 +46,9 @@ def train(args, model, device, train_loader, optimizer, epoch):

				        optimizer.step()

				        if batch_idx % args.log_interval == 0:

				            print(

				                f"Train Epoch: {epoch} [{batch_idx * len(data)}/{len(train_loader.dataset)} ({100. * batch_idx / len(train_loader):.0f}%)]\tLoss: {loss.item():.6f}"  # noqa: B950

				                f"Train Epoch: {epoch} "

				                f"[{batch_idx * len(data)}/{len(train_loader.dataset)} "

				                f"({100.0 * batch_idx / len(train_loader):.0f}%)]\tLoss: {loss.item():.6f}"

				            )

				            if args.dry_run:

				                break

				@ -71,7 +73,9 @@ def test(model, device, test_loader):

				    test_loss /= len(test_loader.dataset)

				    print(

				        f"\nTest set: Average loss: {test_loss:.4f}, Accuracy: {correct}/{len(test_loader.dataset)} ({100. * correct / len(test_loader.dataset):.0f}%)\n"  # noqa: B950

				        f"\nTest set: Average loss: {test_loss:.4f}, "

				        f"Accuracy: {correct}/{len(test_loader.dataset)} "

				        f"({100.0 * correct / len(test_loader.dataset):.0f}%)\n"

				    )

									
										84

.ci/pytorch/smoke_test/smoke_test.py
									
												View File
												
				@ -7,6 +7,7 @@ import subprocess

				import sys

				from pathlib import Path

				from tempfile import NamedTemporaryFile

				from typing import Optional

				import torch

				import torch._dynamo

				@ -76,10 +77,13 @@ def read_release_matrix():

				def test_numpy():

				    import numpy as np

				    try:

				        import numpy as np

				    x = np.arange(5)

				    torch.tensor(x)

				        x = np.arange(5)

				        torch.tensor(x)

				    except ImportError:

				        print("Numpy check skipped. Numpy is not installed.")

				def check_version(package: str) -> None:

				@ -166,6 +170,10 @@ def test_cuda_gds_errors_captured() -> None:

				    major_version = int(torch.version.cuda.split(".")[0])

				    minor_version = int(torch.version.cuda.split(".")[1])

				    if target_os == "windows":

				        print(f"{target_os} is not supported for GDS smoke test")

				        return

				    if major_version < 12 or (major_version == 12 and minor_version < 6):

				        print("CUDA version is not supported for GDS smoke test")

				        return

				@ -188,8 +196,41 @@ def test_cuda_gds_errors_captured() -> None:

				        )

				def find_pypi_package_version(package: str) -> Optional[str]:

				    from importlib import metadata

				    dists = metadata.distributions()

				    for dist in dists:

				        if dist.metadata["Name"].startswith(package):

				            return dist.version

				    return None

				def cudnn_to_version_str(cudnn_version: int) -> str:

				    patch = int(cudnn_version % 10)

				    minor = int((cudnn_version / 100) % 100)

				    major = int((cudnn_version / 10000) % 10000)

				    return f"{major}.{minor}.{patch}"

				def compare_pypi_to_torch_versions(

				    package: str, pypi_version: str, torch_version: str

				) -> None:

				    if pypi_version is None:

				        raise RuntimeError(f"Can't find {package} in PyPI for Torch: {torch_version}")

				    if pypi_version.startswith(torch_version):

				        print(f"Found matching {package}. Torch: {torch_version} PyPI {pypi_version}")

				    else:

				        raise RuntimeError(

				            f"Wrong {package} version. Torch: {torch_version} PyPI: {pypi_version}"

				        )

				def smoke_test_cuda(

				    package: str, runtime_error_check: str, torch_compile_check: str

				    package: str,

				    runtime_error_check: str,

				    torch_compile_check: str,

				    pypi_pkg_check: str,

				) -> None:

				    if not torch.cuda.is_available() and is_cuda_system:

				        raise RuntimeError(f"Expected CUDA {gpu_arch_ver}. However CUDA is not loaded.")

				@ -219,20 +260,30 @@ def smoke_test_cuda(

				            raise RuntimeError(

				                f"Wrong CUDA version. Loaded: {torch.version.cuda} Expected: {gpu_arch_ver}"

				            )

				        print(f"torch cuda: {torch.version.cuda}")

				        # todo add cudnn version validation

				        print(f"torch cudnn: {torch.backends.cudnn.version()}")

				        print(f"cuDNN enabled? {torch.backends.cudnn.enabled}")

				        print(f"torch cuda: {torch.version.cuda}")

				        torch.cuda.init()

				        print("CUDA initialized successfully")

				        print(f"Number of CUDA devices: {torch.cuda.device_count()}")

				        for i in range(torch.cuda.device_count()):

				            print(f"Device {i}: {torch.cuda.get_device_name(i)}")

				        # nccl is availbale only on Linux

				        print(f"cuDNN enabled? {torch.backends.cudnn.enabled}")

				        torch_cudnn_version = cudnn_to_version_str(torch.backends.cudnn.version())

				        print(f"Torch cuDNN version: {torch_cudnn_version}")

				        if sys.platform in ["linux", "linux2"]:

				            print(f"torch nccl version: {torch.cuda.nccl.version()}")

				            torch_nccl_version = ".".join(str(v) for v in torch.cuda.nccl.version())

				            print(f"Torch nccl; version: {torch_nccl_version}")

				        # Pypi dependencies are installed on linux ony and nccl is availbale only on Linux.

				        if pypi_pkg_check == "enabled" and sys.platform in ["linux", "linux2"]:

				            compare_pypi_to_torch_versions(

				                "cudnn", find_pypi_package_version("nvidia-cudnn"), torch_cudnn_version

				            )

				            compare_pypi_to_torch_versions(

				                "nccl", find_pypi_package_version("nvidia-nccl"), torch_nccl_version

				            )

				        if runtime_error_check == "enabled":

				            test_cuda_runtime_errors_captured()

				@ -391,6 +442,13 @@ def parse_args():

				        choices=["enabled", "disabled"],

				        default="enabled",

				    )

				    parser.add_argument(

				        "--pypi-pkg-check",

				        help="Check pypi package versions cudnn and nccl",

				        type=str,

				        choices=["enabled", "disabled"],

				        default="enabled",

				    )

				    return parser.parse_args()

				@ -406,6 +464,7 @@ def main() -> None:

				    smoke_test_conv2d()

				    test_linalg()

				    test_numpy()

				    if is_cuda_system:

				        test_linalg("cuda")

				        test_cuda_gds_errors_captured()

				@ -414,7 +473,10 @@ def main() -> None:

				        smoke_test_modules()

				    smoke_test_cuda(

				        options.package, options.runtime_error_check, options.torch_compile_check

				        options.package,

				        options.runtime_error_check,

				        options.torch_compile_check,

				        options.pypi_pkg_check,

				    )

									
										121

.ci/pytorch/test.sh
									
												View File
												
				@ -314,6 +314,13 @@ test_python() {

				  assert_git_not_dirty

				}

				test_lazy_tensor_meta_reference_disabled() {

				  export TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE=1

				  echo "Testing lazy tensor operations without meta reference"

				  time python test/run_test.py --include lazy/test_ts_opinfo.py --verbose

				  export -n TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE

				}

				test_dynamo_wrapped_shard() {

				  if [[ -z "$NUM_TEST_SHARDS" ]]; then

				@ -476,6 +483,8 @@ elif [[ "${TEST_CONFIG}" == *aot_eager* ]]; then

				  DYNAMO_BENCHMARK_FLAGS+=(--backend aot_eager)

				elif [[ "${TEST_CONFIG}" == *aot_inductor* ]]; then

				  DYNAMO_BENCHMARK_FLAGS+=(--export-aot-inductor)

				elif [[ "${TEST_CONFIG}" == *max_autotune_inductor* ]]; then

				  DYNAMO_BENCHMARK_FLAGS+=(--inductor --inductor-compile-mode max-autotune)

				elif [[ "${TEST_CONFIG}" == *inductor* && "${TEST_CONFIG}" != *perf* ]]; then

				  DYNAMO_BENCHMARK_FLAGS+=(--inductor)

				fi

				@ -490,6 +499,59 @@ else

				  DYNAMO_BENCHMARK_FLAGS+=(--device cuda)

				fi

				test_cachebench() {

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  local BENCHMARK

				  if [[ "${SHARD_NUMBER}" == 1 ]]; then

				    local BENCHMARK=torchbench

				  elif [[ "${SHARD_NUMBER}" == 2 ]]; then

				    local BENCHMARK=huggingface

				  else

				    echo "invalid SHARD_NUMBER: ${SHARD_NUMBER}"

				    exit 1

				  fi

				  local mode_options=("training" "inference")

				  for mode in "${mode_options[@]}"; do

				    $TASKSET python "benchmarks/dynamo/cachebench.py" \

				        --mode "$mode" \

				        --device cuda \

				        --benchmark "$BENCHMARK" \

				        --repeat 3 \

				        --output "$TEST_REPORTS_DIR/cachebench_${BENCHMARK}_${mode}.json"

				    $TASKSET python "benchmarks/dynamo/cachebench.py" \

				        --mode "$mode" \

				        --dynamic \

				        --device cuda \

				        --benchmark "$BENCHMARK" \

				        --repeat 3 \

				        --output "$TEST_REPORTS_DIR/cachebench_${BENCHMARK}_${mode}_dynamic.json"

				  done

				}

				test_verify_cachebench() {

				  TMP_TEST_REPORTS_DIR=$(mktemp -d)

				  TEST_OUTPUT="$TMP_TEST_REPORTS_DIR/test.json"

				  $TASKSET python "benchmarks/dynamo/cachebench.py" \

				      --mode training \

				      --device cpu \

				      --model nanogpt \

				      --benchmark torchbench \

				      --output "$TEST_OUTPUT"

				  # -s checks file exists and is non empty

				  if [[ ! -s "$TEST_OUTPUT" ]]; then

				    echo "Cachebench failed to produce an output."

				    echo "Run 'python benchmarks/dynamo/cachebench.py' to make sure it works"

				    exit 1

				  fi

				}

				test_perf_for_dashboard() {

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				@ -518,6 +580,8 @@ test_perf_for_dashboard() {

				    test_inductor_set_cpu_affinity

				  elif [[ "${TEST_CONFIG}" == *cuda_a10g* ]]; then

				    device=cuda_a10g

				  elif [[ "${TEST_CONFIG}" == *h100* ]]; then

				    device=cuda_h100

				  elif [[ "${TEST_CONFIG}" == *rocm* ]]; then

				    device=rocm

				  fi

				@ -698,6 +762,8 @@ test_dynamo_benchmark() {

				      fi

				    elif [[ "${TEST_CONFIG}" == *aot_inductor* ]]; then

				      test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --bfloat16 "$@"

				    elif [[ "${TEST_CONFIG}" == *max_autotune_inductor* ]]; then

				      test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --bfloat16 "$@"

				    else

				      test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --bfloat16 "$@"

				      test_single_dynamo_benchmark "training" "$suite" "$shard_id" --training --amp "$@"

				@ -1107,7 +1173,7 @@ build_xla() {

				  apply_patches

				  SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"

				  # These functions are defined in .circleci/common.sh in pytorch/xla repo

				  retry install_deps_pytorch_xla $XLA_DIR $USE_CACHE

				  retry install_pre_deps_pytorch_xla $XLA_DIR $USE_CACHE

				  CMAKE_PREFIX_PATH="${SITE_PACKAGES}/torch:${CMAKE_PREFIX_PATH}" XLA_SANDBOX_BUILD=1 build_torch_xla $XLA_DIR

				  assert_git_not_dirty

				}

				@ -1408,14 +1474,13 @@ test_executorch() {

				  pushd /executorch

				  export PYTHON_EXECUTABLE=python

				  export EXECUTORCH_BUILD_PYBIND=ON

				  export CMAKE_ARGS="-DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON"

				  export CMAKE_ARGS="-DEXECUTORCH_BUILD_PYBIND=ON -DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON"

				  # For llama3

				  bash examples/models/llama3_2_vision/install_requirements.sh

				  # NB: We need to rebuild ExecuTorch runner here because it depends on PyTorch

				  # from the PR

				  bash .ci/scripts/setup-linux.sh cmake

				  bash .ci/scripts/setup-linux.sh --build-tool cmake

				  echo "Run ExecuTorch unit tests"

				  pytest -v -n auto

				@ -1439,7 +1504,7 @@ test_executorch() {

				test_linux_aarch64() {

				  python test/run_test.py --include test_modules test_mkldnn test_mkldnn_fusion test_openmp test_torch test_dynamic_shapes \

				        test_transformers test_multiprocessing test_numpy_interop test_autograd test_binary_ufuncs test_complex test_spectral_ops \

				        test_foreach test_reductions test_unary_ufuncs test_tensor_creation_ops \

				        test_foreach test_reductions test_unary_ufuncs test_tensor_creation_ops test_ops \

				        --shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose

				  # Dynamo tests

				@ -1461,6 +1526,27 @@ test_linux_aarch64() {

				       --shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose

				}

				test_operator_benchmark() {

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  TEST_DIR=$(pwd)

				  test_inductor_set_cpu_affinity

				  cd benchmarks/operator_benchmark/pt_extension

				  python setup.py install

				  cd "${TEST_DIR}"/benchmarks/operator_benchmark

				  $TASKSET python -m benchmark_all_test --device "$1" --tag-filter "$2" \

				      --output-dir "${TEST_REPORTS_DIR}/operator_benchmark_eager_float32_cpu.csv"

				  pip_install pandas

				  python check_perf_csv.py \

				      --actual "${TEST_REPORTS_DIR}/operator_benchmark_eager_float32_cpu.csv" \

				      --expected "expected_ci_operator_benchmark_eager_float32_cpu.csv"

				}

				if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-bazel-* ]]; then

				  (cd test && python -c "import torch; print(torch.__config__.show())")

				  (cd test && python -c "import torch; print(torch.__config__.parallel_info())")

				@ -1491,6 +1577,19 @@ elif [[ "$TEST_CONFIG" == distributed ]]; then

				  if [[ "${SHARD_NUMBER}" == 1 ]]; then

				    test_rpc

				  fi

				elif [[ "${TEST_CONFIG}" == *operator_benchmark* ]]; then

				  TEST_MODE="short"

				  if [[ "${TEST_CONFIG}" == *cpu* ]]; then

				    if [[ "${TEST_CONFIG}" == *long* ]]; then

				      TEST_MODE="long"

				    elif [[ "${TEST_CONFIG}" == *all* ]]; then

				      TEST_MODE="all"

				    fi

				    test_operator_benchmark cpu ${TEST_MODE}

				  fi

				elif [[ "${TEST_CONFIG}" == *inductor_distributed* ]]; then

				  test_inductor_distributed

				elif [[ "${TEST_CONFIG}" == *inductor-halide* ]]; then

				@ -1507,6 +1606,16 @@ elif [[ "${TEST_CONFIG}" == *timm* ]]; then

				  install_torchvision

				  id=$((SHARD_NUMBER-1))

				  test_dynamo_benchmark timm_models "$id"

				elif [[ "${TEST_CONFIG}" == cachebench ]]; then

				  install_torchaudio cuda

				  install_torchvision

				  checkout_install_torchbench nanogpt BERT_pytorch resnet50 hf_T5 llama moco

				  PYTHONPATH=$(pwd)/torchbench test_cachebench

				elif [[ "${TEST_CONFIG}" == verify_cachebench ]]; then

				  install_torchaudio cpu

				  install_torchvision

				  checkout_install_torchbench nanogpt

				  PYTHONPATH=$(pwd)/torchbench test_verify_cachebench

				elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then

				  if [[ "${TEST_CONFIG}" == *cpu* ]]; then

				    install_torchaudio cpu

				@ -1543,6 +1652,7 @@ elif [[ "${TEST_CONFIG}" == *inductor_cpp_wrapper* ]]; then

				  install_torchvision

				  checkout_install_torchbench hf_T5 llama moco

				  PYTHONPATH=$(pwd)/torchbench test_inductor_cpp_wrapper_shard "$SHARD_NUMBER"

				  test_inductor_aoti

				elif [[ "${TEST_CONFIG}" == *inductor* ]]; then

				  install_torchvision

				  test_inductor_shard "${SHARD_NUMBER}"

				@ -1562,6 +1672,7 @@ elif [[ "${BUILD_ENVIRONMENT}" == *rocm* && -n "$TESTS_TO_INCLUDE" ]]; then

				  test_python_shard "$SHARD_NUMBER"

				  test_aten

				elif [[ "${SHARD_NUMBER}" == 1 && $NUM_TEST_SHARDS -gt 1 ]]; then

				  test_lazy_tensor_meta_reference_disabled

				  test_without_numpy

				  install_torchvision

				  test_python_shard 1

									
										38

.ci/pytorch/windows/arm64/bootstrap_buildtools.bat
									
												View File
												
				@ -17,32 +17,24 @@ curl -L -o "%INSTALLER_FILE%" %DOWNLOAD_URL%

				:: Install the Visual Studio Build Tools with C++ components

				echo Installing Visual Studio Build Tools with C++ components...

				echo Installing MSVC %MSVC_VERSION%

				if "%MSVC_VERSION%" == "latest" (

				    "%INSTALLER_FILE%" --norestart --nocache --quiet --wait --installPath "%DEPENDENCIES_DIR%\VSBuildTools" ^

				        --add Microsoft.VisualStudio.Component.Windows11SDK.22621 ^

				        --add Microsoft.VisualStudio.Component.VC.ASAN ^

				        --add Microsoft.VisualStudio.Component.VC.CMake.Project ^

				        --add Microsoft.VisualStudio.Component.VC.Tools.ARM64 ^

				        --add Microsoft.VisualStudio.Component.VC.Tools.x86.x64

				) else if "%MSVC_VERSION%" == "14.40" (

				    "%INSTALLER_FILE%" --norestart --nocache --quiet --wait --installPath "%DEPENDENCIES_DIR%\VSBuildTools" ^

				        --add Microsoft.VisualStudio.Component.Windows11SDK.22621 ^

				        --add Microsoft.VisualStudio.Component.VC.ASAN ^

				        --add Microsoft.VisualStudio.Component.VC.CMake.Project ^

				        --add Microsoft.VisualStudio.Component.VC.14.40.17.10.ARM64 ^

				        --add Microsoft.VisualStudio.Component.VC.14.40.17.10.x86.x64

				) else if "%MSVC_VERSION%" == "14.36" (

				    "%INSTALLER_FILE%" --norestart --nocache --quiet --wait --installPath "%DEPENDENCIES_DIR%\VSBuildTools" ^

				        --add Microsoft.VisualStudio.Component.Windows11SDK.22621 ^

				        --add Microsoft.VisualStudio.Component.VC.ASAN ^

				        --add Microsoft.VisualStudio.Component.VC.CMake.Project ^

				        --add Microsoft.VisualStudio.Component.VC.14.36.17.6.ARM64 ^

				        --add Microsoft.VisualStudio.Component.VC.14.36.17.6.x86.x64

				)

				"%INSTALLER_FILE%" --norestart --quiet --wait --installPath "%DEPENDENCIES_DIR%\VSBuildTools" ^

				    --add Microsoft.VisualStudio.Workload.VCTools ^

				    --add Microsoft.VisualStudio.Component.Windows10SDK ^

				    --add Microsoft.VisualStudio.Component.Windows11SDK.22621 ^

				    --add Microsoft.VisualStudio.Component.VC.ASAN ^

				    --add Microsoft.VisualStudio.Component.VC.CMake.Project ^

				    --add Microsoft.VisualStudio.Component.VC.CoreBuildTools ^

				    --add Microsoft.VisualStudio.Component.VC.CoreIde ^

				    --add Microsoft.VisualStudio.Component.VC.Redist.14.Latest ^

				    --add Microsoft.VisualStudio.Component.VC.Tools.ARM64EC ^

				    --add Microsoft.VisualStudio.Component.VC.Tools.ARM64 ^

				    --add Microsoft.VisualStudio.Component.VC.Tools.x86.x64

				echo exitcode = %errorlevel%

				:: Check if installation was successful

				if %errorlevel% neq 0 (

				    echo "Failed to install Visual Studio Build Tools with C++ components. (exitcode = %errorlevel%)"

				    echo Failed to install Visual Studio Build Tools with C++ components.

				    exit /b 1

				)

									
										21

.ci/pytorch/windows/arm64/bootstrap_python.bat
									
												View File
												
				@ -6,22 +6,25 @@ echo Dependency Python installation started.

				if not exist "%DOWNLOADS_DIR%" mkdir %DOWNLOADS_DIR%

				if not exist "%DEPENDENCIES_DIR%" mkdir %DEPENDENCIES_DIR%

				if "%PYTHON_VERSION%"=="Python312" (

				    echo Python version is set to Python312

				    set DOWNLOAD_URL="https://www.python.org/ftp/python/3.12.7/python-3.12.7-arm64.exe"

				) else if "%PYTHON_VERSION%"=="Python311" (

				    echo Python version is set to Python311

				    set DOWNLOAD_URL="https://www.python.org/ftp/python/3.11.9/python-3.11.9-arm64.exe"

				if "%DESIRED_PYTHON%" == "3.13" (

				    echo Python version is set to 3.13

				    set DOWNLOAD_URL=https://www.python.org/ftp/python/3.13.2/python-3.13.2-arm64.exe

				) else if "%DESIRED_PYTHON%" == "3.12" (

				    echo Python version is set to 3.12

				    set DOWNLOAD_URL=https://www.python.org/ftp/python/3.12.7/python-3.12.7-arm64.exe

				) else if "%DESIRED_PYTHON%" == "3.11" (

				    echo Python version is set to 3.11

				    set DOWNLOAD_URL=https://www.python.org/ftp/python/3.11.9/python-3.11.9-arm64.exe

				) else (

				    echo PYTHON_VERSION not defined, Python version is set to Python312

				    set DOWNLOAD_URL="https://www.python.org/ftp/python/3.12.7/python-3.12.7-arm64.exe"

				    echo DESIRED_PYTHON not defined, Python version is set to 3.12

				    set DOWNLOAD_URL=https://www.python.org/ftp/python/3.12.7/python-3.12.7-arm64.exe

				)

				set INSTALLER_FILE=%DOWNLOADS_DIR%\python-installer.exe

				:: Download installer

				echo Downloading Python...

				curl -L -o "%INSTALLER_FILE%" %DOWNLOAD_URL%

				curl -L -o "%INSTALLER_FILE%" "%DOWNLOAD_URL%"

				:: Install Python

				echo Installing Python...

									
										2

.ci/pytorch/windows/arm64/bootstrap_tests.bat
									
												View File
												
				@ -14,7 +14,7 @@ where python

				:: install dependencies

				python -m pip install --upgrade pip

				pip install -r requirements.txt

				pip install pytest numpy

				pip install pytest numpy protobuf expecttest hypothesis

				:: find file name for pytorch wheel

				for /f "delims=" %%f in ('dir /b "%PYTORCH_FINAL_PACKAGE_DIR%" ^| findstr "torch-"') do set "TORCH_WHEEL_FILENAME=%PYTORCH_FINAL_PACKAGE_DIR%\%%f"

									
										28

.ci/pytorch/windows/arm64/smoke_test.bat
									
												View File
												
				@ -1,8 +1,6 @@

				@echo off

				setlocal

				set "ORIG_PATH=%PATH%"

				if "%PACKAGE_TYPE%" == "wheel" goto wheel

				if "%PACKAGE_TYPE%" == "libtorch" goto libtorch

				@ -10,21 +8,7 @@ echo "unknown package type"

				exit /b 1

				:wheel

				echo "install wheel package"

				echo Running pip install...

				pip install -q --pre numpy protobuf

				echo Error level after pip install: %ERRORLEVEL%

				if errorlevel 1 exit /b 1

				for /F "delims=" %%i in ('where /R "%PYTORCH_FINAL_PACKAGE_DIR:/=\%" *.whl') do pip install "%%i"

				if errorlevel 1 exit /b 1

				goto smoke_test

				:smoke_test

				python -c "import torch"

				if ERRORLEVEL 1 exit /b 1

				call %PYTORCH_ROOT%\.ci\pytorch\windows\arm64\bootstrap_tests.bat

				echo Running python rnn_smoke.py...

				python %PYTORCH_ROOT%\.ci\pytorch\test_example_code\rnn_smoke_win_arm64.py

				@ -39,10 +23,12 @@ goto end

				:libtorch

				echo "install and test libtorch"

				for /F "delims=" %%i in ('where /R "%PYTORCH_FINAL_PACKAGE_DIR:/=\%" *-latest.zip') do tar -xf "%%i" -C tmp

				if not exist tmp mkdir tmp

				for /F "delims=" %%i in ('where /R "%PYTORCH_FINAL_PACKAGE_DIR:/=\%" *-latest.zip') do C:\Windows\System32\tar.exe -xf "%%i" -C tmp

				if ERRORLEVEL 1 exit /b 1

				pushd tmp\libtorch

				pushd tmp

				set VC_VERSION_LOWER=14

				set VC_VERSION_UPPER=36

				@ -60,6 +46,4 @@ if ERRORLEVEL 1 exit /b 1

				.\simple-torch-test.exe

				if ERRORLEVEL 1 exit /b 1

				:end

				set "PATH=%ORIG_PATH%"

				popd

				:end

									
										14

.ci/pytorch/windows/internal/smoke_test.bat
									
												View File
												
				@ -42,7 +42,6 @@ if "%DESIRED_PYTHON%" == "3.12" set "PYTHON_INSTALLER_URL=https://www.python.org

				if "%DESIRED_PYTHON%" == "3.11" set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/3.11.0/python-3.11.0-amd64.exe"

				if "%DESIRED_PYTHON%" == "3.10" set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/3.10.0/python-3.10.0-amd64.exe"

				if "%DESIRED_PYTHON%" == "3.9" set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/3.9.0/python-3.9.0-amd64.exe"

				if "%DESIRED_PYTHON%" == "3.8" set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/3.8.2/python-3.8.2-amd64.exe"

				if "%PYTHON_INSTALLER_URL%" == "" (

				    echo Python %DESIRED_PYTHON% not supported yet

				)

				@ -71,11 +70,20 @@ if "%DESIRED_PYTHON%" == "3.13" %PYTHON_EXEC% -m pip install --pre numpy==2.1.2

				if "%DESIRED_PYTHON%" == "3.12" %PYTHON_EXEC% -m pip install --pre numpy==2.0.2 protobuf

				if "%DESIRED_PYTHON%" == "3.11" %PYTHON_EXEC% -m pip install --pre numpy==2.0.2 protobuf

				if "%DESIRED_PYTHON%" == "3.10" %PYTHON_EXEC% -m pip install --pre numpy==2.0.2 protobuf

				if "%DESIRED_PYTHON%" == "3.9" %PYTHON_EXEC% -m pip install --pre numpy==2.0.2 protobuf

				if "%DESIRED_PYTHON%" == "3.9" %PYTHON_EXEC% -m pip install --pre numpy==2.0.2 protobuf networkx

				if errorlevel 1 exit /b 1

				for /F "delims=" %%i in ('where /R "%PYTORCH_FINAL_PACKAGE_DIR:/=\%" *.whl') do %PYTHON_EXEC% -m pip install "%%i"

				if "%PYTORCH_BUILD_VERSION:dev=%" NEQ "%PYTORCH_BUILD_VERSION%" (

				    set "CHANNEL=nightly"

				) else (

				    set "CHANNEL=test"

				)

				set "EXTRA_INDEX= "

				if "%CUDA_VERSION%" == "xpu" set "EXTRA_INDEX=--index-url https://download.pytorch.org/whl/%CHANNEL%/xpu"

				for /F "delims=" %%i in ('where /R "%PYTORCH_FINAL_PACKAGE_DIR:/=\%" *.whl') do %PYTHON_EXEC% -m pip install "%%i" %EXTRA_INDEX%

				if errorlevel 1 exit /b 1

				goto smoke_test

									
										4

.ci/pytorch/windows/internal/xpu_install.bat
									
												View File
												
				@ -47,9 +47,9 @@ set XPU_EXTRA_INSTALLED=0

				set XPU_EXTRA_UNINSTALL=0

				if not [%XPU_VERSION%]==[] if [%XPU_VERSION%]==[2025.0] (

				    set XPU_BUNDLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/efc86abd-cb77-452e-a03f-a741895b8ece/intel-deep-learning-essentials-2025.0.0.336_offline.exe

				    set XPU_BUNDLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/9d6d6c17-ca2d-4735-9331-99447e4a1280/intel-deep-learning-essentials-2025.0.1.28_offline.exe

				    set XPU_BUNDLE_PRODUCT_NAME=intel.oneapi.win.deep-learning-essentials.product

				    set XPU_BUNDLE_VERSION=2025.0.0+335

				    set XPU_BUNDLE_VERSION=2025.0.1+20

				    set XPU_BUNDLE_INSTALLED=0

				    set XPU_BUNDLE_UNINSTALL=0

				    set XPU_EXTRA_URL=NULL

									
										13

.circleci/scripts/binary_linux_test.sh
									
												View File
												
				@ -90,8 +90,17 @@ fi

				/pytorch/.ci/pytorch/check_binary.sh

				if [[ "\$GPU_ARCH_TYPE" != *s390x* && "\$GPU_ARCH_TYPE" != *xpu* && "\$GPU_ARCH_TYPE" != *rocm*  && "$PACKAGE_TYPE" != libtorch ]]; then

				  # Exclude s390, xpu, rocm and libtorch builds from smoke testing

				  python /pytorch/.ci/pytorch/smoke_test/smoke_test.py --package=torchonly --torch-compile-check disabled

				  torch_pkg_size="$(ls -1 /final_pkgs/torch-* | sort |tail -1 |xargs wc -c |cut -d ' ' -f1)"

				  # todo: implement check for large binaries

				  # if the package is larger than 1.5GB, we disable the pypi check.

				  # this package contains all libraries packaged in torch libs folder

				  # example of such package is https://download.pytorch.org/whl/cu126_full/torch

				  if [[ "\$torch_pkg_size" -gt  1500000000 ]]; then

				    python /pytorch/.ci/pytorch/smoke_test/smoke_test.py --package=torchonly --torch-compile-check disabled --pypi-pkg-check disabled

				  else

				    python /pytorch/.ci/pytorch/smoke_test/smoke_test.py --package=torchonly --torch-compile-check disabled $extra_parameters

				  fi

				fi

				# Clean temp files

									
										16

.circleci/scripts/binary_populate_env.sh
									
												View File
												
				@ -31,9 +31,9 @@ fi

				export DOCKER_IMAGE=${DOCKER_IMAGE:-}

				if [[ -z "$DOCKER_IMAGE" ]]; then

				  if [[ "$DESIRED_CUDA" == cpu ]]; then

				    export DOCKER_IMAGE="pytorch/manylinux:cpu"

				    export DOCKER_IMAGE="pytorch/manylinux2_28:cpu"

				  else

				    export DOCKER_IMAGE="pytorch/manylinux-builder:${DESIRED_CUDA:2}"

				    export DOCKER_IMAGE="pytorch/manylinux2_28-builder:${DESIRED_CUDA:2}"

				  fi

				fi

				@ -74,6 +74,12 @@ TRITON_VERSION=$(cat $PYTORCH_ROOT/.ci/docker/triton_version.txt)

				# Here PYTORCH_EXTRA_INSTALL_REQUIREMENTS is already set for the all the wheel builds hence append TRITON_CONSTRAINT

				TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64'"

				# CUDA 12.8 builds have triton for Linux and Linux aarch64 binaries.

				if [[ "$DESIRED_CUDA" == cu128 ]]; then

				  TRITON_CONSTRAINT="platform_system == 'Linux'"

				fi

				if [[ "$PACKAGE_TYPE" =~ .*wheel.* &&  -n "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" && ! "$PYTORCH_BUILD_VERSION" =~ .*xpu.* ]]; then

				  TRITON_REQUIREMENT="triton==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"

				  if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then

				@ -98,11 +104,11 @@ if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_B

				fi

				# Set triton via PYTORCH_EXTRA_INSTALL_REQUIREMENTS for triton xpu package

				if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*xpu.* && $(uname) == "Linux" ]]; then

				    TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"

				if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*xpu.* ]]; then

				    TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}"

				    if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then

				        TRITON_SHORTHASH=$(cut -c1-8 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton-xpu.txt)

				        TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}+git${TRITON_SHORTHASH}; ${TRITON_CONSTRAINT}"

				        TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}+git${TRITON_SHORTHASH}"

				    fi

				    if [[ -z "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then

				        export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${TRITON_REQUIREMENT}"

									
										6

.circleci/scripts/binary_upload.sh
									
												View File
												
				@ -55,12 +55,16 @@ s3_upload() {

				    s3_upload_dir="${s3_root_dir}/${UPLOAD_SUBFOLDER}/"

				  fi

				  (

				    cache_control_flag=""

				    if [[ "${UPLOAD_CHANNEL}" = "test" ]]; then

				      cache_control_flag="--cache-control='no-cache,no-store,must-revalidate'"

				    fi

				    for pkg in ${PKG_DIR}/*.${extension}; do

				      (

				        set -x

				        shm_id=$(sha256sum "${pkg}" | awk '{print $1}')

				        ${AWS_S3_CP} --no-progress --acl public-read "${pkg}" "${s3_upload_dir}" \

				          --metadata "checksum-sha256=${shm_id}"

				          --metadata "checksum-sha256=${shm_id}" ${cache_control_flag}

				      )

				    done

				  )

									
										24

.circleci/scripts/binary_windows_build.sh
									
												View File
												
				@ -4,15 +4,18 @@ set -eux -o pipefail

				source "${BINARY_ENV_FILE:-/c/w/env}"

				mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR"

				export CUDA_VERSION="${DESIRED_CUDA/cu/}"

				export USE_SCCACHE=1

				export SCCACHE_BUCKET=ossci-compiler-cache

				export SCCACHE_IGNORE_SERVER_IO_ERROR=1

				export VC_YEAR=2022

				if [[ "$OS" != "windows-arm64" ]]; then

				    export CUDA_VERSION="${DESIRED_CUDA/cu/}"

				    export USE_SCCACHE=1

				    export SCCACHE_BUCKET=ossci-compiler-cache

				    export SCCACHE_IGNORE_SERVER_IO_ERROR=1

				    export VC_YEAR=2022

				fi

				if [[ "$DESIRED_CUDA" == 'xpu' ]]; then

				    export USE_SCCACHE=0

				    export XPU_VERSION=2025.0

				    export XPU_ENABLE_KINETO=1

				fi

				echo "Free space on filesystem before build:"

				@ -20,7 +23,16 @@ df -h

				pushd "$PYTORCH_ROOT/.ci/pytorch/"

				export NIGHTLIES_PYTORCH_ROOT="$PYTORCH_ROOT"

				./windows/internal/build_wheels.bat

				if [[ "$OS" == "windows-arm64" ]]; then

				    if [[ "$PACKAGE_TYPE" == 'libtorch' ]]; then

				        ./windows/arm64/build_libtorch.bat

				    elif [[ "$PACKAGE_TYPE" == 'wheel' ]]; then

				        ./windows/arm64/build_pytorch.bat

				    fi

				else

				    ./windows/internal/build_wheels.bat

				fi

				echo "Free space on filesystem after build:"

				df -h

									
										7

.circleci/scripts/binary_windows_test.sh
									
												View File
												
				@ -11,6 +11,11 @@ if [[ "$DESIRED_CUDA" == 'xpu' ]]; then

				fi

				pushd "$PYTORCH_ROOT/.ci/pytorch/"

				./windows/internal/smoke_test.bat

				if [[ "$OS" == "windows-arm64" ]]; then

				    ./windows/arm64/smoke_test.bat

				else

				    ./windows/internal/smoke_test.bat

				fi

				popd

4

.clang-tidy

View File

 @ -12,6 +12,7 @@ bugprone-*,
 -bugprone-macro-parentheses,
 -bugprone-lambda-function-name,
 -bugprone-reserved-identifier,
 -bugprone-return-const-ref-from-parameter,
 -bugprone-swapped-arguments,
 clang-analyzer-core.*,
 clang-analyzer-cplusplus.*,
 @ -24,6 +25,7 @@ cppcoreguidelines-*,
 -cppcoreguidelines-avoid-non-const-global-variables,
 -cppcoreguidelines-interfaces-global-init,
 -cppcoreguidelines-macro-usage,
 -cppcoreguidelines-macro-to-enum,
 -cppcoreguidelines-owning-memory,
 -cppcoreguidelines-pro-bounds-array-to-pointer-decay,
 -cppcoreguidelines-pro-bounds-constant-array-index,
 @ -50,11 +52,11 @@ modernize-*,
 -modernize-macro-to-enum,
 -modernize-return-braced-init-list,
 -modernize-use-auto,
 -modernize-use-default-member-init,
 -modernize-use-using,
 -modernize-use-trailing-return-type,
 -modernize-use-nodiscard,
 performance-*,
 -performance-enum-size,
 readability-container-size-empty,
 readability-delete-null-pointer,
 readability-duplicate-include

									
										14

.editorconfig
									
										Normal file
									
												View File
												
				@ -0,0 +1,14 @@

				root = true

				[*]

				end_of_line = lf

				insert_final_newline = true

				# Python

				[*.py]

				indent_style = space

				indent_size = 4

				# Make

				[Makefile]

				indent_style = tab

1

.flake8

View File

 @ -38,6 +38,7 @@ per-file-ignores =
     torchgen/api/types/__init__.py: F401,F403
     torchgen/executorch/api/types/__init__.py: F401,F403
     test/dynamo/test_higher_order_ops.py: B950
     test/dynamo/test_error_messages.py: B950
     torch/testing/_internal/dynamo_test_failures.py: B950
     # TOR901 is only for test, we want to ignore it for everything else.
     # It's not easy to configure this without affecting other per-file-ignores,

									
										2

.github/ISSUE_TEMPLATE/pt2-bug-report.yml
									
										vendored
									
												View File
												
				@ -20,7 +20,7 @@ body:

				        - Don't compare indices of max/min etc, because that avoids the above requirement

				        - If comparing eager and torch.compile at fp16/bf16, you should use fp32 as baseline

				        - When comparing eager and torch.compile, use a higher precision result as a baseline. `torch._dynamo.utils.same` with fp64_ref will handle this comparison.

				        - Ensure rng state used to compare results is equivalent. Use `torch._inductor.config.fallback_random=True` and reset the torch rng seed between comparisons

									
										16

.github/actionlint.yaml
									
										vendored
									
												View File
												
				@ -1,8 +1,10 @@

				self-hosted-runner:

				  labels:

				    # GitHub hosted runner that actionlint doesn't recognize because actionlint version (1.6.21) is too old

				    - ubuntu-24.04

				    # GitHub hosted x86 Linux runners

				    - linux.20_04.4x

				    - linux.20_04.16x

				    - linux.24_04.4x

				    - linux.24_04.16x

				    # Organization-wide AWS Linux Runners

				    - linux.large

				    - linux.2xlarge

				@ -10,7 +12,6 @@ self-hosted-runner:

				    - linux.9xlarge.ephemeral

				    - am2.linux.9xlarge.ephemeral

				    - linux.12xlarge

				    - linux.12xlarge.ephemeral

				    - linux.24xlarge

				    - linux.24xlarge.ephemeral

				    - linux.arm64.2xlarge

				@ -42,10 +43,17 @@ self-hosted-runner:

				    - windows.8xlarge.nvidia.gpu

				    - windows.8xlarge.nvidia.gpu.nonephemeral

				    - windows.g5.4xlarge.nvidia.gpu

				    # Organization-wide AMD hosted runners

				    # Windows ARM64 runners

				    - windows-11-arm64

				    # Organization-wide AMD-hosted runners

				    # MI2xx runners

				    - linux.rocm.gpu

				    - linux.rocm.gpu.2

				    - linux.rocm.gpu.4

				    # MI300 runners

				    - linux.rocm.gpu.mi300.2

				    - linux.rocm.gpu.mi300.4

				    - rocm-docker

				    # Repo-specific Apple hosted  runners

				    - macos-m1-ultra

				    - macos-m2-14

									
										70

.github/actions/binary-docker-build/action.yml
									
										vendored
									
										Normal file
									
												View File
												
				@ -0,0 +1,70 @@

				name: Binary docker build

				description: Build docker image for binary builds

				inputs:

				  docker-image-name:

				    description: Docker image name for PR builds

				    required: true

				  docker-build-dir:

				    description: Location of the build.sh relative to .ci/docker

				    required: true

				  custom-tag-prefix:

				    description: Custom tag prefix for the docker image

				    required: false

				  DOCKER_TOKEN:

				    description: Docker token for authentication

				    required: true

				  DOCKER_ID:

				    description: Docker ID for authentication

				    required: true

				runs:

				  using: composite

				  steps:

				    - name: Checkout PyTorch

				      uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				    - name: Calculate docker image

				      id: calculate-docker-image

				      uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				      with:

				        docker-image-name: ${{ inputs.docker-image-name }}

				        docker-build-dir: .ci/docker

				        custom-tag-prefix: ${{ inputs.custom-tag-prefix }}

				        docker-build-script: ${{ inputs.docker-build-dir }}/build.sh

				        always-rebuild: true

				        push: true

				    - name: Tag and (if WITH_PUSH) push docker image to docker.io

				      env:

				        DOCKER_TOKEN: ${{ inputs.DOCKER_TOKEN }}

				        DOCKER_ID: ${{ inputs.DOCKER_ID }}

				        DOCKER_IMAGE_NAME: ${{ inputs.docker-image-name }}

				        DOCKER_IMAGE_PREFIX: ${{ inputs.custom-tag-prefix }}

				        CREATED_FULL_DOCKER_IMAGE_NAME: ${{ steps.calculate-docker-image.outputs.docker-image }}

				      shell: bash

				      run: |

				        set -euox pipefail

				        GITHUB_REF=${GITHUB_REF:-$(git symbolic-ref -q HEAD || git describe --tags --exact-match)}

				        GIT_BRANCH_NAME=${GITHUB_REF##*/}

				        GIT_COMMIT_SHA=${GITHUB_SHA:-$(git rev-parse HEAD)}

				        CI_FOLDER_SHA=$(git rev-parse HEAD:.ci/docker)

				        DOCKER_IMAGE_NAME_PREFIX=docker.io/pytorch/${DOCKER_IMAGE_NAME}:${DOCKER_IMAGE_PREFIX}

				        docker tag ${CREATED_FULL_DOCKER_IMAGE_NAME} ${DOCKER_IMAGE_NAME_PREFIX}

				        docker tag ${CREATED_FULL_DOCKER_IMAGE_NAME} ${DOCKER_IMAGE_NAME_PREFIX}-${GIT_BRANCH_NAME}

				        docker tag ${CREATED_FULL_DOCKER_IMAGE_NAME} ${DOCKER_IMAGE_NAME_PREFIX}-${GIT_COMMIT_SHA}

				        docker tag ${CREATED_FULL_DOCKER_IMAGE_NAME} ${DOCKER_IMAGE_NAME_PREFIX}-${CI_FOLDER_SHA}

				        # Pretty sure Github will mask tokens and I'm not sure if it will even be

				        # printed due to pipe, but just in case

				        set +x

				        if [[ ${WITH_PUSH:-false} == "true" ]]; then

				          echo "${DOCKER_TOKEN}" | docker login -u "${DOCKER_ID}" --password-stdin

				          docker push ${DOCKER_IMAGE_NAME_PREFIX}

				          docker push ${DOCKER_IMAGE_NAME_PREFIX}-${GIT_BRANCH_NAME}

				          docker push ${DOCKER_IMAGE_NAME_PREFIX}-${GIT_COMMIT_SHA}

				          docker push ${DOCKER_IMAGE_NAME_PREFIX}-${CI_FOLDER_SHA}

				        fi

									
										41

.github/actions/checkout-pytorch/action.yml
									
										vendored
									
												View File
												
				@ -23,9 +23,44 @@ runs:

				      id: check_container_runner

				      run: echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT"

				    - name: Clean workspace

				    - name: Set up parallel fetch and clean workspace

				      id: first-clean

				      continue-on-error: true

				      shell: bash

				      if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}

				      env:

				        NO_SUDO: ${{ inputs.no-sudo }}

				      run: |

				        # Use all available CPUs for fetching

				        cd "${GITHUB_WORKSPACE}"

				        git config --global fetch.parallel 0

				        git config --global submodule.fetchJobs 0

				        # Clean workspace. The default checkout action should also do this, but

				        # do it here as well just in case

				        if [[ -d .git ]]; then

				          if [ -z "${NO_SUDO}" ]; then

				            sudo git clean -ffdx

				          else

				            git clean -ffdx

				          fi

				        fi

				    - name: Checkout PyTorch

				      id: first-checkout-attempt

				      continue-on-error: true

				      uses: actions/checkout@v4

				      with:

				        ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}

				        # --depth=1 for speed, manually fetch history and other refs as necessary

				        fetch-depth: ${{ inputs.fetch-depth }}

				        submodules: ${{ inputs.submodules }}

				        show-progress: false

				    - name: Clean workspace (try again)

				      if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' &&

				        (steps.first-clean.outcome != 'success' || steps.first-checkout-attempt.outcome != 'success') }}

				      shell: bash

				      env:

				        NO_SUDO: ${{ inputs.no-sudo }}

				      run: |

				@ -40,11 +75,11 @@ runs:

				        fi

				        mkdir "${GITHUB_WORKSPACE}"

				    - name: Checkout PyTorch

				    - name: Checkout PyTorch (try again)

				      uses: actions/checkout@v4

				      if: ${{ steps.first-clean.outcome != 'success' || steps.first-checkout-attempt.outcome != 'success' }}

				      with:

				        ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}

				        # --depth=1 for speed, manually fetch history and other refs as necessary

				        fetch-depth: ${{ inputs.fetch-depth }}

				        submodules: ${{ inputs.submodules }}

				        show-progress: false

									
										2

.github/actions/linux-test/action.yml
									
										vendored
									
												View File
												
				@ -66,7 +66,7 @@ runs:

				    - name: configure aws credentials

				      if : ${{ inputs.aws-role-to-assume != '' }}

				      uses: aws-actions/configure-aws-credentials@v3

				      uses: aws-actions/configure-aws-credentials@v4

				      with:

				        role-to-assume: ${{ inputs.aws-role-to-assume }}

				        role-session-name: gha-linux-test

									
										1

.github/actions/test-pytorch-binary/action.yml
									
										vendored
									
												View File
												
				@ -15,7 +15,6 @@ runs:

				          -e BINARY_ENV_FILE \

				          -e BUILD_ENVIRONMENT \

				          -e DESIRED_CUDA \

				          -e DESIRED_DEVTOOLSET \

				          -e DESIRED_PYTHON \

				          -e GITHUB_ACTIONS \

				          -e GPU_ARCH_TYPE \

									
										10

.github/actions/upload-test-artifacts/action.yml
									
										vendored
									
												View File
												
				@ -48,14 +48,8 @@ runs:

				      run: |

				        # Remove any previous usage logs if they exist

				        rm -f logs-*.zip

				        # this workflow is also run in bazel build test, but we dont generate usage reports for it

				        # so check to see if the file exists first

				        if [ -f 'usage_log.txt' ]; then

				            zip "logs-${FILE_SUFFIX}.zip" 'usage_log.txt'

				        fi

				        if find "test/test-reports" -name "*.log" 2>/dev/null | grep -q .; then

				            zip -r "logs-${FILE_SUFFIX}.zip" test/test-reports -i '*.log'

				        fi

				        zip "logs-${FILE_SUFFIX}.zip" 'usage_log.txt' || true

				        zip -r "logs-${FILE_SUFFIX}.zip" test/test-reports -i '*.log' || true

				    - name: Zip debugging artifacts for upload

				      if: runner.os != 'Windows' && !inputs.use-gha

2

.github/ci_commit_pins/audio.txt vendored

View File

 @ -1 +1 @@
 f084f34bbb743fada85f66b0ed8041387565e69c
 bccaa454a54c3c648697cc2f46a4fb0500b1f01b

2

.github/ci_commit_pins/xla.txt vendored

View File

 @ -1 +1 @@
 b2b890e962f5fb6f481e5da2eb4a43bb990d0f1b
 ac9a39f4b768cef09b9d2be8e074be496d7783b6

									
										21

.github/labeler.yml
									
										vendored
									
												View File
												
				@ -98,7 +98,7 @@

				- test/distributed/**

				- torch/testing/_internal/distributed/**

				"module: distributed_checkpoint":

				"release notes: distributed (checkpoint)":

				- torch/distributed/checkpoint/**

				- test/distributed/checkpoint/**

				@ -112,3 +112,22 @@

				- torch/csrc/inductor/aoti_include/xpu.h

				- torch/csrc/inductor/cpp_wrapper/device_internal/xpu.h

				- torch/csrc/inductor/cpp_wrapper/xpu.h

				"release notes: inductor (aoti)":

				- torch/_C/_aoti.pyi

				- torch/_dynamo/repro/aoti.py

				- torch/_export/serde/aoti_schema.py

				- torch/_higher_order_ops/aoti_call_delegate.py

				- torch/_inductor/codegen/aoti_runtime/**

				- torch/_inductor/codegen/aoti_hipify_utils.py

				- torch/_inductor/codegen/cpp_wrapper_cpu.py

				- torch/_inductor/codegen/cpp_wrapper_gpu.py

				- torch/_inductor/aoti_eager.py

				- torch/csrc/inductor/aoti_runtime/**

				- torch/csrc/inductor/aoti_torch/**

				- torch/csrc/inductor/aoti_runner/**

				- torch/csrc/inductor/aoti_eager/**

				- torch/csrc/inductor/aoti_package/**

				- torch/csrc/inductor/aoti_include/**

				- torchgen/aoti/**

				- torchgen/gen_aoti_c_shim.py

									
										6

.github/merge_rules.yaml
									
										vendored
									
												View File
												
				@ -334,6 +334,7 @@

				  - XiaobingSuper

				  - jgong5

				  - mingfeima

				  - EikanWang

				  mandatory_checks_name:

				  - EasyCLA

				  - Lint

				@ -366,6 +367,7 @@

				  - jgong5

				  - vfdev-5

				  - leslie-fang-intel

				  - EikanWang

				  mandatory_checks_name:

				  - EasyCLA

				  - Lint

				@ -379,6 +381,7 @@

				  approved_by:

				  - leslie-fang-intel

				  - jgong5

				  - EikanWang

				  mandatory_checks_name:

				  - EasyCLA

				  - Lint

				@ -498,7 +501,9 @@

				- name: XPU

				  patterns:

				  - '**xpu**'

				  - '**XPU**'

				  - '**sycl**'

				  - '**SYCL**'

				  approved_by:

				  - EikanWang

				  - jgong5

				@ -535,6 +540,7 @@

				  - bdhirsh

				  - zou3519

				  - isuruf

				  - Chillee

				  mandatory_checks_name:

				  - EasyCLA

				  - Lint

									
										4

.github/pytorch-probot.yml
									
										vendored
									
												View File
												
				@ -7,6 +7,7 @@ ciflow_push_tags:

				- ciflow/inductor

				- ciflow/inductor-periodic

				- ciflow/inductor-rocm

				- ciflow/inductor-perf-test-nightly-rocm

				- ciflow/inductor-perf-compare

				- ciflow/inductor-micro-benchmark

				- ciflow/inductor-micro-benchmark-cpu-x86

				@ -15,7 +16,9 @@ ciflow_push_tags:

				- ciflow/mps

				- ciflow/nightly

				- ciflow/periodic

				- ciflow/periodic-rocm-mi300

				- ciflow/rocm

				- ciflow/rocm-mi300

				- ciflow/s390

				- ciflow/slow

				- ciflow/trunk

				@ -23,6 +26,7 @@ ciflow_push_tags:

				- ciflow/xpu

				- ciflow/torchbench

				- ciflow/autoformat

				- ciflow/op-benchmark

				retryable_workflows:

				- pull

				- trunk

2

.github/requirements-gha-cache.txt vendored

View File

 @ -5,7 +5,7 @@
 #   functorch/docs/requirements.txt
 #   .ci/docker/requirements-ci.txt
 boto3==1.35.42
 jinja2==3.1.5
 jinja2==3.1.6
 lintrunner==0.10.7
 ninja==1.10.0.post1
 nvidia-ml-py==11.525.84

									
										6

.github/scripts/amd/package_triton_wheel.sh
									
										vendored
									
												View File
												
				@ -61,10 +61,14 @@ fi

				ROCM_SO=(

				    "${libamdhip}"

				    "libhsa-runtime64.so.1"

				    "libamd_comgr.so.2"

				    "libdrm.so.2"

				    "libdrm_amdgpu.so.1"

				)

				if [[ $ROCM_INT -ge 60400 ]]; then

				    ROCM_SO+=("libamd_comgr.so.3")

				else

				    ROCM_SO+=("libamd_comgr.so.2")

				fi

				if [[ $ROCM_INT -ge 60100 ]]; then

				    ROCM_SO+=("librocprofiler-register.so.0")

									
										2

.github/scripts/build_triton_wheel.py
									
										vendored
									
												View File
												
				@ -123,7 +123,7 @@ def main() -> None:

				    parser = ArgumentParser("Build Triton binaries")

				    parser.add_argument("--release", action="store_true")

				    parser.add_argument(

				        "--device", type=str, default="cuda", choices=["cuda", "rocm", "xpu"]

				        "--device", type=str, default="cuda", choices=["cuda", "rocm", "xpu", "aarch64"]

				    )

				    parser.add_argument("--py-version", type=str)

				    parser.add_argument("--commit-hash", type=str)

Compare commits

2261 Commits yguo/patch ... mlazos/hc-

2 .ci/aarch64_linux/aarch64_ci_build.sh Unescape Escape View File

20 .ci/aarch64_linux/aarch64_wheel_ci_build.py Unescape Escape View File

20 .ci/aarch64_linux/build_aarch64_wheel.py Unescape Escape View File

2 .ci/docker/almalinux/Dockerfile Unescape Escape View File

90 .ci/docker/almalinux/build.sh Unescape Escape View File

131 .ci/docker/build.sh Unescape Escape View File

9 .ci/docker/centos-rocm/Dockerfile Unescape Escape View File

2 .ci/docker/ci_commit_pins/executorch.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/nccl-cu12.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton-xpu.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton.txt Unescape Escape View File

2 .ci/docker/common/install_acl.sh Unescape Escape View File

2 .ci/docker/common/install_base.sh Unescape Escape View File

12 .ci/docker/common/install_clang.sh Unescape Escape View File

4 .ci/docker/common/install_conda.sh Unescape Escape View File

2 .ci/docker/common/install_cpython.sh Unescape Escape View File

48 .ci/docker/common/install_cuda.sh Unescape Escape View File

162 .ci/docker/common/install_cuda_aarch64.sh Unescape Escape View File

2 .ci/docker/common/install_cudnn.sh Unescape Escape View File

38 .ci/docker/common/install_db.sh Unescape Escape View File

5 .ci/docker/common/install_executorch.sh Unescape Escape View File

4 .ci/docker/common/install_halide.sh Unescape Escape View File

6 .ci/docker/common/install_linter.sh Unescape Escape View File

26 .ci/docker/common/install_nccl.sh Normal file Unescape Escape View File

9 .ci/docker/common/install_ninja.sh Unescape Escape View File

2 .ci/docker/common/install_onnx.sh Unescape Escape View File

2 .ci/docker/common/install_openblas.sh Unescape Escape View File

18 .ci/docker/common/install_python.sh Normal file Unescape Escape View File

11 .ci/docker/common/install_rocm.sh Unescape Escape View File

2 .ci/docker/common/install_rocm_drm.sh Unescape Escape View File

72 .ci/docker/common/install_rocm_magma.sh Unescape Escape View File

24 .ci/docker/common/install_swiftshader.sh Unescape Escape View File

18 .ci/docker/common/install_triton.sh Unescape Escape View File

24 .ci/docker/common/install_vulkan_sdk.sh Unescape Escape View File

7 .ci/docker/libtorch/Dockerfile Unescape Escape View File

82 .ci/docker/libtorch/build.sh Unescape Escape View File

26 .ci/docker/linter-cuda/Dockerfile Unescape Escape View File

18 .ci/docker/linter/Dockerfile Unescape Escape View File

6 .ci/docker/manywheel/Dockerfile Unescape Escape View File

153 .ci/docker/manywheel/Dockerfile_2014 Unescape Escape View File

6 .ci/docker/manywheel/Dockerfile_2_28 Unescape Escape View File

6 .ci/docker/manywheel/Dockerfile_2_28_aarch64 Unescape Escape View File

4 .ci/docker/manywheel/Dockerfile_cuda_aarch64 Unescape Escape View File

40 .ci/docker/manywheel/Dockerfile_s390x Unescape Escape View File

133 .ci/docker/manywheel/build.sh Unescape Escape View File

2 .ci/docker/manywheel/build_scripts/build_utils.sh Unescape Escape View File

26 .ci/docker/requirements-ci.txt Unescape Escape View File

10 .ci/docker/requirements-docs.txt Unescape Escape View File

2 .ci/docker/triton_version.txt Unescape Escape View File

29 .ci/docker/ubuntu-cuda/Dockerfile Unescape Escape View File

9 .ci/docker/ubuntu-rocm/Dockerfile Unescape Escape View File

7 .ci/docker/ubuntu-xpu/Dockerfile Unescape Escape View File

51 .ci/docker/ubuntu/Dockerfile Unescape Escape View File

2 .ci/magma-rocm/.gitignore vendored Normal file Unescape Escape View File

41 .ci/magma-rocm/Makefile Normal file Unescape Escape View File

48 .ci/magma-rocm/README.md Normal file Unescape Escape View File

42 .ci/magma-rocm/build_magma.sh Executable file Unescape Escape View File

38 .ci/magma-rocm/package_files/build.sh Executable file Unescape Escape View File

2 .ci/magma/Makefile Unescape Escape View File

12 .ci/manywheel/build_common.sh Unescape Escape View File

4 .ci/manywheel/build_cuda.sh Unescape Escape View File

12 .ci/manywheel/build_libtorch.sh Unescape Escape View File

29 .ci/pytorch/build.sh Unescape Escape View File

120 .ci/pytorch/check_binary.sh Unescape Escape View File

51 .ci/pytorch/macos-build.sh Unescape Escape View File

44 .ci/pytorch/macos-test.sh Unescape Escape View File

6 .ci/pytorch/python_doc_push_script.sh Unescape Escape View File

33 .ci/pytorch/smoke_test/check_binary_symbols.py Unescape Escape View File

8 .ci/pytorch/smoke_test/max_autotune.py Unescape Escape View File

84 .ci/pytorch/smoke_test/smoke_test.py Unescape Escape View File

121 .ci/pytorch/test.sh Unescape Escape View File

38 .ci/pytorch/windows/arm64/bootstrap_buildtools.bat Unescape Escape View File

21 .ci/pytorch/windows/arm64/bootstrap_python.bat Unescape Escape View File

2 .ci/pytorch/windows/arm64/bootstrap_tests.bat Unescape Escape View File

28 .ci/pytorch/windows/arm64/smoke_test.bat Unescape Escape View File

14 .ci/pytorch/windows/internal/smoke_test.bat Unescape Escape View File

4 .ci/pytorch/windows/internal/xpu_install.bat Unescape Escape View File

13 .circleci/scripts/binary_linux_test.sh Unescape Escape View File

2261 Commits

yguo/patch ... mlazos/hc-

2

.ci/aarch64_linux/aarch64_ci_build.sh

View File

20

.ci/aarch64_linux/aarch64_wheel_ci_build.py

View File

20

.ci/aarch64_linux/build_aarch64_wheel.py

View File

2

.ci/docker/almalinux/Dockerfile

View File

90

.ci/docker/almalinux/build.sh

View File

131

.ci/docker/build.sh

View File

9

.ci/docker/centos-rocm/Dockerfile

View File

2

.ci/docker/ci_commit_pins/executorch.txt

View File

2

.ci/docker/ci_commit_pins/nccl-cu12.txt

View File

2

.ci/docker/ci_commit_pins/triton-xpu.txt

View File

2

.ci/docker/ci_commit_pins/triton.txt

View File

2

.ci/docker/common/install_acl.sh

View File

2

.ci/docker/common/install_base.sh

View File

12

.ci/docker/common/install_clang.sh

View File

4

.ci/docker/common/install_conda.sh

View File

2

.ci/docker/common/install_cpython.sh

View File

48

.ci/docker/common/install_cuda.sh

View File

162

.ci/docker/common/install_cuda_aarch64.sh

View File

2

.ci/docker/common/install_cudnn.sh

View File

38

.ci/docker/common/install_db.sh

View File

5

.ci/docker/common/install_executorch.sh

View File

4

.ci/docker/common/install_halide.sh

View File

6

.ci/docker/common/install_linter.sh

View File

26

.ci/docker/common/install_nccl.sh Normal file

View File

9

.ci/docker/common/install_ninja.sh

View File

2

.ci/docker/common/install_onnx.sh

View File

2

.ci/docker/common/install_openblas.sh

View File

18

.ci/docker/common/install_python.sh Normal file

View File

11

.ci/docker/common/install_rocm.sh

View File

2

.ci/docker/common/install_rocm_drm.sh

View File

72

.ci/docker/common/install_rocm_magma.sh

View File

24

.ci/docker/common/install_swiftshader.sh

View File

18

.ci/docker/common/install_triton.sh

View File

24

.ci/docker/common/install_vulkan_sdk.sh

View File

7

.ci/docker/libtorch/Dockerfile

View File

82

.ci/docker/libtorch/build.sh

View File

26

.ci/docker/linter-cuda/Dockerfile

View File

18

.ci/docker/linter/Dockerfile

View File

6

.ci/docker/manywheel/Dockerfile

View File

153

.ci/docker/manywheel/Dockerfile_2014

View File

6

.ci/docker/manywheel/Dockerfile_2_28

View File

6

.ci/docker/manywheel/Dockerfile_2_28_aarch64

View File

4

.ci/docker/manywheel/Dockerfile_cuda_aarch64

View File

40

.ci/docker/manywheel/Dockerfile_s390x

View File

133

.ci/docker/manywheel/build.sh

View File

2

.ci/docker/manywheel/build_scripts/build_utils.sh

View File

26

.ci/docker/requirements-ci.txt

View File

10

.ci/docker/requirements-docs.txt

View File

2

.ci/docker/triton_version.txt

View File

29

.ci/docker/ubuntu-cuda/Dockerfile

View File

9

.ci/docker/ubuntu-rocm/Dockerfile

View File

7

.ci/docker/ubuntu-xpu/Dockerfile

View File

51

.ci/docker/ubuntu/Dockerfile

View File

2

.ci/magma-rocm/.gitignore vendored Normal file

View File

41

.ci/magma-rocm/Makefile Normal file

View File

48

.ci/magma-rocm/README.md Normal file

View File

42

.ci/magma-rocm/build_magma.sh Executable file

View File

38

.ci/magma-rocm/package_files/build.sh Executable file

View File

2

.ci/magma/Makefile

View File

12

.ci/manywheel/build_common.sh

View File

4

.ci/manywheel/build_cuda.sh

View File

12

.ci/manywheel/build_libtorch.sh

View File

29

.ci/pytorch/build.sh

View File

120

.ci/pytorch/check_binary.sh

View File

51

.ci/pytorch/macos-build.sh

View File

44

.ci/pytorch/macos-test.sh

View File

6

.ci/pytorch/python_doc_push_script.sh

View File

33

.ci/pytorch/smoke_test/check_binary_symbols.py

View File

8

.ci/pytorch/smoke_test/max_autotune.py

View File

84

.ci/pytorch/smoke_test/smoke_test.py

View File

121

.ci/pytorch/test.sh

View File

38

.ci/pytorch/windows/arm64/bootstrap_buildtools.bat

View File

21

.ci/pytorch/windows/arm64/bootstrap_python.bat

View File

2

.ci/pytorch/windows/arm64/bootstrap_tests.bat

View File

28

.ci/pytorch/windows/arm64/smoke_test.bat

View File

14

.ci/pytorch/windows/internal/smoke_test.bat

View File

4

.ci/pytorch/windows/internal/xpu_install.bat

View File

13

.circleci/scripts/binary_linux_test.sh

View File

16

.circleci/scripts/binary_populate_env.sh

View File