pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-28 02:04:53 +08:00

Author	SHA1	Message	Date
Michael Lazos	98125eba66	[Hierarchical Compile] Handle autocast ctx manager	2025-04-04 15:18:39 -07:00
Michael Lazos	7766c61783	[Hierarchical Compile] Fix small bug	2025-04-04 15:18:39 -07:00
Michael Lazos	21aae3f096	Disable optimizer and enable graph deduplication	2025-04-04 15:18:39 -07:00
Ankita George	861d2cc02c	Add a param for save format in Storage Writer (#150025 ) Summary: add a param to specify to the storage writer how to save tensors. Write now the only options are safetensors and torch.save. Test Plan: (lintrunner) [ankitageorge@devgpu003.cco3 /data/users/ankitageorge/fbsource/fbcode/caffe2 (1d57cb27b)]$ buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/distributed/checkpoint:test_hf_storage File changed: fbcode//caffe2/torch/distributed/checkpoint/filesystem.py Buck UI: https://www.internalfb.com/buck2/e80cc963-e34a-4876-b6f4-7ce2794e48dd Test UI: https://www.internalfb.com/intern/testinfra/testrun/3659174965882569 Network: Up: 32KiB Down: 1.9KiB (reSessionID-ef9fa764-a40a-451b-ab58-08eabe7a9422) Executing actions. Remaining 0/4 3.4s exec time total Command: test. Finished 2 local Time elapsed: 19.6s Tests finished: Pass 4. Fail 0. Fatal 0. Skip 0. Build failure 0 Reviewed By: saumishr Differential Revision: D70271943 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150025 Approved by: https://github.com/saumishr	2025-04-04 17:52:53 +00:00
Eric Griffith	c53bc616d5	caffe2: Fix lint errors in native/xnnpack/Linear.cpp (#150508 ) Summary: See title Test Plan: Sandcastle Differential Revision: D72275403 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150508 Approved by: https://github.com/malfet, https://github.com/Skylion007, https://github.com/cyyever	2025-04-04 17:14:43 +00:00
PyTorch MergeBot	c93e34d7b5	Revert "bound sympy accuracy (#150383 )" This reverts commit 1bc2b2b12ae1ddd27b0401a1baac3b8099b6fc50. Reverted https://github.com/pytorch/pytorch/pull/150383 on behalf of https://github.com/laithsakka due to big regression ([comment](https://github.com/pytorch/pytorch/pull/150383#issuecomment-2779227548))	2025-04-04 16:26:00 +00:00
PyTorch MergeBot	f443035f10	Revert "[cuda] Add new faster gammabeta backward kernel (#148605 ) (Reapply with launch bounds) (#150625 )" This reverts commit c6defa9443d241dd7a0baac4e708b6e906bd012c. Reverted https://github.com/pytorch/pytorch/pull/150625 on behalf of https://github.com/atalman due to failing internal build ([comment](https://github.com/pytorch/pytorch/pull/150625#issuecomment-2779183414))	2025-04-04 16:05:18 +00:00
Zhengxu Chen	07d439e782	[aoti] Split ConstantType definition out of model.h (#150545 ) Summary: Splitting the type definition of ConstantType into a separate header because it's needed by Sigmoid OSS but the entire model.h header include cause the following compilation error: ``` 2025-04-01T18:12:42.0391272Z FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/nativert/kernels/AOTICallDelegateKernel.cpp.o 2025-04-01T18:12:42.0417705Z /opt/cache/bin/sccache /opt/cache/bin/clang++ -DAT_PER_OPERATOR_HEADERS -DBUILD_ONEDNN_GRAPH -DCAFFE2_BUILD_MAIN_LIB -DCPUINFO_SUPPORTED_PLATFORM=1 -DFMT_HEADER_ONLY=1 -DFXDIV_USE_INLINE_ASSEMBLY=0 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DIDEEP_USE_MKL -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNNP_CONVOLUTION_ONLY=0 -DNNP_INFERENCE_ONLY=0 -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_ENABLE_LLVM -DUSE_C10D_GLOO -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_RPC -DUSE_TENSORPIPE -DXNN_LOG_LEVEL=0 -D_FILE_OFFSET_BITS=64 -Dtorch_cpu_EXPORTS -I/var/lib/jenkins/workspace/build/aten/src -I/var/lib/jenkins/workspace/aten/src -I/var/lib/jenkins/workspace/build -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/cmake/../third_party/benchmark/include -I/opt/llvm/include -I/var/lib/jenkins/workspace/third_party/onnx -I/var/lib/jenkins/workspace/build/third_party/onnx -I/var/lib/jenkins/workspace/nlohmann -I/var/lib/jenkins/workspace/torch/csrc/api -I/var/lib/jenkins/workspace/torch/csrc/api/include -I/var/lib/jenkins/workspace/caffe2/aten/src/TH -I/var/lib/jenkins/workspace/build/caffe2/aten/src/TH -I/var/lib/jenkins/workspace/build/caffe2/aten/src -I/var/lib/jenkins/workspace/build/caffe2/../aten/src -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/third_party/miniz-3.0.2 -I/var/lib/jenkins/workspace/third_party/kineto/libkineto/include -I/var/lib/jenkins/workspace/third_party/kineto/libkineto/src -I/var/lib/jenkins/workspace/third_party/cpp-httplib -I/var/lib/jenkins/workspace/aten/src/ATen/.. -I/var/lib/jenkins/workspace/third_party/FXdiv/include -I/var/lib/jenkins/workspace/c10/.. -I/var/lib/jenkins/workspace/third_party/pthreadpool/include -I/var/lib/jenkins/workspace/third_party/cpuinfo/include -I/var/lib/jenkins/workspace/aten/src/ATen/native/quantized/cpu/qnnpack/include -I/var/lib/jenkins/workspace/aten/src/ATen/native/quantized/cpu/qnnpack/src -I/var/lib/jenkins/workspace/aten/src/ATen/native/quantized/cpu/qnnpack/deps/clog/include -I/var/lib/jenkins/workspace/third_party/NNPACK/include -I/var/lib/jenkins/workspace/third_party/fbgemm/include -I/ 2025-04-01T18:12:42.0444143Z In file included from /var/lib/jenkins/workspace/torch/csrc/nativert/kernels/AOTICallDelegateKernel.cpp:5: 2025-04-01T18:12:42.0445081Z In file included from /var/lib/jenkins/workspace/torch/csrc/nativert/executor/AOTIDelegateExecutor.h:6: 2025-04-01T18:12:42.0446002Z In file included from /var/lib/jenkins/workspace/torch/csrc/nativert/executor/AOTInductorModelImpl.h:5: 2025-04-01T18:12:42.0447549Z /var/lib/jenkins/workspace/torch/csrc/inductor/aoti_runtime/model.h:78:13: error: function 'RAII_cpuMalloc' is not needed and will not be emitted [-Werror,-Wunneeded-internal-declaration] 2025-04-01T18:12:42.0448656Z RAIIDataPtr RAII_cpuMalloc(size_t num_bytes) { ``` model.h defines RAII_malloc functions directly into anonymous namespace which seems pretty sad. we should do something about it but may not in the current diff. Test Plan: CI Differential Revision: D72320413 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150545 Approved by: https://github.com/desertfire	2025-04-04 15:48:45 +00:00
Yuanhao Ji	1b0a023dde	[Dynamo][Misc] Apply typing hints for `codegen` (#150289 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/150289 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2025-04-04 14:26:22 +00:00
Davide Italiano	295b7e21eb	[MPS/inductor] Add support for hermite_polynomial_h. (#150664 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150664 Approved by: https://github.com/malfet	2025-04-04 13:14:52 +00:00
Eddie Yan	09c4da9325	[CUDA][avgpool2d] Fix backward launch bounds again for `sm100`, `sm120` (#150640 ) `__CUDA_ARCH__` is not visible in host code, which causes incorrect launch bounds and `too many resources requested for launch` on blackwell CC @atalman @malfet as we would want this in 2.7 @nWEIdia Pull Request resolved: https://github.com/pytorch/pytorch/pull/150640 Approved by: https://github.com/malfet, https://github.com/drisspg, https://github.com/atalman	2025-04-04 13:05:40 +00:00
Jakub Grzybek	73358d37da	Fix codegen, change str comparison opeator to == for proper equality … (#150611 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150611 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2025-04-04 09:59:59 +00:00
PyTorch MergeBot	4854926aeb	Revert "Add torch._scaled_mm for CPU (#150410 )" This reverts commit 3b02f795c5ad2339794b15b370c0e4a235d36adf. Reverted https://github.com/pytorch/pytorch/pull/150410 on behalf of https://github.com/malfet due to It breaks ROCM tests ([comment](https://github.com/pytorch/pytorch/pull/150410#issuecomment-2777704212))	2025-04-04 06:52:54 +00:00
PyTorch UpdateBot	f3cb3557d6	[executorch hash update] update the pinned executorch hash (#149817 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149817 Approved by: https://github.com/pytorchbot	2025-04-04 05:21:44 +00:00
Yuanhao Ji	98d06b401b	[Dynamo] Fix `dict.items()` return type (#150112 ) Fixes #150110 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150112 Approved by: https://github.com/jansel, https://github.com/zou3519	2025-04-04 04:32:13 +00:00
PyTorch UpdateBot	e6e1f8c272	[audio hash update] update the pinned audio hash (#150589 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150589 Approved by: https://github.com/pytorchbot	2025-04-04 04:29:45 +00:00
Pian Pawakapan	c6d79c163c	[dynamic shapes] allow duck typing for 0/1 (#150222 ) Fixes #150184 e.g. for config.backed_size_oblivious=True and compile Pull Request resolved: https://github.com/pytorch/pytorch/pull/150222 Approved by: https://github.com/laithsakka	2025-04-04 03:24:46 +00:00
Aby Mathew C	7df6f930e8	Adapt test_misc.py for HPUs (#149499 ) This PR is related to https://github.com/pytorch/pytorch/pull/145476 . That PR had two files (test_functions.py and test_misc.py) . test_functions was causing CI/rebase/merge issues and hence removed for now. This PR contains only test_misc.py. This is a continuation of https://github.com/pytorch/pytorch/pull/144387 . ## MOTIVATION We recently integrated support for Intel Gaudi devices (identified as 'hpu') into the common_device_type framework via the pull request at https://github.com/pytorch/pytorch/pull/126970. This integration allows tests to be automatically instantiated for Gaudi devices upon loading the relevant library. Building on this development, the current pull request extends the utility of these hooks by adapting selected CUDA tests to operate on Gaudi devices. Additionally, we have confirmed that these modifications do not interfere with the existing tests on CUDA devices. Other accelerators can also extend the functionality by adding the device in the devices list. ( For eg: xpu ) ## CHANGES Create a separate class for test functions running on CUDA devices Extend the functionality of these tests to include HPUs Use instantiate_device_type_tests with targeted attributes to generate device-specific test instances within the new classes Apply skipIfHPU decorator to bypass tests that are not yet compatible with HPU devices PS: Most of these changes were initially part of https://github.com/pytorch/pytorch/pull/147609 , but closed that PR due to merge conflicts. The review comments were handled in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149499 Approved by: https://github.com/EikanWang, https://github.com/desertfire, https://github.com/cyyever	2025-04-04 02:47:43 +00:00
Scott Wolchok	ed0fd2fa7a	clang-format aten/src/ATen/cpu/vec/*.h (#150426 ) I got a complaint about indentation on #150380. Make the machines fix it for us. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150426 Approved by: https://github.com/aditew01, https://github.com/cyyever, https://github.com/frost-intel, https://github.com/Skylion007	2025-04-04 02:41:11 +00:00
fduwjj	bd9c42ebfb	[c10d] Surface error type when we unlink and create named pipe for DumpPipe (#150648 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150648 Approved by: https://github.com/fegin, https://github.com/kwen2501	2025-04-04 02:12:32 +00:00
Lucas Kabela	a9e2f22405	[Bugfix] Fix compile error with `torch.Tensor.unsqueeze_` and inplace views called from Tensor Class (#150573 ) Fixes #129673 ### Summary: Modifying a tensor by reshaping in place (such as `unsqueeze_`) should cause a graph break; however, when accessed through `torch.Tensor` api as opposed to as self attribute caused the code to crash with an error (see attached issue) Paths differed when traced due to the stack variable popped, as: * `self.unsqueeze_` pops a `LazyVariableTracker` which gets resolved to `TensorVariable`, so when looking for the method, triggers the fn call `var_getattr` in `_dynamo/variables/tensor.py`; since this is an inplace view (metadata mutation) on graph input, it is not well supported so should fall back (see [L446](`1017927c83/torch/_dynamo/variables/tensor.py (L446)`) in that file) * `torch.Tensor.unsqueeze` pops a `UserDefinedClassVariable` so when looking for the method, triggers the fn call `var_getattr` in `_dynamo/variables/user_defined.py` on [L273](`a8f6b40e36/torch/_dynamo/variables/user_defined.py (L273)`). This path tries to build a variable tracker from the obj popped, which resolves to a trace_rule , and as a Tensor method, is resolved to `TorchInGraphFunctionVariable` on [L3767](`a8f6b40e36/torch/_dynamo/trace_rules.py (L3767)`) So, one straightforward option is to check if the fn is an inplace_view on a input tensor in `torch.py` when we resolve the `__call__function` for the `TorchInGraphFunctionVariable` instead, which resolves the bug by providing a graph break ### Test ``` pytest test/dynamo/test_functions.py::FunctionTests::test_unsqueeze_inplace ``` Results in ``` Running 1 items in this shard test/dynamo/test_functions.py . [100%] =========================================================================================== 1 passed in 9.16s ========================================================================================== ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150573 Approved by: https://github.com/anijain2305	2025-04-04 01:58:34 +00:00
James Wu	1979a409e9	Make CompileEventLogger more defensive w.r.t to AOTAutogradCache and FXGraphCache (#150423 ) This PR makes it so that we don't crash due to logging if we invoke AOTAutogradCache/FXGraphCache without using dynamo. This is preparation for supporting certain VLLM use cases where they store graph modules and have special handling in conjunection with the caches. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150423 Approved by: https://github.com/oulgen	2025-04-04 01:55:13 +00:00
Laith Sakka	f9f6c080d8	support guard or false/true in user code and add tests (#150178 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150178 Approved by: https://github.com/pianpwk	2025-04-04 01:19:14 +00:00
Nichols A. Romero	d0026fa138	[ROCm][TunableOp] Fix UT race condition and reduce UT duration. (#150463 ) This PR fixes two race conditions that occur when UT tests are run: - In a particular order within a single shard. - Concurrently in multiple shards. Each test now gets a unique filename that depends on the test name. There were two other minor improvements to the UTs: - matmul_offline_mgpu could occasionally fail if run on 8 GPUs. Criteria was relaxed. - bmm_tunableop_rocm checks that the rotating buffer is not zero. Otherwise, the test is not useful. Additionally, several UTs took over 1 minute to run. Their duration was reduced by a combination of setting max tuning iterations to one, setting the rotating buffer size to zero, and/or reducing the matrix dimensions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150463 Approved by: https://github.com/jeffdaily	2025-04-04 01:12:03 +00:00
Avik Chaudhuri	1bc2b2b12a	bound sympy accuracy (#150383 ) Differential Revision: D72215735 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150383 Approved by: https://github.com/pianpwk	2025-04-04 00:15:32 +00:00
PyTorch MergeBot	b0e28f60df	Revert "add unit test for preferred_blas_library settings (#150581 )" This reverts commit 781d28e2655f88ae2fef827ed110f22ed553a0ab. Reverted https://github.com/pytorch/pytorch/pull/150581 on behalf of https://github.com/clee2000 due to new test broken internally D72395624 ([comment](https://github.com/pytorch/pytorch/pull/150581#issuecomment-2777228731))	2025-04-03 23:51:49 +00:00
Yanan Cao (PyTorch)	1ab6c4ff04	[Codemod][AddExplicitStrictExportForTrainingInferenceArg] caffe2/ (#149595 ) internal diff: D71497480 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149595 Approved by: https://github.com/Skylion007	2025-04-03 23:50:13 +00:00
Zhao Zhu	8878289f89	[aten] 8 bytes aligned vector loads for bf16 and fp16 dtypes in torch.cat (#150233 ) Enable aligned vector loading for 2 bytes datatypes in torch.cat. Specifically: 1. reduce the vector length to 8 bytes for 2-byte types (fp16, bf16 etc) 2. enable through a conditional template The reason why 8-byte vector loading was chosen for fp16 and bf16: 16-byte load results in heavier register overheads (i.e. 4 register per load for fp32 -> 8 register per load for fp16). Therefore, to employ the benefits of vectorized loading, we reduced ALIGNED_VEC_LOAD_BYTES to 8 for fp16 and bf16 ### perf testing: before: ``` torch-cat-D1-30108-D2-624-D3-772-dtype-torch.float32: B pt_eager copy 0 100.0 0.022621 0.036162 1 1000.0 0.133616 0.207051 2 10000.0 1.326848 1.848768 3 20000.0 2.744544 3.692128 torch-cat-D1-30108-D2-624-D3-772-dtype-torch.bfloat16: B pt_eager copy 0 100.0 0.022434 0.035477 1 1000.0 0.140608 0.144518 2 10000.0 1.303792 1.229584 3 20000.0 2.668288 2.436160 ``` after: ``` torch-cat-D1-30108-D2-624-D3-772-dtype-torch.float32: B pt_eager copy 0 100.0 0.022608 0.036328 1 1000.0 0.133861 0.207399 2 10000.0 1.325120 1.847136 3 20000.0 2.726528 3.693184 torch-cat-D1-30108-D2-624-D3-772-dtype-torch.bfloat16: B pt_eager copy 0 100.0 0.019942 0.035482 1 1000.0 0.084858 0.144544 2 10000.0 0.924384 1.230672 3 20000.0 1.944448 2.436480 ``` ### bw analysis: bw on fp16/bf16 got increased by 40%-50% for large tensors before: ``` Bandwidth (GB/s) for ((16384, 16384), 1) int8;fp16;fp32;int32;fp64;long\|869.87\|1382.74\|1956.46\|1952.73\|1969.03\|1963.66 Bandwidth (GB/s) for ((4194304,), 0) int8;fp16;fp32;int32;fp64;long\|568.43\|926.53\|1589.20\|1567.52\|1771.54\|1783.68 Bandwidth (GB/s) for ((16777216,), 0) int8;fp16;fp32;int32;fp64;long\|752.07\|1269.50\|1894.86\|1900.85\|1954.10\|1955.08 Bandwidth (GB/s) for ((33554432,), 0) int8;fp16;fp32;int32;fp64;long\|807.08\|1354.69\|1960.48\|1962.45\|1972.73\|1973.85 Bandwidth (GB/s) for ((134217728,), 0) int8;fp16;fp32;int32;fp64;long\|864.02\|1398.02\|1963.43\|1955.32\|1963.37\|1969.96 ``` after: ``` Bandwidth (GB/s) for ((16384, 16384), 1) int8;fp16;fp32;int32;fp64;long\|873.08\|1892.16\|1954.35\|1962.51\|1962.03\|1965.98 Bandwidth (GB/s) for ((4194304,), 0) int8;fp16;fp32;int32;fp64;long\|575.13\|1242.45\|1576.37\|1571.30\|1769.94\|1790.22 Bandwidth (GB/s) for ((16777216,), 0) int8;fp16;fp32;int32;fp64;long\|742.92\|1734.57\|1887.99\|1897.62\|1940.99\|1959.25 Bandwidth (GB/s) for ((33554432,), 0) int8;fp16;fp32;int32;fp64;long\|802.60\|1865.45\|1952.64\|1947.53\|1974.47\|1973.48 Bandwidth (GB/s) for ((134217728,), 0) int8;fp16;fp32;int32;fp64;long\|865.32\|1939.07\|1965.72\|1963.25\|1969.06\|1968.72 ``` ### Perf testing code: ``` # pyre-strict from typing import List, Optional, Tuple import click import pandas as pd import torch # @manual=//triton:triton import triton # CUDA_VISIBLE_DEVICEs=7 buck2 run @mode/opt //scripts/zhaozhu:cat_bench @click.command() @click.option("--data-type", type=str, default="bf16") @click.option("--return-result", type=bool, default=False) def main( data_type: str, return_result: bool, ) -> Optional[Tuple[List[triton.testing.Benchmark], List[pd.DataFrame]]]: torch.backends.cudnn.allow_tf32 = True torch.backends.cuda.matmul.allow_tf32 = True if data_type == "fp32": dtype = torch.float32 elif data_type == "fp16": dtype = torch.float16 elif data_type == "bf16": dtype = torch.bfloat16 else: raise ValueError(f"Unsupported data type: {data_type}.") D1 = int(torch.randint(low=10000, high=50000, size=(1,)).item()) D2 = int(torch.randint(low=100, high=1000, size=(1,)).item()) D3 = int(torch.randint(low=500, high=1000, size=(1,)).item()) configs: List[triton.testing.Benchmark] = [ triton.testing.Benchmark( x_names=["B"], x_vals=[100, 1000, 10000, 20000], line_arg="provider", line_vals=["pt_eager", "copy"], line_names=["pt_eager", "copy"], styles=[("blue", "-"), ("green", "-"), ("red", "-")], ylabel="ms", plot_name=f"torch-cat-D1-{D1}-D2-{D2}-D3-{D3}-dtype-{dtype}", args={ "D1": D1, "D2": D2, "D3": D3, "dtype": dtype, }, ) ] @triton.testing.perf_report(configs) def bench_cat( B: int, D1: int, D2: int, D3: int, dtype: torch.dtype, provider: str, ) -> float: warmup = 10 rep = 3 tensors = [] a = torch.empty( # (B, 30108), (B, D1), dtype=dtype, device=torch.device("cuda"), ).uniform_(-1.0, 1.0) b = torch.empty( # (B, 624), (B, D2), dtype=dtype, device=torch.device("cuda"), ).uniform_(-1.0, 1.0) c = torch.empty( # (B, 772), (B, D3), dtype=dtype, device=torch.device("cuda"), ).uniform_(-1.0, 1.0) tensors = [a, b, c] total_cols: int = int(a.shape[1] + b.shape[1] + c.shape[1]) def torch_copy( tensors: List[torch.Tensor], is_inplace: bool = True ) -> torch.Tensor: f = torch.zeros([B, total_cols], dtype=dtype, device=torch.device("cuda")) col_idx = 0 for t in tensors: temp = f[:, col_idx : col_idx + t.shape[1]] if is_inplace: temp.copy_(t) else: f[:, col_idx : col_idx + t.shape[1]] = t col_idx += t.shape[1] return f def torch_cat(tensors: List[torch.Tensor]) -> torch.Tensor: return torch.cat(tensors, dim=1) ref = torch_cat(tensors) real = torch_copy(tensors, is_inplace=False) torch.testing.assert_allclose(ref, real) if provider == "pt_eager": fn = lambda: torch_cat(tensors) # noqa E731 ms = triton.testing.do_bench(fn, warmup=warmup, rep=rep) return ms elif provider == "stack": def torch_stack(tensors: List[torch.Tensor]) -> torch.Tensor: return torch.stack(tensors, dim=1).view(-1, total_cols) fn = lambda: torch_stack(tensors) ms = triton.testing.do_bench(fn, warmup=warmup, rep=rep) return ms elif provider == "copy": fn = lambda: torch_copy(tensors) ms = triton.testing.do_bench(fn, warmup=warmup, rep=rep) return ms else: raise ValueError(f"unsupported provider: {provider}") df = bench_cat.run(print_data=True, return_df=return_result) if return_result: return configs, df if __name__ == "__main__": main() ``` and bw analysis code is from: https://github.com/pytorch/pytorch/pull/102815?fbclid=IwZXh0bgNhZW0CMTEAAR1Rwclp_O1fknl1Litpm9GeY0ZZZovdCv8_kQfGf6Zy8LaoP9JhO0ZsutM_aem_BPCZEZda5OOMnzI9Mrlapg#issue-1737409146 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150233 Approved by: https://github.com/ngimel	2025-04-03 23:40:18 +00:00
Henry Hu	5cf3029503	Remove unused rand call if not fallback to eager for rand (#147790 ) Fixes #147171 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147790 Approved by: https://github.com/eellison	2025-04-03 23:27:03 +00:00
William Wen	118e3862bc	[dynamo] disable new test_assert_failure_in_generic_ctx_mgr internally (#150631 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150631 Approved by: https://github.com/clee2000 ghstack dependencies: #150471	2025-04-03 23:08:25 +00:00
Tovly Deutsch	a2dce42654	Split up cub-RadixSortPairs.cu to parallelize compilation (#148936 ) Summary: `cub-RadixSortPairs.cu` has slow compilation times, especially on Windows. These changes split up the file into smaller components to allow each component to compile in parallel. On Windows, I observed a compile time drop from about 20 minutes to 6 minutes. Differential Revision: D70539649 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148936 Approved by: https://github.com/suo, https://github.com/eqy, https://github.com/malfet	2025-04-03 23:04:21 +00:00
Jane Xu	c0618a3957	Update commitlist.py instructions for the GitHub repo regime (#149535 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149535 Approved by: https://github.com/albanD	2025-04-03 22:43:00 +00:00
Richard Howell	76994d48f4	[pytorch] add experimental TORCH_LIBRARY_THREAD_UNSAFE_LAZY_INIT (#150537 ) Summary: Add an experimental feature to defer pytorch library initialization cost to post startup. As noted this feature is not thread safe, it requires the client to maintain thread safety at library load time. Reviewed By: zou3519 Differential Revision: D71917841 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150537 Approved by: https://github.com/zou3519	2025-04-03 22:36:17 +00:00
Jeff Daily	9e55dae2a6	CUDA CachingHostAllocator tracks registrations to call correct free (#146520 ) Allocations using cudaHostRegister should use corresponding cudaHostUnregister and similarly for cudaHostAlloc / cudaFreeHost. In test_cuda.py, the allocator config will change from test to test but the cache is not emptied prior to changing the config. This results in the wrong free being called later. Unit test sharding is avoiding this issue, but running the test_cuda.py with a single shard will fail. The following reproducer demonstrates the problem. ```C++ int main(int argc, char *argv) { void ptr; assert(cudaSuccess == cudaHostAlloc(&ptr, 1024, cudaHostAllocDefault)); assert(cudaSuccess == cudaHostUnregister(ptr)); std::free(ptr); return 0; } ``` The above code results in the following failure because the ptr is an invalid argument to cudaHostUnregister. ``` a.out: test.cpp:53: int main(int, char**): Assertion `cudaSuccess == cudaHostUnregister(ptr)' failed. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146520 Approved by: https://github.com/ngimel	2025-04-03 22:33:48 +00:00
Ahmad Sharif	c6defa9443	[cuda] Add new faster gammabeta backward kernel (#148605 ) (Reapply with launch bounds) (#150625 ) # Changes over the previous PR This reverts commit 61a1f09 and adds `__launch_bounds__` to the kernel. Previously I merged 114d404 that did not work on Blackwell because it consumed too many registers. It got reverted in 61a1f09. For more context see: https://github.com/pytorch/pytorch/issues/150266. This PR reverts the revert (i.e. reapplies the original diff), with one additional line with `__launch_bounds__` added: ``` git diff HEAD^ diff --git a/aten/src/ATen/native/cuda/layer_norm_kernel.cu b/aten/src/ATen/native/cuda/layer_norm_kernel.cu index 0d63a2f979c..3ce2c24c18e 100644 --- a/aten/src/ATen/native/cuda/layer_norm_kernel.cu +++ b/aten/src/ATen/native/cuda/layer_norm_kernel.cu @@ -657,6 +657,7 @@ bool aligned_grid > __global__ void +__launch_bounds__(block_dim_x * block_dim_y) GammaBetaBackwardCUDAKernelTemplate( int64_t M, int64_t N, ``` I managed to get a Blackwell machine and verified that the fix works. The fix was verified using this repro that I got from @drisspg <details> <summary> Repro script that fails on Blackwell </summary> ``` import torch from torch.nn import init # from transformer_nuggets import init_logging # from transformer_nuggets.utils.benchmark import profiler # from pathlib import Path # init_logging() class PermuteModule(torch.nn.Module): def __init__(self, permutation): super(PermuteModule, self).__init__() self.permutation = permutation def forward(self, x:torch.Tensor) -> torch.Tensor: assert len(x.shape) == len(self.permutation), f"Dimension mismatch! Unable to permute {len(x.shape)} dim input with a {len(self.permutation)} dim permutation!" return x.permute(self.permutation) def test(n_layers:int, conv_stride:int): _sequence = [] for _ in range(n_layers): # Conv1d inputs are (N x C x L), LayerNorm expects ( x C). Dims must be permuted between modules. _sequence += [ PermuteModule((0,2,1)), torch.nn.Conv1d(in_channels=512, out_channels=512, groups=1, kernel_size=9, dilation=1, stride=conv_stride, padding=0, bias=False), PermuteModule((0,2,1)), torch.nn.LayerNorm(512), torch.nn.ReLU() ] model = torch.nn.Sequential(_sequence).to(device="cuda") data = torch.randn((100,2048,512), device="cuda") out = model(data) loss = torch.nn.functional.mse_loss(out, torch.rand_like(out)) loss.backward() torch.autograd.set_detect_anomaly(True) print(f"Torch version: {torch.__version__}") # with profiler(Path("conv")): # # print(f"layers=1, stride=1") # # test(n_layers=1, conv_stride=1) # # print(f"layers=2, stride=1") # # test(n_layers=2, conv_stride=1) # # print(f"layers=1, stride=2") # # test(n_layers=1, conv_stride=2) # print(f"layers=2, stride=2") # test(n_layers=2, conv_stride=2) print(f"layers=2, stride=2") test(n_layers=2, conv_stride=2) # we will not reach this print statement. print("DONE.") ``` </details> I also re-ran my performance benchmark and found no regressions over the previous PR. # Full description of the old PR Original PR: https://github.com/pytorch/pytorch/pull/148605 This PR adds a new kernel for producing gamma and beta values for the backward pass in a performant way. To test the performance against the baseline, I measured the backward pass of layernorm while sweeping over the following variables: 1. dtype in {half, float} 2. M in `2k, 2k - 1, 2k + 1 for k in range(...)` 3. N in `2k, 2k - 1, 2k + 1 for k in range(...)` 4. Whether we flush the L2 cache before running the backward pass Summary: The new code performs better than the old code, especially for powers of 2. For M >> N case, it performs very well (kernel itself can be 30x faster and the overall backward pass can be 5-10x faster). In order to visualize results of the kernel when choosing different values of M, N and dtype, I wrote some code to generate a heatmap. The heatmap has N on the x-axis, M on the y-axis and color-coded points where green shows performance improvement and red shows regressions. For example, `m=32 n=2048 1.42x` in the heatmap would indicate the normalized shape had 32 elements. The leading dimensions' product was 2048 elements and the new kernel resulted in the backward pass* being 1.42x faster than the old backward pass. Important note: This heatmap shows the total backward pass time as seen by the user. The kernel time difference can be sometimes very large while the total backward pass time is not that high. For example, for dtype=torch.half, M=32 N=2048, flush_l2_cache=True case, the heatmap shows a speedup of 1.42x, while ncu tells me the new kernel is 2.5x faster than the old: M=32 N=2048 dtype=half flush_l2=True Old Kernel NCU summary: ``` ----------------------- ----------- ------------ Metric Name Metric Unit Metric Value ----------------------- ----------- ------------ DRAM Frequency Ghz 1.59 SM Frequency Ghz 1.35 Elapsed Cycles cycle 27,526 Memory Throughput % 2.21 DRAM Throughput % 0.54 Duration us 20.42 L1/TEX Cache Throughput % 4.31 L2 Cache Throughput % 2.62 SM Active Cycles cycle 1,475.02 Compute (SM) Throughput % 0.29 ----------------------- ----------- ------------ ``` M=32 N=2048 dtype=half flush_l2=True New Kernel NCU summary: ``` ----------------------- ----------- ------------ Metric Name Metric Unit Metric Value ----------------------- ----------- ------------ DRAM Frequency Ghz 1.59 SM Frequency Ghz 1.34 Elapsed Cycles cycle 10,920 Memory Throughput % 5.64 DRAM Throughput % 1.35 Duration us 8.13 L1/TEX Cache Throughput % 1.92 L2 Cache Throughput % 6.89 SM Active Cycles cycle 3,554.41 Compute (SM) Throughput % 0.67 ----------------------- ----------- ------------ ``` Let's look at some rows from the heatmap. For dtype=float16 flush_l2_cache=True and when input shapes are powers of 2, we get the following: <img width="1508" alt="image" src="https://github.com/user-attachments/assets/06179599-b2f0-4a45-8664-247a1067950b" /> There are 3 columns -- the first shows all data points, the second shows speedups only and the 3rd column shows regressions only. We can see that there are dramatic speedups for M >> N cases and the regressions are not that high (less than 1%, which could just be measurement noise). Here is a small guide I made: ![image](https://github.com/user-attachments/assets/90c26f7c-e3ad-46d2-a6ce-fe4b5fb3d738) For dtype=float32, we get a similar chart: <img width="1499" alt="image" src="https://github.com/user-attachments/assets/c4d31a76-03b0-426c-9114-e1bfad29b530" /> The new code performs especially well for m >> n cases, and also where m and n are small. The m >> n case is special because we run 2 reduction kernels back to back and parallelize in the "M" dimension (the older kernel only parallelized in the "N" dimension). The new code can sometimes have regressions for non-powers of 2. That is because the old code was using block sizes of {16, 32} while we have `threads.x = 32`. For example when N=33, the old code would have 3 blocks and we will have 2 blocks. I wrote some code to specialize for this case, but I think it will add complexity and @ngimel mentioned that non-powers of 2 are rare enough. I am including the regressions here for completeness' sake: <img width="1500" alt="image" src="https://github.com/user-attachments/assets/31c17cfb-ed9b-4106-b9c8-5c359751f530" /> To see this better: 1. Click the image 2. Right click the expanded image and open in a new tab 3. Go to that tab and left click once to zoom in If you want to see the full data, here it is: ![image](https://github.com/user-attachments/assets/54fb60c9-8c0c-4530-a1dd-79ecda1a69a1) I also measured binary size and compile time since those are important for developers: Binary size comparison ![image](https://github.com/user-attachments/assets/ceef5073-1036-47f6-b9dc-cea088beda51) ``` # Original -rwxr-xr-x 1 ahmads users 307193112 Mar 6 08:46 ./torch/lib/libtorch_cuda.so # This PR -rwxr-xr-x 1 ahmads users 307193112 Mar 6 08:46 ./torch/lib/libtorch_cuda.so ``` The diff in bytes is 302kB which is about a 0.1% increase. Compile time difference: ``` # Original real 0m10.931s user 0m9.676s sys 0m1.004s # this PR real 0m16.720s user 0m15.514s sys 0m1.066s # Command I ran time /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DFLASHATTENTION_DISABLE_SOFTCAP -DFLASH_NAMESPACE=pytorch_flash -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DTORCH_CUDA_USE_NVTX3 -DUNFUSE_FMA -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_CUFILE -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -I/home/ahmads/personal/pytorch/build/aten/src -I/home/ahmads/personal/pytorch/aten/src -I/home/ahmads/personal/pytorch/build -I/home/ahmads/personal/pytorch -I/home/ahmads/personal/pytorch/cmake/../third_party/benchmark/include -I/home/ahmads/personal/pytorch/third_party/onnx -I/home/ahmads/personal/pytorch/build/third_party/onnx -I/home/ahmads/personal/pytorch/nlohmann -I/home/ahmads/personal/pytorch/third_party/flash-attention/csrc/flash_attn/src -I/home/ahmads/personal/pytorch/aten/src/THC -I/home/ahmads/personal/pytorch/aten/src/ATen/cuda -I/home/ahmads/personal/pytorch/third_party/fmt/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/ahmads/personal/pytorch/build/caffe2/aten/src -I/home/ahmads/personal/pytorch/aten/src/ATen/.. -I/home/ahmads/personal/pytorch/build/nccl/include -I/home/ahmads/personal/pytorch/c10/cuda/../.. -I/home/ahmads/personal/pytorch/c10/.. -I/home/ahmads/personal/pytorch/third_party/tensorpipe -I/home/ahmads/personal/pytorch/build/third_party/tensorpipe -I/home/ahmads/personal/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/ahmads/personal/pytorch/torch/csrc/api -I/home/ahmads/personal/pytorch/torch/csrc/api/include -isystem /home/ahmads/personal/pytorch/build/third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/ahmads/personal/pytorch/third_party/protobuf/src -isystem /home/ahmads/personal/pytorch/third_party/XNNPACK/include -isystem /home/ahmads/personal/pytorch/third_party/ittapi/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda/include -isystem /home/ahmads/personal/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /home/ahmads/personal/pytorch/third_party/ideep/include -isystem /home/ahmads/personal/pytorch/INTERFACE -isystem /home/ahmads/personal/pytorch/third_party/nlohmann/include -isystem /home/ahmads/personal/pytorch/third_party/NVTX/c/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o.d -x cu -c /home/ahmads/personal/pytorch/aten/src/ATen/native/cuda/layer_norm_kernel.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o ``` So the new PR is 6 seconds longer compile time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150625 Approved by: https://github.com/ngimel	2025-04-03 22:07:43 +00:00
Andrey Talman	2abd81402f	[validations] Run nccl version check on Linux only (#150635 ) Followup https://github.com/pytorch/pytorch/pull/150194 to disable nccl version print on OS's other then Linux Pull Request resolved: https://github.com/pytorch/pytorch/pull/150635 Approved by: https://github.com/clee2000	2025-04-03 22:06:58 +00:00
Shangdi Yu	941090a791	Make sure torch.compiler._is_compiling_flag=True in aoti (#150588 ) Summary: See internal Diff summary Differential Revision: D72355449 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150588 Approved by: https://github.com/angelayi	2025-04-03 22:02:29 +00:00
PyTorch MergeBot	5a654deb40	Revert "Enable C++ dynamic shape guards by default (#140756 )" This reverts commit c1d503529d23f33bc0819286df8d0ecbe31b559f. Reverted https://github.com/pytorch/pytorch/pull/140756 on behalf of https://github.com/isuruf due to new test test_runtime_checks_large hangs on CI ([comment](https://github.com/pytorch/pytorch/pull/140756#issuecomment-2776979814))	2025-04-03 21:44:41 +00:00
Jason Ansel	d41c22b578	Revert "[fx] Move Node._prepend/Node._remove_from_list to C++ (#148261 )" (#150542 ) Reverts #148261 due to possible memory leak This reverts commit 5d4e7d58b42623a9024a84f0050967ff0318dcdb. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150542 Approved by: https://github.com/clee2000	2025-04-03 21:15:38 +00:00
Svetlana Karslioglu	277369ac16	Move formulas on separate line in loss.py (#150565 ) Move formulas on separate line in loss.py for better readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150565 Approved by: https://github.com/mikaylagawarecki	2025-04-03 20:47:35 +00:00
Yiming Zhou	a3f9e04656	[export] Make aoti_call_delegate hop traceable (#148804 ) Summary: The `aoti_call_delegate` hop now uses a stateless `original_gm` for tracing with fake tensors and the OSS AOTI Runner for running with real tensors Differential Revision: D70738393 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148804 Approved by: https://github.com/SherlockNoMad	2025-04-03 20:44:31 +00:00
Shangdi Yu	51da241c0a	[aoti] Fix cannot determine truth value of Relation error when propagating unbacked symint in lowering (#150570 ) Summary: Fix cannot determine truth value of Relation error when propagating unbacked symint in lowering Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r aoti_runtime_asserts ``` Differential Revision: D72331070 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150570 Approved by: https://github.com/angelayi, https://github.com/henryoier	2025-04-03 20:06:15 +00:00
Isuru Fernando	c1d503529d	Enable C++ dynamic shape guards by default (#140756 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140756 Approved by: https://github.com/anijain2305 ghstack dependencies: #149149, #149197, #149211	2025-04-03 20:03:52 +00:00
Kai Londenberg	1843ad458d	[Inductor] Cache CUDA compilation errors (#149716 ) Summary: Add support for caching of CUDA (nvcc) compilation errors to codecache.py Test Plan: CI ( for example Cutlass backend unit tests ) Reviewed By: ColinPeppler Differential Revision: D71562040 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149716 Approved by: https://github.com/ColinPeppler	2025-04-03 19:47:27 +00:00
Jiang, Yanbing	3b02f795c5	Add torch._scaled_mm for CPU (#150410 ) This PR is the duplicated one for https://github.com/pytorch/pytorch/pull/139975. This PR is to add torch._scaled_mm for CPU backend. _scaled_mm_out_cpu and _scaled_mm_cpu are new added and included in torch._scaled_mm CPU dispatch. We also add _scaled_mm_out_cpu_emulated as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150410 Approved by: https://github.com/atalman	2025-04-03 19:43:45 +00:00
ZhaoqiongZ	96f35f55e2	update get start xpu document for v2.7 (#150397 ) update get start xpu document for v2.7 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150397 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/atalman Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-04-03 18:17:08 +00:00
Xilun Wu	78d1165d76	[DTensor][tp] fix errors in FSDP+TP checkpointing test (#150354 ) ## Summary remove the `tp_parallelize_plan` assignment that accidentally rewrites the previous assignments in `test_fsdp_dsd.py`. ## Test `pytest test/distributed/checkpoint/fsdp/test_fsdp_dsd.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150354 Approved by: https://github.com/wconstab	2025-04-03 17:41:46 +00:00
FFFrog	5d36253a7d	Refactoring: fix the python constant check (#150608 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150608 Approved by: https://github.com/Skylion007	2025-04-03 17:33:45 +00:00
Jeff Daily	fa0fdc0cca	if blaslt fails, fall back to blas (#150147 ) Fixes #150016. This is implemented for both cublaslt and hipblaslt. gemm_and_bias on failure will fall back to unfused path. lt gemm on failure falls back to gemm even if gemm preference is set to lt. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150147 Approved by: https://github.com/malfet	2025-04-03 16:18:59 +00:00
David Berard	5be5cfe4cb	[inductor][autotune cache] add torch_key() to configs hash (#150494 ) Summary: Context: https://github.com/pytorch/pytorch/pull/150122 (D71982587 - let's call this "the WS diff") introduces "bc/fc-breaking" cache changes. In particular, it introduces `num_consumer_groups` and adds it to the cached config. In versions of torch that include the WS diff, `num_consumer_groups` is treated as a class variable on a triton.Config object (i.e. `triton.Config({..kwargs..}, num_consumer_groups=num_consumer_groups, ...`). And in versions of torch that don't include the WS diff, you generally don't expect to see this kwarg. But if a program is run WS-torch (i.e. torch w/ the WS diff), and then later you run the same program with non-WS-torch, then non-WS-torch is going to find this autotune cache entry, and interpret `num_consumer_groups` as a kwarg, because there's no special handling for for num_consumer_groups in this version of torch. Then the program crashes with a triton failure message. The fix: add the torch version / torch key into the hash, so that any changes to inductor will invalidate the cache (ensuring that other changes to triton_heuristics won't cause these bc/fc issues). Test Plan: D72285868 (or https://gist.github.com/davidberard98/2ea697eb550c94d0d1948fedb5c5c7d8, but this doesn't repro in OSS because this version of warp specialization is not available in oss triton) can repro the failure, and the failure is fixed after this PR is patched. Also, added a test in test/inductor/test_codecache.py which verifies that there's no cache hit if the torch_key changes (and verified that without the functional changes in this PR, the test fails). Differential Revision: D72285303 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150494 Approved by: https://github.com/oulgen	2025-04-03 16:01:57 +00:00
Luca Wehrstedt	440c07e56a	Fix detection of GPU multicast (#150563 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150563 Approved by: https://github.com/kwen2501	2025-04-03 15:31:15 +00:00
angelayi	5314a6fe82	[export] Fix deserialization issue (#150515 ) An internal model was serialized in 2023, and is now breaking while loading with the following error: ``` File "<eval_with_key>.1675", line 4 def forward(self, arg1163_1, arg1164_1, , arg1166_1, , arg1168_1, arg1169_1, arg1170_1, , arg1172_1, arg1173_1, arg1174_1, arg1175_1, arg1176_1, arg1177_1, arg1178_1, arg1179_1, arg1180_1, arg1181_1, arg1182_1, arg1183_1, arg1184_1, arg1185_1, arg1186_1, arg1187_1, arg1188_1, arg1189_1, arg1190_1, arg1191_1, arg1192_1, arg1193_1, arg1194_1, arg1195_1, arg1196_1, arg1197_1, arg1198_1, arg1199_1, arg1200_1, arg1201_1, arg1202_1, arg1203_1, arg1204_1, arg1205_1, arg1206_1, arg1207_1, arg1208_1, arg1209_1, arg1210_1, arg1211_1, arg1212_1, arg1213_1, arg1214_1, arg1215_1, arg1216_1, , arg1218_1, arg1219_1, arg1220_1, arg1221_1, arg1222_1, arg1223_1, arg1224_1, , arg1226_1, arg1227_1, arg1228_1, , arg1230_1, , , , , , , , , , , , , , , ): ^ SyntaxError: invalid syntax ``` The syntax errors are due to inputs that are `None` when exporting. Prior to changes in https://github.com/pytorch/pytorch/pull/123590 (landed 4/2024), input specs for none inputs look like `InputSpec(userInput=UserInputSpec(arg=Argument(asNone=True)))`, and during deserialization when creating a node, we would just use a dummy name `arg`. After to those changes, the input specs for none inputs look like `InputSpec(constantInput=InputToConstantInputSpec(name='y', value=ConstantValue(asNone=True)))`, and when creating a node we would use the name `y` as the name. However the PR didn't handle the case if it's loading an old package which doesn't have this name, so ended up putting empty names in the placeholder nodes. This error was uncovered after https://github.com/pytorch/pytorch/pull/149717, where we now use the GraphModule's python codegen to run the UnflattenedModule instead of going through the interpreter path. The placeholder nodes having empty names caused the python codegen to fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150515 Approved by: https://github.com/yushangdi	2025-04-03 15:27:45 +00:00
Isuru Fernando	a72b4eb806	Support windows in C++ shape guards (#149211 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149211 Approved by: https://github.com/anijain2305 ghstack dependencies: #149149, #149197	2025-04-03 14:42:08 +00:00
Isuru Fernando	f9a7eac718	use python fallback if there are overflows (#149197 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149197 Approved by: https://github.com/anijain2305 ghstack dependencies: #149149	2025-04-03 14:39:03 +00:00
Isuru Fernando	ff783f062a	Fix shape guard failure to be valid python (#149149 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149149 Approved by: https://github.com/anijain2305	2025-04-03 14:36:17 +00:00
FFFrog	70b34a42c1	Add new dependences for gen_pyi.py (#150391 ) As the title stated. When we update some functions in _torch_docs.py or _tensor_docs.py, and execute some commands (like ``python setup.py evolve``) to install the latest version, the description about the function we just changed is not updated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150391 Approved by: https://github.com/Skylion007, https://github.com/peterbell10	2025-04-03 14:18:18 +00:00
Jeff Daily	781d28e265	add unit test for preferred_blas_library settings (#150581 ) Follow up to #150212 that was committed without a unit test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150581 Approved by: https://github.com/atalman	2025-04-03 13:27:50 +00:00
Guilherme Leobas	cbc901fac3	Implement `raise ... from ...` (#148766 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148766 Approved by: https://github.com/zou3519	2025-04-03 13:15:31 +00:00
LifengWang	e0d19cf6cc	Enable weekly test for operator benchmark (#150502 ) To regularly track the performance of the operator benchmark, enable the weekly test. Hi, @huydhn, as you mentioned in https://github.com/pytorch/pytorch/pull/143733#issuecomment-2578317520, we could integrate the performance data from the weekly test into the OSS benchmark database for the dashboard. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150502 Approved by: https://github.com/huydhn	2025-04-03 12:17:19 +00:00
Danfeng Wang	5d9c7f78e7	[fbcode]Removing `@NoIntBaseDeprecated` annotation in `evaluation.thrift` file (#150271 ) Summary: #buildall Test Plan: ``` buck test 'fbcode//mode/opt' fbcode//caffe2/torch/fb/training_toolkit/applications/bulk_eval/tests:evaluator_test -- --exact 'caffe2/torch/fb/training_toolkit/applications/bulk_eval/tests:evaluator_test - test_setup_evaluation_utils (caffe2.torch.fb.training_toolkit.applications.bulk_eval.tests.evaluator_test.EvaluatorTest)' ``` Differential Revision: D72028940 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150271 Approved by: https://github.com/huydhn	2025-04-03 12:01:59 +00:00
Bin Bao	d4c30b4599	[AOTI][dashboard] Update how peak memory is measured (#150534 ) Summary: In the dashboard measurement script, AOTI needs to run Eager first to register the output pytree, so the peak memory compression ratio on the dashboard is always close to 1. Update AOTI run to use an extra warmup run, so the peak memory compression ratio measures the result at the run time instead of the compile time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150534 Approved by: https://github.com/yushangdi	2025-04-03 12:01:43 +00:00
Jagadish Krishnamoorthy	6fa1b17195	ROCm: Add trailing comma for consistency in gfx architecture list (#150250 ) Adding trailing comma for consistency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150250 Approved by: https://github.com/petrex, https://github.com/jeffdaily, https://github.com/cyyever	2025-04-03 10:58:48 +00:00
Arash Pakbin	e6e07ec1cf	[ROCm] code cleanup of architecture checks (#150473 ) This PR replaces several calls to `at::cuda::getCurrentDeviceProperties()->gcnArchName` and `at::cuda::getDeviceProperties(device_index)->gcnArchName` when checking to see if the GPU architecture is in a certain list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150473 Approved by: https://github.com/jeffdaily, https://github.com/cyyever	2025-04-03 09:51:06 +00:00
Jiang, Zhiwei	9e106019f6	[XPU] Add an implict conversion from XPUStream to sycl::queue* (#148646 ) # Motivation Currently, in Pytorch XPU, `cudaStream_t` is mapped to `sycl::queue&`, so an implicit cast from `XPUStream` to `sycl::queue&` is provided just like `CUDAStream` has an implicit cast to `cudaStream_t`. But on the SYCLomatic side, we migrate `cudaStream_t` to `sycl::queue*` but not `sycl::queue&` (One reason is that `cudaStream_t` is actually a pointer so users can do anything with that integer. Another reason is that the early `sycl::queue` was not impl-ed by a pointer, so copy by value is not desirable.) Without this PR: ``` cudaStream_t a = getCurrentCUDAStream(); cudaStream_t b = getCurrentCUDAStream().stream(); ``` need be migrated to: ``` queue_ptr a = &(sycl::queue&)getCurrentXPUStream(); queue_ptr b = &(getCurrentXPUStream().queue()); ``` With this PR: ``` queue_ptr a = getCurrentXPUStream(); queue_ptr b = &(getCurrentXPUStream().queue()); ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148646 Approved by: https://github.com/guangyey, https://github.com/EikanWang	2025-04-03 08:12:38 +00:00
Saagar Jha	c067127d47	Ensure cuda_dlink_post_cflags are quoted as well (#150151 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150151 Approved by: https://github.com/janeyx99	2025-04-03 06:50:22 +00:00
Junjie Wang (PyTorch)	fc674b45d4	[c10d] Add logging for desync debug report (#150513 ) Summary: We want to add a logging to first understand what is the distribution of desync debug report. Test Plan: Test with logger staging Differential Revision: D72249281 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150513 Approved by: https://github.com/kwen2501	2025-04-03 06:42:06 +00:00
Pian Pawakapan	90ddb33141	[export] specialize for aten.to (#149235 ) Changes decomposition behavior of `aten.to` to respect the aliasing/non-aliasing behavior in eager, and to specialize to the input/conversion dtype & device. Before change: we always decompose `aten.to` into `_to_copy`, regardless of aliasing behavior. This leads us to ban mutations on the result of `_to_copy` when aliased, since we can't guarantee correct program semantics. This meant users had to explicitly call `.clone()` before mutating. In the special cases where we don’t ban mutations (e.g. dtype conversion), we add runtime assertions on the input & conversion dtype/devices in the decomposed program (see https://github.com/pytorch/pytorch/pull/142420). After change: we decompose to the aliasing/non-aliasing behavior that matches eager, allowing mutations in all cases. We also add dtype/device assertions for all `aten.to` ops, starting in the pre-dispatch graph, basically specializing the program to the dtype/devices. Differential Revision: D71229547 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149235 Approved by: https://github.com/tugsbayasgalan	2025-04-03 05:20:10 +00:00
drisspg	2e5d95a082	[FlexAttention] Remove dead code (#150575 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150575 Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng	2025-04-03 01:46:19 +00:00
Shangdi Yu	77dca3947e	[aoti] make a check function for each input (#150553 ) Summary: make a check function for each input to avoid too large to optimize error on `__check_inputs_outputs` Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r runtime_checks ``` Differential Revision: D72286280 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150553 Approved by: https://github.com/desertfire	2025-04-03 00:55:35 +00:00
rzou	13f48197d2	Add Chillee as core reviewer (#150579 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150579 Approved by: https://github.com/albanD, https://github.com/drisspg, https://github.com/malfet	2025-04-03 00:40:06 +00:00
Mu-Chu Lee	f363fe616d	[AOTInductor] Fix autotuning code's codegen (#150522 ) Summary: Codegen used to generate tmp_arg_{index} as temporary args, and index is the position of the caller. We changed the logic of codegen such that we can reuse previous generated samples, and only delete after arg is no longer used. In this case, we need to make {index} unique, since different functions could reuse the same "tmp_arg_{index}" name string, but corresponds to different args. Test Plan: `python test/inductor/test_aot_inductor.py -k test_autotuning_args_reuse` Differential Revision: D72297084 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150522 Approved by: https://github.com/desertfire, https://github.com/22quinn	2025-04-03 00:08:19 +00:00
Gabriel Ferns	24f50653c8	fix bug in logging code (#150518 ) Fixes https://github.com/pytorch/pytorch/issues/150379 ```python >>> key = "aten._int_mm_1_2_3" >>> m, n, k = key.split("_")[-3:] >>> m, n, k ('1', '2', '3') >>> name = "_".join(key.split("_")[:-3]) >>> name 'aten._int_mm' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150518 Approved by: https://github.com/xmfan	2025-04-02 23:39:06 +00:00
PyTorch MergeBot	61a1f09b5b	Revert "[cuda] Add new faster gammabeta backward kernel (#148605 )" This reverts commit 114d404b0720e8073748690faeb96449e5c0b229. Reverted https://github.com/pytorch/pytorch/pull/148605 on behalf of https://github.com/drisspg due to See https://github.com/pytorch/pytorch/issues/150266#issuecomment-2773907902 for more details ([comment](https://github.com/pytorch/pytorch/pull/148605#issuecomment-2773928838))	2025-04-02 23:14:11 +00:00
Animesh Jain	de15ef0ee8	[invoke_subgraph] Force grad_outs to be contiguous at tracing time (#150561 ) I am unable to come up with a testcase. It passes many end-to-end tests that fail with ReshapeError at https://ossci-raw-job-status.s3.amazonaws.com/log/39717218372 ![image](https://github.com/user-attachments/assets/8509b485-3897-4538-968b-bbe05af63a59) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150561 Approved by: https://github.com/zou3519, https://github.com/bdhirsh ghstack dependencies: #150082, #150450, #150486, #150556	2025-04-02 22:59:08 +00:00
Wang, Chuanqi	0198e44f37	Update torch-xpu-ops commit pin to 98c808d (#150554 ) Update the torch-xpu-ops commit to [98c808dea6de7330c415aa777d6921944cf79887](`98c808dea6`), include - Fixes #150001 by removing pre-CXX11 ABI logic from build script for XPU - Fixes #150430 - Fixes XCCL build issue caused by PR #150398 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150554 Approved by: https://github.com/EikanWang, https://github.com/malfet	2025-04-02 22:42:18 +00:00
PaulZhang12	8667a00979	Add stride + dtype to autotune results (#150419 ) Add stride/dtype info to autotune gemm results. New output header: `AUTOTUNE mm(1024x1024, 1024x7680)` `strides: [1, 1024], [7680, 1]` `dtypes: torch.bfloat16, torch.bfloat16` Differential Revision: [D72253313](https://our.internmc.facebook.com/intern/diff/D72253313) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150419 Approved by: https://github.com/eellison	2025-04-02 22:36:38 +00:00
Animesh Jain	0bacb90a9c	[invoke_subgraph][min-cut partitioner] Fix bug to use the correct root module (#150556 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150556 Approved by: https://github.com/bdhirsh, https://github.com/zou3519 ghstack dependencies: #150082, #150450, #150486	2025-04-02 22:35:00 +00:00
Shivam Raikundalia	a677b491c9	[Profiler] Fix Empty C Call Queue (#150370 ) Summary: My commandeer of https://github.com/pytorch/pytorch/pull/150102 Based on description of PR it seems that we need to add C calls for each starting python event with a callable such that when the tracing exits we will have a matching enter for any given exit. It adds some unnecessary events at worst but prevents segfaults/failures. My PR just cleans up some refcount impl and logging. Contributors: @arjun-choudhry Test Plan: Ran resnet test internally. Will check CI and ask reviewers to make sure it resolves their issues. Differential Revision: D72207570 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150370 Approved by: https://github.com/aaronenyeshi	2025-04-02 22:25:46 +00:00
Eli Uriegas	74aa9f571c	ci: Use cache / progress when local docker build (#150551 ) It's a bit annoying to try and work on these locally when the cache / progress isn't being used so let's just set it so that those flags are only valid when in CI directly. `${CI}` is a default environment variable that's defined by actions itself. See https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/store-information-in-variables#default-environment-variables Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/150551 Approved by: https://github.com/clee2000, https://github.com/ZainRizvi, https://github.com/atalman	2025-04-02 22:08:57 +00:00
Avik Chaudhuri	1017927c83	multidimensional slicing (#150104 ) Differential Revision: D71962884 Fixes #150057 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150104 Approved by: https://github.com/angelayi	2025-04-02 20:57:16 +00:00
Ryan Guo	bb98749230	[dynamo] Always trace into tensor subclass `__torch_function__` (#149792 ) This patch effectively ignores traceable_tensor_subclasses, allowing Dynamo to always try tracing into the `__torch_function__` of tensor subclass. This helps us with 2 things: 1. allowing users to directly benefit from better compilation of tensor subclass, by just upgrading pytorch, without having to change legacy library code (see earlier patches in the stack for examples). 2. potentially exposing more issues in compiling tensor subclass, so we can get signals and improve them. As a consequence, it exposed and fixes 2 subtle bugs: 1. In `build_torch_function_fn`, we could get `torch._C._disabled_torch_function_impl` because we have a `Parameter` subclass without `__torch_function__` override or if we have a tensor subclass with `__torch_dispatch__` override. We graph break on this for now, and plan to add support -- the logic for simulating `torch._C._disabled_torch_function_impl` is already in `SuperVariable`, we just need to reuse it. 2. Sometimes we create `SyntheticLocalSource` and need to remove all the guards installed on it, but we only removed the ones whose source _is_ the created synthetic source `s`, but forgot about chained source like `s.foo`, this showed up as `SYNTHETIC_LOCAL['tmp_0'].__torch_function__.__func__`. Differential Revision: [D71906141](https://our.internmc.facebook.com/intern/diff/D71906141) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149792 Approved by: https://github.com/jansel, https://github.com/mlazos ghstack dependencies: #149482, #149483, #149484	2025-04-02 20:57:00 +00:00
Ryan Guo	3463ea1059	[dynamo] Support tensor subclass with overriden tensor methods and properties (#149484 ) This fixes most of the "torch.compile X tensor-subclass" issues encountered in https://github.com/city96/ComfyUI-GGUF/issues/118. The relevant tensor subclass definition is here: `298192ed60/ops.py (L18-L65)`. A few things to note about the tensor subclass: 1. it overrides a lot of the `torch.Tensor` methods (e.g., `to`, `clone`), so this patch updates `TensorWithTFOverrideVariable.var_getattr` to support that. 2. it overrides the `shape` property, so this patch updates `TensorWithTFOverrideVariable.var_getattr` to support property as well. 3. it has calls to `torch.Tensor.size`, which returns `torch.Size`, which gets reconstructed in `torch.Tensor.__torch_function__`, so this patch adds support for calling `torch.Size(...)` on non-constant inputs. Differential Revision: [D71906137](https://our.internmc.facebook.com/intern/diff/D71906137) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149484 Approved by: https://github.com/jansel, https://github.com/mlazos ghstack dependencies: #149482, #149483	2025-04-02 20:57:00 +00:00
Ryan Guo	0d4dbfd9ed	[dynamo] Support `torch.Tensor._make_subclass` and tracing through tensor subclass `__new__` (#149483 ) This builds off the previous patch in the stack, and fully fixes https://github.com/huggingface/diffusers/issues/10795. Essentially, tensor subclass in the issue uses `torch.Tensor._make_subclass`, which has a pretty simple shallow-copy plus type change semantics, as far as Dynamo is concerned. So this patch adds a polyfill for it. As a result, this allows us to trace through many user-defined `__new__` in tensor subclass (it's similar to how we trace through user-defined `__new__` for `UserDefinedClassVariable`), so this patch also faithfully trace through these `__new__` methods. Differential Revision: [D71906139](https://our.internmc.facebook.com/intern/diff/D71906139) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149483 Approved by: https://github.com/zou3519, https://github.com/mlazos ghstack dependencies: #149482	2025-04-02 20:56:52 +00:00
Ryan Guo	33535b3eee	[dynamo] Support Tensor subclass that has dynamic attributes or calls `Parameter.__torch_function__` (#149482 ) This fixes most of https://github.com/huggingface/diffusers/issues/10795, except for `torch.Tensor._make_subclass`, which will be fixed in a subsequent patch. The relevant tensor subclass from the aforementioned issue is defined here: `fbf6b856cc/src/diffusers/quantizers/gguf/utils.py (L398-L435)`. There are two things to note about the tensor subclass: 1. it calls `super().__torch_function__`, which is `torch._C._disabled_torch_function_impl`, so this patch updates `SuperVariable.call_method` to handle it (we can't do a simpler polyfill due to some bug with `var_getattr` raising `NotImplementedError`, which forgot to restore symbolic context). 2. it sets and reads attributes (`quant_type`), and defines new methods (`as_data`), so this patch adds support for those. 3. it has a `__init__`, which Dynamo needs to trace through in `TensorSubclassVariable.call_function`. Differential Revision: [D71906140](https://our.internmc.facebook.com/intern/diff/D71906140) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149482 Approved by: https://github.com/jansel, https://github.com/mlazos	2025-04-02 20:56:43 +00:00
William Wen	85df0dc246	[dynamo] emit only 1 graph break message on unrecoverable data-dependent assert fail (#150471 ) Addresses https://fb.workplace.com/groups/1075192433118967/permalink/1625299684774903/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/150471 Approved by: https://github.com/jansel	2025-04-02 20:42:43 +00:00
Colin Peppler	a8f6b40e36	[inductor] skip non-trivial tiling if unbacked symints are present (#150225 ) Take two of https://github.com/pytorch/pytorch/pull/149994. This time we just skip `convert_tiling_to_3d` and `candidate_tilings` if there exists unbacked symints. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150225 Approved by: https://github.com/eellison	2025-04-02 20:36:02 +00:00
PyTorch MergeBot	03c879d59b	Revert "[dynamo] Support Tensor subclass that has dynamic attributes or calls `Parameter.__torch_function__` (#149482 )" This reverts commit 98453c135a7778d12ff881d8b0a717257be9fc38. Reverted https://github.com/pytorch/pytorch/pull/149482 on behalf of https://github.com/malfet due to Broke trunk, see `b03c42109c/1` ([comment](https://github.com/pytorch/pytorch/pull/149482#issuecomment-2773650522))	2025-04-02 20:30:33 +00:00
PyTorch MergeBot	18908c8ced	Revert "[dynamo] Support `torch.Tensor._make_subclass` and tracing through tensor subclass `__new__` (#149483 )" This reverts commit 203e1d681d1a4eb7794dfaeaebfa497242dde17d. Reverted https://github.com/pytorch/pytorch/pull/149483 on behalf of https://github.com/malfet due to Broke trunk, see `b03c42109c/1` ([comment](https://github.com/pytorch/pytorch/pull/149482#issuecomment-2773650522))	2025-04-02 20:30:33 +00:00
PyTorch MergeBot	01411c739f	Revert "[dynamo] Support tensor subclass with overriden tensor methods and properties (#149484 )" This reverts commit 7e53c58687482d58461e1dd8e09f59a9daf8f7b3. Reverted https://github.com/pytorch/pytorch/pull/149484 on behalf of https://github.com/malfet due to Broke trunk, see `b03c42109c/1` ([comment](https://github.com/pytorch/pytorch/pull/149482#issuecomment-2773650522))	2025-04-02 20:30:33 +00:00
PyTorch MergeBot	e545567340	Revert "[dynamo] Always trace into tensor subclass `__torch_function__` (#149792 )" This reverts commit 238109ad3245c5485f9e83b4b02d258b09329042. Reverted https://github.com/pytorch/pytorch/pull/149792 on behalf of https://github.com/malfet due to Broke trunk, see `b03c42109c/1` ([comment](https://github.com/pytorch/pytorch/pull/149482#issuecomment-2773650522))	2025-04-02 20:30:32 +00:00
Eli Uriegas	af5c1b96e2	ci: Set minimum cmake version for halide build (#150560 ) This was failing due to pybind being strict about their cmake version requirements. This resolves errors like: ``` 652.1 Compatibility with CMake < 3.5 has been removed from CMake. 652.1 652.1 Update the VERSION argument <min> value. Or, use the <min>...<max> syntax 652.1 to tell CMake that the project requires at least <min> but has been updated 652.1 to work with policies introduced by <max> or earlier. 652.1 652.1 Or, add -DCMAKE_POLICY_VERSION_MINIMUM=3.5 to try configuring anyway. 652.1 652.1 652.1 -- Configuring incomplete, errors occurred! ``` Tested this locally with the following command: ``` ./build.sh pytorch-linux-jammy-py3.12-halide -t 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-jammy-py3.12-halide:8a8989876ff1aa1d5b0e465177afebbc7a9da921 ``` Closes https://github.com/pytorch/pytorch/issues/150420 Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/150560 Approved by: https://github.com/clee2000, https://github.com/ZainRizvi, https://github.com/atalman, https://github.com/malfet	2025-04-02 20:27:24 +00:00
James Wu	b03c42109c	Proactively remove CompiledTritonKernels before loading from cache/starting inductor compile (#150453 ) We'll still running into this issue intermittently and it's hard to debug; so I thought a more aggressive cache clear strategy may fix it as a stopgap until we can Statically launch cuda kernels and avoid some of this stuff Differential Revision: [D72257973](https://our.internmc.facebook.com/intern/diff/D72257973/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150453 Approved by: https://github.com/oulgen	2025-04-02 20:08:32 +00:00
Yidi Wu	22030efb64	expect fail scan test in sigmoid (#150475 ) Summary: as titled. Test Plan: see modified test. Differential Revision: D72271976 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150475 Approved by: https://github.com/zhxchen17	2025-04-02 19:56:50 +00:00
Catherine Lee	d4298f2136	[CI] Use system nccl in build (#150226 ) Install nccl in the docker image (which is already being done in some docker images), and use USE_SYSTEM_NCCL=1 in CI builds It takes some time to build nccl and doesn't happen in parallel, so theres less benefit in switching to a bigger runner and using more processes The other changes in this PR are because there is an install_cuda script and an install_cuda_aarch64 script and they both build nccl from source and define their own pins for the nccl version. There is also a .ci/docker/nccl-cu11.txt and cu12.txt that define the pins, and this is an attempt to unify them. Unfortunately this leads to a lot of files needing to be copied to the docker build Generally seems to increase docker pull times by <1 min, P1768456379 but its hard to tell what the real increase is 15761 mib -> 16221 [linux-focal-cuda11.8-py3.10-gcc9 / test (distributed](https://github.com/pytorch/pytorch/actions/runs/14114171729/job/39545500161#logs) `jq '[.layers[].size, .config.size] \| add / 1024 / 1024'` Example `6eb3c2e282 (39520169577-box)` ![image](https://github.com/user-attachments/assets/d44ef415-6e48-41ef-ac83-f19bab47560c) TODO: * Figure out a way to verify that nccl was built + works properly when it is expected (this time i just checked torch.distributed.is_nccl_available) * Merge the cusparse installation scripts * Merge the cuda installation scripts * Either split the nccl, cuda, and cusparse installations always, or make the always together in one bash script distributed/test_distributed_spawn Pull Request resolved: https://github.com/pytorch/pytorch/pull/150226 Approved by: https://github.com/seemethere, https://github.com/atalman	2025-04-02 19:42:43 +00:00
atalman	cb4cd6166e	Address Cmake update issue in windows magma builds (#150549 ) 1. Fixes Cmake update error: https://github.com/pytorch/pytorch/actions/runs/14223930697/job/39858632864 ``` CMake Error at CMakeLists.txt:1 (cmake_minimum_required): Compatibility with CMake < 3.5 has been removed from CMake. Update the VERSION argument <min> value. Or, use the <min>...<max> syntax to tell CMake that the project requires at least <min> but has been updated to work with policies introduced by <max> or earlier. Or, add -DCMAKE_POLICY_VERSION_MINIMUM=3.5 to try configuring anyway. ``` 2. Removes deprecated CUDA 12.4 build Pull Request resolved: https://github.com/pytorch/pytorch/pull/150549 Approved by: https://github.com/clee2000	2025-04-02 19:13:44 +00:00
PaulZhang12	e62d958f02	[Inductor] Reland Merge Triton ScaledMM as epilogue to MM template #150045 (#150441 ) Merges https://github.com/pytorch/pytorch/pull/150438 and https://github.com/pytorch/pytorch/pull/150045. https://github.com/pytorch/pytorch/pull/150045 was already landed, but did not include a change that makes it unable to land internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150441 Approved by: https://github.com/clee2000	2025-04-02 17:49:32 +00:00
Ryan Guo	238109ad32	[dynamo] Always trace into tensor subclass `__torch_function__` (#149792 ) This patch effectively ignores traceable_tensor_subclasses, allowing Dynamo to always try tracing into the `__torch_function__` of tensor subclass. This helps us with 2 things: 1. allowing users to directly benefit from better compilation of tensor subclass, by just upgrading pytorch, without having to change legacy library code (see earlier patches in the stack for examples). 2. potentially exposing more issues in compiling tensor subclass, so we can get signals and improve them. As a consequence, it exposed and fixes 2 subtle bugs: 1. In `build_torch_function_fn`, we could get `torch._C._disabled_torch_function_impl` because we have a `Parameter` subclass without `__torch_function__` override or if we have a tensor subclass with `__torch_dispatch__` override. We graph break on this for now, and plan to add support -- the logic for simulating `torch._C._disabled_torch_function_impl` is already in `SuperVariable`, we just need to reuse it. 2. Sometimes we create `SyntheticLocalSource` and need to remove all the guards installed on it, but we only removed the ones whose source _is_ the created synthetic source `s`, but forgot about chained source like `s.foo`, this showed up as `SYNTHETIC_LOCAL['tmp_0'].__torch_function__.__func__`. Differential Revision: [D71906141](https://our.internmc.facebook.com/intern/diff/D71906141) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149792 Approved by: https://github.com/jansel, https://github.com/mlazos ghstack dependencies: #149482, #149483, #149484	2025-04-02 17:05:25 +00:00
Ryan Guo	7e53c58687	[dynamo] Support tensor subclass with overriden tensor methods and properties (#149484 ) This fixes most of the "torch.compile X tensor-subclass" issues encountered in https://github.com/city96/ComfyUI-GGUF/issues/118. The relevant tensor subclass definition is here: `298192ed60/ops.py (L18-L65)`. A few things to note about the tensor subclass: 1. it overrides a lot of the `torch.Tensor` methods (e.g., `to`, `clone`), so this patch updates `TensorWithTFOverrideVariable.var_getattr` to support that. 2. it overrides the `shape` property, so this patch updates `TensorWithTFOverrideVariable.var_getattr` to support property as well. 3. it has calls to `torch.Tensor.size`, which returns `torch.Size`, which gets reconstructed in `torch.Tensor.__torch_function__`, so this patch adds support for calling `torch.Size(...)` on non-constant inputs. Differential Revision: [D71906137](https://our.internmc.facebook.com/intern/diff/D71906137) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149484 Approved by: https://github.com/jansel, https://github.com/mlazos ghstack dependencies: #149482, #149483	2025-04-02 17:05:25 +00:00
Ryan Guo	203e1d681d	[dynamo] Support `torch.Tensor._make_subclass` and tracing through tensor subclass `__new__` (#149483 ) This builds off the previous patch in the stack, and fully fixes https://github.com/huggingface/diffusers/issues/10795. Essentially, tensor subclass in the issue uses `torch.Tensor._make_subclass`, which has a pretty simple shallow-copy plus type change semantics, as far as Dynamo is concerned. So this patch adds a polyfill for it. As a result, this allows us to trace through many user-defined `__new__` in tensor subclass (it's similar to how we trace through user-defined `__new__` for `UserDefinedClassVariable`), so this patch also faithfully trace through these `__new__` methods. Differential Revision: [D71906139](https://our.internmc.facebook.com/intern/diff/D71906139) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149483 Approved by: https://github.com/zou3519, https://github.com/mlazos ghstack dependencies: #149482	2025-04-02 17:05:19 +00:00
Ryan Guo	98453c135a	[dynamo] Support Tensor subclass that has dynamic attributes or calls `Parameter.__torch_function__` (#149482 ) This fixes most of https://github.com/huggingface/diffusers/issues/10795, except for `torch.Tensor._make_subclass`, which will be fixed in a subsequent patch. The relevant tensor subclass from the aforementioned issue is defined here: `fbf6b856cc/src/diffusers/quantizers/gguf/utils.py (L398-L435)`. There are two things to note about the tensor subclass: 1. it calls `super().__torch_function__`, which is `torch._C._disabled_torch_function_impl`, so this patch updates `SuperVariable.call_method` to handle it (we can't do a simpler polyfill due to some bug with `var_getattr` raising `NotImplementedError`, which forgot to restore symbolic context). 2. it sets and reads attributes (`quant_type`), and defines new methods (`as_data`), so this patch adds support for those. 3. it has a `__init__`, which Dynamo needs to trace through in `TensorSubclassVariable.call_function`. Differential Revision: [D71906140](https://our.internmc.facebook.com/intern/diff/D71906140) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149482 Approved by: https://github.com/jansel, https://github.com/mlazos	2025-04-02 17:05:12 +00:00
PyTorch MergeBot	532530be34	Revert "[Profiler] Fix Empty C Call Queue (#150370 )" This reverts commit 5734909f343ab1de44ed5ab23311d43a9c6afaed. Reverted https://github.com/pytorch/pytorch/pull/150370 on behalf of https://github.com/clee2000 due to broke some profiler tests when building with debug asserts profiler/test_memory_profiler.py::TestMemoryProfiler::test_config_check [GH job link](https://github.com/pytorch/pytorch/actions/runs/14211763078/job/39822158330) [HUD commit link](`3ac5a499dd`) ([comment](https://github.com/pytorch/pytorch/pull/150370#issuecomment-2773146070))	2025-04-02 16:40:54 +00:00
Manuel Candales	f38566dfe4	[MPSInductor] Disable mm/bmm decompositions (#150541 ) Disables mm/bmm decompositions. torch.compile on MPS was speeding up stories15M (~4x) but it was making stories110M much slower. Self-contained reproducer to demonstrate the difference (before the change, after it should be identical) ```python import torch import timeit def bench_mm(f, x, y): from torch.utils.benchmark import Timer return Timer(stmt="f(x, y); torch.mps.synchronize()", globals={"x": x, "y": y, "f": f}, language="python", timer=timeit.default_timer).blocked_autorange() x = torch.rand(1024, 512, device='mps') y = torch.rand(512, 1, device='mps') mm_c = torch.compile(torch.mm, options={"coordinate_descent_tuning": False}) mm_c_cdt = torch.compile(torch.mm, options={"coordinate_descent_tuning": True}) print(f"Compiled torch.mm perf (with cdt disabled) for 1024x512 and 512x1 matrices are {bench_mm(mm_c, x, y).median}") print(f"Compiled torch.mm perf (with cdt enabled) for 1024x512 and 512x1 matrices are {bench_mm(mm_c_cdt, x, y).median}") ``` Disabling the inductor mm decomposition, speeds up stories15M further (~6x) and speeds up stories110M (~7x) The table below show average tokens/sec across 5 runs on M1 Pro for stories15M and stories110M: \| \| stories15M \| stories110M \| \|------------------------\|------------\|-------------\| \| without compile \| 99.40 \| 53.11 \| \| compile before change \| 367.68 \| 19.43 \| \| compile after change \| 582.96 \| 355.07 \| stories110M (without compile) ``` (gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps [...] Average tokens/sec: 53.11 ``` stories110M (compile before change) ``` (gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps --compile [...] Average tokens/sec: 19.43 ``` stories110M (compile after change) ``` (gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps --compile [...] Average tokens/sec: 355.07 ``` stories15M (without compile) ``` (gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps [...] Average tokens/sec: 99.40 ``` stories15M (compile before change) ``` (gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps --compile [...] Average tokens/sec: 367.68 ``` stories15M (compile after change) ``` (gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps --compile [...] Average tokens/sec: 582.96 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150541 Approved by: https://github.com/malfet	2025-04-02 16:07:18 +00:00
Wang, Chuanqi	8102272d8c	[BE] Fix triton windows build (#150512 ) Fixes #150480 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150512 Approved by: https://github.com/atalman Co-authored-by: Andrey Talman <atalman@fb.com>	2025-04-02 15:48:11 +00:00
Animesh Jain	42c7c7f15f	[invoke_subgraph] Filter out grad_out where fw_out requires_grad is False (#150486 ) I am not sure if this is the right way. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150486 Approved by: https://github.com/zou3519 ghstack dependencies: #150082, #150450	2025-04-02 14:40:08 +00:00
Isuru Fernando	82ceebce58	[inductor] Lowerings for max_pool3d (#148210 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148210 Approved by: https://github.com/eellison	2025-04-02 14:13:01 +00:00
Isuru Fernando	5f62d07ec6	Fix log2, PowByNatural printing (#147592 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147592 Approved by: https://github.com/eellison	2025-04-02 14:12:15 +00:00
rzou	aae36929ed	Rename node.meta["arg_kwarg_vals"] to node.meta["eager_input_vals"] (#148092 ) And added a comment about it. Otherwise it might be confusing Test Plan: - wait for CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/148092 Approved by: https://github.com/eellison ghstack dependencies: #148046, #148063, #148091	2025-04-02 13:18:04 +00:00
rzou	4d121d2b02	Implement needs_exact_strides for mutable custom operators (#148091 ) Mutable custom operators get wrapped into an auto_functionalized HOP, so we need to store the arg_kwarg_vals on the auto_functionalized HOP itself. When Inductor does the re-inplacing, it'll use the pattern matcher to decompose the auto_functionalized HOP back into the original op (and 0+ other view or clone operations). The pattern matcher uses the arg_kwarg_vals to trace the subgraph to do the decomposition, so it ultimately sets arg_kwarg_vals on the original op's node correctly. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/148091 Approved by: https://github.com/eellison ghstack dependencies: #148046, #148063	2025-04-02 13:18:04 +00:00
rzou	c69c3c885e	Add needs_exact_strides operator tag for Inductor to force exact strides (#148063 ) Inductor will force exact strides on a custom operator tagged with needs_exact_strides. I'll make this the default in a follow-up PR. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/148063 Approved by: https://github.com/eellison ghstack dependencies: #148046	2025-04-02 13:17:58 +00:00
rzou	c41fbb4f78	Change arg_kwarg_vals propagation strategy (#148046 ) Instead of always propagating arg_kwarg_vals in _COPY_META_FIELDS, we special-case the pattern matcher to propagate arg_kwarg_vals when it sees triton_kernel_wrapper_functional. The strategy is: 1) trace out the replacement graph with arg_kwarg_vals (which have accurate eager-mode metadata) 2) trace out the replacement graph with vals (which have the accurate Inductor metadata) 3) Propagate the arg_kwarg_vals from the first graph to the second. 4) Use the second graph as the replacement graph. The strategy is this because we want to extend this to handle auto_functionalized later up in the stack. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/148046 Approved by: https://github.com/eellison	2025-04-02 13:17:52 +00:00
Bin Bao	03138733ba	[AOTI] Emit Triton kernels as comment (#150188 ) Summary: Emit the corresponding Triton kernel code as comment in each call_triton_ wrapper function, for easier debugging. Differential Revision: [D72178907](https://our.internmc.facebook.com/intern/diff/D72178907) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150188 Approved by: https://github.com/yushangdi	2025-04-02 12:41:54 +00:00
Benjamin Glass	75f38dfd4e	cpp_wrapper: precompile a few more commonly used headers, and improve RAIIPyObject interface (#149350 ) Add includes for torch.device, torch.dtype, torch.layout, and torch.memory_format to the cpp_wrapper common header, so that they get precompiled. Additionally, add move constructors and operator bool to RAIIPyObject. Closes #142005. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149350 Approved by: https://github.com/desertfire	2025-04-02 09:54:27 +00:00
Boyuan Feng	3f54b14c75	[CUDAGraph] support meta tensor (#150478 ) Previously, cudagraph is skipped if the graph contains any meta tensor. However, we should not skip since meta tensor does not have actual computation. This PR fixes the issue. ### Example ```python import torch def foobar(x, y): return x * 2, y * 3 foo_c = torch.compile(mode="reduce-overhead")(foobar) t = torch.empty((1, 16, 128, 128), device="meta") y = torch.rand([64], device="cuda") eager_out = foobar(t, y) for _ in range(3): compiled_out = foo_c(t, y) ``` Prior to this PR, above code leads to ``` skipping cudagraphs due to multiple devices: device(type='cuda', index=0), device(type='meta') ``` With this PR, we don't skip. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150478 Approved by: https://github.com/eellison	2025-04-02 07:21:50 +00:00
Sukchul Cho	0da8127f77	Compare device name of profiler dynamically (#150396 ) Compare self.use_device of torch.autograd.profiler.profiler with _get_privateuse1_backend_name(), since privateuse1 backend can be renamed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150396 Approved by: https://github.com/sraikund16	2025-04-02 06:06:06 +00:00
Rebecca Chen	c65de03196	Add `Any` return annotation to `__getattr__` methods that return a union of types. (#150204 ) Adds an `Any` return type annotation to `__getattr__` methods in `torch/_ops.py` that return a union of types. Attribute access returning a union of types can cause issues downstream because consumers would need to handle all of the possible types to make the type checker happy. This doesn't seem to matter today for mypy, presumably because `Any` is always inferred when a return type annotation is missing, but it still makes explicit what mypy is already doing implicitly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150204 Approved by: https://github.com/malfet	2025-04-02 05:25:07 +00:00
Nikita Shulga	dee016ceb7	[MPSInductor] Add `store_reduce` method (#150457 ) That restrict the store operation to 0th thread, which should be much better, shouldn't it (Though I don't observe it in the benchmark) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150457 Approved by: https://github.com/jansel, https://github.com/dcci ghstack dependencies: #150452	2025-04-02 05:12:49 +00:00
William Wen	3ac5a499dd	[dynamo] add dynamo disable reasons to codebase (#150440 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150440 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #150341	2025-04-02 04:26:48 +00:00
William Wen	25eff6e991	[dynamo] add reason field to torch.compiler.disable (#150341 ) Implements https://github.com/pytorch/pytorch/issues/146445 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150341 Approved by: https://github.com/zou3519, https://github.com/jansel	2025-04-02 04:26:48 +00:00
Mu-Chu Lee	063ea5d669	[AOTInductor] Modify test for Memory tracking for memory-related (#150269 ) operations Summary: Fix the test for memory tracking. This PR does: (1) Add tracking before and after for all memory-related operations. Make sure the operation do indeed captures memory both in CUDA and torch's CUDACachAllocator Make sure the operation do indeed captures consumed memory both in CUDA and torch's CUDACachAllocator. (2) Keep track of memory being reserved by CUDACacheAllocator in torch and it's relationship with global CUDA memory consumption. Test Plan: This PR is adding tests. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/150269 Approved by: https://github.com/jingsh, https://github.com/chenyang78, https://github.com/desertfire	2025-04-02 04:18:18 +00:00
Shivam Raikundalia	5734909f34	[Profiler] Fix Empty C Call Queue (#150370 ) Summary: My commandeer of https://github.com/pytorch/pytorch/pull/150102 Based on description of PR it seems that we need to add C calls for each starting python event with a callable such that when the tracing exits we will have a matching enter for any given exit. It adds some unnecessary events at worst but prevents segfaults/failures. My PR just cleans up some refcount impl and logging. Test Plan: Ran resnet test internally. Will check CI and ask reviewers to make sure it resolves their issues. Differential Revision: D72207570 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150370 Approved by: https://github.com/aaronenyeshi	2025-04-02 02:44:50 +00:00
eqy	f09513e515	[CUDA]][SymmetricMemory] Interpret empty string as `std::nullopt` in `rendezvous` (#149793 ) this is a "temporary" fix as current internal API requires strings at some interfaces instead of `std::optional` and empty strings are presumably used in-lieu of `nullopt`. e.g., `9d02b3993f/torch/csrc/distributed/c10d/intra_node_comm.cu (L49)` this currently breaks `test_intra_node_comm_all_reduce` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149793 Approved by: https://github.com/kwen2501, https://github.com/cyyever	2025-04-02 02:41:07 +00:00
Animesh Jain	61ebe999cc	[invoke_subgraph] Do not cache fake tensors for AOTDispatcher first pass (#150450 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150450 Approved by: https://github.com/zou3519 ghstack dependencies: #150082	2025-04-02 02:31:54 +00:00
Animesh Jain	b060fedfa8	[invoke_subgraph] Support None in the fwd output (#150082 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150082 Approved by: https://github.com/zou3519	2025-04-02 02:31:54 +00:00
Rithesh Baradi	0ae75ca2de	assert on all_reduce_event only if it's not CPU device. (#150316 ) Summary: For CPU based runs, `all_reduce_event` would be None since this is the result of the `all_reduce_stream.record_event()`, which does not do much other than returning None when device type is CPU. Test Plan: CI Differential Revision: D72176406 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150316 Approved by: https://github.com/kwen2501, https://github.com/weifengpy, https://github.com/mori360	2025-04-02 01:54:35 +00:00
cyy	e872c38eb3	Remove cppcoreguidelines-pro-type-member-init_fix suppression (#148638 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148638 Approved by: https://github.com/zou3519	2025-04-02 01:33:20 +00:00
vasiliy	c974b5322a	enable torch.compile for torch._scaled_mm nvfp4 recipe (#150462 ) Summary: Updates the meta registration for `torch._scaled_mm` to work for the nvfp4 recipe. Test Plan: ```bash pytest test/test_matmul_cuda.py -s -k test_blockwise_nvfp4 ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/150462 Approved by: https://github.com/eellison	2025-04-02 01:08:40 +00:00
Nikita Shulga	ee97299961	[MPS][Testing] Benchmark reduction ops (#150452 ) That compares eager vs compile On my M4Pro mini I'm getting the following now ``` [--------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------] \| eager-512x512 \| compile-512x512 \| eager-1024x1024 \| compile-1024x1024 \| eager-2048x2048 \| compile-2048x2048 \| eager-4096x4096 \| compile-4096x4096 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- sum (torch.float32) \| 121.0 \| 201.5 \| 130.3 \| 772.3 \| 179.4 \| 1470.5 \| 476.1 \| 2980.0 max (torch.float32) \| 154.1 \| 165.9 \| 198.7 \| 211.6 \| 344.2 \| 386.9 \| 1326.6 \| 1345.6 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150452 Approved by: https://github.com/dcci, https://github.com/manuelcandales	2025-04-02 01:06:27 +00:00
tvukovic-amd	db32093192	[ROCm][Windows] Fix torchvision build with ROCm 6.4 on windows (#150180 ) Since with HIP SDK 6.4 hipcc files and calls and restructured, the case for calling hipcc.exe is added in case of building torchvision with HIP SDK 6.4 on Windows Pull Request resolved: https://github.com/pytorch/pytorch/pull/150180 Approved by: https://github.com/malfet, https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-04-02 00:35:47 +00:00
Junjie Wang (PyTorch)	d22e3d5efe	[fr] Add logger config for flight record in PGNCCL (#150356 ) Summary: We want to move from a scuba based direct logging to a logger config based logging. Mostly changes are internal but we need to change the exception to exception_msg. Test Plan: Following https://www.internalfb.com/wiki/Server_Logging/Getting_Started_with_Logging/Onboarding_Existing_Scribe-Based_Logging_(Alpha)/ to test it. Differential Revision: D72198171 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150356 Approved by: https://github.com/fegin	2025-04-01 23:54:07 +00:00
Tristan Rice	6aea4d90fb	gloo: use shared Stores (#150230 ) Summary: X-link: https://github.com/facebookincubator/gloo/pull/423 This modifies `connectFullMesh` to take in a shared_ptr<IStore> instead of a reference. This is an API breaking change but fairly easy to work around. To have backwards compatibility in PyTorch during the commit phase we add a new ifdef `GLOO_SHARED_STORE` which can provide backwards compatibility until we update the pinned Gloo version in pytorch OSS repo. This also adds a new `wait_get` method to `IStore` which will allow us to do a more efficient operation in PyTorch TCPStore. PyTorch's `Store::get` automatically waits so we want to make sure we can avoid waiting twice to reduce network traffic. This change will land simultaneously in PyTorch and Gloo repos. Test Plan: ``` buck2 test //gloo/... //caffe2/caffe2/contrib/gloo: ``` Differential Revision: D72084111 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150230 Approved by: https://github.com/fduwjj	2025-04-01 23:37:25 +00:00
Nick Riasanovsky	4934a83347	[AMD] [TRITON] [INDUCTOR] Add tl.assume to enable bufferops on AMD (#150373 ) Summary: Update the GEMM template to include the necessary `tl.assume` annotations to enable bufferops with AMD. Test Plan: Tested manually with a simple matmul run with torch.complie(f, mode="max-autotune") the environment variables TRITON_ALWAYS_COMPILE=1 AMDGCN_ENABLE_DUMP=1 AMDGCN_USE_BUFFER_OPS=1. Inspecting the generated AMDGCN all loads/stores use bufferops. Note: Since inductor is loading constants for many of the shape values assumes are generally not needed for the stride/shape information, but pid calculations are generally a gap in Triton's inference capability. Differential Revision: D71922698 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150373 Approved by: https://github.com/eellison	2025-04-01 23:29:39 +00:00
angelayi	60fe0922f6	[pytree] Register normal class to register_dataclass (#147752 ) Fixes https://github.com/pytorch/pytorch/pull/147532#discussion_r1964365330 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147752 Approved by: https://github.com/zou3519	2025-04-01 23:28:20 +00:00
PyTorch MergeBot	203a27e0ce	Revert "[cuBLAS][cuBLASLt] Unify `cuBLASLt` workspaces with `cuBLAS` workspaces (#145130 )" This reverts commit 8f7fbe3d7d2cd301df48fcbe8a14f8aa1a9c1e48. Reverted https://github.com/pytorch/pytorch/pull/145130 on behalf of https://github.com/clee2000 due to reverted internally by D72140190 ([comment](https://github.com/pytorch/pytorch/pull/145130#issuecomment-2770874244))	2025-04-01 23:07:28 +00:00
Will Feng	80ab233786	[Inductor] Hide reinplace_fsdp_all_gather pass behind skip_fsdp_hooks config (#150436 ) The `reinplace_fsdp_all_gather` pass is currently only for Traceable FSDP2 and doesn't work together with SimpleFSDP. We should hide the pass behind `skip_fsdp_hooks` config which makes it only apply to Traceable FSDP2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150436 Approved by: https://github.com/BoyuanFeng	2025-04-01 22:56:06 +00:00
PyTorch MergeBot	9458460211	Revert "if blaslt fails, fall back to blas (#150147 )" This reverts commit 65139eb050817329ac8e541c377b2be3bb5ffe14. Reverted https://github.com/pytorch/pytorch/pull/150147 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/150147#issuecomment-2770847320))	2025-04-01 22:52:22 +00:00
PyTorch MergeBot	76e1b3ba4c	Revert "[ROCm] use correct workspace for hipblaslt, silence warning (#150227 )" This reverts commit c158eac0de2afe38d68952ca401888ed5777f6b0. Reverted https://github.com/pytorch/pytorch/pull/150227 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/150227#issuecomment-2770827563))	2025-04-01 22:31:13 +00:00
henrylhtsang	629c1bd2dd	[ez][inductor][tests] Skip triton backend only for CPU tests (#150343 ) Motivation: to unblock https://github.com/pytorch/pytorch/pull/148622 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150343 Approved by: https://github.com/chenyang78	2025-04-01 22:03:48 +00:00
Avik Chaudhuri	b70d105c77	infer dynamic shapes through additional inputs (#150144 ) Summary: Instead of explicitly specifying dynamic shapes, it is possible to infer them from additional example inputs. Together with the example inputs provided to export, we can basically make any varying dim dynamic and keep any fixed dim static. This should be useful for prod scenarios that have access to tests and/or profiling data, yet are somewhat removed from the model authoring process. However this alone is not satisfactory: the exported program by design has only one graph, representing one path through the model, and we cannot necessarily guarantee that this graph works for the additional example inputs because different guards might have been created if we had exported with them instead (corresponding to different traced paths). However, checking that the additional example inputs satisfy the guards created by the original export should be sufficient for generalization. Now, while we don't preserve all guards in the exported program, we do check a subset of them as part of input matching. So we add a verification step at the end of export when such additional example inputs are provided. This should be enough for now. Test Plan: added test (positive and negative cases) Differential Revision: D72001771 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150144 Approved by: https://github.com/bobrenjc93	2025-04-01 21:13:39 +00:00
Michael Lazos	0d44a8aea1	[Hierarchical Compile] Apply deduplication after output node creation (#150306 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150306 Approved by: https://github.com/anijain2305 ghstack dependencies: #150303, #150304, #150305	2025-04-01 20:54:18 +00:00
Michael Lazos	8740ffa760	[Hierarchical Compile] Add cycle detection to graph region expansion (#150305 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150305 Approved by: https://github.com/anijain2305 ghstack dependencies: #150303, #150304	2025-04-01 20:54:18 +00:00
Michael Lazos	a2300aff94	[Hierarchical Compile] Add cycle detection function for debug (#150304 ) Remove print Pull Request resolved: https://github.com/pytorch/pytorch/pull/150304 Approved by: https://github.com/anijain2305 ghstack dependencies: #150303	2025-04-01 20:54:10 +00:00
Michael Lazos	99fd96c10b	[Hierarchical Compile] Remove spammy debug log (#150303 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150303 Approved by: https://github.com/williamwen42	2025-04-01 20:54:03 +00:00
atalman	295162ec3a	Smoke Test - disable pypi package validation for binaries that package cuda libs (#150194 ) Smoke Test - disable pypi package validation for binaries that package cuda libs. These binaries do not install packages via pypi. Should Resolve this from `linux-binary-manywheel / manywheel-py3_11-cuda12_6-full-test / test`: ``` Traceback (most recent call last): File "/pytorch/.ci/pytorch/smoke_test/smoke_test.py", line 468, in <module> main() File "/pytorch/.ci/pytorch/smoke_test/smoke_test.py", line 462, in main smoke_test_cuda( File "/pytorch/.ci/pytorch/smoke_test/smoke_test.py", line 274, in smoke_test_cuda compare_pypi_to_torch_versions( File "/pytorch/.ci/pytorch/smoke_test/smoke_test.py", line 220, in compare_pypi_to_torch_versions raise RuntimeError(f"Can't find {package} in PyPI for Torch: {torch_version}") RuntimeError: Can't find cudnn in PyPI for Torch: 9.5.1 ``` Link: https://github.com/pytorch/pytorch/actions/runs/14101221665/job/39505479587#step:15:982 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150194 Approved by: https://github.com/ZainRizvi	2025-04-01 19:18:44 +00:00
Tianyu Liu	d2ad9aa2f2	[dtensor][tp] add a ParallelStyle PrepareModuleInputOutput (#150372 ) Needed this class for because `parallelize_module` takes a dict, which doesn't allow `PrepareModuleInput` and `PrepareModuleOutput` to be applied at the same time. The `PrepareModuleInputOutput` in this PR initializes two variables `prepare_module_input` and `prepare_module_output` and uses them to process module / inputs / outputs. I had another implementation which put all code in `PrepareModuleInputOutput` and let `PrepareModuleInput` and `PrepareModuleOutput` inherit the monolithic `PrepareModuleInputOutput`. But it is 1. less cleaner 2. conceptually abusing inheritance because `PrepareModuleInput` shouldn't be able to access class methods of `PrepareModuleOutput` and vice versa Pull Request resolved: https://github.com/pytorch/pytorch/pull/150372 Approved by: https://github.com/wanchaol	2025-04-01 19:15:43 +00:00
Tianyu Liu	5d6ac2dced	[dtensor] add op support for select_backward and slice_backward (#150357 ) Inheriting and rebasing @awgu 's PR https://github.com/pytorch/pytorch/pull/149071 - fixed an issue for `select_backward` and an issue for `slice_backward` - removed `_experimental_ops.py` as it becomes empty Pull Request resolved: https://github.com/pytorch/pytorch/pull/150357 Approved by: https://github.com/awgu, https://github.com/XilunWu	2025-04-01 19:15:25 +00:00
IvanKobzarev	a37afd23fa	[custom_ops][perf] Move expensive pytree traversals of tensors to C++ (#148555 ) (benchmark for 1 call) Before: ``` └─ $ python ~/task_custom_ops_perf/test_custom_ops_perf_repro.py DO_BENCH mutate: 77.72445678710938 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/mutate.json DO_BENCH no_mutate: 64.61143493652344 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/no_mutate.json DO_BENCH direct_mutate: 11.682510375976562 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_mutate.json DO_BENCH direct_no_mutate: 18.596649169921875 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_no_mutate.json ``` After: ``` └─ $ python ~/task_custom_ops_perf/test_custom_ops_perf_repro.py DO_BENCH mutate: 47.6837158203125 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/mutate.json DO_BENCH no_mutate: 31.709671020507812 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/no_mutate.json DO_BENCH direct_mutate: 10.967254638671875 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_mutate.json DO_BENCH direct_no_mutate: 10.728836059570312 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_no_mutate.json ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148555 Approved by: https://github.com/zou3519	2025-04-01 18:45:48 +00:00
Ethan Wee	78300c8205	[ROCm] update test buffer fudge factor for hipblaslt (#150348 ) The default workspace for hipblaslt is larger than for cublas/cublaslt which requires a slight increase to the buffer needed. Forward-fix for #150227 that broke ROCm distributed tests but wasn't part of initial CI signal. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150348 Approved by: https://github.com/jeffdaily	2025-04-01 18:31:25 +00:00
Jason Ansel	37ebb0b56a	[inductor] Fix inductor windows linker error (#150256 ) Fixes #149889 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150256 Approved by: https://github.com/anijain2305, https://github.com/eellison	2025-04-01 18:30:55 +00:00
eellison	15dbad2115	Update torch.compile issue template (#150192 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150192 Approved by: https://github.com/malfet ghstack dependencies: #149947	2025-04-01 18:16:16 +00:00
PyTorch MergeBot	f04cf13bdd	Revert "Merge Triton ScaledMM as epilogue to MM template (#150045 )" This reverts commit 981048854da154eae8ff0bd439e72e1256ae00da. Reverted https://github.com/pytorch/pytorch/pull/150045 on behalf of https://github.com/PaulZhang12 due to Need to add PR 150415 fixes for internal merge ([comment](https://github.com/pytorch/pytorch/pull/150045#issuecomment-2770252452))	2025-04-01 17:54:28 +00:00
Will Feng	b0c560ef2a	[dynamo][hooks] use wrap_top_frame config for functions (#150209 ) When torch.compile is applied to a module via `mod.compile(...)`, it's equivalent to `torch.compile(mod._call_impl)` which takes a different path than `OptimizedModule`. This PR ensures that the `wrap_top_frame` config can also take effect for the `torch.compile(mod._call_impl)` use case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150209 Approved by: https://github.com/anijain2305	2025-04-01 17:41:23 +00:00
Nikita Shulga	48af2cdd27	[BE] Move all lint runner to 24.04 (#150427 ) As Ubuntu-20 reached EOL on Apr 1st, see https://github.com/actions/runner-images/issues/11101 This forces older python version to be 3.8 Delete all linux-20.04 runners from the lintrunner.yml Pull Request resolved: https://github.com/pytorch/pytorch/pull/150427 Approved by: https://github.com/seemethere	2025-04-01 17:33:15 +00:00
Xia, Weiwen	3b0cd9b542	[Quant][PT2E] add a lowering pass for x86 backend (#149708 ) Summary This PR adds a lowering pass for x86 backend - Patterns of `dequantize -> conv/linear (-> quantize)` are fused to corresponding quantized onednn ops. - Weights are prepacked ahead of time. - Post ops of conv/linear are fused if supported. - The pass returns a `GraphModule` with the modifications mentioned above. Test plan ``` pytest test/quantization/pt2e/test_x86inductor_quantizer.py -k test_lowering_to_x86 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149708 Approved by: https://github.com/jerryzh168, https://github.com/leslie-fang-intel	2025-04-01 17:32:41 +00:00
Catherine Lee	783f045c4f	[ez] Remove dead lite interpreter CI code (#150424 ) There are no lite-interpreter build environments in CI I assume every mac build is arm64 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150424 Approved by: https://github.com/seemethere, https://github.com/malfet	2025-04-01 17:14:32 +00:00
Catherine Lee	a17ee8181a	[CI] Fix log artifact not containing test logs attempt 2 (#150234 ) Fixes #ISSUE_NUMBER Take two of https://github.com/pytorch/pytorch/pull/149577 since it didn't work Pull Request resolved: https://github.com/pytorch/pytorch/pull/150234 Approved by: https://github.com/malfet, https://github.com/seemethere	2025-04-01 17:13:58 +00:00
Nikita Shulga	f94ac263af	[MPSInductor] Fix neg for unsigned types (#150412 ) By more-or-less copy-n-pasting the fix from https://github.com/pytorch/pytorch/pull/94035 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150412 Approved by: https://github.com/jansel, https://github.com/dcci ghstack dependencies: #150382, #150386	2025-04-01 16:52:41 +00:00
Xuehai Pan	ae74ef9d53	Set proper `LD_LIBRARY_PATH` on Linux in nightly venv in nightly pull tool (#143262 ) Before this change: ```console $ make setup-env-cuda PYTHON="${HOMEBREW_PREFIX}/bin/python3.12" $ source venv/bin/activate $ python3 -c 'import torch' Traceback (most recent call last): File "<string>", line 1, in <module> File "/home/PanXuehai/Projects/pytorch/torch/__init__.py", line 379, in <module> from torch._C import * # noqa: F403 ^^^^^^^^^^^^^^^^^^^^^^ ImportError: libcudnn.so.9: cannot open shared object file: No such file or directory ``` This PR adds `site-packages/nvidia/**/lib` to `LD_LIBRARY_PATH` in `venv/bin/activate` script to let NVIDIA PyPI packages can be loaded correctly. See also: - #141837 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143262 Approved by: https://github.com/malfet	2025-04-01 16:51:02 +00:00
Sriram Kumar	a19b667bca	[ROCm] Update CUDAPluggableAllocator.h (#1984 ) (#150010 ) Altering the flag to use the correct streamType in CUDAPluggableAllocator class for ROCm gpu. The flag TORCH_HIP_VERSION does not work for ROCm as intended. This flag is replaced with USE_ROCM. This is impacting Distributed Fused Adam in Rocm/APEX when using nccl_ub feature. This has been tested with rocm/apex. See PR https://github.com/ROCm/apex/pull/184 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150010 Approved by: https://github.com/jeffdaily	2025-04-01 16:49:03 +00:00
Ke Wen	35c45a4a31	[Reland] Launch kernel on current stream & remove `record_stream` entirely (#150398 ) Relanding #148590 due to merge conflict. This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Squashed contents: * [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820) PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead (70us -> 35us), which reduce total CPU/GPU from 230us to 195us by 15% * [PGNCCL] Make avoid-record-stream default * [c10d] Add asyncOp argument to Ops * Change python side wait * Pass asyncOp at ProcessGroup level * Watchdog unstashing tensors as a safety net * Stash tensors for reduce_scatter_v and all_gather_v Pull Request approved: https://github.com/pytorch/pytorch/pull/149753 * [c10d] Move unstashing from watchdog to main thread Pull Request approved: https://github.com/pytorch/pytorch/pull/150079 * [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation Pull Request approved: https://github.com/pytorch/pytorch/pull/150130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150398 Approved by: https://github.com/atalman	2025-04-01 16:46:07 +00:00
Mergen Nachin	7382654ebc	Update ExecuTorch pin to latest viable/strict 3/28/2025 (#150308 ) From latest viable/strict: https://hud.pytorch.org/hud/pytorch/executorch/viable%2Fstrict/1?per_page=50 Fixes https://github.com/pytorch/pytorch/issues/144480 This commit has important CI stability fixes, such as https://github.com/pytorch/executorch/pull/9561 and https://github.com/pytorch/executorch/pull/9634 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150308 Approved by: https://github.com/jathu, https://github.com/malfet	2025-04-01 16:30:09 +00:00
Nikita Shulga	428234bc28	[MPSInductor] torch.complex128 is unsupported on MPS (#150386 ) Same as torch.float64 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150386 Approved by: https://github.com/dcci ghstack dependencies: #150382	2025-04-01 15:19:10 +00:00
Nikita Shulga	1c6e88eb03	[MPS] Test bf16 perf of few unary and binary ops (#150382 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150382 Approved by: https://github.com/Skylion007	2025-04-01 13:58:20 +00:00
Bin Bao	0d96c38b76	[AOTI] Skip test_buffer_mutation_and_force_mmap_weights for fbcode (#150340 ) Summary: Skip due to an older ideep version Differential Revision: D72190746 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150340 Approved by: https://github.com/yushangdi	2025-04-01 13:24:21 +00:00
maajidkhann	84c21d2147	Enable SVE ACLE implementation for tanH Aten op for FP32 dType. (#143741 ) In deep learning models, the tanh (hyperbolic tangent) function is a widely used activation function, primarily in feedforward networks, recurrent neural networks (RNNs), and various other architectures. Also, the tanh (hyperbolic tangent) function is commonly used in Physics-Informed Neural Networks (PINNs). PINNs are a class of machine learning models designed to solve partial differential equations (PDEs) by incorporating the governing physics directly into the loss function, along with data-driven terms. In PINNs, activation functions like tanh are used in the neural network architecture to enable the model to learn complex mappings between inputs (such as spatial and temporal coordinates) and outputs (such as field variables). Operator: tanh() Current Implementation in OSS in ATen Backend: SVE Flow: Uses SVE sleef when available else std implementation. With this PR : SVE Flow: Uses SVE ACLE implementation. (Faster Implementation) Here are the performance improvements. Single core perf numbers: ![image](https://github.com/user-attachments/assets/c2f4bcb6-11bc-4af1-b5eb-278a4cc4a69d) Metric: CPU time avg time per iteration (In ms) As you can see with both gcc and clang compilers, we see a significant performance gain with SVE ACLE implementation over current OSS Implementation (Sleef) and also Neon. Hardware: m7g.8xlarge (Graviton 3 Instance) Script used in benchmarking: ```python import os #os.environ["ATEN_CPU_CAPABILITY"] = "default" os.environ["ATEN_CPU_CAPABILITY"] = "sve256" import torch import torch.nn as nn #Set the random seed for reproducibility torch.manual_seed(1) #Create a tensor of shape (8521, 50) x = torch.randn(8521, 50) for i in range(10): output = x.tanh() #Perform the tanh operation 1000 times and profile the performance print("### CPU tanh") with torch.autograd.profiler.profile(record_shapes=True) as prof: for i in range(1000): output = x.tanh() #Print the profiling results sorted by self CPU time print(prof.key_averages().table(sort_by="self_cpu_time_total")) #Optionally print the final output (if needed, uncomment the following line) print(output) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143741 Approved by: https://github.com/malfet	2025-04-01 11:54:58 +00:00
yucai-intel	bf4814eb6a	[Intel GPU] Allow XPU backend in Quantize operators (#150288 ) This modification is to support torch.quantize_per_channel() on XPU, otherwise it will cause a segmentation fault. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150288 Approved by: https://github.com/jerryzh168, https://github.com/guangyey	2025-04-01 11:27:26 +00:00
Xuehai Pan	a10b765bf1	[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257 ) Changes in this PR: 1. Add `is_structseq` and `is_structseq_class` functions to determine a object or a class is PyStructSequence. 2. Add a generic class `structseq` which can be used as the registration key for PyStructSequence types like `namedtuple` for Named Tuple types. 3. Change `is_namedtuple` to accept subclasses of namedtuple to be namedtuple. Before this PR, only namedtuple class directly created by `collections.namedtuple` or `typing.NamedTuple` were namedtuple classes while their subclasses were not. This PR makes `is_namedtuple` return true for subclasses of namedtuple class. Resolves #75982. New tests are included in this PR. - #75982 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113257 Approved by: https://github.com/zou3519	2025-04-01 10:40:43 +00:00
Prajesh Praveen Anchalia	48e9ffc873	Unify on dynamo_compile as the overall wait counter (#150293 ) Summary: dynamo_compile for the most part has been accounting for compile time except autotuning. all_compilation_types had earlier been injected on fx_codegen_and_compile, which was incorrect. Add autotuining to dynamo and deprcate all_compilation_types counter. Differential Revision: D72145447 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150293 Approved by: https://github.com/masnesral, https://github.com/jamesjwu	2025-04-01 08:55:51 +00:00
FFFrog	36f2d0aaba	Add "xpu" to __all__ for torch/version.py (#149695 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149695 Approved by: https://github.com/desertfire, https://github.com/guangyey	2025-04-01 08:44:51 +00:00
Natalia Gimelshein	1700599266	Add one_shot_all_reduce_copy to allow non-symm-mem allocated tensors to be reduced (#150129 ) Per title, we want to be able to use it even if inputs are not registered. Separate copy would add latency, and one-shot is all about the lowest possible latency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150129 Approved by: https://github.com/xw285cornell	2025-04-01 05:36:43 +00:00
Natalia Gimelshein	414b9ae016	enable out variant of 2-shot reduction (#150153 ) Per title, this version uses symm mem input both as input source and as a work buffer, so input is modified after the end (similar to what fbgemm car reduction does). It is intended to be wrapped in an op that would first copy the real inputs to symm mem buffers that wouldn't be exposed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150153 Approved by: https://github.com/xw285cornell	2025-04-01 05:36:04 +00:00
Tugsbayasgalan Manlaibaatar	7e7e5698cc	Suppress more warnings (#149833 ) Differential Revision: [D71702307](https://our.internmc.facebook.com/intern/diff/D71702307) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149833 Approved by: https://github.com/malfet, https://github.com/Skylion007	2025-04-01 05:33:04 +00:00
William Wen	790d459f85	[dynamo] add error message for unsupported LOAD_BUILD_CLASS (#150323 ) Improved error message for https://github.com/pytorch/pytorch/issues/128942 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150323 Approved by: https://github.com/jansel, https://github.com/zou3519	2025-04-01 05:03:50 +00:00
Stonepia	ce52674b76	[Doc] Update CMAKE_PREFIX_PATH for XPU windows README (#148863 ) We found that the `pip install cmake` and `conda install cmake` has different behavior. The reason is that the pip installed one doesn't find the corresponding libs under conda env. So we need to set the `CMAKE_PREFIX_PATH` for alignment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148863 Approved by: https://github.com/CuiYifeng, https://github.com/malfet Co-authored-by: Cui, Yifeng <yifeng.cui@intel.com>	2025-04-01 04:43:11 +00:00
Phillip Liu	31634b8c6a	[fr] Added protection against missing stack frames in fr cont. (#150133 ) Summary: Previously we had D70358287, which didn't fully resolved the issue. Test Plan: # FR `buck2 run @//mode/opt //caffe2/fb/flight_recorder:fr_trace -- --mast_job_id f710320638-TrainingApplication --mast_job_version 0 --mast_job_attempt 0 --bucket tlcm_log_blob --world_size 128 --dump_file_name_offset 0 --allow-incomplete-ranks` Confirm no error # FR analyzer `buck2 run @//mode/opt //investigations/dr_patternson/analyzers/ai_observability:ai_observability-all-analyzers-cli -- flight_recorder_analyzer --mast_job_name f710320638-TrainingApplication --mast_job_version 0 --mast_job_attempt 0` Confirm no error Differential Revision: D71998980 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150133 Approved by: https://github.com/fduwjj	2025-04-01 03:07:59 +00:00
Nikita Shulga	827b730f4e	[CI] Skip test_copy_large_tensor on M2-15 runners (#150377 ) They have more than 12Gb memory, but may be running this test causes OOM in CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/150377 Approved by: https://github.com/atalman	2025-04-01 02:33:43 +00:00
Nikita Shulga	6470b373c1	`torch.backends.mkldnn.flags()` CM should not warn (#150358 ) By returning `None` rather than `False` from `THPModule_allowTF32OneDNN` when USE_XPU is not defined Added regression test Fixes https://github.com/pytorch/pytorch/issues/149829 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/150358 Approved by: https://github.com/atalman	2025-04-01 01:33:40 +00:00
Sun, Jiayi	5cb5675f13	[Inductor] optimize the heuristics of parallel reduction (#149614 ) Fix https://github.com/pytorch/pytorch/issues/148639. Summary: Optimize the heuristics of parallel reduction: When the number of steps of the first inner loop beyond the maximum parallel depth is much larger than the number of steps of all outer loops within the maximum parallel depth, change the starting depth of parallelism to the first inner loop and recalculate the maximum parallel depth. I ran the Inductor benchmark with this PR on CPU. A timm model poolformer_m36 BF16 has about 25% performance improvement, and no performance regression is seen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149614 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2025-04-01 01:31:00 +00:00
Zhang, Jianyi	0f12951fc2	[Intel gpu] always set deterministic for xpu accuracy test (#149028 ) On Intel Max 1550, models like Super_SloMo can actually pass accuracy test after set deterministic, because we do not use atomic in upsampling bilinear backward in some cases when running on XPU. Furthermore, I guess the only reason not to set deterministic on these models is just avoiding errors. We should use warn_only = True. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149028 Approved by: https://github.com/guangyey, https://github.com/desertfire Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-04-01 01:00:11 +00:00
Nikita Shulga	7ab8532cf1	[BE] Get rid of cross-compile and x86 build options for Mac (#150362 ) As both cross-compilation and x86 builds has been removed a while back Remove stale TODO about building with OpenMP support Pull Request resolved: https://github.com/pytorch/pytorch/pull/150362 Approved by: https://github.com/atalman, https://github.com/clee2000	2025-04-01 00:45:24 +00:00
Joshua Hamilton	4ce0b959ff	Add a warning when a tensor with requires_grad=True is converted to a scalar (#143261 ) Fixes #143071 Operations performed on tensors with `requires_grad=True` such as ```python import torch x = torch.tensor(2.0, requires_grad=True) y = x ** 3 ``` and ```python x = torch.tensor(2.0, requires_grad=True) y = torch.pow(x,3) ``` are valid operations. While an operation using `numpy` like ```python import numpy as np x = torch.tensor(2.0, requires_grad=True) y = np.pow(x,3) # > RuntimeError: Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead. ``` leads to an error. However, an operation that uses `math` like ```python import math x = torch.tensor(2.0, requires_grad=True) y = math.pow(x,3) ``` does not cause an error, and `y` is no longer a tensor with a gradient! This represents a [footgun](https://en.wiktionary.org/wiki/footgun#Noun) for some users, like myself when training small, custom, non-neural network models. To prevent future undesired behavior, I added a warning when converting tensors with `requires_grad=True` to scalars. Now, when using `math.pow` on a `tensor`, we get a single warning with: ```python x = torch.tensor(2.0, requires_grad=True) y = math.pow(x,3) # > UserWarning: Converting a tensor with requires_grad=True to a scalar may lead to unexpected behavior. # Consider using tensor.detach() first. ``` Please let me know if you have any questions 👍 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143261 Approved by: https://github.com/malfet Co-authored-by: albanD <desmaison.alban@gmail.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-04-01 00:42:46 +00:00
Jack Taylor	49b7d0d84d	[ROCm] Enable more inductor UTs (#149513 ) Primarily enable inductor fp8 tests, also enable other inductor tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/149513 Approved by: https://github.com/jeffdaily	2025-04-01 00:30:36 +00:00
Nikita Shulga	c75dac5f5c	Fix typo (#150363 ) Fixes https://github.com/pytorch/pytorch/issues/150339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150363 Approved by: https://github.com/atalman, https://github.com/kwen2501	2025-03-31 23:58:37 +00:00
Davide Italiano	b48505a8a1	[MPS] Add support for hermite_polynomial_h. (#150279 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150279 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-03-31 23:30:19 +00:00
Mu-Chu Lee	a2070e2fd5	[AOTInductor] Free tensors in test (#150274 ) Summary: This PR frees tensor that were new-ed within the test itself to prevent memory leak. Test Plan: Fixing tests itself. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/150274 Approved by: https://github.com/chenyang78	2025-03-31 23:28:13 +00:00
Shiyan Deng	982a7f7db0	[cachinghostallocator] remove the check on cudaHostRegister path (#150070 ) Summary: In the cudaHostAlloc path, the flag we used is `cudaHostAllocDefault` [0] which don't really have this strict enforcement (devicePtr retrieved from ` cudaHostGetDevicePointer(()` point to the same addr as the hostPtr) according to the guide [1]. This diff removes the check so that the host register path works for ROCm. [0]`6aca002d82/aten/src/ATen/cuda/CachingHostAllocator.cpp (L97)` [1] https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gb65da58f444e7230d3322b6126bb4902 Test Plan: test_pinned_memory_with_cudaregister tests Differential Revision: D71932562 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150070 Approved by: https://github.com/jeffdaily	2025-03-31 23:23:05 +00:00
PaulZhang12	981048854d	Merge Triton ScaledMM as epilogue to MM template (#150045 ) Previously, scaled_mm's (FP8 matmul) Triton lowering for inductor was in a separate template. This PR consolidates that lowering into the mm template, with an added epilogue to deal with multiplying the scales. This paves the way for future scaled variants of BMM, Grouped GEMM in inductor. Currently, there is still a separate template for TMA+persistent version of scaled_mm. The current mm lowering has a separate template for TMA + Persistent version. Will hopefully consolidate the extra scaled_mm TMA+persistent template when the consolidation for the mm template is done. TODO: Consolidate TMA+Persistent logic into 1 template and remove separate scaled_mm TMA template Pull Request resolved: https://github.com/pytorch/pytorch/pull/150045 Approved by: https://github.com/drisspg	2025-03-31 23:20:14 +00:00
Nikita Shulga	91666eef60	Update gloo submodule (#150320 ) That updates its CMake minimum version(via https://github.com/facebookincubator/gloo/pull/424 ) and removes cmake-4.0.0 workarounds for gloo Pull Request resolved: https://github.com/pytorch/pytorch/pull/150320 Approved by: https://github.com/atalman	2025-03-31 22:40:27 +00:00
PyTorch MergeBot	1526ff955e	Revert "Add a warning when a tensor with requires_grad=True is converted to a scalar (#143261 )" This reverts commit 515b45e5693dbf9dd58d8472806cbe5f49e43074. Reverted https://github.com/pytorch/pytorch/pull/143261 on behalf of https://github.com/clee2000 due to failing internal tests D72135661 ([comment](https://github.com/pytorch/pytorch/pull/143261#issuecomment-2767531682))	2025-03-31 22:19:08 +00:00
Faa Diallo	423e4a4568	[ROCm] cmake 4 workaround for hiprtc (#150324 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150324 Approved by: https://github.com/jeffdaily, https://github.com/atalman, https://github.com/malfet	2025-03-31 21:55:53 +00:00
Ethan Wee	4e2997db73	[ROCm][CI] Increase wheel build timeout from 210 to 240 (#150221 ) Fixes #150046. Increasing the timeout from 210 to 240. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150221 Approved by: https://github.com/jeffdaily	2025-03-31 21:46:09 +00:00
Pian Pawakapan	925fd4aa2e	[export] min/max ranges for dim hints (#149590 ) Differential Revision: D71522032 Adds min/max ranges to Dim.AUTO/DYNAMIC/STATIC, so users can do `Dim.AUTO(min=2, max=2048)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149590 Approved by: https://github.com/tugsbayasgalan	2025-03-31 21:32:20 +00:00
Eli Uriegas	dfcd98e684	cd: Fix naming for windows arm64 libtorch builds (#150310 ) Apparently the magical incantation to name these correctly lies in the build_variant variable otherwise it silently does nothing. Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/150310 Approved by: https://github.com/atalman	2025-03-31 20:12:03 +00:00
Matthew Haddock	80b7f6b704	Adjust TestInductorOpInfo to depend on backend, not device (#146911 ) As is the case with many inductor tests, this test adapts test criteria based on device type, where it should be adjusting for the backend registered for that device. In this particular case, using the upstream triton CPU backend would lead to failures, as reference_in_float would be true as this is required for the C++/OpenMP backend which does not have float16 support. However most triton backends do, and as such should be tested in float16. Similarly a triton backend with a device not described as a GPU would get skipped from testing entirely. A more generic solution would be ideal, but this would require a lot of work across many tests. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146911 Approved by: https://github.com/masnesral	2025-03-31 18:24:16 +00:00
Aleksei Nikiforov	ab342d3793	Make PyTorch buildable by CMake-4.x on s390x (#150294 ) This is a continuation of https://github.com/pytorch/pytorch/pull/150203 that fixes nightly build on s390x. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150294 Approved by: https://github.com/malfet	2025-03-31 18:10:02 +00:00
angelayi	5e34758cef	[invoke_subgraph] Support unbacked (#149298 ) Differential Revision: [D71420641](https://our.internmc.facebook.com/intern/diff/D71420641) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149298 Approved by: https://github.com/zou3519	2025-03-31 17:25:09 +00:00
Pian Pawakapan	284b766898	[dynamic shapes] C++ bindings for guard_or_false/true (#150148 ) C++ version. Would like to add it in one place to prove it works, but couldn't find one that doesn't expose a chain of data-dependent changes... so just gonna put up the base implementation Pull Request resolved: https://github.com/pytorch/pytorch/pull/150148 Approved by: https://github.com/laithsakka, https://github.com/jingsh	2025-03-31 17:04:25 +00:00
Prachi Gupta	47cdad2995	[ROCm] Enable several fsdp related UTs (#149369 ) Enabling 26 UTs for ROCm in the following files: - distributed._shard.sharded_optim.test_sharded_optim - 2 UTs - distributed._shard.sharded_tensor.ops.test_binary_cmp - 4 UTs - distributed._shard.sharded_tensor.ops.test_init - 3 UTs - distributed._shard.sharded_tensor.ops.test_embedding - 2 UTs - distributed._shard.sharded_tensor.ops.test_embedding_bag - 2 UTs - distributed._composable.test_replicate_with_compiler - 4 UTs - distributed._composable.fsdp.test_fully_shard_grad_scaler - 1 UTs - distributed.tensor.test_attention - 4 UTs - distributed.tensor.test_matrix_ops - 1 UTs - distributed.tensor.test_tensor_ops - 1 UTs - distributed.fsdp.test_fsdp_grad_acc - 2 UTs Pull Request resolved: https://github.com/pytorch/pytorch/pull/149369 Approved by: https://github.com/jeffdaily	2025-03-31 16:15:57 +00:00
PyTorch MergeBot	7c858066ae	Revert "Enable TMA persistent GEMM Template by default (#149427 )" This reverts commit b8ef642f04874e13a9f2771902ddb7514f294015. Reverted https://github.com/pytorch/pytorch/pull/149427 on behalf of https://github.com/clee2000 due to failing tests internally D72116141 ([comment](https://github.com/pytorch/pytorch/pull/149427#issuecomment-2766672200))	2025-03-31 15:58:34 +00:00
PyTorch MergeBot	57fa99c5c3	Revert "enable out variant of 2-shot reduction (#150153 )" This reverts commit cdeb32d2d1c31b60c65133e83510977c5c180005. Reverted https://github.com/pytorch/pytorch/pull/150153 on behalf of https://github.com/clee2000 due to failing internal builds D72083877 ([comment](https://github.com/pytorch/pytorch/pull/150153#issuecomment-2766633712))	2025-03-31 15:43:24 +00:00
PyTorch MergeBot	e57fa18b40	Revert "Add one_shot_all_reduce_copy to allow non-symm-mem allocated tensors to be reduced (#150129 )" This reverts commit 8a872261dcb3797557d1965af6832677a77efec1. Reverted https://github.com/pytorch/pytorch/pull/150129 on behalf of https://github.com/clee2000 due to breaking internal builds D72080428 ([comment](https://github.com/pytorch/pytorch/pull/150129#issuecomment-2766619006))	2025-03-31 15:37:54 +00:00
Wang, Chuanqi	f74d5d576a	Update torch-xpu-ops commit pin to 3ee2bd2 (#150300 ) Update the torch-xpu-ops commit to [3ee2bd2f13e1ed17a685986ff667a58bed5f2aa5](`3ee2bd2f13`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150300 Approved by: https://github.com/EikanWang	2025-03-31 13:36:11 +00:00
Yichen Yan	bbb9b2476b	Unify use of `enableCollectiveHashDebug_` and trivial updates (#142865 ) Use `enableCollectiveHashDebug_` instead of checking env ad-hoc when `TORCH_DISTRIBUTED_DEBUG = DETAIL` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/142865 Approved by: https://github.com/fegin, https://github.com/kwen2501	2025-03-31 12:23:30 +00:00
Ethan Wee	c158eac0de	[ROCm] use correct workspace for hipblaslt, silence warning (#150227 ) Follow up to #145130. That PR caused a warning on ROCm the first time hipblaslt was called for any workload, always. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/150227 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-03-31 09:49:43 +00:00
LifengWang	51f0403f46	Update the baseline for max_autotune ci workflow (#149107 ) Since the issue https://github.com/pytorch/pytorch/issues/148535 is fixed in PR https://github.com/pytorch/pytorch/pull/148923, update the baseline for max_autotune ci workflow. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149107 Approved by: https://github.com/chuanqi129, https://github.com/leslie-fang-intel, https://github.com/desertfire	2025-03-31 09:45:44 +00:00
Kavya Govindarajan	4aded85e79	Fix space typo in warning message (#143473 ) Warning shows up like this (no space between willbe): ``` /home/xxx/.local/lib/python3.11/site-packages/torch/distributed/fsdp/_state_dict_utils.py:827: UserWarning: When using ``NO_SHARD`` for ``ShardingStrategy``, full_state_dict willbe returned. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143473 Approved by: https://github.com/mikaylagawarecki, https://github.com/kwen2501	2025-03-31 07:38:02 +00:00
Matthew Hoffman	c976321541	Use variadic length tuple for `torch.masked.DimOrDims` (#149870 ) `tuple[int]` means only a tuple of length 1, which is not what was intended. ```python loss = torch.masked.mean(loss, mask=mask, dim=(-1, -2)) # Argument of type "tuple[Literal[-1], Literal[-2]]" cannot be assigned to parameter "dim" of type "DimOrDims" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149870 Approved by: https://github.com/Skylion007	2025-03-31 07:06:58 +00:00
Vlad K	f1b74037b1	Fix bug when Inductor include path contains spaces (#148271 ) This PR fixes a bug with how include directories with spaces are handled on Windows. I ran into an edge case with torch.compile() - it will error out with an exception on Windows. In particular, it will try to execute the following: `cl /I C:/Program Files/Python311/Include ...`, where `C:/Program` will be treated as separate from `Files/Python311/Include`. I looked into using something like `shlex.quote` or `pathlib.Path`, but I didn't find those options to be suitable (shlex is POSIX shell only, pathlib.Path does not escape spaces). There is another place in the function that also deals with escaping spaces. My fix follows the same style. `0ff2e6a85a/torch/_inductor/cpp_builder.py (L1464)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148271 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-03-31 06:46:05 +00:00
Youseok Yang	b99e0c5412	Fix mtia_extension.cpp setDevice() to correctly set current_device (#149398 ) We referred to this code and found that there was a minor bug. Fix for future reference for others. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149398 Approved by: https://github.com/janeyx99	2025-03-31 06:07:22 +00:00
Yuanhao Ji	4f14224dc8	[Inductor] Fix `torch.polygamma()` when n == 1 (#147453 ) Fixes #147450 Be consistent with cpu kernel: `77dbd28535/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp (L433-L444)` Got this in the case: ``` Eager: tensor([1.2914e+15]), dtype: torch.float32 Compile: tensor([1.2914e+15]), dtype: torch.float32 Expected: tensor([6.5808e+32], dtype=torch.float64), dtype: torch.float64 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147453 Approved by: https://github.com/eellison	2025-03-31 05:27:46 +00:00
fduwjj	9456738edf	[c10d][fr] Allow multiple writer registration with warnings (#150232 ) The life span of writer is actually the whole program which is sub-optimal but it is a practical compromise so that the registration of writer can happen outside PG creation. So we decide to allow multiple writer registrations with warnings. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150232 Approved by: https://github.com/d4l3k, https://github.com/kwen2501	2025-03-31 04:43:43 +00:00
redwrasse	ad54b3aae2	test 0-dim squeeze in basic.TestSqueeze (#147928 ) Replace TODO with 0-dim squeeze, checks scalar is unchanged in `basic.TestSqueeze` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147928 Approved by: https://github.com/janeyx99	2025-03-31 04:35:16 +00:00
Luca Arnaboldi	c3bb174bb2	SubsetRandomSampler - changed iteration over tensor to iteration over list (#149126 ) Digging further the problem at https://github.com/UKPLab/sentence-transformers/pull/3261, it boils down to this expensive loop over a torch tensor. Looping over a list, like in RandomSampler, solves the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149126 Approved by: https://github.com/divyanshk, https://github.com/cyyever	2025-03-31 04:33:35 +00:00
dscamiss	59abb8c7a2	Fix documentation build errors caused by unsupported section titles (#150205 ) Fixes #150134 Build with `make html` looks OK now: ```shell reading sources... [100%] torch.compiler_get_started .. xpu looking for now-outdated files... none found pickling environment... done checking consistency... done preparing documents... done writing output... [ 80%] generated/torch.nn.Softsign .. generated/torch.nn.modules.module.register_module_full_backward_writing output... [ 86%] generated/torch.nn.modules.module.register_module_module_registration_hook .. generated/torch.rwriting output... [100%] generated/torch.xpu.get_rng_state .. xpu generating indices... genindex done highlighting module code... [100%] typing writing additional pages... search done copying images... [100%] _static/img/torch_cuda_memory/allocator_state_history.png copying static files... done copying extra files... done dumping search index in English (code: en)... done dumping object inventory... done build succeeded. The HTML pages are in build/html. ``` New rendering looks like this: ![image](https://github.com/user-attachments/assets/af7e23a5-9dfd-4cb6-9333-a9e8cfe47ea0) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150205 Approved by: https://github.com/albanD	2025-03-31 04:27:44 +00:00
Yuanhao Ji	32afecff8b	[PrivateUse1] Impl `isBuilt()` and `isAvailable()` (#149594 ) Follow-up: #146098 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149594 Approved by: https://github.com/albanD	2025-03-31 04:18:38 +00:00
jj hunt	46c8f2e965	Update docstring to match code. (#148455 ) Very tiny fix to doc string. Pass grid_size=None results in an Exception. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148455 Approved by: https://github.com/mikaylagawarecki	2025-03-31 04:16:11 +00:00
Nichols A. Romero	ca2ffc23ab	[ROCm][TunableOp] Stricter unit tests for online and offline tuning (#150142 ) Improvements to unit tests and warnings for unsupported cases in offline tuning. Here are more details: - Previously we only compared the OpSig for the untuned vs. tuned entries. This was not strict enough so we now compare OpSig+ParamSig. - The main offline and online UTs are now stricter to make sure we exercise the code paths for the four combinations of transA and transB. - Offline tuning does not support some tensor shapes. Emit warning and skip tuning. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150142 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-03-31 04:12:08 +00:00
Daniel Vega-Myhre	157bff22f7	[Async TP] Fuse matmul-reduce-scatters when reduce scatters have multiple users, and save fused node for backward instead of reduce_scatter node (#149946 ) Fixes #149876 ## Stack - [previous PR in stack] https://github.com/pytorch/pytorch/pull/149247 ## TL;DR This PR implements support in async TP for saving the reduce-scatter result for backward, which previously would break the torchtitan AC policies: no AC, per op SAC, and per layer SAC. ## Context In torchtitan's LLama3 per op SAC policy, we want to save the output of `reduce_scatter` ops for backward, which is useful for TP. The reduce_scatter op is also saved for No AC (since all activations are saved) and per layer SAC (since we save the activations for N full layers, which do contain reduce-scatters for TP. However, doing this causes incompatibility with Async TP for the AC policies above, for 2 reasons: 1) The graph pattern matching specifically only matches on reduce scatter nodes with 1 user, but reduce_scatter nodes saved for backwards will have 2 users (the 2nd one being the return/output node, which saves it for backward). 2) The subgraph replacement logic which replaces the users of the `wait_tensor` after the reduce-scatter with the new fused node has no mechanism to save the fused_node for backward instead of the reduce-scatter node. This means we cannot directly replace the subgraph, since we can't delete nodes which still have users (in this case, the output node is still using the reduce-scatter node). To fix this, we do 2 things: 1) Add additional pattern matching logic to also match reduce-scatter nodes with 2 users, so we also perform fusion when reduce-scatter is saved for backward. 2) When replacing the subgraph with the fused node, detect if the reduce-scatter was saved for backward, and if so, save the result of the fused node for backward instead. This enables us to properly erase the subgraph and prevent the memory leak which occurred in #149876 ## Other changes - Continue to throw an error if we don't find any candidate all-gathers or reduce-scatters for fusion (since TP should have both) but DON'T throw an error if we don't fuse any matmul-reduce-scatters. This is because I've found there are actually valid graphs where we do fuse reduce scatters in the forward graph but not the backward graph (in the backward pass there are reduce-scatters but the producer op is an "add" not a mm/scaled_mm). ## Test plan 1. All unit tests are passing 2. Visualized the graphs and verified the fusion is occurring properly. 3. Verified via manual torchtitan runs there is no memory leak / OOM occurring anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149946 Approved by: https://github.com/fegin	2025-03-30 19:05:47 +00:00
James Wu	cbc0964636	Store statically launchable CachingAutotuners inside CompiledFXGraph.triton_bundle (#149054 ) This PR adds CachingAutotuners that are statically launchable to FXGraphCache's cache entry. Regular CachingAutotuners, with triton kernels attached to them, are not very good to cache: they are very large, and take huge amounts of space since they track all of the various binary files, along with various metadata. We could probably figure out what information we could delete from the kernel and have it still work, but with StaticCudaLauncher, we no longer have to. Instead, we can cache every compiled triton kernel that is statically launchable. Because StaticTritonCompileResult is serializable, and designed to have a very small memory footprint, we can save it into FXGraphCache without increasing the cache size significantly. We store it as a part of CompiledFxGraph.triton_bundle. Then, on load, we repopulate the CachingAutotuner into our CompiledTritonKernel cache. The upsides of this are many: - We no longer need to call into a separate process on cache hit - We can guarantee that the triton kernel we got from our cache entry is the one we use to launch again, so no worries about triton's own caching logic - Once we achieve feature parity and all torch.compiled triton kernels are statically launchable, we can clean up a bunch of TritonBundler code and simplify the cache hit logic. Fixes #149449 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149054 Approved by: https://github.com/oulgen	2025-03-30 17:51:11 +00:00
Aaron Gokaslan	e91f84c87d	[BE]: Update cudnn frontend submodule to 1.11.0 (#149759 ) Update CUDNN frontend submodule to 11.1.0. Adds some new features like score_mod from flex_attention and adds a lot of bugfixes and new feature knobs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149759 Approved by: https://github.com/jansel	2025-03-30 17:14:26 +00:00
Joshua Hamilton	515b45e569	Add a warning when a tensor with requires_grad=True is converted to a scalar (#143261 ) Fixes #143071 Operations performed on tensors with `requires_grad=True` such as ```python import torch x = torch.tensor(2.0, requires_grad=True) y = x ** 3 ``` and ```python x = torch.tensor(2.0, requires_grad=True) y = torch.pow(x,3) ``` are valid operations. While an operation using `numpy` like ```python import numpy as np x = torch.tensor(2.0, requires_grad=True) y = np.pow(x,3) # > RuntimeError: Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead. ``` leads to an error. However, an operation that uses `math` like ```python import math x = torch.tensor(2.0, requires_grad=True) y = math.pow(x,3) ``` does not cause an error, and `y` is no longer a tensor with a gradient! This represents a [footgun](https://en.wiktionary.org/wiki/footgun#Noun) for some users, like myself when training small, custom, non-neural network models. To prevent future undesired behavior, I added a warning when converting tensors with `requires_grad=True` to scalars. Now, when using `math.pow` on a `tensor`, we get a single warning with: ```python x = torch.tensor(2.0, requires_grad=True) y = math.pow(x,3) # > UserWarning: Converting a tensor with requires_grad=True to a scalar may lead to unexpected behavior. # Consider using tensor.detach() first. ``` Please let me know if you have any questions 👍 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143261 Approved by: https://github.com/albanD Co-authored-by: albanD <desmaison.alban@gmail.com>	2025-03-30 11:19:07 +00:00
Nikita Shulga	e8a11f175e	[BE] Use `auto` in MPS codebase more (#150000 ) Non-trivial (but still a no-op changes): - Replace `[mpsGraph broadcastTensor:[mpsGraph constantWithScalar:1 dataType:MPSDataTypeInt32] toShape:inputTensor.shape name:nil]` with `[mpsGraph constantWithScalar:1 dataType:MPSDataTypeInt32 shape:inputTensor.shape]` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150000 Approved by: https://github.com/dcci, https://github.com/cyyever	2025-03-30 05:35:58 +00:00
Prajesh Praveen Anchalia	005c9b2f4f	Fix _Waitcounter decorator and dd backward pass wait counter (#150235 ) Summary: This will log a wait counter with for backward compile and fixes weirdness with nested context managers. Since the old wait counters added through dynamo_timed were never created with the nesting issue. I am also changing the key nomenclature from `pytorch.dynamo_timed` to `pytorch.wait_counter`. We want to use the same nomenclature, to make it easy to find keys. Reviewed By: jamesjwu Differential Revision: D72032055 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150235 Approved by: https://github.com/jamesjwu, https://github.com/masnesral	2025-03-30 05:20:12 +00:00
Shangdi Yu	cc58ecceea	Move dump location to avoid dumping twice (#150219 ) Summary: If we put the dumping code in codegen, we might get a separate node_mapping dump for the constant folded graph (https://github.com/pytorch/pytorch/blob/main/torch/_inductor/compile_fx.py#L1119). We move it into compile_fx.py so there's only one node_mapping dump. Test Plan: CI Reviewed By: YUNQIUGUO Differential Revision: D72068715 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150219 Approved by: https://github.com/YUNQIUGUO	2025-03-30 03:35:38 +00:00
Horace He	3140565db6	Update type of `create_block_mask` to more accurately reflect things (#150244 ) Fixes some mypy issues Pull Request resolved: https://github.com/pytorch/pytorch/pull/150244 Approved by: https://github.com/drisspg	2025-03-29 21:55:57 +00:00
sanshang	879a293db8	fix et trace collection of all_to_all (#149485 ) ![image](https://github.com/user-attachments/assets/1e602dec-24a4-4f47-88c0-9311737e217b) ![image](https://github.com/user-attachments/assets/c48a3273-43fb-4a7f-9341-b90cb6b10785) fix ET trace collection to all_to_all. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149485 Approved by: https://github.com/shengfukevin, https://github.com/kwen2501	2025-03-29 20:17:24 +00:00
Nikita Shulga	965784eb9b	[MPSInductor] Specify `max_total_threads_per_threadgroup` (#150247 ) When generating reduction kernel, otherwise compiler can unroll loops too much that kernel could not be launched for the intended threadgroup size Extend `c10:🤘:max` to accept different dtypes Together this fixes `test_large_broadcast_reduction` TODO: - Explore different threadgroup_sizes for best perf Pull Request resolved: https://github.com/pytorch/pytorch/pull/150247 Approved by: https://github.com/jansel, https://github.com/dcci ghstack dependencies: #150246	2025-03-29 19:37:15 +00:00
Nikita Shulga	52135db69a	[BE] Fix signed/unsigned comparison warning (#150246 ) One will see them only if compilation fails, but still Pull Request resolved: https://github.com/pytorch/pytorch/pull/150246 Approved by: https://github.com/cyyever, https://github.com/jansel	2025-03-29 15:12:42 +00:00
PyTorch MergeBot	3b00ff8850	Revert "[Profiler] Give non-zero default values to start events (#149757 )" This reverts commit bc72420bcb37390af3fced885e019903e6e425bd. Reverted https://github.com/pytorch/pytorch/pull/149757 on behalf of https://github.com/malfet due to Broke windows builds, which were also the signal on the HUD ([comment](https://github.com/pytorch/pytorch/pull/149757#issuecomment-2763461365))	2025-03-29 15:08:55 +00:00
Irshad CC	f3c77b2458	Set requires grad in TensorMaker::make_tensor() (#148255 ) Fixes #146419 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148255 Approved by: https://github.com/soulitzer	2025-03-29 08:06:42 +00:00
PaulZhang12	b8ef642f04	Enable TMA persistent GEMM Template by default (#149427 ) Previously, this was unable to be landed given there was limited H100 for CI testing. Benchmarking on H100 CI looks good now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149427 Approved by: https://github.com/drisspg	2025-03-29 07:32:42 +00:00
Max Calman	bc72420bcb	[Profiler] Give non-zero default values to start events (#149757 ) The intent of the existing code is to > // Assign system TIDs to start events based on the system TID of the next // observed event with the same Python TID. However, if there are start events that don't share the same Python TID as later observed events, then they are left with the default initialization of DeviceAndResource and assigned values of `0`. This is problematic because Kineto uses `device=0, resource=0` for the first GPU (or other backend) device. This PR maintains the previous logic of using TIDs from later events if any are present, but defaults to the current process and system thread IDs if there aren't later events to reference. This issue was discovered while working to implement a custom backend and some CPU start events were appearing on the same process and thread as the device in the trace. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149757 Approved by: https://github.com/sraikund16	2025-03-29 06:29:25 +00:00
Michał Górny	ec6fa547a1	Remove unnecessary "special linking" for `BLAS_LIBRARIES` (#145487 ) Remove the "special linking" that involves listing `BLAS_LIBRARIES` thrice if `TH_BINARY_BUILD` is set, as it should not be any different from listing it just once. The code seems to date back to commit cfcf2af95f91a88ec61cbcac8b30a718e7332aa5. The original code already listed `BLAS_LIBRARIES` thrice, but it provided no explanation for doing that — and without `TH_BINARY_BUILD`, BLAS was not linked at all. The current version seems to originate in d6a8d28d6529a4f0b80a8c046ca9c36ca6c8b347 — and it already provided an `ELSE` clause listing `BLAS_LIBRARIES` only once. From this, I suspect that it is probably an unnecessary leftover. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145487 Approved by: https://github.com/malfet	2025-03-29 05:13:22 +00:00
Jane Xu	2c9e07ecd2	[BE] Remove outdated RPC benchmark (#146716 ) We have lots of outdated unused + uncalled code in our codebase, namely in our benchmarks and examples folders among others. The last change to this directory was 4 years ago and this code looks dead. cc @albanD @H-Huang for feedback Pull Request resolved: https://github.com/pytorch/pytorch/pull/146716 Approved by: https://github.com/Skylion007, https://github.com/H-Huang	2025-03-29 04:44:36 +00:00
Bryce Ferenczi	beea76020b	Removed ROCM ifdef that governs thread count + smem parallel reduction. (#149779 ) #149548 Fixed the arbitrarily missing parallelism for NLL, but they also added an arbritrary #ifdef ROCM guard around this fix to prevent its use on CUDA gpus. There is also a problem with the way the kernel does the reduction from the intermediate shared memory, using only thread 0 walking linearly. This has been changed to a simple parallel reduction algorithm. Tested changes with `python3 test/test_nn.py` ``` Ran 3551 tests in 200.554s OK (skipped=998, expected failures=4) ``` Performance before and after with the script below with an RTX 3090, batch size x axis, time (sec) y axis. This GPU is also used for display graphics and such, so the measurements are pretty noisy, even with 100 samples. ## Before ![before_nll](https://github.com/user-attachments/assets/c19044aa-7bc2-4223-b560-9be7acedef35) ## After ifdef removal ![after_nll](https://github.com/user-attachments/assets/4672f5ca-93b0-4c34-a257-81b2ab364995) ## After Parallel SMEM reduction ![after_reduction](https://github.com/user-attachments/assets/9607b68c-7d9d-4ee0-9f99-8989d134e4fd) ```python import torch from matplotlib import pyplot as plt from torch.nn import functional as F timing = [] batches= list(range(32, 4096, 32)) for batch in [32] + batches: samples = [] for _ in range(100): probs = torch.rand(batch, 10).cuda() labels = torch.randint(0, 10, (batch,)).cuda() start = torch.cuda.Event(enable_timing=True) end = torch.cuda.Event(enable_timing=True) start.record() F.nll_loss(probs, labels) end.record() torch.cuda.synchronize() elapsed = start.elapsed_time(end) samples.append(elapsed) timing.append(sum(samples) / len(samples)) timing = timing[1:] plt.plot(batches, timing) plt.show() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149779 Approved by: https://github.com/jeffdaily	2025-03-29 04:27:54 +00:00
Eddie Yan	a8dd9b6c27	[cuDNN][SDPA] abide by `enable_gqa` convention in cuDNN (#149976 ) long overdue Pull Request resolved: https://github.com/pytorch/pytorch/pull/149976 Approved by: https://github.com/drisspg, https://github.com/Skylion007	2025-03-29 04:24:51 +00:00
Thanh Ha	340beb7f7c	Add .editorconfig (#149193 ) This adds an .editorconfig file to automatically configure devs local Editors / IDEs with the basic formatting rules of the project. List of supported editors: https://editorconfig.org/#pre-installed Pull Request resolved: https://github.com/pytorch/pytorch/pull/149193 Approved by: https://github.com/malfet	2025-03-29 04:07:21 +00:00
fzyzcjy	66a7a49d64	Super tiny fix typo (#149190 ) ... when checking the doc to build from source Pull Request resolved: https://github.com/pytorch/pytorch/pull/149190 Approved by: https://github.com/jingsh	2025-03-29 04:06:05 +00:00
Shangdi Yu	5e787bf3e5	[reland] Support torchbind in OSS proxy executor (#150196 ) Summary: The original Diff D69500038 is reverted due to a false alarm on trunk health. Implement torchbind support in OSSProxyExecutor. Exactly the same as the implementation in FbProxyExecutor. D69693697 - fbProxyExecutor D69887230 - fbProxyExecutor but for torchbind method D70746626 - Support None output type Other changes: - When generating the schema of the CallTrochBind HOP, the arg name of the torchbind object arg should be the same as the torchbind method's torchbind object arg (instead of `obj`). - In `AOTIModelPackageLoader`, we extract everything in `data/constants` to `tmp_dir/data/aot_inductor/<model>/` folder, so the torchbind objs exist in the same folder as the rest of the files (e.g. cpp, so). This is to be consistent of how files are packaged internally (more details in internal Diff summary). Note on using `filesystem`: Seems like there'll be [issues](https://github.com/pytorch/pytorch/pull/137209) with using`filesystem` header in linux, so here I use string manipulation instead of `filesystem::path`. Test Plan: ``` test/inductor:torchbind -- -r torchbind_aoti test/inductor:torchbind -- -r aot_compile ``` Differential Revision: D72063691 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150196 Approved by: https://github.com/hl475, https://github.com/desertfire	2025-03-29 03:36:55 +00:00
Mandar Deshpande	0861af2596	[pytorch][triton] Warp specialization support in TritonTemplate for torchinductor (#148503 ) (#150122 ) Summary: Currently only `num_warps` and `num_stages` are supported as one of the kernel options for inductor auto-tuning using `TritonTemplate`. In order to allow warp-specialization kernel options should allow specifying `num_consumer_groups` and `num_buffers_warp_spec` as well. NOTE: Currently gating changes to FBCODE using HAS_WARP_SPEC which is only available on triton/release-3.3.x Test Plan: ## Unit test Added tests for `test_triton_template_warp_specialization` to verify generated kenrnel contains configs for `num_consumer_groups` and `num_buffers_warp_spec`. ## Functional Testing Specific to flexattention. ``` import torch from torch.nn.attention.flex_attention import flex_attention from triton.testing import do_bench make_tensor = lambda: torch.rand(8, 16, 8192, 128, device="cuda", dtype=torch.bfloat16) q, k, v = make_tensor(), make_tensor(), make_tensor() flex_compiled = torch.compile(flex_attention, fullgraph=True) print(do_bench(lambda: flex_compiled(q, k, v, kernel_options={"num_warps": 4}))) ``` triton do_bench results: - default compile: 15.176783561706543 - with warp-spec: 9.452800750732422 ## Extra notes - generated triton kernel using `TORCH_LOGS=output_code`: P1740612877 - TTGIR for fused kernel: P1740614685 Differential Revision: D71982587 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150122 Approved by: https://github.com/eellison, https://github.com/zou3519, https://github.com/jansel	2025-03-29 03:36:50 +00:00
Mu-Chu Lee	03313c6619	[AOTInductor] Add function for users to extract constants in container (#150163 ) Summary: Add extract_constant_map that allows users to inspect the constants being used by AOTInductor Test Plan: `python test/inductor/test_aot_inductor.py -k extract_constants_map` `LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib /data/users/$USER/pytorch/build/bin/test_aoti_inference` Differential Revision: D72020400 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150163 Approved by: https://github.com/chenyang78	2025-03-29 03:36:12 +00:00
Nichols A. Romero	7a470c9320	[ROCm] change preferred blas lib defaults (#150212 ) Fixes #148883 Fixes #150155 Also adds at::BlasBackend:Default. Instinct cards prefer hipBLASLt, everything else prefers rocBLAS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150212 Approved by: https://github.com/jeffdaily	2025-03-29 03:33:07 +00:00
Tristan Rice	29b3fdab01	TCPStoreLibUvBackend: support masterListenFd (#150215 ) This supports `masterListenFd` which is required for full compatibility with the non-libuv TCPStore. The code was just missing a `uv_listen` call and now it works just fine. This is required to migrate the last remaining uses of TCPStore off of the non-libuv backend. Test plan: ``` pytest -v test/distributed/test_store.py -k test_take_over_listen_socket ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150215 Approved by: https://github.com/fduwjj	2025-03-29 01:58:07 +00:00
Nikita Shulga	493c7fa66f	[Cmake] Make PyTorch buildable by CMake-4.x (#150203 ) By turning on compatibility mode for protobuf, nnpack, PSimd and FP16, ittapi, TensorPipe and Gloo Update CMake requirements Revert 0ece461ccafe5649d2d0f058ff5477765fd56499 and b0901d62ae2c2e909f91401eacebf3731df20cbe to test that it actually works TODO: - Update/get rid of those libraries Fixes https://github.com/pytorch/pytorch/issues/150149 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150203 Approved by: https://github.com/clee2000	2025-03-29 01:39:13 +00:00
Nikita Shulga	edb6f1b7a8	Move MacOS inductor tests to M2-15 runner (#150228 ) To get more representative results (and be able to run more tests eventually) Also get pull_request for workflow dispatch if yml file is modified Pull Request resolved: https://github.com/pytorch/pytorch/pull/150228 Approved by: https://github.com/clee2000	2025-03-29 01:36:07 +00:00
Jeff Daily	65139eb050	if blaslt fails, fall back to blas (#150147 ) Fixes #150016. This is implemented for both cublaslt and hipblaslt. gemm_and_bias on failure will fall back to unfused path. lt gemm on failure falls back to gemm even if gemm preference is set to lt. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150147 Approved by: https://github.com/malfet	2025-03-28 23:39:53 +00:00
PyTorch MergeBot	ccfde4dadf	Revert "Move MacOS inductor tests to M2-15 runner (#150228 )" This reverts commit b1b58708b26a840f6bf0ccdd14a9916ff7291fb4. Reverted https://github.com/pytorch/pytorch/pull/150228 on behalf of https://github.com/malfet due to Should not have ignored lint signal ([comment](https://github.com/pytorch/pytorch/pull/150228#issuecomment-2762794366))	2025-03-28 23:05:27 +00:00
Nikita Shulga	b1b58708b2	Move MacOS inductor tests to M2-15 runner (#150228 ) To get more representative results (and be able to run more tests eventually) Also get pull_request for workflow dispatch if yml file is modified Pull Request resolved: https://github.com/pytorch/pytorch/pull/150228 Approved by: https://github.com/clee2000	2025-03-28 22:15:40 +00:00
PyTorch MergeBot	7ac0658757	Revert "[CI] Fix docker builds failing due to cmake update by setting CMAKE_POLICY_VERSION_MINIMUM (#150220 )" This reverts commit 87549a65c96cd7e48f024c02e7daa3f227b2bf18. Reverted https://github.com/pytorch/pytorch/pull/150220 on behalf of https://github.com/clee2000 due to doesn't solve the problem since the installed cmake 4 stays on the system, resulting in failed pytorch builds later ([comment](https://github.com/pytorch/pytorch/pull/150220#issuecomment-2762623078))	2025-03-28 21:44:03 +00:00
Zain Rizvi	4271ebdbdc	Explicitly state that a test-infra branch cut is required (#150214 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150214 Approved by: https://github.com/atalman ghstack dependencies: #150210, #150211, #150213	2025-03-28 21:13:29 +00:00
Zain Rizvi	2b2286c4ec	Update reference for binary_build workflows (#150213 ) There hasn't been a circleci for a looooong time Pull Request resolved: https://github.com/pytorch/pytorch/pull/150213 Approved by: https://github.com/atalman ghstack dependencies: #150210, #150211	2025-03-28 21:13:29 +00:00
Zain Rizvi	4118d7307f	Update referenced PRs for ecosystem library branch cut (#150211 ) The old PRs had a lot of extra changes in them which are no longer needed Pull Request resolved: https://github.com/pytorch/pytorch/pull/150211 Approved by: https://github.com/atalman ghstack dependencies: #150210	2025-03-28 21:13:22 +00:00
Zain Rizvi	f231500c50	Mention the cherry-picker bot in the release docs (#150210 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150210 Approved by: https://github.com/atalman	2025-03-28 21:13:15 +00:00
Catherine Lee	87549a65c9	[CI] Fix docker builds failing due to cmake update by setting CMAKE_POLICY_VERSION_MINIMUM (#150220 ) Set the CMAKE_POLICY_VERSION_MINIMUM env var to make executorch and halide docker builds pass (they install from those repos which don't have cmake pinned) This can be removed if executorch and halide update their builds and we update the hash? Pull Request resolved: https://github.com/pytorch/pytorch/pull/150220 Approved by: https://github.com/atalman, https://github.com/malfet	2025-03-28 20:55:04 +00:00
zeshengzong	cb83850a24	Fix docs format error in `torch.nn` (#150156 ) Fixes #150152 Fix format error in [torch.nn.CosineSimilarity](https://pytorch.org/docs/stable/generated/torch.nn.CosineSimilarity.html#torch.nn.CosineSimilarity), [torch.nn.KLDivLoss](https://pytorch.org/docs/stable/generated/torch.nn.KLDivLoss.html#torch.nn.KLDivLoss) and other pages. ## Test Result ### Before #### torch.nn.CosineSimilarity ![Image](https://github.com/user-attachments/assets/1ad633d9-dfaf-43f0-a536-9035a24bf858) #### torch.nn.KLDivLoss ![Image](https://github.com/user-attachments/assets/20a001b0-1f66-414e-b554-11934d65a4bf) ### After #### torch.nn.CosineSimilarity ![image](https://github.com/user-attachments/assets/a2d9ea8d-5637-4604-a0e4-9231a4deee44) #### torch.nn.KLDivLoss ![image](https://github.com/user-attachments/assets/d0e319f9-a3b3-47a7-b2f8-060d46d53bc7) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150156 Approved by: https://github.com/cyyever, https://github.com/malfet	2025-03-28 20:54:09 +00:00
Nikita Shulga	7c65911b11	[MPS] Fix dot/mm for conj_tensors (#150157 ) - Distinguish between conjugated/non_conjugated inputs by appending conjugation to the operator key - For matmul or dot, add `conjugateWithTensor:name:` calls before running the op - Enable testing for conjugated ops by passing `include_conjugated_inputs` to opinfo - Filter `include_conjugated_inputs` argument from `sample_inputs_window` (probably should have landed as separate PR) - Preserve conj property when gathering the views, that fixes `cov` operator Fixes https://github.com/pytorch/pytorch/issues/148156 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150157 Approved by: https://github.com/dcci	2025-03-28 20:36:44 +00:00
Catherine Lee	9092dd2e82	[CI] Disable some tests that are failing in periodic (#150059 ) Disabling some tests to restore periodic nogpu avx512 timeout: `59f14d19ae (38492953496-box)` profiler failure: `7ae0ce6360 (38461255009-box)` test_accelerator failure: `87bfd66c3c (39476723746-box)` origin: 146098 test_overrides failure: `bf752c36da (39484562957-box)` origin: 146098 inductor cpu repro: `bb9c426024 (38447525659-box)` functorch eager transforms: `8f858e226b (39488068620-box)` `f2cea01f71 (39555064878)` `b5281a4a18 (39599355600)` either 148288 or 148261? `2ec9aceaeb/1` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150059 Approved by: https://github.com/ZainRizvi, https://github.com/atalman, https://github.com/malfet	2025-03-28 20:31:32 +00:00
Jeff Daily	2bd5bfa3ce	[ROCm] use magma-rocm tarball for CI/CD (#149986 ) Follow-up to #149902. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149986 Approved by: https://github.com/malfet	2025-03-28 19:28:50 +00:00
Natalia Gimelshein	cdeb32d2d1	enable out variant of 2-shot reduction (#150153 ) Per title, this version uses symm mem input both as input source and as a work buffer, so input is modified after the end (similar to what fbgemm car reduction does). It is intended to be wrapped in an op that would first copy the real inputs to symm mem buffers that wouldn't be exposed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150153 Approved by: https://github.com/xw285cornell	2025-03-28 19:06:03 +00:00
Wang, Chuanqi	35ff5084e6	[CI] Remove the xpu env source for linux binary validate (#150138 ) Due to we have enabled the xpu runtime pypi packages as dependencies directly Pull Request resolved: https://github.com/pytorch/pytorch/pull/150138 Approved by: https://github.com/atalman	2025-03-28 17:25:37 +00:00
Catherine Lee	85079e4380	[TD] Enable TD on distributed cpu (#150028 ) Enable TD on distributed cpu, I think the only reason it's not is because I forgot to enable it Get rid of some of the statements that are no ops: * asan uses default shard * nogpu got moved to periodic * no windows cuda testing anymore Only thing on pull and trunk that doesn't use TD is dynamo_wrapped but I think it's fast enough to be ok for now, we can take another look after this Pull Request resolved: https://github.com/pytorch/pytorch/pull/150028 Approved by: https://github.com/ZainRizvi	2025-03-28 17:19:11 +00:00
PyTorch MergeBot	cf7447ae99	Revert "cpp_wrapper: Fix even more tests (#147225 )" This reverts commit d25acac357ff8663a7787e57e6bc5e69987a8f9a. Reverted https://github.com/pytorch/pytorch/pull/147225 on behalf of https://github.com/yangw-dev due to broke test internally test/inductor/test_benchmark_fusion ([comment](https://github.com/pytorch/pytorch/pull/147225#issuecomment-2761944564))	2025-03-28 17:07:52 +00:00
PyTorch MergeBot	e691fcae0e	Revert "cpp_wrapper: precompile a few more commonly used headers, and improve RAIIPyObject interface (#149350 )" This reverts commit 2b20d1433f4e5c7556fe4679d89b8f795990d494. Reverted https://github.com/pytorch/pytorch/pull/149350 on behalf of https://github.com/yangw-dev due to broke test internally test/inductor/test_benchmark_fusion ([comment](https://github.com/pytorch/pytorch/pull/147225#issuecomment-2761944564))	2025-03-28 17:07:52 +00:00
Andrey Talman	b0901d62ae	Pin cmake to 3.31.2 for windows conda install (#150185 ) Trying to fix nightly failures Cmake 4.0 update https://pypi.org/project/cmake/4.0.0/ broke nightly builds You can see it here: https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=cuda11_8-build and here: https://hud.pytorch.org/hud/pytorch/pytorch/nightly/1?per_page=50&name_filter= This fix for Windows Builds. Linux and MacOS where already fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150185 Approved by: https://github.com/jeanschmidt, https://github.com/ZainRizvi	2025-03-28 17:03:02 +00:00
Animesh Jain	a469ddc663	[inductor] No type promotion for slice_scatter (#150090 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150090 Approved by: https://github.com/eellison, https://github.com/zou3519 ghstack dependencies: #149087, #149667, #150036, #148953	2025-03-28 17:02:01 +00:00
Catherine Lee	1bdf996e7a	[CI] Fix log artifact not containing test logs? (#149577 ) Sometimes I would find a log artifact that only has usage_logs.txt in it, even though there are other logs created by tests. I think this is somehow caused by output buffering with find. I don't understand how, but at the very least, I can see that all the jobs on this PR have the logs from the test runs Pull Request resolved: https://github.com/pytorch/pytorch/pull/149577 Approved by: https://github.com/ZainRizvi	2025-03-28 17:00:00 +00:00
Catherine Lee	d5a8bd0688	[CI][docker] Use multistage build for triton (#149413 ) Sees to reduce docker pull times by ~3 min if triton is requested, some compressed docker sizes seems to have decreased by 1/3 ish Also add check that triton is installed/not installed Pull Request resolved: https://github.com/pytorch/pytorch/pull/149413 Approved by: https://github.com/malfet	2025-03-28 16:07:19 +00:00
Catherine Lee	0ece461cca	Pin cmake==3.31.6 (#150158 ) I'm not sure if this is the right think to do, but cmake 4.0.0 got released on pypi and our builds are failing with it Example: `aa70d62041 (39555975425-box)` I guess we have to go change all the cmake_minimum_required to >=3.5? backwards compat still failing because its building with the base commit which this pr can't really change until it gets merged, but at least manywheel binary builds got past where they were originally failing Also pin the conda installation, but the most recent version on conda is 3.31.2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150158 Approved by: https://github.com/cyyever, https://github.com/malfet	2025-03-28 15:49:17 +00:00
Alexander Grund	350a479146	Fix test failures on non-x86 Linux (#148445 ) The cpp contexts are only supported on x86 Linux. The tests requiring them are skipped on non-Linux but not if the architecture is not x86. In most places it is checked for ARM64 which is not enough as a check for x86 is required instead. Fix the test decorators and factor out a common one in test_cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148445 Approved by: https://github.com/eellison	2025-03-28 15:27:44 +00:00
Michael Lazos	d2c0c65ea1	[Dynamo] Add debug linting option for graph dedupe (#150053 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/150053 Approved by: https://github.com/StrongerXi, https://github.com/anijain2305	2025-03-28 14:27:09 +00:00
IvanKobzarev	25309a17f0	[aotd] Config to guess_tangents_stride (#150035 ) Differential Revision: [D71907684](https://our.internmc.facebook.com/intern/diff/D71907684) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150035 Approved by: https://github.com/ilyas409, https://github.com/seemethere	2025-03-28 13:54:19 +00:00
PyTorch MergeBot	7c4e49750e	Revert "Store statically launchable CachingAutotuners inside CompiledFXGraph.triton_bundle (#149054 )" This reverts commit c16af5d7984872b6ae81476d6cae64bddb7ce664. Reverted https://github.com/pytorch/pytorch/pull/149054 on behalf of https://github.com/jamesjwu due to Sorry I forgot to fix one last test ([comment](https://github.com/pytorch/pytorch/pull/149054#issuecomment-2761381443))	2025-03-28 13:35:07 +00:00
James Wu	c16af5d798	Store statically launchable CachingAutotuners inside CompiledFXGraph.triton_bundle (#149054 ) This PR adds CachingAutotuners that are statically launchable to FXGraphCache's cache entry. Regular CachingAutotuners, with triton kernels attached to them, are not very good to cache: they are very large, and take huge amounts of space since they track all of the various binary files, along with various metadata. We could probably figure out what information we could delete from the kernel and have it still work, but with StaticCudaLauncher, we no longer have to. Instead, we can cache every compiled triton kernel that is statically launchable. Because StaticTritonCompileResult is serializable, and designed to have a very small memory footprint, we can save it into FXGraphCache without increasing the cache size significantly. We store it as a part of CompiledFxGraph.triton_bundle. Then, on load, we repopulate the CachingAutotuner into our CompiledTritonKernel cache. The upsides of this are many: - We no longer need to call into a separate process on cache hit - We can guarantee that the triton kernel we got from our cache entry is the one we use to launch again, so no worries about triton's own caching logic - Once we achieve feature parity and all torch.compiled triton kernels are statically launchable, we can clean up a bunch of TritonBundler code and simplify the cache hit logic. Fixes #149449 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149054 Approved by: https://github.com/oulgen	2025-03-28 13:28:05 +00:00
Yuanhao Ji	d4da0e955e	[Dynamo] Fix `is_compile_supported()` when `device_type` contains device index (#147837 ) Fixes #147826 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147837 Approved by: https://github.com/anijain2305	2025-03-28 07:16:29 +00:00
Pian Pawakapan	103bf64a3c	[export] refactor _Dim into Dim (#149891 ) Summary: forward fix T218515233 Test Plan: test_export Differential Revision: D71769231 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149891 Approved by: https://github.com/jingsh, https://github.com/angelayi	2025-03-28 06:19:03 +00:00
bobrenjc93	f649ee73ce	Use source hashing to generate consistent symbolic ids (#149665 ) This PR was inspired by internal models that were cache missing due to PGO. At a high level the problem looks as follows Run 1, Invocation 1: We do static compile, save some example values in PGO/automatic dynamic Run 1, Invocation 2: We detect varying inputs, do dynamic compile, get a dynamic graph and save to PGO. Crucially what we save to PGO is actually a superset of what is actually dynamic. If we notice an input was varying, we mark it as dynamic in PGO even if later on that value gets specialized. When a value gets specialized, we actually remove the symbol from the graph. This results in an interesting conundrum where although we are producing the same isomorphic graph, PGO makes the second run cache miss. Let's see how.... Run 2, Invocation 1: We fetch the PGO, over-mark things as dynamic, get a fx graph, look it up in the cache and... whoops! cache miss! This is because of the aforementioned behavior where the PGO profile will cause us to over-allocate symbols. In practice this means we end up saving a graph in cache with symbols x:s1, y:s3 and on second attempt we cache miss with x:s1, y:s6 where symbols s3,s4,s5 were all optimistically marked dynamic by PGO and subsequently specialized. We solve this problem by hashing the source names. This ensures somewhat stable assignment. To prevent catastrophic symbol collisions, we use linear probing to ensure no collisions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149665 Approved by: https://github.com/Mingming-Ding, https://github.com/laithsakka	2025-03-28 05:36:32 +00:00
Tugsbayasgalan Manlaibaatar	c49315e645	Improve attr mismatch msg (#149576 ) Differential Revision: [D71513041](https://our.internmc.facebook.com/intern/diff/D71513041) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149576 Approved by: https://github.com/avikchaudhuri	2025-03-28 05:10:56 +00:00
Daniël de Kok	fdc4394b16	Do not fetch NCCL when system NCCL is used (#149607 ) We are compiling PyTorch in a sandbox without networking. Unconditionally fetching breaks the build and is not needed when a system NCCL is used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149607 Approved by: https://github.com/malfet	2025-03-28 05:06:49 +00:00
Animesh Jain	c9ebf517c2	[dynamo][invoke_subgraph] Input aliasing and mutation check in Dynamo (#148953 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148953 Approved by: https://github.com/zou3519 ghstack dependencies: #149087, #149667, #150036	2025-03-28 03:50:07 +00:00
eellison	c18e2ce53b	Ignore meta ops in inductor (#150137 ) Fix for https://github.com/pytorch/pytorch/issues/144607 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150137 Approved by: https://github.com/BoyuanFeng	2025-03-28 03:01:57 +00:00
PyTorch MergeBot	ddb1e97839	Revert "Support torchbind in OSS proxy executor (#149747 )" This reverts commit aa70d62041c28fe35c416aa932b32ef0e4d5bc33. Reverted https://github.com/pytorch/pytorch/pull/149747 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/149747#issuecomment-2760040741))	2025-03-28 02:48:02 +00:00
Colin L. Rice	2f785ab208	dynamo_compile: Log all compilation time under all_compilation_types (#149664 ) This counter is designed to include all compilation pytorch does (triton + dynamo_compile). However this wasn't including all of dynamo compilation, since it was put in at the fx_codegen_and_compile spot. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149664 Approved by: https://github.com/masnesral	2025-03-28 02:27:48 +00:00
Natalia Gimelshein	8a872261dc	Add one_shot_all_reduce_copy to allow non-symm-mem allocated tensors to be reduced (#150129 ) Per title, we want to be able to use it even if inputs are not registered. Separate copy would add latency, and one-shot is all about the lowest possible latency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150129 Approved by: https://github.com/xw285cornell	2025-03-28 02:14:27 +00:00
Sam Larsen	1e55b9c0b5	Fix autotune pool shutdown (#149890 ) Summary: A couple follow-ups noted in review from https://github.com/pytorch/pytorch/pull/149700: 1. Make sure we correctly signal _all_ subproces to shutdown, even in the case where some processes are currently benchmarking. 2. Change how the pool singleton is created. That also allows us to fully initialize the object in the ctor and remove a bunch of asserts. Test Plan: existing unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/149890 Approved by: https://github.com/aorenste ghstack dependencies: #149700	2025-03-28 02:09:51 +00:00
Sam Larsen	266bd22b44	Improve subproc autotuning implementation (#149700 ) Summary: The primary change is to update the autotune-in-a-subproc implementation to avoid using multiprocessing spawn. Spawn (re)executes the toplevel script in the subproc, which can be problematic. The approach here is similar to Triton parallel compile: we Popen a subproc on a controlled entry point and communicate over pipes. That change drove a lot of refactoring in the TuningProcess class, so I took the opportunity to simplify some things, rename some methods, etc. One other notable change is around the timeout / kill approach. After a timeout, we were previously attempting to stop the subproc in three steps (graceful shutdown, sigkill if graceful fails, sigterm if sigkill fails). I'm gonna argue think that's not useful: 1) The graceful shutdown is never going to work unless the subproc happens to have just completed its task and is ready to receive the next command. 2) If we're going to kill the subproc, let's just take the most aggressive approach and move on as quickly as possible to restarting it rather than waiting to see if previous shutdown attempts succeeded. The only downside that I can find find is maybe a little log spew?, e.g., ` ResourceWarning: subprocess 2987680 is still running` List of changes: * Use Popen instead of spawn for the autotuning subprocess. * Introduced a new entry point `__autotune_main__.py` * Renamed some TuningProcess methods. For example `shutdown` makes more sense than `terminate` because the latter implies a forced kill. * Simplified the implementation around benchmarking timeout and how we kill the subproc after a timeout. * Deprecated the unused timeout configs in `_inductor/config.py` * Moved `get_ld_library_path` helper to a common utils file. * Added more unit tests for subproc crashes / timeouts / exceptions, etc. Test plan: * New unit tests * Also ran internally with all combinations of: build mode `opt` and `dev-nosan`, and `buck run` vs. executing the `.par` file directly. * Made sure the functionality to parallelize autotuning across different GPUs is working (it wasn't clear to me this was behaving the way we wanted it to). Differential Revision: [D71976971](https://our.internmc.facebook.com/intern/diff/D71976971) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149700 Approved by: https://github.com/aorenste, https://github.com/jansel, https://github.com/eellison	2025-03-28 01:06:39 +00:00
Shivam Raikundalia	8b04364914	[Easy/Profiler] Set Duration to -1 for unfinished CPU events (#150131 ) Summary: Some OSS Kineto users were requesting that we allow for 0 duration events in Kineto even though they won't be seen on the trace. To allow this we changed the handling of said events in D71510383. However this causes unfinished events in collection to never be post processed; this diff fixes said issue. Test Plan: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/0/1743102222/localhost/libkineto_activities_631490.json.gz&bucket=gpu_traces Differential Revision: D71993609 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150131 Approved by: https://github.com/chuanhaozhuge, https://github.com/xw285cornell	2025-03-28 00:29:22 +00:00
Shangdi Yu	aa70d62041	Support torchbind in OSS proxy executor (#149747 ) Summary: Implement torchbind support in OSSProxyExecutor. Exactly the same as the implementation in FbProxyExecutor. D69693697 - fbProxyExecutor D69887230 - fbProxyExecutor but for torchbind method Other changes: - When generating the schema of the CallTrochBind HOP, the arg name of the torchbind object arg should be the same as the torchbind method's torchbind object arg (instead of `obj`). - In `AOTIModelPackageLoader`, we extract everything in `data/constants` to `tmp_dir/data/aot_inductor/<model>/` folder, so the torchbind objs exist in the same folder as the rest of the files (e.g. cpp, so). This is to be consistent of how files are packaged internally Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r torchbind_aoti buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r aot_compile ``` Differential Revision: D69500038 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149747 Approved by: https://github.com/desertfire	2025-03-28 00:04:19 +00:00
Taras	d670df356c	Improve error handling when checking CUDA version in case nvcc is not found (#148671 ) Fixes: - https://github.com/pytorch/pytorch/issues/101138 Description The PR enhances error handling in `_check_cuda_version` by verifying the existence of the `nvcc` executable before invoking `subprocess.check_output`. If `nvcc` is missing, a `FileNotFoundError` is raised with a clear message, guiding users to check their CUDA installation and path configuration. Testing Manually tested with and without `nvcc` present in the expected path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148671 Approved by: https://github.com/malfet	2025-03-27 23:04:59 +00:00
Benjamin Glass	2b20d1433f	cpp_wrapper: precompile a few more commonly used headers, and improve RAIIPyObject interface (#149350 ) Add includes for torch.device, torch.dtype, torch.layout, and torch.memory_format to the cpp_wrapper common header, so that they get precompiled. Additionally, add move constructors and operator bool to RAIIPyObject. Closes #142005. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149350 Approved by: https://github.com/desertfire ghstack dependencies: #147225	2025-03-27 23:00:01 +00:00
Nikita Shulga	ef1cb6b646	[BE] Suppress user_warnings while running opinfo tests (#150115 ) Some of the samples are constructed in a way that are expected to trigger those, but what's the point displaying them Pull Request resolved: https://github.com/pytorch/pytorch/pull/150115 Approved by: https://github.com/dcci ghstack dependencies: #150060	2025-03-27 22:36:27 +00:00
PyTorch MergeBot	1a3bd894ff	Revert "[fbcode]Removing `@NoIntBaseDeprecated` annotation in `caffe2.thrift` file (#149742 ) (#149744 )" This reverts commit 6eac3a0068f028d03897ce38e0cfec11812591fe. Reverted https://github.com/pytorch/pytorch/pull/149744 on behalf of https://github.com/malfet due to Broke tests, see `80aa88f907/1` ([comment](https://github.com/pytorch/pytorch/pull/149744#issuecomment-2759676260))	2025-03-27 22:31:54 +00:00
eellison	4c57aec5b9	Dont exclude constant_pad_nd in prologue fusion (#149947 ) Originally, I excluded constant_pad_nd from fusing to be conservative on compilation time. But, on benchmarking, you do occasionally get speedups by fusing it. Also includes a fix for making single, contiguous dep for prologues. For instance, the following benchmark gets a 7% speedup by fusing in the constant_pad_nd. ``` import torch import torch.nn.functional as F torch._inductor.config.force_disable_caches = True padded_N = 2048 n_pad_rows = 100 K, N = 2048, 4096 tensor1 = torch.randn(padded_N - n_pad_rows, 4096, device="cuda").to(torch.bfloat16) tensor2 = torch.randn(4096, 4096, device="cuda").to(torch.bfloat16) @torch.compile(mode='max-autotune-no-cudagraphs') def masked_linear(input, weight, n_pad_input_rows): """ Linear layer with input padded by `n_pad_input_rows` rows """ # Use constant_pad_nd to pad with zeros for the invalid rows padded_input = F.pad(tensor1, (0, 0, 0, n_pad_input_rows), "constant", 0) return F.linear(padded_input, weight) # Invoke the function masked_linear(tensor1, tensor2, n_pad_rows) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149947 Approved by: https://github.com/drisspg	2025-03-27 22:26:30 +00:00
PyTorch MergeBot	80aa88f907	Revert "Store statically launchable CachingAutotuners inside CompiledFXGraph.triton_bundle (#149054 )" This reverts commit ac91f8765ba7817a0853f0520e7f9c94768babc2. Reverted https://github.com/pytorch/pytorch/pull/149054 on behalf of https://github.com/yangw-dev due to This is breaking ROCM tests on trunk. hud.pytorch.org/ ([comment](https://github.com/pytorch/pytorch/pull/149054#issuecomment-2759604301))	2025-03-27 22:15:40 +00:00
Avik Chaudhuri	21bcbbfb5e	fix range constraints for expr (#150103 ) During tracing it is possible for a `s1: VR[2, inf]` to be replaced by a `s0: VR[3, inf]` (note smaller range) by the shape env. But after export, unfortunately we'd previously record `range_constraints[s0] = VR[2, inf]` (note larger range), which is incorrect. This is because we'd map `s1.node.expr` (`s0`) to the `var_to_range` of `s1.node._expr` (`s1`) when creating `range_constraints`. The comment surrounding this code suggests this predated `bound_sympy`, but now we can do better. For users, this means that when using `Dim.DYNAMIC` previously they wouldn't get input constraints checked sufficiently, now they do (shifting errors early). Differential Revision: D71962694 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150103 Approved by: https://github.com/zhxchen17	2025-03-27 22:11:39 +00:00
Keke Zhai	68414512e6	Implement aten.select.int sharding strategy (#149842 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149842 Approved by: https://github.com/XilunWu	2025-03-27 20:49:00 +00:00
Benjamin Glass	d25acac357	cpp_wrapper: Fix even more tests (#147225 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147225 Approved by: https://github.com/desertfire	2025-03-27 19:21:03 +00:00
Shangdi Yu	0ed0b7fa96	[aoti] Better error message when torchbind object is used as a graph input in AOTI (#149965 ) Summary: Given an explicit error when torchbind object is used as input to AoTI Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r test_torchbind_input ``` Differential Revision: D69490915 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149965 Approved by: https://github.com/desertfire	2025-03-27 18:48:55 +00:00
PyTorch MergeBot	a9d08ed0ce	Revert "Parallelize sort (#149505 )" This reverts commit 842d51500be144d53f4d046d31169e8f46c063f6. Reverted https://github.com/pytorch/pytorch/pull/149505 on behalf of https://github.com/ZainRizvi due to Reverting since this is breaking inductor builds on trunk. More details [GH job link](https://github.com/pytorch/pytorch/actions/runs/14000726218/job/39207447863) [HUD commit link](`842d51500b`) ([comment](https://github.com/pytorch/pytorch/pull/149505#issuecomment-2759082390))	2025-03-27 18:43:11 +00:00
vasiliy	01cb3519b3	wire torch._scaled_mm with fp4 operands to the cublas nvfp4 kernel (#148792 ) Summary: When `a` and `b` have dtype `torch.float4_e2m1fn_x2` and `a_scale` and `b_scale` have dtype `torch.float8_e4m3fn`, makes ```python c = torch._scaled_mm(a, b, a_scale, b_scale, out_dtype=torch.bfloat16) ``` call the cuBLAS fp4 gemm kernel, as specified in https://docs.nvidia.com/cuda/cublas/index.html?highlight=fp4#d-block-scaling-for-fp8-and-fp4-data-types note: output scale (`scale_in_D` from the cuBLAS docs) is not tested in this PR - we can enable in a follow-up. Test Plan: ```bash pytest test/test_matmul_cuda.py -s -k mxfp8_nvfp4 ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/148792 Approved by: https://github.com/eqy ghstack dependencies: #148791	2025-03-27 17:32:20 +00:00
vasiliy	e33bc41958	add `torch.float4_e2m1fn_x2` to PyTorch (#148791 ) Summary: Redo of https://github.com/pytorch/pytorch/pull/146578 to get around rebase conflicts. Test Plan: ``` pytest test/quantization/core/experimental/test_floatx.py -s ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/148791 Approved by: https://github.com/drisspg, https://github.com/eqy, https://github.com/jeffdaily	2025-03-27 17:32:20 +00:00
James Wu	ac91f8765b	Store statically launchable CachingAutotuners inside CompiledFXGraph.triton_bundle (#149054 ) This PR adds CachingAutotuners that are statically launchable to FXGraphCache's cache entry. Regular CachingAutotuners, with triton kernels attached to them, are not very good to cache: they are very large, and take huge amounts of space since they track all of the various binary files, along with various metadata. We could probably figure out what information we could delete from the kernel and have it still work, but with StaticCudaLauncher, we no longer have to. Instead, we can cache every compiled triton kernel that is statically launchable. Because StaticTritonCompileResult is serializable, and designed to have a very small memory footprint, we can save it into FXGraphCache without increasing the cache size significantly. We store it as a part of CompiledFxGraph.triton_bundle. Then, on load, we repopulate the CachingAutotuner into our CompiledTritonKernel cache. The upsides of this are many: - We no longer need to call into a separate process on cache hit - We can guarantee that the triton kernel we got from our cache entry is the one we use to launch again, so no worries about triton's own caching logic - Once we achieve feature parity and all torch.compiled triton kernels are statically launchable, we can clean up a bunch of TritonBundler code and simplify the cache hit logic. Fixes #149449 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149054 Approved by: https://github.com/oulgen ghstack dependencies: #149657	2025-03-27 17:14:44 +00:00
Danfeng Wang	6eac3a0068	[fbcode]Removing `@NoIntBaseDeprecated` annotation in `caffe2.thrift` file (#149742 ) (#149744 ) Summary: To align with thrift-python, we are adding the int base class for `non-Flag` enums. In order to not break production code, the annotation `python.NoIntBaseClassDeprecated` is added to opt-out some enums After the related customer code logic changes, we can now safely remove the annotations that were added earlier. Our ultimate goal is to unconditionally add the `int` base to `thrift-py3` enums. Test Plan: ``` buck test 'fbcode//mode/opt' fbcode//caffe2/torch/fb/training_toolkit/applications/bulk_eval/tests:evaluator_test -- --exact 'caffe2/torch/fb/training_toolkit/applications/bulk_eval/tests:evaluator_test - test_setup_evaluation_utils (caffe2.torch.fb.training_toolkit.applications.bulk_eval.tests.evaluator_test.EvaluatorTest)' ``` Reviewed By: ahilger Differential Revision: D71446522 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149744 Approved by: https://github.com/izaitsevfb, https://github.com/huydhn	2025-03-27 17:11:26 +00:00
James Wu	14f0cd7630	[StaticCudaLauncher] Support sharedMemBytes > 48KB (#149657 ) Triton does some special handling when requesting more than 48 KB of shared memory: specifically it queries the device for maximum device memory, then sets the maximum amount of dynamic memory to be the difference between static and dynamic memory. See corresponding implementation in triton land here: https://github.com/triton-lang/triton/blob/main/third_party/nvidia/backend/driver.c#L128-L143 Test plan: - New unit test requesting more than 48 KB of memory Pull Request resolved: https://github.com/pytorch/pytorch/pull/149657 Approved by: https://github.com/jansel	2025-03-27 17:00:18 +00:00
Ankita George	85e4e51a7d	Fix bug in _load_state_dict_from_keys method (#150058 ) Summary: The _load_state_dict_from_keys method specifies that `Loads any key specified in this set. If no keys are specified, the entire checkpoint is loaded.` But this isn't happening right now, because an empty keys arg is passed in as a set() to `_load_state_dict` and keys is expected to be None for it to actually be included in the state_dict https://fburl.com/code/l8yzojyx. So with the set() argument, the state_dict is always going to be empty Test Plan: ensure existing tests pass Differential Revision: D71930712 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150058 Approved by: https://github.com/saumishr	2025-03-27 16:36:00 +00:00
Aleksandar Samardžić	d75921d3a6	Fix sparse CUTLASS-based kernels (#150023 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150023 Approved by: https://github.com/jcaip ghstack dependencies: #149978	2025-03-27 16:23:55 +00:00
Boyuan Feng	c830d750e6	[graph partition] support splitting on custom ops (#149782 ) This PR adds support for graph partition on custom ops. Land after #149458. ### API This PR provides a new API to register/unregister custom ops for graph partition. ```python def register_custom_op_support_cudagraph( operator: torch._library.custom_ops.CustomOpDef, is_cudagraphable: bool, ) -> None ``` Example usage: ```python from torch._inductor.utils import register_custom_op_partition @torch.library.custom_op("mylib::movement", mutates_args=()) def movement(pic: torch.Tensor) -> torch.Tensor: img = pic.cpu() cropped_img = (img + 1) * 2 return cropped_img.cuda() / 255.0 @movement.register_fake def _(pic): return torch.empty_like(pic) register_custom_op_support_cudagraph(movement, is_cudagraphable=False) ``` ### Example In this example, 1 torch-compiled region has 3 cudagraphs after splitting on 2 custom ops. ![image](https://github.com/user-attachments/assets/6d07355b-6690-4cde-89ef-e4aff6b0079c) Code to repro: ```python import torch from torch._inductor.utils import register_custom_op_support_cudagraph torch._inductor.config.graph_partition = True @torch.library.custom_op("mylib::movement", mutates_args=()) def movement(pic: torch.Tensor) -> torch.Tensor: img = pic.cpu() cropped_img = (img + 1)2 return cropped_img.cuda() / 255. @movement.register_fake def _(pic): return torch.empty_like(pic) @torch.library.custom_op("mylib::modify", mutates_args=()) def modify(pic: torch.Tensor) -> torch.Tensor: pic1 = pic + 1 pic1_cpu = (pic1.cpu() + 1) 2 return pic1_cpu.cuda() + pic @modify.register_fake def _(pic): return torch.empty_like(pic) @torch.library.custom_op("mylib::transform", mutates_args=()) def transform(pic: torch.Tensor) -> torch.Tensor: return (pic + 1) * 2 @transform.register_fake def _(pic): return torch.empty_like(pic) register_custom_op_support_cudagraph(movement, is_cudagraphable=False) register_custom_op_support_cudagraph(modify, is_cudagraphable=False) img = torch.randn(3, 64, 64, device="cuda") def f(img): x = (img + 10) * 2 y = movement(x) z = y + 1 u = transform(z) v = 2*u + 1 out = modify(v) return out + 1 compiled_f = torch.compile(f, mode="reduce-overhead", fullgraph=True) eager_out = f(img) for _ in range(3): compiled_out = compiled_f(img) assert torch.allclose(eager_out, compiled_out) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149782 Approved by: https://github.com/zou3519	2025-03-27 16:23:07 +00:00
PyTorch MergeBot	efc975feb2	Revert "[triton] Warp specialization support in torchinductor (#148503 )" This reverts commit 36183215e8845b54cdb69097e2b688fa9e4d3daf. Reverted https://github.com/pytorch/pytorch/pull/148503 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/148503#issuecomment-2758590645))	2025-03-27 16:06:42 +00:00
PyTorch MergeBot	af7719a2fa	Revert "Use source hashing to generate consistent symbolic ids (#149665 )" This reverts commit 1f92348dc6c60e3020a723b37ecb8226cf2480c0. Reverted https://github.com/pytorch/pytorch/pull/149665 on behalf of https://github.com/malfet due to Broke trunk, see `6eb3c2e282/1` ([comment](https://github.com/pytorch/pytorch/pull/149665#issuecomment-2758578187))	2025-03-27 16:02:27 +00:00
zpcore	6eb3c2e282	Update xla pin (#149381 ) Update xla pin to fix the github test failure issue. [failure link](https://hud.pytorch.org/failure?name=pull+%2F+linux-focal-py3_9-clang9-xla+%2F+test+%28xla%2C+1%2C+1%2C+lf.linux.12xlarge%29&jobName=linux-focal-py3_9-clang9-xla+%2F+test+%28xla%2C+1%2C+1%2C+lf.linux.12xlarge%29&failureCaptures=%5B%22test_call_jax_pytree%22%2C%22TestJaxInterop%22%5D). The test is run the torch_xla jax test but install the jax/jaxlib dependencies as we did in https://github.com/pytorch/xla/pull/8781/files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149381 Approved by: https://github.com/atalman	2025-03-27 13:53:25 +00:00
Mandar Deshpande	36183215e8	[triton] Warp specialization support in torchinductor (#148503 ) Summary: Currently only `num_warps` and `num_stages` are supported as one of the kernel options for inductor auto-tuning using `TritonTemplate`. In order to allow warp-specialization kernel options should allow specifying `num_consumer_groups` and `num_buffers_warp_spec` as well. Test Plan: ## Unit test Added tests for `test_triton_template_warp_specialization` to verify generated kenrnel contains configs for `num_consumer_groups` and `num_buffers_warp_spec`. ## Functional Testing Specific to flexattention. ``` import torch from torch.nn.attention.flex_attention import flex_attention from triton.testing import do_bench make_tensor = lambda: torch.rand(8, 16, 8192, 128, device="cuda", dtype=torch.bfloat16) q, k, v = make_tensor(), make_tensor(), make_tensor() flex_compiled = torch.compile(flex_attention, fullgraph=True) print(do_bench(lambda: flex_compiled(q, k, v, kernel_options={"num_warps": 4}))) ``` triton do_bench results: - default compile: 15.176783561706543 - with warp-spec: 9.452800750732422 ## Extra notes - generated triton kernel using `TORCH_LOGS=output_code`: P1740612877 - TTGIR for fused kernel: P1740614685 Differential Revision: D70212243 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148503 Approved by: https://github.com/eellison	2025-03-27 13:07:50 +00:00
_githubsgi	f0e1a0838c	Enabling xpu in OffsetBasedRNGTracker . (#148360 ) Else torch.distributed breaks on xpu devices. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148360 Approved by: https://github.com/zhangxiaoli73, https://github.com/guangyey, https://github.com/gujinghui, https://github.com/XilunWu, https://github.com/kwen2501 Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-03-27 10:55:05 +00:00
Matthew Haddock	e175929b8c	Make codegen dynamic shapes more device agnostic (#146830 ) Currently, as is the case with many inductor devices are assumed to be one of: - CPU with Cpp coden, or - GPU with triton codegen This is not always the case, a CPU backend may be using the triton CPU backend, or some other codegen entirely. This goes some way to fixing it in the case where a CPU backend can use triton scheduling. A more general solution could be implemented, but this would need to be quite robust, and is probably best done more centrally and by someone who can do more testing with CUDA devices. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146830 Approved by: https://github.com/eellison, https://github.com/albanD, https://github.com/guangyey Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>	2025-03-27 10:40:49 +00:00
Laith Sakka	6cbcdee944	Introduce guard_or_true, guard_or_false (#148430 ) some context in this document: https://docs.google.com/document/d/18nJsj-F2C_QXO7ClwzPcAUENQ-B440B43W7DdDnlDt4/edit?tab=t.0#heading=h.pgebnyi7pocj But TLDR; `guard_or_true`, `guard_or_false` are better than `guard_size_oblivious` due to : - Easier to reason about what assumptions we are making while reading the code. - Avoid size_oblivious complexity that is not needed. - Avoid unsoundness that could make `guard_size_oblivious(a==1)` be true when its not true for some vaue `a` during runtime. - Less data dependent errors for some cases: ex, when doing `guard_size_oblivious(a==1)` and we know `a` is a tensor size, if it's traced with `a=u1-u2` `guard_size_oblivious(a==1)` will throw a data dependent error but `guard_else_false` will just return `False`. ### How is it different from statically_known_true?? `if(cond)`: (normal guarding) will try to evaluate statically and guard on the condition, willing to restrict input space to evaluate cond. if it fails to evaluate due to data dependent error will throw an exception (that could be converted to graph break in some situations). `statically_known_true(cond)`: would be used when you never want to add a guard (restrict your input space), but just want to do a best effort check to see if you can infer that something is true/false ONLY based on existing constraints. `guard_or_true(cond)`/`guard_or_false(cond)`: Those would be used in situations you prefer to guard and know the result of the expression over not guarding, but in case you hit a data dependent error you are ok with just returning true or false. Some reasons you might be ok with returning true/false instead could be: 1. It's an optimization I do not want to fail for not performing optimization. 2. I am willing to deviate from the normal semantics when I have unbacked for the benefit of not failing (See the doc above for more details). `definitely_true(cond)`: same as `guard_or_false(cond)` except does not try to do static eval for unbacked (planning to deprecate it and replace uses with `guard_or_false` or make it alias to `guard_or_false`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148430 Approved by: https://github.com/bobrenjc93	2025-03-27 09:34:05 +00:00
pralay	a9ee797e41	added fake tensor support for foreach_copy (#149127 ) Fixes #149111 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149127 Approved by: https://github.com/jansel, https://github.com/jeromean	2025-03-27 09:26:23 +00:00
Louie Tsai	7aacbab0b3	Update Doc for Intel XPU Profiling (#134515 ) Updated below two pages for Intel XPU https://pytorch.org/docs/stable/torch.compiler_profiling_torch_compile.html https://pytorch.org/docs/stable/profiler.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/134515 Approved by: https://github.com/dvrogozh, https://github.com/malfet	2025-03-27 09:15:35 +00:00
Mu-Chu Lee	e6afb51805	[AOTInductor] Free folded constants that's managed by AOTInductor (#149825 ) internally. Summary: This diff allows freeing the usage of folded constants that's created by AOTInductor through CUDACachingAllocator instead of the constant blob from cudaMalloc directly. Test Plan: LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib /home/$USER/local/pytorch/build/bin/test_aoti_inference Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/149825 Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/jingsh	2025-03-27 06:05:50 +00:00
PyTorch MergeBot	e080bac533	Revert "Introduce guard_or_true, guard_or_false (#148430 )" This reverts commit d5593ea31ceb2590336cc9815ee2c13a18db6cd7. Reverted https://github.com/pytorch/pytorch/pull/148430 on behalf of https://github.com/laithsakka due to need to fix stuff ([comment](https://github.com/pytorch/pytorch/pull/148430#issuecomment-2756701436))	2025-03-27 05:10:20 +00:00
Simon Fan	748252378d	[ca] introduce RuntimeState to support c++ hooks via graph breaks (#149987 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149987 Approved by: https://github.com/jansel ghstack dependencies: #149647, #149709, #149651, #149897	2025-03-27 05:05:34 +00:00
Simon Fan	dcb378cff2	[ca] support anomly mode nan checks with different semantics than eager (#149897 ) see note in code Pull Request resolved: https://github.com/pytorch/pytorch/pull/149897 Approved by: https://github.com/jansel ghstack dependencies: #149647, #149709, #149651	2025-03-27 05:05:34 +00:00
Nikita Shulga	488b87cb68	[BE] do not retain/release tensor (#150075 ) `Tensor::as_strided__symint` is inplace op that returns self, no need to retain it Pull Request resolved: https://github.com/pytorch/pytorch/pull/150075 Approved by: https://github.com/angelayi, https://github.com/atalman, https://github.com/cyyever	2025-03-27 03:43:14 +00:00
bobrenjc93	1f92348dc6	Use source hashing to generate consistent symbolic ids (#149665 ) This PR was inspired by internal models that were cache missing due to PGO. At a high level the problem looks as follows Run 1, Invocation 1: We do static compile, save some example values in PGO/automatic dynamic Run 1, Invocation 2: We detect varying inputs, do dynamic compile, get a dynamic graph and save to PGO. Crucially what we save to PGO is actually a superset of what is actually dynamic. If we notice an input was varying, we mark it as dynamic in PGO even if later on that value gets specialized. When a value gets specialized, we actually remove the symbol from the graph. This results in an interesting conundrum where although we are producing the same isomorphic graph, PGO makes the second run cache miss. Let's see how.... Run 2, Invocation 1: We fetch the PGO, over-mark things as dynamic, get a fx graph, look it up in the cache and... whoops! cache miss! This is because of the aforementioned behavior where the PGO profile will cause us to over-allocate symbols. In practice this means we end up saving a graph in cache with symbols x:s1, y:s3 and on second attempt we cache miss with x:s1, y:s6 where symbols s3,s4,s5 were all optimistically marked dynamic by PGO and subsequently specialized. We solve this problem by hashing the source names. This ensures somewhat stable assignment. To prevent catastrophic symbol collisions, we use linear probing to ensure no collisions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149665 Approved by: https://github.com/Mingming-Ding, https://github.com/laithsakka	2025-03-27 03:39:27 +00:00
Daniel Vega-Myhre	ae29f054f5	[Async TP] More robust support for rowwise scales when fusing matmul reduce-scatter (#149247 ) Part of https://github.com/pytorch/torchtitan/issues/866 ## Context - Async TP needs to support the "reshape -> scaled_mm -> reshape" pattern because scaled mm only supports 2D input tensors and 2D scales. - (a,b,c) => (ab,c) - (a\b,c) @ (c,d) = (a\b,d) - (a\b,d) => (a,b,d) - Currently the implementation does not support scaled mm with rowwise scales for all cases of the reshape -> scaled_mm -> reshape pattern. The minimal example of this pattern is confirmed to work via this [unit test](`00a2c68f67/test/distributed/tensor/parallel/test_micro_pipeline_tp.py (L406)`), but more involved e2e examples in torchtitan fail silently (more context in final bullet point). - Previously, the "A tensor" node referenced in the async TP graph manipulation code is the 3D+ node before the reshape, but the "A_scale" node is the 2d node from after the reshape, so they are incompatible. - I previously implemented a simpler solution to this problem in https://github.com/pytorch/pytorch/pull/148001, with a [unit test](https://github.com/pytorch/pytorch/pull/148001/files#diff-115f1d0852382c9b58f22640d80999d879b33618e5f6c633fc9e4d0ca9781cecR406) confirming the fused node is indeed in the graph for the minimal example of the reshape->mm->reshape pattern. I also confirmed via manual e2e testing w/ torchtitan that the crash I was fixing no longer occurred. However, it turns out due to this [bug in torchtitan](https://github.com/pytorch/torchtitan/issues/866) it was causing async TP to fail silently and fall back to vanilla TP, hiding the fact that this original solution fixed the crash but the fusion would not occur for rowwise scales. Thus, more robust solution is needed to support all cases. ## Solution TL;DR - Use the 2D 'A' tensor and corresponding 2D scales as input to the fused_matmul_reduce_scatter implementation, instead of the 3D+ tensor/scales. - Track the "pre mm reshape" and "post mm reshape" separately, to be referenced in the `fused_scaled_matmul_reduce_scatter` implementation, to update the scatter dim through the pre-mm reshape, and apply the post-mm reshape before applying the reduce scatter and returning the output tensor. - Separate the `fused_matmul_reduce_scatter` and the `fused_scaled_matmul_reduce_scatter` code paths, to simplify them both. - By fixing the bug in torchtitan (PR https://github.com/pytorch/torchtitan/pull/965) and implementing support for rowwise scales in pytorch in this PR, together these changes will solve the problem of how to support rowwise scales with all types of AC. ## Additional details for reviewers To use the 2D A tensor while also supporting the "reshape -> mm -> reshape" pattern, the following other changes were needed: - Track the pre-mm reshape, as it will affect the scatter dim used in the fused_matmul_reduce_scatter impementation. - Track the post-mm reshape, as it will affect the output shape used in the fused_matmul_reduce_scatter impementation - Based on the pre-mm reshape and the original scatter dim, calculate the new scatter dim for the 2D tensor. This is needed because during the pipelined producer mm implementation, the scatter dim is moved to dim 0 (so it can be sharded along the first dim and then get chunks to do mm ops on by indexing into the first dim), then moved back to it's original place before the reduce-scatter. - Use the tracked post-mm reshape to reshape the stacked partial 2D outputs of the mm ops into 3D outputs needed for 1) the reduce-scatter w/ the original scatter dim, and 2) the expected output shape to prevent shape errors with subsequent ops. ## Test plan - All existing unit tests passing. - Expand unit tests for rowwise scales to test more scatter dims - Added unit tests enforcing that async TP fails fast / throws an error if it fails to perform any fusions. Previously it just "failed silently" (fell back to vanilla TP without the user knowing) which has led to confusion, so this will improve the UX. - Compared loss curves of bf16 vs float8 w/ rowwise scales to confirm integrity of numerics - Confirmed via manual testing with torchtitan and inspecting the compile graph that the fusion is working as intended for: - bfloat16 - float8 with tensorwise scales - float8 with rowwise scales ## Loss curves Loss curves are virtually identical for bf16 + vanilla TP versus float8 with rowwise scales + async TP: <img width="1017" alt="loss_async_tp" src="https://github.com/user-attachments/assets/4995db78-7012-490f-a370-f4fecc289a22" /> ## Performance #### Per op SAC Performance benchmarks for torchtitan Llama3 8b training runs on 4 H100s with per op SAC, using FSDP degree=2, TP degree=2: - bf16 (vanilla TP): TPS 5161.5, peak memory 50.53 GB - bf16 (async TP): TPS 5229.5, peak memory 50.68 GB - float8 tensorwise (vanilla TP): TPS: 5959.5, peak memory: 50.47 GB - float8 tensorwise (async TP): TPS 5964.5, peak memory 50.47 GB - float8 rowwise (vanilla TP): TPS: 4962.0, peak memory: 50.55 GB - float8 rowwise (async TP): TPS 4966.5, peak memory 50.65 GB #### Full AC Llama3 70b training runs on 128 H100s with full AC, using FSDP=16, TP=8 - bf16 (vanilla TP): 598 TPS, peak memory 71.51 GB - bf16 (async TP): TPS 673, peak memory 71.08 (+12.54% TPS vs vanilla TP) - float8 tensorwise (vanilla TP): 820 TPS, peak memory 55.26 GB - float8 tensorwise (async TP): 950 TPS, peak memory 55.91 GB (+15.85% TPS vs vanilla TP) - float8 rowwise (vanilla TP): TPS: 540 TPS, peak memory 71.46 GB - float8 rowwise (async TP): 560 TPS, peak memory 70.65 GB (+3.7% TPS vs vanilla TP but still unexpectedly lower than bf16) As you can see, float8 rowwise is working but performance needs to be improved further. ## Other changes - Added logging so the user will know why fusion failed if it does. - Remove logic which inserted a reshape node targeting "A scale" to get it to be in 3D like the "A tensor" since it's no longer needed. ## Long term plan - Add a `scaled_matmul` op in pytorch, which will natively support a 3D+ "A tensor" and allow us to simplify the async TP implementation by avoiding the reshape -> scaled_mm -> reshape pattern and the special handling for it. ## Visualizing fused nodes in graphs for torchtitan training runs Below are examples of the visualized graph generated by torch compile for torchtitan llama3 8b training runs with per op SAC. These graphs provide additional evidence (beyond the new unit tests added) that the implementation is working correctly. ### bf16 <img width="900" alt="bf16-fusion" src="https://github.com/user-attachments/assets/a3bed917-28eb-4a56-8d6e-2d2bf498385c" /> ### float8 with tensorwise scales <img width="900" alt="tensorwise-node" src="https://github.com/user-attachments/assets/b212ec4a-1899-44de-a4de-18c74e1de68a" /> ### float8 with rowwise scales <img width="900" alt="rowwise" src="https://github.com/user-attachments/assets/ed3354a3-894b-4ec9-86d0-f80364bf3d83" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/149247 Approved by: https://github.com/kwen2501	2025-03-27 03:15:30 +00:00
Ahmad Sharif	114d404b07	[cuda] Add new faster gammabeta backward kernel (#148605 ) This PR adds a new kernel for producing gamma and beta values for the backward pass in a performant way. To test the performance against the baseline, I measured the backward pass of layernorm while sweeping over the following variables: 1. dtype in {half, float} 2. M in `2k, 2k - 1, 2k + 1 for k in range(...)` 3. N in `2k, 2k - 1, 2k + 1 for k in range(...)` 4. Whether we flush the L2 cache before running the backward pass Summary: The new code performs better than the old code, especially for powers of 2. For M >> N case, it performs very well (kernel itself can be 30x faster and the overall backward pass can be 5-10x faster). In order to visualize results of the kernel when choosing different values of M, N and dtype, I wrote some code to generate a heatmap. The heatmap has N on the x-axis, M on the y-axis and color-coded points where green shows performance improvement and red shows regressions. For example, `m=32 n=2048 1.42x` in the heatmap would indicate the normalized shape had 32 elements. The leading dimensions' product was 2048 elements and the new kernel resulted in the backward pass being 1.42x faster than the old backward pass. Important note: This heatmap shows the total backward pass time as seen by the user. The kernel time difference can be sometimes very large while the total backward pass time is not that high. For example, for dtype=torch.half, M=32 N=2048, flush_l2_cache=True case, the heatmap shows a speedup of 1.42x, while ncu tells me the new kernel is 2.5x faster than the old: M=32 N=2048 dtype=half flush_l2=True Old Kernel NCU summary: ``` ----------------------- ----------- ------------ Metric Name Metric Unit Metric Value ----------------------- ----------- ------------ DRAM Frequency Ghz 1.59 SM Frequency Ghz 1.35 Elapsed Cycles cycle 27,526 Memory Throughput % 2.21 DRAM Throughput % 0.54 Duration us 20.42 L1/TEX Cache Throughput % 4.31 L2 Cache Throughput % 2.62 SM Active Cycles cycle 1,475.02 Compute (SM) Throughput % 0.29 ----------------------- ----------- ------------ ``` M=32 N=2048 dtype=half flush_l2=True New Kernel NCU summary: ``` ----------------------- ----------- ------------ Metric Name Metric Unit Metric Value ----------------------- ----------- ------------ DRAM Frequency Ghz 1.59 SM Frequency Ghz 1.34 Elapsed Cycles cycle 10,920 Memory Throughput % 5.64 DRAM Throughput % 1.35 Duration us 8.13 L1/TEX Cache Throughput % 1.92 L2 Cache Throughput % 6.89 SM Active Cycles cycle 3,554.41 Compute (SM) Throughput % 0.67 ----------------------- ----------- ------------ ``` Let's look at some rows from the heatmap. For dtype=float16 flush_l2_cache=True and when input shapes are powers of 2, we get the following: <img width="1508" alt="image" src="https://github.com/user-attachments/assets/06179599-b2f0-4a45-8664-247a1067950b" /> There are 3 columns -- the first shows all data points, the second shows speedups only and the 3rd column shows regressions only. We can see that there are dramatic speedups for M >> N cases and the regressions are not that high (less than 1%, which could just be measurement noise). Here is a small guide I made: ![image](https://github.com/user-attachments/assets/90c26f7c-e3ad-46d2-a6ce-fe4b5fb3d738) For dtype=float32, we get a similar chart: <img width="1499" alt="image" src="https://github.com/user-attachments/assets/c4d31a76-03b0-426c-9114-e1bfad29b530" /> The new code performs especially well for m >> n cases, and also where m and n are small. The m >> n case is special because we run 2 reduction kernels back to back and parallelize in the "M" dimension (the older kernel only parallelized in the "N" dimension). The new code can sometimes have regressions for non-powers of 2. That is because the old code was using block sizes of {16, 32} while we have `threads.x = 32`. For example when N=33, the old code would have 3 blocks and we will have 2 blocks. I wrote some code to specialize for this case, but I think it will add complexity and @ngimel mentioned that non-powers of 2 are rare enough. I am including the regressions here for completeness' sake: <img width="1500" alt="image" src="https://github.com/user-attachments/assets/31c17cfb-ed9b-4106-b9c8-5c359751f530" /> To see this better: 1. Click the image 2. Right click the expanded image and open in a new tab 3. Go to that tab and left click once to zoom in If you want to see the full data, here it is: ![image](https://github.com/user-attachments/assets/54fb60c9-8c0c-4530-a1dd-79ecda1a69a1) I also measured binary size and compile time since those are important for developers: Binary size comparison ![image](https://github.com/user-attachments/assets/ceef5073-1036-47f6-b9dc-cea088beda51) ``` # Original -rwxr-xr-x 1 ahmads users 307193112 Mar 6 08:46 ./torch/lib/libtorch_cuda.so # This PR -rwxr-xr-x 1 ahmads users 307193112 Mar 6 08:46 ./torch/lib/libtorch_cuda.so ``` The diff in bytes is 302kB which is about a 0.1% increase. Compile time difference: ``` # Original real 0m10.931s user 0m9.676s sys 0m1.004s # this PR real 0m16.720s user 0m15.514s sys 0m1.066s # Command I ran time /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DFLASHATTENTION_DISABLE_SOFTCAP -DFLASH_NAMESPACE=pytorch_flash -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DTORCH_CUDA_USE_NVTX3 -DUNFUSE_FMA -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_CUFILE -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -I/home/ahmads/personal/pytorch/build/aten/src -I/home/ahmads/personal/pytorch/aten/src -I/home/ahmads/personal/pytorch/build -I/home/ahmads/personal/pytorch -I/home/ahmads/personal/pytorch/cmake/../third_party/benchmark/include -I/home/ahmads/personal/pytorch/third_party/onnx -I/home/ahmads/personal/pytorch/build/third_party/onnx -I/home/ahmads/personal/pytorch/nlohmann -I/home/ahmads/personal/pytorch/third_party/flash-attention/csrc/flash_attn/src -I/home/ahmads/personal/pytorch/aten/src/THC -I/home/ahmads/personal/pytorch/aten/src/ATen/cuda -I/home/ahmads/personal/pytorch/third_party/fmt/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/ahmads/personal/pytorch/build/caffe2/aten/src -I/home/ahmads/personal/pytorch/aten/src/ATen/.. -I/home/ahmads/personal/pytorch/build/nccl/include -I/home/ahmads/personal/pytorch/c10/cuda/../.. -I/home/ahmads/personal/pytorch/c10/.. -I/home/ahmads/personal/pytorch/third_party/tensorpipe -I/home/ahmads/personal/pytorch/build/third_party/tensorpipe -I/home/ahmads/personal/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/ahmads/personal/pytorch/torch/csrc/api -I/home/ahmads/personal/pytorch/torch/csrc/api/include -isystem /home/ahmads/personal/pytorch/build/third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/ahmads/personal/pytorch/third_party/protobuf/src -isystem /home/ahmads/personal/pytorch/third_party/XNNPACK/include -isystem /home/ahmads/personal/pytorch/third_party/ittapi/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda/include -isystem /home/ahmads/personal/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /home/ahmads/personal/pytorch/third_party/ideep/include -isystem /home/ahmads/personal/pytorch/INTERFACE -isystem /home/ahmads/personal/pytorch/third_party/nlohmann/include -isystem /home/ahmads/personal/pytorch/third_party/NVTX/c/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o.d -x cu -c /home/ahmads/personal/pytorch/aten/src/ATen/native/cuda/layer_norm_kernel.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o ``` So the new PR is 6 seconds longer compile time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148605 Approved by: https://github.com/ngimel	2025-03-27 03:01:53 +00:00
Yidi Wu	b2b9aaf0ad	Fix non-strict export doesn't turn on dynamo for hop (#149903 ) Somehow the torch._dynamo.is_compiling is changed to torch.compiler.is_compiling(), which also checks whether we're exporting. This is not caught by cI because we don't have an export test for scan. Changing to torch.compiler.is_dynamo_compiling and added a test. edit: piggyback the re-tracing support in this PR. Related code in combine_fn_is_normalized. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149903 Approved by: https://github.com/zou3519	2025-03-27 02:38:05 +00:00
vasiliy	dad0854d48	meta registration for torch._scaled_mm with mxfp8 (#148461 ) Summary: Adds the meta registration logic for torch.compile to work with `torch._scaled_mm` with mxfp8. Thanks to @eellison for the pointer to make inductor work with this. Test Plan: ``` pytest test/test_matmul_cuda.py -k test_blockwise_mxfp8_compile -s ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/148461 Approved by: https://github.com/drisspg, https://github.com/eellison	2025-03-27 02:32:40 +00:00
Laith Sakka	d5593ea31c	Introduce guard_or_true, guard_or_false (#148430 ) some context in this document: https://docs.google.com/document/d/18nJsj-F2C_QXO7ClwzPcAUENQ-B440B43W7DdDnlDt4/edit?tab=t.0#heading=h.pgebnyi7pocj But TLDR; `guard_or_true`, `guard_or_false` are better than `guard_size_oblivious` due to : - Easier to reason about what assumptions we are making while reading the code. - Avoid size_oblivious complexity that is not needed. - Avoid unsoundness that could make `guard_size_oblivious(a==1)` be true when its not true for some vaue `a` during runtime. - Less data dependent errors for some cases: ex, when doing `guard_size_oblivious(a==1)` and we know `a` is a tensor size, if it's traced with `a=u1-u2` `guard_size_oblivious(a==1)` will throw a data dependent error but `guard_else_false` will just return `False`. ### How is it different from statically_known_true?? `if(cond)`: (normal guarding) will try to evaluate statically and guard on the condition, willing to restrict input space to evaluate cond. if it fails to evaluate due to data dependent error will throw an exception (that could be converted to graph break in some situations). `statically_known_true(cond)`: would be used when you never want to add a guard (restrict your input space), but just want to do a best effort check to see if you can infer that something is true/false ONLY based on existing constraints. `guard_or_true(cond)`/`guard_or_false(cond)`: Those would be used in situations you prefer to guard and know the result of the expression over not guarding, but in case you hit a data dependent error you are ok with just returning true or false. Some reasons you might be ok with returning true/false instead could be: 1. It's an optimization I do not want to fail for not performing optimization. 2. I am willing to deviate from the normal semantics when I have unbacked for the benefit of not failing (See the doc above for more details). `definitely_true(cond)`: same as `guard_or_false(cond)` except does not try to do static eval for unbacked (planning to deprecate it and replace uses with `guard_or_false` or make it alias to `guard_or_false`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148430 Approved by: https://github.com/bobrenjc93	2025-03-27 02:22:20 +00:00
Ahmad Sarvmeily	c2b8fead43	Allow TritonTemplate subclasses to override kernel type (#150018 ) Allows subclasses of `TritonTemplate` to override the kernel type, e.g. ``` class MyTritonTemplate(TritonTemplate): kernel_type = MyTritonTemplateKernel ``` This means that all of the logic in `TritonTemplate` class doesn't need to be duplicated in subclasses if the only required change is the kernel type. Note that there is precedent for doing this - see `SIMDScheduling` in `torch/_inductor/codegen/simd.py`: ``` class SIMDScheduling(BaseScheduling): kernel_type: type[Any] = SIMDKernel # override in subclass ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150018 Approved by: https://github.com/jansel	2025-03-27 02:16:40 +00:00
Angela Yi	8d1cfb63b5	[export] Save unflattened gm (#150030 ) Summary: Reland of D71082652 Test Plan: https://www.internalfb.com/intern/testinfra/testrun/8444249558423545 https://www.internalfb.com/intern/testinfra/testrun/7318349652864293 https://www.internalfb.com/intern/testinfra/testrun/13229323980143778 https://www.internalfb.com/intern/testinfra/testrun/11540474119884081 Differential Revision: D71902033 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150030 Approved by: https://github.com/pianpwk	2025-03-27 02:01:51 +00:00
Laith Sakka	128b32f363	cache loaded python modules (#149910 ) I am splitting caching the loading of modules from the caching the codegen since its trivial and much easier. Module loading is 50% of the cost, and codegen is 50% of maybe_append choice on full graph model. which is 40% of total compile time. <img width="434" alt="Screenshot 2025-03-24 at 4 35 12 PM" src="https://github.com/user-attachments/assets/aa851c6a-bde9-43f8-b12d-e439504ef62c" /> running mm_loop benchmark, before this change: 67947323682 after this change: 25845073249 2.6X faster. it seems that the cache was there then got dropped. I added benchmark so it wont be dropped again by mistake. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149910 Approved by: https://github.com/eellison, https://github.com/aorenste ghstack dependencies: #149932	2025-03-27 00:45:09 +00:00
Rachel Guo	48cff64a54	[pt2_provenance_tracing] add combo kernel nodes post_grad nodes origin info (#149598 ) Summary: found it helpful when running prod model with combo_kernel feature enabled Test Plan: CI Differential Revision: D71513304 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149598 Approved by: https://github.com/yushangdi	2025-03-27 00:26:24 +00:00
Animesh Jain	731b559f54	[easy] Use config patch to toggle capture_scalar_output (#150036 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150036 Approved by: https://github.com/angelayi ghstack dependencies: #149087, #149667	2025-03-27 00:01:39 +00:00
Animesh Jain	999fa15ba8	[invoke_subgraph][fake tensor cache] Add a finalizer for id hashed objects (#149667 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149667 Approved by: https://github.com/zou3519 ghstack dependencies: #149087	2025-03-27 00:01:39 +00:00
Animesh Jain	a7596b4b34	[invoke_subgraph] Fake tensor prop caching (#149087 ) Redoing https://github.com/pytorch/pytorch/pull/137808 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149087 Approved by: https://github.com/zou3519	2025-03-27 00:01:39 +00:00
Justin Chu	3efa211e48	[ONNX] Annotate None inputs in symbolic ops (#150038 ) Add `None` to type annotations of `torch.onnx.ops.symbolic*` ops and improve tests to test support for optional inputs. Previously it was omitted mistakenly even though the implementation supports it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150038 Approved by: https://github.com/titaiwangms	2025-03-27 00:01:09 +00:00
Nikita Shulga	6db95ccf4c	Delete linux-focal-cuda12_6-py3_10-gcc11-bazel-test (#150066 ) It's been broken for a while even when this jobs were still called ` linux-focal-cuda12.4-py3.10-gcc9-bazel-test` Last time it run successfully on Feb 21st Pull Request resolved: https://github.com/pytorch/pytorch/pull/150066 Approved by: https://github.com/yangw-dev, https://github.com/seemethere, https://github.com/atalman	2025-03-26 23:55:58 +00:00
Aleksandar Samardžić	43cc954f88	Refactor row-wise scaled MM (#149978 ) 1. Add config selection for SM89. 2. Only build kernels if compiling for given arch. 3. Factor out CMake code to enforce compiling for needed archs for individual files into a function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149978 Approved by: https://github.com/drisspg	2025-03-26 23:49:41 +00:00
Nikita Shulga	6aca002d82	[MPS] Add `chebyshev_polynomial_[uvw]` (#150060 ) For both eager and inductor Pull Request resolved: https://github.com/pytorch/pytorch/pull/150060 Approved by: https://github.com/dcci, https://github.com/jansel	2025-03-26 23:35:05 +00:00
PyTorch MergeBot	185aaaaf8e	Revert "Improve subproc autotuning implementation (#149700 )" This reverts commit 8cd6a133f21821f0713116f0f9a55e5368de8c1c. Reverted https://github.com/pytorch/pytorch/pull/149700 on behalf of https://github.com/yangw-dev due to This is breaking servicelab_benchmark_pyper_local_runner internally ([comment](https://github.com/pytorch/pytorch/pull/149700#issuecomment-2755975959))	2025-03-26 23:17:01 +00:00
Nikita Shulga	db8f4c1b1b	[MPSInductor] Run chebyshev_polynomial_t tests (#150042 ) Test name should start with `test_` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150042 Approved by: https://github.com/dcci	2025-03-26 22:50:08 +00:00
Jon Janzen	9aa0612dd3	[targets2buck] Remove tombstone messages proactively (#147897 ) Summary: X-link: https://github.com/pytorch/executorch/pull/8703 Originally we created a bunch of empty `TARGETS` files to allow us to enable `BUCK` files in fbcode by hiding the existing BUCK file. These files were subsequently merged together using `non_fbcode_target` so these tombstones are no longer necessary. This diff fixes all files that WOULD have had the useless tombstone merged into them. To create this diff, I just ran the merger script that Codemod Service is using and then deleted the "merged from" and tombstone lines with `sed`, `arc f` and reverted any lines that didn't make sense Test Plan: CI Differential Revision: D69994481 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147897 Approved by: https://github.com/izaitsevfb	2025-03-26 22:15:17 +00:00
Nichols A. Romero	c0af782f30	[ROCm] Change LoadHIP to use find_file for rocm_version.h (#149983 ) Fixes #149805 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149983 Approved by: https://github.com/jeffdaily	2025-03-26 21:26:41 +00:00
Pat Vignola	625913eefc	[MTIA] [Triton] Set codename of MTIA device in triton heuristics (#149860 ) Summary: Triton-MTIA expects the codename of the device as the arch when querying the module map, not the compute capability. This diff gets rid of the following error: `No libdevice is provided for arch (0, 0)` Test Plan: CI Reviewed By: Myrthan Differential Revision: D70072095 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149860 Approved by: https://github.com/jansel	2025-03-26 20:58:12 +00:00
Tristan Rice	87bfd66c3c	gloo: update to latest version (#149985 ) This updates submodule Gloo to the latest version and brings a number of benefits: * connection retries `d2609ab5e8` * better error messages `5ca057d6cc` * multi_get support for larger scale jobs `4ff6edf45f` * metadata exchange optimizations `20dc202dd8` * miscellaneous other fixes Old commit: `5354032ea0` Test plan: This is already being used in production environments at scale. PyTorch CI ``` pytest -v test/distributed/test_c10d_gloo.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149985 Approved by: https://github.com/fduwjj, https://github.com/malfet	2025-03-26 19:19:31 +00:00
Boyuan Feng	039ebdc192	[Graph Partition] Support symbol inputs (#149458 ) This PR supports symbol inputs to graph partition functions. Before this PR, we rely on `node.read_writes` to get partition inputs. However, this does not cover symbol inputs. In this PR, for each graph partition, we collect all symbol inputs which are required to be in scope to successfully perform codegen, including: - free symbols used in partition nodes. - free symbols in partition input/node shapes, strides, and offsets. This is needed for recording cudagraphs for tensors with dynamic shapes. ### Note1: MutationLayout In this example, node.layout is MutationLayoutSHOULDREMOVE. The symint from index `n` does not appear in the size, offset, stridese of node.layout. This symint appear in node.layout.target. So we need extra handle for it. ```python x = torch.zeros(7, device="cuda") def fn(n, a): a[n] = -1 return a opt_fn = torch.compile(fn, fullgraph=True) for n in range(2, x.shape[0]): opt_fn(n, x) ``` ### Note2: Composability with Padded Tensor Subclass W/o graph partition, Padded Tensor subclass lifts outer shapes to input arguments (i.e., arg0_1 for s0, arg1_1 for s1) but does not lift inner shapes (i.e., s2 and s3). Since cudagraph cache relies on integer inputs, it will cache on outer shapes and ignore inner shapes, which is bad. ``` def call(args): arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1 = args args.clear() s0 = arg0_1 s1 = arg1_1 arg2_1_size = arg2_1.size() s2 = arg2_1_size[0] s3 = arg2_1_size[1] assert_size_stride(arg2_1, (s2, s3), (s3, 1)) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) buf0 = empty_strided_cuda((s2, s3), (s3, 1), torch.float32) # Topologically Sorted Source Nodes: [x1, mul], Original ATen: [aten.add, aten.mul] triton_poi_fused_add_mul_0_xnumel = s2s3 stream0 = get_raw_stream(0) triton_poi_fused_add_mul_0.run(arg2_1, buf0, triton_poi_fused_add_mul_0_xnumel, stream=stream0) del arg2_1 return (buf0, s0, s1, s1, ) ``` w/ graph partition, the partition function only includes tensor and inner shapes as inputs, to make sure the cudagraph caching is correct. Full Comparison: [code](https://www.internalfb.com/intern/diffing/?paste_number=1761674743) ```python def call(self, args): arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1 = args args.clear() s0 = arg0_1 s1 = arg1_1 arg2_1_size = arg2_1.size() s2 = arg2_1_size[0] s3 = arg2_1_size[1] assert_size_stride(arg2_1, (s2, s3), (s3, 1)) partition0_args = [arg2_1, s2, s3] del arg2_1 (buf0,) = self.partitions[0](partition0_args) del partition0_args return (buf0, s0, s1, s1, ) ``` The number of cudagraphs is validated below: (also added to test) ```python import torch from padded_tensor import PaddedTensor # Turning off graph_partition leads to # torch._inductor.cudagraph_trees.get_container(0).tree_manager.new_graph_id().id=6 # at the end, which is wrong. # torch._inductor.config.graph_partition = False # Turning on graph_partition leads to # torch._inductor.cudagraph_trees.get_container(0).tree_manager.new_graph_id().id=4 # at the end, which is correct. torch._inductor.config.graph_partition = True def f(x): x1 = x + 1 return x1 2 compiled_f = torch.compile(f, mode="reduce-overhead") def run(shape): x = torch.randn(*shape, device="cuda") pad_x = PaddedTensor.from_tensor(x, multipliers={0:4, 1:4}) assert hasattr(pad_x, "multipliers"), breakpoint() eager_out = f(pad_x) for _ in range(3): compiled_out = compiled_f(pad_x) compiled_out = compiled_f(pad_x) assert eager_out.shape == compiled_out.shape assert eager_out.tensor.shape == compiled_out.tensor.shape assert torch.allclose(eager_out.tensor, compiled_out.tensor) # static shape. record a NEW cudagraph. 1 cudagraph in total now. run((2,3)) # outer shape is dynamic, leading to a new dynamo graph # this new dynamo graph forces a NEW cudagraph. 2 cudagraphs in total now run((3,4)) # outer shape changed but inner shape does not change # so NO new cudagraph is recorded run((2,2)) # inner shape is dynamic now, leading to a new dynamo graph # this new dynamo graph forces a NEW cudagraph. 3 cudagraphs in total now run((5,6)) # does NOT record a new cudagraph run((7,8)) # record a NEW cudagraph. 4 cudagraphs in total now run((10,11)) assert torch._inductor.cudagraph_trees.get_container(0).tree_manager.new_graph_id().id == 4 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149458 Approved by: https://github.com/eellison	2025-03-26 17:21:30 +00:00
Jithun Nair	4a9466c96a	Newer conda versions require --update-deps to update dependencies such as libgcc-ng (#149599 ) * When we try to install [libstdcxx-ng 12.3.0 from conda-forge](`595293316d/.ci/docker/common/install_conda.sh (L65)`), conda 24.7.1 updates the dependencies of that package, including libgcc-ng package to the following: `libgcc-ng-14.2.0 \| h69a702a_2 52 KB conda-forge` * However, conda updated their installer script on Feb 6 2025 to version 25.1.1, which behaves differently from previous versions when installing conda packages. * conda 25.1.1 does not update any dependencies in the above step, and hence the same installation of libgcc-ng from "defaults" channel is present: `libgcc-ng pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1` * Adding the "--update-deps" flags to the conda install command installs a newer libgcc-ng package from the "conda-forge" conda channel: `libgcc-ng-12.3.0 \| h77fa898_13 762 KB conda-forge`, which is compatible with the libstdcxx-ng 12.3.0 package * Compare this [Feb 4 docker build](https://github.com/pytorch/pytorch/actions/runs/13148456164/job/36691412387#step:6:5179) to this [Feb 10 docker build](https://github.com/pytorch/pytorch/actions/runs/13247023578/job/36975931849#step:6:5451), which shows that the latter does not update libgcc-ng. * This creates linking issues when trying to use a library, that was built with a newer libgcc_s.so.1 (from libcc-ng package), in the PyTorch conda environment. Eg. ONNX-RT: ``` [0;93m2025-02-13 10:18:38.492434704 [W:onnxruntime:Default, migraphx_execution_provider.cc:167 get_flags_from_env] [MIGraphX EP] MIGraphX ENV Override Variables Set:[m [1;31m2025-02-13 10:18:38.628064251 [E:onnxruntime:Default, provider_bridge_ort.cc:2028 TryGetProviderInfo_ROCM] /onnxruntime/onnxruntime/core/session/provider_bridge_ort.cc:1636 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_rocm.so with error: /opt/conda/envs/py_3.10/bin/../lib/libgcc_s.so.1: version `GCC_12.0.0' not found (required by /opt/conda/envs/py_3.10/lib/python3.10/site-packages/onnxruntime/capi/libonnxruntime_providers_rocm.so) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149599 Approved by: https://github.com/malfet	2025-03-26 17:04:21 +00:00
Shangdi Yu	b2088f1afe	Add inductor test for torchbind symint (#149980 ) Summary: add test Test Plan: ``` buck run //caffe2/test:test_export -- -r test_compile_custom_obj_unbacked_symint ``` Differential Revision: D71843179 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149980 Approved by: https://github.com/BoyuanFeng	2025-03-26 17:02:55 +00:00
Mu-Chu Lee	a0253d2840	[Inductor] Use real input to autotune user defined triton kernels (#149553 ) Summary: User defined Triton kernel sometimes rely on real inputs to determine the path of execution. We need real inputs to invoke the correct behavior of the user defined triton kernels (see example in test case, where we have an early return for random inputs) Test Plan: Included in the commit. python test/inductor/test_aot_inductor.py -k triton_autotuning python test/inductor/test_aot_inductor.py -k triton_mutated_autotuning Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/149553 Approved by: https://github.com/davidberard98, https://github.com/eellison	2025-03-26 16:42:48 +00:00
Nikita Shulga	3a8171efad	[MPS] Preserve in/out dtypes in binary_op name (#150024 ) To be consistient with unary op and avoid silent correctness problems if someone will try to invoke the op with unexpected out dtype Pull Request resolved: https://github.com/pytorch/pytorch/pull/150024 Approved by: https://github.com/dcci	2025-03-26 16:00:43 +00:00
Jack Taylor	32299e5f9a	Reland "Introduce new template heuristic for triton autotune configs" (#147452 ) This change was reverted in https://github.com/pytorch/pytorch/pull/147388 for regressing an internal workload. I have removed the additional ir.device_type calls in mm_scaled and unpack_mixed_mm.py which could be contributing to the additional compile time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147452 Approved by: https://github.com/jansel	2025-03-26 15:47:06 +00:00
atalman	7336b76bcc	Refactor cudnn version check in smoke test for Windows (#150015 ) After https://github.com/pytorch/pytorch/pull/149885 I see failures on Window smoke test: https://github.com/pytorch/test-infra/actions/runs/14069923716/job/39401550854 Due to fact that pypi packages such as cudnn and nccl are installed only on Linux. Hence this should resolve issue on Windows platform. On windows cudnn is shipped with PyTorch as opposed to installed dynamically. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150015 Approved by: https://github.com/ZainRizvi	2025-03-26 15:15:46 +00:00
Ankita George	8a40fca9a1	Support huggingface reading and writing for multi rank case (#148189 ) Summary: This diff adds the ability for HF reader/writer to read/write in a distributed way. We do this by sending all the tensors meant for the same file to the same rank. Test Plan: ensure existing tests pass I also ran a full end to end test on my devserver to read/write from my HF repo Differential Revision: D70096439 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148189 Approved by: https://github.com/joecummings, https://github.com/saumishr	2025-03-26 14:47:31 +00:00
Aleksei Nikiforov	0c139fa58e	Switch s390x tests to blocklist (#149507 ) Switch s390x tests to blocklist Pull Request resolved: https://github.com/pytorch/pytorch/pull/149507 Approved by: https://github.com/seemethere	2025-03-26 12:11:41 +00:00
Laith Sakka	7379c66344	add loop mm benchmark (#149932 ) results: compile time instruction count for iteration 4 is 67947323682 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149932 Approved by: https://github.com/bobrenjc93, https://github.com/eellison	2025-03-26 11:21:30 +00:00
cyy	79e8a69257	Enable move warnings for torch targets (#149923 ) This PR enables more move warnings for torch targets and fixes some code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149923 Approved by: https://github.com/malfet	2025-03-26 08:38:13 +00:00
Nikita Shulga	de68ddc68e	[MPS] Fix metal ops with different dtypes (#149974 ) By implementing `_cast_` flavors of both dense and strided ops. Add regression tests that tests `fmax`/`fmin` for mixed dtypes. Been dreaded to write this PR for a while, as it end up to be pretty bulky: - Adds 1C10_METAL_ALL_TYPES_FUNCTOR` and `c10:🤘:ScalarType` to `c10/metal/common.h` and test that its values always match `c10::ScalarType` - Add `c10:🤘:cast_to` to `c10/metal/utils.h` which could be used to cast any scalar metal dtype to any other one, including complex values - Implement `val_at_offs<T>(constant void *, long offs, ScalarType dtype)` that is used to dynamically cast types - Add `binary_strided_cast` and `binary_dense_cast` that are invoked for output dtype and cast both inputs to that output before performing the op Benchmark collected on M2Pro that runs fmax for 1 mln element tensors (Times are in microseconds.) \| \| dense-dense \| transp-transp \| dense-transp \| transp-dense \| dense-scalar \| dense-bcast \| \|-------------------------\|---------------\|----------------\|----------------\|----------------\|---------------\|--------------- \| \| fmax (torch.float16, torch.float16) \| 160.9 \| 159.9 \| 270.5 \| 270.9 \| 236.6 \| 293.0 \| fmax (torch.float32, torch.float32) \| 176.9 \| 171.0 \| 273.7 \| 293.5 \| 242.6 \| 294.2 \| fmax (torch.float32, torch.float16) \| 171.4 \| 170.9 \| 283.6 \| 303.0 \| 253.7 \| 302.3 \| add (torch.float16, torch.float16) \| 218.0 \| 223.6 \| 221.0 \| 222.0 \| 214.9 \| 218.3 \| add (torch.float32, torch.float32) \| 227.4 \| 233.9 \| 228.8 \| 231.9 \| 218.9 \| 221.4 \| add (torch.float32, torch.float16) \| 226.1 \| 227.5 \| 227.5 \| 226.9 \| 177.0 \| 190.8 TODOS: - Include input and output dtype in non-cast kernel name - Make TensorFactory.h use `C10_METAL_ALL_TYPES_FUNCTOR` - Extend mixed_dytpes testing via OpInfo Fixes https://github.com/pytorch/pytorch/issues/149951 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149974 Approved by: https://github.com/manuelcandales	2025-03-26 07:03:21 +00:00
Aleksei Nikiforov	aa575cab71	Skip cxxabi check for s390x (#149954 ) On s390x gcc 14 is used because it contains fix for interaction between precompiled headers and vectorization builtins. This fix is not available in earlier gcc versions. gcc-14 uses ABI19, but check still fails, so skip it for now.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149954 Approved by: https://github.com/cyyever, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-03-26 06:50:27 +00:00
Justin Chu	6ae8eb881c	[ONNX] Clean up the diagnostics module (#149864 ) Remove the diagnostics/SARIF module from ONNX exporter because it is obsolete unused. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149864 Approved by: https://github.com/titaiwangms	2025-03-26 05:58:32 +00:00
PyTorch MergeBot	d256b2dcb2	Revert "[custom_ops][perf] Move expensive pytree traversals of tensors to C++ (#148555 )" This reverts commit d686d04c2f3bac110044ebad5cc46e3035d7b425. Reverted https://github.com/pytorch/pytorch/pull/148555 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/148555#issuecomment-2753283221))	2025-03-26 05:27:52 +00:00
Shangdi Yu	819b23e0b4	Support None return type in torchbind and Add more AOTI torchbind e2e tests (#149749 ) Summary: - Add more tests for torchbind in aoti FallBackKernel - In FallbackKernel.find_device, do not check the device of torchbind obj because they don't have a fixed "device" - If no device found for CallTorchBindObject, use cpu - handle None output in `export_extern_kernel_node` Test Plan: ``` buck run //sigmoid/inference/test:e2e_test_cpu -- -r CustomClassHolderConstantDynamic ``` Differential Revision: D70746626 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149749 Approved by: https://github.com/desertfire	2025-03-26 04:20:14 +00:00
Isuru Fernando	71acb1bb42	[inductor] Fix division by zero error in fractional max (#148729 ) Fixes https://github.com/pytorch/pytorch/issues/148152 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148729 Approved by: https://github.com/eellison	2025-03-26 04:18:50 +00:00
eqy	9108d153ce	[CUDA]][SymmetricMemory] Interpret empty string as `std::nullopt` in `rendezvous` (#149793 ) this is a "temporary" fix as current internal API requires strings at some interfaces instead of `std::optional` and empty strings are presumably used in-lieu of `nullopt`. e.g., `9d02b3993f/torch/csrc/distributed/c10d/intra_node_comm.cu (L49)` this currently breaks `test_intra_node_comm_all_reduce` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149793 Approved by: https://github.com/kwen2501, https://github.com/cyyever	2025-03-26 03:59:43 +00:00
PyTorch MergeBot	ab9ca6b31f	Revert "[inductor] Fix mm logging for `torch._scaled_.mm` (#149967 )" This reverts commit 661d74bf4483e19e158c41b55d47f02eb9fdcc21. Reverted https://github.com/pytorch/pytorch/pull/149967 on behalf of https://github.com/malfet due to This broke ROCM testing, see `45b11730f1/1` ([comment](https://github.com/pytorch/pytorch/pull/149967#issuecomment-2753149024))	2025-03-26 03:29:59 +00:00
Nichols A. Romero	45b11730f1	[ROCm][TunableOp] TunableOp Context Manager for unit tests (#149930 ) This PR is cleanup only. There are no feature changes or bug fixes. We create a TunableOp context manager for setting up and cleanup. We re-write TunableOp unit tests in terms of this context manager. Ultimately reduces the amount of copy-paste code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149930 Approved by: https://github.com/jeffdaily	2025-03-26 02:59:58 +00:00
David Berard	a8d0c5c928	[inductor][triton 3.3] Fix cpp_wrapper w/ TMA in triton 3.3 (#149973 ) Fixes #148938 Context: In triton 3.3, triton kernels expect a global scratch space arg to be passed in. This is fixed in #148051, which fixed most of the AOTI/cpp_wrapper failures; the fix is to inject a (null) global scratch space arg passed as an argument to all kernels. But in the case of TMA, we need to call a non-triton-generated function - init1DTMADescriptor. The same `generate_args_decl` function used for calling triton kernels (and modified in #148051 to insert a global scratch space) is used to prepare the arguments to init1DTMADescriptor, and so it had an extra global scratch space arg. Then we'd get a null pointer passed into init1DTMADescriptor, resulting in an IMA later on when the TMA use kernel This PR: adds an option to `generate_args_decl` to specify whether this is a triton kernel (in which case we should add the global scratch space arg) or not (when we shouldn't add the extra arg). Note: this doesn't appear in CI because we don't run these tests with Hopper machines in CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149973 Approved by: https://github.com/drisspg	2025-03-26 00:12:02 +00:00
PyTorch MergeBot	1b373f6cd4	Revert "cpp_wrapper: Fix even more tests (#147225 )" This reverts commit 62d351a35b1bd961afbd09057beec14ff201c41d. Reverted https://github.com/pytorch/pytorch/pull/147225 on behalf of https://github.com/yangw-dev due to broke [ROCM mi300 test](https://github.com/pytorch/pytorch/actions/runs/14066803692/job/39393110086) in [HUD](https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=rocm-mi300%20%2F%20linux-focal-rocm6.3-py3.10%20%2F%20test%20(default%2C%201%2C%206%2C%20linux.rocm.gpu.mi300.2)&mergeLF=true) ([comment](https://github.com/pytorch/pytorch/pull/147225#issuecomment-2752799778))	2025-03-26 00:03:13 +00:00
PyTorch MergeBot	91bf92597c	Revert "cpp_wrapper: precompile a few more commonly used headers, and improve RAIIPyObject interface (#149350 )" This reverts commit 0de70fbbe73d2109497cd57ed5402e0cf9450f18. Reverted https://github.com/pytorch/pytorch/pull/149350 on behalf of https://github.com/yangw-dev due to broke [ROCM mi300 test](https://github.com/pytorch/pytorch/actions/runs/14066803692/job/39393110086) in [HUD](https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=rocm-mi300%20%2F%20linux-focal-rocm6.3-py3.10%20%2F%20test%20(default%2C%201%2C%206%2C%20linux.rocm.gpu.mi300.2)&mergeLF=true) ([comment](https://github.com/pytorch/pytorch/pull/147225#issuecomment-2752799778))	2025-03-26 00:03:13 +00:00
Vincent Moens	3c85784980	Fix broken LazyLinear init (#149693 ) Fixes #149691 I beleive it does not impact negatively the fix in https://github.com/pytorch/pytorch/pull/147599 as the tests stilll pass but @FFFrog should confirm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149693 Approved by: https://github.com/mikaylagawarecki, https://github.com/FFFrog, https://github.com/malfet	2025-03-25 23:49:49 +00:00
Rachel Guo	661d74bf44	[inductor] Fix mm logging for `torch._scaled_.mm` (#149967 ) Summary: This pr is just for recreation of the original pr: https://github.com/pytorch/pytorch/pull/149769 Fix for `torch._scaled_mm` op mm logging, which breaks the original brittle underscore parsing assumptions. Test Plan: CI Differential Revision: D71828732 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149967 Approved by: https://github.com/vkuzo	2025-03-25 23:38:35 +00:00
Ethan Wee	c05328e01a	[ROCm] fix uninitialized warning in BFloat16.h (#149868 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149868 Approved by: https://github.com/jeffdaily, https://github.com/cyyever	2025-03-25 23:36:10 +00:00
Ethan Wee	36eb64d60e	[ROCm] missing AT_CUDA_CHECK for cub and SoftMax (#149883 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149883 Approved by: https://github.com/jeffdaily, https://github.com/Skylion007	2025-03-25 23:22:32 +00:00
eqy	de73790fe6	[cuDNN][SDPA] cuDNN SDPA supports `head_dim <= 256` on `sm90` and `sm100` as of `9.5.1+` (#149904 ) gqa check PR will go next... Pull Request resolved: https://github.com/pytorch/pytorch/pull/149904 Approved by: https://github.com/drisspg	2025-03-25 23:10:16 +00:00
Divain	68b327341c	Fix #149806 : Fix path lookup in _preload_cuda_deps (#149808 ) @pytorchbot label "bug" Fixes #149806 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149808 Approved by: https://github.com/jansel	2025-03-25 23:03:47 +00:00
Ozan Aydin	ce54c430c0	[Submodule] [cpuinfo] cpuinfo update (#149305 ) Updating `cpuinfo` module. Relevant: https://github.com/pytorch/cpuinfo/issues/270 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149305 Approved by: https://github.com/malfet	2025-03-25 22:44:50 +00:00
Mu-Chu Lee	feb503c1df	[AOTInductor] Refine error message for dlopen in AOTInductor (#149812 ) Summary: Refine the error message if dlopen failed in AOTInductor. The original error message was ominous, modified to recommend user to rebuild AOTInductor if needed, otherwise it's fine. Test Plan: None. Error message change. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/149812 Approved by: https://github.com/chenyang78, https://github.com/jingsh	2025-03-25 21:45:10 +00:00
Jeff Daily	0159f8ed54	[ROCm] build magma rocm and upload tarball (#149902 ) This will improve docker image build times by not having to rebuild magma rocm for unrelated changes. This PR is step 1 of 2. The next step is a second PR to modify the docker image builds to use the magma tarball that this PR will produce. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149902 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-03-25 21:37:13 +00:00
PyTorch MergeBot	d3b7cf7b7d	Revert "[ROCm] build magma rocm and upload tarball (#149902 )" This reverts commit bf8f4efd3158204592643e6cf26889fff5afcee2. Reverted https://github.com/pytorch/pytorch/pull/149902 on behalf of https://github.com/seemethere due to This is currently breaking lint see [GH job link](https://github.com/pytorch/pytorch/actions/runs/14069330750/job/39399569526) [HUD commit link](`bf8f4efd31`) ([comment](https://github.com/pytorch/pytorch/pull/149902#issuecomment-2752594578))	2025-03-25 21:33:00 +00:00
Davide Italiano	e85ce64bde	[MPS/Inductor] Add support for chebyshev_polynomial_t. (#149928 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149928 Approved by: https://github.com/malfet	2025-03-25 21:02:13 +00:00
Laith Sakka	6c9d48b32b	refresh results of benchmarks (#149936 ) while the test was disabled, I put a fix but another win change landed before the test was restored to it stayed disabled. <img width="698" alt="Screenshot 2025-03-24 at 6 26 36 PM" src="https://github.com/user-attachments/assets/2713c685-aee2-4dea-9a6c-cad01ef575cd" /> caused by https://github.com/pytorch/pytorch/pull/149295 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149936 Approved by: https://github.com/bobrenjc93	2025-03-25 21:01:08 +00:00
bobrenjc93	90110b069f	Use statically known true in should_decompose_mm (#149950 ) This meta function is causing recompiles for large ads runs due to overguarding: https://www.internalfb.com/ai_infra/job_inspector/guided/pt2_compile?jobName=aps-ig_fm_v4_pt2_on-6e0a734dcc&jobVersion=0&jobAttempt=0 If we look at the reasons, it's because of this function adding guards: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-ig_fm_v4_pt2_on-6e0a734dcc/attempt_0/version_0/rank_0/-_18_8_0/recompile_reasons_1971.json?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 This PR moves to statically_known_true so we don't overly guard for dynamic shapes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149950 Approved by: https://github.com/mengluy0125	2025-03-25 20:40:00 +00:00
Fuzzkatt	ce3dc9e346	add some extra test oom skips for jetson due to lacking nvml support (#149587 ) Add a couple of Jetson skips for oom tests in test/test_cuda.py due to failures in nvidia CI. Jetson not having full nvml support is a known issue so this is mostly a test side fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149587 Approved by: https://github.com/eqy	2025-03-25 20:39:10 +00:00
Fuzzkatt	b562d22772	test/test_cuda.py: rework TEST_PYNVML logic to make more sense, add not IS_JETSON condition (#149578 ) PYNVML related tests in test/test_cuda.py are failing in nvidia internal CI for Jetson devices because Jetson devices don't fully support nvml (it exists as a stub library). In addition to skipping PYNVML tests for Jetson, this PR also reworks the TEST_PYNVML logic a bit to be more consistent with the rest of TEST_{something} conditions in test/test_cuda.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/149578 Approved by: https://github.com/janeyx99, https://github.com/eqy	2025-03-25 20:38:15 +00:00
Mu-Chu Lee	12628ba24d	[AOTInductor] Bug fix for freeing buffers when freeing multiple times (#149810 ) Summary: We might free the active buffer if we free the buffer twice. Test Plan: ``` LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib /home/$USER/local/pytorch/build/bin/test_aoti_inference ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/149810 Approved by: https://github.com/chenyang78	2025-03-25 20:26:36 +00:00
Jeff Daily	bf8f4efd31	[ROCm] build magma rocm and upload tarball (#149902 ) This will improve docker image build times by not having to rebuild magma rocm for unrelated changes. This PR is step 1 of 2. The next step is a second PR to modify the docker image builds to use the magma tarball that this PR will produce. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149902 Approved by: https://github.com/malfet	2025-03-25 20:20:36 +00:00
Lucas Kabela	d1ff3ff675	[Bugfix] Add handling for buffer overrides (#149882 ) Fixes #139167 This PR: * uses `named_buffers` to mark static * Checks that `named_buffers` is of expected type (callable, iterator) before trying to iterate over; if not, we skip this pass These changes fix the previous errors in dynamo causing to crash (as shown in issue above) ### Unit Test ``` python test/dynamo/test_buffers_override.py ``` Results in: ``` . ---------------------------------------------------------------------- Ran 2 tests in 5.344s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149882 Approved by: https://github.com/anijain2305	2025-03-25 20:12:43 +00:00
Sam Larsen	8cd6a133f2	Improve subproc autotuning implementation (#149700 ) Summary: The primary change is to update the autotune-in-a-subproc implementation to avoid using multiprocessing spawn. Spawn (re)executes the toplevel script in the subproc, which can be problematic. The approach here is similar to Triton parallel compile: we Popen a subproc on a controlled entry point and communicate over pipes. That change drove a lot of refactoring in the TuningProcess class, so I took the opportunity to simplify some things, rename some methods, etc. One other notable change is around the timeout / kill approach. After a timeout, we were previously attempting to stop the subproc in three steps (graceful shutdown, sigkill if graceful fails, sigterm if sigkill fails). I'm gonna argue think that's not useful: 1) The graceful shutdown is never going to work unless the subproc happens to have just completed its task and is ready to receive the next command. 2) If we're going to kill the subproc, let's just take the most aggressive approach and move on as quickly as possible to restarting it rather than waiting to see if previous shutdown attempts succeeded. The only downside that I can find find is maybe a little log spew?, e.g., ` ResourceWarning: subprocess 2987680 is still running` List of changes: * Use Popen instead of spawn for the autotuning subprocess. * Introduced a new entry point `__autotune_main__.py` * Renamed some TuningProcess methods. For example `shutdown` makes more sense than `terminate` because the latter implies a forced kill. * Simplified the implementation around benchmarking timeout and how we kill the subproc after a timeout. * Deprecated the unused timeout configs in `_inductor/config.py` * Moved `get_ld_library_path` helper to a common utils file. * Added more unit tests for subproc crashes / timeouts / exceptions, etc. Test plan: * New unit tests * Also ran internally with all combinations of: build mode `opt` and `dev-nosan`, and `buck run` vs. executing the `.par` file directly. * Made sure the functionality to parallelize autotuning across different GPUs is working (it wasn't clear to me this was behaving the way we wanted it to). Pull Request resolved: https://github.com/pytorch/pytorch/pull/149700 Approved by: https://github.com/aorenste, https://github.com/jansel, https://github.com/eellison	2025-03-25 20:07:28 +00:00
PyTorch MergeBot	30e8be599f	Revert "[ONNX] Clean up the diagnostics module (#149864 )" This reverts commit cc6e300fe225ac7f34f37494639b061ef45ceeec. Reverted https://github.com/pytorch/pytorch/pull/149864 on behalf of https://github.com/malfet due to This indeed broke Mac testing see `1c98dc3664/1` ([comment](https://github.com/pytorch/pytorch/pull/149864#issuecomment-2752317873))	2025-03-25 19:31:50 +00:00
Ryan Guo	1c98dc3664	[dynamo] Fix handling of setattr with some tensor attributes (#149791 ) We weren't handling `setattr(tensor_obj, "real", 42)` correctly, because the attribute is a `GetSetDescriptorType` that has special setter logic. See added test and comments for more explanations. This patch makes it so that we graph break in those cases, rather than resulting in silent incorrectness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149791 Approved by: https://github.com/mlazos ghstack dependencies: #149481	2025-03-25 18:57:56 +00:00
Benjamin Glass	0de70fbbe7	cpp_wrapper: precompile a few more commonly used headers, and improve RAIIPyObject interface (#149350 ) Add includes for torch.device, torch.dtype, torch.layout, and torch.memory_format to the cpp_wrapper common header, so that they get precompiled. Additionally, add move constructors and operator bool to RAIIPyObject. Closes #142005. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149350 Approved by: https://github.com/desertfire ghstack dependencies: #146706, #147225	2025-03-25 17:58:40 +00:00
Benjamin Glass	62d351a35b	cpp_wrapper: Fix even more tests (#147225 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147225 Approved by: https://github.com/desertfire ghstack dependencies: #146706	2025-03-25 17:58:40 +00:00
Benjamin Glass	0f1aaeb62e	cpp_wrapper: persist autotune example tensors until last use (#146706 ) Patches over an issue where randomly generated example tensors can cause kernel autotuning to fail, when those tensors would not be possible outputs from previous kernels in the sequence. This fixes a failure in `test_torchinductor_opinfo.py` when run with compile-time autotuning, `test_comprehensive_nanquantile_cuda_float64`. For clarity, the situation triggering this PR looks like kernels `A -> BCDE -> F` (`BCDE` is fused), where one of the outputs from `A` is a boolean tensor describing some of the input data. Previously, we randomly regenerated that boolean tensor and the input data before passing them to `BCDE`, so that they no longer matched. This caused a `tl.device_assert` call in `BCDE` to fail. With this PR, we reuse the random data input to `A` and the output Boolean tensor, such that they match and pass the device assertion in `BCDE`. Fixes #147799. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146706 Approved by: https://github.com/desertfire	2025-03-25 17:58:40 +00:00
Nikita Shulga	8d1db7f39d	[MPS][BE] Add `c10/metal/common.h` (#149955 ) That could be shared between host and metal code So far put only one constant, which is a maximum number of tensor dimentions Pull Request resolved: https://github.com/pytorch/pytorch/pull/149955 Approved by: https://github.com/Skylion007, https://github.com/manuelcandales	2025-03-25 17:37:24 +00:00
Justin Chu	cc6e300fe2	[ONNX] Clean up the diagnostics module (#149864 ) Remove the diagnostics/SARIF module from ONNX exporter because it is obsolete unused. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149864 Approved by: https://github.com/titaiwangms	2025-03-25 16:58:46 +00:00
angelayi	84ae056d82	[invoke_subgraph] Support pending unbacked symint (#149297 ) The "PendingUnbackedSymbolNotFound" error is when an unbacked symbol is created within a piece of code, but this symbol never appears in any of the outputs. I believe the original intention is to help catch incorrectly written meta kernels, where users might've unintentionally created an unbacked symbol but never used it anywhere, but in our case this is intentional. An example is the following test case: ```python def test_pending_unbacked(self): class M(torch.nn.Module): @mark_compile_region def gn(self, x): u = x[0].item() return x * u def forward(self, x): for _ in range(4): x = self.gn(x) return x torch._dynamo.config.capture_scalar_outputs = True torch.compile(M())(torch.randn(8)) ``` This fails with the error: ``` torch._dynamo.exc.InternalTorchDynamoError: PendingUnbackedSymbolNotFound: Pending unbacked symbols {zuf1} not in returned outputs (FakeTensor(..., size=(8,)),) . ``` In this case, creating the unbacked symbol is intentional, so we can bypass this using `fake_mode.shape_env.ignore_fresh_unbakced_symbols()`. Differential Revision: [D71298926](https://our.internmc.facebook.com/intern/diff/D71298926) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149297 Approved by: https://github.com/zou3519 ghstack dependencies: #149296	2025-03-25 16:42:58 +00:00
angelayi	8be1bf1dbb	[export] Add mark_compiled_region support (#149296 ) Differential Revision: [D71298930](https://our.internmc.facebook.com/intern/diff/D71298930) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149296 Approved by: https://github.com/zou3519	2025-03-25 16:42:58 +00:00
Eli Uriegas	5c19952c83	cd: Restore windows release builds for libtorch (#149863 ) These were accidentally deleted in the refactor of DEVTOOLSET + cxx11abi. This happened because the `build_environment` variable wasn't aware of the `build_variant` for libtorch and subsequently overwrote the original file twice, leaving the last written as the actual workflow (which in this case was the debug builds). One thing this has made me curious on is if we actually need `debug` builds for window at all? We don't release them for linux and I'd probably bet that they have low download numbers anyways so maybe it makes sense to cut them. Adds a build_variant parameter to the dataclass so that we can extend these easily in the future if we want. Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/149863 Approved by: https://github.com/malfet, https://github.com/atalman	2025-03-25 16:23:59 +00:00
Nikita Shulga	f0ca0d45a6	[CI] Add MacOS-M2-15 as MPS test target on trunk (#149900 ) Now that we have runners allocated by AWS Pull Request resolved: https://github.com/pytorch/pytorch/pull/149900 Approved by: https://github.com/ZainRizvi, https://github.com/seemethere	2025-03-25 16:19:35 +00:00
Wang, Eikan	2cc3f5030a	Add XPU and SYCL Merge Patterns (#149933 ) As the title Pull Request resolved: https://github.com/pytorch/pytorch/pull/149933 Approved by: https://github.com/atalman	2025-03-25 16:03:29 +00:00
Alanna Burke	43ee67e8dc	Removing doc references to PRE_CXX11_ABI. (#149756 ) Fixes #149550 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149756 Approved by: https://github.com/svekars, https://github.com/atalman	2025-03-25 16:01:59 +00:00
atalman	5dca832257	Add smoke test to validate pypi env version vs torch complied and installed versions of nccl and cudnn (#149885 ) Followup after nccl update to validate both cudnn and nccl versions in nightly and release pipelines. Tested on local dev machine, output. Success: ``` Found matching cudnn. Torch: 9.5.1 PyPI 9.5.1.17 Found matching nccl. Torch: 2.25.1 PyPI 2.25.1 ``` Failure: ``` Traceback (most recent call last): File "test1.py", line 29, in <module> compare_pypi_to_torch_versions("nccl", find_pypi_package_version("nvidia-nccl"), torch_nccl_version) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ec2-user/test1.py", line 24, in compare_pypi_to_torch_versions raise RuntimeError( f"Wrong {package} version. Torch: {torch_version} PyPI: {pypi_version}" ) RuntimeError: Wrong nccl version. Torch: 2.25.1 PyPI: 2.26.2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149885 Approved by: https://github.com/malfet, https://github.com/ZainRizvi, https://github.com/d4l3k	2025-03-25 15:57:53 +00:00
Ivan Grigorev	d90d83c484	[torch] Fix unsafe concurrent access to autocast_enabled (#148281 ) Summary: Making autocast_enabled atomic, as it can be accessed from multiple threads Differential Revision: D70456813 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148281 Approved by: https://github.com/davidberard98	2025-03-25 14:46:12 +00:00
soulitzer	a2bba53f87	Improve error message when view of intermediate is returned from autograd.Function and marked dirty (#149543 ) Fixes https://github.com/pytorch/pytorch/issues/149252 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149543 Approved by: https://github.com/zou3519 ghstack dependencies: #149220	2025-03-25 14:44:11 +00:00
PyTorch MergeBot	7b218ca874	Revert "[BE] Replace XPU support packages installation to offline mode in Linux CI/CD (#149843 )" This reverts commit 86dcdf9c8bb8f69c5d28184b31ee6d7f19127d67. Reverted https://github.com/pytorch/pytorch/pull/149843 on behalf of https://github.com/malfet due to This breaks XPU builds, see `23183fef7e/1` ([comment](https://github.com/pytorch/pytorch/pull/149843#issuecomment-2751482412))	2025-03-25 14:39:10 +00:00
Nikita Shulga	29b3f409c2	[BE][CI] Update actionlint to 1.7.7 (#149919 ) - fix anti-pattern started by https://github.com/pytorch/pytorch/pull/81922 when x86 actionlint binaries were placed in Linux-arm64 folder - Fix renaming lint violations, namely ``` >>> Lint for .github/workflows/_linux-test.yml: Error (ACTIONLINT) [expression] property "workspace" is not defined in object type {arch: string; debug: string; environment: string; name: string; os: string; temp: string; tool_cache: string} 446 \| if: failure() && steps.install-nvidia-driver.outcome && steps.install-nvidia-driver.outcome != 'skipped' 447 \| shell: bash 448 \| env: >>> 449 \| RUNNER_WORKSPACE: ${{ runner.workspace }} 450 \| run: \| 451 \| set +e 452 \| set -x >>> Lint for .github/workflows/create_release.yml: Error (ACTIONLINT) [deprecated-commands] workflow command "set-output" was deprecated. use `echo "{name}={value}" >> $GITHUB_OUTPUT` instead: https://docs.github.com/en/actions/using- workflows/workflow-commands-for-github-actions 80 \| path: ${{ env.PT_RELEASE_FILE }} 81 \| - name: Set output 82 \| id: release_name >>> 83 \| run: echo "::set-output name=pt_release_name::${{ env.PT_RELEASE_NAME }}.tar.gz" 84 \| 85 \| upload_source_code_to_s3: 86 \| if: ${{ github.repository == 'pytorch/pytorch' && github.event_name == 'push' && startsWith(github.ref, 'refs/tags/v') && contains(github.ref, 'rc') }} >>> Lint for .github/workflows/target-determination-indexer.yml: Error (ACTIONLINT) [shellcheck] shellcheck reported issue in this script: SC2086:info:3:3: Double quote to prevent globbing and word splitting 98 \| DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }} 99 \| GITHUB_RUN_ID: ${{ github.run_id }} 100 \| AWS_DEFAULT_REGION: us-east-1 >>> 101 \| run: \| 102 \| # detached container should get cleaned up by teardown_ec2_linux 103 \| container_name=$(docker run \ 104 \| ${GPU_FLAG:-} \ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149919 Approved by: https://github.com/jeanschmidt, https://github.com/atalman, https://github.com/Skylion007 ghstack dependencies: #149917, #149918, #149922	2025-03-25 14:37:10 +00:00
Nikita Shulga	6c7f9f7e7d	[CI][BE] Update other actions (#149922 ) Discovered by actionlint-1.7.7: - `actions/checkout@v3`->`actions/checkout@v4` - `actions/setup-python@v4` -> `actions/setup-python@v5` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149922 Approved by: https://github.com/Skylion007 ghstack dependencies: #149917, #149918	2025-03-25 14:37:10 +00:00
Nikita Shulga	535885dc8d	[BE][CI] Update configure-aws-credential to v4 (#149918 ) Prerequisite for update to actionlint-1.7.7 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149918 Approved by: https://github.com/Skylion007 ghstack dependencies: #149917	2025-03-25 14:37:02 +00:00
Nikita Shulga	f63b03e9fc	[BE] Add Mac ARM64 actionlint binary (#149917 ) Downloaded from https://github.com/rhysd/actionlint/releases/tag/v1.6.21 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149917 Approved by: https://github.com/Skylion007	2025-03-25 14:36:54 +00:00
Nikita Shulga	23183fef7e	[Test] Add simple MPS op benchmarks (#149914 ) Lots of benchmark tests has been posted in PRs, but they might get lost over time So let's create a benchmark and populate it with results (preferably from the run on CI machine) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149914 Approved by: https://github.com/dcci, https://github.com/cyyever	2025-03-25 11:31:27 +00:00
Wang, Chuanqi	86dcdf9c8b	[BE] Replace XPU support packages installation to offline mode in Linux CI/CD (#149843 ) To ensure the build environment is stable Pull Request resolved: https://github.com/pytorch/pytorch/pull/149843 Approved by: https://github.com/EikanWang	2025-03-25 09:11:35 +00:00
Yuanhao Ji	86fbbe44cc	Improve error message for CUDAGuardImpl, MPSGuardImpl, XPUGuardImpl (#149838 ) Fixes #149822 Will get: ``` RuntimeError: t == DeviceType::CUDA INTERNAL ASSERT FAILED at "/home/jyh/workspace/pytorch/c10/cuda/impl/CUDAGuardImpl.h":28, please report a bug to PyTorch. CUDAGuardImpl initialized with non-CUDA DeviceType: cpu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149838 Approved by: https://github.com/Skylion007, https://github.com/guangyey	2025-03-25 07:29:53 +00:00
Michael Lazos	a89bdc0565	[Hierarchical Compilation] Handle origin nodes without children (#149685 ) Bug discovered running Hierarchical Compilation on HF. I don't have a smaller repro for this unfortunately. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149685 Approved by: https://github.com/williamwen42, https://github.com/anijain2305	2025-03-25 07:27:11 +00:00
Nikita Shulga	5a7588f183	[Build] Remove pre-CXX11 ABI logic from build script (#149888 ) Only keep one in check_binary_symbols to make sure there are no pre-CXX11 ABI symbols in the library Pull Request resolved: https://github.com/pytorch/pytorch/pull/149888 Approved by: https://github.com/atalman, https://github.com/seemethere ghstack dependencies: #149887	2025-03-25 03:17:16 +00:00
titaiwangms	280e48739a	[ONNX] Set is_in_onnx_export for dynamo=True (#149678 ) Fixes #149141 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149678 Approved by: https://github.com/justinchuby	2025-03-25 03:16:23 +00:00
Tugsbayasgalan Manlaibaatar	27657a00d9	Demote logger of runtime_asserts_frozen to be fired only on debug mode (#149832 ) Differential Revision: [D71702305](https://our.internmc.facebook.com/intern/diff/D71702305) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149832 Approved by: https://github.com/malfet	2025-03-25 02:29:13 +00:00
FEI	59d5cf083b	update torch.nn.RelicationPad{1,2,3}d deternimistic documentation (#148633 ) https://github.com/pytorch/pytorch/issues/115395 This issue mentioned that when deterministic mode is turned on, added a decomp for replication_pad_{1,2,3}d to make the backward function deterministic. @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/148633 Approved by: https://github.com/isuruf	2025-03-25 02:01:31 +00:00
Saurabh Mishra	d4c578082a	[DCP] Cache save plan metadata to reduce the collective overhead (#149785 ) Summary: Cache save plan metadata to reduce the collective overhead. Global plan dedupe and metadata creation are the main overheads on Rank 0. This change saves all this cost for the subsequent saves if the plans do not change. A quick experiment with the 256 rank job, Global step overhead drops by ~99%, from 90s+ to mere 1.5s. 1.5s was mostly spent on creating the checkpoint module directories and near empty collective. Differential Revision: D71631441 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149785 Approved by: https://github.com/MeetVadakkanchery	2025-03-25 02:00:15 +00:00
Scott Wolchok	dc39e673e2	Remove aten.elu core ATen decomp because it is now core ATen (#149780 ) Per @larryliu0820. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149780 Approved by: https://github.com/larryliu0820	2025-03-25 01:59:57 +00:00
Zhengxu Chen	84684e9397	[sigmoid] Fix scalar resolution for Scalar_mode aten ops. (#149755 ) Summary: For Scalar variant resolution, we didn't handle a corner case of "Tensor_mode" variant (from aten::div). Adding the missing case to the graph pass. Test Plan: buck test mode/opt caffe2/test:test_export -- -r test_operator_aten_tensor_mode_variant_cpp_runtime Differential Revision: D71638433 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149755 Approved by: https://github.com/yushangdi	2025-03-25 01:17:36 +00:00
Tristan Rice	159e97cbcf	ProcessGroupGloo: support reduce_scatter + update support chart (#149869 ) This adds a `reduce_scatter` implementation for ProcessGroupGloo. This is a pretty naive implementation as it does 1 allreduce per rank but may be useful for testing in FSDP etc. There was an existing implementation of reduce_scatter_tensor/reduce_scatter_tensor_coalesed that has a very similar implementation but requires a fixed tensor size per rank. If users find these functions to be too slow we can address them as issues arise. Gloo now supports all major distributed operations. Quite a few of these were added by @rohan-varma and @yifuwang but they didn't update the support chart. We also have `CUDAWork` variants of most operations so those were also added to the chart. Test plan: ``` pytest -v test/distributed/test_c10d_gloo.py -k reduce_scatter ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149869 Approved by: https://github.com/fduwjj	2025-03-25 01:16:12 +00:00
Carlo Bertolli	5af9cb12b7	[ROCm] Extend vectorized elementwise kernel to more heterogenous tensor types. (#149738 ) This patch extends the initial support for "vectorized templated" kernels to the following input tensor types: (BFloat16, float) (float, float16) (float16, float) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149738 Approved by: https://github.com/jeffdaily	2025-03-25 01:10:01 +00:00
Stepan Hruda	2a9e737839	[caffe2] Do not use --no-as-needed on macOS (#149421 ) Summary: `--no-as-needed` is not available in ld64.lld Applying this on all macos is potentially too broad? I am not sure if `fbcode//mode/mac` uses a different linker, but arvr mode for sure uses ld64.lld. Test Plan: CI / used for a macOS build on top of the stack. Differential Revision: D71315125 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149421 Approved by: https://github.com/colesbury	2025-03-25 00:41:09 +00:00
bobrenjc93	1cee6c37cc	add bobren and laithsakka as ds owners (#149873 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149873 Approved by: https://github.com/laithsakka	2025-03-25 00:14:04 +00:00
Benjamin Glass	23855391f1	Add regression tests for 3 missing PR-time benchmarks (#149423 ) Uses values from the latest PR-time benchmark run on viable/strict. See https://github.com/pytorch/pytorch/actions/runs/13898520615/job/38900894469 for a job showing why this is needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149423 Approved by: https://github.com/laithsakka	2025-03-24 23:39:36 +00:00
Isalia20	ba46643df1	[MPS] tril op not handling infs correctly (#149866 ) Fixes #149813 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149866 Approved by: https://github.com/malfet	2025-03-24 23:38:41 +00:00
Nikita Shulga	51f91e3428	[CD] Check that nightly x86 binaries are build with gcc-11 (#149887 ) Though they should have been with gcc-14, per https://github.com/pypa/manylinux?tab=readme-ov-file#manylinux_2_28-almalinux-8-based Pull Request resolved: https://github.com/pytorch/pytorch/pull/149887 Approved by: https://github.com/atalman, https://github.com/seemethere	2025-03-24 23:22:19 +00:00
Jzhyang1	f320c7b766	Rename README.txt to README.md (#149811 ) I am 99% sure this is meant to be a .md file rather than a .txt file Fixes an issue with viewing the README on github, idk what else this accomplishes but it's been bothering me Pull Request resolved: https://github.com/pytorch/pytorch/pull/149811 Approved by: https://github.com/colesbury	2025-03-24 22:33:33 +00:00
Zhengxu Chen	490ce7e67c	[sigmoid] Support _operator.neg/truediv (#149754 ) Summary: adding operator.truediv and operator.neg support to the runtime Test Plan: buck run mode/opt caffe2/test:test_export -- -r test_sym_float_operators_cpp_runtime_nonstrict Differential Revision: D71637267 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149754 Approved by: https://github.com/pianpwk	2025-03-24 22:15:25 +00:00
sanchitintel	e77ca19999	[Inductor-CPU] Fix int8 WoQ AMX micro-kernel when `block_n` is 16 or 48 (#149359 ) ### Summary When the block-size for `N` dimension is `48` for the AMX GEMM micro-kernel for int8 WoQ (BF16 activation, int8 statically quantized weights), the logic for handling the tail is incorrect - we can't always dequantize 32 elements of weights at a time because we may need to dequantize `32` followed by `16` when `block_n` is `48` (for each `K`). This PR fixes that logic, which was initially exposed with `M=17, N=1024, K=1024`. This PR also fixes the case of `block_n` being 16. I had introduced [this bug ](`ca9813ea14`) after misreading GEMM blockings as `["block_m", "block_k", "block_n"]` instead of `["block_m", "block_n", "block_k"]` (so I had wrongly assumed that `block_n` was always 32). ### Future work While this PR simply fixes a bug, it's possible to optimize the code pertaining to dequantizing & caching the B buffer - for `block_n` being `16` or `48`, `K` would always be a multiple of 2, so `K * block_n` will always be a multiple of 32. Since `dequantized_B_buf` stores rows contiguously, when `block_n` would be `16` or `48`, we could store 32 BF16 elements at a time instead of storing `16` at a time (when `block_n` is 16), or `32` followed by `16` at a time (when `block_n` is 48). Such an optimization would lower `register -> memory` data movements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149359 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5	2025-03-24 21:27:46 +00:00
James Wu	49f86a939c	[AOTAutogradCache] Allow Custom Autograd functions behind a flag (#149751 ) This adds a new env var and flag, autograd_cache_allow_custom_autograd_functions, (env var: `TORCHINDUCTOR_AUTOGRAD_CACHE_ALLOW_CUSTOM_AUTOGRAD`) which allows custom autograd functions into AOTAutogradCache. @hirsheybar and I worked together to verify that the higher order op AutogradFunctionApply is pure with respect to the dynamo input being passed in, so this should be safe. I'm still putting it behind a flag and turning it on slowly, first on an internal model, though. Once we verify that it is correct on the internal model we can work to enable the flag by default. Differential Revision: [D71633184](https://our.internmc.facebook.com/intern/diff/D71633184/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149751 Approved by: https://github.com/bdhirsh, https://github.com/zou3519	2025-03-24 21:12:11 +00:00
Ryan Guo	ae6158500a	[dynamo] fix calling torch function on newly constructed tensor subclass (#149481 ) This patch updates existing `test_return_..._subclass` tests in `test/dynamo/test_subclasses.py`, so that they end up invoking the `__torch_function__` method of the newly constructed tensor subclass instnaces. This exposes a bug in `TensorVariable.method_as_subclass`, where it forgot to grab the `__func__` out of `__torch_function__`, which led to the an error down the line. This patch fixes `TensorVariable.method_as_subclass` by centralizing how we extract and wrap torch function, in `build_torch_function_fn`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149481 Approved by: https://github.com/jansel	2025-03-24 21:07:41 +00:00
Kirill Goltsman	f12969421e	[DYNAMO] [BUG FIX] correct casting to boolean for TORCH_COMPILE_DISABLE (#149852 ) Fixes #149840 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149852 Approved by: https://github.com/jingsh	2025-03-24 20:50:44 +00:00
Tristan Rice	b248edd7cc	ProcessGroupGloo: support ReduceOp::AVG (#149781 ) This adds AVG support to ProcessGroupGloo to better support FSDP on CPU. I expect there will be more issues but this is easy enough to support in a naive fashion. This applies to both reduce and allreduce. This is a simple SUM + division and may not be the most numerically stable but that's expected. FSDP for low precision data types implements pre/post divide and uses SUM instead. Test plan: ``` pytest -v test/distributed/test_c10d_gloo.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149781 Approved by: https://github.com/fduwjj	2025-03-24 20:29:30 +00:00
Yuxin Wu	40ec9d2bfa	avoid allocation when tensor_new from storage (#149797 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149797 Approved by: https://github.com/Skylion007	2025-03-24 20:02:45 +00:00
Nikita Shulga	112f983056	[MPS] Replace indexed with strided flavor (#149730 ) Which renders non-contiguous operations much faster for larger tensors, for example `fmax` of 1000x1000 strides tensors takes 270ms with new algorithm and 430ms with an old one, that needed additional tensor of 3e6 elements to function. TODO: Add 64-bit indexing logic, as current implementation has the same limitation as `generateKernelDataOffsets` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149730 Approved by: https://github.com/dcci, https://github.com/manuelcandales	2025-03-24 19:37:51 +00:00
Davide Italiano	9179178728	[MPS] Add support for `chebyshev_polynomial_t` in eager. (#149816 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149816 Approved by: https://github.com/malfet	2025-03-24 19:19:55 +00:00
Simon Fan	1e5a561c13	[ca] fix accumulate grad polyfill when different strides between param and grad (#149651 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149651 Approved by: https://github.com/jansel ghstack dependencies: #149647, #149709	2025-03-24 19:06:45 +00:00
Simon Fan	754875e237	[ca] API comments and support dynamic shapes via configs (#149709 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149709 Approved by: https://github.com/jansel ghstack dependencies: #149647	2025-03-24 19:06:45 +00:00
Simon Fan	86ee3bf3d5	[ca] use torch.compile ca API for benchmarks (#149647 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149647 Approved by: https://github.com/jansel	2025-03-24 19:06:45 +00:00
atalman	71145059c8	Allow rebuild of triton on workflow_dispatch (#149865 ) Allows to rebuild triton from main. latest triton build failed : https://github.com/pytorch/pytorch/actions/runs/13984299781/job/39298288914 The cause PR was reverted: https://github.com/pytorch/pytorch/pull/148419 We need to rebuild the triton now Pull Request resolved: https://github.com/pytorch/pytorch/pull/149865 Approved by: https://github.com/seemethere, https://github.com/malfet	2025-03-24 18:17:47 +00:00
PyTorch MergeBot	bada898f5e	Revert "Extend vec backend with BF16 SVE intrinsics (#143666 )" This reverts commit d072254eaea325a507c1498431e4c8294205fe2d. Reverted https://github.com/pytorch/pytorch/pull/143666 on behalf of https://github.com/malfet due to I'm unsure why this PR got merged, as it doesn't have a valid review ([comment](https://github.com/pytorch/pytorch/pull/143666#issuecomment-2749013169))	2025-03-24 18:13:50 +00:00
Jingyi Yang	5beb5b7e47	[torch/c10d] change class variable from private to protected (#149579 ) (#149645 ) Summary: Change class variable from private to protected in ProcessGroupNCCL Test Plan: Existing UT Pass. Reviewed By: kingchc, kwen2501 Differential Revision: D71373067 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149645 Approved by: https://github.com/kwen2501	2025-03-24 17:58:54 +00:00
Ethan Wee	d0c06c4533	[ROCm] Update libamd_comgr.so file in triton wheel build (#149855 ) In ROCm 6.4 and newer, when building Triton in the Triton-ROCm wheel build flow, newer releases of ROCm no longer have libamd_comgr.so.2 as the .so file has been updated to libamd_comgr.so.3 in ROCm 6.4 and newer. We conditionalize on which ROCm the wheel build is for, and choose the .so accordingly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149855 Approved by: https://github.com/Skylion007, https://github.com/jeffdaily	2025-03-24 17:51:14 +00:00
bobrenjc93	60f31f551e	Only print dde partial fx graph for export (#149831 ) Lazos correctly pointed out this doesn't make sense for compile since we graph break in compile. This results in tons of unwanted user log spew. We do want this in export though since it's drastiaclly reduced the support load for DDEs. This PR does the refactor to keep it in export but remove it from compile Pull Request resolved: https://github.com/pytorch/pytorch/pull/149831 Approved by: https://github.com/mlazos	2025-03-24 17:46:18 +00:00
PyTorch MergeBot	42e7bda53e	Revert "[export] Save unflattened gm (#149717 )" This reverts commit 1e159db57c611b98a531341927b2d01f39383f7a. Reverted https://github.com/pytorch/pytorch/pull/149717 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/149717#issuecomment-2748924563))	2025-03-24 17:41:01 +00:00
William Wen	6608d4e3e9	[dynamo] keep chained exceptions in user-facing tracebacks (#149676 ) This preserves graph breaks in the case that one graph break directly causes another, e.g. graph breaks in generic context managers. ```python import torch class CtxMgr: def __enter__(self): return self def __exit__(self, exc_type, exc_value, traceback): pass @torch.compile(backend="eager", fullgraph=True) def fn(): with CtxMgr(): with CtxMgr(): pass with CtxMgr(): with CtxMgr(): pass torch._dynamo.graph_break() fn() ``` Output: ``` torch._dynamo.exc.Unsupported: Call to `torch._dynamo.graph_break()` Explanation: User-inserted graph break. Message: None Hint: Remove the `torch._dynamo.graph_break()` call. Developer debug context: Called `torch._dynamo.graph_break()` with args `[]`, kwargs `{}` The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/data/users/williamwen/pytorch/playground.py", line 23, in <module> fn() File "/data/users/williamwen/pytorch/torch/_dynamo/eval_frame.py", line 664, in _fn raise e.with_traceback(None) from e.__cause__ torch._dynamo.exc.Unsupported: Graph break under GenericContextWrappingVariable Explanation: Attempted to graph break in an active context manager(s) that doesn't support graph breaking. Hint: Move the offending context manager(s) to outside the compiled region. Hint: This graph break may have been caused by an earlier graph break. Resolving the earlier graph break may resolve this one. Developer debug context: Active generic context managers: [GenericContextWrappingVariable(CtxMgr), GenericContextWrappingVariable(CtxMgr)] from user code: File "/data/users/williamwen/pytorch/playground.py", line 20, in fn torch._dynamo.graph_break() Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo" ``` Note in particular that both graph breaks (torch._dynamo.graph_break and graph break in context manager) are present in the logs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149676 Approved by: https://github.com/jansel, https://github.com/zou3519, https://github.com/anijain2305	2025-03-24 17:36:13 +00:00
Angela Yi	1e159db57c	[export] Save unflattened gm (#149717 ) Test Plan: CI Differential Revision: D71082652 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149717 Approved by: https://github.com/pianpwk	2025-03-24 17:25:25 +00:00
Yidi Wu	0a0a73a9a9	[cond] don't trace fw and bw graph in autograd key (#148930 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148930 Approved by: https://github.com/zou3519	2025-03-24 17:07:29 +00:00
Rachel Guo	9bae904cb4	[inductor] fix combo_kernel logging #2 (#149772 ) Summary: fix another combo kernel logging error: File "/home/guorachel/local/fbsource/buck-out/v2/gen/fbcode/4bcbfa3ef39dbd6f/caffe2/test/inductor/__combo_kernels__/combo_kernels#link-tree/torch/_inductor/scheduler.py", line 2036, in _init self.create_combo_kernel_nodes(num_ck_nodes=None) File "/home/guorachel/local/fbsource/buck-out/v2/gen/fbcode/4bcbfa3ef39dbd6f/caffe2/test/inductor/__combo_kernels__/combo_kernels#link-tree/torch/_inductor/scheduler.py", line 3068, in create_combo_kernel_nodes log.debug("ComboKernels: Generating with num_ck_nodes = %d...", num_ck_nodes) Message: 'ComboKernels: Generating with num_ck_nodes = %d...' Arguments: (None,) Test Plan: Verified in test_combo_kernel.py the logging error went away. Differential Revision: D71655949 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149772 Approved by: https://github.com/ColinPeppler, https://github.com/Skylion007	2025-03-24 16:57:45 +00:00
PyTorch MergeBot	453da423d4	Revert "ci: Add sccache to manylinux images (#148419 )" This reverts commit 1099c371505a6a3e3cab69e5afca1e747f2215a4. Reverted https://github.com/pytorch/pytorch/pull/148419 on behalf of https://github.com/atalman due to Breaks triton build ([comment](https://github.com/pytorch/pytorch/pull/148419#issuecomment-2748759515))	2025-03-24 16:43:26 +00:00
Bert Maher	a439524be6	[inductor] Add the largest matmul tile size to default tuning set (#149790 ) While we probably don't want to expand the set of default matmul tunings too much, this is the largest tile size usable by H100 and A100, and is usually the top performing tile size for large matmuls. E.g. on H100 adding this tile size improves perf of multiplying 8192-square matrices from 600->700 tflops. (cuBLAS 12.6 gets 780, so Triton still isn't SOTA, but closer) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149790 Approved by: https://github.com/jansel	2025-03-24 16:32:53 +00:00
Dmitry Nikolayev	db92d0f388	A bunch of typos (#149404 ) Improves readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149404 Approved by: https://github.com/soulitzer	2025-03-24 16:16:04 +00:00
Tristan Rice	ddc0fe903f	ci/docker: use NCCL 2.26.2-1 (#149778 ) Related to #149153 This updates some build scripts to hopefully fix the nightly builds which are somehow building against nccl 2.25.1 and using 2.26.2 from pip. Test plan: After merging rerun nightly linux jobs and validate that nccl version matches Pull Request resolved: https://github.com/pytorch/pytorch/pull/149778 Approved by: https://github.com/Skylion007, https://github.com/atalman Co-authored-by: Andrey Talman <atalman@fb.com>	2025-03-24 16:14:54 +00:00
Francisco Massa	0a60a0cad4	Let pointwise sharding take arg with largest number of dims in case of ties (#149721 ) Before, we would take the first argument with the largest number of shards, regardless if it had fewer dims than another arg with the same number of shards but more dimensions. This would lead to potentially fewer sharding options Pull Request resolved: https://github.com/pytorch/pytorch/pull/149721 Approved by: https://github.com/tianyu-l	2025-03-24 15:39:39 +00:00
Wang, Chuanqi	2c13a07002	[CI] Fix xpu linux test permission issue and add ci docker image pull (#149053 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/149053 Approved by: https://github.com/atalman	2025-03-24 15:19:24 +00:00
Yu, Guangye	db9b031b00	Add default XPU toolkit path to CMake (#149270 ) # Motivation Add default XPU runtime path to CMake to mitigate https://github.com/pytorch/pytorch/issues/149075 This ensures proper linking with `libtorch` when a user does not source the Torch XPU toolkit while working on a C++ library or executable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149270 Approved by: https://github.com/dvrogozh, https://github.com/EikanWang, https://github.com/atalman	2025-03-24 14:41:24 +00:00
Isuru Fernando	66b0a0b61a	[inductor] support dilation in max_pool2d lowering (#148209 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148209 Approved by: https://github.com/eellison	2025-03-24 13:00:12 +00:00
PyTorch UpdateBot	dfdc28ea67	Update slow tests (#149844 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149844 Approved by: https://github.com/pytorchbot	2025-03-24 12:12:56 +00:00
Isalia20	248487f455	[MPS] nanmedian with dims (#149680 ) Third most voted op from #77764 Tests were deleted because they are covered by the regular test_output_match tests so those were redundant and were added in the last PR before the nanmedian dim version would be implemented Pull Request resolved: https://github.com/pytorch/pytorch/pull/149680 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-03-24 03:49:16 +00:00
Yu, Guangye	d5ce5c9509	Reuse format_size utils (#149383 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149383 Approved by: https://github.com/malfet	2025-03-24 03:06:27 +00:00
James Wu	de3aca3311	[StaticCudaLauncher] Support any number of kernel arguments (#149442 ) Fixes #149450 This PR adds fallback support on StaticCudaLauncher for any number of kernel arguments. Above MAX_ARGS, we can do a heap allocation/malloc instead. For 0 arguments, triton technically does some undefined behavior by allocating a 0 byte array and passing it to cuLaunchKernel. In reality, cuLaunchKernel never accesses the pointer if the singature of the cubin has no parameters, so we can just pass nullptr directly. We could technically use `alloca` to stack allocate instead of heap allocate, though in my tests it didn't seem to affect runtime performance on benchmarks particularly impressively, and alloca has portability issues, so I'd rather just stick with something simpler for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149442 Approved by: https://github.com/jansel	2025-03-23 22:43:47 +00:00
Justin Chu	2dccd70ef0	[ONNX] Clean up legacy dynamo export code (#149745 ) Clean up code that is unused and obsolete. The public `torch.onnx.dynamo_export` is kept for now but the legacy implementation is removed. Remove public option classes and OnnxRegistry that have been deprecated. Users: use torch.onnx.export(…, dynamo=True). Pull Request resolved: https://github.com/pytorch/pytorch/pull/149745 Approved by: https://github.com/titaiwangms, https://github.com/cyyever	2025-03-23 19:35:16 +00:00
Nikita Shulga	8bece88655	[BE] Eliminate TODO for 2022 (#149557 ) Need to think a bit more about what types.h includes Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/149557 Approved by: https://github.com/albanD	2025-03-23 05:35:54 +00:00
Alfredo Tupone	c201d4dbea	elif is not a cmake keyword (#149655 ) Test for pocketfft_header not in its place is wrong Pull Request resolved: https://github.com/pytorch/pytorch/pull/149655 Approved by: https://github.com/Skylion007	2025-03-23 03:28:53 +00:00
fzyzcjy	85027ef74a	Super tiny fix typo (#149109 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/149109 Approved by: https://github.com/malfet	2025-03-23 03:02:53 +00:00
James Wu	fe954cdcbf	Use correct boxed_forward_device_index when running `CompiledFxGraph.post_compile` (#148130 ) This PR threads through the correct boxed_forward_device_index from graph_kwargs to CompiledFXGraph.post_compile. This allows us to correctly update BoxedDeviceIndex from cache hits. We don't actually need to save `boxed_forward_device_index` in CompiledFXGraph because its value is in the cache key, so it always matches to the ambient one anyway. On forward with cudagraphs enabled, derive `boxed_forward_device_index`'s value from `device_idxs`. Testing: ``` python benchmarks/dynamo/cachebench.py --mode training --benchmark torchbench --model BERT_pytorch --device cuda --repeat 1 --dynamic --output="dynamic.json" ``` Now cache hits properly on FXGraphCache. AOTAutogradCache has a guard failure. Will look into that as a followup. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148130 Approved by: https://github.com/eellison	2025-03-23 02:57:58 +00:00
Mark Saroufim	539db4af4b	load_inline no_implicit_headers mode (#149480 ) In the kernelBot leaderboard we support people competing with custom cuda extensions via `load_inline()`, however even on toy kernels this can result in cold starts of up to 90s - this feature is primarily responsible for us having to double our timeout values I performed an investigation here https://github.com/msaroufim/load_inline_slow and the primary cause was that torch/extension.h and torch/types.h add in about 5,000 header files https://github.com/msaroufim/load_inline_slow/blob/main/header-analysis So we introduce a mode `no_implicit_headers` which forces users to be explicit about exactly what they want to add. There's a proper test meant to be used in a CLI and a pytest test that's not terribly helpful Then there's still an open question around what's the most minimal example implementation we can provide. For the baseline kernel we're showing here, it takes about 1 min to compile 1. There's using TensorBase.h (finicky to get right but can get compilation times down to 7s) 2. Just using Tensor.h (down to 15s) 3. Using Shim.h (did not try yet since the syntax is verbose relative to cuda) This is my take so far https://gist.github.com/msaroufim/079a8d08ffebd0f91a1c2247eb0ce9e0 for a minimal implementation at 15s but @malfet has a simpler one at only 5s There's more things I'd like to try moving forward like nvrtc and fancier compilation flags. Typical advice around using precompiled headers does not apply to us because we are mostly interested in cold starts where we tear down the machine after running a kernel Also in a future PR I'd like to fix issue I've noticed with load_inline 1. It needs a force recompilation mode, I was using this quite a bit myself 2. The cache does not take into account changes in environment so the best way to force a recompilation is to change some string in the file 3. Instead of relying on pybind, can we use TORCH_LIBRARY instead Pull Request resolved: https://github.com/pytorch/pytorch/pull/149480 Approved by: https://github.com/malfet	2025-03-22 19:21:29 +00:00
cyy	9367f8f6f1	Remove outdated instructions from CI scripts (#149795 ) Some instructions about Python 3.8 and CUDA 11.3 are removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149795 Approved by: https://github.com/malfet	2025-03-22 18:37:07 +00:00
Davide Italiano	2b848ab192	[MPS/inductor] Add support for modified_scaled_bessel_k{0,1} (#149794 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149794 Approved by: https://github.com/malfet	2025-03-22 15:41:40 +00:00
Animesh Jain	6bbe8dbd63	[dynamo][hooks] config to wrap the top frame in a wrapper (#149758 ) This should be done by default but there are too many issues. This PR is a workaround. https://github.com/pytorch/pytorch/issues/117584 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149758 Approved by: https://github.com/yf225 ghstack dependencies: #149712	2025-03-22 07:17:01 +00:00
bobrenjc93	621c801f78	fix dynamic float when dynamic=True (#149564 ) Fixes https://github.com/pytorch/pytorch/issues/149406#issuecomment-2738111733. Basically previously we would only make floats dynamic via automatic dynamic, now if you set dynamic=True, we will make the floats dynamic on the first compile. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149564 Approved by: https://github.com/laithsakka	2025-03-22 05:58:59 +00:00
eqy	8f7fbe3d7d	[cuBLAS][cuBLASLt] Unify `cuBLASLt` workspaces with `cuBLAS` workspaces (#145130 ) As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels. This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits: + caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`) + "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925 + fixes behavior broken behavior with the memtracker; https://github.com/pytorch/pytorch/pull/139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it + one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider Pull Request resolved: https://github.com/pytorch/pytorch/pull/145130 Approved by: https://github.com/ngimel	2025-03-22 05:50:11 +00:00
PyTorch UpdateBot	51fa8fb0ff	[executorch hash update] update the pinned executorch hash (#149585 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149585 Approved by: https://github.com/pytorchbot	2025-03-22 05:14:19 +00:00
Nichols A. Romero	01b1d1f91b	[ROCm][TunableOp] Fix offline tuning for ScaledGEMM. (#149677 ) The main purpose of this PR is to fix offline tuning for ScaledGEMM. The previous UT passed because it was not strict enough. Additionally: - All the offline tuning tests now do a comparison with the online results to ensure that ParamSignature match. - We raise an error if submatrices are encountered as this is only supported in online tuning mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149677 Approved by: https://github.com/jeffdaily	2025-03-22 02:22:13 +00:00
Davide Italiano	b9a5e1d038	[MPS] Add support for scaled_modified_bessel_k1 to eager. (#149783 ) Another day another op Pull Request resolved: https://github.com/pytorch/pytorch/pull/149783 Approved by: https://github.com/malfet	2025-03-22 02:13:41 +00:00
Tugsbayasgalan Manlaibaatar	021b3e23ec	Fix is_nonzero for more than one elem tensors (#149637 ) Differential Revision: [D71560442](https://our.internmc.facebook.com/intern/diff/D71560442) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149637 Approved by: https://github.com/pianpwk	2025-03-22 02:08:28 +00:00
Xintong Hu	9d02b3993f	[PT2] Port use_triton_lce to PT2 pre_grad passes (#149702 ) Summary: `use_triton_lce_replace_simple_LCE` and `use_triton_lce_replace_normal_LCE` code is mostly the same, some minor changes to support aten IR Test Plan: ``` scripts/aetk/aetk -L %run ~/fbsource/fbcode/caffe2/test/inductor/fb/test_customized_triton_kernel_passes.py ``` will verify the qps after everything done in the stack Reviewed By: frank-wei Differential Revision: D68909857 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149702 Approved by: https://github.com/frank-wei	2025-03-22 00:36:58 +00:00
Scott Wolchok	c73a526599	Extract reusable portions of elu_kernel into header (#149673 ) Similar to #140425, we are making the implementation usable via header-only code sharing. Review note: #62546 by @yanbing-j removed expm1 usage from this path. I don't know why and expm1 should be more efficient, so I've put it back. Please let me know if there is a good reason I shouldn't. Testing: existing correctness tests should cover. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149673 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-03-21 23:54:26 +00:00
PyTorch MergeBot	b238e36fd9	Revert "[BE][Ez]: Update CU126 to CUDNN 12.8 too (#149254 )" This reverts commit b0a5d55c584792a504ec18600180e3d1200dfea6. Reverted https://github.com/pytorch/pytorch/pull/149254 on behalf of https://github.com/izaitsevfb due to seems to be causing multiple test failures ([comment](https://github.com/pytorch/pytorch/pull/149254#issuecomment-2744686862))	2025-03-21 23:44:09 +00:00
Nikita Shulga	27370998b2	[MPS][BE] Move `polar`/`complex` to stubs (#149752 ) No need to have in-place MPS kernel, as it just copy-n-paste of code from TensorFactories.cpp into Binarykernel.mm Pull Request resolved: https://github.com/pytorch/pytorch/pull/149752 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #149727, #149728, #149729	2025-03-21 22:36:05 +00:00
Animesh Jain	d320af0663	[dynamo] Ensure placeholder name is not an intermediate node name (#149712 ) Fixes https://fb.workplace.com/groups/1075192433118967/permalink/1615671879071017/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/149712 Approved by: https://github.com/zou3519	2025-03-21 22:24:45 +00:00
Brian Hirsh	7f836b747f	partitioner: ensure collectives saved by SAC that are actually unused in the bw are properly not saved (#149652 ) This PR fixes one of the issues described here: https://github.com/pytorch/torchtitan/issues/866#issuecomment-2726015248 I spent some time trying to write a unit test and ultimately failed. If folks are interested I can spend more time trying to, but otherwise I have an E2E test with torchtitan. command: ``` CUDA_VISIBLE_DEVICES=1,2,3,4 NGPU=4 CONFIG_FILE="./torchtitan/models/llama/train_configs/llama3_8b.toml" tlp ./run_train.sh --training.steps=30 --training.tensor_parallel_degree=2 --training.compile --experimental.enable_async_tensor_parallel ``` here's the backward graph generated prior to the PR: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/hirsheybar/f7d17388-42c2-4d7e-8a55-a00387341ecb/custom/rank_0/-_0_0_0/aot_backward_graph_9.txt?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 and new backward graph with the PR: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/hirsheybar/ab8576fc-98c1-4915-af47-699aa8e2557e/custom/rank_0/-_0_0_0/aot_backward_graph_9.txt?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 The main difference is that the input arg `reduce_scatter_tensor_1` is dead code in the bw graph, causing us to unnecessarily save a giant `reduce_scatter` for bw. With the PR, we properly ensure that it is not saved for backward. More comments in the PR, but the main thing going on is that: (1) We have some existing logic that checks for activations that are actually dead code in the backward, and removes them (2) collectives are not properly handled by this code. Why? collective are always followed by `wait_tensor()` call. So we need to go one node further and check if the "dead" code has a wait_tensor user that is also dead Pull Request resolved: https://github.com/pytorch/pytorch/pull/149652 Approved by: https://github.com/zou3519 ghstack dependencies: #149514	2025-03-21 22:09:19 +00:00
Brian Hirsh	1c6b517e19	DTensor: more generically support CompositeImplicitAutograd ops under inference mode (#149514 ) Today, if you run DTensor (or any tensor subclass) under __torch_dispatch__, you will start seeing `CompositeImplicitAutograd` ops show up in the torch_dispatch. "handling" these ops is trivial: you can just tell them to decompose into their constituent ops. Normally this decomposing happens in autograd, above DTensor, but inference_mode turns autograd off, forcing the subclass to handle the op directly. It looks like previously we manually added a few CompositeImplicitAutograd entries to DTensor (e.g. linear), but this PR tries to support these ops a bit more generically. The main difference is that DTensor now needs to check if a given op is `CompositeImplicitAutograd` before attempting to run sharding prop. I ran a quick microbenchmark for the below code with `timeit`, which gave me overhead on the order of ~1us, which is hopefully not too bad for eager mode: ``` def fast_function(): return torch._C._dispatch_has_kernel_for_dispatch_key(op_call.name(), torch._C.DispatchKey.CompositeImplicitAutograd) import timeit time_taken = timeit.timeit(fast_function, number=1000) # printed 0.12..., aka 1.2us print(f'func={str(op_call)}, time={str(time_taken)}') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149514 Approved by: https://github.com/kwen2501, https://github.com/albanD, https://github.com/wanchaol	2025-03-21 22:09:19 +00:00
Wei Feng	d46c16fca6	[FSDP2] warning that reshard_after_forward=1 and True are different (#149750 ) people complains about spending time to debug reshard_after_forward=1. What they actually want is reshard_after_forward=True. 1 and True can be used interchangeably in programming generally, add one-time warning to remind they are different * reshard_after_forward=1 means resharding parameters to world size 1, by keeping unsharded parameters from forward to backward * reshard_after_forward=True means reshard parameters to FSDP mesh from FSDP2 perspective, our docstring is clear about int vs bool https://pytorch.org/docs/main/distributed.fsdp.fully_shard.html <img width="764" alt="Screenshot 2025-03-21 at 11 02 55 AM" src="https://github.com/user-attachments/assets/6675f7a4-95a0-4421-8dbf-f47e9fdeca26" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/149750 Approved by: https://github.com/mori360, https://github.com/msaroufim, https://github.com/wconstab	2025-03-21 22:05:20 +00:00
angelayi	ff020d32b6	[export] Patch dynamo configs when nonstrict tracing (#149295 ) Differential Revision: [D71298929](https://our.internmc.facebook.com/intern/diff/D71298929) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149295 Approved by: https://github.com/ydwu4, https://github.com/zou3519	2025-03-21 21:44:54 +00:00
Avik Chaudhuri	fb07fe6f36	pretty print graph signature (#149710 ) Fixes #141243 Differential Revision: [D71604218](https://our.internmc.facebook.com/intern/diff/D71604218/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149710 Approved by: https://github.com/angelayi	2025-03-21 21:31:58 +00:00
eellison	5757aa8773	Cudagraph fix + comment cleanup (#149741 ) Cudagraphs is careful to not allow any memory recorded to escape globally without having a reference to the tensor. This is because we may later reclaim that memory for a cudagraph recording and we need to mark the tensor as erroring on access. Very occasionally, a stray tensor will have been allocated locally but not yet cleaned up. In this case, we enter the slow path and try to gc.collect() to deallocate it. From a hard to repro internal use case, this was fixed by an additional `cuda.synchronize()`. i also snuck in an outdated comment and a duplicate line removal. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149741 Approved by: https://github.com/BoyuanFeng, https://github.com/Skylion007	2025-03-21 21:12:36 +00:00
Annop Wongwathanarat	842d51500b	Parallelize sort (#149505 ) PR #142391 erroneously used `USE_OMP` instead of `USE_OPENMP`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149505 Approved by: https://github.com/fadara01, https://github.com/Skylion007	2025-03-21 20:54:40 +00:00
Xuehai Pan	85f6d61421	[BE] format `test/inductor/s429861_repro.py` (#148554 ) Split from #148186 The diff can be re-generated with the following code in the repo root directory on main branch: ```python import re from pathlib import Path def replace(m: re.Match) -> str: s = m.group() if '\n' not in s: return s indent = m.group("indent") varnames = s.removesuffix("None").replace("=", "").replace("(", "").replace(")", "").split() return "\n".join( [ f"{indent}(", (f"{indent} {varname}," for varname in varnames), f"{indent}) = (None,) {len(varnames)}", ] ) file = Path('test/inductor/s429861_repro.py') content = file.read_text(encoding='utf-8') new_content = re.sub( r"^(?P<indent> )\w+ =(\s($\s\w+\s$\|\w+)\s=\s*)+None$", replace, content, flags=re.MULTILINE, ) file.write_text(new_content, encoding='utf-8') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148554 Approved by: https://github.com/jansel	2025-03-21 20:39:28 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	c5deacc27a	Fix subclass access custom op bug (#149698 ) Summary: When we call torch.inference_mode, we seem to skip Autograd key causing the custom op export uses to be not decomposed properly before subclass dispatching starts. We fix this by force desugaring this op at Python key Test Plan: test Differential Revision: D71599541 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149698 Approved by: https://github.com/bdhirsh	2025-03-21 19:42:56 +00:00
Avik Chaudhuri	09aa63ea2c	preserve custom meta in placeholders (#149661 ) Fixes #147338 Differential Revision: [D71573533](https://our.internmc.facebook.com/intern/diff/D71573533/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149661 Approved by: https://github.com/junpeiz, https://github.com/angelayi	2025-03-21 19:09:38 +00:00
Aaron Orenstein	0eb3ac9349	Make sure to write to caches atomically (#149654 ) This is an attempt to fix #119698 I was unable to reproduce the original described problem on the latest trunk but the proposed fix makes sense. Instead of adding locks like the original (unlanded) fix I changed a few of the cache writes to be atomic file swaps (write to temp file, rename file) which should have the same effect without blocking reads. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149654 Approved by: https://github.com/eellison	2025-03-21 18:59:41 +00:00
Shangdi Yu	46dd226702	Fakify torchbind objects in compile_fx and add tests for SigridTransformsInstanceTorchBind (#149529 ) Summary: We need to properly fakify torchbind objects, including the ones in graph module attributes, so the resgitered fake implementation works properly. - _fakify_script_objects in `compile_fx` - Allow fake torchbind objects in `torchbind_constants` Remove `node.meta["unbacked_bindings"]` for `aot_compile` in `compile_fx`. Otherwise `ShapeProp` will fail when trying to resolve the `unbacked_bindings` of `with_effect` tokens. Update `sigrid_transforms_test` to use the latest `torch._inductor.aot_compile` API. Add a test for `Fakify torchbind objects in compile_fx and add tests for SigridTransformsInstanceTorchBind` in `e2e_test`. Test Plan: ``` buck run //caffe2/torch/fb/sparsenn:sigrid_test -- -r test_transform_torch_bind buck run //sigmoid/inference/test:e2e_test_cpu -- -r SigridTransforms buck2 run mode/dev-nosan sigmoid/inference/ts_migration:pt2i_readiness_main -- --model_id 545017754 --test_suite ads_all --mode test_preproc ``` Differential Revision: D70013257 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149529 Approved by: https://github.com/angelayi	2025-03-21 18:58:28 +00:00
Alexander Grund	19b763def1	Skip test if torchvision is not available (#149494 ) The test unconditionally imports torchvision and fails if the isn't installed. Skip it in this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149494 Approved by: https://github.com/janeyx99	2025-03-21 18:57:13 +00:00
Aaron Gokaslan	b0a5d55c58	[BE][Ez]: Update CU126 to CUDNN 12.8 too (#149254 ) Have CUDNN have the same version for 12.6 and 12.8 for better performance and consistency. We can't do CU12.1 because it's not supported and CU12.4 isn't updated due to manywheel Linux compatibility reasons and dropping support for it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149254 Approved by: https://github.com/jansel, https://github.com/atalman, https://github.com/tinglvv	2025-03-21 18:20:44 +00:00
Pradeep Fernando	1b08aaeafe	Supporting non-tensor-data write_size in planner write items. (#149699 ) Summary: 1\ The current write item structure does not contain the amount of data that needs to be written. 2\ the planner.item already has a size primitive 'tensor_storage_size'. https://fburl.com/code/7a0gsmw7 But only for tensors. 3\ Right now, the only way the writer layer get hold of this property (fro non tensor data) first do a lookup in to the actual tensor/bytes then calculate the nbytes. This change introduce a way to capture non-tensor data size within a write-plan item. Test Plan: Existing UT. Differential Revision: D71599725 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149699 Approved by: https://github.com/MeetVadakkanchery	2025-03-21 18:09:14 +00:00
Ding, Yi1	f7d1b966c2	[Inductor] Unify the data type propagation between Triton and CPP Backend (#146970 ) Fixes #144246 Use `DtypePropagationOpsHandler` for CSE variables of CPP backend. In addition, add static type checking for the generated CPP code similar to the `config.test_configs.runtime_triton_dtype_assert`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146970 Approved by: https://github.com/jgong5, https://github.com/eellison, https://github.com/leslie-fang-intel	2025-03-21 17:52:51 +00:00
Scott Wolchok	99a4fc5a2f	Add elu as core ATen (#149684 ) Differential Revision: [D71590420](https://our.internmc.facebook.com/intern/diff/D71590420/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149684 Approved by: https://github.com/larryliu0820	2025-03-21 16:56:10 +00:00
LifengWang	fa5f556f88	[CI] enable operator benchmark on CPU (#143733 ) This is to enable operator benchmark for CPU to track op level performance. This PR is motivated by PR: https://github.com/pytorch/pytorch/issues/120982 and investigate feasibility in https://github.com/pytorch/pytorch/pull/127216 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143733 Approved by: https://github.com/leslie-fang-intel, https://github.com/atalman, https://github.com/huydhn, https://github.com/malfet Co-authored-by: diwei sun <diwei.sun@intel.com> Co-authored-by: chuanqiw <chuanqi.wang@intel.com>	2025-03-21 16:46:03 +00:00
Nikita Shulga	700260f166	[MPS][BE] Get rid of `supports_dense` flag (#149729 ) As now all binary ops supports dense Pull Request resolved: https://github.com/pytorch/pytorch/pull/149729 Approved by: https://github.com/dcci ghstack dependencies: #149727, #149728	2025-03-21 16:37:03 +00:00
Nikita Shulga	64d22b9fad	[MPS][BE] Migrate complex_mul to tensor iterator (#149728 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149728 Approved by: https://github.com/dcci ghstack dependencies: #149727	2025-03-21 16:37:03 +00:00
Nikita Shulga	e35ef61066	[MPS][BE] Migrate `torch.complex` to binary_functor (#149727 ) As it's very similar in nature to `torch.polar` Though rename kernel from `complex_kernel` to `make_complex` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149727 Approved by: https://github.com/dcci	2025-03-21 16:36:56 +00:00
Davide Italiano	bdc132d0e1	[MPS] Add support for scaled_modified_bessel_k0 for eager. (#149705 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149705 Approved by: https://github.com/malfet	2025-03-21 16:14:29 +00:00
Jithun Nair	1eab841185	Add release branch push triggers to inductor-rocm-mi300.yml (#149672 ) In similar vein as https://github.com/pytorch/pytorch/pull/149517 When we added the rocm-mi300.yml earlier this year, we had lower capacity and we were just pipecleaning the workflow, so we set the trigger to only respond to pushes to main branch. But now we have more stability as well as capacity, and we would really like to ensure that the release branch is being tested on MI300s as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149672 Approved by: https://github.com/jeffdaily	2025-03-21 16:02:03 +00:00
Davide Italiano	5d4b5ee315	[MPS] Add inline to function definition. (#149704 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149704 Approved by: https://github.com/malfet	2025-03-21 14:53:09 +00:00
Ryo Suzuki	d072254eae	Extend vec backend with BF16 SVE intrinsics (#143666 ) - Following the work in https://github.com/pytorch/pytorch/pull/119571, BF16 SVE intrinsics are added to the Vectorized class, providing ~1.7x speedup on `silu` and `softmax`. - Added bf16 detection in CMake - Added a guard for native NEON code to prevent compilation errors @aditew01 @maajidkhann please have a look Pull Request resolved: https://github.com/pytorch/pytorch/pull/143666 Approved by: https://github.com/swolchok, https://github.com/aditew01 Co-authored-by: Aditya Tewari <aditya.tewari@arm.com>	2025-03-21 10:55:11 +00:00
Nikita Shulga	68dfd44e50	Do not depend on numpy during the import (#149683 ) But a good followup would be to use torch primitives instead of numpy here Fixes https://github.com/pytorch/pytorch/issues/149681 Test plan: Monkey-patch 2.7.0-rc and run `python -c "import torch;print(torch.compile(lambda x:x.sin() + x.cos())(torch.rand(32)))"` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149683 Approved by: https://github.com/seemethere	2025-03-21 08:14:57 +00:00
Michael Lazos	34743678b9	[Dynamo] Cleanup state management for ctx managers (#149689 ) Removes state indirection for ctx managers. This isn't needed anymore since VTs are mutable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149689 Approved by: https://github.com/StrongerXi	2025-03-21 07:18:33 +00:00
Arash Pakbin	cfc08caea9	[ROCm] NLLLoss (torch.nll_loss) Performance Tuning by Dynamically Selecting # of GPU threads (#149548 ) Instead of fixing the number of GPU threads to 32 regardless of input size, this PR dynamically selects the number of threads based on the formula: clamp(2^round(log2(dim0/16)), min = 32, max = 1024). The experiments below were done on an MI300 machine for data type float32: ![nll_loss_threads_bests](https://github.com/user-attachments/assets/3be3d465-e3db-44ed-991a-fdfcab03baae) ![nll_loss_heauristic](https://github.com/user-attachments/assets/e82b9788-9b4d-4862-a180-8df7ad298182) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149548 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony	2025-03-21 07:16:37 +00:00
Davide Italiano	0ed34210b2	[MPS] Add support for `modified_bessel_k1` to eager and inductor. (#149687 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149687 Approved by: https://github.com/malfet	2025-03-21 04:59:06 +00:00
Yuanhao Ji	0a396a8160	[Docs] Make `torch.Library`'s `kind` have no default value to be consistent with the code (#149390 ) Fixes #149389 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149390 Approved by: https://github.com/janeyx99	2025-03-21 04:42:10 +00:00
Jing Xu	4ea580568a	update aotinductor doc for XPU support (#149299 ) as title. Since the AOTInductor feature starting from 2.7 works on Intel GPU, add the related contents into its doc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149299 Approved by: https://github.com/guangyey, https://github.com/desertfire	2025-03-21 04:40:31 +00:00
Rachel Guo	ccd5d811e8	[aoti] follow up to use new api in `test_provenance_tracing.py` (#149387 ) Summary: As title. Follow up of D71181284. and some minor refactoring Context : D69609685 (update test runner to use new api) / https://github.com/pytorch/pytorch/pull/147105 Test Plan: ``` buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:provenance_tracing -- -r test_triton_kernel_to_post_grad_tracing_cpu ``` Differential Revision: D71375725 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149387 Approved by: https://github.com/yushangdi	2025-03-21 04:37:50 +00:00
Nikita Shulga	5327894812	[BE] Introduce `lapack_work_to_int` function (#149682 ) That could be used to safely cast floating values to int by adding an ULP, which is a followup after https://github.com/pytorch/pytorch/pull/146456 Fixes https://github.com/pytorch/pytorch/issues/149591 (Not adding unittest as it's just going to be too slow) Test plan: ``` % python3 -c "import torch; torch.pinverse(torch.rand(50000, 8193))" ``` Before the change errored out with ``` RuntimeError: false INTERNAL ASSERT FAILED at "pytorch/pytorch/aten/src/ATen/native/BatchLinearAlgebra.cpp":1605, please report a bug to PyTorch. linalg.svd: Argument 12 has illegal value. Most certainly there is a bug in the implementation calling the backend library. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149682 Approved by: https://github.com/wdvr	2025-03-21 04:08:07 +00:00
Yuanhao Ji	bf6621d08f	[Distributed] Add `repr` methods for `ParallelStyle`s (#149478 ) Fixes #149470 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149478 Approved by: https://github.com/wanchaol	2025-03-21 03:59:25 +00:00
xinan.lin	ee6a029165	[XPU] Update triton commit to fix to fix level_zero not found by env var LEVEL_ZERO_V1_SDK_PATH. (#149511 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149511 Approved by: https://github.com/EikanWang	2025-03-21 03:56:00 +00:00
zeshengzong	732f9d7435	Optimize `torch.equal` description (#149618 ) Fixes #149222 ## Test Result ![image](https://github.com/user-attachments/assets/559a376f-2dd0-4474-bbd5-9299d9df51e3) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149618 Approved by: https://github.com/zou3519	2025-03-21 03:44:49 +00:00
Xia, Weiwen	64bd889660	[Inductor][CPP] rename shim_mkldnn.h/.cpp to shim_cpu.h/.cpp (#149372 ) Summary Previous discussion is here: https://github.com/pytorch/pytorch/pull/148907#issuecomment-2712795600 Rename these files because - they may hold mkldnn-unrelated code for CPU - filenames are aligned with files for CUDA and XPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/149372 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire	2025-03-21 03:42:12 +00:00
Justin Chu	a39bf846f5	[ONNX] Add draft_export as a strategy (#147529 ) Create draft_export strategy. The strategy is added before jit and after strict=True, as the third fallback. Since it is specializing tensors it should not be less robust than the jit trace strategy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147529 Approved by: https://github.com/titaiwangms	2025-03-21 03:05:17 +00:00
Hollow Man	0692301e25	Catch OSError in general when writing files (#149464 ) Redundant exception types in `except (PermissionError, OSError):`. Write `except OSError:`, which catches exactly the same exceptions. https://github.com/pytorch/pytorch/actions/runs/13935844871/job/39141062991 When hipify files, or writing cprofile files, PermissionError is not enough when the file is located in a place that is not writable at all, or other OS errors happened when writing files. This fix makes the code more robust. Example error log: ```log File "deepspeed/ops/adam/fused_adam.py", line 94, in __init__ fused_adam_cuda = FusedAdamBuilder().load() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "deepspeed/ops/op_builder/builder.py", line 540, in load return self.jit_load(verbose) ^^^^^^^^^^^^^^^^^^^^^^ File "deepspeed/ops/op_builder/builder.py", line 587, in jit_load op_module = load(name=self.name, ^^^^^^^^^^^^^^^^^^^^ File "torch/utils/cpp_extension.py", line 1597, in load return _jit_compile( ^^^^^^^^^^^^^ File "torch/utils/cpp_extension.py", line 2031, in _jit_compile hipify_result = hipify_python.hipify( ^^^^^^^^^^^^^^^^^^^^^ File "torch/utils/hipify/hipify_python.py", line 1167, in hipify preprocess_file_and_save_result(output_directory, filepath, all_files, header_include_dirs, File "torch/utils/hipify/hipify_python.py", line 213, in preprocess_file_and_save_result result = preprocessor(output_directory, filepath, all_files, header_include_dirs, stats, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "torch/utils/hipify/hipify_python.py", line 940, in preprocessor output_source = RE_QUOTE_HEADER.sub(mk_repl('#include "{0}"', True), output_source) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "torch/utils/hipify/hipify_python.py", line 919, in repl preprocess_file_and_save_result(output_directory, File "torch/utils/hipify/hipify_python.py", line 213, in preprocess_file_and_save_result result = preprocessor(output_directory, filepath, all_files, header_include_dirs, stats, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "torch/utils/hipify/hipify_python.py", line 986, in preprocessor with clean_ctx.open(fout_path, 'w', encoding='utf-8') as fout: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "torch/utils/hipify/hipify_python.py", line 123, in open return open(fn, args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ OSError: [Errno 30] Read-only file system: 'deepspeed/ops/csrc/adam/multi_tensor_apply_hip.cuh' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149464 Approved by: https://github.com/janeyx99	2025-03-21 02:42:50 +00:00
Justin Chu	362b40939d	[ONNX] Improve docstring of onnx symbolic ops (#149668 ) Better examples Pull Request resolved: https://github.com/pytorch/pytorch/pull/149668 Approved by: https://github.com/titaiwangms	2025-03-21 01:57:39 +00:00
Matthias Braun	66dd00fca0	Fix clang-tidy errors (#149581 ) Summary: Cleanup clang-tidy complaints in `EmbeddingBag.cpp`: Avoid shadowed variables and unused parameters. Test Plan: sandcastle Differential Revision: D71512594 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149581 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-03-21 01:53:57 +00:00
Simon Fan	e481615bc7	[aot] always lower the backward with a deepcopy (#149229 ) FIXES https://github.com/pytorch/pytorch/issues/149105 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149229 Approved by: https://github.com/bdhirsh	2025-03-21 01:47:13 +00:00
Xintong Hu	5ebc283f2c	[PT2] Port use_triton_dot_compress to PT2 pre_grad passes (#148517 ) Summary: add use_triton_dot_compress in pre_grad Test Plan: ``` scripts/aetk/aetk -L %run ~/fbsource/fbcode/caffe2/test/inductor/fb/test_customized_triton_kernel_passes.py ``` Reviewed By: frank-wei Differential Revision: D68909838 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148517 Approved by: https://github.com/frank-wei	2025-03-21 01:42:32 +00:00
James Wu	c2ada9d77b	[easy] Do not logspam if static cuda launcher is disabled (#149669 ) No need to log.info every time someone runs with StaticCudaLauncher disabled. Test plan: Run any benchmark and see that we don't spam the bypass message in logs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149669 Approved by: https://github.com/oulgen, https://github.com/jansel ghstack dependencies: #148890	2025-03-21 01:22:26 +00:00
Eli Uriegas	1099c37150	ci: Add sccache to manylinux images (#148419 ) Adds sccache to our manylinux images, these are purposefully built without the scccache-dist binary since we're not expecting to use that. Another caveat of these builds is that they are built with the vendored version of openssl. This is to set the stage for us to be able to build binaries sequentially. Signed-off-by: Eli Uriegas <github@terriblecode.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/148419 Approved by: https://github.com/atalman	2025-03-21 01:15:34 +00:00
Han, Xu	2975664fb0	add python root bin to windows load path. (#146573 ) This PR is extend python root bin path to dll load list. It makes PyTorch robust and compatible to more dependency libraries, such as `intel-pti`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146573 Approved by: https://github.com/EikanWang, https://github.com/albanD	2025-03-21 00:48:43 +00:00
Sam Larsen	90543e90a0	Fix broken dynamo_timed test due to python_version field (#149659 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149659 Approved by: https://github.com/ppanchalia	2025-03-21 00:27:28 +00:00
Zhengxu Chen	f47aa08130	[export] Support python assertion with symints. (#149444 ) Summary: This diff ports some technique from torch.fx symbolic trace to trace through Python asserts when we run into data dependent symbolic shape assertions, so that we can achieve the same effect as torch dynamo to automatically turn assert into torch.check()s. Test Plan: buck test mode/opt caffe2/test:test_export -- -r test_python_asserts_with_sym_int Differential Revision: D71425360 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149444 Approved by: https://github.com/tugsbayasgalan	2025-03-20 23:07:45 +00:00
angelayi	bf34e228c5	[export] Beef up guard_added logs (#149465 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149465 Approved by: https://github.com/pianpwk	2025-03-20 23:02:07 +00:00
Michael Lazos	1d3c50fcc5	[Dynamo] Support the torch._C.DisableTorchFunction ctx manager (#149491 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149491 Approved by: https://github.com/StrongerXi ghstack dependencies: #149489, #149490	2025-03-20 22:19:55 +00:00
Michael Lazos	ce5adc5c05	[Dynamo] add support for torch._C._is_torch_function_all_disabled (#149490 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149490 Approved by: https://github.com/StrongerXi ghstack dependencies: #149489	2025-03-20 22:19:55 +00:00
Michael Lazos	f64c361860	[Dynamo] Refactor DisableTorchFunction ctx manager (#149489 ) Refactors the DisableTorchFunction ctx manager to properly model the eager code (no args to the context manager). Pull Request resolved: https://github.com/pytorch/pytorch/pull/149489 Approved by: https://github.com/StrongerXi	2025-03-20 22:19:55 +00:00
zhc7	a268c29b9f	[distributed] fix: use group rank instead of global rank when possible (#149488 ) Fixes #149200 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149488 Approved by: https://github.com/wconstab	2025-03-20 21:47:03 +00:00
Isuru Fernando	b07b819912	[inductor] Add a helper for convert index_dtype to torch dtype (#149531 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149531 Approved by: https://github.com/eellison	2025-03-20 21:33:29 +00:00
Zhuoran Zhao	a703107f7b	[AOTInductor] Fix skip cpp wrapper unit test (#149606 ) Summary: as title Test Plan: ``` buck2 test 'fbcode//mode/opt' fbcode//deeplearning/aot_inductor/cpu/test:cpu_lowering_utils_test -- --exact 'deeplearning/aot_inductor/cpu/test:cpu_lowering_utils_test - test_cpu_lower_aoti_ep_called (deeplearning.aot_inductor.cpu.test.test_lowering_utils.CPULoweringTest)' ``` ``` buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cudagraph_trees_expandable_segments -- --exact 'caffe2/test/inductor:cudagraph_trees_expandable_segments - test_skip_cpp_wrapper (caffe2.test.inductor.test_cudagraph_trees.CudaGraphTreeTests)' ``` https://www.internalfb.com/phabricator/paste/view/P1758059197 Reviewed By: henryoier Differential Revision: D71528281 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149606 Approved by: https://github.com/desertfire	2025-03-20 20:55:33 +00:00
Guilherme Leobas	406d464d97	Add `is_batchedtensor` to dynamo builder (#149541 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149541 Approved by: https://github.com/zou3519	2025-03-20 20:46:15 +00:00
Kai Londenberg	f17ae3f7b7	[Inductor Cutlass backend] Fix imports and compilation of Cutlass SM100 Kernels (#149515 ) Summary: Fixes the import and compilation of Cutlass SM100 Kernels. Test Plan: Cutlass backend unit tests, running benchmarks/inductor_backends/cutlass.py Differential Revision: D71196747 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149515 Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78	2025-03-20 20:35:18 +00:00
PyTorch MergeBot	24176f6e32	Revert "[cond] don't trace fw and bw graph in autograd key (#148930 )" This reverts commit 6e843a51dd5743b864fc28601ef06cdc18488b3e. Reverted https://github.com/pytorch/pytorch/pull/148930 on behalf of https://github.com/ydwu4 due to Test failure is legit ([comment](https://github.com/pytorch/pytorch/pull/148930#issuecomment-2741585315))	2025-03-20 20:28:29 +00:00
Yidi Wu	4a4a71a73c	[inductor]lowering scan to while_loop (#148580 ) This PR add a pass in post_grad that lowers scan to while_loop. See the comment before the pass for how this is implemented. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148580 Approved by: https://github.com/jansel, https://github.com/eellison	2025-03-20 20:21:02 +00:00
Yidi Wu	6e843a51dd	[cond] don't trace fw and bw graph in autograd key (#148930 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148930 Approved by: https://github.com/zou3519	2025-03-20 20:18:29 +00:00
Guilherme Leobas	18435945af	Set __context__/__cause__ when generator raise `StopIteration` (#148765 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148765 Approved by: https://github.com/zou3519 ghstack dependencies: #146505	2025-03-20 19:59:30 +00:00
Guilherme Leobas	44e6464914	Allow setting attribute to NestedUserFunctionVariable (#146505 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146505 Approved by: https://github.com/zou3519	2025-03-20 19:59:30 +00:00
Dominic Binks	aae4c0729e	Fix broken build within xplat/caffe2 (#149403 ) Summary: Following a pull from open source, the build within xplat is broken due to not finding <autograd/function.h>. Within the python_function.cpp there seems to be a convention of using the torch/csrc prefix. This change includes that prefix to enable the build to proceed. Test Plan: Build a binary using torch. https://www.internalfb.com/buck2/83122485-d3c3-43f4-97b4-81bb90450b3b Unit tests run too https://www.internalfb.com/intern/testinfra/testrun/13229323975828416 Further testing in CI and elsewise expected. Reviewed By: malfet Differential Revision: D70331539 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149403 Approved by: https://github.com/izaitsevfb Co-authored-by: Dominic Binks <dbinks@meta.com>	2025-03-20 19:27:55 +00:00
Yi Wang	ffa085334c	Specify the default PyTorch Distributed backend for MPS (#149538 ) Fixes #149537 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149538 Approved by: https://github.com/d4l3k, https://github.com/malfet	2025-03-20 18:54:03 +00:00
Natalia Gimelshein	1d221724fc	fix missing field initializer warning (#149597 ) Per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/149597 Approved by: https://github.com/drisspg, https://github.com/Skylion007	2025-03-20 18:48:05 +00:00
William Wen	6285a71aba	[dynamo] fix bug where non-recursive disable modifies the original function (#148896 ) Fixes https://github.com/pytorch/pytorch/issues/148787. We fix this by: - Wrapping the original function instead of directly modifying it - When we detect that the previous frame is the non-recursive disable wrapper, then skip tracing this frame (non-recursive disable wrapper will always be skipped, so that frame will be present in the traceback)l Pull Request resolved: https://github.com/pytorch/pytorch/pull/148896 Approved by: https://github.com/jansel	2025-03-20 18:33:54 +00:00
Jane Xu	88a26dbb9d	[BE] simplify test_cpp_extensions_aot and .gitignore (#149231 ) It is shady to clean up an install mid-test. So don't do that anymore and use .gitignore instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149231 Approved by: https://github.com/albanD, https://github.com/msaroufim	2025-03-20 18:17:19 +00:00
Sergey Zimin	b99fc9d29f	[MTIA] Support loading Tensors on mtia:0 for pytorch code (#149327 ) Summary: The diff includes updates to the PyTorch code to enable loading tensors to MTIA. Reviewed By: PatriceVignola Differential Revision: D71176848 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149327 Approved by: https://github.com/ezyang	2025-03-20 18:05:15 +00:00
James Wu	7bb9c36784	Hook StaticCudaLauncher up to torch.compile (cold start) (#148890 ) This hooks up the previous PR to torch.compile. Will add a config flag to hide this behind in a bit, but for now it's useful for testing purposes to have it on by default. Inductor will automatically choose to use StaticCudaLauncher to launch triton kernels if: - The kernel is a cuda kernel and inductor can find a cubin file associated with it - The kernel takes less than 50 arguments - The kernel doesn't use any special features (launch hooks, large amounts of shared memory) - The kernel is not user defined (to be supported in a later PR) We split CompileResult into TritonCompileResult and StaticTritonCompileResult, but have them share implementations of how they exec a python launcher. StaticTritonCompileResult's python launcher has the benefit of a simpler def_args/call_args setup, since it always filters out all constexprs before running, no matter the triton version. Some key features of StaticTritonCompileResult: - It is fully serializable - It stores the minimum amount of stuff, so that later it can be cached easily - It does not depend on any triton specific types (though it does have various triton metadata). For now, both TritonCompileResult and StaticTritonCompileResult still `exec` custom python launchers, and use GridExpr. We can change that in the future to simplify if we'd like. For now though, this custom python codegen is good for flexibility when it comes to supporting removal of constexprs, so using it for static launching is nice to not have to pay the cost of removing constexprs at kernel runtime. Hooking everything up to torch.compile lets me run every unit test with StaticCudaLauncher to make sure that we still pass (even if we bypass StaticCudaLauncher itself). It also lets me check for compilation/runtime performance with these changes. Fixes #149448 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148890 Approved by: https://github.com/jansel	2025-03-20 17:32:20 +00:00
Dmitry Nikolaev	c99efc08fb	[ROCm] skip test_RNN_dropout_state (#149446 ) PR to skip test_nn.py::TestNN::test_RNN_dropout_state Currently ROCm doesn't support dropout value for RNN PR to enable RNN dropout on ROCm still in review and blocked pytorch/pytorch#144572 Fixes: https://github.com/pytorch/pytorch/issues/68849 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149446 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily	2025-03-20 17:22:39 +00:00
Eli Uriegas	1d9401befc	ci: Remove mentions and usages of DESIRED_DEVTOOLSET and cxx11 (#149443 ) This is a remnant of our migration to manylinux2_28 we should remove these since all of our binary builds are now built with cxx11_abi Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/149443 Approved by: https://github.com/izaitsevfb, https://github.com/atalman	2025-03-20 16:49:46 +00:00
Avik Chaudhuri	6237495fcf	torch.Size input (#149414 ) Summary: Support for `torch.Size` inputs was patchy before because `unflatten_fn` for this type returned a tuple. This PR cleans this up. Fixes #149158 Test Plan: added test Differential Revision: D71403635 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149414 Approved by: https://github.com/yushangdi	2025-03-20 16:23:13 +00:00
IvanKobzarev	2c4bc65366	[aotd] Guess tangents stride as output strides (#144579 ) AOTDispatch doing AOT backward graph preparation does not know real tangents that user will specify when runs backward. AOTD guesses the tangents. Before - we guessed that memory format of tangents will be as memory format of corresponding outputs. And if specified tangents at runtime are not the same memory format as we guessed during compilation, AOTD does coercion (copy) to guessed memory_format But as Horace found, there are popular use cases, where the outputs of compiled region will be in specific memory_format. E.g. in 4D tensor transposing dims 1 and 2. https://github.com/karpathy/nanoGPT/blob/master/model.py#L57 This PR changes the logic, that AOTD expects the same "strideness" of tangents as outputs. As a result it will avoid coercion for the case of transposed dims. Limitations: We keep guessing memory_format for: 1/ Dynamic shapes (needs more changes) 2/ Tensor subclasses (needs more changes) Other changes: test_torchinductor was always creating contiguous tangents via `torch.randn()`, changing them to be `torch.randn_like()` to compare computation with the same strideness. (E.g. for cuda float16 strideness affects numerics for fft ops). Pull Request resolved: https://github.com/pytorch/pytorch/pull/144579 Approved by: https://github.com/bdhirsh	2025-03-20 15:41:36 +00:00
Andrey Talman	9b1127437e	Add triton as dependency to CUDA aarch64 build (#149584 ) Aarch64 Triton build was added by: https://github.com/pytorch/pytorch/pull/148705 Hence add proper contrain to CUDA 12.8 Aarch64 build Please note we want to still use: ```platform_system == 'Linux' and platform_machine == 'x86_64'``` For all other builds. Since these are prototype binaries only used by cuda 12.8 linux aarch64 build. Which we would like to serve from download.pytorch.org Pull Request resolved: https://github.com/pytorch/pytorch/pull/149584 Approved by: https://github.com/nWEIdia, https://github.com/tinglvv, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-03-20 15:39:45 +00:00
Zhengxu Chen	80dfce2cc3	[export] Handle non OpNamespace type during decomposition. (#149431 ) Summary: Turns out we can have non OpNamespace object in torch.ops._dir. We should just throw away those during iteration. Test Plan: eyes Differential Revision: D71417992 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149431 Approved by: https://github.com/tugsbayasgalan	2025-03-20 15:36:15 +00:00
ZhiweiYan-96	d67c1a027e	[Intel GPU][PT2E] bugfix: use zero-point to decide conv src zp mask (#149473 ) # Motivation The PR fix a bug that wrongly decides the zero-point mask setting. Specifically, it deems zero-point is always not zeros due to scale is used for judgement. Fortunately, the bug only affects the performance. The accuracy is not affected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149473 Approved by: https://github.com/EikanWang, https://github.com/guangyey	2025-03-20 14:46:07 +00:00
Sun, Jiayi	496bbf38be	add grad_output shape check for adaptive_avg_pool2d_backward (#145241 ) Fix https://github.com/pytorch/pytorch/issues/145070. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145241 Approved by: https://github.com/malfet, https://github.com/eqy	2025-03-20 14:10:31 +00:00
Shuai Yang	00a2c68f67	Fix a typo "trochrec" to "torchrec" (#149542 ) Summary: As titled, the path is incorrect due to the typo Test Plan: CI Differential Revision: D71490709 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149542 Approved by: https://github.com/williamwen42	2025-03-20 10:14:23 +00:00
William Wen	a66a9581da	[dynamo] support Python 3.13t (#149549 ) A few bug fixes to get Dynamo mostly working with 3.13 nogil. Dynamo encounters internal CPython assert errors in older versions of 3.13. The fix has been landed on [CPython's 3.13 branch](https://github.com/python/cpython/tree/3.13) and will be included in 3.13.3 (https://peps.python.org/pep-0719/ - april 8). If you wish to try `torch.compile` on the latest 3.13 branch, you can comment out the error checking (i.e. `70b6cd4e11/torch/__init__.py (L2535)` and `70b6cd4e11/torch/_dynamo/eval_frame.py (L899)`). We will work on getting PyTorch CI up for Dynamo/dynamo-wrapped/inductor once 3.13.3 is available. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149549 Approved by: https://github.com/jansel	2025-03-20 09:49:27 +00:00
Blaine Burton Rister	970ac2d907	[Inductor] Improve memory locality by iterating over y dimension before x (#149339 ) # Feature Fixes https://github.com/pytorch/pytorch/issues/148718 by reordering the tensor dims to `(z, y, x)`. As a bonus refactor, block pointers no longer needed the `reorder=True` argument to `self.active_range_trees()`. Since this argument is no longer used anywhere, this PR simply deletes it as opposed to updating the logic for the new iteration order. # Perf impact It looks like there's a decent perf bump on A100, with cudagraphs enabled. Granted, perf runs seem to have some noise between commits. ([Workflow run](https://github.com/pytorch/pytorch/actions/runs/13914815576).) Training (all neutral or positive): ![image](https://github.com/user-attachments/assets/57f1ef1d-60b4-446f-baf3-aca87a26b81b) Inference (one positive, one very small negative): ![image](https://github.com/user-attachments/assets/679aa057-af23-47f1-8d8e-8520daf1bd92) As reported in https://github.com/pytorch/pytorch/issues/148718, this PR makes consecutive threads access consecutive memory addresses. This should theoretically give the GPU more opportunities to coalesce loads and stores. From Nvidia's [kernel profiling guide](https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html): > Local memory is private storage for an executing thread and is not visible outside of that thread. It is intended for thread-local data like thread stacks and register spills. Local memory addresses are translated to global virtual addresses by the AGU unit. Local memory has the same latency as global memory. One difference between global and local memory is that local memory is arranged such that consecutive 32-bit words are accessed by consecutive thread IDs. Accesses are therefore fully coalesced as long as all threads in a warp access the same relative address (e.g., same index in an array variable, same member in a structure variable, etc.). I couldn't find any information on how coalescing works for other kinds of memory, but the guide mentions it is also supported for accesses to the L2 cache. > The L2 Request Coalescer (LRC) processes incoming requests for L2 and tries to coalesce read requests before forwarding them to the L2 cache. It also serves programmatic multicast requests from the SM and supports compression for writes. The [answer to this Stack Overflow post](https://stackoverflow.com/a/5044424) also explains coalescing in a straightforward way. Inductor's current iteration order corresponds to the first (uncoalesced) example in that answer, while the order after this PR corresponds to the second (coalesced) example. Besides GPUs, this order of accessing data is highly advantageous for systems relying on DMAs, as those are designed to access contiguous spans of memory. This change improves the performance of an elementwise add kernel on an internal model, using internal hardware, by 1.76x. I will share the details with reviewers who are Meta employees via a private channel. # Test plan - Updated expected code on CI tests. - Added a new test checking the {x,y,z}indices and block pointers on a 3D pointwise kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149339 Approved by: https://github.com/jansel	2025-03-20 08:12:00 +00:00
Bin Bao	3647711a89	[AOTI][refactor] Remove dead code (#149287 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149287 Approved by: https://github.com/cyyever, https://github.com/yushangdi	2025-03-20 07:29:27 +00:00
PyTorch MergeBot	90ef7a9561	Revert "Supporting non-tensor-data write_size in planner write items. (#149434 )" This reverts commit 1442230a267f0ce4f0bb540fca775faa71e7cfd5. Reverted https://github.com/pytorch/pytorch/pull/149434 on behalf of https://github.com/izaitsevfb due to breaking docs build ([comment](https://github.com/pytorch/pytorch/pull/149434#issuecomment-2739378287))	2025-03-20 06:52:02 +00:00
Sun, Jiayi	00333c4548	[Inductor] Set prop_kind to forward_inference when grad is not needed for mkldnn_linear_pointwise and mkldnn_convolution_pointwise (#147072 ) Summary: The `prop_kind` of `mkldnn._linear_pointwise`, `mkldnn._linear_pointwise.binary`, `mkldnn._convolution_pointwise.binary` and `mkldnn._convolution_pointwise_.binary` are always `dnnl_forward`, i.e., `dnnl_forward_training` , regardless of whether `grad` is needed. Setting `prop_kind` to `dnnl_forward_inference` for these ops when `grad` is not needed could have better performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147072 Approved by: https://github.com/leslie-fang-intel, https://github.com/CaoE, https://github.com/jansel	2025-03-20 06:21:31 +00:00
Rachel Guo	c4d59e6279	[Inductor] Fix combo_kernel logging error (#149575 ) Summary: Fix logging error like: ``` in combinable_nodes log.debug( Message: 'ComboKernels: %d template nodes are filtered' Arguments: (OrderedSet([8]),) --- Logging error --- Traceback (most recent call last): File "/usr/local/fbcode/platform010/lib/python3.10/logging/__init__.py", line 1100, in emit msg = self.format(record) File "/usr/local/fbcode/platform010/lib/python3.10/logging/__init__.py", line 943, in format return fmt.format(record) File "/data/users/guorachel/fbsource/buck-out/v2/gen/fbcode/854b9ed00d28c5c5/caffe2/torch/fb/model_transform/experimental/benchmark/__mts_gpu_benchmark__/mts_gpu_benchmark#link-tree/torch/_logging/_internal.py", line 818, in format record.message = record.getMessage() File "/usr/local/fbcode/platform010/lib/python3.10/logging/__init__.py", line 368, in getMessage msg = msg % self.args TypeError: %d format: a real number is required, not OrderedSet ``` encountered in running a prod model + enable combo kernel feature Test Plan: CI Differential Revision: D71512220 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149575 Approved by: https://github.com/ColinPeppler	2025-03-20 06:09:44 +00:00
Davide Italiano	595293316d	[MPS/Inductor] Add support for modified_bessel_k0. (#149593 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149593 Approved by: https://github.com/jansel	2025-03-20 04:51:44 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	9a184b1074	Monkeypatch fake mode so it errors on invalid custom ops (#149410 ) Internal version: [D71294776](https://www.internalfb.com/diff/D71294776) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149410 Approved by: https://github.com/gmagogsfm	2025-03-20 04:50:57 +00:00
Menglu Yu	fe94d7da1a	[Inductor][Optimus] Add move view after cat aten pattern (#149178 ) Summary: Add aten pattern to move the view/reshape out of split cat, further reduce the number of kernels. context: https://docs.google.com/document/d/1G2qFcQu1K7VXbz2uPe0CS2aBirnwtwI_B8lxmlBlAPQ/edit?tab=t.0 Test Plan: ### how to enable Add the following patterns to the post grad ``` post_grad_fusion_options={ "normalization_aten_pass": {}, "move_view_after_cat_aten_pass": {}, }, ``` ### unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_move_view_after_cat_aten ``` Buck UI: https://www.internalfb.com/buck2/3c5451be-c63a-4794-8d6b-103ecac78905 Test UI: https://www.internalfb.com/intern/testinfra/testrun/6192449704507267 ### local reproduce ``` buck2 run mode/opt scripts/shuaiyang:test -- --flow_id 691990503 --use_synthetic_data --optimus ``` https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/mengluy/2025-03-13-20-59-34/trace.json.gz&bucket=gpu_traces ### E2E baseline f691990503 proposal Differential Revision: D71177004 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149178 Approved by: https://github.com/Yuzhen11	2025-03-20 04:07:25 +00:00
Isalia20	95e71765f2	[MPS] nanmedian implementation (#149407 ) Implements nanmedian on MPS. This implementation only implements `torch.nanmedian(tensor)` without `keepdim` and `dim` Will implement nanmedian with dim and keepdim in a followup Pull Request resolved: https://github.com/pytorch/pytorch/pull/149407 Approved by: https://github.com/malfet	2025-03-20 03:50:26 +00:00
bobrenjc93	cca46a0b6f	Fix score_mod.py dynamic max autotune (#148991 ) python benchmarks/transformer/score_mod.py --dynamic --max-autotune previously would crash with ``` "/home/bobren/local/a/pytorch/torch/_inductor/select_algorithm.py", line 2306, in key_of node.get_device().type, ``` but with this change no longer does Pull Request resolved: https://github.com/pytorch/pytorch/pull/148991 Approved by: https://github.com/drisspg	2025-03-20 03:28:51 +00:00
Xu Han	bc1b8730a4	[Windows][inductor] fix blank space break windows file path (#149388 ) Fixes #149310 From origin error message: ```cmd Command: cl /I C:/Program Files/Python310/Include /I c:/code/.env/lib/site-packages/torch/include /I c:/code/.env/lib/site-packages/torch/include/torch/csrc/api/include /I c:/code/.env/lib/site-packages/torch/include/TH /I c:/code/.env/lib/site-packages/torch/include/THC /D TORCH_INDUCTOR_CPP_WRAPPER /D STANDALONE_TORCH_HEADER /D C10_USING_CUSTOM_GENERATED_MACROS /DLL /MD /O2 /std:c++20 /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /wd4624 /wd4067 /wd4068 /EHsc /openmp /openmp:experimental C:/Users/user/AppData/Local/Temp/torchinductor_user/ou/coubnfnqsm2gbdzdytufv46jotd6sxsnnhgldiw45pl5yjq5nbvz.cpp /LD /FeC:/Users/user/AppData/Local/Temp/torchinductor_user/ou/coubnfnqsm2gbdzdytufv46jotd6sxsnnhgldiw45pl5yjq5nbvz.pyd /link /LIBPATH:c:/code/.env/Scripts/libs /LIBPATH:c:/code/.env/lib/site-packages/torch/lib torch.lib torch_cpu.lib torch_python.lib sleef.lib Output: Microsoft (R) C/C++ Optimizing Compiler Version 19.43.34809 for x86 Copyright (C) Microsoft Corporation. All rights reserved. cl : Command line warning D9025 : overriding '/openmp' with '/openmp:experimental' cl : Command line warning D9024 : unrecognized source file type 'Files/Python310/Include', object file assumed coubnfnqsm2gbdzdytufv46jotd6sxsnnhgldiw45pl5yjq5nbvz.cpp C:/Users/user/AppData/Local/Temp/torchinductor_user/ou/coubnfnqsm2gbdzdytufv46jotd6sxsnnhgldiw45pl5yjq5nbvz.cpp(21): fatal error C1083: Cannot open include file: 'Python.h': No such file or directory ``` Python installed in `C:/Program Files/Python310` path, and the blank space break the file path. Solution: Add quotes to declare Windows file paths, after that: ```cmd cl /I "C:/Users/Xuhan/.conda/envs/new_build/Include" /I "C:/Users/Xuhan/.conda/envs/new_build/lib/site-packages/torch/include" /I "C:/Users/Xuhan/.conda/envs/new_build/lib/site-packages/torch/include/torch/csrc/api/include" /D TORCH_INDUCTOR_CPP_WRAPPER /D STANDALONE_TORCH_HEADER /D C10_USING_CUSTOM_GENERATED_MACROS /D CPU_CAPABILITY_AVX512 /DLL /MD /O2 /std:c++20 /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /wd4624 /wd4067 /wd4068 /EHsc /openmp /openmp:experimental C:/Users/Xuhan/AppData/Local/Temp/tmp1wsj0m8r/za/czarp3ly5c22ge3hydvnzvad4cjimyr3hkwvofodxqffgil7frfd.cpp /arch:AVX512 /FeC:/Users/Xuhan/AppData/Local/Temp/tmp1wsj0m8r/za/czarp3ly5c22ge3hydvnzvad4cjimyr3hkwvofodxqffgil7frfd.pyd /LD /link /LIBPATH:"C:/Users/Xuhan/.conda/envs/new_build/libs" /LIBPATH:"C:/Users/Xuhan/.conda/envs/new_build/lib/site-packages/torch/lib" "torch.lib" "torch_cpu.lib" "torch_python.lib" "sleef.lib" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149388 Approved by: https://github.com/jansel	2025-03-20 03:10:30 +00:00
Dmitry Rogozhkin	45a879e55b	xpu: improve error handling and reporting in XPU cmake files (#149353 ) For #149075 * Add a graceful cmake error instead of cryptic one if SYCL runtime is not found: ``` The link interface of target "c10_xpu" contains: torch::xpurt but the target was not found. ``` * Suppress unclear cmake error if SYCL compiler is not available and further version query fails: ``` CMake Error at /home/dvrogozh/pytorch/torch/share/cmake/Caffe2/FindSYCLToolkit.cmake:37 (string): string sub-command REGEX, mode REPLACE needs at least 6 arguments total to command. ``` CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149353 Approved by: https://github.com/guangyey, https://github.com/malfet	2025-03-20 02:00:39 +00:00
Tugsbayasgalan Manlaibaatar	3b7bd6c63d	Fix dynamic shapes repordering bug (#149528 ) WHen we create constraints, we look at the ordering of kwargs according to model signature. But when we trace, we use the ordering that is created based on how user passes in their kwargs. As a result, constraints and dynamic shapes end up having a different order causing issues when they have different dynamic tensor specs. Differential Revision: [D71478578](https://our.internmc.facebook.com/intern/diff/D71478578) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149528 Approved by: https://github.com/ydwu4	2025-03-20 01:57:44 +00:00
Sam Larsen	1e30192b19	[logging] Add python version to dynamo_compile table (#149419 ) Summary: This adds a version field like the following: `3.10.9+fb (3.10:1dd9be6, May 4 2022, 01:23:45) [Clang 15.0.7 (mononoke://mononoke.internal.tfbnw.net/fbsource 5d1601b0eed7426ac` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149419 Approved by: https://github.com/c00w	2025-03-20 01:48:34 +00:00
Pradeep Fernando	1442230a26	Supporting non-tensor-data write_size in planner write items. (#149434 ) Summary: 1\ The current write item structure does not contain the amount of data that needs to be written. 2\ the planner.item already has a size primitive 'tensor_storage_size'. https://fburl.com/code/7a0gsmw7 But only for tensors. 3\ Right now, the only way the writer layer get hold of this property (fro non tensor data) - first do a lookup in to the actual tensor/bytes - then calculate the nbytes. This change introduce a way to capture non-tensor data size within a write-plan item. Reviewed By: daulet-askarov Differential Revision: D70497442 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149434 Approved by: https://github.com/MeetVadakkanchery	2025-03-20 01:22:05 +00:00
Theodore Ehrenborg	02e21c7854	Fix spelling (#149277 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149277 Approved by: https://github.com/zou3519	2025-03-20 01:02:32 +00:00
PyTorch MergeBot	826e790696	Revert "ci: Remove mentions and usages of DESIRED_DEVTOOLSET (#149443 )" This reverts commit 95a633c45304755ebdbc08396d9948d34243ddb3. Reverted https://github.com/pytorch/pytorch/pull/149443 on behalf of https://github.com/izaitsevfb due to fails lint ([comment](https://github.com/pytorch/pytorch/pull/149443#issuecomment-2738709561))	2025-03-20 00:59:41 +00:00
Eli Uriegas	95a633c453	ci: Remove mentions and usages of DESIRED_DEVTOOLSET (#149443 ) This is a remnant of our migration to manylinux2_28 we should remove these since all of our binary builds are now built with cxx11_abi Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/149443 Approved by: https://github.com/izaitsevfb, https://github.com/atalman	2025-03-20 00:39:02 +00:00
cyy	29c4f2c07a	Remove Ubuntu 18.04 scripts (#149479 ) Ubuntu 18.04 end of life reached on May 31, 2023. These code isn't used now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149479 Approved by: https://github.com/malfet	2025-03-20 00:13:40 +00:00
Ethan Wee	6cbf97ede8	[ROCm] enable HIPMallocAsyncAllocator (#149145 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149145 Approved by: https://github.com/izaitsevfb Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-03-19 23:42:35 +00:00
Aleksei Nikiforov	2be97c7257	Update nightly s390x builds (#149337 ) This change should fix new nightly build failures for s390x. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149337 Approved by: https://github.com/malfet	2025-03-19 23:27:14 +00:00
Andrey Talman	c9de76a1e4	Modify cuda aarch64 install for cudnn and nccl. Cleanup aarch64 cuda 12.6 docker (#149540 ) 1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: https://github.com/pytorch/pytorch/pull/149351 TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds 3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: https://github.com/pytorch/pytorch/pull/148895 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149540 Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia	2025-03-19 23:20:05 +00:00
Avik Chaudhuri	5005e1bc47	support multinomial for dynamic num_samples (#149463 ) Test Plan: added test Fixes #149048 Differential Revision: D71434914 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149463 Approved by: https://github.com/pianpwk	2025-03-19 23:15:29 +00:00
Catherine Lee	cc469aaf3b	[CI][docker] Remove vulkan and swiftshader from docker builds (#149530 ) Probably should have been removed with https://github.com/pytorch/pytorch/pull/139354/files? Should I also remove mentions of them from build.sh and test.sh? Pull Request resolved: https://github.com/pytorch/pytorch/pull/149530 Approved by: https://github.com/malfet	2025-03-19 23:13:27 +00:00
Davide Italiano	88c2fe533f	[MPS] Add `modified_bessel_k0` support to eager. (#149563 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149563 Approved by: https://github.com/malfet	2025-03-19 23:10:55 +00:00
Mergen Nachin	bc86b6c55a	Update ExecuTorch pin update (#149539 ) Latest commit in https://hud.pytorch.org/hud/pytorch/executorch/viable%2Fstrict/1?per_page=50 Follow-up to https://github.com/pytorch/pytorch/issues/144480#issuecomment-2731150636 Also, need to incorporate change from https://github.com/pytorch/executorch/pull/8817 Test Plan: Monitor linux-jammy-py3-clang12-executorch test Pull Request resolved: https://github.com/pytorch/pytorch/pull/149539 Approved by: https://github.com/larryliu0820	2025-03-19 22:29:59 +00:00
Catherine Lee	6974ba84f6	[ci][anaconda] Remove conda from linter docker images (#147789 ) Remove conda usage from the linter docker images Handles part of https://github.com/pytorch/pytorch/issues/148110 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147789 Approved by: https://github.com/atalman	2025-03-19 21:56:44 +00:00
Shivam Raikundalia	a11538aa46	[GPU Snapshot] Add Clear History Flag (#149352 ) Summary: Oftentimes, users complain that a bunch of extra events are prepended to their desired GPU snapshot. This is because they usually attach an OOM logger without knowing and when they go to collect the actual snapshot, it adds all the OOM logger contents. Since OOM and regular snapshot use the same backend, we currently don't have the infra in place to split these snapshots. As a solution we add a flag to the snapshot frontend to clear out the history when starting the auto-trace record memory history. A more thorough solution would be to have a user pass in a handle and to have snapshots per handle to seperate the events. However, this would likely be complicated and more work than it is worth as we would have to change the callbacks in the caching allocator and pass these objects between python and cpp. Test Plan: See diff below Differential Revision: D71159720 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149352 Approved by: https://github.com/eqy, https://github.com/aaronenyeshi	2025-03-19 21:44:20 +00:00
PyTorch MergeBot	e1d143cb7b	Revert "[ROCm] enable HIPMallocAsyncAllocator (#149145 )" This reverts commit ee1a2b7810126258ce64d1e22b59fae81a3f7bcb. Reverted https://github.com/pytorch/pytorch/pull/149145 on behalf of https://github.com/izaitsevfb due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/149145#issuecomment-2738115728))	2025-03-19 21:12:13 +00:00
Nichols A. Romero	37bb7f79c6	[ROCm][TunableOp] Unit test for TunableOp BLAS logging. (#148982 ) Add unit test for new TunableOp BLAS logging feature. Requires this PR to be merged in first: https://github.com/pytorch/pytorch/pull/148979 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148982 Approved by: https://github.com/jeffdaily	2025-03-19 20:57:19 +00:00
Jessica Vandebon	71daeddde2	[MTIA] Ensure correct stream behavior for input_buffer add autograd on MTIA (#149433 ) Test Plan: CI Differential Revision: D71414498 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149433 Approved by: https://github.com/albanD	2025-03-19 20:19:18 +00:00
Yanan Cao (PyTorch)	fae79e91a0	Remove torch.export.export_for_inference (#149078 ) Summary: Remove torch.export.export_for_inference, it is redundant and can always be replaced with torch.export.export_for_training() + run_decompositions() Test Plan: unit tests Differential Revision: D71069057 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149078 Approved by: https://github.com/tugsbayasgalan	2025-03-19 19:57:18 +00:00
Shangdi Yu	05fee772e5	Fix with effect lowering for list return type (#149510 ) Summary: - For `torch.ops.higher_order.with_effects`'s lowering, we should not extract the items out of an list (i.e. `*result` vs `result`). The `get_attr` nodes consider the result to be in the list format. Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r test_torchbind_aot_compile buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r list_return buck run //caffe2/torch/fb/sparsenn:sigrid_test -- -r test_transform_torch_bind # tested together with D70013257 buck run fbcode//mode/dev-nosan //caffe2/test:test_export -- -r test_custom_obj ``` Reviewed By: angelayi Differential Revision: D71346024 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149510 Approved by: https://github.com/zou3519	2025-03-19 19:35:08 +00:00
Scott Ramsby	842a072fd3	[codemod] Fix clang-tidy command line doc comments (#149524 ) Summary: Fixes the comments to match the latest updates to the checked-in tools. Search/replace applied in this order: * `# /fbsource/tools/lint/clangtidy/clang-tidy-platform010 -list-checks` -> `# ~/fbsource/tools/lint/clangtidy/clang-tidy-platform010-clang-17 -list-checks` * `# ~/fbsource/tools/lint/clangtidy/clang-tidy-platform010 -list-checks` -> `# ~/fbsource/tools/lint/clangtidy/clang-tidy-platform010-clang-17 -list-checks` * `fbsource/tools/lint/clangtidy/clang-tidy-platform010 -list-checks` -> `fbsource/tools/lint/clangtidy/clang-tidy-platform010-clang-17 -list-checks` Test Plan: CI Reviewed By: johnkearney Differential Revision: D71431516 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149524 Approved by: https://github.com/janeyx99	2025-03-19 19:22:11 +00:00
Pian Pawakapan	96828a2155	[export] refactor DimHints for type errors (#149424 ) Differential Revision: D71414367 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149424 Approved by: https://github.com/justinchuby, https://github.com/avikchaudhuri	2025-03-19 18:51:07 +00:00
Yidi Wu	9ec9f4740c	[export] fix stft decomp and making it consistent with cpp impl. (#149232 ) Summary: We change the fake impl of stft to follow more closely with its cpp implementation [here](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/SpectralOps.cpp#L951-L963) where " n_frames = 1 + (len - n_fft) / hop_length;" is also an integer division. Test Plan: Existing tests and buck2 build --flagfile fbcode//mode/dev fbcode//executorch/examples/models/fb/llama4:speech_transform.pte Differential Revision: D71209142 edit: we kept the original path un-changed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149232 Approved by: https://github.com/jackzhxng	2025-03-19 18:40:35 +00:00
Bin Bao	94d761fbf0	[AOTI][reland] Update test runner to use the new APIs (#149412 ) Summary: Reland https://github.com/pytorch/pytorch/pull/147105. Switch to the newer aoti_compile_and_package APIs. Some tests still kept using legacy APIs, and will follow up with internal test refactoring. Differential Revision: [D71470265](https://our.internmc.facebook.com/intern/diff/D71470265) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149412 Approved by: https://github.com/yushangdi	2025-03-19 17:56:44 +00:00
IvanKobzarev	d686d04c2f	[custom_ops][perf] Move expensive pytree traversals of tensors to C++ (#148555 ) (benchmark for 1 call) Before: ``` └─ $ python ~/task_custom_ops_perf/test_custom_ops_perf_repro.py DO_BENCH mutate: 77.72445678710938 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/mutate.json DO_BENCH no_mutate: 64.61143493652344 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/no_mutate.json DO_BENCH direct_mutate: 11.682510375976562 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_mutate.json DO_BENCH direct_no_mutate: 18.596649169921875 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_no_mutate.json ``` After: ``` └─ $ python ~/task_custom_ops_perf/test_custom_ops_perf_repro.py DO_BENCH mutate: 47.6837158203125 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/mutate.json DO_BENCH no_mutate: 31.709671020507812 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/no_mutate.json DO_BENCH direct_mutate: 10.967254638671875 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_mutate.json DO_BENCH direct_no_mutate: 10.728836059570312 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_no_mutate.json ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148555 Approved by: https://github.com/zou3519	2025-03-19 17:16:57 +00:00
Jithun Nair	518563d6ef	Add release branch push triggers to rocm-mi300.yml (#149517 ) When we added the rocm-mi300.yml earlier this year, we had lower capacity and we were just pipecleaning the workflow, so we set the trigger to only respond to pushes to main branch. But now we have more stability as well as capacity, and we would really like to ensure that the release branch is being tested on MI300s as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149517 Approved by: https://github.com/atalman	2025-03-19 16:14:09 +00:00
Ze Sheng	e98afa0f89	[Sigmoid] Remove magic method in CapabilityBasedPartitioner (#149400 ) Summary: As title. Test Plan: CI Differential Revision: D70575197 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149400 Approved by: https://github.com/jfix71	2025-03-19 16:02:43 +00:00
Andrey Talman	4df66e0b7f	Pin auditwheel to 6.2.0 (#149471 ) Observing aarch64 failure in nightly: https://github.com/pytorch/pytorch/actions/runs/13917778961/job/38943911228 Similar to: https://github.com/pytorch/vision/pull/8982 ``` 2025-03-18T08:44:58.4128744Z Repairing Wheel with AuditWheel 2025-03-18T08:44:58.5440988Z INFO:auditwheel.main_repair:Repairing torch-2.8.0.dev20250318+cpu-cp39-cp39-linux_aarch64.whl 2025-03-18T08:45:20.3393288Z Traceback (most recent call last): 2025-03-18T08:45:20.3393732Z File "/opt/python/cp39-cp39/bin/auditwheel", line 8, in <module> 2025-03-18T08:45:20.3394115Z sys.exit(main()) 2025-03-18T08:45:20.3394559Z File "/opt/_internal/cpython-3.9.21/lib/python3.9/site-packages/auditwheel/main.py", line 53, in main 2025-03-18T08:45:20.3395064Z result: int \| None = args.func(args, p) 2025-03-18T08:45:20.3395626Z File "/opt/_internal/cpython-3.9.21/lib/python3.9/site-packages/auditwheel/main_repair.py", line 203, in execute 2025-03-18T08:45:20.3396163Z out_wheel = repair_wheel( 2025-03-18T08:45:20.3396657Z File "/opt/_internal/cpython-3.9.21/lib/python3.9/site-packages/auditwheel/repair.py", line 84, in repair_wheel 2025-03-18T08:45:20.3397184Z raise ValueError(msg) 2025-03-18T08:45:20.3397620Z ValueError: Cannot repair wheel, because required library "libarm_compute.so" could not be located 2025-03-18T08:45:20.3678843Z Traceback (most recent call last): 2025-03-18T08:45:20.3679267Z File "/pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py", line 236, in <module> 2025-03-18T08:45:20.3680988Z pytorch_wheel_name = complete_wheel("/pytorch/") 2025-03-18T08:45:20.3681449Z File "/pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py", line 141, in complete_wheel 2025-03-18T08:45:20.3681976Z check_call(["auditwheel", "repair", f"dist/{wheel_name}"], cwd=folder) 2025-03-18T08:45:20.3682860Z File "/opt/python/cp39-cp39/lib/python3.9/subprocess.py", line 373, in check_call 2025-03-18T08:45:20.3683308Z raise CalledProcessError(retcode, cmd) 2025-03-18T08:45:20.3684034Z subprocess.CalledProcessError: Command '['auditwheel', 'repair', 'dist/torch-2.8.0.dev20250318+cpu-cp39-cp39-linux_aarch64.whl']' returned non-zero exit status 1. 2025-03-18T08:45:20.3790063Z ##[error]Process completed with exit code 1. 2025-03-18T08:45:20.3862012Z ##[group]Run pytorch/test-infra/.github/actions/teardown-linux@main 2025-03-18T08:45:20.3862448Z with: ``` Please note aarch64 CUDA failures are related to: https://github.com/pytorch/pytorch/pull/149351 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149471 Approved by: https://github.com/malfet	2025-03-19 15:55:05 +00:00
Shangdi Yu	1bf443e2f2	[aoti x with_effect token] Unbacked symint and register lowering (#147656 ) Differential Revision: D70022208 - When resolving unbacked symints in ExternKernel for with_effect, we need to ignore the first item in the binding path, because the `example_output` doesn't contain the effect token, but the binding paths do. - Similarly, `node.meta["val"]` contains the effect token, so when we compute_unbacked_bindings, we need to remove that effect token - For `torch.ops.higher_order.with_effects`'s lowering, we should not extract the items out of an list (i.e. `*result` vs `result`). The `get_attr` nodes consider the result to be in the list format. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147656 Approved by: https://github.com/angelayi, https://github.com/zou3519	2025-03-19 14:38:30 +00:00
Aaron Orenstein	2fcfae72b4	async fx compile (#146135 ) Adds the ability to run the selected out-of-process fx compile scheme in async mode - where we kick off the compile and then run eagerly until the compile is finished. Added a test which runs a tiny model in a loop making sure that we execute it both eagerly and then compiled. Differential Revision: [D71135546](https://our.internmc.facebook.com/intern/diff/D71135546) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146135 Approved by: https://github.com/jamesjwu, https://github.com/jansel	2025-03-19 14:07:51 +00:00
FFFrog	1dce65a82c	Fix the invalid link for FX (#149289 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149289 Approved by: https://github.com/zou3519	2025-03-19 14:03:18 +00:00
Aleksei Nikiforov	97910b6c00	Update s390x docker image (#148444 ) New releases of ml_dtypes successfully build on s390x, skip building patched old release. Unpin grpcio version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148444 Approved by: https://github.com/seemethere	2025-03-19 12:25:10 +00:00
Aleksei Nikiforov	7ca296f564	Document patched podman build for s390x runners (#147618 ) Podman patches from upstream are needed to resolve a couple of issues hit when using it. Document automated build of podman with applied patches fixing those issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147618 Approved by: https://github.com/seemethere	2025-03-19 12:25:05 +00:00
Aleksei Nikiforov	cfbeaf7b7e	Improve docker build cleanup on s390x runners (#149316 ) Currently it sometimes still leaves a couple of processess running. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149316 Approved by: https://github.com/seemethere	2025-03-19 10:10:44 +00:00
FFFrog	466d5295c1	Fixed abnormal behavior of LazyLinear when using LayzLinear and load_state together (#147599 ) Update Points: - Update the logic of ``initialize_parameters`` - Add new testcases The ISSUE Related: https://github.com/pytorch/pytorch/issues/147389 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147599 Approved by: https://github.com/mikaylagawarecki	2025-03-19 10:01:12 +00:00
fduwjj	8bf3f3fc43	[c10d] Add a collective time estimator for NCCL comms (#149343 ) We want to upstream the feature from new nccl for users to estimate comm time. Resolves #147753 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149343 Approved by: https://github.com/kwen2501	2025-03-19 07:54:02 +00:00
Riham Selim	b963d96bad	[Torchscript] Add a flag to use mangled names instead of demangled (#148906 ) Summary: Optionally keep mangled names when expanding torchscript stacks Test Plan: ``` buck2 build mode/opt //scripts/rihams/LearnPyTorch:torch_script_generate --show-full-output /data/users/rihams/fbsource/buck-out/v2/gen/fbcode/0bd9d136228ad8a7/scripts/rihams/LearnPyTorch/__torch_script_generate__/torch_script_generate.par buck2 build mode/opt //scripts/rihams/LearnPyTorch:torch_script_execute --show-full-output ``` - With `--torch_jit_expanded_stacks_mangled` Flag: /data/users/rihams/fbsource/buck-out/v2/gen/fbcode/ef35e45045e8164c/scripts/rihams/LearnPyTorch/__torch_script_execute__/torch_script_execute fbcode/model.pt --torch_jit_expanded_stacks_mangled --torch_jit_enable_expanded_stacks https://fburl.com/scuba/strobelight_function_tracer/8die4rvm {F1975933247} Without Flag: /data/users/rihams/fbsource/buck-out/v2/gen/fbcode/ef35e45045e8164c/scripts/rihams/LearnPyTorch/__torch_script_execute__/torch_script_execute ./model.pt --torch_jit_enable_expanded_stacks https://fburl.com/scuba/strobelight_function_tracer/x3nladpf {F1975933268} Reviewed By: bbus Differential Revision: D70905872 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148906 Approved by: https://github.com/zdevito	2025-03-19 07:53:02 +00:00
ikalinic	3e78c9e967	[ROCm][Windows] Disable hipSPARSE and CK declarations and remove references for Windows (#149195 ) This PR removes references to `hipSPARSE` and `ck` functions and disables declarations which are not supported on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149195 Approved by: https://github.com/jeffdaily Co-authored-by: Michal Gallus <Michal.Gallus@amd.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-03-19 07:30:53 +00:00
yifanmao	2cb42f26c1	Remove test_get_model_state_dict_del_memory (#149460 ) test_get_model_state_dict_del_memory get unexpected memory, leading to the test failures. Remove tests right now to avoid blocking the others. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149460 Approved by: https://github.com/fegin	2025-03-19 07:06:46 +00:00
FFFrog	e8a35eb7da	Add Missing Communication collectives (#147379 ) ---- - reduce_add_coalesced Pull Request resolved: https://github.com/pytorch/pytorch/pull/147379 Approved by: https://github.com/mikaylagawarecki	2025-03-19 06:59:04 +00:00
Menglu Yu	981807cfcb	[Inductor][Optimus] split cat aten pass (#149027 ) Summary: We add the aten pattern to optimize big cat node with arbitrary order of inputs to support APS jobs context: https://docs.google.com/document/d/1G2qFcQu1K7VXbz2uPe0CS2aBirnwtwI_B8lxmlBlAPQ/edit?tab=t.0 Test Plan: ### how to enable Add the following patterns to the post grad ``` post_grad_fusion_options={ "normalization_aten_pass": {}, "split_cat_aten_pass": {"threshold_to_cat": 10}, }, ``` You can tune threshold_to_cat to achieve best performance. If nothing gives, the default value 10 will be used ### unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_split_cat_post_grad ``` Buck UI: https://www.internalfb.com/buck2/9e52168d-c107-4be8-a46b-b9d239f5c50d Test UI: https://www.internalfb.com/intern/testinfra/testrun/17732923605061752 Network: Up: 112KiB Down: 132KiB (reSessionID-915796e0-4a8f-486a-9f63-afb1e191d24a) Executing actions. Remaining 0/3 1.0s exec time total Command: test. Finished 2 local Time elapsed: 4:57.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 ### E2E baseline f691990503 proposal Differential Revision: D71017436 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149027 Approved by: https://github.com/Yuzhen11	2025-03-19 06:01:05 +00:00
Simon Fan	f123f2c077	[ca] fix dce for side-effects (#149336 ) The AOT backward could have contained side effectful ops, so we can't DCE them. Have CA also call the default fx.Node.is_impure which will cover some of the existing cases Pull Request resolved: https://github.com/pytorch/pytorch/pull/149336 Approved by: https://github.com/jansel	2025-03-19 05:56:47 +00:00
PyTorch UpdateBot	ddb076591d	[executorch hash update] update the pinned executorch hash (#147422 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147422 Approved by: https://github.com/pytorchbot	2025-03-19 05:22:35 +00:00
Pat Vignola	42bd4a09a3	[MTIA] Add _mtia_getCurrentRawStream to MTIA module (#149436 ) Summary: The FlexAttention path generates code that uses this function. Although streams are not used yet in Triton-MTIA, adding this now allows us to not branch out just for MTIA and generate different code. Test Plan: CI Reviewed By: chaos5958 Differential Revision: D70072057 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149436 Approved by: https://github.com/chaos5958	2025-03-19 05:17:51 +00:00
PyTorch UpdateBot	ef93cdfb8a	[audio hash update] update the pinned audio hash (#149467 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149467 Approved by: https://github.com/pytorchbot	2025-03-19 04:28:57 +00:00
Ethan Wee	ee1a2b7810	[ROCm] enable HIPMallocAsyncAllocator (#149145 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149145 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-03-19 03:59:55 +00:00
Avik Chaudhuri	20874a1f46	debug ival swap (#149206 ) Summary: Recall that we use "ivals" to track intermediate values of mutations during unflattening. Previously, for each such intermediate value, we would create a hidden shared attribute that would be updated / read by respective submodules. Unfortunately this scheme doesn't work when some but not all of those submodules are swapped out. This is because the swapped in submodules have no knowledge of these hidden attributes. Thus the submodules that are not swapped out end up reading / updating dangling state. This PR does away with these hidden attributes. Instead, we directly read the underlying buffer or placeholder that was updated, and update those underlying buffers and placeholders in place. This makes the graphs look much closer to their eager origins. Test Plan: added some tests, ensured existing tests pass Differential Revision: D71203469 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149206 Approved by: https://github.com/tugsbayasgalan	2025-03-19 03:43:30 +00:00
Jun Luo	14dc6e732d	Cache the get_device_module result (#149207 ) Summary: As title. Test Plan: OSS CIs. Reviewed By: chaos5958 Differential Revision: D71084180 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149207 Approved by: https://github.com/jansel	2025-03-19 03:20:38 +00:00
angelayi	01a57981aa	[export] Add TracingContext (#149294 ) TracingContext is added to all tracing locations -- in torch.export this is where we call make_fx (for training IR) and aot_export_module (for inference IR), and in run_decompositions where we call aot_export_module Differential Revision: [D71298927](https://our.internmc.facebook.com/intern/diff/D71298927) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149294 Approved by: https://github.com/ydwu4	2025-03-19 03:11:08 +00:00
Animesh Jain	a3c286677b	[compile] Switch off inference mode during compilation (#149321 ) PR does following * Turns `inference_mode` to False and `no_grad` for `convert_frame`, if the inference_mode is on globally. * Turns off inference_mode for fake tensor prop. This ensures that converting from real inference tensor to a fake tensor removes the inference-ness. * Graph breaks on is_inference and is_inference_mode_enabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149321 Approved by: https://github.com/jansel, https://github.com/zou3519	2025-03-19 02:45:27 +00:00
Bin Bao	04e251a7dd	[AOTI] Add num_runners to AOTIModelPackageLoader (#149364 ) Summary: AOTIModelContainerRunner takes a num_runners argument for multi-threaded inference, but AOTIModelPackageLoader forgot to take the same parameter, although its run() API already expects to take an optional cudaStream_t parameter for multi-threaded inference. Differential Revision: [D71357418](https://our.internmc.facebook.com/intern/diff/D71357418) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149364 Approved by: https://github.com/angelayi	2025-03-19 02:28:06 +00:00
Richard Barnes	536c0c7a47	[codemod][lowrisk] Remove unused exception parameter from caffe2/aten/src/ATen/cuda/CUDABlas.cpp (#149328 ) Summary: `-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it. This: ``` try { ... } catch (exception& e) { // no use of e } ``` should instead be written as ``` } catch (exception&) { ``` If the code compiles, this is safe to land. Test Plan: Sandcastle Reviewed By: dtolnay Pull Request resolved: https://github.com/pytorch/pytorch/pull/149328 Approved by: https://github.com/Skylion007, https://github.com/eqy	2025-03-19 02:05:33 +00:00
Ivan Zaitsev	919d54b7b1	Fix format string in ck_gemm_template.h for int64_t variables (#149438 ) Summary: Change %d to %ld in printf format specifier to correctly handle int64_t variables n, m, k. This fixes compilation errors in HIP builds where the format string didn't match the argument type. forward fix for D71412006 ``` In file included from fbcode/caffe2/aten/src/ATen/native/hip/ck_gemm_bfloat16.hip:4: fbcode/caffe2/aten/src/ATen/native/hip/ck_gemm_template.h:386:28: error: format specifies type 'int' but the argument has type 'int64_t' (aka 'long') [-Werror,-Wformat] 385 \| printf("error shape = %d %d %d TRANSA=%d TRANSB=%d \n", \| ~~ \| %ld 386 \| n, m, k,TRANSA, TRANSB); \| ^ fbcode/caffe2/aten/src/ATen/native/hip/ck_gemm_template.h:386:31: error: format specifies type 'int' but the argument has type 'int64_t' (aka 'long') [-Werror,-Wformat] 385 \| printf("error shape = %d %d %d TRANSA=%d TRANSB=%d \n", \| ~~ \| %ld 386 \| n, m, k,TRANSA, TRANSB); \| ^ fbcode/caffe2/aten/src/ATen/native/hip/ck_gemm_template.h:386:25: error: format specifies type 'int' but the argument has type 'int64_t' (aka 'long') [-Werror,-Wformat] 385 \| printf("error shape = %d %d %d TRANSA=%d TRANSB=%d \n", \| ~~ \| %ld 386 \| n, m, k,TRANSA, TRANSB); \| ^ ``` Test Plan: ``` buck2 build --flagfile fbcode//mode/opt-amd-gpu fbcode//torchrec/sparse/tests:test_jagged_tensor_gpu ``` Differential Revision: D71418611 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149438 Approved by: https://github.com/ZainRizvi	2025-03-19 01:46:34 +00:00
Stepan Hruda	6bcf9c6ce3	[xnnpack] Expose subgraph symbols (#149397 ) Summary: Main XNNPack target code uses symbols from subgraph so they need to be exported - this gets uncovered on macos where symbols were not visible after linking Test Plan: CI / used for a macOS build on top of the stack. Differential Revision: D71315023 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149397 Approved by: https://github.com/digantdesai	2025-03-19 01:14:46 +00:00
Nichols A. Romero	11d4438a5f	[ROCm][TunableOp] More TF32 support. (#149088 ) This PR includes additional enhancements to TF32 support in TunableOp. - OpSignature now differentiates between float32 and tf32 data types. - Offline tuning now supports TF32. - Unit tests for online and offline tuning of TF32. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149088 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-03-19 00:26:20 +00:00
tvukovic-amd	268de64005	[ROCm][Windows] Enable torchvision build with ROCm on Windows (#147382 ) - Updated HIP flags for Windows (removed non Windows flags on Windows case, added runtime library) - Set hipcc call for Windows case - Removed CUDA flags (not used in ROCm) on Windows - Updated Windows compiler (added case when using ROCm on Windows) - Fixed path issue in hipify_python Pull Request resolved: https://github.com/pytorch/pytorch/pull/147382 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-03-18 23:37:05 +00:00
Nikita Shulga	61a64c20c4	[MPSInductor] Move threadfence at the right location (#149437 ) Not sure how it worked in the past, but fence should be before first read from the shared memory, not after it. This bug was exposed by https://github.com/pytorch/pytorch/pull/148969 which removed unnecessary barrier before calling `threadgroup_reduce` functions Test plan: ``` % python3 generate.py --checkpoint_path checkpoints/stories15M/model.pth --prompt "Once upon a time" --device mps --compile ``` Before that it produced gibberish, now it works fine Pull Request resolved: https://github.com/pytorch/pytorch/pull/149437 Approved by: https://github.com/manuelcandales, https://github.com/dcci	2025-03-18 23:27:19 +00:00
Angela Yi	ea02aac2ca	[export] Update remove runtime asserts pass (#149198 ) Test Plan: CI -- Removing asserts should be a noop Differential Revision: D69566851 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149198 Approved by: https://github.com/pianpwk	2025-03-18 23:07:25 +00:00
Nikita Shulga	5db3a4ac88	[Build] Guard per-op headers in ACLUtils.cpp (#149417 ) To fix internal build failures, where per-op headers are not generated. We really should have lint for something like that. Test Plan: CI Reviewed By: izaitsevfb Differential Revision: D71406882 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149417 Approved by: https://github.com/Skylion007, https://github.com/izaitsevfb	2025-03-18 22:56:29 +00:00
Zhuoran Zhao	45fec7843d	Fix local compilication and hipification (#149384 ) Summary: As title, we need to fix the issue introduced from https://github.com/pytorch/pytorch/pull/148305 Test Plan: CI and e2e https://docs.google.com/document/d/1Bu-MxJCkN7WaRkKJLVBQvnSp8yV0v3Aeb3Y9R5sjeHw/edit?tab=t.0 Differential Revision: D71373001 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149384 Approved by: https://github.com/desertfire, https://github.com/jansel, https://github.com/chenyang78	2025-03-18 22:56:02 +00:00
Shivam Raikundalia	0d804dec0f	[Profiler/Easy] Pass Overload Names To Kineto (#149333 ) Summary: Right now we get Overload names and forward them to the Event List frontend for profiler but we do not forward anything to kineto. This diff checks if there is an overload name for each cpu op and appends it to the name if necessary Test Plan: Added test in CI Differential Revision: D71326670 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149333 Approved by: https://github.com/aaronenyeshi	2025-03-18 22:15:51 +00:00
angelayi	3b48c72141	[export] Minor refactor to trace.py (#149240 ) Minor refactor to trace.py * Removed `_strict_export_lower_to_aten_ir` in favor of just `_strict_export` and `_non_strict_export` * Matched the APIs of `_strict_export` and `_non_strict_export` * Instead of a `lower_to_aten_callback` which is a callable, or `dispatch_tracing_mode`, both functions take in a `_to_aten_func` which can be either `_export_to_aten_ir_make_fx` or `_export_to_aten_ir`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149240 Approved by: https://github.com/pianpwk	2025-03-18 21:40:30 +00:00
Justin Chu	010963032c	[ONNX] Create onnx_symbolic (#148905 ) In the old exporter we allow users to define a symbolic() method to bypass JIT tracing for a block of logic. We can allow users to do similar things by creating symbolic ops at export. This PR implements `torch.onnx.ops.symbolic` and `torch.onnx.ops.symbolic_multi_out` to allow users to create onnx nodes symbolically with pt2 & fx. The custom pytorch ops were designed such that the attributes are encoded to be part of a valid fx op. Users provide shape and dtype for the meta function to produce the currect fake tensor during export. An example is ![image](https://github.com/user-attachments/assets/c62f5f21-e038-456e-a71d-b9a5d0a7cd9d) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148905 Approved by: https://github.com/titaiwangms	2025-03-18 21:32:06 +00:00
Yuxin Wu	d80a70b58a	Avoid unnecessary clone in torch.cuda.set_rng_state (#149283 ) Clone has performance issue according to `f49c3eb6e6/megatron/core/tensor_parallel/random.py (L77-L80)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149283 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-03-18 20:47:57 +00:00
Thomas Bohnstingl	cd5c13d8f0	[hop] Rework the check of Metadata in the functionalization key (#148789 ) This PR is a more cosmetic rework of the metadata check performed by some HOPs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148789 Approved by: https://github.com/ydwu4	2025-03-18 20:30:59 +00:00
Brian Hirsh	f06e366532	partitioner: treat inputs with static indices as free to save (#148922 ) Fixes https://github.com/pytorch/pytorch/issues/141881 internal xref: https://fb.workplace.com/groups/1075192433118967/posts/1538435030128036/?comment_id=1556782068293332 I tried to make a test case out of the code linked in that github issue. The setup + bad outcome today was as follows: (1) you have a graph where one of its inputs is a model weight (2) in the backward, you do some downstream compute on `weight`, `tmp = f(weight)`, where (a) `tmp` is of a smaller size than `weight`, and (b) the compute is trivially fusible into other kernels (so the partitioner thinks it is "free" to recompute (3) since `sizeof(tmp) < sizeof(weight)` and the recompute is free, the partitioner decides that it would be strictly better to save `tmp` for backward instead of weight (4) this is bad: `weight` is a static tensor that sits in GPU memory for the duration of your entire training loop, so saving it for backward has no negative impact on peak memory. Since we're saving `tmp` instead, we end up unnecessarily increasing peak memory. In particular - the repro involves an autograd.Function in eager that saves the weight for bw, so we end up hitting higher peak memory in compile The fix I'm trying out in this PR is to tell the partitioner that graph inputs that we know have static addresses (aka parameters) are "free" to save. Below is the fw/bw graph before my change, where you can see that instead of `primals_2` being saved for backward, we save `t_8` (which involves some low precision downstream compute on `primals_2`, that is only needed in the backward. ``` ===== Forward graph 0 ===== /data/users/hirsheybar/checkout2/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module): def forward(self, primals_1: "bf16[64, 64][64, 1]cuda:0", primals_2: "bf16[64, 64][64, 1]cuda:0", primals_3: "bf16[64][1]cuda:0"): # File: /data/users/hirsheybar/checkout2/pytorch/test/dynamo/test_repros.py:6943 in forward, code: out = Fp8LinearFn.apply( abs_1: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.abs.default(primals_1) view: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(abs_1, [64, 1, 64]); abs_1 = None amax: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view, [-1]); view = None abs_2: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.abs.default(primals_2) view_1: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(abs_2, [64, 1, 64]); abs_2 = None amax_1: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view_1, [-1]); view_1 = None _to_copy: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(amax, dtype = torch.float32); amax = None clamp: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy, 1e-12); _to_copy = None div: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp, 448.0); clamp = None reciprocal: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div) view_2: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(primals_1, [64, 1, 64]) view_3: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_2, [64, 1, 1, 64]); view_2 = None slice_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal, 0, 0, 9223372036854775807); reciprocal = None unsqueeze: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_1, 1); slice_1 = None slice_2: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze, 2, 0, 9223372036854775807); unsqueeze = None unsqueeze_1: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_2, 3); slice_2 = None mul: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_3, unsqueeze_1); view_3 = unsqueeze_1 = None view_4: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul, [64, 1, 64]); mul = None view_5: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_4, [64, 64]); view_4 = None _to_copy_1: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_5, dtype = torch.float8_e4m3fn); view_5 = None _to_copy_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(amax_1, dtype = torch.float32) clamp_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_2, 1e-12); _to_copy_2 = None div_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_1, 448.0); clamp_1 = None reciprocal_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div_1) view_6: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(primals_2, [64, 1, 64]) view_7: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_6, [64, 1, 1, 64]); view_6 = None slice_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal_1, 0, 0, 9223372036854775807); reciprocal_1 = None unsqueeze_2: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_3, 1); slice_3 = None slice_4: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_2, 2, 0, 9223372036854775807); unsqueeze_2 = None unsqueeze_3: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_4, 3); slice_4 = None mul_1: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_7, unsqueeze_3); view_7 = unsqueeze_3 = None view_8: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_1, [64, 1, 64]); mul_1 = None view_9: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_8, [64, 64]); view_8 = None _to_copy_3: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_9, dtype = torch.float8_e4m3fn); view_9 = None t: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.t.default(div_1); div_1 = None new_ones: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(div, [1, 1], pin_memory = False) new_ones_1: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(t, [1, 1], pin_memory = False) t_2: "f8e4m3fn[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(_to_copy_3); _to_copy_3 = None t_3: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.t.default(new_ones_1); new_ones_1 = None _scaled_mm: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._scaled_mm.default(_to_copy_1, t_2, new_ones, t_3, None, None, torch.bfloat16); _to_copy_1 = t_2 = new_ones = t_3 = None view_10: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(_scaled_mm, [64, 1, 64]); _scaled_mm = None view_11: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_10, [64, 1, 1, 64]); view_10 = None slice_5: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(div, 0, 0, 9223372036854775807); div = None unsqueeze_4: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_5, 1); slice_5 = None slice_6: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_4, 2, 0, 9223372036854775807); unsqueeze_4 = None unsqueeze_5: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_6, 3); slice_6 = None mul_2: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_11, unsqueeze_5); view_11 = unsqueeze_5 = None view_12: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_2, [64, 1, 64]); mul_2 = None view_13: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_12, [64, 64]); view_12 = None view_14: "f32[1, 64, 64][4096, 64, 1]cuda:0" = torch.ops.aten.view.default(view_13, [1, 64, 64]); view_13 = None view_15: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.view.default(view_14, [1, 64, 64, 1]); view_14 = None slice_7: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(t, 0, 0, 9223372036854775807); t = None unsqueeze_6: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_7, 1); slice_7 = None slice_8: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_6, 2, 0, 9223372036854775807); unsqueeze_6 = None unsqueeze_7: "f32[1, 1, 64, 1][1, 64, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_8, 3); slice_8 = None mul_3: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_15, unsqueeze_7); view_15 = unsqueeze_7 = None view_16: "f32[64, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.view.default(mul_3, [64, 64, 1]); mul_3 = None view_17: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_16, [64, 64]); view_16 = None _to_copy_4: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_17, dtype = torch.bfloat16); view_17 = None add: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.add.Tensor(_to_copy_4, primals_3); _to_copy_4 = primals_3 = None t_4: "bf16[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(primals_2); primals_2 = None clone: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.clone.default(t_4, memory_format = torch.contiguous_format); t_4 = None t_5: "bf16[1, 64][1, 1]cuda:0" = torch.ops.aten.t.default(amax_1); amax_1 = None view_21: "bf16[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.view.default(t_5, [1, 1, 64]); t_5 = None amax_3: "bf16[1, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view_21, [-1]); view_21 = None unsqueeze_8: "bf16[1, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(amax_3, 1); amax_3 = None expand: "bf16[1, 64, 1][1, 0, 1]cuda:0" = torch.ops.aten.expand.default(unsqueeze_8, [1, 64, 1]) clone_1: "bf16[1, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format); expand = None view_22: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.view.default(clone_1, [64, 1]); clone_1 = None _to_copy_7: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(view_22, dtype = torch.float32); view_22 = None clamp_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_7, 1e-12); _to_copy_7 = None div_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_3, 448.0); clamp_3 = None reciprocal_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div_3); div_3 = None view_27: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(clone, [64, 1, 64]); clone = None view_28: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_27, [64, 1, 1, 64]); view_27 = None slice_11: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal_3, 0, 0, 9223372036854775807); reciprocal_3 = None unsqueeze_11: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_11, 1); slice_11 = None slice_12: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_11, 2, 0, 9223372036854775807); unsqueeze_11 = None unsqueeze_12: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_12, 3); slice_12 = None mul_5: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_28, unsqueeze_12); view_28 = unsqueeze_12 = None view_29: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_5, [64, 1, 64]); mul_5 = None view_30: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_29, [64, 64]); view_29 = None _to_copy_8: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_30, dtype = torch.float8_e4m3fn); view_30 = None t_8: "f8e4m3fn[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(_to_copy_8); _to_copy_8 = None # No stacktrace found for following nodes view_39: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(add, [64, 64]); add = None return (view_39, primals_1, unsqueeze_8, t_8) INFO: TRACED GRAPH ===== Backward graph 0 ===== <eval_with_key>.1 class GraphModule(torch.nn.Module): def forward(self, primals_1: "bf16[64, 64][64, 1]cuda:0", unsqueeze_8: "bf16[1, 1, 1][1, 1, 1]cuda:0", t_8: "f8e4m3fn[64, 64][1, 64]cuda:0", tangents_1: "bf16[64, 64][64, 1]cuda:0"): # File: /data/users/hirsheybar/checkout2/pytorch/test/dynamo/test_repros.py:6946 in forward, code: out = out.unflatten(0, input.shape[:-1]) view_19: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(tangents_1, [64, 64]); tangents_1 = None # File: /data/users/hirsheybar/checkout2/pytorch/test/dynamo/test_repros.py:6943 in forward, code: out = Fp8LinearFn.apply( abs_3: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.abs.default(view_19) view_20: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(abs_3, [64, 1, 64]); abs_3 = None amax_2: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view_20, [-1]); view_20 = None expand: "bf16[1, 64, 1][1, 0, 1]cuda:0" = torch.ops.aten.expand.default(unsqueeze_8, [1, 64, 1]); unsqueeze_8 = None clone_1: "bf16[1, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format); expand = None view_22: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.view.default(clone_1, [64, 1]); clone_1 = None _to_copy_5: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(amax_2, dtype = torch.float32); amax_2 = None clamp_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_5, 1e-12); _to_copy_5 = None div_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_2, 448.0); clamp_2 = None reciprocal_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div_2) view_23: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_19, [64, 1, 64]) view_24: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_23, [64, 1, 1, 64]); view_23 = None slice_9: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal_2, 0, 0, 9223372036854775807); reciprocal_2 = None unsqueeze_9: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_9, 1); slice_9 = None slice_10: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_9, 2, 0, 9223372036854775807); unsqueeze_9 = None unsqueeze_10: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_10, 3); slice_10 = None mul_4: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_24, unsqueeze_10); view_24 = unsqueeze_10 = None view_25: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_4, [64, 1, 64]); mul_4 = None view_26: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_25, [64, 64]); view_25 = None _to_copy_6: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_26, dtype = torch.float8_e4m3fn); view_26 = None _to_copy_7: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(view_22, dtype = torch.float32); view_22 = None clamp_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_7, 1e-12); _to_copy_7 = None div_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_3, 448.0); clamp_3 = None t_6: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.t.default(div_3); div_3 = None new_ones_2: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(div_2, [1, 1], pin_memory = False) new_ones_3: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(t_6, [1, 1], pin_memory = False) t_9: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.t.default(new_ones_3); new_ones_3 = None _scaled_mm_1: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._scaled_mm.default(_to_copy_6, t_8, new_ones_2, t_9, None, None, torch.bfloat16); _to_copy_6 = t_8 = new_ones_2 = t_9 = None view_31: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(_scaled_mm_1, [64, 1, 64]); _scaled_mm_1 = None view_32: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_31, [64, 1, 1, 64]); view_31 = None slice_13: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(div_2, 0, 0, 9223372036854775807); div_2 = None unsqueeze_13: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_13, 1); slice_13 = None slice_14: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_13, 2, 0, 9223372036854775807); unsqueeze_13 = None unsqueeze_14: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_14, 3); slice_14 = None mul_6: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_32, unsqueeze_14); view_32 = unsqueeze_14 = None view_33: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_6, [64, 1, 64]); mul_6 = None view_34: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_33, [64, 64]); view_33 = None view_35: "f32[1, 64, 64][4096, 64, 1]cuda:0" = torch.ops.aten.view.default(view_34, [1, 64, 64]); view_34 = None view_36: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.view.default(view_35, [1, 64, 64, 1]); view_35 = None slice_15: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(t_6, 0, 0, 9223372036854775807); t_6 = None unsqueeze_15: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_15, 1); slice_15 = None slice_16: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_15, 2, 0, 9223372036854775807); unsqueeze_15 = None unsqueeze_16: "f32[1, 1, 64, 1][1, 64, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_16, 3); slice_16 = None mul_7: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_36, unsqueeze_16); view_36 = unsqueeze_16 = None view_37: "f32[64, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.view.default(mul_7, [64, 64, 1]); mul_7 = None view_38: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_37, [64, 64]); view_37 = None _to_copy_9: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_38, dtype = torch.bfloat16); view_38 = None t_10: "bf16[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(view_19) mm: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.mm.default(t_10, primals_1); t_10 = primals_1 = None sum_1: "bf16[64][1]cuda:0" = torch.ops.aten.sum.dim_IntList(view_19, [0]); view_19 = None return (_to_copy_9, mm, sum_1) ``` With the change, we save primals_2 for backward instead ``` ===== Forward graph 0 ===== /data/users/hirsheybar/checkout2/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module): def forward(self, primals_1: "bf16[64, 64][64, 1]cuda:0", primals_2: "bf16[64, 64][64, 1]cuda:0", primals_3: "bf16[64][1]cuda:0"): # File: /data/users/hirsheybar/checkout2/pytorch/test/dynamo/test_repros.py:6943 in forward, code: out = Fp8LinearFn.apply( abs_1: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.abs.default(primals_1) view: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(abs_1, [64, 1, 64]); abs_1 = None amax: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view, [-1]); view = None abs_2: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.abs.default(primals_2) view_1: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(abs_2, [64, 1, 64]); abs_2 = None amax_1: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view_1, [-1]); view_1 = None _to_copy: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(amax, dtype = torch.float32); amax = None clamp: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy, 1e-12); _to_copy = None div: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp, 448.0); clamp = None reciprocal: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div) view_2: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(primals_1, [64, 1, 64]) view_3: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_2, [64, 1, 1, 64]); view_2 = None slice_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal, 0, 0, 9223372036854775807); reciprocal = None unsqueeze: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_1, 1); slice_1 = None slice_2: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze, 2, 0, 9223372036854775807); unsqueeze = None unsqueeze_1: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_2, 3); slice_2 = None mul: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_3, unsqueeze_1); view_3 = unsqueeze_1 = None view_4: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul, [64, 1, 64]); mul = None view_5: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_4, [64, 64]); view_4 = None _to_copy_1: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_5, dtype = torch.float8_e4m3fn); view_5 = None _to_copy_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(amax_1, dtype = torch.float32) clamp_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_2, 1e-12); _to_copy_2 = None div_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_1, 448.0); clamp_1 = None reciprocal_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div_1) view_6: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(primals_2, [64, 1, 64]) view_7: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_6, [64, 1, 1, 64]); view_6 = None slice_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal_1, 0, 0, 9223372036854775807); reciprocal_1 = None unsqueeze_2: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_3, 1); slice_3 = None slice_4: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_2, 2, 0, 9223372036854775807); unsqueeze_2 = None unsqueeze_3: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_4, 3); slice_4 = None mul_1: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_7, unsqueeze_3); view_7 = unsqueeze_3 = None view_8: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_1, [64, 1, 64]); mul_1 = None view_9: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_8, [64, 64]); view_8 = None _to_copy_3: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_9, dtype = torch.float8_e4m3fn); view_9 = None t: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.t.default(div_1); div_1 = None new_ones: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(div, [1, 1], pin_memory = False) new_ones_1: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(t, [1, 1], pin_memory = False) t_2: "f8e4m3fn[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(_to_copy_3); _to_copy_3 = None t_3: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.t.default(new_ones_1); new_ones_1 = None _scaled_mm: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._scaled_mm.default(_to_copy_1, t_2, new_ones, t_3, None, None, torch.bfloat16); _to_copy_1 = t_2 = new_ones = t_3 = None view_10: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(_scaled_mm, [64, 1, 64]); _scaled_mm = None view_11: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_10, [64, 1, 1, 64]); view_10 = None slice_5: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(div, 0, 0, 9223372036854775807); div = None unsqueeze_4: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_5, 1); slice_5 = None slice_6: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_4, 2, 0, 9223372036854775807); unsqueeze_4 = None unsqueeze_5: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_6, 3); slice_6 = None mul_2: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_11, unsqueeze_5); view_11 = unsqueeze_5 = None view_12: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_2, [64, 1, 64]); mul_2 = None view_13: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_12, [64, 64]); view_12 = None view_14: "f32[1, 64, 64][4096, 64, 1]cuda:0" = torch.ops.aten.view.default(view_13, [1, 64, 64]); view_13 = None view_15: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.view.default(view_14, [1, 64, 64, 1]); view_14 = None slice_7: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(t, 0, 0, 9223372036854775807); t = None unsqueeze_6: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_7, 1); slice_7 = None slice_8: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_6, 2, 0, 9223372036854775807); unsqueeze_6 = None unsqueeze_7: "f32[1, 1, 64, 1][1, 64, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_8, 3); slice_8 = None mul_3: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_15, unsqueeze_7); view_15 = unsqueeze_7 = None view_16: "f32[64, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.view.default(mul_3, [64, 64, 1]); mul_3 = None view_17: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_16, [64, 64]); view_16 = None _to_copy_4: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_17, dtype = torch.bfloat16); view_17 = None add: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.add.Tensor(_to_copy_4, primals_3); _to_copy_4 = primals_3 = None t_5: "bf16[1, 64][1, 1]cuda:0" = torch.ops.aten.t.default(amax_1); amax_1 = None view_21: "bf16[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.view.default(t_5, [1, 1, 64]); t_5 = None amax_3: "bf16[1, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view_21, [-1]); view_21 = None unsqueeze_8: "bf16[1, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(amax_3, 1); amax_3 = None # No stacktrace found for following nodes view_39: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(add, [64, 64]); add = None return (view_39, primals_1, primals_2, unsqueeze_8) INFO: TRACED GRAPH ===== Backward graph 0 ===== <eval_with_key>.1 class GraphModule(torch.nn.Module): def forward(self, primals_1: "bf16[64, 64][64, 1]cuda:0", primals_2: "bf16[64, 64][64, 1]cuda:0", unsqueeze_8: "bf16[1, 1, 1][1, 1, 1]cuda:0", tangents_1: "bf16[64, 64][64, 1]cuda:0"): # File: /data/users/hirsheybar/checkout2/pytorch/test/dynamo/test_repros.py:6946 in forward, code: out = out.unflatten(0, input.shape[:-1]) view_19: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(tangents_1, [64, 64]); tangents_1 = None # File: /data/users/hirsheybar/checkout2/pytorch/test/dynamo/test_repros.py:6943 in forward, code: out = Fp8LinearFn.apply( t_4: "bf16[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(primals_2); primals_2 = None clone: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.clone.default(t_4, memory_format = torch.contiguous_format); t_4 = None abs_3: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.abs.default(view_19) view_20: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(abs_3, [64, 1, 64]); abs_3 = None amax_2: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view_20, [-1]); view_20 = None expand: "bf16[1, 64, 1][1, 0, 1]cuda:0" = torch.ops.aten.expand.default(unsqueeze_8, [1, 64, 1]); unsqueeze_8 = None clone_1: "bf16[1, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format); expand = None view_22: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.view.default(clone_1, [64, 1]); clone_1 = None _to_copy_5: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(amax_2, dtype = torch.float32); amax_2 = None clamp_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_5, 1e-12); _to_copy_5 = None div_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_2, 448.0); clamp_2 = None reciprocal_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div_2) view_23: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_19, [64, 1, 64]) view_24: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_23, [64, 1, 1, 64]); view_23 = None slice_9: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal_2, 0, 0, 9223372036854775807); reciprocal_2 = None unsqueeze_9: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_9, 1); slice_9 = None slice_10: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_9, 2, 0, 9223372036854775807); unsqueeze_9 = None unsqueeze_10: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_10, 3); slice_10 = None mul_4: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_24, unsqueeze_10); view_24 = unsqueeze_10 = None view_25: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_4, [64, 1, 64]); mul_4 = None view_26: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_25, [64, 64]); view_25 = None _to_copy_6: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_26, dtype = torch.float8_e4m3fn); view_26 = None _to_copy_7: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(view_22, dtype = torch.float32); view_22 = None clamp_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_7, 1e-12); _to_copy_7 = None div_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_3, 448.0); clamp_3 = None reciprocal_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div_3) view_27: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(clone, [64, 1, 64]); clone = None view_28: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_27, [64, 1, 1, 64]); view_27 = None slice_11: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal_3, 0, 0, 9223372036854775807); reciprocal_3 = None unsqueeze_11: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_11, 1); slice_11 = None slice_12: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_11, 2, 0, 9223372036854775807); unsqueeze_11 = None unsqueeze_12: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_12, 3); slice_12 = None mul_5: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_28, unsqueeze_12); view_28 = unsqueeze_12 = None view_29: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_5, [64, 1, 64]); mul_5 = None view_30: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_29, [64, 64]); view_29 = None _to_copy_8: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_30, dtype = torch.float8_e4m3fn); view_30 = None t_6: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.t.default(div_3); div_3 = None new_ones_2: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(div_2, [1, 1], pin_memory = False) new_ones_3: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(t_6, [1, 1], pin_memory = False) t_8: "f8e4m3fn[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(_to_copy_8); _to_copy_8 = None t_9: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.t.default(new_ones_3); new_ones_3 = None _scaled_mm_1: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._scaled_mm.default(_to_copy_6, t_8, new_ones_2, t_9, None, None, torch.bfloat16); _to_copy_6 = t_8 = new_ones_2 = t_9 = None view_31: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(_scaled_mm_1, [64, 1, 64]); _scaled_mm_1 = None view_32: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_31, [64, 1, 1, 64]); view_31 = None slice_13: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(div_2, 0, 0, 9223372036854775807); div_2 = None unsqueeze_13: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_13, 1); slice_13 = None slice_14: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_13, 2, 0, 9223372036854775807); unsqueeze_13 = None unsqueeze_14: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_14, 3); slice_14 = None mul_6: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_32, unsqueeze_14); view_32 = unsqueeze_14 = None view_33: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_6, [64, 1, 64]); mul_6 = None view_34: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_33, [64, 64]); view_33 = None view_35: "f32[1, 64, 64][4096, 64, 1]cuda:0" = torch.ops.aten.view.default(view_34, [1, 64, 64]); view_34 = None view_36: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.view.default(view_35, [1, 64, 64, 1]); view_35 = None slice_15: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(t_6, 0, 0, 9223372036854775807); t_6 = None unsqueeze_15: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_15, 1); slice_15 = None slice_16: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_15, 2, 0, 9223372036854775807); unsqueeze_15 = None unsqueeze_16: "f32[1, 1, 64, 1][1, 64, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_16, 3); slice_16 = None mul_7: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_36, unsqueeze_16); view_36 = unsqueeze_16 = None view_37: "f32[64, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.view.default(mul_7, [64, 64, 1]); mul_7 = None view_38: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_37, [64, 64]); view_37 = None _to_copy_9: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_38, dtype = torch.bfloat16); view_38 = None t_10: "bf16[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(view_19) mm: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.mm.default(t_10, primals_1); t_10 = primals_1 = None sum_1: "bf16[64][1]cuda:0" = torch.ops.aten.sum.dim_IntList(view_19, [0]); view_19 = None return (_to_copy_9, mm, sum_1) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148922 Approved by: https://github.com/zou3519	2025-03-18 20:08:11 +00:00
Zain Rizvi	b8c0c50bbe	Release.md readability improvements (#149402 ) Improves a bunch of readability/grammatical issues with release.md. Note: This was a claude code experiment, with all changes automatically generated. But turns out minor edits like this is _not_ a good use of claude code since it asked for approval on every single changed line. Prob way more efficient to toss this entire thing into a simple LLM. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149402 Approved by: https://github.com/atalman	2025-03-18 20:04:56 +00:00
jzhou	dfdf58f8cb	[ROCm] enable CK backend for bf16/fp16 on gfx11 (#143971 ) this change enables enable CK backend for fp16 on Gfx11 @jeffdaily Pull Request resolved: https://github.com/pytorch/pytorch/pull/143971 Approved by: https://github.com/jeffdaily	2025-03-18 18:18:22 +00:00
Pian Pawakapan	e0e8639a10	[torchbench] fix dynamic_shapes spec for moco (#148772 ) Fixes https://github.com/pytorch/pytorch/issues/148333 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148772 Approved by: https://github.com/yushangdi, https://github.com/desertfire	2025-03-18 18:16:54 +00:00
Nichols A. Romero	dbea13ed45	[ROCm][TunableOp] Minor fix to BLAS logging for ScaledGEMM with no bias vector. (#149357 ) Omit the bias type argument for BLAS logging when there is a ScaledGEMM with no bias vector. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149357 Approved by: https://github.com/jeffdaily	2025-03-18 18:14:52 +00:00
Nichols A. Romero	c0566e0dbf	[ROCm] Fixes and improvements to CUDA->HIP flag conversion for CPP extensions (#149245 ) Fixes https://github.com/ROCm/hip/issues/3764. Fixes and improvements to CUDA->HIP flag conversion for CPP extensions - Log flag conversion for debugging purposes. - Fix cases where it should not touch the -I flags or cases where CUDA appears more than once by replacing only the first instance. - Fix case where nvcc key may not exist - Fix case where hipify should ignore flag values and only touch the flag itself Pull Request resolved: https://github.com/pytorch/pytorch/pull/149245 Approved by: https://github.com/jeffdaily Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>	2025-03-18 18:01:07 +00:00
eellison	585fd972b8	Iterate over dense dim first in split reduction reindexing (#147229 ) Fix for https://github.com/pytorch/pytorch/issues/144431. Improves perf from 0.29963893827160504 -> 0.0396331632970453. In split reductions, we view an input tensor as a single dimension, then reduce over it. When we are reducing over a tensor which has a dimension other than the last dimension as the dense dimension, we should iterate over the dense dimension first in our re-indexing. This pr also gives evidence for general need of reduction tiling, e.g. for cooperative reduction handling of this.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147229 Approved by: https://github.com/jansel	2025-03-18 17:35:21 +00:00
mori360	ee3a2c6ee2	[State_dict] Remove functools.cache and add unit test (#149354 ) Fixes https://github.com/pytorch/pytorch/issues/149100 @functools.cache would keep 'self' alive, leading to unexpected memory performance. (e.g. in the issue linked, if the model is deleted, the model's memory is still occupied.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149354 Approved by: https://github.com/fegin	2025-03-18 17:30:41 +00:00
mori360	5b8cc4709a	[FSDP2] Add set_reshard_after_forward (#149103 ) Fixes https://github.com/pytorch/pytorch/issues/149029 Add `set_reshard_after_forward` to set `post_forward_mesh_info` so as to decide `_reshard_after_forward` Add unit test similar to `test_fully_shard_communication_count`, the FSDPModule would perform as `._reshard_after_forward=True` after `.set_reshard_after_forward=True`, as well as setting to False Pull Request resolved: https://github.com/pytorch/pytorch/pull/149103 Approved by: https://github.com/awgu	2025-03-18 17:21:54 +00:00
Animesh Jain	a8df5e5af9	[dynamo] Add mem leak test (#149358 ) Test for https://github.com/pytorch/pytorch/pull/148480 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149358 Approved by: https://github.com/malfet	2025-03-18 16:38:28 +00:00
Aleksei Nikiforov	d5b1d99f78	Enable more nightly tests on s390x (#148452 ) Also enable some tests which probably were accidentally disabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148452 Approved by: https://github.com/seemethere, https://github.com/malfet	2025-03-18 16:09:39 +00:00
Saurabh Mishra	381d0cb239	[DCP] Avoid in-place update and deepcopy during dudpe (#149320 ) Summary: Avoid in-place update and deepcopy during dudpe. Deepcopy becomes prohibitively expensive with models having a huge number of FQNs. This was manifestd in the Ads 2K experiment as well. Here are the results from the TextRay model in Mitra: #### Control job with deepcopy regression: First save ~24.8s Global step latency is ~7-8s Test job with the new fix to avoid deepcopy: First save is ~21s global step latency ~2s Test Plan: ``` buck test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:test_planner ``` https://www.internalfb.com/intern/testinfra/testrun/3940649945104822 Differential Revision: D71245218 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149320 Approved by: https://github.com/MeetVadakkanchery	2025-03-18 16:08:40 +00:00
Nikita Shulga	c41196a4d0	[EZ][Docker] Remove `install_db.sh` (#149360 ) Which is a vestige of caffe2 days and was no-op since https://github.com/pytorch/pytorch/pull/125092 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149360 Approved by: https://github.com/atalman, https://github.com/cyyever, https://github.com/seemethere, https://github.com/Skylion007	2025-03-18 16:07:47 +00:00
Justin Chu	fdacf3c920	[ONNX] Update types in VerificationInfo (#149377 ) torch.types.Number was rendered as is in the documentation and can be confusing. We write the original types instead to reduce confusion for users. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149377 Approved by: https://github.com/titaiwangms	2025-03-18 15:37:39 +00:00
PyTorch MergeBot	405025778d	Revert "[AOTI] Update test runner to use the new APIs (#147105 )" This reverts commit 9a78513c3cb21a5f506135e2a56f967cf1fddc60. Reverted https://github.com/pytorch/pytorch/pull/147105 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147105#issuecomment-2733656413))	2025-03-18 15:25:40 +00:00
PyTorch MergeBot	5ba437fb45	Revert "[AOTI] Forward fix unit test failures (#149401 )" This reverts commit ec9e11145e1a86300aae0fe09a1d8917d21deba1. Reverted https://github.com/pytorch/pytorch/pull/149401 on behalf of https://github.com/desertfire due to reverting the original PR instead ([comment](https://github.com/pytorch/pytorch/pull/149401#issuecomment-2733633516))	2025-03-18 15:18:48 +00:00
Pat Vignola	213eea216a	[MTIA] Add _mtia_maybeExchangeDevice to MTIA module (#149340 ) Summary: The FlexAttention path uses `_maybe_exchange_device`, so it will be needed eventually for MTIA as well. Test Plan: `buck2 test fbcode//mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- test_maybe_exchange_device` Reviewed By: chaos5958 Differential Revision: D70072063 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149340 Approved by: https://github.com/chaos5958	2025-03-18 15:15:12 +00:00
Bin Bao	ec9e11145e	[AOTI] Forward fix unit test failures (#149401 ) Summary: There is a land conflict between https://github.com/pytorch/pytorch/pull/149161 and https://github.com/pytorch/pytorch/pull/147105. We just need to update the APIs used in two new unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149401 Approved by: https://github.com/ZainRizvi	2025-03-18 15:02:01 +00:00
atalman	6e2b2660b9	Make numpy check optional (#149356 ) We may want to skip numpy smoke tests. Hence making it optional Pull Request resolved: https://github.com/pytorch/pytorch/pull/149356 Approved by: https://github.com/ZainRizvi	2025-03-18 15:00:01 +00:00
Andrey Talman	bc88f6faa1	Use TorchVersion for triton version check (#149136 ) Followup after https://github.com/pytorch/pytorch/pull/149092#issuecomment-2721990321 To use TorchVersion for triton version parsing Pull Request resolved: https://github.com/pytorch/pytorch/pull/149136 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-03-18 13:48:46 +00:00
Jithun Nair	b06b5c3e27	[ROCm] Use alternate mirror for drm repo (#149380 ) Fixes issue with building ROCm manywheel and libtorch images eg. https://github.com/pytorch/pytorch/actions/runs/13887711267/job/38854659005#step:4:8328 ``` #53 2.832 Cloning into 'drm'... #53 2.849 fatal: unable to access 'https://gitlab.freedesktop.org/mesa/drm.git/': The requested URL returned error: 503 #53 2.851 ./install_rocm_drm.sh: line 29: pushd: drm: No such file or directory ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149380 Approved by: https://github.com/jeffdaily	2025-03-18 13:33:25 +00:00
Laith Sakka	6055a4f612	refresh benchmarks results. (#149347 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149347 Approved by: https://github.com/jamesjwu	2025-03-18 08:53:49 +00:00
Francisco Massa	9b92828d4b	Add batch dim sharding rule to sdpa (#149253 ) This is a trivial rule that for most cases isn't needed, but if we want to consider that the input data is actually `Shard(0)` (instead of `Replicated()` as it is currently assumed), then we need this rule. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149253 Approved by: https://github.com/XilunWu	2025-03-18 07:54:02 +00:00
Davide Italiano	9cd52da45c	[MPS/inductor] Add support for `modified_bessel_i1`. (#149379 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149379 Approved by: https://github.com/malfet	2025-03-18 06:02:33 +00:00
Fadi Arafeh	6c2db8fab0	Enable qint8 and quint8 add for AArch64 using ACL directly (#148653 ) This enables qint8 and quint8 add for AArch64 through Arm Compute Library (ACL) directly. Relative performance improvement using OMP_NUM_THREADS=1 is ~15x, using OMP_NUM_THREADS=32 it’s ~5.4x. Co-authored-by: David Svantesson <david.svantesson-yeung@arm.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/148653 Approved by: https://github.com/malfet ghstack dependencies: #148585	2025-03-18 05:38:39 +00:00
Nikita Shulga	2e0c98ff05	[MPS] Add `bicubic2d_aa` (#149378 ) Which is currently the most frequently requested op in https://github.com/pytorch/pytorch/issues/141287 Mostly done by refactoring `upsample_bilinear2d_aa` to accept Functor as one of the template arguments, which closely ideas from `eec43cfbc0/src/libImaging/Resample.c` as well as `bb42e4d137/aten/src/ATen/native/cuda/UpSampleBilinear2d.cu (L472-L478)` Populate unit tests by copying upsample_bilinear_2d_aa and reusing it as upsample_bicubic2d_aa At that point, only difference between upsample_bilinear2d_aa and upsample_bicubic2d_aa are convolution kernel function and size: for bilinear it's 3x3, for bicubic it's 5x5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149378 Approved by: https://github.com/dcci	2025-03-18 05:35:41 +00:00
Tristan Rice	dea7157160	nccl: upgrade to 2.26.2 to avoid hang on ncclCommAbort (#149351 ) Fixes #149153 Yaml generated from: ``` python .github/scripts/generate_ci_workflows.py ``` Test plan: Repro in https://gist.github.com/d4l3k/16a19b475952bc40ddd7f2febcc297b7 ``` rm -rf third_party/nccl python setup.py develop ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149351 Approved by: https://github.com/kwen2501, https://github.com/atalman, https://github.com/malfet	2025-03-18 05:23:18 +00:00
Rachel Guo	b8f91bcb14	[pt2_provenance_tracking] add support for cpp kernel (#149185 ) Summary: As title. Add inductor cpp kernel to post grad graph node mapping & UT. Context: Raised as a feature request for AOTI CPU case. https://fb.workplace.com/groups/1028545332188949/permalink/1169020841474730/ Differential Revision: D71181284 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149185 Approved by: https://github.com/jingsh	2025-03-18 04:43:07 +00:00
Shangdi Yu	7869196482	Fix torchbind schema str generation (#149239 ) Summary: Fix Torchbind HOP schema generation when there's no input Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r schema ``` Differential Revision: D71231164 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149239 Approved by: https://github.com/zou3519	2025-03-18 04:29:56 +00:00
Wei-Sheng Chin	bca75fe97a	[MAIA] [Autocast] Enable autocast on MAIA device (#148511 ) Fixes #148510. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148511 Approved by: https://github.com/albanD	2025-03-18 03:46:22 +00:00
Davide Italiano	c43e35d6f7	[MPS] Implement support for `modified_bessel_i1` in eager. (#149368 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149368 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-03-18 03:29:10 +00:00
Mu-Chu Lee	bb42e4d137	[AOTInductor] Add function to free buffer (#149161 ) Summary: We add a function that allows users to free the unused buffer. Test Plan: Testing correctness: python test/inductor/test_aot_inductor.py -k free_inactive Testing memory consumption: LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib /home/$USER/local/pytorch/build/bin/test_aoti_inference Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/149161 Approved by: https://github.com/chenyang78, https://github.com/desertfire ghstack dependencies: #149249	2025-03-18 02:43:14 +00:00
Jane Xu	cccdf860e2	[BE] Add STABLE_LIBRARY test for multiple returns (#149230 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149230 Approved by: https://github.com/albanD, https://github.com/zou3519 ghstack dependencies: #149052	2025-03-18 02:40:54 +00:00
Jane Xu	988827cdfb	Use schema as source of truth + support ones_like/empty_like (#149052 ) This change does 2 important things: (a) Instead of relying on IValue type as source of truth, we use the schema as the source of truth, which is important as IValue types are overloaded and can ambiguously convert incorrectly. For example, a MemoryFormat will look like an int + get converted to an int64_t vs a MemoryFormat! (b) This PR expands support for many more types to encompass way more schemas, e.g., Optional, Device, dtype, etc. The main win from this PR is the ability for aoti_torch_call_dispatcher to call TensorFactory ops like ones_like/empty_like! Pull Request resolved: https://github.com/pytorch/pytorch/pull/149052 Approved by: https://github.com/albanD	2025-03-18 02:40:54 +00:00
Justin Chu	ebabd0efdd	[ONNX] Expose verification utilities (#148603 ) Expose verification utilities to public documentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148603 Approved by: https://github.com/titaiwangms	2025-03-18 02:10:34 +00:00
Sun, Jiayi	c36ac16da1	[Inductor] optimize welford reduction (#145061 ) Fix https://github.com/pytorch/pytorch/issues/141541. Fix https://github.com/pytorch/pytorch/issues/142839. Fix https://github.com/pytorch/pytorch/issues/143182. Summary: In order to fix the issue that the accuracy of welford reduction is not good enough, we refer to the eager implementation, combine Welford algorithm with cascade sum to improve numerical stability. Specifically: 1. Use Welford algorithm to compute mean and variance. 2. Use cascade summation when computing sum over input for both mean and variance. I tested Inductor benchmark with this PR on CPU, no performance gains or regressions were seen. Example: Take https://github.com/pytorch/pytorch/issues/141541 as an example: ``` import torch import torch.nn as nn torch.manual_seed(0) class Model(nn.Module): def __init__(self): super().__init__() self.gn = nn.GroupNorm(num_groups=32, num_channels=32) def forward(self, x): return self.gn(x) model = Model().eval() c_model = torch.compile(model) x = torch.randn(1, 32, 128, 128, 128) with torch.no_grad(): output = model(x) c_output = c_model(x) print(torch.max(torch.abs(output - c_output))) print(torch.allclose(output, c_output, 1.3e-6, 1e-5)) ``` logs - before ``` tensor(7.0095e-05) False ``` - After ``` tensor(9.5367e-07) True ``` - on CUDA ``` tensor(1.4305e-06, device='cuda:0', grad_fn=<MaxBackward1>) True ``` Generated code: - before ``` cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float'], ''' #include "/tmp/torchinductor_jiayisun/pi/cpicxudqmdsjh5cm4klbtbrvy2cxwr7whxl3md2zzdjdf3orvfdf.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2) { { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<int64_t>(131072L)); for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(2097152L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + 2097152Lx0), static_cast<int64_t>(16)); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0); } } } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<int64_t>(x0)] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<int64_t>(x0)] = static_cast<float>(tmp_acc0.m2); } } } { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(2097152L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + 2097152Lx0), static_cast<int64_t>(16)); auto tmp1 = out_ptr0[static_cast<int64_t>(x0)]; auto tmp4 = out_ptr1[static_cast<int64_t>(x0)]; auto tmp12 = in_ptr1[static_cast<int64_t>(x0)]; auto tmp15 = in_ptr2[static_cast<int64_t>(x0)]; auto tmp2 = at::vec::Vectorized<float>(tmp1); auto tmp3 = tmp0 - tmp2; auto tmp5 = static_cast<float>(2097152.0); auto tmp6 = tmp4 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = decltype(tmp6)(tmp6 + tmp7); auto tmp9 = 1 / std::sqrt(tmp8); auto tmp10 = at::vec::Vectorized<float>(tmp9); auto tmp11 = tmp3 * tmp10; auto tmp13 = at::vec::Vectorized<float>(tmp12); auto tmp14 = tmp11 * tmp13; auto tmp16 = at::vec::Vectorized<float>(tmp15); auto tmp17 = tmp14 + tmp16; tmp17.store(out_ptr2 + static_cast<int64_t>(x1 + 2097152Lx0)); } } } } } } ''') ``` - After ``` cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float'], ''' #include "/tmp/torchinductor_jiayisun/ln/clnlak27xpvmq3klpqyj6xzyq2thf4ecrezve5ddy4f4xaz4sb7w.h" extern "C" void kernel(const float in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2) { { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); WelfordHelper<at::vec::Vectorized<float>> welford_helper0(static_cast<int64_t>(131072L)); static WelfordHelper<at::vec::Vectorized<float>> masked_welford_helper0(static_cast<int64_t>(0L)); for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(2097152L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + 2097152Lx0), static_cast<int64_t>(16)); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &welford_helper0); } } } tmp_acc0_vec = welford_combine(tmp_acc0_vec, &welford_helper0); masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, &masked_welford_helper0); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<int64_t>(x0)] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<int64_t>(x0)] = static_cast<float>(tmp_acc0.m2); } } } { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(2097152L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + 2097152Lx0), static_cast<int64_t>(16)); auto tmp1 = out_ptr0[static_cast<int64_t>(x0)]; auto tmp4 = out_ptr1[static_cast<int64_t>(x0)]; auto tmp12 = in_ptr1[static_cast<int64_t>(x0)]; auto tmp15 = in_ptr2[static_cast<int64_t>(x0)]; auto tmp2 = at::vec::Vectorized<float>(tmp1); auto tmp3 = tmp0 - tmp2; auto tmp5 = static_cast<float>(2097152.0); auto tmp6 = tmp4 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = decltype(tmp6)(tmp6 + tmp7); auto tmp9 = 1 / std::sqrt(tmp8); auto tmp10 = at::vec::Vectorized<float>(tmp9); auto tmp11 = tmp3 * tmp10; auto tmp13 = at::vec::Vectorized<float>(tmp12); auto tmp14 = tmp11 * tmp13; auto tmp16 = at::vec::Vectorized<float>(tmp15); auto tmp17 = tmp14 + tmp16; tmp17.store(out_ptr2 + static_cast<int64_t>(x1 + 2097152L*x0)); } } } } } } ''') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145061 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel	2025-03-18 02:05:35 +00:00
cyy	1096443467	Use torch_compile_options for c10 libraries (#147821 ) c10, c10_cuda, c10_hip and c10_xpu are given additional compile options by torch_compile_options, which are more restrictive and can help reveal potential bugs inside the code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147821 Approved by: https://github.com/guangyey, https://github.com/malfet	2025-03-18 01:54:23 +00:00
Su, Tong	60523540f1	Force build to conform C++ standard on windows by adding /permissive- flag (#149035 ) Fixes #147366 1. Add `/permissive-` to the `torch_compile_options` for the build to conform to the C++ standard. 2. Fix the error when trying to assign a string literal to a non-const ptr. The `/permissive-` flag can be found at https://learn.microsoft.com/en-us/cpp/build/reference/permissive-standards-conformance?view=msvc-170 From the above [doc](https://learn.microsoft.com/en-us/cpp/build/reference/permissive-standards-conformance?view=msvc-170#remarks), > By default, the /permissive- option is set in new projects created by Visual Studio 2017 version 15.5 and later versions. > The /permissive- option is implicitly set by the /std:c++latest option starting in Visual Studio 2019 version 16.8, and in version 16.11 by the /std:c++20 option. Thus, it is reasonable to add this flag to the existing project. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149035 Approved by: https://github.com/guangyey, https://github.com/malfet	2025-03-18 01:51:46 +00:00
Xia, Weiwen	c1dd75e4dc	Add AOTI shim for _weight_int4pack_mm_cpu_tensor (#149031 ) Summary Previous implementation of shim did not align with the design and it was removed by https://github.com/pytorch/pytorch/pull/148907 This PR adds it back in the files of MKLDNN backend and re-enable the CPP wrapper UT. Test plan ``` pytest -s test/inductor/test_cpu_cpp_wrapper.py -k test_woq_int4 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149031 Approved by: https://github.com/leslie-fang-intel, https://github.com/EikanWang, https://github.com/desertfire	2025-03-18 01:33:13 +00:00
cyy	425c6d8eba	Replace c10::is_pod with std::is_trivial (#149286 ) These remaining c10::is_pod calls can be replaced without compromising the semantics. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149286 Approved by: https://github.com/zou3519	2025-03-18 01:33:01 +00:00
Animesh Jain	f9a787224c	[dynamo][guards][serialization] Dont use ID_MATCH guard for bool and None (#149228 ) Doing this removes the need of collecting `id` and therefore facilitates serialization. It also improves readability with recompilations. Earlier, recompile message will just show the `id`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149228 Approved by: https://github.com/jansel	2025-03-18 01:25:37 +00:00
Davide Italiano	186cc7327c	[MPS/BE] Remove decorator that skipped test on macOS 12. (#149365 ) macOS 12 is not really supported anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149365 Approved by: https://github.com/malfet	2025-03-18 00:58:08 +00:00
Aaron Gokaslan	a0ac63cbd9	[BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/149257 Approved by: https://github.com/jansel	2025-03-18 00:46:07 +00:00
Davide Italiano	811f587d86	[MPS/BE] @parametrize generation of pointwise_ops. (#149363 ) Make this less error prone/reduces duplication. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149363 Approved by: https://github.com/malfet	2025-03-18 00:37:43 +00:00
Bin Bao	9a78513c3c	[AOTI] Update test runner to use the new APIs (#147105 ) Summary: Switch to the newer aoti_compile_and_package APIs. Some tests still kept using legacy APIs, and will follow up with internal test refactoring. Differential Revision: [D69609685](https://our.internmc.facebook.com/intern/diff/D69609685) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147105 Approved by: https://github.com/jingsh	2025-03-18 00:27:09 +00:00
PyTorch MergeBot	b52a8bef01	Revert "[dynamo][guards][serialization] Dont use ID_MATCH guard for bool and None (#149228 )" This reverts commit 5905bbe745b0acb4909243c93014c0e6f3512c2d. Reverted https://github.com/pytorch/pytorch/pull/149228 on behalf of https://github.com/malfet due to I wonder if this will fix the pr-time-benchmark regressions ([comment](https://github.com/pytorch/pytorch/pull/149228#issuecomment-2731237949))	2025-03-18 00:10:50 +00:00
Nikita Shulga	46226a90c8	[EZ][BE] Remove cross-compilation options from mac-build.yml (#149237 ) It has long been gone Pull Request resolved: https://github.com/pytorch/pytorch/pull/149237 Approved by: https://github.com/seemethere, https://github.com/atalman	2025-03-17 23:50:31 +00:00
Eli Uriegas	523bffd388	cd: Add no-cache for test binaries (#149218 ) This is to make it so that we don't experience issues like https://github.com/pytorch/vision/actions/runs/13861462856/job/38795684317#step:13:212 ``` ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone may have tampered with them. unknown package: Expected sha256 8e34a6f02ac5a63763251953063a19ba9df855ac2c8a13ef409dfef708e2ba26 Got 341156cc5067488565c1e103be6e95105b0fc0d87d8ac24ff8891f63fd33216f ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149218 Approved by: https://github.com/ZainRizvi, https://github.com/atalman, https://github.com/malfet	2025-03-17 23:26:20 +00:00
Mayank Mishra	37c914ca0c	fix simple-spec crash (#147723 ) found an issue while running `python torchgen/fuse/gen_patterns.py` exact error: ```shell Traceback (most recent call last): File "/Users/mayankmishra/Desktop/non-IBM/pytorch/torchgen/fuse/gen_patterns.py", line 19, in <module> joint_graph.lazy_init() File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/pattern_matcher.py", line 2096, in lazy_init result = fn() File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/fx_passes/joint_graph.py", line 53, in lazy_init _pad_mm_init() File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/fx_passes/pad_mm.py", line 905, in _pad_mm_init gen_register_replacement( File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/pattern_matcher.py", line 1584, in gen_register_replacement pat = _serialize_pattern( File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/pattern_matcher.py", line 1539, in _serialize_pattern file_template = get_file_template() File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/pattern_matcher.py", line 1513, in get_file_template if isinstance(attr, type) and issubclass(attr, (PatternExpr, _TargetExpr)): File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/abc.py", line 123, in __subclasscheck__ return _abc_subclasscheck(cls, subclass) TypeError: issubclass() arg 1 must be a class ``` This PR fixes this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147723 Approved by: https://github.com/aorenste Co-authored-by: Aaron Orenstein <aorenste@meta.com>	2025-03-17 23:25:48 +00:00
Tony-Y	78715a181f	Convert Tensor lr to 0-dim as needed for the optimizer to normally work (#145674 ) Fixes #145461 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145674 Approved by: https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-03-17 23:07:05 +00:00
Mu-Chu Lee	1157367c78	[AOTInductor] [BE] Add macro for loading symbols in aoti runner (#149249 ) Summary: Add macro for loading symbols in aoti runner Test Plan: Existing tests Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/149249 Approved by: https://github.com/chenyang78	2025-03-17 23:02:01 +00:00
PyTorch MergeBot	24cfeec2c7	Revert "[BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257 )" This reverts commit bfee141666319c80b6c5284394905beef8682515. Reverted https://github.com/pytorch/pytorch/pull/149257 on behalf of https://github.com/malfet due to Let's see if it helps restore compiler benchmark sanity, see `8bc7bd94a5/1` ([comment](https://github.com/pytorch/pytorch/pull/149257#issuecomment-2731133812))	2025-03-17 22:57:00 +00:00
PyTorch MergeBot	afa1eda901	Revert "[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 )" This reverts commit ef6296e7f20d744a0cfed81cab573d60204e7626. Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/izaitsevfb due to reverted internally, see D71292427 ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2731114626))	2025-03-17 22:43:15 +00:00
Yanan Cao (PyTorch)	a16ada41b9	Fix outdated docstring of torch.export.export regarding strict flag (#149077 ) Summary: Fix outdated docstring of torch.export.export regarding strict flag Test Plan: None, doc only change Differential Revision: D71068215 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149077 Approved by: https://github.com/zhxchen17	2025-03-17 22:29:20 +00:00
Sheng Qin	d25617255c	Fix AOTI update_constant_buffer issue. (#149243 ) Summary: In D69553929 we changed the logic of constant & buffer update in AOTI. However this is incompatible with current Sigmoid runtime since we have different logics to pass in buffers, resulted in errors like ``` I0310 17:29:24.456960 3679102 AOTIDelegateExecutor.cpp:89] AOTIDelegateExecutor processing weights * Aborted at 1741652964 (Unix time, try 'date -d 1741652964') * * Signal 11 (SIGSEGV) (0x30) received by PID 3679102 (pthread TID 0x7f9933e49000) (linux TID 3679102) (code: address not mapped to object), stack trace: * @ 00000000000040b9 folly::symbolizer::(anonymous namespace)::signalHandler(int, siginfo_t, void) ./fbcode/folly/debugging/symbolizer/SignalHandler.cpp:453 @ 0000000000006c45 folly::fibers::(anonymous namespace)::sigsegvSignalHandler(int, siginfo_t, void) ./fbcode/folly/fibers/GuardPageAllocator.cpp:237 @ 000000000004455f (unknown) /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/signal/../sysdeps/unix/sysv/linux/libc_sigaction.c:8 -> /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c @ 00000000001e8164 torch::aot_inductor::AOTInductorModelContainer::update_constant_buffer(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, AtenTensorOpaque, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, AtenTensorOpaque> > > const&, bool, bool) ``` Test Plan: 1) Generate lowered merge net ``` CUDA_VISIBLE_DEVICES=0 ../buck-out/v2/gen/fbcode/b5b13003c82cbdec/caffe2/torch/fb/model_transform/fx2trt/packaging/__generate_merge_net_file__/generate_merge_net_file.par --action=generate --input-file=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_input --output-file=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_output.aoti_sigmoid --lower-backend=aot_inductor --use_sigmoid=true --aot_inductor_config="{'max_autotune': True, 'comprehensive_padding': False}" --add_passes=use_matmul_lce_replace_normal_LCE,use_triton_dot_compress,use_matmul_fuse_lce_replace_first_LCE,use_contiguous_linear_reduction_replace_linear_reduction --disable_acc_tracer=false ``` 2) Load net predictor ``` CUDA_VISIBLE_DEVICES=1 ../buck-out/v2/gen/fbcode/103717df3cc2b97a/caffe2/torch/fb/model_transform/fx2trt/packaging/__load_net_predictor__/load_net_predictor --loadMode=AccuracyAB --inputNetFile=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_output.aoti_ts --otherNetFile=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_output.aoti_sigmoid --moduleName=merge --benchmarkEnableProfiling=false —-predictor_hardware_type=1 --disableStaticRuntime=true ``` Reviewed By: hl475 Differential Revision: D71236710 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149243 Approved by: https://github.com/hl475, https://github.com/jingsh	2025-03-17 22:10:57 +00:00
Isuru Fernando	a3c6e3139a	allow extra args for parameterization of tests in inductor (#149154 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149154 Approved by: https://github.com/amjames, https://github.com/eellison	2025-03-17 22:05:06 +00:00
Davide Italiano	e4f6e4ac84	[MPS] Add inductor support for `modified_bessel_i0`. (#149342 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149342 Approved by: https://github.com/malfet	2025-03-17 21:45:51 +00:00
Carlo Bertolli	8bc7bd94a5	[ROCm] Input vectorization in elementwise kernels for tensors with heterogeneous types (#147527 ) This patch exemplifies its use for input tensors with types (float,bfloat16) when functor type is float(float,float). Pull Request resolved: https://github.com/pytorch/pytorch/pull/147527 Approved by: https://github.com/jeffdaily Co-authored-by: Hashem Hashemi <hashem.hashemi@amd.com>	2025-03-17 20:51:36 +00:00
Benjamin Glass	e8dd58b8cf	cpp_wrapper: Precompile device-specific header files (#146928 ) This saves us about a second per compilation, which is _massive_ for the OpInfo tests. Total OpInfo test runtime is down about 2x from this change alone. Relands #144002, with changes needed by fbcode internals. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146928 Approved by: https://github.com/desertfire	2025-03-17 20:40:15 +00:00
Sampsa	5e9f792479	[ROCm] Unskip flex attention UTs after triton 3.3 bump (#148327 ) Enable `test_flex_attention.py::TestLearnableBiases` unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148327 Approved by: https://github.com/jeffdaily	2025-03-17 20:15:14 +00:00
Shunting Zhang	6c7d8419e3	fix two accuracy regression (#149172 ) There are 2 accuracy regression in 3/12 nightly perf run. I can not repro them locally thus there is no effective way to bisect. Raise the tolerance to make them pass the accuracy check. - error log for HF MegatronBertForQuestionAnswering https://gist.github.com/shunting314/25322b66e15e98feed32e0d9a1e43316 - error log for TIMM gluon_inception_v3 https://gist.github.com/shunting314/df64ce22327df27a7057bbbd19ef5164 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149172 Approved by: https://github.com/jansel, https://github.com/eellison	2025-03-17 19:34:00 +00:00
Pat Vignola	769f19bf95	[MTIA] Add _mtia_exchangeDevice to MTIA module (#149322 ) Summary: The FlexAttention path uses `_exchange_device`, so it will be needed eventually for MTIA as well. Test Plan: `buck2 test fbcode//mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- test_exchange_device` Reviewed By: chaos5958 Differential Revision: D70072059 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149322 Approved by: https://github.com/chaos5958	2025-03-17 19:31:10 +00:00
angelayi	8d7c430e84	Symintify transpose_ (#149057 ) Fixes https://github.com/pytorch/pytorch/issues/148702 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149057 Approved by: https://github.com/yushangdi	2025-03-17 19:11:54 +00:00
Fadi Arafeh	08a644a4c4	Enable fast qlinear static/dynamic path for AArch64 through ACL directly (#148585 ) This enables a fast path for eager mode static/dynamic quantization for AArch64 through Arm Compute Library (ACL) directly. Context: PRs #126687, #139887 enabled an optimized implementation for `qlinear` and `qlinear_dynamic` for aarch64 through `ideep → oneDNN → ACL` which improved performance by ~10x compared to the previous implementation. However, the current `qlinear` and `qlinear_dynamic` path (`ideep → oneDNN → ACL`) suffers from high overhead due to the API friction between the stateless oneDNN API and the stateful ACL low-precision GEMM (`lowp_gemm`) API - for example, ACL's `lowp_gemm` objects cache information like weights reduction or weights in optimized memory format which oneDNN does not allow due to its stateless nature. Hence, ACL currently runs a (redundant) sum of columns and pre-transposition (to the gemm kerne's optimal format) for each GEMM operation. This PR addresses the sub-optimalities above by integrating ACL directly with `qlinear` and `qlinear_dynamic`. - For `qlinear_dynamic` (dynamically quantized matmuls): This PR yields an **average speedup (averaged over context_lengths of 2^3 up to 2^9) of ~ 50% for `bert-base-uncased`, `bert-large-uncased`, `roberta-base`, `distilbert-base-uncased` with 16 threads on a Neoverse-V1 (with transformers==4.48) for the benchmarking script below: ``` # SPDX-FileCopyrightText: Copyright 2025 Arm Limited and/or its affiliate <open-source-office@arm.com> # SPDX-License-Identifier: BSD-3-Clause import torch from transformers import AutoModel, AutoConfig import time import numpy as np from argparse import ArgumentParser class ModelArgumentParser(ArgumentParser): def __init__(self) -> None: super().__init__(description="huggingface model") self.add_argument("--context_length", help="context length - number of input tokens", type=int, default=64 ) self.add_argument("--model", help="model checkpoint - i.e. 'bert-base-uncased'", type=str, default=None) self.add_argument("--iters", help="benchmark iterations", default=500) if __name__ == "__main__": parser = ModelArgumentParser() args = parser.parse_args() model_name = args.model config = AutoConfig.from_pretrained(model_name) batch_size = 1 model = AutoModel.from_pretrained(model_name) model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8) model.eval() inputs = torch.randint(config.vocab_size, (batch_size, args.context_length), dtype=torch.long, device="cpu") times = [] with torch.no_grad(): # warmup for _ in range(10): model(inputs) # benchmark for _ in range(args.iters): s = time.time_ns() model(inputs) times.append((time.time_ns() - s) / 1e6) print("Model = ", model_name) print("Context Length = ", args.context_length) print("Min (ms) = ", min(times)) print("Mean (ms) = ", np.mean(times)) ``` - For `qlinear` (statically quantized matmuls): This PR yields an average speedup of 2x for signed activations (`s8s8s8`) and 95x for unsigned activations (u8s8u8)** on a Neoverse-V1 with 16 threads for the benchmarking script below. The averages are over for all combinations of `M = [8, 16, ..., 512]`, `K = [768, 1024, 2048, 4096]`, `N = [768, 1024, 2048, 4096]`. The astronomical speedup for unsigned activation is because oneDNN v3.7 does not have an optimized implementation for `u8s8u8` on AArch64. ``` # SPDX-FileCopyrightText: Copyright 2025 Arm Limited and/or its affiliate <open-source-office@arm.com> # SPDX-License-Identifier: BSD-3-Clause import torch import torch.nn as nn from torch.quantization import QConfig from torch.ao.quantization.observer import HistogramObserver, default_weight_observer import torch import torch.nn as nn import numpy as np import random from argparse import ArgumentParser import time class ModelArgumentParser(ArgumentParser): def __init__(self) -> None: super().__init__() self.add_argument("--M", help="M dimension", type=int, default=64 ) self.add_argument("--K", help="K dimension", type=int, default=64 ) self.add_argument("--N", help="N dimension", type=int, default=64 ) self.add_argument("--signed_input", help="Use (signed) torch.qint8 for inputs instead of (unsigned) torch.quint8", action="store_true" ) self.add_argument("--seed", help="Random seed", type=int, default=42 ) self.add_argument("--iters", help="benchmark iterations", default=500) def set_seed(seed): random.seed(seed) np.random.seed(seed) torch.manual_seed(seed) class LinearModel(nn.Module): def __init__(self, K, N): super(LinearModel, self).__init__() self.quant = torch.quantization.QuantStub() self.fc = nn.Linear(K, N) self.dequant = torch.quantization.DeQuantStub() def forward(self, x): x = self.quant(x) x = self.fc(x) x = self.dequant(x) return x def quantize_model(model, args): qconfig = QConfig( activation=HistogramObserver.with_args(reduce_range=False, dtype=torch.qint8 if args.signed_input else torch.quint8), weight=default_weight_observer, ) # Prepare the model for static quantization # Specify quantization configurations model.qconfig = qconfig model_prepared = torch.quantization.prepare(model_fp32) # Calibrate the model with sample inputs # Example input data for calibration with torch.no_grad(): sample_data = torch.randn(args.M, args.K) model_prepared(sample_data) # Convert the prepared model to a quantized model model_quantized = torch.quantization.convert(model_prepared) return model_quantized if __name__ == "__main__": parser = ModelArgumentParser() args = parser.parse_args() set_seed(args.seed) model_fp32 = LinearModel(args.K, args.N) model_quantized = quantize_model(model_fp32, args) inputs = torch.randn(args.M, args.K) times = [] with torch.no_grad(): # warmup for _ in range(10): model_quantized(inputs) # benchmark for _ in range(args.iters): s = time.time_ns() model_quantized(inputs) times.append((time.time_ns() - s) / 1e6) print("M,K,N,signed = ", args.M, args.K, args.N, args.signed_input) print("Min Times (ms) = ", min(times)) print("Mean Times (ms) = ", np.mean(times)) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148585 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-03-17 18:21:10 +00:00
Isuru Fernando	c41c2130be	Fix printing INT64_MIN (#149148 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149148 Approved by: https://github.com/anijain2305	2025-03-17 17:57:18 +00:00
Yichen Yan	8cdb9adc05	do not run `test_ck_blas_library` on cpu (#148316 ) Fix on non-rocm: ``` root@e01-tw-ue5g2g3sap6:~/pytorch/test# python test_linalg.py TestLinalgCPU.test_ck_blas_library_cpu E ====================================================================== ERROR: test_ck_blas_library_cpu (__main__.TestLinalgCPU) ---------------------------------------------------------------------- Traceback (most recent call last): File "/root/pytorch/torch/testing/_internal/common_utils.py", line 3108, in wrapper method(args, kwargs) File "/root/pytorch/torch/testing/_internal/common_device_type.py", line 480, in instantiated_test raise rte File "/root/pytorch/torch/testing/_internal/common_device_type.py", line 460, in instantiated_test result = test(self, param_kwargs) File "/root/pytorch/torch/testing/_internal/common_device_type.py", line 1242, in dep_fn return fn(slf, args, *kwargs) File "/root/pytorch/torch/testing/_internal/common_utils.py", line 1981, in _fn fn(args, **kwargs) File "/root/pytorch/test/test_linalg.py", line 8621, in test_ck_blas_library torch.backends.cuda.preferred_blas_library('ck') File "/root/pytorch/torch/backends/cuda/__init__.py", line 258, in preferred_blas_library torch._C._set_blas_preferred_backend(_BlasBackends[backend]) RuntimeError: Cannot set preferred backend to Ck if PyTorch has not been compiled for ROCm. To execute this test, run the following from the base repo dir: python test/test_linalg.py TestLinalgCPU.test_ck_blas_library_cpu This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ---------------------------------------------------------------------- Ran 1 test in 0.346s FAILED (errors=1) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148316 Approved by: https://github.com/jeffdaily	2025-03-17 17:45:45 +00:00
Catherine Lee	224cd9f055	[ez] Flush trymerge print statements (#149012 ) Logs of trymerge don't match up with timestamps, ex https://github.com/pytorch/pytorch/actions/runs/13766246347/job/38493307591 Ex: ``` 2025-03-10T14:20:41.4899509Z Attempting merge of https://github.com/pytorch/pytorch/pull/148648 (0.003460856278737386 minutes elapsed) ... 2025-03-10T14:20:41.4907867Z Merge of https://github.com/pytorch/pytorch/pull/148648 failed due to: Still waiting for 16 jobs to finish, first few of them are: Check Labels / Check labels, trunk / macos-py3-arm64 / build, trunk / win-vs2022-cpu-py3 / build, trunk / cuda12.4-py3.10-gcc9-sm80 / build, trunk / win-vs2022-cuda12.6-py3 / build. Retrying in 5 min 2025-03-10T14:20:41.4909772Z Attempting merge of https://github.com/pytorch/pytorch/pull/148648 (5.280085611343384 minutes elapsed) ... 2025-03-10T14:20:41.4916812Z Merge of https://github.com/pytorch/pytorch/pull/148648 failed due to: Still waiting for 15 jobs to finish, first few of them are: trunk / macos-py3-arm64 / build, trunk / win-vs2022-cpu-py3 / build, trunk / cuda12.4-py3.10-gcc9-sm80 / build, trunk / win-vs2022-cuda12.6-py3 / build, trunk / linux-focal-cuda12.6-py3.10-gcc11-no-ops / build. Retrying in 5 min 2025-03-10T14:20:41.4918183Z Attempting merge of https://github.com/pytorch/pytorch/pull/148648 (10.590279157956441 minutes elapsed) ``` Either buffering prints or github actions logs are being weird? Print with flush to see if it helps Pull Request resolved: https://github.com/pytorch/pytorch/pull/149012 Approved by: https://github.com/malfet	2025-03-17 17:04:48 +00:00
Rachel Guo	aaa4c3d60b	[mm_logs] make aten mm info readable (#148800 ) Summary: as title. make it into a table like e.g. also see pic in test plan \| Name \| M \| N \| K \| Count \| \| aten.mm \| 16 \| 6 \| 16 \| 1 \| ... Test Plan: {F1975907876} <img width="1090" alt="Screenshot 2025-03-11 at 3 13 00 PM" src="https://github.com/user-attachments/assets/ffae8c56-e32c-49cc-bbfb-5b8d216b8657" /> Differential Revision: D70825664 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148800 Approved by: https://github.com/henrylhtsang	2025-03-17 17:00:58 +00:00
Xinya Zhang	2a011ca904	[ROCm] testing: enable MEFF/FA unittests for gfx1100 (#148911 ) Include gfx1100, and optionally enable gfx1201/gfx950 according to env var TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL Pull Request resolved: https://github.com/pytorch/pytorch/pull/148911 Approved by: https://github.com/jeffdaily	2025-03-17 16:41:15 +00:00
PyTorch MergeBot	9d37b501db	Revert "[ROCm] enable HIPMallocAsyncAllocator (#149145 )" This reverts commit 2e02c07a5d1c432547542f90de2885be9ffd13cf. Reverted https://github.com/pytorch/pytorch/pull/149145 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @albanD, might you be able to help get this PR landed? See D71214814 for more details on the failure. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/149145#issuecomment-2730104736))	2025-03-17 16:17:02 +00:00
Yu, Guangye	c7c3e77324	Refine XPU oneDNN context manager API (#147349 ) # Motivation This PR introduces improvements to the XPU oneDNN context manager API: - `GpuEngineManager::get_engine`: Added a new API that accepts a `DeviceIndex` to simplify code and improve usability - by default, using the current device index. - `GpuStreamManager::get_stream`: Now explicitly requires a `DeviceIndex` as input to ensure correctness and consistency - by default, using the current device index. Additionally, it enhances integration with `c10::DeviceGuard`, ensuring correct device management. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147349 Approved by: https://github.com/EikanWang	2025-03-17 14:45:56 +00:00
PyTorch UpdateBot	790f93db3a	Update slow tests (#149300 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149300 Approved by: https://github.com/pytorchbot	2025-03-17 11:39:29 +00:00
Sun, Jiayi	b2862f1435	optimize the decomposition of aten.native_group_norm (#144733 ) Summary: Optimize the decomposition of aten.native_group_norm. Reduce unnecessary repeated operations by changing the order of operations for `mean`, `rstd`, `weight`, `bias `and `input`, which can improve performance when `flattened_inner_size `is large. The original decomposition: 1. compute `mean `and `rstd`, 2. out = (x - mean) * rstd, compute in the range [N, C, ], 3. out = out weight + bias, compute in the range [N, C, ], The new decomposition: 1. compute `mean `and `rstd`, 2. new_weight = rstd weight, new_bias = - mean * rstd * weight + bias, compute in the range [N, C], 3. out = out * new_weight + new_bias, compute in the range [N, C, *], I tested the Inductor performance benchmark with this PR on both CPU and A100. On CPU, two torchbench models(functorch_dp_cifar10 and opacus_cifar10) have about 25% performance improvement, and two diffusion models(Stable Diffusion and Latent Consistency Model(LCM)) have about 2% performance improvement. On A100, no performance gains or regressions were seen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144733 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2025-03-17 09:27:01 +00:00
zeshengzong	1cc5f6b623	Optimize `MaxPool1d` param `ceil_mode` description (#148869 ) Fixes #148123 Add output shape formula based on `ceil_mode` value, according to `00199acdb8/aten/src/ATen/native/Pool.h (L61-L75)` ## Test Result ### Before ![image](https://github.com/user-attachments/assets/0a175178-a104-4348-a14b-516e866d533a) ### After ![image](https://github.com/user-attachments/assets/ce621d4b-1986-41fb-bd71-2b03c0aa996e) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148869 Approved by: https://github.com/mikaylagawarecki	2025-03-17 08:50:40 +00:00
soulitzer	916e8979d3	Skip some tests not using gradcheck on slowgradcheck (#149220 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149220 Approved by: https://github.com/seemethere	2025-03-17 00:34:52 +00:00
eqy	6048d88afe	[ARM64][CUDA] skip string pattern matching in `test_workspace_allocation_error` (#149236 ) `unwind()` on ARM64 seems to elide the strings of interest Pull Request resolved: https://github.com/pytorch/pytorch/pull/149236 Approved by: https://github.com/malfet, https://github.com/eellison, https://github.com/BoyuanFeng	2025-03-17 00:30:43 +00:00
Aaron Gokaslan	bfee141666	[BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/149257 Approved by: https://github.com/jansel	2025-03-16 23:52:58 +00:00
Tugsbayasgalan Manlaibaatar	6b1b95ad2a	Support subclass constructor capturing in export (#147014 ) Notable TODOs: 1. Need to implement AutogradHOP to get rid of subclasses before serializing 2. Need to implement mechanism to figure out what subclasses will be used in export when they are not expressed in the inputs Differential Revision: [D69640673](https://our.internmc.facebook.com/intern/diff/D69640673) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147014 Approved by: https://github.com/bdhirsh	2025-03-16 18:19:19 +00:00
Animesh Jain	5905bbe745	[dynamo][guards][serialization] Dont use ID_MATCH guard for bool and None (#149228 ) Doing this removes the need of collecting `id` and therefore facilitates serialization. It also improves readability with recompilations. Earlier, recompile message will just show the `id`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149228 Approved by: https://github.com/jansel	2025-03-16 15:56:17 +00:00
Davide Italiano	9f33c6f0a0	[MPS] Add support for modified_bessel_i0 in eager. (#149264 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149264 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-03-16 04:45:49 +00:00
Nikita Shulga	f80bee4934	[MPS][BE] Move common binary ops macros to indexing.h (#149263 ) And binary op invocation logic to OperationUtils.mm This is a no-op change, additional sanity checks/logic improvements will be added as followups Pull Request resolved: https://github.com/pytorch/pytorch/pull/149263 Approved by: https://github.com/dcci ghstack dependencies: #149262	2025-03-16 02:06:40 +00:00
Davide Italiano	21c2edfec8	[MPS/metal] Add missing `inline` to function definitions. (#149265 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149265 Approved by: https://github.com/malfet	2025-03-16 00:33:27 +00:00
Nikita Shulga	3e2c4086ad	[EZ][BE] Reuse `result_of` from `c10/metal/utils.h` (#149262 ) No need for one more implementation Pull Request resolved: https://github.com/pytorch/pytorch/pull/149262 Approved by: https://github.com/dcci	2025-03-16 00:21:28 +00:00
Sam Larsen	acf42b0048	Fix memory leak in subproc_pool future (#149259 ) Summary: The future holds a reference to the callback, and the callback captures the outer future. Seems to create a cycle that the garbage collector doesn't clean up. Verified by compiling 15k synthetic Triton kernels and observing that subprocess memory overhead improves. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149259 Approved by: https://github.com/Skylion007	2025-03-15 20:26:30 +00:00
James Wu	a9c55277d7	[Reland] First version of statically compiled launcher for triton compiled CUDA kernels (#149238 ) This is a new version of https://github.com/pytorch/pytorch/pull/148561 fixing the ROCM test failure Putting this up for a first pass review, though I will likely make a bunch of changes before landing to add more features, etc. This diff implements a first version of a static CUDA kernel launcher in `torch._C`. The goal here is to take a cubin file and some metadata from a CompiledKernel from `triton`, and launch the cubin file directly. Background doc: https://docs.google.com/document/d/1rjRcHl6MfauHG30nCoQX-9UKvKyIs4WWMy_GsGyqb9g/edit?tab=t.0#heading=h.ut5lf39lzq66 Normally, using triton's CompiledKernel.make_launcher(), we would pay the cost of codegenning C++ and running it at compile time. With this new approach, we can use one statically compiled library to launch the kernel. The tradeoff here is that this new kernel launcher will not be able to use codegen to deal with different lengths/types of arguments. So we use templating to handle up to 10 arguments for now. We also allocate 8 bytes on the stack per argument no matter the argument type, which can take more memory than codegenning. On the other hand, we improve compile time on cold and warm start by not having to call the C++ compiler at all. This diff does not add the launcher to torch, but introduces a basic test suite. A list of TODOs that are not yet complete: - Handle `nvTmaDesc` and `cuTensorMap`, which triton handles - Embed the grid logic instead of passing in gridX,Y,Z - Handle launch_enter and exit hooks? (Not sure if inductor has these) - Benchmarking to see if there's runtime performance loss - Probably lots of features of the triton C++ generated code that I haven't handled yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149238 Approved by: https://github.com/oulgen	2025-03-15 15:06:46 +00:00
Sam Larsen	c83c711da8	Remove some memory overhead in parallel compile workers (#149168 ) Summary: The parallel compile workers are holding on to more memory than they need to because they're loading the compiled modules into memory. Update the post-fork initializer to record when in a subprocess and skip some of the unnecessary overhead. Test Plan: Ran a test script to compile 15k Triton kernels and used tracemalloc in the subprocs to investigate the overhead. On my devgpu: * After importing torch in a subproc: 371M * Without this PR, after compiling 15k kernels: 825M * With this PR, after compiling 15k kernels: 531M Pull Request resolved: https://github.com/pytorch/pytorch/pull/149168 Approved by: https://github.com/jansel	2025-03-15 14:20:40 +00:00
Huamin Li	e7e477c1f9	Not generate custom obj json when it's empty (#149246 ) Summary: as title. See internal Diff summary for more context. Test Plan: buck run @fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r config_not_generated Differential Revision: D71241676 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149246 Approved by: https://github.com/houseroad Co-authored-by: Huamin Li <huaminli@meta.com>	2025-03-15 13:00:48 +00:00
Lirong	4482a65fef	Add side_effect to avoid dce custom op in CA graph (#149181 ) We found that in compiled_autograd, when defining custom op, the custom op will be dce in the backward graph. We added a side effect condition in the dce function to prevent eliminating custom op with side effect in CA graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149181 Approved by: https://github.com/xmfan	2025-03-15 04:15:49 +00:00
Wenjie Yang	115fc98cc0	Migrate aten.split.Tensor from using Sharding Rule to Sharding Strategy (#149106 ) Summary: Use Sharding Strategy for aten.split.Tensor instead of sharding rule Test Plan: pytest test/distributed/tensor/test_dtensor_ops.py -s -k split Reviewers: xilunwu Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/149106 Approved by: https://github.com/XilunWu, https://github.com/tianyu-l	2025-03-15 04:03:40 +00:00
Jane Xu	740ce0fa5f	op should NOT be static in aoti_torch_call_dispatcher (#149208 ) aoti_torch_call_dispatcher is meant to call different ops, so the op must not be static. Otherwise, every call to this API will call the first op that was ever called, which is not the intended behavior of any human being. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149208 Approved by: https://github.com/albanD, https://github.com/zou3519, https://github.com/malfet	2025-03-15 01:47:11 +00:00
Simon Fan	578160c875	[ca] don't inline accumulate grad op (#149014 ) we use dummy tensors in our initial trace, so we should never inline. the subclass dispatch might not support the dummy tensor, e.g. DTensor accumulate grad will check that both param and grad are DTensors Pull Request resolved: https://github.com/pytorch/pytorch/pull/149014 Approved by: https://github.com/jansel ghstack dependencies: #149064	2025-03-15 01:10:54 +00:00
Simon Fan	f4368d8872	[ca] clean up aot node deduping (#149064 ) rename the AOT nodes as we copy paste them into the CA graph Pull Request resolved: https://github.com/pytorch/pytorch/pull/149064 Approved by: https://github.com/jansel	2025-03-15 01:10:54 +00:00
Nikita Shulga	96795e9533	[BE] Parametrize `TestMPS.test_binops_dtype_precedence` (#149234 ) No op change, just splits a longer tests into a series of a smaller ones Pull Request resolved: https://github.com/pytorch/pytorch/pull/149234 Approved by: https://github.com/atalman, https://github.com/dcci ghstack dependencies: #149216, #149233	2025-03-15 00:37:11 +00:00
Jithun Nair	1c7196f04b	Add new GHA workflow to cache ROCm CI docker images on MI300 CI runners periodically (#148394 ) Refiling https://github.com/pytorch/pytorch/pull/148387 from pytorch repo branch to get AWS login via OIDC working Successful docker caching run: https://github.com/pytorch/pytorch/actions/runs/13843689908/job/38737095535 Run without cached docker image: https://github.com/pytorch/pytorch/actions/runs/13843692637/job/38746033460 ![image](https://github.com/user-attachments/assets/c410ff35-a150-4885-b904-3a5e1888c032) Run with cached docker image: ![image](https://github.com/user-attachments/assets/41e417b5-a795-4ed2-a9cd-00151db8f813) ~6 min vs 3 s :) Thanks @saienduri for the help on the MI300 infra side Pull Request resolved: https://github.com/pytorch/pytorch/pull/148394 Approved by: https://github.com/jeffdaily	2025-03-15 00:34:04 +00:00
xinan.lin	9ad6265d04	[AOTI][XPU] Fix: model_container_runner_xpu.cpp is not built into libtorch_xpu.so (#149175 ) The missing of model_container_runner_xpu.cpp will cause compilation failure when user build CPP inference application on XPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149175 Approved by: https://github.com/jansel	2025-03-15 00:30:04 +00:00
yifanmao	7537b19c73	[FSDP2] Update ignored_params docstring and add unit test (#149074 ) Fixes https://github.com/pytorch/pytorch/issues/148242 ignored_params won't be moved to devices in full_shard(), update docstring. Add unit test `test_move_states_to_device_ignored_param_device` to show that ignored_params won't be moved during full_shard(), but would be after `model.cuda()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149074 Approved by: https://github.com/awgu	2025-03-15 00:23:09 +00:00
maajidkhann	09f7f62cfe	Fix atomic operation compatibility for ARMv8-A (Raspberry Pi 4) by adjusting compilation flags (#148070 ) Issue: * The ldaddal instruction is an AArch64 atomic operation available from ARMv8.1-A onwards. * Raspberry Pi 4 (Cortex-A72) is ARMv8-A, which does not support ldaddal, leading to failures when running PyTorch built with march=armv8.2-a+sve * This led to an issue when running PyTorch on ARMv8-A (Raspberry Pi 4), as unsupported atomic operations were generated. Fix: * Updated the build flags to explicitly use -march=armv8-a+sve, ensuring GCC and clang promotes it correctly and resolves compatibility issues with armv8 and still work correctly for SVE like before. * This ensures that PyTorch builds correctly for ARMv8-A platforms (e.g., Raspberry Pi 4) while still enabling SVE for supported hardware. Test plan: - Allocate `a1.4xlarge` on AWS - Run following script using wheel produced by this PR ```python import torch def f(x): return x.sin() + x.cos() print(torch.__version__) f_c = torch.jit.script(f) ``` - Observe no crash ``` $ python3 foo.py 2.7.0.dev20250313+cpu ``` - Observe crash with 2.6.0 ``` $ python3 foo.py 2.6.0+cpu Illegal instruction (core dumped) ``` Fixes #146792 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148070 Approved by: https://github.com/malfet	2025-03-15 00:02:38 +00:00
Nikita Shulga	08af311fc2	[MPS] Fix type promotion for `torch.floor_divide` (#149233 ) And delete some duplicating glue code by relying on the stub After this change `torch.arange(10, device = 'mps') // torch.arange(10., device='mps')` will return tensor of floats, which is a common dtype for float + integral operation, rather than tensor of ints Checked by `test_div2` inductor testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/149233 Approved by: https://github.com/atalman ghstack dependencies: #149216	2025-03-15 00:00:42 +00:00
bobrenjc93	eb7bf4202d	Make dynamism code robust to NotImplementedException (#148823 ) In prod many models have `@property` methods that raise NotImplementedError. This PR updates our dynamism code to be more robust to these types of models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148823 Approved by: https://github.com/laithsakka	2025-03-14 23:38:19 +00:00
Stephen Jia	ff58ccec6c	[ATen-CPU] Add `math.h` for Gelu (#149164 ) Summary: ## Context This PR is mostly to enable ExecuTorch build for Windows: https://github.com/pytorch/executorch/pull/9198 In ExecuTorch, the optimized GeLU kernel calls the ATen implementation. However, on Windows `math.h` needs to be included with `#define _USE_MATH_DEFINES` in order for math constants to be defined. Test Plan: Rely on CI to make sure existing tests do not break. Tested separately with ExecuTorch to make sure Windows build is successful. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149164 Approved by: https://github.com/swolchok	2025-03-14 23:37:25 +00:00
PyTorch MergeBot	f9b4856989	Revert "[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257 )" This reverts commit c95a6b416b4d1b830535f82e2719c055d077cbad. Reverted https://github.com/pytorch/pytorch/pull/113257 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @zou3519 can you please help land this internally? See the sigmoid tests in D71198793 for details. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/113257#issuecomment-2725982539))	2025-03-14 23:13:34 +00:00
PyTorch MergeBot	643aaea133	Revert "[RFC] First version of statically compiled launcher for triton compiled CUDA kernels (#148561 )" This reverts commit 5a843f8973d7fc6a601f089fc969d2a5ac7e5338. Reverted https://github.com/pytorch/pytorch/pull/148561 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/148561#issuecomment-2725969268))	2025-03-14 23:01:26 +00:00
cz2h	05f2cbfe19	Add meta function for out variants of ones,zeros,empty (#149098 ) Open another PR to fix merge conflicts. Fixes https://github.com/pytorch/pytorch/issues/135832 For aten.ones, aten.zeros, followed this [link](https://docs.google.com/document/d/1GgvOe7C8_NVOMLOCwDaYV1mXXyHMXY7ExoewHqooxrs/edit?tab=t.0#heading=h.64r4npvq0w0) to register meta functions. For aten.empty.out, followed this [part](https://docs.google.com/document/d/1GgvOe7C8_NVOMLOCwDaYV1mXXyHMXY7ExoewHqooxrs/edit?tab=t.0#heading=h.iy9lxhxhtl5v) to register a decomp for empty that handles the FakeTensor input. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149098 Approved by: https://github.com/williamwen42	2025-03-14 22:17:30 +00:00
Nikita Shulga	d7d9a71e19	[MPSInductor] Add support for atan2 (#149216 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149216 Approved by: https://github.com/dcci	2025-03-14 21:53:03 +00:00
Isalia20	dd6e9df3d0	[MPS] fix attention enable_gqa crash on mps (#149147 ) Fixes #149132 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149147 Approved by: https://github.com/malfet	2025-03-14 21:25:54 +00:00
Davide Italiano	0bd863a62f	[MPS] Add inductor support for `i1e`. (#149221 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149221 Approved by: https://github.com/malfet	2025-03-14 21:18:38 +00:00
Aditya Tewari	a0893475ba	Enable oneDNN dispatch for gemm bf16bf16->bf16 (#148197 ) Currently, `linear` layers using BF16 are dispatched to OpenBLAS, provided that sbgemm_ is available. However, profiling on AArch64 shows that dispatching to oneDNN results in a significant speedup. This PR updates the dispatch logic to leverage oneDNN for improved performance. Attaching some benchmark results. Instance: NeoverseV1., on 16 threads. <img width="482" alt="Screenshot 2025-02-28 at 17 18 38" src="https://github.com/user-attachments/assets/b84e7455-af6e-417f-920d-bdd2bec2e8f9" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/148197 Approved by: https://github.com/malfet	2025-03-14 20:58:24 +00:00
albanD	1bdbf12672	Update as strided doc (#149146 ) Make it clearer why it is not recommended to use it and when the resulting Tensor will have undefined behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149146 Approved by: https://github.com/gchanan, https://github.com/jbschlosser	2025-03-14 19:49:57 +00:00
Um Changyong	69aeb87eca	update error message in get_backend() more detail_ (#141796 ) Fixes #ISSUE_NUMBER When attempting to reconfigure the environment without properly handling the PyTorch-related settings, you may encounter the following message. ``` │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/site-packages/torch/distributed/distribut │ │ ed_c10d.py:1215 in get_backend │ │ │ │ 1212 │ if _rank_not_in_group(pg): │ │ 1213 │ │ raise ValueError("Invalid process group specified") │ │ 1214 │ pg_store = _world.pg_map[pg] if pg in _world.pg_map else None │ │ ❱ 1215 │ return Backend(not_none(pg_store)[0]) │ │ 1216 │ │ 1217 │ │ 1218 def _get_process_group_uid(pg: ProcessGroup) -> int: │ │ │ │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/site-packages/torch/utils/_typing_utils.p │ │ y:13 in not_none │ │ │ │ 10 │ │ 11 def not_none(obj: Optional[T]) -> T: │ │ 12 │ if obj is None: │ │ ❱ 13 │ │ raise TypeError("Invariant encountered: value was None when it should not be") │ │ 14 │ return obj │ │ 15 │ ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ TypeError: Invariant encountered: value was None when it should not be Exception ignored in: <function Vllm.__del__ at 0x7f35f96b6dd0> ``` Since this message can cause confusion for multiple developers, the purpose of this PR is to suggest additional details to help clarify the situation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141796 Approved by: https://github.com/kwen2501	2025-03-14 19:42:42 +00:00
Qiongwen Zhang	5e79b61e8a	add PrivateUse1 backend in fsdp collecitves (#147260 ) add PrivateUse1 backend in fsdp collecitves Pull Request resolved: https://github.com/pytorch/pytorch/pull/147260 Approved by: https://github.com/weifengpy	2025-03-14 19:41:41 +00:00
henrylhtsang	fe01af2242	[AOTI][debug logger] small fix for intermediate value debugger for jit when arg is not tensor (#149007 ) repro: ``` import torch import torch._inductor.config as config config.aot_inductor.debug_intermediate_value_printer = "2" config.aot_inductor.filtered_kernel_names = "triton_poi_fused__to_copy_add_0" class Model(torch.nn.Module): def forward(self, x): x = x.to(torch.float) return x + 1 model = Model().cuda() x = torch.randn(10).cuda().to(torch.float8_e4m3fn) _ = torch.compile(model, fullgraph=True)(x) print("done") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149007 Approved by: https://github.com/jingsh	2025-03-14 19:40:41 +00:00
Aaron Gokaslan	c96ed7e6f5	[BE]: No include left behind - recursive glob setuptools support (#148258 ) Fixes #148256 TestPlan check the printout from the setup.py build and verify the files are still included. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148258 Approved by: https://github.com/malfet, https://github.com/benjaminglass1	2025-03-14 19:39:21 +00:00
Nikita Shulga	9d7945e382	[EZ] Fix typo in UnaryOps.mm (#149217 ) s/imput/input/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/149217 Approved by: https://github.com/ZainRizvi, https://github.com/dcci	2025-03-14 19:31:20 +00:00
zeshengzong	a7f8de2198	Add `nn.Bilinear` param validation (#149018 ) Fixes #103425 ## Changes - Add doc description size value `must be > 0` - Add validation for `in1_features` param Currently, only `in1_features` will cause runtime error, if add checks for `in2_features` and `out_features` as well, might be kind of BC breaking. ```python import torch from torch import nn class lenet(nn.Module): def __init__(self): super(lenet, self).__init__() self.conv = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=5, stride=1) # Error, `in1_features=1, in2_features=0, out_features=0` no error self.linear = nn.Bilinear(in1_features=0, in2_features=0, out_features=0) def forward(self, x): # 1st block x = self.conv(x) x = self.linear(x) return x if __name__ == '__main__': net = lenet() ``` ## Test Result ```bash pytest test/test_nn.py -k test_bilinear -vv ``` ![image](https://github.com/user-attachments/assets/20617ba9-bac5-4db2-aecc-1831dbc8eb43) ![image](https://github.com/user-attachments/assets/401e4e1f-051a-4e1c-952b-48e85de64b0b) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149018 Approved by: https://github.com/mikaylagawarecki	2025-03-14 19:26:12 +00:00
James Wu	5a843f8973	[RFC] First version of statically compiled launcher for triton compiled CUDA kernels (#148561 ) Putting this up for a first pass review, though I will likely make a bunch of changes before landing to add more features, etc. This diff implements a first version of a static CUDA kernel launcher in `torch._C`. The goal here is to take a cubin file and some metadata from a CompiledKernel from `triton`, and launch the cubin file directly. Background doc: https://docs.google.com/document/d/1rjRcHl6MfauHG30nCoQX-9UKvKyIs4WWMy_GsGyqb9g/edit?tab=t.0#heading=h.ut5lf39lzq66 Normally, using triton's CompiledKernel.make_launcher(), we would pay the cost of codegenning C++ and running it at compile time. With this new approach, we can use one statically compiled library to launch the kernel. The tradeoff here is that this new kernel launcher will not be able to use codegen to deal with different lengths/types of arguments. So we use templating to handle up to 10 arguments for now. We also allocate 8 bytes on the stack per argument no matter the argument type, which can take more memory than codegenning. On the other hand, we improve compile time on cold and warm start by not having to call the C++ compiler at all. This diff does not add the launcher to torch, but introduces a basic test suite. A list of TODOs that are not yet complete, will do in separate diff: - Handle `nvTmaDesc` and `cuTensorMap`, which triton handles - Embed the grid logic instead of passing in gridX,Y,Z. With https://github.com/pytorch/pytorch/pull/147583, we should be able to handle all of the grid logic directly in _StaticCudaLauncher.launch_kernel, and get rid of the python evaluation. - Handle launch_enter and exit hooks? (Not sure if inductor has these) - Benchmarking to see if there's runtime performance loss - Hooking it up with a config to inductor - Testing harness to test against torch generated triton kernels Differential Revision: [D69926783](https://our.internmc.facebook.com/intern/diff/D69926783/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148561 Approved by: https://github.com/aorenste, https://github.com/syed-ahmed	2025-03-14 19:12:13 +00:00
zeshengzong	97272e4b49	Fix `torch.nn.functional.hardswish` gradients corner case (#148049 ) Fixes #147801 ## Changes - Change hardswish gradient compute condition as [torch.nn.functional.hardswish](https://pytorch.org/docs/stable/generated/torch.nn.functional.hardswish.html) - Enable cuda for test `test_hardswish_grad_corner` - Add test case for value=-3 ## Test Result ```bash pytest test/test_nn.py -k test_hardswish pytest test/test_unary_ufuncs.py -k test_hardswish pytest test/inductor/test_torchinductor.py -k test_hardswish ``` ![image](https://github.com/user-attachments/assets/000cb5c4-15f5-4bfd-ab45-f52bf810ff3d) ![image](https://github.com/user-attachments/assets/38b08cf8-ea84-47a2-8e37-0a213da3e0c8) ![image](https://github.com/user-attachments/assets/54bc57be-2c57-46cc-ab90-94ea6cbe1c34) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148049 Approved by: https://github.com/soulitzer	2025-03-14 18:53:10 +00:00
Ethan Wee	2e02c07a5d	[ROCm] enable HIPMallocAsyncAllocator (#149145 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149145 Approved by: https://github.com/jeffdaily	2025-03-14 18:21:27 +00:00
Nikita Shulga	f2221b2fce	[MPS] Add support for `i1e` (#149203 ) Followup after https://github.com/pytorch/pytorch/pull/149174 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149203 Approved by: https://github.com/dcci	2025-03-14 17:33:52 +00:00
Davide Italiano	f067eafabb	[MPS] Modify a test to test the correct function. (#149204 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149204 Approved by: https://github.com/malfet	2025-03-14 17:27:47 +00:00
Nikita Shulga	42e468d9b0	[MPSInductor] Adjust check_bounds (#147205 ) To make upper bound inclusive, which fixes `test_vectorized_ops_masked` and results in the following code ```python mps_lib_0 = compile_mps_shader(""" #include <c10/metal/random.h> #include <c10/metal/special_math.h> #include <c10/metal/utils.h> kernel void generated_kernel( device float* out_ptr0, constant float* in_ptr0, uint xindex [[thread_position_in_grid]] ) { int x0 = (xindex) % (64); int x1 = (xindex) / (64); auto tmp5 = in_ptr0[x0 + 63*x1]; int x2 = xindex; auto tmp0 = x0; auto tmp1 = static_cast<long>(tmp0); auto tmp2 = 63; auto tmp3 = tmp1 < tmp2; if (x0 > 63) return; auto tmp6 = tmp3 ? tmp5 : 7; out_ptr0[x2] = static_cast<float>(tmp6); } """) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147205 Approved by: https://github.com/jansel, https://github.com/dcci ghstack dependencies: #147211	2025-03-14 17:26:00 +00:00
cyy	a9aae05a6b	Remove test decorations on MacOS 12 (#148942 ) MacOS 12 may reach EOL, as from https://endoflife.date/macos Pull Request resolved: https://github.com/pytorch/pytorch/pull/148942 Approved by: https://github.com/malfet	2025-03-14 17:22:37 +00:00
Davide Italiano	f2ea77c099	[MPS] Add inductor support for i0e. (#149180 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149180 Approved by: https://github.com/malfet	2025-03-14 16:15:52 +00:00
PyTorch MergeBot	71795f159e	Revert "[AOTInductor] [BE] Add swap_constant_buffer into pybind for tests. (#149167 )" This reverts commit bea181ff7eeead9fcdd806e286846296c4ab2d67. Reverted https://github.com/pytorch/pytorch/pull/149167 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See D71177501 for the failure. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/149167#issuecomment-2725001232))	2025-03-14 15:16:21 +00:00
Davide Italiano	706c22549c	[MPS] Add support for `i0e` in eager. (#149174 ) Add `special.i0e` to XFAIL_GRADLIST for now, as its backward op is not yet implemented Pull Request resolved: https://github.com/pytorch/pytorch/pull/149174 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-03-14 14:43:46 +00:00
Huamin Li	68bbe20db7	Add test coverage (#149182 ) Summary: Follow up from D71160718 Differential Revision: D71177037 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149182 Approved by: https://github.com/houseroad	2025-03-14 09:38:29 +00:00
Xuehai Pan	c95a6b416b	[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257 ) Changes in this PR: 1. Add `is_structseq` and `is_structseq_class` functions to determine a object or a class is PyStructSequence. 2. Add a generic class `structseq` which can be used as the registration key for PyStructSequence types like `namedtuple` for Named Tuple types. 3. Change `is_namedtuple` to accept subclasses of namedtuple to be namedtuple. Before this PR, only namedtuple class directly created by `collections.namedtuple` or `typing.NamedTuple` were namedtuple classes while their subclasses were not. This PR makes `is_namedtuple` return true for subclasses of namedtuple class. Resolves #75982. New tests are included in this PR. - #75982 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113257 Approved by: https://github.com/zou3519	2025-03-14 08:50:30 +00:00
Sheng Fu	05ac99042f	Clean up grid in execution trace (#149159 ) Summary: This DIFF https://www.internalfb.com/diff/D70471332 removed input "grid" when calling triton kernel. PyTorch execution trace need to make the appropriate change. It includes capturing ET and replay ET. Test Plan: buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA.test_execution_trace_with_pt2_cuda buck2 run mode/opt param_bench/fb/integration_tests:test_et_replay Differential Revision: D71152464 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149159 Approved by: https://github.com/sraikund16, https://github.com/jansel	2025-03-14 07:12:16 +00:00
PyTorch MergeBot	be4e6c1c8e	Revert "[MPS] Add support for `i0e` in eager. (#149174 )" This reverts commit b4745db90482ff139ea62d06ec0a18468e1131b7. Reverted https://github.com/pytorch/pytorch/pull/149174 on behalf of https://github.com/malfet due to MPS are red on trunk ([comment](https://github.com/pytorch/pytorch/pull/149174#issuecomment-2723774600))	2025-03-14 06:35:01 +00:00
Nikita Shulga	e162758051	[MPSInductor] Add `bessel_[jy][01]` ops (#149179 ) By simply calling corresponding special functions Followup TODO: tweak bessel_y0 to match CPU implementation for `torch.half` dtype Pull Request resolved: https://github.com/pytorch/pytorch/pull/149179 Approved by: https://github.com/dcci ghstack dependencies: #149123	2025-03-14 06:33:30 +00:00
Huamin Li	d4496346b9	Update logic when producing key name for keep_original_weights (#149171 ) Differential Revision: D71160718 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149171 Approved by: https://github.com/houseroad	2025-03-14 05:29:54 +00:00
Nikita Shulga	db6d72213b	[MPS] Add `torch.special.bessel_[jy][01]` implementations (#149123 ) By copy-n-pasting functions from `f59064f2b7/aten/src/ATen/native/cuda/Math.cuh (L1463)` With an ugly workaround for `bessel_y[01]` to avoid internal compiler exception on M1/M2 machines (see FB16863363 / https://gist.github.com/malfet/e7785e4b572e7740887a83a2386ef769 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149123 Approved by: https://github.com/Skylion007, https://github.com/dcci	2025-03-14 05:13:55 +00:00
PyTorch MergeBot	e6839819c8	Revert "[ROCm] Input vectorization in elementwise kernels for tensors with heterogeneous types (#147527 )" This reverts commit 4f8391db55c8c3a574d61d99d6d6a4a0b6723acb. Reverted https://github.com/pytorch/pytorch/pull/147527 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @albanD, would you be able to help them land the fixes internally? The error looks really simple. See D71152448 for details. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/147527#issuecomment-2723531085))	2025-03-14 05:11:01 +00:00
Isuru Fernando	9e6b2ca58d	Fix sympy float priting (#147552 ) Fixes https://github.com/pytorch/pytorch/pull/147261 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147552 Approved by: https://github.com/bobrenjc93, https://github.com/cyyever	2025-03-14 05:07:06 +00:00
Mu-Chu Lee	bea181ff7e	[AOTInductor] [BE] Add swap_constant_buffer into pybind for tests. (#149167 ) Summary: We add swap_constant_buffer in pybind to add tests. Test Plan: python test/inductor/test_aot_inductor.py -k test_update_inactive_constant_buffer Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/149167 Approved by: https://github.com/chenyang78, https://github.com/jingsh	2025-03-14 04:12:48 +00:00
Mu-Chu Lee	e567900998	[AOTInductor] Activate CPU test for update_constant_buffer (#149162 ) Summary: Fixed by #145459 Test Plan: Re-activating tests. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/149162 Approved by: https://github.com/chenyang78, https://github.com/jingsh	2025-03-14 04:09:57 +00:00
fduwjj	aed0b7a742	[c10d] Add param recording for uniqueID broadcasting and allgather (#149166 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149166 Approved by: https://github.com/kwen2501	2025-03-14 03:51:30 +00:00
Davide Italiano	b4745db904	[MPS] Add support for `i0e` in eager. (#149174 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149174 Approved by: https://github.com/malfet	2025-03-14 02:51:28 +00:00
Dmitry Rogozhkin	c179971bfc	xpu: update filter out of dg2 AOT target (#148677 ) torch-xpu-ops has updated list of AOT targets to use and used `dg2` instead of `dg2-g10`. This requires an update in cpp_extension.py which currently filters out `dg2-` prefixed AOT targets. CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148677 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/albanD	2025-03-14 02:24:06 +00:00
Eli Uriegas	56b2e4b8f0	ci: Update linux.20_04 --> linux.24_04 (#149142 ) Ubuntu 20.04 is getting deprecated soon so we might as well proactively move to the latest LTS which is 24.04 > [!NOTE] > The oldest supported version of python on 24.04 is Python 3.8. Since we test for Python 3.6 compat in our collect_env test we need to have this particular job stick with 20.04 for now until we decide to upgrade it to a newer python version. Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/149142 Approved by: https://github.com/atalman, https://github.com/wdvr	2025-03-14 02:20:10 +00:00
cyy	e66ad221e9	Use std::string_view in get_fully_qualified_type_name (#145197 ) The same as #139164 but open a new PR due to messy history there. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145197 Approved by: https://github.com/r-barnes	2025-03-14 01:58:35 +00:00
Pat Vignola	e8d36019d4	[c10d] Make getDefaultBackend more fault tolerant without relying on exceptions (#149152 ) Summary: no-except builds are terminating when this exception is thrown. We should proactively check if a backend is available before calling has_hooks, instead of trying and failing. Test Plan: CI Differential Revision: D71144456 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149152 Approved by: https://github.com/kwen2501	2025-03-14 01:27:52 +00:00
Yiming Zhou	15cd6921a5	[export] Fix tensor_constant and buffer naming conflicts in TS converter (#148803 ) Summary: In TS converter, tensor constants are traced as BUFFER and later we will convert them back to CONSTANT_TENSOR. So we need to prevent naming conflicts during lift constant pass. Test Plan: CI Differential Revision: D70826426 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148803 Approved by: https://github.com/angelayi	2025-03-14 00:38:12 +00:00
PyTorch MergeBot	49570cb402	Revert "Split up cub-RadixSortPairs.cu to parallelize compilation (#148936 )" This reverts commit 9a3d26cfcdb1c1be84a04baa3ee554dbe67cb049. Reverted https://github.com/pytorch/pytorch/pull/148936 on behalf of https://github.com/ZainRizvi due to Breaks lint in trunk [GH job link](https://github.com/pytorch/pytorch/actions/runs/13845459825/job/38742803351) [HUD commit link](`9a3d26cfcd`) ([comment](https://github.com/pytorch/pytorch/pull/148936#issuecomment-2722853628))	2025-03-13 22:54:33 +00:00
Gheorghe-Teodor Bercea	4cae8f48cc	[ROCm] Improve softmax performance (#149076 ) This patch improves the performance of softmax for 2D tensors by: using a softmax calculation which eliminates the increase of shared memory usage with the size of the tensor and relies on global memory accesses for the tensor data accesses while still using shared memory for the actual reduction step (the shared memory used for the reduction is constant and does not increase with tensor size). for the final computation replacing the division by the sum with the multiplication of 1/sum. The 1/sum is computed as the last step of the warp reduction. replace the use of the exp function with the __expf function. The impact on numerical accuracy is within a 1e-5 for half precision and 1e-7 for full precision. The impact on performance for MI300X is between 22% and 50% percentage improvement over current runtimes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149076 Approved by: https://github.com/jeffdaily	2025-03-13 22:07:28 +00:00
Tovly Deutsch	9a3d26cfcd	Split up cub-RadixSortPairs.cu to parallelize compilation (#148936 ) Summary: `cub-RadixSortPairs.cu` has slow compilation times, especially on Windows. These changes split up the file into smaller components to allow each component to compile in parallel. On Windows, I observed a compile time drop from about 20 minutes to 6 minutes. Differential Revision: D70539649 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148936 Approved by: https://github.com/suo, https://github.com/eqy	2025-03-13 22:02:05 +00:00
Shangdi Yu	4098a229a0	Add back fake class registration to test_torchbind (#149137 ) Fixes #149121 Summary: as title, to fix https://github.com/pytorch/pytorch/issues/149121 Test Plan: ``` python test/export/test_torchbind.py ``` Differential Revision: D71129321 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149137 Approved by: https://github.com/yiming0416	2025-03-13 21:26:37 +00:00
Zhenghao Hu	e5fccb2bab	[pytorch] Fix duplicated Malloc/Free insertation when using IRBuilderBase::CreateMalloc/CreateFree in LLVM 18+ (#149058 ) Summary: Pytorch unitest hangs when jitting the Tensor kernel. The problem exists for LLVM version >= 18 due to this upstream change: `45bb45f2ae` `IRBuilderBase::CreateCall` will insert the instruction into the BasicBlock by default. And we don't need to explicitly insert the instruction when compiling the tensor kernel. Test Plan: ## Test with the release toolchain ``` buck test 'mode/dev' //caffe2/test:jit -- --exact 'caffe2/test:jit - test_concat_invariant (test_jit_fuser_te.TestTEFuserDynamic)' ``` ## Test with the Buckified toolchain Apply this D71046097 to select the LLVM libraries. ``` # Build tests buck build 'mode/dev-asan' //caffe2/test:jit --show-output ``` ``` # Run test (Change HASH and paths accordingly) HASH="b755f1c435832a1e" ENABLE_FLATBUFFER=0 FB_OVERRIDE_PYBIND11_GIL_INCREF_DECREF_CHECK=1 MKL_NUM_THREADS=1 NO_MULTIPROCESSING_SPAWN=0 OMP_NUM_THREADS=1 PYTORCH_TEST=1 PYTORCH_TEST_FBCODE=1 PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_DEV_DBG_ASAN=1 PYTORCH_TEST_WITH_TSAN=0 PYTORCH_TEST_WITH_UBSAN=1 SKIP_TEST_BOTTLENECK=1 TENSORPIPE_TLS_DATACENTER=test_dc TEST_PILOT=True TPX_IS_TEST_EXECUTION=true TPX_TIMEOUT_SEC=6000 \ buck-out/v2/gen/$HASH/caffe2/test/__jit__/jit.par --test-filter test_jit_fuser_te.TestTEFuserDynamic.test_concat_invariant ``` Differential Revision: D71046799 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149058 Approved by: https://github.com/dcci, https://github.com/Skylion007	2025-03-13 20:37:47 +00:00
Andy Lugo	38e81a5332	[ROCm] Use generated CK config.h rather than system (#147993 ) prevents pytorch from potentially using system version of config.h and instead prioritize the CK submodule's version Pull Request resolved: https://github.com/pytorch/pytorch/pull/147993 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-03-13 20:04:12 +00:00
Carlo Bertolli	4f8391db55	[ROCm] Input vectorization in elementwise kernels for tensors with heterogeneous types (#147527 ) This patch exemplifies its use for input tensors with types (float,bfloat16) when functor type is float(float,float). Pull Request resolved: https://github.com/pytorch/pytorch/pull/147527 Approved by: https://github.com/jeffdaily Co-authored-by: Hashem Hashemi <hashem.hashemi@amd.com>	2025-03-13 19:56:26 +00:00
Eddie Yan	0dcd482e54	[SDPA] Respect `sdpa_kernel`'s `priority_order` setting in `torch.compile` (#147768 ) [https://github.com/pytorch/pytorch/pull/140467](https://github.com/pytorch/pytorch/pull/140467) added the option to specify a priority order for SDPA but the `torch.compile` path silently ignored this setting as I wasn't aware of the separate context manager handling on `torch.compile` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147768 Approved by: https://github.com/drisspg	2025-03-13 18:52:34 +00:00
Joel Schlosser	5e1b715dda	BC fix for AOTIModelPackageLoader() constructor defaults (#149082 ) The default value for `run_single_threaded` was wrongly specified in the .cpp file instead of the header, breaking C++-side instantiation of `AOTIModelPackageLoader` with no arguments. This PR fixes this and adds a test for the use case of running with `AOTIModelPackageLoader` instead of `AOTIModelContainerRunner` on the C++ side. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149082 Approved by: https://github.com/desertfire	2025-03-13 18:40:53 +00:00
cyy	970fefcc53	Remove outdated skipCUDAIfCudnnVersionLessThan decoration (#148940 ) Test conditions for CUDNN 7 and 8 were removed because we have moved to CUDNN 9. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148940 Approved by: https://github.com/mikaylagawarecki	2025-03-13 18:02:50 +00:00
Eli Uriegas	c73c72b1e1	ci: Update linux_job references to v2 (#149102 ) This is probably a bit overdue but trying to update these so we can finally get rid of all the remnants that rely on non-manylinux2_28 stuff and conda stuff Signed-off-by: Eli Uriegas <github@terriblecode.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/149102 Approved by: https://github.com/Skylion007, https://github.com/atalman, https://github.com/malfet ghstack dependencies: #149104	2025-03-13 17:31:55 +00:00
Eli Uriegas	77ea66695a	ci: Fix check_binary gcc abi check (#149104 ) All of our binaries should be built with the cxx11-abi now so lets fix this check to reflect reality. I also noticed that this particular script is not used widely since this issue should've been caught in nightlies a long time ago. Maybe worth an investigation to just remove this script if it's not actually being used. Signed-off-by: Eli Uriegas <github@terriblecode.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/149104 Approved by: https://github.com/Skylion007, https://github.com/atalman, https://github.com/malfet	2025-03-13 17:31:55 +00:00
Simon Fan	7c87ec1b50	[ca] always do initial trace with dynamic shapes (#148801 ) HUD: https://fburl.com/wzvx6tax no regressions (ignore the pass rate improvements, those come from #149030) <img width="864" alt="image" src="https://github.com/user-attachments/assets/d7598f98-b378-4abb-a0c7-e4311162f681" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/148801 Approved by: https://github.com/jansel ghstack dependencies: #148799, #149030	2025-03-13 17:30:29 +00:00
Simon Fan	b263b272fa	[ca] fix lazily compiled aot bwd (#149030 ) FIXES https://github.com/pytorch/pytorch/issues/137372 sometimes, the aot bwd is lowered lazily. so the bw_module we saved in CompiledFunction._lazy_backward_info hasn't gone through post grad passes, specifically the view_to_reshape pass. Running that directly will then sometimes error, because the AOT forward has already changed its views to reshapes, and it is reflected in the gradients we see in CA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149030 Approved by: https://github.com/bdhirsh ghstack dependencies: #148799	2025-03-13 17:30:29 +00:00
Simon Fan	e6f560a262	[ca] support for dynamic shapes CopySlices (#148799 ) i'm changing CA initial trace to always trace as dynamic, fixes these errors: ```python This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 FAILED [0.2139s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_autograd_python_custom_function_inplace - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch. To execute this test, run the following from the base repo dir: python test/test_autograd.py TestAutogradWithCompiledAutograd.test_autograd_python_custom_function_inplace This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 FAILED [0.0057s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_copy_slices_graph_task_updates - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch. To execute this test, run the following from the base repo dir: python test/test_autograd.py TestAutogradWithCompiledAutograd.test_copy_slices_graph_task_updates This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 FAILED [0.9662s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_inplace_on_view_weak_grad_fn - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch. To execute this test, run the following from the base repo dir: python test/test_autograd.py TestAutogradWithCompiledAutograd.test_inplace_on_view_weak_grad_fn This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 FAILED [0.0077s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_leaf_assignment - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch. To execute this test, run the following from the base repo dir: python test/test_autograd.py TestAutogradWithCompiledAutograd.test_leaf_assignment This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 FAILED [5.0485s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_setitem_mask - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch. To execute this test, run the following from the base repo dir: python test/test_autograd.py TestAutogradWithCompiledAutograd.test_setitem_mask This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 FAILED [0.0102s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_tensor_hooks_inplace_over_view - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch. To execute this test, run the following from the base repo dir: python test/test_autograd.py TestAutogradWithCompiledAutograd.test_tensor_hooks_inplace_over_view ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148799 Approved by: https://github.com/jansel, https://github.com/zou3519	2025-03-13 17:30:20 +00:00
Shivam Raikundalia	e84cc4c052	Update Kineto Submodule (#149089 ) Summary: We have made a lot of changes in Kineto this month. It is a good idea to update the submodule in now especially since the roctracer-sdk change will be very large Test Plan: CI Differential Revision: D71082829 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149089 Approved by: https://github.com/Skylion007	2025-03-13 17:18:16 +00:00
Aaron Gokaslan	6856d81c60	[BE]: Update CU128 cudnn to 9.8.0.87 (#148963 ) Also cu12.6 is an on old CUDNN version, we may want to upgrade it for all the performance reasons as I don't see a manywheel linux reason to stay back on the old 9.5 release. I might split that into it's own PR. This one just updates CU126 to the latest and greatest. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148963 Approved by: https://github.com/jansel, https://github.com/eqy, https://github.com/nWEIdia, https://github.com/tinglvv, https://github.com/atalman	2025-03-13 16:59:12 +00:00
Bin Bao	b9803a5c81	[AOTI] Re-enable AOTI cpp unit test (#149085 ) Summary: test_inductor_aoti was removed by accident previously. Add it back. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149085 Approved by: https://github.com/jbschlosser	2025-03-13 16:00:38 +00:00
Boyuan Feng	3e605fe46d	[CUDAGraph] Graph Partition (#147648 ) This PR implements cudagraph partition, following previous PR on inductor graph partition (#147038). Since there are many ops that cudagraph cannot support, this PR focuses on `cpu ops` and will add more partition rules in the next PR. ## Example ```python import torch torch._inductor.config.graph_partition = True def f(x, y): x1 = x + 1 y1 = y + 1 y_cpu = y1.cpu() + 1 z = x @ y return x1 + y1 + z + y_cpu.cuda() x, y = [torch.ones(2, 2, device="cuda") for _ in range(2)] x_cloned, y_cloned = [tmp.clone() for tmp in [x,y]] eager_out = f(x, y) f_compiled = torch.compile(f, mode="reduce-overhead") for _ in range(5): compiled_out = f_compiled(x_cloned, y_cloned) assert torch.allclose(eager_out, compiled_out) ``` w/o graph partition, we will skip cudagraph: ``` skipping cudagraphs due to skipping cudagraphs due to cpu device (device_put). Found from : File "/home/boyuan/playground/cudagraph/graph_partition/graph_partition.py", line 9, in f y_cpu = y1.cpu() + 1 # 3 ``` w/ graph partition, we can see two cudagraphify under the same torch-compiled region: ![image](https://github.com/user-attachments/assets/4e22d428-2687-433d-b92a-0814a2201b25) ## Design PR #147038 splits `def call(args)` function into multiple `def partition_id(args)`. In this PR, we use `recursively_apply_fns()` to wrap each `partition_id()` function with `cudagraphify`. One major design point is, `cudagraphify` takes metadata such as static_input_idxs and we need to provide such metadata for each graph partition. However, we previously only have such metadata for the original graph instead of graph partitions. The [idea](https://github.com/pytorch/pytorch/pull/147038#discussion_r1964124800) is: - compute a mapping from the partition metadata (e.g., input/output idx) to the graph metadata, stored in `GraphPartitionMap`. - during post_compile, get the `CudagraphMetadata` for each partition based on the graph-level metadata and `GraphPartitionMap`, via `get_partition_cudagraph_metadata()`. - finally, in `cudagraph_partition_pos_compile`, we compute the `CudagraphMetadata` and apply cudagraphify for each graph via `recursively_apply_fns`. #### Q: How does it work with codecache? While we have multiple graph partitions, we still have 1 file and 1 `call` function for 1 dynamo graph. The major difference is we need to additionally load a `recursively_apply_fns()` for graph partition. We also add `partition_maps: Optional[list[GraphPartitionMap]]` to `CompiledFxGraph` so it will be serialized and could be deserialized later. ## Edge Case 1 PyTorch has an assumption on input/output orders. For example, backward inputs take saved tensors first and then tangents. In graph partition, we respect such orders via `graph_partition_signature_reorder`. ## Edge Case 2 Cudagraphifying `call` function gives 2 cudagraph managed tensors `buf0` and `primals_1`. However, cudagraphifying `partition_0` gives only 1 cudagraph managed tensor `buf0`. This leads to a semantic difference between cudagraph w/ and w/o graph partition. [full code comparison](https://www.internalfb.com/intern/diffing/?paste_number=1747654420) ![image](https://github.com/user-attachments/assets/03d08ce0-f1d1-4d1d-8432-805a07e1dd40) To achieve the same semantic, we returns an input tensor as output if it is not freed in a graph partition. This allows more cudagraph managed tensors and is important for handling saved tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147648 Approved by: https://github.com/eellison	2025-03-13 16:00:21 +00:00
atalman	65d19a5699	Remove runtime dependency on packaging (#149092 ) Looks like after https://github.com/pytorch/pytorch/pull/148924 We are seeing this error in nightly test: https://github.com/pytorch/pytorch/actions/runs/13806023728/job/38616861623 ``` File "/Users/runner/work/_temp/anaconda/envs/test_conda_env/lib/python3.13/site-packages/torch/_inductor/pattern_matcher.py", line 79, in <module> from .lowering import fallback_node_due_to_unsupported_type File "/Users/runner/work/_temp/anaconda/envs/test_conda_env/lib/python3.13/site-packages/torch/_inductor/lowering.py", line 7024, in <module> from . import kernel File "/Users/runner/work/_temp/anaconda/envs/test_conda_env/lib/python3.13/site-packages/torch/_inductor/kernel/__init__.py", line 1, in <module> from . import mm, mm_common, mm_plus_mm File "/Users/runner/work/_temp/anaconda/envs/test_conda_env/lib/python3.13/site-packages/torch/_inductor/kernel/mm.py", line 6, in <module> from packaging.version import Version ModuleNotFoundError: No module named 'packaging' ``` Hence removing runtime dependency on packaging since it may not be installed by default Pull Request resolved: https://github.com/pytorch/pytorch/pull/149092 Approved by: https://github.com/drisspg, https://github.com/davidberard98	2025-03-13 14:53:13 +00:00
taoyang	f59064f2b7	[FIX] remove the duplicate key in DEFAULT_STATIC_QUANT_MODULE_MAPPINGS (#149043 ) nn.Dropout appeared at line 81 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149043 Approved by: https://github.com/jingsh	2025-03-13 12:42:33 +00:00
Bin Bao	bdf57fb8f7	[AOTI][refactor] Split MiniArrayRef into a separate header (#149073 ) Summary: MiniArrayRef is a common utility and will be used by the libtorch-free AOTI. Differential Revision: [D71064657](https://our.internmc.facebook.com/intern/diff/D71064657) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149073 Approved by: https://github.com/yushangdi	2025-03-13 11:57:32 +00:00
Andrew Gu	a8b1767ae5	[DTensor] Fix `local_map` with multi-threading (#149070 ) Using `nonlocal device_mesh` is not safe with multi-threading Pull Request resolved: https://github.com/pytorch/pytorch/pull/149070 Approved by: https://github.com/wanchaol	2025-03-13 10:58:59 +00:00
Shangdi Yu	df60500ab8	Fix too big to optimize in test, actually use O0 when aot_inductor.compile_wrapper_with_O0 is set (#148714 ) Summary: 1. Check against the "0" char instead 2. We got the following error when using anything other than O0 flag: `error: Function ZN5torch12aot_inductorL22__check_inputs_outputsEPP16AtenTensorOpaqueS3 is too big to optimize [-Werror,-Wignored-optimization-argument]` So we use O0 flag in wrapper code when `aot_inductor.compile_wrapper_opt_level` is set to `O0`. Test Plan: ``` buck run 'fbcode//mode/opt' fbcode//deeplearning/aot_inductor/cpu/test:ads_second_stage_dsnn_models_aoti_lowering_test -- -r AdsSecondStageDSNNModelsAOTILoweringTest ``` Differential Revision: D70670957 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148714 Approved by: https://github.com/desertfire	2025-03-13 10:22:06 +00:00
George Wigley	96a6a71ac7	skip test_torch_dynamo_codegen_pow if CPU backend is not cpp (#146595 ) The test asserts that `aten.pow` is not present in the generated kernel code. When using a CPU backend other than cpp, the kernel contains comments referencing the aten ops that produced the kernel in this case `aten.pow`. This PR skips that test case if the CPU backend is not cpp. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146595 Approved by: https://github.com/williamwen42	2025-03-13 10:03:29 +00:00
Tom Ritchford	d90f9e9a34	[inductor] Fix issue with set_linter, improve linter framework (#144620 ) ### `set_linter` only * Fix gnarly [bug](`dbed747aae/tools/test/set_linter_testdata/python_code.py.txt.python (L42)`) which would have garbled Python files involving sets contained in sets. * Better handling of new Python3.12 token types ### Both linters. * Recover from and report on unparseable Python files * Remove `ParseError.check()` (it made it harder to read the code) * FileLinter is now generic on `PythonFile` ### Notes As I started working on new docstring features, I found a nasty bug and an edge case bug in set linter, and realized both the linters crash when there is a badly-formed Python file in the repo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144620 Approved by: https://github.com/amjames, https://github.com/jansel	2025-03-13 09:49:40 +00:00
Leo Wang	f4bffb7461	[docs] fix autograd description on convex function case (#148658 ) The sub-gradient of minimum norm is the least steep descent direction. ```python import torch x = torch.tensor([-2, -1, 0, 1, 2.], requires_grad=True) torch.relu(x).sum().backward() print(x.grad) # tensor([0., 0., 0., 1., 1.]) y = torch.tensor([-2, -1, 0, 1, 2.], requires_grad=True) torch.abs(y).sum().backward() print(y.grad) # tensor([-1., -1., 0., 1., 1.]) ``` (How can I request a reviewer? I don't have the button on the right) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148658 Approved by: https://github.com/lezcano	2025-03-13 09:06:15 +00:00
wdziurdz	75c8b7d972	[Profiler][HPU] Fix incorrect availabilities for HPU (#148663 ) Fixes #148661 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148663 Approved by: https://github.com/jeromean, https://github.com/albanD	2025-03-13 08:03:52 +00:00
eqy	ec93aa7f84	fix cuDNN SDPA meta registration (#148921 ) Update `cuDNN SDPA` meta registration to matching memory layout behavior in: https://github.com/pytorch/pytorch/pull/138354 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148921 Approved by: https://github.com/drisspg, https://github.com/jbschlosser	2025-03-13 07:33:16 +00:00
Shangdi Yu	2a7d583452	Consolidate torchbind fake class registration (#149063 ) Summary: Remove duplicated fake class registration Test Plan: CI Differential Revision: D71052419 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149063 Approved by: https://github.com/angelayi	2025-03-13 06:57:13 +00:00
Yuanhao Ji	c208f21791	[Dynamo] Replace `unimplemented` with`unimplemented_v2` in `torch/_dynamo/variables/base.py` (#148177 ) Part of #147913 Replace `unimplemented` with`unimplemented_v2` in `torch/_dynamo/variables/base.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148177 Approved by: https://github.com/williamwen42	2025-03-13 06:35:51 +00:00
xinan.lin	037d7af778	[Inductor UT] Enable PYTORCH_TESTING_DEVICE_ONLY_FOR test case filter for test_torchinductor.py (#149023 ) The environ var PYTORCH_TESTING_DEVICE_ONLY_FOR controls the devices in get_desired_device_type_test_bases, so we add RUN_CPU and RUN_GPU to make sure cases are only enabled for devices specified for PYTORCH_TESTING_DEVICE_ONLY_FOR. eg. Only enable GPU cases, not CPU cases even HAS_CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149023 Approved by: https://github.com/jansel, https://github.com/cyyever	2025-03-13 05:15:28 +00:00
Sam Larsen	7cdbb913e7	[logging] Set compile_id in the CachingAutotuner during compilation so we have it for dynamo_timed logging (#148693 ) Summary: This is a simpler alternative to https://github.com/pytorch/pytorch/pull/146455, where we can stick the compileId (and forward/backward bool) in the CachingAutotuner so that we have it for logging `benchmark_all_configs`. Recall that the first attempt put the compileId in the inductor_meta and that interfered with caching. Test Plan: `python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only nanogpt` * tlparse: https://fburl.com/e71yn6uc * dynamo_compile: https://fburl.com/scuba/dynamo_compile/sandbox/4ageghhv * pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/4fgv1itq Pull Request resolved: https://github.com/pytorch/pytorch/pull/148693 Approved by: https://github.com/eellison	2025-03-13 03:50:58 +00:00
Brian Hirsh	3646d4dbc8	[partitioner] always ban compiler-driven recompute of collectives by default (#147561 ) This should fix the hang in https://fb.workplace.com/groups/1075192433118967/permalink/1603268720311333/ The argument here is that: (1) in general, it is not safe for the partitioner to sometimes choose to recompute collectives in the backward. Why? If we are running a distributed job, where many ranks are compiling at the same time, we need every rank to make a consistent decision about which collectives are recomputed for backward. If we let each compiler instance make its own choice without any cross-rank communication, they can make different choices and cause NCCL hangs (see the link above) (2) later on, we'll want an `spmd_mode` flag that causes the compiler to issue collectives and communicate info across ranks. Once we have such a config, then turning it on should make it safe for the partitioner to potentially choose to recompute collectives (and agree on the binary "recompute-or-save" choice across all ranks) (3) even without an `spmd_mode`, users can override this choice by using `torch.utils.checkpoint()` in their user code. User checkpointing generally always overrides the partitioner, and this should be safe because we expect the user to apply checkpointing consistently across ranks Pull Request resolved: https://github.com/pytorch/pytorch/pull/147561 Approved by: https://github.com/zou3519	2025-03-13 03:36:13 +00:00
Bartlomiej Stemborowski	420a9be743	[regression] Fix pin_memory() when it is called before device lazy initialization. (#149033 ) PR #145752 has added a check in the isPinnedPtr to check if a device is initialized before checking if the tensor is pinned. Also that PR has added a lazy initialization trigger when an at::empty is called with a pinned param set to true. However, when the tensor is firstly created and it is pinned in a separate call by calling pin_memory() function, lazy device init is not called so is_pinned returns always false. With this PR, the lazy initialization is moved to getPinnedMemoryAllocator function, thus it is assured that device is initialized before we pin a tensor. Fixes #149032 @ngimel @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/149033 Approved by: https://github.com/ngimel, https://github.com/albanD	2025-03-13 02:56:24 +00:00
henrylhtsang	f2d43d866c	[cutlass backend] switch layout for cutlass backend benchmark (#149009 ) ``` python benchmarks/inductor_backends/cutlass.py ``` logs: ``` Experiment group: mm (1024x1024, 1024x1024) torch.float16 +-----------------------+--------------------+----------------------+---------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+---------------------+ \| aten \| 13.059554621577263 \| 1.580178506206721 \| NA \| \| triton \| 10.245470330119133 \| 0.04118620231747627 \| -21.54808776410064 \| \| triton_persistent_tma \| 10.388538241386414 \| 0.04225084185600281 \| -20.45258400908819 \| \| cutlass_lvl_default \| 12.882896699011326 \| 231.14990583620965 \| -1.3527101626732294 \| \| cutlass_lvl_1111 \| 11.362981051206589 \| 126.41650272067636 \| -12.99105229490415 \| \| cutlass_lvl_2222 \| 11.107578873634338 \| 555.8380545829423 \| -14.946725248331441 \| +-----------------------+--------------------+----------------------+---------------------+ Experiment group: mm (1024x1024, 1024x1024) torch.bfloat16 +-----------------------+--------------------+----------------------+---------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+---------------------+ \| aten \| 14.037585817277431 \| 0.21587548777461052 \| NA \| \| triton \| 10.571777820587158 \| 78.15654796129093 \| -24.68948750735019 \| \| triton_persistent_tma \| 10.761583223938942 \| 1.3195342738181353 \| -23.337364672110443 \| \| cutlass_lvl_default \| 12.872588820755482 \| 237.0100042372942 \| -8.299126443010406 \| \| cutlass_lvl_1111 \| 11.08622644096613 \| 137.55013868492097 \| -21.02469338195443 \| \| cutlass_lvl_2222 \| 11.044904589653015 \| 551.265836935956 \| -21.319059178545007 \| +-----------------------+--------------------+----------------------+---------------------+ Experiment group: mm (2048x2048, 2048x2048) torch.float16 +-----------------------+--------------------+----------------------+---------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+---------------------+ \| aten \| 30.483894050121307 \| 0.27990864124149084 \| NA \| \| triton \| 29.567627236247063 \| 99.87172158574685 \| -3.005740711366232 \| \| triton_persistent_tma \| 29.66325916349888 \| 1.3695051120594144 \| -2.692027748401006 \| \| cutlass_lvl_default \| 29.82821688055992 \| 72.61214569816366 \| -2.150897022812533 \| \| cutlass_lvl_1111 \| 29.476772993803024 \| 67.7428645719774 \| -3.303780857728953 \| \| cutlass_lvl_2222 \| 30.113255605101585 \| 233.84051702311262 \| -1.2158500630212203 \| +-----------------------+--------------------+----------------------+---------------------+ Experiment group: mm (2048x2048, 2048x2048) torch.bfloat16 +-----------------------+--------------------+----------------------+---------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+---------------------+ \| aten \| 30.58255836367607 \| 0.058386584743857384 \| NA \| \| triton \| 29.799651354551315 \| 100.18178300186992 \| -2.559978795150901 \| \| triton_persistent_tma \| 29.362043365836143 \| 1.534341821912676 \| -3.990885861562106 \| \| cutlass_lvl_default \| 29.4346883893013 \| 73.68858492700383 \| -3.7533484305817093 \| \| cutlass_lvl_1111 \| 29.164200648665428 \| 75.44329373072833 \| -4.637799421958348 \| \| cutlass_lvl_2222 \| 29.13798950612545 \| 227.33327346481383 \| -4.7235056020244 \| +-----------------------+--------------------+----------------------+---------------------+ Experiment group: mm (8192x8192, 8192x8192) torch.float16 +-----------------------+--------------------+----------------------+--------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+--------------------+ \| aten \| 1656.6237211227417 \| 0.0549461180344224 \| NA \| \| triton \| 1892.8285837173462 \| 2.3174119112081826 \| 14.258208401997386 \| \| triton_persistent_tma \| 1665.332317352295 \| 2.7922237082384527 \| 0.525683419747917 \| \| cutlass_lvl_default \| 1705.5492401123047 \| 108.31571159465238 \| 2.9533272019312116 \| \| cutlass_lvl_1111 \| 1714.9059772491455 \| 17.64627545280382 \| 3.518134829489478 \| \| cutlass_lvl_2222 \| 1680.4152727127075 \| 306.9972395859659 \| 1.4361469829637354 \| +-----------------------+--------------------+----------------------+--------------------+ Experiment group: mm (8192x8192, 8192x8192) torch.bfloat16 +-----------------------+--------------------+----------------------+--------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+--------------------+ \| aten \| 1621.416687965393 \| 0.06300561130046844 \| NA \| \| triton \| 1782.3902368545532 \| 2.318530729971826 \| 9.927956834535548 \| \| triton_persistent_tma \| 1586.0934257507324 \| 2.7931175641715527 \| -2.178543151605614 \| \| cutlass_lvl_default \| 1657.4617624282837 \| 43.31810224894434 \| 2.2230605328307784 \| \| cutlass_lvl_1111 \| 1641.5367126464844 \| 17.648567833006382 \| 1.2408916739557292 \| \| cutlass_lvl_2222 \| 1645.8417177200317 \| 249.33647010894492 \| 1.5064005407078918 \| +-----------------------+--------------------+----------------------+--------------------+ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149009 Approved by: https://github.com/chenyang78, https://github.com/jingsh	2025-03-13 01:57:47 +00:00
Zhou, Lingzhi	4a12777ffe	[Partitioner] Remove unnecessary upstream nodes in dependency viewer (#146580 ) We iterate upstream nodes to update partition map. But actually did nothing due to we iterate nodes with reversed topological order https://github.com/pytorch/pytorch/pull/136608/files#diff-f2f9dd3903fd99955732eb694941fea0cb7301a58d59554787f3311d417e5615L193 so that there exists no upstream nodes in assignment. Remove it to reduce for-loop overhead which up to O(N * N) complexity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146580 Approved by: https://github.com/Skylion007, https://github.com/jerome-habana	2025-03-13 01:42:10 +00:00
Andrey Talman	1e37e5b836	Update nightly PyTorch version to 2.8.0 (#149038 ) Branch for 2.7: https://github.com/pytorch/pytorch/tree/release/2.7 Same as https://github.com/pytorch/pytorch/pull/135916 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149038 Approved by: https://github.com/ZainRizvi	2025-03-12 23:51:04 +00:00
PyTorch MergeBot	e51615cb73	Revert "[Profiler][HPU] Fix incorrect availabilities for HPU (#148663 )" This reverts commit 28b78800b92a4d847a2360ab0e0b87d3e00a6138. Reverted https://github.com/pytorch/pytorch/pull/148663 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @albanD, could you please help get this relanded? See D71052806 for more details. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/148663#issuecomment-2719297055))	2025-03-12 22:52:11 +00:00
PyTorch MergeBot	b1980b2405	Revert "Make dynamism code robust to NotImplementedException (#148823 )" This reverts commit 60576419a2a5cc09e4a92be870fda8f3fc305ddc. Reverted https://github.com/pytorch/pytorch/pull/148823 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, see D71042206 for details. To validate your fixes internally before relanding, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/148823#issuecomment-2719287467))	2025-03-12 22:45:39 +00:00
Catherine Lee	38c5cf99b3	[CI] Don't clean workspace when fetching repo (#147994 ) Tested on https://github.com/pytorch/pytorch/pull/148995 Do two checkouts: first one attempts to use an existing checkout if possible. The second one removes the workspace and re pulls everything if the first one fails This is probably not going to be useful if we switch entirely to ephemeral runners but w/e Pull Request resolved: https://github.com/pytorch/pytorch/pull/147994 Approved by: https://github.com/malfet, https://github.com/atalman	2025-03-12 22:29:52 +00:00
Catherine Lee	3f1769f785	Add ninja to requirements-ci for all arch (#148778 ) So I can get ninja_logs for the builds No negative consequences afaik Pull Request resolved: https://github.com/pytorch/pytorch/pull/148778 Approved by: https://github.com/malfet, https://github.com/atalman	2025-03-12 22:07:46 +00:00
Jeff Daily	0c8ec26d3b	[ROCm][TunableOp] hipblaslt tf32 support (#145946 ) TF32 is supported by hipblaslt. Support added by #143549. This PR expands integration to the TunableOp feature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145946 Approved by: https://github.com/pruthvistony, https://github.com/echen4096, https://github.com/yoyoyocmu Co-authored-by: Nichols A. Romero <nick.romero@amd.com>	2025-03-12 21:17:11 +00:00
Yanan Cao (PyTorch)	ab45aaca97	Set non-strict export as default mode (#148790 ) Summary: - Flip the default value of strict argument in torch.export.export from True to False - Update test infra to cope with the change, some of them made the assumption of strict mode as default - Disabled some tests that fail in non-strict mode Test Plan: Sandcastle Differential Revision: D70228628 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148790 Approved by: https://github.com/angelayi	2025-03-12 21:10:58 +00:00
Matthew Hoffman	e3ebf61589	Create and send `full_tensor` on `ProcessGroup`-supported device in `_broadcast_tensors` (#148865 ) Fixes #138842 `device` is always the device of the `local_state_dict`, which may or may not be CPU, which is not supported by NCCL backend. Instead, create broadcasted tensors on one of `pg._device_types` and then move the tensors back if `local_state_dict`'s `device` was not supported by the `ProcessGroup`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148865 Approved by: https://github.com/mori360	2025-03-12 20:56:31 +00:00
Richard Barnes	b5191b9312	[codemod][lowrisk] Fix deprecated use of 0/NULL in caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/src/fc-unpack.cc + 1 (#148996 ) Summary: `nullptr` is typesafe. `0` and `NULL` are not. In the future, only `nullptr` will be allowed. This diff helps us embrace the future _now_ in service of enabling `-Wzero-as-null-pointer-constant`. Test Plan: Sandcastle Reviewed By: dtolnay Differential Revision: D70939306 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148996 Approved by: https://github.com/Skylion007	2025-03-12 20:06:19 +00:00
eqy	b90698f5ba	[CUDA] try to abate some flakiness in `test_stream_event_nogil` (#148796 ) threshold twiddling as one in a few dozen runs tend to fail the current threshold Pull Request resolved: https://github.com/pytorch/pytorch/pull/148796 Approved by: https://github.com/Skylion007	2025-03-12 19:12:50 +00:00
min-jean-cho	215f856142	Add XPU device to nested_layer_norm (#148593 ) Work with https://github.com/intel/torch-xpu-ops/pull/1416 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/148593 Approved by: https://github.com/guangyey, https://github.com/jbschlosser	2025-03-12 19:07:08 +00:00
henrylhtsang	66300d3d55	[cutlass backend] try make cutlass backend benchmark more robust (#149015 ) Differential Revision: [D71006269](https://our.internmc.facebook.com/intern/diff/D71006269/) I want to make sure the benchmark even if failed on some experiment can still print most of the results. ``` Experiment group: mm (3x3, 3x3) torch.bfloat16 +-----------------------+-------------------+----------------------+---------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+-------------------+----------------------+---------------------+ \| aten \| 6.175220478326082 \| 0.5982149520423263 \| NA \| \| triton \| 5.326753947883844 \| 3.2067150759976357 \| -13.739858089605114 \| \| triton_persistent_tma \| 5.340870004147291 \| 3.279932268196717 \| -13.51126615004617 \| \| cutlass_lvl_default \| inf \| inf \| inf \| \| cutlass_lvl_1111 \| inf \| inf \| inf \| \| cutlass_lvl_2222 \| inf \| inf \| inf \| \| cutlass_lvl_3333 \| inf \| inf \| inf \| +-----------------------+-------------------+----------------------+---------------------+ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149015 Approved by: https://github.com/chenyang78, https://github.com/jingsh	2025-03-12 18:59:49 +00:00
Thomas Bohnstingl	86bc154d61	[scan] Flattened output of HOP scan (#148955 ) This is required because downstream operations expect HOPs to return a flattened list of output elements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148955 Approved by: https://github.com/ydwu4	2025-03-12 18:27:27 +00:00
Tugsbayasgalan Manlaibaatar	fb0e9cb0a0	Remove warnings on non-buffer tensor constants (#148483 ) Export already registers tensor constants directly in the graph and this is also true for Torchbind objects. This removes warning that pollutes the output. Differential Revision: [D70577856](https://our.internmc.facebook.com/intern/diff/D70577856) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148483 Approved by: https://github.com/zhxchen17, https://github.com/zou3519 ghstack dependencies: #148364	2025-03-12 18:20:04 +00:00
atalman	29fd875bc1	Automate stable CUDA update and linter using min Python verison (#148912 ) 1. Fixes: https://github.com/pytorch/pytorch/issues/145571 . Cuda Stable is the same cuda version that is published to pypi, also used to set Metadata section in the rest of whl scripts and tag the docker releases with latest tag. 2. Updates min python version used in linter Pull Request resolved: https://github.com/pytorch/pytorch/pull/148912 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-03-12 18:12:34 +00:00
Shangdi Yu	01e9036bd2	skip torchbind in cosntant folding (#148993 ) Summary: Do not fold torchbind objects in constant folding Any operation on these torchbind objects can have arbitrary side effects, so we can't effectively constant fold anything torchbind-obj-related anyway. Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r aot_compile_constant_folding ``` Reviewed By: angelayi Differential Revision: D69946541 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148993 Approved by: https://github.com/angelayi	2025-03-12 18:08:08 +00:00
Yidi Wu	923ce10f6c	[while_loop] require stride to be the same as input for body_fn (#148002 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148002 Approved by: https://github.com/zou3519	2025-03-12 17:15:10 +00:00
wdziurdz	28b78800b9	[Profiler][HPU] Fix incorrect availabilities for HPU (#148663 ) Fixes #148661 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148663 Approved by: https://github.com/jeromean, https://github.com/Skylion007, https://github.com/EikanWang, https://github.com/albanD	2025-03-12 17:06:57 +00:00
Jason Ansel	b040dc3a53	Reland: [inductor] Simplify grid handling (#148305 ) Summary: Relands D69965761 / https://github.com/pytorch/pytorch/pull/147583 Before this PR, calling a triton kernel would look like: ```py kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0) ``` where the `grid=` was passed as a callable (function closure) arg. This PR removes the grid arg: ```py kernel.run(a, b, xnumel, stream=stream0) ``` instead now the grid computation is included in the kernel launcher, with something like: ```py def launcher(in_ptr0, out_ptr0, xnumel, stream): grid_0 = ((xnumel + 1023) >> 10) grid_1 = 1 grid_2 = 1 runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel) ``` This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`. It also allows us to unify the handling of grids between the Python and C++ wrapper code. Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid. This unification allows this PR to be a net deletion of code. Differential [disconnected] Revision: D70471332 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148305 Approved by: https://github.com/shunting314, https://github.com/eellison	2025-03-12 15:52:16 +00:00
PyTorch MergeBot	626a5e22eb	Revert "[CI] Don't clean workspace when fetching repo (#147994 )" This reverts commit e5fef8a08ebb8548e8413ae54ef0ad9a11f1f4c0. Reverted https://github.com/pytorch/pytorch/pull/147994 on behalf of https://github.com/clee2000 due to broke checkout on xpu, probably lack of sudo? ([comment](https://github.com/pytorch/pytorch/pull/147994#issuecomment-2718335186))	2025-03-12 15:50:38 +00:00
Catherine Lee	9a0f65d3d3	[TD] test_cpp_extensions_aot_ninja corresponds to things in test/cpp_extensions (#148992 ) Manually map test_cpp_extensions_aot_ninja to files in test/cpp_extensions since test_cpp_extensions_aot_ninja isn't an actual file you can edit, but a wrapper for files in test/cpp_extensions. Idk if this is a good idea, feels very manual. Maybe it would be better to classify this the same as any other TD failure where TD simply can't figure out the tests it needs to run Pull Request resolved: https://github.com/pytorch/pytorch/pull/148992 Approved by: https://github.com/malfet, https://github.com/seemethere, https://github.com/janeyx99	2025-03-12 15:40:06 +00:00
Jason Ansel	488c4480f9	[inductor] Fix profiler tests with latest Triton (#149025 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149025 Approved by: https://github.com/yanboliang	2025-03-12 15:34:26 +00:00
PyTorch MergeBot	5ada4e6a53	Revert "Reland: [inductor] Simplify grid handling (#148305 )" This reverts commit 8d08b4901586f230353a558ee00c16ad57f95178. Reverted https://github.com/pytorch/pytorch/pull/148305 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI ([comment](https://github.com/pytorch/pytorch/pull/148305#issuecomment-2718177044))	2025-03-12 14:58:43 +00:00
cyy	8fa81a6066	Enable misc-use-internal-linkage check and apply fixes (#148948 ) Enables clang-tidy rule [`misc-use-internal-linkage`](https://clang.llvm.org/extra/clang-tidy/checks/misc/use-internal-linkage.html). This new check was introduced in Clang-Tidy 18 and is available due to recent update of Clang-Tidy 19. The check marks functions and variables used only in the translation unit as static. Therefore undesired symbols are not leaked into other units, more link time optimisations are possible and the resulting binaries may be smaller. The detected violations were mostly fixed by using static. In other cases, the symbols were indeed consumed by others files, then their declaring headers were included. Still some declarations were wrong and have been fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148948 Approved by: https://github.com/Skylion007	2025-03-12 14:22:56 +00:00
leslie-fang-intel	f349304c08	[Inductor][CPP] Fix expr issue in loop split (#148882 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/148058. In this case, there is an `indexing_expr` as an integer which doesn't have the method of `find`. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_issue_148058 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148882 Approved by: https://github.com/jgong5	2025-03-12 11:08:07 +00:00
lingzhi98	81aee3c9c4	[Partitioner] Reduce time consuming of partitions merger (#146582 ) This patch optimize maybe_merge_partition func through 3-ways: Remove unnecessary copy https://github.com/pytorch/pytorch/blob/main/torch/fx/passes/infra/partitioner.py#L99. The number of copied nodes is large if we can merge all of the nodes of graph into one partition. Record users of each partition to avoid duplicate iteration over nodes https://github.com/pytorch/pytorch/blob/main/torch/fx/passes/infra/partitioner.py#L133. The trip count of this loop maybe very large. The nodes number of each partitions maybe not balance https://github.com/pytorch/pytorch/blob/main/torch/fx/passes/infra/partitioner.py#L145. We always encounter one issue: one partition has n nodes, but the other has one node. Merge the smaller partition into the larger can help to reduce time consuming. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146582 Approved by: https://github.com/jerome-habana, https://github.com/Skylion007	2025-03-12 09:24:38 +00:00
Xiaodong Wang	d547a56668	[AMD] Various fixes for mem efficient attention on CK backend (#148986 ) Summary: Decouple aotriton vs. ck for mem efficient attention. Also fixed HW check. Reviewed By: henryhu6 Differential Revision: D70872677 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148986 Approved by: https://github.com/jianyuh, https://github.com/houseroad	2025-03-12 07:36:46 +00:00
Nikita Shulga	924a247fbb	[MPS] Enable angle and atan2 for `torch.long` (#149017 ) This check was added by https://github.com/pytorch/pytorch/pull/85817, that introduced no unit-tests and its content seems to be totally unrelated to title/subject of that PR. Anyway, right now it seems to be working fine on MacOS-13+ Pull Request resolved: https://github.com/pytorch/pytorch/pull/149017 Approved by: https://github.com/dcci	2025-03-12 04:48:52 +00:00
Nikita Shulga	7b78a2c415	[MPSInductor] Fix `argmin`/`argmax` long reductions (#149021 ) By adding an additional indexes array for aggregates and populating it when performing partial reductions. And with that I can finally `torch.compile` TinyStories and get 600+ tokens/sec vs <200 on eager Pull Request resolved: https://github.com/pytorch/pytorch/pull/149021 Approved by: https://github.com/jansel ghstack dependencies: #148969, #148975, #149004, #149020	2025-03-12 04:39:29 +00:00
Nikita Shulga	758522d56a	[MPSInductor][EZ] Fix argmin/max signatures (#149020 ) threadgroup_argmin used to return input type, which is wrong, it should have returned `int` or `long` Change signatures of both thredgroup_argmin and threadgroup_argmax to return int, as group size is small, no need to carry over large integeres Pull Request resolved: https://github.com/pytorch/pytorch/pull/149020 Approved by: https://github.com/jansel ghstack dependencies: #148969, #148975, #149004	2025-03-12 04:39:29 +00:00
Nikita Shulga	fe22db9cc3	[MPSInductor] Fix `min`/`max` reductions over large dims (#149004 ) Simple followup after sum/prod Pull Request resolved: https://github.com/pytorch/pytorch/pull/149004 Approved by: https://github.com/jansel ghstack dependencies: #148969, #148975	2025-03-12 04:39:19 +00:00
clr	2a7e997b3f	test/dynamo/test_utils: Fix one broken test on different python versions (#148987 ) We correctly handed different python version in the explicit ir_nodes test, but didn't handle it in the dynamo_timed test. Just explicitly deleting the fields there so the dynamo_timed test passes on all python versions. (I noticed it breaking on 3.13). Pull Request resolved: https://github.com/pytorch/pytorch/pull/148987 Approved by: https://github.com/jansel	2025-03-12 02:11:08 +00:00
LifengWang	e40a9e602b	Add the max_autotune tests in the periodic jobs. (#143560 ) To promptly detect issues with max_autotune, such as [#143102](https://github.com/pytorch/pytorch/issues/143102), add the max_autotune tests to the periodic CI to track the accuracy regularly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143560 Approved by: https://github.com/leslie-fang-intel, https://github.com/desertfire	2025-03-12 01:47:46 +00:00
bobrenjc93	60576419a2	Make dynamism code robust to NotImplementedException (#148823 ) In prod many models have `@property` methods that raise NotImplementedError. This PR updates our dynamism code to be more robust to these types of models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148823 Approved by: https://github.com/laithsakka	2025-03-12 01:01:57 +00:00
atalman	46f096bba6	Explicitly set use-ephemeral runners for windows nightly cpu test jobs (#149001 ) This PR migrated windows builds to use ephemeral runners: https://github.com/pytorch/pytorch/pull/134463 however missed test jobs. Explicitly set use-ephemeral runners for windows nightly cpu tests. Please note we should be using already ephemeral runners for these after: https://github.com/pytorch/test-infra/pull/6377 (recently migrated) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149001 Approved by: https://github.com/malfet	2025-03-11 23:51:39 +00:00
Boyuan Feng	5b60749e9e	[cudagraph] add log for skip reasons (#148797 ) Summary: Add skip reasons to dynamo_compile so we can know popular skip reasons for cudagraph Test Plan: {F1975906635} Differential Revision: D70820791 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148797 Approved by: https://github.com/masnesral	2025-03-11 23:31:48 +00:00
Nikita Shulga	98a2d905bf	[MPSInductor] Fix large prod and sum reductions (#148975 ) After this change, if reduction dimension is larger than `max_threadgroup_size`, emit a `for` loop from `codegen_iteration_ranges_entry` and wrap it up in `codegen_body()` I.e. after this changes following command ``` % TORCH_LOGS=output_code python -c "import torch;print(torch.compile(lambda x:(x[0::2].sin()+(x[1::2] + .4).cos()).sum(dim=0) - 3.14)(torch.rand(4096, device='mps')))" 2>&1\|cut -c 86- ``` will emit following shader ```metal #include <c10/metal/random.h> #include <c10/metal/special_math.h> #include <c10/metal/utils.h> #include <c10/metal/reduction_utils.h> kernel void generated_kernel( device float* out_ptr1, constant float* in_ptr0, uint2 thread_pos [[thread_position_in_grid]], uint2 group_pos [[thread_position_in_threadgroup]] ) { auto xindex = thread_pos.x; auto r0_index = thread_pos.y; threadgroup float tmp_acc_0[1024]; tmp_acc_0[r0_index] = 0; for(auto r0_0_cnt = 0; r0_0_cnt < 2; ++r0_0_cnt) { int r0_0 = 2 * r0_index + r0_0_cnt; if (r0_0 >= 2047) break; auto tmp0 = in_ptr0[2r0_0]; auto tmp2 = in_ptr0[1 + 2r0_0]; auto tmp1 = metal::precise::sin(tmp0); auto tmp3 = 0.4; auto tmp4 = tmp2 + tmp3; auto tmp5 = metal::precise::cos(tmp4); auto tmp6 = tmp1 + tmp5; tmp_acc_0[r0_index] += tmp6; } auto tmp7 = c10:🤘:threadgroup_sum(tmp_acc_0, 1024); auto tmp8 = 3.14; auto tmp9 = tmp7 - tmp8; out_ptr1[0] = static_cast<float>(tmp9); } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148975 Approved by: https://github.com/dcci, https://github.com/jansel ghstack dependencies: #148969	2025-03-11 22:46:41 +00:00
bobrenjc93	2dcdb4ba78	[ez] include config as part of __all__ in torch.compiler (#148978 ) Right now we are susceptive to a race condition where if the torch.compiler.config is not implicitly import via dynamo/builder.py, we will throw an error when trying to set compiler configs. This fixes it by including config in `__all__`. Previous ``` >>> import torch >>> torch.compiler.config.dynamic_sources = "L['kwargs']['float_features']" Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: module 'torch.compiler' has no attribute 'config' >>> torch.compiler.config.dynamic_sources = "L['kwargs']['float_features']" Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: module 'torch.compiler' has no attribute 'config' ``` Now ``` >>> import torch >>> torch.compiler.config.dynamic_sources = "L['kwargs']['float_features']" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148978 Approved by: https://github.com/bdhirsh, https://github.com/laithsakka	2025-03-11 21:58:38 +00:00
Pian Pawakapan	a6459afb0e	[dynamic shapes] add backed_size_oblivious option (#148696 ) Adds option `torch.fx.experimental._config.backed_size_oblivious = True` to allocate `[0, inf]` instead of `[2, inf]` ranges for size backed symbols, and opting into size-oblivious semantics for them. Helps in a number of cases like - Keeps `[0, inf]` bounds for unbacked symbols, when we make a unbacked -> backed replacement - More sound handling for 0/1 inputs at runtime when we lower from export - Avoids ends-of-bounds, sys.maxsize constraint violations for exporting with named Dims (https://github.com/pytorch/pytorch/issues/146315, https://github.com/pytorch/pytorch/issues/146046) May look towards turning this on globally for export. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148696 Approved by: https://github.com/bobrenjc93	2025-03-11 21:52:34 +00:00
Natalia Gimelshein	53a1a022a9	[WIP] Initial implementation of Grouped Gemm API (#148531 ) This PR provides initial cutlass implementation of grouped gemm api as described in this [document](https://docs.google.com/document/d/1985La6wUUVH1AGBkNhaGKUXzx-9ybtbUp567-vYVOM4/edit?tab=t.0#heading=h.g8lzbjnyzzx9). Any combination of 2d and 3d inputs is supported, with 2d input being jagged, and the offsets of the jagged input being given by device tensor `offs`. Only H100 is supported, and only fp8_e4m3 with bf16 output and rowwise scaling. All the dimensions of each individual gemm have to be multiple of 16, that's cutlass limitation. I'll need to add those checks, for dynamic dimensions unfortunately the checks will have to be a device assert. I had to copy-paste cutlass's `Sm90RowBroadcast` and `Sm90ColBroadcast` structs with minor changes to enable scales given as pointer arrays, ideally those should be part of cutlass itself. I copied the schedules from the similar grouped gemm in FBGEMM, but there's a lot of room to improve perf, especially for `fast_accum=False`. Next steps would be perf tuning and increasing coverage to B100, I don't know how cutlass grouped gemm example handles blockwise scaling on B100. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148531 Approved by: https://github.com/drisspg	2025-03-11 21:49:46 +00:00
Howard Huang	b98af95401	Fix DCP link (#148974 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148974 Approved by: https://github.com/svekars	2025-03-11 21:26:37 +00:00
Nichols A. Romero	6119ffc711	[ROCm][TunableOp] Fix TunableOp BLAS logging for online tuning case. (#148979 ) In a previous PR https://github.com/pytorch/pytorch/pull/147034, there was a bad merge at the last minute. BLAS logging works for offline tuning, but does not currently work for online tuning. This PR fixes BLAS logging for online tuning. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148979 Approved by: https://github.com/jeffdaily	2025-03-11 21:20:04 +00:00
Catherine Lee	e5fef8a08e	[CI] Don't clean workspace when fetching repo (#147994 ) Tested on 874c5dc4c98cc63a06bfc900d03683b02f110d7c' Also tested on https://github.com/pytorch/pytorch/actions/runs/13798178199/job/38594767529?pr=148995#step:4:12 Don't remove the workspace when fetching. The checkout action performs git clean -ffdx to remove untracked files and files in gitignore This is probably not going to be useful if we switch entirely to ephemeral runners but w/e Pull Request resolved: https://github.com/pytorch/pytorch/pull/147994 Approved by: https://github.com/malfet, https://github.com/atalman	2025-03-11 21:10:56 +00:00
atalman	72d9f88ef2	[release] Move triton pin to latest triton release/3.3.x (#148971 ) This branch contains latest AMD cherry-picks: https://github.com/triton-lang/triton/pull/6171 https://github.com/triton-lang/triton/pull/6165 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148971 Approved by: https://github.com/danzimm	2025-03-11 21:10:42 +00:00
Jane Xu	e6ef0620cc	Add shim.h C API to call dispatcher on our own aten ops (#148832 ) This PR still needs testing through some cpp extension Pull Request resolved: https://github.com/pytorch/pytorch/pull/148832 Approved by: https://github.com/albanD, https://github.com/atalman ghstack dependencies: #148124	2025-03-11 21:02:04 +00:00
Shangdi Yu	cf19efd3d9	Support basic TorchBind in aot_compile and aoti_compile_and_package (#148506 ) Summary: Codegen - Skip some codegen parts for torchbind (such as arg decleration) because they are loaded in proxy executor, so we do not need to declare torchbind args in cpp code - Added a helper method to get the schema of CallTorchBind HOP. The returned schema is only the schema of `obj.method()`. Serialization Add support for torchbind object in serialization - For CallTorchBind HOP, we need to handle it specially because of it's schema. The output serialized args is in the format of `(obj, method, args, kwargs)`. - it.TorchBindObject inputs are serialized to `as_custom_obj` Argument. Packaging* Add torchbind objects file and `custom_objs_config.json` file to generated files output of `aot_compile`. The json file is stored in the `data/aotinductor/<model_name>` folder in pt2 archive. The torchbind objects are stored in data/constants/ folder in pt2 archive. The format of torchbind objects are `f"{CUSTOM_OBJ_FILENAME_PREFIX}{custom_obj_idx}"`. e.g. `custom_obj_0`. CustomClassHolder objects implement their own pickle methods. Note that this `custom_objs_config.json` file is different from the `model_constants_config.json` file produced in package_sigmoid(). The keys in `custom_objs_config` directly correspond to the arg name in extern nodes json. The key in `model_constants_config.json` produced by `package_sigmoid` is the attribute name in the user mode code. This is required for both internal and OSS torchbind support. For OSS torchbind support, we also need to package torchbind_constants into the .pt2 output. Work Left We still need to add torchbind support in ProxyExecutor for inductor.aoti_load_package to work. See other diffs in the stack. Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r schema buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r aot_compile ``` Differential Revision: D69490718 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148506 Approved by: https://github.com/angelayi	2025-03-11 20:55:18 +00:00
Bin Bao	f69e58e8e8	[CI] Update crossvit_9_240 as pass (#148989 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148989 Approved by: https://github.com/ZainRizvi	2025-03-11 20:54:39 +00:00
PyTorch MergeBot	b54cf1a281	Revert "[logging] Set compile_id in the CachingAutotuner during compilation so we have it for dynamo_timed logging (#148693 )" This reverts commit 73c8068cf889829fb811fc75baac03163c9a42ee. Reverted https://github.com/pytorch/pytorch/pull/148693 on behalf of https://github.com/ZainRizvi due to This is breaking lint on trunk. Please rebase these changes before merging them back in. [GH job link](https://github.com/pytorch/pytorch/actions/runs/13796723235/job/38590020554) [HUD commit link](`73c8068cf8`) ([comment](https://github.com/pytorch/pytorch/pull/148693#issuecomment-2715671875))	2025-03-11 20:50:23 +00:00
Nikita Shulga	c18858d633	[MPS] Make `torch.mps.compile_shader` public (#148972 ) It was a private method in 2.6, but nothin changes in its API for 2.7 and it will likely remain the same in 2.8, so time to remove underscore from its name Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148972 Approved by: https://github.com/Skylion007, https://github.com/atalman, https://github.com/seemethere, https://github.com/albanD, https://github.com/dcci	2025-03-11 20:20:58 +00:00
cat-state	abcec55532	gracefully handle `tokenize.TokenError` in funcname parser. Adds support for non-Python source (#148737 ) This change allows defining python functions in non-python source and having them be able to compiled by torch.compile. The existing implementation already returns None for the case where the file couldn't be read, so returning None (by making an empty funcname cache) makes sense for the case of non-python source code too. Example [basilisp](https://github.com/basilisp-lang/basilisp): ```clojure (import torch) (import [torch.nn.functional :as F]) (torch/rand 10) (defn f {:decorators [torch/compile]} [x] (* (F/relu x) x)) (f (-> (torch/randn 100) (.cuda))) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148737 Approved by: https://github.com/williamwen42	2025-03-11 19:49:28 +00:00
Sam Larsen	73c8068cf8	[logging] Set compile_id in the CachingAutotuner during compilation so we have it for dynamo_timed logging (#148693 ) Summary: This is a simpler alternative to https://github.com/pytorch/pytorch/pull/146455, where we can stick the compileId (and forward/backward bool) in the CachingAutotuner so that we have it for logging `benchmark_all_configs`. Recall that the first attempt put the compileId in the inductor_meta and that interfered with caching. Test Plan: `python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only nanogpt` * tlparse: https://fburl.com/e71yn6uc * dynamo_compile: https://fburl.com/scuba/dynamo_compile/sandbox/4ageghhv * pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/4fgv1itq Pull Request resolved: https://github.com/pytorch/pytorch/pull/148693 Approved by: https://github.com/eellison	2025-03-11 19:38:40 +00:00
henrylhtsang	5b8da17681	[cutlass backend] Add addmm and bmm tests for AOTI (#148929 ) Needs to do: 1. Expand addmm tests to cover all 4 shapes 2. Add dynamic shape support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148929 Approved by: https://github.com/jingsh, https://github.com/ColinPeppler	2025-03-11 19:38:24 +00:00
Yanan Cao (PyTorch)	7b2ecb80eb	[Codemod][AddExplicitStrictExportArg] caffe2/test/inductor (#148928 ) Differential Revision: D70908557 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148928 Approved by: https://github.com/angelayi	2025-03-11 19:36:30 +00:00
Xinya Zhang	61f9b50e09	[ROCm] Fix TORCH_CHECK for hdim 512 support added in AOTriton 0.9b (#148967 ) Fixes #148850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148967 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily	2025-03-11 19:21:10 +00:00
Jane Xu	971606befa	Add a stable TORCH_LIBRARY to C shim (#148124 ) This PR adds two main parts: - shim.h stable C APIs into torch::Library APIs - a higher level API in torch/csrc/stable/library.h that calls into this shim.h + otherwise is self contained Goal: custom kernel writers should be able to call the apis in the directories above in order to register their library in a way that allows their custom extension to run with a different libtorch version than it was built with. Subplots resolved: - Do we want a whole separate StableLibrary or do we want to freeze torch::Library and add `m.stable_impl(cstring, void (fn)(void , int64_t, int64_t)` into it - Yes, we want a separate StableLibrary. We cannot freeze Library and it is NOT header only. - Should I use unint64_t as the common denominator instead of void to support 32bit architectures better? - Yes, and done - Should I add a stable `def` and `fragment` when those can be done in python? - I think we do want these --- and now they're done - Where should library_stable_impl.cpp live? -- no longer relevant - I need some solid test cases to make sure everything's going ok. I've intentionally thrown in a bunch of random dtypes into the signature, but I still haven't tested returning multiple things, returning nothing, complex dtypes, etc. - Have since tested all the torch library endpoints. the others can be tested in a followup to separate components that need to be in shim.h vs can be added later Pull Request resolved: https://github.com/pytorch/pytorch/pull/148124 Approved by: https://github.com/albanD, https://github.com/zou3519, https://github.com/atalman	2025-03-11 19:12:46 +00:00
Andy Lugo	4d10da731b	[ROCm] CK Memory-Efficient Attention (attention bias support) (#147778 ) Implements CK as the backend for memory efficient attention with a couple caveats: - Still enabled via `torch.backends.cuda.preferred_rocm_fa_library("ck") - Does NOT support Nested Tensors Using the mem_eff path allows us to use attention bias with a CK sdpa backend Pull Request resolved: https://github.com/pytorch/pytorch/pull/147778 Approved by: https://github.com/houseroad	2025-03-11 19:02:59 +00:00
Doru Bercea	a1cb67b69e	[ROCm] Improve backwards indexing when stride is not one (#147630 ) Improve backwards indexing when stride is not one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147630 Approved by: https://github.com/jeffdaily	2025-03-11 19:02:48 +00:00
Guilherme Leobas	daff65d671	Correctly propagate exception to parent tx (#146502 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146502 Approved by: https://github.com/anijain2305, https://github.com/williamwen42, https://github.com/zou3519 ghstack dependencies: #146504, #146499	2025-03-11 18:55:45 +00:00
Guilherme Leobas	fb53e9e514	Add `__context/cause/suppress_context/traceback__` to Exception (#146499 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146499 Approved by: https://github.com/zou3519, https://github.com/anijain2305 ghstack dependencies: #146504	2025-03-11 18:55:45 +00:00
Guilherme Leobas	4e7d264cf8	Introduce `UserDefinedExceptionClassVariable` (#146504 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146504 Approved by: https://github.com/anijain2305	2025-03-11 18:55:45 +00:00
Jason Ansel	8d08b49015	Reland: [inductor] Simplify grid handling (#148305 ) Summary: Relands D69965761 / https://github.com/pytorch/pytorch/pull/147583 Before this PR, calling a triton kernel would look like: ```py kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0) ``` where the `grid=` was passed as a callable (function closure) arg. This PR removes the grid arg: ```py kernel.run(a, b, xnumel, stream=stream0) ``` instead now the grid computation is included in the kernel launcher, with something like: ```py def launcher(in_ptr0, out_ptr0, xnumel, stream): grid_0 = ((xnumel + 1023) >> 10) grid_1 = 1 grid_2 = 1 runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel) ``` This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`. It also allows us to unify the handling of grids between the Python and C++ wrapper code. Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid. This unification allows this PR to be a net deletion of code. Differential Revision: D70471332 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148305 Approved by: https://github.com/shunting314, https://github.com/eellison	2025-03-11 18:51:06 +00:00
PyTorch MergeBot	c916a8efc5	Revert "Use the device interface for detecting Triton availability (#139171 )" This reverts commit 940b60db974f08a31c746eec2f9c399fc8a861ee. Reverted https://github.com/pytorch/pytorch/pull/139171 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @jansel can you please help get these changes working? See D70946254 for more details. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/139171#issuecomment-2715392451))	2025-03-11 18:49:21 +00:00
drisspg	57ee821a41	fix dynamo ide (#148849 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148849 Approved by: https://github.com/bobrenjc93	2025-03-11 18:43:30 +00:00
Nikita Shulga	883fb78c7e	Update jinja2 version in requirements-gha-cache.txt As previous version is vulnerable to CVE-2025-27516 This closes Dependabot report	2025-03-11 11:42:38 -07:00
dependabot[bot]	5ee9dbc0a1	Bump jinja2 from 3.1.5 to 3.1.6 in /.ci/docker (#148812 ) Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.5 to 3.1.6. - [Release notes](https://github.com/pallets/jinja/releases) - [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst) - [Commits](https://github.com/pallets/jinja/compare/3.1.5...3.1.6) --- updated-dependencies: - dependency-name: jinja2 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-03-11 11:39:55 -07:00
cyy	a5f6b24d87	Remove outdated skipIfRocmVersionLessThan decorations (#148941 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148941 Approved by: https://github.com/jeffdaily	2025-03-11 18:37:40 +00:00
Ke Wen	ef6296e7f2	[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 ) This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj Differential Revision: [D70937982](https://our.internmc.facebook.com/intern/diff/D70937982) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590 Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj	2025-03-11 18:36:12 +00:00
Nikita Shulga	b366f33606	[MPSInductor] Prep for mutlistage reductions (#148969 ) ---- - Move reduction variable initialization from `loads` to `indexing_code` - Move barriers from `codegen_kernel` to `reduction` and only use them for `any` reductions (as other reduction ops do barriers explicitly inside the respective reduction functions) - Use `self.compute` instead of `self.body` for all compute operations Checked that number of before/after failures stays at `164 failed, 616 passed, 53 skipped` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148969 Approved by: https://github.com/dcci	2025-03-11 18:35:23 +00:00
Nichols A. Romero	dcc502f376	[ROCm][TunableOp] Add bias data type to params signature. (#146227 ) Add bias vector data type in TunableOp params signature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146227 Approved by: https://github.com/jeffdaily	2025-03-11 18:31:22 +00:00
Chien-Chin Huang	52acc1f955	[DSD] Update the document to mention the limitation of set_optimizer_state_dict (#148918 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/140898 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148918 Approved by: https://github.com/fduwjj, https://github.com/mori360 ghstack dependencies: #148825	2025-03-11 18:24:12 +00:00
Yukio Siraichi	e0d4c43ad1	Add env for disabling meta reference on functionalization. (#148822 ) Fix: https://github.com/pytorch/xla/issues/8755 This PR introduces `TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE` environment variable. Setting this variable makes it so the functionalization kernels won't run the meta reference, which is used to propagate expected sizes and strides. Currently, PyTorch/XLA doesn't actually propagates the correct strides to its tensors. It was also shown that calling these meta functions may incur in significant overhead. Running the provided minimal reproducer (see issue), we see a speedup close to 4.3x: - Baseline: 0.0747s - `XLA_DISABLE_FUNCTIONALIZATION=1`: 0.0159s - `TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE=1`: 0.0175s In summary, this PR: - Creates the `disable_meta_reference()` function, which checks whether the environment variable is set - Modifies codegen for functionalization kernels, adding the call to `disable_meta_reference()` function to the appropriate conditions - Creates a new bash function for running `lazy/test_ts_opinfo.py` with the environment variable set Pull Request resolved: https://github.com/pytorch/pytorch/pull/148822 Approved by: https://github.com/bdhirsh	2025-03-11 16:13:35 +00:00
Jason Ansel	09029010e5	[inductor] Fix create_specialize_impl error in latest Triton (#148933 ) ```py $ python test/inductor/test_triton_kernels.py KernelTests.test_triton_kernel_2d_autotune_grad_False_dynamic_True_backend_inductor_grid_type_1 WARNING:torch._dynamo:Encountered an exception in identify_mutated_tensors, assuming every input is mutated Traceback (most recent call last): File "/home/jansel/pytorch/torch/_higher_order_ops/triton_kernel_wrap.py", line 715, in identify_mutated_tensors ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_higher_order_ops/triton_kernel_wrap.py", line 289, in generate_ttir specialization = _get_specialization(ordered_args.values()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_higher_order_ops/triton_kernel_wrap.py", line 262, in _get_specialization specialize_impl = triton.runtime.jit.create_specialize_impl() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: create_specialize_impl() missing 1 required positional argument: 'specialize_extra' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148933 Approved by: https://github.com/yanboliang, https://github.com/davidberard98	2025-03-11 15:54:47 +00:00
PyTorch MergeBot	16560d4e8f	Revert "Refactor `test/test_torch.py` by moving testcase to `test_indexing.py` (#148875 )" This reverts commit 0fa0a740958ffc474843ceb1d19ee43c4bff4c09. Reverted https://github.com/pytorch/pytorch/pull/148875 on behalf of https://github.com/ZainRizvi due to That torch.version failure you got in CI was a legitimate failure and is now breaking trunk. [GH job link](https://github.com/pytorch/pytorch/actions/runs/13778023702/job/38534207536) [HUD commit link](`0fa0a74095`) ([comment](https://github.com/pytorch/pytorch/pull/148875#issuecomment-2714757288))	2025-03-11 15:27:25 +00:00
atalman	3945954741	Bump triton pin. Add aarch64 triton build (#148705 ) 1. Bumps pin for triton to release/3.3.x branch 2. Bump pin for triton-xpu 3. Remove ROCm xfail tests 4. Add aarch64 triton build: * Depends on: https://github.com/pytorch/pytorch/pull/148768 * Fixes: https://github.com/pytorch/pytorch/issues/130558 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148705 Approved by: https://github.com/drisspg, https://github.com/Skylion007, https://github.com/EikanWang	2025-03-11 15:12:21 +00:00
PyTorch MergeBot	c983e1124c	Revert "[WIP] Initial implementation of Grouped Gemm API (#148531 )" This reverts commit ff29791ed8f815bdbca1a5606de046380baca69d. Reverted https://github.com/pytorch/pytorch/pull/148531 on behalf of https://github.com/janeyx99 due to Sorry but this broke ROCm jobs on trunk ([comment](https://github.com/pytorch/pytorch/pull/148531#issuecomment-2714577498))	2025-03-11 14:40:58 +00:00
Animesh Jain	f1787ee0f7	[dynamo] Remove L scoping for recompilation messages (#148917 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148917 Approved by: https://github.com/williamwen42	2025-03-11 14:26:26 +00:00
Animesh Jain	992838e702	[dynamo][guards] Do not ID_MATCH on numpy tensors (#148923 ) Might help with https://github.com/pytorch/pytorch/issues/148535 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148923 Approved by: https://github.com/jansel	2025-03-11 14:20:26 +00:00
Alexander Grund	ee21ccc816	Skip ao_sparsity TestComposability for missing FBGEMM (#144146 ) Those tests (from test_ao_sparsity) require FBGEMM which may not be available. So add the skip decorator. Fixes #87364 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144146 Approved by: https://github.com/jerryzh168, https://github.com/jcaip	2025-03-11 13:02:18 +00:00
Rengan Xu	da4bb72a71	Backout D70075331 (#148824 ) Summary: The AOTI lowering for model 699109736 and other new models worked before D70075331, but failed after with error "RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 4096 n 10 k 7936 mat1_ld 7936 mat2_ld 7936 result_ld 4096 abcType 2 computeType 68 scaleType 0" So we revert D70075331 as a workaround now. Test Plan: The model could be lowered and published successfully. e.g. 702869739_16 Differential Revision: D70823254 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148824 Approved by: https://github.com/eqy	2025-03-11 12:51:17 +00:00
David Berard	9ad64ce795	[triton 3.3] Forward-fix mm template selection logic (#148924 ) Follow-up from https://github.com/pytorch/pytorch/pull/148662. The logic from https://github.com/pytorch/pytorch/pull/148662 is incorrect; what we want is "choose the second template 'AMD-specific template' only if we're on hip AND triton version < 3.3" - negating it, the code should be "choose the cirst template if we're NOT on hip OR triton version >= 3.3". Tested locally to verify that it fixes the test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148924 Approved by: https://github.com/drisspg, https://github.com/atalman, https://github.com/eellison	2025-03-11 09:05:44 +00:00
eellison	2bcc3acb90	Update low prec codegen for div/mod (#142350 ) Div/mod in fp16/bf16 requires a downcast to preserve its inputs' dtypes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142350 Approved by: https://github.com/blaine-rister	2025-03-11 08:02:30 +00:00
Gabriel Ferns	41e4728f74	update types on dynamo configs (#146873 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146873 Approved by: https://github.com/williamwen42	2025-03-11 05:33:48 +00:00
Gabriel Ferns	1fcc4bc109	Don't look at TESTING_ONLY in fuzzer (#146870 ) Lots of configs aren't meant to be set because they're testing only Pull Request resolved: https://github.com/pytorch/pytorch/pull/146870 Approved by: https://github.com/masnesral	2025-03-11 05:32:25 +00:00
xinan.lin	bed92a8523	[Window][Inductor UT] Fix for tempfile.NamedTemporaryFile(delete=True) not work on Windows. (#148632 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148632 Approved by: https://github.com/jansel	2025-03-11 05:05:15 +00:00
Bin Bao	ecfbfe1603	[AOTI] Remove aoti_torch_cpu__weight_int4pack_mm_cpu_tensor (#148907 ) Summary: shim.h is only meant for generic tensor util shim functions. We should switch to use the auto fallback generation, but it will need some extra care on the op schema. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148907 Approved by: https://github.com/janeyx99	2025-03-11 04:41:05 +00:00
George White	940b60db97	Use the device interface for detecting Triton availability (#139171 ) This allows for each device type to check current devices for Triton compatibility and ensure their Triton backend is present. This PR replaces the `has_triton()` global method which was previously used for this task, and moves the initial check for each Inductor backend on to their associated `BaseScheduler` subclass. This means that other backends, such as Halide, can also implement their own availability checks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139171 Approved by: https://github.com/jansel	2025-03-11 03:56:11 +00:00
Natalia Gimelshein	ff29791ed8	[WIP] Initial implementation of Grouped Gemm API (#148531 ) This PR provides initial cutlass implementation of grouped gemm api as described in this [document](https://docs.google.com/document/d/1985La6wUUVH1AGBkNhaGKUXzx-9ybtbUp567-vYVOM4/edit?tab=t.0#heading=h.g8lzbjnyzzx9). Any combination of 2d and 3d inputs is supported, with 2d input being jagged, and the offsets of the jagged input being given by device tensor `offs`. Only H100 is supported, and only fp8_e4m3 with bf16 output and rowwise scaling. All the dimensions of each individual gemm have to be multiple of 16, that's cutlass limitation. I'll need to add those checks, for dynamic dimensions unfortunately the checks will have to be a device assert. I had to copy-paste cutlass's `Sm90RowBroadcast` and `Sm90ColBroadcast` structs with minor changes to enable scales given as pointer arrays, ideally those should be part of cutlass itself. I copied the schedules from the similar grouped gemm in FBGEMM, but there's a lot of room to improve perf, especially for `fast_accum=False`. Next steps would be perf tuning and increasing coverage to B100, I don't know how cutlass grouped gemm example handles blockwise scaling on B100. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148531 Approved by: https://github.com/drisspg	2025-03-11 02:41:09 +00:00
Brian Hirsh	621dadd4ca	partitioner: when materializing unbacked tensor intermediates, apply hint to symbol, not expr (#144097 ) Fixes https://github.com/pytorch/pytorch/issues/144095 open to suggestions: the `hint_int(..., fallback=...)` API feels like a bit of a footgun, because: (1) we use the same guess for every unbacked symint (both symbols, and compound expressions) (2) the user may have established some relationship between some unbacked symints that we are not taking into account. I'm not sure how real of an issue (2) is - is it common to e.g. generate two unbacked symints, and then add a runtime assert that they are unequal? Instead I did something simpler that's just enough to fix the linked issue: if we have a sympy expression containing an unbacked symbol (e.g. `u0 + 1`), then the partitioner will now fill in the symbol with our guess instead of the expression (plugging in `u0=4096` gets us 4097). This was important for an internal custom op, that had some logic like this: ``` def custom_op(x: [u0], y: [u0 + 1]): assert x.shape[0] = y.shape[0] - 1 ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144097 Approved by: https://github.com/laithsakka	2025-03-11 02:11:57 +00:00
albanD	8c45d44abb	Skip distributed subprocess test internally as they don't work (#148909 ) Follow up from https://github.com/pytorch/pytorch/pull/146098 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148909 Approved by: https://github.com/janeyx99	2025-03-11 02:07:45 +00:00
Simon Fan	457ff9b7ae	[reland][ca] side-effect free inital trace: compiled_args (#148376 ) This reverts commit ea12fc8a9ff7da808e0b661ca07e9d4ce75d04bc. Reland https://github.com/pytorch/pytorch/pull/147804, there was a bad import inserted by my linter. Differential Revision: [D70582747](https://our.internmc.facebook.com/intern/diff/D70582747) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148376 Approved by: https://github.com/jansel	2025-03-11 01:57:36 +00:00
Shuai Yang	9fddbf3417	Update the comment (#148726 ) Differential Revision: D70747931 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148726 Approved by: https://github.com/yf225	2025-03-11 01:19:14 +00:00
zeshengzong	0fa0a74095	Refactor `test/test_torch.py` by moving testcase to `test_indexing.py` (#148875 ) Fix `FIXME` in `test_torch.py` by moving test-cases to `test_indexing.py` ```python # FIXME: move to test indexing # FIXME: move to indexing test suite ``` - Move tests in `test/test_torch.py` to `test_indexing.py` - Remove `FIXME` comments ## TestResult ```bash pytest test/test_torch.py -k TestTorchDeviceType -vv pytest test/test_indexing.py -k TestIndexing -vv ``` ![image](https://github.com/user-attachments/assets/49a80985-e74a-4da6-a063-476e87e6aa8a) ![image](https://github.com/user-attachments/assets/77afa936-5dba-480c-b293-eb1f7bc74420) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148875 Approved by: https://github.com/soulitzer	2025-03-11 01:01:59 +00:00
bobrenjc93	c297c09a37	Fix invalid nested int guarding in broadcast_shapes() (#145957 ) Fixes #145874 This PR takes the approach of updating the logic determining whether multiple shapes broadcast together to handle nested ints specially. Possible alternative approach: don't update `broadcast_shapes()` + indicate that e.g. `Ne(j0, 1)` should statically evaluate to False. I briefly tried this but it wasn't straightforward. Is it better? Pull Request resolved: https://github.com/pytorch/pytorch/pull/145957 Approved by: https://github.com/bobrenjc93 Co-authored-by: bobrenjc93 <bobren@meta.com>	2025-03-11 00:53:13 +00:00
cyy	295f2ed4d1	Fix "invalid application of 'sizeof' to an incomplete type" (#148854 ) Fixes with C++23 and constexpr std::unique_ptr Pull Request resolved: https://github.com/pytorch/pytorch/pull/148854 Approved by: https://github.com/Skylion007	2025-03-11 00:40:00 +00:00
cyy	a6e71dbc88	Enable ASAN on inductor CUDA tests (#148749 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148749 Approved by: https://github.com/jansel	2025-03-10 23:53:40 +00:00
drisspg	b215841ebb	[MM] Add sm carevout to lowerings (#148793 ) # Summary See https://github.com/pytorch/pytorch/issues/145115 for more details. I have been using the following to verify, need to figure out how to do proper guarding This does do the correct thing if we compile w/ sm carvout already set but since we dont guard on it just yet we dont recompile Pull Request resolved: https://github.com/pytorch/pytorch/pull/148793 Approved by: https://github.com/lw, https://github.com/eellison	2025-03-10 23:49:26 +00:00
Brian Hirsh	492f3fd5cf	replace usages of upload_graph in inductor with tlparse (v2) (#148720 ) Reland of https://github.com/pytorch/pytorch/pull/148703 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148720 Approved by: https://github.com/mengluy0125	2025-03-10 22:47:58 +00:00
Michal Gallus	5bbca7d328	[ROCm][Windows] Fix OpenMP Flags for clang-cl (#148097 ) When clang-cl parses its command line arguments, it expects MSVC-style arguments (beggining with `/` such as `/WX`, `/MD`, etc.) to be provided, and clang-style arguments to be preceded by `-Xclang`, otherwise, the clang-style parameters are ignored as they are interpreted unrecognized compiler options. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148097 Approved by: https://github.com/jeffdaily	2025-03-10 22:47:15 +00:00
PyTorch MergeBot	a95eb0c0a7	Revert "[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 )" This reverts commit 2149f6c6845d00711ffab648132b7377e8cd3edb. Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/ZainRizvi due to Breaking internally, see D70873275. Discussed reverting this with Ke. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2712001270))	2025-03-10 22:38:40 +00:00
Qiaochu Yuan	12a95390ae	[Minimizer] allow overriding of ShapeProp logic by subclasses of _MinimizerBase (#148784 ) Summary: The changes contained in this diff - allow subclass Minimizer implementations to override the default shape propagation logic with custom logic - copies over the meta attribute on get_attr graph nodes during the graph splitting step - for both changes, behavior for existing classes do not change Test Plan: CI Differential Revision: D70799942 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148784 Approved by: https://github.com/blaine-rister	2025-03-10 22:22:16 +00:00
Jane Xu	fcb633fafa	Introduce TORCH_ABI_VERSION and a runtime aoti_torch_abi_version C shim ABI (#148892 ) Importable https://github.com/pytorch/pytorch/pull/148836 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148892 Approved by: https://github.com/albanD	2025-03-10 22:22:10 +00:00
Boyuan Feng	98b3f1db9f	[Flex Attention] support num_heads > 1 in block_mask (#148857 ) Previously flex decoding errors when block mask has num_heads > 1. So users have to use num_heads=1, or explicitly mark `kernel_options={"FORCE_USE_FLEX_ATTENTION": True}`. This PR fixes this issue. When not using grouped query attention (GQA, i.e., Hq == Hkv), we support block mask with num_heads = 1 and num_heads = num_query_heads (i.e., Hq). This is the same setting as flex attention kernel. When using GQA (i.e., Hq != Hkv), we support block mask with num_heads = 1. When num_heads = Hq, we fall back to flex attention kernel so user don't need to explicitly mark `kernel_options={"FORCE_USE_FLEX_ATTENTION": True}` anymore. Why fallback? In the current flex decoding triton kernel, grouped query heads for the same kv head are handled by the same thread block. Supporting num_heads = Hq with GQA requires support different kv num blocks for different query heads in the same thread block, leading to lots of redundant workload. So we should better use the main flex_attention kernel where each query head is handled by a separate block. Fixes #148527 Fixes #147267 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148857 Approved by: https://github.com/drisspg	2025-03-10 22:02:50 +00:00
Mandar Deshpande	6ef15c7f46	[pytorch] Update flexattention bwd config generation (#148600 ) Summary: Currently `flex_attention` template's backward config generation returns values for every case. This change instead stores intermediate values in `'bwd_config` returned at the end. Test Plan: CI. Existing tests. Differential Revision: D70649316 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148600 Approved by: https://github.com/Skylion007	2025-03-10 22:00:56 +00:00
Ozan Aydin	8701b302cc	setuptools pinning (#148879 ) Fixes #148877 --- On 9 March 2025, [setuptools](https://pypi.org/project/setuptools/#history) published a new version and it is causing an issue on `pytorch` with the following error: ``` AttributeError: module 'distutils' has no attribute '_msvccompiler'. Did you mean: 'ccompiler'? ``` Last known working version is [75.8.2](https://pypi.org/project/setuptools/75.8.2/) Currently it is affecting Windows ARM64 nightly build, however soon it might affect also Windows x64 builds. (conda version is not updated yet [setuptools conda](https://anaconda.org/anaconda/setuptools) Locally both `Windows ARM64` and `Windows x64` are having same problem with the latest `setuptools` (>75.8.2) --- This PR is pinning `setuptools` version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148879 Approved by: https://github.com/seemethere	2025-03-10 21:29:32 +00:00
Ting Lu	c652772af7	[aarch64] install ninja for docker to build triton on arm (#148768 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148768 Approved by: https://github.com/atalman, https://github.com/Skylion007 Co-authored-by: Andrey Talman <atalman@fb.com>	2025-03-10 21:28:53 +00:00
Michal Gallus	b706044cca	[ROCm][Windows] Enable hipblaslt for Windows (#148563 ) This PR adds hipblaslt library as one of the Windows' dependencies. `rocBLAS` is added too, since certain symbols aren't detected with `hipblas` alone on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148563 Approved by: https://github.com/jeffdaily	2025-03-10 21:07:16 +00:00
Ting Lu	2a1eeaeed8	Remove 12.4 x86 builds and 12.6 sbsa builds from nightly (#148895 ) https://github.com/pytorch/pytorch/issues/145570 redo https://github.com/pytorch/pytorch/pull/148625 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148895 Approved by: https://github.com/atalman	2025-03-10 20:55:09 +00:00
henrylhtsang	4a2173d9a0	[cutlass backend][ez] Incorporate AOTI dynamic shape test into main test of MM (#148786 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148786 Approved by: https://github.com/jingsh	2025-03-10 20:35:10 +00:00
chunhuanMeng	e9c12e819d	Update torch-xpu-ops commit pin (#148881 ) Update the torch-xpu-ops commit to [026b2c8c7c92a7b2cec5d26334006e3423251cc6](`026b2c8c7c`), includes: - Enable AOT for LNL Pull Request resolved: https://github.com/pytorch/pytorch/pull/148881 Approved by: https://github.com/EikanWang	2025-03-10 20:31:51 +00:00
Chien-Chin Huang	ed969d1236	[DSD] Fix the shared parameter mismatch for optimizer state_dict when flattening FQNs are used (#148825 ) Summary: As title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148825 Approved by: https://github.com/fduwjj, https://github.com/mori360	2025-03-10 20:04:36 +00:00
Tristan Rice	494abeff8a	CUDACachingAllocator,c10d: fixes for IPC release performance (#148805 ) This has two fixes to improve IPC tensor release performance when using torchft's BabyProcessGroupNCCL. 1. release the IpcMutex when deleting the `ExpandableSegements` object to avoid synchronizing under the lock 2. release the GIL in WorkNCCL destructor since the shared tensor will be destructed there Test plan: Run with torchft + torchtitan ``` REPLICA_GROUP_ID=0 NGPU=2 CUDA_VISIBLE_DEVICES=0,1 CONFIG_FILE=./torchtitan/models/llama/train_configs/llama3_8b.toml ./run_train.sh --training.data_par allel_shard_degree=2 --fault_tolerance.enable --fault_tolerance.group_size=2 --fault_tolerance.replica_id=0 --metrics.log_freq=1 --training.seq_len 4096 ... [rank0]:[titan] 2025-03-07 17:51:31,387 - root - INFO - step: 61 loss: 7.4825 memory: 79.73GiB(83.89%) tps: 317 tflops: 16.34 mfu: 1.65% ``` Check py-spy to verify no bottleneck on IPC lock when creating new shared tensors ![20250307_17h50m10s_grim](https://github.com/user-attachments/assets/fa8b359f-e337-4ed5-be22-a42ab2bee03d) ![20250307_17h50m00s_grim](https://github.com/user-attachments/assets/206f869a-f07e-4fbd-9e28-89b3da95ef6e) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148805 Approved by: https://github.com/Skylion007, https://github.com/fegin, https://github.com/zdevito	2025-03-10 19:47:04 +00:00
atalman	2e4874e48d	Update RELEASE.md with latest changes to release process and release 2.7 information (#148888 ) 1. Update for Release 2.7 compatibility matrix 2. Remove mention of builder project, the scripts for release management were migrated to test-infra Pull Request resolved: https://github.com/pytorch/pytorch/pull/148888 Approved by: https://github.com/albanD, https://github.com/ZainRizvi	2025-03-10 19:20:27 +00:00
clr	6b0fd741d1	dynamo: Count number of opcodes processes (#147149 ) This gives us a decent proxy for how big of a graph we functionally had to parse. Note that this is a cummulative counter. If people feel strongly, I can either write into the dynamo_timed datasets with metrics contexts, or clear the counters / write a counter per frame id as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147149 Approved by: https://github.com/jansel	2025-03-10 19:20:09 +00:00
Wanchao Liang	3129faf8be	Optimize shard_dim_alltoall to use alltoall_single (#148868 ) as titled, previously the shard_dim_alltoall uses `all_to_all`, which essentially could incur lots of copies if the tensor become non-contiguous during splits, and alltoall itself also incur copies This PR uses alltoall_single instead, so that we could minimize tensor copies. tested on all the shard dim change tests and it works properly: ``` pytest test/distributed/tensor/test_redistribute.py -s -k shard_dim_alltoall ``` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148868 Approved by: https://github.com/tianyu-l	2025-03-10 18:38:12 +00:00
Benjamin Glass	ed7e964f2b	codecache.py: use str.format rather than % formatting (#148691 ) Additionally, swaps over a fixed length `std::vector` used by `cpp_wrapper` for a `std::array`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148691 Approved by: https://github.com/desertfire	2025-03-10 18:33:58 +00:00
Fadi Arafeh	d1f21d8ec3	Enable Direct Use of Arm Compute Library (ACL) in ATen (#148584 ) ACL is already built with PyTorch as a shared library when USE_MKLDNN_ACL is set. Currently, it is only used indirectly in ATen via oneDNN for AArch64 targets. However there are cases where it makes sense to utilize ACL directly without oneDNN as an intermediary - e.g. quantization. See #145942, #147337, #146620. This patch enables such use cases by exposing ACL to ATen Pull Request resolved: https://github.com/pytorch/pytorch/pull/148584 Approved by: https://github.com/malfet	2025-03-10 18:29:51 +00:00
Han, Xu	00cabd4235	[Inductor][Windows] add env_var switch to turn all Windows inductor UTs. (#148733 ) For timeout reason, we can't turn on all Windows Inductor UTs in CI: https://github.com/pytorch/pytorch/issues/135927 And without the UTs, we can't ensure Windows inductor quality. Intel team will do some local test for Windows inductor, but we still need to add a switch to turn on the full Windows inductor UTs. The switch is an environment variable: ```cmd set TORCHINDUCTOR_WINDOWS_TESTS=1 ``` After setup this environment variable, we can turn on all Windows inductor UTs, It will not affect to PyTorch CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148733 Approved by: https://github.com/jansel Co-authored-by: Jason Ansel <jansel@jansel.net>	2025-03-10 18:25:29 +00:00
eellison	4c13a859e5	Workaround no triton float8_e8m0fnu support in inductor (#148722 ) Triton doesn't support actual float8_e8m0fnu yet, so we can't currently codegen any arithmetic on them. But we can support bitcasting, and view/memory operators and treat them as uint8 for now. Fix for https://github.com/pytorch/pytorch/issues/147873. The one question i'm not sure of is whether or not we need to explicitly disable triton template fusion since it would fuse in these dtypes as uint8.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148722 Approved by: https://github.com/vkuzo ghstack dependencies: #148450	2025-03-10 17:37:39 +00:00
cyy	203dd18c5c	Bump Clang-tidy to 19.1.4 (#148648 ) Because Clang-tidy 19 has more powerful clang-analyzer checks to detect subtle bugs. New checks such as misc-use-internal-linkage can help identify potential static variables or functions, thus reducing binary sizes. Some new checks are disabled temporarily for later enabling. Additional warnings have been fixed or suppressed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148648 Approved by: https://github.com/Skylion007	2025-03-10 17:32:30 +00:00
PyTorch MergeBot	ebd087e4b5	Revert "[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257 )" This reverts commit f08146b67bab331f7bdc9fa247f526f6e60a7190. Reverted https://github.com/pytorch/pytorch/pull/113257 on behalf of https://github.com/jovianjaison due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/113257#issuecomment-2711299830))	2025-03-10 17:19:21 +00:00
PyTorch MergeBot	2ec9aceaeb	Revert "Move aoti_torch_cpu__weight_int4pack_mm_cpu_tensor to not be mangled (#148834 )" This reverts commit 3680e666d8ceaa43069555f821d1e8a5de01d5ab. Reverted https://github.com/pytorch/pytorch/pull/148834 on behalf of https://github.com/janeyx99 due to sorry I don't think I want this PR in before the branch cut, as it'd freeze the API in the file when it should really be in a different header ([comment](https://github.com/pytorch/pytorch/pull/148834#issuecomment-2711162193))	2025-03-10 16:29:40 +00:00
Nicolas De Carli	9dbc2527dc	Disable some SVE autovec (#148489 ) Summary: autovec miscompiles on patterns of the type: ```cpp for (const auto i : c10::irange()) ``` Same issue as described in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117001 and addressed by https://github.com/pytorch/pytorch/pull/137795 for gcc, but not clang Test Plan: buck2 build //caffe2/caffe2/fb/transforms:sigrid_interface Differential Revision: D70422723 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148489 Approved by: https://github.com/malfet	2025-03-10 16:25:00 +00:00
Jason Ansel	a60b4ed623	[fx] Optimize TracerBase.create_arg and Graph._gen_python_code (#148292 ) Before: 19502951 function calls (18702776 primitive calls) in 8.533 seconds After: 16402551 function calls (15602452 primitive calls) in 7.701 seconds Pull Request resolved: https://github.com/pytorch/pytorch/pull/148292 Approved by: https://github.com/oulgen ghstack dependencies: #148243, #148260, #148261, #148288	2025-03-10 16:06:19 +00:00
Jason Ansel	8f858e226b	[fx] Optimizations for node name generation (#148288 ) Before: ![image](https://github.com/user-attachments/assets/3a9ed22b-ae33-41ec-a0db-01f4f3ca2ffe) After: ![image](https://github.com/user-attachments/assets/44c6e578-c63e-4a43-b3e0-d11d4bdbb6db) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148288 Approved by: https://github.com/oulgen ghstack dependencies: #148243, #148260, #148261	2025-03-10 16:06:19 +00:00
Jason Ansel	5d4e7d58b4	[fx] Move Node._prepend/Node._remove_from_list to C++ (#148261 ) Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before: ``` 24303536 function calls (23503339 primitive calls) in 10.726 seconds ``` after: ``` 20003454 function calls (19203257 primitive calls) in 8.936 seconds ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148261 Approved by: https://github.com/oulgen ghstack dependencies: #148243, #148260	2025-03-10 16:06:11 +00:00
Jason Ansel	bf752c36da	[fx] Move Node._update_args_kwargs to C++ (#148260 ) Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before: ``` 25203549 function calls (24403352 primitive calls) in 12.090 seconds ``` after: ``` 24303536 function calls (23503339 primitive calls) in 10.726 seconds ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148260 Approved by: https://github.com/oulgen ghstack dependencies: #148243	2025-03-10 16:06:02 +00:00
Jason Ansel	bec7bdad47	[fx] Move map_aggregate to C++ (#148243 ) Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before: ``` 30603618 function calls (29403419 primitive calls) in 13.744 seconds ``` after: ``` 25203549 function calls (24403352 primitive calls) in 12.090 seconds ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148243 Approved by: https://github.com/oulgen	2025-03-10 16:05:53 +00:00
cyy	b8b1b364c9	Fix invalid format string in libfmt calls (#148855 ) Wrap shaderSource inside fmt::runtime because the format string is not a string literal and can't pass libfmt's compile time check in C++23 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148855 Approved by: https://github.com/Skylion007	2025-03-10 14:47:52 +00:00
Aman Karmani	a81751d8b7	[CD] Annotate linux/arm64 cuda wheels with consistent nvidia dependencies (#145021 ) This resolves issues installing torch nightly wheels into a `uv sync`-generated `.venv` The root cause is that the x64 and arm64 cuda nightly wheels have inconsistent metadata. This can be seen comparing `generated-linux-aarch64-binary-manywheel-nightly.yml` and `generated-linux-binary-manywheel-nightly.yml` `uv` expects consistency: https://github.com/astral-sh/uv/issues/10693 >Frankly, it's really not ideal that they change their dependencies from wheel to wheel. >They could still put the dependencies there with the same platform markers they're using in the other wheel though... 🤷‍♀ https://github.com/astral-sh/uv/issues/10119#issuecomment-2559898792 >I think this is something that basically has to be solved by PyTorch. The issue is that the wheels for `2.6.0.dev20241222+cu126` don't have consistent metadata, and it's a fundamental assumption of uv that the metadata for a given version _is_ consistent. To resolve this, I modified the arm64 nightly build workflow to add two new `PYTORCH_EXTRA_INSTALL_REQUIREMENTS` entries, under `manywheel-py3_11-cuda-aarch64-build` and `manywheel-py3_12-cuda-aarch64-build`. These are based on their equivalents in the x64 workflow for the corresponding python versions. I used the cuda 12.6 dependencies versions for the nvidia packages, to match the `DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-main` being used by these jobs. (The arm64 workflow file already had several `PYTORCH_EXTRA_INSTALL_REQUIREMENTS` entries, under various cpu wheels. I'm not sure why these are there, but I left them as-is.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145021 Approved by: https://github.com/seemethere, https://github.com/atalman Co-authored-by: Eli Uriegas <eliuriegas@meta.com> Co-authored-by: Andrey Talman <atalman@fb.com>	2025-03-10 14:39:39 +00:00
Wang, Chuanqi	4fdd076907	[CD] Add triton xpu as dependency of torch xpu windows whl (#148755 ) Depends on PR #147637 land Pull Request resolved: https://github.com/pytorch/pytorch/pull/148755 Approved by: https://github.com/atalman	2025-03-10 14:04:30 +00:00
Kalpit Munot	31625b08b8	Add ccode for FloorDiv (#148727 ) Summary: Add ccode for FloorDiv Test Plan: CIs Differential Revision: D70749021 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148727 Approved by: https://github.com/bobrenjc93	2025-03-10 14:00:18 +00:00
atalman	2068235c0a	Add timm_efficientnet to flaky models after cuda 12.6 update in CI/CD (#148788 ) After https://github.com/pytorch/pytorch/pull/148612 This model have become flaky Tracking this regression in an issue : https://github.com/pytorch/pytorch/issues/148699 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148788 Approved by: https://github.com/izaitsevfb, https://github.com/malfet	2025-03-10 13:40:41 +00:00
albanD	68c12ecfe2	Move get accelerator to use build time flags when possible (#146098 ) This PR does two main things (they are in a single PR to show how the newly added APIs are used). - Add isBuilt and isAvailable APIs to the AcceleratorHook interface. See inline doc for their exact semantic - Use the newly added isBuilt for accelerator check to ensure it does not poison fork Pull Request resolved: https://github.com/pytorch/pytorch/pull/146098 Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/EikanWang, https://github.com/jeromean Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-03-10 13:17:58 +00:00
Xuehai Pan	098494e9cb	[dynamo] allow global import `from collections import deque` in user code (#148676 ) See https://github.com/pytorch/pytorch/pull/148669#discussion_r1983462218 for more details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148676 Approved by: https://github.com/jansel	2025-03-10 13:14:05 +00:00
Xinyuan Zhao	59f14d19ae	Implement gradient for the `residuals` of `torch.linalg.lstsq` (#148526 ) Fixes #147543. I have written some tests in python using `gradcheck`. Please advise where I should put these tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148526 Approved by: https://github.com/lezcano	2025-03-10 12:35:09 +00:00
Francisco Massa	ea86b8d315	Fix redistribution cost for all-reduce (#148761 ) This issue seems to have been introduced in https://github.com/pytorch/pytorch/pull/119897. With the current implementation, it might be more favorable to perform a reduce_scatter followed by an all-gather than simply an all-reduce. Thanks @lw for the helpful discussions on getting this PR out! Pull Request resolved: https://github.com/pytorch/pytorch/pull/148761 Approved by: https://github.com/Skylion007, https://github.com/lw, https://github.com/tianyu-l, https://github.com/fegin	2025-03-10 12:13:11 +00:00
PyTorch UpdateBot	526524b489	Update slow tests (#148873 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148873 Approved by: https://github.com/pytorchbot	2025-03-10 11:46:30 +00:00
Michal Gallus	74da76f67c	[ROCm][Windows] Fix ROCm/HIP version header (#148560 ) On Windows, ROCm libraries do not have a `<rocm-core/rocm_version.h>` header, which causes the compilation to fail. This PR resolves this problem by utilising `<hip/hip_version.h>` from HIP SDK. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148560 Approved by: https://github.com/jeffdaily	2025-03-10 11:28:13 +00:00
Mwiza Kunda	00199acdb8	[inductor][triton] Block ptr analysis fix assert on matched index expression (#148446 ) If dynamic shapes are enabled, then block analysis may create new precomputed size replacements from the index which can lead to an assertion failure when the matched index is compared with the original index. For example the below assertion fails, despite the expressions being equivalent (ps2 = 3 * ps0). This can be resolved by updating the original index with the replacements, or simply removing the replacements when the expressions are tested to be equal - the latter option is implemented in this PR. ``` torch._inductor.exc.InductorError: AssertionError: E Invalid match! E Index: 3ps0((yindex//3)) + (ModularIndexing(yindex, 1, 3)) E Matched expression: ps2*((yindex//3)) + (ModularIndexing(yindex, 1, 3)) E ``` This PR fixes the test below when `config.triton.use_block_ptr=True`: ``` python test/inductor/test_torchinductor_dynamic_shapes.py DynamicShapesCpuTests.test_conv3d_channels_last_dynamic_shapes_cpu ``` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148446 Approved by: https://github.com/jansel	2025-03-10 05:26:55 +00:00
Jane Xu	3680e666d8	Move aoti_torch_cpu__weight_int4pack_mm_cpu_tensor to not be mangled (#148834 ) I noticed that this op was likely intended to be in the `extern "C"` portion of the file, but it was not added as such in https://github.com/pytorch/pytorch/pull/145250 which means this function is actually not stable/would get mangled by C++. Following the thread there I am thinking there are two possible solutions: (1) Since this op was never stable to begin with, and @Xia-Weiwen already landed the fallback, maybe this op is deletable + should get deleted before the 2.7 branch cut (2) Or we could just move the op to the right portion of the code. While I like just deleting the op, I am hesitant to do in case there's something I haven't considered, so this PR does option 2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148834 Approved by: https://github.com/desertfire	2025-03-10 03:23:48 +00:00
henrylhtsang	7ae0ce6360	[cutlass backend] fix assertion that prevent self multiplication (#148233 ) # Problem: In a matmul, sometimes some of the nodes are the same. Say `A @ A`. In that case, when writing the stride of node B, we have to figure out if we want lda or ldb, which points to the same node, and we have no way to differentiate which one. # Solution Just use whichever. Since they are the same. # Question What if we compile with `A @ A`, and then pass in `A @ B`? Well inductor guards will raise an error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148233 Approved by: https://github.com/ColinPeppler	2025-03-10 00:21:36 +00:00
henrylhtsang	b47d81682d	[cutlass backend] Forward fix for less aligned gemm shapes (#148521 ) Differential Revision: [D70600093](https://our.internmc.facebook.com/intern/diff/D70600093/) 1. Check if config name filtering still works. Tested, it works 2. do we get C++ compile error Yes, potentially we need to filter them out manually. Here we get this. ``` static_assert(threads_minor == 0 \|\| (TileSizeK % threads_minor == 0)); ``` We need to move some assertions to gemm_template.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/148521 Approved by: https://github.com/ColinPeppler	2025-03-10 00:21:24 +00:00
cyy	aac230a511	[MPS] Fix Wreorder-init-list (#148839 ) Fixes the following warning: ``` warning: ISO C++ requires field designators to be specified in declaration order; field 'value' will be initialized after field 'size' [-Wreorder-init-list] 662 \| return {.value.cf = scalar.to<c10::complex<float>>(), .size = sizeof(int64_t), .type = type}; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148839 Approved by: https://github.com/Skylion007	2025-03-09 23:45:46 +00:00
Nikita Shulga	b95889042c	[MPS] Introduce strides unary op (#148468 ) By adding following template ```metal template <typename T, typename F> kernel void unary_strided( device result_of<F, T>* output [[buffer(0)]], constant T* input [[buffer(1)]], constant long* sizes [[buffer(2)]], constant long* input_strides [[buffer(3)]], constant long* output_strides [[buffer(4)]], constant uint& ndim, uint index [[thread_position_in_grid]]) { F f; int pos[max_ndim]; pos_from_thread_index(int(index), pos, sizes, ndim); const auto input_offs = offset_from_coord(pos, input_strides, ndim); const auto output_offs = offset_from_coord(pos, output_strides, ndim); output[output_offs] = f(input[input_offs]); } ``` and instantiating it for all existing unary shaders, which eliminates the need to any intermediate copies. No extra testing are needed as those cases are already covered by `test_output_grad_match_corrcoef_cpu_float32` as well as `test_unary_ops_storage_offset_strided` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148468 Approved by: https://github.com/dcci	2025-03-09 22:30:51 +00:00
PyTorch MergeBot	275a7c5dbb	Revert "Add a stable TORCH_LIBRARY to C shim (#148124 )" This reverts commit 327e07ac1dc3351bb5f0ad436760b83590c400aa. Reverted https://github.com/pytorch/pytorch/pull/148124 on behalf of https://github.com/malfet due to Sorry for reverting your PR, but somehow it caused test failures in newly introduced tests, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=pull%20%2F%20linux-focal-cuda12.6-py3.10-gcc11-sm89%20%2F%20test%20(default%2C%201&mergeLF=true ([comment](https://github.com/pytorch/pytorch/pull/148124#issuecomment-2709057833))	2025-03-09 20:44:56 +00:00
PyTorch MergeBot	19a39a7a06	Revert "[dynamo] allow global import `from collections import deque` in user code (#148676 )" This reverts commit 685fb377131cc684633dc5471e77038988db53f6. Reverted https://github.com/pytorch/pytorch/pull/148676 on behalf of https://github.com/malfet due to Looks like it broke ROCM, see `f1444f006c/1`(default%2C%201&mergeLF=true ([comment](https://github.com/pytorch/pytorch/pull/148676#issuecomment-2709057326))	2025-03-09 20:42:03 +00:00
Zhenghao Hu	f1444f006c	[caffe2/torch] Fixup upstream LLVM (major version 21) API changes (#148833 ) Latest LLVM introduced two changes related to the `Triple` usage that causes build failures when building pytorch. ## Failure in llvm_codegen.cpp: Triple is stored in Modules instead of the string: `979c275097` ## Failure in llvm_jit.cpp: Triple argument is removed from LLJITBuilder::... : `b18e5b6a36` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148833 Approved by: https://github.com/Skylion007	2025-03-09 18:58:36 +00:00
Jason Ansel	9a1a2e1516	Better log message to update pr_time_benchmarks/expected_results.csv (#148303 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148303 Approved by: https://github.com/Skylion007	2025-03-09 17:12:47 +00:00
xinan.lin	a8e3d1984a	[Inductor UT][XPU] Skip test case test_cat_max_autotune_triton for known issue. (#148734 ) The mm triton template/configs have not been tuned for XPU, we observer that the epilogue fusion can not speed up on XPU because of registers spill. So XPU failed on the case `test_cat_max_autotune_triton` which checks the fusion. We'll remove the skip after #146568 being resolved. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148734 Approved by: https://github.com/jansel	2025-03-09 15:09:43 +00:00
Aditya Tiwari	bb9c426024	Typo Errors fixed in multiple files (#148262 ) # Fix typo errors across PyTorch codebase This PR fixes various spelling errors throughout the PyTorch codebase to improve documentation quality and code readability. ## Changes Made ### Documentation Fixes - Changed "seperate" to "separate" in multiple files: - `setup.py`: Build system documentation - `torch/_library/triton.py`: AOT compilation comments - `torch/csrc/dynamo/compiled_autograd.h`: Node compilation documentation - `torch/export/_unlift.py`: Pass population comments - `torch/export/exported_program.py`: Decomposition table notes ### Code Comments and Error Messages - Changed "occured" to "occurred" in: - `test/mobile/test_lite_script_module.py`: Exception handling comments - `torch/export/_draft_export.py`: Error message text - `aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp`: MAGMA bug comment - `torch/csrc/utils/python_numbers.h`: Overflow handling comment - `torch/csrc/jit/OVERVIEW.md`: Graph compilation documentation - `torch/_dynamo/symbolic_convert.py`: Error explanation ### API Documentation - Changed "fullfill" to "fulfill" in `torch/distributed/checkpoint/state_dict_loader.py` - Changed "accross" to "across" in: - `torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp` - `torch/distributed/distributed_c10d.py` ## Motivation These changes improve code readability and maintain consistent spelling throughout the codebase. No functional changes were made; this is purely a documentation and comment improvement PR. ## Test Plan No testing required as these changes only affect comments and documentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148262 Approved by: https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-03-09 12:21:40 +00:00
Jane Xu	327e07ac1d	Add a stable TORCH_LIBRARY to C shim (#148124 ) This PR adds two main parts: - shim.h stable C APIs into torch::Library APIs - a higher level API in torch/csrc/stable/library.h that calls into this shim.h + otherwise is self contained Goal: custom kernel writers should be able to call the apis in the directories above in order to register their library in a way that allows their custom extension to run with a different libtorch version than it was built with. Subplots resolved: - Do we want a whole separate StableLibrary or do we want to freeze torch::Library and add `m.stable_impl(cstring, void (fn)(void , int64_t, int64_t)` into it - Yes, we want a separate StableLibrary. We cannot freeze Library and it is NOT header only. - Should I use unint64_t as the common denominator instead of void to support 32bit architectures better? - Yes, and done - Should I add a stable `def` and `fragment` when those can be done in python? - I think we do want these --- and now they're done - Where should library_stable_impl.cpp live? -- no longer relevant - I need some solid test cases to make sure everything's going ok. I've intentionally thrown in a bunch of random dtypes into the signature, but I still haven't tested returning multiple things, returning nothing, complex dtypes, etc. - Have since tested all the torch library endpoints. the others can be tested in a followup to separate components that need to be in shim.h vs can be added later Pull Request resolved: https://github.com/pytorch/pytorch/pull/148124 Approved by: https://github.com/albanD, https://github.com/zou3519	2025-03-09 10:07:25 +00:00
Xuehai Pan	685fb37713	[dynamo] allow global import `from collections import deque` in user code (#148676 ) See https://github.com/pytorch/pytorch/pull/148669#discussion_r1983462218 for more details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148676 Approved by: https://github.com/jansel	2025-03-09 09:35:29 +00:00
William Wen	6566d67bd3	[dynamo] show stack above dynamo in graph break user tracebacks (#148401 ) Also show the line of code relevant to a dynamo-compiled frame, instead of just the first line (this was broken for data-dependent jump graph breaks and for 3.11+). Also collapses resume frames together (use config.verbose to see full stack trace - for developers). Pull Request resolved: https://github.com/pytorch/pytorch/pull/148401 Approved by: https://github.com/zou3519, https://github.com/jansel	2025-03-09 07:37:38 +00:00
Ke Wen	2149f6c684	[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 ) This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590 Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj	2025-03-09 07:32:23 +00:00
Jason Ansel	85fe576ee3	[set_linter] allow x in {...} (#148422 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148422 Approved by: https://github.com/Skylion007	2025-03-09 06:43:11 +00:00
PyTorch MergeBot	9cb25f0ea2	Revert "[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 )" This reverts commit 17dbeb11db7afbab792ad76c24840c1552a0e76d. Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/janeyx99 due to PR break backward compat test ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2708641172))	2025-03-09 03:01:55 +00:00
Ke Wen	17dbeb11db	[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 ) This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590 Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj	2025-03-08 20:00:12 +00:00
Nino Risteski	5245304f1e	Update decompositions_for_jvp.py (#148821 ) small typo thing that got my eye Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148821 Approved by: https://github.com/Skylion007	2025-03-08 19:08:42 +00:00
Daniel Vega-Myhre	148eb735ee	Change nvcc arch flags for sm100 (#148774 ) ### Summary - Addressing this comment https://github.com/pytorch/pytorch/pull/148274#discussion_r1984944012 ### Test plan - Verified building from source w/ B200s is successful - Verified B200 tensorcores are still being utilized properly via benchmarking script Pull Request resolved: https://github.com/pytorch/pytorch/pull/148774 Approved by: https://github.com/Skylion007	2025-03-08 19:05:53 +00:00
Tristan Rice	7ffadff286	c10d/ProcessGroup: cleanup abort and shutdown (#148798 ) This adds `abort` and `shutdown` to `Backend` and `ProcessGroup` objects. This simplifies the logic in `distributed_c10d.py` by having a default noop implementation for all PGs. This will be useful for torchft and upcoming versions of NCCL which will handle abort correctly. Currently `torchft` would have to call internal methods `_abort` on the PGNCCL object directly but with this change we can now just call `.abort()` and have it work for any PG implementation. Test plan: ``` pytest distributed/test_backends.py distributed/test_c10d_common.py distributed/test_c10d_pypg.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148798 Approved by: https://github.com/kwen2501	2025-03-08 18:33:18 +00:00
Sanket Purandare	9841f0ddcf	Add support for non functional collectives under FakeTensorMode and fake_pg for memory tracking (#147566 ) This PR adds support for non-functional collectives under `FakeTensorMode` and `fake_pg`. It helps eliminate the patching of collectives for memory and runtime estimation. It also modifies the `ModTracker` to enable the post-backward hook call for modules whose inputs don't require gradients but parameters do. For the memory tracking, we now enable tracking DTensor dispatcher for custom dispatch functions like `entropy_loss`. Dispatcher is only enabled for the memory tracking part and disabled as soon as it is done. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147566 Approved by: https://github.com/weifengpy	2025-03-08 18:00:49 +00:00
Fangjun Kuang	439782960c	Fix typos in SpectralOps.cpp (#148818 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148818 Approved by: https://github.com/Skylion007	2025-03-08 17:34:59 +00:00
eqy	849cc058ee	[CUDA][TF32] Account for tf32 in `test_efficient_conv_bn_eval` (#148802 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148802 Approved by: https://github.com/Skylion007	2025-03-08 16:17:04 +00:00
David Berard	c3b05c4a27	[triton 3.3] support both specialize_impl and create_specialize_impl (#148806 ) After https://github.com/triton-lang/triton/pull/6099, we sometimes need to do `from triton.runtime.jit import specialize impl` and sometimes do `triton.runtime.jit.create_specialize_impl()`. This should fix a bunch of the new errors that appeared with the triton 3.3 / pytorch 2.7 integration (e.g. `python test/inductor/test_aot_inductor.py -k test_triton_kernel_equal_to_1_float_arg_dynamic_False_cuda`, failing at https://hud.pytorch.org/pr/pytorch/pytorch/148684#38392501220) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148806 Approved by: https://github.com/drisspg	2025-03-08 09:31:52 +00:00
Justin Chu	118c9e501a	[ONNX] Remove inaccurate test comment (#148813 ) Remove the comment that says jit trace strategy doesn't support dynamic shapes as dict because it does support it (which is what the test is testing) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148813 Approved by: https://github.com/cyyever, https://github.com/titaiwangms	2025-03-08 08:55:56 +00:00
Zhuoran Zhao	3745da18f4	[AOTI] Swith to local cpp compile for fbcode (#148592 ) Summary: as title, otherwise we can not find lamdhip64 Test Plan: https://www.internalfb.com/phabricator/paste/view/P1747104431 Differential Revision: D70637798 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148592 Approved by: https://github.com/hl475	2025-03-08 08:38:26 +00:00
Simon Fan	666508eb17	[aot cache][ca] remove restriction on caching ca's aot inference graph (#148491 ) but still can't cache CA's aot inference graph yet: the CA functional ops aren't serializable Pull Request resolved: https://github.com/pytorch/pytorch/pull/148491 Approved by: https://github.com/jamesjwu ghstack dependencies: #148381	2025-03-08 06:08:26 +00:00
Simon Fan	c16cd25cf5	[ca] remove compiled_autograd_tracing (#148381 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148381 Approved by: https://github.com/jansel	2025-03-08 06:08:26 +00:00
Wang, Chuanqi	5f1c79ba2b	[CD] Enable triton xpu windows build (#147637 ) Depends on #147727, which introduce triton xpu windows support Pull Request resolved: https://github.com/pytorch/pytorch/pull/147637 Approved by: https://github.com/atalman	2025-03-08 05:28:46 +00:00
cyy	f7c0c230b0	Fix compile errors (#148758 ) Fix ``` /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/14.2.1/../../../../include/c++/14.2.1/bits/unique_ptr.h:91:16: error: invalid application of 'sizeof' to an incomplete type 'torch::jit::AliasDb::WriteRegistry' 91 \| static_assert(sizeof(_Tp)>0, \| ^~~~~~~~~~~ /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/14.2.1/../../../../include/c++/14.2.1/bits/unique_ptr.h:399:4: note: in instantiation of member function 'std::default_delete<torch::jit::AliasDb::WriteRegistry>::operator()' requested here 399 \| get_deleter()(std::move(__ptr)); \| ^ ../torch/csrc/jit/ir/alias_analysis.cpp:200:10: note: in instantiation of member function 'std::unique_ptr<torch::jit::AliasDb::WriteRegistry>::~unique_ptr' requested here 200 \| AliasDb::~AliasDb() = default; \| ^ ../torch/csrc/jit/ir/alias_analysis.cpp:200:23: note: in defaulted destructor for 'torch::jit::AliasDb' first required here 200 \| AliasDb::~AliasDb() = default; \| ^ ../torch/csrc/jit/ir/alias_analysis.h:298:10: note: forward declaration of 'torch::jit::AliasDb::WriteRegistry' 298 \| struct WriteRegistry; \| ^ 1 error generated. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148758 Approved by: https://github.com/Skylion007	2025-03-08 04:56:42 +00:00
Yanan Cao (PyTorch)	75179fd6e6	[Codemod][AddExplicitStrictExportArg] caffe2/test/inductor (#148781 ) Differential Revision: D70575053 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148781 Approved by: https://github.com/SherlockNoMad	2025-03-08 04:43:32 +00:00
riccardofelluga	8f71d4563e	Fix rms_norm in fp16/bf16 (#147203 ) Fixes #134106. This PR moves the `upcasted_result` down-casting after all computation is done. Since the multiplication with the weight_opt input is not done in half precision, the current code path is doing the following: fp16 -> fp32 -> fp16 -> fp32 -> fp16. What we want tho is to avoid down-casting and this PR proposes: fp16 -> fp32 -> fp16. This results in better accuracy as it avoids truncating. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147203 Approved by: https://github.com/eqy	2025-03-08 04:43:18 +00:00
Joel Schlosser	85467ed063	Fix for AOTI + CUDAGraphs when calling from Python (#148601 ) Background: I've been comparing performance of torch.compile vs. torch.export + AOTI (specifically, loaded from Python) on the Flux model and found a ~1.4% performance decrease with the latter. The trace shows that CUDAGraphs are not utilized for torch.export + AOTI, leading to higher overhead. When trying to manually CUDAGraph the loaded, previously exported + AOTIed model (thanks to @eellison for the logic here), I get: ``` Error: operation not permitted when stream is capturing ``` @desertfire confirms that this is due to multi-threading logic on the AOTI runtime side (in `AOTIModelContainer` / `AOTIModel`) conflicting with the use of CUDAGraphs. Fix: This PR takes the approach of providing an alternate, single-threaded method for running loaded models with the AOTI runtime. Details: * Python side introduces a new flag to enable this behavior (needs a better name): `torch._inductor.package.load_package(..., run_single_threaded=False)` * This flag is passed down to the C++ side's `AOTIModelPackageLoader`, which passes it to the `CreateAOTIModelRunnerFunc` during `AOTIModelContainerRunner` construction. * C++ side introduces single-threaded alternatives to model running and model container running: * `AOTIModelContainer.run_single_threaded()` / `AOTIModel.run_single_threaded()`. The interfaces match those of `run()`, but the synchronization logic has been removed. * Introduces `AOTInductorModelContainerRunSingleThreaded` to AOTI's `interface.h`; this is invoked by the `AOTIModelContainerRunner` utility class when `run_single_threaded=true`. I've verified on both a small repro and my real-world use case that I can manually CUDAGraph a loaded model that was previously exported + AOTIed. Future work: * Flip default value to `run_single_threaded=True` as Python-side inference doesn't take advantage of the AOTI runtime thread pool * There are some BC concerns here - models need to be re-serialized so the .so contains the new `AOTInductorModelContainerRunSingleThreaded` interface func. We can flip the default value and warn (instead of crashing) if the `AOTInductorModelContainerRunSingleThreaded` symbol does not exist. * Compose with cudagraph trees as opposed to manual cuda graph wrapping Pull Request resolved: https://github.com/pytorch/pytorch/pull/148601 Approved by: https://github.com/desertfire	2025-03-08 02:44:14 +00:00
Sampsa	9f170d9d13	[Triton 3.3] Remove ROCm specific mm gemm template (#148662 ) Fixes: https://github.com/pytorch/pytorch/issues/147121 Since triton 3.3.x fixes the problem Needs to be handled in none BC breaking way, so we will conditionalise this change on triton version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148662 Approved by: https://github.com/davidberard98 Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>	2025-03-08 01:24:40 +00:00
drisspg	a89e7c2da9	[Upstream] Wrap log_2_e in tl.constexpr for new 3.3 bump (#148785 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148785 Approved by: https://github.com/davidberard98	2025-03-08 01:09:28 +00:00
Lukas Pfahler	179b7a0abc	Do not crash when compiling quantized LORA models (#148435 ) Fixes #148072 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148435 Approved by: https://github.com/Valentine233, https://github.com/leslie-fang-intel	2025-03-08 00:02:08 +00:00
Gabriel Ferns	24085db082	Don't clear feedback_saver_fns after cache clear (#148723 ) Summary: Since feedback_saver_fns are used for logging, I don't think it makes sense to clear them, and this resulted in weird behavior in user code where disabling caches caused logging code to break. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148723 Approved by: https://github.com/henrylhtsang, https://github.com/eellison	2025-03-07 23:43:59 +00:00
Justin Chu	d96c85558a	[ONNX] Use torch export to get dynamic shapes for JIT convert strategy (#148627 ) Use torch export to get dynamic shapes for JIT converted graph. I just realized we can retrace a converted jit graph with `torch.export` and produce dynamic shapes using `torch.export`. - Prior: The exporter will produce a static graph silently even when dynamic_shapes are provided. - Proposed: When `dynamic_shapes` is provided and when the strategy is able to handle it, it will succeed ## Why are we still keeping the JIT strategy? It is useful when users want to convert JIT modules or `.pt` files into ONNX via the new path. Sometimes also useful when there are JIT scripted modules in the nn module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148627 Approved by: https://github.com/titaiwangms	2025-03-07 23:41:50 +00:00
Avanish.Tiwari	26f8d81037	Enable onednn in pytorch for ppc64le architecture (#143743 ) This PR will enable onednn for powerpc Architecture which will help to do quantization of the model via onednn for powerpc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143743 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-03-07 23:35:47 +00:00
Sam Larsen	187d5c0eb1	[logging] Log cudagraphify timings to dynamo_timed (#143220 ) Summary: this adds some new dynamo_timed calls in cudagraph_trees, primarily with the aim to add cudagraph-related timing to scuba. Things to note: * Uses the changes in https://github.com/pytorch/pytorch/pull/141919 to log "runtime" entries * The logging for chromium/tlparse/scuba relies on us providing a compile_id since it's not available in the environment. A lot of the changes here are just passing around the compile_id * I believe the spirit of the scuba logging is to capture the overheads of `torch.compile`. Therefore, I'm not adding _every_ dynamo_timed to scuba. For example, "run_eager" is the first real execution of the inductor graph -- it's not cudagraph overhead, per se. Watch out for the two instances of `dynamo_compile_runtime_column_us="runtime_cudagraphify_time_us"`. Those are the spots I believe are _extra_ overhead we'd contribute to torch.compile. Test Plan: `python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only dcgan`: * tlparse: https://fburl.com/21yrdn8h * scuba: https://fburl.com/scuba/dynamo_compile/sandbox/wt90wnjz `python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only nanogpt` * tlparse: https://fburl.com/r9mp7uiv * scuba: https://fburl.com/scuba/dynamo_compile/sandbox/1nvx94re Pull Request resolved: https://github.com/pytorch/pytorch/pull/143220 Approved by: https://github.com/eellison	2025-03-07 23:07:13 +00:00
iupaikov-amd	f2dfe2d99c	[Triton 3.3] [ROCm] Enabled split_scan support for ROCm builds (#147619 ) Fixes issue https://github.com/pytorch/pytorch/issues/133228 Enabled split_scan support for ROCm builds. Must be handled in a non BC breaking way so this functionality is enabled conditionalised on triton version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147619 Approved by: https://github.com/davidberard98 Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> Co-authored-by: David Berard <davidberard98@gmail.com>	2025-03-07 23:06:21 +00:00
PyTorch MergeBot	0f852641c2	Revert "[cutlass backend] Forward fix for less aligned gemm shapes (#148521 )" This reverts commit d35a4ddae2345e639001bfee58a0932e96597f2d. Reverted https://github.com/pytorch/pytorch/pull/148521 on behalf of https://github.com/henrylhtsang due to mistakes when writing the tests ([comment](https://github.com/pytorch/pytorch/pull/148521#issuecomment-2707637965))	2025-03-07 22:42:13 +00:00
David Berard	755965d2e4	[inductor] fix matmul w/ torch.bucketize epilogue (#148769 ) See https://github.com/pytorch/pytorch/issues/148764. Inductor was codegen-ing wrong shapes for bucketize when it was fused as an epilogue: the binary search helper function requested the shape of the input tensor, and Inductor was generating `[XBLOCK]`, when `XBLOCK` doesn't exist. As a workaround, this PR removes the `BLOCK_SHAPE` parameter from the helper function (and just uses `values.shape`) so that we don't even have to generate the shape. This PR also introduces `torch._inductor.config.triton.disallow_failing_autotune_kernels_TESTING_ONLY` to test this behavior. This config is needed to enforce that _all_ autotune kernel candidates pass - otherwise, the fused-bucketize exception just gets caught and an `inf` latency is assigned to it. Differential Revision: [D70794563](https://our.internmc.facebook.com/intern/diff/D70794563) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148769 Approved by: https://github.com/benjaminglass1, https://github.com/aaronenyeshi	2025-03-07 22:34:13 +00:00
Xinya Zhang	67742128b7	[ROCm] Bump AOTriton to 0.9.2b (#148433 ) Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.9b: * Optimize these Non-power-of-two head dimensions: 48, 80, 96, 160, 192, 224. Inputs with these head dimensions do not need padding to power-of-two anymore. * `is_causal=True` cases are now supported with persistent dynamic algorithm, which requires an atomic tensor but does load balance between different CTAs * `dropout_p > 0.0` cases now support full 64-bit offsets and use all i64x4 PRNG outputs * The precise AOTriton shared library version can now be identified with `readelf -p .comment libaotriton_v2.so` + However, this does not guarantee the GPU images stored under `aotriton.images` have the same version, since they can be overwritten. * The newly added fused backward kernel will be used for smaller workloads, due to less kernel invocation overhead. * Support gfx1201 (RX 9070XT). Need to be enabled at runtime with `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148433 Approved by: https://github.com/jeffdaily	2025-03-07 22:10:07 +00:00
Nikita Shulga	7b79e17275	[BE] Move cuda12.6 builds to gcc11 (#148740 ) I.e. `s/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc11/` Which accidentally fixes undefined symbol references errors namely ``` /usr/bin/ld: /var/lib/jenkins/cpp-build/caffe2/build/lib/libtorch_cuda.so: undefined reference to `std::__throw_bad_array_new_length()' ``` Which happens because `libmagma.a` that were build with gcc-11 (after https://github.com/pytorch/pytorch/pull/148135 ) contains symbols which are defined in `/opt/rh/gcc-toolset-11/root/usr/lib/gcc/x86_64-redhat-linux/11/libstdc++_nonshared.a` but missing from the corresponding library bundled with `g++-9`) Though I could not figure out what flags one must use to trigger generation of those symbols, see https://godbolt.org/z/E9KfdhzzY or ``` $ echo "int* foo(int x) { return new int[x];}"\|g++ -std=c++17 -S -O3 -x c++ -o - - .file "" .text .section .text.unlikely,"ax",@progbits .LCOLDB0: .text .LHOTB0: .p2align 4 .globl _Z3fooi .type _Z3fooi, @function _Z3fooi: .LFB0: .cfi_startproc endbr64 movslq %edi, %rdi subq $8, %rsp .cfi_def_cfa_offset 16 movabsq $2305843009213693950, %rax cmpq %rax, %rdi ja .L2 salq $2, %rdi addq $8, %rsp .cfi_def_cfa_offset 8 jmp _Znam@PLT .cfi_endproc .section .text.unlikely .cfi_startproc .type _Z3fooi.cold, @function _Z3fooi.cold: .LFSB0: .L2: .cfi_def_cfa_offset 16 call __cxa_throw_bad_array_new_length@PLT .cfi_endproc ``` Fixes https://github.com/pytorch/pytorch/issues/148728 and https://github.com/pytorch/pytorch/issues/148495 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148740 Approved by: https://github.com/wdvr, https://github.com/atalman, https://github.com/Skylion007, https://github.com/ZainRizvi	2025-03-07 21:21:12 +00:00
Nichols A. Romero	08baaa7d63	[Docs][TunableOp] TunableOp documentation update (#148384 ) This PR aligns documentation to what is in the README file: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cuda/tunable/README.md and removes the prototype NOTE. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148384 Approved by: https://github.com/jeffdaily, https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-03-07 21:02:49 +00:00
PyTorch MergeBot	bb94b65da7	Revert "[cutlass backend] fix assertion that prevent self multiplication (#148233 )" This reverts commit 2fb654676f6291f6e27c6bab2761f170516598dd. Reverted https://github.com/pytorch/pytorch/pull/148233 on behalf of https://github.com/henrylhtsang due to mistake in PR ([comment](https://github.com/pytorch/pytorch/pull/148233#issuecomment-2707440106))	2025-03-07 20:58:28 +00:00
Nikita Shulga	d8dc700e25	Delete duplicate entry from `docker-builds.yml` (#148782 ) Regression introduced by merge conflict of https://github.com/pytorch/pytorch/pull/148612 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148782 Approved by: https://github.com/atalman	2025-03-07 20:55:46 +00:00
PyTorch MergeBot	99da439d10	Revert "Remove Cuda 12.4 from nightly Binaries (#148625 )" This reverts commit 1239176fe717839ca5612ac03a4806051225f381. Reverted https://github.com/pytorch/pytorch/pull/148625 on behalf of https://github.com/malfet due to Broke lint ([comment](https://github.com/pytorch/pytorch/pull/148625#issuecomment-2707415005))	2025-03-07 20:47:45 +00:00
Nikita Shulga	6602e632cd	Suppress build warnings when gcc-11 is used (#148763 ) By decorating the header with `C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wmismatched-new-delete")` that will suppress following (when building against ancient llvm-9) ``` In file included from /var/lib/jenkins/workspace/torch/csrc/jit/tensorexpr/llvm_codegen.cpp:24: /opt/llvm/include/llvm/IR/IRBuilder.h: In member function 'llvm::LoadInst* llvm::IRBuilder<T, Inserter>::CreateLoad(llvm::Type, llvm::Value, const llvm::Twine&) [with T = llvm::ConstantFolder; Inserter = llvm::IRBuilderDefaultInserter]': /opt/llvm/include/llvm/IR/IRBuilder.h:1581:19: error: 'static void llvm::User::operator delete(void)' called on pointer returned from a mismatched allocation function [-Werror=mismatched-new-delete] 1581 \| return Insert(new LoadInst(Ty, Ptr), Name); \| ^~~~~~~~~~~~~~~~~~~~~ /opt/llvm/include/llvm/IR/IRBuilder.h:1581:19: note: returned from 'static void llvm::UnaryInstruction::operator new(size_t)' ``` Probably a reasonable followup will be to disable NNC testing all-together, as project has been in a maintenance mode for a while now Pull Request resolved: https://github.com/pytorch/pytorch/pull/148763 Approved by: https://github.com/Skylion007, https://github.com/ZainRizvi, https://github.com/atalman ghstack dependencies: #148739	2025-03-07 20:43:35 +00:00
Justin Chu	d36391307f	[ONNX] Handle error in verification interpreter (#148730 ) Use a simple try catch to handle onnx runtime errors in the verification interpreter when that happens. One example is ort will sometimes produce a list of None for some nodes. I am not sure how that happens yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148730 Approved by: https://github.com/titaiwangms ghstack dependencies: #148706	2025-03-07 20:24:49 +00:00
Xuehai Pan	aebd2e411f	[pytree][easy] lock global registry containers properly for thread-safety (#148750 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148750 Approved by: https://github.com/StrongerXi	2025-03-07 20:04:52 +00:00
bobrenjc93	6b44a91a62	use statically_known_true instead of guard_size_oblivious in pattern matcher (#147557 ) We shouldn't add guards here. Use statically_known_true instead. Internal xref: https://fb.workplace.com/groups/1075192433118967/?multi_permalinks=1609560723015466&comment_id=1610040026300869&notif_id=1740082892544333&notif_t=work_feedback_reaction_generic&ref=notif Differential Revision: [D69950122](https://our.internmc.facebook.com/intern/diff/D69950122/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147557 Approved by: https://github.com/eellison	2025-03-07 19:17:25 +00:00
PyTorch MergeBot	b246cd7b82	Revert "Move get accelerator to use build time flags when possible (#146098 )" This reverts commit 17302b4bc837af079d2f6480f07ea2c99b93fb4b. Reverted https://github.com/pytorch/pytorch/pull/146098 on behalf of https://github.com/albanD due to Still fails with cuda build on a non-gpu machine ([comment](https://github.com/pytorch/pytorch/pull/146098#issuecomment-2707191770))	2025-03-07 18:59:58 +00:00
Ting Lu	1239176fe7	Remove Cuda 12.4 from nightly Binaries (#148625 ) https://github.com/pytorch/pytorch/issues/145570 removes cuda 12.4 nightly builds Pull Request resolved: https://github.com/pytorch/pytorch/pull/148625 Approved by: https://github.com/atalman	2025-03-07 18:56:04 +00:00
Irem Yuksel	61c4074df7	Add Windows Arm64 Nightly Builds (#139760 ) This PR creates 3 new worklflows for Windows Arm64 target. The workflows and outputs can be reviewed at the following links: https://github.com/pytorch/pytorch/actions/workflows/generated-windows-arm64-binary-libtorch-release-nightly.yml https://github.com/pytorch/pytorch/actions/workflows/generated-windows-arm64-binary-libtorch-debug-nightly.yml https://github.com/pytorch/pytorch/actions/workflows/generated-windows-arm64-binary-wheel-nightly.yml Pull Request resolved: https://github.com/pytorch/pytorch/pull/139760 Approved by: https://github.com/malfet Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com> Co-authored-by: Huy Do <huydhn@gmail.com>	2025-03-07 18:53:56 +00:00
cyy	e839e4f5bd	Fix Wc++98-compat-extra-semi (#148757 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148757 Approved by: https://github.com/Skylion007	2025-03-07 18:49:12 +00:00
Michal Gallus	0a7ccee1e0	[ROCm][Windows] Disable Composable Kernels and Triton for Windows builds (#147334 ) Currently, Composible Kernels and Triton aren't available on Windows. This PR ensures that the files relating to this dependency are not included during the build. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147334 Approved by: https://github.com/jeffdaily	2025-03-07 18:40:49 +00:00
eqy	18c6e00c7b	[CUDA Graphs][NCCL] Set event queries to happen under thread-local mode in `ProcessGroupNCCL.cpp` (#148594 ) Should mean we don't need to coordinate the watchdog with CUDAGraph captures anymore Pull Request resolved: https://github.com/pytorch/pytorch/pull/148594 Approved by: https://github.com/kwen2501	2025-03-07 18:39:02 +00:00
Ting Lu	9769618d35	[CI] [inductor] Add cu126 inductor jobs and move away cu124 (#148612 ) https://github.com/pytorch/pytorch/issues/145570 breaking https://github.com/pytorch/pytorch/pull/140793 into eager and inductor benchmarks to unblock Seems many inductor yml are added after initial change was prepared. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148612 Approved by: https://github.com/nWEIdia, https://github.com/atalman Co-authored-by: atalman <atalman@fb.com>	2025-03-07 18:30:14 +00:00
Nikita Shulga	da923afdc7	[MPS][BE] Align bitshift behavior with CPU (#148719 ) By casting the argument to output type Pull Request resolved: https://github.com/pytorch/pytorch/pull/148719 Approved by: https://github.com/Skylion007 ghstack dependencies: #148685, #148686	2025-03-07 18:28:14 +00:00
Nikita Shulga	f84710aef4	[MPS] Fix scalar to tensors bitshifts (#148686 ) By introducing a concept of non-commutative binary op and renaming all op templates from `bitwise_foo_tensor` and `bitwise_foo_scalar` to `bitwise_foo_tensor_tensor` and `bitwise_foo_tensor_scalar` Add regression tests Please note, that for some undefined values MPS and CPU behaviors are different, for example ``` >>> import torch >>> 4095 >> torch.arange(12, device="mps", dtype=torch.uint8) tensor([255, 255, 255, 255, 255, 127, 63, 31, 15, 7, 3, 1], device='mps:0', dtype=torch.uint8) >>> 4095 >> torch.arange(12, device="cpu", dtype=torch.uint8) tensor([255, 127, 63, 31, 15, 7, 3, 1, 0, 0, 0, 0], dtype=torch.uint8) ``` Because on CPU scalar is cast to output dtype before operation is performed, but on MPS this happens after the op is done Fixes https://github.com/pytorch/pytorch/issues/147889 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148686 Approved by: https://github.com/albanD ghstack dependencies: #148685	2025-03-07 18:28:14 +00:00
cyy	116c1e42c5	Re-enable tests (#148732 ) No UBSAN failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148732 Approved by: https://github.com/Skylion007	2025-03-07 18:11:57 +00:00
Jack Taylor	8059ead823	[ROCm] Incorporate ROCm triton specific tuning parameters (#148437 ) Splitting https://github.com/pytorch/pytorch/pull/147315 into two PRs. This PR adds general support for kpack and waves_per_eu triton kernel args for AMD backend. More detail in the PR above. A follow up PR will update the configs used by ROCm but this requires https://github.com/pytorch/pytorch/pull/147452 to land first Pull Request resolved: https://github.com/pytorch/pytorch/pull/148437 Approved by: https://github.com/eellison, https://github.com/jansel	2025-03-07 18:09:47 +00:00
Aaron Orenstein	a3b77d434a	Subprocess compile (attempt 2) (#148635 ) Add a mode to fx_codegen_and_compile() to compile in a separate process. This is to prepare for async compile where we'll compile and run eager in parallel (and also be able to move the compile phase to a remote computer). Added a test based which runs the test_torchinductor tests with subprocess compiling turned on. Fixed the test which caused the previous version (#146134) to be reverted: ``` $ PYTORCH_TEST_WITH_ROCM=1 PYTORCH_TEST_WITH_SLOW=1 PYTORCH_TEST_SKIP_FAST=1 python test/inductor/test_compile_subprocess.py CpuTests.test_conv_bn_fuse_cpu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148635 Approved by: https://github.com/jamesjwu	2025-03-07 17:50:14 +00:00
xinan.lin	50c9f6d83b	[Windows][Inductor][XPU] Unload triton pyd files to be able to remove them on Windows. (#148323 ) In `fresh_inductor_cache` remove pyd files will raise permission error on Windows because they are still used by the process. So we clear the references to the loaded pyd libray obj and unload them from the process. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148323 Approved by: https://github.com/jansel ghstack dependencies: #148534, #148538, #147727	2025-03-07 17:19:59 +00:00
xinan.lin	d05694807d	[XPU][Inductor] Update Intel triton for release 2.7. (#147727 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147727 Approved by: https://github.com/EikanWang, https://github.com/Skylion007 ghstack dependencies: #148534, #148538	2025-03-07 17:19:59 +00:00
Saurabh Mishra	136b8165d1	[DCP] Save Plan Caching: Fix the missing all_plans update in the cache. (#148577 ) Summary: Save Plan Caching: Fix the missing all_plans update in the cache. Test Plan: ``` buck2 test //aiplatform/modelstore/experimental/integration_tests/tests/nosan:checkpoint_dist_save_load_test ``` https://www.internalfb.com/intern/testinfra/testrun/17451448626323264 Reviewed By: MeetVadakkanchery Differential Revision: D70229019 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148577 Approved by: https://github.com/MeetVadakkanchery	2025-03-07 17:00:59 +00:00
PyTorch MergeBot	abcca2fcbb	Revert "Fix `torch.nn.functional.hardswish` gradients corner case (#148049 )" This reverts commit 29b28e9d9f93d78092099a44a7bcc28cfbae06e3. Reverted https://github.com/pytorch/pytorch/pull/148049 on behalf of https://github.com/soulitzer due to This may be causing an accuracy failure on inductor ([comment](https://github.com/pytorch/pytorch/pull/148049#issuecomment-2706839169))	2025-03-07 16:05:56 +00:00
albanD	17302b4bc8	Move get accelerator to use build time flags when possible (#146098 ) This PR does two main things (they are in a single PR to show how the newly added APIs are used). - Add isBuilt and isAvailable APIs to the AcceleratorHook interface. See inline doc for their exact semantic - Use the newly added isBuilt for accelerator check to ensure it does not poison fork Pull Request resolved: https://github.com/pytorch/pytorch/pull/146098 Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/EikanWang, https://github.com/jeromean Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-03-07 15:19:34 +00:00
Nikita Shulga	d54b2b7fa7	[BE] Delete split builds (#148739 ) They has been disabled since Oct 2024, perhaps time to remove them from the workflows See https://github.com/pytorch/pytorch/issues/138750 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148739 Approved by: https://github.com/atalman	2025-03-07 15:10:50 +00:00
Anant Gulati	372ad7b181	Enable FSDP2 on HPU device (#148667 ) The motivation of this PR is to enable FSDP2 collectives for HPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/148667 Approved by: https://github.com/wconstab	2025-03-07 14:33:43 +00:00
ZhiweiYan-96	81847d08cf	[Intel GPU][quant] Refine zero-point memory creation (#148640 ) # Motivation This PR skips zero-point GPU memory creation when zero-point=0, as it would not be used by oneDNN library. This could help save the 1~3 H2D copy overhead per QLinear/QConv kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148640 Approved by: https://github.com/liangan1, https://github.com/EikanWang	2025-03-07 13:49:19 +00:00
Luca Wehrstedt	f80aad62fa	Improve Pareto frontier plot for AutoAC (#148678 ) This was added in https://github.com/pytorch/pytorch/pull/126320. It's a very nice feature, which can be used to predict memory usage for different budget values. However, it had some limitations, notably in terms of resolution (it only sampled 21 points across the whole range thus missed many threshold values) and in distributed settings. Here I fix those by using recursive binary searches to identify all thresholds (up to a resolution of 1e-3, which can be made configurable) and output them in SVG (to be able to discern different points), plus I add the rank to the filename and store it in a user-define directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148678 Approved by: https://github.com/Chillee, https://github.com/fmassa	2025-03-07 13:22:29 +00:00
Aleksei Nikiforov	d4d7d813fa	Update CURL url for manywheel images (#148343 ) It looks like it was moved on the site it was downloaded from. Switch to official site while updating URL. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148343 Approved by: https://github.com/dr4gon01, https://github.com/janeyx99, https://github.com/atalman, https://github.com/seemethere	2025-03-07 11:41:12 +00:00
Avik Chaudhuri	6cf360be04	fix lost input mutations with export_tracepoint (#148709 ) Preserving module call signatures in the presence of input mutation cause incorrect results. The root cause turned out to be that export tracepoints would unwrap / wrap functional args that would lose mutation info on those args. Differential Revision: [D70734821](https://our.internmc.facebook.com/intern/diff/D70734821/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148709 Approved by: https://github.com/angelayi	2025-03-07 09:36:18 +00:00
Nichols A. Romero	bb84a23c22	[ROCm] [TunableOp] Enable logging of BLAS parameters (#147034 ) This PR supports a logging feature that is being requested. ``` PYTORCH_TUNABLEOP_BLAS_LOG=1 ``` Enables the logging of BLAS parameters with either offline of online (in-situ) tuning. The BLAS parameters are written to the CSV file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147034 Approved by: https://github.com/jeffdaily	2025-03-07 09:32:59 +00:00
Ding, Yi1	243b47e2ec	[Intel GPU] Fix SDPA dummy LSE output to match meta function (#148652 ) To fix XPU patched UTs including ```bash pytest -vs third_party/torch-xpu-ops/test/xpu/test_meta_xpu.py::TestMetaXPU::test_dispatch_symbolic_meta_outplace_nn_functional_scaled_dot_product_attention_xpu_bfloat16 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148652 Approved by: https://github.com/EikanWang	2025-03-07 08:36:18 +00:00
FFFrog	416ea1c71c	Code Clean: Remove unnecessary code (#148735 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148735 Approved by: https://github.com/jingsh, https://github.com/cyyever	2025-03-07 08:15:37 +00:00
ZhiweiYan-96	4075646bd8	Use oneDNN v3.7.1 for Intel GPU (#148403 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148403 Approved by: https://github.com/EikanWang Co-authored-by: majing <jing1.ma@intel.com> Co-authored-by: xiaolil1 <xiaoli.liu@intel.com>	2025-03-07 08:03:49 +00:00
cyy	3d854ea9bd	Remove deprecated std::aligned_storage_t (#148660 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148660 Approved by: https://github.com/swolchok	2025-03-07 07:29:42 +00:00
Rachel Guo	3f069e7679	[mm_logs] enhance the printing for overview info (#148716 ) Summary: previously the dynamo counters does not print the counts information automatically. explicitly added a log msg to print after lowering for overview info for inductor aten mms it will look like: the name is in `{aten_op_name}_{m}_{n}_{k}` ``` torch/_inductor/compile_fx.py:832] [0/0] Overview info of inductor aten mms: (aten.addmm_16_6_16: 1), (name: count), xxx ``` {F1975874802} Test Plan: ``` TORCH_LOGS="+inductor" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm_cuda ``` Differential Revision: D70739912 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148716 Approved by: https://github.com/henrylhtsang	2025-03-07 05:23:49 +00:00
Syed Tousif Ahmed	5f392ae560	Throws error when using torch.cuda.MemPool with expandable segments (#148378 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148378 Approved by: https://github.com/ngimel, https://github.com/eqy ghstack dependencies: #148374	2025-03-07 05:22:03 +00:00
Wei Feng	c0f1557285	[FSDP2][doc] highlight equivalence of set_requires_gradient_sync and no_sync (#148715 ) we got asked a few times about FSDP2's equivalence of no_sync. highlight set_requires_gradient_sync as the equivalence in docstring Pull Request resolved: https://github.com/pytorch/pytorch/pull/148715 Approved by: https://github.com/mori360	2025-03-07 04:34:46 +00:00
Nitin Singh	fe4b88f6aa	[HPU] Add hpu to fused kernels supported devices (#148666 ) This change adds "hpu" to the list of device types that support fused kernels in the optimizer, ensuring compatibility with HPU backend. Without this change, when `test_all_gather_extension_outer_size_stride` of `pytorch/test/distributed/_composable/fsdp/test_fully_shard_extensions.py` is run on 'hpu' backend, it fails with: RuntimeError: fused=True requires all the params to be floating point Tensors of supported devices: ['mps', 'cuda', 'xpu', 'cpu', 'privateuseone'] but torch.float32 and hpu Pull Request resolved: https://github.com/pytorch/pytorch/pull/148666 Approved by: https://github.com/albanD	2025-03-07 04:28:33 +00:00
Nichols A. Romero	33f8ab2f58	[ROCm][TunableOp] Add support for rowwise scaling on scaled GEMM. (#148238 ) This PR adds support for rowwise scaling versus tensorwise scaling on scaled GEMM. There are few other items included in this PR as well: - Fixes for offline tuning of scaled GEMM - Simplification of existing offline UT - Update existing online UT to also test rowwise versus tensorwise scaled GEMM - New UT for offline scaled GEMM Pull Request resolved: https://github.com/pytorch/pytorch/pull/148238 Approved by: https://github.com/jeffdaily	2025-03-07 04:12:48 +00:00
Andrey Talman	cdb4fd0d29	Update win-vs2022-cuda12.1-py3 -> win-vs2022-cuda12.6-py3 (#148717 ) Should have been migrated long ago Pull Request resolved: https://github.com/pytorch/pytorch/pull/148717 Approved by: https://github.com/ZainRizvi, https://github.com/malfet	2025-03-07 03:21:29 +00:00
xinan.lin	389b496062	[XPU] Add test/kernel.errors.txt to .gitignore. (#148538 ) Intel GPU user mode driver may generate kernel.errors.txt files in current working directory in certain scenarios. It includes diagnostic information but does necessarily indicates the issue with an application. This is a known issue and will be fixed in newer version of driver. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148538 Approved by: https://github.com/desertfire, https://github.com/jansel ghstack dependencies: #148534	2025-03-07 03:12:50 +00:00
Wei-Sheng Chin	9c9b05bc4f	Expose functions used in custom backend in torch_python dll (#148213 ) Fixes #148208. There are solutions for exposing symbols implicitly from inline functions (i.e., inline function A calls non-inline function B in foo.h. Code includes foo.h has to see the symbol B in DLL). Solution 1: tag the entire struct where the inline functions are defined as member functions with TORCH_PYTHON_API --- this PR does this for python_arg_parser.h. An alternative solution exists but will slow down dispatching a lot --- drop inline keyword and move implementation to .cc file. Solution 2: tag individual functions with TORCH_PYTHON_API. This PR does this for python_tensor.h. Related discussion about hiding torch_python symbols: https://github.com/pytorch/pytorch/pull/142214 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148213 Approved by: https://github.com/malfet	2025-03-07 02:34:37 +00:00
Zhuoran Zhao	dfb4094b9c	Skip buffer in dense update (#148533 ) Summary: as title. PyTorch Module buffer will not be published in delta publishing. In Quinn's previous diff, constant type annotations have been introduced. In addition to skip constant, we also need to skip buffer if it is not found in the user-provided delta weights list Test Plan: https://docs.google.com/document/d/1wiqUo0PyZ4g6YJIJlL_LE084ZEuE74iu74gZjqGGjWY/edit?tab=t.0#heading=h.dby6cwiw1xrn Differential Revision: D69553929 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148533 Approved by: https://github.com/22quinn, https://github.com/jingsh	2025-03-07 01:59:58 +00:00
ZhiweiYan-96	00cd6c07b9	[Intel GPU][pt2e] Enable quantized grouped convolution at XPU (#148522 ) # Motivation&Details This PR fix a bug that blocked quantized group convolution before. The bug is caused by that, grouped convolution requires setting weight scale mask on both group dimension and output channel dimension. This PR fixs the wrong mask in integration and add grouped conv in UT. # UT ` python test/inductor/test_mkldnn_pattern_matcher.py -k test_qconv2d_xpu` # Runtime exemplification ```onednn_verbose,v1,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src:s8::blocked:acdb::f0 wei:s8::blocked:abcde::f0 bia:f32::blocked:a::f0 dst:f32::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:3:f32 attr-zero-points:src0:0:s32,alg:convolution_direct,g4mb1_ic128oc128_ih4oh2kh3sh1dh0ph0_iw4ow2kw3sw1dw0pw0,0.0529785`` The verbose shows that we successfully run into quantized convolution, where weight is `abcde` format(group conv). Pull Request resolved: https://github.com/pytorch/pytorch/pull/148522 Approved by: https://github.com/EikanWang, https://github.com/liangan1, https://github.com/jansel ghstack dependencies: #148423	2025-03-07 01:57:45 +00:00
drisspg	127bd5a02d	Add sparsity (#148513 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148513 Approved by: https://github.com/danielvegamyhre	2025-03-07 01:47:52 +00:00
ZhiweiYan-96	b4430c3a6d	[Intel GPU][pt2e]: Collapse 3D input to 2D for matmul in qlinear_pointwise_binary fusion (#148423 ) # Motivation During the `qlinear_pointwise_binary` lowering pass, dim collapsing only occurs when post-ops is `add`. It is the responsibility of C++ kernels to handle dimension for post-ops `sum` # Details This PR explicitly reshape input from 3D to 2D in op `qlinear_pointwise_binary`. Besides, we refractor implementation `qlinear_pointwise_binary.tensor` to call `qlinear_pointwise_binary` for removing duplicated codes. # UT testing `python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlienar_add_xpu` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148423 Approved by: https://github.com/EikanWang, https://github.com/jansel	2025-03-07 01:47:33 +00:00
Ryan Guo	c8cd8f68bd	[dynamo] Properly account for non-list instances in list comparison (#148470 ) As title; this patch also removes an unused `list_compare` method. Fixes #148179. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148470 Approved by: https://github.com/anijain2305	2025-03-07 01:29:30 +00:00
eellison	a7fe685be8	Add cpp wrapper skip to cudagraph logs (#148700 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148700 Approved by: https://github.com/jbschlosser	2025-03-07 01:02:40 +00:00
Justin Chu	e3087f6d76	[ONNX] Improve verify_onnx_program to use VerificationInterpreter (#148706 ) I realized we can just extend `verify_onnx_program` to return intermediate values. There is no need for us to expose the VerificationInterpreter to users. I added a `compare_intermediates` option to `verify_onnx_program`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148706 Approved by: https://github.com/titaiwangms	2025-03-07 00:40:54 +00:00
cyy	50eb4f3990	Enable UBSAN test (#147511 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/147511 Approved by: https://github.com/colesbury	2025-03-07 00:35:32 +00:00
Richard Barnes	33a285379a	[codemod] Remove unused-variable in caffe2/torch/csrc/distributed/c10d/cuda/AsyncMM.cu (#148501 ) Summary: LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Reviewed By: dtolnay Pull Request resolved: https://github.com/pytorch/pytorch/pull/148501 Approved by: https://github.com/Skylion007	2025-03-07 00:33:39 +00:00
Ting Lu	a0bc6d81bb	[CI][CUDA] Move away from cuda12.4, Add cuda12.6 eager CI tests (#148602 ) https://github.com/pytorch/pytorch/issues/145570 breaking https://github.com/pytorch/pytorch/pull/140793/ into eager and inductor benchmarks to unblock Pull Request resolved: https://github.com/pytorch/pytorch/pull/148602 Approved by: https://github.com/atalman, https://github.com/malfet Co-authored-by: atalman <atalman@fb.com>	2025-03-07 00:15:04 +00:00
Xilun Wu	e2a0296e80	[dtensor] add CuDNN SDPA op support to DTensor (#148537 ) ### Summary This PR adds `_scaled_dot_product_cudnn_attention` and `_scaled_dot_product_cudnn_attention_backward` to DTensor ops ### Test `pytest test/distributed/tensor/test_attention.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148537 Approved by: https://github.com/drisspg, https://github.com/fegin	2025-03-06 23:44:40 +00:00
Syed Tousif Ahmed	3960f97832	Documents torch.cuda.MemPool API (#148374 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148374 Approved by: https://github.com/eqy, https://github.com/ngimel	2025-03-06 23:18:43 +00:00
Jagadish Krishnamoorthy	ed9c8a5d13	ROCm: Disable torch check for Multiplication of two Float8_e5m2 matrices (#148228 ) ROCm supports Multiplication of two Float8_e5m2 matrices. Hence disabling the torch check for ROCm. Test command (on ROCm h/w supporting fp8) python test/test_matmul_cuda.py TestFP8MatmulCudaCUDA.test_float8_basics_cuda -v Pull Request resolved: https://github.com/pytorch/pytorch/pull/148228 Approved by: https://github.com/jeffdaily, https://github.com/petrex	2025-03-06 22:12:45 +00:00
Aidyn-A	e6800bda7f	[Test][Linalg][CUDA] Increase niter in test_svd_lowrank_cuda_float64 (#145930 ) A recent PR #143049 attempted to increase tolerances to make test passable. However, we are still seeing errors like: ``` Traceback (most recent call last): File "~git/pytorch/test/test_linalg.py", line 2540, in test_svd_lowrank run_subtest(None, size, (), device, torch.svd_lowrank, density=density) File "~git/pytorch/test/test_linalg.py", line 2505, in run_subtest self.assertEqual(A, a, rtol=1e-7, atol=2e-7) File "~git/pytorch/torch/testing/_internal/common_utils.py", line 4044, in assertEqual raise error_metas.pop()[0].to_error( # type: ignore[index] AssertionError: Tensor-likes are not close! Mismatched elements: 90 / 1000000 (0.0%) Greatest absolute difference: 7.795904016052784e-07 at index (176, 930) (up to 2e-07 allowed) Greatest relative difference: inf at index (6, 179) (up to 1e-07 allowed) ``` Increasing `niter` parameter actually decreases numerical differences. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145930 Approved by: https://github.com/ngimel	2025-03-06 22:10:53 +00:00
Blaine Burton Rister	75d29443e7	[Docs] update bucketize documentaion (#148400 ) Fixes #144504 Clarify the documentation for `torch.bucketize` by referencing the existing table. The current version includes a somewhat confusing explanation for the `right` kwarg, whereas the existing table is much clearer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148400 Approved by: https://github.com/benjaminglass1, https://github.com/eellison, https://github.com/albanD	2025-03-06 22:07:52 +00:00
henrylhtsang	2fb654676f	[cutlass backend] fix assertion that prevent self multiplication (#148233 ) # Problem: In a matmul, sometimes some of the nodes are the same. Say `A @ A`. In that case, when writing the stride of node B, we have to figure out if we want lda or ldb, which points to the same node, and we have no way to differentiate which one. # Solution Just use whichever. Since they are the same. # Question What if we compile with `A @ A`, and then pass in `A @ B`? Well inductor guards will raise an error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148233 Approved by: https://github.com/ColinPeppler	2025-03-06 22:02:26 +00:00
henrylhtsang	d35a4ddae2	[cutlass backend] Forward fix for less aligned gemm shapes (#148521 ) Differential Revision: [D70600093](https://our.internmc.facebook.com/intern/diff/D70600093/) 1. Check if config name filtering still works. Tested, it works 2. do we get C++ compile error Yes, potentially we need to filter them out manually. Here we get this. ``` static_assert(threads_minor == 0 \|\| (TileSizeK % threads_minor == 0)); ``` We need to move some assertions to gemm_template.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/148521 Approved by: https://github.com/ColinPeppler	2025-03-06 22:02:19 +00:00
Ting Lu	5a5ac98918	[aarch64] add libcufile for cu126 and cu128 (#148465 ) seeing ` File "/usr/local/lib/python3.12/site-packages/torch/__init__.py", line 411, in <module> from torch._C import * # noqa: F403 ^^^^^^^^^^^^^^^^^^^^^^ ImportError: libcufile.so.0: cannot open shared object file: No such file or directory` with arm cu128 nightly. related to https://github.com/pytorch/pytorch/pull/148137 need to copy the dependency for arm build as well Pull Request resolved: https://github.com/pytorch/pytorch/pull/148465 Approved by: https://github.com/atalman, https://github.com/abhilash1910	2025-03-06 21:39:43 +00:00
lanzongwei.lan	3d62e81a1e	[DCP] fix dcp gather_object/scatter_object_list (#147675 ) gather_object/scatter_object_list's dst is `Destination rank on global process group (regardless of group argument)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147675 Approved by: https://github.com/MeetVadakkanchery	2025-03-06 21:20:38 +00:00
Ryan Guo	1d7fc0c681	[dynamo] Remove dead code path around `functools.partial` objects (#148683 ) This removes the code paths added in #98120, which has then been superceded by #108846. More importantly, it makes `EQUALS_MATCH`'s `ok_mutable_types` (added in #134016) easier to reason about, i.e., no need to worry about `dict` types, which was only needed for #98120. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148683 Approved by: https://github.com/yanboliang	2025-03-06 21:20:04 +00:00
Shunting Zhang	262411e48b	[inductor] online softmax (#127011 ) Softmax need do some preparation work that access the input tensor in two passes - compute amax of each row - compute (x - amax).exp.sum for each row When the row size is large, cache can not hold all the active data and accessing the input multiple passes increases execution time since the kernel is membw bounded. Online softmax uses a customized reduction to compute max and sum at the same time by accessing the data in one pass. Check this paper for more details ( https://arxiv.org/abs/1805.02867 ). Also here is an online softmax kernel generated by inductor as a reference: https://gist.github.com/shunting314/67ae4fffd45d4f2753c781780332fa54 ## Microbenchmark - `TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_ONLINE_SOFTMAX=0 DO_PERF_TEST=1 python test/inductor/test_online_softmax.py -k test_softmax` : without online softmax - eager_ms=6.671296119689941 - opt_ms=8.06931209564209 - `TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_ONLINE_SOFTMAX=1 DO_PERF_TEST=1 python test/inductor/test_online_softmax.py -k test_softmax`: with online softmax - eager_ms=6.634047985076904 - opt_ms=6.230591773986816 Ideally, online softmax should save about 2ms here. We saves about 1.84ms in practice. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127011 Approved by: https://github.com/jansel	2025-03-06 21:07:18 +00:00
PyTorch MergeBot	cf9efbdf16	Revert "Enable onednn in pytorch for ppc64le architecture (#143743 )" This reverts commit d4cf0e5af406239881acfeb4f9e4f62373faca8b. Reverted https://github.com/pytorch/pytorch/pull/143743 on behalf of https://github.com/davidberard98 due to windows build failures look related [GH job link](https://github.com/pytorch/pytorch/actions/runs/13705127978/job/38329845095) [HUD commit link](`d4cf0e5af4`) ([comment](https://github.com/pytorch/pytorch/pull/143743#issuecomment-2704903253))	2025-03-06 20:47:57 +00:00
zeshengzong	1add61c242	Replace `unimplemented` with `unimplemented_v2' in` codegen.py` (#148069 ) Fixes #147913 - replace `unimplemented` in `codegen.py` - remove unused import `unimplemented` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148069 Approved by: https://github.com/Skylion007, https://github.com/williamwen42	2025-03-06 20:42:37 +00:00
Aaron Gokaslan	edd640a95a	[BE][Ez]: Use itertools.chain.from_iterable when possible (#148190 ) Often makes the code more readable, more efficient, and adds support for infinite iterables. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148190 Approved by: https://github.com/jansel, https://github.com/malfet	2025-03-06 20:37:06 +00:00
Nikita Shulga	65dbc3b454	[BE][MPS] Remove redundant `handle_tensor_scalar_binary_op` (#148685 ) After https://github.com/pytorch/pytorch/pull/143934 `mtl_setBuffer` can handle scalar tensors correctly, so no need to have a specialized function here Pull Request resolved: https://github.com/pytorch/pytorch/pull/148685 Approved by: https://github.com/dcci	2025-03-06 19:24:46 +00:00
zeshengzong	29b28e9d9f	Fix `torch.nn.functional.hardswish` gradients corner case (#148049 ) Fixes #147801 ## Changes - Change hardswish gradient compute condition as [torch.nn.functional.hardswish](https://pytorch.org/docs/stable/generated/torch.nn.functional.hardswish.html) - Enable cuda for test `test_hardswish_grad_corner` - Add test case for value=-3 ## Test Result ```bash pytest test/test_nn.py -k test_hardswish pytest test/test_unary_ufuncs.py -k test_hardswish pytest test/inductor/test_torchinductor.py -k test_hardswish ``` ![image](https://github.com/user-attachments/assets/000cb5c4-15f5-4bfd-ab45-f52bf810ff3d) ![image](https://github.com/user-attachments/assets/38b08cf8-ea84-47a2-8e37-0a213da3e0c8) ![image](https://github.com/user-attachments/assets/54bc57be-2c57-46cc-ab90-94ea6cbe1c34) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148049 Approved by: https://github.com/soulitzer	2025-03-06 19:04:52 +00:00
Xuehai Pan	f08146b67b	[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257 ) Changes in this PR: 1. Add `is_structseq` and `is_structseq_class` functions to determine a object or a class is PyStructSequence. 2. Add a generic class `structseq` which can be used as the registration key for PyStructSequence types like `namedtuple` for Named Tuple types. 3. Change `is_namedtuple` to accept subclasses of namedtuple to be namedtuple. Before this PR, only namedtuple class directly created by `collections.namedtuple` or `typing.NamedTuple` were namedtuple classes while their subclasses were not. This PR makes `is_namedtuple` return true for subclasses of namedtuple class. Resolves #75982. New tests are included in this PR. - #75982 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113257 Approved by: https://github.com/zou3519	2025-03-06 18:59:02 +00:00
PyTorch MergeBot	96176e32a9	Revert "[ROCm] Bump AOTriton to 0.9.1b (#148433 )" This reverts commit 8af79b7ec816f5c73536a806aa4c7ea1f7bd3867. Reverted https://github.com/pytorch/pytorch/pull/148433 on behalf of https://github.com/jovianjaison due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/148433#issuecomment-2704638858))	2025-03-06 18:32:48 +00:00
George Wigley	b85ae06bed	Update CPU tolerance for f16 triplet margin loss (#147742 ) Currently, the `test_torchinductor_opinfo` test for `nn.functional.triplet_margin_loss` fails on AArch64, this PR increases the acceptable ATOL and RTOL for this test when using F16. There is precedent for this as XPU and CUDA already increase the tolerance. Additionally, the CPU backend increases the tolerance for the `with_distance_loss` variant of `triplet_margin_loss`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147742 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2025-03-06 18:09:43 +00:00
Bin Bao	d10bacd4ce	[AOTI][dashboard] Skip torchbench models not supported by export (#148359 ) Summary: Certain models fail in export because of data-dependent ops. Skip them so that oncall can better track the AOTInductor dashboard. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148359 Approved by: https://github.com/angelayi, https://github.com/ysiraichi	2025-03-06 18:08:17 +00:00
Ke Wen	d91a634edf	[c10d] Make getDefaultBackend more fault tolerant (#148596 ) This is a forward fix for #135338. It hits error like this: ``` "distributed_c10d.py", line 2156, in destroy_process_group if type(pg) == ProcessGroup and pg._has_hooks(): RuntimeError: Could not find the default backend type 0 for Process Group with name undefined. ``` When users call `init_process_group(nothing)`, default backend is not set, or set to `undefined`. Thus the above signature. Triggered by the `_has_hooks()` call. The fix wraps `getDefaultBackend` with a try-catch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148596 Approved by: https://github.com/LucasLLC, https://github.com/fduwjj	2025-03-06 18:07:43 +00:00
Tiwari-Avanish	d4cf0e5af4	Enable onednn in pytorch for ppc64le architecture (#143743 ) This PR will enable onednn for powerpc Architecture which will help to do quantization of the model via onednn for powerpc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143743 Approved by: https://github.com/malfet, https://github.com/albanD	2025-03-06 18:00:55 +00:00
Xuehai Pan	097b0d372a	[pytree] fix previously failed dynamo tests (#148669 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148669 Approved by: https://github.com/zou3519	2025-03-06 17:59:29 +00:00
PyTorch MergeBot	28b68b46bc	Revert "[cutlass backend] fix assertion that prevent self multiplication (#148233 )" This reverts commit 4aeca28137dcee74b5fcd0c0636d0ee1f113d5fb. Reverted https://github.com/pytorch/pytorch/pull/148233 on behalf of https://github.com/henrylhtsang due to mistake in PR ([comment](https://github.com/pytorch/pytorch/pull/148233#issuecomment-2704534995))	2025-03-06 17:45:49 +00:00
Nikita Shulga	3cde4c3069	[BE] Remove `onlyCPU` decorator from test_local_scalar_dense (#148559 ) Followup from https://github.com/pytorch/pytorch/pull/145717, not sure why author thinks those tests should be limited to one architecture. And fixed similar crashes for CUDA and MPS Pull Request resolved: https://github.com/pytorch/pytorch/pull/148559 Approved by: https://github.com/ZainRizvi, https://github.com/atalman, https://github.com/seemethere	2025-03-06 17:43:02 +00:00
PyTorch MergeBot	841451af9f	Revert "[Inductor] Avoid tensor slice overflow for large step (#147433 )" This reverts commit 1d7397a2d04a4d636559f41511a20f7dadbe5777. Reverted https://github.com/pytorch/pytorch/pull/147433 on behalf of https://github.com/jovianjaison due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/147433#issuecomment-2704506627))	2025-03-06 17:33:08 +00:00
Rachel Guo	679e7d257e	[mm_logs] follow up to add count info based on shape for inductor `aten.mm`s (#148623 ) Summary: as title. when enable `TORCH_LOGS="+inductor"`, you can get logs at the end such as stats [('calls_captured', 1), ('unique_graphs', 1)] inductor [('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('benchmarking.TritonBenchmarker.benchmark_gpu', 2), (('aten_addmm', (16, 6, 16)), 1), ('extern_calls', 1), ('async_compile_cache_miss', 1)] graph_break [] Test Plan: follow up to add proper logging test. Differential Revision: D70665104 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148623 Approved by: https://github.com/henrylhtsang	2025-03-06 16:20:04 +00:00
Benjamin Glass	b160dda743	cpp_wrapper: reduce memory usage by removing unneeded temporaries (#147403 ) This PR contains a set of interrelated changes, listed below, with the upshot that compiled model memory usage in `cpp_wrapper` mode is now roughly equivalent to the default inductor mode. Changes: 1. Refactor `reinterpret_view` calls in `cpp_wrapper` to always return a temporary RAII tensor object, rather than saving off a "temporary" tensor handle that persisted through the end of the function. This matches the behavior of the base Python wrapper class, and is responsible for majority of the memory usage reductions. 2. Eliminate nearly all other cases where a "temporary" tensor handle was saved off (with the exception of one or two places where the tensor would immediately be destroyed by going out-of-scope). This necessitated some ugly-looking code to handle `Optional[Tensor]` and `Optional[Sequence[Any]]`, since `Optional` is passed by pointer into the C-shim functions (making passing temporary objects difficult). This code is justified by the fact that it only appears in controlled circumstances that we auto-generate, so there are minimal user-facing footguns. 3. Delete the list containing the input tensors to the `cpp_wrapper` main function after casting them to `AtenTensorHandle` objects, which have an internal reference count keeping them alive. The [TorchInductor benchmark](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Sat%2C%2015%20Feb%202025%2018%3A38%3A08%20GMT&stopTime=Sat%2C%2022%20Feb%202025%2018%3A38%3A08%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/benjaminglass1/73/head&lCommit=4d5edaf67e80ca9ca36d301af1ded13967a04790&rBranch=main&rCommit=e1bf892d9004a4dba0748d0eda5c3b4eced0ea70) I ran shows the increased memory compression. Differential Revision: [D70648897](https://our.internmc.facebook.com/intern/diff/D70648897) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147403 Approved by: https://github.com/desertfire	2025-03-06 16:08:16 +00:00
David Berard	5fb0f45d3b	[triton 3.3] test_triton_kernel_constants fix (#148626 ) Thanks @FindHao who did the initial version of this PR: https://github.com/pytorch/pytorch/pull/148505 TL;DR is that https://github.com/triton-lang/triton/pull/5961 deprecates `tl.constexpr` annotations - you're supposed to wrap the constexpr value in `tl.constexpr()` instead. This just updates the tests to wrap with `tl.constexpr()` (and leaves the annotations - that way the old triton versions will still pass). Pull Request resolved: https://github.com/pytorch/pytorch/pull/148626 Approved by: https://github.com/FindHao	2025-03-06 14:18:21 +00:00
Mikayla Gawarecki	d5184901c4	Make torch.serialization.skip_data work with torch.load (#148018 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148018 Approved by: https://github.com/albanD ghstack dependencies: #147786, #147787, #147788	2025-03-06 12:04:46 +00:00
Mikayla Gawarecki	be0ceee1c3	Make record/storage alignment in torch.save configurable (#147788 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147788 Approved by: https://github.com/albanD ghstack dependencies: #147786, #147787	2025-03-06 12:04:46 +00:00
Mikayla Gawarecki	209977e6e5	Add information about checkpoint offset to untyped storages when torch.load under FakeTensorMode (#147787 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147787 Approved by: https://github.com/albanD ghstack dependencies: #147786	2025-03-06 12:04:39 +00:00
Mikayla Gawarecki	bdcc1b579b	Allow torch.load under FakeTensorMode to load FakeTensors with correct devices (for plain Tensors) (#147786 ) This only fixes _rebuild_tensor_v2 and _rebuild_tensor_v3 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147786 Approved by: https://github.com/albanD	2025-03-06 12:04:32 +00:00
rzou	79aa17489c	[dynamo] ctx_manager.py: replace unimplemented with unimplemented_v2 (#148570 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148570 Approved by: https://github.com/williamwen42 ghstack dependencies: #148454	2025-03-06 07:46:31 +00:00
titaiwangms	e7bc1d1791	[ONNX] Update saved exported program in debugging report if the exporting passes run_decomposition() (#148617 ) Previous to this PR, if the exporting passes run_decomposition(), the report still shows the exported_program before decomposition, which adds the difficulties to our users when they want to check the exported program that are used to translate to ONNX graph. The following example is what we see before this PR: ``` # PyTorch ONNX Conversion Report ``` ✅ Obtain model graph with `torch.export.export(..., strict=False)` ⚪ Obtain model graph with `torch.export.export(..., strict=True)` ⚪ Obtain model graph with `torch.jit.trace` ✅ Decompose operators for ONNX compatibility ❌ Translate the graph into ONNX ⚪ Run `onnx.checker` on the ONNX model ⚪ Execute the model with ONNX Runtime ⚪ Validate model output accuracy ``` ## Error messages ```pytb Traceback (most recent call last): File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 707, in _translate_fx_graph _handle_call_function_node_with_lowering( File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 486, in _handle_call_function_node_with_lowering raise _errors.DispatchError( torch.onnx._internal.exporter._errors.DispatchError: No ONNX function found for <OpOverload(op='aten.slice', overload='Tensor')>. Failure message: No decompositions registered for the complex-valued input The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 1371, in export onnx_program = _exported_program_to_onnx_program( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 1007, in _exported_program_to_onnx_program values = _translate_fx_graph( ^^^^^^^^^^^^^^^^^^^^ File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 733, in _translate_fx_graph raise _errors.ConversionError( torch.onnx._internal.exporter._errors.ConversionError: Error when translating node %slice_1 : [num_users=1] = call_function[target=torch.ops.aten.slice.Tensor](args = (%_to_copy, 0, 0, 9223372036854775807), kwargs = {}). See the stack trace for more information. ``` ## Exported program ```python ExportedProgram: class GraphModule(torch.nn.Module): def forward(self, x: "f32[3, 4]"): # File: /home/titaiwang/pytorch/test_slice_complex.py:6 in forward, code: x_complex = x.to(torch.complex64) to: "c64[3, 4]" = torch.ops.aten.to.dtype(x, torch.complex64); x = None # File: /home/titaiwang/pytorch/test_slice_complex.py:8 in forward, code: return x_complex[:, :2] slice_1: "c64[3, 4]" = torch.ops.aten.slice.Tensor(to, 0, 0, 9223372036854775807); to = None slice_2: "c64[3, 2]" = torch.ops.aten.slice.Tensor(slice_1, 1, 0, 2); slice_1 = None return (slice_2,) Graph signature: ExportGraphSignature(input_specs=[InputSpec(kind=<InputKind.USER_INPUT: 1>, arg=TensorArgument(name='x'), target=None, persistent=None)], output_specs=[OutputSpec(kind=<OutputKind.USER_OUTPUT: 1>, arg=TensorArgument(name='slice_2'), target=None)]) Range constraints: {} ``` ## Analysis PyTorch ONNX Conversion Analysis ## Model Information The model has 0 parameters and 0 buffers (non-trainable parameters). Number of parameters per dtype: ```python defaultdict(<class 'int'>, {}) ``` Number of buffers per dtype: ```python defaultdict(<class 'int'>, {}) ``` Inputs: - `x`: `TensorMetadata(shape=torch.Size([3, 4]), dtype=torch.float32, requires_grad=False, stride=(4, 1), memory_format=torch.contiguous_format, is_quantized=False, qparams={})` Outputs: - `slice_2`: `TensorMetadata(shape=torch.Size([3, 2]), dtype=torch.complex64, requires_grad=False, stride=(4, 1), memory_format=None, is_quantized=False, qparams={})` The FX graph has 5 nodes in total. Number of FX nodes per op: - `placeholder`: 1 - `call_function`: 3 - `output`: 1 Of the call_function nodes, the counts of operators used are: - `aten.slice.Tensor`: 2 - `aten.to.dtype`: 1 ## ONNX Conversion Information The model contains operators the dispatcher could not find registered ONNX decompositions for. This may be due to missing implementations, decompositions not registered correctly, or a bug in the dispatcher. Errors grouped by operator: - `aten.to.dtype`: No decompositions registered for the real-valued input. Example node: `%to : [num_users=1] = call_function[target=torch.ops.aten.to.dtype](args = (%x, torch.complex64), kwargs = {})`. All nodes: `[to]` - `aten.slice.Tensor`: No decompositions registered for the complex-valued input. Example node: `%slice_1 : [num_users=1] = call_function[target=torch.ops.aten.slice.Tensor](args = (%to, 0, 0, 9223372036854775807), kwargs = {})`. All nodes: `[slice_1, slice_2]` ## Decomposition comparison Ops exist only in the ExportedProgram before decomposition: `['aten.to.dtype']` Ops exist only in the ExportedProgram after decomposition: `['aten._to_copy.default']` ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148617 Approved by: https://github.com/justinchuby	2025-03-06 07:03:45 +00:00
PyTorch MergeBot	ae6bb58483	Revert "[cutlass backend] Forward fix for less aligned gemm shapes (#148521 )" This reverts commit ad49cfc9f0a8a4d8881b3734edd8c33a087c8b97. Reverted https://github.com/pytorch/pytorch/pull/148521 on behalf of https://github.com/davidberard98 due to broke lint: [GH job link](https://github.com/pytorch/pytorch/actions/runs/13690720601/job/38283359447) [HUD commit link](`ad49cfc9f0`) ([comment](https://github.com/pytorch/pytorch/pull/148521#issuecomment-2702980028))	2025-03-06 06:59:39 +00:00
PaulZhang12	4dc956a1d8	[Inductor][Triton] Fix test_autotune_inplace_kernel to work with newer Triton version (#148595 ) For new Triton version 3.3, constexpr are included as part of the signature. Update failing test to reflect this change, additional context in https://github.com/pytorch/pytorch/pull/145051. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148595 Approved by: https://github.com/davidberard98	2025-03-06 05:37:08 +00:00
xinan.lin	1fac47702e	[Break XPU][Inductor UT] Generalize device-bias code introduced by #146866 . (#148534 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148534 Approved by: https://github.com/nandesuka	2025-03-06 04:39:50 +00:00
titaiwangms	f057206fca	[ONNX] Support complex comparison when verify=True (#148619 ) Previously, the comparison of complex numbers was not supported when `verify=True`. NOTE: This PR can be extended to support more complex comparison cases if there are other places in onnx codebase needed to be changed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148619 Approved by: https://github.com/justinchuby	2025-03-06 04:38:43 +00:00
bobrenjc93	8b65d522e1	refactor delayed compile to use code context (#148530 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148530 Approved by: https://github.com/williamwen42 ghstack dependencies: #148509	2025-03-06 04:02:30 +00:00
henrylhtsang	ad49cfc9f0	[cutlass backend] Forward fix for less aligned gemm shapes (#148521 ) Differential Revision: [D70600093](https://our.internmc.facebook.com/intern/diff/D70600093/) 1. Check if config name filtering still works. Tested, it works 2. do we get C++ compile error Yes, potentially we need to filter them out manually. Here we get this. ``` static_assert(threads_minor == 0 \|\| (TileSizeK % threads_minor == 0)); ``` We need to move some assertions to gemm_template.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/148521 Approved by: https://github.com/ColinPeppler	2025-03-06 03:42:55 +00:00
Isalia20	02e1580e39	[MPS] fix crash for mse loss with 0 numel inputs (#148608 ) Fixes #148589 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148608 Approved by: https://github.com/malfet	2025-03-06 03:32:34 +00:00
James Wu	8728d4b815	Clear triton kernels after parent make_launcher (#148604 ) Before, we were clearing the cache only after inductor compile. But inductor may not always compile, i.e. on AOTAutogradCache hit. So instead, we should clear it when the future is consumed. This is a more robust fix for the issue in D69476856 Differential Revision: [D70646281](https://our.internmc.facebook.com/intern/diff/D70646281/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148604 Approved by: https://github.com/masnesral	2025-03-06 03:28:38 +00:00
cyy	1433bc1455	Remove CAFFE2_USE_EXCEPTION_PTR (#147247 ) The check is for older compilers and is now aways true. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147247 Approved by: https://github.com/janeyx99	2025-03-06 02:56:23 +00:00
maybeLee	43e1284c96	Fix empty matrix handling of addmv in inductor (#143792 ) This is a resubmission of my previous PR that I accidentally deleted, apology in advance if any inconvenience caused. Below are details of this PR. Fix an issue when torch.addmv behaves inconsistent between torch.compile mode and eager mode. Here is the code to reproduce: ``` import torch import numpy as np @torch.compile def test_optimized(input, mat, vec): return torch.addmv(input, mat, vec) def test(input, mat, vec): return torch.addmv(input, mat, vec) input = torch.tensor([2], dtype=torch.int32) mat = torch.tensor(np.random.randn(0, 0), dtype=torch.int32) vec = torch.tensor([]) origin_out = test(input, mat, vec) optimized_out = test_optimized(input, mat, vec) print(origin_out) # tensor([2.]) print(optimized_out) # tensor([]) ``` According to the equation (https://pytorch.org/docs/stable/generated/torch.addmv.html), when matrix and vector is empty, returning `[2.]` seems more reasonable to me. Following the cpu implementation of this API:`e97b97af56/aten/src/ATen/native/Blas.cpp (L62)` I add an additional branch to handle empty matrix Pull Request resolved: https://github.com/pytorch/pytorch/pull/143792 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2025-03-06 02:09:27 +00:00
Pat Vignola	38b3375a81	[MTIA] Use "ieee" instead of "tf32" for MTIA's default precision in FlexAttention (#148565 ) Summary: MTIA supports ieee but not tf32, so we set the default precision of MTIA to ieee similar to how it's done for AMD. Test Plan: CI Reviewed By: mortzur Differential Revision: D70072064 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148565 Approved by: https://github.com/mortzur	2025-03-06 02:07:18 +00:00
Ruben Rodriguez Buchillon	32715a2311	[inductor][ck] add kBatch_sweep to config.rocm (#148223 ) Summary: # Why enable testing and users to specify a set of kBatches to try rather than relying on our hand written heuristic # What add rocm.kBatch_sweep as a list of kBatches to try out. These will generate a product of CK instances, one per kBatch for each existing op, though they are often filtered out if they are likely to fail at runtime Test Plan: n/a Reviewed By: chenyang78 Differential Revision: D70226055 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148223 Approved by: https://github.com/ColinPeppler	2025-03-06 01:14:33 +00:00
Shivam Raikundalia	63fbc738dc	[Easy/Profiler] Add last entry to truncated values (#148576 ) Summary: Since the ranks of a PG are usually in a consecutive range it is useful to print the last values when truncating metadata Test Plan: Manually changed truncate length to 2 and ran 4 gpu graph to get the following trace: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu003.rva5.facebook.com/rank-1.Mar_05_09_48_21.1280355.pt.trace.json.gz&bucket=gpu_traces Differential Revision: D70637461 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148576 Approved by: https://github.com/davidberard98	2025-03-06 01:14:15 +00:00
Thomas Bohnstingl	23441492f6	[scan] Refactoring of input checking and dynamo invocation (#142125 ) This PR does a refactoring of the way dynamo is invoked and how the input shapes are checked for scan and for associative_scan Pull Request resolved: https://github.com/pytorch/pytorch/pull/142125 Approved by: https://github.com/ydwu4	2025-03-06 01:06:54 +00:00
Shunting Zhang	6cc3e69103	[inductor] use eager stride for custom op if no tags (#148367 ) Fix https://github.com/pytorch/pytorch/issues/148356 This is some sort of short term fix to recover the default behavior to apply layout constraint for custom ops when there are no tags. A longer term attempt to make sure Inductor always gets correct eager strides is here: https://github.com/pytorch/pytorch/pull/148104 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148367 Approved by: https://github.com/eellison, https://github.com/zou3519	2025-03-06 00:58:00 +00:00
Prachi Gupta	703176e538	[ROCm] Fix sort for non-standard bool (#147459 ) When converting from uint8 to bool using `view` op, we get a bool that has 0 for false and a non-zero value for true. However, these kinds of bool have undefined behavior. We only read the last bit as 0 or 1 to convert to false or true. In this fix, we convert bools to uint8, which will convert false to 0 and non-zero value to 1. Essentially, converting non-standard bool to a standard bool and fixing the sort op for non-standard bool. Fixes #139972 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147459 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony	2025-03-06 00:23:02 +00:00
bobrenjc93	690fc2c876	Add aot_eager_then_compile stance (#148509 ) Sometimes `eager_then_compile` stance isn't enough since some models are so close to the memory limit that going to eager will OOM since we don't get the memory reductions from activation checkpointing. This PR introduces `aot_eager_then_compile` which avoids the expensive inductor compile, but still does aot_eager to get the benefits of memory reduction in the first invocation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148509 Approved by: https://github.com/williamwen42	2025-03-05 23:23:45 +00:00
Benjamin Glass	d6d670ab4d	[AOTI] build CPU CPP kernels at O3, and all other code at O1 (#148587 ) In the future, we may also want to add LTO linking to further optimize the results (while still hopefully netting compile time benefits). Differential Revision: [D70641543](https://our.internmc.facebook.com/intern/diff/D70641543) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148587 Approved by: https://github.com/desertfire	2025-03-05 22:47:46 +00:00
PyTorch MergeBot	897fd9b514	Revert "Subprocess compile (#146134 )" This reverts commit 07f876e9602ec6881df2360ab4817e129b563b7c. Reverted https://github.com/pytorch/pytorch/pull/146134 on behalf of https://github.com/malfet due to looks like it broke slow jobs, see `e1dee4ccb3/3` ([comment](https://github.com/pytorch/pytorch/pull/146134#issuecomment-2702239123))	2025-03-05 22:41:19 +00:00
Justin Chu	e1dee4ccb3	[ONNX] Assert capture strategy in tests (#148348 ) Previously the strategy used for obtaining the exported program is not asserted. This leads to silent errors if torch.export breaks something and a fallback strategy is used. This change adds a _capture_strategy field to ONNXProgram and enables unit tests to assert the strategy used to prevent fallbacks from happening. Fixes #147674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148348 Approved by: https://github.com/titaiwangms, https://github.com/shubhambhokare1	2025-03-05 22:31:54 +00:00
Tugsbayasgalan Manlaibaatar	5ccd659c0e	Fix decomp for linspace (#147997 ) In python decompositions, we shouldn't do any non-functional operations for functional operators. This should go away once we start decomposing before functionalization. Differential Revision: [D70265200](https://our.internmc.facebook.com/intern/diff/D70265200) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147997 Approved by: https://github.com/zou3519	2025-03-05 22:10:08 +00:00
Andy Lugo	9e755a1c03	[ROCm] add gfx12 to nightly wheels (#148562 ) Adds gfx1200 and gfx1201 to PYTORCH_ROCM_ARCH for wheels and libtorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148562 Approved by: https://github.com/jeffdaily	2025-03-05 21:56:22 +00:00
Ankita George	2a639ce1d7	Add new hf storage class to torch.distributed package (#148361 ) Summary: title - Add new hf storage class to torch.distributed package so that it can be imported by customers. The HF storage reader/writer was added as DCP storage components so that DCP load and save can directly interact with hugging face format and storage. Test Plan: ensure signals pass Differential Revision: D70495399 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148361 Approved by: https://github.com/MeetVadakkanchery	2025-03-05 21:52:06 +00:00
Sam Larsen	10354e146f	Re-enable test_torchinductor:test_buffer_batch_norm (#148573 ) Summary: Per https://github.com/pytorch/pytorch/issues/128198 seems like this is working now Fixes #128198 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148573 Approved by: https://github.com/StrongerXi	2025-03-05 21:51:24 +00:00
fduwjj	87bd3471ff	[c10d] Move record param for init to the right place (#148571 ) The place we do the log of init does not look correct. We move it to the beginning of comm init. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148571 Approved by: https://github.com/kwen2501	2025-03-05 21:43:30 +00:00
Ryan Guo	ad9a10aff0	[dynamo] Make `nonstrict_trace` work with some `pytree.register_constant`-ed instances (#148007 ) As title, this enables `nonstrict_trace`-ed function to take in object whose type has been `pytree.register_constant`-ed, as long as the object existed outside the `torch.compile` region. This also forces Dynamo to emit a `EQUALS_MATCH` guard on the object. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148007 Approved by: https://github.com/zou3519 ghstack dependencies: #148385	2025-03-05 21:28:26 +00:00
Ryan Guo	a10f577ee0	[dynamo] Account for function id reuse in relevant Dynamo decorators (#148385 ) This fixes a recent series of flaky failure from `nonstrict_trace` unit tests: #148166, #148056, #148055, #148054, #148034, #148033, #148032, #148031. For now we don't need to worry about the other decorators because they are either meant for builtin/numpy functions (which should never deallocate in practice), or used for polyfills which keeps the function object in `get_torch_obj_rule_map()`. Fixes #147777. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148385 Approved by: https://github.com/zou3519	2025-03-05 21:28:26 +00:00
henrylhtsang	4aeca28137	[cutlass backend] fix assertion that prevent self multiplication (#148233 ) # Problem: In a matmul, sometimes some of the nodes are the same. Say `A @ A`. In that case, when writing the stride of node B, we have to figure out if we want lda or ldb, which points to the same node, and we have no way to differentiate which one. # Solution Just use whichever. Since they are the same. # Question What if we compile with `A @ A`, and then pass in `A @ B`? Well inductor guards will raise an error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148233 Approved by: https://github.com/ColinPeppler	2025-03-05 21:26:22 +00:00
angelayi	ed9624ee60	[export] Fix AttrProxy slicing (#148507 ) Fixes https://fb.workplace.com/groups/1028545332188949/permalink/1159599265750221/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/148507 Approved by: https://github.com/zhxchen17	2025-03-05 21:03:15 +00:00
Nikita Shulga	dd6ec8706e	[BE] Relax sympy dependency to 1.13.3 or newer (#148575 ) Fixes https://github.com/pytorch/pytorch/issues/145225 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148575 Approved by: https://github.com/ZainRizvi, https://github.com/atalman	2025-03-05 20:51:16 +00:00
Yanbo Liang	9efa9c73f6	[Dyamo] Replace unimplemented with unimplemented_v2 for variables/distributed (#148500 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148500 Approved by: https://github.com/williamwen42	2025-03-05 20:41:43 +00:00
Thanh Ha	98458e5c81	Add a docstring to build.sh (#144566 ) Add a little blurb to explain what build.sh is doing. Signed-off-by: Thanh Ha <thanh.ha@linuxfoundation.org>	2025-03-05 15:26:37 -05:00
Justin Chu	c6a05df174	[ONNX] Use onnxscript apis for 2.7 (#148453 ) Use onnxscript apis for 2.7. Remove reference to `torchlib_opset()` and `torchlib_opset_version()` which were removed in the onnxscript 2.7 apis. These apis were removed because torchlib in onnxscript will always stay on opset 18. Future opset version bumps will happen in pytorch core after the migration of torchlib. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148453 Approved by: https://github.com/titaiwangms, https://github.com/shubhambhokare1	2025-03-05 20:10:00 +00:00
PyTorch MergeBot	c9edd37ffb	Revert "[dtensor] add aten._scaled_dot_product_cudnn_attention.default op support (#148377 )" This reverts commit 9eef457c0241f87097a2ca7625f9961e31f3adcd. Reverted https://github.com/pytorch/pytorch/pull/148377 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/13683650448/job/38261818684) [HUD commit link](`9eef457c02`) probably landrace ([comment](https://github.com/pytorch/pytorch/pull/148377#issuecomment-2701903810))	2025-03-05 19:45:16 +00:00
IvanKobzarev	c5d92edd5a	[dynamo] WeakRefVar reconstruct (#148083 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148083 Approved by: https://github.com/anijain2305	2025-03-05 19:34:17 +00:00
Justin Chu	50e827b3df	[ONNX] Create VerificationInterpreter (#148396 ) An fx interpreter for comparing ONNX values with pytorch ones. ```py import torch from torch.onnx._internal.exporter._verification import VerificationInterpreter class Model(torch.nn.Module): def forward(self, query, key, value): res = torch.nn.functional.scaled_dot_product_attention( query, key, value ) rest = res.transpose(0, 1) return rest.view(8, 32, 128 * 64) model = Model() query = torch.rand(32, 8, 128, 64, dtype=torch.float16) key = torch.rand(32, 8, 128, 64, dtype=torch.float16) value = torch.rand(32, 8, 128, 64, dtype=torch.float16) onnx_program = torch.onnx.export(model, (query, key, value), dynamo=True) interpreter = VerificationInterpreter(onnx_program) interpreter.run(query, key, value) for info in interpreter.verification_infos: print(info) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148396 Approved by: https://github.com/titaiwangms	2025-03-05 19:18:52 +00:00
Xinya Zhang	8af79b7ec8	[ROCm] Bump AOTriton to 0.9.1b (#148433 ) Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.9b: * Optimize these Non-power-of-two head dimensions: 48, 80, 96, 160, 192, 224. Inputs with these head dimensions do not need padding to power-of-two anymore. * `is_causal=True` cases are now supported with persistent dynamic algorithm, which requires an atomic tensor but does load balance between different CTAs * `dropout_p > 0.0` cases now support full 64-bit offsets and use all i64x4 PRNG outputs * The precise AOTriton shared library version can now be identified with `readelf -p .comment libaotriton_v2.so` + However, this does not guarantee the GPU images stored under `aotriton.images` have the same version, since they can be overwritten. * The newly added fused backward kernel will be used for smaller workloads, due to less kernel invocation overhead. * Support gfx1201 (RX 9070XT). Need to be enabled at runtime with `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148433 Approved by: https://github.com/jeffdaily	2025-03-05 19:11:57 +00:00
Xilun Wu	9eef457c02	[dtensor] add aten._scaled_dot_product_cudnn_attention.default op support (#148377 ) ### Summary This PR adds `_scaled_dot_product_cudnn_attention` to DTensor ops and tests it with unit test. This should allow Context Parallel and Tensor Parallel to use cudnn SDPA. ### Test `pytest test/distributed/tensor/test_attention.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148377 Approved by: https://github.com/drisspg	2025-03-05 19:09:52 +00:00
Ting Lu	9dd46a9233	Deprecate sm70 for cuda 12.8 binary (#147607 ) follow up for https://github.com/pytorch/pytorch/pull/146265/files, dropping sm_70 as well, since "Architecture support for Maxwell, Pascal, and Volta is considered feature-complete and will be frozen in an upcoming release." https://github.com/pytorch/pytorch/issues/145570 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147607 Approved by: https://github.com/atalman	2025-03-05 18:54:17 +00:00
Wang, Chuanqi	3f4311d589	[CD] Upgrade xpu runtime pypi packages version and enable windows kineto again (#148319 ) Fixes https://github.com/pytorch/pytorch/issues/145155 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148319 Approved by: https://github.com/xuhancn, https://github.com/atalman	2025-03-05 18:39:55 +00:00
angelayi	9db9593bba	Add some more meta kernels (#147862 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147862 Approved by: https://github.com/zou3519	2025-03-05 18:33:00 +00:00
Tugsbayasgalan Manlaibaatar	e555c4d8ae	Fix bug in AOTI lowering (#148364 ) Fixes: https://github.com/pytorch/pytorch/issues/148370 Differential Revision: [D70514480](https://our.internmc.facebook.com/intern/diff/D70514480) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148364 Approved by: https://github.com/desertfire	2025-03-05 18:27:15 +00:00
ZhaoqiongZ	38479e495e	Add note to get start xpu (#148168 ) Installing PyTorch from binaries will automatically install the runtime packages of Intel® Deep Learning Essentials. In this case, if we activate oneAPI in a standalone installation of Intel® Deep Learning Essentials, there will be an environment issue. Therefore, add a note to remind users to avoid this situation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148168 Approved by: https://github.com/janeyx99 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-03-05 18:11:14 +00:00
Marko Radmilac	c65ee728f0	Initial implementation of host memory stats (#147660 ) This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics. This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache. As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later. Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660 Approved by: https://github.com/ngimel	2025-03-05 16:13:19 +00:00
Andy Lugo	70c5edb697	[ROCm] fix CK compile for gfx1200 (#148496 ) gfx1200 causes the CK-based GEMM to fail to compile because CK is choosing an incorrect FP8 interpretation. CK assumes FP8 interpretation is static and chosen prior to compilation. This PR is a work-around that makes the selection dynamic during hipclang compilation passes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148496 Approved by: https://github.com/jeffdaily	2025-03-05 16:11:03 +00:00
Nikita Shulga	864b75dd50	[MPS] Fix unary_kernel_strided logic (#148512 ) Fixes bug introduced by https://github.com/pytorch/pytorch/pull/148350 Before this change ``` % python3 -c "import torch; x, y = torch.arange(128.0, device='mps').reshape(2, 8, 8).unbind(0); print(torch.sqrt(x[::2, ::2], out=y[::2, ::2]))" tensor([[ 0.0000, 1.4142, 2.0000, 2.4495], [ 80.0000, 82.0000, 84.0000, 86.0000], [ 96.0000, 98.0000, 100.0000, 102.0000], [112.0000, 114.0000, 116.0000, 118.0000]], device='mps:0') ``` After this change ``` % python3 -c "import torch; x, y = torch.arange(128.0, device='mps').reshape(2, 8, 8).unbind(0); print(torch.sqrt(x[::2, ::2], out=y[::2, ::2]))" tensor([[0.0000, 1.4142, 2.0000, 2.4495], [4.0000, 4.2426, 4.4721, 4.6904], [5.6569, 5.8310, 6.0000, 6.1644], [6.9282, 7.0711, 7.2111, 7.3485]], device='mps:0') ``` One can not avoid copies if both input and output tensors have the same strides, one needs to make sure that they are dense-in-storage (transposed tensor would be dense, but say selecting every odd and even column wouldn't) Add regression test to prevent those from happening again Also, no need to check that sizes match, luckily it is checked by the structured op (and `out` for unary ops does not support broadcasting, I just checked) Revived needs_copy_logic, though it will become irrelevant after https://github.com/pytorch/pytorch/pull/148468 is landed Pull Request resolved: https://github.com/pytorch/pytorch/pull/148512 Approved by: https://github.com/janeyx99	2025-03-05 15:57:54 +00:00
Aidyn-A	8274da9312	[c10d][PGNCCL] Fix capturability of isend and irecv (#148462 ) This PR fixes an issue of inability to capture `isend`/`irecv` ops in `async` mode. <details> <summary>The repro code</summary> ```Python import os import torch import torch.distributed as dist USE_ASYNC = True def test_func(x, rank): if rank == 0: x += 1 # Send the tensor to process 1 if USE_ASYNC: a = dist.isend(tensor=x, dst=1) else: dist.send(tensor=x, dst=1) else: # Receive tensor from process 0 if USE_ASYNC: a = dist.irecv(tensor=x, src=0) else: dist.recv(tensor=x, src=0) if USE_ASYNC: a.wait() return x + 2 def run(rank): torch.cuda.set_device(rank) x = torch.ones(1, device='cuda') with torch.cuda.stream(torch.cuda.Stream()): for i in range(11): x.copy_(torch.ones(1, device='cuda')) y = test_func(x, rank) print(f"Rank{rank} has data {y} in warmup") torch.cuda.synchronize() graph = torch.cuda.CUDAGraph() x.copy_(torch.ones(1, device='cuda')) with torch.cuda.graph(graph): y = test_func(x, rank) for i in range(1): x.copy_(torch.ones(1, device='cuda')) graph.replay() print(f"Rank{rank} has data {y} after graph replay") def main(): rank = int(os.environ['RANK']) local_rank = int(os.environ['LOCAL_RANK']) world_size = int(os.environ['WORLD_SIZE']) dist.init_process_group('nccl', rank=rank, world_size=world_size) run(local_rank) if __name__ == "__main__": main() ``` </details> Fails with an error stating that work handle is of a NoneType: ``` [rank1]: Traceback (most recent call last): [rank1]: File "/workspace/repro.py", line 54, in <module> [rank1]: main() [rank1]: File "/workspace/repro.py", line 51, in main [rank1]: run(local_rank) [rank1]: File "/workspace/repro.py", line 38, in run [rank1]: y = test_func(x, rank) [rank1]: ^^^^^^^^^^^^^^^^^^ [rank1]: File "/workspace/repro.py", line 22, in test_func [rank1]: a.wait() [rank1]: ^^^^^^ [rank1]: AttributeError: 'NoneType' object has no attribute 'wait' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148462 Approved by: https://github.com/kwen2501	2025-03-05 15:49:53 +00:00
Sun, Jiayi	19a6cf35f6	add input shape check for _local_scalar_dense (#145717 ) Fix https://github.com/pytorch/pytorch/issues/145066. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145717 Approved by: https://github.com/malfet	2025-03-05 15:24:08 +00:00
Aidyn-A	96afa8a2bb	[TEST][SPARSE] Simplify branching in test_cusparselt_backend (#148318 ) Due to introduction of CUDA versions, the branching becomes more complicated. This PR is proposed to simplify branching in `test_cusparselt_backend` in order to avoid checking each and every CUDA version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148318 Approved by: https://github.com/jcaip	2025-03-05 10:17:00 +00:00
Nichols A. Romero	0ef2e938d0	[ROCm] [TunableOp] Track top solutions during tuning process (#147243 ) For each set of GEMM parameters that are evaluated by Tunableop, keep track of the top 5 solutions. Print the top 5 solutions when `PYTORCH_TUNABLEOP_VERBOSE=2`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147243 Approved by: https://github.com/jeffdaily	2025-03-05 09:35:02 +00:00
Jithun Nair	6c3492b491	[ROCm] Enable mi300-specific workflows to be triggered on PRs (#147904 ) This change will be needed to be able to trigger the MI300-specific CI workflows on PRs by using a PR label. * inductor-rocm-mi300.yml uses the existing `ciflow/inductor-rocm` label so that any PR manually labeled as such will trigger `inductor` config runs on both MI200 and MI300. * rocm-mi300.yml uses a separate `ciflow/rocm-mi300` label, since we don't want to over-trigger `default` config runs on MI300 runners due to limited capacity, and [`ciflow/rocm` label is automatically applied](`79438512a0/torchci/lib/bot/autoLabelBot.ts (L24)`) on many PRs. * inductor-perf-test-nightly-rocm.yml uses a separate `ciflow/inductor-perf-test-nightly-rocm` label, so that we can manually trigger a round of perf testing on MI300 runners to test the perf impact of a major inductor-related change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147904 Approved by: https://github.com/huydhn	2025-03-05 06:00:37 +00:00
eellison	2295efa1b3	Fix only logging ir_post_fusion with torch_compile_debug enabled (#148499 ) Because we were invoking the logs through `V.debug`, it was not running if TORCH_COMPILE_DEBUG was not set. this is because there is some magic the in debug [getattr](`d789c22712/torch/_inductor/debug.py (L468-L480)`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/148499 Approved by: https://github.com/shunting314	2025-03-05 05:35:09 +00:00
zeshengzong	fb1b7ec173	Remove deprecate method and attirbute in `LRScheduler` (#147301 ) Following [#99270 suggestion](https://github.com/pytorch/pytorch/issues/99270#issuecomment-1511656408), remove deprecate method `LRScheduler.print_lr` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147301 Approved by: https://github.com/janeyx99	2025-03-05 05:30:19 +00:00
Bin Bao	df7e43e5d4	[AOTI] Fix aot_inductor_package test errors (#148279 ) Summary: Fix fbcode test failures introduced by https://github.com/pytorch/pytorch/pull/147975. Make sure script.ld is copied to the build-time directory. Differential Revision: D70454149 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148279 Approved by: https://github.com/zoranzhao	2025-03-05 05:22:48 +00:00
henrylhtsang	b020d166f2	stage 1 of depreate silent fallback of tuning gemm (#147798 ) Differential Revision: [D70045778](https://our.internmc.facebook.com/intern/diff/D70045778/) context: https://github.com/pytorch/pytorch/issues/147479 For the most part, this should not change the behavior. For int_mm, I also removed ``` # TODO: Re-enable eager mode implementation once cuBLAS is fixed if use_cutlass or use_triton_template(layout, enable_int32=True): choices = [] ``` because I think it is unwanted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147798 Approved by: https://github.com/eellison	2025-03-05 05:15:59 +00:00
Laith Sakka	913356fb41	Fix recent regression in evaluate_expr that effect cache lookups (#147836 ) PR https://github.com/pytorch/pytorch/pull/146939/ added an argument for evaluate_expr for the purpose of logging. This caused a regression that we thought is due to calling id on symnode. I digged deeper and found that adding that argument although does not effect results of evaluate_expr it mess the cache lookups. I refactored the code to avoid using expr_sym_node_id in the cache lookup, I also introduced evaluate_sym_node to and simplified the calls to evaluate_expr #suppress-bc-linter Pull Request resolved: https://github.com/pytorch/pytorch/pull/147836 Approved by: https://github.com/oulgen	2025-03-05 04:11:41 +00:00
henrylhtsang	ed8ec0cb98	[cutlass backend][BE] Fix two small things in cutlass backend standalone debugger (#148493 ) Differential Revision: [D70583777](https://our.internmc.facebook.com/intern/diff/D70583777/) Two really small things: * The bits in BlockFillRandomUniform would round float to ints * when bias exists, the order of args are C, A, B, D Pull Request resolved: https://github.com/pytorch/pytorch/pull/148493 Approved by: https://github.com/chenyang78	2025-03-05 04:01:36 +00:00
Wang, Chuanqi	e0ea593974	[CD] Upgrade Windows xpu support package to 2025.0.1 for binary compression (#148313 ) The binary compression feature can reduce the size of the Torch XPU Windows wheel packages Pull Request resolved: https://github.com/pytorch/pytorch/pull/148313 Approved by: https://github.com/atalman	2025-03-05 03:00:27 +00:00
Rachel Guo	1673bc7610	[mm_logs][ez] dump tuned mm info at lowering stage (#148363 ) Summary: As title. it would be beneficial for judging e2e perf improvement Easy first step to dump mm info at lowering stage. e.g. ``` fbsource/fbcode/caffe2/torch/_inductor/kernel/mm.py:525] [0/0] Tuned aten.addmm: m=16, n=6, k=16, layout=FixedLayout('cuda:0', torch.float32, size=[16, 6], stride=[6, 1]) ``` Next step: Dump overview info at `post_grad_graph` stage such as overall count of `aten.mm` in the graph & visualize to a table structure. Test Plan: by looking very hard in aot inductor bmm and mm UTs. Differential Revision: D70507880 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148363 Approved by: https://github.com/henrylhtsang	2025-03-05 02:21:27 +00:00
wdziurdz	edc3ca577e	[Profiler] Add profiler activity for HPU devices (#148182 ) Fixes #148181 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148182 Approved by: https://github.com/sraikund16	2025-03-05 01:37:48 +00:00
William Wen	3985ce0b88	[dynamo] rename test_graph_break_messages -> test_error_messages (#148220 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148220 Approved by: https://github.com/zou3519, https://github.com/jansel ghstack dependencies: #148205	2025-03-05 01:16:53 +00:00
William Wen	b28cbe5db3	[dynamo] remove internal stack trace for fullgraph=True graph breaks (#148205 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148205 Approved by: https://github.com/zou3519	2025-03-05 01:16:53 +00:00
Mitchell, Frost	2927a64357	[inductor][cpu] Fix error with FlexibleLayout weights in BMM (#148188 ) Fixes #148074 When node A is reshaped (is a `ReinterpretView`) and node B has a `FlexibleLayout`, then the layout of node B may be changed during the `kernel.select(options["W"], 0, self.b_index)` call, which could cause the assertion in `kernel.select` to fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148188 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel	2025-03-05 01:05:05 +00:00
Animesh Jain	713a504a82	[dynamo][guards] Fix mem leak caused be refcount increment (#148480 ) Should help [internalfb.com/sevmanager/view/491701](https://www.internalfb.com/sevmanager/view/491701) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148480 Approved by: https://github.com/xmfan, https://github.com/StrongerXi, https://github.com/williamwen42, https://github.com/zou3519	2025-03-05 01:04:08 +00:00
Mwiza Kunda	b5873292c6	Add overload names to profiler trace (#143114 ) Currently, recorded profiler events for aten ops do not store overload names. It would be useful to know which overloads are actually called to analyse performance. For example, consider the following dispatch trace which occurs if there is a fallthrough kernel registered for aten::add: ``` [call] op=[aten::add.Tensor], key=[AutogradCPU] [redispatch] op=[aten::add.Tensor], key=[Undefined] [call] op=[aten::empty.memory_format], key=[BackendSelect] [redispatch] op=[aten::empty.memory_format], key=[CPU] [call] op=[aten::add.out], key=[CPU] ``` In this case, aten::add.out is a child of aten::add.Tensor, however the current profiler trace provides no way to differentiate aten op calls. See the added unit test for a more detailed example. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143114 Approved by: https://github.com/sraikund16	2025-03-05 01:00:29 +00:00
Daniel Vega-Myhre	cf5e3f3cea	Add cutlass kernel for rowwise scaled mm on sm100 (#148421 ) ### Important - Previous PR in stack https://github.com/pytorch/pytorch/pull/148274 - Despite the changes between sm90 vs sm100 being fairly minimal, I created a separate kernel since we'll be making various arch specific perf optimizations to the sm100 kernel next. - This kernel has not been optimized yet. However, initial perf testing shows numbers which indicates the tensorcores are being utilized as expected (not just CUDA cores). ### Summary of changes - This PR adds a new cutlass kernel for rowwise GEMM on sm100. - sm100 kernel is based on sm90 kernel, with the following changes: - Use new arch tag `cutlass::arch::Sm100` - Do not use [large tile](`4eb0c45297/aten/src/ATen/native/cuda/RowwiseScaledMM.cu (L203)`) schedule in CollectiveMainLoop or CollectiveEpilogue (causes build errors) - SM90 vs SM100 kernel diff: https://www.diffchecker.com/ZCAPaFAg/ ### Next steps - Arch specific performance optimization Pull Request resolved: https://github.com/pytorch/pytorch/pull/148421 Approved by: https://github.com/drisspg	2025-03-05 00:46:01 +00:00
rzou	a907b6abae	[compiled_autograd] workaround windows compilation issue (#148454 ) torch.compile doesn't work on windows so we can ifdef-away the problem. I do not know what the root cause actually is. Most notably, the pytorch windows build is fine, but some third-party projects that use pytorch headers on windows (e.g. torchaudio) have issues. Test Plan: - wait for CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/148454 Approved by: https://github.com/atalman, https://github.com/xmfan	2025-03-05 00:18:20 +00:00
Howard Huang	e02a2ca07a	Fix dist.init_process_group on windows (#148266 ) Fix https://github.com/pytorch/pytorch/issues/139990 We don't build libuv on windows so anything that creates `TCPStore` which includes `init_process_group()` will fail, which is a bad experience. We should just default to `USE_LIBUV=0` for windows. There were a decent amount of hits for this [error on google ](https://www.google.com/search?q=use_libuv+was+requested+but+PyTorch+was+build+without+libuv+support&sca_esv=921f59ac5f8bd98a&sxsrf=AHTn8zpG3PxdKoomFHkclOc451rBhoc3jw%3A1740854890873&source=hp&ei=albDZ5GHM-uIptQP4NTikQw&iflsig=ACkRmUkAAAAAZ8Nkei9H-aB2IBCk3pUOK3yFl5xBLZUt&ved=0ahUKEwiR5P7qxemLAxVrhIkEHWCqOMIQ4dUDCBg&uact=5&oq=use_libuv+was+requested+but+PyTorch+was+build+without+libuv+support&gs_lp=Egdnd3Mtd2l6IkN1c2VfbGlidXYgd2FzIHJlcXVlc3RlZCBidXQgUHlUb3JjaCB3YXMgYnVpbGQgd2l0aG91dCBsaWJ1diBzdXBwb3J0SABQAFgAcAB4AJABAJgBAKABAKoBALgBA8gBAPgBAvgBAZgCAKACAJgDAJIHAKAHAA&sclient=gws-wiz) and https://github.com/pytorch/pytorch/issues/139579, so I figured we should add a more helpful message as well. We don't have CI for windows and our support is just best effort, so I just tested these changes on my windows machine. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148266 Approved by: https://github.com/d4l3k	2025-03-05 00:07:56 +00:00
lzhang2	84b58bd63e	Enable FSDP tests on XPU device (#147518 ) Motivation: Enable FSDP tests on XPU device Pull Request resolved: https://github.com/pytorch/pytorch/pull/147518 Approved by: https://github.com/weifengpy	2025-03-04 23:49:37 +00:00
eellison	c98c3af421	Add a couple config options to compiler bisector (#148450 ) These are commonly source of bugs/divergence (through bad interactions etc) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148450 Approved by: https://github.com/shunting314	2025-03-04 23:23:21 +00:00
Isalia20	0c0a4baddd	[MPS] unary kernels - avoid copying tensors if they have same stride (#148350 ) I was a bit concerned when I saw in #148272 that metal unary kernel was 0.02x of the performance of what we had with MPS Graphs for sqrt(for non contiguous) tensors. This change makes it so that copying is only done if we don't have same strided tensors(for input/output). So if out tensor is not provided then we don't do copy(don't call contiguous) at all and dispatch the kernel as is. After making this change the script that I listed at the end of the above PR has the same execution time as the non-transposed one. Times for reference(on transposed tensor where matrix is NxN matrix): \| N \| time_old \| time_new \| \|-------\|--------------------\|--------------------\| \| 100 \| 0.0002241021 \| 0.0001548659 \| \| 1000 \| 0.0005934822 \| 0.0002150342 \| \| 10000 \| 0.3242016407 \| 0.0045755033 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/148350 Approved by: https://github.com/janeyx99	2025-03-04 23:20:26 +00:00
Nikita Shulga	ade4af8c95	[MPS][BE] Fix `c10:🤘:sinc` implementation (#148471 ) Restrict scalar implementation to `is_scalar_floating_point_v` types, but perform all internal computations in full 32-bit floats. Make complex implementation a template for `is_complex_v` types This makes its eager kernel implementation for both real and complex type a trivial call to the template Pull Request resolved: https://github.com/pytorch/pytorch/pull/148471 Approved by: https://github.com/dcci ghstack dependencies: #148398, #148399, #148448, #148449	2025-03-04 23:14:03 +00:00
Eddie Yan	93e9daed54	[cuDNN][SDPA][Nested Tensor] Experimental cuDNN Nested Tensor SDPA Support (forward only) (#141178 ) Disabled by default for now behind `TORCH_CUDNN_SDPA_NESTED_TENSOR_ENABLED=1` Just wanted to get this out before starting a series of SDPA cleanup PRs---the biggest thing is we don't need the boilerplate around all of the `build_graph_and_tensors*` functions anymore as we can now use the `UID`-style referencing of tensor nodes as was done for the Conv-V8 API backend. CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/141178 Approved by: https://github.com/jbschlosser	2025-03-04 23:09:09 +00:00
Catherine Lee	d789c22712	Upgrade github ubuntu-20.04 runners to ubuntu-24.04 (#148469 ) The github provided ubuntu-20.04 gha runners are being deprecated (https://togithub.com/actions/runner-images/issues/11101) so upgrade workflows using them to the latest runner 24.04 They are currently doing a brownout, resulting in failures like: https://github.com/pytorch/pytorch/actions/runs/13660782115 ``` [do_update_viablestrict](https://github.com/pytorch/pytorch/actions/runs/13660782115/job/38192777885) This is a scheduled Ubuntu 20.04 brownout. Ubuntu 20.04 LTS runner will be removed on 2025-04-01. For more details, see https://github.com/actions/runner-images/issues/11101 ``` Should we be using ubuntu-latest instead? I attempted to upgrade actionlint to 1.7.7 but on my local in test-infra it seems to add a lot of new checks, and on test-infra's CI, I seem to have uploaded the wrong executable or something so it failed. I'll try again later Pull Request resolved: https://github.com/pytorch/pytorch/pull/148469 Approved by: https://github.com/seemethere, https://github.com/malfet	2025-03-04 22:29:04 +00:00
Nichols A. Romero	5f47b7e268	[ROCm][TunableOp] Unit test for offline tuning of GEMM with bias (#148371 ) One more unit test for the offline version of TunableOp. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148371 Approved by: https://github.com/jeffdaily	2025-03-04 22:24:27 +00:00
Nikita Shulga	842ffea445	[MPS][BE] Towards strided unary ops support (#148449 ) Add generic functors kernels and rewrite all existing implementations into functors Pull Request resolved: https://github.com/pytorch/pytorch/pull/148449 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #148398, #148399, #148448	2025-03-04 22:22:39 +00:00
Justin Chu	70d0e1b96a	Bump onnxscript to 0.2.2 in CI (#148388 ) Unblock https://github.com/pytorch/pytorch/pull/148140 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148388 Approved by: https://github.com/malfet	2025-03-04 22:09:50 +00:00
Pian Pawakapan	c677f3251f	[export] don't use unbacked_renamings in export (#147574 ) Plan: avoid the use of unbacked renamings, and introduce a pass run in `_produce_aten_artifact` that recomputes unbacked bindings. Decided to do this because in we don't serialize unbacked renamings (or any ShapeEnv state), so this used to compose poorly with de/serialization. This hopefully establishes the invariant that the unbacked binding keys are always in sync with the example values (i.e. same indices, and removed if the symbol is replaced / specialized). For de/serialization, we don't stored unbacked bindings, and just rerun the pass. Involved a refactor of compute_unbacked_bindings. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147574 Approved by: https://github.com/avikchaudhuri	2025-03-04 21:43:49 +00:00
Eli Uriegas	84961a0c17	ci: Add workflow dispatch for commit hash update (#148486 ) Maybe this should also be split into its own workflow instead of piggy backing off of nightly? Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/148486 Approved by: https://github.com/clee2000 ghstack dependencies: #148466, #148472	2025-03-04 21:26:23 +00:00
Eli Uriegas	d290186ed3	ci: Add triton to update hash workflow (#148472 ) Adds triton to our auto-update workflows so that PRs can be automatically made and the triton team can follow up to fix any issues that may arise. Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/148472 Approved by: https://github.com/Camyll, https://github.com/atalman ghstack dependencies: #148466	2025-03-04 21:26:23 +00:00
Eli Uriegas	9be8f74156	ci: Consolidate commit hash updates into a matrix (#148466 ) Consolidates all of our commit hash update jobs into a single matrix to make it easier to add more jobs later on. Side note: How do I even test if this works? Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/148466 Approved by: https://github.com/Camyll, https://github.com/clee2000, https://github.com/atalman	2025-03-04 21:26:13 +00:00
dan_the_3rd	d1abde11ec	[dynamo] Support passing arguments to `DeviceMesh.get_group` (#147741 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147741 Approved by: https://github.com/StrongerXi	2025-03-04 21:19:47 +00:00
Zain Rizvi	f30776c37a	[BE] Upgrade to mypy 1.14 (#145966 ) Upgrade mypy version Pull Request resolved: https://github.com/pytorch/pytorch/pull/145966 Approved by: https://github.com/Skylion007	2025-03-04 20:58:26 +00:00
Angela Yi	60205b0eb2	[export] Fix logging so that it doesn't result in max recursion error (#148231 ) Test Plan: buck2 run mode/dev-nosan sigmoid/inference/ts_migration:pt2i_readiness_main -- --model_id=487493491 --test_suite ads_all --mode test_full_model Produces https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmp2wsjQH/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100 Differential Revision: D70416613 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148231 Approved by: https://github.com/yiming0416	2025-03-04 20:47:25 +00:00
Thomas Bohnstingl	e4c558be1d	[scan] Corrections for scan (#146110 ) This PR resolves some minor issues with the scan HOP and unifies the handling of the additional_inputs in the same way as for associative_scan. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146110 Approved by: https://github.com/ydwu4	2025-03-04 20:29:08 +00:00
Isalia20	439395c0ae	[MPS] add slogdet and logdet implementations to mps (#148287 ) Low hanging fruits, all ops for these are implemented so just adding them to native functions adds the functionality on mps. Probably next op I should add should be lu solve seeing as how many ops need it for the grad calculation Pull Request resolved: https://github.com/pytorch/pytorch/pull/148287 Approved by: https://github.com/malfet	2025-03-04 19:49:23 +00:00
PyTorch MergeBot	92beda54c8	Revert "[fx] Move map_aggregate to C++ (#148243 )" This reverts commit edaff88f69f069d517b72ea23fd5eb04702eb0b5. Reverted https://github.com/pytorch/pytorch/pull/148243 on behalf of https://github.com/jovianjaison due to breaking internal builds [T216910920] ([comment](https://github.com/pytorch/pytorch/pull/148243#issuecomment-2698724058))	2025-03-04 19:40:21 +00:00
PyTorch MergeBot	17d003fe75	Revert "[fx] Move Node._update_args_kwargs to C++ (#148260 )" This reverts commit 0135f57f4aaeaba8d720f551eab6dca6fcede8cd. Reverted https://github.com/pytorch/pytorch/pull/148260 on behalf of https://github.com/jovianjaison due to breaking internal builds [T216910920] ([comment](https://github.com/pytorch/pytorch/pull/148243#issuecomment-2698724058))	2025-03-04 19:40:21 +00:00
PyTorch MergeBot	97b9e68bc6	Revert "[fx] Move Node._prepend/Node._remove_from_list to C++ (#148261 )" This reverts commit 29c2de9ae16f1673f3f44363243294d403e53d37. Reverted https://github.com/pytorch/pytorch/pull/148261 on behalf of https://github.com/jovianjaison due to breaking internal builds [T216910920] ([comment](https://github.com/pytorch/pytorch/pull/148243#issuecomment-2698724058))	2025-03-04 19:40:21 +00:00
PyTorch MergeBot	6fb18ff685	Revert "Better log message to update pr_time_benchmarks/expected_results.csv (#148303 )" This reverts commit a3d69e6e1a530ae2b91cd549ea26aac51ffc7566. Reverted https://github.com/pytorch/pytorch/pull/148303 on behalf of https://github.com/jovianjaison due to breaking internal builds [T216910920] ([comment](https://github.com/pytorch/pytorch/pull/148243#issuecomment-2698724058))	2025-03-04 19:40:21 +00:00
PyTorch MergeBot	63778cb8a0	Revert "[Inductor] Record Triton’s Base32 Cache Key in `.best_config` for Debugging (#147019 )" This reverts commit e3e45d90d8578083da8b51a3b1d911e9a4523e5b. Reverted https://github.com/pytorch/pytorch/pull/147019 on behalf of https://github.com/clee2000 due to broke inductor test inductor/test_max_autotune.py::TestMaxAutotune::test_cat_max_autotune_extern [GH job link](https://github.com/pytorch/pytorch/actions/runs/13653495421/job/38171259603) [HUD commit link](`e3e45d90d8`) on inductor workflow and rocm workflow ([comment](https://github.com/pytorch/pytorch/pull/147019#issuecomment-2698677222))	2025-03-04 19:20:15 +00:00
PyTorch MergeBot	9d196edb7d	Revert "Bump onnxscript to 0.2.2 in CI (#148388 )" This reverts commit 7ab6749ec7db32e0b3cdfd19db087f15dd0bebe2. Reverted https://github.com/pytorch/pytorch/pull/148388 on behalf of https://github.com/clee2000 due to broke libtorch debug build? [GH job link](https://github.com/pytorch/pytorch/actions/runs/13646179239/job/38152039312) [HUD commit link](`7ab6749ec7`) ([comment](https://github.com/pytorch/pytorch/pull/148388#issuecomment-2698665495))	2025-03-04 19:16:34 +00:00
Afanti	c219c5ca38	Fix code descriptions in the test package. (#148145 ) The parameter and function description have something wrong and make them correct. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148145 Approved by: https://github.com/janeyx99	2025-03-04 19:14:41 +00:00
Nikita Shulga	e8900fbe4f	[MPS] Add some useful utils (#148448 ) Like `is_compex_v`, `is_scalar_intergral_v`, `result_of` etc Pull Request resolved: https://github.com/pytorch/pytorch/pull/148448 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #148398, #148399	2025-03-04 19:09:17 +00:00
Wanchao Liang	f859722f70	[dtensor] refactor sharding prop to handle cross mesh computation (#147869 ) as titled, this PR moves the same mesh check from the sharding propagation level to each individual operator level. This is to allow more flexibility for each individual operator to check the operator can be run on the same mesh or not. For example, before this PR if user have two DTensor params that lives on different DeviceMesh, and want to run `for_each` operator on them individually, it would error out with cross mesh error. But for foreach computation there could be DTensors that live on different meshes, as long as the the mesh are the same in a "zipped way". This should also fix https://github.com/pytorch/pytorch/issues/134212 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/147869 Approved by: https://github.com/tianyu-l	2025-03-04 18:30:44 +00:00
Eli Uriegas	eea54a55f6	ci: Switch manywheel build.sh to just use dev (#148310 ) To avoid annoying error message like: > fatal: no tag exactly matches 'a6520c85bd85875b09f2c68e51622699d7d07595' These were popping up when GITHUB_REF is not set so let's just assume that if someone is building without directly setting GITHUB_REF then they're probably doing a dev build. Signed-off-by: Eli Uriegas <github@terriblecode.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/148310 Approved by: https://github.com/Camyll, https://github.com/atalman	2025-03-04 18:27:44 +00:00
PyTorch MergeBot	611b0e9bc4	Revert "[fx] Optimizations for node name generation (#148288 )" This reverts commit 5eb0337cfd5e7c2cdf4a2d4829609e391467270f. Reverted https://github.com/pytorch/pytorch/pull/148288 on behalf of https://github.com/clee2000 due to something in this stack broke some dynamo and higher order ops tests like higher_order_ops/test_invoke_subgraph.py::TestInvokeSubgraphCompile::test_dedupe [GH job link](https://github.com/pytorch/pytorch/actions/runs/13645082540/job/38149882002) [HUD commit link](`8531d247ba`). dynamo/test_graph_deduplication did run on the PR but the higher_order_ops one didn't, probably combo of landrace and bad TD ([comment](https://github.com/pytorch/pytorch/pull/148288#issuecomment-2698365172))	2025-03-04 17:10:12 +00:00
PyTorch MergeBot	ed9055c303	Revert "[fx] Optimize TracerBase.create_arg and Graph._gen_python_code (#148292 )" This reverts commit 8531d247ba411993f9a10686d70514f6945f9960. Reverted https://github.com/pytorch/pytorch/pull/148292 on behalf of https://github.com/clee2000 due to something in this stack broke some dynamo and higher order ops tests like higher_order_ops/test_invoke_subgraph.py::TestInvokeSubgraphCompile::test_dedupe [GH job link](https://github.com/pytorch/pytorch/actions/runs/13645082540/job/38149882002) [HUD commit link](`8531d247ba`). dynamo/test_graph_deduplication did run on the PR but the higher_order_ops one didn't, probably combo of landrace and bad TD ([comment](https://github.com/pytorch/pytorch/pull/148288#issuecomment-2698365172))	2025-03-04 17:10:12 +00:00
Nikita Shulga	67937be673	[BE] Move `sinc` kernels to the same OP family (#148399 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148399 Approved by: https://github.com/dcci ghstack dependencies: #148398	2025-03-04 15:49:20 +00:00
Nikita Shulga	7fcbaff206	[BE] Remove stale arg for complex ops (#148398 ) Not need to pass DTYPE0 and DTYPE1 if only one DTYPE is used Pull Request resolved: https://github.com/pytorch/pytorch/pull/148398 Approved by: https://github.com/dcci	2025-03-04 14:35:43 +00:00
Jiang, Yanbing	f2f25a5444	Upgrade submodule oneDNN to v3.7.1 (#148293 ) This PR is to upgrade submodule oneDNN to v3.7.1. ## Improvements - Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids). - Improved performance of int8 and fp32 forward convolution primitive on processors with Intel AVX2 instruction set support. - Improved performance of fp8 matmul primitives with bf16 and fp16 bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids). - Introduced initial optimizations for Intel GPUs based on Xe3 architecture. - Added bfloat16 support for SDPA, implemented fp16 and bf16 gemm kernel in SDPA. - Fixed f16 matmul accuracy, the issue of SDPA cannot dispatched to ukernel, bf16/fp16/fp32 conv performance, INT8 Kernel trigger page fault, deconvolution precision issue on complex128 and fp64 and gemm correctness issue in float16 issues. - Improved bf16 matmul performance with fp32 destination with Arm Compute Library (ACL). - Improved bf16 to fp32 reorder performance. - Improved bf16 reorder performance. - Improved bf16 convolution with ACL. Fixes https://github.com/pytorch/pytorch/issues/136348. ## Validation results on CPU 1. NLP models accuracy/inference/training ![image](https://github.com/user-attachments/assets/859279b8-1631-4268-b226-7de9ac5870d8) ![image](https://github.com/user-attachments/assets/30ec7151-41ca-482a-9d2d-0c4850e75bab) 2. Torchbench cpu userbenchmark inference & training ![image](https://github.com/user-attachments/assets/71c9807c-caf9-4385-9990-d2ab637031cd) 3. Inductor quantization ![image](https://github.com/user-attachments/assets/3d2a3bd3-82fa-4566-8050-7ea5d6b61675) 4. Dynamo benchmarks ![image](https://github.com/user-attachments/assets/554ecce3-c85c-4a0e-88f1-2e73983c5dcd) ![image](https://github.com/user-attachments/assets/148c88f8-4367-4428-bb54-ce8a4deefd1b) ![image](https://github.com/user-attachments/assets/f2e744f4-d710-4699-acf4-1f130ecfadf1) ![image](https://github.com/user-attachments/assets/97128b80-4d0e-495a-aeda-dde3e70c96fd) ![image](https://github.com/user-attachments/assets/a9afce37-684c-45c0-b938-6dd7e0383805) ![image](https://github.com/user-attachments/assets/b8714236-9681-4fbe-8d98-be93deedab88) ![image](https://github.com/user-attachments/assets/4423061f-d133-45ba-98bd-d2f739e50431) ![image](https://github.com/user-attachments/assets/7955da10-3d23-493e-99fa-658f7f40035b) ## Validation results on XPU Accuracy is same as baseline. Performance is shown below. ![image](https://github.com/user-attachments/assets/7645304d-5b1d-43f9-b840-9f846ed380a0) ## Validation results on ARM ![image](https://github.com/user-attachments/assets/080f7c02-0238-436f-ad20-5a9e3f6aafbb) ![image](https://github.com/user-attachments/assets/443742aa-ca61-41de-ae80-5d4c65cd0c87) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148293 Approved by: https://github.com/mingfeima, https://github.com/atalman	2025-03-04 13:56:45 +00:00
Mwiza Kunda	f339e41a38	[inductor][triton] Fix average pool nd for int64 dtype (#146061 ) The eager mode implementation of average pool nd returns an integer tensor if the input is also an integer tensor. This should also be preserved in inductor. Fixes pytest -k test_comprehensive_nn_functional_avg_pool2d_cpu_int64 error: Triton compilation failed: triton_poi_fused_avg_pool2d_0 See WIP https://github.com/pytorch/pytorch/pull/145865#issuecomment-26200289890 to potentially enable such tests as they aren't enabled yet. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146061 Approved by: https://github.com/eellison	2025-03-04 13:53:50 +00:00
Meet Vadakkanchery	fdee60769a	[DCP] Introduce process based async checkpointing (#147039 ) Summary: ### Context Background checkpoint upload thread interfering with trainer thread: In [async save API](https://github.com/pytorch/pytorch/blob/main/torch/distributed/checkpoint/state_dict_saver.py#L239-L248), the background thread spends a considerable amount of time on CPU-bound tasks (pickling/unpickling several metada objects a.k.a SavePlans) on rank0 during the collective operation; this kind of asymmetric computation heavily contends for GIL with the trainer thread causing GPU util to suffer significantly for the E2E checkpoint duration. ### Solution: Introduce async save via a checkpoint daemon process. This daemon process will be created once (during the first save attempt) and can serve async checkpoint requests for the remainder of training lifetime. Test Plan: Added E2E UTs for process based async save. Differential Revision: D69272583 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147039 Approved by: https://github.com/saumishr	2025-03-04 13:33:28 +00:00
taozhiwei	16d07988fc	add supports_coalescing property in c10d::Backend to determine whether backend supports coalescing (#135338 ) 1. My company is using privateuseone to connect new hardware device and requires the use of `batch_isend_irecv` function. However, `batch_isend_irecv` is currently only open to CUDA, so I add `supports_coalescing` property in `c10d::Backend` to determine whether backend supports coalescing. 2. If `pg._has_hooks` return True, We don't need to determine if the current device is CUDA. So privateuseone can also support `pg._wait_for_pending_works` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135338 Approved by: https://github.com/kwen2501, https://github.com/albanD	2025-03-04 12:37:06 +00:00
fulvius31	e3e45d90d8	[Inductor] Record Triton’s Base32 Cache Key in `.best_config` for Debugging (#147019 ) Modified TorchInductor’s autotuning flow so that each `best_config` JSON file also includes the Triton “base32” (or base64) cache key. Motivation Debugging & Analysis: With this change, we can quickly identify which compiled binary and IRs belongs to a given best config. The impact is minimal since it is only an extra field in .best_config. It can help advanced performance tuning or kernel-level debugging. Also, since Triton already stores cubin/hsaco in its cache, developers/researchers can avoid to set `store_cubin = True` since they can get the cubin/hsaco in the Triton cache and with the code provided in this PR, they can easily match the best_config with the right Triton cache directory for the "best" kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147019 Approved by: https://github.com/davidberard98	2025-03-04 12:16:38 +00:00
Alexander Grund	f1cce0951b	Create unique test report files for distributed tests (#148325 ) The distributed tests are executed once for each backend and for each init method. `$TEST_REPORT_SOURCE_OVERRIDE` is used such that test results from different backends are stored in different files. The same needs to be done for the init method. Move the setting of the variable into `test_distributed` and incorporate the init method into the name. Useful for e.g. https://github.com/pytorch/pytorch/issues/126523 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148325 Approved by: https://github.com/clee2000	2025-03-04 10:45:33 +00:00
zeshengzong	0b0d28accd	Optimize param `prepend` class reference `torch.nn.Module` (#148304 ) Fixes #147696 ## Changes Change `prepend` description `torch.nn.modules.Module` to `torch.nn.Module` ## Test Result ### Before ![image](https://github.com/user-attachments/assets/054f54b7-9487-4505-a926-3e17a84bd2f9) ### After ![image](https://github.com/user-attachments/assets/1d2a5708-62d1-428e-b136-bcaa35e5e6da) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148304 Approved by: https://github.com/Skylion007	2025-03-04 08:46:14 +00:00
bobrenjc93	da2688f624	Introduce delayed compile via `eager_then_compile` stance (#147983 ) Recently I've been experimenting with introducing new APIs to delay compile as a way to reduce compile times while improving the ergonomics of using dynamic shapes. The high level idea is to run the first invocation of compile in eager, save the example inputs, and on the second invocation we can derive the dynamism in the inputs so that we don't need to waste our time doing a compile with static shapes (which is the status quo today with automatic dynamic). Another benefit of this is most users no longer need to annotate their inputs with mark_dynamic and mark_unbaked calls since we can derive the dynamism on the very first call. Additionally we get dynamic ints out of the box in this new regime. This PR implements this idea through the set_stance APIs. In particular it introduces a new `eager_then_compile` stance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147983 Approved by: https://github.com/williamwen42	2025-03-04 07:46:31 +00:00
drisspg	e0f0db0105	updates to benchmarks (#144831 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144831 Approved by: https://github.com/danielvegamyhre	2025-03-04 06:21:12 +00:00
Daniel Vega-Myhre	ac99fc7e57	Updates to build rowwise scaled mm kernel on SM10.0a (#148274 ) ## Summary Update cmake files and RowwiseScaledMM.cu to build on SM10.0a arch. NOTE: performance optimization will be done in separate follow up PRs ## Steps to verify build 1. Access devgpu/machine with B200 GPUs, verify B200s are visible w/ `nvidia-smi` 2. Install CUDA tookit 12.8 - e.g. see [Nvidia docs](https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Rocky&target_version=9&target_type=rpm_local) 3. Verify CUDA toolkit installation - e.g. `nvcc --version` should have `... Cuda compilation tools, release 12.8 ... ` in output 4. Set env var `TORCH_CUDA_ARCH_LIST=10.0a` 4. Build pytorch from source with this PR ([steps](https://github.com/pytorch/pytorch#from-source)) 5. Uninstall `pytorch-triton` with `pip uninstall pytorch-triton` 6. Build and install triton from source: https://github.com/triton-lang/triton?tab=readme-ov-file#install-from-source 7. Run tests shown in test plan below NOTE: performance optimization will be done in a separate PR. The goal of this PR is just to ensure it builds correctly. ## Test plan - `python test/distributed/tensor/test_matrix_ops.py -k scaled_mm`: OK - `python test/test_matmul_cuda.py -k rowwise`: OK - `python test/test_flop_counter.py -k scaled_mm`: OK - `python test/inductor/test_aot_inductor.py -k fp8`: OK - `python test/inductor/test_fp8.py`: OK Pull Request resolved: https://github.com/pytorch/pytorch/pull/148274 Approved by: https://github.com/drisspg	2025-03-04 05:23:41 +00:00
Justin Chu	7ab6749ec7	Bump onnxscript to 0.2.2 in CI (#148388 ) Unblock https://github.com/pytorch/pytorch/pull/148140 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148388 Approved by: https://github.com/malfet	2025-03-04 04:21:58 +00:00
Richard Barnes	d54cab78e1	[codemod] Fix missing field initializer in caffe2/torch/lib/libshm/manager.cpp +1 (#148393 ) Summary: The LLVM warning `-Wmissing-field-initializers` has found one or more structs in this diff's files which were missing field initializers. This can be unintended such as: ``` my_struct s1 = {0}; // Initializes only the first field to zero; others to default values my_struct s2 = {}; // Initializes all fields to default values (often zero) ``` or it may be because only some of the members of a struct are initialized, perhaps because the items were added to the struct but not every instance of it was updated. To fix the problem, I've either used `{}` to initialize all fields to default or added appropriate default initializations to the missing fields. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Reviewed By: dtolnay Differential Revision: D70472663 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148393 Approved by: https://github.com/Skylion007	2025-03-04 04:20:04 +00:00
Dmitry Rogozhkin	70410f93f2	doc/xpu: align description of SyclExtension with CPP/CUDA (#147988 ) This commit just aligns description of `py_limited_api` feature in SyclExtension with CPP/CUDA. We've missed this change on doing SyclExtension due to parallel work on the changes. For CPP/CUDA change was done in 515e55e6927ad5f57ec222d7779712630341acf3. CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147988 Approved by: https://github.com/janeyx99, https://github.com/guangyey	2025-03-04 04:17:36 +00:00
cyy	ec2805ada8	Remove outdated CUDA version check (#148142 ) Since Torch requires CUDA>=11, some checks can be removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148142 Approved by: https://github.com/janeyx99, https://github.com/eqy	2025-03-04 03:33:44 +00:00
cyy	98bf2f1170	Use Python 3.9 typing (#148157 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148157 Approved by: https://github.com/janeyx99	2025-03-04 03:09:55 +00:00
cyy	b7832f0339	Enable ASAN in CUDA tests (#147812 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/147812 Approved by: https://github.com/janeyx99	2025-03-04 02:50:39 +00:00
Jason Ansel	8531d247ba	[fx] Optimize TracerBase.create_arg and Graph._gen_python_code (#148292 ) Before: 19502951 function calls (18702776 primitive calls) in 8.533 seconds After: 16402551 function calls (15602452 primitive calls) in 7.701 seconds Pull Request resolved: https://github.com/pytorch/pytorch/pull/148292 Approved by: https://github.com/oulgen ghstack dependencies: #148243, #148260, #148261, #148303, #148288	2025-03-04 02:42:23 +00:00
Jason Ansel	5eb0337cfd	[fx] Optimizations for node name generation (#148288 ) Before: ![image](https://github.com/user-attachments/assets/3a9ed22b-ae33-41ec-a0db-01f4f3ca2ffe) After: ![image](https://github.com/user-attachments/assets/44c6e578-c63e-4a43-b3e0-d11d4bdbb6db) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148288 Approved by: https://github.com/oulgen ghstack dependencies: #148243, #148260, #148261, #148303	2025-03-04 02:42:23 +00:00
Jason Ansel	a3d69e6e1a	Better log message to update pr_time_benchmarks/expected_results.csv (#148303 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148303 Approved by: https://github.com/Skylion007 ghstack dependencies: #148243, #148260, #148261	2025-03-04 02:42:23 +00:00
Henry Tsang	17518007b2	[cutlass backend] Benchmark compared to aten and triton (#148347 ) Benchmark for cutlass backend. ``` python benchmarks/inductor_backends/cutlass.py ``` Test Plan: ``` Experiment group: mm (1024x1024, 1024x1024) torch.float16 +-----------------------+--------------------+----------------------+---------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+---------------------+ \| aten \| 12.759539298713207 \| 2.7271360370796174 \| NA \| \| triton \| 10.573655366897583 \| 1.8661278090439737 \| -17.131370346859384 \| \| triton_persistent_tma \| 10.884030722081661 \| 0.5315794269554317 \| -14.698873781600327 \| \| cutlass_lvl_default \| 13.09632882475853 \| 0.5520401500398293 \| 2.6395116481931873 \| \| cutlass_lvl_1111 \| 11.05172373354435 \| 0.569593315012753 \| -13.384617776451302 \| \| cutlass_lvl_2222 \| 11.371277272701263 \| 133.58984916994814 \| -10.880189272601317 \| +-----------------------+--------------------+----------------------+---------------------+ Experiment group: mm (1024x1024, 1024x1024) torch.bfloat16 +-----------------------+--------------------+----------------------+---------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+---------------------+ \| aten \| 14.472318813204765 \| 1.5445372510002926 \| NA \| \| triton \| 10.568295605480671 \| 16.583424195996486 \| -26.975796056689987 \| \| triton_persistent_tma \| 10.45411266386509 \| 5.830657540936954 \| -27.764770809729562 \| \| cutlass_lvl_default \| 12.742593884468079 \| 28.994930602959357 \| -11.951954286402668 \| \| cutlass_lvl_1111 \| 11.522261425852776 \| 79.85037935699802 \| -20.38413764531163 \| \| cutlass_lvl_2222 \| 10.993581265211105 \| 132.86601971101481 \| -24.037181552548486 \| +-----------------------+--------------------+----------------------+---------------------+ Experiment group: mm (2048x2048, 2048x2048) torch.float16 +-----------------------+--------------------+----------------------+---------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+---------------------+ \| aten \| 30.700622126460075 \| 2.225986961973831 \| NA \| \| triton \| 29.17378954589367 \| 38.571991189033724 \| -4.97329524553989 \| \| triton_persistent_tma \| 29.642896726727486 \| 7.2848734309664 \| -3.4452897904663744 \| \| cutlass_lvl_default \| 29.514770954847336 \| 29.819900761009194 \| -3.8626291243482167 \| \| cutlass_lvl_1111 \| 29.411429539322853 \| 23.82907024596352 \| -4.19923929172139 \| \| cutlass_lvl_2222 \| 29.57325428724289 \| 134.31008586101234 \| -3.672133530628152 \| +-----------------------+--------------------+----------------------+---------------------+ Experiment group: mm (2048x2048, 2048x2048) torch.bfloat16 +-----------------------+--------------------+----------------------+--------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+--------------------+ \| aten \| 30.858177691698074 \| 1.181898436974734 \| NA \| \| triton \| 28.630023822188377 \| 39.24473957403097 \| -7.220626868414034 \| \| triton_persistent_tma \| 28.641965240240097 \| 5.275042273919098 \| -7.181929126210897 \| \| cutlass_lvl_default \| 29.16003204882145 \| 29.934022572939284 \| -5.503065216107967 \| \| cutlass_lvl_1111 \| 28.79570797085762 \| 23.948012012057006 \| -6.683705504085324 \| \| cutlass_lvl_2222 \| 29.02756631374359 \| 136.25560767308343 \| -5.932337924306467 \| +-----------------------+--------------------+----------------------+--------------------+ Experiment group: mm (8192x8192, 8192x8192) torch.float16 +-----------------------+--------------------+----------------------+--------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+--------------------+ \| aten \| 1456.143856048584 \| 1.020197194069624 \| NA \| \| triton \| 1708.2737684249878 \| 5.766509635956027 \| 17.31490410985819 \| \| triton_persistent_tma \| 1476.485013961792 \| 7.455113030038774 \| 1.3969195302177155 \| \| cutlass_lvl_default \| 1583.3594799041748 \| 50.408804678940214 \| 8.736473620182366 \| \| cutlass_lvl_1111 \| 1636.4418268203735 \| 82.82403108896688 \| 12.381879030898025 \| \| cutlass_lvl_2222 \| 1507.5665712356567 \| 260.03901409788523 \| 3.531430975962381 \| +-----------------------+--------------------+----------------------+--------------------+ Experiment group: mm (8192x8192, 8192x8192) torch.bfloat16 +-----------------------+--------------------+----------------------+--------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+--------------------+ \| aten \| 1382.230520248413 \| 1.2586536260787398 \| NA \| \| triton \| 1646.9683647155762 \| 5.442052865982987 \| 19.15294450447995 \| \| triton_persistent_tma \| 1423.9195585250854 \| 6.515797697938979 \| 3.016069871556595 \| \| cutlass_lvl_default \| 1500.9030103683472 \| 51.36402789200656 \| 8.58557877152115 \| \| cutlass_lvl_1111 \| 1446.9740390777588 \| 30.65435610699933 \| 4.683988515729638 \| \| cutlass_lvl_2222 \| 1419.661521911621 \| 205.1948991640238 \| 2.7080144096717635 \| +-----------------------+--------------------+----------------------+--------------------+ ``` Differential Revision: D70147589 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148347 Approved by: https://github.com/drisspg, https://github.com/chenyang78	2025-03-04 01:45:36 +00:00
Ding, Yi1	c21dc11a17	[Intel GPU] Enable SDPA on XPU (#147614 ) Motivation === This PR is part of the plan of OneDNN Upstreaming, as #114848 [(comment)](https://github.com/pytorch/pytorch/issues/114848#issuecomment-2451553203) stated. The support of SDPA is via the overridable variance on XPU backend. Beside the added `Attention.cpp` file, `Graph.h` is added to hold utils for OneDNN graph including those for kernel/compile graph caching. In addition, a selection of testcases in `test/test_transformers.py` are copied into the new `test/xpu/test_transformers.py` and modified accordingly to provide additional tests beyond `./third_party/torch-xpu-ops/test/xpu/test_ops_xpu.py`. Depends on OneDNN version v3.7 upgrade in #147498 Depends on BUILD_GRAPH switch in #147608 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147614 Approved by: https://github.com/jansel, https://github.com/EikanWang	2025-03-04 01:40:45 +00:00
Shangdi Yu	b17f5223a4	Generate AOTI input check by default (#148005 ) Summary: Generate AOTI size and stride input check by default. But the checks are only run if `AOT_INDUCTOR_DEBUG_COMPILE` env variable is set (to avoid slowing down the performance). Example output: ```cpp bool _check_aoti_runtime_check_inputs_env() { const static char* env_var_value = getenv("AOTI_RUNTIME_CHECK_INPUTS"); const static bool result = env_var_value != nullptr && env_var_value[0] != '\0'; return result; } AOTI_NOINLINE static void __check_inputs_outputs( AtenTensorHandle* input_handles, AtenTensorHandle* output_handles) { if (!_check_aoti_runtime_check_inputs_env()){ return; } //rest of the check } ``` Test Plan: CI Differential Revision: D70260490 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148005 Approved by: https://github.com/hl475, https://github.com/desertfire, https://github.com/jingsh	2025-03-04 00:55:14 +00:00
atalman	0bd2caac55	Docker release - pin buildkit to v0.19.0 (#148372 ) Fix nightly build failure during arm64 docker build (since 02.21.2025): https://github.com/pytorch/pytorch/actions/runs/13452177170/job/37588508155#step:12:851 Error: ``` #10 73.62 Segmentation fault (core dumped) #10 73.67 qemu: uncaught target signal 11 (Segmentation fault) - core dumped #10 73.85 Segmentation fault (core dumped) #10 73.85 dpkg: error processing package libc-bin (--configure): #10 73.85 installed libc-bin package post-installation script subprocess returned error exit status 139 ``` Looks like we are hitting: https://github.com/moby/buildkit/issues/5783 Update setup-qemu and buildkit actions to v3 and buildkit to v0.19.0 Please note: CUDA 12.8 error is not related to this failure in nightly cpu arm64. Looks like we are trying to install release torch when running on PR. Cuda 12.8 build is not released yet, hence a failure. Will send followup to make sure we are using nightly torch when running on PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148372 Approved by: https://github.com/seemethere	2025-03-03 23:55:30 +00:00
Animesh Jain	d43c6f0033	[invoke_subgraph] Run joint passes on the hop graphs (#139325 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139325 Approved by: https://github.com/bdhirsh, https://github.com/zou3519 ghstack dependencies: #147559	2025-03-03 23:38:14 +00:00
Ethan Wee	216a108aaf	[ROCm] Add rocm-mi300 and inductor-rocm-mi300 to upload-test-stats.yml (#148365 ) We currently run MI300X machines on rocm-mi300 and inductor-rocm-mi300 but we don't have artifacts for the results: e.g. `6e10471966 (rocm-mi300)` ![image](https://github.com/user-attachments/assets/f5588072-b818-4f54-a348-0e6ac7e96829) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148365 Approved by: https://github.com/jeffdaily	2025-03-03 23:22:56 +00:00
Nikita Shulga	586d8df651	Fix condition for `CONVERT_NON_VECTORIZED_INIT` invocation (#148362 ) Yet another regression caused by https://github.com/pytorch/pytorch/pull/146596 that breaks builds if PyTorch is compiled for Android or using NVIDIA GraceHopper systems Not sure why author was trying to change the conditon to begin with Pull Request resolved: https://github.com/pytorch/pytorch/pull/148362 Approved by: https://github.com/izaitsevfb ghstack dependencies: #148354	2025-03-03 23:13:37 +00:00
Nikita Shulga	5887a2d8de	[BE] Use `C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED` (#148354 ) Instead of `#pragma GCC diagnostic ignored "-Wignored-qualifiers"` Also limit the scope to just `Vectorized::map` that has to be declared that way due to sleef function signature definitions that return `const __m256` for AVX2 methods Also delete `#pragma GCC diagnostic pop` from vec256_half and vec256_bfloat16 as it results in an unbalanced pop warning, for push that is defined in vec256_16bit_float, which will be included only once ``` In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec.h:7: In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec256/vec256.h:15: /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec256/vec256_half.h:232:27: warning: pragma diagnostic pop could not pop, no matching push [-Wunknown-pragmas] 232 \| #pragma GCC diagnostic pop \| ^ 1 warning generated. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148354 Approved by: https://github.com/izaitsevfb	2025-03-03 23:00:47 +00:00
henrylhtsang	d0b23e661d	[cutlass backend] Add main tests for mm, addmm and bmm - step 1 (#148229 ) This adds very good coverage for normal mm tests {aoti x torch.compile} x {default, dynamic}. There are some parts that are less tested. For example: * different layout combo * shapes that are less aligned Pull Request resolved: https://github.com/pytorch/pytorch/pull/148229 Approved by: https://github.com/chenyang78	2025-03-03 22:31:46 +00:00
Jane (Yuan) Xu	a41413829c	Use release notes label for module: distributed_checkpoint (#148352 ) module: distributed_checkpoint is redundant with oncall: distributed checkpointing. @fduwjj let us know that module: distributed_checkpoint is just used for release notes, so let's use the release notes label for the release notes, which the bot will pick up better. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148352 Approved by: https://github.com/fegin	2025-03-03 21:33:28 +00:00
ankurneog	e45040b1d3	[c10d] Add hccl distributed backend to c10d data structures (#146478 ) # MOTIVATION Intel Gaudi is an out-of-tree PyTorch accelerator having its own device /dispatch key ```hpu``` . With this change we add entries for Gaudi's distributed backend ```hccl``` to the c10d Backend data structures. This is to ensure that there is no naming conflict in case a new in-tree accelerator is introduced with the same backend name. The Out-of-tree backends are registered calling `fd0cd6a08f/torch/distributed/distributed_c10d.py (L302)` Successful registration adds the backend name to the list : `fd0cd6a08f/torch/distributed/distributed_c10d.py (L265)` We are binding the process group creator constructs at run-time so if there are other distributed backend with the same device name they can safely add the device type to the dictionary `fd0cd6a08f/torch/distributed/distributed_c10d.py (L274)` And add another entry to the dictionary with the same backend name ( but different device name ) `fd0cd6a08f/torch/distributed/distributed_c10d.py (L268)` In addition the out-of-tree devices can utilize the ```backend_list``` to check for successful backend registration eg: APIs like ```is_hccl_available``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146478 Approved by: https://github.com/H-Huang	2025-03-03 21:32:21 +00:00
nandesuka	52078154f2	Add support for no-op concat with padded output (#146866 ) Add support for no-op concat with padded output Pull Request resolved: https://github.com/pytorch/pytorch/pull/146866 Approved by: https://github.com/shunting314	2025-03-03 21:10:46 +00:00
Aaron Orenstein	07f876e960	Subprocess compile (#146134 ) Add a mode to `fx_codegen_and_compile()` to compile in a separate process. This is to prepare for async compile where we'll compile and run eager in parallel (and also be able to move the compile phase to a remote computer). Added a test based which runs the test_torchinductor tests with subprocess compiling turned on. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146134 Approved by: https://github.com/jamesjwu	2025-03-03 21:10:12 +00:00
William Wen	8f361c808b	[dynamo] run-only recursively on recompile limit exceeded (#148021 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148021 Approved by: https://github.com/anijain2305	2025-03-03 21:01:08 +00:00
FFFrog	1bbe57336b	Replace unimplemented with unimplemented_v2 for dynamo (#148158 ) torch/_dynamo/variables/constant.py https://github.com/pytorch/pytorch/issues/147913 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148158 Approved by: https://github.com/williamwen42, https://github.com/Skylion007	2025-03-03 21:00:17 +00:00
Anatoly Myachev	b162b1600b	[Inductor] Hot fix after #148011 (#148270 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148270 Approved by: https://github.com/davidberard98	2025-03-03 20:18:21 +00:00
Prachi Gupta	d260d4fc55	HSDP custom hook UTs are multi-threaded - can't set device rank (#148099 ) HSDP custom hook UTs are multi-threaded and using single physical GPU. If we set rank in each thread, then we are referencing the same GPU with multiple ranks, which isn't right. Therefore, removing the rank setting from these UTs. Now, they are passing with 1, 2, 4 GPUs. Fixes #147767 and #147769 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148099 Approved by: https://github.com/jeffdaily	2025-03-03 19:48:49 +00:00
Alexander Grund	302c660298	Consistently use load_torchbind_test_lib in tests (#148082 ) The same code is repeated multiple times with slightly different implementations. Use the existing function for brevity and consistency. In the function the code from `test_export` is used which does a single `load_library` with cleaner conditions Pull Request resolved: https://github.com/pytorch/pytorch/pull/148082 Approved by: https://github.com/angelayi	2025-03-03 19:37:28 +00:00
Sam Larsen	40c2505f16	[logging] Log individual Triton kernel compilation times to dynamo_compile (#147022 ) Summary: Gather the compilation time of individual triton kernels and log them to dynamo_compile: * Time compilation in `_worker_compile_triton` and pass back to the main process and logged from `get_result()`. * Added a way to track the "top N" (or N most-expensive compiles) in the metrics_context. I did this because I doubt we really care to capture potentially thousands of kernel compile times. That would be problematic for scuba logging anyway, so let's limit the number we track from the beginning. Arbitrarily chose 25 for now. * Format the list of compile times as a json string before logging. Test Plan: `python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only nanogpt` Scuba: https://fburl.com/scuba/dynamo_compile/sandbox/nc4dzm3r Pull Request resolved: https://github.com/pytorch/pytorch/pull/147022 Approved by: https://github.com/jamesjwu	2025-03-03 19:32:17 +00:00
Carlos Mocholi	aade4fbd55	Expose the rendezvous keepalive arguments (#145228 ) Enables support for this: ```python from torch.distributed.launcher.api import LaunchConfig config = LaunchConfig( ..., rdzv_configs={"keep_alive_interval": 1122, "heartbeat_timeout": 321, "keep_alive_max_attempt" 5}, ) ``` These arguments are currently hard-coded inside torchrun. The default values are not suitable for jobs with thousands of ranks. Today, `rdzv_configs` only allows the keys `join_timeout`, `last_call_timeout`, `close_timeout` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145228 Approved by: https://github.com/wconstab	2025-03-03 19:11:56 +00:00
Pian Pawakapan	a929e11e4f	[dynamic shapes][export] ignore when real-tensor fallback fails (#147779 ) Summary: uninspired solution to https://github.com/pytorch/pytorch/issues/147402 Test Plan: test_draft_export Differential Revision: D70132269 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147779 Approved by: https://github.com/bobrenjc93	2025-03-03 19:09:56 +00:00
cyy	09291817b2	Fix extra semicolon warning (#148291 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148291 Approved by: https://github.com/Skylion007	2025-03-03 18:51:44 +00:00
sanchitintel	1c544a9ddd	[Inductor-CPP] If all of the activation scale dims are 1, make it a 0D tensor (#147033 ) For int8 dynamically quantized activation & int8 quantized weights, add a workaround for some indexing issue that expected an empty index ( so, was expecting a 0D tensor) in epilogue creator when the activation scale was sized [1, 1] by converting it into a 0D tensor. The issue was discovered while running LLaMA2 quantized with torchao's `int8_dynamic_activation_int8_weight` quantization on CPU with max-autotune enabled (although this error would've occurred regardless). The final hidden states tensor that's activation to LM head is of shape `[batch_size, sequence_length, hidden_dim]` during decoding. For decoding one token at a time with batch size 1, sequence length is 1. The activation scale is shaped `[1, 1]` (reshaped from `[1, 1, 1]`). However, Inductor epilogue creator expects a 0D tensor in this case (my guess is that the corresponding logic in Inductor expects a 0D tensor if a tensor has only one element, even if it's 1D?). Pull Request resolved: https://github.com/pytorch/pytorch/pull/147033 Approved by: https://github.com/jansel, https://github.com/leslie-fang-intel	2025-03-03 18:32:27 +00:00
Oguz Ulgen	57addfcd58	Significantly speed up save_cache_artifacts (#148227 ) While using save_cache_artifacts on internal workloads, we have noticed that repeatedly calling this function after every batch is incredibly expensive. This PR significantly speeds up this function call by opting out of pickle and redesigning serialization algorithm. Essentially what we want is to be able to call serialize many times without incurring costs from scratch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148227 Approved by: https://github.com/jamesjwu ghstack dependencies: #148226	2025-03-03 17:28:41 +00:00
Nikita Shulga	3ca1a2564d	[BE][MPS] Use `copysign` for imaginary part of sqrt (#148286 ) Also it's tempting trying to replace `aa + bb` with `dot(input[index])` but for some reason it results in a slightly different output Pull Request resolved: https://github.com/pytorch/pytorch/pull/148286 Approved by: https://github.com/dcci ghstack dependencies: #148285	2025-03-03 16:03:54 +00:00
Nikita Shulga	84502baaff	[MPS] Fix sqrt and other for `torch.chalf` (#148285 ) Those kernels, instead of being instantiated for half2 (which corresponds to ComplexHalf) were instnatiated for short2, which resuled in the following test ``` % python3 -c "import torch; print(torch.rand(6, device='mps', dtype=torch.chalf).sqrt())" ``` Fail with ``` RuntimeError: Failed to create function state object for: sqrt_complex_half_half ``` As sqrt is not implemented for CPU, add explicit test to `test_sqrt` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148285 Approved by: https://github.com/dcci	2025-03-03 16:03:54 +00:00
jianan-gu	d57f617844	[Inductor][CPP] Avoid transpose with cpp micro-gemm for FlexAttention (#147069 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147069 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/drisspg ghstack dependencies: #147068	2025-03-03 15:22:11 +00:00
Wang, Chuanqi	6c089f5da3	ci: move xpu triton build to manylinux 2.28 (#148195 ) Follow PR #148129 to remove manylinux builds for triton xpu Pull Request resolved: https://github.com/pytorch/pytorch/pull/148195 Approved by: https://github.com/seemethere	2025-03-03 12:31:08 +00:00
leslie-fang-intel	165e33531c	[Inductor][CPP] Fix the vec codegen for tanh (#148254 ) Summary Fix https://github.com/pytorch/pytorch/issues/148241, The previous vectorized code generation for `tanh` used a decomposed implementation, leading to numerical differences that were further amplified by `atan2`. For example, in the given test case after `tanh`, the eager output at `[0,0,11,47]` was `-5.820766091346741e-10`, while the compiled output was `1.4319084584712982e-08`, resulting in different `atan2` outputs of `-2.3561` and `0.7853`. This issue is fixed by switching to the Sleef implementation. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_tanh_atan2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148254 Approved by: https://github.com/malfet, https://github.com/jgong5	2025-03-03 11:46:57 +00:00
CaoE	118a165ac5	[Inductor][CPP] Add transposed B matrix support for CppMicroGemmFP32Vec (#147068 ) * Add transposed B support for CppMicroGemmFP32Vec. * Add support for cases where N is not divisible by `block_n`. Expand CppMicroGemmFP32Vec to generate gemm kernel that supports transposed B and N of arbitrary size. This is the basis for https://github.com/pytorch/pytorch/pull/147069 to get better performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147068 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel, https://github.com/jgong5	2025-03-03 11:08:23 +00:00
Wang, Eikan	6a3a1f96ce	Enable XPU for Inductor MM Triton Kernel Benchmark (#148237 ) #147620 enabled `force_shape_pad` for triton kernel benchmark. Intel GPU supports this scenario. Hence, we need to enable the case in this PR. Otherwise, there would be a test case regression for Intel GPU as #147620 has been landed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148237 Approved by: https://github.com/jansel	2025-03-03 10:09:06 +00:00
CaoE	b3bb73e11c	Separate transpose from memory load/store and add load size support for convert_to_int32 (#147067 ) Separate transpose from memory load/store and add load size support for convert_to_int32 to facilitate the expansion for CppMicroGemmFP32Vec. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147067 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2025-03-03 02:56:16 +00:00
Xia, Weiwen	ab81ca5053	[Inductor][CPU] Add GEMM templates for _weight_int4pack_mm_for_cpu with AVX512 (#146756 ) Summary It's part of the task to enable max-autotune with GEMM template for WoQ INT4 GEMM on CPU. This PR adds GEMM templates for `torch.ops.aten_weight_int4pack_mm_for_cpu`. The micro kernel used for the templates is based on AVX512 and it's a copy of the ATen implementation of `torch.ops.aten_weight_int4pack_mm_for_cpu` with minor changes. Due to better blocking and loop schedule, the GEMM template based implementation outperforms the ATen implementation in all cases we tested. Test plan ``` python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_avx512 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146756 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2025-03-03 00:56:29 +00:00
PyTorch MergeBot	608377d341	Revert "[import][inductor] Simplify grid handling (#147583 )" This reverts commit b59776d8572a56e2d2366174eac11015b1776f1e. Reverted https://github.com/pytorch/pytorch/pull/147583 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147583#issuecomment-2693016036))	2025-03-03 00:49:32 +00:00
Jason Ansel	29c2de9ae1	[fx] Move Node._prepend/Node._remove_from_list to C++ (#148261 ) Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before: ``` 24303536 function calls (23503339 primitive calls) in 10.726 seconds ``` after: ``` 20003454 function calls (19203257 primitive calls) in 8.936 seconds ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148261 Approved by: https://github.com/oulgen ghstack dependencies: #148243, #148260	2025-03-02 22:42:31 +00:00
Jason Ansel	0135f57f4a	[fx] Move Node._update_args_kwargs to C++ (#148260 ) Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before: ``` 25203549 function calls (24403352 primitive calls) in 12.090 seconds ``` after: ``` 24303536 function calls (23503339 primitive calls) in 10.726 seconds ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148260 Approved by: https://github.com/oulgen ghstack dependencies: #148243	2025-03-02 22:42:31 +00:00
Jason Ansel	edaff88f69	[fx] Move map_aggregate to C++ (#148243 ) Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before: ``` 30603618 function calls (29403419 primitive calls) in 13.744 seconds ``` after: ``` 25203549 function calls (24403352 primitive calls) in 12.090 seconds ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148243 Approved by: https://github.com/oulgen	2025-03-02 22:42:31 +00:00
PyTorch MergeBot	94afb165d9	Revert "[c10d] Add hccl distributed backend to c10d data structures (#146478 )" This reverts commit dae3fbfe9720e83e7e81d41430fb5067221bbed7. Reverted https://github.com/pytorch/pytorch/pull/146478 on behalf of https://github.com/malfet due to This seems to break ROCM tests, see `dae3fbfe97` ([comment](https://github.com/pytorch/pytorch/pull/146478#issuecomment-2692913573))	2025-03-02 21:22:04 +00:00
Nikita Shulga	1106eb0212	[BE] Fix extra semicolon warning (#148284 ) Introduced by https://github.com/pytorch/pytorch/pull/146596 I.e. while building locally my log was littered with ``` In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/LossNLL2d.cpp:5: In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/cpu/utils.h:5: In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec.h:7: In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec256/vec256.h:15: /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec256/vec256_half.h:228:42: warning: extra ';' outside of a function is incompatible with C++98 [-Wc++98-compat-extra-semi] 228 \| LOAD_FP32_NON_VECTORIZED_INIT(Half, fp16); \| ^ 2 warnings generated. [230/1017] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/LossNLL.cpp.o In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/LossNLL.cpp:9: In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/cpu/utils.h:5: In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec.h:7: In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec256/vec256.h:14: /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec256/vec256_bfloat16.h:228:46: warning: extra ';' outside of a function is incompatible with C++98 [-Wc++98-compat-extra-semi] 228 \| LOAD_FP32_NON_VECTORIZED_INIT(BFloat16, bf16); ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148284 Approved by: https://github.com/Skylion007	2025-03-02 19:06:46 +00:00
Aaron Gokaslan	6d70b42810	[BE][Ez]: Update fmt submodule to 11.1.4 (#148264 ) This minor release is mostly bugfixes, ABI fixes, and compiler support fixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148264 Approved by: https://github.com/jansel, https://github.com/cyyever	2025-03-02 19:00:00 +00:00
Nikita Shulga	95d81d21a6	[MPS] Speedup interpolation (#148277 ) First of all, perf claims made in https://github.com/pytorch/pytorch/pull/145581 and https://github.com/pytorch/pytorch/pull/148154 are too good to be true (due to the bug in the script that did not call `torch.mps.synchronize` at the end of the benchmark script, but still slightly better than MPS, probably due to the launch overhead. And while measure performance correctly, I've noticed that a lot of time is spent on 64-bit integral division of thread_index to get spatial coordinates. Simply downcasting divisior to 32-bit integer (which is also the thread index) speeds it up almost 2x for bilinear and bicubic as could be demonstrated by running following script ```python import torch import time import subprocess import itertools def benchmark(device, dtype, mode="bilinear", antialias=False, sf=.5): # Create example inputs x = torch.testing.make_tensor(1, 1, 2048, 2048, device=device, dtype=dtype) # define kwargs kwargs = {"antialias": antialias, "mode": mode, "scale_factor": sf} # Skip for unimplemented flavors if antialias and mode == "bicubic" and device == "mps": return None, "Skip" elif antialias and dtype != torch.float32: if device == "cpu": return None, "Skip" outputs_match = None else: # Check output y = torch.nn.functional.interpolate(x, kwargs) z = torch.nn.functional.interpolate(x.cpu(), kwargs) outputs_match = torch.allclose(y.cpu(), z) if not outputs_match: atol = (y.cpu() - z).abs().max() rtol = ((y.cpu() - z)[z!=0]/z[z!=0]).abs().max() print(f"atol={atol} rtol={rtol}") # Measure time manually start_time = time.time() * 1000 for _ in range(1000): y = torch.nn.functional.interpolate(x, *kwargs) torch.mps.synchronize() end_time = time.time() 1000 manual_delta = (end_time - start_time) average_time = f"{manual_delta:6.1f}" return "True " if outputs_match else "False", average_time brand_string = subprocess.check_output(['sysctl', '-n', 'machdep.cpu.brand_string']).decode("utf-8").strip() for mode,antialias in itertools.product(["bilinear", "bicubic"], [False, True]): outputs_match_list = [] average_time_list = [] for device in ["mps", "cpu"]: for dtype in [torch.float32, torch.float16, torch.bfloat16]: outputs_match, average_time = benchmark(device, dtype, mode=mode, antialias=antialias) outputs_match_list.append(str(outputs_match)) average_time_list.append(average_time) print(f"\nBenchmarking Results (collected on {brand_string}) for {mode} interpolation {'with antialias' if antialias else ''}:") print("-"*40) print("Device : MPS \| CPU") print("Dtype : FP32 \| FP16 \| BF16 \| FP32 \| FP16 \| BF16") print(f"Outputs Match : ", " \| ".join(outputs_match_list)) print(f"Average Time (us) :", " \|".join(average_time_list)) ``` Before ``` Benchmarking Results (collected on Apple M4 Pro) for bilinear interpolation : ---------------------------------------- Device : MPS \| CPU Dtype : FP32 \| FP16 \| BF16 \| FP32 \| FP16 \| BF16 Outputs Match : True \| True \| True \| True \| True \| True Average Time (us) : 292.0 \| 264.7 \| 267.9 \| 289.1 \| 230.9 \| 309.1 atol=1.430511474609375e-06 rtol=0.11363636702299118 Benchmarking Results (collected on Apple M4 Pro) for bilinear interpolation with antialias: ---------------------------------------- Device : MPS \| CPU Dtype : FP32 \| FP16 \| BF16 \| FP32 \| FP16 \| BF16 Outputs Match : False \| False \| False \| True \| None \| None Average Time (us) : 698.3 \| 684.2 \| 683.8 \| 851.0 \|Skip \|Skip atol=2.086162567138672e-06 rtol=0.019750799983739853 Benchmarking Results (collected on Apple M4 Pro) for bicubic interpolation : ---------------------------------------- Device : MPS \| CPU Dtype : FP32 \| FP16 \| BF16 \| FP32 \| FP16 \| BF16 Outputs Match : False \| True \| True \| True \| True \| True Average Time (us) : 314.3 \| 301.0 \| 298.8 \| 681.5 \| 616.7 \| 833.7 ``` After ``` Benchmarking Results (collected on Apple M4 Pro) for bilinear interpolation : ---------------------------------------- Device : MPS \| CPU Dtype : FP32 \| FP16 \| BF16 \| FP32 \| FP16 \| BF16 Outputs Match : True \| True \| True \| True \| True \| True Average Time (us) : 119.9 \| 98.9 \| 98.6 \| 289.8 \| 231.9 \| 308.5 atol=1.430511474609375e-06 rtol=0.05681818351149559 Benchmarking Results (collected on Apple M4 Pro) for bilinear interpolation with antialias: ---------------------------------------- Device : MPS \| CPU Dtype : FP32 \| FP16 \| BF16 \| FP32 \| FP16 \| BF16 Outputs Match : False \| False \| False \| True \| None \| None Average Time (us) : 541.9 \| 531.1 \| 531.0 \| 846.8 \|Skip \|Skip atol=2.0265579223632812e-06 rtol=0.008604463189840317 Benchmarking Results (collected on Apple M4 Pro) for bicubic interpolation : ---------------------------------------- Device : MPS \| CPU Dtype : FP32 \| FP16 \| BF16 \| FP32 \| FP16 \| BF16 Outputs Match : False \| True \| True \| True \| True \| True Average Time (us) : 314.3 \| 301.0 \| 298.8 \| 681.5 \| 616.7 \| 833.7 ``` TODO: - Figure out if this ops make more sense as 3D jobs with n and c channels dispatch as one more dimension Pull Request resolved: https://github.com/pytorch/pytorch/pull/148277 Approved by: https://github.com/Skylion007	2025-03-02 17:13:52 +00:00
cyy	9aa897b992	Remove unnecessary tensor clone (#148159 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148159 Approved by: https://github.com/Skylion007	2025-03-02 16:21:39 +00:00
Ding, Yi1	1d7397a2d0	[Inductor] Avoid tensor slice overflow for large step (#147433 ) Fixes #147071 Currently, if step is a value very close to INT64_MAX, the calculation of slice output length will overflow. This PR tries to fix this problem and thus fix #147071. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147433 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2025-03-02 16:07:15 +00:00
Colin Peppler	9c506aa8a6	[aotinductor] add option to disable runtime assertions (#146462 ) A recent user experience is like this: * User runs AOTI lowering, it's successful. * They take AOTI model and run it with some sample inputs. Everything runs well * Then they boot up a serving test that loads the AOTI model and runs it with a set of sample requests. * They see that some of the requests fail. The logs show them this: * AOTInductorModel run failed with input spec: [1, 32]:c10::BFloat16, [2]:long ... * Error: u45 >= 2 * To the untrained eye, "AOTInductorModel run failed" is all they see. But, the true reason is Error: u45 >= 2 However, the assertion isn't always correct. * In fact, u45 can actually be 0. * So, why did AOTI say u45 ≥ 2? It's a two-piece combo: * With 0/1 Specialization, the ShapeEnv creates symbolic shapes (e.g. s0) with a default value-range of [2, inf] * In the graph, Dynamo traces torch.mul(A, B) where A is [s0, ...]and B is [u45, ...]. So, Dynamo learns Eq(s0, u45). * Therefore, u45 also has a range of [2, inf]. Hence, the incorrect runtime assertion. So, the motivation for this PR is to add an option to disable the logging. If you run into a situation like this. However, another way to avoid this is to call `mark_unbacked()` on all the dynamic dims. @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/146462 Approved by: https://github.com/desertfire, https://github.com/22quinn	2025-03-02 09:14:58 +00:00
Oguz Ulgen	26358fa2d8	Add AppendingByteSerializer class (#148226 ) This PR adds a new util class that enables efficient appending of sequential byte data with custom serialization and deserialization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148226 Approved by: https://github.com/aorenste	2025-03-02 08:20:58 +00:00
Jason Ansel	b59776d857	[import][inductor] Simplify grid handling (#147583 ) Before this PR, calling a triton kernel would look like: ```py kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0) ``` where the `grid=` was passed as a callable (function closure) arg. This PR removes the grid arg: ```py kernel.run(a, b, xnumel, stream=stream0) ``` instead now the grid computation is included in the kernel launcher, with something like: ```py def launcher(in_ptr0, out_ptr0, xnumel, stream): grid_0 = ((xnumel + 1023) >> 10) grid_1 = 1 grid_2 = 1 runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel) ``` This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`. It also allows us to unify the handling of grids between the Python and C++ wrapper code. Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid. This unification allows this PR to be a net deletion of code. Note the attached diff contains some minor fbcode-only changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147583 Approved by: https://github.com/eellison, https://github.com/shunting314	2025-03-02 07:31:07 +00:00
ankurneog	dae3fbfe97	[c10d] Add hccl distributed backend to c10d data structures (#146478 ) # MOTIVATION Intel Gaudi is an out-of-tree PyTorch accelerator having its own device /dispatch key ```hpu``` . With this change we add entries for Gaudi's distributed backend ```hccl``` to the c10d Backend data structures. This is to ensure that there is no naming conflict in case a new in-tree accelerator is introduced with the same backend name. The Out-of-tree backends are registered calling `fd0cd6a08f/torch/distributed/distributed_c10d.py (L302)` Successful registration adds the backend name to the list : `fd0cd6a08f/torch/distributed/distributed_c10d.py (L265)` We are binding the process group creator constructs at run-time so if there are other distributed backend with the same device name they can safely add the device type to the dictionary `fd0cd6a08f/torch/distributed/distributed_c10d.py (L274)` And add another entry to the dictionary with the same backend name ( but different device name ) `fd0cd6a08f/torch/distributed/distributed_c10d.py (L268)` In addition the out-of-tree devices can utilize the ```backend_list``` to check for successful backend registration eg: APIs like ```is_hccl_available``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146478 Approved by: https://github.com/H-Huang, https://github.com/guangyey	2025-03-02 05:13:48 +00:00
Boyuan Feng	6e10471966	[ci] disable cudagraph for tts_angular on dashboard (#148221 ) tts_angular with cudagraph is flaky. Its speedup varies from .05 to 1.01. This PR disables cudagraph for tts_angular to avoid the noise. Since tts_angular shows ~1x speedup while other torchbench models show ~2x speedup, skipping tts_angular would wrongly bump the cudagraph speedup. So this PR only disables cudagraph for tts_angular instead of skipping tts_angular. [Dashboard ](https://github.com/pytorch/pytorch/actions/runs/13597394087) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148221 Approved by: https://github.com/eellison	2025-03-02 03:31:19 +00:00
Daniel Vega-Myhre	de7af81f18	[async TP] insert reshape node to handle "reshape -> scaled mm -> reshape pattern" in async TP with rowwise scales (#148001 ) Fixes https://github.com/pytorch/torchtitan/issues/864 ## Summary While testing torchtitan with float8 training with rowwise scaling + async TP, a [bug](https://github.com/pytorch/torchtitan/issues/864) was discovered. The symptom was the scaling factor dims did not match the dims of the tensor the scales were to be applied to. My [root cause analysis](https://github.com/pytorch/torchtitan/issues/864#issuecomment-2672465060) determined the reason is that when async TP graph manipulation constructs the `fused_scaled_matmul_reduce_scatter` op, it does not yet handle the "reshape -> scaled mm -> reshape" pattern used in torchao [here](`ed361ff5c7/torchao/float8/float8_linear.py (L122-L124)`) - specifically when row-wise scales are being used. ## TL;DR of root cause - When a Float8Tensor is reshaped, the scale is reshaped along with it so the dimensions are aligned. - In the graph manipulation logic of the micropipeline TP post grad pass, the scaled_mm `A tensor` node is referencing the tensor _before_ to the reshape op, but referencing the `A_scale` node _after_ the reshape op. ## Example - Concrete example: - `A tensor` is a Float8Tensor with shape (1,8192,2048) and scale of shape (1,8192,1) when a matmul op is called in torchao [here](`8706d3f3b0/torchao/float8/float8_linear.py (L70)`). Torchao does a reshape -> scaled mm -> reshape [here](`ed361ff5c7/torchao/float8/float8_linear.py (L122)`). When a Float8Tensor is reshaped, its scale is reshaped along with it [here](`8706d3f3b0/torchao/float8/float8_ops.py (L152)`). So the first reshape makes the "A tensor" (1,8192,2048) => (8192,2048) and the scale (1,8192,1) => (8192,1). - During post grad pass in async TP: - `A_node` has shape (1,8192,2048) (tensor from before this [reshape](`ed361ff5c7/torchao/float8/float8_linear.py (L122)`)) - `A_scale` has shape (8192,1) (due to reshape op above, which caused the scale to be reshaped from (1,8192,1) => (8192,1)). ## Solution Note: the compiler inserts a `reciprocal` op after the reshape, so we can't simply use the node before the reshape as the `A_scale_node`, otherwise it will affect the numerics. - Short-term solution: if the specific pattern showne below is detected, insert a reshape node after the reciprocal, to reshape the reciprocal output back to the originals shape before the reshape. - reshape is just a view, so there should be no impact on performance ``` Before: reshape (a,bc,) to (ab,c) -> reciprocal After: reshape (a,bc,) to (ab,c) -> reciprocal -> reshape (a*b,c) to (a,b,c) ``` - Long-term solution: implement a `torch._scaled_matmul` which can support 3D+ `A tensor` ## Test plan - Added unit test which exercises this new path - Manually tested with torchtitan with float8 rowwise + async TP Pull Request resolved: https://github.com/pytorch/pytorch/pull/148001 Approved by: https://github.com/yifuwang	2025-03-02 03:25:28 +00:00
Phillip Liu	ce2f680e00	[fr] Added protection against missing stack frames in fr (#148203 ) Summary: We have quite a while failures due to this unprotected access. https://fburl.com/scuba/ai_rca_debug_tracing/qtnb63qf Test Plan: Reviewed By: fduwjj Differential Revision: D70358287 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148203 Approved by: https://github.com/fduwjj	2025-03-02 01:03:49 +00:00
Isalia20	19de523de6	[MPS] metal unary kernel for sqrt (#148272 ) Issue #148219 highlighted the high dispatch times of ops which ran with MPS Graph on smaller tensors. This PR rewrites the sqrt with metal kernel to mitigate that issue ## Speedups: Matrix size means NxN matrix here. ![speedup_sqrt](https://github.com/user-attachments/assets/db0a705b-1a0e-42b4-bd42-4e7960415c81) Code to generate the times(needs building the torch with old time and new time): ```python import torch import numpy as np import time import csv matrix_sizes = [1, 100, 1000, 10_000] num_runs = 1000 warmup_runs = 3 def run_sqrt(A): torch.mps.synchronize() start = time.perf_counter() c = torch.sqrt(A) torch.mps.synchronize() end = time.perf_counter() return c, end - start results = { 'N': [], 'mean_time': [], 'std_time': [] } for n in matrix_sizes: print(f"\nBenchmarking N={n}") try: A_mps = torch.rand((n, n), dtype=torch.float32, device="mps") for _ in range(warmup_runs): _, _ = run_sqrt(A_mps) times = [] for _ in range(num_runs): _, t = run_sqrt(A_mps) times.append(t) mean_time = np.mean(times) std_time = np.std(times) results['N'].append(n) results['mean_time'].append(mean_time) results['std_time'].append(std_time) print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s") except RuntimeError as e: print(f"Error for N={n}: {e}") continue with open('sqrt_benchmark_times_new.csv', 'w', newline='') as f: writer = csv.writer(f) writer.writerow(['N', 'mean_time', 'std_time']) for i in range(len(results['N'])): writer.writerow([ results['N'][i], results['mean_time'][i], results['std_time'][i] ]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148272 Approved by: https://github.com/malfet	2025-03-02 00:45:45 +00:00
Wei-Sheng Chin	1a6883759d	Fix macro for bit_cast in c10/util/bit_cast.h - one line change (#148265 ) Fixes #148263. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148265 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-03-01 20:55:31 +00:00
PyTorch MergeBot	1919e0de9a	Revert "stage 1 of depreate silent fallback of tuning gemm (#147798 )" This reverts commit 297c00264e54cfb192f289e23a41775b81cb9cb8. Reverted https://github.com/pytorch/pytorch/pull/147798 on behalf of https://github.com/wdvr due to failing internal builds, discussed with author ([comment](https://github.com/pytorch/pytorch/pull/147798#issuecomment-2692390551))	2025-03-01 20:04:23 +00:00
bobrenjc93	82603fd7d2	introduce dynamism library (#147981 ) This is the first step in supporting delayed compile. This library takes in example inputs and outputs a dict of dynamism across the inputs. We will use this to detect dynamism across multiple inputs in delayed compile. We will also use this to make shape collections more ergonomic by providing an affordance to generate a shape collection using example inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147981 Approved by: https://github.com/pianpwk, https://github.com/wdvr	2025-03-01 19:57:54 +00:00
Richard Barnes	5301710b15	[codemod] Fix unused-value issue in caffe2/aten/src/ATen/cuda/detail/CUDAHooks.cpp +4 (#147555 ) Summary: LLVM has a warning `-Wunused-value` which we treat as an error because it's so often diagnostic of a code issue. Unused values often indicate a programming mistake, but can also just be unnecessary cruft that harms readability and performance. For questions/comments, contact r-barnes. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Differential Revision: D69945678 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147555 Approved by: https://github.com/Skylion007, https://github.com/eqy	2025-03-01 19:46:13 +00:00
Rengan Xu	0ff2e6a85a	Fix None and equal_to_1 arguments issue in Triton kernel generated by AOTI (#148102 ) Summary: When a Triton kernel has arguments with None values followed by arguments with value 1, AOTI attempts to remove the None arguments and update the indices of the equal_to_1 arguments in triton_meta["configs"]. However, if the same kernel is called multiple times, this optimization process is repeated. Prior to this diff, the indices of equal_to_1 arguments from subsequent calls (second and later) were based on the updated indices from the previous call, resulting in incorrect behavior. This diff aims to localize the updated indices for equal_to_1 arguments within the optimization process of the current call, ensuring accurate and consistent results. Test Plan: Unit Test: ``` buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor -- -r test_triton_kernel_with_none_inputs_and_equal_to_1_arg ``` Differential Revision: D69998314 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148102 Approved by: https://github.com/davidberard98, https://github.com/chenyang78	2025-03-01 18:38:33 +00:00
Ryo Suzuki	2b86309da3	separate f16 vectorized class from bf16 (#146596 ) Separating the f16 vectorized class into a different file from the bf16 vectorized class in order to be able to add a new bf16 SVE vectorized class in https://github.com/pytorch/pytorch/pull/143666. This is required as we would need to exclude the current bf16 class in order to use the sve bf16 class but still include the current f16 vectorized class. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146596 Approved by: https://github.com/malfet	2025-03-01 18:22:32 +00:00
PyTorch MergeBot	8e004865dd	Revert "introduce dynamism library (#147981 )" This reverts commit 1c1bf410ecdeac8d240e15bf8c33c0f00fab0673. Reverted https://github.com/pytorch/pytorch/pull/147981 on behalf of https://github.com/malfet due to lint failure ([comment](https://github.com/pytorch/pytorch/pull/147981#issuecomment-2692351017))	2025-03-01 18:16:52 +00:00
PyTorch MergeBot	a983b2b11a	Revert "Initial implementation of host memory stats (#147660 )" This reverts commit 945e359fc1afe6c0bb6129ed9607b237fa19cd98. Reverted https://github.com/pytorch/pytorch/pull/147660 on behalf of https://github.com/mradmila due to There is an issue with ambiguous definition of Stat structure when different C++ tools are used. Backing out for now. ([comment](https://github.com/pytorch/pytorch/pull/147660#issuecomment-2692346379))	2025-03-01 18:05:45 +00:00
Sun, Jiayi	d23051f29b	[Inductor] Support parallel reduction for GroupNorm (#144020 ) Summary: Support parallel reduction for GroupNorm by optimizing the parallelization heuristics: When the range of the first inner loop is much larger than the range of all outer loops, change the starting depth of parallelization to the first inner loop. I tested the Inductor benchmark with this PR on CPU. One torchbench model(pytorch_CycleGAN_and_pix2pix) achieved ~45% performance improvement, and two diffusion models(Stable Diffusion and Latent Consistency Model(LCM)) achieved ~2% performance improvement. Example: ``` import torch import torch.nn as nn class GN(nn.Module): def __init__(self, num_groups, num_channels): super(GN, self).__init__() self.gn = nn.GroupNorm(num_groups, num_channels) def forward(self, x): return self.gn(x) x = torch.randn(2, 64, 168, 168).to(memory_format=torch.channels_last) m = GN(2, 64).eval() compiled_m = torch.compile(m) with torch.no_grad(): out = compiled_m(x) ``` Generated code: - Before: ``` cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float', 'float', 'float'], ''' #include "/tmp/torchinductor_jiayisun/pi/cpicxudqmdsjh5cm4klbtbrvy2cxwr7whxl3md2zzdjdf3orvfdf.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2, float* out_ptr3, float* out_ptr4) { #pragma omp parallel num_threads(56) { int tid = omp_get_thread_num(); { #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2L); x1+=static_cast<int64_t>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<int64_t>(56448L)); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(28224L); x2+=static_cast<int64_t>(1L)) { for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(32L); x3+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x3 >= static_cast<int64_t>(0) && x3 < static_cast<int64_t>(32L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + 32Lx1 + 64Lx2 + 1806336Lx0), static_cast<int64_t>(16)); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0); } } } } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<int64_t>(x1 + 2Lx0)] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<int64_t>(x1 + 2Lx0)] = static_cast<float>(tmp_acc0.m2); } } } } #pragma omp single { { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L)) { #pragma GCC ivdep for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2L); x1+=static_cast<int64_t>(1L)) { for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(32L); x2+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x2 >= static_cast<int64_t>(0) && x2 < static_cast<int64_t>(32L))) { auto tmp0 = out_ptr1[static_cast<int64_t>(x1 + 2Lx0)]; auto tmp6 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x2 + 32Lx1), static_cast<int64_t>(16)); auto tmp9 = out_ptr0[static_cast<int64_t>(x1 + 2Lx0)]; auto tmp13 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<int64_t>(x2 + 32Lx1), static_cast<int64_t>(16)); auto tmp1 = static_cast<float>(903168.0); auto tmp2 = tmp0 / tmp1; auto tmp3 = static_cast<float>(1e-05); auto tmp4 = decltype(tmp2)(tmp2 + tmp3); auto tmp5 = 1 / std::sqrt(tmp4); auto tmp7 = at::vec::Vectorized<float>(tmp5); auto tmp8 = tmp7 tmp6; auto tmp10 = decltype(tmp9)(-tmp9); auto tmp11 = at::vec::Vectorized<float>(tmp10); auto tmp12 = tmp11 * tmp8; auto tmp14 = tmp12 + tmp13; tmp8.store(out_ptr2 + static_cast<int64_t>(x2 + 32Lx1 + 64Lx0)); tmp14.store(out_ptr3 + static_cast<int64_t>(x2 + 32Lx1 + 64Lx0)); } } } } } } } { #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(28224L); x1+=static_cast<int64_t>(1L)) { for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(64L); x2+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x2 >= static_cast<int64_t>(0) && x2 < static_cast<int64_t>(64L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x2 + 64Lx1 + 1806336Lx0), static_cast<int64_t>(16)); auto tmp1 = at::vec::Vectorized<float>::loadu(out_ptr2 + static_cast<int64_t>(x2 + 64Lx0), static_cast<int64_t>(16)); auto tmp3 = at::vec::Vectorized<float>::loadu(out_ptr3 + static_cast<int64_t>(x2 + 64Lx0), static_cast<int64_t>(16)); auto tmp2 = tmp0 * tmp1; auto tmp4 = tmp2 + tmp3; tmp4.store(out_ptr4 + static_cast<int64_t>(x2 + 64Lx1 + 1806336Lx0)); } } } } } } } } ''') ``` - After: ``` cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float', 'float', 'float'], ''' #include "/tmp/torchinductor_jiayisun/pi/cpicxudqmdsjh5cm4klbtbrvy2cxwr7whxl3md2zzdjdf3orvfdf.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2, float* out_ptr3, float* out_ptr4) { { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L)) { #pragma GCC ivdep for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2L); x1+=static_cast<int64_t>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec_arr[56]; for (int i = 0; i < 56; i++) { tmp_acc0_vec_arr[i] = Welford<at::vec::Vectorized<float>>(); } Welford<float> tmp_acc0_arr[56]; for (int i = 0; i < 56; i++) { tmp_acc0_arr[i] = Welford<float>(); } Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec_arr[56]; for (int i = 0; i < 56; i++) { masked_tmp_acc0_vec_arr[i] = Welford<at::vec::Vectorized<float>>(); } #pragma omp parallel num_threads(56) { int tid = omp_get_thread_num(); static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<int64_t>(1008L)); Welford<at::vec::Vectorized<float>> tmp_acc0_vec_local = Welford<at::vec::Vectorized<float>>(); Welford<float> tmp_acc0_local = Welford<float>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec_local = Welford<at::vec::Vectorized<float>>(); #pragma omp for for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(28224L); x2+=static_cast<int64_t>(1L)) { for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(32L); x3+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x3 >= static_cast<int64_t>(0) && x3 < static_cast<int64_t>(32L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + 32Lx1 + 64Lx2 + 1806336Lx0), static_cast<int64_t>(16)); tmp_acc0_vec_local = welford_combine(tmp_acc0_vec_local, tmp0, &wrecps0); } } } } tmp_acc0_vec_arr[tid] = tmp_acc0_vec_local; tmp_acc0_arr[tid] = tmp_acc0_local; masked_tmp_acc0_vec_arr[tid] = masked_tmp_acc0_vec_local; } for (int tid = 0; tid < 56; tid++) { tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp_acc0_vec_arr[tid]); } for (int tid = 0; tid < 56; tid++) { tmp_acc0 = welford_combine(tmp_acc0, tmp_acc0_arr[tid]); } for (int tid = 0; tid < 56; tid++) { masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, masked_tmp_acc0_vec_arr[tid]); } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<int64_t>(x1 + 2Lx0)] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<int64_t>(x1 + 2Lx0)] = static_cast<float>(tmp_acc0.m2); } } } } { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L)) { #pragma GCC ivdep for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2L); x1+=static_cast<int64_t>(1L)) { for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(32L); x2+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x2 >= static_cast<int64_t>(0) && x2 < static_cast<int64_t>(32L))) { auto tmp0 = out_ptr1[static_cast<int64_t>(x1 + 2Lx0)]; auto tmp6 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x2 + 32Lx1), static_cast<int64_t>(16)); auto tmp9 = out_ptr0[static_cast<int64_t>(x1 + 2Lx0)]; auto tmp13 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<int64_t>(x2 + 32Lx1), static_cast<int64_t>(16)); auto tmp1 = static_cast<float>(903168.0); auto tmp2 = tmp0 / tmp1; auto tmp3 = static_cast<float>(1e-05); auto tmp4 = decltype(tmp2)(tmp2 + tmp3); auto tmp5 = 1 / std::sqrt(tmp4); auto tmp7 = at::vec::Vectorized<float>(tmp5); auto tmp8 = tmp7 tmp6; auto tmp10 = decltype(tmp9)(-tmp9); auto tmp11 = at::vec::Vectorized<float>(tmp10); auto tmp12 = tmp11 * tmp8; auto tmp14 = tmp12 + tmp13; tmp8.store(out_ptr2 + static_cast<int64_t>(x2 + 32Lx1 + 64Lx0)); tmp14.store(out_ptr3 + static_cast<int64_t>(x2 + 32Lx1 + 64Lx0)); } } } } } } #pragma omp parallel num_threads(56) { int tid = omp_get_thread_num(); { #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(28224L); x1+=static_cast<int64_t>(1L)) { for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(64L); x2+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x2 >= static_cast<int64_t>(0) && x2 < static_cast<int64_t>(64L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x2 + 64Lx1 + 1806336Lx0), static_cast<int64_t>(16)); auto tmp1 = at::vec::Vectorized<float>::loadu(out_ptr2 + static_cast<int64_t>(x2 + 64Lx0), static_cast<int64_t>(16)); auto tmp3 = at::vec::Vectorized<float>::loadu(out_ptr3 + static_cast<int64_t>(x2 + 64Lx0), static_cast<int64_t>(16)); auto tmp2 = tmp0 * tmp1; auto tmp4 = tmp2 + tmp3; tmp4.store(out_ptr4 + static_cast<int64_t>(x2 + 64Lx1 + 1806336Lx0)); } } } } } } } } ''') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144020 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel, https://github.com/jgong5	2025-03-01 17:11:50 +00:00
cyy	8bf3920279	Remove unneeded Clang-tidy suppression (#148246 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148246 Approved by: https://github.com/Skylion007	2025-03-01 16:51:54 +00:00
bobrenjc93	1c1bf410ec	introduce dynamism library (#147981 ) This is the first step in supporting delayed compile. This library takes in example inputs and outputs a dict of dynamism across the inputs. We will use this to detect dynamism across multiple inputs in delayed compile. We will also use this to make shape collections more ergonomic by providing an affordance to generate a shape collection using example inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147981 Approved by: https://github.com/pianpwk	2025-03-01 14:57:06 +00:00
Nikita Shulga	3a0c9f7f9d	[MPS] Fix SDPA crash (#148239 ) If operation is invoked with mask twice it will crash, as mask expansion logic was implemented inside cache creation block, which is executed only once for all shapes Fixes https://github.com/pytorch/pytorch/issues/148194 which is a regression introduced by https://github.com/pytorch/pytorch/pull/147545 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148239 Approved by: https://github.com/dcci	2025-03-01 13:06:51 +00:00
Nikita Shulga	735d7b1af6	[EZ][BE] Increase tolerances for interpolate op (#148224 ) Not sure why tolerances were set like that, this logic was added in https://github.com/pytorch/pytorch/pull/104181 without much explanation But if I'm to make a guess, it's likely due to the inaccuracy of bilinear op, that has since been replaced by shader Pull Request resolved: https://github.com/pytorch/pytorch/pull/148224 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #148154, #148187, #148211	2025-03-01 13:03:59 +00:00
xinan.lin	762724f3d0	[Break XPU][Inductor] Generalize device-bias code and fix test_graph_partition for XPU (#148178 ) This PR generalized the device-bias code introduced by #147038 . And align the behavior between XPU and CUDA on add + mm + pointwise pattern (for XPU, from addmm + pointwise to mm + fused_add_pointwise) , which fix the failed test case `test_graph_partiton` on XPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148178 Approved by: https://github.com/benjaminglass1, https://github.com/jansel, https://github.com/EikanWang ghstack dependencies: #148155	2025-03-01 10:59:55 +00:00
xinan.lin	ab78bf5c66	[Break XPU][Inductor UT] Avoid custom op registration conflicts in test_auto_functionalize.py. (#148155 ) Fix #148148 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148155 Approved by: https://github.com/jansel, https://github.com/EikanWang	2025-03-01 10:59:55 +00:00
wz337	2f1b8e0fe2	[DTensor][Test] Add a test to demonstrate current dtensor view behavior if redistribution happens (#148015 ) This does not fix the view op issue when redistribution happens. We want to add a test to demonstrate/record the issue, in which the distributed behavior does not match up with single device behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148015 Approved by: https://github.com/XilunWu	2025-03-01 10:24:40 +00:00
PyTorch MergeBot	191c9bd013	Revert "[async TP] insert reshape node to handle "reshape -> scaled mm -> reshape pattern" in async TP with rowwise scales (#148001 )" This reverts commit b8efebe57d05a87be5b0f304218d2af7bb2bf6c6. Reverted https://github.com/pytorch/pytorch/pull/148001 on behalf of https://github.com/davidberard98 due to looks like another lint error ([comment](https://github.com/pytorch/pytorch/pull/148001#issuecomment-2692042859))	2025-03-01 07:43:58 +00:00
Sun, Jiayi	fe3b9e3764	[Inductor] optimize the heuristics of outer loop fusion (#147523 ) Summary: Optimize the heuristics of outer loop fusion: When the range of the first inner loop is much larger than the range of all outer loops, do not fuse the outer loops and fallback to standard codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147523 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel, https://github.com/jgong5	2025-03-01 06:50:04 +00:00
Daniel Vega-Myhre	b8efebe57d	[async TP] insert reshape node to handle "reshape -> scaled mm -> reshape pattern" in async TP with rowwise scales (#148001 ) Fixes https://github.com/pytorch/torchtitan/issues/864 ## Summary While testing torchtitan with float8 training with rowwise scaling + async TP, a [bug](https://github.com/pytorch/torchtitan/issues/864) was discovered. The symptom was the scaling factor dims did not match the dims of the tensor the scales were to be applied to. My [root cause analysis](https://github.com/pytorch/torchtitan/issues/864#issuecomment-2672465060) determined the reason is that when async TP graph manipulation constructs the `fused_scaled_matmul_reduce_scatter` op, it does not yet handle the "reshape -> scaled mm -> reshape" pattern used in torchao [here](`ed361ff5c7/torchao/float8/float8_linear.py (L122-L124)`) - specifically when row-wise scales are being used. ## TL;DR of root cause - When a Float8Tensor is reshaped, the scale is reshaped along with it so the dimensions are aligned. - In the graph manipulation logic of the micropipeline TP post grad pass, the scaled_mm `A tensor` node is referencing the tensor _before_ to the reshape op, but referencing the `A_scale` node _after_ the reshape op. ## Example - Concrete example: - `A tensor` is a Float8Tensor with shape (1,8192,2048) and scale of shape (1,8192,1) when a matmul op is called in torchao [here](`8706d3f3b0/torchao/float8/float8_linear.py (L70)`). Torchao does a reshape -> scaled mm -> reshape [here](`ed361ff5c7/torchao/float8/float8_linear.py (L122)`). When a Float8Tensor is reshaped, its scale is reshaped along with it [here](`8706d3f3b0/torchao/float8/float8_ops.py (L152)`). So the first reshape makes the "A tensor" (1,8192,2048) => (8192,2048) and the scale (1,8192,1) => (8192,1). - During post grad pass in async TP: - `A_node` has shape (1,8192,2048) (tensor from before this [reshape](`ed361ff5c7/torchao/float8/float8_linear.py (L122)`)) - `A_scale` has shape (8192,1) (due to reshape op above, which caused the scale to be reshaped from (1,8192,1) => (8192,1)). ## Solution Note: the compiler inserts a `reciprocal` op after the reshape, so we can't simply use the node before the reshape as the `A_scale_node`, otherwise it will affect the numerics. - Short-term solution: if the specific pattern showne below is detected, insert a reshape node after the reciprocal, to reshape the reciprocal output back to the originals shape before the reshape. - reshape is just a view, so there should be no impact on performance ``` Before: reshape (a,bc,) to (ab,c) -> reciprocal After: reshape (a,bc,) to (ab,c) -> reciprocal -> reshape (a*b,c) to (a,b,c) ``` - Long-term solution: implement a `torch._scaled_matmul` which can support 3D+ `A tensor` ## Test plan - Added unit test which exercises this new path - Manually tested with torchtitan with float8 rowwise + async TP Pull Request resolved: https://github.com/pytorch/pytorch/pull/148001 Approved by: https://github.com/yifuwang	2025-03-01 06:38:39 +00:00
Animesh Jain	fd16311e7f	[inductor][subgraph] Plumbing to get ShapeAsConstantBuffer from subgraph to main graph output (#147559 ) I am unable to create a test case that fails without the next PR. The idea is to have a symint which is returned by the inner subgraph and then returned by the forward graph after partitioning. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147559 Approved by: https://github.com/eellison	2025-03-01 06:17:11 +00:00
David Berard	c87097e74a	[triton 3.3] Fix inductor/test_profiler.py test (#148230 ) test_inductor_profiling_kernel_names_pointwise is checking that the profiler correctly records the input shapes to the kernel. After triton 3.3, we get a different number of args (because the constexpr args are passed in, from the python perspective). This just patches the test to pass in either case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148230 Approved by: https://github.com/drisspg, https://github.com/YUNQIUGUO	2025-03-01 04:27:49 +00:00
Anatoly Myachev	9377a32cd1	[Inductor][NFC] Remove unused functions from `compile_tasks.py` (#147564 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147564 Approved by: https://github.com/Skylion007, https://github.com/davidberard98	2025-03-01 03:44:43 +00:00
PyTorch MergeBot	baf1c8fcdc	Revert "introduce dynamism library (#147981 )" This reverts commit 6eff6b28e4d09cbf632f79502a8e317bf5b53c34. Reverted https://github.com/pytorch/pytorch/pull/147981 on behalf of https://github.com/wdvr due to lint failure ([comment](https://github.com/pytorch/pytorch/pull/147981#issuecomment-2691906065))	2025-03-01 03:43:01 +00:00
Fuzzkatt	493cd97af5	add skips to test_notifies_oom and test_set_per_process_memory_fraction (#148134 ) Tests fail in NVIDIA internal CI since we do not support nvml on Jetson, but nvml is required for OOM reporting to work properly, so we are skipping the failing tests for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148134 Approved by: https://github.com/eqy	2025-03-01 02:59:48 +00:00
bobrenjc93	6eff6b28e4	introduce dynamism library (#147981 ) This is the first step in supporting delayed compile. This library takes in example inputs and outputs a dict of dynamism across the inputs. We will use this to detect dynamism across multiple inputs in delayed compile. We will also use this to make shape collections more ergonomic by providing an affordance to generate a shape collection using example inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147981 Approved by: https://github.com/pianpwk	2025-03-01 02:49:16 +00:00
Isalia20	08434df1f2	[MPS] fix empty place holder error for smooth l1 loss (#148133 ) Fixes #123171 And parametrizes the tests for it Pull Request resolved: https://github.com/pytorch/pytorch/pull/148133 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-03-01 02:32:45 +00:00
Anatoly Myachev	02c5f21541	[Inductor] fix `AOTInductorTestABICompatibleGpu.test_triton_kernel_weird_param_order` with new Triton (#148011 ) In this case, the parameters have already been filtered [here](`201666d77d/torch/_inductor/codegen/cpp_wrapper_gpu.py (L335)`) and subsequent filtering is not only unnecessary, it breaks the code, since the positions of the parameters change after filtering. For this test, for example, the second filtering discarded `buf0`. For example: ```python (Pdb) triton_meta["signature"] {'in_ptr0': 'fp32', 'in_ptr1': 'fp32', 'n_elements': 'i32', 'BLOCK_SIZE': 'constexpr', 'out_ptr': '*fp32'} (Pdb) call_args ['arg0_1', 'arg0_1', '256L', 'buf0'] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148011 Approved by: https://github.com/davidberard98	2025-03-01 01:21:20 +00:00
Isuru Fernando	338ed67a1e	[inductor] Implement max_pool2d_with_indices as a reduction for large window sizes (#147876 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147876 Approved by: https://github.com/eellison	2025-03-01 01:07:01 +00:00
atalman	230a3b0f83	Add cuda 11.8 guard for cufile preload (#148184 ) Follow up after https://github.com/pytorch/pytorch/pull/148137 Make sure we don't try to load cufile on CUDA 11.8 Test: ``` >>> import torch /usr/local/lib64/python3.9/site-packages/torch/_subclasses/functional_tensor.py:276: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.) cpu = _conversion_method_template(device=torch.device("cpu")) >>> torch.__version__ '2.7.0.dev20250227+cu118' >>> ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148184 Approved by: https://github.com/mikaylagawarecki	2025-03-01 01:01:04 +00:00
Iris Z	2544afaa1a	[DeviceMesh] Add some documentation for `from_group` API and add a 2D test (#146364 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146364 Approved by: https://github.com/fduwjj	2025-03-01 00:57:37 +00:00
Nikita Shulga	5d297f7a34	[MPS][BE] Combine two `upsample_kernel_out_template` into one (#148211 ) - First, by stopp inverting sizes and strides, i.e. passing them as is, but reading them in inverse order in the shader as 1st stride of 4D tensor is one used for batches, 2nd for channels and 3rd and 4th for spatial coordinates - Pass `scales` as float2 even in linear tensor Above allows one to collide two flavors `upsample_kernel_out_template` into one Pull Request resolved: https://github.com/pytorch/pytorch/pull/148211 Approved by: https://github.com/dcci ghstack dependencies: #148154, #148187	2025-03-01 00:39:26 +00:00
clr	83fb974b5d	scriptfunction: Make sure we have valid __name__ and __qualname__ (#147906 ) It's not fully clear why these are not being created, but you can definitely reproduce this in code. `__name__` is fun, since there appears to be no way to explicitly set it on the pybind11 layer or c++ layer. I've set this in the python wrapper code (which works correctly). But let me know if people feel strongly and want us to go explicitly cast to python within the cpp functions and set it there. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147906 Approved by: https://github.com/jansel ghstack dependencies: #147894	2025-02-28 23:25:47 +00:00
Matthew Hoffman	1ae7cc41ca	Define `__all__` for `torch.utils.tensorboard` (#147550 ) Fixes the issue: ```python import torch.utils.tensorboard torch.utils.tensorboard.FileWriter # pyright: "FileWriter" is not exported from module "torch.utils.tensorboard" torch.utils.tensorboard.RecordWriter # pyright: "RecordWriter" is not exported from module "torch.utils.tensorboard" torch.utils.tensorboard.SummaryWriter # pyright: "SummaryWriter" is not exported from module "torch.utils.tensorboard" ``` The [docs page for `torch.utils.tensorboard`](https://pytorch.org/docs/stable/tensorboard.html) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147550 Approved by: https://github.com/albanD	2025-02-28 23:06:11 +00:00
drisspg	3a69dee955	[Submodule][FlashAttention] Bump to 2.7.4 (#148147 ) # Summary This makes me happy Pull Request resolved: https://github.com/pytorch/pytorch/pull/148147 Approved by: https://github.com/Skylion007	2025-02-28 22:40:02 +00:00
bobrenjc93	83ec7cdcd4	Fix recompile reason logging (#148200 ) for the following test case ``` @torch.compile(dynamic=False, backend=cnts) def fn(x, y, z): return x * y * z[0] fn(1, torch.randn(1), {0: torch.randn(1)}) fn(2, torch.randn(2), {0: torch.randn(2)}) fn(3, torch.randn(3), {0: torch.randn(3)}) fn(4, torch.randn(4), {0: torch.randn(4)}) fn(5, torch.randn(5), {0: torch.randn(5)}) ``` previously we would log ``` 0/0: L['x'] == 1 0/0: L['x'] == 1 0/0: L['x'] == 1 0/0: L['x'] == 1 ``` but after this change we now log ``` 0/0: L['x'] == 1 0/1: L['x'] == 2 0/2: L['x'] == 3 0/3: L['x'] == 4 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148200 Approved by: https://github.com/xmfan	2025-02-28 22:33:37 +00:00
William Wen	40b3e4a358	[dynamo] expose code execution strategy to python (#148020 ) @anijain2305 this can be used to mark a code object to be skipped/run-only (recursively) while tracing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148020 Approved by: https://github.com/jansel	2025-02-28 21:59:12 +00:00
Mwiza Kunda	e74fdbe6d0	[inductor] ignore block ptr advancements for removed buffers (#148087 ) Follow up to https://github.com/pytorch/pytorch/pull/147193. Some buffers are removed only when the kernel context is exited so defer the lines instead. Added `use_block_ptr` as a parameter to test case that fails if run with block ptrs enabled. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148087 Approved by: https://github.com/jansel, https://github.com/eellison	2025-02-28 21:31:15 +00:00
Nikita Shulga	d174562487	[MPS][BE][EZ] Aggregate macros (#148187 ) Refactor `INSTANTIATE_UPSAMPLE_BILINEAR2D(DTYPE)`, `INSTANTIATE_UPSAMPLE_BICUBIC2D(DTYPE)` and `INSTANTIATE_UPSAMPLE_BILINEAR2DAA(DTYPE)` use common `INSTANTIATE_UPSAMPLE2D` Then combine multiple invocations into `INSTANTIATE_UPSAMPLE_ALL` I.e. functionally it's a no-op, but achieves the same with fewer lines of code Pull Request resolved: https://github.com/pytorch/pytorch/pull/148187 Approved by: https://github.com/Skylion007 ghstack dependencies: #148154	2025-02-28 21:30:00 +00:00
Sijia Chen	4995e058bf	[user-triton] handle inline_asm_case (#148043 ) Summary: We currently failed the mutation analysis for all inline_asm ops. In this diff, we handle the case when "is_pure" is set to True since it indicates the operation doesn't mutate the input value Test Plan: ../buck-out/v2/gen/fbcode/854b9ed00d28c5c5/caffe2/test/inductor/__triton_kernels__/triton_kernels.par --r test_mutations_inline_asm_kernel ``` test_mutations_inline_asm_kernel_is_pure_true (caffe2.test.inductor.test_triton_kernels.MutationTests) ... W0226 18:10:34.261000 1906801 /data/users/sijiac/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:656] TTIR mutation analysis: Skipping pure tt.elementwise_inline_asm op (is_pure=True) ok ---------------------------------------------------------------------- Ran 2 tests in 0.706s OK ``` Differential Revision: D69878591 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148043 Approved by: https://github.com/zou3519	2025-02-28 20:52:51 +00:00
Ruben Rodriguez Buchillon	6f91720e1c	[inductor][ck] manual kBatch heuristic (#148118 ) Summary: # Why Leverage kBatch parameter for large splitK examples for CK for better than ATEN performance # What replace default kBatch = 1 with a manual heuristic - if K > 16 * max (M,N) - leverage k_per_block, and K and number of SMs on the chip - upper bound to 128, lower bound to 1 This is better than defaulting to 1, cheap to calculate, and shows performance beyond ATEN This is of course subject to change and improvement Test Plan: with minor modifications to to run torch.mm on the shape `M, N, K = 2048, 2048, 524288` ``` buck2 run -c fbcode.re_gpu_tests=False mode/opt-amd-gpu fbcode//deeplearning/aot_inductor/benchmark/sampling:test_gemm_autotune_benchmark_AMD_block_0 ``` ``` AUTOTUNE mm(2048x524288, 524288x2048) rocm_ck_gemm_template_49 10.4972 ms 100.0% rocm_ck_gemm_template_8 10.6132 ms 98.9% rocm_ck_gemm_template_9 10.6907 ms 98.2% [...] mm 18.9880 ms 55.3% ``` Reviewed By: ColinPeppler Differential Revision: D70224591 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148118 Approved by: https://github.com/ColinPeppler	2025-02-28 20:36:16 +00:00
amdfaa	48c55a66ec	[ROCm] Move ROCm unstable MI300 jobs back to stable (#146675 ) Fixes #145790 Needs #145504 to be merged first to resolve an artifact uploading issue with MI300 runners. This PR moves rocm unstable MI300 back to stable. The change to unstable was introduced through this [PR](https://github.com/pytorch/pytorch/pull/145790). This was because the MI300s were failing with a [docker daemon](https://github.com/pytorch/pytorch/actions/runs/13015957622/job/36306779536) issue which has been resolved. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146675 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily	2025-02-28 20:34:27 +00:00
Bert Maher	6778084531	[inductor][cutlass] Environment variables for allow/denylist (#148161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148161 Approved by: https://github.com/henrylhtsang, https://github.com/eellison	2025-02-28 20:33:10 +00:00
sanchitintel	5a1954eb93	[Inductor-CPU] Fix broken int8 WoQ GEMM AMX implementation in main (#147895 ) #146843 broke int8 WoQ GEMM's (for BF16 activation) AMX ISA implementation in the main branch. UT: `python test/inductor/test_cpu_select_algorithm.py -v -k woq` The issue remained undetected because in case of templated kernel compilation failure, the auto-tuning infra marks its runtime as `inf`, and the op against which it was being benchmarked is used, so UTs didn't fail even on machines that support AMX ISA. `test/inductor/test_cpu_select_algorithm.py` UTs checked the value of the `select_algorithm_autotune` counter, which only counts how many ops were selected for autotuning against their templated codegened counterparts. @leslie-fang-intel advised using a new counter. I added `counters["inductor"]["cpp_templated_kernel_counter"]`, which is incremented after a codegened kernel's compilation, so it'd help catch breakage scenarios in which a templated kernel could not be codegened due to a compilation failure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147895 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel	2025-02-28 20:20:45 +00:00
clr	e0e516c554	Don't crash when we call __qualname__ on torch._C.ScriptFunction (#147894 ) We've root caused this to correctly throwing attribute error on ScriptFunction when missing attributes are caused. This PR will fix crashes that are showing up. I'm going to stack a second PR to fix torch._c.ScriptFunction just being a very badly behaving python object (which should also fix this Pull Request resolved: https://github.com/pytorch/pytorch/pull/147894 Approved by: https://github.com/jansel	2025-02-28 20:15:38 +00:00
henrylhtsang	297c00264e	stage 1 of depreate silent fallback of tuning gemm (#147798 ) Differential Revision: [D70045778](https://our.internmc.facebook.com/intern/diff/D70045778/) context: https://github.com/pytorch/pytorch/issues/147479 For the most part, this should not change the behavior. For int_mm, I also removed ``` # TODO: Re-enable eager mode implementation once cuBLAS is fixed if use_cutlass or use_triton_template(layout, enable_int32=True): choices = [] ``` because I think it is unwanted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147798 Approved by: https://github.com/eellison	2025-02-28 19:51:55 +00:00
PyTorch MergeBot	ebc3f27bf4	Revert "[async TP] insert reshape node to handle "reshape -> scaled mm -> reshape pattern" in async TP with rowwise scales (#148001 )" This reverts commit 6e037ac41c095dfdb37fdd4b36bf8ec2ebf84bf1. Reverted https://github.com/pytorch/pytorch/pull/148001 on behalf of https://github.com/wdvr due to lint error ([comment](https://github.com/pytorch/pytorch/pull/148001#issuecomment-2691421540))	2025-02-28 19:44:54 +00:00
amdfaa	42aeb5d259	Resolve zip file permission issue when uploading artifacts on ROCm MI300 CI runners (#145504 ) E.g.: https://github.com/pytorch/pytorch/actions/runs/13500418791/job/37719437613#step:19:120 ``` Beginning upload of artifact content to blob storage Error: An error has occurred while creating the zip file for upload Error: EACCES: permission denied, open '/home/runner/_work/pytorch/pytorch/test/test-reports/backends.xeon.test_launch_1.1_22ba1133f3fcd140_.log' /home/runner/_work/_actions/actions/upload-artifact/v4/dist/upload/index.js:3459 throw new Error('An error has occurred during zip creation for the artifact'); ^ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145504 Approved by: https://github.com/jeffdaily Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>	2025-02-28 19:16:28 +00:00
Daniel Vega-Myhre	6e037ac41c	[async TP] insert reshape node to handle "reshape -> scaled mm -> reshape pattern" in async TP with rowwise scales (#148001 ) Fixes https://github.com/pytorch/torchtitan/issues/864 ## Summary While testing torchtitan with float8 training with rowwise scaling + async TP, a [bug](https://github.com/pytorch/torchtitan/issues/864) was discovered. The symptom was the scaling factor dims did not match the dims of the tensor the scales were to be applied to. My [root cause analysis](https://github.com/pytorch/torchtitan/issues/864#issuecomment-2672465060) determined the reason is that when async TP graph manipulation constructs the `fused_scaled_matmul_reduce_scatter` op, it does not yet handle the "reshape -> scaled mm -> reshape" pattern used in torchao [here](`ed361ff5c7/torchao/float8/float8_linear.py (L122-L124)`) - specifically when row-wise scales are being used. ## TL;DR of root cause - When a Float8Tensor is reshaped, the scale is reshaped along with it so the dimensions are aligned. - In the graph manipulation logic of the micropipeline TP post grad pass, the scaled_mm `A tensor` node is referencing the tensor _before_ to the reshape op, but referencing the `A_scale` node _after_ the reshape op. ## Example - Concrete example: - `A tensor` is a Float8Tensor with shape (1,8192,2048) and scale of shape (1,8192,1) when a matmul op is called in torchao [here](`8706d3f3b0/torchao/float8/float8_linear.py (L70)`). Torchao does a reshape -> scaled mm -> reshape [here](`ed361ff5c7/torchao/float8/float8_linear.py (L122)`). When a Float8Tensor is reshaped, its scale is reshaped along with it [here](`8706d3f3b0/torchao/float8/float8_ops.py (L152)`). So the first reshape makes the "A tensor" (1,8192,2048) => (8192,2048) and the scale (1,8192,1) => (8192,1). - During post grad pass in async TP: - `A_node` has shape (1,8192,2048) (tensor from before this [reshape](`ed361ff5c7/torchao/float8/float8_linear.py (L122)`)) - `A_scale` has shape (8192,1) (due to reshape op above, which caused the scale to be reshaped from (1,8192,1) => (8192,1)). ## Solution Note: the compiler inserts a `reciprocal` op after the reshape, so we can't simply use the node before the reshape as the `A_scale_node`, otherwise it will affect the numerics. - Short-term solution: if the specific pattern showne below is detected, insert a reshape node after the reciprocal, to reshape the reciprocal output back to the originals shape before the reshape. - reshape is just a view, so there should be no impact on performance ``` Before: reshape (a,bc,) to (ab,c) -> reciprocal After: reshape (a,bc,) to (ab,c) -> reciprocal -> reshape (a*b,c) to (a,b,c) ``` - Long-term solution: implement a `torch._scaled_matmul` which can support 3D+ `A tensor` ## Test plan - Added unit test which exercises this new path - Manually tested with torchtitan with float8 rowwise + async TP Pull Request resolved: https://github.com/pytorch/pytorch/pull/148001 Approved by: https://github.com/yifuwang	2025-02-28 18:51:42 +00:00
Marko Radmilac	945e359fc1	Initial implementation of host memory stats (#147660 ) This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics. This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache. As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later. Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660 Approved by: https://github.com/ngimel	2025-02-28 18:36:44 +00:00
Yidi Wu	982d7ba3ef	[while_loop][inductor] relax the constraint that all inputs must be on the same device (#148019 ) Previously, we require all inputs of while_loop to be on the same device. However, there're use cases where we want to keep some of the inputs on cpu while others on gpu e.g. an loop_idx on cpu will save the gpu to device copies. This PR relaxes the constraint and only check if carry and input at the same position have the same device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148019 Approved by: https://github.com/eellison, https://github.com/jansel	2025-02-28 18:27:03 +00:00
Yidi Wu	2d2f60bdda	[cond] support mismatched output in inductor (#147567 ) In this PR, we extract `codegen_unbacked_symbol_defs` of FallbackKernel out as a `codegen_unbacked_symbol_defs_for_outputs` method in wrapper. With it, HOPs can support the case where the subgraph returns a tensor with unbacked symints. This PR only do it for cond, we'll have follow up PRs for others (e.g. while_loop) as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147567 Approved by: https://github.com/jansel	2025-02-28 18:26:48 +00:00
henrylhtsang	d765077004	[cutlass backend] Sort the list of ops for better repro (#148047 ) Differential Revision: [D70298051](https://our.internmc.facebook.com/intern/diff/D70298051/) This only affects anything if `cutlass_max_profiling_configs` is used. I believe cutlass_max_profiling_configs is more of a testing config. Problem is when we get the configs from cutlass_library, the ops can come in different orders. Motivation is to make repro small issues easier. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148047 Approved by: https://github.com/chenyang78, https://github.com/coconutruben	2025-02-28 18:04:10 +00:00
henrylhtsang	790ec756ee	[cutlass backend] Check if len(timings) == len(choices) before skipping precompile (#148050 ) Differential Revision: [D70298908](https://our.internmc.facebook.com/intern/diff/D70298908/) Mostly from @coconutruben observation. Right now, we skip precompilation if we find some timings. That sounds like a bug. Most of the time it is fine, since we don't change the number of configs and triton compilation doesn't take too long. But it is devastating for cutlass backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148050 Approved by: https://github.com/coconutruben	2025-02-28 17:58:58 +00:00
Nikita Shulga	e5e31050d3	[MPS] Implement linear1d as shader (#148154 ) And get rid of MPS call, as for some reason implementation via MPSGraph API call is 100x+ times slower that Metal shader, at least according to the following benchmark ```python import torch import time import subprocess def benchmark(device, dtype): # Create example inputs x = torch.testing.make_tensor(3, 5, 65536, device=device, dtype=dtype) sf = .5 # Check output y = torch.nn.functional.interpolate(x, scale_factor=sf, mode="linear") z = torch.nn.functional.interpolate(x.cpu(), scale_factor=sf, mode="linear") outputs_match = torch.allclose(y.cpu(), z) if not outputs_match: atol = (y.cpu() - z).abs().max() rtol = ((y.cpu() - z)[z!=0]/z[z!=0]).abs().max() print(f"atol={atol} rtol={rtol}") # Measure time manually start_time = time.time() * 1000 for _ in range(1000): y = torch.nn.functional.interpolate(x, scale_factor=sf, mode="linear") torch.mps.synchronize end_time = time.time() * 1000 manual_delta = (end_time - start_time) average_time = f"{manual_delta:6.1f}" return "True " if outputs_match else "False", average_time outputs_match_list = [] average_time_list = [] for device in ["mps", "cpu"]: for dtype in [torch.float32, torch.float16, torch.bfloat16]: outputs_match, average_time = benchmark(device, dtype) outputs_match_list.append(str(outputs_match)) average_time_list.append(average_time) brand_string = subprocess.check_output(['sysctl', '-n', 'machdep.cpu.brand_string']).decode("utf-8").strip() print(f"\nBenchmarking Results (collected on {brand_string}):") print("-"*40) print("Device : MPS \| CPU") print("Dtype : FP32 \| FP16 \| BF16 \| FP32 \| FP16 \| BF16 ") print(f"Outputs Match : ", " \| ".join(outputs_match_list)) print(f"Average Time (us) :", " \|".join(average_time_list)) ``` Benchmark results after the change ``` Benchmarking Results (collected on Apple M2 Pro): ---------------------------------------- Device : MPS \| CPU Dtype : FP32 \| FP16 \| BF16 \| FP32 \| FP16 \| BF16 Outputs Match : True \| True \| True \| True \| True \| True Average Time (us) : 2.5 \| 2.1 \| 2.2 \| 161.4 \| 115.0 \| 161.1 ``` And before the change ``` Benchmarking Results (collected on Apple M2 Pro): ---------------------------------------- Device : MPS \| CPU Dtype : FP32 \| FP16 \| BF16 \| FP32 \| FP16 \| BF16 Outputs Match : True \| True \| True \| True \| True \| True Average Time (us) : 354.0 \| 336.0 \| 332.4 \| 145.5 \| 114.7 \| 148.3 ``` Fixes https://github.com/pytorch/pytorch/issues/144245 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148154 Approved by: https://github.com/dcci	2025-02-28 16:47:42 +00:00
Mengwei Liu	b5cd4ac950	[torchgen] Add support for schema with namespace (#148038 ) Fixes https://github.com/pytorch/executorch/issues/8711 In ExecuTorch when we try to parse the following schema: ``` aten::__lshift__.Scalar(Tensor self, Scalar other) -> Tensor ``` Repro: ```python from torchgen.model import FunctionSchema native_schema = FunctionSchema.parse("aten::__lshift__.Scalar(Tensor self, Scalar other) -> Tensor") ``` It's failing because `BaseOperatorName` categorizes it to be a inplace operator. I understand we are not supposed to pass in namespace "aten::" into `FunctionSchema.parse()` but unfortunately ExecuTorch requires this feature to work. This PR adds a new `namespace` attribute to `BaseOperatorName` and makes sure the rest of the stack works as before, if a schema without namespace is passed in Pull Request resolved: https://github.com/pytorch/pytorch/pull/148038 Approved by: https://github.com/bdhirsh	2025-02-28 16:41:50 +00:00
Eli Uriegas	e593288859	ci: Remove manylinux builds for triton, except for XPU (#148129 ) We're dropping regular old manylinux so let's drop it here too Relates to #123649 Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/148129 Approved by: https://github.com/Camyll, https://github.com/huydhn, https://github.com/malfet, https://github.com/atalman ghstack dependencies: #148126	2025-02-28 16:23:18 +00:00
bobrenjc93	4708cfdbd9	Support whitelist of dynamic sources (#147979 ) This PR introduces the ability to whitelist sources as dynamic. This is particularly useful for large models with graph breaks, as you can keep the dynamism across graph breaks since source names stay consistent. Additionally you can use this to mark ints as dynamic. NB: I intentionally didn't complicate the interface by supporting specification of per dimension dynamism. There is virtue in keeping true to the standard way of representing sources (eg. L['x']). If we find in practice that we need more more fine grained control, we can explore further affordances at that time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147979 Approved by: https://github.com/Mingming-Ding	2025-02-28 15:43:14 +00:00
Yuanhao Ji	0a948f705b	[Dynamo] Fix `AssertionError` when dynamo traces `torch.functional.xxx()` functions (#148075 ) Fixes #147840 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148075 Approved by: https://github.com/yanboliang	2025-02-28 15:09:11 +00:00
atalman	1db3c58fab	Remove manylinux 2014 artifacts (#148135 ) 1. Switch Magma build to Manylinux 2.28 base 2. Use manylinux 2.28 as default in populate_binary_env.sh 3. Remove manylinux 2014 docker builds Pull Request resolved: https://github.com/pytorch/pytorch/pull/148135 Approved by: https://github.com/malfet	2025-02-28 13:43:14 +00:00
Xuehai Pan	1cb4e2df65	[BE][PYFMT] migrate PYFMT for `torch._inductor` to `ruff format` (#144550 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144550 Approved by: https://github.com/jansel	2025-02-28 13:33:19 +00:00
William Wen	34d726011f	[dynamo] update data-dependent branching graph break messages (#147912 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147912 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #147494, #147872	2025-02-28 12:30:06 +00:00
Xilun Wu	4106aa33eb	[dtensor][fix] fix _scaled_dot_product_flash_attention sharding (#148125 ) ### Summary https://github.com/pytorch/pytorch/pull/146372/ changed the op signature of `_scaled_dot_product_flash_attention` and as a consequence DTensor needs to change its sharding defined at `40ad5e01df/torch/distributed/tensor/_ops/_matrix_ops.py (L232)` ### Test `pytest test/distributed/tensor/test_attention.py` ### Follow-up It's still unclear why the CP unit tests were not run over the original PR which is BC-breaking. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148125 Approved by: https://github.com/tianyu-l, https://github.com/fegin	2025-02-28 09:26:43 +00:00
ZhiweiYan-96	af720cd5a7	[Intel GPU] Decompule Intel GPU oneDNN from other backends (#147926 ) # Motivation Currently, Intel GPU is moving forward rapidly with the development of feature. We(Intel GPU) want an independent version control over oneDNN component so as to quickly adopt the optimization or bug fixing provided by oneDNN team. This PR does not change the behaviors of other backends like Intel CPU, ARM. They can keep using the stable version contained in `third_party/ideep`. # Detail At compilation time, we will `git clone` oneDNN via URL `https://github.com/oneapi-src/oneDNN` and checkout to the tag/commit that Intel GPU backend prefers. This feature is supported by CMake `Externalproject_add` command. Following is a build log example: ```bash [11/60] Performing download step (git clone) for 'xpu_mkldnn_proj' Cloning into 'xpu_mkldnn_proj'... HEAD is now at 5e92240360 meta: updated citation file [12/60] Performing update step for 'xpu_mkldnn_proj' -- Already at requested tag: v3.7 [13/60] No patch step for 'xpu_mkldnn_proj' ``` The log demonstates that, we explicitly download the source files and checkout to a specific tag. The source file of oneDNN is located at `build/xpu_mkldnn_proj-prefix/src/xpu_mkldnn_proj` # Runtime verification Running UT for CPU ```bash onednn_verbose,v1,info,oneDNN v3.7.0 (commit fc3f17ad469b8a6da7192ae12d32625faa509f1e) onednn_verbose,v1,info,cpu,runtime:OpenMP,nthr:24 onednn_verbose,v1,info,cpu,isa:Intel AVX-512 with Intel DL Boost onednn_verbose,v1,info,gpu,runtime:none onednn_verbose,v1,info,graph,backend,0:dnnl_backend onednn_verbose,v1,primitive,info,template:operation,engine ``` Runnint UT for Intel GPU ```bash onednn_verbose,v1,info,oneDNN v3.7.0 (commit 5e9224036021433d2577548ed0539fe9a53256bc) onednn_verbose,v1,info,cpu,runtime:threadpool,nthr:24 onednn_verbose,v1,info,cpu,isa:Intel AVX-512 with Intel DL Boost onednn_verbose,v1,info,gpu,runtime:DPC++ onednn_verbose,v1,info,gpu,engine,sycl gpu device count:2 ``` We can see that, Intel GPU would uses commit `5e922` (tag v3.7), while CPU uses `fc3f17` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147926 Approved by: https://github.com/EikanWang Co-authored-by: leizhenyuan <zhenyuan.lei@intel.com>	2025-02-28 07:42:06 +00:00
Ankita George	3a58a04898	Build a storage reader/writer to write checkpoints in HF format (#148089 ) Summary: D69984656 caused issues by adding the fsspec dependency to torch distributed when many packages internally didn't have it. In this diff I'm not adding HFStorageReader/Writer to __init__.py so that HFStorage components don't get imported internally and in turn there is no fsspec import that happens. I did the removal from __init__.py in D70286926 to fix the failing tests but the revert was done concurrently. I'll add the classes to __init__.py when I figure out a better way to get fsspec added as a dependency everywhere Test Plan: signals pass buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/distributed/checkpoint:test_hf_storage Differential Revision: D70324090 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148089 Approved by: https://github.com/saumishr	2025-02-28 07:38:10 +00:00
Xuehai Pan	995df34b19	[BE][PYFMT] migrate PYFMT for `torch.{distributed,distributions}` to `ruff format` (#144547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144547 Approved by: https://github.com/kwen2501	2025-02-28 07:35:56 +00:00
Rachel Guo	4e160d5fd9	[triton 3.3] Fix aoti cpp wrapper remaining 5 issue. (following #148051 ) (#148117 ) Summary: Fix the following 5 on a100: - test_foreach_cpp_wrapper_cuda_gpu_wrapper - test_enable_dynamic_shapes_cpp_wrapper_cuda_gpu_wrapper - test_dynamic_shapes_persistent_reduction_mixed_x_dim_cuda_gpu_wrapper - test_enable_dynamic_shapes_cpp_wrapper_cuda_dynamic_shapes_gpu_wrapper - test_dynamic_shapes_persistent_reduction_mixed_x_dim_cuda_dynamic_shapes_gpu_wrapper Test Plan: oss : ``` TORCHINDUCTOR_COMPILE_THREADS=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=TRITON TORCH_LOGS="+inductor, output_code" TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 CPLUS_INCLUDE_PATH=/usr/local/cuda-12.6/include:$CPLUS_INCLUDE_PATH python test/inductor/test_gpu_cpp_wrapper.py -k test_foreach_cpp_wrapper_cuda_gpu_wrapper ``` @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/148117 Approved by: https://github.com/davidberard98, https://github.com/chenyang78	2025-02-28 06:56:30 +00:00
Wouter Devriendt	ea12fc8a9f	Revert D70262395 (#148164 ) Summary: This reverts #147804 due to internal revert. --- This diff reverts D70262395 Reviewed By: RossMcKenzie Differential Revision: D70318024 @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/148164 Approved by: https://github.com/xmfan	2025-02-28 06:39:48 +00:00
William Wen	baba7beed2	[dynamo] add context manager debug information to graph breaks (#147872 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147872 Approved by: https://github.com/zou3519 ghstack dependencies: #147494	2025-02-28 06:23:28 +00:00
William Wen	4caeede799	[dynamo] more better error messages [3/N] (#147494 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147494 Approved by: https://github.com/jansel, https://github.com/zou3519	2025-02-28 06:23:28 +00:00
eellison	bc362cc15a	Move expanded dim require_exact_stride handling to api from sdpa lowering (#148101 ) See issue: https://github.com/pytorch/pytorch/issues/147156#issue-2852362217. Original tests from https://github.com/pytorch/pytorch/pull/146054 should cover these changes, and I tested that the perf on https://github.com/pytorch/pytorch/issues/145760 remains fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148101 Approved by: https://github.com/zou3519	2025-02-28 06:02:18 +00:00
cyy	b0dfd242fa	Remove NO_MULTIPROCESSING_SPAWN checks (#146705 ) py 3.9 has spawn. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146705 Approved by: https://github.com/colesbury	2025-02-28 05:53:19 +00:00
Aaron Gokaslan	3b4b23ab0b	[BE][Ez]: Remove extra copy in dtensor parallel loss (#148096 ) Remove an extra copy of the input to `_log_softmax` when there is a dtype and memory format change. Fuse the copies instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148096 Approved by: https://github.com/jansel, https://github.com/wconstab	2025-02-28 05:42:32 +00:00
Arthur Laureus Wigo	9b7130b8db	Clean temporary directory at exit (#147813 ) Issue: A temporary directory is created in [pytorch/torch/distributed/nn/jit/instantiator.py](https://github.com/arthurlw/pytorch/blob/clean-temp-directory-at-exit/torch/distributed/nn/jit/instantiator.py) but is never cleaned up, leading to a ResourceWarning on program exit. Solution: Registered an `atexit` handler to properly clean up the temporary directory when the program exits. Fixes #147744 Line 23 in [0a49f8f](`0a49f8fd3d`) ```python 23 atexit.register(_TEMP_DIR.cleanup) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147813 Approved by: https://github.com/H-Huang	2025-02-28 04:12:23 +00:00
Davide Italiano	760921a7d8	[MPS] Add inductor support for the `entr()` operator. (#148128 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148128 Approved by: https://github.com/jansel, https://github.com/malfet	2025-02-28 03:33:22 +00:00
Animesh Jain	eb9c127341	[dynamo][optimizers] Install ID_GUARDED tensors into the Fx graph (#147824 ) Earlier, with inline flag we were lifting id-guarded tensors to the inputs to the Fx graph. But this offers no benefit. Main idea behind lifting parameters as inputs was to reuse the compilation units across many instances of the nn-module. However, if we are guarding on the `id`, we are explicitly specializing the compiled artifact to the parameter. This PR installs the parameters back into the graph. The benefit is removal of all pre-graph bytecode to extract the id-guarded tensors from locals/globals. This increases speedup from 1.67x to 1.75x for an internal model that has large number of optimizer parameters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147824 Approved by: https://github.com/jansel Co-authored-by: Jason Ansel <jansel@meta.com>	2025-02-28 03:22:11 +00:00
PyTorch MergeBot	926b7b5027	Revert "Remove NO_MULTIPROCESSING_SPAWN checks (#146705 )" This reverts commit 40ad5e01dff05c7d64e070fb01683820e678f788. Reverted https://github.com/pytorch/pytorch/pull/146705 on behalf of https://github.com/cyyever due to Broke lint?, I guess land race with rufff update ([comment](https://github.com/pytorch/pytorch/pull/146705#issuecomment-2689603077))	2025-02-28 03:04:38 +00:00
Xuehai Pan	3ce352e389	[BE][PYFMT] migrate PYFMT for `torch._dynamo` to `ruff format` (#144549 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144549 Approved by: https://github.com/jansel	2025-02-28 03:03:53 +00:00
Ding, Yi1	edc5bf91d2	[Intel GPU] Add synchronize() in torch.utils.benchmark (#147835 ) When following https://pytorch.org/tutorials/recipes/recipes/benchmark.html on XPU, I notice that the device it is not synchronized in the benchmark. This PR tries to fix this and align the behavior with CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147835 Approved by: https://github.com/EikanWang, https://github.com/desertfire	2025-02-28 02:58:17 +00:00
Xuehai Pan	0edb2da4a4	[dynamo] add sourceless builder for `types.MethodType` (#147880 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147880 Approved by: https://github.com/jansel	2025-02-28 02:30:04 +00:00
x41lakazam	30375cb326	Fix minor typo in python_nccl (#148088 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148088 Approved by: https://github.com/Skylion007	2025-02-28 00:47:09 +00:00
eellison	481a57bc37	Support torch.compile rng selective activation checkpointing with cudagraph (#146878 ) TODO: - [x] Add handling for when forward is invoked multiple times without invoking backward, so that the fwd/backward states are out of sync - [x] Update rng state initialization to take from correct device - [x] Tests - [x] handling of retain_graph - [x] respect fallback random Fix for https://github.com/pytorch/pytorch/issues/130123. Updates the aot_eager and cudagraph compilation of `run_and_save_rng_state` to use the new mechanism added by https://github.com/pytorch/pytorch/pull/114068 for CUDAGraph safe rng states. We have a pair of rng states for the fwd and backward respectively. In both forward and backward the rng op will get run with `graphsafe_run_with_rng_state` which takes in RNG state and it hooks onto the current RNG generator before running the operator. The rng states for fwd/backward are initialized with the same value. We ensure that for any given run of the forward, the corresponding backward run will have the same rng states for the op as was observed in the forward. ``` ===== Forward graph 1 ===== /data/users/eellison/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module): def forward(self, primals_1: "f32[4, 4][4, 1]cuda:0", primals_2: "f32[4, 4][4, 1]cuda:0", fwd_rng_state_0): sin: "f32[4, 4][4, 1]cuda:0" = torch.ops.aten.sin.default(primals_1) # No stacktrace found for following nodes graphsafe_run_with_rng_state = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype = torch.float32, device = device(type='cuda', index=0), pin_memory = False, rng_state = fwd_rng_state_0); fwd_rng_state_0 = None ... ===== Backward graph 1 ===== def forward(self, primals_1: "f32[4, 4][4, 1]cuda:0", primals_2: "f32[4, 4][4, 1]cuda:0", tangents_1: "f32[4, 4][4, 1]cuda:0", bwd_rng_state_0): sin: "f32[4, 4][4, 1]cuda:0" = torch.ops.aten.sin.default(primals_1) # No stacktrace found for following nodes graphsafe_run_with_rng_state = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype = torch.float32, device = device(type='cuda', index=0), pin_memory = False, rng_state = bwd_rng_state_0); bwd_rng_state_0 = None ``` There is some extra complication when a user either calls backward with retain_graph, or calls the backward in a different order as they called the forward. If a user has state fwd_rng_state0, bwd_rng_state0 and calls: - fwd0: fwd_rng_state0 -> fwd_rng_state1 - fwd1: fwd_rng_state1 -> fwd_rng_state2 - bwd1 - bwd0 Then naively, when bwd1 is invoked the bwd rng states would not be equal to the same states that were observed in fwd1. I added handling of this in the aot runtime wrappers to detect pending backward invocations, and the current position of the bwd rng states, and to update when necesssary. Other notes: Because nodes which appear later in the forward appear earlier in the backward, we need a separate rng state for each operator. If we reused the rng across ops, the forward and backward would be run with different rng states. I.e., not applied in the same order. Questions for reviewers: This does change numerics, bc the rng of the op is now taken from the input rng state instead of whatever the rng would be midway through running the graph. Technically, we only need this for cuda graph. But, I'd prefer to not have a rng divergence just for cudagraph. I am making it respect `fallback_random`. Edit: decided to apply to non cudagraphs as well, so long as fallback_random is not set I'm initializing the rng states by cloning the current state. If you had something like 5 different rands in the model with the same shape, theyd all get the same value. This doesn't seem great. I could use some other initialization scheme like taking seed from graph position, or etc etc. Not sure. Let me know thoughts. Edit: updated to be taken from randint() Update: initializing rng states from torch.randint.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146878 Approved by: https://github.com/anijain2305, https://github.com/bdhirsh	2025-02-28 00:47:03 +00:00
Brian Hirsh	c6d1038aaa	only print GraphModule during fx.Interpreter errors if valid (#148090 ) Came up in https://www.internalfb.com/diff/D69057074?dst_version_fbid=970771615000938&transaction_fbid=1723357345264461 - we need to make sure the GraphModule is valid before calling `print_readable` on it Pull Request resolved: https://github.com/pytorch/pytorch/pull/148090 Approved by: https://github.com/jamesjwu, https://github.com/zou3519 ghstack dependencies: #147749	2025-02-28 00:44:27 +00:00
Andrey Talman	5a14ff8ace	Add cufile to list of libraries to preload (#148137 ) Fixes: https://github.com/pytorch/pytorch/issues/148120 Test with almalinux/9-base:latest : ``` >>> import torch Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib64/python3.9/site-packages/torch/__init__.py", line 401, in <module> from torch._C import * # noqa: F403 ImportError: libcufile.so.0: cannot open shared object file: No such file or directory >>> exit() [root@18b37257e416 /]# vi /usr/local/lib64/python3.9/site-packages/torch/__init__.py [root@18b37257e416 /]# python3 Python 3.9.19 (main, Sep 11 2024, 00:00:00) [GCC 11.5.0 20240719 (Red Hat 11.5.0-2)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch /usr/local/lib64/python3.9/site-packages/torch/_subclasses/functional_tensor.py:276: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.) cpu = _conversion_method_template(device=torch.device("cpu")) >>> torch.__version__ '2.7.0.dev20250227+cu126' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148137 Approved by: https://github.com/malfet	2025-02-28 00:35:47 +00:00
cyyever	40ad5e01df	Remove NO_MULTIPROCESSING_SPAWN checks (#146705 ) py 3.9 has spawn. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146705 Approved by: https://github.com/colesbury	2025-02-28 00:15:32 +00:00
Catherine Lee	2978771c9d	[CI] test upload: better check for if job is rerun disabled tests (#148027 ) Some disabled test runs weren't being uploaded as disabled tests because some dynamo tests are set to mark themselves as skipped if they are failing. This makes the script think that there are fewer retries than there are actually are and that the job is not a rerun disabled tests job. Instead, query for the job name to see if it contains rerun disabled tests and fall back to counting the number of retries if querying fails Alternate options: relax the check for the number of tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/148027 Approved by: https://github.com/huydhn	2025-02-28 00:04:33 +00:00
Eli Uriegas	fc78192b1d	ci: Only run CI specific things when in CI (#148126 ) This was blocking me from running this locally so don't run it like this Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/148126 Approved by: https://github.com/Camyll, https://github.com/malfet, https://github.com/atalman	2025-02-27 23:27:57 +00:00
Aaron Gokaslan	f4235310e8	[BE][Ez]: Remove redundant empty tensor copies in meta-reg (#147978 ) Empty_likes includes a memory_format arg. Let's use it to avoid unnecessary copy operations. Noticed while reviewing: https://github.com/pytorch/pytorch/pull/147862 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147978 Approved by: https://github.com/jansel	2025-02-27 23:16:44 +00:00
Zhengxu Chen	915b9c80ab	[export] Sync aoti schema to schema.py (#148017 ) Summary: Synchronizing internal AOTI schema to OSS schema.py Test Plan: CI Differential Revision: D70271151 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148017 Approved by: https://github.com/yiming0416	2025-02-27 21:46:11 +00:00
Eli Uriegas	871b3909fc	ci: Remove manylinux 2014 remnants (#148028 ) These are the only remaining references I could find to manylinux2014, we should probably look to remove these a bit quicker since it made it difficult to know which Dockerfiles were important in .ci/docker/manywheel/ > [!TIP] > I checked if we were using these by running > `rg 2014 .github/` Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/148028 Approved by: https://github.com/wdvr, https://github.com/malfet, https://github.com/atalman	2025-02-27 21:37:00 +00:00
Zain Rizvi	10ffd94216	Reference the commit explicitly (#148026 ) Reference the commit tested by CI explicitly, and fail the merge if the PR was updated. Tested locally Pull Request resolved: https://github.com/pytorch/pytorch/pull/148026 Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/atalman	2025-02-27 21:06:34 +00:00
Xintong Hu	783d83c5d8	[PT2] Port fuse_split_getitem_squeeze to PT2 pre_grad passes (#148059 ) Summary: put it as an add_pass option Reviewed By: frank-wei Differential Revision: D68909559 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148059 Approved by: https://github.com/frank-wei	2025-02-27 21:03:51 +00:00
Xuehai Pan	d48eb58d1d	[BE][CI] bump ruff to 0.9.8 (#145606 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145606 Approved by: https://github.com/malfet ghstack dependencies: #144546	2025-02-27 21:01:10 +00:00
PyTorch MergeBot	644d84d594	Revert "optimize the decomposition of aten.native_group_norm (#144733 )" This reverts commit b533bb4b133c36767270bd8a24f11d5c37f8dd5c. Reverted https://github.com/pytorch/pytorch/pull/144733 on behalf of https://github.com/desertfire due to Cause TIMM pass rate regression on H100, see https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2020%20Feb%202025%2020%3A53%3A55%20GMT&stopTime=Thu%2C%2027%20Feb%202025%2020%3A53%3A55%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=main&lCommit=4216478250e08e950fdd090fc23a1b270c520cc4&rBranch=main&rCommit=4986f0f52eb871cdb91b8124ee162cfe622b8688 ([comment](https://github.com/pytorch/pytorch/pull/144733#issuecomment-2689092714))	2025-02-27 20:57:25 +00:00
atalman	1845e7d1f5	Use nightly-wheel-upload env for triton wheel publishing (#148108 ) Required for publishing triton builds Pull Request resolved: https://github.com/pytorch/pytorch/pull/148108 Approved by: https://github.com/malfet	2025-02-27 20:47:40 +00:00
Xuehai Pan	c73a92fbf5	[BE][CI] bump `ruff` to 0.9.2: multiline `assert` statements (#144546 ) Reference: https://docs.astral.sh/ruff/formatter/black/#assert-statements > Unlike Black, Ruff prefers breaking the message over breaking the assertion, similar to how both Ruff and Black prefer breaking the assignment value over breaking the assignment target: > > ```python > # Input > assert ( > len(policy_types) >= priority + num_duplicates > ), f"This tests needs at least {priority+num_duplicates} many types." > > > # Black > assert ( > len(policy_types) >= priority + num_duplicates > ), f"This tests needs at least {priority+num_duplicates} many types." > > # Ruff > assert len(policy_types) >= priority + num_duplicates, ( > f"This tests needs at least {priority + num_duplicates} many types." > ) > ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144546 Approved by: https://github.com/malfet	2025-02-27 20:46:16 +00:00
Ruben Rodriguez Buchillon	f0d00421cf	[inductor][ck] kBatch filtering with gen_ops (#148004 ) Summary: # Why not all choices of kBatch are valid and will lead to a runtime error (when CK checks the validity of the args) `c9bcfd755e/include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3_multi_d.hpp (L1020)` # What - move kBatch inside the gen_ops to have more control over it, and be able to filter it - expand filtering based on the cpp logic - refactor the padding checks to be more readable Test Plan: ``` buck2 run -c fbcode.re_gpu_tests=False mode/opt-amd-gpu fbcode//deeplearning/aot_inductor/benchmark/sampling:test_gemm_autotune_benchmark_AMD_block_0 ``` with kBatch = 128: some filering kBatch = 1: no filering kBatch = 1738: all options filtered out Reviewed By: henrylhtsang Differential Revision: D70211442 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148004 Approved by: https://github.com/ColinPeppler, https://github.com/tenpercent	2025-02-27 20:13:58 +00:00
Davide Italiano	ce805a5ba5	[BE/metal] Rename REGISTER_I0_I1 to REGISTER_SPECIAL. (#148036 ) Now that it's used for other ops as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148036 Approved by: https://github.com/malfet, https://github.com/jansel	2025-02-27 17:56:26 +00:00
Mikayla Gawarecki	9a1f720a72	Validate inputs to _nested_view_from_buffer to prevent overflows (#147356 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147356 Approved by: https://github.com/albanD, https://github.com/jbschlosser ghstack dependencies: #147352, #147354	2025-02-27 15:48:58 +00:00
Mikayla Gawarecki	536bce5a04	Make Tensor.set_ validate storage_offset when sizes/strides are unchanged (#147354 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147354 Approved by: https://github.com/albanD ghstack dependencies: #147352	2025-02-27 15:48:58 +00:00
Mikayla Gawarecki	e64441915f	Fix overflow in checkInBoundsForStorage (#147352 ) Use `computeStorageNbytes` (which checks for overflows) to include the computation re the storage_offset Pull Request resolved: https://github.com/pytorch/pytorch/pull/147352 Approved by: https://github.com/albanD	2025-02-27 15:48:50 +00:00
Anatoly Myachev	6ccbff1450	[Inductor] Fix `inductor/test_kernel_benchmark.py` for new Triton; do not duplicate parameters in `_dump_launch_params` (#147746 ) The problem is that the new Triton uses the following code branch, which does not filter the call parameters, which may already be in the launcher's cfg.kwargs. This is generally expected behavior, so I just stopped adding arguments from `launcher.config.kwargs`: `cde12207a0/torch/_inductor/runtime/triton_heuristics.py (L1099)` Issue example (from https://github.com/intel/intel-xpu-backend-for-triton/issues/3499): ```bash Failed when when running cleaned triton Command '['/home/xinanlin/xinanlin/miniforge3/bin/python', '/tmp/torchinductor_xinanlin/4g/c4gp5j3t44nmaxvl7ndgcptyur6sij4k3b dmtky5n4j4jrd5k5pu.py.cleaned']' returned non-zero exit status 1. Traceback (most recent call last): File "/tmp/torchinductor_xinanlin/4g/c4gp5j3t44nmaxvl7ndgcptyur6sij4k3bdmtky5n4j4jrd5k5pu.py.cleaned", line 103, in <module> compiled_module_main('None', benchmark_compiled_module) File "/home/xinanlin/xinanlin/pytorch/torch/_inductor/wrapper_benchmark.py", line 435, in compiled_module_main wall_time_ms = benchmark_compiled_module_fn(times=times, repeat=repeat) * 1000 File "/tmp/torchinductor_xinanlin/4g/c4gp5j3t44nmaxvl7ndgcptyur6sij4k3bdmtky5n4j4jrd5k5pu.py.cleaned", line 98, in benchmark_compiled_module return print_performance(fn, times=times, repeat=repeat) File "/home/xinanlin/xinanlin/pytorch/torch/_inductor/utils.py", line 451, in print_performance [timed(model, example_inputs, times, device) for _ in range(repeat)] File "/home/xinanlin/xinanlin/pytorch/torch/_inductor/utils.py", line 451, in <listcomp> [timed(model, example_inputs, times, device) for _ in range(repeat)] File "/home/xinanlin/xinanlin/pytorch/torch/_inductor/utils.py", line 434, in timed result = model(example_inputs) File "/tmp/torchinductor_xinanlin/4g/c4gp5j3t44nmaxvl7ndgcptyur6sij4k3bdmtky5n4j4jrd5k5pu.py.cleaned", line 97, in <lambda> fn = lambda: call([arg0_1, arg1_1]) File "/tmp/torchinductor_xinanlin/4g/c4gp5j3t44nmaxvl7ndgcptyur6sij4k3bdmtky5n4j4jrd5k5pu.py.cleaned", line 86, in call triton_poi_fused_add_0[grid(1)](arg0_1, arg1_1, buf0, 1, 1, XBLOCK=1, num_warps=1, num_stages=1) File "/home/xinanlin/xinanlin/miniforge3/lib/python3.10/site-packages/triton/runtime/jit.py", line 336, in <lambda> return lambda args, *kwargs: self.run(grid=grid, warmup=False, args, *kwargs) File "/home/xinanlin/xinanlin/miniforge3/lib/python3.10/site-packages/triton/runtime/jit.py", line 531, in run bound_args, specialization, options = binder(args, **kwargs) TypeError: dynamic_func() got multiple values for argument 'XBLOCK' ``` Reroduce: `python test/inductor/test_kernel_benchmark.py -k test_remove_inductor_deps` Triton: `c4a79a1960` Pytorch: bea72180ed75f522ce4fe5e723bc2112e0874732 @davidberard98 @etaf please take a look Pull Request resolved: https://github.com/pytorch/pytorch/pull/147746 Approved by: https://github.com/jansel	2025-02-27 14:40:22 +00:00
Wang, Eikan	2c35af4def	[Intel GPU] Avoid including CPU oneDNN header files for Intel GPU (#147969 ) XPU builds oneDNN in another folder. The XPU oneDNN head files are in the XPU-specific folder - `${__XPU_MKLDNN_BUILD_DIR}`. `f522d899fb/cmake/Modules/FindMKLDNN.cmake (L73)` So, `${PROJECT_SOURCE_DIR}/third_party/ideep/mkl-dnn/include` is useless for XPU. `XPU_MKLDNN_INCLUDE` is good enough. Meanwhile, it may mess up the included files if the version of XPU oneDNN differs from other backends. * __->__ #147969 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147969 Approved by: https://github.com/ZhiweiYan-96, https://github.com/liangan1, https://github.com/atalman	2025-02-27 14:22:17 +00:00
atalman	71ee17baa1	Smoke Test skip cuda.gds on windows (#148060 ) Follow up after : https://github.com/pytorch/pytorch/pull/147120 Cufile was enabled only on Linux: https://pypi.org/project/nvidia-cufile-cu12/#files Fixes validation workflow failues: https://github.com/pytorch/test-infra/actions/runs/13558218752/job/37896578837 ``` File "C:\Jenkins\Miniconda3\envs\conda-env-13558218752\lib\site-packages\torch\cuda\gds.py", line 105, in __init__ raise RuntimeError("GdsFile is not supported on this platform.") RuntimeError: GdsFile is not supported on this platform. Exception ignored in: <function GdsFile.__del__ at 0x000001772B5003A0> Traceback (most recent call last): File "C:\Jenkins\Miniconda3\envs\conda-env-13558218752\lib\site-packages\torch\cuda\gds.py", line 113, in __del__ if self.handle is not None: AttributeError: 'GdsFile' object has no attribute 'handle' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148060 Approved by: https://github.com/mikaylagawarecki	2025-02-27 14:00:49 +00:00
IvanKobzarev	7ae0e0b2ea	[aotd] Log torch._functorch.config in tlparse (#147883 ) Adding torch._functorch.config to tlparse for better debugability. E.g. https://github.com/pytorch/pytorch/pull/147638 happened only with `torch._functorch.config.view_replay_for_aliased_outputs=False` which is True by defautl Pull Request resolved: https://github.com/pytorch/pytorch/pull/147883 Approved by: https://github.com/bdhirsh, https://github.com/jamesjwu	2025-02-27 11:22:45 +00:00
Raymond Li	c5bf9aaf1c	Log graph breaks (#146537 ) Graph breaks currently aren't logged to dynamo_compile and pt2_compile_events. We want to log them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146537 Approved by: https://github.com/c00w	2025-02-27 11:06:33 +00:00
Lu Fang	0489a349e7	Skip the logging if the pass cannot be pickled (#148053 ) Summary: Skip the logging for vllm at this moment, we can add some pickle logic later. The log is only for debugging purpose. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148053 Approved by: https://github.com/chenyang78	2025-02-27 10:54:34 +00:00
David Berard	26f19539ad	[triton 3.3] cpp_wrapper: add a global_scratch arg (#148051 ) Following triton # 4916, the generated cubin expects a global_scratch argument to support on-device TMA. We believe this is the source of many of the "invalid argument" failures on AOTI/cpp_wrapper tests. AFAIK, we don't use on-device TMA in Inductor as of now, so it should be safe to use a nullptr for the scratch space. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148051 Approved by: https://github.com/YUNQIUGUO	2025-02-27 10:13:57 +00:00
Zhang, Jianyi	91e7c7945c	[Intel GPU] Avoid unnecessary copy when the dst of Matmul is non-contiguous (#144759 ) We should not always call contiguous on the dst of matmul. We have already removed copy of matmul input in https://github.com/pytorch/pytorch/pull/143784 I also fixed an accuracy issue by using onednn sum post op instead of binary add in the case of inplace to avoid UT failure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144759 Approved by: https://github.com/EikanWang	2025-02-27 08:04:34 +00:00
Ti-Tai Wang	8ee84aa703	[ONNX] Fix missed None type support in dyamic shapes string cases (#148025 ) In `_any_str_or_dim_in_dynamic_shapes`, we strictly guard the `dynamic_shapes` to make sure the flattened shapes are valid. But the code missed to consider None could be in the shapes. NOTE: Found in benchmarking with Olive. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148025 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2025-02-27 07:57:47 +00:00
Simon Fan	fd43c36aa9	[ca] side-effect free initial trace: RAII PyCompilerInterface (#147891 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147891 Approved by: https://github.com/jansel ghstack dependencies: #147242, #147796, #147804	2025-02-27 07:17:30 +00:00
Mu-Chu Lee	9017becf1d	Add unique kernel name support for user defined triton kernel (#147587 ) Summary: Add unique_user_kernel_names which mimics what unique_kernel_names do, but for user defined Triton kernels. This does rewrite the copied kernel src, and modifies non-Inductor generated code, so we split it out from unique_kernel_names, where we have more control over all namings and generations. Test Plan: Only used for debug purpose Differential Revision: D69966608 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147587 Approved by: https://github.com/desertfire	2025-02-27 06:00:50 +00:00
PyTorch MergeBot	c622796cde	Revert "Build a storage reader/writer to write checkpoints in HF format (#147622 )" This reverts commit 6a658d983e84f7bcb8e67328b00661ec49db78c5. Reverted https://github.com/pytorch/pytorch/pull/147622 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147622#issuecomment-2686932514))	2025-02-27 05:14:28 +00:00
Yutao Xu	21bd5fe203	Update torch-xpu-ops commit pin (#147968 ) Update the torch-xpu-ops commit to [86aaaf8a9dd6932c088b7afcac0c0856b23d341a](`86aaaf8a9d`), includes: - Bugfix (PT2E/BatchNorm) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147968 Approved by: https://github.com/Skylion007	2025-02-27 05:01:12 +00:00
Boyuan Feng	b6fe28ff02	[Inductor] Graph Partition (#147038 ) This PR implements inductor graph partition. Previously, 1 dynamo graph is mapped to 1 inductor graph, and further mapped to 1 call function. In this PR, we allow 1 dynamo graph mapped to multiple inductor graphs and multiple `graph_partition` functions in the generated code. This allows applying different further optimizations to different `graph_partition`. Design Doc: [link](https://docs.google.com/document/d/1qPgOfy25l7SIYnrQrvU-TO1mdHMslCwv_SLmeXID6tM/edit?usp=sharing) Example: [Generated code before and after this diff](https://www.internalfb.com/intern/diffing/?paste_number=1737334601) In the follow-up PR, we will extend the work to cudagraph, which allows applying cudagraph to parts of the generated code (#125864). Pull Request resolved: https://github.com/pytorch/pytorch/pull/147038 Approved by: https://github.com/eellison	2025-02-27 04:50:43 +00:00
Ankita George	e0b93082f1	Remove HuggingFace reader and writer from __init__.py (#148030 ) Summary: This is causing a HFStorageReader/Writer to be imported which imports fsspec but dependencies don't have fsspec, which is causing failing builds Differential Revision: D70286926 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148030 Approved by: https://github.com/hl475	2025-02-27 04:50:14 +00:00
Mwiza Kunda	8cb8722979	[inductor][triton] Ignore block ptr advances for removed buffers (#147193 ) block ptr advancements should also be deferrered conditional on the associated buffer not being removed. For example, if `FusedSchedulerNode(op0-op1)` has a store in `SchedulerNode` `op0` that is read in `op1`, the store and associated block ptr that would be created for `op0` in isolation is no longer needed. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/147193 Approved by: https://github.com/jansel Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-02-27 03:37:33 +00:00
PyTorch MergeBot	17358ce778	Revert "Support torch.compile rng selective activation checkpointing with cudagraph (#146878 )" This reverts commit ad0c879e2203145f6d56df0b95af36822220ab8f. Reverted https://github.com/pytorch/pytorch/pull/146878 on behalf of https://github.com/wdvr due to lint failure ([comment](https://github.com/pytorch/pytorch/pull/146878#issuecomment-2686767956))	2025-02-27 03:36:16 +00:00
Alex Baden	9d3636283b	[Inductor] Use generic GPU device in test_preserves_strides (#148006 ) #147861 added a new test tagged for the generic GPU but uses the cuda GPU type for creating the tensors. Update the GPU type to also be generic. This passes with my local Intel Triton install, presumably it will work for the current pin. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148006 Approved by: https://github.com/eellison, https://github.com/etaf	2025-02-27 02:52:51 +00:00
drisspg	07b7b3ed4e	torch._scaled_mm with MXFP8 (#147548 ) # summary Add blockwise MXFP8 support to `torch._scaled_mm` on CUDA capability 10.0 and higher devices. If the scales for A and B are of dtype `torch.float8_e8m0fnu`, we dispatch to the blockwise kernel from cuBLAS. This is a skeleton PR where we test basic functionality (numerics of various simple matrices, as well as one end to end quantization + gemm). - Scales are flipped based on transpose_result - Handles boundary conditions Note that MXFP4 is not added in this PR - we can tackle that in a future PR. This PR was created by taking https://github.com/pytorch/pytorch/pull/145562, switching e8m0 to in-core dtype, removing fp4 for now, and adding test cases. # test plan ``` pytest test/test_matmul_cuda.py -k blockwise_mxfp8 -s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147548 Approved by: https://github.com/drisspg Co-authored-by: drisspg <drisspguessous@gmail.com>	2025-02-27 02:44:39 +00:00
henrylhtsang	84c89a4527	[cutlass backend] cache_clear algorithm select cache on fresh inductor cache (#147590 ) Differential Revision: [D69959917](https://our.internmc.facebook.com/intern/diff/D69959917/) AlgorithmSelectorCache is a cache. The expectation is that when we force disable cache + clear inductor caches, it would be clear. However that is not the case. The reason why this is a problem can be seen by following this repro: What we will see is ``` SingleProcess AUTOTUNE benchmarking takes 6.2202 seconds and 46.0568 seconds precompiling for 36 choices SingleProcess AUTOTUNE benchmarking takes 492.3141 seconds and 0.0010 seconds precompiling for 36 choices ``` The root cause is, while precompiling is skipped, due to it being cache, autotuning isn't skipped since we force disable it. repro: ``` import logging import os os.environ["TORCH_LOGS"] = "+output_code,+benchmarking,+inductor" import torch import torch._inductor.config from torch._inductor.utils import clear_inductor_caches torch._inductor.config.max_autotune = True torch._inductor.config.force_disable_caches = True torch._inductor.config.autotune_num_choices_displayed = None torch._inductor.config.max_autotune_gemm_backends = "CUTLASS" torch._inductor.config.autotune_fallback_to_aten = False torch._inductor.config.cuda.cutlass_instantiation_level = "0001" def main(): M, N, K = 2048, 2048, 2048 dtype = torch.bfloat16 A = torch.randn(M, K, device="cuda", dtype=dtype) B = torch.randn(K, N, device="cuda", dtype=dtype) for _ in range(2): torch._dynamo.reset() clear_inductor_caches() compiled_model = torch.compile(torch.mm, fullgraph=True) _ = compiled_model(A, B) print("done") if __name__ == "__main__": main() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147590 Approved by: https://github.com/eellison, https://github.com/chenyang78	2025-02-27 02:30:49 +00:00
Mengwei Liu	97ebccaa91	Add _fft_r2c as core ATen (#147998 ) As titled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147998 Approved by: https://github.com/tugsbayasgalan	2025-02-27 02:29:59 +00:00
eellison	ad0c879e22	Support torch.compile rng selective activation checkpointing with cudagraph (#146878 ) TODO: - [x] Add handling for when forward is invoked multiple times without invoking backward, so that the fwd/backward states are out of sync - [x] Update rng state initialization to take from correct device - [x] Tests - [x] handling of retain_graph - [x] respect fallback random Fix for https://github.com/pytorch/pytorch/issues/130123. Updates the aot_eager and cudagraph compilation of `run_and_save_rng_state` to use the new mechanism added by https://github.com/pytorch/pytorch/pull/114068 for CUDAGraph safe rng states. We have a pair of rng states for the fwd and backward respectively. In both forward and backward the rng op will get run with `graphsafe_run_with_rng_state` which takes in RNG state and it hooks onto the current RNG generator before running the operator. The rng states for fwd/backward are initialized with the same value. We ensure that for any given run of the forward, the corresponding backward run will have the same rng states for the op as was observed in the forward. ``` ===== Forward graph 1 ===== /data/users/eellison/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module): def forward(self, primals_1: "f32[4, 4][4, 1]cuda:0", primals_2: "f32[4, 4][4, 1]cuda:0", fwd_rng_state_0): sin: "f32[4, 4][4, 1]cuda:0" = torch.ops.aten.sin.default(primals_1) # No stacktrace found for following nodes graphsafe_run_with_rng_state = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype = torch.float32, device = device(type='cuda', index=0), pin_memory = False, rng_state = fwd_rng_state_0); fwd_rng_state_0 = None ... ===== Backward graph 1 ===== def forward(self, primals_1: "f32[4, 4][4, 1]cuda:0", primals_2: "f32[4, 4][4, 1]cuda:0", tangents_1: "f32[4, 4][4, 1]cuda:0", bwd_rng_state_0): sin: "f32[4, 4][4, 1]cuda:0" = torch.ops.aten.sin.default(primals_1) # No stacktrace found for following nodes graphsafe_run_with_rng_state = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype = torch.float32, device = device(type='cuda', index=0), pin_memory = False, rng_state = bwd_rng_state_0); bwd_rng_state_0 = None ``` There is some extra complication when a user either calls backward with retain_graph, or calls the backward in a different order as they called the forward. If a user has state fwd_rng_state0, bwd_rng_state0 and calls: - fwd0: fwd_rng_state0 -> fwd_rng_state1 - fwd1: fwd_rng_state1 -> fwd_rng_state2 - bwd1 - bwd0 Then naively, when bwd1 is invoked the bwd rng states would not be equal to the same states that were observed in fwd1. I added handling of this in the aot runtime wrappers to detect pending backward invocations, and the current position of the bwd rng states, and to update when necesssary. Other notes: Because nodes which appear later in the forward appear earlier in the backward, we need a separate rng state for each operator. If we reused the rng across ops, the forward and backward would be run with different rng states. I.e., not applied in the same order. Questions for reviewers: This does change numerics, bc the rng of the op is now taken from the input rng state instead of whatever the rng would be midway through running the graph. Technically, we only need this for cuda graph. But, I'd prefer to not have a rng divergence just for cudagraph. I am making it respect `fallback_random`. Edit: decided to apply to non cudagraphs as well, so long as fallback_random is not set I'm initializing the rng states by cloning the current state. If you had something like 5 different rands in the model with the same shape, theyd all get the same value. This doesn't seem great. I could use some other initialization scheme like taking seed from graph position, or etc etc. Not sure. Let me know thoughts. Edit: updated to be taken from randint() Update: initializing rng states from torch.randint.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146878 Approved by: https://github.com/anijain2305, https://github.com/bdhirsh	2025-02-27 02:08:29 +00:00
Andrey Talman	784902983e	Remove +PTX from cuda 12.6 builds (#148000 ) Similar to: https://github.com/pytorch/pytorch/pull/141142 Ahead of the release 2.7 I see following validation failure: https://github.com/pytorch/test-infra/actions/runs/13552433445/job/37879041739?pr=6339 ``` RuntimeError: Binary size of torch-2.7.0.dev20250226+cu126-cp310-cp310-manylinux_2_28_x86_64.whl 1076.45 MB exceeds the threshold 750 MB ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148000 Approved by: https://github.com/clee2000, https://github.com/ngimel, https://github.com/tinglvv	2025-02-27 02:02:11 +00:00
ZhaoqiongZ	20ce67cd06	Udpate hw requirement for FP64 on "Getting Started on Intel GPU" (#147802 ) Fixes #147731 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147802 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-02-27 01:54:19 +00:00
cyy	9ca871f32b	Remove binaries/benchmark_args.h (#147920 ) It's not used in OSS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147920 Approved by: https://github.com/Skylion007	2025-02-27 01:16:28 +00:00
Zaili Wang	ea5d40db73	Address source code building command for Intel GPU support (#143476 ) As the title Pull Request resolved: https://github.com/pytorch/pytorch/pull/143476 Approved by: https://github.com/EikanWang, https://github.com/malfet Co-authored-by: Xu Han <xu.han@outlook.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-02-27 01:07:40 +00:00
Bin Bao	f104ef1248	[AOTI][refactor] Consolidate CppBuilder.build and CppBuilder.build_fbcode (#147975 ) Summary: Let CppBuilder handle all the cpp build logic Differential Revision: D70141808 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147975 Approved by: https://github.com/angelayi, https://github.com/yushangdi	2025-02-27 00:35:12 +00:00
Benjamin Glass	f98cd84b04	cpp_wrapper: use largeTensorTest for test memory checks (#146991 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146991 Approved by: https://github.com/desertfire	2025-02-27 00:30:21 +00:00
Benjamin Glass	723f3a9eab	torch.utils._content_store: fix error in hash_storage on XPU (#147785 ) See https://github.com/pytorch/pytorch/actions/runs/13508573465/job/37745227468 for an example error. This is triggering after the merge of #147541, which enabled Dynamo compilation on XPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147785 Approved by: https://github.com/jansel	2025-02-26 23:57:59 +00:00
PyTorch MergeBot	915eb012e1	Revert "[dynamo] add sourceless builder for `types.MethodType` (#147880 )" This reverts commit 08f4c1a2332921e57c782c80a66b2adc9cdc0575. Reverted https://github.com/pytorch/pytorch/pull/147880 on behalf of https://github.com/wdvr due to failing trunk tests ([comment](https://github.com/pytorch/pytorch/pull/147880#issuecomment-2686436432))	2025-02-26 23:29:58 +00:00
Nichols A. Romero	84e60eece8	[ROCm] [TunableOp] Unit tests for scaled GEMM and GEMM with bias (#147890 ) Two more unit tests for TunableOp: - Scaled GEMM - GEMM with bias Pull Request resolved: https://github.com/pytorch/pytorch/pull/147890 Approved by: https://github.com/jeffdaily	2025-02-26 22:41:24 +00:00
Nichols A. Romero	b13ad1a193	[ROCm][TunableOp] Remove extra transpose characters in hipBLASLt signature. (#147900 ) Cleanup the TunableOp hipBLASLt signature of extra transpose characters. Test manually and no new regressions found. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147900 Approved by: https://github.com/jeffdaily	2025-02-26 22:28:00 +00:00
PyTorch MergeBot	7e7d05bf85	Revert "[do not merge yet] update grammar (#147996 )" This reverts commit 6e129a697f86425d0682ed30ffc9b3f8abe00e9e. Reverted https://github.com/pytorch/pytorch/pull/147996 on behalf of https://github.com/seemethere due to Need to revert ([comment](https://github.com/pytorch/pytorch/pull/147996#issuecomment-2686291282))	2025-02-26 22:01:12 +00:00
sokkaofthewatertribe	6e129a697f	[do not merge yet] update grammar (#147996 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147996 Approved by: https://github.com/seemethere	2025-02-26 21:52:58 +00:00
PyTorch MergeBot	dc7556f1bd	Revert "[do not merge yet] update grammar (#147996 )" This reverts commit a1ee2c3a08c3bf3d83c4e9f352ea179c107edb13. Reverted https://github.com/pytorch/pytorch/pull/147996 on behalf of https://github.com/seemethere due to Need to revert ([comment](https://github.com/pytorch/pytorch/pull/147996#issuecomment-2686266052))	2025-02-26 21:43:06 +00:00
sokkaofthewatertribe	a1ee2c3a08	[do not merge yet] update grammar (#147996 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147996 Approved by: https://github.com/seemethere	2025-02-26 21:39:08 +00:00
henrylhtsang	201666d77d	[cutlass backend] turn autotuning logs off by default + rename log to autotuning log (#147922 ) things we did: * turn off autotuning logs by default * rename autotuning logs from log to autotuning_log, so people are aware that it is a special artifact log. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147922 Approved by: https://github.com/eellison	2025-02-26 21:02:04 +00:00
Xiao Wang	976ff5cf01	Add cmake hints to USE_SYSTEM_NVTX for nvtx3 include dir (#147418 ) per title sometimes, it's hard for cmake to find NVTX3 without the cuda include path hint Pull Request resolved: https://github.com/pytorch/pytorch/pull/147418 Approved by: https://github.com/nWEIdia, https://github.com/malfet	2025-02-26 20:52:28 +00:00
Ankita George	6a658d983e	Build a storage reader/writer to write checkpoints in HF format (#147622 ) Title - we want to write checkpoints in HF format with DCP, this diff allows this for the non-distributed use case. Copy of [D68444967](https://www.internalfb.com/diff/D68444967) (https://github.com/pytorch/pytorch/pull/146352). That diff got reverted because of lint errors. The lint error was due to having imports of uninstalled libraries. This was on purpose because we don't want to install safetensors and huggingface, this new diff explicitly ignores this lint so that we don't have the error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147622 Approved by: https://github.com/saumishr	2025-02-26 20:47:54 +00:00
Thomas Bohnstingl	7c71ab1d40	[scan] User-facing reverse flag handling (#147886 ) This PR removes the reverse flag from the backend implementation and resolves it via `torch.flip` in the frontend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147886 Approved by: https://github.com/ydwu4	2025-02-26 20:04:57 +00:00
Davide Italiano	683e083e8d	[MPS] Add support for `entr()` in eager. (#147948 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147948 Approved by: https://github.com/malfet	2025-02-26 19:55:02 +00:00
Ryan Guo	eb08ada5d3	[dynamo] Support reads to global/captured tensors in `nonstrict_trace`-ed function (#147572 ) As title. Without this patch we get the following error: Tweaking the `allow_non_fake_inputs` flag on tensor mode doesn't quite work for AOTAutograd, which also needs to fake-tensor-propagate the `nonstrict_trace`-ed function, but that's _after_ Dynamo has handled the `nonstrict_trace` processing and put the `flat_apply(...)` node into the graph. So we can't easily to temporarily enable the `allow_non_fake_inputs` flag on current fake mode, when AOTAutograd processes a `flat_apply` node from Dynamo's `nonstrict_trace` handling. And after discussing with zou3519, I decided to add a global `FakeTensorTLS` that contains a `allow_non_fake_inputs_override` flag, and patch the `nonstrict_trace`-ed function to temporarily tweak this flag during its execution. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147572 Approved by: https://github.com/zou3519 ghstack dependencies: #146714, #146367, #146950, #147571	2025-02-26 19:47:39 +00:00
Ryan Guo	73e963459e	[dynamo] Support `nonstrict_trace` on class method (#147571 ) As title, also see 1. new test `test_nonstrict_trace_on_method` for example. 2. newly added comments for why we need special treatment on methods. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147571 Approved by: https://github.com/zou3519 ghstack dependencies: #146714, #146367, #146950	2025-02-26 19:47:39 +00:00
Ryan Guo	7e0ef2c844	[dynamo] Use the new `get_unique_name_wrt` helper when applicable (#146950 ) This patch removes some duplicated name generation logic in Dynamo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146950 Approved by: https://github.com/zou3519 ghstack dependencies: #146714, #146367	2025-02-26 19:47:39 +00:00
Ryan Guo	f46f0e465c	[dynamo] Initial support for `nonstrict_trace` (#146367 ) ## Context > Note: `mark_traceable` got renamed to `nonstrict_trace` after > offline discussion. The reasons are (1) it aligns with `torch.export`'s > `nonstrict` notion, and (2) it's more definitive in behavior suggestion. 1. [Overall Design](https://docs.google.com/document/d/1O-dR2ZQaJQVt_v67AVcDCw2yJLtqgkZFwoXK0buEWRg/edit?tab=t.0) 2. [Dynamo graph representation with `torch._higher_order_ops.flat_apply`](https://docs.google.com/document/d/1YHl5nPTJvYeCPE5TO9uA18DPWNgUYGE4gCn6bFvXcBM/edit?tab=t.0#heading=h.xtw3hhbro4gn) ## Summary This patch adds a `torch._dynamo.nonstrict_trace` decorator, which currently is an enhanced version of `torch._dynamo.allow_in_graph` (see docstring for their differences). Specifically, this patch focuses on the UI and functionality prototyping/plumbing. The main enhancement is supporting more input types, and the implementation challenge lies in reconstructing the input objects from Dynamo `VariableTracker` (while accounting for buffered side-effects and guards). This patch takes a middle-ground (simple implementation with a bit of user labor), by 1. asking the user to provide pytree registration for non-proxy-able input types, 2. letting Dynamo trace through `pytree_flatten` (which accounts for buffered side-effects and guards automatically), 3. and passing in the TreeSpec as a graph attribute constant into `torch._higher_order_ops.flat_apply` (which unflattens the inputs and invokes the underlying function). ## Next Steps In subsequent patches, we will try to support the following: - annotating on class method - reads to global tensors - inputs that contains `pytree.register_constant`-ed instances. - function as input - more output types (e.g., any pytree-registered type) - `torch.nn.Module` as inputs Pull Request resolved: https://github.com/pytorch/pytorch/pull/146367 Approved by: https://github.com/zou3519 ghstack dependencies: #146714	2025-02-26 19:47:39 +00:00
Ryan Guo	bab84f0bd9	[hop] Support more output types for `flat_apply` (#146714 ) This patch enables `flat_apply` to support certain non-Tensor output types like containers and graphable types. This will in turn enable the upcoming `mark_traceable` to support more output types. The patch also exposes a `func_to_graphable` rather than having the users calling the lower level `pytree.flatten(ConstantFunction(...))`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146714 Approved by: https://github.com/zou3519	2025-02-26 19:47:39 +00:00
IvanKobzarev	8594856651	[aotd] Alias of intermediate unwrap TensorAlias (#147638 ) Bug was reported by internal user. AOTD classified outputs that are aliases of intermediates of the graph in different categories. ... - output is alias of intermediate which base is already output - output is alias of intermediate which base is not in output If we look at the fn: ``` def fn(x): ix = x + 1 a = ix.transpose(0, 1) return a.detach(), a ``` output 0: detach view of alias a, where a is already output output 1: alias of intermediate ix, then additional output ix will be added internally output 0 base is TensorAlias(a) in this case, but could be Tensor. Adding runtime unwrapping solves this problem. Alternatively we should track base of a.detach() all the way to ix, in that case the base will be always a Tensor, not TensorAlias. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147638 Approved by: https://github.com/bdhirsh	2025-02-26 19:42:21 +00:00
Xintong Hu	30db64bf51	[PT2] Support add/remove passes in pre_grad (#146064 ) Summary: support the same functionality with acc_tracer disabled, add a new config for pre_grad add/remove_passes, at the front end it still uses the same interface some minor updates in pre_grad passes to make sure the passes are run in desired order, after added passes, still run pass like remove_noops at the end Test Plan: add new UT, please see stacked diff for add pass tests (TODO: update diff link) Differential Revision: D68909278 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146064 Approved by: https://github.com/frank-wei	2025-02-26 18:46:43 +00:00
Nikita Shulga	00732c3f7e	[MPS] Implemented `masked_fill_scalar` as shader (#147369 ) - Move `pos_from_thread_index and `offset_from_pos` from `UnfoldBackward.metal` into `c10/metal/indexing.h` header - Initial idea were to implement `StridedTensor` and `ConstStridedTensor` and use them to have masked_fill kernel a something simple as the following loop ```metal ConstStridedTensor<bool> mask(mask_data, sizes, mask_strides, ndim); if (mask[thread_index]) { StridedTensor<T> input(input_data, sizes, input_strides, ndim); input[thread_index] = val; } ``` But though it looks elegant and works correctly, performance wise it's much slower that the existing MPS shader (see table below), as int64 divisions on M2 GPU are really slow - Solved performance issue by implementing 3 flavors of the same shader: `dense`, that is used when both input and mask are dense tensors of the same size, `broadcast`, which is used when `mask` is leading dimensions expandable into input tensor and `strided` which is a general purpose fallback, but still computes position in the tensors only ones. As result, perf is even better than existing MPS shader for dense and broadcast able tensors. Performance measured on M2Pro thru different iterations of the same shader \| dtype \| MPS \| int64-idx \| int64-inlined \| 32-bit strided \| 32-bit broadcasted \| \| ------\|------\| -----\| ---- \| --- \| ---- \| \| float32 \| 2.8 msec \| 41.6 msec \| 26.9 msec \| 5 msec \| 2.4 msec \| \| float16 \| 1.86 msec \| 38.2 msec\| 26.6 msec \| 4.6 msec \| 1.9 msec \| \|bfloat16\|1.86 msec \|38.3 msec \| 26.6 msec \| 4.6 msec \| 1.9 msec \| And benchmark script ```python import torch from timeit import default_timer from itertools import product from torch.utils.benchmark import Measurement, Timer def bench_mask_fill( n, binary_func, dtype=torch.float32, ) -> Measurement: t = Timer( stmt=f"x.masked_fill(y, -17.0); torch.mps.synchronize()", setup=f"x,y = torch.rand(1, 20, {n}, {n}, dtype={dtype}, device='mps'), torch.ones({n}, {n}, device='mps').triu().bool()", globals = {'f': binary_func}, language="python", timer=default_timer ) return t.blocked_autorange() if __name__ == "__main__": n = 1024 for dtype in [torch.float32, torch.float16, torch.bfloat16]: eager_t = bench_mask_fill(n, torch.fmax, dtype) use_msec = eager_t.mean > 1e-4 multiplier = 1e3 if use_msec else 1e6 uname = "msec" if use_msec else "usec" print(f"torch.masked_fill_() {str(dtype):>14} {eager_t.mean*multiplier:>7.2f} {uname}") ``` Fixes https://github.com/pytorch/pytorch/issues/143477 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147369 Approved by: https://github.com/dcci ghstack dependencies: #147977	2025-02-26 18:39:15 +00:00
Isalia20	ebf6b9839c	[MPS] faster integer batched matmul (#147877 ) Followup to #147526 Tiled matmul for bmm as well. ## Speed ups: ![speedups_bmm](https://github.com/user-attachments/assets/02501145-7d64-4bbe-9dcc-994f004b4829) Script to record times: ```python import torch import numpy as np import time import csv batch_sizes = [1, 2, 4, 8] matrix_sizes = [256, 512, 1024, 2048] num_runs = 10 warmup_runs = 3 def run_int_mm(A, B): torch.mps.synchronize() start = time.perf_counter() c = A @ B torch.mps.synchronize() end = time.perf_counter() return c, end - start results = { 'N': [], 'B': [], 'mean_time': [], 'std_time': [] } for b in batch_sizes: for n in matrix_sizes: print(f"\nBenchmarking N={n} and B={b}") try: A_mps = torch.randint(low=-100, high=100, size=(b, n, n), dtype=torch.int8, device="mps") B_mps = torch.randint(low=-100, high=100, size=(b, n, n), dtype=torch.int8, device="mps") for _ in range(warmup_runs): _, _ = run_int_mm(A_mps, B_mps) times = [] for _ in range(num_runs): _, t = run_int_mm(A_mps, B_mps) times.append(t) mean_time = np.mean(times) std_time = np.std(times) results['N'].append(n) results['B'].append(b) results['mean_time'].append(mean_time) results['std_time'].append(std_time) print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s") except RuntimeError as e: print(f"Error for N={n}: {e}") continue with open('int_bmm_benchmark_times_new.csv', 'w', newline='') as f: writer = csv.writer(f) writer.writerow(['N', 'batch', 'mean_time', 'std_time']) for i in range(len(results['N'])): writer.writerow([ results['N'][i], results['B'][i], results['mean_time'][i], results['std_time'][i] ]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147877 Approved by: https://github.com/Skylion007	2025-02-26 18:37:13 +00:00
Henry Tsang	cfb293ee02	[inductor] Add logs for precompile and autotuning (#147923 ) Differential Revision: D70222645 I want to add more logs around precompile, especially around the reason why sometimes it gets fast returned. See https://github.com/pytorch/pytorch/pull/147590 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147923 Approved by: https://github.com/Skylion007	2025-02-26 18:26:07 +00:00
Jagadish Krishnamoorthy	0ea5d1067b	ROCm: Remove static specifier for allow_tf32 variable. (#147186 ) Since the env variable HIPBLASLT_ALLOW_TF32 can change, remove static type for allow_tf32 variable so that it captures the current value of env variable HIPBLASLT_ALLOW_TF32. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147186 Approved by: https://github.com/jeffdaily, https://github.com/naromero77amd	2025-02-26 18:24:02 +00:00
Animesh Jain	4e4191854b	[logs][qol] Print log options alphabetically (#147888 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147888 Approved by: https://github.com/jansel	2025-02-26 18:15:39 +00:00
rzou	fb566c5aea	Fix auto_functionalize x inference_mode (#147925 ) Fixes #147924 We were using the wrong FunctionalTensorMode to construct FunctionalTensors. FunctionalTensors modify the FunctionalTensorMode on construction, so that led to the wrong FunctionalTensorMode being modified. This PR threads the FunctionalTensorMode through correctly. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/147925 Approved by: https://github.com/bdhirsh	2025-02-26 18:05:30 +00:00
drisspg	678435c443	[FlexAttention] Fix IMA bug (#147918 ) # Summary Fixes: https://github.com/pytorch/pytorch/issues/147268 I got this right for the backwards and somehow forgot to do the flip in the forward, not sure how this wasnt found earlier.. Testing IMAs is tuff in pytest so didnt add but verified on reproducer ```py ❯ sanitize python flex/maurice_ima.py --setting 0 ========= COMPUTE-SANITIZER pool: torch.Size([64, 8, 784, 64]) tensor(1.0078, device='cuda:0') Feat shape torch.Size([64, 8, 784, 64]) Feat strides (401408, 50176, 64, 1) Feat is contig: True attn: torch.Size([64, 8, 784, 64]) tensor(1.7994, device='cuda:0') ========= ERROR SUMMARY: 0 errors ❯ sanitize python flex/maurice_ima.py --setting 1 ========= COMPUTE-SANITIZER pool: torch.Size([64, 8, 784, 64]) tensor(2.8297, device='cuda:0') Feat shape torch.Size([64, 8, 784, 64]) Feat strides (401408, 50176, 64, 1) Feat is contig: True attn: torch.Size([64, 8, 784, 64]) tensor(1.9714, device='cuda:0') ========= ERROR SUMMARY: 0 errors ❯ sanitize python flex/maurice_ima.py --setting 2 ========= COMPUTE-SANITIZER pool: torch.Size([64, 8, 784, 64]) tensor(3.2232, device='cuda:0') Feat shape torch.Size([64, 8, 784, 64]) Feat strides (401408, 50176, 64, 1) Feat is contig: True attn: torch.Size([64, 8, 784, 64]) tensor(2.2095, device='cuda:0') ========= ERROR SUMMARY: 0 errors ```` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147918 Approved by: https://github.com/BoyuanFeng, https://github.com/Skylion007	2025-02-26 17:59:05 +00:00
Catherine Lee	3f7e242c86	[CI] Checkout with more processes (#147652 ) The default action doesn't use more processes, possibly because most github provided runners only have 2 cpus, but we have more than that, so we might as well use them Generally cuts maybe 1 min off of checkout time? Changed checkout from pytorch/pytorch@main to pytorch/pytorch@my branch to test on 249a936998e66cc0d6ad8664e0e93ec1b9432a8b Pull Request resolved: https://github.com/pytorch/pytorch/pull/147652 Approved by: https://github.com/ZainRizvi	2025-02-26 17:51:28 +00:00
Xilun Wu	ef61c290e1	[DTensor][random] defer DTensor RNG state sync until first random op call or manual_seed call; support more flexible OffsetBasedRNGTracker init (#147025 ) Resolves https://github.com/pytorch/pytorch/issues/146767. May also resolve https://github.com/pytorch/pytorch/issues/147584. ### Summary This PR removes the RNG tracker init from the `distribute_tensor` call for the following reasons: 1. if the user does not use random ops on DTensor, there's no need to init DTensor RNG which currently requires CUDA device to be present. 2. this complies with the 0-communication semantic of `src_data_rank=None` shard distribution. Besides, `OffsetBasedRNGTracker` only accepts `DeviceMesh` argument to its constructor method. ### Consequence DTensor RNG initialization is delayed till the first DTensor random ops call or `torch.distributed.tensor.random.manual_seed`. ### Test `pytest test/distributed/tensor/test_random_ops.py` `pytest test/distributed/tensor/parallel/test_tp_random_state.py` `pytest test/distributed/tensor/parallel/test_tp_style.py` Differential Revision: [D70201856](https://our.internmc.facebook.com/intern/diff/D70201856) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147025 Approved by: https://github.com/kwen2501	2025-02-26 17:33:22 +00:00
Nikita Shulga	5ef94ca816	[BE] Do not copy arguments in variadic template (#147977 ) By adding missing `std::forward<Args>(args)...` and declaring template as passing args by reference Noticed while working on creating `mtl_setBytes` specification that takes `MPSScalar` as argument Pull Request resolved: https://github.com/pytorch/pytorch/pull/147977 Approved by: https://github.com/Skylion007, https://github.com/dcci	2025-02-26 17:20:16 +00:00
Boyuan Feng	ba9ed856e0	[FlexAttention] Improve error msg for embedding < 16 (#147765 ) flex_attention uses tl.dot, which [does not support embedding < 16](https://github.com/triton-lang/triton/issues/2266) on input shapes. This PR adds explicit error message for users who are prototyping with small tensors. Fixes #147701 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147765 Approved by: https://github.com/drisspg	2025-02-26 17:06:35 +00:00
Alex Baden	ac926f81cc	[Inductor][Triton] Rework casting logic to avoid illegal bitcast (#147395 ) Triton introduced checks for bitcasts where the casted value does not fit into the casted type (e.g. https://github.com/triton-lang/triton/pull/5926, though in this instance I think the issue is related to the type for the broadcast). Some routines in Inductor now perform illegal bitcasts. I reworked the compare and swap w/ index routine used in sort to remove the illegal bitcast (~~I left the bitcast for now, but I think it could probably be removed assuming the reshape does not change the type~~). The explicit cast is correct, and I don't think there are performance issues, but because the cast on the sum is not a bitcast I suppose there could be. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147395 Approved by: https://github.com/eellison	2025-02-26 16:56:17 +00:00
Simon Fan	fd1220e386	[ca] side-effect free inital trace: compiled_args (#147804 ) const methods to prevent accidental mutation. changes mainly in Error nodes and PyNode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147804 Approved by: https://github.com/jansel ghstack dependencies: #147242, #147796	2025-02-26 16:37:27 +00:00
Simon Fan	5e3069dde8	[ca] side-effect free initial trace: GraphTask (#147796 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147796 Approved by: https://github.com/jansel ghstack dependencies: #147242	2025-02-26 16:37:27 +00:00
Simon Fan	0a2da008f8	[ca] trace saved variable unpacking (#147242 ) ## Before Previously, CA will always unpack all saved variables stored in the autograd graph before executing it. This meant that we can't capture unpack hooks as part of the CA graph, and they would fire out of order wrt to other backward hooks. For memory saving APIs built on top of saved tensor hooks like non-reentrant checkpointing and offloading, we couldn't achieve any savings because all activations would be recomputed/loaded and active at the same time, resulting in no-op. ## After We add unpack hooks into the CA graph so that they can be executed progressively. The python hook and hook input themselves are wrapped by non-traceable code, so CA polyfills the wrapping as: ```python # pseudocode class SavedVariable: def unpack(self): if self.hook: return self.hook(self.packed_data) else: return self.packed_data # This approach won't directly work when we add support for Forward AD or double-backward. ``` Directly executing the CA graph (without torch.compiling it) under checkpointing/offloading, memory profile is expected to stay the same as when using the eager autograd engine. If AOT backward is in the autograd graph, memory profile is expected to be better than the eager autograd engine, since we can now delay saved activations unpacking into the AOT backward's execution. All tests pass when running the CA graph directly, the remaining issues are in Dynamo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147242 Approved by: https://github.com/jansel	2025-02-26 16:37:17 +00:00
Xuehai Pan	08f4c1a233	[dynamo] add sourceless builder for `types.MethodType` (#147880 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147880 Approved by: https://github.com/jansel	2025-02-26 15:43:47 +00:00
Katarzyna Fojcik	edaf9ddeb5	Add basic Gaudi support to benchmarks/dynamo (#145920 ) This PR adds basic Gaudi support to benchmarks/dynamo Pull Request resolved: https://github.com/pytorch/pytorch/pull/145920 Approved by: https://github.com/eellison	2025-02-26 14:50:22 +00:00
leslie-fang-intel	be830c8b1c	[Inductor][CPP] fix store mode atomic add (#147961 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/147848 and https://github.com/pytorch/pytorch/issues/146390. While addressing these issues, 2 problems were encountered: - In `CppVecKernel`, when the number of threads is 1 and the mode is `atomic_add`, `store` did not `load/add` before storing. This has been fixed in this PR. - In `CppTile2DKernel`, `store` did not support `atomic_add` mode. Support for this has been added in this PR. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_nn_fold ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147961 Approved by: https://github.com/malfet	2025-02-26 14:04:34 +00:00
Irem Yuksel	f522d899fb	Add MSVC version condition to "Fix for MSVC problem on Windows Arm64 (#136765 )" (#145076 ) This PR adds MSVC version guards around the if block presented on f7e36d8d6f9706ee9b9653538c4c8d2ba375a181. This commit was to provide a workaround for the problem reported here: https://developercommunity.visualstudio.com/t/MSVC-loop-unrolling-problem-194033813-/10720692 . The issue is fixed now and only appears between versions 19.36 and 19.42. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145076 Approved by: https://github.com/malfet, https://github.com/alinpahontu2912 Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>	2025-02-26 12:08:24 +00:00
Luca Wehrstedt	60d94ea22b	Add option to limit number of SMs used by matmul kernels (#147966 ) Resubmission of #144974 which was reverted for unrelated reasons. Newer matmul kernels, e.g. those targeting Hopper GPUs, sometime use a "persistent" schedule which consists in launching as many CUDA blocks as there are SMs on the GPU, with each such block then working on multiple output tiles in a row. This allows to eliminate the overhead of starting and finishing each tile, effectively doing cross-tile pipelining. In previous generations these latencies could be hidden by having multiple CUDA blocks per SM but, with blocks becoming larger, only one can run at a time per SM and thus this needs to be taken care of in software. Persistent kernels become an issue when other kernels are running concurrently. The classical example is a NCCL communication kernel running in the background. In such cases the matmul expects to be able to use all the SMs but is prevented from doing so because some of the are busy. This can lead to its blocks being scheduled as two separate waves on the available SMs. This "wave quantization" can double the latency of the matmul kernels. While we wait for smarter solutions, such as automatic load balancing among the blocks, an easy way to unblock ourselves is to tell the matmuls to only use a subset of the GPU's SMs. For this, I am introducing a global `sm_carveout` flag which can be used to specify how many SMs should be left available for other kernels. For now I only change the cuBLAS kernels and the scaled-mm CUTLASS kernel. More kernels can be opted-in later. I tested this change manually, by using the Kineto profiler to look up the grid size of a scaled-mm kernel with different values of `sm_carveout`, and making sure it changed. Suggestions are welcome for a more automated test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147966 Approved by: https://github.com/danthe3rd	2025-02-26 12:01:12 +00:00
Zhenbin Lin	7ffae2c028	Split test_transformers.py (#147441 ) Split test_transformers.py into test_transformers.py and test_transformers_privateuser1.py. Currently the privateuse1 test cases in test_transformers.py are skipped since they conflict with cuda test cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147441 Approved by: https://github.com/drisspg	2025-02-26 11:54:24 +00:00
William Wen	cf6d1e6824	[dynamo] add generic graph break hints (#147429 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147429 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #147385	2025-02-26 09:20:28 +00:00
William Wen	3fd68e4e2f	[dynamo] make some more graph break messages readable in English [2/N] (#147385 ) This is for "for some large number Z, make sure the error messages are readable English." - beginning to audit all `unimplemented` sites and making sure that all messages are at least English-readable. Hints may not necessarily be provided. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147385 Approved by: https://github.com/jansel	2025-02-26 09:20:28 +00:00
Ruben Rodriguez Buchillon	7a06bfdd1c	[inductor][ck] kBatch parametrized (#147885 ) Summary: # Why Enable us to set the kBatch parameter, rather than bake it in Especially for larger splitK scenarios, this can yield very good performance (up to 1.5x vs hipblaslt from initial tests) ## Why like this The obvious question should be: why not add this to the op itself, and maybe even into the template/kernel. That would simplify the code. The choice to have it as a "runtime" param that we fix is be able to reuse the compiled CK `.so` libraries, as now multiple choices of kBatch can be used with the exact same `.so` (as the shared library does not depend on kBatch, but takes it as a parameter) # What - copy cutlass approach for swizzle to have a "runtime" arg that we pass in but is really choice dependent - pipe through everything from template and kernel - hard-code it to be kBatch=1 for now (same as before, just now settable) This is part of a series of Diffs, where next we need to figure out 1. how to filter out ops + kBatch that don't work 2. set this better for splitK scenarios (hand written heuristic) Test Plan: (with minor modifications) ``` # show it working with AOTI buck2 run mode/opt-amd-gpu //scripts/henrylhtsang/repros:aot ``` ``` # show it working with inductor only buck2 run -c fbcode.re_gpu_tests=False mode/opt-amd-gpu fbcode//deeplearning/aot_inductor/benchmark/sampling:test_gemm_autotune_benchmark_AMD_block_0 ``` Differential Revision: D70200008 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147885 Approved by: https://github.com/ColinPeppler	2025-02-26 07:28:19 +00:00
PyTorch MergeBot	a84db75e1b	Revert "torch._scaled_mm with MXFP8 (#147548 )" This reverts commit 12b9674cb603438639298d6c9757ea93e18a7289. Reverted https://github.com/pytorch/pytorch/pull/147548 on behalf of https://github.com/wdvr due to failing internal build - similar to previous, see below ([comment](https://github.com/pytorch/pytorch/pull/147548#issuecomment-2684134336))	2025-02-26 07:17:24 +00:00
Huy Do	4216478250	Fix the benchmark config name from H100 benchmark (#147947 ) When using the wrong benchmark configs, the benchmark jobs will be skipped. The name should have the `_cuda_h100` suffix as used in the test matrix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147947 Approved by: https://github.com/wdvr	2025-02-26 06:40:07 +00:00
Isuru Fernando	4ec6c1d1ec	Fix test_halide.py report invocation to re-run failed tests (#147640 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147640 Approved by: https://github.com/jansel	2025-02-26 06:32:22 +00:00
PyTorch MergeBot	acca9b9cb0	Revert "[AOTI][refactor] Consolidate CppBuilder.build and CppBuilder.build_fbcode_cpu_re (#147803 )" This reverts commit 0b9da1ae0ad30ef228f132354b875bcaec214ace. Reverted https://github.com/pytorch/pytorch/pull/147803 on behalf of https://github.com/wdvr due to breaking internal tests, discussed with author ([comment](https://github.com/pytorch/pytorch/pull/147803#issuecomment-2683938121))	2025-02-26 05:32:17 +00:00
vasiliy	12b9674cb6	torch._scaled_mm with MXFP8 (#147548 ) # summary Add blockwise MXFP8 support to `torch._scaled_mm` on CUDA capability 10.0 and higher devices. If the scales for A and B are of dtype `torch.float8_e8m0fnu`, we dispatch to the blockwise kernel from cuBLAS. This is a skeleton PR where we test basic functionality (numerics of various simple matrices, as well as one end to end quantization + gemm). - Scales are flipped based on transpose_result - Handles boundary conditions Note that MXFP4 is not added in this PR - we can tackle that in a future PR. This PR was created by taking https://github.com/pytorch/pytorch/pull/145562, switching e8m0 to in-core dtype, removing fp4 for now, and adding test cases. # test plan ``` pytest test/test_matmul_cuda.py -k blockwise_mxfp8 -s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147548 Approved by: https://github.com/drisspg Co-authored-by: drisspg <drisspguessous@gmail.com>	2025-02-26 05:21:26 +00:00
Nikita Shulga	9ed40af917	[BE][EZ] Delete MacOS-12.3 xfail list (#147905 ) As PyTorch requires at least MacOS-13 (and Metal-3) to work, delete any pre-MacoS13 checks from test script Pull Request resolved: https://github.com/pytorch/pytorch/pull/147905 Approved by: https://github.com/dcci ghstack dependencies: #147892	2025-02-26 05:08:09 +00:00
Nikita Shulga	a2399c9b44	[BE] Switch `index_variable` to `torch.testing.make_tensor` (#147892 ) As it was a long-time todo and actually ublocks using this function for MPS devices (that do not support double) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147892 Approved by: https://github.com/dcci	2025-02-26 05:08:09 +00:00
eellison	c839fa4dd2	[Resubmit] Record input strides at time of tracing, constrain to them for triton fn (#147861 ) Resubmit of https://github.com/pytorch/pytorch/pull/145448. it lost its changes on rebase. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147861 Approved by: https://github.com/zou3519	2025-02-26 05:05:06 +00:00
Jerry Mannil	ba25e26baa	[ROCm] Use IPT=8 for block radix sort (#147657 ) Improve performance for shapes that use block radix sort by decreasing the item_per_thread to 8. This will increase the thread block size leading to higher occupancy. Co-author: @amd-sushetty Pull Request resolved: https://github.com/pytorch/pytorch/pull/147657 Approved by: https://github.com/jeffdaily	2025-02-26 04:22:16 +00:00
Ke Wen	f211818bc0	[c10d] Restrict use condition of NCCL mem pool (#147764 ) Add check to see if CUDA driver support multicast, as does in Symmetric Memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147764 Approved by: https://github.com/syed-ahmed, https://github.com/yifuwang	2025-02-26 03:40:00 +00:00
Henry Tsang	d3fc583ff0	[cutlass backend] force_disable_caches for test_number_mm_precompiles (#147901 ) Summary: Test is flaky right now. Differential Revision: D70209511 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147901 Approved by: https://github.com/ColinPeppler	2025-02-26 03:22:49 +00:00
Davide Italiano	9ad0ad6497	[MPS] Introduce a shader for `entr()`. (#147914 ) To be used in eager/inductor in order to implement the missing operation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147914 Approved by: https://github.com/malfet	2025-02-26 02:54:44 +00:00
Menglu Yu	805f7d97f7	[Inductor][Optimus] Fix a corner case in split cat aten pass (#147784 ) Summary: We need to further check the input of the cat to make sure all of them are from the same split node. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_split_cat_post_grad ``` Buck UI: https://www.internalfb.com/buck2/c875cbdd-5374-46cf-811c-45f91cf6ba3e Test UI: https://www.internalfb.com/intern/testinfra/testrun/10977524161964655 Network: Up: 64KiB Down: 27KiB (reSessionID-2e5915cb-4894-48f6-ab1c-3981adb42dab) Executing actions. Remaining 0/3 1.5s exec time total Command: test. Finished 2 local Time elapsed: 2:52.1s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E before aps-recgpt_ig_emb_pt2_comment_out-30c4d5127e tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-recgpt_ig_emb_pt2_comment_out-30c4d5127e/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100 after aps-recgpt_ig_emb_pt2_comment_out-c03f74e353 Differential Revision: D70132209 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147784 Approved by: https://github.com/Microve	2025-02-26 02:19:48 +00:00
Sun, Jiayi	b533bb4b13	optimize the decomposition of aten.native_group_norm (#144733 ) Summary: Optimize the decomposition of aten.native_group_norm. Reduce unnecessary repeated operations by changing the order of operations for `mean`, `rstd`, `weight`, `bias `and `input`, which can improve performance when `flattened_inner_size `is large. The original decomposition: 1. compute `mean `and `rstd`, 2. out = (x - mean) * rstd, compute in the range [N, C, ], 3. out = out weight + bias, compute in the range [N, C, ], The new decomposition: 1. compute `mean `and `rstd`, 2. new_weight = rstd weight, new_bias = - mean * rstd * weight + bias, compute in the range [N, C], 3. out = out * new_weight + new_bias, compute in the range [N, C, *], I tested the Inductor performance benchmark with this PR on both CPU and A100. On CPU, two torchbench models(functorch_dp_cifar10 and opacus_cifar10) have about 25% performance improvement, and two diffusion models(Stable Diffusion and Latent Consistency Model(LCM)) have about 2% performance improvement. On A100, no performance gains or regressions were seen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144733 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2025-02-26 01:42:46 +00:00
mori360	12112fd198	Fix bug in FSDP wrapped module with zero argument (#147771 ) Fixes https://github.com/pytorch/pytorch/issues/147531 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147771 Approved by: https://github.com/awgu	2025-02-26 01:40:53 +00:00
martin-kokos	8de6fe8c0b	[docs] fix numpy docs reference (#147697 ) Fix a link to numpy documentation that has moved and now 404's I"ve checked other numpy doc links that point to docs.scipy.org (which then redirects to numpy.org) and they do work, so I am fixing just this 404. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147697 Approved by: https://github.com/soulitzer	2025-02-26 01:30:03 +00:00
PyTorch MergeBot	90e3a3d86d	Revert "[ca] trace saved variable unpacking (#147242 )" This reverts commit 68ddca94498fd7961cc5ebcb0dffafb8c2f4baca. Reverted https://github.com/pytorch/pytorch/pull/147242 on behalf of https://github.com/wdvr due to failing tests in the slow workflow, see below ([comment](https://github.com/pytorch/pytorch/pull/147242#issuecomment-2683604547))	2025-02-26 00:40:16 +00:00
PyTorch MergeBot	4d614baa30	Revert "[ca] side-effect free initial trace: GraphTask (#147796 )" This reverts commit 5758743f3c92f9cd9b61bc435602f13dd19c13d7. Reverted https://github.com/pytorch/pytorch/pull/147796 on behalf of https://github.com/wdvr due to failing tests in the slow workflow, see below ([comment](https://github.com/pytorch/pytorch/pull/147796#issuecomment-2683599896))	2025-02-26 00:36:08 +00:00
PyTorch MergeBot	143f0f0006	Revert "[ca] side-effect free inital trace: compiled_args (#147804 )" This reverts commit ec768d8dc04b334e01db1a90e4e6646e4e867e67. Reverted https://github.com/pytorch/pytorch/pull/147804 on behalf of https://github.com/wdvr due to failing tests in the slow workflow, see below ([comment](https://github.com/pytorch/pytorch/pull/147804#issuecomment-2683594740))	2025-02-26 00:31:40 +00:00
drisspg	3ecfe6be25	[Submodule] Turning flash-attention integration into 3rd party submod (#144120 ) (#146372 ) Summary: # Summary ### Sticky points Cuda-graph rng handling has changed / deviated from original implementation. We will be left with a dangling 'offset' val and confusing naming due to BC ## Dependencies - Flash PR: https://github.com/Dao-AILab/flash-attention/pull/1419 ### Other Points - The BC linter is complaining about losing generate.py and its functions which is not real BC surface cc albanD imported-using-ghimport Test Plan: Imported from OSS Building in dev `buck build @//mode/dev-nosan -c fbcode.nvcc_arch=h100a //caffe2:ATen-cu --show-full-output ` I and Nming the .so I do see that the flash symbols are correctly named: ``` 0000000001c3dfb0 t pytorch_flash::run_mha_bwd(pytorch_flash::Flash_bwd_params&, CUstream_st)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const 0000000001c36080 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()#6}::operator()() const 0000000001c360e0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const 0000000001c35fc0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#6}::operator()() const 0000000001c36020 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const ``` Reviewed By: vkuzo Differential Revision: D68502879 Pulled By: drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/146372 Approved by: https://github.com/jbschlosser	2025-02-26 00:10:59 +00:00
Animesh Jain	276dfe8150	[dynamo][cpp-guards] Disable dict-tag optim if the guard_manager has child accessors (#147694 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147694 Approved by: https://github.com/isuruf	2025-02-26 00:02:08 +00:00
Mikayla Gawarecki	8e7e5ba182	Add sparse tensors constructed via legacy constructor to _sparse_tensors_to_validate (#147759 ) This is a redo of https://github.com/pytorch/pytorch/pull/147408 which added validation at the end of the legacy constructor calls. The reason why I didn't land that was because in `legacy_load`, constructor would be called before storages of indices/values are set. So the tensor would not actually be validated. Technically, torch.sparse.{Foo}Tensor should not even be called by our rebuild process since afaict this was the first PR that added support for sparse tensor serialization https://github.com/pytorch/pytorch/pull/27062 and it already uses `_rebuild_sparse_tensor` (which would add the rebuilt tensor to the list to validate), but torch.sparse.FooTensor is allowlisted This PR adds tensors constructed as such to the list to validate at the end of torch.load. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147759 Approved by: https://github.com/albanD	2025-02-25 23:51:12 +00:00
PyTorch MergeBot	c82c1411c6	Revert "torch._scaled_mm with MXFP8 (#147548 )" This reverts commit e34c15a05b027b9da0962c971d448138fcf94926. Reverted https://github.com/pytorch/pytorch/pull/147548 on behalf of https://github.com/wdvr due to failing internal build - discussed with author ([comment](https://github.com/pytorch/pytorch/pull/147548#issuecomment-2683517851))	2025-02-25 23:28:15 +00:00
henrylhtsang	0633f63f0d	[cutlass backend] try fix standlone runner test (#147811 ) Differential Revision: [D70147859](https://our.internmc.facebook.com/intern/diff/D70147859/) Trying to fix this test one last time, especially when mixed mm is getting removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147811 Approved by: https://github.com/chenyang78	2025-02-25 23:27:02 +00:00
PyTorch MergeBot	05bc8fe62e	Revert "follow up to #147548 , fix regression on MI300 (#147878 )" This reverts commit cc444e75d540daff127f0210b7f8965a5c2b8d2a. Reverted https://github.com/pytorch/pytorch/pull/147878 on behalf of https://github.com/wdvr due to temporary reverting to revert an older one in the stack ([comment](https://github.com/pytorch/pytorch/pull/147878#issuecomment-2683515567))	2025-02-25 23:25:59 +00:00
Anatoly Myachev	2df9a8d72d	[Inductor][Tests] Update `get_divisible_by_16` function in `test_torchinductor.py` to work correctly with new Triton (#147865 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147865 Approved by: https://github.com/davidberard98	2025-02-25 23:14:13 +00:00
PyTorch MergeBot	1e894d2635	Revert "Add option to limit number of SMs used by matmul kernels (#144974 )" This reverts commit af2d63637ed025789679a17c241e6bb466508a1d. Reverted https://github.com/pytorch/pytorch/pull/144974 on behalf of https://github.com/wdvr due to reverting in order to revert #147548 that causes a merge conflict ([comment](https://github.com/pytorch/pytorch/pull/144974#issuecomment-2683461733))	2025-02-25 22:46:38 +00:00
Jeff Daily	cc444e75d5	follow up to #147548 , fix regression on MI300 (#147878 ) Removing curly braces seemed superficial but broke MI300 rowwise matmul. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147878 Approved by: https://github.com/drisspg	2025-02-25 22:16:28 +00:00
Tugsbayasgalan Manlaibaatar	a821d69d92	Fix register constant to be usable in exportz (#147533 ) Differential Revision: [D69939737](https://our.internmc.facebook.com/intern/diff/D69939737) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147533 Approved by: https://github.com/zou3519	2025-02-25 21:10:47 +00:00
PyTorch MergeBot	0d31c621a3	Revert "[inductor][triton] Ignore block ptr advances for removed buffers (#147193 )" This reverts commit 17766b7aad0d9931bb6b3485fcf3d4c7532c3557. Reverted https://github.com/pytorch/pytorch/pull/147193 on behalf of https://github.com/wdvr due to failing tests on trunk - see below ([comment](https://github.com/pytorch/pytorch/pull/147193#issuecomment-2683286358))	2025-02-25 21:04:04 +00:00
Saurabh Mishra	6eb3d1e762	[DCP] Cache save plans in default planner (#147343 ) Summary: This PR caches the save plans to significantly reduce the collective cost for successive checkpoint save attempts. Here is the high level approach: - Create the local plan and cache the same. - In next iteration, compare the local plan with the cached plan metadata. If no change, do not send that local plan in the collective. - Global plan step, will only create the global plan with the new delta plans and empty plans for the cached ones. - Finish plan step will check for the empty plans. If its empty, it will grab the cached plan. If not, it will use the new plan provided. Test Plan: UTs Differential Revision: D69224491 ## How to enable the caching: DefaultSavePlanner introduces the enable_plan_caching which is set to False by default for now. https://github.com/pytorch/pytorch/pull/147343/files#diff-579bbb7b82572753afa91085fbf954f7c7613ff8376da9b26153d5cc3a3c4ee8R77 Set this to True to enable the caching and we should see significant speed up in the subsequent checkpoint save attempts, specially for larger scale jobs. Reference issue: https://github.com/pytorch/pytorch/issues/123695 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147343 Approved by: https://github.com/MeetVadakkanchery	2025-02-25 20:59:25 +00:00
Avik Chaudhuri	8d921eb97f	export method (#147573 ) The `export` API takes a `nn.Module` and traces its `forward` method. However sometimes it is useful to export different methods of a `nn.Module`, either as a one-off for debugging or as a set of methods that are called in some sequence outside `export` (e.g., `encode` / `decode`). When multiple methods of the same module instance are exported, they should share the same of the common module instance. This PR adds a couple of utils in `torch._export.utils` for this workflow. The `wrap_method` util wraps a method as a `nn.Module` that can then be exported. See included test. We recommend using the same module instance to export multiple methods on that instance, in which case they are guaranteed to share state. On serde, this state sharing is lost, so we provide another util, `sync_state`, to re-sync the state. These utils are meant to be eventually replaced by API-level changes, but for now this can unblock users who need this workflow. In particular, in the future we can accept one or multiple method entrypoints, with their own args / kwargs / dynamic shape specifications, which can create a variant of `ExportedProgram` with multiple graphs that share state; then we can automatically ensure that the state sharing is preserved through serde. Differential Revision: [D69960801](https://our.internmc.facebook.com/intern/diff/D69960801/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147573 Approved by: https://github.com/tugsbayasgalan	2025-02-25 20:58:54 +00:00
Hoa Dinh	687fe64667	Fix crash in -[PTMCoreMLCompiler _compileModel:atPath:] (#147809 ) Summary: We could hit one of those exceptions: https://github.com/apple/coremltools/blob/main/modelpackage/src/ModelPackage.cpp#L205-L225 And it would make this code path crash. Test Plan: build. Differential Revision: D70122378 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147809 Approved by: https://github.com/mcr229	2025-02-25 20:56:16 +00:00
Simon Fan	ec768d8dc0	[ca] side-effect free inital trace: compiled_args (#147804 ) const methods to prevent accidental mutation. changes mainly in Error nodes and PyNode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147804 Approved by: https://github.com/jansel ghstack dependencies: #147242, #147796	2025-02-25 20:38:51 +00:00
Simon Fan	5758743f3c	[ca] side-effect free initial trace: GraphTask (#147796 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147796 Approved by: https://github.com/jansel ghstack dependencies: #147242	2025-02-25 20:38:51 +00:00
Simon Fan	68ddca9449	[ca] trace saved variable unpacking (#147242 ) ## Before Previously, CA will always unpack all saved variables stored in the autograd graph before executing it. This meant that we can't capture unpack hooks as part of the CA graph, and they would fire out of order wrt to other backward hooks. For memory saving APIs built on top of saved tensor hooks like non-reentrant checkpointing and offloading, we couldn't achieve any savings because all activations would be recomputed/loaded and active at the same time, resulting in no-op. ## After We add unpack hooks into the CA graph so that they can be executed progressively. The python hook and hook input themselves are wrapped by non-traceable code, so CA polyfills the wrapping as: ```python # pseudocode class SavedVariable: def unpack(self): if self.hook: return self.hook(self.packed_data) else: return self.packed_data # This approach won't directly work when we add support for Forward AD or double-backward. ``` Directly executing the CA graph (without torch.compiling it) under checkpointing/offloading, memory profile is expected to stay the same as when using the eager autograd engine. If AOT backward is in the autograd graph, memory profile is expected to be better than the eager autograd engine, since we can now delay saved activations unpacking into the AOT backward's execution. All tests pass when running the CA graph directly, the remaining issues are in Dynamo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147242 Approved by: https://github.com/jansel	2025-02-25 20:38:51 +00:00
Yidi Wu	adf0f4ffd2	[custom op] fix inductor cpp codegen when returning a list of single tensor (#147649 ) For a custom op that returns a list of a single tensor with unbacked symint shape: ```python @torch.library.custom_op( "aoti_custom_ops::fn_ret_list_of_single_tensor", mutates_args={} ) def fn_ret_list_of_single_tensor(x: torch.Tensor) -> list[torch.Tensor]: s = x.sum().to(torch.int64) return [torch.randn(s.item())] @fn_ret_list_of_single_tensor.register_fake def _(x): ctx = torch._custom_op.impl.get_ctx() i0 = ctx.new_dynamic_size() return [torch.randn(i0)] ``` Before the fix, we have the following error: ``` /tmp/tmp5iikarn2/cci3ruqb7zdwtl457zo4itspq3sjnqiayhcshp5uaak7ktksckix/cggzqlwf4bmu6tjqodhoto3hhkhgharhwtvw2uxsasqrdipnazrv.cpp:456:26: error: type/value mismatch at argument 1 in template parameter list for ‘template<class _Tp, class ... _Types> constexpr const _Tp& std::get(const std::variant<_Types ...>&)’ 456 \| auto u0 = std::get<0>(buf1).size(0); \| ~~~~~~~~~~~^~~~~~ /tmp/tmp5iikarn2/cci3ruqb7zdwtl457zo4itspq3sjnqiayhcshp5uaak7ktksckix/cggzqlwf4bmu6tjqodhoto3hhkhgharhwtvw2uxsasqrdipnazrv.cpp:456:26: note: expected a type, got ‘0’ In file included from /data/users/yidi/pytorch/torch/include/c10/util/Exception.h:14, from /data/users/yidi/pytorch/torch/include/c10/core/ScalarType.h:5, from /data/users/yidi/pytorch/torch/include/ATen/AccumulateType.h:4, from /data/users/yidi/pytorch/torch/include/ATen/native/Math.h:3, from /data/users/yidi/pytorch/torch/include/ATen/cpu/vec/vec_base.h:31, from /data/users/yidi/pytorch/torch/include/ATen/cpu/vec/vec512/vec512.h:8, from /data/users/yidi/pytorch/torch/include/ATen/cpu/vec/vec.h:4, from /data/users/yidi/pytorch/torch/include/ATen/cpu/vec/functional_base.h:6, from /data/users/yidi/pytorch/torch/include/ATen/cpu/vec/functional.h:3, from /tmp/tmp5iikarn2/3b/c3bi5gk6mslf6u4iaqafhxm64z6u65e3eain4xlary5blqnvv6xx.h:39, from /tmp/tmp5iikarn2/cci3ruqb7zdwtl457zo4itspq3sjnqiayhcshp5uaak7ktksckix/cggzqlwf4bmu6tjqodhoto3hhkhgharhwtvw2uxsasqrdipnazrv.cpp:366: /usr/include/c++/11/variant:1145:27: note: candidate: ‘template<class _Tp, class ... _Types> constexpr const _Tp&& std::get(const std::variant<_Types ...>&&)’ 1145 \| constexpr const _Tp&& get(const variant<_Types...>&& __v) \| ^~~ /usr/include/c++/11/variant:1145:27: note: template argument deduction/substitution failed: /tmp/tmp5iikarn2/cci3ruqb7zdwtl457zo4itspq3sjnqiayhcshp5uaak7ktksckix/cggzqlwf4bmu6tjqodhoto3hhkhgharhwtvw2uxsasqrdipnazrv.cpp:456:26: error: type/value mismatch at argument 1 in template parameter list for ‘template<class _Tp, class ... _Types> constexpr const _Tp&& std::get(const std::variant<_Types ...>&&)’ 456 \| auto u0 = std::get<0>(buf1).size(0); \| ~~~~~~~~~~~^~~~~~ /tmp/tmp5iikarn2/cci3ruqb7zdwtl457zo4itspq3sjnqiayhcshp5uaak7ktksckix/cggzqlwf4bmu6tjqodhoto3hhkhgharhwtvw2uxsasqrdipnazrv.cpp:456:26: note: expected a type, got ‘0’ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147649 Approved by: https://github.com/angelayi ghstack dependencies: #147130	2025-02-25 20:28:41 +00:00
Yidi Wu	824474cb35	[cond] support output sizes mismatch in front end (#147130 ) This PR finishes https://github.com/pytorch/pytorch/pull/137615 by addressing the TODOs and comments left there. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147130 Approved by: https://github.com/zou3519	2025-02-25 20:28:41 +00:00
Bo Li	de80b6f0d3	Updated test_cuda.py to rerun tests (#147040 ) Initially test_cuda::TestCudaMallocAsync::test_clock_speed and test_cuda::TestCudaMallocAsync::test_power_draw are skipped in this [commit](`d4871750d9`). Pulled ROCm nightly image and verified these two tests run fine locally. Filed this PR to enable them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147040 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily	2025-02-25 19:58:42 +00:00
Benjamin Glass	361b6c97cd	cpp_wrapper: Fixup output code indentation (#147215 ) Closes #142165. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147215 Approved by: https://github.com/desertfire ghstack dependencies: #146109, #146424	2025-02-25 19:50:37 +00:00
Benjamin Glass	7c515b2da4	cpp_wrapper: fix test_torchinductor* tests (#146424 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146424 Approved by: https://github.com/desertfire ghstack dependencies: #146109	2025-02-25 19:50:37 +00:00
Benjamin Glass	46d1422afd	cpp_wrapper: fix inductor triton tests (#146109 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146109 Approved by: https://github.com/desertfire	2025-02-25 19:50:37 +00:00
Sam Larsen	9740d69e78	[logging] Add toplevel dynamo_compile / tlparse logging for AOTI (#147760 ) Summary: This adds the proper context managers in `compile_fx_aot` such that we get: 1) A toplevel chromium event (i.e., tlparse) 2) A single `dynamo_compile` log entry Test Plan: Before: * Scuba (we only log the dynamo event): https://fburl.com/scuba/dynamo_compile/sandbox/gaqowzrd * Perfetto trace: https://fburl.com/vol7r6w1 After: * Scuba (we log the dynamo _and_ compile_fx_aot event): https://fburl.com/scuba/dynamo_compile/sandbox/cx2we8w8 * Perfetto trace (click on the toplevel event to see the additional metadata): https://fburl.com/sziy40r9 Differential Revision: D70113859 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147760 Approved by: https://github.com/desertfire	2025-02-25 19:41:39 +00:00
Svetlana Karslioglu	14b9f7f7bc	Remove link to search survey (#147751 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147751 Approved by: https://github.com/malfet	2025-02-25 19:26:59 +00:00
Mwiza	17766b7aad	[inductor][triton] Ignore block ptr advances for removed buffers (#147193 ) block ptr advancements should also be deferrered conditional on the associated buffer not being removed. For example, if `FusedSchedulerNode(op0-op1)` has a store in `SchedulerNode` `op0` that is read in `op1`, the store and associated block ptr that would be created for `op0` in isolation is no longer needed. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/147193 Approved by: https://github.com/jansel Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-02-25 19:14:55 +00:00
Xuehai Pan	ea6938a1f7	Add XuehaiPan to CODEOWNERS for C++ PyTree utilities (#137408 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137408 Approved by: https://github.com/zou3519	2025-02-25 18:48:32 +00:00
Zesheng Zong	580f1183b4	Enable ruff rule S324 (#147665 ) Fixes #147627 - Add `S324` in `pyproject.toml ` - Running check and clean warnings ```bash lintrunner --take RUFF --all-files ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147665 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-02-25 18:27:34 +00:00
iupaikov-amd	6061664266	Enabled force_shape_pad for triton tests in test_kernel_benchmark (#147620 ) During ROCm runs we naturally have those tests show that padding path will be slower for our archs and the pad_mm chooses to opt out of padding thus failing those tests. Reasoning for this is per my understanding those tests don't check IF the operation should be padded in the first place, but HOW is it padded and if it's done in a correct way. More than that the tests shouldn't really be hardware dependent or have some condition for them. Similar PR for reference: https://github.com/pytorch/pytorch/pull/141768 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147620 Approved by: https://github.com/jeffdaily, https://github.com/chenyang78, https://github.com/shunting314	2025-02-25 18:06:48 +00:00
Ethan Wee	651e6aacf9	[ROCm] Remove benign warning about missing amdgpu.ids (#147791 ) Fixes #144203. We build a custom libdrm when preparing our docker image. We attempt to locate the amdgpu.ids file relative to the python binary, but this is not possible for venv installs of pytorch when the python binary is a symlink. Not finding amdgpu.ids causes `torch.cuda.get_device_name()` to return "AMD Radeon Graphics" as a generic name instead of something specific such as "AMD Instinct MI250X / MI250". The libdrm warning is noisy, so we are removing it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147791 Approved by: https://github.com/jeffdaily	2025-02-25 17:17:25 +00:00
FFFrog	e5a13410cd	Fix the tiny doc descriptions (#147319 ) As the title stated Pull Request resolved: https://github.com/pytorch/pytorch/pull/147319 Approved by: https://github.com/zou3519	2025-02-25 17:10:16 +00:00
Nikita Shulga	346bbefa63	[BE] Parameterize TestSDPA in test_mps.py (#147856 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147856 Approved by: https://github.com/Skylion007	2025-02-25 16:07:24 +00:00
Robert Hardwick	810d2a3dbd	[ARM] Fix bug in _ref_test_helper in test_ops and fix failing test on Aarch64 (#146597 ) We have a failing unit test on Aarch64 ``` Exception: Caused by reference input at index 34: SampleInput(input=Tensor[size=(5, 5, 4), device="cpu", dtype=torch.complex64, contiguous=False], args=(), kwargs={}, broadcasts_input=False, name='') To execute this test, run the following from the base repo dir: PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=34 python test/test_ops.py TestCommonCPU.test_python_ref__refs_square_cpu_complex64 This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ``` After debugging it I found that `ex` variable is not being reset to None on each loop inside _ref_test_helper. Which after fixing, highlighted another expectedFailure to reenable - `nn.functional.hinge_embedding_loss` which was incorrectly being skipped due to the same problem. `4a545eb85d/test/test_ops.py (L546)` ex variable is not reset after this for next loop iteration Pull Request resolved: https://github.com/pytorch/pytorch/pull/146597 Approved by: https://github.com/digantdesai	2025-02-25 14:15:10 +00:00
Isalia20	a695aae89b	[MPS] fix attention for >4d tensors (#147545 ) Fixes #147443 and adds tests for >4d tensors Pull Request resolved: https://github.com/pytorch/pytorch/pull/147545 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-02-25 13:55:28 +00:00
Bin Bao	0b9da1ae0a	[AOTI][refactor] Consolidate CppBuilder.build and CppBuilder.build_fbcode_cpu_re (#147803 ) Summary: Let CppBuilder handle all the cpp build logic Differential Revision: [D70146185](https://our.internmc.facebook.com/intern/diff/D70146185) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147803 Approved by: https://github.com/malfet ghstack dependencies: #147805, #147806, #147807	2025-02-25 13:33:12 +00:00
Bin Bao	cc1c9826d4	[AOTI][refactor] Fix a typo (#147807 ) Summary: defination -> definition Differential Revision: [D70146182](https://our.internmc.facebook.com/intern/diff/D70146182) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147807 Approved by: https://github.com/malfet ghstack dependencies: #147805, #147806	2025-02-25 13:33:12 +00:00
Bin Bao	7ed0670e21	[AOTI][refactor] Replace run_command_and_check with CppBuilder.build (#147806 ) Summary: Consolidate cpp compilation action to CppBuilder. Reland https://github.com/pytorch/pytorch/pull/147680 Differential Revision: [D70146183](https://our.internmc.facebook.com/intern/diff/D70146183) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147806 Approved by: https://github.com/malfet ghstack dependencies: #147805	2025-02-25 13:33:03 +00:00
Bin Bao	2680e835c8	[AOTI][refactor] Rename use_absolute_path to use_relative_path (#147805 ) Summary: The option really means to compile a cpp file using its basename instead of the its full path. Reland https://github.com/pytorch/pytorch/pull/147679. Differential Revision: [D70146184](https://our.internmc.facebook.com/intern/diff/D70146184) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147805 Approved by: https://github.com/malfet	2025-02-25 13:32:54 +00:00
Isalia20	7e37fb0a4c	[MPS] faster integer matmul for mps (#147526 ) There is a naive matmul kernel written for MPS matmul which is used when input types are integer(and some other cases for older MacOSes). The old version of matmul is naive with global memory accesses which really tanks the performance especially when matrix is sufficiently large. This PR optimizes it (even though there might be more optimizations with using simdgroup matrices which I'll cover in followup since writing that kernel will take more time) ## Performance comparison on M1 Pro: ![performance_comparison](https://github.com/user-attachments/assets/6ea8de5a-8231-4c5b-8dc9-caa79ea6879a) You can get these numbers by running this script with old kernel compiled and then new kernel compiled(Make sure to change the csv where each output is written): ```python import torch import numpy as np import time import csv matrix_sizes = [32, 128, 512, 1024, 2048, 4096] num_runs = 10 warmup_runs = 3 def run_int_mm(A, B): torch.mps.synchronize() start = time.perf_counter() c = A @ B torch.mps.synchronize() end = time.perf_counter() return c, end - start results = { 'N': [], 'mean_time': [], 'std_time': [] } for n in matrix_sizes: print(f"\nBenchmarking N={n}") try: A_mps = torch.randint(low=-100, high=100, size=(n, n), dtype=torch.int8, device="mps") B_mps = torch.randint(low=-100, high=100, size=(n, n), dtype=torch.int8, device="mps") for _ in range(warmup_runs): _, _ = run_int_mm(A_mps, B_mps) times = [] for _ in range(num_runs): _, t = run_int_mm(A_mps, B_mps) times.append(t) mean_time = np.mean(times) std_time = np.std(times) results['N'].append(n) results['mean_time'].append(mean_time) results['std_time'].append(std_time) print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s") except RuntimeError as e: print(f"Error for N={n}: {e}") continue with open('int_mm_benchmark_times_old.csv', 'w', newline='') as f: writer = csv.writer(f) writer.writerow(['N', 'mean_time', 'std_time']) for i in range(len(results['N'])): writer.writerow([ results['N'][i], results['mean_time'][i], results['std_time'][i] ]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147526 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-02-25 13:15:18 +00:00
Wang, Eikan	b63c601614	Update merge rules for oneDNN part (#147615 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147615 Approved by: https://github.com/atalman	2025-02-25 11:26:59 +00:00
Luca Wehrstedt	af2d63637e	Add option to limit number of SMs used by matmul kernels (#144974 ) Newer matmul kernels, e.g. those targeting Hopper GPUs, sometime use a "persistent" schedule which consists in launching as many CUDA blocks as there are SMs on the GPU, with each such block then working on multiple output tiles in a row. This allows to eliminate the overhead of starting and finishing each tile, effectively doing cross-tile pipelining. In previous generations these latencies could be hidden by having multiple CUDA blocks per SM but, with blocks becoming larger, only one can run at a time per SM and thus this needs to be taken care of in software. Persistent kernels become an issue when other kernels are running concurrently. The classical example is a NCCL communication kernel running in the background. In such cases the matmul expects to be able to use all the SMs but is prevented from doing so because some of the are busy. This can lead to its blocks being scheduled as two separate waves on the available SMs. This "wave quantization" can double the latency of the matmul kernels. While we wait for smarter solutions, such as automatic load balancing among the blocks, an easy way to unblock ourselves is to tell the matmuls to only use a subset of the GPU's SMs. For this, I am introducing a global `sm_carveout` flag which can be used to specify how many SMs should be left available for other kernels. For now I only change the cuBLAS kernels and the scaled-mm CUTLASS kernel. More kernels can be opted-in later. I tested this change manually, by using the Kineto profiler to look up the grid size of a scaled-mm kernel with different values of `sm_carveout`, and making sure it changed. Suggestions are welcome for a more automated test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144974 Approved by: https://github.com/eqy, https://github.com/albanD	2025-02-25 10:19:19 +00:00
David Berard	94969d0a40	[inductor][user triton] Handle scf.yield more accurately (#147762 ) TL;DR: Previously, the mutation analysis for scf.if/scf.for would bundle all the scf.yield arguments into a single op (the scf.yield), such that a mutation on any returned value from the scf.if/scf.for would register as a mutation to _all_ of the scf.yield args. To fix this, this PR artificially introduces a new scf.yield op for each of the scf.yield args. Context: The relevant kernel is something like this one (added as a test in test_triton_kernels.py) ```python @triton.jit def branch_with_multiple_yield_args( in_ptr0, in_ptr1, out_ptr, conditional_ptr, n_elements, BLOCK_SIZE: "tl.constexpr", ): pid = tl.program_id(axis=0) block_start = pid * BLOCK_SIZE offsets = block_start + tl.arange(0, BLOCK_SIZE) mask = offsets < n_elements conditional = tl.load(conditional_ptr) if conditional: in0 = in_ptr0 + 1 in1 = in_ptr1 + 1 out = out_ptr + 1 else: in0 = in_ptr0 in1 = in_ptr1 out = out_ptr x = tl.load(in0 + offsets, mask=mask) y = tl.load(in1 + offsets, mask=mask) tl.store(out + offsets, x + y, mask=mask) ``` The mutation analysis starts with the `tl.store` - and then does a DFS backwards towards the parameters. When a new op is encountered in the DFS, the analysis pass recurses on the op's arguments. The if branch gets converted to TTIR like this: ```mlir %21:3 = scf.if %20 -> (!tt.ptr<f32>, !tt.ptr<f32>, !tt.ptr<f32>) { ... scf.yield %31, %32, %33 : !tt.ptr<f32>, !tt.ptr<f32>, !tt.ptr<f32> loc(#loc10) } else { scf.yield %arg0, %arg1, %arg2 : !tt.ptr<f32>, !tt.ptr<f32>, !tt.ptr<f32> loc(#loc11) } loc(#loc7) ``` and so the "source" op of the `out` variable is marked as the `scf.yield` op - and then all of the arguments to `scf.yield` are marked as mutable (including arg0, arg1, and arg2 - only one of which is actually mutated). This PR we duplicate the `scf.yield` to add one `scf.yield` per return value. That way we avoid marking all the returns from the scf.if/scf.for as mutated when only some are. Differential Revision: [D70118202](https://our.internmc.facebook.com/intern/diff/D70118202) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147762 Approved by: https://github.com/oulgen, https://github.com/zou3519	2025-02-25 08:41:00 +00:00
Yutao Xu	7bd2e3bca1	Update torch-xpu-ops commit pin (#147743 ) Update the torch-xpu-ops commit to [306a0ffb6e0cae27c5bd9a3b9cd378048c8e00e7](`306a0ffb6e`), includes: - Bugfix (LayerNorm/Nonzeros) - Update AOT target Pull Request resolved: https://github.com/pytorch/pytorch/pull/147743 Approved by: https://github.com/EikanWang	2025-02-25 08:06:35 +00:00
Aviral Goel	866dc45d3c	[Inductor][ROCm][CK] Unhardedcoded kernel shapes for ck_conv_template codegen (#147504 ) ## [Inductor][ROCm][CK] Parameterize `ck_conv_template` Codegen ### Description Previously, ROCm CK kernel codegen templates were hardcoded with fixed values for convolution parameters: - `index_t GroupCount` - `index_t NBatch` - `index_t NOutChannels` - `index_t NInChannels` - `vector<index_t> FilterSize` - `vector<index_t> InputSize` - `vector<index_t> ConvolutionStrides` - `vector<index_t> Dilations` - `vector<index_t> LeftPads` - `vector<index_t> RightPads` This PR updates `ck_conv_template` to accept these parameters dynamically from Inductor. By doing so, we reduce the number of generated templates, improving flexibility and maintainability. ### Testing - Verified correctness by running relevant test cases, i.e `test/inductor/test_ck_backend.py` - Ensured generated kernels reflect the updated parameterization, i.e generated templates in `/tmp/torchinductor_root/` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147504 Approved by: https://github.com/jansel, https://github.com/eellison, https://github.com/tenpercent Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>	2025-02-25 07:48:07 +00:00
Chien-Chin Huang	d73b927662	[DSD] Fixes issue when there is a PG without parameters (#147730 ) Fixes https://github.com/pytorch/pytorch/issues/143828 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147730 Approved by: https://github.com/mori360	2025-02-25 07:25:38 +00:00
PyTorch MergeBot	fb73b0c7c5	Revert "use copy2d in h2d/d2h copy when possible (#146256 )" This reverts commit 0bc036a9e98d2cc92ff9dd367342b1f2efcc15f0. Reverted https://github.com/pytorch/pytorch/pull/146256 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/146256#issuecomment-2680868627))	2025-02-25 07:06:38 +00:00
Oguz Ulgen	bb7e8fbd66	[CacheBench] Add hf_T5 llama moco to cachebench (#147783 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147783 Approved by: https://github.com/huydhn ghstack dependencies: #147688, #147780, #147781, #147782	2025-02-25 04:34:45 +00:00
Oguz Ulgen	895564d6b6	[CacheBench] Add huggingface (#147782 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147782 Approved by: https://github.com/huydhn ghstack dependencies: #147688, #147780, #147781	2025-02-25 04:34:45 +00:00
Oguz Ulgen	c4fb6ae55d	[CacheBench] Separate dynamic into its own option (#147781 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147781 Approved by: https://github.com/huydhn ghstack dependencies: #147688, #147780	2025-02-25 04:34:34 +00:00
Oguz Ulgen	60d4cbfc06	[CacheBench] Add repeat option so that we can have more accurate cache results (#147780 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147780 Approved by: https://github.com/huydhn ghstack dependencies: #147688	2025-02-25 04:34:25 +00:00
Oguz Ulgen	ab3b814af3	[CacheBench] Add ciflow/trunk test (#147688 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147688 Approved by: https://github.com/huydhn	2025-02-25 04:34:16 +00:00
eellison	4b7604ec10	Delete Mixed MM Special Casing (#147151 ) Now that torchinductor supports prologue fusion we can delete all the mixed mm code. When I benchmarked int8 weight only mm in the new path compared to int8mm in the old path in the [following benchmark](https://gist.github.com/eellison/46e321709572c11c077d0612cb3492b7) I got a 1.244x geomean speedup comparing Huggingface linear shapes with bias. There's a couple reasons for the speedup: - prologue fusion is often unprofitable, even for int8 mm. because the current mixed mm benchmarking only compares triton_int8_mm vs (dtype_conversion + cublas), we miss out on scenarios where the triton template is profitable but the prologue fusion is not. - similarly, we miss out on potential epilogue fusions like bias if we dispatch to the [fallback mixed mm](`5006932cbc/torch/_inductor/kernel/mm.py (L750-L751)`) that mixed_mm will dispatch to instead of the deferred epilogue tuning in current path. It's possible some of the speedups would be smaller on larger models where the epilogue might get fused into a following kernel. Nonetheless, even if this is perf neutral it is worth landing for code deduplication. The one kernel that is a little special and would not fall out of the prologue fusion is the uint4x2_mixed_mm kernel. it's still possible to generate with prologue fusion but not currently exactly as the current [impl](`bd370c138a/torch/_inductor/kernel/unpack_mixed_mm.py (L43-L49)`). But the current impl does not compare to a cublas baseline so I found that it is making things slower (35% slower on a not particularly big 1024, 1024, 1024 mm shape on h100). this should be fine to delete. Future optimizations could include: - cutlass prologue path - making prologue fusion support the persistent tma based mm template. from @drisspg's experience this led to nice wins with fp8 but not as nice wins with bf16 mm. I think similarly, lower memory bandwidth int8 mm would benefit. Differential Revision: [D70114858](https://our.internmc.facebook.com/intern/diff/D70114858) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147151 Approved by: https://github.com/drisspg, https://github.com/cpuhrsch	2025-02-25 04:29:54 +00:00
PyTorch MergeBot	890213f65f	Revert "[AOTI][refactor] Rename use_absolute_path to use_relative_path (#147679 )" This reverts commit 0b52d801d2297ad6c38e631eedfd4dead9360e1b. Reverted https://github.com/pytorch/pytorch/pull/147679 on behalf of https://github.com/desertfire due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147679#issuecomment-2680389225))	2025-02-25 04:11:13 +00:00
PyTorch MergeBot	9b06b30468	Revert "[AOTI][refactor] Replace run_command_and_check with CppBuilder.build (#147680 )" This reverts commit 22fae0d948ac14c72b510fafc2283072d744dff9. Reverted https://github.com/pytorch/pytorch/pull/147680 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147680#issuecomment-2680383986))	2025-02-25 04:06:40 +00:00
Xia, Weiwen	9478c90e2b	[Quant] flip: throw runtime error for QUInt4x2 and QUInt2x4 input (#147430 ) Fixes #147208 Summary The `flip` op causes memory corruption for `torch.quint4x2` and `torch.quint2x4` inputs. It is because the TensorIterator-based implementation does not support multiple elements per byte. And `torch.quint4x2` and `torch.quint2x4` are deprecated in PyTorch. So, we add a check here to throw a runtime error if input dtyps is `torch.quint4x2` or `torch.quint2x4`. Test plan ``` pytest -s test/test_shape_ops.py -k test_flip_unsupported_dtype ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147430 Approved by: https://github.com/mingfeima, https://github.com/ngimel	2025-02-25 03:47:40 +00:00
Riley Dulin	20295c017e	Fix import of getArtifactLogger for ir_pre_fusion and ir_post_fusion (#147560 ) Fixes #147002 There was an issue with the previous PR https://github.com/pytorch/pytorch/pull/147248 that didn't show up in CI, where a logging import was not complete in torch/_inductor/debug.py before importing it. This only happened if someone directly imported the file without doing any other imports before. Also set to off_by_default by request to reduce log spew. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147560 Approved by: https://github.com/Skylion007	2025-02-25 03:36:08 +00:00
vasiliy	e34c15a05b	torch._scaled_mm with MXFP8 (#147548 ) # summary Add blockwise MXFP8 support to `torch._scaled_mm` on CUDA capability 10.0 and higher devices. If the scales for A and B are of dtype `torch.float8_e8m0fnu`, we dispatch to the blockwise kernel from cuBLAS. This is a skeleton PR where we test basic functionality (numerics of various simple matrices, as well as one end to end quantization + gemm). - Scales are flipped based on transpose_result - Handles boundary conditions Note that MXFP4 is not added in this PR - we can tackle that in a future PR. This PR was created by taking https://github.com/pytorch/pytorch/pull/145562, switching e8m0 to in-core dtype, removing fp4 for now, and adding test cases. # test plan ``` pytest test/test_matmul_cuda.py -k blockwise_mxfp8 -s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147548 Approved by: https://github.com/drisspg Co-authored-by: drisspg <drisspguessous@gmail.com>	2025-02-25 03:32:22 +00:00
cyy	8f728e28dd	Enable ASAN in CUDA tests (#147512 ) It should work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147512 Approved by: https://github.com/soulitzer	2025-02-25 02:58:39 +00:00
FFFrog	b0fa92042b	Fix torch.mean out dtype check (#147188 ) For CPU: Type promotion is supported for torch.mean For Meta: Not supported for torch.mean ISSUE related: https://github.com/pytorch/pytorch/issues/138399 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147188 Approved by: https://github.com/albanD	2025-02-25 02:50:03 +00:00
Benjamin Glass	33ff96b3f9	cpp_builder: unbreak clang++ detection (#147775 ) Fixes an issue where `_is_gcc` would match on `clang++` due to the string ending with `g++`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147775 Approved by: https://github.com/desertfire	2025-02-25 02:33:01 +00:00
Ding, Yi1	dacdc9782b	[Inductor] Add input value checking to randint meta function (#147191 ) Fixes #147070 Adding value checking for the range to the meta function, similar to which in the CUDA/CPU aten op. Test with ``` PYTORCH_TEST_WITH_DYNAMO=1 pytest test/test_tensor_creation_ops.py -k test_randint_inference ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147191 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2025-02-25 02:18:16 +00:00
leslie-fang-intel	c644f4c5fe	[Inductor] Fix the decompositions of torch isin (#147519 ) Summary Fixed two decomposition issues in `torch.isin`: - Issue 1: As reported in [#147329](https://github.com/pytorch/pytorch/issues/147329), the current decomposition does not support cases where test_element is a scalar. This is now implemented by referring to the `ead970c8d0/aten/src/ATen/native/TensorCompare.cpp (L1004-L1008)` - Issue 2: Found while enabling a unit test with `elements = 1` and `test_elements = torch.tensor([1, 2, 3, 4])`, where Inductor produced different results compared to eager mode. This issue is fixed by referring to `ead970c8d0/aten/src/ATen/native/cpu/TensorCompareKernel.cpp (L329-L338)` Test Plan ``` python test/inductor/test_torchinductor.py -k test_isin_tensor_scalar ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147519 Approved by: https://github.com/jgong5, https://github.com/FFFrog, https://github.com/peterbell10	2025-02-25 01:49:44 +00:00
Nikita Shulga	2c8cd41c1f	Delete unused conda-aws-upload environment (#147792 ) As this environment only contains keys for Anaconda uploads Pull Request resolved: https://github.com/pytorch/pytorch/pull/147792 Approved by: https://github.com/atalman	2025-02-25 01:42:52 +00:00
Nichols A. Romero	43074680b5	[ROCm] Add support for gfx1102 arch to wheel builds. (#147761 ) [gfx1102 is not officially supported](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html) but most ROCm libs have gfx1102 code objects available since ROCm 5.5. Now that we're using `--offload-compress` we can fit another gfx target. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147761 Approved by: https://github.com/jeffdaily	2025-02-25 01:35:52 +00:00
Anatoly Myachev	97557b9833	[Inductor] Update `set_driver_to_gpu` code to avoid backend re-initialization with new Triton (#147621 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147621 Approved by: https://github.com/jansel	2025-02-25 00:04:54 +00:00
Yuanhao Ji	55bf3ff3a5	[Docs] Add `OpDTypes.any_common_cpu_cuda_one` (#147605 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/147605 Approved by: https://github.com/soulitzer	2025-02-24 23:23:43 +00:00
PyTorch MergeBot	e72b4c61bf	Revert "Upgrade submodule oneDNN to v3.7 (#147498 )" This reverts commit 576ed1e400d069ec2fff6162f82a71ff0bd81f7c. Reverted https://github.com/pytorch/pytorch/pull/147498 on behalf of https://github.com/wdvr due to failing some tests on trunk - see below ([comment](https://github.com/pytorch/pytorch/pull/147498#issuecomment-2679867286))	2025-02-24 22:57:39 +00:00
Peter Yeh	81dccd706b	[ROCm] OCP FP8 Support for new GPUs (#146632 ) TLDR: Follow up/ Build on top of https://github.com/pytorch/pytorch/pull/144476. add OCP FP8 support for gfx950 refer to https://github.com/pytorch/ao/pull/1677 This pull request includes several changes to improve compatibility and support for new GPU architectures and data types, particularly for ROCm. The key updates involve adding support for new ROCm versions and GPU architectures, updating data type handling, and removing outdated checks. ### Improvements to GPU Architecture and ROCm Version Support: * [`aten/src/ATen/Context.cpp`](diffhunk://#diff-33de472d304acbe57d693c8567370c638068bedc1aa0ce8e9dc115dad05a7810L323-R326): Added support for new GPU architectures `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks. * [`aten/src/ATen/native/cuda/Blas.cpp`](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199): Updated architecture support in multiple functions to include `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks. [[1]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199) [[2]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL865-R876) ### Updates to Data Type Handling: * [`aten/src/ATen/cuda/CUDADataType.h`](diffhunk://#diff-9188bb13b1a49f459141f5f9b875593d1c5ce2beb5ad711fdbaf5bc7089ec015L81-L98): Enhanced data type conversion to include new float8 types for both CUDA and ROCm environments. * [`aten/src/ATen/cuda/tunable/GemmHipblaslt.h`](diffhunk://#diff-bfa1a3b5d4bef1892bf50338775f3b0fd8cd31fc1868148f3968b98aefb68e3fL29-R80): Updated `HipDataTypeFor` template to handle new float8 types and added hard-coded enum values for ROCm versions prior to 6.3. ### Removal of Outdated Checks: * [`cmake/public/LoadHIP.cmake`](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197): Removed the check for `HIP_NEW_TYPE_ENUMS` as it is no longer necessary with the updated ROCm versions. [[1]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197) [[2]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L211-R182) These changes ensure better compatibility and performance on newer hardware and software environments, particularly for users leveraging ROCm and CUDA for deep learning and scientific computing tasks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146632 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-02-24 22:47:52 +00:00
Zachary DeVito	a71d8b7246	Fix `ReferenceError: weakly-referenced object no longer exists` in cycle detector (#146922 ) Summary: weakref.proxy objects will throw errors when they re dead. We just do not bother visulaizing them. They are weak, so they aren't relevant to cycles anyway. Differential Revision: D69270429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146922 Approved by: https://github.com/tianfengfrank, https://github.com/Chillee	2025-02-24 22:27:39 +00:00
Bin Bao	22fae0d948	[AOTI][refactor] Replace run_command_and_check with CppBuilder.build (#147680 ) Consolidate cpp compilation action to CppBuilder Differential Revision: [D69723632](https://our.internmc.facebook.com/intern/diff/D69723632/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147680 Approved by: https://github.com/yushangdi, https://github.com/angelayi ghstack dependencies: #147679	2025-02-24 21:45:15 +00:00
Bin Bao	0b52d801d2	[AOTI][refactor] Rename use_absolute_path to use_relative_path (#147679 ) The option really means to compile a cpp file using its basename instead of the its full path. Differential Revision: [D69722709](https://our.internmc.facebook.com/intern/diff/D69722709/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147679 Approved by: https://github.com/angelayi	2025-02-24 21:44:33 +00:00
Doru Bercea	96acb56626	[ROCm] Optimize the stride one indexing backwards kernel (#146420 ) This patch makes several changes to the stride 1 backwards indexing kernel as follows: - enables the computation across the `sorted_indices` array to happen in parallel by all the lanes in the warp, this means that the accesses to `sorted_indices` are now fully coalesced. - the duplicate counting now happens in parallel: each lane in the warp counts the duplicates of a different `idx`. - enable skipping during duplicate count: this optimization ensures that for large number of duplicates we can skip 32 values at time to speed up the count. - for low number of duplicates i.e. we have less than `warp-size` duplicates then just perform the tail reduction which avoid the wasteful parallel reduction across the warp for this case (it would only add zero values). - for high number of duplicates i.e. when we have more than `warp-size` duplicates then we still use the full warp of lanes to compute the reduced value with as much parallelism as possible. This is done by making sure that all lanes stick around and cooperatively execute the reduction in case there is a single `idx` which has a large number of duplicates (i.e. a duplicate spike). For this to happen we use shared memory to pass the duplicate count computed in parallel in the first part of the kernel to the cooperative reduction part of the kernel. Benefits on examples extracted from workloads show a 3.6x to 10x speed-up. co-author: Hashem Hashemi <Hashem.Hashemi@amd.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/146420 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily	2025-02-24 21:19:06 +00:00
Brian Hirsh	89b9c12de8	remove prints from partitioner (#147749 ) See `c57894cd74..22d8f9a657 (r1968015955)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147749 Approved by: https://github.com/Skylion007, https://github.com/laithsakka	2025-02-24 21:03:45 +00:00
Tristan Rice	8eb400ef66	[BE] TCPStore: use typed errors for assertions (#147647 ) This is a follow up to #147465 that changes most TORCH_CHECK calls in TCPStore and TCPStoreLibUvBackend to use typed exceptions instead of generic `TORCH_CHECK` calls which end up as RuntimeErrors in Python. Test plan: ``` pytest test/distributed/test_store.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147647 Approved by: https://github.com/fduwjj	2025-02-24 20:58:10 +00:00
Anatoly Myachev	19fd21fe7e	[Inductor] Hot fix after #146917 (#147639 ) This pull request reverts the changes to `torch/_inductor/ir.py` file that were added in #146917. Where I tested, there were changes only from `torch/_inductor/codegen/cpp_wrapper_gpu.py`, it turns out that changes in `torch/_inductor/ir.py` file are not really needed. So it's my fault, I didn't sync the environments (between several machines) correctly. @davidberard98 @YUNQIUGUO maybe that's why the tests on CUDA didn't pass? Pull Request resolved: https://github.com/pytorch/pytorch/pull/147639 Approved by: https://github.com/etaf, https://github.com/davidberard98	2025-02-24 20:34:48 +00:00
Xuehai Pan	754fb834db	[BE][CI] bump `ruff` to 0.9.0: string quote styles (#144569 ) Reference: https://docs.astral.sh/ruff/formatter/#f-string-formatting - Change the outer quotes to double quotes for nested f-strings ```diff - f'{", ".join(args)}' + f"{', '.join(args)}" ``` - Change the inner quotes to double quotes for triple f-strings ```diff string = """ - {', '.join(args)} + {", ".join(args)} """ ``` - Join implicitly concatenated strings ```diff - string = "short string " "short string " f"{var}" + string = f"short string short string {var}" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144569 Approved by: https://github.com/Skylion007 ghstack dependencies: #146509	2025-02-24 19:56:09 +00:00
Xuehai Pan	52f6d4aa30	[BE][CI][Easy] bump `ruff` to 0.9.0: long statements in docstrings (#146509 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146509 Approved by: https://github.com/justinchuby, https://github.com/Skylion007	2025-02-24 19:56:08 +00:00
Nichols A. Romero	9605c5063b	[ROCm][TunableOp] Speed-up matmul_small_brute_force_tunableop unit test (#147659 ) This PR has a UT speed-up and some refactoring of tests. A previous PR https://github.com/pytorch/pytorch/pull/142422 fixed this matmul_small_brute_force_tunableop for the FP16 data type by adding TunableOp numerical checks. It had the unfortunate side effect that it increased the execution time for the FP32 and FP64 data types by a significant margin. This PR reduces the execution time by 20+ minutes. We also move a hipBLASLt version check to a different tunableop UT for simplicity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147659 Approved by: https://github.com/jeffdaily	2025-02-24 19:44:38 +00:00
Xiaochang Wu	69c4f6ff13	[Minor] Fix minor mistake in docstring of replace_pattern (#147611 ) Fixes #147610 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147611 Approved by: https://github.com/soulitzer	2025-02-24 19:33:44 +00:00
Yan Zhiwei	b9b1fd9b93	[Intel GPU] qlinear.pointwise with mixed dtype support (#136753 ) # Motivation This PR is aimed to add mixed data type(AMP) support for `qlinear_pointwise` op. With current PR, we allow `qlinear` kernels output Tensor that is BF16, rather than FP32/INT8. # UT verification ```bash DNNL_VERBOSE=1 python test/inductor/test_mkldnn_pattern_matcher.py -v \ -k test_qlinear_int8_mixed_bf16_xpu \ -k test_qlinear_relu_int8_mixed_bf16_xpu \ -k test_qlinear_add_int8_mixed_bf16_xpu ``` # Runtime exemplification ```bash #qlinear+bf16 output onednn_verbose,primitive,exec,gpu:0,matmul,ocl:gemm_with_po:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_bf16::blocked:ab::f0_mask2 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32,,4x4:4x4,0.0698242 # qlinear_add + bf16 output onednn_verbose,primitive,exec,gpu:0,matmul,ocl:gemm_with_po:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_bf16::blocked:ab::f0_mask2 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_linear:1:-0.677141+sum:0.0132773,,4x4:4x4,0.0419922 # qlinear_add_relu + bf16 output onednn_verbose,primitive,exec,gpu:0,matmul,ocl:gemm_with_po:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_bf16::blocked:ab::f0_mask2 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_linear:1:0.533096+sum:0.00416481+eltwise_relu,,4x4:4x4,0.0759277 ``` As shown in the oneDNN verbose, the attribute `dst_bf16::blocked:ab::f0` demonstrate that we could successfully output a bf16 tensor in int8 gemm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136753 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire, https://github.com/jerryzh168 ghstack dependencies: #133307, #135189, #135337, #135465 Co-authored-by: guangyey <guangye.yu@intel.com>	2025-02-24 19:27:50 +00:00
Yan Zhiwei	075b91bef1	[Intel GPU] qconv.pointwise with mixed dtype XPU support (#135465 ) # Motivation This PR is aimed to add mixed data type(AMP) support for `qconv_pointwise` op. With current PR, we allow `qconv` kernels output Tensor that is BF16, rather than FP32/INT8. # UT verification ```bash DNNL_VERBOSE=1 python test/inductor/test_mkldnn_pattern_matcher.py -v \ -k test_qconv2d_int8_mixed_bf16_xpu \ -k test_qconv2d_relu_int8_mixed_bf16_xpu \ -k test_qconv2d_hardtanh_int8_mixed_bf16_xpu \ -k test_qconv2d_hardswish_int8_mixed_bf16_xpu \ -k test_qconv2d_silu_int8_mixed_bf16_xpu \ -k test_qconv2d_add_int8_mixed_bf16_xpu \ -k test_qconv2d_add_relu_int8_mixed_bf16_xpu ``` # Runtime verification ```bash #qconv + bf16 onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_s8::blocked:acdb::f0 wei_s8::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_bf16::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:1:f32 attr-zero-points:src0:0:s32,alg:convolution_direct,mb1_ic128oc128_ih6oh4kh3sh1dh0ph0_iw6ow4kw3sw1dw0pw0,0.0539551 # qconv_silu + bf16 onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_s8::blocked:acdb::f0 wei_s8::blocked:abcd::f0 bia_undef::undef::: dst_bf16::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_swish:1,alg:convolution_direct,mb1_ic128oc128_ih6oh4kh3sh1dh0ph0_iw6ow4kw3sw1dw0pw0,0.0588379 # qconv_hardswish + bf16 onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_s8::blocked:acdb::f0 wei_s8::blocked:abcd::f0 bia_undef::undef::: dst_bf16::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_hardswish:0.166667:0.5,alg:convolution_direct,mb1_ic128oc128_ih6oh4kh3sh1dh0ph0_iw6ow4kw3sw1dw0pw0,0.0568848 ``` The `dst_bf16::blocked:acdb::f0` attribute in oneDNN verbose demonstrate the output tensor is computed as bf16 successfully. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135465 Approved by: https://github.com/liangan1, https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire, https://github.com/jerryzh168 ghstack dependencies: #133307, #135189, #135337 Co-authored-by: guangyey <guangye.yu@intel.com>	2025-02-24 19:27:50 +00:00
Michal Gallus	ffa19b9024	[ROCm][Windows] Fix unrecognized constexpr std::memcpy for HIP-clang (#147316 ) Since in MSVC's 2019/2022 implementation of STL memcpy is not defined as a constexpr function, HIP clang compiler on Windows cannot evaluate the following memcopy as one that could be resolved during the compile time. To resolve this, a `__builtin_memcpy` is used instead which doesn't have this limitation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147316 Approved by: https://github.com/jeffdaily	2025-02-24 18:28:59 +00:00
PyTorch MergeBot	900a774781	Revert "[ROCm] Update periodic.yml to use 2GPU runners (#146839 )" This reverts commit b6273d7f4ba4fbb126eb96816287641ca1e4efc6. Reverted https://github.com/pytorch/pytorch/pull/146839 on behalf of https://github.com/jithunnair-amd due to This change is not needed anymore since our 4-GPU runners are back online and stable so far ([comment](https://github.com/pytorch/pytorch/pull/146839#issuecomment-2679145448))	2025-02-24 17:17:58 +00:00
Ding, Yi1	cde12207a0	[Intel GPU] Add SDPA implementation on XPU with OneDNN (#147612 ) Add XPU implementation of OneDNN based SDPA operator. Will be integrated and enabled later. Depends on BUILD_GRAPH switch in #147608 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147612 Approved by: https://github.com/EikanWang	2025-02-24 16:12:04 +00:00
Jiang, Yanbing	576ed1e400	Upgrade submodule oneDNN to v3.7 (#147498 ) This PR is to upgrade submodule oneDNN to v3.7. ## Improvements - Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids). - Improved performance of int8 and fp32 forward convolution primitive on processors with Intel AVX2 instruction set support. - Improved performance of fp8 matmul primitives with bf16 and fp16 bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids). - Introduced initial optimizations for Intel GPUs based on Xe3 architecture. - Added bfloat16 support for SDPA, implemented fp16 and bf16 gemm kernel in SDPA. - Fixed f16 matmul accuracy, the issue of SDPA cannot dispatched to ukernel, bf16/fp16/fp32 conv performance, INT8 Kernel trigger page fault, deconvolution precision issue on complex128 and fp64 and gemm correctness issue in float16 issues. - Improved bf16 matmul performance with fp32 destination with Arm Compute Library (ACL). - Improved bf16 to fp32 reorder performance. - Improved bf16 reorder performance. - Improved bf16 convolution with ACL. Fixes https://github.com/pytorch/pytorch/issues/136348. ## Validation results on CPU 1. NLP models accuracy/inference/training ![image](https://github.com/user-attachments/assets/859279b8-1631-4268-b226-7de9ac5870d8) ![image](https://github.com/user-attachments/assets/30ec7151-41ca-482a-9d2d-0c4850e75bab) 2. Torchbench cpu userbenchmark inference & training ![image](https://github.com/user-attachments/assets/71c9807c-caf9-4385-9990-d2ab637031cd) 3. Inductor quantization ![image](https://github.com/user-attachments/assets/3d2a3bd3-82fa-4566-8050-7ea5d6b61675) 4. Dynamo benchmarks ![image](https://github.com/user-attachments/assets/554ecce3-c85c-4a0e-88f1-2e73983c5dcd) ![image](https://github.com/user-attachments/assets/148c88f8-4367-4428-bb54-ce8a4deefd1b) ![image](https://github.com/user-attachments/assets/f2e744f4-d710-4699-acf4-1f130ecfadf1) ![image](https://github.com/user-attachments/assets/97128b80-4d0e-495a-aeda-dde3e70c96fd) ![image](https://github.com/user-attachments/assets/a9afce37-684c-45c0-b938-6dd7e0383805) ![image](https://github.com/user-attachments/assets/b8714236-9681-4fbe-8d98-be93deedab88) ![image](https://github.com/user-attachments/assets/4423061f-d133-45ba-98bd-d2f739e50431) ![image](https://github.com/user-attachments/assets/7955da10-3d23-493e-99fa-658f7f40035b) ## Validation results on XPU Accuracy is same as baseline. Performance is shown below. ![image](https://github.com/user-attachments/assets/7645304d-5b1d-43f9-b840-9f846ed380a0) ## Validation results on ARM ![image](https://github.com/user-attachments/assets/080f7c02-0238-436f-ad20-5a9e3f6aafbb) ![image](https://github.com/user-attachments/assets/443742aa-ca61-41de-ae80-5d4c65cd0c87) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147498 Approved by: https://github.com/fadara01, https://github.com/mingfeima, https://github.com/atalman	2025-02-24 14:32:51 +00:00
Tom Ritchford	80d3afc698	[inductor] Improve type annotations in _inductor/pattern_matcher.py (#146626 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146626 Approved by: https://github.com/Skylion007	2025-02-24 14:30:35 +00:00
PyTorch UpdateBot	d0f08dc3eb	Update slow tests (#147728 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147728 Approved by: https://github.com/pytorchbot	2025-02-24 11:48:19 +00:00
Xuehai Pan	cba14212e6	[FX] micro-optimization `map_aggregate(immutable_dict)` (#147691 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147691 Approved by: https://github.com/Skylion007, https://github.com/jansel ghstack dependencies: #147699, #144640	2025-02-24 09:14:08 +00:00
Xuehai Pan	a50af71fb6	[FX] Refactor immutable collections implementation (#144640 ) Get rid of dynamic class creation via `type(name, bases, ...)`. Convert it to classic static class definition for better readability and static analysis support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144640 Approved by: https://github.com/jansel ghstack dependencies: #147699	2025-02-24 09:14:08 +00:00
xinan.lin	dc9a03d30c	[Window] Fix invalid file path on windows. (#147708 ) This PR aims to fix the invalid path for windows: `C:\\Users\\sdp\\AppData\\Local\\Temp\\tmp0wugz2qm\\dynamo\\code_state___main__.TestFxGraphCache.test_cache_hot_load_pgo:None:.pkl.lock` Windows does not allow chars `\ / : * ? " < > \|` in a path. And this PR also replace `os.rename` to `os.replace` in torch/_dynamo/pgo.py because `os.replace` allows target file exists on Windows, but not `os.rename` . \| Function \| `os.rename()` \| `os.replace()` \| \|--------------------------------\|----------------------------\|----------------------------\| \| Rename a file \| ✅ \| ✅ \| \| Move a file \| ✅ \| ✅ \| \| Overwrite an existing file \| ❌ (Error on Windows) \| ✅ (Will overwrite) \| \| Overwrite an existing directory \| ❌ (Error on Windows) \| ❌ (Error on Windows) \| \| Move across disks \| ❌ \| ❌ \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/147708 Approved by: https://github.com/jansel	2025-02-24 08:31:11 +00:00
PyTorch MergeBot	5b6ad682bc	Revert "[TorchRec][PT2] disable contextlib in PT2 train pipeline (#147254 )" This reverts commit 85ea67983421acc30ccc76f7a159042e75c6ea08. Reverted https://github.com/pytorch/pytorch/pull/147254 on behalf of https://github.com/jeanschmidt due to introduced reds on main ([comment](https://github.com/pytorch/pytorch/pull/147254#issuecomment-2677700862))	2025-02-24 08:20:16 +00:00
xinan.lin	8d618f3da7	[AOTI][XPU] Suppress multi-line comment warning for XPU. (#147710 ) This PR aim to suppress multi-line comment waring in sycl header when building Inductor cpp_wrapper . ``` /intel/oneapi/compiler/2025.0/include/sycl/detail/builtins/builtins.hpp:235:1: warning: multi-line comment [-Wcomment] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147710 Approved by: https://github.com/EikanWang, https://github.com/jansel	2025-02-24 07:28:59 +00:00
Huamin Li	cee03b7746	[Inductor] Update should_decompose_mm condition for CPU (#147673 ) Summary: Previously, for cpu we decompose addmm if ``` check_device(mat1, mat2, device="cpu") and mat1.shape[0] == 1 and mat2.shape[0] <= 64 and mat2.shape[1] <= 16 ``` We have a new case where `mat2.shape[2] = 304`, and benchmark shows that it will beneficial if we decompose, so update the condition to ``` check_device(mat1, mat2, device="cpu") and mat1.shape[0] == 1 and mat2.shape[0] <= 64 and mat2.shape[1] <= 512 ``` Differential Revision: D70033166 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147673 Approved by: https://github.com/houseroad	2025-02-24 05:51:50 +00:00
Davide Italiano	8b65dbad13	[MPS/Inductor] Add support for xlog1py. (#147709 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147709 Approved by: https://github.com/jansel	2025-02-24 05:28:52 +00:00
Dmitry Rogozhkin	baccadb2f1	xpu: torch.xpu.get_arch_list() to return [] if xpu not compiled (#147431 ) Initially discussed here: https://github.com/pytorch/pytorch/pull/132945#discussion_r1957366131 Previously `torch.xpu.get_arch_list()` got relaxed to work even if XPU device is not available. However, we overlooked the case when pytorch is not compiled with XPU support. In such a case function throws an exception. This commit adjusts this behavior and makes function return `[]` even if pytorch is not compiled with XPU support. CC: @EikanWang @fengyuan14 @guangyey @malfet @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/147431 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/albanD	2025-02-24 01:35:54 +00:00
lei,zhenyuan	7c52ef2424	Add XPU to is_compile_supported to support roi_align op in torchvision (#147541 ) Part of the required fix for https://github.com/intel/torch-xpu-ops/issues/1264. To support `roi_align`, torchvision uses `is_compile_supported` in `torch/_dynamo/utils.py` to compile a non-deterministic version of the op for backwards passes. This PR adds XPU device to the supported compile devices. The `is_compile_supported()` util function has extremely limited usage, only being used in `torchvision.ops.roi_align` and `torch.utils._content_store.has_storage()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147541 Approved by: https://github.com/guangyey, https://github.com/jansel Co-authored-by: lei,zhenyuan <zhenyuan.lei@intel.com>	2025-02-24 01:32:36 +00:00
Davide Italiano	4e934ee5a7	[MPS] Add eager support for xlog1py. (#147687 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147687 Approved by: https://github.com/malfet	2025-02-24 01:23:59 +00:00
eqy	718cf68aee	[cuBLAS][cuBLASLt] Unify `cuBLASLt` workspaces with `cuBLAS` workspaces (#145130 ) As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels. This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits: + caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`) + "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925 + fixes behavior broken behavior with the memtracker; https://github.com/pytorch/pytorch/pull/139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it + one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider Pull Request resolved: https://github.com/pytorch/pytorch/pull/145130 Approved by: https://github.com/ngimel	2025-02-23 22:01:39 +00:00
Xuehai Pan	b5d7aefa57	[BE] add missing overload annotations for `tree_map_only` (#147699 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147699 Approved by: https://github.com/Skylion007	2025-02-23 20:21:07 +00:00
Catherine Lee	f47573f70d	Add super().setUp() to some test cases (#147651 ) I saw that their disabled issues were getting spammed with comments, meaning that they were still running in CI despite having a disable issue, so I added the super().setUp() call to check if there's a disable issue for them since they were missing it Pull Request resolved: https://github.com/pytorch/pytorch/pull/147651 Approved by: https://github.com/huydhn	2025-02-23 18:21:17 +00:00
Nikita Shulga	f03e7f3801	[MPS] Workaround rng bug for 5D tensors (#147667 ) For some reason MPSGraph returns repeated values is tensor dimention is larger than 4, which can be clearly seen by running following ```swift import Metal import MetalPerformanceShadersGraph func randMPS(device: MTLDevice, obuf: MTLBuffer, nelem: Int, ndim: Int = 5) { let graph = MPSGraph() var dims = Array(repeating: 1, count: ndim) dims[0] = nelem let shape = dims.map { NSNumber(value: $0) } let randNode = graph.randomUniformTensor(withShape: shape, seed: 42, name: nil) let mpsOutputBuffer = MPSGraphTensorData(obuf, shape: shape, dataType: .float32) guard let queue = device.makeCommandQueue() else { fatalError("Can't make queue") } graph.run(with: queue, feeds: [:], targetOperations: nil, resultsDictionary: [randNode: mpsOutputBuffer]) } func printBuf(_ prefix: String, buf: MTLBuffer, nelem: Int) { let buf_data = buf.contents().assumingMemoryBound(to: Float.self) print(prefix) for i in 0..<nelem { print(buf_data[i], terminator: i != nelem - 1 ? " " : "\n") } } guard let device = MTLCopyAllDevices().first else { fatalError("Not Metal device found") } print("Using device \(device.name)") let nelem = 2 guard let buf = device.makeBuffer(length:nelem * MemoryLayout<Float>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") } randMPS(device: device, obuf: buf, nelem: nelem, ndim: 4) printBuf("4D uniform", buf: buf, nelem: nelem) randMPS(device: device, obuf: buf, nelem: nelem, ndim: 5) printBuf("5D uniform", buf: buf, nelem: nelem) ``` Workaround by flatting the tensor if it's contiguous Fixes https://github.com/pytorch/pytorch/issues/147624 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147667 Approved by: https://github.com/dcci	2025-02-23 16:52:01 +00:00
PyTorch MergeBot	3e2d9d079e	Revert "[ROCm] OCP FP8 Support for new GPUs (#146632 )" This reverts commit f95ab46797e1f3e8cc48ce2f45e4f6985132fb19. Reverted https://github.com/pytorch/pytorch/pull/146632 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, I'll find someone to help merge this PR back to main ([comment](https://github.com/pytorch/pytorch/pull/146632#issuecomment-2676823614))	2025-02-23 12:04:50 +00:00
Guilherme Leobas	d0adff761e	Propagate `AttributeError` to user code in user_defined.py (#146497 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146497 Approved by: https://github.com/anijain2305, https://github.com/zou3519 ghstack dependencies: #146496	2025-02-23 01:18:28 +00:00
Guilherme Leobas	8c761ac7e3	Handle `is`/`is not` (#146496 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146496 Approved by: https://github.com/anijain2305, https://github.com/zou3519	2025-02-23 01:18:28 +00:00
Davide Italiano	b084635735	[MPS/inductor] Adjust more tests that depends on non-divisible input sizes (#147681 ) Also adjust a comment while I'm at it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147681 Approved by: https://github.com/jansel	2025-02-23 00:33:26 +00:00
Davide Italiano	6a5e3917a7	[MPS] Add inductor support for spherical_bessel_j0. (#147650 ) Counterpart to my previous patch that added support for the op in eager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147650 Approved by: https://github.com/jansel	2025-02-23 00:32:36 +00:00
Davide Italiano	f9c117f859	[mps/inductor] XFAIL adaptive_avg_pool_with_output_size_0. (#147676 ) Non-divisible input sizes are not implemented on MPS device yet Pull Request resolved: https://github.com/pytorch/pytorch/pull/147676 Approved by: https://github.com/malfet	2025-02-22 20:17:33 +00:00
drisspg	db15cb0988	[Submodule] [Cutlass] Update to 3.8.0 tag (#147655 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147655 Approved by: https://github.com/henrylhtsang, https://github.com/eqy	2025-02-22 20:05:31 +00:00
Huanyu He	85ea679834	[TorchRec][PT2] disable contextlib in PT2 train pipeline (#147254 ) [TorchRec][PT2] disable contextlib in PT2 train pipeline (#147254) Summary: # context * more details in the [post](https://fb.workplace.com/groups/1075192433118967/permalink/1587079018596970/) * disable contextlib with PT2 Test Plan: * run command ``` TORCH_SHOW_CPP_STACKTRACES=1 TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 TORCH_LOGS="+dynamo,+graph_code,output_code,dynamic,aot,guards,verbose_guards,recompiles,graph_breaks" TORCH_TRACE=/var/tmp/tt buck2 run fbcode//mode/opt fbcode//aps_models/ads/icvr:icvr_launcher_live -- mode=fmc/local_ig_fm_ultra_mini training.pipeline_type=pt2 data_loader.dataset.table_ds=[2024-12-02] 2>&1 \| tee -a output.log ``` * old tlparse https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpYYAS3o/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100 * new tlparse https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpUJhCGZ/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100 Reviewed By: Microve Differential Revision: D68480678	2025-02-22 18:57:55 +01:00
PyTorch MergeBot	fa8e3a28a7	Revert "[cuDNN][SDPA][Nested Tensor] Experimental cuDNN Nested Tensor SDPA Support (forward only) (#141178 )" This reverts commit 533b884870acd951e684e0bf551eb76904dec047. Reverted https://github.com/pytorch/pytorch/pull/141178 on behalf of https://github.com/jeanschmidt due to Broke internal arvr signals, see D69971019. @jbschlosser please help the author get this PR merged ([comment](https://github.com/pytorch/pytorch/pull/141178#issuecomment-2676317470))	2025-02-22 17:28:12 +00:00
PyTorch MergeBot	bea72180ed	Revert "[ROCm] Implemented dropout usage for RNN with MIOpen backend (#144572 )" This reverts commit e7bf490c430ac5a70ccb7ab8e954d3386fd29413. Reverted https://github.com/pytorch/pytorch/pull/144572 on behalf of https://github.com/jeanschmidt due to Broke internal signals, D69994027, I'll find someone to help get this change merged ([comment](https://github.com/pytorch/pytorch/pull/144572#issuecomment-2676314308))	2025-02-22 17:19:38 +00:00
PyTorch MergeBot	3409cbd177	Revert "Delete Mixed MM Special Casing (#147151 )" This reverts commit d6bb1d7f0a9dc3d11d2864da9ab46872377a6e52. Reverted https://github.com/pytorch/pytorch/pull/147151 on behalf of https://github.com/jeanschmidt due to Broke a few internal signals, see comments on D69994157 ([comment](https://github.com/pytorch/pytorch/pull/147151#issuecomment-2676312215))	2025-02-22 17:14:32 +00:00
Wang, Chuanqi	72b4f35cb5	[CI] Reduce the AOT target list to reduce build time (#147601 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/147601 Approved by: https://github.com/atalman	2025-02-22 14:43:26 +00:00
sanchitintel	3cc3d7e08f	Also support non-contiguous activation for torch._weight_int8pack_mm on CPU (#147588 ) ### Problem Non-contiguous activation for `torch._weight_int8pack_mm` is unsupported on CPU. So, with int8 WoQ with B16 activation with torchao, for batch-size 2 & above, an assertion is hit regarding non-contiguous A being unsupported. Such an issue was encountered with LLaMA models. ### Solution Also support non-contiguous activation for `torch._weight_int8pack_mm`, so long as it's contiguous on the last dimension & remove the assertion that requires contiguous activation. ### Alternative solutions considered Could modify LLaMA model in transformers library to call `contiguous` after obtaining the final hidden state, just before computing logits with the LM head. However, [it](https://github.com/huggingface/transformers/pull/36078) might cause some regression for other users of that code. Another aspect to this issue is - is latency always lower if we make an activation tensor contiguous before linear or `torch._weight_int8pack_mm` is called on CPU? I guess we need some data-points to analyze this part, although I think the performance should be good enough with this patch, since the first cache lines of rows of A are being explicitly prefetched in the existing code (and it also avoids copy, which a `contiguous` call would do). Pull Request resolved: https://github.com/pytorch/pytorch/pull/147588 Approved by: https://github.com/mingfeima, https://github.com/leslie-fang-intel, https://github.com/malfet	2025-02-22 08:29:07 +00:00
Ke Wen	e1bf892d90	[DDP] Temporarily disable comm mem (#147663 ) For fear that it incur slightly more memory usage and cause some applications at tight memory margin to OOM. (bc the comm mem pool is a separate pool than the regular pool ?) Differential Revision: [D70026681](https://our.internmc.facebook.com/intern/diff/D70026681) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147663 Approved by: https://github.com/d4l3k	2025-02-22 05:55:43 +00:00
Aaron Orenstein	086d146f6f	Update ruff linter for PEP585 (#147540 ) This turns on PEP585 enforcement in RUFF. - Updates the target python version - Stops ignoring UP006 warnings (PEP585) - Fixes a few issues which crept into the tree in the last day Pull Request resolved: https://github.com/pytorch/pytorch/pull/147540 Approved by: https://github.com/justinchuby, https://github.com/Skylion007	2025-02-22 04:45:17 +00:00
Laith Sakka	77d2780657	Enable strobelight profiling specific compile frame ids using COMPILE_STROBELIGHT_FRAME_FILTER (#147549 ) running python test/strobelight/examples/compile_time_profile_example.py ``` strobelight_compile_time_profiler, line 123, 2025-02-20 14:08:08,409, INFO: compile time strobelight profiling enabled strobelight_compile_time_profiler, line 159, 2025-02-20 14:08:08,409, INFO: Unique sample tag for this run is: 2025-02-20-14:08:081656673devgpu005.nha1.facebook.com strobelight_compile_time_profiler, line 160, 2025-02-20 14:08:09,124, INFO: URL to access the strobelight profile at the end of the run: https://fburl.com/scuba/pyperf_experimental/on_demand/9felqj0i strobelight_compile_time_profiler, line 205, 2025-02-20 14:08:12,436, INFO: profiling frame 0/0 is skipped due to frame_id_filter 1/.* strobelight_compile_time_profiler, line 205, 2025-02-20 14:08:15,553, INFO: profiling frame 0/0 is skipped due to frame_id_filter 1/.* strobelight_compile_time_profiler, line 205, 2025-02-20 14:08:16,170, INFO: profiling frame 0/0 is skipped due to frame_id_filter 1/.* strobelight_compile_time_profiler, line 214, 2025-02-20 14:08:16,877, INFO: profiling frame 1/0 strobelight_function_profiler, line 247, 2025-02-20 14:08:19,416, INFO: strobelight run id is: 4015948658689996 strobelight_function_profiler, line 249, 2025-02-20 14:08:21,546, INFO: strobelight profiling running strobelight_function_profiler, line 289, 2025-02-20 14:08:25,964, INFO: work function took 4.417063233006047 seconds strobelight_function_profiler, line 230, 2025-02-20 14:08:28,310, INFO: strobelight profiling stopped strobelight_function_profiler, line 221, 2025-02-20 14:08:44,308, INFO: Total samples: 119 strobelight_function_profiler, line 221, 2025-02-20 14:08:44,308, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/73h2f7ur strobelight_function_profiler, line 221, 2025-02-20 14:08:44,308, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/zs06fi9e strobelight_compile_time_profiler, line 167, 2025-02-20 14:08:44,308, INFO: 1 strobelight success runs out of 1 non-recursive compilation events. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147549 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #147547	2025-02-22 03:44:53 +00:00
Laith Sakka	fc095a885c	move _strobelight/example to avoid graph breaks (#147547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147547 Approved by: https://github.com/bobrenjc93	2025-02-22 03:44:53 +00:00
Ethan Wee	fecd3f7ecb	[ROCm] change is_hip_clang() to always return True (#147646 ) hipify is replacing kernel launchs <<< >>> with hipLaunchKernelGGL() macro and this is a regression caused by /opt/rocm/hip/.hipinfo no longer existing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147646 Approved by: https://github.com/jeffdaily, https://github.com/petrex	2025-02-22 03:26:55 +00:00
xinan.lin	b11d5cd584	[Inductor UT][Windows][XPU] Fix Inductor UT on XPU Windows. (#146481 ) This PR fixed all the inductor UT failures for XPU backend on Windows we found in local machine(Due to resource constraints, we have not yet set up a Windows CI pipeline online.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146481 Approved by: https://github.com/jansel, https://github.com/EikanWang ghstack dependencies: #147347	2025-02-22 02:53:16 +00:00
xinan.lin	2d433cf1ad	[Inductor UT][Windows][XPU] Enable Inductor UT on XPU Windows. (#147347 ) This PR removes the restrictions on general cases for XPU on Windows, allowing us to run Inductor UT on Windows. Additionally, this series of PRs has also fixed all XPU Inductor UT issues on Windows. However, due to resource constraints, we have not yet set up a Windows CI pipeline online. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147347 Approved by: https://github.com/jansel, https://github.com/EikanWang	2025-02-22 02:53:16 +00:00
Scott Wolchok	84fcf1bb11	constexpr all the things in irange.h (#147633 ) I got complaints while irangeifying some files in ExecuTorch that irange could not be used in a constexpr function. This made the complaints go away. I added a constexpr function in irange_test that used to fail to build with `error: variable of non-literal type 'iterator' (aka 'integer_iterator<int, true>') cannot be defined in a constexpr function before C++23` and now builds fine. Differential Revision: [D69959614](https://our.internmc.facebook.com/intern/diff/D69959614/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147633 Approved by: https://github.com/albanD	2025-02-22 01:51:51 +00:00
Angela Yi	6e0b09728a	[export] Remove report from draft-export output (#147558 ) Summary: This matches the export API. To print the report, people can just do `print(ep._report)`. This information is also displayed in the terminal after the draft_export call. Test Plan: CI Reviewed By: SherlockNoMad Differential Revision: D69689154 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147558 Approved by: https://github.com/pianpwk	2025-02-22 00:54:29 +00:00
Oguz Ulgen	1c334893dc	[CacheBench] Refactor code to prepare for mode benchmarks (#147641 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147641 Approved by: https://github.com/huydhn	2025-02-22 00:20:54 +00:00
Howard Huang	5d26b7108f	[PP] Remove extra code and docs BE (#147636 ) current docs: <img width="746" alt="image" src="https://github.com/user-attachments/assets/4c4088fc-ee97-4a82-be28-e33eb35e76f5" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/147636 Approved by: https://github.com/awgu	2025-02-22 00:10:31 +00:00
Peter Yeh	f95ab46797	[ROCm] OCP FP8 Support for new GPUs (#146632 ) TLDR: Follow up/ Build on top of https://github.com/pytorch/pytorch/pull/144476. add OCP FP8 support for gfx950 refer to https://github.com/pytorch/ao/pull/1677 This pull request includes several changes to improve compatibility and support for new GPU architectures and data types, particularly for ROCm. The key updates involve adding support for new ROCm versions and GPU architectures, updating data type handling, and removing outdated checks. ### Improvements to GPU Architecture and ROCm Version Support: * [`aten/src/ATen/Context.cpp`](diffhunk://#diff-33de472d304acbe57d693c8567370c638068bedc1aa0ce8e9dc115dad05a7810L323-R326): Added support for new GPU architectures `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks. * [`aten/src/ATen/native/cuda/Blas.cpp`](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199): Updated architecture support in multiple functions to include `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks. [[1]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199) [[2]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL865-R876) ### Updates to Data Type Handling: * [`aten/src/ATen/cuda/CUDADataType.h`](diffhunk://#diff-9188bb13b1a49f459141f5f9b875593d1c5ce2beb5ad711fdbaf5bc7089ec015L81-L98): Enhanced data type conversion to include new float8 types for both CUDA and ROCm environments. * [`aten/src/ATen/cuda/tunable/GemmHipblaslt.h`](diffhunk://#diff-bfa1a3b5d4bef1892bf50338775f3b0fd8cd31fc1868148f3968b98aefb68e3fL29-R80): Updated `HipDataTypeFor` template to handle new float8 types and added hard-coded enum values for ROCm versions prior to 6.3. ### Removal of Outdated Checks: * [`cmake/public/LoadHIP.cmake`](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197): Removed the check for `HIP_NEW_TYPE_ENUMS` as it is no longer necessary with the updated ROCm versions. [[1]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197) [[2]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L211-R182) These changes ensure better compatibility and performance on newer hardware and software environments, particularly for users leveraging ROCm and CUDA for deep learning and scientific computing tasks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146632 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-02-21 23:44:08 +00:00
Jason Furmanek	b1a81a4a65	Don't use '-e' when installing Triton (#147228 ) Currently the install_triton.sh script uses "pip install -e ." to install Triton. Using the -e is sometimes appropriate for develop work but is less appropriate for delivery. To make matters worse it seems the behavior of the -e various depending on the version of pip invovled. This PR removes the -e and installs Triton normally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147228 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily	2025-02-21 23:00:12 +00:00
Catherine Lee	995b125cdd	[CI] Build sm89 with more procs experiment (#147487 ) Add a build that uses 4 out of the 8 processes available on a linux.2xlarge/c5.2xlarge. Currently it's set to 2 because it would oom, but I'm curious as to how often people's builds oom. I can't test this on my own because of caching, so it has to run on pull request This might result in a failing job on may people's PRs and I'm not sure how to get around it. I named it stable to make it automatically get sorted into the stable group for Dr. CI but it'll still show up Pull Request resolved: https://github.com/pytorch/pytorch/pull/147487 Approved by: https://github.com/huydhn	2025-02-21 22:07:00 +00:00
Catherine Lee	7c8c82cd64	[trymerge] Post initial starting merge comment on stacked PRs (#147028 ) Post a small comment stating if a PR is being merged as part of a stack Pull Request resolved: https://github.com/pytorch/pytorch/pull/147028 Approved by: https://github.com/ZainRizvi	2025-02-21 22:05:00 +00:00
Avik Chaudhuri	698f6f9fae	specify only some dimensions in shapes collection (#147534 ) Differential Revision: [D69936316](https://our.internmc.facebook.com/intern/diff/D69936316/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147534 Approved by: https://github.com/bobrenjc93	2025-02-21 22:02:42 +00:00
Mitchell, Frost	2fb9416e6f	[inductor][cpu] Move VNNI weight packing into AMX GEMM kernel for contiguous BMM weights (#146843 ) Currently, the bfloat16 microkernel that uses AMX vectorization requires that the weights are in an interleaved VNNI format. For GEMM code, this hasn't been an issue because GEMM currently only supports constant weights, so the VNNI weight packing is done during compile-time and saved as a constant tensor to the graph. But for BMM ops where weights are not required to be constant, current code does an expensive reshape/VNNI packing for all BMM weights. This PR removes the need for the reshape/packing for non-constant inputs by moving VNNI packing inside the AMX microkernel. A new `K * block_n` buffer is used to store the temporary packed weights. Weight packing involves interleaving 2 rows of weights. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146843 Approved by: https://github.com/jgong5, https://github.com/sanchitintel, https://github.com/leslie-fang-intel, https://github.com/jansel	2025-02-21 21:46:00 +00:00
henrylhtsang	d91be786cb	[cutlass backend] clear_on_fresh_inductor_cache when generatings cutlass ops (#147586 ) Differential Revision: [D69966732](https://our.internmc.facebook.com/intern/diff/D69966732/) This is needed if we want to generate cutlass ops with different instantiation level in one session. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147586 Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78	2025-02-21 21:28:41 +00:00
PyTorch MergeBot	ef6b16ea9d	Revert "[trymerge] Post initial starting merge comment on stacked PRs (#147028 )" This reverts commit 0295aabf6071c7da62325e6a29e04ed09a3e34ef. Reverted https://github.com/pytorch/pytorch/pull/147028 on behalf of https://github.com/clee2000 due to I think this broke merge for non ghstack prs ([comment](https://github.com/pytorch/pytorch/pull/147028#issuecomment-2675532017))	2025-02-21 21:02:19 +00:00
PyTorch MergeBot	05e6f15966	Revert "[Inductor][Triton] Rework casting logic to avoid illegal bitcast (#147395 )" This reverts commit e758d8b4d1632ea765bf8bc8e87b6039ae708b9f. Reverted https://github.com/pytorch/pytorch/pull/147395 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, see D69890757 - servicelab_benchmark_pyper_local_runner, @eellison please help the author get this change landed ([comment](https://github.com/pytorch/pytorch/pull/147395#issuecomment-2675521966))	2025-02-21 20:56:40 +00:00
Thomas Bohnstingl	6eb795c9e8	[associative_scan] compile backend change to "eager" (#146973 ) This PR fixes some issues with torch export discussed here: https://github.com/pytorch/pytorch/pull/140043#discussion_r1941932960 However, this backend change does still not resolve the failure for specific shapes mentioned here: https://github.com/pytorch/pytorch/issues/137943#issuecomment-2649564994 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146973 Approved by: https://github.com/ydwu4	2025-02-21 20:21:41 +00:00
Luca Wehrstedt	5ed1e23e3a	Fix type stubs for SymmetricMemory (#146310 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146310 Approved by: https://github.com/yifuwang	2025-02-21 19:59:43 +00:00
Nichols A. Romero	fd8ae1aa04	[ROCm] gfx940 and gfx941 cleanup (#147394 ) Removing gfx architectures not supported by ROCm. NOTE: For users wanting to build PyTorch for gfx archs that are not supported by the official wheels on download.pytorch.org, you can build PyTorch from source for your desired gfx arch [using the PYTORCH_ROCM_ARCH env var](https://github.com/pytorch/pytorch/blob/main/README.md#amd-rocm-support). Pull Request resolved: https://github.com/pytorch/pytorch/pull/147394 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily	2025-02-21 19:42:12 +00:00
zeshengzong	c0ee62573a	[Easy][optim] Add LBFGS params optional desc (#147579 ) [LBFGS docs](https://pytorch.org/docs/stable/generated/torch.optim.LBFGS.html#torch.optim.LBFGS) missing `optional` description for params in compare with other optimizer docs, like [Adam](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html) ## Test Result ### Before ![image](https://github.com/user-attachments/assets/34877490-16b4-4c68-bf6c-405bae563352) ### After ![image](https://github.com/user-attachments/assets/7fba94c8-7091-47b8-bdf1-ca7d779a027f) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147579 Approved by: https://github.com/janeyx99	2025-02-21 19:38:10 +00:00
Oguz Ulgen	b5c3bb6185	Add continuous run for cachebench (#147546 ) This PR adds a continuous run for cache bench. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147546 Approved by: https://github.com/huydhn ghstack dependencies: #147537	2025-02-21 19:02:17 +00:00
henrylhtsang	76ce194b8e	For addmm and bmm, check if config.autotune_fallback_to_aten before using aten as a fallback. Also fix bmm cutlass backend (#147148 ) This PR also fixes BMM, which was silently failing for a while. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147148 Approved by: https://github.com/eellison	2025-02-21 18:41:52 +00:00
Catherine Lee	0295aabf60	[trymerge] Post initial starting merge comment on stacked PRs (#147028 ) Post a small comment stating if a PR is being merged as part of a stack Pull Request resolved: https://github.com/pytorch/pytorch/pull/147028 Approved by: https://github.com/ZainRizvi	2025-02-21 18:05:05 +00:00
Hanson-HSChang	2190ca7f47	Use __qualname__ in add_safe_globals and update Unpickling error raised for Unsupported GLOBAL (#146815 ) - Fixes #146814 Change ```python for f in _marked_safe_globals_set: module, name = f.__module__, f.__name__ ``` to ```python for f in _marked_safe_globals_set: module, name = f.__module__, f.__qualname__ ``` for avoiding same key string overwrite. A test is also added. ``` python test/test_serialization.py TestSerialization.test_serialization_nested_class ``` - Fixes #146886 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146815 Approved by: https://github.com/mikaylagawarecki	2025-02-21 18:04:59 +00:00
Nicolas De Carli	f4e4cfcb91	[caffe2] Ignore compiler option when building using clang (#147556 ) Summary: Skip adding unrecognized option optimize("-fno-tree-loop-vectorize") when building using clang This piece of code began to be compiled after armv9a has been set as default compilation profile Test Plan: buck2 run mode/opt -c python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12 lego/scripts:lego_cli -- run-locally --model_entity_id ${MODEL} --config_version ${CONFIG_VERSION} --disable_generate_new_checkpoint --checkpoint_version 0 --publish_context OFFLINE_PUBLISH --lego_pipeline aiplatform.modelstore.model_generation.lego.lego_pipeline_builder.gmpp_lego_pipeline --gmpp_config '{"gmpp_pipeline_descriptor": "aiplatform.modelstore.model_generation.v1.ads_pipelines.aimp_pyper_pipeline.model_generation_pipeline", "worker_process_number":12, "worker_thread_per_process_number": 6, "use_work_assignment": true}' 2>&1 \| tee aimp_697790515.log Reviewed By: andrewjcg Differential Revision: D69947027 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147556 Approved by: https://github.com/janeyx99	2025-02-21 17:46:04 +00:00
Shivam Raikundalia	a0c7d96028	[Easy] Add Delimeter To Show Where Allocation Addr Begins (#147461 ) Summary: When we print the addr we append an "s" or a "b" to the beginning of an addr. Since the addr is in hex, a user might be confused and think the "b" is part of the address. Added an approstrophe to clear this up Test Plan: CI Differential Revision: D69828538 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147461 Approved by: https://github.com/zdevito	2025-02-21 17:19:53 +00:00
Anatoly Myachev	784f64bb05	[inductor] triton support port-#5512, update cpp wrapper for gpu (#146917 ) In short, this pull request enhances `constexprs` expression filtering. Note: I tested the changes on xpu backend. Part of https://github.com/pytorch/pytorch/issues/144103 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146917 Approved by: https://github.com/EikanWang, https://github.com/etaf, https://github.com/davidberard98, https://github.com/YUNQIUGUO	2025-02-21 17:10:53 +00:00
Tugsbayasgalan Manlaibaatar	6a6de0e09d	better error message (#147532 ) Differential Revision: [D69939736](https://our.internmc.facebook.com/intern/diff/D69939736) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147532 Approved by: https://github.com/avikchaudhuri, https://github.com/zou3519	2025-02-21 17:08:47 +00:00
Oguz Ulgen	a8ce4d1846	Add cachebench (#147537 ) This PR adds a new benchmark called cachebench in order to measure/demonstrate the prowess of PT2 caching. ``` python benchmarks/dynamo/cachebench.py --output="result.json" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147537 Approved by: https://github.com/jamesjwu	2025-02-21 17:06:45 +00:00
Ding, Yi1	af1072ffb6	[Intel GPU] Enable BUILD_GRAPH for xpu_mkldnn (#147608 ) For preparation of OneDNN based XPU SDPA enabling. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147608 Approved by: https://github.com/EikanWang, https://github.com/atalman	2025-02-21 16:12:30 +00:00
eellison	d6bb1d7f0a	Delete Mixed MM Special Casing (#147151 ) Now that torchinductor supports prologue fusion we can delete all the mixed mm code. When I benchmarked int8 weight only mm in the new path compared to int8mm in the old path in the [following benchmark](https://gist.github.com/eellison/46e321709572c11c077d0612cb3492b7) I got a 1.244x geomean speedup comparing Huggingface linear shapes with bias. There's a couple reasons for the speedup: - prologue fusion is often unprofitable, even for int8 mm. because the current mixed mm benchmarking only compares triton_int8_mm vs (dtype_conversion + cublas), we miss out on scenarios where the triton template is profitable but the prologue fusion is not. - similarly, we miss out on potential epilogue fusions like bias if we dispatch to the [fallback mixed mm](`5006932cbc/torch/_inductor/kernel/mm.py (L750-L751)`) that mixed_mm will dispatch to instead of the deferred epilogue tuning in current path. It's possible some of the speedups would be smaller on larger models where the epilogue might get fused into a following kernel. Nonetheless, even if this is perf neutral it is worth landing for code deduplication. The one kernel that is a little special and would not fall out of the prologue fusion is the uint4x2_mixed_mm kernel. it's still possible to generate with prologue fusion but not currently exactly as the current [impl](`bd370c138a/torch/_inductor/kernel/unpack_mixed_mm.py (L43-L49)`). But the current impl does not compare to a cublas baseline so I found that it is making things slower (35% slower on a not particularly big 1024, 1024, 1024 mm shape on h100). this should be fine to delete. Future optimizations could include: - cutlass prologue path - making prologue fusion support the persistent tma based mm template. from @drisspg's experience this led to nice wins with fp8 but not as nice wins with bf16 mm. I think similarly, lower memory bandwidth int8 mm would benefit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147151 Approved by: https://github.com/drisspg, https://github.com/cpuhrsch	2025-02-21 16:02:40 +00:00
Luca Wehrstedt	36c461af95	Support SymmetricMemory's signaling kernels on sm60 and sm70 (#146308 ) By leveraging libcudacxx's utilities: https://nvidia.github.io/cccl/libcudacxx/extended_api/synchronization_primitives/atomic_ref.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/146308 Approved by: https://github.com/yifuwang	2025-02-21 15:29:02 +00:00
Aaron Orenstein	7ce4974e50	Fix PEP585 update (#147536 ) Summary: D69920347 causes a pyre failure due to changing a base object from typing.Iterable to abc.Iterable. For now revert that change until it can be dealt with on its own. Test Plan: failures from D69920347 pass locally unit tests pass Reviewed By: oulgen Differential Revision: D69936518 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147536 Approved by: https://github.com/jeanschmidt	2025-02-21 14:37:03 +00:00
Jean Schmidt	654f2666d9	Increase memory for linux binary builds (#147542 ) Recently I detected that some linux manywheels builds are flaky ([ex](https://github.com/pytorch/pytorch/actions/runs/13438309056/job/37555475510)). After investigating, could not detect issues when investigating the runner logs, its disk space available, network usage or CPU load. Unfortunately, memory information is not available. But given the symptoms, the likehood of this being a OOM problem is high. So, moving those build jobs from a `linux.12xlarge.ephemeral` to `linux.12xlarge.memory.ephemeral`. This change depends on https://github.com/pytorch/test-infra/pull/6316 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147542 Approved by: https://github.com/ZainRizvi, https://github.com/atalman	2025-02-21 14:15:40 +00:00
Marek Michalowski	51748a5d1a	Update OpenBLAS to 0.3.29 (#144857 ) * Improvements for GEMM to GEMV kernels * Improvements for SVE kernels for SGEMV and DGEMV Pull Request resolved: https://github.com/pytorch/pytorch/pull/144857 Approved by: https://github.com/malfet	2025-02-21 10:07:06 +00:00
Sheng Fu	71d2827eeb	Code Refactoring for getting start and stride from global ranks (#147230 ) Summary: Code Refactoring for getting start and stride from global ranks, this function can be used in different collective backend. Differential Revision: D69555405 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147230 Approved by: https://github.com/kwen2501	2025-02-21 10:02:50 +00:00
Paikov Iurii	e7bf490c43	[ROCm] Implemented dropout usage for RNN with MIOpen backend (#144572 ) This PR fixes https://github.com/pytorch/pytorch/issues/107183 for ROCm. Implemented the usage of new RNN descriptor for MIOpen backend that takes into account dropout rate value using dropout descriptor. This fixes associated test_RNN_dropout_state test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144572 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-02-21 10:01:27 +00:00
henrylhtsang	cffe7183f1	[cutlass backend] Fix standalone runner test after swizzle became a runtime parameter (#147554 ) Differential Revision: [D69945114](https://our.internmc.facebook.com/intern/diff/D69945114/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147554 Approved by: https://github.com/mlazos	2025-02-21 09:27:44 +00:00
cyy	b61a556427	Turn onnx functions into static (#147598 ) To avoid exposing ONNX symbols. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147598 Approved by: https://github.com/justinchuby	2025-02-21 07:40:28 +00:00
PyTorch MergeBot	3395da7f7c	Revert "Build a storage reader/writer to write checkpoints in HF format (#146352 )" This reverts commit c615b8c174c80936d365d19d8b8f4d9ad9a195f3. Reverted https://github.com/pytorch/pytorch/pull/146352 on behalf of https://github.com/jeanschmidt due to Author ignored linting errors ([comment](https://github.com/pytorch/pytorch/pull/146352#issuecomment-2673789271))	2025-02-21 07:30:52 +00:00
PyTorch MergeBot	e5da9df421	Revert "Increase memory for linux binary builds (#147542 )" This reverts commit 87e6e2924eb706b928cdfc4a11623b39259fa830. Reverted https://github.com/pytorch/pytorch/pull/147542 on behalf of https://github.com/jeanschmidt due to seems that it is best to use another machine type ([comment](https://github.com/pytorch/pytorch/pull/147542#issuecomment-2673765724))	2025-02-21 07:14:57 +00:00
Kevin Fu	4986f0f52e	[PT2]: allow empty dict to pass type check (#147167 ) (#147480 ) Summary: Seeing errors like when testing sigmoid for inline_cvr and perevent_cvr models. ``` terminate called after throwing an instance of 'c10::Error' what(): forward() Expected a value of type 'Dict[int, Tuple[Tensor, Tensor, Tensor]]' for argument 'event_based_features' but instead found type 'Dict[Any, Any]'. ``` Let empty dict pass type check. please, do NOT use any of the following flags, those are result of manual interventions in other parts of the system, misuse of them can be very painful for both detect and recover: Test Plan: ``` MODEL_ENTITY_ID=691508446 SNAPSHOT_ID=0 OTHER_MODEL_ENTITY_ID=649645886 OTHER_SNAPSHOT_ID=0 MODULE=local buck2 run mode/opt caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- \ --loadMode=BenchmarkAB \ --inputNetFile=/data/users/${USER}/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}${suffix} \ --otherNetFile=/data/users/${USER}/models/${OTHER_MODEL_ENTITY_ID}/${OTHER_SNAPSHOT_ID}/${OTHER_MODEL_ENTITY_ID}_${OTHER_SNAPSHOT_ID}${suffix} \ --moduleName=${module} \ --submodToDevice "" \ --benchmarkDontRebatchSamples=true \ --sampleInputFilePath=/data/users/${USER}/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/archive_.predictor.disagg.gpu.local/data/sample_inputs/local.pt ``` Reviewed By: yjhao Differential Revision: D69871393 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147480 Approved by: https://github.com/henryoier, https://github.com/jeanschmidt	2025-02-21 07:00:46 +00:00
Arash Pakbin	c74b59fc1f	[ROCm][TunableOp] resolve the rocBLAS version dynamically (#147363 ) Dynamically gets rocBLAS version instead of relying on some preprocessing-time definitions which may be stale. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147363 Approved by: https://github.com/pruthvistony, https://github.com/naromero77amd, https://github.com/jeffdaily	2025-02-21 06:50:21 +00:00
dilililiwhy	86ae672b6a	Use has_triton_package in _inductor.runtime.hints (#147442 ) Fixes #ISSUE_NUMBER Use existing method for triton check Pull Request resolved: https://github.com/pytorch/pytorch/pull/147442 Approved by: https://github.com/Skylion007	2025-02-21 05:52:00 +00:00
Eddie Yan	533b884870	[cuDNN][SDPA][Nested Tensor] Experimental cuDNN Nested Tensor SDPA Support (forward only) (#141178 ) Disabled by default for now behind `TORCH_CUDNN_SDPA_NESTED_TENSOR_ENABLED=1` Just wanted to get this out before starting a series of SDPA cleanup PRs---the biggest thing is we don't need the boilerplate around all of the `build_graph_and_tensors*` functions anymore as we can now use the `UID`-style referencing of tensor nodes as was done for the Conv-V8 API backend. CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/141178 Approved by: https://github.com/jbschlosser	2025-02-21 05:22:19 +00:00
Jerry Zhang	a2c3a2c5c4	Support serialization for uintx/intx in weights_only (#147500 ) Summary: Fixing the issue reported by huggingface Test Plan: python test/test_serialization.py -k test_serialization_uintx_intx Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/147500 Approved by: https://github.com/mikaylagawarecki	2025-02-21 04:38:44 +00:00
Ankita George	c615b8c174	Build a storage reader/writer to write checkpoints in HF format (#146352 ) Summary: Title - we want to write checkpoints in HF format with DCP, this diff allows this for the non-distributed use case. Test Plan: buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:test_hf_torchtune_storage N6476188 --> able to save and load tensor in hf format Differential Revision: D68444967 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146352 Approved by: https://github.com/saumishr	2025-02-21 03:31:21 +00:00
Ting Lu	fe100c3c5b	Add libtorch nightly build for CUDA 12.8 (#146265 ) Try removing sm50 and sm60 to shrink binary size, and resolve the ld --relink error "Architecture support for Maxwell, Pascal, and Volta is considered feature-complete and will be frozen in an upcoming release." from 12.8 release note. Also updating the runner for cuda 12.8 test to g4dn (T4, sm75) due to the drop in sm50/60 support. https://github.com/pytorch/pytorch/issues/145570 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146265 Approved by: https://github.com/atalman	2025-02-21 03:04:06 +00:00
Tristan Rice	ba214ab56c	TCPStore: soft fail bind when agent store active (#147465 ) This makes it easier to roll out `TORCHELASTIC_USE_AGENT_STORE` by opportunistically swallowing bind errors when the agent store is enabled and the port matches `MASTER_PORT`. This should be very safe as if the store is somehow not up and the envs are set, the TCPStore client connections will fail to connect so we end up with a slightly different error message but success/failure behavior is identical. This also pybinds `c10d::SocketError` into Python so we can assert on the error type in tests. https://docs.google.com/document/d/1CzOn_N53AiFxWGgbyMWSnd2elCJd4lZ-ajPg2lzcxoM/edit?tab=t.0#heading=h.2j2f5dimrdau Test plan: ``` pytest test/distributed/test_store.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147465 Approved by: https://github.com/fduwjj	2025-02-21 03:02:26 +00:00
Yan Zhiwei	8a5265cb37	[Intel GPU] qlinear_pointwise.binary[_tensor] XPU support (#135337 ) # Motivation This PR intends to enable quantized fusion `qlinear+add` at Intel GPU backend. At backend level, we register the op via schema `TORCH_SELECTIVE_NAME("onednn::qlinear_pointwise.binary")` and `TORCH_SELECTIVE_NAME("onednn::qlinear_pointwise.binary_tensor")` which is the one already defined in `x86InductorQuantzer` At Inductor level, we have small modification at `torch/_inductor/fx_passes/quantization.py` to allow signed int8 data type(s8) during op lowering. As for the pattern matching, we greatly reuse the code existing at x86InductorQuantizer. # UT verification ```bash python test/inductor/test_mkldnn_pattern_matcher.py -v \ -k test_qlinear_add_xpu ``` # Runtime Verification ```bash onednn_verbose,primitive,exec,gpu:0,matmul,jit:gemm:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_f32::blocked:ab::f0_mask2 dst_f32::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_linear:1:0.654408+sum:0.00511256+eltwise_relu,,4x4:4x4,0.0319824 ``` The verbose is collected from UT. We can see the attribute ` attr-post-ops:eltwise_linear:1:0.654408+sum:0.00511256+eltwise_relu`, the post add and ReLU is successfully fused on GEMM computation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135337 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/liangan1, https://github.com/jerryzh168 ghstack dependencies: #133307, #135189 Co-authored-by: guangyey <guangye.yu@intel.com>	2025-02-21 02:09:28 +00:00
CaoE	8b818ab58f	Use float data type for Half sum in fallback implementation of batchnorm backward on CPU (#147353 ) Fixes #147303. Use float data type for Half sum in fallback implementation of batchnorm backward on CPU as the representation range of Half is small. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147353 Approved by: https://github.com/leslie-fang-intel, https://github.com/cpuhrsch	2025-02-21 01:33:33 +00:00
Simon Fan	ac88a6c00d	[fx] demote node prepend to self log from warning to debug (#147538 ) FIXES https://github.com/pytorch/pytorch/issues/147175 This is harmless, not sure why this is a user warning. Writing reordering graph passes is more concise when we ignore this warning. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147538 Approved by: https://github.com/yanboliang	2025-02-21 01:32:34 +00:00
Nichols A. Romero	4b35139a46	[ROCm][TunableOp] Fix TunableOp warmup environment variable. (#147412 ) This PR corrects the behavior of the TunableOp warmup variables: ``` PYTORCH_TUNABLEOP_MAX_WARMUP_DURATION_MS PYTORCH_TUNABLEOP_MAX_WARMUP_ITERATIONS ``` See the updated comments which describe how the environment variables are intended to work. Previously, if you only set one of the two environment variables the warmup iters would always be zero. Manually tested the four possible combinations to make sure things still behavior as intended. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147412 Approved by: https://github.com/jeffdaily	2025-02-21 00:29:58 +00:00
Zhengxu Chen	fdb1305ace	reland "[sigmoid] Test OSS model runner with test_export.py" (#147535 ) Summary: There are ~260 tests for all the corner cases of export from test_export.py. utitlizing to test sigmoid in the OSS setting. Test Plan: buck test mode/opt caffe2/test:test_export -- -r _sigmoid Differential Revision: D69937387 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147535 Approved by: https://github.com/yiming0416	2025-02-20 23:45:13 +00:00
Jean Schmidt	87e6e2924e	Increase memory for linux binary builds (#147542 ) Recently I detected that some linux manywheels builds are flaky ([ex](https://github.com/pytorch/pytorch/actions/runs/13438309056/job/37555475510)). After investigating, could not detect issues when investigating the runner logs, its disk space available, network usage or CPU load. Unfortunately, memory information is not available. But given the symptoms, the likehood of this being a OOM problem is high. So, moving those build jobs from a `linux.12xlarge.ephemeral` to `linux.24xlarge.ephemeral`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147542 Approved by: https://github.com/ZainRizvi, https://github.com/atalman	2025-02-20 23:02:45 +00:00
Aaron Orenstein	be0df96b50	Fix c++ implementation of strip_function_call (#147436 ) #143063 was missing handling a couple UCS cases as well as had some bugs in the way it dealt with errors. - Fix all the UCS handling (and make some of the common code more common) - Make sure all the error paths return `nullptr` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147436 Approved by: https://github.com/jansel	2025-02-20 20:41:21 +00:00
henrylhtsang	af31640391	[cutlass backend] enable mixed mm test (cutlass2x) for H100 (#147474 ) I am okay with not landing this as well. The motivation is to make developing on H100 smoother. The reason the current test works on A100 but not H100 is because of alignment issue. Which was caused by arch specific filtering logic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147474 Approved by: https://github.com/alexsamardzic, https://github.com/ColinPeppler	2025-02-20 20:28:44 +00:00
henrylhtsang	d068141c3b	[cutlass backend] add subproc tests (#147173 ) I want to separate subproc autotuning from the main tests. And I observed that for addmm, it can work without subproc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147173 Approved by: https://github.com/ColinPeppler ghstack dependencies: #147169	2025-02-20 20:07:42 +00:00
henrylhtsang	2565951f8a	[cutlass backend] remove triton from most tests and add an integration test (#147169 ) Removing aten and triton from the list of backends for the tests that have it. Instead, add a small integration test to make sure autotuning works fine. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147169 Approved by: https://github.com/ColinPeppler	2025-02-20 20:07:42 +00:00
Richard Barnes	fb1f7f6a09	[codemod] Fix unused-value issue in caffe2/aten/src/ATen/native/miopen/Conv_miopen.cpp +1 (#147496 ) Summary: LLVM has a warning `-Wunused-value` which we treat as an error because it's so often diagnostic of a code issue. Unused values often indicate a programming mistake, but can also just be unnecessary cruft that harms readability and performance. For questions/comments, contact r-barnes. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Differential Revision: D69755123 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147496 Approved by: https://github.com/Skylion007	2025-02-20 19:00:38 +00:00
Jessica Vandebon	6971b77510	[CPU Stream] Add noop for CPU stream record_event() and wait_event() (#145935 ) Summary: Adds wait_event and record_event endpoints to CPU stream in order to facilitate device-agnostic code. Both methods are noops. Test Plan: CI Differential Revision: D68833927 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145935 Approved by: https://github.com/Skylion007	2025-02-20 18:50:55 +00:00
Catherine Lee	863ac20659	[CI] Do not overwrite return code of test file when fails for rerun disabled tests (#147484 ) Do not overwrite the return code of a single file when it fails. This will allow the log to be printed to stdout and the gha logs Pull Request resolved: https://github.com/pytorch/pytorch/pull/147484 Approved by: https://github.com/ZainRizvi	2025-02-20 17:51:58 +00:00
Sampsa	83bb921a5a	[ROCm] Update meta_registration for efficient attention (#146979 ) Fixes a series of failing and skipped unit tests. For nvidia hw, the longsumexp last dimension is required to be a multiple of 32. This is not the case for rocm. A related issue: https://github.com/pytorch/pytorch/issues/146848 The unit tests in question: ```bash inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_prev_13_cuda inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_prev_14_cuda inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_prev_15_cuda inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_rewriter_11_cuda inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_rewriter_14_cuda inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_rewriter_15_cuda inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_rewriter_17_cuda inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_rewriter_1_cuda inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_rewriter_1_freezing inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_rewriter_2_cuda inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_rewriter_3_cuda inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_rewriter_4_cuda inductor.test_fused_attention SDPAPatternRewriterCudaDynamicTests test_sdpa_rewriter_6_cuda inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_prev_13_cuda inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_prev_14_cuda inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_prev_15_cuda inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_rewriter_11_cuda inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_rewriter_14_cuda inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_rewriter_15_cuda inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_rewriter_17_cuda inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_rewriter_1_cuda inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_rewriter_1_freezing inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_rewriter_2_cuda inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_rewriter_3_cuda inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_rewriter_4_cuda inductor.test_fused_attention SDPAPatternRewriterCudaTests test_sdpa_rewriter_6_cuda ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146979 Approved by: https://github.com/shunting314	2025-02-20 15:05:13 +00:00
vasiliy	382fbcc1e4	add the `torch.float8_e8m0fnu` dtype to PyTorch (#147466 ) Summary: Continuing the work from https://github.com/pytorch/pytorch/pull/146427 Adds the `torch.float8_e8m0fnu` dtype to PyTorch, as detailed in https://github.com/pytorch/pytorch/issues/146414 . Please see the issue for a detailed definition of the format. Example of basic functionality: ```python import torch # round trip x0 = torch.randn(4, 4, dtype=torch.float32) x1 = x0.to(torch.float8_e8m0fnu) # RNE rounding x2 = x1.to(torch.float32) # 2 ** exponent # creation with empty x0 = torch.empty(4, 4, dtype=torch.float8_e8m0fnu) # printing print(x0) ``` Done in this PR: * numerical correctness * op coverage (except for `torch._scaled_mm`): create tensor, cast to/from float32 * printing a tensor works For future PRs: * performance optimizations for casting * torch._scaled_mm * PT2 * various cleanups (detailed in comments with issue numbers) Test Plan: ``` pytest test/quantization/core/experimental/test_float8.py -s ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/147466 Approved by: https://github.com/drisspg	2025-02-20 13:55:42 +00:00
James Wu	574371d828	Add current cuda device index to FXGraphCache key (#147464 ) This PR intends to fix the cache related issues from https://github.com/pytorch/pytorch/issues/147405. It does not handle the dynamo recompile case in process, because it does not introduce any extra guards. For FXGraphCache and AOTAutogradCache, we simply have to have the device context in the cache key. Note that for any function that accepts tensor inputs, the device context is naturally already included in the cache key by the metadata of example inputs. However, for functions that return constants or have no arguments, the device context still needs to be in the cache key. A more robust fix for this would be to have inductor generate device guards that are dynamic, instead of specialized. This would also help us share more cache artifacts. I've added unit tests for FXGraphCache and AOTAutogradCache, both of which would fail without this change. Differential Revision: [D69875939](https://our.internmc.facebook.com/intern/diff/D69875939) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147464 Approved by: https://github.com/bdhirsh, https://github.com/anijain2305	2025-02-20 12:38:21 +00:00
Nikita Shulga	ead970c8d0	Revert "Add cifllow/riscv64 label" This reverts commit 5116b27792d37c38039459c922a466581e219fc2. (I've pushed to the wrong branch by accident)	2025-02-20 11:55:52 +01:00
Nikita Shulga	5116b27792	Add cifllow/riscv64 label	2025-02-20 11:09:06 +01:00
zeshengzong	6beba8dcce	Optimize `graph.py` typing (#147099 ) Optimize `graph.py` methods type annotation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147099 Approved by: https://github.com/cyyever, https://github.com/aorenste	2025-02-20 09:32:30 +00:00
Luca Wehrstedt	f9b8121350	Make Inductor scheduler aware of _scaled_mm (#146992 ) This is used for example to estimate runtime when doing comms overlap Pull Request resolved: https://github.com/pytorch/pytorch/pull/146992 Approved by: https://github.com/drisspg, https://github.com/eellison, https://github.com/shunting314	2025-02-20 09:02:31 +00:00
Shawn Xu	9da250aada	type `fully_shard` so that the return value can be chained with typing enabled (#147489 ) This allows for ``` fsdped = fully_shard(model) fsdped.set_xyz() ``` same applies if `model` is actually a list of modules Differential Revision: [D69888119](https://our.internmc.facebook.com/intern/diff/D69888119) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147489 Approved by: https://github.com/Skylion007 ghstack dependencies: #147488	2025-02-20 08:43:16 +00:00
zeshengzong	6a72aaadae	Fix `torch.max` optional args `dim`, `keepdim` description (#147177 ) [`torch.max`](https://pytorch.org/docs/stable/generated/torch.max.html#torch.max) optional args `dim`, `keepdim` not described in document, but users can ignore them. ```python >>> import torch >>> a = torch.randn(3,1,3) >>> a.max() tensor(1.9145) >>> a.max(dim=1) torch.return_types.max( values=tensor([[ 1.1436, -0.0728, 1.3312], [-0.4049, 0.1792, -1.2247], [ 0.8767, -0.7888, 1.9145]]), indices=tensor([[0, 0, 0], [0, 0, 0], [0, 0, 0]])) ``` ## Changes - Add `optional` description for `dim`, `keepdim` - Add example of using `dim`, `keepdim` ## Test Result ### Before ![image](https://github.com/user-attachments/assets/3391bc45-b636-4e64-9406-04d80af0c087) ### After ![image](https://github.com/user-attachments/assets/1d70e282-409c-4573-b276-b8219fd6ef0a) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147177 Approved by: https://github.com/colesbury	2025-02-20 08:18:09 +00:00
drisspg	452315c84f	Fix RuntimeError: value cannot be converted to type int64_t without overflow (#147492 ) The exact call is coming from here: `78a94c9114/torch/_inductor/memory.py (L161)` I have no idea why this error is being thrown and what mode/modes might be failing for this Pull Request resolved: https://github.com/pytorch/pytorch/pull/147492 Approved by: https://github.com/eellison	2025-02-20 08:00:26 +00:00
zeshengzong	a000c7e6d2	Add hint message for `pack_padded_sequence` (#146747 ) Fixes #144207 Add truncate hint message in docs [torch.nn.utils.rnn.pack_padded_sequence](https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pack_padded_sequence.html) ## Test Result ![image](https://github.com/user-attachments/assets/46258f36-f6c7-4f11-9213-8513e52a9001) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146747 Approved by: https://github.com/mikaylagawarecki	2025-02-20 06:27:07 +00:00
Aaron Orenstein	db4ce78d46	PEP585: More UP006 fixes (#146392 ) This should be the final PR before we can enable RUFF UP006. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146392 Approved by: https://github.com/justinchuby, https://github.com/albanD, https://github.com/Skylion007	2025-02-20 06:18:13 +00:00
Animesh Jain	76ad19a549	[dynamo][codegen] Implement CSE for pre-graph graph-arg bytecode reconstruction (#147425 ) This reduces fixed overhead seen in a few internal models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147425 Approved by: https://github.com/jansel, https://github.com/StrongerXi	2025-02-20 05:42:52 +00:00
PyTorch UpdateBot	8f6b9403c1	[audio hash update] update the pinned audio hash (#147423 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147423 Approved by: https://github.com/pytorchbot	2025-02-20 05:39:46 +00:00
Yidi Wu	77aa602871	[torchbind] Differentiate ScriptModule and ScriptObject with qualified name (#147399 ) Summary: This pr add a _is_script_object method to differentiate scriptModule and scriptObject, where the formal inherits from ScriptObject in C++ so they both passes the isinstance(obj, torch.ScriptObject) check. The qualified name of ScriptObject (i.e. custom class) would starts with "__torch__.torch.classes", this has been a widely used assumption for dealing with custom class across our code base. Test Plan: Add new test. Differential Revision: D69685316 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147399 Approved by: https://github.com/yushangdi	2025-02-20 04:57:57 +00:00
Michael Lazos	7185ca8348	[Cutlass] Add test verifying number of precompiles (#147477 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/147477 Approved by: https://github.com/henrylhtsang	2025-02-20 04:47:57 +00:00
amdfaa	5f5b44f6bf	[ROCm] Update inductor-periodic.yml to use the correct label (#147473 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147473 Approved by: https://github.com/jeffdaily	2025-02-20 04:44:18 +00:00
bobrenjc93	0d56b7e665	Support size oblivious max equation (#147344 ) Addresses https://github.com/pytorch/pytorch/issues/125914 by detecting when we have a sym_max between {0, 1} and a summation of size-like unbacked symints. The basic idea is max(1, u0 + u1) can be simplified to u0 + u1 if both u0 and u1 are size-like since their value ranges are [2, inf]. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147344 Approved by: https://github.com/angelayi	2025-02-20 04:33:19 +00:00
Shangdi Yu	0b0da81021	Support static method of torchbind attributes in torch.compile with inductor backend (#146927 ) As title. Many changes adapted from https://github.com/pytorch/pytorch/pull/129537. Also this diff is only for static method of torchbind attributes. Some case that's not supported/tested: - dynamic torchbind objects - torchbind objects as an input to the module. Note that in JIT Inductor, the attributes are lifted as inputs. So even if we just have torchbind objects as attributes, they will show up as inputs in the graph. Example generated python code in torch.compile with inductor backend for the test case in `inductor/test_torchbind.py` (P1730554370): ```python async_compile.wait(globals()) del async_compile def call(args): arg1_1, arg2_1, arg3_1 = args args.clear() assert_size_stride(arg1_1, (2, 3), (3, 1)) assert_size_stride(arg2_1, (2, 3), (3, 1)) buf2 = empty_strided_cpu((2, 3), (3, 1), torch.float32) cpp_fused_add_0(arg1_1, arg2_1, buf2) del arg1_1 del arg2_1 # Topologically Sorted Source Nodes: [x, takes_foo_tuple_return], Original ATen: [aten.add] buf3 = torch.ops._TorchScriptTesting.takes_foo_tuple_return.default(arg3_1, buf2) buf4 = buf3[0] assert_size_stride(buf4, (2, 3), (3, 1)) buf5 = buf3[1] assert_size_stride(buf5, (2, 3), (3, 1)) buf6 = buf4; del buf4 # reuse cpp_fused_add_1(buf6, buf5) del buf5 # Topologically Sorted Source Nodes: [y, b], Original ATen: [aten.add] buf7 = torch.ops._TorchScriptTesting.takes_foo.default(arg3_1, buf6) del buf3 del buf6 buf8 = buf7 assert_size_stride(buf8, (2, 3), (3, 1)) # Topologically Sorted Source Nodes: [c], Original ATen: [] buf9 = torch.ops.higher_order.call_torchbind(arg3_1, 'add_tensor', buf2) del arg3_1 del buf7 buf10 = buf9 assert_size_stride(buf10, (2, 3), (3, 1)) del buf9 buf11 = buf2; del buf2 # reuse cpp_fused_add_2(buf11, buf8, buf10) return (buf11, ) def benchmark_compiled_module(times=10, repeat=10): from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance arg1_1 = rand_strided((2, 3), (3, 1), device='cpu', dtype=torch.float32) arg2_1 = rand_strided((2, 3), (3, 1), device='cpu', dtype=torch.float32) import pickle global arg3_1 arg3_1 = pickle.loads(b'\x80\x04\x95[\x00\x00\x00\x00\x00\x00\x00\x8c\x05torch\x94\x8c\x0cScriptObject\x94\x93\x94)\x81\x94]\x94(K\nK\x14e\x8c0__torch__.torch.classes._TorchScriptTesting._Foo\x94\x86\x94b.') fn = lambda: call([arg1_1, arg2_1, arg3_1]) return print_performance(fn, times=times, repeat=repeat) if __name__ == "__main__": from torch._inductor.wrapper_benchmark import compiled_module_main compiled_module_main('None', benchmark_compiled_module) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146927 Approved by: https://github.com/angelayi	2025-02-20 03:33:19 +00:00
Shawn Xu	de1cb0f351	capture the return value in the contract typing (#147488 ) ---- * the existing typing makes the return type `Optional[nn.Module]` * this doesn't seem to be what the decorator actually does as it does not alter the original return type * This PR aims to fix the typing Differential Revision: [D69888120](https://our.internmc.facebook.com/intern/diff/D69888120) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147488 Approved by: https://github.com/Skylion007	2025-02-20 03:32:34 +00:00
rzou	fea718f062	[BaseHOP] change hop(subgraph, operands) to hop(subgraph, *operands) (#146730 ) Our three main users are OK with this, with two of them (foreach_map, invoke_quant) prefering it like this. I was originally worried about BC issues (this now means you cannot add any positional args) but I think that's not a concern -- one can always add kwonly args. Test Plan - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/146730 Approved by: https://github.com/ydwu4, https://github.com/mlazos	2025-02-20 02:30:36 +00:00
Yan Zhiwei	f79b352f5a	[Intel GPU] qconv_pointwise.binary XPU support (#135189 ) # Motivation This PR intends to enable quantized fusion `qconv+add` and `qconv+add+relu` at Intel GPU backend. At backend level, we register the op via schema `TORCH_SELECTIVE_NAME("onednn::qconv2d_pointwise.binary")` which is the one already defined in `x86InductorQuantzer` At Inductor level, we have small modification at `torch/_inductor/fx_passes/quantization.py` to allow signed int8 data type(s8) during op lowering. As for the pattern matching, we greatly reuse the code existing at x86InductorQuantizer. # UT verification ```bash python test/inductor/test_mkldnn_pattern_matcher.py -v \ -k test_qconv2d_add_xpu \ -k test_qconv2d_add_relu_xpu 2>&1 ``` # Runtime exemplification Following is the oneDNN verbose collected from UT ```bash onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_s8::blocked:acdb::f0 wei_s8::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_s8::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:1:f32 attr-zero-points:src0:0:s32+dst:0:s32 attr-post-ops:eltwise_linear:1:0.337704+sum:0.0241217+eltwise_relu,alg:convolution_direct,mb1_ic3oc6_ih8oh6kh3sh1dh0ph0_iw8ow6kw3sw1dw0pw0,0.151123 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135189 Approved by: https://github.com/liangan1, https://github.com/EikanWang, https://github.com/guangyey, https://github.com/jerryzh168 ghstack dependencies: #133307 Co-authored-by: guangyey <guangye.yu@intel.com>	2025-02-20 02:02:54 +00:00
Riley Dulin	93316cfe94	Move ir_pre_fusion.txt and ir_post_fusion.txt to TORCH_LOGS (#147248 ) Fixes #147002 Moves ir_{pre, post}_fusion.txt to be controlled by TORCH_LOGS instead of TORCH_COMPILE_DEBUG. Updated tests of these logs as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147248 Approved by: https://github.com/eellison	2025-02-20 00:26:17 +00:00
William Wen	16e202a38e	[dynamo] improved graph break messages for some common graph break sites [1/N] (#146525 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146525 Approved by: https://github.com/jansel	2025-02-20 00:08:13 +00:00
Pian Pawakapan	1e94c7aaa4	[draft_export] only clear pending unbacked symbols for overwritten kernels (#147427 ) This was wrong, we were doing this in all cases Pull Request resolved: https://github.com/pytorch/pytorch/pull/147427 Approved by: https://github.com/angelayi	2025-02-20 00:07:54 +00:00
Henry Tsang	3986c3e4a6	[reland][cutlass backend] Do not change dtype of GEMM template for cutlass 3x (#147434 ) Reland of https://github.com/pytorch/pytorch/pull/146877 incorporate forward fix (didn't land): https://github.com/pytorch/pytorch/pull/147185 Summary: I think this is a change in the right direction. Right now, when we try to find a cutlass gemm, we generate bunch of gemm templates, and filter out those that don't fix. For example, if we are doing bf16 x bf16 matmul, the gemm template for fp32 x fp32 is generated and filtered out. However, for the dtype of bias, we would attempt to modify the dtype of the gemm template. I think this is a bad idea, since (1) the usable template is also being generated, and (2) this messes with the configuration name of the template. I tested this offline. There isn't much difference in performance. However, with instantiation level 2222, I noticed way less "C++ compile error". This is probably due to using the right template? Follow-ups are needed: 1. benchmark and dashboard 2. check our logic for setting alignment with my change https://www.internalfb.com/intern/paste/P1729604119/ without my change https://www.internalfb.com/intern/paste/P1729624806/ Differential Revision: D69825865 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147434 Approved by: https://github.com/ColinPeppler	2025-02-20 00:07:07 +00:00
Yang Wang	a88d7d4268	[util] fetch logical count cpu (#147413 ) To match with Vcpu count with aws: after (96), before (48) Instance Ref: https://instances.vantage.sh/aws/ec2/g4dn.metal before: https://hud.pytorch.org/utilization/13377376406/37360984234/1 after: https://hud.pytorch.org/utilization/13401543806/37435031356/1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147413 Approved by: https://github.com/clee2000	2025-02-19 23:44:54 +00:00
Michael Lazos	004d65aeb0	Add type hints to cuda kernel (#147471 ) Missed this in a previous PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/147471 Approved by: https://github.com/eellison	2025-02-19 23:35:10 +00:00
Henry Tsang	48203bec63	[BE] remove sysconfig.get_config_var("LIBDIR") from cuda lib paths (#147409 ) Summary: I think the path is not needed anymore. It was added in https://github.com/pytorch/pytorch/pull/126408, but it has been a while since then. See if CI complains. Differential Revision: D69573185 See also https://github.com/pytorch/pytorch/pull/147158 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147409 Approved by: https://github.com/chenyang78	2025-02-19 23:04:22 +00:00
Gregory Comer	f63db6255f	Re-land exclude upsample_bilinear2d.vec and nearest2d.vec from default export decomposition table (#147153 ) Note: This is a re-land of https://github.com/pytorch/pytorch/pull/141791, which I reverted due to breaking some Meta-internal tests - an internal ET delegate did not handle the non-decomposed upsample_nearest2d, and it was not caught in CI. I've resolved that issue and should be ready to safely re-land. Summary: As upsample_bilinear2d.vec and upsample_nearest2d.vec are core ATen ops, they should not be decomposed by default in the export path. Because the operators have CompositeImplicitAutograd dispatch, their decomposition is registered by default. This change adds an override list for CIA decompositions being registered in the default decomp table. In the long-term, we likely will want to exclude decompositions for all core-tagged CIA ops, but this will require all consumers to be ready to handle the remaining two ops, avg_pool1d, and adaptive_avg_pool1d. Until they are ready, I believe an explicit override list is the safest option. Additionally, I've also removed the ExecuTorch XNNPACK delegate ConvertToUpsampleBilinear2d pass, as the pass breaks (and is not needed), given that the op is not decomposed. The purpose of this pass was originally to pattern match the decomposition and recompose it, but this is no longer necessary. Test Plan: Added a new test (`test_default_decomposition_core_cia_ops`) in test_export.py to verify that upsample_bilinear2d.vec (and in the future, other core-tagged CIA ops) are not decomposed by default. Also, I manually validated end to end with ExecuTorch that the op is not decomposed in to_edge (see N6238522). ``` buck test //caffe2/test:test_export -- test_default_decomposition_core_cia_ops ``` Differential Revision: D69625112 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147153 Approved by: https://github.com/manuelcandales	2025-02-19 23:03:29 +00:00
fduwjj	fb55bac3de	[fr][fix] Split MatchState and dynamic info for fr analysis downstream (#147439 ) The original MatchState type was declared as a python Enum. Although we did make it callable but we consume it right away. There are downstream cases when we need it to be a python class which is not supported in Python enum. So we did a small refactoring so that we keep both the enum state and dynamic info (culprit) for the fr analysis script. Differential Revision: [D69830994](https://our.internmc.facebook.com/intern/diff/D69830994) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147439 Approved by: https://github.com/fegin	2025-02-19 22:09:16 +00:00
Justin Chu	41ae15faa3	[ONNX] Add scaffolding for onnx decomp and logic for op tests (#147392 ) Create scaffold for onnx op test data and common logic. This PR creates the scaffolding for new onnx decomp functions described in https://github.com/pytorch/pytorch/issues/139301. It adds two ops: abs and add, and enables the related tests. https://github.com/pytorch/pytorch/issues/139301 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147392 Approved by: https://github.com/titaiwangms ghstack dependencies: #147396	2025-02-19 21:55:12 +00:00
Avik Chaudhuri	24738768a8	more dist ops in non strict (#147417 ) Summary: Previously we added support for `all_reduce` to non strict. This PR extends this support to other non-functional collectives that are remapped in Dynamo: `all_gather`, `all_gather_into_tensor`, `all_to_all_single`, `reduce_scatter_tensor`. Test Plan: added unit tests Differential Revision: D69813991 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147417 Approved by: https://github.com/angelayi	2025-02-19 21:29:16 +00:00
Eli Uriegas	394676759d	ci: Add h100 nightly perf testing (#146868 ) This infrastructure has been up for a while so add a workflow to actually run things on it. > [!IMPORTANT] > We only have 14 linux.aws.h100 runners so it might be beneficial for us to actually pair this list down. > Will leave it up to the compiler team to comment on this PR on which tests are actually important vs. what is not. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146868 Approved by: https://github.com/eellison, https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>	2025-02-19 21:13:17 +00:00
Vincent Moens	8bea08e5bc	[BE] Fix tensor stub (#147384 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147384 Approved by: https://github.com/albanD, https://github.com/janeyx99, https://github.com/atalman	2025-02-19 19:47:03 +00:00
Alex Baden	e758d8b4d1	[Inductor][Triton] Rework casting logic to avoid illegal bitcast (#147395 ) Triton introduced checks for bitcasts where the casted value does not fit into the casted type (e.g. https://github.com/triton-lang/triton/pull/5926, though in this instance I think the issue is related to the type for the broadcast). Some routines in Inductor now perform illegal bitcasts. I reworked the compare and swap w/ index routine used in sort to remove the illegal bitcast (~~I left the bitcast for now, but I think it could probably be removed assuming the reshape does not change the type~~). The explicit cast is correct, and I don't think there are performance issues, but because the cast on the sum is not a bitcast I suppose there could be. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147395 Approved by: https://github.com/eellison	2025-02-19 19:45:01 +00:00
Justin Chu	279c7f262e	[ONNX] Refactor dispatcher and registry (#147396 ) This PR sets up the registry to accept onnx decomp functions to be moved into PyTorch (https://github.com/pytorch/pytorch/issues/139301). The ops from onnx script are currently appended to the registry. When the ops are moved into PyTorch, the moved ops takes precedence because they appear first in the registry list. After the migration hooks for loading ops from onnx script will be removed. 1. Use a private field `_pt_onnx_signature` to store function signatures to avoid conflicts 2. Update the registry to record the signature in OnnxDecompMeta and update the dispatcher to leverage the data structure 3. Update registry to prepare for onnx op registration, and update the the onnx_impl decorator to support a no_compile option Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/147396 Approved by: https://github.com/titaiwangms	2025-02-19 19:38:28 +00:00
bobrenjc93	4f3c070b25	[inductor] GraphLowering code movement (#147335 ) moved these methods under __init__ to be more idiomatic Pull Request resolved: https://github.com/pytorch/pytorch/pull/147335 Approved by: https://github.com/eellison ghstack dependencies: #147331	2025-02-19 19:32:30 +00:00
Fadi Arafeh	5a3a50c791	Update Arm Compute Library (ACL) to v25.02 (#147454 ) Among many things, this version of ACL fixes the redundant declaration warning that we're blocked on in (#145942, #146620, #147337) and introduces better scheduling heuristics for GEMMs Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/147454 Approved by: https://github.com/malfet	2025-02-19 18:51:08 +00:00
Richard Howell	9fee408daa	[caffe2] disable warning for unused arguments (#147411 ) Summary: Disable warnings on unused command line arguments for ukernels_asm. Test Plan: On top of D69602077: ``` $ buck2 build --flagfile fbsource//xplat/mode/arstudio/auto.py fbsource//xplat/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack:ukernels_asmAppleMac ``` Differential Revision: D69807977 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147411 Approved by: https://github.com/kimishpatel	2025-02-19 17:54:31 +00:00
Arash Pakbin	5220d402b5	[ROCm] TopK optimizations for AMD GPUs (#146387 ) TopK performance on ROCm performs better on the test suite with the default config. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146387 Approved by: https://github.com/malfet, https://github.com/ngimel	2025-02-19 17:10:59 +00:00
Ting Lu	e6c86952c6	Add CUDA 12.8 windows nightly build (#147037 ) https://github.com/pytorch/pytorch/issues/145570 windows AMI is deployed to prod today, prepping the windows cuda 12.8 build Pull Request resolved: https://github.com/pytorch/pytorch/pull/147037 Approved by: https://github.com/atalman	2025-02-19 16:59:32 +00:00
xinan.lin	8cbf7d0d6e	[Inductor UT][XPU] Skip fft_c2c case since it's not implemented on XPU. (#147351 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147351 Approved by: https://github.com/jansel	2025-02-19 16:03:03 +00:00
Simon Fan	ed83b0b70b	[ddp] decouple python reducer from compilation mode (#147123 ) Current implementation reads as: we will only actually use the "python_reducer" config if the DDP forward is compiled. Otherwise, we will silently fallback to C++ reducer + no DDPOptimizer. I'm changing this behavior to always use the python reducer if the config is specified. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147123 Approved by: https://github.com/fegin	2025-02-19 15:51:40 +00:00
drisspg	303ad1916f	[FlexAttention] Fix weird generate stride call in flex decode (#147435 ) # Summary Seems like we had a redundant tuple unpack and that doesn't appear to be supported in new triton Fixes https://github.com/pytorch/pytorch/issues/147373 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147435 Approved by: https://github.com/BoyuanFeng	2025-02-19 12:12:27 +00:00
Michael Lazos	77dbd28535	[Cutlass] Restore search space for swizzle (#147224 ) This restores the previous search space, since swizzle is now a runtime parameter, there shouldn't be extra compile-time overhead from searching this now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147224 Approved by: https://github.com/eellison ghstack dependencies: #147222, #147223	2025-02-19 09:22:51 +00:00
Michael Lazos	e9b3ff0570	[Cutlass] Add support for runtime param choices, starting with swizzle (#147223 ) This PR adds support for swizzle as a runtime parameter choice. Future runtime parameter choices can be added to the [get_runtime_arg_info](`2d40f9fb52/torch/_inductor/codegen/cuda/cuda_template.py (L282)`) list method and then possible choices can be [looped over similarly to swizzle](`933f921b36/torch/_inductor/codegen/cuda/gemm_template.py (L532)`). For precompile, we now filter choices by hash to only compile each distinct kernel source once. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147223 Approved by: https://github.com/Chillee, https://github.com/eellison ghstack dependencies: #147222	2025-02-19 09:22:51 +00:00
Michael Lazos	81eb2a78ad	[Inductor] Add autotuning artifact logging (#147222 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147222 Approved by: https://github.com/henrylhtsang, https://github.com/eellison	2025-02-19 09:22:42 +00:00
bobrenjc93	655b061ef0	[inductor] Freeze runtime asserts after shape prop but before codegen (#147331 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147331 Approved by: https://github.com/eellison	2025-02-19 06:29:13 +00:00
Laith Sakka	454fbd5bbe	realize stride symbols in estimate_runtime (#146752 ) Unfortuanlty could not create a local repo, or unit test. fix https://github.com/pytorch/pytorch/issues/146686 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146752 Approved by: https://github.com/bobrenjc93, https://github.com/bdhirsh	2025-02-19 06:02:49 +00:00
Angela Yi	2c3680ce38	[apf] Fix input adapter (#147238 ) Summary: Add support for inputs that no longer exist in `input_fields`, but is not actually used by the original program. In this case, we just give it a dummy input based on the node's metadata. Test Plan: Verified for S488841 Differential Revision: D69328093 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147238 Approved by: https://github.com/pianpwk	2025-02-19 04:49:58 +00:00
Jason Ansel	465930ee81	Revert "[ROCm] ROCm-specific gemm tuning parameters" (#147388 ) Summary: This diff reverts D69573225 / https://github.com/pytorch/pytorch/pull/143286 15% cold compile time regression, see https://fb.workplace.com/groups/1075192433118967/permalink/1608559059782299/ Test Plan: NA Differential Revision: D69790102 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147388 Approved by: https://github.com/yanboliang	2025-02-19 04:47:35 +00:00
atalman	4ece056791	Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073 ) Should resolve: https://github.com/pytorch/pytorch/issues/144768 We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1`` For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1`` We use pinned version of NCCL rather then submodule. Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl`` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146073 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj	2025-02-19 03:52:26 +00:00
Chen Lai	bd370c138a	fix pt2e block wise quantization unit test (#147406 ) Differential Revision: D69806596 https://github.com/pytorch/pytorch/pull/146946 breaks the unit test, because the quant nodes are folded by default now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147406 Approved by: https://github.com/andrewor14, https://github.com/jerryzh168	2025-02-19 02:40:27 +00:00
henrylhtsang	5006932cbc	[cutlass backend] forward fix of standalone runner for fbcode (#147158 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147158 Approved by: https://github.com/chenyang78	2025-02-19 02:02:10 +00:00
Ankita George	f16d30137c	[OSS] Update FileSystem methods to properly handle a string argument (#145751 ) Summary: When testing, I tried to pass in a string argument to the FileSystem class' methods, which is a valid input, but the cast() that casted the string to a path wasn't working as was likely expected and was leading all the methods to fail with a string arg. Instead of a cast, a proper constructor should be used. Test Plan: N6475361 methods don't throw an error with a string arg like they were previously Differential Revision: D68713937 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145751 Approved by: https://github.com/pradeepfn	2025-02-19 01:50:24 +00:00
titaiwangms	953f7834cc	[ONNX] Pick up missing types in dynamic shapes renaming (#147407 ) Found in `_check_dynamic_shapes` that int and None type are valid inputs of dynamic_shapes. This PR adds the support on these two types and add the tests to guard the sync of ONNX flatten logic and the one in expor.t Pull Request resolved: https://github.com/pytorch/pytorch/pull/147407 Approved by: https://github.com/justinchuby	2025-02-19 01:49:53 +00:00
atalman	757d7f28d1	[CD] Increase timeout for windows binary builds (#147390 ) Mitigates https://github.com/pytorch/pytorch/issues/147376 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147390 Approved by: https://github.com/huydhn, https://github.com/jeanschmidt, https://github.com/malfet	2025-02-19 01:15:04 +00:00
Justin Chu	959d79f85f	[ONNX] Move and improve error reproduction logic in test (#147391 ) https://github.com/pytorch/pytorch/issues/139301 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147391 Approved by: https://github.com/titaiwangms	2025-02-19 00:00:11 +00:00
PyTorch MergeBot	babb2dc2af	Revert "Add torch._scaled_mm for CPU (#139975 )" This reverts commit 6f7e67c43c13b5675b4ff60cbaa71e5083a22481. Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/wdvr due to failing inductor mkldnn_pattern_matcher_cpu tests ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2667186865))	2025-02-18 23:58:31 +00:00
bobrenjc93	525ca80f53	add unbacked strict mode (#147333 ) fixes #145775 This is the first step in introducing a "strict" mode where we don't silent specialize and don't silent graph break. At a high level when we do mark_unbacked(... strict=True), anytime we specialize an unbacked symint we will explicitly error and tell the user their unbacked dimension was specialized to a single value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147333 Approved by: https://github.com/laithsakka	2025-02-18 23:33:55 +00:00
bobrenjc93	5d547d82e6	Add no_data_dependent_graph_break mode (#147342 ) This adds a strict mode `TORCHDYNAMO_UNBACKED_STRICT` to prevent graph breaking when we guard on data dependent. This is a better UX for those who are actively trying to make their model more dynamic, but aren't close enough to full graph to use that flag directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147342 Approved by: https://github.com/laithsakka	2025-02-18 23:33:47 +00:00
albanD	bae049b439	Update addr doc (#146482 ) Fixes https://github.com/pytorch/pytorch/issues/146399 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146482 Approved by: https://github.com/janeyx99	2025-02-18 23:25:38 +00:00
Tuan Trieu	ca397d82a6	[Sigmoid] Fix issues with constant folding and fba_ops (#146948 ) Summary: There are 2 issues: - `skip_folding_node_fn` isn't considered when propagating constant values. So given a skipped node with constant inputs, it outputs a constant and its users can output constant values and then be included in the constant graph. However, the skipped node is not included in the constant graph when extracting the constant graph. This issue is fixed by checking for skipped node when propagating the constant values and making the skipped node to output unknown value (not constant) so that its users cannot output constant. - `fba_linear` op can be included in the constant graph but it is not implemented for CPU so constant graph cannot be executed. This issue is fixed by converting `fba_linear` to `aten.addmm`. - A refactor to allow more fba_ops to be included in the constant graph (via mapping fba_ops to aten ops). Reviewed By: StellarrZ Differential Revision: D68716393 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146948 Approved by: https://github.com/zhxchen17	2025-02-18 23:17:47 +00:00
Tsung-Hsien Lee	c9a15d980f	[FSDP2] Simplify shard_placement_fn in test (#146847 ) Summary: Found this while checking `shard_placement_fn` for Shampoo shard independent implementation. Test Plan: OSS CI & tests Differential Revision: D69412878 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146847 Approved by: https://github.com/awgu	2025-02-18 23:01:26 +00:00
Jane Xu	c8433c2c6c	[BE] correct docs for clock_rate to MHz, fixes #147098 (#147393 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147393 Approved by: https://github.com/andrewor14	2025-02-18 22:59:58 +00:00
mori360	a21a123fd5	Add fqn_modifier at loading_state_dict and unit test (#146557 ) In Fusion model, users might change the state_dict keys by state_dict_hook The load_state_dict APIs here won't call model.state_dict() so that the hooks won't be called to change the keys, causing the mismatch between fqn and state_dict keys. The PR here suggests users to add how they would change the state_dict key prefix (they can name it, here we call "fqn_modifiers") by default During loading state_dict, we have the prefix change during getting fqn so that they can be processed same as through state_dict hook. For example: There's a state_dict_hook: ``` def _state_dict_hook(self, destination, prefix, keep_vars): """Remove "embedding" from the original embedding in the state_dict name. This keeps the orginal state dict name for the embedding from before fusing with the FusionEmbedding. [!Note] This update changes the order of the OrderedDict """ key = prefix + "embedding.weight" new_key = prefix + "weight" destination[new_key] = destination[key] del destination[key] ``` In the dsd after this PR, we would skip "embedding." before "weight" if find the "fqn_modifiers" attribute at that module ``` def fqn_modifiers(self) -> Dict[str, str]: return { "weight": "embedding", } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146557 Approved by: https://github.com/fegin	2025-02-18 22:54:41 +00:00
PyTorch MergeBot	7622e29a37	Revert "Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073 )" This reverts commit eecee5863e698d19458b33df7bfecbda0a04557a. Reverted https://github.com/pytorch/pytorch/pull/146073 on behalf of https://github.com/atalman due to breaks Locally building benchmarks ([comment](https://github.com/pytorch/pytorch/pull/146073#issuecomment-2667054179))	2025-02-18 22:23:35 +00:00
Laith Sakka	3f35664ee8	More precise check for shared storage check in inductor/reinplace pass (#147050 ) Currently if two tensor share storage we have some logic to avoid re-inplacing. Before this PR two tensors share storage if use same underlying storage even if they do not overlap. This diff enhance the checks to avoid cases when we know tensors do not overlap easily. mitigate https://github.com/pytorch/pytorch/issues/139628 but does not fix the inductor issue in it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147050 Approved by: https://github.com/zou3519	2025-02-18 21:55:34 +00:00
William Wen	63e8ad49b8	[dynamo] replace hardcoded eval frame control flags skip_code_recursive_flag/cache_limit_hit_flag (#146355 ) This PR and the previous: - Moves parts of `eval_frame.c` to C++. - Reduces code duplication in `dynamo__custom_eval_frame` and makes the control flow more clear. - Enables `convert_frame` to signal to `eval_frame.cpp` in a general manner how to evaluate this frame, recursive frames, and future frames with the same code object (default/compile, skip, run-only). e.g. this will allow us to change skipping/cache limit hit eval_frame behavior directly from convert_frame without requiring changes to C/C++. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146355 Approved by: https://github.com/jansel ghstack dependencies: #145603	2025-02-18 21:37:12 +00:00
William Wen	75db0fd8a0	[dynamo] refactor dynamo__custom_eval_frame to C++, refactor SKIP_CODE[_RECURSIVE] (#145603 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145603 Approved by: https://github.com/jansel, https://github.com/anijain2305	2025-02-18 21:37:12 +00:00
Yang Chen	eb892cd768	[codegen] enable SORT and TUPLE_REDUCTION for AMD Triton (#147340 ) Looks like Triton's AMD backend supports multiple inputs already. Let's enable SORT and TUPLE_REDUCTION for it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147340 Approved by: https://github.com/Skylion007, https://github.com/jansel, https://github.com/eellison	2025-02-18 21:15:23 +00:00
Vincent Moens	1b047d5d7a	Add link to non_blocking/pinmem tutorial in `Tensor.to` docstrings (#145651 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145651 Approved by: https://github.com/svekars	2025-02-18 20:38:01 +00:00
clr	166419b9c1	dynamo: Don't crash when encountering a object with no __name__ (#147246 ) This was triggering on ScriptFunctions. Note that other than badly implemented c functiosn, this seems to be almost impossible to trigger, so I wrote a smaller unit test, rather than a full repro. Let me know if people feel strongly and want a full reproduction. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147246 Approved by: https://github.com/anijain2305, https://github.com/jansel, https://github.com/Skylion007	2025-02-18 20:35:49 +00:00
12v	74682e8595	Fix typo (#147330 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147330 Approved by: https://github.com/srinivasreddy, https://github.com/Skylion007	2025-02-18 20:20:34 +00:00
Ahmad Sharif	d9b3d76b85	Fix linter warnings (#147386 ) https://github.com/pytorch/pytorch/pull/145866 accidentally introduced a warning about const casts and also comparison of unsigned long int with signed long int. This PR fixes both of those warnings. Tested by running: ``` /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DTORCH_CUDA_USE_NVTX3 -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_CUFILE -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -I/home/ahmads/personal/pytorch/build/aten/src -I/home/ahmads/personal/pytorch/aten/src -I/home/ahmads/personal/pytorch/build -I/home/ahmads/personal/pytorch -I/home/ahmads/personal/pytorch/cmake/../third_party/benchmark/include -I/home/ahmads/personal/pytorch/third_party/onnx -I/home/ahmads/personal/pytorch/build/third_party/onnx -I/home/ahmads/personal/pytorch/nlohmann -I/home/ahmads/personal/pytorch/aten/src/THC -I/home/ahmads/personal/pytorch/aten/src/ATen/cuda -I/home/ahmads/personal/pytorch/third_party/fmt/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/ahmads/personal/pytorch/build/caffe2/aten/src -I/home/ahmads/personal/pytorch/aten/src/ATen/.. -I/home/ahmads/personal/pytorch/build/nccl/include -I/home/ahmads/personal/pytorch/c10/cuda/../.. -I/home/ahmads/personal/pytorch/c10/.. -I/home/ahmads/personal/pytorch/third_party/tensorpipe -I/home/ahmads/personal/pytorch/build/third_party/tensorpipe -I/home/ahmads/personal/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/ahmads/personal/pytorch/torch/csrc/api -I/home/ahmads/personal/pytorch/torch/csrc/api/include -isystem /home/ahmads/personal/pytorch/build/third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/ahmads/personal/pytorch/third_party/protobuf/src -isystem /home/ahmads/personal/pytorch/third_party/XNNPACK/include -isystem /home/ahmads/personal/pytorch/third_party/ittapi/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda/include -isystem /home/ahmads/personal/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /home/ahmads/personal/pytorch/third_party/ideep/include -isystem /home/ahmads/personal/pytorch/INTERFACE -isystem /home/ahmads/personal/pytorch/third_party/nlohmann/include -isystem /home/ahmads/personal/pytorch/third_party/NVTX/c/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o.d -x cu -c /home/ahmads/personal/pytorch/aten/src/ATen/native/cuda/SoftMax.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o ``` And I got no warnings or errors. Same with `python setup.py develop` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147386 Approved by: https://github.com/Skylion007, https://github.com/ngimel	2025-02-18 20:03:16 +00:00
PyTorch MergeBot	302f56a1f2	Revert "Fix non-bitwise type annotations for Tensor operators (see #145838 ) (#146845 )" This reverts commit 59b7e52ad8f6146b4364515a7f3e54d6f3edd6da. Reverted https://github.com/pytorch/pytorch/pull/146845 on behalf of https://github.com/jeanschmidt due to Seems to break a few code dependencies in multiple places ([comment](https://github.com/pytorch/pytorch/pull/146845#issuecomment-2666656834))	2025-02-18 19:01:27 +00:00
angelayi	57060bebf3	[symbolic shapes] Add replacement for backed symints (#147240 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147240 Approved by: https://github.com/pianpwk ghstack dependencies: #146939	2025-02-18 18:49:51 +00:00
angelayi	84abeaad5c	[export] Log evaluate_expr (#146939 ) We want to log each symnode created so that we can do provenance tracking in the tlparse report generated for draft export. To do this, we want to assign a unique id to every symnode, which python's `id` function already does, and then for every expression created, we can find the provenance by tracing back through its arguments ids. This logging only happens when dtrace_structured is enabled, which is only when running draft export. An example output is as follows: <img width="799" alt="image" src="https://github.com/user-attachments/assets/88bb31b4-8c31-43fb-aa88-08b573b9f71d" /> For the increase in the compile_time_instruction_count benchmark, this seems unavoidable because I need to call `id` to get the unique identifier for each symnode. But I believe `id` is an inexpensive operation, so hopefully it should be ok? I tried doing the following: * Originally I was passing around `self`, which is a SymNode, which caused the compile time to be ~6.36M * I changed it to pass around `id(self)` instead, which reduced the compile time to ~6.33M * Then I changed it to be passed as a positional arg instead of a kwarg, which reduced the compile time to ~6.22M, but this doesn't seem to be a super worthwhile fix? #suppress-bc-linter Pull Request resolved: https://github.com/pytorch/pytorch/pull/146939 Approved by: https://github.com/oulgen	2025-02-18 18:49:51 +00:00
zeshengzong	c6b331f7d9	Deprecate `skip_code_recursive_on_cache_limit_hit` config flag (#136970 ) Fixes one of #136862 Make `skip_code_recursive_on_cache_limit_hit` flag deprecated. Affected logic is in here: `6931c1644a/torch/_dynamo/convert_frame.py (L866-L876)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136970 Approved by: https://github.com/williamwen42	2025-02-18 18:48:23 +00:00
Jiang, Yanbing	6f7e67c43c	Add torch._scaled_mm for CPU (#139975 ) This PR is to add `torch._scaled_mm` for CPU backend. `_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet	2025-02-18 18:44:26 +00:00
Huamin Li	dd2a943e14	Fix the AOTI compile failure with ARM CPU for Meta internal (#147204 ) Summary: Fix the AOTI compile failure with ARM CPU for Meta internal Differential Revision: D69642211 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147204 Approved by: https://github.com/houseroad	2025-02-18 17:54:34 +00:00
Andy Lugo	5d675de754	Update ck (#144799 ) Updates the CK version and re-implements kernel generation Pull Request resolved: https://github.com/pytorch/pytorch/pull/144799 Approved by: https://github.com/jianyuh	2025-02-18 17:00:27 +00:00
Aleksei Nikiforov	a00d2b5144	s390x: add cleanup for cancelled docker image builds (#147110 ) When podman image build is cancelled, a couple of processes are left behind, and their existence prevents proper shutdown of runner container. Add cleanup step at the end of workflow using new option recently introduced in podman: https://github.com/containers/podman/pull/25102 Example of job preventing s390x worker cleaning up and restarting properly: https://github.com/pytorch/pytorch/actions/runs/13289159296/job/37105230728 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147110 Approved by: https://github.com/huydhn	2025-02-18 16:26:46 +00:00
Yutao Xu	6edc419d69	Update torch-xpu-ops commit pin (#147358 ) Update the torch-xpu-ops commit to [a14d1eaa834a616705068103dc8129319087e864](`a14d1eaa83`), includes: - SparseCSR XPU support - Refine build system Pull Request resolved: https://github.com/pytorch/pytorch/pull/147358 Approved by: https://github.com/EikanWang	2025-02-18 16:05:25 +00:00
angelayi	0c8028e877	[export] Loosen symint input serialization (#147237 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/147237 Approved by: https://github.com/avikchaudhuri	2025-02-18 13:03:47 +00:00
FFFrog	b10ba0a46c	Unify all sympy versions to avoid conflicts within PyTorch (#147197 ) As the title stated. There are some tiny diffrences between 1.13.1 and 1.13.3: 1.13.1: `2e489cf4b1/sympy/core/numbers.py (L1591)` 1.13.3: `b4ce69ad5d/sympy/core/numbers.py (L1591)` Previous PR: https://github.com/pytorch/pytorch/pull/143908 ISSUE Related: https://github.com/pytorch/pytorch/issues/147144 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147197 Approved by: https://github.com/malfet	2025-02-18 10:51:43 +00:00
Michal Gallus	d9cf1debf9	[ROCm][Windows] Fix clang-cl error related to -Wmissing prototypes enabled (#146981 ) Some of the windows files (fused_kernels.cpp or temp_file.h) contain code that fail to compile when this flag is enabled when built with clang-cl. This PR resolves the issue by ensuring that even if we build with clang-cl, it doesn't include those flags on windows. Alternatively if needed, I can fix the files mentioned to pass under this flag. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146981 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-02-18 07:41:12 +00:00
PyTorch MergeBot	49e8f9c965	Revert "Add torch._scaled_mm for CPU (#139975 )" This reverts commit 22fae4c5f94eb43f71a2eebc1904880740cb1d60. Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/huydhn due to third time is the charm ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2664622598))	2025-02-18 05:11:32 +00:00
PyTorch UpdateBot	59a08138c5	[executorch hash update] update the pinned executorch hash (#147345 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147345 Approved by: https://github.com/pytorchbot	2025-02-18 05:08:06 +00:00
Yutao Xu	6a2bb629ec	Update torch-xpu-ops commit pin (#147302 ) Update the torch-xpu-ops commit to [b421032c8fed40df5eaee395c2e7f5f8a7bcc815](`b421032c8f`), includes: - Correct int4 weight pack implementation - Enhance build system: only build one shared library for the user Pull Request resolved: https://github.com/pytorch/pytorch/pull/147302 Approved by: https://github.com/EikanWang	2025-02-18 05:04:15 +00:00
ZhiweiYan-96	59915b8dec	[Intel GPU] qlinear at XPU backend (#133307 ) # Motivation The PR is intended to enable `onednn.qlinear` and `onednn.qlinear_unary` at Intel GPU. We register the qlinear ops at C++ backend via `TORCH_LIBRARY_IMPL`, the op this PR registers includes `onednn::qlinear_pointwise`, `onednn::qlinear_pointwise.tensor`, and `onednn::qlinear_prepack`. The prepack conduct transpose on weight for fitting oneDNN requirement on weight to acquire higher performance. Also, we remove the limitation of the corresponding annotation method in the `XPUInductorQuantizer` (`torch/ao/quantization/quantizer/xpu_inductor_quantizer.py`) to allow GPU linear conversion. We add the kChar(`torch.int8`) dtype in the `torch/_inductor/fx_passes/quantization` and `torch/_inductor/mkldnn_ir.py`, as signed int8 is the default INT8 data type at GPU side. We verified the op through UTs and e2e model testing like ResNet18, ResNet50. # UT verification ``` DNNL_VERBOSE=0 TORCH_COMPILE_DEBUG=0 python test/inductor/test_mkldnn_pattern_matcher.py -v \ -k test_qlinear_xpu \ -k test_qlinear_relu_xpu \ -k test_qlinear_gelu_xpu ``` # Runtime exemplification Here is the oneDNN verbose collected through running above UTs ``` //pure int8 gemm onednn_verbose,primitive,exec,gpu:0,matmul,jit:gemm:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 dst_s8::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32+dst:0:s32,,2x4:4x3,0.187988 // post-relu fusion onednn_verbose,primitive,exec,gpu:0,matmul,jit:gemm:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_f32::blocked:ab::f0_mask2 dst_f32::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_relu,,2x4:4x4,0.115234 // post-gelu fusion onednn_verbose,primitive,exec,gpu:0,matmul,jit:gemm:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 dst_f32::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_gelu_tanh,,2x4:4x4,0.170898 ```` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133307 Approved by: https://github.com/liangan1, https://github.com/guangyey, https://github.com/EikanWang, https://github.com/jerryzh168 Co-authored-by: guangyey <guangye.yu@intel.com>	2025-02-18 04:02:42 +00:00
Yutao Xu	bb8c4ecc6d	Allow XPU device for validating the arguments to sparse compressed tensor factory functions (#147306 ) During Sparse tensor conversion, a validity check is performed. We need to allow XPU to pass this check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147306 Approved by: https://github.com/EikanWang, https://github.com/Skylion007, https://github.com/guangyey	2025-02-18 03:55:54 +00:00
Animesh Jain	71484a2106	[pt2-benchmarks] Compiler reset on every run (#147313 ) Internal benchmarks call `run` in a loop. Compiler reset gives a clean env Pull Request resolved: https://github.com/pytorch/pytorch/pull/147313 Approved by: https://github.com/jansel	2025-02-18 02:09:19 +00:00
Chen Lai	708428704e	patch for block-wise quantization + pt2e (#146946 ) Summary: https://github.com/pytorch/pytorch/pull/144492 was reverted due to duplicate kernel registration. This PR will re-introduce the patch Differential Revision: D69488779 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146946 Approved by: https://github.com/jerryzh168, https://github.com/andrewor14	2025-02-18 01:15:26 +00:00
Tom Ritchford	59b7e52ad8	Fix non-bitwise type annotations for Tensor operators (see #145838 ) (#146845 ) Fix https://github.com/pytorch/pytorch/issues/145838 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146845 Approved by: https://github.com/Skylion007	2025-02-17 22:42:16 +00:00
amdfaa	1393f9a76c	[ROCm] Update inductor-perf-test-nightly-rocm.yml to use the correct labels & frequency (#147221 ) This workflow takes around 75-80hrs on ROCm, so scaling down the frequency to once per week until we get more CI capacity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147221 Approved by: https://github.com/pruthvistony, https://github.com/huydhn	2025-02-17 19:29:27 +00:00
Stonepia	6c0e7463af	Fix test_device_memory_allocated (#147311 ) Fixes #147310 The `torch.ones` allocates memory and is released immediately, thus the following assertion will fail. This PR stores it into a temp variable to fix it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147311 Approved by: https://github.com/guangyey, https://github.com/Skylion007	2025-02-17 19:00:53 +00:00
Stepan Hruda	516133ddb0	Fix arvr macOS buck pytorch builds (#147292 ) Summary: X-link: https://github.com/ctrl-labs/src2/pull/42453 buck arvr macOS builds had a few issues that needed fixing. Test Plan: build with buck Differential Revision: D69722372 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147292 Approved by: https://github.com/Skylion007	2025-02-17 18:47:24 +00:00
Jiang, Yanbing	22fae4c5f9	Add torch._scaled_mm for CPU (#139975 ) This PR is to add `torch._scaled_mm` for CPU backend. `_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet	2025-02-17 18:39:10 +00:00
Annop Wongwathanarat	1b29de5c05	Add NEON implementation for 8 bit quantized embedding bag on aarch64 (#147322 ) This improves performance by ~5.5x on NeoverseV1 cores using the following benchmarking script: ``` import torch import torch.nn as nn import numpy as np import torch.autograd.profiler as profiler np.random.seed(0) torch.manual_seed(0) class SimpleEmbeddingBagModel(nn.Module): def __init__(self, num_embeddings, embedding_dim): super(SimpleEmbeddingBagModel, self).__init__() weights = torch.from_numpy((np.random.random_sample((num_embeddings, embedding_dim)) + 1).astype(np.float32)) obs = torch.ao.quantization.PerChannelMinMaxObserver(dtype=torch.quint8, qscheme=torch.per_channel_affine_float_qparams, ch_axis=0) obs(weights) qparams = obs.calculate_qparams() qweight = torch.quantize_per_channel(weights, qparams[0], qparams[1], axis=0, dtype=torch.quint8) # Defining the EmbeddingBag layer self.qembedding_bag = torch.ao.nn.quantized.EmbeddingBag(num_embeddings, embedding_dim, _weight=qweight, mode='sum', include_last_offset=True, dtype=torch.quint8) def forward(self, input, offsets): # Forward pass through the EmbeddingBag layer result = self.qembedding_bag(input, offsets, per_sample_weights=None) return result num_embeddings = 40000000 embedding_dim = 128 model = SimpleEmbeddingBagModel(num_embeddings=num_embeddings, embedding_dim=embedding_dim) model.eval() multi_hot = 100 batch_size = 400 input_tensor = torch.randint(0, num_embeddings, (batch_size * multi_hot,), dtype=torch.long) offsets = torch.tensor(range(0, batch_size * multi_hot + 1, multi_hot)) with torch.no_grad(): # warm up _ = model(input_tensor, offsets) with profiler.profile(with_stack=True, profile_memory=False, record_shapes=True) as prof: for i in range(100): _ = model(input_tensor, offsets) print(prof.key_averages(group_by_input_shape=True).table(sort_by='self_cpu_time_total', row_limit=50)) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147322 Approved by: https://github.com/malfet	2025-02-17 17:10:47 +00:00
PyTorch UpdateBot	71855a1cad	Update slow tests (#147308 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147308 Approved by: https://github.com/pytorchbot	2025-02-17 12:03:40 +00:00
Nikita Shulga	e8b20f6ef3	[MPS][BE] Turn `exec_unary_kernel` as MetalShaderLibrary method (#147299 ) And delete duplicate implementations from SpecialOps and UnaryKernel. Change input and output arguments order for SpecialOps kernels to match those of UnaryOps Fixes https://github.com/pytorch/pytorch/issues/146770 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147299 Approved by: https://github.com/dcci ghstack dependencies: #147296, #147297	2025-02-17 08:31:24 +00:00
ZhiweiYan-96	ae5f7fec82	[Intel GPU] Enable fp64 GEMM (#140677 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140677 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/desertfire	2025-02-17 08:15:55 +00:00
Nikita Shulga	2b30e94fc0	[BE] Make `exec_unary_kernel` take TensorIterator as argument (#147297 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147297 Approved by: https://github.com/dcci ghstack dependencies: #147296	2025-02-17 07:34:35 +00:00
Nikita Shulga	3d251e6512	[BE] Switch all structured funcs to stubs (#147296 ) No need to have separate foobar_out_mps when registering a dispatch to foobar_stub will do And this makes `exec_unary_kernel` defined in UnaryKernel.mm and SpecialOps.mm look very similar Pull Request resolved: https://github.com/pytorch/pytorch/pull/147296 Approved by: https://github.com/dcci	2025-02-17 07:34:34 +00:00
leslie-fang-intel	424c1b82e0	[Inductor][CPP] Add the legalize low fp support for index expr (#147298 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/147279. The test case produced a low-precision floating-point value using `ops.index_expr`, but the CPP backend did not handle its legalization. This PR adds support for it. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_low_fp_index_expr_issue_147279 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147298 Approved by: https://github.com/jgong5	2025-02-17 07:11:20 +00:00
PyTorch UpdateBot	359165734b	[executorch hash update] update the pinned executorch hash (#147294 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147294 Approved by: https://github.com/pytorchbot	2025-02-17 05:03:05 +00:00
Yan Zhiwei	ae351d4d0e	[Intel GPU] allow_tf32 for oneDNN backend - XPU part (#137570 ) # Motivation Add context variable `torch.bachend.mkldnn.allow_tf32` to control tf32 computation in convolution kernels at XPU side. The tf32 data type is beneficial to improve the performance of deep learning workloads during training/inference. Current PR uses the [oneDNN API fpmath_mode](https://oneapi-src.github.io/oneDNN/dev_guide_attributes_fpmath_mode.html#the-floating-point-math-mode-attribute) to trigger the tf32 acceleration in convolution kernels. # Valiadation * ut to test context variable `python test/xpu/test_conv.py -k test_mkldnn_allow_tf32_get_set` * Runtime exemplification ``` onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic16oc33_ih50oh24kh3sh2dh0ph0_iw100ow49kw3sw2dw0pw0,0.649902 onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic33oc33_ih24oh24kh3sh1dh0ph1_iw49ow49kw3sw1dw0pw1,0.151855 onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,backward_data,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_undef::undef::: dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic33oc33_ih24oh24kh3sh1dh0ph1_iw49ow49kw3sw1dw0pw1,0.167969 onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,backward_weights,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic33oc33_ih24oh24kh3sh1dh0ph1_iw49ow49kw3sw1dw0pw1,0.26709 onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,backward_weights,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic16oc33_ih50oh24kh3sh2dh0ph0_iw100ow49kw3sw2dw0pw0,0.219971 ``` According to the field `fpmath:tf32` in verbose, we could see that, current context setting utils could successfully trigger tf32 computation in conv forward/backward_data/backward_weights kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137570 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/atalman, https://github.com/malfet Co-authored-by: Yu, Guangye <guangye.yu@intel.com>	2025-02-17 01:46:43 +00:00
Nikita Shulga	198ffbdf11	[MPS] Implement and test round.decimals (#147266 ) If inductor can do it, why not eager Pull Request resolved: https://github.com/pytorch/pytorch/pull/147266 Approved by: https://github.com/Skylion007 ghstack dependencies: #147286	2025-02-16 23:17:13 +00:00
Aaron Gokaslan	e738f7ba23	[BE]: Enable ruff rule SIM113 (#147290 ) Lint rules that tells the user to avoid keeping track of their own counter and use the builtin enumerate when possible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147290 Approved by: https://github.com/jansel	2025-02-16 22:41:16 +00:00
Zhou Fang	a8fa4bcfd2	[StaticRuntime] Support a new pattern (aten::to with 5 inputs) for ClipRangesToGatherToOffsets (#147189 ) Summary: Support the following new pattern for ClipRangesToGatherToOffsets: Before optimization: ``` %11175 : Tensor, %11176 : Tensor = fb::clip_ranges_gather(%int_66.1, %getitem_1784.1, %347) %getattr_256.1 : int = prim::dtype(%11175) %to_298.1 : Tensor = aten::to(%11176, %getattr_256.1, %13, %13, %12) %lengths_to_offsets_333.1 : Tensor = fb::lengths_to_offsets(%to_298.1, %8) ``` After optimization: ``` %11199 : int = prim::dtype(%int_66.1) %11200 : Tensor, %11201 : Tensor = fb::clip_ranges_gather_to_offsets(%int_66.1, %getitem_1784.1, %347, %8, %11199) ``` It is similar with https://github.com/pytorch/pytorch/pull/146931, but aten::to has 5 inputs instead of 4. Differential Revision: D69627793 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147189 Approved by: https://github.com/hanyilou123	2025-02-16 22:16:02 +00:00
Nikita Shulga	5c0c99f658	[MPS][BE] Use stubs for floor/ceil/round/trunc (#147286 ) To avoid duplicating logic that those ops are no-ops for integral dtypes (And in preparation of adding `round_decimals` that calls round_stub if decimals are 0) Tested for the corner cases by manually invoking `round`, `trunc`, `floor` and `ceil` for int dtypes Pull Request resolved: https://github.com/pytorch/pytorch/pull/147286 Approved by: https://github.com/Skylion007	2025-02-16 17:22:49 +00:00
Dmitry Rogozhkin	d27ecf85db	xpu: support sycl with torch.utils.cpp_extension APIs (#132945 ) This patch adds support for sycl kernels build via `torch.utils.cpp_extension.load`, `torch.utils.cpp_extension.load_inline` and (new) `class SyclExtension` APIs. Files having `.sycl` extension are considered to have sycl kernels and are compiled with `icpx` (dpc++ sycl compiler from Intel). Files with other extensions, `.cpp`, `.cu`, are handled as before. API supports building sycl along with other file types into single extension. Note that `.sycl` file extension is a PyTorch convention for files containing sycl code which I propose to adopt. We did follow up with compiler team to introduce such file extension in the compiler, but they are opposed to this. At the same time discussion around sycl file extension and adding sycl language support into such tools as cmake is ongoing. Eventually cmake also considers to introduce some file extension convention for sycl. I hope we can further influence cmake and compiler communities to broader adopt `.sycl` file extension. By default SYCL kernels are compiled for all Intel GPU devices for which pytorch native aten SYCL kernels are compiled. At the moment `pvc,xe-lpg`. This behavior can be overridden by setting `TORCH_XPU_ARCH_LIST` environment variables to the comma separated list of desired devices to compile for. Fixes: #132944 CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132945 Approved by: https://github.com/albanD, https://github.com/guangyey, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-02-16 16:50:59 +00:00
PyTorch MergeBot	dd5d0ea6bb	Revert "xpu: support sycl with torch.utils.cpp_extension APIs (#132945 )" This reverts commit 607379960bc5093a1fe51ff72c3e0fd39ac126ab. Reverted https://github.com/pytorch/pytorch/pull/132945 on behalf of https://github.com/malfet due to It just broke all the tests, see `b16ae97ad0/1` ([comment](https://github.com/pytorch/pytorch/pull/132945#issuecomment-2661498747))	2025-02-16 16:03:42 +00:00
lzhang2	b16ae97ad0	Generalize mixed precision in DDP (#146808 ) Motivation: 1. Generalize mixed precision in DDP. 2. Enable `SyncBatchNorm` for XPU device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146808 Approved by: https://github.com/guangyey, https://github.com/gujinghui, https://github.com/wconstab	2025-02-16 11:59:40 +00:00
Xuehai Pan	ee38a32c55	[Dynamo] support `isinstance(...)` check for type tuple (#146984 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146984 Approved by: https://github.com/jansel	2025-02-16 10:41:49 +00:00
Dmitry Rogozhkin	607379960b	xpu: support sycl with torch.utils.cpp_extension APIs (#132945 ) This patch adds support for sycl kernels build via `torch.utils.cpp_extension.load`, `torch.utils.cpp_extension.load_inline` and (new) `class SyclExtension` APIs. Files having `.sycl` extension are considered to have sycl kernels and are compiled with `icpx` (dpc++ sycl compiler from Intel). Files with other extensions, `.cpp`, `.cu`, are handled as before. API supports building sycl along with other file types into single extension. Note that `.sycl` file extension is a PyTorch convention for files containing sycl code which I propose to adopt. We did follow up with compiler team to introduce such file extension in the compiler, but they are opposed to this. At the same time discussion around sycl file extension and adding sycl language support into such tools as cmake is ongoing. Eventually cmake also considers to introduce some file extension convention for sycl. I hope we can further influence cmake and compiler communities to broader adopt `.sycl` file extension. By default SYCL kernels are compiled for all Intel GPU devices for which pytorch native aten SYCL kernels are compiled. At the moment `pvc,xe-lpg`. This behavior can be overridden by setting `TORCH_XPU_ARCH_LIST` environment variables to the comma separated list of desired devices to compile for. Fixes: #132944 CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132945 Approved by: https://github.com/albanD, https://github.com/guangyey	2025-02-16 10:16:09 +00:00
Nikita Shulga	ed3b119c40	Skip unsupported types by MPS in `test_torchinductor.py` (#147211 ) - Skip unsupported dtypes in `test_split_cumsum` (and manually skip int64 for MacOS13) - Adapt `test_cat` to use `torch.half` instead of `torch.double` on MPS - Skip `test_adaptive_avg_pool1d_argmax` is avgpool is not implemented for all sizes - Pull Request resolved: https://github.com/pytorch/pytorch/pull/147211 Approved by: https://github.com/jansel, https://github.com/Skylion007, https://github.com/dcci	2025-02-16 10:15:53 +00:00
Saurabh Mishra	0fb5b224b7	[DCP] Cache save plans: planner helpers and interface updates (#147116 ) Summary: This PR updates the planner interface and introduces the class variables to cache the local and global plans. Two new helpers are also introduced which will be used to compare if the plans have changed across save attempts and merge the delta plans. Test Plan: UTs Differential Revision: D69224488 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147116 Approved by: https://github.com/MeetVadakkanchery, https://github.com/huydhn	2025-02-16 07:18:26 +00:00
PyTorch UpdateBot	4bacd13c92	[executorch hash update] update the pinned executorch hash (#147273 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147273 Approved by: https://github.com/pytorchbot	2025-02-16 05:11:33 +00:00
cfgfung	8f20026bcb	[Intel GPU] Support SparseCsrXPU codegen (#144722 ) Adding a new dispatch key - `SparseCsrXPU` to enable Intel GPU support for SparseCsr Tensor. Similar PR: https://github.com/pytorch/pytorch/pull/139267 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144722 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/albanD Co-authored-by: Kanya-Mo <kanya.mo@intel.com>	2025-02-16 03:16:12 +00:00
Blaine Burton Rister	1677a31019	[Inductor] Fix 3D tiling with permute (#147249 ) This PR adds a test case and tiny fix for 3D tiling. Before this PR, tiling would crash because one of the candidates lacked a `"y"` dimension. Now, when we're calculating 3D tiling candidates, we assume the y size is 1 if it's missing. The test case implements a 3D permute using block pointers. ``` @triton.jit def triton_poi_fused_add_0(in_ptr0, out_ptr0, znumel, ynumel, xnumel, ZBLOCK : tl.constexpr, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr): znumel = 51 ynumel = 51 xnumel = 51 zoffset = tl.program_id(2) * ZBLOCK zindex = zoffset + tl.arange(0, ZBLOCK)[None, None, :] zmask = zindex < znumel yoffset = tl.program_id(1) * YBLOCK yindex = yoffset + tl.arange(0, YBLOCK)[None, :, None] ymask = yindex < ynumel xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None, None] xmask = xindex < xnumel x2 = xindex y1 = yindex z0 = zindex tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[51, 51, 51], strides=[1, 51, 2601], block_shape=[XBLOCK, YBLOCK, ZBLOCK], order=[2, 1, 0], offsets=[xoffset, yoffset, zoffset]), boundary_check=[0, 1, 2]) tmp1 = tl.load(tl.make_block_ptr(in_ptr0, shape=[51, 51, 51], strides=[51, 1, 2601], block_shape=[XBLOCK, YBLOCK, ZBLOCK], order=[2, 1, 0], offsets=[xoffset, yoffset, zoffset]), boundary_check=[0, 1, 2]) tmp2 = tmp0 + tmp1 tmp3 = tmp0 + tmp0 tmp4 = tmp2 + tmp3 tl.store(tl.make_block_ptr(out_ptr0, shape=[51, 51, 51], strides=[1, 51, 2601], block_shape=[XBLOCK, YBLOCK, ZBLOCK], order=[2, 1, 0], offsets=[xoffset, yoffset, zoffset]), tl.broadcast_to(tmp4, [XBLOCK, YBLOCK, ZBLOCK]).to(tl.float32), boundary_check=[0, 1, 2]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147249 Approved by: https://github.com/jansel	2025-02-15 23:28:36 +00:00
Tom Ritchford	44ee9ca593	[inductor] Add type annotations to _inductor/utils.py (#144108 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144108 Approved by: https://github.com/eellison	2025-02-15 23:13:41 +00:00
Avik Chaudhuri	4ab967c44d	all reduce non strict (#147133 ) Summary: Some distributed collectives like `all_reduce` have special handling in Dynamo, where they are mapped to functional collectives. Non-strict was previously blind to such mappings, which means using them would fail to trace. Here we show how intercepting them in non-strict's torch function mode can mimic this remapping logic. More ops to follow. Side note: a recently added distributed test was in the wrong place, making the expected failures for non-strict not fire because we weren't actually generating those tests to begin with! Now fixed. Test Plan: moved and updated test Differential Revision: D69607140 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147133 Approved by: https://github.com/tugsbayasgalan	2025-02-15 19:37:08 +00:00
Eli Uriegas	75a4b73816	utils: Update md5 call to be fips compliant (#147252 ) Updates md5 call to be fips compliant according to this issue: * https://github.com/pytorch/pytorch/issues/147236 Not going to add a conditional here because minimum the python version that we support is already 3.9 Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/147252 Approved by: https://github.com/huydhn, https://github.com/Skylion007, https://github.com/malfet	2025-02-15 15:19:08 +00:00
PyTorch MergeBot	6ca5c22e31	Revert "Enable fp16 linear layers in PyTorch via ACL (#144992 )" This reverts commit 5b37249259ad50d9b4b32a78a5b5178a1eb3d110. Reverted https://github.com/pytorch/pytorch/pull/144992 on behalf of https://github.com/nikhil-arm due to Accuracy Test failures ([comment](https://github.com/pytorch/pytorch/pull/144992#issuecomment-2660902238))	2025-02-15 12:40:59 +00:00
Jing Xu	86be5d4421	remove unnecessary xpu availability check when retrieving aot flags (#146966 ) As title Retrieving xpu aot flags that the pytorch binary was compiled against is not the same as running the binary itself. Thus it doesn't seem to necessarily check if there is an xpu environment available. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146966 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/dvrogozh, https://github.com/albanD	2025-02-15 09:15:49 +00:00
leslie-fang-intel	9e0b3e9b6c	[Inductor] Fix Inplace Buffer inner name conflict (#147199 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/146975, when create `InplacedBuffer` inner name, we only count the number of unique `InplacedBuffer` or `RemovedArg`. The name may have conflict, for example reported in this issue ``` ---- make inplace create, input_name is: buf22; output_name is: buf27; buf.inner_name is: in_out_ptr2 dict_values([ InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf6', 'buf11']), InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf6', 'buf11']), InplacedBuffer(inner_name='in_out_ptr1', other_names=['buf24', 'buf26']), InplacedBuffer(inner_name='in_out_ptr1', other_names=['buf24', 'buf26'])]) ---- make inplace create, input_name is: buf0; output_name is: buf3; buf.inner_name is: in_out_ptr2 dict_values([ <torch._inductor.codegen.common.RemovedArg object at 0x7fbf75516350>, <torch._inductor.codegen.common.RemovedArg object at 0x7fbf75516350>, <torch._inductor.codegen.common.RemovedArg object at 0x7fbf75516350>, <torch._inductor.codegen.common.RemovedArg object at 0x7fbf75516350>, InplacedBuffer(inner_name='in_out_ptr2', other_names=['buf22', 'buf27', 'buf31', 'buf33']), InplacedBuffer(inner_name='in_out_ptr2', other_names=['buf22', 'buf27', 'buf31', 'buf33']) <torch._inductor.codegen.common.RemovedArg object at 0x7fbf75516350>, InplacedBuffer(inner_name='in_out_ptr2', other_names=['buf22', 'buf27', 'buf31', 'buf33']), InplacedBuffer(inner_name='in_out_ptr2', other_names=['buf22', 'buf27', 'buf31', 'buf33']) ]) ``` - The first time create `in_out_ptr2`, there are 2 unique `InplacedBuffer` - The second time create `in_out_ptr2`, there is 1 `RemovedArg` and 1 unique `InplacedBuffer` They are 2 different `InplacedBuffer`, but with same name `in_out_ptr2`. In this PR, we fix this regression by counting the number of `RemovedArg`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147199 Approved by: https://github.com/jansel	2025-02-15 08:31:06 +00:00
Jason Ansel	a30f145101	[inductor] Don't leak pointers to cpp_wrapper with lru_cache (#147233 ) Putting lru_cache on methods will keep pointers to the `self` objects alive forever and leak memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147233 Approved by: https://github.com/yanboliang	2025-02-15 08:25:41 +00:00
Animesh Jain	9dc702875d	[dynamo][mappingproxy][inspect] Support existing types.MappingProxyType (#147217 ) Fixes https://github.com/pytorch/pytorch/issues/147162 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147217 Approved by: https://github.com/williamwen42	2025-02-15 07:59:33 +00:00
cyy	8daa742e8b	Remove code for Python < 3.9 (#147181 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/147181 Approved by: https://github.com/albanD	2025-02-15 06:43:26 +00:00
PyTorch UpdateBot	9919375cf1	[executorch hash update] update the pinned executorch hash (#147241 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147241 Approved by: https://github.com/pytorchbot	2025-02-15 05:02:22 +00:00
cyy	8f291e8c00	Fix clang-tidy warnings in torch/jit (#146963 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146963 Approved by: https://github.com/davidberard98	2025-02-15 03:36:59 +00:00
briancoutinho	4233a77960	update kineto submodule to include fix for windows build (#147195 ) Fixes an issue causing windows builds to fail https://github.com/pytorch/kineto/pull/1039 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147195 Approved by: https://github.com/cyyever, https://github.com/davidberard98, https://github.com/sraikund16	2025-02-15 01:53:16 +00:00
leslie-fang-intel	c1fcba3648	[Inductor] Fix the lowering of squeeze when input is not contiguous (#146746 ) Summary Fix issue https://github.com/pytorch/pytorch/issues/143498. The issue happens when we lowering `select = torch.ops.aten.select.int(cat, 1, 0)`. For example, when `cat` is contiguous with size[2, 2] stride[2,1] - for eager, it returns a view of size[2,] stride[2,] - for Inductor lowering, it returns wrong stride 1 instead of 2 ``` TensorBox( ReinterpretView( StorageBox( ConcatKernel(name='buf10', layout=FixedLayout('cpu', torch.int64, size=[u0, 2], stride=[2, 1]), inputs=[ComputedBuffer(name='buf8', layout=NonOwningLayout('cpu', torch.int64, size=[u0, 1], stride=[2, 1]), data=Pointwise(device=device(type='cpu'), dtype=torch.int64, inner_fn=<function ReinterpretView.make_loader.<locals>.loader at 0x7f6b856449d0>, ranges=[u0, 1])), ComputedBuffer(name='buf9', layout=NonOwningLayout('cpu', torch.int64, size=[u0, 1], stride=[2, 1]), data=Pointwise(device=device(type='cpu'), dtype=torch.int64, inner_fn=<function ReinterpretView.make_loader.<locals>.loader at 0x7f6b85644790>, ranges=[u0, 1]))]) ), FixedLayout('cpu', torch.int64, size=[u0], stride=[1]), origins=OrderedSet([select]) ) ) ``` To fix this issue, we give the right stride when lowering of `squeeze`. Test Plan ``` python -u -m pytest -s -v test/inductor/test_unbacked_symints.py -k test_issue_143498 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146746 Approved by: https://github.com/jgong5, https://github.com/sanchitintel, https://github.com/eellison	2025-02-15 01:33:04 +00:00
Yidi Wu	bf0c89a72f	[dynamo] fix error message when logging graph that contains hops (#147227 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147227 Approved by: https://github.com/zou3519	2025-02-15 00:53:44 +00:00
Shawn Xu	933f921b36	[PT][FSDP] support custom all reduce hook across FSDP units (#147114 ) This change adds an API `set_all_reduce_hook` to the `FSDPModule` to support customized all reduce either in native HSDP (2d mesh) setup or custom HSDP (1d FSDP + custom AR across replicas) * For native HSDP, the original AR would still run as is and this hook allows for additional gradient modification post all reduce. * For custom HSDP, the original AR will be skipped and all the logic is instead expected to be executed in the hook. The custom hook is expected to perform operations in place (no return value). Example basic usage: ``` model = ... fully_shard(model, mesh=...) model.set_all_reduce_hook(my_hook) ``` By default, the hook will run in the default all reduce stream post reduce scatter. When native HSDP is NOT enabled, the custom hook can be specified to run in a custom stream. This custom stream will also be synchronized post reduce scatter similarly. See tests for examples. Test Plan: CI Differential Revision: D68255583 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147114 Approved by: https://github.com/awgu	2025-02-15 00:38:00 +00:00
Isuru Fernando	a9ae3340ca	Fix triton masked loading for non-block tl.loads (#144782 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144782 Approved by: https://github.com/eellison	2025-02-15 00:07:33 +00:00
eellison	49727bbc9d	Turn on prologue fusion (#147008 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147008 Approved by: https://github.com/masnesral	2025-02-14 23:36:21 +00:00
Animesh Jain	76f57e184a	[dynamo] Make SliceVariable a subclass of VariableTracker (#147046 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147046 Approved by: https://github.com/StrongerXi ghstack dependencies: #146819, #146995	2025-02-14 23:22:27 +00:00
Mu-Chu Lee	a5c0dab900	[AOTInductor] Guard RAII_cpuMalloc with macro (#147150 ) Summary: Silence RAII_cpuMalloc(size_t) defined but not used [-Wunused-function] Test Plan: Existing tests Differential Revision: D69623481 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147150 Approved by: https://github.com/henrylhtsang	2025-02-14 23:21:35 +00:00
Yidi Wu	1224765286	[cond] make cond call fake kernel in dynamo (#147045 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147045 Approved by: https://github.com/zou3519 ghstack dependencies: #146954	2025-02-14 23:13:15 +00:00
Yidi Wu	85a82c5bc8	[cond] make cond re-dispatch in proxy mode (#146954 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146954 Approved by: https://github.com/zou3519	2025-02-14 23:13:14 +00:00
atalman	eecee5863e	Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073 ) Should resolve: https://github.com/pytorch/pytorch/issues/144768 We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1`` For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1`` We use pinned version of NCCL rather then submodule. Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl`` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146073 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj	2025-02-14 21:23:19 +00:00
Bin Bao	d38db94689	[inductor][refactor] Move _compile_file to cpp_builder (#147202 ) Summary: To further conslidate cpp build logic into cpp_builder Test Plan: CI Differential Revision: D69595327 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147202 Approved by: https://github.com/yushangdi	2025-02-14 21:02:30 +00:00
henrylhtsang	dd86491b35	[cutlass backend][BE] refactor tests to remove duplicate logic (#146743 ) Doing many things here: * remove duplicate hip checking logic * check for CUDA in setup * remove CUTLASS_DIR setting. That is not needed when building from source and fbcode anymore * fix some typing errors Pull Request resolved: https://github.com/pytorch/pytorch/pull/146743 Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78	2025-02-14 20:50:27 +00:00
Dan Zimmerman	6f035d8462	[torch] Make amdsmi cdll hook private (#147207 ) Summary: https://github.com/pytorch/pytorch/actions/runs/13314282597/job/37186177974 yelled at me for landing a seemingly public API that's not exported. It's a private API, so lets prepend `_` to make that clear Test Plan: CI Differential Revision: D69665234 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147207 Approved by: https://github.com/PaulZhang12	2025-02-14 20:30:48 +00:00
Tom Ritchford	272ead7b5e	Make fx.node.map_arg() and .map_aggregate() generic (#146248 ) ## What's the problem? The popular `fx.node.map_arg()` and `fx.node.map_aggregate()` apply operations recursively on `dict`s, `tuples`, `list`s, etc, and return a new collection of the same type. Unfortunately, their base input type is `Argument`, which is [very unspecific indeed](`5d55a6585d/torch/fx/node.py (L48-L58)`): most type information is just thrown away at the call site of either of these functions, as far as the type checker goes. As `torch` moves to a more typed code base, this would force innocent, unsuspecting developers to add logically unnecessary casts or `# type: ignore` statements. ## What's the solution? Making these two `node.map_*` functions generic on the first argument and return type means that type information is preserved for the type checker. (The signature of the other parameter, the function that visits the nodes and subnodes, has not changed, nor should it.) ## Won't it break everything? It doesn't break the type checker - one place needed an extra hint. There have been code breakages, resolved one, at least one new one... we'll see! Pull Request resolved: https://github.com/pytorch/pytorch/pull/146248 Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007	2025-02-14 19:25:32 +00:00
Justin Chu	58f654b5ad	[ONNX] Consolidate constants to a single location (#147166 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147166 Approved by: https://github.com/titaiwangms ghstack dependencies: #147164, #147165	2025-02-14 19:08:19 +00:00
Justin Chu	765bc30ab9	[ONNX] Set warning stacklevel so it appears at the torch.onnx call site (#147165 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147165 Approved by: https://github.com/Skylion007 ghstack dependencies: #147164	2025-02-14 19:04:43 +00:00
Justin Chu	9a1eac6704	[ONNX] Handle number of outputs in builder (#147164 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147164 Approved by: https://github.com/titaiwangms	2025-02-14 19:03:51 +00:00
PyTorch MergeBot	5517eb4398	Revert "[cutlass backend] Do not change dtype of GEMM template (#146877 )" This reverts commit 260b21b8bca6edd3e0b89b800d6efa8243f0d122. Reverted https://github.com/pytorch/pytorch/pull/146877 on behalf of https://github.com/henrylhtsang due to let me resubmit ([comment](https://github.com/pytorch/pytorch/pull/146877#issuecomment-2660053270))	2025-02-14 18:58:18 +00:00
PyTorch MergeBot	aac5d1a289	Revert "Add torch._scaled_mm for CPU (#139975 )" This reverts commit f0bdc27f74f8b1d4ab6789156691ee0fd5cbb30f. Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it looks like internal ideep version is too old to support this ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2660008996))	2025-02-14 18:31:54 +00:00
Henry Tsang	20a9938069	try print stacktrace for error (#147061 ) Differential Revision: D69573525 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147061 Approved by: https://github.com/Skylion007	2025-02-14 18:28:03 +00:00
Nikita Shulga	8b5ee275fb	[MPS] Fix cholesky_ex for empty inputs (#147159 ) By making sure that `info` is actually initialized if input is empty(but no need to do anything about `out`, is it's guaranteed to be an empty tensor) Also move output resizing logic before `input.numel()` check Fixes https://github.com/pytorch/pytorch/issues/147128 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147159 Approved by: https://github.com/albanD	2025-02-14 17:44:08 +00:00
Catherine Lee	0d16188c06	[CI] Use job name to index into test times json (#147154 ) When the test times are generated, it doesn't know what the build environment is because it's an environment variable. But when we index into the test times, we (previously) didn't know what the job name is. These are usually the same but sometimes they're different and when they're different it ends up using default, which can have unbalanced sharding I think job name was added at some point to most of the CI environments but I didn't realize, so we can now update this code to use the job name instead so the generation and the indexing match also upload stats workflow for mps Checked that inductor_amx doesn't use default Pull Request resolved: https://github.com/pytorch/pytorch/pull/147154 Approved by: https://github.com/huydhn	2025-02-14 17:06:56 +00:00
Mikayla Gawarecki	e8fbc86de0	Make torch.cuda.gds APIs public (#147120 ) Follow up to https://github.com/pytorch/pytorch/pull/145748 that turned USE_CUFILE on for CUDA 12.6 and 12.8 binaries Pull Request resolved: https://github.com/pytorch/pytorch/pull/147120 Approved by: https://github.com/albanD	2025-02-14 17:06:50 +00:00
Jack Taylor	c3853d924f	Introduce new template heuristic for triton autotune configs (#144985 ) Initial PR to refactor bulkiness of mm_common to allow for better device-specific specialisation e.g. in https://github.com/pytorch/pytorch/pull/143286 we require large conditionalisation to get ROCm specific optimisations in. This PR introduces a new file `torch/_inductor/template_heuristics.py` which implements device specific subclasses for autotune configs: - CPUConfigHeuristic() - CUDAConfigHeuristic() - ROCmConfigHeuristic() - XPUConfigHeuristic() These subclasses are integrated as part of the `InductorChoices` class, which will be the interface for the kernel files to access the configs. The mm_common, mm_plus_mm and conv configurations are implemented in this class, in the future we plan to bring in flex attention configurations also so all of the tuning config logic for templated triton kernels are handled in this file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144985 Approved by: https://github.com/jansel	2025-02-14 17:01:06 +00:00
PyTorch MergeBot	e06ee4aa9f	Revert "Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073 )" This reverts commit 06f4a5c0e578d7da10ebdf14edcd24e5dcef78d6. Reverted https://github.com/pytorch/pytorch/pull/146073 on behalf of https://github.com/atalman due to breaks macos builds: ModuleNotFoundError: No module named 'torch._C._distributed_c10d'; 'torch._C' is not a package ([comment](https://github.com/pytorch/pytorch/pull/146073#issuecomment-2659802389))	2025-02-14 16:44:46 +00:00
PyTorch MergeBot	059dfe2081	Revert "update kineto submodule (#147015 )" This reverts commit d1997b610f5b974af7ebad6b9903d2d8f751d927. Reverted https://github.com/pytorch/pytorch/pull/147015 on behalf of https://github.com/atalman due to broke windows builds ([comment](https://github.com/pytorch/pytorch/pull/147015#issuecomment-2659730304))	2025-02-14 16:11:08 +00:00
atalman	06f4a5c0e5	Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073 ) Should resolve: https://github.com/pytorch/pytorch/issues/144768 We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1`` For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1`` We use pinned version of NCCL rather then submodule. Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl`` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146073 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj	2025-02-14 15:29:59 +00:00
Guilherme Leobas	cefd9805de	Add `RAISE_VARARGS 0` (#146493 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146493 Approved by: https://github.com/zou3519 ghstack dependencies: #146498, #146492	2025-02-14 13:37:23 +00:00
Guilherme Leobas	134723ee1c	Add `WITH_EXCEPT_START` opcode (#146492 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146492 Approved by: https://github.com/anijain2305, https://github.com/zou3519 ghstack dependencies: #146498	2025-02-14 13:37:23 +00:00
Guilherme Leobas	dbb86b78ad	Add `sys.exc_info` and `sys.exception` (#146498 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146498 Approved by: https://github.com/anijain2305, https://github.com/zou3519	2025-02-14 13:37:14 +00:00
angelayi	ea188ac0c7	[export] Add meta for aten.bincount (#147129 ) Fixes https://github.com/pytorch/pytorch/issues/147094 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147129 Approved by: https://github.com/pianpwk	2025-02-14 10:33:54 +00:00
Yutao Xu	de26ddfbdc	Update torch-xpu-ops commit pin (#146671 ) Update the torch-xpu-ops commit to [80c375570e2b6b2989a8610da1871f8a50dfddc7](`80c375570e`), includes: - Aten operator coverage improvement - SYCL kernel optimization - Nested Tensor OPs support Pull Request resolved: https://github.com/pytorch/pytorch/pull/146671 Approved by: https://github.com/EikanWang	2025-02-14 09:30:36 +00:00
leslie-fang-intel	bd019c0bb4	[Inductor][CPP] Fix node name for wgt delete (#147056 ) Summary This is a regression issue caused by a change in the FX node name. In commit 71010bf0972834e35a155e6a187e5c6649a5a36b, both the node name and target for the `get_attr` node in `V.graph.graph.nodes` were `_frozen_param2`. However, in the latest main, the node name has changed to `_reorder_linear_weight`. This PR fixes the regression by using the node's target instead of its name. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_cpp_weight_prune ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147056 Approved by: https://github.com/jgong5	2025-02-14 06:27:41 +00:00
Nikita Shulga	10bc8f25b2	[MPS][BE] Migrate polar to use functor (#147184 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147184 Approved by: https://github.com/dcci ghstack dependencies: #147182, #147183	2025-02-14 06:25:36 +00:00
Nikita Shulga	278ffd84fc	[MPS][BE] Add copysign integral flavors as functor (#147183 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147183 Approved by: https://github.com/dcci ghstack dependencies: #147182	2025-02-14 06:25:36 +00:00
Nikita Shulga	2ef51cfb9d	[BE][MPS] Infer results of functor (#147182 ) Do not assume that functor will return the same results as its arguments, but rather dynamically infer it using `decltype` and `:🤘:declval` This is a no-op that prepares for migration of `copysign` of integral arguments, that would return a float Pull Request resolved: https://github.com/pytorch/pytorch/pull/147182 Approved by: https://github.com/dcci	2025-02-14 06:25:27 +00:00
Wu, Chunyuan	331d5cf560	[inductor] [cpp] Support vectorization for score and mask in FlexAttention CPU (#143638 ) ## Description We generate vectorized kernel for score and mask in FlexAttention with this PR. ## Modification The main change include: - For the input and output buffer to the mask and score function, instead of passing scalars, we pass tensors to it. - For the mask function, the original function which works on a scalar only includes the logic of calculating the mask value. The PR added the logic of applying the mark to the qk_data tensor into the graph and then leverage the CPP backend to generate vectorized kernels. The original mask graph: ```python def mask_fn(b, h, q_idx, kv_idx): mask = q_idx >= kv_idx return mask ``` The converted_mask_graph should be: ```python def converted_mask_fn(qk_data, b, h, q_idx, kv_idx): mask = q_idx >= kv_idx qk_data = torch.where(mask, qk_data, torch.full_like(qk_data, -float("inf"))) return qk_data ``` ## Benchmark For q, k, v of shape: `[1, 32, 1024, 128]`, using 40 CPU cores, we observe over 20x speedup compared with the non vectorized version for both `is_causal` = `False` and `True`. ## Test plan The existing FlexAttention UTs (`test/inductor/test_flex_attention.py`, `test/inductor/test_flex_decoding.py`) can cover the change in this PR. ## Output code Code before this PR is in scalar version: ```cpp // apply score mod function for (int64_t row = 0; row < cur_qSplitSize; ++row) { for (int64_t col = 0; col < cur_kvSplitSize; col++) { std::vector<int64_t> b_idx = {i}; std::vector<int64_t> h_idx = {j}; std::vector<int64_t> q_idx = {m+row}; int64_t phisical_kv_idx = n+col; if (use_kv_indice) { phisical_kv_idx= kv_logical_data kvBlockSize + col; } std::vector<int64_t> kv_idx = {phisical_kv_idx}; accum_t* in_ptr0 = qk_data + row * cur_kvSplitSize + col; auto in_ptr1 = b_idx.data(); auto in_ptr2 = h_idx.data(); auto in_ptr3 = q_idx.data(); auto in_ptr4 = kv_idx.data(); accum_t* out_ptr0 = in_ptr0; { { { auto tmp0 = in_ptr0[static_cast<int64_t>(0L)]; out_ptr0[static_cast<int64_t>(0L)] = tmp0; } } } } } // Apply block mask, fill unused with -inf for (int64_t row = 0; row < cur_qSplitSize; ++row) { for (int64_t col = 0; col < cur_kvSplitSize; col++) { std::vector<int64_t> b_idx = {i}; std::vector<int64_t> h_idx = {j}; std::vector<int64_t> q_idx = {m+row}; int64_t phisical_kv_idx = n+col; if (use_kv_indice) { phisical_kv_idx= kv_logical_data kvBlockSize + col; } std::vector<int64_t> kv_idx = {phisical_kv_idx}; accum_t* qk_block = qk_data + row * cur_kvSplitSize + col; auto in_ptr1 = b_idx.data(); auto in_ptr2 = h_idx.data(); auto in_ptr3 = q_idx.data(); auto in_ptr4 = kv_idx.data(); std::vector<int64_t> temp = {0}; int64_t* out_ptr1 = temp.data(); { { { auto tmp0 = static_cast<bool>(true); out_ptr1[static_cast<int64_t>(0L)] = tmp0; } } } qk_block = out_ptr1 != 0 ? qk_block : -std::numeric_limits<accum_t>::infinity(); } } ``` Code after this PR will be vectorized:* ```cpp accum_t* in_ptr0 = qk_data; auto in_ptr1 = b_idx.data(); auto in_ptr2 = h_idx.data(); auto in_ptr3 = q_idx.data(); auto in_ptr4 = kv_idx.data(); // apply score mod function { accum_t* out_ptr0 = in_ptr0; { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(cur_qSplitSize); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(cur_kvSplitSize); x1+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L(c10::div_floor_integer(static_cast<int64_t>(cur_kvSplitSize), static_cast<int64_t>(16L)))))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + cur_kvSplitSizex0), static_cast<int64_t>(16)); tmp0.store(out_ptr0 + static_cast<int64_t>(x1 + cur_kvSplitSizex0)); } if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L(c10::div_floor_integer(static_cast<int64_t>(cur_kvSplitSize), static_cast<int64_t>(16L)))) && x1 < static_cast<int64_t>(cur_kvSplitSize))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + cur_kvSplitSizex0), static_cast<int64_t>(cur_kvSplitSize + ((-16L)(c10::div_floor_integer(static_cast<int64_t>(cur_kvSplitSize), static_cast<int64_t>(16L)))))); tmp0.store(out_ptr0 + static_cast<int64_t>(x1 + cur_kvSplitSizex0), static_cast<int64_t>(cur_kvSplitSize + ((-16L)(c10::div_floor_integer(static_cast<int64_t>(cur_kvSplitSize), static_cast<int64_t>(16L)))))); } } } } } } // Apply block mask, fill unused with -inf { accum_t* out_ptr1 = in_ptr0; { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(cur_qSplitSize); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(cur_kvSplitSize); x1+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L(c10::div_floor_integer(static_cast<int64_t>(cur_kvSplitSize), static_cast<int64_t>(16L)))))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + cur_kvSplitSizex0), static_cast<int64_t>(16)); auto tmp1 = static_cast<bool>(true); auto tmp2 = -std::numeric_limits<float>::infinity(); auto tmp3 = at::vec::VecMask<float,1>::from(tmp1); auto tmp4 = at::vec::Vectorized<float>(tmp2); auto tmp5 = decltype(tmp0)::blendv(tmp4, tmp0, tmp3.template cast<float,1>()); tmp5.store(out_ptr1 + static_cast<int64_t>(x1 + cur_kvSplitSizex0)); } if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L(c10::div_floor_integer(static_cast<int64_t>(cur_kvSplitSize), static_cast<int64_t>(16L)))) && x1 < static_cast<int64_t>(cur_kvSplitSize))) { for (int64_t x1_tail = static_cast<int64_t>(16L(c10::div_floor_integer(static_cast<int64_t>(cur_kvSplitSize), static_cast<int64_t>(16L))));x1_tail < static_cast<int64_t>(cur_kvSplitSize); x1_tail++) { auto tmp0 = in_ptr0[static_cast<int64_t>(x1_tail + cur_kvSplitSizex0)]; auto tmp1 = static_cast<bool>(true); auto tmp2 = -std::numeric_limits<float>::infinity(); auto tmp3 = tmp1 ? tmp0 : tmp2; out_ptr1[static_cast<int64_t>(x1_tail + cur_kvSplitSize*x0)] = tmp3; } } } } } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143638 Approved by: https://github.com/jgong5, https://github.com/drisspg, https://github.com/leslie-fang-intel	2025-02-14 05:26:18 +00:00
PyTorch UpdateBot	ce38bfd299	[executorch hash update] update the pinned executorch hash (#147157 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147157 Approved by: https://github.com/pytorchbot	2025-02-14 05:04:17 +00:00
Nikita Shulga	92f669e39c	[BE] Use `c10::multiply_integers` in cholesky_impl (#147163 ) That replaces explicit for loop Pull Request resolved: https://github.com/pytorch/pytorch/pull/147163 Approved by: https://github.com/huydhn	2025-02-14 03:59:17 +00:00
Animesh Jain	2d089a5697	[dynamo] Remove unintended lru_cache (#147147 ) I forgot to remove it while add frozenset __contains__ method in this PR - https://github.com/pytorch/pytorch/pull/146062?fbclid=IwZXh0bgNhZW0CMTEAAR3S_qq8bYxO7pDuHqpr2X-vqkXQrY0KtT14z46bfuRDYikjJBet3uKF2dE_aem_o1c7I4eawKyaEsfiWhnTmw This is causing memory leak Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/147147 Approved by: https://github.com/williamwen42	2025-02-14 03:55:39 +00:00
Aaron Gokaslan	6344ca1dd4	[BE][Ez]: Apply FURB188: use str remove(pre\|suf)fix (#146997 ) Since we are on 3.9, we can use this nice str builtin which is more readable and more efficient. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146997 Approved by: https://github.com/XuehaiPan, https://github.com/cyyever, https://github.com/jansel	2025-02-14 03:38:07 +00:00
cyy	d473c212fd	Remove code for Python < 3.9 (#147097 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/147097 Approved by: https://github.com/albanD	2025-02-14 03:22:49 +00:00
kareem mohiddeen shaik	880e176544	[inductor] Fix for pattern file contains 'getitem' fails during impor… (#144980 ) …t of the pattern module For example any pattern module that has the following pattern generated, fails to import because the name getitem undefined. native_dropout_default = CallFunction(aten.native_dropout.default, div_Tensor_1, KeywordArg('dropout_p'), True, _users=2) getitem = CallFunction(getitem, native_dropout_default, 0) this fix will resolve the error. Fixes #144674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144980 Approved by: https://github.com/eellison	2025-02-14 02:30:24 +00:00
Zhengxu Chen	0b84311842	[export] Generate printers/parsers for serialization enum values. (#147126 ) Summary: Generate two helper functions for enum classes in generated_serialization_types.h printEnum: will convert enum values into strings. parseEnum: will convert strings into enum values. Test Plan: CI Differential Revision: D69604850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147126 Approved by: https://github.com/yiming0416	2025-02-14 02:14:35 +00:00
Basil Wong	05001f0459	Add Structured Tracing for Traced Graph Edge Details for AC Debugging (#146634 ) Summary: Updating the structured trace infrastructure so that we are able to output to Zoomer and have an E2E solution. Context Doc: https://docs.google.com/document/d/1T6omIBEWVhbOiwDLSLffgQwjxiT2rQv8QvvQwXkw4fY/edit?usp=sharing Test Plan: ### Testing Structured Log + tlparse locally Command: ``` TORCH_TRACE=/data/users/basilwong/fbsource/fbcode/log_torch_trace buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=local_fb_fm_v4 launcher.num_workers=2 ``` Torch Trace Logs (local then sent to paste): P1686419449 ``` cat log_torch_trace/dedicated_log_torch_trace_rank_0_2lg012xo.log \| pastry P1686419449 ``` tlparse output: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpyiv5wj/rank_1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100 tlparse graph edge details output: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpyiv5wj/rank_1/9_0_0/joint_graph_information_397.txt?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100 Differential Revision: D61557220 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146634 Approved by: https://github.com/jansel, https://github.com/Yuzhen11	2025-02-14 02:04:26 +00:00
Colin L Reliability Rice	486fc12d7e	torch: Log a unified waitcounter for torch.compile and triton.autotune (#146723 ) Summary: Add a second more generic waitcounter to torch.compile. We'll keep expanding this as new generic pytorch compilation sites show up. Test Plan: Waitcounter only change, relying on existing tests. Differential Revision: D69215401 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146723 Approved by: https://github.com/davidberard98	2025-02-14 02:04:13 +00:00
Jiang, Yanbing	f0bdc27f74	Add torch._scaled_mm for CPU (#139975 ) This PR is to add `torch._scaled_mm` for CPU backend. `_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet	2025-02-14 02:03:53 +00:00
leslie-fang-intel	c5a9e4a6a0	[Inductor][CPP] Fix a CPP GEMM Template output data type issue (#146958 ) Summary Issue found when fixing https://github.com/pytorch/ao/issues/1662. A FP32 GEMM with an epilogue node `to_fp16` resulted in [generated code](https://gist.github.com/leslie-fang-intel/464fb112abdb105818ae09b057350e84), which failed to compile. The root cause is that we used the slice of global buffer `Y` as the output of micro GEMM instead of a `local buffer`. However, due to the `to_fp16` epilogue node, the global buffer `Y` has a float16 data type, leading to the failure. This fix will ensure the use of a local buffer in such cases. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_linear_to_lowp_fp ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146958 Approved by: https://github.com/jgong5	2025-02-14 01:40:08 +00:00
xinan.lin	d3524ecdd6	[Break XPU] Align meta calculation for fft_r2c with _fft_r2c_mkl (#146763 ) Fix #146761 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146763 Approved by: https://github.com/jansel ghstack dependencies: #146762, #145248, #146880	2025-02-14 01:39:18 +00:00
xinan.lin	ade5af9430	[XPU] Align XPU convolution_backward output layout between fake tensor and real output tensor. (#146880 ) Fix #146879 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146880 Approved by: https://github.com/EikanWang, https://github.com/jansel ghstack dependencies: #146762, #145248	2025-02-14 01:39:18 +00:00
xinan.lin	9befdf565a	[Break XPU][Inductor UT] Set input tensors to corresponding device for test case in test_aot_indutor.py (#145248 ) Fix #145247 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145248 Approved by: https://github.com/desertfire, https://github.com/jansel, https://github.com/EikanWang ghstack dependencies: #146762	2025-02-14 01:39:11 +00:00
xinan.lin	972e927134	[Break XPU][Inductor UT] Fix XPU Inductor UT failures introduced from community. (#146762 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146762 Approved by: https://github.com/EikanWang, https://github.com/desertfire, https://github.com/jansel	2025-02-14 01:38:50 +00:00
Dan Zimmerman	6419076db9	[torch][amdsmi] Look for amdsmi in ROCM_HOME/ROCM_PATH before using rpath (#147117 ) Summary: ROCm uses ROCM_HOME/ROCM_PATH to specify which version of rocm the user wants to use. This is especially important in multi-version setups. Let's respect that behavior when loading amdsmi. Test Plan: CI ``` NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,COLL MSCCL_ALGO_DIR=~/2fbsource/third-party/rccl/develop/tools/msccl-algorithms RCCL_MSCCLPP_THRESHOLD=(math '12810241024') RCCL_MSCCLPP_ENABLE=1 ENABLE_MSCCLPP=1 buck2 run fbcode//mode/opt-amd-gpu -m rocm621 fbcode//accelerators/workloads/microbench:bench_comm -- --shape moe_17b --comm_algo nccl_allreduce ``` Differential Revision: D69597647 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147117 Approved by: https://github.com/malfet	2025-02-14 01:11:59 +00:00
Zhang, Jianyi	20a369aa3a	[Intel GPU] Avoid copy when the input of Matmul is broadcasted (#143784 ) Avoid copy when the input of Matmul is 3D and broadcasted on batch dim. oneDNN support implicit broadcast semantics i.e., src can be broadcasted into weight if the corresponding dimension in src is 1 (and vice versa). On Max 1100, timm resmlp_12_224 amp_fp16 inference with bs=128 can improve from 42ms to 13.7 ms on torch.compile and 57.5ms to 32ms on eager mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143784 Approved by: https://github.com/EikanWang	2025-02-14 00:48:07 +00:00
Simon Fan	057bcd3a45	[ca] eliminate duplicate getitem graph nodes for shape inputs (#146875 ) should reuse existing proxies instead of creating new ones before: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpL7hmHe/0_-_-_0/compiled_autograd_graph_3.txt?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100 ```python class CompiledAutograd0(torch.nn.Module): def forward(self, inputs, sizes, scalars, hooks): # No stacktrace found for following nodes getitem = inputs[0] getitem_1 = inputs[1] getitem_2 = inputs[2]; inputs = None getitem_3 = sizes[0]; getitem_3 = None getitem_4 = sizes[1]; getitem_4 = None getitem_5 = sizes[2]; getitem_5 = None getitem_6 = sizes[3]; getitem_6 = None getitem_7 = sizes[4]; getitem_7 = None getitem_8 = sizes[5]; getitem_8 = None getitem_9 = sizes[6]; getitem_9 = None getitem_10 = sizes[7]; getitem_10 = None getitem_11 = sizes[8]; getitem_11 = None getitem_12 = sizes[9]; getitem_12 = None getitem_13 = sizes[10]; getitem_13 = None getitem_14 = sizes[11]; getitem_14 = None getitem_15 = sizes[12]; getitem_15 = None getitem_16 = sizes[13]; getitem_16 = None getitem_17 = sizes[14]; getitem_17 = None getitem_18 = sizes[15]; getitem_18 = None getitem_19 = sizes[0] getitem_20 = sizes[1] getitem_21 = sizes[2] getitem_22 = sizes[3] getitem_23 = sizes[4] getitem_24 = sizes[5] getitem_25 = sizes[6] getitem_26 = sizes[7] getitem_27 = sizes[8] getitem_28 = sizes[9] getitem_29 = sizes[10] getitem_30 = sizes[11] getitem_31 = sizes[12] getitem_32 = sizes[13] getitem_33 = sizes[14] getitem_34 = sizes[15]; sizes = None ``` after: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpCo5T6B/0_-_-_0/compiled_autograd_graph_1.txt?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100 ```python class CompiledAutograd0(torch.nn.Module): def forward(self, inputs, sizes, scalars, hooks): # No stacktrace found for following nodes getitem = inputs[0] getitem_1 = inputs[1] getitem_2 = inputs[2]; inputs = None getitem_3 = sizes[0] getitem_4 = sizes[1] getitem_5 = sizes[2] getitem_6 = sizes[3] getitem_7 = sizes[4] getitem_8 = sizes[5] getitem_9 = sizes[6] getitem_10 = sizes[7] getitem_11 = sizes[8] getitem_12 = sizes[9] getitem_13 = sizes[10] getitem_14 = sizes[11] getitem_15 = sizes[12] getitem_16 = sizes[13] getitem_17 = sizes[14] getitem_18 = sizes[15]; sizes = None ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146875 Approved by: https://github.com/jansel ghstack dependencies: #146720, #146735	2025-02-13 21:41:33 +00:00
Simon Fan	76dacd5fc7	[ca] log graph before reodering passes (#146735 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146735 Approved by: https://github.com/jansel ghstack dependencies: #146720	2025-02-13 21:41:33 +00:00
Gajanan Choudhary	cdbf677cdd	Remove outdated comment in ATen/mkl/Sparse.h about lack of Windows support (#147125 ) Fixes #147124. * #102604 added support for Intel oneMKL Sparse BLAS APIs so there was an outdated comment left around in the codebase that can now be removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147125 Approved by: https://github.com/janeyx99	2025-02-13 21:34:05 +00:00
Aaron Gokaslan	1f41ceb713	[BE][Ez]: Enable ruff rule banning print in assert (#146615 ) Enables a few ruff rules * Ban print statements within asserts (likely bugs) * ~Use string for Decimal literal to prevent loss of precision~ * ~Do not use default args for __post__init__ in dataclasses, they likely were meant to go into the factory method, the __init__, or somewhere else. The default values are useless here.~ Wait until ruff upgrade for the last 2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146615 Approved by: https://github.com/jansel	2025-02-13 21:14:00 +00:00
angelayi	5469e5c556	[export] Minor fix to locals (#146955 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146955 Approved by: https://github.com/bobrenjc93	2025-02-13 20:29:15 +00:00
Bin Bao	7b4efb492b	[inductor][refactor] Make _compile_file only used for fbcode (#147106 ) Summary: _compile_file in codecache.py only handles specific cpp compilation in fbcode. The next step is to consolidate it with cpp_builder. Test Plan: CI Differential Revision: D69592025 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147106 Approved by: https://github.com/yushangdi	2025-02-13 20:22:31 +00:00
Chen Lai	2d3db4509a	fix pt2e block wise quantization test (#147035 ) Differential Revision: D69559217 https://github.com/pytorch/pytorch/pull/145941 breaks the unit test added for prepare pt2e + block wise quantization. Fixing Pull Request resolved: https://github.com/pytorch/pytorch/pull/147035 Approved by: https://github.com/andrewor14	2025-02-13 19:44:56 +00:00
Yang Wang	b0553cee6b	[Utilization] post-test-process workflow (#145310 ) # Overview Add reusable workflow to trigger the post-test right after each test job is complete. Cousion with pr to setup the runner permissions: Add m fleet instances: https://github.com/pytorch-labs/pytorch-gha-infra/pull/595/files add to lix fleet:https://github.com/pytorch/ci-infra/pull/322/files Currently I turn on the debug flag for testing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145310 Approved by: https://github.com/huydhn	2025-02-13 18:51:19 +00:00
Henry Tsang	260b21b8bc	[cutlass backend] Do not change dtype of GEMM template (#146877 ) I think this is a change in the right direction. Right now, when we try to find a cutlass gemm, we generate bunch of gemm templates, and filter out those that don't fix. For example, if we are doing bf16 x bf16 matmul, the gemm template for fp32 x fp32 is generated and filtered out. However, for the dtype of bias, we would attempt to modify the dtype of the gemm template. I think this is a bad idea, since (1) the usable template is also being generated, and (2) this messes with the configuration name of the template. I tested this offline. There isn't much difference in performance. However, with instantiation level 2222, I noticed way less "C++ compile error". This is probably due to using the right template? Follow-ups are needed: 1. benchmark and dashboard 2. check our logic for setting alignment with my change https://www.internalfb.com/intern/paste/P1729604119/ without my change https://www.internalfb.com/intern/paste/P1729624806/ Differential Revision: D69085556 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146877 Approved by: https://github.com/ColinPeppler	2025-02-13 18:36:16 +00:00
rzou	92d448ff62	Add self to CODEOWNERS for fx/proxy.py; warn against adding new node arg types (#147031 ) Not sure if there's a better way Pull Request resolved: https://github.com/pytorch/pytorch/pull/147031 Approved by: https://github.com/StrongerXi ghstack dependencies: #147016, #147012, #147013	2025-02-13 18:21:21 +00:00
PyTorch MergeBot	9a883007a2	Revert "Implement cuda graphs implementation of torch.cond and torch.while_loop (#140979 )" This reverts commit c7515da7b00de40942c83dc5856b6daec727e280. Reverted https://github.com/pytorch/pytorch/pull/140979 on behalf of https://github.com/huydhn due to This change has been reported to break internal code ([comment](https://github.com/pytorch/pytorch/pull/140979#issuecomment-2657361940))	2025-02-13 18:04:26 +00:00
PyTorch MergeBot	65e8862b9a	Revert "[cond] make cond re-dispatch in proxy mode (#146954 )" This reverts commit 2ce6de2415fb6592dd4447ebea334fd12b8c31ea. Reverted https://github.com/pytorch/pytorch/pull/146954 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I need to revert it to cleanly revert 140979 ([comment](https://github.com/pytorch/pytorch/pull/146954#issuecomment-2657357742))	2025-02-13 18:02:33 +00:00
Nikhil Gupta	1f8ff6812d	[Fix]: Disable KleidiAI if unsupported gcc/clang compiler is detected (#146836 ) Fixes: https://github.com/pytorch/pytorch/issues/146740 Description: 1. KleidiAI officially supports GCC>=11 and Clang>=11. Certain hardware features are tied to compiler version and KleidiAI compilation will fail in such cases. Change-Id: Ib43d6b5bf66ef5ea48c481a2774801c573ec205c Pull Request resolved: https://github.com/pytorch/pytorch/pull/146836 Approved by: https://github.com/malfet	2025-02-13 17:49:26 +00:00
Brian Hirsh	447a142de2	support input mutations on tangents in compile (#141131 ) Fixes https://github.com/pytorch/pytorch/issues/141111. We previously supported mutations on saved activations that happened in the backward. This PR extends the support to tangents Pull Request resolved: https://github.com/pytorch/pytorch/pull/141131 Approved by: https://github.com/zou3519	2025-02-13 17:48:56 +00:00
Saurabh Mishra	7077d0ac8c	[DCP] Introduce modules metadata in the storage_meta (#146654 ) Summary: Introduce the list of modules in the storage_meta which is shared between the planner and the storage writer. We will use it to let the storage writer know about the modules in the state dict and create module directories in the checkpoint. Test Plan: UTs Pull Request resolved: https://github.com/pytorch/pytorch/pull/146654 Approved by: https://github.com/MeetVadakkanchery	2025-02-13 17:44:30 +00:00
PyTorch MergeBot	938209fb6f	Revert "Use 2022 as default VC_YEAR for windows builds (#147053 )" This reverts commit 858bc0cea50614d1e190e6991d974ddb0f53fc88. Reverted https://github.com/pytorch/pytorch/pull/147053 on behalf of https://github.com/atalman due to Broke windows tests ([comment](https://github.com/pytorch/pytorch/pull/147053#issuecomment-2657239501))	2025-02-13 17:09:37 +00:00
Will Constable	683178fabc	[cuda] fix printing of num_gpus (#146838 ) Previously on machines with less than 8 gpus, the device==7 case would trigger the assert inside getDeviceProperties, and print `num_gpus=BEL` which is ascii for 7. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146838 Approved by: https://github.com/Skylion007, https://github.com/eqy	2025-02-13 15:23:35 +00:00
Nikhil Gupta	020232ec9f	[Submodule]: Update KleidiAI submodule to v1.3.0 (#146480 ) Change-Id: I687255982c72ee7daca438a15b718f07298963cc Pull Request resolved: https://github.com/pytorch/pytorch/pull/146480 Approved by: https://github.com/digantdesai, https://github.com/malfet	2025-02-13 15:23:04 +00:00
Ivan Skorokhodov	df776d64f7	chore: fix typos in error messages in FSDP (#146805 ) Fixes two small typos in FSDP error messages Pull Request resolved: https://github.com/pytorch/pytorch/pull/146805 Approved by: https://github.com/awgu, https://github.com/Skylion007	2025-02-13 15:22:13 +00:00
Anatoly Myachev	345f556628	Fix `DispatchStub.cpp` compilation for gcc 14 (#146512 ) Otherwise I get the following error: ```bash .../intel-xpu-backend-for-triton/pytorch/aten/src/ATen/native/DispatchStub.cpp:152:18: error: no matching function for call to ‘find(std::array<c10::DeviceType, 7>::const_iterator, std::array<c10::DeviceType, 7>::const_iterator, const c10::DeviceType&)’ 152 \| if (std::find(supported_devices.begin(), supported_devices.end(), device_type) == supported_devices.end()) { \| ~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In file included from /usr/include/c++/14/bits/locale_facets.h:48, from /usr/include/c++/14/bits/basic_ios.h:37, from /usr/include/c++/14/ios:46, from /usr/include/c++/14/ostream:40, from .../intel-xpu-backend-for-triton/pytorch/c10/core/DeviceType.h:13, from .../intel-xpu-backend-for-triton/pytorch/aten/src/ATen/native/DispatchStub.h:3, from .../intel-xpu-backend-for-triton/pytorch/aten/src/ATen/native/DispatchStub.cpp:2: /usr/include/c++/14/bits/streambuf_iterator.h:435:5: note: candidate: ‘template<class _CharT2> typename __gnu_cxx::__enable_if<std::__is_char<_CharT2>::__value, std::istreambuf_iterator<_CharT, std::char_traits<_CharT> > >::__type std::find(istreambuf_iterator<_CharT, char_traits<_CharT> >, istreambuf_iterator<_CharT, char_traits<_CharT> >, const _CharT2&)’ 435 \| find(istreambuf_iterator<_CharT> __first, \| ^~~~ /usr/include/c++/14/bits/streambuf_iterator.h:435:5: note: template argument deduction/substitution failed: .../intel-xpu-backend-for-triton/pytorch/aten/src/ATen/native/DispatchStub.cpp:152:18: note: mismatched types ‘std::istreambuf_iterator<_CharT, std::char_traits<_CharT> >’ and ‘const std::array<c10::DeviceType, 7>::value_type’ {aka ‘const c10::DeviceType’} 152 \| if (std::find(supported_devices.begin(), supported_devices.end(), device_type) == supported_devices.end()) { \| ~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146512 Approved by: https://github.com/Skylion007	2025-02-13 15:21:59 +00:00
IvanKobzarev	7c3b2a29ec	[subclass] testing WrapperSubclass respect outer_size, outer_stride (#146897 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146897 Approved by: https://github.com/bdhirsh	2025-02-13 15:21:19 +00:00
PyTorch UpdateBot	e2479d7809	Update slow tests (#146822 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146822 Approved by: https://github.com/pytorchbot	2025-02-13 15:20:58 +00:00
Yidi Wu	aeabbffe15	Disable test with dynamo for schema gen (#146865 ) Fixes https://github.com/pytorch/pytorch/issues/141202. 1. So we skip the schema gen tests under dynamo. https://github.com/pytorch/pytorch/issues/141202 fails in a weird way: where it's claiming node is an integer, but we tested isinstance tests [here](https://github.com/pytorch/pytorch/blob/main/torch/_library/utils.py#L234-L241). This is probably dynamo messing up with the stacks. and checking fx.Node isn't really what dynamo is designed for. 2. We move some of legit cond testes out of schema gen and put it back to control flow tests. Also rename _test_export to a lengthy names. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146865 Approved by: https://github.com/zou3519	2025-02-13 15:20:52 +00:00
angelayi	67c4c39b4f	[docs] Minor fixes to export and aoti docs (#144513 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144513 Approved by: https://github.com/yushangdi, https://github.com/desertfire	2025-02-13 15:19:35 +00:00
briancoutinho	d1997b610f	update kineto submodule (#147015 ) Fix https://github.com/pytorch/kineto/issues/1032 See https://github.com/pytorch/kineto/pull/1035 for testplan Pull Request resolved: https://github.com/pytorch/pytorch/pull/147015 Approved by: https://github.com/sraikund16, https://github.com/Skylion007	2025-02-13 15:13:18 +00:00
Aaron Gokaslan	8d94eb1e3b	[BE]: Make OrderedSet reversible (#146904 ) It's rather trivial to make OrderedSet reversible, so let's do it and unlock that additional functionality for downstream users. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146904 Approved by: https://github.com/eellison	2025-02-13 15:11:48 +00:00
Andrey Talman	858bc0cea5	Use 2022 as default VC_YEAR for windows builds (#147053 ) New Windows AMI does not have Visual Studio 2019. Hence use 2022 as default. See: https://github.com/pytorch/test-infra/pull/6226 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147053 Approved by: https://github.com/huydhn	2025-02-13 14:37:55 +00:00
FFFrog	f95bdf5e6c	Make GetCPUAllocatorMaybePinned to be Device-Agnostic (#146687 ) ---- - Keep cuda first to perserve BC - Remove cuda first if it is possible to have only one accelerator at a time in the future Pull Request resolved: https://github.com/pytorch/pytorch/pull/146687 Approved by: https://github.com/ngimel	2025-02-13 13:09:48 +00:00
Mu-Chu Lee	e21181642f	[AOTInductor] Align behavior between CPU and GPU (#145459 ) Summary: (1) Make sure CPU and GPU doesn't have different implementation and behavior when calling from the same path and API. Only difference between CPU and GPU after this PR should ONLY be the running hardware. (2) This PR fixes the issue of memory access when it==constants_map.end() (3) This PR resolves T179437596 Test Plan: buck2 run mode/dev sigmoid/inference/test:e2e_test_cpu Differential Revision: D68540744 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145459 Approved by: https://github.com/desertfire, https://github.com/hl475	2025-02-13 09:50:18 +00:00
Xia, Weiwen	ca3aabc8e6	[Inductor][CPU] Add a lowering pass for _weight_int4pack_mm_for_cpu (#145250 ) Summary It's part of the task to enable max-autotune with GEMM template for WoQ INT4 GEMM on CPU. This PR adds a lowering pass for `torch.ops.aten_weight_int4pack_mm_for_cpu`. This op is used for WoQ int4 in Torchao. The lowering pass is a prerequisite for max-autotune, which is planed to be enabled for this op in subsequent PRs. Test plan ``` python test/inductor/test_mkldnn_pattern_matcher.py -k test_woq_int4 python test/inductor/test_cpu_cpp_wrapper.py -k test_woq_int4 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145250 Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168 ghstack dependencies: #145245	2025-02-13 08:40:12 +00:00
Zhang, Jianyi	17d3a69c32	[Intel GPU] fix memory leak in deconv backward (#144385 ) Fixes #143807 We need manage onednn scratchpad in pytorch, otherwise onednn will always allocate scratchpad memory during primitive execution and causes memory leak. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144385 Approved by: https://github.com/liangan1, https://github.com/EikanWang	2025-02-13 07:41:34 +00:00
David Berard	43496e9b90	[NJT] fix flop counter for SDPA & test (#147032 ) Fixes 3 issues: 1. The test wasn't actually testing SDPA: both were checking cuda, and the inputs to SDPA were not transposed. 2. FlopCounterMode has been renamed _FlopCounterMode (and a wrapper named FlopCounterMode has been added) 3. offsets_to_list also needs to ignore the actual offset values if offsets is a meta tensor. Differential Revision: [D69558785](https://our.internmc.facebook.com/intern/diff/D69558785) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147032 Approved by: https://github.com/jbschlosser	2025-02-13 07:14:58 +00:00
tim	b9a22b3f37	bug fix: ensure 4d input in _scaled_dot_product_attention_math_mps (#146623 ) This pr addresses the issue in the MPS backend for `_scaled_dot_product_attention_math_mps` where a 3d input like (num_heads, seq_len, query_dim) cannot be automatically treated as (1, num_heads, seq_len, query_dim), which can be inferred on cpu or cuda, which can be circumvented by adding a util function to ensure a 4d shape. The issue was found in https://github.com/hiyouga/LLaMA-Factory/issues/6835, in [transformers qwen2_vl](`1590c66430/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py (L373C14-L373C93)`), 3d q/k/v were passed into sdpa function, which lead to an error. Considering consistency, since this pattern might pop up elsewhere in the transformers codebase, I think it makes more sense to maintain the same intuition across all platforms. --- reproduce code: ``` import torch import torch.nn.functional as F head_num, seq_len, embed_dim = 16, 16, 80 bsz = 1 q = torch.randn(head_num, seq_len, embed_dim) k = torch.randn(head_num, seq_len, embed_dim) v = torch.randn(head_num, seq_len, embed_dim) attention_mask = torch.ones(1, seq_len, seq_len) oo_cpu = F.scaled_dot_product_attention( q.to("cpu"), k.to("cpu"), v.to("cpu"), attention_mask.to("cpu"), dropout_p=0.0 ) if torch.backends.mps.is_available(): oo_mps = F.scaled_dot_product_attention( q.to("mps"), k.to("mps"), v.to("mps"), attention_mask.to("mps"), dropout_p=0.0 ) assert torch.allclose(oo_cpu, oo_mps.to("cpu"), atol=1e-5) ``` error outputs: ``` Traceback (most recent call last): File "/opt/homebrew/Caskroom/miniconda/base/envs/torch-dev/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3577, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "<ipython-input-2-5169b8d2c5dd>", line 21, in <module> oo_mps = F.scaled_dot_product_attention( IndexError: Dimension out of range (expected to be in range of [-3, 2], but got 3) ``` hardware and envs: ``` torch 2.6.0 apple m3 max ``` --- Pull Request resolved: https://github.com/pytorch/pytorch/pull/146623 Approved by: https://github.com/malfet Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-02-13 07:00:51 +00:00
Isalia20	17a808557c	[MPS] cholesky ex version (#146799 ) PR #145701 didn't have experimental version of cholesky. This PR adds that version Pull Request resolved: https://github.com/pytorch/pytorch/pull/146799 Approved by: https://github.com/malfet	2025-02-13 07:00:21 +00:00
Ke Wen	4879f8f919	[TP] Add warning when module is distributed twice (#147006 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147006 Approved by: https://github.com/XilunWu	2025-02-13 06:49:17 +00:00
Aaron Gokaslan	3e4172d985	[BE][Ez]: Update fmtlib submodule to 11.1.3 (#146985 ) This submodule just fixes a bunch of miscellaneous bugfix issues with ABI compatibility, compiler warning, workarounds for older compilers, performance, and edge cases in formatting. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146985 Approved by: https://github.com/drisspg	2025-02-13 06:47:11 +00:00
Yu, Guangye	aa20b4b6cf	Friendly handle mem_get_info's runtime error message (#146899 ) # Motivation Friendly handle the runtime error message if the device doesn't support querying the available free memory. See https://github.com/intel/torch-xpu-ops/issues/1352 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146899 Approved by: https://github.com/EikanWang	2025-02-13 06:26:19 +00:00
Nikita Shulga	66fb10fc53	[BE][OpInfo] Introduce generic `dtypesIf` (#146905 ) Use `__setattr__` and `__getattribute__` to wrap existing `dtypesIfXYZ` into it, which will allow for subsequent incremental elimination of those Also, type annotation for OpInfo is a sham: it claims that `dtypes` and `dtypesIfXYZ` must be of type `_dispatch_dtypes`, but in reality it's converted to set in post init. Test Plan: - Check that `op_db[0].dtypesIfCUDA` and others shows the same values as before, by running the following script ```python from torch.testing._internal.common_methods_invocations import op_db print({name: getattr(op_db[0], f'dtypesIf{name}') for name in ['CUDA', 'ROCM', 'XPU', 'Hpu']}) ``` - CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/146905 Approved by: https://github.com/janeyx99	2025-02-13 05:33:17 +00:00
PyTorch UpdateBot	43eb39d7c8	[executorch hash update] update the pinned executorch hash (#145128 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145128 Approved by: https://github.com/pytorchbot	2025-02-13 05:06:44 +00:00
Rachel Guo	88d0bb0fee	[aoti_debug_printer][BE] explicitly dumping float32, bfloat16, float16 data type (#147020 ) Summary: per request, explicitly dumping the float dtypes for aten tensors in debug printing summary info. can be useful in identifying issues such as "wrong AOTI Lowering precisions" Test Plan: ``` AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2 TORCH_LOGS="+inductor, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm ``` Differential Revision: D69547344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147020 Approved by: https://github.com/jingsh, https://github.com/ColinPeppler	2025-02-13 04:41:00 +00:00
PyTorch UpdateBot	2ff3fdfdae	[audio hash update] update the pinned audio hash (#146738 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146738 Approved by: https://github.com/pytorchbot	2025-02-13 04:29:46 +00:00
amathewc	936df4571b	Update test_c10d_object_collectives.py with DistributedTestBase class (#145056 ) # MOTIVATION To generalize distributed test cases for non-CUDA devices, we are leveraging the DistributedTestBase class introduced in [PR #138216](https://github.com/pytorch/pytorch/pull/138216). This new class is derived from MultiProcessTestCase and abstracts the creation/deletion of process groups and other functionality for specific devices. In this PR, we extend the scope of these tests to support HPUs. # CHANGES Replaced MultiProcessTestCase with the DistributedTestBase class. Extended test functionality to include support for HPUs. Utilized instantiate_device_type_tests with targeted attributes to generate device-specific test instances. Applied the skipIfHPU decorator to skip tests that are not yet compatible with HPU devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145056 Approved by: https://github.com/kwen2501, https://github.com/guangyey	2025-02-13 03:57:59 +00:00
Menglu Yu	a9598337b7	[Optimus] Include more corner cases in the select cat aten pass (#146662 ) Summary: Thanks to Shuai for reporting the bug in the pattern. We found there's a typo in the pass, where we should make sure all the selects will go to the cat node. Test Plan: buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_select_cat_post_grad Buck UI: https://www.internalfb.com/buck2/2cd0888e-d803-43a8-8530-d97e6bc281b3 Test UI: https://www.internalfb.com/intern/testinfra/testrun/6192449699305108 Network: Up: 110KiB Down: 35KiB (reSessionID-687be0fa-031a-47a0-8780-5ab4cf4bbd94) Executing actions. Remaining 0/4 6.6s exec time total Command: test. Finished 2 local Time elapsed: 2:12.0s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 Differential Revision: D69278487 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146662 Approved by: https://github.com/Microve	2025-02-13 03:40:26 +00:00
zeshengzong	6ca497a8e5	Replace is_same with is_same_v for concise syntax (#145450 ) Replace `std::is_same<T, U>::value` with `std::is_same_v` for concise and consistent syntax with other code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145450 Approved by: https://github.com/huydhn	2025-02-13 03:29:39 +00:00
Tugsbayasgalan Manlaibaatar	c159723c39	Fix meta impl for topk (#147017 ) Topk in this context is always size-like so we should use torch._check_is_size. Fixes some issue in https://github.com/pytorch/pytorch/issues/146990 Differential Revision: [D69545983](https://our.internmc.facebook.com/intern/diff/D69545983) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147017 Approved by: https://github.com/ydwu4	2025-02-13 03:18:47 +00:00
drisspg	821422018a	[FlexAttention] Make zero_length sequence handiling better (#147010 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147010 Approved by: https://github.com/Chillee	2025-02-13 03:18:24 +00:00
Nikita Shulga	54e28b2a71	[BE] Turn nextafter into functor (#147018 ) This functor is a bit more involved as nextafter is missing for MacOS13 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147018 Approved by: https://github.com/dcci ghstack dependencies: #146965, #146993, #147023	2025-02-13 02:10:29 +00:00
Joona Havukainen	aaa46c0625	Add missing autoreleasepool around runUniqueGraph to prevent leaks (#145512 ) References were held onto longer than needed. Added autoreleasepool around the runUniqueGraph to allow the memory to be freed. Fixes #145151 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145512 Approved by: https://github.com/malfet Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-02-13 01:58:18 +00:00
Nikita Shulga	e0ca041ae3	[BE] Toward Metal Iterator (step 2) (#147023 ) Add dense flavor of the binary ops, i.e. if iterator is contiguous, do not build indices but rather run different flavor, using the same functor, which results in almost 100% perf gain for binary operation with 1mln elements of `torch.fmax` as one can see from the table below collected on M4Pro Mini using following benchmarking script ```python import torch from timeit import default_timer from itertools import product from torch.utils.benchmark import Measurement, Timer def bench_binary( n, binary_func, dtype=torch.float32, ) -> Measurement: t = Timer( stmt=f"f(x, y);f(x, y); f(x, y); torch.mps.synchronize()", setup=f"x, y=torch.rand((2, {n}), dtype={dtype}, device='mps').unbind(0)", globals = {'f': binary_func}, language="python", timer=default_timer ) return t.blocked_autorange() if __name__ == "__main__": n = 1024*2 for dtype in [torch.float32, torch.float16, torch.bfloat16]: eager_t = bench_binary(n, torch.fmax, dtype) use_msec = eager_t.mean > 1e-4 multiplier = 1e3 if use_msec else 1e6 uname = "msec" if use_msec else "usec" print(f"torch.fmax()x3 {str(dtype):>14} {eager_t.meanmultiplier:>7.2f} {uname}") ``` Dtype \| Time before \| Time After \| \| ------\|------------ \| ---------- \| \| float32 \| 0.84 msec \| 0.66 msec \| \| float16 \| 0.49 msec \| 0.23 msec \| \| bfloat16 \| 0.48 msec \| 0.22 msec \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/147023 Approved by: https://github.com/dcci ghstack dependencies: #146965, #146993	2025-02-13 01:50:43 +00:00
zeshengzong	80f146dedf	Update addbmm, addmm, addmv and baddbmm description (#146689 ) Fixes #146611, following #146482 ## Test Result ![image](https://github.com/user-attachments/assets/5c1749be-1f10-4e80-a284-b1929ca340eb) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146689 Approved by: https://github.com/mikaylagawarecki	2025-02-13 01:30:50 +00:00
rzou	5dab0aeef0	[SkipFiles] Some more cleanup (#147013 ) This isn't a no-op but I think it's fine. It changes the case where a function f1 in a module in MOD_SKIPFILES calls a function f2 in one of the deleted modules. Previously f2 would have been skipped, now f2 gets inlined. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147013 Approved by: https://github.com/yanboliang ghstack dependencies: #147016, #147012	2025-02-13 01:18:47 +00:00
rzou	fddaa2958b	[SkipFiles] Some more cleanup (#147012 ) I think these are all no-ops. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/147012 Approved by: https://github.com/yanboliang ghstack dependencies: #147016	2025-02-13 01:18:47 +00:00
rzou	87ebd77b34	Add some more docs to trace_rules.py (#147016 ) After discussing with Yanbo we wanted to record the behavior down so we don't need to rederive them in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147016 Approved by: https://github.com/yanboliang	2025-02-13 01:18:39 +00:00
Animesh Jain	b77a6eb184	[dynamo] Fix tensordict regression (#146995 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146995 Approved by: https://github.com/StrongerXi ghstack dependencies: #146819	2025-02-13 00:59:59 +00:00
Yidi Wu	2ce6de2415	[cond] make cond re-dispatch in proxy mode (#146954 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146954 Approved by: https://github.com/zou3519	2025-02-13 00:50:33 +00:00
angelayi	67cbbb29e0	[export] Dedup expression_created logs (#146859 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146859 Approved by: https://github.com/pianpwk ghstack dependencies: #146532, #146533, #146534, #146858	2025-02-13 00:21:34 +00:00
angelayi	59bc5d0d71	[tlparse] Add stacktrace filter utility (#146858 ) Added a utility function for capturing the user stack and framework stacktrace. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146858 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #146532, #146533, #146534	2025-02-13 00:21:34 +00:00
angelayi	43f5566c92	[export] Add additional tlparse logging (#146534 ) Added some additional logging so we can also run tlparse on generic export errors Pull Request resolved: https://github.com/pytorch/pytorch/pull/146534 Approved by: https://github.com/pianpwk ghstack dependencies: #146532, #146533	2025-02-13 00:21:34 +00:00
angelayi	b4bdbce1ac	[export] Use custom stream logger in draft-export (#146533 ) Using a custom logger so that we can store our own buffer to dedup logs that look the same. The schema for deduping is as follows: ```python if key == "missing_fake_kernel": return hash((key, data["op"])) # Same ops get deduped elif key == "mismatched_fake_kernel": return hash((key, data["op"], data["reason"])) # Same op and reason for errors get deduped elif key == "propagate_real_tensors": return hash((key, json.dumps(data["stack"]))) # Guards appearing on the same stacktrace get deduped elif key == "create_unbacked_symbol": return hash((key, json.dumps(data["stack"]))) # Unbacked symbols appearing on the same stacktrace get deduped ``` Notably, guards appearing on the same stacktrace get deduped. This is because there are some cases in PT2I models where a piece of code which creates a new unbacked symint + runs into a DDE gets called 800 times, causing 800 new symints to be created, and 800 propagate_real_tensor errors that are all the same expression. This is hard to look at, so we should just deduplicate this. The con of this is that if there exists multiple DDE on the same stacktrace, we will only show the first issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146533 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #146532	2025-02-13 00:21:34 +00:00
angelayi	be387f57b1	[symbolic shapes] Log SymNode id for provenance (#146532 ) We can use the SymNode id to point us back to how previous expressions were created, and construct this nice tree in tlparse: <img width="761" alt="image" src="https://github.com/user-attachments/assets/531b03e8-4398-4d0a-bd11-16078256041c" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/146532 Approved by: https://github.com/bobrenjc93	2025-02-13 00:21:34 +00:00
Raymond Li	21c2565f35	Document dynamo (#146736 ) Many files in dynamo are currently lacking file/module-level documentation, which makes it hard to know what they do at a glance and without digging into the code. This fixes that. Note: documentation was AI-generated and could be incorrect, please review carefully. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146736 Approved by: https://github.com/jansel, https://github.com/StrongerXi, https://github.com/anijain2305, https://github.com/zou3519	2025-02-13 00:02:21 +00:00
Ting Lu	0344bf8a5a	[cuDNN] cuDNN to 9.7.1.26 for CUDA 12.8 (#146957 ) rebasing for https://github.com/pytorch/pytorch/pull/146717 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146957 Approved by: https://github.com/Skylion007, https://github.com/eqy, https://github.com/nWEIdia, https://github.com/atalman Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-02-12 23:43:34 +00:00
Jing Shan	d5a2e4c754	[oncall] Change error message to be more readable (#146934 ) Summary: During oncall, got a debug, where the error message is a bit ambiguous, due to multiple colons, and full line cutoff ``` AssertionError: Expected order: 1 for the component: remote_request_only to be >= 2, the max order for all its ``` Update the error message to something like ``` AssertionError: Component remote_request_only order must be >= max order of its upstream components, got component order=1 and max=2 ``` Test Plan: CI Differential Revision: D69482789 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146934 Approved by: https://github.com/ColinPeppler	2025-02-12 23:33:09 +00:00
Benjamin Glass	ad4e5bf705	cpp_wrapper: handle mixed-device C-shim fallbacks (#146449 ) Fixes an error from test_torch, where a CUDA cpp_wrapper run called a CUDA native C-shim kernel with two CPU tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146449 Approved by: https://github.com/desertfire	2025-02-12 23:21:04 +00:00
Oguz Ulgen	076215944a	Turn on autograd local caches in fbcode (#146996 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146996 Approved by: https://github.com/jamesjwu	2025-02-12 23:04:39 +00:00
Howard Huang	c60f587c04	Fix shape_inference for V-schedules (#147000 ) I was hitting a hang in shape_inference when testing v-shaped schedules with >2 ranks in titan. `self.next_rank` and `self.prev_rank` are used in shape inference but are not accurate for v-shaped schedules: `bfcce6984b/torch/distributed/pipelining/stage.py (L1325-L1326)` Will clean up / delete the use of next_rank / prev rank in follow up PRs Pull Request resolved: https://github.com/pytorch/pytorch/pull/147000 Approved by: https://github.com/wconstab	2025-02-12 22:56:46 +00:00
Guilherme Leobas	f954aac6be	Add `make_dynamo_test` (#146491 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146491 Approved by: https://github.com/zou3519, https://github.com/anijain2305, https://github.com/malfet	2025-02-12 22:54:29 +00:00
Justin Chu	fd21126007	[ONNX] Deprecation message follow up (#147005 ) Follow up on https://github.com/pytorch/pytorch/pull/146923 to address comments. This pull request includes updates to the `torch/onnx` module, focusing on deprecations and documentation improvements. The most important changes involve moving version change notes within the `export` function, updating deprecation messages, and removing example code in the `dynamo_export` function. Documentation and Deprecation Updates: * [`torch/onnx/__init__.py`](diffhunk://#diff-c3c8c09b65c1235ca4494633c6a0aab2761a11a7653ddaf9f874bbcd91e15553L172-L184): Moved version change notes to the correct location within the `export` function's docstring. Updated the deprecation note for the `dynamo_export` function to version 2.7 and removed example code from its docstring. [[1]](diffhunk://#diff-c3c8c09b65c1235ca4494633c6a0aab2761a11a7653ddaf9f874bbcd91e15553L172-L184) [[2]](diffhunk://#diff-c3c8c09b65c1235ca4494633c6a0aab2761a11a7653ddaf9f874bbcd91e15553R349-R357) [[3]](diffhunk://#diff-c3c8c09b65c1235ca4494633c6a0aab2761a11a7653ddaf9f874bbcd91e15553L434-R430) [[4]](diffhunk://#diff-c3c8c09b65c1235ca4494633c6a0aab2761a11a7653ddaf9f874bbcd91e15553L445-L475) * [`torch/onnx/utils.py`](diffhunk://#diff-849a5778e2dcf7f36587967273cee0bf20642e35bf4c79405111ea3417c3fb3cL111-R114): Enhanced deprecation messages for several functions (`select_model_mode_for_export`, `disable_apex_o2_state_dict_hook`, `setup_onnx_logging`, `unconvertible_ops`) to provide clearer guidance on their removal and suggest copying logic if needed. [[1]](diffhunk://#diff-849a5778e2dcf7f36587967273cee0bf20642e35bf4c79405111ea3417c3fb3cL111-R114) [[2]](diffhunk://#diff-849a5778e2dcf7f36587967273cee0bf20642e35bf4c79405111ea3417c3fb3cL148-R151) [[3]](diffhunk://#diff-849a5778e2dcf7f36587967273cee0bf20642e35bf4c79405111ea3417c3fb3cL166-R173) [[4]](diffhunk://#diff-849a5778e2dcf7f36587967273cee0bf20642e35bf4c79405111ea3417c3fb3cL1180-R1189) [[5]](diffhunk://#diff-849a5778e2dcf7f36587967273cee0bf20642e35bf4c79405111ea3417c3fb3cL1190-R1199) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147005 Approved by: https://github.com/titaiwangms	2025-02-12 22:48:56 +00:00
Justin Chu	f655f840b8	[ONNX][dort] Remove reference to onnxscript rewriter (#147003 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147003 Approved by: https://github.com/titaiwangms, https://github.com/gramalingam, https://github.com/shubhambhokare1	2025-02-12 22:02:07 +00:00
Link Li	995f607c74	fix doc string (#146968 ) Fixes a wrong function name in doc string Pull Request resolved: https://github.com/pytorch/pytorch/pull/146968 Approved by: https://github.com/zackycao, https://github.com/H-Huang	2025-02-12 21:43:16 +00:00
Nikita Shulga	06a07f6018	[BE] Towards MetalTensorIterator (#146993 ) Further refactor binary kernels to replace individual implementation with a binary_indexing_kernel template that takes functors that implement the logic. According to godbolt such refactoring should have no impact on the performance as compiler thru dead code elimination should just replaces the functor with direct underlying function call as one can see for clang CPU compiler here: https://godbolt.org/z/8dxv5jvz7 but to be on the safe side, run following benchmark ```python import torch from timeit import default_timer from itertools import product from torch.utils.benchmark import Measurement, Timer def bench_binary( n, binary_func, dtype=torch.float32, ) -> Measurement: t = Timer( stmt=f"f(x, y);f(x, y); f(x, y); torch.mps.synchronize()", setup=f"x, y=torch.rand((2, {n}), dtype={dtype}, device='mps').unbind(0)", globals = {'f': binary_func}, language="python", timer=default_timer ) return t.blocked_autorange() if __name__ == "__main__": n = 1024*2 for dtype in [torch.float32, torch.float16, torch.bfloat16]: eager_t = bench_binary(n, torch.fmax, dtype) use_msec = eager_t.mean > 1e-4 multiplier = 1e3 if use_msec else 1e6 uname = "msec" if use_msec else "usec" print(f"torch.fmax()x3 {str(dtype):>14} {eager_t.meanmultiplier:>7.2f} {uname}") ``` That reports roughly identical before and after times (1 msec for float32 and .5 msec for float16) Another interesting quirk, that functors can not be in anonymous namespace, otherwise they'll not be visible from the library, as one can see by running following swift sample (filed FB16490467 to clarify if this is supported) ```swift let shader_source = """ struct add_functor { template <typename T> inline T operator()(const T a, const T b) { return static_cast<T>(a + b); } }; namespace { struct sub_functor { template <typename T> inline T operator()(const T a, const T b) { return static_cast<T>(a - b); } }; } // anonymous namespace template <typename T, typename F> kernel void binary_executor( constant T* input [[buffer(0)]], constant T* other [[buffer(1)]], device T* out [[buffer(2)]], uint tid [[thread_position_in_grid]]) { F f; out[tid] = f(input[tid], other[tid]); } template [[host_name("add_float")]] kernel void binary_executor<float, add_functor>(constant float, constant float , device float, uint); template [[host_name("sub_float")]] kernel void binary_executor<float, sub_functor>(constant float, constant float , device float, uint); """ import Metal guard let device = MTLCopyAllDevices().first else { fatalError("Not Metal device found") } let library = try! device.makeLibrary(source:shader_source, options:MTLCompileOptions()) // Expect two kernels to be printed, but see only one, with functor in global namespace for kernel_name in library.functionNames { print(kernel_name) } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146993 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #146965	2025-02-12 21:40:40 +00:00
Brian Hirsh	de964b9f8b	dont specialize symints when testing truthiness (#146731 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146731 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #146642, #146729	2025-02-12 20:57:10 +00:00
Brian Hirsh	5cda021cac	support meta_tensor.to(device='cpu') under fake_mode (#146729 ) Fixing this is actually a bit annoying: (1) FakeTensorMode sees a function where all of its inputs are real tensors, so it tries to run the real compute before converting the output to a FakeTensor (2) we don't actually want this, because the "real compute" is support to error normally, when you do `meta_tensor.to(device='cpu')`. Instead, we want FakeTensor to actually skip constant prop and run the normal FakeTensor implementation, which will not error Pull Request resolved: https://github.com/pytorch/pytorch/pull/146729 Approved by: https://github.com/zou3519, https://github.com/SherlockNoMad, https://github.com/albanD ghstack dependencies: #146642	2025-02-12 20:57:10 +00:00
Brian Hirsh	ec0b318ddb	[poc] force UntypedStorage.from_buffer(buf) to return meta storage under FakeTensorMode (#146642 ) context here: https://fb.workplace.com/groups/326136610199609/permalink/495389539940981/ This PR is an attempt to make it such that if you create a tensor from an external buffer (using `UntypedStorage.from_buffer(buf)`, we can generate a proper fake tensor for you out of the box. The annoying bit is that there are not any dispatcher ops to interpose on and change behavior. So instead, I took the manual C binding and tweaked the storage device to be "meta' if we see an active fake mode. Put "poc" in the title since I... think this is hopefully reasonable, but I can be convinced that it's not :) ``` from torch._subclasses.fake_tensor import FakeTensorMode import pickle import io import torch from contextlib import nullcontext use_fake_tensor = True with FakeTensorMode() if use_fake_tensor else nullcontext(): obj = [1, 2] f = io.BytesIO() pickle.Pickler(f).dump(obj) byte_storage = torch.ByteStorage._from_buffer(f.getvalue()) # type: ignore[attr-defined] t = torch.ByteTensor(byte_storage) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146642 Approved by: https://github.com/zou3519	2025-02-12 20:57:10 +00:00
PyTorch MergeBot	8a975cb247	Revert "[cutlass backend] Do not change dtype of GEMM template (#146877 )" This reverts commit 5f2714d5e7cded0eb553d5915002e03c22e01e34. Reverted https://github.com/pytorch/pytorch/pull/146877 on behalf of https://github.com/henrylhtsang due to mistake on logging ([comment](https://github.com/pytorch/pytorch/pull/146877#issuecomment-2654648949))	2025-02-12 19:26:45 +00:00
Chien-Chin Huang	0de27ee7e0	Let _create_cpu_state_dict and _copy_state_dict support DTensor (#146852 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146852 Approved by: https://github.com/d4l3k	2025-02-12 18:43:52 +00:00
Nikita Shulga	352484cc83	[BE] Unify kernel templates instantiation (#146965 ) By defining `REGISTER_BINARY_OP` template that could be used to register fmix, fmax, etc Pull Request resolved: https://github.com/pytorch/pytorch/pull/146965 Approved by: https://github.com/Skylion007, https://github.com/dcci	2025-02-12 18:40:45 +00:00
Justin Chu	7f62616a58	[ONNX][reland2] Create deprecation warning on dynamo_export (#146923 ) Reland two PRs - https://github.com/pytorch/pytorch/pull/146425 - https://github.com/pytorch/pytorch/pull/146639 Fixed by removing the deprecation warning on a base class `ExportOptions`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146923 Approved by: https://github.com/titaiwangms	2025-02-12 18:28:37 +00:00
Henry Tsang	5f2714d5e7	[cutlass backend] Do not change dtype of GEMM template (#146877 ) I think this is a change in the right direction. Right now, when we try to find a cutlass gemm, we generate bunch of gemm templates, and filter out those that don't fix. For example, if we are doing bf16 x bf16 matmul, the gemm template for fp32 x fp32 is generated and filtered out. However, for the dtype of bias, we would attempt to modify the dtype of the gemm template. I think this is a bad idea, since (1) the usable template is also being generated, and (2) this messes with the configuration name of the template. I tested this offline. There isn't much difference in performance. However, with instantiation level 2222, I noticed way less "C++ compile error". This is probably due to using the right template? Follow-ups are needed: 1. benchmark and dashboard 2. check our logic for setting alignment with my change https://www.internalfb.com/intern/paste/P1729604119/ without my change https://www.internalfb.com/intern/paste/P1729624806/ Differential Revision: D69085556 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146877 Approved by: https://github.com/ColinPeppler	2025-02-12 18:16:49 +00:00
Nichols A. Romero	bfcce6984b	[ROCm][TunableOp] Close offline tuning results file when offline tuning is disabled. (#146574 ) This PR is to fix UT breakage that has been reported internally and is considered high priority. When `tunable.record_untuned_enable(False)` is invoked, we flush the results of the untuned gemm file. Offline tuning I/O currently doesn't have a set untuned results filename member function or untuned results write to file member function. When performing back-to-back unit tests, the same ofstream ends up getting reused between UTs. Due to the way the UT are executed, this can lead to unexpected failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146574 Approved by: https://github.com/jeffdaily	2025-02-12 18:03:06 +00:00
Huy Do	04011304e5	Update dynamo expected 20250210 (#146856 ) Update all the ci accuracy expect values to make trunk green. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146856 Approved by: https://github.com/yanboliang	2025-02-12 18:01:20 +00:00
Animesh Jain	d6513f3246	[dynamo] Support list subclasses and fix dict subclasses mutation bugs (#146819 ) This PR adds support for list subclasses. Among other things are 1) Tracking the mutations on internal vts like `_dict_vt` and `_list_vt` using sources. This helps identify if there was a mutation in the underlying data structures, and we need to reconstruct it. 2) `UserDefinedObjectVariable` now has a new method - `is_modified` which `side_effect` infra relies upon to check mutations in the underlying vts (like `_dict_vt`). 3) `reconstruction` logic ensures that we use `dict.__getitem__` and `list.__getitem__` methods. This is super important because we don't want to call the overridden `__getitem__` methods. If this PR is hard to review, please let me know. I can break it into several small PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146819 Approved by: https://github.com/StrongerXi, https://github.com/jansel	2025-02-12 17:46:02 +00:00
titaiwangms	6c81435f16	[ONNX] Update CI transformers cache (#146926 ) The cached models are outdated because the related tests are all deleted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146926 Approved by: https://github.com/justinchuby	2025-02-12 17:02:43 +00:00
titaiwangms	b894c2824b	[ONNX] Support custom axis name through dynamic_shapes (#146321 ) Fixes #143443 This PR aims to support custom dynamic axis naming through dynamic_shapes. Currently, _Dim and _DimHint do not support dynamic axis naming (#144273). 1. the original dynamic shapes guarantee The axis renaming is only applied when dynamic shapes include string instead of all _Dim and _DimHint. Thus, there will not be any inconsistent behavior to dynamic_shapes with torch.export.export if the given dynamic shapes follow torch.export.export format. 2. _DimHint.AUTO is applied to the axes that are specified with custom names to avoid exporter crash. (_DimHint.DYNAMIC crashes when the export fails.) 3. There's no need to handle cases where kwargs are out of order with the model signature, as torch.export.export supports dynamism only when kwargs and dynamic_shapes are provided in order. `49082f9dba/torch/export/_trace.py (L2034)` 4. If `torch.onnx.ExportedProgram` finds the axes share the same constraints, they will have the same name (e.g. s0, s1, ...). Therefore, even if the ONNX users specify them with different custom names, they won't be respected. Example model: ```python class NestedModel(torch.nn.Module): def forward( self, x: torch.Tensor, ys: list[torch.Tensor], zs: dict[str, torch.Tensor], c: torch.Tensor, ): y = ys[0] + ys[1] + zs["a"] + zs["b"] w = 5 if x.shape[0] < 3 and c.shape[0] != 4: return x + w, x + y, c else: return x - w, x - y, c input = ( torch.ones(5), [torch.zeros(5), torch.ones(5)], {"a": torch.zeros(5), "b": torch.ones(5)}, torch.ones(6), ) dynamic_shapes = ( {0: torch.export.Dim("dim_x", min=3)}, # _Dim [("custom_name_axis_ys_0",), (torch.export.Dim.AUTO,)], # custom name { "a": {0: torch.export.Dim.AUTO}, "b": ("custom_name_axis_zs_b_0",), }, # _DimHint {0: "custom_name_axis_c_0"}, # custom name ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146321 Approved by: https://github.com/justinchuby	2025-02-12 17:00:03 +00:00
Xuehai Pan	9abaaad6a8	[pytree][Easy] preserve `dict` keys in insertion order in CXX pytree (#130140 ) `optree` and JAX pytree traversal the `dict` in sorted key ordering (see [Key Ordering for Dictionaries](https://github.com/metaopt/optree#key-ordering-for-dictionaries)). While in PyTorch Python pytree, we traversal the `dict` in insertion order. See also: - #114392 This aligns the behavior of CXX pytree with Python pytree. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130140 Approved by: https://github.com/zou3519	2025-02-12 16:41:49 +00:00
Aaron Orenstein	1f8ff94d4f	PEP585: Add noqa to necessary tests (#146391 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146391 Approved by: https://github.com/justinchuby, https://github.com/Skylion007	2025-02-12 15:29:50 +00:00
Aaron Gokaslan	b61032fcf7	[BE][Ez]: Remove unnecessary type ignores from orderedset (#146902 ) After #145783, we can remove some type ignores from the ordered set class Pull Request resolved: https://github.com/pytorch/pytorch/pull/146902 Approved by: https://github.com/eellison	2025-02-12 15:00:13 +00:00
PyTorch MergeBot	ce80865f13	Revert "Replace is_same with is_same_v for concise syntax (#145450 )" This reverts commit 5205158c1b0bc5c390b2a9d83fe3b2ec5edbe3f2. Reverted https://github.com/pytorch/pytorch/pull/145450 on behalf of https://github.com/jeanschmidt due to testing to see if reverting would fix timeout in inductor jobs ([comment](https://github.com/pytorch/pytorch/pull/145450#issuecomment-2653645466))	2025-02-12 13:01:32 +00:00
Yuanhao Ji	b0042286d4	[Dynamo] Allow dynamo to handle `str.xxx()` (#146587 ) Fixes #146350 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146587 Approved by: https://github.com/zou3519	2025-02-12 08:54:10 +00:00
Xia, Weiwen	98e16012ec	[Quant][CPU] add a wrapper op for _weight_int4pack_mm_for_cpu with tensor args (#145245 ) Summary It's part of the task to enable max-autotune with GEMM template for WoQ INT4 GEMM on CPU. This PR adds a wrapper op in `quantized` namespace for `torch.ops.aten_weight_int4pack_mm_for_cpu`, whose arguments are all tensors. It will be used in Inductor lowering with max-autotune where scalar arguments are difficult to handle. The new op is not registered to - `aten` because it will require changing `native_functions.yaml`, which is not recommended. - `quantized_decomposed` because it will only have a Python implementation, which cannot be used for cpp wrapper in Inductor. Test plan ``` python test/test_linalg.py -k test__int4_mm ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145245 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168	2025-02-12 08:46:38 +00:00
Tianyu Liu	ac0f206f3c	[dtensor] fix side-effect on dtype for _like ops (#146869 ) fixes https://github.com/pytorch/pytorch/issues/146749 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146869 Approved by: https://github.com/yifuwang, https://github.com/janeyx99, https://github.com/ngimel	2025-02-12 08:42:14 +00:00
Zhou Fang	d774a6333d	[StaticRuntime] Support a new pattern for ClipRangesToGatherToOffsets (#146931 ) Summary: Support the following new pattern for ClipRangesToGatherToOffsets: Before optimization: ``` %18267 : Tensor, %18268 : Tensor = fb::clip_ranges_gather(%int_77.1, %getitem_2484.1, %493) %getattr_368.1 : int = prim::dtype(%18267) %to_443.1 : Tensor = aten::to(%18268, %getattr_368.1, %self._maybe_compute_kjt_to_jt_dict.is_weighted, %self._maybe_compute_kjt_to_jt_dict.is_weighted) %lengths_to_offsets_490.1 : Tensor = fb::lengths_to_offsets(%to_443.1, %8) ``` After optimization: ``` %18297 : int = prim::dtype(%int_77.1) %18298 : Tensor, %18299 : Tensor = fb::clip_ranges_gather_to_offsets(%int_77.1, %getitem_2484.1, %493, %8, %18297) ``` Reviewed By: garroud Differential Revision: D69373835 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146931 Approved by: https://github.com/hanyilou123	2025-02-12 08:19:41 +00:00
Ahmad Sharif	ae5cc19ba7	[pytorch][cuda] Improve softmax backward pass native CUDA implementation (#145866 ) This PR is similar to https://github.com/pytorch/pytorch/pull/122970, but works on the softmax backward pass. Specifically, it uses shared memory to cache the gradOutput when it can fit in shared memory. Before this PR we were reading gradOutput twice. On my H100 this seems to improve the softmax backward pass performance by about 5% for problem sizes that fit within shared memory. (Note that this is not the only kernel that runs when you call softmax backward pass -- there is an elementwise kernel that runs before this; optimizing that can be a separate PR). Important Note: Currently the softmax backward pass consists of an [element-wise multiply operator](`7f65a20884/aten/src/ATen/native/cuda/SoftMax.cu (L1216)`), followed by [this function](`7f65a20884/aten/src/ATen/native/cuda/SoftMax.cu (L1062)`) which calls the `cunn_SoftMaxBackward` kernel. With my change the kernel time reduces by about 12% (see screenshot below), while the total time (including the elementwise) reduces by about 5%. ``` Baseline This PR N size FP32 bandwidth FP16 bandwidth N size FP32 bandwidth FP16 bandwidth fp32 diff fp16 diff 0 256 134.340966 70.042039 0 256 133.70146 70.342753 -0.48% 0.43% 1 512 233.501185 129.945803 1 512 234.057145 132.933066 0.24% 2.30% 2 1024 340.667966 229.280464 2 1024 338.833265 226.441699 -0.54% -1.24% 3 2048 379.643726 337.452058 3 2048 399.559017 338.432284 5.25% 0.29% 4 4096 416.597537 383.625364 4 4096 428.252403 396.137506 2.80% 3.26% 5 6000 431.198241 384.384384 5 6000 457.744577 406.06275 6.16% 5.64% 6 8192 462.811252 427.292573 6 8192 474.791032 428.281563 2.59% 0.23% 7 10000 464.258731 429.050294 7 10000 483.7643 446.849381 4.20% 4.15% 8 10013 465.199701 429.824179 8 10013 464.904407 428.72184 -0.06% -0.26% 9 10240 477.07359 428.853737 9 10240 485.317024 444.902586 1.73% 3.74% 10 11000 473.038785 430.778663 10 11000 488.161438 453.462162 3.20% 5.27% 11 12000 474.342475 432.594814 11 12000 490.532418 458.427653 3.41% 5.97% 12 16384 487.468854 473.611576 12 16384 488.154406 476.264631 0.14% 0.56% 13 20000 482.029793 465.666186 13 20000 482.147092 483.886193 0.02% 3.91% 14 24000 478.368093 474.159464 14 24000 478.364948 491.447921 0.00% 3.65% 15 32000 476.523796 473.18868 15 32000 476.523796 474.398962 0.00% 0.26% 16 32768 476.104723 477.493634 16 32768 476.704463 477.330606 0.13% -0.03% 17 36864 477.900663 475.472787 17 36864 477.973279 475.728454 0.02% 0.05% 18 40960 477.707561 475.559064 18 40960 478.445017 476.088067 0.15% 0.11% 19 45056 479.169812 475.865134 19 45056 479.143266 475.878202 -0.01% 0.00% 20 49152 477.804907 475.382982 20 49152 477.868404 475.976377 0.01% 0.12% 21 65536 481.274125 478.171806 21 65536 481.537733 478.703926 0.05% 0.11% 22 66000 481.64652 480.095457 22 66000 481.856013 480.466388 0.04% 0.08% 23 68608 481.745774 479.034704 23 68608 481.917596 478.856209 0.04% -0.04% 24 80000 483.409361 480.356529 24 80000 483.330481 480.375277 -0.02% 0.00% 25 98304 480.736301 481.396882 25 98304 480.789858 481.320143 0.01% -0.02% ``` NCU profiler shows lower DRAM fetches with the new kernel: ![image](https://github.com/user-attachments/assets/f3606725-d8fc-4ea5-ae6d-9c188bf32d72) NCU reports about 12% elapsed time reduction in this kernel alone compared to baseline (and because of other kernels that are run, the overall backward pass time as seen by the user gets reduced by 5%). I compared the binary size increase by running `python setup.py develop` before and after and diffing the .so files: ![image](https://github.com/user-attachments/assets/8e6cee2e-3c7a-4fa4-8836-954047ce8ffc) libtorch_cuda.so goes from 274,752,224 bytes to 274,787,072 bytes. The increase in size is 34kB which is about 0.01%. I measured the compilation time for incremental development: ``` touch ./aten/src/ATen/native/cuda/SoftMax.cu time python setup.py develop real 0m10.083s user 0m8.197s sys 0m3.149s ``` Note that this uses `ccache` and does a bunch of copies and is not just measuring the `nvcc` time. I measured the `nvcc` time separately by capturing the `nvcc` command shown in [1] below and running it on the baseline and modified kernels: ``` # baseline nvcc time for SoftMax.cu real 0m35.341s user 0m33.801s sys 0m1.289s # this PR's nvcc time for SoftMax.cu real 0m36.513s user 0m34.722s sys 0m1.408s ``` So the `nvcc` time increases by about 1 second, or ~3% of the baseline. [1] `nvcc` command is here: ``` # This is the nvcc command /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DTORCH_CUDA_USE_NVTX3 -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -I/home/ahmads/personal/pytorch/build/aten/src -I/home/ahmads/personal/pytorch/aten/src -I/home/ahmads/personal/pytorch/build -I/home/ahmads/personal/pytorch -I/home/ahmads/personal/pytorch/cmake/../third_party/benchmark/include -I/home/ahmads/personal/pytorch/third_party/onnx -I/home/ahmads/personal/pytorch/build/third_party/onnx -I/home/ahmads/personal/pytorch/nlohmann -I/home/ahmads/personal/pytorch/aten/src/THC -I/home/ahmads/personal/pytorch/aten/src/ATen/cuda -I/home/ahmads/personal/pytorch/third_party/fmt/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/ahmads/personal/pytorch/build/caffe2/aten/src -I/home/ahmads/personal/pytorch/aten/src/ATen/.. -I/home/ahmads/personal/pytorch/build/nccl/include -I/home/ahmads/personal/pytorch/c10/cuda/../.. -I/home/ahmads/personal/pytorch/c10/.. -I/home/ahmads/personal/pytorch/third_party/tensorpipe -I/home/ahmads/personal/pytorch/build/third_party/tensorpipe -I/home/ahmads/personal/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/ahmads/personal/pytorch/torch/csrc/api -I/home/ahmads/personal/pytorch/torch/csrc/api/include -isystem /home/ahmads/personal/pytorch/build/third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/ahmads/personal/pytorch/third_party/protobuf/src -isystem /home/ahmads/personal/pytorch/third_party/XNNPACK/include -isystem /home/ahmads/personal/pytorch/third_party/ittapi/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda/include -isystem /home/ahmads/personal/pytorch/torch/include -isystem /home/ahmads/personal/pytorch/third_party/ideep/include -isystem /home/ahmads/personal/pytorch/torch/include/oneapi/dnnl -isystem /home/ahmads/personal/pytorch/INTERFACE -isystem /home/ahmads/personal/pytorch/third_party/nlohmann/include -isystem /home/ahmads/personal/pytorch/third_party/NVTX/c/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o.d -x cu -c /home/ahmads/personal/pytorch/aten/src/ATen/native/cuda/SoftMax.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145866 Approved by: https://github.com/ngimel	2025-02-12 07:54:41 +00:00
Wang, Chuanqi	8c80c13b34	[CD] Add python 3.13t build for xpu (#146614 ) Fixes #146451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146614 Approved by: https://github.com/atalman	2025-02-12 07:01:36 +00:00
Huy Do	b30bad710d	Update octokit/request-action to 2.4.0 (#146940 ) The current version 2.1.0 has disappeared since yesterday: * https://github.com/pytorch/pytorch/actions/workflows/upload-torch-dynamo-perf-stats.yml * https://github.com/pytorch/pytorch/actions/workflows/upload-test-stats.yml The latest version is 2.4.0 https://github.com/octokit/request-action Pull Request resolved: https://github.com/pytorch/pytorch/pull/146940 Approved by: https://github.com/izaitsevfb	2025-02-12 05:36:27 +00:00
PyTorch MergeBot	6105b6f15f	Revert "Update octokit/request-action to 2.4.0 (#146940 )" This reverts commit 7aa629f1268f6944eee6e49e43071b4342bf1669. Reverted https://github.com/pytorch/pytorch/pull/146940 on behalf of https://github.com/huydhn due to This does not work ([comment](https://github.com/pytorch/pytorch/pull/146940#issuecomment-2652691614))	2025-02-12 05:21:43 +00:00
Aleksandar Samardžić	5a1c7c424d	Fix standalone runner for CUTLASS auto-tuning backend (#146764 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146764 Approved by: https://github.com/henrylhtsang ghstack dependencies: #146755	2025-02-12 04:42:08 +00:00
Aleksandar Samardžić	eb655a2d5f	Fix CUTLASS 2.x kernels for auto-tuning (#146755 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146755 Approved by: https://github.com/henrylhtsang	2025-02-12 04:42:07 +00:00
Zhengxu Chen	683bb1242c	[export][ez] Update tag_ for union setters. (#146912 ) Summary: ez fix to set tag for union type fields. Test Plan: CI Differential Revision: D69467715 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146912 Approved by: https://github.com/yiming0416	2025-02-12 03:52:36 +00:00
Yichen Yan	06f8f9a017	Update instructions about faster linker (#146750 ) This PR adds instructions to specify linker via cmake env `CMAKE_LINKER_TYPE` and also adds `mold` as a linker alternative. Since 3.29, cmake introduced [`CMAKE_LINKER_TYPE`](https://cmake.org/cmake/help/latest/variable/CMAKE_LINKER_TYPE.html) that can specify linker without overwriting `ld` file or changing build script. `mold` is already stable and the fastest (afaict) linker out there, and also easier to install compared with `lld`. So I added it here. After switching to `mold`, the time of linking `libtorch_cuda.so` has been reduced from ~7s to ~0.6s locally. Also note `gold` has been marked deprecated recently[1]. [1] https://lwn.net/Articles/1007541/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/146750 Approved by: https://github.com/albanD	2025-02-12 03:14:08 +00:00
James Wu	28a2ab6b84	Clear CompiledTritonKernel cache after each inductor compile (#146925 ) Fix a bug introduced by D69123174: because triton kernels now are returned directly by the worker, each future created by the triton kernel should only be used once per compile. Otherwise, a long running process that does something like in : ``` compiled_1 = torch.compile("max-autotune", fullgraph=True)(fn) # run compiled_1 out_compiled = compiled_1 compiled_2 = torch.compile("max-autotune", fullgraph=True)(fn2) ``` Where fn1 and fn2 are very similar (i.e. would generate the same triton kernel source code) would result in us using the launcher for the first autotuning run, and setting the launcher to None after running, and then using the same future/kernel again without regenerating the launcher. Found this bug testing internal inference models. This does not remove the caching support for @eellison's caching for prologue benchmarking, because that happens under the same compile: https://github.com/pytorch/pytorch/pull/143408 Differential Revision: [D69476856](https://our.internmc.facebook.com/intern/diff/D69476856/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D69476856/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/146925 Approved by: https://github.com/laithsakka, https://github.com/jansel ghstack dependencies: #146417	2025-02-12 02:38:42 +00:00
Nikita Shulga	0acbf8039a	[BE] Unskip some tensor creation tests on Mac (#146952 ) Followup after https://github.com/pytorch/pytorch/pull/145367 One should never use skip, but rather xfail otherwise one never knows when test is finally fixed. `test_float_to_int_conversion_finite` were fixed on MacOS a while back (guess since the time Intel builds were disbaled), while `test_float_to_int_conversion_nonfinite` is fixed by https://github.com/pytorch/pytorch/pull/145367 that selects architecture-appropriate reference values for Arm ISA Note, that results of floating to integral types cast are undefined if floating point value is outside of integral dynamic range "Fixes" https://github.com/pytorch/pytorch/issues/38752 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146952 Approved by: https://github.com/atalman, https://github.com/seemethere	2025-02-12 01:59:15 +00:00
Camyll Harajli	78ebd3c502	Revert commit that removed windows testing in VS2019-> update (#146920 ) This reverts commit b57b38b52ede2af27d4eb1bf6ba63868a3ee7553. This commit removed windows testing for the VS build and needs to be added back in with the updated VS2022 build Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146920 Approved by: https://github.com/seemethere, https://github.com/huydhn, https://github.com/atalman, https://github.com/malfet	2025-02-12 01:12:05 +00:00
Nikita Shulga	df5e232563	[BE] Delete NCCL slimming (#146943 ) It was added by https://github.com/pytorch/pytorch/pull/35843 and served its purpose when everything was linked statically in libtorch_cuda.so, but for all our releases it's no longer relevant as nccl is now a dynamic dependency of libtorch_cuda.so Besides, It does not work with CXX11 ABI anyway, and creates problems with newer version of NCCL, when two `collectvies.o` are package into library archive. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146943 Approved by: https://github.com/Skylion007, https://github.com/atalman	2025-02-12 00:35:55 +00:00
Eddie Yan	a58f421f4b	[CUDA][CUDNN][SDPA] Pass dropout seed and offset to cuDNN in `int64` (#146734 ) Workaround for limitation in cuDNN that does not accept dropout seed/offset in `int32` for SM 10.0 kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146734 Approved by: https://github.com/Skylion007	2025-02-12 00:24:38 +00:00
Dan Zimmerman	281249ba54	[torch][amdsmi] Avoid ODR violation when loading amdsmi (#146324 ) Summary: amdsmi bundles its own copy of `libamd_smi.so`. When you're interacting with `amdsmi` from only python that's fine, but when you try to interact with `libamd_smi.so` from native code too this poses a problem, because from native code you'll be linking against the copy of `libamd_smi.so` from the SDK. This means you'll end up with 2 copies of `libamd_smi.so` in your process, and potentially (Murphey's law says you will, as does our CI) violate ODR. In order to avoid this issue from the PT side of the world we can hook the `dlopen("path/to/bundled/libamd_smi.so")` and try to use the already loaded/SDK version of `libamd_smi.so` first, before proceeding to use the `path/to/bundled/libamd_smi.so`. Test Plan: CI, inspect process using libamd_smi.so from native + python and observe only a single copy loaded Differential Revision: D69064038 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146324 Approved by: https://github.com/malfet	2025-02-12 00:01:02 +00:00
Huy Do	7aa629f126	Update octokit/request-action to 2.4.0 (#146940 ) The current version 2.1.0 has disappeared since yesterday: * https://github.com/pytorch/pytorch/actions/workflows/upload-torch-dynamo-perf-stats.yml * https://github.com/pytorch/pytorch/actions/workflows/upload-test-stats.yml The latest version is 2.4.0 https://github.com/octokit/request-action Pull Request resolved: https://github.com/pytorch/pytorch/pull/146940 Approved by: https://github.com/izaitsevfb	2025-02-11 23:50:24 +00:00
ankurneog	f50d359ce2	[ c10d ] modify API to get device string from device with torch.device (#146290 ) Modify the ```get_default_backend_for_device()``` API to extract the device string using ```torch.device()``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146290 Approved by: https://github.com/guangyey, https://github.com/H-Huang	2025-02-11 23:30:57 +00:00
Thomas Bohnstingl	3a29992ee6	[associative_scan] Lifted arguments (#140043 ) This PR implements lifted arguments for associative_scan Pull Request resolved: https://github.com/pytorch/pytorch/pull/140043 Approved by: https://github.com/ydwu4	2025-02-11 23:25:55 +00:00
Robert Hardwick	f59a56e56f	[ARM] Fix `test_float_to_int_conversion_nonfinite` (#145367 ) We have broken tests on Aarch64 which are not enabled upstream, this PR will fix and enable those tests. ``` AssertionError: Tensor-likes are not equal! Mismatched elements: 2 / 3 (66.7%) Greatest absolute difference: 1 at index (1,) Greatest relative difference: 1.0842021724855044e-19 at index (1,) To execute this test, run the following from the base repo dir: python test/test_tensor_creation_ops.py TestTensorCreationCPU.test_float_to_int_conversion_nonfinite_cpu_int64 This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145367 Approved by: https://github.com/malfet	2025-02-11 22:22:10 +00:00
wz337	a20055288f	[DTensor][Test] Create a simple unit test for tensordot (#146514 ) Fixes #ISSUE_NUMBER The dims and shape of the tensors are from a specific Shampoo use case. We want to create a unit test for it to make sure there are no regressions for this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146514 Approved by: https://github.com/tianyu-l, https://github.com/XilunWu	2025-02-11 21:57:56 +00:00
PyTorch MergeBot	443437648a	Revert "Introduce new template heuristic for triton autotune configs (#144985 )" This reverts commit 69301fb10eb3f7fd49af5c681a2e386af115baba. Reverted https://github.com/pytorch/pytorch/pull/144985 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I think it needs a small tweak to avoid breaking some internal code ([comment](https://github.com/pytorch/pytorch/pull/144985#issuecomment-2652021045))	2025-02-11 20:42:41 +00:00
Xu Han	b1ff90ae8a	remove Windows XPU build workaround. (#144644 ) From the RFC: https://github.com/pytorch/pytorch/issues/141946 Fixes https://github.com/pytorch/pytorch/issues/134989 After we land these fixing PRs: 1. https://github.com/pytorch/pytorch/pull/142245 2. https://github.com/pytorch/pytorch/pull/141943 We can remove the Windows XPU workaround. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144644 Approved by: https://github.com/EikanWang, https://github.com/chuanqi129, https://github.com/gujinghui, https://github.com/atalman	2025-02-11 20:39:51 +00:00
Zhengxu Chen	664550ecbf	[export] Serialize special values of float into strings for json. (#146490 ) Summary: Currently inf is serialized as Infinity in JSON which is not standard compliant. Instead we will tweak all special floating points into strings and handle them at json layer. Test Plan: see D69060784 CI Differential Revision: D69186425 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146490 Approved by: https://github.com/yiming0416	2025-02-11 20:01:27 +00:00
Shunting Zhang	110638f702	[inductor] skip _test_insignificant_strides on rocm (#146849 ) Check https://github.com/pytorch/pytorch/issues/146848 , the rocm kernel for _scaled_dot_product_attention does not match the meta kernel regarding output shape. cuda kernel is fine. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146849 Approved by: https://github.com/eellison, https://github.com/atalman, https://github.com/jansel ghstack dependencies: #145904	2025-02-11 19:55:43 +00:00
Ding, Yi	b18e3c01aa	[Inductor] Unifiy Low Precision FP Legalization for to_dtype_bitcast & constant (#144646 ) The upcast in `to_dtype_bitcast()` breaks following operations that only works with the target type (I uses `bitwise_and` in the updated UT). ![image](https://github.com/user-attachments/assets/77a6f3b6-b5e7-4ed8-ab65-09d76f077376) This PR fixes this problem. Let's check the CI results to make sure it doesn't bring accuracy problems. - Unified the type promotion of low-precision FP operations in the legalize func, grouping ops into sources (whose results may be promoted) and sinks (whose input may be cast back). (The term of _sink_ and _source_ are from [graph theory](https://en.wikipedia.org/wiki/Directed_graph#Indegree_and_outdegree).) ## Test ```bash pytest -vs test/inductor/test_torchinductor.py::CpuTests::test_float16_to_int16_cpu pytest -vs test/inductor/test_torchinductor.py::CpuTests::test_bfloat16_to_int16_cpu pytest -vs test/inductor/test_torchinductor.py::CpuTests::test_float32_to_int32_cpu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144646 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2025-02-11 19:45:04 +00:00
drisspg	af349047c3	[FlexAttention] Bug fix broken flag (#146872 ) # Summary I somehow broke this... I think claude was trippin Pull Request resolved: https://github.com/pytorch/pytorch/pull/146872 Approved by: https://github.com/BoyuanFeng	2025-02-11 19:42:37 +00:00
Tugsbayasgalan Manlaibaatar	ebd992724f	Implement serializable getattr support for tensor subclasses (#145772 ) builtins.getattr is not serializable, so we replace it with a custom op that has more refined schema. Differential Revision: [D68899421](https://our.internmc.facebook.com/intern/diff/D68899421) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145772 Approved by: https://github.com/bdhirsh	2025-02-11 19:05:14 +00:00
Andrey Talman	d5d3bdb55a	Fix var CUDA_PATH_V128 in cuda128.bat file (#146906 ) Followup after: https://github.com/pytorch/pytorch/pull/146653 This should fix upcoming CUDA 12.8 windows builds. Issue found during pytorch-canary Windows AMI test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146906 Approved by: https://github.com/malfet, https://github.com/tinglvv	2025-02-11 18:43:55 +00:00
Daniel Galvez	c7515da7b0	Implement cuda graphs implementation of torch.cond and torch.while_loop (#140979 ) This is a new PR for #130386 , which got stale and was closed. Since I force-pushed to that branch in order to rebase it on top of main, the PR can no longer be reopened, according to https://github.com/isaacs/github/issues/361 I fixed the possibly-not-warmed-up problem described here: https://github.com/pytorch/pytorch/pull/130386/files#r1690856534 Since starting this, torch.cond and torch.while_loop now apparently have support for backward passes. I will look into what it might take to support that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140979 Approved by: https://github.com/eqy, https://github.com/eellison	2025-02-11 18:16:15 +00:00
Nikita Shulga	e3839bd603	[BE] Strip `#pragma once` when embedding the headers (#146871 ) This eliminates compiler warning, for example when compiling Metal shader with embedded headers ``` with program_source:6:9: warning: #pragma once in main file [-Wpragma-once-outside-header] #pragma once ^ program_source:81:9: warning: #pragma once in main file [-Wpragma-once-outside-header] #pragma once ^ program_source:588:9: warning: #pragma once in main file [-Wpragma-once-outside-header] #pragma once ^ program_source:719:9: warning: #pragma once in main file [-Wpragma-once-outside-header] #pragma once ^ program_source:829:29: error: use of undeclared identifier 'r0_2' auto tmp8 = in_ptr2[r0_2 + 768*x0]; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146871 Approved by: https://github.com/dcci	2025-02-11 16:49:00 +00:00
Mikayla Gawarecki	861bf892fb	Set USE_CUFILE=1 by default and add pypi package to binary build matrix (#145748 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145748 Approved by: https://github.com/atalman	2025-02-11 15:49:01 +00:00
rzou	5235a18cd6	[SkipFiles] remove some more stuff from MOD_SKIPLIST (#146876 ) Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/146876 Approved by: https://github.com/anijain2305 ghstack dependencies: #146854	2025-02-11 15:00:56 +00:00
Zhou Fang	fc5913b6bf	[StaticRuntime] Fix a bug that memory planner ignores subblocks (#146728 ) (#146855 ) Summary: When Static Runtime graph node has sub-blocks, the memory planner does not consider sub-blocks' inputs as a node's input in memory planner. As the result, such nodes' inputs' lifetime is incorrect and corresponding tensor memory is released earlier than required and causes errors. Differential Revision: D69195886 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146855 Approved by: https://github.com/swolchok	2025-02-11 13:59:54 +00:00
cyy	15635b14ce	[4/N] Remove unnecessary once flag usage (#146783 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146783 Approved by: https://github.com/albanD	2025-02-11 13:55:06 +00:00
Jack Taylor	69301fb10e	Introduce new template heuristic for triton autotune configs (#144985 ) Initial PR to refactor bulkiness of mm_common to allow for better device-specific specialisation e.g. in https://github.com/pytorch/pytorch/pull/143286 we require large conditionalisation to get ROCm specific optimisations in. This PR introduces a new file `torch/_inductor/template_heuristics.py` which implements device specific subclasses for autotune configs: - CPUConfigHeuristic() - CUDAConfigHeuristic() - ROCmConfigHeuristic() - XPUConfigHeuristic() These subclasses are integrated as part of the `InductorChoices` class, which will be the interface for the kernel files to access the configs. The mm_common, mm_plus_mm and conv configurations are implemented in this class, in the future we plan to bring in flex attention configurations also so all of the tuning config logic for templated triton kernels are handled in this file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144985 Approved by: https://github.com/jansel	2025-02-11 10:48:09 +00:00
Yanbo Liang	229fb0bc83	[Dynamo][autograd.Function] Relax backward speculation strict mode: support .requires_grad (#146742 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146742 Approved by: https://github.com/zou3519 ghstack dependencies: #146571, #146741	2025-02-11 05:39:07 +00:00
Yanbo Liang	f2da810516	[Dynamo][autograd.Function] Relax backward speculation strict mode: support .data (#146741 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146741 Approved by: https://github.com/zou3519 ghstack dependencies: #146571	2025-02-11 05:39:07 +00:00
Yanbo Liang	29523aa113	[Dynamo][autograd.Function] Relax backward speculation strict mode a bit (#146571 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146571 Approved by: https://github.com/zou3519	2025-02-11 05:39:00 +00:00
rzou	a7fe384d0e	Remove torch._higher_order_ops from MOD_SKIPLIST (#146853 ) Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/146853 Approved by: https://github.com/williamwen42	2025-02-11 04:38:26 +00:00
Hyunho Yeo	001ebbf734	[MTIA] (4/n) Implement PyTorch APIs to query/reset device peak memory usage (#146751 ) Summary: Public summary (shared with Github): This diff updates the unit test for the PyTorch API "reset_peak_memory_stats". Test Plan: ``` buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- -r test_reset_peak_memory_stats ``` https://www.internalfb.com/intern/testinfra/testrun/9007199321947161 Reviewed By: yuhc Differential Revision: D68989900 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146751 Approved by: https://github.com/nautsimon	2025-02-11 03:51:48 +00:00
James Wu	23524699d5	Only call triton in worker process, kick off worker processes earlier, during inductor codegen (#146417 ) ### Big idea This PR extends https://github.com/pytorch/pytorch/pull/144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism. ### Implementation Overview In total, the diff does the following: - Converts TritonFuture to LambdaFuture, only calling triton.compile on worker processes - Now that triton.compile() isn't called on the main process, we call TritonBundler on all compiled kernels when we get them back from workers - Extend @eellison's future cache to a class, mostly as a refactor - Finally, call async_compile.triton ahead of time in Scheduler.codegen if workers are warmed up. This causes the subsequent async_compile.triton call that occurs after codegen to cache hit on cold start. In the diffs after this, I will add more to CompiledTritonKernels so that TritonBundler, on a warm start, automatically populates the in memory cache on warm start with the existing triton kernels, avoiding calling triton altogether on warm starts. Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen. Differential Revision: [D69123174](https://our.internmc.facebook.com/intern/diff/D69123174/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146417 Approved by: https://github.com/jansel	2025-02-11 03:46:16 +00:00
PyTorch MergeBot	fe94ece375	Revert "Exclude upsample_bilinear2d.vec from default core ATen decomposition table (#141791 )" This reverts commit 3d604b17d91b928c850ded83b2ec25ea066bb3f6. Reverted https://github.com/pytorch/pytorch/pull/141791 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/141791#issuecomment-2649717140))	2025-02-11 03:17:59 +00:00
Ke Wen	30cbf13544	[PGNCCL] Associate tensor allocation support with NCCL version (#146842 ) This is a forward fix to #146589. For NCCL version lower than 2.19, previous PR would see `RuntimeError: NCCL mem allocator is not supported in this NCCL version`. This PR gates the support by checking link-time NCCL version via `ncclGetVersion`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146842 Approved by: https://github.com/XilunWu, https://github.com/wconstab, https://github.com/fduwjj ghstack dependencies: #146589	2025-02-11 02:52:52 +00:00
rzou	1d81ecfc54	Rename PrimHOPBase to BaseHOP + minor changes (#146727 ) This PR: - renames PrimHOPBase to BaseHOP - changes the backward pass to always return a tuple (to match the forward pass). Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/146727 Approved by: https://github.com/ydwu4	2025-02-11 02:43:37 +00:00
rzou	275c034b16	[SkipFiles] remove some stuff from MOD_SKIPLIST (#146854 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146854 Approved by: https://github.com/yanboliang, https://github.com/anijain2305	2025-02-11 01:34:46 +00:00
zeshengzong	5205158c1b	Replace is_same with is_same_v for concise syntax (#145450 ) Replace `std::is_same<T, U>::value` with `std::is_same_v` for concise and consistent syntax with other code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145450 Approved by: https://github.com/Skylion007	2025-02-11 01:34:15 +00:00
PyTorch MergeBot	f38f1dcd82	Revert "move and fix logic to update unbacked bindings (#146115 )" This reverts commit 103c8b44bcb6fbf30b5411c5af19d312427525e7. Reverted https://github.com/pytorch/pytorch/pull/146115 on behalf of https://github.com/huydhn due to This change has been reverted internally D69129334 but the OSS revert failed https://github.com/pytorch/pytorch/pull/146437 ([comment](https://github.com/pytorch/pytorch/pull/146115#issuecomment-2649610877))	2025-02-11 01:26:36 +00:00
Yuanhao Ji	0c9fdd6cfb	[Docs] Fix description of `input` in `torch.addbmm()` (#146664 ) Fixes #146613 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146664 Approved by: https://github.com/mikaylagawarecki	2025-02-11 01:22:09 +00:00
PyTorch MergeBot	2fafcd37c3	Revert "cpp_wrapper: Precompile device-specific header files (#144002 )" This reverts commit de6efa1feb0e8c9073640a77afdec1a53a477aed. Reverted https://github.com/pytorch/pytorch/pull/144002 on behalf of https://github.com/huydhn due to Sorry for reverting your change but this breaks some inductor tests running internally ([comment](https://github.com/pytorch/pytorch/pull/144002#issuecomment-2649569562))	2025-02-11 00:42:22 +00:00
Isalia20	d763093b49	[MPS] fix lu factor for large tensors with bs>1 (#146753 ) Try this: ```python import torch batch_size = 2 A = torch.eye(256, device="mps")[None, :, :].expand(batch_size, -1, -1) + 0.1 * torch.randn((batch_size, 256, 256), device="mps") A_cpu = A.cpu() LU_cpu, pivots_cpu = torch.linalg.lu_factor(A_cpu) LU, pivots = torch.linalg.lu_factor(A) torch.testing.assert_close(LU.cpu(), LU_cpu) ``` You'll get huge difference in LU tensors <img width="706" alt="Screenshot 2025-02-08 at 12 14 39" src="https://github.com/user-attachments/assets/b45f2b3c-e0a5-49c8-aa07-42792150b781" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/146753 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-02-11 00:37:07 +00:00
Anant Gulati	937b41e3b5	Refactoring pipeline parallelism test cases to be device agnostic [1/n] (#146472 ) In this series of PR we intend to refactor pipeline parallelism test cases to enable to be completely device agnostic. These changes will include the following approaches to do the same : - Allowing for multiple device types using instantiate_device_type_test - Replacing calls to cuda stream with torch.get_device_module(device) wherever it applies This should result in improvement in usability for all devices For this PR we have shown support for the following devices: - CPU (wherever applicable) - CUDA - HPU - XPU To add other device new users can simply append their device to the device list Pull Request resolved: https://github.com/pytorch/pytorch/pull/146472 Approved by: https://github.com/H-Huang	2025-02-11 00:13:23 +00:00
amdfaa	b6273d7f4b	[ROCm] Update periodic.yml to use 2GPU runners (#146839 ) Temporary fix for rocm workflow. The 4-GPU runners are all taken offline due to (network timeout issue), and so we aren't able to run any periodic jobs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146839 Approved by: https://github.com/jeffdaily	2025-02-10 23:41:11 +00:00
CK Luk	aa1622c0b6	Support ignoring parameters in FSDP2 (#146631 ) Differential Revision: D69153051 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146631 Approved by: https://github.com/awgu	2025-02-10 23:20:28 +00:00
Jason Ansel	c2bf3be011	[inductor] Remove _get_grid_fn_str (#146800 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146800 Approved by: https://github.com/yanboliang	2025-02-10 23:14:30 +00:00
Henry Tsang	0d5fb0941f	[cutlass backend] check against arch >= 100 (#145812 ) Summary: Want to add a guard against silent fallback to SM90. GenerateSM100 was just added 3 days ago. https://github.com/NVIDIA/cutlass/blame/main/python/cutlass_library/generator.py#L8896 It should show up in CUTLASS 3.8 (not pinned yet). Test Plan: ci Differential Revision: D68748705 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145812 Approved by: https://github.com/chenyang78, https://github.com/ColinPeppler, https://github.com/Aidyn-A	2025-02-10 22:41:08 +00:00
Gabriel Ferns	bab35eb26a	fix intermediate debug information with cpp_wrapper (#145527 ) Summary: before fix, code like: ```cpp aoti_torch_print_tensor_handle(buf0, "after_launch - triton_poi_fused_randn_0 - buf0"); aoti_torch_print_tensor_handle(buf1, "after_launch - triton_poi_fused_randn_0 - buf1"); printf("[ after_launch - triton_poi_fused_randn_0 - 0: %ld ]", 0); printf(" "); printf("[ after_launch - triton_poi_fused_randn_0 - 1228800L: %ld ]", 1228800L); printf(" "); ``` was generated, which is a syntax error. Test Plan: New unit test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145527 Approved by: https://github.com/desertfire	2025-02-10 22:24:26 +00:00
Huy Do	681894546b	Fix bazel job after #144489 (#146840 ) This is currently failing in trunk with the following error https://github.com/pytorch/pytorch/actions/runs/13246034191/job/36972742610 ### Testing Bazel job passing https://github.com/pytorch/pytorch/actions/runs/13247495161/job/36977571965 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146840 Approved by: https://github.com/atalman	2025-02-10 22:17:36 +00:00
Daniel Vega-Myhre	652880e840	Fix logging and test files which misspell "precision" (#146113 ) Noticed this while working on something, decided to submit a quick fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146113 Approved by: https://github.com/drisspg	2025-02-10 21:54:16 +00:00
Nikhil Gupta	e65b89e4cd	[Feat]: Improve KleidiAI 4 bit kernel performance (#146476 ) Description: 1. New thread blocking accelerates GEMVs 2. We increase throughput of the lhs quant pack + matmul pipeline by decoupling two operations. 3. The new blocking strategy blocks ```out_feature``` to accelerate GEMVs Perf improvements: 12% speedup in LLM prefill phase and upto 16% speedup in autoregressive phase Perf Benchmarking : https://github.com/pytorch/pytorch/issues/143289#issuecomment-2545773370 Change-Id: Ie574ff8459fdb75701ae366158b4e118c70694e4 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146476 Approved by: https://github.com/malfet	2025-02-10 21:30:57 +00:00
Wouter Devriendt	4d626c261b	Fix workarea compute in lapackSyevd (#146456 ) work-query APIs return floating point values, that could loose precision when converted back to int. Solve this by using `nextafter` and `ceil` Add regression test Fixes #145801 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146456 Approved by: https://github.com/malfet	2025-02-10 21:29:48 +00:00
Yidi Wu	8f073065d5	[while_loop][inductor] support sym expression as cond_fn output (#146222 ) As titled. Previously, we only support tensor output of cond_fn, this PR changes to also allow a shape expr to be returned in cond_fn. aoti generated output code looks like: ``` V0203 11:28:05.750000 2611693 torch/_inductor/compile_fx.py:1091] [1/0] [__output_code] bool buf7_cond_result; .... (while_loop_cond_graph_0_arg2_1_handle); V0203 11:27:59.336000 2611693 torch/_inductor/compile_fx.py:1091] [1/0] [__output_code] buf7_cond_result = u0 + u1 < 10L; V0203 11:27:59.336000 2611693 torch/_inductor/compile_fx.py:1091] [1/0] [__output_code] if (!buf7_cond_result) break; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146222 Approved by: https://github.com/desertfire	2025-02-10 21:25:40 +00:00
Yidi Wu	97d4753bd3	[hop][inductor] don't promote arg type for cond and while_loop (#146660 ) Hop subgraph codegen assumes arguments's type are not promoted. Otherwise, we might generate wrong kernel. Differential Revision: [D69279031](https://our.internmc.facebook.com/intern/diff/D69279031) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146660 Approved by: https://github.com/zou3519, https://github.com/eellison	2025-02-10 21:24:52 +00:00
zeshengzong	da216baaa2	Optimize inductor `Self` typing (#146669 ) Replace method return type with `Self` typing Pull Request resolved: https://github.com/pytorch/pytorch/pull/146669 Approved by: https://github.com/jansel	2025-02-10 20:39:56 +00:00
angelayi	86b52f4209	Fix lint (#146846 ) [Fixes #ISSUE_NUMBER ](https://github.com/pytorch/pytorch/actions/runs/13248382636/job/36980294598) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146846 Approved by: https://github.com/huydhn, https://github.com/clee2000	2025-02-10 20:00:29 +00:00
Gregory Comer	3d604b17d9	Exclude upsample_bilinear2d.vec from default core ATen decomposition table (#141791 ) As upsample_bilinear2d.vec is a core ATen op, it should not be decomposed by default in the export path. Because the operator has CompositeImplicitAutograd dispatch, its decomposition is registered by default. This change adds an override list for CIA decompositions being registered in the default decomp table. In the long-term, we likely will want to exclude decompositions for all core-tagged CIA ops, but this will require all consumers to be ready to handle the remaining three ops: upsample_nearest2d.vec, avg_pool1d, and adaptive_avg_pool1d. Until they are ready, I believe an explicit override list is the safest option. Additionally, I've also removed the ExecuTorch XNNPACK delegate ConvertToUpsampleBilinear2d pass, as the pass breaks (and is not needed), given that the op is not decomposed. The purpose of this pass was originally to pattern match the decomposition and un-decomposite it, but this is no longer necessary. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141791 Approved by: https://github.com/tugsbayasgalan, https://github.com/digantdesai	2025-02-10 19:30:19 +00:00
Yifu Wang	97f6480cf5	Fix an issue where functional collectives don't force fx stride on inputs when compiled (#146467 ) Fixes https://github.com/pytorch/pytorch/issues/146416 Also added contiguity checks in the C++ functional collective ops to prevent striding issues introduced during compilation manifest as silent correctness issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146467 Approved by: https://github.com/Chillee, https://github.com/lw, https://github.com/shunting314	2025-02-10 19:15:49 +00:00
angelayi	3822a88d21	[symbolic shapes] Log symnode id (#146583 ) We want to log the symnode id which will help us with provenance tracking between expressions created. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146583 Approved by: https://github.com/bobrenjc93	2025-02-10 19:13:06 +00:00
Camyll Harajli	b45e6fa707	Cleanup VS 2019 refs in pytorch (#145863 ) Related to: https://github.com/pytorch/pytorch/issues/128835 Follow up on PR: https://github.com/pytorch/pytorch/pull/145319 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145863 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/huydhn, https://github.com/atalman	2025-02-10 19:05:35 +00:00
Zhengxu Chen	c02a1ecc1d	[export][ez] Allow math.trunc for serialization. (#146715 ) Summary: as title. Test Plan: CI Differential Revision: D69317084 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146715 Approved by: https://github.com/angelayi	2025-02-10 19:05:07 +00:00
angelayi	9b7d050600	Move capture_provenance to make_node_impl (#146625 ) Previously we were only logging `make_user_impl` implementations, which only gets triggered for operations done on python SymInts, not cpp SymInts. Instead `make_node_impl` will get triggered for both python and cpp SymInt operations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146625 Approved by: https://github.com/bobrenjc93	2025-02-10 19:00:51 +00:00
Zhengxu Chen	0486a996d2	[sigmoid] Implement a OSS only model runner. (#146440 ) Summary: Implement an oss version of modelrunner with clean dependencies. The new oss model runner only removes thrift and only use json header to load the model. Test Plan: Test will be added in the next diff separately. (D69060784) Differential Revision: D68846877 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146440 Approved by: https://github.com/SherlockNoMad	2025-02-10 18:54:05 +00:00
Ting Lu	519f547d05	windows Magma build for cu128 (#146653 ) https://github.com/pytorch/pytorch/issues/145570 removing `.ci/pytorch/windows/internal/cuda_install.bat` as it is a duplicate with` .github/scripts/windows/cuda_install.bat`. The later one is the one in use - https://github.com/pytorch/pytorch/pull/146653/files#diff-613791f266f2f7b81148ca8f447b0cd6c6544f824f5f46a78a2794006c78957bR8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146653 Approved by: https://github.com/atalman Co-authored-by: atalman <atalman@fb.com>	2025-02-10 18:34:59 +00:00
Henry Tsang	ad847da0cf	[cutlass backend] fix bug for accuminator dtype (#146356 ) Will add unit tests for accuracy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146356 Approved by: https://github.com/Chillee	2025-02-10 18:20:58 +00:00
Henry Tsang	ddcc97bb8c	Make sure cutlass kernel .cu file has configuration name and nvcc compile command (#146668 ) I think its good to have everything in the .cu file. Especially the nvcc compile command. Technically, the configuration name can be found in the template already. So let me know if you think its not needed. Differential Revision: D69281295 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146668 Approved by: https://github.com/chenyang78	2025-02-10 18:16:44 +00:00
Henry Tsang	6b3f51f870	use None to slice when list has one element only (#146638 ) When autotune_num_choices_displayed is None and the list of choices has length 1, slicing with `[:-1]` means getting all elements except the last one, which resulted in an empty list. Slicing with `[:None]` works. Differential Revision: D69265168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146638 Approved by: https://github.com/drisspg	2025-02-10 18:15:45 +00:00
Rachel Guo	374b762bbf	[ez][BE] get rid of the extra printf('\n') (#146726 ) Summary: as title Test Plan: ``` AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100a @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm_cuda ``` Differential Revision: D69328701 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146726 Approved by: https://github.com/ColinPeppler	2025-02-10 17:45:55 +00:00
blorange-amd	5fd15a04b7	[ROCm] Enable inductor-periodic testing for MI300 (#144594 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144594 Approved by: https://github.com/malfet, https://github.com/huydhn Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-02-10 17:42:09 +00:00
PyTorch MergeBot	b8261358ca	Revert "windows Magma build for cu128 (#146653 )" This reverts commit d0e70c4fd33d9accca2c66203c19372733a83ea1. Reverted https://github.com/pytorch/pytorch/pull/146653 on behalf of https://github.com/jeanschmidt due to Seems to have broken some windows tests, reverting to see if it gets green ([comment](https://github.com/pytorch/pytorch/pull/146653#issuecomment-2648769150))	2025-02-10 17:36:32 +00:00
Animesh Jain	cbbb11d967	[dynamo][user-defined] Unify standard and non-standard __new__ codebase (#146737 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146737 Approved by: https://github.com/jansel ghstack dependencies: #146677	2025-02-10 17:31:13 +00:00
Animesh Jain	ee8a06f1f6	[dynamo][user-defined] User class.__new__ instead of special casing (#146677 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146677 Approved by: https://github.com/jansel	2025-02-10 17:31:13 +00:00
Benjamin Glass	de6efa1feb	cpp_wrapper: Precompile device-specific header files (#144002 ) This saves us about a second per compilation, which is _massive_ for the OpInfo tests. Total OpInfo test runtime is down about 2x from this change alone. Differential Revision: [D69185685](https://our.internmc.facebook.com/intern/diff/D69185685) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144002 Approved by: https://github.com/desertfire	2025-02-10 17:13:09 +00:00
soulitzer	3cadce7af2	[NJT] Fix inference mode for composite implicit ops without nested-specific kernel (#146633 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146633 Approved by: https://github.com/jbschlosser	2025-02-10 16:59:48 +00:00
Davide Italiano	dfe3b64282	[mps] Implement eager support for spherical_bessel_j0 (#146818 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146818 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-02-10 16:58:05 +00:00
Hyunho Yeo	5f621c5879	[MTIA] (3/n) Implement PyTorch APIs to query/reset device peak memory usage (#146710 ) Summary: Public summary (shared with Github): This diff implements a C++-Python binding to enable `reset_peak_memory_stats`. Test Plan: The test is implemented in the following diff. Reviewed By: yuhc Differential Revision: D68988673 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146710 Approved by: https://github.com/nautsimon	2025-02-10 16:57:09 +00:00
Brian Hirsh	68c9e22ef7	FSDP: avoid resetting version counter of all_gather_output in inference_mode (#146709 ) Summary: FSDP needs to hide VC bumps on its allgather buffer, but it does not need to do this is the allgather buffer was generated under inference mode. more details here: https://www.internalfb.com/diff/D69115649?dst_version_fbid=1316814572779281&transaction_fbid=849120230625711 Test Plan: CI Differential Revision: D69311496 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146709 Approved by: https://github.com/awgu	2025-02-10 16:56:40 +00:00
PyTorch MergeBot	6aa924af68	Revert "[ONNX] Create deprecation warning on dynamo_export (#146425 )" This reverts commit 41e6d189a39a40b237ab9b9ab195cec1194b331b. Reverted https://github.com/pytorch/pytorch/pull/146425 on behalf of https://github.com/atalman due to Broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/146425#issuecomment-2648472579))	2025-02-10 15:54:34 +00:00
PyTorch MergeBot	1557b7bf9a	Revert "[ONNX] Adjust and add deprecation messages (#146639 )" This reverts commit 63c2909ae3e293dee96bca5af88bc51d8ca0ce10. Reverted https://github.com/pytorch/pytorch/pull/146639 on behalf of https://github.com/atalman due to Sorry Need to revert https://github.com/pytorch/pytorch/pull/146425 ([comment](https://github.com/pytorch/pytorch/pull/146639#issuecomment-2648465047))	2025-02-10 15:51:52 +00:00
eellison	a36c22f2ed	futher scheduler changes for invoke_quant: prologue low prec, (slightly) more aggressive fusion (#145104 ) Respect invoke_quant low precision options, also, be more aggressive in attepmting fusion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145104 Approved by: https://github.com/shunting314, https://github.com/jansel ghstack dependencies: #139102	2025-02-10 15:50:19 +00:00
Guilherme Leobas	899066eedf	Fix round(...) with constants (#146495 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146495 Approved by: https://github.com/anijain2305	2025-02-10 15:08:09 +00:00
Nikita Shulga	611ca163fd	[MPS] Add bilineard2d_aa implementation (#145526 ) Interesting quirk of the algorithm, that is not very well documented, is that value of align_corners is ignored in antialias mode, see arguments of `e8304f08fe/aten/src/ATen/native/cpu/UpSampleKernel.cpp (L747-L751)` Error out on uint8 implementation(as it relies on a very fragile integer integer arithmetic), as it's not implemented on any other Accelerator devices at the moment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145526 Approved by: https://github.com/dcci	2025-02-10 15:03:14 +00:00
Ting Lu	d0e70c4fd3	windows Magma build for cu128 (#146653 ) https://github.com/pytorch/pytorch/issues/145570 removing `.ci/pytorch/windows/internal/cuda_install.bat` as it is a duplicate with` .github/scripts/windows/cuda_install.bat`. The later one is the one in use - https://github.com/pytorch/pytorch/pull/146653/files#diff-613791f266f2f7b81148ca8f447b0cd6c6544f824f5f46a78a2794006c78957bR8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146653 Approved by: https://github.com/atalman Co-authored-by: atalman <atalman@fb.com>	2025-02-10 13:48:55 +00:00
Tom Ritchford	6f15a609d3	Test typing of arithmetic operators on Tensor (see #145838 ) (#146426 ) See #145838 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146426 Approved by: https://github.com/Skylion007	2025-02-10 12:19:56 +00:00
Jack Taylor	c24038025d	[ROCm] Unskip std:bad_alloc failures (#146407 ) Flakey MI300 issue related to memory usage should now be resolved after https://github.com/pytorch/pytorch/actions/runs/13007160888?pr=145829. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146407 Approved by: https://github.com/jeffdaily	2025-02-10 11:01:56 +00:00
yousoumar	c88ae00692	fix: replace stderr with stdout for download messages in hub.py (#146475 ) This PR addresses an issue where download logs in `hub.py` are sent to `stderr` instead of `stdout`. Hence, when running models with workers, these messages are incorrectly categorized as errors, leading to confusion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146475 Approved by: https://github.com/mikaylagawarecki	2025-02-10 10:46:10 +00:00
gasoonjia	6667e5d786	[dim order] solve broken doc (#146641 ) Differential Revision: [D69265340](https://our.internmc.facebook.com/intern/diff/D69265340/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146641 Approved by: https://github.com/svekars, https://github.com/Jack-Khuu	2025-02-10 07:51:26 +00:00
Xilun Wu	c4d835fbab	[DTensor][conv] add DTensor convolution_backward op support for case where the input Tensor has requires_grad=False (#142278 ) Fixes #142058 ## Summary DTensor `convolution_backward` op throws exception when the input Tensor has `requires_grad=False` which happens if the conv layer is the first layer in the model. ATEN convolution_backward op Usually returns 3 Tensors (grad_input, grad_weight, grad_bias) and the `grad_input` is actually an Optional[Tensor] which can be `None` in the case mentioned above. However, the DTensor sharding propagation rule and corresponding TP conv backward implementation both assume that the `grad_input` would be existent. ## Fix allow the `grad_input` to be `None` for `convolution_backward` op. ## Test `pytest test/distributed/tensor/test_convolution_ops.py` ## Follow-up The current implementation of DTensor conv op also ignores `output_mask` and this may need further care. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142278 Approved by: https://github.com/bdhirsh	2025-02-10 07:06:40 +00:00
Ke Wen	effc545274	[DDP] Use NCCL allocated memory for gradient bucket (#146589 ) So that NVLink SHARP comes with zero-copy on H100+ platforms, for DDP applications. Less SM usage, less memory contention between NCCL kernel and compute kernels. Added env `DDP_DISABLE_COMM_MEM` as a back-out option: ``` An environment variable to disable comm-optimized memory pool. Default is 0, which means comm-optimized memory pool is enabled. Users can set it to 1 in case of seeing regression or OOM (because this comm MemPool may not share space with regular compute MemPool). ``` Differential Revision: [D69297766](https://our.internmc.facebook.com/intern/diff/D69297766) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146589 Approved by: https://github.com/syed-ahmed, https://github.com/c-p-i-o, https://github.com/fduwjj	2025-02-10 05:23:11 +00:00
Simon Fan	387c993c3b	[ca] remove private API: _compiled_autograd_should_lift (#146720 ) Since the functional autograd + compiled autograd migration, we don't trace into nodes anymore, and everything is lifted. We can't support this flag which tries to inline make_fx style in CA initial pass. There's no more usage internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146720 Approved by: https://github.com/zou3519	2025-02-10 04:29:57 +00:00
zeshengzong	e8304f08fe	Fix torch.take_along_dim param type and default description (#146474 ) ## Changes - Change type description to `LongTensor`, consistent with [`torch.take`](https://pytorch.org/docs/stable/generated/torch.take.html) - Add `dim` param default value description ## Test Result Before ![image](https://github.com/user-attachments/assets/720ce158-2bc1-48b5-a188-56fcc7188d96) After ![image](https://github.com/user-attachments/assets/05fe20bd-9476-4b97-ac2b-9b161d6532a1) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146474 Approved by: https://github.com/mikaylagawarecki	2025-02-10 01:19:30 +00:00
Simon Fan	298226f358	[dynamo] check for incompatible configs (#146513 ) internal: https://fb.workplace.com/groups/1075192433118967/permalink/1599802033991335/ Assuming flags don't change during compilation, we shouldn't allow incompatible configs to be set at torch.compile wrap time. Not in this PR: For flags that need to change during compilation, we'd have to be strict about where they can be used in the compile lifecycle Pull Request resolved: https://github.com/pytorch/pytorch/pull/146513 Approved by: https://github.com/williamwen42 Co-authored-by: Gabriel Ferns <gabeferns@meta.com>	2025-02-10 00:44:23 +00:00
Davide Italiano	2a55311773	[cuda] Simplify the sinc function a bit. (#146774 ) `else` after `return` can be removed & the indentation can be reduced, for readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146774 Approved by: https://github.com/malfet	2025-02-09 20:09:34 +00:00
drisspg	b133907d0a	Update strided test to float32 (#146748 ) Fixes #146377 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146748 Approved by: https://github.com/BoyuanFeng, https://github.com/leijurv	2025-02-09 17:41:35 +00:00
Davide Italiano	91c4bf39d3	[mps] Add a shader for spherical_bessel_j0. (#146771 ) In preparation for adding the operation to inductor/eager. Adapted from the CUDA version of the shader. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146771 Approved by: https://github.com/malfet	2025-02-09 05:11:17 +00:00
Nikita Shulga	0e83e7d56e	[EZ] Add logic to build Metal shader with debug info (#146768 ) By appending `-frecord-sources -gline-tables-only` to the compilation command Helpful when debugging shaders compiled into libtorch Test plan: Run `python ../tools/build_with_debinfo.py ../aten/src/ATen/native/mps/kernels/UpSample.metal ../aten/src/ATen/native/mps/operations/UpSample.mm` And then run following to capture shader and check that it contains debug info ```python import torch import os os.environ["MTL_CAPTURE_ENABLED"]="1" inp = torch.rand(size=(6, 3, 10, 20), device="mps", dtype=torch.float32) with torch.mps.profiler.metal_capture("bilinear2d"): out = torch.nn.functional.interpolate(x, scale_factor=(1.7,0.9), mode="bilinear") ``` <img width="769" alt="image" src="https://github.com/user-attachments/assets/e0316c1c-07a4-4da5-97b9-886c56857c1d" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/146768 Approved by: https://github.com/dcci	2025-02-08 23:40:23 +00:00
Guilherme Leobas	6a9a02acbe	Set `enable_faithful_generator_behavior` flag to True (#142513 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142513 Approved by: https://github.com/zou3519 ghstack dependencies: #141055, #144421, #144422, #144423, #144424, #144420, #145223	2025-02-08 22:42:12 +00:00
Guilherme Leobas	580a305681	Raise MutationError if there are side effects when returning generator (#145223 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145223 Approved by: https://github.com/zou3519 ghstack dependencies: #141055, #144421, #144422, #144423, #144424, #144420	2025-02-08 22:42:12 +00:00
Guilherme Leobas	68cfd36c11	Add `CLEANUP_THROW` bytecode (#144420 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144420 Approved by: https://github.com/zou3519 ghstack dependencies: #141055, #144421, #144422, #144423, #144424	2025-02-08 22:42:12 +00:00
Guilherme Leobas	53ab82d8f5	Implement `generator.throw(exception)` (#144424 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144424 Approved by: https://github.com/zou3519 ghstack dependencies: #141055, #144421, #144422, #144423	2025-02-08 22:42:12 +00:00
Guilherme Leobas	8ee095f7c1	Implement `generator.close()` (#144423 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144423 Approved by: https://github.com/zou3519 ghstack dependencies: #141055, #144421, #144422	2025-02-08 22:42:12 +00:00
Guilherme Leobas	ca9b16e070	Implement `generator.send(..)` (#144422 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144422 Approved by: https://github.com/zou3519 ghstack dependencies: #141055, #144421	2025-02-08 22:42:12 +00:00
Guilherme Leobas	d798831167	Implement `generator.__iter__()` (#144421 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144421 Approved by: https://github.com/zou3519 ghstack dependencies: #141055	2025-02-08 22:42:12 +00:00
Guilherme Leobas	8603a1c870	Suport generators (#141055 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141055 Approved by: https://github.com/zou3519	2025-02-08 22:42:12 +00:00
Scott Wolchok	ade8fee512	Use c10 version of half/bfloat16 in executorch (#144111 ) Summary: X-link: https://github.com/pytorch/executorch/pull/7040 Accomplished by importing relevant files from c10 into executorch/runtime/core/portable_type/c10, and then using `using` in the top-level ExecuTorch headers. This approach should keep the ExecuTorch build hermetic for embedded use cases. In the future, we should add a CI job to ensure the c10 files stay identical to the PyTorch ones. ghstack-source-id: 260047850 exported-using-ghexport Test Plan: builds Differential Revision: D66106969 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144111 Approved by: https://github.com/malfet	2025-02-08 22:40:14 +00:00
eellison	92b7e610ab	[Inductor changes] Invoke Quant (#139102 ) Adds a `invoke_quant` higher order operator as proposed [here](https://docs.google.com/document/d/1s2PfJlq6Q1F8l11CkTIC69BW1rEnGEgs6YmBC7hu8rA/edit?tab=t.0). The primary motivations are - Unifying scattered reasoning for quant operators throughout the code base - Easy of pattern matching - see this very large pattern match expression [here](`949fdd2997/torch/_inductor/fx_passes/post_grad.py (L390-L426)`. Compared to the pattern I have in the tests: ``` @register_graph_pattern( CallFunction( torch.ops.aten.mm, CallFunction( torch.ops.higher_order.invoke_quant, Ignored(), Ignored(), Ignored(), scheme="nf4", ), Arg(), ), pass_dict=test_pass, ) ``` - Ability to specify inductor specific logic, like codegen'ing the operators in lower precision, or forcing fusion to a matmul. Example graph: ``` Python ===== AFTER POST GRAD ===== /data/users/eellison/pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module): def forward(self, arg0_1: "f32[8][1]cpu", arg1_1: "f32[8][1]cpu"): # File: /data/users/eellison/pytorch/torch/_higher_order_ops/invoke_quant.py:87 in __call__, code: return invoke_quant_tracer(args, kwargs, quant_options=self) # type: ignore[call-arg] repeated_subgraph0 = self.repeated_subgraph0 invoke_quant: "f32[8][1]cpu" = torch.ops.higher_order.invoke_quant(repeated_subgraph0, arg0_1, arg1_1, scheme = 'nf4'); repeated_subgraph0 = arg0_1 = arg1_1 = None return (invoke_quant,) class repeated_subgraph0(torch.nn.Module): def forward(self, arg0_1: "f32[8][1]cpu", arg1_1: "f32[8][1]cpu"): # File: /data/users/eellison/pytorch/torch/_higher_order_ops/invoke_quant.py:87 in __call__, code: return invoke_quant_tracer(args, *kwargs, quant_options=self) # type: ignore[call-arg] mul: "f32[8][1]cpu" = torch.ops.aten.mul.Tensor(arg0_1, arg1_1); arg0_1 = None add: "f32[8][1]cpu" = torch.ops.aten.add.Tensor(mul, arg1_1); mul = arg1_1 = None return add ``` The schema for `invoke_quant` is `torch.ops.higher_order.invoke_quant(subgraph, args, scheme=None)` where the scheme will not always be present. I wasn't sure exactly how the inductor specific configurations like `codgen_in_low_precision` should be passed through. I didnt want to stuff them all in as kwargs, and I didn't want to have them affect pattern matching. So they will be stored as meta of the node itself. And, following that, I wanted the invocation of the hop to match how it will show up in the graph. So I decided to have it be an object that is then invoked for the tracing. ``` invoke_quant = InvokeQuant(codegen_low_precision=True) invoke_quant(gn, (x, y), scheme="nf4") ``` Todo - not require the packing of args in a tuple, will do following https://github.com/pytorch/pytorch/pull/139162. Feedback welcome. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139102 Approved by: https://github.com/Chillee	2025-02-08 19:30:19 +00:00
Blaine Burton Rister	a1bfb39a31	[Inductor] Expand Identity ops prior to block pattern matching (#146000 ) # Feature Inductor sometimes uses `Identity` functions to group various terms of an expression. While this is convenient in some scenarios, it can frustrate pattern matching. For example, when we're matching an indexing expression to tell if it can be represented as a block pointer, that analysis should be invariant to `Identity`'s. This PR adds a few features to achieve this invariance. - Create a new expansion mode `expr.expand(identity=True)`, which removes all `Identity` functions from the expression. - Preprocess the expression with this expansion prior to pattern matching. - Bonus: create a new test utility function called `dummy_graph()`, which creates a simple `GraphLowering`. This is useful for testing the pattern matcher, as we need to initialize `V.graph` before we can access `V.graph.sizevars`. # Test plan This PR adds a few new unit tests: - Added a unit test specifically for `expr.expand(identity=True)`. - Added a new unit test module for the block pattern matcher. Tested that we can correctly match some example patterns containing Identity ops. I originally intended to add an end to end test compiling pointwise cat, and mapping the corresponding memory accesses to block pointers. However, it looks like that will take more work, since the [relevant code path](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codegen/triton.py#L1306) disables block pointer analysis. It might be better to defer that to a future PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146000 Approved by: https://github.com/eellison, https://github.com/jansel	2025-02-08 18:11:53 +00:00
Jason Ansel	eee5622b98	[inductor] Pre-populate cache for simplify_with_ranges return value (#146373 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146373 Approved by: https://github.com/yanboliang, https://github.com/shunting314 ghstack dependencies: #146252, #146254, #146255, #146257, #146282, #146297	2025-02-08 18:00:49 +00:00
Jason Ansel	c098385cb3	[inductor] Refactor CaptureIndexing into global scope (#146297 ) And inline SimplifyIndexing into it CaptureIndexing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146297 Approved by: https://github.com/shunting314 ghstack dependencies: #146252, #146254, #146255, #146257, #146282	2025-02-08 18:00:49 +00:00
Jason Ansel	d35f6b2339	[inductor] Minor compile time optimizations in DefaultHandler (#146282 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146282 Approved by: https://github.com/shunting314 ghstack dependencies: #146252, #146254, #146255, #146257	2025-02-08 18:00:40 +00:00
Jason Ansel	06604c4ec1	[inductor] Refactor op handlers part 5 (#146257 ) This makes OpHandler just a normal class using inheritance, and removes typing workarounds needed because it wasn't Pull Request resolved: https://github.com/pytorch/pytorch/pull/146257 Approved by: https://github.com/shunting314 ghstack dependencies: #146252, #146254, #146255	2025-02-08 18:00:30 +00:00
Jason Ansel	403db2faee	[inductor] Refactor op handlers part 4 (#146255 ) This replaces the `__getattr__()` pattern used in remaining OpHandlers with a `DefaultHandler` class defined in part 2. Some compile time wins from this as well: ``` 2025-02-02T19:46:32.2033010Z 2025-02-02T19:46:32.2036607Z WIN: benchmark ('add_loop_inductor', 'compile_time_instruction_count') failed, actual result 29633182927 is -1.71% lower than expected 30150000000 ±1.50% please update the expected results. 2025-02-02T19:46:32.2037575Z 2025-02-02T19:46:32.2037907Z please update all results that changed significantly, and not only the failed ones 2025-02-02T19:46:32.2039291Z PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 43986879172 -1.02% is within expected 44440000000 ±2.50% 2025-02-02T19:46:32.2040131Z 2025-02-02T19:46:32.2041180Z WIN: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') failed, actual result 26246225695 is -1.85% lower than expected 26740000000 ±1.50% please update the expected results. 2025-02-02T19:46:32.2042188Z ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146255 Approved by: https://github.com/shunting314 ghstack dependencies: #146252, #146254	2025-02-08 18:00:17 +00:00
Jason Ansel	0e31e5932b	[inductor] Refactor op handlers part 3 (#146254 ) Fixes type errors that arise from typing `V.ops`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146254 Approved by: https://github.com/shunting314 ghstack dependencies: #146252	2025-02-08 18:00:08 +00:00
Jason Ansel	71498aeae3	[inductor] Refactor op handlers part 2 (#146252 ) This replaces the `__getattr__()` pattern used in (some) OpHandlers with a `DefaultHandler` class that has an implementation of every op that calls `self._default()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146252 Approved by: https://github.com/yanboliang	2025-02-08 18:00:00 +00:00
cyyever	46e83bb637	Fix linter F821 error (#146665 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146665 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-02-08 07:19:37 +00:00
Natalia Gimelshein	a3ca5c7f4e	remove incorrect warnings from min/max documentation (#146725 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146725 Approved by: https://github.com/wdvr, https://github.com/malfet	2025-02-08 05:10:08 +00:00
Justin Chu	63c2909ae3	[ONNX] Adjust and add deprecation messages (#146639 ) Adjust and add deprecation messages to torch.onnx utilities and verification methods because they are only related to torch script and are obsolete. Removed unused `_exporter_states.py` and removed the internal deprecation module in favor of the typing_extensions deprecated decorator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146639 Approved by: https://github.com/titaiwangms	2025-02-08 05:09:16 +00:00
Nikita Shulga	2328dcccb9	[MPSInductor] Implement Welford reduction (#146703 ) Still work in progress, though fallback works as expected, but custom shader is not Pull Request resolved: https://github.com/pytorch/pytorch/pull/146703 Approved by: https://github.com/jansel, https://github.com/dcci	2025-02-08 05:00:00 +00:00
drisspg	69feef5a94	Fix broken meta function for flex-attention backwards (#146563 ) # Summary Fixes https://github.com/pytorch/pytorch/issues/146377 So what was the original problem: we were codegening a really weird epilogue: ```Python # first compute broadcasted dk of shape [Bq, Hkv, KV_LEN, V_HEAD_DIM] # then reduce to dk of shape [Bkv, Hkv, KV_LEN, V_HEAD_DIM] xindex = index_k + 64index_n + 64off_hkvks2 + 128off_zqks2 tl.store(out_ptr0 + (tl.broadcast_to(index_k + 64index_n + off_hkvks1, dk.shape)), dk, mask) x5 = (xindex % ks3) tmp2 = tl.load(out_ptr0 + (x5 + ks1off_hkv), mask, eviction_policy='evict_last') tl.store(out_ptr1 + (tl.broadcast_to(xindex, dk.shape)), tmp2, mask) ``` This epilogue was writing and then reading from overlapping regions of memory causing a race condition. ### Why were we generating this epilgoue During the lowering we created a buffer w/ a different size/stride from the expected return strides. I :think this added an implicit node (for doing the permutation of this wrongly strided output to the the expected one from the meta func. The scheduler for some reason thought it was okay to fuse this into the epilogue, tbh I dont know why. This fixes the broken meta func and the original repro. I will add a test but it is hard to pop, better than nothing Pull Request resolved: https://github.com/pytorch/pytorch/pull/146563 Approved by: https://github.com/Chillee	2025-02-08 04:13:52 +00:00
David Peixotto	9c78fb920d	Fix assertion failure in gemm template lowering (#146353 ) Summary: This commit fixes a crash in the gemm template lowering caused by hitting an [assert](`fd515e4f59/torch/_inductor/codegen/common.py (L1181)`) that a buffer was previously removed. The assert triggers because in the first gemm lowering we use a local accumulation buffer, which causes the original buffer name to be added to the `removed_buffers` set. Then in the next gemm lowering we use the global buffer for accumulation, but that buffer name is already in the `removed_buffers` set. The fix is to add a unique suffix to the buffer name to avoid triggering the assert from different gemm lowerings. Differential Revision: D68814625 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146353 Approved by: https://github.com/leslie-fang-intel, https://github.com/frost-intel, https://github.com/hl475	2025-02-08 01:52:20 +00:00
cyy	6cb2f737ee	Enable Windows tests (#146666 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146666 Approved by: https://github.com/albanD	2025-02-08 00:55:20 +00:00
Isalia20	0ab67299c3	[MPS] lu unpack (#146681 ) Implements lu unpack function on MPS. Haven't added new tests because they are covered by removing the lu_unpack from UNIMPLEMENTED_XFAILLIST in test_mps with `test_output_match` function Pull Request resolved: https://github.com/pytorch/pytorch/pull/146681 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-02-08 00:16:17 +00:00
Gregory Comer	803661526e	Update ET pin to 41e7ffa (#145831 ) ExecuTorch pin is failing to update due to a change in the executorch install scripts. The previous install_requirements.sh now only installs dependencies and does not build ET. There is a new script - install_executorch.sh, which both installs dependencies and builds the framework. This PR updates the relevant CI logic to use install_executorch.sh and bumps the pin forward. This should fix the stuck ET pin. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145831 Approved by: https://github.com/metascroy	2025-02-07 23:52:20 +00:00
Hyunho Yeo	dcac3c3e06	[MTIA] (2/n) Implement PyTorch APIs to query/reset device peak memory usage (#146659 ) Summary: Public summary (shared with Github): This diff implements the correct version of the PyTorch API "max_memory_allocated". Nit: The file previously contained two unit tests with the same name (due to wrong revert); I deleted a deprecated one to revamp the correct version. Test Plan: ``` buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- -r test_max_memory_allocated ``` https://www.internalfb.com/intern/testinfra/testrun/12103424065182810 Reviewed By: yuhc Differential Revision: D68988435 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146659 Approved by: https://github.com/nautsimon	2025-02-07 23:06:35 +00:00
Dingming Wu	fa34128435	revert PTD's change that leads to signature mismatch of printNcclCommProxyTrace (#146453 ) Summary: D68801098 introduced this function signature mismatch issue for printNcclCommProxyTrace. Revert it so that trunk build can pass. Test Plan: With the change, build of APS model using rcclexp can now pass: `sh scripts/ltian/run_jobs/fb_fm_v2/run_fb_fm_v2_job.sh -h T20_GTT_MI300X -n 16 -b 1024 -t [2024-12-06] -d ai_infra_ngs -e ai_infra_training_rnd_tc -x 0` Reviewed By: c-p-i-o Differential Revision: D69149588 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146453 Approved by: https://github.com/c-p-i-o	2025-02-07 22:43:52 +00:00
Avik Chaudhuri	103c8b44bc	move and fix logic to update unbacked bindings (#146115 ) Summary: Previously we were touching up unbacked bindings between Dynamo and AOTAutograd in strict export, but the logic had a bug: if an unbacked symint gets substituted by a backed symint, we would put the backed symint in the unbacked bindings (the check `is_symbol` was not enough here). This PR fixes this logic, and moreover, moves it into the serializer instead, because we don't need this adjustment outside serde. Test Plan: added test D68880766 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146115 Approved by: https://github.com/pianpwk	2025-02-07 22:41:19 +00:00
Lu Fang	45d35f5f5a	Clean up op BC check list (#146577 ) Summary: Remove the expired ones Test Plan: ci Differential Revision: D69226556 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146577 Approved by: https://github.com/hl475	2025-02-07 22:40:49 +00:00
Henry Hu	908133f682	[TreeSpec] Add custom comparision function (#146442 ) Summary: https://github.com/pytorch/pytorch/pull/145815 used caching to for treespec_loads calculation to speed up AOTI module call. However, this made tests flaky due when comparing TreeSpec for objects in local scope. ie. 'test_export.TestExport.test_pytree_register_nested_data_class.<locals>.Inner' Type comparison will yield False when local scopes are different due to lru_cache. Since this comparison is only used for testing purpose, we will only test if str(type) are equal. Test Plan: ``` PYTORCH_TEST_WITH_ROCM=1 python test/export/test_retraceability.py ``` Differential Revision: D69137706 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146442 Approved by: https://github.com/angelayi	2025-02-07 22:39:21 +00:00
drisspg	91dfa82981	[FlexAttention] Fix dynamic shapes in max-autotune (#146657 ) # Fixes https://github.com/pytorch/pytorch/issues/146624 ### Updated From offline discussion going w/ sizehint However this does incur guards. I couldn't really think of a fancy way to do this. I was going to do `V.graph.sizevars.size_hint` w/ some default for num blocks, but we ultimately need some information about the input. I am also not sure if size_hint is ALWAYS guaranteed to return the runtime value. I think it would be okay to not supported unbacked symints (maybe). For instance, in the repro, we quickly hit the recompile limit. ```Shell torch._dynamo hit config.recompile_limit (8) function: 'flex_attention' (/home/drisspg/meta/pytorch/torch/nn/attention/flex_attention.py:1161) last reason: 0/0: tensor 'L['key']' size mismatch at index 2. expected 1, actual 546 To log all recompilation reasons, use TORCH_LOGS="recompiles". To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146657 Approved by: https://github.com/Chillee, https://github.com/yanboliang	2025-02-07 22:34:28 +00:00
Jason Ansel	579b9f2ed9	[inductor] Better exception error messages for cache_on_self (#146652 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146652 Approved by: https://github.com/yanboliang	2025-02-07 21:22:21 +00:00
Jason Ansel	04ce02182b	[inductor] Use index_dtype (int32/int64 depending on size) for argmax accumulators (#146651 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146651 Approved by: https://github.com/shunting314, https://github.com/eellison	2025-02-07 21:21:21 +00:00
PyTorch MergeBot	80a1696679	Revert "[cuBLAS][cuBLASLt] Unify `cuBLASLt` workspaces with `cuBLAS` workspaces (#145130 )" This reverts commit 5f0901e57341eb9865102c1caa3d986a0c4ae3bd. Reverted https://github.com/pytorch/pytorch/pull/145130 on behalf of https://github.com/atalman due to Reverted internally ([comment](https://github.com/pytorch/pytorch/pull/145130#issuecomment-2644122846))	2025-02-07 21:04:23 +00:00
Henry Tsang	206ad9f4ad	[cutlass backend] Set no fallback to aten, disabled a few broken tests, default to test on H100 (#146554 ) This PR does a few things: * set fall back to aten to False for most tests. Without this, a lot of tests would fail silently since they just use aten * Disable two subprocess related broken tests. They would crash in subprocess. More investigation needed. * remove/disable the tests on A100. Let me elaborate a bit more. There are two types of A100 tests. * normal tests that also test A100. e.g., mm, addmm, bmm. However, since the shift to cutlass 3x, they don't work anymore. GenerateSM80 would generate ops that use cutlass 2x, but they get filtered out since they are of GemmKind.Universal but only GemmKind.Universal3x are supported in the 3x template. * tests for A100 only. The mixed mm and sparse semi structure tests are failing due to "TypeError: can't multiply sequence by non-int of type 'str'" for a while. Disabled them for now. Do let us know if you are about them @alexsamardzic Differential Revision: D69209929 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146554 Approved by: https://github.com/chenyang78	2025-02-07 19:59:28 +00:00
PyTorch MergeBot	f17109bd96	Revert "windows Magma build for cu128 (#146653 )" This reverts commit 9e27d36e2b2a4f037a7e448c2f87a9ebb0d6e628. Reverted https://github.com/pytorch/pytorch/pull/146653 on behalf of https://github.com/atalman due to Broke nightly builds ([comment](https://github.com/pytorch/pytorch/pull/146653#issuecomment-2643882976))	2025-02-07 19:37:16 +00:00
Shunting Zhang	bc0191802f	[inductor] add size-asserts for fallback ops (#145904 ) Fix https://github.com/pytorch/pytorch/issues/144717 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145904 Approved by: https://github.com/jansel	2025-02-07 18:44:32 +00:00
Gabriel Ferns	b60f630de8	fuzzer: disable "fail_on_recompile_limit_hit" and "suppress_errors" (#146650 ) Summary: needed for https://github.com/pytorch/pytorch/pull/146513 Test Plan: the existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/146650 Approved by: https://github.com/xmfan	2025-02-07 18:25:00 +00:00
Ting Lu	9e27d36e2b	windows Magma build for cu128 (#146653 ) https://github.com/pytorch/pytorch/issues/145570 removing `.ci/pytorch/windows/internal/cuda_install.bat` as it is a duplicate with` .github/scripts/windows/cuda_install.bat`. The later one is the one in use - https://github.com/pytorch/pytorch/pull/146653/files#diff-613791f266f2f7b81148ca8f447b0cd6c6544f824f5f46a78a2794006c78957bR8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146653 Approved by: https://github.com/atalman	2025-02-07 18:09:30 +00:00
Tristan Rice	23af9dde4d	distributed/serialization: add experimental streaming torch.save/load methods (#146555 ) Summary: This is intended for use with torchft when we need to do a streaming state dict transfer. This is strictly superior to the prior streaming method in torchft as this supports all tensor subclasses such as DTensor. This supports 100% of the inputs to torch.save/load but is not wire compatible nor intended to have any backwards compatibility. Security wise this fully supports weights_only and defaults to True. It does use pickle for some metadata but uses weights_only for the metadata. Adapted from: https://github.com/pytorch/torchft/pull/101 https://github.com/pytorch/torchft/pull/54 Test Plan: pytest test/distributed/test_serialization.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/146555 Approved by: https://github.com/fegin, https://github.com/mikaylagawarecki Co-authored-by: Krishn Parasar <76171905+Krishn1412@users.noreply.github.com>	2025-02-07 18:08:11 +00:00
Tristan Rice	68631f6e87	PyWork: preserve Python reference counting when used in functional collectives (#146376 ) @fegin found an issue where torchft is not compatible with functional collectives. Found in https://github.com/pytorch/torchtitan/pull/806 The root cause is because PyProcessGroup/PyWork are not compatible with functional collectives due to a nasty ownership bug. PyWork relies on a pybind trampoline to propagate requests to Python unfortunately the way Pybind works is that the Python object owns the C++ object rather than some form of shared ownership. Thus what happens is that the PyWork Python object will collected when returned to C++ from the PyProcessGroup but the C++ PyWork object still exists. When the PyWork object is used, this causes a deadlock as the corresponding Python object no longer exists To solve this, we introduce a new `PyWorkHolder` class which holds a reference to the `py::object` as well as the trampoline class. This resolves any dependency issues since we can now hold ownership in C++ to both the Python and C++ objects. To make this cleaner we introduce a `WORK_OVERRIDE` macro which is a patched version of `PYBIND11_OVERRIDE` that returns a `PyWorkHolder` rather than just `PyWork` and use for all collectives in PyProcessGroup. Test plan: ``` cd pytorch pytest test/distributed/test_c10d_functional_native.py ``` ``` cd torchft pytest torchft/process_group_test.py -k functional -v -x -s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146376 Approved by: https://github.com/yifuwang	2025-02-07 18:07:53 +00:00
James Wu	76c8a2dc48	Fix get_top() to return the base level event of the stack, not the most recently started event (#146649 ) `get_top()` is really confusing when talking about a stack, because it can mean the most recently started event on the stack or the toplevel event in perfetto(which displays the stack upside down). Rename to `get_outermost` and fix the bug associated with it, so that it returns the correct value out of the stack. Running nanogpt now puts `guard_latency_us` correctly in the `dynamo` event: ``` tlp python benchmarks/dynamo/torchbench.py --backend inductor --device cuda --only nanogpt --amp --cold-start-latency --print-compilation-time --training --performance 2>&1 --dynamic-shapes \| tee out.log ``` <img width="1281" alt="image" src="https://github.com/user-attachments/assets/4eeb371a-4d81-415a-acc4-7d303a4b2a93" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/146649 Approved by: https://github.com/masnesral, https://github.com/anijain2305	2025-02-07 18:04:50 +00:00
briancoutinho	f138b18d18	[inductor/profiler] add kernel kwargs instrumentation (#145573 ) ## About As above, record the kernel launch kwargs. These tends to be contexpr arguments to triton kernels like block size etc. ## Test program Note, install triton before proceeding (pip install triton) triton_test.py>>> ``` import torch from torch.profiler import profile, ProfilerActivity def foo(x, y): a = torch.sin(x) b = torch.cos(y) return a + b def main(): x = torch.randn(10, 10).cuda() y = torch.randn(10, 10).cuda() opt_foo = torch.compile(foo) z = opt_foo(x, y) # Profile the kernel function on the GPU with profile( activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True ) as prof: z = opt_foo(x, y) # Export the trace to a file prof.export_chrome_trace("my_kernel_trace.json") if __name__ == "__main__": main() ``` Run it and we should get a trace file my_kernel_trace.json Output has triton event with the kernel_kwargs attribute. ``` { "ph": "X", "cat": "cpu_op", "name": "triton_poi_fused_add_cos_sin_0", "pid": 2480815, "tid": 2480815, "ts": 2045246693014.959, "dur": 75.662, "args": { ... "kernel_backend": "triton", "num_warps": 4, "kernel_kwargs": "XBLOCK=128", "num_stages": 1, "grid": "grid(100,)", "kernel_file": "/tmp/torchinductor_bcoutinho/ow/cowpmkdpla4qfqj6jupnq4d7og7iz7eeb5wergubivubxd4xapor.py", "kernel_hash": "cowpmkdpla4qfqj6jupnq4d7og7iz7eeb5wergubivubxd4xapor" } }, ``` ## Unit Test Updated unit test: ``` pytest test/inductor/test_profiler.py -k test_pt2_triton_attributes ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145573 Approved by: https://github.com/davidberard98, https://github.com/jansel	2025-02-07 17:44:30 +00:00
Animesh Jain	ee45ea599d	[dynamo] Actionable message on recompilations for fullgraph=True (#146550 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146550 Approved by: https://github.com/zou3519, https://github.com/StrongerXi ghstack dependencies: #146553	2025-02-07 17:28:43 +00:00
Animesh Jain	fa0956951c	[dynamo] Remove the suggestion to use suppress_errors on compiler error (#146553 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146553 Approved by: https://github.com/zou3519, https://github.com/jansel	2025-02-07 17:28:43 +00:00
cyy	25aa7ca62d	Cleanup CallOnce.h (#146700 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146700 Approved by: https://github.com/albanD	2025-02-07 16:44:45 +00:00
PyTorch MergeBot	076717785c	Revert "[while_loop][inductor] support sym expression as cond_fn output (#146222 )" This reverts commit 5ecdc428b230ab5ba44a90678f1c905e314f6ccb. Reverted https://github.com/pytorch/pytorch/pull/146222 on behalf of https://github.com/atalman due to Internal failure, please see associated diff ([comment](https://github.com/pytorch/pytorch/pull/146222#issuecomment-2643379933))	2025-02-07 16:19:41 +00:00
eqy	5d7532140f	[CUDA][CUDA Graphs] Fix debug mode warning message (#145996 ) The real method is `enable_debug_mode()`, `_cuda_enable_graphs_debug_mode` does not exist. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145996 Approved by: https://github.com/ptrblck, https://github.com/eellison	2025-02-07 08:04:49 +00:00
eellison	002accfb8d	Check meta strides for expanded dims in effn_attn_bias (#146054 ) With the `_scaled_dot_product_efficient_attention.default`, we have lowering logic to realize the bias to specific alignment constraints. Some of the dims can be expanded, and we need to keep the stride of that dim to 0 to avoid materializing a larger tensor than we need. Previously, we had checked stride of tensor, but if it is not realized, that will not work. so we should check the strides of the meta as well. Note: getting the exact of realizing/slicing/requiring_exact_strides was a little tricky. I commented to @exclamaforte on an example unable-to-fuse message you get if you do it incorrectly. Fix for https://github.com/pytorch/pytorch/issues/145760 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146054 Approved by: https://github.com/shunting314	2025-02-07 06:35:57 +00:00
eellison	71e8a2bda4	Expand inductor codegen dtype asserts, fix scan (#146067 ) We were codegening intermediary dtype asserts in some places but not all. expands assertions, fixes newly failing assertion in `TORCHINDUCTOR_COMPILE_THREADS=1 TORCH_LOGS="output_code" PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=1 python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoCUDA.test_comprehensive_logcumsumexp_cuda_float16` for scan. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146067 Approved by: https://github.com/shunting314, https://github.com/jansel	2025-02-07 06:35:47 +00:00
cyy	f6bd20e8a2	Enable TemporaryFileName tests on Windows (#146311 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146311 Approved by: https://github.com/albanD	2025-02-07 06:06:18 +00:00
Pian Pawakapan	1c872803cb	[export][dynamic shapes] log provenance for locals & symbols for non-strict (#143378 ) Adds `dtrace_structured` logging so when a guard or real-tensor propagation assert is added, the relevant user code with local symbolic values & free symbols are logged, e.g. from the draft export CLI report (soon to be added to tlparse): 1. Guard added: ``` 1. Constraint violation error. The specified input dynamic_shapes spec was found to be incorrect during tracing. Specifically, this guard was added: Eq(s0, 3), where {'s0': "L['args'][0][0].size()[0]"}. This occured at the following stacktrace: File /data/users/pianpwk/pytorch/test/export/test_draft_export.py, lineno 267, in forward: assert a.shape[0] == 3 Locals: a: Tensor(shape: torch.Size([s0, 3]), stride: (3, 1), storage_offset: 0) Symbols: s0: L['args'][0][0].size()[0] ... ``` 2. Real tensor propagation: ``` 1. Data dependent error. When exporting, we were unable to evaluate the value of `u2 < 0`. This was encountered 8 times. This occurred at the following stacktrace: File /data/users/pianpwk/pytorch/test/export/test_draft_export.py, lineno 217, in forward: return res[:c_item] Locals: res: Tensor(shape: torch.Size([u0, u1]), stride: (Max(1, u1), 1), storage_offset: 0) c_item: u2 ... ``` Currently the values are extracted from the traceback, and are only valid for non-strict; strict seems to require storing & fakifying locals in the frames reporting by `TracingContext`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143378 Approved by: https://github.com/avikchaudhuri, https://github.com/bobrenjc93	2025-02-07 05:46:05 +00:00
Aaron Gokaslan	bc40ccf6aa	[BE]: Inline special functions for MPS (#146627 ) These header functions should be inlined for consistency and to avoid translation unit / symbol issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146627 Approved by: https://github.com/dcci	2025-02-07 05:15:15 +00:00
Zhou32	ecf44d1002	Fixed a typo in dataset.py (#146600 ) Changed word 'Mult' to 'Multi'. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146600 Approved by: https://github.com/Skylion007	2025-02-07 05:09:51 +00:00
Justin Chu	41e6d189a3	[ONNX] Create deprecation warning on dynamo_export (#146425 ) Reland #146003 Deprecation of `torch.onnx.dynamo_export`: * [`torch/onnx/_internal/_exporter_legacy.py`]: Added deprecation warnings to the `OnnxRegistry`, `ExportOptions`, `ONNXRuntimeOptions`, and `dynamo_export` functions, indicating that `torch.onnx.dynamo_export` is deprecated since version 2.6.0 and should be replaced with `torch.onnx.export(..., dynamo=True)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146425 Approved by: https://github.com/titaiwangms, https://github.com/atalman	2025-02-07 04:20:46 +00:00
cyy	fa0592b568	Remove some NOLINT (#146610 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146610 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-02-07 01:50:06 +00:00
Nikita Shulga	624d94bdb8	[MPS] Extend `torch.special.sinc` to complex (#146648 ) And to integral data types as well Was too lazy to deduce the formula myself(or write a sympy script), but ChatGPT did a decent job of doing it, though it forgot that input must be multiplied by $$\pi$$: ```math \text{Re}\left(\text{sinc}(x + i y)\right) = \frac{\sin(x)\cosh(y) x - \cos(x)\sinh(y) y}{x^2 + y^2} ``` ```math \text{Im}\left(\text{sinc}(x + i y)\right) = \frac{\cos(x)\sinh(y) x + \sin(x)\cosh(y) y}{x^2 + y^2} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146648 Approved by: https://github.com/dcci	2025-02-07 01:12:37 +00:00
Michal Gallus	9ea1823f96	[ROCm][Windows] Remove external linkage from an anonymous namespace (#146607 ) Fixes a clang-cl compiler error related to attempt to export a symbol that doesn't have any external linkage, since its declared within a local anonymous namespace. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146607 Approved by: https://github.com/jeffdaily	2025-02-06 23:48:20 +00:00
Michal Gallus	3379c65de6	[ROCm][Windows] Fix unrecognized _BitScanReverse intrinsic (#146606 ) Since PyTorch with ROCm on Windows is built with clang-cl and not MSVC, the intrinsics used are different and hence an attempt to compile with `_BitScanReverse` fails. However, a call to `__builtin_clz` which follows in the subsequent preprocessor branch is correctly recognized by the clang-cl compiler. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146606 Approved by: https://github.com/jeffdaily	2025-02-06 23:47:18 +00:00
Michal Gallus	0d8fc00e0a	[ROCm][Windows] Fix isnan integer overload errors on MS STL (#146605 ) Microsoft's STL has a problem with integer overloads of std::fpclassify used by std::isnan and std::isinf. These functions need a cast to double to function correctly. Otherwise, the call fails with "ambiguous call to overloaded function" error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146605 Approved by: https://github.com/jeffdaily	2025-02-06 23:44:11 +00:00
Michal Gallus	3f5ed05688	[Windows][ROCm] Fix c10 hip tests (#146599 ) - Solves a problem related to .hip source files being ignored by the build system when HIP language is not enabled in CMake. - Also ensures that the test executables link to an appropriate CRT Runtime Library and hence have access to all the necessary symbols. Previously, there were many problems related to linkage errors. - Moves part of Linux-related hipBLASLt changes in `LoadHIP.cmake` under the UNIX conditional branch, as these aren't supported on Windows yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146599 Approved by: https://github.com/jeffdaily	2025-02-06 23:41:25 +00:00
Fuzzkatt	e13a544b54	fix tf32 issue in test_inductor_freezing.py unit tests (#146444 ) Test is hitting numerical mismatches in NVIDIA internal CI. Add tf32_on_and_off decorater, update check to assertEqual Pull Request resolved: https://github.com/pytorch/pytorch/pull/146444 Approved by: https://github.com/jansel, https://github.com/eellison, https://github.com/eqy	2025-02-06 23:34:28 +00:00
eqy	7bd7f735d4	[CUDA][SDPA] Compute reference in `test_triton_scaled_dot_product_attention_block_size_16_cuda_float32` in `float64` (#146461 ) Seems to currently fail with mismatches in the 1e-4 range presumably due to sdpa calling into the `MATH` backend here which is less fused than a triton kernel. Doing the ref computation in `float64` appears to fix it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146461 Approved by: https://github.com/drisspg	2025-02-06 23:28:56 +00:00
Jason Ansel	2834fe5e93	[inductor] Fix test error test_force_cutlass_backend_aoti_cexpr_codegen (#146564 ) Test Plan: ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cutlass_backend -- --exact 'caffe2/test/inductor:cutlass_backend - test_force_cutlass_backend_aoti_cexpr_codegen (caffe2.test.inductor.test_cutlass_backend.TestCutlassBackend)' ``` Differential Revision: D69219873 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146564 Approved by: https://github.com/yanboliang	2025-02-06 23:02:41 +00:00
Aaron Gokaslan	0c81b398ab	[BE][Ez]: Enable some additional pylint ruff warnings (#146609 ) Some additional code hardening with some pylint warnings in ruff that usually indicate bugs. All code currently conforms nicely to them, but this will ensure these errors can be detected statically before running / creating tests. The follow rules: * Ban walrus operators where they would have no effect over regular assignment; making intention more clear. * Statically check for the common error of forgetting to put parens after the `super` call, which will cause an attribute error * Ban bad string literal args to builtins `open` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146609 Approved by: https://github.com/aorenste	2025-02-06 21:58:08 +00:00
Michael Suo	99dd846672	[torch] fix builds for older pybind (#146630 ) Summary: some versions of pybind we build with don't have `py::set_error`. So just use the underlying python C API. Test Plan: unit tests Differential Revision: D69254629 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146630 Approved by: https://github.com/colin2328, https://github.com/ngimel	2025-02-06 21:22:00 +00:00
Huy Do	3008368b12	Honor Dr.CI classification results on auto commit hash update (#146337 ) Disable `ignore_flaky_failures` was a safer choice, but it seems that this option doesn't work with the current state of the CI. For example, https://github.com/pytorch/pytorch/pull/125806 hasn't been merged since May because there would always be a failure in one type or another. This effectively disables the automate mechanism. My proposal here is to relax this rule and allows the bot to merge auto commit has update with `@pytorchbot merge` like a regular PR. Then we will at least have something working. If this causes issue, we can revert it back and try to longer route of improving CI reliability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146337 Approved by: https://github.com/clee2000	2025-02-06 20:33:38 +00:00
Nichols A. Romero	44b69b80c2	[ROCm][TunableOp] Future proof TunableOp unit test. (#146548 ) TunableOp UT will fail because the regular expression in the test will not work for future versions of ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146548 Approved by: https://github.com/jeffdaily	2025-02-06 20:26:02 +00:00
Xilun Wu	5cc1b54a91	[2/N][cp][example] flex attention in context parallel (backward pass) (#146397 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146397 Approved by: https://github.com/fegin ghstack dependencies: #145896	2025-02-06 19:50:02 +00:00
Xilun Wu	6220c64aea	[1/N][cp][example] flex attention in context parallel (forward pass) (#145896 ) Description This is an example of how FlexAttention can be used in a context parallel fashion. Right now it's only a flex_attention call with collectives added and has no load balancer, but we're about to add the missing parts step by step: 1. backward pass 2. static load balancing for causal masking 3. dynamic load balancing for other general maskings 4. automatic collective insertion solution 5. non-intrusive context parallel APIs Test `torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/tensor/examples/flex_attention_cp.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145896 Approved by: https://github.com/fegin, https://github.com/Skylion007	2025-02-06 19:50:02 +00:00
Yidi Wu	5ecdc428b2	[while_loop][inductor] support sym expression as cond_fn output (#146222 ) As titled. Previously, we only support tensor output of cond_fn, this PR changes to also allow a shape expr to be returned in cond_fn. aoti generated output code looks like: ``` V0203 11:28:05.750000 2611693 torch/_inductor/compile_fx.py:1091] [1/0] [__output_code] bool buf7_cond_result; .... (while_loop_cond_graph_0_arg2_1_handle); V0203 11:27:59.336000 2611693 torch/_inductor/compile_fx.py:1091] [1/0] [__output_code] buf7_cond_result = u0 + u1 < 10L; V0203 11:27:59.336000 2611693 torch/_inductor/compile_fx.py:1091] [1/0] [__output_code] if (!buf7_cond_result) break; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146222 Approved by: https://github.com/desertfire ghstack dependencies: #146194, #146195	2025-02-06 19:39:55 +00:00
Bin Bao	1b879fd0ea	[Inductor] Add a JIT Inductor unit test following #146293 (#146529 ) Summary: To follow up https://github.com/pytorch/pytorch/pull/146293, add a JIT Inductor unit test. Other Triton template may need similar fixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146529 Approved by: https://github.com/eellison, https://github.com/shunting314	2025-02-06 19:21:15 +00:00
Shunting Zhang	992388c100	[inductor] use ftz variant of exp (#146216 ) Inductor generated exp op is compiled as the following ptx snippet by Triton. ``` mul.f32 %f74, %f83, 0f3FB8AA3B; ex2.approx.f32 %f73, %f74; ``` But if we enable --use_fast_math in nvcc, exp in CUDA is compiled as ``` mul.ftz.f32 %f2, %f1, 0f3FB8AA3B; ex2.approx.ftz.f32 %f3, %f2; ``` which uses the FTZ variant. Let Inductor able to generate the FTZ variant if use_fast_math config is true. I see 4% speedup for the two pass prepare_softmax kernel, online softmax should be affected more since it does more computation per seconds (>10% in my testing). Pull Request resolved: https://github.com/pytorch/pytorch/pull/146216 Approved by: https://github.com/jansel, https://github.com/eellison	2025-02-06 19:12:35 +00:00
Eddie Yan	9ee506bd93	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee, https://github.com/malfet	2025-02-06 19:04:50 +00:00
eqy	07b214402a	[CUDA][B200] Update the number of threads in `avg_pool2d` backward for SM 10.0 (#145669 ) Fixes register count issue when launching on SM 10.0, originally authored by @bilal2vec Pull Request resolved: https://github.com/pytorch/pytorch/pull/145669 Approved by: https://github.com/nWEIdia, https://github.com/ngimel	2025-02-06 18:57:33 +00:00
Animesh Jain	99ddbb4802	[dynamo][fullgraph] Do not skip frame with fullgraph=True (#146527 ) Earlier if there were no ops in the graph, fullgraph=True will also fallback to eager. This hides issues in testing, where we silently fallback to eager, and do not test optimized bytecode. As can be seen in the PR, I had to fix several tests when I forced to use the optimized bytecode in the absence of graph. A few failing tests will be fixed in follow up PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146527 Approved by: https://github.com/zou3519, https://github.com/StrongerXi	2025-02-06 18:56:07 +00:00
rzou	15b1ac3e86	Add torch.func.debug_unwrap (#146528 ) Use it to unwrap any functorch-wrapped tensor. I don't recommend using the output in a program since it breaks the semantics of the transforms, but it seems useful for debugging. I will note that some people have wanted to get intermediate values out of an e.g. grad transform, so this might be a way to do that... Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/146528 Approved by: https://github.com/Chillee	2025-02-06 18:48:09 +00:00
Ryo Suzuki	49082f9dba	parallelize sort (#142391 ) - use __gnu_parallel::sort for gcc compilations - add a parallelized version of std::sort and std::stable_sort for non gcc compilations Using __gnu_parallel::sort: provides ~3.7x speed up for length 50000 sorts with NUM_THREADS=16 and NUM_THREADS=4 on aarch64 The performance is measured using the following script: ```python import torch import torch.autograd.profiler as profiler torch.manual_seed(0) N = 50000 x = torch.randn(N, dtype=torch.float) with profiler.profile(with_stack=True, profile_memory=False, record_shapes=True) as prof: for i in range(1000): _, _ = torch.sort(x) print(prof.key_averages(group_by_input_shape=True).table(sort_by='self_cpu_time_total', row_limit=10)) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142391 Approved by: https://github.com/malfet	2025-02-06 18:06:40 +00:00
Isalia20	7725d0ba12	[METAL] inline bfloat min/max (#146588 ) After a recent commit 36c6e09528a7e071edecde083254da70cba26c95 , building from source with `python setup.py develop` leads to an error due to multiple symbols for min/max: ``` FAILED: caffe2/aten/src/ATen/kernels_bfloat.metallib /Users/Irakli_Salia/Desktop/pytorch/build/caffe2/aten/src/ATen/kernels_bfloat.metallib cd /Users/Irakli_Salia/Desktop/pytorch/build/caffe2/aten/src/ATen && xcrun metallib -o kernels_bfloat.metallib BinaryKernel_31.air Bucketization_31.air CrossKernel_31.air FusedOptimizerOps_31.air Gamma_31.air HistogramKernel_31.air Im2Col_31.air Indexing_31.air LinearAlgebra_31.air Quantized_31.air RMSNorm_31.air RenormKernel_31.air Repeat_31.air SpecialOps_31.air TriangularOps_31.air UnaryKernel_31.air UnfoldBackward_31.air UpSample_31.air LLVM ERROR: multiple symbols ('_ZN3c105metal3minIDF16bEEN5metal9enable_ifIXgssr5metalE19is_floating_point_vIT_EES4_E4typeES4_S4_')! ``` This PR fixes that. @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/146588 Approved by: https://github.com/FFFrog, https://github.com/Skylion007, https://github.com/malfet	2025-02-06 17:57:31 +00:00
Animesh Jain	e2e265e27b	[dynamo] Use polyfill to implement comparison operators (#144485 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144485 Approved by: https://github.com/jansel	2025-02-06 17:27:07 +00:00
Davide Italiano	1090e58687	[mps] Remove a stale comment. (#146619 ) The implementation of the function was moved to a shader, but the comment was left there. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146619 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-02-06 17:25:29 +00:00
Davide Italiano	46390e9a37	[mps] Implement support for sinc() operator (inductor and eager). (#146539 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146539 Approved by: https://github.com/malfet, https://github.com/jansel Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-02-06 16:37:27 +00:00
Simon Fan	a14c780c4c	[dynamo] fix dynamo_compile logging on RecompileLimitExceeded (#146544 ) Logging branches based on RecompileLimitExceeded or not. If we exceed the limit, we fallback to eager before even trying to analyze the frame. We handle RecompileLimitExceeded outside of the try/catch/finally that edits the metrics context: `72405b0c0f/torch/_dynamo/convert_frame.py (L908-L935)`. dynamo_config and recompile_reason are both known before we raise the RecompileLimitExceeded, so we can add them with the rest of the "common" metrics. which are logged on metric_context decorator exit and is always called Pull Request resolved: https://github.com/pytorch/pytorch/pull/146544 Approved by: https://github.com/masnesral	2025-02-06 16:20:42 +00:00
Taras	6ff3383157	Enable CUPTI on Windows (#141454 ) Fixes: - https://github.com/pytorch/pytorch/issues/93855 The PR enables CUPTI on Windows and enables unit tests to check CUDA profiling events. Additionally, the changes can be verified using the following script: ``` import torch from torch.profiler import profile, ProfilerActivity def check_cupti_enabled(): # Check if CUDA is available if not torch.cuda.is_available(): print("CUDA is not available on this system.") return False # Create a simple CUDA tensor x = torch.randn(1000, 1000, device="cuda") y = torch.randn(1000, 1000, device="cuda") try: # Use PyTorch profiler to perform a basic check with profile(activities=[ProfilerActivity.CUDA]) as prof: z = x @ y # Simple CUDA operation # Print profiling results print("CUPTI is enabled and profiling works.") print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10)) return True except RuntimeError as e: # If profiling fails, CUPTI is likely not set up correctly print("Error: CUPTI might not be enabled or accessible.") print(f"Details: {e}") return False if __name__ == "__main__": if check_cupti_enabled(): print("CUPTI is properly configured in PyTorch.") else: print("CUPTI is not configured correctly. Check your CUDA installation.") ``` Sample output: ``` CUPTI is enabled and profiling works. --------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls --------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ sgemm_128x128x8_NN_vec 0.00% 0.000us 0.00% 0.000us 0.000us 2.086ms 100.00% 2.086ms 2.086ms 1 cudaFree 9.67% 9.816ms 9.67% 9.816ms 9.816ms 0.000us 0.00% 0.000us 0.000us 1 cudaDeviceGetAttribute 0.01% 10.000us 0.01% 10.000us 0.476us 0.000us 0.00% 0.000us 0.000us 21 cudaGetDriverEntryPoint 0.00% 1.700us 0.00% 1.700us 0.850us 0.000us 0.00% 0.000us 0.000us 2 cudaGetSymbolAddress 85.15% 86.438ms 85.15% 86.438ms 86.438ms 0.000us 0.00% 0.000us 0.000us 1 cudaMalloc 0.43% 433.300us 0.43% 433.300us 144.433us 0.000us 0.00% 0.000us 0.000us 3 cudaLaunchKernel 2.61% 2.648ms 2.61% 2.648ms 2.648ms 0.000us 0.00% 0.000us 0.000us 1 cudaDeviceSynchronize 2.13% 2.163ms 2.13% 2.163ms 2.163ms 0.000us 0.00% 0.000us 0.000us 1 --------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 101.511ms Self CUDA time total: 2.086ms CUPTI is properly configured in PyTorch. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141454 Approved by: https://github.com/malfet	2025-02-06 15:58:20 +00:00
FEI	8a4dd763b8	[CCA] remove TODO for hardware_destructive_interference_size (#145591 ) @zyan0 @albanD @houseroad Pull Request resolved: https://github.com/pytorch/pytorch/pull/145591 Approved by: https://github.com/albanD	2025-02-06 14:41:25 +00:00
Jack Zhang	ed309b9156	Re-add stft option to align window for center = false (#146379 ) Skips advancing the fc window on https://github.com/pytorch/pytorch/pull/145437, since I just found that there were non-trivial efforts to do so a while ago that eventually was reverted: https://github.com/pytorch/pytorch/pull/73434 Works around the issue by keeping the stft sans center overload Pull Request resolved: https://github.com/pytorch/pytorch/pull/146379 Approved by: https://github.com/justinchuby, https://github.com/iseeyuan	2025-02-06 14:07:13 +00:00
PyTorch MergeBot	1b79d47635	Revert "[dynamo] check for incompatible configs (#146513 )" This reverts commit aab7925418be561a8af6adfcb8cf009a8786c31b. Reverted https://github.com/pytorch/pytorch/pull/146513 on behalf of https://github.com/atalman due to inductor/test_fuzzer.py::TestConfigFuzzer::test_config_fuzzer_dynamo_bisect [GH job link](https://github.com/pytorch/pytorch/actions/runs/13174131431/job/36772837627) [HUD commit link](`4a545eb85d`) ([comment](https://github.com/pytorch/pytorch/pull/146513#issuecomment-2639860568))	2025-02-06 13:42:25 +00:00
Animesh Jain	340cfe4f28	[dynamo][fbcode] Turn on inline_inbuilt_nn_modules (#145407 ) As title. Some internal testing at https://fb.workplace.com/groups/241460628989036/permalink/411650015303429/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/145407 Approved by: https://github.com/ezyang, https://github.com/jansel	2025-02-06 13:18:35 +00:00
PyTorch MergeBot	bd7d4fb2b5	Revert "[DTensor][Test] Create a simple unit test for tensordot (#146514 )" This reverts commit 1f8baf09ea598c97f30731ddb8328b6aa8d31fe9. Reverted https://github.com/pytorch/pytorch/pull/146514 on behalf of https://github.com/albanD due to The lint failures that you ignored are real right? ([comment](https://github.com/pytorch/pytorch/pull/146514#issuecomment-2639554636))	2025-02-06 11:26:43 +00:00
zeshengzong	4a545eb85d	Fix torch.nn.functional.one_hot param num_classes optional description (#146470 ) `torch.nn.functional.one_hot` [document](https://pytorch.org/docs/stable/generated/torch.nn.functional.one_hot.html) describe param `num_classes` not optional, but user can call method without pass it. ![image](https://github.com/user-attachments/assets/4e6d4feb-691f-451f-95b5-4ac11bac7bc2) ```python >>> import torch >>> a = torch.arange(0, 5) % 3 # [0,1,2,0,1] >>> torch.nn.functional.one_hot(a) tensor([[1, 0, 0], [0, 1, 0], [0, 0, 1], [1, 0, 0], [0, 1, 0]]) ``` `num_classes` has default value -1 `93d98aca31/aten/src/ATen/native/native_functions.yaml (L6154-L6157)` ## Test Result ![image](https://github.com/user-attachments/assets/2c7203b7-6226-4ebc-84c8-cbf912fc48e2) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146470 Approved by: https://github.com/albanD	2025-02-06 07:48:05 +00:00
Simon Fan	aab7925418	[dynamo] check for incompatible configs (#146513 ) internal: https://fb.workplace.com/groups/1075192433118967/permalink/1599802033991335/ Assuming flags don't change during compilation, we shouldn't allow incompatible configs to be set at torch.compile wrap time. Not in this PR: For flags that need to change during compilation, we'd have to be strict about where they can be used in the compile lifecycle Pull Request resolved: https://github.com/pytorch/pytorch/pull/146513 Approved by: https://github.com/williamwen42	2025-02-06 07:39:52 +00:00
eqy	5f0901e573	[cuBLAS][cuBLASLt] Unify `cuBLASLt` workspaces with `cuBLAS` workspaces (#145130 ) As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels. This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits: + caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`) + "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925 + fixes behavior broken behavior with the memtracker; https://github.com/pytorch/pytorch/pull/139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it + one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider Pull Request resolved: https://github.com/pytorch/pytorch/pull/145130 Approved by: https://github.com/ngimel	2025-02-06 05:57:33 +00:00
Nikita Shulga	36c6e09528	[MPSInductor] Fix min/max for bfloat16 (#146552 ) By introducing a full specialization that upcasts everything to float, as bfloat does not have a native min/max Test by runing `test_min_max_reduction` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146552 Approved by: https://github.com/dcci	2025-02-06 05:15:00 +00:00
wz337	1f8baf09ea	[DTensor][Test] Create a simple unit test for tensordot (#146514 ) Fixes #ISSUE_NUMBER The dims and shape of the tensors are from a specific Shampoo use case. We want to create a unit test for it to make sure there are no regressions for this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146514 Approved by: https://github.com/tianyu-l	2025-02-06 05:09:34 +00:00
Michael Diggin	e01a5e9e1e	Small improvements to NJT matrix multiplies (#146405 ) Fixes #146404 Adds changes to the matmul and matmul_backward operation for nested jagged tensors, to support back propagation when the output is a regular strided tensor. This required adding support for the nested matmul operation to work when the nested tensor wasn't 'self', i.e `A@B` where `A` isn't nested but `B` is. The operation schemas had to be updated to reflect that either input can be a strided tensor instead (and the gradient), so an extra assertion is added in an edge case where neither input is nested. Unit tests are also added. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146405 Approved by: https://github.com/soulitzer, https://github.com/jbschlosser	2025-02-06 04:51:12 +00:00
bobrenjc93	389c5c0842	print out partial fx graph for all data-dependent errors (#146363 ) The previous implementation didn't catch the following type of errors ``` torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not extract specialized integer from data-dependent expression u2 (unhinted: u2). (Size-like symbols: none) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146363 Approved by: https://github.com/angelayi, https://github.com/bdhirsh ghstack dependencies: #146298, #146296	2025-02-06 04:21:34 +00:00
Michael Suo	425804db2b	[torch] fix exception types in custom class magic setattr/getattr (#146516 ) Summary: `c10::AttributeError` is not automatically converted to Python AttributeError, it needs some special macros (e.g. `HANDLE_TH_ERRORS`). Some Python functions like `hasattr` rely on the type of the throw exception to be correct. We don't need the fully generality of those macros, so just do a targeted error type conversion here. Test Plan: added unit test Differential Revision: D69197217 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146516 Approved by: https://github.com/zdevito	2025-02-06 02:14:11 +00:00
Pian Pawakapan	3a6a203b98	[dynamic shapes][real tensor tracing] propagate unbacked hint when creating mod replacement (#146381 ) Fixes data-dependent errors for 2 PT2I models in draft export Pull Request resolved: https://github.com/pytorch/pytorch/pull/146381 Approved by: https://github.com/angelayi	2025-02-06 01:48:40 +00:00
Pian Pawakapan	c5062cca98	[export] make stack_trace optional in insert_custom_op_guards (#146438 ) Summary: Fixes 1 PT2I exportability error Test Plan: - Differential Revision: D69132186 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146438 Approved by: https://github.com/yiming0416, https://github.com/angelayi	2025-02-06 01:48:26 +00:00
Nikita Shulga	6a985d8b2e	Make `inductor_utils.requires_gpu` accept MPS (#145156 ) Not yet ready to setp HAS_GPU to true, but can unskip tests that require GPU (Noticed while running test_mps_basics.py that `test_scalar_cpu_tensor_arg` is getting skipped) - Replace `GPU_TYPE` with `self.device` in `test_custom_op_fixed_layout_sequential`, `test_inductor_layout_optimization_input_mutations`, `test_mutable_custom_op_fixed_layout2` otherwise they GPU tests are just running for _cpu suffixes. - Tweak `test_tmp_not_defined_issue3` to work correctly on CPU, by defining `test_device` and `test_device_0` - UnXFail `test_mutable_custom_op_fixed_layout2_dynamic_shapes` as it should just work on CPU - Add `skip_if_no_triton` decorator and decorate `test_reduction_config_limit` with it, as it does not need CPU nor GPU, but rather a triton backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145156 Approved by: https://github.com/dcci, https://github.com/Skylion007, https://github.com/jansel	2025-02-06 01:14:36 +00:00
Isalia20	0dc03134d9	[MPS] linalg solve implementation (#146531 ) Fixes #98222 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146531 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-02-06 00:57:49 +00:00
Nikita Shulga	495049860b	[BE][Metal] Fix signed unsigned comparison warning (#146549 ) I wish I knew how to extract Metal warnings during JIT compilation but https://developer.apple.com/documentation/metal/mtldevice/makelibrary(source:options:)?changes=_7&language=objc is a lie as `error:` stays `nil` unless shader compilation fails. But when it does following warnings are thrown ``` program_source:666:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare] for (auto idx = 1; idx < size; ++idx) { ~~~ ^ ~~~~ program_source:677:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare] for (auto idx = 1; idx < size; ++idx) { ~~~ ^ ~~~~ program_source:688:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare] for (auto idx = 1; idx < size; ++idx) { ~~~ ^ ~~~~ program_source:699:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare] for (auto idx = 1; idx < size; ++idx) { ~~~ ^ ~~~~ program_source:710:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare] for (auto idx = 1; idx < size; ++idx) { ~~~ ^ ~~~~ program_source:723:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare] for (auto idx = 1; idx < size; ++idx) { ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146549 Approved by: https://github.com/dcci	2025-02-06 00:40:17 +00:00
PyTorch MergeBot	e0cf519ade	Revert "[inductor] Refactor op handlers part 2 (#146252 )" This reverts commit 13f0436abdff0386f33c7a8c25caa66e9af16dbd. Reverted https://github.com/pytorch/pytorch/pull/146252 on behalf of https://github.com/atalman due to Sorry need to revert, failing internally ([comment](https://github.com/pytorch/pytorch/pull/146252#issuecomment-2638305417))	2025-02-06 00:04:04 +00:00
Nikita Shulga	c7087d6b14	[BE][EZ][Metal] Do not pass tensor length as arg (#146522 ) As all devices capable of running Metal-2 support nonuniform threadgroup sizes, see https://developer.apple.com/metal/Metal-Feature-Set-Tables.pdf for more detail Pull Request resolved: https://github.com/pytorch/pytorch/pull/146522 Approved by: https://github.com/dcci ghstack dependencies: #146521	2025-02-06 00:03:41 +00:00
Nikita Shulga	54ef029532	[BE][EZ][Metal] Mark constant inputs as constant (#146521 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146521 Approved by: https://github.com/dcci	2025-02-06 00:03:41 +00:00
PyTorch MergeBot	2001066c61	Revert "[inductor] Refactor op handlers part 3 (#146254 )" This reverts commit 8e9bda8d895e80da0fe480d02e100bae8332ed57. Reverted https://github.com/pytorch/pytorch/pull/146254 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/146252 ([comment](https://github.com/pytorch/pytorch/pull/146254#issuecomment-2638300857))	2025-02-05 23:59:50 +00:00
Simon Fan	72405b0c0f	[ca] refactor compile reasons and log to tlparse (#146386 ) This PR accumulates comple reasons inside each CacheNode, and logs them to tlparse on each CA compile. This defines a compile as an autograd structure change, and a recompile as a dynamic shape change. sample tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpdbo7gt/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100 for compiles: ```python [ "!0: Cache miss due to new autograd node: torch::autograd::GraphRoot (NodeCall 0) with key size 39, previous key sizes=[]" ] ``` for recompiles: ```python [ "!0: Cache miss due to new autograd node: torch::autograd::GraphRoot (NodeCall 0) with key size 39, previous key sizes=[]", "!1: Cache miss due to 7 changed tensor shapes (total of 7): sizes[0], sizes[1], sizes[2], sizes[3], sizes[4], sizes[5], sizes[6]" ] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146386 Approved by: https://github.com/jansel ghstack dependencies: #146229	2025-02-05 23:33:21 +00:00
PyTorch MergeBot	68304dba7a	Revert "[inductor] Refactor op handlers part 4 (#146255 )" This reverts commit 7aced455c542f629ffcd4f79c6af259bb966add8. Reverted https://github.com/pytorch/pytorch/pull/146255 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/146252 ([comment](https://github.com/pytorch/pytorch/pull/146255#issuecomment-2638258089))	2025-02-05 23:24:20 +00:00
PyTorch MergeBot	49effa0deb	Revert "[inductor] Refactor op handlers part 5 (#146257 )" This reverts commit d3dd3eeb7f599a2816ba1a067a8fa5a1bb1c84c3. Reverted https://github.com/pytorch/pytorch/pull/146257 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/146252 ([comment](https://github.com/pytorch/pytorch/pull/146257#issuecomment-2638251994))	2025-02-05 23:20:38 +00:00
PyTorch MergeBot	93e1e6e07c	Revert "[inductor] Minor compile time optimizations in DefaultHandler (#146282 )" This reverts commit b8a529cca18ae4d21b1681c5ea3a40635aba5a83. Reverted https://github.com/pytorch/pytorch/pull/146282 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/146252 ([comment](https://github.com/pytorch/pytorch/pull/146282#issuecomment-2638239575))	2025-02-05 23:13:08 +00:00
PyTorch MergeBot	7dc5cfe2ad	Revert "[inductor] Refactor CaptureIndexing into global scope (#146297 )" This reverts commit 7288950bcd4c5851e003dded6ce87da643b93e49. Reverted https://github.com/pytorch/pytorch/pull/146297 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/146252 ([comment](https://github.com/pytorch/pytorch/pull/146297#issuecomment-2638234829))	2025-02-05 23:10:08 +00:00
PyTorch MergeBot	9555bfce88	Revert "[inductor] Pre-populate cache for simplify_with_ranges return value (#146373 )" This reverts commit 84ba9c6e7844a0b457bc64ca70a9c8cf3655d03d. Reverted https://github.com/pytorch/pytorch/pull/146373 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/146252 ([comment](https://github.com/pytorch/pytorch/pull/146373#issuecomment-2638232033))	2025-02-05 23:07:08 +00:00
Yanan Cao (PyTorch)	8af31e30d7	[Codemod][AddExplicitStrictExportArg] caffe2/torch (#146439 ) Differential Revision: D69068432 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146439 Approved by: https://github.com/avikchaudhuri	2025-02-05 22:56:54 +00:00
Catherine Lee	97b64f2e5c	Fix workflow for closing nonexistent disable issues (#146447 ) The workflow could not update issues because it didn't have permissions, and it looked green because it didn't check return codes. Tested by running the workflow and seeing that issues did get closed Fixes https://github.com/pytorch/pytorch/issues/145382 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146447 Approved by: https://github.com/huydhn	2025-02-05 22:29:05 +00:00
Howard Huang	9b6d680131	Remove stage_index_to_group_rank from schedule (#146217 ) This PR allows schedules loaded via CSV to automatically set their `stage_index_to_group_rank ` and removes the `stage_index_to_group_rank ` argument from the `PipelineScheduleMulti` constructor Pull Request resolved: https://github.com/pytorch/pytorch/pull/146217 Approved by: https://github.com/wconstab ghstack dependencies: #146193	2025-02-05 21:26:45 +00:00
Howard Huang	4ee7d0de86	Add generate_stage_to_rank_mapping utility (#146193 ) We use `stage_index_to_group_rank` in the stage to determine what send/recv ops and in the schedule for IR generation. However, we don't need to expose this as an argument in our schedule class, so this stack of PRs is to remove it. This PR creates a `stage_index_to_group_rank` utility function and removes the arg for the ZBVschedule. In a following PR I will add code to infer the `stage_index_to_group_rank` for the CSV schedule path and we will be able to remove this argument from our classes entirely. Related comment from @wconstab https://github.com/pytorch/torchtitan/issues/774#issuecomment-2619793741 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146193 Approved by: https://github.com/wconstab	2025-02-05 21:26:45 +00:00
rzou	98b5d455fd	[opcheck] Improve error reporting; allow atol/rtol overrides (#146488 ) This PR improves opcheck to: 1. directly use torch.testing.assert_close (without a msg override). This allows it to print the absolute and relative differences and the number of mismatched elements. 2. take in an atol/rtol tolerance (for if someone just wants to use opcheck in their testing). Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/146488 Approved by: https://github.com/williamwen42	2025-02-05 21:25:06 +00:00
Justin Chu	1f6b566d74	[ONNX] Bump onnx and onnxscript versions in CI (#146097 ) Bump onnx onnxscript==0.1 in CI; Skipped onnxruntime 1.19 because it has regression on avgpool. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146097 Approved by: https://github.com/malfet	2025-02-05 21:00:25 +00:00
Katarzyna Fojcik	9da376daa6	Add retain-output argument (#145921 ) This PR add retain-output argument which enables appending to the already existing output file if it exists instead of deleting it and creating a new one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145921 Approved by: https://github.com/jansel	2025-02-05 19:45:09 +00:00
Raymond Li	dd349207c5	Add check that envvar configs are boolean (#145454 ) So we don't get unexpected behavior when higher typed values are passed in Pull Request resolved: https://github.com/pytorch/pytorch/pull/145454 Approved by: https://github.com/c00w, https://github.com/jamesjwu	2025-02-05 19:40:10 +00:00
Anant Gulati	9091096d6c	Refactoring Distributed test cases to be device agnostic [1/n] (#145222 ) In this series of PR we intend to refactoring distributed test cases to enable to be completely device agnostic. These changes will include the following approaches to do the same : - Allowing for multiple device types using instantiate_device_type_test - Replacing calls to cuda stream with torch.get_device_module(device) wherever it applies - Skipping set up steps required while using MultiProcessTestCase with DistributedTestBase (#138216) wherever applicable - Replacing explicit calls to distributed backend (NCCL,HCCL,etc) with get_default_backend_for_device (#140536). This should result in significant improvement in usability for all devices Pull Request resolved: https://github.com/pytorch/pytorch/pull/145222 Approved by: https://github.com/kwen2501	2025-02-05 18:47:09 +00:00
eqy	6f7fda3f49	Bump `nn.functional.conv3d` tolerances for `test_comprehensive` (#135719 ) `float16` tolerance was previously set to `1e-5` which seemed very low Pull Request resolved: https://github.com/pytorch/pytorch/pull/135719 Approved by: https://github.com/Chillee, https://github.com/albanD	2025-02-05 18:34:12 +00:00
Tugsbayasgalan Manlaibaatar	d2a2b9f8a7	Fix constants with non-functional operators (#145593 ) Previously, in non-strict path, we always error when trying to inplace update a constant tensor because those constant tensors are not actually wrapped by functional tensors. This is correct behaviour in torch.compile, because dynamo makes all constant tensors into buffers and AOTDispatcher just lifts them and wraps them in functional tensors. However, in non-strict, there is no such step that registers constants as buffers so AOTDispatcher panics when it sees these dangling constant tensors when functioanalizing. Due to recent change in the IR, this is no longer an issue in non-strict path because we don't call AOTDispatcher at training IR level, but now it is a problem for both strict and non-strict when we lower to inference. (lowering to inference is very similar to non-strict tracing) As a result, we have at least one external (https://github.com/pytorch/pytorch/issues/141336) and internal issues reported due to this difference. To fix this, there are two ways: 1. Make functionalization be aware of constant tensors and map them to functional tensors on the fly. This makes functionalization invariant uglier and could potentially open up a gate for more nasty bugs. 2. Special handle this in export. This seems more aligned with what dynamo does today so i think we should do it this way. I think the current state could benefit from more refactors to make the run_deocmpositions to be more similar to strict export (because both of them now handle this constant registerinig logic) but it is bit complicated to do it now because strict export version of this logic is also not complete because it doesn't take into account of export graph renaming pass etc). I will follow up with more refactors after this PR (T213466691) to unblock users faster. For future reference: Why are we not doing "turning constants into non-persistent buffers and never de-register"? The reason is because in some internal models, they rely on module.to to reliably work to move params/buffers to correct device. As a result, buffers are moved while constants are not. In composibility meeting, we agreed that export won't do device agnostic tracing going forward (it will provide a way to specify FakeTensor in CPU that can be configured to be run on GPU), so after that is done, we can always turn constants into non-persistent buffers which will simplify export's constant handling. Differential Revision: [D68610739](https://our.internmc.facebook.com/intern/diff/D68610739) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145593 Approved by: https://github.com/avikchaudhuri	2025-02-05 17:44:19 +00:00
Jeff Daily	44248c44eb	[ROCm] miopen benchmark behavior now better aligns with cudnn (#145294 ) The default benchmark setting is now false. The new miopen behavior means when benchmarking is disabled, for any shape that doesn't have a find hit, then it will do a quick search (same behavior as the prior default), and use that result. Now when benchmark is enabled, it will perform an exhaustive search and update any DBs. miopen immediate mode is still available and is used when deterministic is true and benchmark is false. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145294 Approved by: https://github.com/BrianHarrisonAMD, https://github.com/malfet	2025-02-05 17:19:53 +00:00
PyTorch MergeBot	f27220e32a	Revert "Move get accelerator to use build time flags when possible (#146098 )" This reverts commit 157d81c201715f84ead21d0ee420669ab7f58c04. Reverted https://github.com/pytorch/pytorch/pull/146098 on behalf of https://github.com/atalman due to Failing internally, sorry need to revert ([comment](https://github.com/pytorch/pytorch/pull/146098#issuecomment-2637443675))	2025-02-05 16:39:37 +00:00
Jason Ansel	f55c0af37f	[inductor] Support non-power-of-2 cooperative RSPLIT (#145689 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145689 Approved by: https://github.com/eellison	2025-02-05 16:36:53 +00:00
maajidkhann	db22e9d5a2	Implement blend operation for float, double, int in VEC ATen backend for SVE (#146479 ) - Added support for SVE vectorized blend operation for float, double, int8_t, int16_t, int32_t and int64_t data types. - Utilizes SVE ACLE intrinsic (svcntb, svcntw, svcmpne, svsel) to handle different vector lengths (VL) dynamically. - Ensured compatibility with SVE128, SVE256, and SVE512 hardware configurations. - Enabled back blend SVE vec tests Testing: a) Float DType: ./vec_test_all_types_SVE256 --gtest_filter=BitwiseFloatsAdditional2/0.Blend [Test Passed] on Graviton 3 machine (SVE256) ./vec_test_all_types_SVE128 --gtest_filter=BitwiseFloatsAdditional2/0.Blend [Test Passed] on Graviton 4 machine (SVE128) b) Double DType: ./vec_test_all_types_SVE256 --gtest_filter=BitwiseFloatsAdditional2/1.Blend [Test Passed] on Graviton 3 machine (SVE256) ./vec_test_all_types_SVE128 --gtest_filter=BitwiseFloatsAdditional2/1.Blend [Test Passed] on Graviton 4 machine (SVE128) c)Int DType: python3 test/inductor/test_cpu_repro.py CPUReproTests.test_vec_remainder [Test Passed] on Graviton 3 machine (SVE256) and on Graviton 4 machine (SVE128) <img width="661" alt="grv4_test_case_passed" src="https://github.com/user-attachments/assets/5572fcc0-a861-4bd6-bf9e-356219ffe656" /> Fixes https://github.com/pytorch/pytorch/issues/146309 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146479 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-02-05 16:29:13 +00:00
Zhengxu Chen	cd6c0707a8	[aoti] Assign proxy call args by name, and support default values. (#146263 ) Fixing the following issue when compiling the following program: ``` window = torch.hann_window(N_FFT).to(x.device) stft = torch.stft( x, N_FFT, HOP_LENGTH, window=window, return_complex=True ) magnitudes = stft[..., :-1].abs() ** 2 return magnitudes ``` ``` Traceback (most recent call last): File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 57, in testPartExecutor yield File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 623, in run self._callTestMethod(testMethod) File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 579, in _callTestMethod if method() is not None: ^^^^^^^^ File "/home/zhxchen17/pytorch/torch/testing/_internal/common_utils.py", line 3120, in wrapper method(args, *kwargs) File "/home/zhxchen17/pytorch/test/inductor/test_torchinductor.py", line 12356, in new_test return value(self) ^^^^^^^^^^^ File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor.py", line 4334, in test_stft self.check_model(model, example_inputs) File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 185, in check_model actual = AOTIRunnerUtil.run( ^^^^^^^^^^^^^^^^^^^ File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 137, in run optimized = AOTIRunnerUtil.load(device, so_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 119, in load return torch._export.aot_load(so_path, device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zhxchen17/pytorch/torch/_export/__init__.py", line 165, in aot_load runner = torch._C._aoti.AOTIModelContainerRunnerCuda(so_path, 1, device) # type: ignore[assignment, call-arg] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Expected extern kernel aten::hann_window to have serialized argument type as_scalar_type for argument 1 but got as_device ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146263 Approved by: https://github.com/angelayi	2025-02-05 15:43:05 +00:00
rzou	1bb977a2a4	[auto_functionalized] Support `Tensor(a!)[]?` (#145400 ) Summary: This is just updating some of the checks to allow the Tensor(a!)[]? type through. Fixes #144072 Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/145400 Approved by: https://github.com/laithsakka	2025-02-05 14:52:39 +00:00
PyTorch MergeBot	282d185ec1	Revert "[inductor] use ftz variant of exp (#146216 )" This reverts commit b0b3fe8bcf00f30513e9bb3e197ea4cbcc2beef0. Reverted https://github.com/pytorch/pytorch/pull/146216 on behalf of https://github.com/atalman due to inductor/test_op_completeness.py::TestOpCompleteness::test_triton_overrides [GH job link](https://github.com/pytorch/pytorch/actions/runs/13152430750/job/36702812599) [HUD commit link](`b0b3fe8bcf`) ([comment](https://github.com/pytorch/pytorch/pull/146216#issuecomment-2636961317))	2025-02-05 14:13:45 +00:00
Davide Italiano	8a2000fd42	[MPS] Implement support for zeta (both eager and inductor). (#146465 ) A test was failing in inductor (`test_pointwise_zeta`) -- and I realized the operation was missing also from eager. Implemented for both, leveraging the kernel. Happy to split in two (one PR for eager, one for inductor) if folks prefer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146465 Approved by: https://github.com/malfet	2025-02-05 13:55:50 +00:00
Nichols A. Romero	fd0cd6a08f	[ROCm][TunableOp] Improve identification of fastest solution (#144942 ) This PR addresses some stability issues with identifying the fastest solution on AMD GPUs, particularly the MI300. Changes include: - An improved timer, StreamTimerNoSync - More aggressive skipping of slow solutions - Additional statistics that can be used for diagnostics PYTORCH_TUNABLEOP_VERBOSE=3 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144942 Approved by: https://github.com/jeffdaily	2025-02-05 11:16:49 +00:00
Simon Fan	e20b0c82d1	[ca] no longer require is_traceable annotations for c++ autograd functions (#146229 ) This PR removes the CA compile-time error for C++ autograd functions, and supports them by having dynamo graph break on them (instead of allow_in_graph). The CppNode's collects are kept as is for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146229 Approved by: https://github.com/jansel, https://github.com/zou3519	2025-02-05 08:49:17 +00:00
cyy	6293d1446b	[2/N] Remove NOLINT suppressions (#146402 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146402 Approved by: https://github.com/soulitzer	2025-02-05 08:38:52 +00:00
bobrenjc93	e5ea7e9cdc	add support for capturing provenance of unary operations (#146413 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146413 Approved by: https://github.com/angelayi ghstack dependencies: #145848	2025-02-05 08:31:38 +00:00
Shunting Zhang	b0b3fe8bcf	[inductor] use ftz variant of exp (#146216 ) Inductor generated exp op is compiled as the following ptx snippet by Triton. ``` mul.f32 %f74, %f83, 0f3FB8AA3B; ex2.approx.f32 %f73, %f74; ``` But if we enable --use_fast_math in nvcc, exp in CUDA is compiled as ``` mul.ftz.f32 %f2, %f1, 0f3FB8AA3B; ex2.approx.ftz.f32 %f3, %f2; ``` which uses the FTZ variant. Let Inductor able to generate the FTZ variant if use_fast_math config is true. I see 4% speedup for the two pass prepare_softmax kernel, online softmax should be affected more since it does more computation per seconds (>10% in my testing). Pull Request resolved: https://github.com/pytorch/pytorch/pull/146216 Approved by: https://github.com/jansel	2025-02-05 07:35:43 +00:00
clr	93d98aca31	inductor: Don't throw an internal error when a nn.module is missing a attribute (#145122 ) If a nn.module getattr call throws, we should make sure that we don't crash with an internal error Note that I couldn't figure out how to test this, so advice would be awesome. I have my best case attempt at https://github.com/pytorch/pytorch/pull/145799, but it doesn't seem to reproduce the crash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145122 Approved by: https://github.com/jansel	2025-02-05 05:49:32 +00:00
Angela Yi	eb832b7bcc	[export] Fix draft-export logging (#146106 ) Summary: Fix issue where the lazyTraceHandler does not exist Test Plan: CI Differential Revision: D68928070 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146106 Approved by: https://github.com/yiming0416	2025-02-05 05:49:22 +00:00
PyTorch MergeBot	f242da41c7	Revert "move and fix logic to update unbacked bindings (#146115 )" This reverts commit 0144613e6ff6e018ca41085d1509dcceb80987f7. Reverted https://github.com/pytorch/pytorch/pull/146115 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/146115#issuecomment-2635695958))	2025-02-05 04:51:39 +00:00
cyy	c6ea4425e5	Enable some tests on Windows (#146243 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146243 Approved by: https://github.com/albanD	2025-02-05 03:54:28 +00:00
PyTorch MergeBot	f35e60b21c	Revert "[cutlass backend] fix bug for accuminator dtype (#146356 )" This reverts commit 7c8ec84dab7dc10d4ef90afc93a49b97bbd04503. Reverted https://github.com/pytorch/pytorch/pull/146356 on behalf of https://github.com/huydhn due to Sorry for reverting your change but some slow cutlass tests are failing ([comment](https://github.com/pytorch/pytorch/pull/146356#issuecomment-2635594712))	2025-02-05 03:01:50 +00:00
PyTorch MergeBot	3c0d2bc262	Revert "[Testing] Reduce `test_exp` flakiness (#146436 )" This reverts commit 4c5a9a5f949ef3019fc3ef095034ccfc973ff13d. Reverted https://github.com/pytorch/pytorch/pull/146436 on behalf of https://github.com/huydhn due to Some test_exp2 starts failing in trunk I think ([comment](https://github.com/pytorch/pytorch/pull/146436#issuecomment-2635591878))	2025-02-05 02:58:53 +00:00
Nikita Shulga	aafaf4016f	[MPS] Add error checking when dispatching kernel (#146458 ) That thread-group size should not exceed maximum thread group size Add regression test to validate that Make failures like https://github.com/pytorch/pytorch/issues/146430 much easier to detect Pull Request resolved: https://github.com/pytorch/pytorch/pull/146458 Approved by: https://github.com/dcci	2025-02-05 02:56:40 +00:00
Ting Lu	9e45bc82e9	[aarch64] CUDA 12.8 aarch64 builds to nightly binaries (#146378 ) https://github.com/pytorch/pytorch/issues/145570 Adding Cuda 12.8 and keeping 12.6 for the sbsa build, supported CUDA_ARCH: 9.0, 10.0, 12.0 Refactor the binaries matrix for cuda sbsa build. Previously cuda-aarch64 was hardcoded to cuda 12.6. Now reads 12.6 and 12.8, new build naming example [manywheel-py3_9-cuda-aarch64-12_8-build](https://github.com/pytorch/pytorch/actions/runs/13132625006/job/36640885079?pr=146378#logs) TODO: once 12.8 is stable, remove 12.6 in sbsa Pull Request resolved: https://github.com/pytorch/pytorch/pull/146378 Approved by: https://github.com/atalman	2025-02-05 02:55:21 +00:00
Nikita Shulga	001ad5bef5	[MPSInductor] Scope-down test_prod running in MPS (#146460 ) As mutli-stage reductions are yet not a thing, but original `test_prod` just returned 0 for large reductions, so failures were reported as flaky ones, but if one to run the same test with `MTL_DEBUG_LAYER=1` than failure was obvious ``` 2025-02-04 11:51:30.034 Python[16594:289093] Metal API Validation Enabled test_prod (__main__.MPSBasicTests.test_prod) ... -[MTLDebugComputeCommandEncoder _validateThreadsPerThreadgroup:]:1266: failed assertion `(threadsPerThreadgroup.width(1) * threadsPerThreadgroup.height(2050) * threadsPerThreadgroup.depth(1))(2050) must be <= 1024. (device threadgroup size limit)' ``` Fixes https://github.com/pytorch/pytorch/issues/146430 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146460 Approved by: https://github.com/dcci	2025-02-05 01:47:01 +00:00
Aaron Gokaslan	52aaadf379	[BE][Ez]: Enable ruff rule E731. use `def` instead of anonymous lambda (#146410 ) Not sure why this isn't enabled, only 1 fix is needed and it supports autofixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146410 Approved by: https://github.com/aorenste, https://github.com/albanD	2025-02-05 01:44:41 +00:00
Bert Maher	0e060342b6	[triton] Update pin to tip of 3.2 release (#145867 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145867 Approved by: https://github.com/Skylion007, https://github.com/htyu, https://github.com/exclamaforte, https://github.com/jansel	2025-02-05 01:42:33 +00:00
Michael Lazos	616ac94175	[Dynamo] Fix spammy optimizer warning (#146374 ) Fixes https://discuss.pytorch.org/t/torch-compile-optimizer-step-generates-excessive-warning-messages/216067/7 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146374 Approved by: https://github.com/anijain2305	2025-02-05 01:03:49 +00:00
Haifeng Jin	8177fc4d33	Make regex error catching compatible with Python 3.12+. (#145945 ) In Python 3.12, the error message has changed from "Can't pickle local object" to "Can't get local object". The old regex would no longer catch the error. This PR make it compatible with Python 3.12 and backward compatible as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145945 Approved by: https://github.com/H-Huang	2025-02-05 00:57:36 +00:00
Henry Tsang	9d5bf38dec	[cpp_builder] refactor to reduce libcudart_static logs (#146394 ) Want to reduce logs from `log_msg = f'"libcudart_static.a" not found under {path}'`, which was added in https://github.com/pytorch/pytorch/pull/142175 Differential Revision: D69096354 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146394 Approved by: https://github.com/benjaminglass1, https://github.com/chenyang78	2025-02-05 00:41:30 +00:00
PyTorch MergeBot	658e22d495	Revert "add support for capturing provenance of unary operations (#146413 )" This reverts commit bc33d993acdff2637bc6aee5e604fb969b11fc13. Reverted https://github.com/pytorch/pytorch/pull/146413 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but some export tests are failing after this lands ([comment](https://github.com/pytorch/pytorch/pull/146413#issuecomment-2635440261))	2025-02-05 00:32:40 +00:00
Angela Yi	6e03f4f90e	[export] Include metadata in FlatArgsAdapter (#146107 ) Summary: With https://github.com/pytorch/pytorch/pull/145956, which introduces storing a list of namedtuple field names when serializing, we now want to expose this list to the args adapater so that APS can utilize this information and remove extraneous inputs. Test Plan: No-op Differential Revision: D68928416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146107 Approved by: https://github.com/pianpwk	2025-02-05 00:29:58 +00:00
Jason Ansel	84ba9c6e78	[inductor] Pre-populate cache for simplify_with_ranges return value (#146373 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146373 Approved by: https://github.com/yanboliang, https://github.com/shunting314 ghstack dependencies: #146225, #146226, #146235, #146252, #146254, #146255, #146257, #146282, #146297	2025-02-04 23:36:44 +00:00
Jason Ansel	7288950bcd	[inductor] Refactor CaptureIndexing into global scope (#146297 ) And inline SimplifyIndexing into it CaptureIndexing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146297 Approved by: https://github.com/shunting314 ghstack dependencies: #146225, #146226, #146235, #146252, #146254, #146255, #146257, #146282	2025-02-04 23:36:44 +00:00
Jason Ansel	b8a529cca1	[inductor] Minor compile time optimizations in DefaultHandler (#146282 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146282 Approved by: https://github.com/shunting314 ghstack dependencies: #146225, #146226, #146235, #146252, #146254, #146255, #146257	2025-02-04 23:36:34 +00:00
Jason Ansel	d3dd3eeb7f	[inductor] Refactor op handlers part 5 (#146257 ) This makes OpHandler just a normal class using inheritance, and removes typing workarounds needed because it wasn't Pull Request resolved: https://github.com/pytorch/pytorch/pull/146257 Approved by: https://github.com/shunting314 ghstack dependencies: #146225, #146226, #146235, #146252, #146254, #146255	2025-02-04 23:36:25 +00:00
Jason Ansel	7aced455c5	[inductor] Refactor op handlers part 4 (#146255 ) This replaces the `__getattr__()` pattern used in remaining OpHandlers with a `DefaultHandler` class defined in part 2. Some compile time wins from this as well: ``` 2025-02-02T19:46:32.2033010Z 2025-02-02T19:46:32.2036607Z WIN: benchmark ('add_loop_inductor', 'compile_time_instruction_count') failed, actual result 29633182927 is -1.71% lower than expected 30150000000 ±1.50% please update the expected results. 2025-02-02T19:46:32.2037575Z 2025-02-02T19:46:32.2037907Z please update all results that changed significantly, and not only the failed ones 2025-02-02T19:46:32.2039291Z PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 43986879172 -1.02% is within expected 44440000000 ±2.50% 2025-02-02T19:46:32.2040131Z 2025-02-02T19:46:32.2041180Z WIN: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') failed, actual result 26246225695 is -1.85% lower than expected 26740000000 ±1.50% please update the expected results. 2025-02-02T19:46:32.2042188Z ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146255 Approved by: https://github.com/shunting314 ghstack dependencies: #146225, #146226, #146235, #146252, #146254	2025-02-04 23:36:17 +00:00
Jason Ansel	8e9bda8d89	[inductor] Refactor op handlers part 3 (#146254 ) Fixes type errors that arise from typing `V.ops`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146254 Approved by: https://github.com/shunting314 ghstack dependencies: #146225, #146226, #146235, #146252	2025-02-04 23:36:09 +00:00
Jason Ansel	13f0436abd	[inductor] Refactor op handlers part 2 (#146252 ) This replaces the `__getattr__()` pattern used in (some) OpHandlers with a `DefaultHandler` class that has an implementation of every op that calls `self._default()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146252 Approved by: https://github.com/yanboliang ghstack dependencies: #146225, #146226, #146235	2025-02-04 23:36:01 +00:00
Jason Ansel	67be5953fe	[inductor] Refactor op handlers part 1 (#146235 ) This enforces the invariant that every backend implements the same set of ops and removes a layer of indirection for BasicMathOps. Interestingly this is a small compile time win: ``` ... WIN: benchmark ('add_loop_inductor', 'compile_time_instruction_count') failed, actual result 30151159301 is -6.13% lower than expected 32120000000 ±1.50% please update the expected results. please update all results that changed significantly, and not only the failed ones PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 44447549162 -1.69% is within expected 45210000000 ±2.50% WIN: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') failed, actual result 26743557195 is -2.25% lower than expected 27360000000 ±1.50% please update the expected results. please update all results that changed significantly, and not only the failed ones PASS: benchmark ('basic_modules_ListOfLinears_eager', 'compile_time_instruction_count') pass, actual result 945129734 +0.93% is within expected 936400000 ±1.50% WIN: benchmark ('basic_modules_ListOfLinears_inductor', 'compile_time_instruction_count') failed, actual result 18984384503 is -3.19% lower than expected 19610000000 ±1.50% please update the expected results. please update all results that changed significantly, and not only the failed ones WIN: benchmark ('basic_modules_ListOfLinears_inductor_gpu_force_shape_pad', 'compile_time_instruction_count') failed, actual result 17258025389 is -1.94% lower than expected 17600000000 ±1.50% please update the expected results. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146235 Approved by: https://github.com/shunting314 ghstack dependencies: #146225, #146226	2025-02-04 23:35:53 +00:00
Jason Ansel	ed03f9ca10	[inductor] Refactor CSEProxy into global scope (#146226 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146226 Approved by: https://github.com/shunting314 ghstack dependencies: #146225	2025-02-04 23:35:43 +00:00
Jason Ansel	5cac550ddf	[inductor] Finish typing common.py (#146225 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146225 Approved by: https://github.com/Skylion007	2025-02-04 23:35:33 +00:00
Henry Tsang	7c8ec84dab	[cutlass backend] fix bug for accuminator dtype (#146356 ) Will add unit tests for accuracy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146356 Approved by: https://github.com/Chillee	2025-02-04 22:10:17 +00:00
Sam Larsen	13e17aa106	Make the CUTLASS swizzle options configurable and default to 2. (#146088 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146088 Approved by: https://github.com/henrylhtsang, https://github.com/mlazos	2025-02-04 22:07:26 +00:00
Aidyn-A	aac0577796	[TEST][Sparse] Force CUTLASS backend in TestSparseSemiStructuredCUTLASS (#146398 ) We have noticed some discrepancy between the ways the `test_sparse_semi_structured.py` was called. And in some ways, the test falsely fails, because it was attempting to run on a wrong backend. All because `SparseSemiStructuredTensor._FORCE_CUTLASS = True` was never set in the setup of `TestSparseSemiStructuredCUTLASS` as it was in its `TestSparseSemiStructuredCUSPARSELT` counterpart `8444fe019a/test/test_sparse_semi_structured.py (L1039-L1046)` When I run tests via pytest, just by shear luck it calls `test_values_backend_cutlass_cuda` which sets the backend to CUTLASS `bb4bd5f00b/test/test_sparse_semi_structured.py (L475)` before `test_conversions_all_patterns_cuda_`: ``` test/test_sparse_semi_structured.py::TestSparseSemiStructuredCUDA::test_values_backend_cutlass_cuda PASSED [0.0071s] [ 72%] test/test_sparse_semi_structured.py::TestSparseSemiStructuredCUTLASSCUDA::test_conversions_all_patterns_cuda_bfloat16 PASSED [0.0484s] [ 73%] test/test_sparse_semi_structured.py::TestSparseSemiStructuredCUTLASSCUDA::test_conversions_all_patterns_cuda_float16 PASSED [0.0041s] [ 73%] test/test_sparse_semi_structured.py::TestSparseSemiStructuredCUTLASSCUDA::test_conversions_all_patterns_cuda_int8 PASSED [0.0079s] [ 73%] ``` In this scenario everything is good. But in `python test/test_sparse_semi_structured.py -v -k cuda` way, the order of the tests is not the same, and it sets cuSparseLt backend just before running `test_conversions_all_patterns_cuda_` which causes failures: ``` test_cusparselt_backend_cuda (__main__.TestSparseSemiStructuredCUSPARSELTCUDA.test_cusparselt_backend_cuda) ... ok ... test_conversions_all_patterns_cuda_bfloat16 (__main__.TestSparseSemiStructuredCUTLASSCUDA.test_conversions_all_patterns_cuda_bfloat16) ... FAIL test_conversions_all_patterns_cuda_float16 (__main__.TestSparseSemiStructuredCUTLASSCUDA.test_conversions_all_patterns_cuda_float16) ... FAIL test_conversions_all_patterns_cuda_int8 (__main__.TestSparseSemiStructuredCUTLASSCUDA.test_conversions_all_patterns_cuda_int8) ... ERROR ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146398 Approved by: https://github.com/Skylion007, https://github.com/jcaip, https://github.com/eqy	2025-02-04 22:07:12 +00:00
Benjamin Glass	317dae95fa	cpp_wrapper: fix CPU cpp_wrapper and max-autotune tests (#145683 ) Both of these tests mostly failed due to incorrect assumptions about the generated code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145683 Approved by: https://github.com/desertfire ghstack dependencies: #145095, #145654, #145655	2025-02-04 22:05:59 +00:00
Benjamin Glass	e2a029054d	cpp_wrapper: enable all CPU repro tests (#145655 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145655 Approved by: https://github.com/desertfire ghstack dependencies: #145095, #145654	2025-02-04 22:05:59 +00:00
Benjamin Glass	9873319a42	cpp_wrapper: fix set_.source_Tensor lowering (#145654 ) Adds a C-shim fallback for `set_.source_Tensor`, which is effectively required by `ir.SetSourceTensorKernel`. As a necessary prerequisite to use that IR node, updates `CppWrapperCpu` to handle in-place returns in C-shim ops (the arguments for those returns are silently dropped by `torchgen`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/145654 Approved by: https://github.com/desertfire ghstack dependencies: #145095	2025-02-04 22:05:59 +00:00
Benjamin Glass	7c0fe7a045	cpp_wrapper/aot_inductor: handle conjugation and negation dispatch keys (#145095 ) Handles conjugation and negation in the same way that runtime dispatch does: by on-the-fly cloning a tensor with either key applied. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145095 Approved by: https://github.com/desertfire	2025-02-04 22:05:58 +00:00
Davide Italiano	09b0dfdc90	[metal] Add a missing cast to make the call to copysign unambiguous. (#146422 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146422 Approved by: https://github.com/Skylion007, https://github.com/Samkm0084	2025-02-04 22:04:25 +00:00
clr	4e194bbfd6	dynamo: fsdp throw unimplemented vs attribute error (#146188 ) Rather than throw a full exception for fsdp, instead just return unimplemented, and respect the user options (i.e. fullgraph, vs graph break). Pull Request resolved: https://github.com/pytorch/pytorch/pull/146188 Approved by: https://github.com/jansel	2025-02-04 21:45:55 +00:00
Nikita Shulga	4c5a9a5f94	[Testing] Reduce `test_exp` flakiness (#146436 ) By setting `reference_in_float` to false, as `exp(a + b)` could yield significantly different results than `exp(a.half()+b.half())` as one can see in the following example (which is accidentally the random values generated by MacOS RNG for this test) ``` >>> import torch >>> x=torch.tensor(2.5599, dtype=torch.half) >>> y=torch.tensor(0.6970, dtype=torch.half) >>> (x + y).exp() tensor(26., dtype=torch.float16) >>> (x.float() + y.float()).exp() tensor(25.9799) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146436 Approved by: https://github.com/dcci	2025-02-04 21:24:08 +00:00
bobrenjc93	bc33d993ac	add support for capturing provenance of unary operations (#146413 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146413 Approved by: https://github.com/angelayi ghstack dependencies: #145848	2025-02-04 21:16:15 +00:00
Yanbo Liang	07b9fe0690	[Trace PyDispatcher] Add CustomFunctionHigherOrderOperatorVariable (#146272 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146272 Approved by: https://github.com/zou3519 ghstack dependencies: #146270, #146271	2025-02-04 20:55:51 +00:00
bobrenjc93	d23e4f8109	use DTRACE_ENV_VAR as the trace logs directory of set (#146412 ) ``` (/home/bobren/local/a/pytorch-env) [7:47] devgpu035:/home/bobren/local/a/pytorch TORCH_DTRACE=/tmp/bb python r1.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146412 Approved by: https://github.com/angelayi ghstack dependencies: #145848	2025-02-04 20:54:28 +00:00
Aaron Gokaslan	7f65a20884	[BE]: Enable ruff SLOT checks (#146276 ) This enables a check that which a class which only inherits from immutable classes like str, tuple, and NamedTuple, also defined `__slots__` so they don't allocate memory unnecessarily. This also ensure contributors think about how they define their classes with subclass NamedTuples and str, of which we have many in our codebase Pull Request resolved: https://github.com/pytorch/pytorch/pull/146276 Approved by: https://github.com/aorenste	2025-02-04 19:18:23 +00:00
Nikita Shulga	3525b834f0	[MPSInductor] Implement `argmax`/`argmin` (#146429 ) TODOs: - Find test with NaN - Report internal compiler error when running `test_argmax_argmin1` (which is actually not enough shared memory) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146429 Approved by: https://github.com/dcci ghstack dependencies: #146423, #146428	2025-02-04 19:16:06 +00:00
bobrenjc93	c591ad0c03	dump partial fx graph to stderr when dynamo tracing fails with guard on data-dependent (#146296 ) As discussed with @avikchaudhuri and @bdhirsh last week, this can be quite useful when debugging. The following code produces a data dependent error ``` import torch from torch import nn # UserError: Could not guard on data-dependent expression Eq(507 - u0, 0) (unhinted: Eq(507 - u0, 0)). (Size-like symbols: u0) class Repro(nn.Module): def __init__(self): super().__init__() def forward(self, cache, update, pos): _, _, max_seq_len, _ = cache.shape _, _, seqlen, _ = update.shape pos_item = pos[0].item() # u0 torch._check(pos_item + seqlen <= max_seq_len) # u0 + 502 <= 507 torch._check(pos_item >= 0) before = cache.narrow(2, 0, pos_item) # FAIL # Laith: why can't we make unbacked expressions size-like? after = cache.narrow(2, (pos_item + seqlen), (max_seq_len - pos_item - seqlen)) # PASS end = torch.tensor(max_seq_len - pos_item - seqlen).item() after = cache.narrow(2, (pos_item + seqlen), end) return torch.cat([before, update, after], dim=2) repro = Repro() bsz = 1 n_heads = 4 max_seq_len = 512 head_dim = 64 seqlen = 5 pos_item = 1 cache = torch.zeros(bsz, n_heads, max_seq_len, head_dim) update = torch.ones(bsz, n_heads, seqlen, head_dim) pos = torch.tensor([pos_item]) example_inputs = (cache, update, pos) torch.export.export(repro, example_inputs) ``` This is what it now prints out ``` class GraphModule(torch.nn.Module): def forward(self, L_cache_: "f32[1, 4, 512, 64][131072, 32768, 64, 1]cpu", L_update_: "f32[1, 4, 5, 64][1280, 320, 64, 1]cpu", L_pos_: "i64[1][1]cpu"): l_cache_ = L_cache_ l_update_ = L_update_ l_pos_ = L_pos_ # File: /data/users/bobren/a/pytorch/r1.py:14 in forward, code: pos_item = pos[0].item() # u0 getitem: "i64[][]cpu" = l_pos_[0]; l_pos_ = None item: "Sym(u0)" = getitem.item(); getitem = None # File: /data/users/bobren/a/pytorch/r1.py:15 in forward, code: torch._check(pos_item + seqlen <= max_seq_len) # u0 + 502 <= 507 add: "Sym(u0 + 5)" = item + 5 le: "Sym(u0 + 5 <= 512)" = add <= 512; add = None _check = torch._check(le); le = _check = None # File: /data/users/bobren/a/pytorch/r1.py:16 in forward, code: torch._check(pos_item >= 0) ge: "Sym(u0 >= 0)" = item >= 0 _check_1 = torch._check(ge); ge = _check_1 = None # File: /data/users/bobren/a/pytorch/r1.py:17 in forward, code: before = cache.narrow(2, 0, pos_item) before: "f32[1, 4, u0, 64][131072, 32768, 64, 1]cpu" = l_cache_.narrow(2, 0, item); before = None # File: /data/users/bobren/a/pytorch/r1.py:21 in forward, code: after = cache.narrow(2, (pos_item + seqlen), (max_seq_len - pos_item - seqlen)) add_1: "Sym(u0 + 5)" = item + 5 sub: "Sym(512 - u0)" = 512 - item; item = None sub_1: "Sym(507 - u0)" = sub - 5; sub = None narrow_1 = l_cache_.narrow(2, add_1, sub_1); l_cache_ = add_1 = sub_1 = narrow_1 = None Traceback (most recent call last): File "/data/users/bobren/a/pytorch/torch/_dynamo/utils.py", line 3075, in run_node return getattr(args[0], node.target)(args[1:], kwargs) File "/data/users/bobren/a/pytorch/torch/utils/_stats.py", line 27, in wrapper return fn(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1267, in __torch_dispatch__ return self.dispatch(func, types, args, kwargs) File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1808, in dispatch return self._cached_dispatch_impl(func, types, args, kwargs) File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1369, in _cached_dispatch_impl output = self._dispatch_impl(func, types, args, kwargs) File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 2282, in _dispatch_impl decomposition_table[func](args, *kwargs) File "/data/users/bobren/a/pytorch/torch/_decomp/decompositions.py", line 759, in slice_forward return self.as_strided(sizes, strides, storage_offset) File "/data/users/bobren/a/pytorch/torch/utils/_stats.py", line 27, in wrapper return fn(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1267, in __torch_dispatch__ return self.dispatch(func, types, args, kwargs) File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1808, in dispatch return self._cached_dispatch_impl(func, types, args, kwargs) File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1370, in _cached_dispatch_impl entry = self._make_cache_entry(state, key, func, args, kwargs, output) File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1640, in _make_cache_entry output_info = self._get_output_info_for_cache_entry( File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1583, in _get_output_info_for_cache_entry synth_output = self._output_from_cache_entry( File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1738, in _output_from_cache_entry return self._get_output_tensor_from_cache_entry( File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1709, in _get_output_tensor_from_cache_entry empty.set_(storage, storage_offset, shape, stride) File "/data/users/bobren/a/pytorch/torch/fx/experimental/sym_node.py", line 564, in guard_size_oblivious r = self.shape_env.evaluate_expr( File "/data/users/bobren/a/pytorch/torch/fx/experimental/recording.py", line 263, in wrapper return retlog(fn(args, **kwargs)) File "/data/users/bobren/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6468, in evaluate_expr return self._evaluate_expr( File "/data/users/bobren/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6658, in _evaluate_expr raise self._make_data_dependent_error( torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Ne(507 - u0, 1) (unhinted: Ne(507 - u0, 1)). (Size-like symbols: u0) Caused by: after = cache.narrow(2, (pos_item + seqlen), (max_seq_len - pos_item - seqlen)) # r1.py:21 in forward (utils/_stats.py:27 in wrapper) For more information, run with TORCH_LOGS="dynamic" For extended logs when we create symbols, also add TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL="u0" If you suspect the guard was triggered from C++, add TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 For more debugging help, see https://docs.google.com/document/d/1HSuTTVvYH1pTew89Rtpeu84Ht3nQEFTYhAX3Ypa_xJs/edit?usp=sharing``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146296 Approved by: https://github.com/zou3519 ghstack dependencies: #146298	2025-02-04 19:12:39 +00:00
bobrenjc93	8f861a8dfb	[experimental] filter logs by subgraph (#146047 ) ``` TORCH_LOGS="dynamo" TORCH_LOGS_TRACE_ID_FILTER="[1/0]" python r4.py ``` ``` TORCH_LOGS="dynamo" TORCH_LOGS_TRACE_ID_FILTER="[0/0],[1/0_1]" python r4.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146047 Approved by: https://github.com/laithsakka	2025-02-04 19:11:44 +00:00
Nikita Shulga	7d60235aa6	[Metal] Small speedup for `sum`/`prod` (#146428 ) As they can not really be invoked over empty arrays Pull Request resolved: https://github.com/pytorch/pytorch/pull/146428 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #146423	2025-02-04 19:10:33 +00:00
Nikita Shulga	b1663b31e1	[Metal][BE] Add `#pragma once` to all headers (#146423 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146423 Approved by: https://github.com/Skylion007, https://github.com/dcci	2025-02-04 19:10:33 +00:00
Aaron Gokaslan	292af3cc89	[BE][Ez]: ISC001 Auto concatenate implicit one line strings (#146408 ) Apply ruff rule about implicit string concatenation, this autofixes strings that are all the same type and on the same line. These lines are broken up likely as the result of autoformatters in the past. All fixes are automated using the autofixes in ISC001. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146408 Approved by: https://github.com/justinchuby, https://github.com/janeyx99	2025-02-04 19:07:04 +00:00
rzou	f38a2ea0d4	[Dynamo] Better unsupported message for Fake Tensor Exception (#146357 ) I cannot repro this. But this line shows up in internal logs, and I want to know what the exception is and the context inside it. All of the exceptions_allowed_to_be_fallback are dataclasses, so they should print nicely. Test Plan: - code reading Pull Request resolved: https://github.com/pytorch/pytorch/pull/146357 Approved by: https://github.com/williamwen42	2025-02-04 18:52:11 +00:00
Yidi Wu	b0fe975521	[hop][inductor] track the dependency on unbacked symbols correctly with constant_args for hops (#143456 ) Before the PR, we're getting an undefined symbol error for output code when an unbacked symint is only used in the hop because we didn't correctly record the dependency of the unbacked symbols for hops and it gets DCEed accidentally. This PR adds the symbol arguments to `constant_args`, where the dependencies can be correctly constructed when `get_unbacked_symbol_uses` is called to check constant_args. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143456 Approved by: https://github.com/desertfire	2025-02-04 18:47:34 +00:00
albanD	157d81c201	Move get accelerator to use build time flags when possible (#146098 ) This PR does two main things (they are in a single PR to show how the newly added APIs are used). - Add isBuilt and isAvailable APIs to the AcceleratorHook interface. See inline doc for their exact semantic - Use the newly added isBuilt for accelerator check to ensure it does not poison fork Pull Request resolved: https://github.com/pytorch/pytorch/pull/146098 Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/EikanWang Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-02-04 18:23:24 +00:00
Sam Larsen	23fffb54d5	Use OrderedSet in _functorch/partitioners (#146102 ) In an attempt to make partitioning more deterministic, change all sets in partitioners.py to OrderedSets. Note that this change does not fix the non-determinism we're seeing in the internal model. But let's at least eliminate this potential source of non-determinism before investigating any changes to the mincut approach? Pull Request resolved: https://github.com/pytorch/pytorch/pull/146102 Approved by: https://github.com/oulgen	2025-02-04 17:43:07 +00:00
Bin Bao	53759ccca8	[AOTI] Fix an unaligned memory access issue in mm_template (#146293 ) Summary: Fixes a corner case in the Triton MM template, where the dimension M (dynamic size) can be smaller than BLOCK_M (similarly for the N dimenstion) can trigger unaligned memory access error. Differential Revision: D69034578 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146293 Approved by: https://github.com/chenyang78, https://github.com/jansel	2025-02-04 17:12:04 +00:00
nikitaved	87a63a9886	Add `@nikitaved` to torch.linalg `CODEOWNERS/persons_of_interest` (#141803 ) As per title. I hope there is no objection :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141803 Approved by: https://github.com/albanD	2025-02-04 16:11:31 +00:00
Jason Ansel	e9f6e273e7	[inductor] Add typing to common.CSE (#145993 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145993 Approved by: https://github.com/yanboliang ghstack dependencies: #145916	2025-02-04 16:05:39 +00:00
Jason Ansel	7a5239afd7	[inductor] Add typing to common.KernelArgs (#145916 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145916 Approved by: https://github.com/yanboliang	2025-02-04 16:05:39 +00:00
Nikita Shulga	5d81bc3696	[MPSInductor] Implement `prod` reduction (#146396 ) Mostly reusing `sum` reduction logic Pull Request resolved: https://github.com/pytorch/pytorch/pull/146396 Approved by: https://github.com/dcci ghstack dependencies: #146369, #146370, #146380, #146389	2025-02-04 14:08:04 +00:00
Nikita Shulga	bbe95341d9	[MPSInductor] Implement `min` and `max` reductions (#146389 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146389 Approved by: https://github.com/jansel, https://github.com/dcci ghstack dependencies: #146369, #146370, #146380	2025-02-04 14:04:10 +00:00
PyTorch MergeBot	106acf0eec	Revert "[aoti] Assign proxy call args by name, and support default values. (#146263 )" This reverts commit 11f69808c64a65c68a4452250ba7719dcff27c78. Reverted https://github.com/pytorch/pytorch/pull/146263 on behalf of https://github.com/atalman due to multiple build failures, please see associated diff ([comment](https://github.com/pytorch/pytorch/pull/146263#issuecomment-2633828689))	2025-02-04 12:57:55 +00:00
Nichols A. Romero	e0f22e54e8	[ROCm][TunableOp] Support leading dimensions in TunableOp signature. (#146358 ) This is a feature enhancement that: - May improve performance by distinguishing GEMMs with different leading dimensions. - Fix correctness issues reported by users. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146358 Approved by: https://github.com/jeffdaily	2025-02-04 10:27:43 +00:00
cyy	3f63f2bced	Use std::string_view in tests (#146120 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146120 Approved by: https://github.com/albanD	2025-02-04 09:51:36 +00:00
Angela Yi	8444fe019a	[export] Fix requires_grad deserialization (#146351 ) Test Plan: CI Differential Revision: D69072095 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146351 Approved by: https://github.com/zhxchen17	2025-02-04 08:02:38 +00:00
Davide Italiano	bb4bd5f00b	[Metal][BE] Fix the arguments of `polygamma` (#146382 ) In the public API, order comes before input, while here they're reversed. Match for consistency (and make this less error prone). Pull Request resolved: https://github.com/pytorch/pytorch/pull/146382 Approved by: https://github.com/jansel, https://github.com/malfet	2025-02-04 06:40:34 +00:00
Nikita Shulga	54ceb7c565	[MPSInductor] Add support for `sum` reduction (#146380 ) - Add `threadgroup_sum` template to `c10/metal/reduction_utils.h` that so far uses barrier to compute the reductions TODOs: - Implement efficient reduction using cooperative functions such as `simd_shuffle_down` - Figure out how to merge several sum reduction together - Implement `reduction_store` that will only write results from the first thread Pull Request resolved: https://github.com/pytorch/pytorch/pull/146380 Approved by: https://github.com/jansel, https://github.com/dcci ghstack dependencies: #146369, #146370	2025-02-04 06:23:44 +00:00
cyy	1c16cf70c3	Apply ruff fixes to tests (#146140 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146140 Approved by: https://github.com/albanD	2025-02-04 05:41:01 +00:00
cyy	71e3575525	Remove unactivated test (#146233 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146233 Approved by: https://github.com/rec, https://github.com/albanD	2025-02-04 05:26:04 +00:00
Brian Hirsh	e68f5087d8	update _unsafe_set_version_counter to accept lists of tensors (#137921 ) See the comment [here](https://github.com/pytorch/pytorch/issues/132014#issuecomment-2379547400) (cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov @XilunWu @rec) - this PR updates `_unsafe_set_version_counter` to accept a list of tensors, for overhead-sensitive users (e.g. distributed) who need to hide VC bumps from autograd on a large list of tensors without wanting to suffer the overhead of going from python->C++ separately for every tensor in the list. I left the binding in pybind, and used a `std::vector`. if we really need to optimize overhead even further, we could write a manual cpython binding. I use this updated API in the next PR to fix FSDP2, so that it properly hides the VC of all `all_gather_buffer` tensors in its call to `split_with_sizes_copy.out(all_gather_buffers)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137921 Approved by: https://github.com/awgu, https://github.com/albanD	2025-02-04 04:51:11 +00:00
Sheng Fu	425aca40a4	Fix random crash in PyPer (#146327 ) Summary: PyPer saw random crashes when writing into ET file. This DIFF is to check if the output file is in condition before writing into it, and catch the exception if something bad happens, instead of crashing. Test Plan: buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA Differential Revision: D69065509 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146327 Approved by: https://github.com/sraikund16	2025-02-04 04:50:40 +00:00
angelayi	0c37c332da	[export] Additionally save pytree namedtuple field names (#145956 ) If a user passes in a namedtuple as an input, currently the input TreeSpec looks like: `TreeSpec(type=namedtuple, context=”class_fqn”, children_spec=[, ])` The user then saves the program containing this input TreeSpec. But what happens if they load it in a new environment where `class_fqn` now contains an additional field? This means that the exported program is now expected to take in another input. But since those fields were not used in the original program, users should be able just drop those additional fields and the program will run successfully. This is needed/used in APS where they use unflattener's adapter to adapt the inputs based on the previously saved treespecs. There are a couple of [solutions](https://docs.google.com/document/d/1V4ZSdy-8PUISWc8RqvGu3DU01BVegJhHHPWqa1Io7Eg/edit?tab=t.0) for how we can address this, but eventually we settled on saving a side table mapping namedtuple types to their list of field names, which can then be accessed by the adapter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145956 Approved by: https://github.com/zhxchen17	2025-02-04 04:42:30 +00:00
Animesh Jain	487400f47f	[dynamo] Support functools.partial variables through inspect.signature (#146339 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146339 Approved by: https://github.com/jansel ghstack dependencies: #146322, #146116	2025-02-04 04:39:39 +00:00
Justin Chu	9756c7d788	[benchmark] Remove ONNX (#146325 ) ONNX exporter experiments in benchmark is obsolete and unmaintained. This PR removes it to unblock https://github.com/pytorch/pytorch/pull/146003 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146325 Approved by: https://github.com/titaiwangms	2025-02-04 04:02:47 +00:00
Doru Bercea	a79d8f8ba4	[ROCm] Tune 3d tensor sums when not using fastest dimension (#146170 ) Tune 3d tensor sums when not using fastest dimension. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146170 Approved by: https://github.com/jeffdaily	2025-02-04 04:02:16 +00:00
David Berard	7997ecf809	[BE] reduce log spew from test_triton_kernels.py (#145895 ) One of the tests in this file was setting `self._logging.set_logs(output_code=True)` - which would cause logs to be printed for the rest of the tests in this file. This PR puts the log-setting in a context manager so that the old behavior is restored afterwards. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145895 Approved by: https://github.com/nmacchioni	2025-02-04 03:44:23 +00:00
Animesh Jain	5f53889850	[dynamo][builtin-skipfiles-cleanup] Remove inspect (#146116 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146116 Approved by: https://github.com/williamwen42, https://github.com/zou3519, https://github.com/jansel ghstack dependencies: #146322	2025-02-04 03:36:07 +00:00
Ke Wen	762a05b3b3	[DCP] Remove all-gather of state dict keys (#145998 ) The original `_all_gather_keys` call was for a safety check, but could be costly as things scale, and it blocks CPU. Instead, we make it clear in the documentation that the `state_dict` passed to the `load` API should have same set of keys, otherwise the API may hang. In addition, we move the check to a utility function: `utils.assert_same_keys`. User uncertain about state dict unity can optionally call this API to check. Resolves #145965 (as a workaround). Pull Request resolved: https://github.com/pytorch/pytorch/pull/145998 Approved by: https://github.com/mhorowitz, https://github.com/fegin	2025-02-04 03:16:13 +00:00
PyTorch MergeBot	7f796eb8b7	Revert "[inductor] Add typing to common.KernelArgs (#145916 )" This reverts commit 68cf36d5ab6165372160f65eb84e13d0f8dbc5dc. Reverted https://github.com/pytorch/pytorch/pull/145916 on behalf of https://github.com/atalman due to Failing internally, please see associated diff ([comment](https://github.com/pytorch/pytorch/pull/145916#issuecomment-2632715678))	2025-02-04 03:07:12 +00:00
PyTorch MergeBot	d3c7e4bb9c	Revert "[inductor] Add typing to common.CSE (#145993 )" This reverts commit 8c657ae4be55c6133307ad278c1740af5db133a7. Reverted https://github.com/pytorch/pytorch/pull/145993 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/145916 ([comment](https://github.com/pytorch/pytorch/pull/145993#issuecomment-2632712384))	2025-02-04 03:04:01 +00:00
PyTorch MergeBot	ecbc725fad	Revert "[inductor] Finish typing common.py (#146225 )" This reverts commit 3a67c0e48d29578aeeaa872275e730020bb5cbc2. Reverted https://github.com/pytorch/pytorch/pull/146225 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/145916 ([comment](https://github.com/pytorch/pytorch/pull/146225#issuecomment-2632709707))	2025-02-04 03:01:36 +00:00
PyTorch MergeBot	0061eb5b70	Revert "[inductor] Refactor CSEProxy into global scope (#146226 )" This reverts commit 18380ab877711f2e651c69c78675f0d0b31d2ceb. Reverted https://github.com/pytorch/pytorch/pull/146226 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/145916 ([comment](https://github.com/pytorch/pytorch/pull/146226#issuecomment-2632707618))	2025-02-04 02:58:50 +00:00
cyy	f397c72697	Remove NOLINTNEXTLINE (#146238 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146238 Approved by: https://github.com/albanD	2025-02-04 02:45:32 +00:00
Nikita Shulga	5451c9b7c9	[MPSInductor] Add support for any reduction (#146370 ) - Add `_new_accvar` function that creates a threadgroup variable - As threadgroup variables can not be initialized in place, add explicit initialization for reduction var Pull Request resolved: https://github.com/pytorch/pytorch/pull/146370 Approved by: https://github.com/dcci, https://github.com/jansel ghstack dependencies: #146369	2025-02-04 02:45:03 +00:00
Nikita Shulga	71179772cd	[MPSInductor] Prep change for reduction support (#146369 ) Add `group_pos` parameter as well as set `group_size` when invoking reduction kernels Separates loads and stores and insert threadgroup barrier if reduction is in place Should be a no-op right now Pull Request resolved: https://github.com/pytorch/pytorch/pull/146369 Approved by: https://github.com/dcci, https://github.com/jansel	2025-02-04 02:38:07 +00:00
Henry Tsang	3dcbd04d1d	[cutlass backend] Add instantiation level for generating configs (#146230 ) Passing through instantiation level to generate more configs. I do see some C++ compilation error. But running is fine. Using 2222 generates 1k+ configs. Differential Revision: D68989194 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146230 Approved by: https://github.com/Chillee, https://github.com/mlazos	2025-02-04 02:36:04 +00:00
bobrenjc93	0e49f35e3d	Integrate sympy expression provenance logging with structured logs (#145848 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145848 Approved by: https://github.com/angelayi	2025-02-04 01:21:37 +00:00
Aaron Orenstein	4168982dad	PEP585: .github release triggers (#145708 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145708 Approved by: https://github.com/malfet	2025-02-04 01:02:46 +00:00
Davide Italiano	cf6c5b8fa8	[mps/inductor] Adjust more tests that expect float64 as input. (#146366 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146366 Approved by: https://github.com/malfet	2025-02-04 00:48:02 +00:00
PyTorch MergeBot	2f40f789da	Revert "[inductor] Refactor op handlers part 1 (#146235 )" This reverts commit 204be4e0a2e4509bd2457bfb295c429dd92c241f. Reverted https://github.com/pytorch/pytorch/pull/146235 on behalf of https://github.com/atalman due to Breaks lint, sorry: Definition of polygamma in base class MetalOverrides is incompatible with definition in base class OpsHandler. Please rebase fix lint and reland ([comment](https://github.com/pytorch/pytorch/pull/146235#issuecomment-2632444514))	2025-02-04 00:00:08 +00:00
Stas Bekman	3aeccf2a28	DeepSpeed github repo move sync (#146320 ) DeepSpeed has moved to a new repo on github https://github.com/deepspeedai/DeepSpeed This PR updates this repo to use the new URL. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146320 Approved by: https://github.com/awgu	2025-02-03 23:20:49 +00:00
Jason Ansel	204be4e0a2	[inductor] Refactor op handlers part 1 (#146235 ) This enforces the invariant that every backend implements the same set of ops and removes a layer of indirection for BasicMathOps. Interestingly this is a small compile time win: ``` ... WIN: benchmark ('add_loop_inductor', 'compile_time_instruction_count') failed, actual result 30151159301 is -6.13% lower than expected 32120000000 ±1.50% please update the expected results. please update all results that changed significantly, and not only the failed ones PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 44447549162 -1.69% is within expected 45210000000 ±2.50% WIN: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') failed, actual result 26743557195 is -2.25% lower than expected 27360000000 ±1.50% please update the expected results. please update all results that changed significantly, and not only the failed ones PASS: benchmark ('basic_modules_ListOfLinears_eager', 'compile_time_instruction_count') pass, actual result 945129734 +0.93% is within expected 936400000 ±1.50% WIN: benchmark ('basic_modules_ListOfLinears_inductor', 'compile_time_instruction_count') failed, actual result 18984384503 is -3.19% lower than expected 19610000000 ±1.50% please update the expected results. please update all results that changed significantly, and not only the failed ones WIN: benchmark ('basic_modules_ListOfLinears_inductor_gpu_force_shape_pad', 'compile_time_instruction_count') failed, actual result 17258025389 is -1.94% lower than expected 17600000000 ±1.50% please update the expected results. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146235 Approved by: https://github.com/shunting314 ghstack dependencies: #146225, #146226	2025-02-03 23:15:13 +00:00
Jason Ansel	18380ab877	[inductor] Refactor CSEProxy into global scope (#146226 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146226 Approved by: https://github.com/shunting314 ghstack dependencies: #146225	2025-02-03 23:15:13 +00:00
Natalia Gimelshein	0bc036a9e9	use copy2d in h2d/d2h copy when possible (#146256 ) A rewrite of #138964 In addition to rewriting the conditions for using copy2d, this PR fixes a few other problems with #138964: 1) gpu-gpu copies when peer access is disabled shouldn't rely on copy2d 2) copy2d should record even for the host pinned memory, like the regular copy does 3) copy2d shouldn't pretend that it's synchronizing (for the purposes of cuda sanitizer tracer) when it's non-blocking In this PR copy2d behaves in exactly the same way as copy does wrt to those additional syncs, except it calls a different underlying cuda call. Tests for multiple cases going through copy2d and avoiding copy2d pattern due to unsatisfied conditions are added. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146256 Approved by: https://github.com/eqy, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-02-03 23:07:54 +00:00
Henry Tsang	35af193408	[easy] Add type annotation for autotune_num_choices_displayed (#146323 ) Test Plan: ci Differential Revision: D69064447 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146323 Approved by: https://github.com/ColinPeppler	2025-02-03 23:04:21 +00:00
Davide Italiano	0463cb6ca5	[mps/inductor] Add support for digamma(). (#146292 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146292 Approved by: https://github.com/malfet, https://github.com/jansel	2025-02-03 22:48:13 +00:00
titaiwangms	178531c95e	[ONNX] torch.onnx.export(dynamo=True) changes optimization to default (#146187 ) Fixes #145897 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146187 Approved by: https://github.com/justinchuby	2025-02-03 22:44:54 +00:00
bobrenjc93	d69c181d77	log out partial fx graph when guard on data dependent during non stirct tracing (#146298 ) As discussed with @avikchaudhuri and @bdhirsh last week, this can be quite useful when debugging. The following code produces a data dependent error ``` import torch from torch import nn # UserError: Could not guard on data-dependent expression Eq(507 - u0, 0) (unhinted: Eq(507 - u0, 0)). (Size-like symbols: u0) class Repro(nn.Module): def __init__(self): super().__init__() def forward(self, cache, update, pos): _, _, max_seq_len, _ = cache.shape _, _, seqlen, _ = update.shape pos_item = pos[0].item() # u0 torch._check(pos_item + seqlen <= max_seq_len) # u0 + 502 <= 507 torch._check(pos_item >= 0) before = cache.narrow(2, 0, pos_item) # FAIL # Laith: why can't we make unbacked expressions size-like? after = cache.narrow(2, (pos_item + seqlen), (max_seq_len - pos_item - seqlen)) # PASS end = torch.tensor(max_seq_len - pos_item - seqlen).item() after = cache.narrow(2, (pos_item + seqlen), end) return torch.cat([before, update, after], dim=2) repro = Repro() bsz = 1 n_heads = 4 max_seq_len = 512 head_dim = 64 seqlen = 5 pos_item = 1 cache = torch.zeros(bsz, n_heads, max_seq_len, head_dim) update = torch.ones(bsz, n_heads, seqlen, head_dim) pos = torch.tensor([pos_item]) example_inputs = (cache, update, pos) torch.export.export(repro, example_inputs, strict=False) ``` This is what it now prints out ``` class GraphModule(torch.nn.Module): def forward(self, arg0_1: "f32[1, 4, 512, 64][131072, 32768, 64, 1]cpu", arg1_1: "f32[1, 4, 5, 64][1280, 320, 64, 1]cpu", arg2_1: "i64[1][1]cpu"): # File: /data/users/bobren/a/pytorch/r1.py:14 in forward, code: pos_item = pos[0].item() # u0 select: "i64[][]cpu" = torch.ops.aten.select.int(arg2_1, 0, 0); arg2_1 = None item: "Sym(u0)" = torch.ops.aten.item.default(select); select = None # File: /data/users/bobren/a/pytorch/r1.py:15 in forward, code: torch._check(pos_item + seqlen <= max_seq_len) # u0 + 502 <= 507 add: "Sym(u0 + 5)" = item + 5 le: "Sym(u0 + 5 <= 512)" = add <= 512; add = le = None # File: /data/users/bobren/a/pytorch/r1.py:16 in forward, code: torch._check(pos_item >= 0) ge: "Sym(u0 >= 0)" = item >= 0; ge = None # File: /data/users/bobren/a/pytorch/r1.py:17 in forward, code: before = cache.narrow(2, 0, pos_item) narrow: "f32[1, 4, u0, 64][131072, 32768, 64, 1]cpu" = torch.ops.aten.narrow.default(arg0_1, 2, 0, item); narrow = None # File: /data/users/bobren/a/pytorch/r1.py:21 in forward, code: after = cache.narrow(2, (pos_item + seqlen), (max_seq_len - pos_item - seqlen)) add_1: "Sym(u0 + 5)" = item + 5 sub: "Sym(512 - u0)" = 512 - item; item = None sub_1: "Sym(507 - u0)" = sub - 5; sub = None narrow_1 = torch.ops.aten.narrow.default(arg0_1, 2, add_1, sub_1); arg0_1 = add_1 = sub_1 = narrow_1 = None Traceback (most recent call last): File "/data/users/bobren/a/pytorch/r1.py", line 45, in <module> torch.export.export(repro, example_inputs, strict=False) File "/data/users/bobren/a/pytorch/torch/export/__init__.py", line 368, in export return _export( File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1044, in wrapper raise e File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1017, in wrapper ep = fn(args, kwargs) File "/data/users/bobren/a/pytorch/torch/export/exported_program.py", line 117, in wrapper return fn(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 2079, in _export return _export_for_training( File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1044, in wrapper raise e File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1017, in wrapper ep = fn(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/export/exported_program.py", line 117, in wrapper return fn(args, kwargs) File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1944, in _export_for_training export_artifact = export_func( # type: ignore[operator] File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1879, in _non_strict_export aten_export_artifact = _to_aten_func( # type: ignore[operator] File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1665, in _export_to_aten_ir_make_fx gm, graph_signature = transform(_make_fx_helper)( File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1809, in _aot_export_non_strict gm, sig = aot_export(wrapped_mod, args, kwargs=kwargs, flags) File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1585, in _make_fx_helper gm = make_fx( File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 2194, in wrapped return make_fx_tracer.trace(f, args) File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 2132, in trace return self._trace_inner(f, args) File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 2103, in _trace_inner t = dispatch_trace( File "/data/users/bobren/a/pytorch/torch/_compile.py", line 51, in inner return disable_fn(args, kwargs) File "/data/users/bobren/a/pytorch/torch/_dynamo/eval_frame.py", line 749, in _fn return fn(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1136, in dispatch_trace graph = tracer.trace(root, concrete_args) # type: ignore[arg-type] File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1692, in trace res = super().trace(root, concrete_args) File "/data/users/bobren/a/pytorch/torch/fx/_symbolic_trace.py", line 834, in trace (self.create_arg(fn(args)),), File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1191, in wrapped out = f(tensors) # type:ignore[call-arg] File "<string>", line 1, in <lambda> File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1488, in wrapped_fn return tuple(flat_fn(args)) File "/data/users/bobren/a/pytorch/torch/_functorch/_aot_autograd/utils.py", line 184, in flat_fn tree_out = fn(args, kwargs) File "/data/users/bobren/a/pytorch/torch/_functorch/_aot_autograd/traced_function_transforms.py", line 879, in functional_call out = mod(args[params_len:], *kwargs) File "/data/users/bobren/a/pytorch/torch/fx/_symbolic_trace.py", line 811, in module_call_wrapper return self.call_module(mod, forward, args, kwargs) File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1762, in call_module return Tracer.call_module(self, m, forward, args, kwargs) File "/data/users/bobren/a/pytorch/torch/fx/_symbolic_trace.py", line 529, in call_module ret_val = forward(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/fx/_symbolic_trace.py", line 804, in forward return _orig_module_call(mod, args, *kwargs) File "/data/users/bobren/a/pytorch/torch/nn/modules/module.py", line 1749, in _wrapped_call_impl return self._call_impl(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/nn/modules/module.py", line 1760, in _call_impl return forward_call(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1793, in forward tree_out = mod(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/fx/_symbolic_trace.py", line 811, in module_call_wrapper return self.call_module(mod, forward, args, kwargs) File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1762, in call_module return Tracer.call_module(self, m, forward, args, kwargs) File "/data/users/bobren/a/pytorch/torch/fx/_symbolic_trace.py", line 529, in call_module ret_val = forward(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/fx/_symbolic_trace.py", line 804, in forward return _orig_module_call(mod, args, *kwargs) File "/data/users/bobren/a/pytorch/torch/nn/modules/module.py", line 1749, in _wrapped_call_impl return self._call_impl(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/nn/modules/module.py", line 1760, in _call_impl return forward_call(args, *kwargs) File "/data/users/bobren/a/pytorch/r1.py", line 21, in forward after = cache.narrow(2, (pos_item + seqlen), (max_seq_len - pos_item - seqlen)) File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1239, in __torch_function__ return func(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1286, in __torch_function__ return func(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/_export/non_strict_utils.py", line 654, in __torch_function__ return func(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/_ops.py", line 866, in handler return torch._library.utils.handle_dispatch_mode( File "/data/users/bobren/a/pytorch/torch/_library/utils.py", line 296, in handle_dispatch_mode return curr_mode.__torch_dispatch__(op_overload, overload_types, args, kwargs) File "/data/users/bobren/a/pytorch/torch/utils/_stats.py", line 27, in wrapper return fn(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1341, in __torch_dispatch__ return proxy_call(self, func, self.pre_dispatch, args, kwargs) File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 910, in proxy_call out = func(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/_ops.py", line 749, in __call__ return self._op(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/utils/_stats.py", line 27, in wrapper return fn(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1267, in __torch_dispatch__ return self.dispatch(func, types, args, kwargs) File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1808, in dispatch return self._cached_dispatch_impl(func, types, args, kwargs) File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1369, in _cached_dispatch_impl output = self._dispatch_impl(func, types, args, kwargs) File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 2282, in _dispatch_impl decomposition_table[func](args, *kwargs) File "/data/users/bobren/a/pytorch/torch/_decomp/decompositions.py", line 759, in slice_forward return self.as_strided(sizes, strides, storage_offset) File "/data/users/bobren/a/pytorch/torch/utils/_stats.py", line 27, in wrapper return fn(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1267, in __torch_dispatch__ return self.dispatch(func, types, args, kwargs) File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1808, in dispatch return self._cached_dispatch_impl(func, types, args, kwargs) File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1370, in _cached_dispatch_impl entry = self._make_cache_entry(state, key, func, args, kwargs, output) File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1640, in _make_cache_entry output_info = self._get_output_info_for_cache_entry( File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1583, in _get_output_info_for_cache_entry synth_output = self._output_from_cache_entry( File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1738, in _output_from_cache_entry return self._get_output_tensor_from_cache_entry( File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1709, in _get_output_tensor_from_cache_entry empty.set_(storage, storage_offset, shape, stride) File "/data/users/bobren/a/pytorch/torch/fx/experimental/sym_node.py", line 564, in guard_size_oblivious r = self.shape_env.evaluate_expr( File "/data/users/bobren/a/pytorch/torch/fx/experimental/recording.py", line 263, in wrapper return retlog(fn(args, **kwargs)) File "/data/users/bobren/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6468, in evaluate_expr return self._evaluate_expr( File "/data/users/bobren/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6658, in _evaluate_expr raise self._make_data_dependent_error( torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Ne(507 - u0, 1) (unhinted: Ne(507 - u0, 1)). (Size-like symbols: u0) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146298 Approved by: https://github.com/bdhirsh	2025-02-03 22:16:03 +00:00
Animesh Jain	0da07a6d1d	[dynamo][skip-function] Add missing unimplemented line (#146322 ) This is a missing line from the merged PR in the stack below. Lets try to get this in quickly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146322 Approved by: https://github.com/StrongerXi, https://github.com/jansel, https://github.com/mlazos	2025-02-03 22:11:55 +00:00
PyTorch MergeBot	00dc5b10f6	Revert "[Environment Variable][7/N] Use thread-safe getenv functions (#140211 )" This reverts commit 2fd1b6b3610eb84cd615360a8fd23756a7f2e743. Reverted https://github.com/pytorch/pytorch/pull/140211 on behalf of https://github.com/atalman due to Breaks executorch tests ([comment](https://github.com/pytorch/pytorch/pull/140211#issuecomment-2632202864))	2025-02-03 22:04:28 +00:00
Yanbo Liang	15e12d5ec3	[Trace PyDispatcher] Support temporarily_pop_interpreter_stack ctx manager (#146271 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146271 Approved by: https://github.com/zou3519 ghstack dependencies: #146270	2025-02-03 21:47:54 +00:00
Yanbo Liang	bd8d7b1b74	[Dynamo][Trace PyDispatcher] Remove disable from HigherOrderOperator.__call__ (#146270 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146270 Approved by: https://github.com/zou3519	2025-02-03 21:47:54 +00:00
Yang Wang	fd73ae2068	[Utilization] Convert timestamp to str for datetime64 (#145985 ) Convert all timestamp(float) to int timestamp during data pipeline for db type datetime64. float does not work when try to insert into clickhouse using jsonExtract. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145985 Approved by: https://github.com/huydhn	2025-02-03 21:05:18 +00:00
Simon Fan	1d4adf4e1f	[dynamo] log recompile reason to dynamo_compile (#146117 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146117 Approved by: https://github.com/bobrenjc93	2025-02-03 21:04:04 +00:00
Zhengxu Chen	11f69808c6	[aoti] Assign proxy call args by name, and support default values. (#146263 ) Fixing the following issue when compiling the following program: ``` window = torch.hann_window(N_FFT).to(x.device) stft = torch.stft( x, N_FFT, HOP_LENGTH, window=window, return_complex=True ) magnitudes = stft[..., :-1].abs() ** 2 return magnitudes ``` ``` Traceback (most recent call last): File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 57, in testPartExecutor yield File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 623, in run self._callTestMethod(testMethod) File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 579, in _callTestMethod if method() is not None: ^^^^^^^^ File "/home/zhxchen17/pytorch/torch/testing/_internal/common_utils.py", line 3120, in wrapper method(args, *kwargs) File "/home/zhxchen17/pytorch/test/inductor/test_torchinductor.py", line 12356, in new_test return value(self) ^^^^^^^^^^^ File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor.py", line 4334, in test_stft self.check_model(model, example_inputs) File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 185, in check_model actual = AOTIRunnerUtil.run( ^^^^^^^^^^^^^^^^^^^ File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 137, in run optimized = AOTIRunnerUtil.load(device, so_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 119, in load return torch._export.aot_load(so_path, device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zhxchen17/pytorch/torch/_export/__init__.py", line 165, in aot_load runner = torch._C._aoti.AOTIModelContainerRunnerCuda(so_path, 1, device) # type: ignore[assignment, call-arg] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Expected extern kernel aten::hann_window to have serialized argument type as_scalar_type for argument 1 but got as_device ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146263 Approved by: https://github.com/angelayi	2025-02-03 20:15:59 +00:00
Henry Tsang	e67ce67498	[cutlass backend] update try_import_cutlass to accomodate for pip install (#145891 ) The goal of this PR is to provide 3 ways for people to try out CUTLASS backend: 1. fbcode / internal 2. pip install torch (nightly) and pip install nvidia-cutlass 3. build from source I will go into more detailed combos between building from source and downloading via pip for torch and cutlass. repro: ``` import torch import torch.nn as nn import torch._inductor.config as config config.force_disable_caches = True config.max_autotune = True config.max_autotune_gemm_backends = "CUTLASS" # the following is only needed if you use a custom cutlass library # config.cuda.cutlass_dir = "/data/users/henrylhtsang/cutlass" class TestModule(nn.Module): def forward(self, A, B): return A @ B model = TestModule().cuda() M, K, N = 2048, 2048, 2048 A = torch.randn(M, K).cuda().half() B = torch.randn(K, N).cuda().half() C = torch.compile(model, fullgraph=True)(A, B) ``` ## pre-requisite Assuming you have the right cuda toolkit. Recommend 12.4. Make sure PATH, LD_LIBRARY_PATH and CUDA_NVCC_EXECUTABLE are good. ## combo 1: pip install torch + pip install nvidia-cutlass Check https://pytorch.org/get-started/locally/ for nightly install command. ``` pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu124 pip install nvidia-cutlass ``` Then try running the script above. It should work. ## combo 2: build torch from source + pip install nvidia-cutlass This is going to be be pretty straightforward. Just keep in mind that even though pytorch/third_party/cutlass exists, the one that will be used is the pip package, so mindful of version differences. ## combo 3: build torch from source + use pytorch/third_party/cutlass This is how most pytorch devs would do it. Just make sure you don't have a cutlass pip package installed, i.e., make sure `import cutlass_library` would fail on its own. ## combo 4: any torch version + cutlass library from somewhere else This is probably the only case you need to pass in cutlass_dir. Just set cutlass_dir to the cutlass repo library. The expectations is that cutlass_dir is the directory that contains include, tool, and python/cutlass_library. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145891 Approved by: https://github.com/Chillee, https://github.com/ColinPeppler	2025-02-03 20:05:41 +00:00
Isalia20	f237172768	Fix not inlining functions used in metal files (#146316 ) Fixes issue when building PyTorch with Xcode installed after https://github.com/pytorch/pytorch/pull/146231 ``` FAILED: caffe2/aten/src/ATen/kernels_basic.metallib /Users/Irakli_Salia/Desktop/pytorch/build/caffe2/aten/src/ATen/kernels_basic.metallib cd /Users/Irakli_Salia/Desktop/pytorch/build/caffe2/aten/src/ATen && xcrun metallib -o kernels_basic.metallib BinaryKernel_30.air Bucketization_30.air CrossKernel_30.air FusedOptimizerOps_30.air Gamma_30.air HistogramKernel_30.air Im2Col_30.air Indexing_30.air LinearAlgebra_30.air Quantized_30.air RMSNorm_30.air RenormKernel_30.air Repeat_30.air SpecialOps_30.air TriangularOps_30.air UnaryKernel_30.air UnfoldBackward_30.air UpSample_30.air LLVM ERROR: multiple symbols ('_ZN3c105metal4zetaEff')! [3835/5420] Building CXX object c10/test/CMakeFiles/c10_small_vector_test.dir/util/small_vector_test.cpp.o ninja: build stopped: subcommand failed. ``` AI to @malfet: Add linter that ensures that `c10/metal/` headers do not have any functions there, only templates Pull Request resolved: https://github.com/pytorch/pytorch/pull/146316 Approved by: https://github.com/malfet, https://github.com/atalman	2025-02-03 19:33:52 +00:00
Yidi Wu	674e0b668a	Add non-strict export while_loop test back (#146195 ) This is fixed by https://github.com/pytorch/pytorch/pull/145762 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146195 Approved by: https://github.com/zou3519 ghstack dependencies: #146194	2025-02-03 19:28:22 +00:00
Yidi Wu	1138d0c4f6	[hop] enable while_loop return torch.ones with unbacked symbol expression. (#146194 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146194 Approved by: https://github.com/zou3519	2025-02-03 19:28:22 +00:00
Animesh Jain	57b1fc35f6	[dynamo] Disable compiling on elementwise_type_promotion_wrapper (#146219 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146219 Approved by: https://github.com/zou3519 ghstack dependencies: #146075, #146283	2025-02-03 18:02:48 +00:00
PyTorch MergeBot	64fc9ff09c	Revert "[ONNX] Create deprecation warning on dynamo_export (#146003 )" This reverts commit e6c39d37e90242692cf25ea849abd47d11932cd7. Reverted https://github.com/pytorch/pytorch/pull/146003 on behalf of https://github.com/atalman due to Broke internally ([comment](https://github.com/pytorch/pytorch/pull/146003#issuecomment-2631599314))	2025-02-03 17:17:14 +00:00
Tugsbayasgalan Manlaibaatar	041e08f9dc	Add buffers to parameterizaiton rule (#145991 ) Differential Revision: [D68959513](https://our.internmc.facebook.com/intern/diff/D68959513) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145991 Approved by: https://github.com/bdhirsh	2025-02-03 16:49:03 +00:00
PyTorch MergeBot	c0979d72b5	Revert "[hop][inductor] track the dependency on unbacked symbols correctly with constant_args for hops (#143456 )" This reverts commit 68a363548409a3ff17965770304ee5e12fe718d9. Reverted https://github.com/pytorch/pytorch/pull/143456 on behalf of https://github.com/atalman due to New tests are failing internally ([comment](https://github.com/pytorch/pytorch/pull/143456#issuecomment-2631475900))	2025-02-03 16:25:58 +00:00
Harmen Stoppels	01554c7b5a	fix incorrect literal strings / accidental tuples (#146037 ) * `expr,` is short for `(expr,)` * literal strings over multiple lines need to escape the newline `\` or use `(...)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146037 Approved by: https://github.com/Skylion007	2025-02-03 15:08:11 +00:00
PyTorch UpdateBot	550441a87b	Update slow tests (#146301 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146301 Approved by: https://github.com/pytorchbot	2025-02-03 11:37:16 +00:00
Isuru Fernando	08b14936ae	Disable has_relational_guards check for dict_tag optimization for now (#146232 ) has_relational_guards evaluates to true almost always, and leads to a slowdown in guards runtime Pull Request resolved: https://github.com/pytorch/pytorch/pull/146232 Approved by: https://github.com/anijain2305	2025-02-03 07:56:06 +00:00
Isalia20	e3643e1e0e	[MPS] Add linalg det and fix lu factor for non contiguous tensors (#146279 ) Requested in #77764 This PR adds support for linalg.det on MPS and fixes lu factor for non contiguous tensors, current implementation crashed on any kind of non-contiguous tensor with an error: ``` -[AGXG13XFamilyCommandBuffer blitCommandEncoderCommon:]:833: failed assertion `A command encoder is already encoding to this command buffer' zsh: abort python det.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146279 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-02-03 06:06:43 +00:00
Zhengxu Chen	1580f47bf4	[export][ez] Fix generated header file. (#146208 ) Summary: as title. Test Plan: CI Differential Revision: D68978788 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146208 Approved by: https://github.com/yiming0416	2025-02-03 06:01:05 +00:00
cyy	7b512095ef	Enable some tests on MacOS (#146268 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146268 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-02-03 05:04:24 +00:00
Animesh Jain	fa48757180	[dynamo] misc fixes for inspect (#146283 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146283 Approved by: https://github.com/jansel ghstack dependencies: #146075	2025-02-03 04:26:10 +00:00
cyy	6ac8bc0cd2	Remove unused import in tests (#146266 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146266 Approved by: https://github.com/Skylion007	2025-02-03 03:40:18 +00:00
Davide Italiano	d80eef7c6d	[inductor] Guard a member variable with a define. (#146278 ) It's unused otherwise, and when running MPS tests, I get a bunch of warnings of this kind: /Users/davidino/pytorch/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model_container.h:412:10: warning: private field 'blob_size_' is not used [-Wunused-private-field] 412 \| size_t blob_size_; \| ^ 1 warning generated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146278 Approved by: https://github.com/Skylion007, https://github.com/jansel	2025-02-03 02:20:08 +00:00
Animesh Jain	c0ec2e0a0d	[dynamo][functions] Improve getattr on functions (#146075 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146075 Approved by: https://github.com/jansel	2025-02-03 02:01:57 +00:00
Davide Italiano	d28fe3ed47	[metal] Move digamma to special_math.h (#146284 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146284 Approved by: https://github.com/jansel	2025-02-03 01:29:14 +00:00
Davide Italiano	1f21f699ba	[metal] Refactor digamma in preparation for moving it. (#146281 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146281 Approved by: https://github.com/jansel	2025-02-02 23:54:45 +00:00
Yanbo Liang	511d0dd558	[Dynamo][Trace PyDispatcher] Support calling id function over class (#146269 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146269 Approved by: https://github.com/anijain2305	2025-02-02 22:29:30 +00:00
Lancelot Normand	02fd4868d6	Fix unreachable code (#146262 ) Fixes #146261 Removed unreachable code Pull Request resolved: https://github.com/pytorch/pytorch/pull/146262 Approved by: https://github.com/Skylion007	2025-02-02 21:35:26 +00:00
Isalia20	5d55a6585d	[MPS] lu factor ex implementation (#144651 ) Implements `torch.linalg.lu_factor_ex` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144651 Approved by: https://github.com/malfet	2025-02-02 15:09:49 +00:00
Avik Chaudhuri	0144613e6f	move and fix logic to update unbacked bindings (#146115 ) Summary: Previously we were touching up unbacked bindings between Dynamo and AOTAutograd in strict export, but the logic had a bug: if an unbacked symint gets substituted by a backed symint, we would put the backed symint in the unbacked bindings (the check `is_symbol` was not enough here). This PR fixes this logic, and moreover, moves it into the serializer instead, because we don't need this adjustment outside serde. Test Plan: added test Differential Revision: D68880766 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146115 Approved by: https://github.com/pianpwk	2025-02-02 10:43:55 +00:00
PyTorch UpdateBot	a44a8a7d3a	[audio hash update] update the pinned audio hash (#145988 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145988 Approved by: https://github.com/pytorchbot	2025-02-02 04:19:29 +00:00
cyy	8543d8395b	[2/N] Enable ruff F841 on distributed tests (#146132 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146132 Approved by: https://github.com/Skylion007, https://github.com/rec	2025-02-02 03:44:48 +00:00
Animesh Jain	cef856faa9	[dynamo][enum] Trace through enum.py for enum construction (#146070 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146070 Approved by: https://github.com/jansel ghstack dependencies: #146062, #146198, #146258, #146214	2025-02-02 03:12:36 +00:00
Animesh Jain	31fb691782	[dynamo] Graph break on tensor.retain_grad (#146214 ) Fixes https://github.com/pytorch/pytorch/issues/146212 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146214 Approved by: https://github.com/jansel ghstack dependencies: #146062, #146198, #146258	2025-02-02 03:12:36 +00:00
Animesh Jain	529eb8d558	[dynamo] Add return to python_type (#146258 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146258 Approved by: https://github.com/jansel ghstack dependencies: #146062, #146198	2025-02-02 03:12:36 +00:00
Davide Italiano	7854299b27	[mps/inductor] Implement support for polygamma(). (#146259 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146259 Approved by: https://github.com/jansel	2025-02-02 01:54:23 +00:00
Burak Turk	d89c7ea401	add WaitCounter type interface and get rid of type errors (#146175 ) Summary: as titled Differential Revision: D68960123 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146175 Approved by: https://github.com/andriigrynenko, https://github.com/Skylion007	2025-02-01 23:24:52 +00:00
Jason Ansel	3a67c0e48d	[inductor] Finish typing common.py (#146225 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146225 Approved by: https://github.com/Skylion007	2025-02-01 22:53:35 +00:00
Davide Italiano	dca5cc0255	[mps] Move polygamma to special_math.h. (#146253 ) In preparation to implement it in inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146253 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-02-01 21:45:23 +00:00
Aaron Gokaslan	07dbd539b4	[BE][Ez]: Make c10/special arrays constexpr (#146246 ) No reason to have array creation overhead for these constexpr arrays. This is better because it guarantees the array is not duplicated across templates or translation units unless necessary and allows the compiler to do static compile time bounds checking (even in loop based accesses) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146246 Approved by: https://github.com/dcci, https://github.com/malfet	2025-02-01 21:03:18 +00:00
Davide Italiano	d4ad7b91ad	[mps] Move zeta() to special_math.h. (#146231 ) In preparation for implementing digamma/polygamma Pull Request resolved: https://github.com/pytorch/pytorch/pull/146231 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-02-01 19:22:59 +00:00
Sahdev Zala	f97307f463	[Docs] Add clarification for target types in CrossEntropyLoss doc (#145444 ) CrossEntropyLoss function requires that target for class indices are provided as a long and class probabilities are provided as a float datatype. The CrossEntropyLoss function distinguish the two scenarios (indices and probabilities) by comparing the shapes. When input and target shapes are the same it’s a case for probabilities otherwise it will be used as a class index as already covered in the doc. The related code is here, https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/LossNLL.cpp#L624 I think the current documentation is great but seems like it can confuse users about types as reported in the issues so this PR adds a bit more clarification. Fixes #137188 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145444 Approved by: https://github.com/mikaylagawarecki	2025-02-01 18:55:58 +00:00
Nikita Shulga	5ed5793016	Temp disable MKL in DistributionKernels.cpp (#146174 ) Until https://github.com/pytorch/pytorch/issues/132395 is addressed Test plan: Add test based on the script below (taken from https://discuss.pytorch.org/t/bug-in-torch-multinomial-generated-distribution-is-modestly-incorrect-edit-this-is-a-regression-and-appears-to-be-due-to-an-analogous-bug-in-tensor-exponential ) ```python import torch high_bits_for_seed = 16000000000000000000 # to use "good quality" seed _ = torch.manual_seed (high_bits_for_seed + 2024) prob = torch.ones (26) dups_mult = 0 perm_counts_mult = {} for _ in range (1_000_000): p = tuple (torch.multinomial (prob, prob.numel(), replacement=False).tolist()) if p in perm_counts_mult: dups_mult += 1 perm_counts_mult[p] += 1 else: perm_counts_mult[p] = 1 print ('duplicate multinomial perms: ', dups_mult) print ('multiple multinomial perms: ', (torch.tensor (list (perm_counts_mult.values())) > 1).sum().item()) print ('max of perm_counts_mult: ', torch.tensor (list (perm_counts_mult.values())).max().item()) print ('len (perm_counts_mult): ', len (perm_counts_mult)) ``` This is a reland of https://github.com/pytorch/pytorch/pull/132532 but excluding internal builds that already has some hardcoded values Pull Request resolved: https://github.com/pytorch/pytorch/pull/146174 Approved by: https://github.com/ngimel	2025-02-01 18:53:11 +00:00
Nikita Shulga	e56dcf2772	[CPUInductor] Fix SVE256 detection (#146207 ) This PR removes `torch.cpu._is_arm_sve_supported()` and replaces is with stable `torch.backends.cpu.get_cpu_capability()` I should have reviewed https://github.com/pytorch/pytorch/pull/134672 more thoroughly, because it introduced duplicate, but slightly different API for detecting CPU architectures, which resulted in runtime crashes on system that do support SVE128, rather than SVE256 Fixes https://github.com/pytorch/pytorch/issues/145441 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146207 Approved by: https://github.com/angelayi	2025-02-01 18:51:34 +00:00
Jason Ansel	8c657ae4be	[inductor] Add typing to common.CSE (#145993 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145993 Approved by: https://github.com/yanboliang ghstack dependencies: #145913, #145914, #145915, #145916	2025-02-01 16:34:18 +00:00
Jason Ansel	68cf36d5ab	[inductor] Add typing to common.KernelArgs (#145916 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145916 Approved by: https://github.com/yanboliang ghstack dependencies: #145913, #145914, #145915	2025-02-01 16:34:18 +00:00
Jason Ansel	8e56d713c9	[inductor] Add typing to common.OpDecompositions (#145915 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145915 Approved by: https://github.com/yanboliang ghstack dependencies: #145913, #145914	2025-02-01 16:34:11 +00:00
Jason Ansel	79f9f62e3a	[inductor] Combine regexp checks in OpOverrides.paren (#145914 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145914 Approved by: https://github.com/Skylion007 ghstack dependencies: #145913	2025-02-01 16:34:03 +00:00
Jason Ansel	4c004caa76	[inductor] Add types to DeviceOpOverrides (#145913 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145913 Approved by: https://github.com/Skylion007	2025-02-01 16:33:49 +00:00
rzou	0f768c7866	Barebones flat_apply HOP (#146060 ) This PR: - adds pytree.register_constant for registering a class to be treated as a constant by torch.compile/torch.fx - adds a very barebones flat_apply HOP. This should be sufficient to get mark_traceable working. A lot more work is necessary to get the custom operator case working (when make_fx sees a custom operator with PyTree arg types, it needs to emit a call to the flat_apply HOP). - I expect the flat_apply HOP to change a lot, I want to ship this in the current state to unblock the mark_traceable and custom ops work. Test Plan: - It's kind of difficult to test the barebones flat_apply HOP "works" so I added a really simple test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146060 Approved by: https://github.com/StrongerXi, https://github.com/yanboliang ghstack dependencies: #146059	2025-02-01 16:17:48 +00:00
rzou	373606928b	Add torch.utils._pytree.register_dataclass (#146059 ) This is an API that registers a dataclass as a pytree node. It directly calls torch.export.register_dataclass, but we should eventually inline that implementation here. I want to use this API for something in compile and feel weird calling torch.export.register_dataclass. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/146059 Approved by: https://github.com/StrongerXi, https://github.com/angelayi, https://github.com/yanboliang	2025-02-01 16:17:48 +00:00
cyy	2fd1b6b361	[Environment Variable][7/N] Use thread-safe getenv functions (#140211 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140211 Approved by: https://github.com/ezyang, https://github.com/eqy	2025-02-01 12:33:41 +00:00
Aleksandar Samardžić	2b00d211f0	Build RowwiseScaledMM.cu for SM89 (#145676 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145676 Approved by: https://github.com/drisspg, https://github.com/malfet, https://github.com/eqy	2025-02-01 11:44:58 +00:00
Shangdi Yu	f40e013787	Fix aten.to when input is a tensor constant (#146220 ) Summary: Fix aten.to when input is a tensor constant. In this case, `args_unwrapped` could just be a constant, so not a functional tensor. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r tensor_constant_aten_to ``` Differential Revision: D68984244 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146220 Approved by: https://github.com/JacobSzwejbka	2025-02-01 11:07:33 +00:00
bobrenjc93	30f091da44	add speculation log divergence test (#145659 ) Followup from a SEV. Confirmed that this breaks when stacked on top of https://github.com/pytorch/pytorch/pull/145660 (offending PR that caused the SEV) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145659 Approved by: https://github.com/laithsakka	2025-02-01 09:39:22 +00:00
Shangdi Yu	a4e4368157	add node mapping processing (#146103 ) Summary: Add `node_mapping = create_node_mapping(pre_grad_graph_id, inductor_post_to_pre_grad_nodes, debug_info)`, to produce a `inductor_provenance_tracking_node_mappings.json` file. This file will be used by the provenance tracking highlighter tool to create provenance visualization. `inductor_triton_kernel_to_post_grad_nodes.json` and `inductor_provenance_tracking_node_mappings.json` files are not dumped if they are both empty. So it's removed from some of the `test_structured_trace` tests. Test Plan: CI ``` buck run mode/dev-nosan fbcode//caffe2/test:fx -- -r graph_provenance buck run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing python test/dynamo/test_structured_trace.py ``` Differential Revision: D68190173 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146103 Approved by: https://github.com/chenyang78	2025-02-01 08:29:29 +00:00
Huy Do	f38d5b4a74	Update TorchBench commit to main (#145455 ) I'm adding sam2 to TorchBench https://github.com/pytorch/benchmark/issues/2566, so, as part of that, I'm updating PyTorch CI to use latest TorchBench commit. The corresponding change from TorchBench is https://github.com/pytorch/benchmark/pull/2584 The main thing to call out that the newer transformers added by https://github.com/pytorch/benchmark/pull/2488 is regressing several models. This needs to be investigated further, and I pin the version to unblock this change. * `hf_Roberta_base` a new model added by https://github.com/pytorch/benchmark/pull/2279, not sure why it fails accuracy on A10G, but it works fine on A100 * `speech_transformer` failures are pre-existing trunk failures, i.e. https://github.com/pytorch/pytorch/actions/runs/13040114684/job/36380989702#step:22:2408 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145455 Approved by: https://github.com/kit1980	2025-02-01 06:44:26 +00:00
Shangdi Yu	a97a906dd9	Add "//caffe2:libtorch" to minifier TARGET file (#146203 ) Summary: as title. To avoid errors like "undefined symbol: aoti_torch_device_type_cpu" when compiling minifier_launcher.py Test Plan: CI Differential Revision: D68978430 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146203 Approved by: https://github.com/desertfire	2025-02-01 05:37:23 +00:00
Mingming Ding	bcd0ba0f69	Adding the best autotuner config (#146121 ) Summary: Adding logs to log the best config for autotune configs Test Plan: Testing in Mast : aps-omnifmv1-5_32_test_with_best_config-c5e9ceccf8 {F1974838864} Reviewed By: oulgen Differential Revision: D68931164 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146121 Approved by: https://github.com/oulgen	2025-02-01 03:43:33 +00:00
Yiming Zhou	549e230c33	[draft_export] Clear pending unbacked symbols when overriding mismatched fake kernels (#146089 ) Summary: When encountering a mismatched fake kernel that also creates unbacked symbols, draft export will fail with `PendingUnbackedSymbolNotFound` error. Clearing `shape_env.pending_fresh_unbacked_symbols` fixes this issue. Test Plan: ``` buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_override_mismatched_fake_kernel_with_unbacked_symbols ``` Differential Revision: D68920990 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146089 Approved by: https://github.com/pianpwk	2025-02-01 03:32:50 +00:00
cyy	4d2056efb5	Enable ruff F841 on numpy tests (#146126 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146126 Approved by: https://github.com/rec, https://github.com/albanD	2025-02-01 03:07:28 +00:00
cyy	985a78e9df	Enable ruff F841 on distributed tests (#146131 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146131 Approved by: https://github.com/rec, https://github.com/albanD	2025-02-01 03:06:16 +00:00
Animesh Jain	1de41e6918	[dynamo][exceptions][3.10] Clean symbolic stack on exception handling (#146198 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146198 Approved by: https://github.com/williamwen42 ghstack dependencies: #146062	2025-02-01 02:51:44 +00:00
angelayi	6023684311	[export] Fix symfloat serialization (#146112 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146112 Approved by: https://github.com/pianpwk	2025-02-01 02:28:44 +00:00
David Berard	8326d27093	[inductor][5/N] triton support post-#5512, fix 1 and None handling (#145515 ) This fixes handling for "1" and "None" args with new Triton versions. TL;DR: triton_meta["constants"] (which is passed to ASTSource) should be a map of {"kwarg_name": constant_value} for values which are tl.constexpr, or have a value of 1 or None (i.e. "specialized" constants). For constant args, triton_meta["signature"][arg_name] should be "constexpr" (even for specialized constants). Note: This adds support for Triton versions after 5512; but not for versions in between 5220 and 5512 (i.e. `TritonAttrsDescriptorVersion.V3_BACKENDS_TUPLE`). There's a completely different format for constants/signature in the commit range in between. To test: I ran `test_torchinductor.py` and `test_triton_kernels.py` with the main branch of triton (~jan 27). The only failing tests are aoti-related tests (which need to be fixed as a follow-up), and test_mutable_custom_op_fixed_layout2_cuda (which is failing with or without the new triton version on my machine); additionally, the split-scan/split-reduction kernels rely on https://github.com/triton-lang/triton/pull/5723. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145515 Approved by: https://github.com/SamGinzburg	2025-02-01 02:11:48 +00:00
briancoutinho	6e734bab93	execution trace export supports gzip format (#146179 ) As above, allows Chakra Execution Trace observer to support compressing files. Usage is straightforward, just add ".gz" suffix to the output file name ``` et = ExecutionTraceObserver() et.register_callback("my_trace.json.gz") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146179 Approved by: https://github.com/shengfukevin, https://github.com/davidberard98, https://github.com/sraikund16	2025-02-01 01:25:25 +00:00
Brian Hirsh	57c45340e7	include entire GraphModule instead of current node when erroring inside of fx interpreter (#146197 ) This seems like it would make it easier to diagnose PT2 issues where the user cannot easily repro, and we need more info in the backtrace, e.g. in https://github.com/pytorch/pytorch/issues/134182#issuecomment-2628076114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146197 Approved by: https://github.com/jamesjwu	2025-02-01 01:09:27 +00:00
Sam Larsen	73d90d66a4	Cap size of thread pool in select_algorithm to cpu count (#146071 ) Summary: With changes from https://github.com/pytorch/pytorch/pull/144829, we can see more autotune configs and the size of the pool can get outta hand when using the cutlass backend. See internal discussion at: https://fburl.com/workplace/7g4vz0zy Test Plan: `python test/inductor/test_cutlass_backend.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146071 Approved by: https://github.com/Chillee	2025-02-01 00:41:36 +00:00
Avik Chaudhuri	cde5ddfd14	fix internal error with reorder submodules (#146181 ) Test Plan: hard to isolate as small repro Differential Revision: D68963033 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146181 Approved by: https://github.com/angelayi	2025-02-01 00:30:42 +00:00
Alexander Kurakin	35f113e2a0	torch/nn/utils/rnn.py: docs: improvements (#138628 ) Fix constants highlighting in generated documentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138628 Approved by: https://github.com/mikaylagawarecki	2025-02-01 00:10:30 +00:00
Bin Bao	a78c796f0b	[AOTI] Support composed dynamic shape constraint (#146044 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/145500. When export takes a dynamic shape constraint as an expression containing a symbol, we should be able to solve the symbol at run time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146044 Approved by: https://github.com/angelayi ghstack dependencies: #146043	2025-02-01 00:02:12 +00:00
Laith Sakka	43372e70c2	ehnace logging statically known by adding size_oblivious(..) (#145354 ) after the diff ``` [0/0_1] eval size_oblivious(Eq(s1, 1)) == False [statically known] [0/0_1] eval size_oblivious(Eq(u0, 1)) == False [statically known] [0/0_1] eval size_oblivious(Eq(s0, 1)) == False [statically known] [0/0_1] eval size_oblivious(Eq(s0s1u0, 0)) == False [statically known] ``` before ``` [0/0_1] eval (Eq(s1, 1)) == False [statically known] [0/0_1] eval (Eq(u0, 1)) == False [statically known] [0/0_1] eval (Eq(s0, 1)) == False [statically known] [0/0_1] eval (Eq(s0s1u0, 0)) == False [statically known] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145354 Approved by: https://github.com/ezyang	2025-01-31 23:26:37 +00:00
Animesh Jain	f25f1163dc	[dynamo] Support frozenset({..}).__contains__ (#146062 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146062 Approved by: https://github.com/Skylion007, https://github.com/jansel	2025-01-31 23:22:58 +00:00
fduwjj	eb029fba13	[c10d][NCCL] Implement ncclCommInitRankScalable (merging #136789 ) (#144794 ) Try to land https://github.com/pytorch/pytorch/pull/136789/files on our end and fix any remaining issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144794 Approved by: https://github.com/kwen2501, https://github.com/eqy, https://github.com/atalman	2025-01-31 22:39:56 +00:00
Bin Bao	af2a39849d	[AOTI] Refactor codegen_input_symbol_assignment (#146043 ) Summary: Extract the common logic for size and stride symbol generation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146043 Approved by: https://github.com/angelayi	2025-01-31 21:55:18 +00:00
PyTorch MergeBot	c39c679813	Revert "Tensor .cuda() very slow with specific array sizes (#138964 )" This reverts commit 98f87edd233ea69cee5f3e73e9eb4b5ab77aa744. Reverted https://github.com/pytorch/pytorch/pull/138964 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but some slow test start failing after this lands ([comment](https://github.com/pytorch/pytorch/pull/138964#issuecomment-2628455198))	2025-01-31 21:48:51 +00:00
atalman	a7cc6d3e84	Manylinux 2.28 migration - remove pre-cxx11 abi libtorch builds (#146200 ) Related to: https://github.com/pytorch/pytorch/issues/123649 Removing pre-cxx11 abi builds. As per announcement : https://dev-discuss.pytorch.org/t/pytorch-linux-wheels-switching-to-new-wheel-build-platform-manylinux-2-28-on-november-12-2024/2581 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146200 Approved by: https://github.com/kit1980, https://github.com/huydhn	2025-01-31 21:43:12 +00:00
Andrew Or	8203894eff	Resolve affine quantization namespace collision with torchao (#145941 ) Summary: https://github.com/pytorch/pytorch/pull/141421 duplicated affine quantization custom ops from torchao into the PT2E quantization flow, but these ops are registered under the same namespace with the same name, causing "Duplicate registration" errors for the new ops for use cases that import from both repos. This commit fixes this by moving the PT2E versions of the ops to a new namespace. In the long term, we expect to migrate PT2E into torchao so users can migrate back to the old namespace if they wish to. Test Plan: python test/test_quantization.py -k test_channel_group_quantization Differential Revision: D68838437 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145941 Approved by: https://github.com/cccclai	2025-01-31 21:29:47 +00:00
Animesh Jain	781aceee9c	[dynamo] Revert abc change due to internal failures (#146177 ) xref - https://www.internalfb.com/tasks/?t=191383874 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146177 Approved by: https://github.com/StrongerXi ghstack dependencies: #146141	2025-01-31 21:28:06 +00:00
Jessica Vandebon	a0d1393b1a	[MTIA][FSDP2] Enable MTIA device in FSDP2 library code (#145842 ) Differential Revision: D68560256 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145842 Approved by: https://github.com/chaos5958, https://github.com/nautsimon	2025-01-31 21:21:00 +00:00
Simon Fan	06850e624a	[ca][hop] test CA on all HOPs (#145429 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145429 Approved by: https://github.com/zou3519 ghstack dependencies: #145422	2025-01-31 20:45:22 +00:00
Simon Fan	2e197c8a2d	[dynamo][hop] test torch.compiling all HOPs (#145422 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145422 Approved by: https://github.com/ydwu4, https://github.com/zou3519	2025-01-31 20:45:22 +00:00
William Wen	5b1abdbf5d	[dynamo] remove always-failing eval_frame.c debug check (#145982 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145982 Approved by: https://github.com/StrongerXi, https://github.com/jansel ghstack dependencies: #145981	2025-01-31 20:40:59 +00:00
William Wen	49df8de8be	[dynamo] disable eval_frame callback in _TorchDynamoContext __enter__/__exit__ (#145981 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145981 Approved by: https://github.com/jansel	2025-01-31 20:40:59 +00:00
Wei Wang	3a4e7a589b	[CI][Distributed] Fix edge case: One rank case (Rank 0) should get [False, False] (#146099 ) To match the expected tensor (i.e. 2nd element in the array). Making rank0 receive [False, False] Fixes one of the issues reported in #146094 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146099 Approved by: https://github.com/eqy	2025-01-31 20:31:13 +00:00
Jane Xu	8b8c596503	Remove trivial dispatch_key_allowlist_check function (#146169 ) Hmmm...this _is_ removing a public function from a public C++ file. But the GH counts for this function total 83, seemingly all copying pytorch: https://github.com/search?q=dispatch_key_allowlist_check&type=code&p=1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146169 Approved by: https://github.com/albanD, https://github.com/zou3519	2025-01-31 19:59:40 +00:00
Irakli Salia	ec2522e200	[MPS] optimize cholesky (#145722 ) Followup to #145701 Optimizes the syrk and trsm kernels of cholesky decomposition on mps. For SYRK kernel it does matmuls with apple's simdgroup matrices instead of a tiled implementation and for trsm kernel we do vectorized loads. Also this PR puts command encoder inside of the stream queue dispatch (as discussed on last PR). Script to collect perf ``` mport torch import numpy as np import time import csv matrix_sizes = [512, 1024, 2048, 4096] batch_sizes = [1, 2, 4, 8, 16] num_runs = 10 warmup_runs = 3 def create_spd_matrix(n, batch_size): torch.manual_seed(42) A = torch.randn(batch_size, n, n, dtype=torch.float32) return A @ A.transpose(-2, -1) + n * torch.eye(n).expand(batch_size, -1, -1) def run_cholesky_mps(A): torch.mps.synchronize() start = time.perf_counter() b = torch.linalg.cholesky(A, upper=False) torch.mps.synchronize() end = time.perf_counter() return b, end - start results = { 'N': [], 'batch_size': [], 'mean_time': [], 'std_time': [] } for n in matrix_sizes: for batch_size in batch_sizes: print(f"\nBenchmarking N={n}, batch_size={batch_size}") try: A_cpu = create_spd_matrix(n, batch_size) A_mps = A_cpu.to("mps") for _ in range(warmup_runs): _, _ = run_cholesky_mps(A_mps) times = [] for _ in range(num_runs): _, t = run_cholesky_mps(A_mps) times.append(t) mean_time = np.mean(times) std_time = np.std(times) results['N'].append(n) results['batch_size'].append(batch_size) results['mean_time'].append(mean_time) results['std_time'].append(std_time) print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s") except RuntimeError as e: print(f"Error for N={n}, batch_size={batch_size}: {e}") continue with open('cholesky_benchmark_times.csv', 'w', newline='') as f: writer = csv.writer(f) writer.writerow(['N', 'batch_size', 'mean_time', 'std_time']) for i in range(len(results['N'])): writer.writerow([ results['N'][i], results['batch_size'][i], results['mean_time'][i], results['std_time'][i] ]) ``` Observed speedups on M1 Pro ![cholesky_speedup](https://github.com/user-attachments/assets/be3edb1a-8b4a-4039-9d7f-9b9a10f1c83a) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145722 Approved by: https://github.com/malfet	2025-01-31 19:52:31 +00:00
Mwiza Kunda	6a0138fcc1	Torch device backend autoload fix (#145611 ) This causes an import failure if an external backend imports a module that uses `torch._as_tensor_fullprec` when it is being loaded. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/145611 Approved by: https://github.com/albanD	2025-01-31 19:27:42 +00:00
cyy	18380836eb	Remove outdated test skipif conditions for Python3.9 (#146144 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146144 Approved by: https://github.com/albanD	2025-01-31 19:01:04 +00:00
Yidi Wu	68a3635484	[hop][inductor] track the dependency on unbacked symbols correctly with constant_args for hops (#143456 ) Before the PR, we're getting an undefined symbol error for output code when an unbacked symint is only used in the hop because we didn't correctly record the dependency of the unbacked symbols for hops and it gets DCEed accidentally. This PR adds the symbol arguments to `constant_args`, where the dependencies can be correctly constructed when `get_unbacked_symbol_uses` is called to check constant_args. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143456 Approved by: https://github.com/desertfire	2025-01-31 18:29:27 +00:00
Zhengxu Chen	aad9f44b2e	[export] Sync model container types to schema.py (#145959 ) Summary: Synced from D68840230 Test Plan: No behavior changes to existing API. Will be tested internally. Differential Revision: D68846532 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145959 Approved by: https://github.com/yiming0416	2025-01-31 18:17:56 +00:00
PyTorch MergeBot	16f44fee25	Revert "[inductor/profiler] add kernel kwargs instrumentation (#145573 )" This reverts commit 720b8d0d8dac98f89499bc6b251d1f34dbf68dfe. Reverted https://github.com/pytorch/pytorch/pull/145573 on behalf of https://github.com/ZainRizvi due to Sorry, but this is failing internally. It's a bit weird since this PR doesn't really appear related at first glance, but despite retries it fails pretty consistently. Please see D68930742 for details ([comment](https://github.com/pytorch/pytorch/pull/145573#issuecomment-2628013872))	2025-01-31 18:13:23 +00:00
Catherine Lee	67ed47d886	Binary upload checksum (#144887 ) Equivalent to https://github.com/pytorch/test-infra/pull/6172 but for pytorch Pull Request resolved: https://github.com/pytorch/pytorch/pull/144887 Approved by: https://github.com/atalman	2025-01-31 17:51:27 +00:00
Aleksei Nikiforov	d0748566b4	s390x ci: ensure CI starts correctly if token pipe is not removed (#145840 ) Mark stop actions as "may fail". Container is expected to stop on it's own in normal case. Remove "may fail" mark from token generation steps. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145840 Approved by: https://github.com/huydhn	2025-01-31 17:46:09 +00:00
Aleksei Nikiforov	44ecbcbd5a	s390x: disable test_model_exports_to_core_aten.py test (#145835 ) It often gets killed by OOM. Disable it while investigating. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145835 Approved by: https://github.com/huydhn	2025-01-31 17:45:10 +00:00
Animesh Jain	667b94d1c2	[hotfix][dynamo] Skip linecache due to a flaky issue (#146141 ) A large number of jit + dynamo wrapped tests fail in linecache tracing. We need further debugging. Skipping for now to stem the bleeding. https://github.com/pytorch/pytorch/issues/146076 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146141 Approved by: https://github.com/StrongerXi	2025-01-31 17:45:06 +00:00
PyTorch MergeBot	c3f71eb61b	Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 )" This reverts commit e2917245fb0c0b6aab216e7a0a254b80e7a9e78f. Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/ZainRizvi due to Sorry but this still fails internally with the same error. @Chillee or @malfet, can you please help the change get tested? (See D68783351) ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2627886999))	2025-01-31 17:43:09 +00:00
PyTorch MergeBot	f5a61ba0a3	Revert "inductor: Don't throw an internal error when a nn.module is missing a attribute (#145122 )" This reverts commit d100e9ae744322a74d9fd05d0851caaf36f19c24. Reverted https://github.com/pytorch/pytorch/pull/145122 on behalf of https://github.com/ZainRizvi due to Sorry but this is failing internally. See D68924977 for details ([comment](https://github.com/pytorch/pytorch/pull/145122#issuecomment-2627880860))	2025-01-31 17:39:23 +00:00
Aleksei Nikiforov	eb5a0718c2	S390x nightly builds timeouts (#146041 ) Sometimes build timeouts at the end. This should be fixed by increased timeout. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146041 Approved by: https://github.com/huydhn, https://github.com/malfet	2025-01-31 17:29:11 +00:00
Mikayla Gawarecki	001e355a56	Add option to serialization config to reduce random reads from get_record_offset when loading with mmap=True (#143880 ) ## Background This PR adds `torch.utils.serialization.config.load.calculate_storage_offsets`. This option relies on the previous PR in this stack, where storage order was changed to non lexicographical. A `.format_version` entry was added to the zipfile and `calculate_storage_offsets` will only work on checkpoints with `.format_version`. When this is turned on, for `torch.load(mmap=True)`, offsets of each storage record (other than the 0th storage will be calculated instead of relying on `miniz` APIs to determine this). The existing APIs will issue multiple random reads (reading the end of central directory record, then reading the zipfile header for the record) to determine the storage offset where the record starts. This can greatly degrade `torch.load(mmap=True)` performance for non-filesystem cases. `6aaae9d78f/caffe2/serialize/inline_container.cc (L589-L605)` ## How does this work The format for the checkpoint is as such ``` archive_name/ \|_ data.pkl \|_.format_version \|_byteorder \|_data/ \|_ 0 \|_ 1 \|_ 2 \|_ ... \|_ ``` Each `data/i` record represents a storage, where storages are written in the order that the Pickler encounters them. For each storage, our `persistent_load` logic saves the following metadata to the pickle file `dtype, numel, key, location` where `numel` is the number of bytes in the storage. Note that we always use `miniz` writer in the zip64 mode per [here](`7796e308d0/caffe2/serialize/inline_container.cc (L701)`) A zipfile record written by miniz looks as such ``` ---------------- ----------------- ------------------- ---------------- --------- ------------------------------ \| 30 byte header \| n byte filename \| zip64_extra_data \| m byte padding \| storage \| 16 or 24 byte local dir footer \| ---------------- ----------------- ------------------- ---------------- --------- ------------------------------ ``` - The header size (30) is given by [`MZ_ZIP_LOCAL_DIR_HEADER_SIZE`](https://github.com/pytorch/pytorch/blob/main/third_party/miniz-3.0.2/miniz.c?fbclid=IwZXh0bgNhZW0CMTEAAR2O8Vysd--UoSCxW70gabXIS1dbz733oHwuUQ5_Ff1hY2WU6PL2i6CSH4A_aem_J9oaU2HpDeWtJKOU9EnVqw#L3290) - filename will be `"{archive_name}/{filepath}"` - `zip64_extra_data` is determined by [`mz_zip_writer_create_zip64_extra_data`](`7796e308d0/third_party/miniz-3.0.2/miniz.c (L6202)`). Note that [we only create zip64_extra_data if storage_size >= 0xFFFFFFFF or the offset of the start of the header >= 0xFFFFFFFF](`7796e308d0/third_party/miniz-3.0.2/miniz.c (L6519-L6524)`) - `m` is determined by [`getPadding`](`7796e308d0/caffe2/serialize/inline_container.cc (L254)`), which accounts for filename, zip64_extra_data to determine `m` such that the start of `storage` is aligned to 64 bytes. The `m` bytes will always start with `F B padding_size" as the first 4 bytes - The local dir footer size is determined based on [this snippet ](`7796e308d0/third_party/miniz-3.0.2/miniz.c (L6610-L6632)`): if the buffer size is 0 it is skipped. If the zip64_extra_data was created, it is 24, otherwise it is 16. When `torch.utils.serialization.config.load.calculate_storage_offsets` is set we do the following - We keep track of where the "cursor" is in the file using `current_offset`, after each persistent_load call, it will be at the offset where the header for the next record starts - for the 0th storage, "data/0", we use the regular get_record_offset to determine the start of the storage - for any other storage, (where the storages will be in order encountered by the unpickler, 0, 1, 2, 3, ...) we use `get_record_offset_no_read`, which re-uses the `getPadding` logic to determine the offset of the storage - Note that `load_tensor` will only ever be called again with the same key if the storage's `._data_ptr()` is 0 [[pointer1](https://github.com/pytorch/pytorch/blob/main/torch/serialization.py#L1917-L1918)][[pointer2](https://github.com/pytorch/pytorch/blob/main/torch/serialization.py#L1936-L1937)], so we cache the offsets for this edge case - After each storage, if the storage is non-zero, we account for the local dir footer based on the logic described above ## Testing strategy The agreed upon testing strategy was as follows: - Add debug code gated by an environment flag `TORCH_SERIALIZATION_DEBUG` that will run this offset calculation logic and verify it against getRecordOffset for each storage (when mmap=False) - This flag is set throughout CI, which means that every time `torch.load` is called, the offset calculation logic is implicitly being tested. Differential Revision: [D67673026](https://our.internmc.facebook.com/intern/diff/D67673026) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143880 Approved by: https://github.com/albanD ghstack dependencies: #143879	2025-01-31 17:09:20 +00:00
Donald Tolley	98f87edd23	Tensor .cuda() very slow with specific array sizes (#138964 ) ### Pull Request: Optimized Non-Contiguous Tensor Copy for CPU to GPU in PyTorch #### Summary This PR addresses the performance issue identified in [#111570](https://github.com/pytorch/pytorch/issues/111570), where non-contiguous tensors took significantly longer to transfer from CPU to GPU. Through detailed tracing of the call flow, we identified that PyTorch was creating temporary contiguous buffers for non-contiguous tensor transfers, which introduced unnecessary overhead. #### Tracing the Issue To pinpoint the cause of the slowdown, we followed the call flow from Python’s `tensor.cuda()` method through PyTorch’s backend, ultimately identifying `copy_kernel_cuda` as the key function responsible for CPU-to-GPU tensor transfers. Here’s a summary of the tracing process: 1. Python Call: `tensor.cuda()` - Starting from Python, the `cuda()` method initiates the tensor transfer to the GPU. 2. `TensorBody.h: cuda()` - The `cuda()` method calls `to()`, specifying the target device as CUDA. 3. `Tensor.cpp: TensorBase::to()` - The `to()` function prepares device and data type options before invoking `_ops::to_dtype_layout::call()`. 4. Operator Call: `_ops::to_dtype_layout::call()` - This operator dispatches the request to the backend-specific function responsible for managing the transfer. 5. `Copy.cpp: copy_()` - The `copy_()` function performs preliminary checks (e.g., zero-tensor immutability) and proceeds to call `copy_impl()`. 6. `Copy.cpp: copy_impl()` - This function sets up a tensor iterator and dispatches the copy operation to the appropriate backend through `copy_stub`. 7. Dispatch to CUDA: `copy_stub` - The dispatch mechanism routes the call to the CUDA-specific function, `copy_kernel_cuda`. 8. `Copy.cu: copy_kernel_cuda()` - Here, we identified that PyTorch was creating temporary contiguous buffers for 1D and 2D non-contiguous tensors, which slowed down the copy process. This behavior is managed by the `copy_requires_temporaries()` function. #### Solution To address this, we modified `copy_kernel_cuda` to handle non-contiguous 1D and 2D tensors directly by using `cudaMemcpy2DAsync`, which allows efficient, stride-aware memory transfers without temporary buffers. Here’s why this approach improves performance: - Efficiency of `cudaMemcpy2DAsync`: This CUDA function is optimized for pitched (stride-based) memory transfers, allowing it to handle non-contiguous data layouts effectively by specifying memory strides for source and destination tensors. - Reduction of Overhead: By directly copying non-contiguous tensors without intermediate buffers, we eliminate extra memory allocation and achieve faster CPU-to-GPU transfers. - Asynchronous Execution: `cudaMemcpy2DAsync` enables asynchronous transfer on the CUDA stream, further improving performance by taking advantage of CUDA's optimized memory handling for non-contiguous layouts. #### Performance Results In my testing, I created tensors of size `327680 x 2000` and used slices for transfer performance measurements. The tests show that the average time for transferring a non-contiguous slice (e.g., rows 10,000 to 50,000) from CPU to GPU now closely matches the contiguous case. This improvement indicates that the updated implementation effectively addresses the performance discrepancy. Below are the measured times and validation checks: ```plaintext Average time for contiguous slice (rows 10,000-50,000): 66 ms Average time for non-contiguous slice (rows 10,000-50,000): 66 ms Validation of contiguous and non-contiguous tensor copies: ✅ PASS: Tensor shapes match. ✅ PASS: Tensor contiguity matches. ✅ PASS: Tensor contents match. ✅ PASS: Tensor data types match. ✅ Success: Both contiguous and non-contiguous tensors were copied correctly to the GPU. ``` #### Conclusion This PR resolves the identified performance issue by eliminating the need for temporary buffers in non-contiguous 1D and 2D tensor transfers, ensuring faster and more efficient copies from CPU to GPU. Future optimizations could further enhance performance for higher-dimensional non-contiguous tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138964 Approved by: https://github.com/jeffdaily Co-authored-by: Natalia Gimelshein <ngimel@gmail.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-01-31 17:05:02 +00:00
Mikayla Gawarecki	2d6f6637d3	Remove lexicographical sorting of storage keys in torch.save (#143879 ) Currently the order lexicographical (i.e. 0, 10, 11, ...19, 2, ....) instead of 0, 1, 2, 3, 4, 5 (the order that storage metadata is actually pickled in), since PyTorch will never be used with Python < 3.7 we can be assured that the keys will be read in the order of insertion (numerically sorted) This makes it such that the order storages are written in are the same as the pickling/unpickling order so we can calculate their offsets with less random reads * __->__ #143879 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143879 Approved by: https://github.com/albanD	2025-01-31 17:00:23 +00:00
Ting Lu	9232355bb0	Add CUDA 12.8 manywheel x86 Builds to Binaries Matrix (#145792 ) https://github.com/pytorch/pytorch/issues/145570 Adding cuda 12.8.0 x86 builds first Pull Request resolved: https://github.com/pytorch/pytorch/pull/145792 Approved by: https://github.com/nWEIdia, https://github.com/malfet, https://github.com/atalman	2025-01-31 16:12:02 +00:00
Jackson	a7c2d85c18	Add overloads to diagonal docs (#144214 ) Fixes #126827. Refactored doc to demonstrate when none of the optional values are passed in. Added another example so that all overloads of the function are covered. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144214 Approved by: https://github.com/albanD	2025-01-31 15:53:59 +00:00
Bin Bao	2af876707b	[AOTI] Fix a memory leak in package boxed_run (#146100 ) Summary: AOTIModelPackageLoaderPybind::boxed_run missed a decref when constructing the returned py::list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146100 Approved by: https://github.com/cpuhrsch	2025-01-31 13:32:28 +00:00
Pian Pawakapan	7b07415aaa	[export] nested terms in nn_module_stack deserialization (#145901 ) Summary: accounting for terms like "getattr(getattr(a[0], b), c)". Test Plan: test_serialize Differential Revision: D68784736 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145901 Approved by: https://github.com/angelayi	2025-01-31 10:00:13 +00:00
Haifeng Jin	1f1a9965d5	fix a small typo in comments (#145323 ) A minor typo fix. The description was confusing with the typo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145323 Approved by: https://github.com/Skylion007	2025-01-31 06:45:44 +00:00
Nikita Shulga	c55af2b567	[CMake] Delete Caffe2 inspect_gpu binary (#146105 ) As it's unbuildable right now, as headers it depends on are gone Fixes https://github.com/pytorch/pytorch/issues/146042 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146105 Approved by: https://github.com/atalman, https://github.com/seemethere	2025-01-31 06:42:52 +00:00
Aidyn-A	e84bf88dde	[ATen][CUDA] Implement 128 bit vectorization v2 (#145746 ) This is a re-base PR to my previous one #141959. Description from the original PR: This PR implements 128-bit vectorization. It improves the performance of contiguous elementwise ops by 4-10% on Hopper H100. <details> <summary>The benchmark code used </summary> ```Python import time import torch from torch.profiler import profile, ProfilerActivity def benchmark(function, dtype=torch.float32, check_numerics=True, print_profile=False): device = torch.device("cuda") shapes = [] for p in range(24, 30): shape = 1<<p shapes.append(shape) for shape in shapes: for _ in range(6): x = torch.randn(shape, device=device, dtype=dtype) y = function(x) if print_profile: x = torch.randn(shape, device=device, dtype=dtype) with profile(activities=[ProfilerActivity.CUDA], record_shapes=True) as prof: y = function(x) print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10)) x = torch.randn(shape, device=device, dtype=dtype) torch.cuda.synchronize() t1 = time.perf_counter() for _ in range(6): y = function(x) torch.cuda.synchronize() t2 = time.perf_counter() perf_time = (t2 - t1) / 6 print(f"{function.__name__}, {dtype}, {shape}, {perf_time}") if check_numerics: x_cpu = x.cpu() y_cpu = function(x_cpu).cuda() try: torch.testing.assert_allclose(y_cpu, y) except AssertionError as error: print("An exception occurred:", error) def main(): ops = [ torch.relu, torch.sigmoid, torch.tanh, torch.nn.functional.gelu, torch.sin, torch.exp, ] dtypes = [ torch.float16, torch.bfloat16, torch.float32, ] for op in ops: for dtype in dtypes: benchmark(op, dtype=dtype) torch.cuda.empty_cache() if __name__ == "__main__": main() ``` </details> <details> <summary> Results </summary> \| op \| dtype \| size \| time after \| time before \| % improvement \| \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| \| relu \| torch.float16 \| 33554432 \| 4.84E-05 \| 5.06E-05 \| 4.66296539127052 \| \| relu \| torch.float16 \| 67108864 \| 9.22E-05 \| 9.64E-05 \| 4.56491432752297 \| \| relu \| torch.float16 \| 134217728 \| 0.000180343495837102 \| 0.000187981834945579 \| 4.23543919508829 \| \| relu \| torch.float16 \| 268435456 \| 0.000355071155354381 \| 0.000370856161074092 \| 4.44558942107169 \| \| relu \| torch.float16 \| 536870912 \| 0.000704489842367669 \| 0.000736006341564159 \| 4.47366268483987 \| \| relu \| torch.bfloat16 \| 16777216 \| 3.03E-05 \| 3.04E-05 \| 0.166504085842689 \| \| relu \| torch.bfloat16 \| 33554432 \| 4.89E-05 \| 5.06E-05 \| 3.45848238875716 \| \| relu \| torch.bfloat16 \| 67108864 \| 9.32E-05 \| 9.65E-05 \| 3.56122651631445 \| \| relu \| torch.bfloat16 \| 134217728 \| 0.000180805509444326 \| 0.000187998676362137 \| 3.97840029317567 \| \| relu \| torch.bfloat16 \| 268435456 \| 0.000356242332297067 \| 0.000371279485989362 \| 4.22104627356745 \| \| relu \| torch.bfloat16 \| 536870912 \| 0.000708114336399982 \| 0.000736773828975856 \| 4.04729732229083 \| \| relu \| torch.float32 \| 16777216 \| 5.61E-05 \| 5.61E-05 \| 0.0442587268354941 \| \| relu \| torch.float32 \| 33554432 \| 9.33E-05 \| 9.30E-05 \| -0.259070913799022 \| \| relu \| torch.float32 \| 67108864 \| 0.000181321326332788 \| 0.000181289506144822 \| -0.0175490597877115 \| \| relu \| torch.float32 \| 134217728 \| 0.000356896334172537 \| 0.000356570177245885 \| -0.0913870206618981 \| \| relu \| torch.float32 \| 268435456 \| 0.000709421835684528 \| 0.000707465515006334 \| -0.275762681635911 \| \| relu \| torch.float32 \| 536870912 \| 0.00141372415237129 \| 0.00141036518228551 \| -0.237597276678471 \| \| sigmoid \| torch.float16 \| 16777216 \| 3.10E-05 \| 3.16E-05 \| 2.10012593866895 \| \| sigmoid \| torch.float16 \| 33554432 \| 4.91E-05 \| 5.23E-05 \| 6.37710600666122 \| \| sigmoid \| torch.float16 \| 67108864 \| 9.30E-05 \| 0.000100057009452333 \| 7.61866144555331 \| \| sigmoid \| torch.float16 \| 134217728 \| 0.000180928347011407 \| 0.000194982004662355 \| 7.76752669390248 \| \| sigmoid \| torch.float16 \| 268435456 \| 0.000355658994521946 \| 0.00038468533117945 \| 8.16128288742412 \| \| sigmoid \| torch.float16 \| 536870912 \| 0.000705982849467546 \| 0.000764021339515845 \| 8.22094900634937 \| \| sigmoid \| torch.bfloat16 \| 16777216 \| 3.08E-05 \| 3.17E-05 \| 2.90965915673149 \| \| sigmoid \| torch.bfloat16 \| 33554432 \| 4.87E-05 \| 5.24E-05 \| 7.63503884668234 \| \| sigmoid \| torch.bfloat16 \| 67108864 \| 9.33E-05 \| 0.000100019678939134 \| 7.21238137428013 \| \| sigmoid \| torch.bfloat16 \| 134217728 \| 0.000180786165098349 \| 0.000194868014659733 \| 7.78922964250206 \| \| sigmoid \| torch.bfloat16 \| 268435456 \| 0.000355564659306159 \| 0.000384909333661199 \| 8.25297835063321 \| \| sigmoid \| torch.bfloat16 \| 536870912 \| 0.000705831005082776 \| 0.000764102345177283 \| 8.2557070566308 \| \| sigmoid \| torch.float32 \| 16777216 \| 4.93E-05 \| 5.65E-05 \| 14.5314136197766 \| \| sigmoid \| torch.float32 \| 33554432 \| 9.32E-05 \| 9.31E-05 \| -0.120169865610833 \| \| sigmoid \| torch.float32 \| 67108864 \| 0.000181328505277634 \| 0.000180455681402236 \| -0.481349512069855 \| \| sigmoid \| torch.float32 \| 134217728 \| 0.000357362829769651 \| 0.000356093340087682 \| -0.35523831137877 \| \| sigmoid \| torch.float32 \| 268435456 \| 0.000708921831877281 \| 0.000707052337626616 \| -0.263709504574663 \| \| sigmoid \| torch.float32 \| 536870912 \| 0.00141358317341656 \| 0.0014090768333214 \| -0.318788464654745 \| \| tanh \| torch.float16 \| 16777216 \| 3.03E-05 \| 3.03E-05 \| -0.0912564658661808 \| \| tanh \| torch.float16 \| 33554432 \| 4.90E-05 \| 5.07E-05 \| 3.46644442974484 \| \| tanh \| torch.float16 \| 67108864 \| 9.30E-05 \| 9.68E-05 \| 3.99871369815531 \| \| tanh \| torch.float16 \| 134217728 \| 0.00018052199933057 \| 0.000188717152923346 \| 4.53969799978138 \| \| tanh \| torch.float16 \| 268435456 \| 0.000355684508879979 \| 0.000373026006855071 \| 4.8755280430115 \| \| tanh \| torch.float16 \| 536870912 \| 0.000706660988119741 \| 0.000740105014604827 \| 4.73268328765002 \| \| tanh \| torch.bfloat16 \| 16777216 \| 2.99E-05 \| 3.03E-05 \| 1.21049563135981 \| \| tanh \| torch.bfloat16 \| 33554432 \| 4.89E-05 \| 5.06E-05 \| 3.48836101041744 \| \| tanh \| torch.bfloat16 \| 67108864 \| 9.28E-05 \| 9.69E-05 \| 4.39944918036626 \| \| tanh \| torch.bfloat16 \| 134217728 \| 0.000180710999605556 \| 0.000189167990659674 \| 4.67984299382829 \| \| tanh \| torch.bfloat16 \| 268435456 \| 0.000356062994493792 \| 0.000372666652159144 \| 4.66312363882606 \| \| tanh \| torch.bfloat16 \| 536870912 \| 0.000707100164921333 \| 0.000740134331863374 \| 4.67178040408393 \| \| tanh \| torch.float32 \| 16777216 \| 5.61E-05 \| 5.64E-05 \| 0.439595755746353 \| \| tanh \| torch.float32 \| 33554432 \| 9.31E-05 \| 9.31E-05 \| 0.00287633090228212 \| \| tanh \| torch.float32 \| 67108864 \| 0.000181465332085888 \| 0.000180895323865116 \| -0.31411411437098 \| \| tanh \| torch.float32 \| 134217728 \| 0.000356963835656643 \| 0.000356073161431899 \| -0.249513854283251 \| \| tanh \| torch.float32 \| 268435456 \| 0.000709201170442005 \| 0.00070707315656667 \| -0.300057862849997 \| \| tanh \| torch.float32 \| 536870912 \| 0.00141367283261692 \| 0.00141030051357423 \| -0.238550176877922 \| \| gelu \| torch.float16 \| 16777216 \| 2.73E-05 \| 3.17E-05 \| 15.921079070745 \| \| gelu \| torch.float16 \| 33554432 \| 5.06E-05 \| 5.55E-05 \| 9.76345374333098 \| \| gelu \| torch.float16 \| 67108864 \| 9.65E-05 \| 0.000106600326641152 \| 10.4308039074712 \| \| gelu \| torch.float16 \| 134217728 \| 0.000187776672343413 \| 0.000208565829476962 \| 11.0712139447915 \| \| gelu \| torch.float16 \| 268435456 \| 0.000370216167842348 \| 0.000412251994324227 \| 11.3544005187205 \| \| gelu \| torch.float16 \| 536870912 \| 0.000737301345604161 \| 0.000819394170927505 \| 11.1342296895002 \| \| gelu \| torch.bfloat16 \| 16777216 \| 3.02E-05 \| 3.08E-05 \| 1.78405479367653 \| \| gelu \| torch.bfloat16 \| 33554432 \| 5.13E-05 \| 5.69E-05 \| 10.9929393318302 \| \| gelu \| torch.bfloat16 \| 67108864 \| 9.76E-05 \| 0.00010968199543034 \| 12.3420807512356 \| \| gelu \| torch.bfloat16 \| 134217728 \| 0.000189661824454864 \| 0.000214487663470209 \| 13.0895287371091 \| \| gelu \| torch.bfloat16 \| 268435456 \| 0.000374197009174774 \| 0.000423670164309442 \| 13.2211519391275 \| \| gelu \| torch.bfloat16 \| 536870912 \| 0.000743675006863972 \| 0.000842577001700799 \| 13.299088166737 \| \| gelu \| torch.float32 \| 16777216 \| 5.06E-05 \| 5.04E-05 \| -0.413385894716413 \| \| gelu \| torch.float32 \| 33554432 \| 9.31E-05 \| 9.32E-05 \| 0.134157041722546 \| \| gelu \| torch.float32 \| 67108864 \| 0.000181480175039421 \| 0.000180836669945469 \| -0.354586992112075 \| \| gelu \| torch.float32 \| 134217728 \| 0.000356874331676712 \| 0.000356305002545317 \| -0.159532104402047 \| \| gelu \| torch.float32 \| 268435456 \| 0.000708909006789327 \| 0.000706991491218408 \| -0.270488250615287 \| \| gelu \| torch.float32 \| 536870912 \| 0.00141321367118508 \| 0.00140937082081412 \| -0.271922813181618 \| \| sin \| torch.float16 \| 16777216 \| 3.04E-05 \| 3.11E-05 \| 2.21834939018859 \| \| sin \| torch.float16 \| 33554432 \| 4.85E-05 \| 5.23E-05 \| 7.72165512511596 \| \| sin \| torch.float16 \| 67108864 \| 9.31E-05 \| 9.98E-05 \| 7.24947099480072 \| \| sin \| torch.float16 \| 134217728 \| 0.000180371008658161 \| 0.000194791161144773 \| 7.99471744039613 \| \| sin \| torch.float16 \| 268435456 \| 0.000355454161763191 \| 0.000384903668115536 \| 8.28503630574026 \| \| sin \| torch.float16 \| 536870912 \| 0.000705183832906187 \| 0.000764360166310022 \| 8.39161799270973 \| \| sin \| torch.bfloat16 \| 16777216 \| 3.11E-05 \| 3.10E-05 \| -0.257677954940036 \| \| sin \| torch.bfloat16 \| 33554432 \| 4.89E-05 \| 5.24E-05 \| 7.34808420323539 \| \| sin \| torch.bfloat16 \| 67108864 \| 9.26E-05 \| 0.000100248667877167 \| 8.22347488801205 \| \| sin \| torch.bfloat16 \| 134217728 \| 0.000180674154156198 \| 0.00019567032965521 \| 8.30012215584937 \| \| sin \| torch.bfloat16 \| 268435456 \| 0.000355360486234228 \| 0.000386023331278314 \| 8.62865913118873 \| \| sin \| torch.bfloat16 \| 536870912 \| 0.00070483615854755 \| 0.000766805159704139 \| 8.79197248964745 \| \| sin \| torch.float32 \| 16777216 \| 5.67E-05 \| 5.64E-05 \| -0.441348534920039 \| \| sin \| torch.float32 \| 33554432 \| 9.34E-05 \| 9.30E-05 \| -0.496458540364117 \| \| sin \| torch.float32 \| 67108864 \| 0.000181706990891447 \| 0.000180556671693921 \| -0.633062708199702 \| \| sin \| torch.float32 \| 134217728 \| 0.000356894995396336 \| 0.000356046327700218 \| -0.237791985616354 \| \| sin \| torch.float32 \| 268435456 \| 0.000708777321657787 \| 0.000707602652255446 \| -0.165731798471427 \| \| sin \| torch.float32 \| 536870912 \| 0.00141263716310884 \| 0.00140912582476934 \| -0.248566187496451 \| \| exp \| torch.float16 \| 16777216 \| 3.00E-05 \| 3.04E-05 \| 1.40099098901014 \| \| exp \| torch.float16 \| 33554432 \| 4.86E-05 \| 5.03E-05 \| 3.44611943643906 \| \| exp \| torch.float16 \| 67108864 \| 9.37E-05 \| 9.55E-05 \| 1.96412400380129 \| \| exp \| torch.float16 \| 134217728 \| 0.000180913504057874 \| 0.000187193179347863 \| 3.47109262113439 \| \| exp \| torch.float16 \| 268435456 \| 0.00035607748820136 \| 0.000369079003576189 \| 3.65131630210701 \| \| exp \| torch.float16 \| 536870912 \| 0.000707551507124056 \| 0.000732363162872692 \| 3.50669251620789 \| \| exp \| torch.bfloat16 \| 16777216 \| 2.98E-05 \| 3.04E-05 \| 1.74345594341654 \| \| exp \| torch.bfloat16 \| 33554432 \| 4.88E-05 \| 5.04E-05 \| 3.40217856534821 \| \| exp \| torch.bfloat16 \| 67108864 \| 9.32E-05 \| 9.62E-05 \| 3.29219958210226 \| \| exp \| torch.bfloat16 \| 134217728 \| 0.000180999826019009 \| 0.000187239318620414 \| 3.44723679499521 \| \| exp \| torch.bfloat16 \| 268435456 \| 0.000355944503098726 \| 0.000369370992605885 \| 3.77207384585864 \| \| exp \| torch.bfloat16 \| 536870912 \| 0.000707135167128096 \| 0.000733066000975668 \| 3.66702648277075 \| \| exp \| torch.float32 \| 16777216 \| 4.89E-05 \| 5.63E-05 \| 15.1245314346532 \| \| exp \| torch.float32 \| 33554432 \| 9.34E-05 \| 9.31E-05 \| -0.259945454477446 \| \| exp \| torch.float32 \| 67108864 \| 0.000181152504713585 \| 0.000180474346658836 \| -0.374357536939058 \| \| exp \| torch.float32 \| 134217728 \| 0.000356771342922002 \| 0.000355627329554409 \| -0.3206573034212 \| \| exp \| torch.float32 \| 268435456 \| 0.000708404501589636 \| 0.00070713268360123 \| -0.179532736671163 \| \| exp \| torch.float32 \| 536870912 \| 0.00141283582585553 \| 0.00140944866385932 \| -0.23974208002295 \| </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/145746 Approved by: https://github.com/eqy, https://github.com/ngimel Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-01-31 06:42:08 +00:00
Henry Hu	eeb5e1bf20	[AOTI] Cache treespec_loads calculation (#145815 ) Summary: Treespec can be reused instead of calculated from str every AOTI module call. Using cached result saves 0.2ms for each module call. Test Plan: Before: {F1974751578} After: {F1974751667} Differential Revision: D68749539 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145815 Approved by: https://github.com/henrylhtsang	2025-01-31 06:38:21 +00:00
Aaron Orenstein	57d8278ab9	pickler for GraphModule (#141659 ) Pickling GraphModule needs some special handling for wrapping things that normally can't be pickled - but async compile needs to pass them across a wire so we need to be able to serialize it - add some helpers to enable that. Differential Revision: [D68921318](https://our.internmc.facebook.com/intern/diff/D68921318) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141659 Approved by: https://github.com/jamesjwu	2025-01-31 05:34:28 +00:00
Manav Avlani	f9227e7c33	Expose ToIValueAllowNumbersAsTensors to TORCH_PYTHON_API so we can use it in monarch (#146087 ) Summary: TSIA Test Plan: Tested up the stack but existing unittests Reviewed By: suo Differential Revision: D68917233 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146087 Approved by: https://github.com/suo	2025-01-31 05:08:11 +00:00
Sherlock Huang	cf2de4e230	Introduce aoti_call_delegate HOP (#145630 ) Summary: Previously, aoti compile node is represented as a kernel-less custom op in the exported program. The node was not eager runnable, which is a common practice for numerical validation during lowering. I introduce a new HOP to address this. The schema is following ``` aoti_call_delegate(lower_moduel: AOTInductorEPModule, original_gm: fx.GraphModule, weights: List[Tensor], inputs: List[Tensor]) ``` There are a few problems exposed by HOP - AOTI expects a FX graph with weights as getattr nodes, aka stateful graph. HOP expect graph_module arguments to be stateless. Export serializer also expect a stateless graph. Currently, to make AOTI happy, I am making `original_gm` stateful, and bypassing the serialization for `original_gm`. - As a result, the HOP is not re-traceable, as functionalization on stateful graph module argument will fail. Test Plan: buck2 test 'fbcode//mode/opt' fbcode//deeplearning/aot_inductor/cpu/test:cpu_lowering_utils_test Reviewed By: zhxchen17 Differential Revision: D68359391 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145630 Approved by: https://github.com/zou3519	2025-01-31 04:57:36 +00:00
titaiwangms	f358d4d004	[ONNX] Migrate test_torch_export_with_onnxruntime.py to test_small_models_e2e.py (#146095 ) With [the deprecation of torch.onnx.dynamo_export](https://github.com/pytorch/pytorch/pull/146003), this PR turns the torch.export related tests toward torch.onn.export(..., dynamo=True), and places it in test_small_models_e2e.py NOTE: test_exported_program_as_input_from_file and test_onnx_program_supports_retraced_graph are not kept, because they are more of testing whether exported program stays the same after save/load and retrace. However, in torch.onnx.export(..., dynamo=True), we focus more on the export of from nn.Module to ONNX proto. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146095 Approved by: https://github.com/justinchuby	2025-01-31 03:40:26 +00:00
angelayi	27e35de6c2	[export] Add distributed test (#146050 ) Reland https://github.com/pytorch/pytorch/pull/145886 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146050 Approved by: https://github.com/avikchaudhuri	2025-01-31 02:56:42 +00:00
Pian Pawakapan	ffb424eab6	[dynamo/export] call local_scalar_dense when full() value is scalar tensor (#144999 ) Fixes https://github.com/pytorch/pytorch/issues/144907 ``` class Foo(torch.nn.Module): def forward(self, val): return torch.full((80, 2), val, dtype=torch.float32) export(Foo(), args=(torch.tensor(1),)) ``` When we have a `torch.full` call like above, where the fill value is a scalar Tensor and not a scalar value, the FX graph from `_dynamo.export()` contains a single node: the full op. We run into a `PendingUnbackedSymbolNotFound` error, because the `item()` call is implicit; the UnbackedSymInt is extracted but goes directly into the data of the output tensor value, and we're then unable to locate it when we try to compute unbacked bindings. On the other hand, non-strict export doesn't face this, because an explicit `item()`, or `local_scalar_dense` node is inserted, and the unbacked binding is directly the example value of that node. This adds a dynamo handler to imitate what happens in non-strict. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144999 Approved by: https://github.com/angelayi	2025-01-31 02:45:43 +00:00
Menglu Yu	e01c898e51	[Customized Optimus] Add select cat aten pass (#145918 ) Summary: This is a follow up work of D68695717, where we can further reduce the number of cat kernels in the backward by designing new aten pass in the aten level. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_select_cat_post_grad ``` Buck UI: https://www.internalfb.com/buck2/6943087f-91be-4dbd-9693-df0a11a50b73 Test UI: https://www.internalfb.com/intern/testinfra/testrun/11821949087998233 Network: Up: 101KiB Down: 132KiB (reSessionID-60e898af-f366-4247-a9f7-d8d7cd129fe0) Analyzing targets. Remaining 0/78148 Executing actions. Remaining 0/476147 Command: test. Finished 2 local Tests finished: Pass 3. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to add the config ``` post_grad_fusion_options: { "normalization_aten_pass": {}, "split_cat_aten_pass": {}, "select_cat_aten_pass": {}, } ``` {F1974778773} baseline: aps-recgpt_ranking_1115_pt2_optimus-e52c1f277e proposal aps-recgpt_ranking_1115_pt2_optimus-1b0047ee0e Differential Revision: D68803384 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145918 Approved by: https://github.com/Yuzhen11	2025-01-31 02:35:10 +00:00
Ting Lu	08d88127fe	Use Magma-cuda 12.8 for libtorch (#146019 ) https://github.com/pytorch/pytorch/issues/145570 Build failure for libtorch wheel `CUDAContext.cpp:(.text+0x157): additional relocation overflows omitted from the output /usr/bin/ld: failed to convert GOTPCREL relocation; relink with --no-relax collect2: error: ld returned 1 exit status` Unsure if this is related, fixing as a start Pull Request resolved: https://github.com/pytorch/pytorch/pull/146019 Approved by: https://github.com/eqy	2025-01-31 02:19:23 +00:00
Sam Larsen	2811f33d12	Fix code cache + freezing compile-time regression (#145868 ) Summary: The current implementation introduces a compile-time regression due to overhead hashing large constants. To support freezing+caching, we consider only the tensor metadata of frozen params, but we neglect to do the same for any constants created as a result of folding frozen params. This PR Explicitly marks the constants created during freezing (and constant folding during freezing) and uses that info in the inductor cache to determine when to hash a tensor value+metadata vs. metadata only. Test Plan: `python benchmarks/dynamo/torchbench.py --backend inductor --device cuda --only alexnet --bfloat16 --cold-start-latency --print-compilation-time --inference --performance --freezing` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145868 Approved by: https://github.com/eellison	2025-01-31 02:04:15 +00:00
Yu, Guangye	bf9d053fb8	[Break XPU] Fix Inductor cuda bias UT (#145934 ) # Motivation [Break XPU] inductor ut: `inductor/test_inplace_padding.py::InplacePaddingTest::test_pad_non_zero - RuntimeError: Expected to find "empty_strided_cuda((2048, 2048), (2048, 1), torch.float32).as_strided((2048, 2047), (2048, 1))" but did not find it` With this PR, `test_pad_non_zero` will pass on XPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145934 Approved by: https://github.com/jansel, https://github.com/shunting314, https://github.com/desertfire	2025-01-31 01:39:39 +00:00
Oguz Ulgen	ccd27e8129	Turn on fx graph cache and automatic dynamic pgo local caches in fbcode (#146065 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146065 Approved by: https://github.com/jamesjwu	2025-01-31 01:11:48 +00:00
Scott Wolchok	3fae5c8509	torchgen: support exception boundary for ExecuTorch functions (#144341 ) Needed for ExecuTorch diff D67904052. Differential Revision: [D67906411](https://our.internmc.facebook.com/intern/diff/D67906411/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144341 Approved by: https://github.com/Jack-Khuu	2025-01-31 01:05:21 +00:00
cyy	d94d816d96	Simplify handling of max jobs in CMake builds (#145820 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/145820 Approved by: https://github.com/malfet	2025-01-31 00:55:39 +00:00
Yifu Wang	c70362fac8	[AsyncMM] re-enable and adapt to cutlass 3.6.0 (#144011 ) [D68734067](https://our.internmc.facebook.com/intern/diff/D68734067) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144011 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2025-01-31 00:48:51 +00:00
Animesh Jain	1e3d1738a4	[dynamo][polyfills]Support getrecursionlimit (#145989 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145989 Approved by: https://github.com/StrongerXi, https://github.com/jansel ghstack dependencies: #145986, #145987, #145994	2025-01-31 00:47:31 +00:00
Animesh Jain	e7bb608d02	[dynamo][dicts] Support construction of types.MappingProxyType (#145994 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145994 Approved by: https://github.com/StrongerXi, https://github.com/jansel ghstack dependencies: #145986, #145987	2025-01-31 00:47:31 +00:00
Animesh Jain	4665bc2cc0	[dynamo][functions] Support `id` on function (#145987 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145987 Approved by: https://github.com/StrongerXi, https://github.com/jansel, https://github.com/mlazos ghstack dependencies: #145986	2025-01-31 00:47:23 +00:00
Animesh Jain	56307dc370	[dynamo][dicts] Raise exception on pop (#145986 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145986 Approved by: https://github.com/Skylion007, https://github.com/williamwen42, https://github.com/StrongerXi, https://github.com/jansel	2025-01-31 00:47:13 +00:00
Colin Peppler	e6704a2447	Allow replacing unbacked with very large upperbound by returning no-op for FloorToInt(int) (#146001 ) * Let's say x is an integer beyond 2^53 where Python floats lose precision i.e. can't increment by 1. * Therefore, float(x) will lose precision and won't retain the exact value of x even though it's an integer. * That means `FloorToInt(very_large_number)` will lose precision if we cast it to float ``` >>> int(float(1000000007999999992)) 1000000008000000000 ``` This means when we try to do this in set_replacement(): `32bb6f83d5/torch/fx/experimental/symbolic_shapes.py (L6011-L6019)` We run into this: ``` TORCH_LOGS="+torch.fx.experimental.symbolic_shapes" pytest -s test_export.py -k test_replace_unbacked_with_very_large_upperbound File "/data/users/colinpeppler/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6258, in _maybe_guard_rel self._set_replacement(rhs, self._find(lhs), "trivial_rhs") File "/data/users/colinpeppler/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6039, in _set_replacement assert tgt_bound.issubset( torch._dynamo.exc.TorchRuntimeError: Failed running call_function <built-in function add>((FakeTensor(..., size=(2s0,)), FakeTensor(..., size=(u0,))), **{}): tgt_bound=VR[4, 1000000008000000000] not a subset of src_bound=VR[4, 1000000007999999992] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146001 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #145898	2025-01-31 00:25:20 +00:00
soulitzer	c72b536420	Add manual override flag for core ATen op detection during bc check (#146052 ) Fixes https://github.com/pytorch/pytorch/issues/146049 Today the bc detection logic ignores allow_list for core ATen ops (A PR landed 4 months ago to enable this). The problem is that if I have a PR that removes an op, the script can no longer check whether that op is core ATen op (today we just error out). With my fix: (1) conservatively assume core ATen op in such cases (2) allows the user to specify in their ALLOW_LIST entry that their op is not a core ATen op.) Test plan: - This is tested 2 PRs above `016bdafdcb/test/forward_backward_compatibility/check_forward_backward_compatibility.py (L129-L137)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146052 Approved by: https://github.com/albanD	2025-01-30 23:57:01 +00:00
briancoutinho	720b8d0d8d	[inductor/profiler] add kernel kwargs instrumentation (#145573 ) ## About As above, record the kernel launch kwargs. These tends to be contexpr arguments to triton kernels like block size etc. ## Test program Note, install triton before proceeding (pip install triton) triton_test.py>>> ``` import torch from torch.profiler import profile, ProfilerActivity def foo(x, y): a = torch.sin(x) b = torch.cos(y) return a + b def main(): x = torch.randn(10, 10).cuda() y = torch.randn(10, 10).cuda() opt_foo = torch.compile(foo) z = opt_foo(x, y) # Profile the kernel function on the GPU with profile( activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True ) as prof: z = opt_foo(x, y) # Export the trace to a file prof.export_chrome_trace("my_kernel_trace.json") if __name__ == "__main__": main() ``` Run it and we should get a trace file my_kernel_trace.json Output has triton event with the kernel_kwargs attribute. ``` { "ph": "X", "cat": "cpu_op", "name": "triton_poi_fused_add_cos_sin_0", "pid": 2480815, "tid": 2480815, "ts": 2045246693014.959, "dur": 75.662, "args": { ... "kernel_backend": "triton", "num_warps": 4, "kernel_kwargs": "XBLOCK=128", "num_stages": 1, "grid": "grid(100,)", "kernel_file": "/tmp/torchinductor_bcoutinho/ow/cowpmkdpla4qfqj6jupnq4d7og7iz7eeb5wergubivubxd4xapor.py", "kernel_hash": "cowpmkdpla4qfqj6jupnq4d7og7iz7eeb5wergubivubxd4xapor" } }, ``` ## Unit Test Updated unit test: ``` pytest test/inductor/test_profiler.py -k test_pt2_triton_attributes ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145573 Approved by: https://github.com/davidberard98, https://github.com/jansel	2025-01-30 23:51:44 +00:00
Avik Chaudhuri	8117656162	nonzero_static with symint size (#146006 ) Summary: Previously `nonzero_static` would force specialization on the `size` argument. This PR enables it to be used with a dynamic `size` argument. Test Plan: added test Differential Revision: D68874784 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146006 Approved by: https://github.com/angelayi	2025-01-30 23:42:42 +00:00
Ke Wen	9fdc20809a	[PGNCCL] Simplify support macro definition (#145964 ) - Promotes usage of `NCCL_VERSION_CODE >= NCCL_VERSION(X, Y, Z)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145964 Approved by: https://github.com/fduwjj, https://github.com/shuqiangzhang ghstack dependencies: #145893	2025-01-30 23:26:32 +00:00
PyTorch MergeBot	4280232f21	Revert "Advance past fc window for stft center (#145437 )" This reverts commit 3ef1551f5a745c1d37ff421eb4678814ef4483e4. Reverted https://github.com/pytorch/pytorch/pull/145437 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it breaks some slow trunk tests ([comment](https://github.com/pytorch/pytorch/pull/145437#issuecomment-2625840742))	2025-01-30 23:14:16 +00:00
Murray Steele	f85e4c1360	Enable C++ API parity tests on AArch64 (#145370 ) Re-enables C++ API parity tests on AArch64 which now pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145370 Approved by: https://github.com/albanD	2025-01-30 22:42:49 +00:00
Pat Vignola	2f60f12f8b	[Torch] Extract arange_out resizing logic into a helper function that can be used by other devices (#145747 ) Summary: We want to use the resizing implementation for arange_out in other devices (in this case MTIA), to make sure that the computations match and to avoid off-by-one-errors. Test Plan: Existing CI tests pass. Differential Revision: D68694489 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145747 Approved by: https://github.com/mortzur	2025-01-30 22:37:00 +00:00
Nikita Shulga	99a0940991	[MPS] Fix regression in con-contig bitwise ops (#146085 ) Caused by https://github.com/pytorch/pytorch/pull/128393 that change semantic of `needsGather`, which resulted in silent correctness errors on MacOS-15+ if output tensor is non-contiguous Fixes https://github.com/pytorch/pytorch/issues/145203 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146085 Approved by: https://github.com/dcci	2025-01-30 22:36:56 +00:00
Eddie Yan	e2917245fb	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee, https://github.com/malfet	2025-01-30 22:33:50 +00:00
PyTorch MergeBot	7391cea857	Revert "[triton] Update pin to tip of 3.2 release (#145867 )" This reverts commit 5e5da9bd9afdbb51da3dcc39947347279ccd9130. Reverted https://github.com/pytorch/pytorch/pull/145867 on behalf of https://github.com/ZainRizvi due to Sorry, this PR may have been written correctly, but something is clearly broken with the infra that's making CI very unhappy with this new triton version. Since this has been blocking viable/strict upgrades for a couple days now, I'm reverting this PR. I'll sync with @atalman on how we should fix this. ([comment](https://github.com/pytorch/pytorch/pull/145867#issuecomment-2625720817))	2025-01-30 22:24:09 +00:00
Aaron Orenstein	23695ea002	Fix dynamo use of `list[int]` in graph break (#145554 ) This reintroduces the change backed out by #145393 and fixes the underlying problem. Although using a BuiltinVariable was better than nothing when we saw a GenericAlias it had problems if there was a graph break and we had to reconstruct the original python code which BuiltinVariable did as a simple `list` instead of a `list[int]`. This changes it to use a TypingVariable instead and then teaches TypingVariable how to reconstruct. Original commit changeset: 77b9193acb23 python test/dynamo/test_repros.py ReproTests.test_graph_break_on_jit_isinstance Pull Request resolved: https://github.com/pytorch/pytorch/pull/145554 Approved by: https://github.com/anijain2305 ghstack dependencies: #145551, #145552, #145553	2025-01-30 22:21:40 +00:00
Aaron Orenstein	fbb076cc45	Fix call to create_load_global (#145553 ) There is no version of create_load_global() that takes three parameters - any use of this function will fail. I think this is probably the correct fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145553 Approved by: https://github.com/anijain2305 ghstack dependencies: #145551, #145552	2025-01-30 22:21:40 +00:00
Aaron Orenstein	ccbbc88bbb	Turn on mypy for _dynamo/variables/builtin.py (#145552 ) The fact that mypy errors were ignored was hiding several bugs in builtin.py (for example the previous diff's incorrect override and use of `call_getattr`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145552 Approved by: https://github.com/anijain2305, https://github.com/Skylion007 ghstack dependencies: #145551	2025-01-30 22:21:32 +00:00
Aaron Orenstein	f3120f6d26	Remove incorrect BuiltinVariable.call_hasattr() (#145551 ) BuiltinVariable.call_hasattr() overrides the base class - but actually behaves differently. The base is `obj.call_hasattr(tx, attr)` but BuiltinVariable's version is `<unused>.call_hasattr(tx, obj, attr)`. The BuiltinVariable version is used as a pattern from `call_self_handler()` for `BuiltinVariable(hasattr)`. I think the other version is just used for internal `hasattr(obj, name)` so I renamed that one to `call_obj_hasattr`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145551 Approved by: https://github.com/anijain2305	2025-01-30 22:21:19 +00:00
clr	d100e9ae74	inductor: Don't throw an internal error when a nn.module is missing a attribute (#145122 ) If a nn.module getattr call throws, we should make sure that we don't crash with an internal error Note that I couldn't figure out how to test this, so advice would be awesome. I have my best case attempt at https://github.com/pytorch/pytorch/pull/145799, but it doesn't seem to reproduce the crash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145122 Approved by: https://github.com/jansel	2025-01-30 21:55:29 +00:00
Natalia Gimelshein	08ff11e9d0	initialize device when pinning memory on this device, short circuit i… (#145752 ) …s_pinned if device is not initialized Do not land RFC potential fix for #144687 Now `.is_pinned(device="cuda")` does not initialize device and thus doesn't poison the fork (but it complains about `device` arg being deprecated). To not need `device=` arg we'd need to fix get_accelerator to not initialize device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145752 Approved by: https://github.com/albanD Co-authored-by: albanD <albandes@fb.com>	2025-01-30 21:37:29 +00:00
Michael Lazos	1252c1933d	Update to remind users to use torch.compile template (#145960 ) Users have been submitting fuzzer issues without meeting the requirements outline in the torch.compile issue template. This updates the note to remind users to use the torch.compile template for torch.compile bugs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145960 Approved by: https://github.com/eellison	2025-01-30 21:34:40 +00:00
Michael Lazos	d14046b58d	Update fuzzer guidance to include rng (#145962 ) Add another condition to fuzzer issue guidance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145962 Approved by: https://github.com/eellison	2025-01-30 21:33:57 +00:00
Yidi Wu	7e7341bddd	[hop] fix unbacked_bindings meta for while_loop (#143559 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143559 Approved by: https://github.com/zou3519	2025-01-30 21:33:09 +00:00
Thomas Bohnstingl	9f9904172d	[scan] scan dim handling in user-facing scan() (#145179 ) This PR introduces the capability that the scan dim is handled in the user facing scan() call. Internally, the scan dim is always shifted to dim 0 and then the scan is performed over that dim. This is a follow-up PR from https://github.com/bohnstingl/pytorch/pull/3 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145179 Approved by: https://github.com/ydwu4	2025-01-30 21:09:07 +00:00
Ankita George	70f6aaa786	[OSS] Add kwargs to fsspec reader/writer (#145845 ) Summary: Add kwargs to fsspec reader/writer. This will be used when reading/writing from huggingface because it needs a token to access the repositories Test Plan: https://fburl.com/anp/agkrlas1 ability to read write to hf with fsspec Differential Revision: D68738777 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145845 Approved by: https://github.com/mhorowitz	2025-01-30 21:00:58 +00:00
Justin Chu	e6c39d37e9	[ONNX] Create deprecation warning on dynamo_export (#146003 ) Deprecation of `torch.onnx.dynamo_export`: * [`torch/onnx/_internal/_exporter_legacy.py`](diffhunk://#diff-4d1eb96fe68ea904dcd1f8211318b9ff882dbfe4c3cb725ffc164b6c5a58b74cR83-R86): Added deprecation warnings to the `OnnxRegistry`, `ExportOptions`, `ONNXRuntimeOptions`, and `dynamo_export` functions, indicating that `torch.onnx.dynamo_export` is deprecated since version 2.6.0 and should be replaced with `torch.onnx.export(..., dynamo=True)`. [[1]](diffhunk://#diff-4d1eb96fe68ea904dcd1f8211318b9ff882dbfe4c3cb725ffc164b6c5a58b74cR83-R86) [[2]](diffhunk://#diff-4d1eb96fe68ea904dcd1f8211318b9ff882dbfe4c3cb725ffc164b6c5a58b74cR231-R234) [[3]](diffhunk://#diff-4d1eb96fe68ea904dcd1f8211318b9ff882dbfe4c3cb725ffc164b6c5a58b74cR442-R445) [[4]](diffhunk://#diff-4d1eb96fe68ea904dcd1f8211318b9ff882dbfe4c3cb725ffc164b6c5a58b74cR700-R703) This PR also removed the `**_` kwarg on onnx.export such that users get an error when they supply an unexpected augument. Updated to emit deprecation warning because it is more appropriate: https://docs.python.org/3/library/exceptions.html#DeprecationWarning Pull Request resolved: https://github.com/pytorch/pytorch/pull/146003 Approved by: https://github.com/titaiwangms	2025-01-30 20:13:32 +00:00
Nikita Shulga	1fdb4d65c0	[MPS] Extend `torch.mm`/`torch.bmm` to integral types (#145809 ) By using `naive_mm` kernel, but make sure that accumulation is done over int32 for smaller int types (and float for half and bfloat) as well as adding `navie_bmm` that follows the same pattern. Remove stale restriction on `torch.dot` (which works fine on MacOS-14/15) This also enables integer op flavors for: - `addmv` - `einsum` - `inner` - `linalg.multi_dot` - `matmul` - `mv` - `tensordot` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145809 Approved by: https://github.com/dcci	2025-01-30 19:35:25 +00:00
Jack Zhang	3ef1551f5a	Advance past fc window for stft center (#145437 ) Long overdue follow-up on https://github.com/pytorch/pytorch/pull/73432/files#diff-5f3d4caa0693a716fc46fd7f6339312f1b5f0bf89e3a3ff58e9dc13a9486b17aR719 Onnx stft doesn't support centering, [and all of the existing tests are for center = False](https://github.com/pytorch/pytorch/blob/main/test/onnx/test_pytorch_onnx_onnxruntime.py#L8026). I will open a follow-up issue to address this, this is just a nice-to-have. Pr chain: - -> [Advance past fc window for stft center #145437](https://github.com/pytorch/pytorch/pull/145437) - [Add stft option to align window for center = false #145324](https://github.com/pytorch/pytorch/pull/145324) - [Add istft option to align window for center = false](https://github.com/pytorch/pytorch/pull/145510) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145437 Approved by: https://github.com/justinchuby, https://github.com/iseeyuan	2025-01-30 19:09:18 +00:00
Yidi Wu	a3698ebd5c	[while_loop] specialize when cond_fn return constants (#144515 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144515 Approved by: https://github.com/zou3519	2025-01-30 19:02:34 +00:00
Bin Bao	16420a78eb	[AOTI] Remove AOTI_USE_CREATE_TENSOR_FROM_BLOB_V1 (#146039 ) Summary: The AOTI_USE_CREATE_TENSOR_FROM_BLOB_V1 macro was used to solve a FC issue and it can be removed now. Test Plan: CI Differential Revision: D68871245 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146039 Approved by: https://github.com/yushangdi, https://github.com/hl475	2025-01-30 19:01:19 +00:00
Yidi Wu	d1143c4b37	[export] fix non-strict pre_dispatch exporting while_loop (#145762 ) fix https://github.com/pytorch/pytorch/issues/145737. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145762 Approved by: https://github.com/tugsbayasgalan, https://github.com/zou3519, https://github.com/avikchaudhuri	2025-01-30 18:58:34 +00:00
clr	f746bb6311	config: Don't spam warnings about reference type configs (#145800 ) Summary: https://github.com/pytorch/pytorch/issues/145755 The is_dynamic check for reference types was subtly broken, causing log spam after it was accessed Added an explicit type for is_default for reference types to make sure this behaviour is correct Pull Request resolved: https://github.com/pytorch/pytorch/pull/145800 Approved by: https://github.com/eellison	2025-01-30 18:57:16 +00:00
Gabriel Ferns	5a527fa5ee	Make sure not using cpp wrapper when setting nvtx training annotation (#145538 ) Longer term would be good to add as a feature to cpp_wrapper, but this makes sure it doesn't fail on main. Not sure if this needs a test because it's not meant to compose, but will add one if necessary. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145538 Approved by: https://github.com/desertfire	2025-01-30 18:34:22 +00:00
Luca Wehrstedt	3ee655e4d4	[async-TP] Fix scheduling in matmul+reduce-scatter for 2 ranks (#145846 ) There's a sleep that is issued in order to "nudge" CUDA to do the right scheduling decision, but this is issued on iteration number 2. However, when the world size is 2, we never reach that iteration, which led to a suboptimal scheduling. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145846 Approved by: https://github.com/yifuwang	2025-01-30 18:26:34 +00:00
Ke Wen	51ee9b154e	[c10d] Add NCCL memory allocator (#145675 ) This PR implements a small UI improvement over #133603. It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it. UI: ``` pool = torch.cuda.MemPool(backend.mem_allocator) with torch.cuda.use_mem_pool(pool): tensor = torch.arange(1024 * 1024 * 2, device=device) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145675 Approved by: https://github.com/syed-ahmed, https://github.com/wconstab	2025-01-30 18:19:00 +00:00
eellison	7796e308d0	Record inputs at time of tracing, constrain to them for triton fn (#145448 ) Record input fake tensors at time of tracing and store them in the node meta. Inductor passes have the possibility of changing strides, so it is safer to record the strides of the inputs at tracing. See, https://github.com/pytorch/pytorch/issues/137979 for more context. We can also extend this to custom ops, and user-visible outputs. If this ends up being compilation time sensitive we can just record strides (and maybe storage offset, per @zou3519) instead of the complete fake tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145448 Approved by: https://github.com/zou3519 ghstack dependencies: #145953	2025-01-30 16:54:08 +00:00
PyTorch MergeBot	967cf85f3a	Revert "Update mi300 labels to account for multiple clusters. (#145923 )" This reverts commit 3e135993bd0fa08cbff565ae76bb15cb08e1d6d0. Reverted https://github.com/pytorch/pytorch/pull/145923 on behalf of https://github.com/atalman due to reverting back to one cluster ([comment](https://github.com/pytorch/pytorch/pull/145923#issuecomment-2625022826))	2025-01-30 16:45:50 +00:00
eellison	1c3df9ca8c	Fix signif_strides_equal for symints, dedupe (#145953 ) Previous impl would take a size hint, which was failing internally with a ``` strides1 = [V.graph.sizevars.size_hint(strides1[i]) for i in non_1_indices] File "/dev/shm/uid-30083/6f57b5f9-seed-nspid4026541609_cgpid284393-ns-4026541967/torch/_inductor/sizevars.py", line 554, in size_hint return int(out) File "/dev/shm/uid-30083/6f57b5f9-seed-nspid4026541609_cgpid284393-ns-4026541967/sympy/core/expr.py", line 307, in __int__ raise TypeError("Cannot convert symbols to int") ``` There are unbacked tests in test_triton which should exercise this, as well as other tests for these functions when they were added. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145953 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2025-01-30 16:44:32 +00:00
matthewhagraphcore	aaddfc5a7f	Add TORCHINDUCTOR_VEC_ISA_OK env var for vec_isa_ok (#134667 ) Adds a `TORCHINDUCTOR_VEC_ISA_OK` for `vec_isa_ok` for A\|B testing purposes. Similar setup to `fx_graph_remote_cache` to allow for default `None`. No tests were present for any other config settings here, nor for `vec_isa_ok` so I didn't add any. Motivation: PyTorch uses filelock with a timeout to determine if the CPU supports particular intrinsics: pytorch/torch/_inductor/cpu_vec_isa.py Therefore if 2 processes are running, each processes encounters the HAS_CPU test, if it cannot acquire the lock for checking vec_isa_ok the main thread will be put to sleep. Hence there is a bias towards non-sleeping processes in acquiring the lock i.e. new spawned processes. To avoid this, use a env variable so that each process is aware of this without going through the check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134667 Approved by: https://github.com/eellison	2025-01-30 16:22:48 +00:00
PyTorch MergeBot	5fa28bbe40	Revert "[c10d] Add NCCL memory allocator (#145675 )" This reverts commit 18a7a04c4adecda3be17dd364d48d484fd1dcdba. Reverted https://github.com/pytorch/pytorch/pull/145675 on behalf of https://github.com/ZainRizvi due to Sorry but this still fails internally. See D68866823 for details ([comment](https://github.com/pytorch/pytorch/pull/145675#issuecomment-2624900562))	2025-01-30 16:01:52 +00:00
titaiwangms	50086ab537	[ONNX] Delete `rename_dynamic_shapes_with_model_inputs` (#146002 ) Basically, this function brings more cons than pros. It was nice to have an automation help users to convert top-level key of dynamic shapes to arg names. However, this function has a bug when the model input has the same amount as dynamic_shapes in coincidence: ```python input_names # 'input_ids', 'past_key_values.0.key', 'past_key_values.0.value', 'past_key_values.1.key', 'past_key_values.1.value', 'past_key_values.2.key', 'past_key_values.2.value', 'past_key_values.3.key', 'past_key_values.3.value', 'past_key_values.4.key', 'past_key_values.4.value', 'attention_mask', 'position_ids' inspect.sig(model.forward).parameters # mappingproxy(OrderedDict([('input_ids', <Parameter "input_ids: Optional[torch.LongTensor] = None">), ('past_key_values', <Parameter "past_key_values: Union[transformers.cache_utils.Cache, Tuple[Tuple[torch.Tensor]], NoneType] = None">), ('attention_mask', <Parameter "attention_mask: Optional[torch.FloatTensor] = None">), ('token_type_ids', <Parameter "token_type_ids: Optional[torch.LongTensor] = None">), ('position_ids', <Parameter "position_ids: Optional[torch.LongTensor] = None">), ('head_mask', <Parameter "head_mask: Optional[torch.FloatTensor] = None">), ('inputs_embeds', <Parameter "inputs_embeds: Optional[torch.FloatTensor] = None">), ('labels', <Parameter "labels: Optional[torch.LongTensor] = None">), ('use_cache', <Parameter "use_cache: Optional[bool] = None">), ('output_attentions', <Parameter "output_attentions: Optional[bool] = None">), ('output_hidden_states', <Parameter "output_hidden_states: Optional[bool] = None">), ('return_dict', <Parameter "return_dict: Optional[bool] = None">), ('cache_position', <Parameter "cache_position: Optional[torch.LongTensor] = None">)])) ``` In the above case, the given input_names is following onnx graph, while it has the same length as torch model forward call. This kind of case makes it difficult to detect, and automate for users. On the other hand, the error message from torch.export.export is quite informative that I believe users will know how to go from there: ```python import torch class Model(torch.nn.Module): def forward(self, x=None, y=None): return x + y dim = torch.export.Dim("x", min=1, max=6) onnx_program = torch.export.export( Model(), (), kwargs={"x": torch.randn(2, 3), "y": torch.randn(2, 3)}, dynamic_shapes={"custom_input_x": {0: dim}, "custom_input_y": {0: dim}}, ) # torch._dynamo.exc.UserError: When `dynamic_shapes` is specified as a dict, its top-level keys must be the arg names ['x', 'y'] of `inputs`, but here they are ['custom_input_x', 'custom_input_y']. Alternatively, you could also ignore arg names entirely and specify `dynamic_shapes` as a list/tuple matching `inputs`. For more information about this error, see: https://pytorch.org/docs/main/generated/exportdb/index.html#dynamic-shapes-validation ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146002 Approved by: https://github.com/justinchuby	2025-01-30 16:01:38 +00:00
IvanKobzarev	894ef8c1e3	[torchbench] Inductor freezing bfloat16 conv folding needs high tolerance (#145623 ) Issue: https://github.com/pytorch/pytorch/issues/144888 Torchbench of timm lcnet_050 model fails on accuracy in case of `--frezing` `--inference` `--bfloat16` `res_error==0.12` If to turn off convolution inductor constant folding - `res_error==0.016` `float16 error ~ 0.00669` `float16 without conv folding ~ 0.0018` convolution folding results in increase of error almost at one order of magnitude. I think we should revisit and try to do something to improve the accuracy for conv folding. E.g. For example doing conv folding at compilation time with float64? At the moment I am adding counters to identify if convolution folding happened, and in case of bfloat16 and conv_folding - increase multiplier to the max level (10) to pass accuracy test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145623 Approved by: https://github.com/eellison	2025-01-30 12:46:35 +00:00
Aidyn-A	ffa628169d	[ATen][Native][CUDA][SCALED_MM] limit f8f8bf16 rowwise scaled matmul to sm_90 (#145728 ) The CUTLASS-based kernel for f8f8bf16 rowwise scaled matmul is specific to Hopper devices only. It is not re-usable on newer devices without modifications. This PR adds a guard for this matmul to be sm_90 specific. Once the kernel is there, the guard may be removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145728 Approved by: https://github.com/Skylion007, https://github.com/eqy	2025-01-30 11:19:58 +00:00
shangdiy	6bd19e65b1	add inductor_triton_kernel_mapping_post_grad.json to tlparseadd changes (#145954 ) Landing D67612181 here. The original exported PR somehow fails OSS CI, but this one doesn't (though the PR content is the same). Add debug trace artifact to inductor_triton_kernel_mapping_post_grad.json (debug artifact for provenance tracking) to tlparse. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145954 Approved by: https://github.com/YUNQIUGUO	2025-01-30 06:18:48 +00:00
cyyever	8a6e9a88e9	Let PYTORCH_NO_CUDA_MEMORY_CACHING has effect only when value is 1 (#145905 ) Fixes #145661 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145905 Approved by: https://github.com/eqy, https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-01-30 05:11:10 +00:00
Boyuan Feng	58cc6693cb	[BE] Type annotate wrapper_benchmark.py and cuda_combined_scheduling.py (#145542 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145542 Approved by: https://github.com/eellison	2025-01-30 03:53:52 +00:00
Nikita Shulga	8cc6f17334	[CD] Install OpenMP from homebrew (#145889 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145889 Approved by: https://github.com/atalman ghstack dependencies: #145871, #145870	2025-01-30 03:19:51 +00:00
Nikita Shulga	0d5f0a81c5	[CMake] Find HomeBrew OpenMP on MacOS (#145870 ) Either via `OMP_PREFIX` envvar or by searching in `/opt/homebrew/opt/libomp` folder Modify libomp bundling logic in setup.py to change absolute path to libomp.dylib to a relative one if necessary Pull Request resolved: https://github.com/pytorch/pytorch/pull/145870 Approved by: https://github.com/Skylion007, https://github.com/atalman ghstack dependencies: #145871	2025-01-30 03:19:51 +00:00
cyy	116af809eb	Use std::string_view (#145906 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/145906 Approved by: https://github.com/albanD	2025-01-30 03:14:27 +00:00
Benjamin Glass	933b6d9830	cpp_wrapper: enable in aarch64 and x86 nightly dashboard performance runs (#145791 ) Adds `cpp_wrapper` mode to the nightly inductor benchmark runs, as well as optionally for manually triggered runs. This is justified by `aot_inductor` already being in those runs. Additionally, re-enables `aot_inductor` in the nightly aarch64 runs. It was disabled 5 months ago to deal with a performance instability, which has likely gone away at this point. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145791 Approved by: https://github.com/desertfire	2025-01-30 02:55:45 +00:00
Gabriel Ferns	32bb6f83d5	Make sure that benchmark_harness is set before running (#145532 ) Running torch compile with these options causes an error, because the benchmark code isn't generated but is still called. ``` options={'profile_bandwidth_output': 'foo', 'benchmark_harness': False} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145532 Approved by: https://github.com/eellison	2025-01-30 01:25:53 +00:00
Ke Wen	25ca05eebf	[PGNCCL] Correct some ifdef's (#145893 ) `create` function supporting `ncclConfig_t` should be wrapped inside `NCCL_HAS_CONFIG` instead of `NCCL_HAS_COMM_NONBLOCKING` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145893 Approved by: https://github.com/c-p-i-o	2025-01-30 01:05:21 +00:00
Vasu Agrawal	73dde451b7	[pytorch] Sprinkle in a few `template` keywords (#145877 ) Summary: These seem to be necessary to get compilation working on Windows with CUDA 12.8. I'm not sure whether this means that all of the previous compilers were broken, and the new one is better, or whether this is a regression in NVCC 12.8. Either way, as long as the CI passes for existing versions, this should unblock us from CUDA 12.8 enablement on Windows. See D68663662 for more details on the CUDA 12.8 enablement. Test Plan: CI! Reviewed By: akrieger Differential Revision: D68787925 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145877 Approved by: https://github.com/Skylion007	2025-01-30 00:57:40 +00:00
angelayi	72699950b0	Copy model before benchmark warmup runs (#145858 ) Fixes https://github.com/pytorch/pytorch/issues/144772 The eager warmup runs causes the model to change state so that later when we export it, the model is different than when we export it directly out of box. For some reason exporting the model with the changed state causes issues but exporting the inital model is ok. This is the reason why the accuracy checks pass but the performance check fails when exporting. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145858 Approved by: https://github.com/desertfire	2025-01-30 00:36:33 +00:00
clr	6b41f310c2	config: Support str env variables (#145980 ) Summary: This allows us to use environment variables to set string values. We've added tests for the specific functionality implemented here. Note that we already accidentally started setting up configs to use this, so we're just adding the feature. Additionally, we're not fully validating the underlying type when we set the value (and in general, it's more difficult than we would like to do this). Let me know if people feel strongly, and we can add a PR to do this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145980 Approved by: https://github.com/yushangdi, https://github.com/oulgen	2025-01-30 00:13:02 +00:00
Yang Wang	a9ed7bd78e	[utilization] pipeline to create clean db records (#145327 ) upload_utilization_script to generate db-ready-insert records to s3 - generate two files: metadata and timeseries in ossci-utilization buckets - convert log record to db format ones - add unit test job for tools/stats/ Related Prs: setup composite action for data pipeline: https://github.com/pytorch/pytorch/pull/145310 add permission for composite action to access S3 bucket: https://github.com/pytorch-labs/pytorch-gha-infra/pull/595 add insert logic in s3 replicator: https://github.com/pytorch/test-infra/pull/6217 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145327 Approved by: https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>	2025-01-29 23:48:50 +00:00
Ke Wen	18a7a04c4a	[c10d] Add NCCL memory allocator (#145675 ) This PR implements a small UI improvement over #133603. It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it. UI: ``` pool = torch.cuda.MemPool(backend.mem_allocator) with torch.cuda.use_mem_pool(pool): tensor = torch.arange(1024 * 1024 * 2, device=device) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145675 Approved by: https://github.com/syed-ahmed, https://github.com/wconstab	2025-01-29 23:20:22 +00:00
PyTorch MergeBot	b60120d0df	Revert "[ATen][CUDA] Implement 128 bit vectorization v2 (#145746 )" This reverts commit 81685d81eb86595d169f55a564da26eaafb2ddf5. Reverted https://github.com/pytorch/pytorch/pull/145746 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking in trunk. See functorch/test_ops.py::TestOperatorsCUDA::test_jvp_nn_functional_multi_head_attention_forward_cuda_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/13032483748/job/36358184032) [HUD commit link](`81685d81eb`) ([comment](https://github.com/pytorch/pytorch/pull/145746#issuecomment-2623108958))	2025-01-29 23:02:23 +00:00
Colin Peppler	521588519d	re-use FloorDiv for RShift (#145898 ) I encountered this C++ compilation error. ``` 579 \| int64_t var_6 = (static_cast<int64_t>(std::floor((1.0/2.0)u0)) \| static_cast<int64_t>(std::floor((1.0/4.0)static_cast<int64_t>(std::floor((1.0/2.0)u0))))) \| std::floor((1.0/16.0)(static_cast<int64_t>(std::floor((1.0/2.0)u0)) \| static_cast<int64_t>(std::floor((1.0/4.0)static_cast<int64_t>(std::floor((1.0/2.0)u0)))))); \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \| \| \| \| int64_t {aka long int} double ``` Then, I figured out where this std::floor came from with the help of Bob's guard provenance tool. It comes from RShift which is used in `triton.next_power_of_2`. --- Before, we used `std::floor` ``` int64_t var_6 = ( static_cast<int64_t>(std::floor((1.0/2.0)u0)) \| static_cast<int64_t>(std::floor((1.0/4.0)static_cast<int64_t>(std::floor((1.0/2.0)u0))))) \| std::floor((1.0/16.0)(static_cast<int64_t>(std::floor((1.0/2.0)u0)) # no cast to int here. \| static_cast<int64_t>(std::floor((1.0/4.0)static_cast<int64_t>(std::floor((1.0/2.0)u0)))))); ``` Now, we use `c10::div_floor_integer` instead ``` int64_t var_6 = ( (c10::div_floor_integer(static_cast<int64_t>(u0), static_cast<int64_t>(2L))) \| (c10::div_floor_integer(static_cast<int64_t>(u0), static_cast<int64_t>(8L)))) \| (c10::div_floor_integer(static_cast<int64_t>((c10::div_floor_integer(static_cast<int64_t>(u0), static_cast<int64_t>(2L))) \| (c10::div_floor_integer(static_cast<int64_t>(u0), static_cast<int64_t>(8L)))), static_cast<int64_t>(16L))); ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145898 Approved by: https://github.com/desertfire, https://github.com/bobrenjc93 ghstack dependencies: #145802	2025-01-29 22:50:22 +00:00
eellison	3df961d99b	give emulate_precision_casts an envar (#145948 ) this was requested internally Pull Request resolved: https://github.com/pytorch/pytorch/pull/145948 Approved by: https://github.com/mlazos	2025-01-29 22:43:32 +00:00
rzou	2e5886dcc4	Add fake_impl for unique_consecutive (#145649 ) Summary: It's fairly similar to torch.unique and torch.unique_dim. Test Plan: New test Pull Request resolved: https://github.com/pytorch/pytorch/pull/145649 Approved by: https://github.com/ezyang, https://github.com/eellison	2025-01-29 22:33:16 +00:00
rzou	1e57154af3	Require that all HOPs be imported at `import torch` time (#145939 ) E.g. torch.ops.higher_order.cond does not exist until it is imported, which is bad if it shows up in an FX graph or is used in some code somewhere. This PR also makes some more HOPs get imported at `import torch` time. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/145939 Approved by: https://github.com/ydwu4 ghstack dependencies: #145938	2025-01-29 22:27:52 +00:00
rzou	2141c1aebe	Better hop_db comment; move test to a non-export test file (#145938 ) Goal is for people to better test their HOPs. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/145938 Approved by: https://github.com/ydwu4	2025-01-29 22:27:52 +00:00
Simon Fan	e02c038a23	[dynamo][benchmarks] Stop benchmarking compile time of dead code (#145590 ) FIXES https://github.com/pytorch/pytorch/issues/144775 frfr See details on the problem: https://github.com/pytorch/pytorch/issues/144775#issuecomment-2611699385 We fixed some silent incorrectness, but it results in less nodes DCE'd. The benchmark iteration loop had some dead code which could contain side effect ops that aren't safe to DCE. The regression is expected. This PR removes the compile time benchmarking of the dead code, which should reduce the noise of the benchmark and aligns with the benchmarking used by performance tests New benchmark results: ```python dev,name,batch_size,accuracy,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles,cudagraph_skips,compilation_latency cuda,BartForConditionalGeneration,1,pass,897,1,0,0,0,0,0,39.322364 # after https://github.com/pytorch/pytorch/pull/144319 cuda,BartForConditionalGeneration,1,pass,897,1,0,0,0,0,0,38.972257 # before https://github.com/pytorch/pytorch/pull/144319 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145590 Approved by: https://github.com/jansel ghstack dependencies: #145447	2025-01-29 22:14:47 +00:00
Jason Ansel	793dfc27e0	[inductor] Add some typing to triton.py (#145688 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145688 Approved by: https://github.com/Skylion007, https://github.com/eellison ghstack dependencies: #145671, #145695	2025-01-29 21:56:40 +00:00
Jason Ansel	5db0ad92e3	[inductor] Remove mask_str from IndexingOptions (#145695 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145695 Approved by: https://github.com/eellison ghstack dependencies: #145671	2025-01-29 21:56:40 +00:00
Jason Ansel	23ff899164	[inductor] Fix handling of fixed XBLOCK larger than xnumel=1 (#145671 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145671 Approved by: https://github.com/eellison	2025-01-29 21:56:32 +00:00
Aaron Gokaslan	bb2fb554a9	[BE]: Update CUTLASS submodule to 3.7.0 (#145172 ) * This has a couple of new features, but mostly has a lot of bugfixes for the prior releases * This is the last Hopper-focused release of CUTLASS before blackwell drops, so let's upgrade to it. * Most of the remaining diff noise is copyright year updates on the CUTLASS submodule Pull Request resolved: https://github.com/pytorch/pytorch/pull/145172 Approved by: https://github.com/eqy, https://github.com/henrylhtsang	2025-01-29 21:48:01 +00:00
James Wu	d0aa1386b8	Disable AOTAutogradCache for triton version < 3.2 (#145937 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145937 Approved by: https://github.com/bdhirsh	2025-01-29 21:32:16 +00:00
PyTorch MergeBot	1185b81c51	Revert "[dynamo] Use polyfill to implement comparison operators (#144485 )" This reverts commit d1f82de2bf4ce4d4461791a9c9b2e759202db0bb. Reverted https://github.com/pytorch/pytorch/pull/144485 on behalf of https://github.com/huydhn due to This seems to break dynamo tests in trunk after landing ([comment](https://github.com/pytorch/pytorch/pull/144485#issuecomment-2622893294))	2025-01-29 21:30:42 +00:00
Catherine Lee	953e80936e	[linter] Grep linter batches long command (#145950 ) If the command is too long, the linter fails with ``` Failed due to OSError: [Errno 7] Argument list too long: 'grep' ``` Fix this by batching the command so it is shorter Limit of 750k was chosen due to `getconf ARG_MAX` returns ~1M on my mac. My guess is that most people shouldn't hit this unless they run --all-files and the directory length is long. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145950 Approved by: https://github.com/wdvr	2025-01-29 21:23:27 +00:00
Zain Rizvi	a6e3f294f1	Don't use mypy daemon in CI (#145961 ) This is an attempt to fix flaky mypy errors in CI that look like: ``` dmypy status --verbose connection_name : /var/folders/rf/qrn1jkgj0b9_tcznwp8ck46w0000gn/T/tmpjoqsid7_/dmypy.sock pid : 32233 error : timed out Daemon is stuck; consider /Users/zainr/pytorch/venv/bin/dmypy kill ``` "Fix" it by not using the daemon at all, since it doesn't actually provide any perf benefits in CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145961 Approved by: https://github.com/malfet	2025-01-29 21:15:29 +00:00
bglass@quansight.com	40ccb7a86d	cpp_wrapper: Move #includes to per-device header files (#145932 ) Summary: This prepares us for the next PR in the stack, where we introduce pre-compiled per-device header files to save compilation time. Reland https://github.com/pytorch/pytorch/pull/143909 after merge conflicts. Co-authored-by: Benjamin Glass <[bglass@quansight.com](mailto:bglass@quansight.com)> Differential Revision: D68656960 Pulled By: benjaminglass1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145932 Approved by: https://github.com/yushangdi, https://github.com/benjaminglass1 Co-authored-by: bglass@quansight.com <bglass@quansight.com>	2025-01-29 21:08:45 +00:00
sanchitintel	8bd7bf3269	[Inductor-CPU] Add profiling support for codegened flex attention kernels (#145894 ) ### Summary `RECORD_FUNCTION` wasn't present in codegened Inductor-CPU Flex Attention C++ kernels, so flex attention kernels weren't present in the PyTorch profiler profiling data. Fixes #145825 by adding `RECORD_FUNCTION` calls in the codegened flex-attention kernels. ### Caveat #### _Before_ No corresponding results in PyTorch profiler profiling data #### _After_ \| Inductor config settings \| What kernel name looks like in profiling data \| Comments\| \|-------------------\|------------------------------------\|--------------------\| \| Env variable `TORCHINDUCTOR_CPP_WRAPPER=1` OR `inductor.config.cpp_wrapper=1` in python code \| `graph_x_cpp_fused_y` \| No way to tell from the profiling results if the kernel is a GEMM kernel or an attention kernel \| \| `inductor.config.cpp.descriptive_names = "inductor_node"` but not CPP wrapper \| `graph_x_kernel` \| No way to tell from the profiling results if the kernel is a GEMM kernel or an attention kernel \| \| Both `inductor_config.cpp.descriptive_names = "inductor_node"` & Inductor CPP Wrapper \| `graph_x_cpp_fused_flex_attention_y`\| Easy to interpret data \| \| Neither of the two configs \| `graph_x_kernel`\| No way to tell from the profiling results if the kernel is a GEMM kernel or an attention kernel \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/145894 Approved by: https://github.com/jansel, https://github.com/leslie-fang-intel	2025-01-29 20:54:46 +00:00
Danial Javady	bb4964013f	Add determinmistic kernel for reflection2d (#136241 ) Adds feature for #98925 Tests pass for both existing reflectionpad2d and the new one I inserted. Summary of the work: Simple conditional check for deterministic mode that will dispatch to a different kernel. This kernel does not use any atomic operations, and will lead to deterministic results as instead of going from the output to input(1:1) relationship, I am doing the opposite. I am going from input -> all outputs, which is 1 to many. These operations are done in the same order every execution as I simply traverse the data set with a grid stride loop and use simple linearized indexing into the input tensor. So each thread will compute the 4 conditionals, which are then used to see if the input has an output in the 8 regions. These 8 regions are top left, top, top right, left, right, bottom left, bottom, bottom right`. I did not focus on performance for this PR as that would expand the scope heavily. If there are any performance questions though i can answer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136241 Approved by: https://github.com/eqy, https://github.com/albanD	2025-01-29 20:34:03 +00:00
Ankita George	2b8c28099a	[OSS] Add no dist as an argument to DCP top level apis (#145754 ) Summary: No-dist, for a non-distributed checkpoint, was a top level param in the past, but was removed. This was requested back in https://github.com/pytorch/pytorch/issues/125777 and will be needed for our torchtune changes to use DCP Test Plan: existing tests pass Differential Revision: D68714246 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145754 Approved by: https://github.com/daulet-askarov	2025-01-29 20:33:37 +00:00
chilli	2d5d022594	Fix a number of flexattention issues (cse, cudagraph, etc.) (#145059 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145059 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2025-01-29 20:27:39 +00:00
Nikita Shulga	6aed6c042e	[CD] Install ninja and setuptools from PyPI (#145871 ) As well as typing extensions, they are available from PyPI, no reason to install them from Anaconda Pull Request resolved: https://github.com/pytorch/pytorch/pull/145871 Approved by: https://github.com/Skylion007	2025-01-29 19:47:16 +00:00
PyTorch MergeBot	b80482988f	Revert "[CMake] Find HomeBrew OpenMP on MacOS (#145870 )" This reverts commit c26bb9ba5bd40d256a25436212279bc7e4b436ae. Reverted https://github.com/pytorch/pytorch/pull/145870 on behalf of https://github.com/malfet due to Want to refine it a bit ([comment](https://github.com/pytorch/pytorch/pull/145870#issuecomment-2622659614))	2025-01-29 19:34:27 +00:00
PyTorch MergeBot	b52e8d521e	Revert "[CD] Install ninja and setuptools from PyPI (#145871 )" This reverts commit eea7d395e5faa9a4be5b60f6668c0bdf5163e3a0. Reverted https://github.com/pytorch/pytorch/pull/145871 on behalf of https://github.com/malfet due to Want to refine it a bit ([comment](https://github.com/pytorch/pytorch/pull/145870#issuecomment-2622659614))	2025-01-29 19:34:27 +00:00
Jack Taylor	082fab0fc7	[64-bit] Int64 casting for UpSampleNearest3D (#144865 ) Fixes #144855 Follows approach in https://github.com/pytorch/pytorch/pull/141923 to use int64 types to increase INT_MAX limits Pull Request resolved: https://github.com/pytorch/pytorch/pull/144865 Approved by: https://github.com/eqy	2025-01-29 19:30:09 +00:00
angelayi	1c9014a135	[export] Add tlparse to draft-export (#145810 ) Dependent on https://github.com/ezyang/tlparse/pull/87/files Pull Request resolved: https://github.com/pytorch/pytorch/pull/145810 Approved by: https://github.com/pianpwk	2025-01-29 19:26:00 +00:00
PyTorch MergeBot	6371c25b91	Revert "[c10d] Add NCCL memory allocator (#145675 )" This reverts commit 9fd6722fc9068eeaa176754acb315fc7e0f6416c. Reverted https://github.com/pytorch/pytorch/pull/145675 on behalf of https://github.com/ZainRizvi due to This fails to build internally, can you please take a look at D68831004 for more details? ([comment](https://github.com/pytorch/pytorch/pull/145675#issuecomment-2622515425))	2025-01-29 18:30:30 +00:00
PyTorch MergeBot	e0525dbca9	Revert "inductor.config.descriptive_names = False is not actually supported (#145523 )" This reverts commit edf266e9bbbf6063f7c4a336ffb50234e11a0a82. Reverted https://github.com/pytorch/pytorch/pull/145523 on behalf of https://github.com/ZainRizvi due to Hi, this breaks type checks internally. Can you please take a look? See D68801083 for details ([comment](https://github.com/pytorch/pytorch/pull/145523#issuecomment-2622510900))	2025-01-29 18:27:44 +00:00
PyTorch MergeBot	284f217011	Revert "[Environment Variable][7/N] Use thread-safe getenv functions (#140211 )" This reverts commit 97b3b73f3e96bb8684064715b93c825ba0395475. Reverted https://github.com/pytorch/pytorch/pull/140211 on behalf of https://github.com/ZainRizvi due to Sorry but this is failing internally. @eqy @ezyang can you please help this get remerged? See D68779772. ([comment](https://github.com/pytorch/pytorch/pull/140211#issuecomment-2622504898))	2025-01-29 18:24:29 +00:00
PyTorch MergeBot	0d6343347f	Revert "Record inputs at time of tracing, constrain to them for triton fn (#145448 )" This reverts commit a699034eeca8c096c44a690e405a60efa442d4ed. Reverted https://github.com/pytorch/pytorch/pull/145448 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See D68779678 for details ([comment](https://github.com/pytorch/pytorch/pull/145448#issuecomment-2622470810))	2025-01-29 18:07:12 +00:00
Avik Chaudhuri	1a613c3342	bump counters for unbacked binding names (#145882 ) Instead of bumping symint counters when we process unbacked bindings during deserialization, it's better to bump them at the beginning based on what the symbols in the original shape env before serialization were. This allows symbols in unbacked bindings to have "gaps" that bumping alone would not be able to match. Why is bumping counters important at all? It is because when the shape env coming out of deserialization is used later for propagating symints, say in run_decompositions, we don't want new names to clash with existing names (bad things happen). Differential Revision: [D68798191](https://our.internmc.facebook.com/intern/diff/D68798191/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145882 Approved by: https://github.com/pianpwk	2025-01-29 17:46:21 +00:00
rpsilva	4abff4b271	Introduce cache clearing APIs for the lazy graph executor (#144489 ) This PR introduces two new methods to the LazyGraphExecutor class: - ClearComputationCache(): Allows clearing the entire computation cache. - RemoveFromComputationCache(hash): Enables removal of specific cache entries based on their hash. The main objective is to expose cache management functionality for debugging cache hits and misses across different computations. For instance: - Reset the cache state in tests, allowing reuse of the same computation client to evaluate cache logic consistently. - Selectively remove cache entries to analyze the impact on subsequent computations. - Improve observability into the cache behavior, aiding in the investigation of cache-related issues or optimizations. On the XLA lazy graph executor, we want to run a series of tests that modify some parts of the HLO module proto of the computation, and we need a means to ensure that the hash is agnostic to some elements (OpMetadata in the XLA proto data). Hence, it would be easy to parameterize the test, clear the cache and validate that the resulting hash is the same between runs. Otherwise, we'd need to hardcode the resulting serialized hash. Simultaneously, another motivation, is that users could also clear some computation hashes for an added flexibility in their applications, by introducing their own custom strategies for maintaining the cache (without relying on the default LRU). Pull Request resolved: https://github.com/pytorch/pytorch/pull/144489 Approved by: https://github.com/wconstab	2025-01-29 17:38:01 +00:00
Animesh Jain	d1f82de2bf	[dynamo] Use polyfill to implement comparison operators (#144485 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144485 Approved by: https://github.com/jansel	2025-01-29 17:37:40 +00:00
saienduri	3e135993bd	Update mi300 labels to account for multiple clusters. (#145923 ) We now have multiple Kubernetes clusters of mi300x resources, and this commit updates labels accordingly to target both clusters evenly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145923 Approved by: https://github.com/jeffdaily	2025-01-29 16:56:43 +00:00
Animesh Jain	4499d60d56	[dynamo][builin-skipfiles-cleanup] Remove types (#145909 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145909 Approved by: https://github.com/zou3519 ghstack dependencies: #145856, #145875, #145878, #145892	2025-01-29 16:47:02 +00:00
Brian Hirsh	ed141d7d1a	dont assign a size to _assert_scalar in partitioner (#143877 ) Fixes https://github.com/pytorch/pytorch/issues/143876 Open to other suggestions - we have an invariant that all nodes in our ATen graphs should have a `meta['val']` field, but I don't think this is actually true in all cases, so I just hardcoded the invariant to ignore `_assert_scalar()` (which is a "special" op used in dynamic shapes for runtime asserts, and doesn't have a meta['val'] field) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143877 Approved by: https://github.com/zou3519	2025-01-29 16:21:37 +00:00
Yu, Guangye	3b3aac0cde	Filter out iGPU if dGPU is found on XPU (#144378 ) # Motivation for https://github.com/pytorch/pytorch/issues/143914 On Windows, there are two separate SYCL platforms for iGPU and dGPU. To simplify the logic, we will exclude iGPUs when a dGPU is present. This ensures that all XPU devices enumerated by PyTorch share the same SYCL context. Now I generalize the logic as below: 1. We find the first L0 platform containing at least one dGPU and enumerate all dGPUs of that platform. 2. If no dGPU is found, we find the first L0 platform containing iGPU and enumerate all iGPUs of that platform. 3. No GPU is found (neither iGPU nor dGPU). Pull Request resolved: https://github.com/pytorch/pytorch/pull/144378 Approved by: https://github.com/EikanWang, https://github.com/gujinghui	2025-01-29 15:53:16 +00:00
Bert Maher	5e5da9bd9a	[triton] Update pin to tip of 3.2 release (#145867 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145867 Approved by: https://github.com/Skylion007, https://github.com/htyu, https://github.com/exclamaforte	2025-01-29 15:17:58 +00:00
Aidyn-A	81685d81eb	[ATen][CUDA] Implement 128 bit vectorization v2 (#145746 ) This is a re-base PR to my previous one #141959. Description from the original PR: This PR implements 128-bit vectorization. It improves the performance of contiguous elementwise ops by 4-10% on Hopper H100. <details> <summary>The benchmark code used </summary> ```Python import time import torch from torch.profiler import profile, ProfilerActivity def benchmark(function, dtype=torch.float32, check_numerics=True, print_profile=False): device = torch.device("cuda") shapes = [] for p in range(24, 30): shape = 1<<p shapes.append(shape) for shape in shapes: for _ in range(6): x = torch.randn(shape, device=device, dtype=dtype) y = function(x) if print_profile: x = torch.randn(shape, device=device, dtype=dtype) with profile(activities=[ProfilerActivity.CUDA], record_shapes=True) as prof: y = function(x) print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10)) x = torch.randn(shape, device=device, dtype=dtype) torch.cuda.synchronize() t1 = time.perf_counter() for _ in range(6): y = function(x) torch.cuda.synchronize() t2 = time.perf_counter() perf_time = (t2 - t1) / 6 print(f"{function.__name__}, {dtype}, {shape}, {perf_time}") if check_numerics: x_cpu = x.cpu() y_cpu = function(x_cpu).cuda() try: torch.testing.assert_allclose(y_cpu, y) except AssertionError as error: print("An exception occurred:", error) def main(): ops = [ torch.relu, torch.sigmoid, torch.tanh, torch.nn.functional.gelu, torch.sin, torch.exp, ] dtypes = [ torch.float16, torch.bfloat16, torch.float32, ] for op in ops: for dtype in dtypes: benchmark(op, dtype=dtype) torch.cuda.empty_cache() if __name__ == "__main__": main() ``` </details> <details> <summary> Results </summary> \| op \| dtype \| size \| time after \| time before \| % improvement \| \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| \| relu \| torch.float16 \| 33554432 \| 4.84E-05 \| 5.06E-05 \| 4.66296539127052 \| \| relu \| torch.float16 \| 67108864 \| 9.22E-05 \| 9.64E-05 \| 4.56491432752297 \| \| relu \| torch.float16 \| 134217728 \| 0.000180343495837102 \| 0.000187981834945579 \| 4.23543919508829 \| \| relu \| torch.float16 \| 268435456 \| 0.000355071155354381 \| 0.000370856161074092 \| 4.44558942107169 \| \| relu \| torch.float16 \| 536870912 \| 0.000704489842367669 \| 0.000736006341564159 \| 4.47366268483987 \| \| relu \| torch.bfloat16 \| 16777216 \| 3.03E-05 \| 3.04E-05 \| 0.166504085842689 \| \| relu \| torch.bfloat16 \| 33554432 \| 4.89E-05 \| 5.06E-05 \| 3.45848238875716 \| \| relu \| torch.bfloat16 \| 67108864 \| 9.32E-05 \| 9.65E-05 \| 3.56122651631445 \| \| relu \| torch.bfloat16 \| 134217728 \| 0.000180805509444326 \| 0.000187998676362137 \| 3.97840029317567 \| \| relu \| torch.bfloat16 \| 268435456 \| 0.000356242332297067 \| 0.000371279485989362 \| 4.22104627356745 \| \| relu \| torch.bfloat16 \| 536870912 \| 0.000708114336399982 \| 0.000736773828975856 \| 4.04729732229083 \| \| relu \| torch.float32 \| 16777216 \| 5.61E-05 \| 5.61E-05 \| 0.0442587268354941 \| \| relu \| torch.float32 \| 33554432 \| 9.33E-05 \| 9.30E-05 \| -0.259070913799022 \| \| relu \| torch.float32 \| 67108864 \| 0.000181321326332788 \| 0.000181289506144822 \| -0.0175490597877115 \| \| relu \| torch.float32 \| 134217728 \| 0.000356896334172537 \| 0.000356570177245885 \| -0.0913870206618981 \| \| relu \| torch.float32 \| 268435456 \| 0.000709421835684528 \| 0.000707465515006334 \| -0.275762681635911 \| \| relu \| torch.float32 \| 536870912 \| 0.00141372415237129 \| 0.00141036518228551 \| -0.237597276678471 \| \| sigmoid \| torch.float16 \| 16777216 \| 3.10E-05 \| 3.16E-05 \| 2.10012593866895 \| \| sigmoid \| torch.float16 \| 33554432 \| 4.91E-05 \| 5.23E-05 \| 6.37710600666122 \| \| sigmoid \| torch.float16 \| 67108864 \| 9.30E-05 \| 0.000100057009452333 \| 7.61866144555331 \| \| sigmoid \| torch.float16 \| 134217728 \| 0.000180928347011407 \| 0.000194982004662355 \| 7.76752669390248 \| \| sigmoid \| torch.float16 \| 268435456 \| 0.000355658994521946 \| 0.00038468533117945 \| 8.16128288742412 \| \| sigmoid \| torch.float16 \| 536870912 \| 0.000705982849467546 \| 0.000764021339515845 \| 8.22094900634937 \| \| sigmoid \| torch.bfloat16 \| 16777216 \| 3.08E-05 \| 3.17E-05 \| 2.90965915673149 \| \| sigmoid \| torch.bfloat16 \| 33554432 \| 4.87E-05 \| 5.24E-05 \| 7.63503884668234 \| \| sigmoid \| torch.bfloat16 \| 67108864 \| 9.33E-05 \| 0.000100019678939134 \| 7.21238137428013 \| \| sigmoid \| torch.bfloat16 \| 134217728 \| 0.000180786165098349 \| 0.000194868014659733 \| 7.78922964250206 \| \| sigmoid \| torch.bfloat16 \| 268435456 \| 0.000355564659306159 \| 0.000384909333661199 \| 8.25297835063321 \| \| sigmoid \| torch.bfloat16 \| 536870912 \| 0.000705831005082776 \| 0.000764102345177283 \| 8.2557070566308 \| \| sigmoid \| torch.float32 \| 16777216 \| 4.93E-05 \| 5.65E-05 \| 14.5314136197766 \| \| sigmoid \| torch.float32 \| 33554432 \| 9.32E-05 \| 9.31E-05 \| -0.120169865610833 \| \| sigmoid \| torch.float32 \| 67108864 \| 0.000181328505277634 \| 0.000180455681402236 \| -0.481349512069855 \| \| sigmoid \| torch.float32 \| 134217728 \| 0.000357362829769651 \| 0.000356093340087682 \| -0.35523831137877 \| \| sigmoid \| torch.float32 \| 268435456 \| 0.000708921831877281 \| 0.000707052337626616 \| -0.263709504574663 \| \| sigmoid \| torch.float32 \| 536870912 \| 0.00141358317341656 \| 0.0014090768333214 \| -0.318788464654745 \| \| tanh \| torch.float16 \| 16777216 \| 3.03E-05 \| 3.03E-05 \| -0.0912564658661808 \| \| tanh \| torch.float16 \| 33554432 \| 4.90E-05 \| 5.07E-05 \| 3.46644442974484 \| \| tanh \| torch.float16 \| 67108864 \| 9.30E-05 \| 9.68E-05 \| 3.99871369815531 \| \| tanh \| torch.float16 \| 134217728 \| 0.00018052199933057 \| 0.000188717152923346 \| 4.53969799978138 \| \| tanh \| torch.float16 \| 268435456 \| 0.000355684508879979 \| 0.000373026006855071 \| 4.8755280430115 \| \| tanh \| torch.float16 \| 536870912 \| 0.000706660988119741 \| 0.000740105014604827 \| 4.73268328765002 \| \| tanh \| torch.bfloat16 \| 16777216 \| 2.99E-05 \| 3.03E-05 \| 1.21049563135981 \| \| tanh \| torch.bfloat16 \| 33554432 \| 4.89E-05 \| 5.06E-05 \| 3.48836101041744 \| \| tanh \| torch.bfloat16 \| 67108864 \| 9.28E-05 \| 9.69E-05 \| 4.39944918036626 \| \| tanh \| torch.bfloat16 \| 134217728 \| 0.000180710999605556 \| 0.000189167990659674 \| 4.67984299382829 \| \| tanh \| torch.bfloat16 \| 268435456 \| 0.000356062994493792 \| 0.000372666652159144 \| 4.66312363882606 \| \| tanh \| torch.bfloat16 \| 536870912 \| 0.000707100164921333 \| 0.000740134331863374 \| 4.67178040408393 \| \| tanh \| torch.float32 \| 16777216 \| 5.61E-05 \| 5.64E-05 \| 0.439595755746353 \| \| tanh \| torch.float32 \| 33554432 \| 9.31E-05 \| 9.31E-05 \| 0.00287633090228212 \| \| tanh \| torch.float32 \| 67108864 \| 0.000181465332085888 \| 0.000180895323865116 \| -0.31411411437098 \| \| tanh \| torch.float32 \| 134217728 \| 0.000356963835656643 \| 0.000356073161431899 \| -0.249513854283251 \| \| tanh \| torch.float32 \| 268435456 \| 0.000709201170442005 \| 0.00070707315656667 \| -0.300057862849997 \| \| tanh \| torch.float32 \| 536870912 \| 0.00141367283261692 \| 0.00141030051357423 \| -0.238550176877922 \| \| gelu \| torch.float16 \| 16777216 \| 2.73E-05 \| 3.17E-05 \| 15.921079070745 \| \| gelu \| torch.float16 \| 33554432 \| 5.06E-05 \| 5.55E-05 \| 9.76345374333098 \| \| gelu \| torch.float16 \| 67108864 \| 9.65E-05 \| 0.000106600326641152 \| 10.4308039074712 \| \| gelu \| torch.float16 \| 134217728 \| 0.000187776672343413 \| 0.000208565829476962 \| 11.0712139447915 \| \| gelu \| torch.float16 \| 268435456 \| 0.000370216167842348 \| 0.000412251994324227 \| 11.3544005187205 \| \| gelu \| torch.float16 \| 536870912 \| 0.000737301345604161 \| 0.000819394170927505 \| 11.1342296895002 \| \| gelu \| torch.bfloat16 \| 16777216 \| 3.02E-05 \| 3.08E-05 \| 1.78405479367653 \| \| gelu \| torch.bfloat16 \| 33554432 \| 5.13E-05 \| 5.69E-05 \| 10.9929393318302 \| \| gelu \| torch.bfloat16 \| 67108864 \| 9.76E-05 \| 0.00010968199543034 \| 12.3420807512356 \| \| gelu \| torch.bfloat16 \| 134217728 \| 0.000189661824454864 \| 0.000214487663470209 \| 13.0895287371091 \| \| gelu \| torch.bfloat16 \| 268435456 \| 0.000374197009174774 \| 0.000423670164309442 \| 13.2211519391275 \| \| gelu \| torch.bfloat16 \| 536870912 \| 0.000743675006863972 \| 0.000842577001700799 \| 13.299088166737 \| \| gelu \| torch.float32 \| 16777216 \| 5.06E-05 \| 5.04E-05 \| -0.413385894716413 \| \| gelu \| torch.float32 \| 33554432 \| 9.31E-05 \| 9.32E-05 \| 0.134157041722546 \| \| gelu \| torch.float32 \| 67108864 \| 0.000181480175039421 \| 0.000180836669945469 \| -0.354586992112075 \| \| gelu \| torch.float32 \| 134217728 \| 0.000356874331676712 \| 0.000356305002545317 \| -0.159532104402047 \| \| gelu \| torch.float32 \| 268435456 \| 0.000708909006789327 \| 0.000706991491218408 \| -0.270488250615287 \| \| gelu \| torch.float32 \| 536870912 \| 0.00141321367118508 \| 0.00140937082081412 \| -0.271922813181618 \| \| sin \| torch.float16 \| 16777216 \| 3.04E-05 \| 3.11E-05 \| 2.21834939018859 \| \| sin \| torch.float16 \| 33554432 \| 4.85E-05 \| 5.23E-05 \| 7.72165512511596 \| \| sin \| torch.float16 \| 67108864 \| 9.31E-05 \| 9.98E-05 \| 7.24947099480072 \| \| sin \| torch.float16 \| 134217728 \| 0.000180371008658161 \| 0.000194791161144773 \| 7.99471744039613 \| \| sin \| torch.float16 \| 268435456 \| 0.000355454161763191 \| 0.000384903668115536 \| 8.28503630574026 \| \| sin \| torch.float16 \| 536870912 \| 0.000705183832906187 \| 0.000764360166310022 \| 8.39161799270973 \| \| sin \| torch.bfloat16 \| 16777216 \| 3.11E-05 \| 3.10E-05 \| -0.257677954940036 \| \| sin \| torch.bfloat16 \| 33554432 \| 4.89E-05 \| 5.24E-05 \| 7.34808420323539 \| \| sin \| torch.bfloat16 \| 67108864 \| 9.26E-05 \| 0.000100248667877167 \| 8.22347488801205 \| \| sin \| torch.bfloat16 \| 134217728 \| 0.000180674154156198 \| 0.00019567032965521 \| 8.30012215584937 \| \| sin \| torch.bfloat16 \| 268435456 \| 0.000355360486234228 \| 0.000386023331278314 \| 8.62865913118873 \| \| sin \| torch.bfloat16 \| 536870912 \| 0.00070483615854755 \| 0.000766805159704139 \| 8.79197248964745 \| \| sin \| torch.float32 \| 16777216 \| 5.67E-05 \| 5.64E-05 \| -0.441348534920039 \| \| sin \| torch.float32 \| 33554432 \| 9.34E-05 \| 9.30E-05 \| -0.496458540364117 \| \| sin \| torch.float32 \| 67108864 \| 0.000181706990891447 \| 0.000180556671693921 \| -0.633062708199702 \| \| sin \| torch.float32 \| 134217728 \| 0.000356894995396336 \| 0.000356046327700218 \| -0.237791985616354 \| \| sin \| torch.float32 \| 268435456 \| 0.000708777321657787 \| 0.000707602652255446 \| -0.165731798471427 \| \| sin \| torch.float32 \| 536870912 \| 0.00141263716310884 \| 0.00140912582476934 \| -0.248566187496451 \| \| exp \| torch.float16 \| 16777216 \| 3.00E-05 \| 3.04E-05 \| 1.40099098901014 \| \| exp \| torch.float16 \| 33554432 \| 4.86E-05 \| 5.03E-05 \| 3.44611943643906 \| \| exp \| torch.float16 \| 67108864 \| 9.37E-05 \| 9.55E-05 \| 1.96412400380129 \| \| exp \| torch.float16 \| 134217728 \| 0.000180913504057874 \| 0.000187193179347863 \| 3.47109262113439 \| \| exp \| torch.float16 \| 268435456 \| 0.00035607748820136 \| 0.000369079003576189 \| 3.65131630210701 \| \| exp \| torch.float16 \| 536870912 \| 0.000707551507124056 \| 0.000732363162872692 \| 3.50669251620789 \| \| exp \| torch.bfloat16 \| 16777216 \| 2.98E-05 \| 3.04E-05 \| 1.74345594341654 \| \| exp \| torch.bfloat16 \| 33554432 \| 4.88E-05 \| 5.04E-05 \| 3.40217856534821 \| \| exp \| torch.bfloat16 \| 67108864 \| 9.32E-05 \| 9.62E-05 \| 3.29219958210226 \| \| exp \| torch.bfloat16 \| 134217728 \| 0.000180999826019009 \| 0.000187239318620414 \| 3.44723679499521 \| \| exp \| torch.bfloat16 \| 268435456 \| 0.000355944503098726 \| 0.000369370992605885 \| 3.77207384585864 \| \| exp \| torch.bfloat16 \| 536870912 \| 0.000707135167128096 \| 0.000733066000975668 \| 3.66702648277075 \| \| exp \| torch.float32 \| 16777216 \| 4.89E-05 \| 5.63E-05 \| 15.1245314346532 \| \| exp \| torch.float32 \| 33554432 \| 9.34E-05 \| 9.31E-05 \| -0.259945454477446 \| \| exp \| torch.float32 \| 67108864 \| 0.000181152504713585 \| 0.000180474346658836 \| -0.374357536939058 \| \| exp \| torch.float32 \| 134217728 \| 0.000356771342922002 \| 0.000355627329554409 \| -0.3206573034212 \| \| exp \| torch.float32 \| 268435456 \| 0.000708404501589636 \| 0.00070713268360123 \| -0.179532736671163 \| \| exp \| torch.float32 \| 536870912 \| 0.00141283582585553 \| 0.00140944866385932 \| -0.23974208002295 \| </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/145746 Approved by: https://github.com/eqy, https://github.com/ngimel	2025-01-29 13:32:59 +00:00
Ting Lu	354fe48db9	Add magma cuda build 12.8 (#145765 ) https://github.com/pytorch/pytorch/issues/145570 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145765 Approved by: https://github.com/malfet	2025-01-29 08:43:38 +00:00
gasoonjia	501c5972f0	[pytorch] raise exception when calling dim order on sparse tensor (#145888 ) This diff introduces a change to the PyTorch library that raises an exception when calling the `dim_order` method on a sparse tensor. Differential Revision: [D68797044](https://our.internmc.facebook.com/intern/diff/D68797044/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145888 Approved by: https://github.com/Jack-Khuu	2025-01-29 06:15:44 +00:00
David Berard	2e8c080ab1	[inductor][4/N] triton support post-#5512, fix constexpr signatures (#145583 ) Prior to this PR, constexprs were appearing in signatures as `{.. "XBLOCK : tl.constexpr": "constexpr"}` when they really should appear as `{.. "XBLOCK": "constexpr"}`. This PR represents the argument names as ArgName objects, which can optionally be marked as constexpr. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145583 Approved by: https://github.com/jansel	2025-01-29 05:46:05 +00:00
Animesh Jain	3f77002b96	[dynamo][builtin-skipfiles-cleanup] remove abc, enum, importlib (#145892 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145892 Approved by: https://github.com/williamwen42, https://github.com/StrongerXi ghstack dependencies: #145856, #145875, #145878	2025-01-29 05:30:06 +00:00
Animesh Jain	236793684d	[dynamo][builtin-skipfiles-cleanup] Remove threading, _collections_abc, _weakrefset, threading (#145878 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145878 Approved by: https://github.com/williamwen42, https://github.com/StrongerXi ghstack dependencies: #145856, #145875	2025-01-29 05:30:06 +00:00
Animesh Jain	a479656cd2	[dynamo][builtin-skipfiles-removal] Remove logging (#145875 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145875 Approved by: https://github.com/williamwen42 ghstack dependencies: #145856	2025-01-29 05:29:58 +00:00
Animesh Jain	64ee57847b	[dynamo][builtin-skipfiles-cleanup] Remove some builtins (#145856 ) [dynamo][builtin-skipfiles-cleanup] Remove more builtins Pull Request resolved: https://github.com/pytorch/pytorch/pull/145856 Approved by: https://github.com/zou3519	2025-01-29 05:29:47 +00:00
Aaron Orenstein	7178b827d7	PEP585: Missed conversions (#145342 ) Differential Revision: [D68785969](https://our.internmc.facebook.com/intern/diff/D68785969) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145342 Approved by: https://github.com/bobrenjc93	2025-01-29 05:24:36 +00:00
bobrenjc93	8696e59ae2	add test for capture_dynamic_output_shape_ops=True changing expected output between eager and compiled versions (#145821 ) Followup from https://github.com/pytorch/pytorch/issues/130290 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145821 Approved by: https://github.com/eellison, https://github.com/ezyang	2025-01-29 04:36:32 +00:00
Justin Chu	776bdb962c	[ONNX] Support subgraphs with 1+ outputs (#145860 ) Fixed a bug in _handle_output_node where additional output values were not added as graph outputs Fixes #145734 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145860 Approved by: https://github.com/titaiwangms	2025-01-29 04:13:23 +00:00
cyy	fd515e4f59	Fix C++20 Wambiguous-reversed-operator warnings (#144126 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144126 Approved by: https://github.com/albanD	2025-01-29 03:13:57 +00:00
Simon Mahns	90a6db4a9c	[be][pytorch] Fix backend in autocast (#145859 ) Summary: fixing backend typo (BAKCNEDS -> BACKENDS) Test Plan: ci Differential Revision: D68573324 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145859 Approved by: https://github.com/jvandebon	2025-01-29 03:13:08 +00:00
Mwiza Kunda	9be2e88d41	Fix lowering to inductor IR for triton CPU (#144389 ) Example failing test: `pytest -s test_torchinductor_opinfo.py -k test_comprehensive_special_polygamma_special_polygamma_n_0_cpu_float32` when using triton CPU. Failure: ```shell triton.compiler.errors.CompilationError: at 10:11: def triton_poi_fused_polygamma_0(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 25 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask) tmp1 = 1.0 tl.static_assert(tmp1.dtype == tl.float32) tmp2 = ops.polygamma(tmp1, tmp0) ^ NameError('ops is not defined') ``` This occurs because the registered triton fallbacks are not used during the lowering to inductor IR. Marked the problematic code in the excerpt below from `6bc17b0725/torch/_inductor/lowering.py (L572)` ```python def make_pointwise( fn, override_return_dtype=None, override_device=None, override_fn_when_input_bool=None, override_fn_when_gpu_float64=None, allow_alpha=False, triton_fallback=None, ): def inner(inputs: TensorBox, alpha=None): if triton_fallback is not None and any( isinstance(inp, IRNode) and is_triton(inp) for inp in inputs <--- is_triton should return True when using triton CPU ): assert not allow_alpha # not implemented return triton_fallback(inputs) inputs = promote_constants(inputs, override_return_dtype) if allow_alpha: if alpha is not None and alpha != 1: inputs = list(inputs) ``` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144389 Approved by: https://github.com/jansel	2025-01-29 03:10:53 +00:00
Colin Peppler	50f834f134	[export] allow bit shift builtin ops (#145802 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145802 Approved by: https://github.com/pianpwk	2025-01-29 03:05:48 +00:00
Ting Lu	f4ca98950e	Add CUDA 12.8 libtorch image (#145789 ) https://github.com/pytorch/pytorch/issues/145570 Builds 12.8 libtorch docker/deprecate 12.1 meanwhile Pull Request resolved: https://github.com/pytorch/pytorch/pull/145789 Approved by: https://github.com/nWEIdia, https://github.com/atalman	2025-01-29 02:59:37 +00:00
Sam Larsen	9330b6d098	Added swizzle searching, disabled fp16 accum, and enabled ping-pong for cutlass (#144829 ) Summary: Test Plan: Differential Revision: [D68751149](https://our.internmc.facebook.com/intern/diff/D68751149) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144829 Approved by: https://github.com/Chillee	2025-01-29 02:52:55 +00:00
Ke Wen	9fd6722fc9	[c10d] Add NCCL memory allocator (#145675 ) This PR implements a small UI improvement over #133603. It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it. UI: ``` pool = torch.cuda.MemPool(backend.mem_allocator) with torch.cuda.use_mem_pool(pool): tensor = torch.arange(1024 * 1024 * 2, device=device) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145675 Approved by: https://github.com/syed-ahmed, https://github.com/wconstab	2025-01-29 02:48:56 +00:00
Menglu Yu	29521256e1	[Customized Optimus][Inductor] Add split cat pattern in aten level (#145721 ) Summary: Thanks Microve for discovering that recGPT has some repeated similar kernels that might be optimized through optimus. After investigation, I designed a pattern in the aten level to remove such excessive kernels. trace: https://fburl.com/perfdoctor/82fauil7 tlparse: https://fburl.com/98q6tadx Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_split_cat_post_grad ``` Buck UI: https://www.internalfb.com/buck2/e8458d63-b8ca-498b-a731-77a83fb4d1cb Test UI: https://www.internalfb.com/intern/testinfra/testrun/16325548715106567 Network: Up: 341KiB Down: 359KiB (reSessionID-7d3de666-7fc1-4988-8d11-d75ba958016d) Executing actions. Remaining 0/3 Command: test. Finished 2 local Time elapsed: 3:04.8s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # local run ``` buck2 run @//mode/opt aps_models/ads/recgpt_exp:recgpt_launcher -- mode=local_recgpt_ranking_30x_v0_unified_seq_1115 ``` https://www.internalfb.com/mlhub/pipeline/1630903954173593 # E2E ``` buck2 run @//mode/opt aps_models/ads/recgpt_exp:recgpt_launcher -- mode=mast_recgpt_ranking_30x_v0_unified_seq_1115 launcher.oncall=ads_model_platform launcher.data_project=ai_large_scale launcher.fbl_entitlement=ads_global_tc_training_efficiency launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] launcher.hardware=SMC_T20 launcher.job_name=recgpt_ranking_1115_pt2_with_optimus data_loader.dataset.table_ds=[2024-12-13,2024-12-14,2024-12-15,2024-12-16,2024-12-17,2024-12-18] ``` ### how to add the config Add the following patterns to the dynamo config ``` post_grad_fusion_options: { "normalization_aten_pass": {}, "split_cat_aten_pass": {}, } ``` {F1974700331} baseline: aps-recgpt_ranking_1115_pt2_5-8cb4905c7d {F1974700216} proposal: Differential Revision: D68695717 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145721 Approved by: https://github.com/Yuzhen11	2025-01-29 01:59:06 +00:00
Natalia Gimelshein	331f49057d	Removes threadfence from topk kernel to improve AMD performance (#145536 ) Also marginally improves cuda perf Pull Request resolved: https://github.com/pytorch/pytorch/pull/145536 Approved by: https://github.com/eqy	2025-01-29 01:29:15 +00:00
wz337	6f5c8fb128	[DTensor] Add pointwise ops strategy for `aten.minimum` (#145816 ) Need it for Shampoo optimizer. `9c5700ad5e/matrix_functions.py (L240-L242)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145816 Approved by: https://github.com/XilunWu	2025-01-29 01:19:01 +00:00
Pian Pawakapan	15e37e4253	[export] don't always print GM in serdes logging (#145857 ) Summary: Didn't realize print_readable() would also print and not just return string Test Plan: . Differential Revision: D68781525 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145857 Approved by: https://github.com/angelayi, https://github.com/yiming0416	2025-01-29 01:03:02 +00:00
fan.mo	a24b25942a	Fix RMSNorm epsilon value type for BF16 or FP16 (#142848 ) Fixes #140092 Here's what this PR does: Case 1: no `eps` is passed to python frontend: Use `eps` associated with opmath_t instead of than `eps` associated with`scalar_t` for intermediate computation Case 2: `eps` is passed to python frontend Avoid downcasting `eps` to `scalar_t` and then upcasting it again implicitly in the `rqrst_input` computation Pull Request resolved: https://github.com/pytorch/pytorch/pull/142848 Approved by: https://github.com/albanD	2025-01-29 01:01:44 +00:00
Bert Maher	ae0f305bf9	[inductor] Make triton kernel autotune config defaults backward-compatible (#145494 ) If a model was torch.packaged using triton<=3.1, any user-defined autotuned kernels will have reps/warmups burned in with the old defaults (100/25). If this model is loaded with triton>=3.2, inductor's checks for unsupported non-default autotune args will fail, because triton.Autotuner's defaults for these parameters has changed to `None`. Let's explicitly support those values for backward compatibility with these older models. Differential Revision: [D68561014](https://our.internmc.facebook.com/intern/diff/D68561014/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145494 Approved by: https://github.com/aorenste	2025-01-29 00:31:39 +00:00
Mwiza Kunda	9036a22c83	[Inductor][Triton] Change propagated dtype for fp16/bf16 unwrapped 0d tensors (#145613 ) Fixes TestInductorOpInfoCPU.test_comprehensive_max_binary_cpu_float16 and related tests for Triton CPU. TestInductorOpInfoCPU is currently not run in the CI. See https://github.com/pytorch/pytorch/pull/144389#issuecomment-2608050755 for some additional context. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/145613 Approved by: https://github.com/davidberard98, https://github.com/eellison, https://github.com/jansel	2025-01-29 00:23:44 +00:00
Aaron Orenstein	2f24f2eb46	Make sure to evaluate annotation strings in the context of where the prototype was created (#145667 ) This was incorrectly evaluating the annotation in the context of infer_schema - make sure to evaluate annotation strings in the context of where the prototype was created instead. Fixes #145481 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145667 Approved by: https://github.com/zou3519	2025-01-29 00:14:45 +00:00
Thomas Bohnstingl	82859f6185	[associative_scan] scan dim handling in user-facing associative_scan() (#139864 ) This PR implements the user-facing dim change, i.e., that the scan dim provided by the user is always moved to dim 0 and then the associative_scan operation always operates on dim 0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139864 Approved by: https://github.com/ydwu4	2025-01-28 23:58:10 +00:00
Brian Hirsh	7ca156f0ee	partitioner: avoid inserting duplicates into heap (#145082 ) Fixes https://github.com/pytorch/pytorch/issues/145081 This looks like it was a source of quadratic compile times in the torchtitan CP graphs. There's some code in the partitioner that iteratively adds users of a node to a heap, and pops the earliest user. If you have long parallel chains of fusible ops that all eventually feed into some shared ops, then this can result in: (1) a node getting added to the heap many times (2) each time we pop that node, we add (duplicates of) each of that node users to the heap (3) repeat with each user Pull Request resolved: https://github.com/pytorch/pytorch/pull/145082 Approved by: https://github.com/xmfan	2025-01-28 23:44:45 +00:00
albanD	02dd7a7803	Extend abi-stable nitpick message to all the c stable files (#145862 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145862 Approved by: https://github.com/ezyang	2025-01-28 23:22:23 +00:00
Nikita Shulga	049f042e52	Update build_wheel.sh	2025-01-28 15:14:41 -08:00
Nikita Shulga	eea7d395e5	[CD] Install ninja and setuptools from PyPI (#145871 ) Rather than Conda Pull Request resolved: https://github.com/pytorch/pytorch/pull/145871 Approved by: https://github.com/Skylion007 ghstack dependencies: #145870	2025-01-28 23:09:38 +00:00
Nikita Shulga	c26bb9ba5b	[CMake] Find HomeBrew OpenMP on MacOS (#145870 ) Either via `OMP_PREFIX` envvar or just searching in that folder Pull Request resolved: https://github.com/pytorch/pytorch/pull/145870 Approved by: https://github.com/Skylion007	2025-01-28 23:09:37 +00:00
Aaron Gokaslan	f388ba5986	Update CUDNN frontend submodule to 1.10.0 (#145780 ) Update to CUDNN 1.10. Most of this is release is about supporting some new APIs needed for Blackwell integration and new features in the corresponding CUDNN version Pull Request resolved: https://github.com/pytorch/pytorch/pull/145780 Approved by: https://github.com/albanD, https://github.com/atalman, https://github.com/malfet	2025-01-28 22:54:24 +00:00
Justin Chu	af43b445a5	[ONNX] Set USE_EXPERIMENTAL_LOGIC to True (#137296 ) This sets dynamo_export to use the new export logic. The legacy dynamo export logic will be removed as a follow up. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137296 Approved by: https://github.com/titaiwangms	2025-01-28 22:35:11 +00:00
Benjamin Glass	5aa5a5763e	[inductor triton] Disable incorrect TF32 usage on CUDA capability < 8 (#145684 ) Triton 2.2 and greater have a bug where allowing TF32 generation for a GPU that does not support TF32 will cause code generation errors. Patch around this problem by: 1. Adding a function to `torch.cuda` that determines whether CUDA hardware is capable of using the TF32 format. 2. Using that function to explicitly disable TF32 generation when calling Triton, where needed. To demonstrate that this fix works, try running `test/inductor/test_max_autotune.py` on a GPU with CUDA compute capability < 8 (e.g. any NVIDIA consumer GPU) without this fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145684 Approved by: https://github.com/eqy	2025-01-28 22:01:08 +00:00
Colin Peppler	1ffed44b42	[aotinductor] update unbacked symint runtime assertion msg (#145569 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145569 Approved by: https://github.com/chenyang78	2025-01-28 21:42:58 +00:00
Dan Zimmerman	a06a18b1bb	[ATen] Implement exception handling for hipsolver APIs (#145839 ) Summary: TSA Test Plan: CI Differential Revision: D68741194 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145839 Approved by: https://github.com/Mellonta	2025-01-28 21:37:23 +00:00
Zheng, Zhaoqiong	9003d81144	change the test wheel to release wheel when release wheel available (#145252 ) change the test wheel to release wheel when release wheel available Pull Request resolved: https://github.com/pytorch/pytorch/pull/145252 Approved by: https://github.com/seemethere, https://github.com/atalman Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-28 21:23:53 +00:00
fduwjj	4f949f282d	[c10d][ez] Remove goto in PGNCCL and make linter happy for PGNCCL and NCCLUtils (#145855 ) While working on PGNCCL I found that the code triggers some lint warnings so this PR is to address them or add lint suppressor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145855 Approved by: https://github.com/c-p-i-o, https://github.com/kwen2501	2025-01-28 21:19:49 +00:00
Wei Wang	6bcb545d9c	[CI][CUDA][cuSPARSELt] cusparselt 0.6.3 and cu121 related cleanups (#145793 ) Make ci cusparselt installation be consistent with nightly binary Remove cu121 related docker build jobs and inductor runs Update test failures relating to cu121 Retry of https://github.com/pytorch/pytorch/pull/145696 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145793 Approved by: https://github.com/eqy, https://github.com/tinglvv	2025-01-28 21:01:58 +00:00
Isuru Fernando	ccc2878c97	Fix fractional_max_pool lowering in inductor (#144395 ) Fixes https://github.com/pytorch/pytorch/issues/141538 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144395 Approved by: https://github.com/amjames, https://github.com/eellison	2025-01-28 21:00:18 +00:00
cyyever	ef28df5c9e	[Reland][Environment Variable][4/N] Use thread-safe getenv functions (#140593 ) Reland of #137843 , after checking the code again. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140593 Approved by: https://github.com/albanD Co-authored-by: albanD <desmaison.alban@gmail.com>	2025-01-28 20:51:49 +00:00
PyTorch MergeBot	3481c2aec4	Revert "[dynamo] save/restore system random state more carefully (#145750 )" This reverts commit e3d3f2b22e4b75c64eaa2f940a2dd80c1e43435c. Reverted https://github.com/pytorch/pytorch/pull/145750 on behalf of https://github.com/eellison due to bisected perf regression ([comment](https://github.com/pytorch/pytorch/pull/145750#issuecomment-2620028414))	2025-01-28 20:51:07 +00:00
Paul Saab	28982ceb3b	[aarch64] Rebuild everything with ArmPL (#145742 ) Summary: Rebuild everything that used OpenBLAS with ArmPL Test Plan: CI, prod test Reviewed By: Nicoshev Differential Revision: D68219559 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145742 Approved by: https://github.com/malfet	2025-01-28 20:48:42 +00:00
Gabriel Ferns	edf266e9bb	inductor.config.descriptive_names = False is not actually supported (#145523 ) Summary: This config is not supported (it throws an error when set), and doesn't really make sense imo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145523 Approved by: https://github.com/eellison	2025-01-28 20:22:23 +00:00
Jane Xu	515e55e692	Set -DPy_LIMITED_API flag for py_limited_api=True extensions (#145764 ) This could be BC breaking, because there was a period of time when we use py_limited_api=True but don't enforce the flag, and now that we will start enforcing the flag, people's custom extensions may fail to build. This is strictly still better behavior, as it is sketchy to claim CPython agnosticism without the flag, but calling this out as potential people yelling at us. Ways to mitigate this risk + reasons this may not be too big a deal: - People haven't known about py_limited_api for extensions much due to lack of docs from python so usage is low right now - My current tutorial is in store to make new users of py_limited_api pass this flag, so it'd be a noop for them. Test plan: * Locally i'm confident as I tried rebuilding ao with this change and it reliably failed (cuz importing torch/extension.h is a nono) * Unit test wise, the normal python_agnostic one I added should work Pull Request resolved: https://github.com/pytorch/pytorch/pull/145764 Approved by: https://github.com/ezyang, https://github.com/zou3519, https://github.com/albanD	2025-01-28 20:11:05 +00:00
Nikita Shulga	8d91bfd965	[BE] Include CheckFunctionExists in `FindBLAS.cmake` (#145849 ) It's used in the script, so it must be included Pull Request resolved: https://github.com/pytorch/pytorch/pull/145849 Approved by: https://github.com/Skylion007	2025-01-28 19:47:05 +00:00
Ryan Guo	eaff13275e	[dynamo] Properly branch on an unspecialized NN module (#145786 ) User defined NN module might have their own `__len__` or `__bool__` methods which Dynamo needs to trace through, so that side effects and/or reads to buffered writes are properly handled. This patch removes the special `UnspecializedNNModuleVariable` branch in Dynamo's branch handling, and lets these cases fall into the `UserDefinedObjectVariable` branch, which handles the aforementioned cases correctly. Fixes #145284. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145786 Approved by: https://github.com/williamwen42	2025-01-28 19:45:17 +00:00
James Wu	d9ffa5da65	Log info for AOTAutogradCache bypasses instead of warning (#145768 ) Fixes #145767 FxGraphCache also logs to info instead of warning so lets do that Pull Request resolved: https://github.com/pytorch/pytorch/pull/145768 Approved by: https://github.com/eellison, https://github.com/bdhirsh	2025-01-28 19:25:36 +00:00
Camyll Harajli	6c09954a9e	Windows builds with VS2022 (#145319 ) [Fixes #ISSUE_NUMBER ](https://github.com/pytorch/pytorch/issues/128835) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145319 Approved by: https://github.com/huydhn	2025-01-28 19:07:24 +00:00
Pian Pawakapan	cbc4094298	[draft_export] add LOC for data-dep error logging (#145443 ) Summary: maybe this is too much info, but it's difficult to go through old draft export reports where the stack trace is out of sync with the current codebase. Data-dependent errors now look like: ``` 2. Data dependent error. When exporting, we were unable to evaluate the value of `u306`. This occurred at the following stacktrace: File /data/users/pianpwk/fbsource/buck-out/v2/gen/fbcode/78204cab86e8a0fb/sigmoid/inference/ts_migration/__pt2i_readiness_main__/pt2i_readiness_main#link-tree/caffe2/torch/fb/training_toolkit/common/proxy_module_thrift/embedding_bag_proxy.py, lineno 109, in _forward_impl: `if offsets[-1] > len(input):` As a result, it was specialized to evaluate to `261`, and asserts were inserted into the graph. Please add `torch._check(...)` to the original code to assert this data-dependent assumption. Please refer to https://docs.google.com/document/d/1kZ_BbB3JnoLbUZleDT6635dHs88ZVYId8jT-yTFgf3A/edit#heading=h.boi2xurpqa0o for more details. ``` This would be even more helpful for reports on torch-packaged models, but that requires some more work on PT2I-specific stack trace processing Test Plan: . Differential Revision: D68534017 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145443 Approved by: https://github.com/angelayi	2025-01-28 18:55:16 +00:00
Xinya Zhang	c32bafeb0b	[ROCm] Bump AOTriton to 0.8.2b (#145508 ) We received reports AOTriton kernels mishandles the bias pointer and it causes NaN during fine-tuning llama3.2-11b vision model. This PR will fix the problem. Note: this AOTriton 0.8.1b adds head dimension 512 support and thus the binary size increases, but it is considered experimental and will not be enabled right now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145508 Approved by: https://github.com/jeffdaily	2025-01-28 18:34:25 +00:00
eellison	621604ce46	Maintain multiple configs (#145103 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): Previously, we would finalize the config of a triton template after its first fusion. this maintains multiple configs, in case we epilogue fuse, then prologue fuse, and prologue fusion has a new better config. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145103 Approved by: https://github.com/jansel, https://github.com/shunting314 ghstack dependencies: #143408	2025-01-28 18:32:14 +00:00
Ryan Guo	eaec97ab1f	[dynamo] Properly prune dead input cell object (#145781 ) This patch models input cell object as "newly created" rather than "pre-existing" python object (see added documentation for why this actually captures the semantics more accurately). This enables the `SideEffects.prune_dead_object_new` algorithm to prune away writes to input cell objects which are no longer relevant; this didn't happen prior to this patch because we modelled them as pre-existing objects, which forces us to codegen their attribute mutations. Fixes #145564. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145781 Approved by: https://github.com/williamwen42, https://github.com/jansel	2025-01-28 18:28:13 +00:00
eellison	8e258e2ecd	Parallelize epilogue/prologue benchmarking (#143408 ) When we attempt prologue or epilogue fusion with a TritonTemplate, we benchmark it at compile time in order to determine profitability. This avoids slowdowns/register spilling, and allows us to pick fusion when a base triton template is slower than cublas but faster when considering an epilogue. However, that fused benchmarking does not do the same async compilation as we do for the base TritonTemplate. The Base TritonTemplate is async compiled during lowering, then later waited on and benchmarked. This PR extends a similar process to benchmarking fused TritonTemplates in the scheduler. We keep a list of pending fusions which have async compilations. And we resolve any pending fusions a node is in prior to attempting to fuse it with any other node. Initially, I saw some slowdowns with this because we kick off async compilations of identical fusions in parallel. To address this I added source code caching at the `async_compile` level (we also already cache benchmark runs, but that would not happen in parallel). Compilation speedups: <img width="717" alt="image" src="https://github.com/user-attachments/assets/8e8f7d6c-7824-4210-83f9-a2a0f6db5ac9" /> This also should let us be a bit more aggressive with either configs, or benchmarking other fusions which are hard to determine profitability of. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143408 Approved by: https://github.com/jansel, https://github.com/shunting314	2025-01-28 18:18:24 +00:00
Nikita Shulga	3fd4691908	[MPS] Add `op_math_t` (#145808 ) Similar to `at::opmath_t` to be used for reduction (and int mms) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145808 Approved by: https://github.com/dcci	2025-01-28 18:03:52 +00:00
atalman	5382ab57d7	Move trunk windows builds to CUDA-12.4 (#145844 ) Same as : https://github.com/pytorch/pytorch/pull/130446 That should catch build regressions that were previously only detectable during the nightly builds for 12.4 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145844 Approved by: https://github.com/janeyx99, https://github.com/malfet	2025-01-28 18:00:51 +00:00
Huy Do	56915b093a	Fix environment deployment spam (#145823 ) With https://github.com/pytorch-labs/pytorch-gha-infra/pull/598 in place, the environment can now be removed. Fixes https://github.com/pytorch/pytorch/issues/145704 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145823 Approved by: https://github.com/clee2000	2025-01-28 17:46:31 +00:00
PyTorch MergeBot	cfbb27462e	Revert "[inductor][BE] Enable test_cpu_cpp_wrapper in fbcode (#145373 )" This reverts commit b8087747f5ca7be0d37b1ac85dc0894f6a33e3a3. Reverted https://github.com/pytorch/pytorch/pull/145373 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/145373#issuecomment-2619674197))	2025-01-28 17:46:11 +00:00
PyTorch MergeBot	dbef2a9bc9	Revert "Remove lexicographical sorting of storage keys in torch.save (#143879 )" This reverts commit 7db0afabaaff17dd37cf846cd786610ebf6aedd3. Reverted https://github.com/pytorch/pytorch/pull/143879 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See D68746524 for details ([comment](https://github.com/pytorch/pytorch/pull/143879#issuecomment-2619661492))	2025-01-28 17:40:16 +00:00
Zain Rizvi	097ccd9c39	Move ROCm MI300 jobs to unstable to make CI green (#145790 ) This is a temporary change to reduce intermittent tests failures. Jobs can be moved back once those machines get better runner isolation. This also sneaks in a small fix to all the rocm job's build step to be run on Linux Foundation runners (the get-label-type dependency). The inductor-rocm-mi300 workflow already had it, but it was missing in the rocm-mi300 workflow. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145790 Approved by: https://github.com/yangw-dev	2025-01-28 17:25:15 +00:00
saienduri	7eb51e5464	Ensure GPU isolation for kubernetes pod MI300 runners. (#145829 ) Fixes the reason behind moving the tests to unstable initially. (https://github.com/pytorch/pytorch/pull/145790) We ensure gpu isolation for each pod within kubernetes by propagating the drivers selected for the pod from the Kubernetes layer up to the docker run in pytorch here. Now we stick with the GPUs assigned to the pod in the first place and there is no overlap between the test runners. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145829 Approved by: https://github.com/jeffdaily	2025-01-28 17:20:46 +00:00
cyy	c751541e79	Fix cppcoreguidelines-init-variables ignorance (#141795 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141795 Approved by: https://github.com/albanD	2025-01-28 17:11:37 +00:00
Mu-Chu Lee	ac87388e61	[AOTInductor] Refactor CPU and GPU to remove ifdef macros (#145639 ) Summary: Remove #ifdef USE_CUDA macros through some refactor Test Plan: Refactor code, existing tests. Differential Revision: D68636743 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145639 Approved by: https://github.com/desertfire	2025-01-28 16:46:00 +00:00
Dmitry Nikolaev	6967ef1b07	[ROCm] fix test_cublas_workspace_explicit_allocation for gfx12 (#145227 ) gfx12 passes the condition `torch.cuda.get_device_capability() >= (9, 4)` and uses `default_workspace_size=128MB`, but it required only for MI300 Fix condition to use `("gfx94" in gcn_arch)` instead of `torch.cuda.get_device_properties()` to detect MI300. Now `default_workspace_size=32MB` is used for gfx12 and the test passes Pull Request resolved: https://github.com/pytorch/pytorch/pull/145227 Approved by: https://github.com/jeffdaily, https://github.com/eqy	2025-01-28 16:19:27 +00:00
Animesh Jain	80a0412b76	[dynamo][builtin-skipfiles-cleanup] Remove posixpath (#145828 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145828 Approved by: https://github.com/zou3519 ghstack dependencies: #145744, #145753, #145826	2025-01-28 16:14:34 +00:00
Animesh Jain	6824a4a75d	[dynamo][builtin-skipfiles-cleanup] Remove re (#145826 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145826 Approved by: https://github.com/zou3519 ghstack dependencies: #145744, #145753	2025-01-28 16:14:34 +00:00
Animesh Jain	4307e6c008	[dynamo][builtin-skipfile-cleanup] Remove signal (#145753 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145753 Approved by: https://github.com/zou3519 ghstack dependencies: #145744	2025-01-28 16:14:23 +00:00
eellison	3a56089217	fix unbacked + view incorrectness (#145548 ) fix for https://github.com/pytorch/pytorch/issues/143498 We were incorrectly using contiguous strides for a non-contiguous tensor. There are two separate causes: 1. https://github.com/pytorch/pytorch/pull/110520 made it so we turn Views contiguous with unbacked symints becuase `dynamic_reshape_indexer below will fail due to the size_hint's inability to process unbacked SymInts`. Seems like we should fix. Regardless - it will make the input contiguous if input is unbacked to workaround this. 2. We weren't actually making it contiguous! I filed an issue for this here: https://github.com/pytorch/pytorch/issues/145561. This is still worth landing as a fix, even though we should those issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145548 Approved by: https://github.com/desertfire	2025-01-28 16:03:45 +00:00
cyyever	97b3b73f3e	[Environment Variable][7/N] Use thread-safe getenv functions (#140211 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140211 Approved by: https://github.com/ezyang, https://github.com/eqy	2025-01-28 15:21:12 +00:00
Zhenbin Lin	a08f7f3266	OpenReg: fix issue of pin_memory (#145046 ) Fix issue of `pin_memory` when rewrapping a storage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145046 Approved by: https://github.com/albanD	2025-01-28 09:41:04 +00:00
Chirag Pandya	bdf6dfa17d	[chore][ez] change alloc buffer size from 4000 to 4096 (#145759 ) Summary: Allocations typically happen as a power of 2 anyway. Change the default alloc size to 4096 so eek out a bit more perf. Test: unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/145759 Approved by: https://github.com/XilunWu, https://github.com/fduwjj ghstack dependencies: #145756, #145757	2025-01-28 09:14:07 +00:00
Animesh Jain	5c5306e8bc	[dynamo][builtin-skiplist-cleanup] Remove weakref (#145744 ) WeakKeyDictionary already works very nicely with the UserDefinedObject Variable Tracker. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145744 Approved by: https://github.com/jansel	2025-01-28 07:55:12 +00:00
Avik Chaudhuri	45f64e770a	relax assertion to warning for unbacked binding names (#145777 ) Summary: Quick fix following up on https://github.com/pytorch/pytorch/pull/144894 to unblock internal tests. Will keep investigating a more principled fix. Test Plan: Failures in T213563826 now pass Differential Revision: D68731710 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145777 Approved by: https://github.com/angelayi	2025-01-28 07:52:40 +00:00
Michael Graczyk	0a8a0ef767	[inductor] Fix crash running wrapper_benchmark with no device (#145644 ) Fixes #145434 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145644 Approved by: https://github.com/shunting314	2025-01-28 07:31:36 +00:00
eellison	a699034eec	Record inputs at time of tracing, constrain to them for triton fn (#145448 ) Record input fake tensors at time of tracing and store them in the node meta. Inductor passes have the possibility of changing strides, so it is safer to record the strides of the inputs at tracing. See, https://github.com/pytorch/pytorch/issues/137979 for more context. We can also extend this to custom ops, and user-visible outputs. If this ends up being compilation time sensitive we can just record strides (and maybe storage offset, per @zou3519) instead of the complete fake tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145448 Approved by: https://github.com/zou3519	2025-01-28 07:07:14 +00:00
Nikita Shulga	0f5a68344a	[BE][Inductor] Simplify `custom_op` tests (#145814 ) Not sure what were the motivation behind repeating the same function over and over again for different backends Change `test_custom_op_[123]` from acceptig separate (but identical) implementations for CPU, CUDA and XPU, to take just `fn` and `fn_meta` args Test that it also extendable to MPS Pull Request resolved: https://github.com/pytorch/pytorch/pull/145814 Approved by: https://github.com/jansel	2025-01-28 05:58:51 +00:00
cyyever	23eb0a3201	Improve typing in torch/types.py (#145237 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/145237 Approved by: https://github.com/XuehaiPan, https://github.com/albanD Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>	2025-01-28 05:29:12 +00:00
Aaron Gokaslan	8e46d0f595	[BE]: Update typing of OrderedSet ancestor (#145783 ) Now that we are on python 3.9 minimum version we can properly use Generics in the superclass Pull Request resolved: https://github.com/pytorch/pytorch/pull/145783 Approved by: https://github.com/eellison	2025-01-28 04:43:49 +00:00
cyy	67fcc7cf02	[3/N] Remove unnecessary once flag usage (#145672 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/145672 Approved by: https://github.com/albanD	2025-01-28 04:28:18 +00:00
Burak Turk	01a4d86b31	add pt2 callbacks for backward pass and prevent duplicate callbacks (#145732 ) Summary: This change adds callbacks for lazy backwards compilation while preventing duplicate callbacks to be fired. Differential Revision: D68577593 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145732 Approved by: https://github.com/mlazos	2025-01-28 03:50:02 +00:00
Pian Pawakapan	1a26cdd5cb	[cond] remove warning for unsupported tuple returns (#145766 ) I guess this is supported now Pull Request resolved: https://github.com/pytorch/pytorch/pull/145766 Approved by: https://github.com/ydwu4, https://github.com/zou3519	2025-01-28 03:13:36 +00:00
PyTorch MergeBot	9010649292	Revert "Add option to serialization config to reduce random reads from get_record_offset when loading with mmap=True (#143880 )" This reverts commit db3685a35cdce32622ab89f6c92e09d52210ff53. Reverted https://github.com/pytorch/pytorch/pull/143880 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but either this PR or the base PR breaks distributed tests ([comment](https://github.com/pytorch/pytorch/pull/143880#issuecomment-2617743403))	2025-01-28 03:07:17 +00:00
Chirag Pandya	78f02bf07c	[bug] handle case when remote peer closes connection (#145757 ) Summary: In the case where remote peer closes the connection, nread returns 0. In this case, we still want to free up the allocated buffer. Also, reorder the if so that the likely success cases (nread > 0) is at the top of the function with an early return. Test Plan: unit tests Differential Revision: [D68733192](https://our.internmc.facebook.com/intern/diff/D68733192) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145757 Approved by: https://github.com/XilunWu, https://github.com/fduwjj ghstack dependencies: #145756	2025-01-28 03:06:38 +00:00
Pian Pawakapan	4be831ba2d	[draft_export] fix dense-in-memory check for inferring fakes (#145653 ) Test Plan: fixes check for dense tensors with size-1 dimensions Differential Revision: D68644028 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145653 Approved by: https://github.com/zou3519	2025-01-28 02:52:14 +00:00
James Wu	7c1fc0a047	Log cache state for AOTAutograd in title of file (#145715 ) Differential Revision: [D68692755](https://our.internmc.facebook.com/intern/diff/D68692755/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145715 Approved by: https://github.com/bobrenjc93	2025-01-28 02:14:18 +00:00
Jason Ansel	78a94c9114	[inductor] Remove type ignores from scheduler.py (#145712 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145712 Approved by: https://github.com/yanboliang, https://github.com/Skylion007 ghstack dependencies: #145692	2025-01-28 01:44:32 +00:00
Jason Ansel	2df2f9d895	[inductor] Change type of get_backend_features to OrderedSet (#145692 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145692 Approved by: https://github.com/yanboliang	2025-01-28 01:44:32 +00:00
Yifu Wang	db33d23aa8	[SymmetricMemory] fix an issue where rendezvous is performed with wrong device context when torch.cuda.set_device() is not callled (#144886 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144886 Approved by: https://github.com/awgu	2025-01-28 01:43:37 +00:00
William Wen	e3d3f2b22e	[dynamo] save/restore system random state more carefully (#145750 ) Reattempt of https://github.com/pytorch/pytorch/pull/145435 since the state of the linked internal diff appears to be messed up. Note: I have verified that the previously failing internal tests now pass internally. Differential Revision: [D68723334](https://our.internmc.facebook.com/intern/diff/D68723334) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145750 Approved by: https://github.com/StrongerXi	2025-01-28 01:34:13 +00:00
Gabriel Ferns	f16ce3c7e9	Refactor fuzzer and add support for Dynamo (#145565 ) ## Summary: Dynamo now works with config fuzzer. For BE week, we also found and fixed 5 different bugs (in inductor): - https://github.com/pytorch/pytorch/pull/145426 - https://github.com/pytorch/pytorch/pull/145523 - https://github.com/pytorch/pytorch/pull/145527 - https://github.com/pytorch/pytorch/pull/145532 - https://github.com/pytorch/pytorch/pull/145538 ## Test Plan: New Dynamo Unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/145565 Approved by: https://github.com/masnesral	2025-01-28 00:44:27 +00:00
Syed Tousif Ahmed	6eb74fbec6	Updates NCCL user buffer registration test for NCCL 2.24.3 (#145285 ) NCCL 2.24.3 changed the content of the debug output for NVLS registration. We use this debug output in our test suite to check if NVLS was successfully registered or not. Hence we need to specialize for the NCCL version in the test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145285 Approved by: https://github.com/kwen2501	2025-01-28 00:24:53 +00:00
Ryan Guo	5a4d959cdb	[dynamo] Properly model torch profiler context objects (#145537 ) Prior to this patch, Dynamo conveniently modelled torch profiler context objects (e.g., `torch.profiler.profile`) as `NullContextVariable` because `torch.compile` ignore the effect of these profiler contexts. However, the semantics of these profiler contexts diverges from `contextlib.nullcontext` in the `__enter__` function, where the former returns `self` and the latter returns `None`. This causes subtle error as observed in #125021. This patch adds back a `ProfilerContextVariable`, which addresses the aforementioned semantic discrepency. Fixes #125021. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145537 Approved by: https://github.com/zou3519, https://github.com/williamwen42	2025-01-28 00:03:36 +00:00
Mikayla Gawarecki	db3685a35c	Add option to serialization config to reduce random reads from get_record_offset when loading with mmap=True (#143880 ) ## Background This PR adds `torch.utils.serialization.config.load.calculate_storage_offsets`. This option relies on the previous PR in this stack, where storage order was changed to non lexicographical. A `.format_version` entry was added to the zipfile and `calculate_storage_offsets` will only work on checkpoints with `.format_version`. When this is turned on, for `torch.load(mmap=True)`, offsets of each storage record (other than the 0th storage will be calculated instead of relying on `miniz` APIs to determine this). The existing APIs will issue multiple random reads (reading the end of central directory record, then reading the zipfile header for the record) to determine the storage offset where the record starts. This can greatly degrade `torch.load(mmap=True)` performance for non-filesystem cases. `6aaae9d78f/caffe2/serialize/inline_container.cc (L589-L605)` ## Testing strategy The agreed upon testing strategy was as follows: - Add debug code gated by an environment flag `TORCH_SERIALIZATION_DEBUG` that will run this offset calculation logic and verify it against getRecordOffset for each storage (when mmap=False) - This flag is set throughout CI, which means that every time `torch.load` is called, the offset calculation logic is implicitly being tested. Differential Revision: [D67673026](https://our.internmc.facebook.com/intern/diff/D67673026) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143880 Approved by: https://github.com/albanD ghstack dependencies: #143879	2025-01-27 23:57:30 +00:00
Mikayla Gawarecki	7db0afabaa	Remove lexicographical sorting of storage keys in torch.save (#143879 ) Currently the order lexicographical (i.e. 0, 10, 11, ...19, 2, ....) instead of 0, 1, 2, 3, 4, 5 (the order that storage metadata is actually pickled in), since PyTorch will never be used with Python < 3.7 we can be assured that the keys will be read in the order of insertion (numerically sorted) This makes it such that the order storages are written in are the same as the pickling/unpickling order so we can calculate their offsets with less random reads Differential Revision: [D67673025](https://our.internmc.facebook.com/intern/diff/D67673025) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143879 Approved by: https://github.com/albanD	2025-01-27 23:57:30 +00:00
Colin L. Rice	c1161957a4	inductor_config_logging: Don't drop keys (#144700 ) This bit me while I was trying to debug some trace issues. In general this config is already quite large when dumping, so adding more fields doesn't make it significantly worse. Also a number of the items we are type checking for (except the test configs), don't even show up. Primarily this will help us when debugging rocm, halide, and trace configs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144700 Approved by: https://github.com/ezyang	2025-01-27 23:47:25 +00:00
Jane (Yuan) Xu	7d01f6e6f2	Add ignorable commits on run_test.py to git blame ignore (#145787 ) Chanced upon it while searching through cpp_extension related code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145787 Approved by: https://github.com/malfet	2025-01-27 23:24:48 +00:00
Chirag Pandya	3ce68dc61e	[c10d] Flush file in file recorder (#145458 ) Summary: Flushing file to hopefully prevent file corruptions as reported in https://github.com/pytorch/pytorch/pull/145125 Test Plan: Couldn't get file corruption to occur in my tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145458 Approved by: https://github.com/kwen2501	2025-01-27 23:15:52 +00:00
Chirag Pandya	5534c270db	[chore] fix new linter (#145756 ) Summary: Fix new linter that's complaining when I made changes to this file: class 'LibUVStoreDaemon' defines a non-default destructor but does not define a copy constructor, a copy assignment operator, a move constructor or a move assignment operator Test Plan: make lint passes Differential Revision: [D68733191](https://our.internmc.facebook.com/intern/diff/D68733191) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145756 Approved by: https://github.com/XilunWu, https://github.com/Skylion007, https://github.com/fduwjj	2025-01-27 22:48:12 +00:00
PyTorch MergeBot	2de53b3b65	Revert "pickler for GraphModule (#141659 )" This reverts commit c6ad08357bf8e766b5220bfb5cbbfdb2a4ec0ca5. Reverted https://github.com/pytorch/pytorch/pull/141659 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, please take a look at D68694181 for more details. ([comment](https://github.com/pytorch/pytorch/pull/141659#issuecomment-2617045120))	2025-01-27 22:39:30 +00:00
Huy Do	006397fac3	Remove FBGEMM sccache hack (#145664 ) Testing https://github.com/pytorch/pytorch/actions/runs/12959358756, sccache is working correctly now Pull Request resolved: https://github.com/pytorch/pytorch/pull/145664 Approved by: https://github.com/wdvr	2025-01-27 22:00:06 +00:00
David Berard	69e82d02d3	[inductor][3/N] triton support post-#5512, tt.divisibility format (#145575 ) 1. Fix the tt.divisibility format in hints.py. Previously, it was `{((0,), (1,)): [["tt.divisibility", 16]]}`. Now it is `{(0,): [["tt.divisibility", 16]], (1,): [["tt.divisibility", 16]]}`. This was an oversight in the first PR I added. I've verified that we now get `{ tt.divisibility = 16 }` in the generated TTGIR. 2. Update the test_codegen_triton.py test to work with multiple triton versions (and test this divisibility format in the new triton version) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145575 Approved by: https://github.com/SamGinzburg	2025-01-27 21:48:58 +00:00
Animesh Jain	993b229665	[dynamo][dicts] Fix dict.__new__ bug (#145723 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145723 Approved by: https://github.com/jansel, https://github.com/StrongerXi ghstack dependencies: #145519, #145547, #145558	2025-01-27 21:42:43 +00:00
Animesh Jain	7e1c7253e9	[dynamo][builtin-skipfile-cleanup] Support tuple.__new__ (#145558 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145558 Approved by: https://github.com/jansel, https://github.com/StrongerXi ghstack dependencies: #145519, #145547	2025-01-27 21:42:43 +00:00
Joel Schlosser	1ba1b7b597	Support remaining _like factory functions for NJT (#144889 ) Fixes #144761 This PR adds NJT impls for those _like functions that were previously missing: * `full_like()` * `rand_like()` * `randint_like()` It also fixes a bug in existing *_like functions when a new device is specified. Fix is to also transfer `offsets` / `lengths` to the new device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144889 Approved by: https://github.com/soulitzer	2025-01-27 21:33:51 +00:00
Nikita Shulga	3a23d75b37	[MPS] Fix `c0:🤘:log_gamma` correctness on M4 (#145740 ) To workaround a bug where `abs` method call seems to be ignored before calling log, which could be reproduced by running the following code (submitted as FB16415011 ) ```swift import Metal func run_shader<T: BinaryFloatingPoint> (library: MTLLibrary, kernel_name: String, type: T.Type, nelem: Int = 16) { guard let mfunc = library.makeFunction(name: kernel_name) else { fatalError("Can't find function") } let device = library.device guard let queue = device.makeCommandQueue() else { fatalError("Can't make queue") } guard let cmdBuffer = queue.makeCommandBuffer() else { fatalError("Can't make command buffer") } guard let computeEncoder = cmdBuffer.makeComputeCommandEncoder() else { fatalError("Can't make compute encoder") } guard let ibuf = device.makeBuffer(length:nelem * MemoryLayout<T>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") } let ibuf_data = ibuf.contents().assumingMemoryBound(to: T.self) for i in 0..<nelem { ibuf_data[i] = T(sin(Float(2 + i))) } guard let obuf = device.makeBuffer(length:nelem * MemoryLayout<T>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") } let obuf_data = obuf.contents().assumingMemoryBound(to: T.self) computeEncoder.setComputePipelineState(try! device.makeComputePipelineState(function: mfunc)) computeEncoder.setBuffer(obuf, offset:0, index: 0) computeEncoder.setBuffer(ibuf, offset:0, index: 1) computeEncoder.dispatchThreads(MTLSizeMake(nelem, 1, 1), threadsPerThreadgroup:MTLSizeMake(nelem, 1, 1)) computeEncoder.endEncoding() cmdBuffer.commit() cmdBuffer.waitUntilCompleted() print("Results for \(String(describing: T.self)):", terminator: " ") for i in 0..<nelem { print(obuf_data[i], terminator: " ") } print() } let shader_source = """ #include <metal_stdlib> template<typename T> float foo(T x) { const auto abs_x = :🤘:abs(static_cast<float>(x)); auto rc = :🤘:log(abs_x); return rc - :🤘:log(:🤘:abs(abs_x * :🤘:sinpi(abs_x))); } kernel void half_kernel( device half* out_ptr0, constant half* in_ptr0, uint xindex [[thread_position_in_grid]] ) { auto inp = in_ptr0[xindex]; auto out = foo(inp); out_ptr0[xindex] = static_cast<half>(out); } kernel void float_kernel( device float* out_ptr0, constant float* in_ptr0, uint xindex [[thread_position_in_grid]] ) { auto inp = in_ptr0[xindex]; auto out = foo(inp); out_ptr0[xindex] = static_cast<float>(out); } """ let options = MTLCompileOptions() options.mathMode = .safe options.mathFloatingPointFunctions = .precise guard let device = MTLCopyAllDevices().first else { fatalError("Not Metal device found") } let library = try! device.makeLibrary(source:shader_source, options:options) run_shader(library:library, kernel_name:"half_kernel", type: Float16.self) run_shader(library:library, kernel_name:"float_kernel", type: Float.self) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145740 Approved by: https://github.com/dcci	2025-01-27 21:24:22 +00:00
Aaron Orenstein	60f98262f1	PEP585: .github (#145707 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145707 Approved by: https://github.com/huydhn	2025-01-27 21:21:01 +00:00
Ryan Guo	bfaf76bfc6	[dynamo] clear out traced frames at the start of `test_log_traced_frames` (#145640 ) The test was being flaky in CI, and this patch fixes it. Fixes #137461. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145640 Approved by: https://github.com/williamwen42	2025-01-27 20:49:59 +00:00
Ting Lu	93dd6bc4d8	Add CUDA 12.8 installation and manylinux-cuda12.8 (#145567 ) Breaking https://github.com/pytorch/pytorch/pull/145557 into two parts. Need to have manylinux-cuda12.8 in order to build magma. Issue: https://github.com/pytorch/pytorch/issues/145570 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145567 Approved by: https://github.com/nWEIdia, https://github.com/atalman	2025-01-27 20:49:07 +00:00
Randolf Scholz	64cd81712d	`torch.distributions`: replace `numbers.Number` with `torch.types.Number`. (#145086 ) Fixes #144788 (partial) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145086 Approved by: https://github.com/malfet	2025-01-27 20:24:55 +00:00
Huy Do	2f8ad8f4b9	Run inductor perf benchmark on ROCm (#145763 ) This requires https://github.com/pytorch/pytorch/pull/144594. The test run on PT2 dashboard is at https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2020%20Jan%202025%2019%3A46%3A14%20GMT&stopTime=Mon%2C%2027%20Jan%202025%2019%3A46%3A14%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=rocm&lBranch=144594&lCommit=9f5cb037965aa2990b2e4593610bca92526ebb3b&rBranch=144594&rCommit=9f5cb037965aa2990b2e4593610bca92526ebb3b Pull Request resolved: https://github.com/pytorch/pytorch/pull/145763 Approved by: https://github.com/jeffdaily	2025-01-27 20:19:03 +00:00
Ryan Guo	66631bc84b	[dynamo] Fix read/write conflicts in a cuda test (#145658 ) Prior to this patch, the `test_cuda_event_created_outside_of_graph` is flaky in CI, and that's because we have read and write to the same `foo` tensor buffer from 2 different streams. This patch eliminates that by adding a synchronization to wait till read finishes before starting the write. Fixes #133837, #133828. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145658 Approved by: https://github.com/yifuwang	2025-01-27 19:55:57 +00:00
PyTorch MergeBot	c986eba560	Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 )" This reverts commit abf28982a8cb43342e7669d859de9543fd804cc9. Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/ZainRizvi due to Sorry but this is failing internally. @Chillee can you please help change get remerged? See D68720562 ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2616726406))	2025-01-27 19:38:26 +00:00
leslie-fang-intel	9728e900dc	[Inductor][CPP] fix torch logit decomposition (#145576 ) Summary Fix issue https://github.com/pytorch/pytorch/issues/145379, current decomposition using `self = torch.clamp(self, lo, hi)` which gives wrong result when `lo` is larger than `hi` comparing to eager implementation: `cd68d54911/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp (L165)` Align their behavior in this PR. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_torch_logit ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145576 Approved by: https://github.com/jgong5, https://github.com/eellison	2025-01-27 19:37:51 +00:00
Edward Z. Yang	635b98fa08	Add nitpick warning that aoti_torch/c/shim.h is ABI stable (#145745 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/145745 Approved by: https://github.com/albanD	2025-01-27 19:25:37 +00:00
Yanbo Liang	bc377c503e	[Custom Ops] Fix f-strings in custom ops error message (#145673 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145673 Approved by: https://github.com/zou3519 ghstack dependencies: #145588	2025-01-27 19:22:43 +00:00
Yanbo Liang	ec91b7720f	[Custom Ops] Add a new API to allow users to register an autocast for the custom op (#145588 ) Fixes #137033 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145588 Approved by: https://github.com/zou3519	2025-01-27 19:22:43 +00:00
Simon Mahns	f951d216e0	[autocast][pytorch] Support autocast for MTIA (policy) (#145666 ) Summary: Add autocast support for MTIA (policy) Reviewed By: egienvalue Differential Revision: D68604796 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145666 Approved by: https://github.com/chaos5958	2025-01-27 18:26:04 +00:00
Sam Larsen	1835e1eb98	[BE] Remove test_ops from FIXME_inductor_dont_reset_dynamo (#145307 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145307 Approved by: https://github.com/zou3519, https://github.com/FindHao	2025-01-27 18:12:39 +00:00
Randolf Scholz	835e770bad	Use `typing.IO[bytes]` instead of `io.BytesIO` in annotations (#144994 ) Fixes #144976 Using appoach ① `IO[bytes]`, but could also try with a protocol. ## Notes: - moved `torch.serialization.FILE_LIKE` to `torch.types.FileLike` - Use `FileLike` annotation where it makes sense - made sure those functions also support `os.PathLike` - Replaced `isinstance(x, io.BytesIO)` with `isinstance(x, (io.IOBase, IO))` where appropriate. - Replaced `BinaryIO` with `IO[bytes]` (the two ABCs are almost identical, the only difference is that `BinaryIO` allows `bytearray` input to `write`, whereas `IO[bytes]` only `bytes`) - needed to make `torch.serialization._opener` generic to avoid LSP violations. - skipped `torch/onnx/verification` for now (functions use `BytesIO.getvalue` which is not part of the `IO[bytes]` ABC, but it kind of seems that this is redundant, as e.g. `onnx.load` supports `str \| PathLike[str] \| IO[bytes]` directly... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144994 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2025-01-27 18:08:07 +00:00
Eddie Yan	abf28982a8	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee	2025-01-27 18:05:23 +00:00
Nikita Shulga	30dea8429d	[MPS][BE] Use conveinence methods to set args (#145736 ) It's better to call `mtl_setArgs` rather than set arguments one by one with the risk of making a typo Also, all interactions with MTLCommandBuffer must be serialized, which is commonly done using dispatch queues Pull Request resolved: https://github.com/pytorch/pytorch/pull/145736 Approved by: https://github.com/Skylion007	2025-01-27 17:42:01 +00:00
Mikayla Gawarecki	7db20ffd68	Remove `public_allowlist` from `TestPublicBindings.test_correct_module_names` and ensure private_allowlist-ed things are actually private (#145620 ) This passes locally, also sanity checked importing these modules on [colab](https://colab.research.google.com/drive/1edynWX1mlQNZIBxtb3g81_ZeTpAqWi19?usp=sharing) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145620 Approved by: https://github.com/albanD	2025-01-27 17:30:02 +00:00
Huy Do	5d01a2874f	Increase the number of perf benchmark shards (#145534 ) Per the discussion on https://github.com/pytorch/pytorch/issues/140332#issuecomment-2610805551, this adds 2 more shards for HF, 2 more for TorchBench, and 1 more for TIMM. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145534 Approved by: https://github.com/jeanschmidt	2025-01-27 16:20:42 +00:00
Nikita Shulga	639dd54ef7	[BE] Use copy_method to import all tests (#145718 ) Less chances for typo when doing the imports Pull Request resolved: https://github.com/pytorch/pytorch/pull/145718 Approved by: https://github.com/dcci	2025-01-27 16:01:12 +00:00
leslie-fang-intel	2e80093306	setitem node shouldn't be deadcode eliminated (#145714 ) Summary Fix issue https://github.com/pytorch/pytorch/issues/145697. The `operator.setitem` has been eliminated as dead code, causing a correctness issue. Mark it as impure in this PR to avoid this side effect. TestPlan ``` python -u -m pytest -s -v test/fx/test_dce_pass.py -k test_keep_setitem ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145714 Approved by: https://github.com/ezyang	2025-01-27 15:08:21 +00:00
Stefan-Alin Pahontu	0674ab7e33	solve apl dependency issue (#145215 ) According to the [APL documentation](https://developer.arm.com/documentation/101004/2404/General-information/Arm-Performance-Libraries-example-programs), libraries ending with _mp are OpenMP multi-threaded libraries. When a project is compiled with MSVC and the -openmp flag, the vcomp library (Visual C++ implementation of OpenMP) is used for runtime calls. However, the current APL implementation uses the libomp.dll (LLVM) variant. As a result, there are unexpected behaviors at runtime. --- For Example: ```python import torch # Create a sparse tensor # Input (Sparse Tensor): # [[0, 1], # [1, 0]] indices = torch.tensor([[0, 1], [1, 0]]) values = torch.tensor([1, 1], dtype=torch.float32) size = torch.Size([2, 2]) sparse_tensor = torch.sparse_coo_tensor(indices, values, size) # Convert sparse tensor to dense tensor dense_tensor = sparse_tensor.to_dense() # Expected Output (Dense Tensor): # [[0, 1], # [1, 0]] print("\nDense Tensor:") print(dense_tensor) ``` However, it prints unexpected outputs such as: ```python # [[0, 11], # [10, 0]] ``` The issue arises because the following code does not function as expected at runtime: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/ParallelOpenMP.h#L30 ```c++ // returns 1 , however since OpenMP is enabled it should return total number of threads int64_t num_threads = omp_get_num_threads(); ``` --- In the runtime, loading multiple OpenMP libraries (in this case `libomp` and `vcomp`) is causing unexpected behaviours. So, we've changed libraries from `_mp` to non `_mp` versions and we used `vcomp` for OpenMP calls. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145215 Approved by: https://github.com/ozanMSFT, https://github.com/malfet Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>	2025-01-27 13:02:16 +00:00
PyTorch UpdateBot	7b6029dcc2	Update slow tests (#145206 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145206 Approved by: https://github.com/pytorchbot	2025-01-27 11:40:39 +00:00
H. Vetinari	e6c1e6e20e	simplify torch.utils.cpp_extension.include_paths; use it in cpp_builder (#145480 ) While working on conda-forge integration, I needed to look at the way the include paths are calculated, and noticed an avoidable duplication between `torch/utils/cpp_extension.py` and `torch/_inductor/cpp_builder.py`. The latter already imports the former anyway, so simply reuse the same function. Furthermore, remove long-obsolete include-paths. AFAICT, the `/TH` headers have not existed since pytorch 1.11. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145480 Approved by: https://github.com/ezyang	2025-01-27 07:19:42 +00:00
Jason Ansel	e90cf4abcf	[inductor] Add some typing to common.py (#145691 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145691 Approved by: https://github.com/malfet ghstack dependencies: #145690	2025-01-27 06:27:13 +00:00
Jason Ansel	ddae87f792	[inductor] Add some typing to simd.py (#145690 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145690 Approved by: https://github.com/malfet	2025-01-27 06:27:13 +00:00
Nikita Shulga	71caac2b30	[MPSInductor] Add rand support (#145705 ) Using Philox4 as PRNG Test plan (other that CI) Run ```python mport torch from torch._inductor.utils import run_and_get_code from contextlib import nullcontext def foo(x): return x * torch.randn_like(x) foo_c = torch.compile(foo) x = torch.ones(100, 100, device="mps") y = foo_c(x) print(y.mean().item(), y.std().item()) for i in range(25): print(y[i].mean(), y[i].std()) ``` And observe that printed values are close to 0 and 1 TODO: Better `randint` algorithm for large ranges Pull Request resolved: https://github.com/pytorch/pytorch/pull/145705 Approved by: https://github.com/dcci, https://github.com/jansel	2025-01-27 06:07:36 +00:00
rzou	ea141d8134	functional compiled autograd (#144707 ) This PR squashes together the following commits: https://github.com/pytorch/pytorch/pull/144115 https://github.com/pytorch/pytorch/pull/143417 https://github.com/pytorch/pytorch/pull/143405 https://github.com/pytorch/pytorch/pull/143387 https://github.com/pytorch/pytorch/pull/143304 https://github.com/pytorch/pytorch/pull/143296 This is a refactor of compiled autograd to use "functional autograd". The end goal is that it gets compiled autograd's initial capture to stop specializing on Tensor metadata, therefore allowing compiled autograd to better handle Tensor subclasses. For more information, please read the commit messages for each PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144707 Approved by: https://github.com/bdhirsh, https://github.com/xmfan, https://github.com/jansel	2025-01-27 05:20:56 +00:00
Edward Z. Yang	87fdadde1d	Remove FFT from stride incorrect ops (#145080 ) I gotta say, the FFT implementation is completely insane, there's gotta be a better way to do this than repeatedly inplace restriding the output tensor. Anyway, this is a faithful translation of both the MKL and cuFFT paths to Python. Fixes https://github.com/pytorch/pytorch/issues/135087 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/145080 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #145530	2025-01-27 04:26:04 +00:00
Isalia20	b75afa2e2e	[MPS] cholesky implementation (#145701 ) Requested in #77764 Closed #144193 due to a lot of conflicts when rebasing Pull Request resolved: https://github.com/pytorch/pytorch/pull/145701 Approved by: https://github.com/malfet	2025-01-27 01:53:03 +00:00
Aaron Orenstein	c6ad08357b	pickler for GraphModule (#141659 ) Pickling GraphModule needs some special handling for wrapping things that normally can't be pickled - but async compile needs to pass them across a wire so we need to be able to serialize it - add some helpers to enable that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141659 Approved by: https://github.com/jamesjwu	2025-01-26 19:29:13 +00:00
Arash Pakbin	f3ddc08ddc	Additional operators in operator benchmark (#145625 ) The list of added operators: add_, addcmul, arange, baddbmm…, bmm, clamp, div, div_, gelu, index_add, logical_and, mul_, sub_, topk, where This pull request is the same as a previous one: https://github.com/pytorch/pytorch/pull/145121 which inadvertently got deleted while merging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145625 Approved by: https://github.com/jeffdaily	2025-01-26 19:20:02 +00:00
PyTorch MergeBot	6a4fb4b615	Revert "Align CPU behavior with CUDA for `ConvTranspose` when `out_channels=0` (#142859 )" This reverts commit cb814c0b961369a7ab154c58856c730cafaa2307. Reverted https://github.com/pytorch/pytorch/pull/142859 on behalf of https://github.com/malfet due to It broke ROCM tests again, see `5cd2b34e82/1` ([comment](https://github.com/pytorch/pytorch/pull/142859#issuecomment-2614523822))	2025-01-26 17:49:05 +00:00
Davide Italiano	5cd2b34e82	[inductor] Adjust test_log_fp64 to only run when float64 is supported. (#145686 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145686 Approved by: https://github.com/malfet, https://github.com/jansel Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-26 15:58:19 +00:00
Yichen Yan	ed015143ef	Set RUNPATH on CUDA and XPU tests (#144305 ) #136627 has almost fixed the issue that test binaries' runpath has not been set correctly, with few cases left. This PR fixes the rest. The binaries are found by `auditwheel repair` a wheel built with `BUILD_TEST=1`. @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/144305 Approved by: https://github.com/malfet	2025-01-26 08:40:22 +00:00
Aaron Orenstein	c4523999a1	Fix incorrect type comparison (#145449 ) Summary: This change was incorrectly made as part of #145166 Differential Revision: D68536221 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145449 Approved by: https://github.com/bobrenjc93	2025-01-26 04:40:26 +00:00
PyTorch MergeBot	09ae69a364	Revert "Fix type annotation of `Linear.bias` (#142326 )" This reverts commit 81e370fc6b90f9cb98c88f3173e738aba0dc650a. Reverted https://github.com/pytorch/pytorch/pull/142326 on behalf of https://github.com/malfet due to This introduced a graph break and regressed inductor tests, see `73622fc5fa/1` ([comment](https://github.com/pytorch/pytorch/pull/142326#issuecomment-2614196349))	2025-01-26 03:41:00 +00:00
wengshiy	73622fc5fa	Fix Throughputbenchmark issue (#144669 ) Fixes [144461](https://github.com/pytorch/pytorch/issues/144461) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144669 Approved by: https://github.com/leslie-fang-intel, https://github.com/williamwen42, https://github.com/jansel	2025-01-26 03:37:20 +00:00
Wu, Chunyuan	cb814c0b96	Align CPU behavior with CUDA for `ConvTranspose` when `out_channels=0` (#142859 ) Fixes https://github.com/pytorch/pytorch/issues/142466. Remove the `weight.numel() != 0` check to align the behavior with CUDA for `ConvTranspose` when `out_channels=0`. After removing this check, the existing code is already able to give an empty output in such case. Test plan: ``` python -u test/nn/test_convolution.py -k test_ConvTranspose_output_channels_0_cpu_float32 python -u test/nn/test_convolution.py -k test_ConvTranspose_output_channels_0_cuda_float32 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142859 Approved by: https://github.com/mingfeima, https://github.com/malfet	2025-01-26 01:56:40 +00:00
Edward Z. Yang	90448f0128	Output of nonzero is transposed, fix fake tensor (#144695 ) Needs this companion executorch PR: https://github.com/pytorch/executorch/pull/7657 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/144695 Approved by: https://github.com/bobrenjc93, https://github.com/albanD	2025-01-26 01:07:22 +00:00
Miroslaw Oksiucik	76bec878da	Remove unnecessary HPUHooksInterface method (#145272 ) getDefaultHPUGenerator is no longer necessary Pull Request resolved: https://github.com/pytorch/pytorch/pull/145272 Approved by: https://github.com/ezyang	2025-01-26 01:06:34 +00:00
Nikita Shulga	3cf7874ebe	[MPS][BE] Implement bilineard2d as shader (#145581 ) That significantly improves performance and addresses correctness problem(to an extend permitted by reducing precision of scale factor computation to float32). uint8 scaling algorithm mimics CPU/Pillow implementation `569b785371/src/libImaging/Resample.c (L306-L309)` I.e. using fixed precision integral arithmetic and rounding results of horizontal interpolation back to integers before performing vertical one, which results in technically less accurate results. But even with those changes, `atol`, `rtol` must be tweaked to `1, 0` when scale factor is `1/3` or `2/3` because of the difference of representation of those values as floats and doubles. Changes in the performance could be measured using the following script ```python import torch import time import subprocess def benchmark(device, dtype): # Create example inputs x = torch.testing.make_tensor(1, 1, 2048, 2048, device=device, dtype=dtype) sf = .5 # Check output y = torch.nn.functional.interpolate(x, scale_factor=sf, mode="bilinear") z = torch.nn.functional.interpolate(x.cpu(), scale_factor=sf, mode="bilinear") outputs_match = torch.allclose(y.cpu(), z) if not outputs_match: atol = (y.cpu() - z).abs().max() rtol = ((y.cpu() - z)[z!=0]/z[z!=0]).abs().max() print(f"atol={atol} rtol={rtol}") # Measure time manually start_time = time.time() * 1000 for _ in range(1000): y = torch.nn.functional.interpolate(x, scale_factor=sf, mode="bilinear") torch.mps.synchronize end_time = time.time() * 1000 manual_delta = (end_time - start_time) average_time = f"{manual_delta:6.1f}" return "True " if outputs_match else "False", average_time outputs_match_list = [] average_time_list = [] for device in ["mps", "cpu"]: for dtype in [torch.float32, torch.float16, torch.bfloat16, torch.uint8]: outputs_match, average_time = benchmark(device, dtype) outputs_match_list.append(str(outputs_match)) average_time_list.append(average_time) brand_string = subprocess.check_output(['sysctl', '-n', 'machdep.cpu.brand_string']).decode("utf-8").strip() print(f"\nBenchmarking Results (collected on {brand_string}):") print("-"40) print("Device : MPS \| CPU") print("Dtype : FP32 \| FP16 \| BF16 \| U8 \| FP32 \| FP16 \| BF16 \| U8") print(f"Outputs Match : ", " \| ".join(outputs_match_list)) print(f"Average Time (us) :", " \|".join(average_time_list)) ``` Benchmark results before ``` Benchmarking Results (collected on Apple M4 Pro): ---------------------------------------- Device : MPS \| CPU Dtype : FP32 \| FP16 \| BF16 \| U8 \| FP32 \| FP16 \| BF16 \| U8 Outputs Match : True \| True \| True \| False \| True \| True \| True \| True Average Time (us) : 277.3 \| 197.2 \| 188.0 \| 163.5 \| 302.8 \| 248.1 \| 308.7 \| 650.9 ``` After(almost 100x* perf gain): ``` Benchmarking Results (collected on Apple M4 Pro): ---------------------------------------- Device : MPS \| CPU Dtype : FP32 \| FP16 \| BF16 \| U8 \| FP32 \| FP16 \| BF16 \| U8 Outputs Match : True \| True \| True \| True \| True \| True \| True \| True Average Time (us) : 1.7 \| 1.5 \| 1.7 \| 1.5 \| 296.5 \| 236.0 \| 310.8 \| 642.6 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145581 Approved by: https://github.com/Skylion007 ghstack dependencies: #145578	2025-01-25 21:09:46 +00:00
Xuehai Pan	0afdee4c39	[dynamo] raise IndexError when inserting into a full `deque` (#139379 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139379 Approved by: https://github.com/jansel	2025-01-25 18:04:49 +00:00
Max Podkorytov	513f889a36	[Rocm][Inductor][CK] silence ck package not installed warning when CK backend is not used to autotune bmm (#145626 ) As titled Pull Request resolved: https://github.com/pytorch/pytorch/pull/145626 Approved by: https://github.com/coconutruben	2025-01-25 08:44:35 +00:00
Simon Fan	c5216d2b6c	[ca] add test_reset for 2.6 release validation (#145549 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145549 Approved by: https://github.com/atalman	2025-01-25 06:28:58 +00:00
Sheng Fu	bbe7f53218	Save integral tensor data for ET (#144508 ) Summary: et_replay uses random data to run operators, however, the operators using index tensor to access memory won't work with random data. It usually ran into two exceptions: 1. illegal memory access since index is out of range, it has been fixed with the environment variable ENABLE_PYTORCH_EXECUTION_TRACE_SAVE_INTEGRAL_TENSOR_RANGE to record the min/max value of index tensors. 2. unaligned memory access, FBGEMM ops have speical requirements for the memory layout. To fix the second execption, ENABLE_PYTORCH_EXECUTION_TRACE_SAVE_INTEGRAL_TENSOR is added to allow user to specify the node names, separated by comma, so ET will save the integral tensor data for these nodes. The saved data will be used in et_replay. Be careful to turn on this option since it will use more space to save the extra data. Test Plan: buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA.test_execution_trace_record_integral_tensor_data_cuda Differential Revision: D67989856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144508 Approved by: https://github.com/briancoutinho	2025-01-25 05:38:10 +00:00
Jason Ansel	3d506491b9	[inductor] Fix duplicate detection in _dynamic_scale_rblock (#145577 ) Before this the code was doing nothing because Config doesn't define `__hash__` or `__eq__` (so it was based on object id). Pull Request resolved: https://github.com/pytorch/pytorch/pull/145577 Approved by: https://github.com/shunting314 ghstack dependencies: #142026	2025-01-25 04:58:54 +00:00
Jason Ansel	9007eb5f8e	[inductor] Kernel memory analysis for use in heuristics (#142026 ) This computes statistics about each kernel's memory usage that should allow us to write more precise heuristics. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142026 Approved by: https://github.com/eellison	2025-01-25 04:58:54 +00:00
Yuanhao Ji	cc1ecead07	[Dynamo] Allow `format()` to handle int (#144956 ) Fixes #144830 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144956 Approved by: https://github.com/jansel	2025-01-25 04:12:45 +00:00
Joel Schlosser	b2a0feac85	Update OSS nested tensor docs to focus on NJT (#145402 ) Updated nested tensor docs to be NJT-centric (instead of NST-centric). They now include: * High-level description of NST vs. NJT + a recommendation to use NJT * General NJT construction / usage * torch.compile() integration w/ dynamic shapes * Common errors and how to fix them * Contribution guide * Data layout / shape information (with diagram) * Links to more extensive tutorials involving Transformers / SDPA / FlexAttention Pull Request resolved: https://github.com/pytorch/pytorch/pull/145402 Approved by: https://github.com/soulitzer	2025-01-25 04:08:19 +00:00
Zhenbin Lin	392dc177a9	OpenReg: Refactor impl_registry (#145465 ) Refactor impl_registry to use `driver.exec` as fallback. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145465 Approved by: https://github.com/albanD	2025-01-25 03:31:49 +00:00
Simon Mahns	6939a56e13	[autocast][pytorch] Support autocast for MTIA (#145627 ) Summary: Add autocast support to MTIA Reviewed By: egienvalue Differential Revision: D68572548 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145627 Approved by: https://github.com/egienvalue	2025-01-25 03:24:59 +00:00
Animesh Jain	ef60de07a0	[dynamo] Log guard latency (#145132 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145132 Approved by: https://github.com/ezyang ghstack dependencies: #145509	2025-01-25 03:01:18 +00:00
Avik Chaudhuri	42b8e233d9	serde unbacked bindings (#144894 ) Adds unbacked bindings during deserialization. These are carried by a node's metadata, and map pending fresh unbacked symbols to paths to such symbols inside the corresponding example value carried by the node's metadata. Since it is awkward to serialize paths, we only serialize the names of these symbols and reconstruct the paths on deserialization, using a shape env util. We also need to bump counters for unbacked symbols here, because the shape env util we use to create these symbols (when deserializing example values) don't do so, and not doing so makes later passes (like `run_decompositions`) crash because new unbacked symbols don't get new names. This is enough for non-strict. For strict, the unbacked bindings and example values in node metadata can get out of sync, because of running AOTAutograd as an additional step after Dynamo. So we have to sync those back. Differential Revision: [D68232274](https://our.internmc.facebook.com/intern/diff/D68232274/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144894 Approved by: https://github.com/pianpwk	2025-01-25 02:34:27 +00:00
soulitzer	5725462cd8	Update NJT linear_backward to return non-aliased tensor bias grad (#145399 ) Fixes https://github.com/pytorch/pytorch/issues/141292 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145399 Approved by: https://github.com/jbschlosser ghstack dependencies: #145520, #145531, #145533	2025-01-25 00:58:04 +00:00
soulitzer	3a3e2cf90a	Remove det_singular OpInfo (#145533 ) Fixes https://github.com/pytorch/pytorch/issues/93045 https://github.com/pytorch/pytorch/issues/93044 From previous discussion https://github.com/pytorch/pytorch/issues/93045#issuecomment-1477674083 the resolution is that we're okay with removing this. Some older attempts: - https://github.com/pytorch/pytorch/pull/102581 - https://github.com/pytorch/pytorch/pull/109249 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145533 Approved by: https://github.com/lezcano, https://github.com/malfet ghstack dependencies: #145520, #145531	2025-01-25 00:58:03 +00:00
soulitzer	c7ca1df37e	Disable slow gradcheck for nn.Transformer ModuleInfo (#145531 ) Fixes https://github.com/pytorch/pytorch/issues/117140 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145531 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #145520	2025-01-25 00:58:03 +00:00
soulitzer	9e0ee152e5	Fix allow_mutation_on_saved_tensors for inplace foreach (#145520 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145520 Approved by: https://github.com/albanD	2025-01-25 00:58:03 +00:00
clr	b4fe3c159d	inductor: Explicitly test that torch.compile(option=...) does something (#145321 ) This would have prevented https://github.com/pytorch/pytorch/pull/139833 from dropping the triggers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145321 Approved by: https://github.com/jansel	2025-01-25 00:48:26 +00:00
Marc Horowitz	efebec5ef5	[dcp] Add ZStandard transformer (#143360 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143360 Approved by: https://github.com/saumishr, https://github.com/albanD ghstack dependencies: #145528	2025-01-25 00:14:07 +00:00
Marc Horowitz	f2ad2cdf1c	[utils] add try_import method for importing optional modules (#145528 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145528 Approved by: https://github.com/albanD	2025-01-25 00:14:07 +00:00
Aaron Gokaslan	f3304571fc	[BE][Ez]: FURB148 - remove useless enumerate calls (#145619 ) Remove useless enumerate calls Pull Request resolved: https://github.com/pytorch/pytorch/pull/145619 Approved by: https://github.com/drisspg	2025-01-24 23:37:15 +00:00
Wei Wang	0741963e01	[CI][CUDA][Blackwell] sm_\d\d no longer matches sm_100. (#145641 ) Therefore making it sm_\d+ Fixes this unit test failure: python test/test_cpp_extensions_jit.py -k TestCppExtensionJIT.test_jit_cuda_archflags Pull Request resolved: https://github.com/pytorch/pytorch/pull/145641 Approved by: https://github.com/eqy, https://github.com/malfet	2025-01-24 23:20:22 +00:00
Shangdi Yu	4cc5e880f9	Add accuracy issue support in AOTI Minifier (#145539 ) Summary: Add three more repro levels for AOTI minifier (level 2 already exists). They are the same as the existing dynamo minifier repro levels. Now AOTI minifier can minify and repro programs that have numerical accuracy issues as well. 1: Dumps the original graph out to repro.py if compilation fails 2: Dumps a minifier_launcher.py if aoti fails. 3: Always dumps a minifier_launcher.py. Good for segfaults. 4: Dumps a minifier_launcher.py if the accuracy fails. Refactor AOTI minifier unit tests to be cleaner and better re-use the existing minifier testing code. We do not need to manually patch {"aot_inductor.dump_aoti_minifier": True} to each test now, this config is generated in the test code. Differential Revision: D68294638 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145539 Approved by: https://github.com/desertfire	2025-01-24 23:07:19 +00:00
zeshengzong	5b988ac4fa	[Easy] Replace paper description with link to make a concise description. (#145031 ) Description in [Transformer,](https://pytorch.org/docs/main/generated/torch.nn.Transformer.html), [TransformerEncoderLayer](https://pytorch.org/docs/main/generated/torch.nn.TransformerEncoderLayer.html), [TransformerDecoderLayer](https://pytorch.org/docs/main/generated/torch.nn.TransformerDecoderLayer.html) pages contain authors and paper details seems redundant for users who want to know how to use it, replace with a link to paper content, users can go to the paper detail if they want to learn more. Test Result Before ![image](https://github.com/user-attachments/assets/678402b1-e759-402c-b56b-e24f63dc8490) ![image](https://github.com/user-attachments/assets/ca191734-f2ce-493f-bf34-2d7046a9868f) ![image](https://github.com/user-attachments/assets/10f55083-6eb6-4b1c-9a77-579f0c4c56ed) After ![image](https://github.com/user-attachments/assets/020f81ca-d89b-47d1-a7a9-cae1893df968) ![image](https://github.com/user-attachments/assets/5b9b34df-b892-4d71-8cdb-df18380b2744) ![image](https://github.com/user-attachments/assets/b3348da2-842a-4037-bad3-f23687503cf8) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145031 Approved by: https://github.com/mikaylagawarecki	2025-01-24 23:01:02 +00:00
Davide Italiano	57591edca1	[mps/inductor] Add support for `erfinv`. (#145643 ) After several rounds of refactoring, this seems to be done now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145643 Approved by: https://github.com/malfet, https://github.com/jansel	2025-01-24 22:55:44 +00:00
Joel Schlosser	46e06e1d09	Avoid data-dependent errors in NJT tests via capture_scalar_outputs=True (#144588 ) Part of my BE project addressing NJT bugs surfaced via OpInfo tests. There are several xfails related to data-dependent errors in torch.compile. This PR sets `torch._dynamo.config.capture_scalar_outputs=True` to avoid these, which tends to exercise unbacked SymInt logic and will require `torch._check()`-related fixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144588 Approved by: https://github.com/soulitzer ghstack dependencies: #144586, #144587	2025-01-24 22:45:01 +00:00
Fabian Keller	81e370fc6b	Fix type annotation of `Linear.bias` (#142326 ) Currently the `bias` attribute of `torch.nn.Linear` (and `Bilinear`) is typed incorrectly, because it relies on the implicit `Module.__getattr__` which types it as `Tensor \| Module`. This has two issues: - It hides the fact that `bias` is optional, and can be `None`, which in turn can hide actual bugs on user side. - It blurs the type due to having `Module` in the union, which can require unnecessary `isistance(linear.bias, Tensor)` on user side. This PR types the `bias` attribute explicitly to fix these issues. CC @ezyang @Skylion007 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142326 Approved by: https://github.com/ezyang	2025-01-24 22:43:52 +00:00
Aidyn-A	70577d335e	[ATen][CUDA][Transformers] Add Blackwell support to SDPA (#145602 ) This PR adds sm_100 and sm_120 archs to support SDPA (Flash Attention and Memory Efficient Attention) on Blackwell machines. Special thanks to @Fuzzkatt for co-authoring these changes! Pull Request resolved: https://github.com/pytorch/pytorch/pull/145602 Approved by: https://github.com/drisspg, https://github.com/Skylion007, https://github.com/eqy, https://github.com/malfet Co-authored-by: Patrick Wang <22803332+Fuzzkatt@users.noreply.github.com>	2025-01-24 22:27:39 +00:00
Tan Hoang	5bf5ce0e15	Modify enable logic of COLLECTIVE_COMM profiler activity type (#145478 ) Summary: Since `KINETO_NCCL_PROFILER` flag is not used anymore (we are moving from linking the profiler during compile time to loading it dynamically), we change the logic for enabling the profiler to use `TORCH_PROFILER_ENABLE_COLLECTIVE_PROFILING` environment variable for NCCL Collective Communication Profiler. For HCCL, we still keep the same logic Test Plan: See https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/gpu_traces/tree/traces/clientAPI/0/1737579474/devvm29927.cln0/nccl_activities_2387985.json.gz for sample trace on nccl-profiler Differential Revision: D68515945 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145478 Approved by: https://github.com/sraikund16	2025-01-24 22:21:09 +00:00
Yichen Yan	d4171b724e	Let `tensor_a.new_tensor()` be on `tensor_a.device` by default (#144958 ) Fixes #144957 Closes #73838 cc @albanD @ezyang Currently, `tensor_a.new_tensor()` will return a on-cpu tensor no matter where is `tensor_a`. This differs from the document and is a side-effect of https://github.com/pytorch/pytorch/pull/41984. See #144957 how current logic breaks dynamo. This PR restore the documented behavior and add tests for `new_tensor`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144958 Approved by: https://github.com/ezyang	2025-01-24 22:12:31 +00:00
Wei Wang	2a70de7e92	[CUDA] Change slim-wheel libraries load order (#145638 ) There is no libnvjitlink in CUDA-11.x , so attempts to load it first will abort the execution and prevent the script from preloading nvrtc Fixes issues reported in https://github.com/pytorch/pytorch/pull/145614#issuecomment-2613107072 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145638 Approved by: https://github.com/atalman, https://github.com/kit1980, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-24 22:00:56 +00:00
FEI	615bdd9c81	Improve the caching allocator test for raw alloc (#145269 ) 1 Prevent block allocated by torch._C._cuda_cudaCachingAllocator_raw_alloc from affecting torch.cuda.empty_cache() in other unit tests 2 Additionally, tested the changes to raw_delete in https://github.com/pytorch/pytorch/pull/131114 @jeffdaily @albanD @houseroad @eqy @aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/145269 Approved by: https://github.com/albanD, https://github.com/eqy, https://github.com/jeffdaily	2025-01-24 21:07:17 +00:00
Fernando Pérez-García	d79c6f4946	Improve torchrun documentation (#144354 ) Fixes #142042: - #142042 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144354 Approved by: https://github.com/c-p-i-o, https://github.com/H-Huang	2025-01-24 20:40:05 +00:00
IvanKobzarev	caf60395f4	[torchbench] Increase tolerance for amp only poolformer_m36 (#145375 ) https://github.com/pytorch/pytorch/issues/144893 ``` python benchmarks/dynamo/timm_models.py --only poolformer_m36 --accuracy --no-translation-validatio --training --amp --device cuda --backend inductor ``` `--float32`, `--bfloat16` - passes the accuracy `--disable-cudagraph` does not change the result accuracy_fail only for `--amp` and gives `0.048` res_error, on 1-element result Tensor. This fails with `0.01` tolerance. If to increase tolerance to 0.04 it passes. I have not reproduced "eager_two_runs_differ" on H100. I think this is a true distribution of results with `--amp`, so increasing tolerance to 0.04 for ano case only makes it passing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145375 Approved by: https://github.com/desertfire	2025-01-24 19:56:21 +00:00
Aishwarya Sivaraman	457facf7e2	[caffe2] Use the manifold cache backend as the default (#144773 ) Test Plan: CI D68155591 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144773 Approved by: https://github.com/izaitsevfb	2025-01-24 19:48:34 +00:00
Sam Larsen	c16866a582	[BE] mv test/inductor_skips/* to test/inductor_expected_failures/ (#145572 ) Summary: I think skipping these tests is suboptimal. If we categorize as expected failures, then we'll see test failures when they start passing, which means they're more likely to be removed. As a skip, they quietly continue to skip. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145572 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2025-01-24 19:41:38 +00:00
Edward Z. Yang	cf063d41f8	Spruce up docs for emulate_precision_casts (#145579 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/145579 Approved by: https://github.com/gchanan	2025-01-24 19:28:37 +00:00
Shunting Zhang	96149a201a	[Inductor] be able to disable cache for test (#141195 ) Let TORCHINDUCTOR_FX_GRAPH_CACHE=0 being respected in unit test. This is helpful if I want the compilation to happen for testing. Setting INDUCTOR_TEST_DISABLE_FRESH_CACHE to 1 is not the same, since that will cause the generated wrapper file being deleted. But we may want to check those files after running a test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141195 Approved by: https://github.com/masnesral, https://github.com/desertfire	2025-01-24 19:15:55 +00:00
IvanKobzarev	2fd2a950e6	[torchbench] Add meta function for _cudnn_rnn_flatten_weight (#145488 ) https://github.com/pytorch/pytorch/issues/144989 This fixes tts_angular model on torchbench for `--export-aot-inductor` I put meta function in cpp, as shape calculation requires cudnn API calls. I've extracted shape calculation to be used in implementation as this logic has some non-trivial actions and comments. ``` └─ $ python benchmarks/dynamo/torchbench.py --only tts_angular --accuracy --no-translation-validation --inference --bfloat16 --export-aot-inductor --disable-cudagraphs --device cuda loading model: 0it [00:00, ?it/s]WARNING:common:Model tts_angular does not support bfloat16, running with amp instead loading model: 0it [00:01, ?it/s] WARNING:common:Model tts_angular does not support bfloat16, running with amp instead cuda eval tts_angular WARNING:common:Model tts_angular does not support bfloat16, running with amp instead pass ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145488 Approved by: https://github.com/eqy, https://github.com/zou3519	2025-01-24 19:08:14 +00:00
PyTorch MergeBot	ad36f4f42c	Revert "Add generator parameter to rand*_like functions (#136780 )" This reverts commit c7b2f7dd142fc97c8ce4ad7ad591687cf295fcda. Reverted https://github.com/pytorch/pytorch/pull/136780 on behalf of https://github.com/izaitsevfb due to internal regression ([comment](https://github.com/pytorch/pytorch/pull/136780#issuecomment-2613191933))	2025-01-24 19:00:21 +00:00
c8ef	a989a0b13a	[NFC] Fix some minor typos. (#145599 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145599 Approved by: https://github.com/Skylion007	2025-01-24 18:58:59 +00:00
Davide Italiano	6cda572c98	[mps] Hoist erfinv logic out of the kernel in preparation for moving. (#145568 ) Will be used in inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145568 Approved by: https://github.com/malfet, https://github.com/Skylion007	2025-01-24 18:51:09 +00:00
Michael Lazos	8eea554332	[Dynamo] Fix names collisions with foreach decomps (#145479 ) Fixes https://github.com/pytorch/pytorch/issues/138698 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145479 Approved by: https://github.com/yanboliang	2025-01-24 18:46:58 +00:00
amdfaa	e57cdb8402	[ROCm] trunk.yml only runs pre-merge via ciflow/trunk label (#145629 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145629 Approved by: https://github.com/jeffdaily	2025-01-24 18:31:33 +00:00
Bin Bao	b8087747f5	[inductor][BE] Enable test_cpu_cpp_wrapper in fbcode (#145373 ) Differential Revision: D68278174 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145373 Approved by: https://github.com/Skylion007	2025-01-24 17:59:13 +00:00
Animesh Jain	74cfb4f364	[dynamo][refactor] Move collections.namedtuple out of SkipFunctionVariable (#145547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145547 Approved by: https://github.com/zou3519 ghstack dependencies: #145519	2025-01-24 17:39:33 +00:00
David Peixotto	97c0b7cb0a	Add unique identifer to bmm thread_mm functions (#145303 ) Summary: The bmm template generates code like this ``` template<bool accum> void cpp_fused_bmm_66_micro_gemm(...) { ... } void single_thread_mm() { ... cpp_fused_bmm_66_micro_gemm(...) ... } void threaded_mm() { ... cpp_fused_bmm_66_micro_gemm(...) ... } void cpp_fused_bmm_66(...) { ... single_thread_mm(...); ... threaded_mm(...); ... } ``` The generated `fused_bmm` and `fused_bmm_microgemm` functions both have unique identifiers added to their names, but the `single_threaded_mm` and `threaded_mm` do not. This diff adds unique identifies to those generated functions as well. The identifier is based on the kernel name. So for the example above we would generate a bmm template name like `cpp_fused_bmm_66_single_thread_mm()`. Differential Revision: D68364772 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145303 Approved by: https://github.com/leslie-fang-intel, https://github.com/frost-intel, https://github.com/hl475	2025-01-24 17:35:50 +00:00
jainapurva	547c18ee9f	Add Torchao docs link to Pytorch libraries (#145412 ) Add Torchao docs link to the libraries section in torch docs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145412 Approved by: https://github.com/svekars	2025-01-24 17:11:20 +00:00
amdfaa	ce371ab4c6	[ROCm] Create inductor-rocm-mi300 (#145621 ) - Adds an mi300 inductor workflow to main. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145621 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-01-24 17:04:17 +00:00
Shuqiang Zhang	c0861d092c	[PGNCCL] Add an API to get the status/error code at the PG level (#144498 ) Summary: This PR is basically a replacement of https://github.com/pytorch/pytorch/pull/140087, which caused some perf drop due to frequent TCPStore check in watchdog thread. The fix is to move the tcpstore check in monitoring thread If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/144498 Approved by: https://github.com/kwen2501	2025-01-24 16:47:32 +00:00
Animesh Jain	9132f4b7ce	[dynamo][guards] Log guard latency to tlparse (#145509 ) Example ![image](https://github.com/user-attachments/assets/1503ee59-ff35-46d9-9b61-16352a4a30e2) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145509 Approved by: https://github.com/ezyang	2025-01-24 16:33:29 +00:00
Aaron Orenstein	1335882b2a	If mypy fails it should report the error back to lintrunner (#145550 ) This happened to me because I had a bad LD_LIBRARY_PATH and mypy was failing to run (.so load error) - but lintrunner was silent about the underlying problem. Differential Revision: [D68593081](https://our.internmc.facebook.com/intern/diff/D68593081) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145550 Approved by: https://github.com/bobrenjc93, https://github.com/Skylion007	2025-01-24 15:40:30 +00:00
min-jean-cho	7c314bfed4	[Intel GPU] Add TORCH_API macro to export symbol NestedTensor_to_mask for libtorch_xpu (#145467 ) Part of https://github.com/intel/torch-xpu-ops/issues/1141. The `TORCH_API` macro is added to export the symbol `NestedTensor_to_mask`, which is needed by libtroch_xpu for `NestedTensor_softmax_dropout_xpu`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145467 Approved by: https://github.com/guangyey, https://github.com/ezyang	2025-01-24 15:38:46 +00:00
atalman	5d24a9a274	Advance docker release latest verison to cuda 12.4 (#145566 ) Fixed latest tag in ghcr.io to be cuda 12.4 docker image. Todo, Need to add it to : https://github.com/pytorch/builder/blob/main/CUDA_UPGRADE_GUIDE.MD Will need to check if we can automate this by introducing cuda_stable variable or something like this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145566 Approved by: https://github.com/nWEIdia, https://github.com/kit1980, https://github.com/malfet	2025-01-24 15:27:25 +00:00
Hongtao Yu	5c64aaea40	[triton] Update triton pin to include warp specialization support (#145120 ) The warp specialization work has been landed to the triton rc/3.2.x branch as `b2684bf3b0` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145120 Approved by: https://github.com/bertmaher	2025-01-24 14:45:12 +00:00
Edward Z. Yang	bc62930765	Work around buggy use_const_ref_for_mutable_tensors (#145530 ) See https://github.com/pytorch/pytorch/issues/145522 for context This doesn't fix the problem with use_const_ref_for_mutable_tensors and the boxed wrapper, instead it just gets all of our out kernels off of this flag so that the mutable matching pattern works correctly. I also add a check in torchgen to prevent people from making this mistake in the future. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/145530 Approved by: https://github.com/albanD, https://github.com/bdhirsh	2025-01-24 14:38:49 +00:00
PyTorch MergeBot	9d6927715f	Revert "Fix triton masked loading for non-block tl.loads (#144782 )" This reverts commit 31c2f36989e35ccf023a8e35c4bc21aca077d344. Reverted https://github.com/pytorch/pytorch/pull/144782 on behalf of https://github.com/ezyang due to This regresses compile time for one of our internal models by 20%, internal xref https://fb.workplace.com/groups/1075192433118967/posts/1591490218155850 ([comment](https://github.com/pytorch/pytorch/pull/144782#issuecomment-2612660287))	2025-01-24 14:28:48 +00:00
cyy	6a35d9aaa4	Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143806 Approved by: https://github.com/kwen2501	2025-01-24 12:22:13 +00:00
Digant Desai	f08b9bc7e4	[WIP] Move XNNPACKQuantizer from PyTorch to ExecuTorch (#144940 ) Summary: This replicates XNNPACKQuantizer from PyTorch to ExecuTorch. Rationale: Main motivation is to avoid pytorch pin update in OSS after updating XNNPACKQuantizer, which can be rather frequent. Other impact and considerations: PT2e flow (which lives in PyTorch) relies havily on XNNPACKQuantizer for a "example" implementation for quantizer and more importantly tests. Fow now, we will keep the torch.ao.quantization.xnnpack_quantizer as is but mark is as not BC, and deprecated to discourace future new dependencies on it. Other OSS repository using XNNPACKQuantizer from PyTorch now have to take an additional dependency on ExecuTorch. Differential Revision: D68191752 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144940 Approved by: https://github.com/jerryzh168, https://github.com/mcr229	2025-01-24 10:06:07 +00:00
Oguz Ulgen	d3989ca636	Add multi env variable support to configs (#145288 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145288 Approved by: https://github.com/c00w	2025-01-24 10:04:24 +00:00
Yiming Zhou	10bdd0a1cc	[BE][export] Fix hop tests with flaky memory leak (#145391 ) Summary: As title. Added `torch._dynamo.reset()` for each test This should fix several flaky tests in `test_hop.py` such as https://github.com/pytorch/pytorch/issues/139073 Test Plan: ``` PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 python test/export/test_hop.py TestHOPCUDA.test_serialize_export_scan_simple_cuda_float32 ``` Differential Revision: D68506280 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145391 Approved by: https://github.com/ydwu4	2025-01-24 09:53:21 +00:00
drisspg	72da0a8a42	[Submodule] Add flash as third-party submodule [Prep for later PRs] (#145502 ) # Context Prototyped here: https://github.com/pytorch/pytorch/pull/144120, we are going to make flash-attention a 3rd party submodule. We will then use the c++ sources and include into our build of libtorch.so This requires various changes to work including external and internal changes. Since these require internal changes we need to co-dev and in the co-dev environment I haven't found a way to sync submodule changes + internal only changes. This is unused for now Pull Request resolved: https://github.com/pytorch/pytorch/pull/145502 Approved by: https://github.com/Skylion007	2025-01-24 09:21:41 +00:00
Wei Wang	d62e900d8c	[CI][CUDA][MultiGPU][Regression] Skip a failure due to https://github.com/pytorch/pytorch/issues/139520 (#145318 ) Related: https://github.com/pytorch/pytorch/issues/139520 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145318 Approved by: https://github.com/eqy	2025-01-24 06:58:05 +00:00
Wei Wang	0e98b26b28	[CI][CUDA][Dynamic Shape] xfail: DynamicShapesCodegenGPUTests.test_linspace4_dynamic_shapes_cuda (#145204 ) python test/inductor/test_torchinductor_codegen_dynamic_shapes.py DynamicShapesCodegenGPUTests.test_linspace4_dynamic_shapes_cuda failed to generate triton kernels, causing assert failures on 2x H100 systems (and 2x Grace H100 systems). Failures like below: Finline_call [] stats [('calls_captured', 1), ('unique_graphs', 1)] inductor [('fxgraph_cache_miss', 1)] aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] FAIL: test_linspace4_dynamic_shapes_cuda (__main__.DynamicShapesCodegenGPUTests.test_linspace4_dynamic_shapes_cuda) [61/1892]---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/local/lib/python3.12/dist-packages/torch/testing/_internal/common_utils.py", line 3114, in wrapper method(args, kwargs) File "/opt/pytorch/pytorch/test/inductor/test_torchinductor.py", line 12212, in new_test return value(self) ^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/testing.py", line 420, in _fn return fn(args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/opt/pytorch/pytorch/test/inductor/test_torchinductor.py", line 2603, in test_linspace4 self.common(fn, (torch.Tensor([]),)) File "/opt/pytorch/pytorch/test/inductor/test_torchinductor_codegen_dynamic_shapes.py", line 424, in common return check_codegen( ^^^^^^^^^^^^^^ File "/opt/pytorch/pytorch/test/inductor/test_torchinductor_codegen_dynamic_shapes.py", line 82, in check_codegen self.assertTrue("def triton" in code, f"Failed to find triton kernel\n{code}") AssertionError: False is not true : Failed to find triton kernel # AOT ID: ['0_inference'] [42/1892]from ctypes import c_void_p, c_long, c_int import torch import math import random import os import tempfile from math import inf, nan from torch._inductor.hooks import run_intermediate_hooks from torch._inductor.utils import maybe_profile from torch._inductor.codegen.memory_planning import _align as align from torch import device, empty_strided from torch._inductor.async_compile import AsyncCompile from torch._inductor.select_algorithm import extern_kernels from torch._inductor.codegen.multi_kernel import MultiKernelCall aten = torch.ops.aten inductor_ops = torch.ops.inductor _quantized = torch.ops._quantized assert_size_stride = torch._C._dynamo.guards.assert_size_stride empty_strided_cpu = torch._C._dynamo.guards._empty_strided_cpu empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda empty_strided_xpu = torch._C._dynamo.guards._empty_strided_xpu reinterpret_tensor = torch._C._dynamo.guards._reinterpret_tensor alloc_from_pool = torch.ops.inductor._alloc_from_pool async_compile = AsyncCompile() empty_strided_p2p = torch._C._distributed_c10d._SymmetricMemory.empty_strided_p2p async_compile.wait(globals()) del async_compile def call(args): with torch.cuda._DeviceGuard(1): torch.cuda.set_device(1) buf0 = empty_strided_cuda((0, ), (1, ), torch.float32) return (buf0, ) def benchmark_compiled_module(times=10, repeat=10): from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance fn = lambda: call([]) return print_performance(fn, times=times, repeat=repeat) if __name__ == "__main__": from torch._inductor.wrapper_benchmark import compiled_module_main compiled_module_main('None', benchmark_compiled_module) To execute this test, run the following from the base repo dir: python test/inductor/test_torchinductor_codegen_dynamic_shapes.py DynamicShapesCodegenGPUTests.test_linspace4_dynamic_shapes_cuda This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145204 Approved by: https://github.com/eellison	2025-01-24 06:57:35 +00:00
Boyuan Feng	817fd14714	[BE] Type annotation for `_inductor/dependencies.py` (#145311 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145311 Approved by: https://github.com/eellison	2025-01-24 06:32:48 +00:00
Xilun Wu	2ce70da96c	[cp] override compute_log_sumexp to True for aten._scaled_dot_product_efficient_attention.default if False (#145421 ) ## Description Our current CP doesn't support efficient attention when `compute_log_sumexp=False`. `compute_log_sumexp=False` only if that `requires_grad=False` and since PP's [shape inference](`d95a6babcc/torch/distributed/pipelining/stage.py (L1387)`) happens under `torch.no_grad()` context , we need to override `compute_log_sumexp` to `True` in our CP attention implementation. ## Test - Test PP+FSDP+CP w/ `mixed_precision = "float32"` in torchtitan - `pytest test/distributed/tensor/test_attention.py -s -k test_ring_attention_sdpa` Before: <img width="1880" alt="image" src="https://github.com/user-attachments/assets/872ff583-295e-4751-a280-cf7f2d41c61a" /> After: <img width="2988" alt="image" src="https://github.com/user-attachments/assets/4bdcc2e5-22a5-427a-91a5-82206d5bd78f" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/145421 Approved by: https://github.com/H-Huang, https://github.com/tianyu-l	2025-01-24 06:17:54 +00:00
Animesh Jain	53fc921ce2	[dynamo][trace-rules-cleanup] Remove functools from the Builtins skiplist (#145519 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145519 Approved by: https://github.com/yanboliang, https://github.com/zou3519	2025-01-24 06:02:03 +00:00
atalman	9752c7c1c8	[CD] Fix slim-wheel cuda_nvrtc import problem (#145582 ) Similar fix as: https://github.com/pytorch/pytorch/pull/144816 Fixes: https://github.com/pytorch/pytorch/issues/145580 Found during testing of https://github.com/pytorch/pytorch/issues/138340 Please note both nvrtc and nvjitlink exist for cuda 11.8, 12.4 and 12.6 hence we can safely remove if statement. Preloading can apply to all supporting cuda versions. CUDA 11.8 path: ``` (.venv) root@b4ffe5c8ac8c:/pytorch/.ci/pytorch/smoke_test# ls /.venv/lib/python3.12/site-packages/torch/lib/../../nvidia/cuda_nvrtc/lib __init__.py __pycache__ libnvrtc-builtins.so.11.8 libnvrtc-builtins.so.12.4 libnvrtc.so.11.2 libnvrtc.so.12 (.venv) root@b4ffe5c8ac8c:/pytorch/.ci/pytorch/smoke_test# ls /.venv/lib/python3.12/site-packages/torch/lib/../../nvidia/nvjitlink/lib __init__.py __pycache__ libnvJitLink.so.12 ``` Test with rc 2.6 and CUDA 11.8: ``` python cudnn_test.py 2.6.0+cu118 ---------------------------------------------SDPA-Flash--------------------------------------------- ALL GOOD ---------------------------------------------SDPA-CuDNN--------------------------------------------- ALL GOOD ``` Thank you @nWEIdia for discovering this issue Pull Request resolved: https://github.com/pytorch/pytorch/pull/145582 Approved by: https://github.com/nWEIdia, https://github.com/eqy, https://github.com/kit1980, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-24 04:47:57 +00:00
Johnny	732c4998f3	[NVIDIA] Full Family Blackwell Support codegen (#145436 ) More references: https://github.com/NVIDIA/nccl Pull Request resolved: https://github.com/pytorch/pytorch/pull/145436 Approved by: https://github.com/ezyang, https://github.com/drisspg	2025-01-24 04:36:00 +00:00
Nikita Shulga	c184055743	[BE] Use `value_or` in layer_norm.cpp (#145417 ) Now that we have proper optional, no need to do `if (has_value) value else default_value;` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145417 Approved by: https://github.com/huydhn, https://github.com/Skylion007	2025-01-24 04:02:23 +00:00
Nikita Shulga	4799ebf326	[MPS][BE] Turn `bicubic2d` into generic metal template (#145578 ) In preparation for more metal shaders to come Pull Request resolved: https://github.com/pytorch/pytorch/pull/145578 Approved by: https://github.com/Skylion007	2025-01-24 04:01:23 +00:00
Avik Chaudhuri	68a1505985	serde and_ operator (#145506 ) Differential Revision: [D68565887](https://our.internmc.facebook.com/intern/diff/D68565887/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145506 Approved by: https://github.com/zhxchen17, https://github.com/Skylion007	2025-01-24 03:48:03 +00:00
albanD	29ddf9a63e	Document dispatch trace build flag (#145517 ) Ok, the build flag seems to have been broken for a while since the function it calls doesn't exist anymore. Repurposed it to enable dispatcher printing (which requires a full (and slow) debug build otherwise). Pull Request resolved: https://github.com/pytorch/pytorch/pull/145517 Approved by: https://github.com/bdhirsh	2025-01-24 03:19:39 +00:00
Sam Larsen	a40ead1fd6	Don't fail if fresh_inductor_cache fails to clean up its tmp dir. (#145513 ) Summary: I see we have a test failure due to an error removing the tmp dir: https://github.com/pytorch/pytorch/issues/141761. Seems like we should not raise an exception for this case in general. Also, let's clean up the exception handling related to windows. The comment makes it sound like we want to specifically ignore failures cleaning up, but the current impl is swallowing all exceptions. Fixes #141761 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145513 Approved by: https://github.com/eellison	2025-01-24 03:17:03 +00:00
Henry Tsang	36fcf98db6	[cutlass backend tests] Manually clear cache, test more tests in fbcode and limit configs in some tests (#145545 ) Summary: Manually clear cache: You want to clear cache in most tests. Otherwise link command won't work and you have multiple .o files and you get something like `ld.lld: error: duplicate symbol: cuda_fused_0`. test more tests in fbcode: A few tests have been skipping in fbcode. Unskip them. limit configs in some tests: to reduce time spent on each test Differential Revision: D68584071 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145545 Approved by: https://github.com/coconutruben, https://github.com/ColinPeppler	2025-01-24 03:06:59 +00:00
Robert Hardwick	386650353b	[ARM] Fix bf32 and tf32 precision for tensordot unit test (#141136 ) Fixes unit test failure on aarch64 ( neoverse-v1 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141136 Approved by: https://github.com/malfet	2025-01-24 02:59:45 +00:00
Manuel Candales	d6bea398ac	Only include RMSNorm.h in layer_norm.cpp for MPS (#145524 ) Test Plan: CI Differential Revision: D68578213 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145524 Approved by: https://github.com/malfet	2025-01-24 02:08:49 +00:00
Benjamin Glass	d5629889f1	cpp_wrapper: Properly handle scalars when input to tensor arguments (#144910 ) Additionally, reduce code duplication in `cpp_wrapper_cpu_array_ref.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144910 Approved by: https://github.com/desertfire	2025-01-24 02:06:35 +00:00
Zhenbin Lin	47e65077b1	OpenReg: Remove REGISTER_GENERATOR_PRIVATEUSE1 (#144841 ) Replace REGISTER_GENERATOR_PRIVATEUSE1 with new API in AcceleratorHooksInterface. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144841 Approved by: https://github.com/albanD	2025-01-24 01:52:10 +00:00
Sam Larsen	cd68d54911	Inductor cache: Revamp how we handle frozen params (#143808 ) Summary: In https://github.com/pytorch/pytorch/pull/143563 we have a report of a problem with the treatment of frozen params in the inductor cache implementation. There seems to be a path where new constants are added in the `GraphLowering`. On a cache hit when we try to find those constant names in the `torch.fx.GraphModule`, they do not exist. The current approach treats all constants differently if the GM has any frozen params. This PR changes the approach to only treat the _frozen_ params specially, but store all other constants in the cache entry (as we do without freezing): 1) When creating a cache entry, store the names of any frozen params, but the values of any other constants. 2) On a cache hit, restore the values of the frozen params by looking up in the current GM. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143808 Approved by: https://github.com/leslie-fang-intel, https://github.com/eellison	2025-01-24 01:20:07 +00:00
zeshengzong	54e2f4b201	Fix lerp weight type promotion (#141117 ) Fixes #140601 Enable `promote_inputs_to_common_dtype` when tensors not same dtype when invoke `lerp` function. For `lerp_Tensor` - Check whether same `dtype` of tensors, enable promote if not - Remove type check assert For `lerp_Scalar` - Seems already enable `promote_inputs_to_common_dtype` by default, just remove the type check. Make sure promote behavior consistent with `lerp_Tensor` `lerp_Scalar` get TensorIteratorConfig from here `c37185c76a/aten/src/ATen/TensorIterator.cpp (L979-L985)` Test Result Test case in issue passed ```python >>> import torch >>> >>> x = torch.ones(2, 2, dtype=torch.float64) >>> w = torch.ones(2, 2, dtype=torch.float64) >>> s = torch.tensor(2.2) >>> x.lerp_(w, s) tensor([[1., 1.], [1., 1.]], dtype=torch.float64) >>> x = torch.ones(2, 2, dtype=torch.float16) >>> w = torch.ones(2, 2, dtype=torch.float16) >>> s = torch.tensor(2.2) >>> x.lerp_(w, s) tensor([[1., 1.], [1., 1.]], dtype=torch.float16) ``` ```bash $ pytest test/test_binary_ufuncs.py -k 'test_lerp_tensor_type_promotion or test_lerp_scalar_type_promotion' ``` ![image](https://github.com/user-attachments/assets/288a5294-a9ee-47f3-bbf7-d4ff986f3ba8) ```bash $ lintrunner ``` ![image](https://github.com/user-attachments/assets/d469836f-5c49-4d89-a2fd-379cad4db3af) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141117 Approved by: https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-01-24 01:18:20 +00:00
David Berard	b2c89bc115	[inductor][2/N] triton support post-#5512, user-defined triton kernels (#145348 ) Triton commit 5220 adds tuple support in Triton (changing the indexing format in AttrsDescriptor) and commit 5512 replaces AttrsDescriptor with raw tuples. This PR fixes user-defined triton kernel handling (in most cases) for these new triton commits. What this PR fixes: * in triton_kernel_wrap.py, AST->TTIR parsing was to be updated for the new triton API * ir.py - don't remove None args when using newer triton versions * wrapper.py - update signature & constant handling What this doesn't fix: * correct None handling - I want to do a closer look at constant handling (including None, equal_to_1, and other constants). * cpp wrapper (which needs to be fixed for both user-defined triton kernels and inductor-generated kernels) test/inductor/test_triton_kernels.py passed on triton commit 74de6b46, with the exception of three tests (those shown here: `1374074098`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145348 Approved by: https://github.com/jansel ghstack dependencies: #145051	2025-01-24 00:34:01 +00:00
David Berard	b963ab5325	[inductor][1/N] triton support post-#5512, main components (#145051 ) Triton commit 5220 adds tuple support in Triton (changing the indexing format in AttrsDescriptor) and commit 5512 replaces AttrsDescriptor with raw tuples. This is an initial PR to add support for Triton versions after commit 5512 landed. The main changes in 5220 and 5512 that need to be supported: * AttrsDescriptor() gets replaced with a raw dict. The raw dict has the format `{(TUPLES): [["tt.divisibility", 16]]}`, where `(TUPLES)` is a tuple of indices, e.g. `((0,), (1,), (3,))` to indicate that args 0, 1, and 3 are divisible by 16. These indices are, themselves, represented as tuples to support nested inputs (e.g. an argument that's a tuple), but support for tuples is not implemented right now. * "signature" changes: the signature now contains _all_ args, including constexpr and constant args. * ASTSource now takes "constexprs" instead of "constants" - for example, equal-to-1 args are constants but not constexprs so we don't need to pass these args as "constants". What this PR supports: * Triton versions before Dec 9, 2024, and (partial support for) Triton versions after Jan 1, 2025 * (triton jan 1+) typical inductor-generated triton: updated AttrsDescriptor, signatures, constexpr/constant handling. What this PR doesn't support (TODO in follow-up PRs): * Triton versions between Dec 9, 2024 and before Jan 1, 2025 * (triton jan 1+) user-defined triton kernel support (this is implemented already in @anmyachev's patch) * (triton jan 1+) triton_helper support (failing in triton codegen - needs investigation) * (triton jan 1+) AOTI / cpp wrapper thanks to @anmyachev for patches in https://github.com/intel/intel-xpu-backend-for-triton/blob/main/scripts/pytorch.patch, which contains most of these changes already Pull Request resolved: https://github.com/pytorch/pytorch/pull/145051 Approved by: https://github.com/jansel	2025-01-24 00:34:01 +00:00
PyTorch MergeBot	714f64329b	Revert "Add multi env variable support to configs (#145288 )" This reverts commit a8b7cb6a2ddbba4924b6b2531f1ecd2f5ed6d512. Reverted https://github.com/pytorch/pytorch/pull/145288 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lint from a landrace with some recent PEP585 changes ([comment](https://github.com/pytorch/pytorch/pull/145288#issuecomment-2611278428))	2025-01-24 00:20:00 +00:00
PyTorch MergeBot	6a2b4db0a1	Revert "Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806 )" This reverts commit 42f4fda2ebb27693411f7acca1665778d539bf79. Reverted https://github.com/pytorch/pytorch/pull/143806 on behalf of https://github.com/huydhn due to Lots of builds fail after this land, so maybe a landrace ([comment](https://github.com/pytorch/pytorch/pull/143806#issuecomment-2611275836))	2025-01-24 00:17:34 +00:00
PyTorch MergeBot	6f60c65a3a	Revert "[dynamo] Log guard latency (#145132 )" This reverts commit 0a310d738819ae000f49b32298305724117634c2. Reverted https://github.com/pytorch/pytorch/pull/145132 on behalf of https://github.com/anijain2305 due to CI failures observed after PR was merged ([comment](https://github.com/pytorch/pytorch/pull/145132#issuecomment-2611268421))	2025-01-24 00:11:50 +00:00
Davide Italiano	f0e9f87a9b	[BE/mps] Mark input args as `constant` to prevent incorrect usage. (#145535 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145535 Approved by: https://github.com/malfet, https://github.com/jansel	2025-01-24 00:11:44 +00:00
Howard Huang	6aaae9d78f	Make torchelastic etcd rendezvous publicly importable (#145396 ) Make torchelastic publicly importable by raising error on import etcd lazily, [BE task, row 7](https://docs.google.com/spreadsheets/d/1TtATnLJf1rVXaBQd3X3yYqm9xNN9BIWG7QqRgrFiRRI/edit?gid=1748512924#gid=1748512924) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145396 Approved by: https://github.com/albanD ghstack dependencies: #145387	2025-01-23 23:56:45 +00:00
Chirag Pandya	f8a4f16634	[c10d] fix memory leak on shutdown (#145507 ) Summary: Fix memory leak on shutdown when socket is closed. We still need to free the buffer to make valgrind happy. Test Plan: Use `mtiavm`. Repro steps provided by cristianlume. on window 1: ``` vm ssh --vm=0 -- $(buck run @//neteng/ai/rdma_gen/mode/owl //neteng/ai/rdma_gen:rdma_gen --emit-shell) --rdma_mode=mtiav1 --num_ranks=2 ``` on window 2: ``` vm ssh --vm=1 -- $(buck run @//neteng/ai/rdma_gen/mode/owl //neteng/ai/rdma_gen:rdma_gen --emit-shell) --rdma_mode=mtiav1 --num_ranks=2 --rank=1 --store_host=172.16.1.1 ``` without the fix: ``` ==8766==ERROR: LeakSanitizer: detected memory leaks ``` With fix, no leak Differential Revision: D68566104 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145507 Approved by: https://github.com/XilunWu, https://github.com/d4l3k	2025-01-23 23:36:15 +00:00
PyTorch MergeBot	6dd8283381	Revert "[compiled autograd] Proxy opaque nodes for built-in autograd nodes (#143296 )" This reverts commit 5531fafffefc45cd894040b2b07b0d5227430082. Reverted https://github.com/pytorch/pytorch/pull/143296 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))	2025-01-23 23:34:13 +00:00
PyTorch MergeBot	c3fadacf84	Revert "[compiled autograd] Proxy a node for CopyBackwards into the graph (#143304 )" This reverts commit 8c7c5f7bfcbc55638a0e4aed6eaa27f6194dbebe. Reverted https://github.com/pytorch/pytorch/pull/143304 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))	2025-01-23 23:34:13 +00:00
PyTorch MergeBot	9553301ade	Revert "[compiled autograd] Proxy nodes for user-defined C++ torch::autograd::Function (#143387 )" This reverts commit 784bb2127ca9729c646f1650ecc2cf946a583da8. Reverted https://github.com/pytorch/pytorch/pull/143387 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))	2025-01-23 23:34:13 +00:00
PyTorch MergeBot	16c4f8c395	Revert "[compiled autograd] Always proxy autograd.Function nodes; handle AOT backwards (#143405 )" This reverts commit ec820fe57c2d6a2847569a107856e7fcff87dc5c. Reverted https://github.com/pytorch/pytorch/pull/143405 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))	2025-01-23 23:34:13 +00:00
PyTorch MergeBot	3f6cfd0156	Revert "[compiled autograd] stop specializing on metadata during initial trace (#143417 )" This reverts commit 99dd1bf1b93bc26080e611af54497a73a618e02a. Reverted https://github.com/pytorch/pytorch/pull/143417 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))	2025-01-23 23:34:12 +00:00
PyTorch MergeBot	ab082863a1	Revert "[compiled autograd] support Tensor Subclasses in AOTBackward (#144115 )" This reverts commit 082c28c3c655984ce65c13336cff822db95ee470. Reverted https://github.com/pytorch/pytorch/pull/144115 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))	2025-01-23 23:34:12 +00:00
Animesh Jain	0a310d7388	[dynamo] Log guard latency (#145132 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145132 Approved by: https://github.com/ezyang ghstack dependencies: #145351, #145420	2025-01-23 23:30:07 +00:00
PyTorch MergeBot	bf62222d81	Revert "[compiled_autograd] Rename interface to pyinterface (#145495 )" This reverts commit e1407f5aeb658c8c959d33158f465e975799a3d0. Reverted https://github.com/pytorch/pytorch/pull/145495 on behalf of https://github.com/izaitsevfb due to reverted internally ([comment](https://github.com/pytorch/pytorch/pull/145495#issuecomment-2611194932))	2025-01-23 23:07:17 +00:00
Oguz Ulgen	a8b7cb6a2d	Add multi env variable support to configs (#145288 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145288 Approved by: https://github.com/c00w	2025-01-23 23:00:23 +00:00
PyTorch MergeBot	dad9bc3461	Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 )" This reverts commit de945d78da9198e58df7c19c53b737d0f987ddff. Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/izaitsevfb due to unused variables again :( ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2611182461))	2025-01-23 22:59:25 +00:00
cyy	42f4fda2eb	Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143806 Approved by: https://github.com/kwen2501	2025-01-23 22:47:18 +00:00
bobrenjc93	6f07847efe	Bail on checking internal overlap when dealing with unbacked symints (#145385 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145385 Approved by: https://github.com/ezyang	2025-01-23 22:31:31 +00:00
Richard Zou	e1407f5aeb	[compiled_autograd] Rename interface to pyinterface (#145495 ) Summary: interface is a reserved word in some MSVC variants. Test Plan: build Differential Revision: D68561379 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145495 Approved by: https://github.com/xmfan	2025-01-23 21:40:59 +00:00
Shangdi Yu	302b07f166	Implement deepcopy for AOTICompiledModel (#145423 ) Summary: Fix https://github.com/pytorch/pytorch/issues/145411 Support deepcopying AOTICompiledModel. The `loader` is shallow copied. Test Plan: ``` buck2 run fbcode//mode/opt //caffe2/test/inductor:aot_inductor_package -- -r deepcopy ``` Differential Revision: D68524673 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145423 Approved by: https://github.com/desertfire	2025-01-23 21:05:30 +00:00
Davide Italiano	e924ddbef1	[BE] [mps] Refactor UnaryConstants to be its own kernel. (#145230 ) In preparation for using this file for inductor (for erfinv). Pull Request resolved: https://github.com/pytorch/pytorch/pull/145230 Approved by: https://github.com/malfet	2025-01-23 20:58:43 +00:00
Daulet Askarov	881eb86692	Fix staging for CPU tensors in OSS DCP async_save (#145408 ) Fix staging for CPU tensors in OSS DCP async_save (#145408) Summary: As found in https://github.com/pytorch/pytorch/issues/144657 for CPU tensors we accidentally skip copying during staging due to using offload to cpu helper, which does a no-op for CPU tensors. This means that if the trainer changes the original source CPU tensor value after launch async save but before the actual writing/uploading to the destination commences, the writing/uploading logic will accidentally pick up the latest state of the tensor, while it should have dealt with its own dedicated copy saved earlier. Dropping _offload_state_dict_to_cpu in favor of _copy_state_dict fixes this bug. Test Plan: Running the user script from the linked GitHub issue verifies the fix: ``` import os import torch import torch.distributed as dist import torch.distributed.checkpoint as dcp from torch.distributed.checkpoint.state_dict import get_model_state_dict import torch.nn as nn class Net(nn.Module): def __init__(self): super().__init__() self.weight = nn.Parameter(torch.ones(1, 1)) def forward(self, x): return self.layer(x) os.environ["MASTER_ADDR"] = "localhost" os.environ["MASTER_PORT"] = "12345" os.environ["WORLD_SIZE"] = "1" os.environ["RANK"] = "0" dist.init_process_group() model = Net() state_dict = get_model_state_dict(model) pg = dist.new_group(backend="gloo") try: steps = [10, 20, 30, 40, 50] future = None for step in steps: # simulate a training step, e.g. optimizer updating values with torch.no_grad(): model.weight.data.fill_(step) if future is not None: future.result() future = None future = dcp.async_save( state_dict, checkpoint_id=f"outputs/{step}", process_group=pg, ) future.result() for step in steps: dcp.load( state_dict, checkpoint_id=f"outputs/{step}", process_group=pg, ) assert state_dict["weight"][0, 0] == step, f"got {state_dict['weight'][0, 0]=} on {step=}" finally: dist.destroy_process_group(pg) dist.destroy_process_group() ``` passes all asserts with this fix. If the script is run in trunk, confirmed that it fails the first assert. Differential Revision: D68518689	2025-01-23 12:49:26 -08:00
Bin Bao	6a44a61514	[BE] Bump TIMM pin (#145320 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145320 Approved by: https://github.com/Skylion007	2025-01-23 20:44:26 +00:00
Pian Pawakapan	99367ecbed	[draft export] count how many times a data-dep error shows up (#145030 ) Summary: maybe this is helpful? Test Plan: draft_export Differential Revision: D68303934 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145030 Approved by: https://github.com/angelayi	2025-01-23 20:27:31 +00:00
Aaron Gokaslan	5ebca3015d	[BE]: Simplify set add with set update (#145152 ) Simplifies the set update slightly to be more readable and efficient. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145152 Approved by: https://github.com/XuehaiPan, https://github.com/albanD Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>	2025-01-23 20:18:13 +00:00
PyTorch MergeBot	d7b6746470	Revert "Fix deprecated pytorch_sphinx_theme editable installation (#145347 )" This reverts commit c27dd9cf72265161f85a18c0b19f365097f7a1ac. Reverted https://github.com/pytorch/pytorch/pull/145347 on behalf of https://github.com/huydhn due to Remove -e breaks the theme somehow ([comment](https://github.com/pytorch/pytorch/pull/145347#issuecomment-2610911258))	2025-01-23 20:06:07 +00:00
Pian Pawakapan	d53f2067fe	[BE][export] add "+export" logging to de/serialization (#145283 ) adds de/serialization debug logging to `TORCH_LOGS="+dynamic"` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145283 Approved by: https://github.com/ydwu4, https://github.com/angelayi	2025-01-23 19:47:48 +00:00
PyTorch MergeBot	ce4a097bf7	Revert "Added swizzle searching, disabled fp16 accum, and enabled ping-pong for cutlass (#144829 )" This reverts commit 55084443cabbaf6c28d8c546d8988cf3ed0f3d1c. Reverted https://github.com/pytorch/pytorch/pull/144829 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/144829#issuecomment-2610855579))	2025-01-23 19:37:54 +00:00
iremyux	527101fa95	Move Windows arm64 scripts from pytorch/builder (#144317 ) This PR moves the Windows Arm64 scripts from the builder repository to the main repository. The corresponding PR to pytorch/builder that removes them is here : https://github.com/pytorch/builder/pull/2058 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144317 Approved by: https://github.com/Skylion007, https://github.com/seemethere Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>	2025-01-23 19:29:29 +00:00
Irem Yuksel	66bf7da446	Enable sleef for Win Arm64 (#144876 ) Sleef module was disabled for Windows Arm64 on `b021486405` This PR enables it again since the issue is no longer valid. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144876 Approved by: https://github.com/albanD, https://github.com/malfet Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>	2025-01-23 19:22:58 +00:00
Xu Zhao	991a4b5925	[dynamo] Add `--profile-details` and `--export-perfdoctor` option (#144751 ) Summary: Add `--profile-details` option to add shapes and other details to the Kineto profile. Add `--export-perfdoctor` to directly dump trace to perfdoctor for webview. Test Plan: ``` $ buck2 run mode/opt //caffe2/benchmarks/dynamo:torchbench_internal -- --only mrs_video_watch_over --performance --training --amp --export-profiler-trace --backend=inductor --profile-details --export-perfdoctor ``` https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/pyper_traces/tree/traces/test/inductor_mrs_video_watch_over_rank_0_20250113_173817_6535183793.json.gz Differential Revision: D68134547 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144751 Approved by: https://github.com/drisspg	2025-01-23 19:09:40 +00:00
Renato Arantes	5b37249259	Enable fp16 linear layers in PyTorch via ACL (#144992 ) This pull request aims to enable the use of linear layers with the fp16 data type through the ACL. On a Graviton3 instance running with 16 threads, `torch.randn(2048, 4096, dtype=torch.half)` will take 50+% less time to complete compared with `torch.randn(2048, 4096, dtype=torch.float32)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144992 Approved by: https://github.com/ng-05, https://github.com/digantdesai, https://github.com/malfet	2025-01-23 19:07:54 +00:00
Yang Wang	6d4f5f7688	[Utilization][Usage Log] Add data model for record (#145114 ) Add data model for consistency and data model change in the future. The data model will be used during the post-test-process pipeline Pull Request resolved: https://github.com/pytorch/pytorch/pull/145114 Approved by: https://github.com/huydhn	2025-01-23 19:04:41 +00:00
Joona Havukainen	2f317bbdbc	Missing autorelease in lstm_mps caused a ton of leaked memory (#145503 ) The dictionary held onto the new MPSGraphTensorData objects and MPSNDArrays. Regression caused by https://github.com/pytorch/pytorch/pull/95137 Fixes #145374 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145503 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-01-23 18:54:30 +00:00
Nikhil Gupta	41b38f755c	Revert "Reverting the PR adding Kleidiai-based int4 kernels (#145392 )" (#145505 ) https://github.com/pytorch/pytorch/pull/134124 was reverted by https://github.com/pytorch/pytorch/pull/145392 due to KleidiAI clone issue. 1. This reverts commit 0940eb6d44f3cf69dd840db990245cbe1f78e770 (https://github.com/pytorch/pytorch/pull/145392 )and Fixes KleidiAI mirror issue. 2. KleidiAI is now cloned from github mirror instead of arm gitlab Change-Id: I7d6eee7214cd117d3057d615936fcc3ee6052fa2 Fixes https://github.com/pytorch/pytorch/issues/145273 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145505 Approved by: https://github.com/malfet	2025-01-23 18:50:59 +00:00
Simon Fan	34b8d8b0c0	update compile time benchmarks to dump compile times to stdout and csv (#145447 ) ```python # inductor.csv dev,name,batch_size,accuracy,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles,cudagraph_skips,compilation_latency cuda,cait_m36_384,8,pass,2510,1,0,0,0,0,0,87.705186 ``` ```python loading model: 0it [01:27, ?it/s] cuda eval cait_m36_384 Compilation time (from dynamo_timed): 87.705186276 # <---------------- pass TIMING: _recursive_pre_grad_passes:0.11023 pad_mm_benchmark:0.50341 _recursive_joint_graph_passes:3.88557 _recursive_post_grad_passes:6.71182 async_compile.wait:4.16914 code_gen:17.57586 inductor_compile:42.55769 backend_compile:72.47122 entire_frame_compile:87.70519 gc:0.00112 total_wall_time:87.70519 STATS: call_* op count: 2510 \| FakeTensorMode.__torch_dispatch__:101743 \| FakeTensor.__torch_dispatch__:12959 \| ProxyTorchDispatchMode.__torch_dispatch__:41079 Dynamo produced 1 graphs covering 2510 ops with 0 graph breaks (0 unique) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145447 Approved by: https://github.com/ezyang	2025-01-23 18:49:19 +00:00
Boyuan Feng	629fb1590c	[BE] Type annotate pad_mm.py (#145409 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145409 Approved by: https://github.com/Skylion007	2025-01-23 18:34:24 +00:00
Animesh Jain	015c6d6fdb	[dynamo][guards] Turn on profiling of guard manager (#145420 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145420 Approved by: https://github.com/ezyang ghstack dependencies: #145351	2025-01-23 18:17:43 +00:00
Zheng, Zhaoqiong	fef92c9447	Fix IdentationError of code example (#145251 ) I found there is IndentationError when try to copy paste the example of inference with torch.compile fix the format in this pr Pull Request resolved: https://github.com/pytorch/pytorch/pull/145251 Approved by: https://github.com/mikaylagawarecki Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-23 18:17:11 +00:00
Boyuan Feng	9a5bc7b6dd	[BE] Type annotate metrics.py (#145418 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145418 Approved by: https://github.com/Skylion007	2025-01-23 18:13:59 +00:00
Yidi Wu	bdc2c2a237	[be] fix flaky test aot_export_ cond caused by free symbol lifting and automatic dynamic shape (#145330 ) Fixes https://github.com/pytorch/pytorch/issues/139998#issuecomment-2605908426. It seems to be an issue caused by the interaction between dynamoed hop X automatic dynamic shape X auto_lift_free symbols. The immediate error is that the asserteExpectedInline of the graph can sometimes be different e.g. see https://hud.pytorch.org/flakytest?name=test_aot_export_with_torch_cond&suite=TestAOTExport&limit=100, where sometimes the shapes are lifted as input to the cond and sometimes they're not. The root cause of the flakyness is that the two invocations of torch.cond triggers two torch.compile on the same code object ([code](https://github.com/pytorch/pytorch/blob/main/torch/_higher_order_ops/cond.py#L192)), and triggers automatic dynamic shape because in test_aot_export_with_torch_cond, x has shape (3, 4) while the pre_dispatch one has shape (2, 2). Because of we auto lift free symbols for dynamic shaped input, this causes cond sometimes have the shape as arguments and sometimes not. This PR adds a simple fix by adding a _dynamo.reset before each torch.cond tests. This fixes the error by not triggering automatic dynamic shape. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145330 Approved by: https://github.com/zou3519	2025-01-23 18:12:58 +00:00
Yidi Wu	3c247ee8c4	[hop][be] add utils for more comprehensive input alias and mutation (#145298 ) This PR implements the idea of checking input mutations through tensor version and check aliasing via storage from @zou3519. Previously, we rely on whether there's a in place op that takes placeholder input, which doesn't take views into account. When writing the PR, I also noticed a bug in previous input mutation checking logic: we were checking the whether there are operators functionalized_f where all the mutating ops have been replaced so we won't be able to detect any thing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145298 Approved by: https://github.com/zou3519	2025-01-23 18:12:28 +00:00
Manuel Candales	b0f3597133	Add fused rms_norm implementation for MPS backend (#145301 ) Adding a fused rms_norm implementation for MPS backend. This eliminates most of the current CPU overhead, making this computation GPU bound and improving latency of rms_norm by 30x-40x on MPS backend The metal shader was adapted from MLX: `e6a7ab9675/mlx/backend/metal/kernels/rms_norm.metal` The numbers below are averages over 1000 runs of RMSNorm, obtained on an M1 Pro. Benchmarking Results (Before): ``` Device : MPS \| CPU Dtype : FP32 \| FP16 \| BF16 \| FP32 \| FP16 \| BF16 Outputs Match : True \| True \| True \| True \| True \| True Average Time (us) : 140.5 \| 171.0 \| 170.4 \| 10.9 \| 13.3 \| 13.5 ``` Benchmarking Results (After): ``` Device : MPS \| CPU Dtype : FP32 \| FP16 \| BF16 \| FP32 \| FP16 \| BF16 Outputs Match : True \| True \| True \| True \| True \| True Average Time (us) : 4.0 \| 3.9 \| 3.9 \| 10.0 \| 12.4 \| 13.0 ``` Profiling Results (Before): ``` ----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ aten::rms_norm 2.35% 3.284ms 100.00% 140.038ms 140.038us 1000 aten::mul 33.61% 47.068ms 33.61% 47.068ms 23.534us 2000 aten::pow 17.04% 23.868ms 17.43% 24.402ms 24.402us 1000 aten::add_ 16.52% 23.130ms 16.78% 23.497ms 23.497us 1000 aten::mean 15.82% 22.151ms 15.82% 22.151ms 22.151us 1000 aten::rsqrt 13.63% 19.085ms 13.71% 19.198ms 19.198us 1000 aten::item 0.46% 639.370us 0.56% 788.376us 0.394us 2000 aten::type_as 0.21% 295.507us 0.27% 371.291us 0.371us 1000 aten::to 0.13% 177.742us 0.13% 177.742us 0.059us 3000 aten::_local_scalar_dense 0.11% 149.006us 0.11% 149.006us 0.075us 2000 ----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 140.038ms ``` Profiling Results (After): ``` ----------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ----------------------- ------------ ------------ ------------ ------------ ------------ ------------ aten::rms_norm 63.21% 832.875us 100.00% 1.318ms 1.318us 1000 aten::empty_like 16.06% 211.631us 36.79% 484.681us 0.485us 1000 aten::empty_strided 20.72% 273.050us 20.72% 273.050us 0.273us 1000 ----------------------- ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 1.318ms ``` Benchmarking and profiling script: ```python import torch import torch.nn as nn from torch.profiler import profile import time def benchmark(device, dtype): model = nn.RMSNorm(2048, device=device) # Create example inputs x = torch.randn(1, 1, 2048, requires_grad=False, device=device, dtype=dtype) w = torch.randn(2048, requires_grad=False, device=device, dtype=dtype) eps = 1e-5 # Check output y = torch.ops.aten.rms_norm(x, [2048], w, eps) z = torch.ops.aten.rms_norm(x.cpu(), [2048], w.cpu(), eps) outputs_match = torch.allclose(y.cpu(), z) # Measure time manually start_time = time.time() * 1000 for _ in range(1000): with torch.no_grad(): y = model(x) torch.mps.synchronize end_time = time.time() * 1000 manual_delta = (end_time - start_time) average_time = f"{manual_delta:6.1f}" return outputs_match, average_time outputs_match_list = [] average_time_list = [] for device in ["mps", "cpu"]: for dtype in [torch.float32, torch.float16, torch.bfloat16]: outputs_match, average_time = benchmark(device, dtype) outputs_match_list.append(str(outputs_match)) average_time_list.append(average_time) print("\nBenchmarking Results:") print("---------------------") print("Device : MPS \| CPU") print("Dtype : FP32 \| FP16 \| BF16 \| FP32 \| FP16 \| BF16") print(f"Outputs Match : ", " \| ".join(outputs_match_list)) print(f"Average Time (us) :", " \|".join(average_time_list)) device = "mps" dtype = torch.float32 model = nn.RMSNorm(2048, device=device) x = torch.randn(1, 1, 2048, requires_grad=False, device=device, dtype=dtype) # Run and profile the model with profile() as prof: with torch.no_grad(): for _ in range(1000): y = model(x) torch.mps.synchronize # Print profiling results print("\n\nProfiling Results (MPS/FP32):") print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10)) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145301 Approved by: https://github.com/malfet	2025-01-23 18:07:10 +00:00
Ryan Guo	a86fa779ce	[BE] Fix edge case in translation validation bisector (#145414 ) This patch fixes a small bug for the binary-search algorithm in translation validation bisector. Fixes #131303. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145414 Approved by: https://github.com/ysiraichi, https://github.com/zou3519	2025-01-23 17:35:28 +00:00
Sam Larsen	045698653a	[BE] Remove test_ops_gradients from FIXME_inductor_dont_reset_dynamo (#145308 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145308 Approved by: https://github.com/zou3519 ghstack dependencies: #145306	2025-01-23 17:25:04 +00:00
Bartlomiej Stemborowski	3a8d3785f7	[ca][bug_fix] Fix ref counting of objects in the set_autograd_compiler function. (#145482 ) PR#141153 exposed the option to collect sizes as dynamic. After this change, the function set_autograd_compiler returns PyTuple object which is populated using PyTuple_SET_ITEM function. Yet, that function steals reference to the object and doesn't INCREF it. So currently we are missing INCREF on prior_compiler when it is Py_None and INCREF on prior_dynamic which is either Py_False or Py_True. This bug may lead to the possible memory corruption. @xmfan @jansel @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/145482 Approved by: https://github.com/albanD, https://github.com/jansel	2025-01-23 17:13:56 +00:00
drisspg	c6707734de	Enable non power of 2 head_dim for FlexAttention (#133495 ) # Summary - Adds support for non-power of 2 headdim by launching blocks w/ head_dim rounded to the next valid power. - Other option I considered was building up the final dot_products with smaller blocks (this would probably work but for sake of code complexity going with this option for now) ### Corollary We had a bug in our backwards kernel where we were using index_k instead of index_v. This should have shown up for the qk_head_dim != v_head_dim cases.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133495 Approved by: https://github.com/Chillee	2025-01-23 17:05:38 +00:00
Howard Huang	bf4f8919df	Fix test_modules_can_be_imported (#145387 ) `test_modules_can_be_imported` test is currently failing due to a few missing private modules and this PR gets it working before I start to clean up the public allow list Pull Request resolved: https://github.com/pytorch/pytorch/pull/145387 Approved by: https://github.com/albanD	2025-01-23 16:03:00 +00:00
PyTorch MergeBot	768ad0886f	Revert "Binary upload checksum (#144887 )" This reverts commit 2efa98d69d362e4ee6f15938ec8ded30bf5c40fd. Reverted https://github.com/pytorch/pytorch/pull/144887 on behalf of https://github.com/atalman due to Broke nightly index ([comment](https://github.com/pytorch/pytorch/pull/144887#issuecomment-2610066277))	2025-01-23 15:10:42 +00:00
Wang, Chuanqi	0802e78315	[CD] Disable Kineto for XPU Windows CD (#145255 ) Due to issue #145155, disable Kineto for XPU Windows CD temporally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145255 Approved by: https://github.com/xuhancn, https://github.com/atalman	2025-01-23 14:09:52 +00:00
Aaron Orenstein	629840e038	Backout PEP585 use of Iterable (#145438 ) Summary: Importing Iterable from collections.abc here causes an internal product to fail MRO discovery causing a collision between Iterable and Generic. This fixes the failure on D68461304 Differential Revision: D68531443 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145438 Approved by: https://github.com/izaitsevfb	2025-01-23 11:45:37 +00:00
cyy	29f52e3972	[2/N] Remove unnecessary once flag usage (#145057 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/145057 Approved by: https://github.com/albanD	2025-01-23 09:48:46 +00:00
Shunting Zhang	b6941d4e42	[inductor] fix autotuning memory usage (#145410 ) We use `cpu_tensor.copy_(gpu_tensor)` to clone mutated kernel arguments for autotuning. The purpose is to avoid increasing peak memory due to the clone. But if `gpu_tensor` is not contiguous, this `copy_` will need allocate an temporary tensor on GPU to store a contiguous copy of `gpu_tensor`: `6e53588789/aten/src/ATen/native/cuda/Copy.cu (L322-L334)` Here is a standalone script to illustrate this behavior: https://gist.github.com/shunting314/812a848dc67b1d674ae42415a7a462c8 . The script report 6GB rather than 3GB peak memory usage. Note that, with all the following efforts 1. donated buffer 2. inplace padding 3. this PR We save 3GB peak memory (18.6GB -> 15.5GB) for GPT2 model for torch.compile. The peak memory of GPT2 is like a '...\_M\_...' shape. There are 2 places that we reach the peak. Donated buffer remove the first peak by computing grad_softmax inplace, and inplace padding removes the second peak by not allocating an extra buffer for mm-padding. Before all these optimizations, the peak memory is 18.6GB for GPT2 with torch.compile. With 1 & 2, the peak memory is 1. 17.7GB with a cold cache 2. 15.5GB with a warm cache (since the autotuning overhead is skipped) With 1 & 2 & 3, we save 3GB peak memory (18.6GB -> 15.5GB) no matter if autotuning happens or not Pull Request resolved: https://github.com/pytorch/pytorch/pull/145410 Approved by: https://github.com/masnesral, https://github.com/jansel ghstack dependencies: #140249, #145325	2025-01-23 09:34:23 +00:00
amathewc	638903aeee	Adapt Dynamo tests to HPUs using instantiate_device_type_tests (#144387 ) MOTIVATION We recently integrated support for Intel Gaudi devices (identified as 'hpu') into the common_device_type framework via the pull request at https://github.com/pytorch/pytorch/pull/126970. This integration allows tests to be automatically instantiated for Gaudi devices upon loading the relevant library. Building on this development, the current pull request extends the utility of these hooks by adapting selected CUDA tests to operate on Gaudi devices. Additionally, we have confirmed that these modifications do not interfere with the existing tests on CUDA devices. Other accelerators can also extend the functionality by adding the device in the devices list. ( For eg: xpu ) CHANGES Create a separate class for test functions running on CUDA devices Extend the functionality of these tests to include HPUs Use instantiate_device_type_tests with targeted attributes to generate device-specific test instances within the new classes Apply skipIfHPU decorator to bypass tests that are not yet compatible with HPU devices Previously we had submitted some changes in https://github.com/pytorch/pytorch/pull/140131 . However, deleted that PR due to merge conflicts and other issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144387 Approved by: https://github.com/ankurneog, https://github.com/EikanWang, https://github.com/yanboliang, https://github.com/guangyey	2025-01-23 09:24:42 +00:00
Shunting Zhang	d3f196909d	[inductor] let inplace-padding support cpp-wrapper (#145325 ) Some context: Inplace padding is an optimization to do padding in place. E.g., if a tensor has size [2048, 2047] and stride [2048, 1]. When we need pad one extra element to the end of each row (e.g. during mm padding), we can just reuse the original tensor and do the padding inplace. This saves memory and bandwidth. One caveat for this optimization is, PyTorch does not allocate 2048 elements for the last row of the original tensor. It only allocate 2047 elements. So assuming the last row having enough space for 2048 elements may be wrong and cause OOB memory access (although I never see this happen maybe due to overallocation in the CUDACachingAllocation, this should better be fixed). The fix is when we allocate the tensor, instead of doing something like: ``` buf0 = randn_strided([2048, 2047], [2048, 1]) ``` we do some small overallocation ``` buf0 = randn_strided([2048, 2048], [2048, 1]).as_strided([2048, 2047], [2048, 1]) ``` cpp_wrapper needs special handling since memory allocation goes thru different code path to python wrapper. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145325 Approved by: https://github.com/desertfire, https://github.com/jansel ghstack dependencies: #140249	2025-01-23 09:22:38 +00:00
Justin Chu	f52901a0a7	[ONNX] Remove LegacyDynamoStrategy (#145442 ) It's legacy. So remove. Shouldn't affect anything and will facilitate cleaning up our legacy code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145442 Approved by: https://github.com/titaiwangms	2025-01-23 07:56:04 +00:00
Sam Larsen	28c251dd0b	[BE] Remove test_modules from FIXME_inductor_dont_reset_dynamo (#145306 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145306 Approved by: https://github.com/zou3519	2025-01-23 06:37:21 +00:00
Davide Italiano	f56c638849	[c10/metal] Add a vectype variant for `short`/`int`/`long` (#145430 ) Some of the kernels (exp_complex/atan_complex) need the specialization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145430 Approved by: https://github.com/malfet, https://github.com/jansel	2025-01-23 04:52:56 +00:00
Animesh Jain	c58198184b	[dynamo][dicts] Insert LENTGH guard on an if condition on dict (#145432 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145432 Approved by: https://github.com/williamwen42, https://github.com/jansel	2025-01-23 04:40:56 +00:00
Andy Lugo	faa10faa2c	[ROCm] CK SDPA - Move arch check to CK patch (#144777 ) __gfxXXX__ should only be visible by device code. Move the check to the ck kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/144777 Approved by: https://github.com/jeffdaily, https://github.com/xw285cornell, https://github.com/jianyuh	2025-01-23 04:12:25 +00:00
Chirag Pandya	5e6451ea78	[c10] catch c10 error and log message (#145413 ) Summary: Explicitly catch c10 error and log the error message only. The standard exception `e.what()` below ends up logging the stack trace that is confusing users. See S477887 for details. Test Plan: tested locally. ``` buck test caffe2/test/cpp/c10d:TCPStoreTest buck2 daemon constraint mismatch: Version mismatch; killing daemon... Starting new buck2 daemon... Connected to new buck2 daemon. File changed: fbcode//caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp File changed: fbsource//xplat/caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp Watchman fresh instance: new mergebase, cleared graph state, cleared dep files Soft Error: source_directory_includes_subpackage: Directory `v2.17.1-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.17.1-1/src/tests`. Soft Error: source_directory_includes_subpackage: Directory `v2.18.3-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.18.3-1/src/tests`. Soft Error: source_directory_includes_subpackage: Directory `v2.19.3-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.19.3-1/src/tests`. Buck UI: https://www.internalfb.com/buck2/dbd34fa4-50ed-4eeb-800d-688f5a7bec68 Test UI: https://www.internalfb.com/intern/testinfra/testrun/281475375994918 Network: Up: 1.5GiB Down: 4.7GiB (reSessionID-d6b0568e-2347-4375-a2d9-2d03ca0c2161) Loading targets. Remaining 0/3024 69199 dirs read, 687558 targets declared Analyzing targets. Remaining 0/31483 1481904 actions, 1719048 artifacts declared Executing actions. Remaining 0/250391 77:11:29.7s exec time total Command: test. Finished 2031 local, 45445 remote, 51473 cache (52% hit) 20:16:36.9s exec time cached (26%) Time elapsed: 7:32.7s Tests finished: Pass 8. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Differential Revision: D68516080 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145413 Approved by: https://github.com/fduwjj	2025-01-23 03:45:47 +00:00
Yu, Guangye	719938c77f	Generalize pin memory logic for accelerator when non blocking copy happened (#143783 ) # Motivation fix https://github.com/pytorch/pytorch/issues/143641 Generalize pin memory logic for accelerator when non-blocking copy happened. Each accelerator has its implementation on `empty_strided`. The accelerator which doesn't have pin memory mechanism could ignore or mimic when pin_out is True. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143783 Approved by: https://github.com/EikanWang, https://github.com/albanD ghstack dependencies: #144959	2025-01-23 03:43:05 +00:00
Yu, Guangye	28b6430823	Introduce a new API isAcceleratorExcluded (#144959 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144959 Approved by: https://github.com/albanD	2025-01-23 03:43:05 +00:00
Animesh Jain	5a18f1e1eb	[dynamo] Support fx map_aggregate (#145351 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145351 Approved by: https://github.com/zou3519	2025-01-23 03:19:30 +00:00
PyTorch MergeBot	d95a6babcc	Revert "Align CPU behavior with CUDA for `ConvTranspose` when `out_channels=0` (#142859 )" This reverts commit 0bff37788043626ee5e472389f88cbbbf7add997. Reverted https://github.com/pytorch/pytorch/pull/142859 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the XLA failures look legit ([comment](https://github.com/pytorch/pytorch/pull/142859#issuecomment-2608631019))	2025-01-23 01:10:31 +00:00
albanD	0d28188cc8	Move privateuse1 test out of test_utils and make them serial (#145380 ) Fixes https://github.com/pytorch/pytorch/issues/132720 The reason is that changing the privateuse1 module is global and so can race when other tests happen to check if it is enabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145380 Approved by: https://github.com/Skylion007, https://github.com/janeyx99	2025-01-23 00:31:39 +00:00
amdfaa	c9e12d6a3b	[ROCm] Update rocm.yml and add rocm-mi300.yml (#145398 ) - Added another workflow to run the mi300 jobs post-merge. - Updated rocm.yml to use mi200s instead of mi300s. - Required to get an idea of how PRs are landing on our mi200s and mi300s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145398 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-01-23 00:07:50 +00:00
Wenqin Yang	1e32842324	Improve softmax's perf in cuda (#144679 ) Fixes #144645 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144679 Approved by: https://github.com/eqy	2025-01-23 00:02:57 +00:00
Yiming Zhou	d0a2e11284	[BE][export] Change custom_op registeration style (#145315 ) Summary: `test_unbacked_bindings_for_divisible_u_symint` has been flaky for a while due to ``` Tried to register an operator (mylib::foo(Tensor a, Tensor b) -> Tensor) with the same name and overload name multiple times. ``` It is likely due to when all variants of this test are being run (non-strict, retrace, serdes) simultaneously. In later tests, the operator has already been registered. In this diff, we change registration style. Test Plan: ``` buck2 test mode/dev-nosan caffe2/test:test_export -- -r test_unbacked_bindings_for_divisible_u_symint ``` Differential Revision: D68465258 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145315 Approved by: https://github.com/zou3519	2025-01-22 23:46:51 +00:00
Hyunho Yeo	4803e20bc7	[S481486] Move MTIA dynamic library loading from __init__.py to a separate module (#145322 ) Summary: As titled Test Plan: - Passed CI tests buck2 test 'fbcode//mode/opt' fbcode//ai_infra/distributed_ai/pyper_local_run/tests/integration_tests:test_icvr_e2e_gpu -- --exact 'ai_infra/distributed_ai/pyper_local_run/tests/integration_tests:test_icvr_e2e_gpu - test_icvr_e2e_gpu (ai_infra.distributed_ai.pyper_local_run.tests.integration_tests.test_icvr_e2e_gpu.TestIcvrE2EGpu)' --run-disabled ``` https://www.internalfb.com/intern/testinfra/testconsole/testrun/9007199320480497/ Differential Revision: D68463242 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145322 Approved by: https://github.com/yuhc, https://github.com/albanD	2025-01-22 23:39:43 +00:00
Aaron Orenstein	35c8c31f11	Fix for failure in D68425364 (#145304 ) Summary: Back out change from #145166 which causes an internal model to fail. Differential Revision: D68459095 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145304 Approved by: https://github.com/izaitsevfb	2025-01-22 23:33:02 +00:00
Li Yu (ads)	e6a84be3d3	[PyTorch] Add backend aot_eager_decomp_partition_with_mode (#143250 ) Summary: ## Why To make it possible to run torch dispatch mode inside compiled modules. This is to enable running MemoryTrackerMode (in next diff) to collect memory usage of compiled modules. ## What Add a backend aot_eager_decomp_partition_with_mode. Add an enable_log to the backend to control the compilation logging (which can be very verbose and slow the run of mode) Test Plan: unittest E2e tested in the next diff which shows the memory read from the mode passed to this backend is very close to the actual job's memory snapshot. Differential Revision: D67227144 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143250 Approved by: https://github.com/bdhirsh	2025-01-22 23:20:59 +00:00
PyTorch MergeBot	f0a210bf5d	Revert "Output of nonzero is transposed, fix fake tensor (#144695 )" This reverts commit 693d8c7e945cc494bd31ad694ae4f4b6ff13b82a. Reverted https://github.com/pytorch/pytorch/pull/144695 on behalf of https://github.com/izaitsevfb due to breaking internal tests, see D68461259 ([comment](https://github.com/pytorch/pytorch/pull/144695#issuecomment-2608443589))	2025-01-22 23:04:50 +00:00
Eddie Yan	de945d78da	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee	2025-01-22 22:42:48 +00:00
PyTorch MergeBot	6e53588789	Revert "[BE]: Simplify set add with set update (#145152 )" This reverts commit 0cb9b2284a31fa497d684dbc2f56398c1d1e3114. Reverted https://github.com/pytorch/pytorch/pull/145152 on behalf of https://github.com/davidberard98 due to land race with https://github.com/pytorch/pytorch/pull/145165 broke lint ([comment](https://github.com/pytorch/pytorch/pull/145152#issuecomment-2608378172))	2025-01-22 22:14:26 +00:00
PyTorch MergeBot	dddf52b1b9	Revert "Enable grep_linter to use -a (#144589 )" This reverts commit 3c55669b8814237e018a613a494564da5bea9f15. Reverted https://github.com/pytorch/pytorch/pull/144589 on behalf of https://github.com/clee2000 due to the line parameter is kind of important and -a is not as important as I thought it was so I'm going to revert this ([comment](https://github.com/pytorch/pytorch/pull/144589#issuecomment-2608349155))	2025-01-22 21:55:27 +00:00
rzou	082c28c3c6	[compiled autograd] support Tensor Subclasses in AOTBackward (#144115 ) Compiled autograd's initial trace traces through the AOTBackward epilogue. The Tensor Subclass code is not traceable. This PR changes it so that when we see Tensor Subclass constructors, we proxy nodes for their construction into the graph. Test Plan: - New basic test with TwoTensor - Existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/144115 Approved by: https://github.com/jansel, https://github.com/xmfan, https://github.com/bdhirsh ghstack dependencies: #143296, #143304, #143387, #143405, #143417	2025-01-22 21:51:07 +00:00
rzou	99dd1bf1b9	[compiled autograd] stop specializing on metadata during initial trace (#143417 ) The previous PRs built up to this. We change compiled autograd's initial trace to stop baking in metadata. While tracing, we allocate some weirdly shaped tensors that we can put proxies on. The initial trace should not be accessing any metadata of these tensors (it will likely error out if it does because of how weird the shapes are). This involved fixing some various sites where we do specialize on the metadata, like: - we change CopySlices's apply_with_saved to proxy some calls into the graph (this change is fairly hard to split out by itself). - we stop calling InputBuffer::add - we delete the weird metadata from the graph so that no graph passes can make use of it. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/143417 Approved by: https://github.com/jansel, https://github.com/xmfan ghstack dependencies: #143296, #143304, #143387, #143405	2025-01-22 21:51:07 +00:00
rzou	ec820fe57c	[compiled autograd] Always proxy autograd.Function nodes; handle AOT backwards (#143405 ) We will always proxy autograd.Function nodes in compiled autograd's initial graph capture (previously there was an option to proxy vs trace into the autograd.Function) We have some requirements for the AOTBackward. Compiled Autograd runs accumulate grad reordering passes on the AOTBackward graph directly after the initial graph capture, so we can't just proxy a single node for it. Instead, we: - proxy the AOTBackward prologue function into the CA graph - copy-paste the AOTBackward graph into the CA graph - trace directly through the epilogue (the traced nodes go into the CA graph). Tracing through the epilogue is safe (assuming no Tensor subclasses) because the only thing the epilogue does is drop some outputs. The Tensor subclass situation was already broken so this doesn't regress anything but this PR sets it up to be fixed (in a followup, where we will proxy "make_subclass" calls into the graph from the epilogue). Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/143405 Approved by: https://github.com/jansel, https://github.com/xmfan ghstack dependencies: #143296, #143304, #143387	2025-01-22 21:50:56 +00:00
rzou	784bb2127c	[compiled autograd] Proxy nodes for user-defined C++ torch::autograd::Function (#143387 ) We define a functional version of a C++ torch::autograd::Function. The functional version reconstructs the ctx object and then calls backward with it. Some more details: - we define how to pack/unpack ctx.saved_data into an IValue. It's a Dict[str, IValue], so it wasn't difficult. - every call to CppNode::apply_with_saved binds a new function to Python. This is because we're unable to reuse the a previously bound function for reasons (the schema may change depending on what the user actually puts into their Dict[str, IValue]). Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/143387 Approved by: https://github.com/jansel, https://github.com/xmfan ghstack dependencies: #143296, #143304	2025-01-22 21:50:47 +00:00
rzou	8c7c5f7bfc	[compiled autograd] Proxy a node for CopyBackwards into the graph (#143304 ) CopyBackwards is a manual C++ torch::autograd::Node; we update its apply_with_saved to proxy a functional version of it into the graph instead of inlining into it. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/143304 Approved by: https://github.com/xmfan, https://github.com/jansel ghstack dependencies: #143296	2025-01-22 21:50:37 +00:00
rzou	5531fafffe	[compiled autograd] Proxy opaque nodes for built-in autograd nodes (#143296 ) This PR is on the way to getting compiled autograd's initial capture to stop specializing on Tensor metadata. This PR changes compiled autograd's initial capture to proxy an opaque (w.r.t. Dynamo) function into the graph for all built-in codegen'ed autograd nodes and validate_outputs. We changed each codegen'ed apply_with_saved (e.g. MulBackward0::apply_with_saved) to call into Python to proxy a function (compiled_autograd.ops.MulBackward0) into the graph. Then, we use the node's InputMetadata to "guess" at the properties of the output Tensors to create some new FakeTensors. Some details: - MulBackward0::apply_with_saved lives in libtorch_cpu, but needs to be call to Python via libtorch_python. There is an indirection (PyCompilerInterface) to do this. - MulBackward0::apply_with_saved passes a C++ function to Python. To make our lives easier, every codegen'ed apply_with_saved passes a C++ function with the same signature `(variable_list, ivalue_list) -> variable_list`. - We define how to pack arbitrary C++ types into IValue via a helper IValuePacker struct and codegen functional variants of each builtin C++ autograd node (e.g. MulBackward0_apply_functional_ivalue). MulBackward0 before this PR: https://gist.github.com/zou3519/a80381d5fa38e970e413fcd91b0530de MulBackward0 after this PR: https://gist.github.com/zou3519/0c2eee8b3d8d96232b51ef430b53c5b0 Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/143296 Approved by: https://github.com/jansel	2025-01-22 21:50:29 +00:00
Aaron Gokaslan	0cb9b2284a	[BE]: Simplify set add with set update (#145152 ) Simplifies the set update slightly to be more readable and efficient. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145152 Approved by: https://github.com/XuehaiPan, https://github.com/albanD Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>	2025-01-22 21:31:13 +00:00
Ryan Guo	9f150786bb	[dynamo] Fix numpy test accuracy error induced by randomness divergence (#145293 ) Previously `TestGradient.test_second_order_accurate` was failing because of a small tolerance error (0.03... which is above the 0.03 tolerance). Upon investigating, `np.random.random` caused some divergence between eager and compiled randomness because in compiled we are not using `np.random`'s random seed, rather we end up using `torch`'s. This in turn caused numerical divergence and aforementioned accuracy issue. This patch fixes the failure by patching the test case with `use_numpy_random_stream=True`, which forces a graph break on `np.random.random()` and thereby falling back to eager to ensure consistency of the numpy randomness. Fixes #116746. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145293 Approved by: https://github.com/lezcano	2025-01-22 20:53:02 +00:00
Catherine Lee	2efa98d69d	Binary upload checksum (#144887 ) Equivalent to https://github.com/pytorch/test-infra/pull/6172 but for pytorch Pull Request resolved: https://github.com/pytorch/pytorch/pull/144887 Approved by: https://github.com/atalman	2025-01-22 20:46:04 +00:00
Johnny	a57133e3c7	[NVIDIA] Jetson Thor Blackwell Support codegen (#145395 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145395 Approved by: https://github.com/eqy, https://github.com/malfet	2025-01-22 20:13:19 +00:00
albanD	0940eb6d44	Reverting the PR adding Kleidiai-based int4 kernels (#145392 ) Mitigation for https://github.com/pytorch/pytorch/issues/145273 Reverting https://github.com/pytorch/pytorch/pull/134124 and https://github.com/pytorch/pytorch/pull/144074 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145392 Approved by: https://github.com/ZainRizvi, https://github.com/malfet, https://github.com/atalman, https://github.com/digantdesai	2025-01-22 20:11:49 +00:00
Nikita Shulga	95ff9f0340	[Doc] Add period at the end of the sentence (#145384 ) Test plan: https://docs-preview.pytorch.org/pytorch/pytorch/145384/generated/torch.compiler.disable.html#torch-compiler-disable Fixes https://github.com/pytorch/pytorch/issues/145365 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145384 Approved by: https://github.com/huydhn, https://github.com/svekars, https://github.com/kit1980	2025-01-22 19:56:31 +00:00
PyTorch UpdateBot	3917053f63	[audio hash update] update the pinned audio hash (#145328 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145328 Approved by: https://github.com/pytorchbot	2025-01-22 19:39:03 +00:00
Nikita Shulga	70ccbade83	[MPSInductor] Add `gamma` op (#145341 ) By moving `gamma` and `log_gamma` implementation from `Gamma.metal` to `c10/metal/special_math.h` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145341 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #145309	2025-01-22 19:37:45 +00:00
Aaron Orenstein	b81209557b	Fix tests broken by #145176 (#145393 ) #145176 broke test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_graph_break_on_jit_isinstance_dynamic_shapes test/dynamo/test_repros.py::ReproTests::test_graph_break_on_jit_isinstance this backs out the offending change until it can be fixed properly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145393 Approved by: https://github.com/ZainRizvi	2025-01-22 19:33:16 +00:00
Aidyn-A	e8e3c03f96	[Test][Inductor] Fix test_tma_graph_breaks (#145271 ) Per title. Before these changes, below tests: ``` test_triton_kernels.py::KernelTests::test_tma_graph_breaks_after_data_ptr_False_after_create_desc_False test_triton_kernels.py::KernelTests::test_tma_graph_breaks_after_data_ptr_False_after_create_desc_True test_triton_kernels.py::KernelTests::test_tma_graph_breaks_after_data_ptr_True_after_create_desc_False test_triton_kernels.py::KernelTests::test_tma_graph_breaks_after_data_ptr_True_after_create_desc_True ``` fail with the following message: ``` __________________________________________________________________ KernelTests.test_tma_graph_breaks_after_data_ptr_True_after_create_desc_True ___________________________________________________________________ Traceback (most recent call last): File "/usr/lib/python3.12/unittest/case.py", line 58, in testPartExecutor yield File "/usr/lib/python3.12/unittest/case.py", line 634, in run self._callTestMethod(testMethod) File "/usr/lib/python3.12/unittest/case.py", line 589, in _callTestMethod if method() is not None: ^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/testing/_internal/common_utils.py", line 3114, in wrapper method(args, kwargs) File "/usr/local/lib/python3.12/dist-packages/torch/testing/_internal/common_utils.py", line 557, in instantiated_test test(self, *param_kwargs) File "~/git/pytorch/test/inductor/test_triton_kernels.py", line 1760, in test_tma_graph_breaks eager_out = f(a, b) ^^^^^^^ File "~/git/pytorch/test/inductor/test_triton_kernels.py", line 1740, in f t.element_size(), ^ UnboundLocalError: cannot access local variable 't' where it is not associated with a value To execute this test, run the following from the base repo dir: python test/inductor/test_triton_kernels.py KernelTests.test_tma_graph_breaks_after_data_ptr_True_after_create_desc_True This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145271 Approved by: https://github.com/jansel	2025-01-22 19:18:59 +00:00
Zhengxu Chen	ac8ddf1150	[export][be] Clean up local imports from export [1/n] (#145287 ) Summary: as title Test Plan: CI Differential Revision: D68449844 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145287 Approved by: https://github.com/pianpwk	2025-01-22 19:09:17 +00:00
rzou	30717d25fe	Move Dynamo test to skip from expected_failures (#145390 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/116105 This test is consistently failing. It shouldn't be marked as a flaky test in the CI using the disabld tests mechanism. I'm skipping the test for now. Test Plan: - CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/145390 Approved by: https://github.com/williamwen42	2025-01-22 19:06:39 +00:00
Wu, Chunyuan	0bff377880	Align CPU behavior with CUDA for `ConvTranspose` when `out_channels=0` (#142859 ) Fixes https://github.com/pytorch/pytorch/issues/142466. Remove the `weight.numel() != 0` check to align the behavior with CUDA for `ConvTranspose` when `out_channels=0`. After removing this check, the existing code is already able to give an empty output in such case. Test plan: ``` python -u test/nn/test_convolution.py -k test_ConvTranspose_output_channels_0_cpu_float32 python -u test/nn/test_convolution.py -k test_ConvTranspose_output_channels_0_cuda_float32 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142859 Approved by: https://github.com/mingfeima, https://github.com/malfet	2025-01-22 17:52:53 +00:00
Ryan Guo	698106951e	[dynamo] Re-enable `test_fs` family for dynamo (#145302 ) Fixes #91467. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145302 Approved by: https://github.com/zou3519	2025-01-22 17:50:05 +00:00
Hyunho Yeo	057d9aff39	[S481486] [MTIA] Correct mtia.device_count() API (#145338 ) Summary: Prev: Count the number of "general" accelerators Curr: Count the number of MTIA devices by using the MTIA runtime API Test Plan: ``` buck test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- -r test_get_device_count ``` https://www.internalfb.com/intern/testinfra/testrun/8162774572631995 Reviewed By: BoyueZheng Differential Revision: D68472668 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145338 Approved by: https://github.com/BoyueZheng, https://github.com/egienvalue	2025-01-22 17:45:15 +00:00
Huy Do	c27dd9cf72	Fix deprecated pytorch_sphinx_theme editable installation (#145347 ) Fixes https://github.com/pytorch/pytorch/issues/145221 Pip editable install is going to be deprecated soon https://github.com/pypa/pip/issues/11457. The fix here is just to remove it and install `pytorch_sphinx_theme` normally. ### Testing Doc build is working with the change: * PR https://github.com/pytorch/pytorch/actions/runs/12901499736/job/35975042345?pr=145347 * Nightly https://github.com/pytorch/pytorch/actions/runs/12901500521/job/35975046289 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145347 Approved by: https://github.com/ZainRizvi	2025-01-22 17:28:16 +00:00
Nikita Shulga	288f21cc11	[MPS][BE] Prepare Gamma funcs to be moved ot headers (#145309 ) ---- - Use `float y = 1.0 + metal::frac(x)` instead of complex ```metal float y = x; int n = 0; bool less_than_one = (y < 1.0); // Add or subtract integers as necessary to bring y into (1,2) if (less_than_one) { y += 1.0; } else { n = static_cast<int>(floor(y)) - 1; y -= n; } ``` - Declare them all as templates, to avoid instantiation - Move global arrays to be local to the specific functions Pull Request resolved: https://github.com/pytorch/pytorch/pull/145309 Approved by: https://github.com/dcci	2025-01-22 16:14:06 +00:00
IvanKobzarev	c2b401933f	[torchbench] Fix mobilenetv2 inductor freezing fail_accuracy (#145296 ) Issue: https://github.com/pytorch/pytorch/issues/144891 inductor freezing effectively enables inductor conv-batchnorm fusion. This fusion increases the accuracy error. More context about this: https://github.com/pytorch/pytorch/issues/120545 For Timm models that are run through benchmarks/dynamo/timm_models.py with TimsRunner the tolerance was increased here: https://github.com/pytorch/pytorch/blob/main/benchmarks/dynamo/timm_models.py#L367 If to comment out conv-batchnorm fusion as Elias suggested in Context issue, the accuracy is back. => Increasing tolerace for mobilenetv2 to the same value via introducing the special configuration for tolerance for freezing only Pull Request resolved: https://github.com/pytorch/pytorch/pull/145296 Approved by: https://github.com/eellison, https://github.com/zou3519	2025-01-22 15:54:09 +00:00
CaoE	0dbff7e4be	Add MKLDNN support for Half GELU (#145339 ) Add MKLDNN support for Half GELU to align with BFloat16. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145339 Approved by: https://github.com/yanbing-j, https://github.com/leslie-fang-intel, https://github.com/Skylion007	2025-01-22 15:14:51 +00:00
Isuru Fernando	0efa843392	Dynamic shape guards in C++ (#139899 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139899 Approved by: https://github.com/anijain2305, https://github.com/albanD, https://github.com/jansel ghstack dependencies: #143385, #143164	2025-01-22 14:58:35 +00:00
Isuru Fernando	fbaef0ac03	Add a language option for symbolic shape guards (#143164 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143164 Approved by: https://github.com/ezyang ghstack dependencies: #143385	2025-01-22 14:58:35 +00:00
Isuru Fernando	4b77ff9784	Fix PythonMod printing for C++ (#143385 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143385 Approved by: https://github.com/leslie-fang-intel, https://github.com/anijain2305	2025-01-22 14:58:35 +00:00
Boyuan Feng	079a3e0f75	[BE] Add type annotations to cudagraph_utils.py and test_cases.py (#145291 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145291 Approved by: https://github.com/Skylion007	2025-01-22 14:54:45 +00:00
Isuru Fernando	31c2f36989	Fix triton masked loading for non-block tl.loads (#144782 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144782 Approved by: https://github.com/eellison	2025-01-22 14:30:56 +00:00
Yiming Zhou	3cbc8c54fd	[BE][export] Remove disabled floordiv test in export (#145292 ) Summary: Removing `test_slice_with_floordiv` as it doesn't raise the Runtime Error as expected and it has been disabled since the time it was added https://github.com/pytorch/pytorch/issues/131101 For the case that we expect to fail, it actually returns an empty tensor. This is consistent with the following snippet which prints an empty tensor ``` a = torch.ones(4) print(a[5:]) ``` Test Plan: CI Differential Revision: D68450650 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145292 Approved by: https://github.com/pianpwk	2025-01-22 05:17:56 +00:00
Aaron Orenstein	99dbc5b0e2	PEP585 update - test (#145176 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145176 Approved by: https://github.com/bobrenjc93	2025-01-22 04:48:28 +00:00
Chris Sidebottom	40e27fbcf2	Refactor CPUReproTests to be more vector-length agnostic (#141245 ) This changes the hardcoded assumptions of a `256-bit` vector length to querying from `cpu_vec_isa` and changes relevant tests to share the logic. Also refactored the `config.cpp.simdlen != 1` into the assertion so we stop duplicating it throughout the test cases. Fixes issues on `128-bit` machines. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141245 Approved by: https://github.com/desertfire, https://github.com/malfet	2025-01-22 04:24:45 +00:00
Ryan Guo	dcd9de79e7	[dynamo] Re-enable a AOT-Dispatch test with Dynamo (#145299 ) Fixes #124590. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145299 Approved by: https://github.com/zou3519	2025-01-22 03:47:05 +00:00
Shunting Zhang	3a58512613	[Inductor] inplace padding (#140249 ) https://github.com/pytorch/pytorch/issues/139865 This PR may change the semantic of constant_pad_nd from 'clone' to 'view'. I tried a few tests to do inplace update. Looks like thanks to functionalization, this works fine. Perf for `test_linear_and_cel`: ``` # TORCHINDUCTOR_INPLACE_PADDING=0 DO_PERF_TEST=1 python test/inductor/test_inplace_padding.py -k test_linear_and_cel inductor_config.inplace_padding=False ms=83.311 # TORCHINDUCTOR_INPLACE_PADDING=1 DO_PERF_TEST=1 python test/inductor/test_inplace_padding.py -k test_linear_and_cel inductor_config.inplace_padding=True ms=79.827 ``` The saving is about 4ms (slightly less since we need fill 0 for the padding area). Similar savings for llm.c. - Without the feature: 182.151ms per batch, 180.9K tokens/s - With the feature: 178.278ms per batch, 183.9K tokens/s. There are 3K tokens/s increase. Perf test shows compilation time regression. . I'm not sure if that's real. Will debug more. But a good thing is, there is no accuracy failure: [link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2004%20Nov%202024%2020%3A23%3A22%20GMT&stopTime=Mon%2C%2011%20Nov%202024%2020%3A23%3A22%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=gh/shunting314/186/head&lCommit=03fd924ff382958daf5055dc8425d279e4e10a1e&rBranch=main&rCommit=c03324de2dfbbf0006818c86b88c92a3378f46b7) . UPDATE: Perf test regression seems to be not real. Here is a rerun [link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2007%20Nov%202024%2001%3A29%3A55%20GMT&stopTime=Thu%2C%2021%20Nov%202024%2001%3A29%3A55%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=gh/shunting314/186/head&lCommit=7e2c8e5d9256ac06205e7cd5e740c9e20ce804d0&rBranch=main&rCommit=565a7942eee1ddc23067cdbae597443d0f2290a0). Our dashboard is not that reliable recently due to AWS migration. Differential Revision: [D68340248](https://our.internmc.facebook.com/intern/diff/D68340248) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140249 Approved by: https://github.com/jansel, https://github.com/eellison	2025-01-22 03:37:06 +00:00
sanchitintel	46851022ff	[Inductor][CPU] Add auto-tuning support for da8w8 sym act sym wgt GEMM (#143187 ) ## Summary Templated `int8xint8->int32` GEMM that uses AMX ISA (present on Intel Xeon Gen 4 & above). Any epilogues such as weight scale, activation scale, and bias are applied per output block in a fused manner . Performs well for large values of `M` dimension (assuming canonical dimensions [`M, K`] and [`K, N`] for the activation & weight matrices'/tensors' sizes) when the activation is quantized per-token. Also supports SmoothQuant GEMM pattern when activation is quantized per-tensor (scalar scale) or per-token (vector scale is applied as an epilogue in this case). Also increased coverage of GEMM template for uint8 activation, int8 weight GEMM UTs for when the activation zero point is a 1D tensor (the existing implementation only accepted 0D tensors). However, some of such UTs would have to be explicitly enabled with `max-autotune` Inductor config. ## Performance data The templated codegened fused GEMM with M=32, K=4096, N=14336 used in LLaMA3 exhibits more than 2x perf-gain compared to oneDNN qlinear + mul (for activation's scale) with 48 cores of one socket of Xeon SP 4th gen Platinum 8468 when per-token quantization is used. For M=1, K=4096, N=14336, regardless of whether per-tensor quantization was used for activation or per-token, the perf gain was more than 3x. Intel OpenMP & libtcmalloc had been preloaded. All cores used by the workload corresponded to distinct physical cores. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143187 Approved by: https://github.com/jansel, https://github.com/leslie-fang-intel, https://github.com/jgong5 Co-authored-by: Leslie Fang <leslie.fang@intel.com>	2025-01-22 02:27:53 +00:00
Simon Fan	27598cd154	[fx] move DCE rand check to import time (#145118 ) Mitigates the deterministic benchmark regression: https://github.com/pytorch/pytorch/issues/144775#issuecomment-2593411844. and maybe the dashboard issue. fx.Node.is_impure is unexpectedly a hot spot. It gets called for every node in the graph whenever we invoke DCE, which should be okay, EXCEPT we invoke DCE on the full graph ~10 times at various stages of torch.compile, and an insane number of times (>O(parameters)) for the subgraphs traced by the pattern matcher. I considered addressing this problem by reducing the amount of times DCE is called, but I think we can only trim the ones from the pattern matcher, which will require some refactor/caching solution that I leave out of this PR. torch.Tag.nondeterministic_seeded is provided by native_functions.yml and is implemented as a list. Most of the time, it has <=2 elements, so it's not really worth it to turn it into a set for fast lookup. Using the deterministic instruction count benchmarks ```python # before aotdispatcher_partitioner_cpu,compile_time_instruction_count,8914894946 aotdispatcher_partitioner_cpu,compile_time_instruction_count,8866669058 # after aotdispatcher_partitioner_cpu,compile_time_instruction_count,8770562314 aotdispatcher_partitioner_cpu,compile_time_instruction_count,8779547794 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145118 Approved by: https://github.com/ezyang, https://github.com/zou3519	2025-01-22 02:23:02 +00:00
Aaron Orenstein	f2cfe8b59f	PEP585 update - mostly toplevels (#145178 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145178 Approved by: https://github.com/bobrenjc93	2025-01-22 02:21:14 +00:00
Aaron Orenstein	1ce533867f	Teach dynamo to handle GenericAlias without a graph break (#145240 ) Dynamo wasn't handling the new PEP585 type annotations: ``` x = list[Foo] ``` Although this worked in py3.9 this was causing an `unimplemented` (Unexpected type in sourceless builder) in py3.12. This fixes it to treat them as a BuiltinVariable. Fixes #145226 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145240 Approved by: https://github.com/anijain2305	2025-01-22 01:55:51 +00:00
PyTorch MergeBot	2d1649bc2a	Revert "[triton] Update triton pin to include warp specialization support (#145120 )" This reverts commit e261629dc85c061ee35f539ee8bd35aec9971215. Reverted https://github.com/pytorch/pytorch/pull/145120 on behalf of https://github.com/ZainRizvi due to Reverting since the test failures area about not being able to find a version of triton to install, and this is breaking trunk as well ([comment](https://github.com/pytorch/pytorch/pull/145120#issuecomment-2606107792))	2025-01-22 01:52:36 +00:00
Nikita Shulga	f2d7fe12d8	[BE][MPS] Mark gamma inputs as const (#145295 ) Doubt it will change the perf, but it's good to correctly mark const inputs as const Pull Request resolved: https://github.com/pytorch/pytorch/pull/145295 Approved by: https://github.com/manuelcandales ghstack dependencies: #145289	2025-01-22 01:00:53 +00:00
Nikita Shulga	c106e9b4c6	[BE][MPS] Move Gamma kernels to its own file (#145289 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145289 Approved by: https://github.com/manuelcandales, https://github.com/dcci	2025-01-22 01:00:53 +00:00
Nikita Shulga	1908116ace	[MPS][BE] Move vectypes from Quantized to utils (#145312 ) That allows one to get appropriate vectorized types for templates using `c10:🤘:vec2type_t<>` or `c10:🤘:vec4type_t<>` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145312 Approved by: https://github.com/dcci	2025-01-22 00:37:28 +00:00
Huy Do	266fd35c58	Fix ExecuTorch, XLA, Triton hash updates (#145314 ) Fix some stale hash updates https://github.com/pytorch/pytorch/pulls/pytorchupdatebot reported by @izaitsevfb * XLA and ExecuTorch now wait for all jobs in pull instead of hardcoding the job names which are not correct anymore and the bot waits forever there * Trion commit hash hasn't been updated automatically since 2023 and people have been updating the pin manually with their testings from time to time, so I doubt that it would be an useful thing to keep. The vision update failures looks more complex though and I would need to take a closer look. So, I will keep it in another PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/145314 Approved by: https://github.com/izaitsevfb	2025-01-21 23:24:21 +00:00
rzou	1e8d6d6f0e	[SkipFiles] New modules added to torch.* are inlined by default (#145279 ) This PR: - makes it so that new modules added to torch are inlined by default - adds a list of the previously "skipped by default" modules to avoid regressing anything. This is a new MOD_SKIPLIST list that is consulted in trace_rules.check_file. - Follow-up work will go through this list, one-by-one, and try to delete modules. I think we should be able to delete almost everything, except for torch._dynamo. Test Plan - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/145279 Approved by: https://github.com/yanboliang	2025-01-21 23:24:12 +00:00
Hongtao Yu	e261629dc8	[triton] Update triton pin to include warp specialization support (#145120 ) The warp specialization work has been landed to the triton rc/3.2.x branch as `b2684bf3b0` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145120 Approved by: https://github.com/bertmaher	2025-01-21 22:14:13 +00:00
Jane Xu	19c3ba44a2	Use TORCH_CHECK instead of std::runtime_error in stack.h and ivalue.h (#145280 ) TORCH_CHECK will preserve the stacktrace for when TORCH_CPP_SHOW_STACKTRACES=1, whereas std::runtime_error will not. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145280 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-01-21 21:58:59 +00:00
Catherine Lee	7dd9d1f243	Update clickhouse-connect to 0.8.14 (#144915 ) Corresponds to https://github.com/pytorch/test-infra/pull/6177 I only tested the slow test script but I also did testing on the new version with scripts in https://github.com/pytorch/test-infra/pull/6177 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144915 Approved by: https://github.com/huydhn	2025-01-21 21:43:18 +00:00
johnnynunez	35f5668f7e	[NVIDIA] RTX50 Blackwell Support codegen (#145270 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145270 Approved by: https://github.com/ezyang	2025-01-21 21:10:05 +00:00
PyTorch MergeBot	895659cb41	Revert "Fix RMSNorm epsilon value type for BF16 or FP16 (#142848 )" This reverts commit 07e23653cd9ef8cfda01773d94d9f76e5072528d. Reverted https://github.com/pytorch/pytorch/pull/142848 on behalf of https://github.com/izaitsevfb due to breaking internal tests, see D68355212 ([comment](https://github.com/pytorch/pytorch/pull/142848#issuecomment-2605734067))	2025-01-21 21:04:45 +00:00
Aaron Orenstein	bac62341eb	PEP585 update - torch/_inductor (#145198 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145198 Approved by: https://github.com/bobrenjc93	2025-01-21 21:04:33 +00:00
Aaron Orenstein	2f9d378f7b	PEP585 update - torch/utils (#145201 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145201 Approved by: https://github.com/bobrenjc93	2025-01-21 21:04:10 +00:00
Edward Z. Yang	693d8c7e94	Output of nonzero is transposed, fix fake tensor (#144695 ) Needs this companion executorch PR: https://github.com/pytorch/executorch/pull/7657 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/144695 Approved by: https://github.com/bobrenjc93, https://github.com/albanD	2025-01-21 20:50:09 +00:00
Edward Z. Yang	323fb4dad0	Unconditionally exclude upper bound in all size oblivious tests (#144867 ) I was thinking about https://github.com/pytorch/pytorch/pull/144471 some more and I thought, "Hmm, why not just always exclude the constant upper bound." So here it is. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/144867 Approved by: https://github.com/bobrenjc93	2025-01-21 20:44:09 +00:00
Wei Wang	df67ac4c86	[CI][CUDA][Distributed][FSDP] Remove hardcoded world size of 2 (#145195 ) as these unit tests would fail if run on a single GPU (i.e. skip_if_lt_x_gpu(2)) seems to view world size as 2 even on platforms with 1 GPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145195 Approved by: https://github.com/Skylion007, https://github.com/atalman	2025-01-21 20:25:52 +00:00
Jason Ansel	505ade7471	[inductor] Simplify mode options, only apply CompilerBisector changes once (#145232 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145232 Approved by: https://github.com/yanboliang	2025-01-21 19:25:46 +00:00
RanTao123	85811631d7	[Intel CPU] Fix issue #143489 . (#145062 ) Fix issue in https://github.com/pytorch/pytorch/issues/143489. kernel_height * kernel_weight will cause Floating point exception, so we will divide by them one by one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145062 Approved by: https://github.com/soulitzer	2025-01-21 18:38:33 +00:00
Joel Schlosser	128f3627b1	Implement backward for NJT matmul (#144587 ) Part of my BE project addressing NJT bugs surfaced via OpInfo tests. This PR implements missing backward support for NJT matmul. Notably, for dense tensors, matmul dispatches to bmm. However, due to historical reasons related to NST, NJT handles matmul directly, and thus can't rely on the CompositeImplicit impl of matmul to get the derivative formula. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144587 Approved by: https://github.com/soulitzer ghstack dependencies: #144586	2025-01-21 18:27:50 +00:00
Joel Schlosser	af204135d8	Fix NJT fill.Scalar for contiguous inputs (#144586 ) Part of my BE project addressing NJT bugs surfaced via OpInfo tests. This PR implements the missing `fill.Scalar` support, which works fine for contiguous inputs, but there is still some AOTAutograd debugging required to handle non-contiguous transposed NJTs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144586 Approved by: https://github.com/soulitzer	2025-01-21 18:22:08 +00:00
Edward Z. Yang	efa88e04e1	Don't overspecialize float when propagating cache guards to ShapeEnv (#145078 ) Fixes https://github.com/pytorch/pytorch/issues/142507 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/145078 Approved by: https://github.com/Skylion007	2025-01-21 18:05:43 +00:00
Edward Z. Yang	b3e90c8c33	Add support for torch function on dtype arguments (#145085 ) Along the lines of https://github.com/pytorch/pytorch/issues/119194 although it doesn't actually address the FCD case. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/145085 Approved by: https://github.com/vmoens, https://github.com/Skylion007	2025-01-21 17:44:47 +00:00
Huy Do	eb553ae3cf	Fix broken gpt_fast micro benchmark after #144315 (#145235 ) The benchmark is failing with the following error ``` File "/var/lib/jenkins/workspace/benchmarks/gpt_fast/benchmark.py", line 333, in <module> main(output_file=args.output, only_model=args.only) File "/var/lib/jenkins/workspace/benchmarks/gpt_fast/benchmark.py", line 308, in main lst = func(device) File "/var/lib/jenkins/workspace/benchmarks/gpt_fast/benchmark.py", line 66, in run_mlp_layer_norm_gelu us_per_iter = benchmarker.benchmark(compiled_mod, (x,)) * 1000 File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/_inductor/runtime/benchmarking.py", line 39, in wrapper return fn(self, args, *kwargs) TypeError: benchmark() missing 1 required positional argument: 'fn_kwargs' ``` An example error is https://github.com/pytorch/pytorch/actions/runs/12862761823/job/35858912555 I also assign `oncall: pt2` as the owner of this job going forward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145235 Approved by: https://github.com/nmacchioni	2025-01-21 17:42:24 +00:00
atalman	2cffbff7da	Add 3.13t Windows and MacOS binary builds (#141806 ) Related to: https://github.com/pytorch/pytorch/issues/130249 For conda uses approach described here: https://conda-forge.org/blog/2024/09/26/python-313/ Create Python 3.13t conda env like so: ``` conda create -n py313 python=3.13 python-freethreading -c conda-forge ``` For windows executable installation we need to pass additional parameter to enable 3.13t: ``` Include_freethreaded=1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141806 Approved by: https://github.com/albanD	2025-01-21 17:16:19 +00:00
Aaron Orenstein	0afd335174	PEP585 update - torch/nn torch/optim torch/package torch/profiler torch/serialization torch/sparse torch/xpu (#145175 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145175 Approved by: https://github.com/bobrenjc93	2025-01-21 16:57:27 +00:00
Shunting Zhang	803017f3cb	[inductor] fix MA on poor gpu (#145133 ) Found this bug when debugging a MA issue in CI that can not be repro-ed on devgpu. On GPU with less than 68 SMs (like NVidia L4 used in CI), running torch compile in max-autotune mode may result in the following confusing error https://gist.github.com/shunting314/370f42f547e3367a3773237942725a86 complaining about layout: ``` torch._inductor.exc.InductorError: LoweringException: AssertionError: convert FlexibleLayout to FixedLayout first ``` The reason is, even if we don't pick Triton template, Inductor still returns a MultiTemplateBuffer for tuned addmm. MultiTemplateBuffer.get_reads called from Reduction.num_splits may indexing a FlexibleLayout which results in the error aforementioned. The issue does not appear on devgpu because we freeze the layout of addmm inputs when rendering triton templates. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145133 Approved by: https://github.com/jansel	2025-01-21 09:31:34 +00:00
Aaron Orenstein	b5655d9821	PEP585 update - .ci android aten (#145177 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145177 Approved by: https://github.com/Skylion007	2025-01-21 06:31:26 +00:00
Aaron Orenstein	00ffeca1b1	PEP585 update - torch/distributed (#145164 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145164 Approved by: https://github.com/bobrenjc93	2025-01-21 04:23:29 +00:00
PyTorch MergeBot	c6986ca2e1	Revert "[dcp] Add ZStandard transformer (#143360 )" This reverts commit 7b56b039afe2b4a4038c09d8b6cb1597823f3d5f. Reverted https://github.com/pytorch/pytorch/pull/143360 on behalf of https://github.com/atalman due to Broke 3.13t builds please test with ciflow/binaries label attached ([comment](https://github.com/pytorch/pytorch/pull/143360#issuecomment-2603433066))	2025-01-21 01:10:16 +00:00
PyTorch MergeBot	5fd881a5b6	Revert "PEP585 update - torch/nn torch/optim torch/package torch/profiler torch/serialization torch/sparse torch/xpu (#145175 )" This reverts commit 54a00af2c6026a830f40d9e6a659ff81d51f9bc6. Reverted https://github.com/pytorch/pytorch/pull/145175 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break some trunk tests ([comment](https://github.com/pytorch/pytorch/pull/145175#issuecomment-2603418267))	2025-01-21 00:49:55 +00:00
Aaron Orenstein	dea7ad3371	PEP585 update - torch/testing (#145200 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145200 Approved by: https://github.com/bobrenjc93	2025-01-20 22:42:42 +00:00
Aaron Orenstein	805c4b597a	PEP585 update - torch/_higher_order_ops torch/_subclasses torch/backends torch/compiler torch/cuda torch/masked torch/mtia torch/nested (#145202 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145202 Approved by: https://github.com/bobrenjc93	2025-01-20 22:37:26 +00:00
Aaron Orenstein	54a00af2c6	PEP585 update - torch/nn torch/optim torch/package torch/profiler torch/serialization torch/sparse torch/xpu (#145175 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145175 Approved by: https://github.com/bobrenjc93	2025-01-20 22:32:59 +00:00
Aaron Orenstein	bd97ce0b45	PEP585 update - torch/ao (#145199 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145199 Approved by: https://github.com/bobrenjc93	2025-01-20 22:32:35 +00:00
Aaron Gokaslan	cf05f6a134	[BE]: Improve typing for torch/fx/_pytree.py and torch/utils/_pytree.py (#145173 ) Improve type inference in _pytree.py utility functions Pull Request resolved: https://github.com/pytorch/pytorch/pull/145173 Approved by: https://github.com/bobrenjc93	2025-01-20 22:18:19 +00:00
Wang, Chuanqi	225a10febe	[CI] Add xpu linux build into pull workflow (#145084 ) To mitigate the XPU build failure risk introduced by non-XPU specific PRs. Refer #144967 & #143803 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145084 Approved by: https://github.com/huydhn, https://github.com/atalman	2025-01-20 19:31:48 +00:00
Zhengxu Chen	d0100050dd	[aoti] Deduplicate "V.aot_compilation" and "V.graph.aot_mode" flags. [2/n] (#145091 ) Summary: Following up D68122536 to remove configurable aot_mode for inner_compile Test Plan: CI Reviewed By: desertfire Differential Revision: D68158512 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145091 Approved by: https://github.com/ydwu4	2025-01-20 19:09:10 +00:00
Aaron Orenstein	0b2a3687b9	PEP585 update - torch/fx (#145166 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145166 Approved by: https://github.com/bobrenjc93	2025-01-20 18:11:54 +00:00
PyTorch MergeBot	6374332d33	Revert "PEP585 update - torch/distributed (#145164 )" This reverts commit 6cb186e279bc179a6bb63f0226e24ab42a07b394. Reverted https://github.com/pytorch/pytorch/pull/145164 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing an inductor test ([comment](https://github.com/pytorch/pytorch/pull/145164#issuecomment-2602875679))	2025-01-20 16:46:46 +00:00
Dmitry Nikolaev	57b2b64acf	Fix always true scaled_mm test (#143912 ) Looks like `out_fp8` should use matmul without scales and `out_fp8_s` with Scales were optional arguments before PR https://github.com/pytorch/pytorch/pull/128683 Then test_float8_scale started comparing two identical results and lost its meaning Reason of making scales required https://github.com/pytorch/pytorch/pull/128683#issuecomment-2169146402UMBER This PR uses scale=1.0 to compare result with scaled matmul Pull Request resolved: https://github.com/pytorch/pytorch/pull/143912 Approved by: https://github.com/drisspg, https://github.com/malfet, https://github.com/pruthvistony	2025-01-20 16:17:46 +00:00
Aleksei Nikiforov	53e2408015	Improve cleanup of cancelled jobs on s390x for tests too (#144968 ) Follow up to https://github.com/pytorch/pytorch/pull/144149 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144968 Approved by: https://github.com/huydhn	2025-01-20 12:56:07 +00:00
Sun, Jiayi	92b9da1fc2	fix torch.atan for torch.complex datatypes on CPU (#144749 ) Fix https://github.com/pytorch/pytorch/issues/141487. This issue is caused by the lack of special handling of the case where the real number/imag number is 0/Inf/NaN in the vectorized implementation of `atan`. For correctness, I temporarily fallback the implementation of `atan` to scalar implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144749 Approved by: https://github.com/mingfeima, https://github.com/Skylion007	2025-01-20 08:45:03 +00:00
Sun, Jiayi	ed669a9db7	fix torch.div for torch.complex datatypes on CPU (#140375 ) Fix https://github.com/pytorch/pytorch/issues/135428. Fix https://github.com/pytorch/pytorch/issues/106845. These two issues are caused by the lack of special handling of the case where the real number/imag number is 0/Inf/NaN in the vectorized implementation of `div`. For correctness, I temporarily fallback the implementation of `div` to scalar implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140375 Approved by: https://github.com/mingfeima	2025-01-20 08:34:29 +00:00
Sun, Jiayi	c922ccb7c4	fix sigmoid for torch.complex datatypes on CPU (#140391 ) Fix https://github.com/pytorch/pytorch/issues/135777. This issue is caused by the lack of special handling of the case where the real number/imag number is 0/Inf/NaN in the vectorized implementation of `reciprocal`. For correctness, I temporarily fallback the implementation of `reciprocal` to scalar implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140391 Approved by: https://github.com/mingfeima, https://github.com/Skylion007 ghstack dependencies: #140358	2025-01-20 08:23:58 +00:00
Sun, Jiayi	507bf65c6a	fix torch.exp for torch.complex datatypes on CPU (#140358 ) Fix https://github.com/pytorch/pytorch/issues/48010, https://github.com/pytorch/pytorch/issues/136063. These two issues are caused by the lack of special handling of the case where the real number/imag number is 0/Inf/NaN in the vectorized implementation of `exp`. For correctness, I temporarily fallback the implementation of `exp` to scalar implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140358 Approved by: https://github.com/mingfeima, https://github.com/Skylion007	2025-01-20 08:03:17 +00:00
ankurneog	972d4a154d	Add facility to run dynamo UTs for non-cuda devices (#140929 ) This is in line with changes introduced with https://github.com/pytorch/pytorch/pull/130714, additional files are included to support non-cuda devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140929 Approved by: https://github.com/kwen2501, https://github.com/EikanWang, https://github.com/guangyey	2025-01-20 05:56:38 +00:00
Aaron Orenstein	2b809e58ad	PEP585 update - torch/onnx (#145174 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145174 Approved by: https://github.com/justinchuby	2025-01-20 05:48:52 +00:00
Animesh Jain	19584b28fd	[dynamo][dicts] Consolidate dict(..) construction (#144342 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144342 Approved by: https://github.com/StrongerXi	2025-01-20 04:42:06 +00:00
Nikita Shulga	980c75fe6e	[MPSInductor] Add `TrueDiv` and `Round[Int\|Decimal]` (#145160 ) That fixes `test_builtins_round_float_ndigits_neg` and `test_builtins_round` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145160 Approved by: https://github.com/jansel, https://github.com/dcci	2025-01-20 04:29:42 +00:00
Aaron Orenstein	6cb186e279	PEP585 update - torch/distributed (#145164 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145164 Approved by: https://github.com/bobrenjc93	2025-01-20 00:19:01 +00:00
Aaron Orenstein	b6c5562c1f	PEP585 update - torch/export (#145165 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145165 Approved by: https://github.com/bobrenjc93	2025-01-19 20:56:55 +00:00
Aaron Orenstein	316808e4e9	PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145163 Approved by: https://github.com/Skylion007	2025-01-19 20:55:59 +00:00
Aaron Orenstein	c64e657632	PEP585 update - torch/distributed/fsdp (#145162 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145162 Approved by: https://github.com/bobrenjc93	2025-01-19 20:04:05 +00:00
Nikita Shulga	371a361db9	Enable bfloat16 testing on MacOS14+ (#145159 ) As Metal-3.1 supports this dtype Pull Request resolved: https://github.com/pytorch/pytorch/pull/145159 Approved by: https://github.com/Skylion007, https://github.com/jansel ghstack dependencies: #145157	2025-01-19 19:35:31 +00:00
Aaron Orenstein	97d4d3c40a	PEP585 update - torch/_export (#145138 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145138 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #145154	2025-01-19 18:48:35 +00:00
Aaron Orenstein	cd8d0fa20c	Tweak schema_check to handle annotated builtin types (#145154 ) As of python 3.9 annotated lists can be written as `list[T]` and `List[T]` has been deprecated. However schema_check was converting `list[T]` to simply be `list`. This change teaches it to handle `list[T]` the same as `List[T]`. A couple small drive-by changes I noticed as well: - Path concatenation should use `os.path.join`, not `+` - Spelling in error message Pull Request resolved: https://github.com/pytorch/pytorch/pull/145154 Approved by: https://github.com/bobrenjc93	2025-01-19 18:48:35 +00:00
Aaron Orenstein	9e0437a04a	PEP585 update - torch/ao/quantization (#145140 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145140 Approved by: https://github.com/bobrenjc93	2025-01-19 10:20:00 +00:00
Aaron Orenstein	78bff1e8c1	PEP585 update - torch/_functorch (#145139 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145139 Approved by: https://github.com/bobrenjc93	2025-01-19 07:06:10 +00:00
cassanof	10e4d3aebb	[DCP] Fix fsspec fsync bug on .finish() (#144753 ) Fixes #144752 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144753 Approved by: https://github.com/Skylion007, https://github.com/saumishr	2025-01-19 03:21:00 +00:00
Davide Italiano	8cc415774f	[mps/inductor] Introduce a metal approx for erf() and use it. (#145161 ) Probably we can do better, but this is a start. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145161 Approved by: https://github.com/malfet	2025-01-19 02:29:05 +00:00
Aaron Orenstein	893ca1dfe1	PEP585 update - torch/_inductor/[_-i]* (#145137 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145137 Approved by: https://github.com/bobrenjc93	2025-01-19 01:22:47 +00:00
Nikita Shulga	cede43e06b	[MPSInductor][BE] NaN-propagating min/max to header (#145157 ) May be to be later reused from eager op as well Also, didn't know that Metal already have type_traits And use `metal::isunorderder(a, b)` instead of `metal::isnan(a + b)` is it is defined as function that is equivalent `a != a \|\| b != b`, but I suspect it might have a best native implementation for the specific architecture Pull Request resolved: https://github.com/pytorch/pytorch/pull/145157 Approved by: https://github.com/dcci	2025-01-18 22:52:44 +00:00
Aaron Orenstein	5b5766665d	PEP585 update - torch/_C torch/_decomp torch/_lazy torch/_library torch/_numpy torch/_prims torch/_refs torch/_strobelight (#145102 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145102 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #145105	2025-01-18 20:47:12 +00:00
Aaron Orenstein	a79100ab11	PEP585 update - torch/_dynamo (#145105 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145105 Approved by: https://github.com/bobrenjc93	2025-01-18 20:47:11 +00:00
Aaron Orenstein	c95efc37ba	PEP585 update - torch/distributed/tensor (#145141 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145141 Approved by: https://github.com/bobrenjc93	2025-01-18 20:01:59 +00:00
Davide Italiano	4f8237dbad	[mps/inductor] Skip "double" tests as 64-bits FP is not supported. (#145123 ) 257 tests failed (before) -> 242 tests failed (after) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145123 Approved by: https://github.com/malfet, https://github.com/jansel Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-18 19:13:34 +00:00
PyTorch MergeBot	5802be698e	Revert "parametrized test name handles class arguments (#133546 )" This reverts commit 4e4b8592a32f701b4974679ab1381ba7cccd4844. Reverted https://github.com/pytorch/pytorch/pull/133546 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but trying to disable the new tests does seem to fully cover all the cases and some are still failing in trunk ([comment](https://github.com/pytorch/pytorch/pull/133546#issuecomment-2599814339))	2025-01-18 18:12:18 +00:00
Joel Schlosser	b63b81410c	Fix NJT frexp() to handle both outputs (#144585 ) Part of my BE project addressing NJT bugs surfaced via OpInfo tests. Before this PR, `frexp()` for NJT was handled via the unary pointwise fallback. The op returns a tuple, however, and the fallback doesn't handle that. This PR defines an explicit impl for `frexp()` that wraps both returned `(mantissa, exponent)` as NJTs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144585 Approved by: https://github.com/soulitzer ghstack dependencies: #144582, #144583, #144584	2025-01-18 15:59:56 +00:00
Joel Schlosser	3ee531f8b9	Support NJT chunk() backward on batch dim (#144584 ) Part of my BE project addressing NJT bugs surfaced via OpInfo tests. Implements `chunk()` backward on the batch dim, which was left out before. This PR unbinds the components and invokes `copy_()` on these to pass along the appropriate gradients. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144584 Approved by: https://github.com/soulitzer ghstack dependencies: #144582, #144583	2025-01-18 15:58:24 +00:00
Nikita Shulga	8a57234033	[MPSInductor] Implement `i0` and `i1` ops (#145092 ) Using shared definitions with eager op Pull Request resolved: https://github.com/pytorch/pytorch/pull/145092 Approved by: https://github.com/dcci, https://github.com/jansel ghstack dependencies: #145023, #145087	2025-01-18 15:41:02 +00:00
Edward Z. Yang	1d9fc9df38	Downgrade ignored guard to info level (#145075 ) Fixes https://github.com/pytorch/pytorch/issues/101265 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/145075 Approved by: https://github.com/Skylion007	2025-01-18 15:30:01 +00:00
chilli	5e4cf3e6ad	Moved .all() checks for distributions to _is_all_true (#145029 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145029 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2025-01-18 07:55:48 +00:00
Aaron Orenstein	2bf772d1ba	PEP585 update - torch/_inductor/codegen (#145106 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145106 Approved by: https://github.com/bobrenjc93	2025-01-18 06:56:03 +00:00
Shangdi Yu	4bf29f44b7	[aoti] Remove torch.ops.aten._assert_tensor_metadata.default in post_grad_pass (#145028 ) Summary: Remove torch.ops.aten._assert_tensor_metadata.default in post_grad_pass because this op is blocking fusion. This should not have any affect on the result, because the op would not show up in the final aoti compiled model anyway (the assertion has no effect). An real example where this improves performance: In the example below, the post grad graph would contain `torch.ops.aten._assert_tensor_metadata.default`, because of PR https://github.com/pytorch/pytorch/pull/142420. This op is added when functionalizing aten.to. We want the `add` node from `linear` to be fused with the rest of the pointwise ops, instead of fused with the `mm` from `linear`. ``` class Model(torch.nn.Module): def __init__(self, input_dim, hidden_dim): super(Model, self).__init__() self.linear = nn.Linear(input_dim, hidden_dim).half() self.rms_norm = nn.RMSNorm(hidden_dim) def forward(self, x): linear_458 = self.linear(x) # Linear layer with weights' # mimic the torchtune rms norm: /torchtune/torchtune/modules/rms_norm.py linear_458 = linear_458.to(torch.float32) rms_norm_34 = self.rms_norm(linear_458) # RMS Normalization sigmoid_168 = torch.sigmoid(rms_norm_34) # Sigmoid activation function mul_168 = sigmoid_168 * rms_norm_34 # Element-wise multiplication return mul_168 def main(): with torch.no_grad(): input_dim = 512 hidden_dim = 256 batch_size = 32 model = Model(input_dim, hidden_dim).to("cuda") example_inputs = ( torch.randn(batch_size, input_dim).to("cuda").to(torch.float16), ) ep = torch.export.export(model, example_inputs) package_path = torch._inductor.aoti_compile_and_package(ep) ``` Test Plan: CI Differential Revision: D68303114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145028 Approved by: https://github.com/angelayi	2025-01-18 06:06:25 +00:00
Nikita Shulga	dc9b77cc55	[MPS] Support includes in metal objects (#145087 ) Useful for code reuse for Metal shader build both for eager mode and MPSInductor, but it requires one to implement `_cpp_embed_headers` tool that, as name suggests, would preprocess and embeds the for shader to be used in dynamic compilation. Test using: - `TestMetalLibrary.test_metal_include` - Moving `i0`/`i1` implementation to `c10/util/metal_special_math.h` and call it from `SpecialOps.metal` shader, which now looks much more compact: ```metal template <typename T, typename Tout = T> void kernel i0(constant T* input, device Tout* output, uint index [[thread_position_in_grid]]) { output[index] = c10::i0(static_cast<Tout>(input[index])); } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145087 Approved by: https://github.com/dcci ghstack dependencies: #145023	2025-01-18 05:35:22 +00:00
Will Constable	2859b11bdb	[pytorch/ncclx] Remove Alltoallv specialization for PTD all_to_all (#145045 ) Summary: PTD all_to_all uses a list of tensors, while ncclAllToAllv (provided by NCCLX and RCCL) assumes that a single contiguous buffer is used. These are fundamentally mismatched. The list of tensors might not be contiguous or even ordered (buffer addresses might not be in increasing order). This patch removes the ncclAllToAllv specialization for PTD all_to_all, and instead let's it directly call ncclSend/ncclRecv. Co-authored by @pavanbalaji Pull Request resolved: https://github.com/pytorch/pytorch/pull/145045 Approved by: https://github.com/pavanbalaji, https://github.com/d4l3k, https://github.com/fduwjj, https://github.com/ezyang	2025-01-18 05:26:55 +00:00
Aaron Orenstein	07669ed960	PEP585 update - benchmarks tools torchgen (#145101 ) This is one of a series of PRs to update us to PEP585 (changing Dict -> dict, List -> list, etc). Most of the PRs were completely automated with RUFF as follows: Since RUFF UP006 is considered an "unsafe" fix first we need to enable unsafe fixes: ``` --- a/tools/linter/adapters/ruff_linter.py +++ b/tools/linter/adapters/ruff_linter.py @@ -313,6 +313,7 @@ "ruff", "check", "--fix-only", + "--unsafe-fixes", "--exit-zero", *([f"--config={config}"] if config else []), "--stdin-filename", ``` Then we need to tell RUFF to allow UP006 (as a final PR once all of these have landed this will be made permanent): ``` --- a/pyproject.toml +++ b/pyproject.toml @@ -40,7 +40,7 @@ [tool.ruff] -target-version = "py38" +target-version = "py39" line-length = 88 src = ["caffe2", "torch", "torchgen", "functorch", "test"] @@ -87,7 +87,6 @@ "SIM116", # Disable Use a dictionary instead of consecutive `if` statements "SIM117", "SIM118", - "UP006", # keep-runtime-typing "UP007", # keep-runtime-typing ] select = [ ``` Finally running `lintrunner -a --take RUFF` will fix up the deprecated uses. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145101 Approved by: https://github.com/bobrenjc93	2025-01-18 05:05:07 +00:00
Will Constable	2c4281d7da	Make MultiProcContinuousTest timeout configurable (#145099 ) Allows test classes using MPCT to set their own timeout as a class property, which is good enough since the processgroup is shared across test instances and the timeout is set at processgroup init. Also sets a default timeout of 2 minutes, which is probably (?) long enough for reasonable tests, but can be changed if it causes flakyness. It's preferable to have as short default timeout as possible, since when debugging tests getting a timeout quickly helps. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145099 Approved by: https://github.com/d4l3k, https://github.com/fduwjj ghstack dependencies: #145010, #145011	2025-01-18 04:37:12 +00:00
Will Constable	bdfeda5c9a	composability test cleanup (#145011 ) minor changes to test public PP api instead of internal/private one and also save a few lines of code for microbatch splitting in the process Pull Request resolved: https://github.com/pytorch/pytorch/pull/145011 Approved by: https://github.com/H-Huang, https://github.com/fduwjj ghstack dependencies: #145010	2025-01-18 04:37:12 +00:00
Jason Ansel	4eea2f7496	[inductor] Fix ignored options for torch.compile (#145131 ) #139833 broke `torch.compile(options=...)` so that many (all?) options passed in get completely ignored. @alexreinking pointed this out when `options={"cpu_backend":"halide"}` did nothing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145131 Approved by: https://github.com/exclamaforte	2025-01-18 03:39:49 +00:00
Simon Fan	668fb7dfba	[ca] Use aot_eager on flex attention test (#145097 ) FIXES https://github.com/pytorch/pytorch/issues/144912 The flex attention lowering incompatibilities are covered by https://github.com/pytorch/pytorch/blob/main/test/inductor/test_flex_attention.py. For the CA + flex integration, we don't actually need to test the lowering, only the frontend graph capture. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145097 Approved by: https://github.com/drisspg	2025-01-18 02:47:13 +00:00
Sam Larsen	55084443ca	Added swizzle searching, disabled fp16 accum, and enabled ping-pong for cutlass (#144829 ) Summary: Test Plan: Pull Request resolved: https://github.com/pytorch/pytorch/pull/144829 Approved by: https://github.com/Chillee	2025-01-18 02:39:22 +00:00
Nicolas Macchioni	2f51d06210	basic InductorBenchmarker (#133058 ) This PR adds the most basic custom benchmarker (i.e. a benchmarker that is not provided by Triton), which we call `InductorBenchmarker`. This new benchmarker is very basic in principal, and very closely follows Triton's `do_bench` implementation with slight changes such as flushing the exact L2 cache size (Triton defaults to 256mb), using a buffer zero for warmup (Triton uses the benchmarked kernel itself, I found that buffer zeroes are more consistent), and returning the min runtime (Triton can return min, among other things, currently Inductor picks median). Pull Request resolved: https://github.com/pytorch/pytorch/pull/133058 Approved by: https://github.com/eellison ghstack dependencies: #144315	2025-01-18 02:35:00 +00:00
Nicolas Macchioni	ee3e89190a	refactor benchmarking to use dynamo_timed (#144315 ) use dynamo_timed for all our wrapped calls, instead of our custom timer Pull Request resolved: https://github.com/pytorch/pytorch/pull/144315 Approved by: https://github.com/eellison	2025-01-18 02:35:00 +00:00
Aaron Orenstein	17c3a10cbd	PEP585 update - torch/_inductor/fx_passes (#145107 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145107 Approved by: https://github.com/oulgen, https://github.com/bobrenjc93	2025-01-18 02:04:29 +00:00
Huy Do	8e4539245e	Update ci_expected_accuracy for TIMM levit_128 for further investigation (#145112 ) TSIA, it looks like an upstream change, but I'm not sure from where yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145112 Approved by: https://github.com/izaitsevfb, https://github.com/malfet	2025-01-18 01:55:34 +00:00
Bin Bao	0b151f260f	[AOTI] Add an option to skip optimizing generated wrapper code (#144866 ) Summary: In some cases, generated wrapper code faces a long cpp compilation time. As an alleviation, this PR adds an option to skip cpp compiler optimizers for the generated main wrapper function body. D68174038 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144866 Approved by: https://github.com/chenyang78, https://github.com/hl475	2025-01-18 01:44:21 +00:00
Jason Ansel	7c1fb9b1ae	[inductor] Refactor CachingAutotuner so that it can pickle (#144044 ) These are refactors needed for #144288 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144044 Approved by: https://github.com/eellison	2025-01-18 01:44:16 +00:00
xinan.lin	02385ed625	[Break XPU][Inductor UT] Fix broken XPU CI introduced by community changes (#145058 ) As title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145058 Approved by: https://github.com/jansel	2025-01-18 01:30:24 +00:00
rzou	c434a64f31	Delete torch._library.register_functional_op (#145110 ) Fixes #117816, #117834, #117871 This has been superceded by auto_functionalized_v2. There are no internal usages and this is private API so it is safe to delete. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145110 Approved by: https://github.com/williamwen42 ghstack dependencies: #145109	2025-01-18 00:58:25 +00:00
rzou	712d9882d2	Skip test responsible for causing flakiness (#145109 ) Investigation is a separate issue. For now I want to get the CI back up and running on the other tests. The problem seems to be that IncludeDispatchKeyGuard doesn't actually reset the state, which seems very, very wrong. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145109 Approved by: https://github.com/williamwen42	2025-01-18 00:58:25 +00:00
eellison	c338dda6be	fix test_rng bisector test (#143662 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143662 Approved by: https://github.com/zou3519	2025-01-18 00:15:38 +00:00
Daniel Vega-Myhre	d02c396fbb	add fp8 support to index_cuda (#144747 ) Fixes #133605 Summary This PR adds support for FP8 data types to the `index_cuda` op. It uses `AT_DISPATCH_V2` which is a new macro that can handle arbitrary number of dtypes, as opposed to the old implementations which had a separate macro for each possible number of dtype arguments (e.g. `AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND{2,3,4,5...}`). Test plan Updated test `index_cuda_with_cpu` in `test/test_fake_tensor.py` to have cases for all dtypes handled by `index_cuda`, including fp8 dtypes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144747 Approved by: https://github.com/vkuzo	2025-01-17 22:53:23 +00:00
Nicolas Macchioni	4e4b8592a3	parametrized test name handles class arguments (#133546 ) Previously, parametrized tests with class arguments, for example ``` @parametrize("this_cls", (Foo, Bar)) ``` would create parametrized tests with names `test_foo_this_cls0` and `test_foo_this_cls1`. With this change, we instead should get `test_foo_this_cls_Foo` and `test_foo_this_cls_Bar` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/133546 Approved by: https://github.com/eellison	2025-01-17 22:48:38 +00:00
Will Constable	64e54d5af6	[Pipelining] Relax scale_grads assert (#145010 ) The assert felt morally valid- if no gradients are scaled, then something is definitely wrong with the setup. In one instance, PP + optimizer-in-backward (in torchtitan) resulted in grad=None after running .backward() and before scaling grads. On the other hand, the existing assert is too restrictive. It's possible that a model used with pipelining would have some parameters that do not receieve gradients, and we shouldn't hard-error in these cases. (E.g. if the parameter is literally not used, or is frozen). In the extreme case, the whole stage could be frozen. So we do not complain if no grads are scaled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145010 Approved by: https://github.com/mori360, https://github.com/tianyu-l	2025-01-17 21:33:28 +00:00
fan.mo	07e23653cd	Fix RMSNorm epsilon value type for BF16 or FP16 (#142848 ) Fixes #140092 Here's what this PR does: In before, we create a `scalar_t eps_val;` variable, and the `eps` is mostly a double scalar which passed from python frontend, like 1e-6. While we do `eps_val = std::numeric_limits<at::scalar_value_type<scalar_t>::type>::epsilon();` or `eps_val = eps.value();`, we down cast this epsilon to match input tensor dtype (`scalar_t`), in case of BFloat16, the 1e-6 double would be cast to `1.00136e-05`. However, while we act `auto rqrst_input = rsqrt(at::pow(upcasted_input, 2).mean(dims_to_reduce_ref, /keepdim=/true).add_(eps_val));`, we up cast `eps_val` to match the `opmath_t`, the conversion between these two dtypes is UNNECESSARY, so we could just make the `opmath_t eps_val` instead of `scalar_t`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142848 Approved by: https://github.com/mikaylagawarecki	2025-01-17 21:30:54 +00:00
Joel Schlosser	a8ef423fed	Fix NJT min / max backward() for non-ragged reductions (#144583 ) Part of my BE project addressing NJT bugs surfaced via OpInfo tests. `value_selecting_reduction_backward()` is used in the backward for min / max, so this PR implements it for NJT. Notably, this isn't enough for reducing over the ragged dim, since that results in a dense tensor and thus NJT's torch_dispatch will not be called for this op. We need factory function support for nested ints to fix that case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144583 Approved by: https://github.com/soulitzer ghstack dependencies: #144582	2025-01-17 20:57:11 +00:00
Joel Schlosser	cac10b8190	Fix NJT OpInfo entry for nn.functional.prelu (#144582 ) Part of my BE project addressing NJT bugs surfaced via OpInfo tests. The OpInfo entry for prelu was wrong before this PR; `weight` needs to be passed as well. The op isn't fully implemented yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144582 Approved by: https://github.com/soulitzer	2025-01-17 20:36:15 +00:00
Tom Ritchford	eaef613688	Fix issue with test/nn/test_convolution:TestConvolutionNNDeviceTypeCUDA.test_conv_large_batch_1_cuda (#145067 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145067 Approved by: https://github.com/Skylion007, https://github.com/nWEIdia Co-authored-by: Wei Wang <143543872+nWEIdia@users.noreply.github.com>	2025-01-17 20:31:25 +00:00
Mikayla Gawarecki	0eda02a94c	Prevent legacy_load when weights_only=True (correctly) (#145020 ) Only prevent `legacy_load` (.tar format removed in https://github.com/pytorch/pytorch/pull/713), not the whole of `_legacy_load` (.tar format + _use_new_zipfile_serialization=False) Differential Revision: [D68301405](https://our.internmc.facebook.com/intern/diff/D68301405) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145020 Approved by: https://github.com/kit1980, https://github.com/albanD	2025-01-17 20:10:22 +00:00
Colin Peppler	2ef7b68666	[inductor] fix TORCH_LOGS="benchmarking" (#144997 ) Saw this error with TORCH_LOGS="benchmarking" ``` File "/data/users/colinpeppler/pytorch/torch/_inductor/runtime/benchmarking.py", line 37, in wrapper result = fn(args, kwargs) File "/data/users/colinpeppler/pytorch/torch/_inductor/runtime/benchmarking.py", line 66, in wrapper return fn(self, args, **kwargs) torch._inductor.exc.InductorError: TypeError: Benchmarker.benchmark() missing 1 required positional argument: 'fn_kwargs' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144997 Approved by: https://github.com/eellison, https://github.com/nmacchioni	2025-01-17 19:41:18 +00:00
Wouter Devriendt	d996d7ec13	upgrade to sccache 0.9.1 - dealing with nvcc -E correctly (#145012 ) sccache 0.9.1 should be dealing with `nvcc -E` correctly see https://github.com/mozilla/sccache/pull/2300 If this works as expected, we can get rid of this code: https://github.com/pytorch/pytorch/pull/142813/files Pull Request resolved: https://github.com/pytorch/pytorch/pull/145012 Approved by: https://github.com/malfet	2025-01-17 19:26:01 +00:00
Tom Ritchford	46fbd63405	Fix unbind_copy and add its decomposition (#134319 ) * Fixes https://github.com/pytorch/pytorch/issues/130829 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134319 Approved by: https://github.com/amjames, https://github.com/eellison	2025-01-17 18:21:22 +00:00
Mingming Ding	18638b91fe	Adding more compile time logging in pad_mm (#144884 ) Summary: As title Test Plan: [midin@6262.od /data/sandcastle/boxes/fbsource/fbcode (99e64d2e4)]$ tlp buck run mode/opt caffe2/test/inductor:pad_mm -- -r test_exclude_padding https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2F.tmpiJLgXX%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2F.tmpiJLgXX%2Fchromium_events.json&local_cache_key {F1974355662} Pull Request resolved: https://github.com/pytorch/pytorch/pull/144884 Approved by: https://github.com/oulgen	2025-01-17 17:35:55 +00:00
Yidi Wu	567552b98b	fix typo in doc and import for torch._library.triton (#144882 ) Previously, the doc's suggested `from torch._library.triton import wrap_triton, triton_op` doesn't work because wrap_triton is not imported in torch/_library/__init__.py but `from torch.library import wrap_triton` works. This PR imports wrap_triton and fix the doc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144882 Approved by: https://github.com/zou3519	2025-01-17 17:32:12 +00:00
Stonepia	18eba9575f	[Accelerator] Use uniform `GetAllocator` for devices in `new_qtensor` function (#144849 ) Fixes #144848 This PR is intended to use a uniform `GetAllocator()` call for all the accelerators for `new_qtensor` function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144849 Approved by: https://github.com/guangyey, https://github.com/albanD	2025-01-17 16:37:37 +00:00
atalman	a215e174a1	[BE] Remove conda from scripts and build files Part 2 (#145015 ) Continuation of https://github.com/pytorch/pytorch/pull/144870 Remove conda logic from scripts: 1. Remove conda build from triton build script 2. Remove conda checks from setup.py 3. Remove conda from release scripts 4. Script read_conda_versions.sh is not used (checked via git grep) Related to: https://github.com/pytorch/pytorch/issues/138506 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145015 Approved by: https://github.com/malfet, https://github.com/Skylion007	2025-01-17 16:26:24 +00:00
Aleksandar Samardžić	b7af210d8d	Add SM89 support for f8f8bf16_rowwise() (#144348 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144348 Approved by: https://github.com/drisspg	2025-01-17 15:12:35 +00:00
PyTorch MergeBot	f522502b97	Revert "patch for block-wise quantization + pt2e (#144492 )" This reverts commit 1d43b8150852cdfcbe754edcf027d6e25f33ac63. Reverted https://github.com/pytorch/pytorch/pull/144492 on behalf of https://github.com/albanD due to Broke a few things in trunk ([comment](https://github.com/pytorch/pytorch/pull/144492#issuecomment-2598485291))	2025-01-17 14:27:53 +00:00
Wang, Eikan	dbed747aae	Add Intel GPU specific CMake files to merge rules (#135110 ) As the title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135110 Approved by: https://github.com/atalman	2025-01-17 09:44:13 +00:00
Luca Wehrstedt	a0d2c09115	Add flop formula for _scaled_mm (#144973 ) This will make it work correctly with the partitioner's AutoAC Pull Request resolved: https://github.com/pytorch/pytorch/pull/144973 Approved by: https://github.com/jeffdaily	2025-01-17 09:38:30 +00:00
Laith Sakka	96c0dbbe97	Enhance running pr time benchmarks locally experience. (#144838 ) Summary: title Test Plan: NA Differential Revision: D68195894 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144838 Approved by: https://github.com/huydhn	2025-01-17 07:57:40 +00:00
ZhaoqiongZ	465a1cfe2e	update get start xpu (#143183 ) - Support new Intel client GPU on Windows [Intel® Arc™ B-Series graphics](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/desktop/b-series/overview.html) and [Intel® Core™ Ultra Series 2 with Intel® Arc™ Graphics](https://www.intel.com/content/www/us/en/products/details/processors/core-ultra.html) - Support vision/audio prebuilt wheels on Windows Pull Request resolved: https://github.com/pytorch/pytorch/pull/143183 Approved by: https://github.com/EikanWang, https://github.com/leslie-fang-intel, https://github.com/atalman, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-17 06:31:40 +00:00
Davide Italiano	fd8e0e3e10	[mps/inductor] Introduce is_mps_backend/skip_if_mps decorators. (#145035 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145035 Approved by: https://github.com/jansel	2025-01-17 05:36:38 +00:00
PyTorch UpdateBot	cfd9cc19a3	[executorch hash update] update the pinned executorch hash (#145022 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145022 Approved by: https://github.com/pytorchbot	2025-01-17 04:51:56 +00:00
Gabriel Ferns	f13c864eda	Fuzzer Improvements (#144952 ) Added more tests and cleaned up. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144952 Approved by: https://github.com/masnesral	2025-01-17 04:46:58 +00:00
Chen Lai	1d43b81508	patch for block-wise quantization + pt2e (#144492 ) Summary: As title, needed for enable qcom block-wise quantization kernel Test Plan: local test Differential Revision: D67985303 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144492 Approved by: https://github.com/angelayi, https://github.com/billmguo	2025-01-17 04:10:49 +00:00
Zhenbin Lin	adbbcd87d9	OpenReg: Split Allocator (#144843 ) Split the Allocator into HostAllocator and DeviceAllocator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144843 Approved by: https://github.com/albanD	2025-01-17 03:38:15 +00:00
Yanbo Liang	43a00d73b3	[Trace Python Dispatcher] Support FuncTorchInterpreter (#144444 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144444 Approved by: https://github.com/williamwen42, https://github.com/zou3519 ghstack dependencies: #144439	2025-01-17 02:26:37 +00:00
Yanbo Liang	5d02575aa1	[Trace Python dispatcher] Support torch.DispatchKey & torch.DispatchKeySet (#144439 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144439 Approved by: https://github.com/zou3519	2025-01-17 02:26:36 +00:00
William Wen	3a50aba7d3	[dynamo] add option to not skip on empty graph (#144885 ) Temporary fix to https://github.com/pytorch/pytorch/issues/144360. Turning the config on globally will cause a bunch of tests to fail, which needs to be addressed in followups. I had a previous attempt at https://github.com/pytorch/pytorch/pull/144712, but this is a more complicated change and will likely be absorbed into work to refactor Dynamo's exception handling. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144885 Approved by: https://github.com/jansel	2025-01-17 02:12:20 +00:00
Marc Horowitz	7b56b039af	[dcp] Add ZStandard transformer (#143360 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143360 Approved by: https://github.com/saumishr ghstack dependencies: #143358, #143359	2025-01-17 01:51:37 +00:00
Marc Horowitz	9c909bf3bb	[dcp] Integrate stream extensions into DCP impl (#143359 ) Summary: Updates FileSystemReader/Writer, Planner, DefaultLoad/SavePlanner Pull Request resolved: https://github.com/pytorch/pytorch/pull/143359 Approved by: https://github.com/saumishr ghstack dependencies: #143358	2025-01-17 01:51:37 +00:00
Marc Horowitz	ba3f1c29ee	[dcp] Add extension mechanism (#143358 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143358 Approved by: https://github.com/saumishr	2025-01-17 01:51:37 +00:00
Yu, Guangye	176cde6240	Use torch with statement in torch distributed module (#144951 ) # Motivation In https://github.com/pytorch/pytorch/pull/137678, we help use the device-agnostic APIs to generalize distributed module. As this [comment](https://github.com/pytorch/pytorch/pull/137678#discussion_r1828645683) said, we will use the with statement of `torch.Stream` once https://github.com/pytorch/pytorch/pull/140138 is landed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144951 Approved by: https://github.com/kwen2501, https://github.com/albanD	2025-01-17 01:49:28 +00:00
Nikita Shulga	a61a65ff82	[MPSInductor] Add `Worker.current_device` method (#145023 ) That just returns 0, as multi-gpu is not currently supported by MPS Pull Request resolved: https://github.com/pytorch/pytorch/pull/145023 Approved by: https://github.com/dcci	2025-01-17 01:41:01 +00:00
PyTorch MergeBot	55b0819bee	Revert "Add tests for different dtypes with max autotune (#144721 )" This reverts commit d2a77f48c9dc6df056051de270ce5875d8d2edd0. Reverted https://github.com/pytorch/pytorch/pull/144721 on behalf of https://github.com/kit1980 due to breaking internal builds, max autotune tests a failing, see D68297606 ([comment](https://github.com/pytorch/pytorch/pull/144721#issuecomment-2597250605))	2025-01-17 01:36:14 +00:00
Andrew Gu	45e6647268	[FSDP2] Make post-backward condition more robust (#144781 ) Fixes https://github.com/pytorch/pytorch/issues/144755 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144781 Approved by: https://github.com/fegin	2025-01-17 01:28:56 +00:00
Chien-Chin Huang	6077102415	[DSD][BE] Rewrite some tests to remove `with_comms` (#143241 ) Summary: This saves ~ 1 minute test time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143241 Approved by: https://github.com/mori360, https://github.com/XilunWu ghstack dependencies: #143240	2025-01-17 01:15:55 +00:00
Will Constable	5d54e7b812	[Pipelining] move scale_grads to base class, add docs (#144833 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144833 Approved by: https://github.com/H-Huang	2025-01-17 01:07:12 +00:00
Driss Guessous	3afc5170d4	[Submodule] Upgrade to Cutlass 3.6 part deux (#144911 ) # Summary Take 2 of [D67866269](https://www.internalfb.com/diff/D67866269) Main change is that we identified and fixed the FA2 regression. More details can be found here https://github.com/pytorch/pytorch/issues/144729 and have landed that before this here: [D68194635](https://www.internalfb.com/diff/D68194635) Differential Revision: D68194470 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144911 Approved by: https://github.com/eqy, https://github.com/Skylion007	2025-01-17 00:53:42 +00:00
PyTorch MergeBot	6c713ccb5e	Revert "Make functionalization `ViewMeta` serializable with pickle. (#143712 )" This reverts commit b8abdaa286fd161af48af57a675827f4f849914d. Reverted https://github.com/pytorch/pytorch/pull/143712 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/143712#issuecomment-2597205261))	2025-01-17 00:52:50 +00:00
Nikita Shulga	42c64bd35c	[MPSInductor] More is_dtype_supported gating (#144981 ) This makes `GPUTest.test_scalar_cpu_tensor_arg_mps` pass Pull Request resolved: https://github.com/pytorch/pytorch/pull/144981 Approved by: https://github.com/dcci ghstack dependencies: #144971	2025-01-17 00:48:02 +00:00
PyTorch MergeBot	94c0f15302	Revert "cpp_wrapper: Move #includes to per-device header files (#143909 )" This reverts commit d62b3979dadfa4928ec1c76e850f874d49803125. Reverted https://github.com/pytorch/pytorch/pull/143909 on behalf of https://github.com/kit1980 due to breaking internal builds because of removal of torch‎/_inductor‎/codegen‎/aoti_runtime‎/implementation.cpp‎ ([comment](https://github.com/pytorch/pytorch/pull/143909#issuecomment-2597188669))	2025-01-17 00:36:38 +00:00
PyTorch MergeBot	5e6e6200bf	Revert "[dynamo][dicts] Consolidate dict(..) construction (#144342 )" This reverts commit a54a784b8207617d2b99fbded9bb34c94fb6dd23. Reverted https://github.com/pytorch/pytorch/pull/144342 on behalf of https://github.com/kit1980 due to breaking internal builds, see D68125388 ([comment](https://github.com/pytorch/pytorch/pull/144342#issuecomment-2597184167))	2025-01-17 00:32:09 +00:00
cyy	2ea394ba29	Modernize C++ code (#144603 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144603 Approved by: https://github.com/malfet	2025-01-17 00:25:18 +00:00
Laith Sakka	c3fcb3606d	Profile compile_inner instead of _compile_inner (#144930 ) Summary: title Test Plan: NA Reviewed By: jamesjwu Differential Revision: D67990492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144930 Approved by: https://github.com/jamesjwu	2025-01-16 23:59:27 +00:00
Chien-Chin Huang	573fc42f25	[BE][CP] Use run_subtests instead of parametrize (#143240 ) Summary: This provides a 15X increase in test performance speed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143240 Approved by: https://github.com/XilunWu	2025-01-16 23:55:05 +00:00
Yang Wang	fea9d18d5a	[Utilization Log] Concurrently collect aggregate data during the output interval (#143235 ) # overview Add worker to collect metrics in short intervals 1.Worker: Add a worker to collect usage metrics, by default, every 500ms, notice this is configurable 2.Calculate & avg and max as data point, by default, every 5 second. # Other clean up the log format for necessary needs, currentl we do not need to track gpu processesors etc, or all pids from psutil Pull Request resolved: https://github.com/pytorch/pytorch/pull/143235 Approved by: https://github.com/huydhn	2025-01-16 23:52:43 +00:00
shaoyuyoung	288d67d6c2	[inductor] [bug fix] align `avg_pool` with eager when handling `uint` (#144313 ) Fixes #144310 ~~We just need to add a check in lowering~~ updated: we add the error checking in `meta registration` ### UT ``` pytest -s -v test/inductor/test_torchinductor.py -k test_avg_pool_errors_with_uint ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144313 Approved by: https://github.com/jansel, https://github.com/jgong5	2025-01-16 23:37:51 +00:00
Gabriel Ferns	d2a77f48c9	Add tests for different dtypes with max autotune (#144721 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144721 Approved by: https://github.com/cpuhrsch, https://github.com/etaf	2025-01-16 23:04:56 +00:00
clr	171fb7f358	easy: Fix missing tab in test/dynamo/test_compile.py (#145013 ) It turns out that if you request a merge on a pytorch PR, and then push a fix for a bad rebase, and the test is relativley new, the merge will go through with the previous commit and not notice the test break. Explicitly running the test now passes vs failing, and this is just the last missing commit from https://github.com/pytorch/pytorch/pull/144817 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145013 Approved by: https://github.com/masnesral, https://github.com/jansel	2025-01-16 22:51:51 +00:00
Nikita Shulga	181d93b4f2	[BE] Move `is_device_supported` to helper function (#144971 ) And extend `test_inf` to check half (explicitly instead of check_lowp) and bfloat16 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144971 Approved by: https://github.com/Skylion007, https://github.com/dcci, https://github.com/jansel	2025-01-16 22:44:03 +00:00
PyTorch UpdateBot	a33e02cb26	[executorch hash update] update the pinned executorch hash (#144813 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144813 Approved by: https://github.com/pytorchbot, https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>	2025-01-16 22:39:00 +00:00
Fuzzkatt	7c7bcb1e33	update IS_JETSON check (#144725 ) update IS_JETSON check to include the latest SM Pull Request resolved: https://github.com/pytorch/pytorch/pull/144725 Approved by: https://github.com/eqy	2025-01-16 22:34:48 +00:00
Colin L. Rice	95c363cc9b	dynamo: Don't crash with internal error if getattr on a tensor fails (#144817 ) This prevents crashes when getattr is called on a tensor for something which doesn't exist. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144817 Approved by: https://github.com/williamwen42, https://github.com/jansel	2025-01-16 22:04:06 +00:00
Mwiza Kunda	0e6d44df3f	Add heuristic to fail block pointer match early (#144681 ) This PR adds a heuristic to potentially fail the block pointer match early. Expressions like below take a long time to match using sympy (e.g. > 100 seconds) ```python # torch._inductor.config.triton.use_block_ptr = True # torch._inductor.config.triton.prefer_nd_tiling = True # Expression from pytest -k test_max_pool2d1_dynamic_shapes_cuda: ((xindex//ps1))((s2 - 3//2))2 + 2((xindex//ps1))((s2 - 3//2)) + ((xindex//ps1)) + ((s2 - 3//2))(ModularIndexing(xindex, ps0, ps0)) + (ModularIndexing(xindex, 1, ps0)) + (ModularIndexing(xindex, ps0, ps0)) ``` Additionally, the heuristic for the number of dimensions based on the indexing expression is refined to only add dimensions for FloorDiv(index, denom) and ModularIndexing(index, denom, modulo) instead of including FloorDiv/ModularIndexing expressions that don't involve the index. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144681 Approved by: https://github.com/jansel	2025-01-16 21:57:30 +00:00
PyTorch MergeBot	46b92c025d	Revert "Cholesky mps implementation (#144193 )" This reverts commit 727ae1331820bb3d83d70e9cd3c9d3cd4c79ff56. Reverted https://github.com/pytorch/pytorch/pull/144193 on behalf of https://github.com/malfet due to Alas, inductor changes broke inductor tests, see `aa4a1ff027/1` ([comment](https://github.com/pytorch/pytorch/pull/144193#issuecomment-2596938163))	2025-01-16 21:37:32 +00:00
PyTorch MergeBot	aa4a1ff027	Revert "Prevent _legacy_load with weights_only=True (#144914 )" This reverts commit 7c3aa1da1c97812af54d41f3f0eff2ef922c0f32. Reverted https://github.com/pytorch/pytorch/pull/144914 on behalf of https://github.com/izaitsevfb due to breaking inductor on trunk ([comment](https://github.com/pytorch/pytorch/pull/144914#issuecomment-2596922781))	2025-01-16 21:29:50 +00:00
PyTorch MergeBot	4ea189422d	Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 )" This reverts commit a6763b7b81cd1a55c8316dfdb5bca19819a1429a. Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/kit1980 due to breaking internal builds: unused variable 'halpha' ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2596895865))	2025-01-16 21:12:41 +00:00
garfield1997	3a5bf0bc36	expose extra torch_python apis (#144746 ) Fixes #144302 After checking the code of my third-party devices, I think these APIs are also relied on by us, so I exposed them according to the discussion in the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144746 Approved by: https://github.com/albanD	2025-01-16 20:50:31 +00:00
iupaikov-amd	577708e6de	Unskipped multiple inductor tests for ROCm (#143581 ) All of them should be fine to run now after the triton fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143581 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-01-16 20:46:06 +00:00
CaoE	a9bfc5f70c	Fix boundary conditions for hardswish backward (#143899 ) Fixes #136345. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143899 Approved by: https://github.com/jgong5, https://github.com/ezyang	2025-01-16 20:26:27 +00:00
Davide Italiano	aad5f600ff	[mps] Massage test_full_truncation to work only on the supported dtypes. (#144877 ) Converted a first one to make sure the pattern was the one we wanted -- if we're OK with this, I'll probably adjust all the other failing ones in a batch or two. Let me know. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144877 Approved by: https://github.com/jansel, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-16 19:51:45 +00:00
Jane Xu	3908be676c	Fix loading older state_dict into AdamW after refactor (#144972 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144972 Approved by: https://github.com/albanD	2025-01-16 19:50:31 +00:00
Yukio Siraichi	b8abdaa286	Make functionalization `ViewMeta` serializable with pickle. (#143712 ) Fix: #141974 This PR makes `ViewMeta` sequence, present in functional tensors, serializable with pickle. In order to accomplish that, it makes `ViewMeta` an abstract class with overridable `forward` and `reverse` functions. In this context, each operation that once instanciated `ViewMeta`, should now create a new specialized class that inherits from `ViewMeta. Therefore, this PR also uses codegen for creating these specializations. In summary, these are the changes this PR introduces: - `ViewMeta` is turned into an abstract class (see _FunctionalStorageImpl.cpp_). `forward` and `reverse` are pure virtual functions that need to be implemented. `to_out_index` should be implemented by operations that might return more than 1 output. - New `ViewMeta` specializations for `resize_` and `_unsafe_view` are created (see _FunctionalizeFallbackKernel.h_). - New templates _ViewMetaClasses.{cpp,h}_ are created. They hold the declaration and definition of the `ViewMeta` specializations, which are automatically generated in the ATen codegen (see _gen.py_). - New `_functionalization` Python sub-module is created (see _Module.cpp_). It serves as namespace for the `ViewMeta` specializations and `InverseReturnMode` enum. - New template _ViewMetaClassesPythonBinding.cpp_ is created. It holds the automatically generated Python bindings for the `ViewMeta` specialization, which are generated in the torch codegen (see _generate_code.py_). Note that this PR makes use of codegen at 2 different moments: - ATen codegen (_gen.py_): generates the `ViewMeta` specialized classes. - Torch codegen (_generate_code.py_): generated the Python bindings for them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143712 Approved by: https://github.com/bdhirsh	2025-01-16 19:41:41 +00:00
Mikayla Gawarecki	7c3aa1da1c	Prevent _legacy_load with weights_only=True (#144914 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144914 Approved by: https://github.com/malfet, https://github.com/albanD	2025-01-16 19:33:46 +00:00
Huy Do	cf28d613f1	Allow ROCm runner to upload benchmark results if found (#144710 ) https://github.com/pytorch/pytorch/wiki/How-to-integrate-with-PyTorch-OSS-benchmark-database. This will unblock AMD when they try to run benchmark MI300 benchmarks on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144710 Approved by: https://github.com/kit1980	2025-01-16 19:31:45 +00:00
Natalia Gimelshein	31a73eb712	fix acquire pattern in topk (#144945 ) Similar to #128455, topk needs another threadfence to complete acquire pattern. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144945 Approved by: https://github.com/Skylion007	2025-01-16 19:20:43 +00:00
Yanbo Liang	3004b657f0	[Inductor][FlexAttention] Supports dynamic shapes with custom kernel options (#144938 ) Fixes #144815 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144938 Approved by: https://github.com/drisspg	2025-01-16 19:02:35 +00:00
Jane Xu	e32d2bf853	Document decoupled_weight_decay for Adam for consistency with N/RAdam (#144984 ) Followup from #144972 and #143710 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144984 Approved by: https://github.com/albanD	2025-01-16 18:58:29 +00:00
Nikita Shulga	ad15436db6	Fix `pt2-bug-report.yml` formatting (#144987 ) This is a 2nd regression caused by https://github.com/pytorch/pytorch/pull/144574 Test plan: `python3 -c "import yaml; foo=yaml.safe_load(open('pt2-bug-report.yml'));print(foo['body'][0])"` Before it printed ``` % python3 -c "import yaml; foo=yaml.safe_load(open('pt2-bug-report.yml'));print(foo['body'][0])" {'type': 'markdown', 'attributes': {'value': ''}} ``` After ``` % python3 -c "import yaml; foo=yaml.safe_load(open('pt2-bug-report.yml'));print(foo['body'][0])" {'type': 'markdown', 'attributes': {'value': '#### Note: Please write your bug report in English to ensure it can be understood and addressed by the development team.\n'}} ``` Fixes https://github.com/pytorch/pytorch/issues/144970 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144987 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2025-01-16 18:58:07 +00:00
PyTorch MergeBot	829c4570ca	Revert "[mps] Massage test_full_truncation to work only on the supported dtypes. (#144877 )" This reverts commit 1b34665767fcc35ae4a8f211945a24701c79df79. Reverted https://github.com/pytorch/pytorch/pull/144877 on behalf of https://github.com/malfet due to Actually no, lint is red ([comment](https://github.com/pytorch/pytorch/pull/144877#issuecomment-2596385712))	2025-01-16 18:10:37 +00:00
Tom Ritchford	13d35ea67a	[BE] Add missing throw of `std::runtime_error` in scrc/cuda/utils.cpp (#144962 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144962 Approved by: https://github.com/amjames, https://github.com/Skylion007, https://github.com/malfet	2025-01-16 17:35:39 +00:00
Zhengxu Chen	53256edff9	[export] Support module inputs for non strict mode. (#143925 ) Summary: Add experimental support for torch.nn.Module as input types. Before this change, we don't support module inputs but recently we saw some interesting use cases like gpt-fast https://github.com/pytorch-labs/gpt-fast/blob/main/generate.py#L68 where we directly pass in a module input for different variants of the same models. Since we don't really care about non-param or non-buffer states in non strict mode, we don't care about those either and pretend they are like plain constants during tracing. We treat any module input like a nested container of tensor, and each time we will automatically register a pytree handler for these module types to flatten its state dict into a group of tensors. We will just inline any module method call during tracing like we did for `self` module in export_for_training. This will make input modules' behavior very similar to the training module in typical case, except that we don't record the inputs as parameter or buffers but rather just plain user inputs. Test Plan: buck run mode/opt caffe2/test:test_export -- -r test_module_input Differential Revision: D67680827 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143925 Approved by: https://github.com/tugsbayasgalan	2025-01-16 17:30:36 +00:00
atalman	519269a415	[BE] - Remove conda test and upload scripts and env variables from Workflows Part 1 (#144870 ) Remove conda test and upload scripts and env variables from Workflows Related to: https://github.com/pytorch/pytorch/issues/138506 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144870 Approved by: https://github.com/malfet	2025-01-16 17:20:14 +00:00
Isalia20	727ae13318	Cholesky mps implementation (#144193 ) Requested in #77764 PR is still in draft because it needs some cleanups and optimizations to get to cpu performance the least. Tasks: - [x] Make `upper=True` work, only `upper=False` works now - [x] Code cleanup - [x] Optimizations(Though might need some help on this)(tried my best, maybe there is still some more to squeeze out) - [x] Checks for positive definite input - [x] Support for (*, N, N) input, currently only supports (B, N, N) input - [x] Support other dtypes(float16, bfloat16) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144193 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-16 16:26:46 +00:00
Davide Italiano	1b34665767	[mps] Massage test_full_truncation to work only on the supported dtypes. (#144877 ) Converted a first one to make sure the pattern was the one we wanted -- if we're OK with this, I'll probably adjust all the other failing ones in a batch or two. Let me know. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144877 Approved by: https://github.com/jansel, https://github.com/malfet	2025-01-16 16:22:06 +00:00
Zhengxu Chen	3d29de3ac8	[aoti] Deduplicate "V.aot_compilation" and "V.graph.aot_mode" flags. [1/n] (#144709 ) Summary: According to angelayi, these two flags indicated different things when we have two-pass codegen but since now we basically keep the two flags all the same, we should merge two flags. This can prevent some bug (e.g. we change value of aot_mode which will not cover branches like if V.aot_compialtion is True) from happening when we're trying to add different code paths to tweak the value of aot_mode in the future. Test Plan: CI Differential Revision: D68122536 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144709 Approved by: https://github.com/angelayi, https://github.com/desertfire	2025-01-16 16:02:18 +00:00
Scott Wolchok	241a8a101b	Fix erroneous at_vreinterpretq_u16_bf16 call (#144883 ) Here, `mask` is definitely a `uint16x8_t`, not an `at_bfloat16x8_t`, so we shouldn't be reintepreting it. Candidate fix for #144818 . Differential Revision: [D68224128](https://our.internmc.facebook.com/intern/diff/D68224128/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144883 Approved by: https://github.com/tinglvv, https://github.com/Skylion007, https://github.com/malfet	2025-01-16 15:16:28 +00:00
PyTorch MergeBot	6559374494	Revert "Add flop formula for _scaled_mm (#144872 )" This reverts commit f31452268bf9f7e395f263cd8a9d693633ea75ce. Reverted https://github.com/pytorch/pytorch/pull/144872 on behalf of https://github.com/lw due to Breaks ROCm jobs on main ([comment](https://github.com/pytorch/pytorch/pull/144872#issuecomment-2595994134))	2025-01-16 15:16:18 +00:00
Yutao Xu	6470b0ea6f	Update torch-xpu-ops commit pin (#144739 ) Update the torch-xpu-ops commit to [22cc419e4e60f469341712a5a103fa309a7dfd48](`22cc419e4e`), includes: - Fix building issue https://github.com/intel/torch-xpu-ops/issues/1279 - Aten operator coverage improvement Note: new torch-xpu-ops commit don't support bundle 0.5.3 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144739 Approved by: https://github.com/EikanWang, https://github.com/malfet	2025-01-16 15:12:37 +00:00
Luca Wehrstedt	f31452268b	Add flop formula for _scaled_mm (#144872 ) This will make it work correctly with the partitioner's AutoAC Pull Request resolved: https://github.com/pytorch/pytorch/pull/144872 Approved by: https://github.com/vkuzo	2025-01-16 13:57:54 +00:00
PyTorch MergeBot	1c290912e4	Revert "Add tests for different dtypes with max autotune (#144721 )" This reverts commit 9e568cbaa22df89b77e112f1a373d82acb2e6219. Reverted https://github.com/pytorch/pytorch/pull/144721 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/144721#issuecomment-2595210355))	2025-01-16 10:59:30 +00:00
Shunting Zhang	0c0583254e	[inductor] fix index.Tensor fallback (#144736 ) The original issue is we see accuracy problem in a meta internal model [meta internal link](https://fb.workplace.com/groups/1075192433118967/posts/1567334737238065/). The debugging is hard but the root cause is relatively simple. The root cause is that the model has mix-device inputs for index.Tensor which causes Inductor to fallback. And the meta kernel for index.Tensor returns a tensor with inconsistent strides to the eager kernel. The following code snippet ``` import torch from torch._subclasses import FakeTensorMode device = "cuda" x = torch.randn((24, 16, 32, 32), device=device).to(memory_format=torch.channels_last) x = x.view(2, 12, 16, 32, 32) i1 = torch.arange(2).unsqueeze(-1) i2 = torch.argsort(torch.rand(2, 12), dim=-1)[:, :3] print(f"Eager stride: {x[i1, i2].stride()}") mode = FakeTensorMode() with mode: f_x = mode.from_tensor(x) f_i1 = mode.from_tensor(i1) f_i2 = mode.from_tensor(i2) f_out = f_x[f_i1, f_i2] print(f"Meta stride: {f_out.stride()}") ``` would output: ``` Eager stride: (49152, 16384, 1, 512, 16) Meta stride: (49152, 16384, 1024, 32, 1) ``` In this PR, I fix the problem to run eager kernel to get the index.Tensor fallback's output layout. A better solution would be to change meta/eager kernel implementation so that their output layout matches. But I'm not sure how to properly do that. In the index.Tensor meta kernel, we always produce dense output: `6d56277682/torch/_meta_registrations.py (L3184)` . While the eager kernel seems to leverage TensorIteratorBase to decide some dimension permutation: `6d56277682/aten/src/ATen/TensorIterator.cpp (L232-L308)` . We can duplicate this logic to the meta kernel implementation if we really want meta matches eager. I can follow up on this if people have strong opinion to do this. And here is an issue https://github.com/pytorch/pytorch/issues/144717 for asserting size/strides for fallback kernels. With that, the issue debugged here would be much easier to root cause. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144736 Approved by: https://github.com/jansel	2025-01-16 09:38:29 +00:00
Huy Do	57d5659c3b	XFAIL test_save_load_checkpoint (#144927 ) Fixes https://github.com/pytorch/pytorch/issues/137771 The issue keeps showing up and rerun disable tests couldn't reproduce the issue. So, XFAIL it while waiting for distributed team to investigate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144927 Approved by: https://github.com/kit1980, https://github.com/malfet	2025-01-16 07:31:56 +00:00
Will Constable	7d8c087e24	[Pipelining] Improve shape inference debug logging (#144929 ) Remove log that just said "running forward" since that is not so useful in itself, replace with somewhat equivalent log that reports both input and output shapes after running forward. Note: enabled by `TORCH_LOGS=+pp` Example: ``` [rank0]:V0115 13:28:58.282000 3908366 torch/distributed/pipelining/stage.py:1400] Shape inference: stage 0 inputs (tensor(..., device='meta', size=(1, 64), dtype=torch.int64),), outputs (tensor(..., device='meta', size=(1, 64, 256), dtype=torch.bfloat16),) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144929 Approved by: https://github.com/H-Huang	2025-01-16 07:30:11 +00:00
Natalia Gimelshein	0b17c09893	restore rng generation for fbcode (#144819 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144819 Approved by: https://github.com/malfet, https://github.com/kit1980	2025-01-16 06:46:26 +00:00
Mario Vasilev	49bdc418be	Add strict kwarg to `nn.Module.set_submodule` and fix bug for non dot delineated strings (#143455 ) Before fixing set_submodule, it used to create leaf modules when the target was not a dot-delimited string. After the fix it will not create a new attribute if target is a non-dot-delimited string. If you want to create leaf nodes of `nn.Module` parent nodes, you can use `replace_or_create_new_leaf_module`. Fixes https://github.com/pytorch/pytorch/issues/143441 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143455 Approved by: https://github.com/mikaylagawarecki	2025-01-16 05:06:33 +00:00
fduwjj	e3c4d1b7d6	[c10d][fr] Fix the bug when we still mark mismatch when there are match case (#144916 ) When we introduce partial match, we accidentally introduce the mark of mismatch for the full match case. This is wrong and this PR fix it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144916 Approved by: https://github.com/c-p-i-o	2025-01-16 04:36:30 +00:00
Gabriel Ferns	9e568cbaa2	Add tests for different dtypes with max autotune (#144721 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144721 Approved by: https://github.com/cpuhrsch, https://github.com/etaf	2025-01-16 04:29:44 +00:00
Zhenbin Lin	52a620845b	OpenReg: Use device agnostic API (#144840 ) Use `torch.accelerator.device_count()` to get the number of devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144840 Approved by: https://github.com/albanD	2025-01-16 03:31:52 +00:00
Xia, Weiwen	1230de4c1b	[Quant][Inductor][X86] Separate binary post op fusion and lowering for qconv (#144318 ) Summary The current implementation fuses quantized ops and their post ops and lowers the fused the op to cpp backend in the same pass. It is better to separate post op fusion and lowering because - it looks better in terms of design - we need the post op fusion pass for PT2E quantization eager mode As one of a series of PRs which do the separation, this PR moves binary post op fusion of qconv out of the lowering pass to after the weight-prepack pass. The workflow is 1. Weight prepack for qlinear so that `dq - conv` patterns are replaced by `onednn.qconv2d_pointwise` 2. Fuse `onednn.qconv2d_pointwise` and post ops 3. Lower to cpp backend This PR adds additional `PatternMatcherPass`'s to handle the post op fusion. Pattern matchers used for fusion are reused. Test plan It is covered by existing UTs in `test_mkldnn_pattern_matcher.py` for post op fusion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144318 Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168 ghstack dependencies: #144224, #144312	2025-01-16 03:30:36 +00:00
cyy	843627b7b1	Remove unnecessary once flag usage (#143255 ) Static variables in C++11 is guaranteed to be initialised exactly once, as mentioned [here](https://en.cppreference.com/w/cpp/language/storage_duration) ``` If multiple threads attempt to initialize the same static local variable concurrently, the initialization occurs exactly once (similar behavior can be obtained for arbitrary functions with std::call_once. Usual implementations of this feature use variants of the double-checked locking pattern, which reduces runtime overhead for already-initialized local statics to a single non-atomic boolean comparison. ``` Given that static c10::once_flag is used before, why not just use the associated function to initialised the related static variables? That is the motivation behind this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143255 Approved by: https://github.com/albanD	2025-01-16 02:36:11 +00:00
Nikita Shulga	41ec2e8d3e	[MPSInductor] Fix codegen regression (#144924 ) Caused by https://github.com/pytorch/pytorch/pull/144649 Do not try to insert anything into the header if wrapper is not ready yet Fixes `test_sort_mps` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144924 Approved by: https://github.com/dcci ghstack dependencies: #144827, #144917	2025-01-16 02:12:42 +00:00
Nikita Shulga	05505771a0	[MPSInductor] Properly convert index (#144917 ) By calling `self.index_to_str` from `load`,`store` and `check_bounds` in order to properly handle sizevars variables renames Pull Request resolved: https://github.com/pytorch/pytorch/pull/144917 Approved by: https://github.com/dcci ghstack dependencies: #144827	2025-01-16 02:12:41 +00:00
PyTorch MergeBot	d595b96059	Revert "restore rng generation for fbcode (#144819 )" This reverts commit 2bc18a905544f4e25cfbd354351418b36a0f5fc1. Reverted https://github.com/pytorch/pytorch/pull/144819 on behalf of https://github.com/ngimel due to internal failure ([comment](https://github.com/pytorch/pytorch/pull/144819#issuecomment-2594298941))	2025-01-16 01:52:29 +00:00
Colin L. Rice	6492851125	symbolic_convert: Don't fail when we hit a undefined name (#144784 ) We're using a python builtin NameError here, instead of throwing a Unsupported exception. This causes the NameError to get wrapped in a InternalTorchDynamoError instead of just causing a graph break, and letting the user code fail directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144784 Approved by: https://github.com/williamwen42, https://github.com/jansel	2025-01-16 01:47:48 +00:00
Driss Guessous	c8bcb22e5f	Default Copies are not vectorized in v3.6.0 of cutlass (#144837 ) Summary: FlashAttentionV2 perf was tanked in v3.6.0, See: https://github.com/pytorch/pytorch/issues/144729 for more details. This PR makes it possible to land v3.6.0 update and fixes perf regression. See: https://github.com/pytorch/pytorch/issues/144729#issuecomment-2591644076 for anlaysis, as well we have various internal tests to verify Differential Revision: D68194635 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144837 Approved by: https://github.com/Skylion007, https://github.com/eqy	2025-01-16 01:12:46 +00:00
Colin L. Rice	926f9056a9	speculation_log: Raise a unique error for divergence issues (#144785 ) This is primarily sent for discussion and to see what tests fail due to this. The idea is that rather than capturing this as a regex on the fail_reason, just give it a unique failure type Pull Request resolved: https://github.com/pytorch/pytorch/pull/144785 Approved by: https://github.com/ezyang	2025-01-16 00:49:43 +00:00
David Berard	b90231a189	[inductor][BE] don't try/except ImportError for AttrsDescriptor versions (#144807 ) motivation: Ed's advice to avoid `except ImportError` (i.e. based on the fact that your target module/class might in fact exist, but you might run into some different ImportError whose stacktrace you now ignore). additional motivation: I'm going to add some more cases to this list, and would like to avoid this pattern: ``` try: ... except ImportError: try: ... except ImportError: try: ... ``` suggestions on better ways to do this would be appreciated! test: ran with triton commit e5be006a (last working commit) and 34a6a2ff8 (in june, when AttrsDescriptor was still in triton.compiler.compiler) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144807 Approved by: https://github.com/ezyang	2025-01-16 00:32:29 +00:00
cyy	ee97d80be2	Apply Ruff fixes and pyupgrade to torch/jit (#144208 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144208 Approved by: https://github.com/davidberard98	2025-01-16 00:28:50 +00:00
Pian Pawakapan	774f21a370	[export] handle buffer/input mutations for joint-graph (#144806 ) Summary: previous construction of GraphSignature output specs didn't consider buffer/user input mutations Test Plan: test_experimental Differential Revision: D68177409 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144806 Approved by: https://github.com/zhxchen17, https://github.com/avikchaudhuri	2025-01-16 00:22:16 +00:00
Brian Hirsh	d7f45fc575	dynamic shape support for interpolate(antialias=True) backward (#141198 ) Fixes https://github.com/pytorch/pytorch/issues/141187 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141198 Approved by: https://github.com/ezyang, https://github.com/Chillee ghstack dependencies: #141161	2025-01-16 00:08:25 +00:00
Brian Hirsh	4831f89790	support numbers as tensors for aten.copy(Tensor, Tensor) (#141161 ) Fixes https://github.com/pytorch/pytorch/issues/141149. `aten.copy_` supports numbers as tensors in the python arg parser. So we need to give the same treatment to `aten.copy`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141161 Approved by: https://github.com/ezyang	2025-01-16 00:08:25 +00:00
Xu Han	2645fc45b1	export AOTI_TORCH_EXPORT on Windows. (#140030 ) Fixes #139954 reproduce UT: ```cmd pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu ``` Issue: <img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe"> After fixing: ![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a) Reland: 1. Declare export on Windows explicitly. 2. Support cpu, cuda and xpu devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-15 23:43:41 +00:00
Justin Chu	fb4b5a9299	[ONNX] Use python_dispatcher in type promotion (#144801 ) Fix #143118 Use python_dispatcher in the type promotion pass to preserve symbolic shapes according to @angelayi 's suggestions. (Thanks!) Tested locally. I wasn't able to create a minimal repro except for using the full model Pull Request resolved: https://github.com/pytorch/pytorch/pull/144801 Approved by: https://github.com/titaiwangms	2025-01-15 23:25:19 +00:00
Annop Wongwathanarat	7265dc0622	Enable s8s8s8 for qlinear with mkl-dnn (#139887 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139887 Approved by: https://github.com/huydhn	2025-01-15 23:20:10 +00:00
Natalia Gimelshein	4e1834f5f3	use cooperative schedule in scaled_mm for fast_accum=false (#144809 ) This improves perf for large matrices by more than 2x, more detailed benchmark coming. On master ![image](https://github.com/user-attachments/assets/fc6a0987-5b82-475d-a2ff-b46641bb17dc) On this branch <img width="601" alt="image" src="https://github.com/user-attachments/assets/7f55152b-1110-45e4-b2ea-6f274d543869" /> A plot similar to https://github.com/pytorch/ao/pull/1325#discussion_r1868193786 <details> <summary>Benchmarking code:</summary> ```python import torch from triton.testing import do_bench import itertools def fn_aten_scales(a, b, scale_a, scale_b, use_fast_accum=False): return torch._scaled_mm(a, b.t(), scale_a.view(-1, 1), scale_b.view(1, -1), use_fast_accum=use_fast_accum, out_dtype=torch.bfloat16) def fn_aten(a, b, scale, use_fast_accum=False): return torch._scaled_mm(a, b.t(), scale, scale, use_fast_accum=use_fast_accum, out_dtype=torch.bfloat16) for i,j,k in itertools.product(range(9, 15), range(9, 15), range(9, 15)): m = 2i n = 2j k = 2**k a=torch.randn(m, k, device="cuda").to(dtype=torch.float8_e4m3fn) b=torch.randn(n, k, device="cuda").to(dtype=torch.float8_e4m3fn) scale_a = torch.randint(1, 11, (a.shape[0],), device="cuda", dtype=torch.float32) scale_b = torch.randint(1, 11, (b.shape[0],), device="cuda", dtype=torch.float32) scale_0 = torch.randn((), device="cuda", dtype=torch.float32) ms_rowwise_fast = do_bench(lambda: fn_aten_scales(a, b, scale_a, scale_b, use_fast_accum=True), warmup=25, rep=50) ms_rowwise_slow = do_bench(lambda: fn_aten_scales(a, b, scale_a, scale_b, use_fast_accum=False), warmup=25, rep=50) ms_tensor_fast = do_bench(lambda: fn_aten(a, b, scale_0, use_fast_accum=True), warmup=25, rep=50) ms_tensor_slow = do_bench(lambda: fn_aten(a, b, scale_0, use_fast_accum=False), warmup=25, rep=50) print(f"m={m}, n={n}, k={k}, fast={ms_rowwise_fast}, slow={ms_rowwise_slow}, ratio_tw={ms_tensor_slow /ms_tensor_fast}, ratio_rw={ms_rowwise_slow / ms_rowwise_fast}") ``` </details> Higher N/K values still have about 40% penalty, perhaps some additional heuristics tweaks would be useful. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144809 Approved by: https://github.com/drisspg	2025-01-15 23:04:14 +00:00
PyTorch MergeBot	0f051eaf66	Revert "Fix global namespace pollution in ATen/Dispatch.h (#138626 )" This reverts commit 326c7cae28783f29c577b5a5d3ac38a3b61188bd. Reverted https://github.com/pytorch/pytorch/pull/138626 on behalf of https://github.com/malfet due to This broke inductor tests, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=inductor_torchbench%2C%202%2C%202 ([comment](https://github.com/pytorch/pytorch/pull/138626#issuecomment-2594021436))	2025-01-15 21:59:04 +00:00
Sam	c7b2f7dd14	Add generator parameter to rand*_like functions (#136780 ) Fixes #128786 Fixes #101974 Fixes #27072 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136780 Approved by: https://github.com/Chillee, https://github.com/ezyang	2025-01-15 21:16:52 +00:00
Benjamin Glass	d62b3979da	cpp_wrapper: Move #includes to per-device header files (#143909 ) This prepares us for the next PR in the stack, where we introduce pre-compiled per-device header files to save compilation time. Differential Revision: [D67938955](https://our.internmc.facebook.com/intern/diff/D67938955) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143909 Approved by: https://github.com/desertfire	2025-01-15 21:14:02 +00:00
Huy Do	05095a45f2	Fix the wrong artifact in remaining workflows (#144812 ) I missed them in https://github.com/pytorch/pytorch/pull/144694 as they weren't run often. But they are still failing nonetheless, i.e. https://github.com/pytorch/pytorch/actions/runs/12762640334/job/35578870178 The issue was from https://github.com/pytorch/pytorch/pull/125401 where it added `use-gha: ${{ inputs.use-gha }}` to linux_test workflow. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144812 Approved by: https://github.com/clee2000	2025-01-15 20:36:40 +00:00
Colin L. Rice	b88dcb4835	dynamo: Don't crash when tracing a missing attr on a constant. (#144593 ) dynamo: Don't crash when tracing a missing attr on a constant. This throws a InternalTorchDynamoError: AttributeError: 'NoneType' object has no attribute 'max' instead of just skipping the bad call when tracing, and throwing a normal AttributeError instead. There are two questions that I would love reviewer comment on. 1) Is throwing unimplemented the right thing here? or should I throw something like ObservedAttributeError 2) Do we need to worry about performance with this code? In particular, should we just catch the exception? Or maybe cache the lookup result? Pull Request resolved: https://github.com/pytorch/pytorch/pull/144593 Approved by: https://github.com/jansel	2025-01-15 20:23:43 +00:00
Avik Chaudhuri	d812fdd490	fix as_bool serde (#144791 ) Differential Revision: [D68167701](https://our.internmc.facebook.com/intern/diff/D68167701/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144791 Approved by: https://github.com/pianpwk	2025-01-15 20:22:26 +00:00
Nikita Shulga	904641769e	[MPSInductor] Implement `pow()` (#144827 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144827 Approved by: https://github.com/dcci, https://github.com/jansel	2025-01-15 20:11:34 +00:00
Runming Lu	b410378d93	Register nonzero for meta device for FBLSim (#144727 ) Summary: Fix `nonzero is not registered to meta` issue: ``` "NotImplementedError: aten::nonzero: attempted to run this operator with Meta tensors, but there was no fake impl or Meta kernel registered". ``` Reviewed By: ezyang Differential Revision: D66525640 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144727 Approved by: https://github.com/ezyang	2025-01-15 19:40:42 +00:00
Zhengxu Chen	834086c023	[export] Load side info about pos/kw argument kind for serialization. (#144686 ) Summary: Fixing issue of nodes like ``` torch.ops.aten.linear.default(x, w, b) ``` being deserialized as ``` torch.ops.aten.linear.default(x, w, bias=b) ``` which breaks roundtripping. Test Plan: buck test mode/opt caffe2/test:test_export -- -r TestDeserialize buck test mode/opt caffe2/test:test_export -- -r TestSerialize Differential Revision: D67991410 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144686 Approved by: https://github.com/angelayi	2025-01-15 19:08:38 +00:00
Simon Fan	898a90c6bb	[dynamo][hop] Introduce FlexAttentionBackwardHighOrderVariable (#144533 ) FIXES https://github.com/pytorch/pytorch/issues/143180 This PR adds a new variable mapping to SourcelessBuilder to represent the flex attention intermediates. The variable proxies a call to HOP, and carryovers the graph state (subgraphs represented as UnspecializedNNModuleVariable) to the dynamo output graph. This is safe to do because the nn modules used in flex attention have either been speculated on before, or are outputs of make_fx of the forward. tlparse of `TestCompiledAutograd.test_flex_attention`: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpiWendk/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100 ```python class GraphModule(torch.nn.Module): def forward(self, L_inputs_ : list): ... # File: /data/users/xmfan/core/b/pytorch/torch/_dynamo/compiled_autograd.py:832 in set_node_origin, code: CompiledFunctionBackward0 (NodeCall 1) ... fw_graph0_0 = self.fw_graph0_0 joint_graph0_0 = self.joint_graph0_0 mask_graph0_0 = self.mask_graph0_0 flex_attention_backward = torch.ops.higher_order.flex_attention_backward(aot0_primals_1, aot0_primals_1, aot0_primals_1, aot0_detach_3, aot0_detach_5, aot0_expand_5, aot0_zeros_1, fw_graph0_0, joint_graph0_0, (1, 1, aot0_ones, aot0_zeros, None, None, aot0__to_copy_1, aot0__to_copy_2, None, None, 1073741824, 1073741824, mask_graph0_0), 0.125, {'PRESCALE_QK': False, 'ROWS_GUARANTEED_SAFE': False, 'BLOCKS_ARE_CONTIGUOUS': False, 'WRITE_DQ': True, 'OUTPUT_LOGSUMEXP': True}, (), ()); aot0_primals_1 = aot0_detach_3 = aot0_detach_5 = aot0_expand_5 = aot0_zeros_1 = fw_graph0_0 = joint_graph0_0 = aot0_ones = aot0_zeros = aot0__to_copy_1 = aot0__to_copy_2 = mask_graph0_0 = None aot0_getitem_4: "bf16[1, 1, s0, s1][s0s1, s0s1, s1, 1]cuda:0" = flex_attention_backward[0] aot0_getitem_5: "bf16[1, 1, s0, s1][s0s1, s0s1, s1, 1]cuda:0" = flex_attention_backward[1] aot0_getitem_6: "bf16[1, 1, s0, s1][s0s1, s0s1, s1, 1]cuda:0" = flex_attention_backward[2]; flex_attention_backward = None ... class fw_graph0_0(torch.nn.Module): def forward(self, arg0_1: "bf16[][]cuda:0", arg1_1: "i32[][]cuda:0", arg2_1: "i32[][]cuda:0", arg3_1: "i32[][]cuda:0", arg4_1: "i32[][]cuda:0"): return arg0_1 class joint_graph0_0(torch.nn.Module): def forward(self, arg0_1: "bf16[][]cuda:0", arg1_1: "i32[][]cuda:0", arg2_1: "i32[][]cuda:0", arg3_1: "i32[][]cuda:0", arg4_1: "i32[][]cuda:0", arg5_1: "bf16[][]cuda:0"): return [arg5_1, None, None, None, None] class mask_graph0_0(torch.nn.Module): def forward(self, arg0_1: "i32[][]cuda:0", arg1_1: "i32[][]cuda:0", arg2_1: "i32[][]cuda:0", arg3_1: "i32[][]cuda:0"): # File: /data/users/xmfan/core/b/pytorch/torch/_dynamo/compiled_autograd.py:832 in set_node_origin, code: CompiledFunctionBackward0 (NodeCall 1) new_ones: "b8[][]cuda:0" = torch.ops.aten.new_ones.default(arg0_1, [], dtype = torch.bool, device = device(type='cuda', index=0), pin_memory = False); arg0_1 = None return new_ones ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144533 Approved by: https://github.com/zou3519	2025-01-15 18:40:57 +00:00
eqy	a6763b7b81	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee	2025-01-15 18:37:55 +00:00
Jeff Daily	6ac0616504	[ROCm] hipblaslt rowwise f8 gemm (#144432 ) hipblaslt added rowwise f8 gemm support. Integrate with scaled_mm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144432 Approved by: https://github.com/drisspg	2025-01-15 18:23:44 +00:00
Boyuan Feng	069419569d	[PagedAttention] Support different input position for each batch index (#144693 ) In LLM inference, each request usually has different prefill length, leading to different input position for each batch index. This PR adds such support for paged attention. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144693 Approved by: https://github.com/drisspg	2025-01-15 18:03:52 +00:00
Boyuan Feng	7e80758efc	[CUDAGraph][Docs] add `cuda` to `torch.randn` (#144793 ) Previous doc example created `torch.randn` tensor on cpu so CUDAGraph was skipped. Fixes #144386 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144793 Approved by: https://github.com/eellison	2025-01-15 18:02:10 +00:00
Edward Z. Yang	ee8f833d13	Undo leading underscore on ctx for breakpoint (#144864 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/144864 Approved by: https://github.com/Skylion007	2025-01-15 18:00:58 +00:00
PyTorch MergeBot	443de667b1	Revert "Enable s8s8s8 for qlinear with mkl-dnn (#139887 )" This reverts commit dc8692b0eb093d5af150ae0f3a29a0957c3e4c0d. Reverted https://github.com/pytorch/pytorch/pull/139887 on behalf of https://github.com/ZainRizvi due to Sorry but this appears to have broken trunk. See here for more details: [GH job link](https://github.com/pytorch/pytorch/actions/runs/12788709683/job/35651699934) [HUD commit link](`dc8692b0eb`) ([comment](https://github.com/pytorch/pytorch/pull/139887#issuecomment-2593597977))	2025-01-15 17:58:33 +00:00
Sahan Paliskara	d065e8a9de	[ez] add lint commits to .git-blame-ignore-revs (#144576 ) Test Plan: Ran git blame on .lintrunner.toml and github's linter (+ manual testing) shows all commits exist Pull Request resolved: https://github.com/pytorch/pytorch/pull/144576 Approved by: https://github.com/janeyx99	2025-01-15 17:39:29 +00:00
wizzniu	c07dc64017	Update pin memory related APIs to not pass 'device' argument (#131858 ) Based on https://github.com/pytorch/pytorch/pull/126376, this PR tries to update all PT callers (e.g., `Tensor.is_pinned()`, `Tensor.pin_memory()`) to not pass `device` argument. As for `storage/untyped_storage.is_pinned()/pin_memory()`, we keep the `device` argument but passing `device` is discouraged. And if not given, the default `device` is still 'cuda' for BC. Additionally, based on device-agnostic pin_memory, `pin_memory_device` argument of `torch.utils.data.DataLoader` is discouraged now. For BC, explictly passing this argument is still effective. If not given, the default `device` will be the current accelerator. Fixes #124908 Relates https://github.com/pytorch/pytorch/pull/126376 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131858 Approved by: https://github.com/albanD Co-authored-by: albanD <desmaison.alban@gmail.com>	2025-01-15 17:23:35 +00:00
Catherine Lee	0dca756832	Revert "Upload METADATA file with whl binaries (#143677 )" (#144706 ) This reverts commit 3eb3f4ed5580010a7961d996ccc6ee19c7ccbb5e. Also reverts https://github.com/pytorch/pytorch/pull/144164 Manual revert because the above causes merge conflicts Reverting in favor of https://github.com/pytorch/test-infra/pull/6159 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144706 Approved by: https://github.com/janeyx99, https://github.com/atalman, https://github.com/malfet	2025-01-15 17:20:21 +00:00
Aaron Orenstein	d782e46a36	[BE] typing for decorators - library (#138969 ) Test Plan: unit tests Differential Revision: D62302678 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138969 Approved by: https://github.com/zou3519	2025-01-15 17:08:55 +00:00
Kaustubh Vartak	c7a9599100	Handle meta tensors in FX quantization (#144726 ) Summary: D66895899 got reverted in D67565250 because of pytorch OSS linter failure. Adding back with the format the linter suggested https://github.com/pytorch/pytorch/actions/runs/12443655335/job/34743090791 Test Plan: buck run fbcode//mode/dev-nosan fbcode//torchrec/fb/quant/tests:test_embedding_modules Reviewed By: emlin Differential Revision: D68132568 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144726 Approved by: https://github.com/iamzainhuda, https://github.com/janeyx99	2025-01-15 16:49:43 +00:00
Natalia Gimelshein	2bc18a9055	restore rng generation for fbcode (#144819 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144819 Approved by: https://github.com/malfet, https://github.com/kit1980	2025-01-15 16:34:25 +00:00
PyTorch MergeBot	154185dcd0	Revert "Removed unused _RequiredParameter (#144771 )" This reverts commit 6a5f895e549665a6895c84881a35736677071048. Reverted https://github.com/pytorch/pytorch/pull/144771 on behalf of https://github.com/malfet due to It broke number of cpuinductor tests ([comment](https://github.com/pytorch/pytorch/pull/144771#issuecomment-2593293542))	2025-01-15 15:51:33 +00:00
dilililiwhy	7c52c97a65	Expose several APIs to public (torch python APIs) (#144525 ) Fixes #144302 Try to expose several APIs to public for privateuse1 scenario. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144525 Approved by: https://github.com/cyyever, https://github.com/albanD	2025-01-15 14:34:45 +00:00
Annop Wongwathanarat	dc8692b0eb	Enable s8s8s8 for qlinear with mkl-dnn (#139887 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139887 Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168, https://github.com/ng-05, https://github.com/digantdesai	2025-01-15 12:51:21 +00:00
Sujoy Saraswati	7e1c1e65eb	Graph freezing preparation for non-Inductor backends (#139902 ) Enable preparing module named parameters and buffers in tracing context for non-Inductor backends to implement graph freezing. Fixes #139272 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139902 Approved by: https://github.com/eellison, https://github.com/masnesral, https://github.com/gujinghui	2025-01-15 11:25:04 +00:00
Laith Sakka	62ce3e6e84	refresh benchmarks results after recent recent regressions (#143075 ) refresh data after !5 regression by https://github.com/pytorch/pytorch/pull/144319 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143075 Approved by: https://github.com/bobrenjc93, https://github.com/huydhn	2025-01-15 09:11:57 +00:00
Edward Z. Yang	e263f0af23	[BE] Make a SymbolInfo NamedTuple (#144745 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/144745 Approved by: https://github.com/avikchaudhuri, https://github.com/Skylion007	2025-01-15 08:59:27 +00:00
Taher	d9d7cca009	make eval_frame safe (#141357 ) Fixes #108942 this PR converts eval_frame.c's static extension types to heap types, making it thread and sub-interpreter safe. the current modification only showcases one state variable being lifted, but there are opportunities for other variables that can be addressed in this PR todo / suggestions: 1. uplift `eval_frame_callback_key` to module state 2. define `.m_slots` to module definition so initialization is within python's module lifecycle rather than an explicit `torch_c_dynamo_eval_frame_init` 3. define configurations for module allowing sub-interpreters or not ```c static int module_exec(PyObject *m) {} static PyModuleDef_Slot module_slots[] = { {Py_mod_exec, module_exec}, {0, NULL} }; static struct PyModuleDef module = { PyModuleDef_HEAD_INIT, .... .m_slots = module_slots }; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141357 Approved by: https://github.com/jansel Co-authored-by: Edward Z. Yang <ezyang@meta.com>	2025-01-15 07:37:50 +00:00
Xiaodong Wang	6ba53a5f1c	[AMD] De-noise tf32 warnings (#144797 ) Summary: This is way too noisy especially during unit tests. So just log once. Test Plan: OSS CI. Tested on a unit test and now I only see one line (hard to notice :) ). Differential Revision: D68167633 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144797 Approved by: https://github.com/jianyuh, https://github.com/leitian, https://github.com/yoyoyocmu	2025-01-15 07:10:38 +00:00
Scott Wolchok	69b883d7ac	Remove C10_EMBEDDED (#144808 ) I added this to support code sharing with ExecuTorch, but the operator<< overrides are load-bearing for builds -- we have other code that attempts to pretty-print Half/BFloat16, and implicit conversions can't be used to make that work because there are multiple implicit conversions from Half/BFloat16 to primitive types, so which one to select is ambiguous. Also, we don't actually seem to need it now in ExecuTorch core because we have `include <ostream>` in there at the moment anyway. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144808 Approved by: https://github.com/janeyx99, https://github.com/malfet	2025-01-15 06:08:53 +00:00
Sam Larsen	b801210035	Restore support for other types of async_compile pools (spawn, fork) (#144491 ) Summary: https://github.com/pytorch/pytorch/pull/142001 removed support for process pools other than "subprocess", but some OSS users still find it useful; put it back. Test Plan: New unit test Pull Request resolved: https://github.com/pytorch/pytorch/pull/144491 Approved by: https://github.com/jansel, https://github.com/haifeng-jin	2025-01-15 06:04:49 +00:00
Arnie Yuan	326c7cae28	Fix global namespace pollution in ATen/Dispatch.h (#138626 ) Summary: Was it a typo? Since we already have `at::detail::record_kernel_function_dtype()` in `ATen/Dispatch.h` Test Plan: just build Differential Revision: D64642080 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138626 Approved by: https://github.com/malfet	2025-01-15 05:43:54 +00:00
James Wu	7d71ddbe5d	Add non_c_binding torch functions to allowlist for AOTAutogradCache, confirm no special handlers for them (#144802 ) Differential Revision: [D68173093](https://our.internmc.facebook.com/intern/diff/D68173093/) This diff allows any function in torch_non_c_binding_in_graph_functions to be safe to cache. These functions should be safe to cache because they are part of the torch API, and do not save global state (or if they do, dynamo creates unique guards around the constants they return). A function that's allowed in a dynamo graph is safe to cache for AOTAutograd purposes as long as: - It's functional (i.e. does not access global state); - or its value is constant folded away (and guarded against by dynamo) The tricky cases are functions that dynamo uses special handlers to track. These special handlers can sometimes close over stuff that's safe for dynamo locally, but isn't encoded anywhere when cached across processes. An example of this is `DTensor.from_local`, where various DeviceMesh information doesn't change in the same dynamo process, but can change across multiple processes. The handler for `DTensor.from_local` closes over these and dynamo creates a proxy for the function call. This is not safe to cache. That said, most special handlers are in fact functional and safe. So I add a unit test to test_trace_rules.py that confirms that any function with special handlers in dynamo added to this list needs to be audited to be safe to cache. The list of safe handlers there either: - Don't access global state; - Guard on global state; or - Always returns a constant that never changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/144802 Approved by: https://github.com/bdhirsh	2025-01-15 05:41:36 +00:00
Howard Huang	79312ddb73	[PP] Don't allow for num_microbatches > num_stages for single stage schedules (#144702 ) There is an edge case where `Schedule1F1B` will hang when num_microbatches=1 (https://github.com/pytorch/torchtitan/issues/775). For validation it makes sense to check that the number of stages should be >= number of microbatches otherwise there will be an even larger bubble. This can be removed when we have the single stage schedules to use an IR and updated to run with schedule runtime (issue tracker https://github.com/pytorch/pytorch/issues/144701) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144702 Approved by: https://github.com/kwen2501	2025-01-15 05:35:29 +00:00
fduwjj	ae7df51232	[c10d] Fix CudaEventCache for dangling references (#144496 ) Reported in https://github.com/pytorch/pytorch/issues/143470, we have a dangling references in `CudaEventCache`. So we want to fix it. 1. We add a unit test to repro the issue mentioned in the issue. 2. Instead of converting variables to shared pointers as suggested in the issue, we then make the cache itself a shared pointer. So if the thread creates the cache dies before all events get recycled, the cache is still there until the last CudaEvent get deleted. (thanks for the suggestion from @kwen2501 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144496 Approved by: https://github.com/kwen2501	2025-01-15 05:11:48 +00:00
Simon Fan	9cd6f46130	[ca] raise error message on AOT Autograd caching (#144595 ) FIXES https://github.com/pytorch/pytorch/issues/144175, bandaid Pull Request resolved: https://github.com/pytorch/pytorch/pull/144595 Approved by: https://github.com/bdhirsh	2025-01-15 05:09:42 +00:00
fduwjj	e0bbff6019	[c10d][ez] Add comments to the end of Macro for better readability (#144789 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144789 Approved by: https://github.com/c-p-i-o	2025-01-15 05:06:41 +00:00
Nikita Shulga	d2ca8163c0	[MPSInductor] Support `abs` in MetalPrintExpr (#144826 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144826 Approved by: https://github.com/dcci ghstack dependencies: #144509, #144798, #144795, #144796	2025-01-15 05:01:25 +00:00
Nikita Shulga	9610a22e94	Fix FakeTensor device creation for MPS (#144796 ) By promoting torch.device("mps") to `torch.device("mps:0")`, but skipping `is_initialized` check, as MPS does not really support multi-GPU right now This fixes `GPUTests.test_remove_no_ops_mps` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144796 Approved by: https://github.com/ezyang ghstack dependencies: #144509, #144798, #144795	2025-01-15 05:01:25 +00:00
Nikita Shulga	18786c65e5	[BE] Extend `test_remove_no_ops` (#144795 ) ---- - Use `is_dtype_supported` to skip dtype promotions portion of the test on unsupported device - Extend it to use `torch.float16` so promotions could be checked there - Implement `CpuInterface.is_bfloat16_supported` that returns true (which looks like the case, even if it's supported via emulation) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144795 Approved by: https://github.com/Skylion007 ghstack dependencies: #144509, #144798	2025-01-15 05:00:26 +00:00
Riley Dulin	48f7e7c378	[torch][ao][EASY] Change print to log in numeric debugger to avoid large output (#144790 ) Summary: This print statement was spewing a bunch of data in logs by default, but it should be silenceable. Use `log.debug` instead. Differential Revision: D68166823 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144790 Approved by: https://github.com/tarun292	2025-01-15 04:58:56 +00:00
Piergiacomo De Marchi	6a5f895e54	Removed unused _RequiredParameter (#144771 ) As per this [discussion](https://discuss.pytorch.org/t/a-question-about-requiredparameter/137977), I figured that `_RequiredParameter` is no longer used. The `required` object was initially introduced in this [PR](`4db6667923`) as the `SGD` optimizer did not offer a default value for the learning rate. However there isn't a single place in the code base using `_RequiredParameter`, nor `required`. I am therefore removing unused `_RequiredParameter` and `required`. Everything not included in this PR is Not a Contribution. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144771 Approved by: https://github.com/janeyx99	2025-01-15 04:11:17 +00:00
cyy	d87aad6877	[5/N] Apply Ruff fixes and pyupgrade to Python 3.9 (#144205 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144205 Approved by: https://github.com/albanD	2025-01-15 04:00:47 +00:00
Driss Guessous	db787181b5	Back out "[Submodule] Upgrade to Cutlass 3.6" (#144738 ) Summary: Revert due to perf regressions see: https://github.com/pytorch/pytorch/issues/144729 Test Plan: sand castle Differential Revision: D68137326 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144738 Approved by: https://github.com/huydhn	2025-01-15 02:57:14 +00:00
Nikita Shulga	e2251fffbb	[MPSInductor] Add `min`/`max` to MetalExprPrinter (#144798 ) After that `GPUTests::test_avg_pool2d8_mps` and `GPUTests::test_avg_pool2d5_mps` passes Pull Request resolved: https://github.com/pytorch/pytorch/pull/144798 Approved by: https://github.com/dcci ghstack dependencies: #144509	2025-01-15 01:43:42 +00:00
Xia, Weiwen	9199c79a9c	[Quant][Inductor][X86] Separate unary post op fusion and lowering for qconv (#144312 ) Summary The current implementation fuses quantized ops and their post ops and lowers the fused the op to cpp backend in the same pass. It is better to separate post op fusion and lowering because - it looks better in terms of design - we need the post op fusion pass for PT2E quantization eager mode As one of a series of PRs which do the separation, this PR moves unary post op fusion of qconv out of the lowering pass to after the weight-prepack pass. The workflow is 1. Weight prepack for qlinear so that `dq - conv` patterns are replaced by `onednn.qconv2d_pointwise` 2. Fuse `onednn.qconv2d_pointwise` and post ops 3. Lower to cpp backend This PR adds additional `PatternMatcherPass`'s to handle the post op fusion. Pattern matchers used for fusion are reused. Test plan It is covered by existing UTs in `test_mkldnn_pattern_matcher.py` for post op fusion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144312 Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168 ghstack dependencies: #144224	2025-01-15 00:50:54 +00:00
Tugsbayasgalan Manlaibaatar	825fe15024	EZ fix to make sure local pytest run succeeds in export (#144764 ) Previously run_tests() was protected under IS_FBCODE flag so that following works: ``` python test/export/test_export_legacy.py ``` But it fails on: ``` pytest test/export/test_export_legacy.py ``` This is because pytest doesn't seem to get triggered through run_tests(). Differential Revision: [D68152737](https://our.internmc.facebook.com/intern/diff/D68152737) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144764 Approved by: https://github.com/avikchaudhuri	2025-01-15 00:43:40 +00:00
Henry Tsang	8c2aa0c533	[cutlass backend] cexpr the arg before writing to cpp file (#144714 ) Summary: The problem is for certain shapes, see unit test, one of the dimensions is like `s0 // 2`. If we use cutlass backend, this means writing that to C++ file, which would lead to C++ compilation error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144714 Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78, https://github.com/desertfire	2025-01-14 23:09:44 +00:00
Aaron Orenstein	8ad37ed710	Stop ignoring mypy errors in torch/testing/_internal/common_utils.py (#144483 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144483 Approved by: https://github.com/Skylion007	2025-01-14 22:32:51 +00:00
Jerry Mannil	ea3395e4f2	[ROCm] Improvements for vectorized elementwise kernels (#143269 ) * Make io_size calculation as minimum of size of input and output size, rather than the summation of all sizes * for e.g, for torch.add() on half dtypes (bfloat16/float16), calc_io_size() returns 6 causing elems_per_thread to be 4 * But elems_per_thread = 8 works better on half datypes for AMD gpus * Enable *_load_dwordx4 ISA for 16-bit and 8-bit dtypes on AMD gpus by using vector size of 8 and 16 respectively Co-author: @akadutta Pull Request resolved: https://github.com/pytorch/pytorch/pull/143269 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com>	2025-01-14 22:09:21 +00:00
soulitzer	c000214826	Allow GradientEdge as torch.autograd.backward outputs (#144744 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144744 Approved by: https://github.com/albanD	2025-01-14 21:31:44 +00:00
fan.mo	64829b356a	[PrivateUse1] Support parseDispatchKey with modified PrivateUse1 (#144325 ) PyTorch now support many private1 backend names like `AutogradPrivateUse1` or `QuantizedPrivateUse1`, not mentioned the original `PrivateUse1` backend. However, users that implement `PrivateUse1` funtionalities would modified the backend name by calling `torch.utils.rename_privateuse1_backend("my_backend")`, in that case, all `PrivateUse1` backend string would not be found when we call other functions related to it. For example, we utilize `torch.library` to register some customize functions to our new backend, we would use "my_backend" as the backend name instead of "PrivateUse1", in which the error will be throw: ``` could not parse dispatch key 'my_backend' ``` So, this PR changed the function `c10::DispatchKey parseDispatchKey(const std::string& k)`, it would double check if the `PrivateUse1` has been modified, and if so, we would change `k` to adapt new backend name then find it again. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144325 Approved by: https://github.com/albanD	2025-01-14 21:21:29 +00:00
Will Constable	130452dad6	[Pipelining] fix test_schedule.py (missing destroy_process_group (#144734 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144734 Approved by: https://github.com/H-Huang ghstack dependencies: #144352, #144596	2025-01-14 21:16:09 +00:00
Will Constable	aa57f0c663	[Pipelining] Refactor common utils from test_pp_dp (#144596 ) Split test_pp_dp into pp_ddp and pp_fsdp so its a bit more concise and easier to add CP to the FSDP one. Realize that 'use_new_runtime' parametrization was not even being used, removing it saves a bunch of test time. We should migrate schedules to the new runtime and have them be covered that way. (And test_schedule*.py are testing new runtime too). Pull Request resolved: https://github.com/pytorch/pytorch/pull/144596 Approved by: https://github.com/H-Huang ghstack dependencies: #144352	2025-01-14 20:13:17 +00:00
Will Constable	6f5dce3035	[Pipelining] Fix PP grad scaling (#144352 ) Adds a grad-scaling method `perform_pp_grad_scaling()` which divides grads by num_microbatches. Enables grad scaling by default, unless disabled due to using a loss function that sums instead of averaging losses. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144352 Approved by: https://github.com/H-Huang	2025-01-14 20:13:17 +00:00
Nikita Shulga	9157a748a6	[MPSInductor] Add dummy properties (#144509 ) For compute capabilitiy (which is an empty string, same as CPU) And for multicore count return 8, as this is smallest number of GPU cores on Apple silicon Pull Request resolved: https://github.com/pytorch/pytorch/pull/144509 Approved by: https://github.com/jansel	2025-01-14 20:12:38 +00:00
PyTorch MergeBot	bdd942efd7	Revert "Increase C10_COMPILE_TIME_MAX_GPUS to 128 (#144138 )" This reverts commit 6cfc08167595e27ee9a5701c6426a7a8a7e387ef. Reverted https://github.com/pytorch/pytorch/pull/144138 on behalf of https://github.com/albanD due to This seems to impact the caffe2 code ([comment](https://github.com/pytorch/pytorch/pull/144138#issuecomment-2590891200))	2025-01-14 19:04:12 +00:00
Wang, Chuanqi	b4b4e57469	[CD] Enable profiling for XPU Windows nightly wheels (#144316 ) PR https://github.com/pytorch/pytorch/pull/144034 added profiling support for torch XPU Windows binary, enable it in PyTorch XPU Windows CD Works for https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144316 Approved by: https://github.com/xuhancn, https://github.com/atalman	2025-01-14 19:01:27 +00:00
Bin Bao	2683691237	[AOTI] Add a boxed_run API (#142213 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/141696. Add a new C++ runner API (boxed_run) following dynamo's boxed calling convention, which steals tensors' ownership from the input tensor list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142213 Approved by: https://github.com/ezyang	2025-01-14 18:47:42 +00:00
Richard Barnes	e2891d43a8	[codemod] Remove unused-variable in caffe2/aten/src/ATen/native/quantized/cpu/fbgemm_utils.cpp +1 (#144783 ) Summary: LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Reviewed By: palmje Pull Request resolved: https://github.com/pytorch/pytorch/pull/144783 Approved by: https://github.com/albanD, https://github.com/malfet	2025-01-14 18:34:54 +00:00
Mwiza Kunda	ec1c3ab3b2	[inductor][triton] skip test_data_type_propagation if triton (#142054 ) None cpp inductor backends don't have a `DataTypePropagation` pass on the scheduler nodes so skip the test. CUDA only passes because the device is currently not changed to "cuda" in the test body. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/142054 Approved by: https://github.com/eellison	2025-01-14 18:03:00 +00:00
Nikhil Gupta	e666807653	[Fix]: Enable support for Arm Neon & SVE support for FP32 Gemm Wrapper (#144327 ) Performance Improvements: Linear Layer [ 1x512 * 512x512 ] -> 2x - 4x Linear Layer [ 3x512 * 512x512 ] -> 2x - 4x Pull Request resolved: https://github.com/pytorch/pytorch/pull/144327 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/cfRod, https://github.com/malfet Co-authored-by: Crefeda Rodrigues <crefeda.Rodrigues@arm.com>	2025-01-14 17:52:00 +00:00
soulitzer	eee7a47e94	Support FunctionalTensor subclass in is_fake and maybe_get_fake_mode (#144719 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144719 Approved by: https://github.com/bdhirsh	2025-01-14 17:49:11 +00:00
PyTorch MergeBot	d21738f24a	Revert "Fix torch.normal ignores default_device (#144070 )" This reverts commit 184549b2d7e59acfc6e47d121e9ebb50648945b3. Reverted https://github.com/pytorch/pytorch/pull/144070 on behalf of https://github.com/ezyang due to broken a specific use case ([comment](https://github.com/pytorch/pytorch/pull/144070#issuecomment-2590681953))	2025-01-14 17:41:58 +00:00
PyTorch UpdateBot	7977a3638e	[executorch hash update] update the pinned executorch hash (#140769 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140769 Approved by: https://github.com/pytorchbot	2025-01-14 17:38:07 +00:00
Nikita Shulga	f2975717f3	[CD] Fix slim-wheel nvjit-link import problem (#141063 ) When other toolkit (say CUDA-12.3) is installed and `LD_LIBRARY_PATH` points to there, import torch will fail with ``` ImportError: /usr/local/lib/python3.10/dist-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12 ``` It could not be worked around by tweaking rpath, as it also depends on the library load order, which are not guaranteed by any linker. Instead solve this by preloading `nvjitlink` right after global deps are loaded, by running something along the lines of the following ```python if version.cuda in ["12.4", "12.6"]: with open("/proc/self/maps") as f: _maps = f.read() # libtorch_global_deps.so always depends in cudart, check if its installed via wheel if "nvidia/cuda_runtime/lib/libcudart.so" in _maps: # If all abovementioned conditions are met, preload nvjitlink _preload_cuda_deps("nvjitlink", "libnvJitLink.so.*[0-9]") ``` Fixes https://github.com/pytorch/pytorch/issues/140797 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141063 Approved by: https://github.com/kit1980 Co-authored-by: Sergii Dymchenko <sdym@meta.com>	2025-01-14 17:33:07 +00:00
Shangdi Yu	5c727d5679	[minifier] Fix config generator for callables (#144518 ) Summary: When config contains callables, the current configs generated cannot be run: ``` torch._dynamo.config.reorderable_logging_functions = {<built-in function print>, <function warning at 0x7f774c595630>, <function log at 0x7f774c595870>, <function error at 0x7f774c595510>, <function info at 0x7f774c595750>, <built-in function warn>, <function exception at 0x7f774c5955a0>, <function debug at 0x7f774c5957e0>, <function critical at 0x7f774c5953f0>} ``` We fix the config to generate the right string, so the config is runnable, like below ``` import logging import warnings torch._dynamo.config.reorderable_logging_functions = { warnings.warn, logging.warn, print } ``` Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:utils -- -r test_codegen_config ``` Differential Revision: D67998703 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144518 Approved by: https://github.com/desertfire	2025-01-14 17:18:13 +00:00
Zhenbin Lin	cbb1ed2966	[1/N] OpenReg: Replace `open_registration_extension.cpp` with openreg (#141815 ) As described in OpenReg [next-steps](https://github.com/pytorch/pytorch/blob/main/test/cpp_extensions/open_registration_extension/README.md#next-steps), here we replace the current `open_registration_extension.cpp` test in PyTorch CI with openreg. The current `open_registration_extension.cpp` contains two parts: 1. Implentations to support `PrivateUse1` backend. 2. Helper functions used for UTs in `test_cpp_extensions_open_device_registration.py` and `test_transformers.py`. For the first part, we'll replace it with openreg. For the second part, we'll migrate them to ut files step by step. @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/141815 Approved by: https://github.com/albanD	2025-01-14 15:59:00 +00:00
Nikita Shulga	347a74b8f5	Mark CUDA-12.6 as experimental for 2.6 release (#144769 ) Because that's the first time we are trying to release it, and it also is the first release to use manylinux2_28 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144769 Approved by: https://github.com/atalman	2025-01-14 15:30:00 +00:00
Edward Z. Yang	60d2e32fa4	[BE] Remove lambda from str (#144743 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/144743 Approved by: https://github.com/avikchaudhuri, https://github.com/Skylion007 ghstack dependencies: #144471	2025-01-14 15:10:57 +00:00
Edward Z. Yang	ffb3f32693	Add max kwarg to torch._check with alternate size oblivious semantics (#144471 ) Fixes https://github.com/pytorch/pytorch/issues/120288 for the static bound case I had been tying myself in knots in the original issue about the fact that we can't really do symbolic bounds like u0 < s0. But then I realized, "Wait, but the static bounds are easy!" So this makes it so you can also exclude a specific upper bound when doing size oblivious tests, which is enough to solve https://github.com/pytorch/pytorch/issues/123592#issuecomment-2574556708 It's written very dirtily, maybe there's some cleanup. Bikeshed on the public API name also welcome. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/144471 Approved by: https://github.com/avikchaudhuri	2025-01-14 15:10:57 +00:00
RAHUL SINGH	95b41d2aa4	Tests Generelization for multiple accelerator devices (#139749 ) Motivation: Generalize unit tests so that can be executed for cuda and non cuda devices. Chnages: There are general changes in common_dtesnor module for device type generalization so that tests can be executed on non cuda devices too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139749 Approved by: https://github.com/kwen2501	2025-01-14 08:52:46 +00:00
lzhang2	1800f5f461	Enable coalescing path on XPU and dispatch to XPU tensor barrier if XCCL backend is specified. (#143735 ) Motivation: - Enable coalescing path on XPU for `batch_isend_irecv`. - If XCCL backend is specified, then construct a XPU tensor to ensure `barrier` dispatch to XCCL backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143735 Approved by: https://github.com/kwen2501	2025-01-14 08:37:48 +00:00
Daulet Askarov	21cbee5d9b	Drop unused num_elements variable (#144723 ) Summary: With the recent enforcement of unused variable as an error in D67329035, certain tests like https://www.internalfb.com/intern/test/562950135258426?ref_report_id=0 can't build citing: ``` Action failed: fbcode//caffe2:libtorch_cuda (cfg:linux-x86_64-fbcode-platform010-clang17-no-san#2a7259832b2f5c67) (cxx_compile torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (pic)) Remote command returned non-zero exit code 1 Remote action, reproduce with: `frecli cas download-action a95a6625d2b071a782a7a8ea2882f4adccf103b023df5ccb596f48c506101754:145` Stdout: <empty> Stderr: fbcode/caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3757:16: error: unused variable 'num_elements' [-Werror,-Wunused-variable] 3757 \| size_t num_elements = output.numel(); \| ^~~~~~~~~~~~ 1 error generated. ``` This causes Sandcastle to turn off these tests, decreasing protection from other bad diffs. Clean up the unused variable to unblock. Test Plan: ``` buck2 build --config hpc_comms.use_ncclx=dev --flagfile fbcode//mode/opt fbcode//ftar:ftar_py_e2e_test ``` https://www.internalfb.com/buck2/888dfc68-07eb-4ba1-add5-b38c12d52b33 Reviewed By: c-p-i-o Differential Revision: D68126236 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144723 Approved by: https://github.com/fduwjj, https://github.com/Skylion007 Co-authored-by: Daulet Askarov <dauleta@meta.com>	2025-01-14 08:29:01 +00:00
Isalia20	80eff6e720	[MPS] fix triangular for >3D tensors (#144545 ) Old implementation leads to incorrect output due to not handling the other batch sizes other than 3D tensors(B, M, N) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144545 Approved by: https://github.com/malfet	2025-01-14 08:25:01 +00:00
Xia, Weiwen	8436a5c2cb	[Quant][Inductor][X86] Separate binary post op fusion and lowering for qlinear (#144224 ) Summary The current implementation fuses quantized ops and their post ops and lowers the fused op to cpp backend in the same pass. It is better to separate post op fusion and lowering because - it looks better in terms of design - we need the post op fusion pass for PT2E quantization eager mode As one of a series of PRs which do the separation, this PR moves binary post op fusion of qlinear out of the lowering pass to after the weight-prepack pass. The workflow is 1. Weight prepack for qlinear so that `dq - linear` patterns are replaced by `onednn.qlinear_pointwise` 2. Fuse `onednn.qlinear_pointwise` and post ops 3. Lower to cpp backend This PR adds additional `PatternMatcherPass`'s to handle the post op fusion. Pattern matchers used for fusion are reused. Test plan It is covered by existing UTs in `test_mkldnn_pattern_matcher.py` for post op fusion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144224 Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168	2025-01-14 06:46:38 +00:00
Yu, Guangye	c031defe0b	[RELAND] Generalize at::manual_seed for all accelerators (#144370 ) # Additional Context This is a reland PR originated from eeb57394f93d720bca498c3fa9d167fc7b9cca46 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144370 Approved by: https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui	2025-01-14 06:09:36 +00:00
leslie-fang-intel	9d98b66e7b	[Inductor][CPP] Enable Epilogue Fusion for Grouped GEMM Template (#143897 ) Summary In this PR, we enable the epilogues fusion and code generation for Grouped GEMM. Here are the high-level description of how we implement it. Fusion - The Grouped GEMM Template produces a `Template Buffer` with a `MultiOutputLayout` and a set of `MultiOutput Buffers`, where each buffer corresponds to a specific GEMM. - During the initial round of fusion, the `Template Buffer` and all associated `MultiOutput Buffers` are fused into a `FusedSchedulerNode` by extending the existing fusion design. - In subsequent fusion rounds, this `FusedSchedulerNode` can further fuse with its epilogues, following the original fusion design principles. Code Gen We maintain a list of epilogues and codegen it one by one. - If any of the GEMM has bias, we create a extra `bias_add` epilogue and prepend it at first of the epilogue list. - If any of the GEMM has no epilogue, we create a `to_bf16` copy epilogue and append it at last of the epilogue list. TestPlan ``` python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_grouped_linear_epilogue ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143897 Approved by: https://github.com/jansel, https://github.com/jgong5 ghstack dependencies: #143796	2025-01-14 06:07:50 +00:00
leslie-fang-intel	25de671ea8	[Inductor][CPP] Enable Grouped GEMM Template (#143796 ) Summary Enable the CPP Grouped GEMM Fusion, lowering and Grouped GEMM Template following the RFC: https://github.com/pytorch/pytorch/issues/144012 - Support flexible number of GEMMs - Share activation across GEMMs - The Grouped GEMM Template supports independent activations - However, the pattern matcher requires an anchor node, which is as the shared activation across GEMMs - Each GEMM can have a unique weight but same sizes - Each GEMM can have a unique bias or None - Current PR does not yet support biases; this will be addressed in a follow-up epilogue fusion PR - Each GEMM have its own epilogues - Epilogue fusion is not yet supported in this PR and will be enabled in an upcoming follow-up epilogue fusion PR Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_grouped_linear python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_grouped_linear_invalid python -u -m pytest -s -v test/inductor/test_cpu_cpp_wrapper.py -k test_grouped_linear ``` Example Here is the example and generated code ``` batch_size = 4 in_features = 512 out_features = 1024 dtype = torch.bfloat16 class M(torch.nn.Module): def __init__(self, bias): super().__init__() self.linear0 = torch.nn.Linear(in_features, out_features, bias=False) self.linear1 = torch.nn.Linear(in_features, out_features, bias=False) def forward(self, x): return self.linear0(x), self.linear1(x) if __name__ == "__main__": with torch.no_grad(): input = torch.randn(batch_size, in_features, dtype=dtype) m = M(bias=bias).to(dtype=dtype).eval() cm = torch.compile(m) act_res = cm(input) ``` Generated Code: https://gist.github.com/leslie-fang-intel/ed2e8d23aeb3586eb504feeace692e16#file-grouped-gemm-generated-code-py Next Step - Support Epilogue fusion Pull Request resolved: https://github.com/pytorch/pytorch/pull/143796 Approved by: https://github.com/jgong5, https://github.com/jansel	2025-01-14 05:59:07 +00:00
Davide Italiano	35b46a75f1	[mps/inductor] Add support for `round()` (#144731 ) With this change, inductor/test_view_on_aliased passes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144731 Approved by: https://github.com/malfet	2025-01-14 05:56:13 +00:00
Jagadish Krishnamoorthy	17e05cde0c	ROCm: Skip tests in elastic/utils/distributed_test (#144692 ) The tests are failing on ROCm machines due to the below error. The client socket has timed out after 1000ms while trying to connect to (gpu4f67.jax.cs.cpe.ice.amd.com, 0) Disabling the tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144692 Approved by: https://github.com/jeffdaily	2025-01-14 03:49:06 +00:00
James Wu	e58c823ab8	Implement increment and add_to_set for CompileEventLogger (#143427 ) This diff implements `increment` and `add_to_set`, which are features of MetricsContext, but not ChromiumEventLogger. This allows us to add a bunch of other metricscontext callsites to use CompileEventLogger instead. Differential Revision: [D67354867](https://our.internmc.facebook.com/intern/diff/D67354867/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143427 Approved by: https://github.com/masnesral	2025-01-14 02:42:49 +00:00
Nikita Shulga	6053242890	[CD] Enable python3.13t builds for aarch64 (#144698 ) But make sure that right numpy version is picked (2.0.2 does not support 3.13) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144698 Approved by: https://github.com/atalman ghstack dependencies: #144696, #144697, #144716	2025-01-14 02:29:01 +00:00
Huy Do	b221f88fc1	Leave SCCACHE_S3_KEY_PREFIX empty to share the cache among all build jobs (#144704 ) This is a follow-up of https://github.com/pytorch/pytorch/pull/144112#pullrequestreview-2528451214. After leaving https://github.com/pytorch/pytorch/pull/144112 running for more than a week, all build jobs were fine, but I failed to see any improvement in build time. So, let's try @malfet suggestion by removing the prefix altogether to keep it simple. After this land, I will circle back on this to see if there is any improvements. Otherwise, it's still a simple BE change I guess. Here is the query I'm using to gather build time data for reference: ``` with jobs as ( select id, name, DATE_DIFF('minute', created_at, completed_at) as duration, DATE_TRUNC('week', created_at) as bucket from workflow_job where name like '%/ build' and html_url like concat('%', {repo: String }, '%') and conclusion = 'success' and created_at >= (CURRENT_TIMESTAMP() - INTERVAL 6 MONTHS) ), aggregated_jobs_in_bucket as ( select --groupArray(duration) as durations, --quantiles(0.9)(duration), avg(duration), bucket from jobs group by bucket ) select * from aggregated_jobs_in_bucket order by bucket desc ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144704 Approved by: https://github.com/clee2000	2025-01-14 02:19:38 +00:00
Yiming Zhou	6d56277682	[export] Fix torchbind constant folding (#144684 ) Summary: `CallTorchBind` should not be folded during constant folding Test Plan: ``` buck2 run mode/dev-nosan sigmoid/inference/test:test_passes -- -r test_const_folding_torchbind ``` Reviewed By: henryoier Differential Revision: D67721272 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144684 Approved by: https://github.com/zhxchen17	2025-01-14 01:58:44 +00:00
Nikita Shulga	eaa8a97b39	[RelEng] Add `--ami` option to build_aarch64 (#144685 ) Which should be mutually-exclusive with OS For example, one can use the following to alloc one-off instance ``` ./build_aarch64_wheel.py --alloc-instance --instance-type g5.4xlarge --key-name nshulga-key --ami ami-0f51103893c02957c --ebs-size 200 ``` TODO: - Figure out EBS volume name depending on the AMI (for `ami-05576a079321f21f8`(al2023) it's `/dev/xvda`, but for `ami-0f51103893c02957c`(deep learning container) it's `/dev/sda1` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144685 Approved by: https://github.com/atalman	2025-01-14 01:48:27 +00:00
Davide Italiano	de9d6a25d7	[mps/inductor] Add support for `ceil` (#144715 ) inductor/test_index_dynamic_shapes passes after this change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144715 Approved by: https://github.com/malfet	2025-01-14 01:16:47 +00:00
PyTorch MergeBot	64bcf39180	Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 )" This reverts commit 388b75edec09182131be0dfe1abeafc5c3b91adf. Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/kit1980 due to breaking internal builds: unused variable 'halpha' ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2588517060))	2025-01-14 00:48:28 +00:00
PyTorch MergeBot	dfe06e555d	Revert "Stop ignoring mypy errors in torch/testing/_internal/common_utils.py (#144483 )" This reverts commit dcc04e9237292de10e9cedd8213253e253b1e91c. Reverted https://github.com/pytorch/pytorch/pull/144483 on behalf of https://github.com/kit1980 due to Need to revert in order to revert https://github.com/pytorch/pytorch/pull/144441 ([comment](https://github.com/pytorch/pytorch/pull/144483#issuecomment-2588515018))	2025-01-14 00:46:48 +00:00
Nikita Shulga	58302c4eaa	[BE] [CD] Remove pygit2 dep for aarch64_wheel build (#144716 ) As it's incompatible with 3.13t and only used to fetch the branch name, which could be done by running ``` git rev-parse --abbrev-ref HEAD ``` Also, remove yet another reference to long gone `master` branch. Test plan: Download `manywheel-py3_11-cpu-aarch64.zip` produced by this PR, install it inside docker container and check it's version ``` # pip install torch-2.7.0.dev20250113+cpu-cp311-cp311-manylinux_2_28_aarch64.whl ... Installing collected packages: mpmath, typing-extensions, sympy, networkx, MarkupSafe, fsspec, filelock, jinja2, torch Successfully installed MarkupSafe-3.0.2 filelock-3.16.1 fsspec-2024.12.0 jinja2-3.1.5 mpmath-1.3.0 networkx-3.4.2 sympy-1.13.1 torch-2.7.0.dev20250113+cpu typing-extensions-4.12.2 root@434f2540345e:/# python Python 3.11.9 (main, Aug 1 2024, 23:33:10) [GCC 12.2.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch >>> torch.__version__ '2.7.0.dev20250113+cpu' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144716 Approved by: https://github.com/atalman ghstack dependencies: #144696, #144697	2025-01-14 00:43:46 +00:00
Aaron Orenstein	dcc04e9237	Stop ignoring mypy errors in torch/testing/_internal/common_utils.py (#144483 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144483 Approved by: https://github.com/Skylion007	2025-01-13 23:19:44 +00:00
atalman	c15d6508bd	Binary builds Docker images - remove cuda 12.1 (#144575 ) Remove cuda 12.1 from manylinux, libtoch and almalinux builds Pull Request resolved: https://github.com/pytorch/pytorch/pull/144575 Approved by: https://github.com/seemethere, https://github.com/kit1980, https://github.com/malfet, https://github.com/Skylion007	2025-01-13 22:44:59 +00:00
PyTorch MergeBot	4f74864c94	Revert "[AOTI] Add a boxed_run API (#142213 )" This reverts commit 868984c3e324dedeac04cf10e2bbfbf912dac3b1. Reverted https://github.com/pytorch/pytorch/pull/142213 on behalf of https://github.com/kit1980 due to breaking lots of internal builds, see D68036023 ([comment](https://github.com/pytorch/pytorch/pull/142213#issuecomment-2588378262))	2025-01-13 22:43:47 +00:00
Animesh Jain	a54a784b82	[dynamo][dicts] Consolidate dict(..) construction (#144342 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144342 Approved by: https://github.com/StrongerXi	2025-01-13 22:24:56 +00:00
bobrenjc93	0373cd9950	remove allow-untyped-defs from torch/distributed/checkpoint/api.py (#144653 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144653 Approved by: https://github.com/Skylion007	2025-01-13 21:57:19 +00:00
Richard Barnes	1dab79470d	c10::string_view -> std::string_view in pytorch (#143591 ) Test Plan: Sandcastle Differential Revision: D67312322 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143591 Approved by: https://github.com/malfet	2025-01-13 21:44:05 +00:00
Huy Do	5129d6ef51	Fix inductor periodic smoke test wrong artifact (#144694 ) I'm not entirely sure why this failure starts to show up in periodic since Friday https://github.com/pytorch/pytorch/actions/runs/12716967189/job/35463656803. The artifact was uploaded to S3, but `use-gha: anything-non-empty-to-use-gh` was set and it was working. Maybe this is related to https://github.com/pytorch/pytorch/issues/144479 I also clean up the GCP/AWS A100 selection logic as the GCP cluster doesn't exist anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144694 Approved by: https://github.com/clee2000	2025-01-13 21:42:39 +00:00
Shangdi Yu	e15f91337b	[inductor] Add unbacked symints binding in ShapeProp (#144605 ) Summary: ShapeProp doesn't know how to propagate unbacked. Patch it up to propagate unbacked symints like PropagateUnbackedSymInts. Test Plan: ``` buck run mode/dev-nosan fbcode//caffe2/test:fx -- -r test_shape_prop_unbacked_sym ``` Differential Revision: D68050073 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144605 Approved by: https://github.com/guowentian, https://github.com/pianpwk	2025-01-13 21:30:20 +00:00
Catherine Lee	3c55669b88	Enable grep_linter to use -a (#144589 ) Lintrunner can only apply changes (-a) if only one suggestion is made per file. The grep_linter makes a suggestion for every line it finds incorrect, so it creates multiple suggestions per file if there are multiple lines that it wants to change This sets the `line` parameter of the LintMessage to None for all of grep_linter, but I'm not sure if that entry did anything I'm not sure if enabling -a is the best idea, since its currently used for tabs and tab width might differ each time? I had one instance where running with -a cause the spacing to change. On the other hand, -a would have already worked if only one line was bad Pull Request resolved: https://github.com/pytorch/pytorch/pull/144589 Approved by: https://github.com/huydhn	2025-01-13 21:18:24 +00:00
Aaron Gokaslan	91dbd7b75c	[BE]: Improve typing inference with TypeIs (#144682 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144682 Approved by: https://github.com/albanD Co-authored-by: Aaron Orenstein <aorenste@meta.com>	2025-01-13 21:14:31 +00:00
Ryan Guo	4ceca4d60f	[dynamo] Avoid graph break on updates to `obj.__dict__` (#144419 ) `obj.__dict__` is handled specially in Dynamo, and prior to this patch we only support read and membership check on that dictionary object. This patch adds support for writes and some documentation. Fixes #143756. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144419 Approved by: https://github.com/jansel, https://github.com/anijain2305	2025-01-13 21:04:10 +00:00
Bin Bao	684d015c2f	[AOTI] Support _int_mm (#144571 ) Summary: Add _int_mm to the C shim, to resolve a torchao issue, https://github.com/pytorch/ao/pull/1531#issue-2776827015 Differential Revision: [D68030385](https://our.internmc.facebook.com/intern/diff/D68030385) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144571 Approved by: https://github.com/yushangdi	2025-01-13 20:32:29 +00:00
Nikhil Gupta	b7f95df65b	[Feat]: Add Multithreading support for kleidiai groupwise GEMM kernels (#144074 ) KleidiAI Groupwise GEMM Kernel was not 2D Blocked. This change adds supports for 2D blocking of GEMM kernel to efficiently split workload & speedup GEMM kernel over multiple threads. Performance improvements: 7B model Pre-fill speedup from 145 t/s to 175 t/s Pull Request resolved: https://github.com/pytorch/pytorch/pull/144074 Approved by: https://github.com/digantdesai	2025-01-13 20:32:23 +00:00
Mwiza Kunda	5a2e8fce9d	Fix block pointer test module for triton CPU and add to CI (#144474 ) - Fix for BlockPointerTestBase._discontiguous_tensor. It defaults to constructing CUDA tensors, causing a failure if CUDA is not available. - Add test module to CI to prevent errors like the above from occurring. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144474 Approved by: https://github.com/jansel	2025-01-13 20:25:05 +00:00
bobrenjc93	80c286cbec	remove allow-untyped-defs from torch/_C/_dynamo/eval_frame.pyi (#144655 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144655 Approved by: https://github.com/StrongerXi	2025-01-13 20:03:25 +00:00
bobrenjc93	18deff0262	remove allow-untyped-defs from torch/ao/nn/intrinsic/__init__.py (#144652 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144652 Approved by: https://github.com/Skylion007	2025-01-13 19:36:08 +00:00
Nikita Shulga	d44c3906b8	[EZ] [CD] Add 3.13 to FULL_PYTHON_VERSIONS (#144697 ) Separation was necessary for Conda codegen, but now it's gone Pull Request resolved: https://github.com/pytorch/pytorch/pull/144697 Approved by: https://github.com/atalman, https://github.com/izaitsevfb ghstack dependencies: #144696	2025-01-13 19:12:12 +00:00
Nikita Shulga	d2f905760d	[EZ] [CD] Eliminate stale TODO (#144696 ) As 3.13 has been enabled across the board, which one can verify by running `./github/regenerate.sh` and observe that non of the configs have changed Pull Request resolved: https://github.com/pytorch/pytorch/pull/144696 Approved by: https://github.com/izaitsevfb, https://github.com/atalman	2025-01-13 19:12:12 +00:00
bobrenjc93	cd477cdd1d	remove allow-untyped-defs from torch/ao/nn/quantized/reference/modules/linear.py (#144656 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144656 Approved by: https://github.com/Skylion007	2025-01-13 19:03:05 +00:00
bobrenjc93	f93d786f73	remove allow-untyped-defs from torch/nn/parameter.pyi (#144654 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144654 Approved by: https://github.com/Skylion007	2025-01-13 19:02:31 +00:00
Randolf Scholz	983bf604e5	ReshapeTransform: added missing argument in docstring (#144401 ) See https://github.com/pytorch/pytorch/pull/144197#discussion_r1907336339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144401 Approved by: https://github.com/janeyx99, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-01-13 17:59:59 +00:00
George Wigley	fe8c5c7a2d	Update the Triton DeviceInterface in test/inductor/extension_backends/triton/device_interface.py (#144399 ) Following the changes to how `DeviceInterface` is used in this [PR](https://github.com/pytorch/pytorch/pull/142033), the `DeviceInterface` in `extension_backend/triton/device_interface.py` should by updated to return the `DeviceProperties` instead of raising a NotImplementedError. This PR mirrors the [changes](https://github.com/pytorch/pytorch/pull/142033/files#diff-06553e25e48e1d60f3030458bc46d52067d3d0c3eef2d5fcea29f7e8126bd7c9L112-R114) made in Dynamo when the PR landed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144399 Approved by: https://github.com/jansel	2025-01-13 17:19:58 +00:00
Xuehai Pan	bee84e88f8	[BE][Easy] improve submodule discovery for `torch.ao` type annotations (#144680 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144680 Approved by: https://github.com/Skylion007	2025-01-13 17:16:19 +00:00
Nikita Shulga	c40d917182	[MPSInductor] Fix maximum/minimum for int types (#144665 ) `metal::isnan` is only defined for floats, so provide a generic wrapper that is false for integral types TODO: Figure out why type propagantion is not working (or should it?) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144665 Approved by: https://github.com/dcci	2025-01-13 15:14:01 +00:00
Isuru Fernando	8633845090	Support nanj in inductor (#144064 ) Fixes https://github.com/pytorch/pytorch/issues/144029 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144064 Approved by: https://github.com/amjames, https://github.com/eellison	2025-01-13 14:29:38 +00:00
Davide Italiano	417354d953	[mps/inductor] Add support for truncdiv(). (#144666 ) Two other inductor tests pass after this change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144666 Approved by: https://github.com/malfet	2025-01-13 13:39:38 +00:00
Nikita Shulga	7e2239f1f0	[MPSInductor] Better error when kernel fails to compile (#144649 ) Now error message looks as follows: ``` % python ../test/inductor/test_torchinductor.py -v -k test_cat_unbacked_2d_mps test_cat_unbacked_2d_mps (__main__.GPUTests) ... inline_call [] stats [('calls_captured', 6)] inductor [('extern_calls', 2), ('fxgraph_cache_miss', 1)] aot_autograd [('total', 1), ('autograd_cache_bypass', 1), ('not_ok', 1)] ERROR ====================================================================== ERROR: test_cat_unbacked_2d_mps (__main__.GPUTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/malfet/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3126, in wrapper method(args, kwargs) File "/Users/malfet/git/pytorch/pytorch/build/../test/inductor/test_torchinductor.py", line 12254, in new_test return value(self) File "/Users/malfet/miniconda3/lib/python3.10/contextlib.py", line 79, in inner return func(args, *kwds) File "/Users/malfet/git/pytorch/pytorch/build/../test/inductor/test_torchinductor.py", line 5885, in test_cat_unbacked_2d self.common( File "/Users/malfet/miniconda3/lib/python3.10/contextlib.py", line 79, in inner return func(args, *kwds) File "/Users/malfet/git/pytorch/pytorch/build/../test/inductor/test_torchinductor.py", line 620, in check_model_gpu check_model( File "/Users/malfet/git/pytorch/pytorch/build/../test/inductor/test_torchinductor.py", line 461, in check_model actual = run(example_inputs, *kwargs) File "/Users/malfet/git/pytorch/pytorch/torch/_dynamo/eval_frame.py", line 580, in _fn raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1 File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/compile_fx.py", line 704, in _compile_fx_inner raise InductorError(e, currentframe()).with_traceback( File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/compile_fx.py", line 689, in _compile_fx_inner mb_compiled_graph = fx_codegen_and_compile( File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/compile_fx.py", line 1149, in fx_codegen_and_compile return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs) File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/compile_fx.py", line 1064, in codegen_and_compile compiled_fn = graph.compile_to_module().call File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/graph.py", line 1977, in compile_to_module return self._compile_to_module() File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/graph.py", line 2018, in _compile_to_module mod = PyCodeCache.load_by_key_path( File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/codecache.py", line 2768, in load_by_key_path mod = _reload_python_module(key, path) File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/runtime/compile_tasks.py", line 51, in _reload_python_module exec(code, mod.__dict__, mod.__dict__) File "/var/folders/sc/2thx6_x95h7_h9qs8s48yh140000gn/T/tmpmyfz2ju8/lt/cltm34ognlgcc6oxoe6bexvtbwcdtdfgnkjj5miz7vhkemitacp7.py", line 40, in <module> File "/var/folders/sc/2thx6_x95h7_h9qs8s48yh140000gn/T/tmpmyfz2ju8/lt/cltm34ognlgcc6oxoe6bexvtbwcdtdfgnkjj5miz7vhkemitacp7.py", line 32, in _compile_mps_shader torch._inductor.exc.InductorError: SyntaxError: failed to compile kernel void generated_kernel( device float out_ptr0, constant float* in_ptr0, uint xindex [[thread_position_in_grid]] ) { long x1 = (xindex) / (3); auto tmp0 = x1; auto tmp1 = static_cast<long>(tmp0); auto tmp2 = 0; auto tmp3 = tmp1 >= tmp2; auto tmp4 = 2; auto tmp5 = tmp1 < tmp4; long x0 = (xindex) % (3); auto tmp6 = in_ptr0[x0 + 3*(x1)]; auto tmp7 = tmp5 ? tmp6 : 0.0; auto tmp8 = tmp1 >= tmp4; auto tmp9 = 2 + ks0; auto tmp10 = static_cast<long>(tmp9); auto tmp11 = tmp1 < tmp10; auto tmp12 = 1.0; auto tmp13 = tmp8 ? tmp12 : 0.0; auto tmp14 = tmp5 ? tmp7 : tmp13; long x2 = xindex; out_ptr0[x2] = static_cast<float>(tmp14); } with program_source:18:25: error: use of undeclared identifier 'ks0' auto tmp9 = 2 + ks0; ^ Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True To execute this test, run the following from the base repo dir: python test/inductor/test_torchinductor.py GPUTests.test_cat_unbacked_2d_mps This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ---------------------------------------------------------------------- Ran 1 test in 0.472s FAILED (errors=1) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144649 Approved by: https://github.com/Skylion007, https://github.com/jansel, https://github.com/dcci ghstack dependencies: #144647, #144648	2025-01-13 13:38:03 +00:00
PyTorch UpdateBot	a85d1ee106	Update slow tests (#144670 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144670 Approved by: https://github.com/pytorchbot	2025-01-13 12:06:22 +00:00

5579 changed files with 181863 additions and 338373 deletions

									
										9

.ci/aarch64_linux/aarch64_ci_build.sh
									
												View File
												
				@ -3,8 +3,11 @@ set -eux -o pipefail

				GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}

				# cuda arm build for Grace Hopper solely

				export TORCH_CUDA_ARCH_LIST="9.0"

				if [[ "$GPU_ARCH_VERSION" == *"12.6"* ]]; then

				    export TORCH_CUDA_ARCH_LIST="9.0"

				elif [[ "$GPU_ARCH_VERSION" == *"12.8"* ]]; then

				    export TORCH_CUDA_ARCH_LIST="9.0;10.0;12.0"

				fi

				SCRIPTPATH="$( cd -- "$(dirname "$0")" >/dev/null 2>&1 ; pwd -P )"

				source $SCRIPTPATH/aarch64_ci_setup.sh

				@ -17,7 +20,7 @@ cd /

				# on the mounted pytorch repo

				git config --global --add safe.directory /pytorch

				pip install -r /pytorch/requirements.txt

				pip install auditwheel

				pip install auditwheel==6.2.0

				if [ "$DESIRED_CUDA" = "cpu" ]; then

				    echo "BASE_CUDA_VERSION is not set. Building cpu wheel."

				    #USE_PRIORITIZED_TEXT_FOR_LD for enable linker script optimization https://github.com/pytorch/pytorch/pull/121975/files

									
										6

.ci/aarch64_linux/aarch64_ci_setup.sh
									
												View File
												
				@ -5,16 +5,14 @@ set -eux -o pipefail

				# By creating symlinks from desired /opt/python to /usr/local/bin/

				NUMPY_VERSION=2.0.2

				PYGIT2_VERSION=1.15.1

				if [[ "$DESIRED_PYTHON"  == "3.13" ]]; then

				if [[ "$DESIRED_PYTHON"  == "3.13" || "$DESIRED_PYTHON" == "3.13t" ]]; then

				    NUMPY_VERSION=2.1.2

				    PYGIT2_VERSION=1.16.0

				fi

				SCRIPTPATH="$( cd "$(dirname "$0")" ; pwd -P )"

				source $SCRIPTPATH/../manywheel/set_desired_python.sh

				pip install -q numpy==${NUMPY_VERSION} pyyaml==6.0.2 scons==4.7.0 ninja==1.11.1 patchelf==0.17.2 pygit2==${PYGIT2_VERSION}

				pip install -q numpy==${NUMPY_VERSION} pyyaml==6.0.2 scons==4.7.0 ninja==1.11.1 patchelf==0.17.2

				for tool in python python3 pip pip3 ninja scons patchelf; do

				    ln -sf ${DESIRED_PYTHON_BIN_DIR}/${tool} /usr/local/bin;

									
										37

.ci/aarch64_linux/aarch64_wheel_ci_build.py
									
												View File
												
				@ -4,12 +4,9 @@

				import os

				import shutil

				from subprocess import check_call, check_output

				from typing import List

				from pygit2 import Repository

				def list_dir(path: str) -> List[str]:

				def list_dir(path: str) -> list[str]:

				    """'

				    Helper for getting paths for Python

				    """

				@ -42,7 +39,7 @@ def build_ArmComputeLibrary() -> None:

				            "clone",

				            "https://github.com/ARM-software/ComputeLibrary.git",

				            "-b",

				            "v24.09",

				            "v25.02",

				            "--depth",

				            "1",

				            "--shallow-submodules",

				@ -58,7 +55,7 @@ def build_ArmComputeLibrary() -> None:

				        shutil.copytree(f"{acl_checkout_dir}/{d}", f"{acl_install_dir}/{d}")

				def update_wheel(wheel_path) -> None:

				def update_wheel(wheel_path, desired_cuda) -> None:

				    """

				    Update the cuda wheel libraries

				    """

				@ -80,7 +77,6 @@ def update_wheel(wheel_path) -> None:

				        "/usr/local/cuda/lib64/libnvToolsExt.so.1",

				        "/usr/local/cuda/lib64/libnvJitLink.so.12",

				        "/usr/local/cuda/lib64/libnvrtc.so.12",

				        "/usr/local/cuda/lib64/libnvrtc-builtins.so.12.6",

				        "/usr/local/cuda/lib64/libcudnn_adv.so.9",

				        "/usr/local/cuda/lib64/libcudnn_cnn.so.9",

				        "/usr/local/cuda/lib64/libcudnn_graph.so.9",

				@ -100,6 +96,18 @@ def update_wheel(wheel_path) -> None:

				            "/usr/local/lib/libnvpl_lapack_core.so.0",

				            "/usr/local/lib/libnvpl_blas_core.so.0",

				        ]

				        if "126" in desired_cuda:

				            libs_to_copy += [

				                "/usr/local/cuda/lib64/libnvrtc-builtins.so.12.6",

				                "/usr/local/cuda/lib64/libcufile.so.0",

				                "/usr/local/cuda/lib64/libcufile_rdma.so.1",

				            ]

				        elif "128" in desired_cuda:

				            libs_to_copy += [

				                "/usr/local/cuda/lib64/libnvrtc-builtins.so.12.8",

				                "/usr/local/cuda/lib64/libcufile.so.0",

				                "/usr/local/cuda/lib64/libcufile_rdma.so.1",

				            ]

				    else:

				        libs_to_copy += [

				            "/opt/OpenBLAS/lib/libopenblas.so.0",

				@ -171,22 +179,22 @@ if __name__ == "__main__":

				    args = parse_arguments()

				    enable_mkldnn = args.enable_mkldnn

				    enable_cuda = args.enable_cuda

				    repo = Repository("/pytorch")

				    branch = repo.head.name

				    if branch == "HEAD":

				        branch = "master"

				    branch = check_output(

				        ["git", "rev-parse", "--abbrev-ref", "HEAD"], cwd="/pytorch"

				    ).decode()

				    print("Building PyTorch wheel")

				    build_vars = "MAX_JOBS=5 CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000 "

				    os.system("cd /pytorch; python setup.py clean")

				    override_package_version = os.getenv("OVERRIDE_PACKAGE_VERSION")

				    desired_cuda = os.getenv("DESIRED_CUDA")

				    if override_package_version is not None:

				        version = override_package_version

				        build_vars += (

				            f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={version} PYTORCH_BUILD_NUMBER=1 "

				        )

				    elif branch in ["nightly", "master"]:

				    elif branch in ["nightly", "main"]:

				        build_date = (

				            check_output(["git", "log", "--pretty=format:%cs", "-1"], cwd="/pytorch")

				            .decode()

				@ -196,12 +204,11 @@ if __name__ == "__main__":

				            check_output(["cat", "version.txt"], cwd="/pytorch").decode().strip()[:-2]

				        )

				        if enable_cuda:

				            desired_cuda = os.getenv("DESIRED_CUDA")

				            build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={version}.dev{build_date}+{desired_cuda} PYTORCH_BUILD_NUMBER=1 "

				        else:

				            build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={version}.dev{build_date} PYTORCH_BUILD_NUMBER=1 "

				    elif branch.startswith(("v1.", "v2.")):

				        build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={branch[1:branch.find('-')]} PYTORCH_BUILD_NUMBER=1 "

				        build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={branch[1 : branch.find('-')]} PYTORCH_BUILD_NUMBER=1 "

				    if enable_mkldnn:

				        build_ArmComputeLibrary()

				@ -225,6 +232,6 @@ if __name__ == "__main__":

				        print("Updating Cuda Dependency")

				        filename = os.listdir("/pytorch/dist/")

				        wheel_path = f"/pytorch/dist/{filename[0]}"

				        update_wheel(wheel_path)

				        update_wheel(wheel_path, desired_cuda)

				    pytorch_wheel_name = complete_wheel("/pytorch/")

				    print(f"Build Complete. Created {pytorch_wheel_name}..")

									
										62

.ci/aarch64_linux/build_aarch64_wheel.py
									
												View File
												
				@ -12,22 +12,22 @@ import os

				import subprocess

				import sys

				import time

				from typing import Dict, List, Optional, Tuple, Union

				from typing import Optional, Union

				import boto3

				# AMI images for us-east-1, change the following based on your ~/.aws/config

				os_amis = {

				    "ubuntu18_04": "ami-078eece1d8119409f",  # login_name: ubuntu

				    "ubuntu20_04": "ami-052eac90edaa9d08f",  # login_name: ubuntu

				    "ubuntu22_04": "ami-0c6c29c5125214c77",  # login_name: ubuntu

				    "redhat8": "ami-0698b90665a2ddcf1",  # login_name: ec2-user

				}

				ubuntu18_04_ami = os_amis["ubuntu18_04"]

				ubuntu20_04_ami = os_amis["ubuntu20_04"]

				def compute_keyfile_path(key_name: Optional[str] = None) -> Tuple[str, str]:

				def compute_keyfile_path(key_name: Optional[str] = None) -> tuple[str, str]:

				    if key_name is None:

				        key_name = os.getenv("AWS_KEY_NAME")

				        if key_name is None:

				@ -57,7 +57,7 @@ def ec2_instances_by_id(instance_id):

				def start_instance(

				    key_name, ami=ubuntu18_04_ami, instance_type="t4g.2xlarge", ebs_size: int = 50

				    key_name, ami=ubuntu20_04_ami, instance_type="t4g.2xlarge", ebs_size: int = 50

				):

				    inst = ec2.create_instances(

				        ImageId=ami,

				@ -96,7 +96,7 @@ class RemoteHost:

				        self.keyfile_path = keyfile_path

				        self.login_name = login_name

				    def _gen_ssh_prefix(self) -> List[str]:

				    def _gen_ssh_prefix(self) -> list[str]:

				        return [

				            "ssh",

				            "-o",

				@ -108,13 +108,13 @@ class RemoteHost:

				        ]

				    @staticmethod

				    def _split_cmd(args: Union[str, List[str]]) -> List[str]:

				    def _split_cmd(args: Union[str, list[str]]) -> list[str]:

				        return args.split() if isinstance(args, str) else args

				    def run_ssh_cmd(self, args: Union[str, List[str]]) -> None:

				    def run_ssh_cmd(self, args: Union[str, list[str]]) -> None:

				        subprocess.check_call(self._gen_ssh_prefix() + self._split_cmd(args))

				    def check_ssh_output(self, args: Union[str, List[str]]) -> str:

				    def check_ssh_output(self, args: Union[str, list[str]]) -> str:

				        return subprocess.check_output(

				            self._gen_ssh_prefix() + self._split_cmd(args)

				        ).decode("utf-8")

				@ -157,7 +157,7 @@ class RemoteHost:

				    def using_docker(self) -> bool:

				        return self.container_id is not None

				    def run_cmd(self, args: Union[str, List[str]]) -> None:

				    def run_cmd(self, args: Union[str, list[str]]) -> None:

				        if not self.using_docker():

				            return self.run_ssh_cmd(args)

				        assert self.container_id is not None

				@ -178,7 +178,7 @@ class RemoteHost:

				        if rc != 0:

				            raise subprocess.CalledProcessError(rc, docker_cmd)

				    def check_output(self, args: Union[str, List[str]]) -> str:

				    def check_output(self, args: Union[str, list[str]]) -> str:

				        if not self.using_docker():

				            return self.check_ssh_output(args)

				        assert self.container_id is not None

				@ -230,7 +230,7 @@ class RemoteHost:

				            )

				        self.download_file(remote_file, local_file)

				    def list_dir(self, path: str) -> List[str]:

				    def list_dir(self, path: str) -> list[str]:

				        return self.check_output(["ls", "-1", path]).split("\n")

				@ -327,7 +327,7 @@ def build_ArmComputeLibrary(host: RemoteHost, git_clone_flags: str = "") -> None

				        ]

				    )

				    host.run_cmd(

				        f"git clone https://github.com/ARM-software/ComputeLibrary.git -b v24.09 {git_clone_flags}"

				        f"git clone https://github.com/ARM-software/ComputeLibrary.git -b v25.02 {git_clone_flags}"

				    )

				    host.run_cmd(f"cd ComputeLibrary && scons Werror=1 -j8 {acl_build_flags}")

				@ -358,7 +358,7 @@ def checkout_repo(

				    branch: str = "main",

				    url: str,

				    git_clone_flags: str,

				    mapping: Dict[str, Tuple[str, str]],

				    mapping: dict[str, tuple[str, str]],

				) -> Optional[str]:

				    for prefix in mapping:

				        if not branch.startswith(prefix):

				@ -657,18 +657,6 @@ def configure_system(

				            "sudo apt-get install -y python3-dev python3-yaml python3-setuptools python3-wheel python3-pip"

				        )

				    host.run_cmd("pip3 install dataclasses typing-extensions")

				    # Install and switch to gcc-8 on Ubuntu-18.04

				    if not host.using_docker() and host.ami == ubuntu18_04_ami and compiler == "gcc-8":

				        host.run_cmd("sudo apt-get install -y g++-8 gfortran-8")

				        host.run_cmd(

				            "sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-8 100"

				        )

				        host.run_cmd(

				            "sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-8 100"

				        )

				        host.run_cmd(

				            "sudo update-alternatives --install /usr/bin/gfortran gfortran /usr/bin/gfortran-8 100"

				        )

				    if not use_conda:

				        print("Installing Cython + numpy from PyPy")

				        host.run_cmd("sudo pip3 install Cython")

				@ -681,7 +669,7 @@ def build_domains(

				    branch: str = "main",

				    use_conda: bool = True,

				    git_clone_flags: str = "",

				) -> Tuple[str, str, str, str]:

				) -> tuple[str, str, str, str]:

				    vision_wheel_name = build_torchvision(

				        host, branch=branch, use_conda=use_conda, git_clone_flags=git_clone_flags

				    )

				@ -708,7 +696,7 @@ def start_build(

				    pytorch_build_number: Optional[str] = None,

				    shallow_clone: bool = True,

				    enable_mkldnn: bool = False,

				) -> Tuple[str, str, str, str, str]:

				) -> tuple[str, str, str, str, str]:

				    git_clone_flags = " --depth 1 --shallow-submodules" if shallow_clone else ""

				    if host.using_docker() and not use_conda:

				        print("Auto-selecting conda option for docker images")

				@ -759,7 +747,7 @@ def start_build(

				        version = host.check_output("cat pytorch/version.txt").strip()[:-2]

				        build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={version}.dev{build_date} PYTORCH_BUILD_NUMBER=1"

				    if branch.startswith(("v1.", "v2.")):

				        build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={branch[1:branch.find('-')]} PYTORCH_BUILD_NUMBER=1"

				        build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={branch[1 : branch.find('-')]} PYTORCH_BUILD_NUMBER=1"

				    if host.using_docker():

				        build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"

				    if enable_mkldnn:

				@ -932,9 +920,9 @@ def parse_arguments():

				    parser.add_argument("--debug", action="store_true")

				    parser.add_argument("--build-only", action="store_true")

				    parser.add_argument("--test-only", type=str)

				    parser.add_argument(

				        "--os", type=str, choices=list(os_amis.keys()), default="ubuntu20_04"

				    )

				    group = parser.add_mutually_exclusive_group()

				    group.add_argument("--os", type=str, choices=list(os_amis.keys()))

				    group.add_argument("--ami", type=str)

				    parser.add_argument(

				        "--python-version",

				        type=str,

				@ -964,7 +952,13 @@ def parse_arguments():

				if __name__ == "__main__":

				    args = parse_arguments()

				    ami = os_amis[args.os]

				    ami = (

				        args.ami

				        if args.ami is not None

				        else os_amis[args.os]

				        if args.os is not None

				        else ubuntu20_04_ami

				    )

				    keyfile_path, key_name = compute_keyfile_path(args.key_name)

				    if args.list_instances:

				@ -1018,7 +1012,7 @@ if __name__ == "__main__":

				        install_condaforge_python(host, args.python_version)

				        sys.exit(0)

				    python_version = args.python_version if args.python_version is not None else "3.8"

				    python_version = args.python_version if args.python_version is not None else "3.9"

				    if args.use_torch_from_pypi:

				        configure_system(host, compiler=args.compiler, python_version=python_version)

									
										2

.ci/docker/almalinux/Dockerfile
									
												View File
												
				@ -44,6 +44,8 @@ FROM base as cuda

				ARG CUDA_VERSION=12.4

				RUN rm -rf /usr/local/cuda-*

				ADD ./common/install_cuda.sh install_cuda.sh

				COPY ./common/install_nccl.sh install_nccl.sh

				COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/

				ENV CUDA_HOME=/usr/local/cuda-${CUDA_VERSION}

				# Preserve CUDA_VERSION for the builds

				ENV CUDA_VERSION=${CUDA_VERSION}

									
										205

.ci/docker/build.sh
									
												View File
												
				@ -1,4 +1,8 @@

				#!/bin/bash

				# The purpose of this script is to:

				# 1. Extract the set of parameters to be used for a docker build based on the provided image name.

				# 2. Run docker build with the parameters found in step 1.

				# 3. Run the built image and print out the expected and actual versions of packages installed.

				set -ex

				@ -86,32 +90,21 @@ CMAKE_VERSION=3.18.5

				_UCX_COMMIT=7bb2722ff2187a0cad557ae4a6afa090569f83fb

				_UCC_COMMIT=20eae37090a4ce1b32bcce6144ccad0b49943e0b

				if [[ "$image" == *rocm* ]]; then

				  _UCX_COMMIT=cc312eaa4655c0cc5c2bcd796db938f90563bcf6

				  _UCC_COMMIT=0c0fc21559835044ab107199e334f7157d6a0d3d

				fi

				# It's annoying to rename jobs every time you want to rewrite a

				# configuration, so we hardcode everything here rather than do it

				# from scratch

				case "$image" in

				  pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.4.1

				  pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc11)

				    CUDA_VERSION=12.6.3

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    GCC_VERSION=11

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.1.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				@ -125,37 +118,6 @@ case "$image" in

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.1.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-focal-cuda12.1-cudnn9-py3.12-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.1.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				@ -170,7 +132,6 @@ case "$image" in

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				@ -185,7 +146,61 @@ case "$image" in

				    ANACONDA_PYTHON_VERSION=3.13

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.6.3

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.6.3

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-focal-cuda12.6-cudnn9-py3.12-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.6.3

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=9

				    PROTOBUF=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-focal-cuda12.6-cudnn9-py3.13-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.6.3

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.13

				    GCC_VERSION=9

				    PROTOBUF=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				@ -200,21 +215,6 @@ case "$image" in

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.1.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				@ -226,7 +226,6 @@ case "$image" in

				    ANACONDA_PYTHON_VERSION=3.9

				    CLANG_VERSION=10

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    CONDA_CMAKE=yes

				    ONNX=yes

				@ -235,10 +234,7 @@ case "$image" in

				    ANACONDA_PYTHON_VERSION=3.9

				    CLANG_VERSION=10

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    VULKAN_SDK_VERSION=1.2.162.1

				    SWIFTSHADER=yes

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				@ -246,10 +242,7 @@ case "$image" in

				    ANACONDA_PYTHON_VERSION=3.11

				    CLANG_VERSION=10

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    VULKAN_SDK_VERSION=1.2.162.1

				    SWIFTSHADER=yes

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				@ -257,38 +250,42 @@ case "$image" in

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-rocm-n-1-py3)

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    GCC_VERSION=11

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    ROCM_VERSION=6.2.4

				    NINJA_VERSION=1.9.0

				    CONDA_CMAKE=yes

				    TRITON=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-focal-rocm-n-py3)

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    GCC_VERSION=11

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    ROCM_VERSION=6.3

				    NINJA_VERSION=1.9.0

				    CONDA_CMAKE=yes

				    TRITON=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-jammy-xpu-2024.0-py3)

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=11

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    XPU_VERSION=0.5

				    NINJA_VERSION=1.9.0

				@ -299,7 +296,6 @@ case "$image" in

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=11

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    XPU_VERSION=2025.0

				    NINJA_VERSION=1.9.0

				@ -310,7 +306,6 @@ case "$image" in

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=11

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    CONDA_CMAKE=yes

				@ -324,7 +319,6 @@ case "$image" in

				    CUDNN_VERSION=9

				    CLANG_VERSION=12

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    TRITON=yes

				    ;;

				@ -332,7 +326,6 @@ case "$image" in

				    ANACONDA_PYTHON_VERSION=3.9

				    CLANG_VERSION=12

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    CONDA_CMAKE=yes

				    TRITON=yes

				@ -353,7 +346,6 @@ case "$image" in

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=11

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    CONDA_CMAKE=yes

				@ -368,7 +360,7 @@ case "$image" in

				    EXECUTORCH=yes

				    ;;

				  pytorch-linux-jammy-py3.12-halide)

				    CUDA_VERSION=12.4

				    CUDA_VERSION=12.6

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=11

				    CONDA_CMAKE=yes

				@ -376,7 +368,7 @@ case "$image" in

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-py3.12-triton-cpu)

				    CUDA_VERSION=12.4

				    CUDA_VERSION=12.6

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=11

				    CONDA_CMAKE=yes

				@ -386,20 +378,19 @@ case "$image" in

				    # TODO: Use 3.9 here because of this issue https://github.com/python/mypy/issues/13627.

				    # We will need to update mypy version eventually, but that's for another day. The task

				    # would be to upgrade mypy to 1.0.0 with Python 3.11

				    ANACONDA_PYTHON_VERSION=3.9

				    CONDA_CMAKE=yes

				    PYTHON_VERSION=3.9

				    PIP_CMAKE=yes

				    ;;

				  pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-linter)

				    ANACONDA_PYTHON_VERSION=3.9

				    PYTHON_VERSION=3.9

				    CUDA_VERSION=11.8

				    CONDA_CMAKE=yes

				    PIP_CMAKE=yes

				    ;;

				  pytorch-linux-jammy-aarch64-py3.10-gcc11)

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=11

				    ACL=yes

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    CONDA_CMAKE=yes

				    # snadampal: skipping llvm src build install because the current version

				@ -411,7 +402,6 @@ case "$image" in

				    GCC_VERSION=11

				    ACL=yes

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    CONDA_CMAKE=yes

				    # snadampal: skipping llvm src build install because the current version

				@ -422,7 +412,6 @@ case "$image" in

				  *)

				    # Catch-all for builds that are not hardcoded.

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    echo "image '$image' did not match an existing build configuration"

				    if [[ "$image" == *py* ]]; then

				@ -471,14 +460,21 @@ if [[ "$image" == *cuda*  && ${OS} == "ubuntu" ]]; then

				  fi

				fi

				no_cache_flag=""

				progress_flag=""

				# Do not use cache and progress=plain when in CI

				if [[ -n "${CI:-}" ]]; then

				  no_cache_flag="--no-cache"

				  progress_flag="--progress=plain"

				fi

				# Build image

				docker build \

				       --no-cache \

				       --progress=plain \

				       ${no_cache_flag} \

				       ${progress_flag} \

				       --build-arg "BUILD_ENVIRONMENT=${image}" \

				       --build-arg "PROTOBUF=${PROTOBUF:-}" \

				       --build-arg "LLVMDEV=${LLVMDEV:-}" \

				       --build-arg "DB=${DB:-}" \

				       --build-arg "VISION=${VISION:-}" \

				       --build-arg "UBUNTU_VERSION=${UBUNTU_VERSION}" \

				       --build-arg "CENTOS_VERSION=${CENTOS_VERSION}" \

				@ -486,13 +482,12 @@ docker build \

				       --build-arg "GLIBC_VERSION=${GLIBC_VERSION}" \

				       --build-arg "CLANG_VERSION=${CLANG_VERSION}" \

				       --build-arg "ANACONDA_PYTHON_VERSION=${ANACONDA_PYTHON_VERSION}" \

				       --build-arg "PYTHON_VERSION=${PYTHON_VERSION}" \

				       --build-arg "GCC_VERSION=${GCC_VERSION}" \

				       --build-arg "CUDA_VERSION=${CUDA_VERSION}" \

				       --build-arg "CUDNN_VERSION=${CUDNN_VERSION}" \

				       --build-arg "TENSORRT_VERSION=${TENSORRT_VERSION}" \

				       --build-arg "GRADLE_VERSION=${GRADLE_VERSION}" \

				       --build-arg "VULKAN_SDK_VERSION=${VULKAN_SDK_VERSION}" \

				       --build-arg "SWIFTSHADER=${SWIFTSHADER}" \

				       --build-arg "CMAKE_VERSION=${CMAKE_VERSION:-}" \

				       --build-arg "NINJA_VERSION=${NINJA_VERSION:-}" \

				       --build-arg "KATEX=${KATEX:-}" \

				@ -502,6 +497,7 @@ docker build \

				       --build-arg "UCX_COMMIT=${UCX_COMMIT}" \

				       --build-arg "UCC_COMMIT=${UCC_COMMIT}" \

				       --build-arg "CONDA_CMAKE=${CONDA_CMAKE}" \

				       --build-arg "PIP_CMAKE=${PIP_CMAKE}" \

				       --build-arg "TRITON=${TRITON}" \

				       --build-arg "TRITON_CPU=${TRITON_CPU}" \

				       --build-arg "ONNX=${ONNX}" \

				@ -527,7 +523,7 @@ docker build \

				UBUNTU_VERSION=$(echo ${UBUNTU_VERSION} | sed 's/-rc$//')

				function drun() {

				  docker run --rm "$tmp_tag" $*

				  docker run --rm "$tmp_tag" "$@"

				}

				if [[ "$OS" == "ubuntu" ]]; then

				@ -575,3 +571,14 @@ if [ -n "$KATEX" ]; then

				    exit 1

				  fi

				fi

				HAS_TRITON=$(drun python -c "import triton" > /dev/null 2>&1 && echo "yes" || echo "no")

				if [[ -n "$TRITON" || -n "$TRITON_CPU" ]]; then

				  if [ "$HAS_TRITON" = "no" ]; then

				    echo "expecting triton to be installed, but it is not"

				    exit 1

				  fi

				elif [ "$HAS_TRITON" = "yes" ]; then

				  echo "expecting triton to not be installed, but it is"

				  exit 1

				fi

									
										9

.ci/docker/centos-rocm/Dockerfile
									
												View File
												
				@ -55,13 +55,6 @@ RUN if [ -n "${PROTOBUF}" ]; then bash ./install_protobuf.sh; fi

				RUN rm install_protobuf.sh

				ENV INSTALLED_PROTOBUF ${PROTOBUF}

				# (optional) Install database packages like LMDB and LevelDB

				ARG DB

				COPY ./common/install_db.sh install_db.sh

				RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi

				RUN rm install_db.sh

				ENV INSTALLED_DB ${DB}

				# (optional) Install vision packages like OpenCV

				ARG VISION

				COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./

				@ -75,7 +68,7 @@ COPY ./common/install_rocm.sh install_rocm.sh

				RUN bash ./install_rocm.sh

				RUN rm install_rocm.sh

				COPY ./common/install_rocm_magma.sh install_rocm_magma.sh

				RUN bash ./install_rocm_magma.sh

				RUN bash ./install_rocm_magma.sh ${ROCM_VERSION}

				RUN rm install_rocm_magma.sh

				COPY ./common/install_amdsmi.sh install_amdsmi.sh

				RUN bash ./install_amdsmi.sh

2

.ci/docker/ci_commit_pins/executorch.txt

View File

 @ -1 +1 @@
 a29b208a06ab378bb29ab1aa68932e412f8e09f1
 e487c24e1c20c3f4606c2d8aca2778873b00b4c

1

.ci/docker/ci_commit_pins/nccl-cu11.txt Normal file

View File

				`@ -0,0 +1 @@`
				`v2.21.5-1`

1

.ci/docker/ci_commit_pins/nccl-cu12.txt Normal file

View File

				`@ -0,0 +1 @@`
				`v2.26.2-1`

2

.ci/docker/ci_commit_pins/timm.txt

View File

 @ -1 +1 @@
 ac3470188b914c5d7a5058a7e28b9eb685a62427
 d535d7a2d4b435b1b5c1177fd8f04a12b942b9a

2

.ci/docker/ci_commit_pins/triton-xpu.txt

View File

 @ -1 +1 @@
 e98b6fcb8df5b44eb0d0addb6767c573d37ba024
 bcc8265e677e5321606a3311bf71470f14456a8

2

.ci/docker/ci_commit_pins/triton.txt

View File

 @ -1 +1 @@
 d4682f073ded4d1a8260dd4208a43d735ae3a2b
 ce50fade7e209553aba4898cd9b82aab83b

									
										2

.ci/docker/common/install_acl.sh
									
												View File
												
				@ -1,6 +1,6 @@

				set -euo pipefail

				readonly version=v24.04

				readonly version=v25.02

				readonly src_host=https://github.com/ARM-software

				readonly src_repo=ComputeLibrary

									
										4

.ci/docker/common/install_base.sh
									
												View File
												
				@ -32,8 +32,12 @@ install_ubuntu() {

				  # HACK: UCC testing relies on libnccl library from NVIDIA repo, and version 2.16 crashes

				  # See https://github.com/pytorch/pytorch/pull/105260#issuecomment-1673399729

				  # TODO: Eliminate this hack, we should not relay on apt-get installation

				  # See https://github.com/pytorch/pytorch/issues/144768

				  if [[ "$UBUNTU_VERSION" == "20.04"* && "$CUDA_VERSION" == "11.8"* ]]; then

				    maybe_libnccl_dev="libnccl2=2.15.5-1+cuda11.8 libnccl-dev=2.15.5-1+cuda11.8 --allow-downgrades --allow-change-held-packages"

				  elif [[ "$UBUNTU_VERSION" == "20.04"* && "$CUDA_VERSION" == "12.4"* ]]; then

				    maybe_libnccl_dev="libnccl2=2.26.2-1+cuda12.4 libnccl-dev=2.26.2-1+cuda12.4 --allow-downgrades --allow-change-held-packages"

				  else

				    maybe_libnccl_dev=""

				  fi

									
										2

.ci/docker/common/install_cache.sh
									
												View File
												
				@ -9,7 +9,7 @@ install_ubuntu() {

				  # Instead use lib and headers from OpenSSL1.1 installed in `install_openssl.sh``

				  apt-get install -y cargo

				  echo "Checking out sccache repo"

				  git clone https://github.com/mozilla/sccache -b v0.9.0

				  git clone https://github.com/mozilla/sccache -b v0.9.1

				  cd sccache

				  echo "Building sccache"

				  cargo build --release

									
										12

.ci/docker/common/install_clang.sh
									
												View File
												
				@ -4,16 +4,10 @@ set -ex

				if [ -n "$CLANG_VERSION" ]; then

				  if [[ $CLANG_VERSION == 9 && $UBUNTU_VERSION == 18.04 ]]; then

				    sudo apt-get update

				    # gpg-agent is not available by default on 18.04

				    sudo apt-get install  -y --no-install-recommends gpg-agent

				    wget --no-check-certificate -O - https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add  -

				    apt-add-repository "deb http://apt.llvm.org/bionic/ llvm-toolchain-bionic-${CLANG_VERSION} main"

				  elif [[ $UBUNTU_VERSION == 22.04 ]]; then

				  if [[ $UBUNTU_VERSION == 22.04 ]]; then

				    # work around ubuntu apt-get conflicts

				    sudo apt-get -y -f install

				    wget --no-check-certificate -O - https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add  -

				    wget --no-check-certificate -O - https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add -

				    if [[ $CLANG_VERSION == 18 ]]; then

				      apt-add-repository "deb http://apt.llvm.org/jammy/ llvm-toolchain-jammy-18 main"

				    fi

				@ -41,7 +35,7 @@ if [ -n "$CLANG_VERSION" ]; then

				  # clang's packaging is a little messed up (the runtime libs aren't

				  # added into the linker path), so give it a little help

				  clang_lib=("/usr/lib/llvm-$CLANG_VERSION/lib/clang/"*"/lib/linux")

				  echo "$clang_lib" > /etc/ld.so.conf.d/clang.conf

				  echo "$clang_lib" >/etc/ld.so.conf.d/clang.conf

				  ldconfig

				  # Cleanup package manager

									
										4

.ci/docker/common/install_conda.sh
									
												View File
												
				@ -62,11 +62,11 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

				  # libstdcxx from conda default channels are too old, we need GLIBCXX_3.4.30

				  # which is provided in libstdcxx 12 and up.

				  conda_install libstdcxx-ng=12.3.0 -c conda-forge

				  conda_install libstdcxx-ng=12.3.0 --update-deps -c conda-forge

				  # Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README

				  if [[ $(uname -m) == "aarch64" ]]; then

				    conda_install "openblas==0.3.28=*openmp*"

				    conda_install "openblas==0.3.29=*openmp*"

				  else

				    conda_install "mkl=2021.4.0 mkl-include=2021.4.0"

				  fi

									
										2

.ci/docker/common/install_cpython.sh
									
												View File
												
				@ -7,7 +7,7 @@ PYTHON_DOWNLOAD_GITHUB_BRANCH=https://github.com/python/cpython/archive/refs/hea

				GET_PIP_URL=https://bootstrap.pypa.io/get-pip.py

				# Python versions to be installed in /opt/$VERSION_NO

				CPYTHON_VERSIONS=${CPYTHON_VERSIONS:-"3.8.1 3.9.0 3.10.1 3.11.0 3.12.0 3.13.0 3.13.0t"}

				CPYTHON_VERSIONS=${CPYTHON_VERSIONS:-"3.9.0 3.10.1 3.11.0 3.12.0 3.13.0 3.13.0t"}

				function check_var {

				    if [ -z "$1" ]; then

									
										140

.ci/docker/common/install_cuda.sh
									
												View File
												
				@ -2,7 +2,6 @@

				set -ex

				NCCL_VERSION=v2.21.5-1

				CUDNN_VERSION=9.5.1.17

				function install_cusparselt_040 {

				@ -16,17 +15,6 @@ function install_cusparselt_040 {

				    rm -rf tmp_cusparselt

				}

				function install_cusparselt_052 {

				    # cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				    mkdir tmp_cusparselt && pushd tmp_cusparselt

				    wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.5.2.1-archive.tar.xz

				    tar xf libcusparse_lt-linux-x86_64-0.5.2.1-archive.tar.xz

				    cp -a libcusparse_lt-linux-x86_64-0.5.2.1-archive/include/* /usr/local/cuda/include/

				    cp -a libcusparse_lt-linux-x86_64-0.5.2.1-archive/lib/* /usr/local/cuda/lib64/

				    popd

				    rm -rf tmp_cusparselt

				}

				function install_cusparselt_062 {

				    # cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				    mkdir tmp_cusparselt && pushd tmp_cusparselt

				@ -51,7 +39,7 @@ function install_cusparselt_063 {

				function install_118 {

				    CUDNN_VERSION=9.1.0.70

				    echo "Installing CUDA 11.8 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.4.0"

				    echo "Installing CUDA 11.8 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.4.0"

				    rm -rf /usr/local/cuda-11.8 /usr/local/cuda

				    # install CUDA 11.8.0 in the same container

				    wget -q https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run

				@ -69,56 +57,16 @@ function install_118 {

				    cd ..

				    rm -rf tmp_cudnn

				    # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				    # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				    git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git

				    cd nccl && make -j src.build

				    cp -a build/include/* /usr/local/cuda/include/

				    cp -a build/lib/* /usr/local/cuda/lib64/

				    cd ..

				    rm -rf nccl

				    CUDA_VERSION=11.8 bash install_nccl.sh

				    install_cusparselt_040

				    ldconfig

				}

				function install_121 {

				    echo "Installing CUDA 12.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"

				    rm -rf /usr/local/cuda-12.1 /usr/local/cuda

				    # install CUDA 12.1.0 in the same container

				    wget -q https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda_12.1.1_530.30.02_linux.run

				    chmod +x cuda_12.1.1_530.30.02_linux.run

				    ./cuda_12.1.1_530.30.02_linux.run --toolkit --silent

				    rm -f cuda_12.1.1_530.30.02_linux.run

				    rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.1 /usr/local/cuda

				    # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				    mkdir tmp_cudnn && cd tmp_cudnn

				    wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz

				    tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz

				    cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/

				    cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/

				    cd ..

				    rm -rf tmp_cudnn

				    # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				    # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				    git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git

				    cd nccl && make -j src.build

				    cp -a build/include/* /usr/local/cuda/include/

				    cp -a build/lib/* /usr/local/cuda/lib64/

				    cd ..

				    rm -rf nccl

				    install_cusparselt_052

				    ldconfig

				}

				function install_124 {

				  CUDNN_VERSION=9.1.0.70

				  echo "Installing CUDA 12.4.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.2"

				  echo "Installing CUDA 12.4.1 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.6.2"

				  rm -rf /usr/local/cuda-12.4 /usr/local/cuda

				  # install CUDA 12.4.1 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux.run

				@ -136,14 +84,7 @@ function install_124 {

				  cd ..

				  rm -rf tmp_cudnn

				  # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				  # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				  git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git

				  cd nccl && make -j src.build

				  cp -a build/include/* /usr/local/cuda/include/

				  cp -a build/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf nccl

				  CUDA_VERSION=12.4 bash install_nccl.sh

				  install_cusparselt_062

				@ -151,7 +92,7 @@ function install_124 {

				}

				function install_126 {

				  echo "Installing CUDA 12.6.3 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.3"

				  echo "Installing CUDA 12.6.3 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.6.3"

				  rm -rf /usr/local/cuda-12.6 /usr/local/cuda

				  # install CUDA 12.6.3 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.6.3/local_installers/cuda_12.6.3_560.35.05_linux.run

				@ -169,14 +110,7 @@ function install_126 {

				  cd ..

				  rm -rf tmp_cudnn

				  # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				  # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				  git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git

				  cd nccl && make -j src.build

				  cp -a build/include/* /usr/local/cuda/include/

				  cp -a build/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf nccl

				  CUDA_VERSION=12.6 bash install_nccl.sh

				  install_cusparselt_063

				@ -214,37 +148,6 @@ function prune_118 {

				    rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2022.3.0 $CUDA_BASE/nsight-systems-2022.4.2/

				}

				function prune_121 {

				  echo "Pruning CUDA 12.1"

				  #####################################################################################

				  # CUDA 12.1 prune static libs

				  #####################################################################################

				    export NVPRUNE="/usr/local/cuda-12.1/bin/nvprune"

				    export CUDA_LIB_DIR="/usr/local/cuda-12.1/lib64"

				    export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				    export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				    if [[ -n "$OVERRIDE_GENCODE" ]]; then

				        export GENCODE=$OVERRIDE_GENCODE

				    fi

				    # all CUDA libs except CuDNN and CuBLAS

				    ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis"  \

				      | xargs -I {} bash -c \

				                "echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"

				    # prune CuDNN and CuBLAS

				    $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a

				    $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a

				    #####################################################################################

				    # CUDA 12.1 prune visual tools

				    #####################################################################################

				    export CUDA_BASE="/usr/local/cuda-12.1/"

				    rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2023.1.0 $CUDA_BASE/nsight-systems-2023.1.2/

				}

				function prune_124 {

				  echo "Pruning CUDA 12.4"

				  #####################################################################################

				@ -313,18 +216,45 @@ function prune_126 {

				  rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.3.2 $CUDA_BASE/nsight-systems-2024.5.1/

				}

				function install_128 {

				  CUDNN_VERSION=9.8.0.87

				  echo "Installing CUDA 12.8.0 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.6.3"

				  rm -rf /usr/local/cuda-12.8 /usr/local/cuda

				  # install CUDA 12.8.0 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda_12.8.0_570.86.10_linux.run

				  chmod +x cuda_12.8.0_570.86.10_linux.run

				  ./cuda_12.8.0_570.86.10_linux.run --toolkit --silent

				  rm -f cuda_12.8.0_570.86.10_linux.run

				  rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.8 /usr/local/cuda

				  # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				  mkdir tmp_cudnn && cd tmp_cudnn

				  wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/

				  cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf tmp_cudnn

				  CUDA_VERSION=12.8 bash install_nccl.sh

				  install_cusparselt_063

				  ldconfig

				}

				# idiomatic parameter and option handling in sh

				while test $# -gt 0

				do

				    case "$1" in

				    11.8) install_118; prune_118

				        ;;

				    12.1) install_121; prune_121

				        ;;

				    12.4) install_124; prune_124

				        ;;

				    12.6) install_126; prune_126

				        ;;

				    12.8) install_128;

				        ;;

				    *) echo "bad argument $1"; exit 1

				        ;;

				    esac

									
										144

.ci/docker/common/install_cuda_aarch64.sh
									
												View File
												
				@ -3,19 +3,7 @@

				set -ex

				NCCL_VERSION=v2.21.5-1

				CUDNN_VERSION=9.5.1.17

				function install_cusparselt_062 {

				    # cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				    mkdir tmp_cusparselt && pushd tmp_cusparselt

				    wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-sbsa/libcusparse_lt-linux-sbsa-0.6.2.3-archive.tar.xz

				    tar xf libcusparse_lt-linux-sbsa-0.6.2.3-archive.tar.xz

				    cp -a libcusparse_lt-linux-sbsa-0.6.2.3-archive/include/* /usr/local/cuda/include/

				    cp -a libcusparse_lt-linux-sbsa-0.6.2.3-archive/lib/* /usr/local/cuda/lib64/

				    popd

				    rm -rf tmp_cusparselt

				}

				CUDNN_VERSION=9.8.0.87

				function install_cusparselt_063 {

				    # cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				@ -28,16 +16,15 @@ function install_cusparselt_063 {

				    rm -rf tmp_cusparselt

				}

				function install_124 {

				  CUDNN_VERSION=9.1.0.70

				  echo "Installing CUDA 12.4.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.2"

				  rm -rf /usr/local/cuda-12.4 /usr/local/cuda

				  # install CUDA 12.4.1 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux_sbsa.run

				  chmod +x cuda_12.4.1_550.54.15_linux_sbsa.run

				  ./cuda_12.4.1_550.54.15_linux_sbsa.run --toolkit --silent

				  rm -f cuda_12.4.1_550.54.15_linux_sbsa.run

				  rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.4 /usr/local/cuda

				function install_128 {

				  echo "Installing CUDA 12.8.0 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.6.3"

				  rm -rf /usr/local/cuda-12.8 /usr/local/cuda

				  # install CUDA 12.8.0 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda_12.8.0_570.86.10_linux_sbsa.run

				  chmod +x cuda_12.8.0_570.86.10_linux_sbsa.run

				  ./cuda_12.8.0_570.86.10_linux_sbsa.run --toolkit --silent

				  rm -f cuda_12.8.0_570.86.10_linux_sbsa.run

				  rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.8 /usr/local/cuda

				  # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				  mkdir tmp_cudnn && cd tmp_cudnn

				@ -48,125 +35,18 @@ function install_124 {

				  cd ..

				  rm -rf tmp_cudnn

				  # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				  # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				  git clone -b ${NCCL_VERSION} --depth 1 https://github.com/NVIDIA/nccl.git

				  cd nccl && make -j src.build

				  cp -a build/include/* /usr/local/cuda/include/

				  cp -a build/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf nccl

				  install_cusparselt_062

				  ldconfig

				}

				function prune_124 {

				  echo "Pruning CUDA 12.4"

				  #####################################################################################

				  # CUDA 12.4 prune static libs

				  #####################################################################################

				  export NVPRUNE="/usr/local/cuda-12.4/bin/nvprune"

				  export CUDA_LIB_DIR="/usr/local/cuda-12.4/lib64"

				  export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  if [[ -n "$OVERRIDE_GENCODE" ]]; then

				      export GENCODE=$OVERRIDE_GENCODE

				  fi

				  # all CUDA libs except CuDNN and CuBLAS

				  ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis"  \

				      | xargs -I {} bash -c \

				                "echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"

				  # prune CuDNN and CuBLAS

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a

				  #####################################################################################

				  # CUDA 12.4 prune visual tools

				  #####################################################################################

				  export CUDA_BASE="/usr/local/cuda-12.4/"

				  rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.1.0 $CUDA_BASE/nsight-systems-2023.4.4/

				}

				function install_126 {

				  echo "Installing CUDA 12.6.3 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.3"

				  rm -rf /usr/local/cuda-12.6 /usr/local/cuda

				  # install CUDA 12.6.3 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.6.3/local_installers/cuda_12.6.3_560.35.05_linux_sbsa.run

				  chmod +x cuda_12.6.3_560.35.05_linux_sbsa.run

				  ./cuda_12.6.3_560.35.05_linux_sbsa.run --toolkit --silent

				  rm -f cuda_12.6.3_560.35.05_linux_sbsa.run

				  rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.6 /usr/local/cuda

				  # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				  mkdir tmp_cudnn && cd tmp_cudnn

				  wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-sbsa/cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  tar xf cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  cp -a cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/

				  cp -a cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf tmp_cudnn

				  # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				  # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				  git clone -b ${NCCL_VERSION} --depth 1 https://github.com/NVIDIA/nccl.git

				  cd nccl && make -j src.build

				  cp -a build/include/* /usr/local/cuda/include/

				  cp -a build/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf nccl

				  CUDA_VERSION=12.8 bash install_nccl.sh

				  install_cusparselt_063

				  ldconfig

				}

				function prune_126 {

				  echo "Pruning CUDA 12.6"

				  #####################################################################################

				  # CUDA 12.6 prune static libs

				  #####################################################################################

				  export NVPRUNE="/usr/local/cuda-12.6/bin/nvprune"

				  export CUDA_LIB_DIR="/usr/local/cuda-12.6/lib64"

				  export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  if [[ -n "$OVERRIDE_GENCODE" ]]; then

				      export GENCODE=$OVERRIDE_GENCODE

				  fi

				  if [[ -n "$OVERRIDE_GENCODE_CUDNN" ]]; then

				      export GENCODE_CUDNN=$OVERRIDE_GENCODE_CUDNN

				  fi

				  # all CUDA libs except CuDNN and CuBLAS

				  ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis"  \

				      | xargs -I {} bash -c \

				                "echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"

				  # prune CuDNN and CuBLAS

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a

				  #####################################################################################

				  # CUDA 12.6 prune visual tools

				  #####################################################################################

				  export CUDA_BASE="/usr/local/cuda-12.6/"

				  rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.3.2 $CUDA_BASE/nsight-systems-2024.5.1/

				}

				# idiomatic parameter and option handling in sh

				while test $# -gt 0

				do

				    case "$1" in

				    12.4) install_124; prune_124

				        ;;

				    12.6) install_126; prune_126

				    12.8) install_128;

				        ;;

				    *) echo "bad argument $1"; exit 1

				        ;;

									
										4

.ci/docker/common/install_cudnn.sh
									
												View File
												
				@ -4,7 +4,9 @@ if [[ -n "${CUDNN_VERSION}" ]]; then

				    # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				    mkdir tmp_cudnn

				    pushd tmp_cudnn

				    if [[ ${CUDA_VERSION:0:4} == "12.6" ]]; then

				    if [[ ${CUDA_VERSION:0:4} == "12.8" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-9.8.0.87_cuda12-archive"

				    elif [[ ${CUDA_VERSION:0:4} == "12.6" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-9.5.1.17_cuda12-archive"

				    elif [[ ${CUDA_VERSION:0:2} == "12" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-9.1.0.70_cuda12-archive"

									
										20

.ci/docker/common/install_cusparselt.sh
									
												View File
												
				@ -5,7 +5,15 @@ set -ex

				# cuSPARSELt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				mkdir tmp_cusparselt && cd tmp_cusparselt

				if [[ ${CUDA_VERSION:0:4} =~ ^12\.[2-6]$ ]]; then

				if [[ ${CUDA_VERSION:0:4} =~ ^12\.[5-8]$ ]]; then

				    arch_path='sbsa'

				    export TARGETARCH=${TARGETARCH:-$(uname -m)}

				    if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then

				        arch_path='x86_64'

				    fi

				    CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.6.3.2-archive"

				    curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz

				elif [[ ${CUDA_VERSION:0:4} == "12.4" ]]; then

				    arch_path='sbsa'

				    export TARGETARCH=${TARGETARCH:-$(uname -m)}

				    if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then

				@ -13,17 +21,11 @@ if [[ ${CUDA_VERSION:0:4} =~ ^12\.[2-6]$ ]]; then

				    fi

				    CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.6.2.3-archive"

				    curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz

				elif [[ ${CUDA_VERSION:0:4} == "12.1" ]]; then

				    arch_path='sbsa'

				    export TARGETARCH=${TARGETARCH:-$(uname -m)}

				    if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then

				        arch_path='x86_64'

				    fi

				    CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.5.2.1-archive"

				    curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz

				elif [[ ${CUDA_VERSION:0:4} == "11.8" ]]; then

				    CUSPARSELT_NAME="libcusparse_lt-linux-x86_64-0.4.0.7-archive"

				    curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/${CUSPARSELT_NAME}.tar.xz

				else

				    echo "Not sure which libcusparselt version to install for this ${CUDA_VERSION}"

				fi

				tar xf ${CUSPARSELT_NAME}.tar.xz

									
										38

.ci/docker/common/install_db.sh
									
												View File
											
				@ -1,38 +0,0 @@

				#!/bin/bash

				set -ex

				install_ubuntu() {

				  apt-get update

				  # Cleanup

				  apt-get autoclean && apt-get clean

				  rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

				}

				install_centos() {

				  # Need EPEL for many packages we depend on.

				  # See http://fedoraproject.org/wiki/EPEL

				  yum --enablerepo=extras install -y epel-release

				  # Cleanup

				  yum clean all

				  rm -rf /var/cache/yum

				  rm -rf /var/lib/yum/yumdb

				  rm -rf /var/lib/yum/history

				}

				# Install base packages depending on the base OS

				ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')

				case "$ID" in

				  ubuntu)

				    install_ubuntu

				    ;;

				  centos)

				    install_centos

				    ;;

				  *)

				    echo "Unable to determine OS..."

				    exit 1

				    ;;

				esac

									
										12

.ci/docker/common/install_executorch.sh
									
												View File
												
				@ -37,7 +37,12 @@ install_conda_dependencies() {

				install_pip_dependencies() {

				  pushd executorch

				  as_jenkins bash install_requirements.sh --pybind xnnpack

				  as_jenkins bash install_executorch.sh

				  # A workaround, ExecuTorch has moved to numpy 2.0 which is not compatible with the current

				  # numba and scipy version used in PyTorch CI

				  conda_run pip uninstall -y numba scipy

				  popd

				}

				@ -45,10 +50,9 @@ setup_executorch() {

				  pushd executorch

				  export PYTHON_EXECUTABLE=python

				  export EXECUTORCH_BUILD_PYBIND=ON

				  export CMAKE_ARGS="-DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON"

				  export CMAKE_ARGS="-DEXECUTORCH_BUILD_PYBIND=ON -DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON"

				  as_jenkins .ci/scripts/setup-linux.sh cmake || true

				  as_jenkins .ci/scripts/setup-linux.sh --build-tool cmake || true

				  popd

				}

									
										4

.ci/docker/common/install_halide.sh
									
												View File
												
				@ -35,7 +35,9 @@ git clone https://github.com/halide/Halide.git

				pushd Halide

				git checkout ${COMMIT} && git submodule update --init --recursive

				pip_install -r requirements.txt

				cmake -G Ninja -DCMAKE_BUILD_TYPE=Release -S . -B build

				# NOTE: pybind has a requirement for cmake > 3.5 so set the minimum cmake version here with a flag

				#       Context: https://github.com/pytorch/pytorch/issues/150420

				cmake -G Ninja -DCMAKE_POLICY_VERSION_MINIMUM=3.5 -DCMAKE_BUILD_TYPE=Release -S . -B build

				cmake --build build

				test -e ${CONDA_PREFIX}/lib/python3 || ln -s python${ANACONDA_PYTHON_VERSION} ${CONDA_PREFIX}/lib/python3

				cmake --install build --prefix ${CONDA_PREFIX}

									
										6

.ci/docker/common/install_linter.sh
									
												View File
												
				@ -2,8 +2,6 @@

				set -ex

				source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"

				if [ -n "${UBUNTU_VERSION}" ]; then

				  apt update

				  apt-get install -y clang doxygen git graphviz nodejs npm libtinfo5

				@ -15,8 +13,8 @@ chown -R jenkins pytorch

				pushd pytorch

				# Install all linter dependencies

				pip_install -r requirements.txt

				conda_run lintrunner init

				pip install -r requirements.txt

				lintrunner init

				# Cache .lintbin directory as part of the Docker image

				cp -r .lintbin /tmp

									
										26

.ci/docker/common/install_nccl.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,26 @@

				#!/bin/bash

				set -ex

				NCCL_VERSION=""

				if [[ ${CUDA_VERSION:0:2} == "11" ]]; then

				  NCCL_VERSION=$(cat ci_commit_pins/nccl-cu11.txt)

				elif [[ ${CUDA_VERSION:0:2} == "12" ]]; then

				  NCCL_VERSION=$(cat ci_commit_pins/nccl-cu12.txt)

				else

				  echo "Unexpected CUDA_VERSION ${CUDA_VERSION}"

				  exit 1

				fi

				if [[ -n "${NCCL_VERSION}" ]]; then

				  # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				  # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				  git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git

				  pushd nccl

				  make -j src.build

				  cp -a build/include/* /usr/local/cuda/include/

				  cp -a build/lib/* /usr/local/cuda/lib64/

				  popd

				  rm -rf nccl

				  ldconfig

				fi

									
										9

.ci/docker/common/install_ninja.sh
									
												View File
												
				@ -4,10 +4,15 @@ set -ex

				[ -n "$NINJA_VERSION" ]

				url="https://github.com/ninja-build/ninja/releases/download/v${NINJA_VERSION}/ninja-linux.zip"

				arch=$(uname -m)

				if [ "$arch" == "aarch64" ]; then

				    url="https://github.com/ninja-build/ninja/releases/download/v${NINJA_VERSION}/ninja-linux-aarch64.zip"

				else

				    url="https://github.com/ninja-build/ninja/releases/download/v${NINJA_VERSION}/ninja-linux.zip"

				fi

				pushd /tmp

				wget --no-verbose --output-document=ninja-linux.zip "$url"

				unzip ninja-linux.zip -d /usr/local/bin

				rm -f ninja-linux.zip

				popd

				popd

									
										6

.ci/docker/common/install_onnx.sh
									
												View File
												
				@ -31,15 +31,15 @@ pip_install \

				pip_install coloredlogs packaging

				pip_install onnxruntime==1.18.1

				pip_install onnx==1.16.2

				pip_install onnxscript==0.1.0.dev20241124 --no-deps

				pip_install onnx==1.17.0

				pip_install onnxscript==0.2.2 --no-deps

				# required by onnxscript

				pip_install ml_dtypes

				# Cache the transformers model to be used later by ONNX tests. We need to run the transformers

				# package to download the model. By default, the model is cached at ~/.cache/huggingface/hub/

				IMPORT_SCRIPT_FILENAME="/tmp/onnx_import_script.py"

				as_jenkins echo 'import transformers; transformers.AutoModel.from_pretrained("sshleifer/tiny-gpt2"); transformers.AutoTokenizer.from_pretrained("sshleifer/tiny-gpt2"); transformers.AutoModelForSpeechSeq2Seq.from_pretrained("openai/whisper-large-v3");' > "${IMPORT_SCRIPT_FILENAME}"

				as_jenkins echo 'import transformers; transformers.GPTJForCausalLM.from_pretrained("hf-internal-testing/tiny-random-gptj");' > "${IMPORT_SCRIPT_FILENAME}"

				# Need a PyTorch version for transformers to work

				pip_install --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu

									
										2

.ci/docker/common/install_openblas.sh
									
												View File
												
				@ -4,7 +4,7 @@

				set -ex

				cd /

				git clone https://github.com/OpenMathLib/OpenBLAS.git -b v0.3.28 --depth 1 --shallow-submodules

				git clone https://github.com/OpenMathLib/OpenBLAS.git -b v0.3.29 --depth 1 --shallow-submodules

				OPENBLAS_BUILD_FLAGS="

									
										18

.ci/docker/common/install_python.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,18 @@

				#!/bin/bash

				set -ex

				apt-get update

				# Use deadsnakes in case we need an older python version

				sudo add-apt-repository ppa:deadsnakes/ppa

				apt-get install -y python${PYTHON_VERSION} python${PYTHON_VERSION}-dev python3-pip python${PYTHON_VERSION}-venv

				# Use a venv because uv and some other package managers don't support --user install

				ln -s /usr/bin/python${PYTHON_VERSION} /usr/bin/python

				python -m venv /var/lib/jenkins/ci_env

				source /var/lib/jenkins/ci_env/bin/activate

				python -mpip install --upgrade pip

				python -mpip install -r /opt/requirements-ci.txt

				if [ -n "${PIP_CMAKE}" ]; then

				  python -mpip install cmake==3.31.6

				fi

									
										4

.ci/docker/common/install_rocm.sh
									
												View File
												
				@ -8,10 +8,6 @@ ver() {

				install_ubuntu() {

				    apt-get update

				    if [[ $UBUNTU_VERSION == 18.04 ]]; then

				      # gpg-agent is not available by default on 18.04

				      apt-get install -y --no-install-recommends gpg-agent

				    fi

				    if [[ $UBUNTU_VERSION == 20.04 ]]; then

				      # gpg-agent is not available by default on 20.04

				      apt-get install -y --no-install-recommends gpg-agent

									
										6

.ci/docker/common/install_rocm_drm.sh
									
												View File
												
				@ -25,7 +25,9 @@ python3 -m pip install meson ninja

				###########################

				### clone repo

				###########################

				GIT_SSL_NO_VERIFY=true git clone https://gitlab.freedesktop.org/mesa/drm.git

				# TEMPORARY FIX: https://gitlab.freedesktop.org/mesa/drm.git is down until 2025/03/22

				# GIT_SSL_NO_VERIFY=true git clone https://gitlab.freedesktop.org/mesa/drm.git

				GIT_SSL_NO_VERIFY=true git clone git://anongit.freedesktop.org/mesa/drm

				pushd drm

				###########################

				@ -115,7 +117,7 @@ index a5007ffc..13fa07fc 100644

				 	if (!fp) {

				-		fprintf(stderr, "%s: %s\n", AMDGPU_ASIC_ID_TABLE,

				-			strerror(errno));

				+		fprintf(stderr, "amdgpu.ids: No such file or directory\n");

				+		//fprintf(stderr, "amdgpu.ids: No such file or directory\n");

				 		return;

				 	}

									
										68

.ci/docker/common/install_rocm_magma.sh
									
												View File
												
				@ -1,50 +1,28 @@

				#!/bin/bash

				# Script used in CI and CD pipeline

				#!/usr/bin/env bash

				# Script used only in CD pipeline

				set -ex

				set -eou pipefail

				# Magma build scripts need `python`

				ln -sf /usr/bin/python3 /usr/bin/python

				function do_install() {

				    rocm_version=$1

				    rocm_version_nodot=${1//./}

				ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')

				case "$ID" in

				  almalinux)

				    yum install -y gcc-gfortran

				    ;;

				  *)

				    echo "No preinstalls to build magma..."

				    ;;

				esac

				    # Version 2.7.2 + ROCm related updates

				    MAGMA_VERSION=a1625ff4d9bc362906bd01f805dbbe12612953f6

				    magma_archive="magma-rocm${rocm_version_nodot}-${MAGMA_VERSION}-1.tar.bz2"

				MKLROOT=${MKLROOT:-/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION}

				    rocm_dir="/opt/rocm"

				    (

				        set -x

				        tmp_dir=$(mktemp -d)

				        pushd ${tmp_dir}

				        curl -OLs https://ossci-linux.s3.us-east-1.amazonaws.com/${magma_archive}

				        tar -xvf "${magma_archive}"

				        mkdir -p "${rocm_dir}/magma"

				        mv include "${rocm_dir}/magma/include"

				        mv lib "${rocm_dir}/magma/lib"

				        popd

				    )

				}

				# "install" hipMAGMA into /opt/rocm/magma by copying after build

				git clone https://bitbucket.org/icl/magma.git

				pushd magma

				# Version 2.7.2 + ROCm related updates

				git checkout a1625ff4d9bc362906bd01f805dbbe12612953f6

				cp make.inc-examples/make.inc.hip-gcc-mkl make.inc

				echo 'LIBDIR += -L$(MKLROOT)/lib' >> make.inc

				if [[ -f "${MKLROOT}/lib/libmkl_core.a" ]]; then

				    echo 'LIB = -Wl,--start-group -lmkl_gf_lp64 -lmkl_gnu_thread -lmkl_core -Wl,--end-group -lpthread -lstdc++ -lm -lgomp -lhipblas -lhipsparse' >> make.inc

				fi

				echo 'LIB += -Wl,--enable-new-dtags -Wl,--rpath,/opt/rocm/lib -Wl,--rpath,$(MKLROOT)/lib -Wl,--rpath,/opt/rocm/magma/lib -ldl' >> make.inc

				echo 'DEVCCFLAGS += --gpu-max-threads-per-block=256' >> make.inc

				export PATH="${PATH}:/opt/rocm/bin"

				if [[ -n "$PYTORCH_ROCM_ARCH" ]]; then

				  amdgpu_targets=`echo $PYTORCH_ROCM_ARCH | sed 's/;/ /g'`

				else

				  amdgpu_targets=`rocm_agent_enumerator | grep -v gfx000 | sort -u | xargs`

				fi

				for arch in $amdgpu_targets; do

				  echo "DEVCCFLAGS += --offload-arch=$arch" >> make.inc

				done

				# hipcc with openmp flag may cause isnan() on __device__ not to be found; depending on context, compiler may attempt to match with host definition

				sed -i 's/^FOPENMP/#FOPENMP/g' make.inc

				make -f make.gen.hipMAGMA -j $(nproc)

				LANG=C.UTF-8 make lib/libmagma.so -j $(nproc) MKLROOT="${MKLROOT}"

				make testing/testing_dgemm -j $(nproc) MKLROOT="${MKLROOT}"

				popd

				mv magma /opt/rocm

				do_install $1

									
										24

.ci/docker/common/install_swiftshader.sh
									
												View File
											
				@ -1,24 +0,0 @@

				#!/bin/bash

				set -ex

				[ -n "${SWIFTSHADER}" ]

				retry () {

				    $*  || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)

				}

				_https_amazon_aws=https://ossci-android.s3.amazonaws.com

				# SwiftShader

				_swiftshader_dir=/var/lib/jenkins/swiftshader

				_swiftshader_file_targz=swiftshader-abe07b943-prebuilt.tar.gz

				mkdir -p $_swiftshader_dir

				_tmp_swiftshader_targz="/tmp/${_swiftshader_file_targz}"

				curl --silent --show-error --location --fail --retry 3 \

				  --output "${_tmp_swiftshader_targz}" "$_https_amazon_aws/${_swiftshader_file_targz}"

				tar -C "${_swiftshader_dir}" -xzf "${_tmp_swiftshader_targz}"

				export VK_ICD_FILENAMES="${_swiftshader_dir}/build/Linux/vk_swiftshader_icd.json"

									
										18

.ci/docker/common/install_triton.sh
									
												View File
												
				@ -2,6 +2,12 @@

				set -ex

				mkdir -p /opt/triton

				if [ -z "${TRITON}" ] && [ -z "${TRITON_CPU}" ]; then

				  echo "TRITON and TRITON_CPU are not set. Exiting..."

				  exit 0

				fi

				source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"

				get_conda_version() {

				@ -52,6 +58,7 @@ cd triton

				as_jenkins git checkout ${TRITON_PINNED_COMMIT}

				as_jenkins git submodule update --init --recursive

				cd python

				pip_install pybind11==2.13.6

				# TODO: remove patch setup.py once we have a proper fix for https://github.com/triton-lang/triton/issues/4527

				as_jenkins sed -i -e 's/https:\/\/tritonlang.blob.core.windows.net\/llvm-builds/https:\/\/oaitriton.blob.core.windows.net\/public\/llvm-builds/g' setup.py

				@ -60,17 +67,22 @@ if [ -n "${UBUNTU_VERSION}" ] && [ -n "${GCC_VERSION}" ] && [[ "${GCC_VERSION}"

				  # Triton needs at least gcc-9 to build

				  apt-get install -y g++-9

				  CXX=g++-9 pip_install -e .

				  CXX=g++-9 conda_run python setup.py bdist_wheel

				elif [ -n "${UBUNTU_VERSION}" ] && [ -n "${CLANG_VERSION}" ]; then

				  # Triton needs <filesystem> which surprisingly is not available with clang-9 toolchain

				  add-apt-repository -y ppa:ubuntu-toolchain-r/test

				  apt-get install -y g++-9

				  CXX=g++-9 pip_install -e .

				  CXX=g++-9 conda_run python setup.py bdist_wheel

				else

				  pip_install -e .

				  conda_run python setup.py bdist_wheel

				fi

				# Copy the wheel to /opt for multi stage docker builds

				cp dist/*.whl /opt/triton

				# Install the wheel for docker builds that don't use multi stage

				pip_install dist/*.whl

				if [ -n "${CONDA_CMAKE}" ]; then

				  # TODO: This is to make sure that the same cmake and numpy version from install conda

				  # script is used. Without this step, the newer cmake version (3.25.2) downloaded by

									
										26

.ci/docker/common/install_ucc.sh
									
												View File
												
				@ -8,6 +8,12 @@ else

				  with_cuda=no

				fi

				if [[ -d "/opt/rocm" ]]; then

				  with_rocm=/opt/rocm

				else

				  with_rocm=no

				fi

				function install_ucx() {

				  set -ex

				  git clone --recursive https://github.com/openucx/ucx.git

				@ -19,6 +25,7 @@ function install_ucx() {

				  ./configure --prefix=$UCX_HOME      \

				      --enable-mt                     \

				      --with-cuda=$with_cuda          \

				      --with-rocm=$with_rocm          \

				      --enable-profiling              \

				      --enable-stats

				  time make -j

				@ -36,12 +43,29 @@ function install_ucc() {

				  git submodule update --init --recursive

				  ./autogen.sh

				  # We only run distributed tests on Tesla M60 and A10G

				  NVCC_GENCODE="-gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_86,code=compute_86"

				  if [[ -n "$ROCM_VERSION" ]]; then

				    if [[ -n "$PYTORCH_ROCM_ARCH" ]]; then

				      amdgpu_targets=`echo $PYTORCH_ROCM_ARCH | sed 's/;/ /g'`

				    else

				      amdgpu_targets=`rocm_agent_enumerator | grep -v gfx000 | sort -u | xargs`

				    fi

				    for arch in $amdgpu_targets; do

				      HIP_OFFLOAD="$HIP_OFFLOAD --offload-arch=$arch"

				    done

				  else

				    HIP_OFFLOAD="all-arch-no-native"

				  fi

				  ./configure --prefix=$UCC_HOME          \

				    --with-ucx=$UCX_HOME                  \

				    --with-cuda=$with_cuda                \

				    --with-nvcc-gencode="${NVCC_GENCODE}"

				    --with-nvcc-gencode="${NVCC_GENCODE}" \

				    --with-rocm=$with_rocm                \

				    --with-rocm-arch="${HIP_OFFLOAD}"

				  time make -j

				  sudo make install

									
										24

.ci/docker/common/install_vulkan_sdk.sh
									
												View File
											
				@ -1,24 +0,0 @@

				#!/bin/bash

				set -ex

				[ -n "${VULKAN_SDK_VERSION}" ]

				retry () {

				    $*  || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)

				}

				_vulkansdk_dir=/var/lib/jenkins/vulkansdk

				_tmp_vulkansdk_targz=/tmp/vulkansdk.tar.gz

				curl \

				  --silent \

				  --show-error \

				  --location \

				  --fail \

				  --retry 3 \

				  --output "${_tmp_vulkansdk_targz}" "https://ossci-android.s3.amazonaws.com/vulkansdk-linux-x86_64-${VULKAN_SDK_VERSION}.tar.gz"

				mkdir -p "${_vulkansdk_dir}"

				tar -C "${_vulkansdk_dir}" -xzf "${_tmp_vulkansdk_targz}" --strip-components 1

				rm -rf "${_tmp_vulkansdk_targz}"

									
										15

.ci/docker/libtorch/Dockerfile
									
												View File
												
				@ -49,6 +49,8 @@ RUN bash ./install_mkl.sh && rm install_mkl.sh

				FROM cpu as cuda

				ADD ./common/install_cuda.sh install_cuda.sh

				ADD ./common/install_magma.sh install_magma.sh

				COPY ./common/install_nccl.sh install_nccl.sh

				COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/

				ENV CUDA_HOME /usr/local/cuda

				FROM cuda as cuda11.8

				@ -56,11 +58,6 @@ RUN bash ./install_cuda.sh 11.8

				RUN bash ./install_magma.sh 11.8

				RUN ln -sf /usr/local/cuda-11.8 /usr/local/cuda

				FROM cuda as cuda12.1

				RUN bash ./install_cuda.sh 12.1

				RUN bash ./install_magma.sh 12.1

				RUN ln -sf /usr/local/cuda-12.1 /usr/local/cuda

				FROM cuda as cuda12.4

				RUN bash ./install_cuda.sh 12.4

				RUN bash ./install_magma.sh 12.4

				@ -71,7 +68,13 @@ RUN bash ./install_cuda.sh 12.6

				RUN bash ./install_magma.sh 12.6

				RUN ln -sf /usr/local/cuda-12.6 /usr/local/cuda

				FROM cuda as cuda12.8

				RUN bash ./install_cuda.sh 12.8

				RUN bash ./install_magma.sh 12.8

				RUN ln -sf /usr/local/cuda-12.8 /usr/local/cuda

				FROM cpu as rocm

				ARG ROCM_VERSION

				ARG PYTORCH_ROCM_ARCH

				ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}

				ENV MKLROOT /opt/intel

				@ -90,7 +93,7 @@ RUN apt-get update -y && \

				    apt-get clean

				RUN bash ./install_rocm_drm.sh && rm install_rocm_drm.sh

				RUN bash ./install_rocm_magma.sh && rm install_rocm_magma.sh

				RUN bash ./install_rocm_magma.sh ${ROCM_VERSION} && rm install_rocm_magma.sh

				FROM ${BASE_TARGET} as final

				COPY --from=openssl            /opt/openssl           /opt/openssl

									
										4

.ci/docker/libtorch/build.sh
									
												View File
												
				@ -39,8 +39,8 @@ case ${GPU_ARCH_TYPE} in

				        BASE_TARGET=rocm

				        DOCKER_TAG=rocm${GPU_ARCH_VERSION}

				        GPU_IMAGE=rocm/dev-ubuntu-20.04:${GPU_ARCH_VERSION}-complete

				        PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx1030;gfx1100;gfx1101;gfx942"

				        DOCKER_GPU_BUILD_ARG="--build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH}"

				        PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"

				        DOCKER_GPU_BUILD_ARG="--build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} --build-arg ROCM_VERSION=${GPU_ARCH_VERSION}"

				        ;;

				    *)

				        echo "ERROR: Unrecognized GPU_ARCH_TYPE: ${GPU_ARCH_TYPE}"

									
										26

.ci/docker/linter-cuda/Dockerfile
									
												View File
												
				@ -18,28 +18,30 @@ COPY ./common/install_user.sh install_user.sh

				RUN bash ./install_user.sh && rm install_user.sh

				# Install conda and other packages (e.g., numpy, pytest)

				ARG ANACONDA_PYTHON_VERSION

				ARG CONDA_CMAKE

				ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION

				ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH

				COPY requirements-ci.txt /opt/conda/requirements-ci.txt

				COPY ./common/install_conda.sh install_conda.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ./common/install_magma_conda.sh install_magma_conda.sh

				RUN bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt

				ARG PYTHON_VERSION

				ARG PIP_CMAKE

				# Put venv into the env vars so users don't need to activate it

				ENV PATH /var/lib/jenkins/ci_env/bin:$PATH

				ENV VIRTUAL_ENV /var/lib/jenkins/ci_env

				COPY requirements-ci.txt /opt/requirements-ci.txt

				COPY ./common/install_python.sh install_python.sh

				RUN bash ./install_python.sh && rm install_python.sh /opt/requirements-ci.txt

				# Install cuda and cudnn

				ARG CUDA_VERSION

				COPY ./common/install_cuda.sh install_cuda.sh

				RUN bash ./install_cuda.sh ${CUDA_VERSION} && rm install_cuda.sh

				COPY ./common/install_nccl.sh install_nccl.sh

				COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/

				RUN bash ./install_cuda.sh ${CUDA_VERSION} && rm install_cuda.sh install_nccl.sh /ci_commit_pins/nccl-cu*

				ENV DESIRED_CUDA ${CUDA_VERSION}

				ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:$PATH

				# Note that Docker build forbids copying file outside the build context

				COPY ./common/install_linter.sh install_linter.sh

				COPY ./common/common_utils.sh common_utils.sh

				RUN bash ./install_linter.sh

				RUN rm install_linter.sh common_utils.sh

				RUN rm install_linter.sh

				RUN chown -R jenkins:jenkins /var/lib/jenkins/ci_env

				USER jenkins

				CMD ["bash"]

									
										18

.ci/docker/linter/Dockerfile
									
												View File
												
				@ -15,20 +15,18 @@ COPY ./common/install_user.sh install_user.sh

				RUN bash ./install_user.sh && rm install_user.sh

				# Install conda and other packages (e.g., numpy, pytest)

				ARG ANACONDA_PYTHON_VERSION

				ARG CONDA_CMAKE

				ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION

				ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH

				COPY requirements-ci.txt /opt/conda/requirements-ci.txt

				COPY ./common/install_conda.sh install_conda.sh

				COPY ./common/common_utils.sh common_utils.sh

				RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt

				ARG PYTHON_VERSION

				ARG PIP_CMAKE

				ENV PATH /var/lib/jenkins/ci_env/bin:$PATH

				ENV VIRTUAL_ENV /var/lib/jenkins/ci_env

				COPY requirements-ci.txt /opt/requirements-ci.txt

				COPY ./common/install_python.sh install_python.sh

				RUN bash ./install_python.sh && rm install_python.sh /opt/requirements-ci.txt

				# Note that Docker build forbids copying file outside the build context

				COPY ./common/install_linter.sh install_linter.sh

				COPY ./common/common_utils.sh common_utils.sh

				RUN bash ./install_linter.sh

				RUN rm install_linter.sh common_utils.sh

				RUN rm install_linter.sh

				USER jenkins

				CMD ["bash"]

									
										6

.ci/docker/manywheel/Dockerfile
									
												View File
												
				@ -64,7 +64,9 @@ FROM base as cuda

				ARG BASE_CUDA_VERSION=10.2

				# Install CUDA

				ADD ./common/install_cuda.sh install_cuda.sh

				RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh

				COPY ./common/install_nccl.sh install_nccl.sh

				COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/

				RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh install_nccl.sh /ci_commit_pins/nccl-cu*

				FROM base as intel

				# MKL

				@ -195,6 +197,6 @@ RUN bash ./install_rocm_drm.sh && rm install_rocm_drm.sh

				# cmake3 is needed for the MIOpen build

				RUN ln -sf /usr/local/bin/cmake /usr/bin/cmake3

				ADD ./common/install_rocm_magma.sh install_rocm_magma.sh

				RUN bash ./install_rocm_magma.sh && rm install_rocm_magma.sh

				RUN bash ./install_rocm_magma.sh ${ROCM_VERSION} && rm install_rocm_magma.sh

				ADD ./common/install_miopen.sh install_miopen.sh

				RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh

153

.ci/docker/manywheel/Dockerfile_2014

View File

 @ -1,153 +0,0 @@
 # syntax = docker/dockerfile:experimental
 ARG ROCM_VERSION=3.7
 ARG BASE_CUDA_VERSION=10.2
 ARG GPU_IMAGE=nvidia/cuda:${BASE_CUDA_VERSION}-devel-centos7
 FROM quay.io/pypa/manylinux2014_x86_64 as base
 ENV LC_ALL en_US.UTF-8
 ENV LANG en_US.UTF-8
 ENV LANGUAGE en_US.UTF-8
 RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
 RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
 RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
 RUN yum install -y wget curl perl util-linux xz bzip2 git patch which perl zlib-devel
 RUN yum install -y yum-utils centos-release-scl sudo
 RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
 RUN yum install -y devtoolset-7-gcc devtoolset-7-gcc-c++ devtoolset-7-gcc-gfortran devtoolset-7-binutils
 ENV PATH=/opt/rh/devtoolset-7/root/usr/bin:$PATH
 ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-7/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib:$LD_LIBRARY_PATH
 # cmake
 RUN yum install -y cmake3 && \
     ln -s /usr/bin/cmake3 /usr/bin/cmake
 FROM base as openssl
 # Install openssl (this must precede `build python` step)
 # (In order to have a proper SSL module, Python is compiled
 # against a recent openssl [see env vars above], which is linked
 # statically. We delete openssl afterwards.)
 ADD ./common/install_openssl.sh install_openssl.sh
 RUN bash ./install_openssl.sh && rm install_openssl.sh
 # remove unncessary python versions
 RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
 RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
 RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
 RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
 FROM base as cuda
 ARG BASE_CUDA_VERSION=10.2
 # Install CUDA
 ADD ./common/install_cuda.sh install_cuda.sh
 RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh
 FROM base as intel
 # MKL
 ADD ./common/install_mkl.sh install_mkl.sh
 RUN bash ./install_mkl.sh && rm install_mkl.sh
 FROM base as magma
 ARG BASE_CUDA_VERSION=10.2
 # Install magma
 ADD ./common/install_magma.sh install_magma.sh
 RUN bash ./install_magma.sh ${BASE_CUDA_VERSION} && rm install_magma.sh
 FROM base as jni
 # Install java jni header
 ADD ./common/install_jni.sh install_jni.sh
 ADD ./java/jni.h jni.h
 RUN bash ./install_jni.sh && rm install_jni.sh
 FROM base as libpng
 # Install libpng
 ADD ./common/install_libpng.sh install_libpng.sh
 RUN bash ./install_libpng.sh && rm install_libpng.sh
 FROM ${GPU_IMAGE} as common
 RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
 RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
 RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
 ENV LC_ALL en_US.UTF-8
 ENV LANG en_US.UTF-8
 ENV LANGUAGE en_US.UTF-8
 RUN yum install -y \
         aclocal \
         autoconf \
         automake \
         bison \
         bzip2 \
         curl \
         diffutils \
         file \
         git \
         make \
         patch \
         perl \
         unzip \
         util-linux \
         wget \
         which \
         xz \
         yasm
 RUN yum install -y \
     https://repo.ius.io/ius-release-el7.rpm \
     https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm
 RUN yum swap -y git git236-core
 # git236+ would refuse to run git commands in repos owned by other users
 # Which causes version check to fail, as pytorch repo is bind-mounted into the image
 # Override this behaviour by treating every folder as safe
 # For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
 RUN git config --global --add safe.directory "*"
 ENV SSL_CERT_FILE=/opt/_internal/certs.pem
 # Install LLVM version
 COPY --from=openssl            /opt/openssl                          /opt/openssl
 COPY --from=base               /opt/python                           /opt/python
 COPY --from=base               /opt/_internal                        /opt/_internal
 COPY --from=base               /usr/local/bin/auditwheel             /usr/local/bin/auditwheel
 COPY --from=intel              /opt/intel                            /opt/intel
 COPY --from=base               /usr/local/bin/patchelf               /usr/local/bin/patchelf
 COPY --from=libpng             /usr/local/bin/png*                   /usr/local/bin/
 COPY --from=libpng             /usr/local/bin/libpng*                /usr/local/bin/
 COPY --from=libpng             /usr/local/include/png*               /usr/local/include/
 COPY --from=libpng             /usr/local/include/libpng*            /usr/local/include/
 COPY --from=libpng             /usr/local/lib/libpng*                /usr/local/lib/
 COPY --from=libpng             /usr/local/lib/pkgconfig              /usr/local/lib/pkgconfig
 COPY --from=jni                /usr/local/include/jni.h              /usr/local/include/jni.h
 FROM common as cpu_final
 ARG BASE_CUDA_VERSION=10.2
 RUN yum install -y yum-utils centos-release-scl
 RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
 RUN yum install -y devtoolset-7-gcc devtoolset-7-gcc-c++ devtoolset-7-gcc-gfortran devtoolset-7-binutils
 ENV PATH=/opt/rh/devtoolset-7/root/usr/bin:$PATH
 ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-7/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib:$LD_LIBRARY_PATH
 # cmake
 RUN yum install -y cmake3 && \
     ln -s /usr/bin/cmake3 /usr/bin/cmake
 # ninja
 RUN yum install -y http://repo.okay.com.mx/centos/7/x86_64/release/okay-release-1-1.noarch.rpm
 RUN yum install -y ninja-build
 FROM cpu_final as cuda_final
 RUN rm -rf /usr/local/cuda-${BASE_CUDA_VERSION}
 COPY --from=cuda     /usr/local/cuda-${BASE_CUDA_VERSION}  /usr/local/cuda-${BASE_CUDA_VERSION}
 COPY --from=magma    /usr/local/cuda-${BASE_CUDA_VERSION}  /usr/local/cuda-${BASE_CUDA_VERSION}
 FROM common as rocm_final
 ARG ROCM_VERSION=3.7
 # Install ROCm
 ADD ./common/install_rocm.sh install_rocm.sh
 RUN bash ./install_rocm.sh ${ROCM_VERSION} && rm install_rocm.sh
 # cmake is already installed inside the rocm base image, but both 2 and 3 exist
 # cmake3 is needed for the later MIOpen custom build, so that step is last.
 RUN yum install -y cmake3 && \
     rm -f /usr/bin/cmake && \
     ln -s /usr/bin/cmake3 /usr/bin/cmake
 ADD ./common/install_miopen.sh install_miopen.sh
 RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh

6

.ci/docker/manywheel/Dockerfile_2_28

View File

 @ -36,7 +36,9 @@ FROM base as cuda
 ARG BASE_CUDA_VERSION=11.8
 # Install CUDA
 ADD ./common/install_cuda.sh install_cuda.sh
 RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh
 COPY ./common/install_nccl.sh install_nccl.sh
 COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/
 RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh install_nccl.sh ci_commit_pins/nccl-cu*
 FROM base as intel
 # MKL
 @ -158,7 +160,7 @@ ADD ./common/install_rocm_drm.sh install_rocm_drm.sh
 RUN bash ./install_rocm_drm.sh && rm install_rocm_drm.sh
 ENV MKLROOT /opt/intel
 ADD ./common/install_rocm_magma.sh install_rocm_magma.sh
 RUN bash ./install_rocm_magma.sh && rm install_rocm_magma.sh
 RUN bash ./install_rocm_magma.sh ${ROCM_VERSION} && rm install_rocm_magma.sh
 ADD ./common/install_miopen.sh install_miopen.sh
 RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh

6

.ci/docker/manywheel/Dockerfile_2_28_aarch64

View File

 @ -38,6 +38,12 @@ RUN yum install -y \
   sudo \
   gcc-toolset-${GCCTOOLSET_VERSION}-toolchain
 # (optional) Install non-default Ninja version
 ARG NINJA_VERSION
 COPY ./common/install_ninja.sh install_ninja.sh
 RUN if [ -n "${NINJA_VERSION}" ]; then bash ./install_ninja.sh; fi
 RUN rm install_ninja.sh
 # Ensure the expected devtoolset is used
 ENV PATH=/opt/rh/gcc-toolset-${GCCTOOLSET_VERSION}/root/usr/bin:$PATH
 ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${GCCTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${GCCTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH

4

.ci/docker/manywheel/Dockerfile_cuda_aarch64

View File

 @ -67,7 +67,9 @@ FROM base as cuda
 ARG BASE_CUDA_VERSION
 # Install CUDA
 ADD ./common/install_cuda_aarch64.sh install_cuda_aarch64.sh
 RUN bash ./install_cuda_aarch64.sh ${BASE_CUDA_VERSION} && rm install_cuda_aarch64.sh
 COPY ./common/install_nccl.sh install_nccl.sh
 COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/
 RUN bash ./install_cuda_aarch64.sh ${BASE_CUDA_VERSION} && rm install_cuda_aarch64.sh install_nccl.sh ci_commit_pins/nccl-cu*
 FROM base as magma
 ARG BASE_CUDA_VERSION

40

.ci/docker/manywheel/Dockerfile_s390x

View File

 @ -42,6 +42,7 @@ RUN yum install -y \
   llvm-devel \
   libzstd-devel \
   python3.12-devel \
   python3.12-test \
   python3.12-setuptools \
   python3.12-pip \
   python3-virtualenv \
 @ -101,24 +102,33 @@ CMD ["/bin/bash"]
 # install test dependencies:
 # - grpcio requires system openssl, bundled crypto fails to build
 # - ml_dtypes 0.4.0 requires some fixes provided in later commits to build
 RUN dnf install -y \
   protobuf-devel \
   protobuf-c-devel \
   protobuf-lite-devel \
   wget \
   patch
   hdf5-devel \
   python3-h5py \
   git
 RUN env GRPC_PYTHON_BUILD_SYSTEM_OPENSSL=True pip3 install grpcio==1.65.4
 RUN cd ~ && \
   git clone https://github.com/jax-ml/ml_dtypes && \
   cd ml_dtypes && \
   git checkout v0.4.0 && \
 RUN env GRPC_PYTHON_BUILD_SYSTEM_OPENSSL=True pip3 install grpcio
 # cmake-3.28.0 from pip for onnxruntime
 RUN python3 -mpip install cmake==3.28.0
 # build onnxruntime 1.21.0 from sources.
 # it is not possible to build it from sources using pip,
 # so just build it from upstream repository.
 # h5py is dependency of onnxruntime_training.
 # h5py==3.11.0 builds with hdf5-devel 1.10.5 from repository.
 # install newest flatbuffers version first:
 # for some reason old version is getting pulled in otherwise.
 # packaging package is required for onnxruntime wheel build.
 RUN pip3 install flatbuffers && \
   pip3 install h5py==3.11.0 && \
   pip3 install packaging && \
   git clone https://github.com/microsoft/onnxruntime && \
   cd onnxruntime && git checkout v1.21.0 && \
   git submodule update --init --recursive && \
   wget https://github.com/jax-ml/ml_dtypes/commit/b969f76914d6b30676721bc92bf0f6021a0d1321.patch && \
   wget https://github.com/jax-ml/ml_dtypes/commit/d4e6d035ecda073eab8bcf60f4eef572ee7087e6.patch && \
   patch -p1 < b969f76914d6b30676721bc92bf0f6021a0d1321.patch && \
   patch -p1 < d4e6d035ecda073eab8bcf60f4eef572ee7087e6.patch && \
   python3 setup.py bdist_wheel && \
   pip3 install dist/*.whl && \
   rm -rf ml_dtypes
   ./build.sh --config Release --parallel 0 --enable_pybind --build_wheel --enable_training --enable_training_apis --enable_training_ops --skip_tests --allow_running_as_root && \
   pip3 install ./build/Linux/Release/dist/onnxruntime_training-*.whl && \
   cd .. && /bin/rm -rf ./onnxruntime

									
										9

.ci/docker/manywheel/build.sh
									
												View File
												
				@ -48,7 +48,7 @@ case ${GPU_ARCH_TYPE} in

				        TARGET=final

				        DOCKER_TAG=cpu-aarch64

				        GPU_IMAGE=arm64v8/almalinux:8

				        DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11"

				        DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11 --build-arg NINJA_VERSION=1.12.1"

				        MANY_LINUX_VERSION="2_28_aarch64"

				        ;;

				    cpu-cxx11-abi)

				@ -97,7 +97,7 @@ case ${GPU_ARCH_TYPE} in

				            DEVTOOLSET_VERSION="11"

				            GPU_IMAGE=rocm/dev-almalinux-8:${GPU_ARCH_VERSION}-complete

				        fi

				        PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101"

				        PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"

				        DOCKER_GPU_BUILD_ARG="--build-arg ROCM_VERSION=${GPU_ARCH_VERSION} --build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} --build-arg DEVTOOLSET_VERSION=${DEVTOOLSET_VERSION}"

				        ;;

				    xpu)

				@ -121,7 +121,8 @@ fi

				(

				    set -x

				    if [ "$(uname -m)" != "s390x" ]; then

				    # Only activate this if in CI

				    if [ "$(uname -m)" != "s390x" ] && [ -v CI ]; then

				        # TODO: Remove LimitNOFILE=1048576 patch once https://github.com/pytorch/test-infra/issues/5712

				        # is resolved. This patch is required in order to fix timing out of Docker build on Amazon Linux 2023.

				        sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service

				@ -139,7 +140,7 @@ fi

				        "${TOPDIR}/.ci/docker/"

				)

				GITHUB_REF=${GITHUB_REF:-$(git symbolic-ref -q HEAD || git describe --tags --exact-match)}

				GITHUB_REF=${GITHUB_REF:-"dev")}

				GIT_BRANCH_NAME=${GITHUB_REF##*/}

				GIT_COMMIT_SHA=${GITHUB_SHA:-$(git rev-parse HEAD)}

				DOCKER_IMAGE_BRANCH_TAG=${DOCKER_IMAGE}-${GIT_BRANCH_NAME}

									
										2

.ci/docker/manywheel/build_scripts/build_utils.sh
									
												View File
												
				@ -3,7 +3,7 @@

				# Script used only in CD pipeline

				OPENSSL_DOWNLOAD_URL=https://www.openssl.org/source/old/1.1.1/

				CURL_DOWNLOAD_URL=https://curl.askapache.com/download

				CURL_DOWNLOAD_URL=https://curl.se/download

				AUTOCONF_DOWNLOAD_URL=https://ftp.gnu.org/gnu/autoconf

34

.ci/docker/requirements-ci.txt

View File

 @ -41,11 +41,14 @@ fbscribelogger==0.1.7
 #Pinned versions: 0.1.6
 #test that import:
 flatbuffers==2.0
 flatbuffers==2.0 ; platform_machine != "s390x"
 #Description: cross platform serialization library
 #Pinned versions: 2.0
 #test that import:
 flatbuffers ; platform_machine == "s390x"
 #Description: cross platform serialization library; Newer version is required on s390x for new python version
 hypothesis==5.35.1
 # Pin hypothesis to avoid flakiness: https://github.com/pytorch/pytorch/issues/31136
 #Description: advanced library for generating parametrized tests
 @ -90,10 +93,10 @@ librosa>=0.6.2 ; python_version < "3.11"
 #Pinned versions:
 #test that import:
 mypy==1.13.0
 mypy==1.14.0
 # Pin MyPy version because new errors are likely to appear with each release
 #Description: linter
 #Pinned versions: 1.10.0
 #Pinned versions: 1.14.0
 #test that import: test_typing.py, test_type_hints.py
 networkx==2.8.8
 @ -102,10 +105,10 @@ networkx==2.8.8
 #Pinned versions: 2.8.8
 #test that import: functorch
 #ninja
 #Description: build system.  Note that it install from
 #here breaks things so it is commented out
 #Pinned versions: 1.10.0.post1
 ninja==1.11.1.3
 #Description: build system. Used in some tests. Used in build to generate build
 #time tracing information
 #Pinned versions: 1.11.1.3
 #test that import: run_test.py, test_cpp_extensions_aot.py,test_determination.py
 numba==0.49.0 ; python_version < "3.9"
 @ -294,7 +297,7 @@ ghstack==0.8.0
 #Pinned versions: 0.8.0
 #test that import:
 jinja2==3.1.5
 jinja2==3.1.6
 #Description: jinja2 template engine
 #Pinned versions: 3.1.4
 #test that import:
 @ -329,7 +332,7 @@ lxml==5.3.0
 PyGithub==2.3.0
 sympy==1.13.1 ; python_version >= "3.9"
 sympy==1.13.3
 #Description: Required by coremltools, also pinned in .github/requirements/pip-requirements-macOS.txt
 #Pinned versions:
 #test that import:
 @ -339,7 +342,7 @@ onnx==1.17.0
 #Pinned versions:
 #test that import:
 onnxscript==0.1.0.dev20240817
 onnxscript==0.2.2
 #Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
 #Pinned versions:
 #test that import:
 @ -353,7 +356,7 @@ parameterized==0.8.1
 #Pinned versions: 1.24.0
 #test that import: test_sac_estimator.py
 pwlf==2.2.1 ; python_version >= "3.8"
 pwlf==2.2.1
 #Description: required for testing torch/distributed/_tools/sac_estimator.py
 #Pinned versions: 2.2.1
 #test that import: test_sac_estimator.py
 @ -362,12 +365,17 @@ pwlf==2.2.1 ; python_version >= "3.8"
 # To build PyTorch itself
 astunparse
 PyYAML
 pyzstd
 setuptools
 ninja==1.11.1 ; platform_machine == "aarch64"
 scons==4.5.2 ; platform_machine == "aarch64"
 pulp==2.9.0 ; python_version >= "3.8"
 pulp==2.9.0
 #Description: required for testing ilp formulaiton under torch/distributed/_tools
 #Pinned versions: 2.9.0
 #test that import: test_sac_ilp.py
 dataclasses_json==0.6.7
 #Description: required for data pipeline and scripts under tools/stats
 #Pinned versions: 0.6.7
 #test that import:

2

.ci/docker/triton_version.txt

View File

 @ -1 +1 @@
 .2.0
 .3.0

									
										29

.ci/docker/ubuntu-cuda/Dockerfile
									
												View File
												
				@ -2,7 +2,7 @@ ARG UBUNTU_VERSION

				ARG CUDA_VERSION

				ARG IMAGE_NAME

				FROM ${IMAGE_NAME}

				FROM ${IMAGE_NAME} as base

				ARG UBUNTU_VERSION

				ARG CUDA_VERSION

				@ -50,13 +50,6 @@ RUN if [ -n "${PROTOBUF}" ]; then bash ./install_protobuf.sh; fi

				RUN rm install_protobuf.sh

				ENV INSTALLED_PROTOBUF ${PROTOBUF}

				# (optional) Install database packages like LMDB and LevelDB

				ARG DB

				COPY ./common/install_db.sh install_db.sh

				RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi

				RUN rm install_db.sh

				ENV INSTALLED_DB ${DB}

				# (optional) Install vision packages like OpenCV

				ARG VISION

				COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./

				@ -97,14 +90,20 @@ RUN if [ -n "${CMAKE_VERSION}" ]; then bash ./install_cmake.sh; fi

				RUN rm install_cmake.sh

				ARG TRITON

				FROM base as triton-builder

				# Install triton, this needs to be done before sccache because the latter will

				# try to reach out to S3, which docker build runners don't have access

				COPY ./common/install_triton.sh install_triton.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/triton.txt triton.txt

				COPY triton_version.txt triton_version.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton.txt triton_version.txt

				RUN bash ./install_triton.sh

				FROM base as final

				COPY --from=triton-builder /opt/triton /opt/triton

				RUN if [ -n "${TRITON}" ]; then pip install /opt/triton/*.whl; chown -R jenkins:jenkins /opt/conda; fi

				RUN rm -rf /opt/triton

				ARG HALIDE

				# Build and install halide

				@ -159,6 +158,16 @@ COPY ./common/install_cusparselt.sh install_cusparselt.sh

				RUN bash install_cusparselt.sh

				RUN rm install_cusparselt.sh

				# Install NCCL

				ARG CUDA_VERSION

				COPY ./common/install_nccl.sh install_nccl.sh

				COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/

				RUN bash install_nccl.sh

				RUN rm install_nccl.sh /ci_commit_pins/nccl-cu*

				ENV USE_SYSTEM_NCCL=1

				ENV NCCL_INCLUDE_DIR="/usr/local/cuda/include/"

				ENV NCCL_LIB_DIR="/usr/local/cuda/lib64/"

				# Install CUDSS

				ARG CUDA_VERSION

				COPY ./common/install_cudss.sh install_cudss.sh

									
										63

.ci/docker/ubuntu-rocm/Dockerfile
									
												View File
												
				@ -14,21 +14,20 @@ ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}

				COPY ./common/install_base.sh install_base.sh

				RUN bash ./install_base.sh && rm install_base.sh

				# Install clang

				ARG LLVMDEV

				ARG CLANG_VERSION

				COPY ./common/install_clang.sh install_clang.sh

				RUN bash ./install_clang.sh && rm install_clang.sh

				# Install user

				COPY ./common/install_user.sh install_user.sh

				RUN bash ./install_user.sh && rm install_user.sh

				# Install katex

				ARG KATEX

				COPY ./common/install_docs_reqs.sh install_docs_reqs.sh

				RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh

				# Install conda and other packages (e.g., numpy, pytest)

				ARG ANACONDA_PYTHON_VERSION

				ARG CONDA_CMAKE

				ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION

				ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH

				ARG CONDA_CMAKE

				COPY requirements-ci.txt /opt/conda/requirements-ci.txt

				COPY ./common/install_conda.sh install_conda.sh

				COPY ./common/common_utils.sh common_utils.sh

				@ -39,6 +38,11 @@ ARG GCC_VERSION

				COPY ./common/install_gcc.sh install_gcc.sh

				RUN bash ./install_gcc.sh && rm install_gcc.sh

				# Install clang

				ARG CLANG_VERSION

				COPY ./common/install_clang.sh install_clang.sh

				RUN bash ./install_clang.sh && rm install_clang.sh

				# (optional) Install protobuf for ONNX

				ARG PROTOBUF

				COPY ./common/install_protobuf.sh install_protobuf.sh

				@ -46,13 +50,6 @@ RUN if [ -n "${PROTOBUF}" ]; then bash ./install_protobuf.sh; fi

				RUN rm install_protobuf.sh

				ENV INSTALLED_PROTOBUF ${PROTOBUF}

				# (optional) Install database packages like LMDB and LevelDB

				ARG DB

				COPY ./common/install_db.sh install_db.sh

				RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi

				RUN rm install_db.sh

				ENV INSTALLED_DB ${DB}

				# (optional) Install vision packages like OpenCV

				ARG VISION

				COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./

				@ -66,7 +63,7 @@ COPY ./common/install_rocm.sh install_rocm.sh

				RUN bash ./install_rocm.sh

				RUN rm install_rocm.sh

				COPY ./common/install_rocm_magma.sh install_rocm_magma.sh

				RUN bash ./install_rocm_magma.sh

				RUN bash ./install_rocm_magma.sh ${ROCM_VERSION}

				RUN rm install_rocm_magma.sh

				ADD ./common/install_miopen.sh install_miopen.sh

				RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh

				@ -85,6 +82,32 @@ COPY ./common/install_amdsmi.sh install_amdsmi.sh

				RUN bash ./install_amdsmi.sh

				RUN rm install_amdsmi.sh

				# (optional) Install UCC

				ARG UCX_COMMIT

				ARG UCC_COMMIT

				ENV UCX_COMMIT $UCX_COMMIT

				ENV UCC_COMMIT $UCC_COMMIT

				ENV UCX_HOME /usr

				ENV UCC_HOME /usr

				ADD ./common/install_ucc.sh install_ucc.sh

				RUN if [ -n "${UCX_COMMIT}" ] && [ -n "${UCC_COMMIT}" ]; then bash ./install_ucc.sh; fi

				RUN rm install_ucc.sh

				COPY ./common/install_openssl.sh install_openssl.sh

				ENV OPENSSL_ROOT_DIR /opt/openssl

				RUN bash ./install_openssl.sh

				ENV OPENSSL_DIR /opt/openssl

				ARG INDUCTOR_BENCHMARKS

				ARG ANACONDA_PYTHON_VERSION

				ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION

				COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/huggingface.txt huggingface.txt

				COPY ci_commit_pins/timm.txt timm.txt

				RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi

				RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt

				# (optional) Install non-default CMake version

				ARG CMAKE_VERSION

				COPY ./common/install_cmake.sh install_cmake.sh

				@ -107,17 +130,17 @@ COPY triton_version.txt triton_version.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton.txt triton_version.txt

				# This is needed by sccache

				COPY ./common/install_openssl.sh install_openssl.sh

				ENV OPENSSL_ROOT_DIR /opt/openssl

				RUN bash ./install_openssl.sh

				ENV OPENSSL_DIR /opt/openssl

				# Install ccache/sccache (do this last, so we get priority in PATH)

				COPY ./common/install_cache.sh install_cache.sh

				ENV PATH /opt/cache/bin:$PATH

				RUN bash ./install_cache.sh && rm install_cache.sh

				# Install Open MPI for ROCm

				COPY ./common/install_openmpi.sh install_openmpi.sh

				RUN if [ -n "${CUDA_VERSION}" ]; then bash install_openmpi.sh; fi

				RUN rm install_openmpi.sh

				# Include BUILD_ENVIRONMENT environment variable in image

				ARG BUILD_ENVIRONMENT

				ENV BUILD_ENVIRONMENT ${BUILD_ENVIRONMENT}

									
										7

.ci/docker/ubuntu-xpu/Dockerfile
									
												View File
												
				@ -77,13 +77,6 @@ COPY triton_version.txt triton_version.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton-xpu.txt triton_version.txt

				# (optional) Install database packages like LMDB and LevelDB

				ARG DB

				COPY ./common/install_db.sh install_db.sh

				RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi

				RUN rm install_db.sh

				ENV INSTALLED_DB ${DB}

				# (optional) Install vision packages like OpenCV

				ARG VISION

				COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./

									
										51

.ci/docker/ubuntu/Dockerfile
									
												View File
												
				@ -1,6 +1,6 @@

				ARG UBUNTU_VERSION

				FROM ubuntu:${UBUNTU_VERSION}

				FROM ubuntu:${UBUNTU_VERSION} as base

				ARG UBUNTU_VERSION

				@ -52,9 +52,16 @@ RUN  bash ./install_lcov.sh && rm install_lcov.sh

				# Install cuda and cudnn

				ARG CUDA_VERSION

				COPY ./common/install_cuda.sh install_cuda.sh

				RUN bash ./install_cuda.sh ${CUDA_VERSION} && rm install_cuda.sh

				COPY ./common/install_nccl.sh install_nccl.sh

				COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/

				RUN bash ./install_cuda.sh ${CUDA_VERSION} && rm install_cuda.sh install_nccl.sh /ci_commit_pins/nccl-cu*

				ENV DESIRED_CUDA ${CUDA_VERSION}

				ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:$PATH

				# No effect if cuda not installed

				ENV USE_SYSTEM_NCCL=1

				ENV NCCL_INCLUDE_DIR="/usr/local/cuda/include/"

				ENV NCCL_LIB_DIR="/usr/local/cuda/lib64/"

				# (optional) Install UCC

				ARG UCX_COMMIT

				@ -74,13 +81,6 @@ RUN if [ -n "${PROTOBUF}" ]; then bash ./install_protobuf.sh; fi

				RUN rm install_protobuf.sh

				ENV INSTALLED_PROTOBUF ${PROTOBUF}

				# (optional) Install database packages like LMDB and LevelDB

				ARG DB

				COPY ./common/install_db.sh install_db.sh

				RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi

				RUN rm install_db.sh

				ENV INSTALLED_DB ${DB}

				# (optional) Install vision packages like OpenCV

				ARG VISION

				COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./

				@ -88,18 +88,6 @@ RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi

				RUN rm install_vision.sh cache_vision_models.sh common_utils.sh

				ENV INSTALLED_VISION ${VISION}

				# (optional) Install Vulkan SDK

				ARG VULKAN_SDK_VERSION

				COPY ./common/install_vulkan_sdk.sh install_vulkan_sdk.sh

				RUN if [ -n "${VULKAN_SDK_VERSION}" ]; then bash ./install_vulkan_sdk.sh; fi

				RUN rm install_vulkan_sdk.sh

				# (optional) Install swiftshader

				ARG SWIFTSHADER

				COPY ./common/install_swiftshader.sh install_swiftshader.sh

				RUN if [ -n "${SWIFTSHADER}" ]; then bash ./install_swiftshader.sh; fi

				RUN rm install_swiftshader.sh

				# (optional) Install non-default CMake version

				ARG CMAKE_VERSION

				COPY ./common/install_cmake.sh install_cmake.sh

				@ -127,20 +115,21 @@ RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_d

				RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt

				ARG TRITON

				# Install triton, this needs to be done before sccache because the latter will

				# try to reach out to S3, which docker build runners don't have access

				ARG TRITON_CPU

				# Create a separate stage for building Triton and Triton-CPU.  install_triton

				# will check for the presence of env vars

				FROM base as triton-builder

				COPY ./common/install_triton.sh install_triton.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/triton.txt triton.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton.txt

				ARG TRITON_CPU

				COPY ./common/install_triton.sh install_triton.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/triton-cpu.txt triton-cpu.txt

				RUN if [ -n "${TRITON_CPU}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton-cpu.txt

				RUN bash ./install_triton.sh

				FROM base as final

				COPY --from=triton-builder /opt/triton /opt/triton

				RUN if [ -n "${TRITON}" ] || [ -n "${TRITON_CPU}" ]; then pip install /opt/triton/*.whl; chown -R jenkins:jenkins /opt/conda; fi

				RUN rm -rf /opt/triton

				ARG EXECUTORCH

				# Build and install executorch

2

.ci/magma-rocm/.gitignore vendored Normal file

View File

 @ -0,0 +1,2 @@
 output/
 magma-rocm*/

									
										35

.ci/magma-rocm/Makefile
									
										Normal file
									
												View File
												
				@ -0,0 +1,35 @@

				SHELL=/usr/bin/env bash

				DOCKER_CMD ?= docker

				DESIRED_ROCM ?= 6.3

				DESIRED_ROCM_SHORT = $(subst .,,$(DESIRED_ROCM))

				PACKAGE_NAME = magma-rocm

				# inherit this from underlying docker image, do not pass this env var to docker

				#PYTORCH_ROCM_ARCH ?= gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201

				DOCKER_RUN = set -eou pipefail; ${DOCKER_CMD} run --rm -i \

					-v $(shell git rev-parse --show-toplevel)/.ci:/builder \

					-w /builder \

					-e PACKAGE_NAME=${PACKAGE_NAME}${DESIRED_ROCM_SHORT} \

					-e DESIRED_ROCM=${DESIRED_ROCM} \

					"pytorch/manylinux2_28-builder:rocm${DESIRED_ROCM}-main" \

					magma-rocm/build_magma.sh

				.PHONY: all

				all: magma-rocm63

				all: magma-rocm624

				.PHONY:

				clean:

					$(RM) -r magma-*

					$(RM) -r output

				.PHONY: magma-rocm63

				magma-rocm63: DESIRED_ROCM := 6.3

				magma-rocm63:

					$(DOCKER_RUN)

				.PHONY: magma-rocm624

				magma-rocm624: DESIRED_ROCM := 6.2.4

				magma-rocm624:

					$(DOCKER_RUN)

									
										48

.ci/magma-rocm/README.md
									
										Normal file
									
												View File
												
				@ -0,0 +1,48 @@

				# Magma ROCm

				This folder contains the scripts and configurations to build libmagma.so, linked for various versions of ROCm.

				## Building

				Look in the `Makefile` for available targets to build. To build any target, for example `magma-rocm63`, run

				```

				# Using `docker`

				make magma-rocm63

				# Using `podman`

				DOCKER_CMD=podman make magma-rocm63

				```

				This spawns a `pytorch/manylinux-rocm<version>` docker image, which has the required `devtoolset` and ROCm versions installed.

				Within the docker image, it runs `build_magma.sh` with the correct environment variables set, which package the necessary files

				into a tarball, with the following structure:

				```

				.

				├── include       # header files

				├── lib           # libmagma.so

				├── info

				│   ├── licenses  # license file

				│   └── recipe    # build script

				```

				More specifically, `build_magma.sh` copies over the relevant files from the `package_files` directory depending on the ROCm version.

				Outputted binaries should be in the `output` folder.

				## Pushing

				Packages can be uploaded to an S3 bucket using:

				```

				aws s3 cp output/*/magma-cuda*.bz2 <bucket-with-path>

				```

				If you do not have upload permissions, please ping @seemethere or @soumith to gain access

				## New versions

				New ROCm versions can be added by creating a new make target with the next desired version. For ROCm version N.n, the target should be named `magma-rocmNn`.

				Make sure to edit the appropriate environment variables (e.g., DESIRED_ROCM) in the `Makefile` accordingly. Remember also to check `build_magma.sh` to ensure the logic for copying over the files remains correct.

									
										42

.ci/magma-rocm/build_magma.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,42 @@

				#!/usr/bin/env bash

				set -eou pipefail

				# Environment variables

				# The script expects DESIRED_CUDA and PACKAGE_NAME to be set

				ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"

				# Version 2.7.2 + ROCm related updates

				MAGMA_VERSION=a1625ff4d9bc362906bd01f805dbbe12612953f6

				# Folders for the build

				PACKAGE_FILES=${ROOT_DIR}/magma-rocm/package_files # metadata

				PACKAGE_DIR=${ROOT_DIR}/magma-rocm/${PACKAGE_NAME} # build workspace

				PACKAGE_OUTPUT=${ROOT_DIR}/magma-rocm/output # where tarballs are stored

				PACKAGE_BUILD=${PACKAGE_DIR} # where the content of the tarball is prepared

				PACKAGE_RECIPE=${PACKAGE_BUILD}/info/recipe

				PACKAGE_LICENSE=${PACKAGE_BUILD}/info/licenses

				mkdir -p ${PACKAGE_DIR} ${PACKAGE_OUTPUT}/linux-64 ${PACKAGE_BUILD} ${PACKAGE_RECIPE} ${PACKAGE_LICENSE}

				# Fetch magma sources and verify checksum

				pushd ${PACKAGE_DIR}

				git clone https://bitbucket.org/icl/magma.git

				pushd magma

				git checkout ${MAGMA_VERSION}

				popd

				popd

				# build

				pushd ${PACKAGE_DIR}/magma

				# The build.sh script expects to be executed from the sources root folder

				INSTALL_DIR=${PACKAGE_BUILD} ${PACKAGE_FILES}/build.sh

				popd

				# Package recipe, license and tarball

				# Folder and package name are backward compatible for the build workflow

				cp ${PACKAGE_FILES}/build.sh ${PACKAGE_RECIPE}/build.sh

				cp ${PACKAGE_DIR}/magma/COPYRIGHT ${PACKAGE_LICENSE}/COPYRIGHT

				pushd ${PACKAGE_BUILD}

				tar cjf ${PACKAGE_OUTPUT}/linux-64/${PACKAGE_NAME}-${MAGMA_VERSION}-1.tar.bz2 include lib info

				echo Built in ${PACKAGE_OUTPUT}/linux-64/${PACKAGE_NAME}-${MAGMA_VERSION}-1.tar.bz2

				popd

									
										38

.ci/magma-rocm/package_files/build.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,38 @@

				# Magma build scripts need `python`

				ln -sf /usr/bin/python3 /usr/bin/python

				ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')

				case "$ID" in

				  almalinux)

				    yum install -y gcc-gfortran

				    ;;

				  *)

				    echo "No preinstalls to build magma..."

				    ;;

				esac

				MKLROOT=${MKLROOT:-/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION}

				cp make.inc-examples/make.inc.hip-gcc-mkl make.inc

				echo 'LIBDIR += -L$(MKLROOT)/lib' >> make.inc

				if [[ -f "${MKLROOT}/lib/libmkl_core.a" ]]; then

				    echo 'LIB = -Wl,--start-group -lmkl_gf_lp64 -lmkl_gnu_thread -lmkl_core -Wl,--end-group -lpthread -lstdc++ -lm -lgomp -lhipblas -lhipsparse' >> make.inc

				fi

				echo 'LIB += -Wl,--enable-new-dtags -Wl,--rpath,/opt/rocm/lib -Wl,--rpath,$(MKLROOT)/lib -Wl,--rpath,/opt/rocm/magma/lib -ldl' >> make.inc

				echo 'DEVCCFLAGS += --gpu-max-threads-per-block=256' >> make.inc

				export PATH="${PATH}:/opt/rocm/bin"

				if [[ -n "$PYTORCH_ROCM_ARCH" ]]; then

				  amdgpu_targets=`echo $PYTORCH_ROCM_ARCH | sed 's/;/ /g'`

				else

				  amdgpu_targets=`rocm_agent_enumerator | grep -v gfx000 | sort -u | xargs`

				fi

				for arch in $amdgpu_targets; do

				  echo "DEVCCFLAGS += --offload-arch=$arch" >> make.inc

				done

				# hipcc with openmp flag may cause isnan() on __device__ not to be found; depending on context, compiler may attempt to match with host definition

				sed -i 's/^FOPENMP/#FOPENMP/g' make.inc

				make -f make.gen.hipMAGMA -j $(nproc)

				LANG=C.UTF-8 make lib/libmagma.so -j $(nproc) MKLROOT="${MKLROOT}"

				make testing/testing_dgemm -j $(nproc) MKLROOT="${MKLROOT}"

				cp -R lib ${INSTALL_DIR}

				cp -R include ${INSTALL_DIR}

									
										15

.ci/magma/Makefile
									
												View File
												
				@ -12,13 +12,13 @@ DOCKER_RUN = set -eou pipefail; ${DOCKER_CMD} run --rm -i \

					-e PACKAGE_NAME=${PACKAGE_NAME}${DESIRED_CUDA_SHORT} \

					-e DESIRED_CUDA=${DESIRED_CUDA} \

					-e CUDA_ARCH_LIST="${CUDA_ARCH_LIST}" \

					"pytorch/manylinux-builder:cuda${DESIRED_CUDA}-main" \

					"pytorch/manylinux2_28-builder:cuda${DESIRED_CUDA}-main" \

					magma/build_magma.sh

				.PHONY: all

				all: magma-cuda128

				all: magma-cuda126

				all: magma-cuda124

				all: magma-cuda121

				all: magma-cuda118

				.PHONY:

				@ -26,6 +26,12 @@ clean:

					$(RM) -r magma-*

					$(RM) -r output

				.PHONY: magma-cuda128

				magma-cuda128: DESIRED_CUDA := 12.8

				magma-cuda128: CUDA_ARCH_LIST += -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120

				magma-cuda128:

					$(DOCKER_RUN)

				.PHONY: magma-cuda126

				magma-cuda126: DESIRED_CUDA := 12.6

				magma-cuda126:

				@ -36,11 +42,6 @@ magma-cuda124: DESIRED_CUDA := 12.4

				magma-cuda124:

					$(DOCKER_RUN)

				.PHONY: magma-cuda121

				magma-cuda121: DESIRED_CUDA := 12.1

				magma-cuda121:

					$(DOCKER_RUN)

				.PHONY: magma-cuda118

				magma-cuda118: DESIRED_CUDA := 11.8

				magma-cuda118: CUDA_ARCH_LIST += -gencode arch=compute_37,code=sm_37

									
										12

.ci/manywheel/build_common.sh
									
												View File
												
				@ -111,12 +111,6 @@ case ${DESIRED_PYTHON} in

				    ;;

				esac

				if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then

				    export _GLIBCXX_USE_CXX11_ABI=1

				else

				    export _GLIBCXX_USE_CXX11_ABI=0

				fi

				if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then

				    echo "Calling build_amd.py at $(date)"

				    python tools/amd_build/build_amd.py

				@ -209,12 +203,6 @@ if [[ -n "$BUILD_PYTHONLESS" ]]; then

				    mkdir -p /tmp/$LIBTORCH_HOUSE_DIR

				    if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then

				        LIBTORCH_ABI="cxx11-abi-"

				    else

				        LIBTORCH_ABI=

				    fi

				    zip -rq /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-$PYTORCH_BUILD_VERSION.zip libtorch

				    cp /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-$PYTORCH_BUILD_VERSION.zip \

				       /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-latest.zip

									
										33

.ci/manywheel/build_cuda.sh
									
												View File
												
				@ -14,6 +14,7 @@ export USE_CUDA_STATIC_LINK=1

				export INSTALL_TEST=0 # dont install test binaries into site-packages

				export USE_CUPTI_SO=0

				export USE_CUSPARSELT=${USE_CUSPARSELT:-1} # Enable if not disabled by libtorch build

				export USE_CUFILE=${USE_CUFILE:-1}

				# Keep an array of cmake variables to add to

				if [[ -z "$CMAKE_ARGS" ]]; then

				@ -52,8 +53,12 @@ cuda_version_nodot=$(echo $CUDA_VERSION | tr -d '.')

				TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6"

				case ${CUDA_VERSION} in

				    12.8)

				        TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;9.0;10.0;12.0+PTX" #removing sm_50-sm_70 as these architectures are deprecated in CUDA 12.8 and will be removed in future releases

				        EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")

				        ;;

				    12.6)

				        TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0+PTX"

				        TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0"

				        EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")

				        ;;

				    12.4)

				@ -114,7 +119,16 @@ if [[ $USE_CUSPARSELT == "1" && $CUDA_VERSION == "11.8" ]]; then

				        )

				fi

				if [[ $CUDA_VERSION == "12.4" || $CUDA_VERSION == "12.6" ]]; then

				# Turn USE_CUFILE off for CUDA 11.8, 12.4 since nvidia-cufile-cu11 and 1.9.0.20 are

				# not available in PYPI

				if [[ $CUDA_VERSION == "11.8" || $CUDA_VERSION == "12.4" ]]; then

				    export USE_CUFILE=0

				fi

				# CUDA_VERSION 12.4, 12.6, 12.8

				if [[ $CUDA_VERSION == 12* ]]; then

				    export USE_STATIC_CUDNN=0

				    # Try parallelizing nvcc as well

				    export TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2"

				@ -155,6 +169,16 @@ if [[ $CUDA_VERSION == "12.4" || $CUDA_VERSION == "12.6" ]]; then

				            "libnvrtc.so.12"

				            "libnvrtc-builtins.so"

				        )

				        if [[ $USE_CUFILE == 1 ]]; then

				            DEPS_LIST+=(

				                "/usr/local/cuda/lib64/libcufile.so.0"

				                "/usr/local/cuda/lib64/libcufile_rdma.so.1"

				            )

				            DEPS_SONAME+=(

				                "libcufile.so.0"

				                "libcufile_rdma.so.1"

				            )

				        fi

				    else

				        echo "Using nvidia libs from pypi."

				        CUDA_RPATHS=(

				@ -171,6 +195,11 @@ if [[ $CUDA_VERSION == "12.4" || $CUDA_VERSION == "12.6" ]]; then

				            '$ORIGIN/../../nvidia/nccl/lib'

				            '$ORIGIN/../../nvidia/nvtx/lib'

				        )

				        if [[ $USE_CUFILE == 1 ]]; then

				            CUDA_RPATHS+=(

				                '$ORIGIN/../../nvidia/cufile/lib'

				            )

				        fi

				        CUDA_RPATHS=$(IFS=: ; echo "${CUDA_RPATHS[*]}")

				        export C_SO_RPATH=$CUDA_RPATHS':$ORIGIN:$ORIGIN/lib'

				        export LIB_SO_RPATH=$CUDA_RPATHS':$ORIGIN'

									
										12

.ci/manywheel/build_libtorch.sh
									
												View File
												
				@ -95,12 +95,6 @@ python setup.py clean

				retry pip install -qr requirements.txt

				retry pip install -q numpy==2.0.1

				if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then

				    export _GLIBCXX_USE_CXX11_ABI=1

				else

				    export _GLIBCXX_USE_CXX11_ABI=0

				fi

				if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then

				    echo "Calling build_amd.py at $(date)"

				    python tools/amd_build/build_amd.py

				@ -169,12 +163,6 @@ fi

				)

				if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then

				    LIBTORCH_ABI="cxx11-abi-"

				else

				    LIBTORCH_ABI=

				fi

				(

				    set -x

									
										17

.ci/pytorch/build.sh
									
												View File
												
				@ -35,7 +35,7 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then

				fi

				if [[ "$BUILD_ENVIRONMENT" == *cuda11* ]]; then

				  if [[ "$BUILD_ENVIRONMENT" != *cuda11.3* && "$BUILD_ENVIRONMENT" != *clang* ]]; then

				  if [[ "$BUILD_ENVIRONMENT" != *clang* ]]; then

				    # TODO: there is a linking issue when building with UCC using clang,

				    # disable it for now and to be fix later.

				    # TODO: disable UCC temporarily to enable CUDA 12.1 in CI

				@ -173,6 +173,7 @@ if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				  source /opt/intel/oneapi/compiler/latest/env/vars.sh

				  # XPU kineto feature dependencies are not fully ready, disable kineto build as temp WA

				  export USE_KINETO=0

				  export TORCH_XPU_ARCH_LIST=pvc

				fi

				# sccache will fail for CUDA builds if all cores are used for compiling

				@ -191,7 +192,7 @@ fi

				# We only build FlashAttention files for CUDA 8.0+, and they require large amounts of

				# memory to build and will OOM

				if [[ "$BUILD_ENVIRONMENT" == *cuda* ]] && [[ 1 -eq $(echo "${TORCH_CUDA_ARCH_LIST} >= 8.0" | bc) ]]; then

				if [[ "$BUILD_ENVIRONMENT" == *cuda* ]] && [[ 1 -eq $(echo "${TORCH_CUDA_ARCH_LIST} >= 8.0" | bc) ]] && [ -z "$MAX_JOBS_OVERRIDE" ]; then

				  echo "WARNING: FlashAttention files require large amounts of memory to build and will OOM"

				  echo "Setting MAX_JOBS=(nproc-2)/3 to reduce memory usage"

				  export MAX_JOBS="$(( $(nproc --ignore=2) / 3 ))"

				@ -276,10 +277,8 @@ else

				    # or building non-XLA tests.

				    if [[ "$BUILD_ENVIRONMENT" != *rocm*  &&

				          "$BUILD_ENVIRONMENT" != *xla* ]]; then

				      if [[ "$BUILD_ENVIRONMENT" != *py3.8* ]]; then

				        # Install numpy-2.0.2 for builds which are backward compatible with 1.X

				        python -mpip install numpy==2.0.2

				      fi

				      # Install numpy-2.0.2 for builds which are backward compatible with 1.X

				      python -mpip install numpy==2.0.2

				      WERROR=1 python setup.py clean

				@ -377,8 +376,10 @@ else

				    # This is an attempt to mitigate flaky libtorch build OOM error. By default, the build parallelization

				    # is set to be the number of CPU minus 2. So, let's try a more conservative value here. A 4xlarge has

				    # 16 CPUs

				    MAX_JOBS=$(nproc --ignore=4)

				    export MAX_JOBS

				    if [ -z "$MAX_JOBS_OVERRIDE" ]; then

				      MAX_JOBS=$(nproc --ignore=4)

				      export MAX_JOBS

				    fi

				    # NB: Install outside of source directory (at the same level as the root

				    # pytorch folder) so that it doesn't get cleaned away prior to docker push.

									
										114

.ci/pytorch/check_binary.sh
									
												View File
												
				@ -59,78 +59,16 @@ else

				  export install_root="$(dirname $(which python))/../lib/python${py_dot}/site-packages/torch/"

				fi

				###############################################################################

				# Setup XPU ENV

				###############################################################################

				if [[ "$DESIRED_CUDA" == 'xpu' ]]; then

				  set +u

				  # Refer https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html

				  source /opt/intel/oneapi/compiler/latest/env/vars.sh

				  source /opt/intel/oneapi/pti/latest/env/vars.sh

				fi

				###############################################################################

				# Check GCC ABI

				###############################################################################

				# NOTE [ Building libtorch with old vs. new gcc ABI ]

				#

				# Packages built with one version of ABI could not be linked against by client

				# C++ libraries that were compiled using the other version of ABI. Since both

				# gcc ABIs are still common in the wild, we need to support both ABIs. Currently:

				#

				# - All the nightlies built on CentOS 7 + devtoolset7 use the old gcc ABI.

				# - All the nightlies built on Ubuntu 16.04 + gcc 5.4 use the new gcc ABI.

				# NOTE: As of https://github.com/pytorch/pytorch/issues/126551 we only produce

				#       wheels with cxx11-abi

				echo "Checking that the gcc ABI is what we expect"

				if [[ "$(uname)" != 'Darwin' ]]; then

				  function is_expected() {

				    if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* || "$DESIRED_CUDA" == *"rocm"* ]]; then

				      if [[ "$1" -gt 0 || "$1" == "ON " ]]; then

				        echo 1

				      fi

				    else

				      if [[ -z "$1" || "$1" == 0 || "$1" == "OFF" ]]; then

				        echo 1

				      fi

				    fi

				  }

				  # First we check that the env var in TorchConfig.cmake is correct

				  # We search for D_GLIBCXX_USE_CXX11_ABI=1 in torch/TorchConfig.cmake

				  torch_config="${install_root}/share/cmake/Torch/TorchConfig.cmake"

				  if [[ ! -f "$torch_config" ]]; then

				    echo "No TorchConfig.cmake found!"

				    ls -lah "$install_root/share/cmake/Torch"

				    exit 1

				  fi

				  echo "Checking the TorchConfig.cmake"

				  cat "$torch_config"

				  # The sed call below is

				  #   don't print lines by default (only print the line we want)

				  # -n

				  #   execute the following expression

				  # e

				  #   replace lines that match with the first capture group and print

				  # s/.*D_GLIBCXX_USE_CXX11_ABI=\(.\)".*/\1/p

				  #   any characters, D_GLIBCXX_USE_CXX11_ABI=, exactly one any character, a

				  #   quote, any characters

				  #   Note the exactly one single character after the '='. In the case that the

				  #     variable is not set the '=' will be followed by a '"' immediately and the

				  #     line will fail the match and nothing will be printed; this is what we

				  #     want.  Otherwise it will capture the 0 or 1 after the '='.

				  # /.*D_GLIBCXX_USE_CXX11_ABI=\(.\)".*/

				  #   replace the matched line with the capture group and print

				  # /\1/p

				  actual_gcc_abi="$(sed -ne 's/.*D_GLIBCXX_USE_CXX11_ABI=\(.\)".*/\1/p' < "$torch_config")"

				  if [[ "$(is_expected "$actual_gcc_abi")" != 1 ]]; then

				    echo "gcc ABI $actual_gcc_abi not as expected."

				    exit 1

				  fi

				  # We also check that there are [not] cxx11 symbols in libtorch

				  # We also check that there are cxx11 symbols in libtorch

				  #

				  echo "Checking that symbols in libtorch.so have the right gcc abi"

				  python3 "$(dirname ${BASH_SOURCE[0]})/smoke_test/check_binary_symbols.py"

				@ -208,35 +146,11 @@ setup_link_flags () {

				TEST_CODE_DIR="$(dirname $(realpath ${BASH_SOURCE[0]}))/test_example_code"

				build_and_run_example_cpp () {

				  if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then

				    GLIBCXX_USE_CXX11_ABI=1

				  else

				    GLIBCXX_USE_CXX11_ABI=0

				  fi

				  setup_link_flags

				  g++ ${TEST_CODE_DIR}/$1.cpp -I${install_root}/include -I${install_root}/include/torch/csrc/api/include -D_GLIBCXX_USE_CXX11_ABI=$GLIBCXX_USE_CXX11_ABI -std=gnu++17 -L${install_root}/lib ${REF_LIB} ${ADDITIONAL_LINKER_FLAGS} -ltorch $TORCH_CPU_LINK_FLAGS $TORCH_CUDA_LINK_FLAGS $C10_LINK_FLAGS -o $1

				  g++ ${TEST_CODE_DIR}/$1.cpp -I${install_root}/include -I${install_root}/include/torch/csrc/api/include -std=gnu++17 -L${install_root}/lib ${REF_LIB} ${ADDITIONAL_LINKER_FLAGS} -ltorch $TORCH_CPU_LINK_FLAGS $TORCH_CUDA_LINK_FLAGS $C10_LINK_FLAGS -o $1

				  ./$1

				}

				build_example_cpp_with_incorrect_abi () {

				  if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then

				    GLIBCXX_USE_CXX11_ABI=0

				  else

				    GLIBCXX_USE_CXX11_ABI=1

				  fi

				  set +e

				  setup_link_flags

				  g++ ${TEST_CODE_DIR}/$1.cpp -I${install_root}/include -I${install_root}/include/torch/csrc/api/include -D_GLIBCXX_USE_CXX11_ABI=$GLIBCXX_USE_CXX11_ABI -std=gnu++17 -L${install_root}/lib ${REF_LIB} ${ADDITIONAL_LINKER_FLAGS} -ltorch $TORCH_CPU_LINK_FLAGS $TORCH_CUDA_LINK_FLAGS $C10_LINK_FLAGS -o $1

				  ERRCODE=$?

				  set -e

				  if [ "$ERRCODE" -eq "0" ]; then

				    echo "Building example with incorrect ABI didn't throw error. Aborting."

				    exit 1

				  else

				    echo "Building example with incorrect ABI throws expected error. Proceeding."

				  fi

				}

				###############################################################################

				# Check simple Python/C++ calls

				###############################################################################

				@ -246,11 +160,6 @@ if [[ "$PACKAGE_TYPE" == 'libtorch' ]]; then

				    export LD_LIBRARY_PATH=/usr/local/cuda/lib64

				  fi

				  build_and_run_example_cpp simple-torch-test

				  # `_GLIBCXX_USE_CXX11_ABI` is always ignored by gcc in devtoolset7, so we test

				  # the expected failure case for Ubuntu 16.04 + gcc 5.4 only.

				  if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then

				    build_example_cpp_with_incorrect_abi simple-torch-test

				  fi

				else

				  pushd /tmp

				  python -c 'import torch'

				@ -385,10 +294,19 @@ except RuntimeError as e:

				fi

				###############################################################################

				# Check for C++ ABI compatibility between gcc7 and gcc9 compiled binaries

				# Check for C++ ABI compatibility to GCC-11

				###############################################################################

				if [[ "$(uname)" == 'Linux' && ("$PACKAGE_TYPE" == 'conda' || "$PACKAGE_TYPE" == 'manywheel')]]; then

				if [[ "$(uname)" == 'Linux' &&  "$PACKAGE_TYPE" == 'manywheel' ]]; then

				  pushd /tmp

				  python -c "import torch; exit(0 if torch.compiled_with_cxx11_abi() else (0 if torch._C._PYBIND11_BUILD_ABI == '_cxxabi1011' else 1))"

				  # Per https://gcc.gnu.org/onlinedocs/gcc/C_002b_002b-Dialect-Options.html gcc-11 is ABI16

				  # Though manylinux_2.28 should have been build with gcc-14, per

				  # https://github.com/pypa/manylinux?tab=readme-ov-file#manylinux_2_28-almalinux-8-based

				  # On s390x gcc 14 is used because it contains fix for interaction

				  # between precompiled headers and vectorization builtins.

				  # This fix is not available in earlier gcc versions.

				  # gcc-14 uses ABI19.

				  if [[ "$(uname -m)" != "s390x" ]]; then

				    python -c "import torch; exit(0 if torch._C._PYBIND11_BUILD_ABI == '_cxxabi1016' else 1)"

				  fi

				  popd

				fi

									
										41

.ci/pytorch/common_utils.sh
									
												View File
												
				@ -169,24 +169,34 @@ function install_torchrec_and_fbgemm() {

				  torchrec_commit=$(get_pinned_commit torchrec)

				  local fbgemm_commit

				  fbgemm_commit=$(get_pinned_commit fbgemm)

				  if [[ "$BUILD_ENVIRONMENT" == *rocm* ]] ; then

				    fbgemm_commit=$(get_pinned_commit fbgemm_rocm)

				  fi

				  pip_uninstall torchrec-nightly

				  pip_uninstall fbgemm-gpu-nightly

				  pip_install setuptools-git-versioning scikit-build pyre-extensions

				  # TODO (huydhn): I still have no clue on why sccache doesn't work with only fbgemm_gpu here, but it

				  # seems to be an sccache-related issue

				  if [[ "$IS_A100_RUNNER" == "1" ]]; then

				    unset CMAKE_CUDA_COMPILER_LAUNCHER

				    sudo mv /opt/cache/bin /opt/cache/bin-backup

				  fi

				  if [[ "$BUILD_ENVIRONMENT" == *rocm* ]] ; then

				    # install torchrec first because it installs fbgemm nightly on top of rocm fbgemm

				    pip_install --no-use-pep517 --user "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}"

				    pip_uninstall fbgemm-gpu-nightly

				  # See https://github.com/pytorch/pytorch/issues/106971

				  CUDA_PATH=/usr/local/cuda-12.1 pip_install --no-use-pep517 --user "git+https://github.com/pytorch/FBGEMM.git@${fbgemm_commit}#egg=fbgemm-gpu&subdirectory=fbgemm_gpu"

				  pip_install --no-use-pep517 --user "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}"

				  if [[ "$IS_A100_RUNNER" == "1" ]]; then

				    export CMAKE_CUDA_COMPILER_LAUNCHER=/opt/cache/bin/sccache

				    sudo mv /opt/cache/bin-backup /opt/cache/bin

				    pip_install tabulate  # needed for newer fbgemm

				    pip_install patchelf  # needed for rocm fbgemm

				    git clone --recursive https://github.com/pytorch/fbgemm

				    pushd fbgemm/fbgemm_gpu

				    git checkout "${fbgemm_commit}"

				    python setup.py install \

				      --package_variant=rocm \

				      -DHIP_ROOT_DIR="${ROCM_PATH}" \

				      -DCMAKE_C_FLAGS="-DTORCH_USE_HIP_DSA" \

				      -DCMAKE_CXX_FLAGS="-DTORCH_USE_HIP_DSA"

				    popd

				    rm -rf fbgemm

				  else

				    # See https://github.com/pytorch/pytorch/issues/106971

				    CUDA_PATH=/usr/local/cuda-12.1 pip_install --no-use-pep517 --user "git+https://github.com/pytorch/FBGEMM.git@${fbgemm_commit}#egg=fbgemm-gpu&subdirectory=fbgemm_gpu"

				    pip_install --no-use-pep517 --user "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}"

				  fi

				}

				@ -216,6 +226,11 @@ function checkout_install_torchbench() {

				    # to install and test other models

				    python install.py --continue_on_fail

				  fi

				  # TODO (huydhn): transformers-4.44.2 added by https://github.com/pytorch/benchmark/pull/2488

				  # is regressing speedup metric. This needs to be investigated further

				  pip install transformers==4.38.1

				  echo "Print all dependencies after TorchBench is installed"

				  python -mpip freeze

				  popd

									
										50

.ci/pytorch/macos-build.sh
									
												View File
												
				@ -33,55 +33,11 @@ if which sccache > /dev/null; then

				  export PATH="${tmp_dir}:$PATH"

				fi

				cross_compile_arm64() {

				  # Cross compilation for arm64

				  # Explicitly set USE_DISTRIBUTED=0 to align with the default build config on mac. This also serves as the sole CI config that tests

				  # that building with USE_DISTRIBUTED=0 works at all. See https://github.com/pytorch/pytorch/issues/86448

				  USE_DISTRIBUTED=0 CMAKE_OSX_ARCHITECTURES=arm64 MACOSX_DEPLOYMENT_TARGET=11.0 USE_MKLDNN=OFF USE_QNNPACK=OFF WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python setup.py bdist_wheel

				}

				compile_arm64() {

				  # Compilation for arm64

				  # TODO: Compile with OpenMP support (but this causes CI regressions as cross-compilation were done with OpenMP disabled)

				  USE_DISTRIBUTED=0 USE_OPENMP=1 MACOSX_DEPLOYMENT_TARGET=11.0 WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python setup.py bdist_wheel

				}

				compile_x86_64() {

				  USE_DISTRIBUTED=0 WERROR=1 python setup.py bdist_wheel --plat-name=macosx_10_9_x86_64

				}

				build_lite_interpreter() {

				    echo "Testing libtorch (lite interpreter)."

				    CPP_BUILD="$(pwd)/../cpp_build"

				    # Ensure the removal of the tmp directory

				    trap 'rm -rfv ${CPP_BUILD}' EXIT

				    rm -rf "${CPP_BUILD}"

				    mkdir -p "${CPP_BUILD}/caffe2"

				    # It looks libtorch need to be built in "${CPP_BUILD}/caffe2 folder.

				    BUILD_LIBTORCH_PY=$PWD/tools/build_libtorch.py

				    pushd "${CPP_BUILD}/caffe2" || exit

				    VERBOSE=1 DEBUG=1 python "${BUILD_LIBTORCH_PY}"

				    popd || exit

				    "${CPP_BUILD}/caffe2/build/bin/test_lite_interpreter_runtime"

				}

				print_cmake_info

				if [[ ${BUILD_ENVIRONMENT} = *arm64* ]]; then

				  if [[ $(uname -m) == "arm64" ]]; then

				    compile_arm64

				  else

				    cross_compile_arm64

				  fi

				elif [[ ${BUILD_ENVIRONMENT} = *lite-interpreter* ]]; then

				  export BUILD_LITE_INTERPRETER=1

				  build_lite_interpreter

				else

				  compile_x86_64

				fi

				# Explicitly set USE_DISTRIBUTED=0 to align with the default build config on mac. This also serves as the sole CI config that tests

				# that building with USE_DISTRIBUTED=0 works at all. See https://github.com/pytorch/pytorch/issues/86448

				USE_DISTRIBUTED=0 USE_OPENMP=1 MACOSX_DEPLOYMENT_TARGET=11.0 WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python setup.py bdist_wheel

				if which sccache > /dev/null; then

				  print_sccache_stats

									
										3

.ci/pytorch/macos-test.sh
									
												View File
												
				@ -18,6 +18,9 @@ if [[ ! $(python -c "import torch; print(int(torch.backends.openmp.is_available(

				fi

				popd

				# enable debug asserts in serialization

				export TORCH_SERIALIZATION_DEBUG=1

				setup_test_python() {

				  # The CircleCI worker hostname doesn't resolve to an address.

				  # This environment variable makes ProcessGroupGloo default to

									
										2

.ci/pytorch/run_tests.sh
									
												View File
												
				@ -40,7 +40,7 @@ retry () {

				if [[ "$#" != 3 ]]; then

				  if [[ -z "${DESIRED_PYTHON:-}" || -z "${DESIRED_CUDA:-}" || -z "${PACKAGE_TYPE:-}" ]]; then

				    echo "USAGE: run_tests.sh  PACKAGE_TYPE  DESIRED_PYTHON  DESIRED_CUDA"

				    echo "The env variable PACKAGE_TYPE must be set to 'conda' or 'manywheel' or 'libtorch'"

				    echo "The env variable PACKAGE_TYPE must be set to 'manywheel' or 'libtorch'"

				    echo "The env variable DESIRED_PYTHON must be set like '2.7mu' or '3.6m' etc"

				    echo "The env variable DESIRED_CUDA must be set like 'cpu' or 'cu80' etc"

				    exit 1

									
										43

.ci/pytorch/smoke_test/check_binary_symbols.py
									
												View File
												
				@ -6,7 +6,7 @@ import itertools

				import os

				import re

				from pathlib import Path

				from typing import Any, List, Tuple

				from typing import Any

				# We also check that there are [not] cxx11 symbols in libtorch

				@ -46,17 +46,17 @@ LIBTORCH_PRE_CXX11_PATTERNS = _apply_libtorch_symbols(PRE_CXX11_SYMBOLS)

				@functools.lru_cache(100)

				def get_symbols(lib: str) -> List[Tuple[str, str, str]]:

				def get_symbols(lib: str) -> list[tuple[str, str, str]]:

				    from subprocess import check_output

				    lines = check_output(f'nm "{lib}"|c++filt', shell=True)

				    return [x.split(" ", 2) for x in lines.decode("latin1").split("\n")[:-1]]

				def grep_symbols(lib: str, patterns: List[Any]) -> List[str]:

				def grep_symbols(lib: str, patterns: list[Any]) -> list[str]:

				    def _grep_symbols(

				        symbols: List[Tuple[str, str, str]], patterns: List[Any]

				    ) -> List[str]:

				        symbols: list[tuple[str, str, str]], patterns: list[Any]

				    ) -> list[str]:

				        rc = []

				        for _s_addr, _s_type, s_name in symbols:

				            for pattern in patterns:

				@ -80,7 +80,7 @@ def grep_symbols(lib: str, patterns: List[Any]) -> List[str]:

				        return functools.reduce(list.__add__, (x.result() for x in tasks), [])

				def check_lib_symbols_for_abi_correctness(lib: str, pre_cxx11_abi: bool = True) -> None:

				def check_lib_symbols_for_abi_correctness(lib: str) -> None:

				    print(f"lib: {lib}")

				    cxx11_symbols = grep_symbols(lib, LIBTORCH_CXX11_PATTERNS)

				    pre_cxx11_symbols = grep_symbols(lib, LIBTORCH_PRE_CXX11_PATTERNS)

				@ -88,28 +88,12 @@ def check_lib_symbols_for_abi_correctness(lib: str, pre_cxx11_abi: bool = True)

				    num_pre_cxx11_symbols = len(pre_cxx11_symbols)

				    print(f"num_cxx11_symbols: {num_cxx11_symbols}")

				    print(f"num_pre_cxx11_symbols: {num_pre_cxx11_symbols}")

				    if pre_cxx11_abi:

				        if num_cxx11_symbols > 0:

				            raise RuntimeError(

				                f"Found cxx11 symbols, but there shouldn't be any, see: {cxx11_symbols[:100]}"

				            )

				        if num_pre_cxx11_symbols < 1000:

				            raise RuntimeError("Didn't find enough pre-cxx11 symbols.")

				        # Check for no recursive iterators, regression test for https://github.com/pytorch/pytorch/issues/133437

				        rec_iter_symbols = grep_symbols(

				            lib, [re.compile("std::filesystem::recursive_directory_iterator.*")]

				    if num_pre_cxx11_symbols > 0:

				        raise RuntimeError(

				            f"Found pre-cxx11 symbols, but there shouldn't be any, see: {pre_cxx11_symbols[:100]}"

				        )

				        if len(rec_iter_symbols) > 0:

				            raise RuntimeError(

				                f"recursive_directory_iterator in used pre-CXX11 binaries, see; {rec_iter_symbols}"

				            )

				    else:

				        if num_pre_cxx11_symbols > 0:

				            raise RuntimeError(

				                f"Found pre-cxx11 symbols, but there shouldn't be any, see: {pre_cxx11_symbols[:100]}"

				            )

				        if num_cxx11_symbols < 100:

				            raise RuntimeError("Didn't find enought cxx11 symbols")

				    if num_cxx11_symbols < 100:

				        raise RuntimeError("Didn't find enought cxx11 symbols")

				def main() -> None:

				@ -121,9 +105,8 @@ def main() -> None:

				        else:

				            install_root = Path(distutils.sysconfig.get_python_lib()) / "torch"

				    libtorch_cpu_path = install_root / "lib" / "libtorch_cpu.so"

				    pre_cxx11_abi = "cxx11-abi" not in os.getenv("DESIRED_DEVTOOLSET", "")

				    check_lib_symbols_for_abi_correctness(libtorch_cpu_path, pre_cxx11_abi)

				    libtorch_cpu_path = str(install_root / "lib" / "libtorch_cpu.so")

				    check_lib_symbols_for_abi_correctness(libtorch_cpu_path)

				if __name__ == "__main__":

									
										8

.ci/pytorch/smoke_test/max_autotune.py
									
												View File
												
				@ -46,7 +46,9 @@ def train(args, model, device, train_loader, optimizer, epoch):

				        optimizer.step()

				        if batch_idx % args.log_interval == 0:

				            print(

				                f"Train Epoch: {epoch} [{batch_idx * len(data)}/{len(train_loader.dataset)} ({100. * batch_idx / len(train_loader):.0f}%)]\tLoss: {loss.item():.6f}"  # noqa: B950

				                f"Train Epoch: {epoch} "

				                f"[{batch_idx * len(data)}/{len(train_loader.dataset)} "

				                f"({100.0 * batch_idx / len(train_loader):.0f}%)]\tLoss: {loss.item():.6f}"

				            )

				            if args.dry_run:

				                break

				@ -71,7 +73,9 @@ def test(model, device, test_loader):

				    test_loss /= len(test_loader.dataset)

				    print(

				        f"\nTest set: Average loss: {test_loss:.4f}, Accuracy: {correct}/{len(test_loader.dataset)} ({100. * correct / len(test_loader.dataset):.0f}%)\n"  # noqa: B950

				        f"\nTest set: Average loss: {test_loss:.4f}, "

				        f"Accuracy: {correct}/{len(test_loader.dataset)} "

				        f"({100.0 * correct / len(test_loader.dataset):.0f}%)\n"

				    )

									
										112

.ci/pytorch/smoke_test/smoke_test.py
									
												View File
												
				@ -6,6 +6,8 @@ import re

				import subprocess

				import sys

				from pathlib import Path

				from tempfile import NamedTemporaryFile

				from typing import Optional

				import torch

				import torch._dynamo

				@ -75,10 +77,13 @@ def read_release_matrix():

				def test_numpy():

				    import numpy as np

				    try:

				        import numpy as np

				    x = np.arange(5)

				    torch.tensor(x)

				        x = np.arange(5)

				        torch.tensor(x)

				    except ImportError:

				        print("Numpy check skipped. Numpy is not installed.")

				def check_version(package: str) -> None:

				@ -161,8 +166,71 @@ def test_cuda_runtime_errors_captured() -> None:

				        raise RuntimeError("Expected CUDA RuntimeError but have not received!")

				def test_cuda_gds_errors_captured() -> None:

				    major_version = int(torch.version.cuda.split(".")[0])

				    minor_version = int(torch.version.cuda.split(".")[1])

				    if target_os == "windows":

				        print(f"{target_os} is not supported for GDS smoke test")

				        return

				    if major_version < 12 or (major_version == 12 and minor_version < 6):

				        print("CUDA version is not supported for GDS smoke test")

				        return

				    cuda_exception_missed = True

				    try:

				        print("Testing test_cuda_gds_errors_captured")

				        with NamedTemporaryFile() as f:

				            torch.cuda.gds.GdsFile(f.name, os.O_CREAT | os.O_RDWR)

				    except RuntimeError as e:

				        expected_error = "cuFileHandleRegister failed"

				        if re.search(expected_error, f"{e}"):

				            print(f"Caught CUDA exception with success: {e}")

				            cuda_exception_missed = False

				        else:

				            raise e

				    if cuda_exception_missed:

				        raise RuntimeError(

				            "Expected cuFileHandleRegister failed RuntimeError but have not received!"

				        )

				def find_pypi_package_version(package: str) -> Optional[str]:

				    from importlib import metadata

				    dists = metadata.distributions()

				    for dist in dists:

				        if dist.metadata["Name"].startswith(package):

				            return dist.version

				    return None

				def cudnn_to_version_str(cudnn_version: int) -> str:

				    patch = int(cudnn_version % 10)

				    minor = int((cudnn_version / 100) % 100)

				    major = int((cudnn_version / 10000) % 10000)

				    return f"{major}.{minor}.{patch}"

				def compare_pypi_to_torch_versions(

				    package: str, pypi_version: str, torch_version: str

				) -> None:

				    if pypi_version is None:

				        raise RuntimeError(f"Can't find {package} in PyPI for Torch: {torch_version}")

				    if pypi_version.startswith(torch_version):

				        print(f"Found matching {package}. Torch: {torch_version} PyPI {pypi_version}")

				    else:

				        raise RuntimeError(

				            f"Wrong {package} version. Torch: {torch_version} PyPI: {pypi_version}"

				        )

				def smoke_test_cuda(

				    package: str, runtime_error_check: str, torch_compile_check: str

				    package: str,

				    runtime_error_check: str,

				    torch_compile_check: str,

				    pypi_pkg_check: str,

				) -> None:

				    if not torch.cuda.is_available() and is_cuda_system:

				        raise RuntimeError(f"Expected CUDA {gpu_arch_ver}. However CUDA is not loaded.")

				@ -192,20 +260,30 @@ def smoke_test_cuda(

				            raise RuntimeError(

				                f"Wrong CUDA version. Loaded: {torch.version.cuda} Expected: {gpu_arch_ver}"

				            )

				        print(f"torch cuda: {torch.version.cuda}")

				        # todo add cudnn version validation

				        print(f"torch cudnn: {torch.backends.cudnn.version()}")

				        print(f"cuDNN enabled? {torch.backends.cudnn.enabled}")

				        print(f"torch cuda: {torch.version.cuda}")

				        torch.cuda.init()

				        print("CUDA initialized successfully")

				        print(f"Number of CUDA devices: {torch.cuda.device_count()}")

				        for i in range(torch.cuda.device_count()):

				            print(f"Device {i}: {torch.cuda.get_device_name(i)}")

				        # nccl is availbale only on Linux

				        print(f"cuDNN enabled? {torch.backends.cudnn.enabled}")

				        torch_cudnn_version = cudnn_to_version_str(torch.backends.cudnn.version())

				        print(f"Torch cuDNN version: {torch_cudnn_version}")

				        if sys.platform in ["linux", "linux2"]:

				            print(f"torch nccl version: {torch.cuda.nccl.version()}")

				            torch_nccl_version = ".".join(str(v) for v in torch.cuda.nccl.version())

				            print(f"Torch nccl; version: {torch_nccl_version}")

				        # Pypi dependencies are installed on linux ony and nccl is availbale only on Linux.

				        if pypi_pkg_check == "enabled" and sys.platform in ["linux", "linux2"]:

				            compare_pypi_to_torch_versions(

				                "cudnn", find_pypi_package_version("nvidia-cudnn"), torch_cudnn_version

				            )

				            compare_pypi_to_torch_versions(

				                "nccl", find_pypi_package_version("nvidia-nccl"), torch_nccl_version

				            )

				        if runtime_error_check == "enabled":

				            test_cuda_runtime_errors_captured()

				@ -364,6 +442,13 @@ def parse_args():

				        choices=["enabled", "disabled"],

				        default="enabled",

				    )

				    parser.add_argument(

				        "--pypi-pkg-check",

				        help="Check pypi package versions cudnn and nccl",

				        type=str,

				        choices=["enabled", "disabled"],

				        default="enabled",

				    )

				    return parser.parse_args()

				@ -379,14 +464,19 @@ def main() -> None:

				    smoke_test_conv2d()

				    test_linalg()

				    test_numpy()

				    if is_cuda_system:

				        test_linalg("cuda")

				        test_cuda_gds_errors_captured()

				    if options.package == "all":

				        smoke_test_modules()

				    smoke_test_cuda(

				        options.package, options.runtime_error_check, options.torch_compile_check

				        options.package,

				        options.runtime_error_check,

				        options.torch_compile_check,

				        options.pypi_pkg_check,

				    )

									
										149

.ci/pytorch/test.sh
									
												View File
												
				@ -46,6 +46,9 @@ BUILD_BIN_DIR="$BUILD_DIR"/bin

				SHARD_NUMBER="${SHARD_NUMBER:=1}"

				NUM_TEST_SHARDS="${NUM_TEST_SHARDS:=1}"

				# enable debug asserts in serialization

				export TORCH_SERIALIZATION_DEBUG=1

				export VALGRIND=ON

				# export TORCH_INDUCTOR_INSTALL_GXX=ON

				if [[ "$BUILD_ENVIRONMENT" == *clang9* || "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				@ -174,6 +177,9 @@ if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then

				  # Print GPU info

				  rocminfo

				  rocminfo | grep -E 'Name:.*\sgfx|Marketing'

				  # for benchmarks/dynamo/check_accuracy.py, we need to put results in a rocm specific directory to avoid clashes with cuda

				  MAYBE_ROCM="rocm/"

				fi

				if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				@ -308,6 +314,13 @@ test_python() {

				  assert_git_not_dirty

				}

				test_lazy_tensor_meta_reference_disabled() {

				  export TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE=1

				  echo "Testing lazy tensor operations without meta reference"

				  time python test/run_test.py --include lazy/test_ts_opinfo.py --verbose

				  export -n TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE

				}

				test_dynamo_wrapped_shard() {

				  if [[ -z "$NUM_TEST_SHARDS" ]]; then

				@ -411,7 +424,10 @@ test_inductor_cpp_wrapper_shard() {

				  # Run certain inductor unit tests with cpp wrapper. In the end state, we

				  # should be able to run all the inductor unit tests with cpp_wrapper.

				  python test/run_test.py --include inductor/test_torchinductor --verbose

				  python test/run_test.py \

				    --include inductor/test_torchinductor inductor/test_max_autotune inductor/test_cpu_repro \

				    --verbose

				  python test/run_test.py --inductor --include test_torch -k 'take' --verbose

				  # Run inductor benchmark tests with cpp wrapper.

				  # Skip benchmark tests if it's in rerun-disabled-mode.

				@ -424,7 +440,7 @@ test_inductor_cpp_wrapper_shard() {

				    --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_training.csv"

				    python benchmarks/dynamo/check_accuracy.py \

				      --actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_training.csv" \

				      --expected "benchmarks/dynamo/ci_expected_accuracy/inductor_timm_training.csv"

				      --expected "benchmarks/dynamo/ci_expected_accuracy/${MAYBE_ROCM}inductor_timm_training.csv"

				    python benchmarks/dynamo/torchbench.py --device cuda --accuracy \

				      --bfloat16 --inference --inductor --only hf_T5 --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"

				@ -434,7 +450,7 @@ test_inductor_cpp_wrapper_shard() {

				      --bfloat16 --inference --inductor --only moco --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"

				    python benchmarks/dynamo/check_accuracy.py \

				      --actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv" \

				      --expected "benchmarks/dynamo/ci_expected_accuracy/inductor_torchbench_inference.csv"

				      --expected "benchmarks/dynamo/ci_expected_accuracy/${MAYBE_ROCM}inductor_torchbench_inference.csv"

				  fi

				}

				@ -467,6 +483,8 @@ elif [[ "${TEST_CONFIG}" == *aot_eager* ]]; then

				  DYNAMO_BENCHMARK_FLAGS+=(--backend aot_eager)

				elif [[ "${TEST_CONFIG}" == *aot_inductor* ]]; then

				  DYNAMO_BENCHMARK_FLAGS+=(--export-aot-inductor)

				elif [[ "${TEST_CONFIG}" == *max_autotune_inductor* ]]; then

				  DYNAMO_BENCHMARK_FLAGS+=(--inductor --inductor-compile-mode max-autotune)

				elif [[ "${TEST_CONFIG}" == *inductor* && "${TEST_CONFIG}" != *perf* ]]; then

				  DYNAMO_BENCHMARK_FLAGS+=(--inductor)

				fi

				@ -481,6 +499,59 @@ else

				  DYNAMO_BENCHMARK_FLAGS+=(--device cuda)

				fi

				test_cachebench() {

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  local BENCHMARK

				  if [[ "${SHARD_NUMBER}" == 1 ]]; then

				    local BENCHMARK=torchbench

				  elif [[ "${SHARD_NUMBER}" == 2 ]]; then

				    local BENCHMARK=huggingface

				  else

				    echo "invalid SHARD_NUMBER: ${SHARD_NUMBER}"

				    exit 1

				  fi

				  local mode_options=("training" "inference")

				  for mode in "${mode_options[@]}"; do

				    $TASKSET python "benchmarks/dynamo/cachebench.py" \

				        --mode "$mode" \

				        --device cuda \

				        --benchmark "$BENCHMARK" \

				        --repeat 3 \

				        --output "$TEST_REPORTS_DIR/cachebench_${BENCHMARK}_${mode}.json"

				    $TASKSET python "benchmarks/dynamo/cachebench.py" \

				        --mode "$mode" \

				        --dynamic \

				        --device cuda \

				        --benchmark "$BENCHMARK" \

				        --repeat 3 \

				        --output "$TEST_REPORTS_DIR/cachebench_${BENCHMARK}_${mode}_dynamic.json"

				  done

				}

				test_verify_cachebench() {

				  TMP_TEST_REPORTS_DIR=$(mktemp -d)

				  TEST_OUTPUT="$TMP_TEST_REPORTS_DIR/test.json"

				  $TASKSET python "benchmarks/dynamo/cachebench.py" \

				      --mode training \

				      --device cpu \

				      --model nanogpt \

				      --benchmark torchbench \

				      --output "$TEST_OUTPUT"

				  # -s checks file exists and is non empty

				  if [[ ! -s "$TEST_OUTPUT" ]]; then

				    echo "Cachebench failed to produce an output."

				    echo "Run 'python benchmarks/dynamo/cachebench.py' to make sure it works"

				    exit 1

				  fi

				}

				test_perf_for_dashboard() {

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				@ -509,6 +580,10 @@ test_perf_for_dashboard() {

				    test_inductor_set_cpu_affinity

				  elif [[ "${TEST_CONFIG}" == *cuda_a10g* ]]; then

				    device=cuda_a10g

				  elif [[ "${TEST_CONFIG}" == *h100* ]]; then

				    device=cuda_h100

				  elif [[ "${TEST_CONFIG}" == *rocm* ]]; then

				    device=rocm

				  fi

				  for mode in "${modes[@]}"; do

				@ -625,16 +700,16 @@ test_single_dynamo_benchmark() {

				      TEST_CONFIG=${TEST_CONFIG//_avx512/}

				    fi

				    python "benchmarks/dynamo/$suite.py" \

				      --ci --accuracy --timing --explain \

				      --ci --accuracy --timing --explain --print-compilation-time \

				      "${DYNAMO_BENCHMARK_FLAGS[@]}" \

				      "$@" "${partition_flags[@]}" \

				      --output "$TEST_REPORTS_DIR/${name}_${suite}.csv"

				    python benchmarks/dynamo/check_accuracy.py \

				      --actual "$TEST_REPORTS_DIR/${name}_$suite.csv" \

				      --expected "benchmarks/dynamo/ci_expected_accuracy/${TEST_CONFIG}_${name}.csv"

				      --expected "benchmarks/dynamo/ci_expected_accuracy/${MAYBE_ROCM}${TEST_CONFIG}_${name}.csv"

				    python benchmarks/dynamo/check_graph_breaks.py \

				      --actual "$TEST_REPORTS_DIR/${name}_$suite.csv" \

				      --expected "benchmarks/dynamo/ci_expected_accuracy/${TEST_CONFIG}_${name}.csv"

				      --expected "benchmarks/dynamo/ci_expected_accuracy/${MAYBE_ROCM}${TEST_CONFIG}_${name}.csv"

				  fi

				}

				@ -657,7 +732,7 @@ test_inductor_halide() {

				}

				test_inductor_triton_cpu() {

				  python test/run_test.py --include inductor/test_triton_cpu_backend.py --verbose

				  python test/run_test.py --include inductor/test_triton_cpu_backend.py inductor/test_torchinductor_strided_blocks.py --verbose

				  assert_git_not_dirty

				}

				@ -687,6 +762,8 @@ test_dynamo_benchmark() {

				      fi

				    elif [[ "${TEST_CONFIG}" == *aot_inductor* ]]; then

				      test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --bfloat16 "$@"

				    elif [[ "${TEST_CONFIG}" == *max_autotune_inductor* ]]; then

				      test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --bfloat16 "$@"

				    else

				      test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --bfloat16 "$@"

				      test_single_dynamo_benchmark "training" "$suite" "$shard_id" --training --amp "$@"

				@ -721,7 +798,7 @@ test_inductor_torchbench_smoketest_perf() {

				      --only $test --output "$TEST_REPORTS_DIR/inductor_warm_start_smoketest_$test.csv"

				    python benchmarks/dynamo/check_accuracy.py \

				      --actual "$TEST_REPORTS_DIR/inductor_warm_start_smoketest_$test.csv" \

				      --expected "benchmarks/dynamo/ci_expected_accuracy/inductor_huggingface_training.csv"

				      --expected "benchmarks/dynamo/ci_expected_accuracy/${MAYBE_ROCM}inductor_huggingface_training.csv"

				  done

				}

				@ -1096,8 +1173,9 @@ build_xla() {

				  apply_patches

				  SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"

				  # These functions are defined in .circleci/common.sh in pytorch/xla repo

				  retry install_deps_pytorch_xla $XLA_DIR $USE_CACHE

				  retry install_pre_deps_pytorch_xla $XLA_DIR $USE_CACHE

				  CMAKE_PREFIX_PATH="${SITE_PACKAGES}/torch:${CMAKE_PREFIX_PATH}" XLA_SANDBOX_BUILD=1 build_torch_xla $XLA_DIR

				  retry install_post_deps_pytorch_xla

				  assert_git_not_dirty

				}

				@ -1397,14 +1475,13 @@ test_executorch() {

				  pushd /executorch

				  export PYTHON_EXECUTABLE=python

				  export EXECUTORCH_BUILD_PYBIND=ON

				  export CMAKE_ARGS="-DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON"

				  export CMAKE_ARGS="-DEXECUTORCH_BUILD_PYBIND=ON -DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON"

				  # For llama3

				  bash examples/models/llama3_2_vision/install_requirements.sh

				  # NB: We need to rebuild ExecuTorch runner here because it depends on PyTorch

				  # from the PR

				  bash .ci/scripts/setup-linux.sh cmake

				  bash .ci/scripts/setup-linux.sh --build-tool cmake

				  echo "Run ExecuTorch unit tests"

				  pytest -v -n auto

				@ -1428,7 +1505,7 @@ test_executorch() {

				test_linux_aarch64() {

				  python test/run_test.py --include test_modules test_mkldnn test_mkldnn_fusion test_openmp test_torch test_dynamic_shapes \

				        test_transformers test_multiprocessing test_numpy_interop test_autograd test_binary_ufuncs test_complex test_spectral_ops \

				        test_foreach test_reductions test_unary_ufuncs \

				        test_foreach test_reductions test_unary_ufuncs test_tensor_creation_ops test_ops \

				        --shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose

				  # Dynamo tests

				@ -1450,6 +1527,27 @@ test_linux_aarch64() {

				       --shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose

				}

				test_operator_benchmark() {

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  TEST_DIR=$(pwd)

				  test_inductor_set_cpu_affinity

				  cd benchmarks/operator_benchmark/pt_extension

				  python setup.py install

				  cd "${TEST_DIR}"/benchmarks/operator_benchmark

				  $TASKSET python -m benchmark_all_test --device "$1" --tag-filter "$2" \

				      --output-dir "${TEST_REPORTS_DIR}/operator_benchmark_eager_float32_cpu.csv"

				  pip_install pandas

				  python check_perf_csv.py \

				      --actual "${TEST_REPORTS_DIR}/operator_benchmark_eager_float32_cpu.csv" \

				      --expected "expected_ci_operator_benchmark_eager_float32_cpu.csv"

				}

				if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-bazel-* ]]; then

				  (cd test && python -c "import torch; print(torch.__config__.show())")

				  (cd test && python -c "import torch; print(torch.__config__.parallel_info())")

				@ -1480,6 +1578,19 @@ elif [[ "$TEST_CONFIG" == distributed ]]; then

				  if [[ "${SHARD_NUMBER}" == 1 ]]; then

				    test_rpc

				  fi

				elif [[ "${TEST_CONFIG}" == *operator_benchmark* ]]; then

				  TEST_MODE="short"

				  if [[ "${TEST_CONFIG}" == *cpu* ]]; then

				    if [[ "${TEST_CONFIG}" == *long* ]]; then

				      TEST_MODE="long"

				    elif [[ "${TEST_CONFIG}" == *all* ]]; then

				      TEST_MODE="all"

				    fi

				    test_operator_benchmark cpu ${TEST_MODE}

				  fi

				elif [[ "${TEST_CONFIG}" == *inductor_distributed* ]]; then

				  test_inductor_distributed

				elif [[ "${TEST_CONFIG}" == *inductor-halide* ]]; then

				@ -1496,6 +1607,16 @@ elif [[ "${TEST_CONFIG}" == *timm* ]]; then

				  install_torchvision

				  id=$((SHARD_NUMBER-1))

				  test_dynamo_benchmark timm_models "$id"

				elif [[ "${TEST_CONFIG}" == cachebench ]]; then

				  install_torchaudio cuda

				  install_torchvision

				  checkout_install_torchbench nanogpt BERT_pytorch resnet50 hf_T5 llama moco

				  PYTHONPATH=$(pwd)/torchbench test_cachebench

				elif [[ "${TEST_CONFIG}" == verify_cachebench ]]; then

				  install_torchaudio cpu

				  install_torchvision

				  checkout_install_torchbench nanogpt

				  PYTHONPATH=$(pwd)/torchbench test_verify_cachebench

				elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then

				  if [[ "${TEST_CONFIG}" == *cpu* ]]; then

				    install_torchaudio cpu

				@ -1532,6 +1653,7 @@ elif [[ "${TEST_CONFIG}" == *inductor_cpp_wrapper* ]]; then

				  install_torchvision

				  checkout_install_torchbench hf_T5 llama moco

				  PYTHONPATH=$(pwd)/torchbench test_inductor_cpp_wrapper_shard "$SHARD_NUMBER"

				  test_inductor_aoti

				elif [[ "${TEST_CONFIG}" == *inductor* ]]; then

				  install_torchvision

				  test_inductor_shard "${SHARD_NUMBER}"

				@ -1551,6 +1673,7 @@ elif [[ "${BUILD_ENVIRONMENT}" == *rocm* && -n "$TESTS_TO_INCLUDE" ]]; then

				  test_python_shard "$SHARD_NUMBER"

				  test_aten

				elif [[ "${SHARD_NUMBER}" == 1 && $NUM_TEST_SHARDS -gt 1 ]]; then

				  test_lazy_tensor_meta_reference_disabled

				  test_without_numpy

				  install_torchvision

				  test_python_shard 1

									
										41

.ci/pytorch/test_example_code/cnn_smoke_win_arm64.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,41 @@

				r"""

				It's used to check basic rnn features with cpu-only.

				For example, it would throw exception if some components are missing

				"""

				import torch

				import torch.nn as nn

				import torch.nn.functional as F

				import torch.optim as optim

				class SimpleCNN(nn.Module):

				    def __init__(self):

				        super().__init__()

				        self.conv = nn.Conv2d(1, 1, 3)

				        self.pool = nn.MaxPool2d(2, 2)

				    def forward(self, inputs):

				        output = self.pool(F.relu(self.conv(inputs)))

				        output = output.view(1)

				        return output

				try:

				    # Mock one infer

				    net = SimpleCNN()

				    net_inputs = torch.rand((1, 1, 5, 5))

				    outputs = net(net_inputs)

				    print(outputs)

				    criterion = nn.MSELoss()

				    optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.1)

				    # Mock one step training

				    label = torch.full((1,), 1.0, dtype=torch.float)

				    loss = criterion(outputs, label)

				    loss.backward()

				    optimizer.step()

				except Exception as e:

				    print(f"An error occurred: {e}")

									
										13

.ci/pytorch/test_example_code/rnn_smoke_win_arm64.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,13 @@

				r"""

				It's used to check basic rnn features with cpu-only.

				For example, it would throw exception if missing some components are missing

				"""

				import torch

				import torch.nn as nn

				rnn = nn.RNN(10, 20, 2)

				inputs = torch.randn(5, 3, 10)

				h0 = torch.randn(2, 3, 20)

				output, hn = rnn(inputs, h0)

									
										3

.ci/pytorch/win-test.sh
									
												View File
												
				@ -18,6 +18,9 @@ export PYTORCH_FINAL_PACKAGE_DIR="${PYTORCH_FINAL_PACKAGE_DIR:-/c/w/build-result

				PYTORCH_FINAL_PACKAGE_DIR_WIN=$(cygpath -w "${PYTORCH_FINAL_PACKAGE_DIR}")

				export PYTORCH_FINAL_PACKAGE_DIR_WIN

				# enable debug asserts in serialization

				export TORCH_SERIALIZATION_DEBUG=1

				mkdir -p "$TMP_DIR"/build/torch

				export SCRIPT_HELPERS_DIR=$SCRIPT_PARENT_DIR/win-test-helpers

									
										31

.ci/pytorch/windows/arm64/bootstrap_apl.bat
									
										Normal file
									
												View File
												
				@ -0,0 +1,31 @@

				@echo off

				echo Dependency ARM Performance Libraries (APL) installation started.

				:: Pre-check for downloads and dependencies folders

				if not exist "%DOWNLOADS_DIR%" mkdir %DOWNLOADS_DIR%

				if not exist "%DEPENDENCIES_DIR%" mkdir %DEPENDENCIES_DIR%

				:: Set download URL for the ARM Performance Libraries (APL)

				set DOWNLOAD_URL="https://developer.arm.com/-/cdn-downloads/permalink/Arm-Performance-Libraries/Version_24.10/arm-performance-libraries_24.10_Windows.msi"

				set INSTALLER_FILE=%DOWNLOADS_DIR%\arm-performance-libraries.msi

				:: Download installer

				echo Downloading ARM Performance Libraries (APL)...

				curl -L -o "%INSTALLER_FILE%" %DOWNLOAD_URL%

				:: Install ARM Performance Libraries (APL)

				echo Installing ARM Performance Libraries (APL)...

				msiexec /i "%INSTALLER_FILE%" /qn /norestart ACCEPT_EULA=1 INSTALLFOLDER="%DEPENDENCIES_DIR%"

				:: Check if installation was successful

				if %errorlevel% neq 0 (

				    echo "Failed to install ARM Performance Libraries (APL) components. (exitcode = %errorlevel%)"

				    exit /b 1

				)

				:: Add to environment

				echo ARMPL_DIR=%DEPENDENCIES_DIR%\armpl_24.10\>> %GITHUB_ENV%

				echo %DEPENDENCIES_DIR%\armpl_24.10\bin\>> %GITHUB_PATH%

				echo Dependency ARM Performance Libraries (APL) installation finished.

									
										41

.ci/pytorch/windows/arm64/bootstrap_buildtools.bat
									
										Normal file
									
												View File
												
				@ -0,0 +1,41 @@

				@echo off

				echo Dependency MSVC Build Tools with C++ with ARM64/ARM64EC components installation started.

				:: Pre-check for downloads and dependencies folders

				if not exist "%DOWNLOADS_DIR%" mkdir "%DOWNLOADS_DIR%"

				if not exist "%DEPENDENCIES_DIR%" mkdir "%DEPENDENCIES_DIR%"

				:: Set download URL for the Visual Studio Installer

				set DOWNLOAD_URL=https://aka.ms/vs/17/release/vs_BuildTools.exe

				set INSTALLER_FILE=%DOWNLOADS_DIR%\vs_BuildTools.exe

				:: Download installer

				echo Downloading Visual Studio Build Tools with C++ installer...

				curl -L -o "%INSTALLER_FILE%" %DOWNLOAD_URL%

				:: Install the Visual Studio Build Tools with C++ components

				echo Installing Visual Studio Build Tools with C++ components...

				echo Installing MSVC %MSVC_VERSION%

				"%INSTALLER_FILE%" --norestart --quiet --wait --installPath "%DEPENDENCIES_DIR%\VSBuildTools" ^

				    --add Microsoft.VisualStudio.Workload.VCTools ^

				    --add Microsoft.VisualStudio.Component.Windows10SDK ^

				    --add Microsoft.VisualStudio.Component.Windows11SDK.22621 ^

				    --add Microsoft.VisualStudio.Component.VC.ASAN ^

				    --add Microsoft.VisualStudio.Component.VC.CMake.Project ^

				    --add Microsoft.VisualStudio.Component.VC.CoreBuildTools ^

				    --add Microsoft.VisualStudio.Component.VC.CoreIde ^

				    --add Microsoft.VisualStudio.Component.VC.Redist.14.Latest ^

				    --add Microsoft.VisualStudio.Component.VC.Tools.ARM64EC ^

				    --add Microsoft.VisualStudio.Component.VC.Tools.ARM64 ^

				    --add Microsoft.VisualStudio.Component.VC.Tools.x86.x64

				echo exitcode = %errorlevel%

				:: Check if installation was successful

				if %errorlevel% neq 0 (

				    echo Failed to install Visual Studio Build Tools with C++ components.

				    exit /b 1

				)

				echo Dependency Visual Studio Build Tools with C++ installation finished.

									
										37

.ci/pytorch/windows/arm64/bootstrap_git.bat
									
										Normal file
									
												View File
												
				@ -0,0 +1,37 @@

				:: we need to install newer version of Git manually as "-submodules" function is not supported in the default version of runner.

				@echo off

				echo Dependency Git installation started.

				:: Pre-check for downloads and dependencies folders

				if not exist "%DOWNLOADS_DIR%" mkdir %DOWNLOADS_DIR%

				if not exist "%DEPENDENCIES_DIR%" mkdir %DEPENDENCIES_DIR%

				:: Set download URL for the Git

				set DOWNLOAD_URL="https://github.com/git-for-windows/git/releases/download/v2.46.0.windows.1/Git-2.46.0-64-bit.exe"

				set INSTALLER_FILE=%DOWNLOADS_DIR%\Git-2.46.0-64-bit.exe

				:: Download installer

				echo Downloading Git...

				curl -L -o "%INSTALLER_FILE%" %DOWNLOAD_URL%

				:: Install Git

				echo Installing Git...

				"%INSTALLER_FILE%" /VERYSILENT /DIR="%DEPENDENCIES_DIR%\git"

				dir %DEPENDENCIES_DIR%\git

				:: Check if installation was successful

				if %errorlevel% neq 0 (

				    echo "Failed to install Git. (exitcode = %errorlevel%)"

				    exit /b 1

				)

				:: Enable long paths

				call "%DEPENDENCIES_DIR%\git\cmd\git.exe" config --system core.longpaths true

				:: Add to PATH

				echo %DEPENDENCIES_DIR%\git\cmd\;%DEPENDENCIES_DIR%\git\bin\>> %GITHUB_PATH%

				echo Dependency Git installation finished.

									
										33

.ci/pytorch/windows/arm64/bootstrap_libuv.bat
									
										Normal file
									
												View File
												
				@ -0,0 +1,33 @@

				@echo off

				echo Dependency libuv installation started.

				:: Pre-check for downloads and dependencies folders

				if not exist "%DOWNLOADS_DIR%" mkdir %DOWNLOADS_DIR%

				if not exist "%DEPENDENCIES_DIR%" mkdir %DEPENDENCIES_DIR%

				:: activate visual studio

				call "%DEPENDENCIES_DIR%\VSBuildTools\VC\Auxiliary\Build\vcvarsall.bat" arm64

				where cl.exe

				cd %DEPENDENCIES_DIR%

				git clone https://github.com/libuv/libuv.git -b v1.39.0

				echo Configuring libuv...

				mkdir libuv\build

				cd libuv\build

				cmake .. -DBUILD_TESTING=OFF

				echo Building libuv...

				cmake --build . --config Release

				echo Installing libuv...

				cmake --install . --prefix ../install

				:: Check if installation was successful

				if %errorlevel% neq 0 (

				    echo "Failed to install libuv. (exitcode = %errorlevel%)"

				    exit /b 1

				)

				echo Dependency libuv installation finished.

									
										46

.ci/pytorch/windows/arm64/bootstrap_openblas.bat
									
										Normal file
									
												View File
												
				@ -0,0 +1,46 @@

				@echo off

				echo Dependency OpenBLAS installation started.

				:: Pre-check for downloads and dependencies folders

				if not exist "%DOWNLOADS_DIR%" mkdir %DOWNLOADS_DIR%

				if not exist "%DEPENDENCIES_DIR%" mkdir %DEPENDENCIES_DIR%

				:: activate visual studio

				call "%DEPENDENCIES_DIR%\VSBuildTools\VC\Auxiliary\Build\vcvarsall.bat" arm64

				where cl.exe

				:: Clone OpenBLAS

				cd %DEPENDENCIES_DIR%

				git clone https://github.com/OpenMathLib/OpenBLAS.git -b v0.3.29

				echo Configuring OpenBLAS...

				mkdir OpenBLAS\build

				cd OpenBLAS\build

				cmake .. -G Ninja ^

				  -DBUILD_TESTING=0 ^

				  -DBUILD_BENCHMARKS=0 ^

				  -DC_LAPACK=1 ^

				  -DNOFORTRAN=1 ^

				  -DDYNAMIC_ARCH=0 ^

				  -DARCH=arm64 ^

				  -DBINARY=64 ^

				  -DTARGET=GENERIC ^

				  -DUSE_OPENMP=1 ^

				  -DCMAKE_SYSTEM_PROCESSOR=ARM64 ^

				  -DCMAKE_SYSTEM_NAME=Windows ^

				  -DCMAKE_BUILD_TYPE=Release

				echo Building OpenBLAS...

				cmake --build . --config Release

				echo Installing OpenBLAS...

				cmake --install . --prefix ../install

				:: Check if installation was successful

				if %errorlevel% neq 0 (

				    echo "Failed to install OpenBLAS. (exitcode = %errorlevel%)"

				    exit /b 1

				)

				echo Dependency OpenBLAS installation finished.

									
										44

.ci/pytorch/windows/arm64/bootstrap_python.bat
									
										Normal file
									
												View File
												
				@ -0,0 +1,44 @@

				@echo off

				echo Dependency Python installation started.

				:: Pre-check for downloads and dependencies folders

				if not exist "%DOWNLOADS_DIR%" mkdir %DOWNLOADS_DIR%

				if not exist "%DEPENDENCIES_DIR%" mkdir %DEPENDENCIES_DIR%

				if "%DESIRED_PYTHON%" == "3.13" (

				    echo Python version is set to 3.13

				    set DOWNLOAD_URL=https://www.python.org/ftp/python/3.13.2/python-3.13.2-arm64.exe

				) else if "%DESIRED_PYTHON%" == "3.12" (

				    echo Python version is set to 3.12

				    set DOWNLOAD_URL=https://www.python.org/ftp/python/3.12.7/python-3.12.7-arm64.exe

				) else if "%DESIRED_PYTHON%" == "3.11" (

				    echo Python version is set to 3.11

				    set DOWNLOAD_URL=https://www.python.org/ftp/python/3.11.9/python-3.11.9-arm64.exe

				) else (

				    echo DESIRED_PYTHON not defined, Python version is set to 3.12

				    set DOWNLOAD_URL=https://www.python.org/ftp/python/3.12.7/python-3.12.7-arm64.exe

				)

				set INSTALLER_FILE=%DOWNLOADS_DIR%\python-installer.exe

				:: Download installer

				echo Downloading Python...

				curl -L -o "%INSTALLER_FILE%" "%DOWNLOAD_URL%"

				:: Install Python

				echo Installing Python...

				"%INSTALLER_FILE%" /quiet Include_debug=1 TargetDir="%DEPENDENCIES_DIR%\Python"

				:: Check if installation was successful

				if %errorlevel% neq 0 (

				    echo "Failed to install Python. (exitcode = %errorlevel%)"

				    exit /b 1

				)

				:: Add to PATH

				echo %DEPENDENCIES_DIR%\Python\>> %GITHUB_PATH%

				echo %DEPENDENCIES_DIR%\Python\scripts\>> %GITHUB_PATH%

				echo %DEPENDENCIES_DIR%\Python\libs\>> %GITHUB_PATH%

				echo Dependency Python installation finished.

									
										33

.ci/pytorch/windows/arm64/bootstrap_rust.bat
									
										Normal file
									
												View File
												
				@ -0,0 +1,33 @@

				@echo off

				echo Dependency Rust installation started.

				:: Pre-check for downloads and dependencies folders

				if not exist "%DOWNLOADS_DIR%" mkdir %DOWNLOADS_DIR%

				if not exist "%DEPENDENCIES_DIR%" mkdir %DEPENDENCIES_DIR%

				set DOWNLOAD_URL="https://static.rust-lang.org/rustup/dist/x86_64-pc-windows-msvc/rustup-init.exe"

				set INSTALLER_FILE=%DOWNLOADS_DIR%\rustup-init.exe

				set RUSTUP_HOME=%DEPENDENCIES_DIR%\rust

				set CARGO_HOME=%DEPENDENCIES_DIR%\cargo

				:: Download installer

				echo Downloading Rust...

				curl -L -o "%INSTALLER_FILE%" %DOWNLOAD_URL%

				:: Install APL

				echo Installing Rust...

				"%INSTALLER_FILE%" -q -y --default-host aarch64-pc-windows-msvc --default-toolchain stable --profile default

				:: Check if installation was successful

				if %errorlevel% neq 0 (

				    echo "Failed to install Rust. (exitcode = %errorlevel%)"

				    exit /b 1

				)

				:: Add to PATH

				echo %DEPENDENCIES_DIR%\cargo\bin\>> %GITHUB_PATH%

				echo RUSTUP_HOME=%DEPENDENCIES_DIR%\rust>> %GITHUB_ENV%

				echo CARGO_HOME=%DEPENDENCIES_DIR%\cargo>> %GITHUB_ENV%

				echo Dependency Rust installation finished.

									
										33

.ci/pytorch/windows/arm64/bootstrap_sccache.bat
									
										Normal file
									
												View File
												
				@ -0,0 +1,33 @@

				@echo off

				echo Dependency sccache installation started.

				:: Pre-check for downloads and dependencies folders

				if not exist "%DOWNLOADS_DIR%" mkdir %DOWNLOADS_DIR%

				if not exist "%DEPENDENCIES_DIR%" mkdir %DEPENDENCIES_DIR%

				:: Set download URL for the sccache

				set DOWNLOAD_URL="https://github.com/mozilla/sccache/releases/download/v0.8.1/sccache-v0.8.1-x86_64-pc-windows-msvc.zip"

				set INSTALLER_FILE=%DOWNLOADS_DIR%\sccache.zip

				:: Download installer

				echo Downloading sccache.zip...

				curl -L -o "%INSTALLER_FILE%" %DOWNLOAD_URL%

				:: Install sccache

				echo Extracting sccache.zip...

				tar -xf "%INSTALLER_FILE%" -C %DEPENDENCIES_DIR%

				cd %DEPENDENCIES_DIR%

				ren sccache-v0.8.1-x86_64-pc-windows-msvc sccache

				cd ..

				:: Check if installation was successful

				if %errorlevel% neq 0 (

				    echo "Failed to install sccache. (exitcode = %errorlevel%)"

				    exit /b 1

				)

				:: Add to PATH

				echo %DEPENDENCIES_DIR%\sccache\>> %GITHUB_PATH%

				echo Dependency sccache installation finished.

									
										22

.ci/pytorch/windows/arm64/bootstrap_tests.bat
									
										Normal file
									
												View File
												
				@ -0,0 +1,22 @@

				:: change to source directory

				cd %PYTORCH_ROOT%

				:: activate visual studio

				call "%DEPENDENCIES_DIR%\VSBuildTools\VC\Auxiliary\Build\vcvarsall.bat" arm64

				where cl.exe

				:: create virtual environment

				python -m venv .venv

				echo * > .venv\.gitignore

				call .\.venv\Scripts\activate

				where python

				:: install dependencies

				python -m pip install --upgrade pip

				pip install -r requirements.txt

				pip install pytest numpy protobuf expecttest hypothesis

				:: find file name for pytorch wheel

				for /f "delims=" %%f in ('dir /b "%PYTORCH_FINAL_PACKAGE_DIR%" ^| findstr "torch-"') do set "TORCH_WHEEL_FILENAME=%PYTORCH_FINAL_PACKAGE_DIR%\%%f"

				pip install %TORCH_WHEEL_FILENAME%

									
										101

.ci/pytorch/windows/arm64/build_libtorch.bat
									
										Normal file
									
												View File
												
				@ -0,0 +1,101 @@

				@echo on

				:: environment variables

				set CMAKE_BUILD_TYPE=%BUILD_TYPE%

				set CMAKE_C_COMPILER_LAUNCHER=sccache

				set CMAKE_CXX_COMPILER_LAUNCHER=sccache

				set libuv_ROOT=%DEPENDENCIES_DIR%\libuv\install

				set MSSdk=1

				if defined PYTORCH_BUILD_VERSION (

				  set PYTORCH_BUILD_VERSION=%PYTORCH_BUILD_VERSION%

				  set PYTORCH_BUILD_NUMBER=1

				)

				:: Set BLAS type

				if %ENABLE_APL% == 1 (

				    set BLAS=APL

				    set USE_LAPACK=1

				) else if %ENABLE_OPENBLAS% == 1 (

				    set BLAS=OpenBLAS

				    set OpenBLAS_HOME=%DEPENDENCIES_DIR%\OpenBLAS\install

				)

				:: activate visual studio

				call "%DEPENDENCIES_DIR%\VSBuildTools\VC\Auxiliary\Build\vcvarsall.bat" arm64

				where cl.exe

				:: change to source directory

				cd %PYTORCH_ROOT%

				:: copy libuv.dll

				copy %libuv_ROOT%\lib\Release\uv.dll torch\lib\uv.dll

				:: create virtual environment

				python -m venv .venv

				echo * > .venv\.gitignore

				call .\.venv\Scripts\activate

				where python

				:: python install dependencies

				python -m pip install --upgrade pip

				pip install -r requirements.txt

				:: DISTUTILS_USE_SDK should be set after psutil dependency

				set DISTUTILS_USE_SDK=1

				:: start sccache server and reset sccache stats

				sccache --start-server

				sccache --zero-stats

				sccache --show-stats

				:: Prepare the environment

				mkdir libtorch

				mkdir libtorch\bin

				mkdir libtorch\cmake

				mkdir libtorch\include

				mkdir libtorch\lib

				mkdir libtorch\share

				mkdir libtorch\test

				:: Call LibTorch build script

				python ./tools/build_libtorch.py

				:: Check if there is an error

				IF ERRORLEVEL 1 exit /b 1

				IF NOT ERRORLEVEL 0 exit /b 1

				:: Move the files to the correct location

				move /Y torch\bin\*.* libtorch\bin\

				move /Y torch\cmake\*.* libtorch\cmake\

				robocopy /move /e torch\include\ libtorch\include\

				move /Y torch\lib\*.* libtorch\lib\

				robocopy /move /e torch\share\ libtorch\share\

				move /Y torch\test\*.* libtorch\test\

				move /Y libtorch\bin\*.dll libtorch\lib\

				:: Set version

				echo %PYTORCH_BUILD_VERSION% > libtorch\build-version

				git rev-parse HEAD > libtorch\build-hash

				:: Set LIBTORCH_PREFIX

				IF "%DEBUG%" == "" (

				    set LIBTORCH_PREFIX=libtorch-win-arm64-shared-with-deps

				) ELSE (

				    set LIBTORCH_PREFIX=libtorch-win-arm64-shared-with-deps-debug

				)

				:: Create output

				C:\Windows\System32\tar.exe -cvaf %LIBTORCH_PREFIX%-%PYTORCH_BUILD_VERSION%.zip -C libtorch *

				:: Copy output to target directory

				if not exist ..\output mkdir ..\output

				copy /Y "%LIBTORCH_PREFIX%-%PYTORCH_BUILD_VERSION%.zip" "%PYTORCH_FINAL_PACKAGE_DIR%\"

				copy /Y "%LIBTORCH_PREFIX%-%PYTORCH_BUILD_VERSION%.zip" "%PYTORCH_FINAL_PACKAGE_DIR%\%LIBTORCH_PREFIX%-latest.zip"

				:: Cleanup raw data to save space

				rmdir /s /q libtorch

				:: Check if installation was successful

				if %errorlevel% neq 0 (

				    echo "Failed on build_libtorch. (exitcode = %errorlevel%)"

				    exit /b 1

				)

									
										60

.ci/pytorch/windows/arm64/build_pytorch.bat
									
										Normal file
									
												View File
												
				@ -0,0 +1,60 @@

				@echo on

				:: environment variables

				set CMAKE_BUILD_TYPE=%BUILD_TYPE%

				set CMAKE_C_COMPILER_LAUNCHER=sccache

				set CMAKE_CXX_COMPILER_LAUNCHER=sccache

				set libuv_ROOT=%DEPENDENCIES_DIR%\libuv\install

				set MSSdk=1

				if defined PYTORCH_BUILD_VERSION (

				  set PYTORCH_BUILD_VERSION=%PYTORCH_BUILD_VERSION%

				  set PYTORCH_BUILD_NUMBER=1

				)

				:: Set BLAS type

				if %ENABLE_APL% == 1 (

				    set BLAS=APL

				    set USE_LAPACK=1

				) else if %ENABLE_OPENBLAS% == 1 (

				    set BLAS=OpenBLAS

				    set OpenBLAS_HOME=%DEPENDENCIES_DIR%\OpenBLAS\install

				)

				:: activate visual studio

				call "%DEPENDENCIES_DIR%\VSBuildTools\VC\Auxiliary\Build\vcvarsall.bat" arm64

				where cl.exe

				:: change to source directory

				cd %PYTORCH_ROOT%

				:: copy libuv.dll

				copy %libuv_ROOT%\lib\Release\uv.dll torch\lib\uv.dll

				:: create virtual environment

				python -m venv .venv

				echo * > .venv\.gitignore

				call .\.venv\Scripts\activate

				where python

				:: python install dependencies

				python -m pip install --upgrade pip

				pip install -r requirements.txt

				:: DISTUTILS_USE_SDK should be set after psutil dependency

				set DISTUTILS_USE_SDK=1

				:: start sccache server and reset sccache stats

				sccache --start-server

				sccache --zero-stats

				sccache --show-stats

				:: Call PyTorch build script

				python setup.py bdist_wheel -d "%PYTORCH_FINAL_PACKAGE_DIR%"

				:: show sccache stats

				sccache --show-stats

				:: Check if installation was successful

				if %errorlevel% neq 0 (

				    echo "Failed on build_pytorch. (exitcode = %errorlevel%)"

				    exit /b 1

				)

									
										49

.ci/pytorch/windows/arm64/smoke_test.bat
									
										Normal file
									
												View File
												
				@ -0,0 +1,49 @@

				@echo off

				setlocal

				if "%PACKAGE_TYPE%" == "wheel" goto wheel

				if "%PACKAGE_TYPE%" == "libtorch" goto libtorch

				echo "unknown package type"

				exit /b 1

				:wheel

				call %PYTORCH_ROOT%\.ci\pytorch\windows\arm64\bootstrap_tests.bat

				echo Running python rnn_smoke.py...

				python %PYTORCH_ROOT%\.ci\pytorch\test_example_code\rnn_smoke_win_arm64.py

				if errorlevel 1 exit /b 1

				echo Checking that basic CNN works...

				python %PYTORCH_ROOT%\.ci\pytorch\test_example_code\cnn_smoke_win_arm64.py

				if errorlevel 1 exit /b 1

				goto end

				:libtorch

				echo "install and test libtorch"

				if not exist tmp mkdir tmp

				for /F "delims=" %%i in ('where /R "%PYTORCH_FINAL_PACKAGE_DIR:/=\%" *-latest.zip') do C:\Windows\System32\tar.exe -xf "%%i" -C tmp

				if ERRORLEVEL 1 exit /b 1

				pushd tmp

				set VC_VERSION_LOWER=14

				set VC_VERSION_UPPER=36

				call "%DEPENDENCIES_DIR%\VSBuildTools\VC\Auxiliary\Build\vcvarsall.bat" arm64

				set install_root=%CD%

				set INCLUDE=%INCLUDE%;%install_root%\include;%install_root%\include\torch\csrc\api\include

				set LIB=%LIB%;%install_root%\lib

				set PATH=%PATH%;%install_root%\lib

				cl %PYTORCH_ROOT%\.ci\pytorch\test_example_code\simple-torch-test.cpp c10.lib torch_cpu.lib /EHsc /std:c++17

				if ERRORLEVEL 1 exit /b 1

				.\simple-torch-test.exe

				if ERRORLEVEL 1 exit /b 1

				:end

									
										13

.ci/pytorch/windows/condaenv.bat
									
												View File
												
				@ -9,12 +9,13 @@ FOR %%v IN (%DESIRED_PYTHON%) DO (

				    set PYTHON_VERSION_STR=%%v

				    set PYTHON_VERSION_STR=!PYTHON_VERSION_STR:.=!

				    conda remove -n py!PYTHON_VERSION_STR! --all -y || rmdir %CONDA_HOME%\envs\py!PYTHON_VERSION_STR! /s

				    if "%%v" == "3.8" call conda create -n py!PYTHON_VERSION_STR! -y -q numpy=1.11 pyyaml boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v

				    if "%%v" == "3.9" call conda create -n py!PYTHON_VERSION_STR! -y -q numpy=2.0.1 pyyaml boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v

				    if "%%v" == "3.10" call conda create -n py!PYTHON_VERSION_STR! -y -q -c=conda-forge numpy=2.0.1 pyyaml boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v

				    if "%%v" == "3.11" call conda create -n py!PYTHON_VERSION_STR! -y -q -c=conda-forge numpy=2.0.1 pyyaml boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v

				    if "%%v" == "3.12" call conda create -n py!PYTHON_VERSION_STR! -y -q -c=conda-forge numpy=2.0.1 pyyaml boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v

				    if "%%v" == "3.13" call conda create -n py!PYTHON_VERSION_STR! -y -q -c=conda-forge numpy=2.1.2 pyyaml boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v

				    if "%%v" == "3.9" call conda create -n py!PYTHON_VERSION_STR! -y numpy=2.0.1 boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v

				    if "%%v" == "3.10" call conda create -n py!PYTHON_VERSION_STR! -y -c=conda-forge numpy=2.0.1  boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v

				    if "%%v" == "3.11" call conda create -n py!PYTHON_VERSION_STR! -y -c=conda-forge numpy=2.0.1  boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v

				    if "%%v" == "3.12" call conda create -n py!PYTHON_VERSION_STR! -y -c=conda-forge numpy=2.0.1  boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v

				    if "%%v" == "3.13" call conda create -n py!PYTHON_VERSION_STR! -y -c=conda-forge numpy=2.1.2  boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v

				    if "%%v" == "3.13t" call conda create -n py!PYTHON_VERSION_STR! -y -c=conda-forge numpy=2.1.2 boto3 cmake ninja typing_extensions setuptools=72.1.0 python-freethreading python=3.13

				    call conda run -n py!PYTHON_VERSION_STR! pip install pyyaml

				    call conda run -n py!PYTHON_VERSION_STR! pip install mkl-include

				    call conda run -n py!PYTHON_VERSION_STR! pip install mkl-static

				)

									
										59

.ci/pytorch/windows/cuda128.bat
									
										Normal file
									
												View File
												
				@ -0,0 +1,59 @@

				@echo off

				set MODULE_NAME=pytorch

				IF NOT EXIST "setup.py" IF NOT EXIST "%MODULE_NAME%" (

				    call internal\clone.bat

				    cd %~dp0

				) ELSE (

				    call internal\clean.bat

				)

				IF ERRORLEVEL 1 goto :eof

				call internal\check_deps.bat

				IF ERRORLEVEL 1 goto :eof

				REM Check for optional components

				set USE_CUDA=

				set CMAKE_GENERATOR=Visual Studio 15 2017 Win64

				IF "%NVTOOLSEXT_PATH%"=="" (

				    IF EXIST "C:\Program Files\NVIDIA Corporation\NvToolsExt\lib\x64\nvToolsExt64_1.lib"  (

				        set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt

				    ) ELSE (

				        echo NVTX ^(Visual Studio Extension ^for CUDA^) ^not installed, failing

				        exit /b 1

				    )

				)

				IF "%CUDA_PATH_V128%"=="" (

				    IF EXIST "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\bin\nvcc.exe" (

				        set "CUDA_PATH_V128=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8"

				    ) ELSE (

				        echo CUDA 12.8 not found, failing

				        exit /b 1

				    )

				)

				IF "%BUILD_VISION%" == "" (

				    set TORCH_CUDA_ARCH_LIST=5.0;6.0;6.1;7.0;7.5;8.0;8.6;9.0;10.0;12.0

				    set TORCH_NVCC_FLAGS=-Xfatbin -compress-all

				) ELSE (

				    set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_100,code=compute_100 -gencode=arch=compute_120,code=compute_120

				)

				set "CUDA_PATH=%CUDA_PATH_V128%"

				set "PATH=%CUDA_PATH_V128%\bin;%PATH%"

				:optcheck

				call internal\check_opts.bat

				IF ERRORLEVEL 1 goto :eof

				if exist "%NIGHTLIES_PYTORCH_ROOT%" cd %NIGHTLIES_PYTORCH_ROOT%\..

				call  %~dp0\internal\copy.bat

				IF ERRORLEVEL 1 goto :eof

				call  %~dp0\internal\setup.bat

				IF ERRORLEVEL 1 goto :eof

									
										32

.ci/pytorch/windows/internal/cuda_install.bat
									
												View File
												
				@ -9,7 +9,8 @@ if "%CUDA_VERSION%" == "xpu" (

				    exit /b 0

				)

				set SRC_DIR=%NIGHTLIES_PYTORCH_ROOT%

				set SRC_DIR=%~dp0\..

				if not exist "%SRC_DIR%\temp_build" mkdir "%SRC_DIR%\temp_build"

				set /a CUDA_VER=%CUDA_VERSION%

				@ -23,9 +24,9 @@ set CUDNN_LIB_FOLDER="lib\x64"

				if exist "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%\bin\nvcc.exe" goto set_cuda_env_vars

				if %CUDA_VER% EQU 118 goto cuda118

				if %CUDA_VER% EQU 121 goto cuda121

				if %CUDA_VER% EQU 124 goto cuda124

				if %CUDA_VER% EQU 126 goto cuda126

				if %CUDA_VER% EQU 128 goto cuda128

				echo CUDA %CUDA_VERSION_STR% is not supported

				exit /b 1

				@ -111,6 +112,33 @@ xcopy /Y "%SRC_DIR%\temp_build\zlib\dll_x64\*.dll" "C:\Windows\System32"

				goto cuda_common

				:cuda128

				set CUDA_INSTALL_EXE=cuda_12.8.0_571.96_windows.exe

				if not exist "%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%" (

				    curl -k -L "https://ossci-windows.s3.amazonaws.com/%CUDA_INSTALL_EXE%" --output "%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%"

				    if errorlevel 1 exit /b 1

				    set "CUDA_SETUP_FILE=%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%"

				    set "ARGS=cuda_profiler_api_12.8 thrust_12.8 nvcc_12.8 cuobjdump_12.8 nvprune_12.8 nvprof_12.8 cupti_12.8 cublas_12.8 cublas_dev_12.8 cudart_12.8 cufft_12.8 cufft_dev_12.8 curand_12.8 curand_dev_12.8 cusolver_12.8 cusolver_dev_12.8 cusparse_12.8 cusparse_dev_12.8 npp_12.8 npp_dev_12.8 nvrtc_12.8 nvrtc_dev_12.8 nvml_dev_12.8 nvjitlink_12.8 nvtx_12.8"

				)

				set CUDNN_FOLDER=cudnn-windows-x86_64-9.7.0.66_cuda12-archive

				set CUDNN_LIB_FOLDER="lib"

				set "CUDNN_INSTALL_ZIP=%CUDNN_FOLDER%.zip"

				if not exist "%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%" (

				    curl -k -L "http://s3.amazonaws.com/ossci-windows/%CUDNN_INSTALL_ZIP%" --output "%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%"

				    if errorlevel 1 exit /b 1

				    set "CUDNN_SETUP_FILE=%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%"

				)

				@REM cuDNN 8.3+ required zlib to be installed on the path

				echo Installing ZLIB dlls

				curl -k -L "http://s3.amazonaws.com/ossci-windows/zlib123dllx64.zip" --output "%SRC_DIR%\temp_build\zlib123dllx64.zip"

				7z x "%SRC_DIR%\temp_build\zlib123dllx64.zip" -o"%SRC_DIR%\temp_build\zlib"

				xcopy /Y "%SRC_DIR%\temp_build\zlib\dll_x64\*.dll" "C:\Windows\System32"

				goto cuda_common

				:cuda_common

				:: NOTE: We only install CUDA if we don't have it installed already.

				:: With GHA runners these should be pre-installed as part of our AMI process

									
										109

.ci/pytorch/windows/internal/smoke_test.bat
									
												View File
												
				@ -27,7 +27,6 @@ for /F "delims=" %%i in ('wmic path win32_VideoController get name') do (

				endlocal & set NVIDIA_GPU_EXISTS=%NVIDIA_GPU_EXISTS%

				if "%PACKAGE_TYPE%" == "wheel" goto wheel

				if "%PACKAGE_TYPE%" == "conda" goto conda

				if "%PACKAGE_TYPE%" == "libtorch" goto libtorch

				echo "unknown package type"

				@ -37,16 +36,23 @@ exit /b 1

				echo "install wheel package"

				set PYTHON_INSTALLER_URL=

				if "%DESIRED_PYTHON%" == "3.13t" set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/3.13.0/python-3.13.0-amd64.exe"

				if "%DESIRED_PYTHON%" == "3.13" set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/3.13.0/python-3.13.0-amd64.exe"

				if "%DESIRED_PYTHON%" == "3.12" set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/3.12.0/python-3.12.0-amd64.exe"

				if "%DESIRED_PYTHON%" == "3.11" set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/3.11.0/python-3.11.0-amd64.exe"

				if "%DESIRED_PYTHON%" == "3.10" set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/3.10.0/python-3.10.0-amd64.exe"

				if "%DESIRED_PYTHON%" == "3.9" set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/3.9.0/python-3.9.0-amd64.exe"

				if "%DESIRED_PYTHON%" == "3.8" set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/3.8.2/python-3.8.2-amd64.exe"

				if "%PYTHON_INSTALLER_URL%" == "" (

				    echo Python %DESIRED_PYTHON% not supported yet

				)

				set ADDITIONAL_OPTIONS=""

				set PYTHON_EXEC="python"

				if "%DESIRED_PYTHON%" == "3.13t" (

				    set ADDITIONAL_OPTIONS="Include_freethreaded=1"

				    set PYTHON_EXEC="python3.13t"

				)

				del python-amd64.exe

				curl --retry 3 -kL "%PYTHON_INSTALLER_URL%" --output python-amd64.exe

				if errorlevel 1 exit /b 1

				@ -55,85 +61,39 @@ if errorlevel 1 exit /b 1

				:: the installed Python to PATH system-wide. Even calling set PATH=%ORIG_PATH% later on won't make

				:: a change. As the builder directory will be removed after the smoke test, all subsequent non-binary

				:: jobs will fail to find any Python executable there

				start /wait "" python-amd64.exe /quiet InstallAllUsers=1 PrependPath=0 Include_test=0 TargetDir=%CD%\Python

				start /wait "" python-amd64.exe /quiet InstallAllUsers=1 PrependPath=0 Include_test=0 %ADDITIONAL_OPTIONS% TargetDir=%CD%\Python

				if errorlevel 1 exit /b 1

				set "PATH=%CD%\Python%PYTHON_VERSION%\Scripts;%CD%\Python;%PATH%"

				if "%DESIRED_PYTHON%" == "3.13" pip install -q --pre numpy==2.1.0 protobuf

				if "%DESIRED_PYTHON%" == "3.12" pip install -q --pre numpy==2.0.2 protobuf

				if "%DESIRED_PYTHON%" == "3.11" pip install -q --pre numpy==2.0.2 protobuf

				if "%DESIRED_PYTHON%" == "3.10" pip install -q --pre numpy==2.0.2 protobuf

				if "%DESIRED_PYTHON%" == "3.9" pip install -q --pre numpy==2.0.2 protobuf

				if "%DESIRED_PYTHON%" == "3.8" pip install -q numpy protobuf

				if "%DESIRED_PYTHON%" == "3.13t" %PYTHON_EXEC% -m pip install --pre numpy==2.2.1 protobuf

				if "%DESIRED_PYTHON%" == "3.13" %PYTHON_EXEC% -m pip install --pre numpy==2.1.2 protobuf

				if "%DESIRED_PYTHON%" == "3.12" %PYTHON_EXEC% -m pip install --pre numpy==2.0.2 protobuf

				if "%DESIRED_PYTHON%" == "3.11" %PYTHON_EXEC% -m pip install --pre numpy==2.0.2 protobuf

				if "%DESIRED_PYTHON%" == "3.10" %PYTHON_EXEC% -m pip install --pre numpy==2.0.2 protobuf

				if "%DESIRED_PYTHON%" == "3.9" %PYTHON_EXEC% -m pip install --pre numpy==2.0.2 protobuf networkx

				if errorlevel 1 exit /b 1

				for /F "delims=" %%i in ('where /R "%PYTORCH_FINAL_PACKAGE_DIR:/=\%" *.whl') do pip install "%%i"

				if "%PYTORCH_BUILD_VERSION:dev=%" NEQ "%PYTORCH_BUILD_VERSION%" (

				    set "CHANNEL=nightly"

				) else (

				    set "CHANNEL=test"

				)

				set "EXTRA_INDEX= "

				if "%CUDA_VERSION%" == "xpu" set "EXTRA_INDEX=--index-url https://download.pytorch.org/whl/%CHANNEL%/xpu"

				for /F "delims=" %%i in ('where /R "%PYTORCH_FINAL_PACKAGE_DIR:/=\%" *.whl') do %PYTHON_EXEC% -m pip install "%%i" %EXTRA_INDEX%

				if errorlevel 1 exit /b 1

				goto smoke_test

				:conda

				echo "install conda package"

				:: Install Miniconda3

				set "CONDA_HOME=%CD%\conda"

				set "tmp_conda=%CONDA_HOME%"

				set "miniconda_exe=%CD%\miniconda.exe"

				set "CONDA_EXTRA_ARGS=cpuonly -c pytorch-nightly"

				if "%CUDA_VERSION%" == "118" (

				    set "CONDA_EXTRA_ARGS=pytorch-cuda=11.8 -c nvidia -c pytorch-nightly"

				)

				if "%CUDA_VERSION%" == "121" (

				    set "CONDA_EXTRA_ARGS=pytorch-cuda=12.1 -c nvidia -c pytorch-nightly"

				)

				if "%CUDA_VERSION%" == "124" (

				    set "CONDA_EXTRA_ARGS=pytorch-cuda=12.4 -c nvidia -c pytorch-nightly"

				)

				if "%CUDA_VERSION%" == "126" (

				    set "CONDA_EXTRA_ARGS=pytorch-cuda=12.6 -c nvidia -c pytorch-nightly"

				)

				rmdir /s /q conda

				del miniconda.exe

				curl -k https://repo.anaconda.com/miniconda/Miniconda3-latest-Windows-x86_64.exe -o "%miniconda_exe%"

				start /wait "" "%miniconda_exe%" /S /InstallationType=JustMe /RegisterPython=0 /AddToPath=0 /D=%tmp_conda%

				if ERRORLEVEL 1 exit /b 1

				set "PATH=%CONDA_HOME%;%CONDA_HOME%\scripts;%CONDA_HOME%\Library\bin;%PATH%"

				conda create -qyn testenv python=%DESIRED_PYTHON%

				if errorlevel 1 exit /b 1

				call conda install -yq conda-build

				if errorlevel 1 exit /b 1

				call %CONDA_HOME%\condabin\activate.bat testenv

				if errorlevel 1 exit /b 1

				set "NO_ARCH_PATH=%PYTORCH_FINAL_PACKAGE_DIR:/=\%\noarch"

				mkdir %NO_ARCH_PATH%

				for /F "delims=" %%i in ('where /R "%PYTORCH_FINAL_PACKAGE_DIR:/=\%" *') do xcopy "%%i" %NO_ARCH_PATH% /Y

				if ERRORLEVEL 1 exit /b 1

				call conda index %PYTORCH_FINAL_PACKAGE_DIR%

				if errorlevel 1 exit /b 1

				call conda install -yq -c "file:///%PYTORCH_FINAL_PACKAGE_DIR%" pytorch==%PYTORCH_BUILD_VERSION% -c pytorch -c numba/label/dev -c nvidia

				if ERRORLEVEL 1 exit /b 1

				call conda install -yq numpy

				if ERRORLEVEL 1 exit /b 1

				set /a CUDA_VER=%CUDA_VERSION%

				set CUDA_VER_MAJOR=%CUDA_VERSION:~0,-1%

				set CUDA_VER_MINOR=%CUDA_VERSION:~-1,1%

				set CUDA_VERSION_STR=%CUDA_VER_MAJOR%.%CUDA_VER_MINOR%

				:: Install package we just build

				:smoke_test

				python -c "import torch"

				%PYTHON_EXEC% -c "import torch"

				if ERRORLEVEL 1 exit /b 1

				echo Checking that MKL is available

				python -c "import torch; exit(0 if torch.backends.mkl.is_available() else 1)"

				%PYTHON_EXEC% -c "import torch; exit(0 if torch.backends.mkl.is_available() else 1)"

				if ERRORLEVEL 1 exit /b 1

				if "%NVIDIA_GPU_EXISTS%" == "0" (

				@ -142,24 +102,24 @@ if "%NVIDIA_GPU_EXISTS%" == "0" (

				)

				echo Checking that CUDA archs are setup correctly

				python -c "import torch; torch.randn([3,5]).cuda()"

				%PYTHON_EXEC% -c "import torch; torch.randn([3,5]).cuda()"

				if ERRORLEVEL 1 exit /b 1

				echo Checking that magma is available

				python -c "import torch; torch.rand(1).cuda(); exit(0 if torch.cuda.has_magma else 1)"

				%PYTHON_EXEC% -c "import torch; torch.rand(1).cuda(); exit(0 if torch.cuda.has_magma else 1)"

				if ERRORLEVEL 1 exit /b 1

				echo Checking that CuDNN is available

				python -c "import torch; exit(0 if torch.backends.cudnn.is_available() else 1)"

				%PYTHON_EXEC% -c "import torch; exit(0 if torch.backends.cudnn.is_available() else 1)"

				if ERRORLEVEL 1 exit /b 1

				echo Checking that basic RNN works

				python %PYTORCH_ROOT%\.ci\pytorch\test_example_code\rnn_smoke.py

				%PYTHON_EXEC% %PYTORCH_ROOT%\.ci\pytorch\test_example_code\rnn_smoke.py

				if ERRORLEVEL 1 exit /b 1

				echo Checking that basic CNN works

				python %PYTORCH_ROOT%\.ci\pytorch\test_example_code\cnn_smoke.py

				%PYTHON_EXEC% %PYTORCH_ROOT%\.ci\pytorch\test_example_code\cnn_smoke.py

				if ERRORLEVEL 1 exit /b 1

				goto end

				@ -167,7 +127,6 @@ goto end

				:libtorch

				echo "install and test libtorch"

				if "%VC_YEAR%" == "2019" powershell internal\vs2019_install.ps1

				if "%VC_YEAR%" == "2022" powershell internal\vs2022_install.ps1

				if ERRORLEVEL 1 exit /b 1

				@ -179,10 +138,6 @@ pushd tmp\libtorch

				set VC_VERSION_LOWER=17

				set VC_VERSION_UPPER=18

				IF "%VC_YEAR%" == "2019" (

				    set VC_VERSION_LOWER=16

				    set VC_VERSION_UPPER=17

				)

				for /f "usebackq tokens=*" %%i in (`"%ProgramFiles(x86)%\Microsoft Visual Studio\Installer\vswhere.exe" -legacy -products * -version [%VC_VERSION_LOWER%^,%VC_VERSION_UPPER%^) -property installationPath`) do (

				    if exist "%%i" if exist "%%i\VC\Auxiliary\Build\vcvarsall.bat" (

									
										5

.ci/pytorch/windows/internal/static_lib_test.bat
									
												View File
												
				@ -70,7 +70,6 @@ echo "install and test libtorch"

				pip install cmake

				echo "installing cmake"

				if "%VC_YEAR%" == "2019" powershell internal\vs2019_install.ps1

				if "%VC_YEAR%" == "2022" powershell internal\vs2022_install.ps1

				if ERRORLEVEL 1 exit /b 1

				@ -83,10 +82,6 @@ pushd tmp\libtorch

				set VC_VERSION_LOWER=17

				set VC_VERSION_UPPER=18

				IF "%VC_YEAR%" == "2019" (

				    set VC_VERSION_LOWER=16

				    set VC_VERSION_UPPER=17

				)

				for /f "usebackq tokens=*" %%i in (`"%ProgramFiles(x86)%\Microsoft Visual Studio\Installer\vswhere.exe" -legacy -products * -version [%VC_VERSION_LOWER%^,%VC_VERSION_UPPER%^) -property installationPath`) do (

				    if exist "%%i" if exist "%%i\VC\Auxiliary\Build\vcvarsall.bat" (

									
										6

.ci/pytorch/windows/internal/vc_install_helper.bat
									
												View File
												
				@ -1,12 +1,8 @@

				if "%VC_YEAR%" == "2019" powershell windows/internal/vs2019_install.ps1

				if "%VC_YEAR%" == "2022" powershell windows/internal/vs2022_install.ps1

				set VC_VERSION_LOWER=17

				set VC_VERSION_UPPER=18

				if "%VC_YEAR%" == "2019" (

				    set VC_VERSION_LOWER=16

				    set VC_VERSION_UPPER=17

				)

				for /f "usebackq tokens=*" %%i in (`"%ProgramFiles(x86)%\Microsoft Visual Studio\Installer\vswhere.exe"  -products Microsoft.VisualStudio.Product.BuildTools -version [%VC_VERSION_LOWER%^,%VC_VERSION_UPPER%^) -property installationPath`) do (

				    if exist "%%i" if exist "%%i\VC\Auxiliary\Build\vcvarsall.bat" (

									
										48

.ci/pytorch/windows/internal/vs2019_install.ps1
									
												View File
											
				@ -1,48 +0,0 @@

				# https://developercommunity.visualstudio.com/t/install-specific-version-of-vs-component/1142479

				# https://docs.microsoft.com/en-us/visualstudio/releases/2019/history#release-dates-and-build-numbers

				# 16.8.6 BuildTools

				$VS_DOWNLOAD_LINK = "https://ossci-windows.s3.us-east-1.amazonaws.com/vs16.8.6_BuildTools.exe"

				$COLLECT_DOWNLOAD_LINK = "https://aka.ms/vscollect.exe"

				$VS_INSTALL_ARGS = @("--nocache","--quiet","--wait", "--add Microsoft.VisualStudio.Workload.VCTools",

				                                                     "--add Microsoft.Component.MSBuild",

				                                                     "--add Microsoft.VisualStudio.Component.Roslyn.Compiler",

				                                                     "--add Microsoft.VisualStudio.Component.TextTemplating",

				                                                     "--add Microsoft.VisualStudio.Component.VC.CoreIde",

				                                                     "--add Microsoft.VisualStudio.Component.VC.Redist.14.Latest",

				                                                     "--add Microsoft.VisualStudio.ComponentGroup.NativeDesktop.Core",

				                                                     "--add Microsoft.VisualStudio.Component.VC.Tools.x86.x64",

				                                                     "--add Microsoft.VisualStudio.ComponentGroup.NativeDesktop.Win81")

				curl.exe --retry 3 -kL $VS_DOWNLOAD_LINK --output vs_installer.exe

				if ($LASTEXITCODE -ne 0) {

				    echo "Download of the VS 2019 Version 16.8.5 installer failed"

				    exit 1

				}

				if (Test-Path "${env:ProgramFiles(x86)}\Microsoft Visual Studio\Installer\vswhere.exe") {

				    $existingPath = & "${env:ProgramFiles(x86)}\Microsoft Visual Studio\Installer\vswhere.exe" -products "Microsoft.VisualStudio.Product.BuildTools" -version "[16, 17)" -property installationPath

				    if ($existingPath -ne $null) {

				        if (!${env:CIRCLECI}) {

				            echo "Found correctly versioned existing BuildTools installation in $existingPath"

				            exit 0

				        }

				        echo "Found existing BuildTools installation in $existingPath, keeping it"

				    }

				}

				$process = Start-Process "${PWD}\vs_installer.exe" -ArgumentList $VS_INSTALL_ARGS -NoNewWindow -Wait -PassThru

				Remove-Item -Path vs_installer.exe -Force

				$exitCode = $process.ExitCode

				if (($exitCode -ne 0) -and ($exitCode -ne 3010)) {

				    echo "VS 2019 installer exited with code $exitCode, which should be one of [0, 3010]."

				    curl.exe --retry 3 -kL $COLLECT_DOWNLOAD_LINK --output Collect.exe

				    if ($LASTEXITCODE -ne 0) {

				        echo "Download of the VS Collect tool failed."

				        exit 1

				    }

				    Start-Process "${PWD}\Collect.exe" -NoNewWindow -Wait -PassThru

				    New-Item -Path "C:\w\build-results" -ItemType "directory" -Force

				    Copy-Item -Path "C:\Users\${env:USERNAME}\AppData\Local\Temp\vslogs.zip" -Destination "C:\w\build-results\"

				    exit 1

				}

									
										23

.ci/pytorch/windows/internal/xpu_install.bat
									
												View File
												
				@ -47,9 +47,9 @@ set XPU_EXTRA_INSTALLED=0

				set XPU_EXTRA_UNINSTALL=0

				if not [%XPU_VERSION%]==[] if [%XPU_VERSION%]==[2025.0] (

				    set XPU_BUNDLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/efc86abd-cb77-452e-a03f-a741895b8ece/intel-deep-learning-essentials-2025.0.0.336_offline.exe

				    set XPU_BUNDLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/9d6d6c17-ca2d-4735-9331-99447e4a1280/intel-deep-learning-essentials-2025.0.1.28_offline.exe

				    set XPU_BUNDLE_PRODUCT_NAME=intel.oneapi.win.deep-learning-essentials.product

				    set XPU_BUNDLE_VERSION=2025.0.0+335

				    set XPU_BUNDLE_VERSION=2025.0.1+20

				    set XPU_BUNDLE_INSTALLED=0

				    set XPU_BUNDLE_UNINSTALL=0

				    set XPU_EXTRA_URL=NULL

				@ -104,14 +104,6 @@ goto xpu_install_end

				:xpu_bundle_install

				:: Install Level Zero SDK

				set XPU_EXTRA_LZ_URL=https://github.com/oneapi-src/level-zero/releases/download/v1.14.0/level-zero-sdk_1.14.0.zip

				curl -k -L %XPU_EXTRA_LZ_URL% --output "%SRC_DIR%\temp_build\level_zero_sdk.zip"

				echo "Installing level zero SDK..."

				7z x "%SRC_DIR%\temp_build\level_zero_sdk.zip" -o"%SRC_DIR%\temp_build\level_zero"

				set "INCLUDE=%SRC_DIR%\temp_build\level_zero\include;%INCLUDE%"

				:: Install Bundle

				curl -o xpu_bundle.exe --retry 3 --retry-all-errors -k %XPU_BUNDLE_URL%

				echo "XPU Bundle installing..."

				start /wait "Intel Pytorch Bundle Installer" "xpu_bundle.exe" --action=install --eula=accept --silent --log-dir install_bundle

				@ -128,3 +120,14 @@ if errorlevel 1 exit /b 1

				del xpu_extra.exe

				:xpu_install_end

				if not "%XPU_ENABLE_KINETO%"=="1" goto install_end

				:: Install Level Zero SDK

				set XPU_EXTRA_LZ_URL=https://github.com/oneapi-src/level-zero/releases/download/v1.14.0/level-zero-sdk_1.14.0.zip

				curl -k -L %XPU_EXTRA_LZ_URL% --output "%SRC_DIR%\temp_build\level_zero_sdk.zip"

				echo "Installing level zero SDK..."

				7z x "%SRC_DIR%\temp_build\level_zero_sdk.zip" -o"%SRC_DIR%\temp_build\level_zero"

				set "INCLUDE=%SRC_DIR%\temp_build\level_zero\include;%INCLUDE%"

				del "%SRC_DIR%\temp_build\level_zero_sdk.zip"

				:install_end

									
										5

.ci/pytorch/windows/xpu.bat
									
												View File
												
				@ -28,11 +28,6 @@ call "%XPU_BUNDLE_ROOT%\compiler\latest\env\vars.bat"

				call "%XPU_BUNDLE_ROOT%\ocloc\latest\env\vars.bat"

				IF ERRORLEVEL 1 goto :eof

				:: Workaround for https://github.com/pytorch/pytorch/issues/134989

				set CMAKE_SHARED_LINKER_FLAGS=/FORCE:MULTIPLE

				set CMAKE_MODULE_LINKER_FLAGS=/FORCE:MULTIPLE

				set CMAKE_EXE_LINKER_FLAGS=/FORCE:MULTIPLE

				if exist "%NIGHTLIES_PYTORCH_ROOT%" cd %NIGHTLIES_PYTORCH_ROOT%\..

				call %~dp0\internal\copy_cpu.bat

				IF ERRORLEVEL 1 goto :eof

Compare commits

3126 Commits sanchitint ... mlazos/hc5

9 .ci/aarch64_linux/aarch64_ci_build.sh Unescape Escape View File

6 .ci/aarch64_linux/aarch64_ci_setup.sh Unescape Escape View File

37 .ci/aarch64_linux/aarch64_wheel_ci_build.py Unescape Escape View File

62 .ci/aarch64_linux/build_aarch64_wheel.py Unescape Escape View File

2 .ci/docker/almalinux/Dockerfile Unescape Escape View File

205 .ci/docker/build.sh Unescape Escape View File

9 .ci/docker/centos-rocm/Dockerfile Unescape Escape View File

2 .ci/docker/ci_commit_pins/executorch.txt Unescape Escape View File

1 .ci/docker/ci_commit_pins/nccl-cu11.txt Normal file Unescape Escape View File

1 .ci/docker/ci_commit_pins/nccl-cu12.txt Normal file Unescape Escape View File

2 .ci/docker/ci_commit_pins/timm.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton-xpu.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton.txt Unescape Escape View File

2 .ci/docker/common/install_acl.sh Unescape Escape View File

4 .ci/docker/common/install_base.sh Unescape Escape View File

2 .ci/docker/common/install_cache.sh Unescape Escape View File

12 .ci/docker/common/install_clang.sh Unescape Escape View File

4 .ci/docker/common/install_conda.sh Unescape Escape View File

2 .ci/docker/common/install_cpython.sh Unescape Escape View File

140 .ci/docker/common/install_cuda.sh Unescape Escape View File

144 .ci/docker/common/install_cuda_aarch64.sh Unescape Escape View File

4 .ci/docker/common/install_cudnn.sh Unescape Escape View File

20 .ci/docker/common/install_cusparselt.sh Unescape Escape View File

38 .ci/docker/common/install_db.sh Unescape Escape View File

12 .ci/docker/common/install_executorch.sh Unescape Escape View File

4 .ci/docker/common/install_halide.sh Unescape Escape View File

6 .ci/docker/common/install_linter.sh Unescape Escape View File

26 .ci/docker/common/install_nccl.sh Normal file Unescape Escape View File

9 .ci/docker/common/install_ninja.sh Unescape Escape View File

6 .ci/docker/common/install_onnx.sh Unescape Escape View File

2 .ci/docker/common/install_openblas.sh Unescape Escape View File

18 .ci/docker/common/install_python.sh Normal file Unescape Escape View File

4 .ci/docker/common/install_rocm.sh Unescape Escape View File

6 .ci/docker/common/install_rocm_drm.sh Unescape Escape View File

68 .ci/docker/common/install_rocm_magma.sh Unescape Escape View File

24 .ci/docker/common/install_swiftshader.sh Unescape Escape View File

18 .ci/docker/common/install_triton.sh Unescape Escape View File

26 .ci/docker/common/install_ucc.sh Unescape Escape View File

24 .ci/docker/common/install_vulkan_sdk.sh Unescape Escape View File

15 .ci/docker/libtorch/Dockerfile Unescape Escape View File

4 .ci/docker/libtorch/build.sh Unescape Escape View File

26 .ci/docker/linter-cuda/Dockerfile Unescape Escape View File

18 .ci/docker/linter/Dockerfile Unescape Escape View File

6 .ci/docker/manywheel/Dockerfile Unescape Escape View File

153 .ci/docker/manywheel/Dockerfile_2014 Unescape Escape View File

6 .ci/docker/manywheel/Dockerfile_2_28 Unescape Escape View File

6 .ci/docker/manywheel/Dockerfile_2_28_aarch64 Unescape Escape View File

4 .ci/docker/manywheel/Dockerfile_cuda_aarch64 Unescape Escape View File

40 .ci/docker/manywheel/Dockerfile_s390x Unescape Escape View File

9 .ci/docker/manywheel/build.sh Unescape Escape View File

2 .ci/docker/manywheel/build_scripts/build_utils.sh Unescape Escape View File

34 .ci/docker/requirements-ci.txt Unescape Escape View File

2 .ci/docker/triton_version.txt Unescape Escape View File

29 .ci/docker/ubuntu-cuda/Dockerfile Unescape Escape View File

63 .ci/docker/ubuntu-rocm/Dockerfile Unescape Escape View File

7 .ci/docker/ubuntu-xpu/Dockerfile Unescape Escape View File

51 .ci/docker/ubuntu/Dockerfile Unescape Escape View File

2 .ci/magma-rocm/.gitignore vendored Normal file Unescape Escape View File

35 .ci/magma-rocm/Makefile Normal file Unescape Escape View File

48 .ci/magma-rocm/README.md Normal file Unescape Escape View File

42 .ci/magma-rocm/build_magma.sh Executable file Unescape Escape View File

38 .ci/magma-rocm/package_files/build.sh Executable file Unescape Escape View File

15 .ci/magma/Makefile Unescape Escape View File

12 .ci/manywheel/build_common.sh Unescape Escape View File

33 .ci/manywheel/build_cuda.sh Unescape Escape View File

12 .ci/manywheel/build_libtorch.sh Unescape Escape View File

17 .ci/pytorch/build.sh Unescape Escape View File

114 .ci/pytorch/check_binary.sh Unescape Escape View File

41 .ci/pytorch/common_utils.sh Unescape Escape View File

50 .ci/pytorch/macos-build.sh Unescape Escape View File

3 .ci/pytorch/macos-test.sh Unescape Escape View File

2 .ci/pytorch/run_tests.sh Unescape Escape View File

43 .ci/pytorch/smoke_test/check_binary_symbols.py Unescape Escape View File

8 .ci/pytorch/smoke_test/max_autotune.py Unescape Escape View File

112 .ci/pytorch/smoke_test/smoke_test.py Unescape Escape View File

149 .ci/pytorch/test.sh Unescape Escape View File

41 .ci/pytorch/test_example_code/cnn_smoke_win_arm64.py Normal file Unescape Escape View File

13 .ci/pytorch/test_example_code/rnn_smoke_win_arm64.py Normal file Unescape Escape View File

3126 Commits

sanchitint ... mlazos/hc5

9

.ci/aarch64_linux/aarch64_ci_build.sh

View File

6

.ci/aarch64_linux/aarch64_ci_setup.sh

View File

37

.ci/aarch64_linux/aarch64_wheel_ci_build.py

View File

62

.ci/aarch64_linux/build_aarch64_wheel.py

View File

2

.ci/docker/almalinux/Dockerfile

View File

205

.ci/docker/build.sh

View File

9

.ci/docker/centos-rocm/Dockerfile

View File

2

.ci/docker/ci_commit_pins/executorch.txt

View File

1

.ci/docker/ci_commit_pins/nccl-cu11.txt Normal file

View File

1

.ci/docker/ci_commit_pins/nccl-cu12.txt Normal file

View File

2

.ci/docker/ci_commit_pins/timm.txt

View File

2

.ci/docker/ci_commit_pins/triton-xpu.txt

View File

2

.ci/docker/ci_commit_pins/triton.txt

View File

2

.ci/docker/common/install_acl.sh

View File

4

.ci/docker/common/install_base.sh

View File

2

.ci/docker/common/install_cache.sh

View File

12

.ci/docker/common/install_clang.sh

View File

4

.ci/docker/common/install_conda.sh

View File

2

.ci/docker/common/install_cpython.sh

View File

140

.ci/docker/common/install_cuda.sh

View File

144

.ci/docker/common/install_cuda_aarch64.sh

View File

4

.ci/docker/common/install_cudnn.sh

View File

20

.ci/docker/common/install_cusparselt.sh

View File

38

.ci/docker/common/install_db.sh

View File

12

.ci/docker/common/install_executorch.sh

View File

4

.ci/docker/common/install_halide.sh

View File

6

.ci/docker/common/install_linter.sh

View File

26

.ci/docker/common/install_nccl.sh Normal file

View File

9

.ci/docker/common/install_ninja.sh

View File

6

.ci/docker/common/install_onnx.sh

View File

2

.ci/docker/common/install_openblas.sh

View File

18

.ci/docker/common/install_python.sh Normal file

View File

4

.ci/docker/common/install_rocm.sh

View File

6

.ci/docker/common/install_rocm_drm.sh

View File

68

.ci/docker/common/install_rocm_magma.sh

View File

24

.ci/docker/common/install_swiftshader.sh

View File

18

.ci/docker/common/install_triton.sh

View File

26

.ci/docker/common/install_ucc.sh

View File

24

.ci/docker/common/install_vulkan_sdk.sh

View File

15

.ci/docker/libtorch/Dockerfile

View File

4

.ci/docker/libtorch/build.sh

View File

26

.ci/docker/linter-cuda/Dockerfile

View File

18

.ci/docker/linter/Dockerfile

View File

6

.ci/docker/manywheel/Dockerfile

View File

153

.ci/docker/manywheel/Dockerfile_2014

View File

6

.ci/docker/manywheel/Dockerfile_2_28

View File

6

.ci/docker/manywheel/Dockerfile_2_28_aarch64

View File

4

.ci/docker/manywheel/Dockerfile_cuda_aarch64

View File

40

.ci/docker/manywheel/Dockerfile_s390x

View File

9

.ci/docker/manywheel/build.sh

View File

2

.ci/docker/manywheel/build_scripts/build_utils.sh

View File

34

.ci/docker/requirements-ci.txt

View File

2

.ci/docker/triton_version.txt

View File

29

.ci/docker/ubuntu-cuda/Dockerfile

View File

63

.ci/docker/ubuntu-rocm/Dockerfile

View File

7

.ci/docker/ubuntu-xpu/Dockerfile

View File

51

.ci/docker/ubuntu/Dockerfile

View File

2

.ci/magma-rocm/.gitignore vendored Normal file

View File

35

.ci/magma-rocm/Makefile Normal file

View File

48

.ci/magma-rocm/README.md Normal file

View File

42

.ci/magma-rocm/build_magma.sh Executable file

View File

38

.ci/magma-rocm/package_files/build.sh Executable file

View File

15

.ci/magma/Makefile

View File

12

.ci/manywheel/build_common.sh

View File

33

.ci/manywheel/build_cuda.sh

View File

12

.ci/manywheel/build_libtorch.sh

View File

17

.ci/pytorch/build.sh

View File

114

.ci/pytorch/check_binary.sh

View File

41

.ci/pytorch/common_utils.sh

View File

50

.ci/pytorch/macos-build.sh

View File

3

.ci/pytorch/macos-test.sh

View File

2

.ci/pytorch/run_tests.sh

View File

43

.ci/pytorch/smoke_test/check_binary_symbols.py

View File

8

.ci/pytorch/smoke_test/max_autotune.py

View File

112

.ci/pytorch/smoke_test/smoke_test.py

View File

149

.ci/pytorch/test.sh

View File

41

.ci/pytorch/test_example_code/cnn_smoke_win_arm64.py Normal file

View File

13

.ci/pytorch/test_example_code/rnn_smoke_win_arm64.py Normal file

View File

3

.ci/pytorch/win-test.sh

View File