pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-25 08:11:06 +08:00

Author	SHA1	Message	Date
Yunqiu Guo	2d757f6517	apply	2025-02-18 11:24:34 -08:00
angelayi	57060bebf3	[symbolic shapes] Add replacement for backed symints (#147240 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147240 Approved by: https://github.com/pianpwk ghstack dependencies: #146939	2025-02-18 18:49:51 +00:00
angelayi	84abeaad5c	[export] Log evaluate_expr (#146939 ) We want to log each symnode created so that we can do provenance tracking in the tlparse report generated for draft export. To do this, we want to assign a unique id to every symnode, which python's `id` function already does, and then for every expression created, we can find the provenance by tracing back through its arguments ids. This logging only happens when dtrace_structured is enabled, which is only when running draft export. An example output is as follows: <img width="799" alt="image" src="https://github.com/user-attachments/assets/88bb31b4-8c31-43fb-aa88-08b573b9f71d" /> For the increase in the compile_time_instruction_count benchmark, this seems unavoidable because I need to call `id` to get the unique identifier for each symnode. But I believe `id` is an inexpensive operation, so hopefully it should be ok? I tried doing the following: * Originally I was passing around `self`, which is a SymNode, which caused the compile time to be ~6.36M * I changed it to pass around `id(self)` instead, which reduced the compile time to ~6.33M * Then I changed it to be passed as a positional arg instead of a kwarg, which reduced the compile time to ~6.22M, but this doesn't seem to be a super worthwhile fix? #suppress-bc-linter Pull Request resolved: https://github.com/pytorch/pytorch/pull/146939 Approved by: https://github.com/oulgen	2025-02-18 18:49:51 +00:00
zeshengzong	c6b331f7d9	Deprecate `skip_code_recursive_on_cache_limit_hit` config flag (#136970 ) Fixes one of #136862 Make `skip_code_recursive_on_cache_limit_hit` flag deprecated. Affected logic is in here: `6931c1644a/torch/_dynamo/convert_frame.py (L866-L876)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136970 Approved by: https://github.com/williamwen42	2025-02-18 18:48:23 +00:00
Jiang, Yanbing	6f7e67c43c	Add torch._scaled_mm for CPU (#139975 ) This PR is to add `torch._scaled_mm` for CPU backend. `_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet	2025-02-18 18:44:26 +00:00
Huamin Li	dd2a943e14	Fix the AOTI compile failure with ARM CPU for Meta internal (#147204 ) Summary: Fix the AOTI compile failure with ARM CPU for Meta internal Differential Revision: D69642211 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147204 Approved by: https://github.com/houseroad	2025-02-18 17:54:34 +00:00
Andy Lugo	5d675de754	Update ck (#144799 ) Updates the CK version and re-implements kernel generation Pull Request resolved: https://github.com/pytorch/pytorch/pull/144799 Approved by: https://github.com/jianyuh	2025-02-18 17:00:27 +00:00
Aleksei Nikiforov	a00d2b5144	s390x: add cleanup for cancelled docker image builds (#147110 ) When podman image build is cancelled, a couple of processes are left behind, and their existence prevents proper shutdown of runner container. Add cleanup step at the end of workflow using new option recently introduced in podman: https://github.com/containers/podman/pull/25102 Example of job preventing s390x worker cleaning up and restarting properly: https://github.com/pytorch/pytorch/actions/runs/13289159296/job/37105230728 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147110 Approved by: https://github.com/huydhn	2025-02-18 16:26:46 +00:00
Yutao Xu	6edc419d69	Update torch-xpu-ops commit pin (#147358 ) Update the torch-xpu-ops commit to [a14d1eaa834a616705068103dc8129319087e864](`a14d1eaa83`), includes: - SparseCSR XPU support - Refine build system Pull Request resolved: https://github.com/pytorch/pytorch/pull/147358 Approved by: https://github.com/EikanWang	2025-02-18 16:05:25 +00:00
angelayi	0c8028e877	[export] Loosen symint input serialization (#147237 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/147237 Approved by: https://github.com/avikchaudhuri	2025-02-18 13:03:47 +00:00
FFFrog	b10ba0a46c	Unify all sympy versions to avoid conflicts within PyTorch (#147197 ) As the title stated. There are some tiny diffrences between 1.13.1 and 1.13.3: 1.13.1: `2e489cf4b1/sympy/core/numbers.py (L1591)` 1.13.3: `b4ce69ad5d/sympy/core/numbers.py (L1591)` Previous PR: https://github.com/pytorch/pytorch/pull/143908 ISSUE Related: https://github.com/pytorch/pytorch/issues/147144 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147197 Approved by: https://github.com/malfet	2025-02-18 10:51:43 +00:00
Michal Gallus	d9cf1debf9	[ROCm][Windows] Fix clang-cl error related to -Wmissing prototypes enabled (#146981 ) Some of the windows files (fused_kernels.cpp or temp_file.h) contain code that fail to compile when this flag is enabled when built with clang-cl. This PR resolves the issue by ensuring that even if we build with clang-cl, it doesn't include those flags on windows. Alternatively if needed, I can fix the files mentioned to pass under this flag. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146981 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-02-18 07:41:12 +00:00
PyTorch MergeBot	49e8f9c965	Revert "Add torch._scaled_mm for CPU (#139975 )" This reverts commit 22fae4c5f94eb43f71a2eebc1904880740cb1d60. Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/huydhn due to third time is the charm ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2664622598))	2025-02-18 05:11:32 +00:00
PyTorch UpdateBot	59a08138c5	[executorch hash update] update the pinned executorch hash (#147345 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147345 Approved by: https://github.com/pytorchbot	2025-02-18 05:08:06 +00:00
Yutao Xu	6a2bb629ec	Update torch-xpu-ops commit pin (#147302 ) Update the torch-xpu-ops commit to [b421032c8fed40df5eaee395c2e7f5f8a7bcc815](`b421032c8f`), includes: - Correct int4 weight pack implementation - Enhance build system: only build one shared library for the user Pull Request resolved: https://github.com/pytorch/pytorch/pull/147302 Approved by: https://github.com/EikanWang	2025-02-18 05:04:15 +00:00
ZhiweiYan-96	59915b8dec	[Intel GPU] qlinear at XPU backend (#133307 ) # Motivation The PR is intended to enable `onednn.qlinear` and `onednn.qlinear_unary` at Intel GPU. We register the qlinear ops at C++ backend via `TORCH_LIBRARY_IMPL`, the op this PR registers includes `onednn::qlinear_pointwise`, `onednn::qlinear_pointwise.tensor`, and `onednn::qlinear_prepack`. The prepack conduct transpose on weight for fitting oneDNN requirement on weight to acquire higher performance. Also, we remove the limitation of the corresponding annotation method in the `XPUInductorQuantizer` (`torch/ao/quantization/quantizer/xpu_inductor_quantizer.py`) to allow GPU linear conversion. We add the kChar(`torch.int8`) dtype in the `torch/_inductor/fx_passes/quantization` and `torch/_inductor/mkldnn_ir.py`, as signed int8 is the default INT8 data type at GPU side. We verified the op through UTs and e2e model testing like ResNet18, ResNet50. # UT verification ``` DNNL_VERBOSE=0 TORCH_COMPILE_DEBUG=0 python test/inductor/test_mkldnn_pattern_matcher.py -v \ -k test_qlinear_xpu \ -k test_qlinear_relu_xpu \ -k test_qlinear_gelu_xpu ``` # Runtime exemplification Here is the oneDNN verbose collected through running above UTs ``` //pure int8 gemm onednn_verbose,primitive,exec,gpu:0,matmul,jit:gemm:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 dst_s8::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32+dst:0:s32,,2x4:4x3,0.187988 // post-relu fusion onednn_verbose,primitive,exec,gpu:0,matmul,jit:gemm:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_f32::blocked:ab::f0_mask2 dst_f32::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_relu,,2x4:4x4,0.115234 // post-gelu fusion onednn_verbose,primitive,exec,gpu:0,matmul,jit:gemm:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 dst_f32::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_gelu_tanh,,2x4:4x4,0.170898 ```` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133307 Approved by: https://github.com/liangan1, https://github.com/guangyey, https://github.com/EikanWang, https://github.com/jerryzh168 Co-authored-by: guangyey <guangye.yu@intel.com>	2025-02-18 04:02:42 +00:00
Yutao Xu	bb8c4ecc6d	Allow XPU device for validating the arguments to sparse compressed tensor factory functions (#147306 ) During Sparse tensor conversion, a validity check is performed. We need to allow XPU to pass this check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147306 Approved by: https://github.com/EikanWang, https://github.com/Skylion007, https://github.com/guangyey	2025-02-18 03:55:54 +00:00
Animesh Jain	71484a2106	[pt2-benchmarks] Compiler reset on every run (#147313 ) Internal benchmarks call `run` in a loop. Compiler reset gives a clean env Pull Request resolved: https://github.com/pytorch/pytorch/pull/147313 Approved by: https://github.com/jansel	2025-02-18 02:09:19 +00:00
Chen Lai	708428704e	patch for block-wise quantization + pt2e (#146946 ) Summary: https://github.com/pytorch/pytorch/pull/144492 was reverted due to duplicate kernel registration. This PR will re-introduce the patch Differential Revision: D69488779 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146946 Approved by: https://github.com/jerryzh168, https://github.com/andrewor14	2025-02-18 01:15:26 +00:00
Tom Ritchford	59b7e52ad8	Fix non-bitwise type annotations for Tensor operators (see #145838 ) (#146845 ) Fix https://github.com/pytorch/pytorch/issues/145838 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146845 Approved by: https://github.com/Skylion007	2025-02-17 22:42:16 +00:00
amdfaa	1393f9a76c	[ROCm] Update inductor-perf-test-nightly-rocm.yml to use the correct labels & frequency (#147221 ) This workflow takes around 75-80hrs on ROCm, so scaling down the frequency to once per week until we get more CI capacity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147221 Approved by: https://github.com/pruthvistony, https://github.com/huydhn	2025-02-17 19:29:27 +00:00
Stonepia	6c0e7463af	Fix test_device_memory_allocated (#147311 ) Fixes #147310 The `torch.ones` allocates memory and is released immediately, thus the following assertion will fail. This PR stores it into a temp variable to fix it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147311 Approved by: https://github.com/guangyey, https://github.com/Skylion007	2025-02-17 19:00:53 +00:00
Stepan Hruda	516133ddb0	Fix arvr macOS buck pytorch builds (#147292 ) Summary: X-link: https://github.com/ctrl-labs/src2/pull/42453 buck arvr macOS builds had a few issues that needed fixing. Test Plan: build with buck Differential Revision: D69722372 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147292 Approved by: https://github.com/Skylion007	2025-02-17 18:47:24 +00:00
Jiang, Yanbing	22fae4c5f9	Add torch._scaled_mm for CPU (#139975 ) This PR is to add `torch._scaled_mm` for CPU backend. `_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet	2025-02-17 18:39:10 +00:00
Annop Wongwathanarat	1b29de5c05	Add NEON implementation for 8 bit quantized embedding bag on aarch64 (#147322 ) This improves performance by ~5.5x on NeoverseV1 cores using the following benchmarking script: ``` import torch import torch.nn as nn import numpy as np import torch.autograd.profiler as profiler np.random.seed(0) torch.manual_seed(0) class SimpleEmbeddingBagModel(nn.Module): def __init__(self, num_embeddings, embedding_dim): super(SimpleEmbeddingBagModel, self).__init__() weights = torch.from_numpy((np.random.random_sample((num_embeddings, embedding_dim)) + 1).astype(np.float32)) obs = torch.ao.quantization.PerChannelMinMaxObserver(dtype=torch.quint8, qscheme=torch.per_channel_affine_float_qparams, ch_axis=0) obs(weights) qparams = obs.calculate_qparams() qweight = torch.quantize_per_channel(weights, qparams[0], qparams[1], axis=0, dtype=torch.quint8) # Defining the EmbeddingBag layer self.qembedding_bag = torch.ao.nn.quantized.EmbeddingBag(num_embeddings, embedding_dim, _weight=qweight, mode='sum', include_last_offset=True, dtype=torch.quint8) def forward(self, input, offsets): # Forward pass through the EmbeddingBag layer result = self.qembedding_bag(input, offsets, per_sample_weights=None) return result num_embeddings = 40000000 embedding_dim = 128 model = SimpleEmbeddingBagModel(num_embeddings=num_embeddings, embedding_dim=embedding_dim) model.eval() multi_hot = 100 batch_size = 400 input_tensor = torch.randint(0, num_embeddings, (batch_size * multi_hot,), dtype=torch.long) offsets = torch.tensor(range(0, batch_size * multi_hot + 1, multi_hot)) with torch.no_grad(): # warm up _ = model(input_tensor, offsets) with profiler.profile(with_stack=True, profile_memory=False, record_shapes=True) as prof: for i in range(100): _ = model(input_tensor, offsets) print(prof.key_averages(group_by_input_shape=True).table(sort_by='self_cpu_time_total', row_limit=50)) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147322 Approved by: https://github.com/malfet	2025-02-17 17:10:47 +00:00
PyTorch UpdateBot	71855a1cad	Update slow tests (#147308 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147308 Approved by: https://github.com/pytorchbot	2025-02-17 12:03:40 +00:00
Nikita Shulga	e8b20f6ef3	[MPS][BE] Turn `exec_unary_kernel` as MetalShaderLibrary method (#147299 ) And delete duplicate implementations from SpecialOps and UnaryKernel. Change input and output arguments order for SpecialOps kernels to match those of UnaryOps Fixes https://github.com/pytorch/pytorch/issues/146770 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147299 Approved by: https://github.com/dcci ghstack dependencies: #147296, #147297	2025-02-17 08:31:24 +00:00
ZhiweiYan-96	ae5f7fec82	[Intel GPU] Enable fp64 GEMM (#140677 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140677 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/desertfire	2025-02-17 08:15:55 +00:00
Nikita Shulga	2b30e94fc0	[BE] Make `exec_unary_kernel` take TensorIterator as argument (#147297 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147297 Approved by: https://github.com/dcci ghstack dependencies: #147296	2025-02-17 07:34:35 +00:00
Nikita Shulga	3d251e6512	[BE] Switch all structured funcs to stubs (#147296 ) No need to have separate foobar_out_mps when registering a dispatch to foobar_stub will do And this makes `exec_unary_kernel` defined in UnaryKernel.mm and SpecialOps.mm look very similar Pull Request resolved: https://github.com/pytorch/pytorch/pull/147296 Approved by: https://github.com/dcci	2025-02-17 07:34:34 +00:00
leslie-fang-intel	424c1b82e0	[Inductor][CPP] Add the legalize low fp support for index expr (#147298 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/147279. The test case produced a low-precision floating-point value using `ops.index_expr`, but the CPP backend did not handle its legalization. This PR adds support for it. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_low_fp_index_expr_issue_147279 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147298 Approved by: https://github.com/jgong5	2025-02-17 07:11:20 +00:00
PyTorch UpdateBot	359165734b	[executorch hash update] update the pinned executorch hash (#147294 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147294 Approved by: https://github.com/pytorchbot	2025-02-17 05:03:05 +00:00
Yan Zhiwei	ae351d4d0e	[Intel GPU] allow_tf32 for oneDNN backend - XPU part (#137570 ) # Motivation Add context variable `torch.bachend.mkldnn.allow_tf32` to control tf32 computation in convolution kernels at XPU side. The tf32 data type is beneficial to improve the performance of deep learning workloads during training/inference. Current PR uses the [oneDNN API fpmath_mode](https://oneapi-src.github.io/oneDNN/dev_guide_attributes_fpmath_mode.html#the-floating-point-math-mode-attribute) to trigger the tf32 acceleration in convolution kernels. # Valiadation * ut to test context variable `python test/xpu/test_conv.py -k test_mkldnn_allow_tf32_get_set` * Runtime exemplification ``` onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic16oc33_ih50oh24kh3sh2dh0ph0_iw100ow49kw3sw2dw0pw0,0.649902 onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic33oc33_ih24oh24kh3sh1dh0ph1_iw49ow49kw3sw1dw0pw1,0.151855 onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,backward_data,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_undef::undef::: dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic33oc33_ih24oh24kh3sh1dh0ph1_iw49ow49kw3sw1dw0pw1,0.167969 onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,backward_weights,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic33oc33_ih24oh24kh3sh1dh0ph1_iw49ow49kw3sw1dw0pw1,0.26709 onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,backward_weights,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic16oc33_ih50oh24kh3sh2dh0ph0_iw100ow49kw3sw2dw0pw0,0.219971 ``` According to the field `fpmath:tf32` in verbose, we could see that, current context setting utils could successfully trigger tf32 computation in conv forward/backward_data/backward_weights kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137570 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/atalman, https://github.com/malfet Co-authored-by: Yu, Guangye <guangye.yu@intel.com>	2025-02-17 01:46:43 +00:00
Nikita Shulga	198ffbdf11	[MPS] Implement and test round.decimals (#147266 ) If inductor can do it, why not eager Pull Request resolved: https://github.com/pytorch/pytorch/pull/147266 Approved by: https://github.com/Skylion007 ghstack dependencies: #147286	2025-02-16 23:17:13 +00:00
Aaron Gokaslan	e738f7ba23	[BE]: Enable ruff rule SIM113 (#147290 ) Lint rules that tells the user to avoid keeping track of their own counter and use the builtin enumerate when possible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147290 Approved by: https://github.com/jansel	2025-02-16 22:41:16 +00:00
Zhou Fang	a8fa4bcfd2	[StaticRuntime] Support a new pattern (aten::to with 5 inputs) for ClipRangesToGatherToOffsets (#147189 ) Summary: Support the following new pattern for ClipRangesToGatherToOffsets: Before optimization: ``` %11175 : Tensor, %11176 : Tensor = fb::clip_ranges_gather(%int_66.1, %getitem_1784.1, %347) %getattr_256.1 : int = prim::dtype(%11175) %to_298.1 : Tensor = aten::to(%11176, %getattr_256.1, %13, %13, %12) %lengths_to_offsets_333.1 : Tensor = fb::lengths_to_offsets(%to_298.1, %8) ``` After optimization: ``` %11199 : int = prim::dtype(%int_66.1) %11200 : Tensor, %11201 : Tensor = fb::clip_ranges_gather_to_offsets(%int_66.1, %getitem_1784.1, %347, %8, %11199) ``` It is similar with https://github.com/pytorch/pytorch/pull/146931, but aten::to has 5 inputs instead of 4. Differential Revision: D69627793 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147189 Approved by: https://github.com/hanyilou123	2025-02-16 22:16:02 +00:00
Nikita Shulga	5c0c99f658	[MPS][BE] Use stubs for floor/ceil/round/trunc (#147286 ) To avoid duplicating logic that those ops are no-ops for integral dtypes (And in preparation of adding `round_decimals` that calls round_stub if decimals are 0) Tested for the corner cases by manually invoking `round`, `trunc`, `floor` and `ceil` for int dtypes Pull Request resolved: https://github.com/pytorch/pytorch/pull/147286 Approved by: https://github.com/Skylion007	2025-02-16 17:22:49 +00:00
Dmitry Rogozhkin	d27ecf85db	xpu: support sycl with torch.utils.cpp_extension APIs (#132945 ) This patch adds support for sycl kernels build via `torch.utils.cpp_extension.load`, `torch.utils.cpp_extension.load_inline` and (new) `class SyclExtension` APIs. Files having `.sycl` extension are considered to have sycl kernels and are compiled with `icpx` (dpc++ sycl compiler from Intel). Files with other extensions, `.cpp`, `.cu`, are handled as before. API supports building sycl along with other file types into single extension. Note that `.sycl` file extension is a PyTorch convention for files containing sycl code which I propose to adopt. We did follow up with compiler team to introduce such file extension in the compiler, but they are opposed to this. At the same time discussion around sycl file extension and adding sycl language support into such tools as cmake is ongoing. Eventually cmake also considers to introduce some file extension convention for sycl. I hope we can further influence cmake and compiler communities to broader adopt `.sycl` file extension. By default SYCL kernels are compiled for all Intel GPU devices for which pytorch native aten SYCL kernels are compiled. At the moment `pvc,xe-lpg`. This behavior can be overridden by setting `TORCH_XPU_ARCH_LIST` environment variables to the comma separated list of desired devices to compile for. Fixes: #132944 CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132945 Approved by: https://github.com/albanD, https://github.com/guangyey, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-02-16 16:50:59 +00:00
PyTorch MergeBot	dd5d0ea6bb	Revert "xpu: support sycl with torch.utils.cpp_extension APIs (#132945 )" This reverts commit 607379960bc5093a1fe51ff72c3e0fd39ac126ab. Reverted https://github.com/pytorch/pytorch/pull/132945 on behalf of https://github.com/malfet due to It just broke all the tests, see `b16ae97ad0/1` ([comment](https://github.com/pytorch/pytorch/pull/132945#issuecomment-2661498747))	2025-02-16 16:03:42 +00:00
lzhang2	b16ae97ad0	Generalize mixed precision in DDP (#146808 ) Motivation: 1. Generalize mixed precision in DDP. 2. Enable `SyncBatchNorm` for XPU device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146808 Approved by: https://github.com/guangyey, https://github.com/gujinghui, https://github.com/wconstab	2025-02-16 11:59:40 +00:00
Xuehai Pan	ee38a32c55	[Dynamo] support `isinstance(...)` check for type tuple (#146984 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146984 Approved by: https://github.com/jansel	2025-02-16 10:41:49 +00:00
Dmitry Rogozhkin	607379960b	xpu: support sycl with torch.utils.cpp_extension APIs (#132945 ) This patch adds support for sycl kernels build via `torch.utils.cpp_extension.load`, `torch.utils.cpp_extension.load_inline` and (new) `class SyclExtension` APIs. Files having `.sycl` extension are considered to have sycl kernels and are compiled with `icpx` (dpc++ sycl compiler from Intel). Files with other extensions, `.cpp`, `.cu`, are handled as before. API supports building sycl along with other file types into single extension. Note that `.sycl` file extension is a PyTorch convention for files containing sycl code which I propose to adopt. We did follow up with compiler team to introduce such file extension in the compiler, but they are opposed to this. At the same time discussion around sycl file extension and adding sycl language support into such tools as cmake is ongoing. Eventually cmake also considers to introduce some file extension convention for sycl. I hope we can further influence cmake and compiler communities to broader adopt `.sycl` file extension. By default SYCL kernels are compiled for all Intel GPU devices for which pytorch native aten SYCL kernels are compiled. At the moment `pvc,xe-lpg`. This behavior can be overridden by setting `TORCH_XPU_ARCH_LIST` environment variables to the comma separated list of desired devices to compile for. Fixes: #132944 CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132945 Approved by: https://github.com/albanD, https://github.com/guangyey	2025-02-16 10:16:09 +00:00
Nikita Shulga	ed3b119c40	Skip unsupported types by MPS in `test_torchinductor.py` (#147211 ) - Skip unsupported dtypes in `test_split_cumsum` (and manually skip int64 for MacOS13) - Adapt `test_cat` to use `torch.half` instead of `torch.double` on MPS - Skip `test_adaptive_avg_pool1d_argmax` is avgpool is not implemented for all sizes - Pull Request resolved: https://github.com/pytorch/pytorch/pull/147211 Approved by: https://github.com/jansel, https://github.com/Skylion007, https://github.com/dcci	2025-02-16 10:15:53 +00:00
Saurabh Mishra	0fb5b224b7	[DCP] Cache save plans: planner helpers and interface updates (#147116 ) Summary: This PR updates the planner interface and introduces the class variables to cache the local and global plans. Two new helpers are also introduced which will be used to compare if the plans have changed across save attempts and merge the delta plans. Test Plan: UTs Differential Revision: D69224488 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147116 Approved by: https://github.com/MeetVadakkanchery, https://github.com/huydhn	2025-02-16 07:18:26 +00:00
PyTorch UpdateBot	4bacd13c92	[executorch hash update] update the pinned executorch hash (#147273 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147273 Approved by: https://github.com/pytorchbot	2025-02-16 05:11:33 +00:00
cfgfung	8f20026bcb	[Intel GPU] Support SparseCsrXPU codegen (#144722 ) Adding a new dispatch key - `SparseCsrXPU` to enable Intel GPU support for SparseCsr Tensor. Similar PR: https://github.com/pytorch/pytorch/pull/139267 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144722 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/albanD Co-authored-by: Kanya-Mo <kanya.mo@intel.com>	2025-02-16 03:16:12 +00:00
Blaine Burton Rister	1677a31019	[Inductor] Fix 3D tiling with permute (#147249 ) This PR adds a test case and tiny fix for 3D tiling. Before this PR, tiling would crash because one of the candidates lacked a `"y"` dimension. Now, when we're calculating 3D tiling candidates, we assume the y size is 1 if it's missing. The test case implements a 3D permute using block pointers. ``` @triton.jit def triton_poi_fused_add_0(in_ptr0, out_ptr0, znumel, ynumel, xnumel, ZBLOCK : tl.constexpr, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr): znumel = 51 ynumel = 51 xnumel = 51 zoffset = tl.program_id(2) * ZBLOCK zindex = zoffset + tl.arange(0, ZBLOCK)[None, None, :] zmask = zindex < znumel yoffset = tl.program_id(1) * YBLOCK yindex = yoffset + tl.arange(0, YBLOCK)[None, :, None] ymask = yindex < ynumel xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None, None] xmask = xindex < xnumel x2 = xindex y1 = yindex z0 = zindex tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[51, 51, 51], strides=[1, 51, 2601], block_shape=[XBLOCK, YBLOCK, ZBLOCK], order=[2, 1, 0], offsets=[xoffset, yoffset, zoffset]), boundary_check=[0, 1, 2]) tmp1 = tl.load(tl.make_block_ptr(in_ptr0, shape=[51, 51, 51], strides=[51, 1, 2601], block_shape=[XBLOCK, YBLOCK, ZBLOCK], order=[2, 1, 0], offsets=[xoffset, yoffset, zoffset]), boundary_check=[0, 1, 2]) tmp2 = tmp0 + tmp1 tmp3 = tmp0 + tmp0 tmp4 = tmp2 + tmp3 tl.store(tl.make_block_ptr(out_ptr0, shape=[51, 51, 51], strides=[1, 51, 2601], block_shape=[XBLOCK, YBLOCK, ZBLOCK], order=[2, 1, 0], offsets=[xoffset, yoffset, zoffset]), tl.broadcast_to(tmp4, [XBLOCK, YBLOCK, ZBLOCK]).to(tl.float32), boundary_check=[0, 1, 2]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147249 Approved by: https://github.com/jansel	2025-02-15 23:28:36 +00:00
Tom Ritchford	44ee9ca593	[inductor] Add type annotations to _inductor/utils.py (#144108 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144108 Approved by: https://github.com/eellison	2025-02-15 23:13:41 +00:00
Avik Chaudhuri	4ab967c44d	all reduce non strict (#147133 ) Summary: Some distributed collectives like `all_reduce` have special handling in Dynamo, where they are mapped to functional collectives. Non-strict was previously blind to such mappings, which means using them would fail to trace. Here we show how intercepting them in non-strict's torch function mode can mimic this remapping logic. More ops to follow. Side note: a recently added distributed test was in the wrong place, making the expected failures for non-strict not fire because we weren't actually generating those tests to begin with! Now fixed. Test Plan: moved and updated test Differential Revision: D69607140 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147133 Approved by: https://github.com/tugsbayasgalan	2025-02-15 19:37:08 +00:00
Eli Uriegas	75a4b73816	utils: Update md5 call to be fips compliant (#147252 ) Updates md5 call to be fips compliant according to this issue: * https://github.com/pytorch/pytorch/issues/147236 Not going to add a conditional here because minimum the python version that we support is already 3.9 Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/147252 Approved by: https://github.com/huydhn, https://github.com/Skylion007, https://github.com/malfet	2025-02-15 15:19:08 +00:00
PyTorch MergeBot	6ca5c22e31	Revert "Enable fp16 linear layers in PyTorch via ACL (#144992 )" This reverts commit 5b37249259ad50d9b4b32a78a5b5178a1eb3d110. Reverted https://github.com/pytorch/pytorch/pull/144992 on behalf of https://github.com/nikhil-arm due to Accuracy Test failures ([comment](https://github.com/pytorch/pytorch/pull/144992#issuecomment-2660902238))	2025-02-15 12:40:59 +00:00
Jing Xu	86be5d4421	remove unnecessary xpu availability check when retrieving aot flags (#146966 ) As title Retrieving xpu aot flags that the pytorch binary was compiled against is not the same as running the binary itself. Thus it doesn't seem to necessarily check if there is an xpu environment available. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146966 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/dvrogozh, https://github.com/albanD	2025-02-15 09:15:49 +00:00
leslie-fang-intel	9e0b3e9b6c	[Inductor] Fix Inplace Buffer inner name conflict (#147199 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/146975, when create `InplacedBuffer` inner name, we only count the number of unique `InplacedBuffer` or `RemovedArg`. The name may have conflict, for example reported in this issue ``` ---- make inplace create, input_name is: buf22; output_name is: buf27; buf.inner_name is: in_out_ptr2 dict_values([ InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf6', 'buf11']), InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf6', 'buf11']), InplacedBuffer(inner_name='in_out_ptr1', other_names=['buf24', 'buf26']), InplacedBuffer(inner_name='in_out_ptr1', other_names=['buf24', 'buf26'])]) ---- make inplace create, input_name is: buf0; output_name is: buf3; buf.inner_name is: in_out_ptr2 dict_values([ <torch._inductor.codegen.common.RemovedArg object at 0x7fbf75516350>, <torch._inductor.codegen.common.RemovedArg object at 0x7fbf75516350>, <torch._inductor.codegen.common.RemovedArg object at 0x7fbf75516350>, <torch._inductor.codegen.common.RemovedArg object at 0x7fbf75516350>, InplacedBuffer(inner_name='in_out_ptr2', other_names=['buf22', 'buf27', 'buf31', 'buf33']), InplacedBuffer(inner_name='in_out_ptr2', other_names=['buf22', 'buf27', 'buf31', 'buf33']) <torch._inductor.codegen.common.RemovedArg object at 0x7fbf75516350>, InplacedBuffer(inner_name='in_out_ptr2', other_names=['buf22', 'buf27', 'buf31', 'buf33']), InplacedBuffer(inner_name='in_out_ptr2', other_names=['buf22', 'buf27', 'buf31', 'buf33']) ]) ``` - The first time create `in_out_ptr2`, there are 2 unique `InplacedBuffer` - The second time create `in_out_ptr2`, there is 1 `RemovedArg` and 1 unique `InplacedBuffer` They are 2 different `InplacedBuffer`, but with same name `in_out_ptr2`. In this PR, we fix this regression by counting the number of `RemovedArg`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147199 Approved by: https://github.com/jansel	2025-02-15 08:31:06 +00:00
Jason Ansel	a30f145101	[inductor] Don't leak pointers to cpp_wrapper with lru_cache (#147233 ) Putting lru_cache on methods will keep pointers to the `self` objects alive forever and leak memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147233 Approved by: https://github.com/yanboliang	2025-02-15 08:25:41 +00:00
Animesh Jain	9dc702875d	[dynamo][mappingproxy][inspect] Support existing types.MappingProxyType (#147217 ) Fixes https://github.com/pytorch/pytorch/issues/147162 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147217 Approved by: https://github.com/williamwen42	2025-02-15 07:59:33 +00:00
cyy	8daa742e8b	Remove code for Python < 3.9 (#147181 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/147181 Approved by: https://github.com/albanD	2025-02-15 06:43:26 +00:00
PyTorch UpdateBot	9919375cf1	[executorch hash update] update the pinned executorch hash (#147241 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147241 Approved by: https://github.com/pytorchbot	2025-02-15 05:02:22 +00:00
cyy	8f291e8c00	Fix clang-tidy warnings in torch/jit (#146963 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146963 Approved by: https://github.com/davidberard98	2025-02-15 03:36:59 +00:00
briancoutinho	4233a77960	update kineto submodule to include fix for windows build (#147195 ) Fixes an issue causing windows builds to fail https://github.com/pytorch/kineto/pull/1039 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147195 Approved by: https://github.com/cyyever, https://github.com/davidberard98, https://github.com/sraikund16	2025-02-15 01:53:16 +00:00
leslie-fang-intel	c1fcba3648	[Inductor] Fix the lowering of squeeze when input is not contiguous (#146746 ) Summary Fix issue https://github.com/pytorch/pytorch/issues/143498. The issue happens when we lowering `select = torch.ops.aten.select.int(cat, 1, 0)`. For example, when `cat` is contiguous with size[2, 2] stride[2,1] - for eager, it returns a view of size[2,] stride[2,] - for Inductor lowering, it returns wrong stride 1 instead of 2 ``` TensorBox( ReinterpretView( StorageBox( ConcatKernel(name='buf10', layout=FixedLayout('cpu', torch.int64, size=[u0, 2], stride=[2, 1]), inputs=[ComputedBuffer(name='buf8', layout=NonOwningLayout('cpu', torch.int64, size=[u0, 1], stride=[2, 1]), data=Pointwise(device=device(type='cpu'), dtype=torch.int64, inner_fn=<function ReinterpretView.make_loader.<locals>.loader at 0x7f6b856449d0>, ranges=[u0, 1])), ComputedBuffer(name='buf9', layout=NonOwningLayout('cpu', torch.int64, size=[u0, 1], stride=[2, 1]), data=Pointwise(device=device(type='cpu'), dtype=torch.int64, inner_fn=<function ReinterpretView.make_loader.<locals>.loader at 0x7f6b85644790>, ranges=[u0, 1]))]) ), FixedLayout('cpu', torch.int64, size=[u0], stride=[1]), origins=OrderedSet([select]) ) ) ``` To fix this issue, we give the right stride when lowering of `squeeze`. Test Plan ``` python -u -m pytest -s -v test/inductor/test_unbacked_symints.py -k test_issue_143498 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146746 Approved by: https://github.com/jgong5, https://github.com/sanchitintel, https://github.com/eellison	2025-02-15 01:33:04 +00:00
Yidi Wu	bf0c89a72f	[dynamo] fix error message when logging graph that contains hops (#147227 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147227 Approved by: https://github.com/zou3519	2025-02-15 00:53:44 +00:00
Shawn Xu	933f921b36	[PT][FSDP] support custom all reduce hook across FSDP units (#147114 ) This change adds an API `set_all_reduce_hook` to the `FSDPModule` to support customized all reduce either in native HSDP (2d mesh) setup or custom HSDP (1d FSDP + custom AR across replicas) * For native HSDP, the original AR would still run as is and this hook allows for additional gradient modification post all reduce. * For custom HSDP, the original AR will be skipped and all the logic is instead expected to be executed in the hook. The custom hook is expected to perform operations in place (no return value). Example basic usage: ``` model = ... fully_shard(model, mesh=...) model.set_all_reduce_hook(my_hook) ``` By default, the hook will run in the default all reduce stream post reduce scatter. When native HSDP is NOT enabled, the custom hook can be specified to run in a custom stream. This custom stream will also be synchronized post reduce scatter similarly. See tests for examples. Test Plan: CI Differential Revision: D68255583 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147114 Approved by: https://github.com/awgu	2025-02-15 00:38:00 +00:00
Isuru Fernando	a9ae3340ca	Fix triton masked loading for non-block tl.loads (#144782 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144782 Approved by: https://github.com/eellison	2025-02-15 00:07:33 +00:00
eellison	49727bbc9d	Turn on prologue fusion (#147008 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147008 Approved by: https://github.com/masnesral	2025-02-14 23:36:21 +00:00
Animesh Jain	76f57e184a	[dynamo] Make SliceVariable a subclass of VariableTracker (#147046 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147046 Approved by: https://github.com/StrongerXi ghstack dependencies: #146819, #146995	2025-02-14 23:22:27 +00:00
Mu-Chu Lee	a5c0dab900	[AOTInductor] Guard RAII_cpuMalloc with macro (#147150 ) Summary: Silence RAII_cpuMalloc(size_t) defined but not used [-Wunused-function] Test Plan: Existing tests Differential Revision: D69623481 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147150 Approved by: https://github.com/henrylhtsang	2025-02-14 23:21:35 +00:00
Yidi Wu	1224765286	[cond] make cond call fake kernel in dynamo (#147045 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147045 Approved by: https://github.com/zou3519 ghstack dependencies: #146954	2025-02-14 23:13:15 +00:00
Yidi Wu	85a82c5bc8	[cond] make cond re-dispatch in proxy mode (#146954 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146954 Approved by: https://github.com/zou3519	2025-02-14 23:13:14 +00:00
atalman	eecee5863e	Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073 ) Should resolve: https://github.com/pytorch/pytorch/issues/144768 We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1`` For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1`` We use pinned version of NCCL rather then submodule. Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl`` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146073 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj	2025-02-14 21:23:19 +00:00
Bin Bao	d38db94689	[inductor][refactor] Move _compile_file to cpp_builder (#147202 ) Summary: To further conslidate cpp build logic into cpp_builder Test Plan: CI Differential Revision: D69595327 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147202 Approved by: https://github.com/yushangdi	2025-02-14 21:02:30 +00:00
henrylhtsang	dd86491b35	[cutlass backend][BE] refactor tests to remove duplicate logic (#146743 ) Doing many things here: * remove duplicate hip checking logic * check for CUDA in setup * remove CUTLASS_DIR setting. That is not needed when building from source and fbcode anymore * fix some typing errors Pull Request resolved: https://github.com/pytorch/pytorch/pull/146743 Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78	2025-02-14 20:50:27 +00:00
Dan Zimmerman	6f035d8462	[torch] Make amdsmi cdll hook private (#147207 ) Summary: https://github.com/pytorch/pytorch/actions/runs/13314282597/job/37186177974 yelled at me for landing a seemingly public API that's not exported. It's a private API, so lets prepend `_` to make that clear Test Plan: CI Differential Revision: D69665234 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147207 Approved by: https://github.com/PaulZhang12	2025-02-14 20:30:48 +00:00
Tom Ritchford	272ead7b5e	Make fx.node.map_arg() and .map_aggregate() generic (#146248 ) ## What's the problem? The popular `fx.node.map_arg()` and `fx.node.map_aggregate()` apply operations recursively on `dict`s, `tuples`, `list`s, etc, and return a new collection of the same type. Unfortunately, their base input type is `Argument`, which is [very unspecific indeed](`5d55a6585d/torch/fx/node.py (L48-L58)`): most type information is just thrown away at the call site of either of these functions, as far as the type checker goes. As `torch` moves to a more typed code base, this would force innocent, unsuspecting developers to add logically unnecessary casts or `# type: ignore` statements. ## What's the solution? Making these two `node.map_*` functions generic on the first argument and return type means that type information is preserved for the type checker. (The signature of the other parameter, the function that visits the nodes and subnodes, has not changed, nor should it.) ## Won't it break everything? It doesn't break the type checker - one place needed an extra hint. There have been code breakages, resolved one, at least one new one... we'll see! Pull Request resolved: https://github.com/pytorch/pytorch/pull/146248 Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007	2025-02-14 19:25:32 +00:00
Justin Chu	58f654b5ad	[ONNX] Consolidate constants to a single location (#147166 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147166 Approved by: https://github.com/titaiwangms ghstack dependencies: #147164, #147165	2025-02-14 19:08:19 +00:00
Justin Chu	765bc30ab9	[ONNX] Set warning stacklevel so it appears at the torch.onnx call site (#147165 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147165 Approved by: https://github.com/Skylion007 ghstack dependencies: #147164	2025-02-14 19:04:43 +00:00
Justin Chu	9a1eac6704	[ONNX] Handle number of outputs in builder (#147164 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147164 Approved by: https://github.com/titaiwangms	2025-02-14 19:03:51 +00:00
PyTorch MergeBot	5517eb4398	Revert "[cutlass backend] Do not change dtype of GEMM template (#146877 )" This reverts commit 260b21b8bca6edd3e0b89b800d6efa8243f0d122. Reverted https://github.com/pytorch/pytorch/pull/146877 on behalf of https://github.com/henrylhtsang due to let me resubmit ([comment](https://github.com/pytorch/pytorch/pull/146877#issuecomment-2660053270))	2025-02-14 18:58:18 +00:00
PyTorch MergeBot	aac5d1a289	Revert "Add torch._scaled_mm for CPU (#139975 )" This reverts commit f0bdc27f74f8b1d4ab6789156691ee0fd5cbb30f. Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it looks like internal ideep version is too old to support this ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2660008996))	2025-02-14 18:31:54 +00:00
Henry Tsang	20a9938069	try print stacktrace for error (#147061 ) Differential Revision: D69573525 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147061 Approved by: https://github.com/Skylion007	2025-02-14 18:28:03 +00:00
Nikita Shulga	8b5ee275fb	[MPS] Fix cholesky_ex for empty inputs (#147159 ) By making sure that `info` is actually initialized if input is empty(but no need to do anything about `out`, is it's guaranteed to be an empty tensor) Also move output resizing logic before `input.numel()` check Fixes https://github.com/pytorch/pytorch/issues/147128 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147159 Approved by: https://github.com/albanD	2025-02-14 17:44:08 +00:00
Catherine Lee	0d16188c06	[CI] Use job name to index into test times json (#147154 ) When the test times are generated, it doesn't know what the build environment is because it's an environment variable. But when we index into the test times, we (previously) didn't know what the job name is. These are usually the same but sometimes they're different and when they're different it ends up using default, which can have unbalanced sharding I think job name was added at some point to most of the CI environments but I didn't realize, so we can now update this code to use the job name instead so the generation and the indexing match also upload stats workflow for mps Checked that inductor_amx doesn't use default Pull Request resolved: https://github.com/pytorch/pytorch/pull/147154 Approved by: https://github.com/huydhn	2025-02-14 17:06:56 +00:00
Mikayla Gawarecki	e8fbc86de0	Make torch.cuda.gds APIs public (#147120 ) Follow up to https://github.com/pytorch/pytorch/pull/145748 that turned USE_CUFILE on for CUDA 12.6 and 12.8 binaries Pull Request resolved: https://github.com/pytorch/pytorch/pull/147120 Approved by: https://github.com/albanD	2025-02-14 17:06:50 +00:00
Jack Taylor	c3853d924f	Introduce new template heuristic for triton autotune configs (#144985 ) Initial PR to refactor bulkiness of mm_common to allow for better device-specific specialisation e.g. in https://github.com/pytorch/pytorch/pull/143286 we require large conditionalisation to get ROCm specific optimisations in. This PR introduces a new file `torch/_inductor/template_heuristics.py` which implements device specific subclasses for autotune configs: - CPUConfigHeuristic() - CUDAConfigHeuristic() - ROCmConfigHeuristic() - XPUConfigHeuristic() These subclasses are integrated as part of the `InductorChoices` class, which will be the interface for the kernel files to access the configs. The mm_common, mm_plus_mm and conv configurations are implemented in this class, in the future we plan to bring in flex attention configurations also so all of the tuning config logic for templated triton kernels are handled in this file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144985 Approved by: https://github.com/jansel	2025-02-14 17:01:06 +00:00
PyTorch MergeBot	e06ee4aa9f	Revert "Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073 )" This reverts commit 06f4a5c0e578d7da10ebdf14edcd24e5dcef78d6. Reverted https://github.com/pytorch/pytorch/pull/146073 on behalf of https://github.com/atalman due to breaks macos builds: ModuleNotFoundError: No module named 'torch._C._distributed_c10d'; 'torch._C' is not a package ([comment](https://github.com/pytorch/pytorch/pull/146073#issuecomment-2659802389))	2025-02-14 16:44:46 +00:00
PyTorch MergeBot	059dfe2081	Revert "update kineto submodule (#147015 )" This reverts commit d1997b610f5b974af7ebad6b9903d2d8f751d927. Reverted https://github.com/pytorch/pytorch/pull/147015 on behalf of https://github.com/atalman due to broke windows builds ([comment](https://github.com/pytorch/pytorch/pull/147015#issuecomment-2659730304))	2025-02-14 16:11:08 +00:00
atalman	06f4a5c0e5	Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073 ) Should resolve: https://github.com/pytorch/pytorch/issues/144768 We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1`` For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1`` We use pinned version of NCCL rather then submodule. Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl`` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146073 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj	2025-02-14 15:29:59 +00:00
Guilherme Leobas	cefd9805de	Add `RAISE_VARARGS 0` (#146493 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146493 Approved by: https://github.com/zou3519 ghstack dependencies: #146498, #146492	2025-02-14 13:37:23 +00:00
Guilherme Leobas	134723ee1c	Add `WITH_EXCEPT_START` opcode (#146492 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146492 Approved by: https://github.com/anijain2305, https://github.com/zou3519 ghstack dependencies: #146498	2025-02-14 13:37:23 +00:00
Guilherme Leobas	dbb86b78ad	Add `sys.exc_info` and `sys.exception` (#146498 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146498 Approved by: https://github.com/anijain2305, https://github.com/zou3519	2025-02-14 13:37:14 +00:00
angelayi	ea188ac0c7	[export] Add meta for aten.bincount (#147129 ) Fixes https://github.com/pytorch/pytorch/issues/147094 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147129 Approved by: https://github.com/pianpwk	2025-02-14 10:33:54 +00:00
Yutao Xu	de26ddfbdc	Update torch-xpu-ops commit pin (#146671 ) Update the torch-xpu-ops commit to [80c375570e2b6b2989a8610da1871f8a50dfddc7](`80c375570e`), includes: - Aten operator coverage improvement - SYCL kernel optimization - Nested Tensor OPs support Pull Request resolved: https://github.com/pytorch/pytorch/pull/146671 Approved by: https://github.com/EikanWang	2025-02-14 09:30:36 +00:00
leslie-fang-intel	bd019c0bb4	[Inductor][CPP] Fix node name for wgt delete (#147056 ) Summary This is a regression issue caused by a change in the FX node name. In commit 71010bf0972834e35a155e6a187e5c6649a5a36b, both the node name and target for the `get_attr` node in `V.graph.graph.nodes` were `_frozen_param2`. However, in the latest main, the node name has changed to `_reorder_linear_weight`. This PR fixes the regression by using the node's target instead of its name. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_cpp_weight_prune ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147056 Approved by: https://github.com/jgong5	2025-02-14 06:27:41 +00:00
Nikita Shulga	10bc8f25b2	[MPS][BE] Migrate polar to use functor (#147184 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147184 Approved by: https://github.com/dcci ghstack dependencies: #147182, #147183	2025-02-14 06:25:36 +00:00
Nikita Shulga	278ffd84fc	[MPS][BE] Add copysign integral flavors as functor (#147183 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147183 Approved by: https://github.com/dcci ghstack dependencies: #147182	2025-02-14 06:25:36 +00:00
Nikita Shulga	2ef51cfb9d	[BE][MPS] Infer results of functor (#147182 ) Do not assume that functor will return the same results as its arguments, but rather dynamically infer it using `decltype` and `:🤘:declval` This is a no-op that prepares for migration of `copysign` of integral arguments, that would return a float Pull Request resolved: https://github.com/pytorch/pytorch/pull/147182 Approved by: https://github.com/dcci	2025-02-14 06:25:27 +00:00
Wu, Chunyuan	331d5cf560	[inductor] [cpp] Support vectorization for score and mask in FlexAttention CPU (#143638 ) ## Description We generate vectorized kernel for score and mask in FlexAttention with this PR. ## Modification The main change include: - For the input and output buffer to the mask and score function, instead of passing scalars, we pass tensors to it. - For the mask function, the original function which works on a scalar only includes the logic of calculating the mask value. The PR added the logic of applying the mark to the qk_data tensor into the graph and then leverage the CPP backend to generate vectorized kernels. The original mask graph: ```python def mask_fn(b, h, q_idx, kv_idx): mask = q_idx >= kv_idx return mask ``` The converted_mask_graph should be: ```python def converted_mask_fn(qk_data, b, h, q_idx, kv_idx): mask = q_idx >= kv_idx qk_data = torch.where(mask, qk_data, torch.full_like(qk_data, -float("inf"))) return qk_data ``` ## Benchmark For q, k, v of shape: `[1, 32, 1024, 128]`, using 40 CPU cores, we observe over 20x speedup compared with the non vectorized version for both `is_causal` = `False` and `True`. ## Test plan The existing FlexAttention UTs (`test/inductor/test_flex_attention.py`, `test/inductor/test_flex_decoding.py`) can cover the change in this PR. ## Output code Code before this PR is in scalar version: ```cpp // apply score mod function for (int64_t row = 0; row < cur_qSplitSize; ++row) { for (int64_t col = 0; col < cur_kvSplitSize; col++) { std::vector<int64_t> b_idx = {i}; std::vector<int64_t> h_idx = {j}; std::vector<int64_t> q_idx = {m+row}; int64_t phisical_kv_idx = n+col; if (use_kv_indice) { phisical_kv_idx= kv_logical_data kvBlockSize + col; } std::vector<int64_t> kv_idx = {phisical_kv_idx}; accum_t* in_ptr0 = qk_data + row * cur_kvSplitSize + col; auto in_ptr1 = b_idx.data(); auto in_ptr2 = h_idx.data(); auto in_ptr3 = q_idx.data(); auto in_ptr4 = kv_idx.data(); accum_t* out_ptr0 = in_ptr0; { { { auto tmp0 = in_ptr0[static_cast<int64_t>(0L)]; out_ptr0[static_cast<int64_t>(0L)] = tmp0; } } } } } // Apply block mask, fill unused with -inf for (int64_t row = 0; row < cur_qSplitSize; ++row) { for (int64_t col = 0; col < cur_kvSplitSize; col++) { std::vector<int64_t> b_idx = {i}; std::vector<int64_t> h_idx = {j}; std::vector<int64_t> q_idx = {m+row}; int64_t phisical_kv_idx = n+col; if (use_kv_indice) { phisical_kv_idx= kv_logical_data kvBlockSize + col; } std::vector<int64_t> kv_idx = {phisical_kv_idx}; accum_t* qk_block = qk_data + row * cur_kvSplitSize + col; auto in_ptr1 = b_idx.data(); auto in_ptr2 = h_idx.data(); auto in_ptr3 = q_idx.data(); auto in_ptr4 = kv_idx.data(); std::vector<int64_t> temp = {0}; int64_t* out_ptr1 = temp.data(); { { { auto tmp0 = static_cast<bool>(true); out_ptr1[static_cast<int64_t>(0L)] = tmp0; } } } qk_block = out_ptr1 != 0 ? qk_block : -std::numeric_limits<accum_t>::infinity(); } } ``` Code after this PR will be vectorized:* ```cpp accum_t* in_ptr0 = qk_data; auto in_ptr1 = b_idx.data(); auto in_ptr2 = h_idx.data(); auto in_ptr3 = q_idx.data(); auto in_ptr4 = kv_idx.data(); // apply score mod function { accum_t* out_ptr0 = in_ptr0; { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(cur_qSplitSize); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(cur_kvSplitSize); x1+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L(c10::div_floor_integer(static_cast<int64_t>(cur_kvSplitSize), static_cast<int64_t>(16L)))))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + cur_kvSplitSizex0), static_cast<int64_t>(16)); tmp0.store(out_ptr0 + static_cast<int64_t>(x1 + cur_kvSplitSizex0)); } if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L(c10::div_floor_integer(static_cast<int64_t>(cur_kvSplitSize), static_cast<int64_t>(16L)))) && x1 < static_cast<int64_t>(cur_kvSplitSize))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + cur_kvSplitSizex0), static_cast<int64_t>(cur_kvSplitSize + ((-16L)(c10::div_floor_integer(static_cast<int64_t>(cur_kvSplitSize), static_cast<int64_t>(16L)))))); tmp0.store(out_ptr0 + static_cast<int64_t>(x1 + cur_kvSplitSizex0), static_cast<int64_t>(cur_kvSplitSize + ((-16L)(c10::div_floor_integer(static_cast<int64_t>(cur_kvSplitSize), static_cast<int64_t>(16L)))))); } } } } } } // Apply block mask, fill unused with -inf { accum_t* out_ptr1 = in_ptr0; { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(cur_qSplitSize); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(cur_kvSplitSize); x1+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L(c10::div_floor_integer(static_cast<int64_t>(cur_kvSplitSize), static_cast<int64_t>(16L)))))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + cur_kvSplitSizex0), static_cast<int64_t>(16)); auto tmp1 = static_cast<bool>(true); auto tmp2 = -std::numeric_limits<float>::infinity(); auto tmp3 = at::vec::VecMask<float,1>::from(tmp1); auto tmp4 = at::vec::Vectorized<float>(tmp2); auto tmp5 = decltype(tmp0)::blendv(tmp4, tmp0, tmp3.template cast<float,1>()); tmp5.store(out_ptr1 + static_cast<int64_t>(x1 + cur_kvSplitSizex0)); } if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L(c10::div_floor_integer(static_cast<int64_t>(cur_kvSplitSize), static_cast<int64_t>(16L)))) && x1 < static_cast<int64_t>(cur_kvSplitSize))) { for (int64_t x1_tail = static_cast<int64_t>(16L(c10::div_floor_integer(static_cast<int64_t>(cur_kvSplitSize), static_cast<int64_t>(16L))));x1_tail < static_cast<int64_t>(cur_kvSplitSize); x1_tail++) { auto tmp0 = in_ptr0[static_cast<int64_t>(x1_tail + cur_kvSplitSizex0)]; auto tmp1 = static_cast<bool>(true); auto tmp2 = -std::numeric_limits<float>::infinity(); auto tmp3 = tmp1 ? tmp0 : tmp2; out_ptr1[static_cast<int64_t>(x1_tail + cur_kvSplitSize*x0)] = tmp3; } } } } } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143638 Approved by: https://github.com/jgong5, https://github.com/drisspg, https://github.com/leslie-fang-intel	2025-02-14 05:26:18 +00:00
PyTorch UpdateBot	ce38bfd299	[executorch hash update] update the pinned executorch hash (#147157 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147157 Approved by: https://github.com/pytorchbot	2025-02-14 05:04:17 +00:00
Nikita Shulga	92f669e39c	[BE] Use `c10::multiply_integers` in cholesky_impl (#147163 ) That replaces explicit for loop Pull Request resolved: https://github.com/pytorch/pytorch/pull/147163 Approved by: https://github.com/huydhn	2025-02-14 03:59:17 +00:00
Animesh Jain	2d089a5697	[dynamo] Remove unintended lru_cache (#147147 ) I forgot to remove it while add frozenset __contains__ method in this PR - https://github.com/pytorch/pytorch/pull/146062?fbclid=IwZXh0bgNhZW0CMTEAAR3S_qq8bYxO7pDuHqpr2X-vqkXQrY0KtT14z46bfuRDYikjJBet3uKF2dE_aem_o1c7I4eawKyaEsfiWhnTmw This is causing memory leak Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/147147 Approved by: https://github.com/williamwen42	2025-02-14 03:55:39 +00:00
Aaron Gokaslan	6344ca1dd4	[BE][Ez]: Apply FURB188: use str remove(pre\|suf)fix (#146997 ) Since we are on 3.9, we can use this nice str builtin which is more readable and more efficient. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146997 Approved by: https://github.com/XuehaiPan, https://github.com/cyyever, https://github.com/jansel	2025-02-14 03:38:07 +00:00
cyy	d473c212fd	Remove code for Python < 3.9 (#147097 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/147097 Approved by: https://github.com/albanD	2025-02-14 03:22:49 +00:00
kareem mohiddeen shaik	880e176544	[inductor] Fix for pattern file contains 'getitem' fails during impor… (#144980 ) …t of the pattern module For example any pattern module that has the following pattern generated, fails to import because the name getitem undefined. native_dropout_default = CallFunction(aten.native_dropout.default, div_Tensor_1, KeywordArg('dropout_p'), True, _users=2) getitem = CallFunction(getitem, native_dropout_default, 0) this fix will resolve the error. Fixes #144674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144980 Approved by: https://github.com/eellison	2025-02-14 02:30:24 +00:00
Zhengxu Chen	0b84311842	[export] Generate printers/parsers for serialization enum values. (#147126 ) Summary: Generate two helper functions for enum classes in generated_serialization_types.h printEnum: will convert enum values into strings. parseEnum: will convert strings into enum values. Test Plan: CI Differential Revision: D69604850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147126 Approved by: https://github.com/yiming0416	2025-02-14 02:14:35 +00:00
Basil Wong	05001f0459	Add Structured Tracing for Traced Graph Edge Details for AC Debugging (#146634 ) Summary: Updating the structured trace infrastructure so that we are able to output to Zoomer and have an E2E solution. Context Doc: https://docs.google.com/document/d/1T6omIBEWVhbOiwDLSLffgQwjxiT2rQv8QvvQwXkw4fY/edit?usp=sharing Test Plan: ### Testing Structured Log + tlparse locally Command: ``` TORCH_TRACE=/data/users/basilwong/fbsource/fbcode/log_torch_trace buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=local_fb_fm_v4 launcher.num_workers=2 ``` Torch Trace Logs (local then sent to paste): P1686419449 ``` cat log_torch_trace/dedicated_log_torch_trace_rank_0_2lg012xo.log \| pastry P1686419449 ``` tlparse output: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpyiv5wj/rank_1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100 tlparse graph edge details output: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpyiv5wj/rank_1/9_0_0/joint_graph_information_397.txt?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100 Differential Revision: D61557220 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146634 Approved by: https://github.com/jansel, https://github.com/Yuzhen11	2025-02-14 02:04:26 +00:00
Colin L Reliability Rice	486fc12d7e	torch: Log a unified waitcounter for torch.compile and triton.autotune (#146723 ) Summary: Add a second more generic waitcounter to torch.compile. We'll keep expanding this as new generic pytorch compilation sites show up. Test Plan: Waitcounter only change, relying on existing tests. Differential Revision: D69215401 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146723 Approved by: https://github.com/davidberard98	2025-02-14 02:04:13 +00:00
Jiang, Yanbing	f0bdc27f74	Add torch._scaled_mm for CPU (#139975 ) This PR is to add `torch._scaled_mm` for CPU backend. `_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet	2025-02-14 02:03:53 +00:00
leslie-fang-intel	c5a9e4a6a0	[Inductor][CPP] Fix a CPP GEMM Template output data type issue (#146958 ) Summary Issue found when fixing https://github.com/pytorch/ao/issues/1662. A FP32 GEMM with an epilogue node `to_fp16` resulted in [generated code](https://gist.github.com/leslie-fang-intel/464fb112abdb105818ae09b057350e84), which failed to compile. The root cause is that we used the slice of global buffer `Y` as the output of micro GEMM instead of a `local buffer`. However, due to the `to_fp16` epilogue node, the global buffer `Y` has a float16 data type, leading to the failure. This fix will ensure the use of a local buffer in such cases. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_linear_to_lowp_fp ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146958 Approved by: https://github.com/jgong5	2025-02-14 01:40:08 +00:00
xinan.lin	d3524ecdd6	[Break XPU] Align meta calculation for fft_r2c with _fft_r2c_mkl (#146763 ) Fix #146761 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146763 Approved by: https://github.com/jansel ghstack dependencies: #146762, #145248, #146880	2025-02-14 01:39:18 +00:00
xinan.lin	ade5af9430	[XPU] Align XPU convolution_backward output layout between fake tensor and real output tensor. (#146880 ) Fix #146879 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146880 Approved by: https://github.com/EikanWang, https://github.com/jansel ghstack dependencies: #146762, #145248	2025-02-14 01:39:18 +00:00
xinan.lin	9befdf565a	[Break XPU][Inductor UT] Set input tensors to corresponding device for test case in test_aot_indutor.py (#145248 ) Fix #145247 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145248 Approved by: https://github.com/desertfire, https://github.com/jansel, https://github.com/EikanWang ghstack dependencies: #146762	2025-02-14 01:39:11 +00:00
xinan.lin	972e927134	[Break XPU][Inductor UT] Fix XPU Inductor UT failures introduced from community. (#146762 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146762 Approved by: https://github.com/EikanWang, https://github.com/desertfire, https://github.com/jansel	2025-02-14 01:38:50 +00:00
Dan Zimmerman	6419076db9	[torch][amdsmi] Look for amdsmi in ROCM_HOME/ROCM_PATH before using rpath (#147117 ) Summary: ROCm uses ROCM_HOME/ROCM_PATH to specify which version of rocm the user wants to use. This is especially important in multi-version setups. Let's respect that behavior when loading amdsmi. Test Plan: CI ``` NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,COLL MSCCL_ALGO_DIR=~/2fbsource/third-party/rccl/develop/tools/msccl-algorithms RCCL_MSCCLPP_THRESHOLD=(math '12810241024') RCCL_MSCCLPP_ENABLE=1 ENABLE_MSCCLPP=1 buck2 run fbcode//mode/opt-amd-gpu -m rocm621 fbcode//accelerators/workloads/microbench:bench_comm -- --shape moe_17b --comm_algo nccl_allreduce ``` Differential Revision: D69597647 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147117 Approved by: https://github.com/malfet	2025-02-14 01:11:59 +00:00
Zhang, Jianyi	20a369aa3a	[Intel GPU] Avoid copy when the input of Matmul is broadcasted (#143784 ) Avoid copy when the input of Matmul is 3D and broadcasted on batch dim. oneDNN support implicit broadcast semantics i.e., src can be broadcasted into weight if the corresponding dimension in src is 1 (and vice versa). On Max 1100, timm resmlp_12_224 amp_fp16 inference with bs=128 can improve from 42ms to 13.7 ms on torch.compile and 57.5ms to 32ms on eager mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143784 Approved by: https://github.com/EikanWang	2025-02-14 00:48:07 +00:00
Simon Fan	057bcd3a45	[ca] eliminate duplicate getitem graph nodes for shape inputs (#146875 ) should reuse existing proxies instead of creating new ones before: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpL7hmHe/0_-_-_0/compiled_autograd_graph_3.txt?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100 ```python class CompiledAutograd0(torch.nn.Module): def forward(self, inputs, sizes, scalars, hooks): # No stacktrace found for following nodes getitem = inputs[0] getitem_1 = inputs[1] getitem_2 = inputs[2]; inputs = None getitem_3 = sizes[0]; getitem_3 = None getitem_4 = sizes[1]; getitem_4 = None getitem_5 = sizes[2]; getitem_5 = None getitem_6 = sizes[3]; getitem_6 = None getitem_7 = sizes[4]; getitem_7 = None getitem_8 = sizes[5]; getitem_8 = None getitem_9 = sizes[6]; getitem_9 = None getitem_10 = sizes[7]; getitem_10 = None getitem_11 = sizes[8]; getitem_11 = None getitem_12 = sizes[9]; getitem_12 = None getitem_13 = sizes[10]; getitem_13 = None getitem_14 = sizes[11]; getitem_14 = None getitem_15 = sizes[12]; getitem_15 = None getitem_16 = sizes[13]; getitem_16 = None getitem_17 = sizes[14]; getitem_17 = None getitem_18 = sizes[15]; getitem_18 = None getitem_19 = sizes[0] getitem_20 = sizes[1] getitem_21 = sizes[2] getitem_22 = sizes[3] getitem_23 = sizes[4] getitem_24 = sizes[5] getitem_25 = sizes[6] getitem_26 = sizes[7] getitem_27 = sizes[8] getitem_28 = sizes[9] getitem_29 = sizes[10] getitem_30 = sizes[11] getitem_31 = sizes[12] getitem_32 = sizes[13] getitem_33 = sizes[14] getitem_34 = sizes[15]; sizes = None ``` after: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpCo5T6B/0_-_-_0/compiled_autograd_graph_1.txt?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100 ```python class CompiledAutograd0(torch.nn.Module): def forward(self, inputs, sizes, scalars, hooks): # No stacktrace found for following nodes getitem = inputs[0] getitem_1 = inputs[1] getitem_2 = inputs[2]; inputs = None getitem_3 = sizes[0] getitem_4 = sizes[1] getitem_5 = sizes[2] getitem_6 = sizes[3] getitem_7 = sizes[4] getitem_8 = sizes[5] getitem_9 = sizes[6] getitem_10 = sizes[7] getitem_11 = sizes[8] getitem_12 = sizes[9] getitem_13 = sizes[10] getitem_14 = sizes[11] getitem_15 = sizes[12] getitem_16 = sizes[13] getitem_17 = sizes[14] getitem_18 = sizes[15]; sizes = None ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146875 Approved by: https://github.com/jansel ghstack dependencies: #146720, #146735	2025-02-13 21:41:33 +00:00
Simon Fan	76dacd5fc7	[ca] log graph before reodering passes (#146735 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146735 Approved by: https://github.com/jansel ghstack dependencies: #146720	2025-02-13 21:41:33 +00:00
Gajanan Choudhary	cdbf677cdd	Remove outdated comment in ATen/mkl/Sparse.h about lack of Windows support (#147125 ) Fixes #147124. * #102604 added support for Intel oneMKL Sparse BLAS APIs so there was an outdated comment left around in the codebase that can now be removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147125 Approved by: https://github.com/janeyx99	2025-02-13 21:34:05 +00:00
Aaron Gokaslan	1f41ceb713	[BE][Ez]: Enable ruff rule banning print in assert (#146615 ) Enables a few ruff rules * Ban print statements within asserts (likely bugs) * ~Use string for Decimal literal to prevent loss of precision~ * ~Do not use default args for __post__init__ in dataclasses, they likely were meant to go into the factory method, the __init__, or somewhere else. The default values are useless here.~ Wait until ruff upgrade for the last 2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146615 Approved by: https://github.com/jansel	2025-02-13 21:14:00 +00:00
angelayi	5469e5c556	[export] Minor fix to locals (#146955 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146955 Approved by: https://github.com/bobrenjc93	2025-02-13 20:29:15 +00:00
Bin Bao	7b4efb492b	[inductor][refactor] Make _compile_file only used for fbcode (#147106 ) Summary: _compile_file in codecache.py only handles specific cpp compilation in fbcode. The next step is to consolidate it with cpp_builder. Test Plan: CI Differential Revision: D69592025 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147106 Approved by: https://github.com/yushangdi	2025-02-13 20:22:31 +00:00
Chen Lai	2d3db4509a	fix pt2e block wise quantization test (#147035 ) Differential Revision: D69559217 https://github.com/pytorch/pytorch/pull/145941 breaks the unit test added for prepare pt2e + block wise quantization. Fixing Pull Request resolved: https://github.com/pytorch/pytorch/pull/147035 Approved by: https://github.com/andrewor14	2025-02-13 19:44:56 +00:00
Yang Wang	b0553cee6b	[Utilization] post-test-process workflow (#145310 ) # Overview Add reusable workflow to trigger the post-test right after each test job is complete. Cousion with pr to setup the runner permissions: Add m fleet instances: https://github.com/pytorch-labs/pytorch-gha-infra/pull/595/files add to lix fleet:https://github.com/pytorch/ci-infra/pull/322/files Currently I turn on the debug flag for testing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145310 Approved by: https://github.com/huydhn	2025-02-13 18:51:19 +00:00
Henry Tsang	260b21b8bc	[cutlass backend] Do not change dtype of GEMM template (#146877 ) I think this is a change in the right direction. Right now, when we try to find a cutlass gemm, we generate bunch of gemm templates, and filter out those that don't fix. For example, if we are doing bf16 x bf16 matmul, the gemm template for fp32 x fp32 is generated and filtered out. However, for the dtype of bias, we would attempt to modify the dtype of the gemm template. I think this is a bad idea, since (1) the usable template is also being generated, and (2) this messes with the configuration name of the template. I tested this offline. There isn't much difference in performance. However, with instantiation level 2222, I noticed way less "C++ compile error". This is probably due to using the right template? Follow-ups are needed: 1. benchmark and dashboard 2. check our logic for setting alignment with my change https://www.internalfb.com/intern/paste/P1729604119/ without my change https://www.internalfb.com/intern/paste/P1729624806/ Differential Revision: D69085556 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146877 Approved by: https://github.com/ColinPeppler	2025-02-13 18:36:16 +00:00
rzou	92d448ff62	Add self to CODEOWNERS for fx/proxy.py; warn against adding new node arg types (#147031 ) Not sure if there's a better way Pull Request resolved: https://github.com/pytorch/pytorch/pull/147031 Approved by: https://github.com/StrongerXi ghstack dependencies: #147016, #147012, #147013	2025-02-13 18:21:21 +00:00
PyTorch MergeBot	9a883007a2	Revert "Implement cuda graphs implementation of torch.cond and torch.while_loop (#140979 )" This reverts commit c7515da7b00de40942c83dc5856b6daec727e280. Reverted https://github.com/pytorch/pytorch/pull/140979 on behalf of https://github.com/huydhn due to This change has been reported to break internal code ([comment](https://github.com/pytorch/pytorch/pull/140979#issuecomment-2657361940))	2025-02-13 18:04:26 +00:00
PyTorch MergeBot	65e8862b9a	Revert "[cond] make cond re-dispatch in proxy mode (#146954 )" This reverts commit 2ce6de2415fb6592dd4447ebea334fd12b8c31ea. Reverted https://github.com/pytorch/pytorch/pull/146954 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I need to revert it to cleanly revert 140979 ([comment](https://github.com/pytorch/pytorch/pull/146954#issuecomment-2657357742))	2025-02-13 18:02:33 +00:00
Nikhil Gupta	1f8ff6812d	[Fix]: Disable KleidiAI if unsupported gcc/clang compiler is detected (#146836 ) Fixes: https://github.com/pytorch/pytorch/issues/146740 Description: 1. KleidiAI officially supports GCC>=11 and Clang>=11. Certain hardware features are tied to compiler version and KleidiAI compilation will fail in such cases. Change-Id: Ib43d6b5bf66ef5ea48c481a2774801c573ec205c Pull Request resolved: https://github.com/pytorch/pytorch/pull/146836 Approved by: https://github.com/malfet	2025-02-13 17:49:26 +00:00
Brian Hirsh	447a142de2	support input mutations on tangents in compile (#141131 ) Fixes https://github.com/pytorch/pytorch/issues/141111. We previously supported mutations on saved activations that happened in the backward. This PR extends the support to tangents Pull Request resolved: https://github.com/pytorch/pytorch/pull/141131 Approved by: https://github.com/zou3519	2025-02-13 17:48:56 +00:00
Saurabh Mishra	7077d0ac8c	[DCP] Introduce modules metadata in the storage_meta (#146654 ) Summary: Introduce the list of modules in the storage_meta which is shared between the planner and the storage writer. We will use it to let the storage writer know about the modules in the state dict and create module directories in the checkpoint. Test Plan: UTs Pull Request resolved: https://github.com/pytorch/pytorch/pull/146654 Approved by: https://github.com/MeetVadakkanchery	2025-02-13 17:44:30 +00:00
PyTorch MergeBot	938209fb6f	Revert "Use 2022 as default VC_YEAR for windows builds (#147053 )" This reverts commit 858bc0cea50614d1e190e6991d974ddb0f53fc88. Reverted https://github.com/pytorch/pytorch/pull/147053 on behalf of https://github.com/atalman due to Broke windows tests ([comment](https://github.com/pytorch/pytorch/pull/147053#issuecomment-2657239501))	2025-02-13 17:09:37 +00:00
Will Constable	683178fabc	[cuda] fix printing of num_gpus (#146838 ) Previously on machines with less than 8 gpus, the device==7 case would trigger the assert inside getDeviceProperties, and print `num_gpus=BEL` which is ascii for 7. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146838 Approved by: https://github.com/Skylion007, https://github.com/eqy	2025-02-13 15:23:35 +00:00
Nikhil Gupta	020232ec9f	[Submodule]: Update KleidiAI submodule to v1.3.0 (#146480 ) Change-Id: I687255982c72ee7daca438a15b718f07298963cc Pull Request resolved: https://github.com/pytorch/pytorch/pull/146480 Approved by: https://github.com/digantdesai, https://github.com/malfet	2025-02-13 15:23:04 +00:00
Ivan Skorokhodov	df776d64f7	chore: fix typos in error messages in FSDP (#146805 ) Fixes two small typos in FSDP error messages Pull Request resolved: https://github.com/pytorch/pytorch/pull/146805 Approved by: https://github.com/awgu, https://github.com/Skylion007	2025-02-13 15:22:13 +00:00
Anatoly Myachev	345f556628	Fix `DispatchStub.cpp` compilation for gcc 14 (#146512 ) Otherwise I get the following error: ```bash .../intel-xpu-backend-for-triton/pytorch/aten/src/ATen/native/DispatchStub.cpp:152:18: error: no matching function for call to ‘find(std::array<c10::DeviceType, 7>::const_iterator, std::array<c10::DeviceType, 7>::const_iterator, const c10::DeviceType&)’ 152 \| if (std::find(supported_devices.begin(), supported_devices.end(), device_type) == supported_devices.end()) { \| ~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In file included from /usr/include/c++/14/bits/locale_facets.h:48, from /usr/include/c++/14/bits/basic_ios.h:37, from /usr/include/c++/14/ios:46, from /usr/include/c++/14/ostream:40, from .../intel-xpu-backend-for-triton/pytorch/c10/core/DeviceType.h:13, from .../intel-xpu-backend-for-triton/pytorch/aten/src/ATen/native/DispatchStub.h:3, from .../intel-xpu-backend-for-triton/pytorch/aten/src/ATen/native/DispatchStub.cpp:2: /usr/include/c++/14/bits/streambuf_iterator.h:435:5: note: candidate: ‘template<class _CharT2> typename __gnu_cxx::__enable_if<std::__is_char<_CharT2>::__value, std::istreambuf_iterator<_CharT, std::char_traits<_CharT> > >::__type std::find(istreambuf_iterator<_CharT, char_traits<_CharT> >, istreambuf_iterator<_CharT, char_traits<_CharT> >, const _CharT2&)’ 435 \| find(istreambuf_iterator<_CharT> __first, \| ^~~~ /usr/include/c++/14/bits/streambuf_iterator.h:435:5: note: template argument deduction/substitution failed: .../intel-xpu-backend-for-triton/pytorch/aten/src/ATen/native/DispatchStub.cpp:152:18: note: mismatched types ‘std::istreambuf_iterator<_CharT, std::char_traits<_CharT> >’ and ‘const std::array<c10::DeviceType, 7>::value_type’ {aka ‘const c10::DeviceType’} 152 \| if (std::find(supported_devices.begin(), supported_devices.end(), device_type) == supported_devices.end()) { \| ~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146512 Approved by: https://github.com/Skylion007	2025-02-13 15:21:59 +00:00
IvanKobzarev	7c3b2a29ec	[subclass] testing WrapperSubclass respect outer_size, outer_stride (#146897 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146897 Approved by: https://github.com/bdhirsh	2025-02-13 15:21:19 +00:00
PyTorch UpdateBot	e2479d7809	Update slow tests (#146822 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146822 Approved by: https://github.com/pytorchbot	2025-02-13 15:20:58 +00:00
Yidi Wu	aeabbffe15	Disable test with dynamo for schema gen (#146865 ) Fixes https://github.com/pytorch/pytorch/issues/141202. 1. So we skip the schema gen tests under dynamo. https://github.com/pytorch/pytorch/issues/141202 fails in a weird way: where it's claiming node is an integer, but we tested isinstance tests [here](https://github.com/pytorch/pytorch/blob/main/torch/_library/utils.py#L234-L241). This is probably dynamo messing up with the stacks. and checking fx.Node isn't really what dynamo is designed for. 2. We move some of legit cond testes out of schema gen and put it back to control flow tests. Also rename _test_export to a lengthy names. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146865 Approved by: https://github.com/zou3519	2025-02-13 15:20:52 +00:00
angelayi	67c4c39b4f	[docs] Minor fixes to export and aoti docs (#144513 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144513 Approved by: https://github.com/yushangdi, https://github.com/desertfire	2025-02-13 15:19:35 +00:00
briancoutinho	d1997b610f	update kineto submodule (#147015 ) Fix https://github.com/pytorch/kineto/issues/1032 See https://github.com/pytorch/kineto/pull/1035 for testplan Pull Request resolved: https://github.com/pytorch/pytorch/pull/147015 Approved by: https://github.com/sraikund16, https://github.com/Skylion007	2025-02-13 15:13:18 +00:00
Aaron Gokaslan	8d94eb1e3b	[BE]: Make OrderedSet reversible (#146904 ) It's rather trivial to make OrderedSet reversible, so let's do it and unlock that additional functionality for downstream users. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146904 Approved by: https://github.com/eellison	2025-02-13 15:11:48 +00:00
Andrey Talman	858bc0cea5	Use 2022 as default VC_YEAR for windows builds (#147053 ) New Windows AMI does not have Visual Studio 2019. Hence use 2022 as default. See: https://github.com/pytorch/test-infra/pull/6226 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147053 Approved by: https://github.com/huydhn	2025-02-13 14:37:55 +00:00
FFFrog	f95bdf5e6c	Make GetCPUAllocatorMaybePinned to be Device-Agnostic (#146687 ) ---- - Keep cuda first to perserve BC - Remove cuda first if it is possible to have only one accelerator at a time in the future Pull Request resolved: https://github.com/pytorch/pytorch/pull/146687 Approved by: https://github.com/ngimel	2025-02-13 13:09:48 +00:00
Mu-Chu Lee	e21181642f	[AOTInductor] Align behavior between CPU and GPU (#145459 ) Summary: (1) Make sure CPU and GPU doesn't have different implementation and behavior when calling from the same path and API. Only difference between CPU and GPU after this PR should ONLY be the running hardware. (2) This PR fixes the issue of memory access when it==constants_map.end() (3) This PR resolves T179437596 Test Plan: buck2 run mode/dev sigmoid/inference/test:e2e_test_cpu Differential Revision: D68540744 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145459 Approved by: https://github.com/desertfire, https://github.com/hl475	2025-02-13 09:50:18 +00:00
Xia, Weiwen	ca3aabc8e6	[Inductor][CPU] Add a lowering pass for _weight_int4pack_mm_for_cpu (#145250 ) Summary It's part of the task to enable max-autotune with GEMM template for WoQ INT4 GEMM on CPU. This PR adds a lowering pass for `torch.ops.aten_weight_int4pack_mm_for_cpu`. This op is used for WoQ int4 in Torchao. The lowering pass is a prerequisite for max-autotune, which is planed to be enabled for this op in subsequent PRs. Test plan ``` python test/inductor/test_mkldnn_pattern_matcher.py -k test_woq_int4 python test/inductor/test_cpu_cpp_wrapper.py -k test_woq_int4 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145250 Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168 ghstack dependencies: #145245	2025-02-13 08:40:12 +00:00
Zhang, Jianyi	17d3a69c32	[Intel GPU] fix memory leak in deconv backward (#144385 ) Fixes #143807 We need manage onednn scratchpad in pytorch, otherwise onednn will always allocate scratchpad memory during primitive execution and causes memory leak. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144385 Approved by: https://github.com/liangan1, https://github.com/EikanWang	2025-02-13 07:41:34 +00:00
David Berard	43496e9b90	[NJT] fix flop counter for SDPA & test (#147032 ) Fixes 3 issues: 1. The test wasn't actually testing SDPA: both were checking cuda, and the inputs to SDPA were not transposed. 2. FlopCounterMode has been renamed _FlopCounterMode (and a wrapper named FlopCounterMode has been added) 3. offsets_to_list also needs to ignore the actual offset values if offsets is a meta tensor. Differential Revision: [D69558785](https://our.internmc.facebook.com/intern/diff/D69558785) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147032 Approved by: https://github.com/jbschlosser	2025-02-13 07:14:58 +00:00
tim	b9a22b3f37	bug fix: ensure 4d input in _scaled_dot_product_attention_math_mps (#146623 ) This pr addresses the issue in the MPS backend for `_scaled_dot_product_attention_math_mps` where a 3d input like (num_heads, seq_len, query_dim) cannot be automatically treated as (1, num_heads, seq_len, query_dim), which can be inferred on cpu or cuda, which can be circumvented by adding a util function to ensure a 4d shape. The issue was found in https://github.com/hiyouga/LLaMA-Factory/issues/6835, in [transformers qwen2_vl](`1590c66430/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py (L373C14-L373C93)`), 3d q/k/v were passed into sdpa function, which lead to an error. Considering consistency, since this pattern might pop up elsewhere in the transformers codebase, I think it makes more sense to maintain the same intuition across all platforms. --- reproduce code: ``` import torch import torch.nn.functional as F head_num, seq_len, embed_dim = 16, 16, 80 bsz = 1 q = torch.randn(head_num, seq_len, embed_dim) k = torch.randn(head_num, seq_len, embed_dim) v = torch.randn(head_num, seq_len, embed_dim) attention_mask = torch.ones(1, seq_len, seq_len) oo_cpu = F.scaled_dot_product_attention( q.to("cpu"), k.to("cpu"), v.to("cpu"), attention_mask.to("cpu"), dropout_p=0.0 ) if torch.backends.mps.is_available(): oo_mps = F.scaled_dot_product_attention( q.to("mps"), k.to("mps"), v.to("mps"), attention_mask.to("mps"), dropout_p=0.0 ) assert torch.allclose(oo_cpu, oo_mps.to("cpu"), atol=1e-5) ``` error outputs: ``` Traceback (most recent call last): File "/opt/homebrew/Caskroom/miniconda/base/envs/torch-dev/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3577, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "<ipython-input-2-5169b8d2c5dd>", line 21, in <module> oo_mps = F.scaled_dot_product_attention( IndexError: Dimension out of range (expected to be in range of [-3, 2], but got 3) ``` hardware and envs: ``` torch 2.6.0 apple m3 max ``` --- Pull Request resolved: https://github.com/pytorch/pytorch/pull/146623 Approved by: https://github.com/malfet Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-02-13 07:00:51 +00:00
Isalia20	17a808557c	[MPS] cholesky ex version (#146799 ) PR #145701 didn't have experimental version of cholesky. This PR adds that version Pull Request resolved: https://github.com/pytorch/pytorch/pull/146799 Approved by: https://github.com/malfet	2025-02-13 07:00:21 +00:00
Ke Wen	4879f8f919	[TP] Add warning when module is distributed twice (#147006 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147006 Approved by: https://github.com/XilunWu	2025-02-13 06:49:17 +00:00
Aaron Gokaslan	3e4172d985	[BE][Ez]: Update fmtlib submodule to 11.1.3 (#146985 ) This submodule just fixes a bunch of miscellaneous bugfix issues with ABI compatibility, compiler warning, workarounds for older compilers, performance, and edge cases in formatting. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146985 Approved by: https://github.com/drisspg	2025-02-13 06:47:11 +00:00
Yu, Guangye	aa20b4b6cf	Friendly handle mem_get_info's runtime error message (#146899 ) # Motivation Friendly handle the runtime error message if the device doesn't support querying the available free memory. See https://github.com/intel/torch-xpu-ops/issues/1352 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146899 Approved by: https://github.com/EikanWang	2025-02-13 06:26:19 +00:00
Nikita Shulga	66fb10fc53	[BE][OpInfo] Introduce generic `dtypesIf` (#146905 ) Use `__setattr__` and `__getattribute__` to wrap existing `dtypesIfXYZ` into it, which will allow for subsequent incremental elimination of those Also, type annotation for OpInfo is a sham: it claims that `dtypes` and `dtypesIfXYZ` must be of type `_dispatch_dtypes`, but in reality it's converted to set in post init. Test Plan: - Check that `op_db[0].dtypesIfCUDA` and others shows the same values as before, by running the following script ```python from torch.testing._internal.common_methods_invocations import op_db print({name: getattr(op_db[0], f'dtypesIf{name}') for name in ['CUDA', 'ROCM', 'XPU', 'Hpu']}) ``` - CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/146905 Approved by: https://github.com/janeyx99	2025-02-13 05:33:17 +00:00
PyTorch UpdateBot	43eb39d7c8	[executorch hash update] update the pinned executorch hash (#145128 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145128 Approved by: https://github.com/pytorchbot	2025-02-13 05:06:44 +00:00
Rachel Guo	88d0bb0fee	[aoti_debug_printer][BE] explicitly dumping float32, bfloat16, float16 data type (#147020 ) Summary: per request, explicitly dumping the float dtypes for aten tensors in debug printing summary info. can be useful in identifying issues such as "wrong AOTI Lowering precisions" Test Plan: ``` AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2 TORCH_LOGS="+inductor, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm ``` Differential Revision: D69547344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147020 Approved by: https://github.com/jingsh, https://github.com/ColinPeppler	2025-02-13 04:41:00 +00:00
PyTorch UpdateBot	2ff3fdfdae	[audio hash update] update the pinned audio hash (#146738 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146738 Approved by: https://github.com/pytorchbot	2025-02-13 04:29:46 +00:00
amathewc	936df4571b	Update test_c10d_object_collectives.py with DistributedTestBase class (#145056 ) # MOTIVATION To generalize distributed test cases for non-CUDA devices, we are leveraging the DistributedTestBase class introduced in [PR #138216](https://github.com/pytorch/pytorch/pull/138216). This new class is derived from MultiProcessTestCase and abstracts the creation/deletion of process groups and other functionality for specific devices. In this PR, we extend the scope of these tests to support HPUs. # CHANGES Replaced MultiProcessTestCase with the DistributedTestBase class. Extended test functionality to include support for HPUs. Utilized instantiate_device_type_tests with targeted attributes to generate device-specific test instances. Applied the skipIfHPU decorator to skip tests that are not yet compatible with HPU devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145056 Approved by: https://github.com/kwen2501, https://github.com/guangyey	2025-02-13 03:57:59 +00:00
Menglu Yu	a9598337b7	[Optimus] Include more corner cases in the select cat aten pass (#146662 ) Summary: Thanks to Shuai for reporting the bug in the pattern. We found there's a typo in the pass, where we should make sure all the selects will go to the cat node. Test Plan: buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_select_cat_post_grad Buck UI: https://www.internalfb.com/buck2/2cd0888e-d803-43a8-8530-d97e6bc281b3 Test UI: https://www.internalfb.com/intern/testinfra/testrun/6192449699305108 Network: Up: 110KiB Down: 35KiB (reSessionID-687be0fa-031a-47a0-8780-5ab4cf4bbd94) Executing actions. Remaining 0/4 6.6s exec time total Command: test. Finished 2 local Time elapsed: 2:12.0s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 Differential Revision: D69278487 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146662 Approved by: https://github.com/Microve	2025-02-13 03:40:26 +00:00
zeshengzong	6ca497a8e5	Replace is_same with is_same_v for concise syntax (#145450 ) Replace `std::is_same<T, U>::value` with `std::is_same_v` for concise and consistent syntax with other code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145450 Approved by: https://github.com/huydhn	2025-02-13 03:29:39 +00:00
Tugsbayasgalan Manlaibaatar	c159723c39	Fix meta impl for topk (#147017 ) Topk in this context is always size-like so we should use torch._check_is_size. Fixes some issue in https://github.com/pytorch/pytorch/issues/146990 Differential Revision: [D69545983](https://our.internmc.facebook.com/intern/diff/D69545983) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147017 Approved by: https://github.com/ydwu4	2025-02-13 03:18:47 +00:00
drisspg	821422018a	[FlexAttention] Make zero_length sequence handiling better (#147010 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147010 Approved by: https://github.com/Chillee	2025-02-13 03:18:24 +00:00
Nikita Shulga	54e28b2a71	[BE] Turn nextafter into functor (#147018 ) This functor is a bit more involved as nextafter is missing for MacOS13 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147018 Approved by: https://github.com/dcci ghstack dependencies: #146965, #146993, #147023	2025-02-13 02:10:29 +00:00
Joona Havukainen	aaa46c0625	Add missing autoreleasepool around runUniqueGraph to prevent leaks (#145512 ) References were held onto longer than needed. Added autoreleasepool around the runUniqueGraph to allow the memory to be freed. Fixes #145151 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145512 Approved by: https://github.com/malfet Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-02-13 01:58:18 +00:00
Nikita Shulga	e0ca041ae3	[BE] Toward Metal Iterator (step 2) (#147023 ) Add dense flavor of the binary ops, i.e. if iterator is contiguous, do not build indices but rather run different flavor, using the same functor, which results in almost 100% perf gain for binary operation with 1mln elements of `torch.fmax` as one can see from the table below collected on M4Pro Mini using following benchmarking script ```python import torch from timeit import default_timer from itertools import product from torch.utils.benchmark import Measurement, Timer def bench_binary( n, binary_func, dtype=torch.float32, ) -> Measurement: t = Timer( stmt=f"f(x, y);f(x, y); f(x, y); torch.mps.synchronize()", setup=f"x, y=torch.rand((2, {n}), dtype={dtype}, device='mps').unbind(0)", globals = {'f': binary_func}, language="python", timer=default_timer ) return t.blocked_autorange() if __name__ == "__main__": n = 1024*2 for dtype in [torch.float32, torch.float16, torch.bfloat16]: eager_t = bench_binary(n, torch.fmax, dtype) use_msec = eager_t.mean > 1e-4 multiplier = 1e3 if use_msec else 1e6 uname = "msec" if use_msec else "usec" print(f"torch.fmax()x3 {str(dtype):>14} {eager_t.meanmultiplier:>7.2f} {uname}") ``` Dtype \| Time before \| Time After \| \| ------\|------------ \| ---------- \| \| float32 \| 0.84 msec \| 0.66 msec \| \| float16 \| 0.49 msec \| 0.23 msec \| \| bfloat16 \| 0.48 msec \| 0.22 msec \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/147023 Approved by: https://github.com/dcci ghstack dependencies: #146965, #146993	2025-02-13 01:50:43 +00:00
zeshengzong	80f146dedf	Update addbmm, addmm, addmv and baddbmm description (#146689 ) Fixes #146611, following #146482 ## Test Result ![image](https://github.com/user-attachments/assets/5c1749be-1f10-4e80-a284-b1929ca340eb) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146689 Approved by: https://github.com/mikaylagawarecki	2025-02-13 01:30:50 +00:00
rzou	5dab0aeef0	[SkipFiles] Some more cleanup (#147013 ) This isn't a no-op but I think it's fine. It changes the case where a function f1 in a module in MOD_SKIPFILES calls a function f2 in one of the deleted modules. Previously f2 would have been skipped, now f2 gets inlined. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147013 Approved by: https://github.com/yanboliang ghstack dependencies: #147016, #147012	2025-02-13 01:18:47 +00:00
rzou	fddaa2958b	[SkipFiles] Some more cleanup (#147012 ) I think these are all no-ops. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/147012 Approved by: https://github.com/yanboliang ghstack dependencies: #147016	2025-02-13 01:18:47 +00:00
rzou	87ebd77b34	Add some more docs to trace_rules.py (#147016 ) After discussing with Yanbo we wanted to record the behavior down so we don't need to rederive them in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147016 Approved by: https://github.com/yanboliang	2025-02-13 01:18:39 +00:00
Animesh Jain	b77a6eb184	[dynamo] Fix tensordict regression (#146995 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146995 Approved by: https://github.com/StrongerXi ghstack dependencies: #146819	2025-02-13 00:59:59 +00:00
Yidi Wu	2ce6de2415	[cond] make cond re-dispatch in proxy mode (#146954 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146954 Approved by: https://github.com/zou3519	2025-02-13 00:50:33 +00:00
angelayi	67cbbb29e0	[export] Dedup expression_created logs (#146859 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146859 Approved by: https://github.com/pianpwk ghstack dependencies: #146532, #146533, #146534, #146858	2025-02-13 00:21:34 +00:00
angelayi	59bc5d0d71	[tlparse] Add stacktrace filter utility (#146858 ) Added a utility function for capturing the user stack and framework stacktrace. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146858 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #146532, #146533, #146534	2025-02-13 00:21:34 +00:00
angelayi	43f5566c92	[export] Add additional tlparse logging (#146534 ) Added some additional logging so we can also run tlparse on generic export errors Pull Request resolved: https://github.com/pytorch/pytorch/pull/146534 Approved by: https://github.com/pianpwk ghstack dependencies: #146532, #146533	2025-02-13 00:21:34 +00:00
angelayi	b4bdbce1ac	[export] Use custom stream logger in draft-export (#146533 ) Using a custom logger so that we can store our own buffer to dedup logs that look the same. The schema for deduping is as follows: ```python if key == "missing_fake_kernel": return hash((key, data["op"])) # Same ops get deduped elif key == "mismatched_fake_kernel": return hash((key, data["op"], data["reason"])) # Same op and reason for errors get deduped elif key == "propagate_real_tensors": return hash((key, json.dumps(data["stack"]))) # Guards appearing on the same stacktrace get deduped elif key == "create_unbacked_symbol": return hash((key, json.dumps(data["stack"]))) # Unbacked symbols appearing on the same stacktrace get deduped ``` Notably, guards appearing on the same stacktrace get deduped. This is because there are some cases in PT2I models where a piece of code which creates a new unbacked symint + runs into a DDE gets called 800 times, causing 800 new symints to be created, and 800 propagate_real_tensor errors that are all the same expression. This is hard to look at, so we should just deduplicate this. The con of this is that if there exists multiple DDE on the same stacktrace, we will only show the first issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146533 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #146532	2025-02-13 00:21:34 +00:00
angelayi	be387f57b1	[symbolic shapes] Log SymNode id for provenance (#146532 ) We can use the SymNode id to point us back to how previous expressions were created, and construct this nice tree in tlparse: <img width="761" alt="image" src="https://github.com/user-attachments/assets/531b03e8-4398-4d0a-bd11-16078256041c" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/146532 Approved by: https://github.com/bobrenjc93	2025-02-13 00:21:34 +00:00
Raymond Li	21c2565f35	Document dynamo (#146736 ) Many files in dynamo are currently lacking file/module-level documentation, which makes it hard to know what they do at a glance and without digging into the code. This fixes that. Note: documentation was AI-generated and could be incorrect, please review carefully. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146736 Approved by: https://github.com/jansel, https://github.com/StrongerXi, https://github.com/anijain2305, https://github.com/zou3519	2025-02-13 00:02:21 +00:00
Ting Lu	0344bf8a5a	[cuDNN] cuDNN to 9.7.1.26 for CUDA 12.8 (#146957 ) rebasing for https://github.com/pytorch/pytorch/pull/146717 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146957 Approved by: https://github.com/Skylion007, https://github.com/eqy, https://github.com/nWEIdia, https://github.com/atalman Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-02-12 23:43:34 +00:00
Jing Shan	d5a2e4c754	[oncall] Change error message to be more readable (#146934 ) Summary: During oncall, got a debug, where the error message is a bit ambiguous, due to multiple colons, and full line cutoff ``` AssertionError: Expected order: 1 for the component: remote_request_only to be >= 2, the max order for all its ``` Update the error message to something like ``` AssertionError: Component remote_request_only order must be >= max order of its upstream components, got component order=1 and max=2 ``` Test Plan: CI Differential Revision: D69482789 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146934 Approved by: https://github.com/ColinPeppler	2025-02-12 23:33:09 +00:00
Benjamin Glass	ad4e5bf705	cpp_wrapper: handle mixed-device C-shim fallbacks (#146449 ) Fixes an error from test_torch, where a CUDA cpp_wrapper run called a CUDA native C-shim kernel with two CPU tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146449 Approved by: https://github.com/desertfire	2025-02-12 23:21:04 +00:00
Oguz Ulgen	076215944a	Turn on autograd local caches in fbcode (#146996 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146996 Approved by: https://github.com/jamesjwu	2025-02-12 23:04:39 +00:00
Howard Huang	c60f587c04	Fix shape_inference for V-schedules (#147000 ) I was hitting a hang in shape_inference when testing v-shaped schedules with >2 ranks in titan. `self.next_rank` and `self.prev_rank` are used in shape inference but are not accurate for v-shaped schedules: `bfcce6984b/torch/distributed/pipelining/stage.py (L1325-L1326)` Will clean up / delete the use of next_rank / prev rank in follow up PRs Pull Request resolved: https://github.com/pytorch/pytorch/pull/147000 Approved by: https://github.com/wconstab	2025-02-12 22:56:46 +00:00
Guilherme Leobas	f954aac6be	Add `make_dynamo_test` (#146491 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146491 Approved by: https://github.com/zou3519, https://github.com/anijain2305, https://github.com/malfet	2025-02-12 22:54:29 +00:00
Justin Chu	fd21126007	[ONNX] Deprecation message follow up (#147005 ) Follow up on https://github.com/pytorch/pytorch/pull/146923 to address comments. This pull request includes updates to the `torch/onnx` module, focusing on deprecations and documentation improvements. The most important changes involve moving version change notes within the `export` function, updating deprecation messages, and removing example code in the `dynamo_export` function. Documentation and Deprecation Updates: * [`torch/onnx/__init__.py`](diffhunk://#diff-c3c8c09b65c1235ca4494633c6a0aab2761a11a7653ddaf9f874bbcd91e15553L172-L184): Moved version change notes to the correct location within the `export` function's docstring. Updated the deprecation note for the `dynamo_export` function to version 2.7 and removed example code from its docstring. [[1]](diffhunk://#diff-c3c8c09b65c1235ca4494633c6a0aab2761a11a7653ddaf9f874bbcd91e15553L172-L184) [[2]](diffhunk://#diff-c3c8c09b65c1235ca4494633c6a0aab2761a11a7653ddaf9f874bbcd91e15553R349-R357) [[3]](diffhunk://#diff-c3c8c09b65c1235ca4494633c6a0aab2761a11a7653ddaf9f874bbcd91e15553L434-R430) [[4]](diffhunk://#diff-c3c8c09b65c1235ca4494633c6a0aab2761a11a7653ddaf9f874bbcd91e15553L445-L475) * [`torch/onnx/utils.py`](diffhunk://#diff-849a5778e2dcf7f36587967273cee0bf20642e35bf4c79405111ea3417c3fb3cL111-R114): Enhanced deprecation messages for several functions (`select_model_mode_for_export`, `disable_apex_o2_state_dict_hook`, `setup_onnx_logging`, `unconvertible_ops`) to provide clearer guidance on their removal and suggest copying logic if needed. [[1]](diffhunk://#diff-849a5778e2dcf7f36587967273cee0bf20642e35bf4c79405111ea3417c3fb3cL111-R114) [[2]](diffhunk://#diff-849a5778e2dcf7f36587967273cee0bf20642e35bf4c79405111ea3417c3fb3cL148-R151) [[3]](diffhunk://#diff-849a5778e2dcf7f36587967273cee0bf20642e35bf4c79405111ea3417c3fb3cL166-R173) [[4]](diffhunk://#diff-849a5778e2dcf7f36587967273cee0bf20642e35bf4c79405111ea3417c3fb3cL1180-R1189) [[5]](diffhunk://#diff-849a5778e2dcf7f36587967273cee0bf20642e35bf4c79405111ea3417c3fb3cL1190-R1199) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147005 Approved by: https://github.com/titaiwangms	2025-02-12 22:48:56 +00:00
Justin Chu	f655f840b8	[ONNX][dort] Remove reference to onnxscript rewriter (#147003 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147003 Approved by: https://github.com/titaiwangms, https://github.com/gramalingam, https://github.com/shubhambhokare1	2025-02-12 22:02:07 +00:00
Link Li	995f607c74	fix doc string (#146968 ) Fixes a wrong function name in doc string Pull Request resolved: https://github.com/pytorch/pytorch/pull/146968 Approved by: https://github.com/zackycao, https://github.com/H-Huang	2025-02-12 21:43:16 +00:00
Nikita Shulga	06a07f6018	[BE] Towards MetalTensorIterator (#146993 ) Further refactor binary kernels to replace individual implementation with a binary_indexing_kernel template that takes functors that implement the logic. According to godbolt such refactoring should have no impact on the performance as compiler thru dead code elimination should just replaces the functor with direct underlying function call as one can see for clang CPU compiler here: https://godbolt.org/z/8dxv5jvz7 but to be on the safe side, run following benchmark ```python import torch from timeit import default_timer from itertools import product from torch.utils.benchmark import Measurement, Timer def bench_binary( n, binary_func, dtype=torch.float32, ) -> Measurement: t = Timer( stmt=f"f(x, y);f(x, y); f(x, y); torch.mps.synchronize()", setup=f"x, y=torch.rand((2, {n}), dtype={dtype}, device='mps').unbind(0)", globals = {'f': binary_func}, language="python", timer=default_timer ) return t.blocked_autorange() if __name__ == "__main__": n = 1024*2 for dtype in [torch.float32, torch.float16, torch.bfloat16]: eager_t = bench_binary(n, torch.fmax, dtype) use_msec = eager_t.mean > 1e-4 multiplier = 1e3 if use_msec else 1e6 uname = "msec" if use_msec else "usec" print(f"torch.fmax()x3 {str(dtype):>14} {eager_t.meanmultiplier:>7.2f} {uname}") ``` That reports roughly identical before and after times (1 msec for float32 and .5 msec for float16) Another interesting quirk, that functors can not be in anonymous namespace, otherwise they'll not be visible from the library, as one can see by running following swift sample (filed FB16490467 to clarify if this is supported) ```swift let shader_source = """ struct add_functor { template <typename T> inline T operator()(const T a, const T b) { return static_cast<T>(a + b); } }; namespace { struct sub_functor { template <typename T> inline T operator()(const T a, const T b) { return static_cast<T>(a - b); } }; } // anonymous namespace template <typename T, typename F> kernel void binary_executor( constant T* input [[buffer(0)]], constant T* other [[buffer(1)]], device T* out [[buffer(2)]], uint tid [[thread_position_in_grid]]) { F f; out[tid] = f(input[tid], other[tid]); } template [[host_name("add_float")]] kernel void binary_executor<float, add_functor>(constant float, constant float , device float, uint); template [[host_name("sub_float")]] kernel void binary_executor<float, sub_functor>(constant float, constant float , device float, uint); """ import Metal guard let device = MTLCopyAllDevices().first else { fatalError("Not Metal device found") } let library = try! device.makeLibrary(source:shader_source, options:MTLCompileOptions()) // Expect two kernels to be printed, but see only one, with functor in global namespace for kernel_name in library.functionNames { print(kernel_name) } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146993 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #146965	2025-02-12 21:40:40 +00:00
Brian Hirsh	de964b9f8b	dont specialize symints when testing truthiness (#146731 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146731 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #146642, #146729	2025-02-12 20:57:10 +00:00
Brian Hirsh	5cda021cac	support meta_tensor.to(device='cpu') under fake_mode (#146729 ) Fixing this is actually a bit annoying: (1) FakeTensorMode sees a function where all of its inputs are real tensors, so it tries to run the real compute before converting the output to a FakeTensor (2) we don't actually want this, because the "real compute" is support to error normally, when you do `meta_tensor.to(device='cpu')`. Instead, we want FakeTensor to actually skip constant prop and run the normal FakeTensor implementation, which will not error Pull Request resolved: https://github.com/pytorch/pytorch/pull/146729 Approved by: https://github.com/zou3519, https://github.com/SherlockNoMad, https://github.com/albanD ghstack dependencies: #146642	2025-02-12 20:57:10 +00:00
Brian Hirsh	ec0b318ddb	[poc] force UntypedStorage.from_buffer(buf) to return meta storage under FakeTensorMode (#146642 ) context here: https://fb.workplace.com/groups/326136610199609/permalink/495389539940981/ This PR is an attempt to make it such that if you create a tensor from an external buffer (using `UntypedStorage.from_buffer(buf)`, we can generate a proper fake tensor for you out of the box. The annoying bit is that there are not any dispatcher ops to interpose on and change behavior. So instead, I took the manual C binding and tweaked the storage device to be "meta' if we see an active fake mode. Put "poc" in the title since I... think this is hopefully reasonable, but I can be convinced that it's not :) ``` from torch._subclasses.fake_tensor import FakeTensorMode import pickle import io import torch from contextlib import nullcontext use_fake_tensor = True with FakeTensorMode() if use_fake_tensor else nullcontext(): obj = [1, 2] f = io.BytesIO() pickle.Pickler(f).dump(obj) byte_storage = torch.ByteStorage._from_buffer(f.getvalue()) # type: ignore[attr-defined] t = torch.ByteTensor(byte_storage) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146642 Approved by: https://github.com/zou3519	2025-02-12 20:57:10 +00:00
PyTorch MergeBot	8a975cb247	Revert "[cutlass backend] Do not change dtype of GEMM template (#146877 )" This reverts commit 5f2714d5e7cded0eb553d5915002e03c22e01e34. Reverted https://github.com/pytorch/pytorch/pull/146877 on behalf of https://github.com/henrylhtsang due to mistake on logging ([comment](https://github.com/pytorch/pytorch/pull/146877#issuecomment-2654648949))	2025-02-12 19:26:45 +00:00
Chien-Chin Huang	0de27ee7e0	Let _create_cpu_state_dict and _copy_state_dict support DTensor (#146852 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146852 Approved by: https://github.com/d4l3k	2025-02-12 18:43:52 +00:00
Nikita Shulga	352484cc83	[BE] Unify kernel templates instantiation (#146965 ) By defining `REGISTER_BINARY_OP` template that could be used to register fmix, fmax, etc Pull Request resolved: https://github.com/pytorch/pytorch/pull/146965 Approved by: https://github.com/Skylion007, https://github.com/dcci	2025-02-12 18:40:45 +00:00
Justin Chu	7f62616a58	[ONNX][reland2] Create deprecation warning on dynamo_export (#146923 ) Reland two PRs - https://github.com/pytorch/pytorch/pull/146425 - https://github.com/pytorch/pytorch/pull/146639 Fixed by removing the deprecation warning on a base class `ExportOptions`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146923 Approved by: https://github.com/titaiwangms	2025-02-12 18:28:37 +00:00
Henry Tsang	5f2714d5e7	[cutlass backend] Do not change dtype of GEMM template (#146877 ) I think this is a change in the right direction. Right now, when we try to find a cutlass gemm, we generate bunch of gemm templates, and filter out those that don't fix. For example, if we are doing bf16 x bf16 matmul, the gemm template for fp32 x fp32 is generated and filtered out. However, for the dtype of bias, we would attempt to modify the dtype of the gemm template. I think this is a bad idea, since (1) the usable template is also being generated, and (2) this messes with the configuration name of the template. I tested this offline. There isn't much difference in performance. However, with instantiation level 2222, I noticed way less "C++ compile error". This is probably due to using the right template? Follow-ups are needed: 1. benchmark and dashboard 2. check our logic for setting alignment with my change https://www.internalfb.com/intern/paste/P1729604119/ without my change https://www.internalfb.com/intern/paste/P1729624806/ Differential Revision: D69085556 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146877 Approved by: https://github.com/ColinPeppler	2025-02-12 18:16:49 +00:00
Nichols A. Romero	bfcce6984b	[ROCm][TunableOp] Close offline tuning results file when offline tuning is disabled. (#146574 ) This PR is to fix UT breakage that has been reported internally and is considered high priority. When `tunable.record_untuned_enable(False)` is invoked, we flush the results of the untuned gemm file. Offline tuning I/O currently doesn't have a set untuned results filename member function or untuned results write to file member function. When performing back-to-back unit tests, the same ofstream ends up getting reused between UTs. Due to the way the UT are executed, this can lead to unexpected failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146574 Approved by: https://github.com/jeffdaily	2025-02-12 18:03:06 +00:00
Huy Do	04011304e5	Update dynamo expected 20250210 (#146856 ) Update all the ci accuracy expect values to make trunk green. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146856 Approved by: https://github.com/yanboliang	2025-02-12 18:01:20 +00:00
Animesh Jain	d6513f3246	[dynamo] Support list subclasses and fix dict subclasses mutation bugs (#146819 ) This PR adds support for list subclasses. Among other things are 1) Tracking the mutations on internal vts like `_dict_vt` and `_list_vt` using sources. This helps identify if there was a mutation in the underlying data structures, and we need to reconstruct it. 2) `UserDefinedObjectVariable` now has a new method - `is_modified` which `side_effect` infra relies upon to check mutations in the underlying vts (like `_dict_vt`). 3) `reconstruction` logic ensures that we use `dict.__getitem__` and `list.__getitem__` methods. This is super important because we don't want to call the overridden `__getitem__` methods. If this PR is hard to review, please let me know. I can break it into several small PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146819 Approved by: https://github.com/StrongerXi, https://github.com/jansel	2025-02-12 17:46:02 +00:00
titaiwangms	6c81435f16	[ONNX] Update CI transformers cache (#146926 ) The cached models are outdated because the related tests are all deleted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146926 Approved by: https://github.com/justinchuby	2025-02-12 17:02:43 +00:00
titaiwangms	b894c2824b	[ONNX] Support custom axis name through dynamic_shapes (#146321 ) Fixes #143443 This PR aims to support custom dynamic axis naming through dynamic_shapes. Currently, _Dim and _DimHint do not support dynamic axis naming (#144273). 1. the original dynamic shapes guarantee The axis renaming is only applied when dynamic shapes include string instead of all _Dim and _DimHint. Thus, there will not be any inconsistent behavior to dynamic_shapes with torch.export.export if the given dynamic shapes follow torch.export.export format. 2. _DimHint.AUTO is applied to the axes that are specified with custom names to avoid exporter crash. (_DimHint.DYNAMIC crashes when the export fails.) 3. There's no need to handle cases where kwargs are out of order with the model signature, as torch.export.export supports dynamism only when kwargs and dynamic_shapes are provided in order. `49082f9dba/torch/export/_trace.py (L2034)` 4. If `torch.onnx.ExportedProgram` finds the axes share the same constraints, they will have the same name (e.g. s0, s1, ...). Therefore, even if the ONNX users specify them with different custom names, they won't be respected. Example model: ```python class NestedModel(torch.nn.Module): def forward( self, x: torch.Tensor, ys: list[torch.Tensor], zs: dict[str, torch.Tensor], c: torch.Tensor, ): y = ys[0] + ys[1] + zs["a"] + zs["b"] w = 5 if x.shape[0] < 3 and c.shape[0] != 4: return x + w, x + y, c else: return x - w, x - y, c input = ( torch.ones(5), [torch.zeros(5), torch.ones(5)], {"a": torch.zeros(5), "b": torch.ones(5)}, torch.ones(6), ) dynamic_shapes = ( {0: torch.export.Dim("dim_x", min=3)}, # _Dim [("custom_name_axis_ys_0",), (torch.export.Dim.AUTO,)], # custom name { "a": {0: torch.export.Dim.AUTO}, "b": ("custom_name_axis_zs_b_0",), }, # _DimHint {0: "custom_name_axis_c_0"}, # custom name ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146321 Approved by: https://github.com/justinchuby	2025-02-12 17:00:03 +00:00
Xuehai Pan	9abaaad6a8	[pytree][Easy] preserve `dict` keys in insertion order in CXX pytree (#130140 ) `optree` and JAX pytree traversal the `dict` in sorted key ordering (see [Key Ordering for Dictionaries](https://github.com/metaopt/optree#key-ordering-for-dictionaries)). While in PyTorch Python pytree, we traversal the `dict` in insertion order. See also: - #114392 This aligns the behavior of CXX pytree with Python pytree. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130140 Approved by: https://github.com/zou3519	2025-02-12 16:41:49 +00:00
Aaron Orenstein	1f8ff94d4f	PEP585: Add noqa to necessary tests (#146391 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146391 Approved by: https://github.com/justinchuby, https://github.com/Skylion007	2025-02-12 15:29:50 +00:00
Aaron Gokaslan	b61032fcf7	[BE][Ez]: Remove unnecessary type ignores from orderedset (#146902 ) After #145783, we can remove some type ignores from the ordered set class Pull Request resolved: https://github.com/pytorch/pytorch/pull/146902 Approved by: https://github.com/eellison	2025-02-12 15:00:13 +00:00
PyTorch MergeBot	ce80865f13	Revert "Replace is_same with is_same_v for concise syntax (#145450 )" This reverts commit 5205158c1b0bc5c390b2a9d83fe3b2ec5edbe3f2. Reverted https://github.com/pytorch/pytorch/pull/145450 on behalf of https://github.com/jeanschmidt due to testing to see if reverting would fix timeout in inductor jobs ([comment](https://github.com/pytorch/pytorch/pull/145450#issuecomment-2653645466))	2025-02-12 13:01:32 +00:00
Yuanhao Ji	b0042286d4	[Dynamo] Allow dynamo to handle `str.xxx()` (#146587 ) Fixes #146350 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146587 Approved by: https://github.com/zou3519	2025-02-12 08:54:10 +00:00
Xia, Weiwen	98e16012ec	[Quant][CPU] add a wrapper op for _weight_int4pack_mm_for_cpu with tensor args (#145245 ) Summary It's part of the task to enable max-autotune with GEMM template for WoQ INT4 GEMM on CPU. This PR adds a wrapper op in `quantized` namespace for `torch.ops.aten_weight_int4pack_mm_for_cpu`, whose arguments are all tensors. It will be used in Inductor lowering with max-autotune where scalar arguments are difficult to handle. The new op is not registered to - `aten` because it will require changing `native_functions.yaml`, which is not recommended. - `quantized_decomposed` because it will only have a Python implementation, which cannot be used for cpp wrapper in Inductor. Test plan ``` python test/test_linalg.py -k test__int4_mm ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145245 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168	2025-02-12 08:46:38 +00:00
Tianyu Liu	ac0f206f3c	[dtensor] fix side-effect on dtype for _like ops (#146869 ) fixes https://github.com/pytorch/pytorch/issues/146749 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146869 Approved by: https://github.com/yifuwang, https://github.com/janeyx99, https://github.com/ngimel	2025-02-12 08:42:14 +00:00
Zhou Fang	d774a6333d	[StaticRuntime] Support a new pattern for ClipRangesToGatherToOffsets (#146931 ) Summary: Support the following new pattern for ClipRangesToGatherToOffsets: Before optimization: ``` %18267 : Tensor, %18268 : Tensor = fb::clip_ranges_gather(%int_77.1, %getitem_2484.1, %493) %getattr_368.1 : int = prim::dtype(%18267) %to_443.1 : Tensor = aten::to(%18268, %getattr_368.1, %self._maybe_compute_kjt_to_jt_dict.is_weighted, %self._maybe_compute_kjt_to_jt_dict.is_weighted) %lengths_to_offsets_490.1 : Tensor = fb::lengths_to_offsets(%to_443.1, %8) ``` After optimization: ``` %18297 : int = prim::dtype(%int_77.1) %18298 : Tensor, %18299 : Tensor = fb::clip_ranges_gather_to_offsets(%int_77.1, %getitem_2484.1, %493, %8, %18297) ``` Reviewed By: garroud Differential Revision: D69373835 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146931 Approved by: https://github.com/hanyilou123	2025-02-12 08:19:41 +00:00
Ahmad Sharif	ae5cc19ba7	[pytorch][cuda] Improve softmax backward pass native CUDA implementation (#145866 ) This PR is similar to https://github.com/pytorch/pytorch/pull/122970, but works on the softmax backward pass. Specifically, it uses shared memory to cache the gradOutput when it can fit in shared memory. Before this PR we were reading gradOutput twice. On my H100 this seems to improve the softmax backward pass performance by about 5% for problem sizes that fit within shared memory. (Note that this is not the only kernel that runs when you call softmax backward pass -- there is an elementwise kernel that runs before this; optimizing that can be a separate PR). Important Note: Currently the softmax backward pass consists of an [element-wise multiply operator](`7f65a20884/aten/src/ATen/native/cuda/SoftMax.cu (L1216)`), followed by [this function](`7f65a20884/aten/src/ATen/native/cuda/SoftMax.cu (L1062)`) which calls the `cunn_SoftMaxBackward` kernel. With my change the kernel time reduces by about 12% (see screenshot below), while the total time (including the elementwise) reduces by about 5%. ``` Baseline This PR N size FP32 bandwidth FP16 bandwidth N size FP32 bandwidth FP16 bandwidth fp32 diff fp16 diff 0 256 134.340966 70.042039 0 256 133.70146 70.342753 -0.48% 0.43% 1 512 233.501185 129.945803 1 512 234.057145 132.933066 0.24% 2.30% 2 1024 340.667966 229.280464 2 1024 338.833265 226.441699 -0.54% -1.24% 3 2048 379.643726 337.452058 3 2048 399.559017 338.432284 5.25% 0.29% 4 4096 416.597537 383.625364 4 4096 428.252403 396.137506 2.80% 3.26% 5 6000 431.198241 384.384384 5 6000 457.744577 406.06275 6.16% 5.64% 6 8192 462.811252 427.292573 6 8192 474.791032 428.281563 2.59% 0.23% 7 10000 464.258731 429.050294 7 10000 483.7643 446.849381 4.20% 4.15% 8 10013 465.199701 429.824179 8 10013 464.904407 428.72184 -0.06% -0.26% 9 10240 477.07359 428.853737 9 10240 485.317024 444.902586 1.73% 3.74% 10 11000 473.038785 430.778663 10 11000 488.161438 453.462162 3.20% 5.27% 11 12000 474.342475 432.594814 11 12000 490.532418 458.427653 3.41% 5.97% 12 16384 487.468854 473.611576 12 16384 488.154406 476.264631 0.14% 0.56% 13 20000 482.029793 465.666186 13 20000 482.147092 483.886193 0.02% 3.91% 14 24000 478.368093 474.159464 14 24000 478.364948 491.447921 0.00% 3.65% 15 32000 476.523796 473.18868 15 32000 476.523796 474.398962 0.00% 0.26% 16 32768 476.104723 477.493634 16 32768 476.704463 477.330606 0.13% -0.03% 17 36864 477.900663 475.472787 17 36864 477.973279 475.728454 0.02% 0.05% 18 40960 477.707561 475.559064 18 40960 478.445017 476.088067 0.15% 0.11% 19 45056 479.169812 475.865134 19 45056 479.143266 475.878202 -0.01% 0.00% 20 49152 477.804907 475.382982 20 49152 477.868404 475.976377 0.01% 0.12% 21 65536 481.274125 478.171806 21 65536 481.537733 478.703926 0.05% 0.11% 22 66000 481.64652 480.095457 22 66000 481.856013 480.466388 0.04% 0.08% 23 68608 481.745774 479.034704 23 68608 481.917596 478.856209 0.04% -0.04% 24 80000 483.409361 480.356529 24 80000 483.330481 480.375277 -0.02% 0.00% 25 98304 480.736301 481.396882 25 98304 480.789858 481.320143 0.01% -0.02% ``` NCU profiler shows lower DRAM fetches with the new kernel: ![image](https://github.com/user-attachments/assets/f3606725-d8fc-4ea5-ae6d-9c188bf32d72) NCU reports about 12% elapsed time reduction in this kernel alone compared to baseline (and because of other kernels that are run, the overall backward pass time as seen by the user gets reduced by 5%). I compared the binary size increase by running `python setup.py develop` before and after and diffing the .so files: ![image](https://github.com/user-attachments/assets/8e6cee2e-3c7a-4fa4-8836-954047ce8ffc) libtorch_cuda.so goes from 274,752,224 bytes to 274,787,072 bytes. The increase in size is 34kB which is about 0.01%. I measured the compilation time for incremental development: ``` touch ./aten/src/ATen/native/cuda/SoftMax.cu time python setup.py develop real 0m10.083s user 0m8.197s sys 0m3.149s ``` Note that this uses `ccache` and does a bunch of copies and is not just measuring the `nvcc` time. I measured the `nvcc` time separately by capturing the `nvcc` command shown in [1] below and running it on the baseline and modified kernels: ``` # baseline nvcc time for SoftMax.cu real 0m35.341s user 0m33.801s sys 0m1.289s # this PR's nvcc time for SoftMax.cu real 0m36.513s user 0m34.722s sys 0m1.408s ``` So the `nvcc` time increases by about 1 second, or ~3% of the baseline. [1] `nvcc` command is here: ``` # This is the nvcc command /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DTORCH_CUDA_USE_NVTX3 -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -I/home/ahmads/personal/pytorch/build/aten/src -I/home/ahmads/personal/pytorch/aten/src -I/home/ahmads/personal/pytorch/build -I/home/ahmads/personal/pytorch -I/home/ahmads/personal/pytorch/cmake/../third_party/benchmark/include -I/home/ahmads/personal/pytorch/third_party/onnx -I/home/ahmads/personal/pytorch/build/third_party/onnx -I/home/ahmads/personal/pytorch/nlohmann -I/home/ahmads/personal/pytorch/aten/src/THC -I/home/ahmads/personal/pytorch/aten/src/ATen/cuda -I/home/ahmads/personal/pytorch/third_party/fmt/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/ahmads/personal/pytorch/build/caffe2/aten/src -I/home/ahmads/personal/pytorch/aten/src/ATen/.. -I/home/ahmads/personal/pytorch/build/nccl/include -I/home/ahmads/personal/pytorch/c10/cuda/../.. -I/home/ahmads/personal/pytorch/c10/.. -I/home/ahmads/personal/pytorch/third_party/tensorpipe -I/home/ahmads/personal/pytorch/build/third_party/tensorpipe -I/home/ahmads/personal/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/ahmads/personal/pytorch/torch/csrc/api -I/home/ahmads/personal/pytorch/torch/csrc/api/include -isystem /home/ahmads/personal/pytorch/build/third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/ahmads/personal/pytorch/third_party/protobuf/src -isystem /home/ahmads/personal/pytorch/third_party/XNNPACK/include -isystem /home/ahmads/personal/pytorch/third_party/ittapi/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda/include -isystem /home/ahmads/personal/pytorch/torch/include -isystem /home/ahmads/personal/pytorch/third_party/ideep/include -isystem /home/ahmads/personal/pytorch/torch/include/oneapi/dnnl -isystem /home/ahmads/personal/pytorch/INTERFACE -isystem /home/ahmads/personal/pytorch/third_party/nlohmann/include -isystem /home/ahmads/personal/pytorch/third_party/NVTX/c/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o.d -x cu -c /home/ahmads/personal/pytorch/aten/src/ATen/native/cuda/SoftMax.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145866 Approved by: https://github.com/ngimel	2025-02-12 07:54:41 +00:00
Wang, Chuanqi	8c80c13b34	[CD] Add python 3.13t build for xpu (#146614 ) Fixes #146451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146614 Approved by: https://github.com/atalman	2025-02-12 07:01:36 +00:00
Huy Do	b30bad710d	Update octokit/request-action to 2.4.0 (#146940 ) The current version 2.1.0 has disappeared since yesterday: * https://github.com/pytorch/pytorch/actions/workflows/upload-torch-dynamo-perf-stats.yml * https://github.com/pytorch/pytorch/actions/workflows/upload-test-stats.yml The latest version is 2.4.0 https://github.com/octokit/request-action Pull Request resolved: https://github.com/pytorch/pytorch/pull/146940 Approved by: https://github.com/izaitsevfb	2025-02-12 05:36:27 +00:00
PyTorch MergeBot	6105b6f15f	Revert "Update octokit/request-action to 2.4.0 (#146940 )" This reverts commit 7aa629f1268f6944eee6e49e43071b4342bf1669. Reverted https://github.com/pytorch/pytorch/pull/146940 on behalf of https://github.com/huydhn due to This does not work ([comment](https://github.com/pytorch/pytorch/pull/146940#issuecomment-2652691614))	2025-02-12 05:21:43 +00:00
Aleksandar Samardžić	5a1c7c424d	Fix standalone runner for CUTLASS auto-tuning backend (#146764 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146764 Approved by: https://github.com/henrylhtsang ghstack dependencies: #146755	2025-02-12 04:42:08 +00:00
Aleksandar Samardžić	eb655a2d5f	Fix CUTLASS 2.x kernels for auto-tuning (#146755 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146755 Approved by: https://github.com/henrylhtsang	2025-02-12 04:42:07 +00:00
Zhengxu Chen	683bb1242c	[export][ez] Update tag_ for union setters. (#146912 ) Summary: ez fix to set tag for union type fields. Test Plan: CI Differential Revision: D69467715 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146912 Approved by: https://github.com/yiming0416	2025-02-12 03:52:36 +00:00
Yichen Yan	06f8f9a017	Update instructions about faster linker (#146750 ) This PR adds instructions to specify linker via cmake env `CMAKE_LINKER_TYPE` and also adds `mold` as a linker alternative. Since 3.29, cmake introduced [`CMAKE_LINKER_TYPE`](https://cmake.org/cmake/help/latest/variable/CMAKE_LINKER_TYPE.html) that can specify linker without overwriting `ld` file or changing build script. `mold` is already stable and the fastest (afaict) linker out there, and also easier to install compared with `lld`. So I added it here. After switching to `mold`, the time of linking `libtorch_cuda.so` has been reduced from ~7s to ~0.6s locally. Also note `gold` has been marked deprecated recently[1]. [1] https://lwn.net/Articles/1007541/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/146750 Approved by: https://github.com/albanD	2025-02-12 03:14:08 +00:00
James Wu	28a2ab6b84	Clear CompiledTritonKernel cache after each inductor compile (#146925 ) Fix a bug introduced by D69123174: because triton kernels now are returned directly by the worker, each future created by the triton kernel should only be used once per compile. Otherwise, a long running process that does something like in : ``` compiled_1 = torch.compile("max-autotune", fullgraph=True)(fn) # run compiled_1 out_compiled = compiled_1 compiled_2 = torch.compile("max-autotune", fullgraph=True)(fn2) ``` Where fn1 and fn2 are very similar (i.e. would generate the same triton kernel source code) would result in us using the launcher for the first autotuning run, and setting the launcher to None after running, and then using the same future/kernel again without regenerating the launcher. Found this bug testing internal inference models. This does not remove the caching support for @eellison's caching for prologue benchmarking, because that happens under the same compile: https://github.com/pytorch/pytorch/pull/143408 Differential Revision: [D69476856](https://our.internmc.facebook.com/intern/diff/D69476856/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D69476856/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/146925 Approved by: https://github.com/laithsakka, https://github.com/jansel ghstack dependencies: #146417	2025-02-12 02:38:42 +00:00
Nikita Shulga	0acbf8039a	[BE] Unskip some tensor creation tests on Mac (#146952 ) Followup after https://github.com/pytorch/pytorch/pull/145367 One should never use skip, but rather xfail otherwise one never knows when test is finally fixed. `test_float_to_int_conversion_finite` were fixed on MacOS a while back (guess since the time Intel builds were disbaled), while `test_float_to_int_conversion_nonfinite` is fixed by https://github.com/pytorch/pytorch/pull/145367 that selects architecture-appropriate reference values for Arm ISA Note, that results of floating to integral types cast are undefined if floating point value is outside of integral dynamic range "Fixes" https://github.com/pytorch/pytorch/issues/38752 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146952 Approved by: https://github.com/atalman, https://github.com/seemethere	2025-02-12 01:59:15 +00:00
Camyll Harajli	78ebd3c502	Revert commit that removed windows testing in VS2019-> update (#146920 ) This reverts commit b57b38b52ede2af27d4eb1bf6ba63868a3ee7553. This commit removed windows testing for the VS build and needs to be added back in with the updated VS2022 build Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146920 Approved by: https://github.com/seemethere, https://github.com/huydhn, https://github.com/atalman, https://github.com/malfet	2025-02-12 01:12:05 +00:00
Nikita Shulga	df5e232563	[BE] Delete NCCL slimming (#146943 ) It was added by https://github.com/pytorch/pytorch/pull/35843 and served its purpose when everything was linked statically in libtorch_cuda.so, but for all our releases it's no longer relevant as nccl is now a dynamic dependency of libtorch_cuda.so Besides, It does not work with CXX11 ABI anyway, and creates problems with newer version of NCCL, when two `collectvies.o` are package into library archive. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146943 Approved by: https://github.com/Skylion007, https://github.com/atalman	2025-02-12 00:35:55 +00:00
Eddie Yan	a58f421f4b	[CUDA][CUDNN][SDPA] Pass dropout seed and offset to cuDNN in `int64` (#146734 ) Workaround for limitation in cuDNN that does not accept dropout seed/offset in `int32` for SM 10.0 kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146734 Approved by: https://github.com/Skylion007	2025-02-12 00:24:38 +00:00
Dan Zimmerman	281249ba54	[torch][amdsmi] Avoid ODR violation when loading amdsmi (#146324 ) Summary: amdsmi bundles its own copy of `libamd_smi.so`. When you're interacting with `amdsmi` from only python that's fine, but when you try to interact with `libamd_smi.so` from native code too this poses a problem, because from native code you'll be linking against the copy of `libamd_smi.so` from the SDK. This means you'll end up with 2 copies of `libamd_smi.so` in your process, and potentially (Murphey's law says you will, as does our CI) violate ODR. In order to avoid this issue from the PT side of the world we can hook the `dlopen("path/to/bundled/libamd_smi.so")` and try to use the already loaded/SDK version of `libamd_smi.so` first, before proceeding to use the `path/to/bundled/libamd_smi.so`. Test Plan: CI, inspect process using libamd_smi.so from native + python and observe only a single copy loaded Differential Revision: D69064038 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146324 Approved by: https://github.com/malfet	2025-02-12 00:01:02 +00:00
Huy Do	7aa629f126	Update octokit/request-action to 2.4.0 (#146940 ) The current version 2.1.0 has disappeared since yesterday: * https://github.com/pytorch/pytorch/actions/workflows/upload-torch-dynamo-perf-stats.yml * https://github.com/pytorch/pytorch/actions/workflows/upload-test-stats.yml The latest version is 2.4.0 https://github.com/octokit/request-action Pull Request resolved: https://github.com/pytorch/pytorch/pull/146940 Approved by: https://github.com/izaitsevfb	2025-02-11 23:50:24 +00:00
ankurneog	f50d359ce2	[ c10d ] modify API to get device string from device with torch.device (#146290 ) Modify the ```get_default_backend_for_device()``` API to extract the device string using ```torch.device()``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146290 Approved by: https://github.com/guangyey, https://github.com/H-Huang	2025-02-11 23:30:57 +00:00
Thomas Bohnstingl	3a29992ee6	[associative_scan] Lifted arguments (#140043 ) This PR implements lifted arguments for associative_scan Pull Request resolved: https://github.com/pytorch/pytorch/pull/140043 Approved by: https://github.com/ydwu4	2025-02-11 23:25:55 +00:00
Robert Hardwick	f59a56e56f	[ARM] Fix `test_float_to_int_conversion_nonfinite` (#145367 ) We have broken tests on Aarch64 which are not enabled upstream, this PR will fix and enable those tests. ``` AssertionError: Tensor-likes are not equal! Mismatched elements: 2 / 3 (66.7%) Greatest absolute difference: 1 at index (1,) Greatest relative difference: 1.0842021724855044e-19 at index (1,) To execute this test, run the following from the base repo dir: python test/test_tensor_creation_ops.py TestTensorCreationCPU.test_float_to_int_conversion_nonfinite_cpu_int64 This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145367 Approved by: https://github.com/malfet	2025-02-11 22:22:10 +00:00
wz337	a20055288f	[DTensor][Test] Create a simple unit test for tensordot (#146514 ) Fixes #ISSUE_NUMBER The dims and shape of the tensors are from a specific Shampoo use case. We want to create a unit test for it to make sure there are no regressions for this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146514 Approved by: https://github.com/tianyu-l, https://github.com/XilunWu	2025-02-11 21:57:56 +00:00
PyTorch MergeBot	443437648a	Revert "Introduce new template heuristic for triton autotune configs (#144985 )" This reverts commit 69301fb10eb3f7fd49af5c681a2e386af115baba. Reverted https://github.com/pytorch/pytorch/pull/144985 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I think it needs a small tweak to avoid breaking some internal code ([comment](https://github.com/pytorch/pytorch/pull/144985#issuecomment-2652021045))	2025-02-11 20:42:41 +00:00
Xu Han	b1ff90ae8a	remove Windows XPU build workaround. (#144644 ) From the RFC: https://github.com/pytorch/pytorch/issues/141946 Fixes https://github.com/pytorch/pytorch/issues/134989 After we land these fixing PRs: 1. https://github.com/pytorch/pytorch/pull/142245 2. https://github.com/pytorch/pytorch/pull/141943 We can remove the Windows XPU workaround. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144644 Approved by: https://github.com/EikanWang, https://github.com/chuanqi129, https://github.com/gujinghui, https://github.com/atalman	2025-02-11 20:39:51 +00:00
Zhengxu Chen	664550ecbf	[export] Serialize special values of float into strings for json. (#146490 ) Summary: Currently inf is serialized as Infinity in JSON which is not standard compliant. Instead we will tweak all special floating points into strings and handle them at json layer. Test Plan: see D69060784 CI Differential Revision: D69186425 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146490 Approved by: https://github.com/yiming0416	2025-02-11 20:01:27 +00:00
Shunting Zhang	110638f702	[inductor] skip _test_insignificant_strides on rocm (#146849 ) Check https://github.com/pytorch/pytorch/issues/146848 , the rocm kernel for _scaled_dot_product_attention does not match the meta kernel regarding output shape. cuda kernel is fine. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146849 Approved by: https://github.com/eellison, https://github.com/atalman, https://github.com/jansel ghstack dependencies: #145904	2025-02-11 19:55:43 +00:00
Ding, Yi	b18e3c01aa	[Inductor] Unifiy Low Precision FP Legalization for to_dtype_bitcast & constant (#144646 ) The upcast in `to_dtype_bitcast()` breaks following operations that only works with the target type (I uses `bitwise_and` in the updated UT). ![image](https://github.com/user-attachments/assets/77a6f3b6-b5e7-4ed8-ab65-09d76f077376) This PR fixes this problem. Let's check the CI results to make sure it doesn't bring accuracy problems. - Unified the type promotion of low-precision FP operations in the legalize func, grouping ops into sources (whose results may be promoted) and sinks (whose input may be cast back). (The term of _sink_ and _source_ are from [graph theory](https://en.wikipedia.org/wiki/Directed_graph#Indegree_and_outdegree).) ## Test ```bash pytest -vs test/inductor/test_torchinductor.py::CpuTests::test_float16_to_int16_cpu pytest -vs test/inductor/test_torchinductor.py::CpuTests::test_bfloat16_to_int16_cpu pytest -vs test/inductor/test_torchinductor.py::CpuTests::test_float32_to_int32_cpu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144646 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2025-02-11 19:45:04 +00:00
drisspg	af349047c3	[FlexAttention] Bug fix broken flag (#146872 ) # Summary I somehow broke this... I think claude was trippin Pull Request resolved: https://github.com/pytorch/pytorch/pull/146872 Approved by: https://github.com/BoyuanFeng	2025-02-11 19:42:37 +00:00
Tugsbayasgalan Manlaibaatar	ebd992724f	Implement serializable getattr support for tensor subclasses (#145772 ) builtins.getattr is not serializable, so we replace it with a custom op that has more refined schema. Differential Revision: [D68899421](https://our.internmc.facebook.com/intern/diff/D68899421) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145772 Approved by: https://github.com/bdhirsh	2025-02-11 19:05:14 +00:00
Andrey Talman	d5d3bdb55a	Fix var CUDA_PATH_V128 in cuda128.bat file (#146906 ) Followup after: https://github.com/pytorch/pytorch/pull/146653 This should fix upcoming CUDA 12.8 windows builds. Issue found during pytorch-canary Windows AMI test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146906 Approved by: https://github.com/malfet, https://github.com/tinglvv	2025-02-11 18:43:55 +00:00
Daniel Galvez	c7515da7b0	Implement cuda graphs implementation of torch.cond and torch.while_loop (#140979 ) This is a new PR for #130386 , which got stale and was closed. Since I force-pushed to that branch in order to rebase it on top of main, the PR can no longer be reopened, according to https://github.com/isaacs/github/issues/361 I fixed the possibly-not-warmed-up problem described here: https://github.com/pytorch/pytorch/pull/130386/files#r1690856534 Since starting this, torch.cond and torch.while_loop now apparently have support for backward passes. I will look into what it might take to support that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140979 Approved by: https://github.com/eqy, https://github.com/eellison	2025-02-11 18:16:15 +00:00
Nikita Shulga	e3839bd603	[BE] Strip `#pragma once` when embedding the headers (#146871 ) This eliminates compiler warning, for example when compiling Metal shader with embedded headers ``` with program_source:6:9: warning: #pragma once in main file [-Wpragma-once-outside-header] #pragma once ^ program_source:81:9: warning: #pragma once in main file [-Wpragma-once-outside-header] #pragma once ^ program_source:588:9: warning: #pragma once in main file [-Wpragma-once-outside-header] #pragma once ^ program_source:719:9: warning: #pragma once in main file [-Wpragma-once-outside-header] #pragma once ^ program_source:829:29: error: use of undeclared identifier 'r0_2' auto tmp8 = in_ptr2[r0_2 + 768*x0]; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146871 Approved by: https://github.com/dcci	2025-02-11 16:49:00 +00:00
Mikayla Gawarecki	861bf892fb	Set USE_CUFILE=1 by default and add pypi package to binary build matrix (#145748 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145748 Approved by: https://github.com/atalman	2025-02-11 15:49:01 +00:00
rzou	5235a18cd6	[SkipFiles] remove some more stuff from MOD_SKIPLIST (#146876 ) Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/146876 Approved by: https://github.com/anijain2305 ghstack dependencies: #146854	2025-02-11 15:00:56 +00:00
Zhou Fang	fc5913b6bf	[StaticRuntime] Fix a bug that memory planner ignores subblocks (#146728 ) (#146855 ) Summary: When Static Runtime graph node has sub-blocks, the memory planner does not consider sub-blocks' inputs as a node's input in memory planner. As the result, such nodes' inputs' lifetime is incorrect and corresponding tensor memory is released earlier than required and causes errors. Differential Revision: D69195886 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146855 Approved by: https://github.com/swolchok	2025-02-11 13:59:54 +00:00
cyy	15635b14ce	[4/N] Remove unnecessary once flag usage (#146783 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146783 Approved by: https://github.com/albanD	2025-02-11 13:55:06 +00:00
Jack Taylor	69301fb10e	Introduce new template heuristic for triton autotune configs (#144985 ) Initial PR to refactor bulkiness of mm_common to allow for better device-specific specialisation e.g. in https://github.com/pytorch/pytorch/pull/143286 we require large conditionalisation to get ROCm specific optimisations in. This PR introduces a new file `torch/_inductor/template_heuristics.py` which implements device specific subclasses for autotune configs: - CPUConfigHeuristic() - CUDAConfigHeuristic() - ROCmConfigHeuristic() - XPUConfigHeuristic() These subclasses are integrated as part of the `InductorChoices` class, which will be the interface for the kernel files to access the configs. The mm_common, mm_plus_mm and conv configurations are implemented in this class, in the future we plan to bring in flex attention configurations also so all of the tuning config logic for templated triton kernels are handled in this file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144985 Approved by: https://github.com/jansel	2025-02-11 10:48:09 +00:00
Yanbo Liang	229fb0bc83	[Dynamo][autograd.Function] Relax backward speculation strict mode: support .requires_grad (#146742 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146742 Approved by: https://github.com/zou3519 ghstack dependencies: #146571, #146741	2025-02-11 05:39:07 +00:00
Yanbo Liang	f2da810516	[Dynamo][autograd.Function] Relax backward speculation strict mode: support .data (#146741 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146741 Approved by: https://github.com/zou3519 ghstack dependencies: #146571	2025-02-11 05:39:07 +00:00
Yanbo Liang	29523aa113	[Dynamo][autograd.Function] Relax backward speculation strict mode a bit (#146571 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146571 Approved by: https://github.com/zou3519	2025-02-11 05:39:00 +00:00
rzou	a7fe384d0e	Remove torch._higher_order_ops from MOD_SKIPLIST (#146853 ) Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/146853 Approved by: https://github.com/williamwen42	2025-02-11 04:38:26 +00:00
Hyunho Yeo	001ebbf734	[MTIA] (4/n) Implement PyTorch APIs to query/reset device peak memory usage (#146751 ) Summary: Public summary (shared with Github): This diff updates the unit test for the PyTorch API "reset_peak_memory_stats". Test Plan: ``` buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- -r test_reset_peak_memory_stats ``` https://www.internalfb.com/intern/testinfra/testrun/9007199321947161 Reviewed By: yuhc Differential Revision: D68989900 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146751 Approved by: https://github.com/nautsimon	2025-02-11 03:51:48 +00:00
James Wu	23524699d5	Only call triton in worker process, kick off worker processes earlier, during inductor codegen (#146417 ) ### Big idea This PR extends https://github.com/pytorch/pytorch/pull/144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism. ### Implementation Overview In total, the diff does the following: - Converts TritonFuture to LambdaFuture, only calling triton.compile on worker processes - Now that triton.compile() isn't called on the main process, we call TritonBundler on all compiled kernels when we get them back from workers - Extend @eellison's future cache to a class, mostly as a refactor - Finally, call async_compile.triton ahead of time in Scheduler.codegen if workers are warmed up. This causes the subsequent async_compile.triton call that occurs after codegen to cache hit on cold start. In the diffs after this, I will add more to CompiledTritonKernels so that TritonBundler, on a warm start, automatically populates the in memory cache on warm start with the existing triton kernels, avoiding calling triton altogether on warm starts. Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen. Differential Revision: [D69123174](https://our.internmc.facebook.com/intern/diff/D69123174/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146417 Approved by: https://github.com/jansel	2025-02-11 03:46:16 +00:00
PyTorch MergeBot	fe94ece375	Revert "Exclude upsample_bilinear2d.vec from default core ATen decomposition table (#141791 )" This reverts commit 3d604b17d91b928c850ded83b2ec25ea066bb3f6. Reverted https://github.com/pytorch/pytorch/pull/141791 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/141791#issuecomment-2649717140))	2025-02-11 03:17:59 +00:00
Ke Wen	30cbf13544	[PGNCCL] Associate tensor allocation support with NCCL version (#146842 ) This is a forward fix to #146589. For NCCL version lower than 2.19, previous PR would see `RuntimeError: NCCL mem allocator is not supported in this NCCL version`. This PR gates the support by checking link-time NCCL version via `ncclGetVersion`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146842 Approved by: https://github.com/XilunWu, https://github.com/wconstab, https://github.com/fduwjj ghstack dependencies: #146589	2025-02-11 02:52:52 +00:00
rzou	1d81ecfc54	Rename PrimHOPBase to BaseHOP + minor changes (#146727 ) This PR: - renames PrimHOPBase to BaseHOP - changes the backward pass to always return a tuple (to match the forward pass). Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/146727 Approved by: https://github.com/ydwu4	2025-02-11 02:43:37 +00:00
rzou	275c034b16	[SkipFiles] remove some stuff from MOD_SKIPLIST (#146854 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146854 Approved by: https://github.com/yanboliang, https://github.com/anijain2305	2025-02-11 01:34:46 +00:00
zeshengzong	5205158c1b	Replace is_same with is_same_v for concise syntax (#145450 ) Replace `std::is_same<T, U>::value` with `std::is_same_v` for concise and consistent syntax with other code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145450 Approved by: https://github.com/Skylion007	2025-02-11 01:34:15 +00:00
PyTorch MergeBot	f38f1dcd82	Revert "move and fix logic to update unbacked bindings (#146115 )" This reverts commit 103c8b44bcb6fbf30b5411c5af19d312427525e7. Reverted https://github.com/pytorch/pytorch/pull/146115 on behalf of https://github.com/huydhn due to This change has been reverted internally D69129334 but the OSS revert failed https://github.com/pytorch/pytorch/pull/146437 ([comment](https://github.com/pytorch/pytorch/pull/146115#issuecomment-2649610877))	2025-02-11 01:26:36 +00:00
Yuanhao Ji	0c9fdd6cfb	[Docs] Fix description of `input` in `torch.addbmm()` (#146664 ) Fixes #146613 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146664 Approved by: https://github.com/mikaylagawarecki	2025-02-11 01:22:09 +00:00
PyTorch MergeBot	2fafcd37c3	Revert "cpp_wrapper: Precompile device-specific header files (#144002 )" This reverts commit de6efa1feb0e8c9073640a77afdec1a53a477aed. Reverted https://github.com/pytorch/pytorch/pull/144002 on behalf of https://github.com/huydhn due to Sorry for reverting your change but this breaks some inductor tests running internally ([comment](https://github.com/pytorch/pytorch/pull/144002#issuecomment-2649569562))	2025-02-11 00:42:22 +00:00
Isalia20	d763093b49	[MPS] fix lu factor for large tensors with bs>1 (#146753 ) Try this: ```python import torch batch_size = 2 A = torch.eye(256, device="mps")[None, :, :].expand(batch_size, -1, -1) + 0.1 * torch.randn((batch_size, 256, 256), device="mps") A_cpu = A.cpu() LU_cpu, pivots_cpu = torch.linalg.lu_factor(A_cpu) LU, pivots = torch.linalg.lu_factor(A) torch.testing.assert_close(LU.cpu(), LU_cpu) ``` You'll get huge difference in LU tensors <img width="706" alt="Screenshot 2025-02-08 at 12 14 39" src="https://github.com/user-attachments/assets/b45f2b3c-e0a5-49c8-aa07-42792150b781" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/146753 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-02-11 00:37:07 +00:00
Anant Gulati	937b41e3b5	Refactoring pipeline parallelism test cases to be device agnostic [1/n] (#146472 ) In this series of PR we intend to refactor pipeline parallelism test cases to enable to be completely device agnostic. These changes will include the following approaches to do the same : - Allowing for multiple device types using instantiate_device_type_test - Replacing calls to cuda stream with torch.get_device_module(device) wherever it applies This should result in improvement in usability for all devices For this PR we have shown support for the following devices: - CPU (wherever applicable) - CUDA - HPU - XPU To add other device new users can simply append their device to the device list Pull Request resolved: https://github.com/pytorch/pytorch/pull/146472 Approved by: https://github.com/H-Huang	2025-02-11 00:13:23 +00:00
amdfaa	b6273d7f4b	[ROCm] Update periodic.yml to use 2GPU runners (#146839 ) Temporary fix for rocm workflow. The 4-GPU runners are all taken offline due to (network timeout issue), and so we aren't able to run any periodic jobs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146839 Approved by: https://github.com/jeffdaily	2025-02-10 23:41:11 +00:00
CK Luk	aa1622c0b6	Support ignoring parameters in FSDP2 (#146631 ) Differential Revision: D69153051 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146631 Approved by: https://github.com/awgu	2025-02-10 23:20:28 +00:00
Jason Ansel	c2bf3be011	[inductor] Remove _get_grid_fn_str (#146800 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146800 Approved by: https://github.com/yanboliang	2025-02-10 23:14:30 +00:00
Henry Tsang	0d5fb0941f	[cutlass backend] check against arch >= 100 (#145812 ) Summary: Want to add a guard against silent fallback to SM90. GenerateSM100 was just added 3 days ago. https://github.com/NVIDIA/cutlass/blame/main/python/cutlass_library/generator.py#L8896 It should show up in CUTLASS 3.8 (not pinned yet). Test Plan: ci Differential Revision: D68748705 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145812 Approved by: https://github.com/chenyang78, https://github.com/ColinPeppler, https://github.com/Aidyn-A	2025-02-10 22:41:08 +00:00
Gabriel Ferns	bab35eb26a	fix intermediate debug information with cpp_wrapper (#145527 ) Summary: before fix, code like: ```cpp aoti_torch_print_tensor_handle(buf0, "after_launch - triton_poi_fused_randn_0 - buf0"); aoti_torch_print_tensor_handle(buf1, "after_launch - triton_poi_fused_randn_0 - buf1"); printf("[ after_launch - triton_poi_fused_randn_0 - 0: %ld ]", 0); printf(" "); printf("[ after_launch - triton_poi_fused_randn_0 - 1228800L: %ld ]", 1228800L); printf(" "); ``` was generated, which is a syntax error. Test Plan: New unit test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145527 Approved by: https://github.com/desertfire	2025-02-10 22:24:26 +00:00
Huy Do	681894546b	Fix bazel job after #144489 (#146840 ) This is currently failing in trunk with the following error https://github.com/pytorch/pytorch/actions/runs/13246034191/job/36972742610 ### Testing Bazel job passing https://github.com/pytorch/pytorch/actions/runs/13247495161/job/36977571965 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146840 Approved by: https://github.com/atalman	2025-02-10 22:17:36 +00:00
Daniel Vega-Myhre	652880e840	Fix logging and test files which misspell "precision" (#146113 ) Noticed this while working on something, decided to submit a quick fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146113 Approved by: https://github.com/drisspg	2025-02-10 21:54:16 +00:00
Nikhil Gupta	e65b89e4cd	[Feat]: Improve KleidiAI 4 bit kernel performance (#146476 ) Description: 1. New thread blocking accelerates GEMVs 2. We increase throughput of the lhs quant pack + matmul pipeline by decoupling two operations. 3. The new blocking strategy blocks ```out_feature``` to accelerate GEMVs Perf improvements: 12% speedup in LLM prefill phase and upto 16% speedup in autoregressive phase Perf Benchmarking : https://github.com/pytorch/pytorch/issues/143289#issuecomment-2545773370 Change-Id: Ie574ff8459fdb75701ae366158b4e118c70694e4 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146476 Approved by: https://github.com/malfet	2025-02-10 21:30:57 +00:00
Wouter Devriendt	4d626c261b	Fix workarea compute in lapackSyevd (#146456 ) work-query APIs return floating point values, that could loose precision when converted back to int. Solve this by using `nextafter` and `ceil` Add regression test Fixes #145801 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146456 Approved by: https://github.com/malfet	2025-02-10 21:29:48 +00:00
Yidi Wu	8f073065d5	[while_loop][inductor] support sym expression as cond_fn output (#146222 ) As titled. Previously, we only support tensor output of cond_fn, this PR changes to also allow a shape expr to be returned in cond_fn. aoti generated output code looks like: ``` V0203 11:28:05.750000 2611693 torch/_inductor/compile_fx.py:1091] [1/0] [__output_code] bool buf7_cond_result; .... (while_loop_cond_graph_0_arg2_1_handle); V0203 11:27:59.336000 2611693 torch/_inductor/compile_fx.py:1091] [1/0] [__output_code] buf7_cond_result = u0 + u1 < 10L; V0203 11:27:59.336000 2611693 torch/_inductor/compile_fx.py:1091] [1/0] [__output_code] if (!buf7_cond_result) break; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146222 Approved by: https://github.com/desertfire	2025-02-10 21:25:40 +00:00
Yidi Wu	97d4753bd3	[hop][inductor] don't promote arg type for cond and while_loop (#146660 ) Hop subgraph codegen assumes arguments's type are not promoted. Otherwise, we might generate wrong kernel. Differential Revision: [D69279031](https://our.internmc.facebook.com/intern/diff/D69279031) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146660 Approved by: https://github.com/zou3519, https://github.com/eellison	2025-02-10 21:24:52 +00:00
zeshengzong	da216baaa2	Optimize inductor `Self` typing (#146669 ) Replace method return type with `Self` typing Pull Request resolved: https://github.com/pytorch/pytorch/pull/146669 Approved by: https://github.com/jansel	2025-02-10 20:39:56 +00:00
angelayi	86b52f4209	Fix lint (#146846 ) [Fixes #ISSUE_NUMBER ](https://github.com/pytorch/pytorch/actions/runs/13248382636/job/36980294598) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146846 Approved by: https://github.com/huydhn, https://github.com/clee2000	2025-02-10 20:00:29 +00:00
Gregory Comer	3d604b17d9	Exclude upsample_bilinear2d.vec from default core ATen decomposition table (#141791 ) As upsample_bilinear2d.vec is a core ATen op, it should not be decomposed by default in the export path. Because the operator has CompositeImplicitAutograd dispatch, its decomposition is registered by default. This change adds an override list for CIA decompositions being registered in the default decomp table. In the long-term, we likely will want to exclude decompositions for all core-tagged CIA ops, but this will require all consumers to be ready to handle the remaining three ops: upsample_nearest2d.vec, avg_pool1d, and adaptive_avg_pool1d. Until they are ready, I believe an explicit override list is the safest option. Additionally, I've also removed the ExecuTorch XNNPACK delegate ConvertToUpsampleBilinear2d pass, as the pass breaks (and is not needed), given that the op is not decomposed. The purpose of this pass was originally to pattern match the decomposition and un-decomposite it, but this is no longer necessary. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141791 Approved by: https://github.com/tugsbayasgalan, https://github.com/digantdesai	2025-02-10 19:30:19 +00:00
Yifu Wang	97f6480cf5	Fix an issue where functional collectives don't force fx stride on inputs when compiled (#146467 ) Fixes https://github.com/pytorch/pytorch/issues/146416 Also added contiguity checks in the C++ functional collective ops to prevent striding issues introduced during compilation manifest as silent correctness issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146467 Approved by: https://github.com/Chillee, https://github.com/lw, https://github.com/shunting314	2025-02-10 19:15:49 +00:00
angelayi	3822a88d21	[symbolic shapes] Log symnode id (#146583 ) We want to log the symnode id which will help us with provenance tracking between expressions created. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146583 Approved by: https://github.com/bobrenjc93	2025-02-10 19:13:06 +00:00
Camyll Harajli	b45e6fa707	Cleanup VS 2019 refs in pytorch (#145863 ) Related to: https://github.com/pytorch/pytorch/issues/128835 Follow up on PR: https://github.com/pytorch/pytorch/pull/145319 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145863 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/huydhn, https://github.com/atalman	2025-02-10 19:05:35 +00:00
Zhengxu Chen	c02a1ecc1d	[export][ez] Allow math.trunc for serialization. (#146715 ) Summary: as title. Test Plan: CI Differential Revision: D69317084 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146715 Approved by: https://github.com/angelayi	2025-02-10 19:05:07 +00:00
angelayi	9b7d050600	Move capture_provenance to make_node_impl (#146625 ) Previously we were only logging `make_user_impl` implementations, which only gets triggered for operations done on python SymInts, not cpp SymInts. Instead `make_node_impl` will get triggered for both python and cpp SymInt operations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146625 Approved by: https://github.com/bobrenjc93	2025-02-10 19:00:51 +00:00
Zhengxu Chen	0486a996d2	[sigmoid] Implement a OSS only model runner. (#146440 ) Summary: Implement an oss version of modelrunner with clean dependencies. The new oss model runner only removes thrift and only use json header to load the model. Test Plan: Test will be added in the next diff separately. (D69060784) Differential Revision: D68846877 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146440 Approved by: https://github.com/SherlockNoMad	2025-02-10 18:54:05 +00:00
Ting Lu	519f547d05	windows Magma build for cu128 (#146653 ) https://github.com/pytorch/pytorch/issues/145570 removing `.ci/pytorch/windows/internal/cuda_install.bat` as it is a duplicate with` .github/scripts/windows/cuda_install.bat`. The later one is the one in use - https://github.com/pytorch/pytorch/pull/146653/files#diff-613791f266f2f7b81148ca8f447b0cd6c6544f824f5f46a78a2794006c78957bR8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146653 Approved by: https://github.com/atalman Co-authored-by: atalman <atalman@fb.com>	2025-02-10 18:34:59 +00:00
Henry Tsang	ad847da0cf	[cutlass backend] fix bug for accuminator dtype (#146356 ) Will add unit tests for accuracy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146356 Approved by: https://github.com/Chillee	2025-02-10 18:20:58 +00:00
Henry Tsang	ddcc97bb8c	Make sure cutlass kernel .cu file has configuration name and nvcc compile command (#146668 ) I think its good to have everything in the .cu file. Especially the nvcc compile command. Technically, the configuration name can be found in the template already. So let me know if you think its not needed. Differential Revision: D69281295 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146668 Approved by: https://github.com/chenyang78	2025-02-10 18:16:44 +00:00
Henry Tsang	6b3f51f870	use None to slice when list has one element only (#146638 ) When autotune_num_choices_displayed is None and the list of choices has length 1, slicing with `[:-1]` means getting all elements except the last one, which resulted in an empty list. Slicing with `[:None]` works. Differential Revision: D69265168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146638 Approved by: https://github.com/drisspg	2025-02-10 18:15:45 +00:00
Rachel Guo	374b762bbf	[ez][BE] get rid of the extra printf('\n') (#146726 ) Summary: as title Test Plan: ``` AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100a @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm_cuda ``` Differential Revision: D69328701 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146726 Approved by: https://github.com/ColinPeppler	2025-02-10 17:45:55 +00:00
blorange-amd	5fd15a04b7	[ROCm] Enable inductor-periodic testing for MI300 (#144594 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144594 Approved by: https://github.com/malfet, https://github.com/huydhn Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-02-10 17:42:09 +00:00
PyTorch MergeBot	b8261358ca	Revert "windows Magma build for cu128 (#146653 )" This reverts commit d0e70c4fd33d9accca2c66203c19372733a83ea1. Reverted https://github.com/pytorch/pytorch/pull/146653 on behalf of https://github.com/jeanschmidt due to Seems to have broken some windows tests, reverting to see if it gets green ([comment](https://github.com/pytorch/pytorch/pull/146653#issuecomment-2648769150))	2025-02-10 17:36:32 +00:00
Animesh Jain	cbbb11d967	[dynamo][user-defined] Unify standard and non-standard __new__ codebase (#146737 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146737 Approved by: https://github.com/jansel ghstack dependencies: #146677	2025-02-10 17:31:13 +00:00
Animesh Jain	ee8a06f1f6	[dynamo][user-defined] User class.__new__ instead of special casing (#146677 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146677 Approved by: https://github.com/jansel	2025-02-10 17:31:13 +00:00
Benjamin Glass	de6efa1feb	cpp_wrapper: Precompile device-specific header files (#144002 ) This saves us about a second per compilation, which is _massive_ for the OpInfo tests. Total OpInfo test runtime is down about 2x from this change alone. Differential Revision: [D69185685](https://our.internmc.facebook.com/intern/diff/D69185685) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144002 Approved by: https://github.com/desertfire	2025-02-10 17:13:09 +00:00
soulitzer	3cadce7af2	[NJT] Fix inference mode for composite implicit ops without nested-specific kernel (#146633 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146633 Approved by: https://github.com/jbschlosser	2025-02-10 16:59:48 +00:00
Davide Italiano	dfe3b64282	[mps] Implement eager support for spherical_bessel_j0 (#146818 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146818 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-02-10 16:58:05 +00:00
Hyunho Yeo	5f621c5879	[MTIA] (3/n) Implement PyTorch APIs to query/reset device peak memory usage (#146710 ) Summary: Public summary (shared with Github): This diff implements a C++-Python binding to enable `reset_peak_memory_stats`. Test Plan: The test is implemented in the following diff. Reviewed By: yuhc Differential Revision: D68988673 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146710 Approved by: https://github.com/nautsimon	2025-02-10 16:57:09 +00:00
Brian Hirsh	68c9e22ef7	FSDP: avoid resetting version counter of all_gather_output in inference_mode (#146709 ) Summary: FSDP needs to hide VC bumps on its allgather buffer, but it does not need to do this is the allgather buffer was generated under inference mode. more details here: https://www.internalfb.com/diff/D69115649?dst_version_fbid=1316814572779281&transaction_fbid=849120230625711 Test Plan: CI Differential Revision: D69311496 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146709 Approved by: https://github.com/awgu	2025-02-10 16:56:40 +00:00
PyTorch MergeBot	6aa924af68	Revert "[ONNX] Create deprecation warning on dynamo_export (#146425 )" This reverts commit 41e6d189a39a40b237ab9b9ab195cec1194b331b. Reverted https://github.com/pytorch/pytorch/pull/146425 on behalf of https://github.com/atalman due to Broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/146425#issuecomment-2648472579))	2025-02-10 15:54:34 +00:00
PyTorch MergeBot	1557b7bf9a	Revert "[ONNX] Adjust and add deprecation messages (#146639 )" This reverts commit 63c2909ae3e293dee96bca5af88bc51d8ca0ce10. Reverted https://github.com/pytorch/pytorch/pull/146639 on behalf of https://github.com/atalman due to Sorry Need to revert https://github.com/pytorch/pytorch/pull/146425 ([comment](https://github.com/pytorch/pytorch/pull/146639#issuecomment-2648465047))	2025-02-10 15:51:52 +00:00
eellison	a36c22f2ed	futher scheduler changes for invoke_quant: prologue low prec, (slightly) more aggressive fusion (#145104 ) Respect invoke_quant low precision options, also, be more aggressive in attepmting fusion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145104 Approved by: https://github.com/shunting314, https://github.com/jansel ghstack dependencies: #139102	2025-02-10 15:50:19 +00:00
Guilherme Leobas	899066eedf	Fix round(...) with constants (#146495 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146495 Approved by: https://github.com/anijain2305	2025-02-10 15:08:09 +00:00
Nikita Shulga	611ca163fd	[MPS] Add bilineard2d_aa implementation (#145526 ) Interesting quirk of the algorithm, that is not very well documented, is that value of align_corners is ignored in antialias mode, see arguments of `e8304f08fe/aten/src/ATen/native/cpu/UpSampleKernel.cpp (L747-L751)` Error out on uint8 implementation(as it relies on a very fragile integer integer arithmetic), as it's not implemented on any other Accelerator devices at the moment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145526 Approved by: https://github.com/dcci	2025-02-10 15:03:14 +00:00
Ting Lu	d0e70c4fd3	windows Magma build for cu128 (#146653 ) https://github.com/pytorch/pytorch/issues/145570 removing `.ci/pytorch/windows/internal/cuda_install.bat` as it is a duplicate with` .github/scripts/windows/cuda_install.bat`. The later one is the one in use - https://github.com/pytorch/pytorch/pull/146653/files#diff-613791f266f2f7b81148ca8f447b0cd6c6544f824f5f46a78a2794006c78957bR8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146653 Approved by: https://github.com/atalman Co-authored-by: atalman <atalman@fb.com>	2025-02-10 13:48:55 +00:00
Tom Ritchford	6f15a609d3	Test typing of arithmetic operators on Tensor (see #145838 ) (#146426 ) See #145838 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146426 Approved by: https://github.com/Skylion007	2025-02-10 12:19:56 +00:00
Jack Taylor	c24038025d	[ROCm] Unskip std:bad_alloc failures (#146407 ) Flakey MI300 issue related to memory usage should now be resolved after https://github.com/pytorch/pytorch/actions/runs/13007160888?pr=145829. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146407 Approved by: https://github.com/jeffdaily	2025-02-10 11:01:56 +00:00
yousoumar	c88ae00692	fix: replace stderr with stdout for download messages in hub.py (#146475 ) This PR addresses an issue where download logs in `hub.py` are sent to `stderr` instead of `stdout`. Hence, when running models with workers, these messages are incorrectly categorized as errors, leading to confusion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146475 Approved by: https://github.com/mikaylagawarecki	2025-02-10 10:46:10 +00:00
gasoonjia	6667e5d786	[dim order] solve broken doc (#146641 ) Differential Revision: [D69265340](https://our.internmc.facebook.com/intern/diff/D69265340/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146641 Approved by: https://github.com/svekars, https://github.com/Jack-Khuu	2025-02-10 07:51:26 +00:00
Xilun Wu	c4d835fbab	[DTensor][conv] add DTensor convolution_backward op support for case where the input Tensor has requires_grad=False (#142278 ) Fixes #142058 ## Summary DTensor `convolution_backward` op throws exception when the input Tensor has `requires_grad=False` which happens if the conv layer is the first layer in the model. ATEN convolution_backward op Usually returns 3 Tensors (grad_input, grad_weight, grad_bias) and the `grad_input` is actually an Optional[Tensor] which can be `None` in the case mentioned above. However, the DTensor sharding propagation rule and corresponding TP conv backward implementation both assume that the `grad_input` would be existent. ## Fix allow the `grad_input` to be `None` for `convolution_backward` op. ## Test `pytest test/distributed/tensor/test_convolution_ops.py` ## Follow-up The current implementation of DTensor conv op also ignores `output_mask` and this may need further care. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142278 Approved by: https://github.com/bdhirsh	2025-02-10 07:06:40 +00:00
Ke Wen	effc545274	[DDP] Use NCCL allocated memory for gradient bucket (#146589 ) So that NVLink SHARP comes with zero-copy on H100+ platforms, for DDP applications. Less SM usage, less memory contention between NCCL kernel and compute kernels. Added env `DDP_DISABLE_COMM_MEM` as a back-out option: ``` An environment variable to disable comm-optimized memory pool. Default is 0, which means comm-optimized memory pool is enabled. Users can set it to 1 in case of seeing regression or OOM (because this comm MemPool may not share space with regular compute MemPool). ``` Differential Revision: [D69297766](https://our.internmc.facebook.com/intern/diff/D69297766) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146589 Approved by: https://github.com/syed-ahmed, https://github.com/c-p-i-o, https://github.com/fduwjj	2025-02-10 05:23:11 +00:00
Simon Fan	387c993c3b	[ca] remove private API: _compiled_autograd_should_lift (#146720 ) Since the functional autograd + compiled autograd migration, we don't trace into nodes anymore, and everything is lifted. We can't support this flag which tries to inline make_fx style in CA initial pass. There's no more usage internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146720 Approved by: https://github.com/zou3519	2025-02-10 04:29:57 +00:00
zeshengzong	e8304f08fe	Fix torch.take_along_dim param type and default description (#146474 ) ## Changes - Change type description to `LongTensor`, consistent with [`torch.take`](https://pytorch.org/docs/stable/generated/torch.take.html) - Add `dim` param default value description ## Test Result Before ![image](https://github.com/user-attachments/assets/720ce158-2bc1-48b5-a188-56fcc7188d96) After ![image](https://github.com/user-attachments/assets/05fe20bd-9476-4b97-ac2b-9b161d6532a1) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146474 Approved by: https://github.com/mikaylagawarecki	2025-02-10 01:19:30 +00:00
Simon Fan	298226f358	[dynamo] check for incompatible configs (#146513 ) internal: https://fb.workplace.com/groups/1075192433118967/permalink/1599802033991335/ Assuming flags don't change during compilation, we shouldn't allow incompatible configs to be set at torch.compile wrap time. Not in this PR: For flags that need to change during compilation, we'd have to be strict about where they can be used in the compile lifecycle Pull Request resolved: https://github.com/pytorch/pytorch/pull/146513 Approved by: https://github.com/williamwen42 Co-authored-by: Gabriel Ferns <gabeferns@meta.com>	2025-02-10 00:44:23 +00:00
Davide Italiano	2a55311773	[cuda] Simplify the sinc function a bit. (#146774 ) `else` after `return` can be removed & the indentation can be reduced, for readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146774 Approved by: https://github.com/malfet	2025-02-09 20:09:34 +00:00
drisspg	b133907d0a	Update strided test to float32 (#146748 ) Fixes #146377 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146748 Approved by: https://github.com/BoyuanFeng, https://github.com/leijurv	2025-02-09 17:41:35 +00:00
Davide Italiano	91c4bf39d3	[mps] Add a shader for spherical_bessel_j0. (#146771 ) In preparation for adding the operation to inductor/eager. Adapted from the CUDA version of the shader. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146771 Approved by: https://github.com/malfet	2025-02-09 05:11:17 +00:00
Nikita Shulga	0e83e7d56e	[EZ] Add logic to build Metal shader with debug info (#146768 ) By appending `-frecord-sources -gline-tables-only` to the compilation command Helpful when debugging shaders compiled into libtorch Test plan: Run `python ../tools/build_with_debinfo.py ../aten/src/ATen/native/mps/kernels/UpSample.metal ../aten/src/ATen/native/mps/operations/UpSample.mm` And then run following to capture shader and check that it contains debug info ```python import torch import os os.environ["MTL_CAPTURE_ENABLED"]="1" inp = torch.rand(size=(6, 3, 10, 20), device="mps", dtype=torch.float32) with torch.mps.profiler.metal_capture("bilinear2d"): out = torch.nn.functional.interpolate(x, scale_factor=(1.7,0.9), mode="bilinear") ``` <img width="769" alt="image" src="https://github.com/user-attachments/assets/e0316c1c-07a4-4da5-97b9-886c56857c1d" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/146768 Approved by: https://github.com/dcci	2025-02-08 23:40:23 +00:00
Guilherme Leobas	6a9a02acbe	Set `enable_faithful_generator_behavior` flag to True (#142513 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142513 Approved by: https://github.com/zou3519 ghstack dependencies: #141055, #144421, #144422, #144423, #144424, #144420, #145223	2025-02-08 22:42:12 +00:00
Guilherme Leobas	580a305681	Raise MutationError if there are side effects when returning generator (#145223 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145223 Approved by: https://github.com/zou3519 ghstack dependencies: #141055, #144421, #144422, #144423, #144424, #144420	2025-02-08 22:42:12 +00:00
Guilherme Leobas	68cfd36c11	Add `CLEANUP_THROW` bytecode (#144420 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144420 Approved by: https://github.com/zou3519 ghstack dependencies: #141055, #144421, #144422, #144423, #144424	2025-02-08 22:42:12 +00:00
Guilherme Leobas	53ab82d8f5	Implement `generator.throw(exception)` (#144424 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144424 Approved by: https://github.com/zou3519 ghstack dependencies: #141055, #144421, #144422, #144423	2025-02-08 22:42:12 +00:00
Guilherme Leobas	8ee095f7c1	Implement `generator.close()` (#144423 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144423 Approved by: https://github.com/zou3519 ghstack dependencies: #141055, #144421, #144422	2025-02-08 22:42:12 +00:00
Guilherme Leobas	ca9b16e070	Implement `generator.send(..)` (#144422 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144422 Approved by: https://github.com/zou3519 ghstack dependencies: #141055, #144421	2025-02-08 22:42:12 +00:00
Guilherme Leobas	d798831167	Implement `generator.__iter__()` (#144421 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144421 Approved by: https://github.com/zou3519 ghstack dependencies: #141055	2025-02-08 22:42:12 +00:00
Guilherme Leobas	8603a1c870	Suport generators (#141055 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141055 Approved by: https://github.com/zou3519	2025-02-08 22:42:12 +00:00
Scott Wolchok	ade8fee512	Use c10 version of half/bfloat16 in executorch (#144111 ) Summary: X-link: https://github.com/pytorch/executorch/pull/7040 Accomplished by importing relevant files from c10 into executorch/runtime/core/portable_type/c10, and then using `using` in the top-level ExecuTorch headers. This approach should keep the ExecuTorch build hermetic for embedded use cases. In the future, we should add a CI job to ensure the c10 files stay identical to the PyTorch ones. ghstack-source-id: 260047850 exported-using-ghexport Test Plan: builds Differential Revision: D66106969 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144111 Approved by: https://github.com/malfet	2025-02-08 22:40:14 +00:00
eellison	92b7e610ab	[Inductor changes] Invoke Quant (#139102 ) Adds a `invoke_quant` higher order operator as proposed [here](https://docs.google.com/document/d/1s2PfJlq6Q1F8l11CkTIC69BW1rEnGEgs6YmBC7hu8rA/edit?tab=t.0). The primary motivations are - Unifying scattered reasoning for quant operators throughout the code base - Easy of pattern matching - see this very large pattern match expression [here](`949fdd2997/torch/_inductor/fx_passes/post_grad.py (L390-L426)`. Compared to the pattern I have in the tests: ``` @register_graph_pattern( CallFunction( torch.ops.aten.mm, CallFunction( torch.ops.higher_order.invoke_quant, Ignored(), Ignored(), Ignored(), scheme="nf4", ), Arg(), ), pass_dict=test_pass, ) ``` - Ability to specify inductor specific logic, like codegen'ing the operators in lower precision, or forcing fusion to a matmul. Example graph: ``` Python ===== AFTER POST GRAD ===== /data/users/eellison/pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module): def forward(self, arg0_1: "f32[8][1]cpu", arg1_1: "f32[8][1]cpu"): # File: /data/users/eellison/pytorch/torch/_higher_order_ops/invoke_quant.py:87 in __call__, code: return invoke_quant_tracer(args, kwargs, quant_options=self) # type: ignore[call-arg] repeated_subgraph0 = self.repeated_subgraph0 invoke_quant: "f32[8][1]cpu" = torch.ops.higher_order.invoke_quant(repeated_subgraph0, arg0_1, arg1_1, scheme = 'nf4'); repeated_subgraph0 = arg0_1 = arg1_1 = None return (invoke_quant,) class repeated_subgraph0(torch.nn.Module): def forward(self, arg0_1: "f32[8][1]cpu", arg1_1: "f32[8][1]cpu"): # File: /data/users/eellison/pytorch/torch/_higher_order_ops/invoke_quant.py:87 in __call__, code: return invoke_quant_tracer(args, *kwargs, quant_options=self) # type: ignore[call-arg] mul: "f32[8][1]cpu" = torch.ops.aten.mul.Tensor(arg0_1, arg1_1); arg0_1 = None add: "f32[8][1]cpu" = torch.ops.aten.add.Tensor(mul, arg1_1); mul = arg1_1 = None return add ``` The schema for `invoke_quant` is `torch.ops.higher_order.invoke_quant(subgraph, args, scheme=None)` where the scheme will not always be present. I wasn't sure exactly how the inductor specific configurations like `codgen_in_low_precision` should be passed through. I didnt want to stuff them all in as kwargs, and I didn't want to have them affect pattern matching. So they will be stored as meta of the node itself. And, following that, I wanted the invocation of the hop to match how it will show up in the graph. So I decided to have it be an object that is then invoked for the tracing. ``` invoke_quant = InvokeQuant(codegen_low_precision=True) invoke_quant(gn, (x, y), scheme="nf4") ``` Todo - not require the packing of args in a tuple, will do following https://github.com/pytorch/pytorch/pull/139162. Feedback welcome. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139102 Approved by: https://github.com/Chillee	2025-02-08 19:30:19 +00:00
Blaine Burton Rister	a1bfb39a31	[Inductor] Expand Identity ops prior to block pattern matching (#146000 ) # Feature Inductor sometimes uses `Identity` functions to group various terms of an expression. While this is convenient in some scenarios, it can frustrate pattern matching. For example, when we're matching an indexing expression to tell if it can be represented as a block pointer, that analysis should be invariant to `Identity`'s. This PR adds a few features to achieve this invariance. - Create a new expansion mode `expr.expand(identity=True)`, which removes all `Identity` functions from the expression. - Preprocess the expression with this expansion prior to pattern matching. - Bonus: create a new test utility function called `dummy_graph()`, which creates a simple `GraphLowering`. This is useful for testing the pattern matcher, as we need to initialize `V.graph` before we can access `V.graph.sizevars`. # Test plan This PR adds a few new unit tests: - Added a unit test specifically for `expr.expand(identity=True)`. - Added a new unit test module for the block pattern matcher. Tested that we can correctly match some example patterns containing Identity ops. I originally intended to add an end to end test compiling pointwise cat, and mapping the corresponding memory accesses to block pointers. However, it looks like that will take more work, since the [relevant code path](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codegen/triton.py#L1306) disables block pointer analysis. It might be better to defer that to a future PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146000 Approved by: https://github.com/eellison, https://github.com/jansel	2025-02-08 18:11:53 +00:00
Jason Ansel	eee5622b98	[inductor] Pre-populate cache for simplify_with_ranges return value (#146373 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146373 Approved by: https://github.com/yanboliang, https://github.com/shunting314 ghstack dependencies: #146252, #146254, #146255, #146257, #146282, #146297	2025-02-08 18:00:49 +00:00
Jason Ansel	c098385cb3	[inductor] Refactor CaptureIndexing into global scope (#146297 ) And inline SimplifyIndexing into it CaptureIndexing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146297 Approved by: https://github.com/shunting314 ghstack dependencies: #146252, #146254, #146255, #146257, #146282	2025-02-08 18:00:49 +00:00
Jason Ansel	d35f6b2339	[inductor] Minor compile time optimizations in DefaultHandler (#146282 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146282 Approved by: https://github.com/shunting314 ghstack dependencies: #146252, #146254, #146255, #146257	2025-02-08 18:00:40 +00:00
Jason Ansel	06604c4ec1	[inductor] Refactor op handlers part 5 (#146257 ) This makes OpHandler just a normal class using inheritance, and removes typing workarounds needed because it wasn't Pull Request resolved: https://github.com/pytorch/pytorch/pull/146257 Approved by: https://github.com/shunting314 ghstack dependencies: #146252, #146254, #146255	2025-02-08 18:00:30 +00:00
Jason Ansel	403db2faee	[inductor] Refactor op handlers part 4 (#146255 ) This replaces the `__getattr__()` pattern used in remaining OpHandlers with a `DefaultHandler` class defined in part 2. Some compile time wins from this as well: ``` 2025-02-02T19:46:32.2033010Z 2025-02-02T19:46:32.2036607Z WIN: benchmark ('add_loop_inductor', 'compile_time_instruction_count') failed, actual result 29633182927 is -1.71% lower than expected 30150000000 ±1.50% please update the expected results. 2025-02-02T19:46:32.2037575Z 2025-02-02T19:46:32.2037907Z please update all results that changed significantly, and not only the failed ones 2025-02-02T19:46:32.2039291Z PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 43986879172 -1.02% is within expected 44440000000 ±2.50% 2025-02-02T19:46:32.2040131Z 2025-02-02T19:46:32.2041180Z WIN: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') failed, actual result 26246225695 is -1.85% lower than expected 26740000000 ±1.50% please update the expected results. 2025-02-02T19:46:32.2042188Z ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146255 Approved by: https://github.com/shunting314 ghstack dependencies: #146252, #146254	2025-02-08 18:00:17 +00:00
Jason Ansel	0e31e5932b	[inductor] Refactor op handlers part 3 (#146254 ) Fixes type errors that arise from typing `V.ops`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146254 Approved by: https://github.com/shunting314 ghstack dependencies: #146252	2025-02-08 18:00:08 +00:00
Jason Ansel	71498aeae3	[inductor] Refactor op handlers part 2 (#146252 ) This replaces the `__getattr__()` pattern used in (some) OpHandlers with a `DefaultHandler` class that has an implementation of every op that calls `self._default()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146252 Approved by: https://github.com/yanboliang	2025-02-08 18:00:00 +00:00
cyyever	46e83bb637	Fix linter F821 error (#146665 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146665 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-02-08 07:19:37 +00:00
Natalia Gimelshein	a3ca5c7f4e	remove incorrect warnings from min/max documentation (#146725 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146725 Approved by: https://github.com/wdvr, https://github.com/malfet	2025-02-08 05:10:08 +00:00
Justin Chu	63c2909ae3	[ONNX] Adjust and add deprecation messages (#146639 ) Adjust and add deprecation messages to torch.onnx utilities and verification methods because they are only related to torch script and are obsolete. Removed unused `_exporter_states.py` and removed the internal deprecation module in favor of the typing_extensions deprecated decorator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146639 Approved by: https://github.com/titaiwangms	2025-02-08 05:09:16 +00:00
Nikita Shulga	2328dcccb9	[MPSInductor] Implement Welford reduction (#146703 ) Still work in progress, though fallback works as expected, but custom shader is not Pull Request resolved: https://github.com/pytorch/pytorch/pull/146703 Approved by: https://github.com/jansel, https://github.com/dcci	2025-02-08 05:00:00 +00:00
drisspg	69feef5a94	Fix broken meta function for flex-attention backwards (#146563 ) # Summary Fixes https://github.com/pytorch/pytorch/issues/146377 So what was the original problem: we were codegening a really weird epilogue: ```Python # first compute broadcasted dk of shape [Bq, Hkv, KV_LEN, V_HEAD_DIM] # then reduce to dk of shape [Bkv, Hkv, KV_LEN, V_HEAD_DIM] xindex = index_k + 64index_n + 64off_hkvks2 + 128off_zqks2 tl.store(out_ptr0 + (tl.broadcast_to(index_k + 64index_n + off_hkvks1, dk.shape)), dk, mask) x5 = (xindex % ks3) tmp2 = tl.load(out_ptr0 + (x5 + ks1off_hkv), mask, eviction_policy='evict_last') tl.store(out_ptr1 + (tl.broadcast_to(xindex, dk.shape)), tmp2, mask) ``` This epilogue was writing and then reading from overlapping regions of memory causing a race condition. ### Why were we generating this epilgoue During the lowering we created a buffer w/ a different size/stride from the expected return strides. I :think this added an implicit node (for doing the permutation of this wrongly strided output to the the expected one from the meta func. The scheduler for some reason thought it was okay to fuse this into the epilogue, tbh I dont know why. This fixes the broken meta func and the original repro. I will add a test but it is hard to pop, better than nothing Pull Request resolved: https://github.com/pytorch/pytorch/pull/146563 Approved by: https://github.com/Chillee	2025-02-08 04:13:52 +00:00
David Peixotto	9c78fb920d	Fix assertion failure in gemm template lowering (#146353 ) Summary: This commit fixes a crash in the gemm template lowering caused by hitting an [assert](`fd515e4f59/torch/_inductor/codegen/common.py (L1181)`) that a buffer was previously removed. The assert triggers because in the first gemm lowering we use a local accumulation buffer, which causes the original buffer name to be added to the `removed_buffers` set. Then in the next gemm lowering we use the global buffer for accumulation, but that buffer name is already in the `removed_buffers` set. The fix is to add a unique suffix to the buffer name to avoid triggering the assert from different gemm lowerings. Differential Revision: D68814625 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146353 Approved by: https://github.com/leslie-fang-intel, https://github.com/frost-intel, https://github.com/hl475	2025-02-08 01:52:20 +00:00
cyy	6cb2f737ee	Enable Windows tests (#146666 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146666 Approved by: https://github.com/albanD	2025-02-08 00:55:20 +00:00
Isalia20	0ab67299c3	[MPS] lu unpack (#146681 ) Implements lu unpack function on MPS. Haven't added new tests because they are covered by removing the lu_unpack from UNIMPLEMENTED_XFAILLIST in test_mps with `test_output_match` function Pull Request resolved: https://github.com/pytorch/pytorch/pull/146681 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-02-08 00:16:17 +00:00
Gregory Comer	803661526e	Update ET pin to 41e7ffa (#145831 ) ExecuTorch pin is failing to update due to a change in the executorch install scripts. The previous install_requirements.sh now only installs dependencies and does not build ET. There is a new script - install_executorch.sh, which both installs dependencies and builds the framework. This PR updates the relevant CI logic to use install_executorch.sh and bumps the pin forward. This should fix the stuck ET pin. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145831 Approved by: https://github.com/metascroy	2025-02-07 23:52:20 +00:00
Hyunho Yeo	dcac3c3e06	[MTIA] (2/n) Implement PyTorch APIs to query/reset device peak memory usage (#146659 ) Summary: Public summary (shared with Github): This diff implements the correct version of the PyTorch API "max_memory_allocated". Nit: The file previously contained two unit tests with the same name (due to wrong revert); I deleted a deprecated one to revamp the correct version. Test Plan: ``` buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- -r test_max_memory_allocated ``` https://www.internalfb.com/intern/testinfra/testrun/12103424065182810 Reviewed By: yuhc Differential Revision: D68988435 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146659 Approved by: https://github.com/nautsimon	2025-02-07 23:06:35 +00:00
Dingming Wu	fa34128435	revert PTD's change that leads to signature mismatch of printNcclCommProxyTrace (#146453 ) Summary: D68801098 introduced this function signature mismatch issue for printNcclCommProxyTrace. Revert it so that trunk build can pass. Test Plan: With the change, build of APS model using rcclexp can now pass: `sh scripts/ltian/run_jobs/fb_fm_v2/run_fb_fm_v2_job.sh -h T20_GTT_MI300X -n 16 -b 1024 -t [2024-12-06] -d ai_infra_ngs -e ai_infra_training_rnd_tc -x 0` Reviewed By: c-p-i-o Differential Revision: D69149588 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146453 Approved by: https://github.com/c-p-i-o	2025-02-07 22:43:52 +00:00
Avik Chaudhuri	103c8b44bc	move and fix logic to update unbacked bindings (#146115 ) Summary: Previously we were touching up unbacked bindings between Dynamo and AOTAutograd in strict export, but the logic had a bug: if an unbacked symint gets substituted by a backed symint, we would put the backed symint in the unbacked bindings (the check `is_symbol` was not enough here). This PR fixes this logic, and moreover, moves it into the serializer instead, because we don't need this adjustment outside serde. Test Plan: added test D68880766 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146115 Approved by: https://github.com/pianpwk	2025-02-07 22:41:19 +00:00
Lu Fang	45d35f5f5a	Clean up op BC check list (#146577 ) Summary: Remove the expired ones Test Plan: ci Differential Revision: D69226556 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146577 Approved by: https://github.com/hl475	2025-02-07 22:40:49 +00:00
Henry Hu	908133f682	[TreeSpec] Add custom comparision function (#146442 ) Summary: https://github.com/pytorch/pytorch/pull/145815 used caching to for treespec_loads calculation to speed up AOTI module call. However, this made tests flaky due when comparing TreeSpec for objects in local scope. ie. 'test_export.TestExport.test_pytree_register_nested_data_class.<locals>.Inner' Type comparison will yield False when local scopes are different due to lru_cache. Since this comparison is only used for testing purpose, we will only test if str(type) are equal. Test Plan: ``` PYTORCH_TEST_WITH_ROCM=1 python test/export/test_retraceability.py ``` Differential Revision: D69137706 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146442 Approved by: https://github.com/angelayi	2025-02-07 22:39:21 +00:00
drisspg	91dfa82981	[FlexAttention] Fix dynamic shapes in max-autotune (#146657 ) # Fixes https://github.com/pytorch/pytorch/issues/146624 ### Updated From offline discussion going w/ sizehint However this does incur guards. I couldn't really think of a fancy way to do this. I was going to do `V.graph.sizevars.size_hint` w/ some default for num blocks, but we ultimately need some information about the input. I am also not sure if size_hint is ALWAYS guaranteed to return the runtime value. I think it would be okay to not supported unbacked symints (maybe). For instance, in the repro, we quickly hit the recompile limit. ```Shell torch._dynamo hit config.recompile_limit (8) function: 'flex_attention' (/home/drisspg/meta/pytorch/torch/nn/attention/flex_attention.py:1161) last reason: 0/0: tensor 'L['key']' size mismatch at index 2. expected 1, actual 546 To log all recompilation reasons, use TORCH_LOGS="recompiles". To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146657 Approved by: https://github.com/Chillee, https://github.com/yanboliang	2025-02-07 22:34:28 +00:00
Jason Ansel	579b9f2ed9	[inductor] Better exception error messages for cache_on_self (#146652 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146652 Approved by: https://github.com/yanboliang	2025-02-07 21:22:21 +00:00
Jason Ansel	04ce02182b	[inductor] Use index_dtype (int32/int64 depending on size) for argmax accumulators (#146651 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146651 Approved by: https://github.com/shunting314, https://github.com/eellison	2025-02-07 21:21:21 +00:00
PyTorch MergeBot	80a1696679	Revert "[cuBLAS][cuBLASLt] Unify `cuBLASLt` workspaces with `cuBLAS` workspaces (#145130 )" This reverts commit 5f0901e57341eb9865102c1caa3d986a0c4ae3bd. Reverted https://github.com/pytorch/pytorch/pull/145130 on behalf of https://github.com/atalman due to Reverted internally ([comment](https://github.com/pytorch/pytorch/pull/145130#issuecomment-2644122846))	2025-02-07 21:04:23 +00:00
Henry Tsang	206ad9f4ad	[cutlass backend] Set no fallback to aten, disabled a few broken tests, default to test on H100 (#146554 ) This PR does a few things: * set fall back to aten to False for most tests. Without this, a lot of tests would fail silently since they just use aten * Disable two subprocess related broken tests. They would crash in subprocess. More investigation needed. * remove/disable the tests on A100. Let me elaborate a bit more. There are two types of A100 tests. * normal tests that also test A100. e.g., mm, addmm, bmm. However, since the shift to cutlass 3x, they don't work anymore. GenerateSM80 would generate ops that use cutlass 2x, but they get filtered out since they are of GemmKind.Universal but only GemmKind.Universal3x are supported in the 3x template. * tests for A100 only. The mixed mm and sparse semi structure tests are failing due to "TypeError: can't multiply sequence by non-int of type 'str'" for a while. Disabled them for now. Do let us know if you are about them @alexsamardzic Differential Revision: D69209929 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146554 Approved by: https://github.com/chenyang78	2025-02-07 19:59:28 +00:00
PyTorch MergeBot	f17109bd96	Revert "windows Magma build for cu128 (#146653 )" This reverts commit 9e27d36e2b2a4f037a7e448c2f87a9ebb0d6e628. Reverted https://github.com/pytorch/pytorch/pull/146653 on behalf of https://github.com/atalman due to Broke nightly builds ([comment](https://github.com/pytorch/pytorch/pull/146653#issuecomment-2643882976))	2025-02-07 19:37:16 +00:00
Shunting Zhang	bc0191802f	[inductor] add size-asserts for fallback ops (#145904 ) Fix https://github.com/pytorch/pytorch/issues/144717 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145904 Approved by: https://github.com/jansel	2025-02-07 18:44:32 +00:00
Gabriel Ferns	b60f630de8	fuzzer: disable "fail_on_recompile_limit_hit" and "suppress_errors" (#146650 ) Summary: needed for https://github.com/pytorch/pytorch/pull/146513 Test Plan: the existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/146650 Approved by: https://github.com/xmfan	2025-02-07 18:25:00 +00:00
Ting Lu	9e27d36e2b	windows Magma build for cu128 (#146653 ) https://github.com/pytorch/pytorch/issues/145570 removing `.ci/pytorch/windows/internal/cuda_install.bat` as it is a duplicate with` .github/scripts/windows/cuda_install.bat`. The later one is the one in use - https://github.com/pytorch/pytorch/pull/146653/files#diff-613791f266f2f7b81148ca8f447b0cd6c6544f824f5f46a78a2794006c78957bR8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146653 Approved by: https://github.com/atalman	2025-02-07 18:09:30 +00:00
Tristan Rice	23af9dde4d	distributed/serialization: add experimental streaming torch.save/load methods (#146555 ) Summary: This is intended for use with torchft when we need to do a streaming state dict transfer. This is strictly superior to the prior streaming method in torchft as this supports all tensor subclasses such as DTensor. This supports 100% of the inputs to torch.save/load but is not wire compatible nor intended to have any backwards compatibility. Security wise this fully supports weights_only and defaults to True. It does use pickle for some metadata but uses weights_only for the metadata. Adapted from: https://github.com/pytorch/torchft/pull/101 https://github.com/pytorch/torchft/pull/54 Test Plan: pytest test/distributed/test_serialization.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/146555 Approved by: https://github.com/fegin, https://github.com/mikaylagawarecki Co-authored-by: Krishn Parasar <76171905+Krishn1412@users.noreply.github.com>	2025-02-07 18:08:11 +00:00
Tristan Rice	68631f6e87	PyWork: preserve Python reference counting when used in functional collectives (#146376 ) @fegin found an issue where torchft is not compatible with functional collectives. Found in https://github.com/pytorch/torchtitan/pull/806 The root cause is because PyProcessGroup/PyWork are not compatible with functional collectives due to a nasty ownership bug. PyWork relies on a pybind trampoline to propagate requests to Python unfortunately the way Pybind works is that the Python object owns the C++ object rather than some form of shared ownership. Thus what happens is that the PyWork Python object will collected when returned to C++ from the PyProcessGroup but the C++ PyWork object still exists. When the PyWork object is used, this causes a deadlock as the corresponding Python object no longer exists To solve this, we introduce a new `PyWorkHolder` class which holds a reference to the `py::object` as well as the trampoline class. This resolves any dependency issues since we can now hold ownership in C++ to both the Python and C++ objects. To make this cleaner we introduce a `WORK_OVERRIDE` macro which is a patched version of `PYBIND11_OVERRIDE` that returns a `PyWorkHolder` rather than just `PyWork` and use for all collectives in PyProcessGroup. Test plan: ``` cd pytorch pytest test/distributed/test_c10d_functional_native.py ``` ``` cd torchft pytest torchft/process_group_test.py -k functional -v -x -s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146376 Approved by: https://github.com/yifuwang	2025-02-07 18:07:53 +00:00
James Wu	76c8a2dc48	Fix get_top() to return the base level event of the stack, not the most recently started event (#146649 ) `get_top()` is really confusing when talking about a stack, because it can mean the most recently started event on the stack or the toplevel event in perfetto(which displays the stack upside down). Rename to `get_outermost` and fix the bug associated with it, so that it returns the correct value out of the stack. Running nanogpt now puts `guard_latency_us` correctly in the `dynamo` event: ``` tlp python benchmarks/dynamo/torchbench.py --backend inductor --device cuda --only nanogpt --amp --cold-start-latency --print-compilation-time --training --performance 2>&1 --dynamic-shapes \| tee out.log ``` <img width="1281" alt="image" src="https://github.com/user-attachments/assets/4eeb371a-4d81-415a-acc4-7d303a4b2a93" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/146649 Approved by: https://github.com/masnesral, https://github.com/anijain2305	2025-02-07 18:04:50 +00:00
briancoutinho	f138b18d18	[inductor/profiler] add kernel kwargs instrumentation (#145573 ) ## About As above, record the kernel launch kwargs. These tends to be contexpr arguments to triton kernels like block size etc. ## Test program Note, install triton before proceeding (pip install triton) triton_test.py>>> ``` import torch from torch.profiler import profile, ProfilerActivity def foo(x, y): a = torch.sin(x) b = torch.cos(y) return a + b def main(): x = torch.randn(10, 10).cuda() y = torch.randn(10, 10).cuda() opt_foo = torch.compile(foo) z = opt_foo(x, y) # Profile the kernel function on the GPU with profile( activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True ) as prof: z = opt_foo(x, y) # Export the trace to a file prof.export_chrome_trace("my_kernel_trace.json") if __name__ == "__main__": main() ``` Run it and we should get a trace file my_kernel_trace.json Output has triton event with the kernel_kwargs attribute. ``` { "ph": "X", "cat": "cpu_op", "name": "triton_poi_fused_add_cos_sin_0", "pid": 2480815, "tid": 2480815, "ts": 2045246693014.959, "dur": 75.662, "args": { ... "kernel_backend": "triton", "num_warps": 4, "kernel_kwargs": "XBLOCK=128", "num_stages": 1, "grid": "grid(100,)", "kernel_file": "/tmp/torchinductor_bcoutinho/ow/cowpmkdpla4qfqj6jupnq4d7og7iz7eeb5wergubivubxd4xapor.py", "kernel_hash": "cowpmkdpla4qfqj6jupnq4d7og7iz7eeb5wergubivubxd4xapor" } }, ``` ## Unit Test Updated unit test: ``` pytest test/inductor/test_profiler.py -k test_pt2_triton_attributes ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145573 Approved by: https://github.com/davidberard98, https://github.com/jansel	2025-02-07 17:44:30 +00:00
Animesh Jain	ee45ea599d	[dynamo] Actionable message on recompilations for fullgraph=True (#146550 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146550 Approved by: https://github.com/zou3519, https://github.com/StrongerXi ghstack dependencies: #146553	2025-02-07 17:28:43 +00:00
Animesh Jain	fa0956951c	[dynamo] Remove the suggestion to use suppress_errors on compiler error (#146553 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146553 Approved by: https://github.com/zou3519, https://github.com/jansel	2025-02-07 17:28:43 +00:00
cyy	25aa7ca62d	Cleanup CallOnce.h (#146700 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146700 Approved by: https://github.com/albanD	2025-02-07 16:44:45 +00:00
PyTorch MergeBot	076717785c	Revert "[while_loop][inductor] support sym expression as cond_fn output (#146222 )" This reverts commit 5ecdc428b230ab5ba44a90678f1c905e314f6ccb. Reverted https://github.com/pytorch/pytorch/pull/146222 on behalf of https://github.com/atalman due to Internal failure, please see associated diff ([comment](https://github.com/pytorch/pytorch/pull/146222#issuecomment-2643379933))	2025-02-07 16:19:41 +00:00
eqy	5d7532140f	[CUDA][CUDA Graphs] Fix debug mode warning message (#145996 ) The real method is `enable_debug_mode()`, `_cuda_enable_graphs_debug_mode` does not exist. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145996 Approved by: https://github.com/ptrblck, https://github.com/eellison	2025-02-07 08:04:49 +00:00
eellison	002accfb8d	Check meta strides for expanded dims in effn_attn_bias (#146054 ) With the `_scaled_dot_product_efficient_attention.default`, we have lowering logic to realize the bias to specific alignment constraints. Some of the dims can be expanded, and we need to keep the stride of that dim to 0 to avoid materializing a larger tensor than we need. Previously, we had checked stride of tensor, but if it is not realized, that will not work. so we should check the strides of the meta as well. Note: getting the exact of realizing/slicing/requiring_exact_strides was a little tricky. I commented to @exclamaforte on an example unable-to-fuse message you get if you do it incorrectly. Fix for https://github.com/pytorch/pytorch/issues/145760 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146054 Approved by: https://github.com/shunting314	2025-02-07 06:35:57 +00:00
eellison	71e8a2bda4	Expand inductor codegen dtype asserts, fix scan (#146067 ) We were codegening intermediary dtype asserts in some places but not all. expands assertions, fixes newly failing assertion in `TORCHINDUCTOR_COMPILE_THREADS=1 TORCH_LOGS="output_code" PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=1 python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoCUDA.test_comprehensive_logcumsumexp_cuda_float16` for scan. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146067 Approved by: https://github.com/shunting314, https://github.com/jansel	2025-02-07 06:35:47 +00:00
cyy	f6bd20e8a2	Enable TemporaryFileName tests on Windows (#146311 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146311 Approved by: https://github.com/albanD	2025-02-07 06:06:18 +00:00
Pian Pawakapan	1c872803cb	[export][dynamic shapes] log provenance for locals & symbols for non-strict (#143378 ) Adds `dtrace_structured` logging so when a guard or real-tensor propagation assert is added, the relevant user code with local symbolic values & free symbols are logged, e.g. from the draft export CLI report (soon to be added to tlparse): 1. Guard added: ``` 1. Constraint violation error. The specified input dynamic_shapes spec was found to be incorrect during tracing. Specifically, this guard was added: Eq(s0, 3), where {'s0': "L['args'][0][0].size()[0]"}. This occured at the following stacktrace: File /data/users/pianpwk/pytorch/test/export/test_draft_export.py, lineno 267, in forward: assert a.shape[0] == 3 Locals: a: Tensor(shape: torch.Size([s0, 3]), stride: (3, 1), storage_offset: 0) Symbols: s0: L['args'][0][0].size()[0] ... ``` 2. Real tensor propagation: ``` 1. Data dependent error. When exporting, we were unable to evaluate the value of `u2 < 0`. This was encountered 8 times. This occurred at the following stacktrace: File /data/users/pianpwk/pytorch/test/export/test_draft_export.py, lineno 217, in forward: return res[:c_item] Locals: res: Tensor(shape: torch.Size([u0, u1]), stride: (Max(1, u1), 1), storage_offset: 0) c_item: u2 ... ``` Currently the values are extracted from the traceback, and are only valid for non-strict; strict seems to require storing & fakifying locals in the frames reporting by `TracingContext`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143378 Approved by: https://github.com/avikchaudhuri, https://github.com/bobrenjc93	2025-02-07 05:46:05 +00:00
Aaron Gokaslan	bc40ccf6aa	[BE]: Inline special functions for MPS (#146627 ) These header functions should be inlined for consistency and to avoid translation unit / symbol issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146627 Approved by: https://github.com/dcci	2025-02-07 05:15:15 +00:00
Zhou32	ecf44d1002	Fixed a typo in dataset.py (#146600 ) Changed word 'Mult' to 'Multi'. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146600 Approved by: https://github.com/Skylion007	2025-02-07 05:09:51 +00:00
Justin Chu	41e6d189a3	[ONNX] Create deprecation warning on dynamo_export (#146425 ) Reland #146003 Deprecation of `torch.onnx.dynamo_export`: * [`torch/onnx/_internal/_exporter_legacy.py`]: Added deprecation warnings to the `OnnxRegistry`, `ExportOptions`, `ONNXRuntimeOptions`, and `dynamo_export` functions, indicating that `torch.onnx.dynamo_export` is deprecated since version 2.6.0 and should be replaced with `torch.onnx.export(..., dynamo=True)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146425 Approved by: https://github.com/titaiwangms, https://github.com/atalman	2025-02-07 04:20:46 +00:00
cyy	fa0592b568	Remove some NOLINT (#146610 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146610 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-02-07 01:50:06 +00:00
Nikita Shulga	624d94bdb8	[MPS] Extend `torch.special.sinc` to complex (#146648 ) And to integral data types as well Was too lazy to deduce the formula myself(or write a sympy script), but ChatGPT did a decent job of doing it, though it forgot that input must be multiplied by $$\pi$$: ```math \text{Re}\left(\text{sinc}(x + i y)\right) = \frac{\sin(x)\cosh(y) x - \cos(x)\sinh(y) y}{x^2 + y^2} ``` ```math \text{Im}\left(\text{sinc}(x + i y)\right) = \frac{\cos(x)\sinh(y) x + \sin(x)\cosh(y) y}{x^2 + y^2} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146648 Approved by: https://github.com/dcci	2025-02-07 01:12:37 +00:00
Michal Gallus	9ea1823f96	[ROCm][Windows] Remove external linkage from an anonymous namespace (#146607 ) Fixes a clang-cl compiler error related to attempt to export a symbol that doesn't have any external linkage, since its declared within a local anonymous namespace. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146607 Approved by: https://github.com/jeffdaily	2025-02-06 23:48:20 +00:00
Michal Gallus	3379c65de6	[ROCm][Windows] Fix unrecognized _BitScanReverse intrinsic (#146606 ) Since PyTorch with ROCm on Windows is built with clang-cl and not MSVC, the intrinsics used are different and hence an attempt to compile with `_BitScanReverse` fails. However, a call to `__builtin_clz` which follows in the subsequent preprocessor branch is correctly recognized by the clang-cl compiler. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146606 Approved by: https://github.com/jeffdaily	2025-02-06 23:47:18 +00:00
Michal Gallus	0d8fc00e0a	[ROCm][Windows] Fix isnan integer overload errors on MS STL (#146605 ) Microsoft's STL has a problem with integer overloads of std::fpclassify used by std::isnan and std::isinf. These functions need a cast to double to function correctly. Otherwise, the call fails with "ambiguous call to overloaded function" error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146605 Approved by: https://github.com/jeffdaily	2025-02-06 23:44:11 +00:00
Michal Gallus	3f5ed05688	[Windows][ROCm] Fix c10 hip tests (#146599 ) - Solves a problem related to .hip source files being ignored by the build system when HIP language is not enabled in CMake. - Also ensures that the test executables link to an appropriate CRT Runtime Library and hence have access to all the necessary symbols. Previously, there were many problems related to linkage errors. - Moves part of Linux-related hipBLASLt changes in `LoadHIP.cmake` under the UNIX conditional branch, as these aren't supported on Windows yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146599 Approved by: https://github.com/jeffdaily	2025-02-06 23:41:25 +00:00
Fuzzkatt	e13a544b54	fix tf32 issue in test_inductor_freezing.py unit tests (#146444 ) Test is hitting numerical mismatches in NVIDIA internal CI. Add tf32_on_and_off decorater, update check to assertEqual Pull Request resolved: https://github.com/pytorch/pytorch/pull/146444 Approved by: https://github.com/jansel, https://github.com/eellison, https://github.com/eqy	2025-02-06 23:34:28 +00:00
eqy	7bd7f735d4	[CUDA][SDPA] Compute reference in `test_triton_scaled_dot_product_attention_block_size_16_cuda_float32` in `float64` (#146461 ) Seems to currently fail with mismatches in the 1e-4 range presumably due to sdpa calling into the `MATH` backend here which is less fused than a triton kernel. Doing the ref computation in `float64` appears to fix it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146461 Approved by: https://github.com/drisspg	2025-02-06 23:28:56 +00:00
Jason Ansel	2834fe5e93	[inductor] Fix test error test_force_cutlass_backend_aoti_cexpr_codegen (#146564 ) Test Plan: ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cutlass_backend -- --exact 'caffe2/test/inductor:cutlass_backend - test_force_cutlass_backend_aoti_cexpr_codegen (caffe2.test.inductor.test_cutlass_backend.TestCutlassBackend)' ``` Differential Revision: D69219873 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146564 Approved by: https://github.com/yanboliang	2025-02-06 23:02:41 +00:00
Aaron Gokaslan	0c81b398ab	[BE][Ez]: Enable some additional pylint ruff warnings (#146609 ) Some additional code hardening with some pylint warnings in ruff that usually indicate bugs. All code currently conforms nicely to them, but this will ensure these errors can be detected statically before running / creating tests. The follow rules: * Ban walrus operators where they would have no effect over regular assignment; making intention more clear. * Statically check for the common error of forgetting to put parens after the `super` call, which will cause an attribute error * Ban bad string literal args to builtins `open` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146609 Approved by: https://github.com/aorenste	2025-02-06 21:58:08 +00:00
Michael Suo	99dd846672	[torch] fix builds for older pybind (#146630 ) Summary: some versions of pybind we build with don't have `py::set_error`. So just use the underlying python C API. Test Plan: unit tests Differential Revision: D69254629 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146630 Approved by: https://github.com/colin2328, https://github.com/ngimel	2025-02-06 21:22:00 +00:00
Huy Do	3008368b12	Honor Dr.CI classification results on auto commit hash update (#146337 ) Disable `ignore_flaky_failures` was a safer choice, but it seems that this option doesn't work with the current state of the CI. For example, https://github.com/pytorch/pytorch/pull/125806 hasn't been merged since May because there would always be a failure in one type or another. This effectively disables the automate mechanism. My proposal here is to relax this rule and allows the bot to merge auto commit has update with `@pytorchbot merge` like a regular PR. Then we will at least have something working. If this causes issue, we can revert it back and try to longer route of improving CI reliability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146337 Approved by: https://github.com/clee2000	2025-02-06 20:33:38 +00:00
Nichols A. Romero	44b69b80c2	[ROCm][TunableOp] Future proof TunableOp unit test. (#146548 ) TunableOp UT will fail because the regular expression in the test will not work for future versions of ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146548 Approved by: https://github.com/jeffdaily	2025-02-06 20:26:02 +00:00
Xilun Wu	5cc1b54a91	[2/N][cp][example] flex attention in context parallel (backward pass) (#146397 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146397 Approved by: https://github.com/fegin ghstack dependencies: #145896	2025-02-06 19:50:02 +00:00
Xilun Wu	6220c64aea	[1/N][cp][example] flex attention in context parallel (forward pass) (#145896 ) Description This is an example of how FlexAttention can be used in a context parallel fashion. Right now it's only a flex_attention call with collectives added and has no load balancer, but we're about to add the missing parts step by step: 1. backward pass 2. static load balancing for causal masking 3. dynamic load balancing for other general maskings 4. automatic collective insertion solution 5. non-intrusive context parallel APIs Test `torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/tensor/examples/flex_attention_cp.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145896 Approved by: https://github.com/fegin, https://github.com/Skylion007	2025-02-06 19:50:02 +00:00
Yidi Wu	5ecdc428b2	[while_loop][inductor] support sym expression as cond_fn output (#146222 ) As titled. Previously, we only support tensor output of cond_fn, this PR changes to also allow a shape expr to be returned in cond_fn. aoti generated output code looks like: ``` V0203 11:28:05.750000 2611693 torch/_inductor/compile_fx.py:1091] [1/0] [__output_code] bool buf7_cond_result; .... (while_loop_cond_graph_0_arg2_1_handle); V0203 11:27:59.336000 2611693 torch/_inductor/compile_fx.py:1091] [1/0] [__output_code] buf7_cond_result = u0 + u1 < 10L; V0203 11:27:59.336000 2611693 torch/_inductor/compile_fx.py:1091] [1/0] [__output_code] if (!buf7_cond_result) break; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146222 Approved by: https://github.com/desertfire ghstack dependencies: #146194, #146195	2025-02-06 19:39:55 +00:00
Bin Bao	1b879fd0ea	[Inductor] Add a JIT Inductor unit test following #146293 (#146529 ) Summary: To follow up https://github.com/pytorch/pytorch/pull/146293, add a JIT Inductor unit test. Other Triton template may need similar fixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146529 Approved by: https://github.com/eellison, https://github.com/shunting314	2025-02-06 19:21:15 +00:00
Shunting Zhang	992388c100	[inductor] use ftz variant of exp (#146216 ) Inductor generated exp op is compiled as the following ptx snippet by Triton. ``` mul.f32 %f74, %f83, 0f3FB8AA3B; ex2.approx.f32 %f73, %f74; ``` But if we enable --use_fast_math in nvcc, exp in CUDA is compiled as ``` mul.ftz.f32 %f2, %f1, 0f3FB8AA3B; ex2.approx.ftz.f32 %f3, %f2; ``` which uses the FTZ variant. Let Inductor able to generate the FTZ variant if use_fast_math config is true. I see 4% speedup for the two pass prepare_softmax kernel, online softmax should be affected more since it does more computation per seconds (>10% in my testing). Pull Request resolved: https://github.com/pytorch/pytorch/pull/146216 Approved by: https://github.com/jansel, https://github.com/eellison	2025-02-06 19:12:35 +00:00
Eddie Yan	9ee506bd93	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee, https://github.com/malfet	2025-02-06 19:04:50 +00:00
eqy	07b214402a	[CUDA][B200] Update the number of threads in `avg_pool2d` backward for SM 10.0 (#145669 ) Fixes register count issue when launching on SM 10.0, originally authored by @bilal2vec Pull Request resolved: https://github.com/pytorch/pytorch/pull/145669 Approved by: https://github.com/nWEIdia, https://github.com/ngimel	2025-02-06 18:57:33 +00:00
Animesh Jain	99ddbb4802	[dynamo][fullgraph] Do not skip frame with fullgraph=True (#146527 ) Earlier if there were no ops in the graph, fullgraph=True will also fallback to eager. This hides issues in testing, where we silently fallback to eager, and do not test optimized bytecode. As can be seen in the PR, I had to fix several tests when I forced to use the optimized bytecode in the absence of graph. A few failing tests will be fixed in follow up PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146527 Approved by: https://github.com/zou3519, https://github.com/StrongerXi	2025-02-06 18:56:07 +00:00
rzou	15b1ac3e86	Add torch.func.debug_unwrap (#146528 ) Use it to unwrap any functorch-wrapped tensor. I don't recommend using the output in a program since it breaks the semantics of the transforms, but it seems useful for debugging. I will note that some people have wanted to get intermediate values out of an e.g. grad transform, so this might be a way to do that... Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/146528 Approved by: https://github.com/Chillee	2025-02-06 18:48:09 +00:00
Ryo Suzuki	49082f9dba	parallelize sort (#142391 ) - use __gnu_parallel::sort for gcc compilations - add a parallelized version of std::sort and std::stable_sort for non gcc compilations Using __gnu_parallel::sort: provides ~3.7x speed up for length 50000 sorts with NUM_THREADS=16 and NUM_THREADS=4 on aarch64 The performance is measured using the following script: ```python import torch import torch.autograd.profiler as profiler torch.manual_seed(0) N = 50000 x = torch.randn(N, dtype=torch.float) with profiler.profile(with_stack=True, profile_memory=False, record_shapes=True) as prof: for i in range(1000): _, _ = torch.sort(x) print(prof.key_averages(group_by_input_shape=True).table(sort_by='self_cpu_time_total', row_limit=10)) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142391 Approved by: https://github.com/malfet	2025-02-06 18:06:40 +00:00
Isalia20	7725d0ba12	[METAL] inline bfloat min/max (#146588 ) After a recent commit 36c6e09528a7e071edecde083254da70cba26c95 , building from source with `python setup.py develop` leads to an error due to multiple symbols for min/max: ``` FAILED: caffe2/aten/src/ATen/kernels_bfloat.metallib /Users/Irakli_Salia/Desktop/pytorch/build/caffe2/aten/src/ATen/kernels_bfloat.metallib cd /Users/Irakli_Salia/Desktop/pytorch/build/caffe2/aten/src/ATen && xcrun metallib -o kernels_bfloat.metallib BinaryKernel_31.air Bucketization_31.air CrossKernel_31.air FusedOptimizerOps_31.air Gamma_31.air HistogramKernel_31.air Im2Col_31.air Indexing_31.air LinearAlgebra_31.air Quantized_31.air RMSNorm_31.air RenormKernel_31.air Repeat_31.air SpecialOps_31.air TriangularOps_31.air UnaryKernel_31.air UnfoldBackward_31.air UpSample_31.air LLVM ERROR: multiple symbols ('_ZN3c105metal3minIDF16bEEN5metal9enable_ifIXgssr5metalE19is_floating_point_vIT_EES4_E4typeES4_S4_')! ``` This PR fixes that. @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/146588 Approved by: https://github.com/FFFrog, https://github.com/Skylion007, https://github.com/malfet	2025-02-06 17:57:31 +00:00
Animesh Jain	e2e265e27b	[dynamo] Use polyfill to implement comparison operators (#144485 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144485 Approved by: https://github.com/jansel	2025-02-06 17:27:07 +00:00
Davide Italiano	1090e58687	[mps] Remove a stale comment. (#146619 ) The implementation of the function was moved to a shader, but the comment was left there. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146619 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-02-06 17:25:29 +00:00
Davide Italiano	46390e9a37	[mps] Implement support for sinc() operator (inductor and eager). (#146539 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146539 Approved by: https://github.com/malfet, https://github.com/jansel Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-02-06 16:37:27 +00:00
Simon Fan	a14c780c4c	[dynamo] fix dynamo_compile logging on RecompileLimitExceeded (#146544 ) Logging branches based on RecompileLimitExceeded or not. If we exceed the limit, we fallback to eager before even trying to analyze the frame. We handle RecompileLimitExceeded outside of the try/catch/finally that edits the metrics context: `72405b0c0f/torch/_dynamo/convert_frame.py (L908-L935)`. dynamo_config and recompile_reason are both known before we raise the RecompileLimitExceeded, so we can add them with the rest of the "common" metrics. which are logged on metric_context decorator exit and is always called Pull Request resolved: https://github.com/pytorch/pytorch/pull/146544 Approved by: https://github.com/masnesral	2025-02-06 16:20:42 +00:00
Taras	6ff3383157	Enable CUPTI on Windows (#141454 ) Fixes: - https://github.com/pytorch/pytorch/issues/93855 The PR enables CUPTI on Windows and enables unit tests to check CUDA profiling events. Additionally, the changes can be verified using the following script: ``` import torch from torch.profiler import profile, ProfilerActivity def check_cupti_enabled(): # Check if CUDA is available if not torch.cuda.is_available(): print("CUDA is not available on this system.") return False # Create a simple CUDA tensor x = torch.randn(1000, 1000, device="cuda") y = torch.randn(1000, 1000, device="cuda") try: # Use PyTorch profiler to perform a basic check with profile(activities=[ProfilerActivity.CUDA]) as prof: z = x @ y # Simple CUDA operation # Print profiling results print("CUPTI is enabled and profiling works.") print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10)) return True except RuntimeError as e: # If profiling fails, CUPTI is likely not set up correctly print("Error: CUPTI might not be enabled or accessible.") print(f"Details: {e}") return False if __name__ == "__main__": if check_cupti_enabled(): print("CUPTI is properly configured in PyTorch.") else: print("CUPTI is not configured correctly. Check your CUDA installation.") ``` Sample output: ``` CUPTI is enabled and profiling works. --------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls --------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ sgemm_128x128x8_NN_vec 0.00% 0.000us 0.00% 0.000us 0.000us 2.086ms 100.00% 2.086ms 2.086ms 1 cudaFree 9.67% 9.816ms 9.67% 9.816ms 9.816ms 0.000us 0.00% 0.000us 0.000us 1 cudaDeviceGetAttribute 0.01% 10.000us 0.01% 10.000us 0.476us 0.000us 0.00% 0.000us 0.000us 21 cudaGetDriverEntryPoint 0.00% 1.700us 0.00% 1.700us 0.850us 0.000us 0.00% 0.000us 0.000us 2 cudaGetSymbolAddress 85.15% 86.438ms 85.15% 86.438ms 86.438ms 0.000us 0.00% 0.000us 0.000us 1 cudaMalloc 0.43% 433.300us 0.43% 433.300us 144.433us 0.000us 0.00% 0.000us 0.000us 3 cudaLaunchKernel 2.61% 2.648ms 2.61% 2.648ms 2.648ms 0.000us 0.00% 0.000us 0.000us 1 cudaDeviceSynchronize 2.13% 2.163ms 2.13% 2.163ms 2.163ms 0.000us 0.00% 0.000us 0.000us 1 --------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 101.511ms Self CUDA time total: 2.086ms CUPTI is properly configured in PyTorch. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141454 Approved by: https://github.com/malfet	2025-02-06 15:58:20 +00:00
FEI	8a4dd763b8	[CCA] remove TODO for hardware_destructive_interference_size (#145591 ) @zyan0 @albanD @houseroad Pull Request resolved: https://github.com/pytorch/pytorch/pull/145591 Approved by: https://github.com/albanD	2025-02-06 14:41:25 +00:00
Jack Zhang	ed309b9156	Re-add stft option to align window for center = false (#146379 ) Skips advancing the fc window on https://github.com/pytorch/pytorch/pull/145437, since I just found that there were non-trivial efforts to do so a while ago that eventually was reverted: https://github.com/pytorch/pytorch/pull/73434 Works around the issue by keeping the stft sans center overload Pull Request resolved: https://github.com/pytorch/pytorch/pull/146379 Approved by: https://github.com/justinchuby, https://github.com/iseeyuan	2025-02-06 14:07:13 +00:00
PyTorch MergeBot	1b79d47635	Revert "[dynamo] check for incompatible configs (#146513 )" This reverts commit aab7925418be561a8af6adfcb8cf009a8786c31b. Reverted https://github.com/pytorch/pytorch/pull/146513 on behalf of https://github.com/atalman due to inductor/test_fuzzer.py::TestConfigFuzzer::test_config_fuzzer_dynamo_bisect [GH job link](https://github.com/pytorch/pytorch/actions/runs/13174131431/job/36772837627) [HUD commit link](`4a545eb85d`) ([comment](https://github.com/pytorch/pytorch/pull/146513#issuecomment-2639860568))	2025-02-06 13:42:25 +00:00
Animesh Jain	340cfe4f28	[dynamo][fbcode] Turn on inline_inbuilt_nn_modules (#145407 ) As title. Some internal testing at https://fb.workplace.com/groups/241460628989036/permalink/411650015303429/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/145407 Approved by: https://github.com/ezyang, https://github.com/jansel	2025-02-06 13:18:35 +00:00
PyTorch MergeBot	bd7d4fb2b5	Revert "[DTensor][Test] Create a simple unit test for tensordot (#146514 )" This reverts commit 1f8baf09ea598c97f30731ddb8328b6aa8d31fe9. Reverted https://github.com/pytorch/pytorch/pull/146514 on behalf of https://github.com/albanD due to The lint failures that you ignored are real right? ([comment](https://github.com/pytorch/pytorch/pull/146514#issuecomment-2639554636))	2025-02-06 11:26:43 +00:00
zeshengzong	4a545eb85d	Fix torch.nn.functional.one_hot param num_classes optional description (#146470 ) `torch.nn.functional.one_hot` [document](https://pytorch.org/docs/stable/generated/torch.nn.functional.one_hot.html) describe param `num_classes` not optional, but user can call method without pass it. ![image](https://github.com/user-attachments/assets/4e6d4feb-691f-451f-95b5-4ac11bac7bc2) ```python >>> import torch >>> a = torch.arange(0, 5) % 3 # [0,1,2,0,1] >>> torch.nn.functional.one_hot(a) tensor([[1, 0, 0], [0, 1, 0], [0, 0, 1], [1, 0, 0], [0, 1, 0]]) ``` `num_classes` has default value -1 `93d98aca31/aten/src/ATen/native/native_functions.yaml (L6154-L6157)` ## Test Result ![image](https://github.com/user-attachments/assets/2c7203b7-6226-4ebc-84c8-cbf912fc48e2) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146470 Approved by: https://github.com/albanD	2025-02-06 07:48:05 +00:00
Simon Fan	aab7925418	[dynamo] check for incompatible configs (#146513 ) internal: https://fb.workplace.com/groups/1075192433118967/permalink/1599802033991335/ Assuming flags don't change during compilation, we shouldn't allow incompatible configs to be set at torch.compile wrap time. Not in this PR: For flags that need to change during compilation, we'd have to be strict about where they can be used in the compile lifecycle Pull Request resolved: https://github.com/pytorch/pytorch/pull/146513 Approved by: https://github.com/williamwen42	2025-02-06 07:39:52 +00:00
eqy	5f0901e573	[cuBLAS][cuBLASLt] Unify `cuBLASLt` workspaces with `cuBLAS` workspaces (#145130 ) As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels. This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits: + caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`) + "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925 + fixes behavior broken behavior with the memtracker; https://github.com/pytorch/pytorch/pull/139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it + one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider Pull Request resolved: https://github.com/pytorch/pytorch/pull/145130 Approved by: https://github.com/ngimel	2025-02-06 05:57:33 +00:00
Nikita Shulga	36c6e09528	[MPSInductor] Fix min/max for bfloat16 (#146552 ) By introducing a full specialization that upcasts everything to float, as bfloat does not have a native min/max Test by runing `test_min_max_reduction` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146552 Approved by: https://github.com/dcci	2025-02-06 05:15:00 +00:00
wz337	1f8baf09ea	[DTensor][Test] Create a simple unit test for tensordot (#146514 ) Fixes #ISSUE_NUMBER The dims and shape of the tensors are from a specific Shampoo use case. We want to create a unit test for it to make sure there are no regressions for this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146514 Approved by: https://github.com/tianyu-l	2025-02-06 05:09:34 +00:00
Michael Diggin	e01a5e9e1e	Small improvements to NJT matrix multiplies (#146405 ) Fixes #146404 Adds changes to the matmul and matmul_backward operation for nested jagged tensors, to support back propagation when the output is a regular strided tensor. This required adding support for the nested matmul operation to work when the nested tensor wasn't 'self', i.e `A@B` where `A` isn't nested but `B` is. The operation schemas had to be updated to reflect that either input can be a strided tensor instead (and the gradient), so an extra assertion is added in an edge case where neither input is nested. Unit tests are also added. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146405 Approved by: https://github.com/soulitzer, https://github.com/jbschlosser	2025-02-06 04:51:12 +00:00
bobrenjc93	389c5c0842	print out partial fx graph for all data-dependent errors (#146363 ) The previous implementation didn't catch the following type of errors ``` torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not extract specialized integer from data-dependent expression u2 (unhinted: u2). (Size-like symbols: none) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146363 Approved by: https://github.com/angelayi, https://github.com/bdhirsh ghstack dependencies: #146298, #146296	2025-02-06 04:21:34 +00:00
Michael Suo	425804db2b	[torch] fix exception types in custom class magic setattr/getattr (#146516 ) Summary: `c10::AttributeError` is not automatically converted to Python AttributeError, it needs some special macros (e.g. `HANDLE_TH_ERRORS`). Some Python functions like `hasattr` rely on the type of the throw exception to be correct. We don't need the fully generality of those macros, so just do a targeted error type conversion here. Test Plan: added unit test Differential Revision: D69197217 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146516 Approved by: https://github.com/zdevito	2025-02-06 02:14:11 +00:00
Pian Pawakapan	3a6a203b98	[dynamic shapes][real tensor tracing] propagate unbacked hint when creating mod replacement (#146381 ) Fixes data-dependent errors for 2 PT2I models in draft export Pull Request resolved: https://github.com/pytorch/pytorch/pull/146381 Approved by: https://github.com/angelayi	2025-02-06 01:48:40 +00:00
Pian Pawakapan	c5062cca98	[export] make stack_trace optional in insert_custom_op_guards (#146438 ) Summary: Fixes 1 PT2I exportability error Test Plan: - Differential Revision: D69132186 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146438 Approved by: https://github.com/yiming0416, https://github.com/angelayi	2025-02-06 01:48:26 +00:00
Nikita Shulga	6a985d8b2e	Make `inductor_utils.requires_gpu` accept MPS (#145156 ) Not yet ready to setp HAS_GPU to true, but can unskip tests that require GPU (Noticed while running test_mps_basics.py that `test_scalar_cpu_tensor_arg` is getting skipped) - Replace `GPU_TYPE` with `self.device` in `test_custom_op_fixed_layout_sequential`, `test_inductor_layout_optimization_input_mutations`, `test_mutable_custom_op_fixed_layout2` otherwise they GPU tests are just running for _cpu suffixes. - Tweak `test_tmp_not_defined_issue3` to work correctly on CPU, by defining `test_device` and `test_device_0` - UnXFail `test_mutable_custom_op_fixed_layout2_dynamic_shapes` as it should just work on CPU - Add `skip_if_no_triton` decorator and decorate `test_reduction_config_limit` with it, as it does not need CPU nor GPU, but rather a triton backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145156 Approved by: https://github.com/dcci, https://github.com/Skylion007, https://github.com/jansel	2025-02-06 01:14:36 +00:00
Isalia20	0dc03134d9	[MPS] linalg solve implementation (#146531 ) Fixes #98222 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146531 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-02-06 00:57:49 +00:00
Nikita Shulga	495049860b	[BE][Metal] Fix signed unsigned comparison warning (#146549 ) I wish I knew how to extract Metal warnings during JIT compilation but https://developer.apple.com/documentation/metal/mtldevice/makelibrary(source:options:)?changes=_7&language=objc is a lie as `error:` stays `nil` unless shader compilation fails. But when it does following warnings are thrown ``` program_source:666:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare] for (auto idx = 1; idx < size; ++idx) { ~~~ ^ ~~~~ program_source:677:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare] for (auto idx = 1; idx < size; ++idx) { ~~~ ^ ~~~~ program_source:688:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare] for (auto idx = 1; idx < size; ++idx) { ~~~ ^ ~~~~ program_source:699:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare] for (auto idx = 1; idx < size; ++idx) { ~~~ ^ ~~~~ program_source:710:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare] for (auto idx = 1; idx < size; ++idx) { ~~~ ^ ~~~~ program_source:723:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare] for (auto idx = 1; idx < size; ++idx) { ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146549 Approved by: https://github.com/dcci	2025-02-06 00:40:17 +00:00
PyTorch MergeBot	e0cf519ade	Revert "[inductor] Refactor op handlers part 2 (#146252 )" This reverts commit 13f0436abdff0386f33c7a8c25caa66e9af16dbd. Reverted https://github.com/pytorch/pytorch/pull/146252 on behalf of https://github.com/atalman due to Sorry need to revert, failing internally ([comment](https://github.com/pytorch/pytorch/pull/146252#issuecomment-2638305417))	2025-02-06 00:04:04 +00:00
Nikita Shulga	c7087d6b14	[BE][EZ][Metal] Do not pass tensor length as arg (#146522 ) As all devices capable of running Metal-2 support nonuniform threadgroup sizes, see https://developer.apple.com/metal/Metal-Feature-Set-Tables.pdf for more detail Pull Request resolved: https://github.com/pytorch/pytorch/pull/146522 Approved by: https://github.com/dcci ghstack dependencies: #146521	2025-02-06 00:03:41 +00:00
Nikita Shulga	54ef029532	[BE][EZ][Metal] Mark constant inputs as constant (#146521 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146521 Approved by: https://github.com/dcci	2025-02-06 00:03:41 +00:00
PyTorch MergeBot	2001066c61	Revert "[inductor] Refactor op handlers part 3 (#146254 )" This reverts commit 8e9bda8d895e80da0fe480d02e100bae8332ed57. Reverted https://github.com/pytorch/pytorch/pull/146254 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/146252 ([comment](https://github.com/pytorch/pytorch/pull/146254#issuecomment-2638300857))	2025-02-05 23:59:50 +00:00
Simon Fan	72405b0c0f	[ca] refactor compile reasons and log to tlparse (#146386 ) This PR accumulates comple reasons inside each CacheNode, and logs them to tlparse on each CA compile. This defines a compile as an autograd structure change, and a recompile as a dynamic shape change. sample tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpdbo7gt/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100 for compiles: ```python [ "!0: Cache miss due to new autograd node: torch::autograd::GraphRoot (NodeCall 0) with key size 39, previous key sizes=[]" ] ``` for recompiles: ```python [ "!0: Cache miss due to new autograd node: torch::autograd::GraphRoot (NodeCall 0) with key size 39, previous key sizes=[]", "!1: Cache miss due to 7 changed tensor shapes (total of 7): sizes[0], sizes[1], sizes[2], sizes[3], sizes[4], sizes[5], sizes[6]" ] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146386 Approved by: https://github.com/jansel ghstack dependencies: #146229	2025-02-05 23:33:21 +00:00
PyTorch MergeBot	68304dba7a	Revert "[inductor] Refactor op handlers part 4 (#146255 )" This reverts commit 7aced455c542f629ffcd4f79c6af259bb966add8. Reverted https://github.com/pytorch/pytorch/pull/146255 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/146252 ([comment](https://github.com/pytorch/pytorch/pull/146255#issuecomment-2638258089))	2025-02-05 23:24:20 +00:00
PyTorch MergeBot	49effa0deb	Revert "[inductor] Refactor op handlers part 5 (#146257 )" This reverts commit d3dd3eeb7f599a2816ba1a067a8fa5a1bb1c84c3. Reverted https://github.com/pytorch/pytorch/pull/146257 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/146252 ([comment](https://github.com/pytorch/pytorch/pull/146257#issuecomment-2638251994))	2025-02-05 23:20:38 +00:00
PyTorch MergeBot	93e1e6e07c	Revert "[inductor] Minor compile time optimizations in DefaultHandler (#146282 )" This reverts commit b8a529cca18ae4d21b1681c5ea3a40635aba5a83. Reverted https://github.com/pytorch/pytorch/pull/146282 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/146252 ([comment](https://github.com/pytorch/pytorch/pull/146282#issuecomment-2638239575))	2025-02-05 23:13:08 +00:00
PyTorch MergeBot	7dc5cfe2ad	Revert "[inductor] Refactor CaptureIndexing into global scope (#146297 )" This reverts commit 7288950bcd4c5851e003dded6ce87da643b93e49. Reverted https://github.com/pytorch/pytorch/pull/146297 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/146252 ([comment](https://github.com/pytorch/pytorch/pull/146297#issuecomment-2638234829))	2025-02-05 23:10:08 +00:00
PyTorch MergeBot	9555bfce88	Revert "[inductor] Pre-populate cache for simplify_with_ranges return value (#146373 )" This reverts commit 84ba9c6e7844a0b457bc64ca70a9c8cf3655d03d. Reverted https://github.com/pytorch/pytorch/pull/146373 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/146252 ([comment](https://github.com/pytorch/pytorch/pull/146373#issuecomment-2638232033))	2025-02-05 23:07:08 +00:00
Yanan Cao (PyTorch)	8af31e30d7	[Codemod][AddExplicitStrictExportArg] caffe2/torch (#146439 ) Differential Revision: D69068432 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146439 Approved by: https://github.com/avikchaudhuri	2025-02-05 22:56:54 +00:00
Catherine Lee	97b64f2e5c	Fix workflow for closing nonexistent disable issues (#146447 ) The workflow could not update issues because it didn't have permissions, and it looked green because it didn't check return codes. Tested by running the workflow and seeing that issues did get closed Fixes https://github.com/pytorch/pytorch/issues/145382 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146447 Approved by: https://github.com/huydhn	2025-02-05 22:29:05 +00:00
Howard Huang	9b6d680131	Remove stage_index_to_group_rank from schedule (#146217 ) This PR allows schedules loaded via CSV to automatically set their `stage_index_to_group_rank ` and removes the `stage_index_to_group_rank ` argument from the `PipelineScheduleMulti` constructor Pull Request resolved: https://github.com/pytorch/pytorch/pull/146217 Approved by: https://github.com/wconstab ghstack dependencies: #146193	2025-02-05 21:26:45 +00:00
Howard Huang	4ee7d0de86	Add generate_stage_to_rank_mapping utility (#146193 ) We use `stage_index_to_group_rank` in the stage to determine what send/recv ops and in the schedule for IR generation. However, we don't need to expose this as an argument in our schedule class, so this stack of PRs is to remove it. This PR creates a `stage_index_to_group_rank` utility function and removes the arg for the ZBVschedule. In a following PR I will add code to infer the `stage_index_to_group_rank` for the CSV schedule path and we will be able to remove this argument from our classes entirely. Related comment from @wconstab https://github.com/pytorch/torchtitan/issues/774#issuecomment-2619793741 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146193 Approved by: https://github.com/wconstab	2025-02-05 21:26:45 +00:00
rzou	98b5d455fd	[opcheck] Improve error reporting; allow atol/rtol overrides (#146488 ) This PR improves opcheck to: 1. directly use torch.testing.assert_close (without a msg override). This allows it to print the absolute and relative differences and the number of mismatched elements. 2. take in an atol/rtol tolerance (for if someone just wants to use opcheck in their testing). Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/146488 Approved by: https://github.com/williamwen42	2025-02-05 21:25:06 +00:00
Justin Chu	1f6b566d74	[ONNX] Bump onnx and onnxscript versions in CI (#146097 ) Bump onnx onnxscript==0.1 in CI; Skipped onnxruntime 1.19 because it has regression on avgpool. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146097 Approved by: https://github.com/malfet	2025-02-05 21:00:25 +00:00
Katarzyna Fojcik	9da376daa6	Add retain-output argument (#145921 ) This PR add retain-output argument which enables appending to the already existing output file if it exists instead of deleting it and creating a new one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145921 Approved by: https://github.com/jansel	2025-02-05 19:45:09 +00:00
Raymond Li	dd349207c5	Add check that envvar configs are boolean (#145454 ) So we don't get unexpected behavior when higher typed values are passed in Pull Request resolved: https://github.com/pytorch/pytorch/pull/145454 Approved by: https://github.com/c00w, https://github.com/jamesjwu	2025-02-05 19:40:10 +00:00
Anant Gulati	9091096d6c	Refactoring Distributed test cases to be device agnostic [1/n] (#145222 ) In this series of PR we intend to refactoring distributed test cases to enable to be completely device agnostic. These changes will include the following approaches to do the same : - Allowing for multiple device types using instantiate_device_type_test - Replacing calls to cuda stream with torch.get_device_module(device) wherever it applies - Skipping set up steps required while using MultiProcessTestCase with DistributedTestBase (#138216) wherever applicable - Replacing explicit calls to distributed backend (NCCL,HCCL,etc) with get_default_backend_for_device (#140536). This should result in significant improvement in usability for all devices Pull Request resolved: https://github.com/pytorch/pytorch/pull/145222 Approved by: https://github.com/kwen2501	2025-02-05 18:47:09 +00:00
eqy	6f7fda3f49	Bump `nn.functional.conv3d` tolerances for `test_comprehensive` (#135719 ) `float16` tolerance was previously set to `1e-5` which seemed very low Pull Request resolved: https://github.com/pytorch/pytorch/pull/135719 Approved by: https://github.com/Chillee, https://github.com/albanD	2025-02-05 18:34:12 +00:00
Tugsbayasgalan Manlaibaatar	d2a2b9f8a7	Fix constants with non-functional operators (#145593 ) Previously, in non-strict path, we always error when trying to inplace update a constant tensor because those constant tensors are not actually wrapped by functional tensors. This is correct behaviour in torch.compile, because dynamo makes all constant tensors into buffers and AOTDispatcher just lifts them and wraps them in functional tensors. However, in non-strict, there is no such step that registers constants as buffers so AOTDispatcher panics when it sees these dangling constant tensors when functioanalizing. Due to recent change in the IR, this is no longer an issue in non-strict path because we don't call AOTDispatcher at training IR level, but now it is a problem for both strict and non-strict when we lower to inference. (lowering to inference is very similar to non-strict tracing) As a result, we have at least one external (https://github.com/pytorch/pytorch/issues/141336) and internal issues reported due to this difference. To fix this, there are two ways: 1. Make functionalization be aware of constant tensors and map them to functional tensors on the fly. This makes functionalization invariant uglier and could potentially open up a gate for more nasty bugs. 2. Special handle this in export. This seems more aligned with what dynamo does today so i think we should do it this way. I think the current state could benefit from more refactors to make the run_deocmpositions to be more similar to strict export (because both of them now handle this constant registerinig logic) but it is bit complicated to do it now because strict export version of this logic is also not complete because it doesn't take into account of export graph renaming pass etc). I will follow up with more refactors after this PR (T213466691) to unblock users faster. For future reference: Why are we not doing "turning constants into non-persistent buffers and never de-register"? The reason is because in some internal models, they rely on module.to to reliably work to move params/buffers to correct device. As a result, buffers are moved while constants are not. In composibility meeting, we agreed that export won't do device agnostic tracing going forward (it will provide a way to specify FakeTensor in CPU that can be configured to be run on GPU), so after that is done, we can always turn constants into non-persistent buffers which will simplify export's constant handling. Differential Revision: [D68610739](https://our.internmc.facebook.com/intern/diff/D68610739) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145593 Approved by: https://github.com/avikchaudhuri	2025-02-05 17:44:19 +00:00
Jeff Daily	44248c44eb	[ROCm] miopen benchmark behavior now better aligns with cudnn (#145294 ) The default benchmark setting is now false. The new miopen behavior means when benchmarking is disabled, for any shape that doesn't have a find hit, then it will do a quick search (same behavior as the prior default), and use that result. Now when benchmark is enabled, it will perform an exhaustive search and update any DBs. miopen immediate mode is still available and is used when deterministic is true and benchmark is false. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145294 Approved by: https://github.com/BrianHarrisonAMD, https://github.com/malfet	2025-02-05 17:19:53 +00:00
PyTorch MergeBot	f27220e32a	Revert "Move get accelerator to use build time flags when possible (#146098 )" This reverts commit 157d81c201715f84ead21d0ee420669ab7f58c04. Reverted https://github.com/pytorch/pytorch/pull/146098 on behalf of https://github.com/atalman due to Failing internally, sorry need to revert ([comment](https://github.com/pytorch/pytorch/pull/146098#issuecomment-2637443675))	2025-02-05 16:39:37 +00:00
Jason Ansel	f55c0af37f	[inductor] Support non-power-of-2 cooperative RSPLIT (#145689 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145689 Approved by: https://github.com/eellison	2025-02-05 16:36:53 +00:00
maajidkhann	db22e9d5a2	Implement blend operation for float, double, int in VEC ATen backend for SVE (#146479 ) - Added support for SVE vectorized blend operation for float, double, int8_t, int16_t, int32_t and int64_t data types. - Utilizes SVE ACLE intrinsic (svcntb, svcntw, svcmpne, svsel) to handle different vector lengths (VL) dynamically. - Ensured compatibility with SVE128, SVE256, and SVE512 hardware configurations. - Enabled back blend SVE vec tests Testing: a) Float DType: ./vec_test_all_types_SVE256 --gtest_filter=BitwiseFloatsAdditional2/0.Blend [Test Passed] on Graviton 3 machine (SVE256) ./vec_test_all_types_SVE128 --gtest_filter=BitwiseFloatsAdditional2/0.Blend [Test Passed] on Graviton 4 machine (SVE128) b) Double DType: ./vec_test_all_types_SVE256 --gtest_filter=BitwiseFloatsAdditional2/1.Blend [Test Passed] on Graviton 3 machine (SVE256) ./vec_test_all_types_SVE128 --gtest_filter=BitwiseFloatsAdditional2/1.Blend [Test Passed] on Graviton 4 machine (SVE128) c)Int DType: python3 test/inductor/test_cpu_repro.py CPUReproTests.test_vec_remainder [Test Passed] on Graviton 3 machine (SVE256) and on Graviton 4 machine (SVE128) <img width="661" alt="grv4_test_case_passed" src="https://github.com/user-attachments/assets/5572fcc0-a861-4bd6-bf9e-356219ffe656" /> Fixes https://github.com/pytorch/pytorch/issues/146309 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146479 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-02-05 16:29:13 +00:00
Zhengxu Chen	cd6c0707a8	[aoti] Assign proxy call args by name, and support default values. (#146263 ) Fixing the following issue when compiling the following program: ``` window = torch.hann_window(N_FFT).to(x.device) stft = torch.stft( x, N_FFT, HOP_LENGTH, window=window, return_complex=True ) magnitudes = stft[..., :-1].abs() ** 2 return magnitudes ``` ``` Traceback (most recent call last): File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 57, in testPartExecutor yield File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 623, in run self._callTestMethod(testMethod) File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 579, in _callTestMethod if method() is not None: ^^^^^^^^ File "/home/zhxchen17/pytorch/torch/testing/_internal/common_utils.py", line 3120, in wrapper method(args, *kwargs) File "/home/zhxchen17/pytorch/test/inductor/test_torchinductor.py", line 12356, in new_test return value(self) ^^^^^^^^^^^ File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor.py", line 4334, in test_stft self.check_model(model, example_inputs) File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 185, in check_model actual = AOTIRunnerUtil.run( ^^^^^^^^^^^^^^^^^^^ File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 137, in run optimized = AOTIRunnerUtil.load(device, so_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 119, in load return torch._export.aot_load(so_path, device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zhxchen17/pytorch/torch/_export/__init__.py", line 165, in aot_load runner = torch._C._aoti.AOTIModelContainerRunnerCuda(so_path, 1, device) # type: ignore[assignment, call-arg] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Expected extern kernel aten::hann_window to have serialized argument type as_scalar_type for argument 1 but got as_device ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146263 Approved by: https://github.com/angelayi	2025-02-05 15:43:05 +00:00
rzou	1bb977a2a4	[auto_functionalized] Support `Tensor(a!)[]?` (#145400 ) Summary: This is just updating some of the checks to allow the Tensor(a!)[]? type through. Fixes #144072 Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/145400 Approved by: https://github.com/laithsakka	2025-02-05 14:52:39 +00:00
PyTorch MergeBot	282d185ec1	Revert "[inductor] use ftz variant of exp (#146216 )" This reverts commit b0b3fe8bcf00f30513e9bb3e197ea4cbcc2beef0. Reverted https://github.com/pytorch/pytorch/pull/146216 on behalf of https://github.com/atalman due to inductor/test_op_completeness.py::TestOpCompleteness::test_triton_overrides [GH job link](https://github.com/pytorch/pytorch/actions/runs/13152430750/job/36702812599) [HUD commit link](`b0b3fe8bcf`) ([comment](https://github.com/pytorch/pytorch/pull/146216#issuecomment-2636961317))	2025-02-05 14:13:45 +00:00
Davide Italiano	8a2000fd42	[MPS] Implement support for zeta (both eager and inductor). (#146465 ) A test was failing in inductor (`test_pointwise_zeta`) -- and I realized the operation was missing also from eager. Implemented for both, leveraging the kernel. Happy to split in two (one PR for eager, one for inductor) if folks prefer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146465 Approved by: https://github.com/malfet	2025-02-05 13:55:50 +00:00
Nichols A. Romero	fd0cd6a08f	[ROCm][TunableOp] Improve identification of fastest solution (#144942 ) This PR addresses some stability issues with identifying the fastest solution on AMD GPUs, particularly the MI300. Changes include: - An improved timer, StreamTimerNoSync - More aggressive skipping of slow solutions - Additional statistics that can be used for diagnostics PYTORCH_TUNABLEOP_VERBOSE=3 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144942 Approved by: https://github.com/jeffdaily	2025-02-05 11:16:49 +00:00
Simon Fan	e20b0c82d1	[ca] no longer require is_traceable annotations for c++ autograd functions (#146229 ) This PR removes the CA compile-time error for C++ autograd functions, and supports them by having dynamo graph break on them (instead of allow_in_graph). The CppNode's collects are kept as is for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146229 Approved by: https://github.com/jansel, https://github.com/zou3519	2025-02-05 08:49:17 +00:00
cyy	6293d1446b	[2/N] Remove NOLINT suppressions (#146402 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146402 Approved by: https://github.com/soulitzer	2025-02-05 08:38:52 +00:00
bobrenjc93	e5ea7e9cdc	add support for capturing provenance of unary operations (#146413 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146413 Approved by: https://github.com/angelayi ghstack dependencies: #145848	2025-02-05 08:31:38 +00:00
Shunting Zhang	b0b3fe8bcf	[inductor] use ftz variant of exp (#146216 ) Inductor generated exp op is compiled as the following ptx snippet by Triton. ``` mul.f32 %f74, %f83, 0f3FB8AA3B; ex2.approx.f32 %f73, %f74; ``` But if we enable --use_fast_math in nvcc, exp in CUDA is compiled as ``` mul.ftz.f32 %f2, %f1, 0f3FB8AA3B; ex2.approx.ftz.f32 %f3, %f2; ``` which uses the FTZ variant. Let Inductor able to generate the FTZ variant if use_fast_math config is true. I see 4% speedup for the two pass prepare_softmax kernel, online softmax should be affected more since it does more computation per seconds (>10% in my testing). Pull Request resolved: https://github.com/pytorch/pytorch/pull/146216 Approved by: https://github.com/jansel	2025-02-05 07:35:43 +00:00
clr	93d98aca31	inductor: Don't throw an internal error when a nn.module is missing a attribute (#145122 ) If a nn.module getattr call throws, we should make sure that we don't crash with an internal error Note that I couldn't figure out how to test this, so advice would be awesome. I have my best case attempt at https://github.com/pytorch/pytorch/pull/145799, but it doesn't seem to reproduce the crash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145122 Approved by: https://github.com/jansel	2025-02-05 05:49:32 +00:00
Angela Yi	eb832b7bcc	[export] Fix draft-export logging (#146106 ) Summary: Fix issue where the lazyTraceHandler does not exist Test Plan: CI Differential Revision: D68928070 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146106 Approved by: https://github.com/yiming0416	2025-02-05 05:49:22 +00:00
PyTorch MergeBot	f242da41c7	Revert "move and fix logic to update unbacked bindings (#146115 )" This reverts commit 0144613e6ff6e018ca41085d1509dcceb80987f7. Reverted https://github.com/pytorch/pytorch/pull/146115 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/146115#issuecomment-2635695958))	2025-02-05 04:51:39 +00:00
cyy	c6ea4425e5	Enable some tests on Windows (#146243 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146243 Approved by: https://github.com/albanD	2025-02-05 03:54:28 +00:00
PyTorch MergeBot	f35e60b21c	Revert "[cutlass backend] fix bug for accuminator dtype (#146356 )" This reverts commit 7c8ec84dab7dc10d4ef90afc93a49b97bbd04503. Reverted https://github.com/pytorch/pytorch/pull/146356 on behalf of https://github.com/huydhn due to Sorry for reverting your change but some slow cutlass tests are failing ([comment](https://github.com/pytorch/pytorch/pull/146356#issuecomment-2635594712))	2025-02-05 03:01:50 +00:00
PyTorch MergeBot	3c0d2bc262	Revert "[Testing] Reduce `test_exp` flakiness (#146436 )" This reverts commit 4c5a9a5f949ef3019fc3ef095034ccfc973ff13d. Reverted https://github.com/pytorch/pytorch/pull/146436 on behalf of https://github.com/huydhn due to Some test_exp2 starts failing in trunk I think ([comment](https://github.com/pytorch/pytorch/pull/146436#issuecomment-2635591878))	2025-02-05 02:58:53 +00:00
Nikita Shulga	aafaf4016f	[MPS] Add error checking when dispatching kernel (#146458 ) That thread-group size should not exceed maximum thread group size Add regression test to validate that Make failures like https://github.com/pytorch/pytorch/issues/146430 much easier to detect Pull Request resolved: https://github.com/pytorch/pytorch/pull/146458 Approved by: https://github.com/dcci	2025-02-05 02:56:40 +00:00
Ting Lu	9e45bc82e9	[aarch64] CUDA 12.8 aarch64 builds to nightly binaries (#146378 ) https://github.com/pytorch/pytorch/issues/145570 Adding Cuda 12.8 and keeping 12.6 for the sbsa build, supported CUDA_ARCH: 9.0, 10.0, 12.0 Refactor the binaries matrix for cuda sbsa build. Previously cuda-aarch64 was hardcoded to cuda 12.6. Now reads 12.6 and 12.8, new build naming example [manywheel-py3_9-cuda-aarch64-12_8-build](https://github.com/pytorch/pytorch/actions/runs/13132625006/job/36640885079?pr=146378#logs) TODO: once 12.8 is stable, remove 12.6 in sbsa Pull Request resolved: https://github.com/pytorch/pytorch/pull/146378 Approved by: https://github.com/atalman	2025-02-05 02:55:21 +00:00
Nikita Shulga	001ad5bef5	[MPSInductor] Scope-down test_prod running in MPS (#146460 ) As mutli-stage reductions are yet not a thing, but original `test_prod` just returned 0 for large reductions, so failures were reported as flaky ones, but if one to run the same test with `MTL_DEBUG_LAYER=1` than failure was obvious ``` 2025-02-04 11:51:30.034 Python[16594:289093] Metal API Validation Enabled test_prod (__main__.MPSBasicTests.test_prod) ... -[MTLDebugComputeCommandEncoder _validateThreadsPerThreadgroup:]:1266: failed assertion `(threadsPerThreadgroup.width(1) * threadsPerThreadgroup.height(2050) * threadsPerThreadgroup.depth(1))(2050) must be <= 1024. (device threadgroup size limit)' ``` Fixes https://github.com/pytorch/pytorch/issues/146430 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146460 Approved by: https://github.com/dcci	2025-02-05 01:47:01 +00:00
Aaron Gokaslan	52aaadf379	[BE][Ez]: Enable ruff rule E731. use `def` instead of anonymous lambda (#146410 ) Not sure why this isn't enabled, only 1 fix is needed and it supports autofixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146410 Approved by: https://github.com/aorenste, https://github.com/albanD	2025-02-05 01:44:41 +00:00
Bert Maher	0e060342b6	[triton] Update pin to tip of 3.2 release (#145867 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145867 Approved by: https://github.com/Skylion007, https://github.com/htyu, https://github.com/exclamaforte, https://github.com/jansel	2025-02-05 01:42:33 +00:00
Michael Lazos	616ac94175	[Dynamo] Fix spammy optimizer warning (#146374 ) Fixes https://discuss.pytorch.org/t/torch-compile-optimizer-step-generates-excessive-warning-messages/216067/7 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146374 Approved by: https://github.com/anijain2305	2025-02-05 01:03:49 +00:00
Haifeng Jin	8177fc4d33	Make regex error catching compatible with Python 3.12+. (#145945 ) In Python 3.12, the error message has changed from "Can't pickle local object" to "Can't get local object". The old regex would no longer catch the error. This PR make it compatible with Python 3.12 and backward compatible as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145945 Approved by: https://github.com/H-Huang	2025-02-05 00:57:36 +00:00
Henry Tsang	9d5bf38dec	[cpp_builder] refactor to reduce libcudart_static logs (#146394 ) Want to reduce logs from `log_msg = f'"libcudart_static.a" not found under {path}'`, which was added in https://github.com/pytorch/pytorch/pull/142175 Differential Revision: D69096354 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146394 Approved by: https://github.com/benjaminglass1, https://github.com/chenyang78	2025-02-05 00:41:30 +00:00
PyTorch MergeBot	658e22d495	Revert "add support for capturing provenance of unary operations (#146413 )" This reverts commit bc33d993acdff2637bc6aee5e604fb969b11fc13. Reverted https://github.com/pytorch/pytorch/pull/146413 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but some export tests are failing after this lands ([comment](https://github.com/pytorch/pytorch/pull/146413#issuecomment-2635440261))	2025-02-05 00:32:40 +00:00
Angela Yi	6e03f4f90e	[export] Include metadata in FlatArgsAdapter (#146107 ) Summary: With https://github.com/pytorch/pytorch/pull/145956, which introduces storing a list of namedtuple field names when serializing, we now want to expose this list to the args adapater so that APS can utilize this information and remove extraneous inputs. Test Plan: No-op Differential Revision: D68928416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146107 Approved by: https://github.com/pianpwk	2025-02-05 00:29:58 +00:00
Jason Ansel	84ba9c6e78	[inductor] Pre-populate cache for simplify_with_ranges return value (#146373 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146373 Approved by: https://github.com/yanboliang, https://github.com/shunting314 ghstack dependencies: #146225, #146226, #146235, #146252, #146254, #146255, #146257, #146282, #146297	2025-02-04 23:36:44 +00:00
Jason Ansel	7288950bcd	[inductor] Refactor CaptureIndexing into global scope (#146297 ) And inline SimplifyIndexing into it CaptureIndexing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146297 Approved by: https://github.com/shunting314 ghstack dependencies: #146225, #146226, #146235, #146252, #146254, #146255, #146257, #146282	2025-02-04 23:36:44 +00:00
Jason Ansel	b8a529cca1	[inductor] Minor compile time optimizations in DefaultHandler (#146282 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146282 Approved by: https://github.com/shunting314 ghstack dependencies: #146225, #146226, #146235, #146252, #146254, #146255, #146257	2025-02-04 23:36:34 +00:00
Jason Ansel	d3dd3eeb7f	[inductor] Refactor op handlers part 5 (#146257 ) This makes OpHandler just a normal class using inheritance, and removes typing workarounds needed because it wasn't Pull Request resolved: https://github.com/pytorch/pytorch/pull/146257 Approved by: https://github.com/shunting314 ghstack dependencies: #146225, #146226, #146235, #146252, #146254, #146255	2025-02-04 23:36:25 +00:00
Jason Ansel	7aced455c5	[inductor] Refactor op handlers part 4 (#146255 ) This replaces the `__getattr__()` pattern used in remaining OpHandlers with a `DefaultHandler` class defined in part 2. Some compile time wins from this as well: ``` 2025-02-02T19:46:32.2033010Z 2025-02-02T19:46:32.2036607Z WIN: benchmark ('add_loop_inductor', 'compile_time_instruction_count') failed, actual result 29633182927 is -1.71% lower than expected 30150000000 ±1.50% please update the expected results. 2025-02-02T19:46:32.2037575Z 2025-02-02T19:46:32.2037907Z please update all results that changed significantly, and not only the failed ones 2025-02-02T19:46:32.2039291Z PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 43986879172 -1.02% is within expected 44440000000 ±2.50% 2025-02-02T19:46:32.2040131Z 2025-02-02T19:46:32.2041180Z WIN: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') failed, actual result 26246225695 is -1.85% lower than expected 26740000000 ±1.50% please update the expected results. 2025-02-02T19:46:32.2042188Z ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146255 Approved by: https://github.com/shunting314 ghstack dependencies: #146225, #146226, #146235, #146252, #146254	2025-02-04 23:36:17 +00:00
Jason Ansel	8e9bda8d89	[inductor] Refactor op handlers part 3 (#146254 ) Fixes type errors that arise from typing `V.ops`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146254 Approved by: https://github.com/shunting314 ghstack dependencies: #146225, #146226, #146235, #146252	2025-02-04 23:36:09 +00:00
Jason Ansel	13f0436abd	[inductor] Refactor op handlers part 2 (#146252 ) This replaces the `__getattr__()` pattern used in (some) OpHandlers with a `DefaultHandler` class that has an implementation of every op that calls `self._default()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146252 Approved by: https://github.com/yanboliang ghstack dependencies: #146225, #146226, #146235	2025-02-04 23:36:01 +00:00
Jason Ansel	67be5953fe	[inductor] Refactor op handlers part 1 (#146235 ) This enforces the invariant that every backend implements the same set of ops and removes a layer of indirection for BasicMathOps. Interestingly this is a small compile time win: ``` ... WIN: benchmark ('add_loop_inductor', 'compile_time_instruction_count') failed, actual result 30151159301 is -6.13% lower than expected 32120000000 ±1.50% please update the expected results. please update all results that changed significantly, and not only the failed ones PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 44447549162 -1.69% is within expected 45210000000 ±2.50% WIN: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') failed, actual result 26743557195 is -2.25% lower than expected 27360000000 ±1.50% please update the expected results. please update all results that changed significantly, and not only the failed ones PASS: benchmark ('basic_modules_ListOfLinears_eager', 'compile_time_instruction_count') pass, actual result 945129734 +0.93% is within expected 936400000 ±1.50% WIN: benchmark ('basic_modules_ListOfLinears_inductor', 'compile_time_instruction_count') failed, actual result 18984384503 is -3.19% lower than expected 19610000000 ±1.50% please update the expected results. please update all results that changed significantly, and not only the failed ones WIN: benchmark ('basic_modules_ListOfLinears_inductor_gpu_force_shape_pad', 'compile_time_instruction_count') failed, actual result 17258025389 is -1.94% lower than expected 17600000000 ±1.50% please update the expected results. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146235 Approved by: https://github.com/shunting314 ghstack dependencies: #146225, #146226	2025-02-04 23:35:53 +00:00
Jason Ansel	ed03f9ca10	[inductor] Refactor CSEProxy into global scope (#146226 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146226 Approved by: https://github.com/shunting314 ghstack dependencies: #146225	2025-02-04 23:35:43 +00:00
Jason Ansel	5cac550ddf	[inductor] Finish typing common.py (#146225 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146225 Approved by: https://github.com/Skylion007	2025-02-04 23:35:33 +00:00
Henry Tsang	7c8ec84dab	[cutlass backend] fix bug for accuminator dtype (#146356 ) Will add unit tests for accuracy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146356 Approved by: https://github.com/Chillee	2025-02-04 22:10:17 +00:00
Sam Larsen	13e17aa106	Make the CUTLASS swizzle options configurable and default to 2. (#146088 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146088 Approved by: https://github.com/henrylhtsang, https://github.com/mlazos	2025-02-04 22:07:26 +00:00
Aidyn-A	aac0577796	[TEST][Sparse] Force CUTLASS backend in TestSparseSemiStructuredCUTLASS (#146398 ) We have noticed some discrepancy between the ways the `test_sparse_semi_structured.py` was called. And in some ways, the test falsely fails, because it was attempting to run on a wrong backend. All because `SparseSemiStructuredTensor._FORCE_CUTLASS = True` was never set in the setup of `TestSparseSemiStructuredCUTLASS` as it was in its `TestSparseSemiStructuredCUSPARSELT` counterpart `8444fe019a/test/test_sparse_semi_structured.py (L1039-L1046)` When I run tests via pytest, just by shear luck it calls `test_values_backend_cutlass_cuda` which sets the backend to CUTLASS `bb4bd5f00b/test/test_sparse_semi_structured.py (L475)` before `test_conversions_all_patterns_cuda_`: ``` test/test_sparse_semi_structured.py::TestSparseSemiStructuredCUDA::test_values_backend_cutlass_cuda PASSED [0.0071s] [ 72%] test/test_sparse_semi_structured.py::TestSparseSemiStructuredCUTLASSCUDA::test_conversions_all_patterns_cuda_bfloat16 PASSED [0.0484s] [ 73%] test/test_sparse_semi_structured.py::TestSparseSemiStructuredCUTLASSCUDA::test_conversions_all_patterns_cuda_float16 PASSED [0.0041s] [ 73%] test/test_sparse_semi_structured.py::TestSparseSemiStructuredCUTLASSCUDA::test_conversions_all_patterns_cuda_int8 PASSED [0.0079s] [ 73%] ``` In this scenario everything is good. But in `python test/test_sparse_semi_structured.py -v -k cuda` way, the order of the tests is not the same, and it sets cuSparseLt backend just before running `test_conversions_all_patterns_cuda_` which causes failures: ``` test_cusparselt_backend_cuda (__main__.TestSparseSemiStructuredCUSPARSELTCUDA.test_cusparselt_backend_cuda) ... ok ... test_conversions_all_patterns_cuda_bfloat16 (__main__.TestSparseSemiStructuredCUTLASSCUDA.test_conversions_all_patterns_cuda_bfloat16) ... FAIL test_conversions_all_patterns_cuda_float16 (__main__.TestSparseSemiStructuredCUTLASSCUDA.test_conversions_all_patterns_cuda_float16) ... FAIL test_conversions_all_patterns_cuda_int8 (__main__.TestSparseSemiStructuredCUTLASSCUDA.test_conversions_all_patterns_cuda_int8) ... ERROR ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146398 Approved by: https://github.com/Skylion007, https://github.com/jcaip, https://github.com/eqy	2025-02-04 22:07:12 +00:00
Benjamin Glass	317dae95fa	cpp_wrapper: fix CPU cpp_wrapper and max-autotune tests (#145683 ) Both of these tests mostly failed due to incorrect assumptions about the generated code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145683 Approved by: https://github.com/desertfire ghstack dependencies: #145095, #145654, #145655	2025-02-04 22:05:59 +00:00
Benjamin Glass	e2a029054d	cpp_wrapper: enable all CPU repro tests (#145655 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145655 Approved by: https://github.com/desertfire ghstack dependencies: #145095, #145654	2025-02-04 22:05:59 +00:00
Benjamin Glass	9873319a42	cpp_wrapper: fix set_.source_Tensor lowering (#145654 ) Adds a C-shim fallback for `set_.source_Tensor`, which is effectively required by `ir.SetSourceTensorKernel`. As a necessary prerequisite to use that IR node, updates `CppWrapperCpu` to handle in-place returns in C-shim ops (the arguments for those returns are silently dropped by `torchgen`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/145654 Approved by: https://github.com/desertfire ghstack dependencies: #145095	2025-02-04 22:05:59 +00:00
Benjamin Glass	7c0fe7a045	cpp_wrapper/aot_inductor: handle conjugation and negation dispatch keys (#145095 ) Handles conjugation and negation in the same way that runtime dispatch does: by on-the-fly cloning a tensor with either key applied. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145095 Approved by: https://github.com/desertfire	2025-02-04 22:05:58 +00:00
Davide Italiano	09b0dfdc90	[metal] Add a missing cast to make the call to copysign unambiguous. (#146422 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146422 Approved by: https://github.com/Skylion007, https://github.com/Samkm0084	2025-02-04 22:04:25 +00:00
clr	4e194bbfd6	dynamo: fsdp throw unimplemented vs attribute error (#146188 ) Rather than throw a full exception for fsdp, instead just return unimplemented, and respect the user options (i.e. fullgraph, vs graph break). Pull Request resolved: https://github.com/pytorch/pytorch/pull/146188 Approved by: https://github.com/jansel	2025-02-04 21:45:55 +00:00
Nikita Shulga	4c5a9a5f94	[Testing] Reduce `test_exp` flakiness (#146436 ) By setting `reference_in_float` to false, as `exp(a + b)` could yield significantly different results than `exp(a.half()+b.half())` as one can see in the following example (which is accidentally the random values generated by MacOS RNG for this test) ``` >>> import torch >>> x=torch.tensor(2.5599, dtype=torch.half) >>> y=torch.tensor(0.6970, dtype=torch.half) >>> (x + y).exp() tensor(26., dtype=torch.float16) >>> (x.float() + y.float()).exp() tensor(25.9799) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146436 Approved by: https://github.com/dcci	2025-02-04 21:24:08 +00:00
bobrenjc93	bc33d993ac	add support for capturing provenance of unary operations (#146413 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146413 Approved by: https://github.com/angelayi ghstack dependencies: #145848	2025-02-04 21:16:15 +00:00
Yanbo Liang	07b9fe0690	[Trace PyDispatcher] Add CustomFunctionHigherOrderOperatorVariable (#146272 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146272 Approved by: https://github.com/zou3519 ghstack dependencies: #146270, #146271	2025-02-04 20:55:51 +00:00
bobrenjc93	d23e4f8109	use DTRACE_ENV_VAR as the trace logs directory of set (#146412 ) ``` (/home/bobren/local/a/pytorch-env) [7:47] devgpu035:/home/bobren/local/a/pytorch TORCH_DTRACE=/tmp/bb python r1.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146412 Approved by: https://github.com/angelayi ghstack dependencies: #145848	2025-02-04 20:54:28 +00:00
Aaron Gokaslan	7f65a20884	[BE]: Enable ruff SLOT checks (#146276 ) This enables a check that which a class which only inherits from immutable classes like str, tuple, and NamedTuple, also defined `__slots__` so they don't allocate memory unnecessarily. This also ensure contributors think about how they define their classes with subclass NamedTuples and str, of which we have many in our codebase Pull Request resolved: https://github.com/pytorch/pytorch/pull/146276 Approved by: https://github.com/aorenste	2025-02-04 19:18:23 +00:00
Nikita Shulga	3525b834f0	[MPSInductor] Implement `argmax`/`argmin` (#146429 ) TODOs: - Find test with NaN - Report internal compiler error when running `test_argmax_argmin1` (which is actually not enough shared memory) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146429 Approved by: https://github.com/dcci ghstack dependencies: #146423, #146428	2025-02-04 19:16:06 +00:00
bobrenjc93	c591ad0c03	dump partial fx graph to stderr when dynamo tracing fails with guard on data-dependent (#146296 ) As discussed with @avikchaudhuri and @bdhirsh last week, this can be quite useful when debugging. The following code produces a data dependent error ``` import torch from torch import nn # UserError: Could not guard on data-dependent expression Eq(507 - u0, 0) (unhinted: Eq(507 - u0, 0)). (Size-like symbols: u0) class Repro(nn.Module): def __init__(self): super().__init__() def forward(self, cache, update, pos): _, _, max_seq_len, _ = cache.shape _, _, seqlen, _ = update.shape pos_item = pos[0].item() # u0 torch._check(pos_item + seqlen <= max_seq_len) # u0 + 502 <= 507 torch._check(pos_item >= 0) before = cache.narrow(2, 0, pos_item) # FAIL # Laith: why can't we make unbacked expressions size-like? after = cache.narrow(2, (pos_item + seqlen), (max_seq_len - pos_item - seqlen)) # PASS end = torch.tensor(max_seq_len - pos_item - seqlen).item() after = cache.narrow(2, (pos_item + seqlen), end) return torch.cat([before, update, after], dim=2) repro = Repro() bsz = 1 n_heads = 4 max_seq_len = 512 head_dim = 64 seqlen = 5 pos_item = 1 cache = torch.zeros(bsz, n_heads, max_seq_len, head_dim) update = torch.ones(bsz, n_heads, seqlen, head_dim) pos = torch.tensor([pos_item]) example_inputs = (cache, update, pos) torch.export.export(repro, example_inputs) ``` This is what it now prints out ``` class GraphModule(torch.nn.Module): def forward(self, L_cache_: "f32[1, 4, 512, 64][131072, 32768, 64, 1]cpu", L_update_: "f32[1, 4, 5, 64][1280, 320, 64, 1]cpu", L_pos_: "i64[1][1]cpu"): l_cache_ = L_cache_ l_update_ = L_update_ l_pos_ = L_pos_ # File: /data/users/bobren/a/pytorch/r1.py:14 in forward, code: pos_item = pos[0].item() # u0 getitem: "i64[][]cpu" = l_pos_[0]; l_pos_ = None item: "Sym(u0)" = getitem.item(); getitem = None # File: /data/users/bobren/a/pytorch/r1.py:15 in forward, code: torch._check(pos_item + seqlen <= max_seq_len) # u0 + 502 <= 507 add: "Sym(u0 + 5)" = item + 5 le: "Sym(u0 + 5 <= 512)" = add <= 512; add = None _check = torch._check(le); le = _check = None # File: /data/users/bobren/a/pytorch/r1.py:16 in forward, code: torch._check(pos_item >= 0) ge: "Sym(u0 >= 0)" = item >= 0 _check_1 = torch._check(ge); ge = _check_1 = None # File: /data/users/bobren/a/pytorch/r1.py:17 in forward, code: before = cache.narrow(2, 0, pos_item) before: "f32[1, 4, u0, 64][131072, 32768, 64, 1]cpu" = l_cache_.narrow(2, 0, item); before = None # File: /data/users/bobren/a/pytorch/r1.py:21 in forward, code: after = cache.narrow(2, (pos_item + seqlen), (max_seq_len - pos_item - seqlen)) add_1: "Sym(u0 + 5)" = item + 5 sub: "Sym(512 - u0)" = 512 - item; item = None sub_1: "Sym(507 - u0)" = sub - 5; sub = None narrow_1 = l_cache_.narrow(2, add_1, sub_1); l_cache_ = add_1 = sub_1 = narrow_1 = None Traceback (most recent call last): File "/data/users/bobren/a/pytorch/torch/_dynamo/utils.py", line 3075, in run_node return getattr(args[0], node.target)(args[1:], kwargs) File "/data/users/bobren/a/pytorch/torch/utils/_stats.py", line 27, in wrapper return fn(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1267, in __torch_dispatch__ return self.dispatch(func, types, args, kwargs) File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1808, in dispatch return self._cached_dispatch_impl(func, types, args, kwargs) File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1369, in _cached_dispatch_impl output = self._dispatch_impl(func, types, args, kwargs) File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 2282, in _dispatch_impl decomposition_table[func](args, *kwargs) File "/data/users/bobren/a/pytorch/torch/_decomp/decompositions.py", line 759, in slice_forward return self.as_strided(sizes, strides, storage_offset) File "/data/users/bobren/a/pytorch/torch/utils/_stats.py", line 27, in wrapper return fn(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1267, in __torch_dispatch__ return self.dispatch(func, types, args, kwargs) File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1808, in dispatch return self._cached_dispatch_impl(func, types, args, kwargs) File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1370, in _cached_dispatch_impl entry = self._make_cache_entry(state, key, func, args, kwargs, output) File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1640, in _make_cache_entry output_info = self._get_output_info_for_cache_entry( File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1583, in _get_output_info_for_cache_entry synth_output = self._output_from_cache_entry( File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1738, in _output_from_cache_entry return self._get_output_tensor_from_cache_entry( File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1709, in _get_output_tensor_from_cache_entry empty.set_(storage, storage_offset, shape, stride) File "/data/users/bobren/a/pytorch/torch/fx/experimental/sym_node.py", line 564, in guard_size_oblivious r = self.shape_env.evaluate_expr( File "/data/users/bobren/a/pytorch/torch/fx/experimental/recording.py", line 263, in wrapper return retlog(fn(args, **kwargs)) File "/data/users/bobren/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6468, in evaluate_expr return self._evaluate_expr( File "/data/users/bobren/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6658, in _evaluate_expr raise self._make_data_dependent_error( torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Ne(507 - u0, 1) (unhinted: Ne(507 - u0, 1)). (Size-like symbols: u0) Caused by: after = cache.narrow(2, (pos_item + seqlen), (max_seq_len - pos_item - seqlen)) # r1.py:21 in forward (utils/_stats.py:27 in wrapper) For more information, run with TORCH_LOGS="dynamic" For extended logs when we create symbols, also add TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL="u0" If you suspect the guard was triggered from C++, add TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 For more debugging help, see https://docs.google.com/document/d/1HSuTTVvYH1pTew89Rtpeu84Ht3nQEFTYhAX3Ypa_xJs/edit?usp=sharing``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146296 Approved by: https://github.com/zou3519 ghstack dependencies: #146298	2025-02-04 19:12:39 +00:00
bobrenjc93	8f861a8dfb	[experimental] filter logs by subgraph (#146047 ) ``` TORCH_LOGS="dynamo" TORCH_LOGS_TRACE_ID_FILTER="[1/0]" python r4.py ``` ``` TORCH_LOGS="dynamo" TORCH_LOGS_TRACE_ID_FILTER="[0/0],[1/0_1]" python r4.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146047 Approved by: https://github.com/laithsakka	2025-02-04 19:11:44 +00:00
Nikita Shulga	7d60235aa6	[Metal] Small speedup for `sum`/`prod` (#146428 ) As they can not really be invoked over empty arrays Pull Request resolved: https://github.com/pytorch/pytorch/pull/146428 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #146423	2025-02-04 19:10:33 +00:00
Nikita Shulga	b1663b31e1	[Metal][BE] Add `#pragma once` to all headers (#146423 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146423 Approved by: https://github.com/Skylion007, https://github.com/dcci	2025-02-04 19:10:33 +00:00
Aaron Gokaslan	292af3cc89	[BE][Ez]: ISC001 Auto concatenate implicit one line strings (#146408 ) Apply ruff rule about implicit string concatenation, this autofixes strings that are all the same type and on the same line. These lines are broken up likely as the result of autoformatters in the past. All fixes are automated using the autofixes in ISC001. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146408 Approved by: https://github.com/justinchuby, https://github.com/janeyx99	2025-02-04 19:07:04 +00:00
rzou	f38a2ea0d4	[Dynamo] Better unsupported message for Fake Tensor Exception (#146357 ) I cannot repro this. But this line shows up in internal logs, and I want to know what the exception is and the context inside it. All of the exceptions_allowed_to_be_fallback are dataclasses, so they should print nicely. Test Plan: - code reading Pull Request resolved: https://github.com/pytorch/pytorch/pull/146357 Approved by: https://github.com/williamwen42	2025-02-04 18:52:11 +00:00
Yidi Wu	b0fe975521	[hop][inductor] track the dependency on unbacked symbols correctly with constant_args for hops (#143456 ) Before the PR, we're getting an undefined symbol error for output code when an unbacked symint is only used in the hop because we didn't correctly record the dependency of the unbacked symbols for hops and it gets DCEed accidentally. This PR adds the symbol arguments to `constant_args`, where the dependencies can be correctly constructed when `get_unbacked_symbol_uses` is called to check constant_args. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143456 Approved by: https://github.com/desertfire	2025-02-04 18:47:34 +00:00
albanD	157d81c201	Move get accelerator to use build time flags when possible (#146098 ) This PR does two main things (they are in a single PR to show how the newly added APIs are used). - Add isBuilt and isAvailable APIs to the AcceleratorHook interface. See inline doc for their exact semantic - Use the newly added isBuilt for accelerator check to ensure it does not poison fork Pull Request resolved: https://github.com/pytorch/pytorch/pull/146098 Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/EikanWang Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-02-04 18:23:24 +00:00
Sam Larsen	23fffb54d5	Use OrderedSet in _functorch/partitioners (#146102 ) In an attempt to make partitioning more deterministic, change all sets in partitioners.py to OrderedSets. Note that this change does not fix the non-determinism we're seeing in the internal model. But let's at least eliminate this potential source of non-determinism before investigating any changes to the mincut approach? Pull Request resolved: https://github.com/pytorch/pytorch/pull/146102 Approved by: https://github.com/oulgen	2025-02-04 17:43:07 +00:00
Bin Bao	53759ccca8	[AOTI] Fix an unaligned memory access issue in mm_template (#146293 ) Summary: Fixes a corner case in the Triton MM template, where the dimension M (dynamic size) can be smaller than BLOCK_M (similarly for the N dimenstion) can trigger unaligned memory access error. Differential Revision: D69034578 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146293 Approved by: https://github.com/chenyang78, https://github.com/jansel	2025-02-04 17:12:04 +00:00
nikitaved	87a63a9886	Add `@nikitaved` to torch.linalg `CODEOWNERS/persons_of_interest` (#141803 ) As per title. I hope there is no objection :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141803 Approved by: https://github.com/albanD	2025-02-04 16:11:31 +00:00
Jason Ansel	e9f6e273e7	[inductor] Add typing to common.CSE (#145993 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145993 Approved by: https://github.com/yanboliang ghstack dependencies: #145916	2025-02-04 16:05:39 +00:00
Jason Ansel	7a5239afd7	[inductor] Add typing to common.KernelArgs (#145916 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145916 Approved by: https://github.com/yanboliang	2025-02-04 16:05:39 +00:00
Nikita Shulga	5d81bc3696	[MPSInductor] Implement `prod` reduction (#146396 ) Mostly reusing `sum` reduction logic Pull Request resolved: https://github.com/pytorch/pytorch/pull/146396 Approved by: https://github.com/dcci ghstack dependencies: #146369, #146370, #146380, #146389	2025-02-04 14:08:04 +00:00
Nikita Shulga	bbe95341d9	[MPSInductor] Implement `min` and `max` reductions (#146389 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146389 Approved by: https://github.com/jansel, https://github.com/dcci ghstack dependencies: #146369, #146370, #146380	2025-02-04 14:04:10 +00:00
PyTorch MergeBot	106acf0eec	Revert "[aoti] Assign proxy call args by name, and support default values. (#146263 )" This reverts commit 11f69808c64a65c68a4452250ba7719dcff27c78. Reverted https://github.com/pytorch/pytorch/pull/146263 on behalf of https://github.com/atalman due to multiple build failures, please see associated diff ([comment](https://github.com/pytorch/pytorch/pull/146263#issuecomment-2633828689))	2025-02-04 12:57:55 +00:00
Nichols A. Romero	e0f22e54e8	[ROCm][TunableOp] Support leading dimensions in TunableOp signature. (#146358 ) This is a feature enhancement that: - May improve performance by distinguishing GEMMs with different leading dimensions. - Fix correctness issues reported by users. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146358 Approved by: https://github.com/jeffdaily	2025-02-04 10:27:43 +00:00
cyy	3f63f2bced	Use std::string_view in tests (#146120 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146120 Approved by: https://github.com/albanD	2025-02-04 09:51:36 +00:00
Angela Yi	8444fe019a	[export] Fix requires_grad deserialization (#146351 ) Test Plan: CI Differential Revision: D69072095 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146351 Approved by: https://github.com/zhxchen17	2025-02-04 08:02:38 +00:00
Davide Italiano	bb4bd5f00b	[Metal][BE] Fix the arguments of `polygamma` (#146382 ) In the public API, order comes before input, while here they're reversed. Match for consistency (and make this less error prone). Pull Request resolved: https://github.com/pytorch/pytorch/pull/146382 Approved by: https://github.com/jansel, https://github.com/malfet	2025-02-04 06:40:34 +00:00
Nikita Shulga	54ceb7c565	[MPSInductor] Add support for `sum` reduction (#146380 ) - Add `threadgroup_sum` template to `c10/metal/reduction_utils.h` that so far uses barrier to compute the reductions TODOs: - Implement efficient reduction using cooperative functions such as `simd_shuffle_down` - Figure out how to merge several sum reduction together - Implement `reduction_store` that will only write results from the first thread Pull Request resolved: https://github.com/pytorch/pytorch/pull/146380 Approved by: https://github.com/jansel, https://github.com/dcci ghstack dependencies: #146369, #146370	2025-02-04 06:23:44 +00:00
cyy	1c16cf70c3	Apply ruff fixes to tests (#146140 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146140 Approved by: https://github.com/albanD	2025-02-04 05:41:01 +00:00
cyy	71e3575525	Remove unactivated test (#146233 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146233 Approved by: https://github.com/rec, https://github.com/albanD	2025-02-04 05:26:04 +00:00
Brian Hirsh	e68f5087d8	update _unsafe_set_version_counter to accept lists of tensors (#137921 ) See the comment [here](https://github.com/pytorch/pytorch/issues/132014#issuecomment-2379547400) (cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov @XilunWu @rec) - this PR updates `_unsafe_set_version_counter` to accept a list of tensors, for overhead-sensitive users (e.g. distributed) who need to hide VC bumps from autograd on a large list of tensors without wanting to suffer the overhead of going from python->C++ separately for every tensor in the list. I left the binding in pybind, and used a `std::vector`. if we really need to optimize overhead even further, we could write a manual cpython binding. I use this updated API in the next PR to fix FSDP2, so that it properly hides the VC of all `all_gather_buffer` tensors in its call to `split_with_sizes_copy.out(all_gather_buffers)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137921 Approved by: https://github.com/awgu, https://github.com/albanD	2025-02-04 04:51:11 +00:00
Sheng Fu	425aca40a4	Fix random crash in PyPer (#146327 ) Summary: PyPer saw random crashes when writing into ET file. This DIFF is to check if the output file is in condition before writing into it, and catch the exception if something bad happens, instead of crashing. Test Plan: buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA Differential Revision: D69065509 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146327 Approved by: https://github.com/sraikund16	2025-02-04 04:50:40 +00:00
angelayi	0c37c332da	[export] Additionally save pytree namedtuple field names (#145956 ) If a user passes in a namedtuple as an input, currently the input TreeSpec looks like: `TreeSpec(type=namedtuple, context=”class_fqn”, children_spec=[, ])` The user then saves the program containing this input TreeSpec. But what happens if they load it in a new environment where `class_fqn` now contains an additional field? This means that the exported program is now expected to take in another input. But since those fields were not used in the original program, users should be able just drop those additional fields and the program will run successfully. This is needed/used in APS where they use unflattener's adapter to adapt the inputs based on the previously saved treespecs. There are a couple of [solutions](https://docs.google.com/document/d/1V4ZSdy-8PUISWc8RqvGu3DU01BVegJhHHPWqa1Io7Eg/edit?tab=t.0) for how we can address this, but eventually we settled on saving a side table mapping namedtuple types to their list of field names, which can then be accessed by the adapter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145956 Approved by: https://github.com/zhxchen17	2025-02-04 04:42:30 +00:00
Animesh Jain	487400f47f	[dynamo] Support functools.partial variables through inspect.signature (#146339 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146339 Approved by: https://github.com/jansel ghstack dependencies: #146322, #146116	2025-02-04 04:39:39 +00:00
Justin Chu	9756c7d788	[benchmark] Remove ONNX (#146325 ) ONNX exporter experiments in benchmark is obsolete and unmaintained. This PR removes it to unblock https://github.com/pytorch/pytorch/pull/146003 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146325 Approved by: https://github.com/titaiwangms	2025-02-04 04:02:47 +00:00
Doru Bercea	a79d8f8ba4	[ROCm] Tune 3d tensor sums when not using fastest dimension (#146170 ) Tune 3d tensor sums when not using fastest dimension. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146170 Approved by: https://github.com/jeffdaily	2025-02-04 04:02:16 +00:00
David Berard	7997ecf809	[BE] reduce log spew from test_triton_kernels.py (#145895 ) One of the tests in this file was setting `self._logging.set_logs(output_code=True)` - which would cause logs to be printed for the rest of the tests in this file. This PR puts the log-setting in a context manager so that the old behavior is restored afterwards. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145895 Approved by: https://github.com/nmacchioni	2025-02-04 03:44:23 +00:00
Animesh Jain	5f53889850	[dynamo][builtin-skipfiles-cleanup] Remove inspect (#146116 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146116 Approved by: https://github.com/williamwen42, https://github.com/zou3519, https://github.com/jansel ghstack dependencies: #146322	2025-02-04 03:36:07 +00:00
Ke Wen	762a05b3b3	[DCP] Remove all-gather of state dict keys (#145998 ) The original `_all_gather_keys` call was for a safety check, but could be costly as things scale, and it blocks CPU. Instead, we make it clear in the documentation that the `state_dict` passed to the `load` API should have same set of keys, otherwise the API may hang. In addition, we move the check to a utility function: `utils.assert_same_keys`. User uncertain about state dict unity can optionally call this API to check. Resolves #145965 (as a workaround). Pull Request resolved: https://github.com/pytorch/pytorch/pull/145998 Approved by: https://github.com/mhorowitz, https://github.com/fegin	2025-02-04 03:16:13 +00:00
PyTorch MergeBot	7f796eb8b7	Revert "[inductor] Add typing to common.KernelArgs (#145916 )" This reverts commit 68cf36d5ab6165372160f65eb84e13d0f8dbc5dc. Reverted https://github.com/pytorch/pytorch/pull/145916 on behalf of https://github.com/atalman due to Failing internally, please see associated diff ([comment](https://github.com/pytorch/pytorch/pull/145916#issuecomment-2632715678))	2025-02-04 03:07:12 +00:00
PyTorch MergeBot	d3c7e4bb9c	Revert "[inductor] Add typing to common.CSE (#145993 )" This reverts commit 8c657ae4be55c6133307ad278c1740af5db133a7. Reverted https://github.com/pytorch/pytorch/pull/145993 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/145916 ([comment](https://github.com/pytorch/pytorch/pull/145993#issuecomment-2632712384))	2025-02-04 03:04:01 +00:00
PyTorch MergeBot	ecbc725fad	Revert "[inductor] Finish typing common.py (#146225 )" This reverts commit 3a67c0e48d29578aeeaa872275e730020bb5cbc2. Reverted https://github.com/pytorch/pytorch/pull/146225 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/145916 ([comment](https://github.com/pytorch/pytorch/pull/146225#issuecomment-2632709707))	2025-02-04 03:01:36 +00:00
PyTorch MergeBot	0061eb5b70	Revert "[inductor] Refactor CSEProxy into global scope (#146226 )" This reverts commit 18380ab877711f2e651c69c78675f0d0b31d2ceb. Reverted https://github.com/pytorch/pytorch/pull/146226 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/145916 ([comment](https://github.com/pytorch/pytorch/pull/146226#issuecomment-2632707618))	2025-02-04 02:58:50 +00:00
cyy	f397c72697	Remove NOLINTNEXTLINE (#146238 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146238 Approved by: https://github.com/albanD	2025-02-04 02:45:32 +00:00
Nikita Shulga	5451c9b7c9	[MPSInductor] Add support for any reduction (#146370 ) - Add `_new_accvar` function that creates a threadgroup variable - As threadgroup variables can not be initialized in place, add explicit initialization for reduction var Pull Request resolved: https://github.com/pytorch/pytorch/pull/146370 Approved by: https://github.com/dcci, https://github.com/jansel ghstack dependencies: #146369	2025-02-04 02:45:03 +00:00
Nikita Shulga	71179772cd	[MPSInductor] Prep change for reduction support (#146369 ) Add `group_pos` parameter as well as set `group_size` when invoking reduction kernels Separates loads and stores and insert threadgroup barrier if reduction is in place Should be a no-op right now Pull Request resolved: https://github.com/pytorch/pytorch/pull/146369 Approved by: https://github.com/dcci, https://github.com/jansel	2025-02-04 02:38:07 +00:00
Henry Tsang	3dcbd04d1d	[cutlass backend] Add instantiation level for generating configs (#146230 ) Passing through instantiation level to generate more configs. I do see some C++ compilation error. But running is fine. Using 2222 generates 1k+ configs. Differential Revision: D68989194 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146230 Approved by: https://github.com/Chillee, https://github.com/mlazos	2025-02-04 02:36:04 +00:00
bobrenjc93	0e49f35e3d	Integrate sympy expression provenance logging with structured logs (#145848 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145848 Approved by: https://github.com/angelayi	2025-02-04 01:21:37 +00:00
Aaron Orenstein	4168982dad	PEP585: .github release triggers (#145708 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145708 Approved by: https://github.com/malfet	2025-02-04 01:02:46 +00:00
Davide Italiano	cf6c5b8fa8	[mps/inductor] Adjust more tests that expect float64 as input. (#146366 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146366 Approved by: https://github.com/malfet	2025-02-04 00:48:02 +00:00
PyTorch MergeBot	2f40f789da	Revert "[inductor] Refactor op handlers part 1 (#146235 )" This reverts commit 204be4e0a2e4509bd2457bfb295c429dd92c241f. Reverted https://github.com/pytorch/pytorch/pull/146235 on behalf of https://github.com/atalman due to Breaks lint, sorry: Definition of polygamma in base class MetalOverrides is incompatible with definition in base class OpsHandler. Please rebase fix lint and reland ([comment](https://github.com/pytorch/pytorch/pull/146235#issuecomment-2632444514))	2025-02-04 00:00:08 +00:00
Stas Bekman	3aeccf2a28	DeepSpeed github repo move sync (#146320 ) DeepSpeed has moved to a new repo on github https://github.com/deepspeedai/DeepSpeed This PR updates this repo to use the new URL. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146320 Approved by: https://github.com/awgu	2025-02-03 23:20:49 +00:00
Jason Ansel	204be4e0a2	[inductor] Refactor op handlers part 1 (#146235 ) This enforces the invariant that every backend implements the same set of ops and removes a layer of indirection for BasicMathOps. Interestingly this is a small compile time win: ``` ... WIN: benchmark ('add_loop_inductor', 'compile_time_instruction_count') failed, actual result 30151159301 is -6.13% lower than expected 32120000000 ±1.50% please update the expected results. please update all results that changed significantly, and not only the failed ones PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 44447549162 -1.69% is within expected 45210000000 ±2.50% WIN: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') failed, actual result 26743557195 is -2.25% lower than expected 27360000000 ±1.50% please update the expected results. please update all results that changed significantly, and not only the failed ones PASS: benchmark ('basic_modules_ListOfLinears_eager', 'compile_time_instruction_count') pass, actual result 945129734 +0.93% is within expected 936400000 ±1.50% WIN: benchmark ('basic_modules_ListOfLinears_inductor', 'compile_time_instruction_count') failed, actual result 18984384503 is -3.19% lower than expected 19610000000 ±1.50% please update the expected results. please update all results that changed significantly, and not only the failed ones WIN: benchmark ('basic_modules_ListOfLinears_inductor_gpu_force_shape_pad', 'compile_time_instruction_count') failed, actual result 17258025389 is -1.94% lower than expected 17600000000 ±1.50% please update the expected results. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146235 Approved by: https://github.com/shunting314 ghstack dependencies: #146225, #146226	2025-02-03 23:15:13 +00:00
Jason Ansel	18380ab877	[inductor] Refactor CSEProxy into global scope (#146226 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146226 Approved by: https://github.com/shunting314 ghstack dependencies: #146225	2025-02-03 23:15:13 +00:00
Natalia Gimelshein	0bc036a9e9	use copy2d in h2d/d2h copy when possible (#146256 ) A rewrite of #138964 In addition to rewriting the conditions for using copy2d, this PR fixes a few other problems with #138964: 1) gpu-gpu copies when peer access is disabled shouldn't rely on copy2d 2) copy2d should record even for the host pinned memory, like the regular copy does 3) copy2d shouldn't pretend that it's synchronizing (for the purposes of cuda sanitizer tracer) when it's non-blocking In this PR copy2d behaves in exactly the same way as copy does wrt to those additional syncs, except it calls a different underlying cuda call. Tests for multiple cases going through copy2d and avoiding copy2d pattern due to unsatisfied conditions are added. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146256 Approved by: https://github.com/eqy, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-02-03 23:07:54 +00:00
Henry Tsang	35af193408	[easy] Add type annotation for autotune_num_choices_displayed (#146323 ) Test Plan: ci Differential Revision: D69064447 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146323 Approved by: https://github.com/ColinPeppler	2025-02-03 23:04:21 +00:00
Davide Italiano	0463cb6ca5	[mps/inductor] Add support for digamma(). (#146292 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146292 Approved by: https://github.com/malfet, https://github.com/jansel	2025-02-03 22:48:13 +00:00
titaiwangms	178531c95e	[ONNX] torch.onnx.export(dynamo=True) changes optimization to default (#146187 ) Fixes #145897 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146187 Approved by: https://github.com/justinchuby	2025-02-03 22:44:54 +00:00
bobrenjc93	d69c181d77	log out partial fx graph when guard on data dependent during non stirct tracing (#146298 ) As discussed with @avikchaudhuri and @bdhirsh last week, this can be quite useful when debugging. The following code produces a data dependent error ``` import torch from torch import nn # UserError: Could not guard on data-dependent expression Eq(507 - u0, 0) (unhinted: Eq(507 - u0, 0)). (Size-like symbols: u0) class Repro(nn.Module): def __init__(self): super().__init__() def forward(self, cache, update, pos): _, _, max_seq_len, _ = cache.shape _, _, seqlen, _ = update.shape pos_item = pos[0].item() # u0 torch._check(pos_item + seqlen <= max_seq_len) # u0 + 502 <= 507 torch._check(pos_item >= 0) before = cache.narrow(2, 0, pos_item) # FAIL # Laith: why can't we make unbacked expressions size-like? after = cache.narrow(2, (pos_item + seqlen), (max_seq_len - pos_item - seqlen)) # PASS end = torch.tensor(max_seq_len - pos_item - seqlen).item() after = cache.narrow(2, (pos_item + seqlen), end) return torch.cat([before, update, after], dim=2) repro = Repro() bsz = 1 n_heads = 4 max_seq_len = 512 head_dim = 64 seqlen = 5 pos_item = 1 cache = torch.zeros(bsz, n_heads, max_seq_len, head_dim) update = torch.ones(bsz, n_heads, seqlen, head_dim) pos = torch.tensor([pos_item]) example_inputs = (cache, update, pos) torch.export.export(repro, example_inputs, strict=False) ``` This is what it now prints out ``` class GraphModule(torch.nn.Module): def forward(self, arg0_1: "f32[1, 4, 512, 64][131072, 32768, 64, 1]cpu", arg1_1: "f32[1, 4, 5, 64][1280, 320, 64, 1]cpu", arg2_1: "i64[1][1]cpu"): # File: /data/users/bobren/a/pytorch/r1.py:14 in forward, code: pos_item = pos[0].item() # u0 select: "i64[][]cpu" = torch.ops.aten.select.int(arg2_1, 0, 0); arg2_1 = None item: "Sym(u0)" = torch.ops.aten.item.default(select); select = None # File: /data/users/bobren/a/pytorch/r1.py:15 in forward, code: torch._check(pos_item + seqlen <= max_seq_len) # u0 + 502 <= 507 add: "Sym(u0 + 5)" = item + 5 le: "Sym(u0 + 5 <= 512)" = add <= 512; add = le = None # File: /data/users/bobren/a/pytorch/r1.py:16 in forward, code: torch._check(pos_item >= 0) ge: "Sym(u0 >= 0)" = item >= 0; ge = None # File: /data/users/bobren/a/pytorch/r1.py:17 in forward, code: before = cache.narrow(2, 0, pos_item) narrow: "f32[1, 4, u0, 64][131072, 32768, 64, 1]cpu" = torch.ops.aten.narrow.default(arg0_1, 2, 0, item); narrow = None # File: /data/users/bobren/a/pytorch/r1.py:21 in forward, code: after = cache.narrow(2, (pos_item + seqlen), (max_seq_len - pos_item - seqlen)) add_1: "Sym(u0 + 5)" = item + 5 sub: "Sym(512 - u0)" = 512 - item; item = None sub_1: "Sym(507 - u0)" = sub - 5; sub = None narrow_1 = torch.ops.aten.narrow.default(arg0_1, 2, add_1, sub_1); arg0_1 = add_1 = sub_1 = narrow_1 = None Traceback (most recent call last): File "/data/users/bobren/a/pytorch/r1.py", line 45, in <module> torch.export.export(repro, example_inputs, strict=False) File "/data/users/bobren/a/pytorch/torch/export/__init__.py", line 368, in export return _export( File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1044, in wrapper raise e File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1017, in wrapper ep = fn(args, kwargs) File "/data/users/bobren/a/pytorch/torch/export/exported_program.py", line 117, in wrapper return fn(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 2079, in _export return _export_for_training( File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1044, in wrapper raise e File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1017, in wrapper ep = fn(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/export/exported_program.py", line 117, in wrapper return fn(args, kwargs) File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1944, in _export_for_training export_artifact = export_func( # type: ignore[operator] File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1879, in _non_strict_export aten_export_artifact = _to_aten_func( # type: ignore[operator] File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1665, in _export_to_aten_ir_make_fx gm, graph_signature = transform(_make_fx_helper)( File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1809, in _aot_export_non_strict gm, sig = aot_export(wrapped_mod, args, kwargs=kwargs, flags) File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1585, in _make_fx_helper gm = make_fx( File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 2194, in wrapped return make_fx_tracer.trace(f, args) File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 2132, in trace return self._trace_inner(f, args) File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 2103, in _trace_inner t = dispatch_trace( File "/data/users/bobren/a/pytorch/torch/_compile.py", line 51, in inner return disable_fn(args, kwargs) File "/data/users/bobren/a/pytorch/torch/_dynamo/eval_frame.py", line 749, in _fn return fn(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1136, in dispatch_trace graph = tracer.trace(root, concrete_args) # type: ignore[arg-type] File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1692, in trace res = super().trace(root, concrete_args) File "/data/users/bobren/a/pytorch/torch/fx/_symbolic_trace.py", line 834, in trace (self.create_arg(fn(args)),), File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1191, in wrapped out = f(tensors) # type:ignore[call-arg] File "<string>", line 1, in <lambda> File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1488, in wrapped_fn return tuple(flat_fn(args)) File "/data/users/bobren/a/pytorch/torch/_functorch/_aot_autograd/utils.py", line 184, in flat_fn tree_out = fn(args, kwargs) File "/data/users/bobren/a/pytorch/torch/_functorch/_aot_autograd/traced_function_transforms.py", line 879, in functional_call out = mod(args[params_len:], *kwargs) File "/data/users/bobren/a/pytorch/torch/fx/_symbolic_trace.py", line 811, in module_call_wrapper return self.call_module(mod, forward, args, kwargs) File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1762, in call_module return Tracer.call_module(self, m, forward, args, kwargs) File "/data/users/bobren/a/pytorch/torch/fx/_symbolic_trace.py", line 529, in call_module ret_val = forward(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/fx/_symbolic_trace.py", line 804, in forward return _orig_module_call(mod, args, *kwargs) File "/data/users/bobren/a/pytorch/torch/nn/modules/module.py", line 1749, in _wrapped_call_impl return self._call_impl(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/nn/modules/module.py", line 1760, in _call_impl return forward_call(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1793, in forward tree_out = mod(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/fx/_symbolic_trace.py", line 811, in module_call_wrapper return self.call_module(mod, forward, args, kwargs) File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1762, in call_module return Tracer.call_module(self, m, forward, args, kwargs) File "/data/users/bobren/a/pytorch/torch/fx/_symbolic_trace.py", line 529, in call_module ret_val = forward(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/fx/_symbolic_trace.py", line 804, in forward return _orig_module_call(mod, args, *kwargs) File "/data/users/bobren/a/pytorch/torch/nn/modules/module.py", line 1749, in _wrapped_call_impl return self._call_impl(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/nn/modules/module.py", line 1760, in _call_impl return forward_call(args, *kwargs) File "/data/users/bobren/a/pytorch/r1.py", line 21, in forward after = cache.narrow(2, (pos_item + seqlen), (max_seq_len - pos_item - seqlen)) File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1239, in __torch_function__ return func(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1286, in __torch_function__ return func(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/_export/non_strict_utils.py", line 654, in __torch_function__ return func(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/_ops.py", line 866, in handler return torch._library.utils.handle_dispatch_mode( File "/data/users/bobren/a/pytorch/torch/_library/utils.py", line 296, in handle_dispatch_mode return curr_mode.__torch_dispatch__(op_overload, overload_types, args, kwargs) File "/data/users/bobren/a/pytorch/torch/utils/_stats.py", line 27, in wrapper return fn(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1341, in __torch_dispatch__ return proxy_call(self, func, self.pre_dispatch, args, kwargs) File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 910, in proxy_call out = func(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/_ops.py", line 749, in __call__ return self._op(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/utils/_stats.py", line 27, in wrapper return fn(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1267, in __torch_dispatch__ return self.dispatch(func, types, args, kwargs) File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1808, in dispatch return self._cached_dispatch_impl(func, types, args, kwargs) File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1369, in _cached_dispatch_impl output = self._dispatch_impl(func, types, args, kwargs) File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 2282, in _dispatch_impl decomposition_table[func](args, *kwargs) File "/data/users/bobren/a/pytorch/torch/_decomp/decompositions.py", line 759, in slice_forward return self.as_strided(sizes, strides, storage_offset) File "/data/users/bobren/a/pytorch/torch/utils/_stats.py", line 27, in wrapper return fn(args, *kwargs) File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1267, in __torch_dispatch__ return self.dispatch(func, types, args, kwargs) File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1808, in dispatch return self._cached_dispatch_impl(func, types, args, kwargs) File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1370, in _cached_dispatch_impl entry = self._make_cache_entry(state, key, func, args, kwargs, output) File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1640, in _make_cache_entry output_info = self._get_output_info_for_cache_entry( File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1583, in _get_output_info_for_cache_entry synth_output = self._output_from_cache_entry( File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1738, in _output_from_cache_entry return self._get_output_tensor_from_cache_entry( File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1709, in _get_output_tensor_from_cache_entry empty.set_(storage, storage_offset, shape, stride) File "/data/users/bobren/a/pytorch/torch/fx/experimental/sym_node.py", line 564, in guard_size_oblivious r = self.shape_env.evaluate_expr( File "/data/users/bobren/a/pytorch/torch/fx/experimental/recording.py", line 263, in wrapper return retlog(fn(args, **kwargs)) File "/data/users/bobren/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6468, in evaluate_expr return self._evaluate_expr( File "/data/users/bobren/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6658, in _evaluate_expr raise self._make_data_dependent_error( torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Ne(507 - u0, 1) (unhinted: Ne(507 - u0, 1)). (Size-like symbols: u0) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146298 Approved by: https://github.com/bdhirsh	2025-02-03 22:16:03 +00:00
Animesh Jain	0da07a6d1d	[dynamo][skip-function] Add missing unimplemented line (#146322 ) This is a missing line from the merged PR in the stack below. Lets try to get this in quickly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146322 Approved by: https://github.com/StrongerXi, https://github.com/jansel, https://github.com/mlazos	2025-02-03 22:11:55 +00:00
PyTorch MergeBot	00dc5b10f6	Revert "[Environment Variable][7/N] Use thread-safe getenv functions (#140211 )" This reverts commit 2fd1b6b3610eb84cd615360a8fd23756a7f2e743. Reverted https://github.com/pytorch/pytorch/pull/140211 on behalf of https://github.com/atalman due to Breaks executorch tests ([comment](https://github.com/pytorch/pytorch/pull/140211#issuecomment-2632202864))	2025-02-03 22:04:28 +00:00
Yanbo Liang	15e12d5ec3	[Trace PyDispatcher] Support temporarily_pop_interpreter_stack ctx manager (#146271 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146271 Approved by: https://github.com/zou3519 ghstack dependencies: #146270	2025-02-03 21:47:54 +00:00
Yanbo Liang	bd8d7b1b74	[Dynamo][Trace PyDispatcher] Remove disable from HigherOrderOperator.__call__ (#146270 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146270 Approved by: https://github.com/zou3519	2025-02-03 21:47:54 +00:00
Yang Wang	fd73ae2068	[Utilization] Convert timestamp to str for datetime64 (#145985 ) Convert all timestamp(float) to int timestamp during data pipeline for db type datetime64. float does not work when try to insert into clickhouse using jsonExtract. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145985 Approved by: https://github.com/huydhn	2025-02-03 21:05:18 +00:00
Simon Fan	1d4adf4e1f	[dynamo] log recompile reason to dynamo_compile (#146117 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146117 Approved by: https://github.com/bobrenjc93	2025-02-03 21:04:04 +00:00
Zhengxu Chen	11f69808c6	[aoti] Assign proxy call args by name, and support default values. (#146263 ) Fixing the following issue when compiling the following program: ``` window = torch.hann_window(N_FFT).to(x.device) stft = torch.stft( x, N_FFT, HOP_LENGTH, window=window, return_complex=True ) magnitudes = stft[..., :-1].abs() ** 2 return magnitudes ``` ``` Traceback (most recent call last): File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 57, in testPartExecutor yield File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 623, in run self._callTestMethod(testMethod) File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 579, in _callTestMethod if method() is not None: ^^^^^^^^ File "/home/zhxchen17/pytorch/torch/testing/_internal/common_utils.py", line 3120, in wrapper method(args, *kwargs) File "/home/zhxchen17/pytorch/test/inductor/test_torchinductor.py", line 12356, in new_test return value(self) ^^^^^^^^^^^ File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor.py", line 4334, in test_stft self.check_model(model, example_inputs) File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 185, in check_model actual = AOTIRunnerUtil.run( ^^^^^^^^^^^^^^^^^^^ File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 137, in run optimized = AOTIRunnerUtil.load(device, so_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 119, in load return torch._export.aot_load(so_path, device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zhxchen17/pytorch/torch/_export/__init__.py", line 165, in aot_load runner = torch._C._aoti.AOTIModelContainerRunnerCuda(so_path, 1, device) # type: ignore[assignment, call-arg] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Expected extern kernel aten::hann_window to have serialized argument type as_scalar_type for argument 1 but got as_device ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146263 Approved by: https://github.com/angelayi	2025-02-03 20:15:59 +00:00
Henry Tsang	e67ce67498	[cutlass backend] update try_import_cutlass to accomodate for pip install (#145891 ) The goal of this PR is to provide 3 ways for people to try out CUTLASS backend: 1. fbcode / internal 2. pip install torch (nightly) and pip install nvidia-cutlass 3. build from source I will go into more detailed combos between building from source and downloading via pip for torch and cutlass. repro: ``` import torch import torch.nn as nn import torch._inductor.config as config config.force_disable_caches = True config.max_autotune = True config.max_autotune_gemm_backends = "CUTLASS" # the following is only needed if you use a custom cutlass library # config.cuda.cutlass_dir = "/data/users/henrylhtsang/cutlass" class TestModule(nn.Module): def forward(self, A, B): return A @ B model = TestModule().cuda() M, K, N = 2048, 2048, 2048 A = torch.randn(M, K).cuda().half() B = torch.randn(K, N).cuda().half() C = torch.compile(model, fullgraph=True)(A, B) ``` ## pre-requisite Assuming you have the right cuda toolkit. Recommend 12.4. Make sure PATH, LD_LIBRARY_PATH and CUDA_NVCC_EXECUTABLE are good. ## combo 1: pip install torch + pip install nvidia-cutlass Check https://pytorch.org/get-started/locally/ for nightly install command. ``` pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu124 pip install nvidia-cutlass ``` Then try running the script above. It should work. ## combo 2: build torch from source + pip install nvidia-cutlass This is going to be be pretty straightforward. Just keep in mind that even though pytorch/third_party/cutlass exists, the one that will be used is the pip package, so mindful of version differences. ## combo 3: build torch from source + use pytorch/third_party/cutlass This is how most pytorch devs would do it. Just make sure you don't have a cutlass pip package installed, i.e., make sure `import cutlass_library` would fail on its own. ## combo 4: any torch version + cutlass library from somewhere else This is probably the only case you need to pass in cutlass_dir. Just set cutlass_dir to the cutlass repo library. The expectations is that cutlass_dir is the directory that contains include, tool, and python/cutlass_library. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145891 Approved by: https://github.com/Chillee, https://github.com/ColinPeppler	2025-02-03 20:05:41 +00:00
Isalia20	f237172768	Fix not inlining functions used in metal files (#146316 ) Fixes issue when building PyTorch with Xcode installed after https://github.com/pytorch/pytorch/pull/146231 ``` FAILED: caffe2/aten/src/ATen/kernels_basic.metallib /Users/Irakli_Salia/Desktop/pytorch/build/caffe2/aten/src/ATen/kernels_basic.metallib cd /Users/Irakli_Salia/Desktop/pytorch/build/caffe2/aten/src/ATen && xcrun metallib -o kernels_basic.metallib BinaryKernel_30.air Bucketization_30.air CrossKernel_30.air FusedOptimizerOps_30.air Gamma_30.air HistogramKernel_30.air Im2Col_30.air Indexing_30.air LinearAlgebra_30.air Quantized_30.air RMSNorm_30.air RenormKernel_30.air Repeat_30.air SpecialOps_30.air TriangularOps_30.air UnaryKernel_30.air UnfoldBackward_30.air UpSample_30.air LLVM ERROR: multiple symbols ('_ZN3c105metal4zetaEff')! [3835/5420] Building CXX object c10/test/CMakeFiles/c10_small_vector_test.dir/util/small_vector_test.cpp.o ninja: build stopped: subcommand failed. ``` AI to @malfet: Add linter that ensures that `c10/metal/` headers do not have any functions there, only templates Pull Request resolved: https://github.com/pytorch/pytorch/pull/146316 Approved by: https://github.com/malfet, https://github.com/atalman	2025-02-03 19:33:52 +00:00
Yidi Wu	674e0b668a	Add non-strict export while_loop test back (#146195 ) This is fixed by https://github.com/pytorch/pytorch/pull/145762 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146195 Approved by: https://github.com/zou3519 ghstack dependencies: #146194	2025-02-03 19:28:22 +00:00
Yidi Wu	1138d0c4f6	[hop] enable while_loop return torch.ones with unbacked symbol expression. (#146194 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146194 Approved by: https://github.com/zou3519	2025-02-03 19:28:22 +00:00
Animesh Jain	57b1fc35f6	[dynamo] Disable compiling on elementwise_type_promotion_wrapper (#146219 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146219 Approved by: https://github.com/zou3519 ghstack dependencies: #146075, #146283	2025-02-03 18:02:48 +00:00
PyTorch MergeBot	64fc9ff09c	Revert "[ONNX] Create deprecation warning on dynamo_export (#146003 )" This reverts commit e6c39d37e90242692cf25ea849abd47d11932cd7. Reverted https://github.com/pytorch/pytorch/pull/146003 on behalf of https://github.com/atalman due to Broke internally ([comment](https://github.com/pytorch/pytorch/pull/146003#issuecomment-2631599314))	2025-02-03 17:17:14 +00:00
Tugsbayasgalan Manlaibaatar	041e08f9dc	Add buffers to parameterizaiton rule (#145991 ) Differential Revision: [D68959513](https://our.internmc.facebook.com/intern/diff/D68959513) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145991 Approved by: https://github.com/bdhirsh	2025-02-03 16:49:03 +00:00
PyTorch MergeBot	c0979d72b5	Revert "[hop][inductor] track the dependency on unbacked symbols correctly with constant_args for hops (#143456 )" This reverts commit 68a363548409a3ff17965770304ee5e12fe718d9. Reverted https://github.com/pytorch/pytorch/pull/143456 on behalf of https://github.com/atalman due to New tests are failing internally ([comment](https://github.com/pytorch/pytorch/pull/143456#issuecomment-2631475900))	2025-02-03 16:25:58 +00:00
Harmen Stoppels	01554c7b5a	fix incorrect literal strings / accidental tuples (#146037 ) * `expr,` is short for `(expr,)` * literal strings over multiple lines need to escape the newline `\` or use `(...)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146037 Approved by: https://github.com/Skylion007	2025-02-03 15:08:11 +00:00
PyTorch UpdateBot	550441a87b	Update slow tests (#146301 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146301 Approved by: https://github.com/pytorchbot	2025-02-03 11:37:16 +00:00
Isuru Fernando	08b14936ae	Disable has_relational_guards check for dict_tag optimization for now (#146232 ) has_relational_guards evaluates to true almost always, and leads to a slowdown in guards runtime Pull Request resolved: https://github.com/pytorch/pytorch/pull/146232 Approved by: https://github.com/anijain2305	2025-02-03 07:56:06 +00:00
Isalia20	e3643e1e0e	[MPS] Add linalg det and fix lu factor for non contiguous tensors (#146279 ) Requested in #77764 This PR adds support for linalg.det on MPS and fixes lu factor for non contiguous tensors, current implementation crashed on any kind of non-contiguous tensor with an error: ``` -[AGXG13XFamilyCommandBuffer blitCommandEncoderCommon:]:833: failed assertion `A command encoder is already encoding to this command buffer' zsh: abort python det.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146279 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-02-03 06:06:43 +00:00
Zhengxu Chen	1580f47bf4	[export][ez] Fix generated header file. (#146208 ) Summary: as title. Test Plan: CI Differential Revision: D68978788 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146208 Approved by: https://github.com/yiming0416	2025-02-03 06:01:05 +00:00
cyy	7b512095ef	Enable some tests on MacOS (#146268 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146268 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-02-03 05:04:24 +00:00
Animesh Jain	fa48757180	[dynamo] misc fixes for inspect (#146283 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146283 Approved by: https://github.com/jansel ghstack dependencies: #146075	2025-02-03 04:26:10 +00:00
cyy	6ac8bc0cd2	Remove unused import in tests (#146266 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146266 Approved by: https://github.com/Skylion007	2025-02-03 03:40:18 +00:00
Davide Italiano	d80eef7c6d	[inductor] Guard a member variable with a define. (#146278 ) It's unused otherwise, and when running MPS tests, I get a bunch of warnings of this kind: /Users/davidino/pytorch/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model_container.h:412:10: warning: private field 'blob_size_' is not used [-Wunused-private-field] 412 \| size_t blob_size_; \| ^ 1 warning generated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146278 Approved by: https://github.com/Skylion007, https://github.com/jansel	2025-02-03 02:20:08 +00:00
Animesh Jain	c0ec2e0a0d	[dynamo][functions] Improve getattr on functions (#146075 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146075 Approved by: https://github.com/jansel	2025-02-03 02:01:57 +00:00
Davide Italiano	d28fe3ed47	[metal] Move digamma to special_math.h (#146284 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146284 Approved by: https://github.com/jansel	2025-02-03 01:29:14 +00:00
Davide Italiano	1f21f699ba	[metal] Refactor digamma in preparation for moving it. (#146281 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146281 Approved by: https://github.com/jansel	2025-02-02 23:54:45 +00:00
Yanbo Liang	511d0dd558	[Dynamo][Trace PyDispatcher] Support calling id function over class (#146269 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146269 Approved by: https://github.com/anijain2305	2025-02-02 22:29:30 +00:00
Lancelot Normand	02fd4868d6	Fix unreachable code (#146262 ) Fixes #146261 Removed unreachable code Pull Request resolved: https://github.com/pytorch/pytorch/pull/146262 Approved by: https://github.com/Skylion007	2025-02-02 21:35:26 +00:00
Isalia20	5d55a6585d	[MPS] lu factor ex implementation (#144651 ) Implements `torch.linalg.lu_factor_ex` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144651 Approved by: https://github.com/malfet	2025-02-02 15:09:49 +00:00
Avik Chaudhuri	0144613e6f	move and fix logic to update unbacked bindings (#146115 ) Summary: Previously we were touching up unbacked bindings between Dynamo and AOTAutograd in strict export, but the logic had a bug: if an unbacked symint gets substituted by a backed symint, we would put the backed symint in the unbacked bindings (the check `is_symbol` was not enough here). This PR fixes this logic, and moreover, moves it into the serializer instead, because we don't need this adjustment outside serde. Test Plan: added test Differential Revision: D68880766 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146115 Approved by: https://github.com/pianpwk	2025-02-02 10:43:55 +00:00
PyTorch UpdateBot	a44a8a7d3a	[audio hash update] update the pinned audio hash (#145988 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145988 Approved by: https://github.com/pytorchbot	2025-02-02 04:19:29 +00:00
cyy	8543d8395b	[2/N] Enable ruff F841 on distributed tests (#146132 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146132 Approved by: https://github.com/Skylion007, https://github.com/rec	2025-02-02 03:44:48 +00:00
Animesh Jain	cef856faa9	[dynamo][enum] Trace through enum.py for enum construction (#146070 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146070 Approved by: https://github.com/jansel ghstack dependencies: #146062, #146198, #146258, #146214	2025-02-02 03:12:36 +00:00
Animesh Jain	31fb691782	[dynamo] Graph break on tensor.retain_grad (#146214 ) Fixes https://github.com/pytorch/pytorch/issues/146212 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146214 Approved by: https://github.com/jansel ghstack dependencies: #146062, #146198, #146258	2025-02-02 03:12:36 +00:00
Animesh Jain	529eb8d558	[dynamo] Add return to python_type (#146258 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146258 Approved by: https://github.com/jansel ghstack dependencies: #146062, #146198	2025-02-02 03:12:36 +00:00
Davide Italiano	7854299b27	[mps/inductor] Implement support for polygamma(). (#146259 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146259 Approved by: https://github.com/jansel	2025-02-02 01:54:23 +00:00
Burak Turk	d89c7ea401	add WaitCounter type interface and get rid of type errors (#146175 ) Summary: as titled Differential Revision: D68960123 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146175 Approved by: https://github.com/andriigrynenko, https://github.com/Skylion007	2025-02-01 23:24:52 +00:00
Jason Ansel	3a67c0e48d	[inductor] Finish typing common.py (#146225 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146225 Approved by: https://github.com/Skylion007	2025-02-01 22:53:35 +00:00
Davide Italiano	dca5cc0255	[mps] Move polygamma to special_math.h. (#146253 ) In preparation to implement it in inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146253 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-02-01 21:45:23 +00:00
Aaron Gokaslan	07dbd539b4	[BE][Ez]: Make c10/special arrays constexpr (#146246 ) No reason to have array creation overhead for these constexpr arrays. This is better because it guarantees the array is not duplicated across templates or translation units unless necessary and allows the compiler to do static compile time bounds checking (even in loop based accesses) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146246 Approved by: https://github.com/dcci, https://github.com/malfet	2025-02-01 21:03:18 +00:00
Davide Italiano	d4ad7b91ad	[mps] Move zeta() to special_math.h. (#146231 ) In preparation for implementing digamma/polygamma Pull Request resolved: https://github.com/pytorch/pytorch/pull/146231 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-02-01 19:22:59 +00:00
Sahdev Zala	f97307f463	[Docs] Add clarification for target types in CrossEntropyLoss doc (#145444 ) CrossEntropyLoss function requires that target for class indices are provided as a long and class probabilities are provided as a float datatype. The CrossEntropyLoss function distinguish the two scenarios (indices and probabilities) by comparing the shapes. When input and target shapes are the same it’s a case for probabilities otherwise it will be used as a class index as already covered in the doc. The related code is here, https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/LossNLL.cpp#L624 I think the current documentation is great but seems like it can confuse users about types as reported in the issues so this PR adds a bit more clarification. Fixes #137188 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145444 Approved by: https://github.com/mikaylagawarecki	2025-02-01 18:55:58 +00:00
Nikita Shulga	5ed5793016	Temp disable MKL in DistributionKernels.cpp (#146174 ) Until https://github.com/pytorch/pytorch/issues/132395 is addressed Test plan: Add test based on the script below (taken from https://discuss.pytorch.org/t/bug-in-torch-multinomial-generated-distribution-is-modestly-incorrect-edit-this-is-a-regression-and-appears-to-be-due-to-an-analogous-bug-in-tensor-exponential ) ```python import torch high_bits_for_seed = 16000000000000000000 # to use "good quality" seed _ = torch.manual_seed (high_bits_for_seed + 2024) prob = torch.ones (26) dups_mult = 0 perm_counts_mult = {} for _ in range (1_000_000): p = tuple (torch.multinomial (prob, prob.numel(), replacement=False).tolist()) if p in perm_counts_mult: dups_mult += 1 perm_counts_mult[p] += 1 else: perm_counts_mult[p] = 1 print ('duplicate multinomial perms: ', dups_mult) print ('multiple multinomial perms: ', (torch.tensor (list (perm_counts_mult.values())) > 1).sum().item()) print ('max of perm_counts_mult: ', torch.tensor (list (perm_counts_mult.values())).max().item()) print ('len (perm_counts_mult): ', len (perm_counts_mult)) ``` This is a reland of https://github.com/pytorch/pytorch/pull/132532 but excluding internal builds that already has some hardcoded values Pull Request resolved: https://github.com/pytorch/pytorch/pull/146174 Approved by: https://github.com/ngimel	2025-02-01 18:53:11 +00:00
Nikita Shulga	e56dcf2772	[CPUInductor] Fix SVE256 detection (#146207 ) This PR removes `torch.cpu._is_arm_sve_supported()` and replaces is with stable `torch.backends.cpu.get_cpu_capability()` I should have reviewed https://github.com/pytorch/pytorch/pull/134672 more thoroughly, because it introduced duplicate, but slightly different API for detecting CPU architectures, which resulted in runtime crashes on system that do support SVE128, rather than SVE256 Fixes https://github.com/pytorch/pytorch/issues/145441 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146207 Approved by: https://github.com/angelayi	2025-02-01 18:51:34 +00:00
Jason Ansel	8c657ae4be	[inductor] Add typing to common.CSE (#145993 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145993 Approved by: https://github.com/yanboliang ghstack dependencies: #145913, #145914, #145915, #145916	2025-02-01 16:34:18 +00:00
Jason Ansel	68cf36d5ab	[inductor] Add typing to common.KernelArgs (#145916 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145916 Approved by: https://github.com/yanboliang ghstack dependencies: #145913, #145914, #145915	2025-02-01 16:34:18 +00:00
Jason Ansel	8e56d713c9	[inductor] Add typing to common.OpDecompositions (#145915 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145915 Approved by: https://github.com/yanboliang ghstack dependencies: #145913, #145914	2025-02-01 16:34:11 +00:00
Jason Ansel	79f9f62e3a	[inductor] Combine regexp checks in OpOverrides.paren (#145914 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145914 Approved by: https://github.com/Skylion007 ghstack dependencies: #145913	2025-02-01 16:34:03 +00:00
Jason Ansel	4c004caa76	[inductor] Add types to DeviceOpOverrides (#145913 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145913 Approved by: https://github.com/Skylion007	2025-02-01 16:33:49 +00:00
rzou	0f768c7866	Barebones flat_apply HOP (#146060 ) This PR: - adds pytree.register_constant for registering a class to be treated as a constant by torch.compile/torch.fx - adds a very barebones flat_apply HOP. This should be sufficient to get mark_traceable working. A lot more work is necessary to get the custom operator case working (when make_fx sees a custom operator with PyTree arg types, it needs to emit a call to the flat_apply HOP). - I expect the flat_apply HOP to change a lot, I want to ship this in the current state to unblock the mark_traceable and custom ops work. Test Plan: - It's kind of difficult to test the barebones flat_apply HOP "works" so I added a really simple test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146060 Approved by: https://github.com/StrongerXi, https://github.com/yanboliang ghstack dependencies: #146059	2025-02-01 16:17:48 +00:00
rzou	373606928b	Add torch.utils._pytree.register_dataclass (#146059 ) This is an API that registers a dataclass as a pytree node. It directly calls torch.export.register_dataclass, but we should eventually inline that implementation here. I want to use this API for something in compile and feel weird calling torch.export.register_dataclass. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/146059 Approved by: https://github.com/StrongerXi, https://github.com/angelayi, https://github.com/yanboliang	2025-02-01 16:17:48 +00:00
cyy	2fd1b6b361	[Environment Variable][7/N] Use thread-safe getenv functions (#140211 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140211 Approved by: https://github.com/ezyang, https://github.com/eqy	2025-02-01 12:33:41 +00:00
Aleksandar Samardžić	2b00d211f0	Build RowwiseScaledMM.cu for SM89 (#145676 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145676 Approved by: https://github.com/drisspg, https://github.com/malfet, https://github.com/eqy	2025-02-01 11:44:58 +00:00
Shangdi Yu	f40e013787	Fix aten.to when input is a tensor constant (#146220 ) Summary: Fix aten.to when input is a tensor constant. In this case, `args_unwrapped` could just be a constant, so not a functional tensor. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r tensor_constant_aten_to ``` Differential Revision: D68984244 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146220 Approved by: https://github.com/JacobSzwejbka	2025-02-01 11:07:33 +00:00
bobrenjc93	30f091da44	add speculation log divergence test (#145659 ) Followup from a SEV. Confirmed that this breaks when stacked on top of https://github.com/pytorch/pytorch/pull/145660 (offending PR that caused the SEV) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145659 Approved by: https://github.com/laithsakka	2025-02-01 09:39:22 +00:00
Shangdi Yu	a4e4368157	add node mapping processing (#146103 ) Summary: Add `node_mapping = create_node_mapping(pre_grad_graph_id, inductor_post_to_pre_grad_nodes, debug_info)`, to produce a `inductor_provenance_tracking_node_mappings.json` file. This file will be used by the provenance tracking highlighter tool to create provenance visualization. `inductor_triton_kernel_to_post_grad_nodes.json` and `inductor_provenance_tracking_node_mappings.json` files are not dumped if they are both empty. So it's removed from some of the `test_structured_trace` tests. Test Plan: CI ``` buck run mode/dev-nosan fbcode//caffe2/test:fx -- -r graph_provenance buck run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing python test/dynamo/test_structured_trace.py ``` Differential Revision: D68190173 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146103 Approved by: https://github.com/chenyang78	2025-02-01 08:29:29 +00:00
Huy Do	f38d5b4a74	Update TorchBench commit to main (#145455 ) I'm adding sam2 to TorchBench https://github.com/pytorch/benchmark/issues/2566, so, as part of that, I'm updating PyTorch CI to use latest TorchBench commit. The corresponding change from TorchBench is https://github.com/pytorch/benchmark/pull/2584 The main thing to call out that the newer transformers added by https://github.com/pytorch/benchmark/pull/2488 is regressing several models. This needs to be investigated further, and I pin the version to unblock this change. * `hf_Roberta_base` a new model added by https://github.com/pytorch/benchmark/pull/2279, not sure why it fails accuracy on A10G, but it works fine on A100 * `speech_transformer` failures are pre-existing trunk failures, i.e. https://github.com/pytorch/pytorch/actions/runs/13040114684/job/36380989702#step:22:2408 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145455 Approved by: https://github.com/kit1980	2025-02-01 06:44:26 +00:00
Shangdi Yu	a97a906dd9	Add "//caffe2:libtorch" to minifier TARGET file (#146203 ) Summary: as title. To avoid errors like "undefined symbol: aoti_torch_device_type_cpu" when compiling minifier_launcher.py Test Plan: CI Differential Revision: D68978430 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146203 Approved by: https://github.com/desertfire	2025-02-01 05:37:23 +00:00
Mingming Ding	bcd0ba0f69	Adding the best autotuner config (#146121 ) Summary: Adding logs to log the best config for autotune configs Test Plan: Testing in Mast : aps-omnifmv1-5_32_test_with_best_config-c5e9ceccf8 {F1974838864} Reviewed By: oulgen Differential Revision: D68931164 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146121 Approved by: https://github.com/oulgen	2025-02-01 03:43:33 +00:00
Yiming Zhou	549e230c33	[draft_export] Clear pending unbacked symbols when overriding mismatched fake kernels (#146089 ) Summary: When encountering a mismatched fake kernel that also creates unbacked symbols, draft export will fail with `PendingUnbackedSymbolNotFound` error. Clearing `shape_env.pending_fresh_unbacked_symbols` fixes this issue. Test Plan: ``` buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_override_mismatched_fake_kernel_with_unbacked_symbols ``` Differential Revision: D68920990 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146089 Approved by: https://github.com/pianpwk	2025-02-01 03:32:50 +00:00
cyy	4d2056efb5	Enable ruff F841 on numpy tests (#146126 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146126 Approved by: https://github.com/rec, https://github.com/albanD	2025-02-01 03:07:28 +00:00
cyy	985a78e9df	Enable ruff F841 on distributed tests (#146131 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146131 Approved by: https://github.com/rec, https://github.com/albanD	2025-02-01 03:06:16 +00:00
Animesh Jain	1de41e6918	[dynamo][exceptions][3.10] Clean symbolic stack on exception handling (#146198 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146198 Approved by: https://github.com/williamwen42 ghstack dependencies: #146062	2025-02-01 02:51:44 +00:00
angelayi	6023684311	[export] Fix symfloat serialization (#146112 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146112 Approved by: https://github.com/pianpwk	2025-02-01 02:28:44 +00:00
David Berard	8326d27093	[inductor][5/N] triton support post-#5512, fix 1 and None handling (#145515 ) This fixes handling for "1" and "None" args with new Triton versions. TL;DR: triton_meta["constants"] (which is passed to ASTSource) should be a map of {"kwarg_name": constant_value} for values which are tl.constexpr, or have a value of 1 or None (i.e. "specialized" constants). For constant args, triton_meta["signature"][arg_name] should be "constexpr" (even for specialized constants). Note: This adds support for Triton versions after 5512; but not for versions in between 5220 and 5512 (i.e. `TritonAttrsDescriptorVersion.V3_BACKENDS_TUPLE`). There's a completely different format for constants/signature in the commit range in between. To test: I ran `test_torchinductor.py` and `test_triton_kernels.py` with the main branch of triton (~jan 27). The only failing tests are aoti-related tests (which need to be fixed as a follow-up), and test_mutable_custom_op_fixed_layout2_cuda (which is failing with or without the new triton version on my machine); additionally, the split-scan/split-reduction kernels rely on https://github.com/triton-lang/triton/pull/5723. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145515 Approved by: https://github.com/SamGinzburg	2025-02-01 02:11:48 +00:00
briancoutinho	6e734bab93	execution trace export supports gzip format (#146179 ) As above, allows Chakra Execution Trace observer to support compressing files. Usage is straightforward, just add ".gz" suffix to the output file name ``` et = ExecutionTraceObserver() et.register_callback("my_trace.json.gz") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146179 Approved by: https://github.com/shengfukevin, https://github.com/davidberard98, https://github.com/sraikund16	2025-02-01 01:25:25 +00:00
Brian Hirsh	57c45340e7	include entire GraphModule instead of current node when erroring inside of fx interpreter (#146197 ) This seems like it would make it easier to diagnose PT2 issues where the user cannot easily repro, and we need more info in the backtrace, e.g. in https://github.com/pytorch/pytorch/issues/134182#issuecomment-2628076114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146197 Approved by: https://github.com/jamesjwu	2025-02-01 01:09:27 +00:00
Sam Larsen	73d90d66a4	Cap size of thread pool in select_algorithm to cpu count (#146071 ) Summary: With changes from https://github.com/pytorch/pytorch/pull/144829, we can see more autotune configs and the size of the pool can get outta hand when using the cutlass backend. See internal discussion at: https://fburl.com/workplace/7g4vz0zy Test Plan: `python test/inductor/test_cutlass_backend.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146071 Approved by: https://github.com/Chillee	2025-02-01 00:41:36 +00:00
Avik Chaudhuri	cde5ddfd14	fix internal error with reorder submodules (#146181 ) Test Plan: hard to isolate as small repro Differential Revision: D68963033 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146181 Approved by: https://github.com/angelayi	2025-02-01 00:30:42 +00:00
Alexander Kurakin	35f113e2a0	torch/nn/utils/rnn.py: docs: improvements (#138628 ) Fix constants highlighting in generated documentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138628 Approved by: https://github.com/mikaylagawarecki	2025-02-01 00:10:30 +00:00
Bin Bao	a78c796f0b	[AOTI] Support composed dynamic shape constraint (#146044 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/145500. When export takes a dynamic shape constraint as an expression containing a symbol, we should be able to solve the symbol at run time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146044 Approved by: https://github.com/angelayi ghstack dependencies: #146043	2025-02-01 00:02:12 +00:00
Laith Sakka	43372e70c2	ehnace logging statically known by adding size_oblivious(..) (#145354 ) after the diff ``` [0/0_1] eval size_oblivious(Eq(s1, 1)) == False [statically known] [0/0_1] eval size_oblivious(Eq(u0, 1)) == False [statically known] [0/0_1] eval size_oblivious(Eq(s0, 1)) == False [statically known] [0/0_1] eval size_oblivious(Eq(s0s1u0, 0)) == False [statically known] ``` before ``` [0/0_1] eval (Eq(s1, 1)) == False [statically known] [0/0_1] eval (Eq(u0, 1)) == False [statically known] [0/0_1] eval (Eq(s0, 1)) == False [statically known] [0/0_1] eval (Eq(s0s1u0, 0)) == False [statically known] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145354 Approved by: https://github.com/ezyang	2025-01-31 23:26:37 +00:00
Animesh Jain	f25f1163dc	[dynamo] Support frozenset({..}).__contains__ (#146062 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146062 Approved by: https://github.com/Skylion007, https://github.com/jansel	2025-01-31 23:22:58 +00:00
fduwjj	eb029fba13	[c10d][NCCL] Implement ncclCommInitRankScalable (merging #136789 ) (#144794 ) Try to land https://github.com/pytorch/pytorch/pull/136789/files on our end and fix any remaining issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144794 Approved by: https://github.com/kwen2501, https://github.com/eqy, https://github.com/atalman	2025-01-31 22:39:56 +00:00
Bin Bao	af2a39849d	[AOTI] Refactor codegen_input_symbol_assignment (#146043 ) Summary: Extract the common logic for size and stride symbol generation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146043 Approved by: https://github.com/angelayi	2025-01-31 21:55:18 +00:00
PyTorch MergeBot	c39c679813	Revert "Tensor .cuda() very slow with specific array sizes (#138964 )" This reverts commit 98f87edd233ea69cee5f3e73e9eb4b5ab77aa744. Reverted https://github.com/pytorch/pytorch/pull/138964 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but some slow test start failing after this lands ([comment](https://github.com/pytorch/pytorch/pull/138964#issuecomment-2628455198))	2025-01-31 21:48:51 +00:00
atalman	a7cc6d3e84	Manylinux 2.28 migration - remove pre-cxx11 abi libtorch builds (#146200 ) Related to: https://github.com/pytorch/pytorch/issues/123649 Removing pre-cxx11 abi builds. As per announcement : https://dev-discuss.pytorch.org/t/pytorch-linux-wheels-switching-to-new-wheel-build-platform-manylinux-2-28-on-november-12-2024/2581 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146200 Approved by: https://github.com/kit1980, https://github.com/huydhn	2025-01-31 21:43:12 +00:00
Andrew Or	8203894eff	Resolve affine quantization namespace collision with torchao (#145941 ) Summary: https://github.com/pytorch/pytorch/pull/141421 duplicated affine quantization custom ops from torchao into the PT2E quantization flow, but these ops are registered under the same namespace with the same name, causing "Duplicate registration" errors for the new ops for use cases that import from both repos. This commit fixes this by moving the PT2E versions of the ops to a new namespace. In the long term, we expect to migrate PT2E into torchao so users can migrate back to the old namespace if they wish to. Test Plan: python test/test_quantization.py -k test_channel_group_quantization Differential Revision: D68838437 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145941 Approved by: https://github.com/cccclai	2025-01-31 21:29:47 +00:00
Animesh Jain	781aceee9c	[dynamo] Revert abc change due to internal failures (#146177 ) xref - https://www.internalfb.com/tasks/?t=191383874 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146177 Approved by: https://github.com/StrongerXi ghstack dependencies: #146141	2025-01-31 21:28:06 +00:00
Jessica Vandebon	a0d1393b1a	[MTIA][FSDP2] Enable MTIA device in FSDP2 library code (#145842 ) Differential Revision: D68560256 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145842 Approved by: https://github.com/chaos5958, https://github.com/nautsimon	2025-01-31 21:21:00 +00:00
Simon Fan	06850e624a	[ca][hop] test CA on all HOPs (#145429 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145429 Approved by: https://github.com/zou3519 ghstack dependencies: #145422	2025-01-31 20:45:22 +00:00
Simon Fan	2e197c8a2d	[dynamo][hop] test torch.compiling all HOPs (#145422 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145422 Approved by: https://github.com/ydwu4, https://github.com/zou3519	2025-01-31 20:45:22 +00:00
William Wen	5b1abdbf5d	[dynamo] remove always-failing eval_frame.c debug check (#145982 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145982 Approved by: https://github.com/StrongerXi, https://github.com/jansel ghstack dependencies: #145981	2025-01-31 20:40:59 +00:00
William Wen	49df8de8be	[dynamo] disable eval_frame callback in _TorchDynamoContext __enter__/__exit__ (#145981 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145981 Approved by: https://github.com/jansel	2025-01-31 20:40:59 +00:00
Wei Wang	3a4e7a589b	[CI][Distributed] Fix edge case: One rank case (Rank 0) should get [False, False] (#146099 ) To match the expected tensor (i.e. 2nd element in the array). Making rank0 receive [False, False] Fixes one of the issues reported in #146094 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146099 Approved by: https://github.com/eqy	2025-01-31 20:31:13 +00:00
Jane Xu	8b8c596503	Remove trivial dispatch_key_allowlist_check function (#146169 ) Hmmm...this _is_ removing a public function from a public C++ file. But the GH counts for this function total 83, seemingly all copying pytorch: https://github.com/search?q=dispatch_key_allowlist_check&type=code&p=1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146169 Approved by: https://github.com/albanD, https://github.com/zou3519	2025-01-31 19:59:40 +00:00
Irakli Salia	ec2522e200	[MPS] optimize cholesky (#145722 ) Followup to #145701 Optimizes the syrk and trsm kernels of cholesky decomposition on mps. For SYRK kernel it does matmuls with apple's simdgroup matrices instead of a tiled implementation and for trsm kernel we do vectorized loads. Also this PR puts command encoder inside of the stream queue dispatch (as discussed on last PR). Script to collect perf ``` mport torch import numpy as np import time import csv matrix_sizes = [512, 1024, 2048, 4096] batch_sizes = [1, 2, 4, 8, 16] num_runs = 10 warmup_runs = 3 def create_spd_matrix(n, batch_size): torch.manual_seed(42) A = torch.randn(batch_size, n, n, dtype=torch.float32) return A @ A.transpose(-2, -1) + n * torch.eye(n).expand(batch_size, -1, -1) def run_cholesky_mps(A): torch.mps.synchronize() start = time.perf_counter() b = torch.linalg.cholesky(A, upper=False) torch.mps.synchronize() end = time.perf_counter() return b, end - start results = { 'N': [], 'batch_size': [], 'mean_time': [], 'std_time': [] } for n in matrix_sizes: for batch_size in batch_sizes: print(f"\nBenchmarking N={n}, batch_size={batch_size}") try: A_cpu = create_spd_matrix(n, batch_size) A_mps = A_cpu.to("mps") for _ in range(warmup_runs): _, _ = run_cholesky_mps(A_mps) times = [] for _ in range(num_runs): _, t = run_cholesky_mps(A_mps) times.append(t) mean_time = np.mean(times) std_time = np.std(times) results['N'].append(n) results['batch_size'].append(batch_size) results['mean_time'].append(mean_time) results['std_time'].append(std_time) print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s") except RuntimeError as e: print(f"Error for N={n}, batch_size={batch_size}: {e}") continue with open('cholesky_benchmark_times.csv', 'w', newline='') as f: writer = csv.writer(f) writer.writerow(['N', 'batch_size', 'mean_time', 'std_time']) for i in range(len(results['N'])): writer.writerow([ results['N'][i], results['batch_size'][i], results['mean_time'][i], results['std_time'][i] ]) ``` Observed speedups on M1 Pro ![cholesky_speedup](https://github.com/user-attachments/assets/be3edb1a-8b4a-4039-9d7f-9b9a10f1c83a) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145722 Approved by: https://github.com/malfet	2025-01-31 19:52:31 +00:00
Mwiza Kunda	6a0138fcc1	Torch device backend autoload fix (#145611 ) This causes an import failure if an external backend imports a module that uses `torch._as_tensor_fullprec` when it is being loaded. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/145611 Approved by: https://github.com/albanD	2025-01-31 19:27:42 +00:00
cyy	18380836eb	Remove outdated test skipif conditions for Python3.9 (#146144 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146144 Approved by: https://github.com/albanD	2025-01-31 19:01:04 +00:00
Yidi Wu	68a3635484	[hop][inductor] track the dependency on unbacked symbols correctly with constant_args for hops (#143456 ) Before the PR, we're getting an undefined symbol error for output code when an unbacked symint is only used in the hop because we didn't correctly record the dependency of the unbacked symbols for hops and it gets DCEed accidentally. This PR adds the symbol arguments to `constant_args`, where the dependencies can be correctly constructed when `get_unbacked_symbol_uses` is called to check constant_args. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143456 Approved by: https://github.com/desertfire	2025-01-31 18:29:27 +00:00
Zhengxu Chen	aad9f44b2e	[export] Sync model container types to schema.py (#145959 ) Summary: Synced from D68840230 Test Plan: No behavior changes to existing API. Will be tested internally. Differential Revision: D68846532 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145959 Approved by: https://github.com/yiming0416	2025-01-31 18:17:56 +00:00
PyTorch MergeBot	16f44fee25	Revert "[inductor/profiler] add kernel kwargs instrumentation (#145573 )" This reverts commit 720b8d0d8dac98f89499bc6b251d1f34dbf68dfe. Reverted https://github.com/pytorch/pytorch/pull/145573 on behalf of https://github.com/ZainRizvi due to Sorry, but this is failing internally. It's a bit weird since this PR doesn't really appear related at first glance, but despite retries it fails pretty consistently. Please see D68930742 for details ([comment](https://github.com/pytorch/pytorch/pull/145573#issuecomment-2628013872))	2025-01-31 18:13:23 +00:00
Catherine Lee	67ed47d886	Binary upload checksum (#144887 ) Equivalent to https://github.com/pytorch/test-infra/pull/6172 but for pytorch Pull Request resolved: https://github.com/pytorch/pytorch/pull/144887 Approved by: https://github.com/atalman	2025-01-31 17:51:27 +00:00
Aleksei Nikiforov	d0748566b4	s390x ci: ensure CI starts correctly if token pipe is not removed (#145840 ) Mark stop actions as "may fail". Container is expected to stop on it's own in normal case. Remove "may fail" mark from token generation steps. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145840 Approved by: https://github.com/huydhn	2025-01-31 17:46:09 +00:00
Aleksei Nikiforov	44ecbcbd5a	s390x: disable test_model_exports_to_core_aten.py test (#145835 ) It often gets killed by OOM. Disable it while investigating. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145835 Approved by: https://github.com/huydhn	2025-01-31 17:45:10 +00:00
Animesh Jain	667b94d1c2	[hotfix][dynamo] Skip linecache due to a flaky issue (#146141 ) A large number of jit + dynamo wrapped tests fail in linecache tracing. We need further debugging. Skipping for now to stem the bleeding. https://github.com/pytorch/pytorch/issues/146076 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146141 Approved by: https://github.com/StrongerXi	2025-01-31 17:45:06 +00:00
PyTorch MergeBot	c3f71eb61b	Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 )" This reverts commit e2917245fb0c0b6aab216e7a0a254b80e7a9e78f. Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/ZainRizvi due to Sorry but this still fails internally with the same error. @Chillee or @malfet, can you please help the change get tested? (See D68783351) ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2627886999))	2025-01-31 17:43:09 +00:00
PyTorch MergeBot	f5a61ba0a3	Revert "inductor: Don't throw an internal error when a nn.module is missing a attribute (#145122 )" This reverts commit d100e9ae744322a74d9fd05d0851caaf36f19c24. Reverted https://github.com/pytorch/pytorch/pull/145122 on behalf of https://github.com/ZainRizvi due to Sorry but this is failing internally. See D68924977 for details ([comment](https://github.com/pytorch/pytorch/pull/145122#issuecomment-2627880860))	2025-01-31 17:39:23 +00:00
Aleksei Nikiforov	eb5a0718c2	S390x nightly builds timeouts (#146041 ) Sometimes build timeouts at the end. This should be fixed by increased timeout. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146041 Approved by: https://github.com/huydhn, https://github.com/malfet	2025-01-31 17:29:11 +00:00
Mikayla Gawarecki	001e355a56	Add option to serialization config to reduce random reads from get_record_offset when loading with mmap=True (#143880 ) ## Background This PR adds `torch.utils.serialization.config.load.calculate_storage_offsets`. This option relies on the previous PR in this stack, where storage order was changed to non lexicographical. A `.format_version` entry was added to the zipfile and `calculate_storage_offsets` will only work on checkpoints with `.format_version`. When this is turned on, for `torch.load(mmap=True)`, offsets of each storage record (other than the 0th storage will be calculated instead of relying on `miniz` APIs to determine this). The existing APIs will issue multiple random reads (reading the end of central directory record, then reading the zipfile header for the record) to determine the storage offset where the record starts. This can greatly degrade `torch.load(mmap=True)` performance for non-filesystem cases. `6aaae9d78f/caffe2/serialize/inline_container.cc (L589-L605)` ## How does this work The format for the checkpoint is as such ``` archive_name/ \|_ data.pkl \|_.format_version \|_byteorder \|_data/ \|_ 0 \|_ 1 \|_ 2 \|_ ... \|_ ``` Each `data/i` record represents a storage, where storages are written in the order that the Pickler encounters them. For each storage, our `persistent_load` logic saves the following metadata to the pickle file `dtype, numel, key, location` where `numel` is the number of bytes in the storage. Note that we always use `miniz` writer in the zip64 mode per [here](`7796e308d0/caffe2/serialize/inline_container.cc (L701)`) A zipfile record written by miniz looks as such ``` ---------------- ----------------- ------------------- ---------------- --------- ------------------------------ \| 30 byte header \| n byte filename \| zip64_extra_data \| m byte padding \| storage \| 16 or 24 byte local dir footer \| ---------------- ----------------- ------------------- ---------------- --------- ------------------------------ ``` - The header size (30) is given by [`MZ_ZIP_LOCAL_DIR_HEADER_SIZE`](https://github.com/pytorch/pytorch/blob/main/third_party/miniz-3.0.2/miniz.c?fbclid=IwZXh0bgNhZW0CMTEAAR2O8Vysd--UoSCxW70gabXIS1dbz733oHwuUQ5_Ff1hY2WU6PL2i6CSH4A_aem_J9oaU2HpDeWtJKOU9EnVqw#L3290) - filename will be `"{archive_name}/{filepath}"` - `zip64_extra_data` is determined by [`mz_zip_writer_create_zip64_extra_data`](`7796e308d0/third_party/miniz-3.0.2/miniz.c (L6202)`). Note that [we only create zip64_extra_data if storage_size >= 0xFFFFFFFF or the offset of the start of the header >= 0xFFFFFFFF](`7796e308d0/third_party/miniz-3.0.2/miniz.c (L6519-L6524)`) - `m` is determined by [`getPadding`](`7796e308d0/caffe2/serialize/inline_container.cc (L254)`), which accounts for filename, zip64_extra_data to determine `m` such that the start of `storage` is aligned to 64 bytes. The `m` bytes will always start with `F B padding_size" as the first 4 bytes - The local dir footer size is determined based on [this snippet ](`7796e308d0/third_party/miniz-3.0.2/miniz.c (L6610-L6632)`): if the buffer size is 0 it is skipped. If the zip64_extra_data was created, it is 24, otherwise it is 16. When `torch.utils.serialization.config.load.calculate_storage_offsets` is set we do the following - We keep track of where the "cursor" is in the file using `current_offset`, after each persistent_load call, it will be at the offset where the header for the next record starts - for the 0th storage, "data/0", we use the regular get_record_offset to determine the start of the storage - for any other storage, (where the storages will be in order encountered by the unpickler, 0, 1, 2, 3, ...) we use `get_record_offset_no_read`, which re-uses the `getPadding` logic to determine the offset of the storage - Note that `load_tensor` will only ever be called again with the same key if the storage's `._data_ptr()` is 0 [[pointer1](https://github.com/pytorch/pytorch/blob/main/torch/serialization.py#L1917-L1918)][[pointer2](https://github.com/pytorch/pytorch/blob/main/torch/serialization.py#L1936-L1937)], so we cache the offsets for this edge case - After each storage, if the storage is non-zero, we account for the local dir footer based on the logic described above ## Testing strategy The agreed upon testing strategy was as follows: - Add debug code gated by an environment flag `TORCH_SERIALIZATION_DEBUG` that will run this offset calculation logic and verify it against getRecordOffset for each storage (when mmap=False) - This flag is set throughout CI, which means that every time `torch.load` is called, the offset calculation logic is implicitly being tested. Differential Revision: [D67673026](https://our.internmc.facebook.com/intern/diff/D67673026) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143880 Approved by: https://github.com/albanD ghstack dependencies: #143879	2025-01-31 17:09:20 +00:00
Donald Tolley	98f87edd23	Tensor .cuda() very slow with specific array sizes (#138964 ) ### Pull Request: Optimized Non-Contiguous Tensor Copy for CPU to GPU in PyTorch #### Summary This PR addresses the performance issue identified in [#111570](https://github.com/pytorch/pytorch/issues/111570), where non-contiguous tensors took significantly longer to transfer from CPU to GPU. Through detailed tracing of the call flow, we identified that PyTorch was creating temporary contiguous buffers for non-contiguous tensor transfers, which introduced unnecessary overhead. #### Tracing the Issue To pinpoint the cause of the slowdown, we followed the call flow from Python’s `tensor.cuda()` method through PyTorch’s backend, ultimately identifying `copy_kernel_cuda` as the key function responsible for CPU-to-GPU tensor transfers. Here’s a summary of the tracing process: 1. Python Call: `tensor.cuda()` - Starting from Python, the `cuda()` method initiates the tensor transfer to the GPU. 2. `TensorBody.h: cuda()` - The `cuda()` method calls `to()`, specifying the target device as CUDA. 3. `Tensor.cpp: TensorBase::to()` - The `to()` function prepares device and data type options before invoking `_ops::to_dtype_layout::call()`. 4. Operator Call: `_ops::to_dtype_layout::call()` - This operator dispatches the request to the backend-specific function responsible for managing the transfer. 5. `Copy.cpp: copy_()` - The `copy_()` function performs preliminary checks (e.g., zero-tensor immutability) and proceeds to call `copy_impl()`. 6. `Copy.cpp: copy_impl()` - This function sets up a tensor iterator and dispatches the copy operation to the appropriate backend through `copy_stub`. 7. Dispatch to CUDA: `copy_stub` - The dispatch mechanism routes the call to the CUDA-specific function, `copy_kernel_cuda`. 8. `Copy.cu: copy_kernel_cuda()` - Here, we identified that PyTorch was creating temporary contiguous buffers for 1D and 2D non-contiguous tensors, which slowed down the copy process. This behavior is managed by the `copy_requires_temporaries()` function. #### Solution To address this, we modified `copy_kernel_cuda` to handle non-contiguous 1D and 2D tensors directly by using `cudaMemcpy2DAsync`, which allows efficient, stride-aware memory transfers without temporary buffers. Here’s why this approach improves performance: - Efficiency of `cudaMemcpy2DAsync`: This CUDA function is optimized for pitched (stride-based) memory transfers, allowing it to handle non-contiguous data layouts effectively by specifying memory strides for source and destination tensors. - Reduction of Overhead: By directly copying non-contiguous tensors without intermediate buffers, we eliminate extra memory allocation and achieve faster CPU-to-GPU transfers. - Asynchronous Execution: `cudaMemcpy2DAsync` enables asynchronous transfer on the CUDA stream, further improving performance by taking advantage of CUDA's optimized memory handling for non-contiguous layouts. #### Performance Results In my testing, I created tensors of size `327680 x 2000` and used slices for transfer performance measurements. The tests show that the average time for transferring a non-contiguous slice (e.g., rows 10,000 to 50,000) from CPU to GPU now closely matches the contiguous case. This improvement indicates that the updated implementation effectively addresses the performance discrepancy. Below are the measured times and validation checks: ```plaintext Average time for contiguous slice (rows 10,000-50,000): 66 ms Average time for non-contiguous slice (rows 10,000-50,000): 66 ms Validation of contiguous and non-contiguous tensor copies: ✅ PASS: Tensor shapes match. ✅ PASS: Tensor contiguity matches. ✅ PASS: Tensor contents match. ✅ PASS: Tensor data types match. ✅ Success: Both contiguous and non-contiguous tensors were copied correctly to the GPU. ``` #### Conclusion This PR resolves the identified performance issue by eliminating the need for temporary buffers in non-contiguous 1D and 2D tensor transfers, ensuring faster and more efficient copies from CPU to GPU. Future optimizations could further enhance performance for higher-dimensional non-contiguous tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138964 Approved by: https://github.com/jeffdaily Co-authored-by: Natalia Gimelshein <ngimel@gmail.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-01-31 17:05:02 +00:00
Mikayla Gawarecki	2d6f6637d3	Remove lexicographical sorting of storage keys in torch.save (#143879 ) Currently the order lexicographical (i.e. 0, 10, 11, ...19, 2, ....) instead of 0, 1, 2, 3, 4, 5 (the order that storage metadata is actually pickled in), since PyTorch will never be used with Python < 3.7 we can be assured that the keys will be read in the order of insertion (numerically sorted) This makes it such that the order storages are written in are the same as the pickling/unpickling order so we can calculate their offsets with less random reads * __->__ #143879 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143879 Approved by: https://github.com/albanD	2025-01-31 17:00:23 +00:00
Ting Lu	9232355bb0	Add CUDA 12.8 manywheel x86 Builds to Binaries Matrix (#145792 ) https://github.com/pytorch/pytorch/issues/145570 Adding cuda 12.8.0 x86 builds first Pull Request resolved: https://github.com/pytorch/pytorch/pull/145792 Approved by: https://github.com/nWEIdia, https://github.com/malfet, https://github.com/atalman	2025-01-31 16:12:02 +00:00
Jackson	a7c2d85c18	Add overloads to diagonal docs (#144214 ) Fixes #126827. Refactored doc to demonstrate when none of the optional values are passed in. Added another example so that all overloads of the function are covered. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144214 Approved by: https://github.com/albanD	2025-01-31 15:53:59 +00:00
Bin Bao	2af876707b	[AOTI] Fix a memory leak in package boxed_run (#146100 ) Summary: AOTIModelPackageLoaderPybind::boxed_run missed a decref when constructing the returned py::list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146100 Approved by: https://github.com/cpuhrsch	2025-01-31 13:32:28 +00:00
Pian Pawakapan	7b07415aaa	[export] nested terms in nn_module_stack deserialization (#145901 ) Summary: accounting for terms like "getattr(getattr(a[0], b), c)". Test Plan: test_serialize Differential Revision: D68784736 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145901 Approved by: https://github.com/angelayi	2025-01-31 10:00:13 +00:00
Haifeng Jin	1f1a9965d5	fix a small typo in comments (#145323 ) A minor typo fix. The description was confusing with the typo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145323 Approved by: https://github.com/Skylion007	2025-01-31 06:45:44 +00:00
Nikita Shulga	c55af2b567	[CMake] Delete Caffe2 inspect_gpu binary (#146105 ) As it's unbuildable right now, as headers it depends on are gone Fixes https://github.com/pytorch/pytorch/issues/146042 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146105 Approved by: https://github.com/atalman, https://github.com/seemethere	2025-01-31 06:42:52 +00:00
Aidyn-A	e84bf88dde	[ATen][CUDA] Implement 128 bit vectorization v2 (#145746 ) This is a re-base PR to my previous one #141959. Description from the original PR: This PR implements 128-bit vectorization. It improves the performance of contiguous elementwise ops by 4-10% on Hopper H100. <details> <summary>The benchmark code used </summary> ```Python import time import torch from torch.profiler import profile, ProfilerActivity def benchmark(function, dtype=torch.float32, check_numerics=True, print_profile=False): device = torch.device("cuda") shapes = [] for p in range(24, 30): shape = 1<<p shapes.append(shape) for shape in shapes: for _ in range(6): x = torch.randn(shape, device=device, dtype=dtype) y = function(x) if print_profile: x = torch.randn(shape, device=device, dtype=dtype) with profile(activities=[ProfilerActivity.CUDA], record_shapes=True) as prof: y = function(x) print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10)) x = torch.randn(shape, device=device, dtype=dtype) torch.cuda.synchronize() t1 = time.perf_counter() for _ in range(6): y = function(x) torch.cuda.synchronize() t2 = time.perf_counter() perf_time = (t2 - t1) / 6 print(f"{function.__name__}, {dtype}, {shape}, {perf_time}") if check_numerics: x_cpu = x.cpu() y_cpu = function(x_cpu).cuda() try: torch.testing.assert_allclose(y_cpu, y) except AssertionError as error: print("An exception occurred:", error) def main(): ops = [ torch.relu, torch.sigmoid, torch.tanh, torch.nn.functional.gelu, torch.sin, torch.exp, ] dtypes = [ torch.float16, torch.bfloat16, torch.float32, ] for op in ops: for dtype in dtypes: benchmark(op, dtype=dtype) torch.cuda.empty_cache() if __name__ == "__main__": main() ``` </details> <details> <summary> Results </summary> \| op \| dtype \| size \| time after \| time before \| % improvement \| \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| \| relu \| torch.float16 \| 33554432 \| 4.84E-05 \| 5.06E-05 \| 4.66296539127052 \| \| relu \| torch.float16 \| 67108864 \| 9.22E-05 \| 9.64E-05 \| 4.56491432752297 \| \| relu \| torch.float16 \| 134217728 \| 0.000180343495837102 \| 0.000187981834945579 \| 4.23543919508829 \| \| relu \| torch.float16 \| 268435456 \| 0.000355071155354381 \| 0.000370856161074092 \| 4.44558942107169 \| \| relu \| torch.float16 \| 536870912 \| 0.000704489842367669 \| 0.000736006341564159 \| 4.47366268483987 \| \| relu \| torch.bfloat16 \| 16777216 \| 3.03E-05 \| 3.04E-05 \| 0.166504085842689 \| \| relu \| torch.bfloat16 \| 33554432 \| 4.89E-05 \| 5.06E-05 \| 3.45848238875716 \| \| relu \| torch.bfloat16 \| 67108864 \| 9.32E-05 \| 9.65E-05 \| 3.56122651631445 \| \| relu \| torch.bfloat16 \| 134217728 \| 0.000180805509444326 \| 0.000187998676362137 \| 3.97840029317567 \| \| relu \| torch.bfloat16 \| 268435456 \| 0.000356242332297067 \| 0.000371279485989362 \| 4.22104627356745 \| \| relu \| torch.bfloat16 \| 536870912 \| 0.000708114336399982 \| 0.000736773828975856 \| 4.04729732229083 \| \| relu \| torch.float32 \| 16777216 \| 5.61E-05 \| 5.61E-05 \| 0.0442587268354941 \| \| relu \| torch.float32 \| 33554432 \| 9.33E-05 \| 9.30E-05 \| -0.259070913799022 \| \| relu \| torch.float32 \| 67108864 \| 0.000181321326332788 \| 0.000181289506144822 \| -0.0175490597877115 \| \| relu \| torch.float32 \| 134217728 \| 0.000356896334172537 \| 0.000356570177245885 \| -0.0913870206618981 \| \| relu \| torch.float32 \| 268435456 \| 0.000709421835684528 \| 0.000707465515006334 \| -0.275762681635911 \| \| relu \| torch.float32 \| 536870912 \| 0.00141372415237129 \| 0.00141036518228551 \| -0.237597276678471 \| \| sigmoid \| torch.float16 \| 16777216 \| 3.10E-05 \| 3.16E-05 \| 2.10012593866895 \| \| sigmoid \| torch.float16 \| 33554432 \| 4.91E-05 \| 5.23E-05 \| 6.37710600666122 \| \| sigmoid \| torch.float16 \| 67108864 \| 9.30E-05 \| 0.000100057009452333 \| 7.61866144555331 \| \| sigmoid \| torch.float16 \| 134217728 \| 0.000180928347011407 \| 0.000194982004662355 \| 7.76752669390248 \| \| sigmoid \| torch.float16 \| 268435456 \| 0.000355658994521946 \| 0.00038468533117945 \| 8.16128288742412 \| \| sigmoid \| torch.float16 \| 536870912 \| 0.000705982849467546 \| 0.000764021339515845 \| 8.22094900634937 \| \| sigmoid \| torch.bfloat16 \| 16777216 \| 3.08E-05 \| 3.17E-05 \| 2.90965915673149 \| \| sigmoid \| torch.bfloat16 \| 33554432 \| 4.87E-05 \| 5.24E-05 \| 7.63503884668234 \| \| sigmoid \| torch.bfloat16 \| 67108864 \| 9.33E-05 \| 0.000100019678939134 \| 7.21238137428013 \| \| sigmoid \| torch.bfloat16 \| 134217728 \| 0.000180786165098349 \| 0.000194868014659733 \| 7.78922964250206 \| \| sigmoid \| torch.bfloat16 \| 268435456 \| 0.000355564659306159 \| 0.000384909333661199 \| 8.25297835063321 \| \| sigmoid \| torch.bfloat16 \| 536870912 \| 0.000705831005082776 \| 0.000764102345177283 \| 8.2557070566308 \| \| sigmoid \| torch.float32 \| 16777216 \| 4.93E-05 \| 5.65E-05 \| 14.5314136197766 \| \| sigmoid \| torch.float32 \| 33554432 \| 9.32E-05 \| 9.31E-05 \| -0.120169865610833 \| \| sigmoid \| torch.float32 \| 67108864 \| 0.000181328505277634 \| 0.000180455681402236 \| -0.481349512069855 \| \| sigmoid \| torch.float32 \| 134217728 \| 0.000357362829769651 \| 0.000356093340087682 \| -0.35523831137877 \| \| sigmoid \| torch.float32 \| 268435456 \| 0.000708921831877281 \| 0.000707052337626616 \| -0.263709504574663 \| \| sigmoid \| torch.float32 \| 536870912 \| 0.00141358317341656 \| 0.0014090768333214 \| -0.318788464654745 \| \| tanh \| torch.float16 \| 16777216 \| 3.03E-05 \| 3.03E-05 \| -0.0912564658661808 \| \| tanh \| torch.float16 \| 33554432 \| 4.90E-05 \| 5.07E-05 \| 3.46644442974484 \| \| tanh \| torch.float16 \| 67108864 \| 9.30E-05 \| 9.68E-05 \| 3.99871369815531 \| \| tanh \| torch.float16 \| 134217728 \| 0.00018052199933057 \| 0.000188717152923346 \| 4.53969799978138 \| \| tanh \| torch.float16 \| 268435456 \| 0.000355684508879979 \| 0.000373026006855071 \| 4.8755280430115 \| \| tanh \| torch.float16 \| 536870912 \| 0.000706660988119741 \| 0.000740105014604827 \| 4.73268328765002 \| \| tanh \| torch.bfloat16 \| 16777216 \| 2.99E-05 \| 3.03E-05 \| 1.21049563135981 \| \| tanh \| torch.bfloat16 \| 33554432 \| 4.89E-05 \| 5.06E-05 \| 3.48836101041744 \| \| tanh \| torch.bfloat16 \| 67108864 \| 9.28E-05 \| 9.69E-05 \| 4.39944918036626 \| \| tanh \| torch.bfloat16 \| 134217728 \| 0.000180710999605556 \| 0.000189167990659674 \| 4.67984299382829 \| \| tanh \| torch.bfloat16 \| 268435456 \| 0.000356062994493792 \| 0.000372666652159144 \| 4.66312363882606 \| \| tanh \| torch.bfloat16 \| 536870912 \| 0.000707100164921333 \| 0.000740134331863374 \| 4.67178040408393 \| \| tanh \| torch.float32 \| 16777216 \| 5.61E-05 \| 5.64E-05 \| 0.439595755746353 \| \| tanh \| torch.float32 \| 33554432 \| 9.31E-05 \| 9.31E-05 \| 0.00287633090228212 \| \| tanh \| torch.float32 \| 67108864 \| 0.000181465332085888 \| 0.000180895323865116 \| -0.31411411437098 \| \| tanh \| torch.float32 \| 134217728 \| 0.000356963835656643 \| 0.000356073161431899 \| -0.249513854283251 \| \| tanh \| torch.float32 \| 268435456 \| 0.000709201170442005 \| 0.00070707315656667 \| -0.300057862849997 \| \| tanh \| torch.float32 \| 536870912 \| 0.00141367283261692 \| 0.00141030051357423 \| -0.238550176877922 \| \| gelu \| torch.float16 \| 16777216 \| 2.73E-05 \| 3.17E-05 \| 15.921079070745 \| \| gelu \| torch.float16 \| 33554432 \| 5.06E-05 \| 5.55E-05 \| 9.76345374333098 \| \| gelu \| torch.float16 \| 67108864 \| 9.65E-05 \| 0.000106600326641152 \| 10.4308039074712 \| \| gelu \| torch.float16 \| 134217728 \| 0.000187776672343413 \| 0.000208565829476962 \| 11.0712139447915 \| \| gelu \| torch.float16 \| 268435456 \| 0.000370216167842348 \| 0.000412251994324227 \| 11.3544005187205 \| \| gelu \| torch.float16 \| 536870912 \| 0.000737301345604161 \| 0.000819394170927505 \| 11.1342296895002 \| \| gelu \| torch.bfloat16 \| 16777216 \| 3.02E-05 \| 3.08E-05 \| 1.78405479367653 \| \| gelu \| torch.bfloat16 \| 33554432 \| 5.13E-05 \| 5.69E-05 \| 10.9929393318302 \| \| gelu \| torch.bfloat16 \| 67108864 \| 9.76E-05 \| 0.00010968199543034 \| 12.3420807512356 \| \| gelu \| torch.bfloat16 \| 134217728 \| 0.000189661824454864 \| 0.000214487663470209 \| 13.0895287371091 \| \| gelu \| torch.bfloat16 \| 268435456 \| 0.000374197009174774 \| 0.000423670164309442 \| 13.2211519391275 \| \| gelu \| torch.bfloat16 \| 536870912 \| 0.000743675006863972 \| 0.000842577001700799 \| 13.299088166737 \| \| gelu \| torch.float32 \| 16777216 \| 5.06E-05 \| 5.04E-05 \| -0.413385894716413 \| \| gelu \| torch.float32 \| 33554432 \| 9.31E-05 \| 9.32E-05 \| 0.134157041722546 \| \| gelu \| torch.float32 \| 67108864 \| 0.000181480175039421 \| 0.000180836669945469 \| -0.354586992112075 \| \| gelu \| torch.float32 \| 134217728 \| 0.000356874331676712 \| 0.000356305002545317 \| -0.159532104402047 \| \| gelu \| torch.float32 \| 268435456 \| 0.000708909006789327 \| 0.000706991491218408 \| -0.270488250615287 \| \| gelu \| torch.float32 \| 536870912 \| 0.00141321367118508 \| 0.00140937082081412 \| -0.271922813181618 \| \| sin \| torch.float16 \| 16777216 \| 3.04E-05 \| 3.11E-05 \| 2.21834939018859 \| \| sin \| torch.float16 \| 33554432 \| 4.85E-05 \| 5.23E-05 \| 7.72165512511596 \| \| sin \| torch.float16 \| 67108864 \| 9.31E-05 \| 9.98E-05 \| 7.24947099480072 \| \| sin \| torch.float16 \| 134217728 \| 0.000180371008658161 \| 0.000194791161144773 \| 7.99471744039613 \| \| sin \| torch.float16 \| 268435456 \| 0.000355454161763191 \| 0.000384903668115536 \| 8.28503630574026 \| \| sin \| torch.float16 \| 536870912 \| 0.000705183832906187 \| 0.000764360166310022 \| 8.39161799270973 \| \| sin \| torch.bfloat16 \| 16777216 \| 3.11E-05 \| 3.10E-05 \| -0.257677954940036 \| \| sin \| torch.bfloat16 \| 33554432 \| 4.89E-05 \| 5.24E-05 \| 7.34808420323539 \| \| sin \| torch.bfloat16 \| 67108864 \| 9.26E-05 \| 0.000100248667877167 \| 8.22347488801205 \| \| sin \| torch.bfloat16 \| 134217728 \| 0.000180674154156198 \| 0.00019567032965521 \| 8.30012215584937 \| \| sin \| torch.bfloat16 \| 268435456 \| 0.000355360486234228 \| 0.000386023331278314 \| 8.62865913118873 \| \| sin \| torch.bfloat16 \| 536870912 \| 0.00070483615854755 \| 0.000766805159704139 \| 8.79197248964745 \| \| sin \| torch.float32 \| 16777216 \| 5.67E-05 \| 5.64E-05 \| -0.441348534920039 \| \| sin \| torch.float32 \| 33554432 \| 9.34E-05 \| 9.30E-05 \| -0.496458540364117 \| \| sin \| torch.float32 \| 67108864 \| 0.000181706990891447 \| 0.000180556671693921 \| -0.633062708199702 \| \| sin \| torch.float32 \| 134217728 \| 0.000356894995396336 \| 0.000356046327700218 \| -0.237791985616354 \| \| sin \| torch.float32 \| 268435456 \| 0.000708777321657787 \| 0.000707602652255446 \| -0.165731798471427 \| \| sin \| torch.float32 \| 536870912 \| 0.00141263716310884 \| 0.00140912582476934 \| -0.248566187496451 \| \| exp \| torch.float16 \| 16777216 \| 3.00E-05 \| 3.04E-05 \| 1.40099098901014 \| \| exp \| torch.float16 \| 33554432 \| 4.86E-05 \| 5.03E-05 \| 3.44611943643906 \| \| exp \| torch.float16 \| 67108864 \| 9.37E-05 \| 9.55E-05 \| 1.96412400380129 \| \| exp \| torch.float16 \| 134217728 \| 0.000180913504057874 \| 0.000187193179347863 \| 3.47109262113439 \| \| exp \| torch.float16 \| 268435456 \| 0.00035607748820136 \| 0.000369079003576189 \| 3.65131630210701 \| \| exp \| torch.float16 \| 536870912 \| 0.000707551507124056 \| 0.000732363162872692 \| 3.50669251620789 \| \| exp \| torch.bfloat16 \| 16777216 \| 2.98E-05 \| 3.04E-05 \| 1.74345594341654 \| \| exp \| torch.bfloat16 \| 33554432 \| 4.88E-05 \| 5.04E-05 \| 3.40217856534821 \| \| exp \| torch.bfloat16 \| 67108864 \| 9.32E-05 \| 9.62E-05 \| 3.29219958210226 \| \| exp \| torch.bfloat16 \| 134217728 \| 0.000180999826019009 \| 0.000187239318620414 \| 3.44723679499521 \| \| exp \| torch.bfloat16 \| 268435456 \| 0.000355944503098726 \| 0.000369370992605885 \| 3.77207384585864 \| \| exp \| torch.bfloat16 \| 536870912 \| 0.000707135167128096 \| 0.000733066000975668 \| 3.66702648277075 \| \| exp \| torch.float32 \| 16777216 \| 4.89E-05 \| 5.63E-05 \| 15.1245314346532 \| \| exp \| torch.float32 \| 33554432 \| 9.34E-05 \| 9.31E-05 \| -0.259945454477446 \| \| exp \| torch.float32 \| 67108864 \| 0.000181152504713585 \| 0.000180474346658836 \| -0.374357536939058 \| \| exp \| torch.float32 \| 134217728 \| 0.000356771342922002 \| 0.000355627329554409 \| -0.3206573034212 \| \| exp \| torch.float32 \| 268435456 \| 0.000708404501589636 \| 0.00070713268360123 \| -0.179532736671163 \| \| exp \| torch.float32 \| 536870912 \| 0.00141283582585553 \| 0.00140944866385932 \| -0.23974208002295 \| </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/145746 Approved by: https://github.com/eqy, https://github.com/ngimel Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-01-31 06:42:08 +00:00
Henry Hu	eeb5e1bf20	[AOTI] Cache treespec_loads calculation (#145815 ) Summary: Treespec can be reused instead of calculated from str every AOTI module call. Using cached result saves 0.2ms for each module call. Test Plan: Before: {F1974751578} After: {F1974751667} Differential Revision: D68749539 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145815 Approved by: https://github.com/henrylhtsang	2025-01-31 06:38:21 +00:00
Aaron Orenstein	57d8278ab9	pickler for GraphModule (#141659 ) Pickling GraphModule needs some special handling for wrapping things that normally can't be pickled - but async compile needs to pass them across a wire so we need to be able to serialize it - add some helpers to enable that. Differential Revision: [D68921318](https://our.internmc.facebook.com/intern/diff/D68921318) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141659 Approved by: https://github.com/jamesjwu	2025-01-31 05:34:28 +00:00
Manav Avlani	f9227e7c33	Expose ToIValueAllowNumbersAsTensors to TORCH_PYTHON_API so we can use it in monarch (#146087 ) Summary: TSIA Test Plan: Tested up the stack but existing unittests Reviewed By: suo Differential Revision: D68917233 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146087 Approved by: https://github.com/suo	2025-01-31 05:08:11 +00:00
Sherlock Huang	cf2de4e230	Introduce aoti_call_delegate HOP (#145630 ) Summary: Previously, aoti compile node is represented as a kernel-less custom op in the exported program. The node was not eager runnable, which is a common practice for numerical validation during lowering. I introduce a new HOP to address this. The schema is following ``` aoti_call_delegate(lower_moduel: AOTInductorEPModule, original_gm: fx.GraphModule, weights: List[Tensor], inputs: List[Tensor]) ``` There are a few problems exposed by HOP - AOTI expects a FX graph with weights as getattr nodes, aka stateful graph. HOP expect graph_module arguments to be stateless. Export serializer also expect a stateless graph. Currently, to make AOTI happy, I am making `original_gm` stateful, and bypassing the serialization for `original_gm`. - As a result, the HOP is not re-traceable, as functionalization on stateful graph module argument will fail. Test Plan: buck2 test 'fbcode//mode/opt' fbcode//deeplearning/aot_inductor/cpu/test:cpu_lowering_utils_test Reviewed By: zhxchen17 Differential Revision: D68359391 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145630 Approved by: https://github.com/zou3519	2025-01-31 04:57:36 +00:00
titaiwangms	f358d4d004	[ONNX] Migrate test_torch_export_with_onnxruntime.py to test_small_models_e2e.py (#146095 ) With [the deprecation of torch.onnx.dynamo_export](https://github.com/pytorch/pytorch/pull/146003), this PR turns the torch.export related tests toward torch.onn.export(..., dynamo=True), and places it in test_small_models_e2e.py NOTE: test_exported_program_as_input_from_file and test_onnx_program_supports_retraced_graph are not kept, because they are more of testing whether exported program stays the same after save/load and retrace. However, in torch.onnx.export(..., dynamo=True), we focus more on the export of from nn.Module to ONNX proto. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146095 Approved by: https://github.com/justinchuby	2025-01-31 03:40:26 +00:00
angelayi	27e35de6c2	[export] Add distributed test (#146050 ) Reland https://github.com/pytorch/pytorch/pull/145886 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146050 Approved by: https://github.com/avikchaudhuri	2025-01-31 02:56:42 +00:00
Pian Pawakapan	ffb424eab6	[dynamo/export] call local_scalar_dense when full() value is scalar tensor (#144999 ) Fixes https://github.com/pytorch/pytorch/issues/144907 ``` class Foo(torch.nn.Module): def forward(self, val): return torch.full((80, 2), val, dtype=torch.float32) export(Foo(), args=(torch.tensor(1),)) ``` When we have a `torch.full` call like above, where the fill value is a scalar Tensor and not a scalar value, the FX graph from `_dynamo.export()` contains a single node: the full op. We run into a `PendingUnbackedSymbolNotFound` error, because the `item()` call is implicit; the UnbackedSymInt is extracted but goes directly into the data of the output tensor value, and we're then unable to locate it when we try to compute unbacked bindings. On the other hand, non-strict export doesn't face this, because an explicit `item()`, or `local_scalar_dense` node is inserted, and the unbacked binding is directly the example value of that node. This adds a dynamo handler to imitate what happens in non-strict. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144999 Approved by: https://github.com/angelayi	2025-01-31 02:45:43 +00:00
Menglu Yu	e01c898e51	[Customized Optimus] Add select cat aten pass (#145918 ) Summary: This is a follow up work of D68695717, where we can further reduce the number of cat kernels in the backward by designing new aten pass in the aten level. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_select_cat_post_grad ``` Buck UI: https://www.internalfb.com/buck2/6943087f-91be-4dbd-9693-df0a11a50b73 Test UI: https://www.internalfb.com/intern/testinfra/testrun/11821949087998233 Network: Up: 101KiB Down: 132KiB (reSessionID-60e898af-f366-4247-a9f7-d8d7cd129fe0) Analyzing targets. Remaining 0/78148 Executing actions. Remaining 0/476147 Command: test. Finished 2 local Tests finished: Pass 3. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to add the config ``` post_grad_fusion_options: { "normalization_aten_pass": {}, "split_cat_aten_pass": {}, "select_cat_aten_pass": {}, } ``` {F1974778773} baseline: aps-recgpt_ranking_1115_pt2_optimus-e52c1f277e proposal aps-recgpt_ranking_1115_pt2_optimus-1b0047ee0e Differential Revision: D68803384 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145918 Approved by: https://github.com/Yuzhen11	2025-01-31 02:35:10 +00:00
Ting Lu	08d88127fe	Use Magma-cuda 12.8 for libtorch (#146019 ) https://github.com/pytorch/pytorch/issues/145570 Build failure for libtorch wheel `CUDAContext.cpp:(.text+0x157): additional relocation overflows omitted from the output /usr/bin/ld: failed to convert GOTPCREL relocation; relink with --no-relax collect2: error: ld returned 1 exit status` Unsure if this is related, fixing as a start Pull Request resolved: https://github.com/pytorch/pytorch/pull/146019 Approved by: https://github.com/eqy	2025-01-31 02:19:23 +00:00
Sam Larsen	2811f33d12	Fix code cache + freezing compile-time regression (#145868 ) Summary: The current implementation introduces a compile-time regression due to overhead hashing large constants. To support freezing+caching, we consider only the tensor metadata of frozen params, but we neglect to do the same for any constants created as a result of folding frozen params. This PR Explicitly marks the constants created during freezing (and constant folding during freezing) and uses that info in the inductor cache to determine when to hash a tensor value+metadata vs. metadata only. Test Plan: `python benchmarks/dynamo/torchbench.py --backend inductor --device cuda --only alexnet --bfloat16 --cold-start-latency --print-compilation-time --inference --performance --freezing` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145868 Approved by: https://github.com/eellison	2025-01-31 02:04:15 +00:00
Yu, Guangye	bf9d053fb8	[Break XPU] Fix Inductor cuda bias UT (#145934 ) # Motivation [Break XPU] inductor ut: `inductor/test_inplace_padding.py::InplacePaddingTest::test_pad_non_zero - RuntimeError: Expected to find "empty_strided_cuda((2048, 2048), (2048, 1), torch.float32).as_strided((2048, 2047), (2048, 1))" but did not find it` With this PR, `test_pad_non_zero` will pass on XPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145934 Approved by: https://github.com/jansel, https://github.com/shunting314, https://github.com/desertfire	2025-01-31 01:39:39 +00:00
Oguz Ulgen	ccd27e8129	Turn on fx graph cache and automatic dynamic pgo local caches in fbcode (#146065 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146065 Approved by: https://github.com/jamesjwu	2025-01-31 01:11:48 +00:00
Scott Wolchok	3fae5c8509	torchgen: support exception boundary for ExecuTorch functions (#144341 ) Needed for ExecuTorch diff D67904052. Differential Revision: [D67906411](https://our.internmc.facebook.com/intern/diff/D67906411/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144341 Approved by: https://github.com/Jack-Khuu	2025-01-31 01:05:21 +00:00
cyy	d94d816d96	Simplify handling of max jobs in CMake builds (#145820 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/145820 Approved by: https://github.com/malfet	2025-01-31 00:55:39 +00:00
Yifu Wang	c70362fac8	[AsyncMM] re-enable and adapt to cutlass 3.6.0 (#144011 ) [D68734067](https://our.internmc.facebook.com/intern/diff/D68734067) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144011 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2025-01-31 00:48:51 +00:00
Animesh Jain	1e3d1738a4	[dynamo][polyfills]Support getrecursionlimit (#145989 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145989 Approved by: https://github.com/StrongerXi, https://github.com/jansel ghstack dependencies: #145986, #145987, #145994	2025-01-31 00:47:31 +00:00
Animesh Jain	e7bb608d02	[dynamo][dicts] Support construction of types.MappingProxyType (#145994 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145994 Approved by: https://github.com/StrongerXi, https://github.com/jansel ghstack dependencies: #145986, #145987	2025-01-31 00:47:31 +00:00
Animesh Jain	4665bc2cc0	[dynamo][functions] Support `id` on function (#145987 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145987 Approved by: https://github.com/StrongerXi, https://github.com/jansel, https://github.com/mlazos ghstack dependencies: #145986	2025-01-31 00:47:23 +00:00
Animesh Jain	56307dc370	[dynamo][dicts] Raise exception on pop (#145986 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145986 Approved by: https://github.com/Skylion007, https://github.com/williamwen42, https://github.com/StrongerXi, https://github.com/jansel	2025-01-31 00:47:13 +00:00
Colin Peppler	e6704a2447	Allow replacing unbacked with very large upperbound by returning no-op for FloorToInt(int) (#146001 ) * Let's say x is an integer beyond 2^53 where Python floats lose precision i.e. can't increment by 1. * Therefore, float(x) will lose precision and won't retain the exact value of x even though it's an integer. * That means `FloorToInt(very_large_number)` will lose precision if we cast it to float ``` >>> int(float(1000000007999999992)) 1000000008000000000 ``` This means when we try to do this in set_replacement(): `32bb6f83d5/torch/fx/experimental/symbolic_shapes.py (L6011-L6019)` We run into this: ``` TORCH_LOGS="+torch.fx.experimental.symbolic_shapes" pytest -s test_export.py -k test_replace_unbacked_with_very_large_upperbound File "/data/users/colinpeppler/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6258, in _maybe_guard_rel self._set_replacement(rhs, self._find(lhs), "trivial_rhs") File "/data/users/colinpeppler/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6039, in _set_replacement assert tgt_bound.issubset( torch._dynamo.exc.TorchRuntimeError: Failed running call_function <built-in function add>((FakeTensor(..., size=(2s0,)), FakeTensor(..., size=(u0,))), **{}): tgt_bound=VR[4, 1000000008000000000] not a subset of src_bound=VR[4, 1000000007999999992] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146001 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #145898	2025-01-31 00:25:20 +00:00
soulitzer	c72b536420	Add manual override flag for core ATen op detection during bc check (#146052 ) Fixes https://github.com/pytorch/pytorch/issues/146049 Today the bc detection logic ignores allow_list for core ATen ops (A PR landed 4 months ago to enable this). The problem is that if I have a PR that removes an op, the script can no longer check whether that op is core ATen op (today we just error out). With my fix: (1) conservatively assume core ATen op in such cases (2) allows the user to specify in their ALLOW_LIST entry that their op is not a core ATen op.) Test plan: - This is tested 2 PRs above `016bdafdcb/test/forward_backward_compatibility/check_forward_backward_compatibility.py (L129-L137)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146052 Approved by: https://github.com/albanD	2025-01-30 23:57:01 +00:00
briancoutinho	720b8d0d8d	[inductor/profiler] add kernel kwargs instrumentation (#145573 ) ## About As above, record the kernel launch kwargs. These tends to be contexpr arguments to triton kernels like block size etc. ## Test program Note, install triton before proceeding (pip install triton) triton_test.py>>> ``` import torch from torch.profiler import profile, ProfilerActivity def foo(x, y): a = torch.sin(x) b = torch.cos(y) return a + b def main(): x = torch.randn(10, 10).cuda() y = torch.randn(10, 10).cuda() opt_foo = torch.compile(foo) z = opt_foo(x, y) # Profile the kernel function on the GPU with profile( activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True ) as prof: z = opt_foo(x, y) # Export the trace to a file prof.export_chrome_trace("my_kernel_trace.json") if __name__ == "__main__": main() ``` Run it and we should get a trace file my_kernel_trace.json Output has triton event with the kernel_kwargs attribute. ``` { "ph": "X", "cat": "cpu_op", "name": "triton_poi_fused_add_cos_sin_0", "pid": 2480815, "tid": 2480815, "ts": 2045246693014.959, "dur": 75.662, "args": { ... "kernel_backend": "triton", "num_warps": 4, "kernel_kwargs": "XBLOCK=128", "num_stages": 1, "grid": "grid(100,)", "kernel_file": "/tmp/torchinductor_bcoutinho/ow/cowpmkdpla4qfqj6jupnq4d7og7iz7eeb5wergubivubxd4xapor.py", "kernel_hash": "cowpmkdpla4qfqj6jupnq4d7og7iz7eeb5wergubivubxd4xapor" } }, ``` ## Unit Test Updated unit test: ``` pytest test/inductor/test_profiler.py -k test_pt2_triton_attributes ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145573 Approved by: https://github.com/davidberard98, https://github.com/jansel	2025-01-30 23:51:44 +00:00
Avik Chaudhuri	8117656162	nonzero_static with symint size (#146006 ) Summary: Previously `nonzero_static` would force specialization on the `size` argument. This PR enables it to be used with a dynamic `size` argument. Test Plan: added test Differential Revision: D68874784 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146006 Approved by: https://github.com/angelayi	2025-01-30 23:42:42 +00:00
Ke Wen	9fdc20809a	[PGNCCL] Simplify support macro definition (#145964 ) - Promotes usage of `NCCL_VERSION_CODE >= NCCL_VERSION(X, Y, Z)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145964 Approved by: https://github.com/fduwjj, https://github.com/shuqiangzhang ghstack dependencies: #145893	2025-01-30 23:26:32 +00:00
PyTorch MergeBot	4280232f21	Revert "Advance past fc window for stft center (#145437 )" This reverts commit 3ef1551f5a745c1d37ff421eb4678814ef4483e4. Reverted https://github.com/pytorch/pytorch/pull/145437 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it breaks some slow trunk tests ([comment](https://github.com/pytorch/pytorch/pull/145437#issuecomment-2625840742))	2025-01-30 23:14:16 +00:00
Murray Steele	f85e4c1360	Enable C++ API parity tests on AArch64 (#145370 ) Re-enables C++ API parity tests on AArch64 which now pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145370 Approved by: https://github.com/albanD	2025-01-30 22:42:49 +00:00
Pat Vignola	2f60f12f8b	[Torch] Extract arange_out resizing logic into a helper function that can be used by other devices (#145747 ) Summary: We want to use the resizing implementation for arange_out in other devices (in this case MTIA), to make sure that the computations match and to avoid off-by-one-errors. Test Plan: Existing CI tests pass. Differential Revision: D68694489 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145747 Approved by: https://github.com/mortzur	2025-01-30 22:37:00 +00:00
Nikita Shulga	99a0940991	[MPS] Fix regression in con-contig bitwise ops (#146085 ) Caused by https://github.com/pytorch/pytorch/pull/128393 that change semantic of `needsGather`, which resulted in silent correctness errors on MacOS-15+ if output tensor is non-contiguous Fixes https://github.com/pytorch/pytorch/issues/145203 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146085 Approved by: https://github.com/dcci	2025-01-30 22:36:56 +00:00
Eddie Yan	e2917245fb	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee, https://github.com/malfet	2025-01-30 22:33:50 +00:00
PyTorch MergeBot	7391cea857	Revert "[triton] Update pin to tip of 3.2 release (#145867 )" This reverts commit 5e5da9bd9afdbb51da3dcc39947347279ccd9130. Reverted https://github.com/pytorch/pytorch/pull/145867 on behalf of https://github.com/ZainRizvi due to Sorry, this PR may have been written correctly, but something is clearly broken with the infra that's making CI very unhappy with this new triton version. Since this has been blocking viable/strict upgrades for a couple days now, I'm reverting this PR. I'll sync with @atalman on how we should fix this. ([comment](https://github.com/pytorch/pytorch/pull/145867#issuecomment-2625720817))	2025-01-30 22:24:09 +00:00
Aaron Orenstein	23695ea002	Fix dynamo use of `list[int]` in graph break (#145554 ) This reintroduces the change backed out by #145393 and fixes the underlying problem. Although using a BuiltinVariable was better than nothing when we saw a GenericAlias it had problems if there was a graph break and we had to reconstruct the original python code which BuiltinVariable did as a simple `list` instead of a `list[int]`. This changes it to use a TypingVariable instead and then teaches TypingVariable how to reconstruct. Original commit changeset: 77b9193acb23 python test/dynamo/test_repros.py ReproTests.test_graph_break_on_jit_isinstance Pull Request resolved: https://github.com/pytorch/pytorch/pull/145554 Approved by: https://github.com/anijain2305 ghstack dependencies: #145551, #145552, #145553	2025-01-30 22:21:40 +00:00
Aaron Orenstein	fbb076cc45	Fix call to create_load_global (#145553 ) There is no version of create_load_global() that takes three parameters - any use of this function will fail. I think this is probably the correct fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145553 Approved by: https://github.com/anijain2305 ghstack dependencies: #145551, #145552	2025-01-30 22:21:40 +00:00
Aaron Orenstein	ccbbc88bbb	Turn on mypy for _dynamo/variables/builtin.py (#145552 ) The fact that mypy errors were ignored was hiding several bugs in builtin.py (for example the previous diff's incorrect override and use of `call_getattr`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145552 Approved by: https://github.com/anijain2305, https://github.com/Skylion007 ghstack dependencies: #145551	2025-01-30 22:21:32 +00:00
Aaron Orenstein	f3120f6d26	Remove incorrect BuiltinVariable.call_hasattr() (#145551 ) BuiltinVariable.call_hasattr() overrides the base class - but actually behaves differently. The base is `obj.call_hasattr(tx, attr)` but BuiltinVariable's version is `<unused>.call_hasattr(tx, obj, attr)`. The BuiltinVariable version is used as a pattern from `call_self_handler()` for `BuiltinVariable(hasattr)`. I think the other version is just used for internal `hasattr(obj, name)` so I renamed that one to `call_obj_hasattr`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145551 Approved by: https://github.com/anijain2305	2025-01-30 22:21:19 +00:00
clr	d100e9ae74	inductor: Don't throw an internal error when a nn.module is missing a attribute (#145122 ) If a nn.module getattr call throws, we should make sure that we don't crash with an internal error Note that I couldn't figure out how to test this, so advice would be awesome. I have my best case attempt at https://github.com/pytorch/pytorch/pull/145799, but it doesn't seem to reproduce the crash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145122 Approved by: https://github.com/jansel	2025-01-30 21:55:29 +00:00
Natalia Gimelshein	08ff11e9d0	initialize device when pinning memory on this device, short circuit i… (#145752 ) …s_pinned if device is not initialized Do not land RFC potential fix for #144687 Now `.is_pinned(device="cuda")` does not initialize device and thus doesn't poison the fork (but it complains about `device` arg being deprecated). To not need `device=` arg we'd need to fix get_accelerator to not initialize device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145752 Approved by: https://github.com/albanD Co-authored-by: albanD <albandes@fb.com>	2025-01-30 21:37:29 +00:00
Michael Lazos	1252c1933d	Update to remind users to use torch.compile template (#145960 ) Users have been submitting fuzzer issues without meeting the requirements outline in the torch.compile issue template. This updates the note to remind users to use the torch.compile template for torch.compile bugs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145960 Approved by: https://github.com/eellison	2025-01-30 21:34:40 +00:00
Michael Lazos	d14046b58d	Update fuzzer guidance to include rng (#145962 ) Add another condition to fuzzer issue guidance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145962 Approved by: https://github.com/eellison	2025-01-30 21:33:57 +00:00
Yidi Wu	7e7341bddd	[hop] fix unbacked_bindings meta for while_loop (#143559 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143559 Approved by: https://github.com/zou3519	2025-01-30 21:33:09 +00:00
Thomas Bohnstingl	9f9904172d	[scan] scan dim handling in user-facing scan() (#145179 ) This PR introduces the capability that the scan dim is handled in the user facing scan() call. Internally, the scan dim is always shifted to dim 0 and then the scan is performed over that dim. This is a follow-up PR from https://github.com/bohnstingl/pytorch/pull/3 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145179 Approved by: https://github.com/ydwu4	2025-01-30 21:09:07 +00:00
Ankita George	70f6aaa786	[OSS] Add kwargs to fsspec reader/writer (#145845 ) Summary: Add kwargs to fsspec reader/writer. This will be used when reading/writing from huggingface because it needs a token to access the repositories Test Plan: https://fburl.com/anp/agkrlas1 ability to read write to hf with fsspec Differential Revision: D68738777 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145845 Approved by: https://github.com/mhorowitz	2025-01-30 21:00:58 +00:00
Justin Chu	e6c39d37e9	[ONNX] Create deprecation warning on dynamo_export (#146003 ) Deprecation of `torch.onnx.dynamo_export`: * [`torch/onnx/_internal/_exporter_legacy.py`](diffhunk://#diff-4d1eb96fe68ea904dcd1f8211318b9ff882dbfe4c3cb725ffc164b6c5a58b74cR83-R86): Added deprecation warnings to the `OnnxRegistry`, `ExportOptions`, `ONNXRuntimeOptions`, and `dynamo_export` functions, indicating that `torch.onnx.dynamo_export` is deprecated since version 2.6.0 and should be replaced with `torch.onnx.export(..., dynamo=True)`. [[1]](diffhunk://#diff-4d1eb96fe68ea904dcd1f8211318b9ff882dbfe4c3cb725ffc164b6c5a58b74cR83-R86) [[2]](diffhunk://#diff-4d1eb96fe68ea904dcd1f8211318b9ff882dbfe4c3cb725ffc164b6c5a58b74cR231-R234) [[3]](diffhunk://#diff-4d1eb96fe68ea904dcd1f8211318b9ff882dbfe4c3cb725ffc164b6c5a58b74cR442-R445) [[4]](diffhunk://#diff-4d1eb96fe68ea904dcd1f8211318b9ff882dbfe4c3cb725ffc164b6c5a58b74cR700-R703) This PR also removed the `**_` kwarg on onnx.export such that users get an error when they supply an unexpected augument. Updated to emit deprecation warning because it is more appropriate: https://docs.python.org/3/library/exceptions.html#DeprecationWarning Pull Request resolved: https://github.com/pytorch/pytorch/pull/146003 Approved by: https://github.com/titaiwangms	2025-01-30 20:13:32 +00:00
Nikita Shulga	1fdb4d65c0	[MPS] Extend `torch.mm`/`torch.bmm` to integral types (#145809 ) By using `naive_mm` kernel, but make sure that accumulation is done over int32 for smaller int types (and float for half and bfloat) as well as adding `navie_bmm` that follows the same pattern. Remove stale restriction on `torch.dot` (which works fine on MacOS-14/15) This also enables integer op flavors for: - `addmv` - `einsum` - `inner` - `linalg.multi_dot` - `matmul` - `mv` - `tensordot` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145809 Approved by: https://github.com/dcci	2025-01-30 19:35:25 +00:00
Jack Zhang	3ef1551f5a	Advance past fc window for stft center (#145437 ) Long overdue follow-up on https://github.com/pytorch/pytorch/pull/73432/files#diff-5f3d4caa0693a716fc46fd7f6339312f1b5f0bf89e3a3ff58e9dc13a9486b17aR719 Onnx stft doesn't support centering, [and all of the existing tests are for center = False](https://github.com/pytorch/pytorch/blob/main/test/onnx/test_pytorch_onnx_onnxruntime.py#L8026). I will open a follow-up issue to address this, this is just a nice-to-have. Pr chain: - -> [Advance past fc window for stft center #145437](https://github.com/pytorch/pytorch/pull/145437) - [Add stft option to align window for center = false #145324](https://github.com/pytorch/pytorch/pull/145324) - [Add istft option to align window for center = false](https://github.com/pytorch/pytorch/pull/145510) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145437 Approved by: https://github.com/justinchuby, https://github.com/iseeyuan	2025-01-30 19:09:18 +00:00
Yidi Wu	a3698ebd5c	[while_loop] specialize when cond_fn return constants (#144515 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144515 Approved by: https://github.com/zou3519	2025-01-30 19:02:34 +00:00
Bin Bao	16420a78eb	[AOTI] Remove AOTI_USE_CREATE_TENSOR_FROM_BLOB_V1 (#146039 ) Summary: The AOTI_USE_CREATE_TENSOR_FROM_BLOB_V1 macro was used to solve a FC issue and it can be removed now. Test Plan: CI Differential Revision: D68871245 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146039 Approved by: https://github.com/yushangdi, https://github.com/hl475	2025-01-30 19:01:19 +00:00
Yidi Wu	d1143c4b37	[export] fix non-strict pre_dispatch exporting while_loop (#145762 ) fix https://github.com/pytorch/pytorch/issues/145737. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145762 Approved by: https://github.com/tugsbayasgalan, https://github.com/zou3519, https://github.com/avikchaudhuri	2025-01-30 18:58:34 +00:00
clr	f746bb6311	config: Don't spam warnings about reference type configs (#145800 ) Summary: https://github.com/pytorch/pytorch/issues/145755 The is_dynamic check for reference types was subtly broken, causing log spam after it was accessed Added an explicit type for is_default for reference types to make sure this behaviour is correct Pull Request resolved: https://github.com/pytorch/pytorch/pull/145800 Approved by: https://github.com/eellison	2025-01-30 18:57:16 +00:00
Gabriel Ferns	5a527fa5ee	Make sure not using cpp wrapper when setting nvtx training annotation (#145538 ) Longer term would be good to add as a feature to cpp_wrapper, but this makes sure it doesn't fail on main. Not sure if this needs a test because it's not meant to compose, but will add one if necessary. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145538 Approved by: https://github.com/desertfire	2025-01-30 18:34:22 +00:00
Luca Wehrstedt	3ee655e4d4	[async-TP] Fix scheduling in matmul+reduce-scatter for 2 ranks (#145846 ) There's a sleep that is issued in order to "nudge" CUDA to do the right scheduling decision, but this is issued on iteration number 2. However, when the world size is 2, we never reach that iteration, which led to a suboptimal scheduling. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145846 Approved by: https://github.com/yifuwang	2025-01-30 18:26:34 +00:00
Ke Wen	51ee9b154e	[c10d] Add NCCL memory allocator (#145675 ) This PR implements a small UI improvement over #133603. It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it. UI: ``` pool = torch.cuda.MemPool(backend.mem_allocator) with torch.cuda.use_mem_pool(pool): tensor = torch.arange(1024 * 1024 * 2, device=device) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145675 Approved by: https://github.com/syed-ahmed, https://github.com/wconstab	2025-01-30 18:19:00 +00:00
eellison	7796e308d0	Record inputs at time of tracing, constrain to them for triton fn (#145448 ) Record input fake tensors at time of tracing and store them in the node meta. Inductor passes have the possibility of changing strides, so it is safer to record the strides of the inputs at tracing. See, https://github.com/pytorch/pytorch/issues/137979 for more context. We can also extend this to custom ops, and user-visible outputs. If this ends up being compilation time sensitive we can just record strides (and maybe storage offset, per @zou3519) instead of the complete fake tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145448 Approved by: https://github.com/zou3519 ghstack dependencies: #145953	2025-01-30 16:54:08 +00:00
PyTorch MergeBot	967cf85f3a	Revert "Update mi300 labels to account for multiple clusters. (#145923 )" This reverts commit 3e135993bd0fa08cbff565ae76bb15cb08e1d6d0. Reverted https://github.com/pytorch/pytorch/pull/145923 on behalf of https://github.com/atalman due to reverting back to one cluster ([comment](https://github.com/pytorch/pytorch/pull/145923#issuecomment-2625022826))	2025-01-30 16:45:50 +00:00
eellison	1c3df9ca8c	Fix signif_strides_equal for symints, dedupe (#145953 ) Previous impl would take a size hint, which was failing internally with a ``` strides1 = [V.graph.sizevars.size_hint(strides1[i]) for i in non_1_indices] File "/dev/shm/uid-30083/6f57b5f9-seed-nspid4026541609_cgpid284393-ns-4026541967/torch/_inductor/sizevars.py", line 554, in size_hint return int(out) File "/dev/shm/uid-30083/6f57b5f9-seed-nspid4026541609_cgpid284393-ns-4026541967/sympy/core/expr.py", line 307, in __int__ raise TypeError("Cannot convert symbols to int") ``` There are unbacked tests in test_triton which should exercise this, as well as other tests for these functions when they were added. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145953 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2025-01-30 16:44:32 +00:00
matthewhagraphcore	aaddfc5a7f	Add TORCHINDUCTOR_VEC_ISA_OK env var for vec_isa_ok (#134667 ) Adds a `TORCHINDUCTOR_VEC_ISA_OK` for `vec_isa_ok` for A\|B testing purposes. Similar setup to `fx_graph_remote_cache` to allow for default `None`. No tests were present for any other config settings here, nor for `vec_isa_ok` so I didn't add any. Motivation: PyTorch uses filelock with a timeout to determine if the CPU supports particular intrinsics: pytorch/torch/_inductor/cpu_vec_isa.py Therefore if 2 processes are running, each processes encounters the HAS_CPU test, if it cannot acquire the lock for checking vec_isa_ok the main thread will be put to sleep. Hence there is a bias towards non-sleeping processes in acquiring the lock i.e. new spawned processes. To avoid this, use a env variable so that each process is aware of this without going through the check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134667 Approved by: https://github.com/eellison	2025-01-30 16:22:48 +00:00
PyTorch MergeBot	5fa28bbe40	Revert "[c10d] Add NCCL memory allocator (#145675 )" This reverts commit 18a7a04c4adecda3be17dd364d48d484fd1dcdba. Reverted https://github.com/pytorch/pytorch/pull/145675 on behalf of https://github.com/ZainRizvi due to Sorry but this still fails internally. See D68866823 for details ([comment](https://github.com/pytorch/pytorch/pull/145675#issuecomment-2624900562))	2025-01-30 16:01:52 +00:00
titaiwangms	50086ab537	[ONNX] Delete `rename_dynamic_shapes_with_model_inputs` (#146002 ) Basically, this function brings more cons than pros. It was nice to have an automation help users to convert top-level key of dynamic shapes to arg names. However, this function has a bug when the model input has the same amount as dynamic_shapes in coincidence: ```python input_names # 'input_ids', 'past_key_values.0.key', 'past_key_values.0.value', 'past_key_values.1.key', 'past_key_values.1.value', 'past_key_values.2.key', 'past_key_values.2.value', 'past_key_values.3.key', 'past_key_values.3.value', 'past_key_values.4.key', 'past_key_values.4.value', 'attention_mask', 'position_ids' inspect.sig(model.forward).parameters # mappingproxy(OrderedDict([('input_ids', <Parameter "input_ids: Optional[torch.LongTensor] = None">), ('past_key_values', <Parameter "past_key_values: Union[transformers.cache_utils.Cache, Tuple[Tuple[torch.Tensor]], NoneType] = None">), ('attention_mask', <Parameter "attention_mask: Optional[torch.FloatTensor] = None">), ('token_type_ids', <Parameter "token_type_ids: Optional[torch.LongTensor] = None">), ('position_ids', <Parameter "position_ids: Optional[torch.LongTensor] = None">), ('head_mask', <Parameter "head_mask: Optional[torch.FloatTensor] = None">), ('inputs_embeds', <Parameter "inputs_embeds: Optional[torch.FloatTensor] = None">), ('labels', <Parameter "labels: Optional[torch.LongTensor] = None">), ('use_cache', <Parameter "use_cache: Optional[bool] = None">), ('output_attentions', <Parameter "output_attentions: Optional[bool] = None">), ('output_hidden_states', <Parameter "output_hidden_states: Optional[bool] = None">), ('return_dict', <Parameter "return_dict: Optional[bool] = None">), ('cache_position', <Parameter "cache_position: Optional[torch.LongTensor] = None">)])) ``` In the above case, the given input_names is following onnx graph, while it has the same length as torch model forward call. This kind of case makes it difficult to detect, and automate for users. On the other hand, the error message from torch.export.export is quite informative that I believe users will know how to go from there: ```python import torch class Model(torch.nn.Module): def forward(self, x=None, y=None): return x + y dim = torch.export.Dim("x", min=1, max=6) onnx_program = torch.export.export( Model(), (), kwargs={"x": torch.randn(2, 3), "y": torch.randn(2, 3)}, dynamic_shapes={"custom_input_x": {0: dim}, "custom_input_y": {0: dim}}, ) # torch._dynamo.exc.UserError: When `dynamic_shapes` is specified as a dict, its top-level keys must be the arg names ['x', 'y'] of `inputs`, but here they are ['custom_input_x', 'custom_input_y']. Alternatively, you could also ignore arg names entirely and specify `dynamic_shapes` as a list/tuple matching `inputs`. For more information about this error, see: https://pytorch.org/docs/main/generated/exportdb/index.html#dynamic-shapes-validation ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146002 Approved by: https://github.com/justinchuby	2025-01-30 16:01:38 +00:00
IvanKobzarev	894ef8c1e3	[torchbench] Inductor freezing bfloat16 conv folding needs high tolerance (#145623 ) Issue: https://github.com/pytorch/pytorch/issues/144888 Torchbench of timm lcnet_050 model fails on accuracy in case of `--frezing` `--inference` `--bfloat16` `res_error==0.12` If to turn off convolution inductor constant folding - `res_error==0.016` `float16 error ~ 0.00669` `float16 without conv folding ~ 0.0018` convolution folding results in increase of error almost at one order of magnitude. I think we should revisit and try to do something to improve the accuracy for conv folding. E.g. For example doing conv folding at compilation time with float64? At the moment I am adding counters to identify if convolution folding happened, and in case of bfloat16 and conv_folding - increase multiplier to the max level (10) to pass accuracy test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145623 Approved by: https://github.com/eellison	2025-01-30 12:46:35 +00:00
Aidyn-A	ffa628169d	[ATen][Native][CUDA][SCALED_MM] limit f8f8bf16 rowwise scaled matmul to sm_90 (#145728 ) The CUTLASS-based kernel for f8f8bf16 rowwise scaled matmul is specific to Hopper devices only. It is not re-usable on newer devices without modifications. This PR adds a guard for this matmul to be sm_90 specific. Once the kernel is there, the guard may be removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145728 Approved by: https://github.com/Skylion007, https://github.com/eqy	2025-01-30 11:19:58 +00:00
shangdiy	6bd19e65b1	add inductor_triton_kernel_mapping_post_grad.json to tlparseadd changes (#145954 ) Landing D67612181 here. The original exported PR somehow fails OSS CI, but this one doesn't (though the PR content is the same). Add debug trace artifact to inductor_triton_kernel_mapping_post_grad.json (debug artifact for provenance tracking) to tlparse. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145954 Approved by: https://github.com/YUNQIUGUO	2025-01-30 06:18:48 +00:00
cyyever	8a6e9a88e9	Let PYTORCH_NO_CUDA_MEMORY_CACHING has effect only when value is 1 (#145905 ) Fixes #145661 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145905 Approved by: https://github.com/eqy, https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-01-30 05:11:10 +00:00
Boyuan Feng	58cc6693cb	[BE] Type annotate wrapper_benchmark.py and cuda_combined_scheduling.py (#145542 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145542 Approved by: https://github.com/eellison	2025-01-30 03:53:52 +00:00
Nikita Shulga	8cc6f17334	[CD] Install OpenMP from homebrew (#145889 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145889 Approved by: https://github.com/atalman ghstack dependencies: #145871, #145870	2025-01-30 03:19:51 +00:00
Nikita Shulga	0d5f0a81c5	[CMake] Find HomeBrew OpenMP on MacOS (#145870 ) Either via `OMP_PREFIX` envvar or by searching in `/opt/homebrew/opt/libomp` folder Modify libomp bundling logic in setup.py to change absolute path to libomp.dylib to a relative one if necessary Pull Request resolved: https://github.com/pytorch/pytorch/pull/145870 Approved by: https://github.com/Skylion007, https://github.com/atalman ghstack dependencies: #145871	2025-01-30 03:19:51 +00:00
cyy	116af809eb	Use std::string_view (#145906 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/145906 Approved by: https://github.com/albanD	2025-01-30 03:14:27 +00:00
Benjamin Glass	933b6d9830	cpp_wrapper: enable in aarch64 and x86 nightly dashboard performance runs (#145791 ) Adds `cpp_wrapper` mode to the nightly inductor benchmark runs, as well as optionally for manually triggered runs. This is justified by `aot_inductor` already being in those runs. Additionally, re-enables `aot_inductor` in the nightly aarch64 runs. It was disabled 5 months ago to deal with a performance instability, which has likely gone away at this point. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145791 Approved by: https://github.com/desertfire	2025-01-30 02:55:45 +00:00
Gabriel Ferns	32bb6f83d5	Make sure that benchmark_harness is set before running (#145532 ) Running torch compile with these options causes an error, because the benchmark code isn't generated but is still called. ``` options={'profile_bandwidth_output': 'foo', 'benchmark_harness': False} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145532 Approved by: https://github.com/eellison	2025-01-30 01:25:53 +00:00
Ke Wen	25ca05eebf	[PGNCCL] Correct some ifdef's (#145893 ) `create` function supporting `ncclConfig_t` should be wrapped inside `NCCL_HAS_CONFIG` instead of `NCCL_HAS_COMM_NONBLOCKING` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145893 Approved by: https://github.com/c-p-i-o	2025-01-30 01:05:21 +00:00
Vasu Agrawal	73dde451b7	[pytorch] Sprinkle in a few `template` keywords (#145877 ) Summary: These seem to be necessary to get compilation working on Windows with CUDA 12.8. I'm not sure whether this means that all of the previous compilers were broken, and the new one is better, or whether this is a regression in NVCC 12.8. Either way, as long as the CI passes for existing versions, this should unblock us from CUDA 12.8 enablement on Windows. See D68663662 for more details on the CUDA 12.8 enablement. Test Plan: CI! Reviewed By: akrieger Differential Revision: D68787925 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145877 Approved by: https://github.com/Skylion007	2025-01-30 00:57:40 +00:00
angelayi	72699950b0	Copy model before benchmark warmup runs (#145858 ) Fixes https://github.com/pytorch/pytorch/issues/144772 The eager warmup runs causes the model to change state so that later when we export it, the model is different than when we export it directly out of box. For some reason exporting the model with the changed state causes issues but exporting the inital model is ok. This is the reason why the accuracy checks pass but the performance check fails when exporting. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145858 Approved by: https://github.com/desertfire	2025-01-30 00:36:33 +00:00
clr	6b41f310c2	config: Support str env variables (#145980 ) Summary: This allows us to use environment variables to set string values. We've added tests for the specific functionality implemented here. Note that we already accidentally started setting up configs to use this, so we're just adding the feature. Additionally, we're not fully validating the underlying type when we set the value (and in general, it's more difficult than we would like to do this). Let me know if people feel strongly, and we can add a PR to do this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145980 Approved by: https://github.com/yushangdi, https://github.com/oulgen	2025-01-30 00:13:02 +00:00
Yang Wang	a9ed7bd78e	[utilization] pipeline to create clean db records (#145327 ) upload_utilization_script to generate db-ready-insert records to s3 - generate two files: metadata and timeseries in ossci-utilization buckets - convert log record to db format ones - add unit test job for tools/stats/ Related Prs: setup composite action for data pipeline: https://github.com/pytorch/pytorch/pull/145310 add permission for composite action to access S3 bucket: https://github.com/pytorch-labs/pytorch-gha-infra/pull/595 add insert logic in s3 replicator: https://github.com/pytorch/test-infra/pull/6217 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145327 Approved by: https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>	2025-01-29 23:48:50 +00:00
Ke Wen	18a7a04c4a	[c10d] Add NCCL memory allocator (#145675 ) This PR implements a small UI improvement over #133603. It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it. UI: ``` pool = torch.cuda.MemPool(backend.mem_allocator) with torch.cuda.use_mem_pool(pool): tensor = torch.arange(1024 * 1024 * 2, device=device) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145675 Approved by: https://github.com/syed-ahmed, https://github.com/wconstab	2025-01-29 23:20:22 +00:00
PyTorch MergeBot	b60120d0df	Revert "[ATen][CUDA] Implement 128 bit vectorization v2 (#145746 )" This reverts commit 81685d81eb86595d169f55a564da26eaafb2ddf5. Reverted https://github.com/pytorch/pytorch/pull/145746 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking in trunk. See functorch/test_ops.py::TestOperatorsCUDA::test_jvp_nn_functional_multi_head_attention_forward_cuda_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/13032483748/job/36358184032) [HUD commit link](`81685d81eb`) ([comment](https://github.com/pytorch/pytorch/pull/145746#issuecomment-2623108958))	2025-01-29 23:02:23 +00:00
Colin Peppler	521588519d	re-use FloorDiv for RShift (#145898 ) I encountered this C++ compilation error. ``` 579 \| int64_t var_6 = (static_cast<int64_t>(std::floor((1.0/2.0)u0)) \| static_cast<int64_t>(std::floor((1.0/4.0)static_cast<int64_t>(std::floor((1.0/2.0)u0))))) \| std::floor((1.0/16.0)(static_cast<int64_t>(std::floor((1.0/2.0)u0)) \| static_cast<int64_t>(std::floor((1.0/4.0)static_cast<int64_t>(std::floor((1.0/2.0)u0)))))); \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \| \| \| \| int64_t {aka long int} double ``` Then, I figured out where this std::floor came from with the help of Bob's guard provenance tool. It comes from RShift which is used in `triton.next_power_of_2`. --- Before, we used `std::floor` ``` int64_t var_6 = ( static_cast<int64_t>(std::floor((1.0/2.0)u0)) \| static_cast<int64_t>(std::floor((1.0/4.0)static_cast<int64_t>(std::floor((1.0/2.0)u0))))) \| std::floor((1.0/16.0)(static_cast<int64_t>(std::floor((1.0/2.0)u0)) # no cast to int here. \| static_cast<int64_t>(std::floor((1.0/4.0)static_cast<int64_t>(std::floor((1.0/2.0)u0)))))); ``` Now, we use `c10::div_floor_integer` instead ``` int64_t var_6 = ( (c10::div_floor_integer(static_cast<int64_t>(u0), static_cast<int64_t>(2L))) \| (c10::div_floor_integer(static_cast<int64_t>(u0), static_cast<int64_t>(8L)))) \| (c10::div_floor_integer(static_cast<int64_t>((c10::div_floor_integer(static_cast<int64_t>(u0), static_cast<int64_t>(2L))) \| (c10::div_floor_integer(static_cast<int64_t>(u0), static_cast<int64_t>(8L)))), static_cast<int64_t>(16L))); ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145898 Approved by: https://github.com/desertfire, https://github.com/bobrenjc93 ghstack dependencies: #145802	2025-01-29 22:50:22 +00:00
eellison	3df961d99b	give emulate_precision_casts an envar (#145948 ) this was requested internally Pull Request resolved: https://github.com/pytorch/pytorch/pull/145948 Approved by: https://github.com/mlazos	2025-01-29 22:43:32 +00:00
rzou	2e5886dcc4	Add fake_impl for unique_consecutive (#145649 ) Summary: It's fairly similar to torch.unique and torch.unique_dim. Test Plan: New test Pull Request resolved: https://github.com/pytorch/pytorch/pull/145649 Approved by: https://github.com/ezyang, https://github.com/eellison	2025-01-29 22:33:16 +00:00
rzou	1e57154af3	Require that all HOPs be imported at `import torch` time (#145939 ) E.g. torch.ops.higher_order.cond does not exist until it is imported, which is bad if it shows up in an FX graph or is used in some code somewhere. This PR also makes some more HOPs get imported at `import torch` time. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/145939 Approved by: https://github.com/ydwu4 ghstack dependencies: #145938	2025-01-29 22:27:52 +00:00
rzou	2141c1aebe	Better hop_db comment; move test to a non-export test file (#145938 ) Goal is for people to better test their HOPs. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/145938 Approved by: https://github.com/ydwu4	2025-01-29 22:27:52 +00:00
Simon Fan	e02c038a23	[dynamo][benchmarks] Stop benchmarking compile time of dead code (#145590 ) FIXES https://github.com/pytorch/pytorch/issues/144775 frfr See details on the problem: https://github.com/pytorch/pytorch/issues/144775#issuecomment-2611699385 We fixed some silent incorrectness, but it results in less nodes DCE'd. The benchmark iteration loop had some dead code which could contain side effect ops that aren't safe to DCE. The regression is expected. This PR removes the compile time benchmarking of the dead code, which should reduce the noise of the benchmark and aligns with the benchmarking used by performance tests New benchmark results: ```python dev,name,batch_size,accuracy,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles,cudagraph_skips,compilation_latency cuda,BartForConditionalGeneration,1,pass,897,1,0,0,0,0,0,39.322364 # after https://github.com/pytorch/pytorch/pull/144319 cuda,BartForConditionalGeneration,1,pass,897,1,0,0,0,0,0,38.972257 # before https://github.com/pytorch/pytorch/pull/144319 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145590 Approved by: https://github.com/jansel ghstack dependencies: #145447	2025-01-29 22:14:47 +00:00
Jason Ansel	793dfc27e0	[inductor] Add some typing to triton.py (#145688 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145688 Approved by: https://github.com/Skylion007, https://github.com/eellison ghstack dependencies: #145671, #145695	2025-01-29 21:56:40 +00:00
Jason Ansel	5db0ad92e3	[inductor] Remove mask_str from IndexingOptions (#145695 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145695 Approved by: https://github.com/eellison ghstack dependencies: #145671	2025-01-29 21:56:40 +00:00
Jason Ansel	23ff899164	[inductor] Fix handling of fixed XBLOCK larger than xnumel=1 (#145671 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145671 Approved by: https://github.com/eellison	2025-01-29 21:56:32 +00:00
Aaron Gokaslan	bb2fb554a9	[BE]: Update CUTLASS submodule to 3.7.0 (#145172 ) * This has a couple of new features, but mostly has a lot of bugfixes for the prior releases * This is the last Hopper-focused release of CUTLASS before blackwell drops, so let's upgrade to it. * Most of the remaining diff noise is copyright year updates on the CUTLASS submodule Pull Request resolved: https://github.com/pytorch/pytorch/pull/145172 Approved by: https://github.com/eqy, https://github.com/henrylhtsang	2025-01-29 21:48:01 +00:00
James Wu	d0aa1386b8	Disable AOTAutogradCache for triton version < 3.2 (#145937 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145937 Approved by: https://github.com/bdhirsh	2025-01-29 21:32:16 +00:00
PyTorch MergeBot	1185b81c51	Revert "[dynamo] Use polyfill to implement comparison operators (#144485 )" This reverts commit d1f82de2bf4ce4d4461791a9c9b2e759202db0bb. Reverted https://github.com/pytorch/pytorch/pull/144485 on behalf of https://github.com/huydhn due to This seems to break dynamo tests in trunk after landing ([comment](https://github.com/pytorch/pytorch/pull/144485#issuecomment-2622893294))	2025-01-29 21:30:42 +00:00
Catherine Lee	953e80936e	[linter] Grep linter batches long command (#145950 ) If the command is too long, the linter fails with ``` Failed due to OSError: [Errno 7] Argument list too long: 'grep' ``` Fix this by batching the command so it is shorter Limit of 750k was chosen due to `getconf ARG_MAX` returns ~1M on my mac. My guess is that most people shouldn't hit this unless they run --all-files and the directory length is long. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145950 Approved by: https://github.com/wdvr	2025-01-29 21:23:27 +00:00
Zain Rizvi	a6e3f294f1	Don't use mypy daemon in CI (#145961 ) This is an attempt to fix flaky mypy errors in CI that look like: ``` dmypy status --verbose connection_name : /var/folders/rf/qrn1jkgj0b9_tcznwp8ck46w0000gn/T/tmpjoqsid7_/dmypy.sock pid : 32233 error : timed out Daemon is stuck; consider /Users/zainr/pytorch/venv/bin/dmypy kill ``` "Fix" it by not using the daemon at all, since it doesn't actually provide any perf benefits in CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145961 Approved by: https://github.com/malfet	2025-01-29 21:15:29 +00:00
bglass@quansight.com	40ccb7a86d	cpp_wrapper: Move #includes to per-device header files (#145932 ) Summary: This prepares us for the next PR in the stack, where we introduce pre-compiled per-device header files to save compilation time. Reland https://github.com/pytorch/pytorch/pull/143909 after merge conflicts. Co-authored-by: Benjamin Glass <[bglass@quansight.com](mailto:bglass@quansight.com)> Differential Revision: D68656960 Pulled By: benjaminglass1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145932 Approved by: https://github.com/yushangdi, https://github.com/benjaminglass1 Co-authored-by: bglass@quansight.com <bglass@quansight.com>	2025-01-29 21:08:45 +00:00
sanchitintel	8bd7bf3269	[Inductor-CPU] Add profiling support for codegened flex attention kernels (#145894 ) ### Summary `RECORD_FUNCTION` wasn't present in codegened Inductor-CPU Flex Attention C++ kernels, so flex attention kernels weren't present in the PyTorch profiler profiling data. Fixes #145825 by adding `RECORD_FUNCTION` calls in the codegened flex-attention kernels. ### Caveat #### _Before_ No corresponding results in PyTorch profiler profiling data #### _After_ \| Inductor config settings \| What kernel name looks like in profiling data \| Comments\| \|-------------------\|------------------------------------\|--------------------\| \| Env variable `TORCHINDUCTOR_CPP_WRAPPER=1` OR `inductor.config.cpp_wrapper=1` in python code \| `graph_x_cpp_fused_y` \| No way to tell from the profiling results if the kernel is a GEMM kernel or an attention kernel \| \| `inductor.config.cpp.descriptive_names = "inductor_node"` but not CPP wrapper \| `graph_x_kernel` \| No way to tell from the profiling results if the kernel is a GEMM kernel or an attention kernel \| \| Both `inductor_config.cpp.descriptive_names = "inductor_node"` & Inductor CPP Wrapper \| `graph_x_cpp_fused_flex_attention_y`\| Easy to interpret data \| \| Neither of the two configs \| `graph_x_kernel`\| No way to tell from the profiling results if the kernel is a GEMM kernel or an attention kernel \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/145894 Approved by: https://github.com/jansel, https://github.com/leslie-fang-intel	2025-01-29 20:54:46 +00:00
Danial Javady	bb4964013f	Add determinmistic kernel for reflection2d (#136241 ) Adds feature for #98925 Tests pass for both existing reflectionpad2d and the new one I inserted. Summary of the work: Simple conditional check for deterministic mode that will dispatch to a different kernel. This kernel does not use any atomic operations, and will lead to deterministic results as instead of going from the output to input(1:1) relationship, I am doing the opposite. I am going from input -> all outputs, which is 1 to many. These operations are done in the same order every execution as I simply traverse the data set with a grid stride loop and use simple linearized indexing into the input tensor. So each thread will compute the 4 conditionals, which are then used to see if the input has an output in the 8 regions. These 8 regions are top left, top, top right, left, right, bottom left, bottom, bottom right`. I did not focus on performance for this PR as that would expand the scope heavily. If there are any performance questions though i can answer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136241 Approved by: https://github.com/eqy, https://github.com/albanD	2025-01-29 20:34:03 +00:00
Ankita George	2b8c28099a	[OSS] Add no dist as an argument to DCP top level apis (#145754 ) Summary: No-dist, for a non-distributed checkpoint, was a top level param in the past, but was removed. This was requested back in https://github.com/pytorch/pytorch/issues/125777 and will be needed for our torchtune changes to use DCP Test Plan: existing tests pass Differential Revision: D68714246 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145754 Approved by: https://github.com/daulet-askarov	2025-01-29 20:33:37 +00:00
chilli	2d5d022594	Fix a number of flexattention issues (cse, cudagraph, etc.) (#145059 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145059 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2025-01-29 20:27:39 +00:00
Nikita Shulga	6aed6c042e	[CD] Install ninja and setuptools from PyPI (#145871 ) As well as typing extensions, they are available from PyPI, no reason to install them from Anaconda Pull Request resolved: https://github.com/pytorch/pytorch/pull/145871 Approved by: https://github.com/Skylion007	2025-01-29 19:47:16 +00:00
PyTorch MergeBot	b80482988f	Revert "[CMake] Find HomeBrew OpenMP on MacOS (#145870 )" This reverts commit c26bb9ba5bd40d256a25436212279bc7e4b436ae. Reverted https://github.com/pytorch/pytorch/pull/145870 on behalf of https://github.com/malfet due to Want to refine it a bit ([comment](https://github.com/pytorch/pytorch/pull/145870#issuecomment-2622659614))	2025-01-29 19:34:27 +00:00
PyTorch MergeBot	b52e8d521e	Revert "[CD] Install ninja and setuptools from PyPI (#145871 )" This reverts commit eea7d395e5faa9a4be5b60f6668c0bdf5163e3a0. Reverted https://github.com/pytorch/pytorch/pull/145871 on behalf of https://github.com/malfet due to Want to refine it a bit ([comment](https://github.com/pytorch/pytorch/pull/145870#issuecomment-2622659614))	2025-01-29 19:34:27 +00:00
Jack Taylor	082fab0fc7	[64-bit] Int64 casting for UpSampleNearest3D (#144865 ) Fixes #144855 Follows approach in https://github.com/pytorch/pytorch/pull/141923 to use int64 types to increase INT_MAX limits Pull Request resolved: https://github.com/pytorch/pytorch/pull/144865 Approved by: https://github.com/eqy	2025-01-29 19:30:09 +00:00
angelayi	1c9014a135	[export] Add tlparse to draft-export (#145810 ) Dependent on https://github.com/ezyang/tlparse/pull/87/files Pull Request resolved: https://github.com/pytorch/pytorch/pull/145810 Approved by: https://github.com/pianpwk	2025-01-29 19:26:00 +00:00
PyTorch MergeBot	6371c25b91	Revert "[c10d] Add NCCL memory allocator (#145675 )" This reverts commit 9fd6722fc9068eeaa176754acb315fc7e0f6416c. Reverted https://github.com/pytorch/pytorch/pull/145675 on behalf of https://github.com/ZainRizvi due to This fails to build internally, can you please take a look at D68831004 for more details? ([comment](https://github.com/pytorch/pytorch/pull/145675#issuecomment-2622515425))	2025-01-29 18:30:30 +00:00
PyTorch MergeBot	e0525dbca9	Revert "inductor.config.descriptive_names = False is not actually supported (#145523 )" This reverts commit edf266e9bbbf6063f7c4a336ffb50234e11a0a82. Reverted https://github.com/pytorch/pytorch/pull/145523 on behalf of https://github.com/ZainRizvi due to Hi, this breaks type checks internally. Can you please take a look? See D68801083 for details ([comment](https://github.com/pytorch/pytorch/pull/145523#issuecomment-2622510900))	2025-01-29 18:27:44 +00:00
PyTorch MergeBot	284f217011	Revert "[Environment Variable][7/N] Use thread-safe getenv functions (#140211 )" This reverts commit 97b3b73f3e96bb8684064715b93c825ba0395475. Reverted https://github.com/pytorch/pytorch/pull/140211 on behalf of https://github.com/ZainRizvi due to Sorry but this is failing internally. @eqy @ezyang can you please help this get remerged? See D68779772. ([comment](https://github.com/pytorch/pytorch/pull/140211#issuecomment-2622504898))	2025-01-29 18:24:29 +00:00
PyTorch MergeBot	0d6343347f	Revert "Record inputs at time of tracing, constrain to them for triton fn (#145448 )" This reverts commit a699034eeca8c096c44a690e405a60efa442d4ed. Reverted https://github.com/pytorch/pytorch/pull/145448 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See D68779678 for details ([comment](https://github.com/pytorch/pytorch/pull/145448#issuecomment-2622470810))	2025-01-29 18:07:12 +00:00
Avik Chaudhuri	1a613c3342	bump counters for unbacked binding names (#145882 ) Instead of bumping symint counters when we process unbacked bindings during deserialization, it's better to bump them at the beginning based on what the symbols in the original shape env before serialization were. This allows symbols in unbacked bindings to have "gaps" that bumping alone would not be able to match. Why is bumping counters important at all? It is because when the shape env coming out of deserialization is used later for propagating symints, say in run_decompositions, we don't want new names to clash with existing names (bad things happen). Differential Revision: [D68798191](https://our.internmc.facebook.com/intern/diff/D68798191/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145882 Approved by: https://github.com/pianpwk	2025-01-29 17:46:21 +00:00
rpsilva	4abff4b271	Introduce cache clearing APIs for the lazy graph executor (#144489 ) This PR introduces two new methods to the LazyGraphExecutor class: - ClearComputationCache(): Allows clearing the entire computation cache. - RemoveFromComputationCache(hash): Enables removal of specific cache entries based on their hash. The main objective is to expose cache management functionality for debugging cache hits and misses across different computations. For instance: - Reset the cache state in tests, allowing reuse of the same computation client to evaluate cache logic consistently. - Selectively remove cache entries to analyze the impact on subsequent computations. - Improve observability into the cache behavior, aiding in the investigation of cache-related issues or optimizations. On the XLA lazy graph executor, we want to run a series of tests that modify some parts of the HLO module proto of the computation, and we need a means to ensure that the hash is agnostic to some elements (OpMetadata in the XLA proto data). Hence, it would be easy to parameterize the test, clear the cache and validate that the resulting hash is the same between runs. Otherwise, we'd need to hardcode the resulting serialized hash. Simultaneously, another motivation, is that users could also clear some computation hashes for an added flexibility in their applications, by introducing their own custom strategies for maintaining the cache (without relying on the default LRU). Pull Request resolved: https://github.com/pytorch/pytorch/pull/144489 Approved by: https://github.com/wconstab	2025-01-29 17:38:01 +00:00
Animesh Jain	d1f82de2bf	[dynamo] Use polyfill to implement comparison operators (#144485 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144485 Approved by: https://github.com/jansel	2025-01-29 17:37:40 +00:00
saienduri	3e135993bd	Update mi300 labels to account for multiple clusters. (#145923 ) We now have multiple Kubernetes clusters of mi300x resources, and this commit updates labels accordingly to target both clusters evenly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145923 Approved by: https://github.com/jeffdaily	2025-01-29 16:56:43 +00:00
Animesh Jain	4499d60d56	[dynamo][builin-skipfiles-cleanup] Remove types (#145909 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145909 Approved by: https://github.com/zou3519 ghstack dependencies: #145856, #145875, #145878, #145892	2025-01-29 16:47:02 +00:00
Brian Hirsh	ed141d7d1a	dont assign a size to _assert_scalar in partitioner (#143877 ) Fixes https://github.com/pytorch/pytorch/issues/143876 Open to other suggestions - we have an invariant that all nodes in our ATen graphs should have a `meta['val']` field, but I don't think this is actually true in all cases, so I just hardcoded the invariant to ignore `_assert_scalar()` (which is a "special" op used in dynamic shapes for runtime asserts, and doesn't have a meta['val'] field) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143877 Approved by: https://github.com/zou3519	2025-01-29 16:21:37 +00:00
Yu, Guangye	3b3aac0cde	Filter out iGPU if dGPU is found on XPU (#144378 ) # Motivation for https://github.com/pytorch/pytorch/issues/143914 On Windows, there are two separate SYCL platforms for iGPU and dGPU. To simplify the logic, we will exclude iGPUs when a dGPU is present. This ensures that all XPU devices enumerated by PyTorch share the same SYCL context. Now I generalize the logic as below: 1. We find the first L0 platform containing at least one dGPU and enumerate all dGPUs of that platform. 2. If no dGPU is found, we find the first L0 platform containing iGPU and enumerate all iGPUs of that platform. 3. No GPU is found (neither iGPU nor dGPU). Pull Request resolved: https://github.com/pytorch/pytorch/pull/144378 Approved by: https://github.com/EikanWang, https://github.com/gujinghui	2025-01-29 15:53:16 +00:00
Bert Maher	5e5da9bd9a	[triton] Update pin to tip of 3.2 release (#145867 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145867 Approved by: https://github.com/Skylion007, https://github.com/htyu, https://github.com/exclamaforte	2025-01-29 15:17:58 +00:00
Aidyn-A	81685d81eb	[ATen][CUDA] Implement 128 bit vectorization v2 (#145746 ) This is a re-base PR to my previous one #141959. Description from the original PR: This PR implements 128-bit vectorization. It improves the performance of contiguous elementwise ops by 4-10% on Hopper H100. <details> <summary>The benchmark code used </summary> ```Python import time import torch from torch.profiler import profile, ProfilerActivity def benchmark(function, dtype=torch.float32, check_numerics=True, print_profile=False): device = torch.device("cuda") shapes = [] for p in range(24, 30): shape = 1<<p shapes.append(shape) for shape in shapes: for _ in range(6): x = torch.randn(shape, device=device, dtype=dtype) y = function(x) if print_profile: x = torch.randn(shape, device=device, dtype=dtype) with profile(activities=[ProfilerActivity.CUDA], record_shapes=True) as prof: y = function(x) print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10)) x = torch.randn(shape, device=device, dtype=dtype) torch.cuda.synchronize() t1 = time.perf_counter() for _ in range(6): y = function(x) torch.cuda.synchronize() t2 = time.perf_counter() perf_time = (t2 - t1) / 6 print(f"{function.__name__}, {dtype}, {shape}, {perf_time}") if check_numerics: x_cpu = x.cpu() y_cpu = function(x_cpu).cuda() try: torch.testing.assert_allclose(y_cpu, y) except AssertionError as error: print("An exception occurred:", error) def main(): ops = [ torch.relu, torch.sigmoid, torch.tanh, torch.nn.functional.gelu, torch.sin, torch.exp, ] dtypes = [ torch.float16, torch.bfloat16, torch.float32, ] for op in ops: for dtype in dtypes: benchmark(op, dtype=dtype) torch.cuda.empty_cache() if __name__ == "__main__": main() ``` </details> <details> <summary> Results </summary> \| op \| dtype \| size \| time after \| time before \| % improvement \| \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| \| relu \| torch.float16 \| 33554432 \| 4.84E-05 \| 5.06E-05 \| 4.66296539127052 \| \| relu \| torch.float16 \| 67108864 \| 9.22E-05 \| 9.64E-05 \| 4.56491432752297 \| \| relu \| torch.float16 \| 134217728 \| 0.000180343495837102 \| 0.000187981834945579 \| 4.23543919508829 \| \| relu \| torch.float16 \| 268435456 \| 0.000355071155354381 \| 0.000370856161074092 \| 4.44558942107169 \| \| relu \| torch.float16 \| 536870912 \| 0.000704489842367669 \| 0.000736006341564159 \| 4.47366268483987 \| \| relu \| torch.bfloat16 \| 16777216 \| 3.03E-05 \| 3.04E-05 \| 0.166504085842689 \| \| relu \| torch.bfloat16 \| 33554432 \| 4.89E-05 \| 5.06E-05 \| 3.45848238875716 \| \| relu \| torch.bfloat16 \| 67108864 \| 9.32E-05 \| 9.65E-05 \| 3.56122651631445 \| \| relu \| torch.bfloat16 \| 134217728 \| 0.000180805509444326 \| 0.000187998676362137 \| 3.97840029317567 \| \| relu \| torch.bfloat16 \| 268435456 \| 0.000356242332297067 \| 0.000371279485989362 \| 4.22104627356745 \| \| relu \| torch.bfloat16 \| 536870912 \| 0.000708114336399982 \| 0.000736773828975856 \| 4.04729732229083 \| \| relu \| torch.float32 \| 16777216 \| 5.61E-05 \| 5.61E-05 \| 0.0442587268354941 \| \| relu \| torch.float32 \| 33554432 \| 9.33E-05 \| 9.30E-05 \| -0.259070913799022 \| \| relu \| torch.float32 \| 67108864 \| 0.000181321326332788 \| 0.000181289506144822 \| -0.0175490597877115 \| \| relu \| torch.float32 \| 134217728 \| 0.000356896334172537 \| 0.000356570177245885 \| -0.0913870206618981 \| \| relu \| torch.float32 \| 268435456 \| 0.000709421835684528 \| 0.000707465515006334 \| -0.275762681635911 \| \| relu \| torch.float32 \| 536870912 \| 0.00141372415237129 \| 0.00141036518228551 \| -0.237597276678471 \| \| sigmoid \| torch.float16 \| 16777216 \| 3.10E-05 \| 3.16E-05 \| 2.10012593866895 \| \| sigmoid \| torch.float16 \| 33554432 \| 4.91E-05 \| 5.23E-05 \| 6.37710600666122 \| \| sigmoid \| torch.float16 \| 67108864 \| 9.30E-05 \| 0.000100057009452333 \| 7.61866144555331 \| \| sigmoid \| torch.float16 \| 134217728 \| 0.000180928347011407 \| 0.000194982004662355 \| 7.76752669390248 \| \| sigmoid \| torch.float16 \| 268435456 \| 0.000355658994521946 \| 0.00038468533117945 \| 8.16128288742412 \| \| sigmoid \| torch.float16 \| 536870912 \| 0.000705982849467546 \| 0.000764021339515845 \| 8.22094900634937 \| \| sigmoid \| torch.bfloat16 \| 16777216 \| 3.08E-05 \| 3.17E-05 \| 2.90965915673149 \| \| sigmoid \| torch.bfloat16 \| 33554432 \| 4.87E-05 \| 5.24E-05 \| 7.63503884668234 \| \| sigmoid \| torch.bfloat16 \| 67108864 \| 9.33E-05 \| 0.000100019678939134 \| 7.21238137428013 \| \| sigmoid \| torch.bfloat16 \| 134217728 \| 0.000180786165098349 \| 0.000194868014659733 \| 7.78922964250206 \| \| sigmoid \| torch.bfloat16 \| 268435456 \| 0.000355564659306159 \| 0.000384909333661199 \| 8.25297835063321 \| \| sigmoid \| torch.bfloat16 \| 536870912 \| 0.000705831005082776 \| 0.000764102345177283 \| 8.2557070566308 \| \| sigmoid \| torch.float32 \| 16777216 \| 4.93E-05 \| 5.65E-05 \| 14.5314136197766 \| \| sigmoid \| torch.float32 \| 33554432 \| 9.32E-05 \| 9.31E-05 \| -0.120169865610833 \| \| sigmoid \| torch.float32 \| 67108864 \| 0.000181328505277634 \| 0.000180455681402236 \| -0.481349512069855 \| \| sigmoid \| torch.float32 \| 134217728 \| 0.000357362829769651 \| 0.000356093340087682 \| -0.35523831137877 \| \| sigmoid \| torch.float32 \| 268435456 \| 0.000708921831877281 \| 0.000707052337626616 \| -0.263709504574663 \| \| sigmoid \| torch.float32 \| 536870912 \| 0.00141358317341656 \| 0.0014090768333214 \| -0.318788464654745 \| \| tanh \| torch.float16 \| 16777216 \| 3.03E-05 \| 3.03E-05 \| -0.0912564658661808 \| \| tanh \| torch.float16 \| 33554432 \| 4.90E-05 \| 5.07E-05 \| 3.46644442974484 \| \| tanh \| torch.float16 \| 67108864 \| 9.30E-05 \| 9.68E-05 \| 3.99871369815531 \| \| tanh \| torch.float16 \| 134217728 \| 0.00018052199933057 \| 0.000188717152923346 \| 4.53969799978138 \| \| tanh \| torch.float16 \| 268435456 \| 0.000355684508879979 \| 0.000373026006855071 \| 4.8755280430115 \| \| tanh \| torch.float16 \| 536870912 \| 0.000706660988119741 \| 0.000740105014604827 \| 4.73268328765002 \| \| tanh \| torch.bfloat16 \| 16777216 \| 2.99E-05 \| 3.03E-05 \| 1.21049563135981 \| \| tanh \| torch.bfloat16 \| 33554432 \| 4.89E-05 \| 5.06E-05 \| 3.48836101041744 \| \| tanh \| torch.bfloat16 \| 67108864 \| 9.28E-05 \| 9.69E-05 \| 4.39944918036626 \| \| tanh \| torch.bfloat16 \| 134217728 \| 0.000180710999605556 \| 0.000189167990659674 \| 4.67984299382829 \| \| tanh \| torch.bfloat16 \| 268435456 \| 0.000356062994493792 \| 0.000372666652159144 \| 4.66312363882606 \| \| tanh \| torch.bfloat16 \| 536870912 \| 0.000707100164921333 \| 0.000740134331863374 \| 4.67178040408393 \| \| tanh \| torch.float32 \| 16777216 \| 5.61E-05 \| 5.64E-05 \| 0.439595755746353 \| \| tanh \| torch.float32 \| 33554432 \| 9.31E-05 \| 9.31E-05 \| 0.00287633090228212 \| \| tanh \| torch.float32 \| 67108864 \| 0.000181465332085888 \| 0.000180895323865116 \| -0.31411411437098 \| \| tanh \| torch.float32 \| 134217728 \| 0.000356963835656643 \| 0.000356073161431899 \| -0.249513854283251 \| \| tanh \| torch.float32 \| 268435456 \| 0.000709201170442005 \| 0.00070707315656667 \| -0.300057862849997 \| \| tanh \| torch.float32 \| 536870912 \| 0.00141367283261692 \| 0.00141030051357423 \| -0.238550176877922 \| \| gelu \| torch.float16 \| 16777216 \| 2.73E-05 \| 3.17E-05 \| 15.921079070745 \| \| gelu \| torch.float16 \| 33554432 \| 5.06E-05 \| 5.55E-05 \| 9.76345374333098 \| \| gelu \| torch.float16 \| 67108864 \| 9.65E-05 \| 0.000106600326641152 \| 10.4308039074712 \| \| gelu \| torch.float16 \| 134217728 \| 0.000187776672343413 \| 0.000208565829476962 \| 11.0712139447915 \| \| gelu \| torch.float16 \| 268435456 \| 0.000370216167842348 \| 0.000412251994324227 \| 11.3544005187205 \| \| gelu \| torch.float16 \| 536870912 \| 0.000737301345604161 \| 0.000819394170927505 \| 11.1342296895002 \| \| gelu \| torch.bfloat16 \| 16777216 \| 3.02E-05 \| 3.08E-05 \| 1.78405479367653 \| \| gelu \| torch.bfloat16 \| 33554432 \| 5.13E-05 \| 5.69E-05 \| 10.9929393318302 \| \| gelu \| torch.bfloat16 \| 67108864 \| 9.76E-05 \| 0.00010968199543034 \| 12.3420807512356 \| \| gelu \| torch.bfloat16 \| 134217728 \| 0.000189661824454864 \| 0.000214487663470209 \| 13.0895287371091 \| \| gelu \| torch.bfloat16 \| 268435456 \| 0.000374197009174774 \| 0.000423670164309442 \| 13.2211519391275 \| \| gelu \| torch.bfloat16 \| 536870912 \| 0.000743675006863972 \| 0.000842577001700799 \| 13.299088166737 \| \| gelu \| torch.float32 \| 16777216 \| 5.06E-05 \| 5.04E-05 \| -0.413385894716413 \| \| gelu \| torch.float32 \| 33554432 \| 9.31E-05 \| 9.32E-05 \| 0.134157041722546 \| \| gelu \| torch.float32 \| 67108864 \| 0.000181480175039421 \| 0.000180836669945469 \| -0.354586992112075 \| \| gelu \| torch.float32 \| 134217728 \| 0.000356874331676712 \| 0.000356305002545317 \| -0.159532104402047 \| \| gelu \| torch.float32 \| 268435456 \| 0.000708909006789327 \| 0.000706991491218408 \| -0.270488250615287 \| \| gelu \| torch.float32 \| 536870912 \| 0.00141321367118508 \| 0.00140937082081412 \| -0.271922813181618 \| \| sin \| torch.float16 \| 16777216 \| 3.04E-05 \| 3.11E-05 \| 2.21834939018859 \| \| sin \| torch.float16 \| 33554432 \| 4.85E-05 \| 5.23E-05 \| 7.72165512511596 \| \| sin \| torch.float16 \| 67108864 \| 9.31E-05 \| 9.98E-05 \| 7.24947099480072 \| \| sin \| torch.float16 \| 134217728 \| 0.000180371008658161 \| 0.000194791161144773 \| 7.99471744039613 \| \| sin \| torch.float16 \| 268435456 \| 0.000355454161763191 \| 0.000384903668115536 \| 8.28503630574026 \| \| sin \| torch.float16 \| 536870912 \| 0.000705183832906187 \| 0.000764360166310022 \| 8.39161799270973 \| \| sin \| torch.bfloat16 \| 16777216 \| 3.11E-05 \| 3.10E-05 \| -0.257677954940036 \| \| sin \| torch.bfloat16 \| 33554432 \| 4.89E-05 \| 5.24E-05 \| 7.34808420323539 \| \| sin \| torch.bfloat16 \| 67108864 \| 9.26E-05 \| 0.000100248667877167 \| 8.22347488801205 \| \| sin \| torch.bfloat16 \| 134217728 \| 0.000180674154156198 \| 0.00019567032965521 \| 8.30012215584937 \| \| sin \| torch.bfloat16 \| 268435456 \| 0.000355360486234228 \| 0.000386023331278314 \| 8.62865913118873 \| \| sin \| torch.bfloat16 \| 536870912 \| 0.00070483615854755 \| 0.000766805159704139 \| 8.79197248964745 \| \| sin \| torch.float32 \| 16777216 \| 5.67E-05 \| 5.64E-05 \| -0.441348534920039 \| \| sin \| torch.float32 \| 33554432 \| 9.34E-05 \| 9.30E-05 \| -0.496458540364117 \| \| sin \| torch.float32 \| 67108864 \| 0.000181706990891447 \| 0.000180556671693921 \| -0.633062708199702 \| \| sin \| torch.float32 \| 134217728 \| 0.000356894995396336 \| 0.000356046327700218 \| -0.237791985616354 \| \| sin \| torch.float32 \| 268435456 \| 0.000708777321657787 \| 0.000707602652255446 \| -0.165731798471427 \| \| sin \| torch.float32 \| 536870912 \| 0.00141263716310884 \| 0.00140912582476934 \| -0.248566187496451 \| \| exp \| torch.float16 \| 16777216 \| 3.00E-05 \| 3.04E-05 \| 1.40099098901014 \| \| exp \| torch.float16 \| 33554432 \| 4.86E-05 \| 5.03E-05 \| 3.44611943643906 \| \| exp \| torch.float16 \| 67108864 \| 9.37E-05 \| 9.55E-05 \| 1.96412400380129 \| \| exp \| torch.float16 \| 134217728 \| 0.000180913504057874 \| 0.000187193179347863 \| 3.47109262113439 \| \| exp \| torch.float16 \| 268435456 \| 0.00035607748820136 \| 0.000369079003576189 \| 3.65131630210701 \| \| exp \| torch.float16 \| 536870912 \| 0.000707551507124056 \| 0.000732363162872692 \| 3.50669251620789 \| \| exp \| torch.bfloat16 \| 16777216 \| 2.98E-05 \| 3.04E-05 \| 1.74345594341654 \| \| exp \| torch.bfloat16 \| 33554432 \| 4.88E-05 \| 5.04E-05 \| 3.40217856534821 \| \| exp \| torch.bfloat16 \| 67108864 \| 9.32E-05 \| 9.62E-05 \| 3.29219958210226 \| \| exp \| torch.bfloat16 \| 134217728 \| 0.000180999826019009 \| 0.000187239318620414 \| 3.44723679499521 \| \| exp \| torch.bfloat16 \| 268435456 \| 0.000355944503098726 \| 0.000369370992605885 \| 3.77207384585864 \| \| exp \| torch.bfloat16 \| 536870912 \| 0.000707135167128096 \| 0.000733066000975668 \| 3.66702648277075 \| \| exp \| torch.float32 \| 16777216 \| 4.89E-05 \| 5.63E-05 \| 15.1245314346532 \| \| exp \| torch.float32 \| 33554432 \| 9.34E-05 \| 9.31E-05 \| -0.259945454477446 \| \| exp \| torch.float32 \| 67108864 \| 0.000181152504713585 \| 0.000180474346658836 \| -0.374357536939058 \| \| exp \| torch.float32 \| 134217728 \| 0.000356771342922002 \| 0.000355627329554409 \| -0.3206573034212 \| \| exp \| torch.float32 \| 268435456 \| 0.000708404501589636 \| 0.00070713268360123 \| -0.179532736671163 \| \| exp \| torch.float32 \| 536870912 \| 0.00141283582585553 \| 0.00140944866385932 \| -0.23974208002295 \| </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/145746 Approved by: https://github.com/eqy, https://github.com/ngimel	2025-01-29 13:32:59 +00:00
Ting Lu	354fe48db9	Add magma cuda build 12.8 (#145765 ) https://github.com/pytorch/pytorch/issues/145570 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145765 Approved by: https://github.com/malfet	2025-01-29 08:43:38 +00:00
gasoonjia	501c5972f0	[pytorch] raise exception when calling dim order on sparse tensor (#145888 ) This diff introduces a change to the PyTorch library that raises an exception when calling the `dim_order` method on a sparse tensor. Differential Revision: [D68797044](https://our.internmc.facebook.com/intern/diff/D68797044/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145888 Approved by: https://github.com/Jack-Khuu	2025-01-29 06:15:44 +00:00
David Berard	2e8c080ab1	[inductor][4/N] triton support post-#5512, fix constexpr signatures (#145583 ) Prior to this PR, constexprs were appearing in signatures as `{.. "XBLOCK : tl.constexpr": "constexpr"}` when they really should appear as `{.. "XBLOCK": "constexpr"}`. This PR represents the argument names as ArgName objects, which can optionally be marked as constexpr. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145583 Approved by: https://github.com/jansel	2025-01-29 05:46:05 +00:00
Animesh Jain	3f77002b96	[dynamo][builtin-skipfiles-cleanup] remove abc, enum, importlib (#145892 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145892 Approved by: https://github.com/williamwen42, https://github.com/StrongerXi ghstack dependencies: #145856, #145875, #145878	2025-01-29 05:30:06 +00:00
Animesh Jain	236793684d	[dynamo][builtin-skipfiles-cleanup] Remove threading, _collections_abc, _weakrefset, threading (#145878 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145878 Approved by: https://github.com/williamwen42, https://github.com/StrongerXi ghstack dependencies: #145856, #145875	2025-01-29 05:30:06 +00:00
Animesh Jain	a479656cd2	[dynamo][builtin-skipfiles-removal] Remove logging (#145875 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145875 Approved by: https://github.com/williamwen42 ghstack dependencies: #145856	2025-01-29 05:29:58 +00:00
Animesh Jain	64ee57847b	[dynamo][builtin-skipfiles-cleanup] Remove some builtins (#145856 ) [dynamo][builtin-skipfiles-cleanup] Remove more builtins Pull Request resolved: https://github.com/pytorch/pytorch/pull/145856 Approved by: https://github.com/zou3519	2025-01-29 05:29:47 +00:00
Aaron Orenstein	7178b827d7	PEP585: Missed conversions (#145342 ) Differential Revision: [D68785969](https://our.internmc.facebook.com/intern/diff/D68785969) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145342 Approved by: https://github.com/bobrenjc93	2025-01-29 05:24:36 +00:00
bobrenjc93	8696e59ae2	add test for capture_dynamic_output_shape_ops=True changing expected output between eager and compiled versions (#145821 ) Followup from https://github.com/pytorch/pytorch/issues/130290 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145821 Approved by: https://github.com/eellison, https://github.com/ezyang	2025-01-29 04:36:32 +00:00
Justin Chu	776bdb962c	[ONNX] Support subgraphs with 1+ outputs (#145860 ) Fixed a bug in _handle_output_node where additional output values were not added as graph outputs Fixes #145734 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145860 Approved by: https://github.com/titaiwangms	2025-01-29 04:13:23 +00:00
cyy	fd515e4f59	Fix C++20 Wambiguous-reversed-operator warnings (#144126 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144126 Approved by: https://github.com/albanD	2025-01-29 03:13:57 +00:00
Simon Mahns	90a6db4a9c	[be][pytorch] Fix backend in autocast (#145859 ) Summary: fixing backend typo (BAKCNEDS -> BACKENDS) Test Plan: ci Differential Revision: D68573324 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145859 Approved by: https://github.com/jvandebon	2025-01-29 03:13:08 +00:00
Mwiza Kunda	9be2e88d41	Fix lowering to inductor IR for triton CPU (#144389 ) Example failing test: `pytest -s test_torchinductor_opinfo.py -k test_comprehensive_special_polygamma_special_polygamma_n_0_cpu_float32` when using triton CPU. Failure: ```shell triton.compiler.errors.CompilationError: at 10:11: def triton_poi_fused_polygamma_0(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 25 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask) tmp1 = 1.0 tl.static_assert(tmp1.dtype == tl.float32) tmp2 = ops.polygamma(tmp1, tmp0) ^ NameError('ops is not defined') ``` This occurs because the registered triton fallbacks are not used during the lowering to inductor IR. Marked the problematic code in the excerpt below from `6bc17b0725/torch/_inductor/lowering.py (L572)` ```python def make_pointwise( fn, override_return_dtype=None, override_device=None, override_fn_when_input_bool=None, override_fn_when_gpu_float64=None, allow_alpha=False, triton_fallback=None, ): def inner(inputs: TensorBox, alpha=None): if triton_fallback is not None and any( isinstance(inp, IRNode) and is_triton(inp) for inp in inputs <--- is_triton should return True when using triton CPU ): assert not allow_alpha # not implemented return triton_fallback(inputs) inputs = promote_constants(inputs, override_return_dtype) if allow_alpha: if alpha is not None and alpha != 1: inputs = list(inputs) ``` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144389 Approved by: https://github.com/jansel	2025-01-29 03:10:53 +00:00
Colin Peppler	50f834f134	[export] allow bit shift builtin ops (#145802 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145802 Approved by: https://github.com/pianpwk	2025-01-29 03:05:48 +00:00
Ting Lu	f4ca98950e	Add CUDA 12.8 libtorch image (#145789 ) https://github.com/pytorch/pytorch/issues/145570 Builds 12.8 libtorch docker/deprecate 12.1 meanwhile Pull Request resolved: https://github.com/pytorch/pytorch/pull/145789 Approved by: https://github.com/nWEIdia, https://github.com/atalman	2025-01-29 02:59:37 +00:00
Sam Larsen	9330b6d098	Added swizzle searching, disabled fp16 accum, and enabled ping-pong for cutlass (#144829 ) Summary: Test Plan: Differential Revision: [D68751149](https://our.internmc.facebook.com/intern/diff/D68751149) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144829 Approved by: https://github.com/Chillee	2025-01-29 02:52:55 +00:00
Ke Wen	9fd6722fc9	[c10d] Add NCCL memory allocator (#145675 ) This PR implements a small UI improvement over #133603. It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it. UI: ``` pool = torch.cuda.MemPool(backend.mem_allocator) with torch.cuda.use_mem_pool(pool): tensor = torch.arange(1024 * 1024 * 2, device=device) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145675 Approved by: https://github.com/syed-ahmed, https://github.com/wconstab	2025-01-29 02:48:56 +00:00
Menglu Yu	29521256e1	[Customized Optimus][Inductor] Add split cat pattern in aten level (#145721 ) Summary: Thanks Microve for discovering that recGPT has some repeated similar kernels that might be optimized through optimus. After investigation, I designed a pattern in the aten level to remove such excessive kernels. trace: https://fburl.com/perfdoctor/82fauil7 tlparse: https://fburl.com/98q6tadx Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_split_cat_post_grad ``` Buck UI: https://www.internalfb.com/buck2/e8458d63-b8ca-498b-a731-77a83fb4d1cb Test UI: https://www.internalfb.com/intern/testinfra/testrun/16325548715106567 Network: Up: 341KiB Down: 359KiB (reSessionID-7d3de666-7fc1-4988-8d11-d75ba958016d) Executing actions. Remaining 0/3 Command: test. Finished 2 local Time elapsed: 3:04.8s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # local run ``` buck2 run @//mode/opt aps_models/ads/recgpt_exp:recgpt_launcher -- mode=local_recgpt_ranking_30x_v0_unified_seq_1115 ``` https://www.internalfb.com/mlhub/pipeline/1630903954173593 # E2E ``` buck2 run @//mode/opt aps_models/ads/recgpt_exp:recgpt_launcher -- mode=mast_recgpt_ranking_30x_v0_unified_seq_1115 launcher.oncall=ads_model_platform launcher.data_project=ai_large_scale launcher.fbl_entitlement=ads_global_tc_training_efficiency launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] launcher.hardware=SMC_T20 launcher.job_name=recgpt_ranking_1115_pt2_with_optimus data_loader.dataset.table_ds=[2024-12-13,2024-12-14,2024-12-15,2024-12-16,2024-12-17,2024-12-18] ``` ### how to add the config Add the following patterns to the dynamo config ``` post_grad_fusion_options: { "normalization_aten_pass": {}, "split_cat_aten_pass": {}, } ``` {F1974700331} baseline: aps-recgpt_ranking_1115_pt2_5-8cb4905c7d {F1974700216} proposal: Differential Revision: D68695717 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145721 Approved by: https://github.com/Yuzhen11	2025-01-29 01:59:06 +00:00
Natalia Gimelshein	331f49057d	Removes threadfence from topk kernel to improve AMD performance (#145536 ) Also marginally improves cuda perf Pull Request resolved: https://github.com/pytorch/pytorch/pull/145536 Approved by: https://github.com/eqy	2025-01-29 01:29:15 +00:00
wz337	6f5c8fb128	[DTensor] Add pointwise ops strategy for `aten.minimum` (#145816 ) Need it for Shampoo optimizer. `9c5700ad5e/matrix_functions.py (L240-L242)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145816 Approved by: https://github.com/XilunWu	2025-01-29 01:19:01 +00:00
Pian Pawakapan	15e37e4253	[export] don't always print GM in serdes logging (#145857 ) Summary: Didn't realize print_readable() would also print and not just return string Test Plan: . Differential Revision: D68781525 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145857 Approved by: https://github.com/angelayi, https://github.com/yiming0416	2025-01-29 01:03:02 +00:00
fan.mo	a24b25942a	Fix RMSNorm epsilon value type for BF16 or FP16 (#142848 ) Fixes #140092 Here's what this PR does: Case 1: no `eps` is passed to python frontend: Use `eps` associated with opmath_t instead of than `eps` associated with`scalar_t` for intermediate computation Case 2: `eps` is passed to python frontend Avoid downcasting `eps` to `scalar_t` and then upcasting it again implicitly in the `rqrst_input` computation Pull Request resolved: https://github.com/pytorch/pytorch/pull/142848 Approved by: https://github.com/albanD	2025-01-29 01:01:44 +00:00
Bert Maher	ae0f305bf9	[inductor] Make triton kernel autotune config defaults backward-compatible (#145494 ) If a model was torch.packaged using triton<=3.1, any user-defined autotuned kernels will have reps/warmups burned in with the old defaults (100/25). If this model is loaded with triton>=3.2, inductor's checks for unsupported non-default autotune args will fail, because triton.Autotuner's defaults for these parameters has changed to `None`. Let's explicitly support those values for backward compatibility with these older models. Differential Revision: [D68561014](https://our.internmc.facebook.com/intern/diff/D68561014/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145494 Approved by: https://github.com/aorenste	2025-01-29 00:31:39 +00:00
Mwiza Kunda	9036a22c83	[Inductor][Triton] Change propagated dtype for fp16/bf16 unwrapped 0d tensors (#145613 ) Fixes TestInductorOpInfoCPU.test_comprehensive_max_binary_cpu_float16 and related tests for Triton CPU. TestInductorOpInfoCPU is currently not run in the CI. See https://github.com/pytorch/pytorch/pull/144389#issuecomment-2608050755 for some additional context. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/145613 Approved by: https://github.com/davidberard98, https://github.com/eellison, https://github.com/jansel	2025-01-29 00:23:44 +00:00
Aaron Orenstein	2f24f2eb46	Make sure to evaluate annotation strings in the context of where the prototype was created (#145667 ) This was incorrectly evaluating the annotation in the context of infer_schema - make sure to evaluate annotation strings in the context of where the prototype was created instead. Fixes #145481 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145667 Approved by: https://github.com/zou3519	2025-01-29 00:14:45 +00:00
Thomas Bohnstingl	82859f6185	[associative_scan] scan dim handling in user-facing associative_scan() (#139864 ) This PR implements the user-facing dim change, i.e., that the scan dim provided by the user is always moved to dim 0 and then the associative_scan operation always operates on dim 0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139864 Approved by: https://github.com/ydwu4	2025-01-28 23:58:10 +00:00
Brian Hirsh	7ca156f0ee	partitioner: avoid inserting duplicates into heap (#145082 ) Fixes https://github.com/pytorch/pytorch/issues/145081 This looks like it was a source of quadratic compile times in the torchtitan CP graphs. There's some code in the partitioner that iteratively adds users of a node to a heap, and pops the earliest user. If you have long parallel chains of fusible ops that all eventually feed into some shared ops, then this can result in: (1) a node getting added to the heap many times (2) each time we pop that node, we add (duplicates of) each of that node users to the heap (3) repeat with each user Pull Request resolved: https://github.com/pytorch/pytorch/pull/145082 Approved by: https://github.com/xmfan	2025-01-28 23:44:45 +00:00
albanD	02dd7a7803	Extend abi-stable nitpick message to all the c stable files (#145862 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145862 Approved by: https://github.com/ezyang	2025-01-28 23:22:23 +00:00
Nikita Shulga	049f042e52	Update build_wheel.sh	2025-01-28 15:14:41 -08:00
Nikita Shulga	eea7d395e5	[CD] Install ninja and setuptools from PyPI (#145871 ) Rather than Conda Pull Request resolved: https://github.com/pytorch/pytorch/pull/145871 Approved by: https://github.com/Skylion007 ghstack dependencies: #145870	2025-01-28 23:09:38 +00:00
Nikita Shulga	c26bb9ba5b	[CMake] Find HomeBrew OpenMP on MacOS (#145870 ) Either via `OMP_PREFIX` envvar or just searching in that folder Pull Request resolved: https://github.com/pytorch/pytorch/pull/145870 Approved by: https://github.com/Skylion007	2025-01-28 23:09:37 +00:00
Aaron Gokaslan	f388ba5986	Update CUDNN frontend submodule to 1.10.0 (#145780 ) Update to CUDNN 1.10. Most of this is release is about supporting some new APIs needed for Blackwell integration and new features in the corresponding CUDNN version Pull Request resolved: https://github.com/pytorch/pytorch/pull/145780 Approved by: https://github.com/albanD, https://github.com/atalman, https://github.com/malfet	2025-01-28 22:54:24 +00:00
Justin Chu	af43b445a5	[ONNX] Set USE_EXPERIMENTAL_LOGIC to True (#137296 ) This sets dynamo_export to use the new export logic. The legacy dynamo export logic will be removed as a follow up. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137296 Approved by: https://github.com/titaiwangms	2025-01-28 22:35:11 +00:00
Benjamin Glass	5aa5a5763e	[inductor triton] Disable incorrect TF32 usage on CUDA capability < 8 (#145684 ) Triton 2.2 and greater have a bug where allowing TF32 generation for a GPU that does not support TF32 will cause code generation errors. Patch around this problem by: 1. Adding a function to `torch.cuda` that determines whether CUDA hardware is capable of using the TF32 format. 2. Using that function to explicitly disable TF32 generation when calling Triton, where needed. To demonstrate that this fix works, try running `test/inductor/test_max_autotune.py` on a GPU with CUDA compute capability < 8 (e.g. any NVIDIA consumer GPU) without this fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145684 Approved by: https://github.com/eqy	2025-01-28 22:01:08 +00:00
Colin Peppler	1ffed44b42	[aotinductor] update unbacked symint runtime assertion msg (#145569 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145569 Approved by: https://github.com/chenyang78	2025-01-28 21:42:58 +00:00
Dan Zimmerman	a06a18b1bb	[ATen] Implement exception handling for hipsolver APIs (#145839 ) Summary: TSA Test Plan: CI Differential Revision: D68741194 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145839 Approved by: https://github.com/Mellonta	2025-01-28 21:37:23 +00:00
Zheng, Zhaoqiong	9003d81144	change the test wheel to release wheel when release wheel available (#145252 ) change the test wheel to release wheel when release wheel available Pull Request resolved: https://github.com/pytorch/pytorch/pull/145252 Approved by: https://github.com/seemethere, https://github.com/atalman Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-28 21:23:53 +00:00
fduwjj	4f949f282d	[c10d][ez] Remove goto in PGNCCL and make linter happy for PGNCCL and NCCLUtils (#145855 ) While working on PGNCCL I found that the code triggers some lint warnings so this PR is to address them or add lint suppressor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145855 Approved by: https://github.com/c-p-i-o, https://github.com/kwen2501	2025-01-28 21:19:49 +00:00
Wei Wang	6bcb545d9c	[CI][CUDA][cuSPARSELt] cusparselt 0.6.3 and cu121 related cleanups (#145793 ) Make ci cusparselt installation be consistent with nightly binary Remove cu121 related docker build jobs and inductor runs Update test failures relating to cu121 Retry of https://github.com/pytorch/pytorch/pull/145696 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145793 Approved by: https://github.com/eqy, https://github.com/tinglvv	2025-01-28 21:01:58 +00:00
Isuru Fernando	ccc2878c97	Fix fractional_max_pool lowering in inductor (#144395 ) Fixes https://github.com/pytorch/pytorch/issues/141538 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144395 Approved by: https://github.com/amjames, https://github.com/eellison	2025-01-28 21:00:18 +00:00
cyyever	ef28df5c9e	[Reland][Environment Variable][4/N] Use thread-safe getenv functions (#140593 ) Reland of #137843 , after checking the code again. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140593 Approved by: https://github.com/albanD Co-authored-by: albanD <desmaison.alban@gmail.com>	2025-01-28 20:51:49 +00:00
PyTorch MergeBot	3481c2aec4	Revert "[dynamo] save/restore system random state more carefully (#145750 )" This reverts commit e3d3f2b22e4b75c64eaa2f940a2dd80c1e43435c. Reverted https://github.com/pytorch/pytorch/pull/145750 on behalf of https://github.com/eellison due to bisected perf regression ([comment](https://github.com/pytorch/pytorch/pull/145750#issuecomment-2620028414))	2025-01-28 20:51:07 +00:00
Paul Saab	28982ceb3b	[aarch64] Rebuild everything with ArmPL (#145742 ) Summary: Rebuild everything that used OpenBLAS with ArmPL Test Plan: CI, prod test Reviewed By: Nicoshev Differential Revision: D68219559 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145742 Approved by: https://github.com/malfet	2025-01-28 20:48:42 +00:00
Gabriel Ferns	edf266e9bb	inductor.config.descriptive_names = False is not actually supported (#145523 ) Summary: This config is not supported (it throws an error when set), and doesn't really make sense imo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145523 Approved by: https://github.com/eellison	2025-01-28 20:22:23 +00:00
Jane Xu	515e55e692	Set -DPy_LIMITED_API flag for py_limited_api=True extensions (#145764 ) This could be BC breaking, because there was a period of time when we use py_limited_api=True but don't enforce the flag, and now that we will start enforcing the flag, people's custom extensions may fail to build. This is strictly still better behavior, as it is sketchy to claim CPython agnosticism without the flag, but calling this out as potential people yelling at us. Ways to mitigate this risk + reasons this may not be too big a deal: - People haven't known about py_limited_api for extensions much due to lack of docs from python so usage is low right now - My current tutorial is in store to make new users of py_limited_api pass this flag, so it'd be a noop for them. Test plan: * Locally i'm confident as I tried rebuilding ao with this change and it reliably failed (cuz importing torch/extension.h is a nono) * Unit test wise, the normal python_agnostic one I added should work Pull Request resolved: https://github.com/pytorch/pytorch/pull/145764 Approved by: https://github.com/ezyang, https://github.com/zou3519, https://github.com/albanD	2025-01-28 20:11:05 +00:00
Nikita Shulga	8d91bfd965	[BE] Include CheckFunctionExists in `FindBLAS.cmake` (#145849 ) It's used in the script, so it must be included Pull Request resolved: https://github.com/pytorch/pytorch/pull/145849 Approved by: https://github.com/Skylion007	2025-01-28 19:47:05 +00:00
Ryan Guo	eaff13275e	[dynamo] Properly branch on an unspecialized NN module (#145786 ) User defined NN module might have their own `__len__` or `__bool__` methods which Dynamo needs to trace through, so that side effects and/or reads to buffered writes are properly handled. This patch removes the special `UnspecializedNNModuleVariable` branch in Dynamo's branch handling, and lets these cases fall into the `UserDefinedObjectVariable` branch, which handles the aforementioned cases correctly. Fixes #145284. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145786 Approved by: https://github.com/williamwen42	2025-01-28 19:45:17 +00:00
James Wu	d9ffa5da65	Log info for AOTAutogradCache bypasses instead of warning (#145768 ) Fixes #145767 FxGraphCache also logs to info instead of warning so lets do that Pull Request resolved: https://github.com/pytorch/pytorch/pull/145768 Approved by: https://github.com/eellison, https://github.com/bdhirsh	2025-01-28 19:25:36 +00:00
Camyll Harajli	6c09954a9e	Windows builds with VS2022 (#145319 ) [Fixes #ISSUE_NUMBER ](https://github.com/pytorch/pytorch/issues/128835) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145319 Approved by: https://github.com/huydhn	2025-01-28 19:07:24 +00:00
Pian Pawakapan	cbc4094298	[draft_export] add LOC for data-dep error logging (#145443 ) Summary: maybe this is too much info, but it's difficult to go through old draft export reports where the stack trace is out of sync with the current codebase. Data-dependent errors now look like: ``` 2. Data dependent error. When exporting, we were unable to evaluate the value of `u306`. This occurred at the following stacktrace: File /data/users/pianpwk/fbsource/buck-out/v2/gen/fbcode/78204cab86e8a0fb/sigmoid/inference/ts_migration/__pt2i_readiness_main__/pt2i_readiness_main#link-tree/caffe2/torch/fb/training_toolkit/common/proxy_module_thrift/embedding_bag_proxy.py, lineno 109, in _forward_impl: `if offsets[-1] > len(input):` As a result, it was specialized to evaluate to `261`, and asserts were inserted into the graph. Please add `torch._check(...)` to the original code to assert this data-dependent assumption. Please refer to https://docs.google.com/document/d/1kZ_BbB3JnoLbUZleDT6635dHs88ZVYId8jT-yTFgf3A/edit#heading=h.boi2xurpqa0o for more details. ``` This would be even more helpful for reports on torch-packaged models, but that requires some more work on PT2I-specific stack trace processing Test Plan: . Differential Revision: D68534017 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145443 Approved by: https://github.com/angelayi	2025-01-28 18:55:16 +00:00
Xinya Zhang	c32bafeb0b	[ROCm] Bump AOTriton to 0.8.2b (#145508 ) We received reports AOTriton kernels mishandles the bias pointer and it causes NaN during fine-tuning llama3.2-11b vision model. This PR will fix the problem. Note: this AOTriton 0.8.1b adds head dimension 512 support and thus the binary size increases, but it is considered experimental and will not be enabled right now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145508 Approved by: https://github.com/jeffdaily	2025-01-28 18:34:25 +00:00
eellison	621604ce46	Maintain multiple configs (#145103 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): Previously, we would finalize the config of a triton template after its first fusion. this maintains multiple configs, in case we epilogue fuse, then prologue fuse, and prologue fusion has a new better config. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145103 Approved by: https://github.com/jansel, https://github.com/shunting314 ghstack dependencies: #143408	2025-01-28 18:32:14 +00:00
Ryan Guo	eaec97ab1f	[dynamo] Properly prune dead input cell object (#145781 ) This patch models input cell object as "newly created" rather than "pre-existing" python object (see added documentation for why this actually captures the semantics more accurately). This enables the `SideEffects.prune_dead_object_new` algorithm to prune away writes to input cell objects which are no longer relevant; this didn't happen prior to this patch because we modelled them as pre-existing objects, which forces us to codegen their attribute mutations. Fixes #145564. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145781 Approved by: https://github.com/williamwen42, https://github.com/jansel	2025-01-28 18:28:13 +00:00
eellison	8e258e2ecd	Parallelize epilogue/prologue benchmarking (#143408 ) When we attempt prologue or epilogue fusion with a TritonTemplate, we benchmark it at compile time in order to determine profitability. This avoids slowdowns/register spilling, and allows us to pick fusion when a base triton template is slower than cublas but faster when considering an epilogue. However, that fused benchmarking does not do the same async compilation as we do for the base TritonTemplate. The Base TritonTemplate is async compiled during lowering, then later waited on and benchmarked. This PR extends a similar process to benchmarking fused TritonTemplates in the scheduler. We keep a list of pending fusions which have async compilations. And we resolve any pending fusions a node is in prior to attempting to fuse it with any other node. Initially, I saw some slowdowns with this because we kick off async compilations of identical fusions in parallel. To address this I added source code caching at the `async_compile` level (we also already cache benchmark runs, but that would not happen in parallel). Compilation speedups: <img width="717" alt="image" src="https://github.com/user-attachments/assets/8e8f7d6c-7824-4210-83f9-a2a0f6db5ac9" /> This also should let us be a bit more aggressive with either configs, or benchmarking other fusions which are hard to determine profitability of. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143408 Approved by: https://github.com/jansel, https://github.com/shunting314	2025-01-28 18:18:24 +00:00
Nikita Shulga	3fd4691908	[MPS] Add `op_math_t` (#145808 ) Similar to `at::opmath_t` to be used for reduction (and int mms) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145808 Approved by: https://github.com/dcci	2025-01-28 18:03:52 +00:00
atalman	5382ab57d7	Move trunk windows builds to CUDA-12.4 (#145844 ) Same as : https://github.com/pytorch/pytorch/pull/130446 That should catch build regressions that were previously only detectable during the nightly builds for 12.4 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145844 Approved by: https://github.com/janeyx99, https://github.com/malfet	2025-01-28 18:00:51 +00:00
Huy Do	56915b093a	Fix environment deployment spam (#145823 ) With https://github.com/pytorch-labs/pytorch-gha-infra/pull/598 in place, the environment can now be removed. Fixes https://github.com/pytorch/pytorch/issues/145704 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145823 Approved by: https://github.com/clee2000	2025-01-28 17:46:31 +00:00
PyTorch MergeBot	cfbb27462e	Revert "[inductor][BE] Enable test_cpu_cpp_wrapper in fbcode (#145373 )" This reverts commit b8087747f5ca7be0d37b1ac85dc0894f6a33e3a3. Reverted https://github.com/pytorch/pytorch/pull/145373 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/145373#issuecomment-2619674197))	2025-01-28 17:46:11 +00:00
PyTorch MergeBot	dbef2a9bc9	Revert "Remove lexicographical sorting of storage keys in torch.save (#143879 )" This reverts commit 7db0afabaaff17dd37cf846cd786610ebf6aedd3. Reverted https://github.com/pytorch/pytorch/pull/143879 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See D68746524 for details ([comment](https://github.com/pytorch/pytorch/pull/143879#issuecomment-2619661492))	2025-01-28 17:40:16 +00:00
Zain Rizvi	097ccd9c39	Move ROCm MI300 jobs to unstable to make CI green (#145790 ) This is a temporary change to reduce intermittent tests failures. Jobs can be moved back once those machines get better runner isolation. This also sneaks in a small fix to all the rocm job's build step to be run on Linux Foundation runners (the get-label-type dependency). The inductor-rocm-mi300 workflow already had it, but it was missing in the rocm-mi300 workflow. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145790 Approved by: https://github.com/yangw-dev	2025-01-28 17:25:15 +00:00
saienduri	7eb51e5464	Ensure GPU isolation for kubernetes pod MI300 runners. (#145829 ) Fixes the reason behind moving the tests to unstable initially. (https://github.com/pytorch/pytorch/pull/145790) We ensure gpu isolation for each pod within kubernetes by propagating the drivers selected for the pod from the Kubernetes layer up to the docker run in pytorch here. Now we stick with the GPUs assigned to the pod in the first place and there is no overlap between the test runners. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145829 Approved by: https://github.com/jeffdaily	2025-01-28 17:20:46 +00:00
cyy	c751541e79	Fix cppcoreguidelines-init-variables ignorance (#141795 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141795 Approved by: https://github.com/albanD	2025-01-28 17:11:37 +00:00
Mu-Chu Lee	ac87388e61	[AOTInductor] Refactor CPU and GPU to remove ifdef macros (#145639 ) Summary: Remove #ifdef USE_CUDA macros through some refactor Test Plan: Refactor code, existing tests. Differential Revision: D68636743 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145639 Approved by: https://github.com/desertfire	2025-01-28 16:46:00 +00:00
Dmitry Nikolaev	6967ef1b07	[ROCm] fix test_cublas_workspace_explicit_allocation for gfx12 (#145227 ) gfx12 passes the condition `torch.cuda.get_device_capability() >= (9, 4)` and uses `default_workspace_size=128MB`, but it required only for MI300 Fix condition to use `("gfx94" in gcn_arch)` instead of `torch.cuda.get_device_properties()` to detect MI300. Now `default_workspace_size=32MB` is used for gfx12 and the test passes Pull Request resolved: https://github.com/pytorch/pytorch/pull/145227 Approved by: https://github.com/jeffdaily, https://github.com/eqy	2025-01-28 16:19:27 +00:00
Animesh Jain	80a0412b76	[dynamo][builtin-skipfiles-cleanup] Remove posixpath (#145828 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145828 Approved by: https://github.com/zou3519 ghstack dependencies: #145744, #145753, #145826	2025-01-28 16:14:34 +00:00
Animesh Jain	6824a4a75d	[dynamo][builtin-skipfiles-cleanup] Remove re (#145826 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145826 Approved by: https://github.com/zou3519 ghstack dependencies: #145744, #145753	2025-01-28 16:14:34 +00:00
Animesh Jain	4307e6c008	[dynamo][builtin-skipfile-cleanup] Remove signal (#145753 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145753 Approved by: https://github.com/zou3519 ghstack dependencies: #145744	2025-01-28 16:14:23 +00:00
eellison	3a56089217	fix unbacked + view incorrectness (#145548 ) fix for https://github.com/pytorch/pytorch/issues/143498 We were incorrectly using contiguous strides for a non-contiguous tensor. There are two separate causes: 1. https://github.com/pytorch/pytorch/pull/110520 made it so we turn Views contiguous with unbacked symints becuase `dynamic_reshape_indexer below will fail due to the size_hint's inability to process unbacked SymInts`. Seems like we should fix. Regardless - it will make the input contiguous if input is unbacked to workaround this. 2. We weren't actually making it contiguous! I filed an issue for this here: https://github.com/pytorch/pytorch/issues/145561. This is still worth landing as a fix, even though we should those issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145548 Approved by: https://github.com/desertfire	2025-01-28 16:03:45 +00:00
cyyever	97b3b73f3e	[Environment Variable][7/N] Use thread-safe getenv functions (#140211 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140211 Approved by: https://github.com/ezyang, https://github.com/eqy	2025-01-28 15:21:12 +00:00
Zhenbin Lin	a08f7f3266	OpenReg: fix issue of pin_memory (#145046 ) Fix issue of `pin_memory` when rewrapping a storage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145046 Approved by: https://github.com/albanD	2025-01-28 09:41:04 +00:00
Chirag Pandya	bdf6dfa17d	[chore][ez] change alloc buffer size from 4000 to 4096 (#145759 ) Summary: Allocations typically happen as a power of 2 anyway. Change the default alloc size to 4096 so eek out a bit more perf. Test: unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/145759 Approved by: https://github.com/XilunWu, https://github.com/fduwjj ghstack dependencies: #145756, #145757	2025-01-28 09:14:07 +00:00
Animesh Jain	5c5306e8bc	[dynamo][builtin-skiplist-cleanup] Remove weakref (#145744 ) WeakKeyDictionary already works very nicely with the UserDefinedObject Variable Tracker. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145744 Approved by: https://github.com/jansel	2025-01-28 07:55:12 +00:00
Avik Chaudhuri	45f64e770a	relax assertion to warning for unbacked binding names (#145777 ) Summary: Quick fix following up on https://github.com/pytorch/pytorch/pull/144894 to unblock internal tests. Will keep investigating a more principled fix. Test Plan: Failures in T213563826 now pass Differential Revision: D68731710 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145777 Approved by: https://github.com/angelayi	2025-01-28 07:52:40 +00:00
Michael Graczyk	0a8a0ef767	[inductor] Fix crash running wrapper_benchmark with no device (#145644 ) Fixes #145434 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145644 Approved by: https://github.com/shunting314	2025-01-28 07:31:36 +00:00
eellison	a699034eec	Record inputs at time of tracing, constrain to them for triton fn (#145448 ) Record input fake tensors at time of tracing and store them in the node meta. Inductor passes have the possibility of changing strides, so it is safer to record the strides of the inputs at tracing. See, https://github.com/pytorch/pytorch/issues/137979 for more context. We can also extend this to custom ops, and user-visible outputs. If this ends up being compilation time sensitive we can just record strides (and maybe storage offset, per @zou3519) instead of the complete fake tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145448 Approved by: https://github.com/zou3519	2025-01-28 07:07:14 +00:00
Nikita Shulga	0f5a68344a	[BE][Inductor] Simplify `custom_op` tests (#145814 ) Not sure what were the motivation behind repeating the same function over and over again for different backends Change `test_custom_op_[123]` from acceptig separate (but identical) implementations for CPU, CUDA and XPU, to take just `fn` and `fn_meta` args Test that it also extendable to MPS Pull Request resolved: https://github.com/pytorch/pytorch/pull/145814 Approved by: https://github.com/jansel	2025-01-28 05:58:51 +00:00
cyyever	23eb0a3201	Improve typing in torch/types.py (#145237 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/145237 Approved by: https://github.com/XuehaiPan, https://github.com/albanD Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>	2025-01-28 05:29:12 +00:00
Aaron Gokaslan	8e46d0f595	[BE]: Update typing of OrderedSet ancestor (#145783 ) Now that we are on python 3.9 minimum version we can properly use Generics in the superclass Pull Request resolved: https://github.com/pytorch/pytorch/pull/145783 Approved by: https://github.com/eellison	2025-01-28 04:43:49 +00:00
cyy	67fcc7cf02	[3/N] Remove unnecessary once flag usage (#145672 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/145672 Approved by: https://github.com/albanD	2025-01-28 04:28:18 +00:00
Burak Turk	01a4d86b31	add pt2 callbacks for backward pass and prevent duplicate callbacks (#145732 ) Summary: This change adds callbacks for lazy backwards compilation while preventing duplicate callbacks to be fired. Differential Revision: D68577593 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145732 Approved by: https://github.com/mlazos	2025-01-28 03:50:02 +00:00
Pian Pawakapan	1a26cdd5cb	[cond] remove warning for unsupported tuple returns (#145766 ) I guess this is supported now Pull Request resolved: https://github.com/pytorch/pytorch/pull/145766 Approved by: https://github.com/ydwu4, https://github.com/zou3519	2025-01-28 03:13:36 +00:00
PyTorch MergeBot	9010649292	Revert "Add option to serialization config to reduce random reads from get_record_offset when loading with mmap=True (#143880 )" This reverts commit db3685a35cdce32622ab89f6c92e09d52210ff53. Reverted https://github.com/pytorch/pytorch/pull/143880 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but either this PR or the base PR breaks distributed tests ([comment](https://github.com/pytorch/pytorch/pull/143880#issuecomment-2617743403))	2025-01-28 03:07:17 +00:00
Chirag Pandya	78f02bf07c	[bug] handle case when remote peer closes connection (#145757 ) Summary: In the case where remote peer closes the connection, nread returns 0. In this case, we still want to free up the allocated buffer. Also, reorder the if so that the likely success cases (nread > 0) is at the top of the function with an early return. Test Plan: unit tests Differential Revision: [D68733192](https://our.internmc.facebook.com/intern/diff/D68733192) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145757 Approved by: https://github.com/XilunWu, https://github.com/fduwjj ghstack dependencies: #145756	2025-01-28 03:06:38 +00:00
Pian Pawakapan	4be831ba2d	[draft_export] fix dense-in-memory check for inferring fakes (#145653 ) Test Plan: fixes check for dense tensors with size-1 dimensions Differential Revision: D68644028 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145653 Approved by: https://github.com/zou3519	2025-01-28 02:52:14 +00:00
James Wu	7c1fc0a047	Log cache state for AOTAutograd in title of file (#145715 ) Differential Revision: [D68692755](https://our.internmc.facebook.com/intern/diff/D68692755/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145715 Approved by: https://github.com/bobrenjc93	2025-01-28 02:14:18 +00:00
Jason Ansel	78a94c9114	[inductor] Remove type ignores from scheduler.py (#145712 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145712 Approved by: https://github.com/yanboliang, https://github.com/Skylion007 ghstack dependencies: #145692	2025-01-28 01:44:32 +00:00
Jason Ansel	2df2f9d895	[inductor] Change type of get_backend_features to OrderedSet (#145692 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145692 Approved by: https://github.com/yanboliang	2025-01-28 01:44:32 +00:00
Yifu Wang	db33d23aa8	[SymmetricMemory] fix an issue where rendezvous is performed with wrong device context when torch.cuda.set_device() is not callled (#144886 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144886 Approved by: https://github.com/awgu	2025-01-28 01:43:37 +00:00
William Wen	e3d3f2b22e	[dynamo] save/restore system random state more carefully (#145750 ) Reattempt of https://github.com/pytorch/pytorch/pull/145435 since the state of the linked internal diff appears to be messed up. Note: I have verified that the previously failing internal tests now pass internally. Differential Revision: [D68723334](https://our.internmc.facebook.com/intern/diff/D68723334) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145750 Approved by: https://github.com/StrongerXi	2025-01-28 01:34:13 +00:00
Gabriel Ferns	f16ce3c7e9	Refactor fuzzer and add support for Dynamo (#145565 ) ## Summary: Dynamo now works with config fuzzer. For BE week, we also found and fixed 5 different bugs (in inductor): - https://github.com/pytorch/pytorch/pull/145426 - https://github.com/pytorch/pytorch/pull/145523 - https://github.com/pytorch/pytorch/pull/145527 - https://github.com/pytorch/pytorch/pull/145532 - https://github.com/pytorch/pytorch/pull/145538 ## Test Plan: New Dynamo Unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/145565 Approved by: https://github.com/masnesral	2025-01-28 00:44:27 +00:00
Syed Tousif Ahmed	6eb74fbec6	Updates NCCL user buffer registration test for NCCL 2.24.3 (#145285 ) NCCL 2.24.3 changed the content of the debug output for NVLS registration. We use this debug output in our test suite to check if NVLS was successfully registered or not. Hence we need to specialize for the NCCL version in the test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145285 Approved by: https://github.com/kwen2501	2025-01-28 00:24:53 +00:00
Ryan Guo	5a4d959cdb	[dynamo] Properly model torch profiler context objects (#145537 ) Prior to this patch, Dynamo conveniently modelled torch profiler context objects (e.g., `torch.profiler.profile`) as `NullContextVariable` because `torch.compile` ignore the effect of these profiler contexts. However, the semantics of these profiler contexts diverges from `contextlib.nullcontext` in the `__enter__` function, where the former returns `self` and the latter returns `None`. This causes subtle error as observed in #125021. This patch adds back a `ProfilerContextVariable`, which addresses the aforementioned semantic discrepency. Fixes #125021. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145537 Approved by: https://github.com/zou3519, https://github.com/williamwen42	2025-01-28 00:03:36 +00:00
Mikayla Gawarecki	db3685a35c	Add option to serialization config to reduce random reads from get_record_offset when loading with mmap=True (#143880 ) ## Background This PR adds `torch.utils.serialization.config.load.calculate_storage_offsets`. This option relies on the previous PR in this stack, where storage order was changed to non lexicographical. A `.format_version` entry was added to the zipfile and `calculate_storage_offsets` will only work on checkpoints with `.format_version`. When this is turned on, for `torch.load(mmap=True)`, offsets of each storage record (other than the 0th storage will be calculated instead of relying on `miniz` APIs to determine this). The existing APIs will issue multiple random reads (reading the end of central directory record, then reading the zipfile header for the record) to determine the storage offset where the record starts. This can greatly degrade `torch.load(mmap=True)` performance for non-filesystem cases. `6aaae9d78f/caffe2/serialize/inline_container.cc (L589-L605)` ## Testing strategy The agreed upon testing strategy was as follows: - Add debug code gated by an environment flag `TORCH_SERIALIZATION_DEBUG` that will run this offset calculation logic and verify it against getRecordOffset for each storage (when mmap=False) - This flag is set throughout CI, which means that every time `torch.load` is called, the offset calculation logic is implicitly being tested. Differential Revision: [D67673026](https://our.internmc.facebook.com/intern/diff/D67673026) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143880 Approved by: https://github.com/albanD ghstack dependencies: #143879	2025-01-27 23:57:30 +00:00
Mikayla Gawarecki	7db0afabaa	Remove lexicographical sorting of storage keys in torch.save (#143879 ) Currently the order lexicographical (i.e. 0, 10, 11, ...19, 2, ....) instead of 0, 1, 2, 3, 4, 5 (the order that storage metadata is actually pickled in), since PyTorch will never be used with Python < 3.7 we can be assured that the keys will be read in the order of insertion (numerically sorted) This makes it such that the order storages are written in are the same as the pickling/unpickling order so we can calculate their offsets with less random reads Differential Revision: [D67673025](https://our.internmc.facebook.com/intern/diff/D67673025) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143879 Approved by: https://github.com/albanD	2025-01-27 23:57:30 +00:00
Colin L. Rice	c1161957a4	inductor_config_logging: Don't drop keys (#144700 ) This bit me while I was trying to debug some trace issues. In general this config is already quite large when dumping, so adding more fields doesn't make it significantly worse. Also a number of the items we are type checking for (except the test configs), don't even show up. Primarily this will help us when debugging rocm, halide, and trace configs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144700 Approved by: https://github.com/ezyang	2025-01-27 23:47:25 +00:00
Jane (Yuan) Xu	7d01f6e6f2	Add ignorable commits on run_test.py to git blame ignore (#145787 ) Chanced upon it while searching through cpp_extension related code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145787 Approved by: https://github.com/malfet	2025-01-27 23:24:48 +00:00
Chirag Pandya	3ce68dc61e	[c10d] Flush file in file recorder (#145458 ) Summary: Flushing file to hopefully prevent file corruptions as reported in https://github.com/pytorch/pytorch/pull/145125 Test Plan: Couldn't get file corruption to occur in my tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145458 Approved by: https://github.com/kwen2501	2025-01-27 23:15:52 +00:00
Chirag Pandya	5534c270db	[chore] fix new linter (#145756 ) Summary: Fix new linter that's complaining when I made changes to this file: class 'LibUVStoreDaemon' defines a non-default destructor but does not define a copy constructor, a copy assignment operator, a move constructor or a move assignment operator Test Plan: make lint passes Differential Revision: [D68733191](https://our.internmc.facebook.com/intern/diff/D68733191) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145756 Approved by: https://github.com/XilunWu, https://github.com/Skylion007, https://github.com/fduwjj	2025-01-27 22:48:12 +00:00
PyTorch MergeBot	2de53b3b65	Revert "pickler for GraphModule (#141659 )" This reverts commit c6ad08357bf8e766b5220bfb5cbbfdb2a4ec0ca5. Reverted https://github.com/pytorch/pytorch/pull/141659 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, please take a look at D68694181 for more details. ([comment](https://github.com/pytorch/pytorch/pull/141659#issuecomment-2617045120))	2025-01-27 22:39:30 +00:00
Huy Do	006397fac3	Remove FBGEMM sccache hack (#145664 ) Testing https://github.com/pytorch/pytorch/actions/runs/12959358756, sccache is working correctly now Pull Request resolved: https://github.com/pytorch/pytorch/pull/145664 Approved by: https://github.com/wdvr	2025-01-27 22:00:06 +00:00
David Berard	69e82d02d3	[inductor][3/N] triton support post-#5512, tt.divisibility format (#145575 ) 1. Fix the tt.divisibility format in hints.py. Previously, it was `{((0,), (1,)): [["tt.divisibility", 16]]}`. Now it is `{(0,): [["tt.divisibility", 16]], (1,): [["tt.divisibility", 16]]}`. This was an oversight in the first PR I added. I've verified that we now get `{ tt.divisibility = 16 }` in the generated TTGIR. 2. Update the test_codegen_triton.py test to work with multiple triton versions (and test this divisibility format in the new triton version) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145575 Approved by: https://github.com/SamGinzburg	2025-01-27 21:48:58 +00:00
Animesh Jain	993b229665	[dynamo][dicts] Fix dict.__new__ bug (#145723 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145723 Approved by: https://github.com/jansel, https://github.com/StrongerXi ghstack dependencies: #145519, #145547, #145558	2025-01-27 21:42:43 +00:00
Animesh Jain	7e1c7253e9	[dynamo][builtin-skipfile-cleanup] Support tuple.__new__ (#145558 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145558 Approved by: https://github.com/jansel, https://github.com/StrongerXi ghstack dependencies: #145519, #145547	2025-01-27 21:42:43 +00:00
Joel Schlosser	1ba1b7b597	Support remaining _like factory functions for NJT (#144889 ) Fixes #144761 This PR adds NJT impls for those _like functions that were previously missing: * `full_like()` * `rand_like()` * `randint_like()` It also fixes a bug in existing *_like functions when a new device is specified. Fix is to also transfer `offsets` / `lengths` to the new device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144889 Approved by: https://github.com/soulitzer	2025-01-27 21:33:51 +00:00
Nikita Shulga	3a23d75b37	[MPS] Fix `c0:🤘:log_gamma` correctness on M4 (#145740 ) To workaround a bug where `abs` method call seems to be ignored before calling log, which could be reproduced by running the following code (submitted as FB16415011 ) ```swift import Metal func run_shader<T: BinaryFloatingPoint> (library: MTLLibrary, kernel_name: String, type: T.Type, nelem: Int = 16) { guard let mfunc = library.makeFunction(name: kernel_name) else { fatalError("Can't find function") } let device = library.device guard let queue = device.makeCommandQueue() else { fatalError("Can't make queue") } guard let cmdBuffer = queue.makeCommandBuffer() else { fatalError("Can't make command buffer") } guard let computeEncoder = cmdBuffer.makeComputeCommandEncoder() else { fatalError("Can't make compute encoder") } guard let ibuf = device.makeBuffer(length:nelem * MemoryLayout<T>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") } let ibuf_data = ibuf.contents().assumingMemoryBound(to: T.self) for i in 0..<nelem { ibuf_data[i] = T(sin(Float(2 + i))) } guard let obuf = device.makeBuffer(length:nelem * MemoryLayout<T>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") } let obuf_data = obuf.contents().assumingMemoryBound(to: T.self) computeEncoder.setComputePipelineState(try! device.makeComputePipelineState(function: mfunc)) computeEncoder.setBuffer(obuf, offset:0, index: 0) computeEncoder.setBuffer(ibuf, offset:0, index: 1) computeEncoder.dispatchThreads(MTLSizeMake(nelem, 1, 1), threadsPerThreadgroup:MTLSizeMake(nelem, 1, 1)) computeEncoder.endEncoding() cmdBuffer.commit() cmdBuffer.waitUntilCompleted() print("Results for \(String(describing: T.self)):", terminator: " ") for i in 0..<nelem { print(obuf_data[i], terminator: " ") } print() } let shader_source = """ #include <metal_stdlib> template<typename T> float foo(T x) { const auto abs_x = :🤘:abs(static_cast<float>(x)); auto rc = :🤘:log(abs_x); return rc - :🤘:log(:🤘:abs(abs_x * :🤘:sinpi(abs_x))); } kernel void half_kernel( device half* out_ptr0, constant half* in_ptr0, uint xindex [[thread_position_in_grid]] ) { auto inp = in_ptr0[xindex]; auto out = foo(inp); out_ptr0[xindex] = static_cast<half>(out); } kernel void float_kernel( device float* out_ptr0, constant float* in_ptr0, uint xindex [[thread_position_in_grid]] ) { auto inp = in_ptr0[xindex]; auto out = foo(inp); out_ptr0[xindex] = static_cast<float>(out); } """ let options = MTLCompileOptions() options.mathMode = .safe options.mathFloatingPointFunctions = .precise guard let device = MTLCopyAllDevices().first else { fatalError("Not Metal device found") } let library = try! device.makeLibrary(source:shader_source, options:options) run_shader(library:library, kernel_name:"half_kernel", type: Float16.self) run_shader(library:library, kernel_name:"float_kernel", type: Float.self) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145740 Approved by: https://github.com/dcci	2025-01-27 21:24:22 +00:00
Aaron Orenstein	60f98262f1	PEP585: .github (#145707 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145707 Approved by: https://github.com/huydhn	2025-01-27 21:21:01 +00:00
Ryan Guo	bfaf76bfc6	[dynamo] clear out traced frames at the start of `test_log_traced_frames` (#145640 ) The test was being flaky in CI, and this patch fixes it. Fixes #137461. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145640 Approved by: https://github.com/williamwen42	2025-01-27 20:49:59 +00:00
Ting Lu	93dd6bc4d8	Add CUDA 12.8 installation and manylinux-cuda12.8 (#145567 ) Breaking https://github.com/pytorch/pytorch/pull/145557 into two parts. Need to have manylinux-cuda12.8 in order to build magma. Issue: https://github.com/pytorch/pytorch/issues/145570 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145567 Approved by: https://github.com/nWEIdia, https://github.com/atalman	2025-01-27 20:49:07 +00:00
Randolf Scholz	64cd81712d	`torch.distributions`: replace `numbers.Number` with `torch.types.Number`. (#145086 ) Fixes #144788 (partial) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145086 Approved by: https://github.com/malfet	2025-01-27 20:24:55 +00:00
Huy Do	2f8ad8f4b9	Run inductor perf benchmark on ROCm (#145763 ) This requires https://github.com/pytorch/pytorch/pull/144594. The test run on PT2 dashboard is at https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2020%20Jan%202025%2019%3A46%3A14%20GMT&stopTime=Mon%2C%2027%20Jan%202025%2019%3A46%3A14%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=rocm&lBranch=144594&lCommit=9f5cb037965aa2990b2e4593610bca92526ebb3b&rBranch=144594&rCommit=9f5cb037965aa2990b2e4593610bca92526ebb3b Pull Request resolved: https://github.com/pytorch/pytorch/pull/145763 Approved by: https://github.com/jeffdaily	2025-01-27 20:19:03 +00:00
Ryan Guo	66631bc84b	[dynamo] Fix read/write conflicts in a cuda test (#145658 ) Prior to this patch, the `test_cuda_event_created_outside_of_graph` is flaky in CI, and that's because we have read and write to the same `foo` tensor buffer from 2 different streams. This patch eliminates that by adding a synchronization to wait till read finishes before starting the write. Fixes #133837, #133828. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145658 Approved by: https://github.com/yifuwang	2025-01-27 19:55:57 +00:00
PyTorch MergeBot	c986eba560	Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 )" This reverts commit abf28982a8cb43342e7669d859de9543fd804cc9. Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/ZainRizvi due to Sorry but this is failing internally. @Chillee can you please help change get remerged? See D68720562 ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2616726406))	2025-01-27 19:38:26 +00:00
leslie-fang-intel	9728e900dc	[Inductor][CPP] fix torch logit decomposition (#145576 ) Summary Fix issue https://github.com/pytorch/pytorch/issues/145379, current decomposition using `self = torch.clamp(self, lo, hi)` which gives wrong result when `lo` is larger than `hi` comparing to eager implementation: `cd68d54911/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp (L165)` Align their behavior in this PR. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_torch_logit ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145576 Approved by: https://github.com/jgong5, https://github.com/eellison	2025-01-27 19:37:51 +00:00
Edward Z. Yang	635b98fa08	Add nitpick warning that aoti_torch/c/shim.h is ABI stable (#145745 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/145745 Approved by: https://github.com/albanD	2025-01-27 19:25:37 +00:00
Yanbo Liang	bc377c503e	[Custom Ops] Fix f-strings in custom ops error message (#145673 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145673 Approved by: https://github.com/zou3519 ghstack dependencies: #145588	2025-01-27 19:22:43 +00:00
Yanbo Liang	ec91b7720f	[Custom Ops] Add a new API to allow users to register an autocast for the custom op (#145588 ) Fixes #137033 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145588 Approved by: https://github.com/zou3519	2025-01-27 19:22:43 +00:00
Simon Mahns	f951d216e0	[autocast][pytorch] Support autocast for MTIA (policy) (#145666 ) Summary: Add autocast support for MTIA (policy) Reviewed By: egienvalue Differential Revision: D68604796 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145666 Approved by: https://github.com/chaos5958	2025-01-27 18:26:04 +00:00
Sam Larsen	1835e1eb98	[BE] Remove test_ops from FIXME_inductor_dont_reset_dynamo (#145307 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145307 Approved by: https://github.com/zou3519, https://github.com/FindHao	2025-01-27 18:12:39 +00:00
Randolf Scholz	835e770bad	Use `typing.IO[bytes]` instead of `io.BytesIO` in annotations (#144994 ) Fixes #144976 Using appoach ① `IO[bytes]`, but could also try with a protocol. ## Notes: - moved `torch.serialization.FILE_LIKE` to `torch.types.FileLike` - Use `FileLike` annotation where it makes sense - made sure those functions also support `os.PathLike` - Replaced `isinstance(x, io.BytesIO)` with `isinstance(x, (io.IOBase, IO))` where appropriate. - Replaced `BinaryIO` with `IO[bytes]` (the two ABCs are almost identical, the only difference is that `BinaryIO` allows `bytearray` input to `write`, whereas `IO[bytes]` only `bytes`) - needed to make `torch.serialization._opener` generic to avoid LSP violations. - skipped `torch/onnx/verification` for now (functions use `BytesIO.getvalue` which is not part of the `IO[bytes]` ABC, but it kind of seems that this is redundant, as e.g. `onnx.load` supports `str \| PathLike[str] \| IO[bytes]` directly... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144994 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2025-01-27 18:08:07 +00:00
Eddie Yan	abf28982a8	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee	2025-01-27 18:05:23 +00:00
Nikita Shulga	30dea8429d	[MPS][BE] Use conveinence methods to set args (#145736 ) It's better to call `mtl_setArgs` rather than set arguments one by one with the risk of making a typo Also, all interactions with MTLCommandBuffer must be serialized, which is commonly done using dispatch queues Pull Request resolved: https://github.com/pytorch/pytorch/pull/145736 Approved by: https://github.com/Skylion007	2025-01-27 17:42:01 +00:00
Mikayla Gawarecki	7db20ffd68	Remove `public_allowlist` from `TestPublicBindings.test_correct_module_names` and ensure private_allowlist-ed things are actually private (#145620 ) This passes locally, also sanity checked importing these modules on [colab](https://colab.research.google.com/drive/1edynWX1mlQNZIBxtb3g81_ZeTpAqWi19?usp=sharing) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145620 Approved by: https://github.com/albanD	2025-01-27 17:30:02 +00:00
Huy Do	5d01a2874f	Increase the number of perf benchmark shards (#145534 ) Per the discussion on https://github.com/pytorch/pytorch/issues/140332#issuecomment-2610805551, this adds 2 more shards for HF, 2 more for TorchBench, and 1 more for TIMM. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145534 Approved by: https://github.com/jeanschmidt	2025-01-27 16:20:42 +00:00
Nikita Shulga	639dd54ef7	[BE] Use copy_method to import all tests (#145718 ) Less chances for typo when doing the imports Pull Request resolved: https://github.com/pytorch/pytorch/pull/145718 Approved by: https://github.com/dcci	2025-01-27 16:01:12 +00:00
leslie-fang-intel	2e80093306	setitem node shouldn't be deadcode eliminated (#145714 ) Summary Fix issue https://github.com/pytorch/pytorch/issues/145697. The `operator.setitem` has been eliminated as dead code, causing a correctness issue. Mark it as impure in this PR to avoid this side effect. TestPlan ``` python -u -m pytest -s -v test/fx/test_dce_pass.py -k test_keep_setitem ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145714 Approved by: https://github.com/ezyang	2025-01-27 15:08:21 +00:00
Stefan-Alin Pahontu	0674ab7e33	solve apl dependency issue (#145215 ) According to the [APL documentation](https://developer.arm.com/documentation/101004/2404/General-information/Arm-Performance-Libraries-example-programs), libraries ending with _mp are OpenMP multi-threaded libraries. When a project is compiled with MSVC and the -openmp flag, the vcomp library (Visual C++ implementation of OpenMP) is used for runtime calls. However, the current APL implementation uses the libomp.dll (LLVM) variant. As a result, there are unexpected behaviors at runtime. --- For Example: ```python import torch # Create a sparse tensor # Input (Sparse Tensor): # [[0, 1], # [1, 0]] indices = torch.tensor([[0, 1], [1, 0]]) values = torch.tensor([1, 1], dtype=torch.float32) size = torch.Size([2, 2]) sparse_tensor = torch.sparse_coo_tensor(indices, values, size) # Convert sparse tensor to dense tensor dense_tensor = sparse_tensor.to_dense() # Expected Output (Dense Tensor): # [[0, 1], # [1, 0]] print("\nDense Tensor:") print(dense_tensor) ``` However, it prints unexpected outputs such as: ```python # [[0, 11], # [10, 0]] ``` The issue arises because the following code does not function as expected at runtime: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/ParallelOpenMP.h#L30 ```c++ // returns 1 , however since OpenMP is enabled it should return total number of threads int64_t num_threads = omp_get_num_threads(); ``` --- In the runtime, loading multiple OpenMP libraries (in this case `libomp` and `vcomp`) is causing unexpected behaviours. So, we've changed libraries from `_mp` to non `_mp` versions and we used `vcomp` for OpenMP calls. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145215 Approved by: https://github.com/ozanMSFT, https://github.com/malfet Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>	2025-01-27 13:02:16 +00:00
PyTorch UpdateBot	7b6029dcc2	Update slow tests (#145206 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145206 Approved by: https://github.com/pytorchbot	2025-01-27 11:40:39 +00:00
H. Vetinari	e6c1e6e20e	simplify torch.utils.cpp_extension.include_paths; use it in cpp_builder (#145480 ) While working on conda-forge integration, I needed to look at the way the include paths are calculated, and noticed an avoidable duplication between `torch/utils/cpp_extension.py` and `torch/_inductor/cpp_builder.py`. The latter already imports the former anyway, so simply reuse the same function. Furthermore, remove long-obsolete include-paths. AFAICT, the `/TH` headers have not existed since pytorch 1.11. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145480 Approved by: https://github.com/ezyang	2025-01-27 07:19:42 +00:00
Jason Ansel	e90cf4abcf	[inductor] Add some typing to common.py (#145691 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145691 Approved by: https://github.com/malfet ghstack dependencies: #145690	2025-01-27 06:27:13 +00:00
Jason Ansel	ddae87f792	[inductor] Add some typing to simd.py (#145690 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145690 Approved by: https://github.com/malfet	2025-01-27 06:27:13 +00:00
Nikita Shulga	71caac2b30	[MPSInductor] Add rand support (#145705 ) Using Philox4 as PRNG Test plan (other that CI) Run ```python mport torch from torch._inductor.utils import run_and_get_code from contextlib import nullcontext def foo(x): return x * torch.randn_like(x) foo_c = torch.compile(foo) x = torch.ones(100, 100, device="mps") y = foo_c(x) print(y.mean().item(), y.std().item()) for i in range(25): print(y[i].mean(), y[i].std()) ``` And observe that printed values are close to 0 and 1 TODO: Better `randint` algorithm for large ranges Pull Request resolved: https://github.com/pytorch/pytorch/pull/145705 Approved by: https://github.com/dcci, https://github.com/jansel	2025-01-27 06:07:36 +00:00
rzou	ea141d8134	functional compiled autograd (#144707 ) This PR squashes together the following commits: https://github.com/pytorch/pytorch/pull/144115 https://github.com/pytorch/pytorch/pull/143417 https://github.com/pytorch/pytorch/pull/143405 https://github.com/pytorch/pytorch/pull/143387 https://github.com/pytorch/pytorch/pull/143304 https://github.com/pytorch/pytorch/pull/143296 This is a refactor of compiled autograd to use "functional autograd". The end goal is that it gets compiled autograd's initial capture to stop specializing on Tensor metadata, therefore allowing compiled autograd to better handle Tensor subclasses. For more information, please read the commit messages for each PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144707 Approved by: https://github.com/bdhirsh, https://github.com/xmfan, https://github.com/jansel	2025-01-27 05:20:56 +00:00
Edward Z. Yang	87fdadde1d	Remove FFT from stride incorrect ops (#145080 ) I gotta say, the FFT implementation is completely insane, there's gotta be a better way to do this than repeatedly inplace restriding the output tensor. Anyway, this is a faithful translation of both the MKL and cuFFT paths to Python. Fixes https://github.com/pytorch/pytorch/issues/135087 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/145080 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #145530	2025-01-27 04:26:04 +00:00
Isalia20	b75afa2e2e	[MPS] cholesky implementation (#145701 ) Requested in #77764 Closed #144193 due to a lot of conflicts when rebasing Pull Request resolved: https://github.com/pytorch/pytorch/pull/145701 Approved by: https://github.com/malfet	2025-01-27 01:53:03 +00:00
Aaron Orenstein	c6ad08357b	pickler for GraphModule (#141659 ) Pickling GraphModule needs some special handling for wrapping things that normally can't be pickled - but async compile needs to pass them across a wire so we need to be able to serialize it - add some helpers to enable that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141659 Approved by: https://github.com/jamesjwu	2025-01-26 19:29:13 +00:00
Arash Pakbin	f3ddc08ddc	Additional operators in operator benchmark (#145625 ) The list of added operators: add_, addcmul, arange, baddbmm…, bmm, clamp, div, div_, gelu, index_add, logical_and, mul_, sub_, topk, where This pull request is the same as a previous one: https://github.com/pytorch/pytorch/pull/145121 which inadvertently got deleted while merging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145625 Approved by: https://github.com/jeffdaily	2025-01-26 19:20:02 +00:00
PyTorch MergeBot	6a4fb4b615	Revert "Align CPU behavior with CUDA for `ConvTranspose` when `out_channels=0` (#142859 )" This reverts commit cb814c0b961369a7ab154c58856c730cafaa2307. Reverted https://github.com/pytorch/pytorch/pull/142859 on behalf of https://github.com/malfet due to It broke ROCM tests again, see `5cd2b34e82/1` ([comment](https://github.com/pytorch/pytorch/pull/142859#issuecomment-2614523822))	2025-01-26 17:49:05 +00:00
Davide Italiano	5cd2b34e82	[inductor] Adjust test_log_fp64 to only run when float64 is supported. (#145686 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145686 Approved by: https://github.com/malfet, https://github.com/jansel Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-26 15:58:19 +00:00
Yichen Yan	ed015143ef	Set RUNPATH on CUDA and XPU tests (#144305 ) #136627 has almost fixed the issue that test binaries' runpath has not been set correctly, with few cases left. This PR fixes the rest. The binaries are found by `auditwheel repair` a wheel built with `BUILD_TEST=1`. @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/144305 Approved by: https://github.com/malfet	2025-01-26 08:40:22 +00:00
Aaron Orenstein	c4523999a1	Fix incorrect type comparison (#145449 ) Summary: This change was incorrectly made as part of #145166 Differential Revision: D68536221 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145449 Approved by: https://github.com/bobrenjc93	2025-01-26 04:40:26 +00:00
PyTorch MergeBot	09ae69a364	Revert "Fix type annotation of `Linear.bias` (#142326 )" This reverts commit 81e370fc6b90f9cb98c88f3173e738aba0dc650a. Reverted https://github.com/pytorch/pytorch/pull/142326 on behalf of https://github.com/malfet due to This introduced a graph break and regressed inductor tests, see `73622fc5fa/1` ([comment](https://github.com/pytorch/pytorch/pull/142326#issuecomment-2614196349))	2025-01-26 03:41:00 +00:00
wengshiy	73622fc5fa	Fix Throughputbenchmark issue (#144669 ) Fixes [144461](https://github.com/pytorch/pytorch/issues/144461) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144669 Approved by: https://github.com/leslie-fang-intel, https://github.com/williamwen42, https://github.com/jansel	2025-01-26 03:37:20 +00:00
Wu, Chunyuan	cb814c0b96	Align CPU behavior with CUDA for `ConvTranspose` when `out_channels=0` (#142859 ) Fixes https://github.com/pytorch/pytorch/issues/142466. Remove the `weight.numel() != 0` check to align the behavior with CUDA for `ConvTranspose` when `out_channels=0`. After removing this check, the existing code is already able to give an empty output in such case. Test plan: ``` python -u test/nn/test_convolution.py -k test_ConvTranspose_output_channels_0_cpu_float32 python -u test/nn/test_convolution.py -k test_ConvTranspose_output_channels_0_cuda_float32 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142859 Approved by: https://github.com/mingfeima, https://github.com/malfet	2025-01-26 01:56:40 +00:00
Edward Z. Yang	90448f0128	Output of nonzero is transposed, fix fake tensor (#144695 ) Needs this companion executorch PR: https://github.com/pytorch/executorch/pull/7657 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/144695 Approved by: https://github.com/bobrenjc93, https://github.com/albanD	2025-01-26 01:07:22 +00:00
Miroslaw Oksiucik	76bec878da	Remove unnecessary HPUHooksInterface method (#145272 ) getDefaultHPUGenerator is no longer necessary Pull Request resolved: https://github.com/pytorch/pytorch/pull/145272 Approved by: https://github.com/ezyang	2025-01-26 01:06:34 +00:00
Nikita Shulga	3cf7874ebe	[MPS][BE] Implement bilineard2d as shader (#145581 ) That significantly improves performance and addresses correctness problem(to an extend permitted by reducing precision of scale factor computation to float32). uint8 scaling algorithm mimics CPU/Pillow implementation `569b785371/src/libImaging/Resample.c (L306-L309)` I.e. using fixed precision integral arithmetic and rounding results of horizontal interpolation back to integers before performing vertical one, which results in technically less accurate results. But even with those changes, `atol`, `rtol` must be tweaked to `1, 0` when scale factor is `1/3` or `2/3` because of the difference of representation of those values as floats and doubles. Changes in the performance could be measured using the following script ```python import torch import time import subprocess def benchmark(device, dtype): # Create example inputs x = torch.testing.make_tensor(1, 1, 2048, 2048, device=device, dtype=dtype) sf = .5 # Check output y = torch.nn.functional.interpolate(x, scale_factor=sf, mode="bilinear") z = torch.nn.functional.interpolate(x.cpu(), scale_factor=sf, mode="bilinear") outputs_match = torch.allclose(y.cpu(), z) if not outputs_match: atol = (y.cpu() - z).abs().max() rtol = ((y.cpu() - z)[z!=0]/z[z!=0]).abs().max() print(f"atol={atol} rtol={rtol}") # Measure time manually start_time = time.time() * 1000 for _ in range(1000): y = torch.nn.functional.interpolate(x, scale_factor=sf, mode="bilinear") torch.mps.synchronize end_time = time.time() * 1000 manual_delta = (end_time - start_time) average_time = f"{manual_delta:6.1f}" return "True " if outputs_match else "False", average_time outputs_match_list = [] average_time_list = [] for device in ["mps", "cpu"]: for dtype in [torch.float32, torch.float16, torch.bfloat16, torch.uint8]: outputs_match, average_time = benchmark(device, dtype) outputs_match_list.append(str(outputs_match)) average_time_list.append(average_time) brand_string = subprocess.check_output(['sysctl', '-n', 'machdep.cpu.brand_string']).decode("utf-8").strip() print(f"\nBenchmarking Results (collected on {brand_string}):") print("-"40) print("Device : MPS \| CPU") print("Dtype : FP32 \| FP16 \| BF16 \| U8 \| FP32 \| FP16 \| BF16 \| U8") print(f"Outputs Match : ", " \| ".join(outputs_match_list)) print(f"Average Time (us) :", " \|".join(average_time_list)) ``` Benchmark results before ``` Benchmarking Results (collected on Apple M4 Pro): ---------------------------------------- Device : MPS \| CPU Dtype : FP32 \| FP16 \| BF16 \| U8 \| FP32 \| FP16 \| BF16 \| U8 Outputs Match : True \| True \| True \| False \| True \| True \| True \| True Average Time (us) : 277.3 \| 197.2 \| 188.0 \| 163.5 \| 302.8 \| 248.1 \| 308.7 \| 650.9 ``` After(almost 100x* perf gain): ``` Benchmarking Results (collected on Apple M4 Pro): ---------------------------------------- Device : MPS \| CPU Dtype : FP32 \| FP16 \| BF16 \| U8 \| FP32 \| FP16 \| BF16 \| U8 Outputs Match : True \| True \| True \| True \| True \| True \| True \| True Average Time (us) : 1.7 \| 1.5 \| 1.7 \| 1.5 \| 296.5 \| 236.0 \| 310.8 \| 642.6 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145581 Approved by: https://github.com/Skylion007 ghstack dependencies: #145578	2025-01-25 21:09:46 +00:00
Xuehai Pan	0afdee4c39	[dynamo] raise IndexError when inserting into a full `deque` (#139379 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139379 Approved by: https://github.com/jansel	2025-01-25 18:04:49 +00:00
Max Podkorytov	513f889a36	[Rocm][Inductor][CK] silence ck package not installed warning when CK backend is not used to autotune bmm (#145626 ) As titled Pull Request resolved: https://github.com/pytorch/pytorch/pull/145626 Approved by: https://github.com/coconutruben	2025-01-25 08:44:35 +00:00
Simon Fan	c5216d2b6c	[ca] add test_reset for 2.6 release validation (#145549 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145549 Approved by: https://github.com/atalman	2025-01-25 06:28:58 +00:00
Sheng Fu	bbe7f53218	Save integral tensor data for ET (#144508 ) Summary: et_replay uses random data to run operators, however, the operators using index tensor to access memory won't work with random data. It usually ran into two exceptions: 1. illegal memory access since index is out of range, it has been fixed with the environment variable ENABLE_PYTORCH_EXECUTION_TRACE_SAVE_INTEGRAL_TENSOR_RANGE to record the min/max value of index tensors. 2. unaligned memory access, FBGEMM ops have speical requirements for the memory layout. To fix the second execption, ENABLE_PYTORCH_EXECUTION_TRACE_SAVE_INTEGRAL_TENSOR is added to allow user to specify the node names, separated by comma, so ET will save the integral tensor data for these nodes. The saved data will be used in et_replay. Be careful to turn on this option since it will use more space to save the extra data. Test Plan: buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA.test_execution_trace_record_integral_tensor_data_cuda Differential Revision: D67989856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144508 Approved by: https://github.com/briancoutinho	2025-01-25 05:38:10 +00:00
Jason Ansel	3d506491b9	[inductor] Fix duplicate detection in _dynamic_scale_rblock (#145577 ) Before this the code was doing nothing because Config doesn't define `__hash__` or `__eq__` (so it was based on object id). Pull Request resolved: https://github.com/pytorch/pytorch/pull/145577 Approved by: https://github.com/shunting314 ghstack dependencies: #142026	2025-01-25 04:58:54 +00:00
Jason Ansel	9007eb5f8e	[inductor] Kernel memory analysis for use in heuristics (#142026 ) This computes statistics about each kernel's memory usage that should allow us to write more precise heuristics. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142026 Approved by: https://github.com/eellison	2025-01-25 04:58:54 +00:00
Yuanhao Ji	cc1ecead07	[Dynamo] Allow `format()` to handle int (#144956 ) Fixes #144830 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144956 Approved by: https://github.com/jansel	2025-01-25 04:12:45 +00:00
Joel Schlosser	b2a0feac85	Update OSS nested tensor docs to focus on NJT (#145402 ) Updated nested tensor docs to be NJT-centric (instead of NST-centric). They now include: * High-level description of NST vs. NJT + a recommendation to use NJT * General NJT construction / usage * torch.compile() integration w/ dynamic shapes * Common errors and how to fix them * Contribution guide * Data layout / shape information (with diagram) * Links to more extensive tutorials involving Transformers / SDPA / FlexAttention Pull Request resolved: https://github.com/pytorch/pytorch/pull/145402 Approved by: https://github.com/soulitzer	2025-01-25 04:08:19 +00:00
Zhenbin Lin	392dc177a9	OpenReg: Refactor impl_registry (#145465 ) Refactor impl_registry to use `driver.exec` as fallback. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145465 Approved by: https://github.com/albanD	2025-01-25 03:31:49 +00:00
Simon Mahns	6939a56e13	[autocast][pytorch] Support autocast for MTIA (#145627 ) Summary: Add autocast support to MTIA Reviewed By: egienvalue Differential Revision: D68572548 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145627 Approved by: https://github.com/egienvalue	2025-01-25 03:24:59 +00:00
Animesh Jain	ef60de07a0	[dynamo] Log guard latency (#145132 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145132 Approved by: https://github.com/ezyang ghstack dependencies: #145509	2025-01-25 03:01:18 +00:00
Avik Chaudhuri	42b8e233d9	serde unbacked bindings (#144894 ) Adds unbacked bindings during deserialization. These are carried by a node's metadata, and map pending fresh unbacked symbols to paths to such symbols inside the corresponding example value carried by the node's metadata. Since it is awkward to serialize paths, we only serialize the names of these symbols and reconstruct the paths on deserialization, using a shape env util. We also need to bump counters for unbacked symbols here, because the shape env util we use to create these symbols (when deserializing example values) don't do so, and not doing so makes later passes (like `run_decompositions`) crash because new unbacked symbols don't get new names. This is enough for non-strict. For strict, the unbacked bindings and example values in node metadata can get out of sync, because of running AOTAutograd as an additional step after Dynamo. So we have to sync those back. Differential Revision: [D68232274](https://our.internmc.facebook.com/intern/diff/D68232274/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144894 Approved by: https://github.com/pianpwk	2025-01-25 02:34:27 +00:00
soulitzer	5725462cd8	Update NJT linear_backward to return non-aliased tensor bias grad (#145399 ) Fixes https://github.com/pytorch/pytorch/issues/141292 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145399 Approved by: https://github.com/jbschlosser ghstack dependencies: #145520, #145531, #145533	2025-01-25 00:58:04 +00:00
soulitzer	3a3e2cf90a	Remove det_singular OpInfo (#145533 ) Fixes https://github.com/pytorch/pytorch/issues/93045 https://github.com/pytorch/pytorch/issues/93044 From previous discussion https://github.com/pytorch/pytorch/issues/93045#issuecomment-1477674083 the resolution is that we're okay with removing this. Some older attempts: - https://github.com/pytorch/pytorch/pull/102581 - https://github.com/pytorch/pytorch/pull/109249 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145533 Approved by: https://github.com/lezcano, https://github.com/malfet ghstack dependencies: #145520, #145531	2025-01-25 00:58:03 +00:00
soulitzer	c7ca1df37e	Disable slow gradcheck for nn.Transformer ModuleInfo (#145531 ) Fixes https://github.com/pytorch/pytorch/issues/117140 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145531 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #145520	2025-01-25 00:58:03 +00:00
soulitzer	9e0ee152e5	Fix allow_mutation_on_saved_tensors for inplace foreach (#145520 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145520 Approved by: https://github.com/albanD	2025-01-25 00:58:03 +00:00
clr	b4fe3c159d	inductor: Explicitly test that torch.compile(option=...) does something (#145321 ) This would have prevented https://github.com/pytorch/pytorch/pull/139833 from dropping the triggers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145321 Approved by: https://github.com/jansel	2025-01-25 00:48:26 +00:00
Marc Horowitz	efebec5ef5	[dcp] Add ZStandard transformer (#143360 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143360 Approved by: https://github.com/saumishr, https://github.com/albanD ghstack dependencies: #145528	2025-01-25 00:14:07 +00:00
Marc Horowitz	f2ad2cdf1c	[utils] add try_import method for importing optional modules (#145528 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145528 Approved by: https://github.com/albanD	2025-01-25 00:14:07 +00:00
Aaron Gokaslan	f3304571fc	[BE][Ez]: FURB148 - remove useless enumerate calls (#145619 ) Remove useless enumerate calls Pull Request resolved: https://github.com/pytorch/pytorch/pull/145619 Approved by: https://github.com/drisspg	2025-01-24 23:37:15 +00:00
Wei Wang	0741963e01	[CI][CUDA][Blackwell] sm_\d\d no longer matches sm_100. (#145641 ) Therefore making it sm_\d+ Fixes this unit test failure: python test/test_cpp_extensions_jit.py -k TestCppExtensionJIT.test_jit_cuda_archflags Pull Request resolved: https://github.com/pytorch/pytorch/pull/145641 Approved by: https://github.com/eqy, https://github.com/malfet	2025-01-24 23:20:22 +00:00
Shangdi Yu	4cc5e880f9	Add accuracy issue support in AOTI Minifier (#145539 ) Summary: Add three more repro levels for AOTI minifier (level 2 already exists). They are the same as the existing dynamo minifier repro levels. Now AOTI minifier can minify and repro programs that have numerical accuracy issues as well. 1: Dumps the original graph out to repro.py if compilation fails 2: Dumps a minifier_launcher.py if aoti fails. 3: Always dumps a minifier_launcher.py. Good for segfaults. 4: Dumps a minifier_launcher.py if the accuracy fails. Refactor AOTI minifier unit tests to be cleaner and better re-use the existing minifier testing code. We do not need to manually patch {"aot_inductor.dump_aoti_minifier": True} to each test now, this config is generated in the test code. Differential Revision: D68294638 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145539 Approved by: https://github.com/desertfire	2025-01-24 23:07:19 +00:00
zeshengzong	5b988ac4fa	[Easy] Replace paper description with link to make a concise description. (#145031 ) Description in [Transformer,](https://pytorch.org/docs/main/generated/torch.nn.Transformer.html), [TransformerEncoderLayer](https://pytorch.org/docs/main/generated/torch.nn.TransformerEncoderLayer.html), [TransformerDecoderLayer](https://pytorch.org/docs/main/generated/torch.nn.TransformerDecoderLayer.html) pages contain authors and paper details seems redundant for users who want to know how to use it, replace with a link to paper content, users can go to the paper detail if they want to learn more. Test Result Before ![image](https://github.com/user-attachments/assets/678402b1-e759-402c-b56b-e24f63dc8490) ![image](https://github.com/user-attachments/assets/ca191734-f2ce-493f-bf34-2d7046a9868f) ![image](https://github.com/user-attachments/assets/10f55083-6eb6-4b1c-9a77-579f0c4c56ed) After ![image](https://github.com/user-attachments/assets/020f81ca-d89b-47d1-a7a9-cae1893df968) ![image](https://github.com/user-attachments/assets/5b9b34df-b892-4d71-8cdb-df18380b2744) ![image](https://github.com/user-attachments/assets/b3348da2-842a-4037-bad3-f23687503cf8) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145031 Approved by: https://github.com/mikaylagawarecki	2025-01-24 23:01:02 +00:00
Davide Italiano	57591edca1	[mps/inductor] Add support for `erfinv`. (#145643 ) After several rounds of refactoring, this seems to be done now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145643 Approved by: https://github.com/malfet, https://github.com/jansel	2025-01-24 22:55:44 +00:00
Joel Schlosser	46e06e1d09	Avoid data-dependent errors in NJT tests via capture_scalar_outputs=True (#144588 ) Part of my BE project addressing NJT bugs surfaced via OpInfo tests. There are several xfails related to data-dependent errors in torch.compile. This PR sets `torch._dynamo.config.capture_scalar_outputs=True` to avoid these, which tends to exercise unbacked SymInt logic and will require `torch._check()`-related fixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144588 Approved by: https://github.com/soulitzer ghstack dependencies: #144586, #144587	2025-01-24 22:45:01 +00:00
Fabian Keller	81e370fc6b	Fix type annotation of `Linear.bias` (#142326 ) Currently the `bias` attribute of `torch.nn.Linear` (and `Bilinear`) is typed incorrectly, because it relies on the implicit `Module.__getattr__` which types it as `Tensor \| Module`. This has two issues: - It hides the fact that `bias` is optional, and can be `None`, which in turn can hide actual bugs on user side. - It blurs the type due to having `Module` in the union, which can require unnecessary `isistance(linear.bias, Tensor)` on user side. This PR types the `bias` attribute explicitly to fix these issues. CC @ezyang @Skylion007 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142326 Approved by: https://github.com/ezyang	2025-01-24 22:43:52 +00:00
Aidyn-A	70577d335e	[ATen][CUDA][Transformers] Add Blackwell support to SDPA (#145602 ) This PR adds sm_100 and sm_120 archs to support SDPA (Flash Attention and Memory Efficient Attention) on Blackwell machines. Special thanks to @Fuzzkatt for co-authoring these changes! Pull Request resolved: https://github.com/pytorch/pytorch/pull/145602 Approved by: https://github.com/drisspg, https://github.com/Skylion007, https://github.com/eqy, https://github.com/malfet Co-authored-by: Patrick Wang <22803332+Fuzzkatt@users.noreply.github.com>	2025-01-24 22:27:39 +00:00
Tan Hoang	5bf5ce0e15	Modify enable logic of COLLECTIVE_COMM profiler activity type (#145478 ) Summary: Since `KINETO_NCCL_PROFILER` flag is not used anymore (we are moving from linking the profiler during compile time to loading it dynamically), we change the logic for enabling the profiler to use `TORCH_PROFILER_ENABLE_COLLECTIVE_PROFILING` environment variable for NCCL Collective Communication Profiler. For HCCL, we still keep the same logic Test Plan: See https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/gpu_traces/tree/traces/clientAPI/0/1737579474/devvm29927.cln0/nccl_activities_2387985.json.gz for sample trace on nccl-profiler Differential Revision: D68515945 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145478 Approved by: https://github.com/sraikund16	2025-01-24 22:21:09 +00:00
Yichen Yan	d4171b724e	Let `tensor_a.new_tensor()` be on `tensor_a.device` by default (#144958 ) Fixes #144957 Closes #73838 cc @albanD @ezyang Currently, `tensor_a.new_tensor()` will return a on-cpu tensor no matter where is `tensor_a`. This differs from the document and is a side-effect of https://github.com/pytorch/pytorch/pull/41984. See #144957 how current logic breaks dynamo. This PR restore the documented behavior and add tests for `new_tensor`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144958 Approved by: https://github.com/ezyang	2025-01-24 22:12:31 +00:00
Wei Wang	2a70de7e92	[CUDA] Change slim-wheel libraries load order (#145638 ) There is no libnvjitlink in CUDA-11.x , so attempts to load it first will abort the execution and prevent the script from preloading nvrtc Fixes issues reported in https://github.com/pytorch/pytorch/pull/145614#issuecomment-2613107072 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145638 Approved by: https://github.com/atalman, https://github.com/kit1980, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-24 22:00:56 +00:00
FEI	615bdd9c81	Improve the caching allocator test for raw alloc (#145269 ) 1 Prevent block allocated by torch._C._cuda_cudaCachingAllocator_raw_alloc from affecting torch.cuda.empty_cache() in other unit tests 2 Additionally, tested the changes to raw_delete in https://github.com/pytorch/pytorch/pull/131114 @jeffdaily @albanD @houseroad @eqy @aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/145269 Approved by: https://github.com/albanD, https://github.com/eqy, https://github.com/jeffdaily	2025-01-24 21:07:17 +00:00
Fernando Pérez-García	d79c6f4946	Improve torchrun documentation (#144354 ) Fixes #142042: - #142042 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144354 Approved by: https://github.com/c-p-i-o, https://github.com/H-Huang	2025-01-24 20:40:05 +00:00
IvanKobzarev	caf60395f4	[torchbench] Increase tolerance for amp only poolformer_m36 (#145375 ) https://github.com/pytorch/pytorch/issues/144893 ``` python benchmarks/dynamo/timm_models.py --only poolformer_m36 --accuracy --no-translation-validatio --training --amp --device cuda --backend inductor ``` `--float32`, `--bfloat16` - passes the accuracy `--disable-cudagraph` does not change the result accuracy_fail only for `--amp` and gives `0.048` res_error, on 1-element result Tensor. This fails with `0.01` tolerance. If to increase tolerance to 0.04 it passes. I have not reproduced "eager_two_runs_differ" on H100. I think this is a true distribution of results with `--amp`, so increasing tolerance to 0.04 for ano case only makes it passing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145375 Approved by: https://github.com/desertfire	2025-01-24 19:56:21 +00:00
Aishwarya Sivaraman	457facf7e2	[caffe2] Use the manifold cache backend as the default (#144773 ) Test Plan: CI D68155591 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144773 Approved by: https://github.com/izaitsevfb	2025-01-24 19:48:34 +00:00
Sam Larsen	c16866a582	[BE] mv test/inductor_skips/* to test/inductor_expected_failures/ (#145572 ) Summary: I think skipping these tests is suboptimal. If we categorize as expected failures, then we'll see test failures when they start passing, which means they're more likely to be removed. As a skip, they quietly continue to skip. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145572 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2025-01-24 19:41:38 +00:00
Edward Z. Yang	cf063d41f8	Spruce up docs for emulate_precision_casts (#145579 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/145579 Approved by: https://github.com/gchanan	2025-01-24 19:28:37 +00:00
Shunting Zhang	96149a201a	[Inductor] be able to disable cache for test (#141195 ) Let TORCHINDUCTOR_FX_GRAPH_CACHE=0 being respected in unit test. This is helpful if I want the compilation to happen for testing. Setting INDUCTOR_TEST_DISABLE_FRESH_CACHE to 1 is not the same, since that will cause the generated wrapper file being deleted. But we may want to check those files after running a test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141195 Approved by: https://github.com/masnesral, https://github.com/desertfire	2025-01-24 19:15:55 +00:00
IvanKobzarev	2fd2a950e6	[torchbench] Add meta function for _cudnn_rnn_flatten_weight (#145488 ) https://github.com/pytorch/pytorch/issues/144989 This fixes tts_angular model on torchbench for `--export-aot-inductor` I put meta function in cpp, as shape calculation requires cudnn API calls. I've extracted shape calculation to be used in implementation as this logic has some non-trivial actions and comments. ``` └─ $ python benchmarks/dynamo/torchbench.py --only tts_angular --accuracy --no-translation-validation --inference --bfloat16 --export-aot-inductor --disable-cudagraphs --device cuda loading model: 0it [00:00, ?it/s]WARNING:common:Model tts_angular does not support bfloat16, running with amp instead loading model: 0it [00:01, ?it/s] WARNING:common:Model tts_angular does not support bfloat16, running with amp instead cuda eval tts_angular WARNING:common:Model tts_angular does not support bfloat16, running with amp instead pass ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145488 Approved by: https://github.com/eqy, https://github.com/zou3519	2025-01-24 19:08:14 +00:00
PyTorch MergeBot	ad36f4f42c	Revert "Add generator parameter to rand*_like functions (#136780 )" This reverts commit c7b2f7dd142fc97c8ce4ad7ad591687cf295fcda. Reverted https://github.com/pytorch/pytorch/pull/136780 on behalf of https://github.com/izaitsevfb due to internal regression ([comment](https://github.com/pytorch/pytorch/pull/136780#issuecomment-2613191933))	2025-01-24 19:00:21 +00:00
c8ef	a989a0b13a	[NFC] Fix some minor typos. (#145599 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145599 Approved by: https://github.com/Skylion007	2025-01-24 18:58:59 +00:00
Davide Italiano	6cda572c98	[mps] Hoist erfinv logic out of the kernel in preparation for moving. (#145568 ) Will be used in inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145568 Approved by: https://github.com/malfet, https://github.com/Skylion007	2025-01-24 18:51:09 +00:00
Michael Lazos	8eea554332	[Dynamo] Fix names collisions with foreach decomps (#145479 ) Fixes https://github.com/pytorch/pytorch/issues/138698 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145479 Approved by: https://github.com/yanboliang	2025-01-24 18:46:58 +00:00
amdfaa	e57cdb8402	[ROCm] trunk.yml only runs pre-merge via ciflow/trunk label (#145629 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145629 Approved by: https://github.com/jeffdaily	2025-01-24 18:31:33 +00:00
Bin Bao	b8087747f5	[inductor][BE] Enable test_cpu_cpp_wrapper in fbcode (#145373 ) Differential Revision: D68278174 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145373 Approved by: https://github.com/Skylion007	2025-01-24 17:59:13 +00:00
Animesh Jain	74cfb4f364	[dynamo][refactor] Move collections.namedtuple out of SkipFunctionVariable (#145547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145547 Approved by: https://github.com/zou3519 ghstack dependencies: #145519	2025-01-24 17:39:33 +00:00
David Peixotto	97c0b7cb0a	Add unique identifer to bmm thread_mm functions (#145303 ) Summary: The bmm template generates code like this ``` template<bool accum> void cpp_fused_bmm_66_micro_gemm(...) { ... } void single_thread_mm() { ... cpp_fused_bmm_66_micro_gemm(...) ... } void threaded_mm() { ... cpp_fused_bmm_66_micro_gemm(...) ... } void cpp_fused_bmm_66(...) { ... single_thread_mm(...); ... threaded_mm(...); ... } ``` The generated `fused_bmm` and `fused_bmm_microgemm` functions both have unique identifiers added to their names, but the `single_threaded_mm` and `threaded_mm` do not. This diff adds unique identifies to those generated functions as well. The identifier is based on the kernel name. So for the example above we would generate a bmm template name like `cpp_fused_bmm_66_single_thread_mm()`. Differential Revision: D68364772 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145303 Approved by: https://github.com/leslie-fang-intel, https://github.com/frost-intel, https://github.com/hl475	2025-01-24 17:35:50 +00:00
jainapurva	547c18ee9f	Add Torchao docs link to Pytorch libraries (#145412 ) Add Torchao docs link to the libraries section in torch docs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145412 Approved by: https://github.com/svekars	2025-01-24 17:11:20 +00:00
amdfaa	ce371ab4c6	[ROCm] Create inductor-rocm-mi300 (#145621 ) - Adds an mi300 inductor workflow to main. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145621 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-01-24 17:04:17 +00:00
Shuqiang Zhang	c0861d092c	[PGNCCL] Add an API to get the status/error code at the PG level (#144498 ) Summary: This PR is basically a replacement of https://github.com/pytorch/pytorch/pull/140087, which caused some perf drop due to frequent TCPStore check in watchdog thread. The fix is to move the tcpstore check in monitoring thread If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/144498 Approved by: https://github.com/kwen2501	2025-01-24 16:47:32 +00:00
Animesh Jain	9132f4b7ce	[dynamo][guards] Log guard latency to tlparse (#145509 ) Example ![image](https://github.com/user-attachments/assets/1503ee59-ff35-46d9-9b61-16352a4a30e2) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145509 Approved by: https://github.com/ezyang	2025-01-24 16:33:29 +00:00
Aaron Orenstein	1335882b2a	If mypy fails it should report the error back to lintrunner (#145550 ) This happened to me because I had a bad LD_LIBRARY_PATH and mypy was failing to run (.so load error) - but lintrunner was silent about the underlying problem. Differential Revision: [D68593081](https://our.internmc.facebook.com/intern/diff/D68593081) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145550 Approved by: https://github.com/bobrenjc93, https://github.com/Skylion007	2025-01-24 15:40:30 +00:00
min-jean-cho	7c314bfed4	[Intel GPU] Add TORCH_API macro to export symbol NestedTensor_to_mask for libtorch_xpu (#145467 ) Part of https://github.com/intel/torch-xpu-ops/issues/1141. The `TORCH_API` macro is added to export the symbol `NestedTensor_to_mask`, which is needed by libtroch_xpu for `NestedTensor_softmax_dropout_xpu`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145467 Approved by: https://github.com/guangyey, https://github.com/ezyang	2025-01-24 15:38:46 +00:00
atalman	5d24a9a274	Advance docker release latest verison to cuda 12.4 (#145566 ) Fixed latest tag in ghcr.io to be cuda 12.4 docker image. Todo, Need to add it to : https://github.com/pytorch/builder/blob/main/CUDA_UPGRADE_GUIDE.MD Will need to check if we can automate this by introducing cuda_stable variable or something like this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145566 Approved by: https://github.com/nWEIdia, https://github.com/kit1980, https://github.com/malfet	2025-01-24 15:27:25 +00:00
Hongtao Yu	5c64aaea40	[triton] Update triton pin to include warp specialization support (#145120 ) The warp specialization work has been landed to the triton rc/3.2.x branch as `b2684bf3b0` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145120 Approved by: https://github.com/bertmaher	2025-01-24 14:45:12 +00:00
Edward Z. Yang	bc62930765	Work around buggy use_const_ref_for_mutable_tensors (#145530 ) See https://github.com/pytorch/pytorch/issues/145522 for context This doesn't fix the problem with use_const_ref_for_mutable_tensors and the boxed wrapper, instead it just gets all of our out kernels off of this flag so that the mutable matching pattern works correctly. I also add a check in torchgen to prevent people from making this mistake in the future. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/145530 Approved by: https://github.com/albanD, https://github.com/bdhirsh	2025-01-24 14:38:49 +00:00
PyTorch MergeBot	9d6927715f	Revert "Fix triton masked loading for non-block tl.loads (#144782 )" This reverts commit 31c2f36989e35ccf023a8e35c4bc21aca077d344. Reverted https://github.com/pytorch/pytorch/pull/144782 on behalf of https://github.com/ezyang due to This regresses compile time for one of our internal models by 20%, internal xref https://fb.workplace.com/groups/1075192433118967/posts/1591490218155850 ([comment](https://github.com/pytorch/pytorch/pull/144782#issuecomment-2612660287))	2025-01-24 14:28:48 +00:00
cyy	6a35d9aaa4	Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143806 Approved by: https://github.com/kwen2501	2025-01-24 12:22:13 +00:00
Digant Desai	f08b9bc7e4	[WIP] Move XNNPACKQuantizer from PyTorch to ExecuTorch (#144940 ) Summary: This replicates XNNPACKQuantizer from PyTorch to ExecuTorch. Rationale: Main motivation is to avoid pytorch pin update in OSS after updating XNNPACKQuantizer, which can be rather frequent. Other impact and considerations: PT2e flow (which lives in PyTorch) relies havily on XNNPACKQuantizer for a "example" implementation for quantizer and more importantly tests. Fow now, we will keep the torch.ao.quantization.xnnpack_quantizer as is but mark is as not BC, and deprecated to discourace future new dependencies on it. Other OSS repository using XNNPACKQuantizer from PyTorch now have to take an additional dependency on ExecuTorch. Differential Revision: D68191752 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144940 Approved by: https://github.com/jerryzh168, https://github.com/mcr229	2025-01-24 10:06:07 +00:00
Oguz Ulgen	d3989ca636	Add multi env variable support to configs (#145288 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145288 Approved by: https://github.com/c00w	2025-01-24 10:04:24 +00:00
Yiming Zhou	10bdd0a1cc	[BE][export] Fix hop tests with flaky memory leak (#145391 ) Summary: As title. Added `torch._dynamo.reset()` for each test This should fix several flaky tests in `test_hop.py` such as https://github.com/pytorch/pytorch/issues/139073 Test Plan: ``` PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 python test/export/test_hop.py TestHOPCUDA.test_serialize_export_scan_simple_cuda_float32 ``` Differential Revision: D68506280 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145391 Approved by: https://github.com/ydwu4	2025-01-24 09:53:21 +00:00
drisspg	72da0a8a42	[Submodule] Add flash as third-party submodule [Prep for later PRs] (#145502 ) # Context Prototyped here: https://github.com/pytorch/pytorch/pull/144120, we are going to make flash-attention a 3rd party submodule. We will then use the c++ sources and include into our build of libtorch.so This requires various changes to work including external and internal changes. Since these require internal changes we need to co-dev and in the co-dev environment I haven't found a way to sync submodule changes + internal only changes. This is unused for now Pull Request resolved: https://github.com/pytorch/pytorch/pull/145502 Approved by: https://github.com/Skylion007	2025-01-24 09:21:41 +00:00
Wei Wang	d62e900d8c	[CI][CUDA][MultiGPU][Regression] Skip a failure due to https://github.com/pytorch/pytorch/issues/139520 (#145318 ) Related: https://github.com/pytorch/pytorch/issues/139520 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145318 Approved by: https://github.com/eqy	2025-01-24 06:58:05 +00:00
Wei Wang	0e98b26b28	[CI][CUDA][Dynamic Shape] xfail: DynamicShapesCodegenGPUTests.test_linspace4_dynamic_shapes_cuda (#145204 ) python test/inductor/test_torchinductor_codegen_dynamic_shapes.py DynamicShapesCodegenGPUTests.test_linspace4_dynamic_shapes_cuda failed to generate triton kernels, causing assert failures on 2x H100 systems (and 2x Grace H100 systems). Failures like below: Finline_call [] stats [('calls_captured', 1), ('unique_graphs', 1)] inductor [('fxgraph_cache_miss', 1)] aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] FAIL: test_linspace4_dynamic_shapes_cuda (__main__.DynamicShapesCodegenGPUTests.test_linspace4_dynamic_shapes_cuda) [61/1892]---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/local/lib/python3.12/dist-packages/torch/testing/_internal/common_utils.py", line 3114, in wrapper method(args, kwargs) File "/opt/pytorch/pytorch/test/inductor/test_torchinductor.py", line 12212, in new_test return value(self) ^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/testing.py", line 420, in _fn return fn(args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/opt/pytorch/pytorch/test/inductor/test_torchinductor.py", line 2603, in test_linspace4 self.common(fn, (torch.Tensor([]),)) File "/opt/pytorch/pytorch/test/inductor/test_torchinductor_codegen_dynamic_shapes.py", line 424, in common return check_codegen( ^^^^^^^^^^^^^^ File "/opt/pytorch/pytorch/test/inductor/test_torchinductor_codegen_dynamic_shapes.py", line 82, in check_codegen self.assertTrue("def triton" in code, f"Failed to find triton kernel\n{code}") AssertionError: False is not true : Failed to find triton kernel # AOT ID: ['0_inference'] [42/1892]from ctypes import c_void_p, c_long, c_int import torch import math import random import os import tempfile from math import inf, nan from torch._inductor.hooks import run_intermediate_hooks from torch._inductor.utils import maybe_profile from torch._inductor.codegen.memory_planning import _align as align from torch import device, empty_strided from torch._inductor.async_compile import AsyncCompile from torch._inductor.select_algorithm import extern_kernels from torch._inductor.codegen.multi_kernel import MultiKernelCall aten = torch.ops.aten inductor_ops = torch.ops.inductor _quantized = torch.ops._quantized assert_size_stride = torch._C._dynamo.guards.assert_size_stride empty_strided_cpu = torch._C._dynamo.guards._empty_strided_cpu empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda empty_strided_xpu = torch._C._dynamo.guards._empty_strided_xpu reinterpret_tensor = torch._C._dynamo.guards._reinterpret_tensor alloc_from_pool = torch.ops.inductor._alloc_from_pool async_compile = AsyncCompile() empty_strided_p2p = torch._C._distributed_c10d._SymmetricMemory.empty_strided_p2p async_compile.wait(globals()) del async_compile def call(args): with torch.cuda._DeviceGuard(1): torch.cuda.set_device(1) buf0 = empty_strided_cuda((0, ), (1, ), torch.float32) return (buf0, ) def benchmark_compiled_module(times=10, repeat=10): from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance fn = lambda: call([]) return print_performance(fn, times=times, repeat=repeat) if __name__ == "__main__": from torch._inductor.wrapper_benchmark import compiled_module_main compiled_module_main('None', benchmark_compiled_module) To execute this test, run the following from the base repo dir: python test/inductor/test_torchinductor_codegen_dynamic_shapes.py DynamicShapesCodegenGPUTests.test_linspace4_dynamic_shapes_cuda This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145204 Approved by: https://github.com/eellison	2025-01-24 06:57:35 +00:00
Boyuan Feng	817fd14714	[BE] Type annotation for `_inductor/dependencies.py` (#145311 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145311 Approved by: https://github.com/eellison	2025-01-24 06:32:48 +00:00
Xilun Wu	2ce70da96c	[cp] override compute_log_sumexp to True for aten._scaled_dot_product_efficient_attention.default if False (#145421 ) ## Description Our current CP doesn't support efficient attention when `compute_log_sumexp=False`. `compute_log_sumexp=False` only if that `requires_grad=False` and since PP's [shape inference](`d95a6babcc/torch/distributed/pipelining/stage.py (L1387)`) happens under `torch.no_grad()` context , we need to override `compute_log_sumexp` to `True` in our CP attention implementation. ## Test - Test PP+FSDP+CP w/ `mixed_precision = "float32"` in torchtitan - `pytest test/distributed/tensor/test_attention.py -s -k test_ring_attention_sdpa` Before: <img width="1880" alt="image" src="https://github.com/user-attachments/assets/872ff583-295e-4751-a280-cf7f2d41c61a" /> After: <img width="2988" alt="image" src="https://github.com/user-attachments/assets/4bdcc2e5-22a5-427a-91a5-82206d5bd78f" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/145421 Approved by: https://github.com/H-Huang, https://github.com/tianyu-l	2025-01-24 06:17:54 +00:00
Animesh Jain	53fc921ce2	[dynamo][trace-rules-cleanup] Remove functools from the Builtins skiplist (#145519 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145519 Approved by: https://github.com/yanboliang, https://github.com/zou3519	2025-01-24 06:02:03 +00:00
atalman	9752c7c1c8	[CD] Fix slim-wheel cuda_nvrtc import problem (#145582 ) Similar fix as: https://github.com/pytorch/pytorch/pull/144816 Fixes: https://github.com/pytorch/pytorch/issues/145580 Found during testing of https://github.com/pytorch/pytorch/issues/138340 Please note both nvrtc and nvjitlink exist for cuda 11.8, 12.4 and 12.6 hence we can safely remove if statement. Preloading can apply to all supporting cuda versions. CUDA 11.8 path: ``` (.venv) root@b4ffe5c8ac8c:/pytorch/.ci/pytorch/smoke_test# ls /.venv/lib/python3.12/site-packages/torch/lib/../../nvidia/cuda_nvrtc/lib __init__.py __pycache__ libnvrtc-builtins.so.11.8 libnvrtc-builtins.so.12.4 libnvrtc.so.11.2 libnvrtc.so.12 (.venv) root@b4ffe5c8ac8c:/pytorch/.ci/pytorch/smoke_test# ls /.venv/lib/python3.12/site-packages/torch/lib/../../nvidia/nvjitlink/lib __init__.py __pycache__ libnvJitLink.so.12 ``` Test with rc 2.6 and CUDA 11.8: ``` python cudnn_test.py 2.6.0+cu118 ---------------------------------------------SDPA-Flash--------------------------------------------- ALL GOOD ---------------------------------------------SDPA-CuDNN--------------------------------------------- ALL GOOD ``` Thank you @nWEIdia for discovering this issue Pull Request resolved: https://github.com/pytorch/pytorch/pull/145582 Approved by: https://github.com/nWEIdia, https://github.com/eqy, https://github.com/kit1980, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-24 04:47:57 +00:00
Johnny	732c4998f3	[NVIDIA] Full Family Blackwell Support codegen (#145436 ) More references: https://github.com/NVIDIA/nccl Pull Request resolved: https://github.com/pytorch/pytorch/pull/145436 Approved by: https://github.com/ezyang, https://github.com/drisspg	2025-01-24 04:36:00 +00:00
Nikita Shulga	c184055743	[BE] Use `value_or` in layer_norm.cpp (#145417 ) Now that we have proper optional, no need to do `if (has_value) value else default_value;` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145417 Approved by: https://github.com/huydhn, https://github.com/Skylion007	2025-01-24 04:02:23 +00:00
Nikita Shulga	4799ebf326	[MPS][BE] Turn `bicubic2d` into generic metal template (#145578 ) In preparation for more metal shaders to come Pull Request resolved: https://github.com/pytorch/pytorch/pull/145578 Approved by: https://github.com/Skylion007	2025-01-24 04:01:23 +00:00
Avik Chaudhuri	68a1505985	serde and_ operator (#145506 ) Differential Revision: [D68565887](https://our.internmc.facebook.com/intern/diff/D68565887/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145506 Approved by: https://github.com/zhxchen17, https://github.com/Skylion007	2025-01-24 03:48:03 +00:00
albanD	29ddf9a63e	Document dispatch trace build flag (#145517 ) Ok, the build flag seems to have been broken for a while since the function it calls doesn't exist anymore. Repurposed it to enable dispatcher printing (which requires a full (and slow) debug build otherwise). Pull Request resolved: https://github.com/pytorch/pytorch/pull/145517 Approved by: https://github.com/bdhirsh	2025-01-24 03:19:39 +00:00
Sam Larsen	a40ead1fd6	Don't fail if fresh_inductor_cache fails to clean up its tmp dir. (#145513 ) Summary: I see we have a test failure due to an error removing the tmp dir: https://github.com/pytorch/pytorch/issues/141761. Seems like we should not raise an exception for this case in general. Also, let's clean up the exception handling related to windows. The comment makes it sound like we want to specifically ignore failures cleaning up, but the current impl is swallowing all exceptions. Fixes #141761 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145513 Approved by: https://github.com/eellison	2025-01-24 03:17:03 +00:00
Henry Tsang	36fcf98db6	[cutlass backend tests] Manually clear cache, test more tests in fbcode and limit configs in some tests (#145545 ) Summary: Manually clear cache: You want to clear cache in most tests. Otherwise link command won't work and you have multiple .o files and you get something like `ld.lld: error: duplicate symbol: cuda_fused_0`. test more tests in fbcode: A few tests have been skipping in fbcode. Unskip them. limit configs in some tests: to reduce time spent on each test Differential Revision: D68584071 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145545 Approved by: https://github.com/coconutruben, https://github.com/ColinPeppler	2025-01-24 03:06:59 +00:00
Robert Hardwick	386650353b	[ARM] Fix bf32 and tf32 precision for tensordot unit test (#141136 ) Fixes unit test failure on aarch64 ( neoverse-v1 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141136 Approved by: https://github.com/malfet	2025-01-24 02:59:45 +00:00
Manuel Candales	d6bea398ac	Only include RMSNorm.h in layer_norm.cpp for MPS (#145524 ) Test Plan: CI Differential Revision: D68578213 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145524 Approved by: https://github.com/malfet	2025-01-24 02:08:49 +00:00
Benjamin Glass	d5629889f1	cpp_wrapper: Properly handle scalars when input to tensor arguments (#144910 ) Additionally, reduce code duplication in `cpp_wrapper_cpu_array_ref.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144910 Approved by: https://github.com/desertfire	2025-01-24 02:06:35 +00:00
Zhenbin Lin	47e65077b1	OpenReg: Remove REGISTER_GENERATOR_PRIVATEUSE1 (#144841 ) Replace REGISTER_GENERATOR_PRIVATEUSE1 with new API in AcceleratorHooksInterface. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144841 Approved by: https://github.com/albanD	2025-01-24 01:52:10 +00:00
Sam Larsen	cd68d54911	Inductor cache: Revamp how we handle frozen params (#143808 ) Summary: In https://github.com/pytorch/pytorch/pull/143563 we have a report of a problem with the treatment of frozen params in the inductor cache implementation. There seems to be a path where new constants are added in the `GraphLowering`. On a cache hit when we try to find those constant names in the `torch.fx.GraphModule`, they do not exist. The current approach treats all constants differently if the GM has any frozen params. This PR changes the approach to only treat the _frozen_ params specially, but store all other constants in the cache entry (as we do without freezing): 1) When creating a cache entry, store the names of any frozen params, but the values of any other constants. 2) On a cache hit, restore the values of the frozen params by looking up in the current GM. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143808 Approved by: https://github.com/leslie-fang-intel, https://github.com/eellison	2025-01-24 01:20:07 +00:00
zeshengzong	54e2f4b201	Fix lerp weight type promotion (#141117 ) Fixes #140601 Enable `promote_inputs_to_common_dtype` when tensors not same dtype when invoke `lerp` function. For `lerp_Tensor` - Check whether same `dtype` of tensors, enable promote if not - Remove type check assert For `lerp_Scalar` - Seems already enable `promote_inputs_to_common_dtype` by default, just remove the type check. Make sure promote behavior consistent with `lerp_Tensor` `lerp_Scalar` get TensorIteratorConfig from here `c37185c76a/aten/src/ATen/TensorIterator.cpp (L979-L985)` Test Result Test case in issue passed ```python >>> import torch >>> >>> x = torch.ones(2, 2, dtype=torch.float64) >>> w = torch.ones(2, 2, dtype=torch.float64) >>> s = torch.tensor(2.2) >>> x.lerp_(w, s) tensor([[1., 1.], [1., 1.]], dtype=torch.float64) >>> x = torch.ones(2, 2, dtype=torch.float16) >>> w = torch.ones(2, 2, dtype=torch.float16) >>> s = torch.tensor(2.2) >>> x.lerp_(w, s) tensor([[1., 1.], [1., 1.]], dtype=torch.float16) ``` ```bash $ pytest test/test_binary_ufuncs.py -k 'test_lerp_tensor_type_promotion or test_lerp_scalar_type_promotion' ``` ![image](https://github.com/user-attachments/assets/288a5294-a9ee-47f3-bbf7-d4ff986f3ba8) ```bash $ lintrunner ``` ![image](https://github.com/user-attachments/assets/d469836f-5c49-4d89-a2fd-379cad4db3af) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141117 Approved by: https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-01-24 01:18:20 +00:00
David Berard	b2c89bc115	[inductor][2/N] triton support post-#5512, user-defined triton kernels (#145348 ) Triton commit 5220 adds tuple support in Triton (changing the indexing format in AttrsDescriptor) and commit 5512 replaces AttrsDescriptor with raw tuples. This PR fixes user-defined triton kernel handling (in most cases) for these new triton commits. What this PR fixes: * in triton_kernel_wrap.py, AST->TTIR parsing was to be updated for the new triton API * ir.py - don't remove None args when using newer triton versions * wrapper.py - update signature & constant handling What this doesn't fix: * correct None handling - I want to do a closer look at constant handling (including None, equal_to_1, and other constants). * cpp wrapper (which needs to be fixed for both user-defined triton kernels and inductor-generated kernels) test/inductor/test_triton_kernels.py passed on triton commit 74de6b46, with the exception of three tests (those shown here: `1374074098`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145348 Approved by: https://github.com/jansel ghstack dependencies: #145051	2025-01-24 00:34:01 +00:00
David Berard	b963ab5325	[inductor][1/N] triton support post-#5512, main components (#145051 ) Triton commit 5220 adds tuple support in Triton (changing the indexing format in AttrsDescriptor) and commit 5512 replaces AttrsDescriptor with raw tuples. This is an initial PR to add support for Triton versions after commit 5512 landed. The main changes in 5220 and 5512 that need to be supported: * AttrsDescriptor() gets replaced with a raw dict. The raw dict has the format `{(TUPLES): [["tt.divisibility", 16]]}`, where `(TUPLES)` is a tuple of indices, e.g. `((0,), (1,), (3,))` to indicate that args 0, 1, and 3 are divisible by 16. These indices are, themselves, represented as tuples to support nested inputs (e.g. an argument that's a tuple), but support for tuples is not implemented right now. * "signature" changes: the signature now contains _all_ args, including constexpr and constant args. * ASTSource now takes "constexprs" instead of "constants" - for example, equal-to-1 args are constants but not constexprs so we don't need to pass these args as "constants". What this PR supports: * Triton versions before Dec 9, 2024, and (partial support for) Triton versions after Jan 1, 2025 * (triton jan 1+) typical inductor-generated triton: updated AttrsDescriptor, signatures, constexpr/constant handling. What this PR doesn't support (TODO in follow-up PRs): * Triton versions between Dec 9, 2024 and before Jan 1, 2025 * (triton jan 1+) user-defined triton kernel support (this is implemented already in @anmyachev's patch) * (triton jan 1+) triton_helper support (failing in triton codegen - needs investigation) * (triton jan 1+) AOTI / cpp wrapper thanks to @anmyachev for patches in https://github.com/intel/intel-xpu-backend-for-triton/blob/main/scripts/pytorch.patch, which contains most of these changes already Pull Request resolved: https://github.com/pytorch/pytorch/pull/145051 Approved by: https://github.com/jansel	2025-01-24 00:34:01 +00:00
PyTorch MergeBot	714f64329b	Revert "Add multi env variable support to configs (#145288 )" This reverts commit a8b7cb6a2ddbba4924b6b2531f1ecd2f5ed6d512. Reverted https://github.com/pytorch/pytorch/pull/145288 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lint from a landrace with some recent PEP585 changes ([comment](https://github.com/pytorch/pytorch/pull/145288#issuecomment-2611278428))	2025-01-24 00:20:00 +00:00
PyTorch MergeBot	6a2b4db0a1	Revert "Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806 )" This reverts commit 42f4fda2ebb27693411f7acca1665778d539bf79. Reverted https://github.com/pytorch/pytorch/pull/143806 on behalf of https://github.com/huydhn due to Lots of builds fail after this land, so maybe a landrace ([comment](https://github.com/pytorch/pytorch/pull/143806#issuecomment-2611275836))	2025-01-24 00:17:34 +00:00
PyTorch MergeBot	6f60c65a3a	Revert "[dynamo] Log guard latency (#145132 )" This reverts commit 0a310d738819ae000f49b32298305724117634c2. Reverted https://github.com/pytorch/pytorch/pull/145132 on behalf of https://github.com/anijain2305 due to CI failures observed after PR was merged ([comment](https://github.com/pytorch/pytorch/pull/145132#issuecomment-2611268421))	2025-01-24 00:11:50 +00:00
Davide Italiano	f0e9f87a9b	[BE/mps] Mark input args as `constant` to prevent incorrect usage. (#145535 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145535 Approved by: https://github.com/malfet, https://github.com/jansel	2025-01-24 00:11:44 +00:00
Howard Huang	6aaae9d78f	Make torchelastic etcd rendezvous publicly importable (#145396 ) Make torchelastic publicly importable by raising error on import etcd lazily, [BE task, row 7](https://docs.google.com/spreadsheets/d/1TtATnLJf1rVXaBQd3X3yYqm9xNN9BIWG7QqRgrFiRRI/edit?gid=1748512924#gid=1748512924) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145396 Approved by: https://github.com/albanD ghstack dependencies: #145387	2025-01-23 23:56:45 +00:00
Chirag Pandya	f8a4f16634	[c10d] fix memory leak on shutdown (#145507 ) Summary: Fix memory leak on shutdown when socket is closed. We still need to free the buffer to make valgrind happy. Test Plan: Use `mtiavm`. Repro steps provided by cristianlume. on window 1: ``` vm ssh --vm=0 -- $(buck run @//neteng/ai/rdma_gen/mode/owl //neteng/ai/rdma_gen:rdma_gen --emit-shell) --rdma_mode=mtiav1 --num_ranks=2 ``` on window 2: ``` vm ssh --vm=1 -- $(buck run @//neteng/ai/rdma_gen/mode/owl //neteng/ai/rdma_gen:rdma_gen --emit-shell) --rdma_mode=mtiav1 --num_ranks=2 --rank=1 --store_host=172.16.1.1 ``` without the fix: ``` ==8766==ERROR: LeakSanitizer: detected memory leaks ``` With fix, no leak Differential Revision: D68566104 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145507 Approved by: https://github.com/XilunWu, https://github.com/d4l3k	2025-01-23 23:36:15 +00:00
PyTorch MergeBot	6dd8283381	Revert "[compiled autograd] Proxy opaque nodes for built-in autograd nodes (#143296 )" This reverts commit 5531fafffefc45cd894040b2b07b0d5227430082. Reverted https://github.com/pytorch/pytorch/pull/143296 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))	2025-01-23 23:34:13 +00:00
PyTorch MergeBot	c3fadacf84	Revert "[compiled autograd] Proxy a node for CopyBackwards into the graph (#143304 )" This reverts commit 8c7c5f7bfcbc55638a0e4aed6eaa27f6194dbebe. Reverted https://github.com/pytorch/pytorch/pull/143304 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))	2025-01-23 23:34:13 +00:00
PyTorch MergeBot	9553301ade	Revert "[compiled autograd] Proxy nodes for user-defined C++ torch::autograd::Function (#143387 )" This reverts commit 784bb2127ca9729c646f1650ecc2cf946a583da8. Reverted https://github.com/pytorch/pytorch/pull/143387 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))	2025-01-23 23:34:13 +00:00
PyTorch MergeBot	16c4f8c395	Revert "[compiled autograd] Always proxy autograd.Function nodes; handle AOT backwards (#143405 )" This reverts commit ec820fe57c2d6a2847569a107856e7fcff87dc5c. Reverted https://github.com/pytorch/pytorch/pull/143405 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))	2025-01-23 23:34:13 +00:00
PyTorch MergeBot	3f6cfd0156	Revert "[compiled autograd] stop specializing on metadata during initial trace (#143417 )" This reverts commit 99dd1bf1b93bc26080e611af54497a73a618e02a. Reverted https://github.com/pytorch/pytorch/pull/143417 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))	2025-01-23 23:34:12 +00:00
PyTorch MergeBot	ab082863a1	Revert "[compiled autograd] support Tensor Subclasses in AOTBackward (#144115 )" This reverts commit 082c28c3c655984ce65c13336cff822db95ee470. Reverted https://github.com/pytorch/pytorch/pull/144115 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))	2025-01-23 23:34:12 +00:00
Animesh Jain	0a310d7388	[dynamo] Log guard latency (#145132 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145132 Approved by: https://github.com/ezyang ghstack dependencies: #145351, #145420	2025-01-23 23:30:07 +00:00
PyTorch MergeBot	bf62222d81	Revert "[compiled_autograd] Rename interface to pyinterface (#145495 )" This reverts commit e1407f5aeb658c8c959d33158f465e975799a3d0. Reverted https://github.com/pytorch/pytorch/pull/145495 on behalf of https://github.com/izaitsevfb due to reverted internally ([comment](https://github.com/pytorch/pytorch/pull/145495#issuecomment-2611194932))	2025-01-23 23:07:17 +00:00
Oguz Ulgen	a8b7cb6a2d	Add multi env variable support to configs (#145288 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145288 Approved by: https://github.com/c00w	2025-01-23 23:00:23 +00:00
PyTorch MergeBot	dad9bc3461	Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 )" This reverts commit de945d78da9198e58df7c19c53b737d0f987ddff. Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/izaitsevfb due to unused variables again :( ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2611182461))	2025-01-23 22:59:25 +00:00
cyy	42f4fda2eb	Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143806 Approved by: https://github.com/kwen2501	2025-01-23 22:47:18 +00:00
bobrenjc93	6f07847efe	Bail on checking internal overlap when dealing with unbacked symints (#145385 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145385 Approved by: https://github.com/ezyang	2025-01-23 22:31:31 +00:00
Richard Zou	e1407f5aeb	[compiled_autograd] Rename interface to pyinterface (#145495 ) Summary: interface is a reserved word in some MSVC variants. Test Plan: build Differential Revision: D68561379 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145495 Approved by: https://github.com/xmfan	2025-01-23 21:40:59 +00:00
Shangdi Yu	302b07f166	Implement deepcopy for AOTICompiledModel (#145423 ) Summary: Fix https://github.com/pytorch/pytorch/issues/145411 Support deepcopying AOTICompiledModel. The `loader` is shallow copied. Test Plan: ``` buck2 run fbcode//mode/opt //caffe2/test/inductor:aot_inductor_package -- -r deepcopy ``` Differential Revision: D68524673 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145423 Approved by: https://github.com/desertfire	2025-01-23 21:05:30 +00:00
Davide Italiano	e924ddbef1	[BE] [mps] Refactor UnaryConstants to be its own kernel. (#145230 ) In preparation for using this file for inductor (for erfinv). Pull Request resolved: https://github.com/pytorch/pytorch/pull/145230 Approved by: https://github.com/malfet	2025-01-23 20:58:43 +00:00
Daulet Askarov	881eb86692	Fix staging for CPU tensors in OSS DCP async_save (#145408 ) Fix staging for CPU tensors in OSS DCP async_save (#145408) Summary: As found in https://github.com/pytorch/pytorch/issues/144657 for CPU tensors we accidentally skip copying during staging due to using offload to cpu helper, which does a no-op for CPU tensors. This means that if the trainer changes the original source CPU tensor value after launch async save but before the actual writing/uploading to the destination commences, the writing/uploading logic will accidentally pick up the latest state of the tensor, while it should have dealt with its own dedicated copy saved earlier. Dropping _offload_state_dict_to_cpu in favor of _copy_state_dict fixes this bug. Test Plan: Running the user script from the linked GitHub issue verifies the fix: ``` import os import torch import torch.distributed as dist import torch.distributed.checkpoint as dcp from torch.distributed.checkpoint.state_dict import get_model_state_dict import torch.nn as nn class Net(nn.Module): def __init__(self): super().__init__() self.weight = nn.Parameter(torch.ones(1, 1)) def forward(self, x): return self.layer(x) os.environ["MASTER_ADDR"] = "localhost" os.environ["MASTER_PORT"] = "12345" os.environ["WORLD_SIZE"] = "1" os.environ["RANK"] = "0" dist.init_process_group() model = Net() state_dict = get_model_state_dict(model) pg = dist.new_group(backend="gloo") try: steps = [10, 20, 30, 40, 50] future = None for step in steps: # simulate a training step, e.g. optimizer updating values with torch.no_grad(): model.weight.data.fill_(step) if future is not None: future.result() future = None future = dcp.async_save( state_dict, checkpoint_id=f"outputs/{step}", process_group=pg, ) future.result() for step in steps: dcp.load( state_dict, checkpoint_id=f"outputs/{step}", process_group=pg, ) assert state_dict["weight"][0, 0] == step, f"got {state_dict['weight'][0, 0]=} on {step=}" finally: dist.destroy_process_group(pg) dist.destroy_process_group() ``` passes all asserts with this fix. If the script is run in trunk, confirmed that it fails the first assert. Differential Revision: D68518689	2025-01-23 12:49:26 -08:00
Bin Bao	6a44a61514	[BE] Bump TIMM pin (#145320 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145320 Approved by: https://github.com/Skylion007	2025-01-23 20:44:26 +00:00
Pian Pawakapan	99367ecbed	[draft export] count how many times a data-dep error shows up (#145030 ) Summary: maybe this is helpful? Test Plan: draft_export Differential Revision: D68303934 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145030 Approved by: https://github.com/angelayi	2025-01-23 20:27:31 +00:00
Aaron Gokaslan	5ebca3015d	[BE]: Simplify set add with set update (#145152 ) Simplifies the set update slightly to be more readable and efficient. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145152 Approved by: https://github.com/XuehaiPan, https://github.com/albanD Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>	2025-01-23 20:18:13 +00:00
PyTorch MergeBot	d7b6746470	Revert "Fix deprecated pytorch_sphinx_theme editable installation (#145347 )" This reverts commit c27dd9cf72265161f85a18c0b19f365097f7a1ac. Reverted https://github.com/pytorch/pytorch/pull/145347 on behalf of https://github.com/huydhn due to Remove -e breaks the theme somehow ([comment](https://github.com/pytorch/pytorch/pull/145347#issuecomment-2610911258))	2025-01-23 20:06:07 +00:00
Pian Pawakapan	d53f2067fe	[BE][export] add "+export" logging to de/serialization (#145283 ) adds de/serialization debug logging to `TORCH_LOGS="+dynamic"` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145283 Approved by: https://github.com/ydwu4, https://github.com/angelayi	2025-01-23 19:47:48 +00:00
PyTorch MergeBot	ce4a097bf7	Revert "Added swizzle searching, disabled fp16 accum, and enabled ping-pong for cutlass (#144829 )" This reverts commit 55084443cabbaf6c28d8c546d8988cf3ed0f3d1c. Reverted https://github.com/pytorch/pytorch/pull/144829 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/144829#issuecomment-2610855579))	2025-01-23 19:37:54 +00:00
iremyux	527101fa95	Move Windows arm64 scripts from pytorch/builder (#144317 ) This PR moves the Windows Arm64 scripts from the builder repository to the main repository. The corresponding PR to pytorch/builder that removes them is here : https://github.com/pytorch/builder/pull/2058 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144317 Approved by: https://github.com/Skylion007, https://github.com/seemethere Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>	2025-01-23 19:29:29 +00:00
Irem Yuksel	66bf7da446	Enable sleef for Win Arm64 (#144876 ) Sleef module was disabled for Windows Arm64 on `b021486405` This PR enables it again since the issue is no longer valid. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144876 Approved by: https://github.com/albanD, https://github.com/malfet Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>	2025-01-23 19:22:58 +00:00
Xu Zhao	991a4b5925	[dynamo] Add `--profile-details` and `--export-perfdoctor` option (#144751 ) Summary: Add `--profile-details` option to add shapes and other details to the Kineto profile. Add `--export-perfdoctor` to directly dump trace to perfdoctor for webview. Test Plan: ``` $ buck2 run mode/opt //caffe2/benchmarks/dynamo:torchbench_internal -- --only mrs_video_watch_over --performance --training --amp --export-profiler-trace --backend=inductor --profile-details --export-perfdoctor ``` https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/pyper_traces/tree/traces/test/inductor_mrs_video_watch_over_rank_0_20250113_173817_6535183793.json.gz Differential Revision: D68134547 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144751 Approved by: https://github.com/drisspg	2025-01-23 19:09:40 +00:00
Renato Arantes	5b37249259	Enable fp16 linear layers in PyTorch via ACL (#144992 ) This pull request aims to enable the use of linear layers with the fp16 data type through the ACL. On a Graviton3 instance running with 16 threads, `torch.randn(2048, 4096, dtype=torch.half)` will take 50+% less time to complete compared with `torch.randn(2048, 4096, dtype=torch.float32)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144992 Approved by: https://github.com/ng-05, https://github.com/digantdesai, https://github.com/malfet	2025-01-23 19:07:54 +00:00
Yang Wang	6d4f5f7688	[Utilization][Usage Log] Add data model for record (#145114 ) Add data model for consistency and data model change in the future. The data model will be used during the post-test-process pipeline Pull Request resolved: https://github.com/pytorch/pytorch/pull/145114 Approved by: https://github.com/huydhn	2025-01-23 19:04:41 +00:00
Joona Havukainen	2f317bbdbc	Missing autorelease in lstm_mps caused a ton of leaked memory (#145503 ) The dictionary held onto the new MPSGraphTensorData objects and MPSNDArrays. Regression caused by https://github.com/pytorch/pytorch/pull/95137 Fixes #145374 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145503 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-01-23 18:54:30 +00:00
Nikhil Gupta	41b38f755c	Revert "Reverting the PR adding Kleidiai-based int4 kernels (#145392 )" (#145505 ) https://github.com/pytorch/pytorch/pull/134124 was reverted by https://github.com/pytorch/pytorch/pull/145392 due to KleidiAI clone issue. 1. This reverts commit 0940eb6d44f3cf69dd840db990245cbe1f78e770 (https://github.com/pytorch/pytorch/pull/145392 )and Fixes KleidiAI mirror issue. 2. KleidiAI is now cloned from github mirror instead of arm gitlab Change-Id: I7d6eee7214cd117d3057d615936fcc3ee6052fa2 Fixes https://github.com/pytorch/pytorch/issues/145273 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145505 Approved by: https://github.com/malfet	2025-01-23 18:50:59 +00:00
Simon Fan	34b8d8b0c0	update compile time benchmarks to dump compile times to stdout and csv (#145447 ) ```python # inductor.csv dev,name,batch_size,accuracy,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles,cudagraph_skips,compilation_latency cuda,cait_m36_384,8,pass,2510,1,0,0,0,0,0,87.705186 ``` ```python loading model: 0it [01:27, ?it/s] cuda eval cait_m36_384 Compilation time (from dynamo_timed): 87.705186276 # <---------------- pass TIMING: _recursive_pre_grad_passes:0.11023 pad_mm_benchmark:0.50341 _recursive_joint_graph_passes:3.88557 _recursive_post_grad_passes:6.71182 async_compile.wait:4.16914 code_gen:17.57586 inductor_compile:42.55769 backend_compile:72.47122 entire_frame_compile:87.70519 gc:0.00112 total_wall_time:87.70519 STATS: call_* op count: 2510 \| FakeTensorMode.__torch_dispatch__:101743 \| FakeTensor.__torch_dispatch__:12959 \| ProxyTorchDispatchMode.__torch_dispatch__:41079 Dynamo produced 1 graphs covering 2510 ops with 0 graph breaks (0 unique) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145447 Approved by: https://github.com/ezyang	2025-01-23 18:49:19 +00:00
Boyuan Feng	629fb1590c	[BE] Type annotate pad_mm.py (#145409 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145409 Approved by: https://github.com/Skylion007	2025-01-23 18:34:24 +00:00
Animesh Jain	015c6d6fdb	[dynamo][guards] Turn on profiling of guard manager (#145420 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145420 Approved by: https://github.com/ezyang ghstack dependencies: #145351	2025-01-23 18:17:43 +00:00
Zheng, Zhaoqiong	fef92c9447	Fix IdentationError of code example (#145251 ) I found there is IndentationError when try to copy paste the example of inference with torch.compile fix the format in this pr Pull Request resolved: https://github.com/pytorch/pytorch/pull/145251 Approved by: https://github.com/mikaylagawarecki Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-23 18:17:11 +00:00
Boyuan Feng	9a5bc7b6dd	[BE] Type annotate metrics.py (#145418 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145418 Approved by: https://github.com/Skylion007	2025-01-23 18:13:59 +00:00
Yidi Wu	bdc2c2a237	[be] fix flaky test aot_export_ cond caused by free symbol lifting and automatic dynamic shape (#145330 ) Fixes https://github.com/pytorch/pytorch/issues/139998#issuecomment-2605908426. It seems to be an issue caused by the interaction between dynamoed hop X automatic dynamic shape X auto_lift_free symbols. The immediate error is that the asserteExpectedInline of the graph can sometimes be different e.g. see https://hud.pytorch.org/flakytest?name=test_aot_export_with_torch_cond&suite=TestAOTExport&limit=100, where sometimes the shapes are lifted as input to the cond and sometimes they're not. The root cause of the flakyness is that the two invocations of torch.cond triggers two torch.compile on the same code object ([code](https://github.com/pytorch/pytorch/blob/main/torch/_higher_order_ops/cond.py#L192)), and triggers automatic dynamic shape because in test_aot_export_with_torch_cond, x has shape (3, 4) while the pre_dispatch one has shape (2, 2). Because of we auto lift free symbols for dynamic shaped input, this causes cond sometimes have the shape as arguments and sometimes not. This PR adds a simple fix by adding a _dynamo.reset before each torch.cond tests. This fixes the error by not triggering automatic dynamic shape. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145330 Approved by: https://github.com/zou3519	2025-01-23 18:12:58 +00:00
Yidi Wu	3c247ee8c4	[hop][be] add utils for more comprehensive input alias and mutation (#145298 ) This PR implements the idea of checking input mutations through tensor version and check aliasing via storage from @zou3519. Previously, we rely on whether there's a in place op that takes placeholder input, which doesn't take views into account. When writing the PR, I also noticed a bug in previous input mutation checking logic: we were checking the whether there are operators functionalized_f where all the mutating ops have been replaced so we won't be able to detect any thing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145298 Approved by: https://github.com/zou3519	2025-01-23 18:12:28 +00:00
Manuel Candales	b0f3597133	Add fused rms_norm implementation for MPS backend (#145301 ) Adding a fused rms_norm implementation for MPS backend. This eliminates most of the current CPU overhead, making this computation GPU bound and improving latency of rms_norm by 30x-40x on MPS backend The metal shader was adapted from MLX: `e6a7ab9675/mlx/backend/metal/kernels/rms_norm.metal` The numbers below are averages over 1000 runs of RMSNorm, obtained on an M1 Pro. Benchmarking Results (Before): ``` Device : MPS \| CPU Dtype : FP32 \| FP16 \| BF16 \| FP32 \| FP16 \| BF16 Outputs Match : True \| True \| True \| True \| True \| True Average Time (us) : 140.5 \| 171.0 \| 170.4 \| 10.9 \| 13.3 \| 13.5 ``` Benchmarking Results (After): ``` Device : MPS \| CPU Dtype : FP32 \| FP16 \| BF16 \| FP32 \| FP16 \| BF16 Outputs Match : True \| True \| True \| True \| True \| True Average Time (us) : 4.0 \| 3.9 \| 3.9 \| 10.0 \| 12.4 \| 13.0 ``` Profiling Results (Before): ``` ----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ aten::rms_norm 2.35% 3.284ms 100.00% 140.038ms 140.038us 1000 aten::mul 33.61% 47.068ms 33.61% 47.068ms 23.534us 2000 aten::pow 17.04% 23.868ms 17.43% 24.402ms 24.402us 1000 aten::add_ 16.52% 23.130ms 16.78% 23.497ms 23.497us 1000 aten::mean 15.82% 22.151ms 15.82% 22.151ms 22.151us 1000 aten::rsqrt 13.63% 19.085ms 13.71% 19.198ms 19.198us 1000 aten::item 0.46% 639.370us 0.56% 788.376us 0.394us 2000 aten::type_as 0.21% 295.507us 0.27% 371.291us 0.371us 1000 aten::to 0.13% 177.742us 0.13% 177.742us 0.059us 3000 aten::_local_scalar_dense 0.11% 149.006us 0.11% 149.006us 0.075us 2000 ----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 140.038ms ``` Profiling Results (After): ``` ----------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ----------------------- ------------ ------------ ------------ ------------ ------------ ------------ aten::rms_norm 63.21% 832.875us 100.00% 1.318ms 1.318us 1000 aten::empty_like 16.06% 211.631us 36.79% 484.681us 0.485us 1000 aten::empty_strided 20.72% 273.050us 20.72% 273.050us 0.273us 1000 ----------------------- ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 1.318ms ``` Benchmarking and profiling script: ```python import torch import torch.nn as nn from torch.profiler import profile import time def benchmark(device, dtype): model = nn.RMSNorm(2048, device=device) # Create example inputs x = torch.randn(1, 1, 2048, requires_grad=False, device=device, dtype=dtype) w = torch.randn(2048, requires_grad=False, device=device, dtype=dtype) eps = 1e-5 # Check output y = torch.ops.aten.rms_norm(x, [2048], w, eps) z = torch.ops.aten.rms_norm(x.cpu(), [2048], w.cpu(), eps) outputs_match = torch.allclose(y.cpu(), z) # Measure time manually start_time = time.time() * 1000 for _ in range(1000): with torch.no_grad(): y = model(x) torch.mps.synchronize end_time = time.time() * 1000 manual_delta = (end_time - start_time) average_time = f"{manual_delta:6.1f}" return outputs_match, average_time outputs_match_list = [] average_time_list = [] for device in ["mps", "cpu"]: for dtype in [torch.float32, torch.float16, torch.bfloat16]: outputs_match, average_time = benchmark(device, dtype) outputs_match_list.append(str(outputs_match)) average_time_list.append(average_time) print("\nBenchmarking Results:") print("---------------------") print("Device : MPS \| CPU") print("Dtype : FP32 \| FP16 \| BF16 \| FP32 \| FP16 \| BF16") print(f"Outputs Match : ", " \| ".join(outputs_match_list)) print(f"Average Time (us) :", " \|".join(average_time_list)) device = "mps" dtype = torch.float32 model = nn.RMSNorm(2048, device=device) x = torch.randn(1, 1, 2048, requires_grad=False, device=device, dtype=dtype) # Run and profile the model with profile() as prof: with torch.no_grad(): for _ in range(1000): y = model(x) torch.mps.synchronize # Print profiling results print("\n\nProfiling Results (MPS/FP32):") print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10)) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145301 Approved by: https://github.com/malfet	2025-01-23 18:07:10 +00:00
Ryan Guo	a86fa779ce	[BE] Fix edge case in translation validation bisector (#145414 ) This patch fixes a small bug for the binary-search algorithm in translation validation bisector. Fixes #131303. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145414 Approved by: https://github.com/ysiraichi, https://github.com/zou3519	2025-01-23 17:35:28 +00:00
Sam Larsen	045698653a	[BE] Remove test_ops_gradients from FIXME_inductor_dont_reset_dynamo (#145308 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145308 Approved by: https://github.com/zou3519 ghstack dependencies: #145306	2025-01-23 17:25:04 +00:00
Bartlomiej Stemborowski	3a8d3785f7	[ca][bug_fix] Fix ref counting of objects in the set_autograd_compiler function. (#145482 ) PR#141153 exposed the option to collect sizes as dynamic. After this change, the function set_autograd_compiler returns PyTuple object which is populated using PyTuple_SET_ITEM function. Yet, that function steals reference to the object and doesn't INCREF it. So currently we are missing INCREF on prior_compiler when it is Py_None and INCREF on prior_dynamic which is either Py_False or Py_True. This bug may lead to the possible memory corruption. @xmfan @jansel @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/145482 Approved by: https://github.com/albanD, https://github.com/jansel	2025-01-23 17:13:56 +00:00
drisspg	c6707734de	Enable non power of 2 head_dim for FlexAttention (#133495 ) # Summary - Adds support for non-power of 2 headdim by launching blocks w/ head_dim rounded to the next valid power. - Other option I considered was building up the final dot_products with smaller blocks (this would probably work but for sake of code complexity going with this option for now) ### Corollary We had a bug in our backwards kernel where we were using index_k instead of index_v. This should have shown up for the qk_head_dim != v_head_dim cases.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133495 Approved by: https://github.com/Chillee	2025-01-23 17:05:38 +00:00
Howard Huang	bf4f8919df	Fix test_modules_can_be_imported (#145387 ) `test_modules_can_be_imported` test is currently failing due to a few missing private modules and this PR gets it working before I start to clean up the public allow list Pull Request resolved: https://github.com/pytorch/pytorch/pull/145387 Approved by: https://github.com/albanD	2025-01-23 16:03:00 +00:00
PyTorch MergeBot	768ad0886f	Revert "Binary upload checksum (#144887 )" This reverts commit 2efa98d69d362e4ee6f15938ec8ded30bf5c40fd. Reverted https://github.com/pytorch/pytorch/pull/144887 on behalf of https://github.com/atalman due to Broke nightly index ([comment](https://github.com/pytorch/pytorch/pull/144887#issuecomment-2610066277))	2025-01-23 15:10:42 +00:00
Wang, Chuanqi	0802e78315	[CD] Disable Kineto for XPU Windows CD (#145255 ) Due to issue #145155, disable Kineto for XPU Windows CD temporally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145255 Approved by: https://github.com/xuhancn, https://github.com/atalman	2025-01-23 14:09:52 +00:00
Aaron Orenstein	629840e038	Backout PEP585 use of Iterable (#145438 ) Summary: Importing Iterable from collections.abc here causes an internal product to fail MRO discovery causing a collision between Iterable and Generic. This fixes the failure on D68461304 Differential Revision: D68531443 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145438 Approved by: https://github.com/izaitsevfb	2025-01-23 11:45:37 +00:00
cyy	29f52e3972	[2/N] Remove unnecessary once flag usage (#145057 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/145057 Approved by: https://github.com/albanD	2025-01-23 09:48:46 +00:00
Shunting Zhang	b6941d4e42	[inductor] fix autotuning memory usage (#145410 ) We use `cpu_tensor.copy_(gpu_tensor)` to clone mutated kernel arguments for autotuning. The purpose is to avoid increasing peak memory due to the clone. But if `gpu_tensor` is not contiguous, this `copy_` will need allocate an temporary tensor on GPU to store a contiguous copy of `gpu_tensor`: `6e53588789/aten/src/ATen/native/cuda/Copy.cu (L322-L334)` Here is a standalone script to illustrate this behavior: https://gist.github.com/shunting314/812a848dc67b1d674ae42415a7a462c8 . The script report 6GB rather than 3GB peak memory usage. Note that, with all the following efforts 1. donated buffer 2. inplace padding 3. this PR We save 3GB peak memory (18.6GB -> 15.5GB) for GPT2 model for torch.compile. The peak memory of GPT2 is like a '...\_M\_...' shape. There are 2 places that we reach the peak. Donated buffer remove the first peak by computing grad_softmax inplace, and inplace padding removes the second peak by not allocating an extra buffer for mm-padding. Before all these optimizations, the peak memory is 18.6GB for GPT2 with torch.compile. With 1 & 2, the peak memory is 1. 17.7GB with a cold cache 2. 15.5GB with a warm cache (since the autotuning overhead is skipped) With 1 & 2 & 3, we save 3GB peak memory (18.6GB -> 15.5GB) no matter if autotuning happens or not Pull Request resolved: https://github.com/pytorch/pytorch/pull/145410 Approved by: https://github.com/masnesral, https://github.com/jansel ghstack dependencies: #140249, #145325	2025-01-23 09:34:23 +00:00
amathewc	638903aeee	Adapt Dynamo tests to HPUs using instantiate_device_type_tests (#144387 ) MOTIVATION We recently integrated support for Intel Gaudi devices (identified as 'hpu') into the common_device_type framework via the pull request at https://github.com/pytorch/pytorch/pull/126970. This integration allows tests to be automatically instantiated for Gaudi devices upon loading the relevant library. Building on this development, the current pull request extends the utility of these hooks by adapting selected CUDA tests to operate on Gaudi devices. Additionally, we have confirmed that these modifications do not interfere with the existing tests on CUDA devices. Other accelerators can also extend the functionality by adding the device in the devices list. ( For eg: xpu ) CHANGES Create a separate class for test functions running on CUDA devices Extend the functionality of these tests to include HPUs Use instantiate_device_type_tests with targeted attributes to generate device-specific test instances within the new classes Apply skipIfHPU decorator to bypass tests that are not yet compatible with HPU devices Previously we had submitted some changes in https://github.com/pytorch/pytorch/pull/140131 . However, deleted that PR due to merge conflicts and other issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144387 Approved by: https://github.com/ankurneog, https://github.com/EikanWang, https://github.com/yanboliang, https://github.com/guangyey	2025-01-23 09:24:42 +00:00
Shunting Zhang	d3f196909d	[inductor] let inplace-padding support cpp-wrapper (#145325 ) Some context: Inplace padding is an optimization to do padding in place. E.g., if a tensor has size [2048, 2047] and stride [2048, 1]. When we need pad one extra element to the end of each row (e.g. during mm padding), we can just reuse the original tensor and do the padding inplace. This saves memory and bandwidth. One caveat for this optimization is, PyTorch does not allocate 2048 elements for the last row of the original tensor. It only allocate 2047 elements. So assuming the last row having enough space for 2048 elements may be wrong and cause OOB memory access (although I never see this happen maybe due to overallocation in the CUDACachingAllocation, this should better be fixed). The fix is when we allocate the tensor, instead of doing something like: ``` buf0 = randn_strided([2048, 2047], [2048, 1]) ``` we do some small overallocation ``` buf0 = randn_strided([2048, 2048], [2048, 1]).as_strided([2048, 2047], [2048, 1]) ``` cpp_wrapper needs special handling since memory allocation goes thru different code path to python wrapper. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145325 Approved by: https://github.com/desertfire, https://github.com/jansel ghstack dependencies: #140249	2025-01-23 09:22:38 +00:00
Justin Chu	f52901a0a7	[ONNX] Remove LegacyDynamoStrategy (#145442 ) It's legacy. So remove. Shouldn't affect anything and will facilitate cleaning up our legacy code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145442 Approved by: https://github.com/titaiwangms	2025-01-23 07:56:04 +00:00
Sam Larsen	28c251dd0b	[BE] Remove test_modules from FIXME_inductor_dont_reset_dynamo (#145306 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145306 Approved by: https://github.com/zou3519	2025-01-23 06:37:21 +00:00
Davide Italiano	f56c638849	[c10/metal] Add a vectype variant for `short`/`int`/`long` (#145430 ) Some of the kernels (exp_complex/atan_complex) need the specialization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145430 Approved by: https://github.com/malfet, https://github.com/jansel	2025-01-23 04:52:56 +00:00
Animesh Jain	c58198184b	[dynamo][dicts] Insert LENTGH guard on an if condition on dict (#145432 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145432 Approved by: https://github.com/williamwen42, https://github.com/jansel	2025-01-23 04:40:56 +00:00
Andy Lugo	faa10faa2c	[ROCm] CK SDPA - Move arch check to CK patch (#144777 ) __gfxXXX__ should only be visible by device code. Move the check to the ck kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/144777 Approved by: https://github.com/jeffdaily, https://github.com/xw285cornell, https://github.com/jianyuh	2025-01-23 04:12:25 +00:00
Chirag Pandya	5e6451ea78	[c10] catch c10 error and log message (#145413 ) Summary: Explicitly catch c10 error and log the error message only. The standard exception `e.what()` below ends up logging the stack trace that is confusing users. See S477887 for details. Test Plan: tested locally. ``` buck test caffe2/test/cpp/c10d:TCPStoreTest buck2 daemon constraint mismatch: Version mismatch; killing daemon... Starting new buck2 daemon... Connected to new buck2 daemon. File changed: fbcode//caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp File changed: fbsource//xplat/caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp Watchman fresh instance: new mergebase, cleared graph state, cleared dep files Soft Error: source_directory_includes_subpackage: Directory `v2.17.1-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.17.1-1/src/tests`. Soft Error: source_directory_includes_subpackage: Directory `v2.18.3-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.18.3-1/src/tests`. Soft Error: source_directory_includes_subpackage: Directory `v2.19.3-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.19.3-1/src/tests`. Buck UI: https://www.internalfb.com/buck2/dbd34fa4-50ed-4eeb-800d-688f5a7bec68 Test UI: https://www.internalfb.com/intern/testinfra/testrun/281475375994918 Network: Up: 1.5GiB Down: 4.7GiB (reSessionID-d6b0568e-2347-4375-a2d9-2d03ca0c2161) Loading targets. Remaining 0/3024 69199 dirs read, 687558 targets declared Analyzing targets. Remaining 0/31483 1481904 actions, 1719048 artifacts declared Executing actions. Remaining 0/250391 77:11:29.7s exec time total Command: test. Finished 2031 local, 45445 remote, 51473 cache (52% hit) 20:16:36.9s exec time cached (26%) Time elapsed: 7:32.7s Tests finished: Pass 8. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Differential Revision: D68516080 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145413 Approved by: https://github.com/fduwjj	2025-01-23 03:45:47 +00:00
Yu, Guangye	719938c77f	Generalize pin memory logic for accelerator when non blocking copy happened (#143783 ) # Motivation fix https://github.com/pytorch/pytorch/issues/143641 Generalize pin memory logic for accelerator when non-blocking copy happened. Each accelerator has its implementation on `empty_strided`. The accelerator which doesn't have pin memory mechanism could ignore or mimic when pin_out is True. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143783 Approved by: https://github.com/EikanWang, https://github.com/albanD ghstack dependencies: #144959	2025-01-23 03:43:05 +00:00
Yu, Guangye	28b6430823	Introduce a new API isAcceleratorExcluded (#144959 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144959 Approved by: https://github.com/albanD	2025-01-23 03:43:05 +00:00
Animesh Jain	5a18f1e1eb	[dynamo] Support fx map_aggregate (#145351 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145351 Approved by: https://github.com/zou3519	2025-01-23 03:19:30 +00:00
PyTorch MergeBot	d95a6babcc	Revert "Align CPU behavior with CUDA for `ConvTranspose` when `out_channels=0` (#142859 )" This reverts commit 0bff37788043626ee5e472389f88cbbbf7add997. Reverted https://github.com/pytorch/pytorch/pull/142859 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the XLA failures look legit ([comment](https://github.com/pytorch/pytorch/pull/142859#issuecomment-2608631019))	2025-01-23 01:10:31 +00:00
albanD	0d28188cc8	Move privateuse1 test out of test_utils and make them serial (#145380 ) Fixes https://github.com/pytorch/pytorch/issues/132720 The reason is that changing the privateuse1 module is global and so can race when other tests happen to check if it is enabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145380 Approved by: https://github.com/Skylion007, https://github.com/janeyx99	2025-01-23 00:31:39 +00:00
amdfaa	c9e12d6a3b	[ROCm] Update rocm.yml and add rocm-mi300.yml (#145398 ) - Added another workflow to run the mi300 jobs post-merge. - Updated rocm.yml to use mi200s instead of mi300s. - Required to get an idea of how PRs are landing on our mi200s and mi300s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145398 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-01-23 00:07:50 +00:00
Wenqin Yang	1e32842324	Improve softmax's perf in cuda (#144679 ) Fixes #144645 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144679 Approved by: https://github.com/eqy	2025-01-23 00:02:57 +00:00
Yiming Zhou	d0a2e11284	[BE][export] Change custom_op registeration style (#145315 ) Summary: `test_unbacked_bindings_for_divisible_u_symint` has been flaky for a while due to ``` Tried to register an operator (mylib::foo(Tensor a, Tensor b) -> Tensor) with the same name and overload name multiple times. ``` It is likely due to when all variants of this test are being run (non-strict, retrace, serdes) simultaneously. In later tests, the operator has already been registered. In this diff, we change registration style. Test Plan: ``` buck2 test mode/dev-nosan caffe2/test:test_export -- -r test_unbacked_bindings_for_divisible_u_symint ``` Differential Revision: D68465258 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145315 Approved by: https://github.com/zou3519	2025-01-22 23:46:51 +00:00
Hyunho Yeo	4803e20bc7	[S481486] Move MTIA dynamic library loading from __init__.py to a separate module (#145322 ) Summary: As titled Test Plan: - Passed CI tests buck2 test 'fbcode//mode/opt' fbcode//ai_infra/distributed_ai/pyper_local_run/tests/integration_tests:test_icvr_e2e_gpu -- --exact 'ai_infra/distributed_ai/pyper_local_run/tests/integration_tests:test_icvr_e2e_gpu - test_icvr_e2e_gpu (ai_infra.distributed_ai.pyper_local_run.tests.integration_tests.test_icvr_e2e_gpu.TestIcvrE2EGpu)' --run-disabled ``` https://www.internalfb.com/intern/testinfra/testconsole/testrun/9007199320480497/ Differential Revision: D68463242 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145322 Approved by: https://github.com/yuhc, https://github.com/albanD	2025-01-22 23:39:43 +00:00
Aaron Orenstein	35c8c31f11	Fix for failure in D68425364 (#145304 ) Summary: Back out change from #145166 which causes an internal model to fail. Differential Revision: D68459095 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145304 Approved by: https://github.com/izaitsevfb	2025-01-22 23:33:02 +00:00
Li Yu (ads)	e6a84be3d3	[PyTorch] Add backend aot_eager_decomp_partition_with_mode (#143250 ) Summary: ## Why To make it possible to run torch dispatch mode inside compiled modules. This is to enable running MemoryTrackerMode (in next diff) to collect memory usage of compiled modules. ## What Add a backend aot_eager_decomp_partition_with_mode. Add an enable_log to the backend to control the compilation logging (which can be very verbose and slow the run of mode) Test Plan: unittest E2e tested in the next diff which shows the memory read from the mode passed to this backend is very close to the actual job's memory snapshot. Differential Revision: D67227144 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143250 Approved by: https://github.com/bdhirsh	2025-01-22 23:20:59 +00:00
PyTorch MergeBot	f0a210bf5d	Revert "Output of nonzero is transposed, fix fake tensor (#144695 )" This reverts commit 693d8c7e945cc494bd31ad694ae4f4b6ff13b82a. Reverted https://github.com/pytorch/pytorch/pull/144695 on behalf of https://github.com/izaitsevfb due to breaking internal tests, see D68461259 ([comment](https://github.com/pytorch/pytorch/pull/144695#issuecomment-2608443589))	2025-01-22 23:04:50 +00:00
Eddie Yan	de945d78da	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee	2025-01-22 22:42:48 +00:00
PyTorch MergeBot	6e53588789	Revert "[BE]: Simplify set add with set update (#145152 )" This reverts commit 0cb9b2284a31fa497d684dbc2f56398c1d1e3114. Reverted https://github.com/pytorch/pytorch/pull/145152 on behalf of https://github.com/davidberard98 due to land race with https://github.com/pytorch/pytorch/pull/145165 broke lint ([comment](https://github.com/pytorch/pytorch/pull/145152#issuecomment-2608378172))	2025-01-22 22:14:26 +00:00
PyTorch MergeBot	dddf52b1b9	Revert "Enable grep_linter to use -a (#144589 )" This reverts commit 3c55669b8814237e018a613a494564da5bea9f15. Reverted https://github.com/pytorch/pytorch/pull/144589 on behalf of https://github.com/clee2000 due to the line parameter is kind of important and -a is not as important as I thought it was so I'm going to revert this ([comment](https://github.com/pytorch/pytorch/pull/144589#issuecomment-2608349155))	2025-01-22 21:55:27 +00:00
rzou	082c28c3c6	[compiled autograd] support Tensor Subclasses in AOTBackward (#144115 ) Compiled autograd's initial trace traces through the AOTBackward epilogue. The Tensor Subclass code is not traceable. This PR changes it so that when we see Tensor Subclass constructors, we proxy nodes for their construction into the graph. Test Plan: - New basic test with TwoTensor - Existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/144115 Approved by: https://github.com/jansel, https://github.com/xmfan, https://github.com/bdhirsh ghstack dependencies: #143296, #143304, #143387, #143405, #143417	2025-01-22 21:51:07 +00:00
rzou	99dd1bf1b9	[compiled autograd] stop specializing on metadata during initial trace (#143417 ) The previous PRs built up to this. We change compiled autograd's initial trace to stop baking in metadata. While tracing, we allocate some weirdly shaped tensors that we can put proxies on. The initial trace should not be accessing any metadata of these tensors (it will likely error out if it does because of how weird the shapes are). This involved fixing some various sites where we do specialize on the metadata, like: - we change CopySlices's apply_with_saved to proxy some calls into the graph (this change is fairly hard to split out by itself). - we stop calling InputBuffer::add - we delete the weird metadata from the graph so that no graph passes can make use of it. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/143417 Approved by: https://github.com/jansel, https://github.com/xmfan ghstack dependencies: #143296, #143304, #143387, #143405	2025-01-22 21:51:07 +00:00
rzou	ec820fe57c	[compiled autograd] Always proxy autograd.Function nodes; handle AOT backwards (#143405 ) We will always proxy autograd.Function nodes in compiled autograd's initial graph capture (previously there was an option to proxy vs trace into the autograd.Function) We have some requirements for the AOTBackward. Compiled Autograd runs accumulate grad reordering passes on the AOTBackward graph directly after the initial graph capture, so we can't just proxy a single node for it. Instead, we: - proxy the AOTBackward prologue function into the CA graph - copy-paste the AOTBackward graph into the CA graph - trace directly through the epilogue (the traced nodes go into the CA graph). Tracing through the epilogue is safe (assuming no Tensor subclasses) because the only thing the epilogue does is drop some outputs. The Tensor subclass situation was already broken so this doesn't regress anything but this PR sets it up to be fixed (in a followup, where we will proxy "make_subclass" calls into the graph from the epilogue). Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/143405 Approved by: https://github.com/jansel, https://github.com/xmfan ghstack dependencies: #143296, #143304, #143387	2025-01-22 21:50:56 +00:00
rzou	784bb2127c	[compiled autograd] Proxy nodes for user-defined C++ torch::autograd::Function (#143387 ) We define a functional version of a C++ torch::autograd::Function. The functional version reconstructs the ctx object and then calls backward with it. Some more details: - we define how to pack/unpack ctx.saved_data into an IValue. It's a Dict[str, IValue], so it wasn't difficult. - every call to CppNode::apply_with_saved binds a new function to Python. This is because we're unable to reuse the a previously bound function for reasons (the schema may change depending on what the user actually puts into their Dict[str, IValue]). Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/143387 Approved by: https://github.com/jansel, https://github.com/xmfan ghstack dependencies: #143296, #143304	2025-01-22 21:50:47 +00:00
rzou	8c7c5f7bfc	[compiled autograd] Proxy a node for CopyBackwards into the graph (#143304 ) CopyBackwards is a manual C++ torch::autograd::Node; we update its apply_with_saved to proxy a functional version of it into the graph instead of inlining into it. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/143304 Approved by: https://github.com/xmfan, https://github.com/jansel ghstack dependencies: #143296	2025-01-22 21:50:37 +00:00
rzou	5531fafffe	[compiled autograd] Proxy opaque nodes for built-in autograd nodes (#143296 ) This PR is on the way to getting compiled autograd's initial capture to stop specializing on Tensor metadata. This PR changes compiled autograd's initial capture to proxy an opaque (w.r.t. Dynamo) function into the graph for all built-in codegen'ed autograd nodes and validate_outputs. We changed each codegen'ed apply_with_saved (e.g. MulBackward0::apply_with_saved) to call into Python to proxy a function (compiled_autograd.ops.MulBackward0) into the graph. Then, we use the node's InputMetadata to "guess" at the properties of the output Tensors to create some new FakeTensors. Some details: - MulBackward0::apply_with_saved lives in libtorch_cpu, but needs to be call to Python via libtorch_python. There is an indirection (PyCompilerInterface) to do this. - MulBackward0::apply_with_saved passes a C++ function to Python. To make our lives easier, every codegen'ed apply_with_saved passes a C++ function with the same signature `(variable_list, ivalue_list) -> variable_list`. - We define how to pack arbitrary C++ types into IValue via a helper IValuePacker struct and codegen functional variants of each builtin C++ autograd node (e.g. MulBackward0_apply_functional_ivalue). MulBackward0 before this PR: https://gist.github.com/zou3519/a80381d5fa38e970e413fcd91b0530de MulBackward0 after this PR: https://gist.github.com/zou3519/0c2eee8b3d8d96232b51ef430b53c5b0 Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/143296 Approved by: https://github.com/jansel	2025-01-22 21:50:29 +00:00
Aaron Gokaslan	0cb9b2284a	[BE]: Simplify set add with set update (#145152 ) Simplifies the set update slightly to be more readable and efficient. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145152 Approved by: https://github.com/XuehaiPan, https://github.com/albanD Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>	2025-01-22 21:31:13 +00:00
Ryan Guo	9f150786bb	[dynamo] Fix numpy test accuracy error induced by randomness divergence (#145293 ) Previously `TestGradient.test_second_order_accurate` was failing because of a small tolerance error (0.03... which is above the 0.03 tolerance). Upon investigating, `np.random.random` caused some divergence between eager and compiled randomness because in compiled we are not using `np.random`'s random seed, rather we end up using `torch`'s. This in turn caused numerical divergence and aforementioned accuracy issue. This patch fixes the failure by patching the test case with `use_numpy_random_stream=True`, which forces a graph break on `np.random.random()` and thereby falling back to eager to ensure consistency of the numpy randomness. Fixes #116746. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145293 Approved by: https://github.com/lezcano	2025-01-22 20:53:02 +00:00
Catherine Lee	2efa98d69d	Binary upload checksum (#144887 ) Equivalent to https://github.com/pytorch/test-infra/pull/6172 but for pytorch Pull Request resolved: https://github.com/pytorch/pytorch/pull/144887 Approved by: https://github.com/atalman	2025-01-22 20:46:04 +00:00
Johnny	a57133e3c7	[NVIDIA] Jetson Thor Blackwell Support codegen (#145395 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145395 Approved by: https://github.com/eqy, https://github.com/malfet	2025-01-22 20:13:19 +00:00
albanD	0940eb6d44	Reverting the PR adding Kleidiai-based int4 kernels (#145392 ) Mitigation for https://github.com/pytorch/pytorch/issues/145273 Reverting https://github.com/pytorch/pytorch/pull/134124 and https://github.com/pytorch/pytorch/pull/144074 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145392 Approved by: https://github.com/ZainRizvi, https://github.com/malfet, https://github.com/atalman, https://github.com/digantdesai	2025-01-22 20:11:49 +00:00
Nikita Shulga	95ff9f0340	[Doc] Add period at the end of the sentence (#145384 ) Test plan: https://docs-preview.pytorch.org/pytorch/pytorch/145384/generated/torch.compiler.disable.html#torch-compiler-disable Fixes https://github.com/pytorch/pytorch/issues/145365 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145384 Approved by: https://github.com/huydhn, https://github.com/svekars, https://github.com/kit1980	2025-01-22 19:56:31 +00:00
PyTorch UpdateBot	3917053f63	[audio hash update] update the pinned audio hash (#145328 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145328 Approved by: https://github.com/pytorchbot	2025-01-22 19:39:03 +00:00
Nikita Shulga	70ccbade83	[MPSInductor] Add `gamma` op (#145341 ) By moving `gamma` and `log_gamma` implementation from `Gamma.metal` to `c10/metal/special_math.h` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145341 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #145309	2025-01-22 19:37:45 +00:00
Aaron Orenstein	b81209557b	Fix tests broken by #145176 (#145393 ) #145176 broke test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_graph_break_on_jit_isinstance_dynamic_shapes test/dynamo/test_repros.py::ReproTests::test_graph_break_on_jit_isinstance this backs out the offending change until it can be fixed properly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145393 Approved by: https://github.com/ZainRizvi	2025-01-22 19:33:16 +00:00
Aidyn-A	e8e3c03f96	[Test][Inductor] Fix test_tma_graph_breaks (#145271 ) Per title. Before these changes, below tests: ``` test_triton_kernels.py::KernelTests::test_tma_graph_breaks_after_data_ptr_False_after_create_desc_False test_triton_kernels.py::KernelTests::test_tma_graph_breaks_after_data_ptr_False_after_create_desc_True test_triton_kernels.py::KernelTests::test_tma_graph_breaks_after_data_ptr_True_after_create_desc_False test_triton_kernels.py::KernelTests::test_tma_graph_breaks_after_data_ptr_True_after_create_desc_True ``` fail with the following message: ``` __________________________________________________________________ KernelTests.test_tma_graph_breaks_after_data_ptr_True_after_create_desc_True ___________________________________________________________________ Traceback (most recent call last): File "/usr/lib/python3.12/unittest/case.py", line 58, in testPartExecutor yield File "/usr/lib/python3.12/unittest/case.py", line 634, in run self._callTestMethod(testMethod) File "/usr/lib/python3.12/unittest/case.py", line 589, in _callTestMethod if method() is not None: ^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/testing/_internal/common_utils.py", line 3114, in wrapper method(args, kwargs) File "/usr/local/lib/python3.12/dist-packages/torch/testing/_internal/common_utils.py", line 557, in instantiated_test test(self, *param_kwargs) File "~/git/pytorch/test/inductor/test_triton_kernels.py", line 1760, in test_tma_graph_breaks eager_out = f(a, b) ^^^^^^^ File "~/git/pytorch/test/inductor/test_triton_kernels.py", line 1740, in f t.element_size(), ^ UnboundLocalError: cannot access local variable 't' where it is not associated with a value To execute this test, run the following from the base repo dir: python test/inductor/test_triton_kernels.py KernelTests.test_tma_graph_breaks_after_data_ptr_True_after_create_desc_True This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145271 Approved by: https://github.com/jansel	2025-01-22 19:18:59 +00:00
Zhengxu Chen	ac8ddf1150	[export][be] Clean up local imports from export [1/n] (#145287 ) Summary: as title Test Plan: CI Differential Revision: D68449844 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145287 Approved by: https://github.com/pianpwk	2025-01-22 19:09:17 +00:00
rzou	30717d25fe	Move Dynamo test to skip from expected_failures (#145390 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/116105 This test is consistently failing. It shouldn't be marked as a flaky test in the CI using the disabld tests mechanism. I'm skipping the test for now. Test Plan: - CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/145390 Approved by: https://github.com/williamwen42	2025-01-22 19:06:39 +00:00
Wu, Chunyuan	0bff377880	Align CPU behavior with CUDA for `ConvTranspose` when `out_channels=0` (#142859 ) Fixes https://github.com/pytorch/pytorch/issues/142466. Remove the `weight.numel() != 0` check to align the behavior with CUDA for `ConvTranspose` when `out_channels=0`. After removing this check, the existing code is already able to give an empty output in such case. Test plan: ``` python -u test/nn/test_convolution.py -k test_ConvTranspose_output_channels_0_cpu_float32 python -u test/nn/test_convolution.py -k test_ConvTranspose_output_channels_0_cuda_float32 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142859 Approved by: https://github.com/mingfeima, https://github.com/malfet	2025-01-22 17:52:53 +00:00
Ryan Guo	698106951e	[dynamo] Re-enable `test_fs` family for dynamo (#145302 ) Fixes #91467. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145302 Approved by: https://github.com/zou3519	2025-01-22 17:50:05 +00:00
Hyunho Yeo	057d9aff39	[S481486] [MTIA] Correct mtia.device_count() API (#145338 ) Summary: Prev: Count the number of "general" accelerators Curr: Count the number of MTIA devices by using the MTIA runtime API Test Plan: ``` buck test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- -r test_get_device_count ``` https://www.internalfb.com/intern/testinfra/testrun/8162774572631995 Reviewed By: BoyueZheng Differential Revision: D68472668 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145338 Approved by: https://github.com/BoyueZheng, https://github.com/egienvalue	2025-01-22 17:45:15 +00:00
Huy Do	c27dd9cf72	Fix deprecated pytorch_sphinx_theme editable installation (#145347 ) Fixes https://github.com/pytorch/pytorch/issues/145221 Pip editable install is going to be deprecated soon https://github.com/pypa/pip/issues/11457. The fix here is just to remove it and install `pytorch_sphinx_theme` normally. ### Testing Doc build is working with the change: * PR https://github.com/pytorch/pytorch/actions/runs/12901499736/job/35975042345?pr=145347 * Nightly https://github.com/pytorch/pytorch/actions/runs/12901500521/job/35975046289 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145347 Approved by: https://github.com/ZainRizvi	2025-01-22 17:28:16 +00:00
Nikita Shulga	288f21cc11	[MPS][BE] Prepare Gamma funcs to be moved ot headers (#145309 ) ---- - Use `float y = 1.0 + metal::frac(x)` instead of complex ```metal float y = x; int n = 0; bool less_than_one = (y < 1.0); // Add or subtract integers as necessary to bring y into (1,2) if (less_than_one) { y += 1.0; } else { n = static_cast<int>(floor(y)) - 1; y -= n; } ``` - Declare them all as templates, to avoid instantiation - Move global arrays to be local to the specific functions Pull Request resolved: https://github.com/pytorch/pytorch/pull/145309 Approved by: https://github.com/dcci	2025-01-22 16:14:06 +00:00
IvanKobzarev	c2b401933f	[torchbench] Fix mobilenetv2 inductor freezing fail_accuracy (#145296 ) Issue: https://github.com/pytorch/pytorch/issues/144891 inductor freezing effectively enables inductor conv-batchnorm fusion. This fusion increases the accuracy error. More context about this: https://github.com/pytorch/pytorch/issues/120545 For Timm models that are run through benchmarks/dynamo/timm_models.py with TimsRunner the tolerance was increased here: https://github.com/pytorch/pytorch/blob/main/benchmarks/dynamo/timm_models.py#L367 If to comment out conv-batchnorm fusion as Elias suggested in Context issue, the accuracy is back. => Increasing tolerace for mobilenetv2 to the same value via introducing the special configuration for tolerance for freezing only Pull Request resolved: https://github.com/pytorch/pytorch/pull/145296 Approved by: https://github.com/eellison, https://github.com/zou3519	2025-01-22 15:54:09 +00:00
CaoE	0dbff7e4be	Add MKLDNN support for Half GELU (#145339 ) Add MKLDNN support for Half GELU to align with BFloat16. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145339 Approved by: https://github.com/yanbing-j, https://github.com/leslie-fang-intel, https://github.com/Skylion007	2025-01-22 15:14:51 +00:00
Isuru Fernando	0efa843392	Dynamic shape guards in C++ (#139899 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139899 Approved by: https://github.com/anijain2305, https://github.com/albanD, https://github.com/jansel ghstack dependencies: #143385, #143164	2025-01-22 14:58:35 +00:00
Isuru Fernando	fbaef0ac03	Add a language option for symbolic shape guards (#143164 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143164 Approved by: https://github.com/ezyang ghstack dependencies: #143385	2025-01-22 14:58:35 +00:00
Isuru Fernando	4b77ff9784	Fix PythonMod printing for C++ (#143385 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143385 Approved by: https://github.com/leslie-fang-intel, https://github.com/anijain2305	2025-01-22 14:58:35 +00:00
Boyuan Feng	079a3e0f75	[BE] Add type annotations to cudagraph_utils.py and test_cases.py (#145291 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145291 Approved by: https://github.com/Skylion007	2025-01-22 14:54:45 +00:00
Isuru Fernando	31c2f36989	Fix triton masked loading for non-block tl.loads (#144782 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144782 Approved by: https://github.com/eellison	2025-01-22 14:30:56 +00:00
Yiming Zhou	3cbc8c54fd	[BE][export] Remove disabled floordiv test in export (#145292 ) Summary: Removing `test_slice_with_floordiv` as it doesn't raise the Runtime Error as expected and it has been disabled since the time it was added https://github.com/pytorch/pytorch/issues/131101 For the case that we expect to fail, it actually returns an empty tensor. This is consistent with the following snippet which prints an empty tensor ``` a = torch.ones(4) print(a[5:]) ``` Test Plan: CI Differential Revision: D68450650 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145292 Approved by: https://github.com/pianpwk	2025-01-22 05:17:56 +00:00
Aaron Orenstein	99dbc5b0e2	PEP585 update - test (#145176 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145176 Approved by: https://github.com/bobrenjc93	2025-01-22 04:48:28 +00:00
Chris Sidebottom	40e27fbcf2	Refactor CPUReproTests to be more vector-length agnostic (#141245 ) This changes the hardcoded assumptions of a `256-bit` vector length to querying from `cpu_vec_isa` and changes relevant tests to share the logic. Also refactored the `config.cpp.simdlen != 1` into the assertion so we stop duplicating it throughout the test cases. Fixes issues on `128-bit` machines. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141245 Approved by: https://github.com/desertfire, https://github.com/malfet	2025-01-22 04:24:45 +00:00
Ryan Guo	dcd9de79e7	[dynamo] Re-enable a AOT-Dispatch test with Dynamo (#145299 ) Fixes #124590. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145299 Approved by: https://github.com/zou3519	2025-01-22 03:47:05 +00:00
Shunting Zhang	3a58512613	[Inductor] inplace padding (#140249 ) https://github.com/pytorch/pytorch/issues/139865 This PR may change the semantic of constant_pad_nd from 'clone' to 'view'. I tried a few tests to do inplace update. Looks like thanks to functionalization, this works fine. Perf for `test_linear_and_cel`: ``` # TORCHINDUCTOR_INPLACE_PADDING=0 DO_PERF_TEST=1 python test/inductor/test_inplace_padding.py -k test_linear_and_cel inductor_config.inplace_padding=False ms=83.311 # TORCHINDUCTOR_INPLACE_PADDING=1 DO_PERF_TEST=1 python test/inductor/test_inplace_padding.py -k test_linear_and_cel inductor_config.inplace_padding=True ms=79.827 ``` The saving is about 4ms (slightly less since we need fill 0 for the padding area). Similar savings for llm.c. - Without the feature: 182.151ms per batch, 180.9K tokens/s - With the feature: 178.278ms per batch, 183.9K tokens/s. There are 3K tokens/s increase. Perf test shows compilation time regression. . I'm not sure if that's real. Will debug more. But a good thing is, there is no accuracy failure: [link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2004%20Nov%202024%2020%3A23%3A22%20GMT&stopTime=Mon%2C%2011%20Nov%202024%2020%3A23%3A22%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=gh/shunting314/186/head&lCommit=03fd924ff382958daf5055dc8425d279e4e10a1e&rBranch=main&rCommit=c03324de2dfbbf0006818c86b88c92a3378f46b7) . UPDATE: Perf test regression seems to be not real. Here is a rerun [link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2007%20Nov%202024%2001%3A29%3A55%20GMT&stopTime=Thu%2C%2021%20Nov%202024%2001%3A29%3A55%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=gh/shunting314/186/head&lCommit=7e2c8e5d9256ac06205e7cd5e740c9e20ce804d0&rBranch=main&rCommit=565a7942eee1ddc23067cdbae597443d0f2290a0). Our dashboard is not that reliable recently due to AWS migration. Differential Revision: [D68340248](https://our.internmc.facebook.com/intern/diff/D68340248) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140249 Approved by: https://github.com/jansel, https://github.com/eellison	2025-01-22 03:37:06 +00:00
sanchitintel	46851022ff	[Inductor][CPU] Add auto-tuning support for da8w8 sym act sym wgt GEMM (#143187 ) ## Summary Templated `int8xint8->int32` GEMM that uses AMX ISA (present on Intel Xeon Gen 4 & above). Any epilogues such as weight scale, activation scale, and bias are applied per output block in a fused manner . Performs well for large values of `M` dimension (assuming canonical dimensions [`M, K`] and [`K, N`] for the activation & weight matrices'/tensors' sizes) when the activation is quantized per-token. Also supports SmoothQuant GEMM pattern when activation is quantized per-tensor (scalar scale) or per-token (vector scale is applied as an epilogue in this case). Also increased coverage of GEMM template for uint8 activation, int8 weight GEMM UTs for when the activation zero point is a 1D tensor (the existing implementation only accepted 0D tensors). However, some of such UTs would have to be explicitly enabled with `max-autotune` Inductor config. ## Performance data The templated codegened fused GEMM with M=32, K=4096, N=14336 used in LLaMA3 exhibits more than 2x perf-gain compared to oneDNN qlinear + mul (for activation's scale) with 48 cores of one socket of Xeon SP 4th gen Platinum 8468 when per-token quantization is used. For M=1, K=4096, N=14336, regardless of whether per-tensor quantization was used for activation or per-token, the perf gain was more than 3x. Intel OpenMP & libtcmalloc had been preloaded. All cores used by the workload corresponded to distinct physical cores. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143187 Approved by: https://github.com/jansel, https://github.com/leslie-fang-intel, https://github.com/jgong5 Co-authored-by: Leslie Fang <leslie.fang@intel.com>	2025-01-22 02:27:53 +00:00
Simon Fan	27598cd154	[fx] move DCE rand check to import time (#145118 ) Mitigates the deterministic benchmark regression: https://github.com/pytorch/pytorch/issues/144775#issuecomment-2593411844. and maybe the dashboard issue. fx.Node.is_impure is unexpectedly a hot spot. It gets called for every node in the graph whenever we invoke DCE, which should be okay, EXCEPT we invoke DCE on the full graph ~10 times at various stages of torch.compile, and an insane number of times (>O(parameters)) for the subgraphs traced by the pattern matcher. I considered addressing this problem by reducing the amount of times DCE is called, but I think we can only trim the ones from the pattern matcher, which will require some refactor/caching solution that I leave out of this PR. torch.Tag.nondeterministic_seeded is provided by native_functions.yml and is implemented as a list. Most of the time, it has <=2 elements, so it's not really worth it to turn it into a set for fast lookup. Using the deterministic instruction count benchmarks ```python # before aotdispatcher_partitioner_cpu,compile_time_instruction_count,8914894946 aotdispatcher_partitioner_cpu,compile_time_instruction_count,8866669058 # after aotdispatcher_partitioner_cpu,compile_time_instruction_count,8770562314 aotdispatcher_partitioner_cpu,compile_time_instruction_count,8779547794 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145118 Approved by: https://github.com/ezyang, https://github.com/zou3519	2025-01-22 02:23:02 +00:00
Aaron Orenstein	f2cfe8b59f	PEP585 update - mostly toplevels (#145178 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145178 Approved by: https://github.com/bobrenjc93	2025-01-22 02:21:14 +00:00
Aaron Orenstein	1ce533867f	Teach dynamo to handle GenericAlias without a graph break (#145240 ) Dynamo wasn't handling the new PEP585 type annotations: ``` x = list[Foo] ``` Although this worked in py3.9 this was causing an `unimplemented` (Unexpected type in sourceless builder) in py3.12. This fixes it to treat them as a BuiltinVariable. Fixes #145226 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145240 Approved by: https://github.com/anijain2305	2025-01-22 01:55:51 +00:00
PyTorch MergeBot	2d1649bc2a	Revert "[triton] Update triton pin to include warp specialization support (#145120 )" This reverts commit e261629dc85c061ee35f539ee8bd35aec9971215. Reverted https://github.com/pytorch/pytorch/pull/145120 on behalf of https://github.com/ZainRizvi due to Reverting since the test failures area about not being able to find a version of triton to install, and this is breaking trunk as well ([comment](https://github.com/pytorch/pytorch/pull/145120#issuecomment-2606107792))	2025-01-22 01:52:36 +00:00
Nikita Shulga	f2d7fe12d8	[BE][MPS] Mark gamma inputs as const (#145295 ) Doubt it will change the perf, but it's good to correctly mark const inputs as const Pull Request resolved: https://github.com/pytorch/pytorch/pull/145295 Approved by: https://github.com/manuelcandales ghstack dependencies: #145289	2025-01-22 01:00:53 +00:00
Nikita Shulga	c106e9b4c6	[BE][MPS] Move Gamma kernels to its own file (#145289 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145289 Approved by: https://github.com/manuelcandales, https://github.com/dcci	2025-01-22 01:00:53 +00:00
Nikita Shulga	1908116ace	[MPS][BE] Move vectypes from Quantized to utils (#145312 ) That allows one to get appropriate vectorized types for templates using `c10:🤘:vec2type_t<>` or `c10:🤘:vec4type_t<>` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145312 Approved by: https://github.com/dcci	2025-01-22 00:37:28 +00:00
Huy Do	266fd35c58	Fix ExecuTorch, XLA, Triton hash updates (#145314 ) Fix some stale hash updates https://github.com/pytorch/pytorch/pulls/pytorchupdatebot reported by @izaitsevfb * XLA and ExecuTorch now wait for all jobs in pull instead of hardcoding the job names which are not correct anymore and the bot waits forever there * Trion commit hash hasn't been updated automatically since 2023 and people have been updating the pin manually with their testings from time to time, so I doubt that it would be an useful thing to keep. The vision update failures looks more complex though and I would need to take a closer look. So, I will keep it in another PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/145314 Approved by: https://github.com/izaitsevfb	2025-01-21 23:24:21 +00:00
rzou	1e8d6d6f0e	[SkipFiles] New modules added to torch.* are inlined by default (#145279 ) This PR: - makes it so that new modules added to torch are inlined by default - adds a list of the previously "skipped by default" modules to avoid regressing anything. This is a new MOD_SKIPLIST list that is consulted in trace_rules.check_file. - Follow-up work will go through this list, one-by-one, and try to delete modules. I think we should be able to delete almost everything, except for torch._dynamo. Test Plan - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/145279 Approved by: https://github.com/yanboliang	2025-01-21 23:24:12 +00:00
Hongtao Yu	e261629dc8	[triton] Update triton pin to include warp specialization support (#145120 ) The warp specialization work has been landed to the triton rc/3.2.x branch as `b2684bf3b0` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145120 Approved by: https://github.com/bertmaher	2025-01-21 22:14:13 +00:00
Jane Xu	19c3ba44a2	Use TORCH_CHECK instead of std::runtime_error in stack.h and ivalue.h (#145280 ) TORCH_CHECK will preserve the stacktrace for when TORCH_CPP_SHOW_STACKTRACES=1, whereas std::runtime_error will not. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145280 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-01-21 21:58:59 +00:00
Catherine Lee	7dd9d1f243	Update clickhouse-connect to 0.8.14 (#144915 ) Corresponds to https://github.com/pytorch/test-infra/pull/6177 I only tested the slow test script but I also did testing on the new version with scripts in https://github.com/pytorch/test-infra/pull/6177 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144915 Approved by: https://github.com/huydhn	2025-01-21 21:43:18 +00:00
johnnynunez	35f5668f7e	[NVIDIA] RTX50 Blackwell Support codegen (#145270 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145270 Approved by: https://github.com/ezyang	2025-01-21 21:10:05 +00:00
PyTorch MergeBot	895659cb41	Revert "Fix RMSNorm epsilon value type for BF16 or FP16 (#142848 )" This reverts commit 07e23653cd9ef8cfda01773d94d9f76e5072528d. Reverted https://github.com/pytorch/pytorch/pull/142848 on behalf of https://github.com/izaitsevfb due to breaking internal tests, see D68355212 ([comment](https://github.com/pytorch/pytorch/pull/142848#issuecomment-2605734067))	2025-01-21 21:04:45 +00:00
Aaron Orenstein	bac62341eb	PEP585 update - torch/_inductor (#145198 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145198 Approved by: https://github.com/bobrenjc93	2025-01-21 21:04:33 +00:00
Aaron Orenstein	2f9d378f7b	PEP585 update - torch/utils (#145201 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145201 Approved by: https://github.com/bobrenjc93	2025-01-21 21:04:10 +00:00
Edward Z. Yang	693d8c7e94	Output of nonzero is transposed, fix fake tensor (#144695 ) Needs this companion executorch PR: https://github.com/pytorch/executorch/pull/7657 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/144695 Approved by: https://github.com/bobrenjc93, https://github.com/albanD	2025-01-21 20:50:09 +00:00
Edward Z. Yang	323fb4dad0	Unconditionally exclude upper bound in all size oblivious tests (#144867 ) I was thinking about https://github.com/pytorch/pytorch/pull/144471 some more and I thought, "Hmm, why not just always exclude the constant upper bound." So here it is. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/144867 Approved by: https://github.com/bobrenjc93	2025-01-21 20:44:09 +00:00
Wei Wang	df67ac4c86	[CI][CUDA][Distributed][FSDP] Remove hardcoded world size of 2 (#145195 ) as these unit tests would fail if run on a single GPU (i.e. skip_if_lt_x_gpu(2)) seems to view world size as 2 even on platforms with 1 GPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145195 Approved by: https://github.com/Skylion007, https://github.com/atalman	2025-01-21 20:25:52 +00:00
Jason Ansel	505ade7471	[inductor] Simplify mode options, only apply CompilerBisector changes once (#145232 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145232 Approved by: https://github.com/yanboliang	2025-01-21 19:25:46 +00:00
RanTao123	85811631d7	[Intel CPU] Fix issue #143489 . (#145062 ) Fix issue in https://github.com/pytorch/pytorch/issues/143489. kernel_height * kernel_weight will cause Floating point exception, so we will divide by them one by one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145062 Approved by: https://github.com/soulitzer	2025-01-21 18:38:33 +00:00
Joel Schlosser	128f3627b1	Implement backward for NJT matmul (#144587 ) Part of my BE project addressing NJT bugs surfaced via OpInfo tests. This PR implements missing backward support for NJT matmul. Notably, for dense tensors, matmul dispatches to bmm. However, due to historical reasons related to NST, NJT handles matmul directly, and thus can't rely on the CompositeImplicit impl of matmul to get the derivative formula. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144587 Approved by: https://github.com/soulitzer ghstack dependencies: #144586	2025-01-21 18:27:50 +00:00
Joel Schlosser	af204135d8	Fix NJT fill.Scalar for contiguous inputs (#144586 ) Part of my BE project addressing NJT bugs surfaced via OpInfo tests. This PR implements the missing `fill.Scalar` support, which works fine for contiguous inputs, but there is still some AOTAutograd debugging required to handle non-contiguous transposed NJTs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144586 Approved by: https://github.com/soulitzer	2025-01-21 18:22:08 +00:00
Edward Z. Yang	efa88e04e1	Don't overspecialize float when propagating cache guards to ShapeEnv (#145078 ) Fixes https://github.com/pytorch/pytorch/issues/142507 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/145078 Approved by: https://github.com/Skylion007	2025-01-21 18:05:43 +00:00
Edward Z. Yang	b3e90c8c33	Add support for torch function on dtype arguments (#145085 ) Along the lines of https://github.com/pytorch/pytorch/issues/119194 although it doesn't actually address the FCD case. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/145085 Approved by: https://github.com/vmoens, https://github.com/Skylion007	2025-01-21 17:44:47 +00:00
Huy Do	eb553ae3cf	Fix broken gpt_fast micro benchmark after #144315 (#145235 ) The benchmark is failing with the following error ``` File "/var/lib/jenkins/workspace/benchmarks/gpt_fast/benchmark.py", line 333, in <module> main(output_file=args.output, only_model=args.only) File "/var/lib/jenkins/workspace/benchmarks/gpt_fast/benchmark.py", line 308, in main lst = func(device) File "/var/lib/jenkins/workspace/benchmarks/gpt_fast/benchmark.py", line 66, in run_mlp_layer_norm_gelu us_per_iter = benchmarker.benchmark(compiled_mod, (x,)) * 1000 File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/_inductor/runtime/benchmarking.py", line 39, in wrapper return fn(self, args, *kwargs) TypeError: benchmark() missing 1 required positional argument: 'fn_kwargs' ``` An example error is https://github.com/pytorch/pytorch/actions/runs/12862761823/job/35858912555 I also assign `oncall: pt2` as the owner of this job going forward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145235 Approved by: https://github.com/nmacchioni	2025-01-21 17:42:24 +00:00
atalman	2cffbff7da	Add 3.13t Windows and MacOS binary builds (#141806 ) Related to: https://github.com/pytorch/pytorch/issues/130249 For conda uses approach described here: https://conda-forge.org/blog/2024/09/26/python-313/ Create Python 3.13t conda env like so: ``` conda create -n py313 python=3.13 python-freethreading -c conda-forge ``` For windows executable installation we need to pass additional parameter to enable 3.13t: ``` Include_freethreaded=1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141806 Approved by: https://github.com/albanD	2025-01-21 17:16:19 +00:00
Aaron Orenstein	0afd335174	PEP585 update - torch/nn torch/optim torch/package torch/profiler torch/serialization torch/sparse torch/xpu (#145175 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145175 Approved by: https://github.com/bobrenjc93	2025-01-21 16:57:27 +00:00
Shunting Zhang	803017f3cb	[inductor] fix MA on poor gpu (#145133 ) Found this bug when debugging a MA issue in CI that can not be repro-ed on devgpu. On GPU with less than 68 SMs (like NVidia L4 used in CI), running torch compile in max-autotune mode may result in the following confusing error https://gist.github.com/shunting314/370f42f547e3367a3773237942725a86 complaining about layout: ``` torch._inductor.exc.InductorError: LoweringException: AssertionError: convert FlexibleLayout to FixedLayout first ``` The reason is, even if we don't pick Triton template, Inductor still returns a MultiTemplateBuffer for tuned addmm. MultiTemplateBuffer.get_reads called from Reduction.num_splits may indexing a FlexibleLayout which results in the error aforementioned. The issue does not appear on devgpu because we freeze the layout of addmm inputs when rendering triton templates. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145133 Approved by: https://github.com/jansel	2025-01-21 09:31:34 +00:00
Aaron Orenstein	b5655d9821	PEP585 update - .ci android aten (#145177 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145177 Approved by: https://github.com/Skylion007	2025-01-21 06:31:26 +00:00
Aaron Orenstein	00ffeca1b1	PEP585 update - torch/distributed (#145164 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145164 Approved by: https://github.com/bobrenjc93	2025-01-21 04:23:29 +00:00
PyTorch MergeBot	c6986ca2e1	Revert "[dcp] Add ZStandard transformer (#143360 )" This reverts commit 7b56b039afe2b4a4038c09d8b6cb1597823f3d5f. Reverted https://github.com/pytorch/pytorch/pull/143360 on behalf of https://github.com/atalman due to Broke 3.13t builds please test with ciflow/binaries label attached ([comment](https://github.com/pytorch/pytorch/pull/143360#issuecomment-2603433066))	2025-01-21 01:10:16 +00:00
PyTorch MergeBot	5fd881a5b6	Revert "PEP585 update - torch/nn torch/optim torch/package torch/profiler torch/serialization torch/sparse torch/xpu (#145175 )" This reverts commit 54a00af2c6026a830f40d9e6a659ff81d51f9bc6. Reverted https://github.com/pytorch/pytorch/pull/145175 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break some trunk tests ([comment](https://github.com/pytorch/pytorch/pull/145175#issuecomment-2603418267))	2025-01-21 00:49:55 +00:00
Aaron Orenstein	dea7ad3371	PEP585 update - torch/testing (#145200 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145200 Approved by: https://github.com/bobrenjc93	2025-01-20 22:42:42 +00:00
Aaron Orenstein	805c4b597a	PEP585 update - torch/_higher_order_ops torch/_subclasses torch/backends torch/compiler torch/cuda torch/masked torch/mtia torch/nested (#145202 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145202 Approved by: https://github.com/bobrenjc93	2025-01-20 22:37:26 +00:00
Aaron Orenstein	54a00af2c6	PEP585 update - torch/nn torch/optim torch/package torch/profiler torch/serialization torch/sparse torch/xpu (#145175 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145175 Approved by: https://github.com/bobrenjc93	2025-01-20 22:32:59 +00:00
Aaron Orenstein	bd97ce0b45	PEP585 update - torch/ao (#145199 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145199 Approved by: https://github.com/bobrenjc93	2025-01-20 22:32:35 +00:00
Aaron Gokaslan	cf05f6a134	[BE]: Improve typing for torch/fx/_pytree.py and torch/utils/_pytree.py (#145173 ) Improve type inference in _pytree.py utility functions Pull Request resolved: https://github.com/pytorch/pytorch/pull/145173 Approved by: https://github.com/bobrenjc93	2025-01-20 22:18:19 +00:00
Wang, Chuanqi	225a10febe	[CI] Add xpu linux build into pull workflow (#145084 ) To mitigate the XPU build failure risk introduced by non-XPU specific PRs. Refer #144967 & #143803 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145084 Approved by: https://github.com/huydhn, https://github.com/atalman	2025-01-20 19:31:48 +00:00
Zhengxu Chen	d0100050dd	[aoti] Deduplicate "V.aot_compilation" and "V.graph.aot_mode" flags. [2/n] (#145091 ) Summary: Following up D68122536 to remove configurable aot_mode for inner_compile Test Plan: CI Reviewed By: desertfire Differential Revision: D68158512 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145091 Approved by: https://github.com/ydwu4	2025-01-20 19:09:10 +00:00
Aaron Orenstein	0b2a3687b9	PEP585 update - torch/fx (#145166 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145166 Approved by: https://github.com/bobrenjc93	2025-01-20 18:11:54 +00:00
PyTorch MergeBot	6374332d33	Revert "PEP585 update - torch/distributed (#145164 )" This reverts commit 6cb186e279bc179a6bb63f0226e24ab42a07b394. Reverted https://github.com/pytorch/pytorch/pull/145164 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing an inductor test ([comment](https://github.com/pytorch/pytorch/pull/145164#issuecomment-2602875679))	2025-01-20 16:46:46 +00:00
Dmitry Nikolaev	57b2b64acf	Fix always true scaled_mm test (#143912 ) Looks like `out_fp8` should use matmul without scales and `out_fp8_s` with Scales were optional arguments before PR https://github.com/pytorch/pytorch/pull/128683 Then test_float8_scale started comparing two identical results and lost its meaning Reason of making scales required https://github.com/pytorch/pytorch/pull/128683#issuecomment-2169146402UMBER This PR uses scale=1.0 to compare result with scaled matmul Pull Request resolved: https://github.com/pytorch/pytorch/pull/143912 Approved by: https://github.com/drisspg, https://github.com/malfet, https://github.com/pruthvistony	2025-01-20 16:17:46 +00:00
Aleksei Nikiforov	53e2408015	Improve cleanup of cancelled jobs on s390x for tests too (#144968 ) Follow up to https://github.com/pytorch/pytorch/pull/144149 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144968 Approved by: https://github.com/huydhn	2025-01-20 12:56:07 +00:00
Sun, Jiayi	92b9da1fc2	fix torch.atan for torch.complex datatypes on CPU (#144749 ) Fix https://github.com/pytorch/pytorch/issues/141487. This issue is caused by the lack of special handling of the case where the real number/imag number is 0/Inf/NaN in the vectorized implementation of `atan`. For correctness, I temporarily fallback the implementation of `atan` to scalar implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144749 Approved by: https://github.com/mingfeima, https://github.com/Skylion007	2025-01-20 08:45:03 +00:00
Sun, Jiayi	ed669a9db7	fix torch.div for torch.complex datatypes on CPU (#140375 ) Fix https://github.com/pytorch/pytorch/issues/135428. Fix https://github.com/pytorch/pytorch/issues/106845. These two issues are caused by the lack of special handling of the case where the real number/imag number is 0/Inf/NaN in the vectorized implementation of `div`. For correctness, I temporarily fallback the implementation of `div` to scalar implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140375 Approved by: https://github.com/mingfeima	2025-01-20 08:34:29 +00:00
Sun, Jiayi	c922ccb7c4	fix sigmoid for torch.complex datatypes on CPU (#140391 ) Fix https://github.com/pytorch/pytorch/issues/135777. This issue is caused by the lack of special handling of the case where the real number/imag number is 0/Inf/NaN in the vectorized implementation of `reciprocal`. For correctness, I temporarily fallback the implementation of `reciprocal` to scalar implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140391 Approved by: https://github.com/mingfeima, https://github.com/Skylion007 ghstack dependencies: #140358	2025-01-20 08:23:58 +00:00
Sun, Jiayi	507bf65c6a	fix torch.exp for torch.complex datatypes on CPU (#140358 ) Fix https://github.com/pytorch/pytorch/issues/48010, https://github.com/pytorch/pytorch/issues/136063. These two issues are caused by the lack of special handling of the case where the real number/imag number is 0/Inf/NaN in the vectorized implementation of `exp`. For correctness, I temporarily fallback the implementation of `exp` to scalar implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140358 Approved by: https://github.com/mingfeima, https://github.com/Skylion007	2025-01-20 08:03:17 +00:00
ankurneog	972d4a154d	Add facility to run dynamo UTs for non-cuda devices (#140929 ) This is in line with changes introduced with https://github.com/pytorch/pytorch/pull/130714, additional files are included to support non-cuda devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140929 Approved by: https://github.com/kwen2501, https://github.com/EikanWang, https://github.com/guangyey	2025-01-20 05:56:38 +00:00
Aaron Orenstein	2b809e58ad	PEP585 update - torch/onnx (#145174 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145174 Approved by: https://github.com/justinchuby	2025-01-20 05:48:52 +00:00
Animesh Jain	19584b28fd	[dynamo][dicts] Consolidate dict(..) construction (#144342 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144342 Approved by: https://github.com/StrongerXi	2025-01-20 04:42:06 +00:00
Nikita Shulga	980c75fe6e	[MPSInductor] Add `TrueDiv` and `Round[Int\|Decimal]` (#145160 ) That fixes `test_builtins_round_float_ndigits_neg` and `test_builtins_round` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145160 Approved by: https://github.com/jansel, https://github.com/dcci	2025-01-20 04:29:42 +00:00
Aaron Orenstein	6cb186e279	PEP585 update - torch/distributed (#145164 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145164 Approved by: https://github.com/bobrenjc93	2025-01-20 00:19:01 +00:00
Aaron Orenstein	b6c5562c1f	PEP585 update - torch/export (#145165 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145165 Approved by: https://github.com/bobrenjc93	2025-01-19 20:56:55 +00:00
Aaron Orenstein	316808e4e9	PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145163 Approved by: https://github.com/Skylion007	2025-01-19 20:55:59 +00:00
Aaron Orenstein	c64e657632	PEP585 update - torch/distributed/fsdp (#145162 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145162 Approved by: https://github.com/bobrenjc93	2025-01-19 20:04:05 +00:00
Nikita Shulga	371a361db9	Enable bfloat16 testing on MacOS14+ (#145159 ) As Metal-3.1 supports this dtype Pull Request resolved: https://github.com/pytorch/pytorch/pull/145159 Approved by: https://github.com/Skylion007, https://github.com/jansel ghstack dependencies: #145157	2025-01-19 19:35:31 +00:00
Aaron Orenstein	97d4d3c40a	PEP585 update - torch/_export (#145138 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145138 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #145154	2025-01-19 18:48:35 +00:00
Aaron Orenstein	cd8d0fa20c	Tweak schema_check to handle annotated builtin types (#145154 ) As of python 3.9 annotated lists can be written as `list[T]` and `List[T]` has been deprecated. However schema_check was converting `list[T]` to simply be `list`. This change teaches it to handle `list[T]` the same as `List[T]`. A couple small drive-by changes I noticed as well: - Path concatenation should use `os.path.join`, not `+` - Spelling in error message Pull Request resolved: https://github.com/pytorch/pytorch/pull/145154 Approved by: https://github.com/bobrenjc93	2025-01-19 18:48:35 +00:00
Aaron Orenstein	9e0437a04a	PEP585 update - torch/ao/quantization (#145140 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145140 Approved by: https://github.com/bobrenjc93	2025-01-19 10:20:00 +00:00
Aaron Orenstein	78bff1e8c1	PEP585 update - torch/_functorch (#145139 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145139 Approved by: https://github.com/bobrenjc93	2025-01-19 07:06:10 +00:00
cassanof	10e4d3aebb	[DCP] Fix fsspec fsync bug on .finish() (#144753 ) Fixes #144752 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144753 Approved by: https://github.com/Skylion007, https://github.com/saumishr	2025-01-19 03:21:00 +00:00
Davide Italiano	8cc415774f	[mps/inductor] Introduce a metal approx for erf() and use it. (#145161 ) Probably we can do better, but this is a start. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145161 Approved by: https://github.com/malfet	2025-01-19 02:29:05 +00:00
Aaron Orenstein	893ca1dfe1	PEP585 update - torch/_inductor/[_-i]* (#145137 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145137 Approved by: https://github.com/bobrenjc93	2025-01-19 01:22:47 +00:00
Nikita Shulga	cede43e06b	[MPSInductor][BE] NaN-propagating min/max to header (#145157 ) May be to be later reused from eager op as well Also, didn't know that Metal already have type_traits And use `metal::isunorderder(a, b)` instead of `metal::isnan(a + b)` is it is defined as function that is equivalent `a != a \|\| b != b`, but I suspect it might have a best native implementation for the specific architecture Pull Request resolved: https://github.com/pytorch/pytorch/pull/145157 Approved by: https://github.com/dcci	2025-01-18 22:52:44 +00:00
Aaron Orenstein	5b5766665d	PEP585 update - torch/_C torch/_decomp torch/_lazy torch/_library torch/_numpy torch/_prims torch/_refs torch/_strobelight (#145102 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145102 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #145105	2025-01-18 20:47:12 +00:00
Aaron Orenstein	a79100ab11	PEP585 update - torch/_dynamo (#145105 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145105 Approved by: https://github.com/bobrenjc93	2025-01-18 20:47:11 +00:00
Aaron Orenstein	c95efc37ba	PEP585 update - torch/distributed/tensor (#145141 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145141 Approved by: https://github.com/bobrenjc93	2025-01-18 20:01:59 +00:00
Davide Italiano	4f8237dbad	[mps/inductor] Skip "double" tests as 64-bits FP is not supported. (#145123 ) 257 tests failed (before) -> 242 tests failed (after) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145123 Approved by: https://github.com/malfet, https://github.com/jansel Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-18 19:13:34 +00:00
PyTorch MergeBot	5802be698e	Revert "parametrized test name handles class arguments (#133546 )" This reverts commit 4e4b8592a32f701b4974679ab1381ba7cccd4844. Reverted https://github.com/pytorch/pytorch/pull/133546 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but trying to disable the new tests does seem to fully cover all the cases and some are still failing in trunk ([comment](https://github.com/pytorch/pytorch/pull/133546#issuecomment-2599814339))	2025-01-18 18:12:18 +00:00
Joel Schlosser	b63b81410c	Fix NJT frexp() to handle both outputs (#144585 ) Part of my BE project addressing NJT bugs surfaced via OpInfo tests. Before this PR, `frexp()` for NJT was handled via the unary pointwise fallback. The op returns a tuple, however, and the fallback doesn't handle that. This PR defines an explicit impl for `frexp()` that wraps both returned `(mantissa, exponent)` as NJTs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144585 Approved by: https://github.com/soulitzer ghstack dependencies: #144582, #144583, #144584	2025-01-18 15:59:56 +00:00
Joel Schlosser	3ee531f8b9	Support NJT chunk() backward on batch dim (#144584 ) Part of my BE project addressing NJT bugs surfaced via OpInfo tests. Implements `chunk()` backward on the batch dim, which was left out before. This PR unbinds the components and invokes `copy_()` on these to pass along the appropriate gradients. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144584 Approved by: https://github.com/soulitzer ghstack dependencies: #144582, #144583	2025-01-18 15:58:24 +00:00
Nikita Shulga	8a57234033	[MPSInductor] Implement `i0` and `i1` ops (#145092 ) Using shared definitions with eager op Pull Request resolved: https://github.com/pytorch/pytorch/pull/145092 Approved by: https://github.com/dcci, https://github.com/jansel ghstack dependencies: #145023, #145087	2025-01-18 15:41:02 +00:00
Edward Z. Yang	1d9fc9df38	Downgrade ignored guard to info level (#145075 ) Fixes https://github.com/pytorch/pytorch/issues/101265 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/145075 Approved by: https://github.com/Skylion007	2025-01-18 15:30:01 +00:00
chilli	5e4cf3e6ad	Moved .all() checks for distributions to _is_all_true (#145029 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145029 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2025-01-18 07:55:48 +00:00
Aaron Orenstein	2bf772d1ba	PEP585 update - torch/_inductor/codegen (#145106 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145106 Approved by: https://github.com/bobrenjc93	2025-01-18 06:56:03 +00:00
Shangdi Yu	4bf29f44b7	[aoti] Remove torch.ops.aten._assert_tensor_metadata.default in post_grad_pass (#145028 ) Summary: Remove torch.ops.aten._assert_tensor_metadata.default in post_grad_pass because this op is blocking fusion. This should not have any affect on the result, because the op would not show up in the final aoti compiled model anyway (the assertion has no effect). An real example where this improves performance: In the example below, the post grad graph would contain `torch.ops.aten._assert_tensor_metadata.default`, because of PR https://github.com/pytorch/pytorch/pull/142420. This op is added when functionalizing aten.to. We want the `add` node from `linear` to be fused with the rest of the pointwise ops, instead of fused with the `mm` from `linear`. ``` class Model(torch.nn.Module): def __init__(self, input_dim, hidden_dim): super(Model, self).__init__() self.linear = nn.Linear(input_dim, hidden_dim).half() self.rms_norm = nn.RMSNorm(hidden_dim) def forward(self, x): linear_458 = self.linear(x) # Linear layer with weights' # mimic the torchtune rms norm: /torchtune/torchtune/modules/rms_norm.py linear_458 = linear_458.to(torch.float32) rms_norm_34 = self.rms_norm(linear_458) # RMS Normalization sigmoid_168 = torch.sigmoid(rms_norm_34) # Sigmoid activation function mul_168 = sigmoid_168 * rms_norm_34 # Element-wise multiplication return mul_168 def main(): with torch.no_grad(): input_dim = 512 hidden_dim = 256 batch_size = 32 model = Model(input_dim, hidden_dim).to("cuda") example_inputs = ( torch.randn(batch_size, input_dim).to("cuda").to(torch.float16), ) ep = torch.export.export(model, example_inputs) package_path = torch._inductor.aoti_compile_and_package(ep) ``` Test Plan: CI Differential Revision: D68303114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145028 Approved by: https://github.com/angelayi	2025-01-18 06:06:25 +00:00
Nikita Shulga	dc9b77cc55	[MPS] Support includes in metal objects (#145087 ) Useful for code reuse for Metal shader build both for eager mode and MPSInductor, but it requires one to implement `_cpp_embed_headers` tool that, as name suggests, would preprocess and embeds the for shader to be used in dynamic compilation. Test using: - `TestMetalLibrary.test_metal_include` - Moving `i0`/`i1` implementation to `c10/util/metal_special_math.h` and call it from `SpecialOps.metal` shader, which now looks much more compact: ```metal template <typename T, typename Tout = T> void kernel i0(constant T* input, device Tout* output, uint index [[thread_position_in_grid]]) { output[index] = c10::i0(static_cast<Tout>(input[index])); } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145087 Approved by: https://github.com/dcci ghstack dependencies: #145023	2025-01-18 05:35:22 +00:00
Will Constable	2859b11bdb	[pytorch/ncclx] Remove Alltoallv specialization for PTD all_to_all (#145045 ) Summary: PTD all_to_all uses a list of tensors, while ncclAllToAllv (provided by NCCLX and RCCL) assumes that a single contiguous buffer is used. These are fundamentally mismatched. The list of tensors might not be contiguous or even ordered (buffer addresses might not be in increasing order). This patch removes the ncclAllToAllv specialization for PTD all_to_all, and instead let's it directly call ncclSend/ncclRecv. Co-authored by @pavanbalaji Pull Request resolved: https://github.com/pytorch/pytorch/pull/145045 Approved by: https://github.com/pavanbalaji, https://github.com/d4l3k, https://github.com/fduwjj, https://github.com/ezyang	2025-01-18 05:26:55 +00:00
Aaron Orenstein	07669ed960	PEP585 update - benchmarks tools torchgen (#145101 ) This is one of a series of PRs to update us to PEP585 (changing Dict -> dict, List -> list, etc). Most of the PRs were completely automated with RUFF as follows: Since RUFF UP006 is considered an "unsafe" fix first we need to enable unsafe fixes: ``` --- a/tools/linter/adapters/ruff_linter.py +++ b/tools/linter/adapters/ruff_linter.py @@ -313,6 +313,7 @@ "ruff", "check", "--fix-only", + "--unsafe-fixes", "--exit-zero", *([f"--config={config}"] if config else []), "--stdin-filename", ``` Then we need to tell RUFF to allow UP006 (as a final PR once all of these have landed this will be made permanent): ``` --- a/pyproject.toml +++ b/pyproject.toml @@ -40,7 +40,7 @@ [tool.ruff] -target-version = "py38" +target-version = "py39" line-length = 88 src = ["caffe2", "torch", "torchgen", "functorch", "test"] @@ -87,7 +87,6 @@ "SIM116", # Disable Use a dictionary instead of consecutive `if` statements "SIM117", "SIM118", - "UP006", # keep-runtime-typing "UP007", # keep-runtime-typing ] select = [ ``` Finally running `lintrunner -a --take RUFF` will fix up the deprecated uses. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145101 Approved by: https://github.com/bobrenjc93	2025-01-18 05:05:07 +00:00
Will Constable	2c4281d7da	Make MultiProcContinuousTest timeout configurable (#145099 ) Allows test classes using MPCT to set their own timeout as a class property, which is good enough since the processgroup is shared across test instances and the timeout is set at processgroup init. Also sets a default timeout of 2 minutes, which is probably (?) long enough for reasonable tests, but can be changed if it causes flakyness. It's preferable to have as short default timeout as possible, since when debugging tests getting a timeout quickly helps. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145099 Approved by: https://github.com/d4l3k, https://github.com/fduwjj ghstack dependencies: #145010, #145011	2025-01-18 04:37:12 +00:00
Will Constable	bdfeda5c9a	composability test cleanup (#145011 ) minor changes to test public PP api instead of internal/private one and also save a few lines of code for microbatch splitting in the process Pull Request resolved: https://github.com/pytorch/pytorch/pull/145011 Approved by: https://github.com/H-Huang, https://github.com/fduwjj ghstack dependencies: #145010	2025-01-18 04:37:12 +00:00
Jason Ansel	4eea2f7496	[inductor] Fix ignored options for torch.compile (#145131 ) #139833 broke `torch.compile(options=...)` so that many (all?) options passed in get completely ignored. @alexreinking pointed this out when `options={"cpu_backend":"halide"}` did nothing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145131 Approved by: https://github.com/exclamaforte	2025-01-18 03:39:49 +00:00
Simon Fan	668fb7dfba	[ca] Use aot_eager on flex attention test (#145097 ) FIXES https://github.com/pytorch/pytorch/issues/144912 The flex attention lowering incompatibilities are covered by https://github.com/pytorch/pytorch/blob/main/test/inductor/test_flex_attention.py. For the CA + flex integration, we don't actually need to test the lowering, only the frontend graph capture. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145097 Approved by: https://github.com/drisspg	2025-01-18 02:47:13 +00:00
Sam Larsen	55084443ca	Added swizzle searching, disabled fp16 accum, and enabled ping-pong for cutlass (#144829 ) Summary: Test Plan: Pull Request resolved: https://github.com/pytorch/pytorch/pull/144829 Approved by: https://github.com/Chillee	2025-01-18 02:39:22 +00:00
Nicolas Macchioni	2f51d06210	basic InductorBenchmarker (#133058 ) This PR adds the most basic custom benchmarker (i.e. a benchmarker that is not provided by Triton), which we call `InductorBenchmarker`. This new benchmarker is very basic in principal, and very closely follows Triton's `do_bench` implementation with slight changes such as flushing the exact L2 cache size (Triton defaults to 256mb), using a buffer zero for warmup (Triton uses the benchmarked kernel itself, I found that buffer zeroes are more consistent), and returning the min runtime (Triton can return min, among other things, currently Inductor picks median). Pull Request resolved: https://github.com/pytorch/pytorch/pull/133058 Approved by: https://github.com/eellison ghstack dependencies: #144315	2025-01-18 02:35:00 +00:00
Nicolas Macchioni	ee3e89190a	refactor benchmarking to use dynamo_timed (#144315 ) use dynamo_timed for all our wrapped calls, instead of our custom timer Pull Request resolved: https://github.com/pytorch/pytorch/pull/144315 Approved by: https://github.com/eellison	2025-01-18 02:35:00 +00:00
Aaron Orenstein	17c3a10cbd	PEP585 update - torch/_inductor/fx_passes (#145107 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145107 Approved by: https://github.com/oulgen, https://github.com/bobrenjc93	2025-01-18 02:04:29 +00:00
Huy Do	8e4539245e	Update ci_expected_accuracy for TIMM levit_128 for further investigation (#145112 ) TSIA, it looks like an upstream change, but I'm not sure from where yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145112 Approved by: https://github.com/izaitsevfb, https://github.com/malfet	2025-01-18 01:55:34 +00:00
Bin Bao	0b151f260f	[AOTI] Add an option to skip optimizing generated wrapper code (#144866 ) Summary: In some cases, generated wrapper code faces a long cpp compilation time. As an alleviation, this PR adds an option to skip cpp compiler optimizers for the generated main wrapper function body. D68174038 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144866 Approved by: https://github.com/chenyang78, https://github.com/hl475	2025-01-18 01:44:21 +00:00
Jason Ansel	7c1fb9b1ae	[inductor] Refactor CachingAutotuner so that it can pickle (#144044 ) These are refactors needed for #144288 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144044 Approved by: https://github.com/eellison	2025-01-18 01:44:16 +00:00
xinan.lin	02385ed625	[Break XPU][Inductor UT] Fix broken XPU CI introduced by community changes (#145058 ) As title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145058 Approved by: https://github.com/jansel	2025-01-18 01:30:24 +00:00
rzou	c434a64f31	Delete torch._library.register_functional_op (#145110 ) Fixes #117816, #117834, #117871 This has been superceded by auto_functionalized_v2. There are no internal usages and this is private API so it is safe to delete. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145110 Approved by: https://github.com/williamwen42 ghstack dependencies: #145109	2025-01-18 00:58:25 +00:00
rzou	712d9882d2	Skip test responsible for causing flakiness (#145109 ) Investigation is a separate issue. For now I want to get the CI back up and running on the other tests. The problem seems to be that IncludeDispatchKeyGuard doesn't actually reset the state, which seems very, very wrong. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145109 Approved by: https://github.com/williamwen42	2025-01-18 00:58:25 +00:00
eellison	c338dda6be	fix test_rng bisector test (#143662 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143662 Approved by: https://github.com/zou3519	2025-01-18 00:15:38 +00:00
Daniel Vega-Myhre	d02c396fbb	add fp8 support to index_cuda (#144747 ) Fixes #133605 Summary This PR adds support for FP8 data types to the `index_cuda` op. It uses `AT_DISPATCH_V2` which is a new macro that can handle arbitrary number of dtypes, as opposed to the old implementations which had a separate macro for each possible number of dtype arguments (e.g. `AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND{2,3,4,5...}`). Test plan Updated test `index_cuda_with_cpu` in `test/test_fake_tensor.py` to have cases for all dtypes handled by `index_cuda`, including fp8 dtypes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144747 Approved by: https://github.com/vkuzo	2025-01-17 22:53:23 +00:00
Nicolas Macchioni	4e4b8592a3	parametrized test name handles class arguments (#133546 ) Previously, parametrized tests with class arguments, for example ``` @parametrize("this_cls", (Foo, Bar)) ``` would create parametrized tests with names `test_foo_this_cls0` and `test_foo_this_cls1`. With this change, we instead should get `test_foo_this_cls_Foo` and `test_foo_this_cls_Bar` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/133546 Approved by: https://github.com/eellison	2025-01-17 22:48:38 +00:00
Will Constable	64e54d5af6	[Pipelining] Relax scale_grads assert (#145010 ) The assert felt morally valid- if no gradients are scaled, then something is definitely wrong with the setup. In one instance, PP + optimizer-in-backward (in torchtitan) resulted in grad=None after running .backward() and before scaling grads. On the other hand, the existing assert is too restrictive. It's possible that a model used with pipelining would have some parameters that do not receieve gradients, and we shouldn't hard-error in these cases. (E.g. if the parameter is literally not used, or is frozen). In the extreme case, the whole stage could be frozen. So we do not complain if no grads are scaled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145010 Approved by: https://github.com/mori360, https://github.com/tianyu-l	2025-01-17 21:33:28 +00:00
fan.mo	07e23653cd	Fix RMSNorm epsilon value type for BF16 or FP16 (#142848 ) Fixes #140092 Here's what this PR does: In before, we create a `scalar_t eps_val;` variable, and the `eps` is mostly a double scalar which passed from python frontend, like 1e-6. While we do `eps_val = std::numeric_limits<at::scalar_value_type<scalar_t>::type>::epsilon();` or `eps_val = eps.value();`, we down cast this epsilon to match input tensor dtype (`scalar_t`), in case of BFloat16, the 1e-6 double would be cast to `1.00136e-05`. However, while we act `auto rqrst_input = rsqrt(at::pow(upcasted_input, 2).mean(dims_to_reduce_ref, /keepdim=/true).add_(eps_val));`, we up cast `eps_val` to match the `opmath_t`, the conversion between these two dtypes is UNNECESSARY, so we could just make the `opmath_t eps_val` instead of `scalar_t`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142848 Approved by: https://github.com/mikaylagawarecki	2025-01-17 21:30:54 +00:00
Joel Schlosser	a8ef423fed	Fix NJT min / max backward() for non-ragged reductions (#144583 ) Part of my BE project addressing NJT bugs surfaced via OpInfo tests. `value_selecting_reduction_backward()` is used in the backward for min / max, so this PR implements it for NJT. Notably, this isn't enough for reducing over the ragged dim, since that results in a dense tensor and thus NJT's torch_dispatch will not be called for this op. We need factory function support for nested ints to fix that case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144583 Approved by: https://github.com/soulitzer ghstack dependencies: #144582	2025-01-17 20:57:11 +00:00
Joel Schlosser	cac10b8190	Fix NJT OpInfo entry for nn.functional.prelu (#144582 ) Part of my BE project addressing NJT bugs surfaced via OpInfo tests. The OpInfo entry for prelu was wrong before this PR; `weight` needs to be passed as well. The op isn't fully implemented yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144582 Approved by: https://github.com/soulitzer	2025-01-17 20:36:15 +00:00
Tom Ritchford	eaef613688	Fix issue with test/nn/test_convolution:TestConvolutionNNDeviceTypeCUDA.test_conv_large_batch_1_cuda (#145067 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145067 Approved by: https://github.com/Skylion007, https://github.com/nWEIdia Co-authored-by: Wei Wang <143543872+nWEIdia@users.noreply.github.com>	2025-01-17 20:31:25 +00:00
Mikayla Gawarecki	0eda02a94c	Prevent legacy_load when weights_only=True (correctly) (#145020 ) Only prevent `legacy_load` (.tar format removed in https://github.com/pytorch/pytorch/pull/713), not the whole of `_legacy_load` (.tar format + _use_new_zipfile_serialization=False) Differential Revision: [D68301405](https://our.internmc.facebook.com/intern/diff/D68301405) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145020 Approved by: https://github.com/kit1980, https://github.com/albanD	2025-01-17 20:10:22 +00:00
Colin Peppler	2ef7b68666	[inductor] fix TORCH_LOGS="benchmarking" (#144997 ) Saw this error with TORCH_LOGS="benchmarking" ``` File "/data/users/colinpeppler/pytorch/torch/_inductor/runtime/benchmarking.py", line 37, in wrapper result = fn(args, kwargs) File "/data/users/colinpeppler/pytorch/torch/_inductor/runtime/benchmarking.py", line 66, in wrapper return fn(self, args, **kwargs) torch._inductor.exc.InductorError: TypeError: Benchmarker.benchmark() missing 1 required positional argument: 'fn_kwargs' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144997 Approved by: https://github.com/eellison, https://github.com/nmacchioni	2025-01-17 19:41:18 +00:00
Wouter Devriendt	d996d7ec13	upgrade to sccache 0.9.1 - dealing with nvcc -E correctly (#145012 ) sccache 0.9.1 should be dealing with `nvcc -E` correctly see https://github.com/mozilla/sccache/pull/2300 If this works as expected, we can get rid of this code: https://github.com/pytorch/pytorch/pull/142813/files Pull Request resolved: https://github.com/pytorch/pytorch/pull/145012 Approved by: https://github.com/malfet	2025-01-17 19:26:01 +00:00
Tom Ritchford	46fbd63405	Fix unbind_copy and add its decomposition (#134319 ) * Fixes https://github.com/pytorch/pytorch/issues/130829 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134319 Approved by: https://github.com/amjames, https://github.com/eellison	2025-01-17 18:21:22 +00:00
Mingming Ding	18638b91fe	Adding more compile time logging in pad_mm (#144884 ) Summary: As title Test Plan: [midin@6262.od /data/sandcastle/boxes/fbsource/fbcode (99e64d2e4)]$ tlp buck run mode/opt caffe2/test/inductor:pad_mm -- -r test_exclude_padding https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2F.tmpiJLgXX%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2F.tmpiJLgXX%2Fchromium_events.json&local_cache_key {F1974355662} Pull Request resolved: https://github.com/pytorch/pytorch/pull/144884 Approved by: https://github.com/oulgen	2025-01-17 17:35:55 +00:00
Yidi Wu	567552b98b	fix typo in doc and import for torch._library.triton (#144882 ) Previously, the doc's suggested `from torch._library.triton import wrap_triton, triton_op` doesn't work because wrap_triton is not imported in torch/_library/__init__.py but `from torch.library import wrap_triton` works. This PR imports wrap_triton and fix the doc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144882 Approved by: https://github.com/zou3519	2025-01-17 17:32:12 +00:00
Stonepia	18eba9575f	[Accelerator] Use uniform `GetAllocator` for devices in `new_qtensor` function (#144849 ) Fixes #144848 This PR is intended to use a uniform `GetAllocator()` call for all the accelerators for `new_qtensor` function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144849 Approved by: https://github.com/guangyey, https://github.com/albanD	2025-01-17 16:37:37 +00:00
atalman	a215e174a1	[BE] Remove conda from scripts and build files Part 2 (#145015 ) Continuation of https://github.com/pytorch/pytorch/pull/144870 Remove conda logic from scripts: 1. Remove conda build from triton build script 2. Remove conda checks from setup.py 3. Remove conda from release scripts 4. Script read_conda_versions.sh is not used (checked via git grep) Related to: https://github.com/pytorch/pytorch/issues/138506 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145015 Approved by: https://github.com/malfet, https://github.com/Skylion007	2025-01-17 16:26:24 +00:00
Aleksandar Samardžić	b7af210d8d	Add SM89 support for f8f8bf16_rowwise() (#144348 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144348 Approved by: https://github.com/drisspg	2025-01-17 15:12:35 +00:00
PyTorch MergeBot	f522502b97	Revert "patch for block-wise quantization + pt2e (#144492 )" This reverts commit 1d43b8150852cdfcbe754edcf027d6e25f33ac63. Reverted https://github.com/pytorch/pytorch/pull/144492 on behalf of https://github.com/albanD due to Broke a few things in trunk ([comment](https://github.com/pytorch/pytorch/pull/144492#issuecomment-2598485291))	2025-01-17 14:27:53 +00:00
Wang, Eikan	dbed747aae	Add Intel GPU specific CMake files to merge rules (#135110 ) As the title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135110 Approved by: https://github.com/atalman	2025-01-17 09:44:13 +00:00
Luca Wehrstedt	a0d2c09115	Add flop formula for _scaled_mm (#144973 ) This will make it work correctly with the partitioner's AutoAC Pull Request resolved: https://github.com/pytorch/pytorch/pull/144973 Approved by: https://github.com/jeffdaily	2025-01-17 09:38:30 +00:00
Laith Sakka	96c0dbbe97	Enhance running pr time benchmarks locally experience. (#144838 ) Summary: title Test Plan: NA Differential Revision: D68195894 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144838 Approved by: https://github.com/huydhn	2025-01-17 07:57:40 +00:00
ZhaoqiongZ	465a1cfe2e	update get start xpu (#143183 ) - Support new Intel client GPU on Windows [Intel® Arc™ B-Series graphics](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/desktop/b-series/overview.html) and [Intel® Core™ Ultra Series 2 with Intel® Arc™ Graphics](https://www.intel.com/content/www/us/en/products/details/processors/core-ultra.html) - Support vision/audio prebuilt wheels on Windows Pull Request resolved: https://github.com/pytorch/pytorch/pull/143183 Approved by: https://github.com/EikanWang, https://github.com/leslie-fang-intel, https://github.com/atalman, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-17 06:31:40 +00:00
Davide Italiano	fd8e0e3e10	[mps/inductor] Introduce is_mps_backend/skip_if_mps decorators. (#145035 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145035 Approved by: https://github.com/jansel	2025-01-17 05:36:38 +00:00
PyTorch UpdateBot	cfd9cc19a3	[executorch hash update] update the pinned executorch hash (#145022 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145022 Approved by: https://github.com/pytorchbot	2025-01-17 04:51:56 +00:00
Gabriel Ferns	f13c864eda	Fuzzer Improvements (#144952 ) Added more tests and cleaned up. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144952 Approved by: https://github.com/masnesral	2025-01-17 04:46:58 +00:00
Chen Lai	1d43b81508	patch for block-wise quantization + pt2e (#144492 ) Summary: As title, needed for enable qcom block-wise quantization kernel Test Plan: local test Differential Revision: D67985303 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144492 Approved by: https://github.com/angelayi, https://github.com/billmguo	2025-01-17 04:10:49 +00:00
Zhenbin Lin	adbbcd87d9	OpenReg: Split Allocator (#144843 ) Split the Allocator into HostAllocator and DeviceAllocator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144843 Approved by: https://github.com/albanD	2025-01-17 03:38:15 +00:00
Yanbo Liang	43a00d73b3	[Trace Python Dispatcher] Support FuncTorchInterpreter (#144444 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144444 Approved by: https://github.com/williamwen42, https://github.com/zou3519 ghstack dependencies: #144439	2025-01-17 02:26:37 +00:00
Yanbo Liang	5d02575aa1	[Trace Python dispatcher] Support torch.DispatchKey & torch.DispatchKeySet (#144439 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144439 Approved by: https://github.com/zou3519	2025-01-17 02:26:36 +00:00
William Wen	3a50aba7d3	[dynamo] add option to not skip on empty graph (#144885 ) Temporary fix to https://github.com/pytorch/pytorch/issues/144360. Turning the config on globally will cause a bunch of tests to fail, which needs to be addressed in followups. I had a previous attempt at https://github.com/pytorch/pytorch/pull/144712, but this is a more complicated change and will likely be absorbed into work to refactor Dynamo's exception handling. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144885 Approved by: https://github.com/jansel	2025-01-17 02:12:20 +00:00
Marc Horowitz	7b56b039af	[dcp] Add ZStandard transformer (#143360 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143360 Approved by: https://github.com/saumishr ghstack dependencies: #143358, #143359	2025-01-17 01:51:37 +00:00
Marc Horowitz	9c909bf3bb	[dcp] Integrate stream extensions into DCP impl (#143359 ) Summary: Updates FileSystemReader/Writer, Planner, DefaultLoad/SavePlanner Pull Request resolved: https://github.com/pytorch/pytorch/pull/143359 Approved by: https://github.com/saumishr ghstack dependencies: #143358	2025-01-17 01:51:37 +00:00
Marc Horowitz	ba3f1c29ee	[dcp] Add extension mechanism (#143358 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143358 Approved by: https://github.com/saumishr	2025-01-17 01:51:37 +00:00
Yu, Guangye	176cde6240	Use torch with statement in torch distributed module (#144951 ) # Motivation In https://github.com/pytorch/pytorch/pull/137678, we help use the device-agnostic APIs to generalize distributed module. As this [comment](https://github.com/pytorch/pytorch/pull/137678#discussion_r1828645683) said, we will use the with statement of `torch.Stream` once https://github.com/pytorch/pytorch/pull/140138 is landed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144951 Approved by: https://github.com/kwen2501, https://github.com/albanD	2025-01-17 01:49:28 +00:00
Nikita Shulga	a61a65ff82	[MPSInductor] Add `Worker.current_device` method (#145023 ) That just returns 0, as multi-gpu is not currently supported by MPS Pull Request resolved: https://github.com/pytorch/pytorch/pull/145023 Approved by: https://github.com/dcci	2025-01-17 01:41:01 +00:00
PyTorch MergeBot	55b0819bee	Revert "Add tests for different dtypes with max autotune (#144721 )" This reverts commit d2a77f48c9dc6df056051de270ce5875d8d2edd0. Reverted https://github.com/pytorch/pytorch/pull/144721 on behalf of https://github.com/kit1980 due to breaking internal builds, max autotune tests a failing, see D68297606 ([comment](https://github.com/pytorch/pytorch/pull/144721#issuecomment-2597250605))	2025-01-17 01:36:14 +00:00
Andrew Gu	45e6647268	[FSDP2] Make post-backward condition more robust (#144781 ) Fixes https://github.com/pytorch/pytorch/issues/144755 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144781 Approved by: https://github.com/fegin	2025-01-17 01:28:56 +00:00
Chien-Chin Huang	6077102415	[DSD][BE] Rewrite some tests to remove `with_comms` (#143241 ) Summary: This saves ~ 1 minute test time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143241 Approved by: https://github.com/mori360, https://github.com/XilunWu ghstack dependencies: #143240	2025-01-17 01:15:55 +00:00
Will Constable	5d54e7b812	[Pipelining] move scale_grads to base class, add docs (#144833 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144833 Approved by: https://github.com/H-Huang	2025-01-17 01:07:12 +00:00
Driss Guessous	3afc5170d4	[Submodule] Upgrade to Cutlass 3.6 part deux (#144911 ) # Summary Take 2 of [D67866269](https://www.internalfb.com/diff/D67866269) Main change is that we identified and fixed the FA2 regression. More details can be found here https://github.com/pytorch/pytorch/issues/144729 and have landed that before this here: [D68194635](https://www.internalfb.com/diff/D68194635) Differential Revision: D68194470 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144911 Approved by: https://github.com/eqy, https://github.com/Skylion007	2025-01-17 00:53:42 +00:00
PyTorch MergeBot	6c713ccb5e	Revert "Make functionalization `ViewMeta` serializable with pickle. (#143712 )" This reverts commit b8abdaa286fd161af48af57a675827f4f849914d. Reverted https://github.com/pytorch/pytorch/pull/143712 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/143712#issuecomment-2597205261))	2025-01-17 00:52:50 +00:00
Nikita Shulga	42c64bd35c	[MPSInductor] More is_dtype_supported gating (#144981 ) This makes `GPUTest.test_scalar_cpu_tensor_arg_mps` pass Pull Request resolved: https://github.com/pytorch/pytorch/pull/144981 Approved by: https://github.com/dcci ghstack dependencies: #144971	2025-01-17 00:48:02 +00:00
PyTorch MergeBot	94c0f15302	Revert "cpp_wrapper: Move #includes to per-device header files (#143909 )" This reverts commit d62b3979dadfa4928ec1c76e850f874d49803125. Reverted https://github.com/pytorch/pytorch/pull/143909 on behalf of https://github.com/kit1980 due to breaking internal builds because of removal of torch‎/_inductor‎/codegen‎/aoti_runtime‎/implementation.cpp‎ ([comment](https://github.com/pytorch/pytorch/pull/143909#issuecomment-2597188669))	2025-01-17 00:36:38 +00:00
PyTorch MergeBot	5e6e6200bf	Revert "[dynamo][dicts] Consolidate dict(..) construction (#144342 )" This reverts commit a54a784b8207617d2b99fbded9bb34c94fb6dd23. Reverted https://github.com/pytorch/pytorch/pull/144342 on behalf of https://github.com/kit1980 due to breaking internal builds, see D68125388 ([comment](https://github.com/pytorch/pytorch/pull/144342#issuecomment-2597184167))	2025-01-17 00:32:09 +00:00
cyy	2ea394ba29	Modernize C++ code (#144603 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144603 Approved by: https://github.com/malfet	2025-01-17 00:25:18 +00:00
Laith Sakka	c3fcb3606d	Profile compile_inner instead of _compile_inner (#144930 ) Summary: title Test Plan: NA Reviewed By: jamesjwu Differential Revision: D67990492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144930 Approved by: https://github.com/jamesjwu	2025-01-16 23:59:27 +00:00
Chien-Chin Huang	573fc42f25	[BE][CP] Use run_subtests instead of parametrize (#143240 ) Summary: This provides a 15X increase in test performance speed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143240 Approved by: https://github.com/XilunWu	2025-01-16 23:55:05 +00:00
Yang Wang	fea9d18d5a	[Utilization Log] Concurrently collect aggregate data during the output interval (#143235 ) # overview Add worker to collect metrics in short intervals 1.Worker: Add a worker to collect usage metrics, by default, every 500ms, notice this is configurable 2.Calculate & avg and max as data point, by default, every 5 second. # Other clean up the log format for necessary needs, currentl we do not need to track gpu processesors etc, or all pids from psutil Pull Request resolved: https://github.com/pytorch/pytorch/pull/143235 Approved by: https://github.com/huydhn	2025-01-16 23:52:43 +00:00
shaoyuyoung	288d67d6c2	[inductor] [bug fix] align `avg_pool` with eager when handling `uint` (#144313 ) Fixes #144310 ~~We just need to add a check in lowering~~ updated: we add the error checking in `meta registration` ### UT ``` pytest -s -v test/inductor/test_torchinductor.py -k test_avg_pool_errors_with_uint ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144313 Approved by: https://github.com/jansel, https://github.com/jgong5	2025-01-16 23:37:51 +00:00
Gabriel Ferns	d2a77f48c9	Add tests for different dtypes with max autotune (#144721 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144721 Approved by: https://github.com/cpuhrsch, https://github.com/etaf	2025-01-16 23:04:56 +00:00
clr	171fb7f358	easy: Fix missing tab in test/dynamo/test_compile.py (#145013 ) It turns out that if you request a merge on a pytorch PR, and then push a fix for a bad rebase, and the test is relativley new, the merge will go through with the previous commit and not notice the test break. Explicitly running the test now passes vs failing, and this is just the last missing commit from https://github.com/pytorch/pytorch/pull/144817 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145013 Approved by: https://github.com/masnesral, https://github.com/jansel	2025-01-16 22:51:51 +00:00
Nikita Shulga	181d93b4f2	[BE] Move `is_device_supported` to helper function (#144971 ) And extend `test_inf` to check half (explicitly instead of check_lowp) and bfloat16 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144971 Approved by: https://github.com/Skylion007, https://github.com/dcci, https://github.com/jansel	2025-01-16 22:44:03 +00:00
PyTorch UpdateBot	a33e02cb26	[executorch hash update] update the pinned executorch hash (#144813 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144813 Approved by: https://github.com/pytorchbot, https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>	2025-01-16 22:39:00 +00:00
Fuzzkatt	7c7bcb1e33	update IS_JETSON check (#144725 ) update IS_JETSON check to include the latest SM Pull Request resolved: https://github.com/pytorch/pytorch/pull/144725 Approved by: https://github.com/eqy	2025-01-16 22:34:48 +00:00
Colin L. Rice	95c363cc9b	dynamo: Don't crash with internal error if getattr on a tensor fails (#144817 ) This prevents crashes when getattr is called on a tensor for something which doesn't exist. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144817 Approved by: https://github.com/williamwen42, https://github.com/jansel	2025-01-16 22:04:06 +00:00
Mwiza Kunda	0e6d44df3f	Add heuristic to fail block pointer match early (#144681 ) This PR adds a heuristic to potentially fail the block pointer match early. Expressions like below take a long time to match using sympy (e.g. > 100 seconds) ```python # torch._inductor.config.triton.use_block_ptr = True # torch._inductor.config.triton.prefer_nd_tiling = True # Expression from pytest -k test_max_pool2d1_dynamic_shapes_cuda: ((xindex//ps1))((s2 - 3//2))2 + 2((xindex//ps1))((s2 - 3//2)) + ((xindex//ps1)) + ((s2 - 3//2))(ModularIndexing(xindex, ps0, ps0)) + (ModularIndexing(xindex, 1, ps0)) + (ModularIndexing(xindex, ps0, ps0)) ``` Additionally, the heuristic for the number of dimensions based on the indexing expression is refined to only add dimensions for FloorDiv(index, denom) and ModularIndexing(index, denom, modulo) instead of including FloorDiv/ModularIndexing expressions that don't involve the index. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144681 Approved by: https://github.com/jansel	2025-01-16 21:57:30 +00:00
PyTorch MergeBot	46b92c025d	Revert "Cholesky mps implementation (#144193 )" This reverts commit 727ae1331820bb3d83d70e9cd3c9d3cd4c79ff56. Reverted https://github.com/pytorch/pytorch/pull/144193 on behalf of https://github.com/malfet due to Alas, inductor changes broke inductor tests, see `aa4a1ff027/1` ([comment](https://github.com/pytorch/pytorch/pull/144193#issuecomment-2596938163))	2025-01-16 21:37:32 +00:00
PyTorch MergeBot	aa4a1ff027	Revert "Prevent _legacy_load with weights_only=True (#144914 )" This reverts commit 7c3aa1da1c97812af54d41f3f0eff2ef922c0f32. Reverted https://github.com/pytorch/pytorch/pull/144914 on behalf of https://github.com/izaitsevfb due to breaking inductor on trunk ([comment](https://github.com/pytorch/pytorch/pull/144914#issuecomment-2596922781))	2025-01-16 21:29:50 +00:00
PyTorch MergeBot	4ea189422d	Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 )" This reverts commit a6763b7b81cd1a55c8316dfdb5bca19819a1429a. Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/kit1980 due to breaking internal builds: unused variable 'halpha' ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2596895865))	2025-01-16 21:12:41 +00:00
garfield1997	3a5bf0bc36	expose extra torch_python apis (#144746 ) Fixes #144302 After checking the code of my third-party devices, I think these APIs are also relied on by us, so I exposed them according to the discussion in the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144746 Approved by: https://github.com/albanD	2025-01-16 20:50:31 +00:00
iupaikov-amd	577708e6de	Unskipped multiple inductor tests for ROCm (#143581 ) All of them should be fine to run now after the triton fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143581 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-01-16 20:46:06 +00:00
CaoE	a9bfc5f70c	Fix boundary conditions for hardswish backward (#143899 ) Fixes #136345. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143899 Approved by: https://github.com/jgong5, https://github.com/ezyang	2025-01-16 20:26:27 +00:00
Davide Italiano	aad5f600ff	[mps] Massage test_full_truncation to work only on the supported dtypes. (#144877 ) Converted a first one to make sure the pattern was the one we wanted -- if we're OK with this, I'll probably adjust all the other failing ones in a batch or two. Let me know. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144877 Approved by: https://github.com/jansel, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-16 19:51:45 +00:00
Jane Xu	3908be676c	Fix loading older state_dict into AdamW after refactor (#144972 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144972 Approved by: https://github.com/albanD	2025-01-16 19:50:31 +00:00
Yukio Siraichi	b8abdaa286	Make functionalization `ViewMeta` serializable with pickle. (#143712 ) Fix: #141974 This PR makes `ViewMeta` sequence, present in functional tensors, serializable with pickle. In order to accomplish that, it makes `ViewMeta` an abstract class with overridable `forward` and `reverse` functions. In this context, each operation that once instanciated `ViewMeta`, should now create a new specialized class that inherits from `ViewMeta. Therefore, this PR also uses codegen for creating these specializations. In summary, these are the changes this PR introduces: - `ViewMeta` is turned into an abstract class (see _FunctionalStorageImpl.cpp_). `forward` and `reverse` are pure virtual functions that need to be implemented. `to_out_index` should be implemented by operations that might return more than 1 output. - New `ViewMeta` specializations for `resize_` and `_unsafe_view` are created (see _FunctionalizeFallbackKernel.h_). - New templates _ViewMetaClasses.{cpp,h}_ are created. They hold the declaration and definition of the `ViewMeta` specializations, which are automatically generated in the ATen codegen (see _gen.py_). - New `_functionalization` Python sub-module is created (see _Module.cpp_). It serves as namespace for the `ViewMeta` specializations and `InverseReturnMode` enum. - New template _ViewMetaClassesPythonBinding.cpp_ is created. It holds the automatically generated Python bindings for the `ViewMeta` specialization, which are generated in the torch codegen (see _generate_code.py_). Note that this PR makes use of codegen at 2 different moments: - ATen codegen (_gen.py_): generates the `ViewMeta` specialized classes. - Torch codegen (_generate_code.py_): generated the Python bindings for them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143712 Approved by: https://github.com/bdhirsh	2025-01-16 19:41:41 +00:00
Mikayla Gawarecki	7c3aa1da1c	Prevent _legacy_load with weights_only=True (#144914 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144914 Approved by: https://github.com/malfet, https://github.com/albanD	2025-01-16 19:33:46 +00:00
Huy Do	cf28d613f1	Allow ROCm runner to upload benchmark results if found (#144710 ) https://github.com/pytorch/pytorch/wiki/How-to-integrate-with-PyTorch-OSS-benchmark-database. This will unblock AMD when they try to run benchmark MI300 benchmarks on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144710 Approved by: https://github.com/kit1980	2025-01-16 19:31:45 +00:00
Natalia Gimelshein	31a73eb712	fix acquire pattern in topk (#144945 ) Similar to #128455, topk needs another threadfence to complete acquire pattern. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144945 Approved by: https://github.com/Skylion007	2025-01-16 19:20:43 +00:00
Yanbo Liang	3004b657f0	[Inductor][FlexAttention] Supports dynamic shapes with custom kernel options (#144938 ) Fixes #144815 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144938 Approved by: https://github.com/drisspg	2025-01-16 19:02:35 +00:00
Jane Xu	e32d2bf853	Document decoupled_weight_decay for Adam for consistency with N/RAdam (#144984 ) Followup from #144972 and #143710 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144984 Approved by: https://github.com/albanD	2025-01-16 18:58:29 +00:00
Nikita Shulga	ad15436db6	Fix `pt2-bug-report.yml` formatting (#144987 ) This is a 2nd regression caused by https://github.com/pytorch/pytorch/pull/144574 Test plan: `python3 -c "import yaml; foo=yaml.safe_load(open('pt2-bug-report.yml'));print(foo['body'][0])"` Before it printed ``` % python3 -c "import yaml; foo=yaml.safe_load(open('pt2-bug-report.yml'));print(foo['body'][0])" {'type': 'markdown', 'attributes': {'value': ''}} ``` After ``` % python3 -c "import yaml; foo=yaml.safe_load(open('pt2-bug-report.yml'));print(foo['body'][0])" {'type': 'markdown', 'attributes': {'value': '#### Note: Please write your bug report in English to ensure it can be understood and addressed by the development team.\n'}} ``` Fixes https://github.com/pytorch/pytorch/issues/144970 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144987 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2025-01-16 18:58:07 +00:00
PyTorch MergeBot	829c4570ca	Revert "[mps] Massage test_full_truncation to work only on the supported dtypes. (#144877 )" This reverts commit 1b34665767fcc35ae4a8f211945a24701c79df79. Reverted https://github.com/pytorch/pytorch/pull/144877 on behalf of https://github.com/malfet due to Actually no, lint is red ([comment](https://github.com/pytorch/pytorch/pull/144877#issuecomment-2596385712))	2025-01-16 18:10:37 +00:00
Tom Ritchford	13d35ea67a	[BE] Add missing throw of `std::runtime_error` in scrc/cuda/utils.cpp (#144962 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144962 Approved by: https://github.com/amjames, https://github.com/Skylion007, https://github.com/malfet	2025-01-16 17:35:39 +00:00
Zhengxu Chen	53256edff9	[export] Support module inputs for non strict mode. (#143925 ) Summary: Add experimental support for torch.nn.Module as input types. Before this change, we don't support module inputs but recently we saw some interesting use cases like gpt-fast https://github.com/pytorch-labs/gpt-fast/blob/main/generate.py#L68 where we directly pass in a module input for different variants of the same models. Since we don't really care about non-param or non-buffer states in non strict mode, we don't care about those either and pretend they are like plain constants during tracing. We treat any module input like a nested container of tensor, and each time we will automatically register a pytree handler for these module types to flatten its state dict into a group of tensors. We will just inline any module method call during tracing like we did for `self` module in export_for_training. This will make input modules' behavior very similar to the training module in typical case, except that we don't record the inputs as parameter or buffers but rather just plain user inputs. Test Plan: buck run mode/opt caffe2/test:test_export -- -r test_module_input Differential Revision: D67680827 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143925 Approved by: https://github.com/tugsbayasgalan	2025-01-16 17:30:36 +00:00
atalman	519269a415	[BE] - Remove conda test and upload scripts and env variables from Workflows Part 1 (#144870 ) Remove conda test and upload scripts and env variables from Workflows Related to: https://github.com/pytorch/pytorch/issues/138506 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144870 Approved by: https://github.com/malfet	2025-01-16 17:20:14 +00:00
Isalia20	727ae13318	Cholesky mps implementation (#144193 ) Requested in #77764 PR is still in draft because it needs some cleanups and optimizations to get to cpu performance the least. Tasks: - [x] Make `upper=True` work, only `upper=False` works now - [x] Code cleanup - [x] Optimizations(Though might need some help on this)(tried my best, maybe there is still some more to squeeze out) - [x] Checks for positive definite input - [x] Support for (*, N, N) input, currently only supports (B, N, N) input - [x] Support other dtypes(float16, bfloat16) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144193 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-16 16:26:46 +00:00
Davide Italiano	1b34665767	[mps] Massage test_full_truncation to work only on the supported dtypes. (#144877 ) Converted a first one to make sure the pattern was the one we wanted -- if we're OK with this, I'll probably adjust all the other failing ones in a batch or two. Let me know. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144877 Approved by: https://github.com/jansel, https://github.com/malfet	2025-01-16 16:22:06 +00:00
Zhengxu Chen	3d29de3ac8	[aoti] Deduplicate "V.aot_compilation" and "V.graph.aot_mode" flags. [1/n] (#144709 ) Summary: According to angelayi, these two flags indicated different things when we have two-pass codegen but since now we basically keep the two flags all the same, we should merge two flags. This can prevent some bug (e.g. we change value of aot_mode which will not cover branches like if V.aot_compialtion is True) from happening when we're trying to add different code paths to tweak the value of aot_mode in the future. Test Plan: CI Differential Revision: D68122536 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144709 Approved by: https://github.com/angelayi, https://github.com/desertfire	2025-01-16 16:02:18 +00:00
Scott Wolchok	241a8a101b	Fix erroneous at_vreinterpretq_u16_bf16 call (#144883 ) Here, `mask` is definitely a `uint16x8_t`, not an `at_bfloat16x8_t`, so we shouldn't be reintepreting it. Candidate fix for #144818 . Differential Revision: [D68224128](https://our.internmc.facebook.com/intern/diff/D68224128/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144883 Approved by: https://github.com/tinglvv, https://github.com/Skylion007, https://github.com/malfet	2025-01-16 15:16:28 +00:00
PyTorch MergeBot	6559374494	Revert "Add flop formula for _scaled_mm (#144872 )" This reverts commit f31452268bf9f7e395f263cd8a9d693633ea75ce. Reverted https://github.com/pytorch/pytorch/pull/144872 on behalf of https://github.com/lw due to Breaks ROCm jobs on main ([comment](https://github.com/pytorch/pytorch/pull/144872#issuecomment-2595994134))	2025-01-16 15:16:18 +00:00
Yutao Xu	6470b0ea6f	Update torch-xpu-ops commit pin (#144739 ) Update the torch-xpu-ops commit to [22cc419e4e60f469341712a5a103fa309a7dfd48](`22cc419e4e`), includes: - Fix building issue https://github.com/intel/torch-xpu-ops/issues/1279 - Aten operator coverage improvement Note: new torch-xpu-ops commit don't support bundle 0.5.3 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144739 Approved by: https://github.com/EikanWang, https://github.com/malfet	2025-01-16 15:12:37 +00:00
Luca Wehrstedt	f31452268b	Add flop formula for _scaled_mm (#144872 ) This will make it work correctly with the partitioner's AutoAC Pull Request resolved: https://github.com/pytorch/pytorch/pull/144872 Approved by: https://github.com/vkuzo	2025-01-16 13:57:54 +00:00
PyTorch MergeBot	1c290912e4	Revert "Add tests for different dtypes with max autotune (#144721 )" This reverts commit 9e568cbaa22df89b77e112f1a373d82acb2e6219. Reverted https://github.com/pytorch/pytorch/pull/144721 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/144721#issuecomment-2595210355))	2025-01-16 10:59:30 +00:00
Shunting Zhang	0c0583254e	[inductor] fix index.Tensor fallback (#144736 ) The original issue is we see accuracy problem in a meta internal model [meta internal link](https://fb.workplace.com/groups/1075192433118967/posts/1567334737238065/). The debugging is hard but the root cause is relatively simple. The root cause is that the model has mix-device inputs for index.Tensor which causes Inductor to fallback. And the meta kernel for index.Tensor returns a tensor with inconsistent strides to the eager kernel. The following code snippet ``` import torch from torch._subclasses import FakeTensorMode device = "cuda" x = torch.randn((24, 16, 32, 32), device=device).to(memory_format=torch.channels_last) x = x.view(2, 12, 16, 32, 32) i1 = torch.arange(2).unsqueeze(-1) i2 = torch.argsort(torch.rand(2, 12), dim=-1)[:, :3] print(f"Eager stride: {x[i1, i2].stride()}") mode = FakeTensorMode() with mode: f_x = mode.from_tensor(x) f_i1 = mode.from_tensor(i1) f_i2 = mode.from_tensor(i2) f_out = f_x[f_i1, f_i2] print(f"Meta stride: {f_out.stride()}") ``` would output: ``` Eager stride: (49152, 16384, 1, 512, 16) Meta stride: (49152, 16384, 1024, 32, 1) ``` In this PR, I fix the problem to run eager kernel to get the index.Tensor fallback's output layout. A better solution would be to change meta/eager kernel implementation so that their output layout matches. But I'm not sure how to properly do that. In the index.Tensor meta kernel, we always produce dense output: `6d56277682/torch/_meta_registrations.py (L3184)` . While the eager kernel seems to leverage TensorIteratorBase to decide some dimension permutation: `6d56277682/aten/src/ATen/TensorIterator.cpp (L232-L308)` . We can duplicate this logic to the meta kernel implementation if we really want meta matches eager. I can follow up on this if people have strong opinion to do this. And here is an issue https://github.com/pytorch/pytorch/issues/144717 for asserting size/strides for fallback kernels. With that, the issue debugged here would be much easier to root cause. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144736 Approved by: https://github.com/jansel	2025-01-16 09:38:29 +00:00
Huy Do	57d5659c3b	XFAIL test_save_load_checkpoint (#144927 ) Fixes https://github.com/pytorch/pytorch/issues/137771 The issue keeps showing up and rerun disable tests couldn't reproduce the issue. So, XFAIL it while waiting for distributed team to investigate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144927 Approved by: https://github.com/kit1980, https://github.com/malfet	2025-01-16 07:31:56 +00:00
Will Constable	7d8c087e24	[Pipelining] Improve shape inference debug logging (#144929 ) Remove log that just said "running forward" since that is not so useful in itself, replace with somewhat equivalent log that reports both input and output shapes after running forward. Note: enabled by `TORCH_LOGS=+pp` Example: ``` [rank0]:V0115 13:28:58.282000 3908366 torch/distributed/pipelining/stage.py:1400] Shape inference: stage 0 inputs (tensor(..., device='meta', size=(1, 64), dtype=torch.int64),), outputs (tensor(..., device='meta', size=(1, 64, 256), dtype=torch.bfloat16),) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144929 Approved by: https://github.com/H-Huang	2025-01-16 07:30:11 +00:00
Natalia Gimelshein	0b17c09893	restore rng generation for fbcode (#144819 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144819 Approved by: https://github.com/malfet, https://github.com/kit1980	2025-01-16 06:46:26 +00:00
Mario Vasilev	49bdc418be	Add strict kwarg to `nn.Module.set_submodule` and fix bug for non dot delineated strings (#143455 ) Before fixing set_submodule, it used to create leaf modules when the target was not a dot-delimited string. After the fix it will not create a new attribute if target is a non-dot-delimited string. If you want to create leaf nodes of `nn.Module` parent nodes, you can use `replace_or_create_new_leaf_module`. Fixes https://github.com/pytorch/pytorch/issues/143441 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143455 Approved by: https://github.com/mikaylagawarecki	2025-01-16 05:06:33 +00:00
fduwjj	e3c4d1b7d6	[c10d][fr] Fix the bug when we still mark mismatch when there are match case (#144916 ) When we introduce partial match, we accidentally introduce the mark of mismatch for the full match case. This is wrong and this PR fix it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144916 Approved by: https://github.com/c-p-i-o	2025-01-16 04:36:30 +00:00
Gabriel Ferns	9e568cbaa2	Add tests for different dtypes with max autotune (#144721 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144721 Approved by: https://github.com/cpuhrsch, https://github.com/etaf	2025-01-16 04:29:44 +00:00
Zhenbin Lin	52a620845b	OpenReg: Use device agnostic API (#144840 ) Use `torch.accelerator.device_count()` to get the number of devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144840 Approved by: https://github.com/albanD	2025-01-16 03:31:52 +00:00
Xia, Weiwen	1230de4c1b	[Quant][Inductor][X86] Separate binary post op fusion and lowering for qconv (#144318 ) Summary The current implementation fuses quantized ops and their post ops and lowers the fused the op to cpp backend in the same pass. It is better to separate post op fusion and lowering because - it looks better in terms of design - we need the post op fusion pass for PT2E quantization eager mode As one of a series of PRs which do the separation, this PR moves binary post op fusion of qconv out of the lowering pass to after the weight-prepack pass. The workflow is 1. Weight prepack for qlinear so that `dq - conv` patterns are replaced by `onednn.qconv2d_pointwise` 2. Fuse `onednn.qconv2d_pointwise` and post ops 3. Lower to cpp backend This PR adds additional `PatternMatcherPass`'s to handle the post op fusion. Pattern matchers used for fusion are reused. Test plan It is covered by existing UTs in `test_mkldnn_pattern_matcher.py` for post op fusion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144318 Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168 ghstack dependencies: #144224, #144312	2025-01-16 03:30:36 +00:00
cyy	843627b7b1	Remove unnecessary once flag usage (#143255 ) Static variables in C++11 is guaranteed to be initialised exactly once, as mentioned [here](https://en.cppreference.com/w/cpp/language/storage_duration) ``` If multiple threads attempt to initialize the same static local variable concurrently, the initialization occurs exactly once (similar behavior can be obtained for arbitrary functions with std::call_once. Usual implementations of this feature use variants of the double-checked locking pattern, which reduces runtime overhead for already-initialized local statics to a single non-atomic boolean comparison. ``` Given that static c10::once_flag is used before, why not just use the associated function to initialised the related static variables? That is the motivation behind this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143255 Approved by: https://github.com/albanD	2025-01-16 02:36:11 +00:00
Nikita Shulga	41ec2e8d3e	[MPSInductor] Fix codegen regression (#144924 ) Caused by https://github.com/pytorch/pytorch/pull/144649 Do not try to insert anything into the header if wrapper is not ready yet Fixes `test_sort_mps` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144924 Approved by: https://github.com/dcci ghstack dependencies: #144827, #144917	2025-01-16 02:12:42 +00:00
Nikita Shulga	05505771a0	[MPSInductor] Properly convert index (#144917 ) By calling `self.index_to_str` from `load`,`store` and `check_bounds` in order to properly handle sizevars variables renames Pull Request resolved: https://github.com/pytorch/pytorch/pull/144917 Approved by: https://github.com/dcci ghstack dependencies: #144827	2025-01-16 02:12:41 +00:00
PyTorch MergeBot	d595b96059	Revert "restore rng generation for fbcode (#144819 )" This reverts commit 2bc18a905544f4e25cfbd354351418b36a0f5fc1. Reverted https://github.com/pytorch/pytorch/pull/144819 on behalf of https://github.com/ngimel due to internal failure ([comment](https://github.com/pytorch/pytorch/pull/144819#issuecomment-2594298941))	2025-01-16 01:52:29 +00:00
Colin L. Rice	6492851125	symbolic_convert: Don't fail when we hit a undefined name (#144784 ) We're using a python builtin NameError here, instead of throwing a Unsupported exception. This causes the NameError to get wrapped in a InternalTorchDynamoError instead of just causing a graph break, and letting the user code fail directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144784 Approved by: https://github.com/williamwen42, https://github.com/jansel	2025-01-16 01:47:48 +00:00
Driss Guessous	c8bcb22e5f	Default Copies are not vectorized in v3.6.0 of cutlass (#144837 ) Summary: FlashAttentionV2 perf was tanked in v3.6.0, See: https://github.com/pytorch/pytorch/issues/144729 for more details. This PR makes it possible to land v3.6.0 update and fixes perf regression. See: https://github.com/pytorch/pytorch/issues/144729#issuecomment-2591644076 for anlaysis, as well we have various internal tests to verify Differential Revision: D68194635 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144837 Approved by: https://github.com/Skylion007, https://github.com/eqy	2025-01-16 01:12:46 +00:00
Colin L. Rice	926f9056a9	speculation_log: Raise a unique error for divergence issues (#144785 ) This is primarily sent for discussion and to see what tests fail due to this. The idea is that rather than capturing this as a regex on the fail_reason, just give it a unique failure type Pull Request resolved: https://github.com/pytorch/pytorch/pull/144785 Approved by: https://github.com/ezyang	2025-01-16 00:49:43 +00:00
David Berard	b90231a189	[inductor][BE] don't try/except ImportError for AttrsDescriptor versions (#144807 ) motivation: Ed's advice to avoid `except ImportError` (i.e. based on the fact that your target module/class might in fact exist, but you might run into some different ImportError whose stacktrace you now ignore). additional motivation: I'm going to add some more cases to this list, and would like to avoid this pattern: ``` try: ... except ImportError: try: ... except ImportError: try: ... ``` suggestions on better ways to do this would be appreciated! test: ran with triton commit e5be006a (last working commit) and 34a6a2ff8 (in june, when AttrsDescriptor was still in triton.compiler.compiler) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144807 Approved by: https://github.com/ezyang	2025-01-16 00:32:29 +00:00
cyy	ee97d80be2	Apply Ruff fixes and pyupgrade to torch/jit (#144208 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144208 Approved by: https://github.com/davidberard98	2025-01-16 00:28:50 +00:00
Pian Pawakapan	774f21a370	[export] handle buffer/input mutations for joint-graph (#144806 ) Summary: previous construction of GraphSignature output specs didn't consider buffer/user input mutations Test Plan: test_experimental Differential Revision: D68177409 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144806 Approved by: https://github.com/zhxchen17, https://github.com/avikchaudhuri	2025-01-16 00:22:16 +00:00
Brian Hirsh	d7f45fc575	dynamic shape support for interpolate(antialias=True) backward (#141198 ) Fixes https://github.com/pytorch/pytorch/issues/141187 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141198 Approved by: https://github.com/ezyang, https://github.com/Chillee ghstack dependencies: #141161	2025-01-16 00:08:25 +00:00
Brian Hirsh	4831f89790	support numbers as tensors for aten.copy(Tensor, Tensor) (#141161 ) Fixes https://github.com/pytorch/pytorch/issues/141149. `aten.copy_` supports numbers as tensors in the python arg parser. So we need to give the same treatment to `aten.copy`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141161 Approved by: https://github.com/ezyang	2025-01-16 00:08:25 +00:00
Xu Han	2645fc45b1	export AOTI_TORCH_EXPORT on Windows. (#140030 ) Fixes #139954 reproduce UT: ```cmd pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu ``` Issue: <img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe"> After fixing: ![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a) Reland: 1. Declare export on Windows explicitly. 2. Support cpu, cuda and xpu devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-15 23:43:41 +00:00
Justin Chu	fb4b5a9299	[ONNX] Use python_dispatcher in type promotion (#144801 ) Fix #143118 Use python_dispatcher in the type promotion pass to preserve symbolic shapes according to @angelayi 's suggestions. (Thanks!) Tested locally. I wasn't able to create a minimal repro except for using the full model Pull Request resolved: https://github.com/pytorch/pytorch/pull/144801 Approved by: https://github.com/titaiwangms	2025-01-15 23:25:19 +00:00
Annop Wongwathanarat	7265dc0622	Enable s8s8s8 for qlinear with mkl-dnn (#139887 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139887 Approved by: https://github.com/huydhn	2025-01-15 23:20:10 +00:00
Natalia Gimelshein	4e1834f5f3	use cooperative schedule in scaled_mm for fast_accum=false (#144809 ) This improves perf for large matrices by more than 2x, more detailed benchmark coming. On master ![image](https://github.com/user-attachments/assets/fc6a0987-5b82-475d-a2ff-b46641bb17dc) On this branch <img width="601" alt="image" src="https://github.com/user-attachments/assets/7f55152b-1110-45e4-b2ea-6f274d543869" /> A plot similar to https://github.com/pytorch/ao/pull/1325#discussion_r1868193786 <details> <summary>Benchmarking code:</summary> ```python import torch from triton.testing import do_bench import itertools def fn_aten_scales(a, b, scale_a, scale_b, use_fast_accum=False): return torch._scaled_mm(a, b.t(), scale_a.view(-1, 1), scale_b.view(1, -1), use_fast_accum=use_fast_accum, out_dtype=torch.bfloat16) def fn_aten(a, b, scale, use_fast_accum=False): return torch._scaled_mm(a, b.t(), scale, scale, use_fast_accum=use_fast_accum, out_dtype=torch.bfloat16) for i,j,k in itertools.product(range(9, 15), range(9, 15), range(9, 15)): m = 2i n = 2j k = 2**k a=torch.randn(m, k, device="cuda").to(dtype=torch.float8_e4m3fn) b=torch.randn(n, k, device="cuda").to(dtype=torch.float8_e4m3fn) scale_a = torch.randint(1, 11, (a.shape[0],), device="cuda", dtype=torch.float32) scale_b = torch.randint(1, 11, (b.shape[0],), device="cuda", dtype=torch.float32) scale_0 = torch.randn((), device="cuda", dtype=torch.float32) ms_rowwise_fast = do_bench(lambda: fn_aten_scales(a, b, scale_a, scale_b, use_fast_accum=True), warmup=25, rep=50) ms_rowwise_slow = do_bench(lambda: fn_aten_scales(a, b, scale_a, scale_b, use_fast_accum=False), warmup=25, rep=50) ms_tensor_fast = do_bench(lambda: fn_aten(a, b, scale_0, use_fast_accum=True), warmup=25, rep=50) ms_tensor_slow = do_bench(lambda: fn_aten(a, b, scale_0, use_fast_accum=False), warmup=25, rep=50) print(f"m={m}, n={n}, k={k}, fast={ms_rowwise_fast}, slow={ms_rowwise_slow}, ratio_tw={ms_tensor_slow /ms_tensor_fast}, ratio_rw={ms_rowwise_slow / ms_rowwise_fast}") ``` </details> Higher N/K values still have about 40% penalty, perhaps some additional heuristics tweaks would be useful. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144809 Approved by: https://github.com/drisspg	2025-01-15 23:04:14 +00:00
PyTorch MergeBot	0f051eaf66	Revert "Fix global namespace pollution in ATen/Dispatch.h (#138626 )" This reverts commit 326c7cae28783f29c577b5a5d3ac38a3b61188bd. Reverted https://github.com/pytorch/pytorch/pull/138626 on behalf of https://github.com/malfet due to This broke inductor tests, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=inductor_torchbench%2C%202%2C%202 ([comment](https://github.com/pytorch/pytorch/pull/138626#issuecomment-2594021436))	2025-01-15 21:59:04 +00:00
Sam	c7b2f7dd14	Add generator parameter to rand*_like functions (#136780 ) Fixes #128786 Fixes #101974 Fixes #27072 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136780 Approved by: https://github.com/Chillee, https://github.com/ezyang	2025-01-15 21:16:52 +00:00
Benjamin Glass	d62b3979da	cpp_wrapper: Move #includes to per-device header files (#143909 ) This prepares us for the next PR in the stack, where we introduce pre-compiled per-device header files to save compilation time. Differential Revision: [D67938955](https://our.internmc.facebook.com/intern/diff/D67938955) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143909 Approved by: https://github.com/desertfire	2025-01-15 21:14:02 +00:00
Huy Do	05095a45f2	Fix the wrong artifact in remaining workflows (#144812 ) I missed them in https://github.com/pytorch/pytorch/pull/144694 as they weren't run often. But they are still failing nonetheless, i.e. https://github.com/pytorch/pytorch/actions/runs/12762640334/job/35578870178 The issue was from https://github.com/pytorch/pytorch/pull/125401 where it added `use-gha: ${{ inputs.use-gha }}` to linux_test workflow. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144812 Approved by: https://github.com/clee2000	2025-01-15 20:36:40 +00:00
Colin L. Rice	b88dcb4835	dynamo: Don't crash when tracing a missing attr on a constant. (#144593 ) dynamo: Don't crash when tracing a missing attr on a constant. This throws a InternalTorchDynamoError: AttributeError: 'NoneType' object has no attribute 'max' instead of just skipping the bad call when tracing, and throwing a normal AttributeError instead. There are two questions that I would love reviewer comment on. 1) Is throwing unimplemented the right thing here? or should I throw something like ObservedAttributeError 2) Do we need to worry about performance with this code? In particular, should we just catch the exception? Or maybe cache the lookup result? Pull Request resolved: https://github.com/pytorch/pytorch/pull/144593 Approved by: https://github.com/jansel	2025-01-15 20:23:43 +00:00
Avik Chaudhuri	d812fdd490	fix as_bool serde (#144791 ) Differential Revision: [D68167701](https://our.internmc.facebook.com/intern/diff/D68167701/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144791 Approved by: https://github.com/pianpwk	2025-01-15 20:22:26 +00:00
Nikita Shulga	904641769e	[MPSInductor] Implement `pow()` (#144827 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144827 Approved by: https://github.com/dcci, https://github.com/jansel	2025-01-15 20:11:34 +00:00
Runming Lu	b410378d93	Register nonzero for meta device for FBLSim (#144727 ) Summary: Fix `nonzero is not registered to meta` issue: ``` "NotImplementedError: aten::nonzero: attempted to run this operator with Meta tensors, but there was no fake impl or Meta kernel registered". ``` Reviewed By: ezyang Differential Revision: D66525640 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144727 Approved by: https://github.com/ezyang	2025-01-15 19:40:42 +00:00
Zhengxu Chen	834086c023	[export] Load side info about pos/kw argument kind for serialization. (#144686 ) Summary: Fixing issue of nodes like ``` torch.ops.aten.linear.default(x, w, b) ``` being deserialized as ``` torch.ops.aten.linear.default(x, w, bias=b) ``` which breaks roundtripping. Test Plan: buck test mode/opt caffe2/test:test_export -- -r TestDeserialize buck test mode/opt caffe2/test:test_export -- -r TestSerialize Differential Revision: D67991410 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144686 Approved by: https://github.com/angelayi	2025-01-15 19:08:38 +00:00
Simon Fan	898a90c6bb	[dynamo][hop] Introduce FlexAttentionBackwardHighOrderVariable (#144533 ) FIXES https://github.com/pytorch/pytorch/issues/143180 This PR adds a new variable mapping to SourcelessBuilder to represent the flex attention intermediates. The variable proxies a call to HOP, and carryovers the graph state (subgraphs represented as UnspecializedNNModuleVariable) to the dynamo output graph. This is safe to do because the nn modules used in flex attention have either been speculated on before, or are outputs of make_fx of the forward. tlparse of `TestCompiledAutograd.test_flex_attention`: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpiWendk/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100 ```python class GraphModule(torch.nn.Module): def forward(self, L_inputs_ : list): ... # File: /data/users/xmfan/core/b/pytorch/torch/_dynamo/compiled_autograd.py:832 in set_node_origin, code: CompiledFunctionBackward0 (NodeCall 1) ... fw_graph0_0 = self.fw_graph0_0 joint_graph0_0 = self.joint_graph0_0 mask_graph0_0 = self.mask_graph0_0 flex_attention_backward = torch.ops.higher_order.flex_attention_backward(aot0_primals_1, aot0_primals_1, aot0_primals_1, aot0_detach_3, aot0_detach_5, aot0_expand_5, aot0_zeros_1, fw_graph0_0, joint_graph0_0, (1, 1, aot0_ones, aot0_zeros, None, None, aot0__to_copy_1, aot0__to_copy_2, None, None, 1073741824, 1073741824, mask_graph0_0), 0.125, {'PRESCALE_QK': False, 'ROWS_GUARANTEED_SAFE': False, 'BLOCKS_ARE_CONTIGUOUS': False, 'WRITE_DQ': True, 'OUTPUT_LOGSUMEXP': True}, (), ()); aot0_primals_1 = aot0_detach_3 = aot0_detach_5 = aot0_expand_5 = aot0_zeros_1 = fw_graph0_0 = joint_graph0_0 = aot0_ones = aot0_zeros = aot0__to_copy_1 = aot0__to_copy_2 = mask_graph0_0 = None aot0_getitem_4: "bf16[1, 1, s0, s1][s0s1, s0s1, s1, 1]cuda:0" = flex_attention_backward[0] aot0_getitem_5: "bf16[1, 1, s0, s1][s0s1, s0s1, s1, 1]cuda:0" = flex_attention_backward[1] aot0_getitem_6: "bf16[1, 1, s0, s1][s0s1, s0s1, s1, 1]cuda:0" = flex_attention_backward[2]; flex_attention_backward = None ... class fw_graph0_0(torch.nn.Module): def forward(self, arg0_1: "bf16[][]cuda:0", arg1_1: "i32[][]cuda:0", arg2_1: "i32[][]cuda:0", arg3_1: "i32[][]cuda:0", arg4_1: "i32[][]cuda:0"): return arg0_1 class joint_graph0_0(torch.nn.Module): def forward(self, arg0_1: "bf16[][]cuda:0", arg1_1: "i32[][]cuda:0", arg2_1: "i32[][]cuda:0", arg3_1: "i32[][]cuda:0", arg4_1: "i32[][]cuda:0", arg5_1: "bf16[][]cuda:0"): return [arg5_1, None, None, None, None] class mask_graph0_0(torch.nn.Module): def forward(self, arg0_1: "i32[][]cuda:0", arg1_1: "i32[][]cuda:0", arg2_1: "i32[][]cuda:0", arg3_1: "i32[][]cuda:0"): # File: /data/users/xmfan/core/b/pytorch/torch/_dynamo/compiled_autograd.py:832 in set_node_origin, code: CompiledFunctionBackward0 (NodeCall 1) new_ones: "b8[][]cuda:0" = torch.ops.aten.new_ones.default(arg0_1, [], dtype = torch.bool, device = device(type='cuda', index=0), pin_memory = False); arg0_1 = None return new_ones ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144533 Approved by: https://github.com/zou3519	2025-01-15 18:40:57 +00:00
eqy	a6763b7b81	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee	2025-01-15 18:37:55 +00:00
Jeff Daily	6ac0616504	[ROCm] hipblaslt rowwise f8 gemm (#144432 ) hipblaslt added rowwise f8 gemm support. Integrate with scaled_mm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144432 Approved by: https://github.com/drisspg	2025-01-15 18:23:44 +00:00
Boyuan Feng	069419569d	[PagedAttention] Support different input position for each batch index (#144693 ) In LLM inference, each request usually has different prefill length, leading to different input position for each batch index. This PR adds such support for paged attention. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144693 Approved by: https://github.com/drisspg	2025-01-15 18:03:52 +00:00
Boyuan Feng	7e80758efc	[CUDAGraph][Docs] add `cuda` to `torch.randn` (#144793 ) Previous doc example created `torch.randn` tensor on cpu so CUDAGraph was skipped. Fixes #144386 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144793 Approved by: https://github.com/eellison	2025-01-15 18:02:10 +00:00
Edward Z. Yang	ee8f833d13	Undo leading underscore on ctx for breakpoint (#144864 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/144864 Approved by: https://github.com/Skylion007	2025-01-15 18:00:58 +00:00
PyTorch MergeBot	443de667b1	Revert "Enable s8s8s8 for qlinear with mkl-dnn (#139887 )" This reverts commit dc8692b0eb093d5af150ae0f3a29a0957c3e4c0d. Reverted https://github.com/pytorch/pytorch/pull/139887 on behalf of https://github.com/ZainRizvi due to Sorry but this appears to have broken trunk. See here for more details: [GH job link](https://github.com/pytorch/pytorch/actions/runs/12788709683/job/35651699934) [HUD commit link](`dc8692b0eb`) ([comment](https://github.com/pytorch/pytorch/pull/139887#issuecomment-2593597977))	2025-01-15 17:58:33 +00:00
Sahan Paliskara	d065e8a9de	[ez] add lint commits to .git-blame-ignore-revs (#144576 ) Test Plan: Ran git blame on .lintrunner.toml and github's linter (+ manual testing) shows all commits exist Pull Request resolved: https://github.com/pytorch/pytorch/pull/144576 Approved by: https://github.com/janeyx99	2025-01-15 17:39:29 +00:00
wizzniu	c07dc64017	Update pin memory related APIs to not pass 'device' argument (#131858 ) Based on https://github.com/pytorch/pytorch/pull/126376, this PR tries to update all PT callers (e.g., `Tensor.is_pinned()`, `Tensor.pin_memory()`) to not pass `device` argument. As for `storage/untyped_storage.is_pinned()/pin_memory()`, we keep the `device` argument but passing `device` is discouraged. And if not given, the default `device` is still 'cuda' for BC. Additionally, based on device-agnostic pin_memory, `pin_memory_device` argument of `torch.utils.data.DataLoader` is discouraged now. For BC, explictly passing this argument is still effective. If not given, the default `device` will be the current accelerator. Fixes #124908 Relates https://github.com/pytorch/pytorch/pull/126376 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131858 Approved by: https://github.com/albanD Co-authored-by: albanD <desmaison.alban@gmail.com>	2025-01-15 17:23:35 +00:00
Catherine Lee	0dca756832	Revert "Upload METADATA file with whl binaries (#143677 )" (#144706 ) This reverts commit 3eb3f4ed5580010a7961d996ccc6ee19c7ccbb5e. Also reverts https://github.com/pytorch/pytorch/pull/144164 Manual revert because the above causes merge conflicts Reverting in favor of https://github.com/pytorch/test-infra/pull/6159 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144706 Approved by: https://github.com/janeyx99, https://github.com/atalman, https://github.com/malfet	2025-01-15 17:20:21 +00:00
Aaron Orenstein	d782e46a36	[BE] typing for decorators - library (#138969 ) Test Plan: unit tests Differential Revision: D62302678 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138969 Approved by: https://github.com/zou3519	2025-01-15 17:08:55 +00:00
Kaustubh Vartak	c7a9599100	Handle meta tensors in FX quantization (#144726 ) Summary: D66895899 got reverted in D67565250 because of pytorch OSS linter failure. Adding back with the format the linter suggested https://github.com/pytorch/pytorch/actions/runs/12443655335/job/34743090791 Test Plan: buck run fbcode//mode/dev-nosan fbcode//torchrec/fb/quant/tests:test_embedding_modules Reviewed By: emlin Differential Revision: D68132568 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144726 Approved by: https://github.com/iamzainhuda, https://github.com/janeyx99	2025-01-15 16:49:43 +00:00
Natalia Gimelshein	2bc18a9055	restore rng generation for fbcode (#144819 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144819 Approved by: https://github.com/malfet, https://github.com/kit1980	2025-01-15 16:34:25 +00:00
PyTorch MergeBot	154185dcd0	Revert "Removed unused _RequiredParameter (#144771 )" This reverts commit 6a5f895e549665a6895c84881a35736677071048. Reverted https://github.com/pytorch/pytorch/pull/144771 on behalf of https://github.com/malfet due to It broke number of cpuinductor tests ([comment](https://github.com/pytorch/pytorch/pull/144771#issuecomment-2593293542))	2025-01-15 15:51:33 +00:00
dilililiwhy	7c52c97a65	Expose several APIs to public (torch python APIs) (#144525 ) Fixes #144302 Try to expose several APIs to public for privateuse1 scenario. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144525 Approved by: https://github.com/cyyever, https://github.com/albanD	2025-01-15 14:34:45 +00:00
Annop Wongwathanarat	dc8692b0eb	Enable s8s8s8 for qlinear with mkl-dnn (#139887 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139887 Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168, https://github.com/ng-05, https://github.com/digantdesai	2025-01-15 12:51:21 +00:00
Sujoy Saraswati	7e1c1e65eb	Graph freezing preparation for non-Inductor backends (#139902 ) Enable preparing module named parameters and buffers in tracing context for non-Inductor backends to implement graph freezing. Fixes #139272 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139902 Approved by: https://github.com/eellison, https://github.com/masnesral, https://github.com/gujinghui	2025-01-15 11:25:04 +00:00
Laith Sakka	62ce3e6e84	refresh benchmarks results after recent recent regressions (#143075 ) refresh data after !5 regression by https://github.com/pytorch/pytorch/pull/144319 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143075 Approved by: https://github.com/bobrenjc93, https://github.com/huydhn	2025-01-15 09:11:57 +00:00
Edward Z. Yang	e263f0af23	[BE] Make a SymbolInfo NamedTuple (#144745 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/144745 Approved by: https://github.com/avikchaudhuri, https://github.com/Skylion007	2025-01-15 08:59:27 +00:00
Taher	d9d7cca009	make eval_frame safe (#141357 ) Fixes #108942 this PR converts eval_frame.c's static extension types to heap types, making it thread and sub-interpreter safe. the current modification only showcases one state variable being lifted, but there are opportunities for other variables that can be addressed in this PR todo / suggestions: 1. uplift `eval_frame_callback_key` to module state 2. define `.m_slots` to module definition so initialization is within python's module lifecycle rather than an explicit `torch_c_dynamo_eval_frame_init` 3. define configurations for module allowing sub-interpreters or not ```c static int module_exec(PyObject *m) {} static PyModuleDef_Slot module_slots[] = { {Py_mod_exec, module_exec}, {0, NULL} }; static struct PyModuleDef module = { PyModuleDef_HEAD_INIT, .... .m_slots = module_slots }; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141357 Approved by: https://github.com/jansel Co-authored-by: Edward Z. Yang <ezyang@meta.com>	2025-01-15 07:37:50 +00:00
Xiaodong Wang	6ba53a5f1c	[AMD] De-noise tf32 warnings (#144797 ) Summary: This is way too noisy especially during unit tests. So just log once. Test Plan: OSS CI. Tested on a unit test and now I only see one line (hard to notice :) ). Differential Revision: D68167633 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144797 Approved by: https://github.com/jianyuh, https://github.com/leitian, https://github.com/yoyoyocmu	2025-01-15 07:10:38 +00:00
Scott Wolchok	69b883d7ac	Remove C10_EMBEDDED (#144808 ) I added this to support code sharing with ExecuTorch, but the operator<< overrides are load-bearing for builds -- we have other code that attempts to pretty-print Half/BFloat16, and implicit conversions can't be used to make that work because there are multiple implicit conversions from Half/BFloat16 to primitive types, so which one to select is ambiguous. Also, we don't actually seem to need it now in ExecuTorch core because we have `include <ostream>` in there at the moment anyway. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144808 Approved by: https://github.com/janeyx99, https://github.com/malfet	2025-01-15 06:08:53 +00:00
Sam Larsen	b801210035	Restore support for other types of async_compile pools (spawn, fork) (#144491 ) Summary: https://github.com/pytorch/pytorch/pull/142001 removed support for process pools other than "subprocess", but some OSS users still find it useful; put it back. Test Plan: New unit test Pull Request resolved: https://github.com/pytorch/pytorch/pull/144491 Approved by: https://github.com/jansel, https://github.com/haifeng-jin	2025-01-15 06:04:49 +00:00
Arnie Yuan	326c7cae28	Fix global namespace pollution in ATen/Dispatch.h (#138626 ) Summary: Was it a typo? Since we already have `at::detail::record_kernel_function_dtype()` in `ATen/Dispatch.h` Test Plan: just build Differential Revision: D64642080 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138626 Approved by: https://github.com/malfet	2025-01-15 05:43:54 +00:00
James Wu	7d71ddbe5d	Add non_c_binding torch functions to allowlist for AOTAutogradCache, confirm no special handlers for them (#144802 ) Differential Revision: [D68173093](https://our.internmc.facebook.com/intern/diff/D68173093/) This diff allows any function in torch_non_c_binding_in_graph_functions to be safe to cache. These functions should be safe to cache because they are part of the torch API, and do not save global state (or if they do, dynamo creates unique guards around the constants they return). A function that's allowed in a dynamo graph is safe to cache for AOTAutograd purposes as long as: - It's functional (i.e. does not access global state); - or its value is constant folded away (and guarded against by dynamo) The tricky cases are functions that dynamo uses special handlers to track. These special handlers can sometimes close over stuff that's safe for dynamo locally, but isn't encoded anywhere when cached across processes. An example of this is `DTensor.from_local`, where various DeviceMesh information doesn't change in the same dynamo process, but can change across multiple processes. The handler for `DTensor.from_local` closes over these and dynamo creates a proxy for the function call. This is not safe to cache. That said, most special handlers are in fact functional and safe. So I add a unit test to test_trace_rules.py that confirms that any function with special handlers in dynamo added to this list needs to be audited to be safe to cache. The list of safe handlers there either: - Don't access global state; - Guard on global state; or - Always returns a constant that never changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/144802 Approved by: https://github.com/bdhirsh	2025-01-15 05:41:36 +00:00
Howard Huang	79312ddb73	[PP] Don't allow for num_microbatches > num_stages for single stage schedules (#144702 ) There is an edge case where `Schedule1F1B` will hang when num_microbatches=1 (https://github.com/pytorch/torchtitan/issues/775). For validation it makes sense to check that the number of stages should be >= number of microbatches otherwise there will be an even larger bubble. This can be removed when we have the single stage schedules to use an IR and updated to run with schedule runtime (issue tracker https://github.com/pytorch/pytorch/issues/144701) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144702 Approved by: https://github.com/kwen2501	2025-01-15 05:35:29 +00:00
fduwjj	ae7df51232	[c10d] Fix CudaEventCache for dangling references (#144496 ) Reported in https://github.com/pytorch/pytorch/issues/143470, we have a dangling references in `CudaEventCache`. So we want to fix it. 1. We add a unit test to repro the issue mentioned in the issue. 2. Instead of converting variables to shared pointers as suggested in the issue, we then make the cache itself a shared pointer. So if the thread creates the cache dies before all events get recycled, the cache is still there until the last CudaEvent get deleted. (thanks for the suggestion from @kwen2501 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144496 Approved by: https://github.com/kwen2501	2025-01-15 05:11:48 +00:00
Simon Fan	9cd6f46130	[ca] raise error message on AOT Autograd caching (#144595 ) FIXES https://github.com/pytorch/pytorch/issues/144175, bandaid Pull Request resolved: https://github.com/pytorch/pytorch/pull/144595 Approved by: https://github.com/bdhirsh	2025-01-15 05:09:42 +00:00
fduwjj	e0bbff6019	[c10d][ez] Add comments to the end of Macro for better readability (#144789 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144789 Approved by: https://github.com/c-p-i-o	2025-01-15 05:06:41 +00:00
Nikita Shulga	d2ca8163c0	[MPSInductor] Support `abs` in MetalPrintExpr (#144826 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144826 Approved by: https://github.com/dcci ghstack dependencies: #144509, #144798, #144795, #144796	2025-01-15 05:01:25 +00:00
Nikita Shulga	9610a22e94	Fix FakeTensor device creation for MPS (#144796 ) By promoting torch.device("mps") to `torch.device("mps:0")`, but skipping `is_initialized` check, as MPS does not really support multi-GPU right now This fixes `GPUTests.test_remove_no_ops_mps` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144796 Approved by: https://github.com/ezyang ghstack dependencies: #144509, #144798, #144795	2025-01-15 05:01:25 +00:00
Nikita Shulga	18786c65e5	[BE] Extend `test_remove_no_ops` (#144795 ) ---- - Use `is_dtype_supported` to skip dtype promotions portion of the test on unsupported device - Extend it to use `torch.float16` so promotions could be checked there - Implement `CpuInterface.is_bfloat16_supported` that returns true (which looks like the case, even if it's supported via emulation) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144795 Approved by: https://github.com/Skylion007 ghstack dependencies: #144509, #144798	2025-01-15 05:00:26 +00:00
Riley Dulin	48f7e7c378	[torch][ao][EASY] Change print to log in numeric debugger to avoid large output (#144790 ) Summary: This print statement was spewing a bunch of data in logs by default, but it should be silenceable. Use `log.debug` instead. Differential Revision: D68166823 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144790 Approved by: https://github.com/tarun292	2025-01-15 04:58:56 +00:00
Piergiacomo De Marchi	6a5f895e54	Removed unused _RequiredParameter (#144771 ) As per this [discussion](https://discuss.pytorch.org/t/a-question-about-requiredparameter/137977), I figured that `_RequiredParameter` is no longer used. The `required` object was initially introduced in this [PR](`4db6667923`) as the `SGD` optimizer did not offer a default value for the learning rate. However there isn't a single place in the code base using `_RequiredParameter`, nor `required`. I am therefore removing unused `_RequiredParameter` and `required`. Everything not included in this PR is Not a Contribution. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144771 Approved by: https://github.com/janeyx99	2025-01-15 04:11:17 +00:00
cyy	d87aad6877	[5/N] Apply Ruff fixes and pyupgrade to Python 3.9 (#144205 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144205 Approved by: https://github.com/albanD	2025-01-15 04:00:47 +00:00
Driss Guessous	db787181b5	Back out "[Submodule] Upgrade to Cutlass 3.6" (#144738 ) Summary: Revert due to perf regressions see: https://github.com/pytorch/pytorch/issues/144729 Test Plan: sand castle Differential Revision: D68137326 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144738 Approved by: https://github.com/huydhn	2025-01-15 02:57:14 +00:00
Nikita Shulga	e2251fffbb	[MPSInductor] Add `min`/`max` to MetalExprPrinter (#144798 ) After that `GPUTests::test_avg_pool2d8_mps` and `GPUTests::test_avg_pool2d5_mps` passes Pull Request resolved: https://github.com/pytorch/pytorch/pull/144798 Approved by: https://github.com/dcci ghstack dependencies: #144509	2025-01-15 01:43:42 +00:00
Xia, Weiwen	9199c79a9c	[Quant][Inductor][X86] Separate unary post op fusion and lowering for qconv (#144312 ) Summary The current implementation fuses quantized ops and their post ops and lowers the fused the op to cpp backend in the same pass. It is better to separate post op fusion and lowering because - it looks better in terms of design - we need the post op fusion pass for PT2E quantization eager mode As one of a series of PRs which do the separation, this PR moves unary post op fusion of qconv out of the lowering pass to after the weight-prepack pass. The workflow is 1. Weight prepack for qlinear so that `dq - conv` patterns are replaced by `onednn.qconv2d_pointwise` 2. Fuse `onednn.qconv2d_pointwise` and post ops 3. Lower to cpp backend This PR adds additional `PatternMatcherPass`'s to handle the post op fusion. Pattern matchers used for fusion are reused. Test plan It is covered by existing UTs in `test_mkldnn_pattern_matcher.py` for post op fusion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144312 Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168 ghstack dependencies: #144224	2025-01-15 00:50:54 +00:00
Tugsbayasgalan Manlaibaatar	825fe15024	EZ fix to make sure local pytest run succeeds in export (#144764 ) Previously run_tests() was protected under IS_FBCODE flag so that following works: ``` python test/export/test_export_legacy.py ``` But it fails on: ``` pytest test/export/test_export_legacy.py ``` This is because pytest doesn't seem to get triggered through run_tests(). Differential Revision: [D68152737](https://our.internmc.facebook.com/intern/diff/D68152737) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144764 Approved by: https://github.com/avikchaudhuri	2025-01-15 00:43:40 +00:00
Henry Tsang	8c2aa0c533	[cutlass backend] cexpr the arg before writing to cpp file (#144714 ) Summary: The problem is for certain shapes, see unit test, one of the dimensions is like `s0 // 2`. If we use cutlass backend, this means writing that to C++ file, which would lead to C++ compilation error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144714 Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78, https://github.com/desertfire	2025-01-14 23:09:44 +00:00
Aaron Orenstein	8ad37ed710	Stop ignoring mypy errors in torch/testing/_internal/common_utils.py (#144483 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144483 Approved by: https://github.com/Skylion007	2025-01-14 22:32:51 +00:00
Jerry Mannil	ea3395e4f2	[ROCm] Improvements for vectorized elementwise kernels (#143269 ) * Make io_size calculation as minimum of size of input and output size, rather than the summation of all sizes * for e.g, for torch.add() on half dtypes (bfloat16/float16), calc_io_size() returns 6 causing elems_per_thread to be 4 * But elems_per_thread = 8 works better on half datypes for AMD gpus * Enable *_load_dwordx4 ISA for 16-bit and 8-bit dtypes on AMD gpus by using vector size of 8 and 16 respectively Co-author: @akadutta Pull Request resolved: https://github.com/pytorch/pytorch/pull/143269 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com>	2025-01-14 22:09:21 +00:00
soulitzer	c000214826	Allow GradientEdge as torch.autograd.backward outputs (#144744 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144744 Approved by: https://github.com/albanD	2025-01-14 21:31:44 +00:00
fan.mo	64829b356a	[PrivateUse1] Support parseDispatchKey with modified PrivateUse1 (#144325 ) PyTorch now support many private1 backend names like `AutogradPrivateUse1` or `QuantizedPrivateUse1`, not mentioned the original `PrivateUse1` backend. However, users that implement `PrivateUse1` funtionalities would modified the backend name by calling `torch.utils.rename_privateuse1_backend("my_backend")`, in that case, all `PrivateUse1` backend string would not be found when we call other functions related to it. For example, we utilize `torch.library` to register some customize functions to our new backend, we would use "my_backend" as the backend name instead of "PrivateUse1", in which the error will be throw: ``` could not parse dispatch key 'my_backend' ``` So, this PR changed the function `c10::DispatchKey parseDispatchKey(const std::string& k)`, it would double check if the `PrivateUse1` has been modified, and if so, we would change `k` to adapt new backend name then find it again. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144325 Approved by: https://github.com/albanD	2025-01-14 21:21:29 +00:00
Will Constable	130452dad6	[Pipelining] fix test_schedule.py (missing destroy_process_group (#144734 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144734 Approved by: https://github.com/H-Huang ghstack dependencies: #144352, #144596	2025-01-14 21:16:09 +00:00
Will Constable	aa57f0c663	[Pipelining] Refactor common utils from test_pp_dp (#144596 ) Split test_pp_dp into pp_ddp and pp_fsdp so its a bit more concise and easier to add CP to the FSDP one. Realize that 'use_new_runtime' parametrization was not even being used, removing it saves a bunch of test time. We should migrate schedules to the new runtime and have them be covered that way. (And test_schedule*.py are testing new runtime too). Pull Request resolved: https://github.com/pytorch/pytorch/pull/144596 Approved by: https://github.com/H-Huang ghstack dependencies: #144352	2025-01-14 20:13:17 +00:00
Will Constable	6f5dce3035	[Pipelining] Fix PP grad scaling (#144352 ) Adds a grad-scaling method `perform_pp_grad_scaling()` which divides grads by num_microbatches. Enables grad scaling by default, unless disabled due to using a loss function that sums instead of averaging losses. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144352 Approved by: https://github.com/H-Huang	2025-01-14 20:13:17 +00:00
Nikita Shulga	9157a748a6	[MPSInductor] Add dummy properties (#144509 ) For compute capabilitiy (which is an empty string, same as CPU) And for multicore count return 8, as this is smallest number of GPU cores on Apple silicon Pull Request resolved: https://github.com/pytorch/pytorch/pull/144509 Approved by: https://github.com/jansel	2025-01-14 20:12:38 +00:00
PyTorch MergeBot	bdd942efd7	Revert "Increase C10_COMPILE_TIME_MAX_GPUS to 128 (#144138 )" This reverts commit 6cfc08167595e27ee9a5701c6426a7a8a7e387ef. Reverted https://github.com/pytorch/pytorch/pull/144138 on behalf of https://github.com/albanD due to This seems to impact the caffe2 code ([comment](https://github.com/pytorch/pytorch/pull/144138#issuecomment-2590891200))	2025-01-14 19:04:12 +00:00
Wang, Chuanqi	b4b4e57469	[CD] Enable profiling for XPU Windows nightly wheels (#144316 ) PR https://github.com/pytorch/pytorch/pull/144034 added profiling support for torch XPU Windows binary, enable it in PyTorch XPU Windows CD Works for https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144316 Approved by: https://github.com/xuhancn, https://github.com/atalman	2025-01-14 19:01:27 +00:00
Bin Bao	2683691237	[AOTI] Add a boxed_run API (#142213 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/141696. Add a new C++ runner API (boxed_run) following dynamo's boxed calling convention, which steals tensors' ownership from the input tensor list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142213 Approved by: https://github.com/ezyang	2025-01-14 18:47:42 +00:00
Richard Barnes	e2891d43a8	[codemod] Remove unused-variable in caffe2/aten/src/ATen/native/quantized/cpu/fbgemm_utils.cpp +1 (#144783 ) Summary: LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Reviewed By: palmje Pull Request resolved: https://github.com/pytorch/pytorch/pull/144783 Approved by: https://github.com/albanD, https://github.com/malfet	2025-01-14 18:34:54 +00:00
Mwiza Kunda	ec1c3ab3b2	[inductor][triton] skip test_data_type_propagation if triton (#142054 ) None cpp inductor backends don't have a `DataTypePropagation` pass on the scheduler nodes so skip the test. CUDA only passes because the device is currently not changed to "cuda" in the test body. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/142054 Approved by: https://github.com/eellison	2025-01-14 18:03:00 +00:00
Nikhil Gupta	e666807653	[Fix]: Enable support for Arm Neon & SVE support for FP32 Gemm Wrapper (#144327 ) Performance Improvements: Linear Layer [ 1x512 * 512x512 ] -> 2x - 4x Linear Layer [ 3x512 * 512x512 ] -> 2x - 4x Pull Request resolved: https://github.com/pytorch/pytorch/pull/144327 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/cfRod, https://github.com/malfet Co-authored-by: Crefeda Rodrigues <crefeda.Rodrigues@arm.com>	2025-01-14 17:52:00 +00:00
soulitzer	eee7a47e94	Support FunctionalTensor subclass in is_fake and maybe_get_fake_mode (#144719 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144719 Approved by: https://github.com/bdhirsh	2025-01-14 17:49:11 +00:00
PyTorch MergeBot	d21738f24a	Revert "Fix torch.normal ignores default_device (#144070 )" This reverts commit 184549b2d7e59acfc6e47d121e9ebb50648945b3. Reverted https://github.com/pytorch/pytorch/pull/144070 on behalf of https://github.com/ezyang due to broken a specific use case ([comment](https://github.com/pytorch/pytorch/pull/144070#issuecomment-2590681953))	2025-01-14 17:41:58 +00:00
PyTorch UpdateBot	7977a3638e	[executorch hash update] update the pinned executorch hash (#140769 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140769 Approved by: https://github.com/pytorchbot	2025-01-14 17:38:07 +00:00
Nikita Shulga	f2975717f3	[CD] Fix slim-wheel nvjit-link import problem (#141063 ) When other toolkit (say CUDA-12.3) is installed and `LD_LIBRARY_PATH` points to there, import torch will fail with ``` ImportError: /usr/local/lib/python3.10/dist-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12 ``` It could not be worked around by tweaking rpath, as it also depends on the library load order, which are not guaranteed by any linker. Instead solve this by preloading `nvjitlink` right after global deps are loaded, by running something along the lines of the following ```python if version.cuda in ["12.4", "12.6"]: with open("/proc/self/maps") as f: _maps = f.read() # libtorch_global_deps.so always depends in cudart, check if its installed via wheel if "nvidia/cuda_runtime/lib/libcudart.so" in _maps: # If all abovementioned conditions are met, preload nvjitlink _preload_cuda_deps("nvjitlink", "libnvJitLink.so.*[0-9]") ``` Fixes https://github.com/pytorch/pytorch/issues/140797 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141063 Approved by: https://github.com/kit1980 Co-authored-by: Sergii Dymchenko <sdym@meta.com>	2025-01-14 17:33:07 +00:00
Shangdi Yu	5c727d5679	[minifier] Fix config generator for callables (#144518 ) Summary: When config contains callables, the current configs generated cannot be run: ``` torch._dynamo.config.reorderable_logging_functions = {<built-in function print>, <function warning at 0x7f774c595630>, <function log at 0x7f774c595870>, <function error at 0x7f774c595510>, <function info at 0x7f774c595750>, <built-in function warn>, <function exception at 0x7f774c5955a0>, <function debug at 0x7f774c5957e0>, <function critical at 0x7f774c5953f0>} ``` We fix the config to generate the right string, so the config is runnable, like below ``` import logging import warnings torch._dynamo.config.reorderable_logging_functions = { warnings.warn, logging.warn, print } ``` Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:utils -- -r test_codegen_config ``` Differential Revision: D67998703 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144518 Approved by: https://github.com/desertfire	2025-01-14 17:18:13 +00:00
Zhenbin Lin	cbb1ed2966	[1/N] OpenReg: Replace `open_registration_extension.cpp` with openreg (#141815 ) As described in OpenReg [next-steps](https://github.com/pytorch/pytorch/blob/main/test/cpp_extensions/open_registration_extension/README.md#next-steps), here we replace the current `open_registration_extension.cpp` test in PyTorch CI with openreg. The current `open_registration_extension.cpp` contains two parts: 1. Implentations to support `PrivateUse1` backend. 2. Helper functions used for UTs in `test_cpp_extensions_open_device_registration.py` and `test_transformers.py`. For the first part, we'll replace it with openreg. For the second part, we'll migrate them to ut files step by step. @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/141815 Approved by: https://github.com/albanD	2025-01-14 15:59:00 +00:00
Nikita Shulga	347a74b8f5	Mark CUDA-12.6 as experimental for 2.6 release (#144769 ) Because that's the first time we are trying to release it, and it also is the first release to use manylinux2_28 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144769 Approved by: https://github.com/atalman	2025-01-14 15:30:00 +00:00
Edward Z. Yang	60d2e32fa4	[BE] Remove lambda from str (#144743 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/144743 Approved by: https://github.com/avikchaudhuri, https://github.com/Skylion007 ghstack dependencies: #144471	2025-01-14 15:10:57 +00:00
Edward Z. Yang	ffb3f32693	Add max kwarg to torch._check with alternate size oblivious semantics (#144471 ) Fixes https://github.com/pytorch/pytorch/issues/120288 for the static bound case I had been tying myself in knots in the original issue about the fact that we can't really do symbolic bounds like u0 < s0. But then I realized, "Wait, but the static bounds are easy!" So this makes it so you can also exclude a specific upper bound when doing size oblivious tests, which is enough to solve https://github.com/pytorch/pytorch/issues/123592#issuecomment-2574556708 It's written very dirtily, maybe there's some cleanup. Bikeshed on the public API name also welcome. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/144471 Approved by: https://github.com/avikchaudhuri	2025-01-14 15:10:57 +00:00
RAHUL SINGH	95b41d2aa4	Tests Generelization for multiple accelerator devices (#139749 ) Motivation: Generalize unit tests so that can be executed for cuda and non cuda devices. Chnages: There are general changes in common_dtesnor module for device type generalization so that tests can be executed on non cuda devices too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139749 Approved by: https://github.com/kwen2501	2025-01-14 08:52:46 +00:00
lzhang2	1800f5f461	Enable coalescing path on XPU and dispatch to XPU tensor barrier if XCCL backend is specified. (#143735 ) Motivation: - Enable coalescing path on XPU for `batch_isend_irecv`. - If XCCL backend is specified, then construct a XPU tensor to ensure `barrier` dispatch to XCCL backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143735 Approved by: https://github.com/kwen2501	2025-01-14 08:37:48 +00:00
Daulet Askarov	21cbee5d9b	Drop unused num_elements variable (#144723 ) Summary: With the recent enforcement of unused variable as an error in D67329035, certain tests like https://www.internalfb.com/intern/test/562950135258426?ref_report_id=0 can't build citing: ``` Action failed: fbcode//caffe2:libtorch_cuda (cfg:linux-x86_64-fbcode-platform010-clang17-no-san#2a7259832b2f5c67) (cxx_compile torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (pic)) Remote command returned non-zero exit code 1 Remote action, reproduce with: `frecli cas download-action a95a6625d2b071a782a7a8ea2882f4adccf103b023df5ccb596f48c506101754:145` Stdout: <empty> Stderr: fbcode/caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3757:16: error: unused variable 'num_elements' [-Werror,-Wunused-variable] 3757 \| size_t num_elements = output.numel(); \| ^~~~~~~~~~~~ 1 error generated. ``` This causes Sandcastle to turn off these tests, decreasing protection from other bad diffs. Clean up the unused variable to unblock. Test Plan: ``` buck2 build --config hpc_comms.use_ncclx=dev --flagfile fbcode//mode/opt fbcode//ftar:ftar_py_e2e_test ``` https://www.internalfb.com/buck2/888dfc68-07eb-4ba1-add5-b38c12d52b33 Reviewed By: c-p-i-o Differential Revision: D68126236 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144723 Approved by: https://github.com/fduwjj, https://github.com/Skylion007 Co-authored-by: Daulet Askarov <dauleta@meta.com>	2025-01-14 08:29:01 +00:00
Isalia20	80eff6e720	[MPS] fix triangular for >3D tensors (#144545 ) Old implementation leads to incorrect output due to not handling the other batch sizes other than 3D tensors(B, M, N) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144545 Approved by: https://github.com/malfet	2025-01-14 08:25:01 +00:00
Xia, Weiwen	8436a5c2cb	[Quant][Inductor][X86] Separate binary post op fusion and lowering for qlinear (#144224 ) Summary The current implementation fuses quantized ops and their post ops and lowers the fused op to cpp backend in the same pass. It is better to separate post op fusion and lowering because - it looks better in terms of design - we need the post op fusion pass for PT2E quantization eager mode As one of a series of PRs which do the separation, this PR moves binary post op fusion of qlinear out of the lowering pass to after the weight-prepack pass. The workflow is 1. Weight prepack for qlinear so that `dq - linear` patterns are replaced by `onednn.qlinear_pointwise` 2. Fuse `onednn.qlinear_pointwise` and post ops 3. Lower to cpp backend This PR adds additional `PatternMatcherPass`'s to handle the post op fusion. Pattern matchers used for fusion are reused. Test plan It is covered by existing UTs in `test_mkldnn_pattern_matcher.py` for post op fusion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144224 Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168	2025-01-14 06:46:38 +00:00
Yu, Guangye	c031defe0b	[RELAND] Generalize at::manual_seed for all accelerators (#144370 ) # Additional Context This is a reland PR originated from eeb57394f93d720bca498c3fa9d167fc7b9cca46 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144370 Approved by: https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui	2025-01-14 06:09:36 +00:00
leslie-fang-intel	9d98b66e7b	[Inductor][CPP] Enable Epilogue Fusion for Grouped GEMM Template (#143897 ) Summary In this PR, we enable the epilogues fusion and code generation for Grouped GEMM. Here are the high-level description of how we implement it. Fusion - The Grouped GEMM Template produces a `Template Buffer` with a `MultiOutputLayout` and a set of `MultiOutput Buffers`, where each buffer corresponds to a specific GEMM. - During the initial round of fusion, the `Template Buffer` and all associated `MultiOutput Buffers` are fused into a `FusedSchedulerNode` by extending the existing fusion design. - In subsequent fusion rounds, this `FusedSchedulerNode` can further fuse with its epilogues, following the original fusion design principles. Code Gen We maintain a list of epilogues and codegen it one by one. - If any of the GEMM has bias, we create a extra `bias_add` epilogue and prepend it at first of the epilogue list. - If any of the GEMM has no epilogue, we create a `to_bf16` copy epilogue and append it at last of the epilogue list. TestPlan ``` python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_grouped_linear_epilogue ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143897 Approved by: https://github.com/jansel, https://github.com/jgong5 ghstack dependencies: #143796	2025-01-14 06:07:50 +00:00
leslie-fang-intel	25de671ea8	[Inductor][CPP] Enable Grouped GEMM Template (#143796 ) Summary Enable the CPP Grouped GEMM Fusion, lowering and Grouped GEMM Template following the RFC: https://github.com/pytorch/pytorch/issues/144012 - Support flexible number of GEMMs - Share activation across GEMMs - The Grouped GEMM Template supports independent activations - However, the pattern matcher requires an anchor node, which is as the shared activation across GEMMs - Each GEMM can have a unique weight but same sizes - Each GEMM can have a unique bias or None - Current PR does not yet support biases; this will be addressed in a follow-up epilogue fusion PR - Each GEMM have its own epilogues - Epilogue fusion is not yet supported in this PR and will be enabled in an upcoming follow-up epilogue fusion PR Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_grouped_linear python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_grouped_linear_invalid python -u -m pytest -s -v test/inductor/test_cpu_cpp_wrapper.py -k test_grouped_linear ``` Example Here is the example and generated code ``` batch_size = 4 in_features = 512 out_features = 1024 dtype = torch.bfloat16 class M(torch.nn.Module): def __init__(self, bias): super().__init__() self.linear0 = torch.nn.Linear(in_features, out_features, bias=False) self.linear1 = torch.nn.Linear(in_features, out_features, bias=False) def forward(self, x): return self.linear0(x), self.linear1(x) if __name__ == "__main__": with torch.no_grad(): input = torch.randn(batch_size, in_features, dtype=dtype) m = M(bias=bias).to(dtype=dtype).eval() cm = torch.compile(m) act_res = cm(input) ``` Generated Code: https://gist.github.com/leslie-fang-intel/ed2e8d23aeb3586eb504feeace692e16#file-grouped-gemm-generated-code-py Next Step - Support Epilogue fusion Pull Request resolved: https://github.com/pytorch/pytorch/pull/143796 Approved by: https://github.com/jgong5, https://github.com/jansel	2025-01-14 05:59:07 +00:00
Davide Italiano	35b46a75f1	[mps/inductor] Add support for `round()` (#144731 ) With this change, inductor/test_view_on_aliased passes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144731 Approved by: https://github.com/malfet	2025-01-14 05:56:13 +00:00
Jagadish Krishnamoorthy	17e05cde0c	ROCm: Skip tests in elastic/utils/distributed_test (#144692 ) The tests are failing on ROCm machines due to the below error. The client socket has timed out after 1000ms while trying to connect to (gpu4f67.jax.cs.cpe.ice.amd.com, 0) Disabling the tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144692 Approved by: https://github.com/jeffdaily	2025-01-14 03:49:06 +00:00
James Wu	e58c823ab8	Implement increment and add_to_set for CompileEventLogger (#143427 ) This diff implements `increment` and `add_to_set`, which are features of MetricsContext, but not ChromiumEventLogger. This allows us to add a bunch of other metricscontext callsites to use CompileEventLogger instead. Differential Revision: [D67354867](https://our.internmc.facebook.com/intern/diff/D67354867/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143427 Approved by: https://github.com/masnesral	2025-01-14 02:42:49 +00:00
Nikita Shulga	6053242890	[CD] Enable python3.13t builds for aarch64 (#144698 ) But make sure that right numpy version is picked (2.0.2 does not support 3.13) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144698 Approved by: https://github.com/atalman ghstack dependencies: #144696, #144697, #144716	2025-01-14 02:29:01 +00:00
Huy Do	b221f88fc1	Leave SCCACHE_S3_KEY_PREFIX empty to share the cache among all build jobs (#144704 ) This is a follow-up of https://github.com/pytorch/pytorch/pull/144112#pullrequestreview-2528451214. After leaving https://github.com/pytorch/pytorch/pull/144112 running for more than a week, all build jobs were fine, but I failed to see any improvement in build time. So, let's try @malfet suggestion by removing the prefix altogether to keep it simple. After this land, I will circle back on this to see if there is any improvements. Otherwise, it's still a simple BE change I guess. Here is the query I'm using to gather build time data for reference: ``` with jobs as ( select id, name, DATE_DIFF('minute', created_at, completed_at) as duration, DATE_TRUNC('week', created_at) as bucket from workflow_job where name like '%/ build' and html_url like concat('%', {repo: String }, '%') and conclusion = 'success' and created_at >= (CURRENT_TIMESTAMP() - INTERVAL 6 MONTHS) ), aggregated_jobs_in_bucket as ( select --groupArray(duration) as durations, --quantiles(0.9)(duration), avg(duration), bucket from jobs group by bucket ) select * from aggregated_jobs_in_bucket order by bucket desc ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144704 Approved by: https://github.com/clee2000	2025-01-14 02:19:38 +00:00
Yiming Zhou	6d56277682	[export] Fix torchbind constant folding (#144684 ) Summary: `CallTorchBind` should not be folded during constant folding Test Plan: ``` buck2 run mode/dev-nosan sigmoid/inference/test:test_passes -- -r test_const_folding_torchbind ``` Reviewed By: henryoier Differential Revision: D67721272 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144684 Approved by: https://github.com/zhxchen17	2025-01-14 01:58:44 +00:00
Nikita Shulga	eaa8a97b39	[RelEng] Add `--ami` option to build_aarch64 (#144685 ) Which should be mutually-exclusive with OS For example, one can use the following to alloc one-off instance ``` ./build_aarch64_wheel.py --alloc-instance --instance-type g5.4xlarge --key-name nshulga-key --ami ami-0f51103893c02957c --ebs-size 200 ``` TODO: - Figure out EBS volume name depending on the AMI (for `ami-05576a079321f21f8`(al2023) it's `/dev/xvda`, but for `ami-0f51103893c02957c`(deep learning container) it's `/dev/sda1` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144685 Approved by: https://github.com/atalman	2025-01-14 01:48:27 +00:00
Davide Italiano	de9d6a25d7	[mps/inductor] Add support for `ceil` (#144715 ) inductor/test_index_dynamic_shapes passes after this change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144715 Approved by: https://github.com/malfet	2025-01-14 01:16:47 +00:00
PyTorch MergeBot	64bcf39180	Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 )" This reverts commit 388b75edec09182131be0dfe1abeafc5c3b91adf. Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/kit1980 due to breaking internal builds: unused variable 'halpha' ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2588517060))	2025-01-14 00:48:28 +00:00
PyTorch MergeBot	dfe06e555d	Revert "Stop ignoring mypy errors in torch/testing/_internal/common_utils.py (#144483 )" This reverts commit dcc04e9237292de10e9cedd8213253e253b1e91c. Reverted https://github.com/pytorch/pytorch/pull/144483 on behalf of https://github.com/kit1980 due to Need to revert in order to revert https://github.com/pytorch/pytorch/pull/144441 ([comment](https://github.com/pytorch/pytorch/pull/144483#issuecomment-2588515018))	2025-01-14 00:46:48 +00:00
Nikita Shulga	58302c4eaa	[BE] [CD] Remove pygit2 dep for aarch64_wheel build (#144716 ) As it's incompatible with 3.13t and only used to fetch the branch name, which could be done by running ``` git rev-parse --abbrev-ref HEAD ``` Also, remove yet another reference to long gone `master` branch. Test plan: Download `manywheel-py3_11-cpu-aarch64.zip` produced by this PR, install it inside docker container and check it's version ``` # pip install torch-2.7.0.dev20250113+cpu-cp311-cp311-manylinux_2_28_aarch64.whl ... Installing collected packages: mpmath, typing-extensions, sympy, networkx, MarkupSafe, fsspec, filelock, jinja2, torch Successfully installed MarkupSafe-3.0.2 filelock-3.16.1 fsspec-2024.12.0 jinja2-3.1.5 mpmath-1.3.0 networkx-3.4.2 sympy-1.13.1 torch-2.7.0.dev20250113+cpu typing-extensions-4.12.2 root@434f2540345e:/# python Python 3.11.9 (main, Aug 1 2024, 23:33:10) [GCC 12.2.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch >>> torch.__version__ '2.7.0.dev20250113+cpu' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144716 Approved by: https://github.com/atalman ghstack dependencies: #144696, #144697	2025-01-14 00:43:46 +00:00
Aaron Orenstein	dcc04e9237	Stop ignoring mypy errors in torch/testing/_internal/common_utils.py (#144483 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144483 Approved by: https://github.com/Skylion007	2025-01-13 23:19:44 +00:00
atalman	c15d6508bd	Binary builds Docker images - remove cuda 12.1 (#144575 ) Remove cuda 12.1 from manylinux, libtoch and almalinux builds Pull Request resolved: https://github.com/pytorch/pytorch/pull/144575 Approved by: https://github.com/seemethere, https://github.com/kit1980, https://github.com/malfet, https://github.com/Skylion007	2025-01-13 22:44:59 +00:00
PyTorch MergeBot	4f74864c94	Revert "[AOTI] Add a boxed_run API (#142213 )" This reverts commit 868984c3e324dedeac04cf10e2bbfbf912dac3b1. Reverted https://github.com/pytorch/pytorch/pull/142213 on behalf of https://github.com/kit1980 due to breaking lots of internal builds, see D68036023 ([comment](https://github.com/pytorch/pytorch/pull/142213#issuecomment-2588378262))	2025-01-13 22:43:47 +00:00
Animesh Jain	a54a784b82	[dynamo][dicts] Consolidate dict(..) construction (#144342 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144342 Approved by: https://github.com/StrongerXi	2025-01-13 22:24:56 +00:00
bobrenjc93	0373cd9950	remove allow-untyped-defs from torch/distributed/checkpoint/api.py (#144653 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144653 Approved by: https://github.com/Skylion007	2025-01-13 21:57:19 +00:00
Richard Barnes	1dab79470d	c10::string_view -> std::string_view in pytorch (#143591 ) Test Plan: Sandcastle Differential Revision: D67312322 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143591 Approved by: https://github.com/malfet	2025-01-13 21:44:05 +00:00
Huy Do	5129d6ef51	Fix inductor periodic smoke test wrong artifact (#144694 ) I'm not entirely sure why this failure starts to show up in periodic since Friday https://github.com/pytorch/pytorch/actions/runs/12716967189/job/35463656803. The artifact was uploaded to S3, but `use-gha: anything-non-empty-to-use-gh` was set and it was working. Maybe this is related to https://github.com/pytorch/pytorch/issues/144479 I also clean up the GCP/AWS A100 selection logic as the GCP cluster doesn't exist anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144694 Approved by: https://github.com/clee2000	2025-01-13 21:42:39 +00:00
Shangdi Yu	e15f91337b	[inductor] Add unbacked symints binding in ShapeProp (#144605 ) Summary: ShapeProp doesn't know how to propagate unbacked. Patch it up to propagate unbacked symints like PropagateUnbackedSymInts. Test Plan: ``` buck run mode/dev-nosan fbcode//caffe2/test:fx -- -r test_shape_prop_unbacked_sym ``` Differential Revision: D68050073 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144605 Approved by: https://github.com/guowentian, https://github.com/pianpwk	2025-01-13 21:30:20 +00:00
Catherine Lee	3c55669b88	Enable grep_linter to use -a (#144589 ) Lintrunner can only apply changes (-a) if only one suggestion is made per file. The grep_linter makes a suggestion for every line it finds incorrect, so it creates multiple suggestions per file if there are multiple lines that it wants to change This sets the `line` parameter of the LintMessage to None for all of grep_linter, but I'm not sure if that entry did anything I'm not sure if enabling -a is the best idea, since its currently used for tabs and tab width might differ each time? I had one instance where running with -a cause the spacing to change. On the other hand, -a would have already worked if only one line was bad Pull Request resolved: https://github.com/pytorch/pytorch/pull/144589 Approved by: https://github.com/huydhn	2025-01-13 21:18:24 +00:00
Aaron Gokaslan	91dbd7b75c	[BE]: Improve typing inference with TypeIs (#144682 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144682 Approved by: https://github.com/albanD Co-authored-by: Aaron Orenstein <aorenste@meta.com>	2025-01-13 21:14:31 +00:00
Ryan Guo	4ceca4d60f	[dynamo] Avoid graph break on updates to `obj.__dict__` (#144419 ) `obj.__dict__` is handled specially in Dynamo, and prior to this patch we only support read and membership check on that dictionary object. This patch adds support for writes and some documentation. Fixes #143756. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144419 Approved by: https://github.com/jansel, https://github.com/anijain2305	2025-01-13 21:04:10 +00:00
Bin Bao	684d015c2f	[AOTI] Support _int_mm (#144571 ) Summary: Add _int_mm to the C shim, to resolve a torchao issue, https://github.com/pytorch/ao/pull/1531#issue-2776827015 Differential Revision: [D68030385](https://our.internmc.facebook.com/intern/diff/D68030385) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144571 Approved by: https://github.com/yushangdi	2025-01-13 20:32:29 +00:00
Nikhil Gupta	b7f95df65b	[Feat]: Add Multithreading support for kleidiai groupwise GEMM kernels (#144074 ) KleidiAI Groupwise GEMM Kernel was not 2D Blocked. This change adds supports for 2D blocking of GEMM kernel to efficiently split workload & speedup GEMM kernel over multiple threads. Performance improvements: 7B model Pre-fill speedup from 145 t/s to 175 t/s Pull Request resolved: https://github.com/pytorch/pytorch/pull/144074 Approved by: https://github.com/digantdesai	2025-01-13 20:32:23 +00:00
Mwiza Kunda	5a2e8fce9d	Fix block pointer test module for triton CPU and add to CI (#144474 ) - Fix for BlockPointerTestBase._discontiguous_tensor. It defaults to constructing CUDA tensors, causing a failure if CUDA is not available. - Add test module to CI to prevent errors like the above from occurring. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144474 Approved by: https://github.com/jansel	2025-01-13 20:25:05 +00:00
bobrenjc93	80c286cbec	remove allow-untyped-defs from torch/_C/_dynamo/eval_frame.pyi (#144655 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144655 Approved by: https://github.com/StrongerXi	2025-01-13 20:03:25 +00:00
bobrenjc93	18deff0262	remove allow-untyped-defs from torch/ao/nn/intrinsic/__init__.py (#144652 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144652 Approved by: https://github.com/Skylion007	2025-01-13 19:36:08 +00:00
Nikita Shulga	d44c3906b8	[EZ] [CD] Add 3.13 to FULL_PYTHON_VERSIONS (#144697 ) Separation was necessary for Conda codegen, but now it's gone Pull Request resolved: https://github.com/pytorch/pytorch/pull/144697 Approved by: https://github.com/atalman, https://github.com/izaitsevfb ghstack dependencies: #144696	2025-01-13 19:12:12 +00:00
Nikita Shulga	d2f905760d	[EZ] [CD] Eliminate stale TODO (#144696 ) As 3.13 has been enabled across the board, which one can verify by running `./github/regenerate.sh` and observe that non of the configs have changed Pull Request resolved: https://github.com/pytorch/pytorch/pull/144696 Approved by: https://github.com/izaitsevfb, https://github.com/atalman	2025-01-13 19:12:12 +00:00
bobrenjc93	cd477cdd1d	remove allow-untyped-defs from torch/ao/nn/quantized/reference/modules/linear.py (#144656 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144656 Approved by: https://github.com/Skylion007	2025-01-13 19:03:05 +00:00
bobrenjc93	f93d786f73	remove allow-untyped-defs from torch/nn/parameter.pyi (#144654 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144654 Approved by: https://github.com/Skylion007	2025-01-13 19:02:31 +00:00
Randolf Scholz	983bf604e5	ReshapeTransform: added missing argument in docstring (#144401 ) See https://github.com/pytorch/pytorch/pull/144197#discussion_r1907336339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144401 Approved by: https://github.com/janeyx99, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-01-13 17:59:59 +00:00
George Wigley	fe8c5c7a2d	Update the Triton DeviceInterface in test/inductor/extension_backends/triton/device_interface.py (#144399 ) Following the changes to how `DeviceInterface` is used in this [PR](https://github.com/pytorch/pytorch/pull/142033), the `DeviceInterface` in `extension_backend/triton/device_interface.py` should by updated to return the `DeviceProperties` instead of raising a NotImplementedError. This PR mirrors the [changes](https://github.com/pytorch/pytorch/pull/142033/files#diff-06553e25e48e1d60f3030458bc46d52067d3d0c3eef2d5fcea29f7e8126bd7c9L112-R114) made in Dynamo when the PR landed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144399 Approved by: https://github.com/jansel	2025-01-13 17:19:58 +00:00
Xuehai Pan	bee84e88f8	[BE][Easy] improve submodule discovery for `torch.ao` type annotations (#144680 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144680 Approved by: https://github.com/Skylion007	2025-01-13 17:16:19 +00:00
Nikita Shulga	c40d917182	[MPSInductor] Fix maximum/minimum for int types (#144665 ) `metal::isnan` is only defined for floats, so provide a generic wrapper that is false for integral types TODO: Figure out why type propagantion is not working (or should it?) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144665 Approved by: https://github.com/dcci	2025-01-13 15:14:01 +00:00
Isuru Fernando	8633845090	Support nanj in inductor (#144064 ) Fixes https://github.com/pytorch/pytorch/issues/144029 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144064 Approved by: https://github.com/amjames, https://github.com/eellison	2025-01-13 14:29:38 +00:00
Davide Italiano	417354d953	[mps/inductor] Add support for truncdiv(). (#144666 ) Two other inductor tests pass after this change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144666 Approved by: https://github.com/malfet	2025-01-13 13:39:38 +00:00
Nikita Shulga	7e2239f1f0	[MPSInductor] Better error when kernel fails to compile (#144649 ) Now error message looks as follows: ``` % python ../test/inductor/test_torchinductor.py -v -k test_cat_unbacked_2d_mps test_cat_unbacked_2d_mps (__main__.GPUTests) ... inline_call [] stats [('calls_captured', 6)] inductor [('extern_calls', 2), ('fxgraph_cache_miss', 1)] aot_autograd [('total', 1), ('autograd_cache_bypass', 1), ('not_ok', 1)] ERROR ====================================================================== ERROR: test_cat_unbacked_2d_mps (__main__.GPUTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/malfet/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3126, in wrapper method(args, kwargs) File "/Users/malfet/git/pytorch/pytorch/build/../test/inductor/test_torchinductor.py", line 12254, in new_test return value(self) File "/Users/malfet/miniconda3/lib/python3.10/contextlib.py", line 79, in inner return func(args, *kwds) File "/Users/malfet/git/pytorch/pytorch/build/../test/inductor/test_torchinductor.py", line 5885, in test_cat_unbacked_2d self.common( File "/Users/malfet/miniconda3/lib/python3.10/contextlib.py", line 79, in inner return func(args, *kwds) File "/Users/malfet/git/pytorch/pytorch/build/../test/inductor/test_torchinductor.py", line 620, in check_model_gpu check_model( File "/Users/malfet/git/pytorch/pytorch/build/../test/inductor/test_torchinductor.py", line 461, in check_model actual = run(example_inputs, *kwargs) File "/Users/malfet/git/pytorch/pytorch/torch/_dynamo/eval_frame.py", line 580, in _fn raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1 File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/compile_fx.py", line 704, in _compile_fx_inner raise InductorError(e, currentframe()).with_traceback( File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/compile_fx.py", line 689, in _compile_fx_inner mb_compiled_graph = fx_codegen_and_compile( File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/compile_fx.py", line 1149, in fx_codegen_and_compile return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs) File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/compile_fx.py", line 1064, in codegen_and_compile compiled_fn = graph.compile_to_module().call File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/graph.py", line 1977, in compile_to_module return self._compile_to_module() File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/graph.py", line 2018, in _compile_to_module mod = PyCodeCache.load_by_key_path( File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/codecache.py", line 2768, in load_by_key_path mod = _reload_python_module(key, path) File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/runtime/compile_tasks.py", line 51, in _reload_python_module exec(code, mod.__dict__, mod.__dict__) File "/var/folders/sc/2thx6_x95h7_h9qs8s48yh140000gn/T/tmpmyfz2ju8/lt/cltm34ognlgcc6oxoe6bexvtbwcdtdfgnkjj5miz7vhkemitacp7.py", line 40, in <module> File "/var/folders/sc/2thx6_x95h7_h9qs8s48yh140000gn/T/tmpmyfz2ju8/lt/cltm34ognlgcc6oxoe6bexvtbwcdtdfgnkjj5miz7vhkemitacp7.py", line 32, in _compile_mps_shader torch._inductor.exc.InductorError: SyntaxError: failed to compile kernel void generated_kernel( device float out_ptr0, constant float* in_ptr0, uint xindex [[thread_position_in_grid]] ) { long x1 = (xindex) / (3); auto tmp0 = x1; auto tmp1 = static_cast<long>(tmp0); auto tmp2 = 0; auto tmp3 = tmp1 >= tmp2; auto tmp4 = 2; auto tmp5 = tmp1 < tmp4; long x0 = (xindex) % (3); auto tmp6 = in_ptr0[x0 + 3*(x1)]; auto tmp7 = tmp5 ? tmp6 : 0.0; auto tmp8 = tmp1 >= tmp4; auto tmp9 = 2 + ks0; auto tmp10 = static_cast<long>(tmp9); auto tmp11 = tmp1 < tmp10; auto tmp12 = 1.0; auto tmp13 = tmp8 ? tmp12 : 0.0; auto tmp14 = tmp5 ? tmp7 : tmp13; long x2 = xindex; out_ptr0[x2] = static_cast<float>(tmp14); } with program_source:18:25: error: use of undeclared identifier 'ks0' auto tmp9 = 2 + ks0; ^ Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True To execute this test, run the following from the base repo dir: python test/inductor/test_torchinductor.py GPUTests.test_cat_unbacked_2d_mps This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ---------------------------------------------------------------------- Ran 1 test in 0.472s FAILED (errors=1) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144649 Approved by: https://github.com/Skylion007, https://github.com/jansel, https://github.com/dcci ghstack dependencies: #144647, #144648	2025-01-13 13:38:03 +00:00
PyTorch UpdateBot	a85d1ee106	Update slow tests (#144670 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144670 Approved by: https://github.com/pytorchbot	2025-01-13 12:06:22 +00:00
James Wu	6e77d7cac5	Add AOTAutogradCache support for cache hot loading APIs (#144499 ) This diff adds AOTAutogradCache support to the mega cache. Differential Revision: [D67991059](https://our.internmc.facebook.com/intern/diff/D67991059/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D67991059/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/144499 Approved by: https://github.com/oulgen	2025-01-13 07:07:18 +00:00
Nikita Shulga	a08bd8154e	[MPSInductor] Add support for sizevars (#144662 ) Just pass them as kernel arguments After this change `pytest test/inductor/test_torchinduct.py -v -k _mps` reports 330 failed, 429 passed after and 335 failed, 424 passed before Pull Request resolved: https://github.com/pytorch/pytorch/pull/144662 Approved by: https://github.com/jansel	2025-01-13 06:22:38 +00:00
Yiming Zhou	87843ee9ab	[export] Unify single and multiple return for hops (#143227 ) Summary: Introduce `is_hop_single_tensor_return` field to the `Node` class in serialization so that during deserialization when there is a single return, we know whether it is a tuple of a single element or a single element. Test Plan: ``` buck2 run @mode/dev-nosan sigmoid/inference/test:e2e_test_cpu -- -r E2ETestCPUCond buck2 run @mode/dev-nosan sigmoid/inference/test:test_passes -- -r test_const_folding2 ``` Differential Revision: D66991624 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143227 Approved by: https://github.com/zhxchen17	2025-01-13 03:31:14 +00:00
PyTorch MergeBot	0aa34e9591	Revert "Collect packages with importlib in collect_env (#144616 )" This reverts commit 3541d2a2aaacc4f15ea865c815ce8882577a439c. Reverted https://github.com/pytorch/pytorch/pull/144616 on behalf of https://github.com/malfet due to Somehow this change causes test_bottleneck_cuda to fail ([comment](https://github.com/pytorch/pytorch/pull/144616#issuecomment-2586095595))	2025-01-13 03:11:04 +00:00
Nikita Shulga	46eeef9130	[MPS][BE] Surface syntax errors shader compilation (#144648 ) Before this change ```python >>> import torch >>> torch.mps._compile_shader('What') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/malfet/miniconda3/envs/py311/lib/python3.11/site-packages/torch/mps/__init__.py", line 157, in _compile_shader return torch._C._mps_compileShader(source) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Failed to create metal library, error: Error Domain=MTLLibraryErrorDomain Code=3 "program_source:1:1: error: unknown type name 'What' What ^ program_source:1:5: error: expected unqualified-id What ^ " UserInfo={NSLocalizedDescription=program_source:1:1: error: unknown type name 'What' What ^ program_source:1:5: error: expected unqualified-id What ^ } ``` After this change ```python >>> import torch >>> torch.mps._compile_shader('What') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/malfet/git/pytorch/pytorch/torch/mps/__init__.py", line 157, in _compile_shader return torch._C._mps_compileShader(source) SyntaxError: program_source:1:1: error: unknown type name 'What' What ^ program_source:1:5: error: expected unqualified-id What ^ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144648 Approved by: https://github.com/Skylion007 ghstack dependencies: #144647	2025-01-13 02:03:19 +00:00
Nikita Shulga	9ae35b8bb1	[BE] Introduce `c10::SyntaxError` (#144647 ) Which will be translated into Python's SyntaxError Pull Request resolved: https://github.com/pytorch/pytorch/pull/144647 Approved by: https://github.com/Skylion007	2025-01-12 23:23:54 +00:00
Sv. Lockal	3541d2a2aa	Collect packages with importlib in collect_env (#144616 ) If pytorch is installed systemwide (via os package manager) or by alternative package manager like `uv`, pip is not available, causing error in `collect_env`. However it is still possible to collect exactly the same list using `importlib` API, which is always available. Fixes #144615 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144616 Approved by: https://github.com/malfet	2025-01-12 23:21:08 +00:00
Gabriel Ferns	1376116ab1	Config fuzzer (#139736 ) This tool makes it easy to search through config state-space with a minimal reproduction or test. It presents a similar interface to the config bisector by taking a test_function that should either raise on Exception or return False upon failure. It has two entry points: `fuzz_n_tuple`, which tries every combination of n configs, and `bisect`, which randomly flips configs and tries to find the minimal reproduction upon failure. `bisect` is a much more efficient way to search the space, but `fuzz_n_tuple` can give you peace of mind that a new config will compose with every other config. It's been used to find three bugs so far in the inductor config: https://github.com/pytorch/pytorch/issues/140220 https://github.com/pytorch/pytorch/issues/140219 https://github.com/pytorch/pytorch/issues/143524 This PR also adds a bunch of missing types to the inductor config to get them to play nice with the fuzzer, so it can be a good forcing function for adding types to config. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139736 Approved by: https://github.com/eellison	2025-01-12 22:59:02 +00:00
Wenqin Yang	334ee8ba40	Fix a bug for conj_physical (#144391 ) Fixes #141426 fix a bug in previous [PR](https://github.com/pytorch/pytorch/pull/141427), which shouldn't convert the data type for conj. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144391 Approved by: https://github.com/jansel	2025-01-12 21:18:17 +00:00
Aaron Gokaslan	cb66146f2b	[BE]: Update literal typing for torch/fx/graph nodelist (#144650 ) Mentioned in discussion for #144631 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144650 Approved by: https://github.com/jansel	2025-01-12 21:02:13 +00:00
Nikita Shulga	91a65cbd31	[MPSInductor] Implement `check_bounds` (#144635 ) Although at the moment it returns rather than rasises assert due to https://github.com/pytorch/pytorch/pull/144632 `pytest test/inductor/test_torchinductor.py -v -k _mps` score is `368 failed, 391 passed, 32 skipped` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144635 Approved by: https://github.com/jansel	2025-01-12 21:01:20 +00:00
Jason Ansel	fd382f1269	Micro-optimization in Graph.nodes.__iter__ (#144631 ) This generates slightly better code (removing a generator frame) and drops a redundant assert. ```py >>> import timeit >>> def a(): ... yield from range(3) ... >>> def b(): ... return range(3) ... >>> timeit.timeit(lambda: [a()]) 0.2714634328149259 >>> timeit.timeit(lambda: [b()]) 0.12076826114207506 >>> ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144631 Approved by: https://github.com/oulgen, https://github.com/Skylion007	2025-01-12 17:46:46 +00:00
Sam Larsen	de04acaca9	Disable scuba logging for autotuning (#144568 ) Summary: the compile IDs are currently null, which is confusing. Turn it off until we have a solution. Test Plan: https://fburl.com/scuba/dynamo_compile/sandbox/g2d2g5xs Pull Request resolved: https://github.com/pytorch/pytorch/pull/144568 Approved by: https://github.com/jamesjwu	2025-01-12 15:47:14 +00:00
Yanbo Liang	1664033e13	[Functorch] Refactor vmapify autograd function: remove cell mutation (#143811 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143811 Approved by: https://github.com/zou3519	2025-01-12 10:31:23 +00:00
Nikita Shulga	cec245806e	[MPSInductor] Implement bitcasts (#144638 ) That will be used to compile something like `torch.rand(32, device='mps').view(dtype=torch.int32)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144638 Approved by: https://github.com/dcci	2025-01-12 06:11:28 +00:00
Nikita Shulga	32a91dedc5	[MPSInductor] Properly generate index expressions (#144632 ) Now test_slice_scatter4_mps passes Before this change test_torchinductor.py reported 422 failed and 337 passed, after this change 412 failed 347 passed. Fixes https://github.com/pytorch/pytorch/issues/144630 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144632 Approved by: https://github.com/dcci	2025-01-12 06:10:05 +00:00
Yanbo Liang	3355103233	[Dynamo] Supports autograd.Function forward returns constant (#144597 ) Fixes #144142 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144597 Approved by: https://github.com/jansel	2025-01-12 03:53:10 +00:00
Davide Italiano	e0f67405a1	[mps/inductor] Add support for exp(). (#144606 ) inductor/test_silu now passes after this change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144606 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-12 00:38:11 +00:00
Nikita Shulga	10887fc139	[BE] Enable test_public_bindings on MacOS (#144591 ) I've tried it locally and it works.. (One more reason to xfail rather than skip) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144591 Approved by: https://github.com/Skylion007	2025-01-12 00:34:47 +00:00
Davide Italiano	5e858254d2	[mps/inductor] Add support for trunc(). (#144629 ) inductor/test_div1 passes after this change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144629 Approved by: https://github.com/malfet, https://github.com/jansel	2025-01-12 00:11:03 +00:00
bobrenjc93	f6688ac81d	remove allow-untyped-defs from torch/distributed/_shard/sharded_tensor/shard.py (#144623 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144623 Approved by: https://github.com/Skylion007	2025-01-12 00:10:42 +00:00
bobrenjc93	b8aae2773f	remove allow-untyped-defs from torch/distributed/_checkpointable.py (#144627 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144627 Approved by: https://github.com/Skylion007	2025-01-12 00:07:26 +00:00
bobrenjc93	b5485c9f41	remove allow-untyped-defs from torch/_functorch/utils.py (#144626 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144626 Approved by: https://github.com/Skylion007	2025-01-12 00:07:16 +00:00
bobrenjc93	ad221269b0	remove allow-untyped-defs from torch/distributions/pareto.py (#144624 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144624 Approved by: https://github.com/Skylion007	2025-01-12 00:06:56 +00:00
bobrenjc93	80b756ed91	remove allow-untyped-defs from torch/jit/_pickle.py (#144625 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144625 Approved by: https://github.com/Skylion007	2025-01-12 00:06:25 +00:00
PyTorch MergeBot	4f406d22a2	Revert "[mps/inductor] Add support for exp(). (#144606 )" This reverts commit 2ccbacfa24cae724ec1ea3bc7de189e5bf948d46. Reverted https://github.com/pytorch/pytorch/pull/144606 on behalf of https://github.com/malfet due to It now passes MPS-not-supported test ([comment](https://github.com/pytorch/pytorch/pull/144606#issuecomment-2585482477))	2025-01-11 23:51:35 +00:00
PyTorch MergeBot	eaa24821f2	Revert "[ez] add lint commits to .git-blame-ignore-revs (#144576 )" This reverts commit 49c1f81be84466d015705b1882320919eecffa82. Reverted https://github.com/pytorch/pytorch/pull/144576 on behalf of https://github.com/janeyx99 due to need to redo with better testing ([comment](https://github.com/pytorch/pytorch/pull/144576#issuecomment-2585456893))	2025-01-11 21:53:00 +00:00
Davide Italiano	2ccbacfa24	[mps/inductor] Add support for exp(). (#144606 ) inductor/test_silu now passes after this change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144606 Approved by: https://github.com/malfet	2025-01-11 18:09:33 +00:00
eqy	63569d9745	[CUDA][TF32] Add some missing TF32 decorators to `test_nn.py` (#144592 ) Original authored by @bilal2vec Pull Request resolved: https://github.com/pytorch/pytorch/pull/144592 Approved by: https://github.com/Skylion007	2025-01-11 16:20:59 +00:00
eqy	388b75edec	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee	2025-01-11 15:30:38 +00:00
Ding, Yi1	2e3b051154	[XPU] Fix TRITON_XPU_BUILD_FROM_SOURCE (#142850 ) Fixes #142849 The idea is to remove the redundant 'git' in TRITON_XPU_BUILD_FROM_SOURCE=1 case (L29) while keep it in pre-build whl installation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142850 Approved by: https://github.com/chuanqi129, https://github.com/benjaminglass1, https://github.com/EikanWang, https://github.com/atalman	2025-01-11 13:11:55 +00:00
Ting Lu	b7bef1ca84	[aarch64] fix TORCH_CUDA_ARCH_LIST for cuda arm build (#144436 ) Fixes #144037 Root cause is CUDA ARM build did not call `.ci/manywheel/build_cuda.sh`, but calls `.ci/aarch64_linux/aarch64_ci_build.sh `instead. Therefore, https://github.com/pytorch/pytorch/blob/main/.ci/manywheel/build_cuda.sh#L56 was not called for CUDA ARM build. Adding the equivalent of the code to `.ci/aarch64_linux/aarch64_ci_build.sh` as a WAR. In the future, we should target to integrate the files in .ci/aarch64_linux/aarch64_ci_build.sh back to .ci/manywheel/build_cuda.sh. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144436 Approved by: https://github.com/atalman	2025-01-11 09:00:46 +00:00
Blaine Burton Rister	e1d0a2ff30	[Inductor] Restrict ND tiling analysis to MemoryDeps (#144497 ) # Issue https://github.com/pytorch/pytorch/pull/137243 introduced a feature where the ND tiling algorithm analyzes memory dependencies. It iterates over all `Dep`'s of the kernel. However, the analysis is only applicable to `MemoryDep` instances, which are a subclass of `Dep`. In particular, it doesn't work for `StarDep`'s, for the reasons described here: https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codegen/simd.py#L1653 # Fix This PR changes the algorithm to only iterate over `MemoryDep` instances. # Testing Parameterized an existing test for `torch.bucketize` to also run with ND tiling. This test emits a node with `StarDep`'s. Without this PR, the compiler would crash on this test case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144497 Approved by: https://github.com/eellison	2025-01-11 05:16:47 +00:00
Huy Do	e4b2e90e54	Fix broken YAML template after #144574 (#144604 ) The YAML syntax is wrong and GitHub complains about it https://github.com/pytorch/pytorch/blob/main/.github/ISSUE_TEMPLATE/pt2-bug-report.yml Pull Request resolved: https://github.com/pytorch/pytorch/pull/144604 Approved by: https://github.com/wdvr	2025-01-11 05:09:06 +00:00
Will Constable	11082aead3	[Pipelining] Fix FSDP+PP stream sync bug (#144535 ) This bug could cause gradient corruption as a race condition exists between FSDP's reduce-scatter and any operations reading .grad on the main stream. The root cause is that pipelining stage .backward implementation got modified to support zero-bubble and in doing so, invoked .grad() instead of .backward(), and performed manual gradient accumulation and manually called into hooks for FSDP. But one key hook was missed for FSDP, the '_root_post_backward_final_callback' hook, which is responsible for syncing the grad reduction ops after the last layer's backward completes. Note: this fix applies to both zero-bubble and non-zero-bubble schedules. This caused some confusion initially, as non-zero-bubble schedules do use torch.autograd.backward() which would have called into fsdp's hooks and synced, unlike zero-bubble which uses .grad() which does not invoke hooks. However, this difference was already taken into consideration as FSDP's hooks are manually disabled before invoking either type of backward, and then the hooks are manually triggered. A better fix as a follow up PR would be to invoke .backward() for the weight grad, so that we never have to disable or manually invoke hooks. Modified test_pp_dp to intentionally race against FSDP's reduce by modifying the parameters inplace in a mathematically identical way, and confirmed it fails intermittently when the FSDP sync is not applied and passes with the FSDP sync added. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144535 Approved by: https://github.com/awgu ghstack dependencies: #144534	2025-01-11 03:42:15 +00:00
Will Constable	1d3cd7bd09	[Pipelining] Improve test_pp_dp (#144534 ) Some refactoring, but important changes include - initializing the weights properly so there are more nonzero gradients flowing, which helped catch the DDP+PP+ZB bug - make the DDP+ZB+PP bug skip for now and file an issue - tighten the tolerances to defaults - use separate targets instead of same inputs Pull Request resolved: https://github.com/pytorch/pytorch/pull/144534 Approved by: https://github.com/H-Huang	2025-01-11 03:27:16 +00:00
Simon Fan	8fa47c9455	[dynamo] log compiler collective duration to tlparse chromium trace (#144372 ) To show wall time in tlparse for the synchronous compiler collective. Can eliminate the leading hypothesis from https://fb.workplace.com/groups/1075192433118967/permalink/1578670289437843. <img width="1296" alt="image" src="https://github.com/user-attachments/assets/b17d4efb-8573-43e5-af58-c51af05acb54" /> sample: https://gist.github.com/xmfan/19eeaa80d55a4e7c168e150355ec7392 rank 0: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpr5WNMt/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10 rank 1: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpr5WNMt/rank_1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144372 Approved by: https://github.com/ezyang	2025-01-11 03:10:39 +00:00
Colin L. Rice	0cd9320c7f	easy: dynamo_config: sort keys and set values (#143317 ) This will create consistent ordering of keys when writing, as well as sorting sets before serializing Pull Request resolved: https://github.com/pytorch/pytorch/pull/143317 Approved by: https://github.com/masnesral ghstack dependencies: #143307	2025-01-11 03:08:04 +00:00
Sam Ginzburg	074aca3ed2	[user triton] add support for @triton.heuristics after @triton.autotune (#142208 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142208 Approved by: https://github.com/zou3519	2025-01-11 02:18:26 +00:00
PyTorch MergeBot	3753d30273	Revert "Stop ignoring mypy errors in torch/testing/_internal/common_utils.py (#144483 )" This reverts commit 9f09b719d33c61224ebb85baa369a8364063aa6f. Reverted https://github.com/pytorch/pytorch/pull/144483 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it somehow breaks memory leak checks ([comment](https://github.com/pytorch/pytorch/pull/144483#issuecomment-2585004792))	2025-01-11 02:10:16 +00:00
Sahan Paliskara	49c1f81be8	[ez] add lint commits to .git-blame-ignore-revs (#144576 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144576 Approved by: https://github.com/janeyx99	2025-01-11 02:09:46 +00:00
Nikita Shulga	92ddb3d3d3	[MPS] Expose `MPSProfiler::start/stopCapture` to Python (#144561 ) I.e. when `MTL_CAPTURE_ENABLED` environment variable is set to 1, one should be able to invoke wrap the code with `torch.mps.profiler.capture_metal` to generate gputrace for shaders invoked inside the context manager. For example, code below: ```python import torch import os def foo(x): return x[:,::2].sin() + x[:, 1::2].cos() if __name__ == "__main__": os.environ["MTL_CAPTURE_ENABLED"] = "1" x = torch.rand(32, 1024, device="mps") with torch.mps.profiler.metal_capture("compiled_shader"): torch.compile(foo)(x) ``` should capture the execution of a `torch.compile` generated shader <img width="734" alt="image" src="https://github.com/user-attachments/assets/718ff64e-103b-4b11-b66c-c89cfc770b5d" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/144561 Approved by: https://github.com/manuelcandales ghstack dependencies: #144559, #144560	2025-01-11 02:05:36 +00:00
Yidi Wu	c7dbee5106	[reland][export] don't decompose custom triton op when exporting (#144284 ) Summary: A reland of https://github.com/pytorch/pytorch/pull/142426. Copying the description over here: For torch.export (strict and non-strict), we don't do functional decomposition. Instead, we preserve the custom triton ops as custom ops. This is because we want the exported program to be high-level and serializable. The alternative: If we decompose the custom op to a functional hop and make it a node in exported program, we need to figure out ways of serializing the hop and its arguments, which can be triton.jited python functions and triton dtypes. This is undesireble because: it can be tedious to maintain layer that serialize the jited function (e.g. with a string) and dtypes. changes to triton or the serialization logic for triton arguments can be BC breaking exported program will expose the implementation detail (i.e. triton source code) for a specific backend (GPU) to users, which mixes levels of abstraction. Future plans: After this PR, in the short term, we expect users to have a seperate aot_compile stage that compiles the exported program into a Cubin file on the same machine that users call export, which does autotuning and removes triton dependency and serve the model with Cubin. This guarantees that triton changes won't break BC. In the long term, we may export multiple cubins for the triton op directly. Test Plan: see new tests. Differential Revision: D67879685 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144284 Approved by: https://github.com/zou3519	2025-01-11 01:34:35 +00:00
Marc Horowitz	95d333f52e	[distributed] Fix _ReaderView.read() and readinto() to stop reading at the end of the slice (#143357 ) _ReaderView doesn't work correctly if the slice ends past the view. read(-1) would call read(-1) on the base_stream, which would consume the entire underlying stream, even if the view ended before that. read(n) would read n bytes, even if the view ended before that. The new implementation clamps the size read to the size of the view. readinto(b) would read len(b) bytes, even if the view ended before that. Since the interface depends on the size of b, we use a (potentially) shortened view into b to avoid a copy. If the view doesn't contain enough data to fill the view, then this will appear as end of stream to the caller, which is the desired behavior. This fix should not be user facing, since the bug is in an internal helper, and is only visible with new code down the stack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143357 Approved by: https://github.com/saumishr	2025-01-11 00:22:10 +00:00
Xu Han	c9afa00a85	update sleef for disable libm on Windows [submodule Sleef] (#142245 ) This PR is implement of RFC: https://github.com/pytorch/pytorch/issues/141946 Changes: 1. Update `Sleef` to contains it's PRS: https://github.com/shibatch/sleef/pull/603 2. Set `SLEEF_BUILD_WITH_LIBM` to `OFF`, it is turn off CMake find_library(libm) of `Sleef`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142245 Approved by: https://github.com/EikanWang, https://github.com/atalman Co-authored-by: Eikan Wang <eikan.wang@intel.com>	2025-01-11 00:11:55 +00:00
cyy	6cfc081675	Increase C10_COMPILE_TIME_MAX_GPUS to 128 (#144138 ) To facilitate further possible changes of DeviceIndex to int16_t. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144138 Approved by: https://github.com/albanD	2025-01-10 23:53:19 +00:00
PyTorch MergeBot	b80ecc4457	Revert "Fix poision child process issue when call getAccelerator() (#144368 )" This reverts commit 2583d831d40d6fa64f0b637d5bc7598e484a3283. Reverted https://github.com/pytorch/pytorch/pull/144368 on behalf of https://github.com/clee2000 due to broke internal tests D68023262, probably the same problem as noted in the issue this PR is mentioned above ([comment](https://github.com/pytorch/pytorch/pull/144368#issuecomment-2584848568))	2025-01-10 23:36:43 +00:00
PyTorch MergeBot	db2a30932a	Revert "Generalize at::manual_seed for all accelerators (#144370 )" This reverts commit eeb57394f93d720bca498c3fa9d167fc7b9cca46. Reverted https://github.com/pytorch/pytorch/pull/144370 on behalf of https://github.com/clee2000 due to broke internal tests D68023262, probably the same problem as noted in the issue this PR is mentioned above ([comment](https://github.com/pytorch/pytorch/pull/144368#issuecomment-2584848568))	2025-01-10 23:36:43 +00:00
Sahan Paliskara	9ec8ecea71	Update documentation.yml	2025-01-10 15:27:28 -08:00
Sahan Paliskara	1ff8a1c4eb	Update documentation.yml to request english	2025-01-10 15:26:43 -08:00
Nikita Shulga	c7f12a4a7b	[MPSInductor] Speedup maximum/minumum ops (#144581 ) By relying on the fact that if either `a` or `b` is NaN (or both), than `a + b` would also be NaN. I.e. it replaces ```metal auto tmp2 = metal::any(metal::isnan(static_cast<decltype(tmp0+tmp1)>(tmp0))) \| metal::any(metal::isnan(static_cast<decltype(tmp0+tmp1)>(tmp1))) ? static_cast<decltype(tmp0+tmp1)>(NAN) : metal::max(static_cast<decltype(tmp0+tmp1)>(tmp0), static_cast<decltype(tmp0+tmp1)>(tmp1)); ``` with ```metal auto tmp2 = metal::isnan(tmp0 + tmp1) ? tmp0 + tmp1 : metal::max(static_cast<decltype(tmp0+tmp1)>(tmp0), static_cast<decltype(tmp0+tmp1)>(tmp1)); ``` which according to MetalProfiler takes fewer instructions: <img width="520" alt="image" src="https://github.com/user-attachments/assets/54659392-012b-453e-9c02-c3c5f332074a" /> vs <img width="1031" alt="image" src="https://github.com/user-attachments/assets/55fcfa78-1ea5-4b0a-8154-d79b3e3cc400" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/144581 Approved by: https://github.com/dcci, https://github.com/jhavukainen	2025-01-10 22:58:00 +00:00
Angela Yi	a94ec0a9a5	[aoti] Remove example inputs from aoti_compile_and_package (#144520 ) Summary: The args were removed in https://github.com/pytorch/pytorch/pull/140991 Test Plan: CI Differential Revision: D67998954 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144520 Approved by: https://github.com/yushangdi	2025-01-10 21:56:23 +00:00
Sahan Paliskara	6b902e6e1a	Update bug-report.yml to make it not look weird Seems like https://github.com/pytorch/pytorch/pull/144574 did not format as expected.	2025-01-10 13:53:27 -08:00
Sahan Paliskara	4daf007b64	Request English for Issues (#144574 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144574 Approved by: https://github.com/albanD	2025-01-10 21:51:15 +00:00
Alexander Kurakin	68dad26b95	torch/nn/modules/linear.py: docs: improvements (#138484 ) torch/nn/modules/linear.py: docs: improvements Pull Request resolved: https://github.com/pytorch/pytorch/pull/138484 Approved by: https://github.com/mikaylagawarecki	2025-01-10 20:03:43 +00:00
angelayi	7a81ba18b9	[export] Add support for serializing symint inputs (#142284 ) Fixes https://github.com/pytorch/pytorch/issues/142167 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142284 Approved by: https://github.com/avikchaudhuri	2025-01-10 20:03:26 +00:00
Alexander Kurakin	18c1dcb8f3	docs: get rid of copyright year (#144562 ) Fixes https://github.com/pytorch/pytorch/pull/144153#pullrequestreview-2540418083 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144562 Approved by: https://github.com/albanD	2025-01-10 19:57:25 +00:00
Shangdi Yu	be5afe16a6	Fix deepcopy hooks (#144531 ) Summary: As title, fix bug when a GraphModule doesn't have _deepcopy_hooks attribute Test Plan: ``` buck2 test 'fbcode//mode/opt' fbcode//torchmultimodal/tests:tests -- --exact 'torchmultimodal/tests:tests - test_albef.py::test_dequeue_and_enqueue' ``` Differential Revision: D68002767 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144531 Approved by: https://github.com/BoyuanFeng	2025-01-10 19:55:22 +00:00
angelayi	10ff6b8894	[export] Add pickle protocol (#142253 ) Fixes https://github.com/pytorch/pytorch/issues/142004 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142253 Approved by: https://github.com/avikchaudhuri	2025-01-10 19:49:07 +00:00
Huy Do	396630ed78	Update the accuracy results for moco and llama (#144523 ) This has been failing in trunk for sometimes, let's just update the accuracy results first. The command I run `python benchmarks/dynamo/ci_expected_accuracy/update_expected.py 127f836881e75e0c688619b54a35b018a69d7ee7`. I also fix the update script a bit to make it working after https://github.com/pytorch/pytorch/pull/139337 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144523 Approved by: https://github.com/kit1980, https://github.com/Skylion007	2025-01-10 19:40:49 +00:00
Max Podkorytov	99600789c3	[ROCm][Inductor][CK] hackfix for segfault in addmm op (#144519 ) This snippet used to cause segfault on GPU due to incorrect input order when invoking the kernel ``` import os import torch import torch.nn as nn from torch._inductor import config as inductor_config from torch._inductor.utils import fresh_inductor_cache M, N, K = 128, 128, 4096 dtype = torch.float16 X = torch.randn(M, N, dtype=dtype).cuda() A = torch.randn(M, K, dtype=dtype).cuda() B = torch.randn(K, N, dtype=dtype).cuda() class SimpleModel(nn.Module): def __init__(self): super().__init__() def forward(self, b, x, y): return torch.addmm(b, x, y) import ck4inductor ck_dir = os.path.dirname(ck4inductor.__file__) with fresh_inductor_cache(): with inductor_config.patch( { "max_autotune_gemm_backends": "CK", "autotune_fallback_to_aten": False, "compile_threads": 144, "rocm.ck_dir": ck_dir, } ): compiled_model = torch.compile(SimpleModel(), mode="max-autotune") res = compiled_model(X, A, B) res_eager = torch.addmm(X, A, B) torch.testing.assert_close(res, res_eager) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144519 Approved by: https://github.com/chenyang78	2025-01-10 19:29:14 +00:00
Arash Pakbin	a37db5ae39	operator benchmark change parsing from regex based to manual (#144297 ) The regex-based parser would erroneously split on commas in nested brackets, for example, it would do the following parse which is wrong: 'M: [(32, 16), (64, 32)], ZPB: 2' -> ['M: [(32, 16)', ' (64, 32)]', 'ZPB: 2'] The new manual parser handles this situation the right way: 'M: [(32, 16), (64, 32)], ZPB: 2' -> ['M: [(32, 16), (64, 32)]', 'ZPB: 2'] Pull Request resolved: https://github.com/pytorch/pytorch/pull/144297 Approved by: https://github.com/XuehaiPan, https://github.com/jeffdaily	2025-01-10 19:15:36 +00:00
Hamza Butt	4f04078aec	[CI] Ensure ACL is obtained from GitHub (#141804 ) - The GitHub tagged releases is the preferred method to obtain ACL. Please merge this before https://github.com/pytorch/pytorch/pull/138889 so that PyTorch can take GitHub releases going forward instead of mlplatform. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141804 Approved by: https://github.com/snadampal, https://github.com/ng-05, https://github.com/digantdesai	2025-01-10 19:05:02 +00:00
cyy	4abf554882	Use structure binding (#144524 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144524 Approved by: https://github.com/Skylion007	2025-01-10 18:47:35 +00:00
Feny Patel	1ce3524277	use collective_comm activity for hccl traces (#144490 ) Summary: Use existing collective_comm (currently used for nccl traces) for hccl traces as well. Only init the nccl profiler when KINETO_HAS_NCCL_PROFILER is defined so as to not init it when the build is for MTIA/HCCL Test Plan: CIs Differential Revision: D67285333 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144490 Approved by: https://github.com/sraikund16	2025-01-10 18:39:35 +00:00
Bin Bao	868984c3e3	[AOTI] Add a boxed_run API (#142213 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/141696. Add a new C++ runner API (boxed_run) following dynamo's boxed calling convention, which steals tensors' ownership from the input tensor list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142213 Approved by: https://github.com/ezyang	2025-01-10 18:27:00 +00:00
Scott Wolchok	b46d00c1b7	Shard RegisterDispatchKey (#144364 ) Should fix https://github.com/pytorch/pytorch/issues/143952 . Testing: built PyTorch on Raspberry Pi 5; this seemed to alleviate high peak memory requirement. (I did increase shard counts for other generated files along the way, but I need to go back and figure out how much of that was strictly necessary vs. needing to use -j1 or -j2.) Differential Revision: [D67925496](https://our.internmc.facebook.com/intern/diff/D67925496/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144364 Approved by: https://github.com/Skylion007, https://github.com/bdhirsh ghstack dependencies: #144363	2025-01-10 18:21:19 +00:00
Aleksei Nikiforov	4143312e67	S390x ci periodic tests (#125401 ) Periodically run testsuite for s390x Dependencies update Package z3-solver is updated from version 4.12.2.0 to version 4.12.6.0. This is a minor version update, so no functional change is expected. The reason for update is build on s390x. pypi doesn't provide binary build for z3-solver for versions 4.12.2.0 or 4.12.6.0 for s390x. Unfortunately, version 4.12.2.0 fails to build with newer gcc used on s390x builders, but those errors are fixed in version 4.12.6.0. Due to this minor version bump fixes build on s390x. ``` # pip3 install z3-solver==4.12.2.0 ... In file included from /tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/region.cpp:53: /tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/region.cpp: In member function ‘void* region::allocate(size_t)’: /tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/tptr.h:29:62: error: ‘uintptr_t’ does not name a type 29 \| #define ALIGN(T, PTR) reinterpret_cast<T>(((reinterpret_cast<uintptr_t>(PTR) >> PTR_ALIGNMENT) + \ \| ^~~~~~~~~ /tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/region.cpp:82:22: note: in expansion of macro ‘ALIGN’ 82 \| m_curr_ptr = ALIGN(char , new_curr_ptr); \| ^~~~~ /tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/region.cpp:57:1: note: ‘uintptr_t’ is defined in header ‘<cstdint>’; did you forget to ‘#include <cstdint>’? 56 \| #include "util/page.h" +++ \|+#include <cstdint> 57 \| ``` Python paths update* On AlmaLinux 8 s390x, old paths: ``` python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())' /usr/lib/python3.12/site-packages ``` Total result is `/usr/lib/python3.12/site-packages/torch;/usr/lib/python3.12/site-packages` New paths: ``` python -c 'import site; print(";".join([x for x in site.getsitepackages()] + [x + "/torch" for x in site.getsitepackages()]))' /usr/local/lib64/python3.12/site-packages;/usr/local/lib/python3.12/site-packages;/usr/lib64/python3.12/site-packages;/usr/lib/python3.12/site-packages;/usr/local/lib64/python3.12/site-packages/torch;/usr/local/lib/python3.12/site-packages/torch;/usr/lib64/python3.12/site-packages/torch;/usr/lib/python3.12/site-packages/torch ``` ``` # python -c 'import torch ; print(torch)' <module 'torch' from '/usr/local/lib64/python3.12/site-packages/torch/__init__.py'> ``` `pip3 install dist/.whl` installs torch into `/usr/local/lib64/python3.12/site-packages`, and later it's not found by cmake with old paths: ``` CMake Error at CMakeLists.txt:9 (find_package): By not providing "FindTorch.cmake" in CMAKE_MODULE_PATH this project has asked CMake to find a package configuration file provided by "Torch", but CMake did not find one. ``` https://github.com/pytorch/pytorch/actions/runs/10994060107/job/30521868178?pr=125401 Builders availability* Build took 60 minutes Tests took: 150, 110, 65, 55, 115, 85, 50, 70, 105, 110 minutes (split into 10 shards) 60 + 150 + 110 + 65 + 55 + 115 + 85 + 50 + 70 + 105 + 110 = 975 minutes used. Let's double it. It would be 1950 minutes. We have 20 machines * 24 hours = 20 * 24 * 60 = 20 * 1440 = 28800 minutes We currently run 5 nightly binaries builds, each on average 90 minutes build, 15 minutes test, 5 minutes upload, 110 minutes total for each, 550 minutes total. Doubling would be 1100 minutes. That leaves 28800 - 1100 = 27700 minutes total. Periodic tests would use will leave 25750 minutes. Nightly binaries build + nightly tests = 3050 minutes. 25750 / 3050 = 8.44. So we could do both 8 more times for additional CI runs for any reason. And that is with pretty good safety margin. Skip test_tensorexpr On s390x, pytorch is built without llvm. Even if it would be built with llvm, llvm currently doesn't support used features on s390x and test fails with errors like: ``` JIT session error: Unsupported target machine architecture in ELF object pytorch-jitted-objectbuffer unknown file: Failure C++ exception with description "valOrErr INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/jit/tensorexpr/llvm_jit.h":34, please report a bug to PyTorch. Unexpected failure in LLVM JIT: Failed to materialize symbols: { (main, { func }) } ``` Disable cpp/static_runtime_test on s390x Quantization is not fully supported on s390x in pytorch yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125401 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-10 18:21:07 +00:00
Scott Wolchok	603e1c0b02	torchgen: move dispatch_helpers out of RegisterDispatchDefinitions.ini (#144363 ) The dispatch_helpers should be generated once, not once per kernel namespace. Differential Revision: [D67925497](https://our.internmc.facebook.com/intern/diff/D67925497/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144363 Approved by: https://github.com/bdhirsh	2025-01-10 18:13:06 +00:00
Masaki Kozuki	7a93a58b3c	fix typo: "assumbed" (#144543 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144543 Approved by: https://github.com/Skylion007	2025-01-10 17:16:01 +00:00
Alexander Grund	fdc4f9dde2	Avoid running helper functions as test (#144544 ) Pytest considers all symbols starting with `test_` as a test case/function and runs them. The `test_compiled_fsdp` is a decorator but due to the import discovered by pytest. Rename it to avoid. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144544 Approved by: https://github.com/Skylion007	2025-01-10 17:15:50 +00:00
Nikita Shulga	8dba1ce73b	[MPS] Make MPSProfiler usable from C++ (#144560 ) By moving `buildTensorString` implementation away from the header Pull Request resolved: https://github.com/pytorch/pytorch/pull/144560 Approved by: https://github.com/Skylion007 ghstack dependencies: #144559	2025-01-10 17:13:34 +00:00
Nikita Shulga	f604338e31	[MPS] Make sure that MPSStream is usable from C++ (#144559 ) It's intended to be, but this was never tested. This change introduces no new functionality, just properly isolates ObjC implementation details from the potential C++ caller Pull Request resolved: https://github.com/pytorch/pytorch/pull/144559 Approved by: https://github.com/Skylion007	2025-01-10 17:13:34 +00:00
PyTorch MergeBot	473b745cb9	Revert "[dynamo] Avoid graph break on updates to `obj.__dict__` (#144419 )" This reverts commit c8595ba7d02fea9a5642ebbb60a810d18dc60666. Reverted https://github.com/pytorch/pytorch/pull/144419 on behalf of https://github.com/clee2000 due to newly added test fails internally D68004708 ([comment](https://github.com/pytorch/pytorch/pull/144419#issuecomment-2583265412))	2025-01-10 16:59:38 +00:00
Nikita Shulga	e6b9e67465	[BE][Opinfo] Delete redundant `dtypesIfCUDA` (#144512 ) If they are the same as CPU, no need to have that extra line Discovered while reviewing https://github.com/pytorch/pytorch/pull/143833 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144512 Approved by: https://github.com/Skylion007	2025-01-10 15:15:38 +00:00
Avik Chaudhuri	a222029f4e	retracing in strict doesn't like dataclass registration (#144487 ) Retracing in strict doesn't seem to like dataclass registration. Just refactoring some tests to make this explicit (whereas other export testing variants work fine). Differential Revision: [D67985149](https://our.internmc.facebook.com/intern/diff/D67985149/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144487 Approved by: https://github.com/angelayi	2025-01-10 12:31:53 +00:00
fan.mo	b2fde28283	[Profiler] Fix device setting error of other backends in torch.profiler (#144237 ) In earlier implementation, if `self.use_device != "cuda"` and `device is None`, we would get a `device = "cpu"` from line401, which is not as expected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144237 Approved by: https://github.com/sraikund16	2025-01-10 10:41:11 +00:00
Yu, Guangye	eeb57394f9	Generalize at::manual_seed for all accelerators (#144370 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144370 Approved by: https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui ghstack dependencies: #144368	2025-01-10 09:28:28 +00:00
Yu, Guangye	2583d831d4	Fix poision child process issue when call getAccelerator() (#144368 ) # Motivation fix https://github.com/pytorch/pytorch/issues/144152 # Solution - Align `at::globalContext()::hasXXX` to determine if accelerator XXX is built with PyTorch or an extension already registered to PyTorch. - Define `at::hasXXX` to determine if accelerator XXX is available at runtime. - Use `at::globalContext()::hasXXX` in `getAccelerator` rather than `at::hasXXX` to avoid initializing the XXX runtime (which can poison child processes) while detecting the current accelerator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144368 Approved by: https://github.com/albanD, https://github.com/atalman, https://github.com/gujinghui	2025-01-10 09:28:27 +00:00
bobrenjc93	08be9ec312	Migrate from Tuple -> tuple in torch/distributed (#144258 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144258 Approved by: https://github.com/aorenste	2025-01-10 08:34:54 +00:00
zeshengzong	184549b2d7	Fix torch.normal ignores default_device (#144070 ) Fixes #122886 1. Enable `torch.normal` working with `DeviceContext` to get default device which set via `set_default_device`. 2. Add hint in `set_default_device` doc, suggest use `torch.Tensor.to` method move to desired device explicitly. Test Result 1. Doc Preview ![image](https://github.com/user-attachments/assets/eb69c334-be2b-4dc5-bdce-567da21e1635) 2. Local Test ```python >>> import torch >>> torch.normal(0.,1., (10,10)).device device(type='cpu') >>> torch.set_default_device('cuda') >>> torch.normal(0.,1., (10,10)).device device(type='cuda', index=0) ``` ```bash pytest test/test_tensor_creation_ops.py ``` ![image](https://github.com/user-attachments/assets/8b466b55-f162-4b83-8b20-71de2c1d0914) ```bash lintrunner ``` ![image](https://github.com/user-attachments/assets/5b269c50-da57-47ed-8500-4edf2c2295e4) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144070 Approved by: https://github.com/ezyang	2025-01-10 08:19:55 +00:00
bobrenjc93	1fe3af2c68	Migrate from Tuple -> tuple in torch/_dynamo (#144261 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144261 Approved by: https://github.com/aorenste, https://github.com/zou3519	2025-01-10 07:45:57 +00:00
Shivam Raikundalia	f295eff512	[Profiler] Hide Kineto Step Tracker Behind Env Var (#144494 ) Summary: To support iteration-based on-demand we have step tracker hooks for both the scheduler and for the optimizer to control Kineto's backend FSM. We already hide the optimizer step tracker behind and ENV_VAR to prevent any extra overhead from the frontend profiler down to the kineto backend, but we don't do any such thing for the profiler step tracker. It also seems to cause errors occasionally in the FSM having both auto-trace and on-demand occurring at the same time. To remedy this issue, lets put in a patch to guard the step incrementer for the frontend step function. This will bypass all of the on-demand logic which shouldn't occur in auto-trace Test Plan: Ran `buck run mode/dev-nosan kineto/libkineto/fb/integration_tests:pytorch_resnet_integration_test -- --enable_profiling --trace_handler=auto_trace --with_stack` and added prints in on-demand functions (performLoopStep and collectTrace) and saw that neither were called even though they were called on main. Also got following healthy traces: Auto-Trace (schedule-based): https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Jan_09_12_43_37.1122140.pt.trace.json.gz&bucket=gpu_traces Timing Based On-demand: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/0/1736456722/localhost/libkineto_activities_1286261.json.gz&bucket=gpu_traces Iteration Based On-demand: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/0/1736456889/localhost/libkineto_activities_1304781.json.gz&bucket=gpu_traces Differential Revision: D67990080 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144494 Approved by: https://github.com/ngimel	2025-01-10 07:00:56 +00:00
xinan.lin	8cc8989b26	[Inductor UT] Generalize newly introduced device-bias hard code in (#144456 ) Re-land #143975. Fix "cuda" hard code in test_pattern_matcher.py introduced by #139321 Fix #143974 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144456 Approved by: https://github.com/EikanWang, https://github.com/malfet, https://github.com/jansel ghstack dependencies: #144457	2025-01-10 06:55:44 +00:00
xinan.lin	e5111d0430	[Inductor UT] Add expected failure for newly added case on XPU, align CUDA. (#144457 ) The newly added case `test_randint_distribution` from #143787 was set expected failure for CUDA but not for XPU. We add the expected failure here because if fails with the same reason as CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144457 Approved by: https://github.com/EikanWang, https://github.com/malfet, https://github.com/jansel, https://github.com/liangan1	2025-01-10 06:55:44 +00:00
Zhang, Jianyi	eddf83559e	[Intel GPU][Inductor] Convert Conv1D to 2D in inductor (#144140 ) Layout optimization in inductor does not apply to Conv1D. We convert Conv1D to channel last Conv2D for better performance on Intel GPU. For example, demucs fp16 inference in torchbench can improve from 149ms to 91ms on Max 1100. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144140 Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire	2025-01-10 06:50:46 +00:00
bobrenjc93	fbad833538	Migrate from Tuple -> tuple in test/distributed/_composable (#144254 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144254 Approved by: https://github.com/aorenste	2025-01-10 06:38:05 +00:00
bobrenjc93	3b6b306b71	Migrate from Tuple -> tuple in torch/testing (#144256 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144256 Approved by: https://github.com/aorenste	2025-01-10 06:37:55 +00:00
Yu, Guangye	493a52cb72	Refine torch.xpu.get_device_properties API error message (#144379 ) # Motivation Remove the redundant error message. Without this PR: ```python >>> import torch >>> torch.xpu.get_device_name(1) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/guangyey/repos/stock-pytorch/torch/xpu/__init__.py", line 215, in get_device_name return get_device_properties(device).name File "/home/guangyey/repos/stock-pytorch/torch/xpu/__init__.py", line 258, in get_device_properties raise AssertionError("Invalid device index") AssertionError: Invalid device index ``` With this PR: ```python >>> import torch >>> torch.xpu.get_device_name(1) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/guangyey/repos/stock-pytorch/torch/xpu/__init__.py", line 215, in get_device_name return get_device_properties(device).name File "/home/guangyey/repos/stock-pytorch/torch/xpu/__init__.py", line 257, in get_device_properties return _get_device_properties(device) # type: ignore[name-defined] # noqa: F821 RuntimeError: The device index is out of range. It must be in [0, 1), but got 1. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144379 Approved by: https://github.com/EikanWang	2025-01-10 06:27:51 +00:00
Nicolas Macchioni	4375c2c534	Cleanup gpt_fast benchmark (#144517 ) This is an exact copy of https://github.com/pytorch/pytorch/pull/144484, I bricked the last PR running ghstack land :( Pull Request resolved: https://github.com/pytorch/pytorch/pull/144517 Approved by: https://github.com/davidberard98, https://github.com/huydhn	2025-01-10 05:22:13 +00:00
Ryan Guo	c8595ba7d0	[dynamo] Avoid graph break on updates to `obj.__dict__` (#144419 ) `obj.__dict__` is handled specially in Dynamo, and prior to this patch we only support read and membership check on that dictionary object. This patch adds support for writes and some documentation. Fixes #143756. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144419 Approved by: https://github.com/jansel, https://github.com/anijain2305	2025-01-10 05:22:04 +00:00
Valentine233	d100a92d33	[CPU][Brgemm] add support for int8 brgemm (#143384 ) For INT8 SDPA kernel usage, we add support for INT8 Brgemm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143384 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/ezyang	2025-01-10 04:20:26 +00:00
Scott Wolchok	0529908f13	Remove is_reduced_floating_point from namespace std (#144502 ) Partial fix for #144495. Avoiding BC-break using existing practice of removing only if FBCODE_CAFFE2 and C10_NODEPRECATED are not defined. Differential Revision: [D67992342](https://our.internmc.facebook.com/intern/diff/D67992342/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144502 Approved by: https://github.com/malfet	2025-01-10 03:24:10 +00:00
cyy	9a841f9321	Enable bugprone-unchecked-optional-access (#144226 ) We can actually enable bugprone-unchecked-optional-access without the risk of hang. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144226 Approved by: https://github.com/albanD	2025-01-10 03:16:56 +00:00
Aaron Orenstein	9f09b719d3	Stop ignoring mypy errors in torch/testing/_internal/common_utils.py (#144483 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144483 Approved by: https://github.com/Skylion007	2025-01-10 02:31:43 +00:00
Scott Wolchok	898fcb4590	Simplify vec128 bfloat16/half fmadds (#144486 ) I was being silly when I wrote these; it doesn't make sense to do four conversions and two FMAs when we could do a multiply and an add. Differential Revision: [D67985074](https://our.internmc.facebook.com/intern/diff/D67985074/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144486 Approved by: https://github.com/malfet	2025-01-10 02:25:57 +00:00
Yiming Zhou	d1b64ec326	[export] Fix sym_bool serialization (#144295 ) Summary: When there is a `torch._check()` that checks if a sym_int is equal to some constant, it will generate 3 nodes in the graph with target `operation.ge`, `operator.le` and `operator.eq`. These operators belong to `_SYM_BOOL_OPS` but the `meta_val` of these nodes are are `bool` instead of `torch.SymBool`. Similar things can happen to `torch.SymInt`, where a `node.target` belongs to `_SYM_INT_OPS` but `node.meta["val"]` is an `int` instead of `torch.SymInt`. Therefore, we need to check both `meta_val` type and `node.target` type during serialization. Test Plan: ``` buck2 run @mode/dev-nosan caffe2/test:test_export -- -r test_sym_bool_torch_check_equal buck2 run @mode/dev-nosan caffe2/test:test_export -- -r test_sym_int_torch_check_equal ``` Differential Revision: D67883754 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144295 Approved by: https://github.com/avikchaudhuri, https://github.com/angelayi	2025-01-10 02:07:54 +00:00
Yu, Guangye	6de110b862	Support with statement on torch.Stream (#140138 ) # Motivation We propose to support Python with statement on `torch.Stream`. This is a benefit for all accelerators when writing device-agnostic code. The device-specific stream will also be supported because they are generally derived from `torch.Stream`. With this PR, we can do like this ```python s1= torch.Stream() # Set s1 to the current stream torch.accelerator.set_stream(s1) with torch.Stream() as s2: # Inside with statement, we set s2 to the current stream assert torch.accelerator.current_stream() == s2 # Here the current stream should be s1 assert torch.accelerator.current_stream() == s1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140138 Approved by: https://github.com/albanD	2025-01-10 02:05:19 +00:00
drisspg	04cb19d225	Add instantiation level to CutlassArgs (#144506 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144506 Approved by: https://github.com/huydhn	2025-01-10 02:01:40 +00:00
PyTorch MergeBot	87c1f76e63	Revert "Migrate from Tuple -> tuple in torch/_decomp (#144260 )" This reverts commit 8db67e03193dd1dbf7ca80cf0eb2f904e18e25ec. Reverted https://github.com/pytorch/pytorch/pull/144260 on behalf of https://github.com/kit1980 due to Lots of inductor failures ([comment](https://github.com/pytorch/pytorch/pull/144260#issuecomment-2581572235))	2025-01-10 01:47:29 +00:00
Guilherme Leobas	bf6dd955cd	Fix max(map(...)) (#142443 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142443 Approved by: https://github.com/zou3519	2025-01-10 01:44:37 +00:00
Nikita Shulga	1dd1d532ba	[BE] Fix extra-semi warnings in int4mm_kernel.cpp (#144510 ) Fixes ``` In file included from /Users/nshulga/git/pytorch/pytorch/build/aten/src/ATen/native/cpu/int4mm_kernel.cpp.DEFAULT.cpp:1: /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/cpu/int4mm_kernel.cpp:998:2: warning: extra ';' outside of a function is incompatible with C++98 [-Wc++98-compat-extra-semi] }; ^ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144510 Approved by: https://github.com/kit1980	2025-01-10 01:17:31 +00:00
Xu Han	bd1f5d1c32	update xnnpack for disable libm on Windows [submodule XNNPACK] (#141943 ) This PR is implement of RFC: https://github.com/pytorch/pytorch/issues/141946 Changes: 1. Update `XNNPACK` to contains it's PRS: https://github.com/google/XNNPACK/pull/7456, https://github.com/google/XNNPACK/pull/7535 and other build fixing PRs. 2. Set `XNNPACK_BUILD_WITH_LIBM` to `OFF`, it is turn off CMake find_library(libm) of `XNNPACK`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141943 Approved by: https://github.com/atalman	2025-01-10 00:47:41 +00:00
bobrenjc93	8db67e0319	Migrate from Tuple -> tuple in torch/_decomp (#144260 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144260 Approved by: https://github.com/aorenste	2025-01-10 00:13:15 +00:00
bobrenjc93	3607ff2c1d	Migrate from Tuple -> tuple in benchmarks/instruction_counts/core (#144253 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144253 Approved by: https://github.com/aorenste	2025-01-10 00:12:23 +00:00
bobrenjc93	a55977f763	Migrate from Tuple -> tuple in torch/ao (#144265 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144265 Approved by: https://github.com/aorenste	2025-01-10 00:12:06 +00:00
Benjamin Glass	08eaaa61ea	Inductor dashboard benchmarks: swap unused freeze_autotune_cudagraphs workflow for cppwrapper workflow (#144427 ) GitHub limits us to 10 inputs per workflow_dispatch job, so this PR swaps out an input that is no longer used for the cppwrapper input. See [the HUD](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2002%20Jan%202025%2016%3A30%3A07%20GMT&stopTime=Thu%2C%2009%20Jan%202025%2016%3A30%3A07%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/benjaminglass1/53/orig&lCommit=4c3d3ad3c7886cbda9705b41c6db5fa7da0d6fe9&rBranch=main&rCommit=00df63f09f07546bacec734f37132edc58ccf574) for an example showing that it works and displays sane output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144427 Approved by: https://github.com/desertfire, https://github.com/huydhn	2025-01-09 23:56:00 +00:00
Shangdi Yu	66ce13b497	Revert D67299312: Multisect successfully blamed "D67299312: [AoTI Minifier] UX Improvement" for one test failure (#144475 ) Summary: This diff partially reverts D67299312 D67299312: [AoTI Minifier] UX Improvement by yushangdi causes the following test failure: Differential Revision: D67963019 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144475 Approved by: https://github.com/zhxchen17, https://github.com/angelayi	2025-01-09 23:27:55 +00:00
Nikita Shulga	91cbeb7db9	[MPSInductor] Fix `masked`/`where` for inf values (#144500 ) Move constant to value logic to `value_to_metal` function (similar to `value_to_cpp`) Call it from `constant` as well as `where` ops (which is in turn being called from `masked` op Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144500 Approved by: https://github.com/dcci	2025-01-09 23:11:06 +00:00
Wanchao Liang	b1c2c3967a	[dtensor] deprecate _shard_tensor to use src_data_rank=None (#144171 ) as titled, we can achieve no comm sharding for the inference case with src_data_rank=None, so deprecate the private APi Pull Request resolved: https://github.com/pytorch/pytorch/pull/144171 Approved by: https://github.com/awgu	2025-01-09 22:26:45 +00:00
Shangdi Yu	379b54603a	[Inductor] [bc-breaking] Node Level provenance tracking (#144277 ) Summary: - use GraphTransformObserver + replace_node hooks to track node sources when they are replaced - add pre_grad_graph tracking to tlparse - add the node provenance information to post_grad_graph tlparse. This is for the frontend to create a mapping between pre_grad and post_grad graph. See an example frontend (this is just a prototype) here: https://drive.google.com/file/d/1cMHH_0y4FJUSS9tATwGQvA72O0Lth8eh/view?usp=sharing - change "action" of NodeSource from a single action to a list of actions. - It's BC-Breaking because we removed `GraphTransformObserver`'s class methods `on_node_erase` and `on_node_erase` . https://docs.google.com/document/d/1dGh9myqNhywmbfP0Quzx_f04bghDFlj8cawj8MopiO8/edit?tab=t.0 The front-end code that takes in the tlparse result is in https://github.com/yushangdi/compiler_explorer. ghstack-source-id: 260390519 Test Plan: ``` buck2 run mode/dev-nosan fbcode//caffe2/test:fx -- -r test_graph_transform_observer buck run mode/dev-nosan fbcode//caffe2/test:fx -- -r node_source buck run mode/dev-nosan fbcode//caffe2/test:fx -- -r graph_provenance ``` Front-end example screenshots on a real model, 93% coverage rate between pre_grad_graph and post_grad_graph {F1973584210}{F1973584209} ``` buck2 build --show-output mode/opt -c=python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.split-dwarf=true -c fbcode.nvcc_arch=a100,h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark MODEL_ENTITY_ID=644688112 SNAPSHOT_ID=32 MODULE=merge TORCH_COMPILE_DEBUG=1 CUDA_VISIBLE_DEVICES=7 TORCH_LOGS="+inductor,+schedule,output_code,graph_code" TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 ../buck-out/v2/gen/fbcode/ec86b05dd59e84db/caffe2/torch/fb/model_transform/experimental/benchmark/__mts_gpu_benchmark__/mts_gpu_benchmark.par --local-model /home/bahuang/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend AOT_INDUCTOR_EP --gpu-trace --aot-inductor-config="{'max_autotune': True}" buck2 run mode/dev-nosan fbcode//caffe2/test/inductor:auto_functionalize ``` Differential Revision: D65006709 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144277 Approved by: https://github.com/desertfire	2025-01-09 22:06:51 +00:00
Eddie Yan	28b1960d49	[CUDA] parse arch-conditional compute-capability when building extensions (#144446 ) don't choke on arch-conditional compute capabilities e.g., `sm_90a`: #144037 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144446 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2025-01-09 22:05:18 +00:00
drisspg	206a932f23	[Submodule] Upgrade to Cutlass 3.6 (#144180 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144180 Approved by: https://github.com/eqy, https://github.com/Skylion007	2025-01-09 21:56:53 +00:00
Richard Barnes	3e7e435bb1	[codemod] Remove unused-variable in caffe2/aten/src/ATen/native/quantized/cpu/fbgemm_utils.cpp +2 (#144371 ) Summary: LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Reviewed By: palmje Pull Request resolved: https://github.com/pytorch/pytorch/pull/144371 Approved by: https://github.com/Skylion007	2025-01-09 21:49:17 +00:00
PyTorch MergeBot	f71688f30d	Revert "[Submodule] Upgrade to Cutlass 3.6 (#144180 )" This reverts commit f2c103317814eecf2b622e322e4d0877c16af943. Reverted https://github.com/pytorch/pytorch/pull/144180 on behalf of https://github.com/huydhn due to Ops, this fails some slow tests. Please help fix and reland this ([comment](https://github.com/pytorch/pytorch/pull/144180#issuecomment-2581302233))	2025-01-09 21:45:39 +00:00
Aleksei Nikiforov	127f836881	S390x cancelled jobs cleanup (#144149 ) Sometimes job is cancelled during nested docker container creation. This leads to nested docker container not being stopped and worker hanging forever in the job. Improve nested docker containers cleanup for these cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144149 Approved by: https://github.com/seemethere	2025-01-09 20:45:19 +00:00
Leo Yang	40305dd37e	[onnx] Fix bug for exporting torch.cdist into onnx and support 'compute_mode' (#144213 ) ### Fix bug for exporting torch.cdist and support 'compute_mode' In [cdist,](https://github.com/pytorch/pytorch/blob/main/torch/onnx/symbolic_opset9.py#L6181) the 'compute_mode' was ignored, which leads to a big difference of the computation flow between original torch.cdist and the exported onnx file when computing Euclidean distance (p=2). For computing Euclidean distance, the running of exported onnx model will be 10x slower than running torch.cdist directly, and also very likely to cause CUDA OOM for larger matrixes unnecessarily. This code is going for exporting the same onnx computation flow with the forward of torch.cdist defined at [forward implementation](`9225f149eb/aten/src/ATen/native/Distance.cpp (L66-L149)`.) under every compute_mode. Fixes #144212 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144213 Approved by: https://github.com/justinchuby	2025-01-09 20:07:20 +00:00
atalman	2b241a8206	Amazon Linux 2023: Preload cusparseLt.so (#144477 ) Fixes https://github.com/pytorch/pytorch/issues/144433 Test with some debug statements added: ``` >>> import torch trying to load libcublas.so.[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cublas/lib/libcublas.so.12'] trying to load libcublas.so.[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cublas/lib/libcublas.so.12 trying to load libcudnn.so.[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cudnn/lib/libcudnn.so.9'] trying to load libcudnn.so.[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cudnn/lib/libcudnn.so.9 trying to load libnvrtc.so.[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cuda_nvrtc/lib/libnvrtc.so.12'] trying to load libnvrtc.so.[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cuda_nvrtc/lib/libnvrtc.so.12 trying to load libcudart.so.[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cuda_runtime/lib/libcudart.so.12'] trying to load libcudart.so.[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cuda_runtime/lib/libcudart.so.12 trying to load libcupti.so.[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cuda_cupti/lib/libcupti.so.12'] trying to load libcupti.so.[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cuda_cupti/lib/libcupti.so.12 trying to load libcufft.so.[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cufft/lib/libcufft.so.11'] trying to load libcufft.so.[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cufft/lib/libcufft.so.11 trying to load libcurand.so.[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/curand/lib/libcurand.so.10'] trying to load libcurand.so.[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/curand/lib/libcurand.so.10 trying to load libnvJitLink.so.[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/nvjitlink/lib/libnvJitLink.so.12'] trying to load libnvJitLink.so.[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/nvjitlink/lib/libnvJitLink.so.12 trying to load libcusparse.so.[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cusparse/lib/libcusparse.so.12'] trying to load libcusparse.so.[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cusparse/lib/libcusparse.so.12 trying to load libcusparseLt.so.[0-9] from [] trying to load libcusparseLt.so.[0-9] from /usr/local/lib/python3.9/site-packages/cusparselt/lib/libcusparseLt.so.0 trying to load libcusolver.so.[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cusolver/lib/libcusolver.so.11'] trying to load libcusolver.so.[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cusolver/lib/libcusolver.so.11 trying to load libnccl.so.[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/nccl/lib/libnccl.so.2'] trying to load libnccl.so.[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/nccl/lib/libnccl.so.2 trying to load libnvToolsExt.so.[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/nvtx/lib/libnvToolsExt.so.1'] trying to load libnvToolsExt.so.[0-9] from /usr/local/lib/python3.9/site- packages/nvidia/nvtx/lib/libnvToolsExt.so.1 /usr/local/lib64/python3.9/site-packages/torch/_subclasses/functional_tensor.py:275: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.) cpu = _conversion_method_template(device=torch.device("cpu")) >>> exit() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144477 Approved by: https://github.com/Skylion007, https://github.com/nWEIdia	2025-01-09 20:04:11 +00:00
Guilherme Leobas	6bc17b0725	Update #graph breaks for moco benchmark (#144266 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144266 Approved by: https://github.com/zou3519	2025-01-09 18:51:13 +00:00
Aaron Gokaslan	0e02e6f95f	[BE]: Remove redundant contiguous copy in torch/_decomp/decompositions (#144472 ) Removes a redundant extra copy by calling contiguous. Instead, just add a memory_format flag to the dtype cast. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144472 Approved by: https://github.com/awgu, https://github.com/cyyever, https://github.com/malfet	2025-01-09 18:50:00 +00:00
Aaron Gokaslan	307ca094c9	[BE]: Remove redundant contiguous copy in flex attention (#144467 ) Removes a redundant potential copy, instead use memory_format kwarg to fuse both operations into a single copy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144467 Approved by: https://github.com/awgu	2025-01-09 18:30:09 +00:00
Aaron Gokaslan	bbec35f028	[BE]: Replace clone detach with detach clone to be more efficient (#144469 ) Follow up to #144270 and fix some vulkan code Pull Request resolved: https://github.com/pytorch/pytorch/pull/144469 Approved by: https://github.com/awgu	2025-01-09 18:28:39 +00:00
Colin L. Rice	73278e6a5d	easy: sort dictionary keys for inductor config when publishing (#143307 ) This means we should get consistent logging strings for the same config on different ranks Pull Request resolved: https://github.com/pytorch/pytorch/pull/143307 Approved by: https://github.com/xmfan	2025-01-09 18:01:20 +00:00
Colin L. Rice	84443bd61a	feature_use: Remove JK from naming for feature use. (#143529 ) See discussion in https://github.com/pytorch/pytorch/pull/142819 but TL;DR, since we're loging use but not direct JK reads, it's less confusing to use the logging Pull Request resolved: https://github.com/pytorch/pytorch/pull/143529 Approved by: https://github.com/ezyang	2025-01-09 17:58:22 +00:00
Mikayla Gawarecki	b8f383107e	Link to transformer tutorial in transformer docs (#144425 ) <img width="1045" alt="Screenshot 2025-01-08 at 4 50 20 PM" src="https://github.com/user-attachments/assets/05adfecb-8a23-4c48-9a2c-50c5b3f886b0" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/144425 Approved by: https://github.com/albanD	2025-01-09 17:42:09 +00:00
drisspg	f2c1033178	[Submodule] Upgrade to Cutlass 3.6 (#144180 ) Differential Revision: [D67866269](https://our.internmc.facebook.com/intern/diff/D67866269) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144180 Approved by: https://github.com/eqy, https://github.com/Skylion007	2025-01-09 17:29:58 +00:00
Jithun Nair	1365ae859c	[ROCm][CI] upgrade CI to ROCm 6.3 (#142152 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142152 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-01-09 17:14:16 +00:00
cyy	b0be30dd79	[19/N] Fix extra warnings brought by clang-tidy-17 (#144448 ) Apply more clang-tidy fixes. There was a bug introduced by #144014 due to incorrect namespace concatenation which is reverted here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144448 Approved by: https://github.com/albanD	2025-01-09 15:58:05 +00:00
Davide Italiano	1353f3beb4	[mps/inductor] Add support for fmod(). (#144449 ) 397 -> 395 tests failing. `static_cast<>` is because there are several overloads of `fmod()` that's otherwise ambiguous. I wonder if we should take in account NaN propagation (maybe it's not tested). Pull Request resolved: https://github.com/pytorch/pytorch/pull/144449 Approved by: https://github.com/malfet	2025-01-09 15:47:41 +00:00
Howard Huang	9631d1a021	[pipelining] throw error with ZB and compile (#143599 ) Zero bubble wil SIGSEGV when operating on a `torch.compile`'d model so raising this error while I am still investigating the cause / design for a fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143599 Approved by: https://github.com/wconstab	2025-01-09 06:53:25 +00:00
PyTorch MergeBot	3797143e06	Revert "[Quant][Inductor][X86] Separate binary post op fusion and lowering for qlinear (#144224 )" This reverts commit fabf2ea12e18bad3297e2810b77417d71c2a360b. Reverted https://github.com/pytorch/pytorch/pull/144224 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems that some ARM tests are failing after this lands ([comment](https://github.com/pytorch/pytorch/pull/144224#issuecomment-2579260377))	2025-01-09 06:20:31 +00:00
Davide Italiano	6f28e466f3	[mps/inductor] Add support for tanh(). (#144443 ) Fixes test_tanh() in the inductor testsuite. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144443 Approved by: https://github.com/malfet	2025-01-09 06:14:03 +00:00
Simon Fan	7f1946aa9b	[aot] don't dce aten rng nodes (#144319 ) FIXES https://github.com/pytorch/pytorch/issues/143431 For aot_eager backend, we dce twice in aot. The first dce errs on the side of caution and provides a restrictive dce function: `2e1ea8598f/torch/fx/experimental/proxy_tensor.py (L1173)` The second one is more aggressive: `2e1ea8598f/torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py (L185)` But this deviates from eager accuracy when rand ops are dce'd The repro doesn't work for inductor, but that's a separate issue Pull Request resolved: https://github.com/pytorch/pytorch/pull/144319 Approved by: https://github.com/jansel	2025-01-09 05:27:49 +00:00
Dmitry Nikolaev	d4871750d9	[ROCm] Enable post-merge trunk workflow on MI300 runners; skip and fix MI300 related failed tests (#143673 ) This PR * makes changes to the workflow files and scripts so we can run CI workflows on the MI300 runners * skips and fixes several tests, failed on MI300, observed in https://github.com/pytorch/pytorch/pull/140989 Skipped due to unsupported Float8_e4m3fn data type on MI300 (need to update test code to use datatypes supported by MI300): - distributed.tensor.parallel.test_micro_pipeline_tp.py::MicroPipelineTPTest::test_fuse_all_gather_scaled_matmul_A_dims_\_gather_dim_\ (24 tests across inductor/distributed configs) - distributed.tensor.parallel.test_micro_pipeline_tp.py::test_fuse_scaled_matmul_reduce_scatter_A_dims_\_scatter_dim_\ (12 tests across inductor/distributed configs)) - inductor.test_loop_ordering::LoopOrderingTest::test_fp8_cast_and_t - inductor.test_loop_ordering::LoopOrderingTest::test_fp8_pattern_2 Skipped due to AssertionError on MI300: - inductor.test_mkldnn_pattern_matcher.py::test_qconv2d_int8_mixed_bf16 - distributed._tools.test_sac_ilp::TestSACILP::test_sac_ilp_case1 Skipped: - test_cuda.py::TestCudaMallocAsync::test_clock_speed - test_cuda.py::TestCudaMallocAsync::test_power_draw - test_torch.py::TestTorchDeviceTypeCUDA::test_deterministic_cumsum_cuda Skipped flaky tests on MI300: - distributed.test_c10d_gloo.py::ProcessGroupGlooTest::test_gather_stress_cuda - inductor.test_cpu_repro::CPUReproTests::test_lstm_packed_unbatched_False* (256 tests) Fixed: - test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_float8_basics_cuda Features: - inductor/test_fp8.py - declare a new function to convert FP8 datatypes to ROCm supported FP8 datatypes. It keeps test names for CUDA and ROCm and allows to enable Inductor FP8 tests on CPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/143673 Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/pruthvistony Co-authored-by: saienduri <saimanas.enduri@amd.com> Co-authored-by: Jithun Nair <jithun.nair@amd.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-09 05:18:57 +00:00
Ding, Yi1	0d08084f1a	[Inductor] Add convolution output size checking to the meta function (#144225 ) Fixes #144013 Adding a size check to the meta function, similar to which in the CUDA/CPU aten op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144225 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2025-01-09 04:20:06 +00:00
Xia, Weiwen	fabf2ea12e	[Quant][Inductor][X86] Separate binary post op fusion and lowering for qlinear (#144224 ) Summary The current implementation fuses quantized ops and their post ops and lowers the fused the op to cpp backend in the same pass. It is better to separate post op fusion and lowering because - it looks better in terms of design - we need the post op fusion pass for PT2E quantization eager mode This PR is one of a series of PRs which separate post op fusion and lowering for quantized linear and convolution. It moves binary post op fusion of qlinear out of the lowering pass. This PR moves the fusion pass from the lowering pass to after the weight-prepack pass. The workflow is 1. Weight prepack for qlinear so that `dq - linear` patterns are replaced by `onednn.qlinear_pointwise` 2. Fuse `onednn.qlinear_pointwise` and post ops 3. Lower to cpp backend This PR adds additional `PatternMatcherPass`'s to handle the post op fusion. Pattern matchers used for fusion are reused. Test plan It is covered by existing UTs in `test_mkldnn_pattern_matcher.py` for post op fusion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144224 Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168 ghstack dependencies: #143903	2025-01-09 03:27:09 +00:00
Xinya Zhang	bc576355a2	Let aotriton.cmake detect the best binary package to use, and deprecate aotriton_version.txt (#137443 ) We do not need `install_aotriton.sh` and `aotriton_version.txt` any more since `aotriton.cmake` now installs the best binary release package as the default option when building pytorch. This should resolve the issue of needing a pre-installed aotriton package when building PyTorch for ROCm from source, which is not feasible if building PyTorch outside a CI docker image. With this change, a user can have a pre-installed AOTriton in their environment, if desired, and have the build pick it up by specifying the `AOTRITON_INSTALLED_PREFIX` env var, or have the build automatically detect and install the compatible version. As a third option, the user can also force AOTriton to build from source instead, using the `AOTRITON_INSTALL_FROM_SOURCE` env var. Also, with the changes in this PR, the cmake build process handles the tasks of copying aotriton .so and images directory from `torch/lib` to the installation path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137443 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily Co-authored-by: Jithun Nair <jithun.nair@amd.com>	2025-01-09 00:00:02 +00:00
Andrew Gu	8ac005ddb8	[DTensor] Add `aten.view.dtype` op support (#144404 ) Fixes https://github.com/pytorch/pytorch/issues/144286 Viewing a tensor to a different dtype does not require any redistribution and can use the default strategy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144404 Approved by: https://github.com/wanchaol	2025-01-08 23:11:22 +00:00
Xuehai Pan	dcc3cf7066	[BE] fix ruff rule E226: add missing whitespace around operator in f-strings (#144415 ) The fixes are generated by: ```bash ruff check --fix --preview --unsafe-fixes --select=E226 . lintrunner -a --take "RUFF,PYFMT" --all-files ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144415 Approved by: https://github.com/huydhn, https://github.com/Skylion007	2025-01-08 21:55:00 +00:00
titaiwangms	a742859fc2	[ONNX] Update images and APIs to onnx_dynamo.rst (#144358 ) Update the result image of exporting, and delete the functions/class that belongs to `torch.onnx.dynamo_export` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144358 Approved by: https://github.com/justinchuby, https://github.com/malfet	2025-01-08 21:44:43 +00:00
Brian Muse	a5164a2b18	[BE] Clean up ExecuTorch Export Docstring (#141490 ) Summary: I noticed when looking at the docs for [`torch.export.load`](https://pytorch.org/docs/stable/_modules/torch/export.html#load) that it looked like there was a copy and paste error from the save command docstring since ep is not an actual parameter for load and it says "The exported program to save." This diff removes it from the docstring. Test Plan: Automated Testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/141490 Approved by: https://github.com/JacobSzwejbka	2025-01-08 21:28:58 +00:00
Will Constable	8c5d992772	[Pipelining] Refactor pp composability test to use faster MPCT (#144345 ) * Using MultiProcessContinuousTest base class is faster (60s vs 279s for the full run of `test_manual_with_data_parallel` and all its parametrizations * Have to move to a new file to use MPTC since it requires a different launcher style in `__main__` * Propose to reorganize the composability tests anyway, since `test/_composable/test_composability/test_pp_composability` is an annoyingly long path * rename `test_manual_with_data_parallel` to `test_pp_dp` for simplicity/consistency with newer test names. (manual refers to not using tracer frontend, but that's not so important to call out in the test name) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144345 Approved by: https://github.com/H-Huang, https://github.com/mori360	2025-01-08 20:50:12 +00:00
LlamaFarm	c194e5c986	Remove extra copy torch/_prims (#144407 ) updated _reshape_aten Pull Request resolved: https://github.com/pytorch/pytorch/pull/144407 Approved by: https://github.com/awgu	2025-01-08 20:14:48 +00:00
Randolf Scholz	628acc4ace	`Dirichlet.mode`: use `dim=` instead of `axis=` (#144402 ) `axis=` is undocumented and will raise typing errors when #144197 is merged. See: https://github.com/pytorch/pytorch/pull/144197#pullrequestreview-2537398866 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144402 Approved by: https://github.com/Skylion007	2025-01-08 20:14:01 +00:00
Natalia Gimelshein	ab1f627aa4	fix randint distribution for large max (#143787 ) Fixes #ISSUE_NUMBER Similar to #143682, for large maximum values we were sampling integers via % and it doesn't provide uniform distribution. Here we limit the max skew to approx 1% (random32 is used for max values `<= 2**32 / 128`) This comes with significant perf penalty, especially for cuda, but it's a pretty bad bug, so we'll have to figure out what can be done to improve it. `torch.compile` has always been producing correct results for this, and it's performance is also significantly better than current eager (eager is ~660 GB/s on H100, torch.compile 1200 GB/s), so we have to figure out why torch.compile is better. `__launch_bounds__` slightly regress perf, so perhaps we can figure out how to specify them better, but it's only 20-30 GB/s, so the big difference is still unexplained. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143787 Approved by: https://github.com/eqy	2025-01-08 18:51:48 +00:00
Shangdi Yu	0e1675a89b	Relax aten.to restriction (#142420 ) Summary: if we have a.to(b), and b has a different dtype with a, then it must be a copy. In this case, we do not need to freeze the tensor. Instead, we use torch.ops.aten._assert_tensor_metadata.default to ensure that a must not have the same dtype as b. Fixes https://github.com/pytorch/pytorch/issues/139718 Update executorch pin to include https://github.com/pytorch/executorch/pull/7277. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_float_conversion buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_device_to_mutation_float ``` Differential Revision: D66988295 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142420 Approved by: https://github.com/bdhirsh	2025-01-08 18:11:31 +00:00
Randolf Scholz	768d73f692	use `torch.special.xlogy` to implement `x_log_x` (#144220 ) Fixes #144279 Using `x* x.log()` does not produce the correct value when `x=0`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144220 Approved by: https://github.com/Skylion007	2025-01-08 17:41:55 +00:00
cyy	d0070ca07e	[18/N] Fix extra warnings brought by clang-tidy-17 (#144014 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144014 Approved by: https://github.com/Skylion007, https://github.com/albanD	2025-01-08 17:21:55 +00:00
Aaron Gokaslan	373541fbf4	[BE]: Remove unnecessary copy of gradients in util (#144329 ) No need to copy gradients to CPU too Pull Request resolved: https://github.com/pytorch/pytorch/pull/144329 Approved by: https://github.com/awgu, https://github.com/cyyever	2025-01-08 16:52:15 +00:00
atalman	e14c36d3f4	Set maximum supported version of Python as 3.13 (#144396 ) Same as https://github.com/pytorch/pytorch/pull/119743 Required for Release 2.6.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144396 Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/malfet	2025-01-08 16:16:10 +00:00
Xinya Zhang	3068ce0337	ROCm SDPA: Ensure attn_mask has the same dtype with q (#143242 ) This is required by current AOTriton's backend. Fixes NaN when calling SDPA ME backend with `q.dtype() != attn_mask.dtype()` when training llama2 using transformers+deepspeed+pytorch Corresponding CUDA check seems to be here: `708ce3c008/aten/src/ATen/native/transformers/cuda/attention.cu (L1331-L1336)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143242 Approved by: https://github.com/jeffdaily	2025-01-08 15:20:26 +00:00
Nikita Shulga	708ce3c008	Add `is_dtype_supported` predicate to DeviceInterface (#144355 ) Which will return true, unless dtype is bf16 by default For MPS device it will return false if dtype is double Check that it works by refactoring `test_inf` that should expect TypeError raised if invoked with unsupported dtype Pull Request resolved: https://github.com/pytorch/pytorch/pull/144355 Approved by: https://github.com/jansel, https://github.com/dcci	2025-01-08 13:59:46 +00:00
Davide Italiano	8fc0ffe54b	[mps/inductor] Add support for rsqrt(). (#144374 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144374 Approved by: https://github.com/malfet	2025-01-08 13:58:05 +00:00
William Wen	f700035090	[3.13t] use sysconfig to check for Python nogil builds (#144361 ) `sys._is_gil_enabled()` wasn't working in certain cases, according to @atalman Pull Request resolved: https://github.com/pytorch/pytorch/pull/144361 Approved by: https://github.com/atalman	2025-01-08 13:00:32 +00:00
George Wigley	a5051a9521	Update torch.masked.mean to upcast dtype for bool tensors (#139999 ) When calling `torch.masked.mean(...)` with a boolean tensor, the dtype is inferred to be bool. When the mean is being computed, the sum operator is used. When the sum operator is used with dtype=torch.bool, the result is clamped to True (1) leading to an incorrect mean being calculated. The below example shows how the incorrect result occurs: ``` a = torch.tensor([True, True]) count = torch.sum(torch.ones(a.shape, dtype=torch.int64)) # 2 total = torch.sum(a, dtype=torch.bool) # True (1) mean = total / count # 0.5 ``` This PR upcasts the dtype used for the sumation to int32 in the case of bool tensors allowing for the correct result to be computed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139999 Approved by: https://github.com/cpuhrsch	2025-01-08 10:35:19 +00:00
Xiaodong Wang	60a505022f	[AMD] SDPA internal changes (#144320 ) Summary: All the internal changes needed to enable flash attention w/ SDPA in fbcode. Test Plan: ``` TORCH_ROCM_FA_PREFER_CK=1 buck run -m rocm621 mode/opt-amd-gpu scripts/xdwang/example:sdpa +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ \| Batch Size \| Sequence Length \| Heads \| Head Dim \| Flash Time (µs) \| Math Time (µs) \| xformers Time (µs) \| Flash TFlops \| Math TFlops \| xformers TFlops \| Speedup (Flash/Math) \| Speedup (xformers/Math) \| xformers trace_url \| Flash trace_url \| +==============+===================+=========+============+===================+==================+======================+================+===============+===================+========================+===========================+======================+===================+ \| 1 \| 4096 \| 32 \| 64 \| 455.552 \| 7748.76 \| 513.449 \| 301.698 \| 17.7369 \| 267.678 \| 17.0096 \| 15.0916 \| \| \| +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ \| 1 \| 4096 \| 16 \| 128 \| 329.971 \| 4741.11 \| 386.049 \| 416.519 \| 28.9888 \| 356.014 \| 14.3683 \| 12.2811 \| \| \| +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ \| 1 \| 8192 \| 32 \| 64 \| 1455.76 \| 31869.6 \| 1665.49 \| 377.642 \| 17.2501 \| 330.087 \| 21.8921 \| 19.1353 \| \| \| +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ \| 1 \| 8192 \| 16 \| 128 \| 1265.77 \| 18972.8 \| 1479.48 \| 434.325 \| 28.976 \| 371.588 \| 14.9891 \| 12.824 \| \| \| +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ \| 1 \| 16384 \| 32 \| 64 \| 5732.99 \| 121861 \| 6816.77 \| 383.573 \| 18.0453 \| 322.59 \| 21.2562 \| 17.8767 \| \| \| +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ \| 1 \| 16384 \| 16 \| 128 \| 4749.69 \| 73776.4 \| 5404.03 \| 462.982 \| 29.8066 \| 406.923 \| 15.5329 \| 13.6521 \| \| \| +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ \| Batch Size \| Sequence Length \| Heads \| Head Dim \| Flash Time (µs) \| Math Time (µs) \| xformers Time (µs) \| Flash TFlops \| Math TFlops \| xformers TFlops \| Speedup (Flash/Math) \| Speedup (xformers/Math) \| xformers trace_url \| Flash trace_url \| +==============+===================+=========+============+===================+==================+======================+================+===============+===================+========================+===========================+======================+===================+ \| 1 \| 4096 \| 32 \| 64 \| 1615.41 \| 8342.67 \| 1822.72 \| 212.7 \| 41.1855 \| 188.508 \| 5.16443 \| 4.57705 \| \| \| +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ \| 1 \| 4096 \| 16 \| 128 \| 1357.97 \| 5943.53 \| 1432.34 \| 253.022 \| 57.8104 \| 239.886 \| 4.37676 \| 4.14953 \| \| \| +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ \| 1 \| 8192 \| 32 \| 64 \| 5556.5 \| 31726.7 \| 6502.17 \| 247.348 \| 43.3197 \| 211.374 \| 5.70984 \| 4.8794 \| \| \| +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ \| 1 \| 8192 \| 16 \| 128 \| 5186 \| 22529.4 \| 5590.36 \| 265.019 \| 61.0044 \| 245.85 \| 4.34427 \| 4.03004 \| \| \| +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ \| 1 \| 16384 \| 32 \| 64 \| 22527.7 \| 130413 \| 26527.6 \| 244.035 \| 42.155 \| 207.239 \| 5.789 \| 4.91613 \| \| \| +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ \| 1 \| 16384 \| 16 \| 128 \| 18347.9 \| 87553.2 \| 20358 \| 299.628 \| 62.791 \| 270.044 \| 4.77184 \| 4.30068 \| \| \| +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ ``` Reviewed By: leitian, feikou, yoyoyocmu, sijiac Differential Revision: D67262726 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144320 Approved by: https://github.com/jianyuh, https://github.com/eqy, https://github.com/leitian	2025-01-08 09:29:28 +00:00
PyTorch MergeBot	7d9f26de05	Revert "Unskipped multiple inductor tests for ROCm (#143581 )" This reverts commit e05d67790ee4a53c310322829631c000f0ac2985. Reverted https://github.com/pytorch/pytorch/pull/143581 on behalf of https://github.com/huydhn due to There is some tests failing on ROCm jobs in trunk ([comment](https://github.com/pytorch/pytorch/pull/143581#issuecomment-2577163274))	2025-01-08 09:15:14 +00:00
Davide Italiano	aaf56152ea	[cpu/sorting] Throw an error when trying to sort complex numbers. (#144113 ) It doesn't really make sense to sort complex numbers as they are not comparable. Fixes #129296 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144113 Approved by: https://github.com/malfet	2025-01-08 05:15:36 +00:00
titaiwangms	78eded8e00	[ONNX] Use torch.export.Dim.AUTO in dynamo_export (#144356 ) Align to the changes in https://github.com/pytorch/pytorch/pull/143158 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144356 Approved by: https://github.com/justinchuby	2025-01-08 05:00:16 +00:00
bobrenjc93	90e81a157a	Migrate from Tuple -> tuple in torch/utils/data (#144255 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144255 Approved by: https://github.com/andrewkho	2025-01-08 04:09:45 +00:00
Animesh Jain	8ccf3f6f3f	[dynamo][easy] Move dict tests to test_dicts.py (#144165 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144165 Approved by: https://github.com/jansel ghstack dependencies: #143997	2025-01-08 03:56:33 +00:00
Animesh Jain	2ac41404a8	[dynamo][dicts] Guarding lazily on dict keys (#143997 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143997 Approved by: https://github.com/jansel	2025-01-08 03:56:33 +00:00
iupaikov-amd	e05d67790e	Unskipped multiple inductor tests for ROCm (#143581 ) All of them should be fine to run now after the triton fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143581 Approved by: https://github.com/jataylo, https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-01-08 03:55:33 +00:00
CaoE	28b4992e7a	Set prop_kind to forward_inference when grad is not needed for mkldnn_convolution_pointwise (#142855 ) `prop_kind` of MKLDNN convolution is always `dnnl_forward`, i.e., `dnnl_forward_training` , regardless of whether grad is needed. Setting `prop_kind` to `dnnl_forward_inference` for mkldnn_convolution_pointwise could have better performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142855 Approved by: https://github.com/jgong5	2025-01-08 02:22:06 +00:00
Xia, Weiwen	f8fcb9e7d3	[Quant][Inductor][X86] Separate unary post op fusion and lowering for qlinear (#143903 ) Summary The current implementation fuses quantized ops and their post ops and lowers the fused the op to cpp backend in the same pass. It is better to separate post op fusion and lowering because - it looks better in terms of design - we need the post op fusion pass for PT2E quantization eager mode This PR is the first of a series of PRs which separate post op fusion and lowering for quantized linear and convolution. It moves unary post op fusion of qlinear out of the lowering pass. This PR moves the fusion pass from the lowering pass to after the weight-prepack pass. The workflow is 1. Weight prepack for qlinear so that `dq - linear` patterns are replaced by `onednn.qlinear_pointwise` 2. Fuse `onednn.qlinear_pointwise` and post ops 3. Lower to cpp backend This PR adds additional `PatternMatcherPass`'s to handle the post op fusion. Pattern matchers used for fusion are reused. Test plan It is covered by existing UTs in `test_mkldnn_pattern_matcher.py` for post op fusion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143903 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168	2025-01-08 01:55:53 +00:00
zeshengzong	094ca3154d	Fix torch._refs.tensor error with empty list (#143461 ) Fixes #143216 Test Result Before ```python >>> import torch >>> torch._refs.tensor([]) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/zong/code/pytorch/torch/_refs/__init__.py", line 6614, in tensor new_tensor = _internal_new_from_data( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/_refs/__init__.py", line 6596, in _internal_new_from_data tensor = _recursive_build(inferred_scalar_type, data) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/_refs/__init__.py", line 6545, in _recursive_build return torch.stack([_recursive_build(scalarType, item) for item in seq]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: stack expects a non-empty TensorList ``` After ```python >>> torch._refs.tensor([]) tensor([]) >>> torch._refs.tensor([], device='cuda') tensor([], device='cuda:0') ``` ```bash $ pytest test/test_tensor_creation_ops.py -k test_refs_tensor ``` ![image](https://github.com/user-attachments/assets/5be4c17a-bea6-4b7b-bec1-b4fcb417a8cd) ```bash $ lintrunner ``` ![image](https://github.com/user-attachments/assets/e8f88f41-78ac-4337-b53f-2e524de2bec0) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143461 Approved by: https://github.com/ezyang, https://github.com/soulitzer	2025-01-08 01:29:00 +00:00
Eddie Yan	9e6a6389ce	[functorch] clean up asserts in `test_dims.py` (#144276 ) For better debuggability of issues encountered in e.g., #141730 when trying to migrate to python 3.12/3.13 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144276 Approved by: https://github.com/Skylion007	2025-01-08 01:21:40 +00:00
Lu Fang	013c796b1e	Eliminate c10::optional usage in PyTorch (#144346 ) Differential Revision: D67907427 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144346 Approved by: https://github.com/hl475	2025-01-08 01:14:04 +00:00
Randolf Scholz	f002825e1e	added `__add__` and `__mul__` hints to torch.Size (#144322 ) Fixes #144218 `Size` returns `Size`, whereas `tuple` returns `tuple`: `9f28171658/stdlib/builtins.pyi (L985-L988)` - Use `SupportIndex` instead of `int` in `__getitem__` (supported at runtime) - `Size.__add__` overrides `tuple.__add__`, the latter supports adding tuples on non-integral type. - Added typing unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144322 Approved by: https://github.com/Skylion007	2025-01-08 01:02:11 +00:00
xinan.lin	06ea81336f	[Inductor UT] Remove excepted failure for aoti test_fft_c2c (#144238 ) Since #143223 enabled runtime dispatch for fft_c2c in AOTI mod, for XPU, we can fallback fft_c2c which has no XPU implementation to CPU and pass the case now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144238 Approved by: https://github.com/jansel	2025-01-08 00:49:32 +00:00
Wanchao Liang	96f4abba17	[dtensor] move all tests to distribute/tensor folder (#144166 ) as titled, mainly moving files Pull Request resolved: https://github.com/pytorch/pytorch/pull/144166 Approved by: https://github.com/Skylion007	2025-01-08 00:32:33 +00:00
Justin Chu	7c9cf287c2	[ONNX] Handle list values as 0d inputs (#144343 ) Handle list values as 0d inputs instead of 1d, as the `SymInt`s are expected to be 0d tensors in ONNX. This PR reshapes int64 values into 1D tensors in a list, assuming they are 0D tensors initially. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144343 Approved by: https://github.com/gramalingam, https://github.com/titaiwangms	2025-01-08 00:15:50 +00:00
Oguz Ulgen	9ee242213b	[RFC] Introduce cache hot loading APIs (a.k.a. "Mega-cache") (#143341 ) This PR essentially introduces two new APIs * torch.compiler.save_cache_artifacts * torch.compiler.load_cache_artifacts which aim to create a mega cache experience where the user can start collecting cache artifacts, and later call the save API to fetch them. In the next attempt, the user can "hot load" the cache artifacts via the load function. This bundling approach reduces the need to rely on porting individual files one by one, or relying on many network requests. Note that these APIs CANNOT log to structured logging as these functions will be called before and after compilation, as opposed to during compilation. Due to this limitation, the API returns a struct that the user can log with. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143341 Approved by: https://github.com/jansel	2025-01-07 23:13:24 +00:00
Stacie-Herda	c2c50d5f00	Fixed doc where more than one device specified since only one device is used (#17553 ) (#144043 ) Fixes #17553 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144043 Approved by: https://github.com/soulitzer	2025-01-07 23:06:52 +00:00
Yanbo Liang	430d54ee20	[Dynamo] Add functorch C++ bindings as in graph functions (#144309 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144309 Approved by: https://github.com/williamwen42 ghstack dependencies: #144306, #144307, #144308	2025-01-07 22:25:01 +00:00
Yanbo Liang	d146763f6f	[Dynamo] Inline functions in torch._ops (#144308 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144308 Approved by: https://github.com/williamwen42 ghstack dependencies: #144306, #144307	2025-01-07 22:25:01 +00:00
Yanbo Liang	242a4a3f83	[Dynamo] Inline functions in torch._functorch.pyfunctorch (#144307 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144307 Approved by: https://github.com/williamwen42 ghstack dependencies: #144306	2025-01-07 22:24:53 +00:00
Yanbo Liang	4417be65e5	[Dynamo] Inline functions in torch._functorch.autograd_function (#144306 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144306 Approved by: https://github.com/williamwen42	2025-01-07 22:24:46 +00:00
Richard Barnes	3beb7006dd	c10::optional -> std::optional in a few places (#144340 ) Test Plan: Sandcastle Pull Request resolved: https://github.com/pytorch/pytorch/pull/144340 Approved by: https://github.com/malfet	2025-01-07 21:09:39 +00:00
Simon Fan	f4969c8235	fix torch.compile + ddp + non-reentrant AC pack hook firing count (#144271 ) FIXES https://github.com/pytorch/pytorch/issues/144035 In order to preserve hook firing semantics, we disabled pack/unpack hooks for torch.compile: https://github.com/pytorch/pytorch/pull/123196. In DDP under torch.compile, there's this other callsite that we need to disable hooks for Pull Request resolved: https://github.com/pytorch/pytorch/pull/144271 Approved by: https://github.com/bdhirsh, https://github.com/soulitzer	2025-01-07 21:08:52 +00:00
zeshengzong	861b65fe74	[Easy] Fix linalg.norm hint message typo (#144323 ) Fixes #136454 Test Result Before ```python >>> import torch >>> from torch import linalg >>> >>> my_tensor = torch.tensor([[[8., -3., 0., 1.]]]) >>> # ↓ ↓ ↓ ↓ ↓ >>> linalg.norm(input=my_tensor, ord='fro', dim=(0, 1, 2)) # Error Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: linalg.norm: If dim is specified, it mut be of length 1 or 2. Got [0, 1, 2] >>> # ↓ ↓ ↓ ↓ ↓ >>> linalg.norm(input=my_tensor, ord='nuc', dim=(0, 1, 2)) # Error Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: linalg.norm: If dim is specified, it mut be of length 1 or 2. Got [0, 1, 2] ``` After ```python >>> import torch >>> from torch import linalg >>> >>> my_tensor = torch.tensor([[[8., -3., 0., 1.]]]) >>> # ↓ ↓ ↓ ↓ ↓ >>> linalg.norm(input=my_tensor, ord='fro', dim=(0, 1, 2)) # Error Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: linalg.norm: If dim is specified, it must be of length 1 or 2. Got [0, 1, 2] >>> # ↓ ↓ ↓ ↓ ↓ >>> linalg.norm(input=my_tensor, ord='nuc', dim=(0, 1, 2)) # Error Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: linalg.norm: If dim is specified, it must be of length 1 or 2. Got [0, 1, 2] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144323 Approved by: https://github.com/Skylion007, https://github.com/soulitzer	2025-01-07 20:34:16 +00:00
Simon Fan	d38af6e8bc	[ca] dedup node names when AOT bwd graph is reused multiple times (#144202 ) This error started popping up in HUD CA benchmarks: ```python File "/data/users/xmfan/core/b/pytorch/torch/_dynamo/compiled_autograd.py", line 371, in dce self.fx_tracer.graph.eliminate_dead_code(is_impure) File "/data/users/xmfan/core/b/pytorch/torch/fx/graph.py", line 1862, in eliminate_dead_code self.lint() File "/data/users/xmfan/core/b/pytorch/torch/fx/graph.py", line 1753, in lint raise RuntimeError(f"Node redefined name {node.name}!") RuntimeError: Node redefined name aot0_expand! ``` We added CA initial capture's renaming (https://github.com/pytorch/pytorch/pull/133148) to help debug issues with AOT backward, but it errors out when we have multiple instances of the same AOT backward. This likely only showed up now because of increased hierarchical graph reuse. I fix it by adding a postfix counter to the node name Pull Request resolved: https://github.com/pytorch/pytorch/pull/144202 Approved by: https://github.com/bdhirsh, https://github.com/jansel	2025-01-07 20:23:09 +00:00
Shangdi Yu	72e8f34715	[AoTI Minifier] UX Improvement (#143330 ) Summary: - When a user specify `TORCHINDUCTOR_MAX_AUTOTUNE=1` env variable, we add `config.max_autotune=True` to the generated minifier_launcher - We should do this to other inductor configs as well in a followup Diff Currently in dynamo and aoti minifier, if a config is overwritten by an env variable, the config will not show up in the config list in the minifier_launcher.py file. As a result, when running the minifier_launcher, they need to re-apply the same env variable. This is: 1) not convenient for the users 2) if they copy-paste the minifier_launcher.py to us without including the env variable, we could be confused and not able to reproduce the error. Underlying implementation change: - Add `env_default` parameter to `codegen_config()`. If set, configs overriden by the env are not considered default. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:utils -- -r test_codegen_config ``` Differential Revision: D67299312 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143330 Approved by: https://github.com/jansel, https://github.com/eellison	2025-01-07 20:04:19 +00:00
bobrenjc93	096cb874d3	remove allow-untyped-defs from torch/_prims/executor.py (#144233 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144233 Approved by: https://github.com/Skylion007	2025-01-07 19:40:40 +00:00
Sampsa	0aa74d0ab9	Skip L1 cache for single-use buffers (#143115 ) ### 1. Synopsis Adds `cache_modifier='.cg'` optional argument into `tl.load` instructions in the inductor-generated triton code for selected buffers. It makes the `tl.load` instruction to skip the L1 cache for short-lived / non-reused data. ### 2. Using the feature This feature is experimental and disabled by default. It can be enabled by setting the environmental variable `TORCHINDUCTOR_SKIP_L1` equal to `1`. ### 3. Results For a simple pointwise addition kernel: ```python @torch.compile def add_dummy(x: torch.Tensor, y: torch.Tensor): return x+y ``` we get (bandwith performance is in GB/s): (a) feature DISABLED: ![image](https://github.com/user-attachments/assets/6caaf775-f083-4943-a61f-8a1bcb154387) (b) feature ENABLED: ![image](https://github.com/user-attachments/assets/9286be7d-c6ff-4a33-a023-77cb5cc87ff6) ### 4. Caveats The feature boost is only available when using ```python torch._dynamo.config.cache_size_limit = 64 # or any other sufficiently big number.. torch._dynamo.config.automatic_dynamic_shapes = False # use static shapes ``` When using (the default) dynamic shapes, only 1-2 triton kernels are generated with non-optimal block-sizes for all the cases (vector sizes), hiding any perf benefit from skipping the L1 cache. In the static case, as an optimal block size is generated for each vector size, the perf benefit of skipping the L1 cache becomes visible. This block-size optimization issue is a larger problem in pytorch inductor and is outside the scope of this feature. ### 5. References - [tl.load](https://triton-lang.org/main/python-api/generated/triton.language.load.html#triton.language.load) - [cache operators](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cache-operators) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143115 Approved by: https://github.com/jansel	2025-01-07 19:35:40 +00:00
Randolf Scholz	355b0bc7e3	[typing] Add type hints to `@property` and `@lazy_property` in `torch.distributions`. (#144110 ) Fixes #76772, #144196 Extends #144106 - added type annotations to `lazy_property`. - added type annotation to all `@property` and `@lazy_property` inside `torch.distributions` module. - added simply type-check unit test to ensure type inference is working. - replaced deprecated annotations like `typing.List` with the corresponding counterpart. - simplified `torch.Tensor` hints with plain `Tensor`, otherwise signatures can become very verbose. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144110 Approved by: https://github.com/Skylion007	2025-01-07 19:27:36 +00:00
hongxyan	aa69d73e6b	[ROCm] fix torch.layer_norm invalid configuration problem when input is large tensor (#144007 ) Fixes #136291 This PR is to fix the `invalid configuration argument` problem happened on ROCm when input is a large tensor when calling `torch.layer_norm`. ``` File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/functional.py", line 2573, in layer_norm return torch.layer_norm RuntimeError: HIP error: invalid configuration argument ``` After investigation, I found that the reason why this error happened is: The amd compute language runtime checks whether `gridDim.x * blockDim.x` is greater than `std::numeric_limits<uint32_t>::max()` or not. If yes, it will error out with the "invalid configuration argument" message. The fix is to split the whole task to several chunks so that each chunk will not trigger the failure condition. This will ensure the correctness and completeness given the current kernel implementation logic of `vectorized_layer_norm_kernel`. Also added a largeTensor layer_norm unit test `test_layer_norm_large_tensor` with the same shape `[16, 3000, 3000, 16]` as the one used by the pytorch issue #136291 so that the unit test can check the expected output value to ensure correctness. The future work may include performance optimization of layer_norm and CK layer_norm integration. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144007 Approved by: https://github.com/eqy	2025-01-07 19:17:02 +00:00
PyTorch MergeBot	6c54963f75	Revert "[dtensor] move all tests to distribute/tensor folder (#144166 )" This reverts commit 2e1ea8598f477322965c28fb52e6e5f53876d8dd. Reverted https://github.com/pytorch/pytorch/pull/144166 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but inductor/test_compiled_autograd needs to be updated ([comment](https://github.com/pytorch/pytorch/pull/144166#issuecomment-2575969871))	2025-01-07 18:31:36 +00:00
Aaron Gokaslan	e4a05dec0f	[BE][Ez]: Fix docs recommending inefficient tensor op order (#144270 ) `detach().clone()` is faster than `.clone().detatch()` since the gradients are not cloned. Let's update all the documentation and tests so that users do not use the inefficient op ordering. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144270 Approved by: https://github.com/awgu, https://github.com/XuehaiPan	2025-01-07 17:31:32 +00:00
atalman	8d35333498	[CD] Aarch64 builds should not override `OVERRIDE_PACKAGE_VERSION` envvar (#144285 ) Currently our nightly aarch64 binaries have correct suffixes +cpu or +cu126. But release binaries are missing these suffixes. Hence to correct this, make sure are nightly and release binaries are consistent, I propose this change. I see that override is already set correctly in release workflow: https://github.com/pytorch/pytorch/actions/runs/12383179841/job/34565381200 For CPU: ``` OVERRIDE_PACKAGE_VERSION="2.6.0+cpu" ``` For CUDA: ``` OVERRIDE_PACKAGE_VERSION="2.6.0+cu126" ``` The removed code will set : OVERRIDE_PACKAGE_VERSION="2.6.0" for both cuda and cpu builds for release binaries. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144285 Approved by: https://github.com/malfet, https://github.com/tinglvv	2025-01-07 12:50:54 +00:00
Avik Chaudhuri	12fdb93ebd	fix non-strict placeholder naming with kwargs (#144278 ) Fixes https://github.com/pytorch/pytorch/issues/143732 Differential Revision: [D67872055](https://our.internmc.facebook.com/intern/diff/D67872055/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144278 Approved by: https://github.com/yushangdi, https://github.com/pianpwk	2025-01-07 11:22:09 +00:00
Evgeny Fiksman	c3b28491c8	[caffe2] Add AVX512 support for box_cox operator (#143627 ) Summary: Reuse templetized implementation of box_cox caffe2 operator. * Duplicate .cc file of AVX2 * change intrinsics functions to use AVX512 instructions * override templates * extend the caller to use new methods * guard AVX512 with a gflag to allow smooth transition Differential Revision: D67433457 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143627 Approved by: https://github.com/hl475	2025-01-07 09:54:39 +00:00
RAHUL SINGH	bf7747e935	Tests Generelization for multiple accelerator devices (#139184 ) Motivation: Generalize unit tests so that can be executed for cuda and non cuda devices. Depedency : #133209 Merged now. There was a #135242 for these changes and closed due to in correct commits. I have incoroprated the changes as suggested in comments. @kwen2501 @zeshengzong Please review the changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139184 Approved by: https://github.com/kwen2501 Co-authored-by: Yu, Guangye <guangye.yu@intel.com>	2025-01-07 09:04:38 +00:00
Wanchao Liang	2e1ea8598f	[dtensor] move all tests to distribute/tensor folder (#144166 ) as titled, mainly moving files Pull Request resolved: https://github.com/pytorch/pytorch/pull/144166 Approved by: https://github.com/Skylion007	2025-01-07 06:45:14 +00:00
Simon Fan	d0f5df83a5	[ca] add test_dtensor_compile.py to compiled autograd tests (#144107 ) more than half the tests use autograd, pass rate 19/26 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144107 Approved by: https://github.com/zou3519, https://github.com/bdhirsh, https://github.com/jansel	2025-01-07 05:16:14 +00:00
bobrenjc93	fcf9dc3b11	Migrate from Tuple -> tuple in benchmarks (#144259 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144259 Approved by: https://github.com/yanboliang	2025-01-07 04:09:52 +00:00
Natalia Gimelshein	2e42be0595	Use random64 in Fischer-Yates algorithm for large N (#143682 ) Fixes bug in randperm https://nbsanity.com/static/a4774194938414dedcec7d6e99727d31/Shuffling_20in_20torch_20vs_20numpy-public.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/143682 Approved by: https://github.com/eqy, https://github.com/albanD, https://github.com/malfet	2025-01-07 03:48:56 +00:00
Davide Italiano	551f104153	[mps/inductor] Add support for sign(). (#144298 ) Drive-by fix of a test name while I was at it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144298 Approved by: https://github.com/malfet	2025-01-07 03:33:26 +00:00
bobrenjc93	a3ab27b8e0	Migrate from Tuple -> tuple in torch/_inductor (#144264 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144264 Approved by: https://github.com/eellison	2025-01-07 03:27:27 +00:00
PyTorch MergeBot	778d953951	Revert "[AsyncMM] re-enable and prepare for cutlass 3.5.1 update (#144011 )" This reverts commit 24ac87392bc4e0060a90483643f7df5611988ae5. Reverted https://github.com/pytorch/pytorch/pull/144011 on behalf of https://github.com/malfet due to Not sure what is going on, but lots of builds are failing ([comment](https://github.com/pytorch/pytorch/pull/144011#issuecomment-2574317669))	2025-01-07 03:24:01 +00:00
PyTorch MergeBot	f4e9aebbcc	Revert "Update torch.masked.mean to upcast dtype for bool tensors (#139999 )" This reverts commit 0742b2366e7ba65e0437a17b09a3bb0804ae51ea. Reverted https://github.com/pytorch/pytorch/pull/139999 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it has a landrace and fails a test in trunk ([comment](https://github.com/pytorch/pytorch/pull/139999#issuecomment-2574283986))	2025-01-07 02:42:55 +00:00
bobrenjc93	168c2cb3f3	remove allow-untyped-defs from torch/nn/utils/_deprecation_utils.py (#144231 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144231 Approved by: https://github.com/albanD	2025-01-07 02:22:22 +00:00
Yifu Wang	24ac87392b	[AsyncMM] re-enable and prepare for cutlass 3.5.1 update (#144011 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144011 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2025-01-07 02:15:42 +00:00
leslie-fang-intel	73a6a40346	[Inductor][CPP] Fix outer loop fusion buffer removed (#144243 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/144186. For the test case reported in the issue, we have saw some nodes with `LoopNest` - `LoopNest(loops=[LoopLevel(var=x0, size=8, offset=0, tiled_size=0, steps=1, parallel=0, simd_omp=False, simd_vec=False, collapsed=False, is_reduction=False), LoopLevel(var=x1, size=8, offset=0, tiled_size=0, steps=1, parallel=0, simd_omp=False, simd_vec=False, collapsed=False, is_reduction=True)], kernel=<torch._inductor.codegen.cpp.CppKernelProxy object at 0x7fc724426680>)` - `LoopNest(loops=[LoopLevel(var=x0, size=8, offset=0, tiled_size=0, steps=16, parallel=0, simd_omp=False, simd_vec=True, collapsed=False, is_reduction=False), LoopLevel(var=x1, size=8, offset=0, tiled_size=0, steps=16, parallel=0, simd_omp=False, simd_vec=True, collapsed=False, is_reduction=True)], kernel=<torch._inductor.codegen.cpp.CppKernelProxy object at 0x7fc75c2cae60>)` Although, these 2 `LoopNest` have same `range` and `var`, but different `steps` 1 and 16. So, they will fail to be merged with outer loops. And since when we localize the buffer, we have removed the global buffers. We need to restore the status of `V.graph.removed_buffers` before fallback to codegen without outer loop fusion. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_outer_loop_fusion_buffer_remove ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144243 Approved by: https://github.com/jgong5	2025-01-07 01:17:46 +00:00
Jane Xu	2f6f13562f	[BE] Actually suppress vmap warning from gradcheck (#144287 ) This is the much safer change compared to https://github.com/pytorch/pytorch/pull/144283 Before: ``` PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/test_optim.py -k TestDifferentiableOptimizer.test_sgd /data/users/janeyx/pytorch/torch/autograd/gradcheck.py:1156: FutureWarning: Please use torch.vmap instead of torch._vmap_internals.vmap. result = vmap(vjp)(torch.stack(grad_outputs)) /data/users/janeyx/pytorch/torch/autograd/gradcheck.py:1156: FutureWarning: Please use torch.vmap instead of torch._vmap_internals.vmap. result = vmap(vjp)(torch.stack(grad_outputs)) . ---------------------------------------------------------------------- Ran 1 test in 0.028s ``` (the env vars aren't necessary) After: ``` PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/test_optim.py -k TestDifferentiableOptimizer.test_sgd . ---------------------------------------------------------------------- Ran 1 test in 0.028s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144287 Approved by: https://github.com/cyyever, https://github.com/soulitzer	2025-01-07 01:11:41 +00:00
Nikita Shulga	61c0a3d1cb	Fix lint in `test_provenance_tracing.py` (#144296 ) Regression introduced by https://github.com/pytorch/pytorch/pull/143684/ that somehow did not surface on PR CI IMO this also makes two branches of the test(compile vs aoti) more readable Pull Request resolved: https://github.com/pytorch/pytorch/pull/144296 Approved by: https://github.com/xw285cornell, https://github.com/huydhn	2025-01-07 01:11:38 +00:00
Xu Han	48153c72b2	[Intel XPU] enable kineto for XPU Windows. (#144034 ) This PR will turn on `kineto` on Windowx XPU wheel build. For `kineto` on Windows XPU, the build time dependencies list: 1. Intel PTI, it contained by oneAPI 2025+. 2. Level zero SDK: https://github.com/oneapi-src/level-zero/releases/download/v1.14.0/level-zero-sdk_1.14.0.zip Note: We need to manual setup level zero SDK on build time, so we will turn off kineto build on Windows XPU by default. It is in order to avoid developer occurred build issue. After add level zero SDK include path to `INCLUDE` env_var path. We can add an env_var `XPU_ENABLE_KINETO` to turn on it. For runtime dependency: 1. Intel-pti pipy package. @chuanqi129 will follow up on further PR. Local tested the nightly binary: <img width="1909" alt="image" src="https://github.com/user-attachments/assets/7dfaa7bc-e8ed-40b8-bc71-f91a3df3b95f" /> TODO: @chuanqi129 will submit a following PR to add `intel-pti` as dependency and turn on env_var `XPU_ENABLE_KINETO` for nightly build. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144034 Approved by: https://github.com/chuanqi129, https://github.com/zejun-chen, https://github.com/EikanWang, https://github.com/sraikund16	2025-01-07 01:11:25 +00:00
George Wigley	0742b2366e	Update torch.masked.mean to upcast dtype for bool tensors (#139999 ) When calling `torch.masked.mean(...)` with a boolean tensor, the dtype is inferred to be bool. When the mean is being computed, the sum operator is used. When the sum operator is used with dtype=torch.bool, the result is clamped to True (1) leading to an incorrect mean being calculated. The below example shows how the incorrect result occurs: ``` a = torch.tensor([True, True]) count = torch.sum(torch.ones(a.shape, dtype=torch.int64)) # 2 total = torch.sum(a, dtype=torch.bool) # True (1) mean = total / count # 0.5 ``` This PR upcasts the dtype used for the sumation to int32 in the case of bool tensors allowing for the correct result to be computed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139999 Approved by: https://github.com/cpuhrsch	2025-01-07 00:26:59 +00:00
Henry Hu	f013cfee38	[TreeSpec] Support enum in defaultdict (#144235 ) Summary: Followup from D66269157, add support for enum in defaultdict. Test Plan: Added unit test Differential Revision: D67832100 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144235 Approved by: https://github.com/henrylhtsang, https://github.com/houseroad	2025-01-07 00:10:46 +00:00
Tugsbayasgalan Manlaibaatar	c68c38c673	Support getattr for tensor subclasses in pre-dispatch export via patching tensor.getattr (#143946 ) Previous discussion: https://github.com/pytorch/pytorch/pull/143671#issuecomment-2560112499 and https://github.com/pytorch/pytorch/pull/143671 Differential Revision: [D67693609](https://our.internmc.facebook.com/intern/diff/D67693609) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143946 Approved by: https://github.com/bdhirsh	2025-01-06 23:55:50 +00:00
bobrenjc93	66059f80d2	Migrate from Tuple -> tuple in torch/profiler (#144257 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144257 Approved by: https://github.com/sraikund16	2025-01-06 23:34:14 +00:00
Laith Sakka	5ccbfffd11	update expected results (#144274 ) this PR `f6488d85a0` made it +1.3% < 1.5%. once we have the API from dev infra and change the test this wont be happening. <img width="364" alt="Screenshot 2025-01-06 at 11 01 15 AM" src="https://github.com/user-attachments/assets/401b2d11-e400-49d6-b6f9-8e10ca141cb0" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/144274 Approved by: https://github.com/oulgen, https://github.com/anijain2305	2025-01-06 23:18:21 +00:00
Rachel Guo	f879a6982d	Enhance provenance tracing unit test to cover `torch.compile()` (#143684 ) Summary: Follow up as title. Test Plan: ``` buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:provenance_tracing -- -r test_triton_kernel_to_post_grad_tracing ``` Differential Revision: D67543556 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143684 Approved by: https://github.com/yushangdi	2025-01-06 22:58:04 +00:00
Isuru Fernando	301b9c8a90	Fix PythonMod printing (#144078 ) Fixes #144075 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144078 Approved by: https://github.com/anijain2305	2025-01-06 22:52:34 +00:00
bobrenjc93	edbda2fad8	remove allow-untyped-defs from torch/export/_remove_auto_functionalized_pass.py (#144230 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144230 Approved by: https://github.com/Skylion007	2025-01-06 22:23:19 +00:00
bobrenjc93	d75ffccd0a	Migrate from Tuple -> tuple in torch/_export (#144262 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144262 Approved by: https://github.com/avikchaudhuri	2025-01-06 22:20:26 +00:00
Andrew Gu	00c18c8882	Make all-reduce input contiguous in `distributed.nn.all_reduce` (#144267 ) Fixes https://github.com/pytorch/pytorch/issues/144060 I confirmed that the unit test fails without the `.contiguous()` fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144267 Approved by: https://github.com/wz337, https://github.com/Skylion007, https://github.com/fduwjj	2025-01-06 22:20:04 +00:00
Nikita Shulga	16c1b1048b	[MPSInductor] Add `nan` constant generation (#144281 ) If val is not equal to self, it's a nan (which is spelled as `NAN` in Metal) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144281 Approved by: https://github.com/atalman, https://github.com/dcci	2025-01-06 22:13:23 +00:00
Nikita Shulga	7d5249dbc2	[EZ][BE] Fix E226 flake8 violation (#144282 ) Not sure why CI did not complain about it, but it my local runs it clearly says ``` Advice (FLAKE8) E226 missing whitespace around arithmetic operator See https://www.flake8rules.com/rules/E226.html 268 \| with code.indent(): 269 \| if len(idx_var_names) > 1: 270 \| for idx, name in enumerate(idx_var_names): >>> 271 \| code.writeline(f"auto {name} = thread_pos.{chr(120+idx)};") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144282 Approved by: https://github.com/Skylion007	2025-01-06 22:12:21 +00:00
Ryan Guo	5d88002af6	[inductor] Avoid specializing over symbolic value during constant folding (#144176 ) Fixes #143667. See more context in the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144176 Approved by: https://github.com/jansel, https://github.com/eellison	2025-01-06 21:50:17 +00:00
Faran Ahmad	729b7c0a84	[TGIF][Easy] Slightly improve the logging for tgif split pass (#143771 ) Summary: 1. Added more details for some of the assert statements. 2. Moved assert statements to use tgif_assert Test Plan: all unit tests should pass Reviewed By: jingsh Differential Revision: D67608251 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143771 Approved by: https://github.com/jingsh	2025-01-06 21:00:15 +00:00
Aaron Gokaslan	b5cf8e2460	[BE]: Remove redundant copy in torch chunk shard (#144269 ) Fixes an issue noticed in recent all_gather PR. Some parts of the codebase have a double copy with `clone().contiguous()` which could be fused into a single copy op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144269 Approved by: https://github.com/awgu	2025-01-06 20:52:49 +00:00
bobrenjc93	1b8a943011	remove allow-untyped-defs from ao/nn/sparse/quantized/utils.py (#144232 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144232 Approved by: https://github.com/Skylion007	2025-01-06 19:54:27 +00:00
Doru Bercea	6d445bef0c	[ROCm][NFC] Fix condition for small tensor tuning (#144087 ) Fix condition for small tensor tuning to not impact non-ROCm compilation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144087 Approved by: https://github.com/jeffdaily	2025-01-06 19:40:22 +00:00
Marc Horowitz	c62873a09a	Fix incorrect python expression (#143675 ) Summary: This expression would return True always, causing the input to be deleted on error, even for non-write modes: ``` >>> bool("w" or "+" or "a" in "rb") True ``` Test Plan: new test in test_fsspec.py Differential Revision: D67537234 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143675 Approved by: https://github.com/mayankgarg1990, https://github.com/huydhn	2025-01-06 19:04:26 +00:00
Shangdi Yu	e3aac7f8a0	detect fake mode in proxy_tensor creation in make_fx (#144168 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/143742 A FakeTensorMode may already exist when we are setting the "val" meta of a proxy tensor. We should detect existing FakeTensorMode before creating a new one. Otherwise, we could cause an error when using `detect_fake_mode` later, because there are now multiple FakeTensorModes existing. Test Plan: The error in https://github.com/pytorch/pytorch/issues/143742 Differential Revision: D67813111 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144168 Approved by: https://github.com/BoyuanFeng, https://github.com/tugsbayasgalan	2025-01-06 19:02:08 +00:00
Nikita Shulga	e56768f030	[MPS] Fix bitwise shifts for uint8 (#144251 ) Previosly all bitwise operations were aliased to the same type, but this is wrong for shift ops Rather than building an overly complex logic, let's just instantiate using shared `scalarToMetalTypeString` helper function Fixes https://github.com/pytorch/pytorch/issues/144190 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144251 Approved by: https://github.com/Skylion007 ghstack dependencies: #144249, #144250	2025-01-06 18:27:16 +00:00
PyTorch MergeBot	aa14fcd96c	Revert "export AOTI_TORCH_EXPORT on Windows. (#140030 )" This reverts commit e141cb9c34e5e96ca47ea69b565bc4fd9c8f34c1. Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/clee2000 due to still failing internally D67556174, see D67866123 for link to error ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2573652459))	2025-01-06 18:15:52 +00:00
Nikita Shulga	ebeb433e73	[BE] Fix + parametrize `test_min_max_nan_propagation` (#144250 ) - `dtype` was not passed as argument to `torch.rand` before - Condition bfloat16 testing on MacOS14+ Pull Request resolved: https://github.com/pytorch/pytorch/pull/144250 Approved by: https://github.com/Skylion007 ghstack dependencies: #144249	2025-01-06 17:49:41 +00:00
Nikita Shulga	11a0663eeb	[BE] Parametrize `test_min_max` (#144249 ) It's better to have one unit test per dtype rather a combined one Pull Request resolved: https://github.com/pytorch/pytorch/pull/144249 Approved by: https://github.com/Skylion007	2025-01-06 17:49:41 +00:00
Tugsbayasgalan Manlaibaatar	d65a50ef34	Fix subclass unwrapping bug (#143945 ) I noticed a small bug in tensor subclass unwrapping logic. cc @IvanKobzarev It seems easier if we just implement it recursively so that it is easier to track the inner attrs to corresponding plain tensors and both aot_autograd and fake_tensor implement subclass unwrapping recursively. Differential Revision: [D67693610](https://our.internmc.facebook.com/intern/diff/D67693610) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143945 Approved by: https://github.com/IvanKobzarev	2025-01-06 17:38:19 +00:00
Aaron Gokaslan	5c783bf410	[BE][Ez]: Update CUDNN Frontend submodule to 1.9.0 (#144200 ) * Update CUDNN Frontend to 1.9.0, which include some API improvements, new features, and bugfixes. This is a header only lib fix so should be pretty straight forward. * Nicest feature is that it now logs / print warnings when the CUDNN compiled version does not match the dynamically loaded one * Fixes corrupted / truncated log lines from being printed by CUDNN Frontend Pull Request resolved: https://github.com/pytorch/pytorch/pull/144200 Approved by: https://github.com/cyyever, https://github.com/albanD	2025-01-06 17:33:38 +00:00
Jane Xu	c8713e659a	fix memleak, detach instead of clone to not drag around graph (#144154 ) Thanks @clee2000 for bringing the memleak to my attention: https://github.com/pytorch/pytorch/actions/runs/12549765082/job/34996244798. This memleak in the test was caused by the differentiable flavors. Because we had param.clone() and param persisted outside the for loop, the autograd graph would continue growing for each optimizer.step instead of being deleted after the optim input was used up. To clarify, I had still expected (and still do expect) the test to fully clean everything up once the test is over, but I didn't get the chance to look into why that's not the case. This change would preliminarily unblock this particular test from failing the memleak CI. Use detach instead of clone, which is...cheaper anyway :D since a detach I've learned from @soulitzer is a view with requires_grad=False Pull Request resolved: https://github.com/pytorch/pytorch/pull/144154 Approved by: https://github.com/clee2000, https://github.com/Skylion007, https://github.com/huydhn, https://github.com/albanD	2025-01-06 17:09:00 +00:00
Guilherme Leobas	e222dd5d25	Rewrite _reparametrize_module to use `contextmanager` (#138203 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138203 Approved by: https://github.com/zou3519 ghstack dependencies: #136033, #140604	2025-01-06 16:56:22 +00:00
Guilherme Leobas	4c8d661348	Set `enable_trace_contextlib_contextmanager` flag to True (#140604 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140604 Approved by: https://github.com/zou3519 ghstack dependencies: #136033	2025-01-06 16:56:22 +00:00
Luca Wehrstedt	defbf0d339	[DTensor] Add strategy for _scaled_mm (#143760 ) This is done by copying the one for a regular mm, and enforcing that the scales have the same sharding scheme as their respective operands. This works because scales are 2-d tensors that must "broadcast" to the operands. This broadcasting is trivial when scales have dimensions of 1 or N, which is the only options we currently support. Note, however, that after this PR scales will be allowed to have the mesh's world size as a dimension (in certain cases). This works because, when mapped to the local shard, it becomes a dimension of 1, which can be handled by the operator. Note that when using row-wise _scaled_mm for tensor (sequence) parallelism, this situation arises naturally! Because of these specificities, the test is rather complex, as it specifically tests all these behaviors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143760 Approved by: https://github.com/tianyu-l	2025-01-06 16:35:47 +00:00
yijun-lee	d4609af1ca	Propagate callable parameter types using ParamSpec (#142306 ) (#144047 ) Fixes #142306 This PR includes typing improvements and refactoring for the following files: - __init__.py - decorators.py - _ops.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/144047 Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>	2025-01-06 16:16:18 +00:00
cyy	9225f149eb	Enable clang-analyzer checks of Clang-tidy (#144222 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144222 Approved by: https://github.com/Skylion007	2025-01-06 15:44:45 +00:00
Pian Pawakapan	bba672e117	[docs/export] update dynamic_shapes docs (#142510 ) https://pytorch.org/docs/stable/export.html dynamic_shapes section formatting is messed up, fix & update documentation to be more user-friendly. Happy accepting nits :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142510 Approved by: https://github.com/yushangdi	2025-01-06 14:12:34 +00:00
PyTorch UpdateBot	d85ae4be73	Update slow tests (#144236 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144236 Approved by: https://github.com/pytorchbot	2025-01-06 11:19:09 +00:00
Sun, Jiayi	a8e97d9d4d	fix torch.acos and torch.asin for torch.complex datatypes on CPU (#134838 ) Fix https://github.com/pytorch/pytorch/issues/134487, https://github.com/pytorch/pytorch/issues/138327. These two issues are caused by the lack of special handling of the case where the real number/imag number is 0/Inf/NaN in the vectorized implementation of `asin`. For correctness, I temporarily fallback the implementation of `asin `to scalar implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134838 Approved by: https://github.com/mingfeima, https://github.com/Skylion007	2025-01-06 06:17:39 +00:00
eellison	e1622dca7a	Fix duplicate pattern error (#139321 ) vllm had an error when we were incorrectly stating two patterns are duplicates. See, comment inline: For a particular generated pattern repr, store all the equivalent graphs that used to generate them. Because we ignore certain patterns in searching, but not in matching, use the graph to distinguish if two equivalent searches are actually different. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139321 Approved by: https://github.com/shunting314	2025-01-06 05:04:59 +00:00
PyTorch MergeBot	cb5fa17e44	Revert "[ca] add test_dtensor_compile.py to compiled autograd tests (#144107 )" This reverts commit 67f85ccdcf56894d653b4d37cd7651eefa0ddf8c. Reverted https://github.com/pytorch/pytorch/pull/144107 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the failure looks legit ([comment](https://github.com/pytorch/pytorch/pull/144107#issuecomment-2572209717))	2025-01-06 03:30:22 +00:00
Davide Italiano	c9ef98478a	[mps/BE] Enable a test that now passes. (#144198 ) After the implementation of floordiv in `464b50dbd7` landed, this now passes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144198 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-01-06 03:14:21 +00:00
Davide Italiano	23e2953cd3	[mps/inductor] Add support for floor(). (#144195 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144195 Approved by: https://github.com/jansel	2025-01-06 02:07:17 +00:00
Ding, Yi1	d71f111109	[Inductor][CPP] Fix Inductor integer avg pool (#144059 ) Fixes #143738. Currently the scaler for averaging is rounded to 0 if dtype is an integer, resulting to all-zero output. This fix uses `truediv` instead for integer cases. ## Test ```bash pytest -vs ./test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_nn_functional_avg_pool1d_cpu_int64 pytest -vs ./test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_nn_functional_avg_pool2d_cpu_int64 pytest -vs ./test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_nn_functional_avg_pool3d_cpu_int64 pytest -vs ./test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_nn_functional_local_response_norm_cpu_int64 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144059 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel, https://github.com/jgong5	2025-01-06 01:26:53 +00:00
Xiaodong Wang	3d3a07963f	[reland][attempt2][AMD] Turn on TF32 for aten::mm (#144145 ) Summary: https://github.com/pytorch/pytorch/pull/143549 was reverted due to some internal/oss tooling issue. Relanding. hipblaslt supports TF32, so adding the support. Original PR https://github.com/pytorch/pytorch/pull/139869 Test Plan: CI Differential Revision: D67785496 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144145 Approved by: https://github.com/jianyuh	2025-01-06 00:37:01 +00:00
Jack Morris	9f94710e48	Update core.py to fix typo (#144201 ) dype -> dtype Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144201 Approved by: https://github.com/Skylion007	2025-01-05 18:20:52 +00:00
Mitchell, Frost	51a37a42e0	[inductor][cpu] Fix bmm b_index for dynamic expressions in inductor autotuner (#143141 ) Fixes #143102 Addresses 2 problems relating to dynamic batch size in BMM autotuner: 1. With dynamic batch size, when the input is a sympy Mult expression, such as `s0*8` which occurs in many dynamo benchmark models. We address this by using `size_hints` to solve for any expressions. This is safe since this section of the code is only called to generate inputs for benchmarking. 2. Some epilogue nodes may use the dynamic batch size as part of the codegen, for example when an input to the epilogue node is transposed and has dynamic batch size in the stride of other dimensions. When these epilogue nodes exist, if the sizevar is not already present in the `kernel.args`, it will create a new sizevar with a name. It is possible that subsequent calls to `def_kernel` could overwrite this variable name, so to avoid this we pass all the sizevars as `extra_sizevars` to the calls to `def_kernel` for the GEMM functions, so no variable renaming happens later in the BMM definition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143141 Approved by: https://github.com/jansel, https://github.com/leslie-fang-intel, https://github.com/jgong5	2025-01-05 18:02:37 +00:00
Animesh Jain	f6488d85a0	[dynamo][user-defined] Remove __getattribute__ checks and add getsetdescriptor (#144173 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144173 Approved by: https://github.com/jansel	2025-01-05 13:48:15 +00:00
PyTorch MergeBot	b01556bd8a	Revert "[dynamo][dicts] Guarding lazily on dict keys (#143997 )" This reverts commit f5df082fabfe81639e25b8e01dae107548389c5e. Reverted https://github.com/pytorch/pytorch/pull/143997 on behalf of https://github.com/jeanschmidt due to Seems to have introduced internal ci redness in some tests, D67828366 ([comment](https://github.com/pytorch/pytorch/pull/143997#issuecomment-2571587599))	2025-01-05 11:09:45 +00:00
Yutao Xu	1e881ceecf	Update torch-xpu-ops commit pin (#143984 ) Update the torch-xpu-ops commit to [28cfac20ec662abdb0ac98faf122450013e8f520](`28cfac20ec`), includes: - Disable batch_norm vectorization path to fix accuracy issues. - Fix the LSRM/RNN implementation error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143984 Approved by: https://github.com/EikanWang, https://github.com/ruidazeng, https://github.com/desertfire, https://github.com/jansel	2025-01-05 09:01:36 +00:00
Jason Ansel	157c185afe	[inductor] Add types to compile_tasks.py and runtime_utils.py (#144004 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144004 Approved by: https://github.com/yanboliang	2025-01-05 08:47:49 +00:00
Simon Fan	67f85ccdcf	[ca] add test_dtensor_compile.py to compiled autograd tests (#144107 ) more than half the tests use autograd, pass rate 19/26 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144107 Approved by: https://github.com/zou3519, https://github.com/bdhirsh, https://github.com/jansel	2025-01-05 02:11:48 +00:00
James Wu	f2d6cfa677	Introduce CompileEventLogger, replace usages of metrics_context and chromium_event with it (#143420 ) Problem statement: I want to be able to centralize and simplify the process by which people add columns/data to existing spans. We have MetricsContext and ChromiumEventLogger, and there's various choices you can make to decide where and when to log different levels of observability for your events. To resolve this, I want a central API for "adding to events under dynamo_timed". CompileEventLogger is intended as a frontend for MetricsContext and ChromiumEventLogger so we can use the same class for handling everything. CompileEventLogger is intended be used within a `dynamo_timed()` context. Its purpose is to 1. log to existing events that are in progress (i.e. within dynamo_timed), and 2. log instant events to chromium that are independent of any specific span. CompileEventLogger has three log levels: - CHROMIUM: Log only to chromium events, visible via tlparse. - PT2_COMPILE: Log to chromium_events + pt2_compile_events - COMPILATION_METRIC: Log to compilation metrics in addition to the toplevel chromium and pt2_compile_event. In addition, we have a function CompileEventLogger.add() that automagically chooses the correct log level. For now, it is conservative, and will never automagically choose to log CompilationMetrics (though I could imagine it figuring out the metadata are all keys in CompilationMetric and therefore loggable there). The goal here is to make one single interface to log stuff for observability reasons, and make it as easy as possible. Not included in this diff: - V1 of this diff will not have implementations of `increment` and `add_to_set` which MetricsContext has, so those usages are not replaced yet. But I'll add those in a followup. - We don't handle `RuntimeMetricsContext`. It's unclear if I want that to be part of this, because under RuntimeMetricsContext there might not be a toplevel event to log to, so chromium events doesn't make sense in that context. So I might leave that separate for now. Differential Revision: [D67346203](https://our.internmc.facebook.com/intern/diff/D67346203/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143420 Approved by: https://github.com/aorenste	2025-01-04 22:40:34 +00:00
Jackson Tsang	68d30c6a25	Add check for unsupported sprase layout to resolve false INTERNAL ASSERT FAILED (#139198 ) Fixes #131319. Implemented the check on layout as described in the original issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139198 Approved by: https://github.com/pearu, https://github.com/amjames, https://github.com/cpuhrsch Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Co-authored-by: Pearu Peterson <pearu.peterson@gmail.com>	2025-01-04 21:40:36 +00:00
Nikita Shulga	b1bc880f26	[EZ][BE] Cleanup `test_mps_basic` (#144194 ) - Sort imported tests alphabetically - Run `add` tests with `check_lowp=False` as it is tested explicitly by parametrization - Do not hardcode device, but rather use `self.device` property Pull Request resolved: https://github.com/pytorch/pytorch/pull/144194 Approved by: https://github.com/Skylion007, https://github.com/dcci	2025-01-04 21:36:40 +00:00
Davide Italiano	0dc1e6be19	[mps/BE] Fix linter warning/advice. (#144199 ) Two spaces before an inline comment according to E261. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144199 Approved by: https://github.com/Skylion007, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-04 20:15:41 +00:00
Richard Barnes	e458b39fc4	c10::string_view -> std::string_view in Device.cpp (#144178 ) Test Plan: Sandcastle Differential Revision: D67817163 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144178 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-01-04 18:51:33 +00:00
Joona Havukainen	811c714911	Fix nan propagation for minimum() and maximum() in MPS (#144086 ) Fixes #143976 - Moves minimum and maximum operations to use the NaN propagating call into MPSGraph instead of the default one. - Adds test for the NaN propagating case to `test_mps.py`. - Adjusts the inductor metal backend implementation for minimum and maximum to also respect the nan propagation. Additions by @malfet: - Introduce MPSGraph+PyTorchFixups interface following [Customizing existing classes](https://developer.apple.com/library/archive/documentation/Cocoa/Conceptual/ProgrammingWithObjectiveC/CustomizingExistingClasses/CustomizingExistingClasses.html) tutorial and implement `minimumWithNaNPropagationAndIntFallbackWithPrimaryTensor:` as `minimumWithNaNPropagationWithPrimaryTensor:` segfaults when called for integral types Pull Request resolved: https://github.com/pytorch/pytorch/pull/144086 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <nshulga@meta.com>	2025-01-04 18:48:24 +00:00
Andrey Talman	60de73c3c7	Update nightly PyTorch version to 2.7.0 Same as https://github.com/pytorch/pytorch/pull/135916	2025-01-04 13:24:48 -05:00
Animesh Jain	f5df082fab	[dynamo][dicts] Guarding lazily on dict keys (#143997 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143997 Approved by: https://github.com/jansel ghstack dependencies: #144129, #144130, #144141, #144158, #144163, #144160	2025-01-04 18:13:00 +00:00
drisspg	005a4b9537	[Submodule] Bump Cutlass to 3.5.1 OSS PR (#144000 ) ## Summary Follow up PR to https://github.com/pytorch/pytorch/pull/143515. That PR added a bunch of macro switches to ensure both 3.4 and 3.5.1 built succesfully. This PR actual bumps the cutlass pin to 3.5.1. I am going to do a stack on top to add an conditional gates for 3.6 hijacking the 3.4 switches. We will leap frog our way to the top :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144000 Approved by: https://github.com/Skylion007, https://github.com/eqy, https://github.com/malfet	2025-01-04 18:04:03 +00:00
Michal Gallus	93633d0e80	[ROCm][Windows] Fix export macros (#144098 ) For correct import and export of functions when the dynamic linkage is used for HIP libraries on windows, the appropriate export/import macros need to be put in place. This Pull Request utilizes existing CUDA import/export macros by converting them to corresponding HIP macros during the hipification process. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144098 Approved by: https://github.com/jeffdaily	2025-01-04 17:12:46 +00:00
Aaron Orenstein	45ef3309e3	[BE] typing for decorators (#144161 ) Summary: Untyped decorators strip annotations from the decorated items. - _compile - _inductor/fx_passes/post_grad - _inductor/lowering - _library/custom_ops - _meta_registrations - _ops - _refs/nn/functional - ao/quantization/quantizer/xnnpack_quantizer_utils - distributed/_composable/contract - fx/experimental/graph_gradual_typechecker - fx/experimental/migrate_gradual_types/constraint_generator - optim/optimizer - signal/windows/windows - testing/_internal/common_device_type - torch/_inductor/decomposition - utils/flop_counter Test Plan: unit tests Differential Revision: D62302684 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144161 Approved by: https://github.com/Skylion007, https://github.com/albanD	2025-01-04 16:40:09 +00:00
Nichols A. Romero	79cbda3ab0	[ROCm] Get rid of extra rpath-link that was needed for libtinfo. (#143348 ) Fixes #137858 Due to the extra rpath-link line inserted by these CMake lines, it is possible to unintentionally pick up other libraries that are incompatible with the version of ROCm in ${ROCM_PATH}. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143348 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily, https://github.com/pruthvistony	2025-01-04 15:42:30 +00:00
Steven Zeltmann	6f2451c2e9	[MPS] Add `aten::angle` (#143449 ) This adds an MPS backend implementation for `aten::angle` and `aten::angle_out` (mentioned in issue #77764), following the example #78408. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143449 Approved by: https://github.com/malfet	2025-01-04 15:38:40 +00:00
Nikita Shulga	301c457032	[MPS] Fix `nllnd_loss_backward` crash with different dtypes (#144170 ) Otherwise, invoking with torch.half inputs, but float weights will result in ``` (mpsFileLoc): /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.divide' op requires the same element type for all operands and results (mpsFileLoc): /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %16 = "mps.divide"(%15, %arg2) : (tensor<5x5xf16>, tensor<1xf32>) -> tensor<xf32> (mpsFileLoc): /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.divide' op requires the same element type for all operands and results (mpsFileLoc): /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %16 = "mps.divide"(%15, %arg2) : (tensor<5x5xf16>, tensor<1xf32>) -> tensor<xf32> 2025-01-03 14:13:18.747151-0800 python[87772:4027380] /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm, line 975: error 'original module failed verification' /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:975: failed assertion `original module failed verification' ``` Test plan: `python -mpytest test/inductor/test_torchinductor.py -k test_nll_loss_backward_mps` should not crash Pull Request resolved: https://github.com/pytorch/pytorch/pull/144170 Approved by: https://github.com/kit1980, https://github.com/Skylion007 ghstack dependencies: #144167, #144162, #144083, #144084	2025-01-04 15:24:55 +00:00
PyTorch MergeBot	99f2491af9	Revert "Use absolute path `path.resolve()` -> `path.absolute()` (#129409 )" This reverts commit 45411d1fc9a2b6d2f891b6ab0ae16409719e09fc. Reverted https://github.com/pytorch/pytorch/pull/129409 on behalf of https://github.com/jeanschmidt due to Breaking internal CI, @albanD please help get this PR merged ([comment](https://github.com/pytorch/pytorch/pull/129409#issuecomment-2571316444))	2025-01-04 14:17:20 +00:00
cyy	df458be4e5	[4/N] Apply py39 ruff and pyupgrade fixes (#143257 ) ```torch/fx/passes/annotate_getitem_nodes.py``` was changed to support the new type hinting annotations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143257 Approved by: https://github.com/justinchuby, https://github.com/albanD	2025-01-04 10:47:51 +00:00
Dingming Wu	a881954b0c	[PTD] Dump rcclexp proxy trace in pytorch (#143678 ) Summary: Dump the active proxyOp status per rank and per communicator when WatchDog timeout or aborts. Added `#if defined(USE_ROCM) && defined(NCCL_COMM_DUMP)` guard in the print function, so only rcclexp users will see this dump in console. This is the changes of the PTD. Test Plan: Job with A2A hang due to receiver failing to post receive operations https://fburl.com/mlhub/95vg12r3 {F1971449692} Reviewed By: c-p-i-o Differential Revision: D67036093 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143678 Approved by: https://github.com/c-p-i-o	2025-01-04 10:20:47 +00:00
Huy Do	aa7d01ea22	Use sccache 0.9.0 on ROCm build job (#144125 ) TSIA, sccache 0.9.0 seems to work fine with ROCm build job Pull Request resolved: https://github.com/pytorch/pytorch/pull/144125 Approved by: https://github.com/jithunnair-amd, https://github.com/wdvr, https://github.com/jeffdaily	2025-01-04 08:56:48 +00:00
Valentine233	636a2c7e0f	[Inductor][lowering] support out_dtype for dequant lowering (#143845 ) In lowering, support the parameter `out_dtype` for `dequant_per_tensor` and `dequant_per_channel`. Fix the following runtime error issue found in https://github.com/pytorch/ao/pull/1372: ``` File "/home/liaoxuan/pytorch_ao/torch/_inductor/lowering.py", line 452, in wrapped out = decomp_fn(args, *kwargs) torch._dynamo.exc.BackendCompilerFailed: backend='compile_fx_wrapper' raised: LoweringException: TypeError: quantized_decomposed_dequantize_per_tensor_default() got an unexpected keyword argument 'out_dtype' target: quantized_decomposed.dequantize_per_tensor.default args[0]: TensorBox(StorageBox( InputBuffer(name='arg0_1', layout=FixedLayout('cpu', torch.uint8, size=[1, 7, 7, 9], stride=[441, 63, 9, 1])) )) args[1]: 0.01 args[2]: 100 args[3]: 0 args[4]: 255 args[5]: torch.uint8 kwargs: {'out_dtype': torch.bfloat16} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143845 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2025-01-04 08:48:41 +00:00
Xinran / Allan Rui	417d9c3522	[Inductor/Triton] Upcast FP16/BF16 math reductions to FP32 (#141052 ) Summary: Triton compiler does not automatically promote fp16/bf16 reductions to fp32 accumulation. This will result in significant accuracy issue. This diff will upcast the input to FP32 for all math reductions `["welford_reduce", "welford_combine", "prod", "sum", "xor_sum"]` Test Plan: CI ``` python test/inductor/test_torchinductor.py TritonCodeGenTests.test_low_precision_reduction ``` Differential Revision: D65965032 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141052 Approved by: https://github.com/blaine-rister	2025-01-04 07:57:10 +00:00
Animesh Jain	816328fa51	[dynamo][lazy] LazyVT utils to get original value/source and is_hashable (#144160 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144160 Approved by: https://github.com/williamwen42, https://github.com/jansel ghstack dependencies: #144129, #144130, #144141, #144158, #144163	2025-01-04 06:23:05 +00:00
Nikita Shulga	b5b1e9456a	[MPSInductor] Add `masked` implementation (#144084 ) More or less borrowed from `22580f160e/torch/_inductor/codegen/halide.py (L549-L563)` `pytest test/inductor/test_torchinductor.py -k _mps` score is 408 failed, 347 passed, 32 skipped Pull Request resolved: https://github.com/pytorch/pytorch/pull/144084 Approved by: https://github.com/Skylion007, https://github.com/jansel ghstack dependencies: #144167, #144162, #144083	2025-01-04 04:30:07 +00:00
Shangdi Yu	f15af077fb	Fix get_source_partitions when weights are tied (#142446 ) Summary: Fix https://github.com/pytorch/pytorch/issues/142035 and https://github.com/pytorch/pytorch/issues/143621 When Linear module params are tied to another parameter, like this: ``` class SimpleLinearModel(nn.Module): def __init__(self, input_size, output_size): super(SimpleLinearModel, self).__init__() # Define a linear layer self.linear = nn.Linear(input_size, output_size) self.tied_weight = self.linear.weight def forward(self, x): # Forward pass through the linear layer b = self.tied_weight + 1 return self.linear(x), b ``` We get a graph like below: ``` graph(): %p_tied_weight : [num_users=0] = placeholder[target=p_tied_weight] %p_linear_weight : [num_users=2] = placeholder[target=p_linear_weight] %p_linear_bias : [num_users=1] = placeholder[target=p_linear_bias] %x : [num_users=1] = placeholder[target=x] %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%p_linear_weight, 1), kwargs = {}) %linear : [num_users=1] = call_function[target=torch.ops.aten.linear.default](args = (%x, %p_linear_weight, %p_linear_bias), kwargs = {}) return (linear, add) ``` Notice that ` %p_linear_weight : [num_users=2]`. When we get source partitions, we should exclude attributes nodes like `p_linear_weight` from outputs. A real world example where people do something like this is in https://github.com/pytorch/pytorch/issues/142035. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:fx -- -r test_module_partitioner_weight_tied ``` Differential Revision: D66998592 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142446 Approved by: https://github.com/angelayi	2025-01-04 04:28:20 +00:00
cyy	f9bf9057ef	Fix ruff warnings in caffe2 and functorch (#144182 ) In preparation for upgrading ruff config to py3.9. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144182 Approved by: https://github.com/malfet	2025-01-04 04:15:01 +00:00
Sam Ginzburg	ec1f56fdcf	[user triton] add support for prune_configs_by in @triton.autotune (#142207 ) This PR adds support for prune_configs_by in the @triton.autotune decorator [docs](https://triton-lang.org/main/python-api/generated/triton.autotune.html#triton.autotune). Supporting this lets users reduce autotuning time by running user-supplied code (early_config_prune, perf_model) to prune the provided list of configs. We implement this by realizing args/kwargs in call_triton_kernel(...), and then calling kernel.prune_configs(...). Pull Request resolved: https://github.com/pytorch/pytorch/pull/142207 Approved by: https://github.com/zou3519, https://github.com/aakhundov	2025-01-04 03:50:28 +00:00
Davide Italiano	479d6f2199	[mps/inductor] Add support for log(). (#144169 ) Tested via: ``` % pytest test/inductor/test_mps_basic.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144169 Approved by: https://github.com/jansel, https://github.com/malfet	2025-01-04 03:07:56 +00:00
Animesh Jain	087c625261	[dynamo] Trace torch.typename (#144163 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144163 Approved by: https://github.com/yanboliang, https://github.com/williamwen42, https://github.com/jansel ghstack dependencies: #144129, #144130, #144141, #144158	2025-01-04 02:52:58 +00:00
Animesh Jain	3292220c43	[dynamo][easy] Move symnode helpers to utils (#144158 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144158 Approved by: https://github.com/williamwen42, https://github.com/jansel ghstack dependencies: #144129, #144130, #144141	2025-01-04 02:52:58 +00:00
PHLens	98949df7a4	Fix torch.distributed._functional_collectives.AsyncCollectiveTensor for aten.to. (#134661 ) Fixes #133421 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134661 Approved by: https://github.com/bdhirsh	2025-01-04 02:33:38 +00:00
eqy	7e3cd0e488	[CUDA] Check `size` calculation in `ilpReduce` for `softmax` (#144009 ) For #143644 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144009 Approved by: https://github.com/Skylion007	2025-01-04 02:31:15 +00:00
eqy	dbdda654af	[64-bit][CUDA] Upsample2D 64-bit indexing fix attempt 2 (#141923 ) #141831 Block/thread math requires a cast... Pull Request resolved: https://github.com/pytorch/pytorch/pull/141923 Approved by: https://github.com/ngimel	2025-01-04 02:30:38 +00:00
xinan.lin	1d091e47d6	[Inductor UT] Generalize device-bias code in test_torchinductor.py introduced by #143884 . (#144057 ) Fix #144056 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144057 Approved by: https://github.com/EikanWang, https://github.com/jansel	2025-01-04 02:24:33 +00:00
isalia20	22580f160e	Multinomial sampling fix on mps for non contiguous tensors (#141515 ) Fixes #141457 As for the tests. I looked in `test/test_mps.py` but I saw that `test_multinomial` function is disabled. Glad to add test where needed if there is some place where multinomial function is tested on metal. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141515 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-04 01:21:37 +00:00
Nikita Shulga	464b50dbd7	[MPSInductor] Add `floor_div` and `index_expr` implementation (#144083 ) Simply copy-n-pasted from CPPInductor `pytest test/inductor/test_torchinductor.py -k _mps` score is 418 failed, 337 passed, 32 skipped Pull Request resolved: https://github.com/pytorch/pytorch/pull/144083 Approved by: https://github.com/jansel ghstack dependencies: #144167, #144162	2025-01-04 01:10:01 +00:00
Nikita Shulga	6d25938540	[MPSInductor] Add `remainder` op (#144162 ) For it to return correct result for half precision type it must be upcast to float Pull Request resolved: https://github.com/pytorch/pytorch/pull/144162 Approved by: https://github.com/jansel ghstack dependencies: #144167	2025-01-04 00:47:40 +00:00
Nikita Shulga	f8e1eacf2f	[MPSInductor] Extend `constant` to bool type (#144167 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144167 Approved by: https://github.com/jansel	2025-01-04 00:47:40 +00:00
Yuanhao Ji	d41134f7e5	[Inductor] Fix `torch.polygamma()` when n == 0 (#144058 ) Fixes #143648 aten: `dec1a6d0f0/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp (L436-L447)` compiled kernel code: ``` cpp_fused_polygamma_0 = async_compile.cpp_pybinding(['const float', 'float'], ''' #include "/tmp/torchinductor_devuser/tmpi1d9ksww/db/cdb7hyptwxpzukwd42x4ajfjlgrpum4a4htdd6lhb65apclsmno4.h" extern "C" void kernel(const float* in_ptr0, float* out_ptr0) { { { { auto tmp0 = in_ptr0[static_cast<int64_t>(0L)]; auto tmp1 = static_cast<float>(0.0); auto tmp2 = tmp1 == 0 ? calc_digamma(tmp0) : calc_polygamma(tmp0, tmp1); out_ptr0[static_cast<int64_t>(0L)] = tmp2; } } } } ''') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144058 Approved by: https://github.com/jansel	2025-01-04 00:22:10 +00:00
bobrenjc93	52742b07c5	remove allow-untyped-defs from nn/utils/_deprecation_utils.py (#144136 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144136 Approved by: https://github.com/aorenste	2025-01-03 23:44:14 +00:00
Xiaodong Wang	0a94bb432e	[ROCm] CK Flash Attention Backend (#143695 ) Replace https://github.com/pytorch/pytorch/pull/138947 for re-import. Replaces https://github.com/ROCm/pytorch/pull/1592 This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling torch.backends.cuda.preferred_rocm_fa_library("ck"). Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via USE_FLASH_ATTENTION) and is selected at runtime by the existing heuristics. Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143695 Approved by: https://github.com/malfet Co-authored-by: Andy Lugo <Andy.LugoReyes@amd.com> Co-authored-by: Jithun Nair <jithun.nair@amd.com>	2025-01-03 22:01:36 +00:00
Huy Do	3251171ae8	Make whl metadata public readable (#144164 ) After https://github.com/pytorch/pytorch/pull/143677/files#r1902138480 lands, the new nightly wheel metadata is not readable publicly causing pip install to fail, for example https://github.com/pytorch/pytorch/actions/runs/12603415308/job/35128414909. FBGEMM folks are also noticed this failure on their end (cc @q10) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144164 Approved by: https://github.com/clee2000	2025-01-03 21:08:49 +00:00
drisspg	9bf2a9a616	[ScaledMM] Fix NaNs in test for garbage input data (#144042 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144042 Approved by: https://github.com/janeyx99	2025-01-03 21:02:25 +00:00
Jay Zhang	b75f32b848	Update TorchDynamo-based ONNX Exporter memory usage example code. (#144139 ) Address related comments earlier. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144139 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2025-01-03 20:41:36 +00:00
bobrenjc93	64bffb3124	remove allow-untyped-defs onnx/_internal/exporter/_fx_passes.py (#144134 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144134 Approved by: https://github.com/Skylion007	2025-01-03 20:18:40 +00:00
bobrenjc93	64b197b603	remove allow-untyped-defs from export/_remove_auto_functionalized_pass.py (#144135 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144135 Approved by: https://github.com/Skylion007	2025-01-03 20:08:11 +00:00
bobrenjc93	9b8a4e7141	remove allow-untyped-defs from torch/onnx/operators.py (#144133 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144133 Approved by: https://github.com/Skylion007	2025-01-03 20:07:56 +00:00
bobrenjc93	6e09d32c00	remove allow-untyped-defs from torch/jit/_passes/_property_propagation.py (#144132 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144132 Approved by: https://github.com/Skylion007	2025-01-03 20:07:37 +00:00
Wanchao Liang	eb7a303d21	[dtensor] expose the __create_chunk_list__ in the doc (#144100 ) as titled, this PR expose this dunder method as a public API in the doc, so that different checkpoint implementations can leverage this protocol, instead of exposing a separate API Pull Request resolved: https://github.com/pytorch/pytorch/pull/144100 Approved by: https://github.com/awgu ghstack dependencies: #144099	2025-01-03 20:06:23 +00:00
Xuehai Pan	45411d1fc9	Use absolute path `path.resolve()` -> `path.absolute()` (#129409 ) Changes: 1. Always explicit `.absolute()`: `Path(__file__)` -> `Path(__file__).absolute()` 2. Replace `path.resolve()` with `path.absolute()` if the code is resolving the PyTorch repo root directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129409 Approved by: https://github.com/albanD	2025-01-03 20:03:40 +00:00
bobrenjc93	e9e18a9617	remove allow-untyped-defs from _export/db/logging.py (#144093 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144093 Approved by: https://github.com/Skylion007	2025-01-03 19:36:14 +00:00
Nikita Shulga	ad09395674	[MPSInductor] Fix multi rangevar kernel invocation (#144050 ) By changing `thread_position_in_grid` type to uint{n} and passing dimentions during the kernel call `pytest test/inductor/test_torchinductor.py -k _mps` score is 445 failed, 309 passed, 32 skipped Pull Request resolved: https://github.com/pytorch/pytorch/pull/144050 Approved by: https://github.com/jansel ghstack dependencies: #144055, #144051, #144122, #144105, #144156	2025-01-03 19:32:43 +00:00
Nikita Shulga	52e107a7ca	[MPSInductor] Add `constant`, `isinf` and `isnan` ops (#144156 ) Per Table 6.5 of [Metal Language Specification](https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf) infinity is `HUGE_VALF` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144156 Approved by: https://github.com/Skylion007, https://github.com/jansel ghstack dependencies: #144055, #144051, #144122, #144105	2025-01-03 19:32:43 +00:00
Catherine Lee	383ff4011c	[ez] Use strip for arg sanitization in upload_metadata_file to improve readability (#144155 ) Minor thing that improves readability. I didn't realize you could specify characters for strip when I wrote this Pull Request resolved: https://github.com/pytorch/pytorch/pull/144155 Approved by: https://github.com/huydhn, https://github.com/Skylion007	2025-01-03 19:25:30 +00:00
bobrenjc93	8b3479e361	remove allow-untyped-defs from torch/distributed/fsdp/_dynamo_utils.py (#144131 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144131 Approved by: https://github.com/Skylion007	2025-01-03 19:07:21 +00:00
Jane Xu	7b69f7b449	Clarify what we mean by decoupled weight decay in the *AdamWs (#144101 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144101 Approved by: https://github.com/albanD	2025-01-03 19:06:00 +00:00
Yidi Wu	c36f94b373	[while_loop][dynamo] auto-unspecialize int input and output to unbacked symints (#143106 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143106 Approved by: https://github.com/zou3519 ghstack dependencies: #143105, #143545	2025-01-03 19:01:07 +00:00
Yidi Wu	5660709856	[hop][BE] unify meta checking with check_meta_consistency (#143545 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143545 Approved by: https://github.com/zou3519 ghstack dependencies: #143105	2025-01-03 19:01:07 +00:00
Yidi Wu	6e8dca9ff3	[while_loop][aot] auto-unspecialize int input and output to unbacked symints (#143105 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143105 Approved by: https://github.com/zou3519	2025-01-03 19:01:07 +00:00
Davide Italiano	56f6289f6a	[mps/inductor] Add support for atanh(). (#144121 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144121 Approved by: https://github.com/jansel, https://github.com/malfet	2025-01-03 18:55:05 +00:00
Nikita Shulga	a7b61c5b49	[MPSInductor] Add signbit op support (#144105 ) By mapping it to `metal::signbit` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144105 Approved by: https://github.com/jansel, https://github.com/Skylion007 ghstack dependencies: #144055, #144051, #144122	2025-01-03 18:34:46 +00:00
PyTorch MergeBot	8d63a4a409	Revert "Set `enable_trace_contextlib_contextmanager` flag to True (#140604 )" This reverts commit 1c817fe6714cec510ccc6022b2c3e66146c3ad59. Reverted https://github.com/pytorch/pytorch/pull/140604 on behalf of https://github.com/guilhermeleobas due to breaking one of the benchmarks (moco) ([comment](https://github.com/pytorch/pytorch/pull/140604#issuecomment-2569640837))	2025-01-03 18:23:53 +00:00
Animesh Jain	c5c897c3a1	[dynamo][easy] Miscellaneous fixes (#144141 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144141 Approved by: https://github.com/williamwen42 ghstack dependencies: #144129, #144130	2025-01-03 18:22:56 +00:00
Animesh Jain	732359c633	[dynamo][easy] Minor fixes in guards.cpp (#144130 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144130 Approved by: https://github.com/williamwen42 ghstack dependencies: #144129	2025-01-03 18:22:56 +00:00
Animesh Jain	a450e177fd	[dynamo] remove inline inbuilt tests as flag is enabled by default (#144129 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144129 Approved by: https://github.com/williamwen42	2025-01-03 18:22:56 +00:00
PyTorch MergeBot	2409b49a33	Revert "Rewrite _reparametrize_module to use `contextmanager` (#138203 )" This reverts commit 7bf3b7cdc5631f9991eebcdd8ec09095339a9973. Reverted https://github.com/pytorch/pytorch/pull/138203 on behalf of https://github.com/guilhermeleobas due to breaking one of the benchmarks (moco) ([comment](https://github.com/pytorch/pytorch/pull/138203#issuecomment-2569634001))	2025-01-03 18:17:32 +00:00
Blaine Burton Rister	60fe8a65af	[Inductor] Generalize tiling algorithm to handle fused reductions (#144041 ) # Issue This PR cleans up an edge case that wasn't handled by https://github.com/pytorch/pytorch/pull/137243. The existing tiling code assumes that `node.get_ranges()` is a reliable source of pointwise and reduction numels. This is true for pointwise kernels, but the situation is more complicated with reductions. Since reductions change the number of elements in a tensor, not all ops within a reduction kernel will have the same number of iterations. For example, `var_mean` fuses pointwise division with the output of reduction sum, and the division lacks the corresponding reduction ranges. # Fix Instead of getting numels from `node.get_ranges()`, explicitly pass the global pointwise and reduction numels to the relevant tiling functions. In `SIMDKernel.complete_partial_tiling`, we solve for the missing numel by diving the global numel by the partial tiling's numel. This ensures all tilings have the correct global numel. Also, in `SIMDKernel.is_compatible`, add the global reduction numel to node ranges that are missing it. For example, `{"x": 8, "r0_": 8}` is compatible with a node of ranges `([8], [])` when we have `reduction_numel=8`. Finally, this PR generalizes some of the existing codegen to handle multiple reduction dims. We already had code to ignore reduction splits for pointwise kernels, but it only worked for 1D reductions. Now it can handle ND. # Test plan This PR parametrizes the existing CI test for `var_mean` to also run with tiled reductions. It also adds a new test checking that `var_mean` generates 2D tilings (with tiled reduction enabled). These new tests would fail on the current main branch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144041 Approved by: https://github.com/jansel	2025-01-03 18:16:27 +00:00
Colin Peppler	e93f625d00	[AOTI] don't codegen autotune_at_compile_time for non-Triton kernels (#143990 ) `autotune_at_compile_time` is a separate codegen file specifically for autotuning Triton kernels. We can skip it for non-Triton kernels (like CUTLASS). This test (test_aoti_workspace_ptr) checks that `workspace_0.data_ptr()` is codegen-ed correctly in AOTI. ``` // in AOTI codegen kernels.cuda_fused_0( (const half)arg0_1.data_ptr(), (const half)arg1_1.data_ptr(), (half)buf0.data_ptr(), (int)200, (int)5216, (int)10432, (int)10432, (int)5216, (int)0, (int)5216, (size_t)nullptr, (uint8_t*)workspace_0.data_ptr(), stream); ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143990 Approved by: https://github.com/henrylhtsang, https://github.com/chenyang78, https://github.com/desertfire	2025-01-03 18:01:12 +00:00
Huy Do	f3968373c1	Migrate the rest of CUDA 12.1 jobs to 12.4 (#144118 ) CUDA 12.4 is the default now and we don't build nightly 12.1 anymore, so it's time to move the rest of CI jobs to 12.4. I also clean up some redundant CI jobs on periodic and inductor-periodic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144118 Approved by: https://github.com/atalman	2025-01-03 17:45:41 +00:00
Huy Do	cbdc70ae07	Use the build environment as sccache prefix instead of workflow name (#144112 ) This is an attempt to improve cache usage for jobs in non-pull workflows like periodic, slow, or inductor as we are seeing build timeout there from time to time, for example https://github.com/pytorch/pytorch/actions/runs/12553928804. The build timeout never happens in pull or trunk AFAICT because they are more up to date with the cache content coming from the PR itself. Logically, the same build should use the same cache regardless of the workflows. We have many examples where the same build, for example [linux-focal-cuda12.4-py3.10-gcc9-sm86](https://github.com/search?q=repo%3Apytorch%2Fpytorch+linux-focal-cuda12.4-py3.10-gcc9-sm86&type=code), is split between different workflows and, thus, uses different caches. I could gather some sccache stats from CH in the meantime to try to prove the improvement before and after this lands. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144112 Approved by: https://github.com/malfet	2025-01-03 17:33:03 +00:00
Benjamin Glass	b9fbd65dfd	AOTI fallback ops: remove ops that were never codegen'ed (#143421 ) Removes 4 fallback ops that are currently not possible to codegen, which does not break ABI-compatibility. 1. `_cudnn_rnn_backward` and `_histogramdd_bin_edges` both return `Tensor[]`, which we cannot codegen with the current design. 2. `_sparse_coo_tensor_with_dims_and_tensors` only supplies a Sparse operator, which we don't support. 3. `zeros.names` requires a `Dimname` input, which we can't currently codegen. Removing these ops from the list will improve test performance, since the fallback op generation will use the Python proxy executor instead of calling non-existent C functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143421 Approved by: https://github.com/desertfire ghstack dependencies: #141371, #143223	2025-01-03 16:05:38 +00:00
Benjamin Glass	b5b419d627	cpp_wrapper: Use runtime dispatched fallbacks for complex ops (#143223 ) When calling a fallback op in cpp_wrapper mode, where any of the inputs are complex numbers, utilize the runtime dispatched fallback mode. This properly handles the Conjugate and Negative dispatch keys, if present, in exchange for a performance pessimization in complex arithmetic. This PR additionally fixes some cascading failure modes exposed in our `aot_inductor` tests by this change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143223 Approved by: https://github.com/desertfire ghstack dependencies: #141371	2025-01-03 16:05:38 +00:00
Benjamin Glass	e88d06f54e	ir.ExternKernel: correctly handle kwarg default arguments (#141371 ) Additionally, enable torchinductor opinfo tests exercising all previously fixed bugs in this stack. Note: I've manually sharded the cpp_wrapper CI checks into 2 shards. Once all OpInfo tests are enabled we should switch back to automatic sharding, but until then the pipeline doesn't have appropriate timing stats. More shards would be helpful given the compilation slowdown associated with cpp_wrapper, but 2 will do for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141371 Approved by: https://github.com/desertfire	2025-01-03 16:05:31 +00:00
Nikita Shulga	f7644efa79	[MPSInductor][EZ] Fix logical_[or\|end] ops (#144122 ) For boolean operands it does not really matter whether `&` or `&&` is used, but if one ever to rely on operator precedence, then bitwise ops should have higher precendence than logical ones Pull Request resolved: https://github.com/pytorch/pytorch/pull/144122 Approved by: https://github.com/huydhn ghstack dependencies: #144055, #144051	2025-01-03 15:28:07 +00:00
Nikita Shulga	b336d72dae	[MPSInductor] Preserve dtype during load (#144051 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144051 Approved by: https://github.com/Skylion007 ghstack dependencies: #144055	2025-01-03 15:17:33 +00:00
Valentine233	a1ae8fadc7	[cpu][vec] support reduce ops for add and max (#144065 ) ### Description During the support of INT8 SDPA https://github.com/pytorch/ao/pull/1372, we find that `at::vec::vec_reduce_all<int32_t>` would go into slow scalar path when doing sum and max. So here, we support the two reduce-related ops `reduce_add` and `reduce_max` for `vec512` and `vec256`, using the Sequence instructions. ### Details - Support vectorized `reduce_add` and `reduce_max` for dtypes `int32` and `float32`, using the Sequence instructions; - Implement the scalar version for fallback path in vec base; - Add the operator `reduce` in vec base, in order to simplify the codes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144065 Approved by: https://github.com/mingfeima	2025-01-03 13:01:52 +00:00
Michael Diggin	55dc61dd52	Dataloader distribute tasks to workers when in_order is False (#142324 ) Fixes #105203 and is a follow up PR to #141833 When `in_order` is True (the default), tasks are given out to workers in a round robin fashion. When `in_order` is False this is no longer needed, as we give up guarantees of reproducibility, and instead tasks should be given to workers that are able to perform work. In this PR I've added tracking of the number of outstanding tasks for each worker (updated when tasks are added to their queue, and when data is returned to the main thread). When finding the next queue to add a task to, if `in_order` is False it will only add the task to the workers queue if it has fewer than `_prefetch_factor` tasks outstanding. The current default behaviour is left as is. Tests are also updated to assert on the worker IDs for each sample of data returned. I've run the following to confirm they aren't flaky ```bash for i in {1..20}; do python test/test_dataloader.py TestOutOfOrderDataLoader; done ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142324 Approved by: https://github.com/andrewkho	2025-01-03 12:57:04 +00:00
blzheng	c09bf71bd6	[Inductor][CPU] Fix C++ compile error of torch.max on bool type (#143848 ) Fix https://github.com/pytorch/pytorch/issues/143568 Before: ![image](https://github.com/user-attachments/assets/3e1e869e-7ae7-45c0-a334-8a663028e003) After: ![image](https://github.com/user-attachments/assets/91f72920-64bd-449a-a6c6-6048409c1450) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143848 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel	2025-01-03 09:00:43 +00:00
Xuehai Pan	d9507548d8	[dynamo][BE] move `zip_longest` polyfill to submodule `polyfills.itertools` (#144067 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144067 Approved by: https://github.com/yanboliang ghstack dependencies: #144066	2025-01-03 08:08:31 +00:00
Xuehai Pan	fb1beb31d2	[dynamo][BE] move `dropwhile` polyfill to submodule `polyfills.itertools` (#144066 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144066 Approved by: https://github.com/jansel	2025-01-03 08:08:31 +00:00
hongxyan	00df63f09f	[ROCm] Fix for ld failed to convert GOTPCREL relocation in PyTorch build (#143986 ) I experienced an error while doing a DEBUG build of pytorch on rocm: ``` additional relocation overflows omitted from the output /usr/bin/ld: failed to convert GOTPCREL relocation; relink with --no-relax ``` Based on discussions on similar issue #138427, I fixed it after adding the `--offload-compress` to the HIP_HIPCC_FLAGS to successfully build DEBUG mode on my node. Further updated the PR to enable the flag for non-DEBUG builds as well due to the size reduction. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143986 Approved by: https://github.com/jeffdaily	2025-01-03 06:53:08 +00:00
Xu Han	e141cb9c34	export AOTI_TORCH_EXPORT on Windows. (#140030 ) Fixes #139954 reproduce UT: ```cmd pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu ``` Issue: <img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe"> After fixing: ![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a) Reland: 1. Declare export on Windows explicitly. 2. Support cpu, cuda and xpu devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030 Approved by: https://github.com/jgong5, https://github.com/desertfire	2025-01-03 05:41:06 +00:00
Wanchao Liang	48a05ee773	[dtensor] improve doc of the DTensor class (#144099 ) as titled: explicitly list all public members to make sure the public API stays consistent, also use groupwise as the member order to make doc look better Pull Request resolved: https://github.com/pytorch/pytorch/pull/144099 Approved by: https://github.com/awgu	2025-01-03 05:35:44 +00:00
Davide Italiano	41b5c600df	[ReduceOps] Add dimension checking for cummin()/cummax(). (#143920 ) Summary: cum{min,max} didn't guard against 0-d vector and allowed an arbitrary dimension to be passed. Test Plan: torch_test.py Reviewers: Subscribers: Tasks: Tags: Fixes #71477 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143920 Approved by: https://github.com/malfet	2025-01-03 04:14:33 +00:00
Bin Bao	c5b75f8db1	[AOTI] Remove more AOTI_TORCH_EXPORT (#144081 ) Summary: Similar to https://github.com/pytorch/pytorch/pull/142500, remove redundant AOTI_TORCH_EXPORT from several cpp files, to solve a windows build issue. Differential Revision: D67762069 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144081 Approved by: https://github.com/yushangdi	2025-01-03 02:17:38 +00:00
Jithun Nair	c31912666e	[ROCm] Print amdgpu info on bare metal for CI runners (#144038 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144038 Approved by: https://github.com/jeffdaily	2025-01-03 02:00:40 +00:00
Michal Gallus	37e9da0687	[ROCm][Windows] Disable roctracer-related code (#143329 ) Currently, the roctracer for Windows is not available. This PR disables any mentions of its usage for Windows, and creates dummy functions for Windows to keep compatibility with existing code, but which warn the user about the lack of Windows' availability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143329 Approved by: https://github.com/sraikund16	2025-01-03 01:51:01 +00:00
bobrenjc93	891a86d1ad	remove allow-untyped-defs from ao/quantization/experimental/fake_quantize.py (#144091 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144091 Approved by: https://github.com/aorenste	2025-01-03 01:26:36 +00:00
bobrenjc93	377e29745f	remove allow-untyped-defs from distributed/elastic/utils/data/cycling_iterator.py (#144090 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144090 Approved by: https://github.com/aorenste	2025-01-03 01:22:50 +00:00
bobrenjc93	0d6db839a7	remove allow-untyped-defs from utils/data/datapipes/iter/streamreader.py (#144088 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144088 Approved by: https://github.com/aorenste	2025-01-03 01:21:44 +00:00
bobrenjc93	bdfb40ed29	remove allow-untyped-defs from utils/_import_utils.py (#144089 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144089 Approved by: https://github.com/aorenste	2025-01-03 01:21:13 +00:00
bobrenjc93	28a74fe3aa	remove allow-untyped-defs from torch/mps/event.py (#144092 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144092 Approved by: https://github.com/aorenste	2025-01-03 01:20:17 +00:00
Catherine Lee	496fc90965	[CI] Multigpu 1 -> 2 shards (#143992 ) Fixes #ISSUE_NUMBER It's been timing out https://github.com/pytorch/pytorch/actions/runs/12544191739/job/34977636276 They're still somewhat uneven but they're both under the limit now. It would probably be better to use run_test.py's sharding to do this, maybe in another PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/143992 Approved by: https://github.com/huydhn	2025-01-03 00:33:16 +00:00
Catherine Lee	3eb3f4ed55	Upload METADATA file with whl binaries (#143677 ) Upload the metadata file for wheels for pep658 https://peps.python.org/pep-0658/ Using a python script but using bash might be easier... -- Testing Example run https://github.com/pytorch/pytorch/actions/runs/12550595201/job/34994883276 without actual upload, just dry run Lightly tested the script to make sure it uploads to s3, but integration with the bash script + workflow is untested Pull Request resolved: https://github.com/pytorch/pytorch/pull/143677 Approved by: https://github.com/seemethere	2025-01-03 00:32:05 +00:00
Catherine Lee	bb5e439f2d	Add networkx as bazel dep to fix CI failure (#143995 ) Add networkx as a dependency for test_bazel Example failure: https://github.com/pytorch/pytorch/actions/runs/12551752021/job/34996706301 ``` INFO: From Testing //:test_bazel: ==================== Test output for //:test_bazel: Traceback (most recent call last): File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/test/_test_bazel.py", line 33, in <module> test_simple_compile_eager() File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/test/_test_bazel.py", line 27, in test_simple_compile_eager opt_foo1 = torch.compile(foo, backend="eager") File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/__init__.py", line 2533, in compile backend = _TorchCompileWrapper(backend, mode, options, dynamic) File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/__init__.py", line 2342, in __init__ self.compiler_fn = lookup_backend(backend) File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_dynamo/backends/registry.py", line 66, in lookup_backend _lazy_import() File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_dynamo/backends/registry.py", line 102, in _lazy_import import_submodule(backends) File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_dynamo/utils.py", line 2797, in import_submodule importlib.import_module(f"{mod.__name__}.{filename[:-3]}") File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/execroot/pytorch/external/python3_10_x86_64-unknown-linux-gnu/lib/python3.10/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1050, in _gcd_import File "<frozen importlib._bootstrap>", line 1027, in _find_and_load File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 688, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 883, in exec_module File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_dynamo/backends/common.py", line 12, in <module> from torch._functorch.aot_autograd import ( File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_functorch/aot_autograd.py", line 147, in <module> from .partitioners import default_partition File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_functorch/partitioners.py", line 31, in <module> from ._activation_checkpointing.graph_info_provider import GraphInfoProvider File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_functorch/_activation_checkpointing/graph_info_provider.py", line 3, in <module> import networkx as nx ModuleNotFoundError: No module named 'networkx' ``` No periodic runs on this PR or its main branch commit, but I'm pretty sure its started on https://togithub.com/pytorch/pytorch/pull/143539 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143995 Approved by: https://github.com/huydhn	2025-01-02 19:42:18 +00:00
Driss Guessous	a8c98ce175	[cutlass-3] Update third-party/cutlass-3 from 3.4 to 3.5.1 (#143515 ) # Summary: This also makes updates to different repositories throughout FB code to roll any updates needed for this new release. I was not able to get AsyncMM.cu to build (still trying) Yfiu suggested that I just skip it for now Test Plan: Have run various build commands to try and expose errors Pull Request resolved: https://github.com/pytorch/pytorch/pull/143515 Approved by: https://github.com/eqy, https://github.com/Skylion007	2025-01-02 18:45:11 +00:00
bobrenjc93	8506a2af9a	remove allow-untyped-defs from _export/pass_infra/proxy_value.py (#143944 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143944 Approved by: https://github.com/aorenste ghstack dependencies: #143943	2025-01-02 18:17:03 +00:00
Jagadish Krishnamoorthy	8f3eb84373	ROCm: Enable 4 gpu tests for distributed config (#140319 ) Change the label to make sure the jobs land on a node which has >= 4 GPUs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140319 Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/kwen2501	2025-01-02 17:22:11 +00:00
Chris Sidebottom	916b510ff5	Enable mkldnn pattern matcher tests for BF16 on AArch64 (#144030 ) Fixes #143146 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144030 Approved by: https://github.com/malfet	2025-01-02 17:13:38 +00:00
Nikita Shulga	a93e75d1e2	[MPS] Handle implicit cpu-scalar-to-gpu transfer (#144055 ) Followup after https://github.com/pytorch/pytorch/pull/143934, this check is no longer necessary and fixes a subset of inductor tests Before `pytest test/inductor/test_torchinductor.py -k _mps` reports 463 failed, 291 passed, 32 skipped after 456 failed, 298 passed, 32 skipped Pull Request resolved: https://github.com/pytorch/pytorch/pull/144055 Approved by: https://github.com/Skylion007	2025-01-02 17:12:39 +00:00
Wanchao Liang	0431d47eaa	[tp] propagate src_data_rank kwarg in TP API (#144005 ) as titled, this PR propagates the src_data_rank in the TP API, so that module level APIs could leverage the flexibility to choose src_data_rank, and avoid the communication if it does not need to Pull Request resolved: https://github.com/pytorch/pytorch/pull/144005 Approved by: https://github.com/tianyu-l ghstack dependencies: #143883	2025-01-02 05:35:52 +00:00
Wanchao Liang	f242dbb76f	[dtensor] add src_data_rank to distribute_tensor API (#143883 ) As titled, this PR add a kwarg src_data_rank to the distribute_tensor API, to allow user specify a specific rank as the full tensor source data. Previously we by default specify group_rank=0 as the source of truth for single device semantic, this new option: * gives advanced user flexiblity to choose the source data rank * allow user to specify None explicity, which means we will skip the communications needed (scatter/broadcast) for the cases that does not care about single device semantic (i.e. loading from a checkpoint) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143883 Approved by: https://github.com/XilunWu, https://github.com/tianyu-l	2025-01-02 05:35:52 +00:00
Animesh Jain	dec1a6d0f0	[dynamo] Separate out GetItemSource and DictGetItemSource (#143926 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143926 Approved by: https://github.com/jansel	2025-01-01 02:39:41 +00:00
Wenqin Yang	8d9ff9c8a4	Fix a bug for wrong stride in fake tensor (#141427 ) Fixes #141426 Please see details in the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141427 Approved by: https://github.com/jansel	2024-12-31 23:45:32 +00:00
Jason Ansel	e7ed660233	[inductor] Add missing py312 xfail (#144006 ) See #144006 ```py __________________________________________ CudaReproTests.test_repeated_masked_load __________________________________________ RuntimeError: First class dim doesn't work with python 3.12 The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/jansel/conda/envs/pytorch/lib/python3.12/unittest/case.py", line 58, in testPartExecutor yield File "/home/jansel/conda/envs/pytorch/lib/python3.12/unittest/case.py", line 634, in run self._callTestMethod(testMethod) File "/home/jansel/conda/envs/pytorch/lib/python3.12/unittest/case.py", line 589, in _callTestMethod if method() is not None: ^^^^^^^^ File "/home/jansel/pytorch/torch/testing/_internal/common_utils.py", line 3108, in wrapper method(args, *kwargs) File "/home/jansel/pytorch/test/inductor/test_cuda_repro.py", line 1678, in test_repeated_masked_load from functorch.einops import rearrange File "/home/jansel/pytorch/functorch/einops/__init__.py", line 1, in <module> from .rearrange import rearrange File "/home/jansel/pytorch/functorch/einops/rearrange.py", line 7, in <module> from functorch._C import dim as _C ImportError: initialization failed ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144006 Approved by: https://github.com/Skylion007	2024-12-31 23:37:05 +00:00
PyTorch MergeBot	a174ee2255	Revert "Fix duplicate pattern error (#139321 )" This reverts commit 9e8d84f8631317ce61de4f0f9731fc1b1ccc3d2b. Reverted https://github.com/pytorch/pytorch/pull/139321 on behalf of https://github.com/jeanschmidt due to breaking internal signals ([comment](https://github.com/pytorch/pytorch/pull/139321#issuecomment-2566620095))	2024-12-31 17:44:02 +00:00
PyTorch MergeBot	d8a2796fb6	Revert "[Inductor UT] Generalize newly introduced device-bias hard code in (#143975 )" This reverts commit 7c1c0730beed9bb05a16ba678a8f32b29fdd0a29. Reverted https://github.com/pytorch/pytorch/pull/143975 on behalf of https://github.com/jeanschmidt due to Need to revert in order to be able to revert https://github.com/pytorch/pytorch/pull/139321 feel free to merge it back once conflicts are cleared ([comment](https://github.com/pytorch/pytorch/pull/143975#issuecomment-2566619312))	2024-12-31 17:41:06 +00:00
PyTorch MergeBot	eec30916e7	Revert "Update low prec codegen for div/mod (#142350 )" This reverts commit 135a2d44830b2de1ed6714f52cc6a612406adb6d. Reverted https://github.com/pytorch/pytorch/pull/142350 on behalf of https://github.com/jeanschmidt due to breaking internal signals ([comment](https://github.com/pytorch/pytorch/pull/142350#issuecomment-2566615835))	2024-12-31 17:35:32 +00:00
Nikita Shulga	5ef0de7615	[MPSInductor] Fix multiple kernel generation (#143998 ) At the moment by generating multiple MetalLibraries `pytest test/inductor/test_torchinductor.py -k _mps` score is 434 failed, 317 passed, 32 skipped Pull Request resolved: https://github.com/pytorch/pytorch/pull/143998 Approved by: https://github.com/jansel, https://github.com/ruidazeng ghstack dependencies: #143948, #143949, #143973, #143977	2024-12-31 13:51:50 +00:00
Nikita Shulga	f0f09bb3c2	[MPSInductor] Implement minimum and maximum ops (#143977 ) By calling `metal::min` and `metal::max` respectively with argument typecast to a common type to avoid ambiguous calls errors TODO: Implement NaN propagation for both eager and compile, see https://github.com/pytorch/pytorch/issues/143976 `pytest test/inductor/test_torchinductor.py -k _mps` score is 460 failed, 291 passed, 32 skipped Pull Request resolved: https://github.com/pytorch/pytorch/pull/143977 Approved by: https://github.com/jansel ghstack dependencies: #143948, #143949, #143973	2024-12-31 13:51:50 +00:00
Yu, Guangye	09e47ab7ab	Refine CUDA Stream priority (#143849 ) # Motivation As mentioned in https://github.com/pytorch/pytorch/pull/141119#discussion_r1897480515, we properly handle the priority value if it is outside of the priority range. # Additional Context If the value falls outside of the allowed priority range, it will automatically be mapped to the nearest valid priority(either lowest or highest). Pull Request resolved: https://github.com/pytorch/pytorch/pull/143849 Approved by: https://github.com/albanD, https://github.com/EikanWang ghstack dependencies: #142347, #141119, #141123, #143799	2024-12-31 11:15:59 +00:00
Yu, Guangye	3848de55ed	Add get_stream_from_external API for CUDA backend (#143799 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143799 Approved by: https://github.com/albanD, https://github.com/EikanWang ghstack dependencies: #142347, #141119, #141123	2024-12-31 11:15:59 +00:00
Yu, Guangye	8f6c4d1732	Add get_stream_from_external API for XPU backend (#141123 ) # Motivation This PR aims to introduce `torch.xpu.ExternalStream` to be used to wrap SYCL queue created in other libraries to PyTorch. # Additional Context Pull Request resolved: https://github.com/pytorch/pytorch/pull/141123 Approved by: https://github.com/albanD, https://github.com/EikanWang ghstack dependencies: #142347, #141119	2024-12-31 11:15:52 +00:00
Yu, Guangye	a68c0ca497	Add low priority XPU Stream (#141119 ) # Motivation Due to the potential for the external SYCL queue to have a low priority, we need to support the low-priority SYCL queue for native XPU Streams to maintain consistency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141119 Approved by: https://github.com/gujinghui, https://github.com/albanD ghstack dependencies: #142347	2024-12-31 11:15:45 +00:00
Yu, Guangye	39450ae655	Refine XPU external Stream (#142347 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142347 Approved by: https://github.com/gujinghui, https://github.com/albanD	2024-12-31 11:15:38 +00:00
Vinayak Pandey	16a57e232c	removed dead code for dynamo flag dead_code_elimination (#140938 ) Fixes #136862 1. removed dead code from torch/_dynamo/convert_frame.py 2. ran `lintrunner -a` and all the tests passed. 3. ran the unit tests and everything seems to be in order. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140938 Approved by: https://github.com/zou3519	2024-12-31 09:27:43 +00:00
xinan.lin	01034e963c	[AOTI] Not use AOTI_TORCH_CHECK in non AOTI mode. (#143970 ) Fix #143967 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143970 Approved by: https://github.com/EikanWang, https://github.com/jansel	2024-12-31 06:28:32 +00:00
Blaine Burton Rister	a2753e376b	[Inductor] Support tiling reduction dimensions (#137243 ) Fixes #134277 and https://github.com/pytorch/pytorch/issues/142317. Sub-PRs containing refactors from this one: - https://github.com/pytorch/pytorch/pull/141733 - https://github.com/pytorch/pytorch/pull/141738 - https://github.com/pytorch/pytorch/pull/141751 (based off the former) - https://github.com/pytorch/pytorch/pull/142249 - https://github.com/pytorch/pytorch/pull/142020 - https://github.com/pytorch/pytorch/pull/143135 These refactor PRs should land before the main one. # Feature Note: to minimize risk, multi-dimensional reductions are gated by the flag `config.triton.tile_reductions`, which defaults to False. Instead of having a single reduction dimension called `"r"`, we can now support 2D reductions with `"r0_"` and `"r1_"` dimensions. 2D reductions generate two nested loops, with different block pointer advancements in each loop body. Most of the implementation is generic to ND reductions, but for now the tiling algorithm sets a hard limit at 2D. Here's an example of a 2D persistent reduction kernel: ``` @triton.jit def triton_per_fused_sum_0(in_ptr0, out_ptr0, xnumel, r0_numel, r1_numel, XBLOCK : tl.constexpr): xnumel = 1 r0_numel = 15 R0_BLOCK: tl.constexpr = 16 r1_numel = 15 R1_BLOCK: tl.constexpr = 16 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None, None] xmask = tl.full([XBLOCK, R0_BLOCK, R1_BLOCK], True, tl.int1) r0_index = tl.arange(0, R0_BLOCK)[None, :, None] r0_offset = 0 r0_mask = r0_index < r0_numel r1_index = tl.arange(0, R1_BLOCK)[None, None, :] r1_offset = 0 r1_mask = r1_index < r1_numel rnumel = r0_numel * r1_numel RBLOCK: tl.constexpr = R0_BLOCKR1_BLOCK roffset = r1_offset + (r0_offsetr1_numel) rindex = r1_index + (r0_indexr1_numel) r0_0 = r0_index r1_1 = r1_index tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[15, 15], strides=[30, 1], block_shape=[R0_BLOCK, R1_BLOCK], order=[1, 0], offsets=[r0_offset, r1_offset]), boundary_check=[0, 1], padding_option='zero')[None, :, :] tmp1 = tl.broadcast_to(tmp0, [XBLOCK, R0_BLOCK, R1_BLOCK]) tmp3 = tl.where(r0_mask & r1_mask, tmp1, 0) tmp4 = tl.reshape(tmp3, [XBLOCK, RBLOCK]) tmp5 = tl.sum(tmp4, 1)[:, None, None] tl.store(out_ptr0 + (tl.full([XBLOCK, 1, 1], 0, tl.int32)), tmp5, None) ''', device_str='cuda') ``` There are a few main differences between this kernel and what Inductor would generate without this PR. - Instead of an `r`/`RBLOCK` dimension, we have two reduction dimensions: `r0_`/`R0_BLOCK` and `r1_`/`R1_BLOCK`. - There are special size and indexing variables for reductions, which don't directly correspond to any kernel dimension. (`rindex`, `rnumel`, `RBLOCK`, and `roffset`.) These collapse N-D reduction sizes and indices indices into 1D. This simplifies the codegen for reductions, which sometimes want to access linear indices instead of N-dimensional ones. Doing things this way allows us to generate N-D loads and stores, but access this data as if it were 1D, minimizing the blast radius of this PR. Although this makes the code more verbose, it shouldn't have a perf impact because the triton compiler eliminates dead code. - We generate the line `tmp4 = tl.reshape(tmp3, [XBLOCK, RBLOCK])` before performing the actual reduction. This reshapes N reduction dimensions into 1D. This allows us to reduce over all N dimensions at once, simplifying the codegen and allowing the Triton complier to decide the order of processing under the hood. Here's an example of a looped reduction: ``` @triton.jit def triton_red_fused_sum_0(in_ptr0, out_ptr0, xnumel, r0_numel, r1_numel, XBLOCK : tl.constexpr, R0_BLOCK : tl.constexpr, R1_BLOCK : tl.constexpr): xnumel = 3 r0_numel = 43 r1_numel = 129 xoffset = tl.program_id(0) XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None, None] xmask = xindex < xnumel r0_base = tl.arange(0, R0_BLOCK)[None, :, None] r1_base = tl.arange(0, R1_BLOCK)[None, None, :] rnumel = r0_numel * r1_numel RBLOCK: tl.constexpr = R0_BLOCKR1_BLOCK rbase = r1_base + (r0_baser1_numel) x0 = xindex block_ptr0 = tl.make_block_ptr(in_ptr0, shape=[3, 43, 129], strides=[11094, 258, 1], block_shape=[XBLOCK, R0_BLOCK, R1_BLOCK], order=[2, 1, 0], offsets=[xoffset, 0, 0]) _tmp2 = tl.full([XBLOCK, R0_BLOCK, R1_BLOCK], 0, tl.float32) for r0_offset in range(0, r0_numel, R0_BLOCK): r0_index = r0_offset + r0_base r0_mask = r0_index < r0_numel for r1_offset in range(0, r1_numel, R1_BLOCK): r1_index = r1_offset + r1_base r1_mask = r1_index < r1_numel roffset = r1_offset + (r0_offsetr1_numel) rindex = r1_index + (r0_indexr1_numel) r0_1 = r0_index r1_2 = r1_index tmp0 = tl.load(block_ptr0, boundary_check=[0, 1, 2], padding_option='zero', eviction_policy='evict_first') tmp1 = tl.broadcast_to(tmp0, [XBLOCK, R0_BLOCK, R1_BLOCK]) tmp3 = _tmp2 + tmp1 _tmp2 = tl.where(r0_mask & r1_mask & xmask, tmp3, _tmp2) block_ptr0 = tl.advance(block_ptr0, [0, 0, R1_BLOCK]) block_ptr0 = tl.advance(block_ptr0, [0, R0_BLOCK, (-1)R1_BLOCK((128 + R1_BLOCK) // R1_BLOCK)]) tmp4 = tl.reshape(_tmp2, [XBLOCK, RBLOCK]) tmp2 = tl.sum(tmp4, 1)[:, None, None] tl.store(tl.make_block_ptr(out_ptr0, shape=[3], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.reshape(tmp2, [XBLOCK]).to(tl.float32), boundary_check=[0]) ''', device_str='cuda') ``` In addition to the aforementioned changes to the persistent reduction, multidimensional looped reductions have a few more lines of code: - They calculate indices inside the loop using `r0_base` and `r1_base`. For compatibility with existing codegen, these are collapsed to the 1D variant `rbase`. - Block pointer advancements are more nuanced for multidimensional loops. At the end of each loop body, we emit a `tl.advance` line which not only increments the pointer in its own dimension, but also undoes the cumulative increments of the previous loop level. This is equivalent to the usual practice in nested loops of starting with a fresh iteration variable at each level. Implementing this required refactoring the way we generate pointer advancements into a new `self.pointer_advancements` field of the kernel, which categorizes advancements by dimension. The biggest difficulty in implementing this feature was that we represented tiling with a tuple like `(5,2)`. In the existing codebase, the compiler can infer that the reduction dimension of `(5,2)` is `2`, since reductions are always the last dimension. This became cumbersome now that we have to support multiple reduction dimensions, so I refactored tiling into a dict like `{"x": 5, "r0_": 2, "r1_": 4}`. This required quite a few code changes, but I don't think it makes the underlying logic much more complex. This will also make it easier to eventually support simultaneous pointwise and reduction tiling, like `{"x": 5, "y": 5, "r0_": 2, "r1_": 4}`. (This is not supported today, but we might want to do it eventually.) The existing tiling algorithm generalized naturally to support reductions. For pointwise kernels, we tile the pointwise dimensions (`"x"`, `"y"`) as is. For reduction kernels, we never tile the `"x"` dimension, and only tile the reduction dimensions (`"r0_"`, `"r1_"`). Thus we only ever tile pointwise OR reduction dimensions, but not both. In principle it seems possible to support both, but it would likely require changes to the kernel fusion and autotuning logic. I thought it best to keep this PR as minimal as possible since it already touched a lot of different files. Unfortunately, these changes weren't enough to get block pointers in some seemingly simple test cases. In some tests for `argmax` and `var_mean`, we already collapse reduction dimensions into 1D and generate modular indexing expressions, prior to tiling. So it's not trivial to figure out how to expand the collapsed reduction dimension back to a shape that would simplify the indexing. To address these cases, this PR adds a new feature to the `config.prefer_nd_tiling` option, which analyzes reads and writes in the kernel, using the same mod-div pattern matching logic that generates block pointers later on. By matching this pattern, we can solve for the tiling splits which would simplify the indexing expression, and use then use that tiling to eliminate the modular indexing and emit a block pointer. This tiling mode is still off by default, but it's important for certain applications where we need to get as many block pointers as possible. # Test plan This touches pretty much anything that uses the Triton and Halide backends, so the existing CI provides good coverage. However, 2D reductions are gated behind a few feature flags like `config.prefer_nd_tiling` and `config.tile_reductions`, so this really only checks that the PR doesn't break 1D reductions. In addition to existing CI tests, this PR also adds some new tests that specifically stress 2D reductions: - `test_2d_reduction_odd_shapes`: test 2D reductions with a variety of ops and sizes. This covers the typical persistent and looped reductions. - `test_2d_reduce_no_x_dim`: test 2D reductions with no x dimension. - `test_2d_welford_reduction`: test 2D welford reductions with block pointers. - `test_welford_non_block_pointer`: test a 2D welford reduction when block pointer analysis fails. - `test_reduction_multiple_discontiguous_dims`: test reducing over more than one discontiguous dimension. We won't get a block pointer for this case, since that would require 3D tiling, but we're currently limited to 2D. - `test_2d_reduction_multi_kernel`: test multi kernel autotuning on a 2D softmax kernel. - `test_enable_tiled_reductions`: test that `config.triton.tile_reductions` enables/disables this feature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137243 Approved by: https://github.com/jansel Co-authored-by: Yueming Hao <yhao@meta.com> Co-authored-by: Jason Ansel <jansel@meta.com>	2024-12-31 05:06:46 +00:00
Boyuan Feng	f3e5078c27	[Inductor] Relax size constraints for re-inplacing (#143884 ) Current reinplacing requires input buffer and output buffer has exactly the same storage size. However, matmul padding may increase the tensor size slightly for better performance, which prevents reinplacing. This PR changes the size constraints to be: - input and output buffer have exactly the same symbolic expression for storage size (i.e., sympy str). - it's statically known that 0.99 * input_size <= output_size <= input_size ### Apply on llm.c See the reuse of `buf1`. Before relaxing size requirements on re-inplacing: ([P1703512078](https://www.internalfb.com/phabricator/paste/view/P1703512078)) ![1](https://github.com/user-attachments/assets/1472f550-6eb8-4d5c-9965-49bbb20d81a9) After relaxing size requirements on re-inplacing: ([P1703513053](https://www.internalfb.com/phabricator/paste/view/P1703513053)) ![2](https://github.com/user-attachments/assets/416294dd-30eb-4e12-a36c-1aebf9af530b) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143884 Approved by: https://github.com/eellison	2024-12-31 03:52:47 +00:00
cyy	8df99b6a6e	Remove unneeded std::make_optional (#143575 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143575 Approved by: https://github.com/Skylion007	2024-12-31 03:08:47 +00:00
Nikita Shulga	11bb94b7ea	[MPSInductor] Fix index generation for transpose (#143973 ) Alas, PythonPrinter would not work here, not would CppPrinter, so start building MetalPrinter. `pytest test/inductor/test_torchinductor.py -k _mps` score is 474 failed, 277 passed, 32 skipped Before this change: `pytest test/inductor/test_torchinductor.py -k _mps` reported 506 failed, 245 passed, 32 skipped Pull Request resolved: https://github.com/pytorch/pytorch/pull/143973 Approved by: https://github.com/jansel ghstack dependencies: #143948, #143949	2024-12-31 02:04:50 +00:00
Kai Londenberg	cb24013b5b	Fix assertion failure in pytorch profiler (#143940 ) Summary: Attempt to fix the following exception which occurred when profiling a Pytorch model ( Meta-internal LLM ) that also involved a ThreadPoolExecutor in the background: ``` Exception Found: !stack.empty() INTERNAL ASSERT FAILED at "fbcode/caffe2/torch/csrc/autograd/profiler_python.cpp":987, please report a bug to PyTorch. Python replay stack is empty. ``` The root cause of this issue seems to be that a thread call stack can be empty, which is asserted to not be empty. I fixed this with some minimal changes to profiler_python.cpp Approach: * Ensuring that the stack in question is not empty before trying to pop from it. Test Plan: * Tested manually on a reproducible scenario where the assertion failure was otherwise triggered ( repro too large to include here ). The assertion failure disappears. * CI Differential Revision: D67691558 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143940 Approved by: https://github.com/Skylion007, https://github.com/sraikund16	2024-12-31 01:43:04 +00:00
cyy	af629a8146	Enable readability-redundant-declaration (#143982 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143982 Approved by: https://github.com/Skylion007	2024-12-31 00:20:10 +00:00
xinan.lin	934eaa503f	[Inductor XPU] Support max-autotune on XPU and reuse the corresponding Inductor UT. (#143266 ) This PR aims to add the functionality support of max-autotune for XPU. The current triton templates and configurations are not well optimized for XPU, so the performance is not ready yet. Also the `mm_plus_mm` template have accuracy issues in some cases. We will address these issues in the next PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143266 Approved by: https://github.com/EikanWang, https://github.com/jansel	2024-12-30 23:51:17 +00:00
Andrew Gu	d9a6ffb63c	[FSDP] Add workaround to fix `buffer_dtype` without root parameters (#143989 ) Fixes https://github.com/pytorch/pytorch/issues/143900 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143989 Approved by: https://github.com/H-Huang	2024-12-30 23:42:24 +00:00
Jason Ansel	2da7fb5320	[inductor] Make generated kernels deterministic (#143951 ) `"compile_id"` had slipped into our generated Triton code (in the metadata), which will defeat caching because the same kernels generated in a different order would not cache hit with eachother. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143951 Approved by: https://github.com/oulgen	2024-12-30 23:35:11 +00:00
Benjamin Glass	d88a8c41d5	Fix flaky "Upload test stats" job (#143991 ) Test stat uploading was intermittently failing due to certain XML strings being opportunistically converted to numbers, when string output was expected. This PR makes the conversion behavior optional, which should fix the stat uploads. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143991 Approved by: https://github.com/clee2000, https://github.com/huydhn	2024-12-30 21:40:01 +00:00
Benjamin Glass	d260bc4476	cpp_wrapper: minimize pybind11 dependency (#143772 ) Only include the parts of `pybind11` that handle GIL management within `cpp_wrapper`. This dramatically improves compilation times by reducing the number of headers we compile. Improvements on my local system are on the order of 2x. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143772 Approved by: https://github.com/Skylion007	2024-12-30 20:41:02 +00:00
Aaron Gokaslan	baee623691	[BE][Ez]: Update fmtlib submodule to 1.11.1 (#143937 ) * Exactly the same as previous fmtlib except it fixes an edgecase that could affect ABI compatibility between fmtlib versions. * Seems safe to update Pull Request resolved: https://github.com/pytorch/pytorch/pull/143937 Approved by: https://github.com/albanD	2024-12-30 19:46:27 +00:00
Wouter Devriendt	9d026000de	change import relative paths due to internal build failures (#143968 ) Internal builds failing due to #143355, changing imports to be relative, similar to other imports Pull Request resolved: https://github.com/pytorch/pytorch/pull/143968 Approved by: https://github.com/albanD	2024-12-30 17:19:49 +00:00
Nikita Shulga	c27c788e35	[MPS] Fix `torch.add(x,y, alpha=2)` crash (#143949 ) TODO: as followup PR replace this weird logic with shaders Fixes https://github.com/pytorch/pytorch/issues/143932 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143949 Approved by: https://github.com/Skylion007 ghstack dependencies: #143948	2024-12-30 17:16:29 +00:00
Nikita Shulga	beb6c2dea5	[MPS] Fix crash when mm is invoked with mixed dtypes (#143948 ) Simply by copy-n-pasting check from `a7915c56f6/aten/src/ATen/native/cuda/Blas.cpp (L254-L257)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143948 Approved by: https://github.com/Skylion007	2024-12-30 17:13:34 +00:00
xinan.lin	7c1c0730be	[Inductor UT] Generalize newly introduced device-bias hard code in (#143975 ) test_pattern_matcher.py Fix #143974 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143975 Approved by: https://github.com/malfet	2024-12-30 16:47:19 +00:00
cyy	dca443835e	Enable more readability-redundant checks (#143963 ) They are helpful to simplifying code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143963 Approved by: https://github.com/albanD	2024-12-30 14:49:33 +00:00
chuanqiw	438698b20b	[CD] Remove redundant triton dependency for xpu wheels (#143839 ) Due to XPU CD wheels enabled pypi dependencies by https://github.com/pytorch/pytorch/pull/141135, so the PYTORCH_EXTRA_INSTALL_REQUIREMENTS has value for XPU CD wheel build. Works for https://github.com/pytorch/pytorch/issues/139722 and https://github.com/pytorch/pytorch/issues/114850 Fixes #143838 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143839 Approved by: https://github.com/huydhn	2024-12-30 13:39:06 +00:00
PyTorch UpdateBot	2fa09853cb	Update slow tests (#143745 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143745 Approved by: https://github.com/pytorchbot	2024-12-30 11:51:49 +00:00
Yutao Xu	2ed4d65af0	Update torch-xpu-ops commit pin (#143853 ) Update the torch-xpu-ops commit to [214f33](`214f33b9d9`), includes: - Fix building issue for transformer related operators - Improve XPU operator coverage Pull Request resolved: https://github.com/pytorch/pytorch/pull/143853 Approved by: https://github.com/EikanWang	2024-12-30 02:38:16 +00:00
PyTorch MergeBot	1b0d19a2cb	Revert "[inductor] Make generated kernels deterministic (#143951 )" This reverts commit 79b354ee37b7d8a06a48ca8cc4e19a3fd006b433. Reverted https://github.com/pytorch/pytorch/pull/143951 on behalf of https://github.com/wdvr due to failing tests on trunk ([comment](https://github.com/pytorch/pytorch/pull/143951#issuecomment-2564952267))	2024-12-30 02:06:38 +00:00
Henry Hu	cf89127137	[Torch.package] Add support for UntypedStorage tensors (#143930 ) Summary: fp8 uses untyped storage. Add support for torch.package by using the same logic as in serialization.py Differential Revision: D67684033 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143930 Approved by: https://github.com/henrylhtsang	2024-12-30 02:03:52 +00:00
emmettbicker	92d8965082	Adding support for differentiable lr, weight_decay, and betas in Adam/AdamW (#143726 ) Third PR in a series of PRs to broaden differentiable optimizer support w/ @janeyx99 (sorry for pinging over the holidays! I just wanted to put this one out but I am definitely not asking for review or anything like that rn) This is also going to probably be my last PR before the holidays! Note: This is a branch of #143710 -- I've never worked on a branch of a branch before so I wasn't sure about the protocol so I thought I'd just made the PR and wait until that one gets merged. This is adding support for differentiable lr, weight_decay, and betas to Adam and AdamW (but after refactoring AdamW into an Adam subclass, it's really just changing code in torch/optim/adam.py) I had one main thing I was wondering about, which is that adam already has a differentiable flag built in, so I have code like this ```py if differentiable and isinstance(beta2, Tensor): if beta2.requires_grad: exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj().mul(1 - beta2)) else: exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2) else: exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2) ``` That I could definitely simplify to just ```py if differentiable and isinstance(beta2, Tensor): exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj().mul(1 - beta2)) else: exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2) ``` It would definitely be a little slower in the case that it's differentiable but doesn't need a grad for beta2, but the code would also be a lot more clear and I'm debating speed vs future code usability. Also the line in the above example: ```py exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj().mul(1 - beta2)) ``` was concerning to me because it is considerably more expensive than `value=1 - beta2`, but I couldn't think of a better way to do it. Further work on #141832 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143726 Approved by: https://github.com/janeyx99	2024-12-30 01:11:57 +00:00
Kasperi Apell	a7915c56f6	Propagate callable parameter types using ParamSpec (#142306 ) (#143797 ) The codebase has a few locations where callable parameter type information is lost when the unpackings args and *kwargs are typed as Any. Refactor these instances to retain type information using typing_extensions.ParamSpec. Also, in these functions, enforce return type with TypeVar. Addresses #142306 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143797 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>	2024-12-29 23:03:14 +00:00
Jason Ansel	79b354ee37	[inductor] Make generated kernels deterministic (#143951 ) `"compile_id"` had slipped into our generated Triton code (in the metadata), which will defeat caching because the same kernels generated in a different order would not cache hit with eachother. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143951 Approved by: https://github.com/oulgen	2024-12-29 19:53:33 +00:00
Xuehai Pan	b6bdb67f82	[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 ) Changes by apply order: 1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`. 2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`. 3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first. `.parent{...}.absolute()` -> `.absolute().parent{...}` 4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.) `.parent.parent.parent.parent` -> `.parents[3]` 5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~ ~`.parents[3]` -> `.parents[4 - 1]`~ 6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374 Approved by: https://github.com/justinchuby, https://github.com/malfet	2024-12-29 17:23:13 +00:00
bobrenjc93	7101b8ca35	remove allow-untyped-defs from onnx/_internal/_lazy_import.py (#143943 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143943 Approved by: https://github.com/justinchuby	2024-12-29 10:29:43 +00:00
bobrenjc93	cf0b72c4ab	remove allow-untyped-defs from _inductor/compile_worker/watchdog.py (#143941 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143941 Approved by: https://github.com/Skylion007	2024-12-29 01:05:09 +00:00
bobrenjc93	3ba6fcd3ff	remove allow-untyped-defs from torch/_size_docs.py (#143942 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143942 Approved by: https://github.com/Skylion007	2024-12-29 01:00:46 +00:00
Yanan Cao (PyTorch)	85f348578b	[Codemod][AddExplicitStrictExportArg] caffe2/test/inductor (#143929 ) Differential Revision: D67682313 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143929 Approved by: https://github.com/hl475	2024-12-28 23:39:21 +00:00
bobrenjc93	e1abbe155e	remove allow-untyped-defs from ao/nn/qat/dynamic/modules/linear.py (#143919 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143919 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-12-28 20:50:48 +00:00
Nikita Shulga	3054aae493	[MPS] Fix fmin/fmax for scalar argument (#143934 ) CPU scalar promotion to GPU is allowed for CUDA and shoudl be allowed for MPS as well (at the very least it should not crash) Fixes https://github.com/pytorch/pytorch/issues/143933 https://github.com/pytorch/pytorch/issues/142203 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143934 Approved by: https://github.com/Skylion007	2024-12-28 17:07:19 +00:00
PyTorch MergeBot	45a709d9ec	Revert "Add torch._scaled_mm for CPU (#139975 )" This reverts commit cbc4cf3043a7316c1f6e86b1e22d96042a59489c. Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/malfet due to It broke the same test, but on ROCM this time, though it was classified as flaky for some reason, see `d8c3900d80/1` ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2564378146))	2024-12-28 16:49:38 +00:00
PyTorch MergeBot	8cccc46e33	Revert "Add AOT inductor support for _scaled_mm for CPU (#141961 )" This reverts commit 3fabd10c40c632104e420ae8e3721f33176e8640. Reverted https://github.com/pytorch/pytorch/pull/141961 on behalf of https://github.com/malfet due to It broke the same test, but on ROCM this time, though it was classified as flaky for some reason, see `d8c3900d80/1` ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2564378146))	2024-12-28 16:49:38 +00:00
Nikita Shulga	d8c3900d80	[Inductor] Implement primitive Metal compiler (#143893 ) Still work in progress, only works for element wise operations. Current implementation could be used to turn something like ```python def f(x): return x[:,::2].sin() + x[:, 1::2].cos() ``` into the following shader ```python # Topologically Sorted Source Nodes: [sin, cos, add], Original ATen: [aten.sin, aten.cos, aten.add] # Source node to ATen node mapping: # add => add # cos => cos # sin => sin # Graph fragment: # %sin : [num_users=1] = call_function[target=torch.ops.aten.sin.default](args = (%slice_2,), kwargs = {}) # %cos : [num_users=1] = call_function[target=torch.ops.aten.cos.default](args = (%slice_4,), kwargs = {}) # %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%sin, %cos), kwargs = {}) mps_lib = torch.mps._compile_shader(""" kernel void kernel_0( device float* out_ptr0, constant float* in_ptr0, uint xindex [[thread_position_in_grid]] ) { int x0 = xindex; auto tmp0 = in_ptr0[2x0]; auto tmp1 = metal::precise::sin(tmp0); auto tmp2 = in_ptr0[2x0 + 1]; auto tmp3 = metal::precise::cos(tmp2); auto tmp4 = tmp1 + tmp3; out_ptr0[x0] = static_cast<float>(tmp4); } """) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143893 Approved by: https://github.com/jansel ghstack dependencies: #143891, #143892	2024-12-28 06:58:32 +00:00
leslie-fang-intel	74028cfd0c	[Inductor][CPP] Fix Data Type issue of frexp (#143746 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/143729. `frexp` has 1 input but 2 output tensor with different data type, current `deduce_dtype_for_cpp_cse_variable` can't deduce the data type for each output correctly due to missing of output index. In this PR, we set the data type of cse var in the codegen of `frexp` and avoid it being overridden in the following flow. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_frexp ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143746 Approved by: https://github.com/jgong5	2024-12-28 06:00:13 +00:00
Animesh Jain	01980cac38	[dynamo] Make ConstDictKeySource a subclass of ChainedSource (#143924 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143924 Approved by: https://github.com/jansel	2024-12-28 05:59:45 +00:00
Jiang, Yanbing	3fabd10c40	Add AOT inductor support for _scaled_mm for CPU (#141961 ) This PR is to add AOT inductor support for _scaled_mm for CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141961 Approved by: https://github.com/malfet ghstack dependencies: #139975	2024-12-28 05:57:35 +00:00
Jiang, Yanbing	cbc4cf3043	Add torch._scaled_mm for CPU (#139975 ) This PR is to add `torch._scaled_mm` for CPU backend. `_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet	2024-12-28 05:49:06 +00:00
eellison	d3e9133ab2	Fix separate in process bisector cache, cleanup on exit (#143661 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143661 Approved by: https://github.com/ezyang ghstack dependencies: #143657	2024-12-28 03:20:37 +00:00
Eddie Yan	1e246ef05b	[CUDA][CUDA graphs][RNG] Skip replay prologue if `wholegraph_increment` is 0 (#143777 ) #143572 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143777 Approved by: https://github.com/ngimel, https://github.com/eellison	2024-12-28 02:31:26 +00:00
Nikita Shulga	4a7cf0dbff	[Inductor] Add MPS device op overrides (#143892 ) Mostly dummy interface as MPS backend currently limited to a single device Pull Request resolved: https://github.com/pytorch/pytorch/pull/143892 Approved by: https://github.com/jansel ghstack dependencies: #143891	2024-12-28 02:11:45 +00:00
Jerry Zhang	ad78edee8e	Add support for list, tuple and dict in numeric debugger (#143882 ) Summary: Previously numeric debugger only supports torch.Tensor, this PR adds support for list, tuple and dict as well Test Plan: python test/test_quantization.py -k test_extract_results_from_loggers_list_output Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D67660049](https://our.internmc.facebook.com/intern/diff/D67660049) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143882 Approved by: https://github.com/dulinriley	2024-12-28 02:10:31 +00:00
Animesh Jain	c3c27aef34	[dynamo] Remove HFPretrained config hack (#143698 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143698 Approved by: https://github.com/williamwen42, https://github.com/jansel ghstack dependencies: #143888	2024-12-28 02:03:13 +00:00
eellison	7c343a9d68	Fix emulate low precision bool inp (#143657 ) Fix for https://github.com/pytorch/pytorch/issues/143502 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143657 Approved by: https://github.com/BoyuanFeng	2024-12-28 01:51:28 +00:00
bobrenjc93	88ccf2fa5e	remove allow-untyped-defs from distributed/elastic/multiprocessing/subprocess_handler/handlers.py (#143917 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143917 Approved by: https://github.com/Skylion007	2024-12-28 00:13:05 +00:00
Colin Peppler	e3fefdfbf0	[CUTLASS] fix addmm (#143537 ) We would get a CUDA IMA before because we pass Bias in for X. So, we need to re-order the inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143537 Approved by: https://github.com/chenyang78 ghstack dependencies: #143528	2024-12-27 23:47:55 +00:00
Colin Peppler	b54620f40f	[CUTLASS] fix bugs: extra data_ptr() call, wrong size symbol name, bias symbol not added (#143528 ) A few small things in this PR: - fixed a bug where `workspace.data_ptr().data_ptr()` showed up - for SM80 CUTLASS kernels, the symbol size for W.size(1) was never created - for addmm kernels, the ldc bias symbol never showed up Pull Request resolved: https://github.com/pytorch/pytorch/pull/143528 Approved by: https://github.com/henrylhtsang	2024-12-27 23:38:18 +00:00
bobrenjc93	c17d767686	remove allow-untyped-defs from _inductor/codegen/rocm/rocm_template_buffer.py (#143870 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143870 Approved by: https://github.com/aorenste, https://github.com/Skylion007	2024-12-27 23:28:51 +00:00
bobrenjc93	63d6e1f743	remove allow-untyped-defs from _inductor/codegen/aoti_hipify_utils.py (#143916 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143916 Approved by: https://github.com/Skylion007	2024-12-27 23:25:37 +00:00
Dmitry Nikolaev	928e01545c	restore 'unused' variable to fix test_cuda_device_memory_allocated (#143885 ) This PR fix `test_cuda_multigpu.py::TestCudaMultiGPU::test_cuda_device_memory_allocated` by restoring a deleted 'unused' variable from commit `d8c8ba2440` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143885 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-12-27 23:18:13 +00:00
Emmett Bicker	0de661dc27	Add support for differentiable weight decay (#143679 ) (Actual) second PR in a larger project to broaden support for differentiable optimizers with @janeyx99! In this PR, I did a lot of pattern matching from the previous PR to add support for differentiable weight_decay. And also added a single new line on line 359 (previously line 352) to make the code from the last PR a little easier to read Continuation of progress on #141832 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143679 Approved by: https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2024-12-27 23:14:43 +00:00
bobrenjc93	c0c7f881da	remove allow-untyped-defs from distributed/pipelining/_unflatten.py (#143915 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143915 Approved by: https://github.com/aorenste, https://github.com/Skylion007, https://github.com/malfet	2024-12-27 22:21:28 +00:00
bobrenjc93	af823bd526	remove allow-untyped-defs from utils/tensorboard/_convert_np.py (#143918 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143918 Approved by: https://github.com/Skylion007	2024-12-27 22:19:33 +00:00
Nikita Shulga	fe398de769	[EZ] Update sympy to 1.13.3 (#143908 ) And remove python>=3.9 check as it currently covers all supported python versions Fixes https://github.com/pytorch/pytorch/issues/143907 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143908 Approved by: https://github.com/Skylion007, https://github.com/huydhn	2024-12-27 21:32:55 +00:00
PyTorch MergeBot	b5042cfa58	Revert "remove allow-untyped-defs from torch/ao/__init__.py (#143604 )" This reverts commit 1598d458797e69376a9a148bd37fb6e8167d22e3. Reverted https://github.com/pytorch/pytorch/pull/143604 on behalf of https://github.com/wdvr due to failing typing checks in torchao ([comment](https://github.com/pytorch/pytorch/pull/143604#issuecomment-2564043233))	2024-12-27 21:30:02 +00:00
Nikita Shulga	7a13bfa1ad	[EZ] Update jinja2 to 3.1.5 (#143923 ) To make Dependabot happy about https://cwe.mitre.org/data/definitions/150.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/143923 Approved by: https://github.com/Skylion007	2024-12-27 21:10:21 +00:00
Joel Schlosser	228b228449	Fix batch-specific attention mod for NJT + Flex (#143866 ) Fixes #143788 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143866 Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch	2024-12-27 20:51:41 +00:00
Nikita Shulga	1e65dec2b9	[Dynamo] Add MPSDevice interface (#143891 ) That simply checks if device is available and whether or not it supports bf16 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143891 Approved by: https://github.com/jansel	2024-12-27 20:31:44 +00:00
Xuehai Pan	d2f769476f	[Easy] add quotes to shell activation commands (#143902 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143902 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-12-27 19:17:46 +00:00
Animesh Jain	a87cd5283b	[dynamo] Trace through overridden __getattribute__ method (#143888 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143888 Approved by: https://github.com/jansel	2024-12-27 18:10:00 +00:00
bobrenjc93	fda9048ca8	remove allow-untyped-defs from distributed/elastic/multiprocessing/errors/handlers.py (#143869 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143869 Approved by: https://github.com/Skylion007	2024-12-27 15:49:19 +00:00
YangQun1	a20765a9c1	subgraph rewriter supports matched pattern with no users (#143842 ) Fixes #143841 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143842 Approved by: https://github.com/yushangdi	2024-12-27 12:45:39 +00:00
eellison	9e8d84f863	Fix duplicate pattern error (#139321 ) vllm had an error when we were incorrectly stating two patterns are duplicates. See, comment inline: For a particular generated pattern repr, store all the equivalent graphs that used to generate them. Because we ignore certain patterns in searching, but not in matching, use the graph to distinguish if two equivalent searches are actually different. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139321 Approved by: https://github.com/shunting314	2024-12-27 11:10:46 +00:00
PyTorch MergeBot	3571476739	Revert "fix randint distribution for large max (#143787 )" This reverts commit 8059d56ec364feb554f3fb90012a0fc2d2104e7f. Reverted https://github.com/pytorch/pytorch/pull/143787 on behalf of https://github.com/wdvr due to failing internal tests, to be fixed first ([comment](https://github.com/pytorch/pytorch/pull/143787#issuecomment-2563493323))	2024-12-27 09:16:36 +00:00
PyTorch MergeBot	f6801ba4b3	Revert "Use random64 in Fischer-Yates algorithm for large N (#143682 )" This reverts commit 7013be0094e8d3ded2ba2f948082f98d63e622bb. Reverted https://github.com/pytorch/pytorch/pull/143682 on behalf of https://github.com/wdvr due to failing Meta internal tests that need to be updated ([comment](https://github.com/pytorch/pytorch/pull/143682#issuecomment-2563487675))	2024-12-27 09:09:33 +00:00
Yanan Cao (PyTorch)	ba5cacbc17	[Codemod][AddExplicitStrictExportArg] caffe2/test (#143688 ) Reviewed By: avikchaudhuri Differential Revision: D67530154 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143688 Approved by: https://github.com/tugsbayasgalan	2024-12-27 07:58:44 +00:00
Animesh Jain	969415885d	[inductor][invoke_subgraph] Support None/int as input/output of invoke_subgraph (#139373 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139373 Approved by: https://github.com/eellison	2024-12-27 06:46:09 +00:00
cyy	379bbef23c	Enable more C++ warnings (#143355 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143355 Approved by: https://github.com/albanD	2024-12-27 05:46:57 +00:00
PyTorch MergeBot	fca457b5db	Revert "Add torch._scaled_mm for CPU (#139975 )" This reverts commit 3f80632c802f1d9fafd0c303d45ba2376b9c1e11. Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing some tests in trunk ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2563331259))	2024-12-27 05:25:17 +00:00
Animesh Jain	0f474a960b	[dynamo] Remove dead code after introducing UserDefinedDictVariable (#143699 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143699 Approved by: https://github.com/williamwen42, https://github.com/yanboliang, https://github.com/jansel ghstack dependencies: #143722	2024-12-27 04:51:35 +00:00
Animesh Jain	e296bab614	[dynamo] Remove DICT_SUBCLASS_GUARD_MANAGER and use dict.keys (#143722 ) In hinsight, we never needed a DICT_SUBCLASS_GUARD_MANAGER, because Dynamo would inline through the overridden keys method. In this PR, we ensure that while creating guards and constructing variable trackers, we get the `d.keys()` value by using `dict.keys(d)`. This ensures that we do not call overridden keys method. Therefore, the C++ guard can use `PyDict_Next` directly to check the guards. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143722 Approved by: https://github.com/jansel	2024-12-27 04:51:35 +00:00
bobrenjc93	d60282c177	remove allow-untyped-defs from _inductor/codegen/cpu_device_op_overrides.py (#143881 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143881 Approved by: https://github.com/aorenste	2024-12-27 04:10:47 +00:00
Huamin Li	43853691bc	[Quantization] add an option keep_original_weights in _lower_to_native_backend (#141049 ) Differential Revision: D66153809 This diff adds an option to keep_original_weights so we can track back the original weight and bias after performing prepare_fx and convert_fx Pull Request resolved: https://github.com/pytorch/pytorch/pull/141049 Approved by: https://github.com/jerryzh168	2024-12-27 04:02:07 +00:00
Chirag Pandya	809106a93f	[fr][c10d] fix flaky test (#143878 ) Summary: Test erroneously assumed that input/output sizes are same and that all states are matchable. Fixes issue #143798 Test Plan: Test passes Reviewers Test passes Pull Request resolved: https://github.com/pytorch/pytorch/pull/143878 Approved by: https://github.com/fduwjj ghstack dependencies: #143865	2024-12-27 03:13:15 +00:00
Chirag Pandya	1cd70e7e23	[fr][c10d] log trace capture enabled or not in flight recorder (#143865 ) Summary: Refactor logging for flight recorder so we can log if the capture was with or without stack trace capture enabled. We introduce a new column ('trace_enabled') in the logger. Test Plan: Tested on local job and noted that correct output was produced. Internal link: https://fburl.com/scuba/c10d_flight_recorder/ulhqnmhg Pull Request resolved: https://github.com/pytorch/pytorch/pull/143865 Approved by: https://github.com/fduwjj	2024-12-27 03:07:55 +00:00
Jason Ansel	6bdf2addc5	[inductor] Simplify get_launch_args_* handling (#143835 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143835 Approved by: https://github.com/eellison, https://github.com/shunting314 ghstack dependencies: #143813, #143814, #143815, #143817, #143818	2024-12-27 02:02:11 +00:00
Jason Ansel	138efb3002	[inductor] Move GPUTarget backwards compat to triton_compat.py (#143818 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143818 Approved by: https://github.com/eellison ghstack dependencies: #143813, #143814, #143815, #143817	2024-12-27 02:02:11 +00:00
Jason Ansel	be1936804b	[inductor] Drop support for pre-ASTSource Triton (#143817 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143817 Approved by: https://github.com/eellison ghstack dependencies: #143813, #143814, #143815	2024-12-27 02:02:11 +00:00
Jason Ansel	f3d0f67039	[inductor] Minor refactor of hip compile_meta (#143815 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143815 Approved by: https://github.com/eellison ghstack dependencies: #143813, #143814	2024-12-27 02:02:11 +00:00
bobrenjc93	29841b9414	remove allow-untyped-defs from torch/distributed/pipelining/_debug.py (#143871 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143871 Approved by: https://github.com/Skylion007	2024-12-27 01:20:26 +00:00
bobrenjc93	373dba35f9	remove allow-untyped-defs from fx/experimental/refinement_types.py (#143868 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143868 Approved by: https://github.com/Skylion007	2024-12-27 01:00:45 +00:00
Xuehai Pan	c4bff71854	[Easy] Add ROCm support to nightly pull tool (#141282 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141282 Approved by: https://github.com/malfet ghstack dependencies: #143263	2024-12-27 00:07:38 +00:00
Natalia Gimelshein	8059d56ec3	fix randint distribution for large max (#143787 ) Fixes #ISSUE_NUMBER Similar to #143682, for large maximum values we were sampling integers via % and it doesn't provide uniform distribution. Here we limit the max skew to approx 1% (random32 is used for max values `<= 2**32 / 128`) This comes with significant perf penalty, especially for cuda, but it's a pretty bad bug, so we'll have to figure out what can be done to improve it. `torch.compile` has always been producing correct results for this, and it's performance is also significantly better than current eager (eager is ~660 GB/s on H100, torch.compile 1200 GB/s), so we have to figure out why torch.compile is better. `__launch_bounds__` slightly regress perf, so perhaps we can figure out how to specify them better, but it's only 20-30 GB/s, so the big difference is still unexplained. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143787 Approved by: https://github.com/eqy	2024-12-26 23:54:03 +00:00
bobrenjc93	1598d45879	remove allow-untyped-defs from torch/ao/__init__.py (#143604 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143604 Approved by: https://github.com/aorenste	2024-12-26 23:27:16 +00:00
Jiang, Yanbing	3f80632c80	Add torch._scaled_mm for CPU (#139975 ) This PR is to add `torch._scaled_mm` for CPU backend. `_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet ghstack dependencies: #139974	2024-12-26 22:22:42 +00:00
PyTorch MergeBot	26364428f5	Revert "[dynamo] Remove DICT_SUBCLASS_GUARD_MANAGER and use dict.keys (#143722 )" This reverts commit fe95cbe018218d159ba0a0269045b31ff72f1a20. Reverted https://github.com/pytorch/pytorch/pull/143722 on behalf of https://github.com/wdvr due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/143722#issuecomment-2563127017))	2024-12-26 22:04:36 +00:00
PyTorch MergeBot	ee25daef5a	Revert "[dynamo] Remove dead code after introducing UserDefinedDictVariable (#143699 )" This reverts commit 7d1c6661397f9bff93c1ea389506c8a163d7a2ab. Reverted https://github.com/pytorch/pytorch/pull/143699 on behalf of https://github.com/wdvr due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/143722#issuecomment-2563127017))	2024-12-26 22:04:35 +00:00
Darshan Sanghani	2966fb3708	[pytorch/et] Allow ET to save additional resources for completing a trace like generated kernels and index tensor data (#143775 ) The resources directory lets ET observer dump any additional data like Triton kernels while capturing the ET. This allows us to use the ET trace to replay PT2 workloads and get visibility into data like generated kernels and their usage in a model, index tensor data etc. We also added a few ways to enable ET and ET Resources through the OS environment variables. Setting `ENABLE_PYTORCH_EXECUTION_TRACE` will enable default Execution Tracing in Pytorch. Additionally setting `ENABLE_PYTORCH_EXECUTION_TRACE_EXTRAS` will enable ET to collect extra resources from the ET run like Triton Kernels. Differential Revision: [D67610542](https://our.internmc.facebook.com/intern/diff/D67610542/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D67610542/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/143775 Approved by: https://github.com/shengfukevin, https://github.com/wdvr	2024-12-26 21:15:39 +00:00
chuanqiw	96e9a5aeec	[CI] Disable sccache for xpu test (#143851 ) WA for https://github.com/pytorch/pytorch/issues/143585 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143851 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-12-26 19:45:04 +00:00
Aaron Orenstein	3df12d38cf	dynamo tracing perf: cache cleaned_instructions: 33.7 -> 30.0 (#143070 ) See #143056 for overall docs. This PR: Cache the interesting/expensive bits of `cleaned_instructions()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143070 Approved by: https://github.com/jansel	2024-12-26 19:02:08 +00:00
Xuehai Pan	51a7ecde80	[Easy] Bump CUDA nightly version to 11.8 / 12.4 / 12.6 in nightly pull tool (#143263 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143263 Approved by: https://github.com/malfet	2024-12-26 19:01:38 +00:00
lzhang2	78502a58ba	Enable FSDP2 on XPU device (#143737 ) Motivation: Enabling FSDP2 on XPU device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143737 Approved by: https://github.com/awgu	2024-12-26 18:34:11 +00:00
PyTorch MergeBot	475656fd9c	Revert "[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 )" This reverts commit 2293fe1024812d6349f6e2b3b7de82c6b73f11e4. Reverted https://github.com/pytorch/pytorch/pull/129374 on behalf of https://github.com/malfet due to failing internal ROCM builds with error: ModuleNotFoundError: No module named hipify ([comment](https://github.com/pytorch/pytorch/pull/129374#issuecomment-2562973920))	2024-12-26 17:32:23 +00:00
PyTorch MergeBot	cc4e70b7c3	Revert "Use absolute path `path.resolve()` -> `path.absolute()` (#129409 )" This reverts commit 135c7db99d646b8bd9603bf969d47d3dec5987b1. Reverted https://github.com/pytorch/pytorch/pull/129409 on behalf of https://github.com/malfet due to need to revert to as dependency of https://github.com/pytorch/pytorch/pull/129374 ([comment](https://github.com/pytorch/pytorch/pull/129409#issuecomment-2562969825))	2024-12-26 17:26:06 +00:00
PyTorch MergeBot	9255ffc841	Revert "Enable more C++ warnings (#143355 )" This reverts commit daa3ffe0ebff58577b8db964447b6abc6de53a25. Reverted https://github.com/pytorch/pytorch/pull/143355 on behalf of https://github.com/malfet due to It fails internal build system as it kind of breaks separation between native and native/cpu ([comment](https://github.com/pytorch/pytorch/pull/143355#issuecomment-2562961546))	2024-12-26 17:13:10 +00:00
Jason Ansel	cf76c05b4d	[inductor] Refactor conditional triton imports into triton_compat.py (#143814 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143814 Approved by: https://github.com/Skylion007 ghstack dependencies: #143813	2024-12-26 09:14:06 +00:00
Jason Ansel	efac5ed81b	[inductor] Reorder imports in codecache.py (#143813 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143813 Approved by: https://github.com/Skylion007	2024-12-26 09:14:06 +00:00
dependabot[bot]	bf8da4c145	Bump jinja2 from 3.1.4 to 3.1.5 in /.ci/docker (#143844 ) Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.4 to 3.1.5. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/pallets/jinja/releases">jinja2's releases</a>.</em></p> <blockquote> <h2>3.1.5</h2> <p>This is the Jinja 3.1.5 security fix release, which fixes security issues and bugs but does not otherwise change behavior and should not result in breaking changes compared to the latest feature release.</p> <p>PyPI: <a href="https://pypi.org/project/Jinja2/3.1.5/">https://pypi.org/project/Jinja2/3.1.5/</a> Changes: <a href="https://jinja.palletsprojects.com/changes/#version-3-1-5">https://jinja.palletsprojects.com/changes/#version-3-1-5</a> Milestone: <a href="https://github.com/pallets/jinja/milestone/16?closed=1">https://github.com/pallets/jinja/milestone/16?closed=1</a></p> <ul> <li>The sandboxed environment handles indirect calls to <code>str.format</code>, such as by passing a stored reference to a filter that calls its argument. <a href="https://github.com/pallets/jinja/security/advisories/GHSA-q2x7-8rv6-6q7h">GHSA-q2x7-8rv6-6q7h</a></li> <li>Escape template name before formatting it into error messages, to avoid issues with names that contain f-string syntax. <a href="https://redirect.github.com/pallets/jinja/issues/1792">#1792</a>, <a href="https://github.com/pallets/jinja/security/advisories/GHSA-gmj6-6f8f-6699">GHSA-gmj6-6f8f-6699</a></li> <li>Sandbox does not allow <code>clear</code> and <code>pop</code> on known mutable sequence types. <a href="https://redirect.github.com/pallets/jinja/issues/2032">#2032</a></li> <li>Calling sync <code>render</code> for an async template uses <code>asyncio.run</code>. <a href="https://redirect.github.com/pallets/jinja/issues/1952">#1952</a></li> <li>Avoid unclosed <code>auto_aiter</code> warnings. <a href="https://redirect.github.com/pallets/jinja/issues/1960">#1960</a></li> <li>Return an <code>aclose</code>-able <code>AsyncGenerator</code> from <code>Template.generate_async</code>. <a href="https://redirect.github.com/pallets/jinja/issues/1960">#1960</a></li> <li>Avoid leaving <code>root_render_func()</code> unclosed in <code>Template.generate_async</code>. <a href="https://redirect.github.com/pallets/jinja/issues/1960">#1960</a></li> <li>Avoid leaving async generators unclosed in blocks, includes and extends. <a href="https://redirect.github.com/pallets/jinja/issues/1960">#1960</a></li> <li>The runtime uses the correct <code>concat</code> function for the current environment when calling block references. <a href="https://redirect.github.com/pallets/jinja/issues/1701">#1701</a></li> <li>Make <code>\|unique</code> async-aware, allowing it to be used after another async-aware filter. <a href="https://redirect.github.com/pallets/jinja/issues/1781">#1781</a></li> <li><code>\|int</code> filter handles <code>OverflowError</code> from scientific notation. <a href="https://redirect.github.com/pallets/jinja/issues/1921">#1921</a></li> <li>Make compiling deterministic for tuple unpacking in a <code>{% set ... %}</code> call. <a href="https://redirect.github.com/pallets/jinja/issues/2021">#2021</a></li> <li>Fix dunder protocol (<code>copy</code>/<code>pickle</code>/etc) interaction with <code>Undefined</code> objects. <a href="https://redirect.github.com/pallets/jinja/issues/2025">#2025</a></li> <li>Fix <code>copy</code>/<code>pickle</code> support for the internal <code>missing</code> object. <a href="https://redirect.github.com/pallets/jinja/issues/2027">#2027</a></li> <li><code>Environment.overlay(enable_async)</code> is applied correctly. <a href="https://redirect.github.com/pallets/jinja/issues/2061">#2061</a></li> <li>The error message from <code>FileSystemLoader</code> includes the paths that were searched. <a href="https://redirect.github.com/pallets/jinja/issues/1661">#1661</a></li> <li><code>PackageLoader</code> shows a clearer error message when the package does not contain the templates directory. <a href="https://redirect.github.com/pallets/jinja/issues/1705">#1705</a></li> <li>Improve annotations for methods returning copies. <a href="https://redirect.github.com/pallets/jinja/issues/1880">#1880</a></li> <li><code>urlize</code> does not add <code>mailto:</code> to values like <code>@a@b</code>. <a href="https://redirect.github.com/pallets/jinja/issues/1870">#1870</a></li> <li>Tests decorated with <code>@pass_context</code> can be used with the <code>\|select</code> filter. <a href="https://redirect.github.com/pallets/jinja/issues/1624">#1624</a></li> <li>Using <code>set</code> for multiple assignment (<code>a, b = 1, 2</code>) does not fail when the target is a namespace attribute. <a href="https://redirect.github.com/pallets/jinja/issues/1413">#1413</a></li> <li>Using <code>set</code> in all branches of <code>{% if %}{% elif %}{% else %}</code> blocks does not cause the variable to be considered initially undefined. <a href="https://redirect.github.com/pallets/jinja/issues/1253">#1253</a></li> </ul> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/pallets/jinja/blob/main/CHANGES.rst">jinja2's changelog</a>.</em></p> <blockquote> <h2>Version 3.1.5</h2> <p>Released 2024-12-21</p> <ul> <li>The sandboxed environment handles indirect calls to <code>str.format</code>, such as by passing a stored reference to a filter that calls its argument. :ghsa:<code>q2x7-8rv6-6q7h</code></li> <li>Escape template name before formatting it into error messages, to avoid issues with names that contain f-string syntax. :issue:<code>1792</code>, :ghsa:<code>gmj6-6f8f-6699</code></li> <li>Sandbox does not allow <code>clear</code> and <code>pop</code> on known mutable sequence types. :issue:<code>2032</code></li> <li>Calling sync <code>render</code> for an async template uses <code>asyncio.run</code>. :pr:<code>1952</code></li> <li>Avoid unclosed <code>auto_aiter</code> warnings. :pr:<code>1960</code></li> <li>Return an <code>aclose</code>-able <code>AsyncGenerator</code> from <code>Template.generate_async</code>. :pr:<code>1960</code></li> <li>Avoid leaving <code>root_render_func()</code> unclosed in <code>Template.generate_async</code>. :pr:<code>1960</code></li> <li>Avoid leaving async generators unclosed in blocks, includes and extends. :pr:<code>1960</code></li> <li>The runtime uses the correct <code>concat</code> function for the current environment when calling block references. :issue:<code>1701</code></li> <li>Make <code>\|unique</code> async-aware, allowing it to be used after another async-aware filter. :issue:<code>1781</code></li> <li><code>\|int</code> filter handles <code>OverflowError</code> from scientific notation. :issue:<code>1921</code></li> <li>Make compiling deterministic for tuple unpacking in a <code>{% set ... %}</code> call. :issue:<code>2021</code></li> <li>Fix dunder protocol (<code>copy</code>/<code>pickle</code>/etc) interaction with <code>Undefined</code> objects. :issue:<code>2025</code></li> <li>Fix <code>copy</code>/<code>pickle</code> support for the internal <code>missing</code> object. :issue:<code>2027</code></li> <li><code>Environment.overlay(enable_async)</code> is applied correctly. :pr:<code>2061</code></li> <li>The error message from <code>FileSystemLoader</code> includes the paths that were searched. :issue:<code>1661</code></li> <li><code>PackageLoader</code> shows a clearer error message when the package does not contain the templates directory. :issue:<code>1705</code></li> <li>Improve annotations for methods returning copies. :pr:<code>1880</code></li> <li><code>urlize</code> does not add <code>mailto:</code> to values like <code>@a@b</code>. :pr:<code>1870</code></li> <li>Tests decorated with <code>@pass_context`` can be used with the ``\|select`` filter. :issue:</code>1624`</li> <li>Using <code>set</code> for multiple assignment (<code>a, b = 1, 2</code>) does not fail when the target is a namespace attribute. :issue:<code>1413</code></li> <li>Using <code>set</code> in all branches of <code>{% if %}{% elif %}{% else %}</code> blocks does not cause the variable to be considered initially undefined. :issue:<code>1253</code></li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="`877f6e51be`"><code>877f6e5</code></a> release version 3.1.5</li> <li><a href="`8d58859265`"><code>8d58859</code></a> remove test pypi</li> <li><a href="`eda8fe86fd`"><code>eda8fe8</code></a> update dev dependencies</li> <li><a href="`c8fdce1e03`"><code>c8fdce1</code></a> Fix bug involving calling set on a template parameter within all branches of ...</li> <li><a href="`66587ce989`"><code>66587ce</code></a> Fix bug where set would sometimes fail within if</li> <li><a href="`fbc3a696c7`"><code>fbc3a69</code></a> Add support for namespaces in tuple parsing (<a href="https://redirect.github.com/pallets/jinja/issues/1664">#1664</a>)</li> <li><a href="`b8f4831d41`"><code>b8f4831</code></a> more comments about nsref assignment</li> <li><a href="`ee832194cd`"><code>ee83219</code></a> Add support for namespaces in tuple assignment</li> <li><a href="`1d55cddbb2`"><code>1d55cdd</code></a> Triple quotes in docs (<a href="https://redirect.github.com/pallets/jinja/issues/2064">#2064</a>)</li> <li><a href="`8a8eafc6b9`"><code>8a8eafc</code></a> edit block assignment section</li> <li>Additional commits viewable in <a href="https://github.com/pallets/jinja/compare/3.1.4...3.1.5">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=jinja2&package-manager=pip&previous-version=3.1.4&new-version=3.1.5)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts). </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/143844 Approved by: https://github.com/Skylion007 Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-12-26 05:20:06 +00:00
cyy	e05bfb8ee3	[Submodule] Bump libfmt to 11.1.0 (#143843 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143843 Approved by: https://github.com/Skylion007	2024-12-26 04:49:11 +00:00
Raymond Li	4bacfd6e11	Sort requirements.txt (#143778 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143778 Approved by: https://github.com/albanD	2024-12-26 00:51:52 +00:00
cyy	f42cff4e29	[17/N] Fix extra warnings brought by clang-tidy-17 (#143804 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143804 Approved by: https://github.com/Skylion007	2024-12-25 19:54:42 +00:00
shaoyuyoung	a8ac3a6b20	[inductor] fix the `adaptive_avg_pool` on processing int64 (#143802 ) Fixes #143801 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143802 Approved by: https://github.com/jansel	2024-12-25 09:08:43 +00:00
Tal Ben-Nun	c0d710634f	Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery (#142292 ) Reland of #140320 after failing test on trunk. Fixes potential environment clobbering in test, makes ROCr+HIP devices (if specified together) more robust to index errors. Fixes #140318 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142292 Approved by: https://github.com/jataylo, https://github.com/huydhn, https://github.com/jeffdaily Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2024-12-25 02:37:11 +00:00
Natalia Gimelshein	7013be0094	Use random64 in Fischer-Yates algorithm for large N (#143682 ) Fixes bug in randperm https://nbsanity.com/static/a4774194938414dedcec7d6e99727d31/Shuffling_20in_20torch_20vs_20numpy-public.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/143682 Approved by: https://github.com/eqy, https://github.com/albanD	2024-12-25 01:19:19 +00:00
Jack Taylor	27b0d41f0a	[ROCm] Add miopen_batch_norm to meta_registrations to fix AOTI issue (#143569 ) Currently the upstream example for AOTI usage breaks on ROCm (https://pytorch.org/tutorials/recipes/torch_export_aoti_python.html) ``` File "/root/upstream/torch/_dynamo/exc.py", line 317, in unimplemented raise Unsupported(msg, case_name=case_name) torch._dynamo.exc.Unsupported: unsupported operator: aten.miopen_batch_norm.default (see https://docs.google.com/document/d/1GgvOe7C8_NVOMLOCwDaYV1mXXyHMXY7ExoewHqooxrs/edit#heading=h.64r4npvq0w0 for how to fix) from user code: File "/root/vision/torchvision/models/resnet.py", line 285, in forward return self._forward_impl(x) File "/root/vision/torchvision/models/resnet.py", line 269, in _forward_impl x = self.bn1(x) ``` This PR adds a meta_registration for miopen_batch_norm to resolve this issue Pull Request resolved: https://github.com/pytorch/pytorch/pull/143569 Approved by: https://github.com/jeffdaily	2024-12-24 23:43:11 +00:00
Jason Ansel	9035fb5a7b	[dynamo] Add types to exc.py (#143626 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143626 Approved by: https://github.com/yanboliang ghstack dependencies: #143552, #143610	2024-12-24 21:48:32 +00:00
Jason Ansel	3e7f9e2cc4	[inductor] Shorten tracebacks for errors inside inductor (by skipping AOTAutograd frames) (#143610 ) Before #143552 ```py Traceback (most recent call last): File "/home/jansel/pytorch/repro.py", line 51, in <module> fp32_compiled = optimized_model(low_input) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/eval_frame.py", line 576, in _fn return fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 1381, in __call__ return self._torchdynamo_orig_callable( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 1165, in __call__ result = self._inner_convert( ^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 547, in __call__ return _compile( ^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 987, in _compile guarded_code = compile_inner(code, one_graph, hooks, transform) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 715, in compile_inner return _compile_inner(code, one_graph, hooks, transform) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_utils_internal.py", line 95, in wrapper_function return function(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 750, in _compile_inner out_code = transform_code_object(code, transform) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/bytecode_transformation.py", line 1361, in transform_code_object transformations(instructions, code_options) File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 231, in _fn return fn(args, kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 662, in transform tracer.run() File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 2870, in run super().run() File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 1053, in run while self.step(): ^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 963, in step self.dispatch_table[inst.opcode](self, inst) File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 3050, in RETURN_VALUE self._return(inst) File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 3035, in _return self.output.compile_subgraph( File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1101, in compile_subgraph self.compile_and_call_fx_graph( File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1382, in compile_and_call_fx_graph compiled_fn = self.call_user_compiler(gm) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1432, in call_user_compiler return self._call_user_compiler(gm) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1483, in _call_user_compiler raise BackendCompilerFailed(self.compiler_fn, e).with_traceback( File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1462, in _call_user_compiler compiled_fn = compiler_fn(gm, self.example_inputs()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/repro/after_dynamo.py", line 130, in __call__ compiled_gm = compiler_fn(gm, example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/__init__.py", line 2314, in __call__ return compile_fx(model_, inputs_, config_patches=self.config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1880, in compile_fx return aot_autograd( ^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/backends/common.py", line 83, in __call__ cg = aot_module_simplified(gm, example_inputs, self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 1145, in aot_module_simplified compiled_fn = AOTAutogradCache.load( ^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/_aot_autograd/autograd_cache.py", line 754, in load compiled_fn = dispatch_and_compile() ^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 1131, in dispatch_and_compile compiled_fn, _ = create_aot_dispatcher_function( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 580, in create_aot_dispatcher_function return _create_aot_dispatcher_function( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 830, in _create_aot_dispatcher_function compiled_fn, fw_metadata = compiler_fn( ^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 676, in aot_dispatch_autograd compiled_fw_func = aot_config.fw_compiler(fw_module, adjusted_flat_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 489, in __call__ return self.compiler_fn(gm, example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1758, in fw_compiler_base return inner_compile( ^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 572, in compile_fx_inner return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/repro/after_aot.py", line 102, in debug_wrapper inner_compiled_fn = compiler_fn(gm, example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 686, in _compile_fx_inner mb_compiled_graph = fx_codegen_and_compile( ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1129, in fx_codegen_and_compile return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1044, in codegen_and_compile compiled_fn = graph.compile_to_module().call ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1975, in compile_to_module return self._compile_to_module() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1981, in _compile_to_module self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen() ^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1912, in codegen self.scheduler = Scheduler(self.operations) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 1880, in __init__ self._init(nodes) File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 1955, in _init self.nodes = self.fuse_nodes(self.nodes) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 2461, in fuse_nodes nodes = self.fuse_nodes_once(nodes) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 2773, in fuse_nodes_once assert False, "a fake error during fusion" ^^^^^ torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: AssertionError: a fake error during fusion Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information ``` Before this PR ```py Traceback (most recent call last): File "/home/jansel/pytorch/repro.py", line 51, in <module> fp32_compiled = optimized_model(low_input) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/eval_frame.py", line 580, in _fn raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1484, in _call_user_compiler raise BackendCompilerFailed( File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1463, in _call_user_compiler compiled_fn = compiler_fn(gm, self.example_inputs()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/repro/after_dynamo.py", line 130, in __call__ compiled_gm = compiler_fn(gm, example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/__init__.py", line 2314, in __call__ return compile_fx(model_, inputs_, config_patches=self.config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1880, in compile_fx return aot_autograd( ^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/backends/common.py", line 83, in __call__ cg = aot_module_simplified(gm, example_inputs, self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 1145, in aot_module_simplified compiled_fn = AOTAutogradCache.load( ^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/_aot_autograd/autograd_cache.py", line 754, in load compiled_fn = dispatch_and_compile() ^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 1131, in dispatch_and_compile compiled_fn, _ = create_aot_dispatcher_function( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 580, in create_aot_dispatcher_function return _create_aot_dispatcher_function( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 830, in _create_aot_dispatcher_function compiled_fn, fw_metadata = compiler_fn( ^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 676, in aot_dispatch_autograd compiled_fw_func = aot_config.fw_compiler(fw_module, adjusted_flat_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 489, in __call__ return self.compiler_fn(gm, example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1758, in fw_compiler_base return inner_compile( ^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 572, in compile_fx_inner return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/repro/after_aot.py", line 102, in debug_wrapper inner_compiled_fn = compiler_fn(gm, example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 686, in _compile_fx_inner mb_compiled_graph = fx_codegen_and_compile( ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1129, in fx_codegen_and_compile return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1044, in codegen_and_compile compiled_fn = graph.compile_to_module().call ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1975, in compile_to_module return self._compile_to_module() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1981, in _compile_to_module self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen() ^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1912, in codegen self.scheduler = Scheduler(self.operations) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 1880, in __init__ self._init(nodes) File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 1955, in _init self.nodes = self.fuse_nodes(self.nodes) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 2461, in fuse_nodes nodes = self.fuse_nodes_once(nodes) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 2773, in fuse_nodes_once assert False, "a fake error during fusion" ^^^^^ torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: AssertionError: a fake error during fusion Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information ``` After this PR ```py Traceback (most recent call last): File "/home/jansel/pytorch/repro.py", line 51, in <module> fp32_compiled = optimized_model(low_input) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/eval_frame.py", line 580, in _fn raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 704, in _compile_fx_inner raise InductorError(e, currentframe()).with_traceback( File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 689, in _compile_fx_inner mb_compiled_graph = fx_codegen_and_compile( ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1138, in fx_codegen_and_compile return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1053, in codegen_and_compile compiled_fn = graph.compile_to_module().call ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1975, in compile_to_module return self._compile_to_module() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1981, in _compile_to_module self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen() ^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1912, in codegen self.scheduler = Scheduler(self.operations) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 1880, in __init__ self._init(nodes) File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 1955, in _init self.nodes = self.fuse_nodes(self.nodes) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 2461, in fuse_nodes nodes = self.fuse_nodes_once(nodes) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 2773, in fuse_nodes_once assert False, "a fake error during fusion" ^^^^^ torch._inductor.exc.InductorError: AssertionError: a fake error during fusion Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information ``` A large numer of frames are removed between: ```py File "/home/jansel/pytorch/torch/_dynamo/eval_frame.py", line 580, in _fn raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 704, in _compile_fx_inner raise InductorError(e, currentframe()).with_traceback( ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143610 Approved by: https://github.com/eellison ghstack dependencies: #143552	2024-12-24 21:48:32 +00:00
Jason Ansel	9e5f3fdfc7	[dynamo] Shorten tracebacks for backend compiler errors (#143552 ) Fixes #143406 After this PR the error for missing Triton is: ```py Traceback (most recent call last): File "/home/jansel/pytorch/repro.py", line 51, in <module> fp32_compiled = optimized_model(low_input) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/eval_frame.py", line 580, in _fn raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 3624, in create_backend raise TritonMissing(inspect.currentframe()) torch._dynamo.exc.TritonMissing: Cannot find a working triton installation. Either the package is not installed or it is too old. More information on installing Triton can be found at: https://github.com/triton-lang/triton Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True ``` Setting `TORCHDYNAMO_VERBOSE=1` yields something like the old error: ```py Traceback (most recent call last): File "/home/jansel/pytorch/repro.py", line 51, in <module> fp32_compiled = optimized_model(low_input) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/eval_frame.py", line 580, in _fn raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/eval_frame.py", line 576, in _fn return fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 1383, in __call__ return self._torchdynamo_orig_callable( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 1167, in __call__ result = self._inner_convert( ^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 548, in __call__ return _compile( ^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 988, in _compile guarded_code = compile_inner(code, one_graph, hooks, transform) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 716, in compile_inner return _compile_inner(code, one_graph, hooks, transform) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_utils_internal.py", line 95, in wrapper_function return function(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 751, in _compile_inner out_code = transform_code_object(code, transform) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/bytecode_transformation.py", line 1361, in transform_code_object transformations(instructions, code_options) File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 232, in _fn return fn(args, kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 663, in transform tracer.run() File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 2870, in run super().run() File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 1053, in run while self.step(): ^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 963, in step self.dispatch_table[inst.opcode](self, inst) File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 3050, in RETURN_VALUE self._return(inst) File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 3035, in _return self.output.compile_subgraph( File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1102, in compile_subgraph self.compile_and_call_fx_graph( File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1383, in compile_and_call_fx_graph compiled_fn = self.call_user_compiler(gm) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1433, in call_user_compiler return self._call_user_compiler(gm) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1463, in _call_user_compiler compiled_fn = compiler_fn(gm, self.example_inputs()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/repro/after_dynamo.py", line 130, in __call__ compiled_gm = compiler_fn(gm, example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/__init__.py", line 2314, in __call__ return compile_fx(model_, inputs_, config_patches=self.config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1880, in compile_fx return aot_autograd( ^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/backends/common.py", line 83, in __call__ cg = aot_module_simplified(gm, example_inputs, self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 1145, in aot_module_simplified compiled_fn = AOTAutogradCache.load( ^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/_aot_autograd/autograd_cache.py", line 754, in load compiled_fn = dispatch_and_compile() ^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 1131, in dispatch_and_compile compiled_fn, _ = create_aot_dispatcher_function( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 580, in create_aot_dispatcher_function return _create_aot_dispatcher_function( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 830, in _create_aot_dispatcher_function compiled_fn, fw_metadata = compiler_fn( ^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 676, in aot_dispatch_autograd compiled_fw_func = aot_config.fw_compiler(fw_module, adjusted_flat_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 489, in __call__ return self.compiler_fn(gm, example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1758, in fw_compiler_base return inner_compile( ^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 572, in compile_fx_inner return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/repro/after_aot.py", line 102, in debug_wrapper inner_compiled_fn = compiler_fn(gm, example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 686, in _compile_fx_inner mb_compiled_graph = fx_codegen_and_compile( ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1129, in fx_codegen_and_compile return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1044, in codegen_and_compile compiled_fn = graph.compile_to_module().call ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1975, in compile_to_module return self._compile_to_module() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1981, in _compile_to_module self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen() ^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1916, in codegen self.scheduler.codegen() File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 3667, in codegen return self._codegen() ^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 3761, in _codegen if device is not None and self.get_backend(device).ready_to_flush(): ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 3631, in get_backend self.backends[device] = self.create_backend(device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 3624, in create_backend raise TritonMissing(inspect.currentframe()) torch._dynamo.exc.TritonMissing: Cannot find a working triton installation. Either the package is not installed or it is too old. More information on installing Triton can be found at: https://github.com/triton-lang/triton You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True ``` This PR also strips dynamo stack frames from other types of backend compile errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143552 Approved by: https://github.com/yanboliang	2024-12-24 21:48:23 +00:00
PyTorch MergeBot	844e6108f6	Revert "[Inductor XPU] Support max-autotune on XPU and reuse the corresponding Inductor UT. (#143266 )" This reverts commit ad750ae32079020f51f9b7d01237f3ecfa83b6ff. Reverted https://github.com/pytorch/pytorch/pull/143266 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing some tests in trunk ([comment](https://github.com/pytorch/pytorch/pull/143266#issuecomment-2561303786))	2024-12-24 17:22:57 +00:00
atalman	6c32ef4c5b	Remove builder repo from workflows and scripts (#143776 ) Part of https://github.com/pytorch/builder/issues/2054 Builder is repo is no longer used. Hence remove any references to builder repo Pull Request resolved: https://github.com/pytorch/pytorch/pull/143776 Approved by: https://github.com/huydhn	2024-12-24 14:11:51 +00:00
Luca Wehrstedt	aec3b46274	[DTensor] Add aten.amin/amax to linear_reduction_strategy (#143747 ) In the same vein as https://github.com/pytorch/pytorch/pull/134206, these two ops still seemed missing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143747 Approved by: https://github.com/kwen2501	2024-12-24 13:36:40 +00:00
Xuehai Pan	b77406a9ec	[BE][CI] bump `ruff` to 0.8.4 (#143753 ) Changes: 1. Bump `ruff` from 0.7.4 to 0.8.4 2. Change `%`-formatted strings to f-string 3. Change arguments with the `__`-prefix to positional-only arguments with the `/` separator in function signature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143753 Approved by: https://github.com/Skylion007	2024-12-24 12:24:10 +00:00
Iurii Paikov	dbbc81cb34	Enabled force_shape_pad for test_pad_mm and test_slice_mm_bandwidth_computation (#141768 ) Some tests fail for ROCm build on navi arch because of this check: `f83361b274/torch/_inductor/fx_passes/pad_mm.py (L211)` There is no need to determine if mm is compute bound for most of the padding tests since they don't specifically test compute bound behavior. We don't have enough empirical data to fine tune this check for AMD gpus yet. I propose to force the shape padding for the tests that we had trouble with to avoid this unnecessary logic path. Please correct me if I didn't add other tests that can potentially fail with this issue or if I added a test that is dependent on logic below the `force_shape_pad` check here: `f83361b274/torch/_inductor/fx_passes/pad_mm.py (L444)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141768 Approved by: https://github.com/jeffdaily	2024-12-24 11:03:39 +00:00
Jiang, Yanbing	783065637e	Add FP8 support for eye (#139974 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139974 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-12-24 10:00:23 +00:00
Jason Ansel	060ee14753	[inductor] Make adaptive_max_pool2d error on int64 (#143762 ) Fixes #143752 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143762 Approved by: https://github.com/yanboliang	2024-12-24 08:33:59 +00:00
Xuehai Pan	135c7db99d	Use absolute path `path.resolve()` -> `path.absolute()` (#129409 ) Changes: 1. Always explicit `.absolute()`: `Path(__file__)` -> `Path(__file__).absolute()` 2. Replace `path.resolve()` with `path.absolute()` if the code is resolving the PyTorch repo root directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129409 Approved by: https://github.com/albanD	2024-12-24 08:33:08 +00:00
Jithun Nair	362ecad9bb	[ROCm] Use `linux.rocm.gpu.2` for 2-GPU and `linux.rocm.gpu.4` for 4-GPU runners (#143769 ) * Will enable us to target `periodic`/distributed CI jobs to 4-GPU runners using a different label `linux.rocm.gpu.4` * Use 2-GPU runners for `trunk`, `pull` and `slow` (in addition to `inductor-rocm`) as well (although this currently will not change anything, since all our MI2xx runners have both `linux.rocm.gpu` and `linux.rocm.gpu.2` labels... but this will change in the future: see next point) * Continue to use `linux.rocm.gpu` label for any job that doesn't need more than 1-GPU eg. binary test jobs in `workflows/generated-linux-binary-manywheel-nightly.yml` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143769 Approved by: https://github.com/jeffdaily	2024-12-24 08:04:00 +00:00
Yifu Wang	1963fc83a1	[micro_pipeline_tp] don't pass return_A to fused_all_gather_scaled_matmul (#143782 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143782 Approved by: https://github.com/tianyu-l	2024-12-24 07:25:38 +00:00
xinan.lin	ad750ae320	[Inductor XPU] Support max-autotune on XPU and reuse the corresponding Inductor UT. (#143266 ) This PR aims to add the functionality support of max-autotune for XPU. The current triton templates and configurations are not well optimized for XPU, so the performance is not ready yet. Also the `mm_plus_mm` template have accuracy issues in some cases. We will address these issues in the next PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143266 Approved by: https://github.com/EikanWang, https://github.com/jansel	2024-12-24 05:42:36 +00:00
Jason Ansel	b0c3f48a40	[inductor] Improve error message for assert_size_stride (#143765 ) ``` >>> torch._C._dynamo.guards.assert_size_stride(torch.randn(10), (10,), (2,)) Traceback (most recent call last): File "<stdin>", line 1, in <module> AssertionError: expected size 10==10, stride 1==2 at dim=0 This error most often comes from an incorrect meta function for a custom op. See https://pytorch.org/docs/stable/library.html#torch.library.opcheck >>> ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143765 Approved by: https://github.com/zou3519	2024-12-24 05:26:05 +00:00
Jerry Zhang	ace645a017	Add support for prototype affine quantization in pt2e flow (#141421 ) Summary: duplicated affine quantization functionality including observer (https://github.com/pytorch/ao/blob/main/torchao/quantization/observer.py) and some quant_primitive ops (`7c3c51fd0d/torchao/quantization/quant_primitives.py (L26-L30)`) to allow for per group quantization min max observer in pt2e flow Next: We can follow up to add moving average min max observer Test Plan: python test/test_quantization.py -k test_channel_group_quantization Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/141421 Approved by: https://github.com/cccclai	2024-12-24 04:22:18 +00:00
Jason Ansel	60a0d53c13	[dynamo] Add test for #143697 (#143764 ) The issue from #143697 seems to already be fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143764 Approved by: https://github.com/Skylion007	2024-12-24 03:50:15 +00:00
zeshengzong	01d60bcf32	[Easy] Fix todo by enable tests for cuda (#143637 ) Fix TODO in `test_tensor_creation_ops.py` file: ```python # TODO: update to work on CUDA, too ``` Test Result ```bash $ pytest test/test_tensor_creation_ops.py ``` ![image](https://github.com/user-attachments/assets/ef829541-668e-446d-a9ab-b26b9d73085f) ```bash $ lintrunner ``` ![image](https://github.com/user-attachments/assets/d6a46eee-1f60-48e6-898a-a8d9620eb54a) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143637 Approved by: https://github.com/albanD	2024-12-24 03:47:43 +00:00
Eddie Yan	b90a3b7281	[cumsum][CUDA][64-bit indexing] Add 64-bit indexing path for `cumsum` (#143696 ) For #143486 Interestingly enough changing the indexing type seems to degrade performance when a larger width is not needed, even on small sizes, so making this a template param rather than forcing all cases to 64-bit Pull Request resolved: https://github.com/pytorch/pytorch/pull/143696 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-12-24 03:45:28 +00:00
Jason Ansel	dec4286b2d	[inductor] Fix for extract_target with dots (#143766 ) Fixes #143650 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143766 Approved by: https://github.com/yanboliang	2024-12-24 03:42:15 +00:00
cyy	1feae27ed6	[16/N] Fix extra warnings brought by clang-tidy-17 (#143714 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143714 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-12-24 03:29:38 +00:00
PyTorch MergeBot	49fdc52fd2	Revert "Add a warning when a tensor with requires_grad=True is converted to a scalar (#143261 )" This reverts commit bc78b6ea4f88d673426d6de17671b82facf50beb. Reverted https://github.com/pytorch/pytorch/pull/143261 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lint, plz help fix and reland this ([comment](https://github.com/pytorch/pytorch/pull/143261#issuecomment-2560583332))	2024-12-24 03:15:38 +00:00
cyy	d6a066ead6	Simplify host_softmax (#143251 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143251 Approved by: https://github.com/albanD	2024-12-24 02:27:51 +00:00
Nikita Shulga	da21fabf34	[BE] Only print MKL version on x86 platforms (#143763 ) As it will obviously be missing on ARM/S390, etc Test plan: run `python3 -c "import torch;print(torch.__config__.parallel_info())"` on both x86 and non-x86 system Pull Request resolved: https://github.com/pytorch/pytorch/pull/143763 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-12-24 02:04:26 +00:00
Animesh Jain	7d1c666139	[dynamo] Remove dead code after introducing UserDefinedDictVariable (#143699 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143699 Approved by: https://github.com/williamwen42, https://github.com/yanboliang, https://github.com/jansel ghstack dependencies: #143722	2024-12-24 02:00:18 +00:00
Animesh Jain	fe95cbe018	[dynamo] Remove DICT_SUBCLASS_GUARD_MANAGER and use dict.keys (#143722 ) In hinsight, we never needed a DICT_SUBCLASS_GUARD_MANAGER, because Dynamo would inline through the overridden keys method. In this PR, we ensure that while creating guards and constructing variable trackers, we get the `d.keys()` value by using `dict.keys(d)`. This ensures that we do not call overridden keys method. Therefore, the C++ guard can use `PyDict_Next` directly to check the guards. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143722 Approved by: https://github.com/jansel	2024-12-24 02:00:18 +00:00
zeshengzong	67355a1289	[Easy] Add torch.range, torch.arange params optional description (#143731 ) Fixes #129333 Test Result Before ![image](https://github.com/user-attachments/assets/c5873690-7de7-4a14-9423-a150d17d137e) ![image](https://github.com/user-attachments/assets/ff4ee545-f27a-403b-bf92-51f9571022a3) After ![image](https://github.com/user-attachments/assets/34e2c41f-8b54-417d-bb10-7ca6f679206a) ![image](https://github.com/user-attachments/assets/b54bcebd-70e9-4a1a-8a22-1ab815e17827) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143731 Approved by: https://github.com/janeyx99	2024-12-24 01:29:24 +00:00
Jithun Nair	0ca6a47872	Update tag_regex in filter_test_configs.py for workflows such as `inductor-rocm` (#143768 ) This helps to make `continue-through-error`/`keep-going` work as expected on `inductor-rocm` workflow jobs. Without this, the code here doesn't enter the `if` condition: `6ccb8ed186/.github/scripts/filter_test_configs.py (L577)` Tested via [this PR](https://github.com/pytorch/pytorch/pull/140989): Without this change: https://hud.pytorch.org/pytorch/pytorch/pull/140989?sha=8232e18957f987d99c946efc0cf6da9be9b52067: https://github.com/pytorch/pytorch/actions/runs/12164558045/job/34192442187#step:13:144 With this change: https://hud.pytorch.org/pytorch/pytorch/pull/140989?sha=763179c5e421791ee05c8e2a600379b29a1c8c33: https://github.com/pytorch/pytorch/actions/runs/12261943684/job/34213300153#step:13:145 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143768 Approved by: https://github.com/huydhn	2024-12-24 00:50:14 +00:00
Joshua Hamilton	bc78b6ea4f	Add a warning when a tensor with requires_grad=True is converted to a scalar (#143261 ) Fixes #143071 Operations performed on tensors with `requires_grad=True` such as ```python import torch x = torch.tensor(2.0, requires_grad=True) y = x ** 3 ``` and ```python x = torch.tensor(2.0, requires_grad=True) y = torch.pow(x,3) ``` are valid operations. While an operation using `numpy` like ```python import numpy as np x = torch.tensor(2.0, requires_grad=True) y = np.pow(x,3) # > RuntimeError: Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead. ``` leads to an error. However, an operation that uses `math` like ```python import math x = torch.tensor(2.0, requires_grad=True) y = math.pow(x,3) ``` does not cause an error, and `y` is no longer a tensor with a gradient! This represents a [footgun](https://en.wiktionary.org/wiki/footgun#Noun) for some users, like myself when training small, custom, non-neural network models. To prevent future undesired behavior, I added a warning when converting tensors with `requires_grad=True` to scalars. Now, when using `math.pow` on a `tensor`, we get a single warning with: ```python x = torch.tensor(2.0, requires_grad=True) y = math.pow(x,3) # > UserWarning: Converting a tensor with requires_grad=True to a scalar may lead to unexpected behavior. # Consider using tensor.detach() first. ``` Please let me know if you have any questions 👍 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143261 Approved by: https://github.com/albanD	2024-12-24 00:22:18 +00:00
emmettbicker	6ccb8ed186	Refactor AdamW into Adam (heavily inspired by tfsingh) (#143710 ) Fixes #104899 Refactors AdamW into Adam by making AdamW a subclass of Adam. Additionally adds a test to assert that the added parameter `decoupled_weight_decay` is True in AdamW and also updates test_defaults_changed_to_foreach to account for the differences in module location for AdamW. Heavily heavily inspired by #118857 by @tfsingh Pull Request resolved: https://github.com/pytorch/pytorch/pull/143710 Approved by: https://github.com/janeyx99	2024-12-23 23:27:28 +00:00
Sam Larsen	4271a95590	[logging] A few fixes/updates to record_compilation_metrics (#143332 ) Summary: Mostly cosmetic, but one bug fix: * Bug fix: Make sure compile_id is converted to a string in the compilation metrics so it's printed as, e.g., "0/1" instead of "[0, 1]" * Sort collections in `collection_to_str` * Print non-string elements as `"<unknown>"` instead of None (since we don't expect non-strings) * Move the population of the legacy metrics and any pre-processing to a new factory method in CompilationMetrics Test Plan: ``` python test/dynamo/test_structured_trace.py python test/dynamo/test_utils.py ``` Internal testing: https://fburl.com/scuba/dynamo_compile/sandbox/l0me8auf Pull Request resolved: https://github.com/pytorch/pytorch/pull/143332 Approved by: https://github.com/ppanchalia	2024-12-23 23:10:11 +00:00
Natalia Gimelshein	2ab698e708	allow profiling on all threads via experimentalConfig (#143659 ) In some situations we want to profile calls coming from all threads (similar to on-demand), not just the thread that started profiling and the spawned threads that would inherit KinetoThreadLocal state. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143659 Approved by: https://github.com/sraikund16	2024-12-23 20:41:27 +00:00
Aaron Gokaslan	00831f9b22	[BE]: Properly forward raise pickle exception with from (#143761 ) Properly raises the pickle exception with from. Provides a more informative stack trace and forwards information about the exception that led to the current exception. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143761 Approved by: https://github.com/XuehaiPan, https://github.com/albanD	2024-12-23 20:21:30 +00:00
Jithun Nair	75e1f8a227	[ROCm] upgrade nightly wheels to rocm6.3 - 2 of 2 (binaries) (#143613 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143613 Approved by: https://github.com/jeffdaily	2024-12-23 19:47:30 +00:00
PyTorch MergeBot	0ebc6388cf	Revert "Exclude py 31.3t triton package from PyTorch 3.13t wheel (#143218 )" This reverts commit 3bfdf6f0633e6feb067e032009256c740a2a2665. Reverted https://github.com/pytorch/pytorch/pull/143218 on behalf of https://github.com/atalman due to this constrain is ignored see https://github.com/pytorch/pytorch/issues/143654 ([comment](https://github.com/pytorch/pytorch/pull/143218#issuecomment-2560208992))	2024-12-23 19:37:35 +00:00
Sergii Dymchenko	727ee853b4	Apply TorchFix TOR203 fixes (#143691 ) Codemodded via `torchfix . --select=TOR203 --fix`. This is a step to unblock https://github.com/pytorch/pytorch/pull/141076 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143691 Approved by: https://github.com/malfet	2024-12-23 18:21:03 +00:00
Sergii Dymchenko	c042c8a475	Use default_collate from public API (#143616 ) Codemodded via `torchfix . --select=TOR104 --fix`. This is a step to unblock https://github.com/pytorch/pytorch/pull/141076 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143616 Approved by: https://github.com/malfet	2024-12-23 17:38:43 +00:00
zeshengzong	a70191da41	Add torch.topk indices vary description (#143736 ) Fixes #133542 Test Result Before ![image](https://github.com/user-attachments/assets/65227efb-02af-45e7-804c-35588dff360d) After ![image](https://github.com/user-attachments/assets/91f1f53f-008c-4784-82fe-013404e273cb) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143736 Approved by: https://github.com/zou3519	2024-12-23 17:16:31 +00:00
PyTorch MergeBot	1519a9e30b	Revert "Add FP8 support for eye (#139974 )" This reverts commit 01890526b9068ae20b38b2a33e8f11a6331d7d4b. Reverted https://github.com/pytorch/pytorch/pull/139974 on behalf of https://github.com/huydhn due to Sorry for reverting your change but this seems to fail some slow tests ([comment](https://github.com/pytorch/pytorch/pull/139974#issuecomment-2560046399))	2024-12-23 17:12:39 +00:00
Nikita Shulga	12662901aa	[BE] Move Mac BB test to its own step (#143513 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143513 Approved by: https://github.com/huydhn, https://github.com/atalman, https://github.com/kit1980, https://github.com/seemethere ghstack dependencies: #143395, #143511, #143512	2024-12-23 14:05:10 +00:00
Xuehai Pan	5c4545f857	[BE][Easy] enable PYFMT for `torch/[a-s]*/` (#138447 ) Reproduce command: ```bash ghstack checkout https://github.com/pytorch/pytorch/pull/138447 git checkout HEAD~1 torch/ lintrunner -a --take "PYFMT" --all-files ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138447 Approved by: https://github.com/ezyang	2024-12-23 14:04:00 +00:00
Dmitry Rogozhkin	7314cf44ae	torch/accelerator: fix device type comparison (#143541 ) This was failing without the fix: ``` python -c 'import torch; d=torch.device("xpu:0"); torch.accelerator.current_stream(d)' ``` with: ``` ValueError: xpu doesn't match the current accelerator xpu. ``` CC: @guangyey, @EikanWang Pull Request resolved: https://github.com/pytorch/pytorch/pull/143541 Approved by: https://github.com/guangyey, https://github.com/albanD	2024-12-23 10:54:53 +00:00
Kai Londenberg	434e0c2104	Inductor Cutlass backend: Eliminate unused code. (#143723 ) Summary: Eliminates an unused file and some smaller unused code fragments from the inductor cutlass codebase. Test Plan: CI Differential Revision: D67579837 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143723 Approved by: https://github.com/ColinPeppler	2024-12-23 09:35:03 +00:00
Jiang, Yanbing	01890526b9	Add FP8 support for eye (#139974 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139974 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-12-23 06:47:49 +00:00
PyTorch MergeBot	448c16ac87	Revert "[reland][AMD] Turn on TF32 for aten::mm (#143549 )" This reverts commit 41cdc7f73552cc8a0dbf2d3cb55440c0d6b548ea. Reverted https://github.com/pytorch/pytorch/pull/143549 on behalf of https://github.com/malfet due to It breaks ROCM testing, see `06b4b96b34/1` ([comment](https://github.com/pytorch/pytorch/pull/143549#issuecomment-2559016960))	2024-12-23 06:47:36 +00:00
Aaron Orenstein	06b4b96b34	dynamo tracing perf: no re in arg_ref: 33.9 -> 33.7 (#143069 ) See #143056 for overall docs. This PR: Avoid use of python re and move valid varname check in `GuardBuilder.arg_ref()` into C++ Pull Request resolved: https://github.com/pytorch/pytorch/pull/143069 Approved by: https://github.com/jansel	2024-12-23 05:32:09 +00:00
Yu, Guangye	07fa6e2c8b	Fix torch.accelerator api abort when passing invaild device (#143550 ) # Motivation Fix https://github.com/pytorch/pytorch/issues/143543 # Solution We should raise python exception instead of aborting... # Additional Context without this PR: ```python >>> import torch >>> torch.accelerator.current_stream(torch.accelerator.device_count()) terminate called after throwing an instance of 'c10::Error' what(): device is out of range, device is 2, total number of device is 2. Exception raised from check_device_index at /home/dvrogozh/git/pytorch/pytorch/c10/xpu/XPUFunctions.h:36 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xac (0x7f30707eb95c in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x7f307078fc57 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so) frame #2: <unknown function> + 0x19a3e (0x7f3070c2ba3e in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so) frame #3: c10::xpu::getCurrentXPUStream(signed char) + 0x2f (0x7f3070c2c83f in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so) frame #4: <unknown function> + 0x1ca35 (0x7f3070c2ea35 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so) frame #5: <unknown function> + 0x653f15 (0x7f3083391f15 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so) frame #6: <unknown function> + 0x39e5f2 (0x7f30830dc5f2 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so) <omitting python frames> frame #20: <unknown function> + 0x29d90 (0x7f308b19bd90 in /lib/x86_64-linux-gnu/libc.so.6) frame #21: __libc_start_main + 0x80 (0x7f308b19be40 in /lib/x86_64-linux-gnu/libc.so.6) Aborted (core dumped) ``` with this PR: ```python >>> import torch >>> torch.accelerator.current_stream(torch.accelerator.device_count()) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/pt-gpu/4T-4652/guangyey/stock-pytorch/torch/accelerator/__init__.py", line 123, in current_stream return torch._C._accelerator_getStream(device_index) RuntimeError: The device index is out of range. It must be in [0, 2), but got 2. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143550 Approved by: https://github.com/EikanWang, https://github.com/dvrogozh, https://github.com/albanD	2024-12-23 03:44:22 +00:00
Jason Ansel	eebc93d41e	Better fix for f-strings in set_linter for py3.12 (#143725 ) #143628 didn't handle a few cases right for example: ```py $ python3 tools/linter/adapters/set_linter.py torch/_inductor/scheduler.py torch/_inductor/scheduler.py:261:24: Builtin `set` is deprecated 259 \| multiline=False, 260 \| ) 261 \| return f"{self}{data_str}" ^ 262 \| 263 \| def log_details(self) -> None: torch/_inductor/scheduler.py:261:33: Builtin `set` is deprecated 259 \| multiline=False, 260 \| ) 261 \| return f"{self}{data_str}" ^ 262 \| 263 \| def log_details(self) -> None: ``` also multi-line fstrings Pull Request resolved: https://github.com/pytorch/pytorch/pull/143725 Approved by: https://github.com/yanboliang	2024-12-22 22:51:27 +00:00
Xiaodong Wang	41cdc7f735	[reland][AMD] Turn on TF32 for aten::mm (#143549 ) Summary: hipblaslt supports TF32, so adding the support. Original PR https://github.com/pytorch/pytorch/pull/139869 Test Plan: CI Differential Revision: D67431681 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143549 Approved by: https://github.com/eqy	2024-12-22 21:05:05 +00:00
Nikita Shulga	6425f0779d	[BE] Update triton repo link (#143429 ) It should be https://github.com/triton-lang/triton rather than https://github.com/openai/triton shouldn't it? Pull Request resolved: https://github.com/pytorch/pytorch/pull/143429 Approved by: https://github.com/jansel	2024-12-22 18:38:35 +00:00
Nikita Shulga	a316a4581d	Add mps to GPU_TYPES (#143634 ) Because it is a GPU, but don't require a triton, as it does not need one Pull Request resolved: https://github.com/pytorch/pytorch/pull/143634 Approved by: https://github.com/jansel	2024-12-22 18:37:35 +00:00
cyy	09c950cc87	Remove unused <ATen/core/Array.h> inclusion (#143701 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143701 Approved by: https://github.com/albanD	2024-12-22 14:30:11 +00:00
Oguz Ulgen	dc55704b48	Rename cache limit to recompile limit in configs (#143709 ) This PR renames every cache_limit to recompile_limit via sed. Old config options are maintained via Config(alias='xyz') Pull Request resolved: https://github.com/pytorch/pytorch/pull/143709 Approved by: https://github.com/jansel	2024-12-22 10:03:57 +00:00
Aaron Orenstein	9bf4b1c2e9	dynamo tracing perf: c++ strip_function_call: 49.12 -> 47.77 (#143063 ) See #143056 for overall docs. This PR: Convert `strip_function_call()` into C++ Pull Request resolved: https://github.com/pytorch/pytorch/pull/143063 Approved by: https://github.com/jansel ghstack dependencies: #143057, #143062	2024-12-22 06:38:46 +00:00
Aaron Orenstein	3ec04d30d5	dynamo tracing perf: kill import: 50.36 -> 49.12 (#143062 ) See #143056 for overall docs. This PR: Stop importing in the body of `BuiltinVariable.call_getattr()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143062 Approved by: https://github.com/jansel ghstack dependencies: #143057	2024-12-22 06:38:46 +00:00
Aaron Orenstein	f2b744b9ca	dynamo tracing perf: import_module: 59.92 -> 52.9 (#143057 ) See #143056 for overall docs. This PR: Using `importlib.import_module()` within the hot path of symbolic_convert is slow. Memoize it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143057 Approved by: https://github.com/jansel	2024-12-22 06:38:38 +00:00
Tom Ritchford	f1cbf4b1b5	Enable ruff's unused variable checking everywhere in pytorch (#136965 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136965 Approved by: https://github.com/cyyever, https://github.com/albanD	2024-12-22 02:33:11 +00:00
Xuehai Pan	2293fe1024	[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 ) Changes by apply order: 1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`. 2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`. 3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first. `.parent{...}.absolute()` -> `.absolute().parent{...}` 4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.) `.parent.parent.parent.parent` -> `.parents[3]` 5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~ ~`.parents[3]` -> `.parents[4 - 1]`~ 6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374 Approved by: https://github.com/justinchuby, https://github.com/malfet	2024-12-21 22:08:01 +00:00
PyTorch MergeBot	197954e14b	Revert "Handle meta tensors in FX quantization (#142262 )" This reverts commit e97b97af56204230f1030bd297dda9bc6b053a4c. Reverted https://github.com/pytorch/pytorch/pull/142262 on behalf of https://github.com/janeyx99 due to this PR broke lint ([comment](https://github.com/pytorch/pytorch/pull/142262#issuecomment-2558233022))	2024-12-21 20:34:09 +00:00
Yanan Cao (PyTorch)	0666347fc4	[Codemod][AddExplicitStrictExportArg] caffe2/benchmarks/dynamo (#143686 ) Reviewed By: avikchaudhuri Pull Request resolved: https://github.com/pytorch/pytorch/pull/143686 Approved by: https://github.com/tugsbayasgalan	2024-12-21 19:56:56 +00:00
Kaustubh Vartak	e97b97af56	Handle meta tensors in FX quantization (#142262 ) Summary: If module being quantized contains a some meta tensors and some tensors with actual device, we should not fail quantization. Quantization should also not fail if new quantized module is created on a meta device. Differential Revision: D66895899 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142262 Approved by: https://github.com/iamzainhuda	2024-12-21 13:19:30 +00:00
cyy	daa3ffe0eb	Enable more C++ warnings (#143355 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143355 Approved by: https://github.com/albanD	2024-12-21 09:19:02 +00:00
PyTorch MergeBot	e15442a9b2	Revert "export AOTI_TORCH_EXPORT on Windows. (#140030 )" This reverts commit 6733045a4aaef7a8d9fb1f9f8b80f4f5f4ef1f4f. Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but my first attempt to fix internal build does not fix all the cases, so let us try again ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2558043056))	2024-12-21 08:06:19 +00:00
Avik Chaudhuri	51eacea8c4	graph module retracing without preserving MCS (#143676 ) Retracing while preserving module call signatures used to be a problem because graph modules don't have submodules at given paths. This led to a number of failing retracebility tests. By not trying to wrap modules with export tracepoints we can pass most of these tests; the only exception is where you do module swapping on retraced programs, which is still not possible. Differential Revision: [D67539304](https://our.internmc.facebook.com/intern/diff/D67539304/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143676 Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan ghstack dependencies: #143664	2024-12-21 07:57:43 +00:00
cyy	d7e59c2f85	Fix cppcoreguidelines-pro-type-member-init (#141787 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141787 Approved by: https://github.com/albanD	2024-12-21 07:51:30 +00:00
Basil Wong	7b2af25f80	[1/n] Support Dynamic Memory Budget in Auto AC (#143539 ) # Summary: Full Context: https://docs.google.com/document/d/1-j5KSbfGFJQcH4sYh7BIeJXso3zYzl5G5yFQqXdKx_o/edit?usp=sharing tl;dr This change introduces classes which help determine a dynamic memory budget. This will mostly be helpful for models with many implicit graph breaks. --- New Classes: GraphInfoProvider * Takes the joint_graph as well as the input memories and runtimes and parses the graph + values into usable forms for the SolverEvaluator. KnapsackEvaluator * Provides a function: Given all of the four inputs (solver function as a callable, max_dynamic_memory_budget, min_dynamic_memory_budget, dynamic_memory_budget_pareto_granularity) it returns an approximation of the knee point of the pareto distribution. # Test Plan: ### LintRunner LintRunner Output: P1700445547 ### Unit Tests ``` $ buck test @mode/opt //caffe2/test/functorch:test_ac_knapsack `@mode/opt` was specified, but not found. Using file at `//mode/opt`. This behavior is being deprecated. Please use `"@//mode/opt"` instead File changed: fbcode//caffe2/.ruff_cache/0.7.4/.tmpB6PmDS File changed: fbsource//xplat/caffe2/test/functorch/test_ac_knapsack.py File changed: fbcode//caffe2/.ruff_cache/0.7.4/.tmpyjCiPn 20 additional file change events Buck UI: https://www.internalfb.com/buck2/414ead46-9ede-4192-8e1a-5d3c52bdb9cc Test UI: https://www.internalfb.com/intern/testinfra/testrun/6473924710342830 Network: Up: 0B Down: 0B (reSessionID-159794b9-9d61-477e-8e63-9bdeaa537dca) Analyzing targets. Remaining 0/214 Executing actions. Remaining 0/6933 0.1s exec time total Command: test. Finished 1 local Time elapsed: 18.5s Tests finished: Pass 15. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ### Test Run Updated the config: ``` activation_memory_budget_solver: DYNAMIC_MEMORY_BUDGET_DP ``` Confirming proper execution via: [aps-fb_fm_v4_768_01_dynamic-2a792ba8af](https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-fb_fm_v4_768_01_dynamic-2a792ba8af?job_attempt=0&version=0&env=PRODUCTION) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143539 Approved by: https://github.com/jansel	2024-12-21 07:38:52 +00:00
PyTorch MergeBot	bee47b0663	Revert "[pytorch/et] Allow ET to save additional resources for completing a trace like generated kernels and index tensor data (#143430 )" This reverts commit 33dd4f187dd3b54d65182d56998feae235ee48c7. Reverted https://github.com/pytorch/pytorch/pull/143430 on behalf of https://github.com/huydhn due to The internal diff D58707846 has been backed out ([comment](https://github.com/pytorch/pytorch/pull/143430#issuecomment-2558033930))	2024-12-21 07:26:34 +00:00
PyTorch UpdateBot	47c4e01e71	[audio hash update] update the pinned audio hash (#143694 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143694 Approved by: https://github.com/pytorchbot	2024-12-21 05:42:34 +00:00
Richard Barnes	9f3c291bc3	Fix issue with setAttribute and int8_t vs int32_t variables (#143693 ) Test Plan: Sandcastle Pull Request resolved: https://github.com/pytorch/pytorch/pull/143693 Approved by: https://github.com/huydhn	2024-12-21 05:31:56 +00:00
Richard Barnes	518b5050c0	Fix unused-variable issues in caffe2 (#143639 ) Summary: LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Pull Request resolved: https://github.com/pytorch/pytorch/pull/143639 Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/cyyever	2024-12-21 05:27:38 +00:00
eellison	f44310097c	Reuse partial reductions (#143600 ) Reuse partial reductions for complete reductions. We could expand this to more cover more types of reductions, although we'd have to be a bit more careful about keeping the intermediary, partial reduction in higher precision. Just doing the ops which do not depend on a higher compute_dtype_precision for now to cover the relevant use case initially. Fix for https://github.com/pytorch/pytorch/issues/136267. Longer term, we should make sure cooperative reductions fuse partial and complete reductions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143600 Approved by: https://github.com/vkuzo	2024-12-21 04:44:07 +00:00
PyTorch MergeBot	97990f476d	Revert "Fix unused-variable issues in caffe2 (#143639 )" This reverts commit 23ca7c2515dd1f601926c4fd0e65513308c135a9. Reverted https://github.com/pytorch/pytorch/pull/143639 on behalf of https://github.com/huydhn due to This is failing OSS tests ([comment](https://github.com/pytorch/pytorch/pull/143639#issuecomment-2557991297))	2024-12-21 04:30:48 +00:00
PyTorch MergeBot	b89bfe0bac	Revert "Fix issue with setAttribute and int8_t vs int32_t variables (#143693 )" This reverts commit ae3d385fcba0f91f35b2848b852d4c75f88cbd62. Reverted https://github.com/pytorch/pytorch/pull/143693 on behalf of https://github.com/huydhn due to Sorry for reverting this change but it has a conflict with https://github.com/pytorch/pytorch/pull/143639 that is breaking trunk ([comment](https://github.com/pytorch/pytorch/pull/143693#issuecomment-2557990508))	2024-12-21 04:27:18 +00:00
Simon Fan	a8953c36f5	[compiled autograd] log compilation time to perfetto (#140964 ) https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmprli4iy/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100 ``` [ { "args": { "compile_id": "0/-/-", "graph_id": 0 }, "cat": "dynamo_timed", "name": "compiled_autograd", "ph": "B", "pid": 0, "tid": 0, "ts": 1733886868992655.8 }, { "args": { "compile_id": "0/-/-", "graph_id": 0 }, "cat": "dynamo_timed", "name": "compiled_autograd", "ph": "E", "pid": 0, "tid": 0, "ts": 1733886869130681.0 }, { "args": { "compile_id": "0/0/0" }, "cat": "dynamo_timed", "name": "dynamo", "ph": "B", "pid": 0, "tid": 0, "ts": 1733886869134350.5 }, { ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140964 Approved by: https://github.com/masnesral ghstack dependencies: #141907, #143175	2024-12-21 04:23:25 +00:00
PyTorch MergeBot	c7d7eff798	Revert "[MTIA] (3/n) Implement PyTorch APIs to query/reset device peak memory usage (#143347 )" This reverts commit efe21ee59dfdd6642cc693e69e07aa9d8be13eb9. Reverted https://github.com/pytorch/pytorch/pull/143347 on behalf of https://github.com/huydhn due to D67118173 has been backed out internally ([comment](https://github.com/pytorch/pytorch/pull/143347#issuecomment-2557983266))	2024-12-21 04:04:16 +00:00
PyTorch MergeBot	dabc9566c4	Revert "(MTIA) Move "empty_cache" API (#143402 )" This reverts commit c7d9f298072a3f59b39517e367c7d3d2ea30e6d9. Reverted https://github.com/pytorch/pytorch/pull/143402 on behalf of https://github.com/huydhn due to The internal diff D67148738 has been reverted ([comment](https://github.com/pytorch/pytorch/pull/143402#issuecomment-2557982597))	2024-12-21 04:01:23 +00:00
Bin Bao	fecf03fa3f	[AOTI][reland] Emit a CMakeLists.txt when package_cpp_only (#143680 ) Summary: Emit a CMakeLists.txt with compile and link options when package_cpp_only is specified. After unzipping AOTI generated .pt2 package file, user can manually build the generated model code in their local environment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143680 Approved by: https://github.com/huydhn	2024-12-21 03:48:40 +00:00
xinan.lin	b5e159270a	[AOTI XPU] Replace intel compiler with g++ to build inductor CPP wrapper in runtime. (#142322 ) This PR aims to removes the de pendency on Intel Compiler at Inductor runtime. Now we only need a SYCL_HOME in runtime to find the sycl headers and libs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142322 Approved by: https://github.com/EikanWang, https://github.com/desertfire, https://github.com/albanD ghstack dependencies: #143491	2024-12-21 02:27:04 +00:00
xinan.lin	af0e159740	[Inductor XPU] Add XPU check for `is_big_gpu()`. (#143491 ) Fix #143472 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143491 Approved by: https://github.com/desertfire, https://github.com/jansel, https://github.com/EikanWang	2024-12-21 02:27:04 +00:00
Animesh Jain	0da004f3dd	[dynamo] Remove transformers ModelOutput hack (#143567 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143567 Approved by: https://github.com/williamwen42, https://github.com/jansel ghstack dependencies: #143548	2024-12-21 01:46:14 +00:00
Animesh Jain	4627cfd1f9	[dynamo] Support user defined dicts (#143548 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143548 Approved by: https://github.com/yanboliang, https://github.com/jansel, https://github.com/williamwen42	2024-12-21 01:46:14 +00:00
James Wu	9cb743d1f9	[easy] Set feature use for aot autograd remote cache (#143674 ) Use set_feature_use for logging aot autograd cache so that dynamo_compile has this data as well as PT2 Compile Events. Differential Revision: [D67536293](https://our.internmc.facebook.com/intern/diff/D67536293/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143674 Approved by: https://github.com/bobrenjc93	2024-12-21 01:40:18 +00:00
Simon Fan	ffd1b53f26	[aot] refactor dynamo source and cudagraphs static idx logic (#141748 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141748 Approved by: https://github.com/ezyang	2024-12-21 01:20:53 +00:00
Richard Barnes	ae3d385fcb	Fix issue with setAttribute and int8_t vs int32_t variables (#143693 ) Test Plan: Sandcastle Differential Revision: D67549758 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143693 Approved by: https://github.com/huydhn	2024-12-21 01:19:29 +00:00
Avik Chaudhuri	bdeee82822	unflatten isinstance (#143664 ) When we unflatten, the submodules we generate (`InterpreterModule` or `InterpreterModuleDispatcher`) are not related by type to the original submodules `N`. This makes `isinstance(mod, N)` checks fail. Since we do not have the original types after export, the best we can do is expose a `type_name()` method that carries the original type name, which we do carry in `nn_module_stack` entries. Differential Revision: [D67526542](https://our.internmc.facebook.com/intern/diff/D67526542/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143664 Approved by: https://github.com/tugsbayasgalan	2024-12-21 01:07:10 +00:00
Simon Fan	d88ebbf822	cleanup chromium event log on dynamo exit rather than on entry (#143175 ) clearing at dynamo start is an issue because it throws away events from compiled autograd Pull Request resolved: https://github.com/pytorch/pytorch/pull/143175 Approved by: https://github.com/Skylion007, https://github.com/jamesjwu ghstack dependencies: #141907	2024-12-21 00:41:24 +00:00
Simon Fan	4ee166b82f	[ca] add compiled autograd to CompileId (#141907 ) tlparse PR: https://github.com/ezyang/tlparse/pull/83 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141907 Approved by: https://github.com/ezyang	2024-12-21 00:41:24 +00:00
Tugsbayasgalan Manlaibaatar	0ce233b8ca	Support tensor subclass unwrapping (#141941 ) This PR adds support for export to unwrap/wrap subclasses AOT so that we can trace through subclass parameters. This will resolve the UX issue in torchao where users had to manually unwrap their subclasses before calling export. Differential Revision: [D67531057](https://our.internmc.facebook.com/intern/diff/D67531057) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141941 Approved by: https://github.com/bdhirsh	2024-12-21 00:29:31 +00:00
Nikita Shulga	553031fb9a	[BE] Remove gcc-5 workaround for unused args (#143685 ) ditto Pull Request resolved: https://github.com/pytorch/pytorch/pull/143685 Approved by: https://github.com/kit1980, https://github.com/seemethere, https://github.com/atalman	2024-12-21 00:18:15 +00:00
PyTorch MergeBot	ad7ab5ef84	Revert "[logging] A few fixes/updates to record_compilation_metrics (#143332 )" This reverts commit a9c753bbc88bfdc0e77f66956b3a11e405235d0f. Reverted https://github.com/pytorch/pytorch/pull/143332 on behalf of https://github.com/malfet due to Surprisingly failure is caused by this PR ([comment](https://github.com/pytorch/pytorch/pull/143332#issuecomment-2557899120))	2024-12-21 00:06:44 +00:00
Will Feng	bf7009d839	[rpc] Fix unit test after c10::nullopt removal (#143690 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143690 Approved by: https://github.com/yifuwang, https://github.com/c-p-i-o, https://github.com/XilunWu	2024-12-20 23:36:07 +00:00
eqy	912d6a2867	[CUDA] Bump tolerances in `test_svd_lowrank_cuda_float64` (#143049 ) pre-emptive bump for apparent noisy failure Pull Request resolved: https://github.com/pytorch/pytorch/pull/143049 Approved by: https://github.com/Skylion007, https://github.com/lezcano, https://github.com/nikitaved	2024-12-20 23:25:21 +00:00
Michael Lazos	8960cb5809	Add support for bfloat16 atomic adds in fbcode (#143629 ) Reland https://github.com/pytorch/pytorch/pull/141857 and fallback on A100 which doesn't have bfloat16 atomic add instrs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143629 Approved by: https://github.com/eellison	2024-12-20 23:05:13 +00:00
amdfaa	a3b04d473e	[ROCm] Update setup-rocm for almalinux-based images (#143590 ) Needed for https://github.com/pytorch/test-infra/pull/6003 and https://github.com/pytorch/ao/pull/999 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143590 Approved by: https://github.com/atalman Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>	2024-12-20 22:48:54 +00:00
Richard Barnes	23ca7c2515	Fix unused-variable issues in caffe2 (#143639 ) Summary: LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Pull Request resolved: https://github.com/pytorch/pytorch/pull/143639 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-12-20 22:30:58 +00:00
Tristan Rice	6e58c37542	c10d: no call_guard in init (#143598 ) `py::call_guard<py::gil_scoped_release>` is not safe when using multiple threads. This instead moves it into the init function which is safe. For more details see #143593 https://github.com/pybind/pybind11/issues/5473 Test plan: ``` python setup.py develop ``` CI ```py import time from concurrent.futures import ThreadPoolExecutor from torch import distributed as dist def run(): store = dist.TCPStore( host_name="localhost", port=0, is_master=True, wait_for_workers=False, ) # this sleep is required to trigger the crash time.sleep(0.1) del store futures = [] with ThreadPoolExecutor( max_workers=100, ) as executor: for i in range(100000): print(i) futures.append(executor.submit(run)) if len(futures) > 100: futures.pop(0).result() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143598 Approved by: https://github.com/c-p-i-o	2024-12-20 22:23:36 +00:00
Sam Larsen	a9c753bbc8	[logging] A few fixes/updates to record_compilation_metrics (#143332 ) Summary: Mostly cosmetic, but one bug fix: * Bug fix: Make sure compile_id is converted to a string in the compilation metrics so it's printed as, e.g., "0/1" instead of "[0, 1]" * Sort collections in `collection_to_str` * Print non-string elements as `"<unknown>"` instead of None (since we don't expect non-strings) * Move the population of the legacy metrics and any pre-processing to a new factory method in CompilationMetrics Test Plan: ``` python test/dynamo/test_structured_trace.py python test/dynamo/test_utils.py ``` Internal testing: https://fburl.com/scuba/dynamo_compile/sandbox/l0me8auf Pull Request resolved: https://github.com/pytorch/pytorch/pull/143332 Approved by: https://github.com/ppanchalia	2024-12-20 21:42:32 +00:00
Mikayla Gawarecki	372b023eb1	Fix test_serialization_zipfile_actually_jit when weights_only is not default (#143668 ) Fails in fbcode where weights_only isn't default Pull Request resolved: https://github.com/pytorch/pytorch/pull/143668 Approved by: https://github.com/awgu ghstack dependencies: #143326, #143403	2024-12-20 21:25:10 +00:00
Darshan Sanghani	33dd4f187d	[pytorch/et] Allow ET to save additional resources for completing a trace like generated kernels and index tensor data (#143430 ) The resources directory lets ET observer dump any additional data like Triton kernels while capturing the ET. This allows us to use the ET trace to replay PT2 workloads and get visibility into data like generated kernels and their usage in a model, index tensor data etc. We also added a few ways to enable ET and ET Resources through the OS environment variables. Setting `ENABLE_PYTORCH_EXECUTION_TRACE` will enable default Execution Tracing in Pytorch. Additionally setting `ENABLE_PYTORCH_EXECUTION_TRACE_EXTRAS` will enable ET to collect extra resources from the ET run like Triton Kernels. Differential Revision: [D58707846](https://our.internmc.facebook.com/intern/diff/D58707846/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143430 Approved by: https://github.com/shengfukevin, https://github.com/sraikund16	2024-12-20 21:20:32 +00:00
zeshengzong	cee06e74ee	Apply clang-format for ATen/core/dispatch headers (#143620 ) Code change via add path config in `.lintrunner.toml` file and running ```bash $ lintrunner -a --take CLANGFORMAT --all-files ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143620 Approved by: https://github.com/malfet	2024-12-20 21:16:23 +00:00
Mikayla Gawarecki	8e483654cb	Add config.save.use_pinned_memory_for_d2h to serialization config (#143342 ) This was benchmarked with two separate scripts on my A100 (A) Save state_dict of llama3-style model on CUDA to disk with ``torch.save`` (B) Save `ModuleList` of 10 `nn.Linear(10,000, 10,000)` on CUDA to disk with `torch.save` Timings are an average of 5 runs and benchmark scripts + results are attached Under both scenarios, we see ~2x speedup in ``torch.save`` time with (``compute_crc32=False`` and ``use_pinned_memory_for_d2h=True``) compared to the baseline of the current defaults (``compute_crc32=True`` and ``use_pinned_memory_for_d2h=False`` (A) Save state_dict of llama3-style model on CUDA to disk with ``torch.save`` [[script](https://gist.github.com/mikaylagawarecki/d3a86ea1bb08045d1a839976808d7432)][[results](https://gist.github.com/mikaylagawarecki/f61a4714e5cff703146a1fcb7e0c755c)] \| \| use_pinned_memory_for_d2h=False (Default) \| use_pinned_memory_for_d2h=True \| \|-\|-\|-\| \| `compute_crc_32= True` (Default)\| 28.54s \| 20.76s \| \| `compute_crc_32 = False` \| 22.57s \| 14.51s \| (B) Save `ModuleList` of 10 `nn.Linear(10,000, 10,000)` on CUDA to disk with `torch.save` [[script](https://gist.github.com/mikaylagawarecki/ecbc505436bdd4b5190ef1b3430c12b6)][[results](https://gist.github.com/mikaylagawarecki/4e686bcf030b57de8c3ca74d8f5a88f7)] \| \| use_pinned_memory_for_d2h=False (Default) \| use_pinned_memory_for_d2h=True \| \|-\|-\|-\| \| `compute_crc_32= True` (Default)\| 8.38s \| 5.53s \| \| `compute_crc_32 = False` \| 6.94s \| 3.99s \| Trace of (A) with `use_pinned_memory_for_d2h=True`, `compute_crc32=False` <img width="1745" alt="Screenshot 2024-12-16 at 7 32 33 PM" src="https://github.com/user-attachments/assets/80b87a8c-5a70-4eb9-ad66-7abc4aa7cc25" /> Baseline trace of (A) with `use_pinned_memory_for_d2h=False`, `compute_crc32=True` <img width="1799" alt="Screenshot 2024-12-16 at 7 38 20 PM" src="https://github.com/user-attachments/assets/13fa12d1-8f5f-424c-adc4-275b67012927" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/143342 Approved by: https://github.com/albanD ghstack dependencies: #143324	2024-12-20 21:01:18 +00:00
Mikayla Gawarecki	3f63b742e6	Refactor serialization getter/setters into torch.utils.serialization.config (#143324 ) Consolidate - get/set_default_load_endianness - get/set_default_mmap_options - get/set_crc32_options into one global dynamo-style config + allow global setting of mmap. The existing APIs are not removed and will get/set from the config (as they can't be removed for BC) In #143459 I add the local (argument style) config Pull Request resolved: https://github.com/pytorch/pytorch/pull/143324 Approved by: https://github.com/albanD	2024-12-20 21:01:17 +00:00
Scott Wolchok	629de988df	Fix old-compiler-unfriendly zero init of bfloat16_t array (#143504 ) clang versions before 17 don't like to assign 0 to a bfloat16_t. gcc versions before 13 also won't assign 0.0 to a bfloat16_t. (Citation: https://godbolt.org/z/Gzs5ebdej) Differential Revision: [D67396740](https://our.internmc.facebook.com/intern/diff/D67396740/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143504 Approved by: https://github.com/malfet	2024-12-20 20:49:51 +00:00
Chirag Pandya	485497e727	[c10d][fr] flight recorder improvements (#143446 ) Summary: 1. Flight recorder dumps are now automatically dumped by default upon timeout or exception. Users don't need to opt-in. 2. Change default dump location to running user's home directory `.cache` folder. Test Plan: 1. Tested locally by running the crash program from flight recorder tutorial page. https://pytorch.org/tutorials/prototype/flight_recorder_tutorial.html#an-end-to-end-example 2. Noted that flight recorder files were correctly created. ❯ pwd /home/cpio/.cache/fr_trace ❯ ls nccl_trace_rank_0 nccl_trace_rank_1 Differential Revision: [D67363720](https://our.internmc.facebook.com/intern/diff/D67363720) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143446 Approved by: https://github.com/d4l3k	2024-12-20 20:41:30 +00:00
Colin L. Rice	a94f259a69	pgo: Log feature use (#142819 ) This will cause dynamo_compile to popualte the feature column if we have a hit for PGO. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142819 Approved by: https://github.com/ezyang	2024-12-20 20:22:20 +00:00
Aaron Orenstein	8ce0bc282a	dynamo tracing perf: bytecode_transform improvements: 34.86 -> 33.9 (#143068 ) See #143056 for overall docs. This PR: Use slots on InstructionExnTabEntry and Instruction. Stop doing python version checks in the middle of `convert_instruction()` and `inst_has_op_bits()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143068 Approved by: https://github.com/jansel ghstack dependencies: #143065, #143067	2024-12-20 20:06:42 +00:00
Aaron Orenstein	5feb2d7b41	dynamo tracing perf: don't call expensive _set_guard_export_info if it's a duplicate guard: 37.66 -> 34.86 (#143067 ) See #143056 for overall docs. This PR: Move the call to `_set_guard_export_info()` after the duplicate guard check in `GuardBuilder.DUPLICATE_INPUT()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143067 Approved by: https://github.com/jansel ghstack dependencies: #143065	2024-12-20 20:06:42 +00:00
Aaron Orenstein	7d4e7fbfc1	dynamo tracing perf: no import on hot path: 47.62 -> 47.26 (#143065 ) See #143056 for overall docs. This PR: Removed another `import` in the body of the hot path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143065 Approved by: https://github.com/jansel	2024-12-20 20:06:42 +00:00
Yanbo Liang	792e6184c5	[GPT-fast] Support run spcific model or micro-benchmark (#143607 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143607 Approved by: https://github.com/BoyuanFeng, https://github.com/jerryzh168, https://github.com/huydhn	2024-12-20 19:58:07 +00:00
Nikhil Gupta	94737e8a2a	[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 ) Description: 1. Quantize Linear Layer Weights to 4-bits: Quantize the weights of the Linear layer to 4 bits, using symmetric quantization. Pack two 4-bit weights into one uint8 container. Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32. 2. Prepare Quantized Weights, Scales, and Optional Bias: After quantizing, obtain the quantized_weights, scales, and groupsize. If the original Linear layer has a bias, prepare it as well. 3. Pack the Weights Efficiently: Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias. ```python packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features) ``` Input parameters should include: in_features and out_features (the same as the Linear layer’s corresponding parameters). 4. Perform Dynamic Quantized Matrix Multiplication: Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights. ```python output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features) ``` Inputs required include: The input tensor, packed_weights , groupsize, and the in_features and out_features. API Usage: https://github.com/pytorch/pytorch/issues/143289 Model Perf : 7B Transformer model: Prefill : 340 t/s Decode : 40 t/s 2B Transformer model Prefill : 747 t/s Decode : 80 t/s Tests: python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight Ran 1 test in 0.016s OK python test/test_linalg.py -k test__dyn_quant_matmul_4bit Ran 8 tests in 0.077s OK python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit Ran 8 tests in 11.454s Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-12-20 19:32:03 +00:00
Tom Ritchford	b5475d334e	[inductor] Fix an unused variable in cpu_vec_isa.py (#138473 ) ---- * Extracted from https://github.com/pytorch/pytorch/pull/133492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138473 Approved by: https://github.com/EikanWang, https://github.com/albanD, https://github.com/xuhancn	2024-12-20 18:50:19 +00:00
Nikita Shulga	5a69c2a649	[BE][Sparse] Get rid of gcc-5 workaround (#143653 ) Discovered those comments while looking at https://github.com/pytorch/pytorch/pull/143620 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143653 Approved by: https://github.com/albanD	2024-12-20 18:40:45 +00:00
Joy Dong	a5ed499f6a	FlexAttention Benchmark (#139665 ) 1. Add alibi, sliding window, tahn softcap, prefixLM, and document_mask from attn_gym to benchmark. 2. Add comparison to different SDPA backends & FAv2, FAv3, FAKV. Dependent on https://github.com/pytorch/pytorch/pull/139639 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139665 Approved by: https://github.com/drisspg	2024-12-20 17:52:24 +00:00
Hyunho Yeo	c7d9f29807	(MTIA) Move "empty_cache" API (#143402 ) Summary: This diff moves one of memory-related APIs to the consolidated location, which is `mtia/memory.py`. Test Plan: ``` buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api ``` https://www.internalfb.com/intern/testinfra/testrun/13510798943184259 Reviewed By: nautsimon Differential Revision: D67148738 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143402 Approved by: https://github.com/nautsimon	2024-12-20 17:39:06 +00:00
Colin L. Rice	d79fbf6b6d	test/dynamo/test_utils: logging - Stop testing for impossible things. (#143535 ) We don't support assigning to objects or numeric constants at the top level in config modules, no need to test for them. (This specifically breaks later sorting refactoring, since it requires < to be implemented). Pull Request resolved: https://github.com/pytorch/pytorch/pull/143535 Approved by: https://github.com/ppanchalia	2024-12-20 17:21:49 +00:00
Huamin Li	f5af87c23c	Make Inductor cpp backend enable_floating_point_contract_flag to take string (#143450 ) Differential Revision: D66269001 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143450 Approved by: https://github.com/desertfire	2024-12-20 16:28:54 +00:00
William Wen	7ab880bc5e	fix typo in autocast header (#143625 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143625 Approved by: https://github.com/mlazos ghstack dependencies: #143592	2024-12-20 16:17:15 +00:00
bobrenjc93	4f8b7c4272	Revert "refactor tensorify restart logic to use sources (#141517 )" (#143623 ) This reverts commit 30d8b30db7eaaa254d97077ac6515cdc4568fd6d. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143623 Approved by: https://github.com/mlazos	2024-12-20 15:38:34 +00:00
leslie-fang-intel	607884c9af	[Inductor][CPP] Fix bitwise shift with corner inputs (#143635 ) Summary Fix issue https://github.com/pytorch/pytorch/issues/143555 and https://github.com/pytorch/pytorch/issues/143566, we can align the implementation with Eager: `29b586bbad/aten/src/ATen/native/cpu/BinaryOpsKernel.cpp (L501)` at these corner inputs. Test Plan ``` python test/inductor/test_cpu_repro.py -k test_bitwise_shift_corner_inputs ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143635 Approved by: https://github.com/jgong5	2024-12-20 13:47:40 +00:00
Guilherme Leobas	7bf3b7cdc5	Rewrite _reparametrize_module to use `contextmanager` (#138203 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138203 Approved by: https://github.com/zou3519 ghstack dependencies: #136033, #140604	2024-12-20 12:02:27 +00:00
Guilherme Leobas	1c817fe671	Set `enable_trace_contextlib_contextmanager` flag to True (#140604 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140604 Approved by: https://github.com/zou3519 ghstack dependencies: #136033	2024-12-20 12:02:27 +00:00
Guilherme Leobas	673cc88fd6	Add support for `contextmanager` in Dynamo (#136033 ) Fixes #130559 * Intro This PR adds support for `@contextmanager` in Dynamo. We chose to limit the scope of this work to only `@contextmanager` and plan to handle generators fully in #141055 (still in draft). * Motivation Dynamo lacks support for generator functions. When it encounters one, it traces it as if it were a regular function. This is problematic because it can lead to incorrect behavior. To illustrate, consider the test case below: ```python import torch import contextlib @contextlib.contextmanager def set_default_dtype(dtype): old_dtype = torch.get_default_dtype() try: torch.set_default_dtype(dtype) yield finally: torch.set_default_dtype(old_dtype) @torch.compile(backend="eager", fullgraph=True) def fn(): with set_default_dtype(torch.float64): x = torch.tensor([3.0, 3.0 + 5.0j]) return x ``` Before this work, Dynamo would not stop at the `yield`, and the graph produced would contain both calls to `set_default_dtype` executed one after the other. This is incorrect because the context manager should execute code before and after the `yield`. * List of changes `YIELD_VALUE` now raises an exception (`YieldValueOp`) to signal that control flow must be suspended and returned to the caller. Additionally, `RETURN_VALUE` behaves differently in a generator function. Unlike regular functions, where `RETURN_VALUE` indicates the final result, in generators it signifies that the generator is exhausted and implicitly raises `StopIteration`. A new `VariableTracker` named `FunctionDecoratedByContextlibContextManagerVariable` was introduced to handle `@contextmanager`. This variable tracker acts not just as a wrapper for the original function but also maintains an internal `tx` (InstructionTranslator) object to suspend and return control flow to the parent tracer when a `yield` is encountered. * Corner cases Returning a context manager from a compiled function is not supported. This would require PyTorch to synchronize the generator state between Dynamo and the interpreter. Any attempt to return it will result in an `IncorrectUsage` exception. Graph breaks require special handling as well. In the event of a graph break, the frame associated with the context manager is skipped, and the context manager runs in eager mode. * This PR is breaking my code There is a configuration flag (`enable_trace_contextlib`) that can be set to `False` to disable tracing context managers. If this still causes crashes, please revert this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136033 Approved by: https://github.com/zou3519	2024-12-20 12:02:20 +00:00
Jason Ansel	04b26ee1e8	Fix false positive from f-strings in set_linter (#143628 ) This linter was going crazy in python 3.12, example: ```py $ python3 tools/linter/adapters/set_linter.py torch/_inductor/runtime/triton_heuristics.py torch/_inductor/runtime/triton_heuristics.py:192:25: Builtin `set` is deprecated 190 \| args_str += ", ".join(call_args) 191 \| for k, v in call_kwargs.items(): 192 \| args_str += f", {k}={v}" ^ 193 \| 194 \| abs_path = os.path.abspath(sys.argv[0]) torch/_inductor/runtime/triton_heuristics.py:192:27: Builtin `set` is deprecated 190 \| args_str += ", ".join(call_args) 191 \| for k, v in call_kwargs.items(): 192 \| args_str += f", {k}={v}" ^ 193 \| 194 \| abs_path = os.path.abspath(sys.argv[0]) torch/_inductor/runtime/triton_heuristics.py:192:29: Builtin `set` is deprecated 190 \| args_str += ", ".join(call_args) 191 \| for k, v in call_kwargs.items(): 192 \| args_str += f", {k}={v}" ^ 193 \| 194 \| abs_path = os.path.abspath(sys.argv[0]) torch/_inductor/runtime/triton_heuristics.py:192:31: Builtin `set` is deprecated 190 \| args_str += ", ".join(call_args) 191 \| for k, v in call_kwargs.items(): 192 \| args_str += f", {k}={v}" ^ 193 \| 194 \| abs_path = os.path.abspath(sys.argv[0]) torch/_inductor/runtime/triton_heuristics.py:195:17: Builtin `set` is deprecated 193 \| 194 \| abs_path = os.path.abspath(sys.argv[0]) 195 \| with open(f"{abs_path}.launch_params", "a") as f: ^ 196 \| f.write(f"{kernel_name} \| {args_str}\n") 197 \| torch/_inductor/runtime/triton_heuristics.py:195:26: Builtin `set` is deprecated 193 \| 194 \| abs_path = os.path.abspath(sys.argv[0]) 195 \| with open(f"{abs_path}.launch_params", "a") as f: ^ 196 \| f.write(f"{kernel_name} \| {args_str}\n") 197 \| torch/_inductor/runtime/triton_heuristics.py:196:19: Builtin `set` is deprecated 194 \| abs_path = os.path.abspath(sys.argv[0]) 195 \| with open(f"{abs_path}.launch_params", "a") as f: 196 \| f.write(f"{kernel_name} \| {args_str}\n") ^ 197 \| 198 \| torch/_inductor/runtime/triton_heuristics.py:196:31: Builtin `set` is deprecated 194 \| abs_path = os.path.abspath(sys.argv[0]) 195 \| with open(f"{abs_path}.launch_params", "a") as f: 196 \| f.write(f"{kernel_name} \| {args_str}\n") ^ 197 \| 198 \| torch/_inductor/runtime/triton_heuristics.py:196:35: Builtin `set` is deprecated 194 \| abs_path = os.path.abspath(sys.argv[0]) 195 \| with open(f"{abs_path}.launch_params", "a") as f: 196 \| f.write(f"{kernel_name} \| {args_str}\n") ^ 197 \| 198 \| torch/_inductor/runtime/triton_heuristics.py:196:44: Builtin `set` is deprecated 194 \| abs_path = os.path.abspath(sys.argv[0]) 195 \| with open(f"{abs_path}.launch_params", "a") as f: 196 \| f.write(f"{kernel_name} \| {args_str}\n") ^ 197 \| 198 \| torch/_inductor/runtime/triton_heuristics.py:729:26: Builtin `set` is deprecated 727 \| exec( 728 \| f""" 729 \| def launcher({', '.join(def_args)}, grid, stream): ^ 730 \| if callable(grid): 731 \| grid_0, grid_1, grid_2 = grid(grid_meta) torch/_inductor/runtime/triton_heuristics.py:729:46: Builtin `set` is deprecated 727 \| exec( 728 \| f""" 729 \| def launcher({', '.join(def_args)}, grid, stream): ^ 730 \| if callable(grid): 731 \| grid_0, grid_1, grid_2 = grid(grid_meta) torch/_inductor/runtime/triton_heuristics.py:735:24: Builtin `set` is deprecated 733 \| grid_0, grid_1, grid_2 = grid 734 \| 735 \| args = {', '.join(call_args)}, ^ 736 \| launch_args = get_launch_args( 737 \| grid, grid_0, grid_1, grid_2, stream, function, torch/_inductor/runtime/triton_heuristics.py:735:45: Builtin `set` is deprecated 733 \| grid_0, grid_1, grid_2 = grid 734 \| 735 \| args = {', '.join(call_args)}, ^ 736 \| launch_args = get_launch_args( 737 \| grid, grid_0, grid_1, grid_2, stream, function, torch/_inductor/runtime/triton_heuristics.py:1144:20: Builtin `set` is deprecated 1142 \| cur_file = inspect.stack()[1].filename 1143 \| summary_str = ( 1144 \| f"SUMMARY ({cur_file})\n" ^ 1145 \| f"{overall_time:.2f}ms \t {overall_gb:.2f} GB\t {overall_gb / (overall_time / 1e3):.2f}GB/s" 1146 \| ) torch/_inductor/runtime/triton_heuristics.py:1144:29: Builtin `set` is deprecated 1142 \| cur_file = inspect.stack()[1].filename 1143 \| summary_str = ( 1144 \| f"SUMMARY ({cur_file})\n" ^ 1145 \| f"{overall_time:.2f}ms \t {overall_gb:.2f} GB\t {overall_gb / (overall_time / 1e3):.2f}GB/s" 1146 \| ) torch/_inductor/runtime/triton_heuristics.py:1162:61: Builtin `set` is deprecated 1160 \| ) 1161 \| file.write("====================\n") 1162 \| file.write(f"TRITON KERNELS BANDWIDTH INFO ({cur_file})\n") ^ 1163 \| for ms, num_gb, gb_per_s, kernel_name in sorted_calls: 1164 \| # also display the runtime percentage for each kernel torch/_inductor/runtime/triton_heuristics.py:1162:70: Builtin `set` is deprecated 1160 \| ) 1161 \| file.write("====================\n") 1162 \| file.write(f"TRITON KERNELS BANDWIDTH INFO ({cur_file})\n") ^ 1163 \| for ms, num_gb, gb_per_s, kernel_name in sorted_calls: 1164 \| # also display the runtime percentage for each kernel torch/_inductor/runtime/triton_heuristics.py:1166:36: Builtin `set` is deprecated 1164 \| # also display the runtime percentage for each kernel 1165 \| percentage = f"{ms / overall_time * 100:.2f}%" 1166 \| suffix = f" \t {percentage} \t {kernel_name}" ^ 1167 \| bw_info_str = create_bandwidth_info_str( 1168 \| ms, torch/_inductor/runtime/triton_heuristics.py:1166:47: Builtin `set` is deprecated 1164 \| # also display the runtime percentage for each kernel 1165 \| percentage = f"{ms / overall_time * 100:.2f}%" 1166 \| suffix = f" \t {percentage} \t {kernel_name}" ^ 1167 \| bw_info_str = create_bandwidth_info_str( 1168 \| ms, torch/_inductor/runtime/triton_heuristics.py:1166:52: Builtin `set` is deprecated 1164 \| # also display the runtime percentage for each kernel 1165 \| percentage = f"{ms / overall_time * 100:.2f}%" 1166 \| suffix = f" \t {percentage} \t {kernel_name}" ^ 1167 \| bw_info_str = create_bandwidth_info_str( 1168 \| ms, torch/_inductor/runtime/triton_heuristics.py:1166:64: Builtin `set` is deprecated 1164 \| # also display the runtime percentage for each kernel 1165 \| percentage = f"{ms / overall_time * 100:.2f}%" 1166 \| suffix = f" \t {percentage} \t {kernel_name}" ^ 1167 \| bw_info_str = create_bandwidth_info_str( 1168 \| ms, torch/_inductor/runtime/triton_heuristics.py:1175:30: Builtin `set` is deprecated 1173 \| ) 1174 \| file.write(bw_info_str + "\n") 1175 \| file.write(f"{summary_str}\n\n") ^ 1176 \| except Exception as e: 1177 \| log.warning( torch/_inductor/runtime/triton_heuristics.py:1175:42: Builtin `set` is deprecated 1173 \| ) 1174 \| file.write(bw_info_str + "\n") 1175 \| file.write(f"{summary_str}\n\n") ^ 1176 \| except Exception as e: 1177 \| log.warning( torch/_inductor/runtime/triton_heuristics.py:1205:29: Builtin `set` is deprecated 1203 \| else: 1204 \| possible_names = _find_names(self) 1205 \| kernel_name = f"{max(possible_names, key=len)}" ^ 1206 \| if not re.match(self.regex_filter, kernel_name): 1207 \| return torch/_inductor/runtime/triton_heuristics.py:1205:58: Builtin `set` is deprecated 1203 \| else: 1204 \| possible_names = _find_names(self) 1205 \| kernel_name = f"{max(possible_names, key=len)}" ^ 1206 \| if not re.match(self.regex_filter, kernel_name): 1207 \| return torch/_inductor/runtime/triton_heuristics.py:1241:60: Builtin `set` is deprecated 1239 \| "%s", 1240 \| create_bandwidth_info_str( 1241 \| ms, num_gb, gb_per_s, suffix=f" \t {kernel_name}" ^ 1242 \| ), 1243 \| ) torch/_inductor/runtime/triton_heuristics.py:1241:72: Builtin `set` is deprecated 1239 \| "%s", 1240 \| create_bandwidth_info_str( 1241 \| ms, num_gb, gb_per_s, suffix=f" \t {kernel_name}" ^ 1242 \| ), 1243 \| ) torch/_inductor/runtime/triton_heuristics.py:1256:15: Builtin `set` is deprecated 1254 \| for cfg in configs: 1255 \| hasher.update( 1256 \| f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode() ^ 1257 \| ) 1258 \| return hasher.hexdigest() torch/_inductor/runtime/triton_heuristics.py:1256:42: Builtin `set` is deprecated 1254 \| for cfg in configs: 1255 \| hasher.update( 1256 \| f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode() ^ 1257 \| ) 1258 \| return hasher.hexdigest() torch/_inductor/runtime/triton_heuristics.py:1256:44: Builtin `set` is deprecated 1254 \| for cfg in configs: 1255 \| hasher.update( 1256 \| f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode() ^ 1257 \| ) 1258 \| return hasher.hexdigest() torch/_inductor/runtime/triton_heuristics.py:1256:58: Builtin `set` is deprecated 1254 \| for cfg in configs: 1255 \| hasher.update( 1256 \| f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode() ^ 1257 \| ) 1258 \| return hasher.hexdigest() torch/_inductor/runtime/triton_heuristics.py:1256:60: Builtin `set` is deprecated 1254 \| for cfg in configs: 1255 \| hasher.update( 1256 \| f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode() ^ 1257 \| ) 1258 \| return hasher.hexdigest() torch/_inductor/runtime/triton_heuristics.py:1256:75: Builtin `set` is deprecated 1254 \| for cfg in configs: 1255 \| hasher.update( 1256 \| f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode() ^ 1257 \| ) 1258 \| return hasher.hexdigest() torch/_inductor/runtime/triton_heuristics.py:1377:23: Builtin `set` is deprecated 1375 \| if numel is None: 1376 \| continue 1377 \| block = cfg[f"{label}BLOCK"] ^ 1378 \| if numel == 1: 1379 \| assert block == 1, ( torch/_inductor/runtime/triton_heuristics.py:1377:29: Builtin `set` is deprecated 1375 \| if numel is None: 1376 \| continue 1377 \| block = cfg[f"{label}BLOCK"] ^ 1378 \| if numel == 1: 1379 \| assert block == 1, ( torch/_inductor/runtime/triton_heuristics.py:1381:24: Builtin `set` is deprecated 1379 \| assert block == 1, ( 1380 \| f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1" 1381 \| f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})." ^ 1382 \| ) 1383 \| max_block = TRITON_MAX_BLOCK[label] torch/_inductor/runtime/triton_heuristics.py:1381:38: Builtin `set` is deprecated 1379 \| assert block == 1, ( 1380 \| f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1" 1381 \| f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})." ^ 1382 \| ) 1383 \| max_block = TRITON_MAX_BLOCK[label] torch/_inductor/runtime/triton_heuristics.py:1381:46: Builtin `set` is deprecated 1379 \| assert block == 1, ( 1380 \| f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1" 1381 \| f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})." ^ 1382 \| ) 1383 \| max_block = TRITON_MAX_BLOCK[label] torch/_inductor/runtime/triton_heuristics.py:1381:52: Builtin `set` is deprecated 1379 \| assert block == 1, ( 1380 \| f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1" 1381 \| f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})." ^ 1382 \| ) 1383 \| max_block = TRITON_MAX_BLOCK[label] torch/_inductor/runtime/triton_heuristics.py:1381:58: Builtin `set` is deprecated 1379 \| assert block == 1, ( 1380 \| f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1" 1381 \| f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})." ^ 1382 \| ) 1383 \| max_block = TRITON_MAX_BLOCK[label] torch/_inductor/runtime/triton_heuristics.py:1381:64: Builtin `set` is deprecated 1379 \| assert block == 1, ( 1380 \| f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1" 1381 \| f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})." ^ 1382 \| ) 1383 \| max_block = TRITON_MAX_BLOCK[label] torch/_inductor/runtime/triton_heuristics.py:1381:71: Builtin `set` is deprecated 1379 \| assert block == 1, ( 1380 \| f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1" 1381 \| f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})." ^ 1382 \| ) 1383 \| max_block = TRITON_MAX_BLOCK[label] torch/_inductor/runtime/triton_heuristics.py:1381:77: Builtin `set` is deprecated 1379 \| assert block == 1, ( 1380 \| f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1" 1381 \| f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})." ^ 1382 \| ) 1383 \| max_block = TRITON_MAX_BLOCK[label] torch/_inductor/runtime/triton_heuristics.py:1381:84: Builtin `set` is deprecated 1379 \| assert block == 1, ( 1380 \| f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1" 1381 \| f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})." ^ 1382 \| ) 1383 \| max_block = TRITON_MAX_BLOCK[label] torch/_inductor/runtime/triton_heuristics.py:1381:88: Builtin `set` is deprecated 1379 \| assert block == 1, ( 1380 \| f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1" 1381 \| f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})." ^ 1382 \| ) 1383 \| max_block = TRITON_MAX_BLOCK[label] torch/_inductor/runtime/triton_heuristics.py:1384:52: Builtin `set` is deprecated 1382 \| ) 1383 \| max_block = TRITON_MAX_BLOCK[label] 1384 \| max_block_str = f'config.triton.max_block["{label}"]' ^ 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" torch/_inductor/runtime/triton_heuristics.py:1384:58: Builtin `set` is deprecated 1382 \| ) 1383 \| max_block = TRITON_MAX_BLOCK[label] 1384 \| max_block_str = f'config.triton.max_block["{label}"]' ^ 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" torch/_inductor/runtime/triton_heuristics.py:1386:45: Builtin `set` is deprecated 1384 \| max_block_str = f'config.triton.max_block["{label}"]' 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" ^ 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." 1388 \| ) torch/_inductor/runtime/triton_heuristics.py:1386:51: Builtin `set` is deprecated 1384 \| max_block_str = f'config.triton.max_block["{label}"]' 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" ^ 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." 1388 \| ) torch/_inductor/runtime/triton_heuristics.py:1386:66: Builtin `set` is deprecated 1384 \| max_block_str = f'config.triton.max_block["{label}"]' 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" ^ 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." 1388 \| ) torch/_inductor/runtime/triton_heuristics.py:1386:80: Builtin `set` is deprecated 1384 \| max_block_str = f'config.triton.max_block["{label}"]' 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" ^ 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." 1388 \| ) torch/_inductor/runtime/triton_heuristics.py:1387:20: Builtin `set` is deprecated 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." ^ 1388 \| ) 1389 \| torch/_inductor/runtime/triton_heuristics.py:1387:26: Builtin `set` is deprecated 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." ^ 1388 \| ) 1389 \| torch/_inductor/runtime/triton_heuristics.py:1387:33: Builtin `set` is deprecated 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." ^ 1388 \| ) 1389 \| torch/_inductor/runtime/triton_heuristics.py:1387:39: Builtin `set` is deprecated 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." ^ 1388 \| ) 1389 \| torch/_inductor/runtime/triton_heuristics.py:1387:45: Builtin `set` is deprecated 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." ^ 1388 \| ) 1389 \| torch/_inductor/runtime/triton_heuristics.py:1387:59: Builtin `set` is deprecated 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." ^ 1388 \| ) 1389 \| torch/_inductor/runtime/triton_heuristics.py:1387:61: Builtin `set` is deprecated 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." ^ 1388 \| ) 1389 \| torch/_inductor/runtime/triton_heuristics.py:1387:71: Builtin `set` is deprecated 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." ^ 1388 \| ) 1389 \| torch/_inductor/runtime/triton_heuristics.py:1387:78: Builtin `set` is deprecated 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." ^ 1388 \| ) 1389 \| torch/_inductor/runtime/triton_heuristics.py:1387:82: Builtin `set` is deprecated 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." ^ 1388 \| ) 1389 \| torch/_inductor/runtime/triton_heuristics.py:1402:19: Builtin `set` is deprecated 1400 \| assert ( 1401 \| val <= max_block 1402 \| ), f"'{var}' too large. Maximum: {max_block}. Actual: {val}." ^ 1403 \| 1404 \| torch/_inductor/runtime/triton_heuristics.py:1402:23: Builtin `set` is deprecated 1400 \| assert ( 1401 \| val <= max_block 1402 \| ), f"'{var}' too large. Maximum: {max_block}. Actual: {val}." ^ 1403 \| 1404 \| torch/_inductor/runtime/triton_heuristics.py:1402:46: Builtin `set` is deprecated 1400 \| assert ( 1401 \| val <= max_block 1402 \| ), f"'{var}' too large. Maximum: {max_block}. Actual: {val}." ^ 1403 \| 1404 \| torch/_inductor/runtime/triton_heuristics.py:1402:56: Builtin `set` is deprecated 1400 \| assert ( 1401 \| val <= max_block 1402 \| ), f"'{var}' too large. Maximum: {max_block}. Actual: {val}." ^ 1403 \| 1404 \| torch/_inductor/runtime/triton_heuristics.py:1402:67: Builtin `set` is deprecated 1400 \| assert ( 1401 \| val <= max_block 1402 \| ), f"'{var}' too large. Maximum: {max_block}. Actual: {val}." ^ 1403 \| 1404 \| torch/_inductor/runtime/triton_heuristics.py:1402:71: Builtin `set` is deprecated 1400 \| assert ( 1401 \| val <= max_block 1402 \| ), f"'{var}' too large. Maximum: {max_block}. Actual: {val}." ^ 1403 \| 1404 \| torch/_inductor/runtime/triton_heuristics.py:1551:21: Builtin `set` is deprecated 1549 \| rnumels = {} 1550 \| for idx in range(num_reduction_dims - 1, -1, -1): 1551 \| prefix = f"r{idx}_" ^ 1552 \| max_size = min(size_hints[prefix], TRITON_MAX_BLOCK[prefix.upper()]) 1553 \| dim = min(max_size, remaining) torch/_inductor/runtime/triton_heuristics.py:1551:25: Builtin `set` is deprecated 1549 \| rnumels = {} 1550 \| for idx in range(num_reduction_dims - 1, -1, -1): 1551 \| prefix = f"r{idx}_" ^ 1552 \| max_size = min(size_hints[prefix], TRITON_MAX_BLOCK[prefix.upper()]) 1553 \| dim = min(max_size, remaining) torch/_inductor/runtime/triton_heuristics.py:1556:34: Builtin `set` is deprecated 1554 \| assert ( 1555 \| remaining % dim == 0 1556 \| ), f"Expected dimension '{dim}' to divide remaining size '{remaining}'" ^ 1557 \| rnumels[prefix] = dim 1558 \| remaining //= dim torch/_inductor/runtime/triton_heuristics.py:1556:38: Builtin `set` is deprecated 1554 \| assert ( 1555 \| remaining % dim == 0 1556 \| ), f"Expected dimension '{dim}' to divide remaining size '{remaining}'" ^ 1557 \| rnumels[prefix] = dim 1558 \| remaining //= dim torch/_inductor/runtime/triton_heuristics.py:1556:67: Builtin `set` is deprecated 1554 \| assert ( 1555 \| remaining % dim == 0 1556 \| ), f"Expected dimension '{dim}' to divide remaining size '{remaining}'" ^ 1557 \| rnumels[prefix] = dim 1558 \| remaining //= dim torch/_inductor/runtime/triton_heuristics.py:1556:77: Builtin `set` is deprecated 1554 \| assert ( 1555 \| remaining % dim == 0 1556 \| ), f"Expected dimension '{dim}' to divide remaining size '{remaining}'" ^ 1557 \| rnumels[prefix] = dim 1558 \| remaining //= dim torch/_inductor/runtime/triton_heuristics.py:1564:38: Builtin `set` is deprecated 1562 \| assert ( 1563 \| r == final_numel 1564 \| ), f"Expected ND reduction size ({rnumels}) to have {r} elements." ^ 1565 \| assert all( 1566 \| rnumels[prefix] <= size_hints[prefix] for prefix in rnumels torch/_inductor/runtime/triton_heuristics.py:1564:46: Builtin `set` is deprecated 1562 \| assert ( 1563 \| r == final_numel 1564 \| ), f"Expected ND reduction size ({rnumels}) to have {r} elements." ^ 1565 \| assert all( 1566 \| rnumels[prefix] <= size_hints[prefix] for prefix in rnumels torch/_inductor/runtime/triton_heuristics.py:1564:57: Builtin `set` is deprecated 1562 \| assert ( 1563 \| r == final_numel 1564 \| ), f"Expected ND reduction size ({rnumels}) to have {r} elements." ^ 1565 \| assert all( 1566 \| rnumels[prefix] <= size_hints[prefix] for prefix in rnumels torch/_inductor/runtime/triton_heuristics.py:1564:59: Builtin `set` is deprecated 1562 \| assert ( 1563 \| r == final_numel 1564 \| ), f"Expected ND reduction size ({rnumels}) to have {r} elements." ^ 1565 \| assert all( 1566 \| rnumels[prefix] <= size_hints[prefix] for prefix in rnumels torch/_inductor/runtime/triton_heuristics.py:1567:37: Builtin `set` is deprecated 1565 \| assert all( 1566 \| rnumels[prefix] <= size_hints[prefix] for prefix in rnumels 1567 \| ), f"rnumels exceed size_hints. {rnumels} > {size_hints}" ^ 1568 \| 1569 \| return rnumels torch/_inductor/runtime/triton_heuristics.py:1567:45: Builtin `set` is deprecated 1565 \| assert all( 1566 \| rnumels[prefix] <= size_hints[prefix] for prefix in rnumels 1567 \| ), f"rnumels exceed size_hints. {rnumels} > {size_hints}" ^ 1568 \| 1569 \| return rnumels torch/_inductor/runtime/triton_heuristics.py:1567:49: Builtin `set` is deprecated 1565 \| assert all( 1566 \| rnumels[prefix] <= size_hints[prefix] for prefix in rnumels 1567 \| ), f"rnumels exceed size_hints. {rnumels} > {size_hints}" ^ 1568 \| 1569 \| return rnumels torch/_inductor/runtime/triton_heuristics.py:1567:60: Builtin `set` is deprecated 1565 \| assert all( 1566 \| rnumels[prefix] <= size_hints[prefix] for prefix in rnumels 1567 \| ), f"rnumels exceed size_hints. {rnumels} > {size_hints}" ^ 1568 \| 1569 \| return rnumels torch/_inductor/runtime/triton_heuristics.py:1746:49: Builtin `set` is deprecated 1744 \| 1745 \| if not configs: 1746 \| raise NotImplementedError(f"size_hints: {size_hints}") ^ 1747 \| return cached_autotune( 1748 \| size_hints, torch/_inductor/runtime/triton_heuristics.py:1746:60: Builtin `set` is deprecated 1744 \| 1745 \| if not configs: 1746 \| raise NotImplementedError(f"size_hints: {size_hints}") ^ 1747 \| return cached_autotune( 1748 \| size_hints, torch/_inductor/runtime/triton_heuristics.py:1928:32: Builtin `set` is deprecated 1926 \| for prefix in size_hints: 1927 \| if prefix_is_reduction(prefix): 1928 \| c.kwargs.pop(f"{prefix.upper()}BLOCK") ^ 1929 \| 1930 \| if disable_pointwise_autotuning(inductor_meta): torch/_inductor/runtime/triton_heuristics.py:1928:47: Builtin `set` is deprecated 1926 \| for prefix in size_hints: 1927 \| if prefix_is_reduction(prefix): 1928 \| c.kwargs.pop(f"{prefix.upper()}BLOCK") ^ 1929 \| 1930 \| if disable_pointwise_autotuning(inductor_meta): torch/_inductor/runtime/triton_heuristics.py:1975:49: Builtin `set` is deprecated 1973 \| assert triton_meta is not None 1974 \| if len(size_hints) != 2: 1975 \| raise NotImplementedError(f"size_hints: {size_hints}") ^ 1976 \| 1977 \| configs = _reduction_configs(size_hints=size_hints, inductor_meta=inductor_meta) torch/_inductor/runtime/triton_heuristics.py:1975:60: Builtin `set` is deprecated 1973 \| assert triton_meta is not None 1974 \| if len(size_hints) != 2: 1975 \| raise NotImplementedError(f"size_hints: {size_hints}") ^ 1976 \| 1977 \| configs = _reduction_configs(size_hints=size_hints, inductor_meta=inductor_meta) torch/_inductor/runtime/triton_heuristics.py:2082:56: Builtin `set` is deprecated 2080 \| xnumel, ynumel, znumel = numels[2], numels[1], numels[0] 2081 \| else: 2082 \| raise AssertionError(f"invalid size for numels {len(numels)}") ^ 2083 \| 2084 \| def get_grid_dim(numel, block): torch/_inductor/runtime/triton_heuristics.py:2082:68: Builtin `set` is deprecated 2080 \| xnumel, ynumel, znumel = numels[2], numels[1], numels[0] 2081 \| else: 2082 \| raise AssertionError(f"invalid size for numels {len(numels)}") ^ 2083 \| 2084 \| def get_grid_dim(numel, block): torch/_inductor/runtime/triton_heuristics.py:2104:57: Builtin `set` is deprecated 2102 \| torch._check( 2103 \| y_grid <= max_y_grid, 2104 \| lambda: f"Generated y grid beyond 2^16 ({y_grid}) not supported with z dimension present. File issue", ^ 2105 \| ) 2106 \| torch/_inductor/runtime/triton_heuristics.py:2104:64: Builtin `set` is deprecated 2102 \| torch._check( 2103 \| y_grid <= max_y_grid, 2104 \| lambda: f"Generated y grid beyond 2^16 ({y_grid}) not supported with z dimension present. File issue", ^ 2105 \| ) 2106 \| torch/_inductor/runtime/triton_heuristics.py:2113:43: Builtin `set` is deprecated 2111 \| ) 2112 \| 2113 \| setattr(grid_fn, "grid_fn_str", f"grid{numels}") # noqa: B010 ^ 2114 \| 2115 \| return grid_fn torch/_inductor/runtime/triton_heuristics.py:2113:50: Builtin `set` is deprecated 2111 \| ) 2112 \| 2113 \| setattr(grid_fn, "grid_fn_str", f"grid{numels}") # noqa: B010 ^ 2114 \| 2115 \| return grid_fn torch/_inductor/runtime/triton_heuristics.py:2122:48: Builtin `set` is deprecated 2120 \| return (meta["RSPLIT"], ceildiv(xnumel, meta.get("XBLOCK", 1)), 1) 2121 \| 2122 \| grid_fn_str = f"cooperative_reduction_grid({xnumel})" ^ 2123 \| setattr(grid_fn, "grid_fn_str", grid_fn_str) # noqa: B010 2124 \| return grid_fn torch/_inductor/runtime/triton_heuristics.py:2122:55: Builtin `set` is deprecated 2120 \| return (meta["RSPLIT"], ceildiv(xnumel, meta.get("XBLOCK", 1)), 1) 2121 \| 2122 \| grid_fn_str = f"cooperative_reduction_grid({xnumel})" ^ 2123 \| setattr(grid_fn, "grid_fn_str", grid_fn_str) # noqa: B010 2124 \| return grid_fn torch/_inductor/runtime/triton_heuristics.py:2135:54: Builtin `set` is deprecated 2133 \| coop_grid = cooperative_reduction_grid(xnumel) 2134 \| normal_grid = grid(xnumel) 2135 \| grid_fn_str = f"maybe_cooperative_reduction_grid({xnumel})" ^ 2136 \| setattr(grid_fn, "grid_fn_str", grid_fn_str) # noqa: B010 2137 \| return grid_fn torch/_inductor/runtime/triton_heuristics.py:2135:61: Builtin `set` is deprecated 2133 \| coop_grid = cooperative_reduction_grid(xnumel) 2134 \| normal_grid = grid(xnumel) 2135 \| grid_fn_str = f"maybe_cooperative_reduction_grid({xnumel})" ^ 2136 \| setattr(grid_fn, "grid_fn_str", grid_fn_str) # noqa: B010 2137 \| return grid_fn torch/_inductor/runtime/triton_heuristics.py:2145:37: Builtin `set` is deprecated 2143 \| return (ceildiv(rnumel, meta.get("R0_BLOCK", 1)), xnumel, 1) 2144 \| 2145 \| grid_fn_str = f"split_scan_grid({xnumel}, {rnumel})" ^ 2146 \| setattr(grid_fn, "grid_fn_str", grid_fn_str) # noqa: B010 2147 \| torch/_inductor/runtime/triton_heuristics.py:2145:44: Builtin `set` is deprecated 2143 \| return (ceildiv(rnumel, meta.get("R0_BLOCK", 1)), xnumel, 1) 2144 \| 2145 \| grid_fn_str = f"split_scan_grid({xnumel}, {rnumel})" ^ 2146 \| setattr(grid_fn, "grid_fn_str", grid_fn_str) # noqa: B010 2147 \| torch/_inductor/runtime/triton_heuristics.py:2145:47: Builtin `set` is deprecated 2143 \| return (ceildiv(rnumel, meta.get("R0_BLOCK", 1)), xnumel, 1) 2144 \| 2145 \| grid_fn_str = f"split_scan_grid({xnumel}, {rnumel})" ^ 2146 \| setattr(grid_fn, "grid_fn_str", grid_fn_str) # noqa: B010 2147 \| torch/_inductor/runtime/triton_heuristics.py:2145:54: Builtin `set` is deprecated 2143 \| return (ceildiv(rnumel, meta.get("R0_BLOCK", 1)), xnumel, 1) 2144 \| 2145 \| grid_fn_str = f"split_scan_grid({xnumel}, {rnumel})" ^ 2146 \| setattr(grid_fn, "grid_fn_str", grid_fn_str) # noqa: B010 2147 \| torch/_inductor/runtime/triton_heuristics.py:2173:42: Builtin `set` is deprecated 2171 \| assert ( 2172 \| min_blocks_d is None or min_blocks == min_blocks_d 2173 \| ), f"inconsistent min_blocks {min_blocks} vs x grid {numels[-1]}" ^ 2174 \| else: 2175 \| # sequential dispatch torch/_inductor/runtime/triton_heuristics.py:2173:53: Builtin `set` is deprecated 2171 \| assert ( 2172 \| min_blocks_d is None or min_blocks == min_blocks_d 2173 \| ), f"inconsistent min_blocks {min_blocks} vs x grid {numels[-1]}" ^ 2174 \| else: 2175 \| # sequential dispatch torch/_inductor/runtime/triton_heuristics.py:2173:66: Builtin `set` is deprecated 2171 \| assert ( 2172 \| min_blocks_d is None or min_blocks == min_blocks_d 2173 \| ), f"inconsistent min_blocks {min_blocks} vs x grid {numels[-1]}" ^ 2174 \| else: 2175 \| # sequential dispatch torch/_inductor/runtime/triton_heuristics.py:2173:77: Builtin `set` is deprecated 2171 \| assert ( 2172 \| min_blocks_d is None or min_blocks == min_blocks_d 2173 \| ), f"inconsistent min_blocks {min_blocks} vs x grid {numels[-1]}" ^ 2174 \| else: 2175 \| # sequential dispatch ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143628 Approved by: https://github.com/yanboliang, https://github.com/rec	2024-12-20 11:45:26 +00:00
Xu Han	6733045a4a	export AOTI_TORCH_EXPORT on Windows. (#140030 ) Fixes #139954 reproduce UT: ```cmd pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu ``` Issue: <img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe"> After fixing: ![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a) Reland: 1. Declare export on Windows explicitly. 2. Support cpu, cuda and xpu devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-12-20 11:42:09 +00:00
Michael Lazos	b539c61631	[Hierarchical Compile] Update NoneAsConstantBuffer to support graph d… (#143531 ) Fixes issues I hit while running graph deduplication with torch tune. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143531 Approved by: https://github.com/eellison	2024-12-20 09:23:12 +00:00
Pian Pawakapan	f9f82ca48f	[ts converter] use Dim.AUTO for ts -> export converter (#138273 ) Switches TS converter to use `Dim.AUTO` by default, exporting models with max dynamism. Adds runtime input tests to `test_converter.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138273 Approved by: https://github.com/avikchaudhuri	2024-12-20 07:48:24 +00:00
Michael Lazos	270ad513c8	[Dynamo] only import einops if version is lower than 0.7.0 (#142847 ) Fixes internal xref (https://fb.workplace.com/groups/257735836456307/posts/804793021750583/?comment_id=805229281706957&reply_comment_id=805232695039949) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142847 Approved by: https://github.com/zou3519	2024-12-20 07:46:49 +00:00
Avik Chaudhuri	29b586bbad	fix formatting in programming model doc (#143587 ) Test Plan: Some of the formatting in https://docs-preview.pytorch.org/pytorch/pytorch/143546/export.programming_model.html is broken. Differential Revision: D67458972 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143587 Approved by: https://github.com/yushangdi	2024-12-20 07:09:19 +00:00
Huy Do	fe0f20615c	[DynamoBench] Handle accuracy results in benchmark records (#143611 ) I discovered this issue when trying to search for the accuracy results on the database and couldn't find any. It turns out that the results is there on the JSON file, for example `"metric": {"name": "accuracy", "benchmark_values": ["pass_due_to_skip"]}`, but inserting them into the database fails because benchmark values is a list of strings here while the expectation is that it's a list of numbers. ClickHouse doesn't support mix types atm. It has a Variant type https://clickhouse.com/docs/en/sql-reference/data-types/variant, but this isn't recommended by CH team themselves. So, the remaining option is to store this in the `extra_info` field. This field is a dictionary, so it can goes there. ### Testing https://github.com/pytorch/pytorch/actions/runs/12421747715 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143611 Approved by: https://github.com/kit1980	2024-12-20 06:43:38 +00:00
Sam Ginzburg	132fcf4e0d	[user triton] Raise an exception when encountering nested @triton.autotune decorators or @triton.heuristics (#143519 ) We support running a single Autotuner for each Triton kernel. Currently, if there are multiple autotuning decorators, the subsequent ones will be silently ignored. Instead, we should raise an error here to avoid silent incorrectness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143519 Approved by: https://github.com/aakhundov	2024-12-20 06:38:45 +00:00
PyTorch MergeBot	71479a9b9c	Revert "[AOTI] Emit a CMakeLists.txt when package_cpp_only (#143352 )" This reverts commit 429f4cd1408b11a7b0dd10634b46b3265dc31af1. Reverted https://github.com/pytorch/pytorch/pull/143352 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the new test is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/143352#issuecomment-2556365140))	2024-12-20 06:21:31 +00:00
Jane Xu	4e29e4aa63	[BE] Add a test to ensure grads are never inplaced into accidentally (#143612 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143612 Approved by: https://github.com/soulitzer	2024-12-20 06:15:08 +00:00
Xu Han	2daa666591	update kineto to XPU Windows fixed PR. [submodule kineto] (#143445 ) Include XPU Windows Fixed PR: https://github.com/pytorch/kineto/pull/1012 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143445 Approved by: https://github.com/sraikund16	2024-12-20 05:57:30 +00:00
zeshengzong	217a4ddb04	Add range check embedding_bag on input index >= 0 of cuda device (#140791 ) Fixes #89362 Test Result Before ``` >>> import torch >>> input = torch.randint(-5, 1, [1, 2], dtype=torch.int64).cuda() >>> weight = torch.rand([2, 3], dtype=torch.float32).cuda() >>> print(torch.nn.functional.embedding_bag(input, weight)) tensor([[0., 0., 0.]], device='cuda:0') ``` After ```python >>> import torch >>> input = torch.randint(-5, 1, [1, 2], dtype=torch.int64).cuda() >>> weight = torch.rand([2, 3], dtype=torch.float32).cuda() >>> print(torch.nn.functional.embedding_bag(input, weight)) /home/zong/code/pytorch/aten/src/ATen/native/cuda/EmbeddingBag.cu:141: EmbeddingBag_updateOutputKernel_sum_mean: block: [0,0,0], thread: [0,0,0] Assertion `0 <= input_idx && input_idx < numRows` failed. /home/zong/code/pytorch/aten/src/ATen/native/cuda/EmbeddingBag.cu:141: EmbeddingBag_updateOutputKernel_sum_mean: block: [0,0,0], thread: [1,0,0] Assertion `0 <= input_idx && input_idx < numRows` failed. /home/zong/code/pytorch/aten/src/ATen/native/cuda/EmbeddingBag.cu:141: EmbeddingBag_updateOutputKernel_sum_mean: block: [0,0,0], thread: [2,0,0] Assertion `0 <= input_idx && input_idx < numRows` failed. Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/zong/code/pytorch/torch/_tensor.py", line 568, in __repr__ return torch._tensor_str._str(self, tensor_contents=tensor_contents) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/_tensor_str.py", line 708, in _str return _str_intern(self, tensor_contents=tensor_contents) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/_tensor_str.py", line 625, in _str_intern tensor_str = _tensor_str(self, indent) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/_tensor_str.py", line 357, in _tensor_str formatter = _Formatter(get_summarized_data(self) if summarize else self) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/_tensor_str.py", line 146, in __init__ tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. ``` ```bash $ pytest test/nn/test_embedding.py ``` ![image](https://github.com/user-attachments/assets/6a5ec759-a3dc-4d51-9e5e-ec79c0aac526) ```bash $ lintrunner ``` ![image](https://github.com/user-attachments/assets/2ce4ac24-74fb-4181-9510-18b96a2c2acb) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140791 Approved by: https://github.com/eqy	2024-12-20 05:47:26 +00:00
bobrenjc93	9713a6eeca	remove allow-untyped-defs from torch/fx/experimental/refinement_types.py (#143602 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143602 Approved by: https://github.com/aorenste	2024-12-20 05:40:52 +00:00
bobrenjc93	78d294379a	remove allow-untyped-defs from torch/_lazy/config.py (#143603 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143603 Approved by: https://github.com/aorenste	2024-12-20 05:34:19 +00:00
bobrenjc93	cb4e9888df	remove allow-untyped-defs from torch/ao/quantization/experimental/APoT_tensor.py (#143601 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143601 Approved by: https://github.com/aorenste	2024-12-20 05:26:09 +00:00
bobrenjc93	dd346dbeab	remove allow-untyped-defs from torch/distributed/elastic/multiprocessing/errors/handlers.py (#143605 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143605 Approved by: https://github.com/aorenste	2024-12-20 05:25:01 +00:00
Michael Lazos	fd23cf5848	[Dynamo] check node class first for graph dedup (#143609 ) as title Pull Request resolved: https://github.com/pytorch/pytorch/pull/143609 Approved by: https://github.com/williamwen42	2024-12-20 04:09:46 +00:00
William Wen	1c2593f035	[dynamo] guard global autocast state (#143592 ) Fixes https://github.com/pytorch/pytorch/issues/112260. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143592 Approved by: https://github.com/jansel	2024-12-20 03:30:54 +00:00
drisspg	d339f1506b	Add cutlass version guard in prep for upgrade (#143551 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143551 Approved by: https://github.com/eqy	2024-12-20 02:40:02 +00:00
Mayank Mishra	75661f2036	try root fix for FP8 tensor (#143248 ) Fixes #143194 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143248 Approved by: https://github.com/fegin	2024-12-20 01:57:17 +00:00
PyTorch MergeBot	4462cc6375	Revert "[Inductor] inplace padding (#140249 )" This reverts commit 297ce776363cc4802fa74d210fced2b4128960d5. Reverted https://github.com/pytorch/pytorch/pull/140249 on behalf of https://github.com/huydhn due to This break an internal test https://fburl.com/test/ppl2we5l ([comment](https://github.com/pytorch/pytorch/pull/140249#issuecomment-2556079406))	2024-12-20 01:30:27 +00:00
bobrenjc93	e1b4635504	remove allow-untyped-defs from torch/distributed/pipelining/_debug.py (#143606 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143606 Approved by: https://github.com/aorenste	2024-12-20 01:26:51 +00:00
Jane Xu	a0cff096bc	Improve cond error messaging (#143595 ) Discovered by @drisspg and I trying out a simple toy example and being way too confused :') Pull Request resolved: https://github.com/pytorch/pytorch/pull/143595 Approved by: https://github.com/zou3519, https://github.com/ydwu4	2024-12-20 01:19:20 +00:00
Yanan Cao (PyTorch)	d547fae5b0	[Codemod][AddExplicitStrictExportArg] caffe2/torch/onnx/_internal/exporter (#143542 ) Reviewed By: avikchaudhuri Differential Revision: D67381244 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143542 Approved by: https://github.com/ydwu4, https://github.com/titaiwangms	2024-12-20 00:54:52 +00:00
Sun, Jiayi	544de4008e	[Inductor] Constrain the shape of other tensor for Conv/Linear + broadcast add fusion. (#141759 ) Fix https://github.com/pytorch/pytorch/issues/141671. Summary: The performance regression of these two timm_models is caused by Conv/Linear + broadcast add fusion run into oneDNN ref path. This PR constrains the shape of other tensor for Conv/Linear + broadcast add fusion to fix this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141759 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-12-20 00:35:58 +00:00
PyTorch MergeBot	8136daff5a	Revert "[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 )" This reverts commit 4b82251011f85f9d1395b451d61e976af844d9b1. Reverted https://github.com/pytorch/pytorch/pull/134124 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it breaks lots of internal build ([comment](https://github.com/pytorch/pytorch/pull/134124#issuecomment-2555953189))	2024-12-19 23:33:17 +00:00
PyTorch MergeBot	145fd5bad0	Revert "[Dynamo] only import einops if version is lower than 0.7.0 (#142847 )" This reverts commit a96387a481633389a6b5a5ac7b8406e9216f320e. Reverted https://github.com/pytorch/pytorch/pull/142847 on behalf of https://github.com/huydhn due to This has been reverted internally D67436053 ([comment](https://github.com/pytorch/pytorch/pull/142847#issuecomment-2555942351))	2024-12-19 23:22:44 +00:00
Sun, Jiayi	d2b83aa122	add grad_output shape check for fractional_max_pool2d_backward (#141666 ) Fix https://github.com/pytorch/pytorch/issues/141102. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141666 Approved by: https://github.com/mingfeima, https://github.com/malfet	2024-12-19 22:47:02 +00:00
Evgeny Fiksman	2def1f6f74	[caffe2] Move vectorized templates into a separate file for box_cox operator (#143556 ) Summary: No functional changes in this diff, the code is moved into a separate file to be reused by avx512 version in the follow up diff. Test Plan: buck build //caffe2/caffe2/perfkernels:perfkernels Differential Revision: D67433115 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143556 Approved by: https://github.com/hl475	2024-12-19 22:02:23 +00:00
Bin Bao	429f4cd140	[AOTI] Emit a CMakeLists.txt when package_cpp_only (#143352 ) Summary: Emit a CMakeLists.txt with compile and link options when package_cpp_only is specified. After unzipping AOTI generated .pt2 package file, user can manually build the generated model code in their local environment. Differential Revision: [D67458526](https://our.internmc.facebook.com/intern/diff/D67458526) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143352 Approved by: https://github.com/malfet	2024-12-19 22:01:05 +00:00
PyTorch MergeBot	e9bd74d763	Revert "[export] don't decompose custom triton op when exporting (#142426 )" This reverts commit 10b9c5944e8d6ff0685e1ef25277a1d3c4c9c5aa. Reverted https://github.com/pytorch/pytorch/pull/142426 on behalf of https://github.com/huydhn due to This fails one internal MTIA test, checking with the author that we need to revert and reland this ([comment](https://github.com/pytorch/pytorch/pull/142426#issuecomment-2555793496))	2024-12-19 21:21:38 +00:00
Joel Schlosser	fc03c62c56	Unbacked SymInt fixes for subclasses + data-dependent slice() bounds (#142062 ) Related: #125914 (specifically see [comment](https://github.com/pytorch/pytorch/issues/125914#issuecomment-2513044125)) This PR addresses two broken things involving the usage of unbacked SymInts for calls to `slice()` with data-dependent bounds. These issues are encountered in practice for `narrow()` operating on the batch dim with an NJT input, but apply to other subclasses as well. The test in this PR uses a purpose-built subclass. There are two different issues here, depending on whether `torch.compile()` is called with `dynamic=True`. In practice, these only occur when the unbacked SymInts are created within the torch_dispatch implementation of a subclass, because the unbacked symbols are considered "freshly created" when the output subclass instance is handled in Dynamo. Error 1 (dynamic=False): ``` LoweringException: GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(-Min(22, Max(0, u0)) + Min(22, Max(u0 + u1, Max(0, u0))), 0) (unhinted: Eq(-Min(s0, Max(0, u0)) + Min(s0, Max(u0 + u1, Max(0, u0))), 0)). (Size-like symbols: u1, u0) ``` The expression comes from the use of `clamp()` logic for `SliceView` in Inductor: `41e59754b4/torch/_inductor/ir.py (L3014)` If the (start, end) bounds for the `slice()` are statically known to be in range for the given dim (e.g. provided via `torch._check()` calls), we can avoid this `clamp()` logic and the error. This PR implements this fix. Error 2 (dynamic=True): ``` torch._dynamo.exc.InternalTorchDynamoError: PendingUnbackedSymbolNotFound: Pending unbacked symbols {u0} not in returned outputs NestedTensor(size=(2, s16, s1), offsets=FakeTensor(..., device='cuda:0', size=(3,), dtype=torch.int64), grad_fn=<NarrowBackwardAutogradNestedTensor0 object at 0x7f1f8603cfd0>, contiguous=True) ((s1s16, s1, 1), s1u0) ``` The storage offset of the values component of the returned NJT is `s1u0` where `s1` is known to be an integer. This PR expands the special logic handling the `constant u0` case to handle SymInts as well: `314e08eb52/torch/fx/experimental/symbolic_shapes.py (L1013-L1031)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142062 Approved by: https://github.com/ezyang ghstack dependencies: #143526	2024-12-19 21:08:04 +00:00
emmettbicker	0b2c47962c	Add support for differentiable LR in SGD + test v2.0 (#143510 ) Second PR in a larger project to broader support for differentiable optimizers with @janeyx99 ! The first one had an issue near the end so this is the second PR on that subject. See #143122 for the development up until this point. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143510 Approved by: https://github.com/janeyx99	2024-12-19 21:04:44 +00:00
Ryan Guo	629de4da60	[dynamo] Add a lint rule to restrict what 3P library one can import (#143312 ) As title, this patch prevents developers from importing third party libraries to patch things in Dynamo, unless there's no other easy workaround (in which case one would add the library to the allowlist in `import_linter.py`, as instructed by the lint error). For instance, if we remove `einops` from the allowlist, we'd get this ```verbatim >>> Lint for torch/_dynamo/decorators.py: Error (IMPORT) Disallowed import importing from einops is not allowed, if you believe there's a valid reason, please add it to import_linter.py 608 \|# Note: this carefully avoids eagerly import einops. 609 \|# TODO: we should delete this whole _allow_in_graph_einops logic by approximately 2024 Q2 610 \|def _allow_in_graph_einops(): >>> 611 \| import einops 612 \| 613 \| try: 614 \| # requires einops > 0.6.1, torch >= 2.0 Error (IMPORT) Disallowed import importing from einops is not allowed, if you believe there's a valid reason, please add it to import_linter.py 612 \| 613 \| try: 614 \| # requires einops > 0.6.1, torch >= 2.0 >>> 615 \| from einops._torch_specific import ( # type: ignore[attr-defined] # noqa: F401 616 \| _ops_were_registered_in_torchdynamo, 617 \| ) 618 \| ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143312 Approved by: https://github.com/zou3519	2024-12-19 20:59:16 +00:00
bobrenjc93	8e78345d69	remove allow-untyped-defs from distributed/tensor/experimental/__init__.py (#143583 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143583 Approved by: https://github.com/awgu	2024-12-19 20:25:28 +00:00
Thomas Bohnstingl	0a7dba4978	[cond] Change Autograd for cond (#142518 ) Instead of returning None for unused variables, a tensor with all-zeros is returned. Fixes [141301](https://github.com/pytorch/pytorch/issues/141301) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142518 Approved by: https://github.com/ydwu4	2024-12-19 20:09:42 +00:00
bobrenjc93	8850a7b62c	add some logging for tensorify (#143391 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143391 Approved by: https://github.com/jamesjwu	2024-12-19 20:06:26 +00:00
bobrenjc93	25172dc075	remove allow-untyped-defs from torch/ao/quantization/experimental/fake_quantize_function.py (#143582 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143582 Approved by: https://github.com/XuehaiPan, https://github.com/laithsakka	2024-12-19 20:06:22 +00:00
Nichols A. Romero	2d150ad29f	[ROCm] Fix unit test: matmul_offline_mgpu_tunableop (#143507 ) Fixes #141652 This PR contains: - Fix for `matmul_offline_mgpu_tunableop` - Modifications to _checking_tuning_assertions to enable TunableOp if it is disabled. Also moved it into the concurrent futures initializer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143507 Approved by: https://github.com/jeffdaily	2024-12-19 19:48:20 +00:00
Jack Taylor	66172578f9	[ROCm] Guard triton backend call around cuda.is_available (#143570 ) To resolve: https://github.com/pytorch/test-infra/issues/6082 Calling into Triton's get_backend_options will initialise CUDA and break CPU-only environments that may have hip installed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143570 Approved by: https://github.com/atalman, https://github.com/jeffdaily	2024-12-19 19:46:13 +00:00
Yanbo Liang	c46cfc245f	[Dynamo] Support dict_keys from nested dict object (#143557 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143557 Approved by: https://github.com/williamwen42 ghstack dependencies: #143374, #143547	2024-12-19 19:02:55 +00:00
Yanbo Liang	5fa287aa82	[Dynamo] Rename Dict{View/Keys/Values} to Dict{View/Keys/Values}Variable (#143547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143547 Approved by: https://github.com/williamwen42 ghstack dependencies: #143374	2024-12-19 19:02:55 +00:00
Nikhil Gupta	4b82251011	[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 ) Description: 1. Quantize Linear Layer Weights to 4-bits: Quantize the weights of the Linear layer to 4 bits, using symmetric quantization. Pack two 4-bit weights into one uint8 container. Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32. 2. Prepare Quantized Weights, Scales, and Optional Bias: After quantizing, obtain the quantized_weights, scales, and groupsize. If the original Linear layer has a bias, prepare it as well. 3. Pack the Weights Efficiently: Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias. ```python packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features) ``` Input parameters should include: in_features and out_features (the same as the Linear layer’s corresponding parameters). 4. Perform Dynamic Quantized Matrix Multiplication: Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights. ```python output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features) ``` Inputs required include: The input tensor, packed_weights , groupsize, and the in_features and out_features. API Usage: https://github.com/pytorch/pytorch/issues/143289 Model Perf : 7B Transformer model: Prefill : 340 t/s Decode : 40 t/s 2B Transformer model Prefill : 747 t/s Decode : 80 t/s Tests: python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight Ran 1 test in 0.016s OK python test/test_linalg.py -k test__dyn_quant_matmul_4bit Ran 8 tests in 0.077s OK python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit Ran 8 tests in 11.454s Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-12-19 18:51:26 +00:00
Joel Schlosser	c5ddf5dd90	Unbacked SymInt fixes for subclasses + data-dependent slice() bounds (non-dynamic) (#143526 ) Lifted non-controversial (non-dynamic) fixes from #142062. See description there for context. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143526 Approved by: https://github.com/ezyang	2024-12-19 18:46:36 +00:00
Laith Sakka	2a11472f46	update expected results (#143586 ) update results based on small regression added by `17b71e5d6a` the max we was 1.25%. for sum_floor_div <img width="842" alt="Screenshot 2024-12-19 at 9 04 30 AM" src="https://github.com/user-attachments/assets/6ce913cd-110d-4837-af59-08fb6a0dd12d" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/143586 Approved by: https://github.com/bobrenjc93	2024-12-19 18:43:27 +00:00
William Wen	e1e83015d2	[dynamo, 3.13t] raise error if torch.compile is attempted in 3.13t (nogil) (#143404 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143404 Approved by: https://github.com/colesbury, https://github.com/atalman	2024-12-19 18:10:01 +00:00
Joona Havukainen	33c27be017	Workaround for gather_out in MPS backend (#135543 ) Avoids an underlying issue in reshape op in MPS that gets triggered when the input has multiple dimensions but the shape can be squeezed into 1D. The underlying issue is going to get fixed eventually. Fixes https://github.com/pytorch/pytorch/issues/135240 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135543 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-12-19 18:01:01 +00:00
Avik Chaudhuri	1433bad0e4	torch export programming model (#143546 ) Differential Revision: [D67429743](https://our.internmc.facebook.com/intern/diff/D67429743/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143546 Approved by: https://github.com/ydwu4	2024-12-19 16:56:13 +00:00
Tony-Y	61a835ec53	Corrected description of AMSGrad algorithm (#142351 ) Fixes #142323 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142351 Approved by: https://github.com/janeyx99	2024-12-19 16:24:19 +00:00
bobrenjc93	171e6a934f	Don't 1 specialize if stride is contiguous (#143365 ) Fixes: https://github.com/pytorch/pytorch/issues/142024 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143365 Approved by: https://github.com/ezyang	2024-12-19 15:22:47 +00:00
Animesh Jain	465f282a24	[reland][dynamo][guards] Consider tensors as immutable for dict tag matches (#141085 ) Reland - https://github.com/pytorch/pytorch/pull/139560 As mentioned in https://github.com/pytorch/pytorch/pull/130341, using `static py::object` can lead to segfaults. I suspect this is the reason for the import system error seen internally (https://www.internalfb.com/sevmanager/view/469592). In this PR, I am removing the `static` part. This is fine and also the right thing to do because this will catch if user changes the flag in the same process for compiling two different functions. Unfortunately, there is no easy way to trigger this segfault, so I can't write a test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141085 Approved by: https://github.com/jansel Co-authored-by: William Wen <williamwen@meta.com>	2024-12-19 15:16:10 +00:00
blzheng	288aa87383	[Inductor][CPU] disable bernoulli_p decomposition (#143460 ) Fix https://github.com/pytorch/pytorch/issues/142853 `fallback_random=True` should cause RNG to match between compile/eager (by having compile fall back to eager for RNG ops), but the `bernoulli_p` decompose function is not fully consistent with the eager CPU implementation. We remove the decomp and keep the version for` fallback_random=False`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143460 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel	2024-12-19 11:21:35 +00:00
Edward Z. Yang	fd8b217fcd	Pass allow_rhs_unbacked to the stride test in metadata test too (#143040 ) Fixes https://github.com/pytorch/pytorch/issues/142410 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/143040 Approved by: https://github.com/bobrenjc93	2024-12-19 09:37:50 +00:00
Joe Wang	451c233936	leaking c++ singleton specifically (#143509 ) Summary: fix forward for S477887 leaking c++ singleton specifically when c++ shutdown, it tries to destruct the singleton and acquire GIL, at this moment python runtime exists already, causing undefined behavior. Leaking here specifically so that we won't try to destroy singleton at the shutdown phase Test Plan: n/a Differential Revision: D67400633 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143509 Approved by: https://github.com/c-p-i-o	2024-12-19 09:27:07 +00:00
Aaron Orenstein	da06d47bdb	dynamo tracing perf: slight improvement on __instancecheck__: 47.77 -> 47.62 (#143064 ) See #143056 for overall docs. This PR: Switch out an `isinstance()` for an `is` in the very hot `VariableTrackerMeta.__instancecheck__`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143064 Approved by: https://github.com/ezyang, https://github.com/jansel	2024-12-19 09:19:35 +00:00
Aditya Tewari	a97c6a78a8	Upgrade submodule ideep for bf16f32 matmul changes (#143508 ) This change will enable this PR #140159 to pick proper kernels in bf16 mode for SDPA layer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143508 Approved by: https://github.com/yanbing-j, https://github.com/jgong5	2024-12-19 06:49:16 +00:00
Yanbo Liang	2ffdcab04c	[Dynamo] Add DictKeySetVariable to capture dict_keys passed outside of compiled region (#143374 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143374 Approved by: https://github.com/williamwen42, https://github.com/jansel	2024-12-19 06:39:27 +00:00
Sun, Jiayi	fa1a4a91e9	add batch_size check for max_pool2d_backward (#141657 ) Fix https://github.com/pytorch/pytorch/issues/140923. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141657 Approved by: https://github.com/mingfeima, https://github.com/malfet	2024-12-19 06:01:41 +00:00
mori360	a7ba562ec8	[state dict] Change _load_model_state_dict to enable cpu_offload, accept 2 device type and optimize memory (#142845 ) For destributed state dict api [migration](https://github.com/pytorch/torchtune/pull/2138), make the changes here: 1. `load_from_full_model_state_dict` at TorchTune calls `set_model_state_dict` with the options on whether to have cpu_offload. Add cpu_offload at _load_model_state_dict to process to cpu if config is True 2. Change the device check as lora_finetune might hace 2 device types, accept that to be valid. 3. Some changes to optimize the memory performance: 3.1 use `.detach().clone()` instead of view directly 3.2 if local_state is not meta, copy `full_tensor[slices]` to `ret.to_local()` 4. add relative unit tests Memory performance calling from TorchTune with llama2/7B_full: 1. cpu_offload = True <img width="555" alt="Screenshot 2024-12-18 at 1 36 47 PM" src="https://github.com/user-attachments/assets/429261f5-1107-4592-b295-de3944a2614b" /> 2. cpu_offload = False <img width="555" alt="Screenshot 2024-12-18 at 1 36 52 PM" src="https://github.com/user-attachments/assets/40bf281a-236a-4218-826b-b1192a10c806" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/142845 Approved by: https://github.com/fegin	2024-12-19 05:06:41 +00:00
Sean Xiao	e4301aeaa5	[ODML] Make the ML feature provider thread safe (#143418 ) Summary: This PR is generated from a meta internal Diff, aiming to resolve a crash from a race condition on the dictionary. Test Plan: Build and run Print out the count/name/value of the dictionary and see if the values are get/set/removed correctly. Observe the print statement on app start within IG @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/143418 Approved by: https://github.com/shoumikhin	2024-12-19 04:47:56 +00:00
Valentine233	bf44d5bfb5	[Inductor] move custom pre pass (#143458 ) Fixes #143363. Move `joint_custom_pre` pass after `remove_noop_ops`/`constant_folding`, in order to get the same behavior as `pattern_matcher`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143458 Approved by: https://github.com/jansel, https://github.com/jgong5	2024-12-19 04:41:20 +00:00
Michael Lazos	deb1da15cc	[foreach_map] Add foreach_map Adam impl to compiled optimizer tests (#143454 ) Adds a foreach_map backed Adam to compiled optimizer tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/143454 Approved by: https://github.com/Chillee, https://github.com/eellison	2024-12-19 03:16:47 +00:00
Sergii Dymchenko	19d8bbafb2	Update release matrix for 2.6 (#143538 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143538 Approved by: https://github.com/atalman Co-authored-by: Andrey Talman <atalman@fb.com>	2024-12-19 02:02:04 +00:00
PyTorch MergeBot	14fe1f7190	Revert "[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 )" This reverts commit d3ff2d42c28a2c187cbedfd8f60b84a4dfa2d6bf. Reverted https://github.com/pytorch/pytorch/pull/134124 on behalf of https://github.com/malfet due to This broke S390 builds, includes cpuinfo unconditionally ([comment](https://github.com/pytorch/pytorch/pull/134124#issuecomment-2552560208))	2024-12-19 01:05:11 +00:00
Eddie Yan	2c48af568a	[CUDA][64-bit indexing] Fix some existing problematic `int64_t _ = blockIdx.* * blockDim.` code (#142010 ) `grep` didn't surface any `blockIdx.z blockDim.z` cases ``` git grep -l "int64_t.=.blockIdx.x \* blockDim.x." \| xargs sed -i 's/int64_t $.$ = blockIdx.x \* blockDim.x + threadIdx.x;./int64_t \1 = ((int64_t) blockIdx.x) blockDim.x + threadIdx.x;/g' git grep -l "int64_t.=.blockIdx.x \* blockDim.x." \| xargs sed -i 's/int64_t $.$ = threadIdx.x + blockIdx.x \* blockDim.x;./int64_t \1 = threadIdx.x + ((int64_t) blockIdx.x) blockDim.x;/g' git grep -l "int64_t.=.blockIdx.y \* blockDim.y." \| xargs sed -i 's/int64_t $.$ = blockIdx.y \* blockDim.y + threadIdx.y;./int64_t \1 = ((int64_t) blockIdx.y) blockDim.y + threadIdx.y;/g' git grep -l "int64_t.=.blockIdx.y \* blockDim.y." \| xargs sed -i 's/int64_t $.$ = threadIdx.y + blockIdx.y \* blockDim.y;./int64_t \1 = threadIdx.y + ((int64_t) blockIdx.y) blockDim.y;/g' git grep -l "int64_t.=.blockDim.x \* blockIdx.x." \| xargs sed -i 's/int64_t $.$ = blockDim.x \* blockIdx.x + threadIdx.x;./int64_t \1 = ((int64_t) blockIdx.x) blockDim.x + threadIdx.x;/g' ``` See also https://github.com/pytorch/pytorch/pull/141922/files#r1868262823 in #141999 141922 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142010 Approved by: https://github.com/ngimel	2024-12-19 00:55:11 +00:00
Michael Lazos	b4e0e3bfa3	Backout D66648013 (#143433 ) Summary: backing out https://www.internalfb.com/diff/D66648013 (see comments there for justification) I will reland and disallow the bfloat16 atomics behavior on A100 because it causes a pretty significant performance regression. Test Plan: This is a revert Differential Revision: D67357485 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143433 Approved by: https://github.com/davidberard98	2024-12-19 00:53:49 +00:00
Michael Lazos	5c3996cab2	[Dynamo] topologically sort duplicated graph regions (#143523 ) Ensure regions are topologically sorted Pull Request resolved: https://github.com/pytorch/pytorch/pull/143523 Approved by: https://github.com/williamwen42	2024-12-19 00:43:48 +00:00
Nikita Shulga	55092e1ec5	[BE] Delete `install sccache` step from MacBB (#143512 ) To the best of my knowledge, this step never executed and there were no MacOS binary build running on trunk for a while Pull Request resolved: https://github.com/pytorch/pytorch/pull/143512 Approved by: https://github.com/kit1980, https://github.com/atalman, https://github.com/seemethere ghstack dependencies: #143395, #143511	2024-12-19 00:41:28 +00:00
Nikita Shulga	5e172ea004	[BE] Get rid of `malfet/checkout@silent-checkout` (#143516 ) Instead use `actions/checkout@v4` with `show-progress: false`. It's more verbose than the quiet option, but our logs are long anyway... Partially addresses https://github.com/pytorch/pytorch/issues/143079 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143516 Approved by: https://github.com/atalman, https://github.com/ZainRizvi, https://github.com/huydhn	2024-12-19 00:36:36 +00:00
Richard Barnes	f9da639950	[codemod] Fix a few unused-variable issues in pytorch (#143517 ) Summary: LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Reviewed By: palmje Pull Request resolved: https://github.com/pytorch/pytorch/pull/143517 Approved by: https://github.com/mhorowitz	2024-12-19 00:18:08 +00:00
titaiwangms	b23f11c529	[ONNX] Automatically convert dynamic_axes to dynamic_shapes with torch.export.Dim.AUTO (#143158 ) With https://github.com/pytorch/pytorch/pull/133620 introducing Dim.AUTO, we can now automatically convert dynamic_axes to dynamic_shapes without specifying min and max. However, exporting still could be crashed when there are same specs shared between inputs and there is no guarantee that the axes will be dynamic (see PR description). ~~Therefore, a~~ follow-up PR should create a post-processing ONNX side pass to ~~enable the missed dynamic axes~~ rename the dynamic shapes (s0, s1, ...) to dynamic_axes (user setting names). This PR does: (1) Apply torch.export.Dim.AUTO to dynamic_axes when dynamic_shapes is not provided. (2) Convert args/kwargs to tuple inputs, which follows the generated dynamic_shapes format to avoid errors during torch.export.export. (3) Avoid KeyError in _rename_dynamic_shapes_with_model_inputs funtion. (4) Add real world case of a HF model with kv_cache to test on ONNX exporter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143158 Approved by: https://github.com/xadupre, https://github.com/shubhambhokare1	2024-12-18 23:49:01 +00:00
Shangdi Yu	15a7a0c37e	Remove deprecated branch after capture_pre_autograd_graph fully migrate to training IR (#143228 ) Summary: as title #buildall Test Plan: CI Differential Revision: D67222286 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143228 Approved by: https://github.com/andrewor14	2024-12-18 23:30:45 +00:00
Nikita Shulga	58627fb6bf	[BE] Integrate 5 line build script into template (#143511 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143511 Approved by: https://github.com/kit1980, https://github.com/atalman, https://github.com/seemethere ghstack dependencies: #143395	2024-12-18 23:27:09 +00:00
Michael Lazos	4eafbe5288	[Dynamo] Flatten slices during graph deduplication (#143522 ) I encountered this issue while debugging torchtune - overall we need to make sure to not miss nodes that are slice arguments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143522 Approved by: https://github.com/williamwen42	2024-12-18 23:12:34 +00:00
Ryan Guo	5380407af5	[dynamo] Properly model root frame globals during inlining (#143447 ) This patch updates `InliningInstructionTranslator.STORE_GLOBAL` to properly check whether `self.f_globals` is the same as root frame `f_globals`. See added comments for why this is important. Fixes #143425. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143447 Approved by: https://github.com/zou3519	2024-12-18 23:04:02 +00:00
Tom Ritchford	d8c8ba2440	Fix unused Python variables in test/[e-z]* (#136964 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136964 Approved by: https://github.com/justinchuby, https://github.com/albanD	2024-12-18 23:02:30 +00:00
William Wen	d298bd840f	[dynamo] add two-point iter test (#143500 ) Implements the last checkbox for https://github.com/pytorch/pytorch/issues/112532. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143500 Approved by: https://github.com/StrongerXi	2024-12-18 22:55:46 +00:00
Nikhil Gupta	d3ff2d42c2	[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 ) Description: 1. Quantize Linear Layer Weights to 4-bits: Quantize the weights of the Linear layer to 4 bits, using symmetric quantization. Pack two 4-bit weights into one uint8 container. Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32. 2. Prepare Quantized Weights, Scales, and Optional Bias: After quantizing, obtain the quantized_weights, scales, and groupsize. If the original Linear layer has a bias, prepare it as well. 3. Pack the Weights Efficiently: Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias. ```python packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features) ``` Input parameters should include: in_features and out_features (the same as the Linear layer’s corresponding parameters). 4. Perform Dynamic Quantized Matrix Multiplication: Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights. ```python output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features) ``` Inputs required include: The input tensor, packed_weights , groupsize, and the in_features and out_features. API Usage: https://github.com/pytorch/pytorch/issues/143289 Model Perf : 7B Transformer model: Prefill : 340 t/s Decode : 40 t/s 2B Transformer model Prefill : 747 t/s Decode : 80 t/s Tests: python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight Ran 1 test in 0.016s OK python test/test_linalg.py -k test__dyn_quant_matmul_4bit Ran 8 tests in 0.077s OK python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit Ran 8 tests in 11.454s Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-12-18 22:30:07 +00:00
Huy Do	4717cd1ce9	Skip test_conv2d_linear_add_broadcast_shapes_cpu on fbcode (#143530 ) Summary: The test is added by D67376995 and it is failing on fbcode Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:mkldnn_pattern_matcher_cpu -- --exact 'caffe2/test/inductor:mkldnn_pattern_matcher_cpu - test_conv2d_linear_add_broadcast_shapes_cpu (caffe2.test.inductor.test_mkldnn_pattern_matcher.TestPatternMatcher)'` Differential Revision: D67413687 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143530 Approved by: https://github.com/jansel	2024-12-18 22:08:08 +00:00
James	d4ed5941db	Fix floating point literals in IRPrinter (#142119 ) Fixes #114035 This is a recreation of #140002 with approval from its author. Original description: >when v larger than 1e16, the format will be error. example: v is 1.2e17, the output is 1.2e17.f, it have two point '.' Pull Request resolved: https://github.com/pytorch/pytorch/pull/142119 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-12-18 21:59:48 +00:00
Yidi Wu	10b9c5944e	[export] don't decompose custom triton op when exporting (#142426 ) For torch.export (strict and non-strict), we don't do functional decomposition. Instead, we preserve the custom triton ops as custom ops. This is because we want the exported program to be high-level and serializable. #### The alternative: If we decompose the custom op to a functional hop and make it a node in exported program, we need to figure out ways of serializing the hop and its arguments, which can be triton.jited python functions and triton dtypes. This is undesireble because: - it can be tedious to maintain layer that serialize the jited function (e.g. with a string) and dtypes. - changes to triton or the serialization logic for triton arguments can be BC breaking - exported program will expose the implementation detail (i.e. triton source code) for a specific backend (GPU) to users, which mixes levels of abstraction. #### Future plans: After this PR, in the short term, we expect users to have a seperate aot_compile stage that compiles the exported program into a Cubin file on the same machine that users call export, which does autotuning and removes triton dependency and serve the model with Cubin. This guarantees that triton changes won't break BC. In the long term, we may export multiple cubins for the triton op directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142426 Approved by: https://github.com/zou3519 ghstack dependencies: #142425	2024-12-18 21:36:28 +00:00
Yidi Wu	1e201422ed	[export] add is_exporting flag (#142425 ) We added an is_export flag under torch.compiler.is_exporting. This comes handy when we try to do some special logic in user-level and system-level (e.g. in upper of the stack). In increasing-scope: - `_is_fx_tracing` is set to True when we use under symbolic_trace or make_fx. - `is_exporting` is set to True when we're doing strict or non-strict export, which internally has a step that calls make_fx and set _is_fx_tracing to be True. - `is_compiling` is set to True when we're either doing strict, non-strict export or torch.compile. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142425 Approved by: https://github.com/avikchaudhuri	2024-12-18 21:36:28 +00:00
Nichols A. Romero	894d47b91b	[ROCm] Fix unit test: matmul_offline_tunableop (#143322 ) Fixes #137936 The PR contains: * Fix for `matmul_offline_tunableop` * Clean-up try-finally blocks in UTs that don't use environment variables (`test_validator_tunableop_rocm`, `test_minimum_tuning_iteration_tunableop`, `test_disable_tuning_tunableop`) * Avoid the use of environment variables in `minimum_tuning_iteration_tunableop` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143322 Approved by: https://github.com/jeffdaily	2024-12-18 20:14:44 +00:00
cyy	255a977494	[1/N] Avoid const_cast (#143169 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143169 Approved by: https://github.com/albanD	2024-12-18 19:48:01 +00:00
Nikita Shulga	f129bcb5a5	[BE] Refactor argument parsing into its own function (#143395 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143395 Approved by: https://github.com/kit1980, https://github.com/atalman, https://github.com/seemethere	2024-12-18 19:42:49 +00:00
Tom Ritchford	8d4926e30a	Fix unused variables in test/torch.py (#143399 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143399 Approved by: https://github.com/albanD	2024-12-18 17:57:24 +00:00
Sun, Jiayi	863e6e4567	Improve input dimensions check for reflection_pad1d, reflection_pad2d and reflection_pad3d (#141670 ) Fix https://github.com/pytorch/pytorch/issues/141447. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141670 Approved by: https://github.com/mingfeima, https://github.com/malfet	2024-12-18 17:46:26 +00:00
Sun, Jiayi	b588a78ca3	add grad_output shape check for adaptive_max_pool2d_backward and adaptive_max_pool3d_backward (#141663 ) Fix https://github.com/pytorch/pytorch/issues/141099, https://github.com/pytorch/pytorch/issues/141100. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141663 Approved by: https://github.com/mingfeima, https://github.com/malfet	2024-12-18 17:44:27 +00:00
Mark Saroufim	93e8e32708	Remove iOS folder (#143398 ) This folder is a tutorial that is not packaged in PyTorch that's an example of how to use the now deprecated Lite Interpreter People should be using Executorch instead and there's already good documentation on it all over our tutorials and main homepage Testing to see what breaks in CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/143398 Approved by: https://github.com/albanD	2024-12-18 17:25:52 +00:00
Joy Dong	ed9931e6ee	Add tests for non divisible inputs for flex decoding (#143214 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143214 Approved by: https://github.com/drisspg	2024-12-18 16:32:45 +00:00
Bin Bao	0e8013fc1c	[AOTI] Fix a typo in cpp_builder.py (#143351 ) Summary: passthough -> passthrough Pull Request resolved: https://github.com/pytorch/pytorch/pull/143351 Approved by: https://github.com/yushangdi, https://github.com/chenyang78 ghstack dependencies: #143350	2024-12-18 16:28:37 +00:00
Bin Bao	a2092665a9	[AOTI] Refactor path operations in AotCodeCompiler (#143350 ) Summary: Use safer pathlib operation instead of direct string manipulation; Update some path naming to make them more meaningful. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143350 Approved by: https://github.com/yushangdi, https://github.com/chenyang78	2024-12-18 16:28:37 +00:00
Nikita Shulga	24a18d76c8	[MPS] Use metal shaders for all view ops (#143375 ) Before this PR Metal shaders were used to scatter/gather 1-5 dimensional tensors. This PR introduces generalized ones that could be used for any dimensionality and as results gets rid of 700+ lines complex and untested code that might not even work as expected. Generalized gather shader looks as follows ```metal kernel void gather_kernel_n(uint linear_index [[thread_position_in_grid]], constant void * src_ [[buffer(0)]], device void * dst_ [[buffer(1)]], constant uint32_t * size [[buffer(2)]], constant uint32_t * stride [[buffer(3)]], constant uint32_t & numel [[buffer(4)]], constant int32_t & ndim [[buffer(5)]]) {{ if (linear_index >= numel) return; constant {0} * src = (constant {0} )src_; device {1} dst = (device {1} )dst_; uint64_t src_offs = 0; auto src_idx = linear_index; for(int dim = ndim - 1; dim >= 0; --dim) {{ src_offs += stride[dim] (src_idx % size[dim]); src_idx /= size[dim]; }} dst[linear_index] = cast<{1}>(src[src_offs]); }} ``` Which, according to the following benchmark ```python from timeit import default_timer import torch import torch.utils.cpp_extension from torch.utils.benchmark import Measurement, Timer t = Timer( stmt=f"y.copy_(x);torch.mps.synchronize()", setup=f"x=torch.rand(4, 5, 16, 64, 33, 24, dtype=torch.float32, device='mps')[:,:,:,:24,:24,];y=torch.empty(x.shape, device=x.device, dtype=x.dtype)", language="python", timer=default_timer ) print(t.blocked_autorange()) ``` Is almost twice as fast as previous implementation (i.e. on Mac Book M2 Pro it returns 2.9ms for MPS version vs 1.5ms for shader one On MacOS Sequoia [`gatherWithUpdatesTensor: indicesTensor:...`](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/gather(withupdatestensor:indicestensor:axis:batchdimensions:name:)?language=objc) crashes if invoked with complex data type, as one can see by running the code below ```swift import Metal import MetalPerformanceShadersGraph func gatherComplexMPS(device: MTLDevice, inp_buf: MTLBuffer, idx_buf: MTLBuffer, out_buf: MTLBuffer, inp_elem: Int, upd_elem: Int) { let graph = MPSGraph() let inputPlaceholder = graph.placeholder(shape: [inp_elem as NSNumber], dataType: .complexFloat32, name: nil) let indicesPlaceholder = graph.placeholder(shape: [upd_elem as NSNumber], dataType: .int64, name: nil) let outNode = graph.gather(withUpdatesTensor: inputPlaceholder, indicesTensor: indicesPlaceholder, axis: 0, batchDimensions: 0, name: nil) let mpsInputBuffer = MPSGraphTensorData(inp_buf, shape: [inp_elem as NSNumber], dataType: .complexFloat32) let mpsIndicesBuffer = MPSGraphTensorData(idx_buf, shape: [upd_elem as NSNumber], dataType: .int64) let mpsOutputBuffer = MPSGraphTensorData(out_buf, shape: [inp_elem as NSNumber], dataType: .complexFloat32) guard let queue = device.makeCommandQueue() else { fatalError("Can't make queue") } graph.run(with: queue, feeds: [inputPlaceholder: mpsInputBuffer, indicesPlaceholder: mpsIndicesBuffer ], targetOperations: nil, resultsDictionary: [outNode: mpsOutputBuffer]) } func makeBufferWithValues<T>(device: MTLDevice, values: [T]) -> MTLBuffer { guard let buf = device.makeBuffer(length: values.count * MemoryLayout<T>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") } let buf_data = buf.contents().assumingMemoryBound(to: T.self) for i in 0..<values.count { buf_data[i] = values[i] } return buf } guard let device = MTLCopyAllDevices().first else { fatalError("Not Metal device found") } print("Using device \(device.name)") let inp_buf = makeBufferWithValues(device: device, values: [1.0, 2.0 , 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]) let idx_buf = makeBufferWithValues(device: device, values: [0, 1, 2, 3]) guard let out_buf = device.makeBuffer(length:8 * MemoryLayout<Float>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") } gatherComplexMPS(device: device, inp_buf: inp_buf, idx_buf: idx_buf, out_buf: out_buf, inp_elem: 4, upd_elem: 4) ``` Fixes https://github.com/pytorch/pytorch/issues/143140 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143375 Approved by: https://github.com/albanD	2024-12-18 16:15:46 +00:00
FFFrog	f47aac6bc2	Make Context to be Device-agnostic Step by Step (3/N) (#137578 ) Detailed Descriptions: - Using unified Device-agnostic API to create new generator for accelerator. - Add deprecated info for GeneratorForPrivateuseone Pull Request resolved: https://github.com/pytorch/pytorch/pull/137578 Approved by: https://github.com/cyyever, https://github.com/ezyang	2024-12-18 15:12:19 +00:00
albanD	80a42399bb	Various fix for memory leak in test autograd and dataloader (#143323 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143323 Approved by: https://github.com/andrewkho, https://github.com/soulitzer ghstack dependencies: #143225	2024-12-18 13:56:59 +00:00
bobrenjc93	84b91ce4a1	remove allow-untyped-defs for torch/_inductor/test_operators.py (#143436 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143436 Approved by: https://github.com/aorenste	2024-12-18 12:54:25 +00:00
Shangdi Yu	d8ea4ce631	[reland] Kill capture_pre_autograd_graph API (#143426 ) Summary: Delete the following API: - capture_pre_autograd_graph() - capture_pre_autograd_graph_using_training_ir() - gm_using_training_ir() Update XLA pin to include https://github.com/pytorch/xla/pull/8398 There's no more call sites to `capture_pre_autograd_graph`. Except 1) two test cases in coreml, guarded by version guard, PR to remove: https://github.com/apple/coremltools/pull/2400 2) a few call sites guarded by version guard (< 2.5.0) Test Plan: CI Differential Revision: D67354440 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143426 Approved by: https://github.com/gmagogsfm	2024-12-18 12:07:09 +00:00
Zizeng Meng	eb67dd3e2d	[3/N][Memory Profiling] Add memory profiling function for MTIA hooks (#142149 ) Design Doc: https://fburl.com/gdoc/47zpuweb Prototyping: D66469341 In this diff, we implement two new mtia hooks to start/stop profiler and export the memory snapshot. In next diff, we will integrate the mtia backend with profiler python api Differential Revision: [D66823583](https://our.internmc.facebook.com/intern/diff/D66823583/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142149 Approved by: https://github.com/nautsimon	2024-12-18 11:58:23 +00:00
Tom Ritchford	993b2f0ee0	Fix unused variables in test/test_transformers.py (#143407 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143407 Approved by: https://github.com/drisspg	2024-12-18 09:59:24 +00:00
bobrenjc93	8dd380803c	remove allow-untyped-defs for torch/_functorch/batch_norm_replacement.py (#143438 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143438 Approved by: https://github.com/oulgen	2024-12-18 09:01:06 +00:00
bobrenjc93	75fe5a3ef7	remove allow-untyped-defs for torch/fx/experimental/debug.py (#143439 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143439 Approved by: https://github.com/oulgen	2024-12-18 08:55:46 +00:00
bobrenjc93	03991798ca	remove allow-untyped-defs for torch/nn/parallel/__init__.py (#143437 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143437 Approved by: https://github.com/oulgen	2024-12-18 08:50:37 +00:00
Aidyn-A	a99536480d	[ATen][Native][Special] Hermite polynomial prematurely return NaN if n is high (#141955 ) Hermite polynomials diverge to NaN at high orders due to numerical overflow. The proposal is to prematurely return NaN of it is known that at this value it will be NaN. According to my short test ```Python import torch device = "cuda" dtype = torch.float32 x = torch.linspace(-1000, 1000, 100000, device=device, dtype=dtype) for n in range(1024): if torch.special.hermite_polynomial_h(x, n).isnan().sum().item() == x.shape[0]: print(f"hermite_polynomial_h: all outputs are nans! n = {n}") break for n in range(1024): if torch.special.hermite_polynomial_he(x, n).isnan().sum().item() == x.shape[0]: print(f"hermite_polynomial_he: all outputs are nans! n = {n}") break ``` The output values become NaNs at these orders: ``` hermite_polynomial_h: all outputs are nans! n = 53, dtype=torch.float32 hermite_polynomial_he: all outputs are nans! n = 61, dtype=torch.float32 hermite_polynomial_h: all outputs are nans! n = 272, dtype=torch.float64 hermite_polynomial_he: all outputs are nans! n = 304, dtype=torch.float64 ``` Surely, it makes sense to increase the limit as a safety margin. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141955 Approved by: https://github.com/malfet, https://github.com/eqy	2024-12-18 08:30:08 +00:00
Sheng Fu	2ea4b56ec8	Record min/max of integral tensor in ET (#143088 ) Summary: In et-replay, random data is used to run the operators. However, it does not work well for the op that uses index to access tensor. For example, embedding ops, which use the indices to look up the embedding table. If random data is used for these index ops, et-replay usually runs into invalid memory access issue. To fix it, ET provides an environment variable "ENABLE_PYTORCH_EXECUTION_TRACE_INTEGRAL_TENSOR_RANGE", if it is set, ET will capture the min/max value of the flattened integral tensor. Then in et_replay, the min/max is used to generate the random tensor within that range. It fixed invalid memory access issue. Test Plan: buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA.test_execution_trace_record_integral_tensor_range_cuda Differential Revision: D66666931 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143088 Approved by: https://github.com/sanrise	2024-12-18 08:20:35 +00:00
Avik Chaudhuri	bceedeec2b	fix checking non-trivial input constraints (#143442 ) A bunch of auto dynamic shape tests would fail non-strict retraceability because when checking input constraints, we'd compare non-trivial expressions, which would require / affect shape env. ``` ... is not tracked with proxy for <torch.fx.experimental.proxy_tensor._ModuleStackTracer object ... ``` I've also observed this bug internally. This PR does an early check on whether args passed have concrete shapes, and only then proceeds: as before, we 1. try to unify / solve with the arg dim when the corresponding placeholder node dim is symbolic in one symbol 2. check directly if the placeholder node dim is concrete 3. otherwise defer to run time. Differential Revision: [D67359596](https://our.internmc.facebook.com/intern/diff/D67359596/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143442 Approved by: https://github.com/tugsbayasgalan	2024-12-18 07:29:08 +00:00
qiurc	90cc43f270	Support garbage collection after pt2 compilation (#143364 ) Summary: Support garbage collection after pt2 compilation. Add jk to control the global rollout / rollback of this functionality Add env var to control individual job's rollout Test Plan: Test the model training job with / without this changes Reviewers: @yuxihu @ezyang , @Yuzhen11 , Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143364 Approved by: https://github.com/ezyang	2024-12-18 07:25:11 +00:00
Rachel Guo	9275091d6e	[provenance_tracking] Dump inductor_triton_kernel_to_post_grad_nodes.json info in debug_trace (#143055 ) Summary: This diff mainly adds code changes to dump `inductor_triton_kernel_to_post_grad_nodes.json` artifact which contains mapping info from post_grad -> inductor kernel code: `{"inductor_triton_kernel_name": [post_grad_node_0, post_grad_node_1, ..., ], "..."}.` Example paste: P1695235000 verified on the test model. See "Test Plan": We use this artifact to demonstrate provenance tracking in the frontend 3-tab highlighter tool: https://github.com/YUNQIUGUO/compiler_explorer (copy/pasted the input files for demo purpose for now and will integrate with Shangdi's tool to 4-tab) https://pxl.cl/66BzK Note: Currently only supports mapping for inductor's`TritonKernel` type. TODO for enhancing more support for `ExternKernel` and other inductor generated kernel type, etc. Test Plan: test_model_coverage.sh: ``` #!/bin/sh MODEL_ENTITY_ID=644688112 SNAPSHOT_ID=32 MODULE=merge # buck2 build --show-output mode/opt -c=python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.split-dwarf=true -c fbcode.nvcc_arch=a100,h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark TORCH_COMPILE_DEBUG=1 CUDA_VISIBLE_DEVICES=0 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCH_LOGS="+inductor, schedule, fusion, output_code" TORCH_TRACE="tmp/guorachel_tt" TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 ../buck-out/v2/gen/fbcode/d29ee94b913014f1/caffe2/torch/fb/model_transform/experimental/benchmark/__mts_gpu_benchmark__/mts_gpu_benchmark.par --model-path manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend AOT_INDUCTOR_EP --gpu-trace --aot-inductor-config="{'max_autotune': True}" 2>&1 \| tee output.txt ``` {F1973765026} ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:provenance_tracing -- --exact 'caffe2/test/inductor:provenance_tracing - test_triton_kernel_post_grad_mapping_aot_inductor (caffe2.test.inductor.test_provenance_tracing.TestProvenanceTracingArtifact)' ``` ``` TORCH_LOGS="+inductor, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:provenance_tracing -- -r test_triton_kernel_post_grad_mapping_aot_inductor ``` Differential Revision: D66967510 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143055 Approved by: https://github.com/chenyang78	2024-12-18 06:51:50 +00:00
Digant Desai	6829897682	Remove assert from partitioner.py (#143376 ) Remove erroneous assert assuming a dependent (user) node to be in the partition. This partially reverts #136616 by removing the assert. Tested locally with a failing ExecuTorch Arm test using ``` $ python -m examples.arm.aot_arm_compiler --model_name mv2 --target ethos-u55-128 --delegate --quantize ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143376 Approved by: https://github.com/tarun292	2024-12-18 06:08:19 +00:00
Bert Maher	6715a8858a	Triton bump for 3.2 cherry-picks (device context) (#143409 ) Summary: * https://github.com/triton-lang/triton/pull/3731 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143409 Approved by: https://github.com/atalman	2024-12-18 05:17:29 +00:00
Shangdi Yu	c17a07ade3	Add float8 support in serde schema (#143343 ) Summary: Fix https://github.com/pytorch/pytorch/issues/141316 Bump up schema minor version. as title, add float8 support in serde schema Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_serialize_float8 ``` Differential Revision: D67307670 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143343 Approved by: https://github.com/yiming0416	2024-12-18 05:07:21 +00:00
emmettbicker	576789197a	Add support for CPU scalar in addcmul (#143264 ) Step required for performance in #143122 Adds support for CPU scalar for tensor_2 in addcmul. For example: ``` import torch a = torch.rand(2, 2, device="cuda") b = torch.tensor(1e-3) torch.add(a, b) torch.addcmul(a, a, b) # used to fail, now works ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143264 Approved by: https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2024-12-18 04:43:29 +00:00
Natalia Gimelshein	859be14c4e	fix a few int64_t index computations, fix complex128 scan that had to… (#143401 ) …o few threads per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/143401 Approved by: https://github.com/eqy	2024-12-18 04:27:27 +00:00
Tom Ritchford	c947a7d38e	Fix unused Python variables in test/nn (#143396 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143396 Approved by: https://github.com/mikaylagawarecki	2024-12-18 03:30:54 +00:00
bobrenjc93	17a6d4b882	remove allow-untyped-defs for torch/_export/passes/remove_runtime_assertions.py (#143435 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143435 Approved by: https://github.com/oulgen	2024-12-18 03:05:20 +00:00
Nikita Shulga	a9de6a68f4	[CD] Test that all PyTorch wheels support OpenMP (#143394 ) Together with https://github.com/pytorch/pytorch/pull/143393 fixes https://github.com/pytorch/pytorch/issues/123225 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143394 Approved by: https://github.com/atalman ghstack dependencies: #143393	2024-12-18 02:27:55 +00:00
atalman	2400db115c	Use Manylinux 2.28 for nightly build and cxx11-abi (#143423 ) As per: https://dev-discuss.pytorch.org/t/pytorch-linux-wheels-switching-to-new-wheel-build-platform-manylinux-2-28-on-november-12-2024/2581 Linux Builds: CPU, CUDA 11.8, CUDA 12.4 switched to Manylinux 2.28 and D_GLIBCXX_USE_CXX11_ABI=1 on the week of Dec 16 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143423 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/seemethere	2024-12-18 02:02:58 +00:00
eellison	e890d67543	Use process pool for precompilation of triton templates (#142450 ) Perf results: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Tue%2C%2003%20Dec%202024%2022%3A57%3A51%20GMT&stopTime=Tue%2C%2010%20Dec%202024%2022%3A57%3A51%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/eellison/740/head&lCommit=b925256c29ec43e1933e4ede94b16d1f404b595f&rBranch=gh/eellison/740/base&rCommit=a161d6362f7d9db773322d2ce2a3a70aabbecf4b Training: <img width="793" alt="image" src="https://github.com/user-attachments/assets/75f5bc0d-8005-4213-ae88-0b94fb187dfc" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/142450 Approved by: https://github.com/jansel	2024-12-18 01:48:04 +00:00
Sun, Jiayi	c06b5048ba	[Inductor] Fix _can_be_inplace function (#143279 ) Summary: Modify _can_be_inplace function: return False if `_other.data` is an instance of `ir.BaseView`. Fix https://github.com/pytorch/pytorch/issues/143280. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143279 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel, https://github.com/jgong5	2024-12-18 00:26:05 +00:00
Mikayla Gawarecki	6cd96f069b	Add warning to torch.jit.load (#143403 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143403 Approved by: https://github.com/albanD ghstack dependencies: #143326	2024-12-18 00:17:41 +00:00
Mikayla Gawarecki	ac8342f881	Prevent torch.jit.load path in torch.load when weights_only=True (#143326 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143326 Approved by: https://github.com/albanD	2024-12-18 00:17:41 +00:00
soulitzer	13a5c15ef5	Fix sample inputs leaked from subtest (#143415 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143415 Approved by: https://github.com/jbschlosser ghstack dependencies: #143333	2024-12-18 00:15:18 +00:00
soulitzer	3f99682fbd	NJT linear_backward should not return inner tensor as-is (#143333 ) Fixes debug=1 use-count checks https://github.com/pytorch/pytorch/actions/runs/12187808902/job/34002323481#step:22:2521 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143333 Approved by: https://github.com/jbschlosser	2024-12-18 00:15:18 +00:00
Felix Su	feb4818bc9	[SJD] adding kill logic for current process when killing a worker (#141060 ) Summary: we have seen cases where some workers don't receive stop signals, meaning watchdog isn't stopped accordingly. this diff introduces logic to kill the current pid alongside the worker pid something to note is that there is a case where the worker pid to be killed either doesn't exist or cannot be killed for some reason which will result in the current pid also not being killed. this seems okay since the watchdog loop will just attempt to kill the worker pid on the next iteration but just wanted to point this out Test Plan: experiment in next diff shows this works Differential Revision: D65837085 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141060 Approved by: https://github.com/gag1jain	2024-12-18 00:13:02 +00:00
Hyunho Yeo	efe21ee59d	[MTIA] (3/n) Implement PyTorch APIs to query/reset device peak memory usage (#143347 ) Summary: This diff implements the "max_memory_allocated" PyTorch API for MTIA devices, which returns the peak device DRAM usage Test Plan: Passed the local unit test ``` buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- -r test_max_memory_allocated ``` https://www.internalfb.com/intern/testinfra/testrun/8444249544807192 Reviewed By: yuhc, egienvalue Differential Revision: D67118173 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143347 Approved by: https://github.com/nautsimon	2024-12-17 23:37:03 +00:00
Aleksei Nikiforov	a040006da7	Force symlink creation when building python on s390x (#143195 ) Sometimes it exists already when building on s390x This change should fix docker image build on s390x. Example of error can be found here: https://github.com/pytorch/pytorch/actions/runs/12282230596/job/34365267303 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143195 Approved by: https://github.com/ezyang	2024-12-17 23:01:47 +00:00
Nikita Shulga	2642bbc6dc	[CD] Run smoke tests on MacOS wheel (#143393 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143393 Approved by: https://github.com/atalman, https://github.com/seemethere	2024-12-17 22:47:07 +00:00
Eli Uriegas	b247f87845	tools: Add a tool to build wheels for multiple python versions (#143361 ) Adds a tool to build bdist_wheels sequentially for multiple different python versions (if specified). The goal of this tool is to eventually be able to utilize this in our binary build runs to significantly reduce the amount of time we take to build packages by utilizing a local ccache from the first build. Tested locally using the following: ``` $ ccache -C # clear cache # -p could actually reference any python interpreter $ python tools/packaging/build_wheel.py \ -p /home/eliuriegas/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/bin/python3.12 \ -p /home/eliuriegas/.local/share/uv/python/cpython-3.13.0-linux-x86_64-gnu/bin/python3.13 \ -d dist-multi/ ... 2024-12-17 10:48:11,365 - INFO - Build time (3.12.7): 571.440689s 2024-12-17 10:48:11,365 - INFO - Build time (3.13.0): 191.147503s ``` Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/143361 Approved by: https://github.com/malfet, https://github.com/atalman	2024-12-17 21:56:06 +00:00
Tristan Rice	1e058a8f38	FileTimerClient: add retry logic on connect (#143318 ) Fixes #143188 The fifo server binds from a thread -- under rare cases the client connects before the server thread starts. This adds a retry when opening the fifo socket in non-blocking mode. This will wait up to 1s for the server to start which balances fast error messages while still providing some wiggle room on the server side. Test plan: ``` pytest --minutes 10 test/distributed/elastic/timer/file_based_local_timer_test.py -k test_watchdog_call_count -x ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143318 Approved by: https://github.com/fegin	2024-12-17 21:48:30 +00:00
Manav Avlani	aabe285aaf	Add 2 more APIs to the exposed public torch python APIs (#143380 ) These two APIs are being used internally for some projects and need to be exposed as the build for this is done using OSS toolchain. `af8789c056` - this change hid most apis in torch python barring the ones explicitly specified breaking the build. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143380 Approved by: https://github.com/suo	2024-12-17 21:16:51 +00:00
Chirag Pandya	0bdc173ab6	[fr] recognize all_reduce_barrier as a valid op (#143354 ) Summary: D67068632 introduced a better profiling name for barrier operations to be able to distinguish various ops. Unfortunately, this broke Flight Recorder Analysis with the following error as reported by dmwu ``` fr_trace -m torchx-param_bench_16g_mi300x-all_to_all -a 0 --mast_job_version 98 -w 16 Traceback (most recent call last): File "/usr/local/fbcode/platform010/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/local/fbcode/platform010/lib/python3.10/runpy.py", line 86, in _run_code ``` Test Plan: Test manually. Differential Revision: D67305997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143354 Approved by: https://github.com/wconstab	2024-12-17 21:09:18 +00:00
Michael Lazos	a96387a481	[Dynamo] only import einops if version is lower than 0.7.0 (#142847 ) Fixes internal xref (https://fb.workplace.com/groups/257735836456307/posts/804793021750583/?comment_id=805229281706957&reply_comment_id=805232695039949) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142847 Approved by: https://github.com/zou3519	2024-12-17 20:50:25 +00:00
Richard Barnes	9283c40ba8	[codemod] Decorate unused variables with `[[maybe_unused]]` (#143381 ) Summary: LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Reviewed By: palmje Pull Request resolved: https://github.com/pytorch/pytorch/pull/143381 Approved by: https://github.com/malfet	2024-12-17 20:36:03 +00:00
bobrenjc93	7c25a55c65	clean up type nits on torch/jit/_ir_utils.py (#143371 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143371 Approved by: https://github.com/laithsakka	2024-12-17 20:28:07 +00:00
Catherine Lee	de4a555c82	Run inductor-rocm workflow on ciflow/inductor (#143205 ) The paths are almost the same as ciflow/inductor. The only differences I could spot where that ciflow/inductor also has `test/dynamo/` and `torch/csrc/dynamo/` This is to prevent failures like https://github.com/pytorch/pytorch/actions/runs/12304985383/job/34345585535 which fails due to running on a fork, which cannot set the id token. The other option to prevent this is to stop the job from running when on a fork. If someone adds both labels, one will be cancelled because they have the same concurrency group Pull Request resolved: https://github.com/pytorch/pytorch/pull/143205 Approved by: https://github.com/huydhn	2024-12-17 20:09:48 +00:00
Joy Dong	b16f020edd	Add flex attention kernel parameter tuning options (#139639 ) 1. Add `num_warps` and `num_stages` to kernel parameters of `flex_attention`. This allows performance tuning when the default parameters of `flex_attention` is suboptimal, for example for `document_masks`. 2. Update how flex decoding splits are assigned to threadblocks. The first split of full blocks are assigned to the first threadblock, and the first split of partial blocks are assigned to the last threadblock. 3. Update `get_split_k` to assign 2 splits per SM before we have runtime workload balancing based on BlockMask. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139639 Approved by: https://github.com/drisspg	2024-12-17 19:31:40 +00:00
Catherine Lee	e3c53fb1bc	Increase sharding for debug build (#143327 ) It started timing out consistently and takes 3+ hours per shard I assume its just that we slowly increase tests over time since I cannot find a dramatic jump recently Pull Request resolved: https://github.com/pytorch/pytorch/pull/143327 Approved by: https://github.com/wdvr, https://github.com/huydhn	2024-12-17 19:27:51 +00:00
Chong Gu	5b5d7016c8	Remove stable_partition for ARM AOTI Runtimes (#142394 ) Summary: This function call will cause OOM issues on ARM machines with multi-threaded predictors (reason behind this is still being investigated), we replace it with the standard partition instead. Differential Revision: D66904296 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142394 Approved by: https://github.com/frank-wei	2024-12-17 19:19:04 +00:00
Aaron Orenstein	e7704f41ca	Simplify _compute_symbolic_stride() (#138844 ) Rewrite _compute_symbolic_stride() to make it simpler and faster. The existing code involves several inner loops in an attempt to process the common case faster - but in reality this effort is actually slower than the simpler code. Testing: The initial version of this PR (which passed all tests) ran both the old algorithm and new algorithm and compared the results to make sure that results were substantially the same (they weren't the same simply because the algorithm allocates new dynamic symbols as part of it). I also measured the timing of both methods and from the cases I checked the simpler algorithm was generally about 30% faster (which was usually the "fast path" of the old algorithm). Pull Request resolved: https://github.com/pytorch/pytorch/pull/138844 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #138843	2024-12-17 19:16:53 +00:00
Aaron Orenstein	63cb5e4ade	Move inner loop of _create_symbolic_sizes_strides_storage_offset into its own method (#138843 ) Making the next PR easier to review: - move the inner loop of _create_symbolic_sizes_strides_storage_offset() into a separate function - fix lintrunner lints Pull Request resolved: https://github.com/pytorch/pytorch/pull/138843 Approved by: https://github.com/ezyang	2024-12-17 19:16:53 +00:00
eellison	f3ec59d44c	Fix non-dense inductor effn attn bias (#141905 ) Didn't have any luck making local repro, partially because https://github.com/pytorch/pytorch/issues/141888 which will be fixed when we update to triton 3.2. but verified locally it fixes https://github.com/pytorch/pytorch/issues/139424 with the triton pin update that is landing soon Pull Request resolved: https://github.com/pytorch/pytorch/pull/141905 Approved by: https://github.com/drisspg ghstack dependencies: #143315	2024-12-17 18:55:50 +00:00
Tom Ritchford	1e9ec51431	Fix unused variables in test_serialize_sym_float (#143389 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143389 Approved by: https://github.com/Skylion007	2024-12-17 18:55:14 +00:00
William Wen	18261e9f39	[dynamo] implement framelocals mapping as c++ object (#140063 ) Implements https://github.com/pytorch/pytorch/issues/93753 - move frame local guard accessors to C++. Before, we used dict accessors on a Python dict representing the frame's fastlocals that we manually build. We move this accessor to C++ and additionally use the fastlocal index whenever possible. Some implementation notes: - `FrameLocalsMapping` is now initialized as a C++ vector of `PyObject`s. We do not just use the frame's localsplus/fastlocals buffer because we also unbox cells. - `FrameLocalsMapping` can still be converted into a Python dict representing the frame's fastlocals, but it is done lazily. - We update `LeafGuard`, `GuardAccessor`, and `GuardManager`'s `check_nopybind` methods to accept `FrameLocalsMapping`. By default, we convert the `FrameLocalsMapping` to a Python dict and run the original `check_nopybind` on it, but in some cases, conversion is not needed. - We add a new guard accessor `FrameLocalsGuardAccessor`, which is similar to `DictGetItemGuardAccessor` but has special handling for `FrameLocalsMapping`. We create a separate class to emphasize different use cases, but we could probably combine these two (can do in a follow up) dynamo_guard_eval.py microbenchmark update: - 713.2us -> 630.0us (3.10) - 598.8us -> 530.7us (3.12) Other followups: - Add `FrameLocalsMapping` version for `check_verbose_nopybind` in order to match behavior between `check_nopybind` and `check_verbose_nopybind`. This can prevent difficult debugging situations where guards fail (`check_nopybind` returns false) but no guard error message is generated (`check_verbose_nopybind` succeeds). - Rewrite the `SHAPE_ENV` guard into C++ - it is a fairly common guard that results in `FrameLocalsMapping` needing to convert to a dict Pull Request resolved: https://github.com/pytorch/pytorch/pull/140063 Approved by: https://github.com/jansel ghstack dependencies: #142117, #142430	2024-12-17 18:54:27 +00:00
William Wen	c04f0bb7b9	[dynamo] add benchmark for guard eval (#142430 ) Benchmarks: - 713.2us (3.10) - 598.8us (3.12) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142430 Approved by: https://github.com/jansel ghstack dependencies: #142117	2024-12-17 18:54:27 +00:00
William Wen	97ca09f692	[dynamo] format eval_frame.c (#142117 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142117 Approved by: https://github.com/jansel	2024-12-17 18:54:27 +00:00
bobrenjc93	53e4d7b6a2	remove allow-untyped-defs for torch/_lazy/device_context.py (#143367 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143367 Approved by: https://github.com/aorenste ghstack dependencies: #143366	2024-12-17 18:54:03 +00:00
eellison	bcc93a1e8e	remove nonowninglayout special case in require strides (#143315 ) NonOwningLayout is always constructed to a FixedLayout. We should handle it the same way as FixedLayout. Note - this case is very rare, I added an assertion here and no test/model failed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143315 Approved by: https://github.com/zou3519	2024-12-17 18:47:38 +00:00
Bin Bao	a3688ead4b	[AOTI][doc] Update tutorial (#143390 ) Summary: Update the cpp inference part to call AOTIModelPackageLoader.run directly Pull Request resolved: https://github.com/pytorch/pytorch/pull/143390 Approved by: https://github.com/yushangdi	2024-12-17 18:35:40 +00:00
chuanqiw	fa4db62968	[CI] Unify the XPU Windows CICD installtion scripts (#143185 ) Follow https://github.com/pytorch/pytorch/pull/142156 Works for https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143185 Approved by: https://github.com/atalman	2024-12-17 18:26:19 +00:00
bobrenjc93	74e66a21b4	remove allow-untyped-defs for torch/_C/_distributed_autograd.pyi (#143369 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143369 Approved by: https://github.com/aorenste	2024-12-17 18:09:28 +00:00
Benjamin Glass	37a1b9efcc	[export] Serialize all dataclass fields (#142286 ) Reverts a change in #121337. All dataclass members must be serialized, even default-valued members, because downstream code often implicitly assumes their presence. This PR fixes a segfault when running `test_custom_op_all_inputs` from `test/inductor/test_aot_inductor_custom_ops.py`. This segfault was caused by querying for an "index" field for the `Device` type (see `torch/csrc/inductor/aoti_torch/oss_proxy_executor.cpp:136`), which was previously skipped when serializing if the device index was unspecified. A number of other structs which are deserialized in this file also contain optional fields, and presumably could experience the same bug. Fixes #138955 Fixes #134793 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142286 Approved by: https://github.com/zhxchen17 ghstack dependencies: #142175	2024-12-17 17:21:27 +00:00
Benjamin Glass	bb06fc79fb	cpp_builder: handle CUDA lib paths involving "stubs" in more circumstances (#142175 ) conda packages for `cuda-driver-dev=12.4.127` use a "stubs" subdirectory to contain `libcuda.so`. This was previously only handled by cpp_builder in some cases, but now needs to be potentially handled more generally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142175 Approved by: https://github.com/desertfire	2024-12-17 17:21:27 +00:00
PyTorch MergeBot	e3d754419f	Revert "[reland][dynamo][guards] Consider tensors as immutable for dict tag matches (#141085 )" This reverts commit 1bf983077f9f9c19e20dac178aa764b4620d78e7. Reverted https://github.com/pytorch/pytorch/pull/141085 on behalf of https://github.com/huydhn due to The diff D66211131 has been commandeered internally and is it not part of the train anymore. If codev is needed, pls reland this accordingly ([comment](https://github.com/pytorch/pytorch/pull/141085#issuecomment-2549092225))	2024-12-17 17:21:14 +00:00
bobrenjc93	ec02ae4345	remove allow-untyped-defs for torch/utils/benchmark/examples/simple_timeit.py (#143368 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143368 Approved by: https://github.com/aorenste	2024-12-17 17:19:11 +00:00
bobrenjc93	313b9964ae	remove allow-untyped-defs for torch/_C/_lazy.pyi (#143370 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143370 Approved by: https://github.com/aorenste, https://github.com/desertfire ghstack dependencies: #143366	2024-12-17 17:18:10 +00:00
Guilherme Leobas	487343346e	Prevent users from seeing hardcoded print stmt when hypothesis is not installed (#142398 ) Fixes: #142357 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142398 Approved by: https://github.com/zou3519	2024-12-17 16:59:05 +00:00
PyTorch MergeBot	969b07b96f	Revert "[ROCm] CK Flash Attention Backend (#138947 )" This reverts commit 500d02921bcf1619e268196866ddf099a4b94080. Reverted https://github.com/pytorch/pytorch/pull/138947 on behalf of https://github.com/atalman due to Breaks default windows checkout ([comment](https://github.com/pytorch/pytorch/pull/138947#issuecomment-2548998359))	2024-12-17 16:46:57 +00:00
bobrenjc93	cd7de1f4fa	remove allow-untyped-defs for torch/masked/maskedtensor/creation.py (#143321 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143321 Approved by: https://github.com/laithsakka	2024-12-17 16:44:50 +00:00
Bin Bao	4d90c487d8	[AOTI] Add is_big_gpu checking to test_conv3d (#143339 ) Summary: test_conv3d tests max-autotune, which is only supported for big_gpu. Differential Revision: D67306331 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143339 Approved by: https://github.com/BoyuanFeng	2024-12-17 16:18:45 +00:00
albanD	792f1c47e9	No actual change, just remove variable contain Tensors from global scope (#143225 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143225 Approved by: https://github.com/ezyang	2024-12-17 16:14:25 +00:00
Joona Havukainen	afa313e669	Extend bmm tiling to work up to 2^32 elem in any single output dim (#143095 ) The previous tiling implementation worked for up to 2^32 total elements per single batch entry. This extends the functionality to support the dimensions encountered in ComfyUI (output shape: 1,72250,72250). Fixes #141909 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143095 Approved by: https://github.com/kulinseth	2024-12-17 16:03:46 +00:00
Jackson	340f02c49b	make it clearer (in docs) one can double decorate with torch.library.impl_* APIs (#137608 ) Fixes #120503. Fix originally attempt by @soxand16 with PR: https://github.com/pytorch/pytorch/pull/121469. PR was almost ready to merge, but then went stale (over 6 months old). This PR implements original fix with refactoring for clarity. CC: @zou3519 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137608 Approved by: https://github.com/zou3519	2024-12-17 15:13:58 +00:00
Yuanhao Ji	6bbbb08458	[Dynamo] Replace `torch._dynamo.optimize()` with `torch.compile()` [10/N] (#142451 ) > This is the last one related commits: - #139706 - #140238 - #140247 - #140253 - #140663 - #140688 - #140922 - #140924 - #140933 - #142451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142451 Approved by: https://github.com/bdhirsh	2024-12-17 12:18:29 +00:00
Shunting Zhang	34a0d8b62e	[inductor] invalidate pointwise dep cache for LOAF (#141160 ) Fixes https://github.com/pytorch/pytorch/issues/141134 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141160 Approved by: https://github.com/vkuzo	2024-12-17 09:51:29 +00:00
drisspg	5160a725c8	[FlexAttention] Fix broken eager tracing (#143344 ) Fixes #143331 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143344 Approved by: https://github.com/Chillee ghstack dependencies: #143299	2024-12-17 09:42:36 +00:00
Jason Ansel	cf46eb3bf5	[inductor] Include types and size hints in MultiKernel cache key (#142349 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142349 Approved by: https://github.com/eellison, https://github.com/shunting314	2024-12-17 09:26:38 +00:00
Richard Barnes	e2d47a133b	Disable c10::optional macros (#138912 ) Test Plan: Sandcastle Pull Request resolved: https://github.com/pytorch/pytorch/pull/138912 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-12-17 09:22:47 +00:00
Laith Sakka	c3f3a6e4d2	Back out "Fix undesired specialization on slice after split. (#142372 )" (#143356 ) Summary: Original commit changeset: e54ffcc9fd48 Original Phabricator Diff: D67113058 Reviewed By: ezyang Differential Revision: D67311579 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143356 Approved by: https://github.com/oulgen	2024-12-17 09:17:18 +00:00
Adnan Akhundov	2531543c5f	[user triton cache] Dedup user-defined Triton kernels by config in codecache (#143353 ) Previously, the same kernel source with different autotuning configs would generate the same cache key which can lead to wrong cache it and silent incorrectness. Here we add the configs to the cache key in `FxGraphHashDetails`. Test Plan: ``` python3 test/inductor/test_codecache.py -k test_triton_higher_order_op_different_configs ... ---------------------------------------------------------------------- Ran 2 tests in 3.590s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143353 Approved by: https://github.com/oulgen	2024-12-17 08:41:22 +00:00
Avik Chaudhuri	6056efc5ff	non strict sequential slicing (#143298 ) Differential Revision: [D67284841](https://our.internmc.facebook.com/intern/diff/D67284841/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143298 Approved by: https://github.com/zhxchen17	2024-12-17 08:35:20 +00:00
Shunting Zhang	297ce77636	[Inductor] inplace padding (#140249 ) https://github.com/pytorch/pytorch/issues/139865 This PR may change the semantic of constant_pad_nd from 'clone' to 'view'. I tried a few tests to do inplace update. Looks like thanks to functionalization, this works fine. Perf for `test_linear_and_cel`: ``` # TORCHINDUCTOR_INPLACE_PADDING=0 DO_PERF_TEST=1 python test/inductor/test_inplace_padding.py -k test_linear_and_cel inductor_config.inplace_padding=False ms=83.311 # TORCHINDUCTOR_INPLACE_PADDING=1 DO_PERF_TEST=1 python test/inductor/test_inplace_padding.py -k test_linear_and_cel inductor_config.inplace_padding=True ms=79.827 ``` The saving is about 4ms (slightly less since we need fill 0 for the padding area). Similar savings for llm.c. - Without the feature: 182.151ms per batch, 180.9K tokens/s - With the feature: 178.278ms per batch, 183.9K tokens/s. There are 3K tokens/s increase. Perf test shows compilation time regression. . I'm not sure if that's real. Will debug more. But a good thing is, there is no accuracy failure: [link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2004%20Nov%202024%2020%3A23%3A22%20GMT&stopTime=Mon%2C%2011%20Nov%202024%2020%3A23%3A22%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=gh/shunting314/186/head&lCommit=03fd924ff382958daf5055dc8425d279e4e10a1e&rBranch=main&rCommit=c03324de2dfbbf0006818c86b88c92a3378f46b7) . UPDATE: Perf test regression seems to be not real. Here is a rerun [link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2007%20Nov%202024%2001%3A29%3A55%20GMT&stopTime=Thu%2C%2021%20Nov%202024%2001%3A29%3A55%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=gh/shunting314/186/head&lCommit=7e2c8e5d9256ac06205e7cd5e740c9e20ce804d0&rBranch=main&rCommit=565a7942eee1ddc23067cdbae597443d0f2290a0). Our dashboard is not that reliable recently due to AWS migration. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140249 Approved by: https://github.com/jansel	2024-12-17 06:15:48 +00:00
bobrenjc93	a42ca5a45b	remove allow-untyped-defs for _inductor/codegen/rocm/rocm_template_buffer.py (#143272 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143272 Approved by: https://github.com/aorenste	2024-12-17 05:34:22 +00:00
drisspg	d2ec7f0756	[FlexAttention] Allow num_warps 8 since when block size >=128 (#143299 ) # Summary Fixes #143290 We already strip bad configs here: `e0e763e331/torch/_inductor/kernel/flex_attention.py (L2299)` So this shouldn't be needed. Confirming that the 64 x 128 case is valid otherwise we can just change the default config Pull Request resolved: https://github.com/pytorch/pytorch/pull/143299 Approved by: https://github.com/yanboliang	2024-12-17 05:32:41 +00:00
bobrenjc93	e7ec92331e	remove allow-untyped-defs for torch/jit/_ir_utils.py (#143366 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143366 Approved by: https://github.com/aorenste	2024-12-17 05:15:15 +00:00
Shuqi Yang	bcd3692132	[Inductor][Easy] Fix a test failure in loop_ordering_after_fusion (#142474 ) Summary: Re-land the pr. The previous one was reverted because of a test failure on SM89. The fix is just removing `xfailIfSM89`. ``` _____________________ LoopOrderingTest.test_fp8_pattern_2 ______________________ Unexpected success ``` ------ (Since I am trying the other solution for https://github.com/pytorch/pytorch/pull/141082, I moved out the test case fixes from that pr to a separate pr to land first.) ----- Testing float8 dynamic scaling case with `TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1` didn't make any difference. The test case for fp8 (https://github.com/pytorch/pytorch/blob/main/test/inductor/test_loop_ordering.py#L425) is also failing, https://www.internalfb.com/intern/test/844425111960859?ref_report_id=0 ------- The main change here is to modify the condition of calling `loop_reordering` from `shared_data_score == 0` to `shared_data_score < config.score_fusion_memory_threshold`. Before the change: `shared_data_score > 0 -> won't loop_reorder -> can't fused because of shared_data_score < config.score_fusion_memory_threshold` After the change: `shared_data_score > 0 -> loop_reorder (shared_data_score < config.score_fusion_memory_threshold) -> get a larger shared_data_score -> fused` ---- It's the same issue as fixed in https://github.com/pytorch/pytorch/pull/136782. But the condition to call loop_reorder might be changed later, causing the test case to fail again. Test Plan: ``` buck2 test 'fbcode//mode/opt' caffe2/test/inductor:loop_ordering ``` ----- Ran a float8 dynamic scaling training script to verify it e2e Differential Revision: D67012816 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142474 Approved by: https://github.com/eellison, https://github.com/sijiac, https://github.com/shunting314	2024-12-17 04:14:28 +00:00
Andy Lugo	500d02921b	[ROCm] CK Flash Attention Backend (#138947 ) Replaces https://github.com/ROCm/pytorch/pull/1592 This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling `torch.backends.cuda.preferred_rocm_fa_library("ck")`. Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via `USE_FLASH_ATTENTION`) and is selected at runtime by the existing heuristics. Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138947 Approved by: https://github.com/pruthvistony, https://github.com/xw285cornell, https://github.com/leitian Co-authored-by: Xiaodong Wang <xw285@cornell.edu>	2024-12-17 02:18:07 +00:00
Huy Do	c15638d803	Enable swap on all Linux jobs (#143316 ) A swapfile on Linux runner has been prepared by https://github.com/pytorch/test-infra/pull/6058. So this PR does 2 things: * Start using the swapfile on all Linux build and test jobs * Testing the rollout https://github.com/pytorch-labs/pytorch-gha-infra/pull/582 ### Testing Run `swapon` inside the container and the swapfile shows up correctly: ``` jenkins@259dfb0a314c:~/workspace$ swapon NAME TYPE SIZE USED PRIO /swapfile file 3G 256K -2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143316 Approved by: https://github.com/ZainRizvi, https://github.com/atalman	2024-12-17 02:12:24 +00:00
Michael Lazos	cb4c614ed6	[foreach-map] Add tests for backward (#143282 ) Adds tests for unary and binary foreach_map w/ backwards Pull Request resolved: https://github.com/pytorch/pytorch/pull/143282 Approved by: https://github.com/eellison	2024-12-17 02:08:12 +00:00
PyTorch MergeBot	533d63f83b	Revert "FileTimerClient: add retry logic on connect (#143318 )" This reverts commit b3fb8f8a3a2fe07ca61852b09271382c988629fc. Reverted https://github.com/pytorch/pytorch/pull/143318 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lint jobs in trunk ([comment](https://github.com/pytorch/pytorch/pull/143318#issuecomment-2547342910))	2024-12-17 02:06:52 +00:00
cyy	201cb8834f	Enable more C++ warnings (#143099 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143099 Approved by: https://github.com/albanD	2024-12-17 02:03:39 +00:00
Yifu Wang	af190479c8	[fused_all_gather_matmul] use _multimem_all_gather_matmul for small global Ms (#143160 ) ## Benchmark M=2048, N=3584, K=8192 baseline (nccl + cublas): 301us decomp-based async-tp: 354us comm-aware async-tp: 295us multimem_all_gather matmul: 277us As M further decreases, the multimem_all_gather approach consistently outperforms the baseline and other approaches (omitted other approaches in the chart as they start to be slower than the baseline): ![image](https://github.com/user-attachments/assets/5811455a-68c9-43fe-9d82-ca488dd77bc1) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143160 Approved by: https://github.com/weifengpy ghstack dependencies: #142283, #142810, #143159	2024-12-17 01:07:27 +00:00
Yifu Wang	286921b39e	[fused_all_gather_matmul] introduce an argument to specify whether the all-gather result needs to be returned (#143159 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143159 Approved by: https://github.com/weifengpy ghstack dependencies: #142283, #142810	2024-12-17 01:07:27 +00:00
Yifu Wang	6fae60a34a	[SymmetricMemory] introduce multimem_all_gather (#142810 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142810 Approved by: https://github.com/weifengpy ghstack dependencies: #142283	2024-12-17 01:07:27 +00:00
PyTorch MergeBot	519d858c31	Revert "Kill capture_pre_autograd_graph API (#143224 )" This reverts commit 4c62275325afe21052f3fd49ed4135e3db3c47eb. Reverted https://github.com/pytorch/pytorch/pull/143224 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the XLA failure is legit ([comment](https://github.com/pytorch/pytorch/pull/143224#issuecomment-2547264675))	2024-12-17 00:47:24 +00:00
Will Constable	9d57a39541	[C10D] Update docs for wait() (#143305 ) Clarify that currently active stream, not default stream, is the one that will be blocked by a call to wait(), and also point out that the CPU is not blocked by the call for CUDA/nccl collectives. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143305 Approved by: https://github.com/LucasLLC, https://github.com/ngimel	2024-12-17 00:41:11 +00:00
Tristan Rice	b3fb8f8a3a	FileTimerClient: add retry logic on connect (#143318 ) Fixes #143188 The fifo server binds from a thread -- under rare cases the client connects before the server thread starts. This adds a retry when opening the fifo socket in non-blocking mode. This will wait up to 1s for the server to start which balances fast error messages while still providing some wiggle room on the server side. Test plan: ``` pytest --minutes 10 test/distributed/elastic/timer/file_based_local_timer_test.py -k test_watchdog_call_count -x ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143318 Approved by: https://github.com/fegin	2024-12-17 00:36:10 +00:00
Andrew Gu	90fb7c36ab	[FSDP2] Clamp `reduce_dtype` in lazy init (#143297 ) fixes https://github.com/pytorch/pytorch/issues/143277 by moving the clamp of `reduce_dtype` to `None` to lazy init (same place as where `param_dtype` can be clamped to `None`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143297 Approved by: https://github.com/weifengpy	2024-12-17 00:25:08 +00:00
atalman	dd2cd4279e	Create build_directory if it does not exist when generating ninja build file (#143328 ) Fixes: https://github.com/pytorch/vision/issues/8816 I am observing this failure on Windows, Python 3.13 vision builds: ``` Emitting ninja build file C:\actions-runner\_work\vision\vision\pytorch\vision\build\temp.win-amd64-cpython-313\Release\build.ninja... error: [Errno 2] No such file or directory: 'C:\\actions-runner\\_work\\vision\\vision\\pytorch\\vision\\build\\temp.win-amd64-cpython-313\\Release\\build.ninja' ERROR conda.cli.main_run:execute(49): `conda run packaging/windows/internal/vc_env_helper.bat python setup.py bdist_wheel` failed. (See above for error) ``` Adding the code above fixes it, confirmed by running `` python setup.py bdist_wheel`` : ``` building 'torchvision._C' extension Emitting ninja build file C:\actions-runner\_work\vision\vision\pytorch\vision\build\temp.win-amd64-cpython-313\Release\build.ninja... Creating build directory C:\actions-runner\_work\vision\vision\pytorch\vision\build\temp.win-amd64-cpython-313\Release Compiling objects... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/26] cl /showIncludes /nologo /O2 /W3 /GL /DNDEBUG /MD /MD /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /wd4624 /wd4067 /wd4068 /EHsc -Dtorchvision_EXPORTS -IC:\actions-runner\_work\vision\vision\pytorch\vision\torchvision\csrc -IC:\actions-runner\_work\_temp\conda_environment_12361066769\Lib\site-packages\torch\include -IC:\actions-runner\_work\_temp\conda_environment_12361066769\Lib\site-packages\torch\include\torch\csrc\api\include -IC:\actions-runner\_work\_temp\conda_environment_12361066769\Lib\site-packages\torch\include\TH -IC:\actions-runner\_work\_temp\conda_environment_12361066769\Lib\site-packages\torch\include\THC -IC:\actions-runner\_work\_temp\conda_environment_12361066769\include -IC:\actions-runner\_work\_temp\conda_environment_12361066769\Include "-IC:\Pr ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143328 Approved by: https://github.com/kit1980, https://github.com/albanD	2024-12-17 00:20:43 +00:00
Bin Bao	467970d683	[AOTI] Relax input alignment assertion (#143236 ) Summary: https://github.com/pytorch/pytorch/pull/142136 added a runtime alignment assertion. But the assumption is probably too strict for more flexible use cases of AOTI, e.g. python deployment, see a recent error torchchat ran into for more details, https://github.com/pytorch/torchchat/actions/runs/12322072267/job/34394851280 . This PR relaxes the runtime check and implements copy_misaligned_inputs in cpp instead. Differential Revision: [D67287922](https://our.internmc.facebook.com/intern/diff/D67287922) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143236 Approved by: https://github.com/malfet, https://github.com/chenyang78	2024-12-17 00:17:39 +00:00
bobrenjc93	c4ab3e6ceb	remove allow-untyped-defs for torch/__config__.py (#143320 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143320 Approved by: https://github.com/aorenste ghstack dependencies: #143319	2024-12-17 00:16:09 +00:00
bobrenjc93	0178e43949	remove allow-untyped-defs for torch/utils/_stats.py (#143319 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143319 Approved by: https://github.com/aorenste	2024-12-17 00:16:09 +00:00
Shivam Raikundalia	ff373171d0	[Profiler] Add Optional Flag to turn off external correlations v2 (#143314 ) Summary: The original diff got reverted because its base commit was on a broken version of pytorch that was failing rocm tests. There is no indication that this diff had any effect on rocm. Had trouble rebasing the GH pr after revert and accidentally closed the PR so submitting again . Test Plan: See original PR with same name Differential Revision: D67293040 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143314 Approved by: https://github.com/leitian, https://github.com/aaronenyeshi	2024-12-16 23:49:13 +00:00
rzou	10df370a77	Add missing IValue overloads for SymInt lists (#143167 ) We should be able to convert Int lists into SymInt lists. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/143167 Approved by: https://github.com/ezyang ghstack dependencies: #143166	2024-12-16 23:18:55 +00:00
rzou	557da8014d	[gen_autograd_functions] rename some variables (#143166 ) This is a follow-up from https://github.com/pytorch/pytorch/pull/141278. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/143166 Approved by: https://github.com/soulitzer	2024-12-16 23:18:55 +00:00
Shangdi Yu	4c62275325	Kill capture_pre_autograd_graph API (#143224 ) Summary: Delete the following API: - capture_pre_autograd_graph() - capture_pre_autograd_graph_using_training_ir() - gm_using_training_ir() There's no more call sites to `capture_pre_autograd_graph`. Except 1) two test cases in coreml, PR to remove: https://github.com/apple/coremltools/pull/2400 2) XLA: one test case in pytorch/xla, PR to remove: https://github.com/pytorch/xla/pull/8398 3) a few call sites guarded by version guard (< 2.5.0) Test Plan: CI Reviewed By: tugsbayasgalan Differential Revision: D64056353 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143224 Approved by: https://github.com/tugsbayasgalan	2024-12-16 23:06:22 +00:00
PyTorch MergeBot	6356690b3d	Revert "[BE] Revert "Add conda to Manylinux Docker images (#139903 )" (#143300 )" This reverts commit c86383f956ee86f34d0ffb94bc229c51c6f11dd9. Reverted https://github.com/pytorch/pytorch/pull/143300 on behalf of https://github.com/atalman due to failing nova workflows with conda: command not found ([comment](https://github.com/pytorch/pytorch/pull/143300#issuecomment-2547030664))	2024-12-16 22:50:08 +00:00
eellison	135a2d4483	Update low prec codegen for div/mod (#142350 ) Div/mod in fp16/bf16 requires a downcast to preserve its inputs' dtypes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142350 Approved by: https://github.com/blaine-rister	2024-12-16 21:46:08 +00:00
Bradley Davis	15aee8e090	update aten bmm CK heuristic (#143294 ) Summary: updates heuristic to use new instances based on ck profiling of LLM shapes Differential Revision: D67280269 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143294 Approved by: https://github.com/mxz297, https://github.com/xw285cornell	2024-12-16 21:44:59 +00:00
atalman	c86383f956	[BE] Revert "Add conda to Manylinux Docker images (#139903 )" (#143300 ) This reverts commit 56a40d4ebb0bcf733f1ea5f6efde805326a7a565. Having conda in manylinux builder images is not required. This was added to have manylinux-builder images as the only images for CD builds after conda-builder is deprecated. However we decided to start using ``almalinux-builder``. We are using almalinux-builder for linux_job_v2 which contains conda: https://github.com/pytorch/test-infra/blob/main/.github/workflows/linux_job_v2.yml#L114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143300 Approved by: https://github.com/seemethere	2024-12-16 21:40:08 +00:00
Bert Maher	4e594f4d12	Triton bump for 3.2 cherry-picks (mmav3 segfault fix, gfx950 support) (#143302 ) * https://github.com/triton-lang/triton/pull/5277 * https://github.com/triton-lang/triton/pull/5084 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143302 Approved by: https://github.com/atalman, https://github.com/pruthvistony	2024-12-16 21:22:29 +00:00
Aaron Orenstein	401b1498d2	[BE] typing for decorators - distributed/_tensor/ops/utils (#142139 ) Test Plan: unit tests Differential Revision: D62302679 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142139 Approved by: https://github.com/Skylion007, https://github.com/kwen2501	2024-12-16 21:19:33 +00:00
Aaron Orenstein	159b7ad8aa	Improve async workers to handle forking for async compile (#142072 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142072 Approved by: https://github.com/masnesral	2024-12-16 21:16:42 +00:00
xadupre	678f74988d	Fix a misspelling [ONNX] (#143301 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143301 Approved by: https://github.com/titaiwangms	2024-12-16 20:19:41 +00:00
bobrenjc93	8ad842cda4	remove allow-untyped-defs for utils/data/datapipes/dataframe/structures.py (#143273 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143273 Approved by: https://github.com/aorenste ghstack dependencies: #143271	2024-12-16 20:07:36 +00:00
PyTorch MergeBot	54ed13cdce	Revert "Update low prec codegen for div/mod (#142350 )" This reverts commit ca973069ed9a08782695d9407605e219008821e2. Reverted https://github.com/pytorch/pytorch/pull/142350 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it. breaks an internal test ([comment](https://github.com/pytorch/pytorch/pull/142350#issuecomment-2546615951))	2024-12-16 20:05:14 +00:00
Adnan Akhundov	e885225eda	Add persistent+TMA version of Triton mm and addmm (#142101 ) This PR adds persistent+TMA versions (Triton template + the corresponding infra) for the `tuned_mm` and `tuned_addmm` lowerings. The persistent+TMA choices are added to the GEMM autotuning if (checked by the `use_triton_tma_template` helper): 1. The min. hardware and Triton version requirements are met for the TMA support. 2. The GEMM inputs are compatible with the Triton TMA API (i.e., 16-byte aligned and contiguous). 3. The `config.triton.enable_persistent_tma_matmul` is set to `True`. Additional notes: 1. As added in this PR, the TMA uses are not compatible with prolog / epilogue fusion. To this end, in the new Triton template we currently support: TMA-based loads of A/B, but no prologue fusion; epilogue fusion, but no TMA-based stores of C. TMA + fusion compatibility can be added as a follow-up. 2. The current Triton TMA API (`experimental_device_tensormap_create2d`) does not support strides. Due to this, we limit the applicability of the new Triton template to the cases where the inputs are contiguous. 3. The transposed layouts of A and / or B are supported by passing the constexpr flags to the kernel and adjusting the ordering of the block sizes accordingly in the kernel code (this should have no effect on the kernel perf, as decided at the Triton compilation time). 4. After the next Triton pin update, we can switch to the tensor descriptor API (landed recently in https://github.com/triton-lang/triton/pull/5290) in the new Triton template, which should allow lifting 2 and 3 above. 5. The configs for the new Triton template in `persistent_mm_kernel_configs` are preliminary. We should do more perf exploration and possibly augment the config in a follow-up. 6. This PR is rebased onto and unifies with two related PRs landed previously: https://github.com/pytorch/pytorch/pull/142045 (some infra unification with the persistent+TMA template for _scaled_mm) and https://github.com/pytorch/pytorch/pull/134532 (add possibility to disable prolog fusion for selected choices). 7. The current Triton TMA API only supports 1D and 2D descriptors (even after https://github.com/triton-lang/triton/pull/5290, see [here](`9829ce87cc/python/triton/language/core.py (L1957)`)). For now, this blocks adding persistent+TMA template for `torch.bmm`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142101 Approved by: https://github.com/drisspg, https://github.com/eellison	2024-12-16 19:12:12 +00:00
Oguz Ulgen	17b71e5d6a	Add config alias (#142088 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142088 Approved by: https://github.com/c00w	2024-12-16 18:51:17 +00:00
William Wen	1b6b86fad7	[dynamo] disable eval frame callback around most of _TorchDynamoContext wrapper function (#143211 ) Internal xref: https://fb.workplace.com/groups/1075192433118967/permalink/1559636954674510/ If the `_fn` returned by `_TorchDynamoContext.__call__` makes an external function call, dynamo is recursively invoked. This can cause issues if there are added calls that are not skipped by Dynamo. So we should disable the eval frame callback as much as possible. Differential Revision: [D67211749](https://our.internmc.facebook.com/intern/diff/D67211749) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143211 Approved by: https://github.com/jansel	2024-12-16 18:38:58 +00:00
Animesh Jain	1bf983077f	[reland][dynamo][guards] Consider tensors as immutable for dict tag matches (#141085 ) Reland - https://github.com/pytorch/pytorch/pull/139560 As mentioned in https://github.com/pytorch/pytorch/pull/130341, using `static py::object` can lead to segfaults. I suspect this is the reason for the import system error seen internally (https://www.internalfb.com/sevmanager/view/469592). In this PR, I am removing the `static` part. This is fine and also the right thing to do because this will catch if user changes the flag in the same process for compiling two different functions. Unfortunately, there is no easy way to trigger this segfault, so I can't write a test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141085 Approved by: https://github.com/jansel Co-authored-by: William Wen <williamwen@meta.com>	2024-12-16 18:38:32 +00:00
Jeeja	338835d0d2	Add support for other backends in get_preferred_device (#132118 ) Currenlty get_preferred_device supports only cuda and cpu. Add support for other backends using backend config. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132118 Approved by: https://github.com/kwen2501	2024-12-16 18:30:41 +00:00
leslie-fang-intel	ccf35af142	[Inductor] Fix the Index Put lowering with same input of self and values (#139366 ) Summary Fix the issue: https://github.com/pytorch/pytorch/issues/138908, the root-cause is in https://github.com/pytorch/pytorch/issues/138908#issuecomment-2449192447 Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_index_put python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_index_add ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139366 Approved by: https://github.com/jgong5, https://github.com/eellison	2024-12-16 17:07:14 +00:00
PyTorch MergeBot	7ab3177776	Revert "[AMD] Turn on TF32 for aten::mm (#139869 )" This reverts commit e0bdae7884aed09d9e3f1a3f7a53c095e74a9aff. Reverted https://github.com/pytorch/pytorch/pull/139869 on behalf of https://github.com/jeffdaily due to causing ROCm CI failures, need to investigate, revert for now ([comment](https://github.com/pytorch/pytorch/pull/139869#issuecomment-2546127069))	2024-12-16 16:46:48 +00:00
chuanqiw	a8cc19bb51	[CD] Fix XPU linux CD whl test failure (#143268 ) Follow https://github.com/pytorch/pytorch/pull/142482, refer the original fix PR https://github.com/pytorch/pytorch/pull/130742 and new issue in https://github.com/pytorch/pytorch/actions/runs/12323126436/job/34403681230 Works for https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143268 Approved by: https://github.com/atalman	2024-12-16 15:00:03 +00:00
PyTorch UpdateBot	e4d2e81086	Update slow tests (#143278 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143278 Approved by: https://github.com/pytorchbot	2024-12-16 12:40:40 +00:00
bobrenjc93	d745b2b516	remove allow-untyped-defs for distributed/rpc/_testing/__init__.py (#143271 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143271 Approved by: https://github.com/aorenste	2024-12-16 02:35:37 +00:00
Yu, Guangye	9706ada369	[RELAND] Add device-agnostic runtime Device/Stream C++ API (#138677 ) # Motivation This PR intends to add C++ accelerator device-agnostic APIs. # Additional Context This PR is relanded. It is reverted because `torch.Event` doesn't support mps backend. We have fixed it in https://github.com/pytorch/pytorch/pull/142468. The previous commit is `f84e533a2c` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138677 Approved by: https://github.com/albanD, https://github.com/EikanWang ghstack dependencies: #143171, #133572	2024-12-16 02:18:41 +00:00
Yu, Guangye	45ac4ebf15	[RELAND] Add UTs for accelerator device-agnostic runtime APIs (#133572 ) # Motivation This PR intends to add UTs for accelerator device-agnostic APIs. # Additional Context This PR is relanded. It is reverted because `torch.Event` doesn't support mps backend. We have fixed it in https://github.com/pytorch/pytorch/pull/142468. The previous commit is `952514f0c8` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133572 Approved by: https://github.com/EikanWang, https://github.com/albanD ghstack dependencies: #143171	2024-12-16 02:18:41 +00:00
Yu, Guangye	c1d4d9d3cf	[MPS] Support torch.accelerator.synchronize() on mps (#143171 ) # Motivation Support `torch.accelerator.synchronize()` on mps. The root cause is that MPS doesn't support lazy initialization. So we must check if the current accelerator supports device lazy initialization rather than early return. # Additional Context Add a mps UT to test code change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143171 Approved by: https://github.com/albanD	2024-12-16 02:18:32 +00:00
cyy	af8789c056	Hide torch_python symbols (#142214 ) Change symbols in torch_python to invisible by default on platforms other than Apple. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142214 Approved by: https://github.com/ezyang	2024-12-16 00:59:26 +00:00
drisspg	744a303dee	[FlexAttention] Optimzing learned bias perf to dq calc (#142281 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142281 Approved by: https://github.com/Chillee	2024-12-15 21:44:32 +00:00
Xiaodong Wang	e0bdae7884	[AMD] Turn on TF32 for aten::mm (#139869 ) Summary: hipblaslt supports TF32, so adding the support. Test Plan: CI Differential Revision: D65435392 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139869 Approved by: https://github.com/leitian	2024-12-15 10:02:29 +00:00
PyTorch UpdateBot	5273d8fd2a	[audio hash update] update the pinned audio hash (#143265 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143265 Approved by: https://github.com/pytorchbot	2024-12-15 03:41:14 +00:00
PyTorch MergeBot	9ed045eae9	Revert "[Profiler] Add Optional Flag to turn off external correlations (#142516 )" This reverts commit b29fc52f827cc4b4336ecd24cc0a019ec9cf24b6. Reverted https://github.com/pytorch/pytorch/pull/142516 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the test is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/142516#issuecomment-2543431758))	2024-12-15 03:34:37 +00:00
Simon Fan	dd2d360b7d	[ca] re-enable disabled tests (#143247 ) FIXES https://github.com/pytorch/pytorch/issues/133197 The unspecified floats PR landed while this test was disabled, and it added an analysis restart which counts towards the backend call counter the test is using Pull Request resolved: https://github.com/pytorch/pytorch/pull/143247 Approved by: https://github.com/zou3519	2024-12-15 02:11:39 +00:00
cyy	4273e1a059	[5/N] Apply bugprone-unchecked-optional-access (#143111 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143111 Approved by: https://github.com/Skylion007	2024-12-15 01:07:28 +00:00
Tom Ritchford	91bf2e16de	[distributed] Remove unused variable in test_composability/test_pp_composability.py (#143191 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143191 Approved by: https://github.com/mori360	2024-12-14 12:23:44 +00:00
Avik Chaudhuri	de484134e4	support slicing with symints in non-strict (#143217 ) Differential Revision: [D67215745](https://our.internmc.facebook.com/intern/diff/D67215745/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143217 Approved by: https://github.com/tugsbayasgalan	2024-12-14 10:27:45 +00:00
Michael Suo	9933e59c2b	[torch][cuda] fix race condition in cuda initialization (#143238 ) The access to lazy init callbacks (`_lazy_seed_tracker` and `_queued_calls`) is not synchronized with the initialization lock. This exposes us to the following race: 1. start `_lazy_init` 2. take `_initialization_lock` 3. flush `_queued_calls` and run them all 4. another thread comes in and uses `_lazy_call` to put something on the queue (in our case, the `manual_seed`) 5. original thread finishes initializing, but never runs that call Pull Request resolved: https://github.com/pytorch/pytorch/pull/143238 Approved by: https://github.com/ngimel	2024-12-14 07:41:24 +00:00
Oguz Ulgen	28d8297712	Migrate compiler config to Config (#143152 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143152 Approved by: https://github.com/ezyang ghstack dependencies: #143229	2024-12-14 07:38:25 +00:00
Oguz Ulgen	7c4d29485e	Add typechecking indirection for Config (#143229 ) When we create a Config[T], we actually dynamically unbox this in the module, so lets have type checker believe that Config[T] creates a T. This enables proper typechecking support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143229 Approved by: https://github.com/aorenste	2024-12-14 07:38:25 +00:00
Will Feng	be5b342332	[Inductor] Move peak memory pass and overlap pass to be run at the right place (#142822 ) This PR moves `decide_global_ordering_of_comms` to run first before all other Inductor scheduler passes, so that downstream passes have the correct dependency tracking info. It also moves peak memory pass and overlap pass to the end of all passes, because they need to be the final decision maker on the node order to achieve the desired peak memory and overlap. This PR fixes hard-to-debug peak memory pass errors caused by incorrect tracking in `.unmet_dependencies` during the enablement of SimpleFSDP on internal models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142822 Approved by: https://github.com/eellison	2024-12-14 06:53:02 +00:00
Heiner	3cc617b6a7	`__cuda_array_interface__`: Use "<V2" for bfloat16. (#143042 ) Rationale: While Numpy doesn't support `bfloat16` and therefore there's no official typestr for `bfloat16` in `__array_interface__` (https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.interface.html#__array_interface__), JAX/ml_dtypes uses "<V2": ``` >>> from jax import numpy as jnp >>> jnp.bfloat16.dtype.str '<V2' ``` Using the same in PyTorch has the upside of making the typestrs returned by `__cuda_array_interface__` identify the torch dtype uniquely. ### Misc notes (1) JAX itself just refuses to do `__cuda_array_interface__` for `bfloat16`: ``` >>> from jax import numpy as jnp >>> jnp.arange(10, dtype=jnp.bfloat16).__cuda_array_interface__ Traceback (most recent call last): File "<stdin>", line 1, in <module> jaxlib.xla_extension.XlaRuntimeError: INVALID_ARGUMENT: __cuda_array_interface__ is not supported for bfloat16 buffers. ``` (2) The "official" description of `__cuda_array_interface__` doesn't mention bfloat16, it just references `__array_interface__`: https://numba.readthedocs.io/en/stable/cuda/cuda_array_interface.html (3) Ongoing issue for numpy to support bfloat16: https://github.com/numpy/numpy/issues/19808 (4) Tweet that triggered this: https://x.com/HeinrichKuttler/status/1866761979349844211, with @ezyang responding. (5) "<V2" is kinda weird, as it's a "little-endian void" type. When given to Numpy, it gets turned into endian-agnostic: ``` >>> import numpy as np >>> import ml_dtypes >>> np.dtype("bfloat16").str '<V2' >>> np.dtype("<V2").str '\|V2' ``` Still, it makes sense to have a unique string for `bfloat16` and since Google chose "<V2" we might as well use that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143042 Approved by: https://github.com/ezyang	2024-12-14 06:27:52 +00:00
Nichols A. Romero	c0a39ad35a	[ROCm] Fix TunableOp UTs: Rotating Buffer (#143172 ) TunableOp's rotating buffer feature cannot be properly tested because the environment variable that controls this feature is sticky. A Python API is introduced to modify this value. Additional items in this PR: * UT for rotating buffer API * Clean up UTs that were setting the rotating buffer via the environment variable * Align behavior of environment variable and Python API when a negative value (< 0) is set. * Update documentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143172 Approved by: https://github.com/jeffdaily	2024-12-14 06:18:11 +00:00
Peter Bell	96c3b2c388	Expose remaining sharedMem cudaDeviceProps to python (#143226 ) Was a bit too fast with my earlier PR, `sharedMemPerMultiprocessor` includes some memory that is reserved for the system. The amount a kernel can actually use is limited by `sharedMemPerBlockOptin`. I also expose `sharedMemPerBlock` for completeness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143226 Approved by: https://github.com/ezyang	2024-12-14 06:13:28 +00:00
cyy	4764303cc6	Use static initialization to avoid once_flag in getCUDAHooks (#143198 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143198 Approved by: https://github.com/albanD	2024-12-14 06:05:41 +00:00
Edward Z. Yang	23379e8933	Add torch._compile to uninteresting files (#143209 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/143209 Approved by: https://github.com/albanD	2024-12-14 05:40:21 +00:00
eellison	ca973069ed	Update low prec codegen for div/mod (#142350 ) Div/mod in fp16/bf16 requires a downcast to preserve its inputs' dtypes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142350 Approved by: https://github.com/blaine-rister	2024-12-14 03:53:28 +00:00
Edward Z. Yang	24f24eebde	Get rid of _lazy_import hack (#143213 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/143213 Approved by: https://github.com/aorenste, https://github.com/albanD	2024-12-14 03:46:21 +00:00
PyTorch UpdateBot	698eefaddd	[audio hash update] update the pinned audio hash (#143245 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143245 Approved by: https://github.com/pytorchbot	2024-12-14 03:37:56 +00:00
cyy	e9f6045e80	[15/N] Fix extra warnings brought by clang-tidy-17 (#143100 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143100 Approved by: https://github.com/Skylion007	2024-12-14 03:24:10 +00:00
Eric Hanson	33dee721ae	Reraise worker errors as runtime errors in more cases when the original exception can't be constructed (#140911 ) related to https://github.com/pytorch/pytorch/issues/34130 when pytorch attempts to re-raise an exception from a worker process (e.g. multiprocessing dataloader), if it can't reconstruct the original exception message due to a type error, it instead raises it as a runtime error. However, if it can't reconstruct the exception for some other reason, it throws an error with a stacktrace pointing to the `ExceptionWrapper` code rather than the original underlying issue. One case in which I run into this is with boto3's [HTTPClientError](`66dc1f8d52/botocore/exceptions.py (L94)`)s. They must be constructed with a keyword argument `error`, but if `error` isn't passed, a `KeyError` is thrown instead of a `TypeError`, due to the particular way it is implemented: * [HTTPClientError](`66dc1f8d52/botocore/exceptions.py (L94)`)'s constructor excepts variable keyword arguments it passes to `super` (BotoCoreError) * [it also defines a field `fmt` with `error`](`66dc1f8d52/botocore/exceptions.py (L95)`) * BotoCoreError [expects to be able to format that string with the kwargs](`66dc1f8d52/botocore/exceptions.py (L41)`) So in this case, if a HTTPClientError occurs on a worker process, you simply get a `KeyError: error` with a stacktrace pointing to [this line](`3e2f276a14/torch/_utils.py (L710)`) which is unhelpful. Instead, I propose to reraise the error as a `RuntimeError` unconditionally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140911 Approved by: https://github.com/vmoens	2024-12-14 03:11:36 +00:00
Simon Fan	cdc03f99b7	[ca] add graph id (#141906 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141906 Approved by: https://github.com/jansel ghstack dependencies: #141919	2024-12-14 03:02:06 +00:00
Nikita Shulga	19f3570000	[EZ] Remove `--pre` from numpy installation command (#143237 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143237 Approved by: https://github.com/janeyx99, https://github.com/kit1980	2024-12-14 02:55:21 +00:00
xinan.lin	bf8d4f5b7a	[Inductor UT] Generalize device-bias code in test_triton_syntax.py. (#143178 ) Fix #143177 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143178 Approved by: https://github.com/eellison	2024-12-14 02:08:32 +00:00
Arash Pakbin	86c3370bc3	operator benchmark: write output to a JSON (#142809 ) This pull request adds the functionality of writing the output of operator benchmark to an optional JSON file specified. The output is still printed in the terminal like before, but the user has the option of saving it in a JSON file as well. Main part of the functionality is implemented using the function _perf_result_to_dict which outputs a dictionary to be put inside a JSON file. Each dictionary corresponds to a single test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142809 Approved by: https://github.com/albanD	2024-12-14 01:42:00 +00:00
zeshengzong	12098ad242	Add torch.cat tensors type promotion description (#141339 ) Fixes #126964 Add note description about type promotion of `torch.cat` Test Result Before ![image](https://github.com/user-attachments/assets/2449f11b-48ed-406e-b73e-6d00f8eadb00) After ![image](https://github.com/user-attachments/assets/cba99572-e8b1-4b9c-ba95-a963b54859ba) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141339 Approved by: https://github.com/albanD	2024-12-14 01:36:41 +00:00
Scott Wolchok	13233e062d	Fix Apple Clang ICE when building with -march=armv8.6a (#142879 ) When investigating #142703, I found that the build with -march=armv8.6 on my M1 mac was hitting a clang ICE. When looking at the blame code, I finally noticed that this constructor was nonsense, apparently in a way that the compiler frontend accepted but the backend choked on. example ICE error message: ``` fatal error: error in backend: Cannot select: 0x12689c260: bf16 = uint_to_fp 0x1258324a0 0x1258324a0: i32 = AssertZext 0x125822d90, ValueType:ch:i16 0x125822d90: i32,ch = CopyFromReg 0x1238dddc0, Register:i32 %22 0x12689c6c0: i32 = Register %22 In function: _ZN2at6native7DEFAULTL12logit_kernelERNS_18TensorIteratorBaseERKN3c106ScalarE c++: error: clang frontend command failed with exit code 70 (use -v to see invocation) Apple clang version 16.0.0 (clang-1600.0.26.3) Target: arm64-apple-darwin24.1.0 Thread model: posix ``` Unbreaks `env CFLAGS=-march=armv8.6-a CXXFLAGS=-march=armv8.6-a python setup.py develop --cmake` on M1 Mac. Differential Revision: [D67102953](https://our.internmc.facebook.com/intern/diff/D67102953/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142879 Approved by: https://github.com/malfet	2024-12-14 01:07:01 +00:00
Bradley Davis	063194aa32	add additional CK BMM Instances (2) (#142874 ) Summary: stacked changes to keep new codegen-ed instances below 2000 LOC Reviewed By: zjing14 Differential Revision: D66985408 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142874 Approved by: https://github.com/mxz297	2024-12-14 01:04:34 +00:00
leslie-fang-intel	00b0210139	[Inductor] Use sleef implementation for CPP backend asinh codegen (#142360 ) Summary Fix https://github.com/pytorch/pytorch/issues/142345. Previously, we use `asinh(x) = log(x + sqrt(1 + x2))` to calculate the result of `asinh`, the issue happens when input with `-10000.1`, which makes `x + sqrt(1 + x2)` close to 0 and log(0) is invalid. We use the `sleef` implementation in this PR to fix this issue. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_asinh_with_corner_inputs ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142360 Approved by: https://github.com/jgong5	2024-12-14 00:27:55 +00:00
eellison	d53164880f	dont attempt to fuse in unaligned accesses to mm (#142435 ) This isn't profitable - we were trying to fuse in a padding of unaligned mm, which defeats padding's purpose. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142435 Approved by: https://github.com/jansel ghstack dependencies: #142401, #142402	2024-12-14 00:22:31 +00:00
albanD	70be7900bb	Fix Tensor clear to properly clear slots (#143203 ) Fixes a bug introduced in https://github.com/pytorch/pytorch/pull/137267 While the test ensures the finalizer did run to make sure things are cleared, the objects are not properly collected by the gc due to the faulty tp_clear implementation. So, while the finalizer did run, the object was still alive. Fixing this by giving tp_clear the same treatment as tp_traverse and tp_dealloc on Tensor: make it a unique function that handles the full subclass hierarchy in one place. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143203 Approved by: https://github.com/ezyang, https://github.com/colesbury ghstack dependencies: #143202	2024-12-14 00:17:07 +00:00
albanD	8741d72e3c	move function before modifying it (#143202 ) This is a no-op. Just to make the diff in the next PR easier to read Pull Request resolved: https://github.com/pytorch/pytorch/pull/143202 Approved by: https://github.com/ezyang, https://github.com/janeyx99	2024-12-14 00:17:07 +00:00
atalman	3bfdf6f063	Exclude py 31.3t triton package from PyTorch 3.13t wheel (#143218 ) Follow up after https://github.com/pytorch/pytorch/pull/143162 Include triton only for 3.13 packages not 3.13t Pull Request resolved: https://github.com/pytorch/pytorch/pull/143218 Approved by: https://github.com/kit1980	2024-12-14 00:12:45 +00:00
Nikita Shulga	515abb7744	[CI] Add Triton 3.13t build (#143212 ) By just extending the matrix and invoking script with appropriate cpython runtime Pull Request resolved: https://github.com/pytorch/pytorch/pull/143212 Approved by: https://github.com/clee2000, https://github.com/atalman, https://github.com/seemethere	2024-12-13 23:45:47 +00:00
eellison	8621b9ff0c	Infer whether prologues can be computed without upcasting to fp32 without changing numerics (#142402 ) For prologues which only do either loads like gathers or dtype conversions, and no actual arithmetic on lower-precision types, we can codegen them without upcasting to fp32 without changing numerics. Prologues that actually do arithmetic will need to use invoke quant. But I would like to to support upcasts/gathers out of the box. We could potentially extend this in the future to avoid upcasting max pooling operations as well, if there were perf benefits to be had (less likely). Pull Request resolved: https://github.com/pytorch/pytorch/pull/142402 Approved by: https://github.com/jansel ghstack dependencies: #142401	2024-12-13 23:25:15 +00:00
PyTorch MergeBot	4e0de50eb5	Revert "[CI] Add Triton 3.13t build (#143212 )" This reverts commit 571cd92d7c4c7bd2d5f068b5a285e0e70b8d0a40. Reverted https://github.com/pytorch/pytorch/pull/143212 on behalf of https://github.com/janeyx99 due to lint is failing, the other failures don't seem relevant but ci has turned red after this change haha ([comment](https://github.com/pytorch/pytorch/pull/143212#issuecomment-2542521875))	2024-12-13 23:03:45 +00:00
PyTorch MergeBot	f406207af2	Revert "[ROCm] Prune old gfx archs gfx900/gfx906 from binaries (#142827 )" This reverts commit 1e2b841675e50a6abd8dab9a95b33fda64b12e2b. Reverted https://github.com/pytorch/pytorch/pull/142827 on behalf of https://github.com/jeffdaily due to prematurely dropped support for gfx900/gfx906 ([comment](https://github.com/pytorch/pytorch/pull/142827#issuecomment-2542507857))	2024-12-13 22:48:44 +00:00
eellison	ad2faec8bb	Add a pass which analyzes whether a prologue preserves zero mask (#142401 ) We load inputs to prologue fusion with a mask. That mask must still be zero before we run `tl.dot`. Previously, we would always apply the mask: ``` tmp0 = tl.load(in_ptr1 + (tl.broadcast_to(xindex, xindex.shape)), a_mask, eviction_policy='evict_last') tmp1 = tmp0.to(tl.float32) a = tl.where(a_mask, tmp1, 0.0) ``` now we do not need to -> ``` tmp0 = tl.load(in_ptr1 + (tl.broadcast_to(xindex, xindex.shape)), a_mask, eviction_policy='evict_last') tmp1 = tmp0.to(tl.float32) a = tmp1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142401 Approved by: https://github.com/jansel	2024-12-13 22:37:33 +00:00
Shivam Raikundalia	b29fc52f82	[Profiler] Add Optional Flag to turn off external correlations (#142516 ) Summary: External Correlations are super spammy and oftentimes not even useful. Add flag during init to remove them entirely Test Plan: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Dec_10_12_33_31.531106.pt.trace.json.gz&bucket=gpu_traces Differential Revision: D67048206 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142516 Approved by: https://github.com/ngimel	2024-12-13 22:32:09 +00:00
Shangdi Yu	bb574abe73	[BC-Breaking]Remove capture_pre_autograd_graph references in quantization (#139505 ) Summary: As title This is a BC-breaking change because graph produced by "capture_pre_autograd_graph" cannot be input to quantization anymore. But this is ok, since this API is deprecated for a while and is going to be deleted. We have removed all call sites of it. We remove the deprecated API references in code, docs, and tests. We also removed two tests that specific to capture_pre_autograd_graph API. Test Plan: CI Differential Revision: D65351887 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139505 Approved by: https://github.com/tugsbayasgalan, https://github.com/andrewor14, https://github.com/jerryzh168	2024-12-13 22:26:22 +00:00
Tom Ritchford	d25e6e623f	Fix unused Python variables in test/[a-d]* (#134665 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134665 Approved by: https://github.com/albanD	2024-12-13 22:13:12 +00:00
Brian Hirsh	e19f493f02	add private config to temporarily preserve old FSDP guard behavior (#142871 ) Summary: https://github.com/pytorch/pytorch/pull/138819 wobbled dynamo guards in a way that caused some performance regression, so this PR temporarily adds a config to get the old behavior back while we investigate. Test Plan: CI Differential Revision: D67096751 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142871 Approved by: https://github.com/yf225	2024-12-13 22:06:48 +00:00
Shangdi Yu	8fae4397b4	Add "inductor_pre_grad_graph" logging (#142717 ) (#143126 ) Summary: Add new structured logging "inductor_pre_grad_graph" This is for inductor provenance tracking front-end to load this graph from tlparse. ghstack-source-id: 257581974 exported-using-ghexport Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' //caffe2/test/dynamo:test_dynamo -- -r StructuredTraceTest ``` Differential Revision: D67150288 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143126 Approved by: https://github.com/desertfire	2024-12-13 21:48:25 +00:00
Nikita Shulga	8a04018329	[MPS] Fix conv backward for channels last (cont) (#143196 ) This is a continuation of https://github.com/pytorch/pytorch/issues/140902 but extends the same logic to input. Looks like existing channels-last logic just produced incorrect results on pre MacOS-15 versions and fails on MacOS-15, so removing it feels like a right idea Fixes https://github.com/pytorch/pytorch/issues/142344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143196 Approved by: https://github.com/manuelcandales	2024-12-13 21:32:42 +00:00
Nikita Shulga	571cd92d7c	[CI] Add Triton 3.13t build (#143212 ) By just extending the matrix and invoking script with appropriate cpython runtime Pull Request resolved: https://github.com/pytorch/pytorch/pull/143212 Approved by: https://github.com/clee2000, https://github.com/atalman, https://github.com/seemethere	2024-12-13 21:28:52 +00:00
Sam Larsen	60c54467db	[logging] Log runtime autotuning timing to scuba (#141919 ) See test plan in internal diff [D66679369](https://our.internmc.facebook.com/intern/diff/D66679369) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141919 Approved by: https://github.com/jamesjwu, https://github.com/ezyang	2024-12-13 21:22:13 +00:00
Eddie Yan	0d6d29af38	[CUDA] Follow up to clean up some `set_per_process_memory_fraction` usage in tests (#142811 ) follow-up to #140852 now that #140620 has landed Pull Request resolved: https://github.com/pytorch/pytorch/pull/142811 Approved by: https://github.com/Skylion007	2024-12-13 21:09:05 +00:00
Yidi Wu	65d0a25289	[associative_scan] patch inductor tests to always run with static shape (#143161 ) fixes #143053 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143161 Approved by: https://github.com/eellison	2024-12-13 21:06:12 +00:00
Aaron Orenstein	52f31cc238	dynamo tracing perf: Guard slots: 51.76 -> 51.34 (#143060 ) See #143056 for overall docs. This PR: Add slots to Guard Pull Request resolved: https://github.com/pytorch/pytorch/pull/143060 Approved by: https://github.com/jansel ghstack dependencies: #143066, #143056, #143058, #143059	2024-12-13 21:02:50 +00:00
PyTorch MergeBot	e87f07d3b8	Revert "Migrate compiler config to Config (#143152 )" This reverts commit 1ebdfd56053dafa8880a0dedf535fff70aa92e09. Reverted https://github.com/pytorch/pytorch/pull/143152 on behalf of https://github.com/oulgen due to lint failure ([comment](https://github.com/pytorch/pytorch/pull/143152#issuecomment-2542342073))	2024-12-13 20:55:14 +00:00
Nikita Shulga	625b4edb97	[CD] Test torch.compile on 3.13 (#143207 ) Follow up after https://github.com/pytorch/pytorch/pull/143162 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143207 Approved by: https://github.com/atalman, https://github.com/ZainRizvi	2024-12-13 20:01:36 +00:00
atalman	fe9365f3f5	Add check_binary workflow to pytorch/pytorch (#143201 ) Migrated from pytorch/builder Related to: https://github.com/pytorch/builder/issues/2054 Copying from : `3468139e81` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143201 Approved by: https://github.com/seemethere, https://github.com/malfet	2024-12-13 19:30:10 +00:00
Edward Z. Yang	8f40446770	Fix precedence of bitwise and/or printing (#143197 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/143197 Approved by: https://github.com/albanD, https://github.com/williamwen42	2024-12-13 19:29:42 +00:00
Oguz Ulgen	1ebdfd5605	Migrate compiler config to Config (#143152 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143152 Approved by: https://github.com/ezyang ghstack dependencies: #143150, #143151	2024-12-13 19:29:07 +00:00
Oguz Ulgen	f1ff8bc1c5	Add type to Config (#143151 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143151 Approved by: https://github.com/ezyang ghstack dependencies: #143150	2024-12-13 19:29:07 +00:00
Oguz Ulgen	9d05c8110d	Require Config to have a default (#143150 ) With aliases coming soon, we want to reject alias + default combo, so we need defaults to be passed in. On top of this, this simplifies statically type checking config. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143150 Approved by: https://github.com/ezyang	2024-12-13 19:28:59 +00:00
Doru Bercea	bf711a9cce	[ROCm] Improve performance of reduce sum for 3D shapes (#143137 ) Improve performance of reduce sum for 3D shapes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143137 Approved by: https://github.com/jeffdaily, https://github.com/eqy	2024-12-13 19:02:00 +00:00
Aaron Orenstein	6178be822d	dynamo tracing perf: direct Guard: 52.58 -> 51.76 (#143059 ) See #143056 for overall docs. This PR: Remove explicit constant check from `VariableBuilder.install_guards()` the args calling convention. Also remove a lambda binding. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143059 Approved by: https://github.com/williamwen42, https://github.com/jansel ghstack dependencies: #143066, #143056, #143058	2024-12-13 18:20:48 +00:00
Aaron Orenstein	6bcda3a21a	dynamo tracing perf: cache on import_source: 52.9 -> 52.58 (#143058 ) See #143056 for overall docs. This PR: add cache to `InstructionTranslatorBase.import_source()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143058 Approved by: https://github.com/jansel ghstack dependencies: #143066, #143056	2024-12-13 18:20:48 +00:00
Aaron Orenstein	b472d82c96	dynamo tracing perf: import in build: 60.48 -> 59.92 (#143056 ) A series of directed perf improvements to drive down the dynamo tracing cost of the given test. Before this PR stack the compile took about 60s, and after takes 30s. Individual improvements are listed below along with the approximate improvement of that change. Tested with this model: ``` @torch.compile(backend="eager") def model_add(x, y): out = x for i in range(5000): out = torch.add(out, y) return out ``` This PR: Stop importing builder in the inner loop of `VariableTracker.build()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143056 Approved by: https://github.com/jansel ghstack dependencies: #143066	2024-12-13 18:20:48 +00:00
Aaron Orenstein	63e1f97f4b	dynamo tracing perf: don't unnecessarily call getframeinfo on the hot path: 47.26 -> 37.66 (#143066 ) See #143056 for overall docs. This PR: Stop using `getframeinfo()` when we only care about the function name and throw the rest away. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143066 Approved by: https://github.com/jansel	2024-12-13 18:20:48 +00:00
George Wigley	e0c8abda76	Fix potentially undefined behaviour in index_put sample input (#143116 ) From the [docs](https://pytorch.org/docs/stable/generated/torch.Tensor.index_put_.html) for index_put_: > If accumulate is True, the elements in values are added to self. If accumulate is False, the behavior is undefined if indices contain duplicate elements. Currently the sample inputs for `index_put` generates 2 indices. Because they are generated randomly, they could be the same leading to undefined behaviour if `accumulate=False`. This PR changes the input generation to only generate a single index if `accumulate=False` preventing duplicate indices and undefined behaviour. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143116 Approved by: https://github.com/albanD	2024-12-13 17:59:01 +00:00
Jeremy Hadidjojo	23b8ea3094	Allow disabling int specialization on nn.Modules (#142829 ) Resolves issue #140464 by adding an option to not specialize int from nn.Modules (False by default to maintain existing behavior). Test Plan: `buck2 test mode/opt caffe2/test/dynamo:test_dynamo -- test_modules.py::NNModuleTests::test_nn_module_unspec_int_attr` Differential Revision: D66837042 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142829 Approved by: https://github.com/ezyang, https://github.com/yanboliang	2024-12-13 17:26:11 +00:00
Peter Bell	82a45d19b4	Expose sharedMemPerMultiprocessor device property to python (#143119 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143119 Approved by: https://github.com/ezyang	2024-12-13 16:53:57 +00:00
Jithun Nair	3f62054de1	[ROCm] upgrade nightly wheels to rocm6.3 - 1 of 2 (docker images) (#142151 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/142151 Approved by: https://github.com/jeffdaily	2024-12-13 16:21:17 +00:00
eellison	7968732f5b	Fix int8 mm V.ops.mul dispatching (#143127 ) This is sort of subtle - because we were doing `V.ops.mul` at binding time, we dont redispatch later when we invoke the epilogue. and then later running into assertion checking in pr above. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143127 Approved by: https://github.com/drisspg ghstack dependencies: #143048	2024-12-13 16:17:23 +00:00
Tom Ritchford	da67a6a7bb	[inductor] Replace set by OrderedSet (#138466 ) Uses the set_linter from https://github.com/pytorch/pytorch/pull/138454 and considerable manual editing Pull Request resolved: https://github.com/pytorch/pytorch/pull/138466 Approved by: https://github.com/eellison	2024-12-13 16:08:45 +00:00
Zhengxu Chen	fbfc530442	[export][ez] Fix forward D67044185 (#143193 ) Summary: Fixing forward D67044185 and T210459833 by adding the missing buld file. Test Plan: buck2 build --flagfile fbcode//mode/opt fbcode//admarket/training_data/augmentation/processors/tests:model_manager_test Differential Revision: D67200056 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143193 Approved by: https://github.com/tugsbayasgalan	2024-12-13 16:06:42 +00:00
Andrey Talman	04bb82f097	Linux Wheels: Remove triton dependency python < 3.13 constraint (#143162 ) We do build pytorch-triton package for python 3.13 : https://github.com/pytorch/pytorch/actions/runs/12304476674/job/34344764271 Hence constraint is no longer needed. This stack enabled torch.compile for Python 3.13 : https://github.com/pytorch/pytorch/pull/141264 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143162 Approved by: https://github.com/kit1980	2024-12-13 15:08:44 +00:00
Yifu Wang	810808d97d	Enable cutlass-based all-gather matmul when TORCH_SYMM_MEM_ENABLE_NATIVE_ASYNC_TP is set (#142283 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142283 Approved by: https://github.com/weifengpy, https://github.com/Chillee	2024-12-13 10:29:14 +00:00
Bin Bao	3e1f587514	[AOTI] Fix an autotune block grid computation issue (#143098 ) Summary: There is a grid computation issue after switching to one-pass codegen in https://github.com/pytorch/pytorch/pull/141980. When max-autotune is turned on, there is an incorrect grid codegen in some cases. Reviewed By: henrylhtsang Differential Revision: D67120987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143098 Approved by: https://github.com/henrylhtsang	2024-12-13 07:52:30 +00:00
Nikita Shulga	9f90583ca2	[CI] Run aarch64 tests on Graviton3 (#143129 ) Which is armv8.6 that has SVE and BF16 capability mkldnn_pattern_matcher skips are tracked in https://github.com/pytorch/pytorch/issues/143146 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143129 Approved by: https://github.com/digantdesai	2024-12-13 07:39:22 +00:00
Nikita Shulga	c37185c76a	[BE] Stop using deprecated APIs in mkldnn_pattern_matcher (#143156 ) This should fix ``` /var/lib/jenkins/workspace/test/inductor/test_mkldnn_pattern_matcher.py:157: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143156 Approved by: https://github.com/kit1980	2024-12-13 06:37:20 +00:00
cyy	075905b7bd	[14/N] Fix extra warnings brought by clang-tidy-17 (#141644 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141644 Approved by: https://github.com/ezyang Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>	2024-12-13 06:22:13 +00:00
Simon Fan	72fd7abb35	[ca] fix flex attention backward HOP capture in initial graph (#143155 ) FIXES https://github.com/pytorch/pytorch/issues/142313 So with previous HOPs, compiled autograd could just inline into their body and get their post-dispatch aten representation. You can't do that with this flex attention HOP, which just wants any proxy tracing mechanism to insert it into its graph. Okay, compiled autograd does use proxy tracing, so we can do that. This is safe because other than the reenter_make_fx call, there were no other make_fx internals usage in the HOP. And compiled autograd specializes on the AOT backward's saved symints which should cover any changes in shapes to the inputs of the HOP. However, there's still an issue: Dynamo doesn't know how to handle `FlexAttentionBackwardHOP` and will graph break, so the flex attention backward is running in eager as of this PR. The tlparse looks really scuffed after the compiled autograd capture: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpMMHBEH/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143155 Approved by: https://github.com/drisspg	2024-12-13 06:04:39 +00:00
Ryan Guo	b4f4c75e19	[dynamo] Support multiple inheritance for custom dict construction (#142416 ) This patch applies a local and practical workaround for custom dict construction when multiple inheritance is involved. Handling multiple inheritance in general could be a lot more involved, so I created #142414 to track that. Fixes #141118. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142416 Approved by: https://github.com/jansel	2024-12-13 05:13:05 +00:00
bobrenjc93	b5d8d2444a	add README.md for compile time benchmarks (#143145 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143145 Approved by: https://github.com/laithsakka ghstack dependencies: #141517, #143143	2024-12-13 05:12:26 +00:00
lzhang2	b7ad52abb0	Use new group instead of split group on non-CUDA device (#141469 ) Motivation: Currently, `split_group` only works for NCCL backend. https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L4745. Then we need to use `use_group` on other non-CUDA device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141469 Approved by: https://github.com/kwen2501, https://github.com/gujinghui, https://github.com/albanD	2024-12-13 05:11:33 +00:00
sanchitintel	57c46af47a	[Inductor][CPU] Add torchao da8w8 pattern with sym quantized act & wgt (#142110 ) ### Summary Extends #142036 for Inductor pattern-matching pattern covered for torchao API `int8_dynamic_activation_int8_weight` in the following scenario (inference-only, freezing enabled) - - int8 quantized (symmetrically) activation (per token quantized). - Statically (so, scales are also constant. But then they would have been constant even in case of dynamic quantization due to constant weights, anyway) per-channel int8 quantized (symmetrically) weights (which are also constant because freezing is enabled). The pattern that's matched is `torch._intmm` -> convert to FP32/BF16 -> [optional expand for activation scale] ->`mul` -> `mul`. We don't check if the activation is dynamically quantized or whether the weights are statically quantized, though (since the implementation won't have have any side-effects even if that wouldn't be true). In practice, it also matches the smooth-quant int8 quantized linear pattern if its output is not reshaped (if activation is 2D). ### More details oneDNN int8 matmul supports application of per-channel weight scale but not a vector activation scale, which could be applied as a post op, but is currently unsupported in ATen. Bias addition (which could be supported with an add post-op) is also unfused. The fusion pattern used in this PR is `torch._intmm` -> convert to FP32/BF16 ->`mul`, which will be replaced by oneDNN qlinear op. The speedup over eager-mode is due to 2 reasons - 1. fusion of int8xint8 -> int32 GEMM, conversion to FP32/BF16 & application of weight scale. (In case of BF16, many intermediate conversions are also avoided). 2. weight is pre-packed & cached by Inductor, so a reorder is avoided at run-time. But, in the future, the whole pattern (including application of activation scale, which would be a mul post-op) + bias could be fused if corresponding support would be enabled in ATen. ### Verification Added UT in this PR ``` python test/inductor/test_mkldnn_pattern_matcher.py -v -k test_da8w8_sym_act_sym_wgt_with_int_mm ``` #### Corresponding torchao UTs 1. int8 Smoothquant legacy API - `TORCHINDUCTOR_FREEZING=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor" python test/integration/test_integration.py -v -k test_non_dynamically_quantizable_linear`. The difference from #139595 is that there are no reshapes of the linear output in this pattern. 2. int8 da8w8 - symmetrically quantized activation (dynamically) & statically quantized weights - ` TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor" TORCHINDUCTOR_FREEZING=1 python test/integration/test_integration.py -v -k test_int8_dynamic_quant_subclass_api_0_cpu` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142110 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: #142036	2024-12-13 04:59:03 +00:00
eellison	b731ced91f	Prologue Fusion (#134532 ) This PR extends our ability to fuse pointwise nodes onto triton templates with the ability to fuse pointwise nodes into triton templates - prologue fusion. Similar to the store_output api: `{{store_output(("idx_m", "idx_n"), "acc", "mask")}}` And the modification api: ``` {{ modification( subgraph_number=0, output_name="post_mod_scores", score="qk", out="qk" ) \| indent_except_first(1) }} ``` We have: ```{{load_input("B", "b", ("idx_m", "idx_n"), mask=None if EVEN_K else "b_mask", indent_width=8)}}``` Because we are now loading the input with explicit indices and mask, I needed to rewrite the mm kernel to no longer update the [pointers by BLOCK_K](`bb03ef7aca/torch/_inductor/kernel/mm.py (L110-L111)`) on every iteration and instead on each iteration compute indices from the the k_idx of each loop. This did not have any perf difference. There are a couple main use cases for prologue fusion: - Fusing dequants into a matmul. particularly for more bandwidth bound scenarios. - Fusing gather into a matmul. This is useful particularly in MOE. See https://github.com/pytorch/pytorch/issues/134535 for more details. Prologue fusion is generally much less profitable than epilogue fusion, because it must be applied to an element of an input on each loop of the matmul, compared to only once in the epilogue (gather into matmul is a potential exception). Accordingly, we are much less aggressive in attempting to fuse prologue fusion. We only attempt fusion if it does not increase the number of memory bytes read instead the triton template, multipled by a small factor to allow gathers. This restricts reliably unprofitable fusions like fp32->fp16 inside kernel. In future pr we could potentially have api of being more aggressive if we know we are in a bandwidth bound regime. See: https://github.com/pytorch/pytorch/pull/134532/files#diff-d2539c9c8dc6a3d7e457767a880612e96d3c85752a77ead49a9e4e00a3e4c3c7R3060-R3066 Other notes: By default we will upcast to fp32 inside every kernel. This matches eager numerics. This is fine enough for epilogue because it is only done once (although it is probably unnecessary for say a relu) but tanks perf for prologue. I am currently using the `codegen_upcast_to_fp32` option to avoid it, but that will not work for libdevice calls that require fp32. We will need https://github.com/pytorch/pytorch/pull/136778/ and dtype-aware codegen to upcast fp16 ops into libdevice calls. With prologue fusion, we now have essentially separate kernels for each input, and for the output. I had to increase the number of fields that are swapped out in `set_subgraph_body` by a large number :/ I also update the fusion logic because the inputs will have a different group than the outputs. Maybe as part of enabling multiple outputs, this could get cleaned up a bit so.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134532 Approved by: https://github.com/jansel	2024-12-13 04:18:25 +00:00
bobrenjc93	ceb664aca6	add float_args benchmark (#143143 ) 71% improvement with automatic dynamic float arguments with specialize_float=False ``` float_args,compile_time_instruction_count,346293869 ``` with specialize_float=True ``` float_args,compile_time_instruction_count,1198546486 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143143 Approved by: https://github.com/laithsakka ghstack dependencies: #141517	2024-12-13 03:35:59 +00:00
Simon Fan	ab04f3aee1	[ca] set autograd graph task state (#143108 ) GraphTask holds metadata needed for a single execution of backward(), it is 1:1 with backward calls, at least for compiled autograd. It is used for certain torch._C global autograd state APIs. In SAC, we use torch._C._current_graph_task_id() as a dict key to store information during unpack hook execution: `a5fb07af27/torch/utils/checkpoint.py (L1128)` If we don't set an active task, it will randomize the key, and will do its logic as if each unpacked tensor was from a different graph task `a5fb07af27/torch/utils/checkpoint.py (L1112-L1115)` The sketchy part of this PR is that in eager autograd, GraphTask is mutated during execution. But inspecting the struct, the mutation seems to only be used to communicate between autograd threads (created when multiple devices are involved) or for deprecated uses. We shouldn't run into the mutation case at all in compiled autograd. Also, only the graph task id is accessible from python hooks. FIXES https://github.com/pytorch/pytorch/issues/142862 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143108 Approved by: https://github.com/jansel, https://github.com/albanD	2024-12-13 03:10:48 +00:00
Blaine Burton Rister	dbe4b69df0	[Inductor] Fix cooperative reduction tests broken in recent refactor (#143135 ) These tests were broken by https://github.com/pytorch/pytorch/pull/142020. This PR updates the fixed configs accordingly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143135 Approved by: https://github.com/jansel, https://github.com/huydhn	2024-12-13 02:03:43 +00:00
cyy	9f5ebf3fc6	Clang-format aten/src/ATen/native/Tensor*{cpp,h} (#143089 ) These files are relatively stable, so it should be safe to format them without incurring conflicts Pull Request resolved: https://github.com/pytorch/pytorch/pull/143089 Approved by: https://github.com/albanD	2024-12-13 00:06:48 +00:00
Wouter Devriendt	2533a5a843	upgrade sccache to 0.9.0 (#142854 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142854 Approved by: https://github.com/malfet, https://github.com/ZainRizvi	2024-12-12 22:49:50 +00:00
Xia, Weiwen	fb93462904	[Reopen][Inductor][CPU] Fuse SmoothQuant int8 linear pattern (#142036 ) Reopen of https://github.com/pytorch/pytorch/pull/139595 About the PR In the implementation of SmoothQuant in Torchao, quantized linear is computed by `_int_mm(a, b)` + `mul(b_scale)` + `mul(a_scale)` (+ optional `add` for bias) with `reshape` and `convert_dtype` in between. This PR adds a pass to fuse the corresponding patterns: - (no bias) `reshape -> _int_mm -> convert_element_type -> (expand -> mul) -> mul -> reshape` - (with bias) `pattern_no_bias -> add -> reshape -> reshape` The patterns are replaced by `onednn.qlinear_pointwise` and `onednn.qlinear_prepack`, the latter of which is evaluated and frozen during the freezing process of Inductor. The final graph contains `onednn.qlinear_pointwise` only with packed weight constants. Note that `onednn.qlinear_pointwise` only supports a scalar activation scale, which is a limitation of oneDNN library, so in that case we set activation scale to 1 and bias to none and apply scales and add bias after `onednn.qlinear_pointwise`. Validation results Accuracy/perplexity is not changed with or without this fusion pass. Latency is improved by >10% with the fusion pass. Test method: - Model: EleutherAI/gpt-j-6b - Hardware: Intel(R) Xeon(R) Platinum 8490H, running on 1 socket, 60 cores - Using Intel OMP and Tcmalloc - Running [the example script of SmoothQuant in Torchao](https://github.com/pytorch/ao/blob/main/torchao/prototype/smoothquant/example.py) with `TORCHINDUCTOR_FREEZING=1 numactl -N1 python example.py -m EleutherAI/gpt-j-6b --device=cpu --quant-mode=dynamic --compile` Test plan ``` python test/inductor/test_mkldnn_pattern_matcher.py -k test_smooth_quant_with_int_mm ``` Differential Revision: [D66796966](https://our.internmc.facebook.com/intern/diff/D66796966) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142036 Approved by: https://github.com/jerryzh168, https://github.com/jgong5 Co-authored-by: sanchitintel <sanchit.jain@intel.com>	2024-12-12 21:18:03 +00:00
Chien-Chin Huang	602c86a420	[DSD] Fix strict=False case for DDP (#143038 ) Summary: As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/143038 Approved by: https://github.com/mori360	2024-12-12 21:15:21 +00:00
Adrien Aguila--Multner	a7509e98c5	[pipelining] fix backward_one_chunk when the output of the model is a… (#142237 ) fixes #142229 if any of ``stage_output`` is a view, it cannot be detached in place. Replacing it with ``t = t.detach()`` or similar would not free the graph for the output given to the user. Detaching the base tensor could cause a side effect. The same code is used in ``_backward.py`` (`b64a537993/torch/distributed/pipelining/_backward.py (L215)`) but does not seem to cause any issue in my case. Maybe needs some investigation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142237 Approved by: https://github.com/H-Huang	2024-12-12 20:59:35 +00:00
Huy Do	39cacc1d81	Fix missing tests on test tool lint job (#143052 ) A follow-up from https://github.com/pytorch/pytorch/pull/142476#discussion_r1878888558 where some tests are not discovered correctly by pytest ### Testing https://github.com/pytorch/pytorch/actions/runs/12287448581/job/34289531307?pr=143052#step:14:162 shows the correct number of tests now Pull Request resolved: https://github.com/pytorch/pytorch/pull/143052 Approved by: https://github.com/ZainRizvi	2024-12-12 20:29:32 +00:00
Richard Barnes	82ce888273	c10::string_view -> std::string_view in more places (#142517 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142517 Approved by: https://github.com/malfet	2024-12-12 19:45:59 +00:00
eellison	0b75b7ff2b	[Easy] factor out inductor ophandler decompositions (#142400 ) Factor out inductor operator decompositions Pull Request resolved: https://github.com/pytorch/pytorch/pull/142400 Approved by: https://github.com/Chillee, https://github.com/jansel	2024-12-12 19:03:26 +00:00
Shivam Raikundalia	c170248b78	[Profiler] Enable Iterative Step without profiler in fbcode (#142077 ) Summary: Adds post optimizer hook for fbcode so that we can run iterative on demand without having to use a frontend profiler interface. Since this is being used more frequently, it would be convenient for users to be able to trigger this on-demand feature without having to worry about being within some timing window. Test Plan: Ran iterative tracing without profiler.profile Differential Revision: D66734119 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142077 Approved by: https://github.com/briancoutinho	2024-12-12 19:00:13 +00:00
atalman	e3fe5f62b6	Remove Checkout pytorch/builder for Linux Binary Builds (#143125 ) Follow Up after: https://github.com/pytorch/pytorch/pull/142282 Remove Checkout pytorch/builder for Linux Binary Builds I believe we where not using builder already. Hence remove this checkout. We should be using scripts from this folder: ``` /pytorch/.ci/${{ inputs.PACKAGE_TYPE }}/build.sh ``` TODO: Will followup with removing BUILDER_ROOT everywhere from PyTorch repo Pull Request resolved: https://github.com/pytorch/pytorch/pull/143125 Approved by: https://github.com/kit1980	2024-12-12 18:55:00 +00:00
PyTorch MergeBot	d48b16a725	Revert "[Dynamo] only import einops if version is lower than 0.7.0 (#142847 )" This reverts commit 357e261b1eded933d98de18ddcef2b083f87259d. Reverted https://github.com/pytorch/pytorch/pull/142847 on behalf of https://github.com/atalman due to Breaks binary builds, see the comment above ([comment](https://github.com/pytorch/pytorch/pull/142847#issuecomment-2539759580))	2024-12-12 18:44:35 +00:00
Howard Huang	b0c3d39e0d	[pipelining] Update tutorials and documentation (#143045 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143045 Approved by: https://github.com/wconstab, https://github.com/kwen2501	2024-12-12 18:42:17 +00:00
Zhengxu Chen	ee5bceaee6	[sigmoid] Write the new export schema format to archive without breaking compatibility. (#142511 ) Summary: This diff make it possible to migrate to PyTorch's OSS export schema from sigmoid. Basically, we add a new field called "methods" to ExportedProgram in Model definition, which contains the thrift schema generated based on schema.py from OSS. This way, we can keep writing the old fields while double write a new format in equivalent form. Since thrift doesn't support inlining type definitions, we do it manually here and it shouldn't break on-wire compatibility. As long as every sigmoid user is using sigmoid.frontend.serialization.serialize, we always guarantee to have the new format saved sa well. Eventually we will will use json deserialization from OSS so we will only keep this double writing for a couple of months. Eventually, we will migrate every serialization path to the OSS workflow. Test Plan: buck test mode/opt sigmoid/frontend:serialization_test buck test mode/opt sigmoid/frontend/test_gpu:serializer_test Differential Revision: D67044185 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142511 Approved by: https://github.com/desertfire	2024-12-12 18:41:10 +00:00
Joel Schlosser	5dabe2d464	Fix NJT backward tests (#143072 ) This PR fixes some issues with NJT backward / compile backward tests: 1. `requires_grad` was not being propagated appropriately during `SampleInput` generation, so a LOT of backward cases were untested before (sad times). This PR utilizes a helper function `_clone()` to clone() / detach() NJTs for SampleInputs while preserving `requires_grad` status. Note: the clone() / detach() stuff is for autograd; can't have two SampleInputs as part of the same autograd graph. 2. Per-sample skips weren't -fully- working; the op logic would still be invoked even with a skip. I found this out thanks to `split_with_sizes`, which segfaults during backwards because it tries to use an NST-specific formula. As annoying as it is, I tried a ton of things but ultimately had to split the `subtest_ctx` into that + a `skip_xfail_ctx` to run the subtests within. * Updated all uses of per-sample skips / xfails: 4 in `test_nestedtensor.py` and 1 in `test_vmap.py` 3. Added the appropriate skips / xfails to get everything passing. There are a shitton of bugs to fix! Pull Request resolved: https://github.com/pytorch/pytorch/pull/143072 Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer	2024-12-12 18:06:23 +00:00
Xuehai Pan	d47a80246a	[dynamo][pytree][3/N] make CXX pytree traceable: `tree_map` / `tree_map_` (#137399 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137399 Approved by: https://github.com/jansel ghstack dependencies: #137398	2024-12-12 18:05:25 +00:00
Xuehai Pan	7edeb1005a	[dynamo][pytree][2/N] make CXX pytree traceable: `tree_flatten` / `tree_unflatten` / `tree_structure` (#137398 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137398 Approved by: https://github.com/jansel	2024-12-12 18:05:25 +00:00
PyTorch MergeBot	c85323c5e8	Revert "Tests Generelization for multiple accelerator devices (#139184 )" This reverts commit b576a8c318201b63269f7ff25ec5830d00662a7a. Reverted https://github.com/pytorch/pytorch/pull/139184 on behalf of https://github.com/clee2000 due to Failing internally when trying to pickle distributed test files D67098795 ([comment](https://github.com/pytorch/pytorch/pull/139184#issuecomment-2539610187))	2024-12-12 17:48:30 +00:00
PyTorch MergeBot	2f0fe82f6d	Revert "[14/N] Fix extra warnings brought by clang-tidy-17 (#141644 )" This reverts commit 24a5a2ef258d2b482ded674cdb9555afaf081402. Reverted https://github.com/pytorch/pytorch/pull/141644 on behalf of https://github.com/clee2000 due to failing internally D67112938 ([comment](https://github.com/pytorch/pytorch/pull/141644#issuecomment-2539602023))	2024-12-12 17:43:36 +00:00
Tom Ritchford	dc23f1944a	Remove unused Python variables in torch/[_-a]* (#133492 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133492 Approved by: https://github.com/albanD	2024-12-12 17:39:14 +00:00
Richard Barnes	7667235a23	c10::optional -> std::optional (#142514 ) Fixes issues introduced in https://github.com/pytorch/pytorch/pull/141348 and https://github.com/pytorch/pytorch/pull/139578 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142514 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-12-12 17:23:46 +00:00
Blaine Burton Rister	520ba556cd	[Inductor] Refactor "r" reduction prefix to {"r0_", "r1_"}. (#142020 ) Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. # Feature This PR changes the `RINDEX` / `"r"` symbol type to `(R0_INDEX, R1_INDEX)` and `("r0_", "r1_")`, respectively. This allows the relevant code to support 2D (often ND) reductions. Unlike the parent PR, this one does not change the tiling algorithm, so `"r1_"` is never used. However, it prepares other parts of the system to handle `"r1_"` once we start using it. This should significantly reduce the chances of hitting merge conflicts, making the parent PR much easier to land. The only change to the generated triton code is to rename `"rindex"` -> `"r0_index"`, `"RBLOCK"` -> `"R0_BLOCK"`, etc. To maintain compatibilty with existing codegen, this also generates aliases to the old reduction variables like `rindex = r0_index`. If we generated 2D reductions (which this PR will not do), the aliases would be more complicated and would collapse 2D multi-indices to linear indices. See some example kernels in the parent PR. These aliases can be eliminated by the Triton compiler, and should not impact the final machine code running on the GPU. See the perf testing in the parent PR which confirms the aliases do not impact perf. # Test plan The existing CI provides good coverage. This PR modifies the expected code in a few places, renaming reduction variables from `r.` to `r0_.`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142020 Approved by: https://github.com/jansel Co-authored-by: Jason Ansel <jansel@meta.com>	2024-12-12 17:22:20 +00:00
PyTorch MergeBot	cf538efd0c	Revert "Hide torch_python symbols (#142214 )" This reverts commit da76e912a4c58c649061fc84b29a42714897a0ca. Reverted https://github.com/pytorch/pytorch/pull/142214 on behalf of https://github.com/huydhn due to The MacOS failure looks legit as it shows up in trunk ([comment](https://github.com/pytorch/pytorch/pull/142214#issuecomment-2539543504))	2024-12-12 17:15:51 +00:00
Simon Fan	15ee2960e1	[aot] Functionalize aot backward prologue and epilogue wrappers (#142415 ) For functional compiled autograd, we're having dynamo trace through the aot backward implementation. To avoid graph breaking and imposing too many restrictions, we allow_in_graph the prologue and epilogue. This adds 2 restrictions: - code must be available in the global context - inputs other than tensors/symnodes must be const foldable Pull Request resolved: https://github.com/pytorch/pytorch/pull/142415 Approved by: https://github.com/bdhirsh	2024-12-12 17:14:29 +00:00
Sam Larsen	30b61e521c	[logging] Populate compile_time_autotune_time_us (#143104 ) See testing in attached diff Differential Revision: [D67128210](https://our.internmc.facebook.com/intern/diff/D67128210) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143104 Approved by: https://github.com/ezyang	2024-12-12 17:08:43 +00:00
Yasyf Mohamedali	e3ddc0ca33	Support remote caching requiring redis auth (#141679 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141679 Approved by: https://github.com/masnesral	2024-12-12 17:07:50 +00:00
Svetlana Karslioglu	0f78be5573	Fix search icon (#142808 ) Removing: .pytorch-left-menu-search input[type=text] { background-image: none; } so that the search icon correctly appears in the sphinx searchbox Also, fixing scrolling Pull Request resolved: https://github.com/pytorch/pytorch/pull/142808 Approved by: https://github.com/albanD	2024-12-12 16:09:30 +00:00
eellison	725526abc5	Fix scan dtypes (#143048 ) FIx for https://github.com/pytorch/pytorch/issues/142883. We weren't getting test coverage of scan because the tests were being skipped. see, https://github.com/pytorch/pytorch/issues/143053 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143048 Approved by: https://github.com/arui-meta, https://github.com/blaine-rister	2024-12-12 15:57:00 +00:00
Nikita Shulga	d83a049232	[EZ] Update lintrunner in CI to 0.12.7 (#143073 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143073 Approved by: https://github.com/wdvr	2024-12-12 15:35:37 +00:00
drisspg	7cc3a591c2	[FlexAttention] Fix a few more symbolic shape issues (#142816 ) # Summary See https://github.com/pytorch/pytorch/issues/139064 for more details. This fixes a number of issues with dynamic shapes. Thanks to @alexdremov for finding most of these Pull Request resolved: https://github.com/pytorch/pytorch/pull/142816 Approved by: https://github.com/yanboliang, https://github.com/ezyang	2024-12-12 15:29:21 +00:00
atalman	84f791381a	Python 3.13 CI add crossref test to existing linux-focal-py3_13-clang10-build (#143074 ) Add linux-jammy-py3_13-gcc11-build and test - similar to Py 3.9 Add crossref test to existing linux-focal-py3_13-clang10-build Pull Request resolved: https://github.com/pytorch/pytorch/pull/143074 Approved by: https://github.com/malfet	2024-12-12 14:45:56 +00:00
PyTorch MergeBot	cd1b5924d5	Revert "[Inductor] Use sleef implementation for CPP backend asinh codegen (#142360 )" This reverts commit 79cf8fa75176a8f6bb79d426c6d0f9369d03ff98. Reverted https://github.com/pytorch/pytorch/pull/142360 on behalf of https://github.com/jeanschmidt due to seems to have broken macos tests ([comment](https://github.com/pytorch/pytorch/pull/142360#issuecomment-2539143039))	2024-12-12 14:42:55 +00:00
Edward Z. Yang	30e2b322a1	Add <string> to uninteresting_files (#142984 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/142984 Approved by: https://github.com/albanD, https://github.com/IvanKobzarev	2024-12-12 14:35:30 +00:00
gasoonjia	91261107e0	debug handler maintain through decomposition (#141612 ) Add checks in the ao numberic debugger to guard the debug handle consistency between aten op decomposition Differential Revision: [D66517480](https://our.internmc.facebook.com/intern/diff/D66517480/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141612 Approved by: https://github.com/jerryzh168	2024-12-12 12:26:45 +00:00
Xuehai Pan	18785c1af9	[BE][accelerator] formalize API name `{current,set}_device_{idx => index}` (#140542 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140542 Approved by: https://github.com/guangyey, https://github.com/albanD	2024-12-12 10:53:48 +00:00
Saiteja Samudrala	a5fb07af27	[Torch][#142396 ]Resolve Failure When Uploading To Remote Storage (#143046 ) Summary: Catch io.UnsupportedOperation exception so that stream's without fileno support don't cause failure Test Plan: UT Differential Revision: D67108487 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143046 Approved by: https://github.com/saumishr	2024-12-12 08:17:15 +00:00
Avik Chaudhuri	497f89ff83	fix dynamo nn module stack fqn (#142823 ) Dynamo can produce sources that have funny patterns in their `.name()` that break `nn_module_stack` fqns. Added a test that used to have `._modules` inside nn_module_stack fqns, now doesn't. (Unfortunately couldn't repro a case mentioned in the GH issue where `.slice(...)` is claimed to appear as well.) Fixes https://github.com/pytorch/pytorch/issues/141939 Differential Revision: [D67064189](https://our.internmc.facebook.com/intern/diff/D67064189/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142823 Approved by: https://github.com/pianpwk, https://github.com/zhxchen17	2024-12-12 07:02:13 +00:00
cyyever	da76e912a4	Hide torch_python symbols (#142214 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/142214 Approved by: https://github.com/ezyang	2024-12-12 07:00:54 +00:00
Nichols A. Romero	dcb128d495	[ROCm] TunableOp use thread-safe getenv functions (#142274 ) Fixes #142403 ~~PR fixes breakage due to this commit `8cd7ad8b48`~~ PR is a partial reland of this https://github.com/pytorch/pytorch/pull/140594 with a unit test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142274 Approved by: https://github.com/jeffdaily, https://github.com/eqy	2024-12-12 06:49:26 +00:00
Xilun Wu	5ad7d5304c	[DTensor][random] add HSDP+TP model init test (#143077 ) Summary 1. Move the model init tests from `DistTensorRandomOpTest` to `DistTensorRandomInitTest` 2. Added a HSDP+TP meta init test to show correct model init result in this use case. Note that this test requires 8 GPUs to run and our CI doesn't have that capacity so this test will be skipped on CI testing. A local run shows that the test passes on a 8-GPU host. Test `pytest test/distributed/_tensor/test_random_ops.py -s -k test_hsdp_tp_model_meta_init` <details> <summary> Test Result </summary> <img width="3343" alt="image" src="https://github.com/user-attachments/assets/a960c5e6-37bc-49be-9e36-ecc29ed47eb0" /> </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/143077 Approved by: https://github.com/weifengpy	2024-12-12 06:46:16 +00:00
Michael Lazos	357e261b1e	[Dynamo] only import einops if version is lower than 0.7.0 (#142847 ) Fixes internal xref (https://fb.workplace.com/groups/257735836456307/posts/804793021750583/?comment_id=805229281706957&reply_comment_id=805232695039949) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142847 Approved by: https://github.com/zou3519	2024-12-12 06:38:22 +00:00
Michael Lazos	9701c50bdc	[Dynamo] Add missing tensor builtins to allowed functions (#142841 ) Fixes https://github.com/pytorch/pytorch/issues/141232 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142841 Approved by: https://github.com/yanboliang	2024-12-12 06:38:19 +00:00
YeJiaxi	b25f64b613	Add-o pipefail for all bash scripts (#143050 ) Fixes #142380 I have added -o pipefail in all bash scripts in pytorch/.ci/pytorch. Sorry I didn't double-check the submodule in my last PR. Thanks for the correction! Please contact me again if there are any problems with this fix^^. (Actually contributing to the open source community is an assignment for one of my courses and today is the deadline so I rushed to revise it when I saw an email early in the morning. Haha.) @ezyang @malfet @huydhn @zou3519 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143050 Approved by: https://github.com/ezyang, https://github.com/huydhn Co-authored-by: Edward Z. Yang <ezyang@mit.edu>	2024-12-12 06:18:41 +00:00
leslie-fang-intel	79cf8fa751	[Inductor] Use sleef implementation for CPP backend asinh codegen (#142360 ) Summary Fix https://github.com/pytorch/pytorch/issues/142345. Previously, we use `asinh(x) = log(x + sqrt(1 + x2))` to calculate the result of `asinh`, the issue happens when input with `-10000.1`, which makes `x + sqrt(1 + x2)` close to 0 and log(0) is invalid. We use the `sleef` implementation in this PR to fix this issue. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_asinh_with_corner_inputs ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142360 Approved by: https://github.com/jgong5	2024-12-12 05:40:48 +00:00
Jithun Nair	1e2b841675	[ROCm] Prune old gfx archs gfx900/gfx906 from binaries (#142827 ) Remove gfx900 and gfx906 archs as they're long-in-the-tooth. Should help reduce the increasing size of ROCm binaries. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142827 Approved by: https://github.com/jeffdaily	2024-12-12 05:33:40 +00:00
cyy	fda43c98d1	Improve implementation of quantized_batch_norm (#141570 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141570 Approved by: https://github.com/albanD	2024-12-12 04:35:00 +00:00
cyy	20df80a669	Remove unneeded optional dereference (#141578 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141578 Approved by: https://github.com/swolchok	2024-12-12 04:34:43 +00:00
cyy	f7b9533c3f	[4/N] Apply bugprone-unchecked-optional-access (#142832 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/142832 Approved by: https://github.com/albanD	2024-12-12 04:33:32 +00:00
James Wu	fbbafd0320	Turn on AOTAutogradCache by default on open source (#141981 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141981 Approved by: https://github.com/bdhirsh, https://github.com/oulgen	2024-12-12 04:21:11 +00:00
mori360	4d0775462e	E2E composability testing (#141398 ) Add 3D(pp+tp+fsdp) test `test_3d_with_tp_dp_pp` at test_pp_compodability Currently provide @parametrize on "ScheduleClass" for pp in [ScheduleGPipe, Schedule1F1B, ScheduleInterleaved1F1B, ScheduleLoopedBFS, ScheduleInterleavedZeroBubble] "MixedPrecisionParam" for fsdp in [torch.bfloat16, torch.float32] Future work: 1. add fp8 2. add cp(context parallelism) to enable 4D test Pull Request resolved: https://github.com/pytorch/pytorch/pull/141398 Approved by: https://github.com/wconstab, https://github.com/kwen2501	2024-12-12 04:19:29 +00:00
cyy	2903cf0ad8	Re-enable some C++ warnings (#142332 ) It enables some C++ warnings since the code base is fairly clean. Meanwhile, Wextra-semi is disabled on CUDA generated code since there is no way to fix them without the cooperation of CUDA team. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142332 Approved by: https://github.com/albanD, https://github.com/eqy	2024-12-12 04:02:12 +00:00
Carlo Bertolli	f892f9862a	[ROCM] Enable *_load_dwordx4 ISA for BFloat16 and Half. (#141397 ) Remove input_vec_size constexpr and move it to template parameter. This enables generation of vectorized loads in ROCm AMDGPU backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141397 Approved by: https://github.com/jeffdaily Co-authored-by: Jerry Mannil <jerry.mannil@amd.com>	2024-12-12 03:27:49 +00:00
Nikita Shulga	4d8357e912	[CD] Use Anaconda cmake for Mac builds (#143054 ) To find Anaconda-env-installed OpenMP (As OpenMP from PyPI is looking for it in a different places) For posterity: our build script names are very confusing: - [`.ci/wheel/build_wheel.sh`](`6cb6e8d790/.ci/wheel/build_wheel.sh`) is only used for MacOS wheel/libtorch builds - [`.ci/manywheel/build.sh`](`6cb6e8d790/.ci/manywheel/build.sh`) are used for Linux wheel/libtorch builds - [`.ci/pytorch/windows/build_pytorch.bat`](`6cb6e8d790/.ci/pytorch/windows/build_pytorch.bat`) is used for Windows wheel builds Fixes https://github.com/pytorch/pytorch/issues/142873 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143054 Approved by: https://github.com/Jack-Khuu, https://github.com/atalman	2024-12-12 03:05:46 +00:00
Ke Wen	cb354f8b47	[PGNCCL] Move NCCLComm impl to cpp (#142826 ) BE as titled. No behavior change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142826 Approved by: https://github.com/wconstab, https://github.com/c-p-i-o	2024-12-12 02:45:52 +00:00
leslie-fang-intel	06075d3d18	[Inductor][CPP] Fix Mask Dtype mismatch (#142103 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/141559. The `vec_mask` store data type doesn't aligned when doing `bitwise_and`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142103 Approved by: https://github.com/jgong5	2024-12-12 01:21:32 +00:00
Colin L. Rice	d68403df3b	filelock: Make waitcounter variant to use (#139816 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139816 Approved by: https://github.com/ezyang	2024-12-12 01:18:34 +00:00
atalman	6cb6e8d790	Python 3.11, 3.12 Remove tests covered by 3.13 (#143078 ) We do have linux-focal-py3_13-clang10-build and test. Hence removing linux-focal-py3_11-clang10-build/test and linux-focal-py3_12-clang10-build/test Pull Request resolved: https://github.com/pytorch/pytorch/pull/143078 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-12-12 01:12:00 +00:00
atalman	1dd6f21029	Cuda 12.1 - Remove from trunk tests (#143076 ) Remove cuda 12.1 from trunk tests. This is covered by 12.4 tests. Move ``libtorch-linux-focal-cuda12_4-py3_7-gcc9-debug-build`` -> ``libtorch-linux-focal-cuda12_4-py3_10-gcc9-debug-build`` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143076 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-12-12 01:10:09 +00:00
atalman	bd7d81db9e	Use validate-docker-images workflow from test-infra (#143081 ) After PR: https://github.com/pytorch/test-infra/pull/6029 use validate-docker-images.yml from test-infra. Related to: https://github.com/pytorch/builder/issues/2054 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143081 Approved by: https://github.com/huydhn	2024-12-12 00:24:27 +00:00
cyy	db81a3f31c	[TorchGen] remove remove_non_owning_ref_types from valuetype_type (#142449 ) It is not used Pull Request resolved: https://github.com/pytorch/pytorch/pull/142449 Approved by: https://github.com/ezyang	2024-12-12 00:15:44 +00:00
PyTorch MergeBot	1b3f8b7589	Revert "[RELAND] Add UTs for accelerator device-agnostic runtime APIs (#133572 )" This reverts commit 209119424922b135fef39aba1f25da3b67f5879a. Reverted https://github.com/pytorch/pytorch/pull/133572 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the new test is still very flaky on MacOS even when it does not segfault anymore ([comment](https://github.com/pytorch/pytorch/pull/133572#issuecomment-2537256522))	2024-12-11 21:47:18 +00:00
PyTorch MergeBot	dfe5669076	Revert "[RELAND] Add device-agnostic runtime Device/Stream C++ API (#138677 )" This reverts commit 734bb01460d59e661e9114e7aa17e04821e4b57a. Reverted https://github.com/pytorch/pytorch/pull/138677 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the new test is still very flaky on MacOS even when it does not segfault anymore ([comment](https://github.com/pytorch/pytorch/pull/133572#issuecomment-2537256522))	2024-12-11 21:47:17 +00:00
PyTorch MergeBot	cd50bd8477	Revert "[BE][accelerator] formalize API name `{current,set}_device_{idx => index}` (#140542 )" This reverts commit fb02b40d27737213e0547dec0e30977dfc50f2f3. Reverted https://github.com/pytorch/pytorch/pull/140542 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I need to revert this in order to revert https://github.com/pytorch/pytorch/pull/133572#issuecomment-2537204202 due to a conflict ([comment](https://github.com/pytorch/pytorch/pull/140542#issuecomment-2537253665))	2024-12-11 21:44:23 +00:00
Michael Lazos	de313f1155	[foreach_map] Initial foreach map HOP impl for inference (#142098 ) This is the initial foreach map HOP for pointwise ops which will be extended in the future to support grouped GEMMs and other ops. This PR utilizes PrimHOPBase class to represent foreach_map as a HOP with a single subgraph. The way this is implemented is that the user API `foreach_map` provides a single pointwise torch op, and internally this function calls a polyfill which has the same semantics as a foreach op (ie iterates over lists of operands applying the op elementwise). The higher order op is passed through the stack down to inductor where a lowering in essence inlines the subgraph into the main graph. This is done by interpreting it with a pointwise subgraph lowering, grouping the outputs by device, and registering the output buffers as foreach groups as applicable. For testing I was able to reuse the existing foreach tests by creating a wrapper function which matches the foreach op interfaces for those tests and then run all of the existing foreach tests on foreach_map. TODO before landing: * Add tests for general functions * Test warning if unsupported op will block fusion Followups: * I need to add tests for backwards (this will be a followup PR because backwards will require other work as well) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142098 Approved by: https://github.com/eellison	2024-12-11 21:32:11 +00:00
Nikita Shulga	bd199bc754	[EZ] Move slow job from CU12.1 to CU12.4 (#142856 ) I though it was migrated a while back Pull Request resolved: https://github.com/pytorch/pytorch/pull/142856 Approved by: https://github.com/huydhn, https://github.com/atalman, https://github.com/ZainRizvi	2024-12-11 21:12:35 +00:00
Tristan Rice	688f44824b	DistributedDataParallel: add init_sync option to control collectives during initialization (#142824 ) This controls whether or not we run collectives during the DDP init function. This makes it easier to use fault tolerant ProcessGroup implementations that may not be starting at the same time. torchft uses a dummy process group and a comm hook to get around these checks. With this change torchft can use the normal ProcessGroup API via the stock comm hook. https://github.com/pytorch-labs/torchft/blob/main/torchft/ddp.py#L50-L59 Test plan: ``` pytest test/distributed/test_c10d_pypg.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142824 Approved by: https://github.com/wconstab, https://github.com/fegin, https://github.com/H-Huang	2024-12-11 20:28:38 +00:00
Jane Xu	fd65bd755d	[BE] replace incorrect .. note:: invocations (#142868 ) Something I've noticed is that a lot of the distributed sites don't render on our docs at all, but if they ever do, the notes will render properly now 😛 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142868 Approved by: https://github.com/albanD	2024-12-11 19:58:18 +00:00
Edward Z. Yang	0b96413dbf	Upgrade expecttest to 0.3.0 (#142869 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/142869 Approved by: https://github.com/albanD, https://github.com/malfet	2024-12-11 19:04:16 +00:00
cyy	e5f08c0cbf	[TorchGen] Remove cpp_type_registration_declarations (#142452 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/142452 Approved by: https://github.com/ezyang	2024-12-11 19:01:36 +00:00
cyy	e228381846	[TorchGen] Simplify argument_type_str (#142491 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/142491 Approved by: https://github.com/ezyang	2024-12-11 19:01:20 +00:00
Nikita Shulga	42d4eec5f3	Don't install lintrunner on S390 (#142876 ) Not sure if there are many users of this platform, but hopefully this will fix https://github.com/pytorch/pytorch/issues/142872 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142876 Approved by: https://github.com/jeanschmidt	2024-12-11 18:54:12 +00:00
Yukio Siraichi	e647b6d590	Fix undesired specialization on slice after split. (#142372 ) Fix: #141251 This PR adds a few static guard checks when decomposing and lowering the `slice` operation, so that we avoid adding unnecessary guards. Specifically, when clamping the end values. In summary, the changes are: - `slice` dynamo decomposition: checks `end >= sizes[dim]` statically. If we don't know that, the following guard ensures that we (don't) need clamping. - `evaluate_min` inductor `sizevar` function: checks whether we can solve it statically or not, before actually creating a new guard. The latter had to be changed because `evaluate_min` (called by `ir.SliceView` constructor) would always try to create a guard based on the hints operation result. However, if both `left` and `right` hints were true, it would default to `left <= right` guard. By checking the guards statically before, we can avoid that. ```python N = 16 @torch.compile(backend="inductor", dynamic=False, fullgraph=True) def fn(x): splits = torch.ops.aten.split.Tensor(x, N) first = splits[0] return torch.ops.aten.slice.Tensor(first, 0, 0, N) x = torch.arange(N) torch._dynamo.mark_dynamic(x, 0) fn(x) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142372 Approved by: https://github.com/ezyang	2024-12-11 18:52:17 +00:00
titaiwangms	0ddb33ba22	[ONNX] Avoid overwriting overlapped decomposed functions (#142831 ) Fixes #141770 The decomposed function in `torch.export.default_decompositions().items()` is overwritten by `torch._decomp.decomposition_table`. As from `torch.onnx.export()` perspective, we should rather respect the table of decompositions in `torch.export.default_decompositions().items()` and avoid overwriting it with `torch._decomp.decomposition_table. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142831 Approved by: https://github.com/justinchuby	2024-12-11 18:47:40 +00:00
Yidi Wu	c632e29774	[hop][dynamo] support torch.SymInt inputs (#141524 ) Fixes https://github.com/pytorch/pytorch/issues/141305. ```python class M(torch.nn.Module): def forward(self, x, y, z): a = y.shape[0] b = z.shape[0] def true_fn(x): return x + a def false_fn(x): return x + b * z # When exporting with non-strict: a and b are symints, # so torch.compile need to wrap and trace symint inputs. return torch.cond(x.shape[0] > 5, true_fn, false_fn, (x,)) ``` In non-strict export, when inputs are annotated with dynamic shape, the a, and b in above example are torch.SymInt type. true_fn and false_fn will have closure that're of torch.SymInt types. The error is triggered because we didn't handle SymInt inputs in dynamo and ends up using a UserDefinedObjectVariable for it, which doesn't have a proxy. We added support by following how we handle SymBool input previously. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141524 Approved by: https://github.com/zou3519 ghstack dependencies: #142185	2024-12-11 18:46:58 +00:00
Yidi Wu	a8fa98ccef	skip test dynamo for aot_dispatch tests on ci (#142185 ) A lot of tests in test_aotdispatch.py is not meaningful (from user's perspective) when we run with dynamo. So we skip them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142185 Approved by: https://github.com/zou3519	2024-12-11 18:46:58 +00:00
cyy	24a5a2ef25	[14/N] Fix extra warnings brought by clang-tidy-17 (#141644 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141644 Approved by: https://github.com/ezyang	2024-12-11 18:40:42 +00:00
Jane Xu	be27dbf2b8	Enable CPP/CUDAExtension with py_limited_api for python agnosticism (#138088 ) Getting tested with ao, but now there is a real test i added. ## What does this PR do? We want to allow custom PyTorch extensions to be able to build one wheel for multiple Python versions, in other words, achieve python agnosticism. It turns out that there is such a way that setuptools/Python provides already! Namely, if the user promises to use only the Python limited API in their extension, they can pass in `py_limited_api` to their Extension class and to the bdist_wheel command (with a min python version) in order to build 1 wheel that will suffice across multiple Python versions. Sounds lovely! Why don't people do that already with PyTorch? Well 2 things. This workflow is hardly documented (even searching for python agnostic specifically does not reveal many answers) so I'd expect that people simply don't know about it. But even if they did, _PyTorch_ custom Extensions would still not work because we always link torch_python, which does not abide by py_limited_api rules. So this is where this PR comes in! We respect when the user specifies py_limited_api and skip linking torch_python under that condition, allowing users to enroll in the provided functionality I just described. ## How do I know this PR works? I manually tested my silly little ultra_norm locally (with `import python_agnostic`) and wrote a test case for the extension showing that - torch_python doesn't show up in the ldd tree - no Py- symbols show up It may be a little confusing that our test case is actually python-free (more clean than python-agnostic) but it is sufficient (and not necessary) towards showing that this change works. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138088 Approved by: https://github.com/ezyang, https://github.com/albanD	2024-12-11 18:22:55 +00:00
Xuehai Pan	fb02b40d27	[BE][accelerator] formalize API name `{current,set}_device_{idx => index}` (#140542 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140542 Approved by: https://github.com/guangyey, https://github.com/albanD	2024-12-11 17:57:56 +00:00
cyy	82aaf64422	[3/N] Apply py39 ruff fixes (#142115 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/142115 Approved by: https://github.com/ezyang	2024-12-11 17:50:10 +00:00

3805 changed files with 149722 additions and 68219 deletions

									
										19

.ci/aarch64_linux/aarch64_ci_build.sh
									
												View File
												
				@ -3,22 +3,15 @@ set -eux -o pipefail

				GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}

				if [[ "$GPU_ARCH_VERSION" == *"12.6"* ]]; then

				    export TORCH_CUDA_ARCH_LIST="9.0"

				elif [[ "$GPU_ARCH_VERSION" == *"12.8"* ]]; then

				    export TORCH_CUDA_ARCH_LIST="9.0;10.0;12.0"

				fi

				SCRIPTPATH="$( cd -- "$(dirname "$0")" >/dev/null 2>&1 ; pwd -P )"

				source $SCRIPTPATH/aarch64_ci_setup.sh

				tagged_version() {

				  GIT_DESCRIBE="git --git-dir /pytorch/.git describe --tags --match v[0-9]*.[0-9]*.[0-9]*"

				  if ${GIT_DESCRIBE} --exact >/dev/null; then

				    ${GIT_DESCRIBE}

				  else

				    return 1

				  fi

				}

				if tagged_version >/dev/null; then

				  export OVERRIDE_PACKAGE_VERSION="$(tagged_version | sed -e 's/^v//' -e 's/-.*$//')"

				fi

				###############################################################################

				# Run aarch64 builder python

				###############################################################################

									
										6

.ci/aarch64_linux/aarch64_ci_setup.sh
									
												View File
												
				@ -5,16 +5,14 @@ set -eux -o pipefail

				# By creating symlinks from desired /opt/python to /usr/local/bin/

				NUMPY_VERSION=2.0.2

				PYGIT2_VERSION=1.15.1

				if [[ "$DESIRED_PYTHON"  == "3.13" ]]; then

				if [[ "$DESIRED_PYTHON"  == "3.13" || "$DESIRED_PYTHON" == "3.13t" ]]; then

				    NUMPY_VERSION=2.1.2

				    PYGIT2_VERSION=1.16.0

				fi

				SCRIPTPATH="$( cd "$(dirname "$0")" ; pwd -P )"

				source $SCRIPTPATH/../manywheel/set_desired_python.sh

				pip install -q numpy==${NUMPY_VERSION} pyyaml==6.0.2 scons==4.7.0 ninja==1.11.1 patchelf==0.17.2 pygit2==${PYGIT2_VERSION}

				pip install -q numpy==${NUMPY_VERSION} pyyaml==6.0.2 scons==4.7.0 ninja==1.11.1 patchelf==0.17.2

				for tool in python python3 pip pip3 ninja scons patchelf; do

				    ln -sf ${DESIRED_PYTHON_BIN_DIR}/${tool} /usr/local/bin;

									
										29

.ci/aarch64_linux/aarch64_wheel_ci_build.py
									
												View File
												
				@ -4,12 +4,9 @@

				import os

				import shutil

				from subprocess import check_call, check_output

				from typing import List

				from pygit2 import Repository

				def list_dir(path: str) -> List[str]:

				def list_dir(path: str) -> list[str]:

				    """'

				    Helper for getting paths for Python

				    """

				@ -58,7 +55,7 @@ def build_ArmComputeLibrary() -> None:

				        shutil.copytree(f"{acl_checkout_dir}/{d}", f"{acl_install_dir}/{d}")

				def update_wheel(wheel_path) -> None:

				def update_wheel(wheel_path, desired_cuda) -> None:

				    """

				    Update the cuda wheel libraries

				    """

				@ -80,7 +77,6 @@ def update_wheel(wheel_path) -> None:

				        "/usr/local/cuda/lib64/libnvToolsExt.so.1",

				        "/usr/local/cuda/lib64/libnvJitLink.so.12",

				        "/usr/local/cuda/lib64/libnvrtc.so.12",

				        "/usr/local/cuda/lib64/libnvrtc-builtins.so.12.6",

				        "/usr/local/cuda/lib64/libcudnn_adv.so.9",

				        "/usr/local/cuda/lib64/libcudnn_cnn.so.9",

				        "/usr/local/cuda/lib64/libcudnn_graph.so.9",

				@ -100,6 +96,14 @@ def update_wheel(wheel_path) -> None:

				            "/usr/local/lib/libnvpl_lapack_core.so.0",

				            "/usr/local/lib/libnvpl_blas_core.so.0",

				        ]

				        if "126" in desired_cuda:

				            libs_to_copy += [

				                "/usr/local/cuda/lib64/libnvrtc-builtins.so.12.6",

				            ]

				        elif "128" in desired_cuda:

				            libs_to_copy += [

				                "/usr/local/cuda/lib64/libnvrtc-builtins.so.12.8",

				            ]

				    else:

				        libs_to_copy += [

				            "/opt/OpenBLAS/lib/libopenblas.so.0",

				@ -171,22 +175,22 @@ if __name__ == "__main__":

				    args = parse_arguments()

				    enable_mkldnn = args.enable_mkldnn

				    enable_cuda = args.enable_cuda

				    repo = Repository("/pytorch")

				    branch = repo.head.name

				    if branch == "HEAD":

				        branch = "master"

				    branch = check_output(

				        ["git", "rev-parse", "--abbrev-ref", "HEAD"], cwd="/pytorch"

				    ).decode()

				    print("Building PyTorch wheel")

				    build_vars = "MAX_JOBS=5 CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000 "

				    os.system("cd /pytorch; python setup.py clean")

				    override_package_version = os.getenv("OVERRIDE_PACKAGE_VERSION")

				    desired_cuda = os.getenv("DESIRED_CUDA")

				    if override_package_version is not None:

				        version = override_package_version

				        build_vars += (

				            f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={version} PYTORCH_BUILD_NUMBER=1 "

				        )

				    elif branch in ["nightly", "master"]:

				    elif branch in ["nightly", "main"]:

				        build_date = (

				            check_output(["git", "log", "--pretty=format:%cs", "-1"], cwd="/pytorch")

				            .decode()

				@ -196,7 +200,6 @@ if __name__ == "__main__":

				            check_output(["cat", "version.txt"], cwd="/pytorch").decode().strip()[:-2]

				        )

				        if enable_cuda:

				            desired_cuda = os.getenv("DESIRED_CUDA")

				            build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={version}.dev{build_date}+{desired_cuda} PYTORCH_BUILD_NUMBER=1 "

				        else:

				            build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={version}.dev{build_date} PYTORCH_BUILD_NUMBER=1 "

				@ -225,6 +228,6 @@ if __name__ == "__main__":

				        print("Updating Cuda Dependency")

				        filename = os.listdir("/pytorch/dist/")

				        wheel_path = f"/pytorch/dist/{filename[0]}"

				        update_wheel(wheel_path)

				        update_wheel(wheel_path, desired_cuda)

				    pytorch_wheel_name = complete_wheel("/pytorch/")

				    print(f"Build Complete. Created {pytorch_wheel_name}..")

									
										48

.ci/aarch64_linux/build_aarch64_wheel.py
									
												View File
												
				@ -12,7 +12,7 @@ import os

				import subprocess

				import sys

				import time

				from typing import Dict, List, Optional, Tuple, Union

				from typing import Optional, Union

				import boto3

				@ -24,10 +24,12 @@ os_amis = {

				    "ubuntu22_04": "ami-0c6c29c5125214c77",  # login_name: ubuntu

				    "redhat8": "ami-0698b90665a2ddcf1",  # login_name: ec2-user

				}

				ubuntu18_04_ami = os_amis["ubuntu18_04"]

				ubuntu20_04_ami = os_amis["ubuntu20_04"]

				def compute_keyfile_path(key_name: Optional[str] = None) -> Tuple[str, str]:

				def compute_keyfile_path(key_name: Optional[str] = None) -> tuple[str, str]:

				    if key_name is None:

				        key_name = os.getenv("AWS_KEY_NAME")

				        if key_name is None:

				@ -57,7 +59,7 @@ def ec2_instances_by_id(instance_id):

				def start_instance(

				    key_name, ami=ubuntu18_04_ami, instance_type="t4g.2xlarge", ebs_size: int = 50

				    key_name, ami=ubuntu20_04_ami, instance_type="t4g.2xlarge", ebs_size: int = 50

				):

				    inst = ec2.create_instances(

				        ImageId=ami,

				@ -96,7 +98,7 @@ class RemoteHost:

				        self.keyfile_path = keyfile_path

				        self.login_name = login_name

				    def _gen_ssh_prefix(self) -> List[str]:

				    def _gen_ssh_prefix(self) -> list[str]:

				        return [

				            "ssh",

				            "-o",

				@ -108,13 +110,13 @@ class RemoteHost:

				        ]

				    @staticmethod

				    def _split_cmd(args: Union[str, List[str]]) -> List[str]:

				    def _split_cmd(args: Union[str, list[str]]) -> list[str]:

				        return args.split() if isinstance(args, str) else args

				    def run_ssh_cmd(self, args: Union[str, List[str]]) -> None:

				    def run_ssh_cmd(self, args: Union[str, list[str]]) -> None:

				        subprocess.check_call(self._gen_ssh_prefix() + self._split_cmd(args))

				    def check_ssh_output(self, args: Union[str, List[str]]) -> str:

				    def check_ssh_output(self, args: Union[str, list[str]]) -> str:

				        return subprocess.check_output(

				            self._gen_ssh_prefix() + self._split_cmd(args)

				        ).decode("utf-8")

				@ -157,7 +159,7 @@ class RemoteHost:

				    def using_docker(self) -> bool:

				        return self.container_id is not None

				    def run_cmd(self, args: Union[str, List[str]]) -> None:

				    def run_cmd(self, args: Union[str, list[str]]) -> None:

				        if not self.using_docker():

				            return self.run_ssh_cmd(args)

				        assert self.container_id is not None

				@ -178,7 +180,7 @@ class RemoteHost:

				        if rc != 0:

				            raise subprocess.CalledProcessError(rc, docker_cmd)

				    def check_output(self, args: Union[str, List[str]]) -> str:

				    def check_output(self, args: Union[str, list[str]]) -> str:

				        if not self.using_docker():

				            return self.check_ssh_output(args)

				        assert self.container_id is not None

				@ -230,7 +232,7 @@ class RemoteHost:

				            )

				        self.download_file(remote_file, local_file)

				    def list_dir(self, path: str) -> List[str]:

				    def list_dir(self, path: str) -> list[str]:

				        return self.check_output(["ls", "-1", path]).split("\n")

				@ -358,7 +360,7 @@ def checkout_repo(

				    branch: str = "main",

				    url: str,

				    git_clone_flags: str,

				    mapping: Dict[str, Tuple[str, str]],

				    mapping: dict[str, tuple[str, str]],

				) -> Optional[str]:

				    for prefix in mapping:

				        if not branch.startswith(prefix):

				@ -619,9 +621,11 @@ def build_torchaudio(

				    if host.using_docker():

				        build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"

				    host.run_cmd(f"cd audio && export FFMPEG_ROOT=$(pwd)/third_party/ffmpeg && export USE_FFMPEG=1 \

				    host.run_cmd(

				        f"cd audio && export FFMPEG_ROOT=$(pwd)/third_party/ffmpeg && export USE_FFMPEG=1 \

				        && ./packaging/ffmpeg/build.sh \

				        && {build_vars} python3 setup.py bdist_wheel")

				        && {build_vars} python3 setup.py bdist_wheel"

				    )

				    wheel_name = host.list_dir("audio/dist")[0]

				    embed_libgomp(host, use_conda, os.path.join("audio", "dist", wheel_name))

				@ -679,7 +683,7 @@ def build_domains(

				    branch: str = "main",

				    use_conda: bool = True,

				    git_clone_flags: str = "",

				) -> Tuple[str, str, str, str]:

				) -> tuple[str, str, str, str]:

				    vision_wheel_name = build_torchvision(

				        host, branch=branch, use_conda=use_conda, git_clone_flags=git_clone_flags

				    )

				@ -706,7 +710,7 @@ def start_build(

				    pytorch_build_number: Optional[str] = None,

				    shallow_clone: bool = True,

				    enable_mkldnn: bool = False,

				) -> Tuple[str, str, str, str, str]:

				) -> tuple[str, str, str, str, str]:

				    git_clone_flags = " --depth 1 --shallow-submodules" if shallow_clone else ""

				    if host.using_docker() and not use_conda:

				        print("Auto-selecting conda option for docker images")

				@ -930,9 +934,9 @@ def parse_arguments():

				    parser.add_argument("--debug", action="store_true")

				    parser.add_argument("--build-only", action="store_true")

				    parser.add_argument("--test-only", type=str)

				    parser.add_argument(

				        "--os", type=str, choices=list(os_amis.keys()), default="ubuntu20_04"

				    )

				    group = parser.add_mutually_exclusive_group()

				    group.add_argument("--os", type=str, choices=list(os_amis.keys()))

				    group.add_argument("--ami", type=str)

				    parser.add_argument(

				        "--python-version",

				        type=str,

				@ -962,7 +966,13 @@ def parse_arguments():

				if __name__ == "__main__":

				    args = parse_arguments()

				    ami = os_amis[args.os]

				    ami = (

				        args.ami

				        if args.ami is not None

				        else os_amis[args.os]

				        if args.os is not None

				        else ubuntu20_04_ami

				    )

				    keyfile_path, key_name = compute_keyfile_path(args.key_name)

				    if args.list_instances:

5

.ci/docker/aotriton_version.txt

View File

 @ -1,5 +0,0 @@
 .8b
 manylinux_2_28
 rocm6.2
 f8cbcac8a92775291bb1ba8f514d4beb350baf4
 e938def5d32869fe2e00aec0300f354c9f157867bebdf2e104d732b94cb238d8

									
										124

.ci/docker/build.sh
									
												View File
												
				@ -86,6 +86,10 @@ CMAKE_VERSION=3.18.5

				_UCX_COMMIT=7bb2722ff2187a0cad557ae4a6afa090569f83fb

				_UCC_COMMIT=20eae37090a4ce1b32bcce6144ccad0b49943e0b

				if [[ "$image" == *rocm* ]]; then

				  _UCX_COMMIT=cc312eaa4655c0cc5c2bcd796db938f90563bcf6

				  _UCC_COMMIT=0c0fc21559835044ab107199e334f7157d6a0d3d

				fi

				# It's annoying to rename jobs every time you want to rewrite a

				# configuration, so we hardcode everything here rather than do it

				@ -105,20 +109,6 @@ case "$image" in

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.1.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.4.1

				    CUDNN_VERSION=9

				@ -134,36 +124,6 @@ case "$image" in

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.1.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-focal-cuda12.1-cudnn9-py3.12-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.1.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-focal-cuda12.4-cudnn9-py3.12-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.4.1

				    CUDNN_VERSION=9

				@ -208,48 +168,6 @@ case "$image" in

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.4.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.1.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.4.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-py3-clang10-onnx)

				    ANACONDA_PYTHON_VERSION=3.9

				    CLANG_VERSION=10

				@ -292,18 +210,7 @@ case "$image" in

				    ;;

				  pytorch-linux-focal-rocm-n-1-py3)

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    ROCM_VERSION=6.1

				    NINJA_VERSION=1.9.0

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-rocm-n-py3)

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    GCC_VERSION=11

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				@ -311,6 +218,25 @@ case "$image" in

				    NINJA_VERSION=1.9.0

				    CONDA_CMAKE=yes

				    TRITON=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-focal-rocm-n-py3)

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=11

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    ROCM_VERSION=6.3

				    NINJA_VERSION=1.9.0

				    CONDA_CMAKE=yes

				    TRITON=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-jammy-xpu-2024.0-py3)

				    ANACONDA_PYTHON_VERSION=3.9

				@ -525,7 +451,7 @@ docker build \

				       --build-arg "NINJA_VERSION=${NINJA_VERSION:-}" \

				       --build-arg "KATEX=${KATEX:-}" \

				       --build-arg "ROCM_VERSION=${ROCM_VERSION:-}" \

				       --build-arg "PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH:-gfx90a}" \

				       --build-arg "PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH:-gfx90a;gfx942}" \

				       --build-arg "IMAGE_NAME=${IMAGE_NAME}" \

				       --build-arg "UCX_COMMIT=${UCX_COMMIT}" \

				       --build-arg "UCC_COMMIT=${UCC_COMMIT}" \

									
										7

.ci/docker/centos-rocm/Dockerfile
									
												View File
												
				@ -113,13 +113,6 @@ COPY triton_version.txt triton_version.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton.txt triton_version.txt

				# Install AOTriton (Early fail)

				COPY ./aotriton_version.txt aotriton_version.txt

				COPY ./common/common_utils.sh common_utils.sh

				COPY ./common/install_aotriton.sh install_aotriton.sh

				RUN ["/bin/bash", "-c", "./install_aotriton.sh /opt/rocm && rm -rf install_aotriton.sh aotriton_version.txt common_utils.sh"]

				ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton

				# Install ccache/sccache (do this last, so we get priority in PATH)

				COPY ./common/install_cache.sh install_cache.sh

				ENV PATH /opt/cache/bin:$PATH

2

.ci/docker/ci_commit_pins/executorch.txt

View File

 @ -1 +1 @@
 f638937d64e3396793956d75ee3e14802022745
 e4d6b6380d575e48e37e9d987fded4ec588e7bc

1

.ci/docker/ci_commit_pins/nccl-cu11.txt Normal file

View File

				`@ -0,0 +1 @@`
				`v2.21.5-1`

1

.ci/docker/ci_commit_pins/nccl-cu12.txt Normal file

View File

				`@ -0,0 +1 @@`
				`v2.25.1-1`

2

.ci/docker/ci_commit_pins/timm.txt

View File

 @ -1 +1 @@
 ac3470188b914c5d7a5058a7e28b9eb685a62427
 d535d7a2d4b435b1b5c1177fd8f04a12b942b9a

2

.ci/docker/ci_commit_pins/triton.txt

View File

 @ -1 +1 @@
 c6c7c6284582b3f41c71c150e11b517acf074a
 b3bb1f8da0ded6ccd572dd1358ef45af5a1befe

									
										2

.ci/docker/common/install_acl.sh
									
												View File
												
				@ -1,7 +1,7 @@

				set -euo pipefail

				readonly version=v24.04

				readonly src_host=https://review.mlplatform.org/ml

				readonly src_host=https://github.com/ARM-software

				readonly src_repo=ComputeLibrary

				# Clone ACL

									
										23

.ci/docker/common/install_aotriton.sh
									
												View File
											
				@ -1,23 +0,0 @@

				#!/bin/bash

				set -ex

				source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"

				TARBALL='aotriton.tar.gz'

				# This read command alwasy returns with exit code 1

				read -d "\n" VER MANYLINUX ROCMBASE PINNED_COMMIT SHA256 < aotriton_version.txt || true

				ARCH=$(uname -m)

				AOTRITON_INSTALL_PREFIX="$1"

				AOTRITON_URL="https://github.com/ROCm/aotriton/releases/download/${VER}/aotriton-${VER}-${MANYLINUX}_${ARCH}-${ROCMBASE}-shared.tar.gz"

				cd "${AOTRITON_INSTALL_PREFIX}"

				# Must use -L to follow redirects

				curl -L --retry 3 -o "${TARBALL}" "${AOTRITON_URL}"

				ACTUAL_SHA256=$(sha256sum "${TARBALL}" | cut -d " " -f 1)

				if [ "${SHA256}" != "${ACTUAL_SHA256}" ]; then

				  echo -n "Error: The SHA256 of downloaded tarball is ${ACTUAL_SHA256},"

				  echo " which does not match the expected value ${SHA256}."

				  exit

				fi

				tar xf "${TARBALL}" && rm -rf "${TARBALL}"

									
										4

.ci/docker/common/install_base.sh
									
												View File
												
				@ -32,8 +32,12 @@ install_ubuntu() {

				  # HACK: UCC testing relies on libnccl library from NVIDIA repo, and version 2.16 crashes

				  # See https://github.com/pytorch/pytorch/pull/105260#issuecomment-1673399729

				  # TODO: Eliminate this hack, we should not relay on apt-get installation

				  # See https://github.com/pytorch/pytorch/issues/144768

				  if [[ "$UBUNTU_VERSION" == "20.04"* && "$CUDA_VERSION" == "11.8"* ]]; then

				    maybe_libnccl_dev="libnccl2=2.15.5-1+cuda11.8 libnccl-dev=2.15.5-1+cuda11.8 --allow-downgrades --allow-change-held-packages"

				  elif [[ "$UBUNTU_VERSION" == "20.04"* && "$CUDA_VERSION" == "12.4"* ]]; then

				    maybe_libnccl_dev="libnccl2=2.25.1-1+cuda12.4 libnccl-dev=2.25.1-1+cuda12.4 --allow-downgrades --allow-change-held-packages"

				  else

				    maybe_libnccl_dev=""

				  fi

									
										8

.ci/docker/common/install_cache.sh
									
												View File
												
				@ -9,7 +9,7 @@ install_ubuntu() {

				  # Instead use lib and headers from OpenSSL1.1 installed in `install_openssl.sh``

				  apt-get install -y cargo

				  echo "Checking out sccache repo"

				  git clone https://github.com/mozilla/sccache -b v0.8.2

				  git clone https://github.com/mozilla/sccache -b v0.9.1

				  cd sccache

				  echo "Building sccache"

				  cargo build --release

				@ -36,11 +36,7 @@ sed -e 's|PATH="\(.*\)"|PATH="/opt/cache/bin:\1"|g' -i /etc/environment

				export PATH="/opt/cache/bin:$PATH"

				# Setup compiler cache

				if [ -n "$ROCM_VERSION" ]; then

				  curl --retry 3 http://repo.radeon.com/misc/.sccache_amd/sccache -o /opt/cache/bin/sccache

				else

				  install_ubuntu

				fi

				install_ubuntu

				chmod a+x /opt/cache/bin/sccache

				function write_sccache_stub() {

									
										2

.ci/docker/common/install_cpython.sh
									
												View File
												
				@ -70,7 +70,7 @@ function do_cpython_build {

				    # install setuptools since python 3.12 is required to use distutils

				    ${prefix}/bin/pip install wheel==0.34.2 setuptools==68.2.2

				    local abi_tag=$(${prefix}/bin/python -c "from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag; print('{0}{1}-{2}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag()))")

				    ln -s ${prefix} /opt/python/${abi_tag}

				    ln -sf ${prefix} /opt/python/${abi_tag}

				}

				function build_cpython {

									
										116

.ci/docker/common/install_cuda.sh
									
												View File
												
				@ -2,7 +2,7 @@

				set -ex

				NCCL_VERSION=v2.21.5-1

				NCCL_VERSION=v2.25.1-1

				CUDNN_VERSION=9.5.1.17

				function install_cusparselt_040 {

				@ -16,17 +16,6 @@ function install_cusparselt_040 {

				    rm -rf tmp_cusparselt

				}

				function install_cusparselt_052 {

				    # cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				    mkdir tmp_cusparselt && pushd tmp_cusparselt

				    wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.5.2.1-archive.tar.xz

				    tar xf libcusparse_lt-linux-x86_64-0.5.2.1-archive.tar.xz

				    cp -a libcusparse_lt-linux-x86_64-0.5.2.1-archive/include/* /usr/local/cuda/include/

				    cp -a libcusparse_lt-linux-x86_64-0.5.2.1-archive/lib/* /usr/local/cuda/lib64/

				    popd

				    rm -rf tmp_cusparselt

				}

				function install_cusparselt_062 {

				    # cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				    mkdir tmp_cusparselt && pushd tmp_cusparselt

				@ -51,6 +40,7 @@ function install_cusparselt_063 {

				function install_118 {

				    CUDNN_VERSION=9.1.0.70

				    NCCL_VERSION=v2.21.5-1

				    echo "Installing CUDA 11.8 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.4.0"

				    rm -rf /usr/local/cuda-11.8 /usr/local/cuda

				    # install CUDA 11.8.0 in the same container

				@ -83,39 +73,6 @@ function install_118 {

				    ldconfig

				}

				function install_121 {

				    echo "Installing CUDA 12.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"

				    rm -rf /usr/local/cuda-12.1 /usr/local/cuda

				    # install CUDA 12.1.0 in the same container

				    wget -q https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda_12.1.1_530.30.02_linux.run

				    chmod +x cuda_12.1.1_530.30.02_linux.run

				    ./cuda_12.1.1_530.30.02_linux.run --toolkit --silent

				    rm -f cuda_12.1.1_530.30.02_linux.run

				    rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.1 /usr/local/cuda

				    # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				    mkdir tmp_cudnn && cd tmp_cudnn

				    wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz

				    tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz

				    cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/

				    cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/

				    cd ..

				    rm -rf tmp_cudnn

				    # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				    # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				    git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git

				    cd nccl && make -j src.build

				    cp -a build/include/* /usr/local/cuda/include/

				    cp -a build/lib/* /usr/local/cuda/lib64/

				    cd ..

				    rm -rf nccl

				    install_cusparselt_052

				    ldconfig

				}

				function install_124 {

				  CUDNN_VERSION=9.1.0.70

				  echo "Installing CUDA 12.4.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.2"

				@ -214,37 +171,6 @@ function prune_118 {

				    rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2022.3.0 $CUDA_BASE/nsight-systems-2022.4.2/

				}

				function prune_121 {

				  echo "Pruning CUDA 12.1"

				  #####################################################################################

				  # CUDA 12.1 prune static libs

				  #####################################################################################

				    export NVPRUNE="/usr/local/cuda-12.1/bin/nvprune"

				    export CUDA_LIB_DIR="/usr/local/cuda-12.1/lib64"

				    export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				    export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				    if [[ -n "$OVERRIDE_GENCODE" ]]; then

				        export GENCODE=$OVERRIDE_GENCODE

				    fi

				    # all CUDA libs except CuDNN and CuBLAS

				    ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis"  \

				      | xargs -I {} bash -c \

				                "echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"

				    # prune CuDNN and CuBLAS

				    $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a

				    $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a

				    #####################################################################################

				    # CUDA 12.1 prune visual tools

				    #####################################################################################

				    export CUDA_BASE="/usr/local/cuda-12.1/"

				    rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2023.1.0 $CUDA_BASE/nsight-systems-2023.1.2/

				}

				function prune_124 {

				  echo "Pruning CUDA 12.4"

				  #####################################################################################

				@ -313,18 +239,52 @@ function prune_126 {

				  rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.3.2 $CUDA_BASE/nsight-systems-2024.5.1/

				}

				function install_128 {

				  CUDNN_VERSION=9.7.1.26

				  echo "Installing CUDA 12.8.0 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.3"

				  rm -rf /usr/local/cuda-12.8 /usr/local/cuda

				  # install CUDA 12.8.0 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda_12.8.0_570.86.10_linux.run

				  chmod +x cuda_12.8.0_570.86.10_linux.run

				  ./cuda_12.8.0_570.86.10_linux.run --toolkit --silent

				  rm -f cuda_12.8.0_570.86.10_linux.run

				  rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.8 /usr/local/cuda

				  # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				  mkdir tmp_cudnn && cd tmp_cudnn

				  wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/

				  cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf tmp_cudnn

				  # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				  # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				  git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git

				  cd nccl && make -j src.build

				  cp -a build/include/* /usr/local/cuda/include/

				  cp -a build/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf nccl

				  install_cusparselt_063

				  ldconfig

				}

				# idiomatic parameter and option handling in sh

				while test $# -gt 0

				do

				    case "$1" in

				    11.8) install_118; prune_118

				        ;;

				    12.1) install_121; prune_121

				        ;;

				    12.4) install_124; prune_124

				        ;;

				    12.6) install_126; prune_126

				        ;;

				    12.8) install_128;

				        ;;

				    *) echo "bad argument $1"; exit 1

				        ;;

				    esac

									
										38

.ci/docker/common/install_cuda_aarch64.sh
									
												View File
												
				@ -57,7 +57,7 @@ function install_124 {

				  cd ..

				  rm -rf nccl

				  install_cusparselt_062

				  install_cusparselt_063

				  ldconfig

				}

				@ -160,6 +160,40 @@ function prune_126 {

				  rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.3.2 $CUDA_BASE/nsight-systems-2024.5.1/

				}

				function install_128 {

				  CUDNN_VERSION=9.7.1.26

				  echo "Installing CUDA 12.8.0 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.3"

				  rm -rf /usr/local/cuda-12.8 /usr/local/cuda

				  # install CUDA 12.8.0 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda_12.8.0_570.86.10_linux_sbsa.run

				  chmod +x cuda_12.8.0_570.86.10_linux_sbsa.run

				  ./cuda_12.8.0_570.86.10_linux_sbsa.run --toolkit --silent

				  rm -f cuda_12.8.0_570.86.10_linux_sbsa.run

				  rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.8 /usr/local/cuda

				  # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				  mkdir tmp_cudnn && cd tmp_cudnn

				  wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-sbsa/cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  tar xf cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  cp -a cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/

				  cp -a cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf tmp_cudnn

				  # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				  # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				  git clone -b ${NCCL_VERSION} --depth 1 https://github.com/NVIDIA/nccl.git

				  cd nccl && make -j src.build

				  cp -a build/include/* /usr/local/cuda/include/

				  cp -a build/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf nccl

				  install_cusparselt_063

				  ldconfig

				}

				# idiomatic parameter and option handling in sh

				while test $# -gt 0

				do

				@ -168,6 +202,8 @@ do

				        ;;

				    12.6) install_126; prune_126

				        ;;

				    12.8) install_128;

				        ;;

				    *) echo "bad argument $1"; exit 1

				        ;;

				    esac

									
										4

.ci/docker/common/install_cudnn.sh
									
												View File
												
				@ -4,7 +4,9 @@ if [[ -n "${CUDNN_VERSION}" ]]; then

				    # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				    mkdir tmp_cudnn

				    pushd tmp_cudnn

				    if [[ ${CUDA_VERSION:0:4} == "12.6" ]]; then

				    if [[ ${CUDA_VERSION:0:4} == "12.8" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-9.7.1.26_cuda12-archive"

				    elif [[ ${CUDA_VERSION:0:4} == "12.6" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-9.5.1.17_cuda12-archive"

				    elif [[ ${CUDA_VERSION:0:2} == "12" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-9.1.0.70_cuda12-archive"

									
										20

.ci/docker/common/install_cusparselt.sh
									
												View File
												
				@ -5,7 +5,15 @@ set -ex

				# cuSPARSELt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				mkdir tmp_cusparselt && cd tmp_cusparselt

				if [[ ${CUDA_VERSION:0:4} =~ ^12\.[2-6]$ ]]; then

				if [[ ${CUDA_VERSION:0:4} =~ ^12\.[5-8]$ ]]; then

				    arch_path='sbsa'

				    export TARGETARCH=${TARGETARCH:-$(uname -m)}

				    if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then

				        arch_path='x86_64'

				    fi

				    CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.6.3.2-archive"

				    curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz

				elif [[ ${CUDA_VERSION:0:4} == "12.4" ]]; then

				    arch_path='sbsa'

				    export TARGETARCH=${TARGETARCH:-$(uname -m)}

				    if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then

				@ -13,17 +21,11 @@ if [[ ${CUDA_VERSION:0:4} =~ ^12\.[2-6]$ ]]; then

				    fi

				    CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.6.2.3-archive"

				    curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz

				elif [[ ${CUDA_VERSION:0:4} == "12.1" ]]; then

				    arch_path='sbsa'

				    export TARGETARCH=${TARGETARCH:-$(uname -m)}

				    if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then

				        arch_path='x86_64'

				    fi

				    CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.5.2.1-archive"

				    curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz

				elif [[ ${CUDA_VERSION:0:4} == "11.8" ]]; then

				    CUSPARSELT_NAME="libcusparse_lt-linux-x86_64-0.4.0.7-archive"

				    curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/${CUSPARSELT_NAME}.tar.xz

				else

				    echo "Not sure which libcusparselt version to install for this ${CUDA_VERSION}"

				fi

				tar xf ${CUSPARSELT_NAME}.tar.xz

									
										7

.ci/docker/common/install_executorch.sh
									
												View File
												
				@ -37,7 +37,12 @@ install_conda_dependencies() {

				install_pip_dependencies() {

				  pushd executorch

				  as_jenkins bash install_requirements.sh --pybind xnnpack

				  as_jenkins bash install_executorch.sh

				  # A workaround, ExecuTorch has moved to numpy 2.0 which is not compatible with the current

				  # numba and scipy version used in PyTorch CI

				  conda_run pip uninstall -y numba scipy

				  popd

				}

									
										6

.ci/docker/common/install_onnx.sh
									
												View File
												
				@ -31,15 +31,15 @@ pip_install \

				pip_install coloredlogs packaging

				pip_install onnxruntime==1.18.1

				pip_install onnx==1.16.2

				pip_install onnxscript==0.1.0.dev20241124 --no-deps

				pip_install onnx==1.17.0

				pip_install onnxscript==0.1.0 --no-deps

				# required by onnxscript

				pip_install ml_dtypes

				# Cache the transformers model to be used later by ONNX tests. We need to run the transformers

				# package to download the model. By default, the model is cached at ~/.cache/huggingface/hub/

				IMPORT_SCRIPT_FILENAME="/tmp/onnx_import_script.py"

				as_jenkins echo 'import transformers; transformers.AutoModel.from_pretrained("sshleifer/tiny-gpt2"); transformers.AutoTokenizer.from_pretrained("sshleifer/tiny-gpt2"); transformers.AutoModelForSpeechSeq2Seq.from_pretrained("openai/whisper-large-v3");' > "${IMPORT_SCRIPT_FILENAME}"

				as_jenkins echo 'import transformers; transformers.GPTJForCausalLM.from_pretrained("hf-internal-testing/tiny-random-gptj");' > "${IMPORT_SCRIPT_FILENAME}"

				# Need a PyTorch version for transformers to work

				pip_install --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu

									
										16

.ci/docker/common/install_rocm.sh
									
												View File
												
				@ -62,6 +62,22 @@ install_ubuntu() {

				        sqlite3 $kdb "PRAGMA journal_mode=off; PRAGMA VACUUM;"

				    done

				    # ROCm 6.3 had a regression where initializing static code objects had significant overhead

				    if [[ $(ver $ROCM_VERSION) -eq $(ver 6.3) ]]; then

				        # clr build needs CppHeaderParser but can only find it using conda's python

				        /opt/conda/bin/python -m pip install CppHeaderParser

				        git clone https://github.com/ROCm/HIP -b rocm-6.3.x

				        HIP_COMMON_DIR=$(readlink -f HIP)

				        git clone https://github.com/jeffdaily/clr -b release/rocm-rel-6.3-statco-hotfix

				        mkdir -p clr/build

				        pushd clr/build

				        cmake .. -DCLR_BUILD_HIP=ON -DHIP_COMMON_DIR=$HIP_COMMON_DIR

				        make -j

				        cp hipamd/lib/libamdhip64.so.6.3.* /opt/rocm/lib/libamdhip64.so.6.3.*

				        popd

				        rm -rf HIP clr

				    fi

				    # Cleanup

				    apt-get autoclean && apt-get clean

				    rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

									
										26

.ci/docker/common/install_ucc.sh
									
												View File
												
				@ -8,6 +8,12 @@ else

				  with_cuda=no

				fi

				if [[ -d "/opt/rocm" ]]; then

				  with_rocm=/opt/rocm

				else

				  with_rocm=no

				fi

				function install_ucx() {

				  set -ex

				  git clone --recursive https://github.com/openucx/ucx.git

				@ -19,6 +25,7 @@ function install_ucx() {

				  ./configure --prefix=$UCX_HOME      \

				      --enable-mt                     \

				      --with-cuda=$with_cuda          \

				      --with-rocm=$with_rocm          \

				      --enable-profiling              \

				      --enable-stats

				  time make -j

				@ -36,12 +43,29 @@ function install_ucc() {

				  git submodule update --init --recursive

				  ./autogen.sh

				  # We only run distributed tests on Tesla M60 and A10G

				  NVCC_GENCODE="-gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_86,code=compute_86"

				  if [[ -n "$ROCM_VERSION" ]]; then

				    if [[ -n "$PYTORCH_ROCM_ARCH" ]]; then

				      amdgpu_targets=`echo $PYTORCH_ROCM_ARCH | sed 's/;/ /g'`

				    else

				      amdgpu_targets=`rocm_agent_enumerator | grep -v gfx000 | sort -u | xargs`

				    fi

				    for arch in $amdgpu_targets; do

				      HIP_OFFLOAD="$HIP_OFFLOAD --offload-arch=$arch"

				    done

				  else

				    HIP_OFFLOAD="all-arch-no-native"

				  fi

				  ./configure --prefix=$UCC_HOME          \

				    --with-ucx=$UCX_HOME                  \

				    --with-cuda=$with_cuda                \

				    --with-nvcc-gencode="${NVCC_GENCODE}"

				    --with-nvcc-gencode="${NVCC_GENCODE}" \

				    --with-rocm=$with_rocm                \

				    --with-rocm-arch="${HIP_OFFLOAD}"

				  time make -j

				  sudo make install

									
										17

.ci/docker/libtorch/Dockerfile
									
												View File
												
				@ -56,11 +56,6 @@ RUN bash ./install_cuda.sh 11.8

				RUN bash ./install_magma.sh 11.8

				RUN ln -sf /usr/local/cuda-11.8 /usr/local/cuda

				FROM cuda as cuda12.1

				RUN bash ./install_cuda.sh 12.1

				RUN bash ./install_magma.sh 12.1

				RUN ln -sf /usr/local/cuda-12.1 /usr/local/cuda

				FROM cuda as cuda12.4

				RUN bash ./install_cuda.sh 12.4

				RUN bash ./install_magma.sh 12.4

				@ -71,6 +66,11 @@ RUN bash ./install_cuda.sh 12.6

				RUN bash ./install_magma.sh 12.6

				RUN ln -sf /usr/local/cuda-12.6 /usr/local/cuda

				FROM cuda as cuda12.8

				RUN bash ./install_cuda.sh 12.8

				RUN bash ./install_magma.sh 12.8

				RUN ln -sf /usr/local/cuda-12.8 /usr/local/cuda

				FROM cpu as rocm

				ARG PYTORCH_ROCM_ARCH

				ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}

				@ -92,13 +92,6 @@ RUN apt-get update -y && \

				RUN bash ./install_rocm_drm.sh && rm install_rocm_drm.sh

				RUN bash ./install_rocm_magma.sh && rm install_rocm_magma.sh

				# Install AOTriton

				COPY ./common/common_utils.sh common_utils.sh

				COPY ./aotriton_version.txt aotriton_version.txt

				COPY ./common/install_aotriton.sh install_aotriton.sh

				RUN bash ./install_aotriton.sh /opt/rocm && rm install_aotriton.sh aotriton_version.txt

				ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton

				FROM ${BASE_TARGET} as final

				COPY --from=openssl            /opt/openssl           /opt/openssl

				# Install patchelf

									
										7

.ci/docker/manywheel/Dockerfile
									
												View File
												
				@ -198,10 +198,3 @@ ADD ./common/install_rocm_magma.sh install_rocm_magma.sh

				RUN bash ./install_rocm_magma.sh && rm install_rocm_magma.sh

				ADD ./common/install_miopen.sh install_miopen.sh

				RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh

				# Install AOTriton

				COPY ./common/common_utils.sh common_utils.sh

				COPY ./aotriton_version.txt aotriton_version.txt

				COPY ./common/install_aotriton.sh install_aotriton.sh

				RUN bash ./install_aotriton.sh /opt/rocm && rm install_aotriton.sh aotriton_version.txt

				ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton

22

.ci/docker/requirements-ci.txt

View File

 @ -30,10 +30,10 @@ dill==0.3.7
 #Pinned versions: 0.3.7
 #test that import: dynamo/test_replay_record.py test_dataloader.py test_datapipe.py test_serialization.py
 expecttest==0.2.1
 expecttest==0.3.0
 #Description: method for writing tests where test framework auto populates
 # the expected output based on previous runs
 #Pinned versions: 0.2.1
 #Pinned versions: 0.3.0
 #test that import:
 fbscribelogger==0.1.7
 @ -280,9 +280,9 @@ unittest-xml-reporting<=3.2.0,>=2.0.0
 #test that import:
 #lintrunner is supported on aarch64-linux only from 0.12.4 version
 lintrunner==0.12.5
 lintrunner==0.12.7
 #Description: all about linters!
 #Pinned versions: 0.12.5
 #Pinned versions: 0.12.7
 #test that import:
 redis>=4.0.0
 @ -294,7 +294,7 @@ ghstack==0.8.0
 #Pinned versions: 0.8.0
 #test that import:
 jinja2==3.1.4
 jinja2==3.1.5
 #Description: jinja2 template engine
 #Pinned versions: 3.1.4
 #test that import:
 @ -304,7 +304,7 @@ pytest-cpp==2.3.0
 #Pinned versions: 2.3.0
 #test that import:
 z3-solver==4.12.2.0
 z3-solver==4.12.6.0
 #Description: The Z3 Theorem Prover Project
 #Pinned versions:
 #test that import:
 @ -329,7 +329,7 @@ lxml==5.3.0
 PyGithub==2.3.0
 sympy==1.13.1 ; python_version >= "3.9"
 sympy==1.13.3
 #Description: Required by coremltools, also pinned in .github/requirements/pip-requirements-macOS.txt
 #Pinned versions:
 #test that import:
 @ -339,7 +339,7 @@ onnx==1.17.0
 #Pinned versions:
 #test that import:
 onnxscript==0.1.0.dev20240817
 onnxscript==0.1.0
 #Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
 #Pinned versions:
 #test that import:
 @ -362,6 +362,7 @@ pwlf==2.2.1 ; python_version >= "3.8"
 # To build PyTorch itself
 astunparse
 PyYAML
 pyzstd
 setuptools
 ninja==1.11.1 ; platform_machine == "aarch64"
 @ -371,3 +372,8 @@ pulp==2.9.0 ; python_version >= "3.8"
 #Description: required for testing ilp formulaiton under torch/distributed/_tools
 #Pinned versions: 2.9.0
 #test that import: test_sac_ilp.py
 dataclasses_json==0.6.7
 #Description: required for data pipeline and scripts under tools/stats
 #Pinned versions: 0.6.7
 #test that import:

									
										55

.ci/docker/ubuntu-rocm/Dockerfile
									
												View File
												
				@ -14,21 +14,20 @@ ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}

				COPY ./common/install_base.sh install_base.sh

				RUN bash ./install_base.sh && rm install_base.sh

				# Install clang

				ARG LLVMDEV

				ARG CLANG_VERSION

				COPY ./common/install_clang.sh install_clang.sh

				RUN bash ./install_clang.sh && rm install_clang.sh

				# Install user

				COPY ./common/install_user.sh install_user.sh

				RUN bash ./install_user.sh && rm install_user.sh

				# Install katex

				ARG KATEX

				COPY ./common/install_docs_reqs.sh install_docs_reqs.sh

				RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh

				# Install conda and other packages (e.g., numpy, pytest)

				ARG ANACONDA_PYTHON_VERSION

				ARG CONDA_CMAKE

				ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION

				ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH

				ARG CONDA_CMAKE

				COPY requirements-ci.txt /opt/conda/requirements-ci.txt

				COPY ./common/install_conda.sh install_conda.sh

				COPY ./common/common_utils.sh common_utils.sh

				@ -39,6 +38,11 @@ ARG GCC_VERSION

				COPY ./common/install_gcc.sh install_gcc.sh

				RUN bash ./install_gcc.sh && rm install_gcc.sh

				# Install clang

				ARG CLANG_VERSION

				COPY ./common/install_clang.sh install_clang.sh

				RUN bash ./install_clang.sh && rm install_clang.sh

				# (optional) Install protobuf for ONNX

				ARG PROTOBUF

				COPY ./common/install_protobuf.sh install_protobuf.sh

				@ -85,6 +89,32 @@ COPY ./common/install_amdsmi.sh install_amdsmi.sh

				RUN bash ./install_amdsmi.sh

				RUN rm install_amdsmi.sh

				# (optional) Install UCC

				ARG UCX_COMMIT

				ARG UCC_COMMIT

				ENV UCX_COMMIT $UCX_COMMIT

				ENV UCC_COMMIT $UCC_COMMIT

				ENV UCX_HOME /usr

				ENV UCC_HOME /usr

				ADD ./common/install_ucc.sh install_ucc.sh

				RUN if [ -n "${UCX_COMMIT}" ] && [ -n "${UCC_COMMIT}" ]; then bash ./install_ucc.sh; fi

				RUN rm install_ucc.sh

				COPY ./common/install_openssl.sh install_openssl.sh

				ENV OPENSSL_ROOT_DIR /opt/openssl

				RUN bash ./install_openssl.sh

				ENV OPENSSL_DIR /opt/openssl

				ARG INDUCTOR_BENCHMARKS

				ARG ANACONDA_PYTHON_VERSION

				ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION

				COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/huggingface.txt huggingface.txt

				COPY ci_commit_pins/timm.txt timm.txt

				RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi

				RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt

				# (optional) Install non-default CMake version

				ARG CMAKE_VERSION

				COPY ./common/install_cmake.sh install_cmake.sh

				@ -107,18 +137,17 @@ COPY triton_version.txt triton_version.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton.txt triton_version.txt

				# Install AOTriton

				COPY ./aotriton_version.txt aotriton_version.txt

				COPY ./common/common_utils.sh common_utils.sh

				COPY ./common/install_aotriton.sh install_aotriton.sh

				RUN ["/bin/bash", "-c", "./install_aotriton.sh /opt/rocm && rm -rf install_aotriton.sh aotriton_version.txt common_utils.sh"]

				ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton

				# Install ccache/sccache (do this last, so we get priority in PATH)

				COPY ./common/install_cache.sh install_cache.sh

				ENV PATH /opt/cache/bin:$PATH

				RUN bash ./install_cache.sh && rm install_cache.sh

				# Install Open MPI for ROCm

				COPY ./common/install_openmpi.sh install_openmpi.sh

				RUN if [ -n "${CUDA_VERSION}" ]; then bash install_openmpi.sh; fi

				RUN rm install_openmpi.sh

				# Include BUILD_ENVIRONMENT environment variable in image

				ARG BUILD_ENVIRONMENT

				ENV BUILD_ENVIRONMENT ${BUILD_ENVIRONMENT}

									
										13

.ci/magma/Makefile
									
												View File
												
				@ -16,9 +16,9 @@ DOCKER_RUN = set -eou pipefail; ${DOCKER_CMD} run --rm -i \

					magma/build_magma.sh

				.PHONY: all

				all: magma-cuda128

				all: magma-cuda126

				all: magma-cuda124

				all: magma-cuda121

				all: magma-cuda118

				.PHONY:

				@ -26,6 +26,12 @@ clean:

					$(RM) -r magma-*

					$(RM) -r output

				.PHONY: magma-cuda128

				magma-cuda128: DESIRED_CUDA := 12.8

				magma-cuda128: CUDA_ARCH_LIST += -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120

				magma-cuda128:

					$(DOCKER_RUN)

				.PHONY: magma-cuda126

				magma-cuda126: DESIRED_CUDA := 12.6

				magma-cuda126:

				@ -36,11 +42,6 @@ magma-cuda124: DESIRED_CUDA := 12.4

				magma-cuda124:

					$(DOCKER_RUN)

				.PHONY: magma-cuda121

				magma-cuda121: DESIRED_CUDA := 12.1

				magma-cuda121:

					$(DOCKER_RUN)

				.PHONY: magma-cuda118

				magma-cuda118: DESIRED_CUDA := 11.8

				magma-cuda118: CUDA_ARCH_LIST += -gencode arch=compute_37,code=sm_37

									
										54

.ci/manywheel/build_cuda.sh
									
												View File
												
				@ -14,6 +14,7 @@ export USE_CUDA_STATIC_LINK=1

				export INSTALL_TEST=0 # dont install test binaries into site-packages

				export USE_CUPTI_SO=0

				export USE_CUSPARSELT=${USE_CUSPARSELT:-1} # Enable if not disabled by libtorch build

				export USE_CUFILE=${USE_CUFILE:-1}

				# Keep an array of cmake variables to add to

				if [[ -z "$CMAKE_ARGS" ]]; then

				@ -43,13 +44,6 @@ if [[ -n "$DESIRED_CUDA" ]]; then

				        fi

				    fi

				    echo "Using CUDA $CUDA_VERSION as determined by DESIRED_CUDA"

				    # There really has to be a better way to do this - eli

				    # Possibly limiting builds to specific cuda versions be delimiting images would be a choice

				    if [[ "$OS_NAME" == *"Ubuntu"* ]]; then

				        echo "Switching to CUDA version ${DESIRED_CUDA}"

				        /builder/conda/switch_cuda_version.sh "${DESIRED_CUDA}"

				    fi

				else

				    CUDA_VERSION=$(nvcc --version|grep release|cut -f5 -d" "|cut -f1 -d",")

				    echo "CUDA $CUDA_VERSION Detected"

				@ -59,23 +53,15 @@ cuda_version_nodot=$(echo $CUDA_VERSION | tr -d '.')

				TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6"

				case ${CUDA_VERSION} in

				    12.8)

				        TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0;10.0;12.0+PTX" #Ripping out 5.0 and 6.0 due to ld error

				        EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")

				        ;;

				    12.6)

				        if [[ "$GPU_ARCH_TYPE" = "cuda-aarch64" ]]; then

				            TORCH_CUDA_ARCH_LIST="9.0"

				        else

				            TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0+PTX"

				        fi

				        TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0+PTX"

				        EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")

				        ;;

				    12.4)

				        if [[ "$GPU_ARCH_TYPE" = "cuda-aarch64" ]]; then

				            TORCH_CUDA_ARCH_LIST="9.0"

				        else

				            TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0"

				        fi

				        EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")

				        ;;

				    12.1)

				        TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0"

				        EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")

				        ;;

				@ -133,7 +119,16 @@ if [[ $USE_CUSPARSELT == "1" && $CUDA_VERSION == "11.8" ]]; then

				        )

				fi

				if [[ $CUDA_VERSION == "12.4" || $CUDA_VERSION == "12.6" ]]; then

				# Turn USE_CUFILE off for CUDA 11.8, 12.4 since nvidia-cufile-cu11 and 1.9.0.20 are

				# not available in PYPI

				if [[ $CUDA_VERSION == "11.8" || $CUDA_VERSION == "12.4" ]]; then

				    export USE_CUFILE=0

				fi

				# CUDA_VERSION 12.4, 12.6, 12.8

				if [[ $CUDA_VERSION == 12* ]]; then

				    export USE_STATIC_CUDNN=0

				    # Try parallelizing nvcc as well

				    export TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2"

				@ -174,6 +169,16 @@ if [[ $CUDA_VERSION == "12.4" || $CUDA_VERSION == "12.6" ]]; then

				            "libnvrtc.so.12"

				            "libnvrtc-builtins.so"

				        )

				        if [[ $USE_CUFILE == 1 ]]; then

				            DEPS_LIST+=(

				                "/usr/local/cuda/lib64/libcufile.so.0"

				                "/usr/local/cuda/lib64/libcufile_rdma.so.1"

				            )

				            DEPS_SONAME+=(

				                "libcufile.so.0"

				                "libcufile_rdma.so.1"

				            )

				        fi

				    else

				        echo "Using nvidia libs from pypi."

				        CUDA_RPATHS=(

				@ -190,6 +195,11 @@ if [[ $CUDA_VERSION == "12.4" || $CUDA_VERSION == "12.6" ]]; then

				            '$ORIGIN/../../nvidia/nccl/lib'

				            '$ORIGIN/../../nvidia/nvtx/lib'

				        )

				        if [[ $USE_CUFILE == 1 ]]; then

				            CUDA_RPATHS+=(

				                '$ORIGIN/../../nvidia/cufile/lib'

				            )

				        fi

				        CUDA_RPATHS=$(IFS=: ; echo "${CUDA_RPATHS[*]}")

				        export C_SO_RPATH=$CUDA_RPATHS':$ORIGIN:$ORIGIN/lib'

				        export LIB_SO_RPATH=$CUDA_RPATHS':$ORIGIN'

				@ -275,7 +285,7 @@ else

				    exit 1

				fi

				# builder/test.sh requires DESIRED_CUDA to know what tests to exclude

				# run_tests.sh requires DESIRED_CUDA to know what tests to exclude

				export DESIRED_CUDA="$cuda_version_nodot"

				# Switch `/usr/local/cuda` to the desired CUDA version

									
										27

.ci/manywheel/build_rocm.sh
									
												View File
												
				@ -118,7 +118,7 @@ if [[ "$OS_NAME" == *"CentOS Linux"* || "$OS_NAME" == *"AlmaLinux"* ]]; then

				    fi

				    LIBDRM_PATH="/opt/amdgpu/lib64/libdrm.so.2"

				    LIBDRM_AMDGPU_PATH="/opt/amdgpu/lib64/libdrm_amdgpu.so.1"

				    if [[ $ROCM_INT -ge 60100 ]]; then

				    if [[ $ROCM_INT -ge 60100 && $ROCM_INT -lt 60300 ]]; then

				        # Below libs are direct dependencies of libhipsolver

				        LIBSUITESPARSE_CONFIG_PATH="/lib64/libsuitesparseconfig.so.4"

				        if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then

				@ -151,7 +151,7 @@ elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then

				    fi

				    LIBDRM_PATH="/usr/lib/x86_64-linux-gnu/libdrm.so.2"

				    LIBDRM_AMDGPU_PATH="/usr/lib/x86_64-linux-gnu/libdrm_amdgpu.so.1"

				    if [[ $ROCM_INT -ge 60100 ]]; then

				    if [[ $ROCM_INT -ge 60100 && $ROCM_INT -lt 60300 ]]; then

				        # Below libs are direct dependencies of libhipsolver

				        LIBCHOLMOD_PATH="/lib/x86_64-linux-gnu/libcholmod.so.3"

				        # Below libs are direct dependencies of libcholmod

				@ -186,15 +186,6 @@ do

				    OS_SO_FILES[${#OS_SO_FILES[@]}]=$file_name # Append lib to array

				done

				# FIXME: Temporary until https://github.com/pytorch/pytorch/pull/137443 lands

				# Install AOTriton

				if [ -e ${PYTORCH_ROOT}/.ci/docker/aotriton_version.txt ]; then

				    cp -a ${PYTORCH_ROOT}/.ci/docker/aotriton_version.txt aotriton_version.txt

				    bash ${PYTORCH_ROOT}/.ci/docker/common/install_aotriton.sh ${ROCM_HOME} && rm aotriton_version.txt

				    export AOTRITON_INSTALLED_PREFIX=${ROCM_HOME}/aotriton

				    ROCM_SO_FILES+=("libaotriton_v2.so")

				fi

				# rocBLAS library files

				ROCBLAS_LIB_SRC=$ROCM_HOME/lib/rocblas/library

				ROCBLAS_LIB_DST=lib/rocblas/library

				@ -266,20 +257,6 @@ RCCL_SHARE_FILES=($(ls $RCCL_SHARE_SRC))

				DEPS_AUX_SRCLIST+=(${RCCL_SHARE_FILES[@]/#/$RCCL_SHARE_SRC/})

				DEPS_AUX_DSTLIST+=(${RCCL_SHARE_FILES[@]/#/$RCCL_SHARE_DST/})

				# PyTorch 2.6+ (AOTriton 0.8b+)

				# AKS = "AOTriton Kernel Storage", a file format to store GPU kernels compactly

				if (( $(echo "${PYTORCH_VERSION} 2.6" | awk '{print ($1 >= $2)}') )); then

				    LIBAOTRITON_DIR=$(find "$ROCM_HOME/lib/" -name "libaotriton_v2.so" -printf '%h\n')

				    if [[ -z ${LIBAOTRITON_DIR} ]]; then

				        LIBAOTRITON_DIR=$(find "$ROCM_HOME/" -name "libaotriton_v2.so" -printf '%h\n')

				    fi

				    AKS_FILES=($(find "${LIBAOTRITON_DIR}/aotriton.images" -type f -name '*.aks?' -printf '%P\n'))

				    AKS_SRC="${LIBAOTRITON_DIR}/aotriton.images"

				    AKS_DST="lib/aotriton.images"

				    DEPS_AUX_SRCLIST+=(${AKS_FILES[@]/#/${AKS_SRC}/})

				    DEPS_AUX_DSTLIST+=(${AKS_FILES[@]/#/${AKS_DST}/})

				fi

				echo "PYTORCH_ROCM_ARCH: ${PYTORCH_ROCM_ARCH}"

				SCRIPTPATH="$( cd "$(dirname "$0")" ; pwd -P )"

									
										6

.ci/pytorch/build.sh
									
												View File
												
				@ -228,7 +228,7 @@ if [[ "$BUILD_ENVIRONMENT" == *-debug* ]]; then

				  export CMAKE_BUILD_TYPE=RelWithAssert

				fi

				# Do not change workspace permissions for ROCm CI jobs

				# Do not change workspace permissions for ROCm and s390x CI jobs

				# as it can leave workspace with bad permissions for cancelled jobs

				if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *s390x* && -d /var/lib/jenkins/workspace ]]; then

				  # Workaround for dind-rootless userid mapping (https://github.com/pytorch/ci-infra/issues/96)

				@ -247,7 +247,7 @@ if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *s390x* && -d /v

				fi

				if [[ "$BUILD_ENVIRONMENT" == *-bazel-* ]]; then

				  set -e

				  set -e -o pipefail

				  get_bazel

				@ -278,7 +278,7 @@ else

				          "$BUILD_ENVIRONMENT" != *xla* ]]; then

				      if [[ "$BUILD_ENVIRONMENT" != *py3.8* ]]; then

				        # Install numpy-2.0.2 for builds which are backward compatible with 1.X

				        python -mpip install --pre numpy==2.0.2

				        python -mpip install numpy==2.0.2

				      fi

				      WERROR=1 python setup.py clean

									
										2

.ci/pytorch/check_binary.sh
									
												View File
												
				@ -387,7 +387,7 @@ fi

				###############################################################################

				# Check for C++ ABI compatibility between gcc7 and gcc9 compiled binaries

				###############################################################################

				if [[ "$(uname)" == 'Linux' && ("$PACKAGE_TYPE" == 'conda' || "$PACKAGE_TYPE" == 'manywheel')]]; then

				if [[ "$(uname)" == 'Linux' &&  "$PACKAGE_TYPE" == 'manywheel' ]]; then

				  pushd /tmp

				  python -c "import torch; exit(0 if torch.compiled_with_cxx11_abi() else (0 if torch._C._PYBIND11_BUILD_ABI == '_cxxabi1011' else 1))"

				  popd

									
										2

.ci/pytorch/common.sh
									
												View File
												
				@ -3,7 +3,7 @@

				# Common setup for all Jenkins scripts

				# shellcheck source=./common_utils.sh

				source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"

				set -ex

				set -ex -o pipefail

				# Required environment variables:

				#   $BUILD_ENVIRONMENT (should be set by your Docker image)

									
										43

.ci/pytorch/common_utils.sh
									
												View File
												
				@ -160,7 +160,7 @@ function install_torchvision() {

				}

				function install_tlparse() {

				  pip_install --user "tlparse==0.3.25"

				  pip_install --user "tlparse==0.3.30"

				  PATH="$(python -m site --user-base)/bin:$PATH"

				}

				@ -169,24 +169,34 @@ function install_torchrec_and_fbgemm() {

				  torchrec_commit=$(get_pinned_commit torchrec)

				  local fbgemm_commit

				  fbgemm_commit=$(get_pinned_commit fbgemm)

				  if [[ "$BUILD_ENVIRONMENT" == *rocm* ]] ; then

				    fbgemm_commit=$(get_pinned_commit fbgemm_rocm)

				  fi

				  pip_uninstall torchrec-nightly

				  pip_uninstall fbgemm-gpu-nightly

				  pip_install setuptools-git-versioning scikit-build pyre-extensions

				  # TODO (huydhn): I still have no clue on why sccache doesn't work with only fbgemm_gpu here, but it

				  # seems to be an sccache-related issue

				  if [[ "$IS_A100_RUNNER" == "1" ]]; then

				    unset CMAKE_CUDA_COMPILER_LAUNCHER

				    sudo mv /opt/cache/bin /opt/cache/bin-backup

				  fi

				  if [[ "$BUILD_ENVIRONMENT" == *rocm* ]] ; then

				    # install torchrec first because it installs fbgemm nightly on top of rocm fbgemm

				    pip_install --no-use-pep517 --user "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}"

				    pip_uninstall fbgemm-gpu-nightly

				  # See https://github.com/pytorch/pytorch/issues/106971

				  CUDA_PATH=/usr/local/cuda-12.1 pip_install --no-use-pep517 --user "git+https://github.com/pytorch/FBGEMM.git@${fbgemm_commit}#egg=fbgemm-gpu&subdirectory=fbgemm_gpu"

				  pip_install --no-use-pep517 --user "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}"

				  if [[ "$IS_A100_RUNNER" == "1" ]]; then

				    export CMAKE_CUDA_COMPILER_LAUNCHER=/opt/cache/bin/sccache

				    sudo mv /opt/cache/bin-backup /opt/cache/bin

				    pip_install tabulate  # needed for newer fbgemm

				    pip_install patchelf  # needed for rocm fbgemm

				    git clone --recursive https://github.com/pytorch/fbgemm

				    pushd fbgemm/fbgemm_gpu

				    git checkout "${fbgemm_commit}"

				    python setup.py install \

				      --package_variant=rocm \

				      -DHIP_ROOT_DIR="${ROCM_PATH}" \

				      -DCMAKE_C_FLAGS="-DTORCH_USE_HIP_DSA" \

				      -DCMAKE_CXX_FLAGS="-DTORCH_USE_HIP_DSA"

				    popd

				    rm -rf fbgemm

				  else

				    # See https://github.com/pytorch/pytorch/issues/106971

				    CUDA_PATH=/usr/local/cuda-12.1 pip_install --no-use-pep517 --user "git+https://github.com/pytorch/FBGEMM.git@${fbgemm_commit}#egg=fbgemm-gpu&subdirectory=fbgemm_gpu"

				    pip_install --no-use-pep517 --user "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}"

				  fi

				}

				@ -216,6 +226,11 @@ function checkout_install_torchbench() {

				    # to install and test other models

				    python install.py --continue_on_fail

				  fi

				  # TODO (huydhn): transformers-4.44.2 added by https://github.com/pytorch/benchmark/pull/2488

				  # is regressing speedup metric. This needs to be investigated further

				  pip install transformers==4.38.1

				  echo "Print all dependencies after TorchBench is installed"

				  python -mpip freeze

				  popd

									
										2

.ci/pytorch/cpp_doc_push_script.sh
									
												View File
												
				@ -40,7 +40,7 @@ echo "Building PyTorch C++ API docs..."

				rm -rf cppdocs

				git clone https://github.com/pytorch/cppdocs

				set -ex

				set -ex -o pipefail

				# Generate ATen files

				pushd "${pt_checkout}"

									
										2

.ci/pytorch/functorch_doc_push_script.sh
									
												View File
												
				@ -5,7 +5,7 @@ pt_checkout="/var/lib/jenkins/workspace"

				source "$pt_checkout/.ci/pytorch/common_utils.sh"

				echo "functorch_doc_push_script.sh: Invoked with $*"

				set -ex

				set -ex -o pipefail

				version=${DOCS_VERSION:-nightly}

				echo "version: $version"

									
										2

.ci/pytorch/install_cache_xla.sh
									
												View File
												
				@ -6,7 +6,7 @@

				# return the same thing, ex checks for for rocm, CUDA, and changing the path

				# where sccache is installed, and not changing /etc/environment.

				set -ex

				set -ex -o pipefail

				install_binary() {

				  echo "Downloading sccache binary from S3 repo"

									
										3

.ci/pytorch/macos-test.sh
									
												View File
												
				@ -18,6 +18,9 @@ if [[ ! $(python -c "import torch; print(int(torch.backends.openmp.is_available(

				fi

				popd

				# enable debug asserts in serialization

				export TORCH_SERIALIZATION_DEBUG=1

				setup_test_python() {

				  # The CircleCI worker hostname doesn't resolve to an address.

				  # This environment variable makes ProcessGroupGloo default to

									
										93

.ci/pytorch/multigpu-test.sh
									
												View File
												
				@ -8,55 +8,62 @@

				source "$(dirname "${BASH_SOURCE[0]}")/common.sh"

				echo "Testing pytorch"

				time python test/run_test.py --include test_cuda_multigpu test_cuda_primary_ctx --verbose

				# When adding more tests, please use HUD to see which shard is shorter

				if [[ "${SHARD_NUMBER:-1}" == "1" ]]; then

				    # FSDP tests

				    for f in test/distributed/fsdp/*.py ; do time python test/run_test.py --verbose -i "${f#*/}" ; done

				fi

				# Disabling tests to see if they solve timeout issues; see https://github.com/pytorch/pytorch/issues/70015

				# python tools/download_mnist.py --quiet -d test/cpp/api/mnist

				# OMP_NUM_THREADS=2 TORCH_CPP_TEST_MNIST_PATH="test/cpp/api/mnist" build/bin/test_api

				time python test/run_test.py --verbose -i distributed/test_c10d_common

				time python test/run_test.py --verbose -i distributed/test_c10d_gloo

				time python test/run_test.py --verbose -i distributed/test_c10d_nccl

				time python test/run_test.py --verbose -i distributed/test_c10d_spawn_gloo

				time python test/run_test.py --verbose -i distributed/test_c10d_spawn_nccl

				time python test/run_test.py --verbose -i distributed/test_compute_comm_reordering

				time python test/run_test.py --verbose -i distributed/test_store

				time python test/run_test.py --verbose -i distributed/test_symmetric_memory

				time python test/run_test.py --verbose -i distributed/test_pg_wrapper

				time python test/run_test.py --verbose -i distributed/rpc/cuda/test_tensorpipe_agent

				# FSDP tests

				for f in test/distributed/fsdp/*.py ; do time python test/run_test.py --verbose -i "${f#*/}" ; done

				# ShardedTensor tests

				time python test/run_test.py --verbose -i distributed/checkpoint/test_checkpoint

				time python test/run_test.py --verbose -i distributed/checkpoint/test_file_system_checkpoint

				time python test/run_test.py --verbose -i distributed/_shard/sharding_spec/test_sharding_spec

				time python test/run_test.py --verbose -i distributed/_shard/sharding_plan/test_sharding_plan

				time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/test_sharded_tensor

				time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/test_sharded_tensor_reshard

				if [[ "${SHARD_NUMBER:-2}" == "2" ]]; then

				    time python test/run_test.py --include test_cuda_multigpu test_cuda_primary_ctx --verbose

				# functional collective tests

				time python test/run_test.py --verbose -i distributed/test_functional_api

				    # Disabling tests to see if they solve timeout issues; see https://github.com/pytorch/pytorch/issues/70015

				    # python tools/download_mnist.py --quiet -d test/cpp/api/mnist

				    # OMP_NUM_THREADS=2 TORCH_CPP_TEST_MNIST_PATH="test/cpp/api/mnist" build/bin/test_api

				    time python test/run_test.py --verbose -i distributed/test_c10d_common

				    time python test/run_test.py --verbose -i distributed/test_c10d_gloo

				    time python test/run_test.py --verbose -i distributed/test_c10d_nccl

				    time python test/run_test.py --verbose -i distributed/test_c10d_spawn_gloo

				    time python test/run_test.py --verbose -i distributed/test_c10d_spawn_nccl

				    time python test/run_test.py --verbose -i distributed/test_compute_comm_reordering

				    time python test/run_test.py --verbose -i distributed/test_store

				    time python test/run_test.py --verbose -i distributed/test_symmetric_memory

				    time python test/run_test.py --verbose -i distributed/test_pg_wrapper

				    time python test/run_test.py --verbose -i distributed/rpc/cuda/test_tensorpipe_agent

				# DTensor tests

				time python test/run_test.py --verbose -i distributed/_tensor/test_random_ops

				time python test/run_test.py --verbose -i distributed/_tensor/test_dtensor_compile

				    # ShardedTensor tests

				    time python test/run_test.py --verbose -i distributed/checkpoint/test_checkpoint

				    time python test/run_test.py --verbose -i distributed/checkpoint/test_file_system_checkpoint

				    time python test/run_test.py --verbose -i distributed/_shard/sharding_spec/test_sharding_spec

				    time python test/run_test.py --verbose -i distributed/_shard/sharding_plan/test_sharding_plan

				    time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/test_sharded_tensor

				    time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/test_sharded_tensor_reshard

				# DeviceMesh test

				time python test/run_test.py --verbose -i distributed/test_device_mesh

				    # functional collective tests

				    time python test/run_test.py --verbose -i distributed/test_functional_api

				# DTensor/TP tests

				time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_examples

				time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_random_state

				    # DTensor tests

				    time python test/run_test.py --verbose -i distributed/tensor/test_random_ops

				    time python test/run_test.py --verbose -i distributed/tensor/test_dtensor_compile

				# FSDP2 tests

				time python test/run_test.py --verbose -i distributed/_composable/fsdp/test_fully_shard_training -- -k test_2d_mlp_with_nd_mesh

				    # DeviceMesh test

				    time python test/run_test.py --verbose -i distributed/test_device_mesh

				# ND composability tests

				time python test/run_test.py --verbose -i distributed/_composable/test_composability/test_2d_composability

				time python test/run_test.py --verbose -i distributed/_composable/test_composability/test_pp_composability

				    # DTensor/TP tests

				    time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_examples

				    time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_random_state

				# Other tests

				time python test/run_test.py --verbose -i test_cuda_primary_ctx

				time python test/run_test.py --verbose -i test_optim -- -k test_forloop_goes_right_direction_multigpu

				time python test/run_test.py --verbose -i test_optim -- -k test_mixed_device_dtype

				time python test/run_test.py --verbose -i test_foreach -- -k test_tensors_grouping

				    # FSDP2 tests

				    time python test/run_test.py --verbose -i distributed/_composable/fsdp/test_fully_shard_training -- -k test_2d_mlp_with_nd_mesh

				    # ND composability tests

				    time python test/run_test.py --verbose -i distributed/_composable/test_composability/test_2d_composability

				    time python test/run_test.py --verbose -i distributed/_composable/test_composability/test_pp_composability

				    # Other tests

				    time python test/run_test.py --verbose -i test_cuda_primary_ctx

				    time python test/run_test.py --verbose -i test_optim -- -k test_forloop_goes_right_direction_multigpu

				    time python test/run_test.py --verbose -i test_optim -- -k test_mixed_device_dtype

				    time python test/run_test.py --verbose -i test_foreach -- -k test_tensors_grouping

				fi

				assert_git_not_dirty

									
										4

.ci/pytorch/python_doc_push_script.sh
									
												View File
												
				@ -7,7 +7,7 @@ source "$pt_checkout/.ci/pytorch/common_utils.sh"

				echo "python_doc_push_script.sh: Invoked with $*"

				set -ex

				set -ex -o pipefail

				# for statements like ${1:-${DOCS_INSTALL_PATH:-docs/}}

				# the order of operations goes:

				@ -63,7 +63,7 @@ build_docs () {

				    echo "(tried to echo the WARNINGS above the ==== line)"

				    echo =========================

				  fi

				  set -ex

				  set -ex -o pipefail

				  return $code

				}

									
										4

.ci/pytorch/run_tests.sh
									
												View File
												
				@ -13,7 +13,7 @@ set -eux -o pipefail

				# This script expects to be in the pytorch root folder

				if [[ ! -d 'test' || ! -f 'test/run_test.py' ]]; then

				    echo "builder/test.sh expects to be run from the Pytorch root directory " \

				    echo "run_tests.sh expects to be run from the Pytorch root directory " \

				         "but I'm actually in $(pwd)"

				    exit 2

				fi

				@ -40,7 +40,7 @@ retry () {

				if [[ "$#" != 3 ]]; then

				  if [[ -z "${DESIRED_PYTHON:-}" || -z "${DESIRED_CUDA:-}" || -z "${PACKAGE_TYPE:-}" ]]; then

				    echo "USAGE: run_tests.sh  PACKAGE_TYPE  DESIRED_PYTHON  DESIRED_CUDA"

				    echo "The env variable PACKAGE_TYPE must be set to 'conda' or 'manywheel' or 'libtorch'"

				    echo "The env variable PACKAGE_TYPE must be set to 'manywheel' or 'libtorch'"

				    echo "The env variable DESIRED_PYTHON must be set like '2.7mu' or '3.6m' etc"

				    echo "The env variable DESIRED_CUDA must be set like 'cpu' or 'cu80' etc"

				    exit 1

									
										10

.ci/pytorch/smoke_test/check_binary_symbols.py
									
												View File
												
				@ -6,7 +6,7 @@ import itertools

				import os

				import re

				from pathlib import Path

				from typing import Any, List, Tuple

				from typing import Any

				# We also check that there are [not] cxx11 symbols in libtorch

				@ -46,17 +46,17 @@ LIBTORCH_PRE_CXX11_PATTERNS = _apply_libtorch_symbols(PRE_CXX11_SYMBOLS)

				@functools.lru_cache(100)

				def get_symbols(lib: str) -> List[Tuple[str, str, str]]:

				def get_symbols(lib: str) -> list[tuple[str, str, str]]:

				    from subprocess import check_output

				    lines = check_output(f'nm "{lib}"|c++filt', shell=True)

				    return [x.split(" ", 2) for x in lines.decode("latin1").split("\n")[:-1]]

				def grep_symbols(lib: str, patterns: List[Any]) -> List[str]:

				def grep_symbols(lib: str, patterns: list[Any]) -> list[str]:

				    def _grep_symbols(

				        symbols: List[Tuple[str, str, str]], patterns: List[Any]

				    ) -> List[str]:

				        symbols: list[tuple[str, str, str]], patterns: list[Any]

				    ) -> list[str]:

				        rc = []

				        for _s_addr, _s_type, s_name in symbols:

				            for pattern in patterns:

									
										47

.ci/pytorch/smoke_test/smoke_test.py
									
												View File
												
				@ -6,6 +6,7 @@ import re

				import subprocess

				import sys

				from pathlib import Path

				from tempfile import NamedTemporaryFile

				import torch

				import torch._dynamo

				@ -109,8 +110,10 @@ def check_version(package: str) -> None:

				                            {release_matrix[module['name']]} for channel {channel}. But its {module_version}"

				                    )

				                else:

				                    print(f"{module['name']} version actual: {module_version} expected: \

				                        {release_matrix[module['name']]} for channel {channel}.")

				                    print(

				                        f"{module['name']} version actual: {module_version} expected: \

				                        {release_matrix[module['name']]} for channel {channel}."

				                    )

				    else:

				        print(f"Skip version check for channel {channel} as stable version is None")

				@ -159,6 +162,32 @@ def test_cuda_runtime_errors_captured() -> None:

				        raise RuntimeError("Expected CUDA RuntimeError but have not received!")

				def test_cuda_gds_errors_captured() -> None:

				    major_version = int(torch.version.cuda.split(".")[0])

				    minor_version = int(torch.version.cuda.split(".")[1])

				    if major_version < 12 or (major_version == 12 and minor_version < 6):

				        print("CUDA version is not supported for GDS smoke test")

				        return

				    cuda_exception_missed = True

				    try:

				        print("Testing test_cuda_gds_errors_captured")

				        with NamedTemporaryFile() as f:

				            torch.cuda.gds.GdsFile(f.name, os.O_CREAT | os.O_RDWR)

				    except RuntimeError as e:

				        expected_error = "cuFileHandleRegister failed"

				        if re.search(expected_error, f"{e}"):

				            print(f"Caught CUDA exception with success: {e}")

				            cuda_exception_missed = False

				        else:

				            raise e

				    if cuda_exception_missed:

				        raise RuntimeError(

				            "Expected cuFileHandleRegister failed RuntimeError but have not received!"

				        )

				def smoke_test_cuda(

				    package: str, runtime_error_check: str, torch_compile_check: str

				) -> None:

				@ -180,7 +209,7 @@ def smoke_test_cuda(

				    # torch.compile is available on macos-arm64 and Linux for python 3.8-3.13

				    if (

				        torch_compile_check == "enabled"

				        and sys.version_info < (3, 13, 0)

				        and sys.version_info < (3, 14, 0)

				        and target_os in ["linux", "linux-aarch64", "macos-arm64", "darwin"]

				    ):

				        smoke_test_compile("cuda" if torch.cuda.is_available() else "cpu")

				@ -339,7 +368,7 @@ def smoke_test_modules():

				                print(f"Output: \n{output}\n")

				def main() -> None:

				def parse_args():

				    parser = argparse.ArgumentParser()

				    parser.add_argument(

				        "--package",

				@ -362,9 +391,16 @@ def main() -> None:

				        choices=["enabled", "disabled"],

				        default="enabled",

				    )

				    options = parser.parse_args()

				    return parser.parse_args()

				def main() -> None:

				    options = parse_args()

				    print(f"torch: {torch.__version__}")

				    print(torch.__config__.parallel_info())

				    # All PyTorch binary builds should be built with OpenMP

				    if not torch.backends.openmp.is_available():

				        raise RuntimeError("PyTorch must be built with OpenMP support")

				    check_version(options.package)

				    smoke_test_conv2d()

				@ -372,6 +408,7 @@ def main() -> None:

				    test_numpy()

				    if is_cuda_system:

				        test_linalg("cuda")

				        test_cuda_gds_errors_captured()

				    if options.package == "all":

				        smoke_test_modules()

									
										89

.ci/pytorch/test.sh
									
												View File
												
				@ -4,7 +4,7 @@

				# (This is set by default in the Docker images we build, so you don't

				# need to set it yourself.

				set -ex

				set -ex -o pipefail

				# Suppress ANSI color escape sequences

				export TERM=vt100

				@ -12,9 +12,9 @@ export TERM=vt100

				# shellcheck source=./common.sh

				source "$(dirname "${BASH_SOURCE[0]}")/common.sh"

				# Do not change workspace permissions for ROCm CI jobs

				# Do not change workspace permissions for ROCm and s390x CI jobs

				# as it can leave workspace with bad permissions for cancelled jobs

				if [[ "$BUILD_ENVIRONMENT" != *rocm* && -d /var/lib/jenkins/workspace ]]; then

				if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *s390x* && -d /var/lib/jenkins/workspace ]]; then

				  # Workaround for dind-rootless userid mapping (https://github.com/pytorch/ci-infra/issues/96)

				  WORKSPACE_ORIGINAL_OWNER_ID=$(stat -c '%u' "/var/lib/jenkins/workspace")

				  cleanup_workspace() {

				@ -46,6 +46,9 @@ BUILD_BIN_DIR="$BUILD_DIR"/bin

				SHARD_NUMBER="${SHARD_NUMBER:=1}"

				NUM_TEST_SHARDS="${NUM_TEST_SHARDS:=1}"

				# enable debug asserts in serialization

				export TORCH_SERIALIZATION_DEBUG=1

				export VALGRIND=ON

				# export TORCH_INDUCTOR_INSTALL_GXX=ON

				if [[ "$BUILD_ENVIRONMENT" == *clang9* || "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				@ -86,6 +89,13 @@ if [[ "$BUILD_ENVIRONMENT" == *clang9* || "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				  export VALGRIND=OFF

				fi

				if [[ "$BUILD_ENVIRONMENT" == *s390x* ]]; then

				  # There are additional warnings on s390x, maybe due to newer gcc.

				  # Skip this check for now

				  export VALGRIND=OFF

				fi

				if [[ "${PYTORCH_TEST_RERUN_DISABLED_TESTS}" == "1" ]] || [[ "${CONTINUE_THROUGH_ERROR}" == "1" ]]; then

				  # When rerunning disable tests, do not generate core dumps as it could consume

				  # the runner disk space when crashed tests are run multiple times. Running out

				@ -129,7 +139,7 @@ if [[ "$TEST_CONFIG" == 'default' ]]; then

				fi

				if [[ "$TEST_CONFIG" == 'distributed' ]] && [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then

				  export HIP_VISIBLE_DEVICES=0,1

				  export HIP_VISIBLE_DEVICES=0,1,2,3

				fi

				if [[ "$TEST_CONFIG" == 'slow' ]]; then

				@ -153,6 +163,8 @@ elif [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				  export PYTORCH_TESTING_DEVICE_ONLY_FOR="xpu"

				  # setting PYTHON_TEST_EXTRA_OPTION

				  export PYTHON_TEST_EXTRA_OPTION="--xpu"

				  # Disable sccache for xpu test due to flaky issue https://github.com/pytorch/pytorch/issues/143585

				  sudo rm -rf /opt/cache

				fi

				if [[ "$TEST_CONFIG" == *crossref* ]]; then

				@ -165,6 +177,9 @@ if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then

				  # Print GPU info

				  rocminfo

				  rocminfo | grep -E 'Name:.*\sgfx|Marketing'

				  # for benchmarks/dynamo/check_accuracy.py, we need to put results in a rocm specific directory to avoid clashes with cuda

				  MAYBE_ROCM="rocm/"

				fi

				if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				@ -313,6 +328,7 @@ test_dynamo_wrapped_shard() {

				    --exclude-jit-executor \

				    --exclude-distributed-tests \

				    --exclude-torch-export-tests \

				    --exclude-aot-dispatch-tests \

				    --shard "$1" "$NUM_TEST_SHARDS" \

				    --verbose \

				    --upload-artifacts-while-running

				@ -326,7 +342,7 @@ test_inductor_distributed() {

				  python test/run_test.py -i inductor/test_aot_inductor.py -k test_non_default_cuda_device --verbose

				  python test/run_test.py -i inductor/test_aot_inductor.py -k test_replicate_on_devices --verbose

				  python test/run_test.py -i distributed/test_c10d_functional_native.py --verbose

				  python test/run_test.py -i distributed/_tensor/test_dtensor_compile.py --verbose

				  python test/run_test.py -i distributed/tensor/test_dtensor_compile.py --verbose

				  python test/run_test.py -i distributed/tensor/parallel/test_micro_pipeline_tp.py --verbose

				  python test/run_test.py -i distributed/_composable/test_replicate_with_compiler.py --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_comm.py --verbose

				@ -379,15 +395,32 @@ test_inductor_aoti() {

				  CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference

				}

				test_inductor_cpp_wrapper() {

				test_inductor_cpp_wrapper_shard() {

				  if [[ -z "$NUM_TEST_SHARDS" ]]; then

				    echo "NUM_TEST_SHARDS must be defined to run a Python test shard"

				    exit 1

				  fi

				  export TORCHINDUCTOR_CPP_WRAPPER=1

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  # Run certain inductor unit tests with cpp wrapper. In the end state, we should be able to run all the inductor

				  # unit tests with cpp wrapper.

				  python test/run_test.py --include inductor/test_torchinductor.py --verbose

				  if [[ "$1" -eq "2" ]]; then

				    # For now, manually put the opinfo tests in shard 2, and all other tests in

				    # shard 1.  Test specific things triggering past bugs, for now.

				    python test/run_test.py \

				      --include inductor/test_torchinductor_opinfo \

				      -k 'linalg or to_sparse' \

				      --verbose

				    exit

				  fi

				  # Run certain inductor unit tests with cpp wrapper. In the end state, we

				  # should be able to run all the inductor unit tests with cpp_wrapper.

				  python test/run_test.py \

				    --include inductor/test_torchinductor inductor/test_max_autotune inductor/test_cpu_repro \

				    --verbose

				  python test/run_test.py --inductor --include test_torch -k 'take' --verbose

				  # Run inductor benchmark tests with cpp wrapper.

				  # Skip benchmark tests if it's in rerun-disabled-mode.

				@ -400,7 +433,7 @@ test_inductor_cpp_wrapper() {

				    --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_training.csv"

				    python benchmarks/dynamo/check_accuracy.py \

				      --actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_training.csv" \

				      --expected "benchmarks/dynamo/ci_expected_accuracy/inductor_timm_training.csv"

				      --expected "benchmarks/dynamo/ci_expected_accuracy/${MAYBE_ROCM}inductor_timm_training.csv"

				    python benchmarks/dynamo/torchbench.py --device cuda --accuracy \

				      --bfloat16 --inference --inductor --only hf_T5 --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"

				@ -410,7 +443,7 @@ test_inductor_cpp_wrapper() {

				      --bfloat16 --inference --inductor --only moco --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"

				    python benchmarks/dynamo/check_accuracy.py \

				      --actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv" \

				      --expected "benchmarks/dynamo/ci_expected_accuracy/inductor_torchbench_inference.csv"

				      --expected "benchmarks/dynamo/ci_expected_accuracy/${MAYBE_ROCM}inductor_torchbench_inference.csv"

				  fi

				}

				@ -485,6 +518,8 @@ test_perf_for_dashboard() {

				    test_inductor_set_cpu_affinity

				  elif [[ "${TEST_CONFIG}" == *cuda_a10g* ]]; then

				    device=cuda_a10g

				  elif [[ "${TEST_CONFIG}" == *rocm* ]]; then

				    device=rocm

				  fi

				  for mode in "${modes[@]}"; do

				@ -517,7 +552,7 @@ test_perf_for_dashboard() {

				            --dynamic-batch-only "$@" \

				            --output "$TEST_REPORTS_DIR/${backend}_dynamic_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				      fi

				      if [[ "$DASHBOARD_TAG" == *cppwrapper-true* ]] && [[ "$mode" == "inference" ]]; then

				      if [[ "$DASHBOARD_TAG" == *cppwrapper-true* ]]; then

				        TORCHINDUCTOR_CPP_WRAPPER=1 $TASKSET python "benchmarks/dynamo/$suite.py" \

				            "${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" --disable-cudagraphs "$@" \

				            --output "$TEST_REPORTS_DIR/${backend}_cpp_wrapper_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				@ -601,16 +636,16 @@ test_single_dynamo_benchmark() {

				      TEST_CONFIG=${TEST_CONFIG//_avx512/}

				    fi

				    python "benchmarks/dynamo/$suite.py" \

				      --ci --accuracy --timing --explain \

				      --ci --accuracy --timing --explain --print-compilation-time \

				      "${DYNAMO_BENCHMARK_FLAGS[@]}" \

				      "$@" "${partition_flags[@]}" \

				      --output "$TEST_REPORTS_DIR/${name}_${suite}.csv"

				    python benchmarks/dynamo/check_accuracy.py \

				      --actual "$TEST_REPORTS_DIR/${name}_$suite.csv" \

				      --expected "benchmarks/dynamo/ci_expected_accuracy/${TEST_CONFIG}_${name}.csv"

				      --expected "benchmarks/dynamo/ci_expected_accuracy/${MAYBE_ROCM}${TEST_CONFIG}_${name}.csv"

				    python benchmarks/dynamo/check_graph_breaks.py \

				      --actual "$TEST_REPORTS_DIR/${name}_$suite.csv" \

				      --expected "benchmarks/dynamo/ci_expected_accuracy/${TEST_CONFIG}_${name}.csv"

				      --expected "benchmarks/dynamo/ci_expected_accuracy/${MAYBE_ROCM}${TEST_CONFIG}_${name}.csv"

				  fi

				}

				@ -633,7 +668,7 @@ test_inductor_halide() {

				}

				test_inductor_triton_cpu() {

				  python test/run_test.py --include inductor/test_triton_cpu_backend.py --verbose

				  python test/run_test.py --include inductor/test_triton_cpu_backend.py inductor/test_torchinductor_strided_blocks.py --verbose

				  assert_git_not_dirty

				}

				@ -697,7 +732,7 @@ test_inductor_torchbench_smoketest_perf() {

				      --only $test --output "$TEST_REPORTS_DIR/inductor_warm_start_smoketest_$test.csv"

				    python benchmarks/dynamo/check_accuracy.py \

				      --actual "$TEST_REPORTS_DIR/inductor_warm_start_smoketest_$test.csv" \

				      --expected "benchmarks/dynamo/ci_expected_accuracy/inductor_huggingface_training.csv"

				      --expected "benchmarks/dynamo/ci_expected_accuracy/${MAYBE_ROCM}inductor_huggingface_training.csv"

				  done

				}

				@ -893,10 +928,20 @@ test_libtorch_api() {

				  else

				    # Exclude IMethodTest that relies on torch::deploy, which will instead be ran in test_deploy

				    OMP_NUM_THREADS=2 TORCH_CPP_TEST_MNIST_PATH="${MNIST_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_api -k "not IMethodTest"

				    python test/run_test.py --cpp --verbose -i cpp/test_tensorexpr

				    # On s390x, pytorch is built without llvm.

				    # Even if it would be built with llvm, llvm currently doesn't support used features on s390x and

				    # test fails with errors like:

				    # JIT session error: Unsupported target machine architecture in ELF object pytorch-jitted-objectbuffer

				    # unknown file: Failure

				    # C++ exception with description "valOrErr INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/jit/tensorexpr/llvm_jit.h":34, please report a bug to PyTorch. Unexpected failure in LLVM JIT: Failed to materialize symbols: { (main, { func }) }

				    if [[ "${BUILD_ENVIRONMENT}" != *s390x* ]]; then

				      python test/run_test.py --cpp --verbose -i cpp/test_tensorexpr

				    fi

				  fi

				  if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* && "${BUILD_ENVIRONMENT}" != *asan* ]]; then

				  # quantization is not fully supported on s390x yet

				  if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* && "${BUILD_ENVIRONMENT}" != *asan* && "${BUILD_ENVIRONMENT}" != *s390x* ]]; then

				    # NB: This test is not under TORCH_BIN_DIR but under BUILD_BIN_DIR

				    export CPP_TESTS_DIR="${BUILD_BIN_DIR}"

				    python test/run_test.py --cpp --verbose -i cpp/static_runtime_test

				@ -1243,7 +1288,7 @@ EOF

				}

				test_bazel() {

				  set -e

				  set -e -o pipefail

				  # bazel test needs sccache setup.

				  # shellcheck source=./common-build.sh

				@ -1394,7 +1439,7 @@ test_executorch() {

				test_linux_aarch64() {

				  python test/run_test.py --include test_modules test_mkldnn test_mkldnn_fusion test_openmp test_torch test_dynamic_shapes \

				        test_transformers test_multiprocessing test_numpy_interop test_autograd test_binary_ufuncs test_complex test_spectral_ops \

				        test_foreach test_reductions test_unary_ufuncs \

				        test_foreach test_reductions test_unary_ufuncs test_tensor_creation_ops \

				        --shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose

				  # Dynamo tests

				@ -1497,7 +1542,7 @@ elif [[ "${TEST_CONFIG}" == *inductor_cpp_wrapper* ]]; then

				  install_torchaudio cuda

				  install_torchvision

				  checkout_install_torchbench hf_T5 llama moco

				  PYTHONPATH=$(pwd)/torchbench test_inductor_cpp_wrapper

				  PYTHONPATH=$(pwd)/torchbench test_inductor_cpp_wrapper_shard "$SHARD_NUMBER"

				elif [[ "${TEST_CONFIG}" == *inductor* ]]; then

				  install_torchvision

				  test_inductor_shard "${SHARD_NUMBER}"

									
										41

.ci/pytorch/test_example_code/cnn_smoke_win_arm64.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,41 @@

				r"""

				It's used to check basic rnn features with cpu-only.

				For example, it would throw exception if some components are missing

				"""

				import torch

				import torch.nn as nn

				import torch.nn.functional as F

				import torch.optim as optim

				class SimpleCNN(nn.Module):

				    def __init__(self):

				        super().__init__()

				        self.conv = nn.Conv2d(1, 1, 3)

				        self.pool = nn.MaxPool2d(2, 2)

				    def forward(self, inputs):

				        output = self.pool(F.relu(self.conv(inputs)))

				        output = output.view(1)

				        return output

				try:

				    # Mock one infer

				    net = SimpleCNN()

				    net_inputs = torch.rand((1, 1, 5, 5))

				    outputs = net(net_inputs)

				    print(outputs)

				    criterion = nn.MSELoss()

				    optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.1)

				    # Mock one step training

				    label = torch.full((1,), 1.0, dtype=torch.float)

				    loss = criterion(outputs, label)

				    loss.backward()

				    optimizer.step()

				except Exception as e:

				    print(f"An error occurred: {e}")

									
										13

.ci/pytorch/test_example_code/rnn_smoke_win_arm64.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,13 @@

				r"""

				It's used to check basic rnn features with cpu-only.

				For example, it would throw exception if missing some components are missing

				"""

				import torch

				import torch.nn as nn

				rnn = nn.RNN(10, 20, 2)

				inputs = torch.randn(5, 3, 10)

				h0 = torch.randn(2, 3, 20)

				output, hn = rnn(inputs, h0)

									
										2

.ci/pytorch/win-build.sh
									
												View File
												
				@ -38,7 +38,7 @@ if [[ $PYLONG_API_CHECK == 0 ]]; then

				  echo "PyLong_AsUnsignedLong -> THPUtils_unpackUInt32 / THPUtils_unpackUInt64"

				  exit 1

				fi

				set -ex

				set -ex -o pipefail

				"$SCRIPT_HELPERS_DIR"/build_pytorch.bat

									
										3

.ci/pytorch/win-test-helpers/build_pytorch.bat
									
												View File
												
				@ -26,7 +26,8 @@ if not errorlevel 0 goto fail

				if "%USE_XPU%"=="1" (

				  :: Install xpu support packages

				  call %INSTALLER_DIR%\install_xpu.bat

				  set CUDA_VERSION=xpu

				  call %SCRIPT_HELPERS_DIR%\..\windows\internal\xpu_install.bat

				  if errorlevel 1 exit /b 1

				)

									
										114

.ci/pytorch/win-test-helpers/installation-helpers/install_xpu.bat
									
												View File
											
				@ -1,114 +0,0 @@

				@echo on

				REM Description: Install Intel Support Packages on Windows

				REM BKM reference: https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html

				set XPU_INSTALL_MODE=%~1

				if "%XPU_INSTALL_MODE%"=="" goto xpu_bundle_install_start

				if "%XPU_INSTALL_MODE%"=="bundle" goto xpu_bundle_install_start

				if "%XPU_INSTALL_MODE%"=="driver" goto xpu_driver_install_start

				if "%XPU_INSTALL_MODE%"=="all" goto xpu_driver_install_start

				:arg_error

				echo Illegal XPU installation mode. The value can be "bundle"/"driver"/"all"

				echo If keep the value as space, will use default "bundle" mode

				exit /b 1

				:xpu_driver_install_start

				:: TODO Need more testing for driver installation

				set XPU_DRIVER_LINK=https://downloadmirror.intel.com/830975/gfx_win_101.5972.exe

				curl -o xpu_driver.exe --retry 3 --retry-all-errors -k %XPU_DRIVER_LINK%

				echo "XPU Driver installing..."

				start /wait "Intel XPU Driver Installer" "xpu_driver.exe"

				if errorlevel 1 exit /b 1

				del xpu_driver.exe

				if "%XPU_INSTALL_MODE%"=="driver" goto xpu_install_end

				:xpu_bundle_install_start

				set XPU_BUNDLE_PARENT_DIR=C:\Program Files (x86)\Intel\oneAPI

				set XPU_BUNDLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/9d1a91e2-e8b8-40a5-8c7f-5db768a6a60c/w_intel-for-pytorch-gpu-dev_p_0.5.3.37_offline.exe

				set XPU_BUNDLE_PRODUCT_NAME=intel.oneapi.win.intel-for-pytorch-gpu-dev.product

				set XPU_BUNDLE_VERSION=0.5.3+31

				set XPU_BUNDLE_INSTALLED=0

				set XPU_BUNDLE_UNINSTALL=0

				set XPU_EXTRA_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/9d1a91e2-e8b8-40a5-8c7f-5db768a6a60c/w_intel-pti-dev_p_0.9.0.37_offline.exe

				set XPU_EXTRA_PRODUCT_NAME=intel.oneapi.win.intel-pti-dev.product

				set XPU_EXTRA_VERSION=0.9.0+36

				set XPU_EXTRA_INSTALLED=0

				set XPU_EXTRA_UNINSTALL=0

				if not [%XPU_VERSION%]==[] if [%XPU_VERSION%]==[2025.0] (

				    set XPU_BUNDLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/efc86abd-cb77-452e-a03f-a741895b8ece/intel-deep-learning-essentials-2025.0.0.336_offline.exe

				    set XPU_BUNDLE_PRODUCT_NAME=intel.oneapi.win.deep-learning-essentials.product

				    set XPU_BUNDLE_VERSION=2025.0.0+335

				    set XPU_BUNDLE_INSTALLED=0

				    set XPU_BUNDLE_UNINSTALL=0

				    set XPU_EXTRA_URL=NULL

				    set XPU_EXTRA_PRODUCT_NAME=intel.oneapi.win.compiler.product

				    set XPU_EXTRA_VERSION=2025.0.1+1226

				    set XPU_EXTRA_INSTALLED=0

				    set XPU_EXTRA_UNINSTALL=0

				)

				:: Check if XPU bundle is target version or already installed

				if exist "%XPU_BUNDLE_PARENT_DIR%\Installer\installer.exe" goto xpu_bundle_ver_check

				goto xpu_bundle_install

				:xpu_bundle_ver_check

				"%XPU_BUNDLE_PARENT_DIR%\Installer\installer.exe" --list-products > xpu_bundle_installed_ver.log

				for /f "tokens=1,2" %%a in (xpu_bundle_installed_ver.log) do (

				    if "%%a"=="%XPU_BUNDLE_PRODUCT_NAME%" (

				        echo %%a Installed Version: %%b

				        set XPU_BUNDLE_INSTALLED=1

				        if not "%XPU_BUNDLE_VERSION%"=="%%b" (

				            start /wait "Installer Title" "%XPU_BUNDLE_PARENT_DIR%\Installer\installer.exe" --action=remove --eula=accept --silent --product-id %%a --product-ver %%b --log-dir uninstall_bundle

				            set XPU_BUNDLE_UNINSTALL=1

				        )

				    )

				    if "%%a"=="%XPU_EXTRA_PRODUCT_NAME%" (

				        echo %%a Installed Version: %%b

				        set XPU_EXTRA_INSTALLED=1

				        if not "%XPU_EXTRA_VERSION%"=="%%b" (

				            start /wait "Installer Title" "%XPU_BUNDLE_PARENT_DIR%\Installer\installer.exe" --action=remove --eula=accept --silent --product-id %%a --product-ver %%b --log-dir uninstall_bundle

				            set XPU_EXTRA_UNINSTALL=1

				        )

				    )

				    if not "%%b" == "Version" if not [%%b]==[] if not "%%a"=="%XPU_BUNDLE_PRODUCT_NAME%" if not "%%a"=="%XPU_EXTRA_PRODUCT_NAME%" (

				        echo "Uninstalling...."

				        start /wait "Installer Title" "%XPU_BUNDLE_PARENT_DIR%\Installer\installer.exe" --action=remove --eula=accept --silent --product-id %%a --product-ver %%b --log-dir uninstall_bundle

				    )

				)

				if errorlevel 1 exit /b 1

				if exist xpu_bundle_installed_ver.log del xpu_bundle_installed_ver.log

				if exist uninstall_bundle rmdir /s /q uninstall_bundle

				if "%XPU_BUNDLE_INSTALLED%"=="0" goto xpu_bundle_install

				if "%XPU_BUNDLE_UNINSTALL%"=="1" goto xpu_bundle_install

				:xpu_extra_check

				if "%XPU_EXTRA_URL%"=="NULL" goto xpu_install_end

				if "%XPU_EXTRA_INSTALLED%"=="0" goto xpu_extra_install

				if "%XPU_EXTRA_UNINSTALL%"=="1" goto xpu_extra_install

				goto xpu_install_end

				:xpu_bundle_install

				curl -o xpu_bundle.exe --retry 3 --retry-all-errors -k %XPU_BUNDLE_URL%

				echo "XPU Bundle installing..."

				start /wait "Intel Pytorch Bundle Installer" "xpu_bundle.exe" --action=install --eula=accept --silent --log-dir install_bundle

				if errorlevel 1 exit /b 1

				del xpu_bundle.exe

				goto xpu_extra_check

				:xpu_extra_install

				curl -o xpu_extra.exe --retry 3 --retry-all-errors -k %XPU_EXTRA_URL%

				echo "Intel XPU EXTRA installing..."

				start /wait "Intel XPU EXTRA Installer" "xpu_extra.exe" --action=install --eula=accept --silent --log-dir install_bundle

				if errorlevel 1 exit /b 1

				del xpu_extra.exe

				:xpu_install_end

									
										7

.ci/pytorch/win-test.sh
									
												View File
												
				@ -1,5 +1,5 @@

				#!/bin/bash

				set -ex

				set -ex -o pipefail

				SCRIPT_PARENT_DIR=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )

				# shellcheck source=./common.sh

				@ -18,6 +18,9 @@ export PYTORCH_FINAL_PACKAGE_DIR="${PYTORCH_FINAL_PACKAGE_DIR:-/c/w/build-result

				PYTORCH_FINAL_PACKAGE_DIR_WIN=$(cygpath -w "${PYTORCH_FINAL_PACKAGE_DIR}")

				export PYTORCH_FINAL_PACKAGE_DIR_WIN

				# enable debug asserts in serialization

				export TORCH_SERIALIZATION_DEBUG=1

				mkdir -p "$TMP_DIR"/build/torch

				export SCRIPT_HELPERS_DIR=$SCRIPT_PARENT_DIR/win-test-helpers

				@ -41,7 +44,7 @@ python -m pip install pytest-rerunfailures==10.3 pytest-cpp==2.3.0 tensorboard==

				python -m pip install z3-solver==4.12.2.0

				# Install tlparse for test\dynamo\test_structured_trace.py UTs.

				python -m pip install tlparse==0.3.25

				python -m pip install tlparse==0.3.30

				# Install parameterized

				python -m pip install parameterized==0.8.1

									
										31

.ci/pytorch/windows/arm64/bootstrap_apl.bat
									
										Normal file
									
												View File
												
				@ -0,0 +1,31 @@

				@echo off

				echo Dependency ARM Performance Libraries (APL) installation started.

				:: Pre-check for downloads and dependencies folders

				if not exist "%DOWNLOADS_DIR%" mkdir %DOWNLOADS_DIR%

				if not exist "%DEPENDENCIES_DIR%" mkdir %DEPENDENCIES_DIR%

				:: Set download URL for the ARM Performance Libraries (APL)

				set DOWNLOAD_URL="https://developer.arm.com/-/cdn-downloads/permalink/Arm-Performance-Libraries/Version_24.10/arm-performance-libraries_24.10_Windows.msi"

				set INSTALLER_FILE=%DOWNLOADS_DIR%\arm-performance-libraries.msi

				:: Download installer

				echo Downloading ARM Performance Libraries (APL)...

				curl -L -o "%INSTALLER_FILE%" %DOWNLOAD_URL%

				:: Install ARM Performance Libraries (APL)

				echo Installing ARM Performance Libraries (APL)...

				msiexec /i "%INSTALLER_FILE%" /qn /norestart ACCEPT_EULA=1 INSTALLFOLDER="%DEPENDENCIES_DIR%"

				:: Check if installation was successful

				if %errorlevel% neq 0 (

				    echo "Failed to install ARM Performance Libraries (APL) components. (exitcode = %errorlevel%)"

				    exit /b 1

				)

				:: Add to environment

				echo ARMPL_DIR=%DEPENDENCIES_DIR%\armpl_24.10\>> %GITHUB_ENV%

				echo %DEPENDENCIES_DIR%\armpl_24.10\bin\>> %GITHUB_PATH%

				echo Dependency ARM Performance Libraries (APL) installation finished.

									
										49

.ci/pytorch/windows/arm64/bootstrap_buildtools.bat
									
										Normal file
									
												View File
												
				@ -0,0 +1,49 @@

				@echo off

				echo Dependency MSVC Build Tools with C++ with ARM64/ARM64EC components installation started.

				:: Pre-check for downloads and dependencies folders

				if not exist "%DOWNLOADS_DIR%" mkdir "%DOWNLOADS_DIR%"

				if not exist "%DEPENDENCIES_DIR%" mkdir "%DEPENDENCIES_DIR%"

				:: Set download URL for the Visual Studio Installer

				set DOWNLOAD_URL=https://aka.ms/vs/17/release/vs_BuildTools.exe

				set INSTALLER_FILE=%DOWNLOADS_DIR%\vs_BuildTools.exe

				:: Download installer

				echo Downloading Visual Studio Build Tools with C++ installer...

				curl -L -o "%INSTALLER_FILE%" %DOWNLOAD_URL%

				:: Install the Visual Studio Build Tools with C++ components

				echo Installing Visual Studio Build Tools with C++ components...

				echo Installing MSVC %MSVC_VERSION%

				if "%MSVC_VERSION%" == "latest" (

				    "%INSTALLER_FILE%" --norestart --nocache --quiet --wait --installPath "%DEPENDENCIES_DIR%\VSBuildTools" ^

				        --add Microsoft.VisualStudio.Component.Windows11SDK.22621 ^

				        --add Microsoft.VisualStudio.Component.VC.ASAN ^

				        --add Microsoft.VisualStudio.Component.VC.CMake.Project ^

				        --add Microsoft.VisualStudio.Component.VC.Tools.ARM64 ^

				        --add Microsoft.VisualStudio.Component.VC.Tools.x86.x64

				) else if "%MSVC_VERSION%" == "14.40" (

				    "%INSTALLER_FILE%" --norestart --nocache --quiet --wait --installPath "%DEPENDENCIES_DIR%\VSBuildTools" ^

				        --add Microsoft.VisualStudio.Component.Windows11SDK.22621 ^

				        --add Microsoft.VisualStudio.Component.VC.ASAN ^

				        --add Microsoft.VisualStudio.Component.VC.CMake.Project ^

				        --add Microsoft.VisualStudio.Component.VC.14.40.17.10.ARM64 ^

				        --add Microsoft.VisualStudio.Component.VC.14.40.17.10.x86.x64

				) else if "%MSVC_VERSION%" == "14.36" (

				    "%INSTALLER_FILE%" --norestart --nocache --quiet --wait --installPath "%DEPENDENCIES_DIR%\VSBuildTools" ^

				        --add Microsoft.VisualStudio.Component.Windows11SDK.22621 ^

				        --add Microsoft.VisualStudio.Component.VC.ASAN ^

				        --add Microsoft.VisualStudio.Component.VC.CMake.Project ^

				        --add Microsoft.VisualStudio.Component.VC.14.36.17.6.ARM64 ^

				        --add Microsoft.VisualStudio.Component.VC.14.36.17.6.x86.x64

				)

				:: Check if installation was successful

				if %errorlevel% neq 0 (

				    echo "Failed to install Visual Studio Build Tools with C++ components. (exitcode = %errorlevel%)"

				    exit /b 1

				)

				echo Dependency Visual Studio Build Tools with C++ installation finished.

									
										37

.ci/pytorch/windows/arm64/bootstrap_git.bat
									
										Normal file
									
												View File
												
				@ -0,0 +1,37 @@

				:: we need to install newer version of Git manually as "-submodules" function is not supported in the default version of runner.

				@echo off

				echo Dependency Git installation started.

				:: Pre-check for downloads and dependencies folders

				if not exist "%DOWNLOADS_DIR%" mkdir %DOWNLOADS_DIR%

				if not exist "%DEPENDENCIES_DIR%" mkdir %DEPENDENCIES_DIR%

				:: Set download URL for the Git

				set DOWNLOAD_URL="https://github.com/git-for-windows/git/releases/download/v2.46.0.windows.1/Git-2.46.0-64-bit.exe"

				set INSTALLER_FILE=%DOWNLOADS_DIR%\Git-2.46.0-64-bit.exe

				:: Download installer

				echo Downloading Git...

				curl -L -o "%INSTALLER_FILE%" %DOWNLOAD_URL%

				:: Install Git

				echo Installing Git...

				"%INSTALLER_FILE%" /VERYSILENT /DIR="%DEPENDENCIES_DIR%\git"

				dir %DEPENDENCIES_DIR%\git

				:: Check if installation was successful

				if %errorlevel% neq 0 (

				    echo "Failed to install Git. (exitcode = %errorlevel%)"

				    exit /b 1

				)

				:: Enable long paths

				call "%DEPENDENCIES_DIR%\git\cmd\git.exe" config --system core.longpaths true

				:: Add to PATH

				echo %DEPENDENCIES_DIR%\git\cmd\;%DEPENDENCIES_DIR%\git\bin\>> %GITHUB_PATH%

				echo Dependency Git installation finished.

									
										33

.ci/pytorch/windows/arm64/bootstrap_libuv.bat
									
										Normal file
									
												View File
												
				@ -0,0 +1,33 @@

				@echo off

				echo Dependency libuv installation started.

				:: Pre-check for downloads and dependencies folders

				if not exist "%DOWNLOADS_DIR%" mkdir %DOWNLOADS_DIR%

				if not exist "%DEPENDENCIES_DIR%" mkdir %DEPENDENCIES_DIR%

				:: activate visual studio

				call "%DEPENDENCIES_DIR%\VSBuildTools\VC\Auxiliary\Build\vcvarsall.bat" arm64

				where cl.exe

				cd %DEPENDENCIES_DIR%

				git clone https://github.com/libuv/libuv.git -b v1.39.0

				echo Configuring libuv...

				mkdir libuv\build

				cd libuv\build

				cmake .. -DBUILD_TESTING=OFF

				echo Building libuv...

				cmake --build . --config Release

				echo Installing libuv...

				cmake --install . --prefix ../install

				:: Check if installation was successful

				if %errorlevel% neq 0 (

				    echo "Failed to install libuv. (exitcode = %errorlevel%)"

				    exit /b 1

				)

				echo Dependency libuv installation finished.

									
										46

.ci/pytorch/windows/arm64/bootstrap_openblas.bat
									
										Normal file
									
												View File
												
				@ -0,0 +1,46 @@

				@echo off

				echo Dependency OpenBLAS installation started.

				:: Pre-check for downloads and dependencies folders

				if not exist "%DOWNLOADS_DIR%" mkdir %DOWNLOADS_DIR%

				if not exist "%DEPENDENCIES_DIR%" mkdir %DEPENDENCIES_DIR%

				:: activate visual studio

				call "%DEPENDENCIES_DIR%\VSBuildTools\VC\Auxiliary\Build\vcvarsall.bat" arm64

				where cl.exe

				:: Clone OpenBLAS

				cd %DEPENDENCIES_DIR%

				git clone https://github.com/OpenMathLib/OpenBLAS.git -b v0.3.29

				echo Configuring OpenBLAS...

				mkdir OpenBLAS\build

				cd OpenBLAS\build

				cmake .. -G Ninja ^

				  -DBUILD_TESTING=0 ^

				  -DBUILD_BENCHMARKS=0 ^

				  -DC_LAPACK=1 ^

				  -DNOFORTRAN=1 ^

				  -DDYNAMIC_ARCH=0 ^

				  -DARCH=arm64 ^

				  -DBINARY=64 ^

				  -DTARGET=GENERIC ^

				  -DUSE_OPENMP=1 ^

				  -DCMAKE_SYSTEM_PROCESSOR=ARM64 ^

				  -DCMAKE_SYSTEM_NAME=Windows ^

				  -DCMAKE_BUILD_TYPE=Release

				echo Building OpenBLAS...

				cmake --build . --config Release

				echo Installing OpenBLAS...

				cmake --install . --prefix ../install

				:: Check if installation was successful

				if %errorlevel% neq 0 (

				    echo "Failed to install OpenBLAS. (exitcode = %errorlevel%)"

				    exit /b 1

				)

				echo Dependency OpenBLAS installation finished.

									
										41

.ci/pytorch/windows/arm64/bootstrap_python.bat
									
										Normal file
									
												View File
												
				@ -0,0 +1,41 @@

				@echo off

				echo Dependency Python installation started.

				:: Pre-check for downloads and dependencies folders

				if not exist "%DOWNLOADS_DIR%" mkdir %DOWNLOADS_DIR%

				if not exist "%DEPENDENCIES_DIR%" mkdir %DEPENDENCIES_DIR%

				if "%PYTHON_VERSION%"=="Python312" (

				    echo Python version is set to Python312

				    set DOWNLOAD_URL="https://www.python.org/ftp/python/3.12.7/python-3.12.7-arm64.exe"

				) else if "%PYTHON_VERSION%"=="Python311" (

				    echo Python version is set to Python311

				    set DOWNLOAD_URL="https://www.python.org/ftp/python/3.11.9/python-3.11.9-arm64.exe"

				) else (

				    echo PYTHON_VERSION not defined, Python version is set to Python312

				    set DOWNLOAD_URL="https://www.python.org/ftp/python/3.12.7/python-3.12.7-arm64.exe"

				)

				set INSTALLER_FILE=%DOWNLOADS_DIR%\python-installer.exe

				:: Download installer

				echo Downloading Python...

				curl -L -o "%INSTALLER_FILE%" %DOWNLOAD_URL%

				:: Install Python

				echo Installing Python...

				"%INSTALLER_FILE%" /quiet Include_debug=1 TargetDir="%DEPENDENCIES_DIR%\Python"

				:: Check if installation was successful

				if %errorlevel% neq 0 (

				    echo "Failed to install Python. (exitcode = %errorlevel%)"

				    exit /b 1

				)

				:: Add to PATH

				echo %DEPENDENCIES_DIR%\Python\>> %GITHUB_PATH%

				echo %DEPENDENCIES_DIR%\Python\scripts\>> %GITHUB_PATH%

				echo %DEPENDENCIES_DIR%\Python\libs\>> %GITHUB_PATH%

				echo Dependency Python installation finished.

									
										33

.ci/pytorch/windows/arm64/bootstrap_rust.bat
									
										Normal file
									
												View File
												
				@ -0,0 +1,33 @@

				@echo off

				echo Dependency Rust installation started.

				:: Pre-check for downloads and dependencies folders

				if not exist "%DOWNLOADS_DIR%" mkdir %DOWNLOADS_DIR%

				if not exist "%DEPENDENCIES_DIR%" mkdir %DEPENDENCIES_DIR%

				set DOWNLOAD_URL="https://static.rust-lang.org/rustup/dist/x86_64-pc-windows-msvc/rustup-init.exe"

				set INSTALLER_FILE=%DOWNLOADS_DIR%\rustup-init.exe

				set RUSTUP_HOME=%DEPENDENCIES_DIR%\rust

				set CARGO_HOME=%DEPENDENCIES_DIR%\cargo

				:: Download installer

				echo Downloading Rust...

				curl -L -o "%INSTALLER_FILE%" %DOWNLOAD_URL%

				:: Install APL

				echo Installing Rust...

				"%INSTALLER_FILE%" -q -y --default-host aarch64-pc-windows-msvc --default-toolchain stable --profile default

				:: Check if installation was successful

				if %errorlevel% neq 0 (

				    echo "Failed to install Rust. (exitcode = %errorlevel%)"

				    exit /b 1

				)

				:: Add to PATH

				echo %DEPENDENCIES_DIR%\cargo\bin\>> %GITHUB_PATH%

				echo RUSTUP_HOME=%DEPENDENCIES_DIR%\rust>> %GITHUB_ENV%

				echo CARGO_HOME=%DEPENDENCIES_DIR%\cargo>> %GITHUB_ENV%

				echo Dependency Rust installation finished.

									
										33

.ci/pytorch/windows/arm64/bootstrap_sccache.bat
									
										Normal file
									
												View File
												
				@ -0,0 +1,33 @@

				@echo off

				echo Dependency sccache installation started.

				:: Pre-check for downloads and dependencies folders

				if not exist "%DOWNLOADS_DIR%" mkdir %DOWNLOADS_DIR%

				if not exist "%DEPENDENCIES_DIR%" mkdir %DEPENDENCIES_DIR%

				:: Set download URL for the sccache

				set DOWNLOAD_URL="https://github.com/mozilla/sccache/releases/download/v0.8.1/sccache-v0.8.1-x86_64-pc-windows-msvc.zip"

				set INSTALLER_FILE=%DOWNLOADS_DIR%\sccache.zip

				:: Download installer

				echo Downloading sccache.zip...

				curl -L -o "%INSTALLER_FILE%" %DOWNLOAD_URL%

				:: Install sccache

				echo Extracting sccache.zip...

				tar -xf "%INSTALLER_FILE%" -C %DEPENDENCIES_DIR%

				cd %DEPENDENCIES_DIR%

				ren sccache-v0.8.1-x86_64-pc-windows-msvc sccache

				cd ..

				:: Check if installation was successful

				if %errorlevel% neq 0 (

				    echo "Failed to install sccache. (exitcode = %errorlevel%)"

				    exit /b 1

				)

				:: Add to PATH

				echo %DEPENDENCIES_DIR%\sccache\>> %GITHUB_PATH%

				echo Dependency sccache installation finished.

									
										22

.ci/pytorch/windows/arm64/bootstrap_tests.bat
									
										Normal file
									
												View File
												
				@ -0,0 +1,22 @@

				:: change to source directory

				cd %PYTORCH_ROOT%

				:: activate visual studio

				call "%DEPENDENCIES_DIR%\VSBuildTools\VC\Auxiliary\Build\vcvarsall.bat" arm64

				where cl.exe

				:: create virtual environment

				python -m venv .venv

				echo * > .venv\.gitignore

				call .\.venv\Scripts\activate

				where python

				:: install dependencies

				python -m pip install --upgrade pip

				pip install -r requirements.txt

				pip install pytest numpy

				:: find file name for pytorch wheel

				for /f "delims=" %%f in ('dir /b "%PYTORCH_FINAL_PACKAGE_DIR%" ^| findstr "torch-"') do set "TORCH_WHEEL_FILENAME=%PYTORCH_FINAL_PACKAGE_DIR%\%%f"

				pip install %TORCH_WHEEL_FILENAME%

									
										101

.ci/pytorch/windows/arm64/build_libtorch.bat
									
										Normal file
									
												View File
												
				@ -0,0 +1,101 @@

				@echo on

				:: environment variables

				set CMAKE_BUILD_TYPE=%BUILD_TYPE%

				set CMAKE_C_COMPILER_LAUNCHER=sccache

				set CMAKE_CXX_COMPILER_LAUNCHER=sccache

				set libuv_ROOT=%DEPENDENCIES_DIR%\libuv\install

				set MSSdk=1

				if defined PYTORCH_BUILD_VERSION (

				  set PYTORCH_BUILD_VERSION=%PYTORCH_BUILD_VERSION%

				  set PYTORCH_BUILD_NUMBER=1

				)

				:: Set BLAS type

				if %ENABLE_APL% == 1 (

				    set BLAS=APL

				    set USE_LAPACK=1

				) else if %ENABLE_OPENBLAS% == 1 (

				    set BLAS=OpenBLAS

				    set OpenBLAS_HOME=%DEPENDENCIES_DIR%\OpenBLAS\install

				)

				:: activate visual studio

				call "%DEPENDENCIES_DIR%\VSBuildTools\VC\Auxiliary\Build\vcvarsall.bat" arm64

				where cl.exe

				:: change to source directory

				cd %PYTORCH_ROOT%

				:: copy libuv.dll

				copy %libuv_ROOT%\lib\Release\uv.dll torch\lib\uv.dll

				:: create virtual environment

				python -m venv .venv

				echo * > .venv\.gitignore

				call .\.venv\Scripts\activate

				where python

				:: python install dependencies

				python -m pip install --upgrade pip

				pip install -r requirements.txt

				:: DISTUTILS_USE_SDK should be set after psutil dependency

				set DISTUTILS_USE_SDK=1

				:: start sccache server and reset sccache stats

				sccache --start-server

				sccache --zero-stats

				sccache --show-stats

				:: Prepare the environment

				mkdir libtorch

				mkdir libtorch\bin

				mkdir libtorch\cmake

				mkdir libtorch\include

				mkdir libtorch\lib

				mkdir libtorch\share

				mkdir libtorch\test

				:: Call LibTorch build script

				python ./tools/build_libtorch.py

				:: Check if there is an error

				IF ERRORLEVEL 1 exit /b 1

				IF NOT ERRORLEVEL 0 exit /b 1

				:: Move the files to the correct location

				move /Y torch\bin\*.* libtorch\bin\

				move /Y torch\cmake\*.* libtorch\cmake\

				robocopy /move /e torch\include\ libtorch\include\

				move /Y torch\lib\*.* libtorch\lib\

				robocopy /move /e torch\share\ libtorch\share\

				move /Y torch\test\*.* libtorch\test\

				move /Y libtorch\bin\*.dll libtorch\lib\

				:: Set version

				echo %PYTORCH_BUILD_VERSION% > libtorch\build-version

				git rev-parse HEAD > libtorch\build-hash

				:: Set LIBTORCH_PREFIX

				IF "%DEBUG%" == "" (

				    set LIBTORCH_PREFIX=libtorch-win-arm64-shared-with-deps

				) ELSE (

				    set LIBTORCH_PREFIX=libtorch-win-arm64-shared-with-deps-debug

				)

				:: Create output

				C:\Windows\System32\tar.exe -cvaf %LIBTORCH_PREFIX%-%PYTORCH_BUILD_VERSION%.zip -C libtorch *

				:: Copy output to target directory

				if not exist ..\output mkdir ..\output

				copy /Y "%LIBTORCH_PREFIX%-%PYTORCH_BUILD_VERSION%.zip" "%PYTORCH_FINAL_PACKAGE_DIR%\"

				copy /Y "%LIBTORCH_PREFIX%-%PYTORCH_BUILD_VERSION%.zip" "%PYTORCH_FINAL_PACKAGE_DIR%\%LIBTORCH_PREFIX%-latest.zip"

				:: Cleanup raw data to save space

				rmdir /s /q libtorch

				:: Check if installation was successful

				if %errorlevel% neq 0 (

				    echo "Failed on build_libtorch. (exitcode = %errorlevel%)"

				    exit /b 1

				)

									
										60

.ci/pytorch/windows/arm64/build_pytorch.bat
									
										Normal file
									
												View File
												
				@ -0,0 +1,60 @@

				@echo on

				:: environment variables

				set CMAKE_BUILD_TYPE=%BUILD_TYPE%

				set CMAKE_C_COMPILER_LAUNCHER=sccache

				set CMAKE_CXX_COMPILER_LAUNCHER=sccache

				set libuv_ROOT=%DEPENDENCIES_DIR%\libuv\install

				set MSSdk=1

				if defined PYTORCH_BUILD_VERSION (

				  set PYTORCH_BUILD_VERSION=%PYTORCH_BUILD_VERSION%

				  set PYTORCH_BUILD_NUMBER=1

				)

				:: Set BLAS type

				if %ENABLE_APL% == 1 (

				    set BLAS=APL

				    set USE_LAPACK=1

				) else if %ENABLE_OPENBLAS% == 1 (

				    set BLAS=OpenBLAS

				    set OpenBLAS_HOME=%DEPENDENCIES_DIR%\OpenBLAS\install

				)

				:: activate visual studio

				call "%DEPENDENCIES_DIR%\VSBuildTools\VC\Auxiliary\Build\vcvarsall.bat" arm64

				where cl.exe

				:: change to source directory

				cd %PYTORCH_ROOT%

				:: copy libuv.dll

				copy %libuv_ROOT%\lib\Release\uv.dll torch\lib\uv.dll

				:: create virtual environment

				python -m venv .venv

				echo * > .venv\.gitignore

				call .\.venv\Scripts\activate

				where python

				:: python install dependencies

				python -m pip install --upgrade pip

				pip install -r requirements.txt

				:: DISTUTILS_USE_SDK should be set after psutil dependency

				set DISTUTILS_USE_SDK=1

				:: start sccache server and reset sccache stats

				sccache --start-server

				sccache --zero-stats

				sccache --show-stats

				:: Call PyTorch build script

				python setup.py bdist_wheel -d "%PYTORCH_FINAL_PACKAGE_DIR%"

				:: show sccache stats

				sccache --show-stats

				:: Check if installation was successful

				if %errorlevel% neq 0 (

				    echo "Failed on build_pytorch. (exitcode = %errorlevel%)"

				    exit /b 1

				)

									
										65

.ci/pytorch/windows/arm64/smoke_test.bat
									
										Normal file
									
												View File
												
				@ -0,0 +1,65 @@

				@echo off

				setlocal

				set "ORIG_PATH=%PATH%"

				if "%PACKAGE_TYPE%" == "wheel" goto wheel

				if "%PACKAGE_TYPE%" == "libtorch" goto libtorch

				echo "unknown package type"

				exit /b 1

				:wheel

				echo "install wheel package"

				echo Running pip install...

				pip install -q --pre numpy protobuf

				echo Error level after pip install: %ERRORLEVEL%

				if errorlevel 1 exit /b 1

				for /F "delims=" %%i in ('where /R "%PYTORCH_FINAL_PACKAGE_DIR:/=\%" *.whl') do pip install "%%i"

				if errorlevel 1 exit /b 1

				goto smoke_test

				:smoke_test

				python -c "import torch"

				if ERRORLEVEL 1 exit /b 1

				echo Running python rnn_smoke.py...

				python %PYTORCH_ROOT%\.ci\pytorch\test_example_code\rnn_smoke_win_arm64.py

				if errorlevel 1 exit /b 1

				echo Checking that basic CNN works...

				python %PYTORCH_ROOT%\.ci\pytorch\test_example_code\cnn_smoke_win_arm64.py

				if errorlevel 1 exit /b 1

				goto end

				:libtorch

				echo "install and test libtorch"

				for /F "delims=" %%i in ('where /R "%PYTORCH_FINAL_PACKAGE_DIR:/=\%" *-latest.zip') do tar -xf "%%i" -C tmp

				if ERRORLEVEL 1 exit /b 1

				pushd tmp\libtorch

				set VC_VERSION_LOWER=14

				set VC_VERSION_UPPER=36

				call "%DEPENDENCIES_DIR%\VSBuildTools\VC\Auxiliary\Build\vcvarsall.bat" arm64

				set install_root=%CD%

				set INCLUDE=%INCLUDE%;%install_root%\include;%install_root%\include\torch\csrc\api\include

				set LIB=%LIB%;%install_root%\lib

				set PATH=%PATH%;%install_root%\lib

				cl %PYTORCH_ROOT%\.ci\pytorch\test_example_code\simple-torch-test.cpp c10.lib torch_cpu.lib /EHsc /std:c++17

				if ERRORLEVEL 1 exit /b 1

				.\simple-torch-test.exe

				if ERRORLEVEL 1 exit /b 1

				:end

				set "PATH=%ORIG_PATH%"

				popd

									
										13

.ci/pytorch/windows/condaenv.bat
									
												View File
												
				@ -9,12 +9,13 @@ FOR %%v IN (%DESIRED_PYTHON%) DO (

				    set PYTHON_VERSION_STR=%%v

				    set PYTHON_VERSION_STR=!PYTHON_VERSION_STR:.=!

				    conda remove -n py!PYTHON_VERSION_STR! --all -y || rmdir %CONDA_HOME%\envs\py!PYTHON_VERSION_STR! /s

				    if "%%v" == "3.8" call conda create -n py!PYTHON_VERSION_STR! -y -q numpy=1.11 pyyaml boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v

				    if "%%v" == "3.9" call conda create -n py!PYTHON_VERSION_STR! -y -q numpy=2.0.1 pyyaml boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v

				    if "%%v" == "3.10" call conda create -n py!PYTHON_VERSION_STR! -y -q -c=conda-forge numpy=2.0.1 pyyaml boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v

				    if "%%v" == "3.11" call conda create -n py!PYTHON_VERSION_STR! -y -q -c=conda-forge numpy=2.0.1 pyyaml boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v

				    if "%%v" == "3.12" call conda create -n py!PYTHON_VERSION_STR! -y -q -c=conda-forge numpy=2.0.1 pyyaml boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v

				    if "%%v" == "3.13" call conda create -n py!PYTHON_VERSION_STR! -y -q -c=conda-forge numpy=2.1.2 pyyaml boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v

				    if "%%v" == "3.9" call conda create -n py!PYTHON_VERSION_STR! -y numpy=2.0.1 boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v

				    if "%%v" == "3.10" call conda create -n py!PYTHON_VERSION_STR! -y -c=conda-forge numpy=2.0.1  boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v

				    if "%%v" == "3.11" call conda create -n py!PYTHON_VERSION_STR! -y -c=conda-forge numpy=2.0.1  boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v

				    if "%%v" == "3.12" call conda create -n py!PYTHON_VERSION_STR! -y -c=conda-forge numpy=2.0.1  boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v

				    if "%%v" == "3.13" call conda create -n py!PYTHON_VERSION_STR! -y -c=conda-forge numpy=2.1.2  boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v

				    if "%%v" == "3.13t" call conda create -n py!PYTHON_VERSION_STR! -y -c=conda-forge numpy=2.1.2 boto3 cmake ninja typing_extensions setuptools=72.1.0 python-freethreading python=3.13

				    call conda run -n py!PYTHON_VERSION_STR! pip install pyyaml

				    call conda run -n py!PYTHON_VERSION_STR! pip install mkl-include

				    call conda run -n py!PYTHON_VERSION_STR! pip install mkl-static

				)

									
										59

.ci/pytorch/windows/cuda128.bat
									
										Normal file
									
												View File
												
				@ -0,0 +1,59 @@

				@echo off

				set MODULE_NAME=pytorch

				IF NOT EXIST "setup.py" IF NOT EXIST "%MODULE_NAME%" (

				    call internal\clone.bat

				    cd %~dp0

				) ELSE (

				    call internal\clean.bat

				)

				IF ERRORLEVEL 1 goto :eof

				call internal\check_deps.bat

				IF ERRORLEVEL 1 goto :eof

				REM Check for optional components

				set USE_CUDA=

				set CMAKE_GENERATOR=Visual Studio 15 2017 Win64

				IF "%NVTOOLSEXT_PATH%"=="" (

				    IF EXIST "C:\Program Files\NVIDIA Corporation\NvToolsExt\lib\x64\nvToolsExt64_1.lib"  (

				        set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt

				    ) ELSE (

				        echo NVTX ^(Visual Studio Extension ^for CUDA^) ^not installed, failing

				        exit /b 1

				    )

				)

				IF "%CUDA_PATH_V128%"=="" (

				    IF EXIST "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\bin\nvcc.exe" (

				        set "CUDA_PATH_V128=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8"

				    ) ELSE (

				        echo CUDA 12.8 not found, failing

				        exit /b 1

				    )

				)

				IF "%BUILD_VISION%" == "" (

				    set TORCH_CUDA_ARCH_LIST=5.0;6.0;6.1;7.0;7.5;8.0;8.6;9.0;10.0;12.0

				    set TORCH_NVCC_FLAGS=-Xfatbin -compress-all

				) ELSE (

				    set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_100,code=compute_100 -gencode=arch=compute_120,code=compute_120

				)

				set "CUDA_PATH=%CUDA_PATH_V128%"

				set "PATH=%CUDA_PATH_V128%\bin;%PATH%"

				:optcheck

				call internal\check_opts.bat

				IF ERRORLEVEL 1 goto :eof

				if exist "%NIGHTLIES_PYTORCH_ROOT%" cd %NIGHTLIES_PYTORCH_ROOT%\..

				call  %~dp0\internal\copy.bat

				IF ERRORLEVEL 1 goto :eof

				call  %~dp0\internal\setup.bat

				IF ERRORLEVEL 1 goto :eof

									
										32

.ci/pytorch/windows/internal/cuda_install.bat
									
												View File
												
				@ -9,7 +9,8 @@ if "%CUDA_VERSION%" == "xpu" (

				    exit /b 0

				)

				set SRC_DIR=%NIGHTLIES_PYTORCH_ROOT%

				set SRC_DIR=%~dp0\..

				if not exist "%SRC_DIR%\temp_build" mkdir "%SRC_DIR%\temp_build"

				set /a CUDA_VER=%CUDA_VERSION%

				@ -23,9 +24,9 @@ set CUDNN_LIB_FOLDER="lib\x64"

				if exist "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%\bin\nvcc.exe" goto set_cuda_env_vars

				if %CUDA_VER% EQU 118 goto cuda118

				if %CUDA_VER% EQU 121 goto cuda121

				if %CUDA_VER% EQU 124 goto cuda124

				if %CUDA_VER% EQU 126 goto cuda126

				if %CUDA_VER% EQU 128 goto cuda128

				echo CUDA %CUDA_VERSION_STR% is not supported

				exit /b 1

				@ -111,6 +112,33 @@ xcopy /Y "%SRC_DIR%\temp_build\zlib\dll_x64\*.dll" "C:\Windows\System32"

				goto cuda_common

				:cuda128

				set CUDA_INSTALL_EXE=cuda_12.8.0_571.96_windows.exe

				if not exist "%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%" (

				    curl -k -L "https://ossci-windows.s3.amazonaws.com/%CUDA_INSTALL_EXE%" --output "%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%"

				    if errorlevel 1 exit /b 1

				    set "CUDA_SETUP_FILE=%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%"

				    set "ARGS=cuda_profiler_api_12.8 thrust_12.8 nvcc_12.8 cuobjdump_12.8 nvprune_12.8 nvprof_12.8 cupti_12.8 cublas_12.8 cublas_dev_12.8 cudart_12.8 cufft_12.8 cufft_dev_12.8 curand_12.8 curand_dev_12.8 cusolver_12.8 cusolver_dev_12.8 cusparse_12.8 cusparse_dev_12.8 npp_12.8 npp_dev_12.8 nvrtc_12.8 nvrtc_dev_12.8 nvml_dev_12.8 nvjitlink_12.8 nvtx_12.8"

				)

				set CUDNN_FOLDER=cudnn-windows-x86_64-9.7.0.66_cuda12-archive

				set CUDNN_LIB_FOLDER="lib"

				set "CUDNN_INSTALL_ZIP=%CUDNN_FOLDER%.zip"

				if not exist "%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%" (

				    curl -k -L "http://s3.amazonaws.com/ossci-windows/%CUDNN_INSTALL_ZIP%" --output "%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%"

				    if errorlevel 1 exit /b 1

				    set "CUDNN_SETUP_FILE=%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%"

				)

				@REM cuDNN 8.3+ required zlib to be installed on the path

				echo Installing ZLIB dlls

				curl -k -L "http://s3.amazonaws.com/ossci-windows/zlib123dllx64.zip" --output "%SRC_DIR%\temp_build\zlib123dllx64.zip"

				7z x "%SRC_DIR%\temp_build\zlib123dllx64.zip" -o"%SRC_DIR%\temp_build\zlib"

				xcopy /Y "%SRC_DIR%\temp_build\zlib\dll_x64\*.dll" "C:\Windows\System32"

				goto cuda_common

				:cuda_common

				:: NOTE: We only install CUDA if we don't have it installed already.

				:: With GHA runners these should be pre-installed as part of our AMI process

									
										99

.ci/pytorch/windows/internal/smoke_test.bat
									
												View File
												
				@ -27,7 +27,6 @@ for /F "delims=" %%i in ('wmic path win32_VideoController get name') do (

				endlocal & set NVIDIA_GPU_EXISTS=%NVIDIA_GPU_EXISTS%

				if "%PACKAGE_TYPE%" == "wheel" goto wheel

				if "%PACKAGE_TYPE%" == "conda" goto conda

				if "%PACKAGE_TYPE%" == "libtorch" goto libtorch

				echo "unknown package type"

				@ -37,6 +36,7 @@ exit /b 1

				echo "install wheel package"

				set PYTHON_INSTALLER_URL=

				if "%DESIRED_PYTHON%" == "3.13t" set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/3.13.0/python-3.13.0-amd64.exe"

				if "%DESIRED_PYTHON%" == "3.13" set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/3.13.0/python-3.13.0-amd64.exe"

				if "%DESIRED_PYTHON%" == "3.12" set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/3.12.0/python-3.12.0-amd64.exe"

				if "%DESIRED_PYTHON%" == "3.11" set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/3.11.0/python-3.11.0-amd64.exe"

				@ -47,6 +47,13 @@ if "%PYTHON_INSTALLER_URL%" == "" (

				    echo Python %DESIRED_PYTHON% not supported yet

				)

				set ADDITIONAL_OPTIONS=""

				set PYTHON_EXEC="python"

				if "%DESIRED_PYTHON%" == "3.13t" (

				    set ADDITIONAL_OPTIONS="Include_freethreaded=1"

				    set PYTHON_EXEC="python3.13t"

				)

				del python-amd64.exe

				curl --retry 3 -kL "%PYTHON_INSTALLER_URL%" --output python-amd64.exe

				if errorlevel 1 exit /b 1

				@ -55,85 +62,30 @@ if errorlevel 1 exit /b 1

				:: the installed Python to PATH system-wide. Even calling set PATH=%ORIG_PATH% later on won't make

				:: a change. As the builder directory will be removed after the smoke test, all subsequent non-binary

				:: jobs will fail to find any Python executable there

				start /wait "" python-amd64.exe /quiet InstallAllUsers=1 PrependPath=0 Include_test=0 TargetDir=%CD%\Python

				start /wait "" python-amd64.exe /quiet InstallAllUsers=1 PrependPath=0 Include_test=0 %ADDITIONAL_OPTIONS% TargetDir=%CD%\Python

				if errorlevel 1 exit /b 1

				set "PATH=%CD%\Python%PYTHON_VERSION%\Scripts;%CD%\Python;%PATH%"

				if "%DESIRED_PYTHON%" == "3.13" pip install -q --pre numpy==2.1.0 protobuf

				if "%DESIRED_PYTHON%" == "3.12" pip install -q --pre numpy==2.0.2 protobuf

				if "%DESIRED_PYTHON%" == "3.11" pip install -q --pre numpy==2.0.2 protobuf

				if "%DESIRED_PYTHON%" == "3.10" pip install -q --pre numpy==2.0.2 protobuf

				if "%DESIRED_PYTHON%" == "3.9" pip install -q --pre numpy==2.0.2 protobuf

				if "%DESIRED_PYTHON%" == "3.8" pip install -q numpy protobuf

				if "%DESIRED_PYTHON%" == "3.13t" %PYTHON_EXEC% -m pip install --pre numpy==2.2.1 protobuf

				if "%DESIRED_PYTHON%" == "3.13" %PYTHON_EXEC% -m pip install --pre numpy==2.1.2 protobuf

				if "%DESIRED_PYTHON%" == "3.12" %PYTHON_EXEC% -m pip install --pre numpy==2.0.2 protobuf

				if "%DESIRED_PYTHON%" == "3.11" %PYTHON_EXEC% -m pip install --pre numpy==2.0.2 protobuf

				if "%DESIRED_PYTHON%" == "3.10" %PYTHON_EXEC% -m pip install --pre numpy==2.0.2 protobuf

				if "%DESIRED_PYTHON%" == "3.9" %PYTHON_EXEC% -m pip install --pre numpy==2.0.2 protobuf

				if errorlevel 1 exit /b 1

				for /F "delims=" %%i in ('where /R "%PYTORCH_FINAL_PACKAGE_DIR:/=\%" *.whl') do pip install "%%i"

				for /F "delims=" %%i in ('where /R "%PYTORCH_FINAL_PACKAGE_DIR:/=\%" *.whl') do %PYTHON_EXEC% -m pip install "%%i"

				if errorlevel 1 exit /b 1

				goto smoke_test

				:conda

				echo "install conda package"

				:: Install Miniconda3

				set "CONDA_HOME=%CD%\conda"

				set "tmp_conda=%CONDA_HOME%"

				set "miniconda_exe=%CD%\miniconda.exe"

				set "CONDA_EXTRA_ARGS=cpuonly -c pytorch-nightly"

				if "%CUDA_VERSION%" == "118" (

				    set "CONDA_EXTRA_ARGS=pytorch-cuda=11.8 -c nvidia -c pytorch-nightly"

				)

				if "%CUDA_VERSION%" == "121" (

				    set "CONDA_EXTRA_ARGS=pytorch-cuda=12.1 -c nvidia -c pytorch-nightly"

				)

				if "%CUDA_VERSION%" == "124" (

				    set "CONDA_EXTRA_ARGS=pytorch-cuda=12.4 -c nvidia -c pytorch-nightly"

				)

				if "%CUDA_VERSION%" == "126" (

				    set "CONDA_EXTRA_ARGS=pytorch-cuda=12.6 -c nvidia -c pytorch-nightly"

				)

				rmdir /s /q conda

				del miniconda.exe

				curl -k https://repo.anaconda.com/miniconda/Miniconda3-latest-Windows-x86_64.exe -o "%miniconda_exe%"

				start /wait "" "%miniconda_exe%" /S /InstallationType=JustMe /RegisterPython=0 /AddToPath=0 /D=%tmp_conda%

				if ERRORLEVEL 1 exit /b 1

				set "PATH=%CONDA_HOME%;%CONDA_HOME%\scripts;%CONDA_HOME%\Library\bin;%PATH%"

				conda create -qyn testenv python=%DESIRED_PYTHON%

				if errorlevel 1 exit /b 1

				call conda install -yq conda-build

				if errorlevel 1 exit /b 1

				call %CONDA_HOME%\condabin\activate.bat testenv

				if errorlevel 1 exit /b 1

				set "NO_ARCH_PATH=%PYTORCH_FINAL_PACKAGE_DIR:/=\%\noarch"

				mkdir %NO_ARCH_PATH%

				for /F "delims=" %%i in ('where /R "%PYTORCH_FINAL_PACKAGE_DIR:/=\%" *') do xcopy "%%i" %NO_ARCH_PATH% /Y

				if ERRORLEVEL 1 exit /b 1

				call conda index %PYTORCH_FINAL_PACKAGE_DIR%

				if errorlevel 1 exit /b 1

				call conda install -yq -c "file:///%PYTORCH_FINAL_PACKAGE_DIR%" pytorch==%PYTORCH_BUILD_VERSION% -c pytorch -c numba/label/dev -c nvidia

				if ERRORLEVEL 1 exit /b 1

				call conda install -yq numpy

				if ERRORLEVEL 1 exit /b 1

				set /a CUDA_VER=%CUDA_VERSION%

				set CUDA_VER_MAJOR=%CUDA_VERSION:~0,-1%

				set CUDA_VER_MINOR=%CUDA_VERSION:~-1,1%

				set CUDA_VERSION_STR=%CUDA_VER_MAJOR%.%CUDA_VER_MINOR%

				:: Install package we just build

				:smoke_test

				python -c "import torch"

				%PYTHON_EXEC% -c "import torch"

				if ERRORLEVEL 1 exit /b 1

				echo Checking that MKL is available

				python -c "import torch; exit(0 if torch.backends.mkl.is_available() else 1)"

				%PYTHON_EXEC% -c "import torch; exit(0 if torch.backends.mkl.is_available() else 1)"

				if ERRORLEVEL 1 exit /b 1

				if "%NVIDIA_GPU_EXISTS%" == "0" (

				@ -142,24 +94,24 @@ if "%NVIDIA_GPU_EXISTS%" == "0" (

				)

				echo Checking that CUDA archs are setup correctly

				python -c "import torch; torch.randn([3,5]).cuda()"

				%PYTHON_EXEC% -c "import torch; torch.randn([3,5]).cuda()"

				if ERRORLEVEL 1 exit /b 1

				echo Checking that magma is available

				python -c "import torch; torch.rand(1).cuda(); exit(0 if torch.cuda.has_magma else 1)"

				%PYTHON_EXEC% -c "import torch; torch.rand(1).cuda(); exit(0 if torch.cuda.has_magma else 1)"

				if ERRORLEVEL 1 exit /b 1

				echo Checking that CuDNN is available

				python -c "import torch; exit(0 if torch.backends.cudnn.is_available() else 1)"

				%PYTHON_EXEC% -c "import torch; exit(0 if torch.backends.cudnn.is_available() else 1)"

				if ERRORLEVEL 1 exit /b 1

				echo Checking that basic RNN works

				python %PYTORCH_ROOT%\.ci\pytorch\test_example_code\rnn_smoke.py

				%PYTHON_EXEC% %PYTORCH_ROOT%\.ci\pytorch\test_example_code\rnn_smoke.py

				if ERRORLEVEL 1 exit /b 1

				echo Checking that basic CNN works

				python %PYTORCH_ROOT%\.ci\pytorch\test_example_code\cnn_smoke.py

				%PYTHON_EXEC% %PYTORCH_ROOT%\.ci\pytorch\test_example_code\cnn_smoke.py

				if ERRORLEVEL 1 exit /b 1

				goto end

				@ -167,7 +119,6 @@ goto end

				:libtorch

				echo "install and test libtorch"

				if "%VC_YEAR%" == "2019" powershell internal\vs2019_install.ps1

				if "%VC_YEAR%" == "2022" powershell internal\vs2022_install.ps1

				if ERRORLEVEL 1 exit /b 1

				@ -179,10 +130,6 @@ pushd tmp\libtorch

				set VC_VERSION_LOWER=17

				set VC_VERSION_UPPER=18

				IF "%VC_YEAR%" == "2019" (

				    set VC_VERSION_LOWER=16

				    set VC_VERSION_UPPER=17

				)

				for /f "usebackq tokens=*" %%i in (`"%ProgramFiles(x86)%\Microsoft Visual Studio\Installer\vswhere.exe" -legacy -products * -version [%VC_VERSION_LOWER%^,%VC_VERSION_UPPER%^) -property installationPath`) do (

				    if exist "%%i" if exist "%%i\VC\Auxiliary\Build\vcvarsall.bat" (

									
										5

.ci/pytorch/windows/internal/static_lib_test.bat
									
												View File
												
				@ -70,7 +70,6 @@ echo "install and test libtorch"

				pip install cmake

				echo "installing cmake"

				if "%VC_YEAR%" == "2019" powershell internal\vs2019_install.ps1

				if "%VC_YEAR%" == "2022" powershell internal\vs2022_install.ps1

				if ERRORLEVEL 1 exit /b 1

				@ -83,10 +82,6 @@ pushd tmp\libtorch

				set VC_VERSION_LOWER=17

				set VC_VERSION_UPPER=18

				IF "%VC_YEAR%" == "2019" (

				    set VC_VERSION_LOWER=16

				    set VC_VERSION_UPPER=17

				)

				for /f "usebackq tokens=*" %%i in (`"%ProgramFiles(x86)%\Microsoft Visual Studio\Installer\vswhere.exe" -legacy -products * -version [%VC_VERSION_LOWER%^,%VC_VERSION_UPPER%^) -property installationPath`) do (

				    if exist "%%i" if exist "%%i\VC\Auxiliary\Build\vcvarsall.bat" (

									
										6

.ci/pytorch/windows/internal/vc_install_helper.bat
									
												View File
												
				@ -1,12 +1,8 @@

				if "%VC_YEAR%" == "2019" powershell windows/internal/vs2019_install.ps1

				if "%VC_YEAR%" == "2022" powershell windows/internal/vs2022_install.ps1

				set VC_VERSION_LOWER=17

				set VC_VERSION_UPPER=18

				if "%VC_YEAR%" == "2019" (

				    set VC_VERSION_LOWER=16

				    set VC_VERSION_UPPER=17

				)

				for /f "usebackq tokens=*" %%i in (`"%ProgramFiles(x86)%\Microsoft Visual Studio\Installer\vswhere.exe"  -products Microsoft.VisualStudio.Product.BuildTools -version [%VC_VERSION_LOWER%^,%VC_VERSION_UPPER%^) -property installationPath`) do (

				    if exist "%%i" if exist "%%i\VC\Auxiliary\Build\vcvarsall.bat" (

									
										48

.ci/pytorch/windows/internal/vs2019_install.ps1
									
												View File
											
				@ -1,48 +0,0 @@

				# https://developercommunity.visualstudio.com/t/install-specific-version-of-vs-component/1142479

				# https://docs.microsoft.com/en-us/visualstudio/releases/2019/history#release-dates-and-build-numbers

				# 16.8.6 BuildTools

				$VS_DOWNLOAD_LINK = "https://ossci-windows.s3.us-east-1.amazonaws.com/vs16.8.6_BuildTools.exe"

				$COLLECT_DOWNLOAD_LINK = "https://aka.ms/vscollect.exe"

				$VS_INSTALL_ARGS = @("--nocache","--quiet","--wait", "--add Microsoft.VisualStudio.Workload.VCTools",

				                                                     "--add Microsoft.Component.MSBuild",

				                                                     "--add Microsoft.VisualStudio.Component.Roslyn.Compiler",

				                                                     "--add Microsoft.VisualStudio.Component.TextTemplating",

				                                                     "--add Microsoft.VisualStudio.Component.VC.CoreIde",

				                                                     "--add Microsoft.VisualStudio.Component.VC.Redist.14.Latest",

				                                                     "--add Microsoft.VisualStudio.ComponentGroup.NativeDesktop.Core",

				                                                     "--add Microsoft.VisualStudio.Component.VC.Tools.x86.x64",

				                                                     "--add Microsoft.VisualStudio.ComponentGroup.NativeDesktop.Win81")

				curl.exe --retry 3 -kL $VS_DOWNLOAD_LINK --output vs_installer.exe

				if ($LASTEXITCODE -ne 0) {

				    echo "Download of the VS 2019 Version 16.8.5 installer failed"

				    exit 1

				}

				if (Test-Path "${env:ProgramFiles(x86)}\Microsoft Visual Studio\Installer\vswhere.exe") {

				    $existingPath = & "${env:ProgramFiles(x86)}\Microsoft Visual Studio\Installer\vswhere.exe" -products "Microsoft.VisualStudio.Product.BuildTools" -version "[16, 17)" -property installationPath

				    if ($existingPath -ne $null) {

				        if (!${env:CIRCLECI}) {

				            echo "Found correctly versioned existing BuildTools installation in $existingPath"

				            exit 0

				        }

				        echo "Found existing BuildTools installation in $existingPath, keeping it"

				    }

				}

				$process = Start-Process "${PWD}\vs_installer.exe" -ArgumentList $VS_INSTALL_ARGS -NoNewWindow -Wait -PassThru

				Remove-Item -Path vs_installer.exe -Force

				$exitCode = $process.ExitCode

				if (($exitCode -ne 0) -and ($exitCode -ne 3010)) {

				    echo "VS 2019 installer exited with code $exitCode, which should be one of [0, 3010]."

				    curl.exe --retry 3 -kL $COLLECT_DOWNLOAD_LINK --output Collect.exe

				    if ($LASTEXITCODE -ne 0) {

				        echo "Download of the VS Collect tool failed."

				        exit 1

				    }

				    Start-Process "${PWD}\Collect.exe" -NoNewWindow -Wait -PassThru

				    New-Item -Path "C:\w\build-results" -ItemType "directory" -Force

				    Copy-Item -Path "C:\Users\${env:USERNAME}\AppData\Local\Temp\vslogs.zip" -Destination "C:\w\build-results\"

				    exit 1

				}

									
										14

.ci/pytorch/windows/internal/xpu_install.bat
									
												View File
												
				@ -7,6 +7,9 @@ if not "%CUDA_VERSION%" == "xpu" (

				    exit /b 0

				)

				set SRC_DIR=%NIGHTLIES_PYTORCH_ROOT%

				if not exist "%SRC_DIR%\temp_build" mkdir "%SRC_DIR%\temp_build"

				set XPU_INSTALL_MODE=%~1

				if "%XPU_INSTALL_MODE%"=="" goto xpu_bundle_install_start

				if "%XPU_INSTALL_MODE%"=="bundle" goto xpu_bundle_install_start

				@ -117,3 +120,14 @@ if errorlevel 1 exit /b 1

				del xpu_extra.exe

				:xpu_install_end

				if not "%XPU_ENABLE_KINETO%"=="1" goto install_end

				:: Install Level Zero SDK

				set XPU_EXTRA_LZ_URL=https://github.com/oneapi-src/level-zero/releases/download/v1.14.0/level-zero-sdk_1.14.0.zip

				curl -k -L %XPU_EXTRA_LZ_URL% --output "%SRC_DIR%\temp_build\level_zero_sdk.zip"

				echo "Installing level zero SDK..."

				7z x "%SRC_DIR%\temp_build\level_zero_sdk.zip" -o"%SRC_DIR%\temp_build\level_zero"

				set "INCLUDE=%SRC_DIR%\temp_build\level_zero\include;%INCLUDE%"

				del "%SRC_DIR%\temp_build\level_zero_sdk.zip"

				:install_end

									
										5

.ci/pytorch/windows/xpu.bat
									
												View File
												
				@ -28,11 +28,6 @@ call "%XPU_BUNDLE_ROOT%\compiler\latest\env\vars.bat"

				call "%XPU_BUNDLE_ROOT%\ocloc\latest\env\vars.bat"

				IF ERRORLEVEL 1 goto :eof

				:: Workaround for https://github.com/pytorch/pytorch/issues/134989

				set CMAKE_SHARED_LINKER_FLAGS=/FORCE:MULTIPLE

				set CMAKE_MODULE_LINKER_FLAGS=/FORCE:MULTIPLE

				set CMAKE_EXE_LINKER_FLAGS=/FORCE:MULTIPLE

				if exist "%NIGHTLIES_PYTORCH_ROOT%" cd %NIGHTLIES_PYTORCH_ROOT%\..

				call %~dp0\internal\copy_cpu.bat

				IF ERRORLEVEL 1 goto :eof

									
										48

.ci/wheel/build_wheel.sh
									
												View File
												
				@ -130,7 +130,19 @@ export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}

				SETUPTOOLS_PINNED_VERSION="=46.0.0"

				PYYAML_PINNED_VERSION="=5.3"

				EXTRA_CONDA_INSTALL_FLAGS=""

				CONDA_ENV_CREATE_FLAGS=""

				RENAME_WHEEL=true

				case $desired_python in

				    3.13t)

				        echo "Using 3.13 deps"

				        SETUPTOOLS_PINNED_VERSION=">=68.0.0"

				        PYYAML_PINNED_VERSION=">=6.0.1"

				        NUMPY_PINNED_VERSION="=2.1.0"

				        CONDA_ENV_CREATE_FLAGS="python-freethreading"

				        EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge"

				        desired_python="3.13"

				        RENAME_WHEEL=false

				        ;;

				    3.13)

				        echo "Using 3.13 deps"

				        SETUPTOOLS_PINNED_VERSION=">=68.0.0"

				@ -169,16 +181,15 @@ esac

				# Install into a fresh env

				tmp_env_name="wheel_py$python_nodot"

				conda create ${EXTRA_CONDA_INSTALL_FLAGS} -yn "$tmp_env_name" python="$desired_python"

				conda create ${EXTRA_CONDA_INSTALL_FLAGS} -yn "$tmp_env_name" python="$desired_python" ${CONDA_ENV_CREATE_FLAGS}

				source activate "$tmp_env_name"

				pip install -q "numpy=${NUMPY_PINNED_VERSION}"  "pyyaml${PYYAML_PINNED_VERSION}" requests

				retry conda install ${EXTRA_CONDA_INSTALL_FLAGS} -yq  llvm-openmp=14.0.6 cmake ninja "setuptools${SETUPTOOLS_PINNED_VERSION}" typing_extensions

				retry pip install -qr "${pytorch_rootdir}/requirements.txt" || true

				pip install "numpy=${NUMPY_PINNED_VERSION}"  "pyyaml${PYYAML_PINNED_VERSION}" requests ninja "setuptools${SETUPTOOLS_PINNED_VERSION}" typing_extensions

				retry pip install -r "${pytorch_rootdir}/requirements.txt" || true

				retry brew install libomp

				# For USE_DISTRIBUTED=1 on macOS, need libuv and pkg-config to find libuv.

				# For USE_DISTRIBUTED=1 on macOS, need libuv, which is build as part of tensorpipe submodule

				export USE_DISTRIBUTED=1

				retry conda install ${EXTRA_CONDA_INSTALL_FLAGS} -yq libuv pkg-config

				if [[ -n "$CROSS_COMPILE_ARM64" ]]; then

				    export CMAKE_OSX_ARCHITECTURES=arm64

				@ -220,30 +231,13 @@ echo "The wheel is in $(find $whl_tmp_dir -name '*.whl')"

				wheel_filename_gen=$(find $whl_tmp_dir -name '*.whl' | head -n1 | xargs -I {} basename {})

				popd

				if [[ -z "$BUILD_PYTHONLESS" ]]; then

				if [[ -z "$BUILD_PYTHONLESS" && $RENAME_WHEEL == true  ]]; then

				    # Copy the whl to a final destination before tests are run

				    echo "Renaming Wheel file: $wheel_filename_gen to $wheel_filename_new"

				    cp "$whl_tmp_dir/$wheel_filename_gen" "$PYTORCH_FINAL_PACKAGE_DIR/$wheel_filename_new"

				    ##########################

				    # now test the binary, unless it's cross compiled arm64

				    if [[ -z "$CROSS_COMPILE_ARM64" ]]; then

				        pip uninstall -y "$TORCH_PACKAGE_NAME" || true

				        pip uninstall -y "$TORCH_PACKAGE_NAME" || true

				        # Create new "clean" conda environment for testing

				        conda create ${EXTRA_CONDA_INSTALL_FLAGS} -yn "test_conda_env" python="$desired_python"

				        conda activate test_conda_env

				        pip install "$PYTORCH_FINAL_PACKAGE_DIR/$wheel_filename_new" -v

				        echo "$(date) :: Running tests"

				        # TODO: Add real tests, as run_test.sh from builder is a glorified no-op

				        # pushd "$pytorch_rootdir"

				        # "${SOURCE_DIR}/../run_tests.sh" 'wheel' "$desired_python" 'cpu'

				        # popd

				        echo "$(date) :: Finished tests"

				    fi

				elif [[ $RENAME_WHEEL == false ]]; then

				    echo "Copying Wheel file: $wheel_filename_gen to $PYTORCH_FINAL_PACKAGE_DIR"

				    cp "$whl_tmp_dir/$wheel_filename_gen" "$PYTORCH_FINAL_PACKAGE_DIR/$wheel_filename_gen"

				else

				    pushd "$pytorch_rootdir"

									
										2

.circleci/codegen_validation/normalize_yaml_fragment.py
									
												View File
												
				@ -7,7 +7,7 @@ import yaml

				# Need to import modules that lie on an upward-relative path

				sys.path.append(os.path.join(sys.path[0], ".."))

				sys.path.append(os.path.dirname(sys.path[0]))

				import cimodel.lib.miniyaml as miniyaml

									
										2

.circleci/scripts/binary_linux_test.sh
									
												View File
												
				@ -94,6 +94,8 @@ if [[ "\$GPU_ARCH_TYPE" != *s390x* && "\$GPU_ARCH_TYPE" != *xpu* && "\$GPU_ARCH_

				  python /pytorch/.ci/pytorch/smoke_test/smoke_test.py --package=torchonly --torch-compile-check disabled

				fi

				# Clean temp files

				cd /pytorch/.ci/pytorch/ && git clean -ffdx

				# =================== The above code will be executed inside Docker container ===================

				EOL

									
										11

.circleci/scripts/binary_macos_build.sh
									
												View File
											
				@ -1,11 +0,0 @@

				#!/bin/bash

				set -eux -o pipefail

				source "${BINARY_ENV_FILE:-/Users/distiller/project/env}"

				mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR"

				# Build

				export USE_PYTORCH_METAL_EXPORT=1

				export USE_COREML_DELEGATE=1

				export TORCH_PACKAGE_NAME="$(echo $TORCH_PACKAGE_NAME | tr '-' '_')"

				"${PYTORCH_ROOT}/.ci/wheel/build_wheel.sh"

									
										13

.circleci/scripts/binary_populate_env.sh
									
												View File
												
				@ -30,9 +30,7 @@ fi

				# Pick docker image

				export DOCKER_IMAGE=${DOCKER_IMAGE:-}

				if [[ -z "$DOCKER_IMAGE" ]]; then

				  if [[ "$PACKAGE_TYPE" == conda ]]; then

				    export DOCKER_IMAGE="pytorch/conda-cuda"

				  elif [[ "$DESIRED_CUDA" == cpu ]]; then

				  if [[ "$DESIRED_CUDA" == cpu ]]; then

				    export DOCKER_IMAGE="pytorch/manylinux:cpu"

				  else

				    export DOCKER_IMAGE="pytorch/manylinux-builder:${DESIRED_CUDA:2}"

				@ -63,7 +61,7 @@ if tagged_version >/dev/null; then

				  # Turns tag v1.6.0-rc1 -> v1.6.0

				  BASE_BUILD_VERSION="$(tagged_version | sed -e 's/^v//' -e 's/-.*$//')"

				fi

				if [[ "$(uname)" == 'Darwin' ]] || [[ "$PACKAGE_TYPE" == conda ]]; then

				if [[ "$(uname)" == 'Darwin' ]]; then

				  export PYTORCH_BUILD_VERSION="${BASE_BUILD_VERSION}"

				else

				  export PYTORCH_BUILD_VERSION="${BASE_BUILD_VERSION}+$DESIRED_CUDA"

				@ -75,9 +73,8 @@ export PYTORCH_BUILD_NUMBER=1

				TRITON_VERSION=$(cat $PYTORCH_ROOT/.ci/docker/triton_version.txt)

				# Here PYTORCH_EXTRA_INSTALL_REQUIREMENTS is already set for the all the wheel builds hence append TRITON_CONSTRAINT

				TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64' and python_version < '3.13'"

				if [[ "$PACKAGE_TYPE" =~ .*wheel.* &&  -n "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then

				  # Only linux Python < 3.13 are supported wheels for triton

				TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64'"

				if [[ "$PACKAGE_TYPE" =~ .*wheel.* &&  -n "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" && ! "$PYTORCH_BUILD_VERSION" =~ .*xpu.* ]]; then

				  TRITON_REQUIREMENT="triton==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"

				  if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then

				      TRITON_SHORTHASH=$(cut -c1-8 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton.txt)

				@ -150,8 +147,6 @@ export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:

				# TODO: We don't need this anymore IIUC

				export TORCH_PACKAGE_NAME='torch'

				export TORCH_CONDA_BUILD_FOLDER='pytorch-nightly'

				export ANACONDA_USER='pytorch'

				export USE_FBGEMM=1

				export PIP_UPLOAD_FOLDER="$PIP_UPLOAD_FOLDER"

									
										43

.circleci/scripts/binary_upload.sh
									
												View File
												
				@ -2,7 +2,7 @@

				set -euo pipefail

				PACKAGE_TYPE=${PACKAGE_TYPE:-conda}

				PACKAGE_TYPE=${PACKAGE_TYPE:-wheel}

				PKG_DIR=${PKG_DIR:-/tmp/workspace/final_pkgs}

				@ -18,10 +18,8 @@ BUILD_NAME=${BUILD_NAME:-}

				DRY_RUN=${DRY_RUN:-enabled}

				# Don't actually do work unless explicit

				ANACONDA="true anaconda"

				AWS_S3_CP="aws s3 cp --dryrun"

				if [[ "${DRY_RUN}" = "disabled" ]]; then

				  ANACONDA="anaconda"

				  AWS_S3_CP="aws s3 cp"

				fi

				@ -34,10 +32,6 @@ if [[ ${BUILD_NAME} == *-full* ]]; then

				  UPLOAD_SUBFOLDER="${UPLOAD_SUBFOLDER}_full"

				fi

				# Sleep 2 minutes between retries for conda upload

				retry () {

				  "$@"  || (sleep 5m && "$@") || (sleep 5m && "$@") || (sleep 5m && "$@") || (sleep 5m && "$@")

				}

				do_backup() {

				  local backup_dir

				@ -49,20 +43,6 @@ do_backup() {

				  )

				}

				conda_upload() {

				  (

				    set -x

				    retry \

				    ${ANACONDA} \

				    upload  \

				    ${PKG_DIR}/*.tar.bz2 \

				    -u "pytorch-${UPLOAD_CHANNEL}" \

				    --label main \

				    --no-progress \

				    --force

				  )

				}

				s3_upload() {

				  local extension

				  local pkg_type

				@ -78,31 +58,18 @@ s3_upload() {

				    for pkg in ${PKG_DIR}/*.${extension}; do

				      (

				        set -x

				        ${AWS_S3_CP} --no-progress --acl public-read "${pkg}" "${s3_upload_dir}"

				        shm_id=$(sha256sum "${pkg}" | awk '{print $1}')

				        ${AWS_S3_CP} --no-progress --acl public-read "${pkg}" "${s3_upload_dir}" \

				          --metadata "checksum-sha256=${shm_id}"

				      )

				    done

				  )

				}

				# Install dependencies (should be a no-op if previously installed)

				conda install -yq anaconda-client

				pip install -q awscli

				pip install -q awscli uv

				case "${PACKAGE_TYPE}" in

				  conda)

				    conda_upload

				    for conda_archive in ${PKG_DIR}/*.tar.bz2; do

				      # Fetch  platform (eg. win-64, linux-64, etc.) from index file because

				      # there's no actual conda command to read this

				      subdir=$(\

				        tar -xOf "${conda_archive}" info/index.json \

				          | grep subdir  \

				          | cut -d ':' -f2 \

				          | sed -e 's/[[:space:]]//' -e 's/"//g' -e 's/,//' \

				      )

				      BACKUP_DIR="conda/${subdir}"

				    done

				    ;;

				  libtorch)

				    s3_upload "zip" "libtorch"

				    BACKUP_DIR="libtorch/${UPLOAD_CHANNEL}/${UPLOAD_SUBFOLDER}"

									
										3

.circleci/scripts/binary_windows_build.sh
									
												View File
												
				@ -8,10 +8,9 @@ export CUDA_VERSION="${DESIRED_CUDA/cu/}"

				export USE_SCCACHE=1

				export SCCACHE_BUCKET=ossci-compiler-cache

				export SCCACHE_IGNORE_SERVER_IO_ERROR=1

				export VC_YEAR=2019

				export VC_YEAR=2022

				if [[ "$DESIRED_CUDA" == 'xpu' ]]; then

				    export VC_YEAR=2022

				    export USE_SCCACHE=0

				    export XPU_VERSION=2025.0

				fi

									
										3

.circleci/scripts/binary_windows_test.sh
									
												View File
												
				@ -4,10 +4,9 @@ set -eux -o pipefail

				source "${BINARY_ENV_FILE:-/c/w/env}"

				export CUDA_VERSION="${DESIRED_CUDA/cu/}"

				export VC_YEAR=2019

				export VC_YEAR=2022

				if [[ "$DESIRED_CUDA" == 'xpu' ]]; then

				    export VC_YEAR=2022

				    export XPU_VERSION=2025.0

				fi

2

.clang-format

View File

 @ -106,6 +106,8 @@ StatementMacros:
   - C10_DEFINE_int32
   - C10_DEFINE_int64
   - C10_DEFINE_string
   - C10_DEFINE_REGISTRY_WITHOUT_WARNING
   - C10_REGISTER_CREATOR
   - DEFINE_BINARY
   - PyObject_HEAD
   - PyObject_VAR_HEAD

15

.clang-tidy

View File

 @ -1,8 +1,9 @@
 ---
 # NOTE there must be no spaces before the '-', so put the comma last.
 # The check bugprone-unchecked-optional-access is also turned off atm
 # because it causes clang-tidy to hang randomly. The tracking issue
 # The check bugprone-unchecked-optional-access is also turned on.
 # Note that it can cause clang-tidy to hang randomly. The tracking issue
 # can be found at https://github.com/llvm/llvm-project/issues/69369.
 # When that happens, we can disable it on the problematic code by NOLINT.
 InheritParentConfig: true
 Checks: '
 bugprone-*,
 @ -12,7 +13,10 @@ bugprone-*,
 -bugprone-lambda-function-name,
 -bugprone-reserved-identifier,
 -bugprone-swapped-arguments,
 -bugprone-unchecked-optional-access,
 clang-analyzer-core.*,
 clang-analyzer-cplusplus.*,
 clang-analyzer-nullability.*,
 clang-analyzer-deadcode.*,
 clang-diagnostic-missing-prototypes,
 cppcoreguidelines-*,
 -cppcoreguidelines-avoid-do-while,
 @ -55,10 +59,11 @@ readability-container-size-empty,
 readability-delete-null-pointer,
 readability-duplicate-include
 readability-misplaced-array-index,
 readability-redundant-function-ptr-dereference,
 readability-redundant-smartptr-get,
 readability-redundant*
 readability-simplify-subscript-expr,
 readability-string-compare,
 -readability-redundant-access-specifiers,
 -readability-redundant-control-flow,
 '
 HeaderFilterRegex: '^(aten/|c10/|torch/).*$'
 WarningsAsErrors: '*'

58

.git-blame-ignore-revs

View File

 @ -24,6 +24,10 @@ e3900d2ba5c9f91a24a9ce34520794c8366d5c54
 e26976ad3b06ce95dd6afccfdbe124802edf28f
 # 2021-06-07 Strictly typed everything in `.github` and `tools`
 d920b21db9b4292d056ee1329945990656304
 # 2021-08-12 [codemod][lint][fbcode/c*] Enable BLACK by default
 b0043072529b81276a69df29e00555333117646c
 # 2021-08-25 Reformat run_test.py
 d8e7b659b19e1ee68208b28bfa7dba73375dbc
 # 2022-06-09 Apply clang-format to ATen headers
 b15c266baaf989ef7b6bbd7c23a2d90bacf687
 # 2022-06-11 [lint] autoformat test/cpp and torch/csrc
 @ -44,3 +48,57 @@ a53cda1ddc15336dc1ff0ce1eff2a49cdc5f882e
 d80939e5e9337e8078f11489afefec59fd42f93b
 # 2024-06-28 enable UFMT in `torch.utils.data`
 cf0b90e49689d45be91aa539fdf54cf2ea8a9a3
 # 2024-07-03 Enable UFMT on test/test_public_bindings.py (#128389)
 fe5424d0f8604f6e66d827ae9f94b05cb7119d55
 # 2024-07-03 Enable UFMT on test/test_public_bindings.py (#128389)
 c686304277f7cd72331f685605325498cff94a0b
 # 2024-07-15 Enable UFMT on all of torch/sparse (#130545)
 535016967ae65a6027f83d6b935a985996223d49
 # 2024-07-15 [BE][Easy][1/19] enforce style for empty lines in import segments (#129752)
 a3abfa5cb57203b6a8ba7dff763f4057db8282a8
 # 2024-07-15 [BE][Easy][2/19] enforce style for empty lines in import segments in `.ci/` and `.github/` (#129753)
 ba48cf653541e9160dfdefa7bfea885c22e48608
 # 2024-07-16 [BE][Easy][5/19] enforce style for empty lines in import segments in `tools/` and `torchgen/` (#129756)
 f6838d521a243dbedc50ae96575720bf2cc8a8ad
 # 2024-07-17 [BE][Easy][9/19] enforce style for empty lines in import segments in `test/[e-h]*/` (#129760)
 cf69184bd462b9add40f893f57675f8a057
 # 2024-07-16 [BE][Easy][3/19] enforce style for empty lines in import segments in `benchmarks/` (#129754)
 c0ed38e644aed812d76b0ec85fae2f6019bf462b
 # 2024-07-16 [BE][Easy][4/19] enforce style for empty lines in import segments in `functorch/` (#129755)
 fb229660f388feddc288c127ab12c82e67d36
 # 2024-07-17 [BE][Easy][12/19] enforce style for empty lines in import segments in `test/i*/` (#129763)
 aecc746fccc4495313167e3a7f94210daf457e1d
 # 2024-07-18 Revert "[BE][Easy][12/19] enforce style for empty lines in import segments in `test/i*/` (#129763)"
 b732b52f1e4378f8486ceb5e7026be3321c2651c
 # 2024-07-18 [BE][Easy][12/19] enforce style for empty lines in import segments in `test/i*/` (#129763)
 bc4fc34bb02795aa694e66b132dcea5dde1e1
 # 2024-07-26 [BE][Easy][8/19] enforce style for empty lines in import segments in `test/[k-p]*/` (#129759)
 fbe6f42dcf1834213e0baa87b87529161df3c4d7
 # 2024-07-31 [BE][Easy][14/19] enforce style for empty lines in import segments in `torch/_[a-c]*/` and `torch/_[e-h]*/` and `torch/_[j-z]*/` (#129765)
 e7eeee473c6cb45942e87de5a616b0eb635513d6
 # 2024-07-31 Fix lint after PR #130572 (#132316)
 d72e863b3ecd3de4c8ea00518e110da964583f4f
 # 2024-07-31 [BE][Easy][15/19] enforce style for empty lines in import segments in `torch/_d*/` (#129767)
 e74ba1b34a476b46e76b4e32afe2d481f97e9a47
 # 2024-07-31 [BE][Easy][18/19] enforce style for empty lines in import segments in `torch/d*/` (#129770)
 b25ef91bf158ce459d8654e33c50c8e6ed8db716
 # 2024-07-20 [BE][Easy][13/19] enforce style for empty lines in import segments in `test/j*/` (#129764)
 ff1e43a416c43cd82b210e22ac47384494c172e
 # 2024-11-01 [Lint] Clang-format all metal kernels (#139530)
 b3ad45733bd908b7358959ca1e1f8d026f4507eb
 # 2024-11-17 [BE][MPS] Apply clang-format to mps headers (#140906)
 a297c179862af38ee86bac2051434d3db41
 # 2024-11-27 Apply clang-format for ATen/core/boxing headers (#141105)
 d01a1ef0c0d65768eb0a5c97a25328eec57fbd
 # 2024-12-05 fix the lint from D66795414 (#142122)
 c2086d452ae6966ce9d7fb3cb2eef2fd0d2add
 # 2024-12-20 Apply clang-format for ATen/core/dispatch headers (#143620)
 cee06e74eeb54994b97000a02b715a4e63a97951
 # 2024-12-22 Better fix for f-strings in set_linter for py3.12 (#143725)
 eebc93d41eeffb936cbf20c9052e1e813d0cc052
 # 2025-01-04 [mps/BE] Fix linter warning/advice. (#144199)
 dc1e6be192b260f1c072d70e1b06a3ac8e109fa
 # 2025-01-07 Fix lint in `test_provenance_tracing.py` (#144296)
 c0a3d1cbaf6420e40ab0f9c9019daa21145e69
 # 2025-01-09 [BE] fix ruff rule E226: add missing whitespace around operator in f-strings (#144415)
 dcc3cf7066b4d8cab63ecb73daf1e36b01220a4e

									
										2

.github/ISSUE_TEMPLATE/bug-report.yml
									
										vendored
									
												View File
												
				@ -5,7 +5,7 @@ body:

				- type: markdown

				  attributes:

				    value: >

				      #### Before submitting a bug, please make sure the issue hasn't been already addressed by searching through [the existing and past issues](https://github.com/pytorch/pytorch/issues?q=is%3Aissue+sort%3Acreated-desc+).

				      #### Before submitting a bug, please make sure the issue hasn't been already addressed by searching through [the existing and past issues](https://github.com/pytorch/pytorch/issues?q=is%3Aissue+sort%3Acreated-desc+). Note: Please write your bug report in English to ensure it can be understood and addressed by the development team. If you are filing a bug for torch.compile, please use the [torch.compile issue template](https://github.com/pytorch/pytorch/issues/new?q=sort%3Aupdated-desc+is%3Aissue+is%3Aopen&template=pt2-bug-report.yml).

				- type: textarea

				  attributes:

				    label: 🐛 Describe the bug

									
										2

.github/ISSUE_TEMPLATE/disable-ci-jobs.md
									
										vendored
									
												View File
												
				@ -5,7 +5,7 @@ title: "DISABLED [WORKFLOW_NAME] / [PLATFORM_NAME] / [JOB_NAME]"

				labels: "module: ci"

				---

				> For example, DISABLED pull / win-vs2019-cpu-py3 / test (default). Once

				> For example, DISABLED pull / win-vs2022-cpu-py3 / test (default). Once

				> created, the job will be disabled within 15 minutes. You can check the

				> list of disabled jobs at https://ossci-metrics.s3.amazonaws.com/disabled-jobs.json

									
										4

.github/ISSUE_TEMPLATE/documentation.yml
									
										vendored
									
												View File
												
				@ -2,6 +2,10 @@ name: 📚 Documentation

				description: Report an issue related to https://pytorch.org/docs/stable/index.html

				body:

				- type: markdown

				  attributes:

				    value: >

				      #### Note: Please report your documentation issue in English to ensure it can be understood and addressed by the development team.

				- type: textarea

				  attributes:

				    label: 📚 The doc issue

									
										4

.github/ISSUE_TEMPLATE/feature-request.yml
									
										vendored
									
												View File
												
				@ -2,6 +2,10 @@ name: 🚀 Feature request

				description: Submit a proposal/request for a new PyTorch feature

				body:

				- type: markdown

				  attributes:

				    value: >

				      #### Note: Please write your feature request in English to ensure it can be understood and addressed by the development team.

				- type: textarea

				  attributes:

				    label: 🚀 The feature, motivation and pitch

									
										6

.github/ISSUE_TEMPLATE/pt2-bug-report.yml
									
										vendored
									
												View File
												
				@ -3,6 +3,10 @@ description: Create a report to help us reproduce and fix the bug

				labels: ["oncall: pt2"]

				body:

				  - type: markdown

				    attributes:

				      value: >

				        #### Note: Please write your bug report in English to ensure it can be understood and addressed by the development team.

				  - type: markdown

				    attributes:

				      value: >

				@ -18,6 +22,8 @@ body:

				        - If comparing eager and torch.compile at fp16/bf16, you should use fp32 as baseline

				        - Ensure rng state used to compare results is equivalent. Use `torch._inductor.config.fallback_random=True` and reset the torch rng seed between comparisons

				        If the above requirements are met, add the label "topic: fuzzer" to your issue.

				  - type: textarea

									
										4

.github/actionlint.yaml
									
										vendored
									
												View File
												
				@ -42,8 +42,10 @@ self-hosted-runner:

				    - windows.8xlarge.nvidia.gpu

				    - windows.8xlarge.nvidia.gpu.nonephemeral

				    - windows.g5.4xlarge.nvidia.gpu

				    # Organization-wide AMD hosted MI300 runners

				    # Organization-wide AMD hosted runners

				    - linux.rocm.gpu

				    - linux.rocm.gpu.2

				    - linux.rocm.gpu.4

				    # Repo-specific Apple hosted  runners

				    - macos-m1-ultra

				    - macos-m2-14

									
										4

.github/actions/checkout-pytorch/action.yml
									
										vendored
									
												View File
												
				@ -41,10 +41,10 @@ runs:

				        mkdir "${GITHUB_WORKSPACE}"

				    - name: Checkout PyTorch

				      uses: malfet/checkout@silent-checkout

				      uses: actions/checkout@v4

				      with:

				        ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}

				        # --depth=1 for speed, manually fetch history and other refs as necessary

				        fetch-depth: ${{ inputs.fetch-depth }}

				        submodules: ${{ inputs.submodules }}

				        quiet-checkout: true

				        show-progress: false

									
										4

.github/actions/diskspace-cleanup/action.yml
									
										vendored
									
												View File
												
				@ -17,6 +17,10 @@ runs:

				        set -ex

				        diskspace_cutoff=${{ inputs.diskspace-cutoff }}

				        docker_root_dir=$(docker info -f '{{.DockerRootDir}}')

				        if [ ! -d "$docker_root_dir" ]; then

				            echo "Docker root directory ($docker_root_dir) does not exist. Skipping disk space check."

				            exit 0

				        fi

				        diskspace=$(df -H --output=pcent ${docker_root_dir} | sed -n 2p | sed 's/%//' | sed 's/ //')

				        msg="Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified"

				        if [[ "$diskspace" -ge "$diskspace_cutoff" ]] ; then

									
										58

.github/actions/setup-rocm/action.yml
									
										vendored
									
												View File
												
				@ -5,20 +5,6 @@ description: Set up ROCm host for CI

				runs:

				  using: composite

				  steps:

				    - name: Set DOCKER_HOST

				      shell: bash

				      run: echo "DOCKER_HOST=unix:///run/user/$(id -u)/docker.sock" >> "${GITHUB_ENV}"

				    - name: Remove leftover Docker config file

				      shell: bash

				      continue-on-error: true

				      run: |

				        set -ex

				        cat ~/.docker/config.json || true

				        # https://stackoverflow.com/questions/64455468/error-when-logging-into-ecr-with-docker-login-error-saving-credentials-not

				        rm -f ~/.docker/config.json

				    - name: Stop all running docker containers

				      if: always()

				      shell: bash

				@ -38,6 +24,12 @@ runs:

				        cat /opt/rocm/.info/version || true

				        whoami

				    - name: Runner health check amdgpu info

				      if: always()

				      shell: bash

				      run: |

				        dpkg -l | grep -E "  amdgpu"

				    - name: Runner health check rocm-smi

				      if: always()

				      shell: bash

				@ -68,7 +60,7 @@ runs:

				        fi

				    - name: Runner diskspace health check

				      uses: ./.github/actions/diskspace-cleanup

				      uses: pytorch/pytorch/.github/actions/diskspace-cleanup@main

				      if: always()

				    - name: Runner health check disconnect on failure

				@ -77,14 +69,44 @@ runs:

				      run: |

				        killall runsvc.sh

				    - name: Setup useful environment variables

				      shell: bash

				      run: |

				        RUNNER_ARTIFACT_DIR="${RUNNER_TEMP}/artifacts"

				        rm -rf "${RUNNER_ARTIFACT_DIR}"

				        mkdir -p "${RUNNER_ARTIFACT_DIR}"

				        echo "RUNNER_ARTIFACT_DIR=${RUNNER_ARTIFACT_DIR}" >> "${GITHUB_ENV}"

				        RUNNER_TEST_RESULTS_DIR="${RUNNER_TEMP}/test-results"

				        rm -rf "${RUNNER_TEST_RESULTS_DIR}"

				        mkdir -p "${RUNNER_TEST_RESULTS_DIR}"

				        echo "RUNNER_TEST_RESULTS_DIR=${RUNNER_TEST_RESULTS_DIR}" >> "${GITHUB_ENV}"

				        RUNNER_DOCS_DIR="${RUNNER_TEMP}/docs"

				        rm -rf "${RUNNER_DOCS_DIR}"

				        mkdir -p "${RUNNER_DOCS_DIR}"

				        echo "RUNNER_DOCS_DIR=${RUNNER_DOCS_DIR}" >> "${GITHUB_ENV}"

				    - name: Preserve github env variables for use in docker

				      shell: bash

				      run: |

				        env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}"

				        env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"

				        env | grep '^GITHUB' >> "${RUNNER_TEMP}/github_env_${GITHUB_RUN_ID}"

				        env | grep '^CI' >> "${RUNNER_TEMP}/github_env_${GITHUB_RUN_ID}"

				    - name: ROCm set GPU_FLAG

				      shell: bash

				      run: |

				        # All GPUs are visible to the runner; visibility, if needed, will be set by run_test.py.

				        echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"

				        # Add render group for container creation.

				        render_gid=`cat /etc/group | grep render | cut -d: -f3`

				        # Ensure GPU isolation if pod is part of kubernetes setup with DEVICE_FLAG.

				        if [ -f "/etc/podinfo/gha-render-devices" ]; then

				          DEVICE_FLAG=$(cat /etc/podinfo/gha-render-devices)

				        else

				          DEVICE_FLAG="--device /dev/dri"

				        fi

				        # The --group-add daemon and --group-add bin are needed in the Ubuntu 24.04 and Almalinux OSs respectively.

				        # This is due to the device files (/dev/kfd & /dev/dri) being owned by video group on bare metal.

				        # This video group ID maps to subgid 1 inside the docker image due to the /etc/subgid entries.

				        # The group name corresponding to group ID 1 can change depending on the OS, so both are necessary.

				        echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd $DEVICE_FLAG --group-add video --group-add $render_gid --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host" >> "${GITHUB_ENV}"

									
										1

.github/actions/test-pytorch-binary/action.yml
									
										vendored
									
												View File
												
				@ -13,7 +13,6 @@ runs:

				        container_name=$(docker run \

				          ${GPU_FLAG:-} \

				          -e BINARY_ENV_FILE \

				          -e BUILDER_ROOT \

				          -e BUILD_ENVIRONMENT \

				          -e DESIRED_CUDA \

				          -e DESIRED_DEVTOOLSET \

									
										56

.github/actions/upload-utilization-stats/action.yml
									
										vendored
									
										Normal file
									
												View File
												
				@ -0,0 +1,56 @@

				name: upload-utilization-stats

				description: Upload utilization stats to artifacts

				inputs:

				    workflow_run_id:

				      type: string

				      description: 'workflow (run) id of the workflow the test is running'

				      required: True

				    workflow_attempt:

				      type: string

				      description: 'the workflow (run) attempt'

				      required: True

				    workflow_name:

				      description: 'name of the workflow'

				      type: string

				      required: True

				    job_id:

				      type: string

				      description: 'the job (run) id for the test'

				      required: True

				    job_name:

				      type: string

				      description: 'the job name of the test'

				      required: True

				runs:

				  using: composite

				  steps:

				    - name: Print Inputs

				      shell: bash

				      run: |

				        echo "workflow_id: ${{inputs.workflow_run_id}}"

				        echo "workflow_attempt: ${{inputs.workflow_attempt}}"

				        echo "workflow_Name: ${{inputs.workflow_name}}"

				        echo "job_id: ${{inputs.job_id}}"

				        echo "job_name:  ${{inputs.job_name}}"

				    - uses: nick-fields/retry@v3.0.0

				      name: Setup dependencies

				      with:

				        shell: bash

				        timeout_minutes: 5

				        max_attempts: 5

				        retry_wait_seconds: 30

				        command: |

				          set -eu

				          python3 -m pip install python-dateutil==2.8.2 boto3==1.35.42 pandas==2.1.3

				    - name: Upload utilizatoin stats to s3

				      shell: bash

				      run: |

				        python3 -m tools.stats.upload_utilization_stats.upload_utilization_stats \

				          --workflow-run-id "${{inputs.workflow_run_id}}" \

				          --workflow-name "${{inputs.workflow_name}}" \

				          --workflow-run-attempt "${{inputs.workflow_attempt}}" \

				          --job-id "${{inputs.job_id}}" \

				          --job-name "${{inputs.job_name}}"

2

.github/ci_commit_pins/audio.txt vendored

View File

 @ -1 +1 @@
 d4b300f00a0d862e3cfe1495db3b1a14f9
 f084f34bbb743fada85f66b0ed8041387565e69c

1

.github/ci_commit_pins/fbgemm_rocm.txt vendored Normal file

View File

				`@ -0,0 +1 @@`
				`5fb5024118e9bb9decf96c2b0b1a8f0010bf56be`

2

.github/ci_commit_pins/torchbench.txt vendored

View File

 @ -1 +1 @@
 a5e3a189384659fd35a68c3b17b88c761aaac
 ffb19dc470f4423a3176a4133f8f4b3cdb5bd

2

.github/ci_commit_pins/xla.txt vendored

View File

 @ -1 +1 @@
 f54ba5bd7fb83d7ba81fe6f5e05fb6ee815d6f
 b2b890e962f5fb6f481e5da2eb4a43bb990d0f1b

									
										9

.github/labeler.yml
									
										vendored
									
												View File
												
				@ -30,9 +30,9 @@

				- torch/fx/experimental/sym_node.py

				- torch/fx/experimental/validator.py

				- torch/fx/experimental/proxy_tensor.py

				- test/distributed/_tensor/test_dtensor_compile.py

				- test/distributed/tensor/test_dtensor_compile.py

				- test/distributed/tensor/parallel/test_fsdp_2d_parallel.py

				- torch/distributed/_tensor/**

				- torch/distributed/tensor/**

				- torch/distributed/fsdp/**

				- torch/csrc/inductor/**

				- torch/csrc/dynamo/**

				@ -107,3 +107,8 @@

				- torch/csrc/dynamo/compiled_autograd.h

				- torch/_dynamo/compiled_autograd.py

				- torch/inductor/test_compiled_autograd.py

				"ciflow/xpu":

				- torch/csrc/inductor/aoti_include/xpu.h

				- torch/csrc/inductor/cpp_wrapper/device_internal/xpu.h

				- torch/csrc/inductor/cpp_wrapper/xpu.h

Compare commits

2398 Commits v2.6.0-rc9 ... yguo/patch

19 .ci/aarch64_linux/aarch64_ci_build.sh Unescape Escape View File

6 .ci/aarch64_linux/aarch64_ci_setup.sh Unescape Escape View File

29 .ci/aarch64_linux/aarch64_wheel_ci_build.py Unescape Escape View File

48 .ci/aarch64_linux/build_aarch64_wheel.py Unescape Escape View File

5 .ci/docker/aotriton_version.txt Unescape Escape View File

124 .ci/docker/build.sh Unescape Escape View File

7 .ci/docker/centos-rocm/Dockerfile Unescape Escape View File

2 .ci/docker/ci_commit_pins/executorch.txt Unescape Escape View File

1 .ci/docker/ci_commit_pins/nccl-cu11.txt Normal file Unescape Escape View File

1 .ci/docker/ci_commit_pins/nccl-cu12.txt Normal file Unescape Escape View File

2 .ci/docker/ci_commit_pins/timm.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton.txt Unescape Escape View File

2 .ci/docker/common/install_acl.sh Unescape Escape View File

23 .ci/docker/common/install_aotriton.sh Unescape Escape View File

4 .ci/docker/common/install_base.sh Unescape Escape View File

8 .ci/docker/common/install_cache.sh Unescape Escape View File

2 .ci/docker/common/install_cpython.sh Unescape Escape View File

116 .ci/docker/common/install_cuda.sh Unescape Escape View File

38 .ci/docker/common/install_cuda_aarch64.sh Unescape Escape View File

4 .ci/docker/common/install_cudnn.sh Unescape Escape View File

20 .ci/docker/common/install_cusparselt.sh Unescape Escape View File

7 .ci/docker/common/install_executorch.sh Unescape Escape View File

6 .ci/docker/common/install_onnx.sh Unescape Escape View File

16 .ci/docker/common/install_rocm.sh Unescape Escape View File

26 .ci/docker/common/install_ucc.sh Unescape Escape View File

17 .ci/docker/libtorch/Dockerfile Unescape Escape View File

7 .ci/docker/manywheel/Dockerfile Unescape Escape View File

22 .ci/docker/requirements-ci.txt Unescape Escape View File

55 .ci/docker/ubuntu-rocm/Dockerfile Unescape Escape View File

13 .ci/magma/Makefile Unescape Escape View File

54 .ci/manywheel/build_cuda.sh Unescape Escape View File

27 .ci/manywheel/build_rocm.sh Unescape Escape View File

6 .ci/pytorch/build.sh Unescape Escape View File

2 .ci/pytorch/check_binary.sh Unescape Escape View File

2 .ci/pytorch/common.sh Unescape Escape View File

43 .ci/pytorch/common_utils.sh Unescape Escape View File

2 .ci/pytorch/cpp_doc_push_script.sh Unescape Escape View File

2 .ci/pytorch/functorch_doc_push_script.sh Unescape Escape View File

2 .ci/pytorch/install_cache_xla.sh Unescape Escape View File

3 .ci/pytorch/macos-test.sh Unescape Escape View File

93 .ci/pytorch/multigpu-test.sh Unescape Escape View File

4 .ci/pytorch/python_doc_push_script.sh Unescape Escape View File

4 .ci/pytorch/run_tests.sh Unescape Escape View File

10 .ci/pytorch/smoke_test/check_binary_symbols.py Unescape Escape View File

47 .ci/pytorch/smoke_test/smoke_test.py Unescape Escape View File

89 .ci/pytorch/test.sh Unescape Escape View File

41 .ci/pytorch/test_example_code/cnn_smoke_win_arm64.py Normal file Unescape Escape View File

13 .ci/pytorch/test_example_code/rnn_smoke_win_arm64.py Normal file Unescape Escape View File

2 .ci/pytorch/win-build.sh Unescape Escape View File

3 .ci/pytorch/win-test-helpers/build_pytorch.bat Unescape Escape View File

114 .ci/pytorch/win-test-helpers/installation-helpers/install_xpu.bat Unescape Escape View File

7 .ci/pytorch/win-test.sh Unescape Escape View File

31 .ci/pytorch/windows/arm64/bootstrap_apl.bat Normal file Unescape Escape View File

49 .ci/pytorch/windows/arm64/bootstrap_buildtools.bat Normal file Unescape Escape View File

37 .ci/pytorch/windows/arm64/bootstrap_git.bat Normal file Unescape Escape View File

33 .ci/pytorch/windows/arm64/bootstrap_libuv.bat Normal file Unescape Escape View File

46 .ci/pytorch/windows/arm64/bootstrap_openblas.bat Normal file Unescape Escape View File

41 .ci/pytorch/windows/arm64/bootstrap_python.bat Normal file Unescape Escape View File

33 .ci/pytorch/windows/arm64/bootstrap_rust.bat Normal file Unescape Escape View File

33 .ci/pytorch/windows/arm64/bootstrap_sccache.bat Normal file Unescape Escape View File

22 .ci/pytorch/windows/arm64/bootstrap_tests.bat Normal file Unescape Escape View File

101 .ci/pytorch/windows/arm64/build_libtorch.bat Normal file Unescape Escape View File

60 .ci/pytorch/windows/arm64/build_pytorch.bat Normal file Unescape Escape View File

65 .ci/pytorch/windows/arm64/smoke_test.bat Normal file Unescape Escape View File

13 .ci/pytorch/windows/condaenv.bat Unescape Escape View File

59 .ci/pytorch/windows/cuda128.bat Normal file Unescape Escape View File

32 .ci/pytorch/windows/internal/cuda_install.bat Unescape Escape View File

99 .ci/pytorch/windows/internal/smoke_test.bat Unescape Escape View File

5 .ci/pytorch/windows/internal/static_lib_test.bat Unescape Escape View File

6 .ci/pytorch/windows/internal/vc_install_helper.bat Unescape Escape View File

48 .ci/pytorch/windows/internal/vs2019_install.ps1 Unescape Escape View File

14 .ci/pytorch/windows/internal/xpu_install.bat Unescape Escape View File

5 .ci/pytorch/windows/xpu.bat Unescape Escape View File

48 .ci/wheel/build_wheel.sh Unescape Escape View File

2 .circleci/codegen_validation/normalize_yaml_fragment.py Unescape Escape View File

2 .circleci/scripts/binary_linux_test.sh Unescape Escape View File

11 .circleci/scripts/binary_macos_build.sh Unescape Escape View File

13 .circleci/scripts/binary_populate_env.sh Unescape Escape View File

2398 Commits

v2.6.0-rc9 ... yguo/patch

19

.ci/aarch64_linux/aarch64_ci_build.sh

View File

6

.ci/aarch64_linux/aarch64_ci_setup.sh

View File

29

.ci/aarch64_linux/aarch64_wheel_ci_build.py

View File

48

.ci/aarch64_linux/build_aarch64_wheel.py

View File

5

.ci/docker/aotriton_version.txt

View File

124

.ci/docker/build.sh

View File

7

.ci/docker/centos-rocm/Dockerfile

View File

2

.ci/docker/ci_commit_pins/executorch.txt

View File

1

.ci/docker/ci_commit_pins/nccl-cu11.txt Normal file

View File

1

.ci/docker/ci_commit_pins/nccl-cu12.txt Normal file

View File

2

.ci/docker/ci_commit_pins/timm.txt

View File

2

.ci/docker/ci_commit_pins/triton.txt

View File

2

.ci/docker/common/install_acl.sh

View File

23

.ci/docker/common/install_aotriton.sh

View File

4

.ci/docker/common/install_base.sh

View File

8

.ci/docker/common/install_cache.sh

View File

2

.ci/docker/common/install_cpython.sh

View File

116

.ci/docker/common/install_cuda.sh

View File

38

.ci/docker/common/install_cuda_aarch64.sh

View File

4

.ci/docker/common/install_cudnn.sh

View File

20

.ci/docker/common/install_cusparselt.sh

View File

7

.ci/docker/common/install_executorch.sh

View File

6

.ci/docker/common/install_onnx.sh

View File

16

.ci/docker/common/install_rocm.sh

View File

26

.ci/docker/common/install_ucc.sh

View File

17

.ci/docker/libtorch/Dockerfile

View File

7

.ci/docker/manywheel/Dockerfile

View File

22

.ci/docker/requirements-ci.txt

View File

55

.ci/docker/ubuntu-rocm/Dockerfile

View File

13

.ci/magma/Makefile

View File

54

.ci/manywheel/build_cuda.sh

View File

27

.ci/manywheel/build_rocm.sh

View File

6

.ci/pytorch/build.sh

View File

2

.ci/pytorch/check_binary.sh

View File

2

.ci/pytorch/common.sh

View File

43

.ci/pytorch/common_utils.sh

View File

2

.ci/pytorch/cpp_doc_push_script.sh

View File

2

.ci/pytorch/functorch_doc_push_script.sh

View File

2

.ci/pytorch/install_cache_xla.sh

View File

3

.ci/pytorch/macos-test.sh

View File

93

.ci/pytorch/multigpu-test.sh

View File

4

.ci/pytorch/python_doc_push_script.sh

View File

4

.ci/pytorch/run_tests.sh

View File

10

.ci/pytorch/smoke_test/check_binary_symbols.py

View File

47

.ci/pytorch/smoke_test/smoke_test.py

View File

89

.ci/pytorch/test.sh

View File

41

.ci/pytorch/test_example_code/cnn_smoke_win_arm64.py Normal file

View File

13

.ci/pytorch/test_example_code/rnn_smoke_win_arm64.py Normal file

View File

2

.ci/pytorch/win-build.sh

View File

3

.ci/pytorch/win-test-helpers/build_pytorch.bat

View File

114

.ci/pytorch/win-test-helpers/installation-helpers/install_xpu.bat

View File

7

.ci/pytorch/win-test.sh

View File

31

.ci/pytorch/windows/arm64/bootstrap_apl.bat Normal file

View File

49

.ci/pytorch/windows/arm64/bootstrap_buildtools.bat Normal file

View File

37

.ci/pytorch/windows/arm64/bootstrap_git.bat Normal file

View File

33

.ci/pytorch/windows/arm64/bootstrap_libuv.bat Normal file

View File

46

.ci/pytorch/windows/arm64/bootstrap_openblas.bat Normal file

View File

41

.ci/pytorch/windows/arm64/bootstrap_python.bat Normal file

View File

33

.ci/pytorch/windows/arm64/bootstrap_rust.bat Normal file

View File

33

.ci/pytorch/windows/arm64/bootstrap_sccache.bat Normal file

View File

22

.ci/pytorch/windows/arm64/bootstrap_tests.bat Normal file

View File

101

.ci/pytorch/windows/arm64/build_libtorch.bat Normal file

View File

60

.ci/pytorch/windows/arm64/build_pytorch.bat Normal file

View File

65

.ci/pytorch/windows/arm64/smoke_test.bat Normal file

View File

13

.ci/pytorch/windows/condaenv.bat

View File

59

.ci/pytorch/windows/cuda128.bat Normal file

View File

32

.ci/pytorch/windows/internal/cuda_install.bat

View File

99

.ci/pytorch/windows/internal/smoke_test.bat

View File

5

.ci/pytorch/windows/internal/static_lib_test.bat

View File

6

.ci/pytorch/windows/internal/vc_install_helper.bat

View File

48

.ci/pytorch/windows/internal/vs2019_install.ps1

View File

14

.ci/pytorch/windows/internal/xpu_install.bat

View File

5

.ci/pytorch/windows/xpu.bat

View File

48

.ci/wheel/build_wheel.sh

View File

2

.circleci/codegen_validation/normalize_yaml_fragment.py

View File

2

.circleci/scripts/binary_linux_test.sh

View File

11

.circleci/scripts/binary_macos_build.sh

View File

13

.circleci/scripts/binary_populate_env.sh

View File

43

.circleci/scripts/binary_upload.sh

View File