pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-21 05:34:18 +08:00

Author	SHA1	Message	Date
PyTorch MergeBot	240c13394e	Revert "[inductor] require shape in TritonCSEVariable (#162275 )" This reverts commit 3af2f0c12accc6bd10ef2b76fb5c51aa0f6b73a3. Reverted https://github.com/pytorch/pytorch/pull/162275 on behalf of https://github.com/clee2000 due to still failing due to the above D84932446 ([comment](https://github.com/pytorch/pytorch/pull/162275#issuecomment-3423153819))	2025-10-20 17:55:54 +00:00
PyTorch MergeBot	150682ba7f	Revert "Remove workaround to old CUDA bug (#164354 )" This reverts commit 26f38034332a99f2bdcc67ce1f4ba9403d420e52. Reverted https://github.com/pytorch/pytorch/pull/164354 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/164354#issuecomment-3423132083))	2025-10-20 17:48:08 +00:00
PyTorch MergeBot	ca7360e996	Revert "Move toString(ScalarType) and ScalarType ostream operator to headeronly (#164405 )" This reverts commit ca8bd5dbedb5b46f78026e0378b0f47500ddba38. Reverted https://github.com/pytorch/pytorch/pull/164405 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/164354#issuecomment-3423132083))	2025-10-20 17:48:08 +00:00
PyTorch MergeBot	0bf604320f	Revert "[dynamo][user_defined] Replace UserFunctionVariable with VariableTracker build (#165706 )" This reverts commit 1dc9a05d0323ee3c7a20945c62463959d40f1a51. Reverted https://github.com/pytorch/pytorch/pull/165706 on behalf of https://github.com/clee2000 due to breaking internal tests D84961097 ([comment](https://github.com/pytorch/pytorch/pull/165706#issuecomment-3423059867))	2025-10-20 17:28:58 +00:00
PyTorch MergeBot	9875e70da8	Revert "[dynamo][misc] Replace UserFunctionVariable with VariableTracker build (#165707 )" This reverts commit 630520b346b8883db7821562e589ccde7d12687a. Reverted https://github.com/pytorch/pytorch/pull/165707 on behalf of https://github.com/clee2000 due to breaking internal tests D84961097 ([comment](https://github.com/pytorch/pytorch/pull/165706#issuecomment-3423059867))	2025-10-20 17:28:58 +00:00
PyTorch MergeBot	69a4bfe8bb	Revert "Refactor out headeronly ArrayRef (#164991 )" This reverts commit 3806e9767b03d06edc317cb90a3a996abdf192a0. Reverted https://github.com/pytorch/pytorch/pull/164991 on behalf of https://github.com/clee2000 due to breaking internal tests D84961075 ([comment](https://github.com/pytorch/pytorch/pull/164991#issuecomment-3423058017))	2025-10-20 17:26:42 +00:00
PyTorch MergeBot	62a263b8d4	Revert "Widen ops support to take in IntHOArrayRef vs only std::vec (#165152 )" This reverts commit e4454947e2c692db1a249591121f8583fefe7df1. Reverted https://github.com/pytorch/pytorch/pull/165152 on behalf of https://github.com/clee2000 due to breaking internal tests D84961075 ([comment](https://github.com/pytorch/pytorch/pull/164991#issuecomment-3423058017))	2025-10-20 17:26:42 +00:00
PyTorch MergeBot	0da1f911dc	Revert "[Submodule] Bump FBGEMM to latest (#165544 )" This reverts commit 23417ae50f5d9bc02e988d916c103ff3a03c5903. Reverted https://github.com/pytorch/pytorch/pull/165544 on behalf of https://github.com/clee2000 due to failing in internal D84996252, probably needs some sort of update to fbgemm internally? ([comment](https://github.com/pytorch/pytorch/pull/165544#issuecomment-3422993703))	2025-10-20 17:06:07 +00:00
PyTorch MergeBot	ab82456c16	Revert "[1/N] Change C-style casts to static_cast or reinterpret_cast (#165750 )" This reverts commit e1e8491b316df810388d9fa24f135cdba27ab40e. Reverted https://github.com/pytorch/pytorch/pull/165750 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/165750#issuecomment-3422413890))	2025-10-20 14:51:58 +00:00
Nikhil Patel	b23f4687fd	[Inductor][CuTeDSL] Move load_template up two directories (#165868 ) Summary: This is a reland of https://github.com/pytorch/pytorch/pull/165347 Moves the function used to load CuTeDSL Jinja templates up one level out of the flex attention folder. This way it can be used for more generate Inductor templates in the future. Test Plan: test/inductor/test_flex_flash Differential Revision: D85013024 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165868 Approved by: https://github.com/jananisriram	2025-10-20 12:14:38 +00:00
Animesh Jain	722b2b86c9	[dynamo] Remove duplicated guards (#165806 ) This is by looking at a tlparse of an internal job. We will need deeper audit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165806 Approved by: https://github.com/jansel	2025-10-20 05:50:33 +00:00
Yuanyuan Chen	e1e8491b31	[1/N] Change C-style casts to static_cast or reinterpret_cast (#165750 ) This series of changes try to cover C style casts into C++ alternatives. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165750 Approved by: https://github.com/Skylion007	2025-10-20 04:36:19 +00:00
Amin Sedaghat	767199fd9b	[flex_attention] replace sliced BlockMask noop with helpful error (#164702 ) Fixes part of #163314 After slicing BlockMask with `[]`, mask_mod was silently replaced with noop_mask. This caused silent incorrect results when users applied transformations to `sliced_mask.mask_mod`. Replace noop with `_sliced_mask_mod_error` that raises RuntimeError with guidance to use `base_mask.mask_mod` instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164702 Approved by: https://github.com/drisspg, https://github.com/BoyuanFeng	2025-10-20 03:46:16 +00:00
Sun, Jiayi	e8cb34dd52	[Inductor] support masked vectorization for the tail_loop for fp8 datatype (#163324 ) Summary: Support masked vectorization for the tail_loop for fp8 datatype. Example: ``` import torch def fn( x, scale, zero_point, quant_min, quant_max, dtype, ): x = torch.ops.quantized_decomposed.dequantize_per_tensor( x, scale, zero_point, quant_min, quant_max, dtype, ) x = torch.relu(x) x = torch.ops.quantized_decomposed.quantize_per_tensor( x, scale, zero_point, quant_min, quant_max, dtype ) return x quant_min = -128 quant_max = 127 dtype = torch.float8_e4m3fn x = torch.clamp(torch.randn((1, 7, 7, 9), dtype=torch.float32) * 100, quant_min, quant_max).to(dtype) zero_point = 100 scale = 0.01 with torch.no_grad(): compiled_fn = torch.compile(fn) compiled_fn(x, scale, zero_point, quant_min, quant_max, dtype) ``` Generated code: - Before ``` cpp_fused_dequantize_per_tensor_quantize_per_tensor_relu_0 = async_compile.cpp_pybinding(['const at::Float8_e4m3fn', 'at::Float8_e4m3fn'], r''' #include <torch/csrc/inductor/cpp_prefix.h> extern "C" void kernel(const at::Float8_e4m3fn* in_ptr0, at::Float8_e4m3fn* out_ptr0) { { for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(441L); x0+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(432L))) { auto tmp0 = at::vec::Vectorized<at::Float8_e4m3fn>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16)); auto tmp1 = at::vec::convert<float>(tmp0); auto tmp2 = static_cast<float>(100.0); auto tmp3 = at::vec::Vectorized<float>(tmp2); auto tmp4 = tmp1 - tmp3; auto tmp5 = static_cast<float>(0.01); auto tmp6 = at::vec::Vectorized<float>(tmp5); auto tmp7 = tmp4 * tmp6; auto tmp8 = (tmp7); auto tmp9 = at::vec::clamp_min(tmp8, decltype(tmp8)(0)); auto tmp10 = tmp9 * tmp3; auto tmp11 = tmp10.round(); auto tmp12 = tmp11 + tmp3; auto tmp13 = static_cast<float>(-128.0); auto tmp14 = at::vec::Vectorized<float>(tmp13); auto tmp15 = at::vec::maximum(tmp12, tmp14); auto tmp16 = static_cast<float>(127.0); auto tmp17 = at::vec::Vectorized<float>(tmp16); auto tmp18 = at::vec::minimum(tmp15, tmp17); auto tmp19 = at::vec::convert<at::Float8_e4m3fn>(tmp18); tmp19.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16)); } if(C10_UNLIKELY(x0 >= static_cast<int64_t>(432L) && x0 < static_cast<int64_t>(441L))) { for (int64_t x0_tail = static_cast<int64_t>(432L);x0_tail < static_cast<int64_t>(441L); x0_tail++) { auto tmp0 = in_ptr0[static_cast<int64_t>(x0_tail)]; auto tmp1 = c10::convert<float>(tmp0); auto tmp2 = static_cast<float>(100.0); auto tmp3 = float(tmp1 - tmp2); auto tmp4 = static_cast<float>(0.01); auto tmp5 = float(tmp3 * tmp4); auto tmp6 = c10::convert<float>(tmp5); auto tmp7 = std::max(tmp6, decltype(tmp6)(0)); auto tmp8 = float(tmp7 * tmp2); auto tmp9 = std::nearbyint(tmp8); auto tmp10 = float(tmp9 + tmp2); auto tmp11 = static_cast<float>(-128.0); auto tmp12 = max_propagate_nan(tmp10, tmp11); auto tmp13 = static_cast<float>(127.0); auto tmp14 = min_propagate_nan(tmp12, tmp13); auto tmp15 = c10::convert<at::Float8_e4m3fn>(tmp14); out_ptr0[static_cast<int64_t>(x0_tail)] = tmp15; } } } } } } ''') async_compile.wait(globals()) del async_compile class Runner: def __init__(self, partitions): self.partitions = partitions def recursively_apply_fns(self, fns): new_callables = [] for fn, c in zip(fns, self.partitions): new_callables.append(fn(c)) self.partitions = new_callables def call(self, args): arg0_1, = args args.clear() assert_size_stride(arg0_1, (1, 7, 7, 9), (441, 63, 9, 1)) buf0 = empty_strided_cpu((1, 7, 7, 9), (441, 63, 9, 1), torch.float8_e4m3fn) # [Provenance debug handles] cpp_fused_dequantize_per_tensor_quantize_per_tensor_relu_0:1 cpp_fused_dequantize_per_tensor_quantize_per_tensor_relu_0(arg0_1, buf0) del arg0_1 return (buf0, ) ``` - After ``` cpp_fused_dequantize_per_tensor_quantize_per_tensor_relu_0 = async_compile.cpp_pybinding(['const at::Float8_e4m3fn', 'at::Float8_e4m3fn'], r''' #include <torch/csrc/inductor/cpp_prefix.h> extern "C" void kernel(const at::Float8_e4m3fn* in_ptr0, at::Float8_e4m3fn* out_ptr0) { { for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(441L); x0+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(432L))) { auto tmp0 = at::vec::Vectorized<at::Float8_e4m3fn>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16)); auto tmp1 = at::vec::convert<float>(tmp0); auto tmp2 = static_cast<float>(100.0); auto tmp3 = at::vec::Vectorized<float>(tmp2); auto tmp4 = tmp1 - tmp3; auto tmp5 = static_cast<float>(0.01); auto tmp6 = at::vec::Vectorized<float>(tmp5); auto tmp7 = tmp4 * tmp6; auto tmp8 = (tmp7); auto tmp9 = at::vec::clamp_min(tmp8, decltype(tmp8)(0)); auto tmp10 = tmp9 * tmp3; auto tmp11 = tmp10.round(); auto tmp12 = tmp11 + tmp3; auto tmp13 = static_cast<float>(-128.0); auto tmp14 = at::vec::Vectorized<float>(tmp13); auto tmp15 = at::vec::maximum(tmp12, tmp14); auto tmp16 = static_cast<float>(127.0); auto tmp17 = at::vec::Vectorized<float>(tmp16); auto tmp18 = at::vec::minimum(tmp15, tmp17); auto tmp19 = at::vec::convert<at::Float8_e4m3fn>(tmp18); tmp19.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16)); } if(C10_UNLIKELY(x0 >= static_cast<int64_t>(432L) && x0 < static_cast<int64_t>(441L))) { auto tmp0 = at::vec::Vectorized<at::Float8_e4m3fn>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(9L)); auto tmp1 = at::vec::convert<float>(tmp0); auto tmp2 = static_cast<float>(100.0); auto tmp3 = at::vec::Vectorized<float>(tmp2); auto tmp4 = tmp1 - tmp3; auto tmp5 = static_cast<float>(0.01); auto tmp6 = at::vec::Vectorized<float>(tmp5); auto tmp7 = tmp4 * tmp6; auto tmp8 = (tmp7); auto tmp9 = at::vec::clamp_min(tmp8, decltype(tmp8)(0)); auto tmp10 = tmp9 * tmp3; auto tmp11 = tmp10.round(); auto tmp12 = tmp11 + tmp3; auto tmp13 = static_cast<float>(-128.0); auto tmp14 = at::vec::Vectorized<float>(tmp13); auto tmp15 = at::vec::maximum(tmp12, tmp14); auto tmp16 = static_cast<float>(127.0); auto tmp17 = at::vec::Vectorized<float>(tmp16); auto tmp18 = at::vec::minimum(tmp15, tmp17); auto tmp19 = at::vec::convert<at::Float8_e4m3fn>(tmp18); tmp19.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(9L)); } } } } } ''') async_compile.wait(globals()) del async_compile class Runner: def __init__(self, partitions): self.partitions = partitions def recursively_apply_fns(self, fns): new_callables = [] for fn, c in zip(fns, self.partitions): new_callables.append(fn(c)) self.partitions = new_callables def call(self, args): arg0_1, = args args.clear() assert_size_stride(arg0_1, (1, 7, 7, 9), (441, 63, 9, 1)) buf0 = empty_strided_cpu((1, 7, 7, 9), (441, 63, 9, 1), torch.float8_e4m3fn) # [Provenance debug handles] cpp_fused_dequantize_per_tensor_quantize_per_tensor_relu_0:1 cpp_fused_dequantize_per_tensor_quantize_per_tensor_relu_0(arg0_1, buf0) del arg0_1 return (buf0, ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163324 Approved by: https://github.com/Xia-Weiwen, https://github.com/mingfeima, https://github.com/jansel ghstack dependencies: #163316	2025-10-20 01:56:00 +00:00
Sun, Jiayi	e9d8973427	[Inductor] support masked vectorization for the tail_loop for float64 datatype (#163316 ) Summary: Support masked vectorization for the tail_loop for float64 datatype. Example: ``` import torch def fn(x): return x * x x = torch.randn((22, 22), dtype=torch.double) with torch.no_grad(): compiled_fn = torch.compile(fn) compiled_fn(x) ``` Generated code: - Before ``` cpp_fused_mul_0 = async_compile.cpp_pybinding(['const double', 'double'], r''' #include <torch/csrc/inductor/cpp_prefix.h> extern "C" void kernel(const double* in_ptr0, double* out_ptr0) { { for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(484L); x0+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(480L))) { auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16)); auto tmp1 = tmp0 * tmp0; tmp1.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16)); } if(C10_UNLIKELY(x0 >= static_cast<int64_t>(480L) && x0 < static_cast<int64_t>(484L))) { for (int64_t x0_tail = static_cast<int64_t>(480L);x0_tail < static_cast<int64_t>(484L); x0_tail++) { auto tmp0 = in_ptr0[static_cast<int64_t>(x0_tail)]; auto tmp1 = double(tmp0 * tmp0); out_ptr0[static_cast<int64_t>(x0_tail)] = tmp1; } } } } } } ''') async_compile.wait(globals()) del async_compile class Runner: def __init__(self, partitions): self.partitions = partitions def recursively_apply_fns(self, fns): new_callables = [] for fn, c in zip(fns, self.partitions): new_callables.append(fn(c)) self.partitions = new_callables def call(self, args): arg0_1, = args args.clear() assert_size_stride(arg0_1, (22, 22), (22, 1)) buf0 = empty_strided_cpu((22, 22), (22, 1), torch.float64) # [Provenance debug handles] cpp_fused_mul_0:1 cpp_fused_mul_0(arg0_1, buf0) del arg0_1 return (buf0, ) ``` - After ``` cpp_fused_mul_0 = async_compile.cpp_pybinding(['const double', 'double'], r''' #include <torch/csrc/inductor/cpp_prefix.h> extern "C" void kernel(const double* in_ptr0, double* out_ptr0) { { for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(484L); x0+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(480L))) { auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16)); auto tmp1 = tmp0 * tmp0; tmp1.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16)); } if(C10_UNLIKELY(x0 >= static_cast<int64_t>(480L) && x0 < static_cast<int64_t>(484L))) { auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(4L)); auto tmp1 = tmp0 * tmp0; tmp1.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(4L)); } } } } } ''') async_compile.wait(globals()) del async_compile class Runner: def __init__(self, partitions): self.partitions = partitions def recursively_apply_fns(self, fns): new_callables = [] for fn, c in zip(fns, self.partitions): new_callables.append(fn(c)) self.partitions = new_callables def call(self, args): arg0_1, = args args.clear() assert_size_stride(arg0_1, (22, 22), (22, 1)) buf0 = empty_strided_cpu((22, 22), (22, 1), torch.float64) # [Provenance debug handles] cpp_fused_mul_0:1 cpp_fused_mul_0(arg0_1, buf0) del arg0_1 return (buf0, ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163316 Approved by: https://github.com/mingfeima, https://github.com/jansel	2025-10-20 01:41:38 +00:00
drisspg	6b80c94901	[FlexAttention] Fix dynamic shaped heads flex_flash check (#165866 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165866 Approved by: https://github.com/BoyuanFeng ghstack dependencies: #165729	2025-10-19 23:10:16 +00:00
Parshant Sharma	8139f33fa5	[dynamo] Add recompile reason for set_stance fail_on_recompile (#165445 ) Fixes #163500 ### Summary: For `set_stance("fail_on_recompile")` failures will provide the reason why the recompilation occurred ### Impacts: module: dynamo Pull Request resolved: https://github.com/pytorch/pytorch/pull/165445 Approved by: https://github.com/williamwen42	2025-10-19 21:12:19 +00:00
can-gaa-hou	a88587348b	[dynamo] Clean up assert in dynamo [1/N] (#165430 ) Fixes some part of #162852 and #164878. These two issues have some relationship though. * __->__ #165430 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165430 Approved by: https://github.com/Lucaskabela, https://github.com/williamwen42 Co-authored-by: Lucas Kabela <lucasakabela@gmail.com>	2025-10-19 21:00:05 +00:00
PyTorch MergeBot	633a3b7f67	Revert "shrink_group implementation to expose ncclCommShrink API (#164518 )" This reverts commit fa0db212e717b6cb225159cb32ea3d83baa52381. Reverted https://github.com/pytorch/pytorch/pull/164518 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/164518#issuecomment-3419893217))	2025-10-19 19:20:45 +00:00
Bruce Chang	fa0db212e7	shrink_group implementation to expose ncclCommShrink API (#164518 ) Closes #164529 To expose the new [ncclCommShrink](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclcommshrink) API to PyTorch. This is useful when you need to exclude certain GPUs or nodes from a collective operation, for example in fault tolerance scenarios or when dynamically adjusting resource utilization. For more info: [Shrinking a communicator](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#shrinking-a-communicator) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164518 Approved by: https://github.com/kwen2501	2025-10-19 18:00:08 +00:00
Tugsbayasgalan Manlaibaatar	22ae059d32	AOTI util deprecated flow using the new tracer (#165582 ) Reapply of https://github.com/pytorch/pytorch/pull/163260 AOTI utils expect free function sometimes so adjust export API to handle that, haven't seen any methods getting exported. Some AOTI flows also require we populate dynamo_flat_name_to_original_fqn so i just copy how it is done in eval_frame.py. I also cleaned up how we get rid of export_root and fixed some overcomplicated nn_module_stack handling in export code. The logic is simpler now thanks to @anijain2305 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/165582 Approved by: https://github.com/anijain2305	2025-10-19 15:52:16 +00:00
Yu, Guangye	b2f5c25b27	Introduce a generic API torch._C._accelerator_setAllocatorSettings (#165291 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165291 Approved by: https://github.com/albanD ghstack dependencies: #165288, #165289	2025-10-19 15:34:36 +00:00
Aaron Gokaslan	57ba575242	[BE][Ez]: Update torch.is_tensor documentation (#165841 ) TypeIs propogates the isinstance check with the typing system. They are now equivalent. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165841 Approved by: https://github.com/albanD	2025-10-19 09:24:11 +00:00
Yuanyuan Chen	3255e7872b	Enable all flake8-logging-format rules (#164655 ) These rules are enabled by removing existing suppressions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164655 Approved by: https://github.com/janeyx99, https://github.com/mlazos	2025-10-19 00:59:28 +00:00
Dzmitry Huba	c4f6619330	Enable more DTensor tests in local tensor mode and fix more integration issues (#165716 ) - During op dispatch local tensor is supposed to collect rng state from CPU and CUDA devices so that it can be reset before execution of the op for each such that ops with randomness produces the same result for all ranks (note that we are planning a separate change to add support of per rank rng state). Previously we relied on op input arguments to deduce which devices to get rng state from. Which doesn't work for factory functions such torch.randn. Hence this changes switches to uncondionally collecting rng state from all devices. - Fixing per rank specific computations in _MaskedPartial and Shard placements discovered during test enablement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165716 Approved by: https://github.com/ezyang	2025-10-18 23:33:24 +00:00
andreh7	f18041cca8	Fix missing closing quote in __init__.py documentation (#165827 ) Title says it all. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165827 Approved by: https://github.com/Skylion007	2025-10-18 22:09:18 +00:00
Yuanyuan Chen	1f43d17ce6	Fix self assignment (#165816 ) This PR removes assignments of the form `var=var`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165816 Approved by: https://github.com/jansel	2025-10-18 18:51:52 +00:00
Yuanyuan Chen	032bed95cd	Various C++ code fixes in LSAN integration (#165818 ) This PR extracts the C++ code fixes from #154584, which are fixes in enabling LSAN. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165818 Approved by: https://github.com/ezyang	2025-10-18 17:59:23 +00:00
arkadip-maitra	f510d0dbc0	Clarrifying input output angle unit in the docs for trigonometric fun… (#161248 ) …ctions Fixes #[160995](https://github.com/pytorch/pytorch/issues/160995) Modified the docs to clarify that input tensor values for torch.sin, torch.cos and torch.tan should be in radians and the output tensor values for torch.acos, torch.asin and torch.atan is in radians. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161248 Approved by: https://github.com/isuruf Co-authored-by: Isuru Fernando <isuruf@gmail.com>	2025-10-18 11:53:48 +00:00
PyTorch MergeBot	beb6b62e8c	Revert "Enable more DTensor tests in local tensor mode and fix more integration issues (#165716 )" This reverts commit 1b397420f22b22f90a1093233ecd9167656e50cb. Reverted https://github.com/pytorch/pytorch/pull/165716 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/165716#issuecomment-3418083391))	2025-10-18 09:15:49 +00:00
Chien-Chin Huang	4740ce7787	[CP] Fix load balancer incorrectly assuming batch dimension exists (#165792 ) https://github.com/pytorch/pytorch/pull/163617 removes the if/else statement to check if the input buffers have the batch dimension. This PR fixes the issue and also adds a test. In the future, we should explicitly ask users to unsqueeze the batch dimension. This is a BC of the existing contract but implicitly infers the batch dimension existence is not safe. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165792 Approved by: https://github.com/XilunWu	2025-10-18 09:11:16 +00:00
Yuanyuan Chen	fdab48a7c1	Enable all PIE rules on ruff (#165814 ) This PR enables all PIE rules on ruff, there are already some enabled rules from this family, the new added rules are ``` PIE796 Enum contains duplicate value: {value} PIE808 Unnecessary start argument in range ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165814 Approved by: https://github.com/ezyang	2025-10-18 07:36:18 +00:00
Nichols A. Romero	a0948d4d23	[ROCm][inductor] autotune support for persistent reduction kernels (#163908 ) After the removal of want_no_x_dim for persistent reduction kernels, we can improve the autotuning setup for persistent reduction kernels. Currently even with tuning enable, filtering will only try a single config in many cases. Avoid filtering with autotune mode, and override MAX_BLOCK limit. Also we always include tiny_config when autotuning is enabled. Contributions from several members of the AMD Inductor and Triton teams: @jataylo @iupaikov-amd @AmdSampsa @xiaohuguo2023 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163908 Approved by: https://github.com/jansel, https://github.com/PaulZhang12	2025-10-18 07:33:24 +00:00
Nichols A. Romero	0bbdd6b8db	[ROCm][inductor] heuristic improvements for pointwise kernels (#163197 ) Heuristic improvements for pointwise kernels for MI350. Contributions from several members of the AMD Inductor and Triton teams: @jataylo @AmdSampsa @iupaikov-amd @@xiaohuguo2023 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163197 Approved by: https://github.com/PaulZhang12, https://github.com/eellison, https://github.com/jansel Co-authored-by: AmdSampsa <sampsa.riikonen@amd.com> Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>	2025-10-18 07:23:41 +00:00
PyTorch MergeBot	24520b8386	Revert "Enable all PIE rules on ruff (#165814 )" This reverts commit c79dfdc6550e872783aa5cb5fc9e86589bf18872. Reverted https://github.com/pytorch/pytorch/pull/165814 on behalf of https://github.com/cyyever due to Need to cover more files ([comment](https://github.com/pytorch/pytorch/pull/165814#issuecomment-3417931863))	2025-10-18 07:21:08 +00:00
Yuanyuan Chen	c79dfdc655	Enable all PIE rules on ruff (#165814 ) This PR enables all PIE rules on ruff, there are already some enabled rules from this family, the new added rules are ``` PIE796 Enum contains duplicate value: {value} PIE808 Unnecessary start argument in range ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165814 Approved by: https://github.com/ezyang	2025-10-18 06:40:12 +00:00
Yuanyuan Chen	e595136187	Enable PLC1802 on ruff (#165813 ) This PR enables ruff check `PLC1802`, which detects len calls on sequences in a boolean test context. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165813 Approved by: https://github.com/ezyang	2025-10-18 05:44:14 +00:00
Yuanyuan Chen	aaac8cb0f5	[1/N] Add strict parameter to Python zip calls (#165531 ) Add `strict=True/False` to zip calls in test utils. `strict=True` is passed when possible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165531 Approved by: https://github.com/Skylion007	2025-10-18 05:26:33 +00:00
Yuanyuan Chen	0f0b4bf029	[1/N] Remove unused header inclusion (#165763 ) This PR removes unused header inclusion in C++ files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165763 Approved by: https://github.com/Skylion007	2025-10-18 05:23:11 +00:00
Yuanyuan Chen	b8194268a6	Remove unnecessary noqa suppressions (#164106 ) This PR removes unused `noqa` suppressions in Python code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164106 Approved by: https://github.com/albanD	2025-10-18 04:52:41 +00:00
Maggie Moss	f02e3947f6	Expand type checking to mypy strict files (#165697 ) Expands Pyrefly type checking to check the files outlined in the mypy-strict.ini configuration file: Pull Request resolved: https://github.com/pytorch/pytorch/pull/165697 Approved by: https://github.com/ezyang	2025-10-18 04:34:45 +00:00
Animesh Jain	d9f94e0d7d	[dynamo] Support fx.traceback.annotate as decorator (#165805 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165805 Approved by: https://github.com/Lucaskabela, https://github.com/SherlockNoMad, https://github.com/yushangdi	2025-10-18 03:58:11 +00:00
Simon Layton	23417ae50f	[Submodule] Bump FBGEMM to latest (#165544 ) Summary: * FBGEMM submodule updated to main * CMake updated to reflect necessary changes * Notably pulls in NVFP4 grouped gemm kernels Test Plan: Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Simon Layton <simonlayton@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/165544 Approved by: https://github.com/cyyever, https://github.com/jeffdaily	2025-10-18 03:58:08 +00:00
Yiming Zhou	e4d6c56ffb	Improve dynamo graph capture stack trace for custom ops (#165693 ) For a custom op ``` @torch.library.custom_op("my_lib::foo", mutates_args={}) def foo(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor: return x + y ``` ppl could call `torch.ops.my_lib.foo()` or directly call `foo()` in the `forward` of an `nn.Module` These two calling conventions will lead to the same node in the output graph, but different stack traces. When directly calling `foo()`, the displayed stack_trace in the graph will be ``` # File: .../pytorch/torch/_library/custom_ops.py:687 in __call__, code: return self._opoverload(args, *kwargs) ``` This is not useful so we filter it out. ``` python test/functorch/test_aot_joint_with_descriptors.py -k test_custom_op_stack_trace ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165693 Approved by: https://github.com/SherlockNoMad, https://github.com/williamwen42	2025-10-18 03:48:18 +00:00
Laith Sakka	017d2985f3	set unbacked bindings in reinplace pass for newly created nodes during generalize_scatter decomp (#164948 ) Two fixes: 1. in rein_place pass, set unbacked bindings for newly created nodes. 2. In inductor, ComputeBuffer used to miss detecting some used symbols, fixed that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164948 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #164341	2025-10-18 03:20:30 +00:00
Laith Sakka	c6a8db0b9a	Fix issues with generalized_scatter and setitem allocated unbacked symbols. (#164341 ) Three fixes: 1. When doing t[u0] +=1 if u0 is unbacked we could allocate a new unbacked symbol during the the indexing of t[u0] (when we fake trace setitem), namely because meta_select does allocate a new unbacked symbol for the storage offset when we do not know if u0>=0 or u0<0. but the output size/stride of setitem(), does not depend on that new symbol. it's self consumed in setitem so we shall ignore it. 2. Also when we trace through generalized_scatter the applications of the views could allocate unbacked symints but those do not effect final output, we also shall ignore them. 3.Before accessing strides in lowering we shall materialize. Address https://github.com/pytorch/pytorch/issues/114293 and https://github.com/pytorch/pytorch/issues/131911 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164341 Approved by: https://github.com/bobrenjc93	2025-10-18 03:20:30 +00:00
Shangdi Yu	cf3a787bbc	[annotate] Annotate bw nodes before eliminate dead code (#165782 ) Fixes https://github.com/pytorch/torchtitan/pull/1907 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165782 Approved by: https://github.com/SherlockNoMad	2025-10-18 01:54:31 +00:00
drisspg	de3da77cf7	Thread deterministic config vars to subproc compilation (#165729 ) # Summary TIL (AFTER WAYYYY TOO MUCH INSANITY), that we do not serialize the full set of configs for the subproc compilation. I found this while working on Flex-attention determinism: https://github.com/meta-pytorch/attention-gym/pull/168 might be good to audit if we need to thread through any more Pull Request resolved: https://github.com/pytorch/pytorch/pull/165729 Approved by: https://github.com/shunting314, https://github.com/eellison	2025-10-18 01:25:50 +00:00
Ti-Tai Wang	543ddbf44c	[ONNX] Support renaming in dynamic axes to shapes conversion (#165769 ) Discovered in ##165748 This PR also deprecates the conversion. ONNX exporter team does not intend to maintain the conversion in long term. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165769 Approved by: https://github.com/justinchuby	2025-10-18 01:11:20 +00:00
orangeH25	e9f4999985	[Code Clean] Replace std::runtime_error with TORCH_CHECK (#165305 ) Fixes part of #148114 Including: - torch/csrc/distributed Pull Request resolved: https://github.com/pytorch/pytorch/pull/165305 Approved by: https://github.com/FFFrog, https://github.com/albanD	2025-10-18 01:08:44 +00:00

1 2 3 4 5 ...

52669 Commits