240c13394e
Revert "[inductor] require shape in TritonCSEVariable ( #162275 )"
...
This reverts commit 3af2f0c12accc6bd10ef2b76fb5c51aa0f6b73a3.
Reverted https://github.com/pytorch/pytorch/pull/162275 on behalf of https://github.com/clee2000 due to still failing due to the above D84932446 ([comment](https://github.com/pytorch/pytorch/pull/162275#issuecomment-3423153819 ))
2025-10-20 17:55:54 +00:00
150682ba7f
Revert "Remove workaround to old CUDA bug ( #164354 )"
...
This reverts commit 26f38034332a99f2bdcc67ce1f4ba9403d420e52.
Reverted https://github.com/pytorch/pytorch/pull/164354 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/164354#issuecomment-3423132083 ))
2025-10-20 17:48:08 +00:00
ca7360e996
Revert "Move toString(ScalarType) and ScalarType ostream operator to headeronly ( #164405 )"
...
This reverts commit ca8bd5dbedb5b46f78026e0378b0f47500ddba38.
Reverted https://github.com/pytorch/pytorch/pull/164405 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/164354#issuecomment-3423132083 ))
2025-10-20 17:48:08 +00:00
0bf604320f
Revert "[dynamo][user_defined] Replace UserFunctionVariable with VariableTracker build ( #165706 )"
...
This reverts commit 1dc9a05d0323ee3c7a20945c62463959d40f1a51.
Reverted https://github.com/pytorch/pytorch/pull/165706 on behalf of https://github.com/clee2000 due to breaking internal tests D84961097 ([comment](https://github.com/pytorch/pytorch/pull/165706#issuecomment-3423059867 ))
2025-10-20 17:28:58 +00:00
9875e70da8
Revert "[dynamo][misc] Replace UserFunctionVariable with VariableTracker build ( #165707 )"
...
This reverts commit 630520b346b8883db7821562e589ccde7d12687a.
Reverted https://github.com/pytorch/pytorch/pull/165707 on behalf of https://github.com/clee2000 due to breaking internal tests D84961097 ([comment](https://github.com/pytorch/pytorch/pull/165706#issuecomment-3423059867 ))
2025-10-20 17:28:58 +00:00
69a4bfe8bb
Revert "Refactor out headeronly ArrayRef ( #164991 )"
...
This reverts commit 3806e9767b03d06edc317cb90a3a996abdf192a0.
Reverted https://github.com/pytorch/pytorch/pull/164991 on behalf of https://github.com/clee2000 due to breaking internal tests D84961075 ([comment](https://github.com/pytorch/pytorch/pull/164991#issuecomment-3423058017 ))
2025-10-20 17:26:42 +00:00
62a263b8d4
Revert "Widen ops support to take in IntHOArrayRef vs only std::vec ( #165152 )"
...
This reverts commit e4454947e2c692db1a249591121f8583fefe7df1.
Reverted https://github.com/pytorch/pytorch/pull/165152 on behalf of https://github.com/clee2000 due to breaking internal tests D84961075 ([comment](https://github.com/pytorch/pytorch/pull/164991#issuecomment-3423058017 ))
2025-10-20 17:26:42 +00:00
0da1f911dc
Revert "[Submodule] Bump FBGEMM to latest ( #165544 )"
...
This reverts commit 23417ae50f5d9bc02e988d916c103ff3a03c5903.
Reverted https://github.com/pytorch/pytorch/pull/165544 on behalf of https://github.com/clee2000 due to failing in internal D84996252, probably needs some sort of update to fbgemm internally? ([comment](https://github.com/pytorch/pytorch/pull/165544#issuecomment-3422993703 ))
2025-10-20 17:06:07 +00:00
ab82456c16
Revert "[1/N] Change C-style casts to static_cast or reinterpret_cast ( #165750 )"
...
This reverts commit e1e8491b316df810388d9fa24f135cdba27ab40e.
Reverted https://github.com/pytorch/pytorch/pull/165750 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/165750#issuecomment-3422413890 ))
2025-10-20 14:51:58 +00:00
b23f4687fd
[Inductor][CuTeDSL] Move load_template up two directories ( #165868 )
...
Summary:
This is a reland of https://github.com/pytorch/pytorch/pull/165347
Moves the function used to load CuTeDSL Jinja templates up one level out of the flex attention folder. This way it can be used for more generate Inductor templates in the future.
Test Plan: test/inductor/test_flex_flash
Differential Revision: D85013024
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165868
Approved by: https://github.com/jananisriram
2025-10-20 12:14:38 +00:00
722b2b86c9
[dynamo] Remove duplicated guards ( #165806 )
...
This is by looking at a tlparse of an internal job. We will need deeper audit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165806
Approved by: https://github.com/jansel
2025-10-20 05:50:33 +00:00
e1e8491b31
[1/N] Change C-style casts to static_cast or reinterpret_cast ( #165750 )
...
This series of changes try to cover C style casts into C++ alternatives.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165750
Approved by: https://github.com/Skylion007
2025-10-20 04:36:19 +00:00
767199fd9b
[flex_attention] replace sliced BlockMask noop with helpful error ( #164702 )
...
Fixes part of #163314
After slicing BlockMask with `[]`, mask_mod was silently replaced with noop_mask. This caused silent incorrect results when users applied transformations to `sliced_mask.mask_mod`.
Replace noop with `_sliced_mask_mod_error` that raises RuntimeError with guidance to use `base_mask.mask_mod` instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164702
Approved by: https://github.com/drisspg , https://github.com/BoyuanFeng
2025-10-20 03:46:16 +00:00
e8cb34dd52
[Inductor] support masked vectorization for the tail_loop for fp8 datatype ( #163324 )
...
**Summary:**
Support masked vectorization for the tail_loop for fp8 datatype.
**Example:**
```
import torch
def fn(
x,
scale,
zero_point,
quant_min,
quant_max,
dtype,
):
x = torch.ops.quantized_decomposed.dequantize_per_tensor(
x,
scale,
zero_point,
quant_min,
quant_max,
dtype,
)
x = torch.relu(x)
x = torch.ops.quantized_decomposed.quantize_per_tensor(
x, scale, zero_point, quant_min, quant_max, dtype
)
return x
quant_min = -128
quant_max = 127
dtype = torch.float8_e4m3fn
x = torch.clamp(torch.randn((1, 7, 7, 9), dtype=torch.float32) * 100, quant_min, quant_max).to(dtype)
zero_point = 100
scale = 0.01
with torch.no_grad():
compiled_fn = torch.compile(fn)
compiled_fn(x, scale, zero_point, quant_min, quant_max, dtype)
```
**Generated code:**
- Before
```
cpp_fused_dequantize_per_tensor_quantize_per_tensor_relu_0 = async_compile.cpp_pybinding(['const at::Float8_e4m3fn*', 'at::Float8_e4m3fn*'], r'''
#include <torch/csrc/inductor/cpp_prefix.h>
extern "C" void kernel(const at::Float8_e4m3fn* in_ptr0,
at::Float8_e4m3fn* out_ptr0)
{
{
for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(441L); x0+=static_cast<int64_t>(16L))
{
{
if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(432L)))
{
auto tmp0 = at::vec::Vectorized<at::Float8_e4m3fn>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
auto tmp1 = at::vec::convert<float>(tmp0);
auto tmp2 = static_cast<float>(100.0);
auto tmp3 = at::vec::Vectorized<float>(tmp2);
auto tmp4 = tmp1 - tmp3;
auto tmp5 = static_cast<float>(0.01);
auto tmp6 = at::vec::Vectorized<float>(tmp5);
auto tmp7 = tmp4 * tmp6;
auto tmp8 = (tmp7);
auto tmp9 = at::vec::clamp_min(tmp8, decltype(tmp8)(0));
auto tmp10 = tmp9 * tmp3;
auto tmp11 = tmp10.round();
auto tmp12 = tmp11 + tmp3;
auto tmp13 = static_cast<float>(-128.0);
auto tmp14 = at::vec::Vectorized<float>(tmp13);
auto tmp15 = at::vec::maximum(tmp12, tmp14);
auto tmp16 = static_cast<float>(127.0);
auto tmp17 = at::vec::Vectorized<float>(tmp16);
auto tmp18 = at::vec::minimum(tmp15, tmp17);
auto tmp19 = at::vec::convert<at::Float8_e4m3fn>(tmp18);
tmp19.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
}
if(C10_UNLIKELY(x0 >= static_cast<int64_t>(432L) && x0 < static_cast<int64_t>(441L)))
{
for (int64_t x0_tail = static_cast<int64_t>(432L);x0_tail < static_cast<int64_t>(441L); x0_tail++)
{
auto tmp0 = in_ptr0[static_cast<int64_t>(x0_tail)];
auto tmp1 = c10::convert<float>(tmp0);
auto tmp2 = static_cast<float>(100.0);
auto tmp3 = float(tmp1 - tmp2);
auto tmp4 = static_cast<float>(0.01);
auto tmp5 = float(tmp3 * tmp4);
auto tmp6 = c10::convert<float>(tmp5);
auto tmp7 = std::max(tmp6, decltype(tmp6)(0));
auto tmp8 = float(tmp7 * tmp2);
auto tmp9 = std::nearbyint(tmp8);
auto tmp10 = float(tmp9 + tmp2);
auto tmp11 = static_cast<float>(-128.0);
auto tmp12 = max_propagate_nan(tmp10, tmp11);
auto tmp13 = static_cast<float>(127.0);
auto tmp14 = min_propagate_nan(tmp12, tmp13);
auto tmp15 = c10::convert<at::Float8_e4m3fn>(tmp14);
out_ptr0[static_cast<int64_t>(x0_tail)] = tmp15;
}
}
}
}
}
}
''')
async_compile.wait(globals())
del async_compile
class Runner:
def __init__(self, partitions):
self.partitions = partitions
def recursively_apply_fns(self, fns):
new_callables = []
for fn, c in zip(fns, self.partitions):
new_callables.append(fn(c))
self.partitions = new_callables
def call(self, args):
arg0_1, = args
args.clear()
assert_size_stride(arg0_1, (1, 7, 7, 9), (441, 63, 9, 1))
buf0 = empty_strided_cpu((1, 7, 7, 9), (441, 63, 9, 1), torch.float8_e4m3fn)
# [Provenance debug handles] cpp_fused_dequantize_per_tensor_quantize_per_tensor_relu_0:1
cpp_fused_dequantize_per_tensor_quantize_per_tensor_relu_0(arg0_1, buf0)
del arg0_1
return (buf0, )
```
- After
```
cpp_fused_dequantize_per_tensor_quantize_per_tensor_relu_0 = async_compile.cpp_pybinding(['const at::Float8_e4m3fn*', 'at::Float8_e4m3fn*'], r'''
#include <torch/csrc/inductor/cpp_prefix.h>
extern "C" void kernel(const at::Float8_e4m3fn* in_ptr0,
at::Float8_e4m3fn* out_ptr0)
{
{
for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(441L); x0+=static_cast<int64_t>(16L))
{
{
if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(432L)))
{
auto tmp0 = at::vec::Vectorized<at::Float8_e4m3fn>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
auto tmp1 = at::vec::convert<float>(tmp0);
auto tmp2 = static_cast<float>(100.0);
auto tmp3 = at::vec::Vectorized<float>(tmp2);
auto tmp4 = tmp1 - tmp3;
auto tmp5 = static_cast<float>(0.01);
auto tmp6 = at::vec::Vectorized<float>(tmp5);
auto tmp7 = tmp4 * tmp6;
auto tmp8 = (tmp7);
auto tmp9 = at::vec::clamp_min(tmp8, decltype(tmp8)(0));
auto tmp10 = tmp9 * tmp3;
auto tmp11 = tmp10.round();
auto tmp12 = tmp11 + tmp3;
auto tmp13 = static_cast<float>(-128.0);
auto tmp14 = at::vec::Vectorized<float>(tmp13);
auto tmp15 = at::vec::maximum(tmp12, tmp14);
auto tmp16 = static_cast<float>(127.0);
auto tmp17 = at::vec::Vectorized<float>(tmp16);
auto tmp18 = at::vec::minimum(tmp15, tmp17);
auto tmp19 = at::vec::convert<at::Float8_e4m3fn>(tmp18);
tmp19.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
}
if(C10_UNLIKELY(x0 >= static_cast<int64_t>(432L) && x0 < static_cast<int64_t>(441L)))
{
auto tmp0 = at::vec::Vectorized<at::Float8_e4m3fn>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(9L));
auto tmp1 = at::vec::convert<float>(tmp0);
auto tmp2 = static_cast<float>(100.0);
auto tmp3 = at::vec::Vectorized<float>(tmp2);
auto tmp4 = tmp1 - tmp3;
auto tmp5 = static_cast<float>(0.01);
auto tmp6 = at::vec::Vectorized<float>(tmp5);
auto tmp7 = tmp4 * tmp6;
auto tmp8 = (tmp7);
auto tmp9 = at::vec::clamp_min(tmp8, decltype(tmp8)(0));
auto tmp10 = tmp9 * tmp3;
auto tmp11 = tmp10.round();
auto tmp12 = tmp11 + tmp3;
auto tmp13 = static_cast<float>(-128.0);
auto tmp14 = at::vec::Vectorized<float>(tmp13);
auto tmp15 = at::vec::maximum(tmp12, tmp14);
auto tmp16 = static_cast<float>(127.0);
auto tmp17 = at::vec::Vectorized<float>(tmp16);
auto tmp18 = at::vec::minimum(tmp15, tmp17);
auto tmp19 = at::vec::convert<at::Float8_e4m3fn>(tmp18);
tmp19.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(9L));
}
}
}
}
}
''')
async_compile.wait(globals())
del async_compile
class Runner:
def __init__(self, partitions):
self.partitions = partitions
def recursively_apply_fns(self, fns):
new_callables = []
for fn, c in zip(fns, self.partitions):
new_callables.append(fn(c))
self.partitions = new_callables
def call(self, args):
arg0_1, = args
args.clear()
assert_size_stride(arg0_1, (1, 7, 7, 9), (441, 63, 9, 1))
buf0 = empty_strided_cpu((1, 7, 7, 9), (441, 63, 9, 1), torch.float8_e4m3fn)
# [Provenance debug handles] cpp_fused_dequantize_per_tensor_quantize_per_tensor_relu_0:1
cpp_fused_dequantize_per_tensor_quantize_per_tensor_relu_0(arg0_1, buf0)
del arg0_1
return (buf0, )
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163324
Approved by: https://github.com/Xia-Weiwen , https://github.com/mingfeima , https://github.com/jansel
ghstack dependencies: #163316
2025-10-20 01:56:00 +00:00
e9d8973427
[Inductor] support masked vectorization for the tail_loop for float64 datatype ( #163316 )
...
**Summary:**
Support masked vectorization for the tail_loop for float64 datatype.
**Example:**
```
import torch
def fn(x):
return x * x
x = torch.randn((22, 22), dtype=torch.double)
with torch.no_grad():
compiled_fn = torch.compile(fn)
compiled_fn(x)
```
**Generated code:**
- Before
```
cpp_fused_mul_0 = async_compile.cpp_pybinding(['const double*', 'double*'], r'''
#include <torch/csrc/inductor/cpp_prefix.h>
extern "C" void kernel(const double* in_ptr0,
double* out_ptr0)
{
{
for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(484L); x0+=static_cast<int64_t>(16L))
{
{
if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(480L)))
{
auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
auto tmp1 = tmp0 * tmp0;
tmp1.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
}
if(C10_UNLIKELY(x0 >= static_cast<int64_t>(480L) && x0 < static_cast<int64_t>(484L)))
{
for (int64_t x0_tail = static_cast<int64_t>(480L);x0_tail < static_cast<int64_t>(484L); x0_tail++)
{
auto tmp0 = in_ptr0[static_cast<int64_t>(x0_tail)];
auto tmp1 = double(tmp0 * tmp0);
out_ptr0[static_cast<int64_t>(x0_tail)] = tmp1;
}
}
}
}
}
}
''')
async_compile.wait(globals())
del async_compile
class Runner:
def __init__(self, partitions):
self.partitions = partitions
def recursively_apply_fns(self, fns):
new_callables = []
for fn, c in zip(fns, self.partitions):
new_callables.append(fn(c))
self.partitions = new_callables
def call(self, args):
arg0_1, = args
args.clear()
assert_size_stride(arg0_1, (22, 22), (22, 1))
buf0 = empty_strided_cpu((22, 22), (22, 1), torch.float64)
# [Provenance debug handles] cpp_fused_mul_0:1
cpp_fused_mul_0(arg0_1, buf0)
del arg0_1
return (buf0, )
```
- After
```
cpp_fused_mul_0 = async_compile.cpp_pybinding(['const double*', 'double*'], r'''
#include <torch/csrc/inductor/cpp_prefix.h>
extern "C" void kernel(const double* in_ptr0,
double* out_ptr0)
{
{
for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(484L); x0+=static_cast<int64_t>(16L))
{
{
if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(480L)))
{
auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
auto tmp1 = tmp0 * tmp0;
tmp1.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
}
if(C10_UNLIKELY(x0 >= static_cast<int64_t>(480L) && x0 < static_cast<int64_t>(484L)))
{
auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(4L));
auto tmp1 = tmp0 * tmp0;
tmp1.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(4L));
}
}
}
}
}
''')
async_compile.wait(globals())
del async_compile
class Runner:
def __init__(self, partitions):
self.partitions = partitions
def recursively_apply_fns(self, fns):
new_callables = []
for fn, c in zip(fns, self.partitions):
new_callables.append(fn(c))
self.partitions = new_callables
def call(self, args):
arg0_1, = args
args.clear()
assert_size_stride(arg0_1, (22, 22), (22, 1))
buf0 = empty_strided_cpu((22, 22), (22, 1), torch.float64)
# [Provenance debug handles] cpp_fused_mul_0:1
cpp_fused_mul_0(arg0_1, buf0)
del arg0_1
return (buf0, )
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163316
Approved by: https://github.com/mingfeima , https://github.com/jansel
2025-10-20 01:41:38 +00:00
6b80c94901
[FlexAttention] Fix dynamic shaped heads flex_flash check ( #165866 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165866
Approved by: https://github.com/BoyuanFeng
ghstack dependencies: #165729
2025-10-19 23:10:16 +00:00
8139f33fa5
[dynamo] Add recompile reason for set_stance fail_on_recompile ( #165445 )
...
Fixes #163500
### Summary:
For `set_stance("fail_on_recompile")` failures will provide the reason why the recompilation occurred
### Impacts:
module: dynamo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165445
Approved by: https://github.com/williamwen42
2025-10-19 21:12:19 +00:00
a88587348b
[dynamo] Clean up assert in dynamo [1/N] ( #165430 )
...
Fixes some part of #162852 and #164878 . These two issues have some relationship though.
* __->__ #165430
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165430
Approved by: https://github.com/Lucaskabela , https://github.com/williamwen42
Co-authored-by: Lucas Kabela <lucasakabela@gmail.com >
2025-10-19 21:00:05 +00:00
633a3b7f67
Revert "shrink_group implementation to expose ncclCommShrink API ( #164518 )"
...
This reverts commit fa0db212e717b6cb225159cb32ea3d83baa52381.
Reverted https://github.com/pytorch/pytorch/pull/164518 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/164518#issuecomment-3419893217 ))
2025-10-19 19:20:45 +00:00
fa0db212e7
shrink_group implementation to expose ncclCommShrink API ( #164518 )
...
Closes #164529
To expose the new [ncclCommShrink](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclcommshrink ) API to PyTorch.
This is useful when you need to exclude certain GPUs or nodes from a collective operation, for example in fault tolerance scenarios or when dynamically adjusting resource utilization.
For more info: [Shrinking a communicator](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#shrinking-a-communicator )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164518
Approved by: https://github.com/kwen2501
2025-10-19 18:00:08 +00:00
22ae059d32
AOTI util deprecated flow using the new tracer ( #165582 )
...
Reapply of https://github.com/pytorch/pytorch/pull/163260
AOTI utils expect free function sometimes so adjust export API to handle that, haven't seen any methods getting exported. Some AOTI flows also require we populate dynamo_flat_name_to_original_fqn so i just copy how it is done in eval_frame.py. I also cleaned up how we get rid of export_root and fixed some overcomplicated nn_module_stack handling in export code. The logic is simpler now thanks to @anijain2305 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165582
Approved by: https://github.com/anijain2305
2025-10-19 15:52:16 +00:00
b2f5c25b27
Introduce a generic API torch._C._accelerator_setAllocatorSettings ( #165291 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165291
Approved by: https://github.com/albanD
ghstack dependencies: #165288 , #165289
2025-10-19 15:34:36 +00:00
57ba575242
[BE][Ez]: Update torch.is_tensor documentation ( #165841 )
...
TypeIs propogates the isinstance check with the typing system. They are now equivalent.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165841
Approved by: https://github.com/albanD
2025-10-19 09:24:11 +00:00
3255e7872b
Enable all flake8-logging-format rules ( #164655 )
...
These rules are enabled by removing existing suppressions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164655
Approved by: https://github.com/janeyx99 , https://github.com/mlazos
2025-10-19 00:59:28 +00:00
c4f6619330
Enable more DTensor tests in local tensor mode and fix more integration issues ( #165716 )
...
- During op dispatch local tensor is supposed to collect rng state from CPU and CUDA
devices so that it can be reset before execution of the op for each such that ops
with randomness produces the same result for all ranks (note that we are planning a
separate change to add support of per rank rng state). Previously we relied on
op input arguments to deduce which devices to get rng state from. Which doesn't work
for factory functions such torch.randn. Hence this changes switches to uncondionally
collecting rng state from all devices.
- Fixing per rank specific computations in _MaskedPartial and Shard placements discovered
during test enablement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165716
Approved by: https://github.com/ezyang
2025-10-18 23:33:24 +00:00
f18041cca8
Fix missing closing quote in __init__.py documentation ( #165827 )
...
Title says it all.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165827
Approved by: https://github.com/Skylion007
2025-10-18 22:09:18 +00:00
1f43d17ce6
Fix self assignment ( #165816 )
...
This PR removes assignments of the form `var=var`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165816
Approved by: https://github.com/jansel
2025-10-18 18:51:52 +00:00
032bed95cd
Various C++ code fixes in LSAN integration ( #165818 )
...
This PR extracts the C++ code fixes from #154584 , which are fixes in enabling LSAN.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165818
Approved by: https://github.com/ezyang
2025-10-18 17:59:23 +00:00
f510d0dbc0
Clarrifying input output angle unit in the docs for trigonometric fun… ( #161248 )
...
…ctions
Fixes #[160995](https://github.com/pytorch/pytorch/issues/160995 )
Modified the docs to clarify that input tensor values for torch.sin, torch.cos and torch.tan should be in radians and the output tensor values for torch.acos, torch.asin and torch.atan is in radians.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161248
Approved by: https://github.com/isuruf
Co-authored-by: Isuru Fernando <isuruf@gmail.com >
2025-10-18 11:53:48 +00:00
beb6b62e8c
Revert "Enable more DTensor tests in local tensor mode and fix more integration issues ( #165716 )"
...
This reverts commit 1b397420f22b22f90a1093233ecd9167656e50cb.
Reverted https://github.com/pytorch/pytorch/pull/165716 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/165716#issuecomment-3418083391 ))
2025-10-18 09:15:49 +00:00
4740ce7787
[CP] Fix load balancer incorrectly assuming batch dimension exists ( #165792 )
...
https://github.com/pytorch/pytorch/pull/163617 removes the if/else statement to check if the input buffers have the batch dimension.
This PR fixes the issue and also adds a test.
In the future, we should explicitly ask users to unsqueeze the batch dimension. This is a BC of the existing contract but implicitly infers the batch dimension existence is not safe.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165792
Approved by: https://github.com/XilunWu
2025-10-18 09:11:16 +00:00
fdab48a7c1
Enable all PIE rules on ruff ( #165814 )
...
This PR enables all PIE rules on ruff, there are already some enabled rules from this family, the new added rules are
```
PIE796 Enum contains duplicate value: {value}
PIE808 Unnecessary start argument in range
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165814
Approved by: https://github.com/ezyang
2025-10-18 07:36:18 +00:00
a0948d4d23
[ROCm][inductor] autotune support for persistent reduction kernels ( #163908 )
...
After the removal of want_no_x_dim for persistent reduction kernels, we can improve the autotuning setup for persistent reduction kernels.
Currently even with tuning enable, filtering will only try a single config in many cases. Avoid filtering with autotune mode, and override MAX_BLOCK limit. Also we always include tiny_config when autotuning is enabled.
Contributions from several members of the AMD Inductor and Triton teams: @jataylo @iupaikov-amd @AmdSampsa @xiaohuguo2023
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163908
Approved by: https://github.com/jansel , https://github.com/PaulZhang12
2025-10-18 07:33:24 +00:00
0bbdd6b8db
[ROCm][inductor] heuristic improvements for pointwise kernels ( #163197 )
...
Heuristic improvements for pointwise kernels for MI350.
Contributions from several members of the AMD Inductor and Triton teams:
@jataylo @AmdSampsa @iupaikov-amd @@xiaohuguo2023
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163197
Approved by: https://github.com/PaulZhang12 , https://github.com/eellison , https://github.com/jansel
Co-authored-by: AmdSampsa <sampsa.riikonen@amd.com >
Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com >
2025-10-18 07:23:41 +00:00
24520b8386
Revert "Enable all PIE rules on ruff ( #165814 )"
...
This reverts commit c79dfdc6550e872783aa5cb5fc9e86589bf18872.
Reverted https://github.com/pytorch/pytorch/pull/165814 on behalf of https://github.com/cyyever due to Need to cover more files ([comment](https://github.com/pytorch/pytorch/pull/165814#issuecomment-3417931863 ))
2025-10-18 07:21:08 +00:00
c79dfdc655
Enable all PIE rules on ruff ( #165814 )
...
This PR enables all PIE rules on ruff, there are already some enabled rules from this family, the new added rules are
```
PIE796 Enum contains duplicate value: {value}
PIE808 Unnecessary start argument in range
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165814
Approved by: https://github.com/ezyang
2025-10-18 06:40:12 +00:00
e595136187
Enable PLC1802 on ruff ( #165813 )
...
This PR enables ruff check `PLC1802`, which detects len calls on sequences in a boolean test context.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165813
Approved by: https://github.com/ezyang
2025-10-18 05:44:14 +00:00
aaac8cb0f5
[1/N] Add strict parameter to Python zip calls ( #165531 )
...
Add `strict=True/False` to zip calls in test utils. `strict=True` is passed when possible.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165531
Approved by: https://github.com/Skylion007
2025-10-18 05:26:33 +00:00
0f0b4bf029
[1/N] Remove unused header inclusion ( #165763 )
...
This PR removes unused header inclusion in C++ files.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165763
Approved by: https://github.com/Skylion007
2025-10-18 05:23:11 +00:00
b8194268a6
Remove unnecessary noqa suppressions ( #164106 )
...
This PR removes unused `noqa` suppressions in Python code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164106
Approved by: https://github.com/albanD
2025-10-18 04:52:41 +00:00
f02e3947f6
Expand type checking to mypy strict files ( #165697 )
...
Expands Pyrefly type checking to check the files outlined in the mypy-strict.ini configuration file:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165697
Approved by: https://github.com/ezyang
2025-10-18 04:34:45 +00:00
d9f94e0d7d
[dynamo] Support fx.traceback.annotate as decorator ( #165805 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165805
Approved by: https://github.com/Lucaskabela , https://github.com/SherlockNoMad , https://github.com/yushangdi
2025-10-18 03:58:11 +00:00
23417ae50f
[Submodule] Bump FBGEMM to latest ( #165544 )
...
Summary:
* FBGEMM submodule updated to main
* CMake updated to reflect necessary changes
* Notably pulls in NVFP4 grouped gemm kernels
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
Signed-off-by: Simon Layton <simonlayton@meta.com >
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165544
Approved by: https://github.com/cyyever , https://github.com/jeffdaily
2025-10-18 03:58:08 +00:00
e4d6c56ffb
Improve dynamo graph capture stack trace for custom ops ( #165693 )
...
For a custom op
```
@torch.library.custom_op("my_lib::foo", mutates_args={})
def foo(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
return x + y
```
ppl could call `torch.ops.my_lib.foo()` or directly call `foo()` in the `forward` of an `nn.Module`
These two calling conventions will lead to the same node in the output graph, but different stack traces.
When directly calling `foo()`, the displayed stack_trace in the graph will be
```
# File: .../pytorch/torch/_library/custom_ops.py:687 in __call__, code: return self._opoverload(*args, **kwargs)
```
This is not useful so we filter it out.
```
python test/functorch/test_aot_joint_with_descriptors.py -k test_custom_op_stack_trace
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165693
Approved by: https://github.com/SherlockNoMad , https://github.com/williamwen42
2025-10-18 03:48:18 +00:00
017d2985f3
set unbacked bindings in reinplace pass for newly created nodes during generalize_scatter decomp ( #164948 )
...
Two fixes:
1. in rein_place pass, set unbacked bindings for newly created nodes.
2. In inductor, ComputeBuffer used to miss detecting some used symbols, fixed that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164948
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #164341
2025-10-18 03:20:30 +00:00
c6a8db0b9a
Fix issues with generalized_scatter and setitem allocated unbacked symbols. ( #164341 )
...
Three fixes:
1. When doing t[u0] +=1 if u0 is unbacked we could allocate a new unbacked symbol during the the indexing of t[u0] (when we fake trace setitem), namely because meta_select does allocate a new unbacked symbol for the storage offset when we do not know if u0>=0 or u0<0. but the output size/stride of setitem(), does not depend on that new symbol. it's self consumed in setitem so we shall ignore it.
2. Also when we trace through generalized_scatter the applications of the views could allocate unbacked symints
but those do not effect final output, we also shall ignore them.
3.Before accessing strides in lowering we shall materialize.
Address https://github.com/pytorch/pytorch/issues/114293 and https://github.com/pytorch/pytorch/issues/131911
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164341
Approved by: https://github.com/bobrenjc93
2025-10-18 03:20:30 +00:00
cf3a787bbc
[annotate] Annotate bw nodes before eliminate dead code ( #165782 )
...
Fixes https://github.com/pytorch/torchtitan/pull/1907
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165782
Approved by: https://github.com/SherlockNoMad
2025-10-18 01:54:31 +00:00
de3da77cf7
Thread deterministic config vars to subproc compilation ( #165729 )
...
# Summary
TIL (AFTER WAYYYY TOO MUCH INSANITY), that we do not serialize the full set of configs for the subproc compilation.
I found this while working on Flex-attention determinism: https://github.com/meta-pytorch/attention-gym/pull/168
might be good to audit if we need to thread through any more
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165729
Approved by: https://github.com/shunting314 , https://github.com/eellison
2025-10-18 01:25:50 +00:00
543ddbf44c
[ONNX] Support renaming in dynamic axes to shapes conversion ( #165769 )
...
Discovered in ##165748
This PR also deprecates the conversion. ONNX exporter team does not intend to maintain the conversion in long term.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165769
Approved by: https://github.com/justinchuby
2025-10-18 01:11:20 +00:00
e9f4999985
[Code Clean] Replace std::runtime_error with TORCH_CHECK ( #165305 )
...
Fixes part of #148114
Including:
- torch/csrc/distributed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165305
Approved by: https://github.com/FFFrog , https://github.com/albanD
2025-10-18 01:08:44 +00:00