1459 Commits

Author SHA1 Message Date
e1e8491b31 [1/N] Change C-style casts to static_cast or reinterpret_cast (#165750)
This series of changes try to cover C style casts into C++ alternatives.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165750
Approved by: https://github.com/Skylion007
2025-10-20 04:36:19 +00:00
1b121d636e Fix AllocatorConfig parse roundup division bug (#165304)
* #165288
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165304
Approved by: https://github.com/albanD
ghstack dependencies: #165288, #165289, #165291, #165298
2025-10-19 15:34:44 +00:00
4888ed440e Refine Allocator Config error message friendly (#165288)
* __->__ #165288
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165288
Approved by: https://github.com/albanD
2025-10-19 15:34:17 +00:00
0f0b4bf029 [1/N] Remove unused header inclusion (#165763)
This PR removes unused header inclusion in C++ files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165763
Approved by: https://github.com/Skylion007
2025-10-18 05:23:11 +00:00
7ee45f7503 Restore AcceleratorAllocatorConfig to avoid potential regression (#165129)
# Motivation
This PR aims to restore `AcceleratorAllocatorConfig` to avoid the potential regression mentioned in https://github.com/pytorch/pytorch/pull/160666#issue-3323270375
These code change would be reverted in the following PR https://github.com/pytorch/pytorch/pull/165304
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165129
Approved by: https://github.com/albanD
2025-10-16 15:26:17 +00:00
ca8bd5dbed Move toString(ScalarType) and ScalarType ostream operator to headeronly (#164405)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164405
Approved by: https://github.com/Skylion007, https://github.com/janeyx99
ghstack dependencies: #164350, #164354
2025-10-16 00:55:43 +00:00
48064acf37 Move AT_FORALL_... macros and ScalarTypeToCPPTypeT to headeronly (#164350)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164350
Approved by: https://github.com/janeyx99
2025-10-16 00:55:42 +00:00
37d57ac9cb Use sym_eq in _check_rms_norm_inputs_symint (#165112)
Summary:
### Problem
ArrayRef's `equals()`does elementwise quality using `==` operator. This can cause a DDE for unbacked symints since `==`  operator calls `guard_bool`.
```
// SymInt.h
bool operator==(const SymInt& o) const {
  return sym_eq(o).guard_bool(__FILE__, __LINE__);
}
```

### Solution
Adds `sym_equals()` to do elementwise equality for `SymIntArrayRef`. Use this instead of `equals()` for `SymIntArrayRef`.

Reviewed By: guangy10, pianpwk, muchulee8

Differential Revision: D84168401

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165112
Approved by: https://github.com/Skylion007
2025-10-14 00:06:24 +00:00
de8d81275a Do not decompose in functionalization/proxy tensor if autograd wouldn't have decomposed (#164939)
This fixes AOTAutograd rms_norm not being bitwise equivalent to
eager, because it avoids a decomposition.  You can force the
decomposition by having the decomposition in the dispatch table,
but if eager mode wouldn't have decomposed (because it went to the fused
one), we now default to preserving the fused call by default.

This largely reverts https://github.com/pytorch/pytorch/pull/103275/ for view ops. This means that in inference mode we could hit the wrong C++ kernel; if this occurs we should just SymInt'ify the C++ kernel.

Another neat side effect of this change is that Inductor's generated kernels for rms_norm now have rms_norm in their name.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164939
Approved by: https://github.com/bdhirsh
2025-10-11 01:03:55 +00:00
ef50c9b557 Remove unnecessary "static" for definitions in anonymous namespace (#165035)
This PR removes unnecessary "static" for C++ functions and variables in anonymous namespace as detected by clang-tidy. This enhances code readability. The related rules are planed to be enabled in follow-up PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165035
Approved by: https://github.com/Skylion007
2025-10-11 00:04:23 +00:00
5c3fe9fb30 Revert "Do not decompose in functionalization/proxy tensor if autograd wouldn't have decomposed (#164939)"
This reverts commit a6fa4f9c283971c0fb6f60a89674a1f35370ac79.

Reverted https://github.com/pytorch/pytorch/pull/164939 on behalf of https://github.com/izaitsevfb due to introduces numeric issues internally, see [D84326613](https://www.internalfb.com/diff/D84326613) ([comment](https://github.com/pytorch/pytorch/pull/164939#issuecomment-3392203314))
2025-10-10 20:21:12 +00:00
7f2a902ea2 more sizelike deprecation (#164889)
remove expext_size c++ bindings and usages

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164889
Approved by: https://github.com/mlazos
ghstack dependencies: #164884, #164885, #164886, #164887, #164888
2025-10-10 03:45:06 +00:00
a6fa4f9c28 Do not decompose in functionalization/proxy tensor if autograd wouldn't have decomposed (#164939)
This fixes AOTAutograd rms_norm not being bitwise equivalent to
eager, because it avoids a decomposition.  You can force the
decomposition by having the decomposition in the dispatch table,
but if eager mode wouldn't have decomposed (because it went to the fused
one), we now default to preserving the fused call by default.

This largely reverts https://github.com/pytorch/pytorch/pull/103275/ for view ops. This means that in inference mode we could hit the wrong C++ kernel; if this occurs we should just SymInt'ify the C++ kernel.

Another neat side effect of this change is that Inductor's generated kernels for rms_norm now have rms_norm in their name.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164939
Approved by: https://github.com/bdhirsh
2025-10-10 00:15:00 +00:00
0fd976b65c Enable mimalloc on non-Windows platforms and make default for AArch64 builds (#164741)
This change removes the Windows requirement for mimalloc builds, and makes mimalloc the default c10 system allocator for AArch64 builds. This significantly improves the performance of AArch64 builds of PyTorch as large allocations are better cached by mimalloc than glibc.

**Updated Results**

Torchbench FP32 eager Inference, 16 threads:
<img width="1510" height="733" alt="mimalloc-v2-fp32-diff" src="https://github.com/user-attachments/assets/7fe3ea0c-3b52-42e7-879b-612444479c90" />

Torchbench BF16 eager Inference, 16 threads:
<img width="1510" height="733" alt="mimalloc-v2-bf16-diff" src="https://github.com/user-attachments/assets/56469a72-9e06-4d57-ae2a-aeb139ca79a3" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164741
Approved by: https://github.com/fadara01, https://github.com/aditew01, https://github.com/malfet
2025-10-09 20:49:46 +00:00
688efd9741 Revert "Enable mimalloc on non-Windows platforms and make default for AArch64 builds (#164741)"
This reverts commit 87eccf10e8484c9e59ef81ae7bdee68d3db4f605.

Reverted https://github.com/pytorch/pytorch/pull/164741 on behalf of https://github.com/malfet due to But it breaks MacOS builds, see https://github.com/pytorch/pytorch/actions/runs/18382886648/job/52373781138 ([comment](https://github.com/pytorch/pytorch/pull/164741#issuecomment-3386859778))
2025-10-09 17:30:25 +00:00
87eccf10e8 Enable mimalloc on non-Windows platforms and make default for AArch64 builds (#164741)
This change removes the Windows requirement for mimalloc builds, and makes mimalloc the default c10 system allocator for AArch64 builds. This significantly improves the performance of AArch64 builds of PyTorch as large allocations are better cached by mimalloc than glibc.

**Updated Results**

Torchbench FP32 eager Inference, 16 threads:
<img width="1510" height="733" alt="mimalloc-v2-fp32-diff" src="https://github.com/user-attachments/assets/7fe3ea0c-3b52-42e7-879b-612444479c90" />

Torchbench BF16 eager Inference, 16 threads:
<img width="1510" height="733" alt="mimalloc-v2-bf16-diff" src="https://github.com/user-attachments/assets/56469a72-9e06-4d57-ae2a-aeb139ca79a3" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164741
Approved by: https://github.com/fadara01, https://github.com/aditew01, https://github.com/malfet
2025-10-09 16:45:31 +00:00
06d86e58d0 Revert "Do not decompose in functionalization/proxy tensor if autograd wouldn't have decomposed (#164939)"
This reverts commit d40a9bfb8da0dc1ac1e6e56b33a25979112874de.

Reverted https://github.com/pytorch/pytorch/pull/164939 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/164939#issuecomment-3385056722))
2025-10-09 09:50:59 +00:00
f231be25c6 Mark unused parameters in C++ code (#164912)
This PR adds unused parameter name comments in C++ declarations to improve code readability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164912
Approved by: https://github.com/Skylion007
2025-10-09 06:23:25 +00:00
d40a9bfb8d Do not decompose in functionalization/proxy tensor if autograd wouldn't have decomposed (#164939)
This fixes AOTAutograd rms_norm not being bitwise equivalent to
eager, because it avoids a decomposition.  You can force the
decomposition by having the decomposition in the dispatch table,
but if eager mode wouldn't have decomposed (because it went to the fused
one), we now default to preserving the fused call by default.

This largely reverts https://github.com/pytorch/pytorch/pull/103275/ for view ops. This means that in inference mode we could hit the wrong C++ kernel; if this occurs we should just SymInt'ify the C++ kernel.

Another neat side effect of this change is that Inductor's generated kernels for rms_norm now have rms_norm in their name.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164939
Approved by: https://github.com/bdhirsh
ghstack dependencies: #164573
2025-10-09 04:49:44 +00:00
43fc859625 Don't return values in void functions (#164809)
This PR fixes returning values in void C++ functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164809
Approved by: https://github.com/janeyx99
2025-10-08 01:04:14 +00:00
9fff8155c3 [2/N] Fix clang-tidy readability checks (#164652)
This PR applies clang-tidy readability checks to jit sources and all headers in the code base.
`readability-redundant-inline-specifier` is suppressed because it incurs too many changes. `readability-redundant-inline-specifier` is used to detect redundant inline specifiers on function and variable declarations. There are many in-class method definitions that are marked inline.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164652
Approved by: https://github.com/Skylion007
2025-10-06 01:06:01 +00:00
331191ce4b Revert "[BE] Make PyObjectSlot use a global PyInterpreter (#162659)"
This reverts commit 29cbcbac4215e0d9070a1b7a07ddaec9a36bbd08.

Reverted https://github.com/pytorch/pytorch/pull/162659 on behalf of https://github.com/izaitsevfb due to reverted internally, see [D83214133](https://www.internalfb.com/diff/D83214133) ([comment](https://github.com/pytorch/pytorch/pull/162659#issuecomment-3369348172))
2025-10-05 21:39:57 +00:00
2c5ed6e7c0 Revert "[2/N] Fix clang-tidy readability checks (#164652)"
This reverts commit 3c5ca685d6f5b6f3971c0cd20a054aa355610419.

Reverted https://github.com/pytorch/pytorch/pull/164652 on behalf of https://github.com/izaitsevfb due to need to revert due to a conflict with revert of https://github.com/pytorch/pytorch/pull/162659 ([comment](https://github.com/pytorch/pytorch/pull/164652#issuecomment-3369346707))
2025-10-05 21:36:57 +00:00
3c5ca685d6 [2/N] Fix clang-tidy readability checks (#164652)
This PR applies clang-tidy readability checks to jit sources and all headers in the code base.
`readability-redundant-inline-specifier` is suppressed because it incurs too many changes. `readability-redundant-inline-specifier` is used to detect redundant inline specifiers on function and variable declarations. There are many in-class method definitions that are marked inline.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164652
Approved by: https://github.com/Skylion007
2025-10-05 07:05:11 +00:00
b5c4f46bb9 Add functions to setup PrivateUse1 as a python backend device. (#157859)
Fixes #156052 and #156444.

This PR setup the privateuseone key in Python to be used as a python backend for pytorch.
Meaning that, after calling `setup_privateuseone_for_python_backend('npy')`, one can use a subclass to with that device to hold arbitrary python data as "device data" and use `torch.library` to register ops that takes that Tensor.

Changes done in this PR:

1. Register an vanilla Device Guard: I extended NoOpDeviceGuard to have allow device index of 0 and to not raise errors when event related functions are accessed. If I don't do those, when calling backward I would get errors. (CPU backend uses NoOpDeviceGuard just fine, although there seems to be special treatment of CPU in the autograd engine.
2. Tensor subclass allows not having `__torch_dispatch__` if the device is not CUDA or CPU. The comment of the check suggests it was to avoid segfault when calling into ops that expects a storage. Here we have a different device so will not call into those ops.
3. python function that invokes the other incantations to setup the privateusekey backend.

This took inspiration of https://github.com/bdhirsh/pytorch_open_registration_example and https://github.com/tinygrad/tinygrad/blob/master/extra/torch_backend/wrapped_tensor.cpp; great thanks to @bdhirsh and @geohot.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157859
Approved by: https://github.com/albanD
2025-10-01 21:32:59 +00:00
7f3dc45300 Migrate DeviceType to torch/headeronly (#163999)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163999
Approved by: https://github.com/mikaylagawarecki
2025-09-30 23:13:27 +00:00
cc5d74c366 Revert "[BE] Remove HermeticPyObjectTLS and Simplify PythonOpRegistrationTrampoline (#163464)"
This reverts commit 94195a37ae4eae9c486a81b0f67725c8970f74d6.

Reverted https://github.com/pytorch/pytorch/pull/163464 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/163464#issuecomment-3353307034))
2025-09-30 18:20:20 +00:00
410ed3006b Revert "Add functions to setup PrivateUse1 as a python backend device. (#157859)"
This reverts commit 1310d6a1f9194ddcf6753f7e12fb78f278451f8a.

Reverted https://github.com/pytorch/pytorch/pull/157859 on behalf of https://github.com/jeanschmidt due to introduce linting errors ([comment](https://github.com/pytorch/pytorch/pull/157859#issuecomment-3352140098))
2025-09-30 13:24:37 +00:00
1310d6a1f9 Add functions to setup PrivateUse1 as a python backend device. (#157859)
Fixes #156052 and #156444.

This PR setup the privateuseone key in Python to be used as a python backend for pytorch.
Meaning that, after calling `setup_privateuseone_for_python_backend('npy')`, one can use a subclass to with that device to hold arbitrary python data as "device data" and use `torch.library` to register ops that takes that Tensor.

Changes done in this PR:

1. Register an vanilla Device Guard: I extended NoOpDeviceGuard to have allow device index of 0 and to not raise errors when event related functions are accessed. If I don't do those, when calling backward I would get errors. (CPU backend uses NoOpDeviceGuard just fine, although there seems to be special treatment of CPU in the autograd engine.
2. Tensor subclass allows not having `__torch_dispatch__` if the device is not CUDA or CPU. The comment of the check suggests it was to avoid segfault when calling into ops that expects a storage. Here we have a different device so will not call into those ops.
3. python function that invokes the other incantations to setup the privateusekey backend.

This took inspiration of https://github.com/bdhirsh/pytorch_open_registration_example and https://github.com/tinygrad/tinygrad/blob/master/extra/torch_backend/wrapped_tensor.cpp; great thanks to @bdhirsh and @geohot.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157859
Approved by: https://github.com/albanD
2025-09-30 08:39:36 +00:00
7bbde0c094 Remove unused argument from DEFINE_BINARY macro. (#163868)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163868
Approved by: https://github.com/Skylion007
ghstack dependencies: #163822
2025-09-28 06:32:41 +00:00
94195a37ae [BE] Remove HermeticPyObjectTLS and Simplify PythonOpRegistrationTrampoline (#163464)
Removes HermeticPyObjectTLS as we no longer need since torch deploy is no longer supported. PythonOpRegistrationTrampoline is also drastically simplified as and being prepped for removal in a future PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163464
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-09-25 23:30:50 +00:00
29cbcbac42 [BE] Make PyObjectSlot use a global PyInterpreter (#162659)
This pr gets rid of the pyobj_interpreter_ variable from PyObjectSlot and saves a word in the process

Gonna ask for review from @huydhn as there are some changes to CI.

Testing: imported internally and the failed android build seems to work now!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162659
Approved by: https://github.com/albanD, https://github.com/huydhn
2025-09-25 08:53:19 +00:00
edafc902d7 Revert "[BE] Make PyObjectSlot use a global PyInterpreter (#162659)"
This reverts commit d1993c27ae59842c887d549a3f8936fbcd769498.

Reverted https://github.com/pytorch/pytorch/pull/162659 on behalf of https://github.com/wdvr due to reverted internally, please see D82771705 @PaliC ([comment](https://github.com/pytorch/pytorch/pull/162659#issuecomment-3317110247))
2025-09-22 06:22:37 +00:00
033b7d1e1a [Reland] Return NoOpDeviceGuardImpl in replace of CudaDeviceGuard when device is not available (#163187)
Reland of #160532

Summary:

To support exporting a cuda model on a CPU-only machine under fake tensor mode. User commonly need to move sample inputs to the cuda device with .to("cuda:0") or .to("cuda") call. This diff supports this.
I expect the following pattern to work
```
with FakeTensorMode(allow_non_fake_inputs=True):
    cuda_module = module.to("cuda:0")
    cuda_sample_inputs = tuple([x.to("cuda:0") for x in sample_inputs])
    with torch.no_grad():
        ep = torch.export.export(cuda_module, cuda_sample_inputs)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163016
Approved by: https://github.com/huydhn

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163187
Approved by: https://github.com/angelayi
2025-09-18 04:46:26 +00:00
d1993c27ae [BE] Make PyObjectSlot use a global PyInterpreter (#162659)
This pr gets rid of the pyobj_interpreter_ variable from PyObjectSlot and saves a word in the process

Gonna ask for review from @huydhn as there are some changes to CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162659
Approved by: https://github.com/albanD, https://github.com/huydhn
2025-09-17 16:40:55 +00:00
79fd497423 Revert "[Reland] Return NoOpDeviceGuardImpl in replace of CudaDeviceGuard when device is not available, or cpu-only build (#163016)"
This reverts commit f1eb99e2e4363f20eb5896433e1eb7f7500aadea.

Reverted https://github.com/pytorch/pytorch/pull/163016 on behalf of https://github.com/jeffdaily due to broke rocm CI, see export/test_export_opinfo.py::TestExportOnFakeCudaCUDA::test_fake_export_nonzero_cuda_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/17787208381/job/50564369696) [HUD commit link](f1eb99e2e4) ([comment](https://github.com/pytorch/pytorch/pull/163016#issuecomment-3303707552))
2025-09-17 16:17:53 +00:00
a63221a335 Fix TODO in make_tensor_for_subclass_helper (#162336)
The constructor does accept a DataPtr (had to fix the DataPtr variant not accepting a SymInt, though).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162336
Approved by: https://github.com/ezyang
ghstack dependencies: #162298
2025-09-17 06:46:34 +00:00
f1eb99e2e4 [Reland] Return NoOpDeviceGuardImpl in replace of CudaDeviceGuard when device is not available, or cpu-only build (#163016)
Reland of #160532

Summary:

To support exporting a cuda model on a CPU-only machine under fake tensor mode.
User commonly need to move sample inputs to the cuda device with .to("cuda:0") or .to("cuda") call.
This diff supports this.
I expect the following pattern to work
```
with FakeTensorMode(allow_non_fake_inputs=True):
    cuda_module = module.to("cuda:0")
    cuda_sample_inputs = tuple([x.to("cuda:0") for x in sample_inputs])
    with torch.no_grad():
        ep = torch.export.export(cuda_module, cuda_sample_inputs)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163016
Approved by: https://github.com/huydhn
2025-09-17 05:01:33 +00:00
4db203f875 Revert "[BE] Make PyObjectSlot use a global PyInterpreter (#162659)"
This reverts commit 05ee8114f818a95745c812c3cd7aa8e784e61a9a.

Reverted https://github.com/pytorch/pytorch/pull/162659 on behalf of https://github.com/jeanschmidt due to seems to have introduced errors in linting see https://github.com/pytorch/pytorch/actions/runs/17750689989/job/50444910643 ([comment](https://github.com/pytorch/pytorch/pull/162659#issuecomment-3298626136))
2025-09-16 12:52:57 +00:00
05ee8114f8 [BE] Make PyObjectSlot use a global PyInterpreter (#162659)
This pr gets rid of the pyobj_interpreter_ variable from PyObjectSlot and saves a word in the process

Gonna ask for review from @huydhn as there are some changes to CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162659
Approved by: https://github.com/albanD, https://github.com/huydhn
2025-09-16 00:37:09 +00:00
9c93dc8123 Revert "Return NoOpDeviceGuardImpl in replace of CudaDeviceGuard when device is not available, or cpu-only build (#160532)"
This reverts commit a956c4ab1cb13079203a8f07eb26218724f54dc8.

Reverted https://github.com/pytorch/pytorch/pull/160532 on behalf of https://github.com/huydhn due to Reverted internally ([comment](https://github.com/pytorch/pytorch/pull/160532#issuecomment-3287745165))
2025-09-13 07:42:12 +00:00
a956c4ab1c Return NoOpDeviceGuardImpl in replace of CudaDeviceGuard when device is not available, or cpu-only build (#160532)
Summary:

To support exporting a cuda model on a CPU-only machine under fake tensor mode.
User commonly need to move sample inputs to the cuda device with .to("cuda:0") or .to("cuda") call.
This diff supports this.
I expect the following pattern to work
```
with FakeTensorMode(allow_non_fake_inputs=True):
    cuda_module = module.to("cuda:0")
    cuda_sample_inputs = tuple([x.to("cuda:0") for x in sample_inputs])
    with torch.no_grad():
        ep = torch.export.export(cuda_module, cuda_sample_inputs)
```

Test Plan:
CI

Rollback Plan:

Differential Revision: D80181887

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160532
Approved by: https://github.com/henryoier, https://github.com/ezyang
2025-09-13 01:50:51 +00:00
189a054cfb Remove guard_size_oblivious from default contiguity python check, and add aten.sym_is_contiguous. [attempt2] (#160869)
[relanding again after fixing internal build]
Summary:
This might cause some new DDEs on call sites that do not use is_contiguous_or_false() or sym_is_contiguous()
but want to find those call sites to handle this properly by calling  is_contiguous_or_false() and not is_contiguous() explitly when appropriate.
I had to fix one issue after removing the implicit size oblivious reasoning. here is context

we defined in this https://github.com/pytorch/pytorch/pull/157472 sym_is_contiguous to be the function computing contiguity for dynamic shapes in c++. It returns a symbolic expression that represents contiguity and guaranteed not to throw a DDE.

when people call is_contiguous we do sym_is_contiguous().guard_bool()
when people call is_contiguous_or_false we do sym_is_contiguous().guard_or_false()

one issue not handled well was this path
```
c10::SymBool TensorImpl::sym_is_contiguous_custom(
    at::MemoryFormat memory_format) const {
  if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomStrides))) {
    return pyobj_slot_.load_pyobj_interpreter()->is_contiguous(
        this, memory_format);
  }

  return sym_is_contiguous_default(memory_format);
}
```
namely if we call sym_is_contiguous_custom but we have matches_python_custom(SizesStridesPolicy::CustomStrides) return true , then we used to call is_contiguous(this, memory_format);

This used to go through the load_pyobj_interpreter and end up calling the python is_contiguous call which used implicit size oblivious reasoning.
once we removed that implicit size oblivious reasoning, the right thing we want is to call
return pyobj_slot_.load_pyobj_interpreter()->sym_is_contiguous(this, memory_format);
otherwise we would get DDE even if the caller is doing sym_is_contiguous.

so I had to define it for pyinterpreter, and then I had to override it for nested tensors.

Approved by: https://github.com/ezyang

Test Plan:
contbuild & OSS CI, see e444cd24d4

Rollback Plan:

Differential Revision: D80435179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160869
Approved by: https://github.com/ezyang
2025-09-08 22:59:13 +00:00
79fcd5247a symbolic cpp channels_last_contiguous (#160402)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160402
Approved by: https://github.com/aorenste
2025-09-05 21:40:32 +00:00
b0a3e58dd7 Add inline fast paths for SymInt operators (#161586)
If SymInt::maybe_as_int() returns non-empty, then we get an inline
fast path. The philosophy here (as with the previous PR) is to
preserve performance in the "plain old ints" case.

Observed time spent in SymInt functions in computeStorageNBytes to
drop (and not cost shift elsewhere in the function) after this change,
profiling detach() using code similar to the benchmark from #160580
and Linux perf.

Differential Revision: [D81530107](https://our.internmc.facebook.com/intern/diff/D81530107)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161586
Approved by: https://github.com/ezyang
ghstack dependencies: #161466
2025-09-03 06:54:47 +00:00
fa1514acf1 Outline SymInt::maybe_as_int_slow_path (#161466)
Keeps SymInt::maybe_as_int small enough to inline.

Differential Revision: [D81530097](https://our.internmc.facebook.com/intern/diff/D81530097)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161466
Approved by: https://github.com/ezyang
2025-09-03 06:54:47 +00:00
c03d8d4082 Revert "Generalize torch._C._set_allocator_settings to be generic (#156175)" (#161626)
This reverts commit 908c5cc4c0f22d141776bde47c296b5186691855.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161626
Approved by: https://github.com/atalman
ghstack dependencies: #161625
2025-08-27 21:37:14 +00:00
2beffb3311 Refactoring TensorImpl by using constexpr and std::is_same_v (#161043)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161043
Approved by: https://github.com/Skylion007
2025-08-22 10:49:49 +00:00
b82aa3df20 Revert "Remove guard_size_oblivious from default contiguity python check, and add aten.sym_is_contiguous. (#159197)"
This reverts commit e444cd24d48b3a46f067974f2cc157f5ed27709f.

Reverted https://github.com/pytorch/pytorch/pull/159197 on behalf of https://github.com/laithsakka due to internal build failures ([comment](https://github.com/pytorch/pytorch/pull/159197#issuecomment-3195436668))
2025-08-18 07:22:13 +00:00
e444cd24d4 Remove guard_size_oblivious from default contiguity python check, and add aten.sym_is_contiguous. (#159197)
This might cause some new DDEs on call sites that do not use is_contiguous_or_false() or sym_is_contiguous()
but want to find those call sites to handle this properly by calling  is_contiguous_or_false() and not is_contiguous() explitly when appropriate.
I had to fix one issue after removing the implicit size oblivious reasoning. here is context

we defined in this https://github.com/pytorch/pytorch/pull/157472 sym_is_contiguous to be the function computing contiguity for dynamic shapes in c++. It returns a symbolic expression that represents contiguity and guaranteed not to throw a DDE.

when people call is_contiguous we do sym_is_contiguous().guard_bool()
when people call is_contiguous_or_false we do sym_is_contiguous().guard_or_false()

one issue not handled well was this path
```
c10::SymBool TensorImpl::sym_is_contiguous_custom(
    at::MemoryFormat memory_format) const {
  if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomStrides))) {
    return pyobj_slot_.load_pyobj_interpreter()->is_contiguous(
        this, memory_format);
  }

  return sym_is_contiguous_default(memory_format);
}
```
namely if we call sym_is_contiguous_custom but we have matches_python_custom(SizesStridesPolicy::CustomStrides) return true , then we used to call is_contiguous(this, memory_format);

This used to go through the load_pyobj_interpreter and end up calling the python is_contiguous call which used implicit size oblivious reasoning.
once we removed that implicit size oblivious reasoning, the right thing we want is to call
return pyobj_slot_.load_pyobj_interpreter()->sym_is_contiguous(this, memory_format);
otherwise we would get DDE even if the caller is doing sym_is_contiguous.

so I had to define it for pyinterpreter, and then I had to override it for nested tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159197
Approved by: https://github.com/ezyang
2025-08-16 09:15:58 +00:00