16370 Commits

Author SHA1 Message Date
33aabdd8ac [Code Clean] Remove deadcodes about Python3.9 [3/N] (#163629)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163629
Approved by: https://github.com/albanD
ghstack dependencies: #163626, #163627
2025-09-24 07:30:50 +00:00
11a231ef52 [c10d] P2P tensors must be dense (#163719)
Fixes #161324
by adding `is_non_overlapping_and_dense` check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163719
Approved by: https://github.com/ngimel
2025-09-24 06:58:03 +00:00
4535254c28 [3/N] Use std::filesystem in inductor (#163632)
Continued work to use std::fs in inductor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163632
Approved by: https://github.com/Skylion007
2025-09-24 00:23:34 +00:00
6b5ad5f211 [Kineto] Add list of string parsing for profiler (#163593)
Summary:
We add the parsing for list of string. This is needed for AOTInductor
profiling for input information of Triton kernels.

Test Plan:
Included in commit.
test_profiler_op_event_kwargs_list_of_strings

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163593
Approved by: https://github.com/sraikund16
2025-09-23 22:45:49 +00:00
fde929c8a8 [AOTI] Fix model_package_loader get_cpp_compile_command (#163561)
It should fix AOTI UTs of `test_aot_inductor_package.py`, these cases are failed at `compile_so`.

reproducer:
```cmd
pytest test\inductor\test_aot_inductor_package.py -v -k test_multiple_methods
```
<img width="1262" height="95" alt="image" src="https://github.com/user-attachments/assets/49458536-1cfe-498e-a12a-2bfd8da67a9e" />

Major fix at `get_cpp_compile_command`. The code is aligned to cpp_builder frontend code:  3ef1bef36c/torch/_inductor/cpp_builder.py (L1780-L1790)
3ef1bef36c/torch/_inductor/cpp_builder.py (L1959-L1976)

Fixed on Windows:
<img width="1261" height="89" alt="Image" src="https://github.com/user-attachments/assets/9bf43b11-aac1-4161-a625-e602e313a299" />

Also validated on Linux:
<img width="1039" height="81" alt="Image" src="https://github.com/user-attachments/assets/46063e16-6cf1-4a28-8466-0496871b8619" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163561
Approved by: https://github.com/jansel
2025-09-23 17:38:18 +00:00
bb5be56619 [torch][cuda][device_limits] Library for querying device hardware limits for flops and bandwidth (#162942)
In various benchmarks scattered across the repo, the limits for flops/second and memory bandwidth are usually hardcoded for a single device. This utility could help in providing a more structured way to query the device capabilities. If this is approved, we can use it when reporting flops efficiency and bandwidth relative to peak in the benchmarks and tests. The intent is to add more devices, more parameters (e.g. L2 cache bandwidth, NVLink, etc.) for both CPUs and accelerators.

Testing:

```
import torch

if torch.cuda.is_available():
    device = torch.cuda.current_device()
    mod = torch.get_device_module('cuda')
    hw = mod._device_limits.GPULimits(device)

    print(hw.get_tflops_per_second(torch.float16))
    print(hw.get_tflops_per_second(torch.float32))
    print(hw.get_tflops_per_second(torch.float64))
    print(hw.get_tflops_per_second(torch.bfloat16))
    print(hw.get_tflops_per_second(torch.int8))
    print(hw.get_memory_bandwidth_Bps() / 1e9)
    print(hw.get_shared_memory_bandwidth_Bps() / 1e9)

# Output on an H100 GPU
1070.53056
535.26528
66.90816
1070.53056
2141.06112
4893.696
33454.08
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162942
Approved by: https://github.com/ngimel, https://github.com/albanD
2025-09-23 04:48:19 +00:00
cyy
447b8fc56d [2/N] Use filesystem in inductor (#163465)
Use std::filesystem in most inductor code. This is follow-up of https://github.com/pytorch/pytorch/pull/152288 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163465
Approved by: https://github.com/Skylion007
2025-09-23 03:56:16 +00:00
8f30a8dc47 [AOTInductor] Add grid information for Triton Kernels (#160131)
Summary:
Add grid information for Triton Kernels for profiling in Kineto.

Test Plan:
Before change:
<img width="539" height="625" alt="Screenshot 2025-08-07 at 1 09 07 PM" src="https://github.com/user-attachments/assets/dd0778a9-2ff3-4819-acd3-de585cf7f9d1" />

After change:
<img width="550" height="898" alt="Screenshot 2025-08-07 at 1 05 49 PM" src="https://github.com/user-attachments/assets/d84988df-bb83-41ed-80ac-8a6d843a1a9d" />

*Note we can extract grid size etc. from device side trace, but we're focusing host side specifically for this PR, mainly to add more host side information in the future needed for performance profiling.

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160131
Approved by: https://github.com/desertfire
2025-09-23 02:15:24 +00:00
09cb34c1dc [RELAND] Always build USE_DISTRIBUTED (#160449) and Make distributed modules importable even when backend not built (#159889) (#162594)
Summary:
Original: D81957844 and D81957923

Also, https://github.com/pytorch/pytorch/pull/162142 is patched in as well

#buildall

Test Plan:
sandcastle and oss ci

Rollback Plan:

Reviewed By: H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162594
Approved by: https://github.com/H-Huang, https://github.com/dcci
2025-09-22 21:12:18 +00:00
8abc2af9b9 [STABLE ABI] Add clone method to torch::stable::Tensor (#161896)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161896
Approved by: https://github.com/janeyx99
2025-09-22 20:39:24 +00:00
e065d35fd3 [BE]: Add a few more missing move from return indices (#163456)
@ezyang A follow up where I found a few more missing returns of this style in the codebase. Follow up to #163416

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163456
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-09-22 20:24:23 +00:00
d15048493c [opaque_obj] Add set_payload + docs (#163276)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163276
Approved by: https://github.com/zou3519
ghstack dependencies: #162660
2025-09-22 20:02:29 +00:00
eaa613bf66 Revert "[opaque_obj] Add set_payload + docs (#163276)"
This reverts commit dd30667f6c2204a15e91eaeb61c84f9080be7748.

Reverted https://github.com/pytorch/pytorch/pull/163276 on behalf of https://github.com/ZainRizvi due to Sorry but this fails lint on trunk: [GH job link](https://github.com/pytorch/pytorch/actions/runs/17924886989/job/50968430537) [HUD commit link](dd30667f6c) ([comment](https://github.com/pytorch/pytorch/pull/163276#issuecomment-3321054061))
2025-09-22 19:32:30 +00:00
dd30667f6c [opaque_obj] Add set_payload + docs (#163276)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163276
Approved by: https://github.com/zou3519
ghstack dependencies: #162660
2025-09-22 18:30:28 +00:00
3be9c86c74 [opaque obj] Initial OpaqueObject (#162660)
A big pain point ppl have with custom ops is that they do not accept arbitrary input/outputs. In this PR we create the concept of an "OpaqueObject" which allows users to pass arbitrary python objects into custom operators.

Some still slightly annoying parts with this implementation:
- The schema of the operator is `__torch__.torch.classes.aten.OpaqueObject` instead of whatever python type
- `@torch.library.custom_op` doesn't work.. yet?

UX:
```python
from torch._library.opaque_object import make_opaque, get_payload

# your custom python class
class OpaqueQueue:
    def __init__(self, queue: list[torch.Tensor], init_tensor_: torch.Tensor) -> None:
        super().__init__()
        self.queue = queue
        self.init_tensor_ = init_tensor_

    def push(self, tensor: torch.Tensor) -> None:
        self.queue.append(tensor)

    def pop(self) -> torch.Tensor:
        if len(self.queue) > 0:
            return self.queue.pop(0)
        return self.init_tensor_

    def size(self) -> int:
        return len(self.queue)

queue = OpaqueQueue([], torch.zeros(3))
obj: torch._C.ScriptObject = make_opaque(queue)

# obj.payload stores a direct reference to this python queue object
self.assertEqual(get_payload(obj), queue)

# This is able to be passed through the dispatcher
torch.ops._TestOpaqueObject.queue_push(obj, torch.ones(3))
self.assertTrue(queue.size(), 1)
```

Authoring a custom op:

```python
lib = torch.library.Library("_TestOpaqueObject", "FRAGMENT")

torch.library.define(
    f"_TestOpaqueObject::queue_push",
    "(__torch__.torch.classes.aten.OpaqueObject a, Tensor b) -> ()",
    tags=torch.Tag.pt2_compliant_tag,
    lib=lib,
)

@torch.library.impl(f"{libname}::queue_push", "CompositeExplicitAutograd", lib=lib)
def push_impl(q: torch._C.ScriptObject, b: torch.Tensor) -> None:
    # We can get the payload directly by get_payload(q)
    queue = get_payload(q)
    assert isinstance(queue, OpaqueQueue)
    queue.push(b)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162660
Approved by: https://github.com/zou3519
2025-09-22 18:30:28 +00:00
edafc902d7 Revert "[BE] Make PyObjectSlot use a global PyInterpreter (#162659)"
This reverts commit d1993c27ae59842c887d549a3f8936fbcd769498.

Reverted https://github.com/pytorch/pytorch/pull/162659 on behalf of https://github.com/wdvr due to reverted internally, please see D82771705 @PaliC ([comment](https://github.com/pytorch/pytorch/pull/162659#issuecomment-3317110247))
2025-09-22 06:22:37 +00:00
f0078941cf Revert "[RELAND] Always build USE_DISTRIBUTED (#160449) and Make distributed modules importable even when backend not built (#159889) (#162594)"
This reverts commit 6c334885d48725197b5d35e2c1543efc0f4198d0.

Reverted https://github.com/pytorch/pytorch/pull/162594 on behalf of https://github.com/wdvr due to reverted internally - @ezyang see D82281294 ([comment](https://github.com/pytorch/pytorch/pull/162594#issuecomment-3317017530))
2025-09-22 05:39:07 +00:00
8b14f43da9 [torch] DRY a couple of lines in unpickler (#163447)
Test Plan: CI.

Reviewed By: dolpm

Differential Revision: D82660989

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163447
Approved by: https://github.com/Skylion007
2025-09-21 20:29:33 +00:00
5599f487ef Fully native DTensor.__new__ (#162508)
Move the entirety of `__new__` into C++, saving a layer of disable_dynamo and making progress toward all-C++.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162508
Approved by: https://github.com/ezyang
ghstack dependencies: #161695
2025-09-21 18:36:05 +00:00
51152efa67 Remove autograd code for Python < 3.9 (#163313)
As PyTorch is moving to Python 3.10, it is safe to remove code for Python < 3.9.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163313
Approved by: https://github.com/ezyang
2025-09-21 15:35:06 +00:00
97eb7a281d torchdim Python port (#160236)
The big semantic change (and the reason for this port) is that we no longer monkeypatch Tensor with torchdim's special methods. The new algorithm for handling dispatch is that we first land in `__torch_function__` and we see if a special FCD implementation needs to be dispatch to first, and if there is nothing we fallback to the standard level strategy.

Because there is no longer C binding equivalent of classes, we've condensed _C.Dim and Dim together, and similar for Tensor. This resulted in some bugs as the Python API is sometimes different from the C API. I've attempted to disambiguate these but there may still be mistakes (many early bugs were due to this problem). Dim and DimEntry are especially painful as Dim must abide by Tensor equality semantics, but is pointer equality in C (DimEntry doesn't have this problem). Another difference between C/Python that is subtle is we no longer get implicit conversions from Dim to DimEntry, this also caused some bugs.

Much of the mechanical porting work was done by claude code. I have a separate PR that deletes functorch._C, but it was useful having dim.cpp to point claude at it so I haven't done it in this PR. From a reviewing perspective, I need to re-review that I didn't forget to port anything, some noticeably missing "small" things are patched_dim_method. I am still in progress of carefully doing a side-by-side review of ports; "simplifications" from claude code were also a major source of bugs.

There are two major feature gaps in the implementation:

- DelayedTensor and dot handling are not implemented yet. This should be reasonably easy, just need to do it.  However, for the purposes of sharded propagation it is actually better not to reconstruct matmuls.
- Splitting dimensions with an index like `[x, y]` doesn't work. The problem is that `__getitem__` interprets this as advanced indexing and sends the list to torch.tensor to turn into a tensor, instead of being eligible for `__torch_function__`. I think I might need to hard code a special case for this or something?

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160236
Approved by: https://github.com/zdevito, https://github.com/albanD
2025-09-21 03:01:04 +00:00
8e3fd3d4f9 [AI Codemod][DevmatePerfOptimizationVectorReallocation] fbcode/caffe2/torch/csrc/jit/serialization/unpickler.cpp (#163240)
Reviewed By: marksantaniello, yfeldblum

Differential Revision: D82140619

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163240
Approved by: https://github.com/Skylion007
2025-09-20 23:26:24 +00:00
f9074c7332 [STABLE ABI] Add copy_ operation. (#161895)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161895
Approved by: https://github.com/janeyx99
2025-09-20 10:30:33 +00:00
fab8455943 Don't use declarations in global namespace in stable headers (#163352)
Fixes https://github.com/pytorch/pytorch/issues/163338

Configured https://clang.llvm.org/extra/clang-tidy/checks/google/global-names-in-headers.html for torch/csrc/stable

Note that doesn't error for the DeleterFnPtr case, but will generate the following for the `using torch::stable::Tensor;`

```
>>> Lint for torch/csrc/stable/ops.h:

  Error (CLANGTIDY) [google-global-names-in-headers,-warnings-as-errors]
    using declarations in the global namespace in headers are prohibited

         10  |#include <torch/csrc/inductor/aoti_torch/generated/c_shim_aten.h>
         11  |#include <torch/headeronly/core/ScalarType.h>
         12  |
    >>>  13  |using torch::stable::Tensor;
         14  |
         15  |namespace torch::stable {
         16  |
   ```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163352
Approved by: https://github.com/janeyx99
2025-09-19 21:15:52 +00:00
9f8a311af0 [Inductor][Intel GPU] Save threads_per_warp from tirton compiled kernel for launching kernel correctly in cpp wrapper. (#163315)
On the Inductor XPU backend, `threads_per_warp` is not always 32. For Intel GEMM Triton kernels, it can be 16. This information must be preserved for XPU so that the Cpp wrapper can launch the kernel with the correct configuration.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163315
Approved by: https://github.com/EikanWang, https://github.com/desertfire
2025-09-19 21:06:56 +00:00
df9a4824e6 Bugfix for doing negative padding (#161639)
Fixes #161014

This bug fix introduces a fix that is consistent with the exception handling. Outlined in issue #161014, there is an edge case where the negative padding does not make the tensor size negative but still triggers the exception that the size is negative. The fix is simply adding `new_dim >=0` to include the zero dim and letting the operator return an empty tensor.

In the PR I have added the edge case where the test will now check the negative padding where the dimension gets reduced to zero.  But the sample is only for the `constant` type of padding. I would like some feedback if it is necessary to put the same sample on the `reduce` type as well.

This is my first PR to contribute to PyTorch and any help/feedback will be welcome! Thank you!

@malfet @manuelcandales @janeyx99 @ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161639
Approved by: https://github.com/manuelcandales
2025-09-19 20:57:05 +00:00
7130b174e0 [SymmMem] Fix memory allocation hold-up (#162680)
Problem:
Without MemPool it looks like nvshmem backend never deallocates memory.

Cause:
Handles in `symm_mems_` (a map) keeps reference to memory allocations.

Solution:
- Remove reference to allocation from handles -- the reference is never used anyway.
- Use `unique_ptr` instead of `shared_ptr` to wrap allocation to ensure single ownership.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162680
Approved by: https://github.com/ezyang
ghstack dependencies: #163298
2025-09-19 20:19:47 +00:00
f8fb437197 [SymmMem] Barrier on team instead of world (#163298)
As titled. Avoiding a potential hang when running dispatch and combine in subgroups.

The rest is just re-arrange of the tests to create a sub-group test class. (no substantial change)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163298
Approved by: https://github.com/fegin
2025-09-19 20:19:47 +00:00
4a160dae3c [CUDA] revert PR 130472 (#162950)
This change may also resolve https://github.com/pytorch/pytorch/issues/161789, though verification is still needed.

PR #130472 would introduced the problem of  freeing the same address without clean metadata. according to the below discussion, reverted it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162950
Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/syed-ahmed
2025-09-19 19:50:44 +00:00
ab5086a7ae [WOQ] Add XPU kernel for _weight_int8pack_mm (#160938)
Summary:
This issue proposes implementing a XPU kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU and CUDA.

Motivation:
Same as https://github.com/pytorch/pytorch/pull/159325.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160938
Approved by: https://github.com/EikanWang, https://github.com/ZhiweiYan-96, https://github.com/liangan1, https://github.com/jerryzh168
2025-09-19 07:37:14 +00:00
32ad29b72a Revert "[dynamo][guards] Fail on an unknown framelocals to dict conversion (#162695)"
This reverts commit a8432bcaadd6dea52a94429dced1fb4550f2f560.

Reverted https://github.com/pytorch/pytorch/pull/162695 on behalf of https://github.com/anijain2305 due to internal failure at https://fburl.com/workplace/qiitdlp6 ([comment](https://github.com/pytorch/pytorch/pull/162695#issuecomment-3310757225))
2025-09-19 06:18:27 +00:00
1302637a23 Revert "[dynamo][guards] Do not construct entire framelocals dict for LAMBDA_GUARD (#162525)"
This reverts commit 5f630d28d7ff9fdd8bd6cdbe2438e5c821007845.

Reverted https://github.com/pytorch/pytorch/pull/162525 on behalf of https://github.com/anijain2305 due to internal tests fail ([comment](https://github.com/pytorch/pytorch/pull/162525#issuecomment-3310748980))
2025-09-19 06:15:28 +00:00
c91f59b1a0 Fix performance regression when indexing by Numpy arrays (#163280)
Benchmark script:

```
import time
import numpy as np
import torch

def main() -> None:
    for i in range(10):
        block_indices = np.arange(16384, dtype=np.int32)
        block_indices = block_indices.reshape(-1).clip(max=255)
        batch_indices = np.zeros(16384, dtype=np.int64)
        virtual_batches = 32
        block_table = torch.randn(32, 256)
        start = time.perf_counter()
        block_table[batch_indices, block_indices].view(virtual_batches, -1)
        end = time.perf_counter()
        time_elapsed_ms = (end - start) * 1000
        print(f"Function execution time: {time_elapsed_ms:.1f}ms")

if __name__ == "__main__":
    main()
```

Before:

```
(a) [ezyang@devvm006.dkl0 ~/local/b/pytorch] python ben.py
Function execution time: 28.5ms
Function execution time: 12.9ms
Function execution time: 12.6ms
Function execution time: 13.5ms
Function execution time: 12.0ms
Function execution time: 13.4ms
Function execution time: 12.9ms
Function execution time: 12.9ms
Function execution time: 13.1ms
Function execution time: 13.0ms
```

After:

```
Function execution time: 17.8ms
Function execution time: 2.5ms
Function execution time: 1.3ms
Function execution time: 2.5ms
Function execution time: 2.3ms
Function execution time: 1.3ms
Function execution time: 2.4ms
Function execution time: 2.5ms
Function execution time: 2.5ms
Function execution time: 2.4ms
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163280
Approved by: https://github.com/SherlockNoMad, https://github.com/cyyever
2025-09-19 05:02:58 +00:00
76a841fd47 Port OpSchema.__post_init__ and OpSchema._recompute_comparison_key to C++ (#161695)
I initially didn't see good results porting this, but it was apparently because of pybind11 function calling overhead. (pybind11's object-handling primitives seem fine enough.) I'm interested in setting up nanobind, but this demonstrates it's not blocking.

Differential Revision: [D81530102](https://our.internmc.facebook.com/intern/diff/D81530102)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161695
Approved by: https://github.com/ezyang
2025-09-19 04:07:30 +00:00
4c007073e6 [dynamic shapes] DynamicInts prototype (#162194)
Initial prototype for dynamic int inputs, allows users to run with `torch.compile(f)(DynamicInt(4))`, compiling dynamically and using the underlying hint at runtime.

Current behavior:
- Also works in eager (mostly by subclassing int), as scalar input to torch functions, or numpy/math/etc. For example, `x = DynamicInt(3); torch.randn(x); torch.add(y, z, alpha=x); np.arange(x)` all act as if x = 3.
- Behavior for arithmetic ops is to return new DynamicInts rather than static ints; `DynamicInt(3) * 2 = DynamicInt(6)`. This is via SymNode magic methods, but coverage might not be 100% - for example, I had to explicitly override floordiv to avoid int casting. This is not necessarily the case for non-magic method ops (e.g. `math.cos(x)`). The alternative here is to int cast on all operations, but I opted for this for dynamism propagation in non-compiled regions.
- Doesn't ban fullgraph=False; DynamicInt objects might be leaked back to the user, but I guess this is fine, because they can be casted to ints when needed?
- Dynamo only allocates one symbol per DynamicInt; specifying the same DynamicInt for multiple inputs leads to input deduplication, and a guard installed.
- We don't raise on int specialization (in allowlist/maybe_mark_dynamic style) - but an easy change if needed.
- DynamicInts as nn.Module attributes are handled.
- We don't guard on the DynamicInt id, e.g. users can do the following without recompiling (maybe we should guard?)
```python
x = DynamicInt(4)
f(x)
f(1)
f(DynamicInt(3))  # same as f(3)
```

Follow-up work:
- Specifying shape constraints, either at the int-level, e.g.
```python
DynamicInt(64, name="s0", constraints=["s0 % 32 == 0", "s0 <= 1024"]
```
or at the compilation level, e.g. something like
```python
s0 = DynamicInt(64, name="s0")
s1 = DynamicInt(128, name="s1")
with some_compiler_config.dynamic_int_constraints(["s1 == 2*s0", "s0 % 32 == 0"]):
    f(s0, s1)
```
This should subsume the need for specifying derived SymInts?
- SymFloat support - currently it seems backed floats are specialized by the tensorify float pass, and there's no handling in inductor.
- Propagating dynamism in tensor constructors, e.g. `x = DynamicInt(4); torch.randn(x)` could annotate `_dynamo_dynamic_indices`.

Differential Revision: D81698719

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162194
Approved by: https://github.com/bobrenjc93
2025-09-18 23:26:28 +00:00
4b7aed89d8 Revert "[torch][cuda][device_limits] Library for querying device hardware limits for flops and bandwidth (#162942)"
This reverts commit 627482a7b7780752c0e7aea034a2eb2db5899fcc.

Reverted https://github.com/pytorch/pytorch/pull/162942 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it needs some fixes for CUDA 13 ([comment](https://github.com/pytorch/pytorch/pull/162942#issuecomment-3308784448))
2025-09-18 17:49:16 +00:00
627482a7b7 [torch][cuda][device_limits] Library for querying device hardware limits for flops and bandwidth (#162942)
In various benchmarks scattered across the repo, the limits for flops/second and memory bandwidth are usually hardcoded for a single device. This utility could help in providing a more structured way to query the device capabilities. If this is approved, we can use it when reporting flops efficiency and bandwidth relative to peak in the benchmarks and tests. The intent is to add more devices, more parameters (e.g. L2 cache bandwidth, NVLink, etc.) for both CPUs and accelerators.

Testing:

```
import torch

if torch.cuda.is_available():
    device = torch.cuda.current_device()
    mod = torch.get_device_module('cuda')
    hw = mod._device_limits.GPULimits(device)

    print(hw.get_tflops_per_second(torch.float16))
    print(hw.get_tflops_per_second(torch.float32))
    print(hw.get_tflops_per_second(torch.float64))
    print(hw.get_tflops_per_second(torch.bfloat16))
    print(hw.get_tflops_per_second(torch.int8))
    print(hw.get_memory_bandwidth_Bps() / 1e9)
    print(hw.get_shared_memory_bandwidth_Bps() / 1e9)

# Output on an H100 GPU
1070.53056
535.26528
66.90816
1070.53056
2141.06112
4893.696
33454.08
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162942
Approved by: https://github.com/ngimel
2025-09-18 06:40:07 +00:00
033b7d1e1a [Reland] Return NoOpDeviceGuardImpl in replace of CudaDeviceGuard when device is not available (#163187)
Reland of #160532

Summary:

To support exporting a cuda model on a CPU-only machine under fake tensor mode. User commonly need to move sample inputs to the cuda device with .to("cuda:0") or .to("cuda") call. This diff supports this.
I expect the following pattern to work
```
with FakeTensorMode(allow_non_fake_inputs=True):
    cuda_module = module.to("cuda:0")
    cuda_sample_inputs = tuple([x.to("cuda:0") for x in sample_inputs])
    with torch.no_grad():
        ep = torch.export.export(cuda_module, cuda_sample_inputs)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163016
Approved by: https://github.com/huydhn

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163187
Approved by: https://github.com/angelayi
2025-09-18 04:46:26 +00:00
5236007806 [MPS] Add embedding_bag forward pass (#163012)
Part of #162270

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163012
Approved by: https://github.com/kulinseth, https://github.com/malfet
2025-09-17 19:00:47 +00:00
2291199e9b [AOTInductor] Use CudaCachingAllocator for memory allocation (#162893)
Summary:
Use c10::CudaCachingAllocator for AOTInductor's initial constant buffer
allocation.

Test Plan:
Activate test under test/cpp/aoti_inference/test.cpp

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162893
Approved by: https://github.com/desertfire
2025-09-17 17:08:20 +00:00
d1993c27ae [BE] Make PyObjectSlot use a global PyInterpreter (#162659)
This pr gets rid of the pyobj_interpreter_ variable from PyObjectSlot and saves a word in the process

Gonna ask for review from @huydhn as there are some changes to CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162659
Approved by: https://github.com/albanD, https://github.com/huydhn
2025-09-17 16:40:55 +00:00
79fd497423 Revert "[Reland] Return NoOpDeviceGuardImpl in replace of CudaDeviceGuard when device is not available, or cpu-only build (#163016)"
This reverts commit f1eb99e2e4363f20eb5896433e1eb7f7500aadea.

Reverted https://github.com/pytorch/pytorch/pull/163016 on behalf of https://github.com/jeffdaily due to broke rocm CI, see export/test_export_opinfo.py::TestExportOnFakeCudaCUDA::test_fake_export_nonzero_cuda_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/17787208381/job/50564369696) [HUD commit link](f1eb99e2e4) ([comment](https://github.com/pytorch/pytorch/pull/163016#issuecomment-3303707552))
2025-09-17 16:17:53 +00:00
a63221a335 Fix TODO in make_tensor_for_subclass_helper (#162336)
The constructor does accept a DataPtr (had to fix the DataPtr variant not accepting a SymInt, though).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162336
Approved by: https://github.com/ezyang
ghstack dependencies: #162298
2025-09-17 06:46:34 +00:00
f1eb99e2e4 [Reland] Return NoOpDeviceGuardImpl in replace of CudaDeviceGuard when device is not available, or cpu-only build (#163016)
Reland of #160532

Summary:

To support exporting a cuda model on a CPU-only machine under fake tensor mode.
User commonly need to move sample inputs to the cuda device with .to("cuda:0") or .to("cuda") call.
This diff supports this.
I expect the following pattern to work
```
with FakeTensorMode(allow_non_fake_inputs=True):
    cuda_module = module.to("cuda:0")
    cuda_sample_inputs = tuple([x.to("cuda:0") for x in sample_inputs])
    with torch.no_grad():
        ep = torch.export.export(cuda_module, cuda_sample_inputs)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163016
Approved by: https://github.com/huydhn
2025-09-17 05:01:33 +00:00
0819de412d Add a new API torch.xpu.can_device_access_peer for Intel GPU (#162705)
# Motivation
Aligned with other backends, this PR introduces an new API `torch.xpu.can_device_access_peer`, which is used in vllm distributed [scenarios](2048c4e379/vllm/distributed/device_communicators/custom_all_reduce.py (L37))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162705
Approved by: https://github.com/EikanWang, https://github.com/ezyang
2025-09-16 18:00:22 +00:00
559e8d1c20 [doc]: Small typos (#162982)
Small typo fixes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162982
Approved by: https://github.com/ezyang, https://github.com/zou3519
2025-09-16 17:42:19 +00:00
4db203f875 Revert "[BE] Make PyObjectSlot use a global PyInterpreter (#162659)"
This reverts commit 05ee8114f818a95745c812c3cd7aa8e784e61a9a.

Reverted https://github.com/pytorch/pytorch/pull/162659 on behalf of https://github.com/jeanschmidt due to seems to have introduced errors in linting see https://github.com/pytorch/pytorch/actions/runs/17750689989/job/50444910643 ([comment](https://github.com/pytorch/pytorch/pull/162659#issuecomment-3298626136))
2025-09-16 12:52:57 +00:00
e3783a9575 Replace std::runtime_error with TORCH_CHECK (#159344)
Fixes part of #148114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159344
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-09-16 09:00:06 +00:00
05ee8114f8 [BE] Make PyObjectSlot use a global PyInterpreter (#162659)
This pr gets rid of the pyobj_interpreter_ variable from PyObjectSlot and saves a word in the process

Gonna ask for review from @huydhn as there are some changes to CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162659
Approved by: https://github.com/albanD, https://github.com/huydhn
2025-09-16 00:37:09 +00:00
dae5beae8e [RecordFunction] Add Scope for Record Function Fast (#162661)
Differential Revision: D82164587

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162661
Approved by: https://github.com/davidberard98
2025-09-15 21:01:47 +00:00