Compare commits

..

260 Commits

Author SHA1 Message Date
cyy
67e094992e Revert changes 2025-09-21 09:20:22 +00:00
cyy
14622ff376 More fixes 2025-09-21 09:20:22 +00:00
cyy
35d5ec3d85 Fix clang-tidy warnings of performance 2025-09-21 09:20:22 +00:00
97eb7a281d torchdim Python port (#160236)
The big semantic change (and the reason for this port) is that we no longer monkeypatch Tensor with torchdim's special methods. The new algorithm for handling dispatch is that we first land in `__torch_function__` and we see if a special FCD implementation needs to be dispatch to first, and if there is nothing we fallback to the standard level strategy.

Because there is no longer C binding equivalent of classes, we've condensed _C.Dim and Dim together, and similar for Tensor. This resulted in some bugs as the Python API is sometimes different from the C API. I've attempted to disambiguate these but there may still be mistakes (many early bugs were due to this problem). Dim and DimEntry are especially painful as Dim must abide by Tensor equality semantics, but is pointer equality in C (DimEntry doesn't have this problem). Another difference between C/Python that is subtle is we no longer get implicit conversions from Dim to DimEntry, this also caused some bugs.

Much of the mechanical porting work was done by claude code. I have a separate PR that deletes functorch._C, but it was useful having dim.cpp to point claude at it so I haven't done it in this PR. From a reviewing perspective, I need to re-review that I didn't forget to port anything, some noticeably missing "small" things are patched_dim_method. I am still in progress of carefully doing a side-by-side review of ports; "simplifications" from claude code were also a major source of bugs.

There are two major feature gaps in the implementation:

- DelayedTensor and dot handling are not implemented yet. This should be reasonably easy, just need to do it.  However, for the purposes of sharded propagation it is actually better not to reconstruct matmuls.
- Splitting dimensions with an index like `[x, y]` doesn't work. The problem is that `__getitem__` interprets this as advanced indexing and sends the list to torch.tensor to turn into a tensor, instead of being eligible for `__torch_function__`. I think I might need to hard code a special case for this or something?

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160236
Approved by: https://github.com/zdevito, https://github.com/albanD
2025-09-21 03:01:04 +00:00
2887f3fde4 [BE] Slight improvements to documentation in python_dispatch (#162963)
I was briefly confused which way I should iterate stack, here's the
comments I wanted.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162963
Approved by: https://github.com/albanD, https://github.com/SherlockNoMad
2025-09-21 01:45:46 +00:00
eqy
e37b600007 [CUDA][cuBLAS][FP8] Forward-fix #162022 (#163354)
@ngimel is right, `ciflow/h100` doesn't actually appear to test the PR :(

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163354
Approved by: https://github.com/ngimel, https://github.com/Skylion007
2025-09-21 00:55:12 +00:00
8e3fd3d4f9 [AI Codemod][DevmatePerfOptimizationVectorReallocation] fbcode/caffe2/torch/csrc/jit/serialization/unpickler.cpp (#163240)
Reviewed By: marksantaniello, yfeldblum

Differential Revision: D82140619

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163240
Approved by: https://github.com/Skylion007
2025-09-20 23:26:24 +00:00
9e3725e8e5 make fullgraph_capture work on mod, args, kwargs (#162849)
Summary:
Today `fullgraph_capture` takes a frame, but clients usually take a callable (`nn.Module`, function, or method) and example inputs (args and kwargs) and then explicitly set up the frame to pass. This is boilerplate—and potentially tricky to get right—that can be hidden inside the API.

The original `fullgraph_capture` now becomes `_fullgraph_capture_frame`.

Test Plan:
existing tests

Rollback Plan:

Differential Revision: D82339400

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162849
Approved by: https://github.com/zhxchen17
2025-09-20 22:48:06 +00:00
3938175ec1 [1/n] Support cpu_tensor.to("cuda:0") in FakeTensorMode on cuda-less machine (#160431)
Summary:
To support exporting a cuda model on a CPU-only machine under fake tensor mode.
User commonly need to move sample inputs to the cuda device with .to("cuda:0") call.
This diff supports this.

Notice that .to("cuda") doesn't work yet, as it enquery current device idx by calling cuda API.

I expect the following pattern to work

```
with FakeTensorMode(allow_non_fake_inputs=True):
    cuda_module = module.to("cuda:0")
    cuda_sample_inputs = tuple([x.to("cuda:0") for x in sample_inputs])

    with torch.no_grad():
        ep = torch.export.export(cuda_module, cuda_sample_inputs)

```

Test Plan:
buck2 run  fbcode//caffe2/test:fake_tensor -- --r test_fake_gpu_no_init

Rollback Plan:

Differential Revision: D80101283

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160431
Approved by: https://github.com/henryoier, https://github.com/ezyang
2025-09-20 21:33:53 +00:00
d70c0babf5 minimize graph capture output (#162211)
Currently OutputGraphGuardsState is separated out as a serializable interface for OutputGraph, but some of the typing around it is incorrect in dynamo's guards.py and output_graph.py: more fields are used by code than claimed by OutputGraphGuardsState, and it works because either the full OutputGraph is passed in or the parts that use those fields are dead when OutputGraphGuardsState is passed in.
In this PR we try to further separate the necessary fields of OutputGraph that should be retained by a full graph capture mechanism, not just limited to dynamo (as it is currently) but also something like make_fx (in the future). Since these fields do not need to be serialized, the result is an intermediate "common" data structure that is between OutputGraphGuardsState and OutputGraph in the inheritance hierarchy.

Differential Revision: D81718791

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162211
Approved by: https://github.com/zhxchen17
2025-09-20 15:52:28 +00:00
f9074c7332 [STABLE ABI] Add copy_ operation. (#161895)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161895
Approved by: https://github.com/janeyx99
2025-09-20 10:30:33 +00:00
eb11d172e3 [Caffe2] Improve SVE batch box cox by 2% (#163360)
Summary:
Improve bound checking on exp computation, decreasing the longest dependency chain by 1.

Box-cox benchmarks show about 2% of improved throughput.
Precision remains unaltered.

before:

NonZeroLambdaBatch                                        155.30us     6.44K

after:

NonZeroLambdaBatch                                        151.78us     6.59K

Test Plan:
Correctness:

buck2 test @//mode/opt //koski/functions_contrib/df4ai/tests:batch_box_cox_test

Performance:

buck2 run @//mode/opt //koski/functions_contrib/df4ai/benchmark:boxcox_benchmark

Differential Revision:
D82847111

Privacy Context Container: L1208939

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163360
Approved by: https://github.com/Skylion007
2025-09-20 06:42:26 +00:00
5050cfa363 [Opitmus] fix fp8 activation quatization for duplicates forward output (#163364)
Summary: We observe a case then the fwd graph has duplicated return nodes, which will lead to errors due to fx renaming the node, thus we add poi info into the node name.

Test Plan:
### unit test

```
CUDA_VISIBLE_DEVICES=3 buck2 test mode/opt -m ovr_config//triton:beta -c fbcode.nvcc_arch=b200a -c fbcode.platform010_cuda_version=12.8 //caffe2/test/functorch:test_aotdispatch -- test_quantize_activation_duplicate_nodes
```

Buck UI: https://www.internalfb.com/buck2/de5eccc6-4064-4214-843d-70b8e3829afe
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4503599937670844
Network: Up: 217KiB  Down: 72KiB  (reSessionID-73e5c269-4f4d-4a54-896a-79c077eea326)
Executing actions. Remaining     0/2                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 45.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

### E2E

before
f798417700

after

Differential Revision: D82844100

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163364
Approved by: https://github.com/Yuzhen11
2025-09-20 06:33:20 +00:00
d55c9d52cd [CP] Fix cuDNN CP LSE dimension bug (#163231)
We should only unsqueeze if necessary.

Fix https://github.com/pytorch/pytorch/issues/162743

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163231
Approved by: https://github.com/eqy
ghstack dependencies: #162539, #162540, #162541, #163115, #163131
2025-09-20 06:13:45 +00:00
0ee331b523 [inductor][choices] move extra kwargs out of get_template_configs (#163209)
# why

- extra kwargs are input/op dependent and not config dependent. We don't
  plan to serialize/deserialize them, and so they need to be fed in
  later beore making the KTC, rather than when getting the config values
  directly

# what

- move extra_kwargs into the KTC and get_ktc interface directly

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v -k "_addmm"
```

Differential Revision: [D82871310](https://our.internmc.facebook.com/intern/diff/D82871310)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163209
Approved by: https://github.com/nmacchioni
ghstack dependencies: #163305
2025-09-20 05:30:40 +00:00
df5d6d57c9 [inductor][triton heuristics] move allow tf32 out of config params (#163305)
# why

- this is not directly controlled by the config arg but rather by the
  input and by the inductor wide setting
- it's always the same for every choice
- we want the config kwargs to be *programable* and this is not
  programable in that sense but rather needs to use inductor config

# what

- move generating the ALLOW_TF32 kwarg in Triton templates into
  get_extra_kwargs

# testing

with some annotations, this is now the kwargs and extra_kwargs on addmm

```
{'EVEN_K': True, 'USE_FAST_ACCUM': False, 'ACC_TYPE': 'tl.float32', 'num_stages': 1, 'num_warps': 2, 'BLOCK_M': 32, 'BLOCK_N': 32, 'BLOCK_K': 16, 'hint_override': None, 'GROUP_M': 8} # choice/config kwargs
{'ALLOW_TF32': True, 'epilogue_fn': <function addmm_epilogue.<locals>.epilogue at 0x7f64d54ff600>, 'epilogue_fn_hash': "['addmm_epilogue', torch.float32, 1, 1]", 'prefix_args': 1} # extra kwargs
```

they're both passed onto the template

Differential Revision: [D82871312](https://our.internmc.facebook.com/intern/diff/D82871312)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163305
Approved by: https://github.com/nmacchioni
2025-09-20 05:30:40 +00:00
0b5a99be88 remove duplicate import for defaultdict (#160519)
Fixes #160518

This PR aims to remove the duplicate import of defaultdict in the following file:

ecde76c764/functorch/op_analysis/gen_data.py (L36)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160519
Approved by: https://github.com/malfet
2025-09-20 04:06:39 +00:00
a87aea03f7 Update RandomSampler docstring. data_source must be Sized not Dataset (#158857)
Fixes #158631

The docstring said data_source was a Dataset, but RandomSampler only needs something that implements __len__. This updates the docstring to use Sized instead, which matches the actual type used in the constructor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158857
Approved by: https://github.com/divyanshk
2025-09-20 04:05:25 +00:00
e56dd5d770 [Inductor-FX] Support torch.cond (#163234)
# Feature

Support `torch.cond` in the FX converter. The generated FX IR is conceptually indentical to what would come from `torch.export`:
- Submodules as stored as attributes, and accessed via `getattr`.
- The conditional is represented as `torch.ops.higher_order.cond`, which takes in the subgraphs, a predicate and submodule inputs.

# Implementation overview

The FX backend generates code for subgraphs using the following steps:
1. When `codegen_conditional` is called in `WrapperFxCodegen`, we emit a `ConditionalLine`.
   a. We also codegen the true/false subgraphs at this time, storing their subgms for later.
2. At the beginning of FX conversion, generate `get_attr` nodes accessing each subgraph. It's important to do this at the start, before registering the node metadata hook. This also matches the convention followed by torch.export.
3. When we see the `ConditionalLine` in the FX converter, we generate a corresponding `torch.ops.higher_order.cond`.

# Implementation details
This ended up being a substantial change, as wrapper codegen has some special logic for subgraphs.

Certain methods of `PythonWrapperCodegen` are overridden by `SubgraphPythonWrapperCodegen`. To apply these overrides, we use multiple inheritance with the registered subclass of `WrapperFxCodegen`.

Unlike most other wrapper codegen methods, which map 1:1 to Wrapper IR lines, subgraph codegen generates a number of wrapper lines including `EnterSubgraphLine` and `ExitSubgraphLine`, along with Python or C++ code calling the subgraph as a function. These lines are used for some backends' memory planning.

In contrast, FX IR typically represents a subgraph call as a single HOP node, or a `call_module` op. To account for this difference, this PR introduces a new wrapper IR line called `ConditionalLine`, which is only used by the FX backend. We override the `codegen_conditional` method to emit this line. This sidesteps having to port the existing subgraph codegen and associated memory planning to Wrapper IR. (In principle, it seems possible to adapt the existing backends to `ConditionalLine`, but it could be a larger refactor, since we'd also have to update the memory planning.)

Some of the lower-level subgraph codegen methods are still shared between the FX and Python backends, such as `generate_subgraph_common`. Those were easier to port to Wrapper IR.

This also required generalizing the way the FX converter handles graph inputs and outputs. Previously, it assumed the IO signature was the same as `V.graph.module`, but this is only true for the parent graph, and not subgraphs. Instead, we need to call `get_graph_inputs` and `get_graph_outputs` to populate the inputs and outputs for subgraphs.

# Test plan
This PR adds a couple of tests using torch.cond. Here's an example graph generated by one of them:
```
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %arg1_1 : [num_users=1] = placeholder[target=arg1_1]
    %true_graph_0 : [num_users=1] = get_attr[target=true_graph_0]
    %false_graph_0 : [num_users=1] = get_attr[target=false_graph_0]
    %cond : [num_users=1] = call_function[target=torch.ops.higher_order.cond](args = (%arg0_1, %true_graph_0, %false_graph_0, (%arg1_1,)), kwargs = {})
    %buf1 : [num_users=2] = call_function[target=operator.getitem](args = (%cond, 0), kwargs = {})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 6, constant_args_idx: 6, grid: [(1, 1, 1)], tma_descriptor_metadata: {}, kwargs: {in_out_ptr0: %buf1, xnumel: 6, XBLOCK: 8}})
    return buf1
```

It also removes an existing negative test which checked that a certain error was raised when subgraphs were encountered.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163234
Approved by: https://github.com/angelayi, https://github.com/jansel
2025-09-20 03:52:31 +00:00
a31acf32bd Clean up obsoleted vLLM tests (#163383)
They have been removed in https://github.com/vllm-project/vllm/pull/25117 and https://github.com/vllm-project/vllm/pull/22772, thus failing in trunk at the moment after the latest pin commit update

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163383
Approved by: https://github.com/wdvr, https://github.com/seemethere, https://github.com/malfet
2025-09-20 02:48:36 +00:00
a1df0b42ce Lazy import to avoid circular import issue for DebugMode (#163381)
as title.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163381
Approved by: https://github.com/dolpm
2025-09-20 01:54:57 +00:00
bfe9e60ffb Simplify PrecompileContext to no longer be a CacheArtifactManager (#162886)
Summary:
This diff does a big refactor of PrecompileContext to make it considerably simpler: instead of being a CacheArtifactManager and managing a bunch of bytes, it simply stores two things: dynamo cache entries and backend cache entries. When asked, it stitches them together into PrecompileCacheEntries, which are stored by DynamoCache.

This structure then allows us to register DynamoCache to the regular Megacache API, instead of having two separate APIs that are confusing. It also lets us remove the autotune cache integration, since MegaCache API will automatically store autotune cache entries.

The intent here is that users who want to use caching precompile will simply be able to use torch.compiler.save_cache_artifacts as before, just with `torch.dynamo.config.caching_precompile` set to True. They can also directly interact with PrecompileContext if they wish to specifically only load Precompile entries, using PrecompileContext.create_cache_entries().

Saving single entries and such with DynamoCache still works normally.

Test Plan:
All existing unit tests pass.

Rollback Plan:

Differential Revision: D82380307

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162886
Approved by: https://github.com/zhxchen17
2025-09-20 01:24:37 +00:00
8225a26835 [dynamo] Fix issue with namedtuple slicing (#163351)
Fixes #163253

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163351
Approved by: https://github.com/williamwen42, https://github.com/mlazos
2025-09-20 00:42:02 +00:00
093f0642aa [CP][BE] Correct an incorrect docstring (#163131)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163131
Approved by: https://github.com/tianyu-l, https://github.com/XilunWu
ghstack dependencies: #162539, #162540, #162541, #163115
2025-09-19 23:55:03 +00:00
ee7bdd8f2f [graph partition] Add way to register custom rule (#163310)
This PR adds an experimental way to register a custom rule for if
inductor should partition the graph around an operator.

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163310
Approved by: https://github.com/ProExpertProg, https://github.com/BoyuanFeng, https://github.com/eellison
ghstack dependencies: #162117, #162307, #162651
2025-09-19 23:28:03 +00:00
0098e5636d [CI] Move Windows build/tests to Python-3.10 (#162862)
What supposed to be a very simple change end up being quite involved, as current Windows CI framework is quite inflexible, i.e. it takes a lots of argument, but later on ignores them, namely:
 - `PYTHON_VERSION` used to be a no-op that is simply ignored by the scripts
 - With this change, `setup-win` action will create an environment called `py_tmp` with specific python version + intel-openmp (that is hard runtime requirement, but for some reason not packaged into the wheel nor marked as such)
 - Copied test type dependencies from be01a40157/aws/ami/windows/scripts/Installers/Install-Pip-Dependencies.ps1 (L16) into `win-test.sh`, but made some adjustments to be compatible with 3.10 runtime (scipy version update) and just make rerun-tests compatible with the rest of the deps

I think in the long run, one needs to update 4432e2cacd/aws/ami/windows/scripts/Installers/Install-Miniconda3.ps1 that currently pins Miniconda python to 3.9, but also figure out how CI can still create a new environment without having to download all the dependencies all the time
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162862
Approved by: https://github.com/wdvr, https://github.com/huydhn
ghstack dependencies: #163339, #163341
2025-09-19 22:51:38 +00:00
9b5ec0ff7c Use computed buffer sizes of torch for cusparseLt metadata (#163125)
Making sure buffer allocation matches what is computed by cusparseLt compression

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163125
Approved by: https://github.com/jcaip
2025-09-19 22:12:40 +00:00
e6a9db58d7 Add analytics ID to cpp docs (#163370)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163370
Approved by: https://github.com/albanD
2025-09-19 21:45:19 +00:00
fab8455943 Don't use declarations in global namespace in stable headers (#163352)
Fixes https://github.com/pytorch/pytorch/issues/163338

Configured https://clang.llvm.org/extra/clang-tidy/checks/google/global-names-in-headers.html for torch/csrc/stable

Note that doesn't error for the DeleterFnPtr case, but will generate the following for the `using torch::stable::Tensor;`

```
>>> Lint for torch/csrc/stable/ops.h:

  Error (CLANGTIDY) [google-global-names-in-headers,-warnings-as-errors]
    using declarations in the global namespace in headers are prohibited

         10  |#include <torch/csrc/inductor/aoti_torch/generated/c_shim_aten.h>
         11  |#include <torch/headeronly/core/ScalarType.h>
         12  |
    >>>  13  |using torch::stable::Tensor;
         14  |
         15  |namespace torch::stable {
         16  |
   ```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163352
Approved by: https://github.com/janeyx99
2025-09-19 21:15:52 +00:00
9f8a311af0 [Inductor][Intel GPU] Save threads_per_warp from tirton compiled kernel for launching kernel correctly in cpp wrapper. (#163315)
On the Inductor XPU backend, `threads_per_warp` is not always 32. For Intel GEMM Triton kernels, it can be 16. This information must be preserved for XPU so that the Cpp wrapper can launch the kernel with the correct configuration.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163315
Approved by: https://github.com/EikanWang, https://github.com/desertfire
2025-09-19 21:06:56 +00:00
df9a4824e6 Bugfix for doing negative padding (#161639)
Fixes #161014

This bug fix introduces a fix that is consistent with the exception handling. Outlined in issue #161014, there is an edge case where the negative padding does not make the tensor size negative but still triggers the exception that the size is negative. The fix is simply adding `new_dim >=0` to include the zero dim and letting the operator return an empty tensor.

In the PR I have added the edge case where the test will now check the negative padding where the dimension gets reduced to zero.  But the sample is only for the `constant` type of padding. I would like some feedback if it is necessary to put the same sample on the `reduce` type as well.

This is my first PR to contribute to PyTorch and any help/feedback will be welcome! Thank you!

@malfet @manuelcandales @janeyx99 @ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161639
Approved by: https://github.com/manuelcandales
2025-09-19 20:57:05 +00:00
248156ed06 [Inductor] do loop reordering in a separate final round (#162355)
Previous LOAF after fusion algorithm is not guaranteed to create more fusion opportunities even if loop reordering happens. I can not find an example that LOAF reduce the amount of fusion, but here is an example that reordering loops does not add more fusions:

a1f7639922/test/inductor/test_loop_ordering.py (L612-L641)

Move LOAF to a separate final round of fusion so that we are guaranteed to not reducing the amount of fusions. Hopefully this also helps compilation time since LOAF kicks in when there are less nodes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162355
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #162101, #162126
2025-09-19 20:21:33 +00:00
e88460f453 [Inductor] don't call sympy_str when not needed (#162126)
I see torch.compile spend 2% of time on sympy_str when compiling the bwd graph for MobileBertForQuestionAnswering.  Most time sympy_str is called when extracting read/write dependencies. But when we extracting read/writer deps, the result of sympy_str is just discarded (correct me if I'm wrong). To make things simple, I just remove those calls. But if people think it may be useful for debugging, I can add a flag to only call sympy_str when it's explicitly set.

<img width="667" height="409" alt="Screenshot 2025-09-03 at 6 21 52 PM" src="https://github.com/user-attachments/assets/a5929473-873d-4540-8f1e-c29f92be7125" />

(scuba link: https://fburl.com/scuba/pyperf_experimental/on_demand/3k2rduh9 )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162126
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #162101
2025-09-19 20:21:33 +00:00
466122b92c [inductor] avoid creating LoopBody twice (#162101)
Previously in merge_loops, we have to construct LoopBody twice to make sure we can use the same symbol prefix as before. This PR change it to create LoopBody only once by allowing using the same symbol prefix for the new LoopBody.

In looks like it's ok to have duplicate symbols in sympy replacement:
```
>>> x, y = sympy.symbols("x y")
>>> (x + y).xreplace({x: 0, y: x + 1})
x + 1
>>> (x + y).xreplace({x: y * y, y: x + 1})
x + y**2 + 1
>>> (x + y + x * x).xreplace({x: 0, y: x})
x
```

UPDATE: add the same optimization for LoopBody.reorder_iter_loops

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162101
Approved by: https://github.com/jansel, https://github.com/eellison
2025-09-19 20:21:33 +00:00
ba3c2c80ab SDP Backend function fix (#161169)
The issue cannot be reproduced using the original repro code provided in the issue description.

However, the underlying issue mentioned by the maintainer (missing functions in `builder.py` and `trace_rules.py`) was never addressed and can still be reproduced with this test case:

```python
import torch
from torch.nn.attention import _cur_sdpa_kernel_backends

@torch.compile(fullgraph=True)
def test_function_that_triggers_error():
    return _cur_sdpa_kernel_backends()

print("Calling torch.compile function...")
try:
    result = test_function_that_triggers_error()
    print(f"Success: {result}")
except Exception as e:
    print(f"ERROR: {e}")
    print(f"Error type: {type(e)}")
```

The original repro likely no longer triggers the issue due to code path changes in the SDPA implementation, while the direct call to `_cur_sdpa_kernel_backends()` exposes the underlying problem where certain torch._C functions returning non-Tensor values aren't properly handled by dynamo tracing.

I have implemented the changes by adding the missing functions to both `builder.py` and `trace_rules.py` to properly handle these cases during compilation.

@guilhermeleobas

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161169
Approved by: https://github.com/guilhermeleobas, https://github.com/StrongerXi
2025-09-19 20:19:59 +00:00
7130b174e0 [SymmMem] Fix memory allocation hold-up (#162680)
Problem:
Without MemPool it looks like nvshmem backend never deallocates memory.

Cause:
Handles in `symm_mems_` (a map) keeps reference to memory allocations.

Solution:
- Remove reference to allocation from handles -- the reference is never used anyway.
- Use `unique_ptr` instead of `shared_ptr` to wrap allocation to ensure single ownership.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162680
Approved by: https://github.com/ezyang
ghstack dependencies: #163298
2025-09-19 20:19:47 +00:00
f8fb437197 [SymmMem] Barrier on team instead of world (#163298)
As titled. Avoiding a potential hang when running dispatch and combine in subgroups.

The rest is just re-arrange of the tests to create a sub-group test class. (no substantial change)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163298
Approved by: https://github.com/fegin
2025-09-19 20:19:47 +00:00
2a308c7dee Revert "Improve device info with new flops and bandwidth formula based on hardware libraries (#162245)"
This reverts commit 35d7b321597ed00245aad533a8fa6b7fdadd73ea.

Reverted https://github.com/pytorch/pytorch/pull/162245 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/162245#issuecomment-3313669412))
2025-09-19 20:09:12 +00:00
4a160dae3c [CUDA] revert PR 130472 (#162950)
This change may also resolve https://github.com/pytorch/pytorch/issues/161789, though verification is still needed.

PR #130472 would introduced the problem of  freeing the same address without clean metadata. according to the below discussion, reverted it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162950
Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/syed-ahmed
2025-09-19 19:50:44 +00:00
a273475b01 [BE] Introduce CONDA_ROOT_DIR (#163341)
Which equal to `%CONDA_PARENT_DIR%/Miniconda3`, and replace this pattern with `%CONDA_ROOT_DIR%` throughout the codebase
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163341
Approved by: https://github.com/clee2000
ghstack dependencies: #163339
2025-09-19 19:45:32 +00:00
979e10f7d6 [Bugfix] Match eager stride semantics for cloned tensors with preserve_format in compile (#163017)
Fixes #161010 by making `clone_meta` match the semantics of strides for eager mode.

This is:
  * Case 1: Tensor is_non_overlapping_and_dense; in this case, stride should match input tensor stride
  * Case 2: Otherwise, stride should be contiguous computed from input tensor using `compute_elementwise_output_strides`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163017
Approved by: https://github.com/williamwen42, https://github.com/xmfan

Co-authored-by: morrison-turnansky <mturnans@redhat.com>
2025-09-19 19:41:33 +00:00
bc7b17a36d Realize LazyVariableTracker before raising exception (#163350)
Improves error message reported on #163321

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163350
Approved by: https://github.com/Skylion007, https://github.com/xmfan
2025-09-19 19:25:17 +00:00
03f34fd307 Add explicit typing to nn.Module.__init__() parameters (#157389)
Fixes #156740

Adds explicit `Any` typing to `*args` and `**kwargs` in `nn.Module.__init__()` to fix type checker errors in strict mode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157389
Approved by: https://github.com/Skylion007, https://github.com/Raman-RH
2025-09-19 19:02:28 +00:00
52dd7a898c Move ROCM trunk wheel builds to 3.10 (#163339)
This code is a delicious spaghetti: Sometimes python version is defined in jinja template (see https://github.com/pytorch/pytorch/pull/162297 ) sometimes in shell script (see https://github.com/pytorch/pytorch/pull/162877 ), but this time around it's in a python file (and there is another one called `generate_binary_build_matrix.py` that defines `FULL_PYTHON_VERSIONS`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163339
Approved by: https://github.com/clee2000
2025-09-19 18:52:00 +00:00
b8c5ec582f [CD] Simplify NVIDIA driver installation step (#163349)
Undo changes introduced in https://github.com/pytorch/pytorch/pull/160956 as driver has been updated to 580 for both fleets

Fixes https://github.com/pytorch/pytorch/issues/163342
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163349
Approved by: https://github.com/seemethere
2025-09-19 18:50:47 +00:00
a0d2d84846 Handling overflow for long int overflow for the product of kernel_hei… (#155989)
…ght and kernel_width that overflows to be exactly 0

Fixes [#155981](https://github.com/pytorch/pytorch/issues/155981)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155989
Approved by: https://github.com/malfet
2025-09-19 18:15:01 +00:00
607469bdad Revert "[ROCm] Bump FBGEMM commit to avoid CK errors (#162590)"
This reverts commit c9b80c4d4b48deb1931e5f8641ab345d7cc7b639.

Reverted https://github.com/pytorch/pytorch/pull/162590 on behalf of https://github.com/malfet due to This breaks CUDA 13 builds ([comment](https://github.com/pytorch/pytorch/pull/162590#issuecomment-3313263772))
2025-09-19 18:13:00 +00:00
a3b68c7c57 Revert "Fix boxcox to return same result for same input in one batch (#162772)"
This reverts commit 49d30f9a234f0816a1ece278c8450d119e417714.

Reverted https://github.com/pytorch/pytorch/pull/162772 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/162772#issuecomment-3313213011))
2025-09-19 17:58:29 +00:00
2984bfe3da [ez][CI] Run vllm workflow on vllm pin updates (#163353)
As in title

The auto pin update was merged without running vllm workflow
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163353
Approved by: https://github.com/malfet, https://github.com/wdvr
2025-09-19 17:32:49 +00:00
3e663ce5da [Inductor][Triton][FP8] Add a Blackwell-specific scaled persistent + TMA template for GEMMs (#163147)
Summary:
X-link: https://github.com/meta-pytorch/tritonbench/pull/432

Add a Blackwell-specific scaled persistent + TMA Triton template to Inductor. This diff builds on D82515450 by adding a new set of mixins which inherit the scaling epilogue and add scaled persistent + TMA kwargs to the template.

This diff also adds a benchmark for the scaled Blackwell persistent + TMA template to TritonBench `fp8_gemm`.

Note that this diff is a minimal extension to the above diff; rather than adding a new kernel for the scaled version, we opted to simply extend the epilogue to account for scaling. This template is accurate for per-tensor and per-row scaling but may require modifications for other scaling modes, such as deepseek-style scaling, which apply scaling prior to the GEMM computation.

In addition, note that epilogue subtiling is currently unsupported for both the scaled and non-scaled Blackwell templates, and functionality will be added in a subsequent diff.

Test Plan:
Verified that the scaled Blackwell template adds the scaling epilogue to the generated Triton kernel by inspecting the Inductor-generated Triton kernel.

Benchmarking command:
```
TRITON_PRINT_AUTOTUNING=1 TORCHINDUCTOR_CACHE_DIR=~/personal/cache_dir_inductor TRITON_CACHE_DIR=~/personal/cache_dir_triton TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 buck2 run mode/{opt,inplace} pytorch/tritonbench:run -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -- --op fp8_gemm --only torch_fp8_gemm,blackwell_pt2_fp8_gemm --metrics tflops,accuracy --input-loader=/home/jananisriram/personal/fp8_shapes_testing.json --scaling_rowwise --output="/home/jananisriram/personal/fp8_shapes_testing_results.csv" --atol=1e-2 --rtol=0.5 2>&1 | tee ~/personal/fp8_shapes_testing.log
```

Rollback Plan:

Differential Revision: D82597111

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163147
Approved by: https://github.com/njriasan
2025-09-19 17:23:37 +00:00
4967ad8baa [Graph Partition] improve custom op output alias (#163227)
For a custom op with multiple outputs, we will see the following generated code:
```
buf1 = op1(arg0)
buf3 = buf0[0]
buf4 = buf0[1]
del buf1 # <--- if buf1 is not accessed in the future
```

If `buf1` is not accessed in the future, it's good to deallocate early. So we don't delay `del` until both buf3 and buf4 are not used anymore. Note that buf3 and buf4 hold reference to the data such that `del buf1` does not prevent their usage.

However, when there are mutating args, we don't see `del buf1` immediately.

```python
@torch.library.custom_op(
    "mylib::op1",
    mutates_args=["x"],
    schema="(Tensor(a!)?  x) -> (Tensor, Tensor)",
    device_types="cuda",
)
def op1(x) -> tuple[torch.Tensor, torch.Tensor]:
    x = x + 1
    return (x + 1, x + 2)
```

<img width="661" height="821" alt="image" src="https://github.com/user-attachments/assets/3d1d1f5a-9749-4652-bb02-da593c78702d" />

Why? Because `buf3` is a MultiOutput with `buf1` as input and believes `buf1` (an output of FallbackKernel op1) has inputs that alias output.
72fedf0575/torch/_inductor/ir.py (L7976-L7982)

According to `[NOTE: FallbackKernel supported operators]`, as a mutating op that are auto-functionalizable, buf1's output should NOT alias any of the inputs. This PR improves get_inputs_that_alias_output of Fallback Kernel.

Use case: [moe custom op in vllm](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py#L2057-L2064)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163227
Approved by: https://github.com/zou3519
2025-09-19 17:01:36 +00:00
e631d76002 [Flex] Changing how bwd configs are setup and updating default b200 config (#163318)
```Shell
Up to 4x perf boost

🔝 Top 5 Performance Differences (by absolute %):
shape: (5, 7)
┌───────────┬────────────────┬────────────────────────────────┬───────────────────┬─────────────────────────────┬─────────────────────────────────┬────────────┐
│ attn_type ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)          ┆ TFlops BWD (base) ┆ TFlops BWD (better_configs) ┆ better_configs_speedup_over_ba… ┆ pct_delta  │
│ ---       ┆ ---            ┆ ---                            ┆ ---               ┆ ---                         ┆ ---                             ┆ ---        │
│ str       ┆ str            ┆ str                            ┆ f64               ┆ f64                         ┆ f64                             ┆ f64        │
╞═══════════╪════════════════╪════════════════════════════════╪═══════════════════╪═════════════════════════════╪═════════════════════════════════╪════════════╡
│ noop      ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 124.775035        ┆ 532.580435                  ┆ 4.268325                        ┆ 326.832527 │
│ noop      ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 124.494557        ┆ 519.798488                  ┆ 4.175271                        ┆ 317.527078 │
│ causal    ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 123.984189        ┆ 512.877391                  ┆ 4.136635                        ┆ 313.663544 │
│ noop      ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 128)   ┆ 122.827725        ┆ 496.195958                  ┆ 4.039772                        ┆ 303.977164 │
│ causal    ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 123.826738        ┆ 484.244647                  ┆ 3.910663                        ┆ 291.066303 │
└───────────┴────────────────┴────────────────────────────────┴───────────────────┴─────────────────────────────┴─────────────────────────────────┴────────────┘

🔺 Top 5 Cases Where better_configs (change) is Faster than base (baseline):
shape: (5, 7)
┌───────────┬────────────────┬────────────────────────────────┬───────────────────┬─────────────────────────────┬─────────────────────────────────┬────────────┐
│ attn_type ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)          ┆ TFlops BWD (base) ┆ TFlops BWD (better_configs) ┆ better_configs_speedup_over_ba… ┆ pct_delta  │
│ ---       ┆ ---            ┆ ---                            ┆ ---               ┆ ---                         ┆ ---                             ┆ ---        │
│ str       ┆ str            ┆ str                            ┆ f64               ┆ f64                         ┆ f64                             ┆ f64        │
╞═══════════╪════════════════╪════════════════════════════════╪═══════════════════╪═════════════════════════════╪═════════════════════════════════╪════════════╡
│ noop      ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 124.775035        ┆ 532.580435                  ┆ 4.268325                        ┆ 326.832527 │
│ noop      ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 124.494557        ┆ 519.798488                  ┆ 4.175271                        ┆ 317.527078 │
│ causal    ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 123.984189        ┆ 512.877391                  ┆ 4.136635                        ┆ 313.663544 │
│ noop      ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 128)   ┆ 122.827725        ┆ 496.195958                  ┆ 4.039772                        ┆ 303.977164 │
│ causal    ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 123.826738        ┆ 484.244647                  ┆ 3.910663                        ┆ 291.066303 │
└───────────┴────────────────┴────────────────────────────────┴───────────────────┴─────────────────────────────┴─────────────────────────────────┴────────────┘

🔻 Top 5 Cases Where better_configs (change) is Slower than base (baseline):
shape: (5, 7)
┌───────────────┬────────────────┬───────────────────────────────┬───────────────────┬─────────────────────────────┬─────────────────────────────────┬───────────┐
│ attn_type     ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)         ┆ TFlops BWD (base) ┆ TFlops BWD (better_configs) ┆ better_configs_speedup_over_ba… ┆ pct_delta │
│ ---           ┆ ---            ┆ ---                           ┆ ---               ┆ ---                         ┆ ---                             ┆ ---       │
│ str           ┆ str            ┆ str                           ┆ f64               ┆ f64                         ┆ f64                             ┆ f64       │
╞═══════════════╪════════════════╪═══════════════════════════════╪═══════════════════╪═════════════════════════════╪═════════════════════════════════╪═══════════╡
│ document_mask ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 128)  ┆ 267.502004        ┆ 250.728732                  ┆ 0.937297                        ┆ -6.270335 │
│ document_mask ┆ torch.bfloat16 ┆ (4, 16, 8192, 4, 8192, 128)   ┆ 248.510516        ┆ 235.210874                  ┆ 0.946483                        ┆ -5.351742 │
│ document_mask ┆ torch.bfloat16 ┆ (4, 16, 16384, 4, 16384, 128) ┆ 282.856295        ┆ 271.806926                  ┆ 0.960936                        ┆ -3.906354 │
│ document_mask ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 64)   ┆ 282.212695        ┆ 280.519092                  ┆ 0.993999                        ┆ -0.600116 │
│ document_mask ┆ torch.bfloat16 ┆ (4, 16, 32768, 4, 32768, 128) ┆ 295.864073        ┆ 294.477894                  ┆ 0.995315                        ┆ -0.468519 │
└───────────────┴────────────────┴───────────────────────────────┴───────────────────┴─────────────────────────────┴─────────────────────────────────┴───────────┘

📊 Performance Summary:
============================================================
Baseline: base
Change:   better_configs
Geometric Mean Speedup (change over baseline): 1.9954x
Geometric Mean % Change: +99.54%
Median Speedup (change over baseline): 2.1590x
Speedup Std Dev: 0.9800
Valid Comparisons: 60/60

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163318
Approved by: https://github.com/BoyuanFeng
2025-09-19 16:57:21 +00:00
f8f230a801 [FP8][cuBLAS][H100] only test fp32 outputs for rowwise _scaled_mm on H100 (#162022)
only cuBLAS supports float32 output and cuBLAS only supports rowwise for SM 9.0

Intended to land after #161305

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162022
Approved by: https://github.com/ngimel
2025-09-19 15:18:13 +00:00
264e7f68a0 [ROCm] Fix mx fp8 and fp4 code after scaling refactor changes. (#163127)
PR #151360 added mx fp8 and fp4 support on ROCm.
1. However, on recent upstream, scaling function in Blas.cpp along with test_matmul_cuda changes triggered failures.
This patch corrects is_blockwise_1x32_scaling function code.

2. Fixes the m, n, k dimensions for ROCm mx case.

3.  Modify FP4E2M1FN_LARGEST_POW2 (largest power of 2 representable in `torch.float4_e2m1fn_x2`) to 2.
This resulted in higher SQNR value for mx fp4 test.

Testing result on gfx950 w/ ROCm7.0

PYTORCH_TEST_WITH_ROCM=1 python test/test_matmul_cuda.py -k test_blockwise -v Ran 452 tests in 22.698s
OK passed 111
This is same as before. (when PR 151360 was merged)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163127
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-19 12:29:52 +00:00
bee362c381 [ROCm][SymmMem] Fix skip condition for PLATFORM_SUPPORTS_SYMM_MEM (#163205)
It seems `TEST_CUDA` is set to true even for ROCm (MI200) jobs. Changing if TEST_CUDA to an else condition to avoid running symmetric memory UTs on MI200. For other non-rocm arch, it should return true and can be skipped using other skip decorators.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163205
Approved by: https://github.com/ezyang

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-19 12:12:47 +00:00
33e6c5a93d [Dependabot] Update(deps): Bump transformers from 4.54.0 to 4.56.0 in /.ci/docker/ci_commit_pins (#162063)
* [Dependabot] Update(deps): Bump transformers

Bumps [transformers](https://github.com/huggingface/transformers) from 4.54.0 to 4.56.0.
- [Release notes](https://github.com/huggingface/transformers/releases)
- [Commits](https://github.com/huggingface/transformers/compare/v4.54.0...v4.56.0)

---
updated-dependencies:
- dependency-name: transformers
  dependency-version: 4.56.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* Refresh results

Signed-off-by: Huy Do <huydhn@gmail.com>

* Another round of updates

Signed-off-by: Huy Do <huydhn@gmail.com>

* Another round of update

Signed-off-by: Huy Do <huydhn@gmail.com>

* Hopefully the last round of update

Signed-off-by: Huy Do <huydhn@gmail.com>

* Plz

Signed-off-by: Huy Do <huydhn@gmail.com>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Huy Do <huydhn@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Huy Do <huydhn@gmail.com>
2025-09-19 02:50:36 -07:00
ab5086a7ae [WOQ] Add XPU kernel for _weight_int8pack_mm (#160938)
Summary:
This issue proposes implementing a XPU kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU and CUDA.

Motivation:
Same as https://github.com/pytorch/pytorch/pull/159325.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160938
Approved by: https://github.com/EikanWang, https://github.com/ZhiweiYan-96, https://github.com/liangan1, https://github.com/jerryzh168
2025-09-19 07:37:14 +00:00
0815091d86 [CP][BE] Cosmetic refactors for CP code base (#163115)
Summary:
This PR is extracted from https://github.com/pytorch/pytorch/pull/162542, to make the original PR
easier to review. This PR only contains cosmetic changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163115
Approved by: https://github.com/tianyu-l
ghstack dependencies: #162539, #162540, #162541
2025-09-19 07:21:46 +00:00
32ad29b72a Revert "[dynamo][guards] Fail on an unknown framelocals to dict conversion (#162695)"
This reverts commit a8432bcaadd6dea52a94429dced1fb4550f2f560.

Reverted https://github.com/pytorch/pytorch/pull/162695 on behalf of https://github.com/anijain2305 due to internal failure at https://fburl.com/workplace/qiitdlp6 ([comment](https://github.com/pytorch/pytorch/pull/162695#issuecomment-3310757225))
2025-09-19 06:18:27 +00:00
1302637a23 Revert "[dynamo][guards] Do not construct entire framelocals dict for LAMBDA_GUARD (#162525)"
This reverts commit 5f630d28d7ff9fdd8bd6cdbe2438e5c821007845.

Reverted https://github.com/pytorch/pytorch/pull/162525 on behalf of https://github.com/anijain2305 due to internal tests fail ([comment](https://github.com/pytorch/pytorch/pull/162525#issuecomment-3310748980))
2025-09-19 06:15:28 +00:00
e0bcd58f57 [MTIA] Add MTIA dispatch for kernel foreach_maximum(Add D80022242 back) (#161571)
Summary: dispatch MTIA to function foreach_tensor_maximum_scalar_kernel_mtia_

Test Plan:
CI

Rollback Plan:

Differential Revision: D81086607

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161571
Approved by: https://github.com/malfet
2025-09-19 05:57:09 +00:00
17081209e5 Revert "[CI] Move Windows build/tests to Python-3.10 (#162862)"
This reverts commit 2dcd153342d27b0981ff79eb2ccb8d8962e79c48.

Reverted https://github.com/pytorch/pytorch/pull/162862 on behalf of https://github.com/malfet due to Breaks some windows tests ([comment](https://github.com/pytorch/pytorch/pull/162862#issuecomment-3310606135))
2025-09-19 05:16:49 +00:00
578047838c Revert "[BE] Update Python min version to 3.10 (#162310)"
This reverts commit 3016616ccbba3dc9bb6a80eb4a81a846ddf49cc9.

Reverted https://github.com/pytorch/pytorch/pull/162310 on behalf of https://github.com/malfet due to Breaks some windows tests ([comment](https://github.com/pytorch/pytorch/pull/162862#issuecomment-3310606135))
2025-09-19 05:16:49 +00:00
ce5637be29 Fix invalid indices bug for max_unpool2d/3d on MPS (#163036)
Fixes #163035
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163036
Approved by: https://github.com/kulinseth, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-09-19 05:13:21 +00:00
c91f59b1a0 Fix performance regression when indexing by Numpy arrays (#163280)
Benchmark script:

```
import time
import numpy as np
import torch

def main() -> None:
    for i in range(10):
        block_indices = np.arange(16384, dtype=np.int32)
        block_indices = block_indices.reshape(-1).clip(max=255)
        batch_indices = np.zeros(16384, dtype=np.int64)
        virtual_batches = 32
        block_table = torch.randn(32, 256)
        start = time.perf_counter()
        block_table[batch_indices, block_indices].view(virtual_batches, -1)
        end = time.perf_counter()
        time_elapsed_ms = (end - start) * 1000
        print(f"Function execution time: {time_elapsed_ms:.1f}ms")

if __name__ == "__main__":
    main()
```

Before:

```
(a) [ezyang@devvm006.dkl0 ~/local/b/pytorch] python ben.py
Function execution time: 28.5ms
Function execution time: 12.9ms
Function execution time: 12.6ms
Function execution time: 13.5ms
Function execution time: 12.0ms
Function execution time: 13.4ms
Function execution time: 12.9ms
Function execution time: 12.9ms
Function execution time: 13.1ms
Function execution time: 13.0ms
```

After:

```
Function execution time: 17.8ms
Function execution time: 2.5ms
Function execution time: 1.3ms
Function execution time: 2.5ms
Function execution time: 2.3ms
Function execution time: 1.3ms
Function execution time: 2.4ms
Function execution time: 2.5ms
Function execution time: 2.5ms
Function execution time: 2.4ms
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163280
Approved by: https://github.com/SherlockNoMad, https://github.com/cyyever
2025-09-19 05:02:58 +00:00
3016616ccb [BE] Update Python min version to 3.10 (#162310)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162310
Approved by: https://github.com/atalman, https://github.com/Skylion007, https://github.com/ZainRizvi
ghstack dependencies: #162862
2025-09-19 04:28:56 +00:00
46c647d1ee [vllm hash update] update the pinned vllm hash (#163304)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163304
Approved by: https://github.com/pytorchbot
2025-09-19 04:25:43 +00:00
76a841fd47 Port OpSchema.__post_init__ and OpSchema._recompute_comparison_key to C++ (#161695)
I initially didn't see good results porting this, but it was apparently because of pybind11 function calling overhead. (pybind11's object-handling primitives seem fine enough.) I'm interested in setting up nanobind, but this demonstrates it's not blocking.

Differential Revision: [D81530102](https://our.internmc.facebook.com/intern/diff/D81530102)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161695
Approved by: https://github.com/ezyang
2025-09-19 04:07:30 +00:00
bd964cbbfb [functionalize] Avoid one more call to custom get_device on FunctionalTensorWrapper (#163019)
Trying to reduce the number of `__torch_dispatch__` calls of FakeTensorMode in the AOT metadata collection pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163019
Approved by: https://github.com/Lucaskabela, https://github.com/bdhirsh, https://github.com/zou3519
ghstack dependencies: #162987
2025-09-19 02:52:08 +00:00
5f25dbe7fd Rm pytorch deps platform args (#163086)
Summary: Platform args was a buck1 concept that we decided to port over to buck2 in order to make the migration easier. However, platforms args existing in the repo blocks some buck modernization like modefile free efforts, so we're trying to get rid of the usage.

Test Plan:
CI

Rollback Plan:

Differential Revision: D82470032

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163086
Approved by: https://github.com/malfet, https://github.com/8Keep
2025-09-19 02:13:03 +00:00
e134bb340a Update torch-xpu-ops commit pin (#163244)
Update the torch-xpu-ops commit to 24fab67b6e, includes:

- Clean up getDeviceIndexOfCurrentQueue
- Fix hardswish gradients corner case
- Fix xccl contiguous check
- Move checks from nonzero kernel to operator
- support high priority stream for xccl

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163244
Approved by: https://github.com/EikanWang
2025-09-19 02:04:40 +00:00
6e680ae8de add more restriction to fusion with large accumulate reads (#163163)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163163
Approved by: https://github.com/yf225
2025-09-19 01:20:30 +00:00
3c9e220f34 Refactor PrecompileContext to be considerably more debuggable (#162740)
Summary:
This diff does a few things:
- It refactors PrecompileContext to store DynamoCacheEntries directly on the context. This allows us at serialization time to check if the dynamo cache entry has all its backends ready for serialization, and if not, skip unnecessarily serializing it
- It also gives us the ability to print out a `debug` JSON, which contains a mapping for everything being serialized and deserialized.

Here's an example of what that JSON looks like:

```
{
  "artifacts": {
    "precompile_aot_autograd": [
      "__compiled_fn_8_306d538b_f7f8_4ab4_98a1_b5ff4493f99d"
    ],
    "precompile_dynamo": [
      {
        "backend_ids": [
          "__compiled_fn_8_306d538b_f7f8_4ab4_98a1_b5ff4493f99d"
        ],
        "fn_name": "TorchBenchmarkRunner.forward_and_backward_pass",
        "num_codes": "10",
        "python_version": "3.12.11+meta",
        "torch_version": "2.10.0a0+fb"
      }
    ]
  },
  "num_entries": 1
}
```

Test Plan:
Existing tests pass.

NanoGPT tlparse showing the new debug:

https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpeIsL5G/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Note that there aren't compile ids since we're logging this in PrecompileContext.serialize() for now, where there isn't a compile yet. I think this is fine for now, as no compile ID makes sense here. If anything, these kind of belong in a "Global" compile ID, which I will not implement in this PR.

Rollback Plan:

Differential Revision: D82232574

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162740
Approved by: https://github.com/zhxchen17
2025-09-19 01:14:28 +00:00
c9b80c4d4b [ROCm] Bump FBGEMM commit to avoid CK errors (#162590)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162590
Approved by: https://github.com/jeffdaily
2025-09-19 01:14:20 +00:00
cd4303afc6 [CP][BE] Correct the names of some tests (#162541)
We are not doing ring attention but only using allgather to do CP for Flex.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162541
Approved by: https://github.com/ezyang, https://github.com/Skylion007, https://github.com/tianyu-l, https://github.com/XilunWu
ghstack dependencies: #162539, #162540
2025-09-19 00:38:04 +00:00
2dcd153342 [CI] Move Windows build/tests to Python-3.10 (#162862)
What supposed to be a very simple change end up being quite involved, as current Windows CI framework is quite inflexible, i.e. it takes a lots of argument, but later on ignores them, namely:
 - `PYTHON_VERSION` used to be a no-op that is simply ignored by the scripts
 - With this change, `setup-win` action will create an environment called `py_tmp` with specific python version + intel-openmp (that is hard runtime requirement, but for some reason not packaged into the wheel nor marked as such)
 - Introduced `CONDA_ROOT_DIR` env variable in `activate_miniconda3.bat` to avoid `%CONDA_PARENT_DIR%\Miniconda3` invocations throughout the codebase
 - Copied test type dependencies from be01a40157/aws/ami/windows/scripts/Installers/Install-Pip-Dependencies.ps1 (L16) into `win-test.sh`, but made some adjustments to be compatible with 3.10 runtime (scipy version update) and just make rerun-tests compatible with the rest of the deps

I think in the long run, one needs to update 4432e2cacd/aws/ami/windows/scripts/Installers/Install-Miniconda3.ps1 that currently pins Miniconda python to 3.9, but also figure out how CI can still create a new environment without having to download all the dependencies all the time
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162862
Approved by: https://github.com/wdvr, https://github.com/huydhn
2025-09-19 00:33:03 +00:00
04842ac2b0 Change DebugMode record_torchfunction default to False (#163293)
Result is too noisy with `record_torchfunction = True`. Change it to False, to make it clean.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163293
Approved by: https://github.com/zpcore
2025-09-19 00:30:53 +00:00
17c16537e2 Deprecate Lite Interpreter (#163289)
Summary:
Point people lowering to lite interpreter to the existence of ExecuTorch.

Added the typing deprecation, a warnings deprecation

Test Plan: Try using it, see deprecation warning

Reviewed By: lucylq

Differential Revision: D82759566

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163289
Approved by: https://github.com/larryliu0820
2025-09-18 23:56:21 +00:00
ddc56f6f92 [functional] Use the saved device on storage instead for device_custom (#162987)
Trying to reduce the number of __torch_dispatch__ calls of FakeTensorMode in the AOT metadata collection pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162987
Approved by: https://github.com/Lucaskabela, https://github.com/bdhirsh, https://github.com/zou3519
2025-09-18 23:43:20 +00:00
096d35c44c [CP] Remove the need of recording cp_dim in the global var (#162540)
This information can be obtained during the dispatching.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162540
Approved by: https://github.com/ezyang, https://github.com/tianyu-l, https://github.com/XilunWu
ghstack dependencies: #162539
2025-09-18 23:40:48 +00:00
4c007073e6 [dynamic shapes] DynamicInts prototype (#162194)
Initial prototype for dynamic int inputs, allows users to run with `torch.compile(f)(DynamicInt(4))`, compiling dynamically and using the underlying hint at runtime.

Current behavior:
- Also works in eager (mostly by subclassing int), as scalar input to torch functions, or numpy/math/etc. For example, `x = DynamicInt(3); torch.randn(x); torch.add(y, z, alpha=x); np.arange(x)` all act as if x = 3.
- Behavior for arithmetic ops is to return new DynamicInts rather than static ints; `DynamicInt(3) * 2 = DynamicInt(6)`. This is via SymNode magic methods, but coverage might not be 100% - for example, I had to explicitly override floordiv to avoid int casting. This is not necessarily the case for non-magic method ops (e.g. `math.cos(x)`). The alternative here is to int cast on all operations, but I opted for this for dynamism propagation in non-compiled regions.
- Doesn't ban fullgraph=False; DynamicInt objects might be leaked back to the user, but I guess this is fine, because they can be casted to ints when needed?
- Dynamo only allocates one symbol per DynamicInt; specifying the same DynamicInt for multiple inputs leads to input deduplication, and a guard installed.
- We don't raise on int specialization (in allowlist/maybe_mark_dynamic style) - but an easy change if needed.
- DynamicInts as nn.Module attributes are handled.
- We don't guard on the DynamicInt id, e.g. users can do the following without recompiling (maybe we should guard?)
```python
x = DynamicInt(4)
f(x)
f(1)
f(DynamicInt(3))  # same as f(3)
```

Follow-up work:
- Specifying shape constraints, either at the int-level, e.g.
```python
DynamicInt(64, name="s0", constraints=["s0 % 32 == 0", "s0 <= 1024"]
```
or at the compilation level, e.g. something like
```python
s0 = DynamicInt(64, name="s0")
s1 = DynamicInt(128, name="s1")
with some_compiler_config.dynamic_int_constraints(["s1 == 2*s0", "s0 % 32 == 0"]):
    f(s0, s1)
```
This should subsume the need for specifying derived SymInts?
- SymFloat support - currently it seems backed floats are specialized by the tensorify float pass, and there's no handling in inductor.
- Propagating dynamism in tensor constructors, e.g. `x = DynamicInt(4); torch.randn(x)` could annotate `_dynamo_dynamic_indices`.

Differential Revision: D81698719

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162194
Approved by: https://github.com/bobrenjc93
2025-09-18 23:26:28 +00:00
f4eca0e3b3 Try updating ET pin in PT/PT (#159664)
Looking into resolving this: https://github.com/pytorch/pytorch/issues/159599

Test Plan: Wait for executorch CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159664
Approved by: https://github.com/malfet
2025-09-18 21:55:16 +00:00
ed3438ff13 Turn on capture_dynamic_output_shape_ops when fullgraph=True (#163123)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163123
Approved by: https://github.com/laithsakka
ghstack dependencies: #163121
2025-09-18 21:24:15 +00:00
7dcb568c8f Turn on capture_scalar_outputs when fullgraph=True (#163121)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163121
Approved by: https://github.com/laithsakka
2025-09-18 21:24:15 +00:00
bb7c9a2d41 [DTensor] Fix DTensor.mean with uneven sharding (#163241)
Fixes #162692

When input is uneven sharded, redistribute input as Replicated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163241
Approved by: https://github.com/dcci
2025-09-18 19:53:51 +00:00
159c2140f7 test: ensure editable cached wrapper is respected (#160943)
## Summary
- add a test verifying that editing the local cache wrapper is picked up after Dynamo reset

## Testing
- `lintrunner -a` *(fails: FLAKE8 failure, TEST_HAS_MAIN failure, CODESPELL failure, PYFMT failure)*
- `PYTHONPATH=. python test/inductor/test_codecache.py TestPyCodeCache.test_editable_cached_wrapper -v`

------
https://chatgpt.com/codex/tasks/task_e_68a3aa3fcc9883239b17d1f4250d1e89

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160943
Approved by: https://github.com/xmfan, https://github.com/albanD
2025-09-18 19:24:51 +00:00
62a746f62c [ROCm] update ci_expected_accuracy for dynamo benchmarks (#163256)
Some tests that were already failing changed status to skipped.  Some model entries were missing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163256
Approved by: https://github.com/malfet

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-18 19:05:19 +00:00
8627454c84 Add local file path to inductor_output_code trace metadata (#160920)
## Summary
- include local file path in `inductor_output_code` structured trace metadata
- adjust structured trace tests for new `file_path` field

## Testing
- `python test/dynamo/test_structured_trace.py StructuredTraceTest.test_compile_id_serialization_deserialization`
- `lintrunner -a torch/_inductor/codecache.py torch/_inductor/graph.py test/dynamo/test_structured_trace.py` *(fails: MYPY failure)*

------
https://chatgpt.com/codex/tasks/task_e_68a2b02b54ec8323ae820120605a9f1c

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160920
Approved by: https://github.com/oulgen
2025-09-18 18:39:46 +00:00
93964ed6ab [unit test] correct wrong input shape in test_flop_fx (#163148)
The input tensor shape does not match the weight tensor shape, which was detected by the validation logic implemented in my other PR(https://github.com/pytorch/pytorch/pull/160408).  The input tensor should have a shape of (2, 2, 3), since dimension 1 of the input (representing input channels) must match dimension 0 of the weight tensor (representing input channels). ref https://docs.pytorch.org/docs/stable/generated/torch.nn.ConvTranspose1d.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163148
Approved by: https://github.com/eellison
2025-09-18 18:38:01 +00:00
80f8be9840 [SymmMem] Fix put_signal + wait_until hang (#163194)
The test used a wrong ptr to refer to remote address:
```
            dst_ptr = out_hdl.buffer_ptrs[peer]
            src_ptr = inp_hdl.buffer_ptrs[rank]
            sig_ptr = out_hdl.signal_pad_ptrs[peer]
```
All three indices should be `rank` instead of `peer` because NVSHMEM APIs accept local address as input and perform translation internally. Without correct signal address, the peer would be waiting, thus hang.

Also adjusted the signature of `nvshmem.putmem_signal_block` to accept tensor instead of pointer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163194
Approved by: https://github.com/ngimel
ghstack dependencies: #163025, #163152
2025-09-18 18:18:58 +00:00
e36a6fcf0f Massive hack to make autograd shut up about threaded PG mutations (#163238)
See the Note for explanation.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163238
Approved by: https://github.com/albanD
2025-09-18 18:12:57 +00:00
23af32a078 [WOQ] Integrate CUDA support for concat linear int8pack_mm woq optimization pattern (#161848)
Summary:
What: Enables CUDA support for concat linear int8_mm woq optimization pattern by:

- Updating pattern validation to accept CUDA devices
- Adding test coverage for CUDA

Why: Extend WOQ to more device types

Test Plan:
```
buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm
```

Rollback Plan:

Differential Revision: D80884518

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161848
Approved by: https://github.com/jerryzh168
2025-09-18 18:08:07 +00:00
a81a2e54ed [submodule] CUTLASS upgrade to 4.2.0 and change cutlass to cutlass_cppgen (#163092)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163092
Approved by: https://github.com/drisspg, https://github.com/Skylion007
2025-09-18 18:03:51 +00:00
4b7aed89d8 Revert "[torch][cuda][device_limits] Library for querying device hardware limits for flops and bandwidth (#162942)"
This reverts commit 627482a7b7780752c0e7aea034a2eb2db5899fcc.

Reverted https://github.com/pytorch/pytorch/pull/162942 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it needs some fixes for CUDA 13 ([comment](https://github.com/pytorch/pytorch/pull/162942#issuecomment-3308784448))
2025-09-18 17:49:16 +00:00
1aeac304b8 Move prioritized text linker optimization code from setup.py to cmake (#160078)
Note. This is a replica PR of #155901 which will be closed. I had to create a new PR in order to add it into my ghstack as there are some later commits which depend on it.

### Summary

🚀 This PR moves the prioritized text linker optimization from setup.py to cmake ( and enables by default on Linux aarch64 systems )

This change consolidates what was previously manual CI logic into a single location (cmake), ensuring consistent behavior across local builds, CI pipelines, and developer environments.

### Motivation
Prioritized text layout has measurable performance benefits on Arm systems by reducing code padding and improving cache utilization. This optimization was previously triggered manually via CI scripts (.ci/aarch64_linux/aarch64_ci_build.sh) or user-set environment variables. By detecting the target architecture within setup.py, this change enables the optimization automatically where applicable, improving maintainability and usability.

Note:

Due to ninja/cmake graph generation issues we cannot apply the linker file globally to all targets to the targets must be manually defined. See CMakeLists.txt the main libraries torch_python, torch, torch_cpu, torch_cuda, torch_xpu have been targetted which should be enough to maintain the performance benefits outlined above.

Co-authored-by: Usamah Zaheer <usamah.zaheer@arm.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160078
Approved by: https://github.com/seemethere
2025-09-18 17:09:48 +00:00
56893ca1f6 Don't register wrong overload to prim decomp (#163138)
These decompositions take precedence before CIA decomps in fake tensor prop, as a result, we would hit this implementation for all where overloads which is wrong in some cases. For the overloads that can't be implemented by this decomp, we just run the default CIA impl. Previously this doesn't matter because in post-dispatch IR, aten.where would have decomposed but when user tries to preserve aten.where this issue will surface because fake tensor will start seeing aten.where.

Differential Revision: [D82604702](https://our.internmc.facebook.com/intern/diff/D82604702)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163138
Approved by: https://github.com/henryoier, https://github.com/ezyang
2025-09-18 17:01:19 +00:00
af8c232b75 [CI] reuse old whl: fix metadata file not getting version replaced (#163214)
In the .dist-info/METADATA file, the version was not being written with the new sha.

On python <3.11 (I think), the glob `**` will only match directories, so change this to `*`, which I checked that it will match both files and directories on py3.9 and py3.13

There's probably also a bunch of mismatches in RECORD but thats a problem for later
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163214
Approved by: https://github.com/huydhn
2025-09-18 16:08:29 +00:00
4908fb53c3 [testing] Add test owner labels for some ao sparse tests (#163203)
I am trying to give some test files better owner labels than `module: unknown`.  I am not sure them, but they seem pretty reasonable
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163203
Approved by: https://github.com/jcaip
2025-09-18 16:08:13 +00:00
3c8b90542c support unbacked softmax / logsoftmax (#162216)
### DDE

```
GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(3*u0, 0) (unhinted: Eq(3*u0, 0)).  (Size-like symbols: u0)

Caused by: (_decomp/decompositions.py:1185 in _softmax)
```

```
torch._dynamo.exc.UserError: Could not guard on data-dependent expression Eq(u0, 0) (unhinted: Eq(u0, 0)).  (Size-like symbols: u0)

Caused by: logsoft = torch.nn.functional.log_softmax(nz, dim=0)  # test/inductor/test_unbacked_symints.py:573 in fn (_decomp/decompositions.py:1212 in _log_softmax)
```

```
GuardOnDataDependentSymNode: Could not guard on data-dependent expression Ne(u0, 0) (unhinted: Ne(u0, 0)).  (Size-like symbols: u0)

Caused by: (_refs/__init__.py:2218 in _reduction)
```

### Cannot convert symbols to int
```
  File "torch/_inductor/lowering.py", line 7160, in prepare_softmax_online
    and V.graph.sizevars.size_hint(rnumel) >= config.unroll_reductions_threshold
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "orch/_inductor/sizevars.py", line 591, in size_hint
    return int(out)
           ^^^^^^^^
  File "sympy/core/expr.py", line 342, in __int__
    raise TypeError("Cannot convert symbols to int")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162216
Approved by: https://github.com/laithsakka, https://github.com/eellison
2025-09-18 15:43:20 +00:00
1f21f8544c fixing graph break for namedtuple._replace (#160139)
Fixes #158772
_replace works without graph break

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160139
Approved by: https://github.com/mlazos
2025-09-18 14:32:36 +00:00
1330c638be Update Microsoft C++ Redistributable to the latest version (#161430)
Update Microsoft C++ Redistributable link to the latest version as one of the libraries used by AMD currently has a dependency on that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161430
Approved by: https://github.com/malfet
2025-09-18 14:22:03 +00:00
8bc4a467a7 [ROCm] test_aot_inductor: Enable fp8 tests. (#163050)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163050
Approved by: https://github.com/jeffdaily
2025-09-18 14:05:21 +00:00
e769026bcb [ROCm] Remove HIPBLASLT_ALLOW_TF32 from codebase (#162998)
A few UT failures are caused by `HIPBLASLT_ALLOW_TF32`

Fixes #157094
Fixes #157093
Fixes #157092
Fixes #157091
Fixes #157064
Fixes #157063
Fixes #157062
Fixes #157061
Fixes #157042
Fixes #157041
Fixes #157039
Fixes #157004

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162998
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-18 13:53:48 +00:00
14f8d86136 Reland #161649, vectorize stored in cat for all dtypes (#162440)
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162440
Approved by: https://github.com/Skylion007
2025-09-18 13:50:44 +00:00
c43ccfbc2d [BE] Remove bottleneck (#163210)
Some cleanup related to this RFC: https://github.com/pytorch/pytorch/issues/68742
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163210
Approved by: https://github.com/ezyang
2025-09-18 12:08:13 +00:00
cfb8aec1a4 [FSDP2] idempotent reset_sharded_param: no-op if _local_tensor is already padded (#163130)
resolves https://github.com/pytorch/torchtitan/issues/1136

torchtitan use cached state dict for ft. reset_sharded_param should be idempotent if model.parameters() are padded already

```
# pad DTensor._local_tensor
fully_shard(model)
sd = fsdp_model.state_dict()
# reset_sharded_param should be a no-op in lazy_init
loss = fsdp_model(inp).sum()
```

this PR make `reset_sharded_param` idempotent by checking storage data ptr and return early

unit test
```
pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_cached_state_dict
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163130
Approved by: https://github.com/tianyu-l
2025-09-18 09:20:37 +00:00
98ce93db0b [DTensor] Add guide for what to do about mixed torch.Tensor and DTensor operations (#162651)
Also updates the error message to point to the guide.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162651
Approved by: https://github.com/ezyang
ghstack dependencies: #162117, #162307
2025-09-18 06:41:02 +00:00
627482a7b7 [torch][cuda][device_limits] Library for querying device hardware limits for flops and bandwidth (#162942)
In various benchmarks scattered across the repo, the limits for flops/second and memory bandwidth are usually hardcoded for a single device. This utility could help in providing a more structured way to query the device capabilities. If this is approved, we can use it when reporting flops efficiency and bandwidth relative to peak in the benchmarks and tests. The intent is to add more devices, more parameters (e.g. L2 cache bandwidth, NVLink, etc.) for both CPUs and accelerators.

Testing:

```
import torch

if torch.cuda.is_available():
    device = torch.cuda.current_device()
    mod = torch.get_device_module('cuda')
    hw = mod._device_limits.GPULimits(device)

    print(hw.get_tflops_per_second(torch.float16))
    print(hw.get_tflops_per_second(torch.float32))
    print(hw.get_tflops_per_second(torch.float64))
    print(hw.get_tflops_per_second(torch.bfloat16))
    print(hw.get_tflops_per_second(torch.int8))
    print(hw.get_memory_bandwidth_Bps() / 1e9)
    print(hw.get_shared_memory_bandwidth_Bps() / 1e9)

# Output on an H100 GPU
1070.53056
535.26528
66.90816
1070.53056
2141.06112
4893.696
33454.08
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162942
Approved by: https://github.com/ngimel
2025-09-18 06:40:07 +00:00
c5e7bb08b0 [inductor] pdl inductor option (disabled by default) (#160928)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160928
Approved by: https://github.com/eellison
2025-09-18 06:35:28 +00:00
6f9b4ccf8f Fix SEMI_STRUCTURED_SUPPORTED_BACKENDS selection on CUDA and ROCm (#163223)
It should work with the current CUDA/ROCm device_capability enumeration anyway. But it will help to avoid unexpected triggering in the future

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163223
Approved by: https://github.com/jeffdaily
2025-09-18 06:29:29 +00:00
708dc6e3cd [CP][BE] Remove _AttentionContextParallel (#162539)
This is not an API we want to support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162539
Approved by: https://github.com/ezyang, https://github.com/tianyu-l
2025-09-18 06:20:18 +00:00
7803d2c244 [FSDP][Replicate] tests replicate synchronization after optimizer states (#162785)
**Summary:**  In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. Verify replicate correctly handles post-optimizer events.

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_post_optim_event

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162785
Approved by: https://github.com/mori360
ghstack dependencies: #162631, #162636, #162650, #162654, #162656, #162658
2025-09-18 04:47:09 +00:00
033b7d1e1a [Reland] Return NoOpDeviceGuardImpl in replace of CudaDeviceGuard when device is not available (#163187)
Reland of #160532

Summary:

To support exporting a cuda model on a CPU-only machine under fake tensor mode. User commonly need to move sample inputs to the cuda device with .to("cuda:0") or .to("cuda") call. This diff supports this.
I expect the following pattern to work
```
with FakeTensorMode(allow_non_fake_inputs=True):
    cuda_module = module.to("cuda:0")
    cuda_sample_inputs = tuple([x.to("cuda:0") for x in sample_inputs])
    with torch.no_grad():
        ep = torch.export.export(cuda_module, cuda_sample_inputs)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163016
Approved by: https://github.com/huydhn

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163187
Approved by: https://github.com/angelayi
2025-09-18 04:46:26 +00:00
d734b26141 [vllm hash update] update the pinned vllm hash (#163218)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163218
Approved by: https://github.com/pytorchbot
2025-09-18 04:31:47 +00:00
a27c002186 [BE] [Triton] [Inductor] Add an assert for store_output val_shape to use a tuple (#162887)
Summary: Updates the remaining tests to all ensure val_shapes is always passed a tuple and not a list. Enables adding an assert consistent with the other function arguments.

Test Plan:
Depends on CI.

Rollback Plan:

Differential Revision: D82383319

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162887
Approved by: https://github.com/NikhilAPatel
2025-09-18 04:30:36 +00:00
0f462740a0 replace more // with FloorDiv in inductor code (#162969)
see this https://github.com/pytorch/pytorch/pull/162869 for more context, sympy div representation can make reasoning fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162969
Approved by: https://github.com/ezyang, https://github.com/eellison, https://github.com/jansel
2025-09-18 03:28:31 +00:00
d6aaf08344 [inductor][heuristics] add kernel template params (#162781)
# why

- enable a clear interface for kernel templates to declare all their
  instantiation parameters and any potential defaults
- simplify KernelTemplateChoice to just have a single params, and not kwargs and extra_kwargs

# what

- KernelTemplateParams interface
- placeholder implementation where we just pass through a dict

# testing

- existing ci tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162781
Approved by: https://github.com/jansel
2025-09-18 02:15:42 +00:00
13304401df Port 4 dynamo test files for the intel XPU (#160953)
# Description
Fixes #114850, we will port dynamo tests to Intel GPU
We could enable Intel GPU with following methods and try the best to keep the original code styles:

# Changes
1. Get device type from accelerator method.
2. Replace the requires cuda statement with requires_gpu.
3. Add HAS_XPU_AND_TRITON into the scope.
4. Add several wrapper methods in cuda module into the accelerator.

# Notify

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160953
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/jansel

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-09-18 01:54:45 +00:00
8e48d1ba25 Skip reuse PyTorch wheel when building vLLM (#163232)
This issues starts surfacing in [trunk](b26d4c9a7a/1).  When building vLLM, uv doesn't like that we rename CI wheel without changing its metadata to match it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163232
Approved by: https://github.com/izaitsevfb
2025-09-18 01:42:32 +00:00
6189a5f731 [dynamo][ez] Initialize tracer_output to None by default. (#163169)
Summary:
In edge cases, tracer_output can be left unset if there's double exception raised which causes the following issue:
```
UnboundLocalError: local variable 'tracer_output' referenced before assignment
```

Default initialize this variable so that it's always present.

Test Plan:
CI

Rollback Plan:

Differential Revision: D82652815

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163169
Approved by: https://github.com/tugsbayasgalan
2025-09-18 01:30:23 +00:00
48a7e8cc70 [CPU][GEMM Template] Improve A16W8 performance (#162479)
**Summary**
Improve A16W8 performance by
1. supporting GQA concat linear
2. using smaller cache blocking size
3. improving code for dequantization of weight (reducing instructions and adding prefetch)

We saw > 5% E2E next token performance gain when running Llama3.1-8B-instruct.

**Test plan**
Already covered by UT

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162479
Approved by: https://github.com/mingfeima, https://github.com/CaoE, https://github.com/jansel
2025-09-18 01:28:37 +00:00
f17e2ab1f9 [FSDP][Replicate] tests replicate with prefetching (#162658)
**Summary:** Prefetching tests validate that distributed training systems can correctly overlap communication and computation by pre-loading parameters or data before they're needed. This test ensures the prefetching mechanism doesn't break training correctness while potentially improving performance by reducing idle time where computation waits for communication to complete.

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_explicit_prefetching

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162658
Approved by: https://github.com/mori360
ghstack dependencies: #162631, #162636, #162650, #162654, #162656
2025-09-18 01:05:16 +00:00
e14b290d1e [FSDP][Replicate] tests replicate module functionality when used multiple times in a forward pass (#162656)
**Summary:** Verifies that Replicate works correctly when a module is used multiple times in a single forward pass.

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_multi_forward_module

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162656
Approved by: https://github.com/mori360
ghstack dependencies: #162631, #162636, #162650, #162654
2025-09-18 01:02:08 +00:00
04ddea44fd Fix: ShapeEnv not propagated properly to inductor SizeVars (#162927)
Summary:
I am really skeptical about inductor sizevars creating an empty shape env when not provided with one
i think we should fail there if the graph has dynamic shapes and no shape env is provided.

however i wonder if there are actually use cases that depends on the shape env not being there?
Reasoning APIs depends on facts in the shape env. and assumes some stuff exists for specific symbols.

Test Plan:
Fix the bug reported in creating simple e2e unit test is not trivial
https://www.internalfb.com/diff/D82337184

Rollback Plan:

Differential Revision: D82412384

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162927
Approved by: https://github.com/ezyang, https://github.com/eellison, https://github.com/jansel
2025-09-18 00:56:22 +00:00
57a54a04b6 [SymmMem] Fix NVSHMEM plugin + Triton 3.5 (#163152)
1. The dispatch signatures defined in `core.extern_elementwise` call must match the C signature of the NVSHMEM functions, in particular the dtypes. Otherwise, there would be weird errors, such as IMA or hang. When matched, most of time the NVSHMEM device function will be inlined into the generated PTX. When not matched, it is represented as a function call in the PTX (not sure if it is the function call that goes wrong).

2. When calling the `core.extern` wrappers from the `triton.jit` kernels, the input must be cast to match the signatures defined in 1, e.g. via `nbytes.to(tl.int64)`. Otherwise, Triton will report a key error when searching for such kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163152
Approved by: https://github.com/ngimel
ghstack dependencies: #163025
2025-09-18 00:50:22 +00:00
6edfb3062c [FSDP][Replicate] tests replicate runs forward/backward for root and non-root module (#162654)
**Summary:** Verifies that Replicate correctly handles the scenario where forward and backward passes are run through both the root module and a non-root module.

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_non_root_forward_backward

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162654
Approved by: https://github.com/mori360
ghstack dependencies: #162631, #162636, #162650
2025-09-18 00:47:19 +00:00
72fedf0575 Move export_db to use new tracer, remove restriction on optional inputs (#162993)
Differential Revision: [D82478644](https://our.internmc.facebook.com/intern/diff/D82478644)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162993
Approved by: https://github.com/zhxchen17
ghstack dependencies: #162557, #162558, #162559, #162682, #162992
2025-09-18 00:43:32 +00:00
b26d4c9a7a Make dynamo preserving stack trace work with inlined nn modules (#162992)
Differential Revision: [D82478646](https://our.internmc.facebook.com/intern/diff/D82478646)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162992
Approved by: https://github.com/williamwen42
ghstack dependencies: #162557, #162558, #162559, #162682
2025-09-18 00:43:23 +00:00
bb25c60945 [FSDP][Replicate] tests replicate parity for single and multigroup (#162650)
**Summary:** The parity tests train two identical models with the same inputs - one using a reference approach and one using the test approach (replicate) - then check that both models produce identical losses. This ensures the distributed training methods don't change the mathematical results compared to standard training.

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_single_group
2. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_multi_group
3. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_multi_group_cpu_offload_eager

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162650
Approved by: https://github.com/mori360
ghstack dependencies: #162631, #162636
2025-09-18 00:38:49 +00:00
fdcef1477c [pcache] Generalize testing + All Caches thread-safe (#163173)
Summary:
1. Generalized testing by auto-detecting Cache types and splitting testing by abstract base class
- Now checks that all Cache types are thread-safe
- Will fail tests if any new Cache is added and is untested (for example, any cache with non-str key or non-bytes value)
2. All Caches are thread-safe
- InMemoryCache was the only one not thread-safe, so added a lock for access
- Realized that to implement MultiCache we should just have this requirement.
* Also, OnDiskCache is now a functioning AsyncCache with a default base_dir using Python's tempfile.gettempdir, i.e. OnDiskCache is no longer an abstract cache class

Test Plan:
```
[nmacchioni@*** / ()]$ buck test fbcode//mode/opt caffe2/test/inductor:pcache
Tests finished: Pass 28. Fail 0. Fatal 0. Skip 0. Build failure 0
[nmacchioni@*** / ()|remote/fbcode/warm_gpu_od_stable...)]$
```

Rollback Plan:

Differential Revision: D82660240

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163173
Approved by: https://github.com/masnesral
2025-09-18 00:32:46 +00:00
e93706c2c8 [Intel GPU][pre_compile] Add XPU toolkit version and hardware info in compiled model check. (#162951)
Following #162438, this PR generalized the origin CUDA only check, and add XPU check.

Fixes #162939, Fixes #162938, Fixes #163032,Fixes #163045

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162951
Approved by: https://github.com/EikanWang, https://github.com/jansel
2025-09-18 00:04:22 +00:00
26eefd5ae2 Fix windows path escape characters (#162761)
Fixes #135954
Torch Inductor Windows Path Escape Characters

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162761
Approved by: https://github.com/jansel, https://github.com/mlazos
2025-09-17 23:39:39 +00:00
28c42cc280 compile_kernel: Add DLPack test (#163166)
Note to self: i should probably. start using gh stack

This is rebased on top of https://github.com/pytorch/pytorch/pull/163165 so you only need to review this commit 7387c1becf

This test doesn't add any new functionality it just ensures DLPack conversion is working well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163166
Approved by: https://github.com/janeyx99, https://github.com/albanD
2025-09-17 22:55:48 +00:00
0661ecdb38 add support for hint_override in mark_unbacked (#162652)
Very similar to https://github.com/pytorch/pytorch/pull/161007 except now for mark_unbacked.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162652
Approved by: https://github.com/laithsakka
2025-09-17 22:29:54 +00:00
7a0f93344e [FSDP][Replicate] tests replicate casting module after init (#162636)
**Summary:** In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. This test is important as it verifies we can cast a replicated module to a different type after initialization, and import feature for enabling mixed precision,

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_to_float64_after_init

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162636
Approved by: https://github.com/mori360
ghstack dependencies: #162631
2025-09-17 20:36:13 +00:00
63276edb7c [Inductor] support mixed dtype in the native_layer_norm_backward meta function (#159830)
Fixes #159829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159830
Approved by: https://github.com/albanD
2025-09-17 20:29:12 +00:00
dfda2dfd53 very small typo in fsdp2 comment (#163155)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163155
Approved by: https://github.com/awgu, https://github.com/Skylion007
2025-09-17 20:19:41 +00:00
876824f174 Make inline tests to use new exporter and fix some issues around it (#162682)
inline_and_install_module export variant is our long term state so it is better to use the new tracer for this. It also uncovered bunch of minor bugs because with inline_and_install_module, the nn_module_stack generation is changed a bit.

Differential Revision: [D82478648](https://our.internmc.facebook.com/intern/diff/D82478648)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162682
Approved by: https://github.com/zhxchen17
ghstack dependencies: #162557, #162558, #162559
2025-09-17 20:09:28 +00:00
a89d5e97ec compile_kernel remove header_code arg (#163165)
We previously asked users to seperate these because we didn't have any way of adding extern C declarations. Now we don't and we don't need this confusing flag anymore

BC breaking but is fine for this API since it doesn't have major users yet. Please just put your all your code in `kernel_source` moving forward

## BC note
The header_code parameter has been removed from torch.cuda._compile_kernel. Previously, users could pass separate header code that would be prepended to the kernel source. Now, header code must be included directly in the kernel_source parameter.

Note this only affects torch.cuda._compile_kernel, which is a private API.

Example:

Before
```python
kernel = compile_kernel(
    kernel_source="global void my_kernel() { ... }",
    kernel_name="my_kernel",
    header_code="#define SCALE 2.0f\n__device_ float scale(float x) { return x * SCALE; }"
  )
```

After
```python
kernel_source = """
#define SCALE 2.0f
device float scale(float x) { return x * SCALE; }

global void my_kernel() { ... }
"""
kernel = _compile_kernel(kernel_source, "my_kernel")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163165
Approved by: https://github.com/janeyx99, https://github.com/albanD
2025-09-17 19:47:32 +00:00
4660e38e5a write conv1d decomposition (#163080)
In Unified Runtime, we cannot have any fallback ops (for now). Not all conv1d ops can avoid fallbacks now, so we write a decomposition for it.

it's not registered to the default decomposition table as currently only executorch/unified runtime needs it. But it might benefit inductor as well because conv2d can generate triton kernels while there's no triton codegen for conv1d. I don't know if the conv2d triton kernel will have better perf compared to aten::conv1d, so it's not registered by default yet.

To register it, one just needs to do `import torch._decomp as decomp;decomp.register_decomposition(torch.ops.aten.conv1d.default, conv1d_to_conv2d)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163080
Approved by: https://github.com/angelayi
2025-09-17 19:22:38 +00:00
5236007806 [MPS] Add embedding_bag forward pass (#163012)
Part of #162270

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163012
Approved by: https://github.com/kulinseth, https://github.com/malfet
2025-09-17 19:00:47 +00:00
167ad09be5 [optim] override SWALR.state_dict and load_state_dict (#163122)
Fixes #163105

Note that the new `SWALR.load_state_dict` is **not backwards compatible**:
```python
@override
def load_state_dict(self, state_dict: dict[str, Any]) -> None:
  """Load the scheduler's state.

  Args:
      state_dict (dict): scheduler state. Should be an object returned
          from a call to :meth:`state_dict`.
  """
  self.__dict__.update(state_dict)
  self._set_anneal_func(self._anneal_strategy)
```

If we'd like to maintain compatibility with old state_dicts (loaded with `weights_only=False`), we could use something along these lines:
```python
@override
def load_state_dict(self, state_dict: dict[str, Any]) -> None:
    """Load the scheduler's state.

    Args:
        state_dict (dict): scheduler state. Should be an object returned
            from a call to :meth:`state_dict`.
    """
    anneal_func = state_dict.pop("anneal_func", None)
    strategy = state_dict.get("_anneal_strategy")
    self.__dict__.update(state_dict)

    if anneal_func is not None:
        state_dict["anneal_func"] = anneal_func
        if strategy is None:
            if anneal_func == self._linear_anneal:
                strategy = "linear"
            elif anneal_func == self._cosine_anneal:
                strategy = "cos"

    if strategy is None:
        strategy = getattr(self, "_anneal_strategy", "cos")

    self._set_anneal_func(strategy)
```

But given the fact that loading an `SWALR` state_dict before this PR would have caused an error, this seems okay. A GitHub/Google search for `SWALR.load_state_dict` had no results. Happy to change if not, or add a warning just in case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163122
Approved by: https://github.com/janeyx99
2025-09-17 18:17:26 +00:00
bcbb45b746 remove tolerance override for dynamo test_mixed_device_dtype in SGD (#163088)
In reaction to https://github.com/pytorch/pytorch/issues/116202#issuecomment-3145929113

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163088
Approved by: https://github.com/albanD
2025-09-17 18:17:23 +00:00
3cad2403cb [inductor][choices] pass through annotations from KTC to ChoiceCaller (#163117)
# why

- KTC might regenerate a choicecaller e.g. through FlexibleLayout
  optimization. This in turn would delete any annotations

# what

- provide an annotations dict inside KTC
- forward that dict towards the ChoiceCaller's annotations

- ChoiceCaller users e.g. in selectalgorithm now have access to the KTC
  and can register handlers do record/make decisions based on the KTC

# testing

n/a

Differential Revision: [D82587631](https://our.internmc.facebook.com/intern/diff/D82587631)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163117
Approved by: https://github.com/nmacchioni
2025-09-17 18:06:50 +00:00
e63476b236 [MTIA Runtime] Add foreach_div ops to native_functions.yaml (#162732)
Summary: Quick fix for runtime support on foreach_div, see D81274963. Fixed an issue that I created in that diff so that the CIs pass.

Test Plan:
CIs created in D81274963 and D81286593 pass.

Added some logs in [aten_mtia_ops.py](https://www.internalfb.com/code/fbsource/[c56272ba042c43c65517dcac254364cf732fcfa9]/fbcode/mtia/host_runtime/torch_mtia/aten_mtia_ops.cpp?lines=3676) to all the foreach_div ops. We can see that the correct MTIA kernels are being invoked in the tests.
https://www.internalfb.com/intern/testinfra/testrun/15481123829281588

Rollback Plan:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162732
Approved by: https://github.com/danielhou0515
2025-09-17 17:44:03 +00:00
4f641aa1a2 capturing exit codes after sigterm/sigkill from torch elastic. (#160908)
Summary:
**Background**
Torch Elastic sends SIGKILL/SIGTERM signals if any process fails while others are still running. However, processes terminated by these signals do not generate termination logs, causing confusion.

**Solution**
Capture exit codes after SIGTERM signals to ensure complete and accurate termination logging.

Test Plan:
unit tests

https://www.internalfb.com/mlhub/pipelines/runs/mast/f773486907-TrainingApplication__13_D79584569?job_attempt=1&version=0&tab=summary&env=PRODUCTION

Rollback Plan:

Differential Revision: D79584569

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160908
Approved by: https://github.com/d4l3k
2025-09-17 17:41:35 +00:00
8dbac62edb [CI] Update NVIDIA driver to 580.82.07 (#163111)
To make CI machines capable of running CUDA-13 tests. Unfortunately, this upgrade regresses NUMBA integration, so live patch it with 6e08c9d08e

This fix was suggested in https://github.com/pytorch/pytorch/issues/162878#issuecomment-3288635745

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163111
Approved by: https://github.com/huydhn
2025-09-17 17:37:06 +00:00
7a1e267d4a Fix set_grad_enabled HOP in strict mode with new tracer (#162559)
previous graph seems wrong probably because dynamo bytecode running might be changing the grad state unintentionally.

Differential Revision: [D82478643](https://our.internmc.facebook.com/intern/diff/D82478643)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162559
Approved by: https://github.com/zhxchen17, https://github.com/ydwu4
ghstack dependencies: #162557, #162558
2025-09-17 17:13:03 +00:00
2291199e9b [AOTInductor] Use CudaCachingAllocator for memory allocation (#162893)
Summary:
Use c10::CudaCachingAllocator for AOTInductor's initial constant buffer
allocation.

Test Plan:
Activate test under test/cpp/aoti_inference/test.cpp

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162893
Approved by: https://github.com/desertfire
2025-09-17 17:08:20 +00:00
0e9f9c3a61 Fix inconsistent test and add new tracer as config (#162558)
It is better to have the new tracer as global config that can be manipulated easily. Also I believe dynamo-like config infra is useful instead of relying on custom way of patching stuff.

Differential Revision: [D82478649](https://our.internmc.facebook.com/intern/diff/D82478649)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162558
Approved by: https://github.com/zhxchen17
ghstack dependencies: #162557
2025-09-17 17:01:48 +00:00
0e9e3cf996 Don't skip register_dataclass unflatten in dynamo (#162557)
We changed how we are tracing, as a result, we need to trace into register_data_class now.

Differential Revision: [D82478651](https://our.internmc.facebook.com/intern/diff/D82478651)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162557
Approved by: https://github.com/zhxchen17
2025-09-17 16:53:02 +00:00
c5c9e20f11 [dtensor][compile] Disable proxy mode in sharding prop rules (#163126)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163126
Approved by: https://github.com/bdhirsh
2025-09-17 16:49:35 +00:00
d1993c27ae [BE] Make PyObjectSlot use a global PyInterpreter (#162659)
This pr gets rid of the pyobj_interpreter_ variable from PyObjectSlot and saves a word in the process

Gonna ask for review from @huydhn as there are some changes to CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162659
Approved by: https://github.com/albanD, https://github.com/huydhn
2025-09-17 16:40:55 +00:00
928ac57c2a Upgrades dlpack to v1.1 to include fp8/fp4 (#162195)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162195
Approved by: https://github.com/eqy, https://github.com/albanD, https://github.com/Skylion007, https://github.com/rgommers
2025-09-17 16:39:11 +00:00
f2206b1ed8 fix wait() missing in redistribute tensor (#162749)
We notice that the wait() op is missing after collective op call: https://github.com/pytorch/pytorch/pull/162665#discussion_r2338460562.

The issue is that `_maybe_warp_tensor` calls AsyncCollectiveTensor in 3ad3bfe11d/torch/distributed/_functional_collectives.py (L829) We need to check whether the wait() is required after collective op call.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162749
Approved by: https://github.com/ezyang, https://github.com/SherlockNoMad, https://github.com/wconstab
2025-09-17 16:24:26 +00:00
4ca3f435fb Revert "[CI] Update NVIDIA driver to 580.82.07 (#163111)"
This reverts commit 16475a829f7fe3b1dc3c74573740df09ffaec650.

Reverted https://github.com/pytorch/pytorch/pull/163111 on behalf of https://github.com/malfet due to It started to fail now, but worked just fine in PR CI ([comment](https://github.com/pytorch/pytorch/pull/163111#issuecomment-3303707671))
2025-09-17 16:20:31 +00:00
79fd497423 Revert "[Reland] Return NoOpDeviceGuardImpl in replace of CudaDeviceGuard when device is not available, or cpu-only build (#163016)"
This reverts commit f1eb99e2e4363f20eb5896433e1eb7f7500aadea.

Reverted https://github.com/pytorch/pytorch/pull/163016 on behalf of https://github.com/jeffdaily due to broke rocm CI, see export/test_export_opinfo.py::TestExportOnFakeCudaCUDA::test_fake_export_nonzero_cuda_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/17787208381/job/50564369696) [HUD commit link](f1eb99e2e4) ([comment](https://github.com/pytorch/pytorch/pull/163016#issuecomment-3303707552))
2025-09-17 16:17:53 +00:00
9b7a8c4d05 [cuDNN][SDPA][submodule] Roll-back cuDNN frontend upgrade, update Meta registration (#163104)
For https://github.com/pytorch/torchtitan/issues/1713

Also note that we will need to rollback the cuDNN frontend upgrade in 2.9 as it currently introduces a segmentation fault by assuming tensors have their strides and sizes populated at graph creation time 1a7b4b78db/include/cudnn_frontend/node/sdpa_support_surface.h (L447%C2%A0)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163104
Approved by: https://github.com/drisspg
2025-09-17 15:48:54 +00:00
16475a829f [CI] Update NVIDIA driver to 580.82.07 (#163111)
To make CI machines capable of running CUDA-13 tests. Unfortunately, this upgrade regresses NUMBA integration, so live patch it with 6e08c9d08e

This fix was suggested in https://github.com/pytorch/pytorch/issues/162878#issuecomment-3288635745

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163111
Approved by: https://github.com/huydhn
2025-09-17 14:44:06 +00:00
6cfb080d84 [CD] Do not enable GenAI on Windows (#163116)
Follow up after https://github.com/pytorch/pytorch/pull/162209 as looks
like it causes some of the Windows builds to fail with
```
C:/actions-runner/_work/pytorch/pytorch/pytorch/third_party/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/common/include\fbgemm_gpu/quantize/utils.h(19): error C3861: '__builtin_clz': identifier not found
```

May be fixes https://github.com/pytorch/pytorch/issues/162881
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163116
Approved by: https://github.com/wdvr, https://github.com/danielvegamyhre
2025-09-17 14:09:10 +00:00
bc38c5baa1 [optim] prevent problematic tensor aliasing in lr_scheduler (#163098)
Prevents edge cases in SequentialLR and ReduceLROnPlateau which could corrupt learning rates or trigger recompilation.

Supersedes #162360
Fixes #162359
Fixes #163093

While putting #162360 together, I noticed the class of issue I was fixing (i.e. unintended aliasing in lr_schedulers when using Tensor lrs) appeared in several other places. @janeyx99 suggested I put together a follow-up PR.

There are several bugs resembling the one fixed in #162360. I added a helper to fix these:
```python
def _update_param_group_val(param_group: dict[str, Any], key: str, val: float | Tensor):
    """Set param_group[key] to val without aliasing or assignment when they're both tensors.
    Raises a KeyError if param_group[key] does not exist.
    """
    if isinstance(param_group[key], Tensor):
        param_group[key].fill_(_to_scalar(val))
    else:
        param_group[key] = val
```

And applied it to fix bugs in `SequentialLR.__init__` and `LRScheduler._update_lr`. I also added it to `CyclicLR.__init__` which was using an equivalent pattern, and `CosineAnnealingWarmRestarts.step` which *should* have had a similar issue:
```python
for param_group, lr in zip(self.optimizer.param_groups, self.get_lr()):
    param_group["lr"] = lr
```

But did not, because `get_lr()` actually returns tensors when using a tensor lr (despite its `list[float]` return type annotation). Relying on this propagation seems fragile, so I conservatively added the method here as well. I'll be fixing the type annotations and several related issues in followup PRs built off of this one.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163098
Approved by: https://github.com/janeyx99
2025-09-17 13:40:23 +00:00
607489f3d0 logging exit code for failures to ease debugging (#160907)
Summary:
**Problem**
Some processes are terminated by other processes using signals. These signal terminations often lack stack traces, causing confusion during debugging.

**Solution**
Log exit codes to simplify and improve the debugging process failures.

Test Plan:
unit tests

https://www.internalfb.com/mlhub/pipelines/runs/mast/f773486907-TrainingApplication__13_D79777290?version=0&env=PRODUCTION

Rollback Plan:

Differential Revision: D79777290

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160907
Approved by: https://github.com/d4l3k
2025-09-17 12:52:48 +00:00
89a6dbe73a Filter out local timer tests which are unimplemented in Python on AArch64 (#158342)
This stems from using a conda build of Python, which incorrectly detects this as unimplemented:
https://github.com/conda-forge/python-feedstock/issues/804

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158342
Approved by: https://github.com/malfet
2025-09-17 11:31:57 +00:00
c6392fcc06 [2/N] Port 3 fsdp distributed test cases to Intel GPU (#160940)
For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU. This is the second PR for fsdp distributed test cases, the first is https://github.com/pytorch/pytorch/pull/160158.
We could enable Intel GPU with following methods and try the best to keep the original code styles:
- Use "torch.accelerator.current_accelerator()" to determine the accelerator backend
- Enabled XPU for some test path

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160940
Approved by: https://github.com/guangyey, https://github.com/d4l3k
2025-09-17 10:45:28 +00:00
c52c4052d8 [WOQ] Integrate CUDA support for int8pack_mm woq optimization pattern (#161680)
Summary:
What: Enables CUDA support for int8_mm woq optimization pattern by:

- Fixing dtype conversion in weight_int8pack_mm_kernel to match CPU
- Updating pattern validation to accept CUDA devices
- Adding test coverage for CUDA

Why: Extend WOQ to more device types

Test Plan:
```
buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm
```

Rollback Plan:

Reviewed By: jerryzh168

Differential Revision: D80882442

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161680
Approved by: https://github.com/jerryzh168
2025-09-17 10:24:13 +00:00
175299416b [mypy] add some import ignores to onnx (#163133)
these keep appearing when I run `lintrunner`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163133
Approved by: https://github.com/justinchuby
ghstack dependencies: #161458, #162702
2025-09-17 09:32:38 +00:00
a97cefac15 [dtensor] do not mutate specs when doing sharding prop (#162702)
Because these specs are cached by reference. So by reusing them and mutating them, we're overwriting the cached specs of another op. I'm just fixing these 2, there are more instances, we'll need to do an audit separately.

This fixes a few opinfo tests, but side note that `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=0 python test/distributed/tensor/test_dtensor_ops.py TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_multi_head_attention_forward_cpu_float32` fails for me locally even on the base commit, but it is not marked as xfail

NOTE: I am renaming `_wrap_output_spec_tensor_meta` so that external libraries will loudly fail. You should migrate to the functional `_create_output_spec_with_new_tensor_meta` or create your own mutation wrapper and take responsibility for the cache! This should be improved in https://github.com/pytorch/pytorch/issues/162731

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162702
Approved by: https://github.com/ezyang, https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #161458
2025-09-17 09:32:38 +00:00
821458d97a [dynamo][hop] Introduce Local Map HOP (#161458)
Can't actually deploy it because of: https://github.com/pytorch/pytorch/issues/161456

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161458
Approved by: https://github.com/ydwu4
2025-09-17 09:32:38 +00:00
c9f16f201a [inductor] Fix convolution autotune check when groups != 1 (#163094)
When generating the triton template for convolution, we check `V.graph.sizevars.statically_known_equals(in_chan * groups, x.get_size()[1]) `. Note that in this check, we should consider the groups.

This check verifies, at compile time, that the total number of input channels expected by the convolution weights (in_chan * groups) exactly matches the number of channels in the input tensor (x.get_size()[1]).

This fix is good in general as it allows for conv triton template to be generated when `groups> 1`. It's also required for unified runtime to use AOTI as a backend delegate, because unified runtime is libtorch-free, so we cannot use the ATEN fallback of conv2d.

```
 python test/inductor/test_select_algorithm.py -k test_convolution2_group
 ```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163094
Approved by: https://github.com/SherlockNoMad
2025-09-17 09:09:32 +00:00
b229455ddd Update placement utils and weights to handle meta device (#162842)
Summary:
This diff fixes two things which come up when testing a tgif-published pt2 model remote net:
1) Updates isSameDevice to handle meta device to avoid this error:
```
what():  Unsupported device typemeta and meta
Exception raised from isSameDevice at fbcode/caffe2/torch/nativert/executor/PlacementUtils.cpp:20
```

2. Updates xl weight v2 loading logic in Weights.cpp to handle non-TBE xl-weights. Today, we enforce the device is the same for an old weight and new weight when replacing with ModelRunnerAdapter.setAttr(). However, the way we replace non-TBE xl weights is to find any weights on "meta" device and then replace them with their correct weight with real device from xl_weights folder. Therefore, the new weight and old weight will always have different devices and the device check is invalid. I don't think we've run into this so far bc non-TBE xl weights have not been thoroughly tested until now.

Test Plan:
Run MRS you model merge net, which uses non-TBE xl weights. Confirm that before change #1 we get error:
```
Unsupported device typemeta and meta
```
Then after change #1 and before change #2 we get:
```
what():  Mismatched device for merge.user_tower.linear.weight: meta vs cpu
Exception raised from validateValue at fbcode/caffe2/torch/nativert/executor/Weights.cpp:374
```
After change run is successful
Command:
```
MODEL_ENTITY_ID=921242082
SNAPSHOT_ID=1269
module_name=merge
SAMPLE_INPUT_DIR=/data/users/georgiaphillips/models/921242082/${SNAPSHOT_ID}/${module_name}_archive/package/data/sample_inputs
buck2 run mode/dev-nosan -c fbcode.nvcc_arch=h100,a100 -c fbcode.enable_gpu_sections=true caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=Benchmark --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.${module_name} --moduleName=${module_name} --submodToDevice="merge|cuda0"  --benchmarkEnableProfiling=false --disableStaticRuntime=true --doNotRandomizeSampleInputs=true --benchmarkDontRebatchSamples=true --pytorch_predictor_sigmoid_static_dispatch_enable=false --pytorch_predictor_sigmoid_graph_passes_enable=false --sampleInputFilePath=${SAMPLE_INPUT_DIR}/${module_name}.pt
```

Rollback Plan:

Differential Revision: D80713052

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162842
Approved by: https://github.com/henryoier
2025-09-17 08:12:32 +00:00
a5419743c6 Revert "remove unnecessary sync point in AveragedModel update (#158017)"
This reverts commit cb7f45fd34b890fa7665837573ebb25744889568.

Reverted https://github.com/pytorch/pytorch/pull/158017 on behalf of https://github.com/wdvr due to discussed with author - expecting this to break checkpointing ([comment](https://github.com/pytorch/pytorch/pull/158017#issuecomment-3301790645))
2025-09-17 08:02:02 +00:00
a63221a335 Fix TODO in make_tensor_for_subclass_helper (#162336)
The constructor does accept a DataPtr (had to fix the DataPtr variant not accepting a SymInt, though).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162336
Approved by: https://github.com/ezyang
ghstack dependencies: #162298
2025-09-17 06:46:34 +00:00
c9485f8ff3 [Reland][2/N]Port several test files under test/distributed to Intel GPU (#159473)
For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU. This PR will work on some test files under test/distributed. We could enable Intel GPU with following methods and try the best to keep the original code styles:

- instantiate_device_type_tests()
- use "torch.accelerator.current_accelerator()" to determine the accelerator backend
- use requires_accelerator_dist_backend to allow both nccl and xccl test
- enabled XPU for some test path
- Change the hardcoded world_size according to device_count.
- Unify some common code under torch/testing/_internal for multiple backend, for example:
  Added xpu for Backend.backend_capability and dist.Backend.register_backend()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159473
Approved by: https://github.com/guangyey, https://github.com/d4l3k
2025-09-17 06:42:27 +00:00
71b272e4a3 [BE] Use init_device_mesh over DeviceMesh (#162960)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162960
Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/dcci
2025-09-17 06:12:19 +00:00
39450e7b00 [Fix XPU CI][Inductor UT] Fix test cases broken by community. (#162933)
Fixes #162937

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162933
Approved by: https://github.com/EikanWang, https://github.com/jansel
2025-09-17 05:35:06 +00:00
f1eb99e2e4 [Reland] Return NoOpDeviceGuardImpl in replace of CudaDeviceGuard when device is not available, or cpu-only build (#163016)
Reland of #160532

Summary:

To support exporting a cuda model on a CPU-only machine under fake tensor mode.
User commonly need to move sample inputs to the cuda device with .to("cuda:0") or .to("cuda") call.
This diff supports this.
I expect the following pattern to work
```
with FakeTensorMode(allow_non_fake_inputs=True):
    cuda_module = module.to("cuda:0")
    cuda_sample_inputs = tuple([x.to("cuda:0") for x in sample_inputs])
    with torch.no_grad():
        ep = torch.export.export(cuda_module, cuda_sample_inputs)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163016
Approved by: https://github.com/huydhn
2025-09-17 05:01:33 +00:00
bb635a11f8 [vllm hash update] update the pinned vllm hash (#163128)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163128
Approved by: https://github.com/pytorchbot
2025-09-17 04:26:07 +00:00
3bfa35d62e [AOTI-FX] Solve for undefined symbols in dynamic input shapes (#163044)
# Problem
When dynamic shapes are passed to AOTInductor, they usually have a very basic form like `(s0, 5, 27)`. In these cases it's straightforward to generate code defining the symbol `s0` as a specific dimension of the input tensor. However, AOTI can handle slightly more generic expressions than this, such as `(2 * s0, 5, 27)`. In these cases, we don't immediately know the value of `s0`, but we need to solve for it, since it may be referenced in other parts of the program such as kernel call arguments, launch grids, etc.

# Feature
This PR adds support for more generic dynamic input expressions in the FX backend, following the implementation already present in AOTI's C++ backend:
 1. Check if the expression contains *one* undefined symbol, as multiple variables would make the equation underdetermined. Let's call this `s0`. (We could potentially generalize this, but this PR focuses on cases AOTI can already handle.)
 2. Generate a new symbol for the relevant size or stride of the input tensor. Let's call this `size`. This is computed with FX nodes just as a normal symbol would be.
 3. Use sympy to solve for `s0` in terms of `size`. Let's call the resulting expression `solution`.
 4. Since we know `s0` is an integer, `solution == floor(solution)`. Take the floor and then convert division to `FloorDiv`. This is required to trace through the expression, since the return value of regular division is not guaranteed to be an integer.
 5. Generate FX for the modified `solution`, which defines the value `s0`.
 6. Override the relevant method of `PythonWrapperCodegen` to a no-op, since the FX converter handles the above on its own.

# Test plan
In addition to the existing dynamic shapes tests, this PR adds new test cases where the input shape contains a non-trivial expression. This dynamic input dimension is then multiplied by other dimensions to form the argument to a `reshape`.

Here's an example graph from one of the CI tests. In this case, the input expression was `2*x + 1`, and the solution is `x = (sym_size_int - 1) / 2`:
```
graph():
    %arg0_1 : [num_users=2] = placeholder[target=arg0_1]
    %sym_size_int : [num_users=1] = call_function[target=torch.ops.aten.sym_size.int](args = (%arg0_1, 0), kwargs = {})
    %sym_sum : [num_users=1] = call_function[target=torch.sym_sum](args = ([-1, %sym_size_int],), kwargs = {})
    %floordiv : [num_users=1] = call_function[target=operator.floordiv](args = (%sym_sum, 2), kwargs = {})
    %mul : [num_users=2] = call_function[target=operator.mul](args = (8, %floordiv), kwargs = {})
    %sym_sum_1 : [num_users=2] = call_function[target=torch.sym_sum](args = ([4, %mul],), kwargs = {})
    %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ([%sym_sum_1], [1]), kwargs = {dtype: torch.float32, device: cuda:0})
    %sym_sum_2 : [num_users=1] = call_function[target=torch.sym_sum](args = ([35, %mul],), kwargs = {})
    %floordiv_1 : [num_users=1] = call_function[target=operator.floordiv](args = (%sym_sum_2, 32), kwargs = {})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(%floordiv_1, 1, 1)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg0_1, out_ptr0: %buf0, xnumel: %sym_sum_1, XBLOCK: 32}})
    return buf0
```

The `sym_size_int` node returns the first dimension of the input tensor. Next, `floordiv` computes the input symbol in terms of the input size. Then, the launch grid is computed by `floordiv_1`, the kernel argument by `sym_sum_1`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163044
Approved by: https://github.com/jansel
2025-09-17 04:12:03 +00:00
7a3791c5d0 Make torch.cuda.rng_set_state() and torch.cuda.rng_get_state() work during stream capture. (#162505)
Note that this works only in a limited case, where you *don't* change the seed, but change only the offset of the philox generator. This captures the main use case we're interested in: Rewinding the RNG to a previous state. This is done by torch.utils.checkpoint.checkpoint in particular.

Calls to increase() change only the offset, not the seed. Thus, we allow for "no-op" calls to set_seed where the new seed is the same as the old seed. If a user does happen to try to change the seed during stream capture, they will receive an error.

Fixes #162504

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162505
Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/eellison, https://github.com/eee4017, https://github.com/cyyever
2025-09-17 03:57:34 +00:00
e28983be76 Add decomp rule to assert_tensor_metadata for BatchedTensors (#163008)
Whenever there is device move, export introduces assert_tensor_metadata aten operator to make sure to guard for device specialization. This aten op didn't work with Vmap because we didn't register explicit decomp rule saying we just skip BatchedTensor and call it on underlying tensor

Differential Revision: [D82483979](https://our.internmc.facebook.com/intern/diff/D82483979)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163008
Approved by: https://github.com/huydhn
2025-09-17 03:49:41 +00:00
794b48c9f4 [PyTorch] Compile SVE's box-cox only when building targeting SVE (#163078)
Summary:
Internally, we are building PyTorch on the compat layer.
Need to avoid compiling sve's box-cox, as sve is not marked as build target.

Rollback Plan:

Reviewed By: rraometa, YifanYuan3

Differential Revision:
D82544412

Privacy Context Container: L1208939

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163078
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-09-17 03:35:11 +00:00
65845d7291 Update Gloo submodule (#163112)
Which makes PyTorch buildable with gcc-15, tested by running the build inside `fedora:44` docker
```
docker run --rm -it fedora:44 bash -c "yum install -y g++ python3-devel git; git clone https://github.com/pytorch/pytorch; cd pytorch; git checkout 8f710acce8332979c9a7bf97e72666dfd35c43e6; python3 -mpip install -r requirements.txt; python3 setup.py bdist_wheel"
```

Fixes https://github.com/pytorch/pytorch/issues/156595
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163112
Approved by: https://github.com/huydhn
2025-09-17 03:04:09 +00:00
3009b6959a [FSDP][Replicate] tests replicate parameter registration (#162631)
**Summary**
Tests parameter state management after forward and backward passes for single and multiple replicate groups

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_param_registration_after_forward
2. pytest test/distributed/_composable/test_replicate_training.py -k test_param_registration_after_backward

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162631
Approved by: https://github.com/mori360
2025-09-17 02:46:30 +00:00
df4ebddbe0 DisableTorchFunction in debug_string (#163096)
debug_string() invokes some torch functions under the hood.
Use DisableTorchFunction() to avoid re-invoking __torch_function__ when calling debug_sting() inside DebugMode()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163096
Approved by: https://github.com/zpcore
2025-09-17 00:19:49 +00:00
e13cf68d03 Revert "[Triton] [Inductor] Restrict subprocess autotuning to just Triton (#162688)"
This reverts commit 082d3dd9d53a60deb022e203892f0c492cf2cce7.

Reverted https://github.com/pytorch/pytorch/pull/162688 on behalf of https://github.com/mlazos due to H100 tests didn't run internally for some reason, rerun with ciflow/h100 ([comment](https://github.com/pytorch/pytorch/pull/162688#issuecomment-3300634763))
2025-09-16 23:17:14 +00:00
814338826e Set the credential to upload vLLM nightly wheels on schedule and workflow_dispatch (#163018)
The build is ok, but uploading is failing at the moment https://github.com/pytorch/pytorch/actions/runs/17734972779/job/50416387786

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163018
Approved by: https://github.com/wdvr, https://github.com/malfet
2025-09-16 22:26:22 +00:00
c527292c43 [CI] Remove functorch doc build jobs (#163101)
As repo has been archived, there couldn't be any doc updates

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163101
Approved by: https://github.com/svekars, https://github.com/zou3519, https://github.com/ZainRizvi
2025-09-16 22:25:59 +00:00
d4554bc284 Revert "Set the credential to upload vLLM nightly wheels on schedule and workflow_dispatch (#163018)"
This reverts commit 61be0f1c11ef59ff8cf39138b594efe3672816c0.

Reverted https://github.com/pytorch/pytorch/pull/163018 on behalf of https://github.com/huydhn due to Missed another update on the environment ([comment](https://github.com/pytorch/pytorch/pull/163018#issuecomment-3300444271))
2025-09-16 21:44:11 +00:00
f6ea41ead2 [CPU] Adding missing brackets in native MaxUnpool log (#163039)
As stated in the title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163039
Approved by: https://github.com/Skylion007
2025-09-16 21:28:15 +00:00
489860f3c2 Prefer_deferred_runtime_asserts should be propagated to new tracer (#162556)
Differential Revision: [D82478650](https://our.internmc.facebook.com/intern/diff/D82478650)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162556
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #162487
2025-09-16 21:25:00 +00:00
9494b09549 bf16 support for fused_moving_avg_obs_fake_quant() op (#162620)
enabling bf16 support for `torch.fused_moving_avg_obs_fake_quant()` op on cuda

**testing**
`python test/quantization/pt2e/test_quantize_pt2e.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162620
Approved by: https://github.com/andrewor14, https://github.com/jerryzh168
2025-09-16 21:22:44 +00:00
c230ac7300 [inductor][ez] add ChoiceCaller annotations (#162672)
# why

- enable ChoiceCaller generation to provide extra information that
  feedback_saver_fns (functions registered to run at the bench of
  benchmarking) can use afterwards
- users that extend ChoiceCaller creation e.g. by creating their own
  InductorChoices can use this to shuttle through information

# what

- add an annotations dictionary to ChoiceCaller class

# testing

n/a

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162672
Approved by: https://github.com/nmacchioni
2025-09-16 20:49:55 +00:00
77cafe105a enable sync batchnorm for HPU device (#163047)
Add HPU to list of supported devices for SyncBatchNorm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163047
Approved by: https://github.com/albanD
2025-09-16 20:45:38 +00:00
66308fb470 Revert "[ROCm] Remove HIPBLASLT_ALLOW_TF32 from codebase (#162998)"
This reverts commit cef815dc2ce37f98e01a6469a15b69f15995c1f9.

Reverted https://github.com/pytorch/pytorch/pull/162998 on behalf of https://github.com/huydhn due to Sorry for reverting this, but it seems to break a test in trunk ([comment](https://github.com/pytorch/pytorch/pull/162998#issuecomment-3300280242))
2025-09-16 20:39:41 +00:00
232dd65c15 [CuTe] Change the logic of pycute manipulation ops like coalesce, complement from co-lex to lex (#162690)
PyTorch tensor iteration (.view, contiguous, broadcasting) and NumPy array indexing all follow lexicographic (row-major) order. In Lexicographic (lex) on (i0, i1, …, i{k-1}): the leftmost index(stride is larger) changes fastest and the rightmost index changes slowest and usually last dim is contiguous.

However original pycute is all based on co-lex, after porting their code into pytorch and some cosmetic change, we now make it lex so that we can use it for use cases like device mesh internal bookkeeping and other stuff as well.

Changes included in this PR:
1. We changes all API ported in, included prefix_product(stride inferring and rename it to suffix_product), idx2crd, crd2idx, coalesce, composition, complement, right_inverse and left_inverse to make sure they are working in the lex way.
2. Added more unit test cases for some API mentioned above since existing unit tests do not have full coverage.
3. One bug fix inside composition, which will lead to infinite recursive call.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162690
Approved by: https://github.com/ezyang
ghstack dependencies: #162413, #162534, #162414
2025-09-16 19:53:45 +00:00
505ee42570 [Graph Partition] allow sharing default device context (#162873)
Entering a device context takes 30 us and exiting a device context takes 11 us. If all graph partitions and cudagraph-unsafe ops happen on the same device, we can share the device context.

## Trace

Use vLLM as an example. The first trace shows dynamo graph partition.
<img width="1338" height="453" alt="image" src="https://github.com/user-attachments/assets/b81815fd-cdcb-4024-846a-5b64164f8bac" />

The second trace shows inductor graph partition prior to this PR.
<img width="1331" height="270" alt="image" src="https://github.com/user-attachments/assets/8d98b127-2053-4eae-9a31-5491661f14d8" />

Comparing with fx graph partition, we can see inductor graph partition shows extra overhead from enter/exit device contexts (13+6 us -> 30+11 us), but smaller runtime overhead (13 us -> 7 us). This motivates the PR to share default device context.

The third trace shows Inductor graph partition after this PR. We observe that the extra overhead from enter/exit device contexts have been fixed. At the same time, we observe the smaller runtime overhead.
<img width="1336" height="276" alt="image" src="https://github.com/user-attachments/assets/77be2237-34dd-4bac-ad9c-d9af3be36417" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162873
Approved by: https://github.com/shunting314
2025-09-16 19:36:42 +00:00
9babcae1ed fix f-string in errors.py (#163074)
Add missing "f" for formatted f-string in UnsupportedOperandError, change "op_name" (undefined) to "name" for more descriptive error message in case of an unsupported operand with an unrecognized namespace.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163074
Approved by: https://github.com/justinchuby, https://github.com/Skylion007
2025-09-16 19:19:30 +00:00
69a5a5ac02 Add to inductor provenance tracking doc (#162975)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162975
Approved by: https://github.com/desertfire, https://github.com/mlazos
2025-09-16 19:09:06 +00:00
a4e74f416b Fix error message (#162487)
More proper fix here should be that we directly replace shape_env with correct sources but it is bit involved as we have to manually construct dynamo sources by hand (need to handle list/dict etc) but it is quite easy if we are operating on a string so i do this as post-processing step for now.

Differential Revision: [D82478647](https://our.internmc.facebook.com/intern/diff/D82478647)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162487
Approved by: https://github.com/zhxchen17
2025-09-16 19:06:30 +00:00
cb7f45fd34 remove unnecessary sync point in AveragedModel update (#158017)
Summary:
The test `bool(self.n_averaged == 0)` is a CPU/GPU synchronization point that is called for each update.
This test is only meant to know whether the AveragedModel copy has been initialized or not.
This diff introduces a CPU-based variable for that purpose.
When loading from checkpoint we also make sure the parameter is refreshed.

After this fix, each `update_parameter` call is reduced to 6ms from 333ms (98% reduction).

Test Plan:
contbuild & OSS CI
Test plan from GitHub:
CI

Rollback Plan:

Differential Revision: D78074709

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158017
Approved by: https://github.com/janeyx99
2025-09-16 18:57:55 +00:00
5937861eba [TEST][CUDA] Use proper dtype in test_cuda_tensor_pow_scalar_tensor_cuda (#163070)
The test `test_binary_ufuncs.py::TestBinaryUfuncsCUDA::test_cuda_tensor_pow_scalar_tensor_cuda` fails with a mismatched `dtype`:
```Python
AssertionError: The values for attribute 'dtype' do not match: torch.float32 != torch.float64.
```
This PR forces both arguments to use the same `dtype` to fix the test failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163070
Approved by: https://github.com/eqy
2025-09-16 18:28:50 +00:00
bb3f3cc65e [precompile] Store traced file information with CompileArtifacts. (#162983)
Summary:
Add some metadata to CompileArtifacts, so that it contains the source code information about the original code while they are being traced.

For now, we will not provide a verification method to end user and instead we just provide which files are inlined. It's up to user to verify the content from these files are not changed (because it's optional for many users to validate source code changes anyway in aot precompile)

Test Plan:
buck run @mode/opt test/dynamo:test_dynamo -- -k test_file_change
buck run @mode/opt test/dynamo:test_dynamo -- -k test_aot_compile_source_info

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162983
Approved by: https://github.com/yushangdi
2025-09-16 18:27:48 +00:00
0819de412d Add a new API torch.xpu.can_device_access_peer for Intel GPU (#162705)
# Motivation
Aligned with other backends, this PR introduces an new API `torch.xpu.can_device_access_peer`, which is used in vllm distributed [scenarios](2048c4e379/vllm/distributed/device_communicators/custom_all_reduce.py (L37))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162705
Approved by: https://github.com/EikanWang, https://github.com/ezyang
2025-09-16 18:00:22 +00:00
6db37d7206 [MPS] zeros like, narrow and enable tests (#163011)
zeros like, narrow and enable tests for SparseMPS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163011
Approved by: https://github.com/malfet
2025-09-16 17:48:04 +00:00
559e8d1c20 [doc]: Small typos (#162982)
Small typo fixes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162982
Approved by: https://github.com/ezyang, https://github.com/zou3519
2025-09-16 17:42:19 +00:00
6702f545d8 Restore environment after NcclUserBufferRegistrationTest (#163063)
This test sets "NCCL_ALGO=NVLS" in NcclUserBufferRegistrationTest which affects tests run in the same process such as `test_on_completion_hook_*` that fail with
> invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.26.2
> ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
> Last error:
> Error : no algorithm/protocol available for function Broadcast with datatype ncclInt8. NCCL_ALGO was set to NVLS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163063
Approved by: https://github.com/ezyang
2025-09-16 17:37:09 +00:00
ddf3124b05 [FSDP][Replicate] tests replicate input device movements (#162629)
**Summary:** This test verifies that the replicate function automatically moves forward pass inputs to the correct device.

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_root_move_forward_input_to_device

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162629
Approved by: https://github.com/mori360
2025-09-16 17:35:27 +00:00
457b27f92f [FSDP][Collectives] skipping reduce_scatter when world size is 1 (#162021)
**Summary:** In its current state, FSDP collectives uses cuda synchronizations and communication ops regardless of what the world size is. However, now that replicate will use FSDP, there will be instances where group size = 1 and these synchronizations and ops will be used needlessly. I have updated fsdp_collectives to skip reduce_scatter in the foreach_reduce API when world_size ‎ = 1. I have created edited a test that uses CommDebugMode to verify that the reduce_scatter has been removed. I also edited an affected test which used 1-way FSDP by verifying and changing its assert statements for CommDebugMode. I have also added a test command.

**Test Cases**
1. pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_single_worldsize1
2. pytest test/distributed/_composable/test_composability/test_2d_composability.py -k test_tp_with_fsdp_offloading

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162021
Approved by: https://github.com/mori360
2025-09-16 17:18:07 +00:00
b6a48ff69f [BE] Add Documentation for Device APIs (#162834)
Added documentation for torch.cuda APIs.
Fixed docstring for xpu and mtia is_bf16_supported API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162834
Approved by: https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-09-16 17:01:06 +00:00
9de22bc5da Inspect schedule IR comms (#162996)
Small change to util to allow us to see comms (e.g. `SEND`, `RECV`, etc.) in the schedule IR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162996
Approved by: https://github.com/fegin
2025-09-16 16:59:06 +00:00
f638854e1d [ROCm][SymmMem] re-enable UTs (#162811)
After the UT suite moved to `MultiProcContinuousTest`, `skipIfRocm` decorator started failing rather than skipping UTs because now we spawn multiple threads before the skip decorator is taken into account and the skip decorator was raising an exception to exit the process. But, the parent process treated the child process exiting as a crash rather than a skip. Additionally, in `MultiProcContinuousTest`, if one UT fails all subsequent ones are also skipped which makes sense since there's one setup for the entire suite. However, this showed up as many failing/skipped UTs in the parity.

I added multiprocess version of skip decorators for ROCm, including, `skip_if_rocm_arch_multiprocess` and
`skip_if_rocm_ver_lessthan_multiprocess`. These are needed as symmetric memory feature is only supported on MI300 onwards and we need to skip them for other archs and some UTs only work after ROCm7.0.

Fixes #161249
Fixes #161187
Fixes #161078
Fixes #160989
Fixes #160881
Fixes #160768
Fixes #160716
Fixes #160665
Fixes #160621
Fixes #160549
Fixes #160506
Fixes #160445
Fixes #160347
Fixes #160203
Fixes #160177
Fixes #160049
Fixes #159921
Fixes #159764
Fixes #159643
Fixes #159499
Fixes #159397
Fixes #159396
Fixes #159347
Fixes #159067
Fixes #159066
Fixes #158916
Fixes #158760
Fixes #158759
Fixes #158422
Fixes #158138
Fixes #158136
Fixes #158135

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162811
Approved by: https://github.com/jeffdaily
2025-09-16 15:35:39 +00:00
3ee071aa85 Allow aot_module_simplified to return a serializable output (#162527)
This PR refactors AOTAutograd slightly:

- It adds `simple_wraps` to various wrappers so that the reference to inner functions is stored in the output of AOTAutograd.
- It saves a `serialize()` method on the result of `aot_stage2`, in the event of an eager backward compile.

I discussed the lazy backward case with @bdhirsh, and we agreed that serialization in that case would probably use a different, more AOT API anyway, so we do not implement a serialize function for the lazy backward case. AOT precompile, at least initially, will always eagerly compile the backward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162527
Approved by: https://github.com/zhxchen17
ghstack dependencies: #162171
2025-09-16 15:22:05 +00:00
e7c3f802ff Revert "[dynamo][hop] Introduce Local Map HOP (#161458)"
This reverts commit 505458db803e1ffabac08a2fc150b566d3ea3a57.

Reverted https://github.com/pytorch/pytorch/pull/161458 on behalf of https://github.com/jeffdaily due to broke rocm tests ([comment](https://github.com/pytorch/pytorch/pull/161458#issuecomment-3299230458))
2025-09-16 15:14:36 +00:00
4db203f875 Revert "[BE] Make PyObjectSlot use a global PyInterpreter (#162659)"
This reverts commit 05ee8114f818a95745c812c3cd7aa8e784e61a9a.

Reverted https://github.com/pytorch/pytorch/pull/162659 on behalf of https://github.com/jeanschmidt due to seems to have introduced errors in linting see https://github.com/pytorch/pytorch/actions/runs/17750689989/job/50444910643 ([comment](https://github.com/pytorch/pytorch/pull/162659#issuecomment-3298626136))
2025-09-16 12:52:57 +00:00
cef815dc2c [ROCm] Remove HIPBLASLT_ALLOW_TF32 from codebase (#162998)
A few UT failures are caused by `HIPBLASLT_ALLOW_TF32`

Fixes #157094, #157093, #157092, #157091, #157064, #157063, #157062, #157061, #157042, #157041, #157039, #157004

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162998
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-16 12:48:45 +00:00
fa127d9b20 Fix LBFGS wolfe max iteration (#161488)
Fixes #91581 , based on #135026

## Test Result

```bash
pytest test/test_optim.py

.........
========================== 1473 passed, 242 skipped in 2412.49s (0:40:12) ===========================
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161488
Approved by: https://github.com/albanD
2025-09-16 12:07:50 +00:00
6926710adf [ATen][CUDA] CUTLASS matmuls: add sm_103a flag (#162956)
This PR adds an `sm_103a` flag for GroupMM and RowwiseScaledMM. Contrary to just #161399, this simply adds the flag as the support for `sm_103a` matmuls is going to be added to CUTLASS v4.2 (see https://github.com/pytorch/pytorch/pull/161399#issuecomment-3252892937).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162956
Approved by: https://github.com/eqy, https://github.com/Skylion007
2025-09-16 10:29:55 +00:00
e3783a9575 Replace std::runtime_error with TORCH_CHECK (#159344)
Fixes part of #148114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159344
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-09-16 09:00:06 +00:00
9aca0ba027 [Inductor-FX] Support IndexPutFallback (#162863)
# Feature

This PR supports lowering `IndexPutFallback` through Inductor's FX converter. The approach is very similar to the one taken in https://github.com/pytorch/pytorch/pull/162686.

Compared to `ScatterFallback`, this required one additional change: the value of `self.op_overload` for `IndexPutFallback` was inaccurate. Previously, it used `aten.index_put`, which would result in unsound FX IR. The existing Python/C++ codegen use `aten.index_put_`, since the fallback mutates its input. This PR changes `self.op_overload` to match that.

# Test plan
Added a CI test lowering deterministic index put via the FX converter.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162863
Approved by: https://github.com/angelayi
2025-09-16 08:52:47 +00:00
de143bf79b [C10d] Code clean for torch.distributed.init_process_group (#163038)
As the title stated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163038
Approved by: https://github.com/msaroufim
2025-09-16 08:15:25 +00:00
fb1e0321da [CUDA] fix shared memory race in reduce_kernel (#162995)
Reported by compute-sanitizer, otherwise it looks like `block_y_reduce` and `block_x_reduce` both use `shared_memory` for temporaries without synchronization between them

reproduces in e.g.,

`compute-sanitizer --tool=racecheck python test/test_matmul_cuda.py -k test_scaled_mm_vs_emulated_block_wise_float32_lhs_block_128_rhs_block_1_cuda` (note that this test requires H100 to run unless only the non-emulated (cuBLAS impl.) is commented out)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162995
Approved by: https://github.com/msaroufim
2025-09-16 07:53:21 +00:00
f8d379d29e [DTensor] Introduce DebugMode (#162665)
Introduce a lightweight TorchDispatchMode for understanding the magic behind DTensor.

- Tracks redistribution, see `redistribute_input(input_idx, from_placement, to_placement)`
- Optionally tracks torch-level functions, via `__torch_function__`
- Optionally tracks FakeTensor operations, which was needed for propagating tensor meta as a step of sharding propagation
- Optionally tracks real tensor operations, including functional c10d op, and regular ops
- Calls are shown in the hierarchical structure!
- shorthand representation
  - dt: DTesnor, ft: FakeTensor, t: Tensor
  - DM(2, 2) == DeviceMesh(shape = [2, 2])
  - [R, P, S(0)] == Placement[Replicate, Partial, Shard(0)]
  - f32[8,8] == float32 with shape[8, 8]

```
  debug_mode = DTensorDebugMode(record_faketensor=False, record_realtensor=True)
  with debug_mode:
      torch.mm(x_dtensor, y_dtensor)
  print(debug_mode.debug_string())
```
produces:
```
  torch.mm(dt: f32[8, 8][S(0)], dt: f32[8, 32][S(0)])
    aten::mm(dt: f32[8, 8][S(0)], dt: f32[8, 32][S(0)])
      redistribute_input(1, [S(0)], [R])
        _c10d_functional::all_gather_into_tensor(t: f32[1, 32], 8, 0)
        _c10d_functional::wait_tensor(t: f32[8, 32])
      aten::mm(t: f32[1, 8], t: f32[8, 32])
```

Another example, for torch.einsum
```
  torch.functional.einsum(bld,dnh->blnh, dt: f32[16, 6, 8][P, R], dt: f32[8, 4, 4][R, P])
    aten::unsqueeze(dt: f32[16, 6, 8][P, R], 3)
      aten::unsqueeze(t: f32[16, 6, 8], 3)
    aten::unsqueeze(dt: f32[16, 6, 8, 1][P, R], 4)
      aten::unsqueeze(t: f32[16, 6, 8, 1], 4)
    aten::permute(dt: f32[16, 6, 8, 1, 1][P, R], [0, 1, 3, 4, 2])
      aten::permute(t: f32[16, 6, 8, 1, 1], [0, 1, 3, 4, 2])
    aten::unsqueeze(dt: f32[8, 4, 4][R, P], 3)
      aten::unsqueeze(t: f32[8, 4, 4], 3)
    aten::unsqueeze(dt: f32[8, 4, 4, 1][R, P], 4)
      aten::unsqueeze(t: f32[8, 4, 4, 1], 4)
    aten::permute(dt: f32[8, 4, 4, 1, 1][R, P], [3, 4, 1, 2, 0])
      aten::permute(t: f32[8, 4, 4, 1, 1], [3, 4, 1, 2, 0])
    aten::permute(dt: f32[16, 6, 1, 1, 8][P, R], [0, 1, 4, 2, 3])
      aten::permute(t: f32[16, 6, 1, 1, 8], [0, 1, 4, 2, 3])
    aten::view(dt: f32[16, 6, 8, 1, 1][P, R], [1, 96, 8])
      aten::view(t: f32[16, 6, 8, 1, 1], [1, 96, 8])
    aten::permute(dt: f32[1, 1, 4, 4, 8][R, P], [4, 2, 3, 0, 1])
      aten::permute(t: f32[1, 1, 4, 4, 8], [4, 2, 3, 0, 1])
    aten::view(dt: f32[8, 4, 4, 1, 1][R, P], [1, 8, 16])
      aten::view(t: f32[8, 4, 4, 1, 1], [1, 8, 16])
    aten::bmm(dt: f32[1, 96, 8][P, R], dt: f32[1, 8, 16][R, P])
      redistribute_input(0, [P, R], [S(2), S(2)])
        aten::chunk(t: f32[1, 96, 8], 4, 2)
        aten::cat(['t: f32[1, 96, 2]', 't: f32[1, 96, 2]', 't: f32[1, 96, 2]', 't: f32[1, 96, 2]'])
        _c10d_functional::reduce_scatter_tensor(t: f32[4, 96, 2], sum, 4, 2)
        aten::clone(t: f32[1, 96, 1])
      redistribute_input(1, [R, P], [S(1), S(1)])
        aten::chunk(t: f32[1, 8, 16], 4, 1)
        aten::clone(t: f32[1, 2, 16])
        aten::chunk(t: f32[1, 2, 16], 2, 1)
        aten::cat(['t: f32[1, 1, 16]', 't: f32[1, 1, 16]'])
        _c10d_functional::reduce_scatter_tensor(t: f32[2, 1, 16], sum, 2, 3)
        _c10d_functional::wait_tensor(t: f32[1, 1, 16])
      aten::bmm(t: f32[1, 96, 1], t: f32[1, 1, 16])
    aten::view(dt: f32[1, 96, 16][P, P], [16, 6, 1, 4, 4])
      aten::view(t: f32[1, 96, 16], [16, 6, 1, 4, 4])
    aten::permute(dt: f32[16, 6, 1, 4, 4][P, P], [0, 1, 3, 4, 2])
      aten::permute(t: f32[16, 6, 1, 4, 4], [0, 1, 3, 4, 2])
    aten::view(dt: f32[16, 6, 4, 4, 1][P, P], [16, 6, 4, 4])
      aten::view(t: f32[16, 6, 4, 4, 1], [16, 6, 4, 4])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162665
Approved by: https://github.com/ezyang
2025-09-16 07:30:05 +00:00
2459da4a64 [Caffe2] Add float batch box cox SVE128 implementation (#159778)
Introduce SVE128 SIMD batch box-cox computation.

We've seen about 65% throughput improvement.

Privacy Context Container: L1196524

This is a no-op from OSS point of view, therefore it could be landed without tests (see precedence set by https://github.com/pytorch/pytorch/pull/143627), but we should delete those at some point

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159778
Approved by: https://github.com/malfet
2025-09-16 07:25:04 +00:00
76fa381eef [mps] Take into account offset (#163021)
Fixes issue when running AOTI + MPS on voxtral model
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163021
Approved by: https://github.com/malfet
2025-09-16 07:14:33 +00:00
29ea6254a0 [Bug] Add more boundary check for FractionalMaxPool3d (#161876)
This PR aims to fix the bug mentioned at [#161853](https://github.com/pytorch/pytorch/issues/161853#issuecomment-3240695121)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161876
Approved by: https://github.com/malfet
2025-09-16 06:59:02 +00:00
d2ecddf1a3 [PT2]: Overriding Tensor device by SubmodNameToDevice (#162144)
Summary:
A temporarily solution mainly for weights that are not moved to cuda in fake mode during publishing, but runs on cuda in serving.

This has some overlap with placement, but with 2 differences:
1. OverrideWeightsDevice only changes weights, not graph.
2. Placement only handles mapping between non-empty cuda indices, while here we override everything as submodNameToDevice is the ground truth.

Test Plan:
ICE replayer with custom package:
https://www.internalfb.com/intern/unidash/dashboard/ads_infra_cost_estimation/model_infra_cost_estimation/?e[select_ESTIMATION_RUN_ID]=ICE_kevinqfu_1756939411c164_replayeripnext_00

Rollback Plan:

Differential Revision: D81284723

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162144
Approved by: https://github.com/henryoier, https://github.com/SherlockNoMad
2025-09-16 06:56:06 +00:00
1115749da7 Fix provenance tracking kernel name for fallback kernels (#162628)
Summary:
as title

`kernel.cpp_kernel_name` is something like `at::_ops::_scaled_dot_product_efficient_attention::call`, but the actual kernel name we want is `aoti_torch_cuda__scaled_dot_product_efficient_attention`

Differential Revision: D82142287

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162628
Approved by: https://github.com/angelayi, https://github.com/desertfire
2025-09-16 06:56:00 +00:00
9786243b64 Update torch-xpu-ops commit pin (#162804)
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@d8c3ee](d8c3eefc29), includes:

- Optimize adaptive average pool for channel-last memory format
- Add unregister wait_tensor
- Replace deprecated `[[intel::reqd_sub_group_size(SgSize)]]` with `[[sycl::reqd_sub_group_size(SIMD)]]` and remove unnecessary attributes
- Revert "Roll back to original usage of sycl::get_kernel_bundle"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162804
Approved by: https://github.com/EikanWang
2025-09-16 06:30:48 +00:00
9009c4da39 [functional] Avoid duplicate custom get_device call in constructor (#162889)
Trying to reduce the number of `__torch_dispatch__` calls of FakeTensorMode in the AOT metadata collection pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162889
Approved by: https://github.com/Lucaskabela, https://github.com/zou3519
2025-09-16 05:00:19 +00:00
b68a5115a4 Workaround for mtia double init issue in has_triton (#162974)
Summary:
This change adds a new environment variable (`TORCHINDUCTOR_TRITON_DISABLE_DEVICE_DETECTION`) and configuration in `torch._inductor.config` which can be set to `"1"` to allow a user to disable triton's device detection logic in [torch/utils/_triton.py:has_triton()](c9e57d7e9f/torch/utils/_triton.py (L128)). This function is used at import scope in several places but the function has a side effect of initializing the mtia device if it is available which is causing some of our autotuning workflows to crash.

Worth noting that when enabled this configuration disables all device detection not just mtia and this is because the logic in has_triton will initialize the mtia device as a side effect even when checking for a cuda or other device via the [get_interface_for_device()](c9e57d7e9f/torch/_dynamo/device_interface.py (L570)) function.

I've tagged it `topic: not user facing` since I don't anticipate any outside of meta users making use of this, however this is my first PR here, so please indicate if it should be handled differently.

Test Plan: This has been tested in the context of internal workflows.

Differential Revision: D82347853

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162974
Approved by: https://github.com/xmfan
2025-09-16 04:46:11 +00:00
2c45628813 [Flight Recorder][WP] Added mismatch tail as an arg (#162991)
Summary:
Mismatch tail is used as a fixed variable and there are cases that there are more than 10 mismatches FR gives up producing results (e.g. https://fburl.com/ai_infra/7gjl5ucb). This diff added the mismatch tail in the parsed args so make this configuarble.

Also tho the variable name is `mismatch_tail`(last 10) it is used as `mismatch_head` (the first 10). Updated it to be `num_mismatch_to_print`

Test Plan:
`buck2 run @//mode/opt //caffe2/fb/flight_recorder:fr_trace -- --mast_job_id aps-ctx_fm_pipeline_change-1c8ea38a94 --mast_job_version 0 --mast_job_attempt 2 --bucket tlcm_log_blob --world_size 128 --dump_file_name_offset 0 --allow-incomplete-ranks --num_mismatch_to_print 20 1>out 2>err`
Confirm no error and output 20 mismatches.

Rollback Plan:

Differential Revision: D82335995

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162991
Approved by: https://github.com/fduwjj
2025-09-16 04:46:05 +00:00
6c0fd747af [vllm hash update] update the pinned vllm hash (#162928)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162928
Approved by: https://github.com/pytorchbot
2025-09-16 04:25:04 +00:00
d172d0231b [pcache] Cache and AsyncCache implementations (#162777)
Summary:
Implemented caching abstractions: `Cache` and `AsyncCache`.

`Cache` provides an abstraction for defining simple key -> value stores with get and put functionality. We propose using `Cache` for implementations with very low (microseconds) overhead, for example an in-memory cache.

`AsyncCache` provides an abstraction for defining simple key -> value stores with asynchronous get and put functionality. We propose using `AsyncCache` for implementations with medium to high (> millisecond) overhead, for example an on-disk cache.

We provide an initial extension of `Cache` in the form of `InMemoryCache`. `InMemoryCache` provides fast, in-memory caching that can be later used to memoize more expensive cache accesses. `InMemoryCache` also provides a custom constructor `InMemoryCache.from_env_var` that can be used to pre-populate the in-memory cache, which will be helpful for enabling determinism in the future.

We also provides extensions of `AsyncCache`. `OnDiskCache` subclasses `AsyncCache` and serves as a generic on-disk caching implementation with atomic, write-once guarantees. `OnDiskCache` is semi-generic, allowing subclassing to alter the output directory. `InductorOnDiskCache` subclasses `OnDiskCache` to create an Inductor-specific on-disk cache that outputs to Inductor's default caching directory.

Test Plan:
`Cache` Tests:
1. Get -> Set -> Get
- Checks that `get(key)` returns `None` when `key` is not cached, and that after calling `put(key, value)` subsequent `get(key)` calls return `value`
2. Set -> Set
- Checks that with duplicated `set(key, value)` calls only the initial call is successful
3. From env var
- Checks that constructing an `InMemoryCache` from an environment variable works.

`AsyncCache` Tests:
1. Get -> Set -> Get
- Same as `Cache` test, but checks both with synchronous and asynchronous execution
2. Set -> Set
- Same as `Cache` test, but checks both with synchronous and asynchronous execution
3. Set -> Set Concurrent
- Checks that of two concurrent `set(key, value)` operations, only one passes

```
cd ~/fbsource/fbcode && buck test mode/opt //caffe2/test/inductor:pcache
```

 {F1981926248}

Rollback Plan:

Differential Revision: D82269762

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162777
Approved by: https://github.com/masnesral, https://github.com/aorenste
2025-09-16 04:07:12 +00:00
fdf68fa5d7 [ONNX] Fix rotary_embedding_23 implementation (#162865)
The implementation of rotary_embedding_23 when input is 3D was incorrect.

## Tested

Locally with

```py
import onnx_ir as ir
import onnx
import torch
import os
import numpy as np

base_path = "/home/justinchu/dev/onnx/onnx/backend/test/data/node"
test_names = [
    "test_rotary_embedding",
    "test_rotary_embedding_3d_input",
    "test_rotary_embedding_interleaved",
    "test_rotary_embedding_no_position_ids",
    "test_rotary_embedding_no_position_ids_interleaved",
    "test_rotary_embedding_no_position_ids_rotary_dim",
    "test_rotary_embedding_with_interleaved_rotary_dim",
    "test_rotary_embedding_with_rotary_dim",
]
model_paths = [os.path.join(base_path, name) for name in test_names]

for path in model_paths:
    print(f"Checking {path} for issues...")

    model = onnx.load(os.path.join(path, "model.onnx"))
    input0 = ir.from_proto(
        onnx.load_tensor(os.path.join(path, "test_data_set_0", "input_0.pb"))
    ).numpy()
    input1 = ir.from_proto(
        onnx.load_tensor(os.path.join(path, "test_data_set_0", "input_1.pb"))
    ).numpy()
    input2 = ir.from_proto(
        onnx.load_tensor(os.path.join(path, "test_data_set_0", "input_2.pb"))
    ).numpy()
    if os.path.exists(os.path.join(path, "test_data_set_0", "input_3.pb")):
        input3 = ir.from_proto(
            onnx.load_tensor(os.path.join(path, "test_data_set_0", "input_3.pb"))
        ).numpy()
    else:
        input3 = None
    output0 = ir.from_proto(
        onnx.load_tensor(os.path.join(path, "test_data_set_0", "output_0.pb"))
    ).numpy()

    m = ir.from_proto(model)

    node = m.graph[-1]
    print(node)
    assert node.op_type == "RotaryEmbedding"

    interleaved = node.attributes.get_int("interleaved", 0)
    num_heads = node.attributes.get_int("num_heads", 0)
    rotary_embedding_dim = node.attributes.get_int("rotary_embedding_dim", 0)

    torch_out = torch.onnx.ops.rotary_embedding(
        torch.tensor(input0),
        torch.tensor(input1),
        torch.tensor(input2),
        position_ids=torch.tensor(input3) if input3 is not None else None,
        interleaved=bool(interleaved),
        num_heads=num_heads,
        rotary_embedding_dim=rotary_embedding_dim,
    )
    torch_out = torch_out.detach().cpu().numpy()
    np.testing.assert_allclose(torch_out, output0)
```

Fix https://github.com/pytorch/pytorch/issues/162848

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162865
Approved by: https://github.com/kunal-vaishnavi, https://github.com/titaiwangms
2025-09-16 03:30:05 +00:00
7924b083c1 [CI] disable rerun of distributed tests (#163025)
#162978 identified an issue that distributed test failures were wrongly muted.
Per discussion with @malfet, one solution is to disable rerun of distributed tests in `run_test.py`.
The PR makes use of the `is_distributed_test` flag to identify those tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163025
Approved by: https://github.com/malfet
2025-09-16 03:11:50 +00:00
3ae31782cc [DCP] Add timeout for checkpoint background process join (#162828)
Summary:
Cleaning up checkpoint background process can currently block trainer thread indefinitely if the process is hanging (notably due to Gloo pg init timeout).

This diff adds a 5s grace period for normal termination and sends SIGTERM if unable to shut down in that period.

Rollback Plan:

Differential Revision: D82268979

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162828
Approved by: https://github.com/meetv18
2025-09-16 02:32:50 +00:00
c7fa16a05c [ROCm][CI] update _rocm-test.yml based on _linux-test.yml (#163014)
Fixes missing huggingface secrets and aligns _rocm-test.yml with other updates from _linux-test.yml that it was initially based on.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163014
Approved by: https://github.com/huydhn
2025-09-16 02:14:38 +00:00
1aa41eccc2 [Inductor][CPP] Reuse the pre-existing kernel for the same kernels (#158404)
Reuse the pre-existing kernel to avoid defining redundant kernels.
Inductor CPP will generate same kernels. For example:
```
# Example
class Model(torch.nn.Module):
    def __init__(self, K, N):
        super().__init__()
        self.linear0 = torch.nn.Linear(K, N)
        self.linear1 = torch.nn.Linear(N, K)
        self.linear2 = torch.nn.Linear(K, N)

    def forward(self, input):
        out = self.linear0(input)
        out = self.linear1(out)
        out = self.linear2(out)
        return out
```

For the above example, linear2 is same as linear0, and Inductor CPP generates 2 same kernels: cpp_fused_addmm_0  and cpp_fused_addmm_2.

```
# Generated code:
...
cpp_fused_addmm_0 = async_compile.cpp_pybinding(['const at::BFloat16*', 'const at::BFloat16*', 'const at::BFloat16*', 'at::BFloat16*'], '''
...
extern "C"
void kernel(const at::BFloat16* X, const at::BFloat16* W, const at::BFloat16* inp, at::BFloat16* Y)
{

    constexpr int64_t num_threads = 32;
    constexpr int64_t N = 1024;
    constexpr int64_t K = 2048;
    constexpr int64_t Mr = 32;
    constexpr int64_t Nr = 32;
    constexpr int64_t Kr = 32;
...
cpp_fused_addmm_1 = async_compile.cpp_pybinding(['const at::BFloat16*', 'const at::BFloat16*', 'const at::BFloat16*', 'at::BFloat16*'], '''
...
extern "C"
void kernel(const at::BFloat16* X, const at::BFloat16* W, const at::BFloat16* inp, at::BFloat16* Y)
{

    constexpr int64_t num_threads = 32;
    constexpr int64_t N = 2048;
    constexpr int64_t K = 1024;
    constexpr int64_t Mr = 32;
    constexpr int64_t Nr = 32;
    constexpr int64_t Kr = 32;
...
cpp_fused_addmm_2 = async_compile.cpp_pybinding(['const at::BFloat16*', 'const at::BFloat16*', 'const at::BFloat16*', 'at::BFloat16*'], '''
extern "C"
void kernel(const at::BFloat16* X, const at::BFloat16* W, const at::BFloat16* inp, at::BFloat16* Y)
{

    constexpr int64_t num_threads = 32;
    constexpr int64_t N = 1024;
    constexpr int64_t K = 2048;
    constexpr int64_t Mr = 32;
    constexpr int64_t Nr = 32;
    constexpr int64_t Kr = 32;
...
    def call(self, args):
        arg6_1, = args
        args.clear()
        buf0 = empty_strided_cpu((1024, 1024), (1024, 1), torch.bfloat16)
        cpp_fused_addmm_0(arg6_1, constant6, _frozen_param6, buf0)
        del arg6_1
        buf1 = empty_strided_cpu((1024, 2048), (2048, 1), torch.bfloat16)
        cpp_fused_addmm_1(buf0, constant6_0, _frozen_param8, buf1)
        buf2 = buf0; del buf0  # reuse
        cpp_fused_addmm_2(buf1, constant6_1, _frozen_param10, buf2)
        return (buf2, )
```

After reusing the pre-existing kernel, Inductor CPP will reuse cpp_fused_addmm_0.
```
cpp_fused_addmm_0 = async_compile.cpp_pybinding(['const at::BFloat16*', 'const at::BFloat16*', 'const at::BFloat16*', 'at::BFloat16*'], '''
...
extern "C"
void kernel(const at::BFloat16* X, const at::BFloat16* W, const at::BFloat16* inp, at::BFloat16* Y)
{

    constexpr int64_t num_threads = 32;
    constexpr int64_t N = 1024;
    constexpr int64_t K = 2048;
    constexpr int64_t Mr = 32;
    constexpr int64_t Nr = 32;
    constexpr int64_t Kr = 32;
...
cpp_fused_addmm_1 = async_compile.cpp_pybinding(['const at::BFloat16*', 'const at::BFloat16*', 'const at::BFloat16*', 'at::BFloat16*'], '''
...
extern "C"
void kernel(const at::BFloat16* X, const at::BFloat16* W, const at::BFloat16* inp, at::BFloat16* Y)
{

    constexpr int64_t num_threads = 32;
    constexpr int64_t N = 2048;
    constexpr int64_t K = 1024;
    constexpr int64_t Mr = 32;
    constexpr int64_t Nr = 32;
    constexpr int64_t Kr = 32;
...
    def call(self, args):
        arg6_1, = args
        args.clear()
        buf0 = empty_strided_cpu((1024, 1024), (1024, 1), torch.bfloat16)
        cpp_fused_addmm_0(arg6_1, constant6, _frozen_param6, buf0)
        del arg6_1
        buf1 = empty_strided_cpu((1024, 2048), (2048, 1), torch.bfloat16)
        cpp_fused_addmm_1(buf0, constant6_0, _frozen_param8, buf1)
        buf2 = buf0; del buf0  # reuse
        cpp_fused_addmm_0(buf1, constant6_1, _frozen_param10, buf2)
        return (buf2, )
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158404
Approved by: https://github.com/jansel, https://github.com/leslie-fang-intel
2025-09-16 01:54:24 +00:00
61be0f1c11 Set the credential to upload vLLM nightly wheels on schedule and workflow_dispatch (#163018)
The build is ok, but uploading is failing at the moment https://github.com/pytorch/pytorch/actions/runs/17734972779/job/50416387786

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163018
Approved by: https://github.com/wdvr, https://github.com/malfet
2025-09-16 01:46:59 +00:00
48dbd60df4 are_strides_like_channels_last_or_false (#162354)
Note this could change suggest_memory_format behaviour for unbacked

we used to return True for are_strides_like_channels_last sometimes even when results undecided
now when its not decided we return False.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162354
Approved by: https://github.com/aorenste
2025-09-16 00:49:05 +00:00
505458db80 [dynamo][hop] Introduce Local Map HOP (#161458)
Can't actually deploy it because of: https://github.com/pytorch/pytorch/issues/161456

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161458
Approved by: https://github.com/ydwu4
2025-09-16 00:37:40 +00:00
05ee8114f8 [BE] Make PyObjectSlot use a global PyInterpreter (#162659)
This pr gets rid of the pyobj_interpreter_ variable from PyObjectSlot and saves a word in the process

Gonna ask for review from @huydhn as there are some changes to CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162659
Approved by: https://github.com/albanD, https://github.com/huydhn
2025-09-16 00:37:09 +00:00
e900a274e5 Add CUDA_KERNEL_ASSERT_PRINTF, a more flexible CUDA_KERNEL_ASSERT_MSG (#160129)
This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows.

We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by `__assert_fail`. There's also an easy-to-grep-for keyword `CUDA_KERNEL_ASSERT` in the message.

I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side).

# Alternatives
* We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity.
* If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to https://github.com/pytorch/pytorch/pull/157996. But the main downside here is the performance hit, so let's have an organized way of doing it first.

# Risks/Problems
* We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU.
* Adding printf calls can have a performance impact because of increased register and stack usage. I did not see this play out in practice (see "benchmarks" below). However, there are changes to the generated PTX that could result in performance problems later (see "changes in generated PTX" below).

# Benchmarks

* I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f
* Results are here -- I couldn't find a significant difference before or after  https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431

# Change in generated PTX

This is the easiest way I found to run nvcc over just Repeat.cu (this is a buck2 target that includes just a copy of Repeat.cu):
```
buck2 build --show-output scripts/mjk/ai_training/cuda_benchmarks:repeat_cuda
# then use the printed .so file like this:
~/fbsource/third-party/cuda/cuda_12.8.0/x64-linux/bin/cuobjdump -ptx ../buck-out/v2/gen/fbcode/028bde1acfaba823/scripts/mjk/ai_training/cuda_benchmarks/__repeat_cuda__/libscripts_mjk_ai_training_cuda_benchmarks_repeat_cuda.so
```

## with printf
This is the version of the code that appears in this diff:

https://gist.github.com/mjkatmeta/5d18d48282d46b2240d946b335052b9a

## without printf
I recompiled, replacing `CUDA_KERNEL_ASSERT_PRINTF(...)` in Repeat.cu with:
```
CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1]);
```
https://gist.github.com/mjkatmeta/480df4b3a122e7b326554dd15ebb7c9d

(Both of these are annotated with `// CHAR ARRAY:` comments to make the string constants easier to read.)

Test Plan:
Running this minimal test case:

```
import torch
def main():
    x = torch.ones(10, dtype=torch.int64, device="cuda:0")
    torch.repeat_interleave(x, x, output_size=0)
```

Now we see the new message (from printf) alongside the assert failure:

```
$ buck2 run fbcode//scripts/darshanr/repeat_interleave_errors:repeat_interleave_errors
[...]
[CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [31,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the `output_size` argument (0) must be the same as the sum of the elements in the `repeats` tensor (10).

fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [384,0,0] Assertion `result_size == cumsum_ptr[size - 1]` failed.
[...[
```

Rollback Plan:

Reviewed By: mradmila

Differential Revision: D79310684

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160129
Approved by: https://github.com/ngimel
2025-09-16 00:23:48 +00:00
d08cabe314 [BC Breaking] Remove flex + njt code paths (#161734)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161734
Approved by: https://github.com/jbschlosser
2025-09-16 00:13:56 +00:00
dac6a4bf6c [CP] Fix the CP FlexAttention test (#162518)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162518
Approved by: https://github.com/XilunWu, https://github.com/drisspg
2025-09-16 00:12:26 +00:00
cfc539fe15 Improved error lr last epoch (#162368)
Fixes #160626

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162368
Approved by: https://github.com/janeyx99
2025-09-15 23:33:14 +00:00
955e195c7d [Triton] [Inductor] Add a Blackwell specific Template for persistent matmul (#162916)
Summary:
This adds the Triton Tutorial Matmul persistent matmul with device side TMA for Blackwell and adds it as a template option for blackwell. This uses newer Triton features such as automatic warp specialization and loop flattening, which while still containing flaws can improve performance on blackwell. This does not include the Epilogue subtiling section, as that will be a followup PR.

This PR doesn't include any tuning. I am doing a larger benchmarking run to determine the best initial configs for tuning and will open a followup PR with better defaults soon.

Test Plan:
Tested on a Blackwell machine with test_max_autotune.py and confirmed the new tests pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162916
Approved by: https://github.com/NikhilAPatel
2025-09-15 23:23:04 +00:00
c77726b1d7 [inductor] fix expand_shape when copy_shape is not a string (#162739)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162739
Approved by: https://github.com/eellison, https://github.com/mlazos
2025-09-15 23:22:07 +00:00
6b608dfe81 Add DISABLE_JUSTKNOBS to torch/_utils_internal.py and use it for dynamo _maybe_set_eval_frame (#162298)
If JustKnobs is disabled (as it always is in OSS), we can easily avoid an extra layer of Python function call.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162298
Approved by: https://github.com/ezyang
2025-09-15 23:00:39 +00:00
090e6838a0 compile_kernel enable pch (#162972)
Enabling automatic pre compiled headers per https://docs.nvidia.com/cuda/nvrtc/index.html#example-automatic-pch-cuda-12-8

I'm seeing large speedups in compilation times using PCH on average but the max compilation time with PCH is worst which is why I can't enable it by default. `load_inline()` also supports precompiled headers and does not enable them by default

```
Without PCH: 270.58 ms average
With PCH:    115.27 ms average
```

```
Without PCH: Max: 337.99 ms
With PCH: Max: 383.82 ms
```

```python
source) [marksaroufim@devgpu005]~/pytorch% python simple_pch_benchmark.py
============================================================
Simple PCH Compilation Benchmark
============================================================
Device: NVIDIA B200
Iterations: 100

Testing WITHOUT PCH:
------------------------------
Compiling kernel 100 times WITHOUT PCH...
  Completed 10/100 compilations
  Completed 20/100 compilations
  Completed 30/100 compilations
  Completed 40/100 compilations
  Completed 50/100 compilations
  Completed 60/100 compilations
  Completed 70/100 compilations
  Completed 80/100 compilations
  Completed 90/100 compilations
  Completed 100/100 compilations
Average: 270.58 ms (±6.99 ms)
Min: 264.09 ms
Max: 337.99 ms

Testing WITH PCH:
------------------------------
Compiling kernel 100 times WITH PCH...
  Completed 10/100 compilations
  Completed 20/100 compilations
  Completed 30/100 compilations
  Completed 40/100 compilations
  Completed 50/100 compilations
  Completed 60/100 compilations
  Completed 70/100 compilations
  Completed 80/100 compilations
  Completed 90/100 compilations
  Completed 100/100 compilations
Average: 115.27 ms (±27.32 ms)
Min: 110.65 ms
Max: 383.82 ms

```

## Benchmarking script

```python
#!/usr/bin/env python3
import argparse
import os
import sys
import time
from statistics import mean, stdev

import torch
from torch.cuda._utils import _nvrtc_compile

def benchmark_compilation(use_pch, iterations=100):
    """Compile the same kernel many times with or without PCH."""

    # CUB kernel that benefits from PCH
    kernel_source = """
    #include <cub/block/block_reduce.cuh>
    #include <cub/block/block_scan.cuh>
    #include <cub/warp/warp_reduce.cuh>

    extern "C"
    __global__ void test_kernel(const float* input, float* output, int n) {
        using BlockReduce = cub::BlockReduce<float, 256>;
        using BlockScan = cub::BlockScan<float, 256>;
        using WarpReduce = cub::WarpReduce<float>;

        __shared__ union {
            typename BlockReduce::TempStorage reduce;
            typename BlockScan::TempStorage scan;
            typename WarpReduce::TempStorage warp[8];
        } temp_storage;

        int idx = blockIdx.x * blockDim.x + threadIdx.x;
        float val = (idx < n) ? input[idx] : 0.0f;

        float sum = BlockReduce(temp_storage.reduce).Sum(val);
        __syncthreads();

        float scan_result;
        BlockScan(temp_storage.scan).ExclusiveSum(val, scan_result);
        __syncthreads();

        int warp_id = threadIdx.x / 32;
        float warp_sum = WarpReduce(temp_storage.warp[warp_id]).Sum(val);

        if (threadIdx.x == 0) {
            output[blockIdx.x] = sum + scan_result + warp_sum;
        }
    }
    """

    device = torch.cuda.current_device()
    major, minor = torch.cuda.get_device_capability(device)
    compute_capability = f"{major}{minor}"

    compile_times = []

    print(
        f"Compiling kernel {iterations} times {'WITH' if use_pch else 'WITHOUT'} PCH..."
    )

    for i in range(iterations):
        # Use unique kernel name to avoid caching between iterations
        kernel_name = f"test_kernel_{i}"
        unique_source = kernel_source.replace("test_kernel", kernel_name)

        start = time.perf_counter()

        ptx, mangled_name = _nvrtc_compile(
            unique_source,
            kernel_name,
            compute_capability,
            header_code="",
            nvcc_options=["-std=c++17"],
            auto_pch=use_pch,
        )

        elapsed = time.perf_counter() - start
        compile_times.append(elapsed * 1000)  # Convert to ms

        # Progress indicator
        if (i + 1) % 10 == 0:
            print(f"  Completed {i + 1}/{iterations} compilations")

    return compile_times

def main():
    parser = argparse.ArgumentParser(description="Simple PCH Compilation Benchmark")
    parser.add_argument("--pch", action="store_true", help="Test with PCH only")
    parser.add_argument("--no-pch", action="store_true", help="Test without PCH only")
    parser.add_argument(
        "--iterations", type=int, default=100, help="Number of compilations"
    )
    args = parser.parse_args()

    print("=" * 60)
    print("Simple PCH Compilation Benchmark")
    print("=" * 60)
    print(f"Device: {torch.cuda.get_device_name()}")
    print(f"Iterations: {args.iterations}")
    print()

    # Determine what to test
    test_both = not args.pch and not args.no_pch

    results = {}

    # Test without PCH
    if args.no_pch or test_both:
        print("Testing WITHOUT PCH:")
        print("-" * 30)
        times_no_pch = benchmark_compilation(use_pch=False, iterations=args.iterations)

        if times_no_pch:
            avg_no_pch = mean(times_no_pch)
            std_no_pch = stdev(times_no_pch) if len(times_no_pch) > 1 else 0
            print(f"Average: {avg_no_pch:.2f} ms (±{std_no_pch:.2f} ms)")
            print(f"Min: {min(times_no_pch):.2f} ms")
            print(f"Max: {max(times_no_pch):.2f} ms")
            results["no_pch"] = avg_no_pch
        print()

    # Test with PCH
    if args.pch or test_both:
        print("Testing WITH PCH:")
        print("-" * 30)
        times_with_pch = benchmark_compilation(
            use_pch=True, iterations=args.iterations
        )

        if times_with_pch:
            avg_with_pch = mean(times_with_pch)
            std_with_pch = stdev(times_with_pch) if len(times_with_pch) > 1 else 0
            print(f"Average: {avg_with_pch:.2f} ms (±{std_with_pch:.2f} ms)")
            print(f"Min: {min(times_with_pch):.2f} ms")
            print(f"Max: {max(times_with_pch):.2f} ms")
            results["pch"] = avg_with_pch
        print()

if __name__ == "__main__":
    main()

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162972
Approved by: https://github.com/albanD, https://github.com/janeyx99
2025-09-15 22:55:39 +00:00
cf7873ea8b Placement: make is_shard/is_replicate/is_partial more straightforward (#162619)
We already have method dispatch based on actual type, so just provide appropriate base class and subclass method implementations. (This is not motivated by any particular performance profiling, just seems more straightforward to me.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162619
Approved by: https://github.com/ezyang, https://github.com/tianyu-l, https://github.com/zpcore
2025-09-15 22:54:06 +00:00
0def79fdd9 [ROCm] fix conv relu fusion (#162856)
Fixes #162816.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162856
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-15 22:49:32 +00:00
8590c3a66b [DTensor] Add _foreach_pow to sharding propagation list. (#162895)
Fixes #152696

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162895
Approved by: https://github.com/ezyang
2025-09-15 21:14:06 +00:00
dae5beae8e [RecordFunction] Add Scope for Record Function Fast (#162661)
Differential Revision: D82164587

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162661
Approved by: https://github.com/davidberard98
2025-09-15 21:01:47 +00:00
01c3c891c1 [ROCm] Enable test_fixed_striding (#162787)
Enable the distributed test test_fixed_striding on gfx arch which supports fp8.
Test command: python test/distributed/test_c10d_functional_native.py -k test_fixed_striding

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162787
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
2025-09-15 20:23:43 +00:00
1247dde1f2 [BE] Improve pytest summary display for OpInfo tests (#162961)
pytest summarizes test failures by printing a truncated first line of the test of the OUTERMOST wrapped exception.

Prior to this PR, it looked like this:

```
FAILED [0.0454s] test/distributed/tensor/test_dtensor_ops.py::TestLocalDTensorOpsCPU::test_dtensor_op_db_H_cpu_float32 - Exception: Caused by sample input at index 0: SampleInput(input=Tensor[size=(12, 12), device="cpu", dtype=torch.float32], args=(), kwargs={}, ...
```

I argue this is not so useful.  If I have a lot of test failures, I look to the test summary to understand what /kind/ of errors I have, so I can assess which ones I should look at first.  In other words, this is better:

```
FAILED [0.1387s] test/distributed/tensor/test_dtensor_ops.py::TestLocalDTensorOpsCPU::test_dtensor_op_db__softmax_backward_data_cpu_float32 - Exception: Tensor-likes are not close!
```

Now I know specifically this is a numerics problem!

This PR does it by prepending the old exception text to the wrapped exception.  This is slightly redundant, as we are exception chaining, but it does the job.  Open to bikeshedding.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162961
Approved by: https://github.com/malfet
2025-09-15 19:58:19 +00:00
de3a863cd8 AMD CPU CI - Add freezing + fix label trigger (#162176)
Added the following changes:

1. Added freezing by default for AMD CPU based CI (to follow pattern introduced by https://github.com/pytorch/pytorch/pull/152298 )
2. Fixed issue with label based CI triggers

Addresses code review comment in https://github.com/pytorch/pytorch/pull/161155

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162176
Approved by: https://github.com/malfet, https://github.com/jeffdaily
2025-09-15 19:29:35 +00:00
fa919feab6 Revert "[lint][CI] Don't checkout submodules for lintrunner-noclang (#162844)"
This reverts commit 6b231af23d63ee543a81c32952138090bebcf61d.

Reverted https://github.com/pytorch/pytorch/pull/162844 on behalf of https://github.com/wdvr due to seems to be needed after all - failing lint ([comment](https://github.com/pytorch/pytorch/pull/162844#issuecomment-3293465058))
2025-09-15 18:43:53 +00:00
8e05749d5c Fix integer overflow bug in triu/tril for large diagonal values (#153240)
This PR fixes a bug in the implementation of `apply_triu_tril_single` where using extremely large values for the diagonal argument (e.g. `diagonal=9223372036854775807`) could result in integer overflow and incorrect results. The masking logic is re-written to avoid this issue by always iterating over all columns, ensuring correctness even for large or extreme diagonal values.

Example of the original incorrect behavior:
```python
a = torch.ones(5,5)
torch.triu(a, 9223372036854775807)
# Before:
# tensor([[0., 0., 0., 0., 0.],
#         [1., 1., 1., 1., 1.],
#         [1., 1., 1., 1., 1.],
#         [1., 1., 1., 1., 1.],
#         [1., 1., 1., 1., 1.]])
```

The new implementation guards against overflow and produces correct results for all valid input values.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153240
Approved by: https://github.com/albanD
2025-09-15 18:07:19 +00:00
b334a5a379 [ROCm][benchmark] Add HF LLM benchmark expected accuracy (#162965)
PR #156967 added HF LLM benchmarks but did not add the ci expected accuracy files for ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162965
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-15 18:04:39 +00:00
567 changed files with 17260 additions and 6898 deletions

View File

@ -31,8 +31,7 @@ pip install -r /pytorch/requirements.txt
pip install auditwheel==6.2.0 wheel
if [ "$DESIRED_CUDA" = "cpu" ]; then
echo "BASE_CUDA_VERSION is not set. Building cpu wheel."
#USE_PRIORITIZED_TEXT_FOR_LD for enable linker script optimization https://github.com/pytorch/pytorch/pull/121975/files
USE_PRIORITIZED_TEXT_FOR_LD=1 python /pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn
python /pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn
else
echo "BASE_CUDA_VERSION is set to: $DESIRED_CUDA"
export USE_SYSTEM_NCCL=1
@ -46,6 +45,5 @@ else
export USE_NVIDIA_PYPI_LIBS=1
fi
#USE_PRIORITIZED_TEXT_FOR_LD for enable linker script optimization https://github.com/pytorch/pytorch/pull/121975/files
USE_PRIORITIZED_TEXT_FOR_LD=1 python /pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn --enable-cuda
python /pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn --enable-cuda
fi

View File

@ -317,7 +317,7 @@ if __name__ == "__main__":
).decode()
print("Building PyTorch wheel")
build_vars = "CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000 "
build_vars = ""
# MAX_JOB=5 is not required for CPU backend (see commit 465d98b)
if enable_cuda:
build_vars += "MAX_JOBS=5 "

View File

@ -1 +1 @@
56392aa978594cc155fa8af48cd949f5b5f1823a
e0dda9059d082537cee36be6c5e4fe3b18c880c0

View File

@ -1,2 +1,2 @@
transformers==4.54.0
transformers==4.56.0
soxr==0.5.0

View File

@ -42,22 +42,27 @@ install_pip_dependencies() {
# A workaround, ExecuTorch has moved to numpy 2.0 which is not compatible with the current
# numba and scipy version used in PyTorch CI
conda_run pip uninstall -y numba scipy
# Yaspin is needed for running CI test (get_benchmark_analysis_data.py)
pip_install yaspin==3.1.0
popd
}
setup_executorch() {
pushd executorch
export PYTHON_EXECUTABLE=python
export CMAKE_ARGS="-DEXECUTORCH_BUILD_PYBIND=ON -DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON"
export CMAKE_ARGS="-DEXECUTORCH_BUILD_PYBIND=ON -DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON -DEXECUTORCH_BUILD_TESTS=ON"
as_jenkins .ci/scripts/setup-linux.sh --build-tool cmake || true
popd
}
clone_executorch
install_buck2
install_conda_dependencies
install_pip_dependencies
setup_executorch
if [ $# -eq 0 ]; then
clone_executorch
install_buck2
install_conda_dependencies
install_pip_dependencies
pushd executorch
setup_executorch
popd
else
"$@"
fi

View File

@ -93,8 +93,9 @@ librosa==0.10.2 ; python_version == "3.12" and platform_machine != "s390x"
#Pinned versions:
#test that import:
mypy==1.16.0
mypy==1.16.0 ; platform_system != "Windows"
# Pin MyPy version because new errors are likely to appear with each release
# Skip on Windows as lots of type annotations are POSIX specific
#Description: linter
#Pinned versions: 1.16.0
#test that import: test_typing.py, test_type_hints.py

View File

@ -41,7 +41,6 @@ def sample_vllm_test_library():
"pytest -v -s basic_correctness/test_cumem.py",
"pytest -v -s basic_correctness/test_basic_correctness.py",
"pytest -v -s basic_correctness/test_cpu_offload.py",
"VLLM_TEST_ENABLE_ARTIFICIAL_PREEMPT=1 pytest -v -s basic_correctness/test_preemption.py",
],
},
"vllm_basic_models_test": {
@ -68,14 +67,11 @@ def sample_vllm_test_library():
"-v",
"-s",
"entrypoints/llm",
"--ignore=entrypoints/llm/test_lazy_outlines.py",
"--ignore=entrypoints/llm/test_generate.py",
"--ignore=entrypoints/llm/test_generate_multiple_loras.py",
"--ignore=entrypoints/llm/test_collective_rpc.py",
]
),
"pytest -v -s entrypoints/llm/test_lazy_outlines.py",
"pytest -v -s entrypoints/llm/test_generate.py ",
"pytest -v -s entrypoints/llm/test_generate.py",
"VLLM_USE_V1=0 pytest -v -s entrypoints/offline_mode",
],
},

View File

@ -1,40 +0,0 @@
#!/bin/bash
# This is where the local pytorch install in the docker image is located
pt_checkout="/var/lib/jenkins/workspace"
source "$pt_checkout/.ci/pytorch/common_utils.sh"
echo "functorch_doc_push_script.sh: Invoked with $*"
set -ex -o pipefail
version=${DOCS_VERSION:-nightly}
echo "version: $version"
# Build functorch docs
pushd $pt_checkout/functorch/docs
make html
popd
git clone https://github.com/pytorch/functorch -b gh-pages --depth 1 functorch_ghpages
pushd functorch_ghpages
if [ "$version" == "main" ]; then
version=nightly
fi
git rm -rf "$version" || true
mv "$pt_checkout/functorch/docs/build/html" "$version"
git add "$version" || true
git status
git config user.email "soumith+bot@pytorch.org"
git config user.name "pytorchbot"
# If there aren't changes, don't make a commit; push is no-op
git commit -m "Generate Python docs from pytorch/pytorch@${GITHUB_SHA}" || true
git status
if [[ "${WITH_PUSH:-}" == true ]]; then
git push -u origin gh-pages
fi
popd

View File

@ -59,7 +59,7 @@ test_python_shard() {
setup_test_python
time python test/run_test.py --verbose --exclude-jit-executor --exclude-distributed-tests --exclude-quantization-tests --shard "$1" "$NUM_TEST_SHARDS"
time python test/run_test.py --verbose --exclude-jit-executor --exclude-distributed-tests --shard "$1" "$NUM_TEST_SHARDS"
assert_git_not_dirty
}

View File

@ -0,0 +1,25 @@
From 6e08c9d08e9de59c7af28b720289debbbd384764 Mon Sep 17 00:00:00 2001
From: Michael Wang <13521008+isVoid@users.noreply.github.com>
Date: Tue, 1 Apr 2025 17:28:05 -0700
Subject: [PATCH] Avoid bumping certain driver API to avoid future breakage
(#185)
Co-authored-by: isVoid <isVoid@users.noreply.github.com>
---
numba_cuda/numba/cuda/cudadrv/driver.py | 3 +++
1 file changed, 3 insertions(+)
diff --git a/numba_cuda/numba/cuda/cudadrv/driver.py b/numba_cuda/numba/cuda/cudadrv/driver.py
index 1641bf77..233e9ed7 100644
--- a/numba_cuda/numba/cuda/cudadrv/driver.py
+++ b/numba_cuda/numba/cuda/cudadrv/driver.py
@@ -365,6 +365,9 @@ def _find_api(self, fname):
else:
variants = ('_v2', '')
+ if fname in ("cuCtxGetDevice", "cuCtxSynchronize"):
+ return getattr(self.lib, fname)
+
for variant in variants:
try:
return getattr(self.lib, f'{fname}{variant}')

View File

@ -32,6 +32,16 @@ if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *s390x* && -d /v
git config --global --add safe.directory /var/lib/jenkins/workspace
fi
# Patch numba to avoid CUDA-13 crash, see https://github.com/pytorch/pytorch/issues/162878
NUMBA_CUDA_DIR=$(python -c "import os;import numba.cuda; print(os.path.dirname(numba.cuda.__file__))" 2>/dev/null || true)
if [ -n "$NUMBA_CUDA_DIR" ]; then
NUMBA_PATCH="$(dirname "$(realpath "${BASH_SOURCE[0]}")")/numba-cuda-13.patch"
pushd "$NUMBA_CUDA_DIR"
patch -p4 <"$NUMBA_PATCH"
popd
fi
echo "Environment variables:"
env
@ -312,14 +322,14 @@ test_python_shard() {
# modify LD_LIBRARY_PATH to ensure it has the conda env.
# This set of tests has been shown to be buggy without it for the split-build
time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests --exclude-quantization-tests $INCLUDE_CLAUSE --shard "$1" "$NUM_TEST_SHARDS" --verbose $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --shard "$1" "$NUM_TEST_SHARDS" --verbose $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
assert_git_not_dirty
}
test_python() {
# shellcheck disable=SC2086
time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests --exclude-quantization-tests $INCLUDE_CLAUSE --verbose $PYTHON_TEST_EXTRA_OPTION
time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --verbose $PYTHON_TEST_EXTRA_OPTION
assert_git_not_dirty
}
@ -374,7 +384,6 @@ test_dynamo_wrapped_shard() {
--exclude-distributed-tests \
--exclude-torch-export-tests \
--exclude-aot-dispatch-tests \
--exclude-quantization-tests \
--shard "$1" "$NUM_TEST_SHARDS" \
--verbose \
--upload-artifacts-while-running
@ -1147,12 +1156,6 @@ test_distributed() {
fi
}
test_quantization() {
echo "Testing quantization"
python test/test_quantization.py
}
test_rpc() {
echo "Testing RPC C++ tests"
# NB: the ending test_rpc must match the current function name for the current
@ -1547,14 +1550,10 @@ test_executorch() {
install_torchvision
install_torchaudio
INSTALL_SCRIPT="$(pwd)/.ci/docker/common/install_executorch.sh"
pushd /executorch
export PYTHON_EXECUTABLE=python
export CMAKE_ARGS="-DEXECUTORCH_BUILD_PYBIND=ON -DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON"
# NB: We need to rebuild ExecuTorch runner here because it depends on PyTorch
# from the PR
bash .ci/scripts/setup-linux.sh --build-tool cmake
"${INSTALL_SCRIPT}" setup_executorch
echo "Run ExecuTorch unit tests"
pytest -v -n auto
@ -1568,10 +1567,6 @@ test_executorch() {
popd
# Test torchgen generated code for Executorch.
echo "Testing ExecuTorch op registration"
"$BUILD_BIN_DIR"/test_edge_op_registration
assert_git_not_dirty
}
@ -1579,6 +1574,7 @@ test_linux_aarch64() {
python test/run_test.py --include test_modules test_mkldnn test_mkldnn_fusion test_openmp test_torch test_dynamic_shapes \
test_transformers test_multiprocessing test_numpy_interop test_autograd test_binary_ufuncs test_complex test_spectral_ops \
test_foreach test_reductions test_unary_ufuncs test_tensor_creation_ops test_ops \
distributed/elastic/timer/api_test distributed/elastic/timer/local_timer_example distributed/elastic/timer/local_timer_test \
--shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose
# Dynamo tests
@ -1653,8 +1649,6 @@ elif [[ "${TEST_CONFIG}" == *executorch* ]]; then
test_executorch
elif [[ "$TEST_CONFIG" == 'jit_legacy' ]]; then
test_python_legacy_jit
elif [[ "$TEST_CONFIG" == 'quantization' ]]; then
test_quantization
elif [[ "${BUILD_ENVIRONMENT}" == *libtorch* ]]; then
# TODO: run some C++ tests
echo "no-op at the moment"

View File

@ -137,7 +137,7 @@ sccache --show-stats
python -c "import os, glob; os.system('python -mpip install --no-index --no-deps ' + glob.glob('dist/*.whl')[0])"
(
if "%BUILD_ENVIRONMENT%"=="" (
echo NOTE: To run `import torch`, please make sure to activate the conda environment by running `call %CONDA_PARENT_DIR%\Miniconda3\Scripts\activate.bat %CONDA_PARENT_DIR%\Miniconda3` in Command Prompt before running Git Bash.
echo NOTE: To run `import torch`, please make sure to activate the conda environment by running `call %CONDA_ROOT_DIR%\Scripts\activate.bat %CONDA_ROOT_DIR%\envs\py_tmp` in Command Prompt before running Git Bash.
) else (
copy /Y "dist\*.whl" "%PYTORCH_FINAL_PACKAGE_DIR%"

View File

@ -3,12 +3,12 @@ if "%BUILD_ENVIRONMENT%"=="" (
) else (
set CONDA_PARENT_DIR=C:\Jenkins
)
set CONDA_ROOT_DIR=%CONDA_PARENT_DIR%\Miniconda3
:: Be conservative here when rolling out the new AMI with conda. This will try
:: to install conda as before if it couldn't find the conda installation. This
:: can be removed eventually after we gain enough confidence in the AMI
if not exist %CONDA_PARENT_DIR%\Miniconda3 (
if not exist %CONDA_ROOT_DIR% (
set INSTALL_FRESH_CONDA=1
)
@ -17,10 +17,14 @@ if "%INSTALL_FRESH_CONDA%"=="1" (
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
%TMP_DIR_WIN%\Miniconda3-latest-Windows-x86_64.exe /InstallationType=JustMe /RegisterPython=0 /S /AddToPath=0 /D=%CONDA_PARENT_DIR%\Miniconda3
%TMP_DIR_WIN%\Miniconda3-latest-Windows-x86_64.exe /InstallationType=JustMe /RegisterPython=0 /S /AddToPath=0 /D=%CONDA_ROOT_DIR%
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
)
:: Activate conda so that we can use its commands, i.e. conda, python, pip
call %CONDA_PARENT_DIR%\Miniconda3\Scripts\activate.bat %CONDA_PARENT_DIR%\Miniconda3
call %CONDA_ROOT_DIR%\Scripts\activate.bat %CONDA_ROOT_DIR%
:: Activate conda so that we can use its commands, i.e. conda, python, pip
call conda activate py_tmp
call pip install -r .ci/docker/requirements-ci.txt

View File

@ -14,7 +14,7 @@ if not errorlevel 0 exit /b
:: build\torch. Rather than changing all these references, making a copy of torch folder
:: from conda to the current workspace is easier. The workspace will be cleaned up after
:: the job anyway
xcopy /s %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torch %TMP_DIR_WIN%\build\torch\
xcopy /s %CONDA_ROOT_DIR%\envs\py_tmp\Lib\site-packages\torch %TMP_DIR_WIN%\build\torch\
pushd .
if "%VC_VERSION%" == "" (

View File

@ -25,7 +25,7 @@ echo Copying over test times file
robocopy /E "%PYTORCH_FINAL_PACKAGE_DIR_WIN%\.additional_ci_files" "%PROJECT_DIR_WIN%\.additional_ci_files"
echo Run nn tests
python run_test.py --exclude-jit-executor --exclude-distributed-tests --exclude-quantization-tests --shard "%SHARD_NUMBER%" "%NUM_TEST_SHARDS%" --verbose
python run_test.py --exclude-jit-executor --exclude-distributed-tests --shard "%SHARD_NUMBER%" "%NUM_TEST_SHARDS%" --verbose
if ERRORLEVEL 1 goto fail
popd

View File

@ -38,7 +38,14 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then
fi
# TODO: Move both of them to Windows AMI
python -m pip install pytest-rerunfailures==10.3 pytest-cpp==2.3.0 tensorboard==2.13.0 protobuf==5.29.4 pytest-subtests==0.13.1
python -m pip install tensorboard==2.13.0 protobuf==5.29.4 pytest-subtests==0.13.1
# Copied from https://github.com/pytorch/test-infra/blob/be01a40157c36cd5a48391fdf44a7bc3ebd4c7e3/aws/ami/windows/scripts/Installers/Install-Pip-Dependencies.ps1#L16 with some adjustments
# pytest-rerunfailures==10.3 as 10.2 fails with INTERNALERROR> pluggy._manager.PluginValidationError: unknown hook 'pytest_configure_node'
# scipy from 1.6.3 to 1.10
# expecttest from 0.1.3 to 0.3.0
# xdoctest from 1.0.2 to 1.3.0
python -m pip install "future==0.18.2" "hypothesis==5.35.1" "expecttest==0.3.0" "librosa>=0.6.2" "scipy==1.10.1" "psutil==5.9.1" "pynvml==11.4.1" "pillow==9.2.0" "unittest-xml-reporting<=3.2.0,>=2.0.0" "pytest==7.1.3" "pytest-xdist==2.5.0" "pytest-flakefinder==1.1.0" "pytest-rerunfailures==10.3" "pytest-shard==0.1.2" "sympy==1.11.1" "xdoctest==1.3.0" "pygments==2.12.0" "opt-einsum>=3.3" "networkx==2.8.8" "mpmath==1.2.1" "pytest-cpp==2.3.0" "boto3==1.35.42"
# Install Z3 optional dependency for Windows builds.
python -m pip install z3-solver==4.15.1.0
@ -52,9 +59,6 @@ python -m pip install parameterized==0.8.1
# Install pulp for testing ilps under torch\distributed\_tools
python -m pip install pulp==2.9.0
# Install expecttest to merge https://github.com/pytorch/pytorch/pull/155308
python -m pip install expecttest==0.3.0
run_tests() {
# Run nvidia-smi if available
for path in '/c/Program Files/NVIDIA Corporation/NVSMI/nvidia-smi.exe' /c/Windows/System32/nvidia-smi.exe; do

View File

@ -264,7 +264,7 @@ def unzip_artifact_and_replace_files() -> None:
change_content_to_new_version(f"artifacts/dist/{old_stem}/torch/version.py")
for file in Path(f"artifacts/dist/{old_stem}").glob(
"*.dist-info/**",
"*.dist-info/*",
):
change_content_to_new_version(file)

View File

@ -6,6 +6,12 @@ inputs:
cuda-version:
description: which cuda version to install, 'cpu' for none
required: true
python-version:
required: false
type: string
default: "3.10"
description: |
The python version to be used. Will be 3.10 by default
runs:
using: composite
@ -38,18 +44,24 @@ runs:
CONDA="C:\Jenkins\Miniconda3\condabin\conda.bat"
{
echo "CONDA=${CONDA}";
echo "CONDA_RUN=${CONDA} run --no-capture-output";
echo "CONDA_BUILD=${CONDA} run conda-build";
echo "CONDA_INSTALL=${CONDA} install";
} >> "${GITHUB_ENV}"
- name: Setup Python3
env:
PYTHON_VERSION: ${{ inputs.python-version }}
shell: bash
run: |
set +e
set -x
PYTHON3=$(${CONDA_RUN} which python3)
# Create new py_tmp env with python-version
${CONDA} create -y -n py_tmp python=${PYTHON_VERSION} intel-openmp
PYTHON3=$(${CONDA_RUN} -n py_tmp which python3)
EXIT_CODE=$?
if [[ "${EXIT_CODE}" == "0" ]]; then
@ -62,7 +74,7 @@ runs:
# installation, which is Python 3 based. Its Python is default to Python 3. Further, there
# is also the Miniconda installation that is Python 2 based, and both can be installed if
# needed. In both cases, Python binary is just called python
PYTHON=$(${CONDA_RUN} which python)
PYTHON=$(${CONDA_RUN} -n py_tmp which python)
EXIT_CODE=$?
if [[ "${EXIT_CODE}" == "0" ]]; then

View File

@ -1 +1 @@
973c9d01da863cac9c51e8a5c0d390fc84b84fbc
9d1c50a5ac8726f4af0d4a4e85ad4d26a674ad26

3
.github/labeler.yml vendored
View File

@ -130,3 +130,6 @@
- torch/csrc/inductor/aoti_include/**
- torchgen/aoti/**
- torchgen/gen_aoti_c_shim.py
"ciflow/vllm":
- .github/ci_commit_pins/vllm.txt

View File

@ -19,7 +19,6 @@ ciflow_push_tags:
- ciflow/nightly
- ciflow/periodic
- ciflow/periodic-rocm-mi300
- ciflow/quantization-periodic
- ciflow/rocm
- ciflow/rocm-mi300
- ciflow/s390

View File

@ -135,7 +135,7 @@ ROCM_SMOKE_WORKFLOWS = [
build_configs=generate_binary_build_matrix.generate_wheels_matrix(
OperatingSystem.LINUX,
arches=["6.4"],
python_versions=["3.9"],
python_versions=["3.10"],
),
ciflow_config=CIFlowConfig(
labels={

View File

@ -187,8 +187,6 @@ jobs:
- name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
uses: pytorch/test-infra/.github/actions/setup-nvidia@main
with:
driver-version: ${{ startsWith(inputs.GPU_ARCH_VERSION, '13') && '580.65.06' || '570.133.07' }}
if: ${{ inputs.GPU_ARCH_TYPE == 'cuda' && steps.filter.outputs.is-test-matrix-empty == 'False' }}
- name: configure aws credentials

View File

@ -75,10 +75,6 @@ jobs:
runner: ${{ inputs.runner_prefix }}linux.2xlarge
# It takes less than 30m to finish python docs unless there are issues
timeout-minutes: 30
- docs_type: functorch
runner: ${{ inputs.runner_prefix }}linux.2xlarge
# It takes less than 15m to finish functorch docs unless there are issues
timeout-minutes: 15
# Set a fixed name for this job instead of using the current matrix-generated name, i.e. build-docs (cpp, linux.12xlarge, 180)
# The current name requires updating the database last docs push query from test-infra every time the matrix is updated
name: build-docs-${{ matrix.docs_type }}-${{ inputs.push }}
@ -211,16 +207,6 @@ jobs:
path: cppdocs/
s3-prefix: pytorch/pytorch/${{ github.event.pull_request.number }}/cppdocs
- name: Upload functorch Docs Preview
uses: seemethere/upload-artifact-s3@baba72d0712b404f646cebe0730933554ebce96a # v5.1.0
if: ${{ github.event_name == 'pull_request' && matrix.docs_type == 'functorch' && steps.build-docs.outcome == 'success' }}
with:
retention-days: 14
s3-bucket: doc-previews
if-no-files-found: error
path: functorch_ghpages/nightly/
s3-prefix: pytorch/pytorch/${{ github.event.pull_request.number }}/functorchdocs
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
if: always()

View File

@ -169,7 +169,7 @@ jobs:
id: install-nvidia-driver
uses: pytorch/test-infra/.github/actions/setup-nvidia@main
with:
driver-version: ${{ matrix.config == 'legacy_nvidia_driver' && '525.105.17' || '570.133.07' }}
driver-version: ${{ matrix.config == 'legacy_nvidia_driver' && '525.105.17' || '580.82.07' }}
if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' && !contains(matrix.runner, 'b200') }}
- name: Setup GPU_FLAG for docker run

View File

@ -62,6 +62,11 @@ on:
required: false
type: number
default: 1
secrets:
HUGGING_FACE_HUB_TOKEN:
required: false
description: |
HF Auth token to avoid rate limits when downloading models or datasets from hub
env:
GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
@ -76,10 +81,9 @@ jobs:
strategy:
matrix: ${{ fromJSON(inputs.test-matrix) }}
fail-fast: false
timeout-minutes: ${{ matrix.mem_leak_check == 'mem_leak_check' && 600 || inputs.timeout-minutes }}
runs-on: ${{ matrix.runner }}
timeout-minutes: ${{ matrix.mem_leak_check == 'mem_leak_check' && 600 || inputs.timeout-minutes }}
steps:
# [see note: pytorch repo ref]
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
with:
@ -131,6 +135,9 @@ jobs:
- name: Start monitoring script
id: monitor-script
if: ${{ !inputs.disable-monitor }}
shell: bash
continue-on-error: true
env:
JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
JOB_NAME: ${{ steps.get-job-id.outputs.job-name }}
@ -138,9 +145,6 @@ jobs:
WORKFLOW_RUN_ID: ${{github.run_id}}
MONITOR_LOG_INTERVAL: ${{ inputs.monitor-log-interval }}
MONITOR_DATA_COLLECT_INTERVAL: ${{ inputs.monitor-data-collect-interval }}
if: ${{ !inputs.disable-monitor }}
shell: bash
continue-on-error: true
run: |
python3 -m pip install psutil==5.9.8 dataclasses_json==0.6.7
python3 -m tools.stats.monitor --log-interval "$MONITOR_LOG_INTERVAL" --data-collect-interval "$MONITOR_DATA_COLLECT_INTERVAL" > usage_log.txt 2>&1 &
@ -178,6 +182,12 @@ jobs:
run: |
echo "timeout=$((JOB_TIMEOUT-30))" >> "${GITHUB_OUTPUT}"
- name: Preserve github env variables for use in docker
shell: bash
run: |
env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}"
env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"
- name: Test
id: test
env:
@ -193,20 +203,22 @@ jobs:
JOB_NAME: ${{ steps.get-job-id.outputs.job-name }}
BRANCH: ${{ steps.parse-ref.outputs.branch }}
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
BASE_SHA: ${{ github.event.pull_request.base.sha || github.sha }}
TEST_CONFIG: ${{ matrix.config }}
SHARD_NUMBER: ${{ matrix.shard }}
NUM_TEST_SHARDS: ${{ matrix.num_shards }}
REENABLED_ISSUES: ${{ steps.keep-going.outputs.reenabled-issues }}
CONTINUE_THROUGH_ERROR: ${{ steps.keep-going.outputs.keep-going }}
VERBOSE_TEST_LOGS: ${{ steps.keep-going.outputs.ci-verbose-test-logs }}
TEST_SHOWLOCALS: ${{ steps.keep-going.outputs.ci-test-showlocals }}
NO_TEST_TIMEOUT: ${{ steps.keep-going.outputs.ci-no-test-timeout }}
NO_TD: ${{ steps.keep-going.outputs.ci-no-td }}
TEST_CONFIG: ${{ matrix.config }}
SHARD_NUMBER: ${{ matrix.shard }}
NUM_TEST_SHARDS: ${{ matrix.num_shards }}
REENABLED_ISSUES: ${{ steps.keep-going.outputs.reenabled-issues }}
DOCKER_IMAGE: ${{ inputs.docker-image }}
PYTORCH_TEST_CUDA_MEM_LEAK_CHECK: ${{ matrix.mem_leak_check && '1' || '0' }}
PYTORCH_TEST_RERUN_DISABLED_TESTS: ${{ matrix.rerun_disabled_tests && '1' || '0' }}
TESTS_TO_INCLUDE: ${{ inputs.tests-to-include }}
DASHBOARD_TAG: ${{ inputs.dashboard-tag }}
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
timeout-minutes: ${{ fromJson(steps.test-timeout.outputs.timeout) }}
run: |
set -x
@ -236,6 +248,7 @@ jobs:
-e GITHUB_RUN_ATTEMPT \
-e JOB_ID \
-e JOB_NAME \
-e BASE_SHA \
-e BRANCH \
-e SHA1 \
-e AWS_DEFAULT_REGION \
@ -253,10 +266,12 @@ jobs:
-e PYTORCH_TEST_CUDA_MEM_LEAK_CHECK \
-e PYTORCH_TEST_RERUN_DISABLED_TESTS \
-e TESTS_TO_INCLUDE \
-e HUGGING_FACE_HUB_TOKEN \
-e DASHBOARD_TAG \
--env-file="${RUNNER_TEMP}/github_env_${GITHUB_RUN_ID}" \
--ulimit stack=10485760:83886080 \
--ulimit core=0 \
--env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
--security-opt seccomp=unconfined \
--cap-add=SYS_PTRACE \
--shm-size="8g" \

View File

@ -151,7 +151,7 @@ jobs:
BUILD_WHEEL: 1
MAX_JOBS: 8
CUDA_VERSION: ${{ inputs.cuda-version }}
PYTHON_VERSION: "3.9"
PYTHON_VERSION: "3.10"
SCCACHE_BUCKET: "ossci-compiler-cache"
SCCACHE_S3_KEY_PREFIX: ${{ github.workflow }}
SCCACHE_REGION: us-east-1

View File

@ -184,7 +184,7 @@ jobs:
env:
USE_CUDA: ${{ inputs.cuda-version != 'cpu' && '1' || '0' }}
INSTALL_WINDOWS_SDK: 1
PYTHON_VERSION: 3.9
PYTHON_VERSION: "3.10"
CONTINUE_THROUGH_ERROR: ${{ steps.keep-going.outputs.keep-going }}
VERBOSE_TEST_LOGS: ${{ steps.keep-going.outputs.ci-verbose-test-logs }}
TEST_SHOWLOCALS: ${{ steps.keep-going.outputs.ci-test-showlocals }}

View File

@ -178,12 +178,12 @@ jobs:
contents: read
container:
image: continuumio/miniconda3:4.12.0
environment: ${{ (github.event_name == 'push' && github.event.ref == 'refs/heads/main') && 'nightly-wheel-upload' || '' }}
environment: ${{ ((github.event_name == 'push' && github.event.ref == 'refs/heads/main') || github.event_name == 'schedule' || github.event_name == 'workflow_dispatch') && 'nightly-wheel-upload' || '' }}
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- name: Configure AWS credentials(PyTorch account) for main
if: ${{ github.event_name == 'push' && github.event.ref == 'refs/heads/main' }}
if: ${{ (github.event_name == 'push' && github.event.ref == 'refs/heads/main') || github.event_name == 'schedule' || github.event_name == 'workflow_dispatch' }}
uses: aws-actions/configure-aws-credentials@ececac1a45f3b08a01d2dd070d28d111c5fe6722 # v4.1.0
with:
role-to-assume: arn:aws:iam::749337293305:role/gha_workflow_nightly_build_wheels

View File

@ -71,8 +71,7 @@ jobs:
pytorch-linux-jammy-py3-clang12-onnx,
pytorch-linux-jammy-linter,
pytorch-linux-jammy-cuda12.8-cudnn9-py3.9-linter,
# Executorch pin needs update
# pytorch-linux-jammy-py3-clang12-executorch,
pytorch-linux-jammy-py3-clang12-executorch,
pytorch-linux-jammy-py3.12-triton-cpu,
pytorch-linux-noble-riscv64-py3.12-gcc14
]

View File

@ -44,7 +44,7 @@ jobs:
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
manywheel-py3_9-rocm6_4-build:
manywheel-py3_10-rocm6_4-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
@ -58,16 +58,16 @@ jobs:
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: rocm6.4
DESIRED_PYTHON: "3.9"
DESIRED_PYTHON: "3.10"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-rocm6_4
build_name: manywheel-py3_10-rocm6_4
build_environment: linux-binary-manywheel-rocm
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-rocm6_4-test: # Testing
manywheel-py3_10-rocm6_4-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_9-rocm6_4-build
- manywheel-py3_10-rocm6_4-build
- get-label-type
runs-on: linux.rocm.gpu.mi250
timeout-minutes: 240
@ -82,14 +82,14 @@ jobs:
SKIP_ALL_TESTS: 1
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: rocm6.4
DESIRED_PYTHON: "3.9"
DESIRED_PYTHON: "3.10"
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
- uses: actions/download-artifact@v4.1.7
name: Download Build Artifacts
with:
name: manywheel-py3_9-rocm6_4
name: manywheel-py3_10-rocm6_4
path: "${{ runner.temp }}/artifacts/"
- name: Checkout PyTorch
uses: actions/checkout@v4

View File

@ -43,6 +43,11 @@ on:
required: false
type: boolean
default: false
freezing:
description: Run freezing?
required: false
type: boolean
default: true
benchmark_configs:
description: The list of configs used the benchmark
required: false
@ -102,7 +107,7 @@ jobs:
if: github.event.schedule == '0 7 * * *'
with:
build-environment: linux-jammy-py3.10-gcc11-build
dashboard-tag: training-false-inference-true-default-true-dynamic-true-cppwrapper-true-aotinductor-true
dashboard-tag: training-false-inference-true-default-true-dynamic-true-cppwrapper-true-aotinductor-true-freezing-true
docker-image: ${{ needs.inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.inductor-build.outputs.test-matrix }}
timeout-minutes: 720
@ -116,10 +121,9 @@ jobs:
name: inductor-test
uses: ./.github/workflows/_linux-test.yml
needs: inductor-build
if: github.event_name == 'workflow_dispatch'
with:
build-environment: linux-jammy-py3.10-gcc11-build
dashboard-tag: training-${{ inputs.training }}-inference-${{ inputs.inference }}-default-${{ inputs.default }}-dynamic-${{ inputs.dynamic }}-cppwrapper-${{ inputs.cppwrapper }}-aotinductor-${{ inputs.aotinductor }}
dashboard-tag: training-${{ inputs.training || 'false' }}-inference-${{ inputs.inference || 'true' }}-default-${{ inputs.default || 'true' }}-dynamic-${{ inputs.dynamic || 'true' }}-cppwrapper-${{ inputs.cppwrapper || 'true' }}-aotinductor-${{ inputs.aotinductor || 'true' }}-freezing-${{ inputs.freezing || 'true' }}
docker-image: ${{ needs.inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.inductor-build.outputs.test-matrix }}
timeout-minutes: 720

View File

@ -105,7 +105,7 @@ jobs:
# NB: A shallow checkout won't work here because calculate-docker-image requires a full checkout
# to run git rev-parse HEAD~:.ci/docker when a new image is needed
fetch-depth: 0
submodules: false
submodules: true
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
script: |
CHANGED_FILES="${{ needs.get-changed-files.outputs.changed-files }}"

View File

@ -127,6 +127,8 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
# More memory is needed to build with asan
runner: linux.2xlarge.memory
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-py3.10-clang18-asan
docker-image-name: ci-image:pytorch-linux-jammy-py3-clang18-asan
@ -316,32 +318,6 @@ jobs:
]}
secrets: inherit
linux-jammy-py3-clang12-executorch-build:
if: false # Docker build needs pin update
name: linux-jammy-py3-clang12-executorch
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-py3-clang12-executorch
docker-image-name: ci-image:pytorch-linux-jammy-py3-clang12-executorch
test-matrix: |
{ include: [
{ config: "executorch", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
]}
secrets: inherit
linux-jammy-py3-clang12-executorch-test:
name: linux-jammy-py3-clang12-executorch
uses: ./.github/workflows/_linux-test.yml
needs: linux-jammy-py3-clang12-executorch-build
if: false # Has been broken for a while
with:
build-environment: linux-jammy-py3-clang12-executorch
docker-image: ${{ needs.linux-jammy-py3-clang12-executorch-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-py3-clang12-executorch-build.outputs.test-matrix }}
secrets: inherit
linux-jammy-cuda12_8-py3_10-gcc9-inductor-build:
name: cuda12.8-py3.10-gcc9-sm75
uses: ./.github/workflows/_linux-build.yml

View File

@ -1,54 +0,0 @@
name: quantization-periodic
on:
push:
tags:
- ciflow/quantization-periodic/*
workflow_dispatch:
schedule:
# run weekly
- cron: "45 0 * * 0"
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
permissions:
id-token: write
contents: read
jobs:
get-default-label-prefix:
name: get-default-label-prefix
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
if: ${{ (github.event_name != 'schedule' || github.repository == 'pytorch/pytorch') && github.repository_owner == 'pytorch' }}
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
opt_out_experiments: lf
periodic-quantization-build:
name: periodic-quantization-build
uses: ./.github/workflows/_linux-build.yml
needs: get-default-label-prefix
with:
runner_prefix: "${{ needs.get-default-label-prefix.outputs.label-type }}"
build-environment: linux-jammy-cuda12.8-cudnn9-py3-gcc11
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11
cuda-arch-list: '8.9'
test-matrix: |
{ include: [
{ config: "quantization", shard: 1, num_shards: 1, runner: "${{ needs.get-default-label-prefix.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu" },
]}
secrets: inherit
periodic-test-quantization:
name: periodic-test-quantization
uses: ./.github/workflows/_linux-test.yml
needs: periodic-quantization-build
with:
build-environment: linux-jammy-cuda12.8-cudnn9-py3-gcc11
docker-image: ${{ needs.periodic-quantization-build.outputs.docker-image }}
test-matrix: ${{ needs.periodic-quantization-build.outputs.test-matrix }}
secrets: inherit

View File

@ -140,6 +140,8 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
# More memory is needed to build with asan
runner: linux.2xlarge.memory
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-py3.10-clang18-asan
docker-image-name: ci-image:pytorch-linux-jammy-py3-clang18-asan

View File

@ -259,3 +259,27 @@ jobs:
docker-image: ${{ needs.verify-cachebench-cpu-build.outputs.docker-image }}
test-matrix: ${{ needs.verify-cachebench-cpu-build.outputs.test-matrix }}
secrets: inherit
linux-jammy-py3-clang12-executorch-build:
name: linux-jammy-py3-clang12-executorch
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-py3-clang12-executorch
docker-image-name: ci-image:pytorch-linux-jammy-py3-clang12-executorch
test-matrix: |
{ include: [
{ config: "executorch", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
]}
secrets: inherit
linux-jammy-py3-clang12-executorch-test:
name: linux-jammy-py3-clang12-executorch
uses: ./.github/workflows/_linux-test.yml
needs: linux-jammy-py3-clang12-executorch-build
with:
build-environment: linux-jammy-py3-clang12-executorch
docker-image: ${{ needs.linux-jammy-py3-clang12-executorch-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-py3-clang12-executorch-build.outputs.test-matrix }}
secrets: inherit

View File

@ -36,6 +36,8 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
# When building vLLM, uv doesn't like that we rename wheel without changing the wheel metadata
allow-reuse-old-whl: false
build-additional-packages: "vision audio"
build-external-packages: "vllm"
build-environment: linux-jammy-cuda12.8-py3.12-gcc11

3
.gitignore vendored
View File

@ -259,6 +259,9 @@ gen
.pytest_cache
aten/build/*
# Linker scripts for prioritized text optimization
cmake/linker_script.ld
# Bram
plsdontbreak

View File

@ -123,6 +123,7 @@ is_formatter = true
code = 'MYPY'
include_patterns = [
'setup.py',
'functorch/dim/**/*.py',
'torch/**/*.py',
'torch/**/*.pyi',
'caffe2/**/*.py',
@ -964,7 +965,6 @@ exclude_patterns = [
'test/jit/**', # should be run through test/test_jit.py
'test/ao/sparsity/**', # should be run through test/test_ao_sparsity.py
'test/fx/**', # should be run through test/test_fx.py
'test/bottleneck_test/**', # excluded by test/run_test.py
'test/package/**', # excluded by test/run_test.py
'test/distributed/argparse_util_test.py',
'test/distributed/bin/test_script.py',
@ -1410,8 +1410,6 @@ exclude_patterns = [
'torch/utils/benchmark/utils/timer.py',
'torch/utils/benchmark/utils/valgrind_wrapper/__init__.py',
'torch/utils/benchmark/utils/valgrind_wrapper/timer_interface.py',
'torch/utils/bottleneck/__init__.py',
'torch/utils/bottleneck/__main__.py',
'torch/utils/bundled_inputs.py',
'torch/utils/checkpoint.py',
'torch/utils/collect_env.py',

View File

@ -380,6 +380,13 @@ cmake_dependent_option(BUILD_BUNDLE_PTXAS "Bundle PTX into torch/bin fodler"
OFF "USE_CUDA" OFF)
cmake_dependent_option(USE_KLEIDIAI "Use KleidiAI for the ARM CPU & AARCH64 architecture." ON
"CPU_AARCH64" OFF)
# prioritized text linker, ON by default for AArch64+Linux, option visible to all AArch64, x86 and ppc64le.
set(USE_PRIORITIZED_TEXT_DEFAULT OFF)
if(LINUX AND CPU_AARCH64)
set(USE_PRIORITIZED_TEXT_DEFAULT ON)
endif()
cmake_dependent_option(USE_PRIORITIZED_TEXT_FOR_LD "Use prioritized text linker for ld."
"${USE_PRIORITIZED_TEXT_DEFAULT}" "CPU_INTEL OR CPU_AARCH64 OR CPU_POWER" OFF)
option(USE_MIMALLOC "Use mimalloc" OFF)
# Enable third party mimalloc library to improve memory allocation performance
@ -657,6 +664,11 @@ endif(MSVC)
string(APPEND CMAKE_CUDA_FLAGS " -Xfatbin -compress-all")
# Set linker max-page-size to 64KiB on AArch64 Linux
if(LINUX AND CPU_AARCH64)
add_link_options_if_supported("-z,max-page-size=0x10000")
endif()
# Set INTERN_BUILD_MOBILE for all mobile builds. Components that are not
# applicable to mobile are disabled by this variable. Setting
# `BUILD_PYTORCH_MOBILE_WITH_HOST_TOOLCHAIN` environment variable can force it
@ -891,7 +903,7 @@ IF(USE_FBGEMM_GENAI AND USE_ROCM AND NOT "gfx942" IN_LIST PYTORCH_ROCM_ARCH)
endif()
# Set USE_FBGEMM_GENAI to ON for CUDA build on SM100.
if(USE_CUDA AND "$ENV{TORCH_CUDA_ARCH_LIST}" MATCHES "10.0" AND CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL 12.8)
if(USE_CUDA AND "$ENV{TORCH_CUDA_ARCH_LIST}" MATCHES "10.0" AND CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL 12.8 AND NOT WIN32)
message(STATUS "Setting USE_FBGEMM_GENAI to ON, doing CUDA build for SM100a")
set(USE_FBGEMM_GENAI ON)
endif()
@ -1421,3 +1433,57 @@ if(BUILD_BUNDLE_PTXAS AND USE_CUDA)
install(PROGRAMS "${PROJECT_BINARY_DIR}/ptxas"
DESTINATION "${CMAKE_INSTALL_BINDIR}")
endif()
if(USE_PRIORITIZED_TEXT_FOR_LD)
add_compile_options(
$<$<COMPILE_LANGUAGE:C,CXX>:-ffunction-sections>
$<$<COMPILE_LANGUAGE:C,CXX>:-fdata-sections>
)
set(LINKER_SCRIPT_FILE_OUT "${CMAKE_SOURCE_DIR}/cmake/linker_script.ld")
set(LINKER_SCRIPT_FILE_IN "${CMAKE_SOURCE_DIR}/cmake/prioritized_text.txt")
add_custom_command(
OUTPUT "${LINKER_SCRIPT_FILE_OUT}"
COMMAND ${Python_EXECUTABLE} ${CMAKE_SOURCE_DIR}/tools/setup_helpers/generate_linker_script.py --filein "${LINKER_SCRIPT_FILE_IN}" --fout "${LINKER_SCRIPT_FILE_OUT}"
DEPENDS ${CMAKE_SOURCE_DIR}/tools/setup_helpers/generate_linker_script.py "${LINKER_SCRIPT_FILE_IN}"
COMMENT "Generating prioritized text linker files"
VERBATIM
)
add_custom_target(generate_linker_script DEPENDS "${LINKER_SCRIPT_FILE_OUT}")
if(BUILD_PYTHON)
set(LINKER_OPT_TARGETS torch_python)
endif()
if(NOT BUILD_LIBTORCHLESS)
list(APPEND LINKER_OPT_TARGETS torch_cpu c10)
if(USE_CUDA)
list(APPEND LINKER_OPT_TARGETS torch_cuda c10_cuda)
endif()
if(USE_XPU)
list(APPEND LINKER_OPT_TARGETS torch_xpu c10_xpu)
endif()
if(USE_ROCM)
list(APPEND LINKER_OPT_TARGETS torch_hip c10_hip)
endif()
endif()
foreach(tgt IN LISTS LINKER_OPT_TARGETS)
if(TARGET ${tgt})
add_dependencies("${tgt}" generate_linker_script)
target_link_options_if_supported(${tgt} "-T,${LINKER_SCRIPT_FILE_OUT}")
set_property(TARGET ${tgt} APPEND PROPERTY LINK_DEPENDS "${LINKER_SCRIPT_FILE_OUT}")
else()
message(WARNING "Requested target '${tgt}' for linker script optimization was not found.")
endif()
endforeach()
else()
if(LINUX AND CPU_AARCH64)
message(WARNING [[
It is strongly recommend to enable linker script optimization for all AArch64 Linux builds.
To do so please export USE_PRIORITIZED_TEXT_FOR_LD=1
]])
endif()
endif()

View File

@ -180,7 +180,7 @@ void Context::setUserEnabledNNPACK(bool e) {
}
bool Context::allowTF32CuDNN(const std::string& op) const {
if (op.size() == 0){
if (op.empty()){
bool allow_tf32_rnn = float32Precision("cuda", "rnn") == "tf32";
bool allow_tf32_conv = float32Precision("cuda", "conv") == "tf32";
TORCH_CHECK(
@ -281,9 +281,6 @@ bool Context::userEnabledOverrideableSDP() const {
static constexpr const auto cublas_config_var_name = "CUBLAS_WORKSPACE_CONFIG";
static constexpr const std::array<const char*, 2> cublas_deterministic_configs = {":4096:8", ":16:8"};
#ifdef USE_ROCM
static constexpr const auto hipblaslt_allow_tf32 = "HIPBLASLT_ALLOW_TF32";
#endif
bool Context::checkCuBLASConfigDeterministic() {
// If using CUDA 10.2 or greater, need to make sure CuBLAS workspace config
@ -343,12 +340,6 @@ void Context::setImmediateMiopen(bool b) {
}
bool Context::allowTF32CuBLAS() const {
#ifdef USE_ROCM
const auto allow_tf32 = c10::utils::check_env(hipblaslt_allow_tf32);
if (allow_tf32 != true) {
return false;
}
#endif
bool legacy_allow_tf32 = float32_matmul_precision != at::Float32MatmulPrecision::HIGHEST;
bool allow_tf32_new = float32Precision("cuda", "matmul") == "tf32";
TORCH_CHECK(
@ -362,14 +353,6 @@ bool Context::allowTF32CuBLAS() const {
}
void Context::setAllowTF32CuBLAS(bool b) {
#ifdef USE_ROCM
const auto allow_tf32 = c10::utils::check_env(hipblaslt_allow_tf32);
if (allow_tf32 != true) {
C10_LOG_FIRST_N(INFO, 10) << "torch.backends.cuda.matmul.allow_tf32 is not supported on ROCm by default. "
<< "Please set environment variable HIPBLASLT_ALLOW_TF32=1 to enable it.";
return;
}
#endif
float32_matmul_precision = b ? at::Float32MatmulPrecision::HIGH : at::Float32MatmulPrecision::HIGHEST;
setFloat32Precision("cuda", "matmul", b ? "tf32" : "ieee");
}
@ -443,7 +426,7 @@ void Context::setFloat32Precision(const std::string& backend, const std::string&
std::string msg;
auto iterp = _fp32_precisions.find(backend);
TORCH_CHECK(iterp != _fp32_precisions.end());
for (auto p : iterp->second) {
for (const auto& p : iterp->second) {
msg += p;
msg += " ";
}

View File

@ -65,14 +65,24 @@ DLDataType getDLDataType(const Tensor& t) {
break;
// TODO(#146647): use macro here instead of spelling out each shell dtype
case ScalarType::Float8_e5m2:
dtype.code = DLDataTypeCode::kDLFloat8_e5m2;
break;
case ScalarType::Float8_e5m2fnuz:
dtype.code = DLDataTypeCode::kDLFloat8_e5m2fnuz;
break;
case ScalarType::Float8_e4m3fn:
dtype.code = DLDataTypeCode::kDLFloat8_e4m3fn;
break;
case ScalarType::Float8_e4m3fnuz:
dtype.code = DLDataTypeCode::kDLFloat8_e4m3fnuz;
break;
case ScalarType::Float8_e8m0fnu:
TORCH_CHECK_BUFFER(false, "float8 types are not supported by dlpack");
dtype.code = DLDataTypeCode::kDLFloat8_e8m0fnu;
break;
case ScalarType::Float4_e2m1fn_x2:
TORCH_CHECK_BUFFER(false, "float4 types are not supported by dlpack");
dtype.code = DLDataTypeCode::kDLFloat4_e2m1fn;
dtype.lanes = 2;
dtype.bits = 4;
break;
case ScalarType::QInt8:
case ScalarType::QUInt8:
@ -177,7 +187,11 @@ static Device getATenDevice(DLDeviceType type, c10::DeviceIndex index, void* dat
ScalarType toScalarType(const DLDataType& dtype) {
ScalarType stype = ScalarType::Undefined;
TORCH_CHECK_BUFFER(dtype.lanes == 1, "ATen does not support lanes != 1");
if (dtype.code != DLDataTypeCode::kDLFloat4_e2m1fn) {
TORCH_CHECK_BUFFER(
dtype.lanes == 1,
"ATen does not support lanes != 1 for dtype code", std::to_string(dtype.code));
}
switch (dtype.code) {
case DLDataTypeCode::kDLUInt:
switch (dtype.bits) {
@ -269,6 +283,73 @@ ScalarType toScalarType(const DLDataType& dtype) {
false, "Unsupported kDLBool bits ", std::to_string(dtype.bits));
}
break;
case DLDataTypeCode::kDLFloat8_e5m2:
switch (dtype.bits) {
case 8:
stype = ScalarType::Float8_e5m2;
break;
default:
TORCH_CHECK_BUFFER(
false, "Unsupported kDLFloat8_e5m2 bits ", std::to_string(dtype.bits));
}
break;
case DLDataTypeCode::kDLFloat8_e5m2fnuz:
switch (dtype.bits) {
case 8:
stype = ScalarType::Float8_e5m2fnuz;
break;
default:
TORCH_CHECK_BUFFER(
false, "Unsupported kDLFloat8_e5m2fnuz bits ", std::to_string(dtype.bits));
}
break;
case DLDataTypeCode::kDLFloat8_e4m3fn:
switch (dtype.bits) {
case 8:
stype = ScalarType::Float8_e4m3fn;
break;
default:
TORCH_CHECK_BUFFER(
false, "Unsupported kDLFloat8_e4m3fn bits ", std::to_string(dtype.bits));
}
break;
case DLDataTypeCode::kDLFloat8_e4m3fnuz:
switch (dtype.bits) {
case 8:
stype = ScalarType::Float8_e4m3fnuz;
break;
default:
TORCH_CHECK_BUFFER(
false, "Unsupported kDLFloat8_e4m3fnuz bits ", std::to_string(dtype.bits));
}
break;
case DLDataTypeCode::kDLFloat8_e8m0fnu:
switch (dtype.bits) {
case 8:
stype = ScalarType::Float8_e8m0fnu;
break;
default:
TORCH_CHECK_BUFFER(
false, "Unsupported kDLFloat8_e8m0fnu bits ", std::to_string(dtype.bits));
}
break;
case DLDataTypeCode::kDLFloat4_e2m1fn:
switch (dtype.bits) {
case 4:
switch (dtype.lanes) {
case 2:
stype = ScalarType::Float4_e2m1fn_x2;
break;
default:
TORCH_CHECK_BUFFER(
false, "Unsupported kDLFloat4_e2m1fn lanes ", std::to_string(dtype.lanes));
}
break;
default:
TORCH_CHECK_BUFFER(
false, "Unsupported kDLFloat4_e2m1fn bits ", std::to_string(dtype.bits));
}
break;
default:
TORCH_CHECK_BUFFER(false, "Unsupported code ", std::to_string(dtype.code));
}
@ -354,8 +435,8 @@ T* toDLPackImpl(const Tensor& src) {
atDLMTensor->tensor.dl_tensor.device = torchDeviceToDLDevice(src.device());
atDLMTensor->tensor.dl_tensor.ndim = static_cast<int32_t>(src.dim());
atDLMTensor->tensor.dl_tensor.dtype = getDLDataType(src);
atDLMTensor->tensor.dl_tensor.shape = view.sizes().data();
atDLMTensor->tensor.dl_tensor.strides = view.strides().data();
atDLMTensor->tensor.dl_tensor.shape = const_cast<int64_t*>(view.sizes().data());
atDLMTensor->tensor.dl_tensor.strides = const_cast<int64_t*>(view.strides().data());
atDLMTensor->tensor.dl_tensor.byte_offset = 0;
fillVersion(&atDLMTensor->tensor);

View File

@ -102,7 +102,7 @@ FunctionalStorageImpl::FunctionalStorageImpl(const Tensor& base)
// SparseTensorImpl has no storage, so we cannot query its nbytes.
// (original_storage_size is only used for storage resizing in fsdp anyway, which does not apply to sparse)
// Same for XLA
if (base.unsafeGetTensorImpl()->has_storage() && base.device().type() != c10::DeviceType::XLA) {
if (base.unsafeGetTensorImpl()->has_storage() && data_ptr().device().type() != c10::DeviceType::XLA) {
original_storage_size_ = base.unsafeGetTensorImpl()->unsafe_storage().unsafeGetStorageImpl()->sym_nbytes();
} else {
original_storage_size_ = -1;

View File

@ -133,7 +133,7 @@ FunctionalTensorWrapper::FunctionalTensorWrapper(const Tensor& view_value, const
: c10::TensorImpl(
c10::DispatchKeySet(DispatchKey::Functionalize),
view_value.dtype(),
view_value.device()
base->storage().data_ptr().device()
),
value_(view_value),
is_multi_output_view_(base->is_multi_output_view_ || meta.is_multi_output),
@ -485,7 +485,10 @@ void FunctionalTensorWrapper::shallow_copy_from(const c10::intrusive_ptr<TensorI
c10::Device FunctionalTensorWrapper::device_custom() const {
return value_.unsafeGetTensorImpl()->device();
// The storage pointer already uses the underlying tensor custom device (if
// applicable) to extract the device. So, we dont have to recurse again by
// doing value_.unsafeGetTensorImpl()->device().
return storage().data_ptr().device();
}
at::IntArrayRef FunctionalTensorWrapper::sizes_custom() const {
return value_.unsafeGetTensorImpl()->sizes();

View File

@ -94,10 +94,10 @@ inline at::DimVector infer_size_dv(IntArrayRef shape, int64_t numel) {
inline at::SymDimVector infer_size_dv(
c10::SymIntArrayRef shape,
c10::SymInt numel) {
const c10::SymInt& numel) {
auto res = at::SymDimVector(shape);
infer_size_impl<c10::SymIntArrayRef, c10::SymInt, at::SymDimVector>(
shape, std::move(numel), res);
shape, numel, res);
return res;
}

View File

@ -6,7 +6,6 @@
#include <c10/util/TypeList.h>
#include <c10/util/intrusive_ptr.h>
#include <c10/util/order_preserving_flat_hash_map.h>
#include <optional>
#include <ATen/core/TensorBody.h>
#include <ATen/core/jit_type_base.h>

View File

@ -55,8 +55,7 @@ class TORCH_API CppSignature final {
}
private:
explicit CppSignature(std::type_index signature)
: signature_(std::move(signature)) {}
explicit CppSignature(std::type_index signature) : signature_(signature) {}
std::type_index signature_;
};

View File

@ -70,7 +70,7 @@ private:
void _print_dispatch_trace(const std::string& label, const std::string& op_name, const DispatchKeySet& dispatchKeySet) {
auto nesting_value = dispatch_trace_nesting_value();
for (int64_t i = 0; i < nesting_value; ++i) std::cerr << " ";
std::cerr << label << " op=[" << op_name << "], key=[" << toString(dispatchKeySet.highestPriorityTypeId()) << "]" << std::endl;
std::cerr << label << " op=[" << op_name << "], key=[" << toString(dispatchKeySet.highestPriorityTypeId()) << "]" << '\n';
}
} // namespace detail
@ -213,9 +213,11 @@ OperatorHandle Dispatcher::findOrRegisterName_(const OperatorName& op_name) {
// Windows build doesn't produce the destructor symbol in PyTorch libs
// causing a linker failure in downstream projects.
// x-ref https://github.com/pytorch/pytorch/issues/70032
#if defined(_WIN32)
OperatorHandle::~OperatorHandle() = default;
#endif
RegistrationHandleRAII Dispatcher::registerLibrary(std::string ns, std::string debug) {
RegistrationHandleRAII Dispatcher::registerLibrary(const std::string& ns, std::string debug) {
std::lock_guard<std::mutex> lock(guard_->mutex);
auto found = libraries_.find(ns);
TORCH_CHECK(
@ -306,7 +308,7 @@ PythonModuleMapType& pythonModulesSingleton() {
}
std::optional<std::pair<const char*, const char*>> Dispatcher::getPyStub(OperatorName op_name) {
std::optional<std::pair<const char*, const char*>> Dispatcher::getPyStub(const OperatorName& op_name) {
std::lock_guard<std::mutex> lock(guard_->mutex);
auto found = pythonModulesSingleton().find(op_name);
if (found == pythonModulesSingleton().end()) {
@ -342,7 +344,7 @@ RegistrationHandleRAII Dispatcher::registerPythonModule(
});
}
void Dispatcher::throwIfHasPythonModule(OperatorName op_name) {
void Dispatcher::throwIfHasPythonModule(const OperatorName& op_name) {
std::lock_guard<std::mutex> lock(guard_->mutex);
auto elt = pythonModulesSingleton().find(op_name);
if (elt == pythonModulesSingleton().end()) {
@ -362,7 +364,7 @@ void Dispatcher::throwIfHasPythonModule(OperatorName op_name) {
}
RegistrationHandleRAII Dispatcher::registerImpl(
OperatorName op_name,
const OperatorName& op_name,
std::optional<DispatchKey> dispatch_key,
KernelFunction kernel,
std::optional<impl::CppSignature> cpp_signature,
@ -377,7 +379,7 @@ RegistrationHandleRAII Dispatcher::registerImpl(
*this,
dispatch_key,
std::move(kernel),
std::move(cpp_signature),
cpp_signature,
std::move(inferred_function_schema),
std::move(debug)
);
@ -406,7 +408,7 @@ void Dispatcher::deregisterImpl_(const OperatorHandle& op, const OperatorName& o
cleanup(op, op_name);
}
RegistrationHandleRAII Dispatcher::registerName(OperatorName op_name) {
RegistrationHandleRAII Dispatcher::registerName(const OperatorName& op_name) {
std::lock_guard<std::mutex> lock(guard_->mutex);
auto op = findOrRegisterName_(op_name);
++op.operatorDef_->def_and_impl_count;

View File

@ -13,15 +13,10 @@
#include <condition_variable>
#include <list>
#include <mutex>
#include <type_traits>
#include <ATen/core/enum_tag.h>
#include <ATen/core/grad_mode.h>
#ifndef NDEBUG
#include <iostream>
#endif
namespace c10 {
TORCH_API bool show_dispatch_trace();
@ -255,7 +250,7 @@ class TORCH_API Dispatcher final {
// NB: steals the inferred function schema, as we may need to hold on to
// it for a bit until the real schema turns up
RegistrationHandleRAII registerImpl(
OperatorName op_name,
const OperatorName& op_name,
std::optional<DispatchKey> dispatch_key,
KernelFunction kernel,
std::optional<impl::CppSignature> cpp_signature,
@ -274,15 +269,15 @@ class TORCH_API Dispatcher final {
/**
* Given an operator, throws if we have a pystub.
*/
void throwIfHasPythonModule(OperatorName op_name);
void throwIfHasPythonModule(const OperatorName& op_name);
std::optional<std::pair<const char*, const char*>> getPyStub(
OperatorName op_name);
const OperatorName& op_name);
/**
* Register a new operator by name.
*/
RegistrationHandleRAII registerName(OperatorName op_name);
RegistrationHandleRAII registerName(const OperatorName& op_name);
/**
* Register a fallback kernel for a backend.
@ -300,7 +295,9 @@ class TORCH_API Dispatcher final {
* API. These invocations are only permitted once per program, so we raise
* an error if this is called again for the same namespace.
*/
RegistrationHandleRAII registerLibrary(std::string ns, std::string debug);
RegistrationHandleRAII registerLibrary(
const std::string& ns,
std::string debug);
// ------------------------------------------------------------------------
//
@ -448,8 +445,12 @@ class TORCH_API OperatorHandle {
OperatorHandle& operator=(OperatorHandle&&) noexcept = default;
OperatorHandle(const OperatorHandle&) = default;
OperatorHandle& operator=(const OperatorHandle&) = default;
#if defined(_WIN32)
// NOLINTNEXTLINE(performance-trivially-destructible)
~OperatorHandle();
#else
~OperatorHandle() = default;
#endif
const OperatorName& operator_name() const {
return operatorDef_->op.operator_name();

View File

@ -556,7 +556,7 @@ inline std::ostream& operator<<(std::ostream& out, const Argument& arg) {
// real_type versus fake_type: in order to be compatible with FunctionSchema
// parser, printing an argument with either MemoryFormat or Layout type should
// give us the original schema string, hence printing out real_type.
auto type = arg.real_type();
const auto& type = arg.real_type();
bool is_opt = type->kind() == OptionalType::Kind;
auto unopt_type = is_opt ? type->castRaw<OptionalType>()->getElementType() : type;

View File

@ -232,7 +232,7 @@ struct TORCH_API OptionalType : public UnionType {
static TypePtr ofTensor();
//
// global singleton
static TypePtr get(TypePtr inner);
static TypePtr get(const TypePtr& inner);
private:
explicit OptionalType(const TypePtr& contained);
@ -895,7 +895,7 @@ struct TORCH_API ListType
// the type List<T>.
// The extra "identifier" argument is needed beccause we have multiple container types
// that all re-use this function (List<T>, array<T, N>, etc.)
static TypePtr get(const std::string& identifier, TypePtr inner);
static TypePtr get(const std::string& identifier, const TypePtr& inner);
// common cast List[Tensor]
static ListTypePtr ofTensors();

View File

@ -274,7 +274,7 @@ ListTypePtr ListType::ofNumbers() {
return value;
}
TypePtr OptionalType::get(TypePtr inner) {
TypePtr OptionalType::get(const TypePtr& inner) {
static ska::flat_hash_map<TypePtr, TypePtr> containerTypePtrs;
static std::mutex mutex;
// Perf from the lock is ok because this function is guarded behind
@ -287,7 +287,7 @@ TypePtr OptionalType::get(TypePtr inner) {
return containerTypePtrs[inner];
}
TypePtr ListType::get(const std::string& identifier, TypePtr inner) {
TypePtr ListType::get(const std::string& identifier, const TypePtr& inner) {
static ska::flat_hash_map<std::tuple<std::string, TypePtr>, TypePtr> containerTypePtrs;
static std::mutex mutex;
// Perf from the lock is ok because this function is guarded behind

View File

@ -1954,8 +1954,8 @@ void scaled_gemm(
#if ROCM_VERSION >= 70000
if (at::detail::getCUDAHooks().isGPUArch({"gfx950"})) {
// TODO: add constraints based on hipblaslt internals
TORCH_CHECK((m % 32 == 0) && (n % 32 == 0) && (k % 32 == 0),
"Matrix dimensions must be multiples of 32 for MX format. "
TORCH_CHECK((m % 16 == 0) && (n % 16 == 0) && (k % 128 == 0),
"M, N must be multiples of 16 and K should be multiple of 128 for MX format. "
"Got m=", m, ", n=", n, ", k=", k);
}
#endif

View File

@ -266,11 +266,14 @@ CUDAGeneratorImpl::CUDAGeneratorImpl(
* See Note [Acquire lock when using random generators]
*/
void CUDAGeneratorImpl::set_current_seed(uint64_t seed) {
at::cuda::assertNotCapturing(
"Cannot call CUDAGeneratorImpl::set_current_seed");
state_->seed_ = seed;
state_->philox_offset_per_thread_ = 0;
no_reset_rnn_state_.clear();
if (C10_LIKELY(at::cuda::currentStreamCaptureStatus() == at::cuda::CaptureStatus::None)) {
state_->seed_ = seed;
state_->philox_offset_per_thread_ = 0;
no_reset_rnn_state_.clear();
} else {
TORCH_CHECK(state_->seed_ == seed, "CUDAGeneratorImpl::set_current_seed can be called during stream capture only if new seed is the same as the original seed.");
// no-op case
}
}
/**
@ -299,9 +302,6 @@ uint64_t CUDAGeneratorImpl::get_offset() const {
* Gets the current seed of CUDAGeneratorImpl.
*/
uint64_t CUDAGeneratorImpl::current_seed() const {
// Debatable if current_seed() should be allowed in captured regions.
// Conservatively disallow it for now.
at::cuda::assertNotCapturing("Cannot call CUDAGeneratorImpl::current_seed");
return state_->seed_;
}
@ -346,8 +346,6 @@ c10::intrusive_ptr<c10::TensorImpl> CUDAGeneratorImpl::get_state() const {
* and size of the internal state.
*/
void CUDAGeneratorImpl::set_state(const c10::TensorImpl& new_state) {
at::cuda::assertNotCapturing(
"Please ensure to utilize the CUDAGeneratorImpl::set_state_index method during capturing.");
static const size_t seed_size = sizeof(uint64_t);
static const size_t offset_size = sizeof(int64_t);
static const size_t total_size = seed_size + offset_size;
@ -402,15 +400,27 @@ c10::intrusive_ptr<c10::GeneratorImpl> CUDAGeneratorImpl::graphsafe_get_state()
*/
void CUDAGeneratorImpl::set_philox_offset_per_thread(uint64_t offset) {
// see Note [Why enforce RNG offset % 4 == 0?]
// Note: If you use CUDNN RNN's, calling
// set_philox_offset_per_thread instead of set_offset will cause the
// cudnn RNN rng state to become stale.
TORCH_CHECK(offset % 4 == 0, "offset must be a multiple of 4");
state_->philox_offset_per_thread_ = offset;
if (C10_LIKELY(at::cuda::currentStreamCaptureStatus() == at::cuda::CaptureStatus::None)) {
state_->philox_offset_per_thread_ = offset;
} else {
state_->offset_intragraph_ = offset;
}
}
/**
* Gets the current philox_offset_per_thread_ of CUDAGeneratorImpl.
*/
uint64_t CUDAGeneratorImpl::philox_offset_per_thread() const {
return state_->philox_offset_per_thread_;
if (C10_LIKELY(at::cuda::currentStreamCaptureStatus() == at::cuda::CaptureStatus::None)) {
return state_->philox_offset_per_thread_;
} else {
return state_->offset_intragraph_;
}
}
/**

View File

@ -122,7 +122,7 @@ struct DeviceThreadHandlePool : public std::enable_shared_from_this<DeviceThread
// Called by the destructor. Releases this thread's handles back into the pool.
void release() {
if(my_handles.size() > 0) {
if(!my_handles.empty()) {
auto parent = weak_parent.lock();
if (!parent) {
// If this thread exits after atexit handlers have completed, the

View File

@ -19,7 +19,7 @@
#define DLPACK_MAJOR_VERSION 1
/*! \brief The current minor version of dlpack */
#define DLPACK_MINOR_VERSION 0
#define DLPACK_MINOR_VERSION 1
/*! \brief DLPACK_DLL prefix for windows */
#ifdef _WIN32
@ -32,9 +32,7 @@
#define DLPACK_DLL
#endif
// NOLINTNEXTLINE(modernize-deprecated-headers)
#include <stdint.h>
// NOLINTNEXTLINE(modernize-deprecated-headers)
#include <stddef.h>
#ifdef __cplusplus
@ -159,6 +157,26 @@ typedef enum {
kDLComplex = 5U,
/*! \brief boolean */
kDLBool = 6U,
/*! \brief FP8 data types */
kDLFloat8_e3m4 = 7U,
kDLFloat8_e4m3 = 8U,
kDLFloat8_e4m3b11fnuz = 9U,
kDLFloat8_e4m3fn = 10U,
kDLFloat8_e4m3fnuz = 11U,
kDLFloat8_e5m2 = 12U,
kDLFloat8_e5m2fnuz = 13U,
kDLFloat8_e8m0fnu = 14U,
/*! \brief FP6 data types
* Setting bits != 6 is currently unspecified, and the producer must ensure it is set
* while the consumer must stop importing if the value is unexpected.
*/
kDLFloat6_e2m3fn = 15U,
kDLFloat6_e3m2fn = 16U,
/*! \brief FP4 data types
* Setting bits != 4 is currently unspecified, and the producer must ensure it is set
* while the consumer must stop importing if the value is unexpected.
*/
kDLFloat4_e2m1fn = 17U,
} DLDataTypeCode;
/*!
@ -172,6 +190,12 @@ typedef enum {
* - int8: type_code = 0, bits = 8, lanes = 1
* - std::complex<float>: type_code = 5, bits = 64, lanes = 1
* - bool: type_code = 6, bits = 8, lanes = 1 (as per common array library convention, the underlying storage size of bool is 8 bits)
* - float8_e4m3: type_code = 8, bits = 8, lanes = 1 (packed in memory)
* - float6_e3m2fn: type_code = 16, bits = 6, lanes = 1 (packed in memory)
* - float4_e2m1fn: type_code = 17, bits = 4, lanes = 1 (packed in memory)
*
* When a sub-byte type is packed, DLPack requires the data to be in little bit-endian, i.e.,
* for a packed data set D ((D >> (i * bits)) && bit_mask) stores the i-th element.
*/
typedef struct {
/*!
@ -229,12 +253,12 @@ typedef struct {
/*! \brief The data type of the pointer*/
DLDataType dtype;
/*! \brief The shape of the tensor */
const int64_t* shape;
int64_t* shape;
/*!
* \brief strides of the tensor (in number of elements, not bytes)
* can be NULL, indicating tensor is compact and row-majored.
*/
const int64_t* strides;
int64_t* strides;
/*! \brief The offset in bytes to the beginning pointer to data */
uint64_t byte_offset;
} DLTensor;
@ -269,7 +293,7 @@ typedef struct DLManagedTensor {
void (*deleter)(struct DLManagedTensor * self);
} DLManagedTensor;
// bit masks used in in the DLManagedTensorVersioned
// bit masks used in the DLManagedTensorVersioned
/*! \brief bit mask to indicate that the tensor is read only. */
#define DLPACK_FLAG_BITMASK_READ_ONLY (1UL << 0UL)
@ -282,6 +306,14 @@ typedef struct DLManagedTensor {
*/
#define DLPACK_FLAG_BITMASK_IS_COPIED (1UL << 1UL)
/*
* \brief bit mask to indicate that whether a sub-byte type is packed or padded.
*
* The default for sub-byte types (ex: fp4/fp6) is assumed packed. This flag can
* be set by the producer to signal that a tensor of sub-byte type is padded.
*/
#define DLPACK_FLAG_BITMASK_IS_SUBBYTE_TYPE_PADDED (1UL << 2UL)
/*!
* \brief A versioned and managed C Tensor object, manage memory of DLTensor.
*

View File

@ -139,7 +139,7 @@ static void autogradBasedTransformSendToNext(
std::bitset<default_bitset_size> outputs_aliasing_immutable; // set = 1 for all bits
if(!grad_special_case) {
for (auto idx = stack->size() - args_size; idx < stack->size(); idx++) {
const auto ivalue = (*stack)[idx];
const auto& ivalue = (*stack)[idx];
if (!ivalue.isTensor()) {
continue; // only input that can be aliased is a tensor, not a tensor list (expect in ops without returns)
}

View File

@ -6,6 +6,8 @@
#include <ATen/functorch/BatchRulesHelper.h>
#include <algorithm>
namespace at::functorch {
typedef std::tuple<Tensor, std::optional<int64_t>> oneOutput;
@ -315,7 +317,7 @@ oneOutput linalg_lu_solve_batch_rule(
const auto LU_num_batch_dims = rankWithoutBatchDim(LU_, LU_bdim) - LU_min_rank;
const auto pivots_num_batch_dims = rankWithoutBatchDim(pivots_, pivots_bdim) - pivots_min_rank;
const auto B_num_batch_dims = rankWithoutBatchDim(B_, B_bdim) - B_min_rank;
const auto max_num_batch_dims = std::max(std::max(LU_num_batch_dims, pivots_num_batch_dims), B_num_batch_dims);
const auto max_num_batch_dims = std::max({LU_num_batch_dims, pivots_num_batch_dims, B_num_batch_dims});
LU_ = maybePadToLogicalRank(LU_, LU_bdim, max_num_batch_dims + LU_min_rank);
pivots_ = maybePadToLogicalRank(pivots_, pivots_bdim, max_num_batch_dims + pivots_min_rank);

View File

@ -171,6 +171,8 @@ TORCH_LIBRARY_IMPL(aten, FuncTorchBatched, m) {
POINTWISE_BOXED(fill_.Scalar);
POINTWISE_BOXED(zero_);
// This is special because this op doesn't return anything
m.impl("_assert_tensor_metadata", native::_assert_tensor_metadata);
#undef UNARY_POINTWISE
#undef UNARY_POINTWISE_ALL

View File

@ -897,11 +897,11 @@ Tensor& div_(Tensor& self, const Scalar& other) {
}
Tensor div(const Tensor& self, const Scalar& other, std::optional<std::string_view> rounding_mode) {
return self.div(wrapped_scalar_tensor(other), std::move(rounding_mode)); // redispatch!
return self.div(wrapped_scalar_tensor(other), rounding_mode); // redispatch!
}
Tensor& div_(Tensor& self, const Scalar& other, std::optional<std::string_view> rounding_mode) {
return self.div_(wrapped_scalar_tensor(other), std::move(rounding_mode)); // redispatch!
return self.div_(wrapped_scalar_tensor(other), rounding_mode); // redispatch!
}
// divide, alias for div
@ -926,23 +926,23 @@ Tensor& divide_(Tensor& self, const Scalar& other) {
}
Tensor& divide_out(const Tensor& self, const Tensor& other, std::optional<std::string_view> rounding_mode, Tensor& result) {
return at::div_out(result, self, other, std::move(rounding_mode));
return at::div_out(result, self, other, rounding_mode);
}
Tensor divide(const Tensor& self, const Tensor& other, std::optional<std::string_view> rounding_mode) {
return self.div(other, std::move(rounding_mode));
return self.div(other, rounding_mode);
}
Tensor& divide_(Tensor& self, const Tensor& other, std::optional<std::string_view> rounding_mode) {
return self.div_(other, std::move(rounding_mode));
return self.div_(other, rounding_mode);
}
Tensor divide(const Tensor& self, const Scalar& other, std::optional<std::string_view> rounding_mode) {
return self.div(other, std::move(rounding_mode));
return self.div(other, rounding_mode);
}
Tensor& divide_(Tensor& self, const Scalar& other, std::optional<std::string_view> rounding_mode) {
return self.div_(other, std::move(rounding_mode));
return self.div_(other, rounding_mode);
}
// true_divide, an alias for div

View File

@ -81,7 +81,7 @@ Tensor math_channel_shuffle(const Tensor& self, int64_t groups) {
// TODO: contiguous can be made to preserve the memory format
// of the input. However since the above reshape clobbers h and w
// it may not be safe to do that, since channels_last contiguous
// may think oc and and the last dim correspond to h,w?
// may think oc and the last dim correspond to h,w?
// It is not clear, however from initial looking around it feels that
// this may not be correct.
// In this case channels last will likely require custom implementation

View File

@ -1,3 +1,4 @@
#pragma once
#include <ATen/core/Tensor.h>
#include <ATen/Config.h>
#include <cstdint>

View File

@ -67,13 +67,13 @@ TORCH_PRECOMPUTE_META_FUNC(fractional_max_pool3d)(
int64_t inputH = input_.size(heightDim);
int64_t inputW = input_.size(widthDim);
TORCH_CHECK(outputT + poolSizeT - 1 < inputT,
TORCH_CHECK((poolSizeT <= inputT) && (outputT + poolSizeT - 1 < inputT),
"fractional_max_pool3d_out(): pool time ", poolSizeT,
" too large relative to input time ", inputT);
TORCH_CHECK(outputW + poolSizeW - 1 < inputW,
TORCH_CHECK((poolSizeW <= inputW) && (outputW + poolSizeW - 1 < inputW),
"fractional_max_pool3d_out(): pool width ", poolSizeW,
" too large relative to input width ", inputW);
TORCH_CHECK(outputH + poolSizeH - 1 < inputH,
TORCH_CHECK((poolSizeH <= inputH) && (outputH + poolSizeH - 1 < inputH),
"fractional_max_pool3d_out(): pool height ", poolSizeH,
" too large relative to input height ", inputH);

View File

@ -150,7 +150,7 @@ void histogramdd_prepare_out(const Tensor& input, const std::vector<int64_t>& bi
void histogramdd_prepare_out(const Tensor& input, TensorList bins,
const Tensor& hist, const TensorList& bin_edges) {
std::vector<int64_t> bin_ct(bins.size());
std::transform(bins.begin(), bins.end(), bin_ct.begin(), [](Tensor t) { return t.numel() - 1; });
std::transform(bins.begin(), bins.end(), bin_ct.begin(), [](const Tensor& t) { return t.numel() - 1; });
histogramdd_prepare_out(input, bin_ct, hist, bin_edges);
}

View File

@ -360,7 +360,7 @@ Tensor einsum(std::string_view equation, TensorList operands, at::OptionalIntArr
// to compute the number of dimensions covered by ellipsis.
for(const auto i : c10::irange(num_ops)) {
const auto& operand = operands[i];
const auto labels = op_labels[i];
const auto& labels = op_labels[i];
const auto ndims = operand.dim();
int64_t nlabels = static_cast<int64_t>(labels.size());
bool has_ellipsis = false;

View File

@ -237,7 +237,7 @@ TORCH_META_FUNC(linalg_vector_norm)(const Tensor& self, const Scalar& scalar_ord
at::detail::check_linalg_norm_dtype(opt_dtype, self.scalar_type(), "linalg.vector_norm");
auto mask = at::native::make_dim_mask(dim, self.dim());
auto shape = at::native::shape_from_dim_mask(self, std::move(mask), keepdim);
auto shape = at::native::shape_from_dim_mask(self, mask, keepdim);
auto options = self.options()
.dtype(toRealValueType(opt_dtype.value_or(self.scalar_type())));
@ -641,7 +641,7 @@ namespace {
Tensor linalg_matrix_power_impl(
const Tensor& self,
int64_t n,
std::optional<Tensor> _out) {
const std::optional<Tensor>& _out) {
NoTF32Guard disable_tf32;
auto out = _out.value_or(Tensor());
@ -1019,7 +1019,7 @@ Tensor multi_dot_impl(TensorList _tensors, std::optional<Tensor> _out) {
Tensor result;
if (_out.has_value()) {
auto out = *_out;
const auto& out = *_out;
TORCH_CHECK(
dtype == out.dtype(),
"multi_dot(): expected out tensor to have dtype ",

View File

@ -493,7 +493,7 @@ Tensor get_clamped_target_length(
// the gradient is implemented for _cudnn_ctc_loss (just in derivatives.yaml) and _ctc_loss and this function has automatic gradients
// it also handles the reduction if desired
template <typename LengthsType>
Tensor ctc_loss_impl(const Tensor& log_probs_, const Tensor& targets, LengthsType input_lengths, LengthsType target_lengths, int64_t BLANK, int64_t reduction, bool zero_infinity) {
Tensor ctc_loss_impl(const Tensor& log_probs_, const Tensor& targets, const LengthsType& input_lengths, const LengthsType& target_lengths, int64_t BLANK, int64_t reduction, bool zero_infinity) {
auto is_batched = log_probs_.dim() == 3;
Tensor log_probs = is_batched ? log_probs_ : log_probs_.unsqueeze(1);
bool use_cudnn =

View File

@ -599,7 +599,7 @@ std::tuple<Tensor, Tensor, Tensor, Tensor, int64_t> _batch_norm_impl_index(
check_dims_match_num_input_features("weight", num_features, weight.sym_numel());
}
if (bias.defined()) {
check_dims_match_num_input_features("bias", std::move(num_features), bias.sym_numel());
check_dims_match_num_input_features("bias", num_features, bias.sym_numel());
}
BatchNormBackend backend = _select_batch_norm_backend(input, weight, bias, running_mean, running_var, training, eps);
@ -923,7 +923,7 @@ std::tuple<Tensor, Tensor, Tensor> _batch_norm_legit_no_stats_cpu(
std::tuple<Tensor, Tensor, Tensor> _batch_norm_legit_no_training(
const Tensor& self, const std::optional<Tensor>& weight_opt, const std::optional<Tensor>& bias_opt,
const Tensor& running_mean, const Tensor& running_var, double momentum, double eps) {
return at::_native_batch_norm_legit(self, weight_opt, bias_opt, const_cast<Tensor&>(running_mean), const_cast<Tensor&>(running_var), /*train=*/false, momentum, eps);
return at::_native_batch_norm_legit(self, weight_opt, bias_opt, const_cast<Tensor&>(running_mean), const_cast<Tensor&>(running_var), /*training=*/false, momentum, eps);
}

View File

@ -73,7 +73,7 @@ Tensor constant_pad_nd(const Tensor& self, IntArrayRef pad, const Scalar& value)
for (const auto i : c10::irange((size_t)l_pad)) {
auto pad_idx = pad.size() - ((i + 1) * 2);
auto new_dim = input_sizes[l_diff + i] + pad[pad_idx] + pad[pad_idx + 1];
TORCH_CHECK(new_dim > 0, "The input size ", input_sizes[l_diff + i], ", plus negative padding ",
TORCH_CHECK(new_dim >= 0, "The input size ", input_sizes[l_diff + i], ", plus negative padding ",
pad[pad_idx], " and ", pad[pad_idx + 1], " resulted in a negative output size, "
"which is invalid. Check dimension ", l_diff + i, " of your input.");
new_shape.emplace_back(new_dim);

View File

@ -1533,7 +1533,7 @@ std::tuple<Tensor, Tensor> lstm_cell(
check_rnn_cell_forward_input(input, w_ih.sym_size(1));
auto hidden_size = w_hh.sym_size(1);
check_rnn_cell_forward_hidden(input, hx[0], hidden_size, 0);
check_rnn_cell_forward_hidden(input, hx[1], std::move(hidden_size), 1);
check_rnn_cell_forward_hidden(input, hx[1], hidden_size, 1);
static at::Tensor undefined;
return LSTMCell<CellParams>{}(input, std::make_tuple(hx[0], hx[1]), CellParams{w_ih, w_hh, b_ih, b_hh, undefined});
}
@ -1612,13 +1612,13 @@ std::tuple<Tensor, Tensor, Tensor, Tensor, Tensor> _thnn_differentiable_gru_cell
h_g = h_g + hidden_bias;
}
auto chunked_input_gates = in_g.unsafe_chunk(3, 1);
Tensor ir = chunked_input_gates[0];
Tensor ii = chunked_input_gates[1];
Tensor in = chunked_input_gates[2];
const Tensor& ir = chunked_input_gates[0];
const Tensor& ii = chunked_input_gates[1];
const Tensor& in = chunked_input_gates[2];
auto chunked_hidden_gates = h_g.unsafe_chunk(3, 1);
Tensor hr = chunked_hidden_gates[0];
Tensor hi = chunked_hidden_gates[1];
Tensor hn = chunked_hidden_gates[2];
const Tensor& hr = chunked_hidden_gates[0];
const Tensor& hi = chunked_hidden_gates[1];
const Tensor& hn = chunked_hidden_gates[2];
Tensor rg = (ir + hr).sigmoid();
Tensor ig = (ii + hi).sigmoid();
Tensor grad_hx = grad_hy * ig;

View File

@ -52,6 +52,7 @@ void apply_triu_tril_single(
int64_t self_col_stride,
bool upper) {
constexpr int64_t zero = 0;
k = std::clamp(k, -n, m); // Clamp k to [-n, m] to prevent i + k arithmetic overflow, especially if k approaches INT64_MAX/INT64_MIN.
if (upper) {
parallel_for(0, n, 0, [&](int64_t start, int64_t end) {

View File

@ -409,17 +409,17 @@ static inline Tensor& unary_op_impl_out(Tensor& result, const Tensor& self, Stub
}
template <typename Stub, typename ...Args>
static inline Tensor& unary_op_impl_float_out(Tensor& result, const Tensor& self, Stub& stub, Args... args) {
static inline Tensor& unary_op_impl_float_out(Tensor& result, const Tensor& self, Stub& stub, Args&&... args) {
auto iter = TensorIterator::unary_float_op(result, self);
stub(iter.device_type(), iter, args...);
stub(iter.device_type(), iter, std::forward<Args>(args)...);
return result;
}
template <typename Stub, typename ...Args>
static inline Tensor unary_op_impl_float(const Tensor& self, Stub& stub, Args... args) {
static inline Tensor unary_op_impl_float(const Tensor& self, Stub& stub, Args&&... args) {
Tensor result;
auto iter = TensorIterator::unary_float_op(result, self);
stub(iter.device_type(), iter, args...);
stub(iter.device_type(), iter, std::forward<Args>(args)...);
return iter.output();
}

View File

@ -323,7 +323,7 @@ std::tuple<Tensor, Tensor, Tensor> unique_consecutive_cpu_template(
template<class ForwardIt>
ForwardIt _unique_dim_cpu_impl(ForwardIt first, ForwardIt last,
std::vector<int64_t>& indices, Tensor inverse_indices_vec, Tensor counts) {
std::vector<int64_t>& indices, const Tensor& inverse_indices_vec, const Tensor& counts) {
if (first == last) {
return last;
}

View File

@ -24,7 +24,7 @@ constexpr int64_t num_output_channels_index [[maybe_unused]] = 10;
constexpr int64_t num_input_channels_index [[maybe_unused]] = 11;
template <typename TENSOR_DTYPE, typename VEC_DTYPE>
std::vector<VEC_DTYPE> unwrap_vector(at::Tensor tensor) {
std::vector<VEC_DTYPE> unwrap_vector(const at::Tensor& tensor) {
std::vector<VEC_DTYPE> vec(tensor.numel());
TENSOR_DTYPE* tensor_data_ptr = tensor.data_ptr<TENSOR_DTYPE>();
std::copy(tensor_data_ptr, tensor_data_ptr + tensor.numel(), vec.data());
@ -39,7 +39,7 @@ std::vector<VEC_DTYPE> unwrap_vector(at::Tensor tensor) {
*/
void unpack_bcsr(
int8_t* dst,
ao::sparse::BCSR bcsr,
const ao::sparse::BCSR& bcsr,
const int64_t R,
const int64_t C,
const int64_t RB,

View File

@ -85,11 +85,11 @@ void cpu_max_unpool(
if constexpr (is_3d) {
TORCH_CHECK(false, "Found an invalid max index: ", optional_error_index.value(),
" (output volumes are of size ", output_depth,
"x", output_height, "x", output_width);
"x", output_height, "x", output_width, ")");
} else {
TORCH_CHECK(false, "Found an invalid max index: ", optional_error_index.value(),
" (output volumes are of size ", output_height,
"x", output_width);
"x", output_width, ")");
}
}

View File

@ -1138,9 +1138,14 @@ bool is_blockwise_1x16_scaling(const at::Tensor& t, const at::Tensor& scale) {
bool is_blockwise_1x32_scaling(const at::Tensor& t, const at::Tensor& scale) {
// TODO: We might want to enforce some structure on the shapes of the scale
// tensors
return (isFloat8Type(t.scalar_type()) && scale.scalar_type() == at::kFloat8_e8m0fnu
&& scale.numel() == round_up<int64_t>(t.size(0), 128) * round_up<int64_t>(ceil_div<int64_t>(t.size(1), 32), 4)
&& scale.is_contiguous());
bool is_fp8_path = (isFloat8Type(t.scalar_type()) && scale.scalar_type() == at::kFloat8_e8m0fnu
&& scale.numel() == round_up<int64_t>(t.size(0), 128) * round_up<int64_t>(ceil_div<int64_t>(t.size(1), 32), 4));
bool is_packed_fp4_path = false;
#ifdef USE_ROCM
is_packed_fp4_path = (t.scalar_type() == ScalarType::Float4_e2m1fn_x2 && scale.scalar_type() == at::kFloat8_e8m0fnu
&& scale.numel() == round_up<int64_t>(t.size(0), 128) * round_up<int64_t>(ceil_div<int64_t>(t.size(1) * 2, 32), 4));
#endif
return (is_fp8_path || is_packed_fp4_path) && scale.is_contiguous();
}
bool is_blockwise_1x128_scaling(const at::Tensor& t, const at::Tensor& scale) {
@ -1381,9 +1386,15 @@ _scaled_mm_out_cuda(const Tensor& mat1, const Tensor& mat2,
TORCH_CHECK(at::detail::getCUDAHooks().isGPUArch({"gfx950"}),
"Block-wise scaling for Float8_e8m0fnu is only supported on gfx950");
TORCH_CHECK(mat1.size(0) % 32 == 0 && mat1.size(1) % 32 == 0 &&
mat2.size(0) % 32 == 0 && mat2.size(1) % 32 == 0,
"Matrix dimensions must be multiples of 32 for block-wise scaling");
int packed_factor = 1;
if (mat1.scalar_type() == ScalarType::Float4_e2m1fn_x2) {
// For float4 data type, each byte stores two 4-bit floating-point values,
// effectively packing two elements into one byte.
packed_factor = 2;
}
TORCH_CHECK(mat1.size(0) % 16 == 0 && (mat1.size(1) * packed_factor) % 128 == 0 &&
mat2.size(1) % 16 == 0,
"M, N must be multiples of 16 and K must be multiple of 128 for block-wise scaling");
TORCH_CHECK(out.scalar_type() == ScalarType::BFloat16 ||
out.scalar_type() == ScalarType::Half,

View File

@ -416,6 +416,7 @@ struct ReduceOp {
if (config.should_block_y_reduce()) {
value = block_y_reduce<output_vec_size>(value, shared_memory);
}
__syncthreads();
if (config.should_block_x_reduce()) {
value = block_x_reduce<output_vec_size>(value, shared_memory);
}

View File

@ -17,12 +17,11 @@ __global__ static void compute_cuda_kernel(
index_t* result_ptr,
int64_t size,
int64_t result_size) {
if (C10_UNLIKELY((result_size != cumsum_ptr[size - 1]))) {
printf("%s:%d:%s: block: [%d,%d,%d], thread: [%d,%d,%d] "
CUDA_KERNEL_ASSERT_PRINTF(
result_size == cumsum_ptr[size - 1],
"Invalid input! In `repeat_interleave`, the `output_size` argument (%ld) must be the same as the sum of the elements in the `repeats` tensor (%ld).\n",
__FILE__, __LINE__, __func__,blockIdx.x, blockIdx.y, blockIdx.z, threadIdx.x, threadIdx.y, threadIdx.z, result_size, cumsum_ptr[size - 1 ]);
CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1])
}
result_size,
cumsum_ptr[size - 1]);
int64_t idx = ((int64_t) blockIdx.x) * blockDim.x + threadIdx.x;
int64_t stride = (blockDim.x * gridDim.x) / C10_WARP_SIZE;

View File

@ -226,6 +226,38 @@ __global__ void CatArrayBatchedCopy_contig(
}
}
template <typename T, typename IndexType, int Dims, int batch_size, int stride_size, int alignment, int elems_per_vec>
__global__ void CatArrayBatchedCopy_vectorized(
char* output,
CatArrInputTensorMetadata<T, IndexType, batch_size, stride_size> inputs,
TensorSizeStride<IndexType, CAT_ARRAY_MAX_INPUT_DIMS> os,
const int concatDim,
IndexType trailingSize) {
IndexType tid = blockIdx.x * blockDim.x + threadIdx.x;
IndexType nElements = inputs.nElements[blockIdx.y] / elems_per_vec;
if(tid >= nElements) return;
const char * data = (char*)inputs.input[blockIdx.y];
IndexType offset = inputs.offset[blockIdx.y] * trailingSize / elems_per_vec;
IndexType dimSize = inputs.dimSize[blockIdx.y] * trailingSize / elems_per_vec;
int64_t dataOffset = (int64_t)offset * alignment; // in bytes
IndexType stride = gridDim.x * blockDim.x;
while( tid < nElements){
int64_t elementOffset = (int64_t)CatArrIndexToOffset<IndexType, Dims>::compute(
os.tensorSize, os.tensorStride, dimSize, concatDim, tid) * alignment; // in bytes
auto vec = at::native::memory::ld_vec<alignment>(data + (int64_t)alignment * tid);
at::native::memory::st_vec<alignment>(output + dataOffset + elementOffset, vec);
tid += stride;
}
}
/*
Specialized implementation of the CatArrayBatchedCopy written to generate wide memory loads
to improve memory bandwidth throughput.
@ -296,12 +328,27 @@ void parallel_cat(const Tensor &out, const MaterializedITensorListRef& inputs, i
scalar_t *data = (scalar_t *)(out.mutable_data_ptr());
CatArrInputTensorMetadata<scalar_t, unsigned int, batch_size, stride_size> catMetaData;
TensorSizeStride<unsigned int, CAT_ARRAY_MAX_INPUT_DIMS> outputParam;
// If all batches are contiguous we can call a specialized implementation
// which requires the input tensor addresses to be aligned to a
// 16 Byte boundary.
constexpr bool isContig = stride_size == 1;
bool isAligned = true;
constexpr int alignment = 16;
// Next, let's initialize the size, stride arrays for the output Tensor.
// for contig case, we'll canonicalize output strides, so that
// we don't have arbitrary strides for dims of size 0
size_t stride0 = 1;
if (memory_format == c10::MemoryFormat::Contiguous) {
for (int i = 0; i < nDims; ++i) {
for (int i = nDims - 1; i >= 0; --i) {
outputParam.tensorSize[i] = out.size(i);
outputParam.tensorStride[i] = out.stride(i);
if (isContig) {
outputParam.tensorStride[i] = stride0;
stride0 *= out.size(i);
} else {
outputParam.tensorStride[i] = out.stride(i);
}
}
} else if (memory_format == c10::MemoryFormat::ChannelsLast || memory_format == c10::MemoryFormat::ChannelsLast3d) {
// permute the semantics of dims from NCHW to NHWC so that the input
@ -320,12 +367,15 @@ void parallel_cat(const Tensor &out, const MaterializedITensorListRef& inputs, i
at::cuda::CUDAStream stream = at::cuda::getCurrentCUDAStream();
// If all batches are contiguous we can call a specialized implementation
// which requires the input tensor addresses to be aligned to a
// 16 Byte boundary.
bool isContig = true;
bool isAligned = true;
// for channels last computing slice size correctly is much more involved, so we never send it
// on the fully vectorized path
// we need output stride in cat dimension to be multiple of alignment,
// if we ever use it to compute offsets
// for catting in 0th dimension it doesn't matter
bool isInOutAligned = isContig && at::native::memory::get_alignment(data) >= alignment &&
memory_format == c10::MemoryFormat::Contiguous && (dimension == 0 ||
outputParam.tensorStride[dimension - 1] * sizeof(scalar_t) % alignment == 0);
unsigned int max_elements_per_tensor = 0;
// Now we loop
@ -341,6 +391,16 @@ void parallel_cat(const Tensor &out, const MaterializedITensorListRef& inputs, i
// high-dimensional tensor
if (inputs[i+batchCounter].get().numel() > 0) {
dimSize = inputs[i+batchCounter].get().size(dimension);
if (isInOutAligned) {
auto t = inputs[i+batchCounter].get();
// similarly to output stride, we cannot trust stride value to
// determine slice size if the corresponding dimension is 1
// we have to multiply all the subsequent sizes
int64_t slice_size = dimension == 0 ? t.numel() : t.sizes()[dimension - 1] != 1 ?
t.strides()[dimension - 1] : c10::multiply_integers(t.sizes().begin() + dimension, t.sizes().end());
slice_size *= sizeof(scalar_t);
isInOutAligned &= (slice_size % alignment == 0);
}
}
catMetaData.input[batchCounter] = (scalar_t*)(inputs[i+batchCounter].get().const_data_ptr());
@ -351,10 +411,12 @@ void parallel_cat(const Tensor &out, const MaterializedITensorListRef& inputs, i
#ifdef USE_ROCM
// On ROCm, CatArrayBatchedCopy_contig is faster
isAligned = false;
isInOutAligned = false;
#else
// If at least one of the inputs is not aligned, we can't call the
// CatArrayBatchedCopy_alignedK_contig
isAligned &= is_aligned_vec4(catMetaData.input[batchCounter]);
isInOutAligned &= at::native::memory::get_alignment(catMetaData.input[batchCounter]) >= alignment;
#endif
if (stride_size > 1) {
@ -365,7 +427,6 @@ void parallel_cat(const Tensor &out, const MaterializedITensorListRef& inputs, i
catMetaData.tensorStride[batchCounter].tensorStride[j] = strides[j];
}
catMetaData.isContiguous[batchCounter] = false;
isContig = false;
} else {
catMetaData.isContiguous[batchCounter] = true;
}
@ -388,10 +449,13 @@ void parallel_cat(const Tensor &out, const MaterializedITensorListRef& inputs, i
max_elements_per_tensor, batchCounter);
#else
dim3 applyBlock, catGrid;
if (isContig && sizeof(scalar_t) > 2) {
if (isInOutAligned) {
std::tie(catGrid, applyBlock) = getCatGridContig<scalar_t, alignment>(
max_elements_per_tensor, batchCounter);
} else if (isContig && isAligned && sizeof(scalar_t) > 2) {
std::tie(catGrid, applyBlock) = getCatGridContig<scalar_t, ALIGNED_VEC_LOAD_BYTES_16>(
max_elements_per_tensor, batchCounter);
} else if (isContig && sizeof(scalar_t) == 2) {
} else if (isContig && isAligned && sizeof(scalar_t) == 2) {
std::tie(catGrid, applyBlock) = getCatGridContig<scalar_t, ALIGNED_VEC_LOAD_BYTES_8>(
max_elements_per_tensor, batchCounter);
} else {
@ -399,6 +463,30 @@ void parallel_cat(const Tensor &out, const MaterializedITensorListRef& inputs, i
getCatGrid(batchCounter, catGrid);
}
#endif
int32_t trailingSize;
TensorSizeStride<unsigned int, CAT_ARRAY_MAX_INPUT_DIMS> kernelOutputParam;
if (isInOutAligned) {
// in this case we can and should flatten the tensors after the cat dim
// we want to view the tensors as if consisting of `alignment`-sized elements
// however, we might not be able to cleanly divide just the last dim -
// it might not be the multiple of alignment.
// however, we know that the full concatted slice is multiple of alignment,
// so if we flatten all the dims after and including concat dim,
// it will be divisible by alignment
// then we need to divide last out size by elems_per_vec,
// and divide all strides except last by elems_per_vec (last stride is 1 always)
// for input, we will fix up the sizes and strides in the kernel directly
kernelOutputParam = outputParam;
nDims = dimension + 1;
constexpr auto elems_per_vec = alignment / sizeof(scalar_t);
auto out_size = dimension == 0 ? out.numel() : kernelOutputParam.tensorStride[dimension-1];
kernelOutputParam.tensorSize[dimension] = out_size / elems_per_vec;
trailingSize = outputParam.tensorStride[dimension];
kernelOutputParam.tensorStride[dimension] = 1;
for (int i = 0; i < dimension; ++i) {
kernelOutputParam.tensorStride[i] /= elems_per_vec;
}
}
if (memory_format != c10::MemoryFormat::Contiguous) {
switch (dimension) {
@ -413,7 +501,12 @@ void parallel_cat(const Tensor &out, const MaterializedITensorListRef& inputs, i
}
// Template Declarations for dim = 1, 2, 3, 4
#define HANDLE_CASE(DIMS) \
if (isContig && isAligned && sizeof(scalar_t) > 2 && sizeof(scalar_t) <= 8) {\
if (isInOutAligned) {\
constexpr auto elems_per_vec = alignment / sizeof(scalar_t); \
CatArrayBatchedCopy_vectorized<scalar_t, unsigned int, DIMS, batch_size, stride_size, alignment, elems_per_vec><<<\
catGrid, applyBlock, 0, stream.stream()>>>(\
(char*)data, catMetaData, kernelOutputParam, dimension, trailingSize);\
} else if (isContig && isAligned && sizeof(scalar_t) > 2 && sizeof(scalar_t) <= 8) {\
CatArrayBatchedCopy_alignedK_contig<scalar_t, unsigned int, DIMS, batch_size, stride_size, ALIGNED_VEC_LOAD_BYTES_16><<<\
catGrid, applyBlock, 0, stream.stream()>>>(\
data, catMetaData, outputParam, dimension, outputParam.tensorStride[dimension]);\

View File

@ -5,12 +5,20 @@
namespace at::native {
__global__ void weight_int8pack_mm_kernel(const float* x, const int8_t* w, const float* scale, float* out, int B, int K, int N) {
__global__ void weight_int8pack_mm_kernel(
const float* x,
const int8_t* w,
const float* scale,
float* out,
int B,
int K,
int N) {
// one thread per output element: [B, N]
int b = blockIdx.y * blockDim.y + threadIdx.y;
int n = blockIdx.x * blockDim.x + threadIdx.x;
if (b >= B || n >= N) return;
if (b >= B || n >= N)
return;
float acc = 0.0f;
for (int k = 0; k < K; ++k) {
@ -20,7 +28,11 @@ __global__ void weight_int8pack_mm_kernel(const float* x, const int8_t* w, const
out[b * N + n] = acc * scale[n];
}
void launch_weight_int8pack_mm_cuda_kernel(const Tensor& x, const Tensor& w_int8, const Tensor& scale, Tensor& out) {
void launch_weight_int8pack_mm_cuda_kernel(
const Tensor& x,
const Tensor& w_int8,
const Tensor& scale,
Tensor& out) {
const int B = x.size(0);
const int K = x.size(1);
const int N = w_int8.size(0);
@ -35,12 +47,16 @@ void launch_weight_int8pack_mm_cuda_kernel(const Tensor& x, const Tensor& w_int8
w_int8.data_ptr<int8_t>(),
scale.data_ptr<float>(),
out.data_ptr<float>(),
B, K, N);
B,
K,
N);
}
// Main GPU entry point
at::Tensor _weight_int8pack_mm_cuda(const at::Tensor& x, const at::Tensor& w_int8, const at::Tensor& scale) {
at::Tensor _weight_int8pack_mm_cuda(
const at::Tensor& x,
const at::Tensor& w_int8,
const at::Tensor& scale) {
// --- Check inputs ---
TORCH_CHECK(x.is_cuda(), "x must be a CUDA tensor");
TORCH_CHECK(w_int8.is_cuda(), "w must be a CUDA tensor");
@ -50,12 +66,16 @@ at::Tensor _weight_int8pack_mm_cuda(const at::Tensor& x, const at::Tensor& w_int
TORCH_CHECK(w_int8.dim() == 2, "w must be 2D");
TORCH_CHECK(scale.dim() == 1, "scale must be 1D");
TORCH_CHECK(x.size(1) == w_int8.size(1), "K dimension mismatch: x.size(1) != w.size(1)");
TORCH_CHECK(w_int8.size(0) == scale.size(0), "Output dim mismatch: w.size(0) != scale.size(0)");
TORCH_CHECK(
x.size(1) == w_int8.size(1),
"K dimension mismatch: x.size(1) != w.size(1)");
TORCH_CHECK(
w_int8.size(0) == scale.size(0),
"Output dim mismatch: w.size(0) != scale.size(0)");
// --- Determine shapes ---
auto B = x.size(0); // batch size
auto N = w_int8.size(0); // output dim
auto B = x.size(0); // batch size
auto N = w_int8.size(0); // output dim
// Ensure inputs are in the correct types for the kernel
auto x_f32 = x.to(at::kFloat);
@ -63,12 +83,13 @@ at::Tensor _weight_int8pack_mm_cuda(const at::Tensor& x, const at::Tensor& w_int
auto scale_f32 = scale.to(at::kFloat);
// --- Allocate output ---
auto out = at::empty({B, N}, x.options().dtype(at::kFloat));
auto out = at::empty({B, N}, x_f32.options());
// --- Launch kernel ---
launch_weight_int8pack_mm_cuda_kernel(x_f32, w_int8_contiguous, scale_f32, out);
launch_weight_int8pack_mm_cuda_kernel(
x_f32, w_int8_contiguous, scale_f32, out);
return out;
return out.to(x.dtype());
}
} // namespace at::native

View File

@ -482,7 +482,9 @@ auto build_graph(
auto scaled_dot_product_flash_attention_options =
fe::graph::SDPA_attributes()
.set_name("CUDNN_SDPA")
.set_generate_stats(return_softmaxstats)
.set_is_inference(return_softmaxstats == false)
// TODO(eqy): switch to this API once cuDNN FE is upgraded
// .set_generate_stats(return_softmaxstats)
.set_causal_mask(is_causal)
.set_attn_scale(attn_scale);
if (use_ragged_in_dense(q, k, v, o, attn_bias.has_value())) {
@ -702,7 +704,9 @@ auto build_graph_nestedtensor(
auto scaled_dot_product_flash_attention_options =
fe::graph::SDPA_attributes()
.set_name("CUDNN_SDPA_NESTEDTENSOR")
.set_generate_stats(return_softmaxstats)
.set_is_inference(return_softmaxstats == false)
// TODO(eqy): switch to this API once cuDNN FE is upgraded
// .set_generate_stats(return_softmaxstats)
.set_causal_mask(is_causal)
.set_attn_scale(attn_scale)
.set_seq_len_q(SEQ_LEN_Q_)

View File

@ -2,6 +2,7 @@
#include <ATen/core/Tensor.h>
#include <ATen/TensorUtils.h>
#include <ATen/div_rtn.h>
#include <c10/util/safe_numerics.h>
namespace at::native {
@ -54,6 +55,14 @@ inline void col2im_shape_check(
int64_t batch_dim = (ndim == 3) ? 0 : -1;
int64_t n_input_plane = input.size(batch_dim + 1);
uint64_t prod_kernel_size = 1;
TORCH_CHECK(!c10::mul_overflows(static_cast<uint64_t>(kernel_width), static_cast<uint64_t>(kernel_height), &prod_kernel_size),
"Given kernel_width = ",
kernel_width,
" and kernel_height = ",
kernel_height,
" the product of kernel_width and kernel_height overflowed.");
if (n_input_plane % (kernel_width * kernel_height) != 0) {
TORCH_CHECK(false,

View File

@ -35,7 +35,7 @@ C10_ALWAYS_INLINE void _check_rms_norm_inputs_symint(
std::stringstream ss;
ss << "Given normalized_shape=" << normalized_shape
<< ", expected input with shape [*";
for (auto size : normalized_shape) {
for (const auto& size : normalized_shape) {
ss << ", " << size;
}
ss << "], but got input of size" << input_shape;

View File

@ -1770,10 +1770,12 @@ std::tuple<at::Tensor, at::Tensor, at::Tensor> miopen_depthwise_convolution_back
// fusions
// ---------------------------------------------------------------------
void raw_miopen_convolution_relu_out(
void raw_miopen_convolution_add_relu_out(
const Tensor& output,
const Tensor& input,
const Tensor& weight,
const Tensor& z,
float alpha,
const Tensor& bias,
IntArrayRef stride,
IntArrayRef padding,
@ -1781,68 +1783,20 @@ void raw_miopen_convolution_relu_out(
int64_t groups,
bool benchmark,
bool deterministic) {
auto dataType = getMiopenDataType(input);
miopenConvolutionMode_t c_mode = miopenConvolution;
ConvolutionArgs args{ input, output, weight };
args.handle = getMiopenHandle();
at::MemoryFormat memory_format = miopen_conv_suggest_memory_format(input, weight);
setConvolutionParams(
&args.params,
args.handle,
raw_miopen_convolution_forward_out(
output,
input,
weight,
padding,
stride,
dilation,
groups,
deterministic,
memory_format);
args.idesc.set(input, memory_format);
args.wdesc.set(weight, memory_format, 0);
args.odesc.set(output, memory_format);
args.cdesc.set(
dataType,
c_mode,
input.dim() - 2,
args.params.padding,
args.params.stride,
args.params.dilation,
args.params.groups,
benchmark,
deterministic);
TensorDescriptor bdesc;
bdesc.set(bias.expand({1, bias.size(0)}), output.dim());
// Create the fusion plan
miopenFusionPlanDescriptor_t fusePlanDesc;
miopenFusionOpDescriptor_t convoOp;
miopenFusionOpDescriptor_t biasOp;
miopenFusionOpDescriptor_t activOp;
MIOPEN_CHECK(miopenCreateFusionPlan(&fusePlanDesc, miopenVerticalFusion, args.idesc.desc()));
MIOPEN_CHECK(miopenCreateOpConvForward(fusePlanDesc, &convoOp, args.cdesc.desc(), args.wdesc.desc()));
MIOPEN_CHECK(miopenCreateOpBiasForward(fusePlanDesc, &biasOp, bdesc.desc()));
MIOPEN_CHECK(miopenCreateOpActivationForward(fusePlanDesc, &activOp, miopenActivationRELU));
// compile fusion plan
MIOPEN_CHECK(miopenCompileFusionPlan(args.handle, fusePlanDesc));
// Set the Args
float alpha = static_cast<float>(1);
float beta = static_cast<float>(0);
float activ_alpha = static_cast<float>(0);
float activ_beta = static_cast<float>(0);
float activ_gamma = static_cast<float>(0);
miopenOperatorArgs_t fusionArgs;
MIOPEN_CHECK(miopenCreateOperatorArgs(&fusionArgs));
MIOPEN_CHECK(miopenSetOpArgsConvForward(fusionArgs, convoOp, &alpha, &beta, weight.const_data_ptr()));
MIOPEN_CHECK(miopenSetOpArgsBiasForward(fusionArgs, biasOp, &alpha, &beta, bias.const_data_ptr()));
MIOPEN_CHECK(miopenSetOpArgsActivForward(fusionArgs, activOp, &alpha, &beta, activ_alpha, activ_beta, activ_gamma));
miopenExecuteFusionPlan(args.handle, fusePlanDesc, args.idesc.desc(), input.const_data_ptr(), args.odesc.desc(), output.data_ptr(), fusionArgs);
// Cleanup
miopenDestroyFusionPlan(fusePlanDesc);
at::Tensor alpha_mul_z_add_bias =
at::native::reshape_bias(input.dim(), bias).add(z, alpha);
output.add_(alpha_mul_z_add_bias);
output.relu_();
}
static at::Tensor self_or_new_memory_format(at::Tensor& self, at::MemoryFormat memory_format) {
@ -1855,171 +1809,107 @@ static at::Tensor self_or_new_memory_format(at::Tensor& self, at::MemoryFormat m
Tensor miopen_convolution_add_relu(
const Tensor& input_t,
const Tensor& weight_t,
const Tensor& z,
const Tensor& z_t,
const std::optional<Scalar>& alpha,
const std::optional<Tensor>& bias,
const std::optional<Tensor>& bias_t,
IntArrayRef stride,
IntArrayRef padding,
IntArrayRef dilation,
int64_t groups) {
// MIOpen does not support fusion of add, the alpha2 * z step of the below cuDNN function:
// y = act ( alpha1 * conv(x) + alpha2 * z + bias )
auto memory_format = miopen_conv_suggest_memory_format(input_t, weight_t);
const Tensor input = input_t.contiguous(memory_format);
const Tensor weight = weight_t.contiguous(memory_format);
Tensor z = z_t;
if (z.suggest_memory_format() != memory_format) {
z = z.to(memory_format);
}
z = z.contiguous(memory_format);
// FuseFrozenConvAddRelu performs some tensor shape checking
Tensor output_t = at::detail::empty_cuda(
conv_output_size(
input.sizes(), weight.sizes(), padding, stride, dilation),
input.options().memory_format(memory_format));
if (output_t.numel() == 0) {
return output_t;
}
auto& ctx = at::globalContext();
bool benchmark = ctx.benchmarkCuDNN();
auto _alpha = alpha.has_value() ? alpha.value().to<float>() : 1.0;
auto _bias = bias_t.has_value()
? bias_t.value()
: at::zeros(
{output_t.size(1)},
optTypeMetaToScalarType(output_t.options().dtype_opt()),
output_t.options().layout_opt(),
output_t.options().device_opt(),
output_t.options().pinned_memory_opt());
TensorArg input { input_t, "input", 1 },
weight { weight_t, "weight", 2 };
Tensor output_t = at::detail::empty_cuda(
conv_output_size(
input_t.sizes(), weight_t.sizes(), padding, stride, dilation),
input_t.options().memory_format(memory_format));
if (output_t.numel() == 0){
return output_t;
}
// Avoid ambiguity of "output" when this is being used as backwards
TensorArg output{output_t, "result", 0};
miopen_convolution_forward_out(
output,
"miopen_convolution_add_relu",
raw_miopen_convolution_add_relu_out(
output_t,
input,
weight,
padding,
z,
_alpha,
_bias,
stride,
padding,
dilation,
groups,
benchmark,
false // deterministic
);
true); // deterministic
auto contig_output_t = self_or_new_memory_format(output_t, memory_format);
if (!output_t.is_same(contig_output_t)) {
contig_output_t.copy_(output_t);
}
auto _alpha = alpha.has_value() ? alpha.value().to<float>() : 1.0;
auto _bias = bias.has_value()
? bias.value()
: at::zeros(
{contig_output_t.size(1)},
optTypeMetaToScalarType(contig_output_t.options().dtype_opt()),
contig_output_t.options().layout_opt(),
contig_output_t.options().device_opt(),
contig_output_t.options().pinned_memory_opt());
at::Tensor alpha_mul_z_add_bias = at::native::reshape_bias(input_t.dim(), _bias).add(z, _alpha);
contig_output_t.add_(alpha_mul_z_add_bias);
contig_output_t.relu_();
return contig_output_t;
return output_t;
}
Tensor miopen_convolution_relu(
const Tensor& input_t,
const Tensor& weight_t,
const std::optional<Tensor>& bias,
const std::optional<Tensor>& bias_t,
IntArrayRef stride,
IntArrayRef padding,
IntArrayRef dilation,
int64_t groups) {
auto memory_format = miopen_conv_suggest_memory_format(input_t, weight_t);
const Tensor input = input_t.contiguous(memory_format);
const Tensor weight = weight_t.contiguous(memory_format);
// FuseFrozenConvAddRelu performs some tensor shape checking
Tensor output_t = at::detail::empty_cuda(
conv_output_size(
input.sizes(), weight.sizes(), padding, stride, dilation),
input.options().memory_format(memory_format));
if (output_t.numel() == 0) {
return output_t;
}
auto& ctx = at::globalContext();
bool benchmark = ctx.benchmarkCuDNN();
auto _bias = bias_t.has_value()
? bias_t.value()
: at::zeros(
{output_t.size(1)},
optTypeMetaToScalarType(output_t.options().dtype_opt()),
output_t.options().layout_opt(),
output_t.options().device_opt(),
output_t.options().pinned_memory_opt());
// MIOpen currently only supports MemoryFormat::Contiguous and fp32 and 2d
if (input_t.suggest_memory_format() == at::MemoryFormat::Contiguous
&& input_t.scalar_type() == at::kFloat
&& input_t.ndimension() == 4) {
raw_miopen_convolution_add_relu_out(
output_t,
input,
weight,
output_t, // use output_t as z to satisfy MIOpen API
0, // alpha
_bias,
stride,
padding,
dilation,
groups,
benchmark, // benchmark
true); // deterministic
// FuseFrozenConvAddRelu performs some tensor shape checking
Tensor output_t = at::detail::empty_cuda(
conv_output_size(
input_t.sizes(), weight_t.sizes(), padding, stride, dilation),
input_t.options().memory_format(input_t.suggest_memory_format()));
if (output_t.numel() == 0) {
return output_t;
}
auto _bias = bias.has_value()
? bias.value()
: at::zeros(
{output_t.size(1)},
optTypeMetaToScalarType(output_t.options().dtype_opt()),
output_t.options().layout_opt(),
output_t.options().device_opt(),
output_t.options().pinned_memory_opt());
raw_miopen_convolution_relu_out(
output_t,
input_t,
weight_t,
_bias,
stride,
padding,
dilation,
groups,
benchmark, // benchmark
false // deterministic
);
return output_t;
}
else {
// fallback
auto memory_format = miopen_conv_suggest_memory_format(input_t, weight_t);
TensorArg input { input_t, "input", 1 },
weight { weight_t, "weight", 2 };
Tensor output_t = at::detail::empty_cuda(
conv_output_size(
input_t.sizes(), weight_t.sizes(), padding, stride, dilation),
input->options().memory_format(memory_format));
if (output_t.numel() == 0){
return output_t;
}
// Avoid ambiguity of "output" when this is being used as backwards
TensorArg output{output_t, "result", 0};
miopen_convolution_forward_out(
output,
"miopen_convolution_relu",
input,
weight,
padding,
stride,
dilation,
groups,
benchmark,
false // deterministic
);
auto contig_output_t = self_or_new_memory_format(output_t, memory_format);
if (!output_t.is_same(contig_output_t)) {
contig_output_t.copy_(output_t);
}
auto _bias = bias.has_value()
? bias.value()
: at::zeros(
{contig_output_t.size(1)},
optTypeMetaToScalarType(contig_output_t.options().dtype_opt()),
contig_output_t.options().layout_opt(),
contig_output_t.options().device_opt(),
contig_output_t.options().pinned_memory_opt());
at::Tensor reshaped_bias = at::native::reshape_bias(input_t.dim(), _bias);
contig_output_t.add_(reshaped_bias);
contig_output_t.relu_();
return contig_output_t;
}
return output_t;
}
REGISTER_CUDA_DISPATCH(miopen_convolution_backward_stub, &miopen_convolution_backward)

View File

@ -559,4 +559,60 @@ Tensor _int_mm_xpu(const Tensor& self, const Tensor& mat2) {
at::empty({self.size(0), mat2.size(1)}, self.options().dtype(at::kInt));
return _int_mm_out_xpu(self, mat2, result);
}
Tensor _weight_int8pack_mm_xpu(
const Tensor& A,
const Tensor& B,
const Tensor& scales) {
auto M = A.size(0);
auto N = B.size(0);
auto K = A.size(1);
TORCH_CHECK(
A.dtype() == kBFloat16 || A.dtype() == kHalf || A.dtype() == kFloat,
" : expect A to be either 32-bit or 16-bit float tensor.");
TORCH_CHECK(A.dim() == 2, __func__, " : expect A to be 2D tensor.");
TORCH_CHECK(
A.stride(1) == 1, " : A must be contiguous on the last dimension.");
TORCH_CHECK(B.dtype() == kChar, " : expect B to be int8 tensor.");
TORCH_CHECK(B.is_contiguous(), " : expect B to be contiguous.");
TORCH_CHECK(B.size(1) == K, " : expect B.size(1) == ", K);
TORCH_CHECK(
scales.dim() == 1 && scales.size(0) == N,
" : expect scales to be 1d tensor with size ",
N);
auto C = at::empty({M, N}, A.options());
// --- Launch kernel ---
Tensor bias = at::Tensor();
Tensor mat2_zero_points = at::Tensor();
Tensor non_const_scales = scales;
auto post_op_args = torch::List<std::optional<at::Scalar>>();
at::native::onednn::quantized_matmul(
A.contiguous(),
1.0,
0,
B,
non_const_scales,
mat2_zero_points,
bias,
C,
1.0,
0,
C.scalar_type(),
/*other*/ std::nullopt,
/*other scale*/ 1.0,
/*other zp*/ 0,
/*binary post op*/ "none",
/*binary alpha*/ 1.0,
/*post_op_name*/ "none",
post_op_args,
/*post_op_algorithm*/ "none",
/*m2_trans*/ false);
return C;
}
} // namespace at::native

View File

@ -110,8 +110,9 @@ void quantized_matmul(
// [Note] Quantized Matrix Multiplication at XPU
// The following code integrates oneDNN quantized gemm. The quantization
// config we support:
// activation: s8&u8; per tensor calibrated; symmetric&asymmetric
// weight: s8; per_tensor/per_channel calibrated; symmetric
// activation: s8, u8, fp16, bf16, fp32; per tensor calibrated;
// symmetric&asymmetric weight: s8; per_tensor/per_channel calibrated;
// symmetric
auto attr = Attr(static_cast<float>(1.0 / output_scale), output_zero_point);
construct_attr_by_post_op(
binary_post_op,

View File

@ -568,7 +568,7 @@ Placeholder::Placeholder(MPSGraphTensor* mpsGraphTensor,
MPSShape* mpsStrides = getMPSShape(_tensor.strides());
check_mps_shape(mpsShape);
auto storage_numel = src.storage().nbytes() / src.element_size();
auto storage_numel = src.storage().nbytes() / src.element_size() - src.storage_offset();
TORCH_CHECK(storage_numel <= std::numeric_limits<int32_t>::max(),
"MPSGaph does not support tensor dims larger than INT_MAX");
MPSNDArrayDescriptor* srcTensorDesc = [MPSNDArrayDescriptor descriptorWithDataType:dataType

View File

@ -0,0 +1,25 @@
#pragma once
#include <c10/metal/common.h>
#ifdef __METAL__
enum class EmbeddingBagMode { SUM = 0, MEAN, MAX };
#else
#include <ATen/native/EmbeddingBag.h>
using at::native::EmbeddingBagMode;
#endif
template <typename idx_type_t = uint32_t>
struct EmbeddingBagParams {
::c10::metal::array<idx_type_t, 2> weight_strides;
::c10::metal::array<idx_type_t, 2> output_strides;
::c10::metal::array<idx_type_t, 2> max_indices_strides;
idx_type_t per_sample_weights_strides;
idx_type_t num_indices;
idx_type_t num_bags;
idx_type_t feature_size;
EmbeddingBagMode mode;
int64_t padding_idx;
};

View File

@ -0,0 +1,212 @@
#include <ATen/native/mps/kernels/EmbeddingBag.h>
#include <c10/metal/utils.h>
#include <metal_array>
#include <metal_stdlib>
using namespace metal;
using namespace c10::metal;
template <EmbeddingBagMode M, typename T>
struct ReductionOpInit {
inline opmath_t<T> operator()() {
return 0;
}
};
template <typename T>
struct ReductionOpInit<EmbeddingBagMode::MAX, T> {
inline opmath_t<T> operator()() {
return static_cast<opmath_t<T>>(-INFINITY);
}
};
template <EmbeddingBagMode M, typename T>
struct ReductionOp {
inline opmath_t<T> operator()(
T weight_val,
opmath_t<T> out_val,
uint32_t per_sample_weights_index,
constant T* per_sample_weights,
uint32_t per_sample_weights_strides);
};
template <typename T>
struct ReductionOp<EmbeddingBagMode::SUM, T> {
inline opmath_t<T> operator()(
T weight_val,
opmath_t<T> out_val,
uint32_t per_sample_weights_index,
constant T* per_sample_weights,
uint32_t per_sample_weights_strides) {
if (per_sample_weights_strides) {
T per_sample_weight = per_sample_weights
[per_sample_weights_strides * per_sample_weights_index];
return static_cast<opmath_t<T>>(per_sample_weight) *
static_cast<opmath_t<T>>(weight_val) +
out_val;
} else {
return static_cast<opmath_t<T>>(weight_val) + out_val;
}
}
};
template <typename T>
struct ReductionOp<EmbeddingBagMode::MEAN, T> {
inline opmath_t<T> operator()(
T weight_val,
opmath_t<T> out_val,
uint32_t,
constant T*,
uint32_t) {
return static_cast<opmath_t<T>>(weight_val) + out_val;
}
};
template <typename T>
struct ReductionOp<EmbeddingBagMode::MAX, T> {
inline opmath_t<T> operator()(
T weight_val,
opmath_t<T> out_val,
uint32_t,
constant T*,
uint32_t) {
return max(static_cast<opmath_t<T>>(weight_val), out_val);
}
};
template <EmbeddingBagMode M, typename T>
struct ReductionOpFinal {
inline T operator()(opmath_t<T> val, uint32_t) {
return static_cast<T>(val);
}
};
template <typename T>
struct ReductionOpFinal<EmbeddingBagMode::MEAN, T> {
inline T operator()(opmath_t<T> val, uint32_t count) {
auto out = val / count;
return static_cast<T>((count == 0) ? 0 : out);
}
};
template <typename T>
struct ReductionOpFinal<EmbeddingBagMode::MAX, T> {
inline T operator()(opmath_t<T> val, uint32_t count) {
return static_cast<T>((count == 0) ? 0 : val);
}
};
template <EmbeddingBagMode M, typename T, typename I>
void embedding_bag_impl(
constant T* weight,
constant I* indices,
constant I* offsets,
constant T* per_sample_weights,
device T* output,
device I* offset2bag,
device I* bag_size,
device I* max_indices,
constant EmbeddingBagParams<uint32_t>& params,
uint tid) {
auto num_indices = params.num_indices;
auto num_bags = params.num_bags;
auto feature_size = params.feature_size;
auto padding_idx = params.padding_idx;
auto per_sample_weights_strides = params.per_sample_weights_strides;
constant auto& output_strides = params.output_strides;
constant auto& weight_strides = params.weight_strides;
constant auto& max_indices_strides = params.max_indices_strides;
auto bag_idx = tid / feature_size;
auto feature_idx = tid % feature_size;
output += bag_idx * output_strides[0] + feature_idx * output_strides[1];
uint32_t offsets_end = min(bag_idx + 1, num_bags - 1);
bool is_last_bag = bag_idx + 1 == num_bags;
uint32_t indices_start = static_cast<uint32_t>(offsets[bag_idx]);
uint32_t indices_end = is_last_bag * (num_indices) +
(!is_last_bag) * (static_cast<uint32_t>(offsets[offsets_end]));
auto out_val = ReductionOpInit<M, T>()();
uint32_t bag_size_ = 0;
for (uint32_t indices_idx = indices_start; indices_idx < indices_end;
indices_idx++) {
I weight_idx = indices[indices_idx];
bool pad = (weight_idx == padding_idx);
T weight_val = weight
[static_cast<uint32_t>(weight_idx) * weight_strides[0] +
feature_idx * weight_strides[1]];
bag_size_ += static_cast<uint32_t>(!pad);
auto tmp_val = ReductionOp<M, T>()(
weight_val,
out_val,
indices_idx,
per_sample_weights,
per_sample_weights_strides);
out_val = pad ? out_val : tmp_val;
}
*output = ReductionOpFinal<M, T>()(out_val, bag_size_);
}
#define DISPATCH_IMPL(MODE) \
return embedding_bag_impl<MODE>( \
weight, \
indices, \
offsets, \
per_sample_weights, \
output, \
offset2bag, \
bag_size, \
max_indices, \
params, \
tid)
template <typename T, typename I>
kernel void embedding_bag(
constant T* weight [[buffer(0)]],
constant I* indices [[buffer(1)]],
constant I* offsets [[buffer(2)]],
constant T* per_sample_weights [[buffer(3)]],
device T* output [[buffer(4)]],
device I* offset2bag [[buffer(5)]],
device I* bag_size [[buffer(6)]],
device I* max_indices [[buffer(7)]],
constant EmbeddingBagParams<uint32_t>& params [[buffer(8)]],
uint tid [[thread_position_in_grid]]) {
switch (params.mode) {
case EmbeddingBagMode::SUM:
DISPATCH_IMPL(EmbeddingBagMode::SUM);
case EmbeddingBagMode::MEAN:
DISPATCH_IMPL(EmbeddingBagMode::MEAN);
case EmbeddingBagMode::MAX:
DISPATCH_IMPL(EmbeddingBagMode::MAX);
}
}
#define REGISTER_EMBEDDING_BAG_OP(T, I) \
template [[host_name("embedding_bag_" #T "_" #I)]] \
kernel void embedding_bag<T, I>( \
constant T * weight [[buffer(0)]], \
constant I * indices [[buffer(1)]], \
constant I * offsets [[buffer(2)]], \
constant T * per_sample_weights [[buffer(3)]], \
device T * output [[buffer(4)]], \
device I * offset2bag [[buffer(5)]], \
device I * bag_size [[buffer(6)]], \
device I * max_indices [[buffer(7)]], \
constant EmbeddingBagParams<uint32_t> & params [[buffer(8)]], \
uint tid [[thread_position_in_grid]]);
REGISTER_EMBEDDING_BAG_OP(float, int);
REGISTER_EMBEDDING_BAG_OP(float, long);
REGISTER_EMBEDDING_BAG_OP(half, int);
REGISTER_EMBEDDING_BAG_OP(half, long);
REGISTER_EMBEDDING_BAG_OP(bfloat, int);
REGISTER_EMBEDDING_BAG_OP(bfloat, long);

View File

@ -0,0 +1,179 @@
#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
#include <ATen/TensorUtils.h>
#include <ATen/core/Tensor.h>
#include <ATen/mps/MPSProfiler.h>
#include <ATen/native/EmbeddingBag.h>
#include <ATen/native/Pool.h>
#include <ATen/native/mps/OperationUtils.h>
#include <ATen/native/mps/kernels/EmbeddingBag.h>
#include <fmt/format.h>
#ifndef AT_PER_OPERATOR_HEADERS
#include <ATen/Functions.h>
#include <ATen/NativeFunctions.h>
#else
#include <ATen/ops/_embedding_bag_forward_only_native.h>
#include <ATen/ops/_embedding_bag_native.h>
#include <ATen/ops/empty.h>
#endif
namespace at::native {
#ifndef PYTORCH_JIT_COMPILE_SHADERS
static auto& lib = mps::MetalShaderLibrary::getBundledLibrary();
#else
#include <ATen/native/mps/EmbeddingBag_metallib.h>
#endif
namespace {
std::pair<Tensor, Tensor> promoteIndicesAndOffsets(const Tensor& indices, const Tensor& offsets) {
const auto commonType = promoteTypes(offsets.scalar_type(), indices.scalar_type());
return {indices.scalar_type() == commonType ? indices : indices.toType(commonType),
offsets.scalar_type() == commonType ? offsets : offsets.toType(commonType)};
}
} // namespace
namespace mps {
static std::tuple<Tensor, Tensor, Tensor, Tensor> _embedding_bag_mps_impl(
const Tensor& weight,
const Tensor& indices_,
const Tensor& offsets_,
const bool scale_grad_by_freq,
const int64_t mode,
bool sparse,
const std::optional<Tensor>& per_sample_weights_opt,
bool include_last_offset,
int64_t padding_idx) {
TORCH_CHECK(indices_.dim() == 1, "input has to be a 1D Tensor, but got Tensor of dimension ", indices_.dim());
if (indices_.dim() == 1) {
TORCH_CHECK(offsets_.dim() == 1, "offsets has to be a 1D Tensor, but got Tensor of dimension ", offsets_.dim());
}
TORCH_CHECK(weight.dim() == 2, "weight has to be a 2D Tensor, but got Tensor of dimension ", weight.dim());
Tensor indices, offsets;
std::tie(indices, offsets) = promoteIndicesAndOffsets(indices_, offsets_);
auto indices_arg = TensorArg(indices, "indices", 1);
checkScalarTypes("embedding_bag_mps", indices_arg, {kLong, kInt});
auto offsets_arg = TensorArg(offsets, "offsets", 1);
checkScalarTypes("embedding_bag_mps", offsets_arg, {kLong, kInt});
checkSameType("embedding_bag_mps", indices_arg, offsets_arg);
auto weight_arg = TensorArg(weight, "weight", 1);
int64_t num_indices = indices.size(0);
int64_t num_bags = offsets.size(0);
if (include_last_offset) {
num_bags -= 1;
}
int64_t feature_size = weight.size(1);
auto bag_size = at::empty(offsets.sizes(), indices.options());
auto offset2bag = at::empty({indices.size(0)}, indices.options());
auto output = at::empty({num_bags, feature_size}, weight.options());
Tensor max_indices;
if (mode == EmbeddingBagMode::MAX) {
max_indices = at::empty({num_bags, feature_size}, indices.options());
} else {
max_indices = at::empty({0}, indices.options());
}
EmbeddingBagParams<uint32_t> params;
for (const auto dim : c10::irange(weight.dim())) {
params.weight_strides[dim] = safe_downcast<uint32_t, int64_t>(weight.stride(dim));
params.output_strides[dim] = safe_downcast<uint32_t, int64_t>(output.stride(dim));
if (mode == EmbeddingBagMode::MAX) {
params.max_indices_strides[dim] = safe_downcast<uint32_t, int64_t>(max_indices.stride(dim));
}
}
bool use_per_sample_weights = per_sample_weights_opt.has_value() && per_sample_weights_opt->defined();
params.per_sample_weights_strides = use_per_sample_weights ? per_sample_weights_opt->stride(0) : 0;
params.num_indices = num_indices;
params.num_bags = num_bags;
params.feature_size = feature_size;
params.mode = static_cast<EmbeddingBagMode>(mode);
params.padding_idx = padding_idx;
auto num_threads = output.numel();
MPSStream* stream = getCurrentMPSStream();
dispatch_sync_with_rethrow(stream->queue(), ^() {
@autoreleasepool {
id<MTLComputeCommandEncoder> computeEncoder = stream->commandEncoder();
auto pipeline_state = lib.getPipelineStateForFunc(
fmt::format("embedding_bag_{}_{}", scalarToMetalTypeString(weight), scalarToMetalTypeString(indices)));
getMPSProfiler().beginProfileKernel(pipeline_state, "embedding_bag", {weight, indices, offsets});
[computeEncoder setComputePipelineState:pipeline_state];
mtl_setArgs(computeEncoder,
weight,
indices,
offsets,
use_per_sample_weights ? per_sample_weights_opt : std::nullopt,
output,
offset2bag,
bag_size,
max_indices,
params);
mtl_dispatch1DJob(computeEncoder, pipeline_state, num_threads);
getMPSProfiler().endProfileKernel(pipeline_state);
}
});
return std::tuple<Tensor, Tensor, Tensor, Tensor>(
std::move(output), std::move(offset2bag), std::move(bag_size), std::move(max_indices));
}
} // namespace mps
std::tuple<Tensor, Tensor, Tensor, Tensor> _embedding_bag_mps(const Tensor& weight,
const Tensor& indices,
const Tensor& offsets,
const bool scale_grad_by_freq,
const int64_t mode,
bool sparse,
const std::optional<Tensor>& per_sample_weights_opt,
bool include_last_offset,
int64_t padding_idx) {
return mps::_embedding_bag_mps_impl(weight,
indices,
offsets,
scale_grad_by_freq,
mode,
sparse,
per_sample_weights_opt,
include_last_offset,
padding_idx);
}
std::tuple<Tensor, Tensor, Tensor, Tensor> _embedding_bag_forward_only_mps(
const Tensor& weight,
const Tensor& indices,
const Tensor& offsets,
const bool scale_grad_by_freq,
const int64_t mode,
bool sparse,
const std::optional<Tensor>& per_sample_weights_opt,
bool include_last_offset,
int64_t padding_idx) {
return _embedding_bag_mps(weight,
indices,
offsets,
scale_grad_by_freq,
mode,
sparse,
per_sample_weights_opt,
include_last_offset,
padding_idx);
}
} // namespace at::native

View File

@ -534,6 +534,18 @@ static void max_unpool_out_mps_template(const Tensor& input,
output.resize_(output_size, memory_format);
output.fill_(0);
if (indices.defined() && indices.numel() > 0) {
auto output_image_size = c10::multiply_integers(output_size_);
int64_t min_idx = indices.min().item<int64_t>();
int64_t max_idx = indices.max().item<int64_t>();
if (min_idx < 0 || max_idx >= output_image_size) {
int64_t error_idx = (min_idx < 0) ? min_idx : max_idx;
TORCH_CHECK(false, "Found an invalid max index: ", error_idx, " for output tensor of shape ", output_size_);
}
}
id<MTLDevice> device = MPSDevice::getInstance()->device();
MPSStream* mpsStream = getCurrentMPSStream();
const auto numThreads = input.numel();

View File

@ -2351,6 +2351,7 @@
dispatch:
CPU: _embedding_bag_forward_only_cpu
CUDA: _embedding_bag_forward_only_cuda
MPS: _embedding_bag_forward_only_mps
autogen: _embedding_bag_forward_only.out
- func: _rowwise_prune(Tensor weight, Tensor mask, ScalarType compressed_indices_dtype) -> (Tensor, Tensor)
@ -2372,6 +2373,7 @@
dispatch:
CPU: _embedding_bag_cpu
CUDA: _embedding_bag_cuda
MPS: _embedding_bag_mps
autogen: _embedding_bag.out
tags: core
@ -4241,6 +4243,7 @@
CPU: _weight_int8pack_mm_cpu
CUDA: _weight_int8pack_mm_cuda
MPS: _weight_int8pack_mm_mps
XPU: _weight_int8pack_mm_xpu
- func: _sparse_mm(Tensor sparse, Tensor dense) -> Tensor
python_module: sparse
@ -4372,7 +4375,7 @@
variants: function, method
dispatch:
CPU: narrow_copy_dense_cpu
SparseCPU, SparseCUDA: narrow_copy_sparse
SparseCPU, SparseCUDA, SparseMPS: narrow_copy_sparse
CompositeExplicitAutogradNonFunctional: narrow_copy_dense_symint
tags: view_copy
@ -6660,7 +6663,7 @@
- func: zeros.out(SymInt[] size, *, Tensor(a!) out) -> Tensor(a!)
dispatch:
CompositeExplicitAutograd: zeros_out
SparseCPU, SparseCUDA, SparseMeta: zeros_sparse_out
SparseCPU, SparseCUDA, SparseMPS, SparseMeta: zeros_sparse_out
- func: zeros_like(Tensor self, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None, MemoryFormat? memory_format=None) -> Tensor
dispatch:
@ -10699,6 +10702,7 @@
dispatch:
CompositeExplicitAutograd: foreach_tensor_div_list_kernel_slow
CUDA: foreach_tensor_div_list_kernel_cuda
MTIA: foreach_tensor_div_list_kernel_mtia
- func: _foreach_div_.List(Tensor(a!)[] self, Tensor[] other) -> ()
device_check: NoCheck # foreach kernels fall back to slow path when tensor are on different devices
@ -10706,6 +10710,7 @@
dispatch:
CompositeExplicitAutograd: foreach_tensor_div_list_kernel_slow_
CUDA: foreach_tensor_div_list_kernel_cuda_
MTIA: foreach_tensor_div_list_kernel_mtia_
autogen: _foreach_div.List_out
- func: _foreach_div.ScalarList(Tensor[] self, Scalar[] scalars) -> Tensor[]
@ -10729,6 +10734,7 @@
dispatch:
CompositeExplicitAutograd: foreach_tensor_div_tensor_kernel_slow
CUDA: foreach_tensor_div_tensor_kernel_cuda
MTIA: foreach_tensor_div_tensor_kernel_mtia
- func: _foreach_div_.Tensor(Tensor(a!)[] self, Tensor other) -> ()
device_check: NoCheck # foreach kernels fall back to slow path when tensor are on different devices
@ -10736,6 +10742,7 @@
dispatch:
CompositeExplicitAutograd: foreach_tensor_div_tensor_kernel_slow_
CUDA: foreach_tensor_div_tensor_kernel_cuda_
MTIA: foreach_tensor_div_tensor_kernel_mtia_
autogen: _foreach_div.Tensor_out
- func: _foreach_clamp_max.Scalar(Tensor[] self, Scalar scalar) -> Tensor[]
@ -10842,6 +10849,7 @@
dispatch:
CompositeExplicitAutograd: foreach_tensor_clamp_min_scalar_kernel_slow_
CUDA: foreach_tensor_clamp_min_scalar_kernel_cuda_
MTIA: foreach_tensor_maximum_scalar_kernel_mtia_
autogen: _foreach_maximum.Scalar_out
# foreach_minimum/maximum dispatches to clamp_max/min

View File

@ -77,7 +77,7 @@ static Tensor NestedTensor_elementwise_Tensor(
const Tensor& other,
const std::string& op_name,
bool supports_striding,
Func f) {
const Func& f) {
Tensor self_contiguous = self;
Tensor other_contiguous = other;
// self is a scalar
@ -238,7 +238,7 @@ static Tensor& NestedTensor_elementwise__Tensor(
Tensor& self,
const Tensor& other,
const std::string& op_name,
Func f) {
const Func& f) {
// self is a scalar
if (!self.is_nested() && self.dim() == 0 && self.numel() == 1) {
auto other_impl = get_nested_tensor_impl(other);

View File

@ -149,7 +149,7 @@ Tensor MakeStridedQTensorCPU(
const IntArrayRef& sizes,
const IntArrayRef& strides,
const TensorOptions& options,
QuantizerPtr quantizer) {
const QuantizerPtr& quantizer) {
AT_ASSERT(options.device().is_cpu());
at::native::check_size_nonnegative(sizes);
auto* allocator = at::getCPUAllocator();

View File

@ -37,7 +37,7 @@ struct TORCH_API PackedLinearWeight : public LinearPackedParamsBase {
col_offsets(std::move(col_offsets)),
w_scale(std::move(w_scale)),
w_zp(std::move(w_zp)),
q_scheme(std::move(q_scheme)) {}
q_scheme(q_scheme) {}
std::unique_ptr<fbgemm::PackBMatrix<int8_t>> w;
std::optional<at::Tensor> bias_;
std::vector<int32_t> col_offsets;
@ -316,7 +316,7 @@ Tensor MakeStridedQTensorCPU(
const IntArrayRef& sizes,
const IntArrayRef& strides,
const TensorOptions& options,
QuantizerPtr quantizer);
const QuantizerPtr& quantizer);
Tensor MakeEmptyAffineQuantizedChannelsLast3dTensor(
int64_t N,

View File

@ -1,5 +1,6 @@
#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
#include <ATen/core/Tensor.h>
#include <ATen/Dispatch.h>
#include <ATen/ceil_div.h>
#include <ATen/native/cuda/Loops.cuh>
#include <c10/cuda/CUDAGuard.h>
@ -21,10 +22,11 @@
namespace at::native {
namespace {
template <typename T>
__global__ void ChooseQuantizationParamsKernelImpl(
const int64_t* fake_quant_on,
const float* x_min,
const float* x_max,
const T* x_min,
const T* x_max,
int32_t qmin,
int32_t qmax,
int size,
@ -93,34 +95,44 @@ __global__ void ChooseQuantizationParamsKernelImpl(
}
}
__device__ inline bool isinf_device(float v) {
return ::isinf(v);
}
__device__ inline bool isinf_device(c10::BFloat16 v) {
return ::isinf(static_cast<float>(v));
}
// CUDA kernel to compute Moving Average Min/Max of the tensor.
// It uses the running_min and running_max along with averaging const, c.
// The formula used to compute the new min/max is as follows
//
// running_min = (1 - c) * running_min + c * x_min, if running_min != inf
// running_min = x_min, if running_min == inf
template <typename T>
__global__ void MovingAverageMinMax(
const int64_t* observer_on,
const float* x_min,
const float* x_max,
float* running_min,
float* running_max,
const T* x_min,
const T* x_max,
T* running_min,
T* running_max,
const float averaging_const,
const int size) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (*observer_on == 1) {
if (i < size) {
float curr_min = x_min[i];
float curr_max = x_max[i];
T curr_min = x_min[i];
T curr_max = x_max[i];
float adjusted_min = ::isinf(running_min[i])
? curr_min
: (running_min[i]) + averaging_const * (curr_min - (running_min[i]));
T averaging_const_t = static_cast<T>(averaging_const);
float adjusted_max = ::isinf(running_max[i])
? curr_max
: (running_max[i]) + averaging_const * (curr_max - (running_max[i]));
T adjusted_min = isinf_device(running_min[i]) ? curr_min
: (running_min[i]) +
averaging_const_t * (curr_min - (running_min[i]));
T adjusted_max = isinf_device(running_max[i]) ? curr_max
: (running_max[i]) +
averaging_const_t * (curr_max - (running_max[i]));
running_min[i] = adjusted_min;
running_max[i] = adjusted_max;
@ -142,40 +154,51 @@ void _calculate_moving_average(
at::Tensor x_min, x_max;
int64_t* observer_on_data = observer_on.data_ptr<int64_t>();
float* running_min_data = running_min.data_ptr<float>();
float* running_max_data = running_max.data_ptr<float>();
cudaStream_t cuda_stream = at::cuda::getCurrentCUDAStream();
if (per_row_fq) {
std::tie(x_min, x_max) = at::aminmax(x, 1);
float* x_min_data = x_min.data_ptr<float>();
float* x_max_data = x_max.data_ptr<float>();
int num_threads = std::min(size, (int64_t)512);
const uint64_t num_blocks = ceil_div<uint64_t>(size, num_threads);
AT_DISPATCH_FLOATING_TYPES_AND(
at::kBFloat16, x.scalar_type(), "aminmax_kernel", [&] {
scalar_t* x_min_data = x_min.data_ptr<scalar_t>();
scalar_t* x_max_data = x_max.data_ptr<scalar_t>();
// Moving Average Min/Max observer for activations
MovingAverageMinMax<<<num_blocks, num_threads, 0, cuda_stream>>>(
observer_on_data,
x_min_data,
x_max_data,
running_min_data,
running_max_data,
averaging_const,
size);
scalar_t* running_min_data = running_min.data_ptr<scalar_t>();
scalar_t* running_max_data = running_max.data_ptr<scalar_t>();
// Moving Average Min/Max observer for activations
MovingAverageMinMax<<<num_blocks, num_threads, 0, cuda_stream>>>(
observer_on_data,
x_min_data,
x_max_data,
running_min_data,
running_max_data,
averaging_const,
size);
});
C10_CUDA_KERNEL_LAUNCH_CHECK();
} else {
std::tie(x_min, x_max) = at::aminmax(x);
float* x_min_data = x_min.data_ptr<float>();
float* x_max_data = x_max.data_ptr<float>();
// Moving Average Min/Max observer for activations
MovingAverageMinMax<<<1, 1, 0, cuda_stream>>>(
observer_on_data,
x_min_data,
x_max_data,
running_min_data,
running_max_data,
averaging_const,
1 /*size*/);
AT_DISPATCH_FLOATING_TYPES_AND(
at::kBFloat16, x.scalar_type(), "aminmax_kernel", [&] {
scalar_t* x_min_data = x_min.data_ptr<scalar_t>();
scalar_t* x_max_data = x_max.data_ptr<scalar_t>();
scalar_t* running_min_data = running_min.data_ptr<scalar_t>();
scalar_t* running_max_data = running_max.data_ptr<scalar_t>();
// Moving Average Min/Max observer for activations
MovingAverageMinMax<<<1, 1, 0, cuda_stream>>>(
observer_on_data,
x_min_data,
x_max_data,
running_min_data,
running_max_data,
averaging_const,
1 /*size*/);
});
C10_CUDA_KERNEL_LAUNCH_CHECK();
}
}
@ -198,34 +221,44 @@ void _calc_moving_avg_qparams_helper(
cudaStream_t cuda_stream = at::cuda::getCurrentCUDAStream();
int64_t* fake_quant_on_data = fake_quant_on.data_ptr<int64_t>();
if (per_row_fq) {
float* running_min_data = running_min.data_ptr<float>();
float* running_max_data = running_max.data_ptr<float>();
int num_threads = std::min(size, (int64_t)512);
const uint64_t num_blocks = ceil_div<uint64_t>(size, num_threads);
ChooseQuantizationParamsKernelImpl<<<num_blocks, num_threads, 0, cuda_stream>>>(
fake_quant_on_data,
running_min_data,
running_max_data,
qmin,
qmax,
size,
symmetric_quant,
scale_ptr,
zp_ptr);
AT_DISPATCH_FLOATING_TYPES_AND(
at::kBFloat16, x.scalar_type(), "aminmax_kernel", [&] {
scalar_t* running_min_data = running_min.data_ptr<scalar_t>();
scalar_t* running_max_data = running_max.data_ptr<scalar_t>();
int num_threads = std::min(size, (int64_t)512);
const uint64_t num_blocks = ceil_div<uint64_t>(size, num_threads);
ChooseQuantizationParamsKernelImpl<<<
num_blocks,
num_threads,
0,
cuda_stream>>>(
fake_quant_on_data,
running_min_data,
running_max_data,
qmin,
qmax,
size,
symmetric_quant,
scale_ptr,
zp_ptr);
});
C10_CUDA_KERNEL_LAUNCH_CHECK();
} else {
float* running_min_data = running_min.data_ptr<float>();
float* running_max_data = running_max.data_ptr<float>();
ChooseQuantizationParamsKernelImpl<<<1, 1, 0, cuda_stream>>>(
fake_quant_on_data,
running_min_data,
running_max_data,
qmin,
qmax,
1, // size
symmetric_quant, // preserve_sparsity
scale_ptr,
zp_ptr);
AT_DISPATCH_FLOATING_TYPES_AND(
at::kBFloat16, x.scalar_type(), "aminmax_kernel", [&] {
scalar_t* running_min_data = running_min.data_ptr<scalar_t>();
scalar_t* running_max_data = running_max.data_ptr<scalar_t>();
ChooseQuantizationParamsKernelImpl<<<1, 1, 0, cuda_stream>>>(
fake_quant_on_data,
running_min_data,
running_max_data,
qmin,
qmax,
1, // size
symmetric_quant, // preserve_sparsity
scale_ptr,
zp_ptr);
});
C10_CUDA_KERNEL_LAUNCH_CHECK();
}
}

View File

@ -64,7 +64,6 @@ at::Tensor _cslt_compress(const Tensor& sparse_input) {
// create sparse descriptor, dtype
cusparseLtMatDescriptor_t sparse_input_descriptor;
cudaDataType type;
auto compression_factor = 9;
#ifdef USE_ROCM
TORCH_CHECK(isHipSparseLtSupported());
@ -73,7 +72,6 @@ at::Tensor _cslt_compress(const Tensor& sparse_input) {
switch (sparse_input.scalar_type()) {
case at::ScalarType::Char:
type = CUDA_R_8I;
compression_factor = 10;
break;
case at::ScalarType::Half:
type = CUDA_R_16F;
@ -89,7 +87,6 @@ at::Tensor _cslt_compress(const Tensor& sparse_input) {
#if defined(CUSPARSELT_VERSION) && CUSPARSELT_VERSION >= 602 && !defined(USE_ROCM)
case at::ScalarType::Float8_e4m3fn:
type = CUDA_R_8F_E4M3;
compression_factor = 10;
break;
#endif
default:
@ -97,10 +94,6 @@ at::Tensor _cslt_compress(const Tensor& sparse_input) {
break;
}
// create a new compressed tensor with the same dtype as
auto compressed_tensor =
sparse_input.new_empty(sparse_input.numel() * compression_factor / 16);
TORCH_CUDASPARSE_CHECK(cusparseLtStructuredDescriptorInit(
&handle,
&sparse_input_descriptor,
@ -121,6 +114,15 @@ at::Tensor _cslt_compress(const Tensor& sparse_input) {
&compressed_size,
&compressed_buffer_size));
// create a new compressed tensor with the same dtype as the input,
// and with packed data/metadata stored in an array with original
// number of rows, and sufficient columns to provide compressed_size
// buffer (in bytes)
size_t orig_m = sparse_input.size(0);
size_t div = orig_m * sparse_input.itemsize();
size_t new_n = (compressed_size + div - 1) / div; // floor
auto compressed_tensor = sparse_input.new_empty({(int64_t)orig_m, (int64_t)new_n});
auto& allocator = *::c10::cuda::CUDACachingAllocator::get();
auto compressedBufferPtr = allocator.allocate(compressed_buffer_size);
cudaStream_t stream = at::cuda::getCurrentCUDAStream();
@ -165,7 +167,6 @@ std::tuple<at::Tensor, int64_t, int64_t, int64_t, int64_t> _cslt_sparse_mm_impl(
cudaDataType output_type;
cudaDataType C_type;
cusparseComputeType compute_type;
auto compression_factor = 9;
#ifdef USE_ROCM
TORCH_CHECK(isHipSparseLtSupported());
@ -177,7 +178,6 @@ std::tuple<at::Tensor, int64_t, int64_t, int64_t, int64_t> _cslt_sparse_mm_impl(
output_type = CUDA_R_8I;
C_type = CUDA_R_8I;
compute_type = CUSPARSE_COMPUTE_32I;
compression_factor = 10;
break;
// cuSPARSELt v0.5.2 onwards changes CUSPARSE_COMPUTE_TF32, CUSPARSE_COMPUT_16F
@ -210,7 +210,6 @@ std::tuple<at::Tensor, int64_t, int64_t, int64_t, int64_t> _cslt_sparse_mm_impl(
output_type = CUDA_R_8F_E4M3;
C_type = CUDA_R_16F;
compute_type = CUSPARSE_COMPUTE_32F;
compression_factor = 10;
break;
#endif
// cuSPARSELt <= v0.5.2 uses CUSPARSE_COMPUTE_TF32, CUSPARSE_COMPUTE_16F
@ -300,9 +299,10 @@ std::tuple<at::Tensor, int64_t, int64_t, int64_t, int64_t> _cslt_sparse_mm_impl(
}
}
TORCH_INTERNAL_ASSERT(compressed_A.dim() == 2); // encoded M x S
int64_t k = dense_B.size(0);
int64_t n = dense_B.size(1);
int64_t m = (compressed_A.numel() * 16 / compression_factor) / k;
int64_t m = compressed_A.size(0);
// initialize sparse descriptor
cusparseLtMatDescriptor_t sparse_input_descriptor;

View File

@ -7,7 +7,7 @@ QTensorImpl::QTensorImpl(
DispatchKeySet key_set,
const caffe2::TypeMeta data_type,
QuantizerPtr quantizer)
: TensorImpl(std::move(storage), std::move(key_set), data_type),
: TensorImpl(std::move(storage), key_set, data_type),
quantizer_(std::move(quantizer)) {}
QTensorImpl::QTensorImpl(
@ -16,7 +16,7 @@ QTensorImpl::QTensorImpl(
DispatchKeySet key_set,
const caffe2::TypeMeta data_type,
QuantizerPtr quantizer)
: TensorImpl(type, std::move(storage), std::move(key_set), data_type),
: TensorImpl(type, std::move(storage), key_set, data_type),
quantizer_(std::move(quantizer)) {}
const char* QTensorImpl::tensorimpl_type_name() const {

View File

@ -4,6 +4,8 @@
#include <c10/core/TensorImpl.h>
#include <c10/util/Exception.h>
#include <utility>
namespace at {
/**
@ -36,7 +38,7 @@ struct TORCH_API QTensorImpl : public c10::TensorImpl {
}
void set_quantizer_(QuantizerPtr quantizer) {
quantizer_ = quantizer;
quantizer_ = std::move(quantizer);
}
/**

Some files were not shown because too many files have changed in this diff Show More