Compare commits

..

260 Commits

Author SHA1 Message Date
a6df8acdbf init version 2025-10-10 10:26:40 -07:00
3f27100d3e [torchfuzz] remove fixed xfail (#165116)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165116
Approved by: https://github.com/PaulZhang12
2025-10-10 16:31:27 +00:00
253fd765bd bf16 support for fake_quantize_learnable_per_channel_affine (#165098)
Adding bf16 support for `torch._fake_quantize_learnable_per_channel_affine()` op by relaxing the type check on scale

TODO: need to add bf16 support to `per_tensor_affine_` as `torch._fake_quantize_learnable_per_tensor_affine_backward` gets called in the backward pass

**Test**
Modified unit test in `test_workflow_ops.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165098
Approved by: https://github.com/jerryzh168, https://github.com/andrewor14
2025-10-10 16:24:52 +00:00
abb2f7179e Revert "Fix truediv numerics between eager and compile (#164144)"
This reverts commit 68913d8f2a953bdbada4033101b04f6e8d49dabe.

Reverted https://github.com/pytorch/pytorch/pull/164144 on behalf of https://github.com/malfet due to It breaks CI again, why was it landed for 3 times in a row without any changes? ([comment](https://github.com/pytorch/pytorch/pull/164144#issuecomment-3390973016))
2025-10-10 16:10:25 +00:00
b57ab9a3f2 Fix #165125: Type "str" is not assignable to return type "None" (#165128)
Fixes #165125

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165128
Approved by: https://github.com/malfet
2025-10-10 16:05:07 +00:00
fb64da0791 [2/N] Use "is" in python type comparison (#165142)
This is follow-up of #165037. It generally recommended to use `is/is not` to compare types. Therefore this series of changes apply this suggestion in the code base, and it aims to finally enabling related linter checks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165142
Approved by: https://github.com/albanD
2025-10-10 15:36:44 +00:00
10a9fb641b Switch build jobs from linux.4xlarge to c7i (#165057)
Switch build jobs that use linux.4xlarge which uses c5 instance types to c7i variant. This should improve performance by ~15-20% while cutting costs by ~10-15%.

Relates to pytorch/test-infra#7175
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165057
Approved by: https://github.com/huydhn
2025-10-10 15:13:40 +00:00
9420944033 Revert "[AMP][Refactor] Simplify dtype support logic in autocast context manager (#163446)"
This reverts commit 960b0d5f0d0efb1f1962bddcf62e2a698e26edd2.

Reverted https://github.com/pytorch/pytorch/pull/163446 on behalf of https://github.com/izaitsevfb due to breaks autocast tests on linux and mac ([comment](https://github.com/pytorch/pytorch/pull/163446#issuecomment-3390688642))
2025-10-10 15:12:46 +00:00
55f01a48af [ROCm] Enable and fix several FSDP + Inductor distributed unit tests (#165011)
This PR enables a number of distributed unit tests and applies necessary fixes to ensure they pass on ROCm platforms. The changes have been successfully tested on both MI200 and MI300 hardware.

This work addresses the following issues:
**https://github.com/ROCm/frameworks-internal/issues/13586
https://github.com/ROCm/frameworks-internal/issues/13578**

**Enabled Tests**

The following tests have been enabled and are now passing:
1. test_compiled_autograd_ctx
2. test_simple_mlp_fullgraph_backend_aot_eager
3. test_simple_mlp_fullgraph_backend_aot_eager_decomp_partition
4. test_simple_mlp_fullgraph_backend_inductor
5. test_nested_fully_shard_backend_aot_eager
6. test_nested_fully_shard_backend_aot_eager_decomp_partition
7. test_nested_fully_shard_backend_inductor_fullgraph_True
8. test_nested_fully_shard_backend_inductor_fullgraph_True_graph_partition
9. test_transformer_backend_aot_eager
10. test_transformer_backend_aot_eager_decomp_partition
11. test_storage_resize_zero_gpu
12. test_storage_resize_nonzero_gpu
13. test_fake_distributed_inductor

**Tests skipped due to upstream issues:**
1. test_nested_fully_shard_backend_inductor_fullgraph_False
2. test_transformer_backend_inductor_fullgraph_True
3. test_transformer_backend_inductor_fullgraph_True_graph_partition
4. test_transformer_backend_inductor_fullgraph_False

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165011
Approved by: https://github.com/jeffdaily
2025-10-10 14:10:54 +00:00
68913d8f2a Fix truediv numerics between eager and compile (#164144)
Addresses numeric differences between eager and compile in https://github.com/pytorch/pytorch/issues/141753

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164144
Approved by: https://github.com/eellison, https://github.com/jansel, https://github.com/ngimel
2025-10-10 14:00:46 +00:00
b8be796a57 Revert "[2/N] More ruff SIM fixes (#165031)"
This reverts commit 38095fbd1323ee4a9541fbcbb9b28bd20f2cd956.

Reverted https://github.com/pytorch/pytorch/pull/165031 on behalf of https://github.com/albanD due to One of the changed line started to fail on trunk ([comment](https://github.com/pytorch/pytorch/pull/165031#issuecomment-3390190870))
2025-10-10 13:42:14 +00:00
238dd5517d [PP] Move profiler record_function in schedule (#164976)
Better engineering to move the `record_function` call to also encompass the custom callback, this line is the only change: https://github.com/pytorch/pytorch/pull/164976/files#diff-1d3d91f53db88fb886901fb178d69e47776e71b8103f85688fa9ca64cc55d068R2147, the rest is just formatting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164976
Approved by: https://github.com/fegin
ghstack dependencies: #162016, #164962
2025-10-10 13:09:23 +00:00
d272ed4b3e Fix identity expansion (#165066)
In some cases, we wrap indexing with `Identity` to prevent expansion from int32 -> int64 range. There are some checks in codegen which intend to check for constants, which did not handle Identity. Update these checks and update Identity so that it recursively prints inputs.

Fix for https://github.com/pytorch/pytorch/issues/164700

Replaces https://github.com/pytorch/pytorch/pull/160190 cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @njriasan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165066
Approved by: https://github.com/njriasan, https://github.com/shunting314, https://github.com/jansel
2025-10-10 13:07:15 +00:00
70925bdf82 [1/N] Use "is" in python type comparison (#165037)
It generally recommended to use `is/is not` to compare types. Therefore this series of changes apply this suggestion in the code base, and it aims to finally enabling related linter checks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165037
Approved by: https://github.com/mlazos
2025-10-10 12:36:50 +00:00
960b0d5f0d [AMP][Refactor] Simplify dtype support logic in autocast context manager (#163446)
## Description:

This PR refactors the autocast context manager in `autocast_mode.py` to simplify and centralize the logic for checking supported dtypes for each device. The previous implementation repeated similar checks for multiple device types. Now, a single mapping `device_supported_dtypes` is used to associate device types with their supported dtypes, and the validation logic is unified.

In my view, this makes the code easier to maintain and extend for new devices.

Please share any suggestions and comments with me.

BTW, in the original `xla` branch, the `supported_dtype` are `[torch.float16, torch.bfloat16]`, 5d8a226e23/torch/amp/autocast_mode.py (L358-L363) but the warning message has only `torch.bfloat16`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163446
Approved by: https://github.com/FFFrog, https://github.com/albanD
2025-10-10 12:30:06 +00:00
e0abcee3b5 [Code Clean] Remove support of python3.9 (#163846)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163846
Approved by: https://github.com/ezyang
2025-10-10 11:11:56 +00:00
77bf23d85c Add an option to put store large mmap weights on disk (#164526)
As title

In windows, we cannot modify the .dll to append weights at the end, the windows .dll loader will complain it's not a valid .dll file. So we store the weight blob as a separete file.

1. We add the following API which allows passing in a pointer to the weight blob and get the size of the weight blob.

```cpp
AOTI_API AOTIRuntimeError AOTInductorModelContainerGetConstantsBlobSize(
    AOTInductorModelContainerHandle container_handle,
    uint64_t* ret_size);

// Load weights from a single blob in weight_blob_ptr
AOTI_API AOTIRuntimeError AOTInductorModelUpdateConstantsFromBlob(
    AOTInductorModelContainerHandle container_handle,
    const uint8_t* weight_blob_ptr);
```

2. We also add a method in ModelContainerRunner to load the weight:

If the runner see that there is a `.blob` file in the package, if will mmap the .blob file and use the content to load the constants.

3. We also add the `USE_MMAP_EXTERNAL` macro. When this macro is defined, the model expects to load the weights from external mmap'd weights.

Test Plan:

```
buck run @mode/dev-nosan caffe2/test/inductor:test_aot_inductor -- -r test_large_mmaped_weights_on_disk
```

Also tested for windows-cross compilation with 6542566585/demo/main_voxtral.cpp

```
Loaded model.dll
audio_encoder loaded
C:\Users\shangdiy\source\repos\torchnative\demo\token_embedding\data\aotinductor\model\model.wrapper.so
Loaded model.dll
token_embedding loaded
C:\Users\shangdiy\source\repos\torchnative\demo\text_decoder\data\aotinductor\model\model.wrapper.so
Loaded model.dll
Loading weights from C:\Users\shangdiy\source\repos\torchnative\demo\text_decoder\data\aotinductor\model\model.wrapper_weights.blob
text_decoder loaded
Load latency (ms):
  audio_encoder: 1011.234
    archive extraction: 0.000
    .so loading: 1011.197
  token_embedding: 525.773
    archive extraction: 0.000
    .so loading: 525.704
  text_decoder: 3324.130
    archive extraction: 0.000
    .so loading: 3323.979
Run latency (ms):
  audio_encoder: 285.958
    audio_encoder output: dtype=bfloat16, shape=[1, 1125, 3072], numel=3456000
  token_embedding: 6.676
    token_embedding output: dtype=bfloat16, shape=[1, 1138, 3072], numel=3495936
  text_decoder: 576.519
    text_decoder output: dtype=bfloat16, shape=[1, 1138, 131072], numel=149159936
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164526
Approved by: https://github.com/desertfire
2025-10-10 07:53:57 +00:00
d2cb183344 Revert "[inductor] verify determinism with inductor benchmark script (#164904)"
This reverts commit a3c700656f9a666eb33074b60333a23eb7e99a15.

Reverted https://github.com/pytorch/pytorch/pull/164904 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but there seems to be some failed vLLM failures coming out of this ([comment](https://github.com/pytorch/pytorch/pull/164904#issuecomment-3388443678))
2025-10-10 06:23:07 +00:00
38095fbd13 [2/N] More ruff SIM fixes (#165031)
This is follow-up of #164695 to apply ruff SIM rules to more files. Most changes are about simplifying dict.get because None is already the default value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165031
Approved by: https://github.com/mlazos
2025-10-10 05:37:46 +00:00
ffc9559d9f [7/N] Apply ruff UP035 rule (#164653)
This PR is follow-up of #164438 to continue applying `UP035` rule. All changes are about proper `Callable` importation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164653
Approved by: https://github.com/aorenste
2025-10-10 05:16:17 +00:00
172d6ed8b8 Refactor _scaled_grouped_mm_cuda dispatch (#165060)
Summary:

* Clean & simplify different scaling recipe dispatch
* Split out recipes into separate dispatch functions

Test Plan:

```
pytest -svv -k grouped  test/test_scaled_matmul_cuda.py
```

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Simon Layton <simonlayton@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165060
Approved by: https://github.com/danielvegamyhre, https://github.com/ngimel
2025-10-10 04:44:25 +00:00
9a3c4b917e [CMake] Remove forcing of -O2 from torch_compile_options (#164894)
That was introduced by 75a65ffe0f
Hattip to @jathu for alerting me about the issue. As result, all our PyTorch builds were shipped with `-O2` for almost all of its modern history

Partially undo the damage introduced by https://github.com/pytorch/pytorch/pull/128406 that cause cross-ISA symbols leak, to be properly followed up in https://github.com/pytorch/pytorch/issues/165123

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164894
Approved by: https://github.com/ezyang
2025-10-10 04:43:53 +00:00
df514a6d5a Revert "[inductor][eazy] change how torch.use_deterministic_algorithms affect inductor (#164905)"
This reverts commit 344e6365a0068c2d2847fcec0c55dd53291d475e.

Reverted https://github.com/pytorch/pytorch/pull/164905 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but there seems to be some failed vLLM failures coming out of this ([comment](https://github.com/pytorch/pytorch/pull/164905#issuecomment-3388258660))
2025-10-10 04:37:09 +00:00
48fe858fef Fix error, remove file from pyrefly checking (#165094)
Reported issue with formatting and parsing.

Removing suppressions and avoiding this file in future type checking until we can get a more complete fix in .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165094
Approved by: https://github.com/albanD
2025-10-10 04:34:51 +00:00
7ab00c7c17 Revert "Hotfix test scaled matmul cuda (#165104)"
This reverts commit 9aa92f246fa5fe5cfda17970d41d167b19a0612a.

Reverted https://github.com/pytorch/pytorch/pull/165104 on behalf of https://github.com/malfet due to Looks like it broke cuda tests, isn't it, see 44b1ff54e9/1 ([comment](https://github.com/pytorch/pytorch/pull/165104#issuecomment-3388247886))
2025-10-10 04:32:18 +00:00
44b1ff54e9 [CD] Do not propagate download.pytorch.org IP into container (#165075)
Followup after https://github.com/pytorch/pytorch/pull/164969

Should fix binary build test failures
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165075
Approved by: https://github.com/seemethere, https://github.com/huydhn
ghstack dependencies: #164968, #164969
2025-10-10 04:27:29 +00:00
daea35df5c Revert "[CD] Do not propagate download.pytorch.org IP into container (#165075)"
This reverts commit 6d27a8e5093ee2a21d44dceeeffcb272e6e0f655.

Reverted https://github.com/pytorch/pytorch/pull/165075 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/165075#issuecomment-3388228013))
2025-10-10 04:20:51 +00:00
7f2a902ea2 more sizelike deprecation (#164889)
remove expext_size c++ bindings and usages

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164889
Approved by: https://github.com/mlazos
ghstack dependencies: #164884, #164885, #164886, #164887, #164888
2025-10-10 03:45:06 +00:00
9c057d9863 [BE] Refresh documentation for stable ABI / API (#163899)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163899
Approved by: https://github.com/janeyx99
2025-10-10 03:26:28 +00:00
938869e7d3 [DTensor] Improve sharding propagation error msg in DTensor dispatch (#164623)
Fixes #164543

This PR improves the `__str__` method of DTensor's `OpSchema` to provide better readable error message when dispatch fails as the error message prints `{op_info.schema}`

example 1 `aten.embedding`
```
aten.embedding.default(Spec(f32[2048, 256](S(0))), Spec(i64[16, 2048](S(0)R))) on DeviceMesh((dp=2, tp=2), 'cuda', stride=(2, 1)))
```

example 2 `aten.mm`
```
aten.mm.default(Spec(f32[1024, 512](S(1))), Spec(f32[512, 256](S(0)))) on DeviceMesh((tp=4), 'cuda', stride=(1,)))
```

example 3 `aten._scaled_dot_product_flash_attention`
```
aten._scaled_dot_product_flash_attention.default(Spec(f16[8, 16, 128, 64](RS(1))), Spec(f16[8, 16, 128, 64](RS(1))), Spec(f16[8, 16, 128, 64](RS(1)))) on DeviceMesh((dp=2, tp=4), 'cuda', stride=(4, 1)))
```

Added test
```
python test/distributed/tensor/test_dtensor_ops.py -k test_embedding_error_msg
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164623
Approved by: https://github.com/zpcore
2025-10-10 03:16:04 +00:00
ce6b589545 Enable B904 check of flake8 (#165047)
The description of `B904` is `Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling. `

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165047
Approved by: https://github.com/Lucaskabela
2025-10-10 03:08:01 +00:00
ae25dd51fc Simplifying computation of the final result for equals op on DTensor (#164999)
Instead of collecting local results using all_gather_object followed by local reduction, with this change we switch to using a single all_reduce with MIN reduction operation to compute the final equals result.

This change is needed to enable LocalTensor work (all_gather_object introduces challenges in for DTensor and LocalTensor integration).

topic: not user facing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164999
Approved by: https://github.com/ezyang
2025-10-10 03:01:28 +00:00
a61d0de9f9 [hop] support local_map filtered gradients (#164437)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164437
Approved by: https://github.com/ezyang
ghstack dependencies: #164296, #164321, #164419, #164420, #164340, #163602, #164431, #164433
2025-10-10 02:34:27 +00:00
3ad88924ad [hop] support local_map None placements (#164433)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164433
Approved by: https://github.com/ezyang
ghstack dependencies: #164296, #164321, #164419, #164420, #164340, #163602, #164431
2025-10-10 02:34:27 +00:00
3241b9c15f [hop] support local_map None gradients (#164431)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164431
Approved by: https://github.com/bdhirsh
ghstack dependencies: #164296, #164321, #164419, #164420, #164340, #163602
2025-10-10 02:34:27 +00:00
25d4d5107e [dynamo] trace local_map with local shapes for AP (#163602)
Context is in https://www.internalfb.com/excalidraw/EX519691 and https://docs.google.com/document/d/1qnuXLZk_GYt_PksHTwkn7L2ELRDnYlIRPkHAlXTyuhw/edit?tab=t.0. And the description of the previous PR: https://github.com/pytorch/pytorch/pull/164340.

The previous PR adds the support on the HOP side for eager execution and AOTAutograd. Dynamo is still passing the HOP a subgraph with wrong shapes. This PR fixes that. This is similar to the HOP implementation, however we additionally need to manually keep the TensorVariable metadata in sync.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163602
Approved by: https://github.com/ydwu4
ghstack dependencies: #164296, #164321, #164419, #164420, #164340
2025-10-10 02:34:27 +00:00
e4fe811be8 [hop] trace local_map with local shapes in fake key (#164340)
Context is in https://www.internalfb.com/excalidraw/EX519691 and https://docs.google.com/document/d/1qnuXLZk_GYt_PksHTwkn7L2ELRDnYlIRPkHAlXTyuhw/edit?tab=t.0.

So for Autoparallel initial trace, we want to trace the graph with global shapes initially. But, for the local_map region, we are forced to trace with the expected local tensors. To the tracers, this looks weird, because it's a plain tensor input (representing DTensor's full tensor .to_local()) that we need to "redistribute".

After hacking a miserable version that had cross-key dependencies, @ydwu4 proposed this simpler approach to override the fake key. This means the shape conversion will be invisible to all dispatch keys above fake, this covers all current tracing mechanisms. This manifests as the joint graph for the HOP body being traced with local shapes:
```python
# HOP forward, note local shapes (10, 80)
class GraphModule(torch.nn.Module):
    def forward(self, primals_0: "f32[10, 80]"):
        # No stacktrace found for following nodes
        view: "f32[800]" = torch.ops.aten.view.default(primals_0, [-1]);  primals_0 = None
        add: "f32[800]" = torch.ops.aten.add.Tensor(view, 10);  view = None
        view_1: "f32[10, 80]" = torch.ops.aten.view.default(add, [10, 80]);  add = None
        return (view_1,)

# HOP backward, note local shapes (10, 80)
class GraphModule(torch.nn.Module):
    def forward(self, tangents_0: "f32[10, 80]"):
        # No stacktrace found for following nodes
        clone: "f32[10, 80]" = torch.ops.aten.clone.default(tangents_0);  tangents_0 = None
        return (clone,)
```

while the rest of the graph is still traced with global shapes:
```python
# Parent graph joint, note global shapes (80, 80)
class inner_f(torch.nn.Module):
    def forward(self, primals, tangents):
        primals_1: "f32[80, 80]"; tangents_1: "f32[80, 80]";

        primals_1, tangents_1, = fx_pytree.tree_flatten_spec([primals, tangents], self._in_spec)
         # File: /home/xmfan/core/a/pytorch/test/higher_order_ops/test_local_map.py:597 in forward, code: return fn(x)
        call_local_map = torch._higher_order_ops.local_map.call_local_map(primals_1);  primals_1 = None
        getitem: "f32[80, 80]" = call_local_map[0];  call_local_map = None
        call_local_map_1 = torch._higher_order_ops.local_map.call_local_map(tangents_1);  tangents_1 = None
        getitem_1: "f32[80, 80]" = call_local_map_1[0];  call_local_map_1 = None
        return pytree.tree_unflatten([getitem, getitem_1], self._out_spec)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164340
Approved by: https://github.com/ydwu4
ghstack dependencies: #164296, #164321, #164419, #164420
2025-10-10 02:34:27 +00:00
82c71af59a [hop] local_map validate partitioned fw/bw wrt placements (#164420)
Reviewed GPT-5 Summary:

**Summary / Goal**
Add validation that partitioned forward/backward graphs respect placements.

**Details**
- Validates placement alignment in local_map.
- The HOP's autograd key gets called when we are tracing the joint, we need to validate:
  - the inputs to the HOP's fwd gm (typically this is the dynamo rewritten inputs)
  - the inputs to the HOP partitioned fwd/bwd gm
  - the outputs of the HOP partitioned fwd/bwd gm

**Motivation**
Catch mismatch errors earlier, improve debugging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164420
Approved by: https://github.com/ezyang
ghstack dependencies: #164296, #164321, #164419
2025-10-10 02:34:27 +00:00
7bd704a346 [hop] local_map fix fw_gm/bw_gm naming (#164419)
Reviewed GPT5 summary:

**Summary / Goal**
Fix inconsistent variable naming for forward/backward graphs.

**Details**
- Those methods are actually for both fw and bw graphs now that we reuse the same op for fw/bw

**Motivation**
Improves clarity, avoids confusion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164419
Approved by: https://github.com/bdhirsh
ghstack dependencies: #164296, #164321
2025-10-10 02:34:27 +00:00
ae139b73e0 [dynamo] Better error message for local_map subgraph mismatches number of inputs/outputs with placement info (#164321)
Reviewed GPT5 summary:

**Summary / Goal**
Improve error reporting when local_map subgraph input/output counts mismatch placement info.

**Details**
- Adds descriptive runtime error messages.

**Motivation**
Helps debug local_map misalignments.

```python
AssertionError: Expecting 2 inputs to local_map function based on placements, but found 1. If the count matches for eager, Dynamo may have flattened inputs to the function or found additional tensors used via closures. Please adjust the input placements to match what the traced graph sees:
class GraphModule(torch.nn.Module):
    def forward(self, l_args_0_: "f32[8, 8, 16]"):
         # File: /home/xmfan/core/a/pytorch/test/higher_order_ops/test_local_map.py:523 in mismatch_input, code: return x + scalar, scalar
        child: "f32[8, 8, 16]" = l_args_0_ + 10;  l_args_0_ = None
        return (child,)
        .
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164321
Approved by: https://github.com/ezyang, https://github.com/mlazos
ghstack dependencies: #164296
2025-10-10 02:34:27 +00:00
cbaa07e438 [dtensor] add util to compute expected local sizes/strides for even sharding (#164296)
Reviewed GPT5 summary:

**Summary / Goal**
Add a utility to compute expected local tensor sizes and strides under *even sharding* in dtensor.

**Details**
- New function in `torch/distributed/tensor/_utils.py`.
- Computes local sizes/strides given global shape, mesh, and placements.
- Enforces divisibility of global dimension by mesh size (strict even sharding).
- Complements `compute_global_tensor_info`.

**Motivation**
Ensures correctness for stride/layout computations in distributed tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164296
Approved by: https://github.com/ezyang
2025-10-10 02:34:27 +00:00
bc0e2a0d2b Fix a condition error in torch/_inductor/codegen/debug_utils.py (#165033)
This PR fixes the condition
```
if arg_signatures is None and self.kernel_type == "cpp" or "extern"
```
which is interpreted as
```
if (arg_signatures is None and self.kernel_type == "cpp") or ("extern"):
```
and it is always evaluated to `True`. According to the context the intention was
```
if arg_signatures is None and (self.kernel_type == "cpp" or self.kernel_type == "extern"):
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165033
Approved by: https://github.com/Skylion007
2025-10-10 02:20:00 +00:00
0747d95994 Add Loads from fixed inputs (#162031)
## TODO
Check on multi indices
```Python

    @cute.jit
    def score_mod(tSrS_ssa, b_idx, h_idx, q_idx, kv_idx, buffers):
        in_ptr4 = buffers[0]
        tmp0 = tSrS_ssa
        tmp1 = b_idx
        tmp2 = h_idx
        tmp3 = cute.make_fragment(1, cutlass.Int32)
        tmp4 = tmp3.store(32*tmp1 + tmp2)
        tmp5 = cute.make_fragment(1, cutlass.BFloat16)
        tmp6 = tmp3[0]
        tmp7 = tmp5[0] = (in_ptr4[tmp6])
        tmp8 = (tmp5.load()).to(cutlass.Float32)
        tmp9 = (tmp0 + tmp8)
        tSrS_ssa = tmp9

        return tSrS_ssa

 ```

I dont think that
```
        tmp4 = tmp3.store(32*tmp1 + tmp2)
        tmp5 = cute.make_fragment(1, cutlass.BFloat16)
        tmp6 = tmp3[0]
        tmp7 = tmp5[0] = (in_ptr4[tmp6]

```

 is right since this tmp6 value will be larger than the actual index dim int his case its B -> see if its possible to 1d index

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162031
Approved by: https://github.com/v0i0
ghstack dependencies: #161118
2025-10-10 01:23:37 +00:00
0a2cde2f06 Add Flash Attention support to FlexAttention (#161118)
Relies on this PR in Flash Attention: https://github.com/Dao-AILab/flash-attention/pull/1840

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161118
Approved by: https://github.com/v0i0
2025-10-10 01:23:37 +00:00
c7b57d9349 Add gfx1100 to build target for ROCm docker builds (#165103)
Fixes issue of gfx1100 test jobs timing out

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165103
Approved by: https://github.com/jeffdaily
2025-10-10 01:18:56 +00:00
7614338b69 Revert "Add SVE128 ISA (#158932)"
This reverts commit 92284fb2ff44f09a9c7df0d8cf6cac9903e376a4.

Reverted https://github.com/pytorch/pytorch/pull/158932 on behalf of https://github.com/malfet due to Hmm, but from OSS point of view, this is a no-op ([comment](https://github.com/pytorch/pytorch/pull/158932#issuecomment-3387961238))
2025-10-10 01:17:02 +00:00
a6fa4f9c28 Do not decompose in functionalization/proxy tensor if autograd wouldn't have decomposed (#164939)
This fixes AOTAutograd rms_norm not being bitwise equivalent to
eager, because it avoids a decomposition.  You can force the
decomposition by having the decomposition in the dispatch table,
but if eager mode wouldn't have decomposed (because it went to the fused
one), we now default to preserving the fused call by default.

This largely reverts https://github.com/pytorch/pytorch/pull/103275/ for view ops. This means that in inference mode we could hit the wrong C++ kernel; if this occurs we should just SymInt'ify the C++ kernel.

Another neat side effect of this change is that Inductor's generated kernels for rms_norm now have rms_norm in their name.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164939
Approved by: https://github.com/bdhirsh
2025-10-10 00:15:00 +00:00
344e6365a0 [inductor][eazy] change how torch.use_deterministic_algorithms affect inductor (#164905)
Previously when torch.are_deterministic_algorithms_enabled() is True Inductor will
- skip autotuning pointwise kernels
- pick a fixed (and quite arbitrary) config for reduction

This PR change the behavior to
- for pointwise kernels, we still do autotuning
- for reduction kernels, we use the recent added heuristic to pick a config

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164905
Approved by: https://github.com/jansel, https://github.com/v0i0
ghstack dependencies: #164801, #164532, #164904
2025-10-10 00:00:58 +00:00
a3c700656f [inductor] verify determinism with inductor benchmark script (#164904)
Verify the deterministic mode with torch.compile benchmark scripts.

Here is what my testing script does (pasted in the end):
- run a model in default mode, save it's result
- run the model again in default mode, but distort the benchmarking results. Compare it with the saved result.
- Do the above again in deterministic mode.

I tried to test a few modes
- BertForMaskedLM and GoogleFnet: I can repro the numeric change by distorting the benchnmark result in the default mode. The non-determinism is gone in the deterministic mode
- DistillGPT2: I can not repro the numeric change by distorting the benchmarking result in the default mode. It does not surprise me much. Reduction order change does not always cause numeric change.

```
model=GoogleFnet

export TORCHINDUCTOR_WRITE_ARE_DETERMINISTIC_ALGORITHMS_ENABLED=0
export TORCHINDUCTOR_FORCE_DISABLE_CACHES=1  # disable autotune cache
export TORCHINDUCTOR_FX_GRAPH_REMOTE_CACHE=0
export TORCHINDUCTOR_FX_GRAPH_CACHE=0
export TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_shunting/
export TORCHINDUCTOR_BENCHMARK_KERNEL=1
export TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1
export INDUCTOR_TEST_DISABLE_FRESH_CACHE=1

# Non deterministic mode
# --float32 rather than --amp to make it easier to repro non-deterministic
echo "Save results for non-deterministic mode"
python benchmarks/dynamo/huggingface.py --backend inductor --float32 --accuracy --only $model --training --disable-cudagraphs --save-model-outputs-to=/tmp/saved-non-deterministic.pkl

echo "Compare results with distorted benchmarking in non-deterministic mode"
TORCHINDUCTOR_DISTORT_BENCHMARKING_RESULT=inverse python benchmarks/dynamo/huggingface.py --backend inductor --float32 --accuracy --only $model --training --disable-cudagraphs --compare-model-outputs-with=/tmp/saved-non-deterministic.pkl

echo "Save results for deterministic mode"
TORCHINDUCTOR_DETERMINISTIC=1 python benchmarks/dynamo/huggingface.py --backend inductor --float32 --accuracy --only $model --training --disable-cudagraphs --save-model-outputs-to=/tmp/saved-deterministic.pkl

echo "Compare results with distorted benchmarking in deterministic mode"
TORCHINDUCTOR_DETERMINISTIC=1 TORCHINDUCTOR_DISTORT_BENCHMARKING_RESULT=inverse python benchmarks/dynamo/huggingface.py --backend inductor --float32 --accuracy --only $model --training --disable-cudagraphs --compare-model-outputs-with=/tmp/saved-deterministic.pkl
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164904
Approved by: https://github.com/jansel, https://github.com/v0i0
ghstack dependencies: #164801, #164532
2025-10-10 00:00:58 +00:00
600db525bd [easy][while_loop] use copy_input instead of clone in _clone_aliased_inputs (#164955)
Compared with clone, ExternKernel.copy_input additionally realize the buffer, which downstream assumes the input buffer are realized.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164955
Approved by: https://github.com/BoyuanFeng
2025-10-09 23:39:00 +00:00
f6de195616 [dynamo][trace_rules] Add ao.quantization (#165069)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165069
Approved by: https://github.com/tugsbayasgalan, https://github.com/mlazos
2025-10-09 23:08:42 +00:00
4a0df39f81 Symintify fused_scaled_matmul_reduce_scatter (#165086)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165086
Approved by: https://github.com/zou3519, https://github.com/Skylion007
2025-10-09 23:07:40 +00:00
34ac9b61cb Revert "[export] Turn on install_free_tensors flag (#164691)"
This reverts commit 0e9b3a772ab96e998ab85591d5b2a9c1d41bacb0.

Reverted https://github.com/pytorch/pytorch/pull/164691 on behalf of https://github.com/izaitsevfb due to breaks tests internally, author asked to revert, see [D84230990](https://www.internalfb.com/diff/D84230990) ([comment](https://github.com/pytorch/pytorch/pull/164691#issuecomment-3387718323))
2025-10-09 22:53:50 +00:00
9aa92f246f Hotfix test scaled matmul cuda (#165104)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165104
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-09 22:51:30 +00:00
a57a14868d Better handling of restore_state_dict (#164401)
After lean export, we might want to be able to restore the original fqn. This PR refactors one util function in export that sort of does this. Note that strict_export has some complicated logic of updating the graph signature as well which we don't want. I think we can gradually make this util more refined by handling constants, non persistent buffers etc and change how strict_export does it today.

Differential Revision: [D83687844](https://www.internalfb.com/diff/D83687844)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164401
Approved by: https://github.com/avikchaudhuri
2025-10-09 22:39:11 +00:00
47956196d9 Revert "Call internal log_compilation_event if it exists (#164855)"
This reverts commit 98a081a24c22072362dc536afd39a469e28939d4.

Reverted https://github.com/pytorch/pytorch/pull/164855 on behalf of https://github.com/albanD due to We should not land this kind of code in core ([comment](https://github.com/pytorch/pytorch/pull/164855#issuecomment-3387692988))
2025-10-09 22:38:45 +00:00
6d27a8e509 [CD] Do not propagate download.pytorch.org IP into container (#165075)
Followup after https://github.com/pytorch/pytorch/pull/164969

Should fix binary build test failures
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165075
Approved by: https://github.com/seemethere, https://github.com/huydhn
ghstack dependencies: #164968, #164969
2025-10-09 21:59:31 +00:00
cd62a73dcb [cuDNN][SDPA] Handle noncontig nested tensors in cuDNN SDPA (#164958)
Previously we hardcoded the assumption in cuDNN that the inputs would be dense which breaks when e.g., the user is chunking tensors yielding noncontig inputs

New test added to check this  when `TORCH_CUDNN_SDPA_NESTED_TENSOR_ENABLED=1` is set in `test/test_transformers.py`

One issue I noticed was that the old gating of nested tensor in `sdp_utils.cpp` seems to be a no-op? All of the inputs are reported as "dense" by the time that function is called in the nested tensor tests in `test/test_nestedtensor.py -k sdpa`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164958
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2025-10-09 21:58:54 +00:00
4d7f9f3aed Revert "[ATen] Fix CUDA reduction warp shuffle order (#164790)"
This reverts commit 8e1f409b8ccf64b2cf3933ece13587ad57e9d8a9.

Reverted https://github.com/pytorch/pytorch/pull/164790 on behalf of https://github.com/jeffdaily due to broke cuda and rocm ci ([comment](https://github.com/pytorch/pytorch/pull/164790#issuecomment-3387558806))
2025-10-09 21:36:10 +00:00
2b9ff99535 [flex attention] change "==" to "is" in inspect parameter comparison (#165003)
Patch for https://github.com/pytorch/pytorch/issues/164760.

This doesn't actually fix the underlying torch function issue though.

Explanation: `is` is traced differently compared to `__eq__`, so we end up avoiding the issue where we attempt to evaluate `torch.eq(tensor, inspect._empty)` in the first place.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165003
Approved by: https://github.com/mlazos
2025-10-09 21:18:05 +00:00
98a081a24c Call internal log_compilation_event if it exists (#164855)
Summary: For internal conda on mast jobs, call the internal version of log_compilation_event if it exists.

Test Plan: Ran a simple test job that just calls the API: https://fburl.com/scuba/dynamo_compile/dqx8d10g
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164855
Approved by: https://github.com/c00w
2025-10-09 21:15:11 +00:00
6c0125dbc0 Mark functions const in CUDACachingAllocator (#165007)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165007
Approved by: https://github.com/eqy
2025-10-09 20:53:58 +00:00
0fd976b65c Enable mimalloc on non-Windows platforms and make default for AArch64 builds (#164741)
This change removes the Windows requirement for mimalloc builds, and makes mimalloc the default c10 system allocator for AArch64 builds. This significantly improves the performance of AArch64 builds of PyTorch as large allocations are better cached by mimalloc than glibc.

**Updated Results**

Torchbench FP32 eager Inference, 16 threads:
<img width="1510" height="733" alt="mimalloc-v2-fp32-diff" src="https://github.com/user-attachments/assets/7fe3ea0c-3b52-42e7-879b-612444479c90" />

Torchbench BF16 eager Inference, 16 threads:
<img width="1510" height="733" alt="mimalloc-v2-bf16-diff" src="https://github.com/user-attachments/assets/56469a72-9e06-4d57-ae2a-aeb139ca79a3" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164741
Approved by: https://github.com/fadara01, https://github.com/aditew01, https://github.com/malfet
2025-10-09 20:49:46 +00:00
9944cac6e6 Add suppressions to torch/_inductor (#165062)
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283

Split this directory into two PRs to keep them from being too large.

Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check

step 1: delete lines in the pyrefly.toml file from the project-excludes field
step 2: run pyrefly check
step 3: add suppressions, clean up unused suppressions
before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199

after:
INFO 0 errors (6,884 ignored)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165062
Approved by: https://github.com/oulgen, https://github.com/mlazos
2025-10-09 20:34:20 +00:00
e7fd296930 [CI] Add full debug build to trunk (#164974)
But not test, just import torch, as regression test for https://github.com/pytorch/pytorch/issues/164297

Test plan: Re-apply #164974 on top of this change and observer the failure in the workflows: https://github.com/pytorch/pytorch/actions/runs/18383302153/job/52375282838
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164974
Approved by: https://github.com/seemethere, https://github.com/clee2000, https://github.com/atalman
ghstack dependencies: #164968, #164969
2025-10-09 20:12:16 +00:00
fac85fcfb5 [inductor] custom_graph_pass.get_hash_for_files: don't hash paths (#165020)
Summary: We have an internal user where caching broke because the paths that are unzipped are probably different per host. We can't think of a use case where a path change matters when the file content has not changed, so removing this part

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165020
Approved by: https://github.com/oulgen
2025-10-09 20:07:53 +00:00
228973df7f Fix channels-last dimension mapping in CUDA parallel_cat (#165023)
Fixes #164849
`dimension` was updated in-place, so for more than one batch of channels-last tensors the concat `dimension` for the second kernel launch was wrong

## Testing
- python -m compileall test/test_tensor_creation_ops.py

------
https://chatgpt.com/codex/tasks/task_e_68e708879b30832f89b10ae55faa68e8
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165023
Approved by: https://github.com/ezyang
2025-10-09 20:04:32 +00:00
ed2d514ad8 Revert "Fix truediv numerics between eager and compile (#164144)"
This reverts commit 724463d5a2fba369cd14e89215b84d1b01435df7.

Reverted https://github.com/pytorch/pytorch/pull/164144 on behalf of https://github.com/malfet due to Not sure if it's related, but looks it triggered fuzzer compiler test failure, see a2f29bcd63/1 ([comment](https://github.com/pytorch/pytorch/pull/164144#issuecomment-3387288464))
2025-10-09 19:53:38 +00:00
a2f29bcd63 [inductor] Remove Repeated Code in Subgraph (#164892)
Discovered some repeated code blocks in the subgraph.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164892
Approved by: https://github.com/PaulZhang12
2025-10-09 19:16:02 +00:00
5390324984 [CodeClean] Replace std::runtime_error with TORCH_CHECK (#164129)
As the title stated.

**Changes**:
- torch/csrc/Module.cpp
- torch/csrc/utils.cpp
- torch/csrc/stable
- torch/lib/libshm
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164129
Approved by: https://github.com/albanD
2025-10-09 19:01:07 +00:00
ae25ec569c reorder wrappers in aot_stage2_inference to match forward compile in aot_stage2_autograd (#165016)
In aot_stage2_autograd:
Before calling fw_compiler, we run pre_compile for the following wrappers:
* FakifiedOutWrapper
* FunctionalizedRngRuntimeWrapper

After, we run post_compile for the following wrappers:
 * EffectTokensWrapper
 * AOTDispatchSubclassWrapper
 * FunctionalizedRngRuntimeWrapper
 * FakifiedOutWrapper

In aot_stage2_inference:
Before calling inference compiler, we run pre_compile for the following wrappers (same as above):
 * FakifiedOutWrapper
 * FunctionalizedRngRuntimeWrapper

After, we run post_compile for the following wrappers  (different than above):
 * FunctionalizedRngRuntimeWrapper
 * FakifiedOutWrapper
 * EffectTokensWrapper
 * AOTDispatchSubclassWrapper

This PR makes both do the post_compiles in the same order.

Differential Revision: D84213657

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165016
Approved by: https://github.com/zhxchen17, https://github.com/bdhirsh
2025-10-09 18:36:04 +00:00
8e1f409b8c [ATen] Fix CUDA reduction warp shuffle order (#164790)
Typical warp shuffle reduction has the following pattern:
<img width="1138" height="501" alt="image" src="https://github.com/user-attachments/assets/3bd176dc-0ad2-4df6-90c7-06e467337166" />

which is exhibited in Triton generated by torch.compile:
<img width="663" height="403" alt="image" src="https://github.com/user-attachments/assets/7f9f36cd-b9eb-44c1-879e-b469668a2ea8" />

Switch the warp shuffle order to make bitwise equivalence between the 2 easier.
PTX difference between old and new, we see a few extra instructions: https://www.diffchecker.com/h6ly3INC/

Comparing the performance on different reduction operations, we see minimal differences. New represents the changes in this PR, old represents the past warp shuffle order:
```
Tensor Shape              Operation            New all dims (ms)       New dim=0 (ms)      New dim=1 (ms)     Old all dims (ms)    Old dim=0 (ms)      Old dim=1 (ms)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1024, 1024)              mean                 0.015817             0.016259             0.013642             0.015990             0.016258             0.013631
(1024, 1024)              sum                  0.015917             0.015906             0.013359             0.015707             0.016266             0.013226
(1024, 1024)              min                  0.016021             0.024625             0.015631             0.015761             0.024485             0.015317
(1024, 1024)              max                  0.016349             0.024971             0.015972             0.015771             0.025001             0.015314
(1024, 1024)              argmin               0.018070             0.024448             0.015578             0.018135             0.025370             0.015322
(1024, 1024)              argmax               0.018427             0.024859             0.015932             0.018164             0.024452             0.015639
(1024, 1024)              var                  0.020078             0.026413             0.020295             0.020199             0.026381             0.020214
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(2048, 2048)              mean                 0.023826             0.023726             0.022273             0.023236             0.023776             0.022248
(2048, 2048)              sum                  0.023840             0.023355             0.021974             0.023294             0.023354             0.021884
(2048, 2048)              min                  0.024519             0.041263             0.024620             0.023292             0.041491             0.024358
(2048, 2048)              max                  0.024509             0.041670             0.024277             0.023334             0.041231             0.024395
(2048, 2048)              argmin               0.026125             0.041282             0.024567             0.026772             0.041773             0.024296
(2048, 2048)              argmax               0.026117             0.041487             0.024572             0.026412             0.041477             0.024273
(2048, 2048)              var                  0.026603             0.048581             0.031308             0.027587             0.048603             0.030860
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(4096, 4096)              mean                 0.053927             0.057070             0.054073             0.053028             0.057544             0.053935
(4096, 4096)              sum                  0.053604             0.057410             0.054451             0.053076             0.057033             0.054266
(4096, 4096)              min                  0.054293             0.109122             0.058363             0.053821             0.108689             0.058382
(4096, 4096)              max                  0.054258             0.108035             0.058703             0.053492             0.110552             0.058376
(4096, 4096)              argmin               0.056805             0.111167             0.058301             0.056836             0.112325             0.058292
(4096, 4096)              argmax               0.056488             0.110958             0.058636             0.056844             0.111000             0.057928
(4096, 4096)              var                  0.058936             0.141755             0.068693             0.059735             0.141284             0.068500
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(8192, 8192)              mean                 0.145552             0.148082             0.138647             0.145364             0.147818             0.138207
(8192, 8192)              sum                  0.145985             0.147900             0.138714             0.145755             0.148031             0.138616
(8192, 8192)              min                  0.146566             0.205359             0.192739             0.145611             0.205237             0.182335
(8192, 8192)              max                  0.146526             0.204844             0.193050             0.146073             0.205457             0.182697
(8192, 8192)              argmin               0.150190             0.206605             0.192543             0.150654             0.206847             0.182007
(8192, 8192)              argmax               0.150481             0.206368             0.192535             0.150845             0.206430             0.182022
(8192, 8192)              var                  0.150884             0.184546             0.203900             0.151594             0.184172             0.197983
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1, 1024, 128)            mean                 0.014293             0.008119             0.014533             0.013861             0.008022             0.014449
(1, 1024, 128)            sum                  0.014039             0.007877             0.014111             0.014219             0.008227             0.014045
(1, 1024, 128)            min                  0.014159             0.011354             0.023493             0.014271             0.010862             0.023644
(1, 1024, 128)            max                  0.014154             0.011027             0.023368             0.014259             0.011234             0.023692
(1, 1024, 128)            argmin               0.016403             0.005677             0.023328             0.016273             0.005683             0.024073
(1, 1024, 128)            argmax               0.016734             0.005675             0.023437             0.016580             0.005318             0.023331
(1, 1024, 128)            var                  0.018338             0.009549             0.025538             0.018528             0.009391             0.024777
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(5, 1024, 128)            mean                 0.014873             0.010131             0.015546             0.015123             0.010131             0.015481
(5, 1024, 128)            sum                  0.015334             0.009673             0.015824             0.014736             0.009671             0.015438
(5, 1024, 128)            min                  0.015047             0.013252             0.024573             0.014803             0.013163             0.024551
(5, 1024, 128)            max                  0.015050             0.013339             0.024197             0.014810             0.013525             0.024230
(5, 1024, 128)            argmin               0.017341             0.012737             0.024306             0.017471             0.012379             0.024991
(5, 1024, 128)            argmax               0.017345             0.012411             0.024421             0.017422             0.012471             0.024237
(5, 1024, 128)            var                  0.019973             0.011453             0.026188             0.020050             0.011438             0.026282
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(10, 1024, 128)           mean                 0.016976             0.011575             0.016831             0.016722             0.011927             0.017173
(10, 1024, 128)           sum                  0.017039             0.011841             0.017159             0.016385             0.011860             0.016753
(10, 1024, 128)           min                  0.017036             0.015331             0.026770             0.016944             0.015205             0.027166
(10, 1024, 128)           max                  0.017369             0.015348             0.027077             0.016531             0.015716             0.026819
(10, 1024, 128)           argmin               0.019203             0.014447             0.026813             0.018994             0.014497             0.027313
(10, 1024, 128)           argmax               0.019563             0.014795             0.027140             0.019460             0.014912             0.026733
(10, 1024, 128)           var                  0.020529             0.014316             0.030405             0.020719             0.013960             0.029964
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(100, 1024, 128)          mean                 0.045046             0.039168             0.046082             0.044839             0.039217             0.045782
(100, 1024, 128)          sum                  0.045094             0.039150             0.045777             0.044496             0.039542             0.046083
(100, 1024, 128)          min                  0.045768             0.054466             0.076244             0.044915             0.053943             0.076599
(100, 1024, 128)          max                  0.045748             0.054459             0.076188             0.044931             0.053949             0.076856
(100, 1024, 128)          argmin               0.048275             0.054046             0.076647             0.048694             0.054105             0.077004
(100, 1024, 128)          argmax               0.048267             0.054395             0.077401             0.048691             0.054131             0.076751
(100, 1024, 128)          var                  0.049710             0.043254             0.083077             0.050971             0.043251             0.082378
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1000, 1000, 100)         mean                 0.202312             0.196723             0.197765             0.201774             0.196641             0.197459
(1000, 1000, 100)         sum                  0.202651             0.196682             0.197736             0.202175             0.196313             0.197523
(1000, 1000, 100)         min                  0.203022             0.264762             0.269200             0.202729             0.264129             0.268694
(1000, 1000, 100)         max                  0.202864             0.264396             0.269388             0.202486             0.263896             0.268720
(1000, 1000, 100)         argmin               0.226727             0.263781             0.268651             0.226597             0.264676             0.268983
(1000, 1000, 100)         argmax               0.226412             0.264469             0.269090             0.226570             0.264595             0.269178
(1000, 1000, 100)         var                  0.243223             0.204079             0.216096             0.241942             0.204079             0.215925
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(10000, 100)              mean                 0.016193             0.020277             0.014316             0.016152             0.020324             0.013712
(10000, 100)              sum                  0.016289             0.020237             0.014034             0.016168             0.020265             0.013708
(10000, 100)              min                  0.016046             0.030872             0.019609             0.016208             0.030867             0.018627
(10000, 100)              max                  0.016369             0.030835             0.019257             0.016218             0.030861             0.018209
(10000, 100)              argmin               0.017957             0.031171             0.019517             0.018050             0.031556             0.018077
(10000, 100)              argmax               0.017961             0.031658             0.019521             0.018060             0.031564             0.018087
(10000, 100)              var                  0.020393             0.035652             0.019339             0.020144             0.035987             0.019171
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(100000, 10)              mean                 0.015718             0.016576             0.016555             0.015999             0.016246             0.014869
(100000, 10)              sum                  0.015833             0.016247             0.016572             0.016007             0.016627             0.014872
(100000, 10)              min                  0.015888             0.020510             0.023920             0.015671             0.020821             0.021417
(100000, 10)              max                  0.015889             0.020479             0.023918             0.016077             0.020386             0.021421
(100000, 10)              argmin               0.018233             0.020863             0.023647             0.017574             0.020864             0.021103
(100000, 10)              argmax               0.017896             0.020527             0.023296             0.017569             0.020447             0.021098
(100000, 10)              var                  0.020005             0.024198             0.024372             0.020075             0.024167             0.022415
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1023, 1023, 1023)        mean                 1.874816             1.963506             1.903909             1.873279             1.963859             1.903230
(1023, 1023, 1023)        sum                  1.875030             1.965716             1.902458             1.873566             1.960730             1.901642
(1023, 1023, 1023)        min                  1.878563             2.473455             2.179092             1.875174             2.482086             2.183027
(1023, 1023, 1023)        max                  1.879128             2.474803             2.178895             1.874831             2.482253             2.183884
(1023, 1023, 1023)        argmin               1.921800             2.476629             2.174831             1.923987             2.472641             2.170453
(1023, 1023, 1023)        argmax               1.922605             2.476688             2.177927             1.923366             2.472808             2.172979
(1023, 1023, 1023)        var                  1.972606             3.088695             2.758797             1.978679             3.095658             2.762243
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1023, 1023, 255)         mean                 0.489984             0.500954             0.492957             0.489891             0.500654             0.491971
(1023, 1023, 255)         sum                  0.490228             0.500764             0.492289             0.489624             0.501089             0.492824
(1023, 1023, 255)         min                  0.491457             0.563560             0.553334             0.490355             0.564709             0.554754
(1023, 1023, 255)         max                  0.491396             0.563628             0.553345             0.490017             0.565004             0.554947
(1023, 1023, 255)         argmin               0.503666             0.561512             0.551831             0.503845             0.560972             0.551017
(1023, 1023, 255)         argmax               0.503602             0.561185             0.551407             0.504328             0.561267             0.551448
(1023, 1023, 255)         var                  0.510844             0.709452             0.701630             0.512693             0.710365             0.701965
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1023, 1023, 377)         mean                 0.707439             0.727646             0.712019             0.706769             0.727101             0.711632
(1023, 1023, 377)         sum                  0.707780             0.727453             0.711554             0.706807             0.726656             0.711729
(1023, 1023, 377)         min                  0.709423             0.819809             0.794379             0.707847             0.822086             0.796664
(1023, 1023, 377)         max                  0.709297             0.819780             0.794308             0.707566             0.821913             0.796690
(1023, 1023, 377)         argmin               0.725028             0.817088             0.791695             0.726039             0.816445             0.790828
(1023, 1023, 377)         argmax               0.725301             0.817011             0.791420             0.726040             0.816917             0.791143
(1023, 1023, 377)         var                  0.740859             1.034165             1.006712             0.743413             1.035506             1.007638
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164790
Approved by: https://github.com/ngimel, https://github.com/eqy
2025-10-09 18:08:30 +00:00
ee6a1ecb0a [ROCm] Enable MI355 CI on PRs, and run full set of UTs on PRs (#160215)
Useful to have PR testing for PRs such as https://github.com/pytorch/pytorch/pull/151360

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160215
Approved by: https://github.com/malfet, https://github.com/atalman

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-09 18:03:12 +00:00
3c0577bd15 Remove shared_ptr from MHAGraphCache (#164895)
This commit makes several cleanup changes to MHA.cpp, the main
one of which is removal of shared_ptr from MHAGraphCache as the
cache does not actually intend to share ownership. The changes are:

1. Remove shared_ptr from MHAGraphCache
2. Remove template arguments from MHAGraphCache
3. Remove unnecessary optional<shared_ptr<...>> vars
4. Change some functions with auto return type to the actual type

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164895
Approved by: https://github.com/eqy
2025-10-09 17:44:28 +00:00
688efd9741 Revert "Enable mimalloc on non-Windows platforms and make default for AArch64 builds (#164741)"
This reverts commit 87eccf10e8484c9e59ef81ae7bdee68d3db4f605.

Reverted https://github.com/pytorch/pytorch/pull/164741 on behalf of https://github.com/malfet due to But it breaks MacOS builds, see https://github.com/pytorch/pytorch/actions/runs/18382886648/job/52373781138 ([comment](https://github.com/pytorch/pytorch/pull/164741#issuecomment-3386859778))
2025-10-09 17:30:25 +00:00
91040f4934 Revert "[Code Clean] Remove support of python3.9 (#163846)"
This reverts commit bc1690c7e859dee8c47a7f0bbd3c43cc27c6fd2a.

Reverted https://github.com/pytorch/pytorch/pull/163846 on behalf of https://github.com/izaitsevfb due to breaks distributed tests ([comment](https://github.com/pytorch/pytorch/pull/163846#issuecomment-3386855437))
2025-10-09 17:27:08 +00:00
87eccf10e8 Enable mimalloc on non-Windows platforms and make default for AArch64 builds (#164741)
This change removes the Windows requirement for mimalloc builds, and makes mimalloc the default c10 system allocator for AArch64 builds. This significantly improves the performance of AArch64 builds of PyTorch as large allocations are better cached by mimalloc than glibc.

**Updated Results**

Torchbench FP32 eager Inference, 16 threads:
<img width="1510" height="733" alt="mimalloc-v2-fp32-diff" src="https://github.com/user-attachments/assets/7fe3ea0c-3b52-42e7-879b-612444479c90" />

Torchbench BF16 eager Inference, 16 threads:
<img width="1510" height="733" alt="mimalloc-v2-bf16-diff" src="https://github.com/user-attachments/assets/56469a72-9e06-4d57-ae2a-aeb139ca79a3" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164741
Approved by: https://github.com/fadara01, https://github.com/aditew01, https://github.com/malfet
2025-10-09 16:45:31 +00:00
5d459dd609 avoid bit cast for bfloat16_t (#159946)
using bit_cast<bfloat16_t> triggers a static_assert, so replace it with intrinsics.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159946
Approved by: https://github.com/aditew01, https://github.com/malfet
2025-10-09 16:42:49 +00:00
24d69c57cb Add view support for library custom Function (#164520)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164520
Approved by: https://github.com/soulitzer, https://github.com/ezyang
2025-10-09 16:17:48 +00:00
eaa02655ea [CI] Run cpp tests on windows in one run_tests call (#164861)
The windows cpp tests take ~1 hour according to logs.  Each has run_test called on them individually, so I tried batching them together so it's just one run_test call for all of them.  I believe it now takes 30min.  I turned off TD since I don't think cpp tests are included in TD stuff.

As always with batch, I'm not sure if the errorlevel/error surfacing stuff is correct

This code is written with a lot of help from chatgpu and copilot
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164861
Approved by: https://github.com/huydhn
2025-10-09 16:07:28 +00:00
aea57b3aa3 AOTI MPS Shim Implementation (#163865)
## MPS Shim API

*   Updated MPS shimification API with handles and function declarations:
    *   `AOTIMetalShaderLibraryHandle` and `AOTIMetalKernelFunctionHandle` types
    *   Library management: `aoti_torch_mps_create_shader_library`, `aoti_torch_mps_delete_shader_library`, `aoti_torch_mps_get_kernel_function`
    *   Kernel execution: `aoti_torch_mps_run_command_block`, `aoti_torch_mps_start_encoding`, `aoti_torch_mps_dispatch` variants, etc

## MPS Shader Codegen

*   Modified to generate source constants instead of direct `DynamicMetalShaderLibrary` instantiation:
    *   **Before**: `at::native::mps::DynamicMetalShaderLibrary mps_lib_0(R"MTL(...)MTL");`
    *   **After**: `const char* mps_lib_0_source = R"MTL(...)MTL";`
*   Updated kernel call generation  to use shimified functions:
    *   Generates calls to shimified API instead of direct libtorch calls

## Before vs After Comparison

### Section 1: Shader Library
**Before (Direct Library Object)**
```cpp
at::native::mps::DynamicMetalShaderLibrary mps_lib_0(R"MTL(
    ...
)MTL");
```
**After (Source String)**
```cpp
const char* mps_lib_0_source = (R"MTL(
    ...
)MTL");
```

### Section 2: Getter Functions & RAII Management

**Before (Direct Library Access)**
```cpp
const std::shared_ptr<at::native::mps::MetalKernelFunction> get_mps_lib_0() {
    static const auto func = mps_lib_0.getKernelFunction("generated_kernel");
    return func;
}

AOTIMetalKernelFunctionHandle get_mps_lib_0_handle() {
    static const auto handle = AOTIMetalKernelFunctionHandle(get_mps_lib_0().get());
    return handle;
}
```

**After (Shim API + RAII Wrapper)**
```cpp
AOTIMetalKernelFunctionHandle get_mps_lib_0_handle() {
    static auto kernel_handle = []() {
        AOTIMetalShaderLibraryHandle lib_handle = nullptr;
        AOTIMetalKernelFunctionHandle kern_handle = nullptr;

        aoti_torch_mps_create_shader_library(mps_lib_0_source, &lib_handle);
        aoti_torch_mps_get_kernel_function(lib_handle, "generated_kernel", &kern_handle);

        // RAII wrapper with custom deleter
        auto lib_deleter = [](AOTIMetalShaderLibraryHandle h) {{
            if (h) aoti_torch_mps_delete_shader_library(h);
        }};

        using LibDeleter = decltype(lib_deleter);
        using LibPtr = std::unique_ptr<AOTIMetalShaderLibraryOpaque, LibDeleter>;

        // Return pair of kernel handle and library smart pointer for cleanup
        return std::make_pair(kern_handle, LibPtr(lib_handle, lib_deleter));
    }();
    return kernel_handle.first;
}
```

### Section 3: Runtime Execution

**Before (Direct Library Methods)**
```cpp
void AOTInductorModel::run_impl(...) {

    ...

    get_mps_lib_0()->runCommandBlock([&] {
        get_mps_lib_0()->startEncoding();
        aoti_torch_mps_set_arg_tensor(get_mps_lib_0_handle(), 0, buf0);
        aoti_torch_mps_set_arg_tensor(get_mps_lib_0_handle(), 1, arg0_1);
        aoti_torch_mps_set_arg_tensor(get_mps_lib_0_handle(), 2, arg1_1);
        get_mps_lib_0()->dispatch({static_cast<uint64_t>(10LL)});

    });

    ...

} // AOTInductorModel::run_impl
```

**After (Shim API with Lambda Pattern)**
```cpp
void AOTInductorModel::run_impl(...) {

    ...

    auto mps_lib_0_lambda_0 = [&](AOTIMetalKernelFunctionHandle handle) {
        aoti_torch_mps_start_encoding(handle);
        aoti_torch_mps_set_arg_tensor(handle, 0, buf0);
        aoti_torch_mps_set_arg_tensor(handle, 1, arg0_1);
        aoti_torch_mps_set_arg_tensor(handle, 2, arg1_1);
        aoti_torch_mps_dispatch_single(handle, static_cast<uint64_t>(10LL));
    };

    std::function<void(AOTIMetalKernelFunctionHandle)> mps_lib_0_func_wrapper_0 = mps_lib_0_lambda_0;
    aoti_torch_mps_run_command_block(get_mps_lib_0_handle(), aoti_torch_mps_shared_callback, &mps_lib_0_func_wrapper_0);

    ...

} // AOTInductorModel::run_impl
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163865
Approved by: https://github.com/angelayi, https://github.com/desertfire
2025-10-09 16:06:36 +00:00
3d1fa40ae1 Revert "[BC-Breaking] Remove long-deprecated casting functions from native_functions.yaml (#164641)"
This reverts commit 64108bdbed2f099d527060b4c9fdd5a11cad2afc.

Reverted https://github.com/pytorch/pytorch/pull/164641 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/164641#issuecomment-3386346474))
2025-10-09 15:42:51 +00:00
a7fa1a91e3 fix flex attention eager bwd: more rounding (#164317)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164317
Approved by: https://github.com/drisspg
ghstack dependencies: #163986
2025-10-09 15:40:49 +00:00
afeec56a5a Fix replacement reconstruct (#164937)
If we return Dtensor, the object is created via fx graph call so we never needed to reconstruct them. But if there is side effect, we do need to reconstruct it.

Differential Revision: [D84159000](https://our.internmc.facebook.com/intern/diff/D84159000)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164937
Approved by: https://github.com/StrongerXi
2025-10-09 15:31:23 +00:00
724463d5a2 Fix truediv numerics between eager and compile (#164144)
Addresses numeric differences between eager and compile in https://github.com/pytorch/pytorch/issues/141753

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164144
Approved by: https://github.com/eellison, https://github.com/jansel, https://github.com/ngimel
ghstack dependencies: #164997
2025-10-09 14:31:33 +00:00
f79e212733 Revert "[CUDA][cuBLAS] addmm -- some refactoring for easier navigation between the Lt and non-Lt paths (#163955)"
This reverts commit ab94a0d544503b5c27e889b45e45ef8cf75c8183.

Reverted https://github.com/pytorch/pytorch/pull/163955 on behalf of https://github.com/jeffdaily due to broke on cuda and rocm after landing though this PR had a clean signal initially ([comment](https://github.com/pytorch/pytorch/pull/163955#issuecomment-3386127145))
2025-10-09 14:24:56 +00:00
b28b24a9fc Switch build jobs that use linux.12xlarge to c7i (#164941)
This PR updates build jobs that currently use linux.12xlarge to the
c7i varient which should increase build times by 15% - 20% depending
on the job and reduce costs of these jobs by 10% - 15%.

Signed-off-by: Thanh Ha <thanh.ha@linuxfoundation.org>
2025-10-09 09:58:52 -04:00
17c7170ca6 Fix Avoid DDE in item numel check (#164934)
address https://github.com/pytorch/pytorch/issues/164725 and https://github.com/pytorch/pytorch/issues/164704

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164934
Approved by: https://github.com/ezyang, https://github.com/aorenste, https://github.com/Skylion007
2025-10-09 13:09:06 +00:00
6a7f5c0d21 Add scaled_mm python API, test (#164142)
Summary:

* Add `torch.nn.functional.scaled_mm` as an abstraction around the C++
  methods
* Wraps `torch._scaled_mm_v2` API by default, but user can force use of
  the older `torch._scaled_mm` interface.
* Scaled MM tests now run on the new API

Test Plan:

`pytest test/test_scaled_matmul_cuda.py`

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Simon Layton <simonlaytonmeta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164142
Approved by: https://github.com/drisspg
ghstack dependencies: #164141
2025-10-09 12:43:18 +00:00
512b6b59f0 Add _scaled_mm_v2 API (#164141)
Summary:

* Add new scaled-MM API to future-proof / clean-up existing code.
* Scaling is explicitly described rather than infer
* Swizzling of scaled must now be defined (vs. inferred)
* Adds API support for multi-level scaling
* Refactor dispatch logic to make it easier to add new implementations

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Simon Layton <simonlaytonmeta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164141
Approved by: https://github.com/drisspg
2025-10-09 12:43:18 +00:00
bc1690c7e8 [Code Clean] Remove support of python3.9 (#163846)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163846
Approved by: https://github.com/ezyang
2025-10-09 11:54:10 +00:00
53f5af8c92 Update torch-xpu-ops commit pin (#164237)
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@f30173](f301733b03), includes:

- Install xpu internal headers to PyTorch
- Fix error handling for BatchLinearAlgebra Ops
- Fix unnecessary double data type conversion
- Fix overflow when calculating workgroups count
- Fix segmentation fault and calculation error in AveragePool2dKernel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164237
Approved by: https://github.com/EikanWang
2025-10-09 10:38:59 +00:00
4412026949 Revert "AOTI MPS Shim Implementation (#163865)"
This reverts commit 874efa2d72d83b00894097130f18062ce331a265.

Reverted https://github.com/pytorch/pytorch/pull/163865 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/163865#issuecomment-3385196387))
2025-10-09 10:26:01 +00:00
06d86e58d0 Revert "Do not decompose in functionalization/proxy tensor if autograd wouldn't have decomposed (#164939)"
This reverts commit d40a9bfb8da0dc1ac1e6e56b33a25979112874de.

Reverted https://github.com/pytorch/pytorch/pull/164939 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/164939#issuecomment-3385056722))
2025-10-09 09:50:59 +00:00
874efa2d72 AOTI MPS Shim Implementation (#163865)
## MPS Shim API

*   Updated MPS shimification API with handles and function declarations:
    *   `AOTIMetalShaderLibraryHandle` and `AOTIMetalKernelFunctionHandle` types
    *   Library management: `aoti_torch_mps_create_shader_library`, `aoti_torch_mps_delete_shader_library`, `aoti_torch_mps_get_kernel_function`
    *   Kernel execution: `aoti_torch_mps_run_command_block`, `aoti_torch_mps_start_encoding`, `aoti_torch_mps_dispatch` variants, etc

## MPS Shader Codegen

*   Modified to generate source constants instead of direct `DynamicMetalShaderLibrary` instantiation:
    *   **Before**: `at::native::mps::DynamicMetalShaderLibrary mps_lib_0(R"MTL(...)MTL");`
    *   **After**: `const char* mps_lib_0_source = R"MTL(...)MTL";`
*   Updated kernel call generation  to use shimified functions:
    *   Generates calls to shimified API instead of direct libtorch calls

## Before vs After Comparison

### Section 1: Shader Library
**Before (Direct Library Object)**
```cpp
at::native::mps::DynamicMetalShaderLibrary mps_lib_0(R"MTL(
    ...
)MTL");
```
**After (Source String)**
```cpp
const char* mps_lib_0_source = (R"MTL(
    ...
)MTL");
```

### Section 2: Getter Functions & RAII Management

**Before (Direct Library Access)**
```cpp
const std::shared_ptr<at::native::mps::MetalKernelFunction> get_mps_lib_0() {
    static const auto func = mps_lib_0.getKernelFunction("generated_kernel");
    return func;
}

AOTIMetalKernelFunctionHandle get_mps_lib_0_handle() {
    static const auto handle = AOTIMetalKernelFunctionHandle(get_mps_lib_0().get());
    return handle;
}
```

**After (Shim API + RAII Wrapper)**
```cpp
AOTIMetalKernelFunctionHandle get_mps_lib_0_handle() {
    static auto kernel_handle = []() {
        AOTIMetalShaderLibraryHandle lib_handle = nullptr;
        AOTIMetalKernelFunctionHandle kern_handle = nullptr;

        aoti_torch_mps_create_shader_library(mps_lib_0_source, &lib_handle);
        aoti_torch_mps_get_kernel_function(lib_handle, "generated_kernel", &kern_handle);

        // RAII wrapper with custom deleter
        auto lib_deleter = [](AOTIMetalShaderLibraryHandle h) {{
            if (h) aoti_torch_mps_delete_shader_library(h);
        }};

        using LibDeleter = decltype(lib_deleter);
        using LibPtr = std::unique_ptr<AOTIMetalShaderLibraryOpaque, LibDeleter>;

        // Return pair of kernel handle and library smart pointer for cleanup
        return std::make_pair(kern_handle, LibPtr(lib_handle, lib_deleter));
    }();
    return kernel_handle.first;
}
```

### Section 3: Runtime Execution

**Before (Direct Library Methods)**
```cpp
void AOTInductorModel::run_impl(...) {

    ...

    get_mps_lib_0()->runCommandBlock([&] {
        get_mps_lib_0()->startEncoding();
        aoti_torch_mps_set_arg_tensor(get_mps_lib_0_handle(), 0, buf0);
        aoti_torch_mps_set_arg_tensor(get_mps_lib_0_handle(), 1, arg0_1);
        aoti_torch_mps_set_arg_tensor(get_mps_lib_0_handle(), 2, arg1_1);
        get_mps_lib_0()->dispatch({static_cast<uint64_t>(10LL)});

    });

    ...

} // AOTInductorModel::run_impl
```

**After (Shim API with Lambda Pattern)**
```cpp
void AOTInductorModel::run_impl(...) {

    ...

    auto mps_lib_0_lambda_0 = [&](AOTIMetalKernelFunctionHandle handle) {
        aoti_torch_mps_start_encoding(handle);
        aoti_torch_mps_set_arg_tensor(handle, 0, buf0);
        aoti_torch_mps_set_arg_tensor(handle, 1, arg0_1);
        aoti_torch_mps_set_arg_tensor(handle, 2, arg1_1);
        aoti_torch_mps_dispatch_single(handle, static_cast<uint64_t>(10LL));
    };

    std::function<void(AOTIMetalKernelFunctionHandle)> mps_lib_0_func_wrapper_0 = mps_lib_0_lambda_0;
    aoti_torch_mps_run_command_block(get_mps_lib_0_handle(), aoti_torch_mps_shared_callback, &mps_lib_0_func_wrapper_0);

    ...

} // AOTInductorModel::run_impl
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163865
Approved by: https://github.com/angelayi, https://github.com/desertfire
2025-10-09 09:28:10 +00:00
e09fb44ef1 Revert "Fix truediv numerics between eager and compile (#164144)"
This reverts commit d386325ca9a142419f45b987391f4bb175dd7d0b.

Reverted https://github.com/pytorch/pytorch/pull/164144 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/164144#issuecomment-3384769092))
2025-10-09 08:40:52 +00:00
5b8174bc28 Revert "[vllm hash update] update the pinned vllm hash (#164628)"
This reverts commit 7b691546d2949790ffc8f6bd3c674faa6a46ff7c.

Reverted https://github.com/pytorch/pytorch/pull/164628 on behalf of https://github.com/huydhn due to There are some broken vLLM tests ([comment](https://github.com/pytorch/pytorch/pull/164628#issuecomment-3384560957))
2025-10-09 07:43:02 +00:00
5209c8ce07 Revert "Fix Avoid DDE in item numel check (#164934)"
This reverts commit a9a9a3438a374f96a308b707a1718036aaec790d.

Reverted https://github.com/pytorch/pytorch/pull/164934 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/164934#issuecomment-3384390621))
2025-10-09 06:57:03 +00:00
f231be25c6 Mark unused parameters in C++ code (#164912)
This PR adds unused parameter name comments in C++ declarations to improve code readability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164912
Approved by: https://github.com/Skylion007
2025-10-09 06:23:25 +00:00
a753ffa9af Revert "Use runner with more memory for ASAN builds (#165000)"
This reverts commit f5fd18f7e24378bd9eb91404f697f1c81a8187d5.

Reverted https://github.com/pytorch/pytorch/pull/165000 on behalf of https://github.com/izaitsevfb due to not sure how, but this broke lint ([comment](https://github.com/pytorch/pytorch/pull/165000#issuecomment-3384286412))
2025-10-09 06:22:28 +00:00
a9a9a3438a Fix Avoid DDE in item numel check (#164934)
address https://github.com/pytorch/pytorch/issues/164725 and https://github.com/pytorch/pytorch/issues/164704

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164934
Approved by: https://github.com/ezyang, https://github.com/aorenste, https://github.com/Skylion007
2025-10-09 06:06:25 +00:00
263db92563 Add knobs in FR dump by watchdog (stacktrace and only active collectives) and trigger FR even on any exceptions (#164591)
This PR includes a couple of changes to extend FlightRecorder dump by PyTorch watchdog

- New knobs to control FR dump as suggested in the public documentation even for watchdog
(TORCH_INCLUDE_STACK_TRACE, TORCH_INCLUDE_ONLY_ACTIVE)
- Trigger the flight recorder dump on exceptions which could be triggered by any CUDA / host side error
  (TORCH_NCCL_EXTRA_DUMP_ON_EXEC)
-> Can be used as a snapshot of the workload progress for post-mortem analysis

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164591
Approved by: https://github.com/fduwjj
2025-10-09 05:33:35 +00:00
ed6156e3ea non-fb impls + unit tests (#164722)
Test Plan:
```
buck test fbcode//mode/opt caffe2/test/inductor:caching
```

Differential Revision: D83714692

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164722
Approved by: https://github.com/NikhilAPatel, https://github.com/adamomainz
2025-10-09 05:10:57 +00:00
d40a9bfb8d Do not decompose in functionalization/proxy tensor if autograd wouldn't have decomposed (#164939)
This fixes AOTAutograd rms_norm not being bitwise equivalent to
eager, because it avoids a decomposition.  You can force the
decomposition by having the decomposition in the dispatch table,
but if eager mode wouldn't have decomposed (because it went to the fused
one), we now default to preserving the fused call by default.

This largely reverts https://github.com/pytorch/pytorch/pull/103275/ for view ops. This means that in inference mode we could hit the wrong C++ kernel; if this occurs we should just SymInt'ify the C++ kernel.

Another neat side effect of this change is that Inductor's generated kernels for rms_norm now have rms_norm in their name.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164939
Approved by: https://github.com/bdhirsh
ghstack dependencies: #164573
2025-10-09 04:49:44 +00:00
e532f62e0d Introduce joint_custom_pass callback (#164981)
```
        def joint_custom_pass(joint_gm: torch.fx.GraphModule, joint_inputs):
           # apply your pass for joint graph here

            return joint_gm

        class M(torch.nn.Module):
            def forward(self, x):
                return x.sin()

        x = torch.randn(10, requires_grad=False)
        compiled_fn = torch.compile(M(), backend="aot_eager")

        with torch._functorch.config.patch("joint_custom_pass", joint_custom_pass):
            out = compiled_fn(x)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164981
Approved by: https://github.com/ezyang, https://github.com/anijain2305
2025-10-09 04:40:54 +00:00
1f73b96668 [PGO] log missing sources in allowlist (#164881)
Summary:
- logs missing dynamic sources
- emits MLHub insight only on size mismatch recompiles

Test Plan: test_pgo

Differential Revision: D84098898

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164881
Approved by: https://github.com/bobrenjc93
2025-10-09 04:39:09 +00:00
7b691546d2 [vllm hash update] update the pinned vllm hash (#164628)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164628
Approved by: https://github.com/pytorchbot
2025-10-09 04:35:36 +00:00
f05e23e1bc Add less warps config to inner reductions (#162447)
Add less warps to ensure proper vectorization + memory coalescing for inner reductions, prefer more work per thread

<img width="1717" height="731" alt="Screenshot 2025-09-17 at 10 03 25 AM" src="https://github.com/user-attachments/assets/7b1f4a30-62f2-4bee-bb9c-122501bde63e" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162447
Approved by: https://github.com/v0i0, https://github.com/eellison, https://github.com/shunting314
2025-10-09 04:22:16 +00:00
d386325ca9 Fix truediv numerics between eager and compile (#164144)
Addresses numeric differences between eager and compile in https://github.com/pytorch/pytorch/issues/141753

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164144
Approved by: https://github.com/eellison, https://github.com/jansel, https://github.com/ngimel
ghstack dependencies: #164997
2025-10-09 04:22:03 +00:00
7457d139c5 Add pyrefly suppressions to torch/distributed (7/n) (#165002)
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283

One more PR after this one.

Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check

step 1: delete lines in the pyrefly.toml file from the project-excludes field
step 2: run pyrefly check
step 3: add suppressions, clean up unused suppressions
before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199

after:
INFO 0 errors (6,884 ignored)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165002
Approved by: https://github.com/oulgen
2025-10-09 04:08:25 +00:00
ab94a0d544 [CUDA][cuBLAS] addmm -- some refactoring for easier navigation between the Lt and non-Lt paths (#163955)
As per title. Additionally, some Lt selection conditions are revisited, and some redundancy removed (especially in the ROCm vs non-ROCm paths).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163955
Approved by: https://github.com/ngimel, https://github.com/eqy
2025-10-09 04:07:45 +00:00
0e9b3a772a [export] Turn on install_free_tensors flag (#164691)
The final step in removing the discrepancy between
torch.compile(fullgraph=True) and torch.export(strict=True).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164691
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #164721
2025-10-09 03:25:15 +00:00
af7ca55ced [export][dynamo] Fallback to slowpath for MultiHeadAttention for strict export (#164721)
In https://github.com/pytorch/pytorch/pull/106824, export decided to slow-path for MultiHeadAttention module (look into the PR description as to why). But that PR eventually caused a divergence between Dynamo and export.

Today, strict-export does not inline into builtin modules (like MultiHeadAttention), and therefore make_fx sees the original nn.Module and takes the slow path. But compile inlines into the nn module, and at this time the condition `_is_make_fx_tracing` is False. As a result, Dynamo takes a fast path, resulting in a different op being called.

This divergence is undesirable. There are 2 ways to fix it

1) Make export take the fast path - As explained in the https://github.com/pytorch/pytorch/pull/106824 , this might be difficult. So, we go to (2)
2) Make compile as well take the slow path - This is easy to implement. The con here is that Pytorch eager and compile will use different operators, which can cause numerics issues etc.

Since (2) is easy to do, we will follow this path. We are tracking the issue in  https://github.com/pytorch/pytorch/issues/164062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164721
Approved by: https://github.com/avikchaudhuri, https://github.com/tugsbayasgalan
2025-10-09 03:25:15 +00:00
a029675f6f More ruff SIM fixes (#164695)
This PR applies ruff `SIM` rules to more files. Most changes are about simplifying `dict.get` because `None` is already the default value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164695
Approved by: https://github.com/ezyang
2025-10-09 03:24:50 +00:00
54ae61c573 Change test_emulate_precision_casts_mean_ratio_chain from gelu to relu (#164997)
gelu can be instable on local builds due to libdevice differences, as we lower to libdevice.erf. That combined with the semantics in the test can lead to catastrophic cancellation. We switch this test from gelu to relu to fix this instability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164997
Approved by: https://github.com/eellison, https://github.com/jansel
2025-10-09 03:14:05 +00:00
2fe37b5fde [RecSys][Combo Kernel] skip combo kernel generation if parition group is empty (#164918)
Summary: Noticed sometimes the combo kernel partition will contain empty group. Skip kernel generation in this case to unblock head model launching. The change in this diff is safe, but it's better to root cause why empty group is being created.

Test Plan:
Lowering passed after applying the diff

Differential Revision: D84134471

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164918
Approved by: https://github.com/mlazos
2025-10-09 02:55:23 +00:00
96d91da792 [dynamo] allow placement subclass to be traceble (#164985)
This pr is to unblock SimpleFSDP+`gradient_divide_factor` [here](https://github.com/pytorch/torchtitan/pull/1793). We will need to create a subclass for DTensor `Partial` placement. When tracing `SimpleFSDPPartial`, I hit the assertion error that `SimpleFSDPPartial` is not in `ok_types`. I'm updating the code to check placement dtype via `isinstance` instead of `type(val)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164985
Approved by: https://github.com/ezyang, https://github.com/eellison
2025-10-09 01:44:21 +00:00
f5fd18f7e2 Use runner with more memory for ASAN builds (#165000)
An attempt to [address OOM here](aed5ed1076/1).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165000
Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/huydhn
2025-10-09 01:09:28 +00:00
8ca986ee60 [fr] Enable reset the FR recording for fault tolerance (#164988)
We also want to have a python side API for users to reset FR recording for FR entries. We don't need to reset the PGNCCL's member counter since we are creating new PGNCCL anyway. FR is a global ring buffer, so we need to reset it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164988
Approved by: https://github.com/tushar00jain
ghstack dependencies: #164752
2025-10-09 01:03:01 +00:00
81dbeb06f4 CUDA aarch64 12.6 and 12.8 builds fix triton constraints (#165013)
Since we have introduced CUDA aarch64 builds for all cuda versions we need to remove this constraint.
This was missed by https://github.com/pytorch/pytorch/pull/162364

Proper constraint on triton should be:
```
Requires-Dist: triton==3.5.0; platform_system == "Linux"
```

not:
```
Requires-Dist: triton==3.5.0; platform_system == "Linux" and platform_machine == "x86_64"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165013
Approved by: https://github.com/Camyll, https://github.com/nWEIdia, https://github.com/tinglvv
2025-10-09 00:49:28 +00:00
7a1ead755f [DeviceMesh] Add a warning for slicing flattened dim from root mesh and types for _get_slice_mesh_layout (#164993)
As title, we want to add a deprecate warning for slicing flattened dim from root mesh. Also cosmetic changes for adding types for `_get_slice_mesh_layout`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164993
Approved by: https://github.com/fegin
ghstack dependencies: #164750, #164954
2025-10-09 00:47:08 +00:00
90b4e130d6 [Benchmark] cleanup torchbench models (#164816)
Prune models from TorchInductor dashboard to reduce ci cost. This PR prunes torchbench models according to the [doc](https://docs.google.com/document/d/1nLPNNAU-_M9Clx9FMrJ1ycdPxe-xRA54olPnsFzdpoU/edit?tab=t.0), which removes timm and huggingface models from torchbench.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164816
Approved by: https://github.com/anijain2305, https://github.com/seemethere, https://github.com/huydhn, https://github.com/malfet
2025-10-09 00:31:25 +00:00
4308b8a28f [dynamo] Support torch.fx.traceback.annotate (#164678)
Builds on top of https://github.com/pytorch/pytorch/pull/163673 and https://github.com/pytorch/pytorch/pull/164174. This will be used in the followup PRs to apply regional inductor compilation.

The existing implementation let Dynamo trace into the `torch.fx.traceback.annotate`, but thats not what we want. We want Dynamo to essentially run the torch.fx.traceback.annotate function in eager, so that every Fx node created in Dynamo Fx graph has the custom meta node.

What does not work?
* We still have to set the context manager `torch.fx.traceback.preserve_node_meta()` in the user code because CI was unhappy. This can be fixed but with some perseverance.
* This does not work with graph breaks yet. But we can solve that problem, if needed, in a separate PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164678
Approved by: https://github.com/SherlockNoMad, https://github.com/jansel, https://github.com/xmfan
2025-10-08 22:41:00 +00:00
94b1ec8c7c [BE] Use torch check the way its intended (#164987)
Replace
`if (!foo) TORCH_CHECK(false, "bar");` with `TORCH_CHECK(foo, "bar");`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164987
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-10-08 22:28:08 +00:00
054268c9eb Consider collective inputs to be deallocated only when wait is completed (#164945)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164945
Approved by: https://github.com/IvanKobzarev
ghstack dependencies: #164738, #164783, #164944
2025-10-08 22:19:25 +00:00
af40828bbb Limit coll bucketing within node idxs (#164944)
Respect max_coll_distance from overlap scheduler in bucketing, also, add an optimization in path searching.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164944
Approved by: https://github.com/IvanKobzarev
ghstack dependencies: #164738, #164783
2025-10-08 22:18:53 +00:00
5a1fbf45ad [ez] remove unnecessary wrapper (#164720)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164720
Approved by: https://github.com/ydwu4
2025-10-08 22:12:29 +00:00
aed5ed1076 Refactor memory estimator to use node storages, add test (#164783)
- Update the Memory Estimator to use node storages for analysis, which simplifies book keeping, as opposed to manually looking at operator schema. This will also allow me to reuse this component elsewhere.

- Factor out into separate class, so that this same logic can be used  in scheduling (node allocations / aliasing / uses)

- Adds Tests for correctness - right now only on fwd/bwd by itself, not with both.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164783
Approved by: https://github.com/ruisizhang123
ghstack dependencies: #164738
2025-10-08 22:07:43 +00:00
af4c29fea8 [dynamo, nested graph breaks] fix nested step graph break related issues (#162737)
Turns out codegen'ing a nested step graph break is significantly more complicated than first thought. The optimized function should actually do:
- call graph/load values/do side effects etc.
- call into the leaf's resume function, but skipped (this essentially step graph break function for just the leaf function)
- call into all the other resume functions, traced.

This PR also adds `torch._dynamo.step_unsupported()`, which can be used for internal testing purposes to better test step graph break handling.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162737
Approved by: https://github.com/Lucaskabela
ghstack dependencies: #160601
2025-10-08 22:02:52 +00:00
486b4d2414 [dynamo, nested graph breaks] move cell codegen before side effects codegen (#160601)
This is needed because if we codegen cells for nested frames AFTER side effects, then reconstruction could get messed up. From below:

>The added test case demonstrates the reconstruction failure if we kept cell codegen at the original place (only happens with nested graph breaks since we reconstruct nested frame cells from VariableTracker rather than directly using LOAD_CLOSURE).

>At a high level, what happened before this change was that side_effects was pruning the cells (I don't recall exactly why this happens), and because cells were codegen'd after the side effects were applied, we were unable to properly reconstruct the cell. The error I was seeing was a list/tuple IndexError.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160601
Approved by: https://github.com/mlazos
2025-10-08 22:02:52 +00:00
8f83b3e71c add device generalization support for distributed checkpoint tests (#159242)
## MOTIVATION
To generalize Distributed checkpoint test cases for non-CUDA devices

## CHANGES
18 test files with minimal device abstraction changes updated in
test/distributed/checkpoint/

- Use device_type from DTensorTestBase wherever appropriate
- Replaced hard coded device names with torch.accelerator.current_accelerator()
- extend multi gpu decrator for other devices

test/distributed/checkpoint/test_state_dict_stager.py has large diff, that's because i changed the name cuda_obj  to gpu_obj. Functional change is minimum.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159242
Approved by: https://github.com/guangyey, https://github.com/d4l3k
2025-10-08 21:56:31 +00:00
f0c9f3bddb [PP] [BE] Remove runtime tests (#164962)
BE cleaning up dead code since we migrated the Multi-stage schedules to use schedule execution runtime

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164962
Approved by: https://github.com/Skylion007
ghstack dependencies: #162016
2025-10-08 21:42:33 +00:00
1d182dd81c [MPS] sparse norm (#164961)
Norms for sparse mps tensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164961
Approved by: https://github.com/malfet
2025-10-08 21:41:42 +00:00
0b15f7ae05 [fr] Enable dynamic path write for FR dump when it comes to torchft (#164752)
When it comes to FR dump, in the case of fault tolerance, users want to set the dump path to a different one when there is restart, so we just enable this case for users.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164752
Approved by: https://github.com/tushar00jain
2025-10-08 21:36:32 +00:00
f1229b6db9 [BE] Remove manual IP address resolution (#164969)
As https://github.com/pytorch/pytorch/issues/100400 has been closed a while back
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164969
Approved by: https://github.com/seemethere
ghstack dependencies: #164968
2025-10-08 21:22:34 +00:00
b1ac252f55 [Replicate][Test] tests that pp model grads are the same as single-device model grads (#164890)
**Summary:** Created a test so that we can verify that a model that has been pipelined + replicated has the same gradients as a reference model. To do this, I mapped the layers and their parameters in each partial model to the original full model and then compared the gradients.
**Test Case**
1. pytest test/distributed/_composable/test_composability/test_pp_composability.py -k test_replicate_pp_grads

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164890
Approved by: https://github.com/H-Huang
2025-10-08 21:07:05 +00:00
5ba11df4f8 [DeviceMesh] Make all members of DeviceMesh private and add public access API (#164954)
This is mostly mechanical change which make device mesh members all private and use a public property API instead. This is not a BC breaking change since the new API still guarantee BC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164954
Approved by: https://github.com/fegin
ghstack dependencies: #164750
2025-10-08 21:04:07 +00:00
15800888b6 [CI] Print GPU info during setup linux (#164968)
I.e. run `nvidia-smi` if present

Helps detecting what driver version this runner is on, which would have helped debugging some of the issues recently
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164968
Approved by: https://github.com/ngimel
2025-10-08 20:58:33 +00:00
e7ed1a00eb Run inductor-perf-test-nightly-h100 once per day (#164967)
To reduce inductor costs, though I'm not sure how much this one matters specifically since h100s are reserved

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164967
Approved by: https://github.com/BoyuanFeng
2025-10-08 20:58:19 +00:00
2982406721 [inductor] ban benchmarking by default in deterministic mode (#164532)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164532
Approved by: https://github.com/eellison
ghstack dependencies: #164801
2025-10-08 20:55:15 +00:00
005c3d449e Support custom callback functions in schedule (#162016)
This is going to be used in https://github.com/pytorch/torchtitan/issues/1682

Add a `register_custom_function` to the `_PipelineScheduleRuntime` which allows users to implement any custom function to replace the runtime operation dynamically.

The signature of the callback should look like:

```python
class _CustomFunctionProtocol(Protocol):
    def __call__(self, action: _Action, ctx: _PipelineContext) -> None: ...
```

`_PipelineContext` contains a reference to the schedule which is executing the operations.

### Testing

Added a test which adds custom methods for `FORWARD` and `OVERLAP_F_B` which are just the same implementations as those used in the default schedule runtime. Check that the schedule can still run, numerics are correct, and the callbacks are executed the correct number of times.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162016
Approved by: https://github.com/fegin
2025-10-08 20:43:26 +00:00
b2b3947565 [DeviceMesh] Remove private _set_mesh_dim_group_options API (#164750)
We allow passing in PG option via https://github.com/pytorch/pytorch/pull/159371 and we did a clean up of Meta internal usage of `_set_mesh_dim_group_options`, since this a private API, we don't have any bc guarantee, we want to directly remove so that people use the new behavior from now on.

Also since we now allow passing pg in both DeviceMesh constructor and flatten API, so that we also want to get rid of the global pg option override variable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164750
Approved by: https://github.com/lw, https://github.com/fegin
2025-10-08 20:38:17 +00:00
81994b08a0 [inductor] don't tune xblock for reduction (#164801)
It turns out that tuning XBLOCK for a reduction can also change numerics ( https://github.com/pytorch/pytorch/pull/164525#pullrequestreview-3306235454 ).

The PR skip tuning XBLOCK for a reduction. If we have multiple configs left with different XBLOCKs, the heuristic will pick the configs with second-largest XBLOCK.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164801
Approved by: https://github.com/jansel, https://github.com/mlazos, https://github.com/v0i0
2025-10-08 20:31:39 +00:00
71aefd5595 [reland] Allow setting grad_dtype on leaf tensors (#164751)
ghstack-source-id: e44b3941530be83a630ec93f1478eec741ffca2e
Pull-Request-resolved: https://github.com/pytorch/pytorch/pull/162815

Fixes #ISSUE_NUMBER

Relanding due to internal weirdness. Separate PR to codev w/o ghstack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164751
Approved by: https://github.com/albanD
2025-10-08 20:23:13 +00:00
001e1d2637 Add memory estimator (#164738)
Original work by @ShatianWang, with lints applied. I am going to a few changes and add tests in subsequent prs but I want to preserve original commit first.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164738
Approved by: https://github.com/IvanKobzarev
2025-10-08 20:04:33 +00:00
e0cb1848d0 Use TMA loads always for Triton grouped MM kernel (#164256)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164256
Approved by: https://github.com/ngimel
2025-10-08 19:40:06 +00:00
a4110fedcf Use insert_or_assign instead of erase+emplace (#164868)
insert_or_assign does effectively the same thing as
erase+emplace but more efficiently since the search
does not need to be repeated

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164868
Approved by: https://github.com/eqy
2025-10-08 19:13:49 +00:00
37c6087334 Add split-K control to cuBLAS reduced-precision settings (#164766)
## Summary
- add a CuBLASReductionOption enum so the CUDA context can track reduced-precision and split-K options
- extend the Python bindings, backend helpers, and docs to accept an optional allow_splitk argument for fp16/bf16 matmul controls
- update cuBLAS/cuBLASLt call sites plus dynamo guards and tests to respect the new combinations

## Testing
- python test/test_cuda.py TestCuda.test_cublas_allow_fp16_reduced_precision_reduction_get_set -v *(fails: ModuleNotFoundError: No module named 'psutil')*

------
https://chatgpt.com/codex/tasks/task_e_68e404623178832f8a3e1d34e1e175da

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164766
Approved by: https://github.com/malfet, https://github.com/albanD
2025-10-08 18:48:45 +00:00
0b85236477 Fix refine_ranges corner case (#164075) (#164846)
Summary:
address https://github.com/pytorch/pytorch/issues/161360

u0>0 should update the range of u0 to start from [1, ..] this fix it. it was not doing that.

Test Plan: contbuild & OSS CI, see 27234792ad

D84038721

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164846
Approved by: https://github.com/izaitsevfb, https://github.com/ezyang
2025-10-08 18:42:37 +00:00
4c0fec3e4d [Max Autotune][B200] Skip carveout tests (#164435)
Summary: Skip sm `carveout` tests on B200, as carveout is currently unsupported.

Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:max_autotune -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -c fbcode.re_gpu_tests=False -- test_honor_sm_carveout_with_triton_tma
```

Differential Revision: D83395610

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164435
Approved by: https://github.com/eellison
2025-10-08 18:39:43 +00:00
cyy
fdc622b513 [CMake] Remove LLVM link code (#134940)
This handling is not needed no recent LLVM APIs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134940
Approved by: https://github.com/ezyang, https://github.com/malfet
2025-10-08 18:39:16 +00:00
91b9484264 [ez] fix small doc error (#164915)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164915
Approved by: https://github.com/svekars
2025-10-08 18:27:44 +00:00
5c827a4133 [SymmMem] Multi-root tile reduction (#164757)
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom):

Perform multiple tile reductions concurrently, with each tile reduced to a separate root.

- The number of concurrent reductions can be smaller than world size, i.e. roots can be a subset of all ranks. But all ranks are still required to call into this API.

- Currently supports NVLink SHARP scope only.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164757
Approved by: https://github.com/weifengpy, https://github.com/fegin
ghstack dependencies: #162243
2025-10-08 17:28:00 +00:00
83458197d1 [Benchmark] remove old timm models from benchmark (#164805)
Prune models from TorchInductor dashboard to reduce ci cost. This PR prunes for timm models according to the [doc](https://docs.google.com/document/d/1nLPNNAU-_M9Clx9FMrJ1ycdPxe-xRA54olPnsFzdpoU/edit?tab=t.0), which reduces from 60 to 14 models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164805
Approved by: https://github.com/anijain2305, https://github.com/seemethere, https://github.com/huydhn, https://github.com/malfet
2025-10-08 17:14:58 +00:00
0b01ff4de0 [ROCm] Improve non stride-one backwards indexing for small index sets (#164409)
This patch fixes a performance problem which occurs when a small set of indices is used and there are practically no duplicates.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164409
Approved by: https://github.com/jerrymannil, https://github.com/jeffdaily
2025-10-08 17:04:52 +00:00
01f3a43462 [MPS] Update OS version in error message (#164946)
Followup after https://github.com/pytorch/pytorch/pull/159912
Fixes https://github.com/pytorch/pytorch/issues/164943

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164946
Approved by: https://github.com/Camyll
2025-10-08 16:43:50 +00:00
f332017294 C++ API handle optimizer defaults (#161825)
Fixes #141884

This fixes the issue for all optimizers and parameter options.
A member function `overwrite_from` is added to the optimizer base class. Each optimizer then implements this function for comparing their accepted parameters to defaults. A SFINAE approach to handle the different optimizer parameters generically (in optimizer.h only) was evaluated, but I think this is easier to review and maintain.

This mirrors the Python API up to one edge case. An example of the edge case is provided below.

Python can distinguish between 1) Key not present in dict = "not specified"  and 2) Key present in dict = "explicitly set". The C++ implementation cannot.
The issue hinges on whether or not to track if a particular parameter was set by the user explicitly or not (discrepancy in the case when the constructor default is explicitly passed in).

To track this seems like it will take more intervention than would be worth it (modify TORCH_ARG to keep track, use std::optional for the parameter types, use bitset tracking) and was not pursued in the current PR. I'm happy to alter the design if appropriate.

### Example of edge case hinging on CONSTRUCTOR DEFAULTS vs OPTIMIZER DEFAULTS

1. CONSTRUCTOR DEFAULTS:
   These are the values you get when calling AdamOptions()
   AdamOptions().lr() = 0.001
   AdamOptions().weight_decay() = 0
   AdamOptions().eps() = 1e-08

2. OPTIMIZER DEFAULTS:
   These are the values the user chose when creating the optimizer
   User's optimizer defaults:
   optimizer.lr() = 0.005
   optimizer.weight_decay() = 0.1
   optimizer.eps() = 1e-07

3. THE PROBLEM SCENARIO:
   User wants to add a parameter group with explicit weight_decay=0.0
   User sets: weight_decay(0)

4. THE CONFUSION:
   Constructor default weight_decay: 0
   User's explicit weight_decay:     0
   Are they equal? YES

   Since they're equal, our overwrite_from() logic thinks:
   "User didn't set weight_decay explicitly, use optimizer default"

5. CURRENT BEHAVIOR:
   Final weight_decay: 0.1
   User expected:      0
   Match?  NO

=== KEY INSIGHT ===
Constructor defaults are built into the C++ class definition.
Optimizer defaults are chosen by the user at runtime. We want to respect the user intention.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161825
Approved by: https://github.com/janeyx99
2025-10-08 16:40:45 +00:00
0a3e4e894c [PP]: Optimize memory by early releasing stage inputs' gradients (#164329)
Seems that we can release input activations' gradients early in `stage_backward()` in PP, which helps to reduce the peak memory.

I tested this using `1F1B` and `Interleaved1F1B` PP strategy (for simplicity, I use 4 decoder layers of llama3, set PP size to 2 and set num_microbatches to 128)  based on torchtitan
run command using torchtitan:
```bash
CUDA_VISIBLE_DEVICES=4,5 LOG_RANK=0,1 NGPU=2 CONFIG_FILE=./torchtitan/models/llama3/train_configs/llama3_8b.toml ./run_train.sh --metrics.log_freq 1  --training.seq_len 8192 --training.steps 10 --parallelism.data_parallel_shard_degree 1 --activation_checkpoint.mode full --model.tokenizer_path /workspace/torchtitan-v0.1.0/torchtitan/torchtitan/datasets/tokenizer/original/tokenizer.model --tr
aining.dataset wikipedia  --parallelism.pipeline_parallel_degree 2  --training.local_batch_size 128 --parallelism.pipeline_parallel_microbatch_size 1 --training.dataset_path /workspace/wikipedia_subset --training.seed 42 --parallelism.pipeline_parallel_schedule 1F1B
```
## 1F1B torchtitan train results
### before fix
<img width="1526" height="606" alt="b8e281cce1dac15e827c216e7d83f402" src="https://github.com/user-attachments/assets/545c0a80-6276-40c0-893f-fd2df0a53b8d" />

### after fix
<img width="1526" height="594" alt="70d5ceba311a8398d041189bf8897cfc" src="https://github.com/user-attachments/assets/0d606e08-238a-4115-a1c0-b40df101d867" />

after fix, the memory usage on rank1, i.e., non first stages saving 6.9GB compare to before fix. the memory usage on rank0 remains unchanged (rank0 represents stage0)

## Interleaved1F1B torchtitan train results
### before fix
<img width="1514" height="601" alt="a28b7f9704b9234870619c43194e8a72" src="https://github.com/user-attachments/assets/2c28565f-ffff-4747-a8f5-722b5c65dc7e" />

### after fix
<img width="1526" height="621" alt="2d8d6d956b72885186f8c7059146c41a" src="https://github.com/user-attachments/assets/8c4a4ff2-336b-4e0b-8ac4-014ae22c2ed1" />

after fix, the memory usage on rank1 saving 14.57GB (rank1 holds layer1 and layer3) and rank0 saving 7.5GB (rank0 holds layer0 and layer2)

## Memory snapshot results
also, I have dumped the memory snapshot to observe the memory under the 1F1B PP strategy.

### before fix
<img width="1906" height="918" alt="6fd4e4ba82b8bacf9ca6edee4f3d5581" src="https://github.com/user-attachments/assets/d1b9245c-b09f-43c5-87ce-87ba48533a70" />

we can see the memory is increasing as pp step_microbatches running. (the lifetime of input activation's gradient, i.e., the output of `FusedRMSNormBackward`  lasts too long)

### after fix
<img width="1903" height="918" alt="2e415f25af6750d06e5e647683b212b9" src="https://github.com/user-attachments/assets/b657c8f6-5a56-46bd-8743-f3b8375c81b0" />

after fix, we got more steady memory usage during training. (the input activation's gradient will be released or return allocator soon)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164329
Approved by: https://github.com/H-Huang
2025-10-08 16:12:00 +00:00
73adac05d1 Triton 3.5.x pin update to 7416ffc (#164587)
Updates triton pin to latest: https://github.com/triton-lang/triton/commits/release/3.5.x/

This updates contains 1 cherry-pick to fix flex_attention_fwd regression on B200:
- https://github.com/triton-lang/triton/pull/8366

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164587
Approved by: https://github.com/atalman
2025-10-08 16:07:18 +00:00
eqy
0d39ecb2ce [cuDNN][RNN] cuDNN RNN supports BFloat16 inputs since 9.13 (#164411)
seems to work

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164411
Approved by: https://github.com/Skylion007
2025-10-08 15:26:50 +00:00
90c0825e2d [GHF] Allow reverts from pytorch-auto-revert app (#164911)
This is a bit weird, but author_login is not a unique field, but author_url is.

Explicitly allow https://github.com/apps/pytorch-auto-revert to issue revert commands

Update mocks by running
```
sed -i -e s/8e262b0495bd934d39dda198d4c09144311c5ddd6cca6a227194bd48dbfe7201/47860a8f57a214a426d1150c29893cbc2aa49507f12b731483b1a1254bca3428/ gql_mocks.json
```

Test plan: Run
```python
from trymerge import GitHubPR
pr=GitHubPR("pytorch", "pytorch", 164660)
print(pr.get_last_comment().author_url, pr.get_comment_by_id(3375785595).author_url)
```
that should produce
```
https://github.com/pytorch-auto-revert https://github.com/apps/pytorch-auto-revert
```
Plus added a regression test that checks two particular comments for revert validity

`pytorch-auto-revert` user is my alter ego :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164911
Approved by: https://github.com/jeanschmidt
2025-10-08 15:15:45 +00:00
fd4bde430a Revert "list_stored_sd_metadata API. (#160610)"
This reverts commit da903b6a8be422529d47649e89c0d50bb95c37ca.

Reverted https://github.com/pytorch/pytorch/pull/160610 on behalf of https://github.com/jeffdaily due to broke ROCm CI, but flaky also on CUDA CI https://hud.pytorch.org/failure?name=periodic%20%2F%20linux-jammy-rocm-py3.10%20%2F%20test%20(distributed%2C%202%2C%203%2C%20linux.rocm.gpu.mi250.4%2C%20module%3Arocm%2C%20oncall%3Adistributed)&jobName=undefined&failureCaptures=distributed%2Fcheckpoint%2Ftest_list_stored_state_dict.py%3A%3ATestListStateDict%3A%3Atest_list_stored_sd_metadata ([comment](https://github.com/pytorch/pytorch/pull/160610#issuecomment-3382023022))
2025-10-08 15:10:38 +00:00
b5e93ffdcf Revert "Limit path search within range (#164581)"
This reverts commit 415e641572473479fc9d9eaea12762e1a223a9e0.

Reverted https://github.com/pytorch/pytorch/pull/164581 on behalf of https://github.com/eellison due to merge sets makes this trickier ([comment](https://github.com/pytorch/pytorch/pull/164581#issuecomment-3381955240))
2025-10-08 14:56:21 +00:00
f8d0d65ddc Revert "Add memory estimator (#164738)"
This reverts commit ab01a0d7d352e7fd07989b8d6bf035bf82aea74e.

Reverted https://github.com/pytorch/pytorch/pull/164738 on behalf of https://github.com/eellison due to merge sets makes this trickier ([comment](https://github.com/pytorch/pytorch/pull/164581#issuecomment-3381955240))
2025-10-08 14:56:21 +00:00
f46ddb1e65 [ROCm][CI] add gfx1150 gfx1151 to docker images for binary builds (#164854)
Fixes #164346.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164854
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-08 14:34:22 +00:00
20082d7136 Revert "fix flex attention eager bwd: more rounding (#164317)"
This reverts commit 41808b2ba9a61ab2f4c7af394c1668d09a4a0331.

Reverted https://github.com/pytorch/pytorch/pull/164317 on behalf of https://github.com/jeffdaily due to inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_builtin_score_mods_seqlen_lt_custom_sparse_block_size_score_mod4_cuda_float16 [GH job link](https://github.com/pytorch/pytorch/actions/runs/18330774537/job/52207370954) [HUD commit link](41808b2ba9) ([comment](https://github.com/pytorch/pytorch/pull/164317#issuecomment-3381812090))
2025-10-08 14:29:10 +00:00
7158aa22e8 remove more (#164753)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164753
Approved by: https://github.com/aorenste, https://github.com/mlazos
ghstack dependencies: #164664, #164665, #164667, #164668
2025-10-08 14:23:38 +00:00
2035f6b2e6 use check_size instead of check_is_size in ops.py (#164668)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164668
Approved by: https://github.com/angelayi
ghstack dependencies: #164664, #164665, #164667
2025-10-08 14:23:38 +00:00
2b58adc3bd [inductor][templates] Distinguish between kernel input nodes and codegen input nodes (#163752)
If there is a single autotuner choice, the wrong type of input node is used to instantiate `TritonTemplateBuffer` through `TritonTemplateCaller.output_node`. This PR distinguishes the input nodes used in `AlgorithmSelectorCache.__call__` between the actual inputs passed to the kernel at runtime, vs the possibly viewed inputs that influence scheduling behaviour (e.g. `MemoryDeps`) and codegen. See the added unit test for more detail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163752
Approved by: https://github.com/eellison
2025-10-08 14:12:14 +00:00
322091d8d8 [opaque_obj] Add make_fx tracing support (#163278)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163278
Approved by: https://github.com/zou3519
ghstack dependencies: #163279, #163277
2025-10-08 09:09:16 +00:00
2bb4e6876c [opaque obj] Error for torch.library.custom_op infer_schema (#163277)
Unsure how we can get infer_schema to infer the scriptObject type from just the type annotation, so for now will just error clearly and ask users to specify a schema.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163277
Approved by: https://github.com/zou3519
ghstack dependencies: #163279
2025-10-08 09:09:16 +00:00
56ef7743fc [opaque_obj] Add __eq__ and __deepcopy__ (#163279)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163279
Approved by: https://github.com/zou3519
2025-10-08 09:09:16 +00:00
64108bdbed [BC-Breaking] Remove long-deprecated casting functions from native_functions.yaml (#164641)
This PR removes `torch._cast_XXX` from generated OPs. They were deprecated in PyTorch 1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164641
Approved by: https://github.com/albanD, https://github.com/justinchuby
2025-10-08 08:27:58 +00:00
c855f8632e Pyrefly suppressions 7/n (#164913)
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283

Almost there!

Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check

step 1: delete lines in the pyrefly.toml file from the project-excludes field
step 2: run pyrefly check
step 3: add suppressions, clean up unused suppressions
before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199

after:
 INFO 0 errors (6,884 ignored)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164913
Approved by: https://github.com/oulgen
2025-10-08 07:27:17 +00:00
12d2ef557f Update round size with 1 division behavior (#162203)
have round size return nearest power of 2 greater than or equal to size with 1 division

Fixes #161139

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162203
Approved by: https://github.com/ezyang
2025-10-08 06:41:46 +00:00
65aa62d50d Use codegen for the boxed interpreters (#164573)
Authored with claude code.  The arg parsing is kind of horrible, open
to more suggestions.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164573
Approved by: https://github.com/albanD, https://github.com/jansel
2025-10-08 06:27:44 +00:00
6a09f9306c Fix #164742, all header-impl'd userfacing functions should be inline (#164871)
It is as @mxmpl pointed out; we are missing an inline.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164871
Approved by: https://github.com/mikaylagawarecki
2025-10-08 05:57:19 +00:00
19bf67be32 multimem reduce (#164517)
Modified `multimem_one_shot_all_reduce_out` function to accept a `root` argument, making it a `multimem_reduce` op.

The original `multimem_one_shot_all_reduce` op becomes a caller of the `multimem_reduce`, with each rank providing its own rank id as root.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164517
Approved by: https://github.com/ngimel
2025-10-08 05:25:16 +00:00
1927783aa3 Revert "Reland vision pinned commit hash update (#164492)"
This reverts commit 6861a270624b44954826688f8dad668eb0154452.

Reverted https://github.com/pytorch/pytorch/pull/164492 on behalf of https://github.com/izaitsevfb due to see autorevert msg above, inductor breakage is legit ([comment](https://github.com/pytorch/pytorch/pull/164492#issuecomment-3379537888))
2025-10-08 04:38:26 +00:00
184817c7a8 locks + unit tests (#164636)
Test Plan:
```
buck test fbcode//mode/opt caffe2/test/inductor:caching
```

Reviewed By: aorenste

D83714690

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164636
Approved by: https://github.com/aorenste
2025-10-08 04:34:22 +00:00
da903b6a8b list_stored_sd_metadata API. (#160610)
Summary:
1\ Certain checkpoint load use cases are not aware of the properties of the data/tensors they want to load.
2\ These usecases include data loader checkpoints, reading data for post processing (when the original model definition is not available).
3\ There, we have to use saved checkpoint  (metadata) as our source of truth.
4\ This RFC proposal exposes the checkpoint metadata using a public API.

In this proposal we expose the stored state-dict metadata  (minus associated storage/chunk metadata).

Chunk/storage details should not be exposed to the users and is a impl detail of the storage writer/reader.

Test Plan:
UT.

Rollback Plan:

Differential Revision: D80231457

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160610
Approved by: https://github.com/saumishr
2025-10-08 04:33:51 +00:00
f76fdcaaf8 [Benchmark] cleanup huggingface models (#164815)
Prune models from TorchInductor dashboard to reduce ci cost. This PR prunes for hugging face models according to the [doc](https://docs.google.com/document/d/1nLPNNAU-_M9Clx9FMrJ1ycdPxe-xRA54olPnsFzdpoU/edit?tab=t.0), which reduces from 46 to 27 models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164815
Approved by: https://github.com/anijain2305, https://github.com/seemethere, https://github.com/huydhn, https://github.com/malfet
2025-10-08 03:21:04 +00:00
608792153f [inductor][codecache] Print bytes in codecache debug output (#164898)
Summary: We have an internal request to help understand why the hash of `post_grad_custom_post_pass` is changing between attempts. We don't get useful info from the debug output, because we just print "<bytes>". Instead, attempt to print at least _some_ of the value in case it contains readable characters.

Test Plan:
Registered a dummy post_grad_custom_pass and printed codecache debug output
`TORCH_LOGS=+torch._inductor.codecache python ~/foo.py`

Yields something like:
```
V1007 16:41:19.024000 3546009 /data/users/slarsen/pytorch-3.10_4/torch/_inductor/codecache.py:989] [0/0] [law2ujt2wzjb5tyiu6jh64r2lxpvl62yvxcsmdouhg3qyelhhdv] post_grad_custom_post_pass: HelloWorld!����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������...
```

Differential Revision: [D84108770](https://our.internmc.facebook.com/intern/diff/D84108770)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164898
Approved by: https://github.com/oulgen
2025-10-08 02:45:20 +00:00
086dec3235 Pyrefly suppressions 6/n (#164877)
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283

Almost there!

Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check

step 1: delete lines in the pyrefly.toml file from the project-excludes field
step 2: run pyrefly check
step 3: add suppressions, clean up unused suppressions
before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199

after:

INFO 0 errors (5,064 ignored)

Only four directories left to enable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164877
Approved by: https://github.com/oulgen
2025-10-08 02:30:57 +00:00
ad7b2bebc6 Use tuples to have a deterministic ordering. (#164851)
When debugging I noticed some non-deterministic behavior and tracked it down to this literal set. Changed to be a tuple for determinism. Changed two other small literal sets also because using a set for a small lookup like that is slow.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164851
Approved by: https://github.com/bobrenjc93, https://github.com/bdhirsh
2025-10-08 02:12:03 +00:00
d444384003 [SymmMem] Tiled reduce (#162243)
Added op: `tile_reduce(Tensor input, Tensor(a!) out, int root, str group_name)`

For now supports only:
- NVSHMEM backed symmetric tensor;
- 2D tensor and tile;
- torch.float.

Testing on right-bottom quandrant:
```
rank 0:
tensor([[0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 1., 1., 1., 1.],
        [0., 0., 0., 0., 1., 1., 1., 1.],
        [0., 0., 0., 0., 1., 1., 1., 1.],
        [0., 0., 0., 0., 1., 1., 1., 1.]], device='cuda:0')
PASSED
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162243
Approved by: https://github.com/ngimel
2025-10-08 02:03:04 +00:00
3040a5d294 Revert "[dynamo] Support torch.fx.traceback.annotate (#164678)"
This reverts commit 801e282f39e9ef4424dfd3ecfd2b550a44595229.

Reverted https://github.com/pytorch/pytorch/pull/164678 on behalf of https://github.com/izaitsevfb due to breaks executorch internally, see [D84068062](https://www.internalfb.com/diff/D84068062?entry_point=16) ([comment](https://github.com/pytorch/pytorch/pull/164678#issuecomment-3379281844))
2025-10-08 01:49:34 +00:00
97463d4cf3 Revert "Fix double dispatch to Python for detach (#163671)"
This reverts commit c32118dc3e50505fd285e6e448a90883fce11535.

Reverted https://github.com/pytorch/pytorch/pull/163671 on behalf of https://github.com/izaitsevfb due to breaks export tests ([comment](https://github.com/pytorch/pytorch/pull/163671#issuecomment-3379281422))
2025-10-08 01:46:45 +00:00
c813617c53 [PP] Migrate other schedules to use PipelineScheduleRuntime (#164777)
Second fix for https://github.com/pytorch/pytorch/issues/164756

This has been a TODO to make the all schedules execute using the same runtime. Now after this change, schedules will use the same logic for `_PipelineScheduleRuntime` where it adds `UNSHARD` and `RESHARD` operations to the schedules which fixes the issue mentioned above.

<img width="920" height="406" alt="image" src="https://github.com/user-attachments/assets/a4d5bcd0-7dac-43cd-96f9-8ca33cfd8b91" />

A test is failing after the conversion:
- Fixed a gradient scaling issue for dWeight

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164777
Approved by: https://github.com/fegin
ghstack dependencies: #164775
2025-10-08 01:45:57 +00:00
e659661ffa [PP] Fix FSDP unshard/reshard (#164775)
First fix for https://github.com/pytorch/pytorch/issues/164756

In the pipeline IR we call `UNSHARD` and `RESHARD`,  but there is a bug because when we call `module.unshard()` these do not recursively call the FSDP modules, hence leading to sometime call allgather before the module forward.

Since we want the pipeline IR to explicitly handle this, we can call `group.unshard` instead which ensures that all the modules are unsharded.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164775
Approved by: https://github.com/weifengpy
2025-10-08 01:45:57 +00:00
41808b2ba9 fix flex attention eager bwd: more rounding (#164317)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164317
Approved by: https://github.com/drisspg
ghstack dependencies: #163986
2025-10-08 01:17:45 +00:00
c0510dc447 [ContextParallel] add _LoadBalancer classes, and load-balance interface to Context Parallel APIs (#161062)
**Summary**
This PR provides an interface for users to specify how to load-balance the attention
input. The load-balance is essentially a rearrangement of the input tensor(s) over the
seq_dim before sharding and can be specified via an index tensor `rearrange` such
that Q[rearrange] is the balanced Q users want (i.e. `rearrange[i] == j` where `i` is the new
index of `Q[j]` in the balanced Q). An example is the `_generate_round_robin_indices()` added
in https://github.com/pytorch/pytorch/pull/155442.

**New `_LoadBalancer` classes**
New `_LoadBalancer` class (defined in `torch/distributed/tensor/experimental/_load_balancer.py`)
provides one interface for defining load-balance behavior: `_generate_indices(self, restore: bool = False)`.

When `restore == False`, this method should output an index Tensor (namely `rearrange_idx`) such
that QKV will be transformed into Q' K' V' in a way that `Q'[i] == Q[rearrange_idx[i]]` (same applies
to K and V).

When `restore == True`, this method outputs an index Tensor (namely `restore_idx` such that
`Q'[restore_idx] == Q` (same applies to K and V).

**Impact**
2 public CP APIs and 1 private CP API is modified. This PR should be backward-compatible by:
- For uses w/ SDPA, existing users must be using the `context_parallel()` API which does not
take in the extra `load_balancer` argument and solely determines from the global var
`_cp_options.enable_load_balance`.
- For new users including who want to try `flex_attention()`, we require to use the new API
`_context_parallel_buffers` to explicitly shard the QKV input instead of using `context_parallel()`
because we no longer rely on TorchDispatchMode nor TorchFunctionMode for op replacement. And
we also require users to explicitly pass in a `load_balancer` argument if load-balancing is demanded.

**Load-Balance Behavior**
`context_parallel_unshard()`, and `create_cp_block_mask()` APIs now take an extra optional argument
`load_balancer`. This argument is optional because of backward compatibility but we require new users
to explicitly pass in a `load_balancer` if load-balancing is demanded:
- if `load_balancer == None` and `_cp_options.enable_load_balance == False`, CP performs
no load-balancing on input Tensors.
- if `load_balancer == None` and `_cp_options.enable_load_balance ==True`, CP performs
head-tail load-balancing (e.g. split a Tensor into 2*N chunks and first N are called head and
the rest are called tail. Place the first head chunk the last tail chunk on rank 0, and the second
head along with the second last tail chunk on rank 1, and so on).

`_context_parallel_buffers()` also takes the extra optional argument `load_balancer`, but the behavior
is slightly different from the other 2 APIs -- it doesn't branch on `_cp_options.enable_load_balance` :
- if `load_balancer == None`, no load-balancing will be performed
- otherwise, apply load-balancing using `load_balancer._generate_indices()` before sharding.

**Changes**
This PR moves the index Tensor generation logic into a set of LoadBalancer classes and
make LoadBalancer the common interface for Context Parallel APIs that leverages
load-balancing:
* _context_parallel_buffers
* context_parallel_unshard
* create_cp_block_mask

The `_LoadBalancer` classes added are:
- `_LoadBalancer`: the abstract base class that provides “_generate_indices” interface index Tensor generation.
- `_HeadTailLoadBalancer`: Implements head-tail balancing logic.
- `_PerDocumentHeadTailLoadBalancer`: Supports per-document head-tail balancing for batched sequences.

**Test**
`pytest test/distributed/tensor/test_attention.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161062
Approved by: https://github.com/fegin
2025-10-08 01:09:14 +00:00
9ec10dc26a utils + unit tests (#164551)
Test Plan:
```
buck test fbcode//mode/opt caffe2/test/inductor:caching
```

Reviewed By: aorenste

Differential Revision: D83714691

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164551
Approved by: https://github.com/aorenste
2025-10-08 01:05:45 +00:00
43fc859625 Don't return values in void functions (#164809)
This PR fixes returning values in void C++ functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164809
Approved by: https://github.com/janeyx99
2025-10-08 01:04:14 +00:00
f713abab16 Revert "Enable all flake8-logging-format rules (#164655)"
This reverts commit e98c4e835b1db22092fc93b49d2cddd7b3537d1f.

Reverted https://github.com/pytorch/pytorch/pull/164655 on behalf of https://github.com/malfet due to Looks like it broke lint in trunk, see bd3b98a8a5/1 ([comment](https://github.com/pytorch/pytorch/pull/164655#issuecomment-3379209309))
2025-10-08 00:55:17 +00:00
bd3b98a8a5 [dynamic shapes] make backed_size_oblivious behavior consistent b/w symbolic_shapes/inductor (#164796)
Summary: call guard_or_ directly to enable backed_size_obl in inductor calls to guard_or

Test Plan: CI and unit test added.

Differential Revision: D84009392

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164796
Approved by: https://github.com/laithsakka
2025-10-08 00:19:06 +00:00
e98c4e835b Enable all flake8-logging-format rules (#164655)
These rules are enabled by removing existing suppressions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164655
Approved by: https://github.com/janeyx99
2025-10-08 00:16:13 +00:00
7b15534434 [export] Fix weight sharing when there is no complete tensor (#164857)
Summary: As titled.

Test Plan: CI

Differential Revision: D84079625

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164857
Approved by: https://github.com/yushangdi
2025-10-07 23:40:13 +00:00
c32118dc3e Fix double dispatch to Python for detach (#163671)
This fixes #71725.

Differential Revision: [D83857880](https://our.internmc.facebook.com/intern/diff/D83857880)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163671
Approved by: https://github.com/ezyang, https://github.com/albanD
2025-10-07 23:34:37 +00:00
e3ae80fc03 [PP] Let PP split BlockMask into micro-BlockMask (#164111)
BlockMask has batch dimension information. So PP has to split it as well just like all other tensors. All the tensors in BlockMask have the batch dimension, so we can just split it without too many issues. However, `mask_mod` requires the batch index as the input, which the value is going to be changed after the split. So we have to wrap it inside a closure to modify the batch index.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164111
Approved by: https://github.com/H-Huang
2025-10-07 23:25:34 +00:00
483f4e0db9 CUDA 13.0 builds fix on Amazon Linux 2023 (#164870)
During 2.9 rc testing I am seeing an issue on Amazon Linux 2023 with CUDA 13.0 builds

This is related to:
 https://github.com/pytorch/pytorch/issues/152756

Workflow: https://github.com/pytorch/test-infra/actions/runs/18324074610/job/52184079262

Error:
```
WARNING: There was an error checking the latest version of pip.
+ python3.11 .ci/pytorch/smoke_test/smoke_test.py --package torchonly
Traceback (most recent call last):
  File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 333, in _load_global_deps
    ctypes.CDLL(global_deps_lib_path, mode=ctypes.RTLD_GLOBAL)
  File "/usr/lib64/python3.11/ctypes/__init__.py", line 376, in __init__
    self._handle = _dlopen(self._name, mode)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: libcudart.so.13: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/pytorch/pytorch/.ci/pytorch/smoke_test/smoke_test.py", line 12, in <module>
    import torch
  File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 425, in <module>
    _load_global_deps()
  File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 383, in _load_global_deps
    _preload_cuda_deps(lib_folder, lib_name)
  File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 317, in _preload_cuda_deps
    raise ValueError(f"{lib_name} not found in the system path {sys.path}")
Traceback (most recent call last):
ValueError: libnvToolsExt.so.*[0-9] not found in the system path ['/pytorch/pytorch/.ci/pytorch/smoke_test', '/usr/lib64/python311.zip', '/usr/lib64/python3.11', '/usr/lib64/python3.11/lib-dynload', '/usr/local/lib64/python3.11/site-packages', '/usr/local/lib/python3.11/site-packages', '/usr/lib64/python3.11/site-packages', '/usr/lib/python3.11/site-packages']
  File "/home/ec2-user/actions-runner/_work/test-infra/test-infra/test-infra/.github/scripts/run_with_env_secrets.py", line 102, in <module>
    main()
  File "/home/ec2-user/actions-runner/_work/test-infra/test-infra/test-infra/.github/scripts/run_with_env_secrets.py", line 98, in main
    run_cmd_or_die(f"docker exec -t {container_name} /exec")
  File "/home/ec2-user/actions-runner/_work/test-infra/test-infra/test-infra/.github/scripts/run_with_env_secrets.py", line 39, in run_cmd_or_die
    raise RuntimeError(f"Command {cmd} failed with exit code {exit_code}")
RuntimeError: Command docker exec -t 7d9c5bd403cac9a9ee824d63a1d6f6057ecce89a7daa94a81617dbf8eff0ff2e /exec failed with exit code 1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164870
Approved by: https://github.com/Camyll

Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
2025-10-07 22:52:53 +00:00
d1a62c8036 [BE][Ez]: Enable RUF007 Prefer itertools.pairwise over zip slicing (#164856)
Now that our min version is 3.10 we can support this rule. This is more concise, readable, and efficient than the previous zip slicing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164856
Approved by: https://github.com/williamwen42
2025-10-07 22:51:17 +00:00
6861a27062 Reland vision pinned commit hash update (#164492)
Redo https://github.com/pytorch/pytorch/pull/154694

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164492
Approved by: https://github.com/yangw-dev
2025-10-07 22:45:05 +00:00
955f21dc2c [ROCm][CI] Add support for gfx1100 in rocm workflow + test skips (#148355)
This PR adds infrastructure support for gfx1100 in the rocm workflow. Nodes have been allocated for this effort.
@dnikolaev-amd contributed all the test skips.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148355
Approved by: https://github.com/jeffdaily

Co-authored-by: Dmitry Nikolaev <dmitry.nikolaev@amd.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-07 22:36:25 +00:00
9f5e1beaf3 [multi-kernel] base tensor sizes for shape cache key (#164499)
to match shape key in 3ca09d65f1/torch/_inductor/select_algorithm.py (L3571)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164499
Approved by: https://github.com/ColinPeppler
2025-10-07 21:27:40 +00:00
2e027e8742 [inductor] Improve bound on the number of dims to match for the block (#163755)
- Removes redundant broadcast code when `len(kernel.range_tree_nodes)` is much larger than `len(range_tree.nodes)`. For example:
```python
# before, the broadcast is to [1, 1, XBLOCK, R0_BLOCK]
tmp0 = tl.reshape(tl.broadcast_to(tl.load(block_ptr0, boundary_check=[2], padding_option='zero', eviction_policy='evict_last')[:, None, :, :], [(511 + XBLOCK) // 512, ((1) * ((1) <= ((511 + XBLOCK) // 512)) + ((511 + XBLOCK) // 512) * (((511 + XBLOCK) // 512) < (1))), ((512) * ((512) <= (XBLOCK)) + (XBLOCK) * ((XBLOCK) < (512))), R0_BLOCK]), [XBLOCK, R0_BLOCK])
# after
tmp0 = tl.reshape(tl.load(block_ptr0, boundary_check=[2], padding_option='zero', eviction_policy='evict_last'), [XBLOCK, R0_BLOCK])
```
- Fix: also save range_tree_nodes per subgraph

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163755
Approved by: https://github.com/eellison, https://github.com/blaine-rister
2025-10-07 21:02:37 +00:00
1e42fde45e Revert "[CUDA] Add experimental green context support for SM carveout (#159104)"
This reverts commit 746fe78ecd52f3e9cfddda41f0ac82dada7bdd0b.

Reverted https://github.com/pytorch/pytorch/pull/159104 on behalf of https://github.com/malfet due to Breaks Windows CD build ([comment](https://github.com/pytorch/pytorch/pull/159104#issuecomment-3378675515))
2025-10-07 20:51:22 +00:00
f505caa71b Revert "multimem reduce (#164517)"
This reverts commit d1cbb74fb16406488a174832e1b58b7c242f418d.

Reverted https://github.com/pytorch/pytorch/pull/164517 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/164517#issuecomment-3378529654))
2025-10-07 20:12:38 +00:00
65f10becdf Support OVERLAP_F_B in schedule (#161072)
Previously, we converted the overlap_f_b into separate forward and backward operations in the plan. This is a small change that includes it in the plan and handles it in the runtime

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161072
Approved by: https://github.com/fegin, https://github.com/wconstab
2025-10-07 19:55:10 +00:00
df640df68a Revert "Reapply "C++-accessible Placements via pybind11 (#163030)" (#164519)"
This reverts commit 8c0bc879b97bc580aaa0777b2d266bdd068cb528.

Reverted https://github.com/pytorch/pytorch/pull/164519 on behalf of https://github.com/malfet due to Still breaks internal workflows ([comment](https://github.com/pytorch/pytorch/pull/164519#issuecomment-3378469432))
2025-10-07 19:46:17 +00:00
4c3c0ef2f1 [precompile] Load source cache for AOT compile as well. (#164773)
Adding source_get_cache also to AOT compile case. Since the guard manager loader code can be shared between AOT and caching, we added a new function load_guard_manager to avoid code duplication between two workflows, for loading guards.

Test Plan: test_guard_serialization.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164773
Approved by: https://github.com/yiming0416, https://github.com/dolpm
2025-10-07 18:47:09 +00:00
bc33b10202 fix copy_ for scalar in inductor (#164167)
Fixes #158437

### Summary

- TorchInductor was not properly handling scalar copy operations `(tensor.copy_(scalar_value))`
- Ensured scalar sources are converted to appropriate tensor representations with correct dtype and device

### Impact

- Enables compilation of models using ` tensor.copy_(scalar) `patterns
- module: inductor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164167
Approved by: https://github.com/shunting314
2025-10-07 18:31:37 +00:00
2855a045b3 Use sym_eq and sym_and on symbolic shapes in common_meta_baddbmm_bmm (#164781)
Differential Revision: D84005053

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164781
Approved by: https://github.com/Skylion007
2025-10-07 18:25:00 +00:00
9ecd092bd9 Add python bindings for NCCL CTA policies (#164309)
NCCLConfig can now be constructed with non-default [cta policies][1]

```python
import torch
from torch.distributed import ProcessGroupNCCL as nccl

config = nccl.NCCLConfig()
config.cta_policy = nccl.NCCL_CTA_POLICY_ZERO  # NCCL version >= 2.28
```

[1]: https://docs.nvidia.com/deeplearning/nccl/archives/nccl_2283/user-guide/docs/api/flags.html#nccl-communicator-cta-policy-flags

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164309
Approved by: https://github.com/eqy
2025-10-07 18:16:20 +00:00
078d475d3b move partition and compiler fns from stage 1 to stage 2 (#164765)
Differential Revision: D83995689

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164765
Approved by: https://github.com/zhxchen17
2025-10-07 18:02:03 +00:00
f37a6523ef Move version.h to torch/headeronly (#164381)
Differential Revision: [D83685392](https://our.internmc.facebook.com/intern/diff/D83685392)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164381
Approved by: https://github.com/janeyx99
2025-10-07 17:47:30 +00:00
b13cd141b3 Add pyrefly suppressions (#164748)
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283

Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check

step 1: delete lines in the pyrefly.toml file from the `project-excludes` field
step 2: run pyrefly check
step 3: add suppressions, clean up unused suppressions
before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199

after:

0 errors (4,263 ignored)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164748
Approved by: https://github.com/oulgen
2025-10-07 17:31:18 +00:00
5e47b4dd60 Remove device_id param from DeviceCachingAllocator::malloc (#164798)
The `malloc` call in DeviceCachingAllocator accepts a DeviceIndex param which
can be confusion because the allocator can only allocate memory for the device
that it corresponds to. This associated device is fixed at construction time
and the runtime param can be misleading.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164798
Approved by: https://github.com/ngimel, https://github.com/cyyever, https://github.com/eqy
2025-10-07 16:42:04 +00:00
ee5389d520 Enable batch samples in sparse tests (#164677)
The test cases are enabled because the issue was fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164677
Approved by: https://github.com/albanD
2025-10-07 15:58:37 +00:00
ab01a0d7d3 Add memory estimator (#164738)
Original work by @ShatianWang, with lints applied. I am going to a few changes and add tests in subsequent prs but I want to preserve original commit first.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164738
Approved by: https://github.com/IvanKobzarev
ghstack dependencies: #164568, #164569, #164581
2025-10-07 15:32:27 +00:00
801e282f39 [dynamo] Support torch.fx.traceback.annotate (#164678)
Builds on top of https://github.com/pytorch/pytorch/pull/163673 and https://github.com/pytorch/pytorch/pull/164174. This will be used in the followup PRs to apply regional inductor compilation.

The existing implementation let Dynamo trace into the `torch.fx.traceback.annotate`, but thats not what we want. We want Dynamo to essentially run the torch.fx.traceback.annotate function in eager, so that every Fx node created in Dynamo Fx graph has the custom meta node.

This does not work with graph breaks yet. But we can solve that problem, if needed, in a separate PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164678
Approved by: https://github.com/SherlockNoMad, https://github.com/jansel, https://github.com/xmfan
2025-10-07 14:54:26 +00:00
87c9fbda22 Follow up to PR 163980 for s390x (#164464)
Now with same updates propagated to s390x it works on s390x runners too.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164464
Approved by: https://github.com/atalman
2025-10-07 12:02:29 +00:00
3cc8af2d67 torch.topk: refactor global histogram/cumsum into a dedicated kernel to eliminate redundant memory access (#164459)
# TLDR
This PR removes the regression in torch.topk introduced from torch 2.7.0 and delivers much better performance for large inputs.

The table below reports execution times on H20 for various input sizes with float32 data, extracting the top-100 values. Results indicate that this PR restores and improves performance, especially on large inputs.
| Input Shape    | torch2.6.0 (ms) | torch2.8.0 (ms) | 2.8.0+this PR (ms) |
| -------------- | --------------- | --------------- | ------------------ |
| (1, 1B)        | 36.6            | 1564.1          | 25.6               |
| (1, 100M)      | 3.56            | 17.4            | 2.54               |
| (1, 1000,000)  | 0.135           | 0.145           | 0.098              |
| (512, 128000)  | 1.33            | 1.33            | 1.32               |
| (8192, 128000) | 19.6            | 19.6            | 19.4               |

# Background
After upgrading PyTorch from 2.6.0 to 2.7.0, we observed a significant GPU performance regression in `torch.topk` on NVIDIA GPUs. For instance, extracting the top-1000 largest values from one billion floats on an NVIDIA H20 increased from **36 ms** to **1.6 s**.

Profiling with Nsight Compute indicates that the slowdown is caused by redundant memory accesses introduced in [PR #145536](https://github.com/pytorch/pytorch/pull/145536).

# Analysis

`torch.topk` relies on **RadixSelect** to find the target values. Each radix pass requires computing a histogram of the input values. For large inputs, histogram computation is split into two stages:

1. **Local histogram**: Each CUDA block processes a subset of the input and writes its local histogram to global memory.
2. **Global reduction**: A single CUDA block reads all local histograms from global memory and reduces them into the final global histogram.

Before [PR #145536](https://github.com/pytorch/pytorch/pull/145536), both stages ran inside a single kernel (`radixFindKthValues`), using a semaphore to ensure that all local histograms were completed before reduction.

In PR #145536, the global histogram computation was merged with subsequent top-k calculations into a single kernel (`computeBlockwiseKthCounts`) to avoid the semaphore. While this simplifies synchronization, it introduces **redundant memory reads**:

- `computeBlockwiseKthCounts` launches `numInputSlices * blocks_per_slice` blocks.
- For each row (slice), `blocks_per_slice` CUDA blocks redundantly reload the same local histograms from global memory.

# This PR

To address this inefficiency, we introduce the following optimizations:

1. **Dedicated kernel**: Refactor global histogram and cumsum computation into a separate GPU kernel, `computeDigitCumSum`.
2. **Loop unrolling**: Apply loop unrolling in `computeDigitCumSum` to speed up local histogram reads.

# Performance
We benchmarked torch.topk on NVIDIA H20 with float32 inputs, extracting the top-100 values across different input sizes. The results in the table below demonstrate that this PR effectively eliminates the performance regression introduced in 2.7.0 and delivers substantial improvements on large inputs.

| Input Shape    | torch2.6.0 (ms) | torch2.8.0 (ms) | 2.8.0+this PR (ms) |
| -------------- | --------------- | --------------- | ------------------ |
| (1, 1B)        | 36.6            | 1564.1          | 25.6               |
| (1, 100M)      | 3.56            | 17.4            | 2.54               |
| (1, 1000,000)  | 0.135           | 0.145           | 0.098              |
| (512, 128000)  | 1.33            | 1.33            | 1.32               |
| (8192, 128000) | 19.6            | 19.6            | 19.4               |

Besides, I have verified the correctness of this PR with different inputs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164459
Approved by: https://github.com/ngimel, https://github.com/Skylion007
2025-10-07 11:04:03 +00:00
1fb072ac2a exceptions + unit tests (#164550)
Test Plan:
```
buck test fbcode//mode/opt caffe2/test/inductor:caching
```

Reviewed By: aorenste

Differential Revision: D83714688

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164550
Approved by: https://github.com/aorenste
2025-10-07 10:04:58 +00:00
cac5e13e13 [dynamo] Inline nn module calls using __call__ methods (#164817)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164817
Approved by: https://github.com/SherlockNoMad, https://github.com/mlazos
2025-10-07 08:57:20 +00:00
68350660ee Increase timeout for nightly macOS performance tests to 300 minutes (#164793)
the Test step time recently went slightly up.

hopefully this fixes https://github.com/pytorch/alerting-infra/issues/263
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164793
Approved by: https://github.com/seemethere
2025-10-07 08:44:07 +00:00
ef7e2ca77e remove check_is_size from test_misc.py (#164667)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164667
Approved by: https://github.com/angelayi
ghstack dependencies: #164664, #164665
2025-10-07 07:33:50 +00:00
cdaaf3e4a3 remove size-like based size-oblivious special max simplifications (#164665)
As we removed guard_size_oblivious this simplification is no longer relevant, this is part of the process of
deprecation for guard_size_oblivious and its dependencies.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164665
Approved by: https://github.com/aorenste
ghstack dependencies: #164664
2025-10-07 07:33:50 +00:00
0ea59c3c55 do not suggest torch._check_is_size() (#164664)
size like concept for data dependency is not relevant anymore as we removed all guard_size_oblivious calls.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164664
Approved by: https://github.com/angelayi, https://github.com/mlazos
2025-10-07 07:33:50 +00:00
8f705d019a context + unit tests (#164549)
Summary:
the context module provides configurable context selection + isolation key hashing;

context selection is broken into runtime and compile context. runtime context is decided at call time (inductor configs, precision configs, etc.) and compile context is decided at compile time (hardware type, software hashes).

callees will be given access to SelectedRuntimeContext and SelectedCompileContext, which they can use to determine and select what context is necessary with regards to the function which is being cached.

these selected contexts are wrapped in an IsolationSchema, which denotes what context should be taken into consideration when producing an isolation key. The isolation key is essentially a salt of the function signature key, which says that some function signature key result is valid under a given context (isolation schema)

Test Plan:
```
buck test fbcode//mode/opt caffe2/test/inductor:caching
```

Reviewed By: aorenste

 D83714689

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164549
Approved by: https://github.com/aorenste
2025-10-07 06:02:10 +00:00
4bcc05777e [torchfuzz] synthesize inputs for data dependent ops (#164716)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164716
Approved by: https://github.com/pianpwk
ghstack dependencies: #164432, #164434, #164514, #164646, #164647, #164649, #164687, #164688, #164693, #164694, #164715
2025-10-07 05:40:32 +00:00
2a6cdba6e5 [torchfuzz] various edge case fixes (#164715)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164715
Approved by: https://github.com/pianpwk
ghstack dependencies: #164432, #164434, #164514, #164646, #164647, #164649, #164687, #164688, #164693, #164694
2025-10-07 05:30:46 +00:00
53f6cc7529 [torchfuzz] make ops_fuzzer deterministic (#164694)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164694
Approved by: https://github.com/pianpwk
ghstack dependencies: #164432, #164434, #164514, #164646, #164647, #164649, #164687, #164688, #164693
2025-10-07 05:30:46 +00:00
ac901bf79a [torchfuzz] consolidate on a base implementation of args_codegen (#164693)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164693
Approved by: https://github.com/pianpwk
ghstack dependencies: #164432, #164434, #164514, #164646, #164647, #164649, #164687, #164688
2025-10-07 05:20:28 +00:00
c965d6dbb2 [torchfuzz] move into experimental dir (#164688)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164688
Approved by: https://github.com/pianpwk
ghstack dependencies: #164432, #164434, #164514, #164646, #164647, #164649, #164687
2025-10-07 05:09:08 +00:00
ac08556f67 [torchfuzz] support more unbacked functions (#164687)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164687
Approved by: https://github.com/pianpwk
ghstack dependencies: #164432, #164434, #164514, #164646, #164647, #164649
2025-10-07 05:00:03 +00:00
5fe7f29b9e [torchfuzz] add support for operator weights (#164649)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164649
Approved by: https://github.com/pianpwk
ghstack dependencies: #164432, #164434, #164514, #164646, #164647
2025-10-07 05:00:03 +00:00
ded099ecbf [torchfuzz] don't use the first gpu in multi process fuzzer (#164647)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164647
Approved by: https://github.com/pianpwk
ghstack dependencies: #164432, #164434, #164514, #164646
2025-10-07 04:59:56 +00:00
63fcc3e6c4 [torchfuzz] update README.md (#164646)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164646
Approved by: https://github.com/pianpwk
ghstack dependencies: #164432, #164434, #164514
2025-10-07 04:59:50 +00:00
fd3e15c14f Fix typo in class definition of bytecodedispatchtable (#164762)
ghstack-source-id: 84f0d7bb7e3780ca75473782abfae530010be56e
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164761

Fixes the type in naming of bytecodedispatchtable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164762
Approved by: https://github.com/StrongerXi, https://github.com/williamwen42
2025-10-07 04:36:09 +00:00
ff5faa744a Remove unused THPXXX macros (#164660)
These macros are not used in OSS.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164660
Approved by: https://github.com/albanD
2025-10-07 04:04:21 +00:00
4725871a81 Return fake mode from export graph capture API (#164730)
This PR is to temporarily unblock various experiments to re-use dynamo create fake mode. Note that this is still not what we want as the end state. The end state should look sth like:
```
out = fulllgraph_capture(mod, inputs)
fake_mode = out.backend_inputs.fake_mode
gm  = out.module()
```
This doesn't work today because export requires we need to wrap the original module to setup a flat module to trace for easier handling of pytree. As a result, we would need to carry export specific flag in fullgraph_capture which seems not ideal.
Regardless, the end state is that we need to give downstream user a graph module and a fake mode in some form, so I think _dynamo_graph_capture_for_export returning a fake mode within graph module itself via gm.meta

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164730
Approved by: https://github.com/avikchaudhuri
2025-10-07 03:42:46 +00:00
bcd96cc6ff [annotate] Copy fwd to bwd metadata for subgraphs as well (#164795)
The test is in the next PR. My older PR on dynamo annotate - https://github.com/pytorch/pytorch/pull/164678 is getting reverted due to unknown reasons, so difficult to add a test right now in this PR.  When I reland, I can add a test for this as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164795
Approved by: https://github.com/yushangdi
ghstack dependencies: #164772
2025-10-07 02:42:47 +00:00
50e077beaa Fix outdated info in requirements-ci.txt (#164441)
Fixes installation instructions and descriptions for `numba` and `scikit-image`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164441
Approved by: https://github.com/albanD
2025-10-07 02:10:41 +00:00
56d66ac0d7 Make custom op alias check consistent (#164576)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164576
Approved by: https://github.com/soulitzer
2025-10-07 02:05:09 +00:00
49f7d8d19d [ROCm] Fix test_cuda_synchronize failure on ROCm (#164735)
This PR skips the hipify step of torch/csrc/jit/ir/ir.h to avoid a build-time error for the JIT cuda namespace.  This fixes two skipped tests in test/jit/test_cuda.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164735
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-07 01:14:24 +00:00
afee8062d5 Revert "Fix mesh.get_local_rank when it is > 1d (#164473)"
This reverts commit 83d71dfb2fd993a6242372b8123549acaa85ffdb.

Reverted https://github.com/pytorch/pytorch/pull/164473 on behalf of https://github.com/izaitsevfb due to appears to be causing vision_maskrcnn regression ([comment](https://github.com/pytorch/pytorch/pull/164473#issuecomment-3374738997))
2025-10-07 00:37:41 +00:00
e89d12bf5d Numpy zerotensor handling (#164487)
Fixes #89034

Updated tensor_to_numpy() function in tensor_numpy.cpp to handle ZeroTensors by throwing an error if force=False and returning an array full of zeros if force=True.

@ngimel, I just saw that you mentioned PyTorch is not too concerned with this issue but I had already worked on it so I figured I would push it anyways and see what you thought. Feel free to close the PR if you think it is not worth merging.

@albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164487
Approved by: https://github.com/izaitsevfb
2025-10-07 00:34:14 +00:00
d4752bc7f6 [caffe2] tweak Unpickler::readInstruction handling TUPLE (#164764)
Summary: Creating the vector was a bit awkward. Use the natural iterator-pair constructor with move-iterators.

Test Plan: CI.

Reviewed By: dolpm

Differential Revision: D83995108

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164764
Approved by: https://github.com/drisspg
2025-10-07 00:18:10 +00:00
44a5d41993 [ROCm] add gfx1150 gfx1151 to supported gemm lists (#164744)
This is one of a few PRs needed to address https://github.com/pytorch/pytorch/pull/164744 fully.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164744
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-07 00:02:23 +00:00
361c5d362c [fx][traceback] Actually disable preservation of node metadata when enable=False (#164772)
This will come in handy when we run graph passes that add new nodes, and
create_proxy can add seq_nr meta.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164772
Approved by: https://github.com/SherlockNoMad
2025-10-06 23:39:12 +00:00
1fc71d1b57 Revert "Numpy zerotensor handling (#164487)"
This reverts commit f7ad6dbad67161333a1473d1e0b478b7475a0ec1.

Reverted https://github.com/pytorch/pytorch/pull/164487 on behalf of https://github.com/malfet due to Did it break torchbench?, see 8c728e129d/1 ([comment](https://github.com/pytorch/pytorch/pull/164487#issuecomment-3374635051))
2025-10-06 23:32:12 +00:00
8f54e27e5d [ROCm][CI] rebuild magma binary for gfx1150 gfx1151 (#164782)
After #164763 added gfx1150 gfx1151 to list of targets, this PR will trigger rebuild of magma binary for ROCm 7 with the new targets.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164782
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-06 23:29:21 +00:00
8c0bc879b9 Reapply "C++-accessible Placements via pybind11 (#163030)" (#164519)
This makes Placement data representation available in C++ via pybind11. Reapply with fix for internal errors.

D83788896

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164519
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2025-10-06 23:19:14 +00:00
746fe78ecd [CUDA] Add experimental green context support for SM carveout (#159104)
Low-level PyTorch APIs should be usable/stable enough at this point but we might move the underlying driver API usage a bit from here...

Built on top of @drisspg 's branch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159104
Approved by: https://github.com/ngimel

Co-authored-by: drisspg <drisspguessous@gmail.com>
2025-10-06 23:11:23 +00:00
b63bbe1661 Remove old ROCm version check in tests (#164245)
This PR removes ROCm<6 version checks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164245
Approved by: https://github.com/jeffdaily
2025-10-06 22:42:01 +00:00
3912ba3e94 Revert "Fix refine_ranges corner case (#164075)"
This reverts commit 27234792add2ee9bedd84ca02dbf34f8f244bc5c.

Reverted https://github.com/pytorch/pytorch/pull/164075 on behalf of https://github.com/izaitsevfb due to fails executorch builds, see [D83938444](https://www.internalfb.com/diff/D83938444) ([comment](https://github.com/pytorch/pytorch/pull/164075#issuecomment-3374430964))
2025-10-06 22:09:39 +00:00
cfc5cc17dc Revert "[dynamo] Support torch.fx.traceback.annotate (#164678)"
This reverts commit 2883b5ab773daf5861d43ff0b65be49a441ab3f9.

Reverted https://github.com/pytorch/pytorch/pull/164678 on behalf of https://github.com/izaitsevfb due to fails inductor:max_autotune tests internally, see D83948169 ([comment](https://github.com/pytorch/pytorch/pull/164678#issuecomment-3374407009))
2025-10-06 22:03:42 +00:00
fdc8ccc5bc Make Adam, AdamW work with nonzero-dim Tensor betas (#149939)
Fixes #147921

## Changes

- Convert tensor `betas` using `_to_scalar`
- Change annotation of `betas` param
- Change param type in docs

## Test Result

```bash
pytest -s test/test_optim.py -k test_tensor_lr -vv
```

![image](https://github.com/user-attachments/assets/312ee045-1e8b-4789-aa6e-ba63e6df7e81)

![image](https://github.com/user-attachments/assets/7e6ec274-645b-46b9-b1a6-2b340a685203)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149939
Approved by: https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-10-06 22:03:25 +00:00
48b54b45d6 Replace pynvml with nvidia-ml-py in win-test.sh (#164681)
pynvml was deprecated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164681
Approved by: https://github.com/Aidyn-A, https://github.com/eqy
2025-10-06 21:57:26 +00:00
1185 changed files with 19318 additions and 15883 deletions

View File

@ -181,7 +181,7 @@ case "$tag" in
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
PYTORCH_ROCM_ARCH="gfx90a;gfx942;gfx950"
PYTORCH_ROCM_ARCH="gfx90a;gfx942;gfx950;gfx1100"
if [[ $tag =~ "benchmarks" ]]; then
INDUCTOR_BENCHMARKS=yes
fi
@ -344,7 +344,7 @@ docker build \
--build-arg "NINJA_VERSION=${NINJA_VERSION:-}" \
--build-arg "KATEX=${KATEX:-}" \
--build-arg "ROCM_VERSION=${ROCM_VERSION:-}" \
--build-arg "PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH:-gfx90a;gfx942}" \
--build-arg "PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH}" \
--build-arg "IMAGE_NAME=${IMAGE_NAME}" \
--build-arg "UCX_COMMIT=${UCX_COMMIT}" \
--build-arg "UCC_COMMIT=${UCC_COMMIT}" \

View File

@ -1 +1 @@
27664085f804afc83df26f740bb46c365854f2c4
7416ffcb92cdbe98d9f97e4e6f95247e46dfc9fd

View File

@ -46,9 +46,9 @@ case ${DOCKER_TAG_PREFIX} in
BASE_TARGET=rocm
GPU_IMAGE=rocm/dev-ubuntu-22.04:${GPU_ARCH_VERSION}-complete
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"
# add gfx950 conditionally starting in ROCm 7.0
# add gfx950, gfx115x conditionally starting in ROCm 7.0
if [[ "$GPU_ARCH_VERSION" == *"7.0"* ]]; then
PYTORCH_ROCM_ARCH="${PYTORCH_ROCM_ARCH};gfx950"
PYTORCH_ROCM_ARCH="${PYTORCH_ROCM_ARCH};gfx950;gfx1150;gfx1151"
fi
DOCKER_GPU_BUILD_ARG="--build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} --build-arg ROCM_VERSION=${GPU_ARCH_VERSION}"
;;

View File

@ -115,6 +115,9 @@ RUN env GRPC_PYTHON_BUILD_SYSTEM_OPENSSL=True pip3 install grpcio
# cmake-3.28.0 from pip for onnxruntime
RUN python3 -mpip install cmake==3.28.0
ADD ./common/patch_libstdc.sh patch_libstdc.sh
RUN bash ./patch_libstdc.sh && rm patch_libstdc.sh
# build onnxruntime 1.21.0 from sources.
# it is not possible to build it from sources using pip,
# so just build it from upstream repository.

View File

@ -84,9 +84,9 @@ case ${image} in
DEVTOOLSET_VERSION="11"
GPU_IMAGE=rocm/dev-almalinux-8:${GPU_ARCH_VERSION}-complete
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"
# add gfx950 conditionally starting in ROCm 7.0
# add gfx950, gfx115x conditionally starting in ROCm 7.0
if [[ "$GPU_ARCH_VERSION" == *"7.0"* ]]; then
PYTORCH_ROCM_ARCH="${PYTORCH_ROCM_ARCH};gfx950"
PYTORCH_ROCM_ARCH="${PYTORCH_ROCM_ARCH};gfx950;gfx1150;gfx1151"
fi
DOCKER_GPU_BUILD_ARG="--build-arg ROCM_VERSION=${GPU_ARCH_VERSION} --build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} --build-arg DEVTOOLSET_VERSION=${DEVTOOLSET_VERSION}"
;;

View File

@ -10,11 +10,6 @@ BAD_SSL = "https://self-signed.badssl.com"
print("Testing SSL certificate checking for Python:", sys.version)
if sys.version_info[:2] < (2, 7) or sys.version_info[:2] < (3, 4):
print("This version never checks SSL certs; skipping tests")
sys.exit(0)
EXC = OSError
print(f"Connecting to {GOOD_SSL} should work")

View File

@ -120,9 +120,8 @@ ninja==1.11.1.4
numba==0.55.2 ; python_version == "3.10" and platform_machine != "s390x"
numba==0.60.0 ; python_version == "3.12" and platform_machine != "s390x"
#Description: Just-In-Time Compiler for Numerical Functions
#Pinned versions: 0.54.1, 0.49.0, <=0.49.1
#Pinned versions: 0.55.2, 0.60.0
#test that import: test_numba_integration.py
#For numba issue see https://github.com/pytorch/pytorch/issues/51511
#Need release > 0.61.2 for s390x due to https://github.com/numba/numba/pull/10073
#numpy
@ -242,10 +241,9 @@ pygments==2.15.0
#Pinned versions: 14.1.0
#test that import:
scikit-image==0.19.3 ; python_version < "3.10"
scikit-image==0.22.0 ; python_version >= "3.10"
scikit-image==0.22.0
#Description: image processing routines
#Pinned versions:
#Pinned versions: 0.22.0
#test that import: test_nn.py
#scikit-learn

View File

@ -5,7 +5,7 @@ DESIRED_ROCM ?= 7.0
DESIRED_ROCM_SHORT = $(subst .,,$(DESIRED_ROCM))
PACKAGE_NAME = magma-rocm
# inherit this from underlying docker image, do not pass this env var to docker
#PYTORCH_ROCM_ARCH ?= gfx900;gfx906;gfx908;gfx90a;gfx942;gfx950;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201
#PYTORCH_ROCM_ARCH ?= gfx900;gfx906;gfx908;gfx90a;gfx942;gfx950;gfx1030;gfx1100;gfx1101;gfx1102;gfx1150;gfx1151;gfx1200;gfx1201
DOCKER_RUN = set -eou pipefail; ${DOCKER_CMD} run --rm -i \
-v $(shell git rev-parse --show-toplevel)/.ci:/builder \
@ -18,7 +18,6 @@ DOCKER_RUN = set -eou pipefail; ${DOCKER_CMD} run --rm -i \
.PHONY: all
all: magma-rocm70
all: magma-rocm64
all: magma-rocm63
.PHONY:
clean:
@ -34,8 +33,3 @@ magma-rocm70:
magma-rocm64: DESIRED_ROCM := 6.4
magma-rocm64:
$(DOCKER_RUN)
.PHONY: magma-rocm63
magma-rocm63: DESIRED_ROCM := 6.3
magma-rocm63:
$(DOCKER_RUN)

View File

@ -233,7 +233,9 @@ if [[ "${BUILD_ENVIRONMENT}" != *cuda* ]]; then
export BUILD_STATIC_RUNTIME_BENCHMARK=ON
fi
if [[ "$BUILD_ENVIRONMENT" == *-debug* ]]; then
if [[ "$BUILD_ENVIRONMENT" == *-full-debug* ]]; then
export CMAKE_BUILD_TYPE=Debug
elif [[ "$BUILD_ENVIRONMENT" == *-debug* ]]; then
export CMAKE_BUILD_TYPE=RelWithAssert
fi
@ -299,6 +301,11 @@ else
python -m build --wheel --no-isolation
fi
pip_install_whl "$(echo dist/*.whl)"
if [[ "$BUILD_ENVIRONMENT" == *full-debug* ]]; then
# Regression test for https://github.com/pytorch/pytorch/issues/164297
# Torch should be importable and that's about it
pushd /; python -c "import torch;print(torch.__config__.show(), torch.randn(5) + 1.7)"; popd
fi
if [[ "${BUILD_ADDITIONAL_PACKAGES:-}" == *vision* ]]; then
install_torchvision

View File

@ -67,7 +67,7 @@ fi
# wheels with cxx11-abi
echo "Checking that the gcc ABI is what we expect"
if [[ "$(uname)" != 'Darwin' && "$(uname -m)" != "s390x" ]]; then
if [[ "$(uname)" != 'Darwin' ]]; then
# We also check that there are cxx11 symbols in libtorch
#
echo "Checking that symbols in libtorch.so have the right gcc abi"

View File

@ -256,7 +256,7 @@ test_torchbench_smoketest() {
local device=mps
local dtypes=(undefined float16 bfloat16 notset)
local dtype=${dtypes[$1]}
local models=(hf_T5 llama BERT_pytorch dcgan hf_GPT2 yolov3 resnet152 sam sam_fast pytorch_unet stable_diffusion_text_encoder speech_transformer Super_SloMo doctr_det_predictor doctr_reco_predictor timm_resnet timm_vovnet vgg16)
local models=(llama BERT_pytorch dcgan yolov3 resnet152 sam sam_fast pytorch_unet stable_diffusion_text_encoder speech_transformer Super_SloMo doctr_det_predictor doctr_reco_predictor vgg16)
for backend in eager inductor; do
@ -319,7 +319,7 @@ test_aoti_torchbench_smoketest() {
local device=mps
local dtypes=(undefined float16 bfloat16 notset)
local dtype=${dtypes[$1]}
local models=(hf_T5 llama BERT_pytorch dcgan hf_GPT2 yolov3 resnet152 sam sam_fast pytorch_unet stable_diffusion_text_encoder speech_transformer Super_SloMo doctr_det_predictor doctr_reco_predictor timm_resnet timm_vovnet vgg16)
local models=(llama BERT_pytorch dcgan yolov3 resnet152 sam sam_fast pytorch_unet stable_diffusion_text_encoder speech_transformer Super_SloMo doctr_det_predictor doctr_reco_predictor vgg16)
echo "Launching torchbench inference performance run for AOT Inductor and dtype ${dtype}"
local dtype_arg="--${dtype}"

View File

@ -838,7 +838,7 @@ test_dynamo_benchmark() {
elif [[ "${suite}" == "timm_models" ]]; then
export TORCHBENCH_ONLY_MODELS="inception_v3"
elif [[ "${suite}" == "torchbench" ]]; then
export TORCHBENCH_ONLY_MODELS="hf_Bert"
export TORCHBENCH_ONLY_MODELS="BERT_pytorch"
fi
fi
test_single_dynamo_benchmark "dashboard" "$suite" "$shard_id" "$@"
@ -869,13 +869,13 @@ test_inductor_torchbench_smoketest_perf() {
mkdir -p "$TEST_REPORTS_DIR"
python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --float16 --training \
--batch-size-file "$(realpath benchmarks/dynamo/torchbench_models_list.txt)" --only hf_Bert \
--batch-size-file "$(realpath benchmarks/dynamo/torchbench_models_list.txt)" --only BERT_pytorch \
--output "$TEST_REPORTS_DIR/inductor_training_smoketest.csv"
# The threshold value needs to be actively maintained to make this check useful
python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_training_smoketest.csv" -t 1.4
# Check memory compression ratio for a few models
for test in hf_Albert timm_vision_transformer; do
for test in BERT_pytorch yolov3; do
python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --amp --training \
--disable-cudagraphs --batch-size-file "$(realpath benchmarks/dynamo/torchbench_models_list.txt)" \
--only $test --output "$TEST_REPORTS_DIR/inductor_training_smoketest_$test.csv"
@ -886,7 +886,7 @@ test_inductor_torchbench_smoketest_perf() {
done
# Perform some "warm-start" runs for a few huggingface models.
for test in AlbertForQuestionAnswering AllenaiLongformerBase DistilBertForMaskedLM DistillGPT2 GoogleFnet YituTechConvBert; do
for test in AllenaiLongformerBase DistilBertForMaskedLM DistillGPT2 GoogleFnet YituTechConvBert; do
python benchmarks/dynamo/huggingface.py --accuracy --training --amp --inductor --device cuda --warm-start-latency \
--only $test --output "$TEST_REPORTS_DIR/inductor_warm_start_smoketest_$test.csv"
python benchmarks/dynamo/check_accuracy.py \

View File

@ -15,37 +15,35 @@ if errorlevel 1 exit /b 1
if not errorlevel 0 exit /b 1
cd %TMP_DIR_WIN%\build\torch\test
:: Enable delayed variable expansion to make the list
setlocal enabledelayedexpansion
set EXE_LIST=
for /r "." %%a in (*.exe) do (
call :libtorch_check "%%~na" "%%~fa"
if "%%~na" == "c10_intrusive_ptr_benchmark" (
@REM NB: This is not a gtest executable file, thus couldn't be handled by
@REM pytest-cpp and is excluded from test discovery by run_test
call "%%~fa"
if errorlevel 1 goto fail
if not errorlevel 0 goto fail
) else (
if "%%~na" == "verify_api_visibility" (
@REM Skip verify_api_visibility as it is a compile-level test
) else (
set EXE_LIST=!EXE_LIST! cpp/%%~na
)
)
)
goto :eof
:libtorch_check
cd %CWD%
set CPP_TESTS_DIR=%TMP_DIR_WIN%\build\torch\test
:: Skip verify_api_visibility as it a compile level test
if "%~1" == "verify_api_visibility" goto :eof
:: Run python test\run_test.py on the list
set NO_TD=True && python test\run_test.py --cpp --verbose -i !EXE_LIST!
if errorlevel 1 goto fail
if not errorlevel 0 goto fail
echo Running "%~2"
if "%~1" == "c10_intrusive_ptr_benchmark" (
:: NB: This is not a gtest executable file, thus couldn't be handled by pytest-cpp
call "%~2"
goto :eof
)
python test\run_test.py --cpp --verbose -i "cpp/%~1"
if errorlevel 1 (
echo %1 failed with exit code %errorlevel%
goto fail
)
if not errorlevel 0 (
echo %1 failed with exit code %errorlevel%
goto fail
)
goto :eof
:eof
exit /b 0

View File

@ -38,7 +38,7 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then
fi
# TODO: Move this to .ci/docker/requirements-ci.txt
python -m pip install "psutil==5.9.1" "pynvml==11.4.1" "pytest-shard==0.1.2"
python -m pip install "psutil==5.9.1" nvidia-ml-py "pytest-shard==0.1.2"
run_tests() {
# Run nvidia-smi if available

View File

@ -71,14 +71,7 @@ export PYTORCH_BUILD_NUMBER=1
# Set triton version as part of PYTORCH_EXTRA_INSTALL_REQUIREMENTS
TRITON_VERSION=$(cat $PYTORCH_ROOT/.ci/docker/triton_version.txt)
# Here PYTORCH_EXTRA_INSTALL_REQUIREMENTS is already set for the all the wheel builds hence append TRITON_CONSTRAINT
TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64'"
# CUDA 12.9/13.0 builds have triton for Linux and Linux aarch64 binaries.
if [[ "$DESIRED_CUDA" == "cu129" ]] || [[ "$DESIRED_CUDA" == "cu130" ]]; then
TRITON_CONSTRAINT="platform_system == 'Linux'"
fi
TRITON_CONSTRAINT="platform_system == 'Linux'"
if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" && ! "$PYTORCH_BUILD_VERSION" =~ .*xpu.* ]]; then
TRITON_REQUIREMENT="triton==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"

View File

@ -12,7 +12,7 @@ ignore =
# to line this up with executable bit
EXE001,
# these ignores are from flake8-bugbear; please fix!
B007,B008,B017,B019,B023,B028,B903,B904,B905,B906,B907,B908,B910
B007,B008,B017,B019,B023,B028,B903,B905,B906,B907,B908,B910
# these ignores are from flake8-comprehensions; please fix!
C407,
# these ignores are from flake8-logging-format; please fix!

View File

@ -274,8 +274,6 @@ runs:
-w /var/lib/jenkins/workspace \
"${DOCKER_IMAGE}"
)
# Propagate download.pytorch.org IP to container
grep download.pytorch.org /etc/hosts | docker exec -i "${container_name}" sudo bash -c "/bin/cat >> /etc/hosts"
echo "DOCKER_CONTAINER_ID=${container_name}" >> "${GITHUB_ENV}"
docker exec -t "${container_name}" sh -c "pip install $(echo dist/*.whl)[opt-einsum] && ${TEST_COMMAND}"

View File

@ -28,6 +28,10 @@ runs:
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: Print GPU info (if present)
shell: bash
run: if [ -f /usr/bin/nvidia-smi ]; then nvidia-smi; fi
- name: Check if in a container runner
shell: bash
id: check_container_runner
@ -82,37 +86,6 @@ runs:
# Prune all of the docker images
docker system prune -af
- name: Manually resolve download.pytorch.org
shell: bash
continue-on-error: true
run: |
set +e
set -x
PT_DOMAIN=download.pytorch.org
# TODO: Flaky access to download.pytorch.org https://github.com/pytorch/pytorch/issues/100400,
# cleaning this up once the issue is fixed. There are more than one resolved IP here, the last
# one is returned at random
RESOLVED_IP=$(dig -4 +short "${PT_DOMAIN}" | tail -n1)
if [ -z "${RESOLVED_IP}" ]; then
echo "Couldn't resolve ${PT_DOMAIN}, retrying with Google DNS..."
RESOLVED_IP=$(dig -4 +short "${PT_DOMAIN}" @8.8.8.8 | tail -n1)
if [ -z "${RESOLVED_IP}" ]; then
echo "Couldn't resolve ${PT_DOMAIN}, exiting..."
exit 1
fi
fi
if grep -r "${PT_DOMAIN}" /etc/hosts; then
# Clean up any old records first
sudo sed -i "/${PT_DOMAIN}/d" /etc/hosts
fi
echo "${RESOLVED_IP} ${PT_DOMAIN}" | sudo tee -a /etc/hosts
cat /etc/hosts
- name: Check that the docker daemon is running
shell: bash
continue-on-error: true

View File

@ -33,10 +33,6 @@ runs:
)
echo "CONTAINER_NAME=${container_name}" >> "$GITHUB_ENV"
if [[ "${GPU_ARCH_TYPE}" != "rocm" && "${BUILD_ENVIRONMENT}" != "linux-aarch64-binary-manywheel" && "${BUILD_ENVIRONMENT}" != "linux-s390x-binary-manywheel" && "${GPU_ARCH_TYPE}" != "xpu" ]]; then
# Propagate download.pytorch.org IP to container. This is only needed on Linux non aarch64 runner
grep download.pytorch.org /etc/hosts | docker exec -i "${container_name}" bash -c "/bin/cat >> /etc/hosts"
fi
docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
# Generate test script

View File

@ -30,6 +30,7 @@ ciflow_push_tags:
- ciflow/riscv64
- ciflow/rocm
- ciflow/rocm-mi300
- ciflow/rocm-mi355
- ciflow/s390
- ciflow/slow
- ciflow/torchbench

Binary file not shown.

View File

@ -18,6 +18,7 @@ class GitHubComment:
body_text: str
created_at: str
author_login: str
author_url: Optional[str]
author_association: str
editor_login: Optional[str]
database_id: int

Binary file not shown.

View File

@ -38,6 +38,7 @@ def mock_get_comments() -> list[GitHubComment]:
body_text="mock_body_text",
created_at="",
author_login="",
author_url=None,
author_association="",
editor_login=None,
database_id=1,
@ -48,6 +49,7 @@ def mock_get_comments() -> list[GitHubComment]:
body_text=" #" + LABEL_ERR_MSG_TITLE.replace("`", ""),
created_at="",
author_login=BOT_AUTHORS[1],
author_url=None,
author_association="",
editor_login=None,
database_id=2,

View File

@ -32,6 +32,7 @@ from trymerge import (
main as trymerge_main,
MandatoryChecksMissingError,
MergeRule,
PostCommentError,
RE_GHSTACK_DESC,
read_merge_rules,
remove_job_name_suffix,
@ -588,6 +589,23 @@ class TestTryMerge(TestCase):
self.assertEqual(mock_merge_base, pr.get_merge_base())
mocked_gh_fetch_merge_base.assert_called_once()
def test_app_can_revert(self, *args: Any) -> None:
pr = GitHubPR("pytorch", "pytorch", 164660)
repo = DummyGitRepo()
app_comment_id, impostor_comment_id = 3375785595, 3377647892
# Check that app can revert
self.assertIsNotNone(validate_revert(repo, pr, comment_id=app_comment_id))
# But impostor can not
self.assertRaises(
PostCommentError,
lambda: validate_revert(repo, pr, comment_id=impostor_comment_id),
)
# Despite it's name being the name of the bot
self.assertEqual(
pr.get_comment_by_id(impostor_comment_id).author_login,
"pytorch-auto-revert",
)
@mock.patch("trymerge.gh_graphql", side_effect=mocked_gh_graphql)
@mock.patch("trymerge.gh_fetch_merge_base", return_value="")

View File

@ -234,6 +234,7 @@ query ($owner: String!, $name: String!, $number: Int!) {
createdAt
author {
login
url
}
authorAssociation
editor {
@ -1093,6 +1094,7 @@ class GitHubPR:
body_text=node["bodyText"],
created_at=node["createdAt"] if "createdAt" in node else "",
author_login=node["author"]["login"],
author_url=node["author"].get("url", None),
author_association=node["authorAssociation"],
editor_login=editor["login"] if editor else None,
database_id=node["databaseId"],
@ -2029,6 +2031,11 @@ def validate_revert(
# For some reason, one can not be a member of private repo, only CONTRIBUTOR
if pr.is_base_repo_private():
allowed_reverters.append("CONTRIBUTOR")
# Special case the pytorch-auto-revert app, whose does not have association
# But should be able to issue revert command
if comment.author_url == "https://github.com/apps/pytorch-auto-revert":
allowed_reverters.append("NONE")
if author_association not in allowed_reverters:
raise PostCommentError(
f"Will not revert as @{author_login} is not one of "

View File

@ -389,8 +389,6 @@ jobs:
"${DOCKER_IMAGE}" \
${DOCKER_SHELL_CMD}
)
# Propagate download.pytorch.org IP to container
grep download.pytorch.org /etc/hosts | docker exec -i "${container_name}" sudo bash -c "/bin/cat >> /etc/hosts"
echo "DOCKER_CONTAINER_ID=${container_name}" >> "${GITHUB_ENV}"
if [[ ${BUILD_ENVIRONMENT} == *"s390x"* ]]; then

View File

@ -37,7 +37,7 @@ jobs:
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runner: "linux.12xlarge"
runner: "linux.c7i.12xlarge"
build-environment: linux-jammy-cuda12.8-py3.10-gcc11-sm90-dist
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11
cuda-arch-list: '9.0'

View File

@ -2,7 +2,7 @@ name: inductor-perf-nightly-h100
on:
schedule:
- cron: 15 0,12 * * 1-6
- cron: 15 0 * * 1-6
- cron: 0 7 * * 0
# NB: GitHub has an upper limit of 10 inputs here, so before we can sort it
# out, let try to run torchao cudagraphs_low_precision as part of cudagraphs

View File

@ -63,6 +63,7 @@ jobs:
# Same as the build job
python-version: 3.12.7
test-matrix: ${{ needs.macos-perf-py3-arm64-build.outputs.test-matrix }}
timeout-minutes: 300
disable-monitor: false
monitor-log-interval: 15
monitor-data-collect-interval: 4

View File

@ -1,6 +1,9 @@
name: rocm-mi355
on:
push:
tags:
- ciflow/rocm-mi355/*
workflow_dispatch:
schedule:
- cron: 30 11,1 * * * # about 4:30am PDT and 6:30pm PDT
@ -64,5 +67,7 @@ jobs:
build-environment: linux-noble-rocm-py3.12-mi355
docker-image: ${{ needs.linux-noble-rocm-py3_12-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-noble-rocm-py3_12-build.outputs.test-matrix }}
tests-to-include: "test_nn test_torch test_cuda test_ops test_unary_ufuncs test_binary_ufuncs test_autograd inductor/test_torchinductor"
tests-to-include: >-
${{ github.event_name == 'schedule' && 'test_nn test_torch test_cuda test_ops test_unary_ufuncs test_binary_ufuncs test_autograd inductor/test_torchinductor test_matmul_cuda test_scaled_matmul_cuda'
|| '' }}
secrets: inherit

View File

@ -59,3 +59,29 @@ jobs:
docker-image: ${{ needs.linux-jammy-rocm-py3_10-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-rocm-py3_10-build.outputs.test-matrix }}
secrets: inherit
linux-jammy-rocm-py3_10-gfx1100-test:
if: ${{ github.event_name == 'push' && github.ref == 'refs/heads/main' }}
permissions:
id-token: write
contents: read
name: linux-jammy-rocm-py3_10-gfx1100
uses: ./.github/workflows/_rocm-test.yml
needs:
- linux-jammy-rocm-py3_10-build
- target-determination
with:
build-environment: linux-jammy-rocm-py3.10
docker-image: ${{ needs.linux-jammy-rocm-py3_10-build.outputs.docker-image }}
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.gfx1100" },
{ config: "default", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.gfx1100" },
]}
tests-to-include: >
test_nn test_torch test_cuda test_ops test_unary_ufuncs test_binary_ufuncs
test_autograd inductor/test_torchinductor inductor/test_kernel_benchmark
inductor/test_pad_mm inductor/test_benchmark_fusion inductor/test_aot_inductor
inductor/test_torchinductor inductor/test_decompose_mem_bound_mm
inductor/test_flex_attention inductor/test_max_autotune
secrets: inherit

View File

@ -56,7 +56,7 @@ jobs:
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11
build-generates-artifacts: false
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runner: "linux.4xlarge"
runner: "linux.c7i.4xlarge"
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 1 },
@ -249,3 +249,14 @@ jobs:
docker-image: ${{ needs.linux-jammy-py3-clang12-executorch-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-py3-clang12-executorch-build.outputs.test-matrix }}
secrets: inherit
linux-jammy-py3_10-gcc11-full-debug-build-only:
name: linux-jammy-py3.10-gcc11-full-debug-build-only
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runner: linux.2xlarge.memory
build-environment: linux-jammy-py3.10-gcc11-full-debug-build-only
docker-image-name: ci-image:pytorch-linux-jammy-py3.10-gcc11
secrets: inherit

View File

@ -35,7 +35,7 @@ jobs:
runner_prefix: ${{ needs.get-label-type.outputs.label-type }}
build-environment: linux-jammy-xpu-n-1-py3.10
docker-image-name: ci-image:pytorch-linux-jammy-xpu-n-1-py3
runner: linux.12xlarge
runner: linux.c7i.12xlarge
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 6, runner: "linux.idc.xpu" },
@ -56,7 +56,7 @@ jobs:
runner_prefix: ${{ needs.get-label-type.outputs.label-type }}
build-environment: linux-jammy-xpu-n-py3.10
docker-image-name: ci-image:pytorch-linux-jammy-xpu-n-py3
runner: linux.12xlarge
runner: linux.c7i.12xlarge
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 8, runner: "linux.idc.xpu" },

2
.gitignore vendored
View File

@ -88,7 +88,7 @@ torch_compile_debug/
# Listed manually because some files in this directory are not generated
torch/testing/_internal/generated/annotated_fn_args.py
torch/testing/_internal/data/*.pt
torch/csrc/api/include/torch/version.h
torch/headeronly/version.h
torch/csrc/cudnn/cuDNN.cpp
torch/csrc/generated
torch/csrc/generic/TensorMethods.cpp

View File

@ -28,7 +28,7 @@ exclude_patterns = [
'torch/lib/**',
'venv/**',
'**/*.pyi',
"tools/experimental/dynamic_shapes/torchfuzz/**",
"tools/experimental/torchfuzz/**",
'tools/test/test_selective_build.py',
]
command = [
@ -198,7 +198,7 @@ exclude_patterns = [
'tools/test/gen_operators_yaml_test.py',
'tools/test/gen_oplist_test.py',
'tools/test/test_selective_build.py',
'tools/experimental/dynamic_shapes/torchfuzz/**',
'tools/experimental/torchfuzz/**',
]
command = [
'python3',

View File

@ -13,6 +13,9 @@ load(":build_variables.bzl", "jit_core_sources", "lazy_tensor_ts_sources", "libt
load(":ufunc_defs.bzl", "aten_ufunc_generated_cpu_kernel_sources", "aten_ufunc_generated_cpu_sources", "aten_ufunc_generated_cuda_sources")
load("//:tools/bazel.bzl", "rules")
# Export files for use by torch/headeronly (where version.h generation now lives)
exports_files(["version.txt"])
define_targets(rules = rules)
COMMON_COPTS = [
@ -690,7 +693,9 @@ cc_library(
"torch/csrc/*/generated/*.h",
"torch/csrc/jit/serialization/mobile_bytecode_generated.h",
] + torch_cuda_headers,
) + GENERATED_AUTOGRAD_CPP + [":version_h"],
) + GENERATED_AUTOGRAD_CPP + [
"//torch/headeronly:version_h",
],
includes = [
"third_party/kineto/libkineto/include",
"torch/csrc",

View File

@ -388,9 +388,9 @@ cmake_dependent_option(USE_PRIORITIZED_TEXT_FOR_LD "Use prioritized text linker
option(USE_MIMALLOC "Use mimalloc" OFF)
# Enable third party mimalloc library to improve memory allocation performance
# on Windows.
# on Windows and AArch64.
option(USE_MIMALLOC_ON_MKL "Use mimalloc on MKL" OFF)
if(WIN32)
if(WIN32 OR (CPU_AARCH64 AND NOT APPLE))
set(USE_MIMALLOC ON)
# Not enable USE_MIMALLOC_ON_MKL due to it caused issue:

View File

@ -28,4 +28,19 @@ inline std::ostream& operator<<(std::ostream& stream, at::BlasBackend backend) {
return stream << BlasBackendToString(backend);
}
namespace blas {
enum class ScalingType : std::uint8_t {
TensorWise, // fp32 scales
RowWise, // fp32 scales
BlockWise1x16, // fp8_e4m3fn scales
BlockWise1x32, // fp8_e8m0fnu scales
BlockWise1x128, // fp32 scales
BlockWise128x128, // fp32 scales
};
enum class SwizzleType : std::uint8_t { NO_SWIZZLE = 0, SWIZZLE_32_4_4 = 1 };
} // namespace blas
} // namespace at

View File

@ -144,8 +144,7 @@ inline std::string _all_equal_numel_error(at::ArrayRef<Tensor> tensors) {
inline bool _apply_preamble(ArrayRef<Tensor> tensors) {
checkDeviceType("CPU_tensor_apply", tensors, kCPU);
checkLayout("CPU_tensor_apply", tensors, kStrided);
if (!_all_equal_numel(tensors))
TORCH_CHECK(false, _all_equal_numel_error(tensors));
TORCH_CHECK(_all_equal_numel(tensors), _all_equal_numel_error(tensors));
// An empty tensor has no elements
for (auto& t : tensors)
if (t.numel() == 0)

View File

@ -483,8 +483,8 @@ at::BlasBackend Context::blasPreferredBackend() {
#if ROCM_VERSION >= 60300
"gfx1100", "gfx1101", "gfx1200", "gfx1201", "gfx908",
#endif
#if ROCM_VERSION >= 60500
"gfx950"
#if ROCM_VERSION >= 70000
"gfx950", "gfx1150", "gfx1151"
#endif
};
for (auto index: c10::irange(detail::getCUDAHooks().deviceCount())) {
@ -587,20 +587,33 @@ void Context::setROCmFAPreferredBackend(at::ROCmFABackend b) {
rocm_fa_preferred_backend = b;
}
bool Context::allowFP16ReductionCuBLAS() const {
CuBLASReductionOption Context::allowFP16ReductionCuBLAS() const {
return allow_fp16_reduction_cublas;
}
void Context::setAllowFP16ReductionCuBLAS(bool b) {
allow_fp16_reduction_cublas = b;
CuBLASReductionOption inline get_reduction_option(bool allow_reduced_precision, bool allow_splitk) {
TORCH_CHECK(
!(allow_reduced_precision && !allow_splitk),
"allow_splitk=False is not supported when reduced precision reductions are enabled");
if (allow_reduced_precision) {
return CuBLASReductionOption::AllowReducedPrecisionWithSplitK;
} else if (allow_splitk) {
return CuBLASReductionOption::DisallowReducedPrecisionAllowSplitK;
} else {
return CuBLASReductionOption::DisallowReducedPrecisionDisallowSplitK;
}
}
bool Context::allowBF16ReductionCuBLAS() const {
void Context::setAllowFP16ReductionCuBLAS(bool allow_reduced_precision, bool allow_splitk) {
allow_fp16_reduction_cublas = get_reduction_option(allow_reduced_precision, allow_splitk);
}
CuBLASReductionOption Context::allowBF16ReductionCuBLAS() const {
return allow_bf16_reduction_cublas;
}
void Context::setAllowBF16ReductionCuBLAS(bool b) {
allow_bf16_reduction_cublas = b;
void Context::setAllowBF16ReductionCuBLAS(bool allow_reduced_precision, bool allow_splitk) {
allow_bf16_reduction_cublas = get_reduction_option(allow_reduced_precision, allow_splitk);
}
bool Context::allowFP16AccumulationCuBLAS() const {

View File

@ -38,6 +38,12 @@ namespace at {
class Tensor;
enum class TORCH_API Float32MatmulPrecision { HIGHEST, HIGH, MEDIUM };
enum class CuBLASReductionOption : uint8_t {
AllowReducedPrecisionWithSplitK = 0,
DisallowReducedPrecisionAllowSplitK = 1,
DisallowReducedPrecisionDisallowSplitK = 2,
};
enum class TORCH_API Float32Backend { GENERIC, CUDA, MKLDNN };
enum class TORCH_API Float32Op { ALL, CONV, RNN, MATMUL };
enum class TORCH_API Float32Precision { NONE, IEEE, TF32, BF16 };
@ -220,15 +226,15 @@ class TORCH_API Context {
bool userEnabledMkldnn() const;
void setUserEnabledMkldnn(bool e);
bool benchmarkCuDNN() const;
void setBenchmarkCuDNN(bool);
void setBenchmarkCuDNN(bool /*b*/);
int benchmarkLimitCuDNN() const;
void setBenchmarkLimitCuDNN(int);
void setBenchmarkLimitCuDNN(int /*b*/);
bool immediateMiopen() const;
void setImmediateMiopen(bool);
void setImmediateMiopen(bool /*b*/);
bool deterministicCuDNN() const;
void setDeterministicCuDNN(bool);
void setDeterministicCuDNN(bool /*b*/);
bool deterministicMkldnn() const;
void setDeterministicMkldnn(bool);
void setDeterministicMkldnn(bool /*b*/);
bool userEnabledNNPACK() const;
void setUserEnabledNNPACK(bool e);
@ -246,32 +252,32 @@ class TORCH_API Context {
void setSDPPriorityOrder(const std::vector<int64_t>& order);
std::array<at::SDPBackend, at::num_sdp_backends> sDPPriorityOrder();
void setSDPUseFlash(bool);
void setSDPUseFlash(bool /*e*/);
bool userEnabledFlashSDP() const;
void setSDPUseMemEfficient(bool);
void setSDPUseMemEfficient(bool /*e*/);
bool userEnabledMemEfficientSDP() const;
void setSDPUseMath(bool);
void setSDPUseMath(bool /*e*/);
bool userEnabledMathSDP() const;
void setSDPUseCuDNN(bool);
void setSDPUseCuDNN(bool /*e*/);
bool userEnabledCuDNNSDP() const;
void setAllowFP16BF16ReductionMathSDP(bool);
void setAllowFP16BF16ReductionMathSDP(bool /*e*/);
bool allowFP16BF16ReductionMathSDP() const;
void setSDPUseOverrideable(bool);
void setSDPUseOverrideable(bool /*e*/);
bool userEnabledOverrideableSDP() const;
at::LinalgBackend linalgPreferredBackend() const;
void setLinalgPreferredBackend(at::LinalgBackend);
void setLinalgPreferredBackend(at::LinalgBackend /*b*/);
at::BlasBackend blasPreferredBackend();
void setBlasPreferredBackend(at::BlasBackend);
void setBlasPreferredBackend(at::BlasBackend /*b*/);
at::ROCmFABackend getROCmFAPreferredBackend();
void setROCmFAPreferredBackend(at::ROCmFABackend);
void setROCmFAPreferredBackend(at::ROCmFABackend /*b*/);
// Note [Enabling Deterministic Operations]
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -304,9 +310,9 @@ class TORCH_API Context {
bool deterministicAlgorithms() const;
bool deterministicAlgorithmsWarnOnly() const;
void setDeterministicAlgorithms(bool, bool);
void setDeterministicAlgorithms(bool /*b*/, bool /*warn_only*/);
bool deterministicFillUninitializedMemory() const;
void setDeterministicFillUninitializedMemory(bool);
void setDeterministicFillUninitializedMemory(bool /*b*/);
// Note [Writing Nondeterministic Operations]
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -350,19 +356,23 @@ class TORCH_API Context {
Float32Op op,
Float32Precision p);
bool allowTF32CuDNN(std::optional<Float32Op> op = std::nullopt) const;
void setAllowTF32CuDNN(bool);
void setAllowTF32CuDNN(bool /*b*/);
bool allowTF32OneDNN() const;
void setAllowTF32OneDNN(bool);
void setAllowTF32OneDNN(bool /*b*/);
bool allowTF32CuBLAS() const;
void setAllowTF32CuBLAS(bool);
void setAllowTF32CuBLAS(bool /*b*/);
Float32MatmulPrecision float32MatmulPrecision() const;
Float32Precision float32Precision(Float32Backend backend, Float32Op op) const;
bool allowFP16ReductionCuBLAS() const;
void setAllowFP16ReductionCuBLAS(bool);
bool allowBF16ReductionCuBLAS() const;
void setAllowBF16ReductionCuBLAS(bool);
CuBLASReductionOption allowFP16ReductionCuBLAS() const;
void setAllowFP16ReductionCuBLAS(
bool allow_reduced_precision,
bool allow_splitk = true);
CuBLASReductionOption allowBF16ReductionCuBLAS() const;
void setAllowBF16ReductionCuBLAS(
bool allow_reduced_precision,
bool allow_splitk = true);
bool allowFP16AccumulationCuBLAS() const;
void setAllowFP16AccumulationCuBLAS(bool);
void setAllowFP16AccumulationCuBLAS(bool /*b*/);
// Matmuls can use a so-called "persistent" kernel which launches one CUDA
// block for each SM on the GPU, and each block then iterates over multiple
@ -374,7 +384,7 @@ class TORCH_API Context {
// to make matmuls target only a subset of the SMs, so they can fully schedule
// even next to a comms kernel, and only be a few percent slower.
std::optional<int32_t> _SMCarveout_EXPERIMENTAL() const;
void _setSMCarveout_EXPERIMENTAL(std::optional<int32_t>);
void _setSMCarveout_EXPERIMENTAL(std::optional<int32_t> /*c*/);
at::QEngine qEngine() const;
void setQEngine(at::QEngine e);
@ -395,7 +405,7 @@ class TORCH_API Context {
void setDefaultMobileCPUAllocator();
void unsetDefaultMobileCPUAllocator();
bool allowFP16ReductionCPU() const;
void setAllowFP16ReductionCPU(bool);
void setAllowFP16ReductionCPU(bool /*b*/);
// Preserved for BC
void lazyInitCUDA() {
@ -452,8 +462,10 @@ class TORCH_API Context {
: at::Float32MatmulPrecision::HIGHEST;
int benchmark_limit_cudnn = 10;
bool allow_tf32_cudnn = true;
bool allow_fp16_reduction_cublas = true;
bool allow_bf16_reduction_cublas = true;
CuBLASReductionOption allow_fp16_reduction_cublas =
CuBLASReductionOption::AllowReducedPrecisionWithSplitK;
CuBLASReductionOption allow_bf16_reduction_cublas =
CuBLASReductionOption::AllowReducedPrecisionWithSplitK;
bool allow_fp16_accumulation_cublas = false;
std::optional<int32_t> sm_carveout = std::nullopt;
bool enabled_mkldnn = true;

View File

@ -16,8 +16,8 @@ inline void check_size_nonnegative(ArrayRef<int64_t> size) {
inline void check_size_nonnegative(ArrayRef<c10::SymInt> size) {
for (const auto& x : size) {
TORCH_CHECK(
x.expect_size(__FILE__, __LINE__),
TORCH_SYM_CHECK(
x.sym_ge(0),
"Trying to create tensor with negative dimension ",
x,
": ",

View File

@ -62,7 +62,7 @@ constexpr const char* unknown_eventname = "eventname not specified";
#endif
} // namespace (anonymous)
MapAllocator::MapAllocator(WithFd, std::string_view filename, int fd, int flags, size_t size)
MapAllocator::MapAllocator(WithFd /*unused*/, std::string_view filename, int fd, int flags, size_t size)
: filename_(filename.empty() ? unknown_filename : filename)
, size_(0) // to be filled later
#ifdef _WIN32
@ -494,7 +494,7 @@ RefcountedMapAllocator::RefcountedMapAllocator(const char *filename, int flags,
initializeAlloc();
}
RefcountedMapAllocator::RefcountedMapAllocator(WithFd, const char *filename, int fd, int flags, size_t size)
RefcountedMapAllocator::RefcountedMapAllocator(WithFd /*unused*/, const char *filename, int fd, int flags, size_t size)
: RefcountedMapAllocatorArgCheck(flags)
, MapAllocator(WITH_FD, filename, flags, fd, size + map_alloc_alignment) {
@ -614,7 +614,7 @@ at::DataPtr MapAllocator::makeDataPtr(std::string_view filename, int flags, size
return {context->data(), context, &deleteMapAllocator, at::DeviceType::CPU};
}
at::DataPtr MapAllocator::makeDataPtr(WithFd, const char *filename, int fd, int flags, size_t size, size_t* actual_size_out) {
at::DataPtr MapAllocator::makeDataPtr(WithFd /*unused*/, const char *filename, int fd, int flags, size_t size, size_t* actual_size_out) {
auto* context = new MapAllocator(WITH_FD, filename, fd, flags, size);
if (actual_size_out) *actual_size_out = context->size();
return {context->data(), context, &deleteMapAllocator, at::DeviceType::CPU};
@ -626,7 +626,7 @@ at::DataPtr RefcountedMapAllocator::makeDataPtr(const char *filename, int flags,
return {context->data(), context, &deleteRefcountedMapAllocator, at::DeviceType::CPU};
}
at::DataPtr RefcountedMapAllocator::makeDataPtr(WithFd, const char *filename, int fd, int flags, size_t size, size_t* actual_size_out) {
at::DataPtr RefcountedMapAllocator::makeDataPtr(WithFd /*unused*/, const char *filename, int fd, int flags, size_t size, size_t* actual_size_out) {
auto* context = new RefcountedMapAllocator(WITH_FD, filename, fd, flags, size);
if (actual_size_out) *actual_size_out = context->size() - map_alloc_alignment;
return {context->data(), context, &deleteRefcountedMapAllocator, at::DeviceType::CPU};

View File

@ -25,7 +25,7 @@ class TORCH_API MapAllocator {
public:
MapAllocator(std::string_view filename, int flags, size_t size);
MapAllocator(
WithFd,
WithFd /*unused*/,
std::string_view filename,
int fd,
int flags,
@ -59,14 +59,14 @@ class TORCH_API MapAllocator {
return flags_;
}
static MapAllocator* fromDataPtr(const at::DataPtr&);
static MapAllocator* fromDataPtr(const at::DataPtr& /*dptr*/);
static at::DataPtr makeDataPtr(
std::string_view filename,
int flags,
size_t size,
size_t* actual_size_out);
static at::DataPtr makeDataPtr(
WithFd,
WithFd /*unused*/,
const char* filename,
int fd,
int flags,
@ -105,13 +105,13 @@ class TORCH_API RefcountedMapAllocator : private RefcountedMapAllocatorArgCheck,
public:
RefcountedMapAllocator(const char* filename, int flags, size_t size);
RefcountedMapAllocator(
WithFd,
WithFd /*unused*/,
const char* filename,
int fd,
int flags,
size_t size);
static RefcountedMapAllocator* fromDataPtr(const at::DataPtr&);
static RefcountedMapAllocator* fromDataPtr(const at::DataPtr& /*dptr*/);
RefcountedMapAllocator(const RefcountedMapAllocator&) = delete;
RefcountedMapAllocator(RefcountedMapAllocator&&) = delete;
RefcountedMapAllocator& operator=(const RefcountedMapAllocator&) = delete;
@ -122,7 +122,7 @@ class TORCH_API RefcountedMapAllocator : private RefcountedMapAllocatorArgCheck,
size_t size,
size_t* actual_size_out);
static at::DataPtr makeDataPtr(
WithFd,
WithFd /*unused*/,
const char* filename,
int fd,
int flags,

View File

@ -273,7 +273,7 @@ c10::SymInt NestedTensorImpl::sym_numel_custom() const {
return NestedTensorImpl::numel_custom();
}
c10::SymBool NestedTensorImpl::sym_is_contiguous_custom(MemoryFormat) const {
c10::SymBool NestedTensorImpl::sym_is_contiguous_custom(MemoryFormat /*memory_format*/) const {
return nested_tensor_impl_is_contiguous(this);
}
IntArrayRef NestedTensorImpl::sizes_custom() const {

View File

@ -115,7 +115,8 @@ struct TORCH_API NestedTensorImpl : public c10::TensorImpl {
// with real implementations
int64_t numel_custom() const override;
c10::SymInt sym_numel_custom() const override;
c10::SymBool sym_is_contiguous_custom(MemoryFormat) const override;
c10::SymBool sym_is_contiguous_custom(
MemoryFormat /*memory_format*/) const override;
int64_t size_custom(int64_t d) const override {
return this->size(d);
}

View File

@ -14,7 +14,7 @@ inline int64_t divup(int64_t x, int64_t y) {
TORCH_API void init_num_threads();
// Sets the number of threads to be used in parallel region
TORCH_API void set_num_threads(int);
TORCH_API void set_num_threads(int /*nthreads*/);
// Returns the maximum number of threads that may be used in a parallel region
TORCH_API int get_num_threads();
@ -37,7 +37,7 @@ inline void lazy_init_num_threads() {
}
}
TORCH_API void set_thread_num(int);
TORCH_API void set_thread_num(int /*id*/);
class TORCH_API ThreadIdGuard {
public:
@ -130,7 +130,7 @@ inline scalar_t parallel_reduce(
TORCH_API std::string get_parallel_info();
// Sets number of threads used for inter-op parallelism
TORCH_API void set_num_interop_threads(int);
TORCH_API void set_num_interop_threads(int /*nthreads*/);
// Returns the number of threads used for inter-op parallelism
TORCH_API size_t get_num_interop_threads();

View File

@ -252,7 +252,7 @@ void SparseCsrTensorImpl::set_stride(int64_t dim, int64_t new_stride) {
void SparseCsrTensorImpl::set_storage_offset(int64_t storage_offset) {
TORCH_CHECK(false, "Sparse ", at::sparse_csr::layoutToString(layout_, /*upper=*/true), " tensors do not have set_storage_offset.");
}
c10::SymBool SparseCsrTensorImpl::sym_is_contiguous_custom(MemoryFormat) const {
c10::SymBool SparseCsrTensorImpl::sym_is_contiguous_custom(MemoryFormat /*memory_format*/) const {
TORCH_CHECK(false, "Sparse ", at::sparse_csr::layoutToString(layout_, /*upper=*/true), " tensors do not have is_contiguous");
}
} // namespace at

View File

@ -32,10 +32,10 @@ struct TORCH_API SparseCsrTensorImpl : public TensorImpl {
public:
explicit SparseCsrTensorImpl(
at::DispatchKeySet,
at::DispatchKeySet /*key_set*/,
at::Device device,
Layout layout,
const caffe2::TypeMeta);
const caffe2::TypeMeta /*data_type*/);
void resize_(int64_t nnz, IntArrayRef size);
void resize_and_clear_(
@ -86,7 +86,8 @@ struct TORCH_API SparseCsrTensorImpl : public TensorImpl {
protected:
IntArrayRef strides_custom() const override;
SymIntArrayRef sym_strides_custom() const override;
SymBool sym_is_contiguous_custom(MemoryFormat) const override;
SymBool sym_is_contiguous_custom(
MemoryFormat /*memory_format*/) const override;
public:
void set_size(int64_t dim, int64_t new_size) override;

View File

@ -46,7 +46,9 @@ struct TORCH_API SparseTensorImpl : public TensorImpl {
public:
// Public for now...
explicit SparseTensorImpl(at::DispatchKeySet, const caffe2::TypeMeta);
explicit SparseTensorImpl(
at::DispatchKeySet /*key_set*/,
const caffe2::TypeMeta /*data_type*/);
void release_resources() override;
@ -229,14 +231,14 @@ struct TORCH_API SparseTensorImpl : public TensorImpl {
}
void resize_(int64_t sparse_dim, int64_t dense_dim, ArrayRef<int64_t> size) {
return _resize_(sparse_dim, dense_dim, size);
_resize_(sparse_dim, dense_dim, size);
}
void resize_(
int64_t sparse_dim,
int64_t dense_dim,
ArrayRef<c10::SymInt> size) {
return _resize_(sparse_dim, dense_dim, size);
_resize_(sparse_dim, dense_dim, size);
}
// NOTE: this function will resize the sparse tensor and also set `indices`
@ -384,8 +386,8 @@ struct TORCH_API SparseTensorImpl : public TensorImpl {
private:
explicit SparseTensorImpl(
at::DispatchKeySet,
const caffe2::TypeMeta,
at::DispatchKeySet /*key_set*/,
const caffe2::TypeMeta /*data_type*/,
at::Tensor indices,
at::Tensor values);

View File

@ -59,7 +59,7 @@ static inline void set_item(const Tensor& self, ArrayRef<TensorIndex> indices, c
}
}
return set_item(self, indices, value);
set_item(self, indices, value);
}
} // namespace indexing

View File

@ -112,10 +112,10 @@ TORCH_API std::ostream& operator<<(std::ostream& stream, const Slice& slice);
// `torch.tensor([1, 2])`) | `torch::tensor({1, 2})`
struct TORCH_API TensorIndex final {
// Case 1: `at::indexing::None`
TensorIndex(std::nullopt_t) : type_(TensorIndexType::None) {}
TensorIndex(std::nullopt_t /*unused*/) : type_(TensorIndexType::None) {}
// Case 2: "..." / `at::indexing::Ellipsis`
TensorIndex(at::indexing::EllipsisIndexType)
TensorIndex(at::indexing::EllipsisIndexType /*unused*/)
: type_(TensorIndexType::Ellipsis) {}
TensorIndex(const char* str) : TensorIndex(at::indexing::Ellipsis) {
TORCH_CHECK_VALUE(

View File

@ -765,7 +765,8 @@ void TensorIteratorBase::for_each(loop2d_t loop, int64_t grain_size) {
if (numel == 0) {
return;
} else if (numel < grain_size || at::get_num_threads() == 1) {
return serial_for_each(loop, {0, numel});
serial_for_each(loop, {0, numel});
return;
} else {
at::parallel_for(0, numel, grain_size, [&](int64_t begin, int64_t end) {
serial_for_each(loop, {begin, end});

View File

@ -250,7 +250,7 @@ struct TORCH_API TensorIteratorBase : public impl::MetaBase {
using PtrVector = SmallVector<char*, 4>;
using StrideVector = SmallVector<int64_t, 6>;
void build(TensorIteratorConfig&);
void build(TensorIteratorConfig& /*config*/);
// The inner-loop function operates on the fastest moving dimension. It
// implements element-wise operations in terms of 1-d strided tensors.
@ -618,20 +618,20 @@ struct TORCH_API TensorIteratorBase : public impl::MetaBase {
#undef TORCH_DISALLOW_TEMPORARIES
protected:
// Mutable reference as it moves tensors out of TensorIteratorConfig
void populate_operands(TensorIteratorConfig&);
void populate_operands(TensorIteratorConfig& /*config*/);
void mark_outputs();
void mark_resize_outputs(const TensorIteratorConfig&);
void compute_mem_overlaps(const TensorIteratorConfig&);
void compute_shape(const TensorIteratorConfig&);
void compute_strides(const TensorIteratorConfig&);
void mark_resize_outputs(const TensorIteratorConfig& /*config*/);
void compute_mem_overlaps(const TensorIteratorConfig& /*config*/);
void compute_shape(const TensorIteratorConfig& /*config*/);
void compute_strides(const TensorIteratorConfig& /*config*/);
void reorder_dimensions();
void permute_dimensions(IntArrayRef perm);
void compute_types(const TensorIteratorConfig&);
void compute_types(const TensorIteratorConfig& /*config*/);
ScalarType compute_common_dtype();
void allocate_or_resize_outputs();
bool fast_set_up(const TensorIteratorConfig&);
FastSetupType compute_fast_setup_type(const TensorIteratorConfig&);
void compute_names(const TensorIteratorConfig&);
bool fast_set_up(const TensorIteratorConfig& /*config*/);
FastSetupType compute_fast_setup_type(const TensorIteratorConfig& /*config*/);
void compute_names(const TensorIteratorConfig& /*config*/);
void propagate_names_to_outputs();
void coalesce_dimensions();

View File

@ -20,7 +20,7 @@
namespace at {
TORCH_API int _crash_if_asan(int);
TORCH_API int _crash_if_asan(int /*arg*/);
// Converts a TensorList (i.e. ArrayRef<Tensor> to vector of TensorImpl*)
// NB: This is ONLY used by legacy TH bindings, and ONLY used by cat.

View File

@ -103,9 +103,7 @@ std::string get_cpu_capability() {
#elif defined(HAVE_ZVECTOR_CPU_DEFINITION)
case native::CPUCapability::ZVECTOR:
return "Z VECTOR";
#elif defined(HAVE_SVE_CPU_DEFINITION) && defined(HAVE_ARM_BF16_CPU_DEFINITION)
case native::CPUCapability::SVE128:
return "SVE128";
#elif defined(HAVE_SVE256_CPU_DEFINITION) && defined(HAVE_ARM_BF16_CPU_DEFINITION)
case native::CPUCapability::SVE256:
return "SVE256";
#else

View File

@ -148,7 +148,7 @@ Tensor cached_cast(at::ScalarType to_type, const Tensor& arg, DeviceType device_
Banned functions
*******************************/
static Tensor binary_cross_entropy_banned(const Tensor &, const Tensor &, const std::optional<Tensor>&, int64_t) {
static Tensor binary_cross_entropy_banned(const Tensor & /*unused*/, const Tensor & /*unused*/, const std::optional<Tensor>& /*unused*/, int64_t /*unused*/) {
TORCH_CHECK(false, "torch.nn.functional.binary_cross_entropy and torch.nn.BCELoss are unsafe to autocast.\n"
"Many models use a sigmoid layer right before the binary cross entropy layer.\n"
"In this case, combine the two layers using torch.nn.functional.binary_cross_entropy_with_logits\n"

View File

@ -49,7 +49,7 @@ static void check_unique_names(DimnameList names) {
}
void check_names_valid_for(const TensorBase& tensor, DimnameList names) {
return impl::check_names_valid_for(tensor.unsafeGetTensorImpl(), names);
impl::check_names_valid_for(tensor.unsafeGetTensorImpl(), names);
}
void check_names_valid_for(size_t tensor_dim, DimnameList names) {

View File

@ -27,11 +27,11 @@ struct TORCH_API NamedTensorMeta final : public c10::NamedTensorMetaInterface {
HasNonWildcard
};
explicit NamedTensorMeta(HAS_NON_WILDCARD, DimnameList names)
explicit NamedTensorMeta(HAS_NON_WILDCARD /*unused*/, DimnameList names)
: names_(names.vec()) {
check_invariants();
}
explicit NamedTensorMeta(HAS_NON_WILDCARD, std::vector<Dimname>&& names)
explicit NamedTensorMeta(HAS_NON_WILDCARD /*unused*/, std::vector<Dimname>&& names)
: names_(std::move(names)) {
check_invariants();
}
@ -52,13 +52,13 @@ struct TORCH_API NamedTensorMeta final : public c10::NamedTensorMetaInterface {
std::any_of(names_.begin(), names_.end(), [](const Dimname& n) { return !n.isWildcard(); }));
}
void set_names(HAS_NON_WILDCARD, DimnameList new_names) {
void set_names(HAS_NON_WILDCARD /*unused*/, DimnameList new_names) {
TORCH_INTERNAL_ASSERT(new_names.size() == names_.size());
std::copy(new_names.begin(), new_names.end(), names_.begin());
check_invariants();
}
void set_names(HAS_NON_WILDCARD, std::vector<Dimname>&& new_names) {
void set_names(HAS_NON_WILDCARD /*unused*/, std::vector<Dimname>&& new_names) {
TORCH_INTERNAL_ASSERT(new_names.size() == names_.size());
names_ = std::move(new_names);
check_invariants();

View File

@ -13,7 +13,7 @@ class TORCH_API PythonOpRegistrationTrampoline final {
public:
// Returns true if you successfully registered yourself (that means
// you are in the hot seat for doing the operator registrations!)
static bool registerInterpreter(c10::impl::PyInterpreter*);
static bool registerInterpreter(c10::impl::PyInterpreter* /*interp*/);
// Returns nullptr if no interpreter has been registered yet.
static c10::impl::PyInterpreter* getInterpreter();

View File

@ -138,7 +138,7 @@ void Tensor::_backward(TensorList inputs,
const std::optional<Tensor>& gradient,
std::optional<bool> keep_graph,
bool create_graph) const {
return impl::GetVariableHooks()->_backward(*this, inputs, gradient, keep_graph, create_graph);
impl::GetVariableHooks()->_backward(*this, inputs, gradient, keep_graph, create_graph);
}
const TensorBase& TensorBase::requires_grad_(bool _requires_grad) const {
@ -173,4 +173,12 @@ unsigned TensorBase::_register_hook(std::function<TensorBase(const TensorBase&)>
return impl::GetVariableHooks()->_register_hook(*this, std::move(hook));
}
std::optional<ScalarType> TensorBase::grad_dtype() const {
return impl::GetVariableHooks()->grad_dtype(*this);
}
void TensorBase::set_grad_dtype(const std::optional<ScalarType>& grad_dtype) const {
return impl::GetVariableHooks()->set_grad_dtype(*this, grad_dtype);
}
} // namespace at

View File

@ -100,7 +100,7 @@ class TORCH_API TensorBase {
// Create a Tensor with a +0 reference count. Special care must be
// taken to avoid decrementing this reference count at destruction
// time. Intended to support MaybeOwnedTraits<Tensor>.
explicit TensorBase(unsafe_borrow_t, const TensorBase& rhs)
explicit TensorBase(unsafe_borrow_t /*unused*/, const TensorBase& rhs)
: impl_(c10::intrusive_ptr<at::TensorImpl, UndefinedTensorImpl>(rhs.impl_.get(), c10::raw::DontIncreaseRefcount{})) {}
friend MaybeOwnedTraits<TensorBase>;
@ -930,6 +930,10 @@ public:
const TensorBase& requires_grad_(bool _requires_grad=true) const;
std::optional<ScalarType> grad_dtype() const;
void set_grad_dtype(const std::optional<ScalarType>& grad_dtype) const;
// View Variables
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -950,7 +954,7 @@ protected:
c10::intrusive_ptr<TensorImpl, UndefinedTensorImpl> impl_;
private:
TensorBase __dispatch_contiguous(c10::MemoryFormat) const;
TensorBase __dispatch_contiguous(c10::MemoryFormat /*memory_format*/) const;
};
inline DeviceIndex get_device(const TensorBase& self) {

View File

@ -68,6 +68,8 @@ struct TORCH_API VariableHooksInterface {
const c10::OperatorHandle& op,
c10::DispatchKeySet dispatch_keys,
torch::jit::Stack* stack) const = 0;
virtual std::optional<c10::ScalarType> grad_dtype(const TensorBase&) const = 0;
virtual void set_grad_dtype(const TensorBase&, const std::optional<c10::ScalarType>&) const = 0;
};
TORCH_API void SetVariableHooks(VariableHooksInterface* hooks);

View File

@ -18,10 +18,10 @@ class KernelFunction;
// implementation notes; notably, this does NOT actually go through the
// boxing/unboxing codepath.
TORCH_API void fallthrough_kernel(
OperatorKernel*,
const OperatorHandle&,
DispatchKeySet,
Stack*);
OperatorKernel* /*unused*/,
const OperatorHandle& /*unused*/,
DispatchKeySet /*unused*/,
Stack* /*unused*/);
// Note [Ambiguity in AutogradOther kernel]
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -62,10 +62,10 @@ TORCH_API void fallthrough_kernel(
// than arbitrarily pick one or the other, we just register a kernel that raises
// an error and let the user decide how to proceed.
TORCH_API void ambiguous_autogradother_kernel(
OperatorKernel*,
const OperatorHandle&,
DispatchKeySet,
Stack*);
OperatorKernel* /*unused*/,
const OperatorHandle& /*op*/,
DispatchKeySet /*unused*/,
Stack* /*unused*/);
// Note [named_not_supported_kernel]
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -75,10 +75,10 @@ TORCH_API void ambiguous_autogradother_kernel(
// give a good error message in cases when boxing is not supported). When
// boxing is universally supported this can be removed.
[[noreturn]] TORCH_API void named_not_supported_kernel(
OperatorKernel*,
const OperatorHandle&,
DispatchKeySet,
Stack*);
OperatorKernel* /*unused*/,
const OperatorHandle& /*op*/,
DispatchKeySet /*unused*/,
Stack* /*unused*/);
/**
* BoxedKernel is similar to a std::function storing a boxed kernel.
@ -185,16 +185,16 @@ class TORCH_API BoxedKernel final {
template <BoxedKernelFunction* func>
static void make_boxed_function(
OperatorKernel*,
OperatorKernel* /*unused*/,
const OperatorHandle& opHandle,
DispatchKeySet,
DispatchKeySet /*unused*/,
Stack* stack);
template <BoxedKernelFunction_withDispatchKeys* func>
static void make_boxed_function(
OperatorKernel*,
OperatorKernel* /*unused*/,
const OperatorHandle& opHandle,
DispatchKeySet,
DispatchKeySet /*ks*/,
Stack* stack);
explicit BoxedKernel(

View File

@ -11,9 +11,9 @@ inline BoxedKernel::BoxedKernel(
template <BoxedKernel::BoxedKernelFunction* func>
inline void BoxedKernel::make_boxed_function(
OperatorKernel*,
OperatorKernel* /*unused*/,
const OperatorHandle& opHandle,
DispatchKeySet,
DispatchKeySet /*unused*/,
Stack* stack) {
// Note that we're dropping the DispatchKeySet argument.
// See Note [Plumbing Keys Through The Dispatcher 2] for details.
@ -22,7 +22,7 @@ inline void BoxedKernel::make_boxed_function(
template <BoxedKernel::BoxedKernelFunction_withDispatchKeys* func>
inline void BoxedKernel::make_boxed_function(
OperatorKernel*,
OperatorKernel* /*unused*/,
const OperatorHandle& opHandle,
DispatchKeySet ks,
Stack* stack) {

View File

@ -10,7 +10,7 @@ namespace c10 {
// be handled specially. Its semantics is that it redispatches to the
// *next* dispatch key that would have been processed, skipping the current
// one.
void fallthrough_kernel(OperatorKernel*, const OperatorHandle&, DispatchKeySet, Stack*) {
void fallthrough_kernel(OperatorKernel* /*unused*/, const OperatorHandle& /*unused*/, DispatchKeySet /*unused*/, Stack* /*unused*/) {
TORCH_INTERNAL_ASSERT(0,
"fallthrough_kernel was executed but it should have been short-circuited by the dispatcher. "
"This could occur if you registered a fallthrough kernel as a override for a specific operator "
@ -19,7 +19,7 @@ void fallthrough_kernel(OperatorKernel*, const OperatorHandle&, DispatchKeySet,
"let us know in the bug tracker.");
}
void ambiguous_autogradother_kernel(OperatorKernel*, const OperatorHandle& op, DispatchKeySet, Stack*) {
void ambiguous_autogradother_kernel(OperatorKernel* /*unused*/, const OperatorHandle& op, DispatchKeySet /*unused*/, Stack* /*unused*/) {
TORCH_INTERNAL_ASSERT(0,
op.operator_name(), " has kernels registered to both CompositeImplicitAutograd and a backend mapped to AutogradOther. "
"This makes the backend kernel unreachable; the dispatcher will always prefer the CompositeImplicitAutograd lowering "
@ -32,7 +32,7 @@ void ambiguous_autogradother_kernel(OperatorKernel*, const OperatorHandle& op, D
"\nCanonical state\n~~~~~~~~~~~\n", op.dumpState(), "\n\n");
}
void named_not_supported_kernel(OperatorKernel*, const OperatorHandle& op, DispatchKeySet, Stack*) {
void named_not_supported_kernel(OperatorKernel* /*unused*/, const OperatorHandle& op, DispatchKeySet /*unused*/, Stack* /*unused*/) {
// DO NOT LOOK AT STACK, YOU HAVE SHORT CIRCUITED BOXING
// See Note [named_not_supported_kernel]
TORCH_CHECK(0,

View File

@ -229,7 +229,7 @@ class TORCH_API KernelFunction final {
* &unboxed_func>();
*/
template <class FuncPtr, bool AllowLegacyTypes = false>
static KernelFunction makeFromUnboxedFunction(FuncPtr);
static KernelFunction makeFromUnboxedFunction(FuncPtr /*func_ptr*/);
/**
* Create a KernelFunction from an unboxed function.
@ -271,7 +271,7 @@ class TORCH_API KernelFunction final {
std::string dumpState() const;
// For testing internal invariants only
bool _equalsBoxedAndUnboxed(const KernelFunction&) const;
bool _equalsBoxedAndUnboxed(const KernelFunction& /*other*/) const;
// Register a token to be invalidated when this KernelFunction is destroyed
void registerToken(std::weak_ptr<KernelToken> token) const;

View File

@ -131,7 +131,7 @@ C10_ALWAYS_INLINE_UNLESS_MOBILE void boxToStack(
new (dest++) IValue(options.pinned_memory());
}
inline void boxArgsToStack(IValue*&) {}
inline void boxArgsToStack(IValue*& /*unused*/) {}
template <typename T, typename... Args>
C10_ALWAYS_INLINE_UNLESS_MOBILE void boxArgsToStack(
@ -185,7 +185,7 @@ struct PopResult<std::tuple<Types...>> final {
template <size_t... indices>
static Result pop_to_tuple_impl(
Stack& stack,
std::index_sequence<indices...>) {
std::index_sequence<indices...> /*unused*/) {
return std::make_tuple((std::move(stack[indices]).template to<Types>())...);
}
};

View File

@ -561,7 +561,7 @@ struct wrap_kernel_functor_unboxed_<
// doesn't use &&
static ReturnType call(
OperatorKernel* functor,
DispatchKeySet,
DispatchKeySet /*unused*/,
ParameterTypes... args) {
KernelFunctor* functor_ = static_cast<KernelFunctor*>(functor);
// Note [Plumbing Keys Through The Dispatcher 2]
@ -629,8 +629,8 @@ call_functor_with_args_from_stack_(
OperatorKernel* functor,
DispatchKeySet dispatchKeySet,
Stack* stack,
std::index_sequence<ivalue_arg_indices...>,
guts::typelist::typelist<ArgTypes...>*) {
std::index_sequence<ivalue_arg_indices...> /*unused*/,
guts::typelist::typelist<ArgTypes...>* /*unused*/) {
(void)(stack); // when sizeof...(ivalue_arg_indices) == 0, this argument would
// be unused and we have to silence the compiler warning.
@ -708,7 +708,7 @@ struct push_outputs<std::tuple<OutputTypes...>, AllowDeprecatedTypes> final {
static void call_(
std::tuple<OutputTypes...>&& output,
Stack* stack,
std::index_sequence<indices...>) {
std::index_sequence<indices...> /*unused*/) {
torch::jit::push(
*stack,
return_to_ivalue<OutputTypes, AllowDeprecatedTypes>::call(
@ -718,7 +718,7 @@ struct push_outputs<std::tuple<OutputTypes...>, AllowDeprecatedTypes> final {
static void copy_(
const std::tuple<OutputTypes...>& output,
Stack* stack,
std::index_sequence<indices...>) {
std::index_sequence<indices...> /*unused*/) {
torch::jit::push(
*stack,
return_to_ivalue<OutputTypes, AllowDeprecatedTypes>::copy(
@ -741,7 +741,7 @@ struct make_boxed_from_unboxed_functor final {
static void call(
OperatorKernel* functor,
const OperatorHandle&,
const OperatorHandle& /*unused*/,
DispatchKeySet dispatchKeySet,
Stack* stack) {
using ReturnType =

View File

@ -63,13 +63,13 @@ struct BuiltinOpFunction : public Function {
bool call(
Stack& stack,
std::optional<size_t>,
c10::function_ref<void(const Code&)>) override {
std::optional<size_t> /*unused*/,
c10::function_ref<void(const Code&)> /*unused*/) override {
run(stack);
return false;
}
bool call(Stack& stack, c10::function_ref<void(const mobile::Code&)>)
bool call(Stack& stack, c10::function_ref<void(const mobile::Code&)> /*unused*/)
override {
run(stack);
return false;

View File

@ -80,7 +80,8 @@ struct MultiDispatchKeySet : at::IterArgs<MultiDispatchKeySet> {
ts = ts | x.key_set();
}
}
[[noreturn]] void operator()(at::ArrayRef<std::optional<at::Tensor>>) {
[[noreturn]] void operator()(
at::ArrayRef<std::optional<at::Tensor>> /*unused*/) {
// Just checking that the handling of Tensor?[] didn't change.
TORCH_INTERNAL_ASSERT(false);
}
@ -95,7 +96,7 @@ struct MultiDispatchKeySet : at::IterArgs<MultiDispatchKeySet> {
}
}
template <typename T>
void operator()(const T&) {
void operator()(const T& /*unused*/) {
// do nothing
}
};

View File

@ -496,7 +496,7 @@ class TORCH_API OperatorHandle {
}
void checkInvariants() const {
return operatorDef_->op.checkInvariants();
operatorDef_->op.checkInvariants();
}
c10::ArrayRef<at::Tag> getTags() const {
@ -633,7 +633,7 @@ class TypedOperatorHandle<Return(Args...)> final : public OperatorHandle {
namespace detail {
template <class... Args>
inline void unused_arg_(const Args&...) {}
inline void unused_arg_(const Args&... /*unused*/) {}
// CaptureKernelCall is intended to capture return values from Dispatcher
// unboxed kernel calls. A record function may request to get outputs from the
@ -932,7 +932,7 @@ inline void Dispatcher::redispatchBoxed(
}
#endif
const auto& kernel = entry.lookup(dispatchKeySet);
return kernel.callBoxed(op, dispatchKeySet, stack);
kernel.callBoxed(op, dispatchKeySet, stack);
}
} // namespace c10

View File

@ -105,7 +105,7 @@ class TORCH_API OperatorEntry final {
// versa that is an error. (Refcounting for the registrations is
// handled in the OperatorHandle in Dispatcher)
void registerSchema(
FunctionSchema&&,
FunctionSchema&& /*schema*/,
std::string&& debug,
std::vector<at::Tag> tags = {});
void deregisterSchema();

View File

@ -177,7 +177,7 @@ bool DynamicType::equals(const Type& rhs) const {
return equals(*create(rhs));
}
bool DynamicType::isSubtypeOfExt(const Type& rhs, std::ostream*) const {
bool DynamicType::isSubtypeOfExt(const Type& rhs, std::ostream* /*why_not*/) const {
auto other = create(rhs);
if (tag_ == other->tag_) {
if (equals(*other)) {
@ -371,7 +371,7 @@ DynamicTypePtr ivalue::TupleTypeFactory<c10::DynamicType>::create(
}
DynamicTypePtr ivalue::TupleTypeFactory<c10::DynamicType>::fallback(
const Type&) {
const Type& /*unused*/) {
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(false);
return nullptr;
}

View File

@ -138,8 +138,8 @@ class DynamicType : public SharedType {
struct Arguments {
Arguments() = default;
Arguments(c10::ArrayRef<TypePtr>);
Arguments(const std::vector<std::string_view>&, c10::ArrayRef<TypePtr>);
Arguments(c10::ArrayRef<TypePtr> /*args*/);
Arguments(const std::vector<std::string_view>& /*names*/, c10::ArrayRef<TypePtr> /*args*/);
std::vector<LabeledDynamicType> elems;
};
@ -156,15 +156,15 @@ class DynamicType : public SharedType {
static const TypeKind Kind = TypeKind::DynamicType;
static TORCH_API DynamicTypePtr create(Type& ty);
explicit DynamicType(Tag, Arguments);
explicit DynamicType(Tag, std::string_view, Arguments);
explicit DynamicType(Tag /*tag*/, Arguments /*arguments*/);
explicit DynamicType(Tag /*tag*/, std::string_view /*name*/, Arguments /*arguments*/);
DynamicType(DynamicType&& other) = delete;
DynamicType(const DynamicType&) = delete;
DynamicType& operator=(const DynamicType&) = delete;
DynamicType& operator=(DynamicType&&) = delete;
TypePtr containedType(size_t) const override;
TypePtr containedType(size_t /*i*/) const override;
size_t containedTypeSize() const override;
Tag tag() const {
return tag_;

View File

@ -96,15 +96,15 @@ struct TORCH_API Function {
// Overload for server interpreter, a bailout size is needed for graph
// executor.
virtual bool call(
Stack&,
std::optional<size_t>,
c10::function_ref<void(const Code&)>) {
Stack& /*unused*/,
std::optional<size_t> /*unused*/,
c10::function_ref<void(const Code&)> /*unused*/) {
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(false);
return false;
}
// Overload for mobile interpreter.
virtual bool call(Stack&, c10::function_ref<void(const mobile::Code&)>) {
virtual bool call(Stack& /*unused*/, c10::function_ref<void(const mobile::Code&)> /*unused*/) {
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(false);
return false;
}

View File

@ -847,7 +847,7 @@ struct TORCH_API IValue final {
IValue(std::optional<T> v);
template <class T, enable_if_list_is_ivalue_constructible<T> = nullptr>
IValue(c10::OptionalArrayRef<T> v);
IValue(std::nullopt_t);
IValue(std::nullopt_t /*unused*/);
// ClassType
IValue(c10::intrusive_ptr<ivalue::Object> v);

View File

@ -660,7 +660,7 @@ struct TORCH_API TupleTypeFactory<TupleType> {
template <>
struct TORCH_API TupleTypeFactory<c10::DynamicType> {
static DynamicTypePtr create(const std::vector<TypePtr>& elemTypes);
static DynamicTypePtr fallback(const Type&);
static DynamicTypePtr fallback(const Type& /*unused*/);
};
struct TORCH_API Tuple : c10::intrusive_ptr_target {
@ -1682,7 +1682,7 @@ struct ivalue::EnumHolder : c10::intrusive_ptr_target {
namespace detail {
struct _guarded_unsigned_long_unique_dummy final {
_guarded_unsigned_long_unique_dummy(int64_t){}
_guarded_unsigned_long_unique_dummy(int64_t /*unused*/){}
};
using _guarded_unsigned_long = std::conditional_t<
std::is_same_v<unsigned long, uint32_t> ||
@ -1776,7 +1776,7 @@ template <class Elem>
// native_functions.yaml still return std::vector.
// C10_DEPRECATED_MESSAGE("IValues based on std::vector<T> are potentially slow
// and deprecated. Please use torch::List<T> instead.")
std::vector<Elem> generic_to(IValue ivalue, _fake_type<std::vector<Elem>>) {
std::vector<Elem> generic_to(IValue ivalue, _fake_type<std::vector<Elem>> /*unused*/) {
// We need to do a deep copy of the vector because there might be other
// references to this same IValue that also use the list. We can't just
// move the elements out.
@ -1826,18 +1826,18 @@ c10::intrusive_ptr<T> IValue::toCustomClass() const& {
}
template <typename T>
T generic_to(IValue ivalue, _fake_type<T>) {
T generic_to(IValue ivalue, _fake_type<T> /*unused*/) {
using ElemType = typename std::remove_pointer<T>::type::element_type;
return std::move(ivalue).template toCustomClass<ElemType>();
}
template <typename T>
tagged_capsule<T> generic_to(IValue ivalue, _fake_type<tagged_capsule<T>>) {
tagged_capsule<T> generic_to(IValue ivalue, _fake_type<tagged_capsule<T>> /*unused*/) {
return tagged_capsule<T>{std::move(ivalue)};
}
template <typename Elem>
c10::List<Elem> generic_to(IValue ivalue, _fake_type<c10::List<Elem>>) {
c10::List<Elem> generic_to(IValue ivalue, _fake_type<c10::List<Elem>> /*unused*/) {
return impl::toTypedList<Elem>(std::move(ivalue).toList());
}
@ -1867,7 +1867,7 @@ std::vector<T> createVectorFromList(const c10::List<T>& impl) {
}
template <typename T>
OptionalArray<T> generic_to(IValue ivalue, _fake_type<OptionalArray<T>>) {
OptionalArray<T> generic_to(IValue ivalue, _fake_type<OptionalArray<T>> /*unused*/) {
if (ivalue.isNone()) {
return {};
}
@ -1880,8 +1880,8 @@ namespace detail {
template <typename Elem, size_t... I>
std::array<Elem, sizeof...(I)> generic_to_array(
IValue ivalue,
_fake_type<std::array<Elem, sizeof...(I)>>,
std::index_sequence<I...>) {
_fake_type<std::array<Elem, sizeof...(I)>> /*unused*/,
std::index_sequence<I...> /*unused*/) {
// We need to do a deep copy of the array because there might be other
// references to this same IValue that also use the list. We can't just
// move the elements out.
@ -1906,7 +1906,7 @@ std::array<Elem, N> generic_to(
template <typename Key, typename Value>
c10::Dict<Key, Value> generic_to(
IValue ivalue,
_fake_type<c10::Dict<Key, Value>>) {
_fake_type<c10::Dict<Key, Value>> /*unused*/) {
return impl::toTypedDict<Key, Value>(std::move(ivalue).toGenericDict());
}
@ -1915,7 +1915,7 @@ C10_DEPRECATED_MESSAGE(
"IValues based on std::unordered_map are slow and deprecated. Please use c10::Dict<K, V> instead.")
std::unordered_map<K, V> generic_to(
IValue ivalue,
_fake_type<std::unordered_map<K, V>>) {
_fake_type<std::unordered_map<K, V>> /*unused*/) {
std::unordered_map<K, V> specialized_dict;
for (const auto& item : std::move(ivalue).toGenericDict()) {
@ -1926,7 +1926,7 @@ std::unordered_map<K, V> generic_to(
}
template <typename T>
std::optional<T> generic_to(IValue ivalue, _fake_type<std::optional<T>>) {
std::optional<T> generic_to(IValue ivalue, _fake_type<std::optional<T>> /*unused*/) {
if (ivalue.isNone()) {
return std::nullopt;
}
@ -1937,7 +1937,7 @@ namespace detail {
template <typename Tuple, std::size_t... INDEX>
Tuple generic_to_tuple_impl(
const ivalue::TupleElements& t,
std::index_sequence<INDEX...>) {
std::index_sequence<INDEX...> /*unused*/) {
return std::make_tuple(
t[INDEX].to<typename std::tuple_element<INDEX, Tuple>::type>()...);
}
@ -1951,7 +1951,7 @@ template <
std::is_lvalue_reference<Args>...,
std::negation<std::is_constructible<IValue, Args>>...>,
std::nullptr_t> = nullptr>
std::tuple<Args...> generic_to(const IValue& ivalue, _fake_type<std::tuple<Args...>>) {
std::tuple<Args...> generic_to(const IValue& ivalue, _fake_type<std::tuple<Args...>> /*unused*/) {
const auto& vals = ivalue.toTupleRef().elements();
TORCH_CHECK(vals.size() == sizeof...(Args));
return detail::generic_to_tuple_impl<std::tuple<Args...>>(vals, Indices{});
@ -2311,7 +2311,7 @@ inline IValue::IValue(std::optional<T> v) : IValue() {
}
}
inline IValue::IValue(std::nullopt_t) : IValue() {}
inline IValue::IValue(std::nullopt_t /*unused*/) : IValue() {}
inline IValue::IValue(c10::intrusive_ptr<ivalue::Object> v)
: tag(Tag::Object) {
@ -2482,15 +2482,15 @@ namespace ivalue {
namespace detail {
template <typename T>
IValue from_(T&& x, std::true_type) {
IValue from_(T&& x, std::true_type /*unused*/) {
return IValue(std::forward<T>(x));
}
template <typename T>
IValue from_(c10::intrusive_ptr<T> x, std::false_type) {
IValue from_(c10::intrusive_ptr<T> x, std::false_type /*unused*/) {
return IValue(std::move(x));
}
template <typename T>
IValue from_(T&& /*x*/, std::false_type) {
IValue from_(T&& /*x*/, std::false_type /*unused*/) {
static_assert(
guts::false_t<T>::value,
"You are calling from with a type that it doesn't support, and isn't a potential custom class (ie: is an intrusive_ptr)");
@ -2546,19 +2546,19 @@ struct MaybeOwnedTraits<IValue> {
return &borrow;
}
static bool debugBorrowIsValid(const borrow_type&) {
static bool debugBorrowIsValid(const borrow_type& /*unused*/) {
return true;
}
};
template <>
struct IValue::TagType<c10::Type> {
static TORCH_API c10::TypePtr get(const IValue&);
static TORCH_API c10::TypePtr get(const IValue& /*v*/);
};
template <>
struct IValue::TagType<c10::DynamicType> {
static TORCH_API c10::TypePtr get(const IValue&);
static TORCH_API c10::TypePtr get(const IValue& /*v*/);
};
template <typename T>

View File

@ -44,7 +44,7 @@ constexpr int checkStaticTypes() {
}
template <typename... Ts, size_t... Is>
constexpr std::array<ArgumentDef, sizeof...(Ts)> createArgumentVectorFromTypes(std::index_sequence<Is...>) {
constexpr std::array<ArgumentDef, sizeof...(Ts)> createArgumentVectorFromTypes(std::index_sequence<Is...> /*unused*/) {
return (
// Check types for common errors
checkStaticTypes<Ts...>(),

View File

@ -83,7 +83,7 @@ inline bool operator!=(const OperatorName& lhs, const OperatorName& rhs) {
}
TORCH_API std::string toString(const OperatorName& opName);
TORCH_API std::ostream& operator<<(std::ostream&, const OperatorName&);
TORCH_API std::ostream& operator<<(std::ostream& /*os*/, const OperatorName& /*opName*/);
} // namespace c10

View File

@ -16,7 +16,7 @@ class SingletonTypePtr {
/* implicit */ SingletonTypePtr(T* p) : repr_(p) {}
// We need this to satisfy Pybind11, but it shouldn't be hit.
explicit SingletonTypePtr(std::shared_ptr<T>) { TORCH_CHECK(false); }
explicit SingletonTypePtr(std::shared_ptr<T> /*unused*/) { TORCH_CHECK(false); }
using element_type = typename std::shared_ptr<T>::element_type;

View File

@ -102,31 +102,8 @@ struct VecReduceAllSIMD<float, Op> {
#endif // defined(__GNUC__) && (__GNUC__ > 5) && !defined(_MSC_VER) &&
// !defined(C10_MOBILE)
#if defined(__aarch64__) && !defined(C10_MOBILE) && !defined(__CUDACC__)
#if defined(CPU_CAPABILITY_SVE256)
template <typename Op>
struct VecReduceAllSIMD<float, Op> {
static inline float apply(
const Op& vec_fun,
const Vectorized<float>& acc_vec) {
using Vec = Vectorized<float>;
Vec v = acc_vec;
// 128-bit shuffle
svuint32_t ind = svdupq_n_u32(4, 5, 6, 7);
Vec v1 = svtbl_f32(v, ind);
v = vec_fun(v, v1);
// 64-bit shuffle
ind = svdupq_n_u32(2, 3, 0, 1);
v1 = svtbl_f32(v, ind);
v = vec_fun(v, v1);
// 32-bit shuffle
ind = svdupq_n_u32(1, 0, 2, 3);
v1 = svtbl_f32(v, ind);
v = vec_fun(v, v1);
return svlasta(svpfalse(), v);
}
};
#else
#if defined(__aarch64__) && !defined(C10_MOBILE) && !defined(__CUDACC__) && \
!defined(CPU_CAPABILITY_SVE)
template <typename Op>
struct VecReduceAllSIMD<float, Op> {
static inline float apply(
@ -163,8 +140,35 @@ struct VecReduceAllSIMD<float, std::plus<Vectorized<float>>> {
return vaddvq_f32(acc_vec);
}
};
#endif // defined(CPU_CAPABILITY_SVE256)
#endif // defined(__aarch64__) && !defined(C10_MOBILE) && !defined(__CUDACC__)
// && !defined(CPU_CAPABILITY_SVE)
#if defined(__aarch64__) && !defined(C10_MOBILE) && !defined(__CUDACC__) && \
defined(CPU_CAPABILITY_SVE256)
template <typename Op>
struct VecReduceAllSIMD<float, Op> {
static inline float apply(
const Op& vec_fun,
const Vectorized<float>& acc_vec) {
using Vec = Vectorized<float>;
Vec v = acc_vec;
// 128-bit shuffle
svuint32_t ind = svdupq_n_u32(4, 5, 6, 7);
Vec v1 = svtbl_f32(v, ind);
v = vec_fun(v, v1);
// 64-bit shuffle
ind = svdupq_n_u32(2, 3, 0, 1);
v1 = svtbl_f32(v, ind);
v = vec_fun(v, v1);
// 32-bit shuffle
ind = svdupq_n_u32(1, 0, 2, 3);
v1 = svtbl_f32(v, ind);
v = vec_fun(v, v1);
return svlasta(svpfalse(), v);
}
};
#endif // defined(__aarch64__) && !defined(C10_MOBILE) && !defined(__CUDACC__)
// && defined(CPU_CAPABILITY_SVE256)
template <typename scalar_t, typename Op>
inline scalar_t vec_reduce_all(

View File

@ -1,21 +1,9 @@
#pragma once
#include <ATen/cpu/vec/intrinsics.h>
#include <c10/macros/Macros.h>
#include <cstdint>
#include <ATen/cpu/vec/vec_base.h>
#if defined(__aarch64__) && \
(defined(AT_BUILD_ARM_VEC256_WITH_SLEEF) || \
defined(AT_BUILD_ARM_VECSVE_WITH_SLEEF))
#define SLEEF_STATIC_LIBS
#include <sleef.h>
#define USE_SLEEF(sleef_code, non_sleef_code) sleef_code
#else
#define USE_SLEEF(sleef_code, non_sleef_code) non_sleef_code
#endif
#if defined(CPU_CAPABILITY_SVE)
// Define the data type of VLS(vector-length specific).

View File

@ -2,6 +2,7 @@
#include <ATen/cpu/vec/intrinsics.h>
#include <ATen/cpu/vec/sve/sve_helper.h>
#include <ATen/cpu/vec/sve/vec_common_sve.h>
#include <ATen/cpu/vec/sve/vec_float.h>
#include <ATen/cpu/vec/vec_base.h>
#include <c10/util/bit_cast.h>
@ -307,8 +308,8 @@ Vectorized<c10::BFloat16> inline operator/(
}
inline Vectorized<BFloat16>::Vectorized() {
const short zero = 0;
values = svdup_n_bf16(c10::bit_cast<bfloat16_t>(zero));
auto vals_f = svdup_n_f32(0);
values = convert_float_bfloat16(vals_f, vals_f);
}
inline Vectorized<BFloat16>::Vectorized(int val) {

View File

@ -8,48 +8,13 @@
#include <ATen/cpu/vec/sve/sve_helper.h>
#include <ATen/cpu/vec/vec_base.h>
#ifdef CPU_CAPABILITY_SVE128
#include <ATen/cpu/vec/vec128/vec128_float_neon.h>
#include <ATen/cpu/vec/vec128/vec128_bfloat16_neon.h>
#include <ATen/cpu/vec/vec128/vec128_half_neon.h>
#include <ATen/cpu/vec/vec128/vec128_convert.h>
#include <ATen/cpu/vec/sve/vec_qint.h>
#elif defined(CPU_CAPABILITY_SVE)
#include <ATen/cpu/vec/sve/vec_float.h>
#if defined(CPU_CAPABILITY_SVE)
#include <ATen/cpu/vec/sve/vec_bfloat16.h>
#include <ATen/cpu/vec/sve/vec_double.h>
#include <ATen/cpu/vec/sve/vec_float.h>
#include <ATen/cpu/vec/sve/vec_int.h>
#include <ATen/cpu/vec/sve/vec_qint.h>
#include <ATen/cpu/vec/vec256/vec256_half.h>
#include <ATen/cpu/vec/vec256/vec256_convert.h>
#else // NEON
#include <ATen/cpu/vec/vec128/vec128_float_neon.h>
#include <ATen/cpu/vec/vec128/vec128_half_neon.h>
#include <ATen/cpu/vec/vec128/vec128_bfloat16_neon.h>
#include <ATen/cpu/vec/vec128/vec128_convert.h>
#include <ATen/cpu/vec/vec256/vec256_qint.h>
#endif // defined(CPU_CAPABILITY_SVE128)
#include <ATen/cpu/vec/functional.h>
#endif
namespace at::vec {
// Note [CPU_CAPABILITY namespace]
@ -83,6 +48,12 @@ DEFINE_SVE_CAST(int32_t, s32, float, f32)
DEFINE_SVE_CAST(int16_t, s16, float, f32)
DEFINE_SVE_CAST(float, f32, double, f64)
#ifdef __ARM_FEATURE_BF16
DEFINE_SVE_CAST(int64_t, s64, c10::BFloat16, bf16)
DEFINE_SVE_CAST(int32_t, s32, c10::BFloat16, bf16)
DEFINE_SVE_CAST(int16_t, s16, c10::BFloat16, bf16)
#endif // __ARM_FEATURE_BF16
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ GATHER ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
template <int64_t scale = 1>
@ -202,11 +173,9 @@ std::pair<
// group cols crossing lanes:
// return {a0, b0, a1, b1, a2, b2, a3, b3}
// {a4, b4, a5, b5, a6, b6, a7, b7}
svbfloat16_t aReg = a;
svbfloat16_t bReg = b;
Vectorized<c10::BFloat16> c = svzip1_bf16(aReg, bReg);
Vectorized<c10::BFloat16> d = svzip2_bf16(aReg, bReg);
return std::make_pair(c, d);
return std::make_pair(
Vectorized<c10::BFloat16>(svzip1_bf16(a, b)),
Vectorized<c10::BFloat16>(svzip2_bf16(a, b)));
}
#endif // __ARM_FEATURE_BF16
@ -255,27 +224,12 @@ std::pair<
// swap lanes:
// return {a0, a1, a2, a3, a4, a5, a6, a7}
// {b0, b1, b2, b3, b4, b5, b6, b7}
svbfloat16_t aReg = a;
svbfloat16_t bReg = b;
Vectorized<c10::BFloat16> c = svuzp1_bf16(aReg, bReg);
Vectorized<c10::BFloat16> d = svuzp2_bf16(aReg, bReg);
return std::make_pair(c, d);
return std::make_pair(
Vectorized<c10::BFloat16>(svuzp1_bf16((svbfloat16_t)a, (svbfloat16_t)b)),
Vectorized<c10::BFloat16>(svuzp2_bf16((svbfloat16_t)a, (svbfloat16_t)b)));
}
#endif // __ARM_FEATURE_BF16
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ FLIP ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#define DEFINE_FLIP_FUNC(type, sve_func) \
inline Vectorized<type> flip(const Vectorized<type>& v) { \
return Vectorized<type>(sve_func(v)); \
}
// Use the macro to define the flip functions
DEFINE_FLIP_FUNC(float, svrev_f32)
DEFINE_FLIP_FUNC(double, svrev_f64)
DEFINE_FLIP_FUNC(int64_t, svrev_s64)
DEFINE_FLIP_FUNC(int32_t, svrev_s32)
DEFINE_FLIP_FUNC(int16_t, svrev_s16)
DEFINE_FLIP_FUNC(int8_t, svrev_s8)
#endif // defined(CPU_CAPABILITY_SVE)
} // namespace CPU_CAPABILITY

View File

@ -1,8 +1,6 @@
#pragma once
#if defined(__aarch64__)
#include <ATen/cpu/vec/vec_common_aarch64.h>
#elif defined(CPU_CAPABILITY_AVX512)
#if defined(CPU_CAPABILITY_AVX512)
#include <ATen/cpu/vec/vec512/vec512.h>
#else
#include <ATen/cpu/vec/vec128/vec128.h>
@ -13,34 +11,6 @@ namespace at::vec {
// See Note [CPU_CAPABILITY namespace]
inline namespace CPU_CAPABILITY {
inline std::ostream& operator<<(std::ostream& stream, const c10::qint32& val) {
stream << val.val_;
return stream;
}
inline std::ostream& operator<<(std::ostream& stream, const c10::qint8& val) {
stream << static_cast<int>(val.val_);
return stream;
}
inline std::ostream& operator<<(std::ostream& stream, const c10::quint8& val) {
stream << static_cast<unsigned int>(val.val_);
return stream;
}
template <typename T>
std::ostream& operator<<(std::ostream& stream, const Vectorized<T>& vec) {
T buf[Vectorized<T>::size()];
vec.store(buf);
stream << "vec[";
for (int i = 0; i != Vectorized<T>::size(); i++) {
if (i != 0) {
stream << ", ";
}
stream << buf[i];
}
stream << "]";
return stream;
}
inline Vectorized<bool> convert_to_bool(Vectorized<int8_t> x) {
__at_align__ bool buffer[x.size()];
x.ne(Vectorized<int8_t>(0)).store(buffer);

View File

@ -2,7 +2,6 @@
// DO NOT DEFINE STATIC DATA IN THIS HEADER!
// See Note [Do not compile initializers with AVX]
#include <ATen/cpu/vec/sve/sve_helper.h>
#include <ATen/cpu/vec/vec128/vec128_float_neon.h>
#include <ATen/cpu/vec/vec128/vec128_reduced_precision_common_neon.h>
#include <ATen/cpu/vec/vec_base.h>
@ -263,13 +262,6 @@ class Vectorized<c10::BFloat16> : public Vectorized16<
c10::bit_cast<at_bfloat16_t>(val6.x),
c10::bit_cast<at_bfloat16_t>(val7.x)}) {}
#ifdef CPU_CAPABILITY_SVE128
Vectorized(svbfloat16_t v) : Vectorized16(svget_neonq(v)) {}
operator svbfloat16_t() const {
return svset_neonq(svundef_bf16(), values);
}
#endif
static Vectorized<c10::BFloat16> blendv(
const Vectorized<c10::BFloat16>& a,
const Vectorized<c10::BFloat16>& b,
@ -382,23 +374,6 @@ class Vectorized<c10::BFloat16> : public Vectorized16<
Vectorized ge(const Vectorized& other) const;
Vectorized lt(const Vectorized& other) const;
Vectorized le(const Vectorized& other) const;
#ifdef CPU_CAPABILITY_SVE128
template <typename step_t>
static Vectorized<BFloat16> arange(
BFloat16 base = 0.f,
step_t step = static_cast<step_t>(1)) {
__at_align__ BFloat16 buffer[size()];
for (int64_t i = 0; i < size(); i++) {
buffer[i] = base + i * step;
}
return svget_neonq(
svld1_bf16(ptrue, reinterpret_cast<bfloat16_t*>(buffer)));
}
#endif // CPU_CAPABILITY_SVE128
}; // Vectorized<c10::BFloat16>
inline std::tuple<Vectorized<float>, Vectorized<float>> convert_bfloat16_float(
@ -422,24 +397,6 @@ inline Vectorized<c10::BFloat16> convert_float_bfloat16(
return Vectorized<c10::BFloat16>(at_vcombine_bf16(x1, x2));
}
inline void load_fp32_from_bf16(const BFloat16* data, Vectorized<float>& out) {
__at_align__ float values[Vectorized<float>::size()];
for (const auto k : c10::irange(Vectorized<float>::size())) {
values[k] = data[k];
}
out = Vectorized<float>::loadu(values);
}
inline void load_fp32_from_bf16(
const BFloat16* data,
Vectorized<float>& out1,
Vectorized<float>& out2) {
Vectorized<BFloat16> bf16_vec = Vectorized<BFloat16>::loadu(data);
auto floats = convert_bfloat16_float(bf16_vec);
out1 = std::get<0>(floats);
out2 = std::get<1>(floats);
}
template <typename Op>
Vectorized<c10::BFloat16> binary_operator_via_float(
Op op,
@ -622,12 +579,6 @@ Vectorized<c10::BFloat16> inline fnmsub(
return -a * b - c;
}
#else //
CONVERT_NON_VECTORIZED_INIT(BFloat16, bfloat16)
LOAD_FP32_NON_VECTORIZED_INIT(BFloat16, bf16)
#endif // !defined(C10_MOBILE) && defined(__aarch64__)
} // namespace CPU_CAPABILITY

View File

@ -4,7 +4,7 @@
namespace at::vec {
inline namespace CPU_CAPABILITY {
#if defined(__aarch64__) && !defined(CPU_CAPABILITY_SVE256)
#if (defined(__aarch64__) && !defined(CPU_CAPABILITY_SVE256))
template <typename src_t>
struct VecConvert<
float,
@ -60,7 +60,6 @@ struct VecConvert<float, 1, BFloat16, 1> {
}
};
#endif // defined(__aarch64__) && (!defined(CPU_CAPABILITY_SVE) ||
// defined(CPU_CAPABILITY_SVE128))
#endif // defined(__aarch64__) && !defined(CPU_CAPABILITY_SVE256)
} // namespace CPU_CAPABILITY
} // namespace at::vec

View File

@ -4,10 +4,13 @@
// See Note [Do not compile initializers with AVX]
#include <ATen/cpu/vec/intrinsics.h>
#include <ATen/cpu/vec/sve/sve_helper.h>
#include <ATen/cpu/vec/vec_base.h>
#include <c10/util/irange.h>
#if defined(__aarch64__) && defined(AT_BUILD_ARM_VEC256_WITH_SLEEF)
#include <sleef.h>
#endif
// Sleef offers vectorized versions of some transcedentals
// such as sin, cos, tan etc..
// However for now opting for STL, since we are not building
@ -32,6 +35,12 @@ inline namespace CPU_CAPABILITY {
#error "Big endian is not supported."
#endif
#if defined(AT_BUILD_ARM_VEC256_WITH_SLEEF)
#define USE_SLEEF(sleef_code, non_sleef_code) sleef_code
#else
#define USE_SLEEF(sleef_code, non_sleef_code) non_sleef_code
#endif
template <int index, bool mask_val>
struct BlendRegs {
static float32x4_t impl(
@ -85,12 +94,6 @@ class Vectorized<float> {
operator float32x4_t() const {
return values;
}
#ifdef CPU_CAPABILITY_SVE128
Vectorized(svfloat32_t v) : values(svget_neonq(v)) {}
operator svfloat32_t() const {
return svset_neonq(svundef_f32(), values);
}
#endif
template <int64_t mask>
static Vectorized<float> blend(
const Vectorized<float>& a,

View File

@ -4,6 +4,7 @@
// See Note [Do not compile initializers with AVX]
#include <ATen/cpu/vec/intrinsics.h>
#include <ATen/cpu/vec/vec128/vec128_convert.h>
#include <ATen/cpu/vec/vec128/vec128_float_neon.h>
#include <ATen/cpu/vec/vec128/vec128_reduced_precision_common_neon.h>
#include <ATen/cpu/vec/vec_base.h>
@ -24,6 +25,7 @@ inline namespace CPU_CAPABILITY {
// https://bugs.llvm.org/show_bug.cgi?id=45824
// Most likely we will do aarch32 support with inline asm.
#if !defined(C10_MOBILE) && defined(__aarch64__)
#ifdef __BIG_ENDIAN__
#error "Big endian is not supported."
#endif
@ -419,24 +421,6 @@ Vectorized<c10::Half> inline operator+(
#endif
}
inline void load_fp32_from_fp16(const c10::Half* data, Vectorized<float>& out) {
__at_align__ float values[Vectorized<float>::size()];
for (const auto k : c10::irange(Vectorized<float>::size())) {
values[k] = data[k];
}
out = Vectorized<float>::loadu(values);
}
inline void load_fp32_from_fp16(
const c10::Half* data,
Vectorized<float>& out1,
Vectorized<float>& out2) {
Vectorized<c10::Half> f16_vec = Vectorized<c10::Half>::loadu(data);
auto floats = convert_half_float(f16_vec);
out1 = std::get<0>(floats);
out2 = std::get<1>(floats);
}
template <>
Vectorized<c10::Half> inline operator-(
const Vectorized<c10::Half>& a,
@ -672,53 +656,6 @@ Vectorized<c10::Half> inline fnmsub(
return -a * b - c;
#endif
}
#else
#define CONVERT_NON_VECTORIZED_INIT(type, name) \
inline std::tuple<Vectorized<float>, Vectorized<float>> \
convert_##name##_float(const Vectorized<type>& a) { \
constexpr int64_t K = Vectorized<type>::size(); \
__at_align__ float arr[K]; \
__at_align__ type arr2[K]; \
a.store(arr2); \
convert(arr2, arr, K); \
return std::make_tuple( \
Vectorized<float>::loadu(arr), \
Vectorized<float>::loadu(arr + Vectorized<float>::size())); \
} \
inline Vectorized<type> convert_float_##name( \
const Vectorized<float>& a, const Vectorized<float>& b) { \
constexpr int64_t K = Vectorized<type>::size(); \
__at_align__ float arr[K]; \
__at_align__ type arr2[K]; \
a.store(arr); \
b.store(arr + Vectorized<float>::size()); \
convert(arr, arr2, K); \
return Vectorized<type>::loadu(arr2); \
}
#define LOAD_FP32_NON_VECTORIZED_INIT(type, name) \
inline void load_fp32_from_##name( \
const type* data, Vectorized<float>& out) { \
__at_align__ float values[Vectorized<float>::size()]; \
for (const auto k : c10::irange(Vectorized<float>::size())) { \
values[k] = data[k]; \
} \
out = Vectorized<float>::loadu(values); \
} \
\
inline void load_fp32_from_##name( \
const type* data, Vectorized<float>& out1, Vectorized<float>& out2) { \
load_fp32_from_##name(data, out1); \
data += Vectorized<float>::size(); \
load_fp32_from_##name(data, out2); \
}
CONVERT_NON_VECTORIZED_INIT(Half, half)
LOAD_FP32_NON_VECTORIZED_INIT(Half, fp16)
#endif // !defined(C10_MOBILE) && defined(__aarch64__)
} // namespace CPU_CAPABILITY

View File

@ -9,16 +9,21 @@
#if !( \
defined(__VSX__) || defined(CPU_CAPABILITY_VSX) || \
defined(CPU_CAPABILITY_ZVECTOR))
#include <ATen/cpu/vec/vec256/vec256_double.h>
#if defined(CPU_CAPABILITY_SVE256)
#include <ATen/cpu/vec/sve/vec_common_sve.h>
#else
// clang-format off
#include <ATen/cpu/vec/vec256/vec256_float.h>
#include <ATen/cpu/vec/vec256/vec256_double.h>
#include <ATen/cpu/vec/vec256/vec256_int.h>
#include <ATen/cpu/vec/vec256/vec256_qint.h>
#endif
#if !defined(CPU_CAPABILITY_SVE256) || !defined(__ARM_FEATURE_BF16)
#include <ATen/cpu/vec/vec256/vec256_bfloat16.h>
#endif
#include <ATen/cpu/vec/vec256/vec256_complex_double.h>
#include <ATen/cpu/vec/vec256/vec256_complex_float.h>
#include <ATen/cpu/vec/vec256/vec256_half.h>
#include <ATen/cpu/vec/vec256/vec256_complex_float.h>
#include <ATen/cpu/vec/vec256/vec256_complex_double.h>
// clang-format on
#elif defined(__VSX__) || defined(CPU_CAPABILITY_VSX)
#include <ATen/cpu/vec/vec256/vsx/vec256_common_vsx.h>
@ -51,6 +56,34 @@ namespace at::vec {
// accessed as `at::vec`.
inline namespace CPU_CAPABILITY {
inline std::ostream& operator<<(std::ostream& stream, const c10::qint32& val) {
stream << val.val_;
return stream;
}
inline std::ostream& operator<<(std::ostream& stream, const c10::qint8& val) {
stream << static_cast<int>(val.val_);
return stream;
}
inline std::ostream& operator<<(std::ostream& stream, const c10::quint8& val) {
stream << static_cast<unsigned int>(val.val_);
return stream;
}
template <typename T>
std::ostream& operator<<(std::ostream& stream, const Vectorized<T>& vec) {
T buf[Vectorized<T>::size()];
vec.store(buf);
stream << "vec[";
for (int i = 0; i != Vectorized<T>::size(); i++) {
if (i != 0) {
stream << ", ";
}
stream << buf[i];
}
stream << "]";
return stream;
}
#if defined(CPU_CAPABILITY_AVX2)
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ CAST (AVX2) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@ -268,7 +268,9 @@ LOAD_FP32_VECTORIZED_INIT(BFloat16, bf16)
#else // defined(CPU_CAPABILITY_AVX2)
#if !(defined(__aarch64__))
#if !( \
defined(__aarch64__) && !defined(C10_MOBILE) && !defined(__CUDACC__) && \
!defined(CPU_CAPABILITY_SVE256))
CONVERT_NON_VECTORIZED_INIT(BFloat16, bfloat16)
#endif

View File

@ -342,19 +342,19 @@ class Vectorized<c10::complex<double>> {
return _mm256_cmp_pd(values, other.values, _CMP_NEQ_UQ);
}
Vectorized<c10::complex<double>> operator<(
const Vectorized<c10::complex<double>>&) const {
const Vectorized<c10::complex<double>>& /*unused*/) const {
TORCH_CHECK(false, "not supported for complex numbers");
}
Vectorized<c10::complex<double>> operator<=(
const Vectorized<c10::complex<double>>&) const {
const Vectorized<c10::complex<double>>& /*unused*/) const {
TORCH_CHECK(false, "not supported for complex numbers");
}
Vectorized<c10::complex<double>> operator>(
const Vectorized<c10::complex<double>>&) const {
const Vectorized<c10::complex<double>>& /*unused*/) const {
TORCH_CHECK(false, "not supported for complex numbers");
}
Vectorized<c10::complex<double>> operator>=(
const Vectorized<c10::complex<double>>&) const {
const Vectorized<c10::complex<double>>& /*unused*/) const {
TORCH_CHECK(false, "not supported for complex numbers");
}

View File

@ -268,7 +268,9 @@ LOAD_FP32_VECTORIZED_INIT(Half, fp16)
#else // defined(CPU_CAPABILITY_AVX2)
#if !defined(__aarch64__) || defined(CPU_CAPABILITY_SVE256)
#if !( \
defined(__aarch64__) && !defined(C10_MOBILE) && !defined(__CUDACC__) && \
!defined(CPU_CAPABILITY_SVE256))
CONVERT_NON_VECTORIZED_INIT(Half, half)
#endif

View File

@ -5,13 +5,6 @@
#include <ATen/cpu/vec/intrinsics.h>
#include <ATen/cpu/vec/vec_base.h>
#ifdef __aarch64__
#if defined(CPU_CAPABILITY_SVE128) || !defined(CPU_CAPABILITY_SVE)
#include <ATen/cpu/vec/vec128/vec128_float_neon.h>
#endif
#endif
#include <ATen/native/quantized/AffineQuantizerBase.h>
#include <c10/util/irange.h>
@ -922,7 +915,7 @@ Vectorized<c10::quint8> inline maximum(
return a.maximum(b);
}
#else
#elif !defined(CPU_CAPABILITY_SVE256)
// NOTE: These are low-performance implementations that we fall back on
// if we are not building with AVX2. This may not be an issue, because
@ -1379,18 +1372,12 @@ Vectorized<c10::quint8> inline maximum(
return a.maximum(b);
}
#if defined(__aarch64__) && \
(defined(CPU_CAPABILITY_SVE128) || !defined(CPU_CAPABILITY_SVE))
#endif // if defined(CPU_CAPABILITY_AVX2)
#if (defined(__aarch64__) && !defined(CPU_CAPABILITY_SVE256))
std::pair<Vectorized<float>, Vectorized<float>> inline convert_int8_to_float(
at::vec::Vectorized<int8_t> src) {
#ifdef CPU_CAPABILITY_SVE
svint8_t x = src;
auto s8x8 = vget_low_s8(svget_neonq(x));
#else
auto s8x8 = vld1_s8(src.operator const int8_t*());
#endif
auto s16x8 = vmovl_s8(s8x8);
auto s32x4_hi = vmovl_s16(vget_high_s16(s16x8));
@ -1415,14 +1402,7 @@ std::pair<Vectorized<float>, Vectorized<float>> inline convert_int8_to_float(
Vectorized<float> inline convert_int8_half_register_to_float(
at::vec::Vectorized<int8_t> src) {
#ifdef CPU_CAPABILITY_SVE
svint8_t x = src;
auto s8x8 = vget_low_s8(svget_neonq(x));
#else
auto s8x8 = vld1_s8(src.operator const int8_t*());
#endif
auto s16x8 = vmovl_s8(s8x8);
auto s32x4_lo = vmovl_s16(vget_low_s16(s16x8));
@ -1440,8 +1420,5 @@ Vectorized<float> inline convert_int8_half_register_to_float(
}
#endif
#endif // if defined(CPU_CAPABILITY_AVX2)
} // namespace CPU_CAPABILITY
} // namespace at::vec

View File

@ -31,6 +31,34 @@ namespace vec {
// See Note [CPU_CAPABILITY namespace]
inline namespace CPU_CAPABILITY {
inline std::ostream& operator<<(std::ostream& stream, const c10::qint32& val) {
stream << val.val_;
return stream;
}
inline std::ostream& operator<<(std::ostream& stream, const c10::qint8& val) {
stream << static_cast<int>(val.val_);
return stream;
}
inline std::ostream& operator<<(std::ostream& stream, const c10::quint8& val) {
stream << static_cast<unsigned int>(val.val_);
return stream;
}
template <typename T>
std::ostream& operator<<(std::ostream& stream, const Vectorized<T>& vec) {
T buf[Vectorized<T>::size()];
vec.store(buf);
stream << "vec[";
for (int i = 0; i != Vectorized<T>::size(); i++) {
if (i != 0) {
stream << ", ";
}
stream << buf[i];
}
stream << "]";
return stream;
}
#if defined(CPU_CAPABILITY_AVX512)
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ CAST (AVX512)

View File

@ -67,7 +67,18 @@ Windows llvm will not have this definition.
#endif
#define VECTOR_WIDTH 64
#define int_vector __m512i
#elif defined(CPU_CAPABILITY_AVX2) || defined(CPU_CAPABILITY_SVE256)
#elif defined(__aarch64__) && \
!defined(CPU_CAPABILITY_SVE) // CPU_CAPABILITY_AVX512
// SVE code expects 256-vectors; leave that set for SVE?
#if defined(__GNUC__)
#define __at_align__ __attribute__((aligned(16)))
#elif defined(_WIN32)
#define __at_align__ __declspec(align(16))
#else
#define __at_align__
#endif
#define VECTOR_WIDTH 16
#else // CPU_CAPABILITY_AVX512
#if defined(__GNUC__)
#define __at_align__ __attribute__((aligned(32)))
#elif defined(_WIN32)
@ -77,27 +88,7 @@ Windows llvm will not have this definition.
#endif
#define VECTOR_WIDTH 32
#define int_vector __m256i
#elif defined(__aarch64__)
// Define alignment and vector width for SVE128/Default (e.g., NEON)
#if defined(__GNUC__)
#define __at_align__ __attribute__((aligned(16)))
#elif defined(_WIN32)
#define __at_align__ __declspec(align(16))
#else
#define __at_align__
#endif
#define VECTOR_WIDTH 16
#else
// Fallback: define default alignment and vector width
#if defined(__GNUC__)
#define __at_align__ __attribute__((aligned(32)))
#elif defined(_WIN32)
#define __at_align__ __declspec(align(32))
#else
#define __at_align__
#endif
#define VECTOR_WIDTH 32
#endif
#endif // CPU_CAPABILITY_AVX512
namespace at::vec {
// See Note [CPU_CAPABILITY namespace]

View File

@ -422,18 +422,34 @@ static inline bool bgemm_internal_cublaslt(CUDABLAS_BGEMM_ARGTYPES_AND_C_DTYPE(D
abType = CUDA_R_16F;
cType = (std::is_same_v<C_Dtype, float>) ? CUDA_R_32F : CUDA_R_16F;
#ifndef USE_ROCM
if (!at::globalContext().allowFP16ReductionCuBLAS()) {
preference.setAttribute(CUBLASLT_MATMUL_PREF_REDUCTION_SCHEME_MASK,
CUBLASLT_REDUCTION_SCHEME_COMPUTE_TYPE | CUBLASLT_REDUCTION_SCHEME_NONE);
auto fp16_reduction = at::globalContext().allowFP16ReductionCuBLAS();
if (fp16_reduction !=
at::CuBLASReductionOption::AllowReducedPrecisionWithSplitK) {
uint32_t mask =
fp16_reduction ==
at::CuBLASReductionOption::DisallowReducedPrecisionAllowSplitK
? (CUBLASLT_REDUCTION_SCHEME_COMPUTE_TYPE |
CUBLASLT_REDUCTION_SCHEME_NONE)
: CUBLASLT_REDUCTION_SCHEME_NONE;
preference.setAttribute(
CUBLASLT_MATMUL_PREF_REDUCTION_SCHEME_MASK, mask);
}
#endif
} else if constexpr (std::is_same_v<Dtype, at::BFloat16>) {
abType = CUDA_R_16BF;
cType = (std::is_same_v<C_Dtype, float>) ? CUDA_R_32F : CUDA_R_16BF;
#ifndef USE_ROCM
if (!at::globalContext().allowBF16ReductionCuBLAS()) {
preference.setAttribute(CUBLASLT_MATMUL_PREF_REDUCTION_SCHEME_MASK,
CUBLASLT_REDUCTION_SCHEME_COMPUTE_TYPE | CUBLASLT_REDUCTION_SCHEME_NONE);
auto bf16_reduction = at::globalContext().allowBF16ReductionCuBLAS();
if (bf16_reduction !=
at::CuBLASReductionOption::AllowReducedPrecisionWithSplitK) {
uint32_t mask =
bf16_reduction ==
at::CuBLASReductionOption::DisallowReducedPrecisionAllowSplitK
? (CUBLASLT_REDUCTION_SCHEME_COMPUTE_TYPE |
CUBLASLT_REDUCTION_SCHEME_NONE)
: CUBLASLT_REDUCTION_SCHEME_NONE;
preference.setAttribute(
CUBLASLT_MATMUL_PREF_REDUCTION_SCHEME_MASK, mask);
}
#endif
} else {
@ -1120,8 +1136,15 @@ inline void gemm_internal_cublas_half_helper(CUDABLAS_GEMM_ARGTYPES_AND_C_DTYPE(
}
if (prop->major >= 5) {
cublasMath_t cublas_flags = CUBLAS_DEFAULT_MATH;
if (!at::globalContext().allowFP16ReductionCuBLAS()) {
cublas_flags = static_cast<cublasMath_t>(cublas_flags | CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION);
auto fp16_reduction = at::globalContext().allowFP16ReductionCuBLAS();
TORCH_CHECK(fp16_reduction !=
at::CuBLASReductionOption::DisallowReducedPrecisionDisallowSplitK,
"torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction("
"..., allow_splitk=False) requires the cuBLASLt backend");
if (fp16_reduction !=
at::CuBLASReductionOption::AllowReducedPrecisionWithSplitK) {
cublas_flags = static_cast<cublasMath_t>(
cublas_flags | CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION);
}
// Disallow fp16 reductions that could lead to unexpected overflow issues.
TORCH_CUDABLAS_CHECK(cublasSetMathMode(handle, cublas_flags));
@ -1180,8 +1203,15 @@ inline void gemm_internal_cublas_bfloat16_helper(CUDABLAS_GEMM_ARGTYPES_AND_C_DT
GEMM_CHECK_ARGVALUES(at::BFloat16);
#ifndef USE_ROCM
cublasMath_t cublas_flags = CUBLAS_DEFAULT_MATH;
if (!at::globalContext().allowBF16ReductionCuBLAS()) {
cublas_flags = static_cast<cublasMath_t>(cublas_flags | CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION);
auto bf16_reduction = at::globalContext().allowBF16ReductionCuBLAS();
TORCH_CHECK(bf16_reduction !=
at::CuBLASReductionOption::DisallowReducedPrecisionDisallowSplitK,
"torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction("
"..., allow_splitk=False) requires the cuBLASLt backend");
if (bf16_reduction !=
at::CuBLASReductionOption::AllowReducedPrecisionWithSplitK) {
cublas_flags = static_cast<cublasMath_t>(
cublas_flags | CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION);
}
#endif
#if defined(USE_ROCM)
@ -1270,7 +1300,7 @@ void gemm_internal<float>(CUDABLAS_GEMM_ARGTYPES(float))
}
#if defined(USE_ROCM) && defined(USE_ROCM_CK_GEMM)
else if (at::globalContext().blasPreferredBackend() == BlasBackend::Ck) {
if (at::detail::getCUDAHooks().isGPUArch({"gfx1100"})) { //no CK GEMM version for gfx1100
if (at::detail::getCUDAHooks().isGPUArch({"gfx11", "gfx12"})) { //no CK GEMM version
gemm_internal_cublaslt<float>(CUDABLAS_GEMM_ARGS(float));
} else{
at::native::gemm_internal_ck<float>(CUDABLAS_GEMM_ARGS(float));
@ -1577,18 +1607,34 @@ bool gemm_and_bias(
abType = CUDA_R_16F;
cType = (std::is_same_v<C_Dtype, float>) ? CUDA_R_32F : CUDA_R_16F;
#ifndef USE_ROCM
if (!at::globalContext().allowFP16ReductionCuBLAS()) {
preference.setAttribute(CUBLASLT_MATMUL_PREF_REDUCTION_SCHEME_MASK,
CUBLASLT_REDUCTION_SCHEME_COMPUTE_TYPE | CUBLASLT_REDUCTION_SCHEME_NONE);
auto fp16_reduction = at::globalContext().allowFP16ReductionCuBLAS();
if (fp16_reduction !=
at::CuBLASReductionOption::AllowReducedPrecisionWithSplitK) {
uint32_t mask =
fp16_reduction ==
at::CuBLASReductionOption::DisallowReducedPrecisionAllowSplitK
? (CUBLASLT_REDUCTION_SCHEME_COMPUTE_TYPE |
CUBLASLT_REDUCTION_SCHEME_NONE)
: CUBLASLT_REDUCTION_SCHEME_NONE;
preference.setAttribute(
CUBLASLT_MATMUL_PREF_REDUCTION_SCHEME_MASK, mask);
}
#endif
} else if constexpr (std::is_same_v<Dtype, at::BFloat16>) {
abType = CUDA_R_16BF;
cType = (std::is_same_v<C_Dtype, float>) ? CUDA_R_32F : CUDA_R_16BF;
#ifndef USE_ROCM
if (!at::globalContext().allowBF16ReductionCuBLAS()) {
preference.setAttribute(CUBLASLT_MATMUL_PREF_REDUCTION_SCHEME_MASK,
CUBLASLT_REDUCTION_SCHEME_COMPUTE_TYPE | CUBLASLT_REDUCTION_SCHEME_NONE);
auto bf16_reduction = at::globalContext().allowBF16ReductionCuBLAS();
if (bf16_reduction !=
at::CuBLASReductionOption::AllowReducedPrecisionWithSplitK) {
uint32_t mask =
bf16_reduction ==
at::CuBLASReductionOption::DisallowReducedPrecisionAllowSplitK
? (CUBLASLT_REDUCTION_SCHEME_COMPUTE_TYPE |
CUBLASLT_REDUCTION_SCHEME_NONE)
: CUBLASLT_REDUCTION_SCHEME_NONE;
preference.setAttribute(
CUBLASLT_MATMUL_PREF_REDUCTION_SCHEME_MASK, mask);
}
#endif
}
@ -1815,6 +1861,8 @@ template bool gemm_and_bias(
int64_t result_ld,
GEMMAndBiasActivationEpilogue activation);
using at::blas::ScalingType;
int get_scale_mode(ScalingType scaling_type, ScalarType scale_dtype, bool use_fast_accum) {
switch (scaling_type) {
case ScalingType::BlockWise1x32:

View File

@ -14,6 +14,7 @@
*/
#include <ATen/cuda/CUDAContext.h>
#include <ATen/BlasBackend.h>
#include <ATen/OpMathType.h>
namespace at::cuda::blas {
@ -136,15 +137,6 @@ void int8_gemm(
int32_t* result_ptr,
int64_t result_ld);
enum class ScalingType : std::uint8_t {
TensorWise, // fp32 scales
RowWise, // fp32 scales
BlockWise1x16, // fp8_e4m3fn scales
BlockWise1x32, // fp8_e8m0fnu scales
BlockWise1x128, // fp32 scales
BlockWise128x128, // fp32 scales
};
void scaled_gemm(
char transa,
char transb,
@ -156,13 +148,13 @@ void scaled_gemm(
int64_t mat1_ld,
ScalarType mat1_dtype,
ScalarType mat1_scale_dtype,
ScalingType mat1_scaling_type,
at::blas::ScalingType mat1_scaling_type,
const void* mat2_ptr,
const void* mat2_scale_ptr,
int64_t mat2_ld,
ScalarType mat2_dtype,
ScalarType mat2_scale_dtype,
ScalingType mat2_scaling_type,
at::blas::ScalingType mat2_scaling_type,
const void* bias_ptr,
ScalarType bias_dtype,
void* result_ptr,

Some files were not shown because too many files have changed in this diff Show More