Compare commits

..

123 Commits

Author SHA1 Message Date
39e77ce851 [dynamo] Add most recent bytecode to graph break with developer initiation
ghstack-source-id: 8b538f2e1ac703a4538468a758f08db0c89b91a7
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163720

Add most recent bytecode to dynamo graph break called by user

Fix other user-initiated graph break and issues

Fix linter
2025-10-01 17:21:03 -07:00
22c5e8c17c Add num_store to inductor_meta and use it to scale persistent reduction x block (#162446)
Scale up XBLOCK for contiguous persistent reductions based on rnumel and number of loads + stores

<img width="928" height="656" alt="Screenshot 2025-09-18 at 5 02 57 PM" src="https://github.com/user-attachments/assets/ec3c561f-2a3f-4459-9e14-653715898da3" />

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162446
Approved by: https://github.com/v0i0, https://github.com/eellison, https://github.com/shunting314
ghstack dependencies: #162296
2025-09-23 20:36:39 +00:00
bcb893acb0 [ROCm] Build FBGEMM_GENAI for gfx942 only (#162648)
Fixes build timeouts >4h on libtorch build jobs: 75e7f49f9c/1

Brings back code to narrow down CK compilation targets from 69a25f6888 (diff-ce80f3115ab2f6be5142f0678a1fc92c6b2d7727766ce44f48726c99e720f777)

gfx942 supports fp8

Don't enable gfx950 for now, until more optimizations are in place as per https://github.com/pytorch/pytorch/pull/162648/files#r2369588738

Validation:
[rocm6.4](https://github.com/pytorch/pytorch/actions/runs/17944766350/job/51028483128) and [rocm6.3](https://github.com/pytorch/pytorch/actions/runs/17944766350/job/51028483093) libtorch builds finished within 3.9h.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162648
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-23 18:55:35 +00:00
8e6b0c71fb [Inductor] Remove no_type_check annotation on properties (#163570)
Some properties with `cache_on_self` were prevously annotated with `no_type_check`, to get around mypy limitations. This PR replaces both annotations with `cache_property_on_self`, to enable type checking.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163570
Approved by: https://github.com/mlazos, https://github.com/PaulZhang12, https://github.com/Skylion007
2025-09-23 18:20:04 +00:00
0696a4b0b8 [EZ] Perma-ignore UP038 (#163649)
As it has been removed, see https://docs.astral.sh/ruff/rules/non-pep604-isinstance/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163649
Approved by: https://github.com/Skylion007
ghstack dependencies: #163648
2025-09-23 17:58:18 +00:00
ca35dc2fdd [EZ] Fix UP041 violations (#163648)
I.e. use `TimeoutError` instead of `socket.timeout`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163648
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-09-23 17:58:18 +00:00
649ceda8a5 [export] handling NamedTuple inputs (#162959)
Fixes #160547
### Summary:
bug
```
    def test_namedtuple(self):
        from collections import namedtuple
        Point = namedtuple('Point', 'x y')

        class M(torch.nn.Module):
            def forward(self, x, y):
                return x + y

        inp = Point(torch.ones(3), torch.ones(3))
        print(M()(*inp))

        # errors
        ep = torch.export.export(M(), inp, strict=False)
        print(ep)

        # succeeds
        ep = torch.export.export(M(), inp, strict=True)
        print(ep)

        # workaround could be to convert namedtuple to a kwarg
        inp_kwargs =  {field: getattr(inp, field) for field in inp._fields}
        ep = torch.export.export(M(), (), inp_kwargs)
        print(ep)
```
FIx :
namedtuple is subclass of tuple
but namedtuple is not expected
So, this change handles named tuple case

I have added 🧪 test case for this as well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162959
Approved by: https://github.com/angelayi

Co-authored-by: Angela Yi <angelayi@meta.com>
2025-09-23 17:43:50 +00:00
2aadcea05c [ROCm] Improve perf for elementwise broadcast with mixed dtype (#163562)
* Unroll loops manually to hide memory access latency

Co-author: @amd-hhashemi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163562
Approved by: https://github.com/jeffdaily
2025-09-23 17:42:48 +00:00
fde929c8a8 [AOTI] Fix model_package_loader get_cpp_compile_command (#163561)
It should fix AOTI UTs of `test_aot_inductor_package.py`, these cases are failed at `compile_so`.

reproducer:
```cmd
pytest test\inductor\test_aot_inductor_package.py -v -k test_multiple_methods
```
<img width="1262" height="95" alt="image" src="https://github.com/user-attachments/assets/49458536-1cfe-498e-a12a-2bfd8da67a9e" />

Major fix at `get_cpp_compile_command`. The code is aligned to cpp_builder frontend code:  3ef1bef36c/torch/_inductor/cpp_builder.py (L1780-L1790)
3ef1bef36c/torch/_inductor/cpp_builder.py (L1959-L1976)

Fixed on Windows:
<img width="1261" height="89" alt="Image" src="https://github.com/user-attachments/assets/9bf43b11-aac1-4161-a625-e602e313a299" />

Also validated on Linux:
<img width="1039" height="81" alt="Image" src="https://github.com/user-attachments/assets/46063e16-6cf1-4a28-8466-0496871b8619" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163561
Approved by: https://github.com/jansel
2025-09-23 17:38:18 +00:00
134dfbeaef [DCP] DTensor slice dequantization with proper block alignment (#163532)
Summary:
When loading quantized tensors with DTensor slicing, the dequantization process was producing numerically incorrect results due to improper block-to-slice coordinate mapping. The previous implementation calculated block boundaries relative to the sliced tensor dimensions instead of the original full tensor dimensions, causing scale factors to be applied to wrong tensor regions.

This fix addresses the issue by:

1. **Proper coordinate mapping**: Added `_get_slice_to_block_mapping()` to correctly map tensor slices to quantization blocks using global coordinates from the full tensor shape.

3. **Block-aligned dequantization**: Updated `_dequantize_tensor()` to use proper block intersection logic, ensuring scale factors are applied to the correct portions of sliced tensors.

The fix ensures that when DTensor requests a slice of a quantized tensor, the dequantization correctly identifies which quantization blocks intersect with the requested slice and applies the appropriate scale factors to the right tensor regions.

Test Plan:
Tested with DTensor configurations where quantized tensors are sliced across different dimensions. Verified that:
1. Dequantized tensor values are numerically correct
2. Block boundaries are properly calculated relative to full tensor shape
3. Scale factors are applied to correct tensor regions
4. Tensor shapes map is built efficiently using only metadata

Correctness validation using https://github.com/wwwjn/torchtitan/blob/dsv3-sd-test/tests/fsdp_dequantized_load.py
```
{
  "model.layers.0.mlp.gate_proj.weight": {
    "mse": 4.30626645453458e-11,
    "mae": 9.98388827611052e-07,
    "max_abs_diff": 0.0009703934192657471,
    "cosine_similarity": 1.010810375213623,
    "relative_error": 0.001330620958469808,
    "kl_divergence_1_to_2": "6.563401e-08",
    "kl_divergence_2_to_1": "-6.522914e-08",
    "js_divergence": 1.3711876079014476e-10,
    "shape": [
      18432,
      7168
    ],
    "t1_stats": {
      "min": -0.4453125,
      "max": 0.30859375,
      "mean": -1.2592146958922967e-05
    },
    "t2_stats": {
      "min": -0.44529813528060913,
      "max": 0.3085886240005493,
      "mean": -1.2624391274584923e-05
    }
  },
  "model.layers.0.mlp.up_proj.weight": {
    "mse": 2.5534721906361746e-11,
    "mae": 3.118609583907528e-06,
    "max_abs_diff": 0.00047551095485687256,
    "cosine_similarity": 1.038962483406067,
    "relative_error": 0.0013681650161743164,
    "kl_divergence_1_to_2": "-5.8253768e-08",
    "kl_divergence_2_to_1": "5.8747577e-08",
    "js_divergence": NaN,
    "shape": [
      18432,
      7168
    ],
    "t1_stats": {
      "min": -0.228515625,
      "max": 0.2333984375,
      "mean": 8.862222955485777e-08
    },
    "t2_stats": {
      "min": -0.2285017967224121,
      "max": 0.23338991403579712,
      "mean": 8.824501662729745e-08
    }
  },
  "model.layers.0.mlp.down_proj.weight": {
    "mse": 2.2803769289536646e-11,
    "mae": 2.8916260816913564e-06,
    "max_abs_diff": 0.0008973777294158936,
    "cosine_similarity": 1.0376262664794922,
    "relative_error": 0.001346255769021809,
    "kl_divergence_1_to_2": "1.2744896e-07",
    "kl_divergence_2_to_1": "-1.2736885e-07",
    "js_divergence": 5.992362162032805e-11,
    "shape": [
      7168,
      18432
    ],
    "t1_stats": {
      "min": -0.54296875,
      "max": 0.546875,
      "mean": -2.9487239316949854e-07
    },
    "t2_stats": {
      "min": -0.5429964661598206,
      "max": 0.5469087362289429,
      "mean": -2.9507478416235244e-07
    }
  }
}
```

https://www.internalfb.com/intern/testinfra/testrun/3940649985202645

Differential Revision: D82975005

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163532
Approved by: https://github.com/wwwjn
2025-09-23 16:48:16 +00:00
221ac81043 Revert "[precompile] Add option to disable guard check on aot-compiled function. (#163432)"
This reverts commit 539e84e289fa7563032410706ede50a4eaa7a15d.

Reverted https://github.com/pytorch/pytorch/pull/163432 on behalf of https://github.com/Camyll due to breaking internal tests ([comment](https://github.com/pytorch/pytorch/pull/163432#issuecomment-3324757069))
2025-09-23 16:31:30 +00:00
6e5dddba64 Use accelerator API in common_dtensor (#163498)
Fixes #ISSUE_NUMBER

Try to unify the device checking in common_dtensor (testing module) by accelerator API

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163498
Approved by: https://github.com/albanD, https://github.com/H-Huang
2025-09-23 16:30:20 +00:00
ebddbe787a [ROCm][CI] skip test_sparse_triangular_solve (#163651)
need more time to debug, but also need clean CI signal test was unskipped by #163495, but had been skipp on rocm prior

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163651
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-23 15:55:51 +00:00
5f0c7cb4aa Add B200 smoke test (#159494)
Okay running test_max_autotune locally on B200is horrible read, for now to get something landed I am focusing on test_matmul_cuda.py and test_fp8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159494
Approved by: https://github.com/nWEIdia, https://github.com/huydhn
ghstack dependencies: #163460, #163537, #163552
2025-09-23 15:45:05 +00:00
b3cf5c79dd Skip on sm100 later since Tests are non determinisitic (#163552)
This is tracked https://github.com/pytorch/pytorch/issues/163462

skipping since we are seeing sporadic errors locally and on CI,
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163552
Approved by: https://github.com/eqy, https://github.com/Skylion007
ghstack dependencies: #163460, #163537
2025-09-23 15:45:05 +00:00
0f674077f4 Large tests failing on bfloat16 (#163537)
# Summary

I ran these tests locally, each 10k Tests takes over 5 mins for an extremely beefy cpu to run. I think that this is overkill feel free to disagree. Also the 1 test I ran that failed earlier up in the stack failed with 1 ulp difference so I think that this is kind of an edgecase on how we do testing (will right up issue for my thoughts later)

``` Shell
==================================================================================================== FAILURES =====================================================================================================
_________________________________________________________ TestMatmulCudaCUDA.test_cublas_addmm_reduced_precision_size_10000_backend_cublas_cuda_bfloat16 __________________________________________________________
Traceback (most recent call last):
  File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
    yield
  File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 634, in run
    self._callTestMethod(testMethod)
  File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 3223, in wrapper
    method(*args, **kwargs)
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 3223, in wrapper
    method(*args, **kwargs)
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_device_type.py", line 426, in instantiated_test
    result = test(self, **param_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_device_type.py", line 1408, in only_fn
    return fn(slf, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 2024, in wrap_fn
    return fn(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/meta/pytorch/test/test_matmul_cuda.py", line 190, in test_cublas_addmm_reduced_precision
    self.cublas_addmm(size, dtype, True)
  File "/home/dev/meta/pytorch/test/test_matmul_cuda.py", line 162, in cublas_addmm
    assert_close_with_ulp(res_cpu, res_cuda, atol=tolerance.atol, rtol=tolerance.rtol)
  File "/home/dev/meta/transformer_nuggets/transformer_nuggets/numerics/__init__.py", line 222, in assert_close_with_ulp
    raise AssertionError("\n".join(error_parts))
AssertionError: Tensor-likes are not close!

Mismatched elements: 425 / 100030002 (0.0%)
Greatest absolute difference: 16 at index (2176, 9325) (up to 10 allowed)
Greatest relative difference: 3984 at index (376, 3754) (up to 0.2 allowed)

============================================================
ULP Analysis of Failures:
============================================================

Total failures: 425
ULP distances: min=-32761, max=32763, mean=-11513.7

Top 10 failures by absolute difference:
  #  | Index                      | Abs Diff    | Rel Diff    | ULP  | Expected     | Actual
----------------------------------------------------------------------------------------------------
   1 | (6923, 1580)               | 1.600000e+01 | 5.390625e-01 |  146 |    29.750000 |    13.750000
   2 | (4677, 420)                | 1.600000e+01 | 6.601562e-01 |   95 |    24.250000 |    40.250000
   3 | (2176, 9325)               | 1.600000e+01 | 6.875000e-01 |  210 |    23.250000 |     7.250000
   4 | (5119, 7865)               | 1.600000e+01 | 1.164062e+00 |  146 |   -13.750000 |   -29.750000
   5 | (3218, 8334)               | 1.600000e+01 | 2.593750e+00 |  236 |     6.156250 |    22.125000
   6 | (5245, 241)                | 1.600000e+01 | 5.468750e-01 |   75 |    29.250000 |    45.250000
   7 | (7666, 6549)               | 1.600000e+01 | 1.640000e+03 | 1376 |    -0.009766 |   -16.000000
   8 | (1663, 1115)               | 1.593750e+01 | 8.375000e+00 | -32427 |     1.898438 |   -14.062500
   9 | (3967, 7708)               | 1.593750e+01 | 1.368750e+01 | -32510 |     1.164062 |   -14.750000
  10 | (2874, 2038)               | 1.593750e+01 | 1.710938e+00 |  181 |     9.312500 |    25.250000

Note: Maximum absolute and relative errors occur at different locations
  Max abs diff location (2176, 9325): 210 ULP
  Max rel diff location (376, 3754): 31868 ULP

To execute this test, run the following from the base repo dir:
    python test/test_matmul_cuda.py TestMatmulCudaCUDA.test_cublas_addmm_reduced_precision_size_10000_backend_cublas_cuda_bfloat16

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
________________________________________________________ TestMatmulCudaCUDA.test_cublas_addmm_reduced_precision_size_10000_backend_cublaslt_cuda_bfloat16 _________________________________________________________
Traceback (most recent call last):
  File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
    yield
  File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 634, in run
    self._callTestMethod(testMethod)
  File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 3223, in wrapper
    method(*args, **kwargs)
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 3223, in wrapper
    method(*args, **kwargs)
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_device_type.py", line 426, in instantiated_test
    result = test(self, **param_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_device_type.py", line 1408, in only_fn
    return fn(slf, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 2024, in wrap_fn
    return fn(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/meta/pytorch/test/test_matmul_cuda.py", line 190, in test_cublas_addmm_reduced_precision
    self.cublas_addmm(size, dtype, True)
  File "/home/dev/meta/pytorch/test/test_matmul_cuda.py", line 162, in cublas_addmm
    assert_close_with_ulp(res_cpu, res_cuda, atol=tolerance.atol, rtol=tolerance.rtol)
  File "/home/dev/meta/transformer_nuggets/transformer_nuggets/numerics/__init__.py", line 222, in assert_close_with_ulp
    raise AssertionError("\n".join(error_parts))
AssertionError: Tensor-likes are not close!

Mismatched elements: 425 / 100030002 (0.0%)
Greatest absolute difference: 16 at index (2176, 9325) (up to 10 allowed)
Greatest relative difference: 3984 at index (376, 3754) (up to 0.2 allowed)

============================================================
ULP Analysis of Failures:
============================================================

Total failures: 425
ULP distances: min=-32761, max=32763, mean=-11513.7

Top 10 failures by absolute difference:
  #  | Index                      | Abs Diff    | Rel Diff    | ULP  | Expected     | Actual
----------------------------------------------------------------------------------------------------
   1 | (6923, 1580)               | 1.600000e+01 | 5.390625e-01 |  146 |    29.750000 |    13.750000
   2 | (4677, 420)                | 1.600000e+01 | 6.601562e-01 |   95 |    24.250000 |    40.250000
   3 | (2176, 9325)               | 1.600000e+01 | 6.875000e-01 |  210 |    23.250000 |     7.250000
   4 | (5119, 7865)               | 1.600000e+01 | 1.164062e+00 |  146 |   -13.750000 |   -29.750000
   5 | (3218, 8334)               | 1.600000e+01 | 2.593750e+00 |  236 |     6.156250 |    22.125000
   6 | (5245, 241)                | 1.600000e+01 | 5.468750e-01 |   75 |    29.250000 |    45.250000
   7 | (7666, 6549)               | 1.600000e+01 | 1.640000e+03 | 1376 |    -0.009766 |   -16.000000
   8 | (1663, 1115)               | 1.593750e+01 | 8.375000e+00 | -32427 |     1.898438 |   -14.062500
   9 | (3967, 7708)               | 1.593750e+01 | 1.368750e+01 | -32510 |     1.164062 |   -14.750000
  10 | (2874, 2038)               | 1.593750e+01 | 1.710938e+00 |  181 |     9.312500 |    25.250000

Note: Maximum absolute and relative errors occur at different locations
  Max abs diff location (2176, 9325): 210 ULP
  Max rel diff location (376, 3754): 31868 ULP

To execute this test, run the following from the base repo dir:
    python test/test_matmul_cuda.py TestMatmulCudaCUDA.test_cublas_addmm_reduced_precision_size_10000_backend_cublaslt_cuda_bfloat16

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
```
Okay the bfloat16 are forsure  real cc @eqy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163537
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/eqy
ghstack dependencies: #163460
2025-09-23 15:45:05 +00:00
720a7b2887 [export] Remove .contiguous() when saving weights to raw bytes (#163587)
Summary: `.contiguous()` will discard the original storage size of the tensor, and could lead to issues during loading.

Test Plan:
buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_1D_tensor_slicing
buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_2D_tensor_slicing

Differential Revision: D83016250

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163587
Approved by: https://github.com/angelayi
2025-09-23 15:44:56 +00:00
49e7b2f69d [inductor] Fix error from custom CUDA allocators (#163422)
Fixes #163257

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163422
Approved by: https://github.com/eellison
ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434, #163393, #163412
2025-09-23 15:37:45 +00:00
6ef74879f6 [dynamo] Fix TorchFunctionMode handling with get_rng_state (#163412)
Fixes #162624
Fixes #162586

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163412
Approved by: https://github.com/eellison
ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434, #163393
2025-09-23 15:37:45 +00:00
9c4d9f940b [inductor] Support out_dtype arg to matmul (#163393)
Fixes #163275

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163393
Approved by: https://github.com/eellison, https://github.com/coconutruben
ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434
2025-09-23 15:37:38 +00:00
ed84e808f0 [inductor] Freeze layouts in FlexAttention (#163434)
Fixes #163300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163434
Approved by: https://github.com/drisspg
ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419
2025-09-23 15:37:29 +00:00
518c320676 [inductor] libdevice.sqrt => tl.sqrt_rn (#163419)
Fixes #163082

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163419
Approved by: https://github.com/Skylion007, https://github.com/mlazos
ghstack dependencies: #163386, #163398, #163387, #163414, #163415
2025-09-23 15:37:21 +00:00
4264fd34ec Add basic tests for torch.distributed.tensor._utils.compute_global_tensor_info (#162968)
Next PR writes a C++ implementation. Seems good to have tests first.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162968
Approved by: https://github.com/ezyang
ghstack dependencies: #161695, #162508
2025-09-23 14:56:32 +00:00
e05c9c0c84 [ROCm][CI] cudagraph trees ut fixes (#163592)
Fixes #162125.
Fixes #160719.
Fixes #157901.
Fixes #157871.
Fixes #157761.
Fixes #157723.
Fixes #157643.
Fixes #157616.
Fixes #157556.
Fixes #157533.
Fixes #157449.
Fixes #157428.
Fixes #157413.
Fixes #157367.
Fixes #157350.
Fixes #157339.
Fixes #157312.
Fixes #157280.
Fixes #157258.
Fixes #157173.
Fixes #157143.
Fixes #157112.
Fixes #157086.
Fixes #157058.
Fixes #157035.
Fixes #156984.
Fixes #156957.
Fixes #156954.
Fixes #156922.
Fixes #156886.
Fixes #156838.
Fixes #156808.
Fixes #156801.
Fixes #156778.
Fixes #156755.
Fixes #156735.
Fixes #156693.
Fixes #152561.
Fixes #130749.
Fixes #100074.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163592
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-23 14:45:00 +00:00
aff76c046d Revert "Add fake_impl for _native_multi_head_attention (#163167)"
This reverts commit 27164b6788cab6e6d8095012839e51c958a819d6.

Reverted https://github.com/pytorch/pytorch/pull/163167 on behalf of https://github.com/malfet due to This broke in inductor-cpu-test, see 1a42656d6c/1 ([comment](https://github.com/pytorch/pytorch/pull/163167#issuecomment-3324302026))
2025-09-23 14:36:45 +00:00
1a42656d6c [Flex attention] Fix flex attention head broadcast (#163426)
Fixes part of #163314

In particular bug: **Bug 1: H=None Broadcasting Produces Incorrect Results**

This fixes a shape bug when slicing BlockMask on the Q-tile axis with an int (**mask[:, :, i]**). That form of indexing collapses the Q dimension, so kv_num_blocks/kv_indices lose their expected [B, H, Q_tiles, …] shape. Due to them losing shape, even though the mask_mod remains "interpretable", the kernel’s stride math then reads wrong offsets. Due to this we get silent numerical mismatches compared to regular SDPA, especially when single position decoding/H broadcasting.

The B=None, H=None works case is accidental: with singleton batch/head the kernel maps to index 0 via `sparse_idx_z = off_zq % 1` and `sparse_idx_hq = off_hq % 1` and with a single Q tile `q_start // SPARSE_Q_MULTIPLE = 0`. The missing Q-tiles stride is multiplied by 0, so the bad offset from the collapsed Q axis doesn’t move the pointer and it happens to read the first tile correctly. Once H > 1 or there are multiple Q tiles, those terms become nonzero and the kernel indexes with wrong strides which causes silent error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163426
Approved by: https://github.com/drisspg
2025-09-23 13:01:51 +00:00
bda9ab291d [inductor] fix as_strided lowering with .view(dtype) inputs (#163319)
FIXES https://github.com/pytorch/pytorch/issues/163286

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163319
Approved by: https://github.com/eellison
2025-09-23 12:50:57 +00:00
3c64b2abab CUDA 13.0 Warning update for supported architectures (#163585)
Please see build script: 8da008678f/.ci/manywheel/build_cuda.sh (L69-L71)

This should display correct warning:
``
Please install PyTorch with a following CUDA
configurations: 12.6 12.8 13.0 following instructions at
https://pytorch.org/get-started/locally/
``
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163585
Approved by: https://github.com/malfet
2025-09-23 11:27:11 +00:00
5d749ceb92 Remove test conditions for CUDA<12 (#163495)
Because it required that CUDA >=12.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163495
Approved by: https://github.com/janeyx99
2025-09-23 07:52:00 +00:00
8d81564df5 [pt2][cache] rework cache for true generic usage + better tests (#163488)
Differential Revision: D82933509

over the weekend I realized that some of the cache implementation was a bit silly, and too constrained to be actually generic. for example, InMemoryCache[str, bytes] was odd since we'd probably want to be able to store more than just str keys with bytes values. so tldr; everything is now generic, with the one constraint being that Key and Value must both be pickle-able types. this makes things a lot simpler for us, since all caches can now be str -> bytes caches under the hood if we'd like, and Key/Value just get pickled on the way in and out.

with this change, there were also some improvements made to the testing; mainly better coverage, but now we also test each cache across every combination of Key/Value types to ensure that they will work with the types we might specify later

I also hardened some things here and there, for example we now use literal_eval (forgot who mentioned this on the first PR, but thank you for the suggestion!), and all errors coming from the caching will be wrapped in CacheError from now on (although we still raise from the original error context where possible)

putting this PR up now for feedback, in the process of generalizing the code I did remove the documentation since it was becoming outdated but I will add that back in after the PR is green

I have the next PR ready as well (implements a fresh cache context manager), will export once this lands

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163488
Approved by: https://github.com/aorenste, https://github.com/masnesral
2025-09-23 07:31:48 +00:00
b426ba1d5e [torchfuzz] introduce tensor and scalar pointwise ops (#163558)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163558
Approved by: https://github.com/laithsakka
ghstack dependencies: #163547, #163553, #163554, #163555, #163556, #163557
2025-09-23 06:20:13 +00:00
375f3e3a61 [OpenReg][Docs] Correct docs about openreg usage example. (#163235)
## Why this PR?
I've tried to follow the guidance of the `OpenReg` [usage example](https://github.com/pytorch/pytorch/tree/main/test/cpp_extensions/open_registration_extension/torch_openreg/third_party/openreg) and found that the command for compiling `example.cpp` (`g++ -o out example/example.cpp -L ./build -lopenreg`) is not compatible with my `gcc` (v11.4).

Since I installed my `gcc` through `apt install build-essential`, and I think that's a common way to install `gcc` for a few developers? I believe it's necessary to slightly modify the command to add `-I ./` to explicitly indicate the header file search path.

## What I've changed?
- I added `-I ./` to correctly search for `./include/openreg.h`.
- I also added a `pwd` comment for better readability and removed unused imports in `example/example.cpp`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163235
Approved by: https://github.com/FFFrog, https://github.com/albanD

Co-authored-by: Jiawei Li <ljw1101.vip@gmail.com>
2025-09-23 06:16:45 +00:00
45d9dcccc5 Update Kineto Submodule (#162222)
Summary: Update

Test Plan:
CI

Rollback Plan:

Differential Revision: D81727392

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162222
Approved by: https://github.com/sanrise
2025-09-23 06:08:55 +00:00
309fe03f4b [torchfuzz] remove unneeded try catch (#163557)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163557
Approved by: https://github.com/laithsakka
ghstack dependencies: #163547, #163553, #163554, #163555, #163556
2025-09-23 06:05:08 +00:00
1545bb1c00 [torchfuzz] shuffle compatible ops (#163556)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163556
Approved by: https://github.com/laithsakka
ghstack dependencies: #163547, #163553, #163554, #163555
2025-09-23 05:53:44 +00:00
d5e51d34f7 [torchfuzz] decompose -> fuzz_inputs_specs (#163555)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163555
Approved by: https://github.com/laithsakka
ghstack dependencies: #163547, #163553, #163554
2025-09-23 05:44:59 +00:00
08c5efde5f [torchfuzz] cache operators (#163554)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163554
Approved by: https://github.com/laithsakka
ghstack dependencies: #163547, #163553
2025-09-23 05:28:07 +00:00
19b754dff8 Revert "Update cutlass version for fbcode (#163091)"
This reverts commit 509c4e86270cc4decca58905d0f446e1fc0cf618.

Reverted https://github.com/pytorch/pytorch/pull/163091 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/163091#issuecomment-3322428791))
2025-09-23 05:08:42 +00:00
d3a1345ed8 Use functools.cache on has_efa (#163439)
Cache the result of `has_efa` by `functools.cache`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163439
Approved by: https://github.com/janeyx99
2025-09-23 05:03:03 +00:00
e3b392bdfd [BC breaking] Remove deprecated imports for torch.utils.data.datapipes.iter.grouping (#163438)
This PR removes import tricks of `SHARDING_PRIORITIES` and  `ShardingFilterIterDataPipe` from `torch.utils.data.datapipes.iter.grouping`. They are declared to be removed in PyTorch 2.1 but not.
Before change:
```
import torch.utils.data.datapipes.iter.grouping.SHARDING_PRIORITIES
import torch.utils.data.datapipes.iter.grouping.ShardingFilterIterDataPipe
```
works
After change:
there is an import error exception.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163438
Approved by: https://github.com/janeyx99
2025-09-23 05:02:06 +00:00
bb5be56619 [torch][cuda][device_limits] Library for querying device hardware limits for flops and bandwidth (#162942)
In various benchmarks scattered across the repo, the limits for flops/second and memory bandwidth are usually hardcoded for a single device. This utility could help in providing a more structured way to query the device capabilities. If this is approved, we can use it when reporting flops efficiency and bandwidth relative to peak in the benchmarks and tests. The intent is to add more devices, more parameters (e.g. L2 cache bandwidth, NVLink, etc.) for both CPUs and accelerators.

Testing:

```
import torch

if torch.cuda.is_available():
    device = torch.cuda.current_device()
    mod = torch.get_device_module('cuda')
    hw = mod._device_limits.GPULimits(device)

    print(hw.get_tflops_per_second(torch.float16))
    print(hw.get_tflops_per_second(torch.float32))
    print(hw.get_tflops_per_second(torch.float64))
    print(hw.get_tflops_per_second(torch.bfloat16))
    print(hw.get_tflops_per_second(torch.int8))
    print(hw.get_memory_bandwidth_Bps() / 1e9)
    print(hw.get_shared_memory_bandwidth_Bps() / 1e9)

# Output on an H100 GPU
1070.53056
535.26528
66.90816
1070.53056
2141.06112
4893.696
33454.08
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162942
Approved by: https://github.com/ngimel, https://github.com/albanD
2025-09-23 04:48:19 +00:00
0e122380c2 [torchfuzz] remove supports_variable_inputs for now (#163553)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163553
Approved by: https://github.com/laithsakka
ghstack dependencies: #163547
2025-09-23 04:44:54 +00:00
fcd79d5228 [vllm hash update] update the pinned vllm hash (#163590)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163590
Approved by: https://github.com/pytorchbot
2025-09-23 04:44:15 +00:00
95ac7d724e Rename to _debug_mode.py to make it private (#163534)
rename debug_mode.py to _debug_mode.py to make it private, per @alban's request.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163534
Approved by: https://github.com/albanD
2025-09-23 04:27:10 +00:00
0b75a16200 [torchfuzz] Encapsulate fuzzing and codegen logic into ops (#163547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163547
Approved by: https://github.com/laithsakka
2025-09-23 04:26:00 +00:00
27164b6788 Add fake_impl for _native_multi_head_attention (#163167)
Test Plan:
See added test in test_export.py

Rollback Plan:

Reviewed By: henryoier

Differential Revision: D77747446

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163167
Approved by: https://github.com/angelayi
2025-09-23 04:02:20 +00:00
cyy
447b8fc56d [2/N] Use filesystem in inductor (#163465)
Use std::filesystem in most inductor code. This is follow-up of https://github.com/pytorch/pytorch/pull/152288 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163465
Approved by: https://github.com/Skylion007
2025-09-23 03:56:16 +00:00
6a48f57d2f [1/N] Remove 'type: ignore' suppressions (#163468)
Remove some unnecessary 'type: ignore' suppressions from python code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163468
Approved by: https://github.com/Skylion007, https://github.com/janeyx99
2025-09-23 03:53:11 +00:00
e9300b2b7c remove allow-untyped-defs from ./torch/onnx/_internal/torchscript_exporter/_globals.py (#163472)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163472
Approved by: https://github.com/Skylion007
ghstack dependencies: #163246, #163469, #163470
2025-09-23 03:50:29 +00:00
8f30a8dc47 [AOTInductor] Add grid information for Triton Kernels (#160131)
Summary:
Add grid information for Triton Kernels for profiling in Kineto.

Test Plan:
Before change:
<img width="539" height="625" alt="Screenshot 2025-08-07 at 1 09 07 PM" src="https://github.com/user-attachments/assets/dd0778a9-2ff3-4819-acd3-de585cf7f9d1" />

After change:
<img width="550" height="898" alt="Screenshot 2025-08-07 at 1 05 49 PM" src="https://github.com/user-attachments/assets/d84988df-bb83-41ed-80ac-8a6d843a1a9d" />

*Note we can extract grid size etc. from device side trace, but we're focusing host side specifically for this PR, mainly to add more host side information in the future needed for performance profiling.

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160131
Approved by: https://github.com/desertfire
2025-09-23 02:15:24 +00:00
2c7959eee9 [ignore][codex-test] Add typing to simple library registry (#161367)
## Summary
- add type annotations for simple library registry and dispatch rule holder
- remove allow-untyped-defs directive

## Testing
- `python -m mypy torch/_library/simple_registry.py` *(fails: repo expects mypy==1.16.0)*
- `lintrunner -a torch/_library/simple_registry.py` *(fails: attr-defined error in torchgen/gen_schema_utils.py)*
- `python test/test_torch.py TestTorch.test_dir` *(fails: ModuleNotFoundError: No module named 'torch')*

------
https://chatgpt.com/codex/tasks/task_e_68aa3cc210488326befdd992c79115a0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161367
Approved by: https://github.com/Skylion007
2025-09-23 02:08:55 +00:00
3ef1bef36c [sdpa] make sure to recompile if alignment is different than before (#163083)
## Context
An example from Qwen2-7B
- This come from running torch.compile with a sequence length that is
divisible by 8 (no padding needed). Call this `Run1`.
- If we then run the compiled model with a difference length that isn't
divisible by 8 (requires padding). Call this `Run2`.
- Then we'll see this error.
```
File "/var/tmp/torchinductor_nobody/2w/c2wby7ilxbna45xrtrrfjqpeutwouruviu2742ockunnd2bleeiz.py", line 1963, in call
    buf24 = torch.ops.aten._scaled_dot_product_efficient_attention_backward.default(reinterpret_tensor(buf18, (s85, 3584 // s19, s48, 512 // (512 // s19)), (s48*(512 // (512 // s19))*(3584 // s19), 512 // (512 // s19), (512 // (512 // s19))*(3584 // s19), 1), 0), buf20, buf21, buf22, buf23, getitem, getitem_1, getitem_2, getitem_3, 0.0, [True, True, True, False], scale=0.08838834764831845)
File "torch/_ops.py", line 841, in __call__
    return self._op(*args, **kwargs)
RuntimeError: attn_bias is not correctly aligned (strideM). attn_bias.stride(2) = 6102, and should be a multiple of 4.
```
- We only see the error because we did not recompile on `Run2`. Instead we ran the inputs on the same graph as `Run1`.

### A bit more on why.
Here we check whether to realize the unpadded buffer (unwrapped slice) which we want for `Run1` but not for `Run2`.
0897affcd5/torch/_inductor/lowering.py (L2687-L2694)

## Fix
Size hint doesn't guard, so the fix is to use `guard_or*` to guard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163083
Approved by: https://github.com/eellison
2025-09-23 01:33:33 +00:00
539e84e289 [precompile] Add option to disable guard check on aot-compiled function. (#163432)
Summary:
Under circumstances it seems reasonable to return a callable directly without guard check when user use aot_compile on a function with single compilation result.

When having multiple entries (aot_compile_module), we should start enabling guard check to differetiate different compiled functions apart.

Test Plan: CI

Differential Revision: D82904540

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163432
Approved by: https://github.com/dolpm
2025-09-23 01:00:05 +00:00
68e75be86a Update pytorch_sphinx_theme2 to latest hash (#163269)
The updated theme:
- Fixes articleBody in the json+ld that caused previous Google Search issues
- Other minor fixes
- 404.html fixes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163269
Approved by: https://github.com/albanD
2025-09-22 23:20:23 +00:00
8da008678f Remove outdated commented CMake code (#163442)
Policies `CMP0023` and `CMP0022` have been removed in CMake 4.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163442
Approved by: https://github.com/janeyx99
2025-09-22 23:07:36 +00:00
fa15fb01ab [EZ] Remove XLA from unstable.yml (#163564)
It runs for 30 min on linux.12xlarge and then fails and it has been like
that since Aug 7th

Besides, there are no more python-3.9 builds left.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163564
Approved by: https://github.com/seemethere, https://github.com/atalman, https://github.com/huydhn
2025-09-22 22:11:50 +00:00
clr
33daaad7d0 dynamo: Handle objects in graph that do not support weakref (#163168)
We are seeing crashes of the form
```
Traceback (most recent call last):
  File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/symbolic_convert.py", line 1487, in run
    while self.step():
  File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/symbolic_convert.py", line 1348, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/symbolic_convert.py", line 2437, in LOAD_ATTR
    self._load_attr(inst)
  File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/symbolic_convert.py", line 2425, in _load_attr
    result = BuiltinVariable(getattr).call_function(
  File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/variables/builtin.py", line 1347, in call_function
    return handler(tx, args, kwargs)
  File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/variables/builtin.py", line 967, in <lambda>
    tx, [v.realize() for v in args], kwargs
  File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/variables/builtin.py", line 967, in <listcomp>
    tx, [v.realize() for v in args], kwargs
  File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/variables/lazy.py", line 72, in realize
    self._cache.realize()
  File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/variables/lazy.py", line 33, in realize
    self.vt = builder.VariableBuilder(tx, self.source)(self.value)
  File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/variables/builder.py", line 445, in __call__
    vt = self._wrap(value)
  File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/variables/builder.py", line 1043, in _wrap
    torch._dynamo.utils.store_user_object_weakref(value)
  File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/utils.py", line 4694, in store_user_object_weakref
    user_obj_id_to_weakref[obj_id] = weakref.ref(obj)
torch._dynamo.exc.InternalTorchDynamoError: TypeError: cannot create weak reference to 'torch.Event' object
```

This pull request makes us gracefully graph break, vs explicitly crashing.

I've added a test which reproduces the issue. There is a side discussion re:
how did torch.Event support ever work here, since it appears you cannot take a
weakref to a torch.Event

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163168
Approved by: https://github.com/Lucaskabela, https://github.com/jansel
2025-09-22 22:11:09 +00:00
60c2bdedcd Replace Literal[None] with None in typing (#163489)
This PR replaces Literal[None] with None in typing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163489
Approved by: https://github.com/Skylion007, https://github.com/mlazos
2025-09-22 22:10:08 +00:00
b756b580fb Improve fake tensor leakage detection in export by not relying on gc too much (#163516)
Previously we relied on gc to get the snapshot of fake tensors before and after export to get list of fake tensors that are created during export. This caused some flakiness in our test suite (https://github.com/pytorch/pytorch/issues/162232). it seems super hard to make gc deterministic, so we just instrument fake tensor creation which seems lot better. In addition, it is also quite faster than previous approach becuase we are no longer manually triggering garbage collector.

Differential Revision: [D82966648](https://our.internmc.facebook.com/intern/diff/D82966648)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163516
Approved by: https://github.com/ezyang
2025-09-22 22:04:24 +00:00
e0cbab46ad [Inductor] avoid CUDA__equal when constant tensors are from different device (#163529)
Summary:
otherwise, may hit
```
Exception: Expected all tensors to be on the same device, but got other is on cuda:0, different from other tensors on cpu (when checking argument in method wrapper_CUDA__equal)
```

Test Plan: UTs

Reviewed By: yushangdi

Differential Revision: D82974062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163529
Approved by: https://github.com/yushangdi, https://github.com/Skylion007
2025-09-22 22:04:11 +00:00
4fc271e559 [inductor] Don't require_dense for grid_sampler_2d_backward (#163415)
Fixes #163372

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163415
Approved by: https://github.com/Skylion007
ghstack dependencies: #163386, #163398, #163387, #163414
2025-09-22 21:53:01 +00:00
c8fd2b45e5 [inductor] Skip test_baddmm on XPU (#163414)
Fixes #161484
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163414
Approved by: https://github.com/Skylion007
ghstack dependencies: #163386, #163398, #163387
2025-09-22 21:53:01 +00:00
a1bd9248eb [inductor] Fallback on strided complex add (#163387)
Fixes #163243
Fixes #162561

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163387
Approved by: https://github.com/eellison
ghstack dependencies: #163386, #163398
2025-09-22 21:52:53 +00:00
36c2a1325c [inductor] Fix bug where viewed outputs get padded (#163398)
Fixes #163328

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163398
Approved by: https://github.com/eellison
ghstack dependencies: #163386
2025-09-22 21:52:45 +00:00
7ea8998c0b Better decomp for torch.eye (#163386)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163386
Approved by: https://github.com/eellison
2025-09-22 21:52:37 +00:00
2b036632ca Allow add_persistent_r_block to scale up rblock up to a limit (#162296)
<img width="654" height="392" alt="Screenshot 2025-09-18 at 4 22 53 PM" src="https://github.com/user-attachments/assets/975650ec-f769-43a6-bdf5-2885a8d40d3c" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162296
Approved by: https://github.com/eellison
2025-09-22 21:41:46 +00:00
0256f91558 [BUG] MaxUnpool2d/3d should check output dim before accessing its elements (#163507)
Fixes #163409
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163507
Approved by: https://github.com/malfet, https://github.com/Skylion007
2025-09-22 21:36:48 +00:00
da05aa7a9d [BE] Use output_t directly (#163518)
Rather than deref the safe tensor wrapped in `TensorArg`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163518
Approved by: https://github.com/Skylion007
2025-09-22 21:33:42 +00:00
e558f7a222 [vllm hash update] update the pinned vllm hash (#163463)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163463
Approved by: https://github.com/pytorchbot

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-09-22 21:24:56 +00:00
09cb34c1dc [RELAND] Always build USE_DISTRIBUTED (#160449) and Make distributed modules importable even when backend not built (#159889) (#162594)
Summary:
Original: D81957844 and D81957923

Also, https://github.com/pytorch/pytorch/pull/162142 is patched in as well

#buildall

Test Plan:
sandcastle and oss ci

Rollback Plan:

Reviewed By: H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162594
Approved by: https://github.com/H-Huang, https://github.com/dcci
2025-09-22 21:12:18 +00:00
4027e97791 [BE] Delete skipIfMPSOnMacOS13 (#163515)
As PyTorch needs MacOS-14 or newer to use MPS
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163515
Approved by: https://github.com/Skylion007
2025-09-22 21:10:22 +00:00
8e62d01f7a Add dynamic shapes doc (#159428)
This PR adds new Dynamic Shapes documentation and expands on the existing one.
- Adds a new structure with Intro, Core Concepts, Troubleshooting

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159428
Approved by: https://github.com/bobrenjc93

Co-authored-by: bobrenjc93 <bobren@meta.com>
2025-09-22 21:01:27 +00:00
8abc2af9b9 [STABLE ABI] Add clone method to torch::stable::Tensor (#161896)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161896
Approved by: https://github.com/janeyx99
2025-09-22 20:39:24 +00:00
02da4753f5 Triton template IMA reads on B200 (#163460)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163460
Approved by: https://github.com/eqy, https://github.com/alexsamardzic
2025-09-22 20:34:39 +00:00
cf28ab2c88 remove allow-untyped-defs from ./torch/ao/quantization/pt2e/duplicate_dq_pass.py (#163470)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163470
Approved by: https://github.com/aorenste
ghstack dependencies: #163246, #163469
2025-09-22 20:29:09 +00:00
46e1b7d70b remove allow-untyped-defs from ./torch/utils/data/datapipes/iter/fileopener.py (#163469)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163469
Approved by: https://github.com/aorenste, https://github.com/Skylion007
ghstack dependencies: #163246
2025-09-22 20:29:09 +00:00
e065d35fd3 [BE]: Add a few more missing move from return indices (#163456)
@ezyang A follow up where I found a few more missing returns of this style in the codebase. Follow up to #163416

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163456
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-09-22 20:24:23 +00:00
fd785b1762 Add NestedTensor dispatch for _is_any_true/_is_all_true (#162096)
Fixes: https://github.com/pytorch/pytorch/issues/161818

### Summary
Add NestedTensor support for `_is_any_true` and `_is_all_true`.

### Changes
- Register dispatch for `aten._is_any_true.default` and
  `aten._is_all_true.default`
- Add CPU tests:
  - `test_is_any_true_jagged`: dispatch_matches_values_buffer,
    all_false_returns_false, one_true_returns_true
  - `test_is_all_true_jagged`: dispatch_matches_values_buffer,
    all_true_returns_true, any_false_returns_false

### Testing

Before Fix:

`pytest -q test/test_nestedtensor.py -k "test_is_any_true_jagged or test_is_all_true_jagged" -v`

Output:
```
FAILED [0.0129s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_is_all_true_jagged_cpu - NotImplementedError: aten._is_all_true.default
FAILED [0.0007s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_is_any_true_jagged_cpu - NotImplementedError: aten._is_any_true.default
```

After Fix:

`pytest -q test/test_nestedtensor.py -k "test_is_any_true_jagged or test_is_all_true_jagged" -v`

Output:

```
Running 2 items in this shard

test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_is_all_true_jagged_cpu PASSED [0.0277s]                                                                                                                               [ 50%]
test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_is_any_true_jagged_cpu PASSED [0.0013s]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162096
Approved by: https://github.com/jbschlosser
2025-09-22 20:22:44 +00:00
d0086708dd [triton] update 3.5 pin to bbb06c0334a6772b92d24bde54956e675c8c6604 (#163382)
Includes:
* https://github.com/triton-lang/triton/pull/8211 to work around a PTXAS bug that was causing 03-matrix-multiplication tutorial matmuls to underperform due to excessive WGMMA waits
* https://github.com/triton-lang/triton/pull/8157 to fix a convert_layout bug

Verified that this passes Triton CI in https://github.com/pytorch/pytorch/pull/159158 and improves gemm perf (see https://github.com/pytorch/pytorch/issues/159704)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163382
Approved by: https://github.com/Camyll, https://github.com/atalman
2025-09-22 20:20:59 +00:00
6f9aef5fef [2/n] Support module.to("cuda:0") in FakeTensorMode on cuda-less machine (#163433)
Summary:
To support exporting a cuda model on a CPU-only machine under fake tensor mode.
User commonly need to move sample inputs to the cuda device with .to("cuda:0") or .to("cuda") call.
This diff supports this.

I expect the following pattern to work

```
with FakeTensorMode(allow_non_fake_inputs=True):
    cuda_module = module.to("cuda:0")
    cuda_sample_inputs = tuple([x.to("cuda:0") for x in sample_inputs])

    with torch.no_grad():
        ep = torch.export.export(cuda_module, cuda_sample_inputs)

```

Before
Moving module.to("cuda:0") under fake tensor mode would have parameter on `meta` device.

After
parameters would be on "cuda:0" .

Test Plan: buck2 run  fbcode//caffe2/test:fake_tensor -- --r test_move_module

Reviewed By: mikaylagawarecki

Differential Revision: D80102876

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163433
Approved by: https://github.com/albanD
2025-09-22 20:16:32 +00:00
d15048493c [opaque_obj] Add set_payload + docs (#163276)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163276
Approved by: https://github.com/zou3519
ghstack dependencies: #162660
2025-09-22 20:02:29 +00:00
bf28990c3d Add support for NestedTensor share_memory_ (#162272)
Fixes: https://github.com/pytorch/pytorch/issues/161915

### Summary

Implements share_memory_() support for NestedTensor!

### Changes

- Added share_memory_() method to NestedTensor class.
  - Shares storage for all NestedTensor components: _values, _offsets, _lengths, and cached seqlen tensors.
  - Guard for CUDA Tensors.

### Testing

Before Fix:

`pytest -q test/test_nestedtensor.py -k "test_share_memory" -v`

Output:

```
Running 1 items in this shard

test/test_nestedtensor.py Fatal Python error: Segmentation fault
```

After Fix:

`pytest -q test/test_nestedtensor.py -k "test_share_memory" -v`

Output:

```
Running 1 items in this shard

test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_share_memory_cpu PASSED [0.0753s]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162272
Approved by: https://github.com/jbschlosser
2025-09-22 19:59:58 +00:00
eaa613bf66 Revert "[opaque_obj] Add set_payload + docs (#163276)"
This reverts commit dd30667f6c2204a15e91eaeb61c84f9080be7748.

Reverted https://github.com/pytorch/pytorch/pull/163276 on behalf of https://github.com/ZainRizvi due to Sorry but this fails lint on trunk: [GH job link](https://github.com/pytorch/pytorch/actions/runs/17924886989/job/50968430537) [HUD commit link](dd30667f6c) ([comment](https://github.com/pytorch/pytorch/pull/163276#issuecomment-3321054061))
2025-09-22 19:32:30 +00:00
1818c36d6e [Fix] Restrict stride normalization to 1D tensors on export (#163282)
This change restricts the DLPack stride normalization to apply only to 1D tensors of shape (1,).

### Rationale
The previous implementation normalized the strides for any multi-dimensional tensor containing a dimension of size 1. While well-intentioned, this "over-normalization" discards critical memory layout information, causing issues for downstream consumers who rely on strides to infer alignment and contiguity.

For example:

* A row-major tensor with `shape=(1, 128)` and `stride=(128, 1)` would be incorrectly normalized to `stride=(1, 1)`.

* A column-major tensor with `shape=(1024, 1)` and `stride=(1, 1024)` would also be normalized to `stride=(1, 1)`.

This loss of stride information makes it impossible for consumers to detect the original memory layout (e.g., row-major vs. column-major) and breaks assumptions about memory alignment needed for optimized indexing or specialized hardware APIs like GPU TMA.

The original intent of the normalization was to handle the simple case of a 1D tensor with shape=(1,) and a non-standard stride. This fix reverts to that specific, non-problematic behavior, ensuring that multi-dimensional tensors retain their precise stride information during DLPack export.

### Related Issues
#163274

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163282
Approved by: https://github.com/eqy
2025-09-22 19:10:05 +00:00
7e9781174c Fix lint (#163542)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163542
Approved by: https://github.com/malfet
2025-09-22 19:10:00 +00:00
4941719061 Enable logging for absolute memory estimation (#158799)
Summary: Update the Auto AC logging so that it also provides the *absolute* memory estimations for each node.

Test Plan:
(aps-gem_omnifm_v2_mwb_dynamic_005_budget-f23a84c3d8): https://fburl.com/ai_infra/0r738h5r

{F1980393481}

* Memory Recorded in bytes

---

```
buck2 test //caffe2/test/functorch:test_ac_logging
```
https://www.internalfb.com/intern/testinfra/testrun/14918173863021573

Rollback Plan:

Differential Revision: D78580107

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158799
Approved by: https://github.com/jansel
2025-09-22 18:36:49 +00:00
dd30667f6c [opaque_obj] Add set_payload + docs (#163276)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163276
Approved by: https://github.com/zou3519
ghstack dependencies: #162660
2025-09-22 18:30:28 +00:00
3be9c86c74 [opaque obj] Initial OpaqueObject (#162660)
A big pain point ppl have with custom ops is that they do not accept arbitrary input/outputs. In this PR we create the concept of an "OpaqueObject" which allows users to pass arbitrary python objects into custom operators.

Some still slightly annoying parts with this implementation:
- The schema of the operator is `__torch__.torch.classes.aten.OpaqueObject` instead of whatever python type
- `@torch.library.custom_op` doesn't work.. yet?

UX:
```python
from torch._library.opaque_object import make_opaque, get_payload

# your custom python class
class OpaqueQueue:
    def __init__(self, queue: list[torch.Tensor], init_tensor_: torch.Tensor) -> None:
        super().__init__()
        self.queue = queue
        self.init_tensor_ = init_tensor_

    def push(self, tensor: torch.Tensor) -> None:
        self.queue.append(tensor)

    def pop(self) -> torch.Tensor:
        if len(self.queue) > 0:
            return self.queue.pop(0)
        return self.init_tensor_

    def size(self) -> int:
        return len(self.queue)

queue = OpaqueQueue([], torch.zeros(3))
obj: torch._C.ScriptObject = make_opaque(queue)

# obj.payload stores a direct reference to this python queue object
self.assertEqual(get_payload(obj), queue)

# This is able to be passed through the dispatcher
torch.ops._TestOpaqueObject.queue_push(obj, torch.ones(3))
self.assertTrue(queue.size(), 1)
```

Authoring a custom op:

```python
lib = torch.library.Library("_TestOpaqueObject", "FRAGMENT")

torch.library.define(
    f"_TestOpaqueObject::queue_push",
    "(__torch__.torch.classes.aten.OpaqueObject a, Tensor b) -> ()",
    tags=torch.Tag.pt2_compliant_tag,
    lib=lib,
)

@torch.library.impl(f"{libname}::queue_push", "CompositeExplicitAutograd", lib=lib)
def push_impl(q: torch._C.ScriptObject, b: torch.Tensor) -> None:
    # We can get the payload directly by get_payload(q)
    queue = get_payload(q)
    assert isinstance(queue, OpaqueQueue)
    queue.push(b)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162660
Approved by: https://github.com/zou3519
2025-09-22 18:30:28 +00:00
bec967eaa4 Remove C++ and test branches for CUDA<12 (#163443)
Remove conditional branches for CUDA<12.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163443
Approved by: https://github.com/eqy
2025-09-22 18:20:08 +00:00
d279a6a6f1 ci: Add a way to lint all files in a PR from label (#163525)
Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163525
Approved by: https://github.com/ZainRizvi
2025-09-22 18:06:39 +00:00
281f8f407e Combine strong and weak refcounts in intrusive_ptr in a single refcount (#163394)
Summary:
Currently, we assume that refcount_ and weakcount_ are always stored in an 8-byte aligned address right next to each other. Based on this assumption, we load 8 bytes in intrusive_ptr::reset_ to check the values of both counts. However, that assumption is not part of C++ language standard so it's essentially undefined behavior.

This change eliminates that assumption by combining refcount_ and weakcount_ in a single 64-bit count and we use the lower 32 bits for refcount_ and upper 32 bits for the weakcount_.

In addition to eliminating the undefined behavior, the change also eliminates the read of weakcount_ after decrementing refcount_ in intrusive_ptr::reset_. This claws back lost performance introduced in https://github.com/pytorch/pytorch/pull/162784 for non-final refcount_ decrementing.

Reviewed By: yfeldblum

Differential Revision: D82869192

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163394
Approved by: https://github.com/Skylion007
2025-09-22 17:53:28 +00:00
5e7be98800 [BE] Update Python min version to 3.10 (#162310)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162310
Approved by: https://github.com/atalman, https://github.com/Skylion007, https://github.com/ZainRizvi
2025-09-22 17:04:21 +00:00
06fe5b9025 [AOTI] fix TestAOTInductorPackage temp file locked handler. (#163499)
Fix `test\inductor\test_aot_inductor_package.py` common class `TestAOTInductorPackage`'s `check_model` function, temp file locked file handler on Windows. It would caused c++ backend open file failed:
```cmd
FAILED [4.5918s] test/inductor/test_aot_inductor_package.py::TestAOTInductorPackage_cpu::test_add - RuntimeError: File C:/Users/Xuhan/AppData/Local/Temp/tmp21sjnnhl.pt2 cannot be opened.
FAILED [4.1703s] test/inductor/test_aot_inductor_package.py::TestAOTInductorPackage_cpu::test_bool_input - RuntimeError: File C:/Users/Xuhan/AppData/Local/Temp/tmp5kd3apub.pt2 cannot be opened.
FAILED [4.2266s] test/inductor/test_aot_inductor_package.py::TestAOTInductorPackage_cpu::test_linear - RuntimeError: File C:/Users/Xuhan/AppData/Local/Temp/tmpkyy3pxow.pt2 cannot be opened.
FAILED [4.2134s] test/inductor/test_aot_inductor_package.py::TestAOTInductorPackage_cpu::test_metadata - RuntimeError: File C:/Users/Xuhan/AppData/Local/Temp/tmphyer7wi9.pt2 cannot be opened.
......
```

Fix it via `WritableTempFile`, it can release file handler for backend use.

After fixed:

<img width="1904" height="176" alt="image" src="https://github.com/user-attachments/assets/e71b3182-0204-497b-9aca-cbbb33bc4687" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163499
Approved by: https://github.com/jansel, https://github.com/desertfire
2025-09-22 16:54:18 +00:00
9ca183e933 switch from stack based to graph based aproach (#163459)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163459
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #163417
2025-09-22 16:41:35 +00:00
e310cc5e06 Update fbgemm submodule (#163411)
Test Plan:

As titled, includes some new changes fbgemm to see if CUDA13 breakage is fixed.

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163411
Approved by: https://github.com/Skylion007
2025-09-22 15:46:11 +00:00
eaac218b64 [ROCm] Fix environment variable AOTRITON_INSTALLED_PREFIX (#163373)
Early assignment of `__AOTRITON_LIB` breaks the usage of environment variable `$AOTRITON_INSTALLED_PREFIX`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163373
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
2025-09-22 15:01:18 +00:00
509c4e8627 Update cutlass version for fbcode (#163091)
Differential Revision: [D82567751](https://our.internmc.facebook.com/intern/diff/D82567751/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163091
Approved by: https://github.com/drisspg
2025-09-22 14:31:11 +00:00
10adeb9044 Revert "[BE] Update Python min version to 3.10 (#162310)"
This reverts commit 9f5a644f0768258bc81f8b38492754d297399f74.

Reverted https://github.com/pytorch/pytorch/pull/162310 on behalf of https://github.com/malfet due to Broke lint, but to the best of my knowledge it's no longer possible to run lint for all files on PRs ([comment](https://github.com/pytorch/pytorch/pull/162310#issuecomment-3319289031))
2025-09-22 14:13:59 +00:00
9f5a644f07 [BE] Update Python min version to 3.10 (#162310)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162310
Approved by: https://github.com/atalman, https://github.com/Skylion007, https://github.com/ZainRizvi
2025-09-22 13:37:02 +00:00
60b4791d08 [MPS] Fix compile linalg inv (#163452)
Fixes #161969

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163452
Approved by: https://github.com/Skylion007
2025-09-22 10:36:52 +00:00
96a3afb8ec Simplify BFLOAT16_AVAILABLE (#163445)
Simplify `BFLOAT16_AVAILABLE` by using `torch.cuda.is_bf16_supported()`  and `torch.xpu.is_bf16_supported()`. Outdated comments are also removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163445
Approved by: https://github.com/Skylion007, https://github.com/kwen2501
2025-09-22 07:31:46 +00:00
edafc902d7 Revert "[BE] Make PyObjectSlot use a global PyInterpreter (#162659)"
This reverts commit d1993c27ae59842c887d549a3f8936fbcd769498.

Reverted https://github.com/pytorch/pytorch/pull/162659 on behalf of https://github.com/wdvr due to reverted internally, please see D82771705 @PaliC ([comment](https://github.com/pytorch/pytorch/pull/162659#issuecomment-3317110247))
2025-09-22 06:22:37 +00:00
ae5be038a6 Revert "Delete functorch C extension entirely. (#163340)"
This reverts commit 1faf6367e396b1d0894e8735912a47ac465f469d.

Reverted https://github.com/pytorch/pytorch/pull/163340 on behalf of https://github.com/wdvr due to temporary revert to pull out #162659 ([comment](https://github.com/pytorch/pytorch/pull/163340#issuecomment-3317105243))
2025-09-22 06:20:04 +00:00
f0078941cf Revert "[RELAND] Always build USE_DISTRIBUTED (#160449) and Make distributed modules importable even when backend not built (#159889) (#162594)"
This reverts commit 6c334885d48725197b5d35e2c1543efc0f4198d0.

Reverted https://github.com/pytorch/pytorch/pull/162594 on behalf of https://github.com/wdvr due to reverted internally - @ezyang see D82281294 ([comment](https://github.com/pytorch/pytorch/pull/162594#issuecomment-3317017530))
2025-09-22 05:39:07 +00:00
3a7db34cf9 Revert "[SymmMem] Promote @requires_nvshmem instead of enable_triton (#163423)"
This reverts commit 5d8a226e23339e7243a2a84afd174f685f145b68.

Reverted https://github.com/pytorch/pytorch/pull/163423 on behalf of https://github.com/wdvr due to temporary reverting to back out #162594 ([comment](https://github.com/pytorch/pytorch/pull/163423#issuecomment-3317011500))
2025-09-22 05:35:41 +00:00
281bb56cc5 Enable half precision types on test_conv_cudnn_nhwc_support (#163444)
This PR adds flaot16 and bfloat16 cases to `test_conv_cudnn_nhwc_support` and removes outdated comments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163444
Approved by: https://github.com/Skylion007
2025-09-22 04:11:20 +00:00
01f927eb40 Remove workarounds for Python 3.6 (#163440)
This PR removes tuple unpacking workarounds for Py 3.6 form two distributed files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163440
Approved by: https://github.com/ezyang
2025-09-22 04:08:04 +00:00
0b59492853 [export] Fix wrap_with_set_grad_enabled retracing (#163295)
Fixes https://github.com/pytorch/pytorch/issues/163294

The code `with torch.set_grad_enabled(enable_grad)` calls `torch._C._set_grad_enabled` three times -- (1) when [initializing set_grad_enabled](bb7c9a2d41/torch/autograd/grad_mode.py (L187C9-L187C35)), (2) when [entering the context](bb7c9a2d41/torch/autograd/grad_mode.py (L194)), and (3) when [exiting the context](bb7c9a2d41/torch/autograd/grad_mode.py (L197)).

This results in the the retraced export module to have a duplicate `torch._C._set_grad_enabled` like:
```
def forward(self, arg0_1):
    add = torch.ops.aten.add.Tensor(arg0_1, 1);  arg0_1 = None
    _set_grad_enabled = torch._C._set_grad_enabled(False);  _set_grad_enabled = None
    _set_grad_enabled = torch._C._set_grad_enabled(False);  _set_grad_enabled = None
    add_1 = torch.ops.aten.add.Tensor(add, 2);  add = None
    _set_grad_enabled_1 = torch._C._set_grad_enabled(True);  _set_grad_enabled_1 = None
    add_2 = torch.ops.aten.add.Tensor(add_1, 3);  add_1 = None
    return (add_2,)
```

When export runs the `replace_set_grad_with_hop_pass`, it will look through the graph for `torch._C._set_grad_enabled` and create subgraphs. The duplicate `torch._C._set_grad_enabled` results in an empty submod in the graph, which resulted in an error in [this post](https://fb.workplace.com/groups/1028545332188949/posts/1844720036398281/?comment_id=1862175381319413).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163295
Approved by: https://github.com/yushangdi
2025-09-21 22:54:40 +00:00
8a281d7214 [submodule] Bump libfmt to 12.0.0 (#163441)
libfmt 12.0 brings new optimisations and fixes some compilation issues for clang 21 (https://github.com/fmtlib/fmt/pull/4477).
For a detailed release log, see https://github.com/fmtlib/fmt/releases/tag/12.0.0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163441
Approved by: https://github.com/Skylion007
2025-09-21 22:37:25 +00:00
6ac2b3ae35 [BE] Adding aliases for CUDA and XPU API documentation (#162984)
This PR reorganizes CUDA and XPU API documentation with additional aliases pages. Multiple entries of APIs under torch.cuda are thus removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162984
Approved by: https://github.com/janeyx99
2025-09-21 22:28:27 +00:00
8b14f43da9 [torch] DRY a couple of lines in unpickler (#163447)
Test Plan: CI.

Reviewed By: dolpm

Differential Revision: D82660989

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163447
Approved by: https://github.com/Skylion007
2025-09-21 20:29:33 +00:00
4d3d32f14c Add torchfuzz initial impl. (#163417)
all details are in readme.md
Note: one thing i want to do soonest is to switch to graph representation instead of stack representation
for the fuzzed ops should make things easier as things get more complicated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163417
Approved by: https://github.com/bobrenjc93
2025-09-21 19:17:54 +00:00
5599f487ef Fully native DTensor.__new__ (#162508)
Move the entirety of `__new__` into C++, saving a layer of disable_dynamo and making progress toward all-C++.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162508
Approved by: https://github.com/ezyang
ghstack dependencies: #161695
2025-09-21 18:36:05 +00:00
51152efa67 Remove autograd code for Python < 3.9 (#163313)
As PyTorch is moving to Python 3.10, it is safe to remove code for Python < 3.9.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163313
Approved by: https://github.com/ezyang
2025-09-21 15:35:06 +00:00
f34744d2a5 [inductor] bugfix: keep WeakDeps (WAR deps) during fusion (#162316)
fixes #159855, was not triggered in other tests since it took
more than one round of fusion to get to the problematic code
which prunes WeakDeps. The WeakDeps are important to inhibit
fusion of kernels that read/write data into mutated buffers
with different indexing.

We modify the code to a) always prune before fusion, rather
than after, which improves its coverage and makes our basic
vertical fusion tests surface this issue as well and b)
check whether the weak dep is fusable before eliminating it
(which basically means checking that the producing code and
the consuming code are sufficiently compatible).

The tests that trigger this with change (a) is:
test_fusing_write_into_disjoint_read introduced in #118210.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162316
Approved by: https://github.com/eellison, https://github.com/mlazos, https://github.com/shunting314
2025-09-21 13:08:11 +00:00
5d8a226e23 [SymmMem] Promote @requires_nvshmem instead of enable_triton (#163423)
### Issue
The previous `enable_triton` UI requires the user-defined Triton kernel have a "nvshmem" in its name.
If users did not do so, the kernel would miss the NVSHMEM init, and silently hit CUDA IMA.

The `@require_nvshmem` decorator eliminates the above name requirement (and the `enable_triton` call).

### Usage:
```
@requires_nvshmem
@triton.jit
def foo(...):
    ...

foo[(1, 1)](...)
```
It also remove the need of passing `extern_lib` to `foo` (handled by the decorator now).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163423
Approved by: https://github.com/ngimel
ghstack dependencies: #163025, #163152, #163194
2025-09-21 10:03:20 +00:00
d8cbbc0f70 [Easy][AMP] Refactor the AMP logic for getting dtype (#162796)
As the title stated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162796
Approved by: https://github.com/ezyang
2025-09-21 06:32:35 +00:00
9ba918082a Add api info for torch._C._nn.pyi (#162707)
Fix part of #148404

APis involved are as followed:

- multilabel_margin_loss
- multi_margin_loss
- nll_loss_nd
- relu6
- relu6_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162707
Approved by: https://github.com/ezyang
2025-09-21 06:17:15 +00:00
1faf6367e3 Delete functorch C extension entirely. (#163340)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163340
Approved by: https://github.com/aorenste
ghstack dependencies: #160236
2025-09-21 06:02:21 +00:00
4a96a6fa4a [Docs] Fix indentations in cond.md (#156147)
This is a follow-up PR to fix indentations mentioned by https://github.com/pytorch/pytorch/pull/155653#issuecomment-2971660356

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156147
Approved by: https://github.com/svekars, https://github.com/cyyever
2025-09-21 05:50:50 +00:00
f591bb5056 Remove data_source argument from Sampler (#163134)
`data_source` is declared being removed in PT 2.2 but not.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163134
Approved by: https://github.com/ezyang
2025-09-21 05:44:41 +00:00
1ca9445229 [BE][Ez]: Prevent copies of std::vector in CUDA ForeachOps (#163416)
No need for unnecessary copy of std::vectors. This Tensor list is copied throughout the foreach paths and this code is on a hot path for torch optimizers. Auto move elision will not happen on the return statement since it's a subelement of a vector that needs to be copied out before the std::vector is dtor'd. This should reduce quite a few list copies along this path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163416
Approved by: https://github.com/ezyang
2025-09-21 05:24:13 +00:00
5b386ee16e [vllm hash update] update the pinned vllm hash (#163392)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163392
Approved by: https://github.com/pytorchbot
2025-09-21 04:34:14 +00:00
269 changed files with 8582 additions and 2195 deletions

View File

@ -241,7 +241,7 @@ def wait_for_connection(addr, port, timeout=15, attempt_cnt=5):
try:
with socket.create_connection((addr, port), timeout=timeout):
return
except (ConnectionRefusedError, socket.timeout): # noqa: PERF203
except (ConnectionRefusedError, TimeoutError): # noqa: PERF203
if i == attempt_cnt - 1:
raise
time.sleep(timeout)

View File

@ -262,13 +262,10 @@ case "$tag" in
TRITON_CPU=yes
;;
pytorch-linux-jammy-linter)
# TODO: Use 3.9 here because of this issue https://github.com/python/mypy/issues/13627.
# We will need to update mypy version eventually, but that's for another day. The task
# would be to upgrade mypy to 1.0.0 with Python 3.11
PYTHON_VERSION=3.9
PYTHON_VERSION=3.10
;;
pytorch-linux-jammy-cuda12.8-cudnn9-py3.9-linter)
PYTHON_VERSION=3.9
pytorch-linux-jammy-cuda12.8-cudnn9-py3.10-linter)
PYTHON_VERSION=3.10
CUDA_VERSION=12.8.1
;;
pytorch-linux-jammy-aarch64-py3.10-gcc11)

View File

@ -1 +1 @@
5ae38bdb0dc066c5823e34dc9797afb9de42c866
bbb06c0334a6772b92d24bde54956e675c8c6604

View File

@ -1,7 +1,7 @@
sphinx==5.3.0
#Description: This is used to generate PyTorch docs
#Pinned versions: 5.3.0
-e git+https://github.com/pytorch/pytorch_sphinx_theme.git@1657ad2fc1acdc98aa719eebecbb0128a7c13ce4#egg=pytorch_sphinx_theme2
-e git+https://github.com/pytorch/pytorch_sphinx_theme.git@d53b0ffb9b1cda68260693ea98f3483823c88d8e#egg=pytorch_sphinx_theme2
# TODO: sphinxcontrib.katex 0.9.0 adds a local KaTeX server to speed up pre-rendering
# but it doesn't seem to work and hangs around idly. The initial thought that it is probably

View File

@ -72,7 +72,7 @@ def sample_vllm_test_library():
]
),
"pytest -v -s entrypoints/llm/test_generate.py",
"VLLM_USE_V1=0 pytest -v -s entrypoints/offline_mode",
"pytest -v -s entrypoints/offline_mode",
],
},
"vllm_regression_test": {

View File

@ -334,11 +334,17 @@ test_python() {
}
test_python_smoke() {
# Smoke tests for H100
# Smoke tests for H100/B200
time python test/run_test.py --include test_matmul_cuda inductor/test_fp8 inductor/test_max_autotune $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
assert_git_not_dirty
}
test_python_smoke_b200() {
# Targeted smoke tests for B200 - staged approach to avoid too many failures
time python test/run_test.py --include test_matmul_cuda inductor/test_fp8 $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
assert_git_not_dirty
}
test_h100_distributed() {
# Distributed tests at H100
time python test/run_test.py --include distributed/_composable/test_composability/test_pp_composability.py $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
@ -1773,6 +1779,8 @@ elif [[ "${BUILD_ENVIRONMENT}" == *xpu* ]]; then
test_xpu_bin
elif [[ "${TEST_CONFIG}" == smoke ]]; then
test_python_smoke
elif [[ "${TEST_CONFIG}" == smoke_b200 ]]; then
test_python_smoke_b200
elif [[ "${TEST_CONFIG}" == h100_distributed ]]; then
test_h100_distributed
elif [[ "${TEST_CONFIG}" == "h100-symm-mem" ]]; then

View File

@ -1 +1 @@
9d1c50a5ac8726f4af0d4a4e85ad4d26a674ad26
090197034faf3b193c4467cedeb9281e3078892d

View File

@ -36,6 +36,7 @@ ciflow_push_tags:
- ciflow/win-arm64
- ciflow/h100-symm-mem
- ciflow/h100-cutlass-backend
- ciflow/b200
retryable_workflows:
- pull
- trunk

View File

@ -2,6 +2,12 @@ name: Get Changed Files
on:
workflow_call:
inputs:
all_files:
description: "Whether to return all files instead of just changed files"
required: false
type: boolean
default: false
outputs:
changed-files:
description: "List of changed files (space-separated) or '*' if not in a PR"
@ -26,17 +32,23 @@ jobs:
# Get the PR number from the github context
PR_NUMBER="${{ github.event.number }}"
# Use gh CLI to get changed files in the PR with explicit repo
CHANGED_FILES=$(gh api repos/${{ github.repository }}/pulls/$PR_NUMBER/files --paginate --jq '.[] | select(.status != "removed") | .filename' | tr '\n' ' ' | sed 's/ $//')
# Check if all_files is requested
if [ "${{ inputs.all_files }}" = "true" ]; then
echo "all_files input is true, returning all files"
echo "changed-files=*" >> "$GITHUB_OUTPUT"
else
# Use gh CLI to get changed files in the PR with explicit repo
CHANGED_FILES=$(gh api repos/${{ github.repository }}/pulls/$PR_NUMBER/files --paginate --jq '.[] | select(.status != "removed") | .filename' | tr '\n' ' ' | sed 's/ $//')
if [ -z "$CHANGED_FILES" ]; then
echo "No changed files found, setting to '*'"
CHANGED_FILES="*"
if [ -z "$CHANGED_FILES" ]; then
echo "No changed files found, setting to '*'"
CHANGED_FILES="*"
fi
echo "Changed files: $CHANGED_FILES"
echo "changed-files=$CHANGED_FILES" >> "$GITHUB_OUTPUT"
fi
echo "Changed files: $CHANGED_FILES"
echo "changed-files=$CHANGED_FILES" >> "$GITHUB_OUTPUT"
else
echo "Not in PR context, setting changed files to '*'"
echo "changed-files=*" >> "$GITHUB_OUTPUT"

View File

@ -70,7 +70,7 @@ jobs:
pytorch-linux-jammy-py3-clang18-asan,
pytorch-linux-jammy-py3-clang12-onnx,
pytorch-linux-jammy-linter,
pytorch-linux-jammy-cuda12.8-cudnn9-py3.9-linter,
pytorch-linux-jammy-cuda12.8-cudnn9-py3.10-linter,
pytorch-linux-jammy-py3-clang12-executorch,
pytorch-linux-jammy-py3.12-triton-cpu,
pytorch-linux-noble-riscv64-py3.12-gcc14

View File

@ -31,6 +31,8 @@ jobs:
if: github.repository_owner == 'pytorch'
name: Get changed files
uses: ./.github/workflows/_get-changed-files.yml
with:
all_files: ${{ contains(github.event.pull_request.labels.*.name, 'lint-all-files') || contains(github.event.pull_request.labels.*.name, 'Reverted') }}
lintrunner-clang:
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
@ -53,7 +55,7 @@ jobs:
with:
timeout: 120
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
docker-image: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3.9-linter
docker-image: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3.10-linter
# NB: A shallow checkout won't work here because calculate-docker-image requires a full checkout
# to run git rev-parse HEAD~:.ci/docker when a new image is needed
fetch-depth: 0
@ -264,10 +266,10 @@ jobs:
with:
submodules: false
fetch-depth: 1
- name: Setup Python 3.9
- name: Setup Python 3.10
uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5.6.0
with:
python-version: '3.9'
python-version: '3.10'
architecture: x64
cache: pip
- name: Install dependencies

View File

@ -127,8 +127,6 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
# More memory is needed to build with asan
runner: linux.2xlarge.memory
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-py3.10-clang18-asan
docker-image-name: ci-image:pytorch-linux-jammy-py3-clang18-asan

View File

@ -140,8 +140,6 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
# More memory is needed to build with asan
runner: linux.2xlarge.memory
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-py3.10-clang18-asan
docker-image-name: ci-image:pytorch-linux-jammy-py3-clang18-asan

76
.github/workflows/test-b200.yml vendored Normal file
View File

@ -0,0 +1,76 @@
# B200 Smoke Tests CI Workflow
#
# This workflow runs smoke tests on B200 hardware
#
# Flow:
# 1. Builds PyTorch with CUDA 12.8+ and sm100 architecture for B200
# 2. Runs smoke tests on linux.dgx.b200 runner
# 3. Tests executed are defined in .ci/pytorch/test.sh -> test_python_smoke() function
#
# Triggered by:
# - Pull requests modifying this workflow file
# - Manual dispatch
# - Schedule (every 6 hours)
# - Adding ciflow/b200 label to a PR (creates ciflow/b200/* tag)
name: B200 Smoke Tests
on:
pull_request:
paths:
- .github/workflows/test-b200.yml
workflow_dispatch:
schedule:
- cron: 0 4,10,16,22 * * * # every 6 hours
push:
tags:
- ciflow/b200/*
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
cancel-in-progress: true
permissions:
id-token: write
contents: read
jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
linux-jammy-cuda12_8-py3_10-gcc11-sm100-build:
name: linux-jammy-cuda12.8-py3.10-gcc11-sm100
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runner: linux.12xlarge.memory
build-environment: linux-jammy-cuda12.8-py3.10-gcc11-sm100
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11
cuda-arch-list: '10.0'
test-matrix: |
{ include: [
{ config: "smoke_b200", shard: 1, num_shards: 1, runner: "linux.dgx.b200" },
]}
# config: "smoke_b200" maps to test_python_smoke_b200() in .ci/pytorch/test.sh
secrets: inherit
linux-jammy-cuda12_8-py3_10-gcc11-sm100-test:
name: linux-jammy-cuda12.8-py3.10-gcc11-sm100
uses: ./.github/workflows/_linux-test.yml
needs:
- linux-jammy-cuda12_8-py3_10-gcc11-sm100-build
with:
build-environment: linux-jammy-cuda12.8-py3.10-gcc11-sm100
docker-image: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-sm100-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-sm100-build.outputs.test-matrix }}
aws-role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only
secrets: inherit

View File

@ -53,27 +53,3 @@ jobs:
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
linux-jammy-py3_9-clang9-xla-build:
name: linux-jammy-py3_9-clang9-xla
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-py3.9-clang9-xla
docker-image-name: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/xla_base:v1.3-lite
test-matrix: |
{ include: [
{ config: "xla", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
]}
secrets: inherit
linux-jammy-py3_9-clang9-xla-test:
name: linux-jammy-py3_9-clang9-xla
uses: ./.github/workflows/_linux-test.yml
needs: linux-jammy-py3_9-clang9-xla-build
with:
build-environment: linux-jammy-py3.9-clang9-xla
docker-image: ${{ needs.linux-jammy-py3_9-clang9-xla-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-py3_9-clang9-xla-build.outputs.test-matrix }}
secrets: inherit

View File

@ -196,6 +196,7 @@ exclude_patterns = [
'tools/test/gen_operators_yaml_test.py',
'tools/test/gen_oplist_test.py',
'tools/test/test_selective_build.py',
'tools/experimental/dynamic_shapes/torchfuzz/**',
]
command = [
'python3',

View File

@ -1,5 +1,4 @@
cmake_minimum_required(VERSION 3.27 FATAL_ERROR)
# cmake_policy(SET CMP0022 NEW) cmake_policy(SET CMP0023 NEW)
# Use compiler ID "AppleClang" instead of "Clang" for XCode. Not setting this
# sometimes makes XCode C compiler gets detected as "Clang", even when the C++
@ -1486,4 +1485,4 @@ else()
To do so please export USE_PRIORITIZED_TEXT_FOR_LD=1
]])
endif()
endif()
endif()

View File

@ -317,10 +317,20 @@ IF(USE_FBGEMM_GENAI)
-greedy-reverse-local-assignment=1
-fhip-new-launch-api)
# Only compile for gfx942 for now.
# This is rather hacky, I could not figure out a clean solution :(
set(HIP_CLANG_FLAGS_ORIGINAL ${HIP_CLANG_FLAGS})
string(REGEX REPLACE "--offload-arch=[^ ]*" "" FILTERED_HIP_CLANG_FLAGS "${HIP_CLANG_FLAGS}")
if("gfx942" IN_LIST PYTORCH_ROCM_ARCH)
list(APPEND FILTERED_HIP_CLANG_FLAGS --offload-arch=gfx942;)
endif()
set(HIP_CLANG_FLAGS ${FILTERED_HIP_CLANG_FLAGS})
hip_add_library(
fbgemm_genai STATIC
${fbgemm_genai_native_rocm_hip}
HIPCC_OPTIONS ${HIP_HCC_FLAGS} ${FBGEMM_GENAI_EXTRA_HIPCC_FLAGS})
set(HIP_CLANG_FLAGS ${HIP_CLANG_FLAGS_ORIGINAL})
set_target_properties(fbgemm_genai PROPERTIES POSITION_INDEPENDENT_CODE ON)
target_compile_definitions(fbgemm_genai PRIVATE FBGEMM_GENAI_NO_EXTENDED_SHAPES)

View File

@ -401,30 +401,13 @@ T* toDLPackImpl(const Tensor& src) {
// The following code detects whether the src follows
// a continuous pattern. If the src follows such pattern (common-case)
// then we do not need to normalize the strides.
bool need_normalize_strides = false;
int64_t expected_stride = 1;
for (int i = src.dim() - 1; i >= 0; i--) {
// detect if we do not meet continuous pattern
// and the size is 1, so there is opportunity to normalize
if (src.stride(i) != expected_stride && src.size(i) == 1) {
need_normalize_strides = true;
break;
}
expected_stride *= src.size(i);
}
bool need_normalize_strides = src.dim() == 1 && src.size(0) == 1 && src.stride(0) != 1;
// less common case, try normalizing the strides
if (need_normalize_strides) {
// create a new tensor with possibly normalized strides
// gh-83069
auto shape = src.sizes();
auto strides = src.strides().vec();
for (int i = 0; i < src.dim(); i++) {
if (shape[i] < 2) {
strides[i] = 1;
}
}
view = src.as_strided(shape, strides, src.storage_offset());
view = src.as_strided(shape, {1}, src.storage_offset());
}
ATenDLMTensor<T>* atDLMTensor(new ATenDLMTensor<T>);

View File

@ -94,10 +94,10 @@ inline at::DimVector infer_size_dv(IntArrayRef shape, int64_t numel) {
inline at::SymDimVector infer_size_dv(
c10::SymIntArrayRef shape,
const c10::SymInt& numel) {
c10::SymInt numel) {
auto res = at::SymDimVector(shape);
infer_size_impl<c10::SymIntArrayRef, c10::SymInt, at::SymDimVector>(
shape, numel, res);
shape, std::move(numel), res);
return res;
}

View File

@ -6,6 +6,7 @@
#include <c10/util/TypeList.h>
#include <c10/util/intrusive_ptr.h>
#include <c10/util/order_preserving_flat_hash_map.h>
#include <optional>
#include <ATen/core/TensorBody.h>
#include <ATen/core/jit_type_base.h>

View File

@ -55,7 +55,8 @@ class TORCH_API CppSignature final {
}
private:
explicit CppSignature(std::type_index signature) : signature_(signature) {}
explicit CppSignature(std::type_index signature)
: signature_(std::move(signature)) {}
std::type_index signature_;
};

View File

@ -70,7 +70,7 @@ private:
void _print_dispatch_trace(const std::string& label, const std::string& op_name, const DispatchKeySet& dispatchKeySet) {
auto nesting_value = dispatch_trace_nesting_value();
for (int64_t i = 0; i < nesting_value; ++i) std::cerr << " ";
std::cerr << label << " op=[" << op_name << "], key=[" << toString(dispatchKeySet.highestPriorityTypeId()) << "]" << '\n';
std::cerr << label << " op=[" << op_name << "], key=[" << toString(dispatchKeySet.highestPriorityTypeId()) << "]" << std::endl;
}
} // namespace detail
@ -213,11 +213,9 @@ OperatorHandle Dispatcher::findOrRegisterName_(const OperatorName& op_name) {
// Windows build doesn't produce the destructor symbol in PyTorch libs
// causing a linker failure in downstream projects.
// x-ref https://github.com/pytorch/pytorch/issues/70032
#if defined(_WIN32)
OperatorHandle::~OperatorHandle() = default;
#endif
RegistrationHandleRAII Dispatcher::registerLibrary(const std::string& ns, std::string debug) {
RegistrationHandleRAII Dispatcher::registerLibrary(std::string ns, std::string debug) {
std::lock_guard<std::mutex> lock(guard_->mutex);
auto found = libraries_.find(ns);
TORCH_CHECK(
@ -308,7 +306,7 @@ PythonModuleMapType& pythonModulesSingleton() {
}
std::optional<std::pair<const char*, const char*>> Dispatcher::getPyStub(const OperatorName& op_name) {
std::optional<std::pair<const char*, const char*>> Dispatcher::getPyStub(OperatorName op_name) {
std::lock_guard<std::mutex> lock(guard_->mutex);
auto found = pythonModulesSingleton().find(op_name);
if (found == pythonModulesSingleton().end()) {
@ -344,7 +342,7 @@ RegistrationHandleRAII Dispatcher::registerPythonModule(
});
}
void Dispatcher::throwIfHasPythonModule(const OperatorName& op_name) {
void Dispatcher::throwIfHasPythonModule(OperatorName op_name) {
std::lock_guard<std::mutex> lock(guard_->mutex);
auto elt = pythonModulesSingleton().find(op_name);
if (elt == pythonModulesSingleton().end()) {
@ -364,7 +362,7 @@ void Dispatcher::throwIfHasPythonModule(const OperatorName& op_name) {
}
RegistrationHandleRAII Dispatcher::registerImpl(
const OperatorName& op_name,
OperatorName op_name,
std::optional<DispatchKey> dispatch_key,
KernelFunction kernel,
std::optional<impl::CppSignature> cpp_signature,
@ -379,7 +377,7 @@ RegistrationHandleRAII Dispatcher::registerImpl(
*this,
dispatch_key,
std::move(kernel),
cpp_signature,
std::move(cpp_signature),
std::move(inferred_function_schema),
std::move(debug)
);
@ -408,7 +406,7 @@ void Dispatcher::deregisterImpl_(const OperatorHandle& op, const OperatorName& o
cleanup(op, op_name);
}
RegistrationHandleRAII Dispatcher::registerName(const OperatorName& op_name) {
RegistrationHandleRAII Dispatcher::registerName(OperatorName op_name) {
std::lock_guard<std::mutex> lock(guard_->mutex);
auto op = findOrRegisterName_(op_name);
++op.operatorDef_->def_and_impl_count;

View File

@ -13,10 +13,15 @@
#include <condition_variable>
#include <list>
#include <mutex>
#include <type_traits>
#include <ATen/core/enum_tag.h>
#include <ATen/core/grad_mode.h>
#ifndef NDEBUG
#include <iostream>
#endif
namespace c10 {
TORCH_API bool show_dispatch_trace();
@ -250,7 +255,7 @@ class TORCH_API Dispatcher final {
// NB: steals the inferred function schema, as we may need to hold on to
// it for a bit until the real schema turns up
RegistrationHandleRAII registerImpl(
const OperatorName& op_name,
OperatorName op_name,
std::optional<DispatchKey> dispatch_key,
KernelFunction kernel,
std::optional<impl::CppSignature> cpp_signature,
@ -269,15 +274,15 @@ class TORCH_API Dispatcher final {
/**
* Given an operator, throws if we have a pystub.
*/
void throwIfHasPythonModule(const OperatorName& op_name);
void throwIfHasPythonModule(OperatorName op_name);
std::optional<std::pair<const char*, const char*>> getPyStub(
const OperatorName& op_name);
OperatorName op_name);
/**
* Register a new operator by name.
*/
RegistrationHandleRAII registerName(const OperatorName& op_name);
RegistrationHandleRAII registerName(OperatorName op_name);
/**
* Register a fallback kernel for a backend.
@ -295,9 +300,7 @@ class TORCH_API Dispatcher final {
* API. These invocations are only permitted once per program, so we raise
* an error if this is called again for the same namespace.
*/
RegistrationHandleRAII registerLibrary(
const std::string& ns,
std::string debug);
RegistrationHandleRAII registerLibrary(std::string ns, std::string debug);
// ------------------------------------------------------------------------
//
@ -445,12 +448,8 @@ class TORCH_API OperatorHandle {
OperatorHandle& operator=(OperatorHandle&&) noexcept = default;
OperatorHandle(const OperatorHandle&) = default;
OperatorHandle& operator=(const OperatorHandle&) = default;
#if defined(_WIN32)
// NOLINTNEXTLINE(performance-trivially-destructible)
~OperatorHandle();
#else
~OperatorHandle() = default;
#endif
const OperatorName& operator_name() const {
return operatorDef_->op.operator_name();

View File

@ -556,7 +556,7 @@ inline std::ostream& operator<<(std::ostream& out, const Argument& arg) {
// real_type versus fake_type: in order to be compatible with FunctionSchema
// parser, printing an argument with either MemoryFormat or Layout type should
// give us the original schema string, hence printing out real_type.
const auto& type = arg.real_type();
auto type = arg.real_type();
bool is_opt = type->kind() == OptionalType::Kind;
auto unopt_type = is_opt ? type->castRaw<OptionalType>()->getElementType() : type;

View File

@ -232,7 +232,7 @@ struct TORCH_API OptionalType : public UnionType {
static TypePtr ofTensor();
//
// global singleton
static TypePtr get(const TypePtr& inner);
static TypePtr get(TypePtr inner);
private:
explicit OptionalType(const TypePtr& contained);
@ -895,7 +895,7 @@ struct TORCH_API ListType
// the type List<T>.
// The extra "identifier" argument is needed beccause we have multiple container types
// that all re-use this function (List<T>, array<T, N>, etc.)
static TypePtr get(const std::string& identifier, const TypePtr& inner);
static TypePtr get(const std::string& identifier, TypePtr inner);
// common cast List[Tensor]
static ListTypePtr ofTensors();

View File

@ -274,7 +274,7 @@ ListTypePtr ListType::ofNumbers() {
return value;
}
TypePtr OptionalType::get(const TypePtr& inner) {
TypePtr OptionalType::get(TypePtr inner) {
static ska::flat_hash_map<TypePtr, TypePtr> containerTypePtrs;
static std::mutex mutex;
// Perf from the lock is ok because this function is guarded behind
@ -287,7 +287,7 @@ TypePtr OptionalType::get(const TypePtr& inner) {
return containerTypePtrs[inner];
}
TypePtr ListType::get(const std::string& identifier, const TypePtr& inner) {
TypePtr ListType::get(const std::string& identifier, TypePtr inner) {
static ska::flat_hash_map<std::tuple<std::string, TypePtr>, TypePtr> containerTypePtrs;
static std::mutex mutex;
// Perf from the lock is ok because this function is guarded behind

View File

@ -1637,9 +1637,7 @@ bool gemm_and_bias(
if (activation == GEMMAndBiasActivationEpilogue::RELU) {
epilogue = CUBLASLT_EPILOGUE_RELU_BIAS;
} else if (activation == GEMMAndBiasActivationEpilogue::GELU) {
#if CUDA_VERSION >= 11040 || defined(USE_ROCM)
epilogue = CUBLASLT_EPILOGUE_GELU_BIAS;
#endif
}
if (bias != nullptr) {
@ -1931,7 +1929,6 @@ void scaled_gemm(
bool use_fast_accum) {
// Note: see `cublasCommonArgs` for various non-intuitive manupulations
// of input arguments to this function.
#if CUDA_VERSION >= 11080 || defined(USE_ROCM)
const auto computeType = CUBLAS_COMPUTE_32F;
const auto scaleType = CUDA_R_32F;
const float alpha_val = 1.0;
@ -2133,8 +2130,6 @@ void scaled_gemm(
" scaleType ",
scaleType);
return;
#endif // if CUDA_VERSION >= 11080 || defined(USE_ROCM)
TORCH_CHECK(false, "scaled_gemm is only supported for CUDA 11.8 and above");
}
void int8_gemm(

View File

@ -122,7 +122,7 @@ struct DeviceThreadHandlePool : public std::enable_shared_from_this<DeviceThread
// Called by the destructor. Releases this thread's handles back into the pool.
void release() {
if(!my_handles.empty()) {
if(my_handles.size() > 0) {
auto parent = weak_parent.lock();
if (!parent) {
// If this thread exits after atexit handlers have completed, the

View File

@ -139,7 +139,7 @@ static void autogradBasedTransformSendToNext(
std::bitset<default_bitset_size> outputs_aliasing_immutable; // set = 1 for all bits
if(!grad_special_case) {
for (auto idx = stack->size() - args_size; idx < stack->size(); idx++) {
const auto& ivalue = (*stack)[idx];
const auto ivalue = (*stack)[idx];
if (!ivalue.isTensor()) {
continue; // only input that can be aliased is a tensor, not a tensor list (expect in ops without returns)
}

View File

@ -6,8 +6,6 @@
#include <ATen/functorch/BatchRulesHelper.h>
#include <algorithm>
namespace at::functorch {
typedef std::tuple<Tensor, std::optional<int64_t>> oneOutput;
@ -317,7 +315,7 @@ oneOutput linalg_lu_solve_batch_rule(
const auto LU_num_batch_dims = rankWithoutBatchDim(LU_, LU_bdim) - LU_min_rank;
const auto pivots_num_batch_dims = rankWithoutBatchDim(pivots_, pivots_bdim) - pivots_min_rank;
const auto B_num_batch_dims = rankWithoutBatchDim(B_, B_bdim) - B_min_rank;
const auto max_num_batch_dims = std::max({LU_num_batch_dims, pivots_num_batch_dims, B_num_batch_dims});
const auto max_num_batch_dims = std::max(std::max(LU_num_batch_dims, pivots_num_batch_dims), B_num_batch_dims);
LU_ = maybePadToLogicalRank(LU_, LU_bdim, max_num_batch_dims + LU_min_rank);
pivots_ = maybePadToLogicalRank(pivots_, pivots_bdim, max_num_batch_dims + pivots_min_rank);

View File

@ -897,11 +897,11 @@ Tensor& div_(Tensor& self, const Scalar& other) {
}
Tensor div(const Tensor& self, const Scalar& other, std::optional<std::string_view> rounding_mode) {
return self.div(wrapped_scalar_tensor(other), rounding_mode); // redispatch!
return self.div(wrapped_scalar_tensor(other), std::move(rounding_mode)); // redispatch!
}
Tensor& div_(Tensor& self, const Scalar& other, std::optional<std::string_view> rounding_mode) {
return self.div_(wrapped_scalar_tensor(other), rounding_mode); // redispatch!
return self.div_(wrapped_scalar_tensor(other), std::move(rounding_mode)); // redispatch!
}
// divide, alias for div
@ -926,23 +926,23 @@ Tensor& divide_(Tensor& self, const Scalar& other) {
}
Tensor& divide_out(const Tensor& self, const Tensor& other, std::optional<std::string_view> rounding_mode, Tensor& result) {
return at::div_out(result, self, other, rounding_mode);
return at::div_out(result, self, other, std::move(rounding_mode));
}
Tensor divide(const Tensor& self, const Tensor& other, std::optional<std::string_view> rounding_mode) {
return self.div(other, rounding_mode);
return self.div(other, std::move(rounding_mode));
}
Tensor& divide_(Tensor& self, const Tensor& other, std::optional<std::string_view> rounding_mode) {
return self.div_(other, rounding_mode);
return self.div_(other, std::move(rounding_mode));
}
Tensor divide(const Tensor& self, const Scalar& other, std::optional<std::string_view> rounding_mode) {
return self.div(other, rounding_mode);
return self.div(other, std::move(rounding_mode));
}
Tensor& divide_(Tensor& self, const Scalar& other, std::optional<std::string_view> rounding_mode) {
return self.div_(other, rounding_mode);
return self.div_(other, std::move(rounding_mode));
}
// true_divide, an alias for div

View File

@ -150,7 +150,7 @@ void histogramdd_prepare_out(const Tensor& input, const std::vector<int64_t>& bi
void histogramdd_prepare_out(const Tensor& input, TensorList bins,
const Tensor& hist, const TensorList& bin_edges) {
std::vector<int64_t> bin_ct(bins.size());
std::transform(bins.begin(), bins.end(), bin_ct.begin(), [](const Tensor& t) { return t.numel() - 1; });
std::transform(bins.begin(), bins.end(), bin_ct.begin(), [](Tensor t) { return t.numel() - 1; });
histogramdd_prepare_out(input, bin_ct, hist, bin_edges);
}

View File

@ -360,7 +360,7 @@ Tensor einsum(std::string_view equation, TensorList operands, at::OptionalIntArr
// to compute the number of dimensions covered by ellipsis.
for(const auto i : c10::irange(num_ops)) {
const auto& operand = operands[i];
const auto& labels = op_labels[i];
const auto labels = op_labels[i];
const auto ndims = operand.dim();
int64_t nlabels = static_cast<int64_t>(labels.size());
bool has_ellipsis = false;

View File

@ -237,7 +237,7 @@ TORCH_META_FUNC(linalg_vector_norm)(const Tensor& self, const Scalar& scalar_ord
at::detail::check_linalg_norm_dtype(opt_dtype, self.scalar_type(), "linalg.vector_norm");
auto mask = at::native::make_dim_mask(dim, self.dim());
auto shape = at::native::shape_from_dim_mask(self, mask, keepdim);
auto shape = at::native::shape_from_dim_mask(self, std::move(mask), keepdim);
auto options = self.options()
.dtype(toRealValueType(opt_dtype.value_or(self.scalar_type())));
@ -641,7 +641,7 @@ namespace {
Tensor linalg_matrix_power_impl(
const Tensor& self,
int64_t n,
const std::optional<Tensor>& _out) {
std::optional<Tensor> _out) {
NoTF32Guard disable_tf32;
auto out = _out.value_or(Tensor());
@ -1019,7 +1019,7 @@ Tensor multi_dot_impl(TensorList _tensors, std::optional<Tensor> _out) {
Tensor result;
if (_out.has_value()) {
const auto& out = *_out;
auto out = *_out;
TORCH_CHECK(
dtype == out.dtype(),
"multi_dot(): expected out tensor to have dtype ",

View File

@ -493,7 +493,7 @@ Tensor get_clamped_target_length(
// the gradient is implemented for _cudnn_ctc_loss (just in derivatives.yaml) and _ctc_loss and this function has automatic gradients
// it also handles the reduction if desired
template <typename LengthsType>
Tensor ctc_loss_impl(const Tensor& log_probs_, const Tensor& targets, const LengthsType& input_lengths, const LengthsType& target_lengths, int64_t BLANK, int64_t reduction, bool zero_infinity) {
Tensor ctc_loss_impl(const Tensor& log_probs_, const Tensor& targets, LengthsType input_lengths, LengthsType target_lengths, int64_t BLANK, int64_t reduction, bool zero_infinity) {
auto is_batched = log_probs_.dim() == 3;
Tensor log_probs = is_batched ? log_probs_ : log_probs_.unsqueeze(1);
bool use_cudnn =

View File

@ -23,8 +23,6 @@ Tensor& max_unpooling2d_forward_out_cpu(
// Nondeterministic with duplicate indices
at::globalContext().alertNotDeterministic("max_unpooling2d_forward_out");
auto oheight = output_size[0];
auto owidth = output_size[1];
TORCH_CHECK(
indices_.scalar_type() == at::ScalarType::Long,
"elements in indices should be type int64 but got: ", indices_.scalar_type());
@ -45,6 +43,9 @@ Tensor& max_unpooling2d_forward_out_cpu(
self_.sizes(), " with dimension ", i , " being empty.");
}
auto oheight = output_size[0];
auto owidth = output_size[1];
auto memory_format = self_.suggest_memory_format();
auto self = self_.contiguous(memory_format);
auto indices = indices_.contiguous(memory_format);

View File

@ -599,7 +599,7 @@ std::tuple<Tensor, Tensor, Tensor, Tensor, int64_t> _batch_norm_impl_index(
check_dims_match_num_input_features("weight", num_features, weight.sym_numel());
}
if (bias.defined()) {
check_dims_match_num_input_features("bias", num_features, bias.sym_numel());
check_dims_match_num_input_features("bias", std::move(num_features), bias.sym_numel());
}
BatchNormBackend backend = _select_batch_norm_backend(input, weight, bias, running_mean, running_var, training, eps);
@ -923,7 +923,7 @@ std::tuple<Tensor, Tensor, Tensor> _batch_norm_legit_no_stats_cpu(
std::tuple<Tensor, Tensor, Tensor> _batch_norm_legit_no_training(
const Tensor& self, const std::optional<Tensor>& weight_opt, const std::optional<Tensor>& bias_opt,
const Tensor& running_mean, const Tensor& running_var, double momentum, double eps) {
return at::_native_batch_norm_legit(self, weight_opt, bias_opt, const_cast<Tensor&>(running_mean), const_cast<Tensor&>(running_var), /*training=*/false, momentum, eps);
return at::_native_batch_norm_legit(self, weight_opt, bias_opt, const_cast<Tensor&>(running_mean), const_cast<Tensor&>(running_var), /*train=*/false, momentum, eps);
}

View File

@ -1533,7 +1533,7 @@ std::tuple<Tensor, Tensor> lstm_cell(
check_rnn_cell_forward_input(input, w_ih.sym_size(1));
auto hidden_size = w_hh.sym_size(1);
check_rnn_cell_forward_hidden(input, hx[0], hidden_size, 0);
check_rnn_cell_forward_hidden(input, hx[1], hidden_size, 1);
check_rnn_cell_forward_hidden(input, hx[1], std::move(hidden_size), 1);
static at::Tensor undefined;
return LSTMCell<CellParams>{}(input, std::make_tuple(hx[0], hx[1]), CellParams{w_ih, w_hh, b_ih, b_hh, undefined});
}
@ -1612,13 +1612,13 @@ std::tuple<Tensor, Tensor, Tensor, Tensor, Tensor> _thnn_differentiable_gru_cell
h_g = h_g + hidden_bias;
}
auto chunked_input_gates = in_g.unsafe_chunk(3, 1);
const Tensor& ir = chunked_input_gates[0];
const Tensor& ii = chunked_input_gates[1];
const Tensor& in = chunked_input_gates[2];
Tensor ir = chunked_input_gates[0];
Tensor ii = chunked_input_gates[1];
Tensor in = chunked_input_gates[2];
auto chunked_hidden_gates = h_g.unsafe_chunk(3, 1);
const Tensor& hr = chunked_hidden_gates[0];
const Tensor& hi = chunked_hidden_gates[1];
const Tensor& hn = chunked_hidden_gates[2];
Tensor hr = chunked_hidden_gates[0];
Tensor hi = chunked_hidden_gates[1];
Tensor hn = chunked_hidden_gates[2];
Tensor rg = (ir + hr).sigmoid();
Tensor ig = (ii + hi).sigmoid();
Tensor grad_hx = grad_hy * ig;

View File

@ -409,17 +409,17 @@ static inline Tensor& unary_op_impl_out(Tensor& result, const Tensor& self, Stub
}
template <typename Stub, typename ...Args>
static inline Tensor& unary_op_impl_float_out(Tensor& result, const Tensor& self, Stub& stub, Args&&... args) {
static inline Tensor& unary_op_impl_float_out(Tensor& result, const Tensor& self, Stub& stub, Args... args) {
auto iter = TensorIterator::unary_float_op(result, self);
stub(iter.device_type(), iter, std::forward<Args>(args)...);
stub(iter.device_type(), iter, args...);
return result;
}
template <typename Stub, typename ...Args>
static inline Tensor unary_op_impl_float(const Tensor& self, Stub& stub, Args&&... args) {
static inline Tensor unary_op_impl_float(const Tensor& self, Stub& stub, Args... args) {
Tensor result;
auto iter = TensorIterator::unary_float_op(result, self);
stub(iter.device_type(), iter, std::forward<Args>(args)...);
stub(iter.device_type(), iter, args...);
return iter.output();
}

View File

@ -323,7 +323,7 @@ std::tuple<Tensor, Tensor, Tensor> unique_consecutive_cpu_template(
template<class ForwardIt>
ForwardIt _unique_dim_cpu_impl(ForwardIt first, ForwardIt last,
std::vector<int64_t>& indices, const Tensor& inverse_indices_vec, const Tensor& counts) {
std::vector<int64_t>& indices, Tensor inverse_indices_vec, Tensor counts) {
if (first == last) {
return last;
}

View File

@ -24,7 +24,7 @@ constexpr int64_t num_output_channels_index [[maybe_unused]] = 10;
constexpr int64_t num_input_channels_index [[maybe_unused]] = 11;
template <typename TENSOR_DTYPE, typename VEC_DTYPE>
std::vector<VEC_DTYPE> unwrap_vector(const at::Tensor& tensor) {
std::vector<VEC_DTYPE> unwrap_vector(at::Tensor tensor) {
std::vector<VEC_DTYPE> vec(tensor.numel());
TENSOR_DTYPE* tensor_data_ptr = tensor.data_ptr<TENSOR_DTYPE>();
std::copy(tensor_data_ptr, tensor_data_ptr + tensor.numel(), vec.data());
@ -39,7 +39,7 @@ std::vector<VEC_DTYPE> unwrap_vector(const at::Tensor& tensor) {
*/
void unpack_bcsr(
int8_t* dst,
const ao::sparse::BCSR& bcsr,
ao::sparse::BCSR bcsr,
const int64_t R,
const int64_t C,
const int64_t RB,

View File

@ -999,12 +999,41 @@ void gpu_kernel_impl(TensorIteratorBase& iter, const func_t& f) {
dtypes[i] = iter.dtype(i);
}
auto offset_calc = ::make_offset_calculator<traits::arity + 1>(iter);
#ifdef USE_ROCM
constexpr int grp_sz = 128;
launch_legacy_kernel_manual_unroll<grp_sz, 4>(numel, [=] GPU_LAMBDA(int idx, bool unrl) {
if (unrl) {
auto offsets0 = offset_calc.get(idx);
auto offsets1 = offset_calc.get(idx + grp_sz);
auto offsets2 = offset_calc.get(idx + grp_sz * 2);
auto offsets3 = offset_calc.get(idx + grp_sz * 3);
void* out0 = data[0] + offsets0[0];
void* out1 = data[0] + offsets1[0];
void* out2 = data[0] + offsets2[0];
void* out3 = data[0] + offsets3[0];
arg0_t result0 = invoke(f, &data[1], &offsets0[1], &dtypes[1], 1);
arg0_t result1 = invoke(f, &data[1], &offsets1[1], &dtypes[1], 1);
arg0_t result2 = invoke(f, &data[1], &offsets2[1], &dtypes[1], 1);
arg0_t result3 = invoke(f, &data[1], &offsets3[1], &dtypes[1], 1);
c10::cast_and_store<arg0_t>(dtypes[0], out0, result0);
c10::cast_and_store<arg0_t>(dtypes[0], out1, result1);
c10::cast_and_store<arg0_t>(dtypes[0], out2, result2);
c10::cast_and_store<arg0_t>(dtypes[0], out3, result3);
} else {
auto offsets = offset_calc.get(idx);
void* out = data[0] + offsets[0];
arg0_t result = invoke(f, &data[1], &offsets[1], &dtypes[1], 1);
c10::cast_and_store<arg0_t>(dtypes[0], out, result);
}
});
#else
launch_legacy_kernel<128, 4>(numel, [=] GPU_LAMBDA(int idx) {
auto offsets = offset_calc.get(idx);
void* out = data[0] + offsets[0];
arg0_t result = invoke(f, &data[1], &offsets[1], &dtypes[1], 1);
c10::cast_and_store<arg0_t>(dtypes[0], out, result);
});
#endif
}
}

View File

@ -51,7 +51,7 @@ std::vector<Tensor> foreach_tensor_list_op(
Op<opmath_t>(),
alpha.to<opmath_t>());
return tensor_lists[2];
return std::move(tensor_lists[2]);
}
template <typename T, template <class> class Op>

View File

@ -45,7 +45,7 @@ std::vector<Tensor> foreach_binary_op(
/* res_arg_index */ 1>(),
Op<opmath_t>(),
scalar.to<opmath_t>());
return tensor_lists[1];
return std::move(tensor_lists[1]);
}
template <typename T, template <class> class Op>

View File

@ -33,7 +33,7 @@ std::vector<Tensor> foreach_binary_op(
}
tensor_lists.emplace_back(tensors.vec());
tensor_lists.emplace_back(vec_res);
tensor_lists.emplace_back(std::move(vec_res));
using opmath_t = at::opmath_type<T>;
multi_tensor_apply<2, opmath_t>(
@ -46,7 +46,7 @@ std::vector<Tensor> foreach_binary_op(
/* res_arg_index */ 1>(),
Op<opmath_t>());
return tensor_lists[1];
return std::move(tensor_lists[1]);
}
template <typename T, template <class> class Op>

View File

@ -56,7 +56,7 @@ std::vector<Tensor> foreach_binary_op(
Op<opmath_t>(),
scalar.data_ptr<T>(),
alpha.to<opmath_t>());
return tensor_lists[1];
return std::move(tensor_lists[1]);
}
template <typename T, template <class> class Op>

View File

@ -57,7 +57,7 @@ std::vector<Tensor> foreach_pointwise_op(
scalar.to<opmath_t>());
});
return tensor_lists[3];
return std::move(tensor_lists[3]);
}
template <template <class> class Op>
@ -160,7 +160,7 @@ std::vector<Tensor> foreach_pointwise_op(
Op<opmath_t>());
});
return tensor_lists[3];
return std::move(tensor_lists[3]);
}
#define FOREACH_POINTWISE_OP_SCALAR(NAME, OP) \

View File

@ -37,7 +37,7 @@ std::vector<at::Tensor> foreach_tensor_lerp_ternary_cuda(
vec_res.emplace_back(at::native::empty_like(t));
}
std::vector<std::vector<at::Tensor>> tensor_lists{
tensors1.vec(), tensors2.vec(), tensors3.vec(), vec_res};
tensors1.vec(), tensors2.vec(), tensors3.vec(), std::move(vec_res)};
AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND2(
at::ScalarType::Half,
@ -56,7 +56,7 @@ std::vector<at::Tensor> foreach_tensor_lerp_ternary_cuda(
LerpFunctor<opmath_t>());
});
return tensor_lists[3];
return std::move(tensor_lists[3]);
}
void foreach_tensor_lerp_ternary_cuda_(
@ -104,7 +104,7 @@ std::vector<at::Tensor> foreach_tensor_lerp_list_cuda(
vec_res.emplace_back(at::native::empty_like(t));
}
std::vector<std::vector<at::Tensor>> tensor_lists{
tensors1.vec(), tensors2.vec(), vec_res};
tensors1.vec(), tensors2.vec(), std::move(vec_res)};
AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND2(
at::ScalarType::Half,
@ -124,7 +124,7 @@ std::vector<at::Tensor> foreach_tensor_lerp_list_cuda(
weight.to<opmath_t>());
});
return tensor_lists[2];
return std::move(tensor_lists[2]);
}
void foreach_tensor_lerp_list_cuda_(
@ -173,7 +173,7 @@ std::vector<at::Tensor> foreach_tensor_lerp_scalarlist_cuda(
vec_res.emplace_back(at::native::empty_like(t));
}
std::vector<std::vector<at::Tensor>> tensor_lists{
tensors1.vec(), tensors2.vec(), vec_res};
tensors1.vec(), tensors2.vec(), std::move(vec_res)};
AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND2(
at::ScalarType::Half,
@ -193,7 +193,7 @@ std::vector<at::Tensor> foreach_tensor_lerp_scalarlist_cuda(
LerpFunctor<opmath_t>());
});
return tensor_lists[2];
return std::move(tensor_lists[2]);
}
void foreach_tensor_lerp_scalarlist_cuda_(

View File

@ -67,7 +67,7 @@ std::vector<Tensor> foreach_unary_op(TensorList tensors) {
/* res_arg_index */ 1>(),
Op<opmath_t>());
return tensor_lists[1];
return std::move(tensor_lists[1]);
}
template <typename scalar_t, template <class> class Op>

View File

@ -125,8 +125,6 @@ Tensor& max_unpooling2d_forward_out_cuda(const Tensor& self_,
TORCH_CHECK(
indices_.scalar_type() == at::ScalarType::Long,
"elements in indices should be type int64 but got: ", indices_.scalar_type());
auto oheight = output_size[0];
auto owidth = output_size[1];
TensorArg output_arg{output, "output", 1}, self_arg{self_, "self_", 2},
indices_arg{indices_, "indices_", 3};
@ -149,6 +147,9 @@ Tensor& max_unpooling2d_forward_out_cuda(const Tensor& self_,
output_size.size() == 2,
"There should be exactly two elements (height, width) in output_size, but got ", output_size.size(), " elements.");
auto oheight = output_size[0];
auto owidth = output_size[1];
int64_t dimw = 2;
int64_t dimh = 1;
int64_t numBatch = 1;
@ -217,9 +218,6 @@ static void max_unpooling3d_shape_check(
IntArrayRef stride,
IntArrayRef padding,
const char *fn_name) {
int64_t oT = output_size[0];
int64_t oH = output_size[1];
int64_t oW = output_size[2];
TORCH_CHECK(
indices.scalar_type() == at::ScalarType::Long,
"elements in indices should be type int64 but got: ", indices.scalar_type());
@ -250,6 +248,10 @@ static void max_unpooling3d_shape_check(
"strides should be greater than zero, but got stride: ",
stride);
int64_t oT = output_size[0];
int64_t oH = output_size[1];
int64_t oW = output_size[2];
int dimw = 3;
int dimh = 2;
int dimt = 1;
@ -402,8 +404,6 @@ at::Tensor& max_unpooling2d_backward_out_cuda(const Tensor& grad_output_,
const Tensor& indices_,
IntArrayRef output_size,
Tensor& grad_input) {
int64_t oheight = output_size[0];
int64_t owidth = output_size[1];
TORCH_CHECK(grad_input.is_contiguous(), "grad_input must be contiguous");
TORCH_CHECK(
indices_.scalar_type() == at::ScalarType::Long,
@ -426,6 +426,9 @@ at::Tensor& max_unpooling2d_backward_out_cuda(const Tensor& grad_output_,
TORCH_CHECK(output_size.size() == 2, "output_size must have two elements, got size: ", output_size.size());
int64_t oheight = output_size[0];
int64_t owidth = output_size[1];
int64_t nInputCols, nInputRows, nInputPlane;
int dimw = 2;
@ -505,13 +508,14 @@ at::Tensor& max_unpooling3d_backward_out_cuda(const Tensor& grad_output_,
IntArrayRef padding,
Tensor& grad_input) {
TORCH_CHECK(grad_input.is_contiguous(), "grad_input must be contiguous");
int64_t oT = output_size[0];
int64_t oH = output_size[1];
int64_t oW = output_size[2];
max_unpooling3d_shape_check(
self_, grad_output_, indices_, output_size, stride, padding, "max_unpooling3d_backward_out_cuda()");
int64_t oT = output_size[0];
int64_t oH = output_size[1];
int64_t oW = output_size[2];
int batchSize = 0;
int inputSlices = 0;
int inputTime = 0;

View File

@ -300,8 +300,6 @@ void nonzero_static_cuda_out_impl(
int64_t size,
int64_t fill_value,
Tensor& out) {
#if defined(CUDA_VERSION) || defined(USE_ROCM)
Tensor self_contiguous_ = self.contiguous();
// see comment in nonzero_cuda_out_impl on reqs for out
bool out_correct_size =
@ -377,9 +375,6 @@ void nonzero_static_cuda_out_impl(
if (need_to_copy) {
out.copy_(out_temp);
}
#else
TORCH_CHECK(false, "Nonzero_static is not supported for cuda <= 11.4");
#endif
}
Tensor& nonzero_out_cuda(const Tensor& self, Tensor& out) {

View File

@ -221,22 +221,9 @@ static const Tensor& _exec_fft(Tensor& out, const Tensor& self, IntArrayRef out_
std::optional<CuFFTConfig> uncached_plan;
const CuFFTConfig * config = nullptr;
// Workaround for gh-63152, gh-58724
// Bluestein plans in CUDA 11.1 (cufft 10.3) cannot be re-used
// Bluestein's algorithm is only used when a size has large prime factors,
// sizes with only small prime factors can still be cached
bool use_caching = true;
#ifdef CUFFT_VERSION
if constexpr (10300 <= CUFFT_VERSION && CUFFT_VERSION < 10400) {
// Only cache plans for transforms with small prime factors
use_caching = std::none_of(
signal_size.begin() + 1, signal_size.end(), [](int64_t dim_size) {
return has_large_prime_factor(dim_size);
});
}
#endif
if (use_caching && plan_cache.max_size() > 0) {
if (plan_cache.max_size() > 0) {
guard.lock();
if (plan_cache.max_size() > 0) { // check again after acquiring the lock
config = &plan_cache.lookup(Params);

View File

@ -35,7 +35,7 @@ C10_ALWAYS_INLINE void _check_rms_norm_inputs_symint(
std::stringstream ss;
ss << "Given normalized_shape=" << normalized_shape
<< ", expected input with shape [*";
for (const auto& size : normalized_shape) {
for (auto size : normalized_shape) {
ss << ", " << size;
}
ss << "], but got input of size" << input_shape;

View File

@ -198,7 +198,7 @@ static Tensor _mps_convolution_impl(const Tensor& input_t_,
if (input_t.is_contiguous(memory_format) && output_t.is_contiguous(memory_format) && is_macOS_15_0_or_newer) {
inputNDArray = getMPSNDArray(input_t, inputShape);
outputNDArray = getMPSNDArray(*output, outputShape);
outputNDArray = getMPSNDArray(output_t, outputShape);
}
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
@ -302,7 +302,7 @@ static Tensor _mps_convolution_impl(const Tensor& input_t_,
}
}
auto outputPlaceholder = outputNDArray ? Placeholder(cachedGraph->outputTensor_, outputNDArray)
: Placeholder(cachedGraph->outputTensor_, *output);
: Placeholder(cachedGraph->outputTensor_, output_t);
NSMutableDictionary<MPSGraphTensor*, MPSGraphTensorData*>* feeds =
[[[NSMutableDictionary alloc] initWithCapacity:3] autorelease];
@ -315,7 +315,7 @@ static Tensor _mps_convolution_impl(const Tensor& input_t_,
runMPSGraph(stream, cachedGraph->graph(), feeds, outputPlaceholder);
}
return *output;
return output_t;
}
Tensor _mps_convolution(const Tensor& input_t,

View File

@ -20,6 +20,7 @@
#include <ATen/ops/baddbmm_native.h>
#include <ATen/ops/bmm_native.h>
#include <ATen/ops/cholesky_native.h>
#include <ATen/ops/eye_native.h>
#include <ATen/ops/linalg_cholesky_ex_native.h>
#include <ATen/ops/linalg_inv_ex_native.h>
#include <ATen/ops/linalg_lu_factor_ex_native.h>
@ -496,26 +497,24 @@ static void linalg_inv_ex_out_mps_impl(const Tensor& A, bool check_errors, const
using namespace mps;
TORCH_CHECK(result.is_mps(), "Output tensor is not MPS");
TORCH_CHECK(!A.is_complex(), "linalg_inv: not supported for complex types yet!");
using CachedGraph = MPSUnaryCachedGraph;
MPSStream* stream = getCurrentMPSStream();
info.zero_();
if (A.numel() == 0) {
return;
}
if (!result.is_contiguous()) {
result.unsafeGetTensorImpl()->empty_tensor_restride(MemoryFormat::Contiguous);
}
auto A_sizes = A.sizes();
int ndim = A.dim();
Tensor LU = empty_like(A);
Tensor identity = zeros_like(A);
Tensor LU = empty_like(A, MemoryFormat::Contiguous);
Tensor identity = eye(A.size(-2), A.size(-1), A.scalar_type(), A.options().layout(), A.device()).expand_as(A);
Tensor pivots = empty({A_sizes.begin(), A_sizes.end() - 1}, A.options().dtype(kInt));
(ndim == 2 ? identity.diagonal() : identity.diagonal(0, -2, -1)).fill_(1);
linalg_solve_out_mps_impl(A, identity, true, check_errors, result, LU, pivots, info);
// need to do this to keep the strides of the result tensor
// mps's solve expects row major layout, while inductor
// expects result to be column major
Tensor tmp = empty_like(A, MemoryFormat::Contiguous);
linalg_solve_out_mps_impl(A, identity, true, check_errors, tmp, LU, pivots, info);
result.copy_(tmp);
}
static Tensor& mm_out_mps_impl(const Tensor& self, const Tensor& other, Tensor& output) {

View File

@ -519,6 +519,13 @@ static void max_unpool_out_mps_template(const Tensor& input,
Tensor& output,
const int32_t pooling_dims,
const std::string& op_name) {
TORCH_CHECK(output_size_.size() == static_cast<size_t>(pooling_dims),
op_name,
"There should be exactly ",
pooling_dims,
" elements but got ",
output_size_.size());
auto dims = input.dim();
auto leading_dims = input.dim() - pooling_dims;

View File

@ -77,7 +77,7 @@ static Tensor NestedTensor_elementwise_Tensor(
const Tensor& other,
const std::string& op_name,
bool supports_striding,
const Func& f) {
Func f) {
Tensor self_contiguous = self;
Tensor other_contiguous = other;
// self is a scalar
@ -238,7 +238,7 @@ static Tensor& NestedTensor_elementwise__Tensor(
Tensor& self,
const Tensor& other,
const std::string& op_name,
const Func& f) {
Func f) {
// self is a scalar
if (!self.is_nested() && self.dim() == 0 && self.numel() == 1) {
auto other_impl = get_nested_tensor_impl(other);

View File

@ -149,7 +149,7 @@ Tensor MakeStridedQTensorCPU(
const IntArrayRef& sizes,
const IntArrayRef& strides,
const TensorOptions& options,
const QuantizerPtr& quantizer) {
QuantizerPtr quantizer) {
AT_ASSERT(options.device().is_cpu());
at::native::check_size_nonnegative(sizes);
auto* allocator = at::getCPUAllocator();

View File

@ -37,7 +37,7 @@ struct TORCH_API PackedLinearWeight : public LinearPackedParamsBase {
col_offsets(std::move(col_offsets)),
w_scale(std::move(w_scale)),
w_zp(std::move(w_zp)),
q_scheme(q_scheme) {}
q_scheme(std::move(q_scheme)) {}
std::unique_ptr<fbgemm::PackBMatrix<int8_t>> w;
std::optional<at::Tensor> bias_;
std::vector<int32_t> col_offsets;
@ -316,7 +316,7 @@ Tensor MakeStridedQTensorCPU(
const IntArrayRef& sizes,
const IntArrayRef& strides,
const TensorOptions& options,
const QuantizerPtr& quantizer);
QuantizerPtr quantizer);
Tensor MakeEmptyAffineQuantizedChannelsLast3dTensor(
int64_t N,

View File

@ -7,7 +7,7 @@ QTensorImpl::QTensorImpl(
DispatchKeySet key_set,
const caffe2::TypeMeta data_type,
QuantizerPtr quantizer)
: TensorImpl(std::move(storage), key_set, data_type),
: TensorImpl(std::move(storage), std::move(key_set), data_type),
quantizer_(std::move(quantizer)) {}
QTensorImpl::QTensorImpl(
@ -16,7 +16,7 @@ QTensorImpl::QTensorImpl(
DispatchKeySet key_set,
const caffe2::TypeMeta data_type,
QuantizerPtr quantizer)
: TensorImpl(type, std::move(storage), key_set, data_type),
: TensorImpl(type, std::move(storage), std::move(key_set), data_type),
quantizer_(std::move(quantizer)) {}
const char* QTensorImpl::tensorimpl_type_name() const {

View File

@ -4,8 +4,6 @@
#include <c10/core/TensorImpl.h>
#include <c10/util/Exception.h>
#include <utility>
namespace at {
/**
@ -38,7 +36,7 @@ struct TORCH_API QTensorImpl : public c10::TensorImpl {
}
void set_quantizer_(QuantizerPtr quantizer) {
quantizer_ = std::move(quantizer);
quantizer_ = quantizer;
}
/**

View File

@ -107,7 +107,7 @@ static int64_t get_sub_byte_tensor_size(IntArrayRef sizes, size_t dtype_itemsize
inline Tensor new_qtensor(
IntArrayRef sizes,
const TensorOptions& options,
const QuantizerPtr& quantizer) {
QuantizerPtr quantizer) {
auto memory_format = options.memory_format_opt().value_or(MemoryFormat::Contiguous);
auto device = options.device();
at::Allocator* allocator = nullptr;
@ -338,7 +338,7 @@ Tensor from_blob_quantized_per_tensor_affine(
const std::size_t datasize = size * itemsize;
DataPtr data_ptr = InefficientStdFunctionContext::makeDataPtr(
data, std::move(deleter), options.device());
data, deleter, options.device());
Storage storage{Storage::use_byte_size_t{}, datasize, std::move(data_ptr)};
@ -411,7 +411,7 @@ Tensor from_blob_quantized_per_channel_affine(
const std::size_t datasize = size * itemsize;
DataPtr data_ptr = InefficientStdFunctionContext::makeDataPtr(
data, std::move(deleter), options.device());
data, deleter, options.device());
Storage storage{Storage::use_byte_size_t{}, datasize, std::move(data_ptr)};

View File

@ -196,8 +196,8 @@ struct TORCH_API PerChannelAffineFloatQParamsQuantizer : public PerChannelAffine
Tensor zero_points,
int64_t axis)
: PerChannelAffineQuantizer(scalar_type,
std::move(scales),
std::move(zero_points),
scales,
zero_points,
axis) {}
QScheme qscheme() const override {
@ -246,7 +246,7 @@ TORCH_API QuantizerPtr make_unknown_quantizer(ScalarType scalar_type);
TORCH_API Tensor new_qtensor(
IntArrayRef sizes,
const TensorOptions& options,
const QuantizerPtr& quantizer);
QuantizerPtr quantizer);
TORCH_API void set_quantizer_(const Tensor& self, ConstQuantizerPtr quantizer);

View File

@ -3269,7 +3269,7 @@ class C10_TensorImpl_Size_Check_Dummy_Class : private TensorImpl {
is_le<sizeof(autograd_meta_), 16, FieldNameEnum::autograd_meta_>();
is_le<sizeof(extra_meta_), 16, FieldNameEnum::extra_meta_>();
are_equal<sizeof(version_counter_), 8, FieldNameEnum::version_counter_>();
are_equal<sizeof(pyobj_slot_), 8, FieldNameEnum::pyobj_slot_>();
are_equal<sizeof(pyobj_slot_), 16, FieldNameEnum::pyobj_slot_>();
are_equal<sizeof(sizes_and_strides_), 88, FieldNameEnum::sizes_and_strides_>();
are_equal<sizeof(storage_offset_), 8, FieldNameEnum::storage_offset_>();
are_equal<sizeof(numel_), 8, FieldNameEnum::numel_>();

View File

@ -13,10 +13,11 @@ struct C10_API PyInterpreterHooksInterface {
// Get the PyInterpreter instance
// Stub implementation throws error when Python is not available
// We return nullptr rather than throwing an error since there are bits of c10
// that expect an empty PyObjectSlot when python is not available.
virtual PyInterpreter* getPyInterpreter() const {
return nullptr;
TORCH_CHECK(
false,
"PyTorch was compiled without Python support. "
"Cannot access Python interpreter from C++.");
}
};

View File

@ -2,7 +2,7 @@
namespace c10::impl {
PyObjectSlot::PyObjectSlot() : pyobj_(nullptr) {}
PyObjectSlot::PyObjectSlot() : pyobj_interpreter_(nullptr), pyobj_(nullptr) {}
PyObjectSlot::~PyObjectSlot() {
maybe_destroy_pyobj();
@ -10,9 +10,9 @@ PyObjectSlot::~PyObjectSlot() {
void PyObjectSlot::maybe_destroy_pyobj() {
if (owns_pyobj()) {
TORCH_INTERNAL_ASSERT(getGlobalPyInterpreter() != nullptr);
TORCH_INTERNAL_ASSERT(pyobj_interpreter_ != nullptr);
TORCH_INTERNAL_ASSERT(pyobj_ != nullptr);
(*getGlobalPyInterpreter())
(*pyobj_interpreter_.load(std::memory_order_acquire))
->decref(_unchecked_untagged_pyobj(), /*has_pyobj_slot*/ true);
// NB: this destructor can only be entered when there are no
// references to this C++ object (obviously), NOR any references
@ -25,7 +25,7 @@ void PyObjectSlot::maybe_destroy_pyobj() {
}
PyInterpreter* PyObjectSlot::pyobj_interpreter() {
return getGlobalPyInterpreter();
return pyobj_interpreter_.load(std::memory_order_acquire);
}
PyObject* PyObjectSlot::_unchecked_untagged_pyobj() const {
@ -35,7 +35,7 @@ PyObject* PyObjectSlot::_unchecked_untagged_pyobj() const {
}
PyInterpreter& PyObjectSlot::load_pyobj_interpreter() const {
auto interpreter = getGlobalPyInterpreter();
auto interpreter = pyobj_interpreter_.load(std::memory_order_acquire);
if (interpreter) {
return *interpreter;
}

View File

@ -6,17 +6,10 @@
#include <c10/util/python_stub.h>
#include <optional>
#include <atomic>
namespace c10::impl {
// Function pointer type for getting the global interpreter
using GetPyInterpreterFn = PyInterpreter* (*)();
// Global function pointer (set by csrc initialization)
C10_API extern GetPyInterpreterFn g_get_pyinterpreter_fn;
// Helper function to get the global interpreter
C10_API PyInterpreter* getGlobalPyInterpreter();
struct C10_API PyObjectSlot {
public:
PyObjectSlot();
@ -33,6 +26,8 @@ struct C10_API PyObjectSlot {
// NB: THIS FUNCTION CAN RAISE AN EXCEPTION. Make sure to clean up after
// PyObject if necessary!
void init_pyobj(PyObject* pyobj) {
pyobj_interpreter_.store(
getGlobalPyInterpreter(), std::memory_order_relaxed);
pyobj_ = pyobj;
}
@ -60,15 +55,18 @@ struct C10_API PyObjectSlot {
// @todo alban: I'm not too sure what's going on here, we can probably delete
// it but it's worthwhile making sure
std::optional<PyObject*> check_pyobj() const {
impl::PyInterpreter* interpreter = getGlobalPyInterpreter();
if (interpreter == nullptr || pyobj_ == nullptr) {
std::optional<PyObject*> check_pyobj(bool ignore_hermetic_tls = false) const {
impl::PyInterpreter* interpreter =
pyobj_interpreter_.load(std::memory_order_acquire);
if (interpreter == nullptr) {
return std::nullopt;
}
if (c10::impl::HermeticPyObjectTLS::get_state()) {
if (!ignore_hermetic_tls && c10::impl::HermeticPyObjectTLS::get_state()) {
return std::nullopt;
} else {
return _unchecked_untagged_pyobj();
}
return _unchecked_untagged_pyobj();
}
PyInterpreter& load_pyobj_interpreter() const;
@ -78,6 +76,30 @@ struct C10_API PyObjectSlot {
void set_owns_pyobj(bool b);
private:
// This field contains the interpreter tag for this object. See
// Note [Python interpreter tag] for general context
//
// Note [Memory ordering on Python interpreter tag]
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
// What memory_order do we need when accessing this atomic? We don't
// need a single total modification order (as provided by
// memory_order_seq_cst) as pyobj_interpreter_ is monotonic: it can only
// transition from -1 to some positive integer and never changes afterwards.
// Because there is only one modification, it trivially already has a total
// modification order (e.g., we don't need fences or locked instructions on
// x86)
//
// In fact, one could make a reasonable argument that relaxed reads are OK,
// due to the presence of external locking (GIL) to ensure that interactions
// with other data structures are still correctly synchronized, so that
// we fall in the "Single-Location Data Structures" case as described in
// http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p2055r0.pdf
// However, on x86, it doesn't matter if I use acquire or relaxed on the load
// as I get the same assembly in both cases. So I just use the more
// conservative acquire (which will impede compiler optimizations but I don't
// care)
std::atomic<PyInterpreter*> pyobj_interpreter_;
// This field contains a reference to a PyObject representing this Tensor.
// If pyobj is nullptr, when we transfer Tensor to Python, we allocate a new
// PyObject for it and set this field. This field does not have to be

View File

@ -14,7 +14,6 @@ namespace c10::cuda::CUDACachingAllocator::CudaMallocAsync {
using namespace c10::CachingAllocator;
using namespace c10::CachingDeviceAllocator;
#if CUDA_VERSION >= 11040 || defined(USE_ROCM)
// CUDA device allocator that uses cudaMallocAsync to implement
// the same interface as CUDACachingAllocator.cpp.
@ -926,13 +925,4 @@ CUDAAllocator* allocator() {
return &device_allocator;
}
#else
// NOLINTNEXTLINE(misc-use-internal-linkage)
CUDAAllocator* allocator() {
TORCH_CHECK(false, "Cannot use CudaMallocAsyncAllocator with cuda < 11.4.");
return nullptr;
}
#endif
} // namespace c10::cuda::CUDACachingAllocator::CudaMallocAsync

View File

@ -35,26 +35,26 @@ struct ExclusivelyOwnedTensorTraits {
// incremented.
const bool isUndefined = toDestroy == UndefinedTensorImpl::singleton();
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(
toDestroy->refcount_ == 1 || (toDestroy->refcount_ == 0 && isUndefined),
toDestroy->refcount() == 1 ||
(toDestroy->refcount() == 0 && isUndefined),
"ExclusivelyOwned<Tensor> destroyed with isUndefined ",
isUndefined,
" and refcount ",
toDestroy->refcount_,
toDestroy->refcount(),
", expected 1 or, if isUndefined, 0!");
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(
toDestroy->weakcount_ == 1 ||
(toDestroy->weakcount_ == 0 &&
toDestroy->weakcount() == 1 ||
(toDestroy->weakcount() == 0 &&
toDestroy == UndefinedTensorImpl::singleton()),
"ExclusivelyOwned<Tensor> destroyed with isUndefined ",
isUndefined,
" and weakcount ",
toDestroy->weakcount_,
toDestroy->weakcount(),
", expected 1 or, if isUndefined, 0!");
if (!isUndefined) {
#ifndef NDEBUG
// Needed to pass the debug assertions in ~intrusive_ptr_target.
toDestroy->refcount_ = 0;
toDestroy->weakcount_ = 0;
toDestroy->combined_refcount_.store(0, std::memory_order_relaxed);
#endif
delete toDestroy;
}

View File

@ -27,7 +27,78 @@ struct DontIncreaseRefcount {};
} // namespace raw
namespace detail {
constexpr uint32_t kImpracticallyHugeReferenceCount = 0x0FFFFFFF;
constexpr uint64_t kImpracticallyHugeReferenceCount = 0x0FFFFFFF;
constexpr uint64_t kImpracticallyHugeWeakReferenceCount =
(kImpracticallyHugeReferenceCount << 32);
constexpr uint64_t kReferenceCountOne = 1;
constexpr uint64_t kWeakReferenceCountOne = (kReferenceCountOne << 32);
constexpr uint64_t kUniqueRef = (kReferenceCountOne | kWeakReferenceCountOne);
template <class TTarget>
struct intrusive_target_default_null_type final {
static constexpr TTarget* singleton() noexcept {
return nullptr;
}
};
template <class TTarget, class ToNullType, class FromNullType>
TTarget* assign_ptr_(TTarget* rhs) {
if (FromNullType::singleton() == rhs) {
return ToNullType::singleton();
} else {
return rhs;
}
}
inline uint32_t refcount(uint64_t combined_refcount) {
return static_cast<uint32_t>(combined_refcount);
}
inline uint32_t weakcount(uint64_t combined_refcount) {
return static_cast<uint32_t>(combined_refcount >> 32);
}
// The only requirement for refcount increment is that it happens-before
// decrement, so no additional memory ordering is needed.
inline uint64_t atomic_combined_refcount_increment(
std::atomic<uint64_t>& combined_refcount,
uint64_t inc) {
return combined_refcount.fetch_add(inc, std::memory_order_relaxed) + inc;
}
inline uint32_t atomic_refcount_increment(
std::atomic<uint64_t>& combined_refcount) {
return detail::refcount(atomic_combined_refcount_increment(
combined_refcount, kReferenceCountOne));
}
inline uint32_t atomic_weakcount_increment(
std::atomic<uint64_t>& combined_refcount) {
return detail::weakcount(atomic_combined_refcount_increment(
combined_refcount, kWeakReferenceCountOne));
}
// The requirement is that all modifications to the managed object happen-before
// invocation of the managed object destructor, and that allocation of the
// managed object storage happens-before deallocation of the storage.
//
// To get this ordering, all non-final decrements must synchronize-with the
// final decrement. So all non-final decrements have to store-release while the
// final decrement has to load-acquire, either directly or with the help of
// fences. But it's easiest just to have all decrements be acq-rel. And it turns
// out, on modern architectures and chips, it's also fastest.
inline uint64_t atomic_combined_refcount_decrement(
std::atomic<uint64_t>& combined_refcount,
uint64_t dec) {
return combined_refcount.fetch_sub(dec, std::memory_order_acq_rel) - dec;
}
inline uint32_t atomic_weakcount_decrement(
std::atomic<uint64_t>& combined_refcount) {
return detail::weakcount(atomic_combined_refcount_decrement(
combined_refcount, kWeakReferenceCountOne));
}
} // namespace detail
/**
@ -80,8 +151,14 @@ class C10_API intrusive_ptr_target {
// atomically increment the use count, if it is greater than 0.
// If it is not, you must report that the storage is dead.
//
mutable std::atomic<uint32_t> refcount_;
mutable std::atomic<uint32_t> weakcount_;
//.We use a single combined count for refcount and weakcount so that
// we can atomically operate on both at the same time for performance
// and defined behaviors.
//
mutable std::atomic<uint64_t> combined_refcount_;
static_assert(sizeof(std::atomic<uint64_t>) == 8);
static_assert(alignof(std::atomic<uint64_t>) == 8);
static_assert(std::atomic<uint64_t>::is_always_lock_free);
template <typename T, typename NullType>
friend class intrusive_ptr;
@ -126,16 +203,16 @@ class C10_API intrusive_ptr_target {
// caller of unsafe_adapt_non_heap_allocated wanted to
// use). We choose our reference count such that the count
// will not dip below kImpracticallyHugeReferenceCount regardless.
refcount_.load() == 0 ||
refcount_.load() >= detail::kImpracticallyHugeReferenceCount,
refcount() == 0 ||
refcount() >= detail::kImpracticallyHugeReferenceCount,
"Tried to destruct an intrusive_ptr_target that still has intrusive_ptr to it; refcount was ",
refcount_.load());
refcount());
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(
// See ~intrusive_ptr for optimization that will frequently result in 1
// at destruction time.
weakcount_.load() == 1 || weakcount_.load() == 0 ||
weakcount_.load() == detail::kImpracticallyHugeReferenceCount - 1 ||
weakcount_.load() == detail::kImpracticallyHugeReferenceCount,
weakcount() == 1 || weakcount() == 0 ||
weakcount() == detail::kImpracticallyHugeReferenceCount - 1 ||
weakcount() == detail::kImpracticallyHugeReferenceCount,
"Tried to destruct an intrusive_ptr_target that still has weak_intrusive_ptr to it");
#if defined(_MSC_VER) && !defined(__clang__)
#pragma warning(pop)
@ -144,7 +221,7 @@ class C10_API intrusive_ptr_target {
#endif
}
constexpr intrusive_ptr_target() noexcept : refcount_(0), weakcount_(0) {}
constexpr intrusive_ptr_target() noexcept : combined_refcount_(0) {}
// intrusive_ptr_target supports copy and move: but refcount and weakcount
// don't participate (since they are intrinsic properties of the memory
@ -177,54 +254,17 @@ class C10_API intrusive_ptr_target {
* destructed), this function WILL NOT be called.
*/
virtual void release_resources() {}
};
namespace detail {
template <class TTarget>
struct intrusive_target_default_null_type final {
static constexpr TTarget* singleton() noexcept {
return nullptr;
uint32_t refcount(std::memory_order order = std::memory_order_relaxed) const {
return detail::refcount(combined_refcount_.load(order));
}
uint32_t weakcount(
std::memory_order order = std::memory_order_relaxed) const {
return detail::weakcount(combined_refcount_.load(order));
}
};
template <class TTarget, class ToNullType, class FromNullType>
TTarget* assign_ptr_(TTarget* rhs) {
if (FromNullType::singleton() == rhs) {
return ToNullType::singleton();
} else {
return rhs;
}
}
// The only requirement for refcount increment is that it happens-before
// decrement, so no additional memory ordering is needed.
inline uint32_t atomic_refcount_increment(std::atomic<uint32_t>& refcount) {
return refcount.fetch_add(1, std::memory_order_relaxed) + 1;
}
inline uint32_t atomic_weakcount_increment(std::atomic<uint32_t>& weakcount) {
return weakcount.fetch_add(1, std::memory_order_relaxed) + 1;
}
// The requirement is that all modifications to the managed object happen-before
// invocation of the managed object destructor, and that allocation of the
// managed object storage happens-before deallocation of the storage.
//
// To get this ordering, all non-final decrements must synchronize-with the
// final decrement. So all non-final decrements have to store-release while the
// final decrement has to load-acquire, either directly or with the help of
// fences. But it's easiest just to have all decrements be acq-rel. And it turns
// out, on modern architectures and chips, it's also fastest.
inline uint32_t atomic_refcount_decrement(std::atomic<uint32_t>& refcount) {
return refcount.fetch_sub(1, std::memory_order_acq_rel) - 1;
}
inline uint32_t atomic_weakcount_decrement(std::atomic<uint32_t>& weakcount) {
return weakcount.fetch_sub(1, std::memory_order_acq_rel) - 1;
}
} // namespace detail
template <class TTarget, class NullType>
class weak_intrusive_ptr;
@ -275,7 +315,7 @@ class intrusive_ptr final {
void retain_() {
if (target_ != NullType::singleton()) {
uint32_t new_refcount =
detail::atomic_refcount_increment(target_->refcount_);
detail::atomic_refcount_increment(target_->combined_refcount_);
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(
new_refcount != 1,
"intrusive_ptr: Cannot increase refcount after it reached zero.");
@ -284,41 +324,25 @@ class intrusive_ptr final {
void reset_() noexcept {
if (target_ != NullType::singleton()) {
#if defined(__linux__) && (defined(__aarch64__) || defined(__x86_64__))
if constexpr (
std::atomic<uint64_t>::is_always_lock_free &&
std::atomic<uint32_t>::is_always_lock_free &&
sizeof(std::atomic<uint64_t>) == 8 &&
sizeof(std::atomic<uint32_t>) == 4) {
auto both_counts_ =
reinterpret_cast<std::atomic<uint64_t>*>(&target_->refcount_);
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(
(reinterpret_cast<std::uintptr_t>(both_counts_) %
sizeof(std::atomic<uint64_t>)) == 0 &&
(reinterpret_cast<std::uintptr_t>(&target_->weakcount_) -
reinterpret_cast<std::uintptr_t>(both_counts_)) ==
sizeof(std::atomic<uint32_t>));
// 0x100000001ULL is a 64-bit number combination of both the refcount_
// and weakcount_ being 1.
constexpr uint64_t unique_ref_ = 0x100000001ULL;
if (both_counts_->load(std::memory_order_acquire) == unique_ref_) {
// Both counts are 1, so there are no weak references and
// we are releasing the last strong reference. No other
// threads can observe the effects of this target_ deletion
// call (e.g. calling use_count()) without a data race.
target_->refcount_.store(0, std::memory_order_relaxed);
delete target_;
return;
}
if (target_->combined_refcount_.load(std::memory_order_acquire) ==
detail::kUniqueRef) {
// Both counts are 1, so there are no weak references and
// we are releasing the last strong reference. No other
// threads can observe the effects of this target_ deletion
// call (e.g. calling use_count()) without a data race.
target_->combined_refcount_.store(0, std::memory_order_relaxed);
delete target_;
return;
}
#endif
if (detail::atomic_refcount_decrement(target_->refcount_) == 0) {
auto combined_refcount = detail::atomic_combined_refcount_decrement(
target_->combined_refcount_, detail::kReferenceCountOne);
if (detail::refcount(combined_refcount) == 0) {
bool should_delete =
(combined_refcount == detail::kWeakReferenceCountOne);
// See comment above about weakcount. As long as refcount>0,
// weakcount is one larger than the actual number of weak references.
// So we need to decrement it here.
bool should_delete =
target_->weakcount_.load(std::memory_order_acquire) == 1;
if (!should_delete) {
// justification for const_cast: release_resources is basically a
// destructor and a destructor always mutates the object, even for
@ -326,8 +350,8 @@ class intrusive_ptr final {
// NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast)
const_cast<std::remove_const_t<TTarget>*>(target_)
->release_resources();
should_delete =
detail::atomic_weakcount_decrement(target_->weakcount_) == 0;
should_delete = detail::atomic_weakcount_decrement(
target_->combined_refcount_) == 0;
}
if (should_delete) {
delete target_;
@ -354,12 +378,12 @@ class intrusive_ptr final {
// `mov`, whereas an atomic increment does a lock-prefixed `add`, which is
// much more expensive: https://godbolt.org/z/eKPzj8.)
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(
target_->refcount_ == 0 && target_->weakcount_ == 0,
target_->combined_refcount_.load(std::memory_order_relaxed) == 0,
"intrusive_ptr: Newly-created target had non-zero refcounts. Does its "
"constructor do something strange like incref or create an "
"intrusive_ptr from `this`?");
target_->refcount_.store(1, std::memory_order_relaxed);
target_->weakcount_.store(1, std::memory_order_relaxed);
target_->combined_refcount_.store(
detail::kUniqueRef, std::memory_order_relaxed);
}
}
@ -482,14 +506,14 @@ class intrusive_ptr final {
if (target_ == NullType::singleton()) {
return 0;
}
return target_->refcount_.load(std::memory_order_relaxed);
return target_->refcount(std::memory_order_relaxed);
}
uint32_t weak_use_count() const noexcept {
if (target_ == NullType::singleton()) {
return 0;
}
return target_->weakcount_.load(std::memory_order_relaxed);
return target_->weakcount(std::memory_order_relaxed);
}
bool unique() const noexcept {
@ -518,8 +542,8 @@ class intrusive_ptr final {
*/
static intrusive_ptr reclaim(TTarget* owning_ptr) {
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(
owning_ptr == NullType::singleton() ||
owning_ptr->refcount_.load() == 0 || owning_ptr->weakcount_.load(),
owning_ptr == NullType::singleton() || owning_ptr->refcount() == 0 ||
owning_ptr->weakcount(),
"TTarget violates the invariant that refcount > 0 => weakcount > 0");
return intrusive_ptr(owning_ptr, raw::DontIncreaseRefcount{});
}
@ -590,11 +614,11 @@ class intrusive_ptr final {
#ifdef NDEBUG
expected_decrefs = 0;
#endif
result.target_->refcount_.store(
detail::kImpracticallyHugeReferenceCount + expected_decrefs,
result.target_->combined_refcount_.store(
detail::refcount(
detail::kImpracticallyHugeReferenceCount + expected_decrefs) |
detail::kImpracticallyHugeWeakReferenceCount,
std::memory_order_relaxed);
result.target_->weakcount_.store(
detail::kImpracticallyHugeReferenceCount, std::memory_order_relaxed);
return result;
}
@ -611,7 +635,7 @@ class intrusive_ptr final {
static intrusive_ptr unsafe_reclaim_from_nonowning(TTarget* raw_ptr) {
// See Note [Stack allocated intrusive_ptr_target safety]
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(
raw_ptr == NullType::singleton() || raw_ptr->refcount_.load() > 0,
raw_ptr == NullType::singleton() || raw_ptr->refcount() > 0,
"intrusive_ptr: Can only reclaim pointers that are owned by someone");
auto ptr = reclaim(raw_ptr); // doesn't increase refcount
ptr.retain_();
@ -745,7 +769,7 @@ class weak_intrusive_ptr final {
void retain_() {
if (target_ != NullType::singleton()) {
uint32_t new_weakcount =
detail::atomic_weakcount_increment(target_->weakcount_);
detail::atomic_weakcount_increment(target_->combined_refcount_);
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(
new_weakcount != 1,
"weak_intrusive_ptr: Cannot increase weakcount after it reached zero.");
@ -754,7 +778,7 @@ class weak_intrusive_ptr final {
void reset_() noexcept {
if (target_ != NullType::singleton() &&
detail::atomic_weakcount_decrement(target_->weakcount_) == 0) {
detail::atomic_weakcount_decrement(target_->combined_refcount_) == 0) {
// NOLINTNEXTLINE(clang-analyzer-cplusplus.NewDelete)
delete target_;
}
@ -887,7 +911,7 @@ class weak_intrusive_ptr final {
if (target_ == NullType::singleton()) {
return 0;
}
return target_->refcount_.load(
return target_->refcount(
std::memory_order_relaxed); // refcount, not weakcount!
}
@ -895,7 +919,7 @@ class weak_intrusive_ptr final {
if (target_ == NullType::singleton()) {
return 0;
}
return target_->weakcount_.load(std::memory_order_relaxed);
return target_->weakcount(std::memory_order_relaxed);
}
bool expired() const noexcept {
@ -906,16 +930,17 @@ class weak_intrusive_ptr final {
if (target_ == NullType::singleton()) {
return intrusive_ptr<TTarget, NullType>();
} else {
auto refcount = target_->refcount_.load(std::memory_order_relaxed);
auto combined_refcount =
target_->combined_refcount_.load(std::memory_order_relaxed);
do {
if (refcount == 0) {
if (detail::refcount(combined_refcount) == 0) {
// Object already destructed, no strong references left anymore.
// Return nullptr.
return intrusive_ptr<TTarget, NullType>();
}
} while (!target_->refcount_.compare_exchange_weak(
refcount,
refcount + 1,
} while (!target_->combined_refcount_.compare_exchange_weak(
combined_refcount,
combined_refcount + detail::kReferenceCountOne,
std::memory_order_acquire,
std::memory_order_relaxed));
@ -952,9 +977,9 @@ class weak_intrusive_ptr final {
// if refcount == 0, weakcount only must be >0.
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(
owning_weak_ptr == NullType::singleton() ||
owning_weak_ptr->weakcount_.load() > 1 ||
(owning_weak_ptr->refcount_.load() == 0 &&
owning_weak_ptr->weakcount_.load() > 0),
owning_weak_ptr->weakcount() > 1 ||
(owning_weak_ptr->refcount() == 0 &&
owning_weak_ptr->weakcount() > 0),
"weak_intrusive_ptr: Can only weak_intrusive_ptr::reclaim() owning pointers that were created using weak_intrusive_ptr::release().");
return weak_intrusive_ptr(owning_weak_ptr);
}
@ -1033,7 +1058,7 @@ namespace intrusive_ptr {
// NullType::singleton to this function
inline void incref(intrusive_ptr_target* self) {
if (self) {
detail::atomic_refcount_increment(self->refcount_);
detail::atomic_refcount_increment(self->combined_refcount_);
}
}
@ -1067,7 +1092,7 @@ inline uint32_t use_count(intrusive_ptr_target* self) {
namespace weak_intrusive_ptr {
inline void incref(weak_intrusive_ptr_target* self) {
detail::atomic_weakcount_increment(self->weakcount_);
detail::atomic_weakcount_increment(self->combined_refcount_);
}
inline void decref(weak_intrusive_ptr_target* self) {

View File

@ -396,8 +396,7 @@ size_t PyTorchStreamReader::getRecordMultiReaders(
size_t perThreadSize = (n + nthread - 1) / nthread;
std::vector<size_t> readSizes(nthread, 0);
std::lock_guard<std::mutex> guard(reader_lock_);
loaderThreads.reserve(nthread);
for (size_t i = 0; i < nthread; i++) {
for (size_t i = 0; i < nthread; i++) {
loaderThreads.emplace_back([this,
name,
i,
@ -416,7 +415,7 @@ for (size_t i = 0; i < nthread; i++) {
size =
read(recordOff + startPos, (char*)dst + startPos, threadReadSize);
} else {
const auto& reader = additionalReaders[i - 1];
auto reader = additionalReaders[i - 1];
size = reader->read(
recordOff + startPos, (char*)dst + startPos, threadReadSize);
}
@ -642,7 +641,7 @@ size_t PyTorchStreamReader::getRecordSize(const std::string& name) {
size_t PyTorchStreamReader::getRecordOffsetNoRead(
size_t cursor,
const std::string& filename,
std::string filename,
size_t size,
uint64_t alignment) {
std::string full_name = archive_name_plus_slash_ + filename;
@ -698,7 +697,7 @@ PyTorchStreamWriter::PyTorchStreamWriter(
}
PyTorchStreamWriter::PyTorchStreamWriter(
const std::function<size_t(const void*, size_t)>& writer_func,
const std::function<size_t(const void*, size_t)> writer_func,
bool compute_crc32,
uint64_t alignment)
: archive_name_("archive"),
@ -713,7 +712,7 @@ void PyTorchStreamWriter::setup(const string& file_name) {
memset(ar_.get(), 0, sizeof(mz_zip_archive));
archive_name_plus_slash_ = archive_name_ + "/"; // for writeRecord().
if (archive_name_.empty()) {
if (archive_name_.size() == 0) {
CAFFE_THROW("invalid file name: ", file_name);
}

View File

@ -180,7 +180,7 @@ class TORCH_API PyTorchStreamReader final {
size_t getRecordOffset(const std::string& name);
size_t getRecordOffsetNoRead(
size_t cursor,
const std::string& filename,
std::string filename,
size_t size,
uint64_t alignment);
bool hasRecord(const std::string& name);
@ -232,7 +232,7 @@ class TORCH_API PyTorchStreamWriter final {
bool compute_crc32 = true,
uint64_t alignment = 64);
explicit PyTorchStreamWriter(
const std::function<size_t(const void*, size_t)>& writer_func,
const std::function<size_t(const void*, size_t)> writer_func,
bool compute_crc32 = true,
uint64_t alignment = 64);

View File

@ -46,9 +46,10 @@ if(NOT __AOTRITON_INCLUDED)
set(__AOTRITON_BASE_URL "https://github.com/ROCm/aotriton/releases/download/") # @lint-ignore
set(__AOTRITON_Z "gz")
# Set the default __AOTRITON_LIB path
set(__AOTRITON_LIB "${__AOTRITON_INSTALL_DIR}/lib/libaotriton_v2.so")
if(WIN32)
set(__AOTRITON_LIB "${__AOTRITON_INSTALL_DIR}/lib/aotriton_v2.lib")
if(NOT WIN32)
set(__AOTRITON_LIB "lib/libaotriton_v2.so")
else()
set(__AOTRITON_LIB "lib/aotriton_v2.lib")
endif()
function(aotriton_build_windows_dependencies dlfcn-win32_external xz_external dlfcn-win32_DIR liblzma_DIR)
@ -143,8 +144,7 @@ if(NOT __AOTRITON_INCLUDED)
-DHIP_PLATFORM=amd
$<$<BOOL:${WIN32}>:-Ddlfcn-win32_DIR=${dlfcn-win32_DIR}>
$<$<BOOL:${WIN32}>:-Dliblzma_DIR=${liblzma_DIR}>
BUILD_BYPRODUCTS
"${__AOTRITON_LIB}"
BUILD_BYPRODUCTS "${__AOTRITON_INSTALL_DIR}/${__AOTRITON_LIB}"
USES_TERMINAL_DOWNLOAD TRUE
USES_TERMINAL_CONFIGURE TRUE
USES_TERMINAL_BUILD TRUE
@ -177,7 +177,7 @@ if(NOT __AOTRITON_INCLUDED)
INSTALL_COMMAND ${CMAKE_COMMAND} -E copy_directory
"${CMAKE_CURRENT_BINARY_DIR}/aotriton_runtime"
"${__AOTRITON_INSTALL_DIR}"
BUILD_BYPRODUCTS "${__AOTRITON_LIB}"
BUILD_BYPRODUCTS "${__AOTRITON_INSTALL_DIR}/${__AOTRITON_LIB}"
)
message(STATUS "Using AOTriton Runtime from pre-compiled binary ${__AOTRITON_URL}.\
Set env variables AOTRITON_INSTALL_FROM_SOURCE=1 to build from source.")
@ -267,7 +267,7 @@ if(NOT __AOTRITON_INCLUDED)
endforeach()
endforeach()
endif()
target_link_libraries(__caffe2_aotriton INTERFACE ${__AOTRITON_LIB})
target_link_libraries(__caffe2_aotriton INTERFACE "${__AOTRITON_INSTALL_DIR}/${__AOTRITON_LIB}")
target_include_directories(__caffe2_aotriton INTERFACE ${__AOTRITON_INSTALL_DIR}/include)
set(AOTRITON_FOUND TRUE)
endif() # __AOTRITON_INCLUDED

Binary file not shown.

After

Width:  |  Height:  |  Size: 168 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 89 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 418 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 256 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 530 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 187 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 359 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 107 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 189 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 566 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 251 KiB

View File

@ -0,0 +1,239 @@
(dynamic_shapes_advanced_control_options)=
# Advanced Options to Control Dynamic Behavior
PyTorch provides several advanced options to control dynamic behavior.
These options requires a deep understanding of the PyTorch internals and
may inlvolve setting additional tools. These options include:
* Profile-Guided Optimization (PGO) is a technique that allows the compiler
to save automatic dynamic decisions and reuse them across jobs.
* Compiler Collective is a feature that is used to modify automatic dynamic
shapes behavior by inferring if an input is dynamic based on whether
its size varies across ranks.
## Profile-Guided Optimization (PGO)
Profile-Guided Optimization (PGO) enhances automatic dynamic by sharing profiling decisions across runs of your model. Specifically, it serializes all the choices made by automatic dynamic into a file on disk. You can then copy this file—or store it in a centralized metadata service like S3—and reuse it on other machines to ensure consistent behavior across environments.
For the purposes of the rest of this tutorial, you can use the following environmental variables to turn on PGO locally `TORCH_COMPILE_JOB_ID=1 TORCH_DYNAMO_AUTOMATIC_DYNAMIC_LOCAL_PGO=1`
(identifying-dynamic-elements-marked-by-pgo)=
### Identifying Dynamic Elements Marked by PGO
Use `tlparse` to find line numbers of interest and check for multiple values
seen for inputs.
To determine which elements are marked as dynamic by Profile-Guided Optimization (PGO),
follow these steps using `tlparse`:
1. In the `tlparse` output, identify the line number of the frame of interest. Example:
```{image} ../_static/img/dynamic_shapes/tlparse4_pgo.png
```
2. Open `local_code` using `put_local_code_state_` or `put_remote_code_state_` for the
latest frame (for example, 6/1).
Each `?` indicates that multiple values have been observed for this input.
For instance, the following output shows that the input `L['m']` has been seen with
multiple sizes at `size[0]`, but the stride has consistently been 1:
```
/data/users/bobren/a/pytorch/r2.py:2:func:
L['m']: fully dynamic scalar or tensor
L['x']: tensor size=[?] stride=[1]
L['y']: tensor size=[?] stride=[1]
L['z']: tensor size=[?] stride=[1]
```
```{note}
If an element is marked as dynamic by PGO, it does not guarantee that it will remain dynamic in the graph. Specialization can revert it to a static state.
```
## Compiler Collective
Different ranks can communicate with each other to share observed sizes. In the second
iteration, automatic dynamic uses this information to determine which elements to mark
as dynamic based on inputs seen across all ranks. Check this [PR](https://github.com/pytorch/pytorch/pull/130935) for more details.
To enable this feature, use `enable_compiler_collectives=True` with the `@config.patch`
decorator.
```python
@config.patch(enable_compiler_collectives=True)
```
```{note}
This feature enables the use of collectives during compilation to
synchronize behavior across ranks. Currently, it is used to modify
automatic dynamic shapes behavior by inferring if an input is dynamic
based on whether its size varies across ranks. Since this synchronization
uses collectives, all ranks must run compilation simultaneously; ranks must
not diverge with graph breaks. This is most reliably achieved by ensuring
torch is only run on SPMD programs. Violating this invariant may result in
deadlocking NCCL and encountering a NCCL timeout.
```
## Reducing Compilations: Step by Step
If you have a model that you can run on your master job and have a `tlparse`,
here's whatyou should do next:
### Step 1: Mark Dynamic Elements
The first step is to reduce initial compilations that are eventually optimized away
by automatic dynamic or PGO. This is straightforward because we know it will work
upfront. If, in one run, a frame starts with static graphs and converges to
dynamic graphs, and if you notice a reduction in the number of compiled
frames in a second (warm) PGO-enabled run, it's likely due to this optimization.
This is a two-step process:
1. Find elements marked as dynamic by PGO or automatic dynamic.
2. Mark them as dynamic using one of the {ref}`user_annotations`.
#### How to Identify Elements to Mark as Dynamic
Follow these guidelines:
1. **PGO artifact:** Follow the steps in {ref}`identifying-dynamic-elements-marked-by-pgo`.
2. **Dynamic Logs:** If you have a run with `TORCH_LOGS="+dynamic"`, each
time a new dynamic dimension is allocated, a debug line will specify it
along with the input name.
3. **Compare Graphs:** For frames with reduced compilations across runs,
inspect the Dynamo graphs in the second run or the latest runs in the
cold run. Look for elements marked as dynamic in those graphs. Specifically,
find graphs that are similar (once specialized and once dynamic).
Even without a warm run, you can inspect all graphs for a specific frame
to see if some are similar and converge to a dynamic version.
For example, in the following `tlparse` snapshot, Dynamo graphs 20/0,
20/1, and 20/2 are similar except for different sizes (for example,
graph 20/0 vs. graph 20/2). In the Dynamo graph of 20/2, sizes `s0`,
`s1`, and `s5` are used for `rotary_pos_emb_` and `x`.
```{image} ../_static/img/dynamic_shapes/tlparse5_dynamic_shapes.png
```
```{tip}
Two graphs are considered similar if they have the same sequence of calls for
torch operations and the same tensor inputs. Variations may exist in integer
inputs that could be inlined in the specialized version or arithmetic
computations that only exist in the dynamic version due to inlining in the
static version.
```
### Step 2: Debugging: Identifying Missed Opportunities
The complexity of debugging can vary greatly depending on the issues you
encounter. The end result is often to find a bug, enable a flag, or modify
user/framework code.
#### Finding Similar Graphs
Start by identifying a group of similar graphs that you might want to combine
into one dynamic graph, as discussed in the previous section on comparing
graphs. If you can't find any similar graphs, there's nothing further to do
in this step.
#### Quick Checks: Fail Fast
After finding similar graphs, you want to understand why the have recompilations.
Check the following:
1. **Check Recompile Reasons:** For graphs you believe are similar, click on
`recompile_reason` in the `tlparse` output for the later graph. Ensure the
reason is size-related and not due to other factors. For example, while
in these screenshot the recomplile reason is size-related:
```{image} ../_static/img/dynamic_shapes/tlparse6_size_related_recompilations.png
```
In the one below it is not, which indicates that dynamic shapes won't resolve it:
```{image} ../_static/img/dynamic_shapes/tlparse7_not_size_related_recompilations.png
:width: 500px
:align: center
```
2. **Compare Guards Files:** Ensure there are no guards on non-size-related
elementsthat exist in one graph but not the others.
3. **Early Check for Custom Triton Kernels:** Check if your model calls custom
Triton kernels with `tl.constexpr` arguments, as these are always
specialized. If your model receives different values for these arguments,
it could be a source of recompilation.
## **Identifying and Fixing Recompilation Causes**
1. **Is Something Not Marked Dynamic but Should Be?** Determine if an input was
marked dynamic and got specialized or was not marked dynamic at all. You can
identify this by:
* Checking the Dynamo graph - look for `Sym(number)`. For example:
```
Sym(256) vs Sym(s0)
```
* Using dynamic logs:
```
["TORCH_LOGS=+dynamic"]
create_symbol s2 = 2 for L['self']._modules['cle ...
```
* Reviewing guards files. If a tensor size is dynamic, it will be indicated as `None`:
```
TENSOR_MATCH:check_tensor(L['self'].x._parameters['weight']], Parameter, DispatchKeySet(CPU, BackendSelect, ADInplaceOrView, AutogradCPU), torch.float32, device=None, requires_grad=True, size=[None, None], stride=[None, 1])
```
2. **Why Is It Not Marked Dynamic?** If you determine an element is not marked dynamic, consider:
* Checking if it's an `nn` module property, parameter, or field. Verify setting for the flags:
* `force_parameter_static_shapes = True`
* `force_nn_module_property_static_shapes = True`
* `allow_unspec_int_on_nn_module = False`
* Or using the dynamic allow list to mark it dynamic, which should have the highest priority.
```{tip}
Marking elements one by one can be time-consuming. Initially, flip the flags to
identify any blocking specializations, then decide how to mark them
dynamic at the end of the process.
```
* If you feel, like it could be a bug, please file a bug report and mark
with the `module: dynamic shapes` label. Check the list of known issues in
[this list](https://github.com/pytorch/pytorch/issues?q=sort%3Aupdated-desc+state%3Aopen+label%3A%22module%3A+dynamic+shapes%22).
3. **Is a Dynamic Element Getting Specialized?** Determine why it is specialized.
It could be due to user code (such as an `if` condition), framework code, or a
call to a Triton kernel. To identify the reason for specialization:
* **Using tlparse:** Check the `compilation_metrics` for a specialization section, which will indicate what got specialized and the user and framework stack when it happened. Example:
```{image} ../_static/img/dynamic_shapes/tlparse8_compilation_metrics.png
```
The log above indicates that `s0` is specialized to `33` due to the following code:
```
`if self.x ==33` at example4.py line 16.
```
* **+Dynamic Logs:** pass `["TORCH_LOGS=+dynamic"]`. Look for the first specialization, as once a variable is specialized, all dependent variables get specialized too.
Example log:
```
torch/fx/experimental/symbolic_shapes.py:6557] [0/2] eval Eq(s0, 33) [guard added] if self.x ==33: # example4.py:16 in forward (_dynamo/variables/tensor.py:1242 in evaluate_expr), for more info run with TORCHDYNAMO_EXTENDED_DEBUG_GUARD_ADDED="Eq(s0, 33)"
V0228 12:04:24.190000 2990033 torch/fx/experimental/symbolic_shapes.py:6000] [0/2] _update_var_to_range s0 = VR[33, 33] (update)
```
The log above indicates that `s0` is specialized to `33` due to the following code:
```
if self.x ==33. At example4.py like 16.
```

View File

@ -0,0 +1,45 @@
(backed-vs-unbacked-symints)=
# Backed vs Unbacked Symints
Backed `SymInts` are symbolic integers that have a concrete value or "hint"
associated with them. This means that torch can use these values to make
decisions about control flow, such as determining which branch of code
to execute. They are typically derived from operations where the size or
value is known or can be inferred.
Unbacked `SymInts` are symbolic integers that do not have a concrete value or
hint. They often arise from data-dependent operations, such as `.nonzero()`
or `.item()`, where the size or value cannot be determined at compile time.
Since they lack a concrete value, they cannot be used for control flow
decisions, and attempting to do so requires a graph break.
Unbacked `SymInts` use *oblivious-size reasoning* which is particularly
useful when you are dealing with
{ref}`0/1 specialization recompilation problem <zero-one-specialization>`.
In summary, backed `SymInts` have known values that can be used for
decision-making, while unbacked `SymInts` do not, requiring special handling
to avoid graph breaks.
Unbacked symbolic integers can be too restrictive, causing most PyTorch programs
to fail. To address this, you can use the following methods and APIs as
workaround:
* Use higher-level APIs like `empty` instead of `empty_strided` to create tensors.
This ensures the tensor is non-overlapping and dense, avoiding unnecessary stride
sorting and guard creation.to avoid unnecessary recomputation of these properties.
* Modify your code to make precomputed properties *lazy*. This ensures that
guards on unbacked symbolic integers are only applied when necessary,
reducing computational overhead.
## How to use unbacked
To use unbacked APIs, replace `mark_dynamic` with `mark_unbacked` and
`TORCH_COMPILE_DYNAMIC_SOURCES` with `TORCH_COMPILE_UNBACKED_SOURCES`.
This tells the compiler to treat an input as unbacked.
```{seealso}
* {ref}`dynamic_shapes`
* {ref}`torch.export`
* {ref}`what_is_a_specialization`
```

View File

@ -0,0 +1,10 @@
(dynamic_shapes_beyond_the_basics)=
# Beyond the Basics
This section covers some advanced topics related to dynamic shapes. This includes more complex explanations of how dynamic shapes work, 0/1 specialization problems, and so on.
```{toctree}
:maxdepth: 1
dynamic_shapes_zero_one_specialization
dynamic_shapes_backed_unbacked
```

View File

@ -0,0 +1,134 @@
(dynamic_shapes_core_concepts)=
# Dynamic Shapes Core Concepts
This section described the core concepts of dynamic shapes in PyTorch. It is intended to be a
reference for engineers working on the PyTorch compiler stack and anyone who wants to understand
the inner workings of dynamic shapes.
## Symbolic integers
Symbolic integers (Symints) are used to represent variables that can span a range. For example:
```python
x = torch.randn(5, 5) # this tensor has a shape [5, 5]
torch._dynamo.decorators.mark_dynamic(x, 0)
x = torch.randn(5, 5) # this tensor has a shape [s0, 5]
y = torch.cat([x, x], dim=0) # this tensor has a shape [2*s0, 5]
```
However, `z = x * y` would throw an error since we know that pointwise operation like multiply must
operate on same sized tensors but we know statically `s0 != 2 * s0`. Astute readers may point out
that this is not true when `s0 == 0` and the reason why that doesn't matter here is described in
{ref}`zero-one-specialization`.
## Guards
In `torch.compile`, a guard is a mechanism that is used to ensure the validity of a compiled code graph.
By default, when you make a variable dynamic, it can range from `[-inf, inf]`. For example:
```python
def foo(x): return x / 2
This works for any dynamic x. But if your code is:
def foo(x)
if x > 5:
return x / 2
return x / 3
```
If you call `foo(6)`, it returns `x / 2` and adds a guard `x > 5`. Calling `foo(4)` later will
require recompilation because the guard is broken.
## Runtime Asserts
You can use runtime asserts to provide hints when you know certain facts, like batch size being less than 100:
```python
def foo(batch_size):
torch._check(batch_size < 100)
if batch_size < 100:
return do_something
return do_something_else()
```
## "Hint" Value
A "hint value" in the context of `torch.compile` refers to the actual values known during the compilation process that help the JIT compiler make decisions about expressions. Hint values are particularly useful for handling dynamic shapes, as they provide concrete information that guides the compilation without requiring recompilation for varying dimensions.
## Dynamic Behavior Overview
PyTorch assumes static shapes by default. When a size change is detected, it attempts to
recompile with dynamic input, although this may fail if there are conditional branches
or missing support for dynamic shapes. To diagnose overspecialization, you can set
`TORCH_LOGS=dynamic` to view "eval" entries that indicate when and why guards are added.
If you anticipate a dimension will be dynamic, you can use `torch._dynamo.mark_dynamic(tensor, dim)`
to mark it in advance, specifying `min` and `max` values if known. Using `torch.compile(dynamic=False)`
disables automatic dynamic shapes, leading to recompilation for each unique size. Conversely,
`torch.compile(dynamic=True)` aims to use dynamic shapes as much as possible which is most useful
for small and may not be suitable for large models due to potential crashes or performance issues.
You can whitelist specific sources to be marked as dynamic using the `TORCH_COMPILE_DYNAMIC_SOURCES` environment variable or `torch.compiler.config.dynamic_sources`. This is particularly useful for large
models with graph breaks, as you can maintain dynamism across graph breaks since
source names stay consistent. You can also use this to mark integers as dynamic. The format is a comma-delimited list of source names, for example, `"L['x'], L['y']"`.
You can also use regexes, for example, `"L\['x.*'\], L\['y.*'\]")`.
This whitelist takes precedence over other flags like `dynamic=False` `force_nn_module_property_static_shapes`, and `force_parameter_static_shapes`.
Sometimes it can be cumbersome to find the right inputs to mark as dynamic. If
you're willing to take a performance hit for the first batch, one other affordable
option we have are the `eager_then_compile` stances which derive dynamism for you.
See {func}`torch.compiler.set_stance` for more details.
## Overall Architecture
Symbolic shapes workflow:
1. When compiling a frame in Dynamo, we allocate a `ShapeEnv` (attached to `FakeTensorMode`) to
track symbolic shapes.
2. We allocate symbolic sizes for tensors on entry, based on policy decisions.
3. We propagate symbolic sizes through operators, maintaining both FX IR for symbolic compute export
and Sympy expressions for reasoning.
4. We add guards based on conditionals during Dynamo tracing or Inductor optimization, induced from both Python and C++.
5. Guards can simplify symbolic variables. For instance, asserting `s0 == 4` allows replacing all occurrences of `s0` with `4`.
6. After tracing and optimizing, we install all guards with the compiled code, ensuring reusability only if all guards evaluate true.
## Internal API Class Hierarchy
### Python Classes
- **`SymInt`/`SymFloat`/`SymBool`**: User-visible classes that simulate their `int`/`float`/`bool` counterparts. Adding two `SymInts` produces a new `SymInt` that symbolically tracks the integer addition.
- **`SymNode`**: Internal structure (accessible via `symint.node`) that holds actual symbolic tracking information. `SymNode` is type-erased, making it convenient to represent mixed-type operations.
- **`ShapeEnv`**: Per-compile context state that tracks all free symbols and guards accumulated so far. Every `SymNode` records its `ShapeEnv` (but not vice versa; `SymNodes` are only used if they participate in a guard).
### C++ Equivalents
- **`c10::SymInt`/`SymFloat`/`SymBool`**: User-visible classes that simulate `int`/`float`/`bool`
- **`c10::SymNode`/`SymNodeImpl`**: Analogous to Python `SymNode`
- **No C++ `ShapeEnv`**: For debugging ease, the entire symbolic reasoning apparatus remains in Python
When writing code traceable with `make_fx`, it must handle `SymInt`/`SymFloat`/`SymBool` flowing through it.
## Value Ranges and Constraints
Symbolic variables maintain **value ranges** that specify the set of possible values. By default:
- Size-like unbacked `SymInts` have value range `[0, Inf]`
- Regular unbacked `SymInts` have value range `[-Inf, Inf]`
When assertions are made (e.g., `torch._check(x == y)`), the system:
1. Attempts to replace unbacked symbols with equivalent expressions
2. Refines value ranges based on the assertion
3. Remembers boolean expressions that are always true
Important files:
- C++ SymInt API: `c10/core/SymInt.h`, `SymFloat.h`, `SymBool.h`
- Python SymInt API: `torch/__init__.py` (look for `SymInt/SymFloat/SymBool`)
- C++ plumbing: `c10/core/SymNodeImpl.h`, `torch/csrc/utils/python_symnode.h`, `torch/csrc/jit/python/init.cpp`
- Python infrastructure: `torch/fx/experimental/symbolic_shapes.py`
- Other important files: `torch/_subclasses/fake_tensor.py`, `torch/_meta_registrations.py`, decomps, PrimTorch refs
```{seealso}
* {ref}`dynamic_shapes`
* {ref}`dynamic_shapes_troubleshooting`
```

View File

@ -0,0 +1,101 @@
(debugging-tlparse-torch-logs)=
# Debugging with `tlparse` and `TORCH_LOGS=dynamic`
`tlparse` is a tool used for analyzing and understanding the compilation
process in PyTorch, particularly when dealing with dynamic shapes. It helps
identify where guards and specializations occur in your code.
`TORCH_LOGS=dynamic` is an environment variable setting that enables detailed
logging of dynamic shape operations, providing insights into how symbolic
shapes are handled during execution.
This section will guide you through using `tlparse` and `TORCH_LOGS=dynamic` to
troubleshoot dynamic shape issues in your code, including debugging
specialization, guards, and more.
# Debugging Specialization
In the following example, `x.shape[0]` is dynamic but becomes specialized due to multiplication:
```python
import torch
@torch.compile
def fn(x, y):
return x * y
x = torch.randn(5)
y = torch.randn(5)
torch._dynamo.decorators.mark_dynamic(x, 0)
fn(x, y)
```
By using `TORCH_LOGS=dynamic`, you can observe this specialization in the logs:
```xml
TORCH_LOGS=dynamic python tl.py
I0721 11:10:00.950000 845259 torch/fx/experimental/symbolic_shapes.py:3776] [0/0] create_env
I0721 11:10:01.030000 845259 torch/fx/experimental/symbolic_shapes.py:5117] [0/0] create_symbol s77 = 5 for L['x'].size()[0] [2, int_oo] return x * y # tl.py:5 in fn (_dynamo/variables/builder.py:3466 in <lambda>), for more info run with TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL="s77" or to suppress this message run with TORCHDYNAMO_EXTENDED_ADVICE="0"
I0721 11:10:01.038000 845259 torch/fx/experimental/symbolic_shapes.py:7211] [0/0] eval Eq(s77, 5) [guard added] return x * y # tl.py:5 in fn (_subclasses/fake_impls.py:922 in infer_size), for more info run with TORCHDYNAMO_EXTENDED_DEBUG_GUARD_ADDED="Eq(s77, 5)"
```
The line `eval Eq(s77, 5) [guard added] return x * y # tl.py:5` indicates the specialization.
## Debugging Guards
Consider the following code, which may cause recompilations due to dynamic
shapes:
```python
import torch
@torch.compile
def fn(x, y):
if x.shape[0] < 10:
return x * y
x = torch.randn(5)
y = torch.randn(5)
torch._dynamo.decorators.mark_dynamic(x, 0)
torch._dynamo.decorators.mark_dynamic(y, 0)
fn(x, y)
```
To identify where dynamic shape guards originate, use `tlparse`. Here is an example tlparse output:
```{image} ../_static/img/dynamic_shapes/tlparse9_debugging_guards.png
```
By clicking on the `dynamo_cpp_guards` link, you can view all guards from the compilation, including the symbolic shape guard `L['x'].size()[0] <= 9`.
Astute readers will notice the 0/1 specialization where we guard on `L['x'].size()[0] >= 2`. By modifying the code to use unbacked symbols, this guard is removed:
```python
import torch
@torch.compile
def fn(x, y):
# Necessary runtime assert since we can't guard on unbacked
torch._check(x.shape[0] < 10)
if x.shape[0] < 10:
return x * y
x = torch.randn(5)
y = torch.randn(5)
torch._dynamo.decorators.mark_unbacked(x, 0)
torch._dynamo.decorators.mark_unbacked(y, 0)
fn(x, y)
```
Now, this compiled region can be used for inputs of size 0 and 1:
```{image} ../_static/img/dynamic_shapes/tlparse10_debugging_guards_unbacked.png
```
```{seealso}
* {ref}`dynamic_shapes`
* {ref}`troubleshooting_guardondatadependentsymnode_errors`
```

View File

@ -0,0 +1,14 @@
(dynamic_shapes_troubleshooting)=
# Troubleshooting Dynamic Shapes
This section contains a list of common issues that you may encounter when using
dynamic shapes. The section describes how to use `TORCH_LOGS` and `tlparse` to
debug the issues, as well as provides some general tips and tricks to help you
resolve the issues.
```{toctree}
:maxdepth: 1
dynamic_shapes_debugging_tlparse_torch_logs
dynamic_shapes_troubleshooting_guardon_errors
```

View File

@ -0,0 +1,411 @@
(troubleshooting_guardondatadependentsymnode_errors)=
# Troubleshooting GuardOnDataDependentSymNode Errors
When working with PyTorch models that have data-dependent control flow (using functions
like `item()`, `tolist()`, or `nonzero())`, you may encounter `GuardOnDataDependentSymNode` errors.
This section explains what these errors are and how to fix them.
## Common Error Pattern
The following output shows the common error pattern `GuardOnDataDependentSymNode` errors:
```sh
torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(u2, -1) (unhinted: Eq(u2, -1)). (Size-like symbols: none)
Potential framework code culprit (scroll up for full backtrace):
File "/data/users/ezyang/a/pytorch/torch/_prims_common/__init__.py", line 855, in infer_size
if d == -1:
For more information, run with TORCH_LOGS="dynamic"
For extended logs when we create symbols, also add TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL="u2"
If you suspect the guard was triggered from C++, add TORCHDYNAMO_EXTENDED_DEBUG_CPP=1
For more debugging help, see https://docs.google.com/document/d/1HSuTTVvYH1pTew89Rtpeu84Ht3nQEFTYhAX3Ypa_xJs/edit?usp=sharing
```
## Root Cause
These errors occur when PyTorch tries to convert a symbolic quantity (for example, `u2 == -1`)
into a concrete value (such as, `False`) to make branching decisions. In a typical scenario,
where data-dependent sizes are not involved, PyTorch can determine the concrete value at
compile time and install a guard to ensure the compilation result remains valid. However,
with data-dependent quantities, the true value is unknown at compile time, resulting in errors.
You can often rewrite your model, by adding `torch._check` or `torch._check_is_size` to
bypass these issues. This document aims to teach you how.
## Debugging Tools
Here is the list of some of the debugging tools available in PyTorch that you can use to troubleshoot these errors:
* `TORCH_LOGS="dynamic"` - Shows detailed logs about symbolic operations
* `TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL="u2"` - Provides extended logs for specific symbols
* `TORCHDYNAMO_EXTENDED_DEBUG_CPP=1` - Helps when guards are triggered from C++
## Error Variations
Here is a the list of error variations that you might encounter:
| Error Variations | Description |
|------------------|-------------|
| "Could not guard on data-dependent expression" | Occurs when trying to extract a concrete boolean from expressions like u0 == 0 or u0 > 10 |
| "Could not extract specialized integer from data-dependent expression" | Occurs when trying to extract a concrete integer value. <br/> **Common causes:** <br/> - Control flow that depends on the integer (such as, looping `u0` times) <br/> - Overspecialization in code that could work symbolically |
## How to Diagnose Your Problem
### Step 1: Examine the Potential Framework Culprit (Python Backtrace)
The exception provides a backtrace, which often indicates the problem.
Given that PT2 backtraces can be lengthy, the error message will also
suggest a potential framework culprit. For example:
```sh
Potential framework code culprit (scroll up for full backtrace):
File "/data/users/ezyang/a/pytorch/torch/_prims_common/__init__.py", line 855, in infer_size
if d == -1:
```
**Consider the Following:**
* Does it make sense that this condition is triggering a guard on a
data-dependent symbol?
* Should we know if the quantity in question is size-like?
(The exception lists size-like symbols; if a symbol is not listed,
it might be an arbitrary integer.)
* If the equation involves two distinct symbols, should we know
they are actually equal?
* If all symbols are size-like but the equation involves 0 or 1,
are we missing a `guard_size_oblivious` wrapper? (Remember, for
`guard_size_oblivious` between two size tuples, use `sym_eq` instead
of regular equality.)
In the example above, testing if `d` (a data-dependent value) is `-1` suggests
that `d` should be non-negative if it were a size. This indicates a missing
`torch._check_is_size`. If `d` is already size-like but `numel() == 0` fails,
consider wrapping it in `guard_size_oblivious`.
Using `TORCH_LOGS=dynamic` and examining the user stack trace is crucial for
understanding how to fix the problem, as they guide you on how to modify the
user program.
```sh
[INFO] create_unbacked_symint u0 [-9223372036854775808, 9223372036854775807] (w.py:40 in custom_op_meta)
```
This log message indicates where (`w.py:40`) the unbacked `SymInt` was
allocated. An unbacked `SymInt` may be allocated multiple times, so track
their equalities:
```sh
[INFO] set_replacement u1 = u0 (trivial_lhs) ValueRanges(lower=0, upper=9223372036854775807, is_bool=False)
```
### Step 2: Examine the C++ Backtrace
If the framework code culprit is uninformative, the guard might be in C++. You can
force a C++ backtrace by running with `TORCHDYNAMO_EXTENDED_DEBUG_CPP=1`. This
provides a detailed C++ backtrace with Python, CPython, and C10/ATen/libtorch
frames interspersed. Look for symbols in the `at::` or `c10::` namespace that
resemble kernel-specific code, likely related to the kernel executed per the Python
backtrace. If using a non-debug build of PyTorch, inlining may cause missing
frames, requiring source code investigation to locate the issue. For example, see https://github.com/pytorch/pytorch/pull/118579.
Here is an example C++ backtrace from a debugging session:
```
[2024-02-08 08:20:45,259] torch.fx.experimental.symbolic_shapes: [INFO] File "../
__gen_aten__/out/RegisterCompositeImplicitAutograd.cpp", line 2025, in at::
(anonymous namespace)::(anonymous namespace)
::wrapper_CompositeImplicitAutograd_Tensor_narrow(at::Tensor const&, long,
at::Tensor const&, c10::SymInt) [2024-02-08 08:20:45,259] torch.fx.experimental.
symbolic_shapes: [INFO] File "../aten/src/ATen/native/TensorShape.cpp", line 1410,
in at::native::narrow_tensor_symint(at::Tensor const&, long, at::Tensor const&,
c10::SymInt) [2024-02-08 08:20:45,259] torch.fx.experimental.symbolic_shapes:
[INFO] File "../__gen_aten__/out/core/TensorMethods.cpp", line 52, in long
at::Tensor::item<long>() const [2024-02-08 08:20:45,259] torch.fx.experimental.
symbolic_shapes: [INFO] File "../ATen/core/TensorBody.h", line 4274, in
at::Tensor::item() const
```
In this example, `at::native::narrow_tensor_symint` calls into `item`, which
triggers the guard on a data-dependent `SymNode`. You can modify the C++ code to
avoid specializing, or verify if you should be in this C++ code (e.g., `start` was
not expected to be a `Tensor`, and modifying this fixed the problem).
## Tools for Fixing Errors
There are a few important functions which you should use to troubleshoot this problem.
### torch._check(cond, msg_fn)
`torch._check` is a function used to assert conditions at runtime, particularly when dealing with symbolic integers (`SymInts`) in PyTorch.
**Example Usage:**
```python
torch._check(x.size(0) == y, lambda: f"size mismatch: {x.size(0)} != {y}")
```
The code above does the following:
* Creates a deferred runtime assertion instead of a compile-time guard
* Teaches the symbolic reasoning system facts about your unbacked SymInts
* Can eliminate unbacked symbols by replacing them with equivalent expressions
* Refines value ranges of symbols
* Remembers boolean expressions that are always true
Semantically, the function behaves like a conditional check:
```python
if not cond:
raise RuntimeError(msg_fn())
```
But there a number of key differences:
* The condition is always assumed true at compile time, even if it involves unbacked `SymInts`. The actual check is deferred to runtime, avoiding
compile-time errors. Instead of setting up a guard, we implement a
deferred runtime assertion to verify the condition at runtime. At compile
time, we assume the condition won't trigger an error, so we don't need
to determine if it evaluates to `True` or `False`.
* If you perform an equality test `u0 = RHS`, we try to replace all instances
of `u0` with RHS. We will ALWAYS do this if RHS has no unbacked symbols,
as removing unbacked symbols is beneficial—eliminating them prevents
the creation of a `GuardOnDataDependentSymNode`. Even if we are not able
to eliminate u0, we can refine its value range. The value range specifies
what the set of possible values for a variable are. By default, size-like
unbacked SymInts have a value range of `[0, Inf]`; if you assert it is
equal to an expression with a refined value range, say `[2, 20]`, then
`u0`s value range will be updated to `[2, 20]`. We also have limited
support for propagating value ranges in reverse.
* If you perform a boolean test `f(u0)`, we will remember that this expression always evaluates to True, and if you evaluate an expression that contains this expression, we will substitute it with True. We also support some limited reasoning on logically equivalent statements. For example, if you `torch._check(u0 < 4)`, we will also know that `u0 >= 4` evaluates to `False`, and so performing a test like this in a normal non-check conditional will go through fine.
### `torch._check_is_size(size)` and `guard_size_oblivious(cond)`
Example:
```python
u0 = y.item()
torch._check_is_size(u0)
```
**Semantic Equivalent:**
```python
if u0 < 0:
raise RuntimeError("u0 is not a size")
```
**Key Differences:**
Like `torch._check`, this test will always succeed at compile time, and it will establish that `u0 >= 0`. This refines the value range of `u0` to `[0, Inf]` instead of `[-Inf, Inf]`.
Marking `u0` as size-like is crucial. Size-like unbacked `SymInts` behave like
their regular counterparts, except when involved in a boolean expression
evaluated with `guard_size_oblivious`. In such cases, they are assumed not to equal zero or one, temporarily setting their value range to `[2, Inf]`. For instance, a conditional check like `u0 == 1` will evaluate to `False` when `u0` is size-like, instead of causing an error.
For example, `guard_size_oblivious(u0 == 1)` will always return `False` when `u0`
is size-like.
Marking unbacked symbols as size-like is essential in contexts where tensor
sizes are expected. PyTorch internals often check if sizes are zero or one to
handle special cases related to empty or single-element tensors. If you pass an
unbacked symbol to a factory function like `torch.empty`, it will automatically
be marked as size-like. However, some quantities, like arguments to `Tensor.view`,
cannot be inferred as size-like because `-1` is a valid argument. In such cases,
you need to explicitly use `torch._check_is_size` on an unbacked `SymInt` before
passing it to `view`.
In PyTorch framework code, if you need to test a size for zero or one, wrap the
test in `guard_size_oblivious` to assume that size-like unbacked `SymInts` will
not pass this test. Generally, most framework code has logic for the `>= 2`
case, which works for the `0/1` case. If using `guard_size_oblivious` in
PyTorch framework code resolves your issue, it's likely acceptable. However,
avoid using `guard_size_oblivious` in user code, especially if different
behavior is required for the `0/1` case at runtime, such as in a
hand-tracking application.
In C++, this can be done with `TORCH_GUARD_SIZE_OBLIVIOUS(u0.sym_eq(0))`, for example.
### torch._check_is_size(size, max=upper_bound) (New)
This function is semantically equivalent to `torch._check(size <= upper_bound)`.
However, under `guard_size_oblivious`, it assumes that `size < upper_bound`.
This functionality only works when the upper bound is an integer constant. If
`upper_bound` is a symbolic expression, normal semantics apply. There is
potential to extend this functionality to symbolic expressions with further
development.
For more details, see the related issue https://github.com/pytorch/pytorch/issues/120288.
### `torch._constrain_as_value` and `torch._constrain_as_size`
These APIs are more specialized and are effectively equivalent to
`torch._check` and `torch._check_is_size`, with the added capability
of adjusting the value range of a variable by specifying minimum and
maximum values. However, in recommendation models, these functions are
unlikely to resolve `GuardOnDataDependentSymNode` errors effectively.
While `constrain_as_value` might seem like a convenient way to ensure a
variable stays within the bounds of another tensor, it is often impractical.
This is because value ranges only support constant bounds, and it's common
for the tensor you want to index into to have a symbolic dimension (for
example, `s0`). Using its size as the maximum value for a value range
will force specialization, which is usually undesirable. Instead, if
necessary, manually handle range checks by using `torch._check()` on
appropriate expressions based on the errors you encounter.
## Common Fix Patterns
There are several common methods to resolve issues like this. Below,
we outline the most frequently used solutions.
### When It's Unfixable
In some cases, the issue is genuinely unfixable due to the nature of the code.
Consider the following example:
```python
i = x.item()
if i > 4:
return x * 2
else:
return x + 3
```
If the user code is branching on a data-dependent value, it is impossible to
trace as is. In such cases, you may need to consider alternative approaches,
such as using `torch.cond`.
Another common pattern involves indexing with a data-dependent value:
```python
return self.mlps[x.item()]
```
Here, `self.mlps` is a Python list or `ModuleList`, and the code branches on a data-dependent value. The simplest solution is to induce a graph break before the indexing operation.
### `u0` is a Size, but We Dont Know It
Some guards fail on tests that essentially ask, "Is this a size?" but we don't know it is a size. These fall into two categories:
1. **Regular Tests:**
These are tests like `u0 >= 0` or `u0 != -1` that are unconditionally true
for sizes. Adding a `torch._check_is_size(...)` on the relevant size will
assert that these tests are true. This is typically uncommon because if
the test is for error checking, we can infer that the condition must be
true, as an error would occur otherwise. An important exception is APIs
that accept both sizes and `-1`; in such cases, the user must indicate that
the input data-dependent quantity cannot be `-1`, as something unusual would
happen otherwise. For an example, see
https://github.com/pytorch/pytorch/pull/107788.
Sometimes, you can refactor an error-checking API to split a logical
disjunction of conditionals into separate conditionals. If you can do so
to achieve a single `torch._check(x == y)` statement, it will enable
the automatic generation of a deferred runtime assertion. For an example,
see https://github.com/pytorch/pytorch/pull/110979.
2. **Edge Case Tests:**
These are tests like `u0 == 0` or `u0 == 1`, which are not always true for
sizes, but where our choice doesnt really matter. These tests handle edge
cases, such as dealing with an empty tensor or testing for broadcasting when
we want to assume broadcasting is not occurring. To resolve these situations,
two steps are needed:
* First, the guard itself must be evaluated via `guard_size_oblivious`,
which assumes that size-like integers cannot equal zero or one, with the
promise that if they do, something reasonable will happen.
* Second, the symbols themselves must be marked as size-like, either
inferred because they were passed to tensor factory functions or explicitly
specified with `torch._check_is_size(...)`. For examples of making guards
size-oblivious, see https://github.com/pytorch/pytorch/pull/118579.
Sometimes, these tests can occur in C++. While there are corresponding
C++ APIs for these tests, it can be more challenging to localize the problem,
as you do not get a useful backtrace by default.
### `u0` is Actually Equal to `u1`, but We Dont Know It
Multiple unbacked `SymInts` can be known to be equal at compile time:
```python
i0 = x.sum().item()
i1 = x.sum().item()
return torch.randn(i0) + torch.randn(i1)
```
If there is a `torch._check(i0 == i1)` somewhere (in the example above, this
check would occur inside the shape-checking rule for addition), we will
automatically unify the two unbacked `SymInts` and recognize them as equal.
However, if such an assertion is missing, you may need to explicitly add an
assertion to achieve this unification. For an example, see
https://github.com/pytorch/pytorch/issues/111950).
```{note}
If we allocate an unbacked `SymInt` and
immediately set it equal to another, these instances are benign and not easily
eliminated entirely from the framework.
```
### `u0` is a Tensor
Another reason you might be overallocating unbacked `SymInts` is due to passing
around a `Tensor` and relying on its implicit conversion to an integer. Many
functions that accept an integer will also accept a `Tensor` and automatically
call `item()` on the integer argument. It's beneficial to examine
`TORCH_LOGS=dynamic` to determine whether the number of unbacked `SymInts` is
as expected or excessive. When this occurs, a new `SymInt` will be allocated at
the line where a PyTorch function is invoked.
This issue is less likely to cause problems now because the return value of
`t.item()` is memoized, ensuring that you consistently receive the same unbacked
`SymInt` if you call it multiple times.
### Overspecialization Issue
In non-strict export mode, consider the following code:
```python
u0 = x.sum().item() return y[:u0]
```
This code will fail when trying to evaluate `u0` because, when a `SymInt` is
used directly inside a Python slice (without using Dynamo), Python forces the
integer to be specialized and fails if it is unbacked.
To resolve this, you can rewrite the program to avoid specialization.
For the example above, you can fix it by not using slices:
```python
u0 = x.sum().item() return y.narrow(0, 0, u0)
```
For more details, see the related issue
https://github.com/pytorch/pytorch/issues/111950.
### Use Lengths Instead of Offsets
When working with variable sequence lengths, it's common to have tensors
representing either the lengths or offsets of the sequences. For example, given
`values = [[1, 2, 3], [4, 5], [6, 7, 8, 9]]`, you might have `lengths = [3, 2, 4]`
and `offsets = [0, 3, 5, 9]`. While these representations are interconvertible,
it's better to work with lengths when dealing with them as integers (by calling
`lengths.tolist()`), rather than offsets.
The reason is that when you perform a `torch.split()` on your `values` tensor, you
need to create tensors for each sub-sequence, such as tensors of sizes 3, 2, and 4.
If you have unbacked `SymInts` for sizes, they become `u0`, `u1`, and `u2`. You can
easily indicate that they are size-like, and you're done. However, if you have
unbacked `SymInts` for offsets, they become `u1 - u0`, `u2 - u1`, `u3 - u2`, which
complicates matters. These quantities cannot be conveniently marked as size-like,
leading to potential issues. Since it's relatively straightforward to write code
using either lengths or offsets, you should prefer using lengths.
```{seealso}
* {ref}`dynamic_shapes`
* {ref}`debugging-tlparse-torch-logs`
```

View File

@ -0,0 +1,33 @@
(zero-one-specialization)=
# The Zero-One Specialization Problem
Before you read this section, you should understand the basics of
dynamic shapes. Make sure you have read the following sections:
* {ref}`dynamic_shapes`
* {ref}`torch.export`
* {ref}`what_is_a_specialization`
In `torch.compile`, we specialize automatically on inputs with sizes
0 or 1 and assume that any remaining inputs cannot be 0 or 1. This
simplifies tasks like contiguity and broadcasting checks, as it
avoids adding extra guards. However, this can cause problems for
sparse models with many symbolic integers that in practice have
tensors of size 0, 1, or 2. For example, consider when you a task is
something like collecting likes on page.
While it's possible to stop specializing on 0/1 upfront, executing
normal PyTorch code often reintroduces 0/1 guards, as many conditions
in PyTorch check for values being 0 or 1. Although models that work
for `N > 2` often generalize to `N = 1`, this isn't guaranteed, especially
with symbolic variables. For example, in hand tracking, a dimension
size of `N = 0`, `1`, or `2` may lead to different graph behaviors.
Simply hoping that the `N > 2` model generalizes can expose soundness issues.
```{seealso}
* {ref}`dynamic_shapes`
* {ref}`torch.export`
* {ref}`what_is_a_specialization`
* {ref}`backed-vs-unbacked-symints`
```

View File

@ -34,75 +34,75 @@ Read more about feature classification at: https://pytorch.org/blog/pytorch-feat
Below is an example that uses cond to branch based on input shape:
```python
import torch
import torch
def true_fn(x: torch.Tensor):
return x.cos() + x.sin()
def true_fn(x: torch.Tensor):
return x.cos() + x.sin()
def false_fn(x: torch.Tensor):
return x.sin()
def false_fn(x: torch.Tensor):
return x.sin()
class DynamicShapeCondPredicate(torch.nn.Module):
"""
A basic usage of cond based on dynamic shape predicate.
"""
class DynamicShapeCondPredicate(torch.nn.Module):
"""
A basic usage of cond based on dynamic shape predicate.
"""
def __init__(self):
super().__init__()
def __init__(self):
super().__init__()
def forward(self, x: torch.Tensor) -> torch.Tensor:
def true_fn(x: torch.Tensor):
return x.cos()
def forward(self, x: torch.Tensor) -> torch.Tensor:
def true_fn(x: torch.Tensor):
return x.cos()
def false_fn(x: torch.Tensor):
return x.sin()
def false_fn(x: torch.Tensor):
return x.sin()
return torch.cond(x.shape[0] > 4, true_fn, false_fn, (x,))
return torch.cond(x.shape[0] > 4, true_fn, false_fn, (x,))
dyn_shape_mod = DynamicShapeCondPredicate()
dyn_shape_mod = DynamicShapeCondPredicate()
```
We can eagerly run the model and expect the results vary based on input shape:
```python
inp = torch.randn(3)
inp2 = torch.randn(5)
assert torch.equal(dyn_shape_mod(inp), false_fn(inp))
assert torch.equal(dyn_shape_mod(inp2), true_fn(inp2))
inp = torch.randn(3)
inp2 = torch.randn(5)
assert torch.equal(dyn_shape_mod(inp), false_fn(inp))
assert torch.equal(dyn_shape_mod(inp2), true_fn(inp2))
```
We can export the model for further transformations and deployment:
```python
inp = torch.randn(4, 3)
dim_batch = torch.export.Dim("batch", min=2)
ep = torch.export.export(DynamicShapeCondPredicate(), (inp,), {}, dynamic_shapes={"x": {0: dim_batch}})
print(ep)
inp = torch.randn(4, 3)
dim_batch = torch.export.Dim("batch", min=2)
ep = torch.export.export(DynamicShapeCondPredicate(), (inp,), {}, dynamic_shapes={"x": {0: dim_batch}})
print(ep)
```
This gives us an exported program as shown below:
```
class GraphModule(torch.nn.Module):
class GraphModule(torch.nn.Module):
def forward(self, arg0_1: f32[s0, 3]):
sym_size: Sym(s0) = torch.ops.aten.sym_size.int(arg0_1, 0)
gt: Sym(s0 > 4) = sym_size > 4; sym_size = None
true_graph_0 = self.true_graph_0
false_graph_0 = self.false_graph_0
conditional: f32[s0, 3] = torch.ops.higher_order.cond(gt, true_graph_0, false_graph_0, [arg0_1]); gt = true_graph_0 = false_graph_0 = arg0_1 = None
return (conditional,)
class <lambda>(torch.nn.Module):
def forward(self, arg0_1: f32[s0, 3]):
sym_size: Sym(s0) = torch.ops.aten.sym_size.int(arg0_1, 0)
gt: Sym(s0 > 4) = sym_size > 4; sym_size = None
true_graph_0 = self.true_graph_0
false_graph_0 = self.false_graph_0
conditional: f32[s0, 3] = torch.ops.higher_order.cond(gt, true_graph_0, false_graph_0, [arg0_1]); gt = true_graph_0 = false_graph_0 = arg0_1 = None
return (conditional,)
cos: f32[s0, 3] = torch.ops.aten.cos.default(arg0_1)
sin: f32[s0, 3] = torch.ops.aten.sin.default(arg0_1); arg0_1 = None
add: f32[s0, 3] = torch.ops.aten.add.Tensor(cos, sin); cos = sin = None
return add
class <lambda>(torch.nn.Module):
def forward(self, arg0_1: f32[s0, 3]):
cos: f32[s0, 3] = torch.ops.aten.cos.default(arg0_1)
sin: f32[s0, 3] = torch.ops.aten.sin.default(arg0_1); arg0_1 = None
add: f32[s0, 3] = torch.ops.aten.add.Tensor(cos, sin); cos = sin = None
return add
class <lambda>(torch.nn.Module):
def forward(self, arg0_1: f32[s0, 3]):
sin: f32[s0, 3] = torch.ops.aten.sin.default(arg0_1); arg0_1 = None
return sin
class <lambda>(torch.nn.Module):
def forward(self, arg0_1: f32[s0, 3]):
sin: f32[s0, 3] = torch.ops.aten.sin.default(arg0_1); arg0_1 = None
return sin
```
Notice that `torch.cond` is lowered to `torch.ops.higher_order.cond`, its predicate becomes a Symbolic expression over the shape of input,
@ -111,41 +111,41 @@ and branch functions becomes two sub-graph attributes of the top level graph mod
Here is another example that showcases how to express a data-dependent control flow:
```python
class DataDependentCondPredicate(torch.nn.Module):
"""
A basic usage of cond based on data dependent predicate.
"""
def __init__(self):
super().__init__()
class DataDependentCondPredicate(torch.nn.Module):
"""
A basic usage of cond based on data dependent predicate.
"""
def __init__(self):
super().__init__()
def forward(self, x: torch.Tensor) -> torch.Tensor:
return torch.cond(x.sum() > 4.0, true_fn, false_fn, (x,))
def forward(self, x: torch.Tensor) -> torch.Tensor:
return torch.cond(x.sum() > 4.0, true_fn, false_fn, (x,))
```
The exported program we get after export:
```
class GraphModule(torch.nn.Module):
class GraphModule(torch.nn.Module):
def forward(self, arg0_1: f32[s0, 3]):
sum_1: f32[] = torch.ops.aten.sum.default(arg0_1)
gt: b8[] = torch.ops.aten.gt.Scalar(sum_1, 4.0); sum_1 = None
true_graph_0 = self.true_graph_0
false_graph_0 = self.false_graph_0
conditional: f32[s0, 3] = torch.ops.higher_order.cond(gt, true_graph_0, false_graph_0, [arg0_1]); gt = true_graph_0 = false_graph_0 = arg0_1 = None
return (conditional,)
class <lambda>(torch.nn.Module):
def forward(self, arg0_1: f32[s0, 3]):
sum_1: f32[] = torch.ops.aten.sum.default(arg0_1)
gt: b8[] = torch.ops.aten.gt.Scalar(sum_1, 4.0); sum_1 = None
cos: f32[s0, 3] = torch.ops.aten.cos.default(arg0_1)
sin: f32[s0, 3] = torch.ops.aten.sin.default(arg0_1); arg0_1 = None
add: f32[s0, 3] = torch.ops.aten.add.Tensor(cos, sin); cos = sin = None
return add
true_graph_0 = self.true_graph_0
false_graph_0 = self.false_graph_0
conditional: f32[s0, 3] = torch.ops.higher_order.cond(gt, true_graph_0, false_graph_0, [arg0_1]); gt = true_graph_0 = false_graph_0 = arg0_1 = None
return (conditional,)
class <lambda>(torch.nn.Module):
def forward(self, arg0_1: f32[s0, 3]):
cos: f32[s0, 3] = torch.ops.aten.cos.default(arg0_1)
sin: f32[s0, 3] = torch.ops.aten.sin.default(arg0_1); arg0_1 = None
add: f32[s0, 3] = torch.ops.aten.add.Tensor(cos, sin); cos = sin = None
return add
class <lambda>(torch.nn.Module):
def forward(self, arg0_1: f32[s0, 3]):
sin: f32[s0, 3] = torch.ops.aten.sin.default(arg0_1); arg0_1 = None
return sin
class <lambda>(torch.nn.Module):
def forward(self, arg0_1: f32[s0, 3]):
sin: f32[s0, 3] = torch.ops.aten.sin.default(arg0_1); arg0_1 = None
return sin
```
## Invariants of torch.ops.higher_order.cond

View File

@ -509,10 +509,6 @@ coverage_ignore_functions = [
"custom_fwd",
# torch.cuda.amp.common
"amp_definitely_not_available",
# torch.cuda.graphs
"graph_pool_handle",
"is_current_stream_capturing",
"make_graphed_callables",
# torch.mtia.memory
"reset_peak_memory_stats",
# torch.cuda.nccl
@ -524,25 +520,11 @@ coverage_ignore_functions = [
"reduce_scatter",
"unique_id",
"version",
# torch.cuda.nvtx
"range",
"range_end",
"range_start",
# torch.cuda.profiler
"init",
"profile",
"start",
"stop",
# torch.cuda.random
"get_rng_state",
"get_rng_state_all",
"initial_seed",
"manual_seed",
"manual_seed_all",
"seed",
"seed_all",
"set_rng_state",
"set_rng_state_all",
# torch.distributed.algorithms.ddp_comm_hooks.ddp_zero_hook
"hook_with_zero_step",
"hook_with_zero_step_interleaved",
@ -2172,8 +2154,6 @@ coverage_ignore_classes = [
"EventHandler",
"SynchronizationError",
"UnsynchronizedAccessError",
# torch.cuda.memory
"MemPool",
# torch.distributed.elastic.multiprocessing.errors
"ChildFailedError",
"ProcessFailure",
@ -2479,10 +2459,6 @@ coverage_ignore_classes = [
# torch.amp.grad_scaler
"GradScaler",
"OptState",
# torch.cuda.graphs
"CUDAGraph",
# torch.cuda.streams
"Event",
# torch.distributed.algorithms.ddp_comm_hooks.post_localSGD_hook
"PostLocalSGDState",
# torch.distributed.algorithms.ddp_comm_hooks.powerSGD_hook
@ -3176,8 +3152,6 @@ coverage_ignore_classes = [
"WeakIdKeyDictionary",
"WeakIdRef",
"WeakTensorKeyDictionary",
# torch.utils.debug_mode
"DebugMode",
]
# The suffix(es) of source filenames.

View File

@ -0,0 +1,47 @@
# Aliases in torch.cuda
The following are aliases to their counterparts in ``torch.cuda`` in the nested namespaces in which they are defined. For any of these APIs, feel free to use the top-level version in ``torch.cuda`` like ``torch.cuda.seed`` or the nested version ``torch.cuda.random.seed``.
```{eval-rst}
.. automodule:: torch.cuda.random
.. currentmodule:: torch.cuda.random
.. autosummary::
:toctree: generated
:nosignatures:
get_rng_state
get_rng_state_all
set_rng_state
set_rng_state_all
manual_seed
manual_seed_all
seed
seed_all
initial_seed
```
```{eval-rst}
.. automodule:: torch.cuda.graphs
.. currentmodule:: torch.cuda.graphs
.. autosummary::
:toctree: generated
:nosignatures:
is_current_stream_capturing
graph_pool_handle
CUDAGraph
graph
make_graphed_callables
```
```{eval-rst}
.. automodule:: torch.cuda.streams
.. currentmodule:: torch.cuda.streams
.. autosummary::
:toctree: generated
:nosignatures:
Stream
ExternalStream
Event
```

View File

@ -274,10 +274,6 @@ See the docs for {class}`~torch.cuda.gds.GdsFile` for an example of how to use t
.. py:module:: torch.cuda.gds
```
```{eval-rst}
.. py:module:: torch.cuda.graphs
```
```{eval-rst}
.. py:module:: torch.cuda.jiterator
```
@ -294,14 +290,13 @@ See the docs for {class}`~torch.cuda.gds.GdsFile` for an example of how to use t
.. py:module:: torch.cuda.profiler
```
```{eval-rst}
.. py:module:: torch.cuda.random
```
```{eval-rst}
.. py:module:: torch.cuda.sparse
```
```{eval-rst}
.. py:module:: torch.cuda.streams
```
.. toctree::
:hidden:
cuda.aliases.md
```

View File

@ -82,55 +82,48 @@ Some of the most commonly used backends include:
## Read More
```{eval-rst}
.. toctree::
:caption: Getting Started for PyTorch Users
:maxdepth: 1
```{toctree}
:caption: Getting Started for PyTorch Users
:maxdepth: 2
torch.compiler_get_started
torch.compiler_api
torch.compiler.config
torch.compiler_fine_grain_apis
torch.compiler_backward
torch.compiler_aot_inductor
torch.compiler_inductor_profiling
torch.compiler_profiling_torch_compile
torch.compiler_faq
torch.compiler_troubleshooting
torch.compiler_performance_dashboard
torch.compiler_inductor_provenance
torch.compiler_get_started
torch.compiler_api
torch.compiler.config
torch.compiler_dynamic_shapes
torch.compiler_fine_grain_apis
torch.compiler_backward
torch.compiler_aot_inductor
torch.compiler_inductor_profiling
torch.compiler_profiling_torch_compile
torch.compiler_faq
torch.compiler_troubleshooting
torch.compiler_performance_dashboard
torch.compiler_inductor_provenance
```
```{eval-rst}
.. toctree::
:caption: `torch.compile` Programming Model
```{toctree}
:caption: torch.compile Programming Model
:maxdepth: 2
compile/programming_model
compile/programming_model
```
% _If you want to contribute a developer-level topic
% that provides in-depth overview of a torch._dynamo feature,
% add in the below toc.
```{toctree}
:caption: Deep Dive for PyTorch Developers
:maxdepth: 1
```{eval-rst}
.. toctree::
:caption: Deep Dive for PyTorch Developers
:maxdepth: 1
torch.compiler_dynamo_overview
torch.compiler_dynamo_deepdive
torch.compiler_dynamic_shapes
torch.compiler_nn_module
torch.compiler_cudagraph_trees
torch.compiler_fake_tensor
torch.compiler_dynamo_overview
torch.compiler_dynamo_deepdive
torch.compiler_nn_module
torch.compiler_cudagraph_trees
torch.compiler_fake_tensor
```
```{eval-rst}
.. toctree::
:caption: HowTo for PyTorch Backend Vendors
:maxdepth: 1
```{toctree}
:caption: HowTo for PyTorch Backend Vendors
:maxdepth: 1
torch.compiler_custom_backends
torch.compiler_transformations
torch.compiler_ir
torch.compiler_custom_backends
torch.compiler_transformations
torch.compiler_ir
```

View File

@ -1,129 +1,295 @@
# Dynamic Shapes
---
file_format: mystnb
kernelspec:
name: python3
mystnb:
execution_timeout: 30
execution_show_tb: True
merge_streams: True
---
Code: [symbolic_shapes.py](https://github.com/pytorch/pytorch/blob/db4572dbf18f1cf50cf662547e272d3117063747/torch/fx/experimental/symbolic_shapes.py)
```{code-cell}
:tags: [remove-cell]
import torch
from compile import header_code
See also: [The dynamic shapes manual](https://docs.google.com/document/d/1GgvOe7C8_NVOMLOCwDaYV1mXXyHMXY7ExoewHqooxrs/edit#heading=h.fh8zzonyw8ng)
## Motivation
Deep learning compilers commonly only work for static shapes, that is to say, they produced compiled programs which only work for a single specific configuration of input shapes, and must recompile if any input shape changes. This assumption works great for the majority of commonly run deep learning models today, but there are a few situations where it is insufficient:
- Some dimensions, such as batch size or sequence length, may vary. For example, an inference service performing adaptive batching will execute inference requests with varying batch sizes depending on how many requests it received within its batching window. We may also want to consider padding out variable size sequences only to the maximum sequence length within a batch, which may vary from batch-to-batch.
- Some models exhibit data-dependent output shapes, that is to say, the size of their outputs and intermediates may depend on the actual input data which may vary across runs. For example, detection models may first generate a variable number of potential bounding boxes before running a more expensive image recognition model to identify if the subject is in a bounding box. The number of bounding boxes is data dependent.
- One particularly important case of data-dependent shapes occurs when dealing with sparse representations, such as sparse tensors, jagged tensors, and graph neural networks. In all of these cases, the amount of data to be processed depends on the sparse structure of the problem, which will typically vary in a data-dependent way.
In supporting dynamic shapes, we chose not to support dynamic rank programs, e.g., programs whose inputs tensors change in dimensionality, as this pattern rarely occurs in real-world deep learning programs, and it avoids the need to reason inductively over symbolic lists of shapes.
## Abridged public API
The default dynamic behavior in PyTorch 2.1 is:
- PT2 assumes everything is static by default
- If we recompile because a size changed, we will instead attempt to recompile
that size as being dynamic (sizes that have changed are likely to change in
the future). This generalization may fail (e.g., because user code does a
conditional branch on the size in question or missing dynamic shapes support
in PT2). If you are trying to understand why PT2 has overspecialized some
code, run with `TORCH_LOGS=dynamic` and look for "eval" entries that say
when guards are added and why.
- If you know ahead of time something will be dynamic, you can skip the first
recompile with `torch._dynamo.mark_dynamic(tensor, dim)`. If you know ahead of time
the `min` and `max` value this dimension can take, you can specify `torch._dynamo.mark_dynamic(tensor, dim, min=min, max=max)`
- If you say `torch.compile(dynamic=False)`, we will turn off automatic
dynamic shapes on recompiles and always recompile for each distinct size.
Conversely, if you say `torch.compile(dynamic=True)`, we will try to make
everything as dynamic as possible. This is mostly useful for small
operators; if you try it on a big model it will (1) probably crash PT2 and (2) run slow for no good reason.
- You can whitelist specific sources to be marked as dynamic using the
`TORCH_COMPILE_DYNAMIC_SOURCES` environment variable or by setting
`torch.compiler.config.dynamic_sources`. This is particularly useful for large
models with graph breaks, as you can maintain dynamism across graph breaks since
source names stay consistent. You can also use this to mark integers as dynamic.
The format is a comma-delimited list of source names, e.g., `"L['x'], L['y']"`.
You can also use regexes, e.g., `"L\['x.*'\], L\['y.*'\]")`.
This whitelist takes precedence over other flags like `dynamic=False`,
`force_nn_module_property_static_shapes`, and `force_parameter_static_shapes`.
- Sometimes it can be cumbersome to find the right inputs to mark as dynamic. If
you're willing to take a performance hit for the first batch, one other affordable
option we have are the eager_then_compile stances which derive dynamism for you.
See [torch.compiler.set_stance](https://docs.pytorch.org/docs/stable/generated/torch.compiler.set_stance.html) for more details.
## The Guard Model
When considering how to add support for dynamic shapes to TorchDynamo and TorchInductor, we made a major design decision: in order to reuse decompositions and other preexisting code written in Python/C++ targeting the PyTorch API, we must be able to trace through dynamic shapes. Unlike a fully symbolic system which might capture both branches of a conditional, we always pick one branch and specialize our trace under the assumption that we only use this trace when we would have made the same choice for that branch in the future. To do this, we maintain a "hint" for every symbolic size saying what its concrete value is at compile time (as TorchDynamo is a just-in-time compiler, it always knows what the actual input sizes are.) When we perform a condition on a tensor, we simply consult the hint to find out which branch to take.
This greatly simplifies the symbolic shape formulas we produce, but means we have a much more involved system for managing guards. Consider, for example, the following program:
```python
def f(x, y):
z = torch.cat([x, y])
if z.size(0) > 2:
return z.mul(2)
else:
return z.add(2)
torch._logging.set_logs(graph_breaks=True, graph_code=True)
```
The final IR we will compile with TorchInductor will either be `torch.cat([x, y]).add(2)` or `torch.cat([x, y]).mul(2)` (with the condition flattened away), but to determine which branch we are in, we would need to know the size of `z`, an intermediate. Because TorchDynamo must know upfront if a compiled trace is valid (we do not support bailouts, like some JIT compilers), we must be able to reduce `z.size(0)` as an expression in terms of the inputs, `x.size(0) + y.size(0)`. This is done by writing meta functions for all operators in PyTorch which can propagate size information to the output of a tensor without actually performing computation on the node.
(dynamic_shapes)=
# Dynamic Shapes
## Overall architecture
This section explains how to work with dynamic shapes in PyTorch, including how
to debug and fix common errors, implement support for dynamic shapes in
operators, and understand the underlying mechanisms.
Symbolic shapes workflow:
Dynamic shapes allow PyTorch models to handle inputs with varying dimensions
without recompilation. This enables more flexible models that can process
different batch sizes, sequence lengths, or image dimensions in a single
compiled artifact. Dynamic shapes work by symbolically tracing tensor
dimensions rather than using concrete values, creating a computation
graph that adapts to different input shapes at runtime. By default,
PyTorch assumes all input shapes to be static.
1. When we start compiling a frame in Dynamo, we allocate a ShapeEnv (attached to FakeTensorMode) which keeps track of symbolic shapes state.
2. We allocate symbolic sizes for tensors on entry (what is static or dynamic is a policy decision, with some knobs).
3. We propagate the symbolic sizes through operators, maintaining both (1) FX IR so that we can faithfully export symbolic compute, and (2) Sympy expressions representing the size vars, so we can reason about them.
4. When we condition on symbolic sizes, either in Dynamo tracing or in Inductor optimization, we add guards based on the conditional. These can be induced from both Python and C++.
5. These guards can induce further simplifications on symbolic variables. For example, if you assert `s0 == 4`, we can now replace all occurrences of `s0` with `4`.
6. When we're done tracing and optimizing, we install all of these guards with the compiled code; the compiled code is only reusable if all the guards evaluate true.
Typically, deep learning compilers only support static shapes, requiring
recompilation for input shape changes. While this approach covers many use cases,
there are situations where this is insufficient:
Important files:
- **Variable Dimensions** - Batch sizes or sequence lengths vary, such as in
adaptive batching.
- **Data-Dependent Outputs** - Models produce outputs based on input data,
like variable bounding boxes in detection models.
- **Sparse Representations** - Processing depends on data-varying sparse structures,
such as in sparse tensors, jagged tensors, and graph neural networks.
- C++ SymInt API: `c10/core/SymInt.h`, `SymFloat.h`, `SymBool.h`
- Python SymInt API: `torch/__init__.py` (look for `SymInt/SymFloat/SymBool`)
- C++ plumbing: `c10/core/SymNodeImpl.h`, `torch/csrc/utils/python_symnode.h`, `torch/csrc/jit/python/init.cpp`
- Python infrastructure: `torch/fx/experimental/symbolic_shapes.py`
- Other important files: `torch/_subclasses/fake_tensor.py`, `torch/_meta_registrations.py`, decomps, PrimTorch refs
Dynamic shapes do not support dynamic rank programs, programs which input tensors
change in dimensionality, as this is uncommon and unnecessarily complex.
## Abridged internal API
Understanding the Python class hierarchy:
## What does it mean for a size/integer to be dynamic?
- SymInt/SymFloat/SymBool: these are user-visible classes that simulate their int/float/bool counterparts. If you add two SymInts, we give you a new SymInt that symbolically tracks that the integer addition had occurred.
- SymNode: this is the internal structure (accessible via e.g., `symint.node`) which holds the actual symbolic tracking info. SymNode is type erased; this makes it more convenient to represent mixed-type operations. Note that technically you don't have to call into Python SymNode from SymInt; for example, XLA's C++ `SymNodeImpl` would take the place of SymNode.
- ShapeEnv: per-compile context state which keeps track of all the free symbols and guards we have accumulated so far. Every SymNode records its ShapeEnv (but not vice versa; SymNodes only get used if they participate in a guard).
Dynamic shapes allow avoiding recompilations by making certain dimensions or integers
dynamic. For example, if a function `f(x)` is compiled with a static size, it will need
recompilation for different sizes:
C++ is fairly similar:
```{note}
For simplicity, this example uses `@torch.compile(dynamic=True)`. Note, that
this option is not recommended due to it being error prone.
For a recommended way of enabling dynamic shapes, see {ref}`enable-dynamic-behavior`.
```
- c10::SymInt/SymFloat/SymBool: user-visible classes that simulate int/float/bool.
- c10::SymNode/SymNodeImpl: analogous to SymNode
- There is no ShapeEnv in C++; for ease of debugging, the entire symbolic reasoning apparatus is in Python.
```{code-cell}
import torch
@torch.compile(dynamic=False)
def f(x):
return x* x.size()[0]
When you write code that is traceable with `make_fx`, it must be able to deal with SymInt/SymFloat/SymBool flowing through it. [The dynamic shapes manual](https://docs.google.com/document/d/1GgvOe7C8_NVOMLOCwDaYV1mXXyHMXY7ExoewHqooxrs/edit#heading=h.fh8zzonyw8ng) gives some guidance for how to do this.
f(torch.rand(10))
f(torch.rand(20))
f(torch.rand(30))
f(torch.rand(40))
```
## DimDynamic policy
In the produced output, you can see that four graphs were generated.
See the corresponding <a href="_static/img/dynamic_shapes/tlparse1_dynamic_shapes_false.png" target="_blank">tlparse output</a>
Symbolic reasoning:
By making the size dynamic, the function can handle various sizes without recompilation:
- Value ranges
- Sympy usage notes
- Constraints
- DimDynamic/Constraint
```{code-cell}
import torch
@torch.compile(dynamic=True)
def f(x):
return x* x.size()[0]
## Unbacked SymInts
f(torch.rand(10))
f(torch.rand(20))
f(torch.rand(30))
f(torch.rand(40))
```
To resolve control flow, we check the hint, aka actual value, of a symbolic integer to determine which branch to go. However, in some cases, we may not have a hint: so-called unbacked symbolic integers arise when a size variable emerges from a data-dependent operation like `.nonzero()` or `.item()`. It is illegal to perform control flow on these symbolic integers, so we must graph break on these operations.
With dynamic shapes enabled, only one graph is created. See the
corresponding <a href="_static/img/dynamic_shapes/tlparse2_dynamic_shapes_true.png" target="_blank">tlparse output</a>.
Naively implemented, this is too restrictive: most PyTorch programs will immediately fail if you try to do anything with unbacked symbolic integers. Here are the most important enhancements to make this actually work:
While compilation time differences
are minimal for this small example, more complex use cases would show significant
performance improvements.
- On tensor creation, PyTorch precomputes a lot of data about a tensor; for example, if you use `empty_strided` to create a tensor, we will eagerly sort the strides and determine if the tensor is non-overlapping and dense. Sorts produce a lot of guards. However, it is more common to produce a tensor directly with a higher-level API like `empty`, which is guaranteed to produce a non-overlapping and dense tensor. We modified PyTorch to avoid needlessly recomputing these properties.
- Even if nontrivial compute is needed, sometimes a property is never actually queried at all. Making these precomputed properties lazy allows us to avoid guarding on an unbacked symbolic integer unless it is actually needed.
- The data in an integer tensor is generally not known to be non-negative. However, we provide an API `constrain_range` whereby a user can specify that a size is bounded above and below by known limits.
(what_is_a_specialization)=
## What is a specialization?
Similar to the dynamic APIs, there are corresponding unbacked APIs: namely you can use mark_unbacked instead of `mark_dynamic` and `TORCH_COMPILE_UNBACKED_SOURCES` instead of `TORCH_COMPILE_DYNAMIC_SOURCES` to tell the compiler to mark an input as unbacked.
**Specialization** refers to optimizing a computational graph for specific input shapes
by examining shape conditions during control flow. If a branch is taken based on a
shape condition, the graph is tailored for that condition. If a new input doesn't meet
this condition, the system will recompile the graph.
In future versions of PT2 (beyond PT2.1), we will extend our reasoning system
to infer that an unbacked symbolic integer is size-like based on usage. For
example, if you pass the result of an `.item()` call to a factory function
like `torch.empty`, we will automatically infer that the result is a size
(because if it was not, it would fail.) This assumption would get validated
at runtime, raising an error if it was not fulfilled.
Specialization allows you to create optimized computational graphs for specific input
shapes, which can significantly improve execution speed.
```{code-cell}
import torch
@torch.compile(dynamic=True)
def f(x):
if x.size()[0] == 10:
return x * 10
if x.size()[0] <= 30:
return x*200
return x*x.size()[0]
f(torch.rand(10))
f(torch.rand(20))
f(torch.rand(30))
f(torch.rand(40))
f(torch.rand(50))
```
In the code above, we specialize that the graph requires an input size of 10, in which
case it will return `x * 10`. If the input size is less than 30, it will return `x * 200`.
In the output, you can see that this creates three graphs.
See the corresponding <a href="_static/img/dynamic_shapes/tlparse3_specialization.png" target="_blank">tlparse output</a>
This is how graphs created for the above function:
```{image} _static/img/dynamic_shapes/dynamic_shapes_example_specialization.png
```
(enable-dynamic-behavior)=
## Enabling Dynamic Behavior
There are the following ways to make things dynamic:
* {ref}`automatic_dynamic`
* {ref}`user_annotations` (preferred)
* {ref}`torch_compile_dynamic_true` (for testing only)
* {ref}`dynamic_shapes_advanced_control_options` (for advanced use cases)
Read below about each of this options.
(automatic_dynamic)=
### Automatic dynamic
**Automatic dynamic** is the default behavior where {func}`torch.compile` performs
the initial compilation assuming static shapes are used, while tracking the
input sizes from that first compilation. When a recompile is triggered, it
uses this information to identify which dimensions have changed and marks
those as dynamic for the second compilation.
(user_annotations)=
### User Annotations
Several APIs allow users to explicitly mark specific inputs
by name or code as dynamic. This is useful for avoiding initial compilations that
would eventually become dynamic with the previous tools. It is also used to mark
elements that do not automatically get marked as dynamic, such as neural network
module parameters, and so on. User annotations are the preferred way to enable
dynamic shapes.
#### `mark_dynamic(tensor, dim, min=min, max=max)`
The {func}`torch._dynamo.mark_dynamic` function marks a tensor dimension as dynamic and will fail if it
gets specialized. It does not work for integers. Use this function only if you know
all graphs in the frame using this input converge to a single dynamic graph.
Otherwise, you may encounter a misleading constraint violation error.
In such cases, consider using {func}`torch._dynamo.maybe_mark_dynamic`. Currently,
{func}`torch._dynamo.mark_dynamic`
does not have precedence over `force_parameter_static_shapes = True` or `force_nn_module_property_static_shapes = True`.
If you know in advance that a particular dimension will be dynamic, you
can avoid the initial recompilation by using {func}`torch._dynamo.mark_dynamic(tensor, dim)`.
Additionally, if you already know the minimum and maximum possible
values for this dimension, you can specify them with
{func}`torch._dynamo.mark_dynamic(tensor, dim, min=min, max=max)`.
Here is a quick example:
```{code-cell}
import torch
@torch.compile(dynamic=True)
def f(x):
return x * x.size()[0]
x = torch.randn(10)
torch._dynamo.mark_dynamic(x, 0)
# first invocation we give it is a tensor marked as dynamic
f(x)
# rest of these invocations will use dynamically compiled code
f(torch.randn(20))
f(torch.randn(30))
f(torch.randn(40))
```
#### `maybe_mark_dynamic(tensor, dim)`
The {func}`torch._dynamo.maybe_mark_dynamic` function shares all properties
with {func}`torch._dynamo.mark_dynamic`
but does not fail if the size gets specialized. Use it for inputs shared by
multiple graphs or if the number of graphs does not converge to one for a specific
frame. For instance, in the example above, use {func}`torch._dynamo.maybe_mark_dynamic()` because graphs
with sizes 0 and 1 will specialize. However, you can use {func}`torch._dynamo.mark_dynamic` to ensure
you never specialize.
#### `mark_unbacked(tensor, dim)`
The {func}`torch._dynamo.mark_unbacked` function marks a tensor dimension as unbacked. It is unlikely
to be the tool you need, but it could be useful if the specialization occurs inside
a condition `guard_size_oblivious(x)`, and if using it removes the specialization.
Ensure it fixes the specialization and does not introduce a data-dependent error
that converts to a graph break at or before the specialization location
you are trying to avoid. It might be better to use the next option.
(dynamic_sources_allow_list)=
#### Dynamic Allow List (`DYNAMIC_SOURCES`)
Use the evnironmental variable `TORCH_COMPILE_DYNAMIC_SOURCES` to pass a configuration
list of source names to be marked as dynamic. For example:
`TORCH_COMPILE_DYNAMIC_SOURCES=L[x],L[y]`
It's easiest to find these dynamic source names using the PGO artifact in `tlparse`.
You can copy and paste the dynamic source names from the PGO artifact. This method works
for integers and tensor sizes and has the highest precedence over all other flags
that force static shapes. It will not throw an error if what is marked dynamic
gets specialized or if the provided input does not exist.
Here is an example:
```{code-cell}
import torch
@torch.compile()
def f(x):
return x * x.size()[0]
with torch.compiler.config.patch(dynamic_sources="L['x']"):
f(torch.rand(10))
f(torch.rand(20))
f(torch.rand(30))
f(torch.rand(40))
```
(torch.compiler.set_stance_eager_then_compile)=
#### `torch.compiler.set_stance ("eager_then_compile")`
At times, identifying the appropriate inputs to mark as dynamic can
be challenging. If you are willing to accept a performance cost for
the first batch, another convenient option is to use the
`eager_then_compile` stances, which automatically determine dynamic
inputs for you. For more information, see {func}`torch.compiler.set_stance` and [Dynamic Compilation Control with torch.compiler.set_stance](https://docs.pytorch.org/tutorials/recipes/torch_compiler_set_stance_tutorial.html).
(torch_compile_dynamic_true)=
### `torch.compile (dynamic=true)` (Not recommended)
This setting forces all sizes and integers to be dynamic, increasing the
chance of encountering dynamic shape bugs. Setting this option is not
recommended due to it being error prone.
It would make every input size dynamic which may result it performance
regressions and ultimately increase compilation time.
PyTorch also provides advanced control options for dynamic shapes, see:
{ref}`dynamic_shapes_advanced_control_options`.
## Where Do I Go From Here?
If you encounter a framework code bug or an issue with specialization,
file an issue so it can be reviewed and potentially improved. If the issue
is within your user code, consider whether you are willing to rewrite your
code to avoid it. Determine if it affects correctness or if it's a redundant
check. If the issue involves a Triton custom kernel with a `constexpr`
argument, evaluate whether you can rewrite it to address the problem.
```{toctree}
:maxdepth: 1
compile/dynamic_shapes_core_concepts
compile/dynamic_shapes_troubleshooting
compile/dynamic_shapes_advanced_control_options
compile/dynamic_shapes_beyond_the_basics
```
```{seealso}
* [tlparse documentation](https://github.com/pytorch/tlparse)
* [The dynamic shapes manual](https://docs.google.com/document/d/1GgvOe7C8_NVOMLOCwDaYV1mXXyHMXY7ExoewHqooxrs/edit?tab=t.0#heading=h.fh8zzonyw8ng)
```

View File

@ -78,7 +78,6 @@ for tracking purposes -->
.. py:module:: torch.utils.data.graph
.. py:module:: torch.utils.data.graph_settings
.. py:module:: torch.utils.data.sampler
.. py:module:: torch.utils.debug_mode
.. py:module:: torch.utils.dlpack
.. py:module:: torch.utils.file_baton
.. py:module:: torch.utils.flop_counter

Some files were not shown because too many files have changed in this diff Show More