Compare commits

..

162 Commits

Author SHA1 Message Date
3411990fa0 Fix compile 2025-08-19 12:16:29 +00:00
fc32f3d5eb Update aten/src/ATen/cpu/vec/vec128/vec128_float_neon.h 2025-08-18 14:43:19 +01:00
ef8f493676 Update aten/src/ATen/cpu/vec/sve/vec_float.h 2025-08-18 13:14:40 +01:00
92eaa3d3b8 Update aten/src/ATen/cpu/vec/vec128/vec128_float_neon.h 2025-08-18 13:11:16 +01:00
e0340e599e [feat]: Add optimization for SVE exp function
Signed-off-by: Analle Abuammar <analle.abuammar@arm.com>
Co-authored-by: Fadi Arafeh <Fadi.Arafeh@arm.com>
2025-08-15 15:47:00 +00:00
c3e4e4079e Fix tests 2025-08-15 15:15:48 +00:00
62f61292e3 add SVE dispatch 2025-08-15 15:15:36 +00:00
41cbceee59 Make size non-constexpr 2025-08-15 15:15:24 +00:00
46706e7c34 Vec length agnostic SVE Vectorized class POC 2025-08-15 15:15:05 +00:00
6662a76f59 [cutlass backend] Fix EVT tests post buf name change (#159541)
Differential Revision: [D79317791](https://our.internmc.facebook.com/intern/diff/D79317791/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159541
Approved by: https://github.com/mlazos
2025-07-31 16:39:49 +00:00
eqy
05aade1b6d [CUDA] Add serialTest decorator to largeTensorTest in test_cuda.py (#159271)
Hopefully helps with disabled tests due to OOM such as https://github.com/pytorch/pytorch/issues/159069

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159271
Approved by: https://github.com/Skylion007, https://github.com/ngimel
2025-07-31 16:27:16 +00:00
f946b25865 [MPS] Speedup argmax/argmin (#159524)
By using efficient `threadgroup_arg[max|min]` primitives.
- Fixed bug in `simd_argmax` when result of the `simd_ballot` were prematurely cast to `ushort` and adjusted unit test
- Fixed nan handling in compiled argmax, but can't reliably test it as MPS(eager) implementaiton of argmax is buggy

Now according to `bench_mps_ops.py` `max(x, dim=0)` is reliably faster than eager implementaiton:
```
[---------------------------------------------------------------------------------------------  --------------------------------------------------------------------------------------------]
                           |  eager-512x512  |  compile-512x512  |  eager-1024x1024  |  compile-1024x1024  |  eager-2048x2048  |  compile-2048x2048  |  eager-4096x4096  |  compile-4096x4096
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      max (torch.float16)  |      285.8      |       272.2       |       422.3       |        354.5        |       721.6       |        683.5        |       2224.0      |        1979.1
      max (torch.float32)  |      300.2      |       267.0       |       389.6       |        342.5        |       769.4       |        682.6        |       2995.7      |        2609.8
      max (torch.int32)    |      299.6      |       275.4       |       390.0       |        361.7        |       758.7       |        686.1        |       3103.4      |        2646.5
      max (torch.int64)    |      297.5      |       275.5       |       417.0       |        382.1        |       856.1       |        722.6        |       5467.7      |        3156.8

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159524
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #158990
2025-07-31 16:18:32 +00:00
d2e02585b8 [AOTI] Explicitly delete wait_tensor returned tensor (#159502)
Summary: In the Python wrapper codegen, the returned tensor from wait_tensor is not assigned or used anywhere, because wait_tensor always returns its input, see more discussion in https://github.com/pytorch/pytorch/issues/126773. Similarly, we should just immediately delete the returned tensor handle from aoti_torch_cpu__c10d_functional_wait_tensor in the cpp wrapper codegen, otherwise it may cause tensor's lifetime expansion and even cause OOM in some cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159502
Approved by: https://github.com/yushangdi, https://github.com/jingsh
ghstack dependencies: #159476, #159487
2025-07-31 15:33:36 +00:00
3dd7ebf418 [BE] Fix buf name mismatch in test_c10d_functional_native.py (#159487)
Summary: test_c10d_functional_native.py uses hard-coded buf names to check the generated code string. This is fragile given that Inductor can update its buffer naming implementation freely. Thus this PR uses name regex matching to find buffer names at the run time. This will solve issues like https://github.com/pytorch/pytorch/issues/147754. Currently we do name matching based on empty_strided_ calls. We can expand it later if needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159487
Approved by: https://github.com/yushangdi
ghstack dependencies: #159476
2025-07-31 15:33:36 +00:00
8273ee0646 [BE] Fix global config leak in test_c10d_functional_native.py (#159476)
Summary: test_c10d_functional_native.py tests torch._inductor.config.cpp_wrapper as True and False. Currently torch._inductor.config.cpp_wrapper is set globally which can cause a problem when running the whole test file. This PR changes it to use patch context.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159476
Approved by: https://github.com/yushangdi
2025-07-31 15:33:36 +00:00
c57382a493 Move BFloat16.h to headeronly (#159412)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159412
Approved by: https://github.com/desertfire
2025-07-31 15:29:17 +00:00
e7cc42df58 [inductor] consolidate common GEMM triton param retrieval (#159383)
\# Why

- Make loop iteration simpler
- Have a common spot where to make modifications that affect
  all the GEMM Triton templates, avoiding missed spots

\# What

- pull out commong logic of taking the BaseConfig objects
  and turning them into kwargs to feed into maybe_append_choice
  for Triton GEMM templates

Differential Revision: [D79186962](https://our.internmc.facebook.com/intern/diff/D79186962)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159383
Approved by: https://github.com/jansel
2025-07-31 13:05:04 +00:00
cyy
72c69e731f set MSVC debug information only on debug builds (#159533)
Fixes: https://github.com/pytorch/pytorch/issues/159515
To reduce the binary size increment in release builds by removing debug information.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159533
Approved by: https://github.com/atalman
2025-07-31 12:57:33 +00:00
78b9dea754 [inductor] Fix set_linter's handling of f-strings for Python 3.12 and up (fix #159056) (#159252)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159252
Approved by: https://github.com/Skylion007
2025-07-31 12:56:09 +00:00
838924436e update the baseline for nightly max_autotune tests (#154973)
Hi @desertfire, according to the latest test [results](https://github.com/pytorch/pytorch/actions/runs/15385952839) from the inductor nightly for max_autotune tests, we plan to update the baseline data:

In the latest nightly test, two models require baseline updates:

- vision_maskrcnn: This model shows improved graph breaks, so I’ve updated the baseline accordingly.
- detectron2_fcos_r_50_fpn: This model has a different number of graph breaks. However, since its accuracy result still shows fail_accuracy, so I skipped the graph break check for this model.

```
vision_maskrcnn                     IMPROVED:           graph_breaks=29, expected=30
Improvement: 1 models have fixed dynamo graph breaks:
    vision_maskrcnn
```

```
detectron2_fcos_r_50_fpn            XFAIL
detectron2_fcos_r_50_fpn            FAIL:               graph_breaks=24, expected=22
Error: 1 models have new dynamo graph breaks:
    detectron2_fcos_r_50_fpn
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154973
Approved by: https://github.com/desertfire
2025-07-31 11:38:55 +00:00
2ffb510942 [Break XPU][Indutor UT] Fix failures introduced by community. (#159463)
Fixes #159000, Fixes #159335, Fixes #159334, Fixes #159332, Fixes #159331, Fixes #159330

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159463
Approved by: https://github.com/jansel
2025-07-31 08:37:41 +00:00
20b5f694f8 [Dynamo] Make frozen dataclasses hashable (#159529)
Fixes https://github.com/pytorch/pytorch/issues/159424

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159529
Approved by: https://github.com/oulgen
ghstack dependencies: #159513
2025-07-31 07:03:01 +00:00
447e300d55 [Dynamo] Frozen dataclass attr access test (#159513)
Verifies https://github.com/pytorch/pytorch/issues/159424, but perhaps the issue is not fixed yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159513
Approved by: https://github.com/oulgen
2025-07-31 07:03:01 +00:00
5b2ad9279c [draft export] logging (#159004)
Summary: adds logging for draft export

Test Plan:
loggercli stage actualize-stage TorchDraftExportUsageLoggerConfig

Rollback Plan:

Differential Revision: D78308105

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159004
Approved by: https://github.com/angelayi
2025-07-31 05:52:13 +00:00
78d7f0cdec disable execution frame cleanup (#159531)
Summary: Want to disable execution frame cleanup until fix in D78621408 is merged

Test Plan:
CI

Rollback Plan:

Differential Revision: D79306602

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159531
Approved by: https://github.com/SherlockNoMad
2025-07-31 05:02:36 +00:00
d5c719ec3c [inductor] fix open temp file failed on Windows. (#159342)
Fix open temp file failed on Windows. Error message:
<img width="1181" height="239" alt="image" src="https://github.com/user-attachments/assets/e4a6f438-cb06-44c6-959b-0a6a49d2f44f" />

Here two option to fix this issue: https://stackoverflow.com/questions/66744497/python-tempfile-namedtemporaryfile-cant-use-generated-tempfile
1. `tempfile.NamedTemporaryFile` must setup `delete=False` on Windows
2. Use `WritableTempFile` to handle this case on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159342
Approved by: https://github.com/jansel
2025-07-31 04:58:02 +00:00
c44efc3755 [Refactor] Fix Compile Warning: possibly dangling reference to a temporary (#159517)
```bash
DEBUG pytorch/torch/csrc/dynamo/compiled_autograd.h:1388:25: warning: possibly dangling reference to a temporary [-Wdangling-reference]
DEBUG  1388 |     for (const at::IValue& elt : lst) {
DEBUG       |                         ^~~
DEBUG pytorch/torch/csrc/dynamo/compiled_autograd.h:1388:1: note: the temporary was destroyed at the end of the full expression ‘__for_begin .c10::impl::ListIterator<c10::IValue, __gnu_cxx::__normal_iterator<c10::IValue*, std::vector<c10::IValue> > >::operator*().c10::impl::ListElementReference<c10::IValue, __gnu_cxx::__normal_iterator<c10::IValue*, std::vector<c10::IValue> > >::operator std::conditional_t<true, const c10::IValue&, c10::IValue>()’
DEBUG  1388 |     for (const at::IValue& elt : lst) {
DEBUG       | ^
```

This PR fixes this warning

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159517
Approved by: https://github.com/xmfan
2025-07-31 04:49:43 +00:00
6b9473469f [Graph Partition] add log for graph partition reasons and #partitions (#159425)
Previously, we log `skipping cudagraphs due to [xxx reasons]` when there are cudagraph-unsafe ops. With graph partition, we will split off these ops and cudagraph remaining parts. But the log message is also skipped.

In this PR, we add logs for graph partition reasons and the number of partitions to better understand the workload.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159425
Approved by: https://github.com/eellison
2025-07-31 04:21:06 +00:00
7a4167a164 support fabric handles with symmetric memory (#159319)
enable fabric handles for symmetric memory

Enables handle exchange via CU_MEM_HANDLE_TYPE_FABRIC on the systems that support it. This is needed to enable symmetric memory on NVLS72 systems.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159319
Approved by: https://github.com/malfet, https://github.com/kwen2501
2025-07-31 04:16:20 +00:00
8e67a6ae89 [vllm hash update] update the pinned vllm hash (#159320)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159320
Approved by: https://github.com/pytorchbot
2025-07-31 04:08:14 +00:00
c68ad1bd6a [dynamo][guards] Always record user.stack for informative tlparse guards (#159526)
Before
<img width="1146" height="280" alt="image" src="https://github.com/user-attachments/assets/4ddb11b2-dec8-4010-a28d-63b3cd4a7929" />

After
<img width="1248" height="248" alt="image" src="https://github.com/user-attachments/assets/8aafc5be-92cd-4468-bb8f-ad966de8c717" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159526
Approved by: https://github.com/Lucaskabela
2025-07-31 03:18:33 +00:00
3e5e094615 Revert "Fix large_tensor_test skipping cpu (#158617)"
This reverts commit debc0591b888f211bfe846bdc7cfa0626a5f6f6a.

Reverted https://github.com/pytorch/pytorch/pull/158617 on behalf of https://github.com/ZainRizvi due to Sorry but this seems to be breaking trunk. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/16631113381/job/47062415099) [HUD commit link](debc0591b8) ([comment](https://github.com/pytorch/pytorch/pull/158617#issuecomment-3138387762))
2025-07-31 02:57:22 +00:00
clr
c65efc8ea1 torch.compile: Record a pt2_compile_event for combo kernels (#159306)
This is off by default, but some jobs have it on. Having this show up in
perfetto and be globally queryable would be useful to see how expensive this
is.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159306
Approved by: https://github.com/masnesral
2025-07-31 02:51:38 +00:00
a9049413e2 [dynamo] Turn on recursive dict tag optimization (#159186)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159186
Approved by: https://github.com/jansel
2025-07-31 02:36:37 +00:00
d7a5ec9355 Fix the Doc of padding in avg_poolnd (#159142)
Fixes #159141

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159142
Approved by: https://github.com/mikaylagawarecki
2025-07-31 02:02:48 +00:00
2c46922ce4 Fix rand_like decomposition to preserve strides (#159294)
Summary: Like https://github.com/pytorch/pytorch/pull/158898, the rand_like variants are not preserving strides. Followed the pattern established in https://github.com/pytorch/pytorch/pull/158898.

Test Plan: New unit test (fails before this PR; but fixed after)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159294
Approved by: https://github.com/eellison
2025-07-31 01:36:50 +00:00
668d414ae7 [CPU] Fix bias dtype issue for FP8 qlinear (#159125)
Fixes
`RuntimeError: self and mat2 must have the same dtype, but got BFloat16 and Float`

With bf16 autocast, bias converted into BFloat16, but fp8_qlinear_onednn_ref not support bf16 bias.
In this pr, convert bias into bf16 on fp8_qlinear_onednn_ref.

Add this case into ut and reproduce:
`python test/test_quantization.py -k test_qlinear_fp8`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159125
Approved by: https://github.com/Xia-Weiwen, https://github.com/cyyever, https://github.com/CaoE
2025-07-31 01:26:45 +00:00
4541509237 [Triton] [Inductor] Fix an incorrect descriptor (#159407)
Summary: Fixes a clear template typo where `a_desc_ptr` was passed instead of `b_desc_ptr` to define `b_desc`.

Test Plan:
Found by inspection.

Rollback Plan:

Reviewed By: NoamPaz

Differential Revision: D79178538

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159407
Approved by: https://github.com/NikhilAPatel
2025-07-31 00:34:19 +00:00
6c7f88c2c9 Check addmm dtypes (#159509)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159509
Approved by: https://github.com/eqy
2025-07-31 00:15:46 +00:00
c400c8e2e0 [ROCm] Add FP8 rowwise support to _scaled_grouped_mm + Submodule update (#159075)
Summary:

In this PR we integrate the [FBGEMM AMD FP8 rowwise scaling grouped GEMM kernel](https://github.com/pytorch/FBGEMM/tree/main/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped) to add support for the `_scaled_grouped_mm` API on AMD. `_scaled_grouped_mm` is [currently supported on Nvidia](9faef3d17c/aten/src/ATen/native/cuda/Blas.cpp (L1614)), this PR aims to bring parity to AMD. Related: [[RFC]: PyTorch Low-Precision GEMMs Public API](https://github.com/pytorch/pytorch/issues/157950#top) #157950.

The kernel is developed using the Composable Kernel framework. Only MI300X is currently supported. In the near future we plan to add support for MI350X as well. For data types we support FP8 e3m4.

The kernel support will be gated with the `USE_FBGEMM_GENAI` flag. We hope to enable this by default for relevant AMD builds.

Note we also update submodule `third_party/fbgemm` to 0adf62831 for the required updates from fbgemm.

Test Plan:

**Hipify & build**
```
python tools/amd_build/build_amd.py
USE_FBGEMM_GENAI=1 python setup.py develop
```

**Unit tests**
```
python test/test_matmul_cuda.py -- TestFP8MatmulCUDA
Ran 488 tests in 32.969s
OK (skipped=454)
```

**Performance Sample**
| G  | M | N | K | Runtime Ms | GB/S | TFLOPS |
| --  | -- | -- | -- | -- | -- | -- |
| 128 | 1 | 2048 | 5120 | 0.37| 3590 | 7.17 |
| 128 | 64 | 2048 | 5120 | 0.51| 2792 | 338.34 |
| 128 | 128 | 2048 | 5120 | 0.66| 2272 | 522.72 |
| 128 | 1 | 5120 | 1024 | 0.21| 3224 | 6.43 |
| 128 | 64 | 5120 | 1024 | 0.29| 2590 | 291.40 |
| 128 | 128 | 5120 | 1024 | 0.40| 2165 | 434.76 |
| 128 | 1 | 4096 | 4096 | 0.69| 3126 | 6.25 |
| 128 | 64 | 4096 | 4096 | 0.85| 2655 | 324.66 |
| 128 | 128 | 4096 | 4096 | 1.10| 2142 | 501.40 |
| 128 | 1 | 8192 | 8192 | 2.45| 3508 | 7.01 |
| 128 | 64 | 8192 | 8192 | 3.27| 2692 | 336.74 |
| 128 | 128 | 8192 | 8192 | 4.04| 2224 | 543.76 |
| 16 | 1 | 2048 | 5120 | 0.04| 3928 | 7.85 |
| 16 | 64 | 2048 | 5120 | 0.05| 3295 | 399.29 |
| 16 | 128 | 2048 | 5120 | 0.07| 2558 | 588.69 |
| 16 | 1 | 5120 | 1024 | 0.03| 3119 | 6.23 |
| 16 | 64 | 5120 | 1024 | 0.03| 2849 | 320.62 |
| 16 | 128 | 5120 | 1024 | 0.05| 2013 | 404.11 |
| 16 | 1 | 4096 | 4096 | 0.06| 4512 | 9.02 |
| 16 | 64 | 4096 | 4096 | 0.09| 3124 | 381.95 |
| 16 | 128 | 4096 | 4096 | 0.13| 2340 | 547.67 |
| 16 | 1 | 8192 | 8192 | 0.32| 3374 | 6.75 |
| 16 | 64 | 8192 | 8192 | 0.42| 2593 | 324.28 |
| 16 | 128 | 8192 | 8192 | 0.53| 2120 | 518.36 |

- Using ROCm 6.4.1
- Collected through `triton.testing.do_bench_cudagraph`

**Binary size with gfx942 arch**
Before: 116103856 Jul 23 14:12 build/lib/libtorch_hip.so
After:  118860960 Jul 23 14:29 build/lib/libtorch_hip.so
The difference is 2757104 bytes (~2.6 MiB).

Reviewers: @drisspg @ngimel @jwfromm @jeffdaily

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159075
Approved by: https://github.com/drisspg
2025-07-30 23:53:58 +00:00
25c3a7e317 [CUDA][CUDA Graphs] Move cuda graphs test to subprocess to avoid polluting mempool tests (#159305)
Otherwise mempool test will fail as the previous graph capture failed but doesn't have its state in the caching allocator fully cleaned up. See also #159301

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159305
Approved by: https://github.com/eellison, https://github.com/BoyuanFeng, https://github.com/naromero77amd
2025-07-30 23:31:38 +00:00
de7376537f Fix ep deepcopy when there is python builitin name (#159478)
Summary: title

Test Plan:
CI

Rollback Plan:

Differential Revision: D79261007

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159478
Approved by: https://github.com/pianpwk
2025-07-30 23:14:31 +00:00
fd2c64e286 Fix duplicated sources in inductor provenance tracking (#159484)
Summary:

The `replace_hook` is called once for each user of the replaced node. This fix avoids adding duplicated node sources.

This also means that if there are two nested pass like:

```
with GraphTransformObserver(gm, "outer"):
      with GraphTransformObserver(gm, "inner"):
              .....
```

We'll only see the outer pass's pass name recorded for the replaced node in the "from_node" node meta. I think this is fine. In practice, the outer pass usually contains a more meaningful name, e.g. `decompose_auto_functionalized`, and the inner pass name is just a default pass name like `pattern_matcher`.

Test Plan:
```
buck2 run @mode/dev-nosan fbcode//caffe2/test:fx -- -r test_graph_transform_observer_replace
```

Rollback Plan:

Differential Revision: D79203058

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159484
Approved by: https://github.com/angelayi
2025-07-30 23:03:11 +00:00
2b1ae29960 [Dynamo][Better Engineering] Add typing annotations to guard and source (#158397) (#159491)
Summary:
X-link: https://github.com/pytorch/executorch/pull/12986

As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to a critical set of files for dynamo, `source.py` and the base `_guards.py`

Running
```
mypy torch/_dynamo/source.py torch/_guards.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  1227 | 2208 | 55.57% | 207 | 362 | 57.18% |
| This PR | 2217 | 2217 | 100.00% | 362 | 362 | 100.00% |
| Delta    | +990 | +9 | +44.43% | +155 | 0 | +42.82% |

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 jerryzh168 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben

Test Plan:
Imported from GitHub, without a `Test Plan:` line.

Rollback Plan:

Reviewed By: JacobSzwejbka, yangw-dev

Differential Revision: D79199389

Pulled By: Lucaskabela

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159491
Approved by: https://github.com/anijain2305, https://github.com/yangw-dev
2025-07-30 22:57:50 +00:00
1293405c8d [MPS] Add simd_[arg][max|min] (#158990)
And add eager tests for those.
Re-implement `threadgroup_[max|min]` using those function as they are significantly faster (though much slower than eager, due to the arg part) than before, which could be verified by running the following script
```python
import itertools
import timeit
import torch
from torch.utils.benchmark import Compare, Measurement, Timer

def bench_unary_op(func, x, label) -> Measurement:
    sync_cmd = "torch.mps.synchronize()" if "mps" in str(x.device) else ""
    t = Timer(
        stmt=f"f(x);{sync_cmd}",
        globals={"f": func, "x": x},
        language="python",
        timer=timeit.default_timer,
        sub_label=f"{func.__name__} ({str(x.dtype)})",
        description=label,
        env=torch.__version__,
    )
    return t.blocked_autorange()

def bench_reduction(
    reduction_func, device: str = "mps", dtype: torch.dtype = torch.float32
) -> list[Measurement]:
    rc = []

    # Bench 2D with reduction over dim=0
    def f(t):
        return reduction_func(t, dim=0)[0]

    f.__name__ = reduction_func.__name__
    f_c = torch.compile(f, dynamic=False, fullgraph=True)

    for size in (512, 1024, 2048, 4096):
        x = torch.testing.make_tensor(size, size, device=device, dtype=dtype)
        rc_c, rc_e = f(x), f_c(x)
        rc_c, rc_e = (rc_c[0], rc_e[0]) if isinstance(rc_c, tuple) else (rc_c, rc_e)
        rc.append(bench_unary_op(f, x, f"eager-{size}x{size}"))
        rc.append(bench_unary_op(f_c, x, f"compile-{size}x{size}"))
    return rc

def main() -> None:
    #dtypes = [torch.float16, torch.float32, torch.bfloat16, torch.int32, torch.int64]
    dtypes = [torch.float32, torch.int32, torch.int64]

    # Profile reduction ops
    rc = []
    for op, dtype in itertools.product([torch.max], dtypes):
        rc.extend(bench_reduction(op, dtype=dtype))
    Compare(rc).print()

if __name__ == "__main__":
    torch._dynamo.config.cache_size_limit = 2**16
    main()
```

Produces the following table before
```
[---------------------------------------------------------------------------------------------  --------------------------------------------------------------------------------------------]
                           |  eager-512x512  |  compile-512x512  |  eager-1024x1024  |  compile-1024x1024  |  eager-2048x2048  |  compile-2048x2048  |  eager-4096x4096  |  compile-4096x4096
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      max (torch.float32)  |      297.3      |       531.6       |       394.1       |        2550.5       |       773.0       |        4904.7       |       3647.2      |        9682.0
      max (torch.int32)    |      297.8      |       359.2       |       387.7       |        1179.4       |       768.2       |        2175.0       |       3677.1      |        4495.9
      max (torch.int64)    |      278.7      |       541.4       |       410.2       |        2873.3       |       858.9       |        5620.4       |       6107.2      |       11176.1

Times are in microseconds (us).
```
And after
```
[---------------------------------------------------------------------------------------------  --------------------------------------------------------------------------------------------]
                           |  eager-512x512  |  compile-512x512  |  eager-1024x1024  |  compile-1024x1024  |  eager-2048x2048  |  compile-2048x2048  |  eager-4096x4096  |  compile-4096x4096
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      max (torch.float32)  |      307.9      |       265.3       |       401.0       |        340.8        |       766.5       |        661.9        |       3463.5      |        2829.5
      max (torch.int32)    |      293.5      |       263.1       |       405.0       |        338.8        |       761.4       |        672.5        |       3050.0      |        2688.6
      max (torch.int64)    |      308.2      |       255.7       |       417.4       |        341.4        |       877.0       |        695.0        |       5812.2      |        5762.2

```

`argmax`/`argmin` are much tricker due to the nan-handling logic that need to be added there.

Also fixes `torch.max/min` compilation for half-precision types, added regression types for it.

This PR also introduces a bunch of helper functions, such as `simd_broadcast` that works for int64 and `c10:🤘:pair` template, which are used by `simd_argmax` to return both value and index

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158990
Approved by: https://github.com/dcci, https://github.com/Skylion007
2025-07-30 21:57:25 +00:00
3a65ff84b6 [dynamo, easy] add comment on skipping sys.monitoring frames (#159493)
Add a comment so we know why we're doing this code (followup to https://github.com/pytorch/pytorch/pull/159369)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159493
Approved by: https://github.com/azahed98, https://github.com/Lucaskabela, https://github.com/zou3519, https://github.com/jingsh
ghstack dependencies: #159369
2025-07-30 21:54:38 +00:00
acf13a9b75 Fix a bug of distributed 'gather' with uncontiguous tensors on the Gloo backend (#158903)
Fixes #158902

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158903
Approved by: https://github.com/H-Huang
2025-07-30 21:44:29 +00:00
3a55676200 fix strategy hashing arg mismatch (#159506)
Reland https://github.com/pytorch/pytorch/pull/159289.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159506
Approved by: https://github.com/XilunWu
2025-07-30 21:37:13 +00:00
af39144a93 Don't use torch.backends.cuda.matmul.allow_tf32 in inductor cache key (#159480)
Summary: According to https://github.com/pytorch/pytorch/pull/158209, the API is deprecated and we should be using torch.backends.cuda.matmul.fp32_precision instead.

Fixes https://github.com/pytorch/pytorch/issues/159440

Test Plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159480
Approved by: https://github.com/xmfan, https://github.com/oulgen
2025-07-30 21:29:38 +00:00
25343b343e [ATen][CUDA][cuFFT] Guard against deprecated error codes (#159466)
This PR adds a guard based on CUDA version, per latest cuFFT [documentation](https://docs.nvidia.com/cuda/cufft/index.html#return-value-cufftresult):
>The following error codes are deprecated and will be removed in a future release: `CUFFT_INCOMPLETE_PARAMETER_LIST`, `CUFFT_PARSE_ERROR`, `CUFFT_LICENSE_ERROR`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159466
Approved by: https://github.com/albanD, https://github.com/eqy, https://github.com/Skylion007
2025-07-30 21:10:32 +00:00
07fad04181 [ContextParallel][FlexAttention] Prototype of supporting FlexAttention in Context Parallel (#158692)
**Summary**
This PR adds an all-gather based FlexAttention and uses TorchFunctionMode to dispatch
`FlexAttentionHOP.__call__` to it.

This PR makes the following changes:

- add a user-facing API `create_cp_block_mask` for creating CP-specific `BlockMask`
which masks over the attention result of Q shard and KV global.
- add `_ContextParallelGlobalVars` to store all necessary global vars that CP FlexAttention
requires. `torch_function_mode` is critical to maintain singleton mode to avoid dynamo
recompilations.
- add a dispatch path for `FlexAttentionForwardHOP.__call__` (TorchFunctionMode dispatch
won't work correctly without this line)

What's not in this PR:
- QKV load balancing
- Test on other masking besides `causal_mask`.
- Support on small attention (i.e. qkv size is smaller than 128) because the block mask
rewrite function requires `Q_BLOCK_SIZE == KV_BLOCK_SIZE == 128`.

**Test**
`pytest test/distributed/tensor/test_attention.py -s -k test_ring_flex_attention`

**Followup**
1. create an issue to reproduce the error in `create_fw_bw_graph()` when trying to call `create_block_mask`
to re-write `block_mask` in `FlexAttentionHOP` dispatch in `TorchFunctionMode`.
2. Merge `_ContextParallelGlobalVars` and `_cp_options`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158692
Approved by: https://github.com/drisspg
2025-07-30 21:01:53 +00:00
7ac70ac4cd Revert "Fix rand_like decomposition to preserve strides (#159294)"
This reverts commit a3a51282dbabe0220c2c3947a89f7d2ecc514d33.

Reverted https://github.com/pytorch/pytorch/pull/159294 on behalf of https://github.com/yangw-dev due to failed internal build Failed to load config ([comment](https://github.com/pytorch/pytorch/pull/159294#issuecomment-3137796767))
2025-07-30 20:59:19 +00:00
e221a1c853 [Code Motion]Restructure flex attention kernel into flex subdirectory (#159437)
Mostly code motion, updating relative paths, moving some imports that had to be lazy before to top level scope now that we are free from the curse.

This will make it easier to add newer templates and provide some organization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159437
Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng, https://github.com/eellison, https://github.com/Skylion007
2025-07-30 20:12:35 +00:00
4defea1e2c [c10d] Fix setGroupName and setGroupDesc in group_split and merge_remote_group (#159429)
Summary:
We found that we don't really set group_name inside group_split correctly, because we are setting group_name to `deviceTypeToBackend_` which is set after `setBackend`. Same thing as group_desc. I added more unit tests for it.

We need to setGroupName correctly, otherwise, this will break DeviceMesh use case when split_group is used in DeviceMesh

Also ncclx needs to be aware of that its Option is a subclass of BackendOption

Test Plan:
CI

Rollback Plan:

Differential Revision: D79201132

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159429
Approved by: https://github.com/xunnanxu
2025-07-30 19:55:55 +00:00
53d68b95de [ROCm CI] Migrate to MI325 Capacity. (#159059)
This PR moves PyTorch CI capacity from mi300 to a new, larger mi325 cluster. Both of these GPUs are the same architecture gfx942 and our testing plans don't change within an architecture, so we pool them under the same label `linux.rocm.gpu.gfx942.<#gpus>` with this PR as well to reduce overhead and confusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159059
Approved by: https://github.com/jithunnair-amd, https://github.com/atalman

Co-authored-by: deedongala <deekshitha.dongala@amd.com>
2025-07-30 19:47:59 +00:00
f74842d57f [PP] Fix zero bubble schedules for eval() (#159475)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159475
Approved by: https://github.com/tianyu-l, https://github.com/Skylion007
2025-07-30 19:46:10 +00:00
644fee2610 Fix TestAutogradFallback flaky tests under Dynamo: migrate to lib._destroy() (#159443)
under dynamo, the libraries couldn't properly be cleared unless we manually did `gc.collect()`, but that's slow. it also worked if we just used the _destroy() method to tear down

FIXES
#159398
#159349
#159254
#159237
#159153
#159114
#159040
#158910
#158841
#158763
#158735

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159443
Approved by: https://github.com/zou3519, https://github.com/Skylion007
2025-07-30 19:30:55 +00:00
7821fbc560 [BE] Clarify comment to not revert when command has been edited (#159495)
This is mostly a nit. I was a bit confused when I saw
<img width="1032" height="183" alt="image" src="https://github.com/user-attachments/assets/7a18f167-78c1-4c33-ba6f-3588914c642e" />
in https://github.com/pytorch/pytorch/pull/159172

So I decided I should clean up this message a bit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159495
Approved by: https://github.com/yangw-dev, https://github.com/clee2000, https://github.com/ZainRizvi, https://github.com/malfet
2025-07-30 19:23:33 +00:00
73ee323380 [ONNX] RMS Norm (#159377)
- Implement rms norm using onnx RMSNormalization-23
- Use the correct eps for float32
  eaadd1282c/aten/src/ATen/native/cuda/layer_norm_kernel.cu (L1844-L1866)
  <img width="743" height="107" alt="image" src="https://github.com/user-attachments/assets/a6fd45aa-01d9-4667-924d-3012232cfcde" />

- Created facility to run tests with the reference runtime by extending ONNXProgram and assert_onnx_program.

Fix https://github.com/pytorch/pytorch/issues/159257
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159377
Approved by: https://github.com/titaiwangms
2025-07-30 18:55:47 +00:00
176c6446f8 Update CODEOWNERS for ONNX (#159390)
Update CODEOWNERS for ONNX to reflect current maintainers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159390
Approved by: https://github.com/titaiwangms, https://github.com/malfet
2025-07-30 18:54:25 +00:00
debc0591b8 Fix large_tensor_test skipping cpu (#158617)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158617
Approved by: https://github.com/BoyuanFeng
2025-07-30 18:48:07 +00:00
0df78f0c11 Remove /d2implyavx512upperregs- flag (#159431)
And reopen https://github.com/pytorch/pytorch/issues/145702

As this flag is not documented anywhere, slows down sccache accelerated build and  per https://developercommunity.visualstudio.com/t/Invalid-code-gen-when-using-AVX2-and-SSE/10527298#T-N10562579 it does not workaround a compiler bug, but rather disables some optimizations of AVX512 instructions which are being invoked in AVX2 codepath

Fixes https://github.com/pytorch/pytorch/issues/159082

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159431
Approved by: https://github.com/clee2000
2025-07-30 18:47:03 +00:00
d0e8a0ec4c Add CPython test for heapq (#159370)
Not used directly but used internally by `collections.Counter`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159370
Approved by: https://github.com/zou3519, https://github.com/Skylion007
2025-07-30 18:43:06 +00:00
22492848b6 [BE]: Update CUTLASS submodule to 4.1.0 (#158854)
Update the CUTLASS submodule to the latest version with new supported architectures and new features we can use.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158854
Approved by: https://github.com/henrylhtsang
2025-07-30 17:44:38 +00:00
5c14315b05 fixed typo error (#159451)
Fixes #159375

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159451
Approved by: https://github.com/albanD
2025-07-30 17:41:30 +00:00
1b99c1859c [BE] Make PyObjectSlot use a global PyInterpreter and remove (#158427)
This PR is a bit more involved but effectively works to drastically simplify PyObjectSlot and PyInterpreter.
1) For PyObjectSlot we now use a global pyinterpreter since there only is one. From here we change all of the call sites to rely on this assumption.
2) We also remove the "tags" of the PyInterpreter by deprecating `PyInterpreterStatus`.

For the reviewer, sadly it seems like `functorch/csrc/dim/dim.cpp` needed to get linted, so there is an unreadable amount of changes there. Fortunately, the only actual change in the file is as follows which just removes `getPyInterpreter()` from  the `check_pyobj` call.

```
 mpy::handle handle_from_tensor(Arena& A, TensorRef t) {
-    // fast case: tensor is live in python
-    std::optional<PyObject*> mb_obj =
-        t->unsafeGetTensorImpl()->pyobj_slot()->check_pyobj(getPyInterpreter(), /*ignore_hermetic_tls=*/false);
-    if (mb_obj.has_value() && !t->unsafeGetTensorImpl()->pyobj_slot()->owns_pyobj()) {
-        return *mb_obj;
-    }
-    return A.autorelease(mpy::object::checked_steal(THPVariable_Wrap(*t)));
-}
-}
+  // fast case: tensor is live in python
+  std::optional<PyObject*> mb_obj =
+      t->unsafeGetTensorImpl()->pyobj_slot()->check_pyobj(
+          /*ignore_hermetic_tls=*/false);
+  if (mb_obj.has_value() &&
+      !t->unsafeGetTensorImpl()->pyobj_slot()->owns_pyobj()) {
+    return *mb_obj;
+  }
+  return A.autorelease(mpy::object::checked_steal(THPVariable_Wrap(*t)));
+}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158427
Approved by: https://github.com/albanD
2025-07-30 17:29:43 +00:00
435edbcb5d [Graph Partition] add graph partition doc (#159450)
This pr adds doc for graph partition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159450
Approved by: https://github.com/eellison
2025-07-30 17:01:10 +00:00
6c6e11c206 Revert "Fix max_width computation in _tensor_str._Formatter (#126859)"
This reverts commit 1465757959dd7e63715b7621650896eca977aefa.

Reverted https://github.com/pytorch/pytorch/pull/126859 on behalf of https://github.com/yangw-dev due to broke trunk with test  distributed/test_c10d_functional_native.py::CompileTest::test_inductor_all_reduce_single - RuntimeError: Expected to find buf7 = empty but did not find it ([comment](https://github.com/pytorch/pytorch/pull/126859#issuecomment-3137137030))
2025-07-30 16:56:32 +00:00
a775c8e73e [Profiler] Fix lost C call events problem in Python 3.12.0-3.12.4 (#155446)
Hi team,

Please help review this patch.

This PR https://github.com/pytorch/pytorch/pull/150370 tried to fix the "Empty C Call Queue" problem on Python 3.12. It added C calls for each starting Python event with a callable.

I found the root cause is not that we cannot get C function frames by `PyFrame_GetBack` when PythonTracer is filling start frames, but the c call event loss problem bug on Python 3.12.0-3.12.4. And that problem was fixed by 257c413cd1 on 3.12.5.

So I think the https://github.com/pytorch/pytorch/pull/150370 cannot fix the problem, this patch reverts the change of it.

There are solutions to fix the problem correctly, such as we can add a new monitoring callback to compensate call events of methods with C function or we can override the callback registered by `PyEval_SetProfile`.  These solutions may make the code hard to maintain.

~~Since upgrading the micro version of Python is not difficult for users, we can just ignore C functions and suggest user upgrade.~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155446
Approved by: https://github.com/sraikund16
2025-07-30 16:35:51 +00:00
24d07b3a67 [inductor] Fix mm decomposition evaluating symints (#158998)
Fixes #154111

Resolves an issue during compilation with dynamic shapes where `torch._inductor.decomposition.mm` evaluates the SymInt expression for the input tensor due to a for loop, and thus the output tensor is not dynamically shaped. This issue is limited to (Mx1)x(1xN) small matrix multiplications, and creates an explicit error with tensor subclasses such as DTensor.

The proposed fix replaces the loop with a simple product instead. Benchmark currently running https://hud.pytorch.org/benchmark/compilers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158998
Approved by: https://github.com/jansel, https://github.com/BoyuanFeng
2025-07-30 16:34:15 +00:00
90fd06be71 Various bugfixes for running NanoGPT training (#159166)
Fix various small bugs with running nanogpt on torchbenchmark in OSS under python 3.10. After these changes, the following now succeeds:

```
tlp python benchmarks/dynamo/torchbench.py --only nanogpt --performance  --training --backend inductor  --caching-precompile --warm-start-latency
```

Cold start: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmp12LuZ5/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Warm start (we are invesigating the recompile):
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpT5YTB2/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159166
Approved by: https://github.com/zhxchen17
2025-07-30 16:30:22 +00:00
002f18807e [DCP] Improve error handling for process based async checkpointing (#159374)
Summary:
### PR Context
- Kill background process only when PG init fails or there is an explicit `TERMINATE` signal from main process.
- When a checkpoint fails to save, log and return the error but continue the serving loop.

Test Plan:
CI

Rollback Plan:

Differential Revision: D79177410

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159374
Approved by: https://github.com/sibuachu
2025-07-30 16:25:28 +00:00
259e79e3ff Move Half to headeronly (#159172)
Essence of this copypasta:
- combine Half-inl.h and Half.h in c10/util -> torch/headeronly/util/Half.h
- Add NOLINTNEXTLINE's to the portions of Half-inl.h that were previously in the ignore list of clangtidy
- Re-expose all APIs in namespaces and through includes of the original files. Ideally, we would have the APIs in torch::headeronly and reexpose them in c10, but that runs into BC issues (see D78997465) so for now we are keeping the APIs in c10 but reexposing them in torch::headeronly.
- Change test cases in test_aoti_abi_check to test torch::headeronly::Half vs c10::Half (they're the same thing but we eventually want all the tests for headeronly APIs to only import from headeronly).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159172
Approved by: https://github.com/albanD, https://github.com/desertfire
2025-07-30 16:11:58 +00:00
ee343ce60c [RPC][TensorPipe] Fix import torch if compiled without TensorPipe (#159461)
This is a follow up on the PR #154382, as the issue still persists:
```
  File "/opt/pytorch/pytorch/torch/distributed/rpc/__init__.py", line 81, in <module>
    from . import api, backend_registry, functions
  File "/opt/pytorch/pytorch/torch/distributed/rpc/api.py", line 35, in <module>
    from .constants import DEFAULT_SHUTDOWN_TIMEOUT, UNSET_RPC_TIMEOUT
  File "/opt/pytorch/pytorch/torch/distributed/rpc/constants.py", line 3, in <module>
    from torch._C._distributed_rpc import (
ImportError: cannot import name '_DEFAULT_NUM_WORKER_THREADS' from 'torch._C._distributed_rpc' (unknown location)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159461
Approved by: https://github.com/lw
2025-07-30 16:04:02 +00:00
ea5369113a unflatten closure (#159418)
Summary: Sometimes the call history recorded in a `nn_module_stack` does not have the stack property, where each FQN is a prefix of the next FQN. This can cause errors during `unflatten`. Instead of erroring we now drop entries from such a `nn_module_stack` to restore the stack property. This effectively leads to less unflattening: the last FQN in the call history before the stack property was broken keeps the entire flat subgraph of its call.

Test Plan:
added test, updated another

Rollback Plan:

Differential Revision: D79204669

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159418
Approved by: https://github.com/angelayi
2025-07-30 15:42:18 +00:00
b268f22ab2 Move Float4 to headeronly (#159414)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159414
Approved by: https://github.com/desertfire
2025-07-30 15:34:01 +00:00
52a52d1b78 [dynamo][guards] Skip no tensor aliasing guard on inbuilt nn module buffers (#159453)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159453
Approved by: https://github.com/jansel
2025-07-30 15:31:07 +00:00
eaadd1282c Revert "Move Half to headeronly (#159172)"
This reverts commit 6d0f4566e2b6e05369d8bb6c0d0e83a0eee982aa.

Reverted https://github.com/pytorch/pytorch/pull/159172 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/16613893793/job/47002486679) [HUD commit link](6d0f4566e2).  Note to self: why isn't Dr. CI updating ([comment](https://github.com/pytorch/pytorch/pull/159172#issuecomment-3136769493))
2025-07-30 15:10:26 +00:00
1465757959 Fix max_width computation in _tensor_str._Formatter (#126859)
Previous version of `torch._tensor_str._Formatter` was not using `PRINT_OPTS.sci_mode` for the `max_width` computation but was using it for the formatting of values leading to a weird discrepancy.

Now, the code first checks if it should be in sci_mode, then compute `max_width`

Here is an example to test the behavior:
```python
A = torch.tensor([10, 1e-1, 1e-2])
B = torch.tensor([10, 1e-1, 1e-1])

print("================= Default =================")
print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}")
print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}")

print("================= sci_mode=False =================")
with torch._tensor_str.printoptions(sci_mode=False):
    print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}")
    print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}")

print("================= sci_mode=True =================")
with torch._tensor_str.printoptions(sci_mode=True):
    print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}")
    print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}")
```

In the current version this prints:
```
================= Default =================
tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10
tensor([10.0000,  0.1000,  0.1000]) Formatter max_width: 7
================= sci_mode=False =================
tensor([   10.0000,     0.1000,     0.0100]) Formatter max_width: 10
tensor([10.0000,  0.1000,  0.1000]) Formatter max_width: 7
================= sci_mode=True =================
tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10
tensor([1.0000e+01, 1.0000e-01, 1.0000e-01]) Formatter max_width: 7
```

On can see that in `sci_mode=False`, the values of A are prefixed with unneeded 0 and does not have the same `max_width` as B (It keeps the `max_width` from `sci_mode = None`)

Also in `sci_mode = True`, for B, the `max_width` is 7 but each value takes 10 chars... (But it is fine as the code that uses `max_width` do not rely much on it, but still, this is missleading)

After this commit, this will print
```
================= Default =================
tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10
tensor([10.0000,  0.1000,  0.1000]) Formatter max_width: 7
================= sci_mode=False =================
tensor([10.0000,  0.1000,  0.0100]) Formatter max_width: 7
tensor([10.0000,  0.1000,  0.1000]) Formatter max_width: 7
================= sci_mode=True =================
tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10
tensor([1.0000e+01, 1.0000e-01, 1.0000e-01]) Formatter max_width: 10
```

This also allows to align A with B for `sci_mode=False`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126859
Approved by: https://github.com/malfet
2025-07-30 14:01:00 +00:00
17b9c618dd [a2av] not returning out tensor from ops (#159435)
torch.compile of `all_to_all_vdev_2d` hits the following error:
```
torch._dynamo.exc.BackendCompilerFailed: backend='aot_eager' raised:
RuntimeError: Found a custom (non-ATen) operator whose output has alias annotations: symm_mem::all_to_all_vdev_2d(Tensor input, Tensor(a!) out, Tensor in_splits, Tensor(a!) out_splits_offsets, str group_name, int? major_align=None) -> Tensor(a!). We only support functionalizing operators whose outputs do not have alias annotations (e.g. 'Tensor(a)' is a Tensor with an alias annotation whereas 'Tensor' is a Tensor without. The '(a)' is the alias annotation). The alias annotation specifies that the output Tensor shares storage with an input that has the same annotation. Please check if (1) the output needs to be an output (if not, don't return it), (2) if the output doesn't share storage with any inputs, then delete the alias annotation. (3) if the output indeed shares storage with an input, then add a .clone() before returning it to prevent storage sharing and then delete the alias annotation. Otherwise, please file an issue on GitHub.
```

This PR selects option (1).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159435
Approved by: https://github.com/ngimel, https://github.com/xmfan
2025-07-30 08:30:25 +00:00
d3ce45012e Generalize torch._C._set_allocator_settings to be generic (#156175)
# Motivation
This PR moves the implementation of `torch.cuda.memory._set_allocator_settings` to `torch._C._accelerator_setAllocatorSettings`.
Since the original API was intended as a temporary/internal utility, I am not exposing the new function as a public API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156175
Approved by: https://github.com/albanD
ghstack dependencies: #149601, #157908, #150312, #156165
2025-07-30 06:37:15 +00:00
1fc010a9d8 Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165
Approved by: https://github.com/albanD
ghstack dependencies: #149601, #157908, #150312
2025-07-30 06:37:15 +00:00
dfacf11f66 Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312)
# Motivation
Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150312
Approved by: https://github.com/albanD
ghstack dependencies: #149601, #157908
2025-07-30 06:37:06 +00:00
c8cf811995 Enable AcceleratorAllocatorConfig key check (#157908)
# Motivation
Add a mechanism to ensure raise the key if the key is unrecognized in allocator config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157908
Approved by: https://github.com/albanD
ghstack dependencies: #149601
2025-07-30 06:36:56 +00:00
914b1a3873 Introduce AcceleratorAllocatorConfig as the common class (#149601)
# Motivation
This PR aims to generalize `AllocatorConfig` to be device-agnostic. Introduce the class `AcceleratorAllocatorConfig` to clarify its scope as a configuration manager for accelerator backends (e.g., CUDA, XPU). The another name `AllocatorConfig` is now reserved for a potential future base class that can unify configuration handling for both CPU and accelerator allocators, should similar requirements arise for the CPU path.

# Design Rule
## Overall
This class configures memory allocation for both device and host memory. A single `AcceleratorAllocatorConfig` instance is shared across all accelerator backends, such as CUDA and XPU, under the assumption that relevant environment variables apply uniformly to all accelerators. Device-specific configuration extensions are supported via hooks (see `registerDeviceConfigParserHook`).
Introduce a new class `ConfigTokenizer` to help process the env variable config key-value pair

## Naming Convention:
- Public API names in `AcceleratorAllocatorConfig` should be device-generic.
- Members prefixed with `pinned_` are specific to the host/pinned allocator.
- Environment variable names should be generic across backends.
- Comma-separated key-value pairs in the format: `key:value`. Use square brackets `[]` for list values Example: `key1:123, key2:[val1,val2]`

## Environment Variables:
- The default environment variable for configuration is `PYTORCH_ALLOC_CONF`.
- For backward compatibility, `PYTORCH_CUDA_ALLOC_CONF` and `PYTORCH_HIP_ALLOC_CONF` are also supported with lower priority.

Differential Revision: [D79011786](https://our.internmc.facebook.com/intern/diff/D79011786)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149601
Approved by: https://github.com/albanD
2025-07-30 06:36:46 +00:00
7eb5fdb358 [dynamo][guards] Recursive dict tag optimization (#159183)
Design doc here - https://docs.google.com/document/d/1W29DrWID5miGWlZXspsQVN5U0zydE3kjZpziOXrhuaY/edit?tab=t.0#bookmark=id.sba04iw9sp68

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159183
Approved by: https://github.com/jansel
2025-07-30 06:01:32 +00:00
f1fb57d854 Add user annotation for FX graph cache key (#159318)
Summary: AI system co-design team requested to add user annotation for FX graph cache key in PyTorch Kineto trace and Execution trace. With this annotation, they can know the FX graph to which the kernels belong.

Test Plan:
buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA

Rollback Plan:

Differential Revision: D79019069

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159318
Approved by: https://github.com/sraikund16, https://github.com/jansel
2025-07-30 05:52:50 +00:00
6d0f4566e2 Move Half to headeronly (#159172)
Essence of this copypasta:
- combine Half-inl.h and Half.h in c10/util -> torch/headeronly/util/Half.h
- Add NOLINTNEXTLINE's to the portions of Half-inl.h that were previously in the ignore list of clangtidy
- Re-expose all APIs in namespaces and through includes of the original files. Ideally, we would have the APIs in torch::headeronly and reexpose them in c10, but that runs into BC issues (see D78997465) so for now we are keeping the APIs in c10 but reexposing them in torch::headeronly.
- Change test cases in test_aoti_abi_check to test torch::headeronly::Half vs c10::Half (they're the same thing but we eventually want all the tests for headeronly APIs to only import from headeronly).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159172
Approved by: https://github.com/albanD, https://github.com/desertfire
2025-07-30 05:02:13 +00:00
e785c087c5 [audio hash update] update the pinned audio hash (#159321)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159321
Approved by: https://github.com/pytorchbot
2025-07-30 04:35:01 +00:00
d214901133 Add a title to distributed._dist2.md (#159385)
Sphinx likes titles and complains about them when they are not there. So adding a title to address this Wartning in the build:
```
WARNING: toctree contains reference to document 'distributed._dist2' that doesn't have a title: no link will be generated
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159385
Approved by: https://github.com/d4l3k
2025-07-30 04:09:41 +00:00
96ac64d00c Migrate easy q(u)int/bits stuff to torch/headeronly (#159302)
Straightup copy pasta. Keeps APIs in c10 and reexposes them to torch::headeronly.

It is arguable that we should just get rid of some of these unused dtypes but that is outside the scope of this PR, which is meant to build up to ScalarType moving to headeronly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159302
Approved by: https://github.com/malfet, https://github.com/albanD
2025-07-30 03:41:27 +00:00
46d34d6766 (should_fold) gso to guard_or_false when checking folding whether to 3d bmm into 2d mm (#159184)
Switch from guard_size_oblivious to guard_or_false if you encounter a DDE, this would then avoid folding this 3d bmm into a mm.

806d9e3fe7/torch/_decomp/decompositions.py (L4506-L4512)

## DDE
```
  File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 4506, in matmul
    elif should_fold(tensor1, tensor2, is_out):
  File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 4472, in should_fold
    if guard_size_oblivious(t1.numel() == 0):
torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(12*((u0//2)), 0) (unhinted: Eq(12*((u0//2)), 0)).  (Size-like symbols: none)

Caused by: (_decomp/decompositions.py:4472 in should_fold)
```

```
  File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 4506, in matmul
    elif should_fold(tensor1, tensor2, is_out):
  File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 4483, in should_fold
    return all(
torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(3*((u0//2)), 3) (unhinted: Eq(3*((u0//2)), 3)).  (Size-like symbols: none)

Caused by: (_decomp/decompositions.py:4483 in should_fold)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159184
Approved by: https://github.com/ezyang
ghstack dependencies: #158894
2025-07-30 03:12:14 +00:00
clr
880249adbc dynamo: handle AttributeErrors from nn_module when infer_paramaters throws. (#158501)
This only handles AttributeError, but in general, any exception coming from
here is a user exception. let me know if we prefer to catch all exceptions, and then reraise them as observed exceptions.

```
 File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/_dynamo/symbolic_convert.py", line 2200, in CALL_FUNCTION
    self.call_function(fn, args, {})
  File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/_dynamo/symbolic_convert.py", line 1210, in call_function
    self.push(fn.call_function(self, args, kwargs))  # type: ignore[arg-type]
  File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/_dynamo/variables/lazy.py", line 201, in realize_and_forward
    return getattr(self.realize(), name)(*args, **kwargs)
  File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/_dynamo/variables/nn_module.py", line 472, in call_function
    initialize_lazy_module(tx, mod, args, kwargs)
  File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/_dynamo/variables/nn_module.py", line 104, in initialize_lazy_module
    mod._infer_parameters(mod, fake_args, fake_kwargs)
  File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/nn/modules/lazy.py", line 261, in _infer_parameters
    module.initialize_parameters(*args, **kwargs)
  ...,
  File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/nn/modules/module.py", line 1962, in __getattr__
    raise AttributeError(
torch._dynamo.exc.InternalTorchDynamoError: AttributeError: '...' object has no attribute '...'
```

Note that we crash with a sligthly different exception trace in the other test I added. Let me know if we want this to not throw directly to the end user.
```
======================================================================
ERROR: test_lazy_module_bad_params (__main__.NNModuleTests.test_lazy_module_bad_params)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/users/clr/pytorch/torch/testing/_internal/common_utils.py", line 3223, in wrapper
    method(*args, **kwargs)
    ~~~~~~^^^^^^^^^^^^^^^^^
  File "/data/users/clr/pytorch/test/dynamo/test_modules.py", line 1683, in test_lazy_module_bad_params
    exp_res = opt_m(x, y)
  File "/data/users/clr/pytorch/torch/_dynamo/eval_frame.py", line 411, in __call__
    return super().__call__(*args, **kwargs)
           ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/data/users/clr/pytorch/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/data/users/clr/pytorch/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/users/clr/pytorch/torch/_dynamo/eval_frame.py", line 473, in _call_lazy_check
    self._orig_mod._infer_parameters(self._orig_mod, args, kwargs)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/clr/pytorch/torch/nn/modules/lazy.py", line 261, in _infer_parameters
    module.initialize_parameters(*args, **kwargs)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/data/users/clr/pytorch/test/dynamo/test_modules.py", line 711, in initialize_parameters
    self.foo += 1
    ^^^^^^^^
  File "/data/users/clr/pytorch/torch/nn/modules/module.py", line 1962, in __getattr__
    raise AttributeError(
        f"'{type(self).__name__}' object has no attribute '{name}'"
    )
AttributeError: 'LazyModuleBadInferParams' object has no attribute 'foo'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158501
Approved by: https://github.com/williamwen42, https://github.com/jansel
2025-07-30 02:41:41 +00:00
846ada4973 [AOTI] disable crashed AOTI UTs on Windows. (#159427)
disable crashed AOTI UTs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159427
Approved by: https://github.com/angelayi
2025-07-30 02:23:27 +00:00
badd0618e4 Remove unused paramter on CUDA AllocParams (#159159)
# Motivation
While refactoring the caching allocator, I noticed that the `AllocParams` constructor on CUDA had an unused parameter. This change removes that unused argument to avoid potential confusion.

# Additional Context
I noticed that `AllocParams` is defined in cpp file, so it should be safe to make this change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159159
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-07-30 02:05:25 +00:00
a753a72b14 [BE] Modify PyObjectSlot the assume only a single interpreter is in use (#158407)
This PR makes some less risky changes to PyObjectSlot as there is a lot of stuff we do not need since there is only one interpreter. Specifically `check_interpreter` and `has_pyobj_nonhermetic` are removed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158407
Approved by: https://github.com/albanD
ghstack dependencies: #158290, #158291
2025-07-30 01:36:03 +00:00
b57d1ef110 [BE] Remove __reduce_deploy__ (#158291)
This PR removes the integration point torch.fx had with torch::deploy (and another minor change).

Note: This PR has some broken mypy errors, but I believe those should have been in the code base beforehand, and should be fixed in a separate PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158291
Approved by: https://github.com/albanD
ghstack dependencies: #158290
2025-07-30 01:36:03 +00:00
dd7c996d5c [BE] Remove torch deploy | remove torch deploy specific files (#158290)
This PR removes specific files found in pytorch which are only used for torch::deploy. This is mostly testing code and a debugger.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158290
Approved by: https://github.com/albanD
2025-07-30 01:36:03 +00:00
70d2e9ba45 [MPS] Avoid outputing zeros from exponential_ for MPS (#159386)
Fixes #159103
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159386
Approved by: https://github.com/malfet
2025-07-30 00:20:31 +00:00
eqy
62f98dbb44 [CUDA][Convolution] Add tf32_on_and_off decorator to test_deconv_freezing_cuda (#159280)
Blackwell seems to select TF32 kernels for this case

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159280
Approved by: https://github.com/zou3519, https://github.com/jingsh, https://github.com/Skylion007
2025-07-29 23:44:10 +00:00
e288c258f7 Revert "Remove tensorexpr tests (#158928)"
This reverts commit d742a2896c571a535003d5928fe80397325575a5.

Reverted https://github.com/pytorch/pytorch/pull/158928 on behalf of https://github.com/yangw-dev due to this breaks bunch of internal dependency since some tests are still using the deleted test files from this pr, the internal reviewer please help fix this using codev ([comment](https://github.com/pytorch/pytorch/pull/158928#issuecomment-3134378616))
2025-07-29 23:32:07 +00:00
df58db8831 [dynamo, docs] add recompilation, observability, reporting issues docs (#159062)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159062
Approved by: https://github.com/svekars, https://github.com/zou3519, https://github.com/anijain2305
2025-07-29 23:23:51 +00:00
15bb81ea4f [2/N][CI] Remove MacOS-13 workarounds from tests (#159304)
Part of https://github.com/pytorch/pytorch/issues/159275

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159304
Approved by: https://github.com/dcci, https://github.com/cyyever
ghstack dependencies: #159277, #159278
2025-07-29 23:12:13 +00:00
8d37073bac [ROCm] Update jit_utils.cpp trait modification based on HIP version. (#159292)
The mi355 ci regression and hiprtc kernel compilation is failing due to duplicate definitions of traits leading to errors like `error: redefinition of 'integral_constant'`. This seems to be the culprit: https://github.com/pytorch/pytorch/pull/158868. Checking if using hip version instead of rocm version for the check would help with resolution here as rocm version and hip version aren't synced. ROCm 7.0 Alpha build used in CI is still on HIP 6.5.

Confirmed that this patch works here: https://github.com/pytorch/pytorch/actions/runs/16579227179?pr=159292

Also, this PR increases the frequency of this MI355 CI to twice a day so we can catch and identify regressions easier if they happen for now.

Jeff is on vacation, so Jithun asked me to reach out to y'all. Please help stamp and approve, so we can resolve the recent MI355 CI regression/timeout (https://github.com/pytorch/pytorch/actions/workflows/rocm-mi355.yml) :) @huydhn @malfet @atalman @seemethere

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159292
Approved by: https://github.com/malfet
2025-07-29 22:45:27 +00:00
dc286aef61 Fused RMSNorm Housekeeping (#159317)
Small PR to address comments that were made from the original fused rmsnorm PR that were not landed

Changes:
- Warning message when input.dtype doesn't match weight.dtype
- Ensure default epsilon value is correct

Comments:
https://github.com/pytorch/pytorch/pull/153666#discussion_r2114735005
https://github.com/pytorch/pytorch/pull/153666#discussion_r2223518064

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159317
Approved by: https://github.com/ngimel, https://github.com/Skylion007, https://github.com/eqy
2025-07-29 22:39:18 +00:00
b4619f0272 Pin Helion to 0.0.10 in PyTorch CI (#159420)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159420
Approved by: https://github.com/aorenste, https://github.com/malfet
2025-07-29 22:06:50 +00:00
477c2273e1 [dynamo] better way to skip tracing sys.monitoring callables (#159369)
Better approach to https://github.com/pytorch/pytorch/pull/158171, according to https://github.com/python/cpython/issues/137178#issuecomment-3131617493.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159369
Approved by: https://github.com/Skylion007
2025-07-29 21:54:58 +00:00
2176d481c1 [DTensor] dispatch to sharding prop over decomps (#159324)
Fixes #159110

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159324
Approved by: https://github.com/ezyang
2025-07-29 21:28:36 +00:00
b97274e8ac [iter] Raise TypeError if iter arg cannot be iterable (#158410)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158410
Approved by: https://github.com/XuehaiPan, https://github.com/zou3519
ghstack dependencies: #156371, #156416, #156460
2025-07-29 21:24:21 +00:00
f9be65cea4 [iter] Wrap iter(..) call in a ObjectIteratorVariable (#156460)
This object keeps track when the iterator is exhausted (raise Stopiteration).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156460
Approved by: https://github.com/zou3519
ghstack dependencies: #156371, #156416
2025-07-29 21:24:20 +00:00
4e3e3dc0a7 [iter] support iter(callable, sentinel) (#156416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156416
Approved by: https://github.com/XuehaiPan, https://github.com/zou3519
ghstack dependencies: #156371
2025-07-29 21:24:20 +00:00
fcf59df2b6 [iter] Add support for sequence protocol in iter(..) (#156371)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156371
Approved by: https://github.com/zou3519
2025-07-29 21:24:20 +00:00
1bcb2f41e0 [BE] Eliminate workspace info in templates with new API (#159055)
Summary: Moves the workspace info calculations to the old TMA API.

Test Plan:
NFC

Rollback Plan:

Differential Revision: D78904434

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159055
Approved by: https://github.com/NikhilAPatel
2025-07-29 21:22:36 +00:00
8460131087 [nativert] Add OSS version of ModelRunner (#159268)
Summary: Implement a ModelRunner from scratch with the minimum features for OSS only

Test Plan:
test_export -r NativeRT

Rollback Plan:

Differential Revision: D78979812

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159268
Approved by: https://github.com/dolpm
2025-07-29 21:08:14 +00:00
c0c24b61ff Revert "Partitioner: Fix to align partition node order with original graph (#157892)"
This reverts commit 2d1e92307d3e67622f4fe8058d62e44fe4fa2f4e.

Reverted https://github.com/pytorch/pytorch/pull/157892 on behalf of https://github.com/yangw-dev due to fails internal tests : [executorch/backends/xnnpack/partition/xnnpack_partitioner.py:101:24] Incompatible parameter type [6]: In call `Partition.__init__`, for argument `nodes`, expected `Optional[Iterable[Tuple[Node, Optional[int]]]]` but got `dict_keys[Node, str]`. ([comment](https://github.com/pytorch/pytorch/pull/157892#issuecomment-3134004881))
2025-07-29 20:41:45 +00:00
4fac43b21f [BE] Move _freeze.py to torch/fb/utils (#159307)
Summary: We are trying to deprecate torch deploy externally. However a bunch of legacy stuff still uses it. This PR allows the legacy tests to still run if neccessary

Test Plan:
It's a targets change so CI should suffice

Rollback Plan:

Differential Revision: D78910653

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159307
Approved by: https://github.com/albanD
2025-07-29 20:07:17 +00:00
b794e77b7b Disable cudagraph GCs by default (#158649)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158649
Approved by: https://github.com/eellison
ghstack dependencies: #158193
2025-07-29 19:56:11 +00:00
d987a6f7f0 Revert "[Dynamo][Better Engineering] Add typing annotations to guard and source (#158397)"
This reverts commit abcb24f4de11f8fedf2c2c9ff53b6092ef42306d.

Reverted https://github.com/pytorch/pytorch/pull/158397 on behalf of https://github.com/yangw-dev due to Suggested to fix failing internal signals on D78911890 ([comment](https://github.com/pytorch/pytorch/pull/158397#issuecomment-3133823766))
2025-07-29 19:49:40 +00:00
5d93127c87 Revert "[HOP, map] Rework of map autograd to the new interface (#153343)"
This reverts commit 24b1f10ca13d682430725c511812e43a35fcd6a6.

Reverted https://github.com/pytorch/pytorch/pull/153343 on behalf of https://github.com/yangw-dev due to a older pr this pr dependes on needed to revert, rebase it after it's in ([comment](https://github.com/pytorch/pytorch/pull/153343#issuecomment-3133816812))
2025-07-29 19:46:42 +00:00
a3a51282db Fix rand_like decomposition to preserve strides (#159294)
Summary: Like https://github.com/pytorch/pytorch/pull/158898, the rand_like variants are not preserving strides. Followed the pattern established in https://github.com/pytorch/pytorch/pull/158898.

Test Plan: New unit test (fails before this PR; but fixed after)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159294
Approved by: https://github.com/eellison
2025-07-29 19:26:20 +00:00
e557b3d5e5 Revert "[inductor] Fix mm decomposition evaluating symints (#158998)"
This reverts commit 52e180c3799a7638ee668b1291a711865ab8cfec.

Reverted https://github.com/pytorch/pytorch/pull/158998 on behalf of https://github.com/yangw-dev due to it broke trunk with pr_time_benchmark test  ([comment](https://github.com/pytorch/pytorch/pull/158998#issuecomment-3133696775))
2025-07-29 19:04:11 +00:00
f3a9e99036 Fix inductor cuda sort nan behavior (#159308)
Fix for https://github.com/pytorch/pytorch/issues/152423

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159308
Approved by: https://github.com/isuruf
2025-07-29 19:02:45 +00:00
f7d6e9f500 [dynamo][guards] More small guard optimizations (#159345)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159345
Approved by: https://github.com/williamwen42
ghstack dependencies: #159288
2025-07-29 18:36:49 +00:00
e43e09e6c1 [dynamo][guards] Use lambda guards for object aliasing to improve object aliasing guards (#159288)
# Note - On Lambda guarding of object aliasing
        # We previously installed object‑aliasing guards as relational guards,
        # but that undermined the recursive‑dict guard optimization: placing the
        # aliasing guard at a leaf prevented the parent dict node from
        # qualifying as a recursive‑dict guard root. Because aliasing guards are
        # rare, we now emit them as epilogue guards via a small Python lambda.
        # This repeats the access in Python—adding a bit of work—but the
        # overhead is outweighed by the gains from enabling recursive‑dict guard
        # optimization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159288
Approved by: https://github.com/StrongerXi
2025-07-29 18:36:49 +00:00
2004f8aa10 FXConverter handling of generic output in inductor fallback kernel (#159002) (#159297)
Summary:

A fallback kernel's output may be a non-list/tuple but a `MultiOutput` with empty indices. Allow the `FXConverter` to handle such case.

Test Plan:
Modified the fxir test for fallbacks, then ran `buck2 test mode/dev-nosan caffe2/test/inductor:fxir_backend -- test_fallback`.

Before this diff the modified test would fail with
```
File "/re_cwd/buck-out/v2/gen/fbcode/e2105f7329ead90a/caffe2/test/inductor/__fxir_backend__/fxir_backend#link-tree/torch/_inductor/codegen/wrapper_fxir.py", line 341, in generate
    line.codegen_fx(self)(line)
  File "/re_cwd/buck-out/v2/gen/fbcode/e2105f7329ead90a/caffe2/test/inductor/__fxir_backend__/fxir_backend#link-tree/torch/_inductor/codegen/wrapper_fxir.py", line 489, in _generate_multi_output
    inds = line.indices[0][1:]
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
IndexError: list index out of range
```
 (Full error paste in P1878839403)

With this diff the error is no longer present.

Rollback Plan:

Differential Revision: [D79126619](https://our.internmc.facebook.com/intern/diff/D79126619)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159297
Approved by: https://github.com/blaine-rister
2025-07-29 18:29:01 +00:00
31b3b38e3a Ensure export joint with descriptors + compile works (#159337)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159337
Approved by: https://github.com/wconstab
ghstack dependencies: #159336
2025-07-29 17:43:52 +00:00
2f0db0444e Track previous MetricsContext edits for ease of debugging. (#159336)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159336
Approved by: https://github.com/wconstab
2025-07-29 17:43:52 +00:00
6162e650b0 [BE] remove torch deploy - conditionals (#158288)
This PR is part of the work to deprecate torch::deploy in OSS. Effectively it does 3 things to get started.
1. Remove test_deploy_interaction as we no longer need to worry about this
2. Remove all torch._running_with_deploy checks and use the False path always (surfaced 1)
3. Remove `USE_DEPLOY` and switch to the default path always

Note: MyPy does fail on a bunch of things here as a bunch of older files are touched. It may be better to fix these things on a separate PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158288
Approved by: https://github.com/albanD
2025-07-29 17:40:49 +00:00
5d89634ca8 Graph break with error message (#158800)
Fixes #157452

Test with
```
python test/dynamo/test_repros.py ReproTests.test_nn_parameter_ctor_graph_breaks
```

### Release Notes

Change to nn.Parameter Constructor Behavior in Dynamo

Semantic change introduced in the nn.Parameter constructor; previously, if the constructor lacked a clean source, the system would attempt to infer arguments to construct a clone and lift this synthetic proxy in the computation graph. This approach had many potential edge cases and was difficult to reason about. The new behavior defaults to graph breaking when the nn.Parameter constructor does not have a clean source. Users are now suggested to manually move the constructor out of the graph in such cases. This change improves clarity and reduces complexity in graph construction and debugging.  Users can escape hatch to old semantics with `torch.dynamo.config.graph_break_on_nn_param_ctor=False` if this cannot be done.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158800
Approved by: https://github.com/anijain2305
2025-07-29 17:34:49 +00:00
52e180c379 [inductor] Fix mm decomposition evaluating symints (#158998)
Fixes #154111

Resolves an issue during compilation with dynamic shapes where `torch._inductor.decomposition.mm` evaluates the SymInt expression for the input tensor due to a for loop, and thus the output tensor is not dynamically shaped. This issue is limited to (Mx1)x(1xN) small matrix multiplications, and creates an explicit error with tensor subclasses such as DTensor.

The proposed fix replaces the loop with a simple product instead. Benchmark currently running https://hud.pytorch.org/benchmark/compilers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158998
Approved by: https://github.com/jansel, https://github.com/BoyuanFeng
2025-07-29 17:29:38 +00:00
c55e72bea1 [Re-land][Inductor] Support native Inductor as backend for MTIA (#159211)
The previous [diff/PR] (https://github.com/pytorch/pytorch/pull/158526) was reverted due to this docstring lint error:
<img width="1736" height="722" alt="image" src="https://github.com/user-attachments/assets/216b1720-4002-48da-b5f3-32b5d48aaa54" />
I didn't add the docstring cause I thought I'm not supposed to add docstring for an EXISTING function.

So this diff/PR is an exactly copy of the previous one, except for adding the docstring.

-------------
This diff/PR includes the changes to support native Inductor integration for MTIA. The goal is to support `torch.compile(backend="inductor")` for MTIA. Inductor should generate code(triton kernel + python wrapper code) similar to CUDA. And the triton kernels can be launched eagerly.

The changes include:
- Add MTIA device interfaces used by Dynamo and Inductor, including APIs on device, stream, event, etc.
- Add required torch.mtia APIs, like is_bf16_supported, memory_allocated, set_stream_by_id, etc.
- MTIA specific codegen logic, for example, loading MTIA dynamic_library.
- Other necessary changes to integrate with Inductor codegn, following other devices like CUDA, XPU.
- Integrate with the [empty_strided_mtia](https://www.internalfb.com/code/fbsource/[0d017d3a4a1bdff7253f9c66a9f38e77bd62166b]/fbcode/caffe2/aten/src/ATen/native/mtia/EmptyTensor.cpp?lines=49%2C63%2C71%2C74%2C78) API that we’ve added for the new MTIA ATen backend.
- A change in Inductor runtime to avoid re-initialize MTIADriver.
- BUCK changes to include ATen-mtia in Inductor, and to use -USE_MTIA preprocessor flag.
- Update `test_mnist_e2e.py` to cover native Inductor as backend, using the `--use_native_inductor` flag.
- Add a personal script(`scripts/anwang/run_native_inductor_script.py`) for testing purpose.

Note:
- This approach(option 3) aims to provide a pytorch native approach of Inductor integration for MTIA, minimizing the onboarding overhead. The downside of this approach is that it doesn't leverage MTIA specific graph optimization, and is limited to eagerly launch overhead.
- MTIA will support another approach(option 2) to provide best performance, based on WrapperFxCodegen. We should be able to reuse the fundamental changes of this diff for option 2, like the device interfaces, steam/event APIs, etc, especially as WrapperFxCodegen inherits PythonWrapperCodegen.

Internal:
References:
- [post for context](https://fb.workplace.com/groups/mtiasw/permalink/1718377262384606/)
- [Inductor integration discussion(option 1/2/3)](https://docs.google.com/document/d/1p6363OXtVIRv1hPoaKlRSK3j-iir3QIbDd5bjyqCNig/edit?tab=t.0#heading=h.7s4ns6wcnhmb)
- [Project design doc(option 3)](https://docs.google.com/document/d/1jXUmhgoV9WvkMf-bcY3Od_kK9K_RDOdgHdt1LoQ5Tc4/edit?tab=t.0#heading=h.y43gwdqlv46w)
- [early prototying diff](https://www.internalfb.com/diff/D75110196)
- [MPS integration PR](https://github.com/pytorch/pytorch/pull/153959)
- [empty_strided_xpu PR](https://github.com/pytorch/pytorch/pull/126678)

Differential Revision: [D79040806](https://our.internmc.facebook.com/intern/diff/D79040806/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159211
Approved by: https://github.com/eellison, https://github.com/blaine-rister, https://github.com/jansel
2025-07-29 17:03:24 +00:00
750348b579 [NativeRT] Clean up use of TargetDevice in KernelFactory (#159298)
Summary:
Remove use of targetDevice in KernelFactory.

AOTI would infer device when creating AOTIDelegateExecutor.

Test Plan:
CI

Rollback Plan:

Reviewed By: dolpm

Differential Revision: D79007317

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159298
Approved by: https://github.com/dolpm
2025-07-29 16:24:33 +00:00
52b9af163c Add avg_pool3d for MPS (#158877)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158877
Approved by: https://github.com/malfet
2025-07-29 15:22:22 +00:00
f4bfac11c7 [Precompile] [easy] API For Editable PrecompileCacheArtifacts (#158586)
This adds an option for backend precompile artifacts to be *editable*, i.e. to not serialize them right away, but instead be able to apply a Callable edit_fn to them.

This allows us to support editing the precompile artifact with more updated autotune results at a later time in the next PR. The goal flow here is:
- User runs AOTAutograd -> Inductor -> Triton
- User saves to AOTAutogradCache the normal results
- User runs autotuning
- User calls serialize(), it takes the new autotuning results at runtime and saves only the necessary triton kernels.

This PR just implements the API for editing the cache artifacts. The next PR actually adds the autotuning saving support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158586
Approved by: https://github.com/zhxchen17
2025-07-29 14:53:21 +00:00
8d00833fdb [PP] Fix eval step under no_grad() (#159293)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159293
Approved by: https://github.com/tianyu-l, https://github.com/wconstab
2025-07-29 14:42:33 +00:00
de529ef002 [ONNX] onnx.md to simplify deprecated entities (#159312)
Simplify documentation of deprecated entities and remove the auto-generated page for JitScalarType
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159312
Approved by: https://github.com/titaiwangms
2025-07-29 14:24:17 +00:00
61aa2ae20f Revert "[CPU] fix _weight_int8pack_mm with large output shape (#158341)"
This reverts commit e469414b59ceeaae2860e36708de8852b9892776.

Reverted https://github.com/pytorch/pytorch/pull/158341 on behalf of https://github.com/albanD due to Breaks slowtest ([comment](https://github.com/pytorch/pytorch/pull/158341#issuecomment-3132641530))
2025-07-29 13:56:20 +00:00
9d32aa9789 Help fix numpy detection in cross compiled layouts (#137084)
We had trouble at conda-forge getting numpy to get detected on aarch64 due to our splayed layout and cross compilation needs.

see:
* https://github.com/conda-forge/pytorch-cpu-feedstock/pull/256
* https://github.com/conda-forge/pytorch-cpu-feedstock/issues/266
* https://github.com/conda-forge/pytorch-cpu-feedstock/pull/267

This is my attempt at making an "upstreamable patch" that tries to follow your structure.

It could introduce a new environment variable `Python_NumPy_INCLUDE_DIR` if you want, but CMake doesn't use it as an environment variable, so I feel like that would be weird.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137084
Approved by: https://github.com/atalman
2025-07-29 12:08:56 +00:00
5cf77a0ea2 Fix redistribution costs for slice_scatter (#159223)
We were previously assuming that the `input_strategy == src_strategy`, which is not true in all cases.

This should fix this.

On the side, I also realized that for `slice_scatter` some DTensorSpecs don't have TensorMeta, e.g., https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/_ops/_tensor_ops.py#L524

It would be good to fix it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159223
Approved by: https://github.com/ezyang, https://github.com/wconstab
2025-07-29 12:00:39 +00:00
efcf87654e [CI] update flake8 and mypy lint dependencies (#158720)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158720
Approved by: https://github.com/Skylion007
2025-07-29 08:05:56 +00:00
2523e58781 unbacked handling for view_copy (#159244)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159244
Approved by: https://github.com/bobrenjc93
2025-07-29 07:10:46 +00:00
222fa451a2 Move some of vec into headeronly in preparation for Half.h (#158976)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158976
Approved by: https://github.com/albanD, https://github.com/desertfire
2025-07-29 05:43:53 +00:00
6de24135e5 Fix flaky test_inductor_multiple_specializations (#159264)
Summary: This test was using do_bench, so it was flaky performance is non-deterministic.

Test Plan:
buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:compile_subprocess -- --exact 'caffe2/test/inductor:compile_subprocess - test_inductor_multiple_specializations_cuda (caffe2.test.inductor.test_compile_subprocess.GPUTests)' --run-disabled

Rollback Plan:

Differential Revision: D79098692

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159264
Approved by: https://github.com/jingsh
2025-07-29 05:16:55 +00:00
27ae72036d [cutlass] Prep for cutlass upgrade by ignoring Wunused-but-set-variable (#159276)
Differential Revision: [D79106238](https://our.internmc.facebook.com/intern/diff/D79106238/)

This is in prep for cutlass upgrade.

More context: https://github.com/NVIDIA/cutlass/issues/2487

Tested in https://github.com/pytorch/pytorch/pull/159115
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159276
Approved by: https://github.com/adamomainz, https://github.com/njriasan, https://github.com/Skylion007
2025-07-29 04:40:24 +00:00
e924df23a6 [NativeRT] Strengthen matcher check for StaticDispatch kernel (#159187)
Summary:
Strength matcher for StaticDispatch kernels: all input, output tensor must be on CPU, all Device-typed attribute must be CPU.

Previously, we only check output tensor on CPU. This will miss catching the case where we do DeviceToHost aten._to_copy.

Prepare for turning on static dispatch kernel by default.

Test Plan:
I should add some test before land.

Rollback Plan:

Differential Revision: D78747600

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159187
Approved by: https://github.com/dolpm
2025-07-29 04:03:49 +00:00
67e68e0785 [c10d] Cleanup split_group logic using the newly built splitGroup (#158488)
with https://github.com/pytorch/pytorch/pull/157716 merged we want to further clean up the code on the python side for `split_group` API. We do need to keep some old global book keeping for bc. The rest of logic is now all in cpp. Regarding the change brought in https://github.com/pytorch/pytorch/pull/152175, we did clean up in https://github.com/pytorch/pytorch/pull/158790 (including internal changes) so that we can safely remove it.

Differential Revision: [D78777152](https://our.internmc.facebook.com/intern/diff/D78777152)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158488
Approved by: https://github.com/d4l3k
2025-07-29 03:27:11 +00:00
775788f93b [BE][PYFMT] migrate PYFMT for test/[i-z]*/ to ruff format (#144556)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144556
Approved by: https://github.com/ezyang
2025-07-29 03:26:09 +00:00
19ce1beb05 [AOTInductor] Add test for enabling CUDACachingAllocator for AOTInductor's Weight (#159279)
Summary:
Add test for enabling CUDACachingAllocator for AOTInductor's Weight.
Implementation TBD

Test Plan:
N/A, commit is adding a test.

Rollback Plan:

Differential Revision: D79107507

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159279
Approved by: https://github.com/desertfire, https://github.com/jingsh
2025-07-29 02:52:10 +00:00
a91ddea61f Add CPython tests for collections module (#158950)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158950
Approved by: https://github.com/zou3519
2025-07-29 02:24:27 +00:00
ffccb90ff4 [dynamo, docs] add fullgraph=False docs (#159050)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159050
Approved by: https://github.com/svekars, https://github.com/anijain2305
ghstack dependencies: #157985, #158055, #158531
2025-07-29 01:53:47 +00:00
f916f34739 [dynamo, docs] non-strict programming model docs (#158531)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158531
Approved by: https://github.com/AlannaBurke, https://github.com/mlazos, https://github.com/anijain2305
ghstack dependencies: #157985, #158055

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-07-29 01:53:47 +00:00
c32994ce4b [docs, dynamo] add fullgraph=True, common graph breaks docs (#158055)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158055
Approved by: https://github.com/AlannaBurke, https://github.com/anijain2305
ghstack dependencies: #157985

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-07-29 01:53:41 +00:00
433e43cbec [dynamo, docs] programming model dynamo core concepts (#157985)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157985
Approved by: https://github.com/svekars, https://github.com/anijain2305
2025-07-29 01:53:34 +00:00
e469414b59 [CPU] fix _weight_int8pack_mm with large output shape (#158341)
**Summary**
`_weight_int8pack_mm` on CPU may cause segmentation fault if output shape is large (i.e., M * N is large). It's because the kernel compute output buffer address by
```c++
auto* C_ptr = C_data + mb_start * N + nb_start;
```
where both `mb_start` and `N` are `int` and when they are large their product may overflow.
The solution is simple: declare these variables as `int64_t` so that the product won't overflow.

**Test plan**
```
pytest -sv test/test_linalg.py -k test__int8_mm_large_shape
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158341
Approved by: https://github.com/mingfeima, https://github.com/drisspg
2025-07-29 01:14:50 +00:00
657e5e9aa6 All custom operators go through Inductor's graph.call_function (#159174)
Fixes #158892

All custom operators should go through the graph.call_function path. The
other fallback path is for aten/prim operations that don't have support
for things (like torch.float8_e8m0fn).

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159174
Approved by: https://github.com/eellison
2025-07-29 00:31:57 +00:00
f02b783aae [1/N] Remove MacOS-13 MPS testing (#159278)
Starts addressing https://github.com/pytorch/pytorch/issues/159275
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159278
Approved by: https://github.com/dcci
ghstack dependencies: #159277
2025-07-28 23:52:47 +00:00
8ad96a563c [inductor] normalize path of the code. (#159255)
Error stack:
<img width="1361" height="345" alt="image" src="https://github.com/user-attachments/assets/50fb2baa-34fd-4a48-a3e7-76e3185391d4" />

After fix:
<img width="1103" height="398" alt="image" src="https://github.com/user-attachments/assets/ece5a9ba-a085-46fe-b061-0c2ebda3a2df" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159255
Approved by: https://github.com/desertfire
2025-07-28 23:42:11 +00:00
59e261bbd8 Revert "[CI] update flake8 and mypy lint dependencies (#158720)"
This reverts commit f5130bf339f12ccf5c6296130c47685bdc4858e4.

Reverted https://github.com/pytorch/pytorch/pull/158720 on behalf of https://github.com/yangw-dev due to this pr failed internally when build torchgen due to rror: fail: Unknown PyPI project: pyyaml, it seems like this is caused by change PyYAML into  pyyaml, please fix it ([comment](https://github.com/pytorch/pytorch/pull/158720#issuecomment-3129995414))
2025-07-28 22:02:10 +00:00
08ea8fccaf [ez][docker] Remove some unused vars and scripts (#158680)
`CUDNN_VERSION` isn't used in any Dockerfiles, it's picked automatically based on the cuda version in `install_cuda.sh`

`install_cudnn.sh` isn't used anywhere, cudnn installation happens in `install_cuda.sh`

I didn't find any mentions of `GRADLE_VERSION` or `TENSORRT_VERSION`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158680
Approved by: https://github.com/janeyx99, https://github.com/atalman, https://github.com/malfet
2025-07-28 21:44:47 +00:00
41754539be Add 3.14 triton wheel build (#159261)
Related to https://github.com/pytorch/pytorch/issues/156856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159261
Approved by: https://github.com/malfet, https://github.com/albanD
2025-07-28 20:34:16 +00:00
716d52779f [BE] Delete non-existing labels (#159277)
As no such runners has been online for last 2+ month
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159277
Approved by: https://github.com/clee2000
2025-07-28 20:28:57 +00:00
3bf41f26c8 [cutlass] rename EVT args within kernels for code caching (#159243)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159243
Approved by: https://github.com/henrylhtsang
2025-07-28 19:01:40 +00:00
636 changed files with 57414 additions and 9995 deletions

View File

@ -104,7 +104,6 @@ If your new Docker image needs a library installed from a specific pinned commit
```bash
pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc11-new1)
CUDA_VERSION=12.8.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=11
VISION=yes

View File

@ -93,7 +93,6 @@ tag=$(echo $image | awk -F':' '{print $2}')
case "$tag" in
pytorch-linux-jammy-cuda12.4-cudnn9-py3-gcc11)
CUDA_VERSION=12.4
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
VISION=yes
@ -104,7 +103,6 @@ case "$tag" in
;;
pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11)
CUDA_VERSION=12.8.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
VISION=yes
@ -115,7 +113,6 @@ case "$tag" in
;;
pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks)
CUDA_VERSION=12.8.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
VISION=yes
@ -127,7 +124,6 @@ case "$tag" in
;;
pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc9-inductor-benchmarks)
CUDA_VERSION=12.8.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=9
VISION=yes
@ -139,7 +135,6 @@ case "$tag" in
;;
pytorch-linux-jammy-cuda12.8-cudnn9-py3.13-gcc9-inductor-benchmarks)
CUDA_VERSION=12.8.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.13
GCC_VERSION=9
VISION=yes
@ -151,7 +146,6 @@ case "$tag" in
;;
pytorch-linux-jammy-cuda12.6-cudnn9-py3-gcc9)
CUDA_VERSION=12.6.3
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
VISION=yes
@ -162,7 +156,6 @@ case "$tag" in
;;
pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc11-vllm)
CUDA_VERSION=12.8.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=11
VISION=yes
@ -173,7 +166,6 @@ case "$tag" in
;;
pytorch-linux-jammy-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
VISION=yes
@ -185,7 +177,6 @@ case "$tag" in
;;
pytorch-linux-jammy-cuda12.6-cudnn9-py3.12-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=9
VISION=yes
@ -197,7 +188,6 @@ case "$tag" in
;;
pytorch-linux-jammy-cuda12.6-cudnn9-py3.13-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.13
GCC_VERSION=9
VISION=yes
@ -209,7 +199,6 @@ case "$tag" in
;;
pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9)
CUDA_VERSION=12.8.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
VISION=yes
@ -299,7 +288,6 @@ case "$tag" in
pytorch-linux-jammy-cuda12.8-cudnn9-py3.9-clang12)
ANACONDA_PYTHON_VERSION=3.9
CUDA_VERSION=12.8.1
CUDNN_VERSION=9
CLANG_VERSION=12
VISION=yes
TRITON=yes
@ -378,7 +366,6 @@ case "$tag" in
fi
if [[ "$image" == *cuda* ]]; then
extract_version_from_image_name cuda CUDA_VERSION
extract_version_from_image_name cudnn CUDNN_VERSION
fi
if [[ "$image" == *rocm* ]]; then
extract_version_from_image_name rocm ROCM_VERSION
@ -430,9 +417,6 @@ docker build \
--build-arg "PYTHON_VERSION=${PYTHON_VERSION}" \
--build-arg "GCC_VERSION=${GCC_VERSION}" \
--build-arg "CUDA_VERSION=${CUDA_VERSION}" \
--build-arg "CUDNN_VERSION=${CUDNN_VERSION}" \
--build-arg "TENSORRT_VERSION=${TENSORRT_VERSION}" \
--build-arg "GRADLE_VERSION=${GRADLE_VERSION}" \
--build-arg "NINJA_VERSION=${NINJA_VERSION:-}" \
--build-arg "KATEX=${KATEX:-}" \
--build-arg "ROCM_VERSION=${ROCM_VERSION:-}" \

View File

@ -1,26 +0,0 @@
#!/bin/bash
if [[ -n "${CUDNN_VERSION}" ]]; then
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn
pushd tmp_cudnn
if [[ ${CUDA_VERSION:0:4} == "12.9" || ${CUDA_VERSION:0:4} == "12.8" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.10.2.21_cuda12-archive"
elif [[ ${CUDA_VERSION:0:4} == "12.6" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.10.2.21_cuda12-archive"
elif [[ ${CUDA_VERSION:0:4} == "12.4" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.10.2.21_cuda12-archive"
elif [[ ${CUDA_VERSION:0:2} == "11" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.1.0.70_cuda11-archive"
else
print "Unsupported CUDA version ${CUDA_VERSION}"
exit 1
fi
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/${CUDNN_NAME}.tar.xz
tar xf ${CUDNN_NAME}.tar.xz
cp -a ${CUDNN_NAME}/include/* /usr/local/cuda/include/
cp -a ${CUDNN_NAME}/lib/* /usr/local/cuda/lib64/
popd
rm -rf tmp_cudnn
ldconfig
fi

View File

@ -103,5 +103,5 @@ fi
# It depends on torch and triton. We don't want to install
# triton and torch from production on Docker CI images
if [[ "$ANACONDA_PYTHON_VERSION" != 3.9* ]]; then
pip_install helion --no-deps
pip_install helion==0.0.10 --no-deps
fi

View File

@ -50,8 +50,8 @@ IPython==8.12.0
#Pinned versions: 8.12.0
myst-nb==0.17.2
#Description: This is used to generate PyTorch functorch docs
#Pinned versions: 0.13.2
#Description: This is used to generate PyTorch functorch and torch.compile docs
#Pinned versions: 0.17.2
# The following are required to build torch.distributed.elastic.rendezvous.etcd* docs
python-etcd==0.4.5
@ -59,4 +59,3 @@ sphinx-copybutton==0.5.0
sphinx-design==0.4.0
sphinxcontrib-mermaid==1.0.0
myst-parser==0.18.1
myst-nb

View File

@ -50,6 +50,9 @@ if [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
export ATEN_THREADING=NATIVE
fi
# Enable LLVM dependency for TensorExpr testing
export USE_LLVM=/opt/llvm
export LLVM_DIR=/opt/llvm/lib/cmake/llvm
if ! which conda; then
# In ROCm CIs, we are doing cross compilation on build machines with
@ -189,6 +192,7 @@ if [[ "$BUILD_ENVIRONMENT" == *-clang*-asan* ]]; then
export USE_ASAN=1
export REL_WITH_DEB_INFO=1
export UBSAN_FLAGS="-fno-sanitize-recover=all"
unset USE_LLVM
fi
if [[ "${BUILD_ENVIRONMENT}" == *no-ops* ]]; then

View File

@ -1039,10 +1039,20 @@ test_libtorch_api() {
mkdir -p $TEST_REPORTS_DIR
OMP_NUM_THREADS=2 TORCH_CPP_TEST_MNIST_PATH="${MNIST_DIR}" "$TORCH_BIN_DIR"/test_api --gtest_filter='-IMethodTest.*' --gtest_output=xml:$TEST_REPORTS_DIR/test_api.xml
"$TORCH_BIN_DIR"/test_tensorexpr --gtest_output=xml:$TEST_REPORTS_DIR/test_tensorexpr.xml
else
# Exclude IMethodTest that relies on torch::deploy, which will instead be ran in test_deploy
OMP_NUM_THREADS=2 TORCH_CPP_TEST_MNIST_PATH="${MNIST_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_api -k "not IMethodTest"
# On s390x, pytorch is built without llvm.
# Even if it would be built with llvm, llvm currently doesn't support used features on s390x and
# test fails with errors like:
# JIT session error: Unsupported target machine architecture in ELF object pytorch-jitted-objectbuffer
# unknown file: Failure
# C++ exception with description "valOrErr INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/jit/tensorexpr/llvm_jit.h":34, please report a bug to PyTorch. Unexpected failure in LLVM JIT: Failed to materialize symbols: { (main, { func }) }
if [[ "${BUILD_ENVIRONMENT}" != *s390x* ]]; then
python test/run_test.py --cpp --verbose -i cpp/test_tensorexpr
fi
fi
# quantization is not fully supported on s390x yet

View File

@ -53,16 +53,12 @@ self-hosted-runner:
- linux.rocm.gpu.mi250
- linux.rocm.gpu.2
- linux.rocm.gpu.4
# MI300 runners
- linux.rocm.gpu.mi300.2
- linux.rocm.gpu.mi300.4
# gfx942 runners
- linux.rocm.gpu.gfx942.2
- linux.rocm.gpu.gfx942.4
- rocm-docker
# Repo-specific Apple hosted runners
- macos-m1-ultra
- macos-m2-14
# Org wise AWS `mac2.metal` runners (2020 Mac mini hardware powered by Apple silicon M1 processors)
- macos-m1-stable
- macos-m1-13
- macos-m1-14
# GitHub-hosted MacOS runners
- macos-latest-xlarge

View File

@ -1 +1 @@
f6dfe1231dcdd221a68416e49ab85c2575cbb824
bf305f538005f2e900f8850ed57146024a8bc559

View File

@ -1 +1 @@
8f605ee30912541126c0fe46d0c8c413101b600a
ca9e2be3ed6320b51f52f536595cd24e254f8bb2

View File

@ -2,7 +2,7 @@ boto3==1.35.42
cmake==3.27.*
expecttest==0.3.0
fbscribelogger==0.1.7
filelock==3.13.1
filelock==3.18.0
hypothesis==6.56.4
librosa>=0.6.2
mpmath==1.3.0

View File

@ -1891,7 +1891,9 @@ def validate_revert(
else pr.get_comment_by_id(comment_id)
)
if comment.editor_login is not None:
raise PostCommentError("Don't want to revert based on edited command")
raise PostCommentError(
"Halting the revert as the revert comment has been edited."
)
author_association = comment.author_association
author_login = comment.author_login
allowed_reverters = ["COLLABORATOR", "MEMBER", "OWNER"]

View File

@ -269,8 +269,8 @@ jobs:
# copy test results back to the mounted workspace, needed sudo, resulting permissions were correct
docker exec -t "${{ env.CONTAINER_NAME }}" sh -c "cd ../pytorch && sudo cp -R test/test-reports ../workspace/test"
- name: Change permissions (only needed for MI300 and MI355 kubernetes runners for now)
if: ${{ always() && steps.test.conclusion && (contains(matrix.runner, 'mi300') || contains(matrix.runner, 'mi355')) }}
- name: Change permissions (only needed for kubernetes runners for now)
if: ${{ always() && steps.test.conclusion && (contains(matrix.runner, 'gfx942') || contains(matrix.runner, 'mi355')) }}
run: |
docker exec -t "${{ env.CONTAINER_NAME }}" sh -c "sudo chown -R 1001:1001 test"

View File

@ -50,7 +50,7 @@ jobs:
strategy:
fail-fast: false
matrix:
py_vers: [ "3.9", "3.10", "3.11", "3.12", "3.13", "3.13t" ]
py_vers: [ "3.9", "3.10", "3.11", "3.12", "3.13", "3.13t", "3.14", "3.14t" ]
device: ["cuda", "rocm", "xpu", "aarch64"]
docker-image: ["pytorch/manylinux2_28-builder:cpu"]
include:
@ -126,6 +126,12 @@ jobs:
3.13t)
PYTHON_EXECUTABLE=/opt/python/cp313-cp313t/bin/python
;;
3.14)
PYTHON_EXECUTABLE=/opt/python/cp314-cp314/bin/python
;;
3.14t)
PYTHON_EXECUTABLE=/opt/python/cp314-cp314t/bin/python
;;
*)
echo "Unsupported python version ${PY_VERS}"
exit 1

View File

@ -88,23 +88,23 @@ jobs:
docker-image-name: ci-image:pytorch-linux-jammy-rocm-n-py3
test-matrix: |
{ include: [
{ config: "inductor_huggingface_perf_rocm", shard: 1, num_shards: 4, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_huggingface_perf_rocm", shard: 2, num_shards: 4, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_huggingface_perf_rocm", shard: 3, num_shards: 4, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_huggingface_perf_rocm", shard: 4, num_shards: 4, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_timm_perf_rocm", shard: 1, num_shards: 5, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_timm_perf_rocm", shard: 2, num_shards: 5, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_timm_perf_rocm", shard: 3, num_shards: 5, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_timm_perf_rocm", shard: 4, num_shards: 5, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_timm_perf_rocm", shard: 5, num_shards: 5, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_torchbench_perf_rocm", shard: 1, num_shards: 8, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_torchbench_perf_rocm", shard: 2, num_shards: 8, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_torchbench_perf_rocm", shard: 3, num_shards: 8, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_torchbench_perf_rocm", shard: 4, num_shards: 8, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_torchbench_perf_rocm", shard: 5, num_shards: 8, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_torchbench_perf_rocm", shard: 6, num_shards: 8, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_torchbench_perf_rocm", shard: 7, num_shards: 8, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_torchbench_perf_rocm", shard: 8, num_shards: 8, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_huggingface_perf_rocm", shard: 1, num_shards: 4, runner: "linux.rocm.gpu.gfx942.2" },
{ config: "inductor_huggingface_perf_rocm", shard: 2, num_shards: 4, runner: "linux.rocm.gpu.gfx942.2" },
{ config: "inductor_huggingface_perf_rocm", shard: 3, num_shards: 4, runner: "linux.rocm.gpu.gfx942.2" },
{ config: "inductor_huggingface_perf_rocm", shard: 4, num_shards: 4, runner: "linux.rocm.gpu.gfx942.2" },
{ config: "inductor_timm_perf_rocm", shard: 1, num_shards: 5, runner: "linux.rocm.gpu.gfx942.2" },
{ config: "inductor_timm_perf_rocm", shard: 2, num_shards: 5, runner: "linux.rocm.gpu.gfx942.2" },
{ config: "inductor_timm_perf_rocm", shard: 3, num_shards: 5, runner: "linux.rocm.gpu.gfx942.2" },
{ config: "inductor_timm_perf_rocm", shard: 4, num_shards: 5, runner: "linux.rocm.gpu.gfx942.2" },
{ config: "inductor_timm_perf_rocm", shard: 5, num_shards: 5, runner: "linux.rocm.gpu.gfx942.2" },
{ config: "inductor_torchbench_perf_rocm", shard: 1, num_shards: 8, runner: "linux.rocm.gpu.gfx942.2" },
{ config: "inductor_torchbench_perf_rocm", shard: 2, num_shards: 8, runner: "linux.rocm.gpu.gfx942.2" },
{ config: "inductor_torchbench_perf_rocm", shard: 3, num_shards: 8, runner: "linux.rocm.gpu.gfx942.2" },
{ config: "inductor_torchbench_perf_rocm", shard: 4, num_shards: 8, runner: "linux.rocm.gpu.gfx942.2" },
{ config: "inductor_torchbench_perf_rocm", shard: 5, num_shards: 8, runner: "linux.rocm.gpu.gfx942.2" },
{ config: "inductor_torchbench_perf_rocm", shard: 6, num_shards: 8, runner: "linux.rocm.gpu.gfx942.2" },
{ config: "inductor_torchbench_perf_rocm", shard: 7, num_shards: 8, runner: "linux.rocm.gpu.gfx942.2" },
{ config: "inductor_torchbench_perf_rocm", shard: 8, num_shards: 8, runner: "linux.rocm.gpu.gfx942.2" },
]}
secrets: inherit

View File

@ -47,8 +47,8 @@ jobs:
docker-image-name: ci-image:pytorch-linux-jammy-rocm-n-py3
test-matrix: |
{ include: [
{ config: "inductor", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.gfx942.2" },
{ config: "inductor", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.gfx942.2" },
]}
secrets: inherit

View File

@ -28,7 +28,6 @@ jobs:
# than our AWS macos-m1-14 runners
test-matrix: |
{ include: [
{ config: "test_mps", shard: 1, num_shards: 1, runner: "macos-m1-13" },
{ config: "test_mps", shard: 1, num_shards: 1, runner: "macos-m1-14" },
{ config: "test_mps", shard: 1, num_shards: 1, runner: "macos-m2-15" },
]}

View File

@ -59,9 +59,9 @@ jobs:
docker-image-name: ci-image:pytorch-linux-jammy-rocm-n-py3
test-matrix: |
{ include: [
{ config: "distributed", shard: 1, num_shards: 3, runner: "linux.rocm.gpu.mi300.4", owners: ["module:rocm", "oncall:distributed"] },
{ config: "distributed", shard: 2, num_shards: 3, runner: "linux.rocm.gpu.mi300.4", owners: ["module:rocm", "oncall:distributed"] },
{ config: "distributed", shard: 3, num_shards: 3, runner: "linux.rocm.gpu.mi300.4", owners: ["module:rocm", "oncall:distributed"] },
{ config: "distributed", shard: 1, num_shards: 3, runner: "linux.rocm.gpu.gfx942.4", owners: ["module:rocm", "oncall:distributed"] },
{ config: "distributed", shard: 2, num_shards: 3, runner: "linux.rocm.gpu.gfx942.4", owners: ["module:rocm", "oncall:distributed"] },
{ config: "distributed", shard: 3, num_shards: 3, runner: "linux.rocm.gpu.gfx942.4", owners: ["module:rocm", "oncall:distributed"] },
]}
secrets: inherit

View File

@ -48,12 +48,12 @@ jobs:
sync-tag: rocm-build
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 6, runner: "linux.rocm.gpu.mi300.2" },
{ config: "default", shard: 2, num_shards: 6, runner: "linux.rocm.gpu.mi300.2" },
{ config: "default", shard: 3, num_shards: 6, runner: "linux.rocm.gpu.mi300.2" },
{ config: "default", shard: 4, num_shards: 6, runner: "linux.rocm.gpu.mi300.2" },
{ config: "default", shard: 5, num_shards: 6, runner: "linux.rocm.gpu.mi300.2" },
{ config: "default", shard: 6, num_shards: 6, runner: "linux.rocm.gpu.mi300.2" },
{ config: "default", shard: 1, num_shards: 6, runner: "linux.rocm.gpu.gfx942.2" },
{ config: "default", shard: 2, num_shards: 6, runner: "linux.rocm.gpu.gfx942.2" },
{ config: "default", shard: 3, num_shards: 6, runner: "linux.rocm.gpu.gfx942.2" },
{ config: "default", shard: 4, num_shards: 6, runner: "linux.rocm.gpu.gfx942.2" },
{ config: "default", shard: 5, num_shards: 6, runner: "linux.rocm.gpu.gfx942.2" },
{ config: "default", shard: 6, num_shards: 6, runner: "linux.rocm.gpu.gfx942.2" },
]}
secrets: inherit

View File

@ -3,7 +3,7 @@ name: rocm-mi355
on:
workflow_dispatch:
schedule:
- cron: 30 9 * * * # about 2:30am PDT
- cron: 30 11,1 * * * # about 4:30am PDT and 6:30pm PDT
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}

View File

@ -94,7 +94,6 @@ jobs:
{ config: "default", shard: 1, num_shards: 3, runner: "macos-m1-stable" },
{ config: "default", shard: 2, num_shards: 3, runner: "macos-m1-stable" },
{ config: "default", shard: 3, num_shards: 3, runner: "macos-m1-stable" },
{ config: "mps", shard: 1, num_shards: 1, runner: "macos-m1-13" },
{ config: "mps", shard: 1, num_shards: 1, runner: "macos-m1-14" },
{ config: "mps", shard: 1, num_shards: 1, runner: "macos-m2-15" },
]}

View File

@ -164,7 +164,7 @@ init_command = [
'types-setuptools==79.0.0.20250422',
'types-jinja2==2.11.9',
'types-colorama==0.4.6',
'filelock==3.13.1',
'filelock==3.18.0',
'junitparser==2.1.1',
'rich==14.1.0',
'pyyaml==6.0.2',

View File

@ -679,6 +679,7 @@ cc_library(
[
"torch/*.h",
"torch/csrc/**/*.h",
"torch/nativert/**/*.h",
"torch/csrc/distributed/c10d/**/*.hpp",
"torch/lib/libshm/*.h",
],

View File

@ -564,7 +564,7 @@ if(MSVC)
set(CMAKE_NINJA_CMCLDEPS_RC OFF)
if(MSVC_Z7_OVERRIDE)
# CMake set debug flags to use /Z7
set(CMAKE_MSVC_DEBUG_INFORMATION_FORMAT Embedded)
set(CMAKE_MSVC_DEBUG_INFORMATION_FORMAT "$<$<CONFIG:Debug,RelWithDebInfo>:Embedded>")
endif()
foreach(
flag_var
@ -872,6 +872,14 @@ cmake_dependent_option(
"USE_CUDA OR USE_ROCM;NOT MSVC"
OFF)
cmake_dependent_option(
USE_FBGEMM_GENAI
"Whether to build FBGEMM GenAI quantized GEMM kernels.\
Will be disabled if not supported by the platform"
OFF
"USE_CUDA OR USE_ROCM"
OFF)
# CAVEAT: Again, Flash Attention2 will error while building for sm52 while Mem
# Eff Attention won't
cmake_dependent_option(
@ -905,6 +913,10 @@ if(USE_FBGEMM)
string(APPEND CMAKE_CXX_FLAGS " -DUSE_FBGEMM")
endif()
if(USE_FBGEMM_GENAI)
string(APPEND CMAKE_CXX_FLAGS " -DUSE_FBGEMM_GENAI")
endif()
if(USE_PYTORCH_QNNPACK)
string(APPEND CMAKE_CXX_FLAGS " -DUSE_PYTORCH_QNNPACK")
endif()

View File

@ -51,12 +51,12 @@ nn/qat/ @jerryzh168
/torch/csrc/distributed/c10d/Ops.* @kwen2501
# ONNX Export
/torch/_dynamo/backends/onnxrt.py @wschin
/torch/csrc/jit/passes/onnx.h @titaiwangms @shubhambhokare1
/torch/csrc/jit/passes/onnx.cpp @titaiwangms @shubhambhokare1
/torch/csrc/jit/passes/onnx/ @titaiwangms @shubhambhokare1
/torch/onnx/ @titaiwangms @shubhambhokare1 @justinchuby @wschin
/test/onnx/ @titaiwangms @shubhambhokare1 @justinchuby @wschin
/torch/_dynamo/backends/onnxrt.py @titaiwangms @xadupre @justinchuby
/torch/csrc/jit/passes/onnx.h @titaiwangms @xadupre
/torch/csrc/jit/passes/onnx.cpp @titaiwangms @xadupre
/torch/csrc/jit/passes/onnx/ @titaiwangms @xadupre
/torch/onnx/ @titaiwangms @xadupre @justinchuby
/test/onnx/ @titaiwangms @xadupre @justinchuby
# CI
/.ci @pytorch/pytorch-dev-infra

View File

@ -247,6 +247,50 @@ if(USE_MEM_EFF_ATTENTION)
list(APPEND ATen_ATTENTION_KERNEL_SRCS ${mem_eff_attention_cuda_kernels_cu})
endif()
IF(USE_FBGEMM_GENAI AND USE_ROCM AND NOT "gfx942" IN_LIST PYTORCH_ROCM_ARCH)
message(WARNING "Unsupported ROCM arch for FBGEMM GenAI, will set USE_FBGEMM_GENAI to OFF")
set(USE_FBGEMM_GENAI off)
endif()
# FBGEMM GenAI
IF(USE_FBGEMM_GENAI)
set(FBGEMM_THIRD_PARTY ${PROJECT_SOURCE_DIR}/third_party/fbgemm/external/)
set(FBGEMM_GENAI_DIR ${PROJECT_SOURCE_DIR}/third_party/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize)
if(USE_ROCM)
# Only include the kernels we want to build to avoid increasing binary size.
file(GLOB_RECURSE fbgemm_genai_native_rocm_hip
"${FBGEMM_GENAI_DIR}/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped*.hip"
"${FBGEMM_GENAI_DIR}/ck_extensions/fp8_rowwise_grouped/fp8_rowwise_grouped_gemm.hip")
set_source_files_properties(${fbgemm_genai_native_rocm_hip} PROPERTIES HIP_SOURCE_PROPERTY_FORMAT 1)
# Add additional HIPCC compiler flags for performance
set(FBGEMM_GENAI_EXTRA_HIPCC_FLAGS
-mllvm
-amdgpu-coerce-illegal-types=1
-mllvm
-enable-post-misched=0
-mllvm
-greedy-reverse-local-assignment=1
-fhip-new-launch-api)
hip_add_library(
fbgemm_genai STATIC
${fbgemm_genai_native_rocm_hip}
HIPCC_OPTIONS ${HIP_HCC_FLAGS} ${FBGEMM_GENAI_EXTRA_HIPCC_FLAGS})
set_target_properties(fbgemm_genai PROPERTIES POSITION_INDEPENDENT_CODE ON)
target_compile_definitions(fbgemm_genai PRIVATE FBGEMM_GENAI_NO_EXTENDED_SHAPES)
target_include_directories(fbgemm_genai PUBLIC
# FBGEMM version of Composable Kernel is used due to some customizations
${FBGEMM_THIRD_PARTY}/composable_kernel/include
${FBGEMM_THIRD_PARTY}/composable_kernel/library/include
${FBGEMM_GENAI_DIR}/include/
${FBGEMM_GENAI_DIR}/common/include/
)
endif()
endif()
# XNNPACK
file(GLOB native_xnnpack "native/xnnpack/*.cpp")

View File

@ -10,6 +10,10 @@
#include <ideep.hpp>
#endif
#if !defined(__s390x__) && !defined(__powerpc__)
#include <cpuinfo.h>
#endif
#include <caffe2/core/common.h>
#include <ATen/native/DispatchStub.h>
@ -103,7 +107,9 @@ std::string get_cpu_capability() {
#elif defined(HAVE_ZVECTOR_CPU_DEFINITION)
case native::CPUCapability::ZVECTOR:
return "Z VECTOR";
#elif defined(HAVE_SVE256_CPU_DEFINITION) && defined(HAVE_ARM_BF16_CPU_DEFINITION)
#elif defined(HAVE_SVE_CPU_DEFINITION) && defined(HAVE_ARM_BF16_CPU_DEFINITION)
case native::CPUCapability::SVE:
return "SVE";
case native::CPUCapability::SVE256:
return "SVE256";
#else
@ -118,6 +124,12 @@ std::string get_cpu_capability() {
return "";
}
int get_sve_len() {
// It is possible that we override the cpu_capability with
// environment variable
return cpuinfo_get_max_arm_sve_length();
}
static std::string used_cpu_capability() {
// It is possible that we override the cpu_capability with
// environment variable

View File

@ -15,4 +15,6 @@ TORCH_API std::string get_cxx_flags();
TORCH_API std::string get_cpu_capability();
TORCH_API int get_sve_len();
} // namespace at

View File

@ -34,9 +34,9 @@ inline scalar_t vec_reduce_all(
scalar_t acc_arr[Vec::size()];
acc_vec.store(acc_arr);
for (const auto i : c10::irange(1, size)) {
std::array<scalar_t, Vec::size()> acc_arr_next = {0};
scalar_t acc_arr_next[Vec::size()] = {0};
acc_arr_next[0] = acc_arr[i];
Vec acc_vec_next = Vec::loadu(acc_arr_next.data());
Vec acc_vec_next = Vec::loadu(acc_arr_next);
acc_vec = vec_fun(acc_vec, acc_vec_next);
}
acc_vec.store(acc_arr);
@ -102,8 +102,7 @@ struct VecReduceAllSIMD<float, Op> {
#endif // defined(__GNUC__) && (__GNUC__ > 5) && !defined(_MSC_VER) &&
// !defined(C10_MOBILE)
#if defined(__aarch64__) && !defined(C10_MOBILE) && !defined(__CUDACC__) && \
!defined(CPU_CAPABILITY_SVE)
#if defined(__aarch64__) && !defined(C10_MOBILE) && !defined(__CUDACC__) && !defined(CPU_CAPABILITY_SVE256) && !defined(CPU_CAPABILITY_SVE)
template <typename Op>
struct VecReduceAllSIMD<float, Op> {
static inline float apply(
@ -143,8 +142,7 @@ struct VecReduceAllSIMD<float, std::plus<Vectorized<float>>> {
#endif // defined(__aarch64__) && !defined(C10_MOBILE) && !defined(__CUDACC__)
// && !defined(CPU_CAPABILITY_SVE)
#if defined(__aarch64__) && !defined(C10_MOBILE) && !defined(__CUDACC__) && \
defined(CPU_CAPABILITY_SVE256)
#if defined(__aarch64__) && !defined(C10_MOBILE) && !defined(__CUDACC__) && (defined(CPU_CAPABILITY_SVE256) || defined(CPU_CAPABILITY_SVE))
template <typename Op>
struct VecReduceAllSIMD<float, Op> {
static inline float apply(

View File

@ -4,7 +4,7 @@
#include <ATen/cpu/vec/vec_base.h>
#if defined(CPU_CAPABILITY_SVE)
#if defined(CPU_CAPABILITY_SVE256) || defined(CPU_CAPABILITY_SVE)
// Define the data type of VLS(vector-length specific).
typedef svbool_t vls_pred_t
@ -77,4 +77,4 @@ typedef svfloat64_t vls_float64_t
#define ALL_F64_TRUE_MASK svreinterpret_f64_s64(ALL_S64_TRUE_MASK)
#define ALL_F64_FALSE_MASK svreinterpret_f64_s64(ALL_S64_FALSE_MASK)
#endif // defined(CPU_CAPABILITY_SVE)
#endif // defined(CPU_CAPABILITY_SVE256) || defined(CPU_CAPABILITY_SVE)

View File

@ -19,7 +19,7 @@ namespace vec {
// accessed as `at::vec`.
inline namespace CPU_CAPABILITY {
#if defined(CPU_CAPABILITY_SVE256) && defined(__ARM_FEATURE_BF16)
#if (defined(CPU_CAPABILITY_SVE256) || defined(CPU_CAPABILITY_SVE)) && defined(__ARM_FEATURE_BF16)
template <>
struct is_vec_specialized_for<BFloat16> : std::bool_constant<true> {};
@ -230,8 +230,6 @@ __attribute__((optimize("no-tree-vectorize")))
#endif
inline std::tuple<Vectorized<float>, Vectorized<float>>
convert_bfloat16_float(const Vectorized<c10::BFloat16>& a) {
static_assert(
Vectorized<c10::BFloat16>::size() == 2 * Vectorized<float>::size());
auto zero = svreinterpret_bf16_f32(svdup_n_f32(0.0f));
auto bf16_vec1 = svzip1_bf16(zero, a);
auto bf16_vec2 = svzip2_bf16(zero, a);
@ -243,19 +241,18 @@ convert_bfloat16_float(const Vectorized<c10::BFloat16>& a) {
inline Vectorized<c10::BFloat16> convert_float_bfloat16(
const Vectorized<float>& a,
const Vectorized<float>& b) {
static_assert(
Vectorized<c10::BFloat16>::size() == 2 * Vectorized<float>::size());
svbfloat16_t x1 = svcvt_bf16_f32_z(ptrue, a);
svbfloat16_t x2 = svcvt_bf16_f32_z(ptrue, b);
return Vectorized<c10::BFloat16>(svuzp1_bf16(x1, x2));
}
inline void load_fp32_from_bf16(const BFloat16* data, Vectorized<float>& out) {
__at_align__ float values[Vectorized<float>::size()];
__at_align__ float * values = new float[Vectorized<float>::size()];
for (const auto k : c10::irange(Vectorized<float>::size())) {
values[k] = data[k];
}
out = Vectorized<float>::loadu(values);
delete[] values;
}
inline void load_fp32_from_bf16(
@ -308,8 +305,8 @@ Vectorized<c10::BFloat16> inline operator/(
}
inline Vectorized<BFloat16>::Vectorized() {
const short zero = 0;
values = svdup_n_bf16(c10::bit_cast<bfloat16_t>(zero));
auto vals_f = svdup_n_f32(0);
values = convert_float_bfloat16(vals_f, vals_f);
}
inline Vectorized<BFloat16>::Vectorized(int val) {

View File

@ -8,7 +8,7 @@
#include <ATen/cpu/vec/sve/sve_helper.h>
#include <ATen/cpu/vec/vec_base.h>
#if defined(CPU_CAPABILITY_SVE)
#if defined(CPU_CAPABILITY_SVE) || defined(CPU_CAPABILITY_SVE256)
#include <ATen/cpu/vec/sve/vec_bfloat16.h>
#include <ATen/cpu/vec/sve/vec_double.h>
#include <ATen/cpu/vec/sve/vec_float.h>
@ -27,7 +27,7 @@ namespace at::vec {
// accessed as `at::vec`.
inline namespace CPU_CAPABILITY {
#if defined(CPU_CAPABILITY_SVE)
#if defined(CPU_CAPABILITY_SVE256) || defined(CPU_CAPABILITY_SVE)
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ CAST ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#define DEFINE_SVE_CAST(t1_t, t1_prefix, t2_t, t2_prefix) \
@ -231,6 +231,5 @@ std::pair<
#endif // __ARM_FEATURE_BF16
#endif // defined(CPU_CAPABILITY_SVE)
} // namespace CPU_CAPABILITY
} // namespace at::vec
}

View File

@ -22,7 +22,7 @@ namespace at::vec {
// accessed as `at::vec`.
inline namespace CPU_CAPABILITY {
#if defined(CPU_CAPABILITY_SVE)
#if defined(CPU_CAPABILITY_SVE256) || defined(CPU_CAPABILITY_SVE)
template <>
struct is_vec_specialized_for<double> : std::bool_constant<true> {};
@ -55,10 +55,11 @@ class Vectorized<double> {
operator svfloat64_t() const {
return values;
}
template <uint64_t mask>
static Vectorized<double> blend(
const Vectorized<double>& a,
const Vectorized<double>& b) {
const Vectorized<double>& b,
int64_t mask
) {
// Build an array of flags: each element is 1 if the corresponding bit in
// 'mask' is set, 0 otherwise.
__at_align__ int64_t flag_arr[size()];

View File

@ -2,8 +2,10 @@
#include <ATen/cpu/vec/intrinsics.h>
#include <ATen/cpu/vec/sve/sve_helper.h>
#include <ATen/cpu/vec/vec_base.h>
#include <algorithm>
#include <cmath>
#if defined(__aarch64__) && defined(AT_BUILD_ARM_VEC256_WITH_SLEEF)
#include <sleef.h>
#define USE_SLEEF(sleef_code, non_sleef_code) sleef_code
@ -22,7 +24,7 @@ namespace at::vec {
// accessed as `at::vec`.
inline namespace CPU_CAPABILITY {
#if defined(CPU_CAPABILITY_SVE)
#if defined(CPU_CAPABILITY_SVE) || defined(CPU_CAPABILITY_SVE256)
template <>
struct is_vec_specialized_for<float> : std::bool_constant<true> {};
@ -30,52 +32,77 @@ struct is_vec_specialized_for<float> : std::bool_constant<true> {};
template <>
class Vectorized<float> {
private:
vls_float32_t values;
__at_align__ float values[2048 / sizeof(float)];
public:
using value_type = float;
using size_type = int;
static constexpr size_type size() {
return VECTOR_WIDTH / sizeof(float);
static inline size_type size() {
return svcntw();
}
Vectorized() {
values = svdup_n_f32(0);
inline Vectorized() {svst1_f32(ptrue, values, svdup_n_f32(0));}
inline Vectorized(const float val) {
svst1_f32(ptrue, values, svdup_n_f32(val));
}
Vectorized(svfloat32_t v) : values(v) {}
Vectorized(float val) {
values = svdup_n_f32(val);
inline Vectorized(const svfloat32_t val) {
svst1_f32(ptrue, values, val);
}
template <
typename... Args,
typename = std::enable_if_t<(sizeof...(Args) == size())>>
Vectorized(Args... vals) {
__at_align__ float buffer[size()] = {vals...};
values = svld1_f32(ptrue, buffer);
template<typename T,
typename = std::enable_if_t<std::is_pointer_v<T>>>
inline Vectorized(float * val) {
svst1_f32(ptrue, values, svld1_f32(ptrue, val));
}
operator svfloat32_t() const {
return values;
template<typename... Args,
typename = std::enable_if_t<(sizeof...(Args) == size())>>
inline Vectorized(Args... vals) {
values = { vals... };
}
template <uint64_t mask>
static Vectorized<float> blend(
const Vectorized<float>& a,
const Vectorized<float>& b) {
// Build an array of flags: each element is 1 if the corresponding bit in
// 'mask' is set, 0 otherwise.
__at_align__ int32_t flag_arr[size()];
inline operator svfloat32_t() const {
return svld1_f32(ptrue, values);
}
static inline Vectorized<float> from_ptr(const float * vs) {
Vectorized<float> v;
svst1_f32(ptrue, v.values, svld1_f32(ptrue, static_cast<const float *>(vs)));
return v;
}
static inline Vectorized<float> from_ptr(const float * vs, int count) {
Vectorized<float> v;
svst1_f32(ptrue, v.values, svld1_f32(svwhilelt_b32_s32(0, count), static_cast<const float *>(vs)));
return v;
}
inline void set_lane(int i, float value) {
values[i] = value;
}
inline Vectorized<float> map(float (*fn)(float)) const {
Vectorized<float> result;
for (int64_t i = 0; i < size(); ++i) {
result.set_lane(i, fn(values[i]));
}
return result;
}
inline Vectorized<float> map2(float (*fn)(float, float), const Vectorized<float> &b) const {
Vectorized<float> result;
for (int64_t i = 0; i < size(); ++i) {
result.set_lane(i, fn(values[i], b.values[i]));
}
return result;
}
static inline Vectorized<float> blend(const Vectorized<float>& a, const Vectorized<float>& b, const uint64_t mask) {
// Build an array of flags: each element is 1 if the corresponding bit in 'mask' is set, 0 otherwise.
__at_align__ int32_t * flag_arr = new int32_t[size()];
for (int i = 0; i < size(); i++) {
flag_arr[i] = (mask & (1ULL << i)) ? 1 : 0;
}
// Load the flag array into an SVE int32 vector.
svint32_t int_mask = svld1_s32(svptrue_b32(), flag_arr);
// Compare each lane of int_mask to 0; returns an svbool_t predicate where
// true indicates a nonzero flag.
svbool_t blend_mask = svcmpne_n_s32(svptrue_b32(), int_mask, 0);
// Use svsel to select elements from b where the predicate is true, else
// from a.
svfloat32_t result = svsel_f32(blend_mask, b.values, a.values);
return Vectorized<float>(result);
svint32_t int_mask = svld1_s32(ptrue, flag_arr);
delete[] flag_arr;
// Compare each lane of int_mask to 0; returns an svbool_t predicate where true indicates a nonzero flag.
svbool_t blend_mask = svcmpne_n_s32(ptrue, int_mask, 0);
// Use svsel to select elements from b where the predicate is true, else from a.
return svsel_f32(blend_mask, b, a);
}
static Vectorized<float> blendv(
static inline Vectorized<float> blendv(
const Vectorized<float>& a,
const Vectorized<float>& b,
const Vectorized<float>& mask_) {
@ -84,16 +111,18 @@ class Vectorized<float> {
return svsel_f32(mask, b, a);
}
template <typename step_t>
static Vectorized<float> arange(
static inline Vectorized<float> arange(
float base = 0.f,
step_t step = static_cast<step_t>(1)) {
__at_align__ float buffer[size()];
__at_align__ float * buffer = new float[size()];
for (int64_t i = 0; i < size(); i++) {
buffer[i] = base + i * step;
}
return svld1_f32(ptrue, buffer);
auto tmp = Vectorized<float>::from_ptr(buffer);
delete[] buffer;
return tmp;
}
static Vectorized<float> set(
static inline Vectorized<float> set(
const Vectorized<float>& a,
const Vectorized<float>& b,
int64_t count = size()) {
@ -169,271 +198,219 @@ class Vectorized<float> {
poly = svsel_f32(svcmpgt_f32(pg, x, max_input), inf, poly);
return poly;
}
static Vectorized<float> loadu(const void* ptr, int64_t count = size()) {
if (count == size())
return svld1_f32(ptrue, reinterpret_cast<const float*>(ptr));
svbool_t pg = svwhilelt_b32(0ull, count);
return svld1_f32(pg, reinterpret_cast<const float*>(ptr));
static inline Vectorized<float> loadu(const void* ptr) {
return Vectorized<float>::from_ptr(reinterpret_cast<const float *>(ptr));
}
void store(void* ptr, int64_t count = size()) const {
if (count == size()) {
svst1_f32(ptrue, reinterpret_cast<float*>(ptr), values);
} else {
svbool_t pg = svwhilelt_b32(0ull, count);
svst1_f32(pg, reinterpret_cast<float*>(ptr), values);
}
static inline Vectorized<float> loadu(const void* ptr, int64_t count) {
return Vectorized<float>::from_ptr(reinterpret_cast<const float *>(ptr), count);
}
const float& operator[](int idx) const = delete;
float& operator[](int idx) = delete;
int64_t zero_mask() const {
// returns an integer mask where all zero elements are translated to 1-bit
// and others are translated to 0-bit
inline void store(void* ptr) const {
svst1_f32(ptrue, static_cast<float *>(ptr), svld1_f32(ptrue, values));
}
inline void store(void* ptr, int count) const {
svst1_f32(svwhilelt_b32_s32(0, count), static_cast<float *>(ptr), svld1_f32(ptrue, values));
}
inline const float& operator[](int idx) const {
return values[idx];
};
inline float& operator[](int idx) {
return values[idx];
};
inline int64_t zero_mask() const {
// returns an integer mask where all zero elements are translated to 1-bit and others are translated to 0-bit
int64_t mask = 0;
__at_align__ int32_t mask_array[size()];
__at_align__ int32_t * mask_array = new int32_t[size()];
svbool_t svbool_mask = svcmpeq_f32(ptrue, values, ZERO_F32);
svst1_s32(
ptrue,
mask_array,
svsel_s32(svbool_mask, ALL_S32_TRUE_MASK, ALL_S32_FALSE_MASK));
for (int64_t i = 0; i < size(); ++i) {
if (mask_array[i])
mask |= (1ull << i);
svbool_t svbool_mask = svcmpeq_f32(ptrue, *this, ZERO_F32);
svst1_s32(ptrue, mask_array, svsel_s32(svbool_mask,
ALL_S32_TRUE_MASK,
ALL_S32_FALSE_MASK));
for (int64_t j = 0; j < size(); ++j) {
if (mask_array[j]) mask |= (1ull << j);
}
delete[] mask_array;
return mask;
}
Vectorized<float> isnan() const {
inline Vectorized<float> isnan() const {
// NaN check
svbool_t mask = svcmpuo_f32(ptrue, values, ZERO_F32);
auto mask = svcmpuo_f32(ptrue, *this, ZERO_F32);
return svsel_f32(mask, ALL_F32_TRUE_MASK, ALL_F32_FALSE_MASK);
}
bool has_inf_nan() const {
return svptest_any(
ptrue,
svcmpuo_f32(ptrue, svsub_f32_x(ptrue, values, values), ZERO_F32));
inline bool has_inf_nan() const {
return svptest_any(ptrue, svcmpuo_f32(ptrue, svsub_f32_x(ptrue, *this, *this), ZERO_F32));
}
Vectorized<float> map(float (*f)(float)) const {
__at_align__ float tmp[size()];
store(tmp);
for (int64_t i = 0; i < size(); ++i) {
tmp[i] = f(tmp[i]);
}
return loadu(tmp);
inline Vectorized<float> abs() const {
return svabs_f32_x(ptrue, *this);
}
Vectorized<float> abs() const {
return svabs_f32_x(ptrue, values);
}
Vectorized<float> angle() const {
inline Vectorized<float> angle() const {
const auto nan_vec = svdup_n_f32(NAN);
const auto nan_mask = svcmpuo_f32(ptrue, values, ZERO_F32);
const auto nan_mask = svcmpuo_f32(ptrue, *this, ZERO_F32);
const auto pi = svdup_n_f32(c10::pi<float>);
const auto neg_mask = svcmplt_f32(ptrue, values, ZERO_F32);
const auto neg_mask = svcmplt_f32(ptrue, *this, ZERO_F32);
auto angle = svsel_f32(neg_mask, pi, ZERO_F32);
angle = svsel_f32(nan_mask, nan_vec, angle);
return angle;
return svsel_f32(nan_mask, nan_vec, angle);
}
Vectorized<float> real() const {
return values;
inline Vectorized<float> real() const {
return *this;
}
Vectorized<float> imag() const {
inline Vectorized<float> imag() const {
return Vectorized<float>(0.f);
}
Vectorized<float> conj() const {
return values;
inline Vectorized<float> conj() const {
return *this;
}
Vectorized<float> acos() const {
return USE_SLEEF(
Vectorized<float>(Sleef_acosfx_u10sve(values)), map(std::acos));
inline Vectorized<float> acos() const {
return USE_SLEEF(Sleef_acosfx_u10sve(*this), map(std::acos));
}
Vectorized<float> acosh() const {
return USE_SLEEF(
Vectorized<float>(Sleef_acoshfx_u10sve(values)), map(std::acosh));
inline Vectorized<float> acosh() const {
return USE_SLEEF(Sleef_acoshfx_u10sve(*this), map(std::acosh));
}
Vectorized<float> asin() const {
return USE_SLEEF(
Vectorized<float>(Sleef_asinfx_u10sve(values)), map(std::asin));
inline Vectorized<float> asin() const {
return USE_SLEEF(Sleef_asinfx_u10sve(*this), map(std::asin));
}
Vectorized<float> asinh() const {
return USE_SLEEF(
Vectorized<float>(Sleef_asinhfx_u10sve(values)), map(std::asinh));
inline Vectorized<float> asinh() const {
return USE_SLEEF(Sleef_asinhfx_u10sve(*this), map(std::asinh));
}
Vectorized<float> atan() const {
return USE_SLEEF(
Vectorized<float>(Sleef_atanfx_u10sve(values)), map(std::atan));
inline Vectorized<float> atan() const {
return USE_SLEEF(Sleef_atanfx_u10sve(*this), map(std::atan));
}
Vectorized<float> atanh() const {
return USE_SLEEF(
Vectorized<float>(Sleef_atanhfx_u10sve(values)), map(std::atanh));
inline Vectorized<float> atanh() const {
return USE_SLEEF(Sleef_atanhfx_u10sve(*this), map(std::atanh));
}
Vectorized<float> atan2(const Vectorized<float>& b) const {USE_SLEEF(
{ return Vectorized<float>(Sleef_atan2fx_u10sve(values, b)); },
{
__at_align__ float tmp[size()];
__at_align__ float tmp_b[size()];
store(tmp);
b.store(tmp_b);
for (int64_t i = 0; i < size(); i++) {
tmp[i] = std::atan2(tmp[i], tmp_b[i]);
}
return loadu(tmp);
})} Vectorized<float> copysign(const Vectorized<float>& sign) const {
USE_SLEEF(
{ return Vectorized<float>(Sleef_copysignfx_sve(values, sign)); },
{
__at_align__ float tmp[size()];
__at_align__ float tmp_sign[size()];
store(tmp);
sign.store(tmp_sign);
for (int64_t i = 0; i < size(); ++i) {
tmp[i] = std::copysign(tmp[i], tmp_sign[i]);
}
return loadu(tmp);
})} Vectorized<float> erf() const {
return USE_SLEEF(
Vectorized<float>(Sleef_erffx_u10sve(values)), map(std::erf));
inline Vectorized<float> atan2(const Vectorized<float> &b) const {
return USE_SLEEF(Sleef_atan2fx_u10sve(*this, b), map2(std::atan2, b));
}
Vectorized<float> erfc() const {
return USE_SLEEF(
Vectorized<float>(Sleef_erfcfx_u15sve(values)), map(std::erfc));
inline Vectorized<float> copysign(const Vectorized<float> &sign) const {
return USE_SLEEF(Sleef_copysignfx_sve(*this, sign), map2(std::copysign, sign));
}
Vectorized<float> erfinv() const {
inline Vectorized<float> erf() const {
return USE_SLEEF(Sleef_erffx_u10sve(*this), map(std::erf));
}
inline Vectorized<float> erfc() const {
return USE_SLEEF(Sleef_erfcfx_u15sve(*this), map(std::erfc));
}
inline Vectorized<float> erfinv() const {
return map(calc_erfinv);
}
Vectorized<float> exp() const {
return USE_SLEEF(
Vectorized<float>(Sleef_expfx_u10sve(values)), map(std::exp));
inline Vectorized<float> exp() const {
return USE_SLEEF(Sleef_expfx_u10sve(*this), map(std::exp));
}
Vectorized<float> exp2() const {
return USE_SLEEF(
Vectorized<float>(Sleef_exp2fx_u10sve(values)), map(std::exp2));
inline Vectorized<float> exp2() const {
return USE_SLEEF(Sleef_exp2fx_u10sve(*this), map(std::exp2));
}
Vectorized<float> expm1() const {
return USE_SLEEF(
Vectorized<float>(Sleef_expm1fx_u10sve(values)), map(std::expm1));
inline Vectorized<float> expm1() const {
return USE_SLEEF(Sleef_expm1fx_u10sve(*this), map(std::expm1));
}
Vectorized<float> exp_u20() const {
return exp();
// Implementation copied from Arm Optimized Routines:
// https://github.com/ARM-software/optimized-routines/blob/master/math/aarch64/sve/expf.c
inline Vectorized<float> exp_u20() {
// Load values into an SVE vector
svfloat32_t val_vec = svld1(svptrue_b32(), values); // 'values' is float*
// Check for special case: |x| >= 87.3...
svbool_t is_special_case = svacgt(svptrue_b32(), val_vec, 0x1.5d5e2ap+6f);
if (svptest_any(svptrue_b32(), is_special_case)) {
return exp(); // fallback to scalar exp() for special cases
}
Vectorized<float> fexp_u20() const {
return exp();
// Constants
const svfloat32_t ln2_hi = svdup_f32(0x1.62e4p-1f);
const svfloat32_t ln2_lo = svdup_f32(0x1.7f7d1cp-20f);
const svfloat32_t c1 = svdup_f32(0.5f);
const svfloat32_t inv_ln2 = svdup_f32(0x1.715476p+0f);
const svfloat32_t shift_vec = svdup_f32(0x1.803f8p17f); // scalar to vector
// n = round(x / ln2)
svfloat32_t z = svmad_x(svptrue_b32(), inv_ln2, val_vec, shift_vec);
svfloat32_t n = svsub_x(svptrue_b32(), z, shift_vec);
// r = x - n * ln2
svfloat32_t r = svsub_x(svptrue_b32(), val_vec, svmul_x(svptrue_b32(), n, ln2_hi));
r = svsub_x(svptrue_b32(), r, svmul_x(svptrue_b32(), n, ln2_lo));
// scale = 2^(n)
svfloat32_t scale = svexpa(svreinterpret_u32(z));
// poly(r) = exp(r) - 1 ≈ r + 0.5 * r^2
svfloat32_t r2 = svmul_x(svptrue_b32(), r, r);
svfloat32_t poly = svmla_x(svptrue_b32(), r, r2, c1);
// return scale * (1 + poly)
return svmla_x(svptrue_b32(), scale, scale, poly);
}
Vectorized<float> fmod(const Vectorized<float>& q) const {USE_SLEEF(
{ return Vectorized<float>(Sleef_fmodfx_sve(values, q)); },
{
__at_align__ float tmp[size()];
__at_align__ float tmp_q[size()];
store(tmp);
q.store(tmp_q);
for (int64_t i = 0; i < size(); ++i) {
tmp[i] = std::fmod(tmp[i], tmp_q[i]);
}
return loadu(tmp);
})} Vectorized<float> hypot(const Vectorized<float>& b) const {
USE_SLEEF(
{ return Vectorized<float>(Sleef_hypotfx_u05sve(values, b)); },
{
__at_align__ float tmp[size()];
__at_align__ float tmp_b[size()];
store(tmp);
b.store(tmp_b);
for (int64_t i = 0; i < size(); i++) {
tmp[i] = std::hypot(tmp[i], tmp_b[i]);
}
return loadu(tmp);
})} Vectorized<float> i0() const {
inline Vectorized<float> fexp_u20() {
return exp_u20();
}
inline Vectorized<float> fmod(const Vectorized<float>& q) const {
return USE_SLEEF(Sleef_fmodfx_sve(*this, q), return map2(std::fmod, q));
}
inline Vectorized<float> hypot(const Vectorized<float> &b) const {
return USE_SLEEF(Sleef_hypotfx_u05sve(*this, b), map2(std::hypot, b));
}
inline Vectorized<float> i0() const {
return map(calc_i0);
}
Vectorized<float> i0e() const {
return map(calc_i0e);
inline Vectorized<float> i0e() const {
return map(calc_i0e<float>);
}
Vectorized<float> digamma() const {
inline Vectorized<float> digamma() const {
return map(calc_digamma);
}
Vectorized<float> igamma(const Vectorized<float>& x) const {
__at_align__ float tmp[size()];
__at_align__ float tmp_x[size()];
store(tmp);
x.store(tmp_x);
for (int64_t i = 0; i < size(); i++) {
tmp[i] = calc_igamma(tmp[i], tmp_x[i]);
}
return loadu(tmp);
inline Vectorized<float> igamma(const Vectorized<float> &x) const {
return map2(calc_igamma<float>, x);
}
Vectorized<float> igammac(const Vectorized<float>& x) const {
__at_align__ float tmp[size()];
__at_align__ float tmp_x[size()];
store(tmp);
x.store(tmp_x);
for (int64_t i = 0; i < size(); i++) {
tmp[i] = calc_igammac(tmp[i], tmp_x[i]);
}
return loadu(tmp);
inline Vectorized<float> igammac(const Vectorized<float> &x) const {
return map2(calc_igammac<float>, x);
}
Vectorized<float> nextafter(const Vectorized<float>& b) const {USE_SLEEF(
{ return Vectorized<float>(Sleef_nextafterfx_sve(values, b)); },
{
__at_align__ float tmp[size()];
__at_align__ float tmp_b[size()];
store(tmp);
b.store(tmp_b);
for (int64_t i = 0; i < size(); ++i) {
tmp[i] = std::nextafter(tmp[i], tmp_b[i]);
}
return loadu(tmp);
})} Vectorized<float> log() const {
return USE_SLEEF(
Vectorized<float>(Sleef_logfx_u10sve(values)), map(std::log));
inline Vectorized<float> nextafter(const Vectorized<float> &b) const {
return USE_SLEEF(Sleef_nextafterfx_sve(*this, b), map2(std::nextafter, b));
}
Vectorized<float> log2() const {
return USE_SLEEF(
Vectorized<float>(Sleef_log2fx_u10sve(values)), map(std::log2));
inline Vectorized<float> log() const {
return USE_SLEEF(Sleef_logfx_u10sve(*this), map(std::log));
}
Vectorized<float> log10() const {
return USE_SLEEF(
Vectorized<float>(Sleef_log10fx_u10sve(values)), map(std::log10));
inline Vectorized<float> log2() const {
return USE_SLEEF(Sleef_log2fx_u10sve(*this), map(std::log2));
}
Vectorized<float> log1p() const {
return USE_SLEEF(
Vectorized<float>(Sleef_log1pfx_u10sve(values)), map(std::log1p));
inline Vectorized<float> log10() const {
return USE_SLEEF(Sleef_log10fx_u10sve(*this), map(std::log10));
}
Vectorized<float> frac() const;
Vectorized<float> sin() const {
return USE_SLEEF(
Vectorized<float>(Sleef_sinfx_u10sve(values)), map(std::sin));
inline Vectorized<float> log1p() const {
return USE_SLEEF(Sleef_log1pfx_u10sve(*this), map(std::log1p));
}
Vectorized<float> sinh() const {
return USE_SLEEF(
Vectorized<float>(Sleef_sinhfx_u10sve(values)), map(std::sinh));
inline Vectorized<float> frac() const;
inline Vectorized<float> sin() const {
return USE_SLEEF(Sleef_sinfx_u10sve(*this), map(std::sin));
}
Vectorized<float> cos() const {
return USE_SLEEF(
Vectorized<float>(Sleef_cosfx_u10sve(values)), map(std::cos));
inline Vectorized<float> sinh() const {
return USE_SLEEF(Sleef_sinhfx_u10sve(*this), map(std::sinh));
}
Vectorized<float> cosh() const {
return USE_SLEEF(
Vectorized<float>(Sleef_coshfx_u10sve(values)), map(std::cosh));
inline Vectorized<float> cos() const {
return USE_SLEEF(Sleef_cosfx_u10sve(*this), map(std::cos));
}
Vectorized<float> ceil() const {
return svrintp_f32_x(ptrue, values);
inline Vectorized<float> cosh() const {
return USE_SLEEF(Sleef_coshfx_u10sve(*this), map(std::cosh));
}
Vectorized<float> floor() const {
return svrintm_f32_x(ptrue, values);
inline Vectorized<float> ceil() const {
return svrintp_f32_x(ptrue, *this);
}
Vectorized<float> neg() const {
return svneg_f32_x(ptrue, values);
inline Vectorized<float> floor() const {
return svrintm_f32_x(ptrue, *this);
}
Vectorized<float> round() const {
return svrinti_f32_x(ptrue, values);
inline Vectorized<float> neg() const {
return svneg_f32_x(ptrue, *this);
}
Vectorized<float> tan() const {
return USE_SLEEF(
Vectorized<float>(Sleef_tanfx_u10sve(values)), map(std::tan));
inline Vectorized<float> round() const {
return svrinti_f32_x(ptrue, *this);
}
inline Vectorized<float> tan() const {
return USE_SLEEF(Sleef_tanfx_u10sve(*this), map(std::tan));
}
// Implementation is picked from
// https://github.com/ARM-software/ComputeLibrary/blob/v25.01/src/core/NEON/SVEMath.inl#L179
Vectorized<float> tanh() const {
inline Vectorized<float> tanh() const {
// Constants used for the tanh calculation.
const svfloat32_t CONST_1 =
svdup_n_f32(1.f); // Constant 1.0f for the tanh formula.
@ -450,7 +427,7 @@ class Vectorized<float> {
// instability. svmax_f32_z ensures values are greater than -10, and
// svmin_f32_z ensures they are less than 10.
svfloat32_t x = svmin_f32_z(
ptrue, svmax_f32_z(ptrue, values, CONST_MIN_TANH), CONST_MAX_TANH);
ptrue, svmax_f32_z(ptrue, *this, CONST_MIN_TANH), CONST_MAX_TANH);
// Step 2: Calculate exp(2 * x), where x is the clamped value.
// svmul_f32_z computes 2 * x, and svexp_f32_z computes the exponential of
@ -472,104 +449,85 @@ class Vectorized<float> {
// Return the calculated tanh values.
return tanh;
}
Vectorized<float> trunc() const {
return svrintz_f32_x(ptrue, values);
inline Vectorized<float> trunc() const {
return svrintz_f32_x(ptrue, *this);
}
Vectorized<float> lgamma() const {
return USE_SLEEF(
Vectorized<float>(Sleef_lgammafx_u10sve(values)), map(std::lgamma));
inline Vectorized<float> lgamma() const {
return USE_SLEEF(Sleef_lgammafx_u10sve(*this), map(std::lgamma));
}
Vectorized<float> sqrt() const {
return svsqrt_f32_x(ptrue, values);
inline Vectorized<float> sqrt() const {
return svsqrt_f32_x(ptrue, *this);
}
Vectorized<float> reciprocal() const {
return svdivr_f32_x(ptrue, values, ONE_F32);
inline Vectorized<float> reciprocal() const {
return svdivr_f32_x(ptrue, *this, svdup_n_f32(1.f));
}
Vectorized<float> rsqrt() const {
return svdivr_f32_x(ptrue, svsqrt_f32_x(ptrue, values), ONE_F32);
inline Vectorized<float> rsqrt() const {
return svdivr_f32_x(ptrue, svsqrt_f32_x(ptrue, *this), ONE_F32);
}
Vectorized<float> pow(const Vectorized<float>& b) const {USE_SLEEF(
{ return Vectorized<float>(Sleef_powfx_u10sve(values, b)); },
{
__at_align__ float tmp[size()];
__at_align__ float tmp_b[size()];
store(tmp);
b.store(tmp_b);
for (int64_t i = 0; i < size(); i++) {
tmp[i] = std::pow(tmp[i], tmp_b[i]);
}
return loadu(tmp);
})} // Comparison using the _CMP_**_OQ predicate.
// `O`: get false if an operand is NaN
// `Q`: do not raise if an operand is NaN
Vectorized<float> operator==(const Vectorized<float>& other) const {
svbool_t mask = svcmpeq_f32(ptrue, values, other);
inline Vectorized<float> pow(const Vectorized<float> &b) const {
return USE_SLEEF(Sleef_powfx_u10sve(*this, b), map(std::pow, b));
}
// Comparison using the _CMP_**_OQ predicate.
// `O`: get false if an operand is NaN
// `Q`: do not raise if an operand is NaN
inline Vectorized<float> operator==(const Vectorized<float>& other) const {
svbool_t mask = svcmpeq_f32(ptrue, *this, other);
return svsel_f32(mask, ALL_F32_TRUE_MASK, ALL_F32_FALSE_MASK);
}
inline Vectorized<float> operator!=(const Vectorized<float>& other) const {
svbool_t mask = svcmpne_f32(ptrue, *this, other);
return svsel_f32(mask, ALL_F32_TRUE_MASK, ALL_F32_FALSE_MASK);
}
inline Vectorized<float> operator<(const Vectorized<float>& other) const {
svbool_t mask = svcmplt_f32(ptrue, *this, other);
return svsel_f32(mask, ALL_F32_TRUE_MASK, ALL_F32_FALSE_MASK);
}
Vectorized<float> operator!=(const Vectorized<float>& other) const {
svbool_t mask = svcmpne_f32(ptrue, values, other);
inline Vectorized<float> operator<=(const Vectorized<float>& other) const {
svbool_t mask = svcmple_f32(ptrue, *this, other);
return svsel_f32(mask, ALL_F32_TRUE_MASK, ALL_F32_FALSE_MASK);
}
Vectorized<float> operator<(const Vectorized<float>& other) const {
svbool_t mask = svcmplt_f32(ptrue, values, other);
inline Vectorized<float> operator>(const Vectorized<float>& other) const {
svbool_t mask = svcmpgt_f32(ptrue, *this, other);
return svsel_f32(mask, ALL_F32_TRUE_MASK, ALL_F32_FALSE_MASK);
}
Vectorized<float> operator<=(const Vectorized<float>& other) const {
svbool_t mask = svcmple_f32(ptrue, values, other);
inline Vectorized<float> operator>=(const Vectorized<float>& other) const {
svbool_t mask = svcmpge_f32(ptrue, *this, other);
return svsel_f32(mask, ALL_F32_TRUE_MASK, ALL_F32_FALSE_MASK);
}
Vectorized<float> operator>(const Vectorized<float>& other) const {
svbool_t mask = svcmpgt_f32(ptrue, values, other);
return svsel_f32(mask, ALL_F32_TRUE_MASK, ALL_F32_FALSE_MASK);
}
Vectorized<float> operator>=(const Vectorized<float>& other) const {
svbool_t mask = svcmpge_f32(ptrue, values, other);
return svsel_f32(mask, ALL_F32_TRUE_MASK, ALL_F32_FALSE_MASK);
}
Vectorized<float> eq(const Vectorized<float>& other) const;
Vectorized<float> ne(const Vectorized<float>& other) const;
Vectorized<float> gt(const Vectorized<float>& other) const;
Vectorized<float> ge(const Vectorized<float>& other) const;
Vectorized<float> lt(const Vectorized<float>& other) const;
Vectorized<float> le(const Vectorized<float>& other) const;
inline Vectorized<float> eq(const Vectorized<float>& other) const;
inline Vectorized<float> ne(const Vectorized<float>& other) const;
inline Vectorized<float> gt(const Vectorized<float>& other) const;
inline Vectorized<float> ge(const Vectorized<float>& other) const;
inline Vectorized<float> lt(const Vectorized<float>& other) const;
inline Vectorized<float> le(const Vectorized<float>& other) const;
};
template <>
Vectorized<float> inline operator+(
const Vectorized<float>& a,
const Vectorized<float>& b) {
inline Vectorized<float> operator+(const Vectorized<float>& a, const Vectorized<float>& b) {
return svadd_f32_x(ptrue, a, b);
}
template <>
Vectorized<float> inline operator-(
const Vectorized<float>& a,
const Vectorized<float>& b) {
inline Vectorized<float> operator-(const Vectorized<float>& a, const Vectorized<float>& b) {
return svsub_f32_x(ptrue, a, b);
}
template <>
Vectorized<float> inline operator*(
const Vectorized<float>& a,
const Vectorized<float>& b) {
inline Vectorized<float> operator*(const Vectorized<float>& a, const Vectorized<float>& b) {
return svmul_f32_x(ptrue, a, b);
}
template <>
Vectorized<float> inline operator/(
const Vectorized<float>& a,
const Vectorized<float>& b) {
inline Vectorized<float> operator/(const Vectorized<float>& a, const Vectorized<float>& b) {
return svdiv_f32_x(ptrue, a, b);
}
// frac. Implement this here so we can use subtraction
Vectorized<float> inline Vectorized<float>::frac() const {
inline Vectorized<float> Vectorized<float>::frac() const {
return *this - this->trunc();
}
@ -585,115 +543,91 @@ Vectorized<float> inline maximum(
// Implements the IEEE 754 201X `minimum` operation, which propagates NaN if
// either input is a NaN.
template <>
Vectorized<float> inline minimum(
const Vectorized<float>& a,
const Vectorized<float>& b) {
inline Vectorized<float> minimum(const Vectorized<float>& a, const Vectorized<float>& b) {
return svmin_f32_x(ptrue, a, b);
}
template <>
Vectorized<float> inline clamp(
const Vectorized<float>& a,
const Vectorized<float>& min,
const Vectorized<float>& max) {
inline Vectorized<float> clamp(const Vectorized<float>& a, const Vectorized<float>& min, const Vectorized<float>& max) {
return svmin_f32_x(ptrue, max, svmax_f32_x(ptrue, min, a));
}
template <>
Vectorized<float> inline clamp_max(
const Vectorized<float>& a,
const Vectorized<float>& max) {
inline Vectorized<float> clamp_max(const Vectorized<float>& a, const Vectorized<float>& max) {
return svmin_f32_x(ptrue, max, a);
}
template <>
Vectorized<float> inline clamp_min(
const Vectorized<float>& a,
const Vectorized<float>& min) {
inline Vectorized<float> clamp_min(const Vectorized<float>& a, const Vectorized<float>& min) {
return svmax_f32_x(ptrue, min, a);
}
template <>
Vectorized<float> inline operator&(
const Vectorized<float>& a,
const Vectorized<float>& b) {
return svreinterpret_f32_s32(
svand_s32_x(ptrue, svreinterpret_s32_f32(a), svreinterpret_s32_f32(b)));
inline Vectorized<float> operator&(const Vectorized<float>& a, const Vectorized<float>& b) {
return svreinterpret_f32_s32(svand_s32_x(ptrue, svreinterpret_s32_f32(a), svreinterpret_s32_f32(b)));
}
template <>
Vectorized<float> inline operator|(
const Vectorized<float>& a,
const Vectorized<float>& b) {
return svreinterpret_f32_s32(
svorr_s32_x(ptrue, svreinterpret_s32_f32(a), svreinterpret_s32_f32(b)));
inline Vectorized<float> operator|(const Vectorized<float>& a, const Vectorized<float>& b) {
return svreinterpret_f32_s32(svorr_s32_x(ptrue, svreinterpret_s32_f32(a), svreinterpret_s32_f32(b)));
}
template <>
Vectorized<float> inline operator^(
const Vectorized<float>& a,
const Vectorized<float>& b) {
return svreinterpret_f32_s32(
sveor_s32_x(ptrue, svreinterpret_s32_f32(a), svreinterpret_s32_f32(b)));
inline Vectorized<float> operator^(const Vectorized<float>& a, const Vectorized<float>& b) {
return svreinterpret_f32_s32(sveor_s32_x(ptrue, svreinterpret_s32_f32(a), svreinterpret_s32_f32(b)));
}
Vectorized<float> inline Vectorized<float>::eq(
const Vectorized<float>& other) const {
inline Vectorized<float> Vectorized<float>::eq(const Vectorized<float>& other) const {
return (*this == other) & Vectorized<float>(1.0f);
}
Vectorized<float> inline Vectorized<float>::ne(
const Vectorized<float>& other) const {
inline Vectorized<float> Vectorized<float>::ne(const Vectorized<float>& other) const {
return (*this != other) & Vectorized<float>(1.0f);
}
Vectorized<float> inline Vectorized<float>::gt(
const Vectorized<float>& other) const {
inline Vectorized<float> Vectorized<float>::gt(const Vectorized<float>& other) const {
return (*this > other) & Vectorized<float>(1.0f);
}
Vectorized<float> inline Vectorized<float>::ge(
const Vectorized<float>& other) const {
inline Vectorized<float> Vectorized<float>::ge(const Vectorized<float>& other) const {
return (*this >= other) & Vectorized<float>(1.0f);
}
Vectorized<float> inline Vectorized<float>::lt(
const Vectorized<float>& other) const {
inline Vectorized<float> Vectorized<float>::lt(const Vectorized<float>& other) const {
return (*this < other) & Vectorized<float>(1.0f);
}
Vectorized<float> inline Vectorized<float>::le(
const Vectorized<float>& other) const {
inline Vectorized<float> Vectorized<float>::le(const Vectorized<float>& other) const {
return (*this <= other) & Vectorized<float>(1.0f);
}
template <>
inline void convert(const float* src, float* dst, int64_t n) {
const int64_t fraction = n % Vectorized<float>::size();
const int64_t fraction = n % svcntw();
#pragma unroll
for (int64_t i = 0; i < n - fraction; i += Vectorized<float>::size()) {
for (int64_t i = 0; i < n - fraction; i += svcntw()) {
svst1_f32(ptrue, dst + i, svldnt1_f32(ptrue, src + i));
}
#pragma unroll
for (int64_t i = n - fraction; i < n; i += Vectorized<float>::size()) {
for (int64_t i = n - fraction; i < n; i += svcntw()) {
svbool_t pg = svwhilelt_b32(i, n);
svst1_f32(pg, dst + i, svldnt1_f32(pg, src + i));
}
}
template <>
inline void convert(const float* src, at::Half* dst, int64_t n) {
const int64_t fraction = n % Vectorized<float>::size();
svbool_t pg_16 = svwhilelt_b16(0ull, Vectorized<float>::size());
svbool_t pg_32 = svwhilelt_b32(0ull, Vectorized<float>::size());
inline void convert(const float *src, at::Half *dst, int64_t n) {
const int64_t fraction = n % svcntw();
svbool_t pg_16 = svwhilelt_b16(0ull, svcntw());
svbool_t pg_32 = svwhilelt_b32(0ull, svcntw());
#pragma unroll
for (int64_t i = 0; i < n - fraction; i += Vectorized<float>::size()) {
svfloat16_t src_vec = svuzp1_f16(
svcvt_f16_f32_x(ptrue, svldnt1_f32(pg_32, src + i)), ZERO_F16);
for (int64_t i = 0; i < n - fraction; i += svcntw()) {
svfloat16_t src_vec = svuzp1_f16(svcvt_f16_f32_x(ptrue, svldnt1_f32(pg_32, src + i)),
ZERO_F16);
svst1_f16(pg_16, reinterpret_cast<float16_t*>(dst) + i, src_vec);
}
#pragma unroll
for (int64_t i = n - fraction; i < n; i += Vectorized<float>::size()) {
for (int64_t i = n - fraction; i < n; i += svcntw()) {
pg_16 = svwhilelt_b16(i, n);
pg_32 = svwhilelt_b32(i, n);
svfloat16_t src_vec = svuzp1_f16(
@ -703,19 +637,18 @@ inline void convert(const float* src, at::Half* dst, int64_t n) {
}
template <>
inline void convert(const at::Half* src, float* dst, int64_t n) {
const int64_t fraction = n % Vectorized<float>::size();
svbool_t pg_16 = svwhilelt_b16(0ull, Vectorized<float>::size());
svbool_t pg_32 = svwhilelt_b32(0ull, Vectorized<float>::size());
inline void convert(const at::Half *src, float *dst, int64_t n) {
const int64_t fraction = n % svcntw();
svbool_t pg_16 = svwhilelt_b16(0ull, svcntw());
svbool_t pg_32 = svwhilelt_b32(0ull, svcntw());
#pragma unroll
for (int64_t i = 0; i < n - fraction; i += Vectorized<float>::size()) {
svfloat16_t src_vec = svzip1_f16(
svldnt1_f16(pg_16, reinterpret_cast<const float16_t*>(src) + i),
ZERO_F16);
for (int64_t i = 0; i < n - fraction; i += svcntw()) {
svfloat16_t src_vec = svzip1_f16(svldnt1_f16(pg_16, reinterpret_cast<const float16_t*>(src) + i),
ZERO_F16);
svst1_f32(pg_32, dst + i, svcvt_f32_f16_x(ptrue, src_vec));
}
#pragma unroll
for (int64_t i = n - fraction; i < n; i += Vectorized<float>::size()) {
for (int64_t i = n - fraction; i < n; i += svcntw()) {
pg_16 = svwhilelt_b16(i, n);
pg_32 = svwhilelt_b32(i, n);
svfloat16_t src_vec = svzip1_f16(
@ -726,20 +659,19 @@ inline void convert(const at::Half* src, float* dst, int64_t n) {
}
template <>
inline void convert(const bool* src, float* dst, int64_t n) {
const int64_t fraction = n % Vectorized<float>::size();
svbool_t pg_8 = svwhilelt_b8(0ull, Vectorized<float>::size());
svbool_t pg_32 = svwhilelt_b32(0ull, Vectorized<float>::size());
inline void convert(const bool *src, float *dst, int64_t n) {
const int64_t fraction = n % svcntw();
svbool_t pg_8 = svwhilelt_b8(0ull, svcntw());
svbool_t pg_32 = svwhilelt_b32(0ull, svcntw());
#pragma unroll
for (int64_t i = 0; i < n - fraction; i += Vectorized<float>::size()) {
svuint8_t src_vec_u8 =
svldnt1_u8(pg_8, reinterpret_cast<const uint8_t*>(src) + i);
for (int64_t i = 0; i < n - fraction; i += svcntw()) {
svuint8_t src_vec_u8 = svldnt1_u8(pg_8, reinterpret_cast<const uint8_t*>(src) + i);
svuint32_t src_vec_u32 = svunpklo_u32(svunpklo_u16(src_vec_u8));
svbool_t mask = svcmpne_u32(pg_32, src_vec_u32, ZERO_U32);
svst1_f32(pg_32, dst + i, svsel_f32(mask, ONE_F32, ZERO_F32));
}
#pragma unroll
for (int64_t i = n - fraction; i < n; i += Vectorized<float>::size()) {
for (int64_t i = n - fraction; i < n; i += svcntw()) {
pg_8 = svwhilelt_b8(i, n);
pg_32 = svwhilelt_b32(i, n);
svuint8_t src_vec_u8 =
@ -751,10 +683,7 @@ inline void convert(const bool* src, float* dst, int64_t n) {
}
template <>
Vectorized<float> inline fmadd(
const Vectorized<float>& a,
const Vectorized<float>& b,
const Vectorized<float>& c) {
inline Vectorized<float> fmadd(const Vectorized<float>& a, const Vectorized<float>& b, const Vectorized<float>& c) {
return svmad_f32_x(ptrue, a, b, c);
}
@ -785,4 +714,4 @@ Vectorized<float> inline fnmsub(
#endif // defined(CPU_CAPABILITY_SVE)
} // namespace CPU_CAPABILITY
} // namespace at::vec
} // namespace at::vec

View File

@ -15,7 +15,7 @@ namespace at::vec {
// accessed as `at::vec`.
inline namespace CPU_CAPABILITY {
#if defined(CPU_CAPABILITY_SVE)
#if defined(CPU_CAPABILITY_SVE256) || defined(CPU_CAPABILITY_SVE)
#define VEC_INT_SVE_TEMPLATE(vl, bit) \
template <> \
@ -49,10 +49,11 @@ inline namespace CPU_CAPABILITY {
operator svint##bit##_t() const { \
return values; \
} \
template <uint64_t mask> \
static Vectorized<int##bit##_t> blend( \
const Vectorized<int##bit##_t>& a, \
const Vectorized<int##bit##_t>& b) { \
const Vectorized<int##bit##_t>& b, \
uint64_t mask \
) { \
__at_align__ int##bit##_t flag_arr[size()]; \
for (int i = 0; i < size(); ++i) { \
flag_arr[i] = (i < 64 && (mask & (1ULL << i))) ? 1 : 0; \
@ -493,7 +494,7 @@ Vectorized<int8_t> inline operator>>(
return svasr_s8_x(ptrue, a, svreinterpret_u8_s8(b));
}
#endif // defined(CPU_CAPABILITY_SVE)
#endif // defined(CPU_CAPABILITY_SVE256)
} // namespace CPU_CAPABILITY
} // namespace at::vec

View File

@ -46,7 +46,7 @@ namespace at::vec {
// accessed as `at::vec`.
inline namespace CPU_CAPABILITY {
#if defined(CPU_CAPABILITY_SVE)
#if defined(CPU_CAPABILITY_SVE256) || defined(CPU_CAPABILITY_SVE)
// NOTE: These are low-performance implementations that we fall back on
// if we are not building with SVE. This may not be an issue, because
@ -100,12 +100,12 @@ struct VectorizedQuantizedConverter {
Vectorized<float> zero_point,
Vectorized<float> scale_zp_premul) const {
float_vec_return_type rv;
float tmp_scale[Vectorized<float>::size()];
float tmp_zero_point[Vectorized<float>::size()];
float * tmp_scale = new float[Vectorized<float>::size()];
float * tmp_zero_point = new float[Vectorized<float>::size()];
scale.store(tmp_scale);
zero_point.store(tmp_zero_point);
for (int i = 0; i < float_num_vecs(); ++i) {
float tmp_vals[Vectorized<float>::size()];
float * tmp_vals = new float[Vectorized<float>::size()];
for (int j = 0; j < Vectorized<float>::size(); ++j) {
tmp_vals[j] = at::native::dequantize_val<T>(
tmp_scale[j],
@ -113,6 +113,10 @@ struct VectorizedQuantizedConverter {
T(vals[Vectorized<float>::size() * i + j]));
}
rv[i] = Vectorized<float>::loadu(tmp_vals);
delete[] tmp_scale;
delete[] tmp_zero_point;
delete[] tmp_vals;
}
return rv;
}
@ -121,12 +125,12 @@ struct VectorizedQuantizedConverter {
Vectorized<float> scale,
Vectorized<float> zero_point) const {
float_vec_return_type rv;
float tmp_scale[Vectorized<float>::size()];
float tmp_zero_point[Vectorized<float>::size()];
float * tmp_scale = new float[Vectorized<float>::size()];
float * tmp_zero_point = new float[Vectorized<float>::size()];
scale.store(tmp_scale);
zero_point.store(tmp_zero_point);
for (int i = 0; i < float_num_vecs(); ++i) {
float tmp_vals[Vectorized<float>::size()];
float * tmp_vals = new float[Vectorized<float>::size()];
for (int j = 0; j < Vectorized<float>::size(); ++j) {
tmp_vals[j] = at::native::dequantize_val<T>(
tmp_scale[j],
@ -134,6 +138,9 @@ struct VectorizedQuantizedConverter {
T(vals[Vectorized<float>::size() * i + j]));
}
rv[i] = Vectorized<float>::loadu(tmp_vals);
delete[] tmp_scale;
delete[] tmp_zero_point;
delete[] tmp_vals;
}
return rv;
}
@ -205,7 +212,7 @@ struct Vectorized<c10::qint32> : public VectorizedQuantizedConverter<
int32_t zero_point,
float inverse_scale) {
std::array<value_type, size()> qvals;
std::array<float, float_num_vecs() * Vectorized<float>::size()> float_vals;
float * float_vals = new float[float_num_vecs() * Vectorized<float>::size()];
for (int i = 0; i < float_num_vecs(); ++i) {
rhs[i].store(
@ -216,10 +223,11 @@ struct Vectorized<c10::qint32> : public VectorizedQuantizedConverter<
at::native::quantize_vec<c10::qint32, /*precision=*/32>(
scale,
zero_point,
float_vals.data(),
float_vals,
(c10::qint32*)qvals.data(),
Vectorized<float>::size() * float_num_vecs());
delete[] float_vals;
return Vectorized<c10::qint32>::loadu(qvals.data());
}
@ -359,7 +367,7 @@ struct Vectorized<c10::qint8> : public VectorizedQuantizedConverter<
int32_t zero_point,
float inverse_scale) {
std::array<value_type, size()> qvals;
std::array<float, float_num_vecs() * Vectorized<float>::size()> float_vals;
float * float_vals = new float[float_num_vecs() * Vectorized<float>::size()];
for (int i = 0; i < float_num_vecs(); ++i) {
rhs[i].store(
@ -370,10 +378,11 @@ struct Vectorized<c10::qint8> : public VectorizedQuantizedConverter<
at::native::quantize_vec<c10::qint8>(
scale,
zero_point,
float_vals.data(),
float_vals,
(c10::qint8*)qvals.data(),
Vectorized<float>::size() * float_num_vecs());
delete[] float_vals;
return Vectorized<c10::qint8>::loadu(qvals.data());
}
@ -511,7 +520,7 @@ struct Vectorized<c10::quint8> : public VectorizedQuantizedConverter<
int32_t zero_point,
float inverse_scale) {
std::array<value_type, size()> qvals;
std::array<float, float_num_vecs() * Vectorized<float>::size()> float_vals;
float * float_vals = new float[float_num_vecs() * Vectorized<float>::size()];
for (int i = 0; i < float_num_vecs(); ++i) {
rhs[i].store(
@ -522,10 +531,11 @@ struct Vectorized<c10::quint8> : public VectorizedQuantizedConverter<
at::native::quantize_vec<c10::quint8>(
scale,
zero_point,
float_vals.data(),
float_vals,
(c10::quint8*)qvals.data(),
Vectorized<float>::size() * float_num_vecs());
delete[] float_vals;
return Vectorized<c10::quint8>::loadu(qvals.data());
}
@ -600,7 +610,7 @@ Vectorized<c10::quint8> inline maximum(
return a.maximum(b);
}
#endif // defined(CPU_CAPABILITY_SVE)
#endif // defined(CPU_CAPABILITY_SVE256)
} // namespace CPU_CAPABILITY
} // namespace at::vec

View File

@ -4,7 +4,9 @@
#include <ATen/cpu/vec/intrinsics.h>
#ifdef __aarch64__
#if !defined(CPU_CAPABILITY_SVE)
#if defined(CPU_CAPABILITY_SVE) || defined(CPU_CAPABILITY_SVE256)
#include <ATen/cpu/vec/sve/vec_common_sve.h>
#else
#include <ATen/cpu/vec/vec128/vec128_bfloat16_neon.h>
#include <ATen/cpu/vec/vec128/vec128_float_neon.h>
#include <ATen/cpu/vec/vec128/vec128_half_neon.h>

View File

@ -241,7 +241,7 @@ class Vectorized<c10::BFloat16> : public Vectorized16<
Vectorized() = default;
Vectorized(c10::BFloat16 val)
: Vectorized16(at_vdupq_n_bf16(c10::bit_cast<at_bfloat16_t>(val.x))) {}
: Vectorized16(at_vdupq_n_bf16(val.x)) {}
Vectorized(float val) : Vectorized(c10::BFloat16(val)) {}
Vectorized(
value_type val0,
@ -253,14 +253,14 @@ class Vectorized<c10::BFloat16> : public Vectorized16<
value_type val6,
value_type val7)
: Vectorized16(at_bfloat16x8_t{
c10::bit_cast<at_bfloat16_t>(val0.x),
c10::bit_cast<at_bfloat16_t>(val1.x),
c10::bit_cast<at_bfloat16_t>(val2.x),
c10::bit_cast<at_bfloat16_t>(val3.x),
c10::bit_cast<at_bfloat16_t>(val4.x),
c10::bit_cast<at_bfloat16_t>(val5.x),
c10::bit_cast<at_bfloat16_t>(val6.x),
c10::bit_cast<at_bfloat16_t>(val7.x)}) {}
val0.x,
val1.x,
val2.x,
val3.x,
val4.x,
val5.x,
val6.x,
val7.x}) {}
static Vectorized<c10::BFloat16> blendv(
const Vectorized<c10::BFloat16>& a,

View File

@ -4,7 +4,7 @@
namespace at::vec {
inline namespace CPU_CAPABILITY {
#if (defined(__aarch64__) && !defined(CPU_CAPABILITY_SVE256))
#if (defined(__aarch64__) && !defined(CPU_CAPABILITY_SVE256) && !defined(CPU_CAPABILITY_SVE))
template <typename src_t>
struct VecConvert<
float,

View File

@ -41,32 +41,16 @@ inline namespace CPU_CAPABILITY {
#define USE_SLEEF(sleef_code, non_sleef_code) non_sleef_code
#endif
template <int index, bool mask_val>
template <int index>
struct BlendRegs {
static float32x4_t impl(
const float32x4_t& a,
const float32x4_t& b,
float32x4_t& res);
};
template <int index>
struct BlendRegs<index, true> {
static float32x4_t impl(
const float32x4_t& a,
const float32x4_t& b,
float32x4_t& res) {
return vsetq_lane_f32(vgetq_lane_f32(b, index), res, index);
}
};
template <int index>
struct BlendRegs<index, false> {
static float32x4_t impl(
const float32x4_t& a,
const float32x4_t& b,
float32x4_t& res) {
return vsetq_lane_f32(vgetq_lane_f32(a, index), res, index);
}
float32x4_t& res,
bool mask_val
) {
return vsetq_lane_f32(vgetq_lane_f32(mask_val ? b : a, index), res, index);
}
};
template <>
@ -94,19 +78,15 @@ class Vectorized<float> {
operator float32x4_t() const {
return values;
}
template <int64_t mask>
static Vectorized<float> blend(
const Vectorized<float>& a,
const Vectorized<float>& b) {
const Vectorized<float>& b,
int64_t mask) {
Vectorized<float> vec;
vec.values = BlendRegs < 0,
(mask & 0x01) != 0 > ::impl(a.values, b.values, vec.values);
vec.values = BlendRegs < 1,
(mask & 0x02) != 0 > ::impl(a.values, b.values, vec.values);
vec.values = BlendRegs < 2,
(mask & 0x04) != 0 > ::impl(a.values, b.values, vec.values);
vec.values = BlendRegs < 3,
(mask & 0x08) != 0 > ::impl(a.values, b.values, vec.values);
vec.values = BlendRegs <0>::impl(a.values, b.values, vec.values, (mask & 0x01) != 0);
vec.values = BlendRegs <1> ::impl(a.values, b.values, vec.values, (mask & 0x02) != 0);
vec.values = BlendRegs <2> ::impl(a.values, b.values, vec.values, (mask & 0x04) != 0);
vec.values = BlendRegs <3> ::impl(a.values, b.values, vec.values, (mask & 0x08) != 0);
return vec;
}
static Vectorized<float> blendv(
@ -307,11 +287,50 @@ class Vectorized<float> {
DEFINE_SLEEF_COMPATIBLE_UNARY_ELEMENTWISE_FUNC(exp)
DEFINE_SLEEF_COMPATIBLE_UNARY_ELEMENTWISE_FUNC(exp2)
DEFINE_SLEEF_COMPATIBLE_UNARY_ELEMENTWISE_FUNC(expm1)
// Implementation copied from Arm Optimized Routine https://github.com/ARM-software/optimized-routines/blob/master/math/aarch64/advsimd/expf.c
Vectorized<float> exp_u20() const {
return exp();
// bail out to sleef if it's a special case:
// i.e. there's an input s.t. |input| > 87.3....
const float32x4_t special_bound = vdupq_n_f32(0x1.5d5e2ap+6f);
uint32x4_t cmp = vcagtq_f32 (values, special_bound);
if (vpaddd_u64 (vreinterpretq_u64_u32 (cmp)) != 0) {
return exp();
}
const float32x4_t inv_ln2 = vdupq_n_f32(0x1.715476p+0f);
const float ln2_hi = 0x1.62e4p-1f;
const float ln2_lo = 0x1.7f7d1cp-20f;
const float c0 = 0x1.0e4020p-7f;
const float c2 = 0x1.555e66p-3f;
const float32x4_t ln2_c02 = {ln2_hi, ln2_lo, c0, c2};
const uint32x4_t exponent_bias = vdupq_n_u32(0x3f800000);
const float32x4_t c1 = vdupq_n_f32(0x1.573e2ep-5f);
const float32x4_t c3 = vdupq_n_f32(0x1.fffdb6p-2f);
const float32x4_t c4 = vdupq_n_f32(0x1.ffffecp-1f);
/* exp(x) = 2^n (1 + poly(r)), with 1 + poly(r) in [1/sqrt(2),sqrt(2)]
x = ln2*n + r, with r in [-ln2/2, ln2/2]. */
float32x4_t n = vrndaq_f32 (vmulq_f32 (values, inv_ln2));
float32x4_t r = vfmsq_laneq_f32 (values, n, ln2_c02, 0);
r = vfmsq_laneq_f32 (r, n, ln2_c02, 1);
uint32x4_t e = vshlq_n_u32 (vreinterpretq_u32_s32 (vcvtq_s32_f32 (n)), 23);
float32x4_t scale = vreinterpretq_f32_u32 (vaddq_u32 (e, exponent_bias));
float32x4_t r2 = vmulq_f32 (r, r);
float32x4_t p = vfmaq_laneq_f32 (c1, r, ln2_c02, 2);
float32x4_t q = vfmaq_laneq_f32 (c3, r, ln2_c02, 3);
q = vfmaq_f32 (q, p, r2);
p = vmulq_f32 (c4, r);
float32x4_t poly = vfmaq_f32 (p, q, r2);
return vfmaq_f32 (scale, poly, scale);
}
Vectorized<float> fexp_u20() const {
return exp();
return exp_u20();
}
DEFINE_SLEEF_COMPATIBLE_BINARY_ELEMENTWISE_FUNC_WITH_SLEEF_NAME(
fmod,
@ -645,4 +664,4 @@ inline Vectorized<float> Vectorized<float>::erf() const {
#endif /* defined(aarch64) */
} // namespace CPU_CAPABILITY
} // namespace at::vec
} // namespace at::vec

View File

@ -813,11 +813,12 @@ static inline Vectorized<T> binary_op_as_fp32(
#define LOAD_FP32_NON_VECTORIZED_INIT(type, name) \
inline void load_fp32_from_##name( \
const type* data, Vectorized<float>& out) { \
__at_align__ float values[Vectorized<float>::size()]; \
__at_align__ float * values = new float[Vectorized<float>::size()]; \
for (const auto k : c10::irange(Vectorized<float>::size())) { \
values[k] = data[k]; \
} \
out = Vectorized<float>::loadu(values); \
delete[] values; \
} \
\
inline void load_fp32_from_##name( \

View File

@ -269,12 +269,13 @@ LOAD_FP32_VECTORIZED_INIT(BFloat16, bf16)
#else // defined(CPU_CAPABILITY_AVX2)
#if !( \
defined(__aarch64__) && !defined(C10_MOBILE) && !defined(__CUDACC__) && \
!defined(CPU_CAPABILITY_SVE256))
defined(__aarch64__) && !defined(C10_MOBILE) && !defined(__CUDACC__))
CONVERT_NON_VECTORIZED_INIT(BFloat16, bfloat16)
#endif
#if !defined(CPU_CAPABILITY_SVE256) && !defined(CPU_CAPABILITY_SVE)
LOAD_FP32_NON_VECTORIZED_INIT(BFloat16, bf16)
#endif
#endif // defined(CPU_CAPABILITY_AVX2)
} // namespace CPU_CAPABILITY
} // namespace at::vec

View File

@ -294,7 +294,7 @@ struct VecConvert<
};
#endif
#if defined(CPU_CAPABILITY_SVE256) && defined(__ARM_FEATURE_BF16)
#if (defined(CPU_CAPABILITY_SVE256) || defined(CPU_CAPABILITY_SVE)) && defined(__ARM_FEATURE_BF16)
template <>
struct VecConvert<float, 1, BFloat16, 1> {

View File

@ -270,7 +270,7 @@ LOAD_FP32_VECTORIZED_INIT(Half, fp16)
#if !( \
defined(__aarch64__) && !defined(C10_MOBILE) && !defined(__CUDACC__) && \
!defined(CPU_CAPABILITY_SVE256))
!defined(CPU_CAPABILITY_SVE256) && !defined(CPU_CAPABILITY_SVE))
CONVERT_NON_VECTORIZED_INIT(Half, half)
#endif

View File

@ -915,7 +915,7 @@ Vectorized<c10::quint8> inline maximum(
return a.maximum(b);
}
#elif !defined(CPU_CAPABILITY_SVE256)
#elif !defined(CPU_CAPABILITY_SVE256) && !defined(CPU_CAPABILITY_SVE)
// NOTE: These are low-performance implementations that we fall back on
// if we are not building with AVX2. This may not be an issue, because
@ -1374,11 +1374,11 @@ Vectorized<c10::quint8> inline maximum(
#endif // if defined(CPU_CAPABILITY_AVX2)
#if (defined(__aarch64__) && !defined(CPU_CAPABILITY_SVE256))
std::pair<Vectorized<float>, Vectorized<float>> inline convert_int8_to_float(
at::vec::Vectorized<int8_t> src) {
auto s8x8 = vld1_s8(src.operator const int8_t*());
auto s16x8 = vmovl_s8(s8x8);
#if defined(__aarch64__) && !defined(CPU_CAPABILITY_SVE256) && !defined(CPU_CAPABILITY_SVE)
std::pair<Vectorized<float>, Vectorized<float>>
inline convert_int8_to_float(at::vec::Vectorized<int8_t> src) {
auto s8x8 = vld1_s8(src.operator const int8_t*());
auto s16x8 = vmovl_s8(s8x8);
auto s32x4_hi = vmovl_s16(vget_high_s16(s16x8));
auto s32x4_lo = vmovl_s16(vget_low_s16(s16x8));

View File

@ -68,7 +68,7 @@ Windows llvm will not have this definition.
#define VECTOR_WIDTH 64
#define int_vector __m512i
#elif defined(__aarch64__) && \
!defined(CPU_CAPABILITY_SVE) // CPU_CAPABILITY_AVX512
!defined(CPU_CAPABILITY_SVE) && !defined(CPU_CAPABILITY_SVE256) // CPU_CAPABILITY_AVX512
// SVE code expects 256-vectors; leave that set for SVE?
#if defined(__GNUC__)
#define __at_align__ __attribute__((aligned(16)))
@ -79,6 +79,18 @@ Windows llvm will not have this definition.
#endif
#define VECTOR_WIDTH 16
#else // CPU_CAPABILITY_AVX512
#if defined(CPU_CAPABILITY_SVE)
#if defined(__GNUC__)
#define __at_align__ __attribute__((aligned(16)))
#elif defined(_WIN32)
#define __at_align__ __declspec(align(16))
#else
#define __at_align__
#endif
#define VECTOR_WIDTH 16
#define int_vector __m256i
#else // CPU_CAPABILITY_SVE256 || CPU_CAPABILITY_SVE
#if defined(CPU_CAPABILITY_SVE256)
#if defined(__GNUC__)
#define __at_align__ __attribute__((aligned(32)))
#elif defined(_WIN32)
@ -88,6 +100,18 @@ Windows llvm will not have this definition.
#endif
#define VECTOR_WIDTH 32
#define int_vector __m256i
#else // CPU_CAPABILITY_SVE
#if defined(__GNUC__)
#define __at_align__ __attribute__((aligned(16)))
#elif defined(_WIN32)
#define __at_align__ __declspec(align(16))
#else
#define __at_align__
#endif
#define VECTOR_WIDTH 16
#define int_vector __m256i
#endif // CPU_CAPABILITY_SVE256
#endif // CPU_CAPABILITY_SVE256 || CPU_CAPABILITY_SVE
#endif // CPU_CAPABILITY_AVX512
namespace at::vec {
@ -210,8 +234,7 @@ struct Vectorized {
auto as_bytes() const -> const char* {
return reinterpret_cast<const char*>(values);
}
template <int64_t mask_>
static Vectorized<T> blend(const Vectorized<T>& a, const Vectorized<T>& b) {
static Vectorized<T> blend(const Vectorized<T>& a, const Vectorized<T>& b, const int64_t mask_) {
int64_t mask = mask_;
Vectorized vector;
for (const auto i : c10::irange(size())) {
@ -1312,7 +1335,7 @@ std::
T const* base_addr,
const Vectorized<int_same_size_t<T>>& vindex,
Vectorized<T>& mask) {
static constexpr int size = Vectorized<T>::size();
static const int size = Vectorized<T>::size();
T src_arr[size];
int_same_size_t<T> mask_arr[size]; // use int type so we can logical and
int_same_size_t<T> index_arr[size];
@ -1405,7 +1428,7 @@ inline Vectorized<T> convert_to_fp_of_same_size(
// clang-format on
template <typename T>
inline std::enable_if_t<
Vectorized<T>::size() % 2 == 0,
true,
std::pair<Vectorized<T>, Vectorized<T>>>
deinterleave2(const Vectorized<T>& a, const Vectorized<T>& b) {
static constexpr int size = Vectorized<T>::size();
@ -1444,7 +1467,7 @@ VECTORIZED_SUPPORT_SCALARS_FOR_BINARY_FUNC(deinterleave2)
// clang-format on
template <typename T>
inline std::enable_if_t<
Vectorized<T>::size() % 2 == 0,
true,
std::pair<Vectorized<T>, Vectorized<T>>>
interleave2(const Vectorized<T>& a, const Vectorized<T>& b) {
static constexpr int size = Vectorized<T>::size();
@ -1486,7 +1509,7 @@ inline void convert(const src_T* src, dst_T* dst, int64_t n) {
template <typename T>
inline Vectorized<T> flip(const Vectorized<T>& data) {
static constexpr int size = Vectorized<T>::size();
static const int size = Vectorized<T>::size();
T output[size];
T buffer[size];
data.store(static_cast<void*>(buffer));

View File

@ -15,7 +15,7 @@ template <
struct VecConvert {
static inline VectorizedN<dst_t, dst_n> apply(
const VectorizedN<src_t, src_n>& src) {
constexpr int count = std::min(
const int count = std::min(
VectorizedN<src_t, src_n>::size(), VectorizedN<dst_t, dst_n>::size());
__at_align__ src_t src_buf[VectorizedN<src_t, src_n>::size()];
src.store(src_buf);

View File

@ -2,6 +2,8 @@
#include <ATen/cpu/vec/vec_base.h>
#include <ATen/cpu/vec/vec_n.h>
#include <cassert>
namespace at::vec {
inline namespace CPU_CAPABILITY {
@ -38,9 +40,9 @@ struct VecMaskLoad {
static inline VectorizedN<data_t, data_n> apply(
const data_t* ptr,
const VecMask<mask_t, mask_n>& vec_mask) {
constexpr typename VecMask<mask_t, mask_n>::size_type size =
const typename VecMask<mask_t, mask_n>::size_type size =
VecMask<mask_t, mask_n>::size();
static_assert(VectorizedN<data_t, data_n>::size() >= size);
assert((VectorizedN<data_t, data_n>::size() >= size));
__at_align__ data_t data[size];
__at_align__ mask_t mask[size];
auto mask_ = VectorizedN<mask_t, mask_n>(vec_mask);
@ -134,7 +136,7 @@ class VecMask {
template <typename U, int L>
static VecMask<T, N> from(const VectorizedN<U, L>& b_vec) {
__at_align__ U b_buf[size()];
if constexpr (size() >= VectorizedN<U, L>::size()) {
if (size() >= VectorizedN<U, L>::size()) {
b_vec.store(b_buf);
for (int i = VectorizedN<U, L>::size(); i < size(); i++) {
b_buf[i] = static_cast<U>(0);
@ -235,16 +237,18 @@ class VecMask {
template <
typename U,
int L,
std::enable_if_t<L >= 2 && VectorizedN<U, L>::size() >= size(), int> = 0>
std::enable_if_t<L >= 2, int> = 0>
VectorizedN<U, L> loadu(const U* ptr) const {
assert((VectorizedN<U, L>::size() >= size()));
return VecMaskLoad<U, L, T, N>::apply(ptr, *this);
}
template <
typename U,
int L,
std::enable_if_t<L == 1 && Vectorized<U>::size() >= size(), int> = 0>
std::enable_if_t<L == 1, int> = 0>
Vectorized<U> loadu(const U* ptr) const {
assert((Vectorized<U>::size() >= size()));
return VecMaskLoad<U, L, T, N>::apply(ptr, *this);
}
};

View File

@ -28,7 +28,7 @@ class VectorizedN {
using size_type = int;
static constexpr size_type size_T = sizeof(T);
static constexpr size_type size() {
static size_type size() {
return Vectorized<T>::size() * N;
}

View File

@ -162,7 +162,7 @@ struct CUDACachingHostAllocatorImpl
}
bool pinned_use_background_threads() override {
return c10::cuda::CUDACachingAllocator::CUDAAllocatorConfig::
return c10::CachingAllocator::AcceleratorAllocatorConfig::
pinned_use_background_threads();
}

View File

@ -1157,6 +1157,7 @@ REGISTER_AVX512_DISPATCH(cholesky_stub, &cholesky_kernel)
REGISTER_AVX2_DISPATCH(cholesky_stub, &cholesky_kernel)
REGISTER_VSX_DISPATCH(cholesky_stub, &cholesky_kernel)
REGISTER_ZVECTOR_DISPATCH(cholesky_stub, &cholesky_kernel)
REGISTER_SVE_DISPATCH(cholesky_stub, &cholesky_kernel)
REGISTER_SVE256_DISPATCH(cholesky_stub, &cholesky_kernel)
REGISTER_ARCH_DISPATCH(cholesky_inverse_stub, DEFAULT, &cholesky_inverse_kernel_impl)
@ -1164,6 +1165,7 @@ REGISTER_AVX512_DISPATCH(cholesky_inverse_stub, &cholesky_inverse_kernel_impl)
REGISTER_AVX2_DISPATCH(cholesky_inverse_stub, &cholesky_inverse_kernel_impl)
REGISTER_VSX_DISPATCH(cholesky_inverse_stub, &cholesky_inverse_kernel_impl)
REGISTER_ZVECTOR_DISPATCH(cholesky_inverse_stub, &cholesky_inverse_kernel_impl)
REGISTER_SVE_DISPATCH(cholesky_inverse_stub, &cholesky_inverse_kernel_impl)
REGISTER_SVE256_DISPATCH(cholesky_inverse_stub, &cholesky_inverse_kernel_impl)
REGISTER_ARCH_DISPATCH(linalg_eig_stub, DEFAULT, &linalg_eig_kernel)
@ -1171,6 +1173,7 @@ REGISTER_AVX512_DISPATCH(linalg_eig_stub, &linalg_eig_kernel)
REGISTER_AVX2_DISPATCH(linalg_eig_stub, &linalg_eig_kernel)
REGISTER_VSX_DISPATCH(linalg_eig_stub, &linalg_eig_kernel)
REGISTER_ZVECTOR_DISPATCH(linalg_eig_stub, &linalg_eig_kernel)
REGISTER_SVE_DISPATCH(linalg_eig_stub, &linalg_eig_kernel)
REGISTER_SVE256_DISPATCH(linalg_eig_stub, &linalg_eig_kernel)
REGISTER_ARCH_DISPATCH(linalg_eigh_stub, DEFAULT, &linalg_eigh_kernel)
@ -1178,6 +1181,7 @@ REGISTER_AVX512_DISPATCH(linalg_eigh_stub, &linalg_eigh_kernel)
REGISTER_AVX2_DISPATCH(linalg_eigh_stub, &linalg_eigh_kernel)
REGISTER_VSX_DISPATCH(linalg_eigh_stub, &linalg_eigh_kernel)
REGISTER_ZVECTOR_DISPATCH(linalg_eigh_stub, &linalg_eigh_kernel)
REGISTER_SVE_DISPATCH(linalg_eigh_stub, &linalg_eigh_kernel)
REGISTER_SVE256_DISPATCH(linalg_eigh_stub, &linalg_eigh_kernel)
REGISTER_ARCH_DISPATCH(geqrf_stub, DEFAULT, &geqrf_kernel)
@ -1185,6 +1189,7 @@ REGISTER_AVX512_DISPATCH(geqrf_stub, &geqrf_kernel)
REGISTER_AVX2_DISPATCH(geqrf_stub, &geqrf_kernel)
REGISTER_VSX_DISPATCH(geqrf_stub, &geqrf_kernel)
REGISTER_ZVECTOR_DISPATCH(geqrf_stub, &geqrf_kernel)
REGISTER_SVE_DISPATCH(geqrf_stub, &geqrf_kernel)
REGISTER_SVE256_DISPATCH(geqrf_stub, &geqrf_kernel)
REGISTER_ARCH_DISPATCH(orgqr_stub, DEFAULT, &orgqr_kernel_impl)
@ -1192,6 +1197,7 @@ REGISTER_AVX512_DISPATCH(orgqr_stub, &orgqr_kernel_impl)
REGISTER_AVX2_DISPATCH(orgqr_stub, &orgqr_kernel_impl)
REGISTER_VSX_DISPATCH(orgqr_stub, &orgqr_kernel_impl)
REGISTER_ZVECTOR_DISPATCH(orgqr_stub, &orgqr_kernel_impl)
REGISTER_SVE_DISPATCH(orgqr_stub, &orgqr_kernel_impl)
REGISTER_SVE256_DISPATCH(orgqr_stub, &orgqr_kernel_impl)
REGISTER_ARCH_DISPATCH(ormqr_stub, DEFAULT, &ormqr_kernel)
@ -1199,6 +1205,7 @@ REGISTER_AVX512_DISPATCH(ormqr_stub, &ormqr_kernel)
REGISTER_AVX2_DISPATCH(ormqr_stub, &ormqr_kernel)
REGISTER_VSX_DISPATCH(ormqr_stub, &ormqr_kernel)
REGISTER_ZVECTOR_DISPATCH(ormqr_stub, &ormqr_kernel)
REGISTER_SVE_DISPATCH(ormqr_stub, &ormqr_kernel)
REGISTER_SVE256_DISPATCH(ormqr_stub, &ormqr_kernel)
REGISTER_ARCH_DISPATCH(lstsq_stub, DEFAULT, &lstsq_kernel)
@ -1206,6 +1213,7 @@ REGISTER_AVX512_DISPATCH(lstsq_stub, &lstsq_kernel)
REGISTER_AVX2_DISPATCH(lstsq_stub, &lstsq_kernel)
REGISTER_VSX_DISPATCH(lstsq_stub, &lstsq_kernel)
REGISTER_ZVECTOR_DISPATCH(lstsq_stub, &lstsq_kernel)
REGISTER_SVE_DISPATCH(lstsq_stub, &lstsq_kernel)
REGISTER_SVE256_DISPATCH(lstsq_stub, &lstsq_kernel)
REGISTER_ARCH_DISPATCH(triangular_solve_stub, DEFAULT, &triangular_solve_kernel)
@ -1213,6 +1221,7 @@ REGISTER_AVX512_DISPATCH(triangular_solve_stub, &triangular_solve_kernel)
REGISTER_AVX2_DISPATCH(triangular_solve_stub, &triangular_solve_kernel)
REGISTER_VSX_DISPATCH(triangular_solve_stub, &triangular_solve_kernel)
REGISTER_ZVECTOR_DISPATCH(triangular_solve_stub, &triangular_solve_kernel)
REGISTER_SVE_DISPATCH(triangular_solve_stub, &triangular_solve_kernel)
REGISTER_SVE256_DISPATCH(triangular_solve_stub, &triangular_solve_kernel)
REGISTER_ARCH_DISPATCH(lu_factor_stub, DEFAULT, &lu_factor_kernel)
@ -1220,6 +1229,7 @@ REGISTER_AVX512_DISPATCH(lu_factor_stub, &lu_factor_kernel)
REGISTER_AVX2_DISPATCH(lu_factor_stub, &lu_factor_kernel)
REGISTER_VSX_DISPATCH(lu_factor_stub, &lu_factor_kernel)
REGISTER_ZVECTOR_DISPATCH(lu_factor_stub, &lu_factor_kernel)
REGISTER_SVE_DISPATCH(lu_factor_stub, &lu_factor_kernel)
REGISTER_SVE256_DISPATCH(lu_factor_stub, &lu_factor_kernel)
REGISTER_ARCH_DISPATCH(ldl_factor_stub, DEFAULT, &ldl_factor_kernel)
@ -1227,6 +1237,7 @@ REGISTER_AVX512_DISPATCH(ldl_factor_stub, &ldl_factor_kernel)
REGISTER_AVX2_DISPATCH(ldl_factor_stub, &ldl_factor_kernel)
REGISTER_VSX_DISPATCH(ldl_factor_stub, &ldl_factor_kernel)
REGISTER_ZVECTOR_DISPATCH(ldl_factor_stub, &ldl_factor_kernel)
REGISTER_SVE_DISPATCH(ldl_factor_stub, &ldl_factor_kernel)
REGISTER_SVE256_DISPATCH(ldl_factor_stub, &ldl_factor_kernel)
REGISTER_ARCH_DISPATCH(ldl_solve_stub, DEFAULT, &ldl_solve_kernel)
@ -1234,6 +1245,7 @@ REGISTER_AVX512_DISPATCH(ldl_solve_stub, &ldl_solve_kernel)
REGISTER_AVX2_DISPATCH(ldl_solve_stub, &ldl_solve_kernel)
REGISTER_VSX_DISPATCH(ldl_solve_stub, &ldl_solve_kernel)
REGISTER_ZVECTOR_DISPATCH(ldl_solve_stub, &ldl_solve_kernel)
REGISTER_SVE_DISPATCH(ldl_solve_stub, &ldl_solve_kernel)
REGISTER_SVE256_DISPATCH(ldl_solve_stub, &ldl_solve_kernel)
REGISTER_ARCH_DISPATCH(lu_solve_stub, DEFAULT, &lu_solve_kernel)
@ -1241,6 +1253,7 @@ REGISTER_AVX512_DISPATCH(lu_solve_stub, &lu_solve_kernel)
REGISTER_AVX2_DISPATCH(lu_solve_stub, &lu_solve_kernel)
REGISTER_VSX_DISPATCH(lu_solve_stub, &lu_solve_kernel)
REGISTER_ZVECTOR_DISPATCH(lu_solve_stub, &lu_solve_kernel)
REGISTER_SVE_DISPATCH(lu_solve_stub, &lu_solve_kernel)
REGISTER_SVE256_DISPATCH(lu_solve_stub, &lu_solve_kernel)
REGISTER_ARCH_DISPATCH(svd_stub, DEFAULT, &svd_kernel)
@ -1248,6 +1261,7 @@ REGISTER_AVX512_DISPATCH(svd_stub, &svd_kernel)
REGISTER_AVX2_DISPATCH(svd_stub, &svd_kernel)
REGISTER_VSX_DISPATCH(svd_stub, &svd_kernel)
REGISTER_ZVECTOR_DISPATCH(svd_stub, &svd_kernel)
REGISTER_SVE_DISPATCH(svd_stub, &svd_kernel)
REGISTER_SVE256_DISPATCH(svd_stub, &svd_kernel)
REGISTER_ARCH_DISPATCH(unpack_pivots_stub, DEFAULT, &unpack_pivots_cpu_kernel)
@ -1255,5 +1269,6 @@ REGISTER_AVX512_DISPATCH(unpack_pivots_stub, &unpack_pivots_cpu_kernel)
REGISTER_AVX2_DISPATCH(unpack_pivots_stub, &unpack_pivots_cpu_kernel)
REGISTER_VSX_DISPATCH(unpack_pivots_stub, &unpack_pivots_cpu_kernel)
REGISTER_ZVECTOR_DISPATCH(unpack_pivots_stub, &unpack_pivots_cpu_kernel)
REGISTER_SVE_DISPATCH(unpack_pivots_stub, &unpack_pivots_cpu_kernel)
REGISTER_SVE256_DISPATCH(unpack_pivots_stub, &unpack_pivots_cpu_kernel)
} // namespace at::native

View File

@ -38,17 +38,27 @@ static CPUCapability compute_cpu_capability() {
return CPUCapability::ZVECTOR;
}
#elif defined(HAVE_SVE_CPU_DEFINITION)
int sve_vl = cpuinfo_get_max_arm_sve_length(); //Returns maximum SVE VL supported by your HW.
#ifdef HAVE_SVE256_CPU_DEFINITION
int sve_vl = cpuinfo_get_max_arm_sve_length(); // Returns maximum SVE VL supported by your HW.
#ifdef HAVE_SVE_CPU_DEFINITION
if (envar == "sve256") {
if (sve_vl == 256) {
#ifdef HAVE_ARM_BF16_CPU_DEFINITION
if (cpuinfo_has_arm_bf16()) {
if (cpuinfo_has_arm_bf16()) {
if (sve_vl == 256) {
return CPUCapability::SVE256;
} else if (sve_vl > 0) {
return CPUCapability::SVE;
}
#endif
}
TORCH_WARN("SVE256 capability not available on hardware. Falling back to DEFAULT");
#endif
TORCH_WARN("SVE capability not available on hardware. Falling back to DEFAULT");
return CPUCapability::DEFAULT;
} else if (envar == "sve") {
#ifdef HAVE_ARM_BF16_CPU_DEFINITION
if (cpuinfo_has_arm_bf16() && sve_vl > 0) {
return CPUCapability::SVE;
}
#endif
TORCH_WARN("SVE capability not available on hardware. Falling back to DEFAULT");
return CPUCapability::DEFAULT;
}
#endif
@ -100,19 +110,15 @@ static CPUCapability compute_cpu_capability() {
#if defined(__linux__) && defined(HAVE_SVE_CPU_DEFINITION)
if (cpuinfo_initialize() && cpuinfo_has_arm_sve()) {
int sve_vl = cpuinfo_get_max_arm_sve_length(); //Returns maximum SVE VL supported by your HW.
if (sve_vl <= 0) {
// SVE is not supported on this system.
// Return the default CPU capability.
return CPUCapability::DEFAULT;
#ifdef HAVE_ARM_BF16_CPU_DEFINITION
if (cpuinfo_has_arm_bf16()) {
if (sve_vl == 256) { // Check for SVE256
return CPUCapability::SVE256;
} else if (sve_vl > 0) {
return CPUCapability::SVE;
}
}
#ifdef HAVE_SVE256_CPU_DEFINITION
if (sve_vl == 256) { // Check for SVE256
#ifdef HAVE_ARM_BF16_CPU_DEFINITION
if (cpuinfo_has_arm_bf16())
return CPUCapability::SVE256;
#endif
}
#endif
#endif
// Return the default CPU capability.
return CPUCapability::DEFAULT;
}
@ -144,7 +150,8 @@ DispatchResult DispatchStubImpl::try_get_call_ptr(
#ifdef HAVE_ZVECTOR_CPU_DEFINITION
, void *ZVECTOR
#endif
#ifdef HAVE_SVE256_CPU_DEFINITION
#ifdef HAVE_SVE_CPU_DEFINITION
, void *SVE
, void *SVE256
#endif
) {
@ -182,7 +189,8 @@ DispatchResult DispatchStubImpl::try_get_call_ptr(
#ifdef HAVE_ZVECTOR_CPU_DEFINITION
, ZVECTOR
#endif
#ifdef HAVE_SVE256_CPU_DEFINITION
#ifdef HAVE_SVE_CPU_DEFINITION
, SVE
, SVE256
#endif
);
@ -239,7 +247,8 @@ void* DispatchStubImpl::get_call_ptr(
#ifdef HAVE_ZVECTOR_CPU_DEFINITION
, void *ZVECTOR
#endif
#ifdef HAVE_SVE256_CPU_DEFINITION
#ifdef HAVE_SVE_CPU_DEFINITION
, void *SVE
, void *SVE256
#endif
) {
@ -263,7 +272,9 @@ void* DispatchStubImpl::get_call_ptr(
,
ZVECTOR
#endif
#ifdef HAVE_SVE256_CPU_DEFINITION
#ifdef HAVE_SVE_CPU_DEFINITION
,
SVE
,
SVE256
#endif
@ -298,7 +309,8 @@ DispatchResult DispatchStubImpl::try_choose_cpu_impl(
#ifdef HAVE_ZVECTOR_CPU_DEFINITION
, void *ZVECTOR
#endif
#ifdef HAVE_SVE256_CPU_DEFINITION
#ifdef HAVE_SVE_CPU_DEFINITION
, void *SVE
, void *SVE256
#endif
){
@ -333,7 +345,7 @@ DispatchResult DispatchStubImpl::try_choose_cpu_impl(
return ZVECTOR != nullptr ? DispatchResult(ZVECTOR) : ErrorType::MissingDeviceKernel;
}
#endif
#ifdef HAVE_SVE256_CPU_DEFINITION
#ifdef HAVE_SVE_CPU_DEFINITION
if (capability >= static_cast<int>(CPUCapability::SVE256)) {
if (C10_UNLIKELY(!SVE256)) {
// dispatch to DEFAULT, since the SVE kernel is missing
@ -342,6 +354,14 @@ DispatchResult DispatchStubImpl::try_choose_cpu_impl(
return DispatchResult(SVE256);
}
}
if (capability >= static_cast<int>(CPUCapability::SVE)) {
if (C10_UNLIKELY(!SVE)) {
// dispatch to DEFAULT, since the SVE kernel is missing
return DEFAULT != nullptr ? DispatchResult(DEFAULT) : ErrorType::MissingDeviceKernel;
} else {
return DispatchResult(SVE);
}
}
#endif
return DEFAULT != nullptr ? DispatchResult(DEFAULT) : ErrorType::MissingDeviceKernel;
}
@ -360,7 +380,8 @@ void* DispatchStubImpl::choose_cpu_impl(
#ifdef HAVE_ZVECTOR_CPU_DEFINITION
, void *ZVECTOR
#endif
#ifdef HAVE_SVE256_CPU_DEFINITION
#ifdef HAVE_SVE_CPU_DEFINITION
, void *SVE
, void *SVE256
#endif
) {
@ -398,7 +419,7 @@ void* DispatchStubImpl::choose_cpu_impl(
return ZVECTOR;
}
#endif
#ifdef HAVE_SVE256_CPU_DEFINITION
#ifdef HAVE_SVE_CPU_DEFINITION
if (capability >= static_cast<int>(CPUCapability::SVE256)) {
if (C10_UNLIKELY(!SVE256)) {
// dispatch to DEFAULT, since the SVE kernel is missing
@ -408,6 +429,15 @@ void* DispatchStubImpl::choose_cpu_impl(
return SVE256;
}
}
if (capability >= static_cast<int>(CPUCapability::SVE)) {
if (C10_UNLIKELY(!SVE)) {
// dispatch to DEFAULT, since the SVE kernel is missing
TORCH_INTERNAL_ASSERT(DEFAULT, "DispatchStub: missing default kernel");
return DEFAULT;
} else {
return SVE;
}
}
#endif
TORCH_INTERNAL_ASSERT(DEFAULT, "DispatchStub: missing default kernel");
return DEFAULT;

View File

@ -64,8 +64,9 @@ enum class CPUCapability {
VSX = 1,
#elif defined(HAVE_ZVECTOR_CPU_DEFINITION)
ZVECTOR = 1,
#elif defined(HAVE_SVE256_CPU_DEFINITION) && defined(HAVE_ARM_BF16_CPU_DEFINITION)
SVE256 = 1,
#elif defined(HAVE_SVE_CPU_DEFINITION) && defined(HAVE_ARM_BF16_CPU_DEFINITION)
SVE=1,
SVE256 = 2,
#else
AVX2 = 1,
AVX512 = 2,
@ -115,7 +116,8 @@ struct TORCH_API DispatchStubImpl {
#ifdef HAVE_ZVECTOR_CPU_DEFINITION
, void *ZVECTOR
#endif
#ifdef HAVE_SVE256_CPU_DEFINITION
#ifdef HAVE_SVE_CPU_DEFINITION
, void *SVE
, void *SVE256
#endif
);
@ -136,7 +138,8 @@ struct TORCH_API DispatchStubImpl {
#ifdef HAVE_ZVECTOR_CPU_DEFINITION
, void *ZVECTOR
#endif
#ifdef HAVE_SVE256_CPU_DEFINITION
#ifdef HAVE_SVE_CPU_DEFINITION
, void *SVE
, void *SVE256
#endif
);
@ -157,7 +160,8 @@ struct TORCH_API DispatchStubImpl {
#ifdef HAVE_ZVECTOR_CPU_DEFINITION
, void *ZVECTOR
#endif
#ifdef HAVE_SVE256_CPU_DEFINITION
#ifdef HAVE_SVE_CPU_DEFINITION
, void *SVE
, void *SVE256
#endif
);
@ -181,7 +185,8 @@ struct TORCH_API DispatchStubImpl {
#ifdef HAVE_ZVECTOR_CPU_DEFINITION
, void *ZVECTOR
#endif
#ifdef HAVE_SVE256_CPU_DEFINITION
#ifdef HAVE_SVE_CPU_DEFINITION
, void *SVE
, void *SVE256
#endif
);
@ -238,7 +243,8 @@ private:
#ifdef HAVE_ZVECTOR_CPU_DEFINITION
, reinterpret_cast<void*>(ZVECTOR)
#endif
#ifdef HAVE_SVE256_CPU_DEFINITION
#ifdef HAVE_SVE_CPU_DEFINITION
, reinterpret_cast<void*>(SVE)
, reinterpret_cast<void*>(SVE256)
#endif
)
@ -299,7 +305,8 @@ public:
#ifdef HAVE_ZVECTOR_CPU_DEFINITION
, reinterpret_cast<void*>(ZVECTOR)
#endif
#ifdef HAVE_SVE256_CPU_DEFINITION
#ifdef HAVE_SVE_CPU_DEFINITION
, reinterpret_cast<void*>(SVE)
, reinterpret_cast<void*>(SVE256)
#endif
);
@ -322,7 +329,8 @@ public:
#ifdef HAVE_ZVECTOR_CPU_DEFINITION
static TORCH_API FnPtr ZVECTOR;
#endif
#ifdef HAVE_SVE256_CPU_DEFINITION
#ifdef HAVE_SVE_CPU_DEFINITION
static TORCH_API FnPtr SVE;
static TORCH_API FnPtr SVE256;
#endif
private:
@ -426,9 +434,11 @@ struct RegisterPRIVATEUSE1Dispatch {
#define REGISTER_ZVECTOR_DISPATCH(name, fn)
#endif
#ifdef HAVE_SVE256_CPU_DEFINITION
#ifdef HAVE_SVE_CPU_DEFINITION
#define REGISTER_SVE_DISPATCH(name, fn) REGISTER_ARCH_DISPATCH(name, SVE, fn)
#define REGISTER_SVE256_DISPATCH(name, fn) REGISTER_ARCH_DISPATCH(name, SVE256, fn)
#else
#define REGISTER_SVE_DISPATCH(name, fn)
#define REGISTER_SVE256_DISPATCH(name, fn)
#endif
@ -440,6 +450,7 @@ struct RegisterPRIVATEUSE1Dispatch {
REGISTER_AVX2_DISPATCH(name, fn) \
REGISTER_VSX_DISPATCH(name, fn) \
REGISTER_ZVECTOR_DISPATCH(name, fn) \
REGISTER_SVE_DISPATCH(name, fn) \
REGISTER_SVE256_DISPATCH(name, fn)
#define REGISTER_NO_CPU_DISPATCH(name) \
@ -488,6 +499,7 @@ struct RegisterPRIVATEUSE1Dispatch {
#define REGISTER_DISPATCH(name, fn) REGISTER_ARCH_DISPATCH(name, CPU_CAPABILITY, fn)
#endif
#define ALSO_REGISTER_AVX512_DISPATCH(name, fn) REGISTER_ARCH_DISPATCH(name, CPU_CAPABILITY, fn)
#define ALSO_REGISTER_SVE_DISPATCH(name, fn) REGISTER_ARCH_DISPATCH(name, CPU_CAPABILITY, fn)
#define ALSO_REGISTER_SVE256_DISPATCH(name, fn) REGISTER_ARCH_DISPATCH(name, CPU_CAPABILITY, fn)
#endif
} // namespace at::native

View File

@ -466,6 +466,7 @@ REGISTER_AVX2_DISPATCH(_segment_reduce_lengths_stub, &_segment_reduce_lengths_cp
REGISTER_AVX512_DISPATCH(_segment_reduce_lengths_stub, &_segment_reduce_lengths_cpu_kernel)
REGISTER_VSX_DISPATCH(_segment_reduce_lengths_stub, &_segment_reduce_lengths_cpu_kernel)
REGISTER_ZVECTOR_DISPATCH(_segment_reduce_lengths_stub, &_segment_reduce_lengths_cpu_kernel)
REGISTER_SVE_DISPATCH(_segment_reduce_lengths_stub, &_segment_reduce_lengths_cpu_kernel)
REGISTER_SVE256_DISPATCH(_segment_reduce_lengths_stub, &_segment_reduce_lengths_cpu_kernel)
// offsets dispatches
@ -477,6 +478,7 @@ REGISTER_AVX2_DISPATCH(_segment_reduce_offsets_stub, &_segment_reduce_offsets_cp
REGISTER_AVX512_DISPATCH(_segment_reduce_offsets_stub, &_segment_reduce_offsets_cpu_kernel)
REGISTER_VSX_DISPATCH(_segment_reduce_offsets_stub, &_segment_reduce_offsets_cpu_kernel)
REGISTER_ZVECTOR_DISPATCH(_segment_reduce_offsets_stub, &_segment_reduce_offsets_cpu_kernel)
REGISTER_SVE_DISPATCH(_segment_reduce_offsets_stub, &_segment_reduce_offsets_cpu_kernel)
REGISTER_SVE256_DISPATCH(_segment_reduce_offsets_stub, &_segment_reduce_offsets_cpu_kernel)
// Currently some computation is being duplicated across forward and backward.
@ -548,6 +550,9 @@ REGISTER_VSX_DISPATCH(
REGISTER_ZVECTOR_DISPATCH(
_segment_reduce_lengths_backward_stub,
&_segment_reduce_cpu_lengths_backward_kernel)
REGISTER_SVE_DISPATCH(
_segment_reduce_lengths_backward_stub,
&_segment_reduce_cpu_lengths_backward_kernel)
REGISTER_SVE256_DISPATCH(
_segment_reduce_lengths_backward_stub,
&_segment_reduce_cpu_lengths_backward_kernel)
@ -568,6 +573,9 @@ REGISTER_VSX_DISPATCH(
REGISTER_ZVECTOR_DISPATCH(
_segment_reduce_offsets_backward_stub,
&_segment_reduce_cpu_offsets_backward_kernel)
REGISTER_SVE_DISPATCH(
_segment_reduce_offsets_backward_stub,
&_segment_reduce_cpu_offsets_backward_kernel)
REGISTER_SVE256_DISPATCH(
_segment_reduce_offsets_backward_stub,
&_segment_reduce_cpu_offsets_backward_kernel)

View File

@ -274,7 +274,7 @@ inline Vectorized<scalar_t> div_floor_floating_vec(
return floordiv;
}
#if defined(CPU_CAPABILITY_SVE256) && defined(__ARM_FEATURE_BF16)
#if (defined(CPU_CAPABILITY_SVE256) || defined(CPU_CAPABILITY_SVE)) && defined(__ARM_FEATURE_BF16)
// Since sve lacks sufficient bf16 intrinsics, do the calculations in f32 to
// avoid rounding errors. This should not cause performance issues as

View File

@ -11,6 +11,7 @@
#include <ATen/native/transformers/attention.h>
#include <ATen/native/transformers/sdp_utils_cpp.h>
#include <c10/util/irange.h>
#include <variant>
#ifndef AT_PER_OPERATOR_HEADERS
#include <ATen/Functions.h>
@ -44,13 +45,23 @@ inline void _scale_attn_mask_fusion_kernel(
#endif
const auto vec_size1 = at::vec::Vectorized<T1>::size();
const auto vec_size2 = at::vec::Vectorized<T2>::size();
constexpr int64_t T1_n =
const int64_t T1_n =
(vec_size2 == vec_size1 * 2 && is_reduced_floating_point_v<T2>) ? 2 : 1;
constexpr int64_t T2_n = 1;
auto vec_scale = at::vec::VectorizedN<T1, T1_n>(val);
std::variant<at::vec::VectorizedN<T1, 2>, at::vec::VectorizedN<T1, 1>> vec_scale;
if (T1_n == 2)
vec_scale = at::vec::VectorizedN<T1, 2>(val);
else if (T1_n == 1)
vec_scale = at::vec::VectorizedN<T1, 1>(val);
int64_t i = 0;
for (; i < size - (size % vec_size2); i += vec_size2) {
auto a_n = at::vec::VectorizedN<T1, T1_n>::loadu(a + i);
std::variant<at::vec::VectorizedN<T1, 2>, at::vec::VectorizedN<T1, 1>> a_n;
if (T1_n == 2)
a_n = at::vec::VectorizedN<T1, 2>::loadu(a + i);
else if (T1_n == 1)
a_n = at::vec::VectorizedN<T1, 1>::loadu(a + i);
at::vec::VectorizedN<T2, T2_n> b_n;
#if __GNUC__ == 11 && defined(__ARM_FEATURE_SVE)
if (is_b_stride_zero) {
@ -61,9 +72,16 @@ inline void _scale_attn_mask_fusion_kernel(
} else {
b_n = at::vec::VectorizedN<T2, T2_n>::loadu(b + i);
}
auto b_n_convert = at::vec::convert<T1, T1_n, T2, T2_n, true>(b_n);
auto res = a_n * vec_scale + b_n_convert;
res.store(out + i);
std::variant<at::vec::VectorizedN<T1, 2>, at::vec::VectorizedN<T1, 1>> b_n_convert;
if (T1_n == 2) {
auto b_n_convert = at::vec::convert<T1, 2, T2, T2_n, true>(b_n);
auto res = std::get<at::vec::VectorizedN<T1, 2>>(a_n) * std::get<at::vec::VectorizedN<T1, 2>>(vec_scale) + b_n_convert;
res.store(out + i);
} else if(T1_n == 1) {
auto b_n_convert = at::vec::convert<T1, 1, T2, T2_n, true>(b_n);
auto res = std::get<at::vec::VectorizedN<T1, 1>>(a_n) * std::get<at::vec::VectorizedN<T1, 1>>(vec_scale) + b_n_convert;
res.store(out + i);
}
}
for (; i < size; i++) {
auto tmp0 = a[i];

View File

@ -694,7 +694,7 @@ struct ApplyGridSample<scalar_t, 2, GridSamplerInterpolation::Bilinear,
gx = gx * gx_mult;
gy = gy * gy_mult;
constexpr int64_t step = Vec::size();
const int64_t step = Vec::size();
auto interleaved_gGrid = interleave2(gx, gy);
auto gGrid_ptr = gGrid_slice.data() + offset * 2;
std::get<0>(interleaved_gGrid).store(gGrid_ptr,
@ -1010,7 +1010,7 @@ struct ApplyGridSample<scalar_t, 2, GridSamplerInterpolation::Bicubic,
gx = gx * gx_mult;
gy = gy * gy_mult;
constexpr int64_t step = Vec::size();
const int64_t step = Vec::size();
auto interleaved_gGrid = interleave2(gx, gy);
auto gGrid_ptr = gGrid_slice.data() + offset * 2;
std::get<0>(interleaved_gGrid).store(gGrid_ptr,
@ -1041,7 +1041,7 @@ static inline void grid_sample_2d_grid_slice_iterator(
using Vec = Vectorized<scalar_t>;
using iVec = Vectorized<int_same_size_t<scalar_t>>;
constexpr int64_t step = Vec::size();
const int64_t step = Vec::size();
// Loop over each output pixel in grid.
// We consider the following three cases (after slicing out the batch

View File

@ -19,7 +19,7 @@ Vectorized<scalar_t> is_lerp_weight_small(Vectorized<scalar_t> weight) {
// is_lerp_weight_small doesn't work for complex because z.abs() returns a
// complex vector which can't be compared. Either implement it with z.abs_2_(),
// or fallback to the scalar function.
#if !(defined(CPU_CAPABILITY_DEFAULT) || defined(_MSC_VER) || defined(CPU_CAPABILITY_SVE))
#if !(defined(CPU_CAPABILITY_DEFAULT) || defined(_MSC_VER) || defined(CPU_CAPABILITY_SVE256) || defined(CPU_CAPABILITY_SVE))
template <typename value_t>
Vectorized<c10::complex<value_t>> is_lerp_weight_small(Vectorized<c10::complex<value_t>> weight) {
using vec_reg_t = decltype(weight.abs_2_());

View File

@ -210,13 +210,22 @@ vectorized_loop(char** C10_RESTRICT data_, int64_t n, int64_t S, func_t&& op, ve
Vec opt_scalar = Vec(S > 0 ? c10::load((scalar_t*)data[S]) : scalar_t(0));
int64_t i = 0;
for (; i <= n - 2 * Vec::size(); i += 2 * Vec::size()) {
int size = Vec::size();
#if !defined(CPU_CAPABILITY_SVE) && !defined(CPU_CAPABILITY_SVE256)
// Loop unrolling prevents compiler from optimizing the SVE classes
for (; i <= n - 2 * size; i += 2 * size) {
auto args1 = dereference_vec<traits>(&data[1], opt_scalar, S, i);
auto args2 = dereference_vec<traits>(&data[1], opt_scalar, S, i + Vec::size());
auto args2 = dereference_vec<traits>(&data[1], opt_scalar, S, i + size);
auto out1 = c10::guts::apply(vop, std::move(args1));
auto out2 = c10::guts::apply(vop, std::move(args2));
out1.store(data[0] + i * sizeof(scalar_t));
out2.store(data[0] + (i + Vec::size()) * sizeof(scalar_t));
out2.store(data[0] + (i + size) * sizeof(scalar_t));
}
#endif
for (; i <= n - size; i += size) {
auto args1 = dereference_vec<traits>(&data[1], opt_scalar, S, i);
auto out1 = c10::guts::apply(vop, std::move(args1));
out1.store(data[0] + i * sizeof(scalar_t));
}
if (i < n) {
int64_t strides[ntensors];

View File

@ -80,7 +80,7 @@ inline void UNARY_OUTER_LOOP(char* data[2], const int64_t strides[2], int64_t n,
template <typename func_t, typename vec_func_t>
inline void vectorized_inner_reduction(char** data, int64_t n, func_t op, vec_func_t vop) {
VEC_LOOP_HEADER(func_t, data)
constexpr int64_t vector_stride = 4 * Vec::size() * sizeof(scalar_t);
const int64_t vector_stride = 4 * Vec::size() * sizeof(scalar_t);
int64_t count = n / (4 * Vec::size());
if (count > 0) {
vectorized_reduction(data, count, vector_stride, op, vop, /*reduce=*/true);
@ -96,7 +96,7 @@ inline void vectorized_outer_reduction(char** data, int64_t inner_stride, int64_
VEC_LOOP_HEADER(func_t, data)
// reduce down each column of 4 * Vec::size() elements.
constexpr int64_t vector_stride = 4 * Vec::size() * sizeof(scalar_t);
const int64_t vector_stride = 4 * Vec::size() * sizeof(scalar_t);
int64_t outer_stride[2] = { vector_stride, vector_stride };
UNARY_OUTER_LOOP(data, outer_stride, size1 / (4 * Vec::size()), [&] {
vectorized_reduction(data, size0, inner_stride, op, vop, /*reduce=*/false);

View File

@ -154,8 +154,8 @@ inline void map_acc(
using Vec = vec::Vectorized<scalar_t>;
using aVec = vec::Vectorized<accumut>;
int64_t d = 0;
constexpr int64_t kVecSize = Vec::size();
constexpr int64_t kaVecSize = aVec::size();
const int64_t kVecSize = Vec::size();
const int64_t kaVecSize = aVec::size();
for (d = 0; d < size - (size % kVecSize); d += kVecSize) {
Vec data2_vec = Vec::loadu(input_data2 + d);
auto [data2_avec0, data2_avec1] = convert_to_float<scalar_t>(data2_vec);

View File

@ -22,8 +22,8 @@ inline namespace CPU_CAPABILITY {
constexpr auto kF32RegisterPairsPerIteration = 4;
constexpr auto kF32RegistersPerIteration = kF32RegisterPairsPerIteration * 2;
constexpr auto kF32ElementsPerRegister = vec::Vectorized<float>::size();
constexpr auto kF32ElementsPerIteration = kF32RegistersPerIteration * kF32ElementsPerRegister;
const auto kF32ElementsPerRegister = vec::Vectorized<float>::size();
const auto kF32ElementsPerIteration = kF32RegistersPerIteration * kF32ElementsPerRegister;
namespace {
template <typename T>
@ -150,16 +150,16 @@ float reduce(vec::VectorizedN<float, kF32RegistersPerIteration>& x) {
// BFDOT. Deferring that for now to get the NEON/ASIMD BFDOT path
// working.
#if __ARM_FEATURE_BF16_VECTOR_ARITHMETIC
#if defined(__aarch64__) && !defined(CPU_CAPABILITY_SVE) && defined(__clang__) && __clang_major__ > 15
#if defined(__aarch64__) && !defined(CPU_CAPABILITY_SVE) && !defined(CPU_CAPABILITY_SVE256) && defined(__clang__) && __clang_major__ > 15
// https://godbolt.org/z/z8P4Yncra
#define COMPILER_SUPPORTS_BF16_TARGET 1
#elif defined(__aarch64__) && !defined(CPU_CAPABILITY_SVE) && !defined(__clang__) && defined(__GNUC__) && __GNUC__ >= 10
#elif defined(__aarch64__) && !defined(CPU_CAPABILITY_SVE256) && !defined(CPU_CAPABILITY_SVE) && !defined(__clang__) && defined(__GNUC__) && __GNUC__ >= 10
// https://gcc.gnu.org/gcc-10/changes.html
// https://godbolt.org/z/cdGG7vn8o
#define COMPILER_SUPPORTS_BF16_TARGET 1
#else // defined(__aarch64__) && !defined(CPU_CAPABILITY_SVE) && defined(__clang__) && __clang_major__ > 15
#else // defined(__aarch64__) && !defined(CPU_CAPABILITY_SVE256) && !defined(CPU_CAPABILITY_SVE) && defined(__clang__) && __clang_major__ > 15
#define COMPILER_SUPPORTS_BF16_TARGET 0
#endif // defined(__aarch64__) && !defined(CPU_CAPABILITY_SVE) && defined(__clang__) && __clang_major__ > 15
#endif // defined(__aarch64__) && !defined(CPU_CAPABILITY_SVE) && !defined(CPU_CAPABILITY_SVE) && defined(__clang__) && __clang_major__ > 15
#else // __ARM_FEATURE_BF16_VECTOR_ARITHMETIC
#define COMPILER_SUPPORTS_BF16_TARGET 0
#endif // __ARM_FEATURE_BF16_VECTOR_ARITHMETIC
@ -212,7 +212,7 @@ std::pair<vec::Vectorized<float>, vec::Vectorized<float>> fmadd(
const vec::Vectorized<c10::Half>& b,
const vec::Vectorized<float>& acc_low,
const vec::Vectorized<float>& acc_high) {
#if defined(__ARM_FEATURE_FP16_FML) && !defined(CPU_CAPABILITY_SVE)
#if defined(__ARM_FEATURE_FP16_FML) && !defined(CPU_CAPABILITY_SVE256) && !defined(CPU_CAPABILITY_SVE)
return std::make_pair(vfmlalq_low_f16(acc_low, a, b), vfmlalq_high_f16(acc_high, a, b));
#else
const auto [a_float_low, a_float_high] = convert_half_float(a);

View File

@ -28,8 +28,8 @@ inline void _update(at::opmath_type<scalar_t>* out_ptr, int64_t e, int64_t c, co
using opmath_t = at::opmath_type<scalar_t>;
using Vec = vec::Vectorized<scalar_t>;
using aVec = VecType<scalar_t>;
constexpr int64_t kVecSize = Vec::size();
constexpr int64_t kVLEN = kVecSize * 4;
const int64_t kVecSize = Vec::size();
const int64_t kVLEN = kVecSize * 4;
int64_t k = 0;
aVec val_vec = aVec((opmath_t)val);

View File

@ -21,11 +21,11 @@ Vectorized<acc_t> load_reduce_vec(const scalar_t* data, F reduce, acc_t ident) {
using vacc_t = Vectorized<acc_t>;
static_assert(vacc_t::size() <= vec_t::size());
const auto val = vec_t::loadu(data);
alignas(64) std::array<scalar_t, vec_t::size()> values;
val.store(values.data());
alignas(64) scalar_t values[vec_t::size()];
val.store(values);
constexpr int vstride = vec_t::size() / vacc_t::size();
alignas(64) std::array<acc_t, vacc_t::size()> acc;
alignas(64) acc_t acc[vacc_t::size()];
acc.fill(ident);
for (const auto k : c10::irange(vstride)) {
for (const auto i : c10::irange(vacc_t::size())) {
@ -33,7 +33,7 @@ Vectorized<acc_t> load_reduce_vec(const scalar_t* data, F reduce, acc_t ident) {
}
}
return vacc_t::loadu(acc.data());
return vacc_t::loadu(acc);
}
template <typename scalar_t>
@ -138,7 +138,7 @@ struct OuterSumCastLoadPolicy <vec_t, vacc_t,
using scalar_t = vechold_type<vec_t>;
using acc_t = vechold_type<vacc_t>;
static constexpr int64_t memsize() {
static int64_t memsize() {
return sizeof(scalar_t) * vacc_t::size();
}
@ -161,7 +161,7 @@ template <typename vec_t, typename vacc_t>
struct OuterSumCastLoadPolicy <vec_t, vacc_t, std::enable_if_t<is_reduced_floating_point_v<vechold_type<vec_t>>>> {
using scalar_t = vechold_type<vec_t>;
static constexpr int64_t memsize() {
static int64_t memsize() {
return sizeof(scalar_t) * vacc_t::size();
}
@ -198,7 +198,7 @@ template <typename scalar_t>
struct NanSumLoadPolicy<Vectorized<scalar_t>> {
using vec_t = Vectorized<scalar_t>;
static constexpr int64_t memsize() {
static int64_t memsize() {
return LoadPolicy<vec_t>::memsize();
}
@ -267,7 +267,7 @@ struct InnerNanSumCastLoadPolicy <vec_t, vacc_t, std::enable_if_t<is_reduced_flo
template <typename vec_t, typename vacc_t>
struct OuterNanSumCastLoadPolicy {
static constexpr int64_t memsize() {
static int64_t memsize() {
return OuterSumCastLoadPolicy<vec_t, vacc_t>::memsize();
}
@ -300,13 +300,23 @@ static void store(char * C10_RESTRICT data, int64_t stride, int64_t index,
}
}
template <typename StorePolicy, typename scalar_t>
static void store(char * C10_RESTRICT data, int64_t stride, int64_t index,
const scalar_t *values, size_t numel) {
auto *base_ptr = data + stride * index;
for (const auto k : c10::irange(numel)) {
auto val = values[k];
StorePolicy::store(base_ptr, stride, k, val);
}
}
template <typename StorePolicy, typename scalar_t>
static void store(char * C10_RESTRICT data, int64_t stride, int64_t index,
const Vectorized<scalar_t> &values) {
using vec_t = Vectorized<scalar_t>;
alignas(64) std::array<scalar_t, vec_t::size()> array_values{};
values.store(array_values.data());
store<StorePolicy>(data, stride, index, array_values);
alignas(64) scalar_t array_values[vec_t::size()] = {};
values.store(array_values);
store<StorePolicy, scalar_t>(data, stride, index, array_values, vec_t::size());
}
/** Simultaneously sum over n rows at once
@ -436,9 +446,9 @@ void vectorized_inner_sum(
char * C10_RESTRICT data[2], int64_t outer_stride, int64_t out_stride,
int64_t size0, int64_t size1) {
using vacc_t = Vectorized<acc_t>;
constexpr int64_t vec_stride = VecLoadPolicy::memsize();
constexpr int64_t scalar_stride = ScalarLoadPolicy::memsize();
constexpr int64_t vec_numel = vec_stride / scalar_stride;
const int64_t vec_stride = VecLoadPolicy::memsize();
const int64_t scalar_stride = ScalarLoadPolicy::memsize();
const int64_t vec_numel = vec_stride / scalar_stride;
const int64_t vec_size = size0 / vec_numel;
// Input is contiguous over the first (reduced) dimension
@ -451,9 +461,9 @@ void vectorized_inner_sum(
final_acc += ScalarLoadPolicy::load(row_in, scalar_stride, k);
}
alignas(64) std::array<acc_t, vacc_t::size()> partials{};
vec_acc.store(partials.data());
for (const auto k : c10::irange(partials.size())) {
alignas(64) acc_t partials[vacc_t::size()] = {};
vec_acc.store(partials);
for (const auto k : c10::irange(vacc_t::size())) {
final_acc += partials[k];
}
store<StorePolicy>(data[0], out_stride, j, final_acc);
@ -479,7 +489,7 @@ void vectorized_outer_sum(
int64_t size0, int64_t size1) {
using vacc_t = Vectorized<acc_t>;
constexpr int64_t scalar_stride = ScalarLoadPolicy::memsize();
constexpr int64_t vec_stride = VecLoadPolicy::memsize();
const int64_t vec_stride = VecLoadPolicy::memsize();
constexpr int64_t nrows = 4;
// Input is contiguous over the second (non-reduced) dimension

View File

@ -93,7 +93,7 @@ ColumnwiseMoments(
int64_t C,
int64_t D) {
using Vec = vec::Vectorized<T>;
constexpr int64_t K = Vec::size();
const int64_t K = Vec::size();
const int64_t inner_size = D / K * K;
Vec acc0_vec{0}, acc1_vec{0};
for (const auto m : c10::irange(HxW)) {
@ -668,20 +668,20 @@ void GroupNormInputBackward(
const opmath_t s = opmath_t(1) / static_cast<opmath_t>(D * HxW);
const bool gamma_null = (gamma == nullptr);
at::parallel_for(0, N * G, 1, [=](int64_t start, int64_t end) {
constexpr int64_t K = vec::Vectorized<PT>::size();
const int64_t K = vec::Vectorized<PT>::size();
const int64_t d = D / K * K;
// NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
std::array<opmath_t, at::vec::Vectorized<opmath_t>::size()> ds_arr;
opmath_t ds_arr[at::vec::Vectorized<opmath_t>::size()];
// NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
std::array<opmath_t, at::vec::Vectorized<opmath_t>::size()> db_arr;
opmath_t db_arr[at::vec::Vectorized<opmath_t>::size()];
for (const auto i : c10::irange(start, end)) {
const int64_t g = i % G;
const opmath_t* ds_ptr = ds + i * D;
const opmath_t* db_ptr = db + i * D;
const PT* gamma_ptr = gamma_null ? nullptr : (gamma + g * D);
CalcDsDb(ds_ptr, db_ptr, gamma_ptr, d, K, ds_arr.data(), db_arr.data());
opmath_t ds_val = std::accumulate(ds_arr.cbegin(), ds_arr.cend(), opmath_t(0));
opmath_t db_val = std::accumulate(db_arr.cbegin(), db_arr.cend(), opmath_t(0));
CalcDsDb(ds_ptr, db_ptr, gamma_ptr, d, K, ds_arr, db_arr);
opmath_t ds_val = std::accumulate(&ds_arr[0], &ds_arr[at::vec::Vectorized<opmath_t>::size()], opmath_t(0));
opmath_t db_val = std::accumulate(&db_arr[0], &db_arr[at::vec::Vectorized<opmath_t>::size()], opmath_t(0));
for (const auto j : c10::irange(d, D)) {
const opmath_t gamma_v = gamma_null ? opmath_t(1) : opmath_t(gamma[g * D + j]);
ds_val += ds_ptr[j] * gamma_v;
@ -718,7 +718,7 @@ GammaBackward(
PT* dgamma) {
const int64_t G = group;
const int64_t D = C / G;
constexpr int64_t K = at::vec::Vectorized<PT>::size();
const int64_t K = at::vec::Vectorized<PT>::size();
using Vec = at::vec::Vectorized<PT>;
const int64_t inner_size = D / K * K;
for (const auto g : c10::irange(G)) {
@ -818,7 +818,7 @@ template <typename PT, typename opmath_t>
std::enable_if_t<std::is_same_v<PT, opmath_t>, void>
BetaBackward(int64_t N, int64_t C, const opmath_t* db, PT* dbeta) {
using Vec = at::vec::Vectorized<PT>;
constexpr int64_t K = Vec::size();
const int64_t K = Vec::size();
Vec acc_vec{0}, zero{0};
const int64_t inner_size = C / K * K;
int64_t i = 0;
@ -943,7 +943,7 @@ DsDbRowwiseMomentsChannelsLast(
opmath_t* db_ptr,
int64_t C) {
using Vec = vec::Vectorized<T>;
constexpr int64_t K = vec::Vectorized<T>::size();
const int64_t K = vec::Vectorized<T>::size();
const int64_t inner_size = C / K * K;
int64_t d = 0;
for (; d < inner_size; d += K) {
@ -1247,7 +1247,7 @@ inline typename std::
int64_t D) {
using Vec = vec::Vectorized<T>;
const bool gamma_null = (gamma_ptr == nullptr);
constexpr int64_t K = Vec::size();
const int64_t K = Vec::size();
const int64_t inner_size = D / K * K;
int64_t d = 0;
opmath_t ds_gamma{0}, db_gamma{0};

View File

@ -625,7 +625,7 @@ void weight_to_int4pack_kernel(
int K = weight.size(1);
// 64 for avx512 and 32 for avx2/non-vectorized
constexpr int BLOCK_N = vec::Vectorized<float>::size() * 4;
const int BLOCK_N = vec::Vectorized<float>::size() * 4;
const int NB = (N + BLOCK_N - 1) / BLOCK_N;
// parallel on NB blocks
@ -713,7 +713,7 @@ void int4pack_mm_kernel_(
constexpr int BLOCK_M = 4;
// 64 for avx512 and 32 for avx2/non-vectorized
constexpr int BLOCK_N = vec::Vectorized<float>::size() * 4;
const int BLOCK_N = vec::Vectorized<float>::size() * 4;
// 32, 64, 128, 256
const int BLOCK_K = qGroupSize;

View File

@ -109,8 +109,8 @@ template <typename T, int64_t kMaxDepth>
std::pair<opmath_t<T>, opmath_t<T>> RowwiseMomentsImpl(const T* X, int64_t N, int64_t ddof = 0) {
using math_t = opmath_t<T>;
constexpr int64_t kVecSize = vec::Vectorized<T>::size();
constexpr int64_t kAccVecSize = vec::Vectorized<math_t>::size();
const int64_t kVecSize = vec::Vectorized<T>::size();
const int64_t kAccVecSize = vec::Vectorized<math_t>::size();
const int64_t n = N / kVecSize;
const int64_t m = divup(n, kChunkSize);
const int64_t depth = utils::CeilLog2(m);
@ -155,10 +155,10 @@ std::pair<opmath_t<T>, opmath_t<T>> RowwiseMomentsImpl(const T* X, int64_t N, in
m0_stk[i], m1_stk[i], m2_stk[i], m0_stk[0], m1_stk[0], m2_stk[0]);
}
std::array<math_t, kAccVecSize> m1_arr{};
std::array<math_t, kAccVecSize> m2_arr{};
m1_stk[0].store(m1_arr.data());
m2_stk[0].store(m2_arr.data());
math_t m1_arr[kAccVecSize] = {};
math_t m2_arr[kAccVecSize] = {};
m1_stk[0].store(m1_arr);
m2_stk[0].store(m2_arr);
int64_t m0 = 0;
math_t m1 = 0;
@ -182,7 +182,7 @@ std::pair<opmath_t<T>, opmath_t<T>> RowwiseMomentsImpl(const T* X, int64_t N, in
template <typename T>
std::pair<opmath_t<T>, opmath_t<T>> RowwiseMoments(const T* X, int64_t N, int64_t ddof = 0) {
using Vec = vec::Vectorized<T>;
constexpr int64_t kVecSize = Vec::size();
const int64_t kVecSize = Vec::size();
const int64_t n = N / kVecSize;
const int64_t m = divup(n, kChunkSize);
const int64_t depth = utils::CeilLog2(m);

View File

@ -21,6 +21,10 @@
#include <ATen/native/cuda/GroupMM.h>
#include <ATen/ceil_div.h>
#ifdef USE_FBGEMM_GENAI
#include <fbgemm_gpu/torch_ops.h>
#endif
#ifndef AT_PER_OPERATOR_HEADERS
#include <ATen/Functions.h>
#include <ATen/NativeFunctions.h>
@ -1216,7 +1220,7 @@ std::pair<ScalingType, ScalingType> get_joint_scaling(
// - `scale_a`: a tensor with the inverse scale of `mat1`, whose shape/strides/dtype depend on the scaling scheme
// - `scale_b`: a tensor with the inverse scale of `mat2`, whose shape/strides/dtype depend on the scaling scheme
// - `scale_result`: a scalar tensor with the scale of the output, only utilized if the output is a float8 type
// - `use_fast_accum`: if true, enables fast float8 accumulation
// - `use_fast_accum`: if true, enables fast float8 accumulation. Backends may ignore this option if not applicable.
// - `out`: a reference to the output tensor
Tensor&
@ -1525,6 +1529,7 @@ namespace {
const auto out_dtype_ = out_dtype.value_or(kBFloat16);
TORCH_CHECK(out_dtype_ == kBFloat16, "Only bf16 high precision output types are supported for grouped gemm");
#ifndef USE_ROCM
// For TMA transfers, strides of output tensor have to be either
// 1, or aligned to 16 bytes.
const auto last_dim = out_size.size() - 1;
@ -1536,9 +1541,10 @@ namespace {
} else {
out_stride = {out_size[1] * size_padded, size_padded, 1};
}
auto out = at::empty_strided(out_size, out_stride, mat_a.options().dtype(out_dtype_));
return out;
return at::empty_strided(out_size, out_stride, mat_a.options().dtype(out_dtype_));
#else
return at::empty(out_size, mat_a.options().dtype(out_dtype_));
#endif
}
bool check_valid_strides_and_return_transposed(const Tensor& mat) {
@ -1619,12 +1625,9 @@ const std::optional<at::Tensor>& bias,
const std::optional<at::Tensor>& scale_result,
std::optional<c10::ScalarType> out_dtype,
bool use_fast_accum) {
#ifndef USE_ROCM
bool allowed_device = _scaled_mm_allowed_device(/*sm90_only*/true);
TORCH_CHECK(allowed_device, "torch._scaled_grouped_mm is only supported on CUDA devices with compute capability = 9.0");
bool allowed_device = _scaled_mm_allowed_device();
TORCH_CHECK(allowed_device, "torch._scaled_grouped_mm is only supported on CUDA devices with compute capability = 9.0, or ROCm MI300+");
TORCH_CHECK(mat_a.dtype() == at::kFloat8_e4m3fn, "Expected mat_a to be Float8_e4m3 matrix got ", mat_a.scalar_type());
TORCH_CHECK(mat_b.dtype() == at::kFloat8_e4m3fn, "Expected mat_a to be Float8_e4m3 matrix got ", mat_b.scalar_type());
TORCH_CHECK(!check_valid_strides_and_return_transposed(mat_a), "Expected mat1 to not be transposed");
TORCH_CHECK(check_valid_strides_and_return_transposed(mat_b), "Expected mat2 to be transposed");
TORCH_CHECK(mat_a.dim() == 2 || mat_a.dim() == 3, "mat_a has to be 2 or 3d");
@ -1664,6 +1667,10 @@ bool use_fast_accum) {
Tensor out = create_grouped_gemm_output_tensor(mat_a, mat_b, offs, out_dtype);
#ifndef USE_ROCM
TORCH_CHECK(mat_a.dtype() == at::kFloat8_e4m3fn, "Expected mat_a to be Float8_e4m3 matrix got ", mat_a.scalar_type());
TORCH_CHECK(mat_b.dtype() == at::kFloat8_e4m3fn, "Expected mat_a to be Float8_e4m3 matrix got ", mat_b.scalar_type());
at::cuda::detail::f8f8bf16_grouped_mm(
mat_a,
mat_b,
@ -1674,12 +1681,23 @@ bool use_fast_accum) {
use_fast_accum,
out);
return out;
#else
TORCH_CHECK(false, "grouped gemm is not supported on ROCM")
#ifdef USE_FBGEMM_GENAI
TORCH_CHECK(mat_a.dtype() == at::kFloat8_e4m3fnuz, "Expected mat_a to be Float8_e4m3fnuz matrix got ", mat_a.scalar_type());
TORCH_CHECK(mat_b.dtype() == at::kFloat8_e4m3fnuz, "Expected mat_a to be Float8_e4m3fnuz matrix got ", mat_b.scalar_type());
fbgemm_gpu::f8f8bf16_rowwise_grouped_mm(
mat_a,
// FBGEMM expects B matrix shape to be (.., N, K)
mat_b.transpose(-2, -1),
scale_a,
scale_b,
offs,
out);
return out;
#else
TORCH_CHECK(false, "grouped gemm is not supported without USE_FBGEMM_GENAI on ROCM")
#endif
#endif
}

View File

@ -38,17 +38,19 @@ static inline std::string _cudaGetErrorEnum(cufftResult error)
return "CUFFT_INVALID_SIZE";
case CUFFT_UNALIGNED_DATA:
return "CUFFT_UNALIGNED_DATA";
case CUFFT_INCOMPLETE_PARAMETER_LIST:
return "CUFFT_INCOMPLETE_PARAMETER_LIST";
case CUFFT_INVALID_DEVICE:
return "CUFFT_INVALID_DEVICE";
case CUFFT_PARSE_ERROR:
return "CUFFT_PARSE_ERROR";
case CUFFT_NO_WORKSPACE:
return "CUFFT_NO_WORKSPACE";
case CUFFT_NOT_IMPLEMENTED:
return "CUFFT_NOT_IMPLEMENTED";
#if !defined(USE_ROCM)
#if CUDA_VERSION <= 12090
case CUFFT_INCOMPLETE_PARAMETER_LIST:
return "CUFFT_INCOMPLETE_PARAMETER_LIST";
case CUFFT_PARSE_ERROR:
return "CUFFT_PARSE_ERROR";
#endif
#if !defined(USE_ROCM) && CUDA_VERSION <= 12090
case CUFFT_LICENSE_ERROR:
return "CUFFT_LICENSE_ERROR";
#endif

View File

@ -9,6 +9,7 @@
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wset-but-not-used")
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wunused-but-set-parameter")
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wmissing-field-initializers")
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wunused-but-set-variable")
// Determine if the architecture supports rowwise scaled mm
// Currently failing on windows with:
@ -44,6 +45,7 @@ C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wmissing-field-initializers")
#include <ATen/native/cuda/cutlass_common.cuh>
C10_DIAGNOSTIC_POP()
C10_DIAGNOSTIC_POP()
C10_DIAGNOSTIC_POP()

View File

@ -10,6 +10,7 @@
// Two warninngs in Cutlass included header files
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wset-but-not-used")
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wunused-but-set-parameter")
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wunused-but-set-variable")
// Determine if the architecture supports rowwise scaled mm
// Currently failing on windows with:
@ -44,6 +45,7 @@ C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wunused-but-set-parameter")
#include <cutlass/gemm/kernel/gemm_universal.hpp>
#include <cutlass/util/packed_stride.hpp>
C10_DIAGNOSTIC_POP()
C10_DIAGNOSTIC_POP()
C10_DIAGNOSTIC_POP()

View File

@ -45,7 +45,7 @@ namespace at::cuda::jit {
// Copied from aten/src/ATen/cuda/llvm_basic.cpp, then modified as above.
// If not compiling for ROCm, return the original get_traits_string().
std::string get_traits_string_but_hiprtc_safe() {
#if defined(USE_ROCM) && ROCM_VERSION < 70000
#if defined(USE_ROCM) && HIP_VERSION_MAJOR < 7
return R"ESCAPE(
namespace std {

View File

@ -342,8 +342,8 @@ Tensor rms_norm_symint(
if (weight_opt.has_value() && weight_opt.value().defined() && weight_opt.value().dtype() != input.dtype()) {
TORCH_WARN_ONCE(
"Mismatch dtype between input and module: input dtype = ", input.dtype(),
", module dtype = ", weight_opt.value().dtype(), ", Can not dispatch to fused implementation"
"Mismatch dtype between input and weight: input dtype = ", input.dtype(),
", weight dtype = ", weight_opt.value().dtype(), ", Cannot dispatch to fused implementation."
);
return std::get<0>(rms_norm_composite(input, IntArrayRef(reinterpret_cast<const int64_t*>(normalized_shape.data()), normalized_shape.size()), weight_opt, eps));
}

View File

@ -165,6 +165,7 @@ REGISTER_AVX2_DISPATCH(fft_fill_with_conjugate_symmetry_stub, &_fft_fill_with_co
REGISTER_AVX512_DISPATCH(fft_fill_with_conjugate_symmetry_stub, &_fft_fill_with_conjugate_symmetry_cpu_)
REGISTER_ZVECTOR_DISPATCH(fft_fill_with_conjugate_symmetry_stub, &_fft_fill_with_conjugate_symmetry_cpu_)
REGISTER_VSX_DISPATCH(fft_fill_with_conjugate_symmetry_stub, &_fft_fill_with_conjugate_symmetry_cpu_)
REGISTER_SVE_DISPATCH(fft_fill_with_conjugate_symmetry_stub, &_fft_fill_with_conjugate_symmetry_cpu_)
REGISTER_SVE256_DISPATCH(fft_fill_with_conjugate_symmetry_stub, &_fft_fill_with_conjugate_symmetry_cpu_)
// _out variants can be shared between PocketFFT and MKL

View File

@ -22,6 +22,22 @@ struct PoolingParams {
bool return_indices;
};
template <unsigned N = 5, typename idx_type_t = int32_t>
struct AvgPoolingParams {
int32_t dims;
int32_t pooling_dims;
::c10::metal::array<idx_type_t, N> input_sizes;
::c10::metal::array<idx_type_t, N> input_strides;
::c10::metal::array<idx_type_t, N> output_sizes;
::c10::metal::array<idx_type_t, N> output_strides;
::c10::metal::array<idx_type_t, N - 2> kernel_size;
::c10::metal::array<idx_type_t, N - 2> stride;
::c10::metal::array<idx_type_t, N - 2> padding;
bool count_include_pad;
bool has_divisor_override;
int32_t divisor_override;
};
template <unsigned N = 5, typename idx_type_t = int32_t>
struct PoolingBackwardParams {
int32_t dims;

View File

@ -292,12 +292,154 @@ kernel void max_pool_backward(
pooling_dims);
}
#define REGISTER_MAX_POOL_OP(DTYPE) \
template <typename T>
struct AvgPoolIterBounds {
T start;
T end;
T count;
};
template <int32_t dim>
AvgPoolIterBounds<int32_t> get_avg_pool_input_iter_bounds(
constant int32_t* input_sizes,
thread int32_t (&pooling_dim_indices)[3],
constant int32_t* kernel_size,
constant int32_t* stride,
constant int32_t* padding,
bool count_include_pad) {
auto start = stride[dim] * pooling_dim_indices[dim] - padding[dim];
auto end = start + kernel_size[dim];
auto end_corrected = min(start + kernel_size[dim], input_sizes[dim]);
auto start_corrected = (start < 0) ? 0 : start;
auto count = count_include_pad
? (min(end, input_sizes[dim] + padding[dim]) - start)
: (end_corrected - start_corrected);
return {start_corrected, end_corrected, count};
}
// Iterates through all the input elements that this kernel needs to
// apply max to. Specialized for 3 pooling dimensions.
template <typename T>
void avg_pool_3d_input_iter(
constant T* input,
device T* output,
constant int32_t* input_sizes,
constant int32_t* input_strides,
thread int32_t (&pooling_dim_indices)[3],
constant int32_t* kernel_size,
constant int32_t* stride,
constant int32_t* padding,
bool count_include_pad,
bool has_divisor_override,
int32_t divisor_override) {
auto bounds0 = get_avg_pool_input_iter_bounds<0>(
input_sizes,
pooling_dim_indices,
kernel_size,
stride,
padding,
count_include_pad);
auto bounds1 = get_avg_pool_input_iter_bounds<1>(
input_sizes,
pooling_dim_indices,
kernel_size,
stride,
padding,
count_include_pad);
auto bounds2 = get_avg_pool_input_iter_bounds<2>(
input_sizes,
pooling_dim_indices,
kernel_size,
stride,
padding,
count_include_pad);
T value_sum = 0;
auto divisor = has_divisor_override
? divisor_override
: (bounds0.count) * (bounds1.count) * (bounds2.count);
auto size12 = input_sizes[1] * input_sizes[2];
for (auto i0 = bounds0.start; i0 < bounds0.end; i0++) {
auto offset0 = input_strides[0] * i0;
for (auto i1 = bounds1.start; i1 < bounds1.end; i1++) {
auto offset1 = input_strides[1] * i1;
for (auto i2 = bounds2.start; i2 < bounds2.end; i2++) {
auto offset2 = input_strides[2] * i2;
auto input_value = input[offset0 + offset1 + offset2];
value_sum += input_value;
}
}
}
*output = value_sum / static_cast<T>(divisor);
}
// Kernel computes one element of the output per kernel call.
template <typename T>
kernel void avg_pool(
constant T* input [[buffer(0)]],
device T* output [[buffer(1)]],
constant AvgPoolingParams<5>& params [[buffer(2)]],
uint tid [[thread_position_in_grid]]) {
auto pooling_dims = params.pooling_dims;
auto dims = params.dims;
auto input_sizes = params.input_sizes.data();
auto input_strides = params.input_strides.data();
auto output_sizes = params.output_sizes.data();
auto output_strides = params.output_strides.data();
auto kernel_size = params.kernel_size.data();
auto stride = params.stride.data();
auto padding = params.padding.data();
auto leading_dims = dims - pooling_dims;
// This buffer keeps track of the pooling dimension indices of this thread's
// element of the output. We need to fill it with the proper values below.
int32_t pooling_dim_indices[3];
PoolOffsets offsets = find_pool_offsets(
output_sizes,
output_strides,
/*indices_strides=*/nullptr,
input_strides,
pooling_dim_indices,
dims,
leading_dims,
/*return_indices=*/false,
tid);
output += offsets.output;
input += offsets.input_leading;
input_sizes += leading_dims;
input_strides += leading_dims;
avg_pool_3d_input_iter<T>(
input,
output,
input_sizes,
input_strides,
pooling_dim_indices,
kernel_size,
stride,
padding,
params.count_include_pad,
params.has_divisor_override,
params.divisor_override);
}
#define REGISTER_POOL_OP(DTYPE) \
template [[host_name("max_pool_" #DTYPE)]] kernel void max_pool<DTYPE>( \
constant DTYPE * input [[buffer(0)]], \
device DTYPE * output [[buffer(1)]], \
device int64_t* indices [[buffer(2)]], \
constant PoolingParams<5>& params [[buffer(3)]], \
uint tid [[thread_position_in_grid]]); \
\
template [[host_name("avg_pool_" #DTYPE)]] kernel void avg_pool<DTYPE>( \
constant DTYPE * input [[buffer(0)]], \
device DTYPE * output [[buffer(1)]], \
constant AvgPoolingParams<5> & params [[buffer(2)]], \
uint tid [[thread_position_in_grid]]);
#define REGISTER_MAX_POOL_BACKWARD_OP(DTYPE) \
@ -309,19 +451,19 @@ kernel void max_pool_backward(
constant PoolingBackwardParams<5>& params [[buffer(3)]], \
uint tid [[thread_position_in_grid]]);
REGISTER_MAX_POOL_OP(float);
REGISTER_MAX_POOL_OP(half);
REGISTER_MAX_POOL_OP(int);
REGISTER_MAX_POOL_OP(long);
REGISTER_MAX_POOL_OP(short);
REGISTER_MAX_POOL_OP(char);
REGISTER_MAX_POOL_OP(uchar);
REGISTER_MAX_POOL_OP(bool);
REGISTER_POOL_OP(float);
REGISTER_POOL_OP(half);
REGISTER_POOL_OP(int);
REGISTER_POOL_OP(long);
REGISTER_POOL_OP(short);
REGISTER_POOL_OP(char);
REGISTER_POOL_OP(uchar);
REGISTER_POOL_OP(bool);
REGISTER_MAX_POOL_BACKWARD_OP(float);
REGISTER_MAX_POOL_BACKWARD_OP(half);
#if __METAL_VERSION__ >= 310
REGISTER_MAX_POOL_OP(bfloat);
REGISTER_POOL_OP(bfloat);
REGISTER_MAX_POOL_BACKWARD_OP(bfloat);
#endif

View File

@ -418,8 +418,9 @@ Tensor& exponential_mps_(Tensor& self, double lambda, std::optional<Generator> g
MPSGraphTensor* logTensor = [mpsGraph logarithmWithTensor:subtractTensor name:nil];
return [mpsGraph divisionWithPrimaryTensor:logTensor secondaryTensor:minusLambdaTensor name:nil];
};
auto eps = std::numeric_limits<float>::epsilon();
return mps::random_mps_impl<double>(self,
0.0,
eps,
1.0,
std::nullopt,
std::nullopt,

View File

@ -14,6 +14,7 @@
#include <ATen/ops/avg_pool2d_backward.h>
#include <ATen/ops/avg_pool2d_backward_native.h>
#include <ATen/ops/avg_pool2d_native.h>
#include <ATen/ops/avg_pool3d_native.h>
#include <ATen/ops/max_pool2d_backward_native.h>
#include <ATen/ops/max_pool2d_native.h>
#include <ATen/ops/max_pool2d_with_indices_backward_native.h>
@ -265,13 +266,13 @@ using PoolSizes = std::tuple<int32_t,
std::vector<int32_t>,
std::vector<int32_t>,
std::vector<int32_t>,
std::vector<int32_t>>;
std::optional<std::vector<int32_t>>>;
static PoolSizes process_pool_sizes(const Tensor& input,
IntArrayRef kernel_size,
IntArrayRef stride,
IntArrayRef padding,
IntArrayRef dilation,
std::optional<IntArrayRef> dilation_opt,
bool ceil_mode,
const int32_t pooling_dims,
const std::string& op_name) {
@ -305,18 +306,22 @@ static PoolSizes process_pool_sizes(const Tensor& input,
pooling_dims,
" ints");
TORCH_CHECK(dilation.size() == 1 || dilation.size() == pooling_dims,
op_name,
": dilation must be either a single int, or a tuple of ",
pooling_dims,
" ints");
if (dilation_opt.has_value()) {
auto dilation = dilation_opt.value();
TORCH_CHECK(dilation.size() == 1 || dilation.size() == pooling_dims,
op_name,
": dilation must be either a single int, or a tuple of ",
pooling_dims,
" ints");
}
int32_t leading_dims = input.dim() - pooling_dims;
const auto kernel_size_expanded = copy_and_maybe_expand(kernel_size, pooling_dims);
const auto stride_expanded = copy_and_maybe_expand(stride.empty() ? kernel_size : stride, pooling_dims);
const auto padding_expanded = copy_and_maybe_expand(padding, pooling_dims);
const auto dilation_expanded = copy_and_maybe_expand(dilation, pooling_dims);
const auto dilation_expanded = dilation_opt.has_value() ? copy_and_maybe_expand(dilation_opt.value(), pooling_dims)
: std::vector<int32_t>(pooling_dims, 1);
for (const auto dim : c10::irange(pooling_dims)) {
TORCH_CHECK(padding_expanded[dim] >= 0, op_name, ": pad must be non-negative");
@ -362,7 +367,12 @@ static PoolSizes process_pool_sizes(const Tensor& input,
output_size[leading_dims + dim] = output_pooling_size[dim];
}
return PoolSizes(dims, output_size, kernel_size_expanded, stride_expanded, padding_expanded, dilation_expanded);
return PoolSizes(dims,
output_size,
kernel_size_expanded,
stride_expanded,
padding_expanded,
dilation_opt.has_value() ? std::make_optional(dilation_expanded) : std::nullopt);
}
static void max_pool_with_indices_out_mps_template(const Tensor& output,
@ -375,8 +385,10 @@ static void max_pool_with_indices_out_mps_template(const Tensor& output,
bool ceil_mode,
const int32_t pooling_dims,
const std::string& op_name) {
auto [dims, output_size, kernel_size, stride, padding, dilation] =
auto [dims, output_size, kernel_size, stride, padding, dilation_opt] =
process_pool_sizes(input, _kernel_size, _stride, _padding, _dilation, ceil_mode, pooling_dims, op_name);
TORCH_INTERNAL_ASSERT(dilation_opt.has_value());
auto dilation = dilation_opt.value();
const Tensor& indices = *(at::borrow_from_optional_tensor(indices_opt));
const bool return_indices = indices.defined();
@ -442,7 +454,7 @@ static void max_pool_with_indices_backward_out_mps_template(Tensor& grad_input,
bool ceil_mode,
const int32_t pooling_dims,
const std::string& op_name) {
auto [dims, output_size, kernel_size, stride, padding, dilation] =
auto [dims, output_size, kernel_size, stride, padding, dilation_opt] =
process_pool_sizes(input, _kernel_size, _stride, _padding, _dilation, ceil_mode, pooling_dims, op_name);
const auto memory_format = input.suggest_memory_format();
@ -601,6 +613,62 @@ static void avg_pool2d_template(const Tensor& input,
op_name);
}
static void avg_pool_out_mps_template(const Tensor& output,
const Tensor& input,
IntArrayRef _kernel_size,
IntArrayRef _stride,
IntArrayRef _padding,
bool ceil_mode,
bool count_include_pad,
std::optional<int64_t> divisor_override,
const int32_t pooling_dims,
const std::string& op_name) {
auto [dims, output_size, kernel_size, stride, padding, _] =
process_pool_sizes(input, _kernel_size, _stride, _padding, std::nullopt, ceil_mode, pooling_dims, op_name);
const auto memory_format = input.suggest_memory_format();
output.resize_(output_size, memory_format);
id<MTLDevice> device = MPSDevice::getInstance()->device();
MPSStream* mpsStream = getCurrentMPSStream();
const auto numThreads = output.numel();
AvgPoolingParams<5> params;
params.dims = dims;
params.pooling_dims = pooling_dims;
params.count_include_pad = count_include_pad;
params.has_divisor_override = divisor_override.has_value();
if (divisor_override.has_value()) {
params.divisor_override = safe_downcast<int32_t, int64_t>(divisor_override.value());
}
for (const auto dim : c10::irange(dims)) {
params.input_sizes[dim] = safe_downcast<int32_t, int64_t>(input.size(dim));
params.input_strides[dim] = safe_downcast<int32_t, int64_t>(input.stride(dim));
params.output_sizes[dim] = safe_downcast<int32_t, int64_t>(output.size(dim));
params.output_strides[dim] = safe_downcast<int32_t, int64_t>(output.stride(dim));
}
memcpy(params.kernel_size.data(), kernel_size.data(), pooling_dims * sizeof(int32_t));
memcpy(params.stride.data(), stride.data(), pooling_dims * sizeof(int32_t));
memcpy(params.padding.data(), padding.data(), pooling_dims * sizeof(int32_t));
dispatch_sync_with_rethrow(mpsStream->queue(), ^() {
@autoreleasepool {
id<MTLComputeCommandEncoder> computeEncoder = mpsStream->commandEncoder();
auto PSO = lib.getPipelineStateForFunc("avg_pool_" + scalarToMetalTypeString(input));
getMPSProfiler().beginProfileKernel(PSO, op_name, {input});
[computeEncoder setComputePipelineState:PSO];
mtl_setArgs(computeEncoder, input, output, params);
mtl_dispatch1DJob(computeEncoder, PSO, numThreads);
getMPSProfiler().endProfileKernel(PSO);
}
});
}
} // namespace mps
Tensor mps_max_pool2d(const Tensor& input,
@ -876,4 +944,25 @@ TORCH_IMPL_FUNC(avg_pool2d_backward_out_mps)
"avg_pool2d_backward");
}
TORCH_IMPL_FUNC(avg_pool3d_out_mps)
(const Tensor& input,
IntArrayRef kernel_size,
IntArrayRef stride,
IntArrayRef padding,
bool ceil_mode,
bool count_include_pad,
std::optional<int64_t> divisor_override,
const Tensor& output) {
mps::avg_pool_out_mps_template(output,
input,
kernel_size,
stride,
padding,
ceil_mode,
count_include_pad,
divisor_override,
/*pooling_dims=*/3,
"avg_pool3d");
}
} // namespace at::native

View File

@ -12334,6 +12334,7 @@
dispatch:
CPU: avg_pool3d_out_cpu
CUDA: avg_pool3d_out_cuda
MPS: avg_pool3d_out_mps
MkldnnCPU: mkldnn_avg_pool3d_out
- func: avg_pool3d(Tensor self, int[3] kernel_size, int[3] stride=[], int[3] padding=0, bool ceil_mode=False, bool count_include_pad=True, int? divisor_override=None) -> Tensor

View File

@ -142,7 +142,7 @@ Tensor qcat_nhwc_kernel(
continue;
}
constexpr auto VLEN = Vec::size();
const auto VLEN = Vec::size();
int64_t c = 0;
// Vectorized loop
@ -170,16 +170,16 @@ Tensor qcat_nhwc_kernel(
}
// Vectorized loop for channel between 8 and 32 (avx2)
constexpr auto kVLEN = Vectorized<float>::size();
const auto kVLEN = Vectorized<float>::size();
int64_t elem_size = curr_C - c;
if ((VLEN == 4 * kVLEN) && elem_size >= kVLEN) {
auto curr_scale_vec = Vectorized<float>(curr_scale);
auto curr_zero_pt_vec = Vectorized<float>((float)curr_zero_pt);
auto scale_neg_zp_premul = curr_scale_vec * curr_zero_pt_vec.neg();
int64_t vec_num = elem_size / kVLEN;
std::array<typename scalar_t::underlying, VLEN> buf_in{};
memcpy(buf_in.data(), iptr + c, vec_num * kVLEN);
auto inp_vec = Vec::loadu(buf_in.data());
typename scalar_t::underlying buf_in[VLEN] = {};
memcpy(buf_in, iptr + c, vec_num * kVLEN);
auto inp_vec = Vec::loadu(buf_in);
auto float_values = inp_vec.dequantize(
curr_scale_vec, curr_zero_pt_vec, scale_neg_zp_premul);
Vec::float_vec_return_type retvals;
@ -1487,7 +1487,7 @@ void _qmaxpool_2d_nhwc_kernel(
int64_t c = 0;
// Interleaved vector loop 4x
constexpr auto vec_width = Vectorized<scalar_t>::size();
const auto vec_width = Vectorized<scalar_t>::size();
for (; c + 4 * vec_width <= iC; c += 4 * vec_width) {
Vectorized<scalar_t> acc{
scalar_t(std::numeric_limits<scalar_t_underlying>::lowest())};
@ -1623,7 +1623,7 @@ void qmaxpool_3d_nthwc_kernel(
w_start += dW;
int64_t c = 0;
constexpr auto vec_width = Vectorized<scalar_t>::size();
const auto vec_width = Vectorized<scalar_t>::size();
// Vector loop
for (; c + vec_width <= iC; c += vec_width) {
Vectorized<scalar_t> acc{
@ -2449,7 +2449,7 @@ void q_batch_norm_kernel(
reinterpret_cast<scalar_t::underlying*>(input.data_ptr());
scalar_t::underlying* Y = reinterpret_cast<scalar_t::underlying*>(output.data_ptr());
constexpr int kVLen = Vectorized<float>::size();
const int kVLen = Vectorized<float>::size();
const int64_t outer_size = N * HxW;
using Vec = Vectorized<scalar_t>;
// Hoisted variables
@ -2975,7 +2975,7 @@ void quantized_normalize_kernel(
float y_scale = Y->q_scale();
float y_inv_scale = 1.0f / y_scale;
constexpr int kFloatVLen = fVec::size();
const int kFloatVLen = fVec::size();
int64_t kIntVLen = kFloatVLen * qVec::float_num_vecs();
int64_t kNumIntVecInLayer = N / kIntVLen;
int64_t kNonVecRemInLayer = N % kIntVLen;
@ -3263,7 +3263,7 @@ void quantized_groupnorm_nhwc_kernel(
float y_scale = Y->q_scale();
float y_inv_scale = 1.0f / y_scale;
constexpr int kFloatVLen = fVec::size();
const int kFloatVLen = fVec::size();
int64_t kIntVLen = kFloatVLen * qVec::float_num_vecs();
int64_t channels_per_group = C / G;
int64_t HxW = N / channels_per_group;

View File

@ -955,7 +955,10 @@ static at::Tensor fp8_qlinear_onednn_ref(
std::vector<int64_t> w_scales_new_shape(weight.dim(), 1);
w_scales_new_shape[0] = -1;
auto dqw = weight.to(at::kFloat) * weight_scales.reshape(w_scales_new_shape);
auto y_f32 = at::linear(dqx, dqw, bias);
auto y_f32 = at::linear(dqx, dqw);
if (bias.has_value()) {
y_f32 += bias.value().to(at::kFloat);
}
if (binary_post_op == "none") {
if (unary_post_op == "relu") {
at::relu_(y_f32);

View File

@ -27,6 +27,7 @@ REGISTER_AVX512_DISPATCH(flatten_indices_stub, &flatten_indices_cpu_kernel)
REGISTER_AVX2_DISPATCH(flatten_indices_stub, &flatten_indices_cpu_kernel)
REGISTER_VSX_DISPATCH(flatten_indices_stub, &flatten_indices_cpu_kernel)
REGISTER_ZVECTOR_DISPATCH(flatten_indices_stub, &flatten_indices_cpu_kernel)
REGISTER_SVE_DISPATCH(flatten_indices_stub, &flatten_indices_cpu_kernel)
REGISTER_SVE256_DISPATCH(flatten_indices_stub, &flatten_indices_cpu_kernel)
} // namespace at::native

View File

@ -161,6 +161,7 @@ REGISTER_AVX512_DISPATCH(mul_sparse_sparse_out_stub, &mul_sparse_sparse_out_cpu_
REGISTER_AVX2_DISPATCH(mul_sparse_sparse_out_stub, &mul_sparse_sparse_out_cpu_kernel)
REGISTER_VSX_DISPATCH(mul_sparse_sparse_out_stub, &mul_sparse_sparse_out_cpu_kernel)
REGISTER_ZVECTOR_DISPATCH(mul_sparse_sparse_out_stub, &mul_sparse_sparse_out_cpu_kernel)
REGISTER_SVE_DISPATCH(mul_sparse_sparse_out_stub, &mul_sparse_sparse_out_cpu_kernel)
REGISTER_SVE256_DISPATCH(mul_sparse_sparse_out_stub, &mul_sparse_sparse_out_cpu_kernel)
REGISTER_ARCH_DISPATCH(sparse_mask_intersection_out_stub, DEFAULT, &sparse_mask_intersection_out_cpu_kernel)
@ -168,6 +169,7 @@ REGISTER_AVX512_DISPATCH(sparse_mask_intersection_out_stub, &sparse_mask_interse
REGISTER_AVX2_DISPATCH(sparse_mask_intersection_out_stub, &sparse_mask_intersection_out_cpu_kernel)
REGISTER_VSX_DISPATCH(sparse_mask_intersection_out_stub, &sparse_mask_intersection_out_cpu_kernel)
REGISTER_ZVECTOR_DISPATCH(sparse_mask_intersection_out_stub, &sparse_mask_intersection_out_cpu_kernel)
REGISTER_SVE_DISPATCH(sparse_mask_intersection_out_stub, &sparse_mask_intersection_out_cpu_kernel)
REGISTER_SVE256_DISPATCH(sparse_mask_intersection_out_stub, &sparse_mask_intersection_out_cpu_kernel)
REGISTER_ARCH_DISPATCH(sparse_mask_projection_out_stub, DEFAULT, &sparse_mask_projection_out_cpu_kernel)
@ -175,5 +177,6 @@ REGISTER_AVX512_DISPATCH(sparse_mask_projection_out_stub, &sparse_mask_projectio
REGISTER_AVX2_DISPATCH(sparse_mask_projection_out_stub, &sparse_mask_projection_out_cpu_kernel)
REGISTER_VSX_DISPATCH(sparse_mask_projection_out_stub, &sparse_mask_projection_out_cpu_kernel)
REGISTER_ZVECTOR_DISPATCH(sparse_mask_projection_out_stub, &sparse_mask_projection_out_cpu_kernel)
REGISTER_SVE_DISPATCH(sparse_mask_projection_out_stub, &sparse_mask_projection_out_cpu_kernel)
REGISTER_SVE256_DISPATCH(sparse_mask_projection_out_stub, &sparse_mask_projection_out_cpu_kernel)
}

View File

@ -448,6 +448,7 @@ REGISTER_AVX2_DISPATCH(_fused_sdp_choice_stub, &_fused_sdp_choice_cpp)
REGISTER_AVX512_DISPATCH(_fused_sdp_choice_stub, &_fused_sdp_choice_cpp)
REGISTER_VSX_DISPATCH(_fused_sdp_choice_stub, &_fused_sdp_choice_cpp)
REGISTER_ZVECTOR_DISPATCH(_fused_sdp_choice_stub, &_fused_sdp_choice_cpp)
REGISTER_SVE_DISPATCH(_fused_sdp_choice_stub, &_fused_sdp_choice_cpp)
REGISTER_SVE256_DISPATCH(_fused_sdp_choice_stub, &_fused_sdp_choice_cpp)
REGISTER_HPU_DISPATCH(_fused_sdp_choice_stub, &_fused_sdp_choice_meta)

View File

@ -1,8 +1,7 @@
#include <gtest/gtest.h>
#include <ATen/ATen.h>
#include <ATen/Parallel.h>
#include <c10/util/irange.h>
#include <test/cpp/tensorexpr/test_base.h>
#include <thread>
@ -10,7 +9,7 @@
// numbers of threads set and also whether the scheduler
// will throw an exception when multiple threads call
// their first parallel construct.
static void test(int given_num_threads) {
void test(int given_num_threads) {
auto t = at::ones({1000 * 1000}, at::CPU(at::kFloat));
ASSERT_TRUE(given_num_threads >= 0);
ASSERT_EQ(at::get_num_threads(), given_num_threads);
@ -20,7 +19,7 @@ static void test(int given_num_threads) {
}
}
TEST(ThreadInitTest, ThreadInit) {
int main() {
at::init_num_threads();
at::set_num_threads(4);
@ -33,11 +32,13 @@ TEST(ThreadInitTest, ThreadInit) {
#if !AT_PARALLEL_NATIVE
at::set_num_threads(5);
ASSERT_EQ(at::get_num_threads(), 5);
ASSERT_TRUE(at::get_num_threads() == 5);
#endif
// test inter-op settings
at::set_num_interop_threads(5);
ASSERT_EQ(at::get_num_interop_threads(), 5);
ASSERT_ANY_THROW(at::set_num_interop_threads(6));
return 0;
}

View File

@ -134,7 +134,7 @@ namespace {
TYPED_TEST(Memory, UnAlignedLoadStore) {
using vec = TypeParam;
using VT = ValueType<TypeParam>;
constexpr size_t b_size = vec::size() * sizeof(VT);
const size_t b_size = vec::size() * sizeof(VT);
// NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers,cppcoreguidelines-avoid-c-arrays,modernize-avoid-c-arrays)
CACHE_ALIGN unsigned char ref_storage[128 * b_size];
// NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers,cppcoreguidelines-avoid-c-arrays,modernize-avoid-c-arrays)
@ -164,7 +164,7 @@ namespace {
for (size_t offset = 0; offset < b_size; offset += 1) {
unsigned char* p1 = ref_storage + offset;
unsigned char* p2 = storage + offset;
for (; p1 + b_size <= std::end(ref_storage); p1 += b_size, p2 += b_size) {
for (; p1 + b_size <= &ref_storage[128 * b_size]; p1 += b_size, p2 += b_size) {
vec v = vec::loadu(p1);
v.store(p2);
}
@ -381,7 +381,7 @@ namespace {
TYPED_TEST(Hyperbolic, Tanh) {
using vec = TypeParam;
// NOTE: Because SVE uses ACL logic, the precision changes, hence the adjusted tolerance.
#if defined(CPU_CAPABILITY_SVE)
#if defined(CPU_CAPABILITY_SVE) || defined(CPU_CAPABILITY_SVE256)
using UVT = UvalueType<vec>;
UVT tolerance = getDefaultTolerance<UVT>();
test_unary<vec>(
@ -586,7 +586,7 @@ namespace {
}
}
}
#if defined(CPU_CAPABILITY_SVE) && defined(__ARM_FEATURE_BF16)
#if (defined(CPU_CAPABILITY_SVE256)) && defined(__ARM_FEATURE_BF16)
TEST(NanBfloat16, IsNan) {
for (unsigned int ii = 0; ii < 0xFFFF; ++ii) {
c10::BFloat16 val(ii, c10::BFloat16::from_bits());
@ -598,6 +598,19 @@ namespace {
}
}
}
#endif
#if (defined(CPU_CAPABILITY_SVE)) && defined(__ARM_FEATURE_BF16)
TEST(NanBfloat16, IsNan) {
for (unsigned int ii = 0; ii < 0xFFFF; ++ii) {
c10::BFloat16 val(ii, c10::BFloat16::from_bits());
bool expected = std::isnan(val);
CACHE_ALIGN c10::BFloat16 actual_vals[at::vec::SVE::Vectorized<c10::BFloat16>::size()];
at::vec::SVE::Vectorized<c10::BFloat16>(val).isnan().store(actual_vals);
for (int jj = 0; jj < at::vec::SVE::Vectorized<c10::BFloat16>::size(); ++jj) {
EXPECT_EQ(expected, c10::bit_cast<uint16_t>(actual_vals[jj]) != 0) << "bf16 isnan failure for bit pattern " << std::hex << ii << std::dec;
}
}
}
#endif
TYPED_TEST(LGamma, LGamma) {
using vec = TypeParam;
@ -653,7 +666,7 @@ namespace {
TYPED_TEST(Interleave, Interleave) {
using vec = TypeParam;
using VT = ValueType<TypeParam>;
constexpr auto N = vec::size() * 2LL;
const auto N = vec::size() * 2LL;
// NOLINTNEXTLINE(cppcoreguidelines-avoid-c-arrays,modernize-avoid-c-arrays)
CACHE_ALIGN VT vals[N];
// NOLINTNEXTLINE(cppcoreguidelines-avoid-c-arrays,modernize-avoid-c-arrays)
@ -663,7 +676,7 @@ namespace {
for (VT& v : vals) {
v = generator.get();
}
copy_interleave(vals, interleaved);
copy_interleave<VT>(vals, interleaved, N);
auto a = vec::loadu(vals);
auto b = vec::loadu(vals + vec::size());
auto cc = interleave2(a, b);
@ -673,7 +686,7 @@ namespace {
TYPED_TEST(Interleave, DeInterleave) {
using vec = TypeParam;
using VT = ValueType<TypeParam>;
constexpr auto N = vec::size() * 2LL;
const auto N = vec::size() * 2LL;
// NOLINTNEXTLINE(cppcoreguidelines-avoid-c-arrays,modernize-avoid-c-arrays)
CACHE_ALIGN VT vals[N];
// NOLINTNEXTLINE(cppcoreguidelines-avoid-c-arrays,modernize-avoid-c-arrays)
@ -683,7 +696,7 @@ namespace {
for (VT& v : vals) {
v = generator.get();
}
copy_interleave(vals, interleaved);
copy_interleave<VT>(vals, interleaved, N);
// test interleaved with vals this time
auto a = vec::loadu(interleaved);
auto b = vec::loadu(interleaved + vec::size());
@ -1017,78 +1030,70 @@ namespace {
RESOLVE_OVERLOAD(filter_fmadd));
}
#endif
template<typename vec, typename VT, int64_t mask>
typename std::enable_if_t<(mask < 0 || mask> 255), void>
template<typename vec, typename VT>
// NOLINTNEXTLINE(cppcoreguidelines-avoid-c-arrays,modernize-avoid-c-arrays)
test_blend(VT expected_val[vec::size()], VT a[vec::size()], VT b[vec::size()])
{
void test_blend(VT * expected_val, VT * a, VT * b, int64_t mask) {
if (mask >= 0 && mask <= 255) {
// generate expected_val
int64_t m = mask;
for (int64_t i = 0; i < vec::size(); i++) {
expected_val[i] = (m & 0x01) ? b[i] : a[i];
m = m >> 1;
}
// test with blend
auto vec_a = vec::loadu(a);
auto vec_b = vec::loadu(b);
auto expected = vec::loadu(expected_val);
auto actual = vec::blend(vec_a, vec_b, mask);
auto mask_str = std::string("\nblend mask: ") + std::to_string(mask);
if (AssertVectorized<vec>(std::string(NAME_INFO(test_blend)) + mask_str, expected, actual).check()) return;
test_blend<vec, VT>(expected_val, a, b, mask - 1);
}
}
template<typename vec, typename VT, int64_t mask>
typename std::enable_if_t<(mask >= 0 && mask <= 255), void>
template<typename vec, typename VT>
// NOLINTNEXTLINE(cppcoreguidelines-avoid-c-arrays,modernize-avoid-c-arrays)
test_blend(VT expected_val[vec::size()], VT a[vec::size()], VT b[vec::size()]) {
// generate expected_val
int64_t m = mask;
for (int64_t i = 0; i < vec::size(); i++) {
expected_val[i] = (m & 0x01) ? b[i] : a[i];
m = m >> 1;
}
// test with blend
auto vec_a = vec::loadu(a);
auto vec_b = vec::loadu(b);
auto expected = vec::loadu(expected_val);
auto actual = vec::template blend<mask>(vec_a, vec_b);
auto mask_str = std::string("\nblend mask: ") + std::to_string(mask);
if (AssertVectorized<vec>(std::string(NAME_INFO(test_blend)) + mask_str, expected, actual).check()) return;
test_blend<vec, VT, mask - 1>(expected_val, a, b);
bool test_blendv(VT * expected_val, VT * a, VT * b, VT * mask, int64_t idx, size_t N) {
if ((size_t) idx == N) {
using bit_rep = BitType<VT>;
// generate expected_val
for (int64_t i = 0; i < vec::size(); i++) {
bit_rep hex_mask = 0;
hex_mask=c10::bit_cast<bit_rep>(mask[i]);
expected_val[i] = (hex_mask & 0x01) ? b[i] : a[i];
}
// test with blendv
auto vec_a = vec::loadu(a);
auto vec_b = vec::loadu(b);
auto vec_m = vec::loadu(mask);
auto expected = vec::loadu(expected_val);
auto actual = vec::blendv(vec_a, vec_b, vec_m);
auto mask_str = std::string("\nblendv mask: ");
for (int64_t i = 0; i < vec::size(); i++) {
mask_str += std::to_string(mask[i]) + " ";
}
if (AssertVectorized<vec>(std::string(NAME_INFO(test_blendv)) + mask_str, expected, actual).check()) {
return false;
}
return true;
} else {
// shuffle mask and do blendv test
VT m = mask[idx];
if (!test_blendv<vec, VT>(expected_val, a, b, mask, idx+1, N)) return false;
if (m != (VT)0) {
mask[idx] = (VT)0;
}
else {
uint64_t hex_mask = 0xFFFFFFFFFFFFFFFF;
std::memcpy(&mask[idx], &hex_mask, sizeof(VT));
}
if (!test_blendv<vec, VT>(expected_val, a, b, mask, idx+1, N)) return false;
mask[idx] = m;
return true;
}
}
template<typename vec, typename VT, int64_t idx, int64_t N>
std::enable_if_t<(!is_complex<VT>::value && idx == N), bool>
template<typename T>
// NOLINTNEXTLINE(cppcoreguidelines-avoid-c-arrays,modernize-avoid-c-arrays)
test_blendv(VT expected_val[vec::size()], VT a[vec::size()], VT b[vec::size()], VT mask[vec::size()]) {
using bit_rep = BitType<VT>;
// generate expected_val
for (int64_t i = 0; i < vec::size(); i++) {
bit_rep hex_mask = 0;
hex_mask=c10::bit_cast<bit_rep>(mask[i]);
expected_val[i] = (hex_mask & 0x01) ? b[i] : a[i];
}
// test with blendv
auto vec_a = vec::loadu(a);
auto vec_b = vec::loadu(b);
auto vec_m = vec::loadu(mask);
auto expected = vec::loadu(expected_val);
auto actual = vec::blendv(vec_a, vec_b, vec_m);
auto mask_str = std::string("\nblendv mask: ");
for (int64_t i = 0; i < vec::size(); i++) {
mask_str += std::to_string(mask[i]) + " ";
}
if (AssertVectorized<vec>(std::string(NAME_INFO(test_blendv)) + mask_str, expected, actual).check()) {
return false;
}
return true;
}
template<typename vec, typename VT, int64_t idx, int64_t N>
std::enable_if_t<(!is_complex<VT>::value && idx != N), bool>
// NOLINTNEXTLINE(cppcoreguidelines-avoid-c-arrays,modernize-avoid-c-arrays)
test_blendv(VT expected_val[vec::size()], VT a[vec::size()], VT b[vec::size()], VT mask[vec::size()]) {
// shuffle mask and do blendv test
VT m = mask[idx];
if (!test_blendv<vec, VT, idx+1, N>(expected_val, a, b, mask)) return false;
if (m != (VT)0) {
mask[idx] = (VT)0;
}
else {
uint64_t hex_mask = 0xFFFFFFFFFFFFFFFF;
std::memcpy(&mask[idx], &hex_mask, sizeof(VT));
}
if (!test_blendv<vec, VT, idx+1, N>(expected_val, a, b, mask)) return false;
mask[idx] = m;
return true;
}
template<typename T, int N>
// NOLINTNEXTLINE(cppcoreguidelines-avoid-c-arrays,modernize-avoid-c-arrays)
void blend_init(T(&a)[N], T(&b)[N]) {
void blend_init(T * a, T * b, int N) {
a[0] = (T)1.0;
b[0] = a[0] + (T)N;
for (const auto i : c10::irange(1, N)) {
@ -1107,8 +1112,8 @@ namespace {
CACHE_ALIGN VT mask[vec::size()] = {0};
// NOLINTNEXTLINE(cppcoreguidelines-avoid-c-arrays,modernize-avoid-c-arrays)
CACHE_ALIGN VT expected_val[vec::size()];
blend_init(a, b);
test_blendv<vec, VT, 0, vec::size()>(expected_val, a, b, mask);
blend_init(a, b, vec::size());
test_blendv<vec, VT>(expected_val, a, b, mask, 0, vec::size());
}
TYPED_TEST(BitwiseFloatsAdditional2, Blend) {
using vec = TypeParam;
@ -1119,9 +1124,9 @@ namespace {
CACHE_ALIGN VT b[vec::size()];
// NOLINTNEXTLINE(cppcoreguidelines-avoid-c-arrays,modernize-avoid-c-arrays)
CACHE_ALIGN VT expected_val[vec::size()];
blend_init(a, b);
constexpr int64_t power_sets = 1LL << (vec::size());
test_blend<vec, VT, power_sets - 1>(expected_val, a, b);
blend_init(a, b, vec::size());
const int64_t power_sets = 1LL << (vec::size());
test_blend<vec, VT>(expected_val, a, b, power_sets - 1);
}
template<typename vec, typename VT>
// NOLINTNEXTLINE(cppcoreguidelines-avoid-c-arrays,modernize-avoid-c-arrays)
@ -1152,7 +1157,7 @@ namespace {
CACHE_ALIGN VT b[vec::size()];
// NOLINTNEXTLINE(cppcoreguidelines-avoid-c-arrays,modernize-avoid-c-arrays)
CACHE_ALIGN VT expected_val[vec::size()];
blend_init(a, b);
blend_init(a, b, vec::size());
test_set<vec, VT>(expected_val, a, b, vec::size());
}
template<typename T>
@ -1218,7 +1223,7 @@ namespace {
// NOLINTNEXTLINE(bugprone-signed-char-misuse)
constexpr int min_val = std::numeric_limits<underlying>::min();
constexpr int max_val = std::numeric_limits<underlying>::max();
constexpr int el_count = vfloat::size();
const int el_count = vfloat::size();
// NOLINTNEXTLINE(cppcoreguidelines-avoid-c-arrays,modernize-avoid-c-arrays)
CACHE_ALIGN float unit_float_vec[el_count];
// NOLINTNEXTLINE(cppcoreguidelines-avoid-c-arrays,modernize-avoid-c-arrays)
@ -1566,7 +1571,7 @@ namespace {
using vec = TypeParam;
using VT = ValueType<TypeParam>;
constexpr auto R = 2LL; // residual
constexpr auto N = vec::size() + R;
const auto N = vec::size() + R;
CACHE_ALIGN VT x1[N];
CACHE_ALIGN VT x2[N];
CACHE_ALIGN VT x3[N];
@ -2130,7 +2135,7 @@ namespace {
ASSERT_TRUE(vec_pinf.has_inf_nan()) << "Test failed for positive Infinity\n";
ASSERT_TRUE(vec_ninf.has_inf_nan()) << "Test failed for negative Infinity\n";
}
#if !defined(CPU_CAPABILITY_SVE)
#if !defined(CPU_CAPABILITY_SVE256) && !defined(CPU_CAPABILITY_SVE)
template <typename vec, typename dst_t>
void test_convert_to(const char* dst_t_name) {
using src_t = ValueType<vec>;
@ -2213,13 +2218,13 @@ namespace {
TYPED_TEST(VecMaskTests, MaskedLoad) {
using vec = TypeParam;
using src_t = ValueType<TypeParam>;
constexpr auto size = vec::size();
const auto size = vec::size();
#define TEST_MASK_LOAD(dst_t, mask_t, mask_n) \
do { \
constexpr int dst_size = at::vec::Vectorized<dst_t>::size(); \
constexpr int dst_n = mask_n * size / dst_size; \
if constexpr(dst_n * dst_size >= mask_n * size) { \
int dst_size = at::vec::Vectorized<dst_t>::size(); \
int dst_n = mask_n * size / dst_size; \
if (dst_n * dst_size >= mask_n * size) { \
CACHE_ALIGN dst_t x[mask_n * size]; \
CACHE_ALIGN dst_t y[mask_n * size]; \
CACHE_ALIGN dst_t ref[mask_n * size]; \
@ -2230,9 +2235,47 @@ namespace {
x[i] = generator.get(); \
} \
auto vec_mask = generate_vec_mask<mask_t, mask_n>(seed); \
constexpr int rnd_n = (mask_n * size + dst_size - 1) / dst_size;\
auto x_vec = vec_mask.template loadu<dst_t, rnd_n>(x); \
x_vec.store(y); \
int rnd_n = (mask_n * size + dst_size - 1) / dst_size;\
switch (rnd_n) { \
case 1: \
{ \
auto x_vec = vec_mask.template loadu<dst_t, 1>(x); \
x_vec.store(y); \
break; \
} \
case 2: \
{ \
auto x_vec = vec_mask.template loadu<dst_t, 2>(x); \
x_vec.store(y); \
break; \
} \
case 3: \
{ \
auto x_vec = vec_mask.template loadu<dst_t, 3>(x); \
x_vec.store(y); \
break; \
} \
case 4: \
{ \
auto x_vec = vec_mask.template loadu<dst_t, 4>(x); \
x_vec.store(y); \
break; \
} \
case 8: \
{ \
auto x_vec = vec_mask.template loadu<dst_t, 8>(x); \
x_vec.store(y); \
break; \
} \
case 16: \
{ \
auto x_vec = vec_mask.template loadu<dst_t, 16>(x); \
x_vec.store(y); \
break; \
} \
default: \
throw std::out_of_range("Unexpected rnd_n call to vec_mask"); \
} \
for (const auto i : c10::irange(mask_n * size)) { \
if (vec_mask.is_masked(i)) { \
ref[i] = x[i]; \
@ -2269,7 +2312,7 @@ namespace {
#undef TEST_MASK_LOAD
#undef TEST_MASK_LOAD_N
}
#if !defined(CPU_CAPABILITY_SVE)
#if !defined(CPU_CAPABILITY_SVE256) && !defined(CPU_CAPABILITY_SVE)
TYPED_TEST(VecMaskTests, MaskedCheck) {
using VT = ValueType<TypeParam>;
using vec = TypeParam;
@ -2294,7 +2337,7 @@ namespace {
#undef TEST_MASK_CHECK_N
}
#endif
#if !defined(CPU_CAPABILITY_SVE)
#if !defined(CPU_CAPABILITY_SVE256) && !defined(CPU_CAPABILITY_SVE)
TYPED_TEST(VecMaskTests, ToFrom) {
using vec = TypeParam;
using VT = ValueType<TypeParam>;
@ -2321,7 +2364,7 @@ namespace {
}
}
#endif
#if !defined(CPU_CAPABILITY_SVE)
#if !defined(CPU_CAPABILITY_SVE256) && !defined(CPU_CAPABILITY_SVE)
TYPED_TEST(VecMaskTests, Cast) {
using vec = TypeParam;
using src_t = ValueType<TypeParam>;

View File

@ -56,7 +56,7 @@ CACHE_ALIGN #define
defined(CPU_CAPABILITY_AVX512) && (defined(__GNUC__) || defined(__GNUG__))
#undef CHECK_DEQUANT_WITH_LOW_PRECISION
#define CHECK_WITH_FMA 1
#elif defined(CPU_CAPABILITY_SVE)
#elif defined(CPU_CAPABILITY_SVE256)
#define CHECK_DEQUANT_WITH_LOW_PRECISION 1
#define CHECK_WITH_FMA 1
#elif !defined(CPU_CAPABILITY_VSX) && !defined(CPU_CAPABILITY_AVX2)
@ -136,7 +136,7 @@ template<typename T>
struct VecTypeHelper {
using holdType = typename T::value_type;
using memStorageType = typename T::value_type;
static constexpr int holdCount = T::size();
static inline int holdCount = T::size();
static constexpr int unitStorageCount = 1;
};
@ -399,9 +399,9 @@ T clamp_min(const T& a, const T& min) {
return a < min ? min : a;
}
template <class VT, size_t N>
void copy_interleave(VT(&vals)[N], VT(&interleaved)[N]) {
static_assert(N % 2 == 0, "should be even");
template <class VT>
void copy_interleave(VT * vals, VT * interleaved, size_t N) {
assert(N % 2 == 0);
auto ptr1 = vals;
auto ptr2 = vals + N / 2;
for (size_t i = 0; i < N; i += 2) {
@ -871,10 +871,10 @@ public:
using UVT = UvalueType<T>;
using BVT = BitType<UVT>;
UVT absErr = correctEpsilon(toleranceEps);
constexpr int sizeX = VecTypeHelper<T>::holdCount * VecTypeHelper<T>::unitStorageCount;
const int sizeX = VecTypeHelper<T>::holdCount * VecTypeHelper<T>::unitStorageCount;
constexpr int unitStorageCount = VecTypeHelper<T>::unitStorageCount;
CACHE_ALIGN UVT expArr[sizeX];
CACHE_ALIGN UVT actArr[sizeX];
UVT expArr[sizeX];
UVT actArr[sizeX];
exp.store(expArr);
act.store(actArr);
if (bitwise)
@ -942,7 +942,7 @@ void test_unary(
using vec_type = T;
using VT = ValueType<T>;
using UVT = UvalueType<T>;
constexpr int el_count = vec_type::size();
const int el_count = vec_type::size();
CACHE_ALIGN VT vals[el_count];
CACHE_ALIGN VT expected[el_count];
bool bitwise = testCase.isBitwise();
@ -1000,7 +1000,7 @@ void test_binary(
using vec_type = T;
using VT = ValueType<T>;
using UVT = UvalueType<T>;
constexpr int el_count = vec_type::size();
const int el_count = vec_type::size();
CACHE_ALIGN VT vals0[el_count];
CACHE_ALIGN VT vals1[el_count];
CACHE_ALIGN VT expected[el_count];
@ -1163,7 +1163,7 @@ void test_ternary(
using vec_type = T;
using VT = ValueType<T>;
using UVT = UvalueType<T>;
constexpr int el_count = vec_type::size();
const int el_count = vec_type::size();
CACHE_ALIGN VT vals0[el_count];
CACHE_ALIGN VT vals1[el_count];
CACHE_ALIGN VT vals2[el_count];
@ -1203,12 +1203,15 @@ void test_ternary(
auto input1 = vec_type::loadu(vals1);
auto input2 = vec_type::loadu(vals2);
auto actual = actualFunction(input0, input1, input2);
CACHE_ALIGN VT actual_[vec_type::size()];
actual.store(actual_);
auto vec_expected = vec_type::loadu(expected);
AssertVectorized<vec_type> vecAssert(
testNameInfo, seed, vec_expected, actual, input0, input1, input2);
if (vecAssert.check(
bitwise, dmn.CheckWithTolerance, dmn.ToleranceError))
return;
return;
} // trial
changeSeedBy += 1;
}
@ -1573,19 +1576,19 @@ double getDefaultTolerance() {
template<typename T, int N = 1>
at::vec::VecMask<T, N> create_vec_mask(uint64_t bitmask) {
constexpr auto size = at::vec::Vectorized<T>::size();
std::array<int, N * size> mask;
const auto size = at::vec::Vectorized<T>::size();
int mask[N * size];
for (int n = 0; n < N; n++) {
for (int i = 0; i < size; i++) {
mask[n * size + i] = (bitmask >> i) & 1;
}
}
return at::vec::VecMask<T, N>::from(mask.data());
return at::vec::VecMask<T, N>::from(mask);
}
template<typename T, int N = 1>
at::vec::VecMask<T, N> generate_vec_mask(int seed) {
constexpr auto size = at::vec::Vectorized<T>::size();
const auto size = at::vec::Vectorized<T>::size();
ValueGen<uint64_t> generator(0, (1ULL << size) - 1, seed);
auto bitmask = generator.get();
return create_vec_mask<T, N>(bitmask);

View File

@ -13,6 +13,7 @@ flaky_models = {
"gluon_inception_v3",
"detectron2_maskrcnn_r_101_c4",
"XGLMForCausalLM", # discovered in https://github.com/pytorch/pytorch/pull/128148
"detectron2_fcos_r_50_fpn",
}

View File

@ -346,7 +346,7 @@ vgg16,pass,0
vision_maskrcnn,fail_accuracy,30
vision_maskrcnn,fail_accuracy,29

1 name accuracy graph_breaks
346
347
348
349
350
351
352

View File

@ -1,32 +1,32 @@
add_loop_eager,compile_time_instruction_count,3070000000,0.10
add_loop_eager,compile_time_instruction_count,3070000000,0.1
add_loop_eager_dynamic,compile_time_instruction_count,4432000000,0.10
add_loop_eager_dynamic,compile_time_instruction_count,4432000000,0.1
add_loop_inductor,compile_time_instruction_count,30280000000,0.10
add_loop_inductor,compile_time_instruction_count,30280000000,0.1
add_loop_inductor_dynamic_gpu,compile_time_instruction_count,39910000000,0.10
add_loop_inductor_dynamic_gpu,compile_time_instruction_count,39910000000,0.1
add_loop_inductor_gpu,compile_time_instruction_count,26800000000,0.10
add_loop_inductor_gpu,compile_time_instruction_count,26800000000,0.1
basic_modules_ListOfLinears_eager,compile_time_instruction_count,969100000,0.10
basic_modules_ListOfLinears_eager,compile_time_instruction_count,969100000,0.1
basic_modules_ListOfLinears_inductor,compile_time_instruction_count,18030000000,0.10
basic_modules_ListOfLinears_inductor,compile_time_instruction_count,15240000000,0.1
basic_modules_ListOfLinears_inductor_gpu_force_shape_pad,compile_time_instruction_count,17020000000,0.10
basic_modules_ListOfLinears_inductor_gpu_force_shape_pad,compile_time_instruction_count,17020000000,0.1
@ -34,56 +34,56 @@ basic_modules_ListOfLinears_inductor_gpu,compile_time_instruction_count,11090000
update_hint_regression,compile_time_instruction_count,1719000000,0.10
update_hint_regression,compile_time_instruction_count,1719000000,0.1
sum_floordiv_regression,compile_time_instruction_count,966100000,0.10
sum_floordiv_regression,compile_time_instruction_count,966100000,0.1
symint_sum,compile_time_instruction_count,3237000000,0.10
symint_sum,compile_time_instruction_count,3237000000,0.1
symint_sum_loop,compile_time_instruction_count,4299000000,0.10
symint_sum_loop,compile_time_instruction_count,4299000000,0.1
aotdispatcher_inference_nosubclass_cpu,compile_time_instruction_count,2151000000,0.10
aotdispatcher_inference_nosubclass_cpu,compile_time_instruction_count,2151000000,0.1
aotdispatcher_inference_subclass_cpu,compile_time_instruction_count,6124000000,0.10
aotdispatcher_inference_subclass_cpu,compile_time_instruction_count,6124000000,0.1
aotdispatcher_partitioner_cpu,compile_time_instruction_count,9005000000,0.10
aotdispatcher_partitioner_cpu,compile_time_instruction_count,9005000000,0.1
aotdispatcher_partitioner_cpu2,compile_time_instruction_count,1989000000,0.10
aotdispatcher_partitioner_cpu2,compile_time_instruction_count,1989000000,0.1
aotdispatcher_training_nosubclass_cpu,compile_time_instruction_count,3959000000,0.10
aotdispatcher_training_nosubclass_cpu,compile_time_instruction_count,3959000000,0.1
aotdispatcher_training_subclass_cpu,compile_time_instruction_count,10650000000,0.10
aotdispatcher_training_subclass_cpu,compile_time_instruction_count,10650000000,0.1
mm_loop_inductor_gpu,compile_time_instruction_count,4461000000,0.10
mm_loop_inductor_gpu,compile_time_instruction_count,4461000000,0.1
mm_loop_inductor_dynamic_gpu,compile_time_instruction_count,8417000000,0.10
mm_loop_inductor_dynamic_gpu,compile_time_instruction_count,8417000000,0.1
basic_NestedModule_eager,compile_time_instruction_count,8348000000,0.10
basic_NestedModule_eager,compile_time_instruction_count,8348000000,0.1
basic_InlineMod_eager,compile_time_instruction_count,7464000000,0.10
basic_InlineMod_eager,compile_time_instruction_count,7464000000,0.1

1 add_loop_eager compile_time_instruction_count 3070000000 0.10 0.1
2 add_loop_eager_dynamic compile_time_instruction_count 4432000000 0.10 0.1
3 add_loop_inductor compile_time_instruction_count 30280000000 0.10 0.1
4 add_loop_inductor_dynamic_gpu compile_time_instruction_count 39910000000 0.10 0.1
5 add_loop_inductor_gpu compile_time_instruction_count 26800000000 0.10 0.1
6 basic_modules_ListOfLinears_eager compile_time_instruction_count 969100000 0.10 0.1
7 basic_modules_ListOfLinears_inductor compile_time_instruction_count 18030000000 15240000000 0.10 0.1
8 basic_modules_ListOfLinears_inductor_gpu_force_shape_pad compile_time_instruction_count 17020000000 0.10 0.1
9 basic_modules_ListOfLinears_inductor_gpu compile_time_instruction_count 11090000000 0.2 0.2
10 update_hint_regression compile_time_instruction_count 1719000000 0.10 0.1
11 sum_floordiv_regression compile_time_instruction_count 966100000 0.10 0.1
12 symint_sum compile_time_instruction_count 3237000000 0.10 0.1
13 symint_sum_loop compile_time_instruction_count 4299000000 0.10 0.1
14 aotdispatcher_inference_nosubclass_cpu compile_time_instruction_count 2151000000 0.10 0.1
15 aotdispatcher_inference_subclass_cpu compile_time_instruction_count 6124000000 0.10 0.1
16 aotdispatcher_partitioner_cpu compile_time_instruction_count 9005000000 0.10 0.1
17 aotdispatcher_partitioner_cpu2 compile_time_instruction_count 1989000000 0.10 0.1
18 aotdispatcher_training_nosubclass_cpu compile_time_instruction_count 3959000000 0.10 0.1
19 aotdispatcher_training_subclass_cpu compile_time_instruction_count 10650000000 0.10 0.1
20 mm_loop_inductor_gpu compile_time_instruction_count 4461000000 0.10 0.1
21 mm_loop_inductor_dynamic_gpu compile_time_instruction_count 8417000000 0.10 0.1
22 basic_NestedModule_eager compile_time_instruction_count 8348000000 0.10 0.1
23 basic_InlineMod_eager compile_time_instruction_count 7464000000 0.10 0.1
24
25
26
27
28
29
30
31
32
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89

View File

@ -944,6 +944,7 @@ def define_buck_targets(
[
("torch/csrc/api/include", "torch/**/*.h"),
("", "torch/csrc/**/*.h"),
("", "torch/nativert/**/*.h"),
("", "torch/headeronly/**/*.h"),
("", "torch/script.h"),
("", "torch/library.h"),

View File

@ -593,6 +593,7 @@ libtorch_core_jit_sources = sorted(jit_sources_full)
libtorch_nativert_sources = [
"torch/nativert/ModelRunner.cpp",
"torch/nativert/graph/Graph.cpp",
"torch/nativert/graph/GraphPasses.cpp",
"torch/nativert/graph/GraphSignature.cpp",
@ -864,6 +865,7 @@ libtorch_python_core_sources = [
"torch/csrc/QScheme.cpp",
"torch/csrc/Module.cpp",
"torch/csrc/PyInterpreter.cpp",
"torch/csrc/PyInterpreterHooks.cpp",
"torch/csrc/python_dimname.cpp",
"torch/csrc/Size.cpp",
"torch/csrc/Storage.cpp",
@ -986,6 +988,7 @@ libtorch_python_core_sources = [
"torch/csrc/utils/verbose.cpp",
"torch/csrc/cpu/Module.cpp",
"torch/csrc/instruction_counter/Module.cpp",
"torch/nativert/python/Bindings.cpp",
] + lazy_tensor_core_python_sources
libtorch_python_distributed_core_sources = [

View File

@ -0,0 +1,241 @@
#include <c10/core/AllocatorConfig.h>
#include <c10/core/DeviceType.h>
#include <c10/util/env.h>
namespace c10::CachingAllocator {
namespace {
constexpr size_t kRoundUpPowerOfTwoIntervals = 16;
constexpr size_t kMB = 1024 * 1024ul;
constexpr size_t kRoundUpPowerOfTwoStart = 1 * kMB; // 1MB
constexpr size_t kRoundUpPowerOfTwoEnd = 64 * 1024ul * kMB; // 64GB
} // anonymous namespace
AcceleratorAllocatorConfig& AcceleratorAllocatorConfig::instance() {
static AcceleratorAllocatorConfig instance;
#define C10_ALLOCATOR_CONFIG_PARSE_ENV(env, deprecated) \
auto env##_name = c10::utils::get_env(#env); \
if (env##_name.has_value()) { \
if (deprecated) { \
TORCH_WARN_ONCE(#env " is deprecated, use PYTORCH_ALLOC_CONF instead"); \
} \
instance.parseArgs(env##_name.value()); \
return true; \
}
static bool env_flag [[maybe_unused]] = []() {
C10_ALLOCATOR_CONFIG_PARSE_ENV(PYTORCH_ALLOC_CONF, false)
// Keep this for backwards compatibility
C10_ALLOCATOR_CONFIG_PARSE_ENV(PYTORCH_CUDA_ALLOC_CONF, /*deprecated=*/true)
C10_ALLOCATOR_CONFIG_PARSE_ENV(PYTORCH_HIP_ALLOC_CONF, /*deprecated=*/true)
return false;
}();
#undef C10_ALLOCATOR_CONFIG_PARSE_ENV
return instance;
}
AcceleratorAllocatorConfig::AcceleratorAllocatorConfig() {
roundup_power2_divisions_.assign(kRoundUpPowerOfTwoIntervals, 0);
}
size_t AcceleratorAllocatorConfig::roundup_power2_divisions(size_t size) {
size_t log_size = (63 - llvm::countLeadingZeros(size));
// Our intervals start at 1MB and end at 64GB
const size_t interval_start =
63 - llvm::countLeadingZeros(kRoundUpPowerOfTwoStart);
const size_t interval_end =
63 - llvm::countLeadingZeros(kRoundUpPowerOfTwoEnd);
TORCH_CHECK_VALUE(
interval_end - interval_start == kRoundUpPowerOfTwoIntervals,
"kRoundUpPowerOfTwoIntervals mismatch");
size_t index =
(log_size > interval_start) ? (log_size - interval_start) : 0ul;
index = std::min(index, kRoundUpPowerOfTwoIntervals - 1);
return instance().roundup_power2_divisions_[index];
}
size_t AcceleratorAllocatorConfig::parseMaxSplitSize(
const ConfigTokenizer& tokenizer,
size_t i) {
tokenizer.checkToken(++i, ":");
constexpr size_t min_allowed_split_size_mb = kLargeBuffer / kMB;
constexpr size_t max_allowed_split_size_mb =
std::numeric_limits<size_t>::max() / kMB;
size_t val_env = tokenizer.toSizeT(++i);
TORCH_CHECK_VALUE(
val_env >= min_allowed_split_size_mb,
"CachingAllocator option max_split_size_mb too small, must be >= ",
min_allowed_split_size_mb);
val_env = std::min(val_env, max_allowed_split_size_mb);
max_split_size_ = val_env * kMB;
return i;
}
size_t AcceleratorAllocatorConfig::parseMaxNonSplitRoundingSize(
const ConfigTokenizer& tokenizer,
size_t i) {
tokenizer.checkToken(++i, ":");
constexpr size_t min_allowed_split_size_mb = kLargeBuffer / kMB;
constexpr size_t max_allowed_split_size_mb =
std::numeric_limits<size_t>::max() / kMB;
size_t val_env = tokenizer.toSizeT(++i);
TORCH_CHECK_VALUE(
val_env >= min_allowed_split_size_mb,
"CachingAllocator option max_non_split_rounding_mb too small, must be >= ",
min_allowed_split_size_mb);
val_env = std::min(val_env, max_allowed_split_size_mb);
max_non_split_rounding_size_ = val_env * kMB;
return i;
}
size_t AcceleratorAllocatorConfig::parseGarbageCollectionThreshold(
const ConfigTokenizer& tokenizer,
size_t i) {
tokenizer.checkToken(++i, ":");
double val_env = tokenizer.toDouble(++i);
TORCH_CHECK_VALUE(
val_env > 0 && val_env < 1.0,
"garbage_collect_threshold is invalid, set it in (0.0, 1.0)");
garbage_collection_threshold_ = val_env;
return i;
}
size_t AcceleratorAllocatorConfig::parseRoundUpPower2Divisions(
const ConfigTokenizer& tokenizer,
size_t i) {
tokenizer.checkToken(++i, ":");
bool first_value = true;
if (tokenizer[++i] == "[") {
size_t last_index = 0;
// NOLINTNEXTLINE(bugprone-inc-dec-in-conditions)
while (++i < tokenizer.size() && tokenizer[i] != "]") {
size_t value_index = i;
tokenizer.checkToken(++i, ":");
size_t value = tokenizer.toSizeT(++i);
TORCH_CHECK_VALUE(
value == 0 || llvm::isPowerOf2_64(value),
"For roundups, the divisions has to be power of 2 or 0 to disable roundup ");
if (tokenizer[value_index] == ">") {
std::fill(
std::next(
roundup_power2_divisions_.begin(),
static_cast<std::vector<size_t>::difference_type>(
last_index + 1)),
roundup_power2_divisions_.end(),
value);
} else {
size_t boundary = tokenizer.toSizeT(value_index);
TORCH_CHECK_VALUE(
llvm::isPowerOf2_64(boundary),
"For roundups, the intervals have to be power of 2 ");
size_t index = 63 - llvm::countLeadingZeros(boundary);
index =
std::clamp(index, size_t{0}, roundup_power2_divisions_.size() - 1);
if (first_value) {
std::fill(
roundup_power2_divisions_.begin(),
std::next(
roundup_power2_divisions_.begin(),
static_cast<std::vector<size_t>::difference_type>(index)),
value);
first_value = false;
}
roundup_power2_divisions_[index] = value;
last_index = index;
}
if (tokenizer[i + 1] != "]") {
tokenizer.checkToken(++i, ",");
}
}
TORCH_INTERNAL_ASSERT(
i < tokenizer.size(),
"Expected closing bracket ']' in ConfigTokenizer but reached end of config");
} else { // Keep this for backwards compatibility
size_t value = tokenizer.toSizeT(i);
TORCH_CHECK_VALUE(
llvm::isPowerOf2_64(value),
"For roundups, the divisions has to be power of 2 ");
std::fill(
roundup_power2_divisions_.begin(),
roundup_power2_divisions_.end(),
value);
}
return i;
}
size_t AcceleratorAllocatorConfig::parseExpandableSegments(
const ConfigTokenizer& tokenizer,
size_t i) {
tokenizer.checkToken(++i, ":");
use_expandable_segments_ = tokenizer.toBool(++i);
return i;
}
size_t AcceleratorAllocatorConfig::parsePinnedUseBackgroundThreads(
const ConfigTokenizer& tokenizer,
size_t i) {
tokenizer.checkToken(++i, ":");
pinned_use_background_threads_ = tokenizer.toBool(++i);
return i;
}
void AcceleratorAllocatorConfig::parseArgs(const std::string& env) {
// The following option will be reset to its default value if not explicitly
// set each time.
max_split_size_ = std::numeric_limits<size_t>::max();
roundup_power2_divisions_.assign(kRoundUpPowerOfTwoIntervals, 0);
garbage_collection_threshold_ = 0;
{
std::lock_guard<std::mutex> lock(last_allocator_settings_mutex_);
last_allocator_settings_ = env;
}
ConfigTokenizer tokenizer(env);
for (size_t i = 0; i < tokenizer.size(); i++) {
const auto& key = tokenizer[i];
if (key == "max_split_size_mb") {
i = parseMaxSplitSize(tokenizer, i);
} else if (key == "max_non_split_rounding_mb") {
i = parseMaxNonSplitRoundingSize(tokenizer, i);
} else if (key == "garbage_collection_threshold") {
i = parseGarbageCollectionThreshold(tokenizer, i);
} else if (key == "roundup_power2_divisions") {
i = parseRoundUpPower2Divisions(tokenizer, i);
} else if (key == "expandable_segments") {
i = parseExpandableSegments(tokenizer, i);
} else if (key == "pinned_use_background_threads") {
i = parsePinnedUseBackgroundThreads(tokenizer, i);
} else {
// If a device-specific configuration parser hook is registered, it will
// check if the key is unrecognized.
if (device_config_parser_hook_) {
TORCH_CHECK(
keys_.find(key) != keys_.end(),
"Unrecognized key '",
key,
"' in Accelerator allocator config.");
}
i = tokenizer.skipKey(i);
}
if (i + 1 < tokenizer.size()) {
tokenizer.checkToken(++i, ",");
}
}
}
} // namespace c10::CachingAllocator

372
c10/core/AllocatorConfig.h Normal file
View File

@ -0,0 +1,372 @@
#pragma once
#include <c10/core/DeviceType.h>
#include <c10/util/Exception.h>
#include <c10/util/llvmMathExtras.h>
#include <atomic>
#include <mutex>
#include <string>
#include <unordered_set>
#include <vector>
namespace c10::CachingAllocator {
// "large" allocations may be packed in 20 MiB blocks
const size_t kLargeBuffer = 20971520;
// A utility class for tokenizing allocator configuration strings into discrete
// parts. For example, the config string:
// "key1:val1,key2:[val2,val3]"
// is tokenized into:
// "key1", ":", "val1", ",", "key2", ":", "[", "val2", ",", "val3", "]",
//
// Tokens include keys, values, and special characters (':', ',', '[', ']').
// Whitespace is ignored.
class ConfigTokenizer {
public:
explicit ConfigTokenizer(const std::string& env) {
std::string buffer;
for (char ch : env) {
if (ch == ',' || ch == ':' || ch == '[' || ch == ']') {
if (!buffer.empty()) {
config_.emplace_back(std::move(buffer));
buffer.clear();
}
config_.emplace_back(1, ch);
} else if (!std::isspace(static_cast<unsigned char>(ch))) {
buffer += ch;
}
}
if (!buffer.empty()) {
config_.emplace_back(std::move(buffer));
}
}
const std::string& operator[](size_t i) const {
TORCH_INTERNAL_ASSERT(
i < config_.size(), "Index out of bounds in ConfigTokenizer");
return config_[i];
}
size_t size() const {
return config_.size();
}
bool checkToken(size_t i, const std::string& token) const {
checkIndex(i);
return config_[i] == token;
}
size_t toSizeT(size_t i) const {
checkIndex(i);
return std::stoull(config_[i]);
}
double toDouble(size_t i) const {
checkIndex(i);
return std::stod(config_[i]);
}
bool toBool(size_t i) const {
checkIndex(i);
const auto& token = config_[i];
if (token == "True") {
return true;
} else if (token == "False") {
return false;
} else {
TORCH_CHECK_VALUE(
false,
"Expected 'True' or 'False' at index ",
i,
" in ConfigTokenizer but got '",
token,
"'");
}
}
// Skips the current token group and returns the index of the value token.
// Assumes the current index `i` points to a key name in a key-value pair.
size_t skipKey(size_t i) const {
// Expect a colon after the key
checkToken(++i, ":");
++i; // Move to the value
checkIndex(i);
if (config_[i] != "[") {
// Value is a single token (not a list) -> return its index
return i;
}
// Skip tokens inside the list until matching ']'
// NOLINTNEXTLINE(bugprone-inc-dec-in-conditions)
while (++i < config_.size() && config_[i] != "]") {
}
TORCH_INTERNAL_ASSERT(
i < config_.size(),
"Expected closing bracket ']' in ConfigTokenizer but reached end of config");
return i; // Return the index of the closing ']'
}
private:
void checkIndex(size_t i) const {
TORCH_INTERNAL_ASSERT(
i < config_.size(), "Index out of bounds in ConfigTokenizer");
}
std::vector<std::string> config_;
};
/**
* Note [AcceleratorAllocatorConfig design]
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* This class configures memory allocation for both device and host memory. A
* single `AcceleratorAllocatorConfig` instance is shared across all accelerator
* backends, such as CUDA and XPU, under the assumption that relevant
* environment variables apply uniformly to all accelerators. Device-specific
* configuration extensions are supported via hooks (see
* `registerDeviceConfigParserHook`).
*
* Recommended design:
* - Place common configurations in `AcceleratorAllocatorConfig`.
* - Extend backend-specific configurations in corresponding device-specific
* classes, such as `CUDAAllocatorConfig`, etc.
*
* Scope:
* - Configuration options must be environment-variable driven.
*
* Naming Convention:
* - Public API names in `AcceleratorAllocatorConfig` should be device-generic.
* - Members prefixed with `pinned_` are specific to the host/pinned allocator.
* - Environment variable names should be generic across backends.
* - Comma-separated key-value pairs in the format: `key:value`. Use square
* brackets `[]` for list values Example: `key1:123, key2:[val1,val2]`
*
* Environment Variables:
* - The primary environment variable for configuration is `PYTORCH_ALLOC_CONF`.
* - For backward compatibility, `PYTORCH_CUDA_ALLOC_CONF` is also supported
* with lower priority.
*/
class C10_API AcceleratorAllocatorConfig {
public:
static AcceleratorAllocatorConfig& instance();
C10_DISABLE_COPY_AND_ASSIGN(AcceleratorAllocatorConfig);
AcceleratorAllocatorConfig(AcceleratorAllocatorConfig&&) = delete;
AcceleratorAllocatorConfig& operator=(AcceleratorAllocatorConfig&&) = delete;
~AcceleratorAllocatorConfig() = default;
/* Device allocator settings */
// Returns the maximum block size (in MB) that is allowed to be split. The
// default is unlimited (all blocks can be split).
static size_t max_split_size() {
return instance().max_split_size_;
}
// Returns the maximum block size (in MB) that is allowed to be rounded up
// without requiring splitting when searching for a free block. The default is
// 20 MiB.
static size_t max_non_split_rounding_size() {
return instance().max_non_split_rounding_size_;
}
// Return the number of divisions used when rounding up allocation sizes (in
// MB) to the nearest power-of-2 boundary.
static size_t roundup_power2_divisions(size_t size);
// Returns the vector of division factors used for rounding up allocation
// sizes. These divisions apply to size intervals between 1MB and 64GB.
static const std::vector<size_t>& roundup_power2_divisions() {
return instance().roundup_power2_divisions_;
}
// Returns the threshold that triggers garbage collection when the ratio of
// used memory to maximum allowed memory exceeds this value. The default is 0,
// meaning no garbage collection is triggered. The value should be in the
// range (0.0, 1.0).
static double garbage_collection_threshold() {
return instance().garbage_collection_threshold_;
}
// Returns whether the expandable segment feature is enabled. This allows the
// allocator to start with one segment that grows as needed, rather than
// creating a new segment for each allocation. Default is false (expandable
// segments disabled).
static bool use_expandable_segments() {
return instance().use_expandable_segments_;
}
/* Host allocator settings */
// Returns whether the pinned host allocator uses background threads for
// processing events. This is useful for improving performance in scenarios
// where many small allocations are made. Default is false (background threads
// disabled).
static bool pinned_use_background_threads() {
return instance().pinned_use_background_threads_;
}
/* Settings for both device and host allocator */
// Returns the current allocator settings as a string. This string is useful
// to expand device-specific allocator configurations
static std::string last_allocator_settings() {
std::lock_guard<std::mutex> lock(instance().last_allocator_settings_mutex_);
return instance().last_allocator_settings_;
}
// Returns the set of valid keys for the allocator configuration.
// This set is used to validate the presence and correctness of keys in
// device-specific configuration parsers.
static const std::unordered_set<std::string>& getKeys() {
return keys_;
}
// Registers a device-specific configuration parser hook and its key. This
// allows backends to parse additional device-specific configuration options
// from the environment variable. The hook should be a function that takes a
// string (the environment variable value) and parses it to set
// device-specific configuration options. The hook will be called when the
// environment variable is parsed. If a hook is already registered, it will be
// replaced with the new one.
static void registerDeviceConfigParserHook(
std::function<void(const std::string&)>&& hook,
const std::unordered_set<std::string>& keys) {
device_config_parser_hook_ = std::move(hook);
for (auto& key : keys) {
TORCH_CHECK(
keys_.insert(key).second,
"Duplicated key '",
key,
"' found in device-specific configuration parser hook registration");
}
}
// Calls the registered device-specific configuration parser hook with the
// provided environment string. This allows backends to parse additional
// device-specific configuration options from the environment variable.
// If no hook is registered, this function does nothing.
static void callDeviceConfigParserHook(const std::string& env) {
if (device_config_parser_hook_) {
device_config_parser_hook_(env);
}
}
// Parses the environment variable `env` to update the allocator settings.
// If the environment variable is not set, it does nothing.
// The configuration string should be a comma-separated list of key-value
// pairs, where each key is a configuration option and the value is the
// corresponding setting. For example:
// "max_split_size_mb:100,max_non_split_rounding_mb:20,garbage_collection_threshold:0.5,roundup_power2_divisions:[64:8,256:4,1024:4,>:1],expandable_segments:true,pinned_use_background_threads:true"
void parseArgs(const std::string& env);
private:
AcceleratorAllocatorConfig();
/* Internal functions for device allocator */
// Parse `max_split_size_mb` from environment variable.
size_t parseMaxSplitSize(const ConfigTokenizer& tokenizer, size_t i);
// Parse `max_non_split_rounding_mb` from environment variable.
size_t parseMaxNonSplitRoundingSize(
const ConfigTokenizer& tokenizer,
size_t i);
// Parse `garbage_collection_threshold` from environment variable.
size_t parseGarbageCollectionThreshold(
const ConfigTokenizer& tokenizer,
size_t i);
// Parse `roundup_power2_divisions` from environment variable.
size_t parseRoundUpPower2Divisions(
const ConfigTokenizer& tokenizer,
size_t i);
// Parse `expandable_segments` from environment variable.
size_t parseExpandableSegments(const ConfigTokenizer& tokenizer, size_t i);
/* Internal functions for host allocator */
// Parse `pinned_use_background_threads` from environment variable.
size_t parsePinnedUseBackgroundThreads(
const ConfigTokenizer& tokenizer,
size_t i);
/* The following members are specifically used for the device allocator. */
// The maximum block size that is allowed to be split.
std::atomic<size_t> max_split_size_{std::numeric_limits<size_t>::max()};
// The maximum allowable extra size of a memory block without requiring
// splitting when searching for a free block.
std::atomic<size_t> max_non_split_rounding_size_{kLargeBuffer};
// Used to store how memory allocations of different sizes should be rounded
// up to the nearest power of 2 divisions.
std::vector<size_t> roundup_power2_divisions_;
// The threshold that triggers garbage collection when the ratio of used
// memory to maximum allowed memory exceeds this value.
std::atomic<double> garbage_collection_threshold_{0};
// A flag to enable expandable segments feature.
std::atomic<bool> use_expandable_segments_{false};
/* The following members are specifically used for the host allocator. */
// A flag to enable background thread for processing events.
std::atomic<bool> pinned_use_background_threads_{false};
/* The following members are used for both device and host allocator. */
// Record the last allocator config environment setting.
std::mutex last_allocator_settings_mutex_;
std::string last_allocator_settings_;
// Optional hook for parsing additional device-specific allocator settings.
// This allows backends (e.g., CUDA, XPU) to register a custom parser for
// their own environment configuration extensions.
inline static std::function<void(const std::string&)>
device_config_parser_hook_{nullptr};
// A set of valid configuration keys, including both common and
// device-specific options. This set is used to validate the presence and
// correctness of keys during parsing.
inline static std::unordered_set<std::string> keys_{
"max_split_size_mb",
"max_non_split_rounding_mb",
"garbage_collection_threshold",
"roundup_power2_divisions",
"expandable_segments",
"pinned_use_background_threads"};
};
C10_API inline void setAllocatorSettings(const std::string& env) {
AcceleratorAllocatorConfig::instance().parseArgs(env);
AcceleratorAllocatorConfig::callDeviceConfigParserHook(env);
}
C10_API inline std::string getAllocatorSettings() {
return AcceleratorAllocatorConfig::instance().last_allocator_settings();
}
struct DeviceConfigParserHookRegistry {
explicit DeviceConfigParserHookRegistry(
std::function<void(const std::string&)>&& hook,
const std::unordered_set<std::string>& keys) {
// Use static method to avoid static initialization order fiasco issues
AcceleratorAllocatorConfig::registerDeviceConfigParserHook(
std::move(hook), keys);
}
};
// Assume each config parser has `parseArgs` and `getKeys` methods
#define REGISTER_ALLOCATOR_CONFIG_PARSE_HOOK(parser_cls) \
namespace { \
static at::CachingAllocator::DeviceConfigParserHookRegistry \
g_device_config_parse_hook_registry_instance( \
[](const std::string& env) { \
parser_cls::instance().parseArgs(env); \
}, \
parser_cls::getKeys()); \
}
} // namespace c10::CachingAllocator

View File

@ -240,24 +240,4 @@ struct C10_API PyInterpreter {
void disarm() noexcept;
};
// PyInterpreterStatus describes what the state of its interpreter tag
// is, relative to the thread currently holding the GIL.
enum class PyInterpreterStatus {
// We just allocated the Tensor, it hasn't escaped to other threads,
// we know that it definitely hasn't been tagged to be associated
// with an interpreter.
DEFINITELY_UNINITIALIZED,
// We queried the interpreter field and it looked uninitialized. But
// another thread may have raced with us to tag it with some other
// interpreter id. So we will have to do a CEX to make sure we can
// actually nab it.
MAYBE_UNINITIALIZED,
// We queried the interpreter field and it was tagged to belong to us.
// This means we have sole write access (as we hold the GIL for this
// interpreter)
TAGGED_BY_US,
// Someone else tagged this. We can't use this TensorImpl from Python.
TAGGED_BY_OTHER,
};
} // namespace c10::impl

View File

@ -0,0 +1,32 @@
#include <c10/core/impl/PyInterpreterHooks.h>
namespace c10::impl {
// Define the registry
C10_DEFINE_REGISTRY(
PyInterpreterHooksRegistry,
PyInterpreterHooksInterface,
PyInterpreterHooksArgs)
const PyInterpreterHooksInterface& getPyInterpreterHooks() {
auto create_impl = [] {
#if !defined C10_MOBILE
auto hooks = PyInterpreterHooksRegistry()->Create(
"PyInterpreterHooks", PyInterpreterHooksArgs{});
if (hooks) {
return hooks;
}
#endif
// Return stub implementation that will throw errors when methods are called
return std::make_unique<PyInterpreterHooksInterface>();
};
static auto hooks = create_impl();
return *hooks;
}
// Main function to get global PyInterpreter
PyInterpreter* getGlobalPyInterpreter() {
return getPyInterpreterHooks().getPyInterpreter();
}
} // namespace c10::impl

View File

@ -0,0 +1,39 @@
#pragma once
#include <c10/core/impl/PyInterpreter.h>
#include <c10/macros/Export.h>
#include <c10/util/Registry.h>
#include <memory>
namespace c10::impl {
// Minimal interface for PyInterpreter hooks
struct C10_API PyInterpreterHooksInterface {
virtual ~PyInterpreterHooksInterface() = default;
// Get the PyInterpreter instance
// Stub implementation throws error when Python is not available
virtual PyInterpreter* getPyInterpreter() const {
TORCH_CHECK(
false,
"PyTorch was compiled without Python support. "
"Cannot access Python interpreter from C++.");
}
};
struct C10_API PyInterpreterHooksArgs{};
C10_DECLARE_REGISTRY(
PyInterpreterHooksRegistry,
PyInterpreterHooksInterface,
PyInterpreterHooksArgs);
#define REGISTER_PYTHON_HOOKS(clsname) \
C10_REGISTER_CLASS(PyInterpreterHooksRegistry, clsname, clsname)
// Get the global PyInterpreter hooks instance
C10_API const PyInterpreterHooksInterface& getPyInterpreterHooks();
C10_API PyInterpreter* getGlobalPyInterpreter();
} // namespace c10::impl

View File

@ -34,29 +34,12 @@ PyObject* PyObjectSlot::_unchecked_untagged_pyobj() const {
reinterpret_cast<uintptr_t>(pyobj_) & ~0x1ULL);
}
void PyObjectSlot::unchecked_clear_pyobj(PyInterpreter* interpreter) {
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(interpreter == pyobj_interpreter_.load());
pyobj_ = nullptr;
}
PyInterpreter& PyObjectSlot::load_pyobj_interpreter() const {
auto interpreter = pyobj_interpreter_.load(std::memory_order_acquire);
if (interpreter) {
return *interpreter;
}
TORCH_CHECK(
false,
"cannot access PyObject for Tensor on interpreter ",
(*pyobj_interpreter_.load())->name());
}
bool PyObjectSlot::check_interpreter(PyInterpreter* interpreter) {
return interpreter == pyobj_interpreter();
}
bool PyObjectSlot::has_pyobj_nonhermetic() {
return check_pyobj(pyobj_interpreter(), /*ignore_hermetic_tls=*/true)
.has_value();
TORCH_CHECK(false, "cannot access PyObject for Tensor - no interpreter set");
}
bool PyObjectSlot::owns_pyobj() {

View File

@ -2,6 +2,7 @@
#include <c10/core/impl/HermeticPyObjectTLS.h>
#include <c10/core/impl/PyInterpreter.h>
#include <c10/core/impl/PyInterpreterHooks.h>
#include <c10/util/python_stub.h>
#include <optional>
@ -24,52 +25,9 @@ struct C10_API PyObjectSlot {
//
// NB: THIS FUNCTION CAN RAISE AN EXCEPTION. Make sure to clean up after
// PyObject if necessary!
void init_pyobj(
PyInterpreter* self_interpreter,
PyObject* pyobj,
PyInterpreterStatus status) {
impl::PyInterpreter* expected = nullptr;
switch (status) {
case impl::PyInterpreterStatus::DEFINITELY_UNINITIALIZED:
// caller guarantees there is no multithreaded access; if there is
// no data race OK to do a relaxed store
pyobj_interpreter_.store(self_interpreter, std::memory_order_relaxed);
break;
case impl::PyInterpreterStatus::TAGGED_BY_US:
// no tagging is necessary, the tag is already correct
break;
case impl::PyInterpreterStatus::MAYBE_UNINITIALIZED:
// attempt to claim this TensorImpl with the specified interpreter
// tag
if (pyobj_interpreter_.compare_exchange_strong(
expected, self_interpreter, std::memory_order_acq_rel)) {
break;
}
// test if, actually, it was already tagged by us! this situation can't
// be caused by a race, but it could be caused by a situation
// where someone conservatively tagged the tensor as MAYBE_UNINITIALIZED
// (because they didn't pre-check the tag) when actually it was
// owned by the interpreter
if (expected == self_interpreter) {
break;
}
// fallthrough, we lost the race. We are guaranteed not to lose the
// race with ourself, as calls to init_pyobj with the same interpreter
// ID must be sequentialized by the GIL
[[fallthrough]];
case impl::PyInterpreterStatus::TAGGED_BY_OTHER:
TORCH_CHECK(
false,
"cannot allocate PyObject for Tensor on interpreter ",
self_interpreter,
" that has already been used by another torch deploy interpreter ",
pyobj_interpreter_.load());
}
// we are the ONLY thread that can have gotten to this point. It is not
// possible to conflict with another zero interpreter as access is protected
// by GIL
// NB: owns_pyobj tag is initially false
void init_pyobj(PyObject* pyobj) {
pyobj_interpreter_.store(
getGlobalPyInterpreter(), std::memory_order_relaxed);
pyobj_ = pyobj;
}
@ -94,49 +52,25 @@ struct C10_API PyObjectSlot {
//
// NB: this lives in header so that we can avoid actually creating the
// std::optional
std::optional<PyObject*> check_pyobj(
PyInterpreter* self_interpreter,
bool ignore_hermetic_tls = false) const {
// Note [Memory ordering on Python interpreter tag]
// @todo alban: I'm not too sure what's going on here, we can probably delete
// it but it's worthwhile making sure
std::optional<PyObject*> check_pyobj(bool ignore_hermetic_tls = false) const {
impl::PyInterpreter* interpreter =
pyobj_interpreter_.load(std::memory_order_acquire);
if (interpreter == nullptr) {
// NB: This never returns DEFINITELY_UNINITIALIZED because there is
// always the possibility that another thread races to initialize
// after we query here. The only time when we can conclude a tensor
// is definitely uninitialized is when we have just allocated it and
// it cannot have escaped to other threads yet
return std::nullopt;
} else if (interpreter == self_interpreter) {
// NB: pyobj_ could still be null!
if (!ignore_hermetic_tls && c10::impl::HermeticPyObjectTLS::get_state()) {
return std::nullopt;
} else {
return _unchecked_untagged_pyobj();
}
}
if (!ignore_hermetic_tls && c10::impl::HermeticPyObjectTLS::get_state()) {
return std::nullopt;
} else {
TORCH_CHECK(
false,
"cannot access PyObject for Tensor on interpreter ",
(*self_interpreter)->name(),
" that has already been used by another torch deploy interpreter ",
(*pyobj_interpreter_.load())->name());
return _unchecked_untagged_pyobj();
}
}
// Clear the PyObject field for an interpreter, in situations where we
// statically know the tensor is tagged with our interpreter.
void unchecked_clear_pyobj(PyInterpreter* interpreter);
PyInterpreter& load_pyobj_interpreter() const;
// Check if the PyObjectSlot's interpreter is the same as the specified
// interpreter
bool check_interpreter(PyInterpreter* interpreter);
// Check if the PyObjectSlot is holding a PyObject, owned or non-owned
bool has_pyobj_nonhermetic();
bool owns_pyobj();
void set_owns_pyobj(bool b);

View File

@ -1,389 +1,119 @@
#include <c10/cuda/CUDAAllocatorConfig.h>
#include <c10/cuda/CUDACachingAllocator.h>
#include <c10/util/llvmMathExtras.h>
#if !defined(USE_ROCM) && defined(PYTORCH_C10_DRIVER_API_SUPPORTED)
#include <c10/cuda/driver_api.h>
#endif
#include <cuda_runtime_api.h>
namespace c10::cuda::CUDACachingAllocator {
constexpr size_t kRoundUpPowerOfTwoIntervals = 16;
CUDAAllocatorConfig::CUDAAllocatorConfig()
: m_max_split_size(std::numeric_limits<size_t>::max()),
m_max_non_split_rounding_size(kLargeBuffer),
m_garbage_collection_threshold(0),
m_pinned_num_register_threads(1),
m_expandable_segments(false),
#if CUDA_VERSION >= 12030
m_expandable_segments_handle_type(
Expandable_Segments_Handle_Type::UNSPECIFIED),
#else
m_expandable_segments_handle_type(
Expandable_Segments_Handle_Type::POSIX_FD),
#endif
m_release_lock_on_cudamalloc(false),
m_pinned_use_cuda_host_register(false),
m_pinned_use_background_threads(false) {
m_roundup_power2_divisions.assign(kRoundUpPowerOfTwoIntervals, 0);
}
size_t CUDAAllocatorConfig::roundup_power2_divisions(size_t size) {
size_t log_size = (63 - llvm::countLeadingZeros(size));
// Our intervals start at 1MB and end at 64GB
const size_t interval_start =
63 - llvm::countLeadingZeros(static_cast<size_t>(1048576));
const size_t interval_end =
63 - llvm::countLeadingZeros(static_cast<size_t>(68719476736));
TORCH_CHECK(
(interval_end - interval_start == kRoundUpPowerOfTwoIntervals),
"kRoundUpPowerOfTwoIntervals mismatch");
int index = static_cast<int>(log_size) - static_cast<int>(interval_start);
index = std::max(0, index);
index = std::min(index, static_cast<int>(kRoundUpPowerOfTwoIntervals) - 1);
return instance().m_roundup_power2_divisions[index];
}
void CUDAAllocatorConfig::lexArgs(
const std::string& env,
std::vector<std::string>& config) {
std::vector<char> buf;
for (char ch : env) {
if (ch == ',' || ch == ':' || ch == '[' || ch == ']') {
if (!buf.empty()) {
config.emplace_back(buf.begin(), buf.end());
buf.clear();
}
config.emplace_back(1, ch);
} else if (ch != ' ') {
buf.emplace_back(ch);
}
}
if (!buf.empty()) {
config.emplace_back(buf.begin(), buf.end());
}
}
void CUDAAllocatorConfig::consumeToken(
const std::vector<std::string>& config,
size_t i,
const char c) {
TORCH_CHECK(
i < config.size() && config[i] == std::string(1, c),
"Error parsing CachingAllocator settings, expected ",
c,
"");
}
size_t CUDAAllocatorConfig::parseMaxSplitSize(
const std::vector<std::string>& config,
size_t i) {
consumeToken(config, ++i, ':');
constexpr int mb = 1024 * 1024;
if (++i < config.size()) {
size_t val1 = stoi(config[i]);
TORCH_CHECK(
val1 > kLargeBuffer / mb,
"CachingAllocator option max_split_size_mb too small, must be > ",
kLargeBuffer / mb,
"");
val1 = std::max(val1, kLargeBuffer / mb);
val1 = std::min(val1, (std::numeric_limits<size_t>::max() / mb));
m_max_split_size = val1 * 1024 * 1024;
} else {
TORCH_CHECK(false, "Error, expecting max_split_size_mb value", "");
}
return i;
}
size_t CUDAAllocatorConfig::parseMaxNonSplitRoundingSize(
const std::vector<std::string>& config,
size_t i) {
consumeToken(config, ++i, ':');
constexpr int mb = 1024 * 1024;
if (++i < config.size()) {
size_t val1 = stoi(config[i]);
TORCH_CHECK(
val1 > kLargeBuffer / mb,
"CachingAllocator option max_non_split_rounding_mb too small, must be > ",
kLargeBuffer / mb,
"");
val1 = std::max(val1, kLargeBuffer / mb);
val1 = std::min(val1, (std::numeric_limits<size_t>::max() / mb));
m_max_non_split_rounding_size = val1 * 1024 * 1024;
} else {
TORCH_CHECK(false, "Error, expecting max_non_split_rounding_mb value", "");
}
return i;
}
size_t CUDAAllocatorConfig::parseGarbageCollectionThreshold(
const std::vector<std::string>& config,
size_t i) {
consumeToken(config, ++i, ':');
if (++i < config.size()) {
double val1 = stod(config[i]);
TORCH_CHECK(
val1 > 0, "garbage_collect_threshold too small, set it 0.0~1.0", "");
TORCH_CHECK(
val1 < 1.0, "garbage_collect_threshold too big, set it 0.0~1.0", "");
m_garbage_collection_threshold = val1;
} else {
TORCH_CHECK(
false, "Error, expecting garbage_collection_threshold value", "");
}
return i;
}
size_t CUDAAllocatorConfig::parseRoundUpPower2Divisions(
const std::vector<std::string>& config,
size_t i) {
consumeToken(config, ++i, ':');
bool first_value = true;
if (++i < config.size()) {
if (std::string_view(config[i]) == "[") {
size_t last_index = 0;
// NOLINTNEXTLINE(bugprone-inc-dec-in-conditions)
while (++i < config.size() && std::string_view(config[i]) != "]") {
const std::string& val1 = config[i];
size_t val2 = 0;
consumeToken(config, ++i, ':');
if (++i < config.size()) {
val2 = stoi(config[i]);
} else {
TORCH_CHECK(
false, "Error parsing roundup_power2_divisions value", "");
}
TORCH_CHECK(
val2 == 0 || llvm::isPowerOf2_64(val2),
"For roundups, the divisions has to be power of 2 or 0 to disable roundup ",
"");
if (std::string_view(val1) == ">") {
std::fill(
std::next(
m_roundup_power2_divisions.begin(),
static_cast<std::vector<unsigned long>::difference_type>(
last_index)),
m_roundup_power2_divisions.end(),
val2);
} else {
size_t val1_long = stoul(val1);
TORCH_CHECK(
llvm::isPowerOf2_64(val1_long),
"For roundups, the intervals have to be power of 2 ",
"");
size_t index = 63 - llvm::countLeadingZeros(val1_long);
index = std::max((size_t)0, index);
index = std::min(index, m_roundup_power2_divisions.size() - 1);
if (first_value) {
std::fill(
m_roundup_power2_divisions.begin(),
std::next(
m_roundup_power2_divisions.begin(),
static_cast<std::vector<unsigned long>::difference_type>(
index)),
val2);
first_value = false;
}
if (index < m_roundup_power2_divisions.size()) {
m_roundup_power2_divisions[index] = val2;
}
last_index = index;
}
if (std::string_view(config[i + 1]) != "]") {
consumeToken(config, ++i, ',');
}
}
} else { // Keep this for backwards compatibility
size_t val1 = stoi(config[i]);
TORCH_CHECK(
llvm::isPowerOf2_64(val1),
"For roundups, the divisions has to be power of 2 ",
"");
std::fill(
m_roundup_power2_divisions.begin(),
m_roundup_power2_divisions.end(),
val1);
}
} else {
TORCH_CHECK(false, "Error, expecting roundup_power2_divisions value", "");
}
return i;
}
size_t CUDAAllocatorConfig::parseAllocatorConfig(
const std::vector<std::string>& config,
size_t i,
bool& used_cudaMallocAsync) {
const c10::CachingAllocator::ConfigTokenizer& tokenizer,
size_t i) {
// For ease of maintenance and understanding, the CUDA and ROCm
// implementations of this function are separated. This avoids having many
// #ifdef's throughout.
#ifdef USE_ROCM
// Ease burden on ROCm users by allowing either cuda or hip tokens.
// cuda token is broken up to prevent hipify matching it.
#define PYTORCH_TOKEN1 \
"cud" \
"aMallocAsync"
#define PYTORCH_TOKEN2 "hipMallocAsync"
consumeToken(config, ++i, ':');
if (++i < config.size()) {
tokenizer.checkToken(++i, ":");
i++; // Move to the value after the colon
TORCH_CHECK_VALUE(
((tokenizer[i] == "native") || (tokenizer[i] == PYTORCH_TOKEN1) ||
(tokenizer[i] == PYTORCH_TOKEN2)),
"Unknown allocator backend, "
"options are native, " PYTORCH_TOKEN1 ", and " PYTORCH_TOKEN2);
if (m_is_allocator_loaded) {
bool aync_allocator_at_runtime = (tokenizer[i] != "native");
TORCH_CHECK(
((config[i] == "native") || (config[i] == PYTORCH_TOKEN1) ||
(config[i] == PYTORCH_TOKEN2)),
"Unknown allocator backend, "
"options are native, " PYTORCH_TOKEN1 ", and " PYTORCH_TOKEN2);
used_cudaMallocAsync =
(config[i] == PYTORCH_TOKEN1 || config[i] == PYTORCH_TOKEN2);
TORCH_INTERNAL_ASSERT(
config[i] == get()->name() ||
(config[i] == PYTORCH_TOKEN1 && get()->name() == PYTORCH_TOKEN2),
"Allocator backend parsed at runtime != "
"allocator backend parsed at load time, ",
config[i],
aync_allocator_at_runtime == m_use_async_allocator,
"Allocator async backend parsed at runtime != allocator async backend parsed at load time, ",
aync_allocator_at_runtime,
" != ",
get()->name());
} else {
TORCH_CHECK(false, "Error parsing backend value", "");
m_use_async_allocator);
}
m_use_async_allocator =
(tokenizer[i] == PYTORCH_TOKEN1 || tokenizer[i] == PYTORCH_TOKEN2);
// CUDA allocator is always loaded at the start of the program
m_is_allocator_loaded = true;
#if defined(CUDA_VERSION)
if (m_use_async_allocator) {
#if CUDA_VERSION >= 11040
int version = 0;
C10_CUDA_CHECK(cudaDriverGetVersion(&version));
TORCH_CHECK(
version >= 11040,
"backend:cudaMallocAsync requires CUDA runtime "
"11.4 or newer, but cudaDriverGetVersion returned ",
version);
#else
TORCH_CHECK(
false,
"backend:cudaMallocAsync requires PyTorch to be built with "
"CUDA 11.4 or newer, but CUDA_VERSION is ",
CUDA_VERSION);
#endif
}
#endif
return i;
#undef PYTORCH_TOKEN1
#undef PYTORCH_TOKEN2
#else // USE_ROCM
consumeToken(config, ++i, ':');
if (++i < config.size()) {
TORCH_CHECK(
((config[i] == "native") || (config[i] == "cudaMallocAsync")),
"Unknown allocator backend, "
"options are native and cudaMallocAsync");
used_cudaMallocAsync = (config[i] == "cudaMallocAsync");
if (used_cudaMallocAsync) {
#if CUDA_VERSION >= 11040
int version = 0;
C10_CUDA_CHECK(cudaDriverGetVersion(&version));
TORCH_CHECK(
version >= 11040,
"backend:cudaMallocAsync requires CUDA runtime "
"11.4 or newer, but cudaDriverGetVersion returned ",
version);
#else
TORCH_CHECK(
false,
"backend:cudaMallocAsync requires PyTorch to be built with "
"CUDA 11.4 or newer, but CUDA_VERSION is ",
CUDA_VERSION);
#endif
}
TORCH_INTERNAL_ASSERT(
config[i] == get()->name(),
"Allocator backend parsed at runtime != "
"allocator backend parsed at load time");
} else {
TORCH_CHECK(false, "Error parsing backend value", "");
}
return i;
#endif // USE_ROCM
}
void CUDAAllocatorConfig::parseArgs(const std::optional<std::string>& env) {
void CUDAAllocatorConfig::parseArgs(const std::string& env) {
// If empty, set the default values
m_max_split_size = std::numeric_limits<size_t>::max();
m_roundup_power2_divisions.assign(kRoundUpPowerOfTwoIntervals, 0);
m_garbage_collection_threshold = 0;
bool used_cudaMallocAsync = false;
bool used_native_specific_option = false;
if (!env.has_value()) {
return;
}
{
std::lock_guard<std::mutex> lock(m_last_allocator_settings_mutex);
m_last_allocator_settings = env.value();
}
std::vector<std::string> config;
lexArgs(env.value(), config);
for (size_t i = 0; i < config.size(); i++) {
std::string_view config_item_view(config[i]);
if (config_item_view == "max_split_size_mb") {
i = parseMaxSplitSize(config, i);
used_native_specific_option = true;
} else if (config_item_view == "max_non_split_rounding_mb") {
i = parseMaxNonSplitRoundingSize(config, i);
used_native_specific_option = true;
} else if (config_item_view == "garbage_collection_threshold") {
i = parseGarbageCollectionThreshold(config, i);
used_native_specific_option = true;
} else if (config_item_view == "roundup_power2_divisions") {
i = parseRoundUpPower2Divisions(config, i);
used_native_specific_option = true;
} else if (config_item_view == "backend") {
i = parseAllocatorConfig(config, i, used_cudaMallocAsync);
} else if (config_item_view == "expandable_segments") {
used_native_specific_option = true;
consumeToken(config, ++i, ':');
++i;
TORCH_CHECK(
i < config.size() &&
(std::string_view(config[i]) == "True" ||
std::string_view(config[i]) == "False"),
"Expected a single True/False argument for expandable_segments");
config_item_view = config[i];
m_expandable_segments = (config_item_view == "True");
c10::CachingAllocator::ConfigTokenizer tokenizer(env);
for (size_t i = 0; i < tokenizer.size(); i++) {
const auto& key = tokenizer[i];
if (key == "backend") {
i = parseAllocatorConfig(tokenizer, i);
} else if (
// ROCm build's hipify step will change "cuda" to "hip", but for ease of
// use, accept both. We must break up the string to prevent hipify here.
config_item_view == "release_lock_on_hipmalloc" ||
config_item_view ==
key == "release_lock_on_hipmalloc" ||
key ==
"release_lock_on_c"
"udamalloc") {
used_native_specific_option = true;
consumeToken(config, ++i, ':');
++i;
TORCH_CHECK(
i < config.size() &&
(std::string_view(config[i]) == "True" ||
std::string_view(config[i]) == "False"),
"Expected a single True/False argument for release_lock_on_cudamalloc");
config_item_view = config[i];
m_release_lock_on_cudamalloc = (config_item_view == "True");
tokenizer.checkToken(++i, ":");
m_release_lock_on_cudamalloc = tokenizer.toBool(++i);
} else if (
// ROCm build's hipify step will change "cuda" to "hip", but for ease of
// use, accept both. We must break up the string to prevent hipify here.
config_item_view == "pinned_use_hip_host_register" ||
config_item_view ==
key == "pinned_use_hip_host_register" ||
key ==
"pinned_use_c"
"uda_host_register") {
i = parsePinnedUseCudaHostRegister(config, i);
i = parsePinnedUseCudaHostRegister(tokenizer, i);
used_native_specific_option = true;
} else if (config_item_view == "pinned_num_register_threads") {
i = parsePinnedNumRegisterThreads(config, i);
used_native_specific_option = true;
} else if (config_item_view == "pinned_use_background_threads") {
i = parsePinnedUseBackgroundThreads(config, i);
} else if (key == "pinned_num_register_threads") {
i = parsePinnedNumRegisterThreads(tokenizer, i);
used_native_specific_option = true;
} else {
const auto& keys =
c10::CachingAllocator::AcceleratorAllocatorConfig::getKeys();
TORCH_CHECK(
false, "Unrecognized CachingAllocator option: ", config_item_view);
keys.find(key) != keys.end(),
"Unrecognized key '",
key,
"' in Accelerator allocator config.");
i = tokenizer.skipKey(i);
}
if (i + 1 < config.size()) {
consumeToken(config, ++i, ',');
if (i + 1 < tokenizer.size()) {
tokenizer.checkToken(++i, ",");
}
}
if (used_cudaMallocAsync && used_native_specific_option) {
if (m_use_async_allocator && used_native_specific_option) {
TORCH_WARN(
"backend:cudaMallocAsync ignores max_split_size_mb,"
"roundup_power2_divisions, and garbage_collect_threshold.");
@ -391,64 +121,33 @@ void CUDAAllocatorConfig::parseArgs(const std::optional<std::string>& env) {
}
size_t CUDAAllocatorConfig::parsePinnedUseCudaHostRegister(
const std::vector<std::string>& config,
const c10::CachingAllocator::ConfigTokenizer& tokenizer,
size_t i) {
consumeToken(config, ++i, ':');
if (++i < config.size()) {
TORCH_CHECK(
(config[i] == "True" || config[i] == "False"),
"Expected a single True/False argument for pinned_use_cuda_host_register");
m_pinned_use_cuda_host_register = (config[i] == "True");
} else {
TORCH_CHECK(
false, "Error, expecting pinned_use_cuda_host_register value", "");
}
tokenizer.checkToken(++i, ":");
m_pinned_use_cuda_host_register = tokenizer.toBool(++i);
return i;
}
size_t CUDAAllocatorConfig::parsePinnedNumRegisterThreads(
const std::vector<std::string>& config,
const c10::CachingAllocator::ConfigTokenizer& tokenizer,
size_t i) {
consumeToken(config, ++i, ':');
if (++i < config.size()) {
size_t val2 = stoi(config[i]);
TORCH_CHECK(
llvm::isPowerOf2_64(val2),
"Number of register threads has to be power of 2 ",
"");
auto maxThreads = CUDAAllocatorConfig::pinned_max_register_threads();
TORCH_CHECK(
val2 <= maxThreads,
"Number of register threads should be less than or equal to " +
std::to_string(maxThreads),
"");
m_pinned_num_register_threads = val2;
} else {
TORCH_CHECK(
false, "Error, expecting pinned_num_register_threads value", "");
}
tokenizer.checkToken(++i, ":");
size_t val2 = tokenizer.toSizeT(++i);
TORCH_CHECK_VALUE(
llvm::isPowerOf2_64(val2),
"Number of register threads has to be power of 2 ",
"");
auto maxThreads = CUDAAllocatorConfig::pinned_max_register_threads();
TORCH_CHECK_VALUE(
val2 <= maxThreads,
"Number of register threads should be less than or equal to " +
std::to_string(maxThreads),
"");
m_pinned_num_register_threads = val2;
return i;
}
size_t CUDAAllocatorConfig::parsePinnedUseBackgroundThreads(
const std::vector<std::string>& config,
size_t i) {
consumeToken(config, ++i, ':');
if (++i < config.size()) {
TORCH_CHECK(
(config[i] == "True" || config[i] == "False"),
"Expected a single True/False argument for pinned_use_background_threads");
m_pinned_use_background_threads = (config[i] == "True");
} else {
TORCH_CHECK(
false, "Error, expecting pinned_use_background_threads value", "");
}
return i;
}
// General caching allocator utilities
void setAllocatorSettings(const std::string& env) {
CUDACachingAllocator::CUDAAllocatorConfig::instance().parseArgs(env.c_str());
}
REGISTER_ALLOCATOR_CONFIG_PARSE_HOOK(CUDAAllocatorConfig)
} // namespace c10::cuda::CUDACachingAllocator

Some files were not shown because too many files have changed in this diff Show More