Compare commits

...

69 Commits

Author SHA1 Message Date
39d1d0d9ce remove expected failure 2025-11-16 16:34:00 +04:00
2245d7d3b9 Improve char printing (#167899)
This PR outputs chars to stream without building temporary strings.
They were modified by (on fish)
```
sed  -i -e 's/<< "\([^\\\']\)"/<< \'\1\'/g' (grep '<< "."' -r torch c10 aten -l)
```
and revert some invalid changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167899
Approved by: https://github.com/Skylion007
2025-11-16 07:19:16 +00:00
98b94b90dd [pallas backend] implement gpu tiles/mask for power of 2 (#167584)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167584
Approved by: https://github.com/jansel
2025-11-16 07:01:51 +00:00
5cdbda140c [vision hash update] update the pinned vision hash (#167890)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167890
Approved by: https://github.com/pytorchbot
2025-11-16 04:58:47 +00:00
0ec53beaeb Refactor TensorAccessor for headeronly. (#166855)
This PR moves the implementations of Tensor accessor classes to headeronly with the following modifications:
- Add ArrayRef and IndexBoundsCheck template parameters to refactor out the usages of `IntArrayRef` and `TORCH_CHECK_INDEX` from Tensor accessor implementations.
- Eliminate usage of `c10::irange` as it is not headeronly-compatible.
- Introduce `torch::headeronly::{TensorAccessorBase,TensorAccessor, GenericPackedTensorAccessorBase, GenericPackedTensorAccessor}` that are headeronly-equivalent to `at::{TensorAccessorBase,TensorAccessor, GenericPackedTensorAccessorBase, GenericPackedTensorAccessor}`. Both these sets of template classes use original implementations from `torch::headeronly::detail` that have new template parameters `ArrayRefCls` and `IndexBoundsCheck` to facilitate `at` and `torch::headeronly` implementations of ArrayRef and checking indices.

TODO:
- ~when https://github.com/pytorch/pytorch/pull/164991 lands, eliminate the placeholder class HeaderOnlyArrayRef~ UPDATE: done.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166855
Approved by: https://github.com/janeyx99
2025-11-15 22:37:24 +00:00
79fc0a9141 [xpu][fix]Fall back deterministic index_copy to index_put on XPU (#167830)
A minor update has been made to the deterministic behavior checks in the `index_copy_out` implementation. This change ensures that deterministic  `index_copy` is dispatched to `index_put` not only for CUDA tensors but also for XPU tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167830
Approved by: https://github.com/guangyey, https://github.com/ezyang
2025-11-15 18:09:25 +00:00
d01a7b0241 Back out "MatMal - fix folding logic" (#167884)
Summary:
For sepcific hardware (A100), Autocast will generate a relatively large error on Transformer (torch.nn.TransformerEncoder) when using no_grad decorator on dim=256 (and larger presuably).

H100 seems fine, as does A100 with mig (so less than full SMs).

For now backing out, and revisting next week.

Test Plan:
failed jobs:
https://fburl.com/scuba/remote_execution_action/jzcmujgk

 {F1983543613}

Reviewed By: t-ivan-gr

Differential Revision: D87111518

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167884
Approved by: https://github.com/malfet
2025-11-15 08:29:08 +00:00
deabb3e36d Use c10::filesystem (#167821)
This PR fixes code to use c10::filesystem functionality instead of manually implemented functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167821
Approved by: https://github.com/Skylion007
2025-11-15 06:01:01 +00:00
79d2397b6b Fix grammar issues in C++ frontend documentation (#167702)
Corrected minor grammatical errors in the documentation.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167702
Approved by: https://github.com/jerryzh168
2025-11-15 05:55:08 +00:00
6ef3a62c36 Fix typo in FP16 accumulation section (#167703)
Fix typo error
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167703
Approved by: https://github.com/jerryzh168
2025-11-15 05:34:07 +00:00
530e782239 [codemod][lowrisk] Remove unused exception parameter from caffe2/torch/csrc/jit/backends/coreml/objc/PTMCoreMLBackend.mm (#167604)
Summary:
`-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it.

This:
```
try {
    ...
} catch (exception& e) {
    // no use of e
}
```
should instead be written as
```
} catch (exception&) {
```

If the code compiles, this is safe to land.

Test Plan: Sandcastle

Differential Revision: D85813836

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167604
Approved by: https://github.com/malfet, https://github.com/seemethere
2025-11-15 05:17:48 +00:00
c66a6c432e [HOP][print] Add functionalization (make sure ordering) for print (#167016)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167016
Approved by: https://github.com/angelayi
2025-11-15 05:06:05 +00:00
3d7a8b7e61 MPS: Fix clamp scalar cache key to store floats in hex representation (#167777)
Fixes #167767.

Original issue was that using std::to_string(value) does not work intended here if the value is smaller than 1e-6. The caching keys ended up as `clamp_out_mps_min:0.000000_scalar::f32[1]` instead of `clamp_out_mps_min:0.0000001_scalar::f32[1]`. After the change the values are stored as the hex representation for the floating point number. So for min_value 1e-7 the key will be `impl_min:0x1.ad7f2ap-24_scalar::f32[1]` and for min_value 0.0 `clamp_out_mps_min:0x0p+0_scalar::f32[1]`

Output of the repro code before the change:

```
tensor([0.], device='mps:0')
tensor([0.], device='mps:0')
tensor([0.], device='mps:0')
tensor([0.], device='mps:0')
tensor([0.], device='mps:0')
tensor([1.0000e-07], device='mps:0')
tensor([0.], device='mps:0')
tensor([1.0000e-07], device='mps:0')
```

Output for the repro code after the change:

```
tensor([0.], device='mps:0')
tensor([1.0000e-07], device='mps:0')
tensor([0.], device='mps:0')
tensor([1.0000e-07], device='mps:0')
tensor([0.], device='mps:0')
tensor([1.0000e-07], device='mps:0')
tensor([0.], device='mps:0')
tensor([1.0000e-07], device='mps:0')
```
which matches the expected CPU reference.

Snippet to test with:
```
import torch

device='mps'
dtype=torch.float32
a = torch.zeros(1, device=device, dtype=dtype)

# the following line triggers the incorrect behavior, when commented, the remainder of the script appears to work as expected
a_clamped = a.clamp(min=0.0)

b = torch.zeros(1, device=device)
print(b)
c = b.clamp(min=1e-7)
print(c)

b = torch.zeros(1, device=device)
print(b)
c = b.clamp(min=1e-7, max=None)
print(c)

b = torch.zeros(1, device=device)
print(b)
c = b.clamp(min=1e-7, max=torch.inf)
print(c)

b = torch.zeros(1, device=device)
print(b)
c = b.clamp_min(1e-7)
print(c)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167777
Approved by: https://github.com/malfet
2025-11-15 03:26:38 +00:00
de0d69b2c4 Remove useless super() delegation (#167791)
This PR removes useless super() delegations detected by pylint.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167791
Approved by: https://github.com/albanD
2025-11-15 02:50:51 +00:00
bc60b86066 Skip stable diffusion models in torchbench, get tests and benchmarks green (#167896)
Test Plan:
- wait for CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167896
Approved by: https://github.com/aorenste, https://github.com/shunting314
ghstack dependencies: #167609
2025-11-15 02:44:36 +00:00
d7782ddde7 [ATEN][CUDA] Reduce register pressure introduced by CUDA_KERNEL_ASSERT to improve torch.EmbeddingBag performance (#167834)
# Summary

This PR optimizes the CUDA kernels for `torch.nn.EmbeddingBag` by reducing GPU register pressure introduced by `CUDA_KERNEL_ASSERT`, which improves kernel occupancy and overall performance. The optimization separates input validation into a dedicated loop before the main processing loop, allowing the compiler to better optimize register allocation. By extensively testing on various GPUs and CUDA versions, `torch.nn.EmbeddingBag` performance improves by 29% to 111% with this PR.

# Performance Results

The following table shows the performance improvements on various input distributions and GPUs. All benchmarks use PyTorch 2.9.0 compiled with CUDA 12.8.

**Input Distribution Types (simulating recommendation system ID patterns):**
- **random id**: Randomly sampled embedding indices from the full vocabulary (uniform distribution)
- **one-hot**: One ID appears with very high frequency across all bags, simulating a popular item in recommendation systems
- **multi-hot**: Multiple IDs appear with high frequency across all bags, simulating multiple popular items in recommendation systems

**Test Configuration:**
- Embedding shape: `(5000000, 128)` (5M vocabulary size, 128-dimensional embeddings)
- Batch size: 2048 bags
- Average bag size: 150 indices per bag

| GPU  | Input Distribution | Before (µs) | After (µs) | Speedup |
| ---- | ------------------ | ----------- | ---------- | ------- |
| H100 | random id          | 162.4       | 105.9      | 1.53×   |
| H100 | one-hot            | 120.4       | 88.6       | 1.36×   |
| H100 | multi-hot          | 113.1       | 87.8       | 1.29×   |
| H20  | random id          | 278.6       | 132.2      | 2.11×   |
| H20  | one-hot            | 189.7       | 110.3      | 1.72×   |
| H20  | multi-hot          | 172.4       | 107.4      | 1.61×   |

# Motivation

The original implementation performed bounds checking using `CUDA_KERNEL_ASSERT` inline within the main processing loop, which increased register pressure and limited GPU occupancy. From NSight Compute analysis on H20, using PyTorch 2.9 compiled with CUDA 12.8, removing the `CUDA_KERNEL_ASSERT` from the main loop with this PR increases the overall occupancy from 50% to 75%(registers per thread 52->40).

By separating validation into a dedicated loop, we:

1. **Reduce register pressure in the main loop**: The validation loop uses minimal registers, allowing the compiler to optimize the main processing loop independently with better register allocation.
2. **Maintain correctness**: All input validation is still performed, but in a more register-efficient manner.

# Changes

## Modified Kernels

1. **`EmbeddingBag_updateOutputKernel_max`**: Added separate validation loop before main processing
2. **`EmbeddingBag_updateOutputKernel_sum_mean`**: Added separate validation loop before main processing

## Key Implementation Details

- **Separate validation loop**: Input indices are validated in a dedicated loop that checks all indices before processing begins
- **No early exit**: The validation loop intentionally avoids using `break` for early exit, as benchmarking showed that early exit degrades performance, possibly due to increased branch divergence and reduced instruction-level parallelism
- **Consistent error messages**: Improved error message clarity for invalid input indices
- **Design choice: validation loop vs. separate kernel**: We considered removing `CUDA_KERNEL_ASSERT` entirely and performing bounds checking in a separate GPU kernel, which would achieve even better performance (e.g., on H20 with random id distribution: 132.2 µs → 124.6 µs). However, this approach is harder to maintain as it requires coordinating two separate kernel launches and managing additional kernel launch overhead. Instead, we chose the current approach of using a separate validation loop within the same kernel, which provides a good balance between performance improvement and code maintainability.

## Code Changes

```cpp
// Separate validation loop reduces register pressure in the main loop below.
// No early exit (break) on invalid input as benchmarking shows it degrades performance.
bool has_invalid_index = false;
for (int64_t emb = begin; emb < end; emb++) {
  index_t input_idx = input[emb];
  has_invalid_index = has_invalid_index || (input_idx < 0 || input_idx >= numRows);
}
CUDA_KERNEL_ASSERT(!has_invalid_index && "Invalid input index in EmbeddingBag: index out of range [0, numRows)");

// Main processing loop (now with reduced register pressure)
for (int64_t emb = begin; emb < end; emb++) {
  // ... processing logic ...
}
```

# Testing & Compatibility

## Performance Testing

I conducted extensive performance testing across multiple configurations. All tests show significant performance improvements:

**Tested CUDA Versions:**
- CUDA 12.6, 12.8, 13.0

**Tested GPU Architectures:**
- A100, H20, H100

**Tested Input Configurations:**
- **Embedding shapes**: Various sizes including `[5000000, 128]` and `[128000, 4096]`
- **Embedding dtypes**: `torch.float32`, `torch.float16`
- **Input distributions**: Random indices, one-hot (high-frequency single ID), and multi-hot (high-frequency multiple IDs) patterns, simulating recommendation system workloads
- **Input sizes**: Average bag sizes of 150, 20, and 10 indices per bag

## Correctness Testing

-  Correctness tests pass for various embedding types (bfloat16, float32), shapes, and input distributions
-  Register usage reduction verified with NSight Compute
-  Linter passes

## Compatibility

-  No API/ABI changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167834
Approved by: https://github.com/ngimel, https://github.com/eqy
2025-11-15 02:03:38 +00:00
eqy
fb04e9ad03 [CUDA][CUDA Graphs] Respect node-priority in cudaGraphInstantiate (#167346)
Needed for e.g., stream priority-based implementations of comm-compute overlap

CC @galv

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167346
Approved by: https://github.com/ngimel
2025-11-15 01:59:36 +00:00
cfe799b4aa Revert "Ops convolution_backward optional flag bug (#165008)"
This reverts commit c429b1fc5c60a6819b041f1a881ab09735689fbe.

Reverted https://github.com/pytorch/pytorch/pull/165008 on behalf of https://github.com/clee2000 due to I think this broke some tests in the slow workflow? test/test_ops.py::TestCommonCUDA::test_compare_cpu_convolution_backward_cuda_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/19375318020/job/55443680773) [HUD commit link](c429b1fc5c) ([comment](https://github.com/pytorch/pytorch/pull/165008#issuecomment-3535354672))
2025-11-15 01:50:09 +00:00
b7f52773e6 Add meta registration for scaled_mm_v2 and test (#167653)
Summary:

`torch._scaled_mm_v2` didn't have a valid meta registration, or
`FakeTensor` tests, so anything expecting inductor to work (like
torch.ao tests) would fail horribly.

Test Plan:

```
pytest -sv -k "scaled_mm_v2" test/test_ops.py
```

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Simon Layton <simonlayton@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167653
Approved by: https://github.com/drisspg
2025-11-15 01:21:04 +00:00
f6b54d8899 flight_recorder: move to torch.distributed (#167782)
Summary: This moves torchfrtrace to be under `torch.distributed.flight_recorder` instead of `tools.flight_recorder` as the `tools` package is not included in the torch wheels. This makes it so you can use fr trace analyze without using it from a source checkout

Test Plan:
```
buck run //caffe2/fb/flight_recorder:fr_trace
```

CI

Differential Revision: D87022129

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167782
Approved by: https://github.com/fduwjj
2025-11-15 01:16:59 +00:00
da91bf5262 Fix incorrect attention example in ONNX exporter docstring (#167646)
Fixes #167627

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167646
Approved by: https://github.com/malfet, https://github.com/titaiwangms

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-11-15 00:31:48 +00:00
1c1638297e Revert "distributed/debug: add an HTTP server for debugging running jobs (#167395)"
This reverts commit 4ed26f7382bc3e5217121f5085af070e57f2ef40.

Reverted https://github.com/pytorch/pytorch/pull/167395 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/167395#issuecomment-3535150292))
2025-11-15 00:25:51 +00:00
ee0b5b4b1c Add new CI jobs to run dynamo tests on all python versions supported (#166978)
This PR adds 2 new CI jobs to run dynamo core (`test/dynamo/*`) and
`dynamo_wrapped` tests on Python 3.11/3.12.

**Selected Machine**
Tests are executed on `linux.c7i.2xlarge` without GPU. Which means all
cuda tests (if any) are skipped.

**Runtime**
- The core tests takes 30 minutes to run
- The `dynamo_wrapped` test is divided into three shards and each one
  takes around 1.5 hours to execute

**Schedule**
Tests are executed every day at 1:29 PDT or in the presence of
`ciflow/dynamo` label

Co-authored-by: Rob Timpe <rtimpe@openteams.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166978
Approved by: https://github.com/atalman, https://github.com/malfet
ghstack dependencies: #167092
2025-11-15 00:15:36 +00:00
fcfb213c5a [inductor] layout constraint for weight-norm-bwd (#167667)
fix https://github.com/pytorch/pytorch/issues/165749

The weight_norm backward kernel requires its inputs to be contiguous. Add those constraints to the lowering/fallback rule.

A better fix is maybe add decomposition rule for the op. But since we already fallback, this fix does no harm and can fix the attached issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167667
Approved by: https://github.com/eellison
2025-11-14 23:59:24 +00:00
08042bbb9c [6/N] Use Python 3.10 typing (#167649)
This PR applies new Union typing syntax to some python files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167649
Approved by: https://github.com/albanD
2025-11-14 23:55:08 +00:00
e20ca3bc2e Remove python workaround for ContextDecorator (#167049)
This PR removes the import workaround for ContextDecorator because the import always succeeds in Py 3.10+.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167049
Approved by: https://github.com/Skylion007
2025-11-14 23:54:52 +00:00
4ed26f7382 distributed/debug: add an HTTP server for debugging running jobs (#167395)
This adds a debug HTTP server for debugging stuck or slow jobs. It runs the WorkerServer on every worker and then launches a separate flask process on rank 0 to have users connect to for debugging.

This can easily be improved to trigger profilers as well as visualize the data much better.

Initial handlers:
* pytorch profiler
* FlightRecorder data
* Python stacks

```
os.environ["TORCH_NCCL_TRACE_BUFFER_SIZE"] = "2000"

from torch.distributed.debug import enable_debug_server

enable_debug_server()
```

Test plan:

```
torchrun --nnodes 1 --nproc_per_node=gpu ~/scripts/debug_test.py
```

<img width="1499" height="1629" alt="20251107_17h10m47s_grim" src="https://github.com/user-attachments/assets/a8b9a0cb-3bbf-4558-be12-5253e418214e" />
<img width="1192" height="1337" alt="20251107_17h10m39s_grim" src="https://github.com/user-attachments/assets/ac5d7011-4acb-4401-bf2c-f9b22c1466bd" />

<img width="984" height="851" alt="20251107_18h35m38s_grim" src="https://github.com/user-attachments/assets/98b3eb31-ed01-4345-90dd-c79345cf82ce" />
<img width="2880" height="777" alt="20251107_18h35m31s_grim" src="https://github.com/user-attachments/assets/8de84b8b-9d06-4bc8-a1bf-280a2958315b" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167395
Approved by: https://github.com/fduwjj
2025-11-14 23:14:38 +00:00
4c79305b87 [targets2buck] Clean up get_pt_ops_deps (#167690)
Summary: I didn't understand what this macro was doing so I created a bit of a mess, mess be gone!

Test Plan: `buck2 ctargets fbcode//caffe2/... fbsource//xplat/caffe2/...`

Reviewed By: mzlee

Differential Revision: D86460608

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167690
Approved by: https://github.com/seemethere
2025-11-14 23:05:24 +00:00
f4b8c4f907 backed size oblivious checks for expand() (#167689)
Summary:
Support semantics when using backed_size_oblivious, similar to https://github.com/pytorch/pytorch/pull/167232

We see errors in a model exported with dynamic shapes, like
```
RuntimeError: non-broadcasting semantics require s67 == 41

While executing %expand : [num_users=1] = call_method[target=expand](args = (%reshape_5, -1, -1, %getitem_9), kwargs = {})
```

Test Plan:
test_dynamic_shapes:
```
test_backed_size_oblivious_expand (test_dynamic_shapes.TestUbackedOps) ... I1112 14:07:54.724596 1386932 Logger.cpp:995] Dropping logs in unit tests.
ok
```

Differential Revision: D86902546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167689
Approved by: https://github.com/laithsakka
2025-11-14 22:31:28 +00:00
d629b7a459 Move CppTypeToScalarType to torch/headeronly (#167610)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167610
Approved by: https://github.com/pearu, https://github.com/janeyx99
2025-11-14 22:21:45 +00:00
0922ba5f42 [BE] No need to pass const enum values by reference (#167868)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167868
Approved by: https://github.com/slayton58
2025-11-14 21:56:19 +00:00
c87295c044 [precompile] Support captured global tensors. (#167846)
Summary:
In vllm we saw cases where user intialize a tensor in the global scope and reference it in the forward body. This should be supported by pruning the used globals in the scope and serialize them along the artifacts similar to how we handle closure.

Use case example: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/gemma3n.py#L65

Test Plan:
test_aot_compile.py

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167846
Approved by: https://github.com/jamesjwu
2025-11-14 21:40:07 +00:00
7aa210d215 Revert "[CodeClean] Remove the Unused MACRO for AOT Inductor Runtime (#165139)"
This reverts commit fcd5f8c352b5b75bd32e57fa044ec5df095032da.

Reverted https://github.com/pytorch/pytorch/pull/165139 on behalf of https://github.com/jeanschmidt due to trying to hevert in the hopes it fixes internal errors, will land it back ([comment](https://github.com/pytorch/pytorch/pull/165139#issuecomment-3534662138))
2025-11-14 21:35:37 +00:00
5a368b8010 Revert "[CodeClean] Replace std::runtime_error with TORCH_CHECK (#165119)"
This reverts commit 398775a43e9808205f75c81d36f5087117d3f3f4.

Reverted https://github.com/pytorch/pytorch/pull/165119 on behalf of https://github.com/jeanschmidt due to trying to hevert in the hopes it fixes internal errors, will land it back ([comment](https://github.com/pytorch/pytorch/pull/165139#issuecomment-3534662138))
2025-11-14 21:35:37 +00:00
602102be50 Revert "Hide all symbols (except stable/headeronly/shim) if TORCH_STABLE_ONLY is defined (#167496)"
This reverts commit bc09a84150eaadaadab8a8ecd76cd9afc60d8a19.

Reverted https://github.com/pytorch/pytorch/pull/167496 on behalf of https://github.com/jeanschmidt due to trying to revert 165139, my intention is to land it again, so, will land this once both are reverted ([comment](https://github.com/pytorch/pytorch/pull/167496#issuecomment-3534641209))
2025-11-14 21:33:02 +00:00
200156e385 DTensor: avoid unnecessary DTensorSpec creation in _ToTorchTensor.backward (#167588)
Looks like the check here is cheap and has a potentially large payoff.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167588
Approved by: https://github.com/ezyang
2025-11-14 21:08:12 +00:00
a2daf3fc86 [Inductor] Add support bound methods in pattern matcher (#167795)
Fixes: #167776

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167795
Approved by: https://github.com/mlazos
2025-11-14 20:55:51 +00:00
52b45c16de Add reshape, view, flatten to torch/csrc/stable (#167600)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167600
Approved by: https://github.com/janeyx99
ghstack dependencies: #167592
2025-11-14 20:35:53 +00:00
2ef85bed5a Add empty to stable ops (#167592)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167592
Approved by: https://github.com/janeyx99
2025-11-14 20:35:53 +00:00
d99c6bcf69 [export] Disable side effects on dynamo_graph_capture_for_export and warn user. (#167763)
Summary:
as title.

Test Plan:
test_dynamo_graph_capture_side_effects

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167763
Approved by: https://github.com/tugsbayasgalan
2025-11-14 20:35:22 +00:00
8378abda84 [torch.export] Fix for flaky test_annotate_on_assert (#167805)
Summary: test_annotate_on_assert become flaky with PR 166341 (Details in https://github.com/pytorch/pytorch/issues/167432). Torchdynamo related metadata can vary depending on the caller. Removing the those metadata before comparison.

Test Plan:
```
buck test mode/opt caffe2/test:test_export -- 'test_annotate_on_assert'
```
https://www.internalfb.com/intern/testinfra/testrun/7036874728749661

Differential Revision: D87036890

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167805
Approved by: https://github.com/yushangdi
2025-11-14 19:56:51 +00:00
5b42a5d9a6 [doc] Add example for torch.is_storage (#161898)
Fixes #161858

### Summary:
Added comprehensive documentation examples for `torch.is_storage()` to help users understand how to check if an object is a PyTorch storage object.

### Impact:

- Enhances API Documentation
- Helps users distinguish between PyTorch storage objects and other types

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161898
Approved by: https://github.com/isuruf, https://github.com/malfet
2025-11-14 19:45:54 +00:00
caca3f2eec Revert "Re-land "Fix thread safety in getCurrentCUDABlasHandle and getCUDABlasLtWorkspace" (#167722)"
This reverts commit 40e6f090d91026947fbec92a42564ad492f37eae.

Reverted https://github.com/pytorch/pytorch/pull/167722 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/167722#issuecomment-3534282212))
2025-11-14 19:38:22 +00:00
9e2bf129e1 [MPS] addmm complex fix (#167826)
Fixes #167727

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167826
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-11-14 19:29:09 +00:00
c429b1fc5c Ops convolution_backward optional flag bug (#165008)
Fixes #89629

When using torch.ops.aten.convolution_backward, the optional argument bias_sizes was being used in the python function registration without checking whether it was defined.

## For the fix
there are two modes to consider with different results.

First @dynamo.optimize("inductor") is the most demanding.
We cannot be wrong about the size passed into the function. But we should not ignore what the user wants/thinks they are doing. For this case, we want to throw an error when the user is wrong. If the user passes in None, we calculate the expected size directly.

Second @dynamo.optimize("eager") is very lenient.
We really can provide any value we want here. If the user is wrong about bias shape in eager mode, the op will just reshape the bias to the proper size so no error is thrown here.

## For testing
An OpInfo was added for torch.ops.aten.convolution_backward.default.
For the CUDA test_noncontiguous_samples test, a slightly updated error tolerance was necessary for the compounded add multiply (for 2x2 kernel).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165008
Approved by: https://github.com/bdhirsh
2025-11-14 19:24:45 +00:00
1176b2b0b7 [BE]: Update NVTX submodule to 3.3.0 (#167751)
Update NVTX to 3.3.0. Mostly fixes some errors in the bindings, improve C++20 support, and improve C++ bindings to NVTX. Header only library upgrade so should be mostly safe.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167751
Approved by: https://github.com/albanD, https://github.com/eqy
2025-11-14 19:24:37 +00:00
dd37a1a434 Fix NaN gradients in atan2_backward when both inputs are zero (#166787)
Fixes #165427

## Description of Bug 🐛

As reported in #165427, When both the input of  `atan2` function is zero the gradient becomes `NaN`. During the forward pass, `atan2` successfully avoids division-by-zero issue, but during backpropagation gradients become `NaN`.

This is because the backward pass calculates `(self * self + other * other).reciprocal()`, which becomes `inf` at `(0, 0)`. The subsequent multiplication by zero `(0 * inf)` results in `NaN`.

## Changes
- Added an `at::where` condition to handle zero denominators in `atan2_backward`.
- If denom is zero return 0 for the reciprocal; otherwise, use the original value.

## Testing
- Added` test_atan2_zero_gradient` in `test/test_autograd.py` to verify `atan2` returns `0.0` gradients for `(0,0)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166787
Approved by: https://github.com/soulitzer
2025-11-14 19:23:33 +00:00
a74adcf80e [codemod][lowrisk] Remove unused exception parameter from caffe2/caffe2/serialize/inline_container.cc (#167612)
Summary:
`-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it.

This:
```
try {
    ...
} catch (exception& e) {
    // no use of e
}
```
should instead be written as
```
} catch (exception&) {
```

If the code compiles, this is safe to land.

Test Plan: Sandcastle

Differential Revision: D85813824

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167612
Approved by: https://github.com/seemethere, https://github.com/malfet
2025-11-14 19:11:01 +00:00
5eac46a011 add assume_32bit_indexing inductor config (#167784)
when we know all tensor and intermediate tensors fit in 32 bit but use unbacked DS
we want a way to assume that we can use 32 bit indexing(we will runtime assert on it).

It is not practical to torch check every possible intermediate tensor size ahead of time.

This is needed to enhance vLLM perf with unbacked,  since in vLLM all tensors and
intermediates assumed to fit in 32 bits.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167784
Approved by: https://github.com/jansel
2025-11-14 19:04:22 +00:00
e0fff31ae3 [dynamo] Make global state guards and torch function stack guards droppable. (#167674)
Summary:
Prior to this PR we will always build global and torch funciton guards in all cases.

In this PR we did 2 changes to dynamo guards:
1. Created a new guard called "GLOBAL_STATE" which corresponds to the global state guard and can be filtered out using guard_filter_fn
2. Repurpose the existing "TORCH_FUNCTION_STATE" guard for checking torch function mode stack.

Also added a new helper `torch.compiler.skip_all_guards_unsafe` which can be useful for use cases like vllm

Test Plan:
CI

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167674
Approved by: https://github.com/anijain2305
2025-11-14 18:11:44 +00:00
7ede33b8e3 Tiling bug fix (#167771)
Fix for https://github.com/pytorch/pytorch/issues/166653.

Two fixes:
- We were inducing a split for broadcasted loads. e.g. (x // 16). While a split of 16 here will make the load coalesced in one of the tile vars, since the load is already in cache it's not worth splitting. And it would make the other tile var load from memory that isnt in cache.
- Add a slight term for uncoalesced memory. This prevents doing tiling for loads which are a small % of the overall kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167771
Approved by: https://github.com/v0i0
2025-11-14 17:32:42 +00:00
065176cd97 [export] Add pytree input check for dynamo_graph_capture_for_export (#167731)
Summary:
as title.

Test Plan:
pytest test/export/test_export.py -k test_invalid_pytree_dynamo_graph_capture

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167731
Approved by: https://github.com/tugsbayasgalan
2025-11-14 17:29:55 +00:00
eqy
02ee7dd7d3 [CUDA][Test] Add serialTest() to some largeTensorTest tests (#167471)
Try to prevent two big tests from overlapping in their memory usage

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167471
Approved by: https://github.com/soulitzer
2025-11-14 17:13:14 +00:00
99fdca8f4d [ROCm] Enable StaticCudaLauncher for ROCm (#166492)
This PR enables ROCm/HIP support for PyTorch's StaticCudaLauncher, which provides static compilation and launching of Triton kernels. The implementation has been tested on AMD MI300 and MI200 hardware.

**Changes**

**Python (torch/_inductor/runtime/)**
- static_cuda_launcher.py: Added ROCm detection, .hsaco binary support, and ROCm-specific scratch parameter handling
- triton_heuristics.py: Updated device type checks to support both cuda and hip

**C++ (torch/csrc/)**
- Module.cpp: Enabled StaticCudaLauncher for ROCm builds
- inductor/static_cuda_launcher.cpp: Added HIP API equivalents for all CUDA driver calls
- inductor/static_cuda_launcher.h: Updated header guard

**Tests (test/inductor/)**
- test_static_cuda_launcher.py: Removed @skipIfRocm decorators and updated binary file handling

**Enabled Unit Tests**
All tests in test/inductor/test_static_cuda_launcher.py now pass on ROCm:
1. test_basic
2. test_unsigned_integers
3. test_signed_integers
4. test_basic_1arg
5. test_constexpr
6. test_implied_constant
7. test_kernel_no_args
8. test_high_shared_mem
9. test_too_high_shared_mem
10. test_kernel_empty_tensor
11. test_kernel_many_args
12. test_basic_compile
13. test_incompatible_code
14. test_static_launch_user_defined_triton_kernels
15. test_empty_tensor
16. test_any
17. test_disable_static_cuda_launcher

In addition to this, the following tests from test/inductor/test_codecache.py also pass:
1. test_remote_cache_load_function_device_cuda_float32_dynamic_False_bundle_triton_False_use_static_cuda_launcher_False
2. test_remote_cache_load_function_device_cuda_float32_dynamic_False_bundle_triton_True_use_static_cuda_launcher_False
3. test_remote_cache_load_function_device_cuda_float32_dynamic_False_bundle_triton_True_use_static_cuda_launcher_True
4. test_remote_cache_load_function_device_cuda_bfloat16_dynamic_False_bundle_triton_False_use_static_cuda_launcher_False
5. test_remote_cache_load_function_device_cuda_bfloat16_dynamic_False_bundle_triton_True_use_static_cuda_launcher_False
6. test_remote_cache_load_function_device_cuda_bfloat16_dynamic_False_bundle_triton_True_use_static_cuda_launcher_True

The following tests are skipped since triton bundling is necessary for StaticCudaLauncher:
1. test_remote_cache_load_function_device_cuda_float32_dynamic_False_bundle_triton_False_use_static_cuda_launcher_True
2. test_remote_cache_load_function_device_cuda_bfloat16_dynamic_False_bundle_triton_False_use_static_cuda_launcher_True

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166492
Approved by: https://github.com/jeffdaily
2025-11-14 17:11:45 +00:00
9d1a74cb0c Fix mvlgamma_ FPE crash on x86 with integer input (#164230)
Fixes #161871.

Behaviour on arm:

```
PyTorch version: 2.10.0a0+gitdef3b05
Architecture: arm64
Platform: Darwin
Processor: arm

Testing mvlgamma_ with integer tensor on arm64...
 Got expected error: mvlgamma: result type Long can't be cast to the desired output type Float
```

and on x86:

```
PyTorch version: 2.10.0a0+git1310d6a
Architecture: x86_64
Platform: Linux
Processor: x86_64

Testing mvlgamma_ with integer tensor on x86_64...
 Got expected error: mvlgamma: result type Long can't be cast to the desired output type Float
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164230
Approved by: https://github.com/albanD
2025-11-14 17:09:10 +00:00
40e6f090d9 Re-land "Fix thread safety in getCurrentCUDABlasHandle and getCUDABlasLtWorkspace" (#167722)
Summary:
getCurrentCUDABlasHandle() and getCUDABlasLtWorkspace() use static mutable maps that are not protected from concurrent read-and-write. This leads to crashes.
This diff adds mutexes to synchronize access to the static maps.

Note: this is a re-land of D86316117 / https://github.com/pytorch/pytorch/pull/167248 (see comments for details)

Test Plan:
Use a GPU OD, run multi-threaded tests (cuda_cublas_handle_pool_test) with TSAN:
```
buck test fbcode//mode/dev-tsan fbcode//caffe2:cuda_cublas_handle_pool_test  -- --stress-runs 100
```
https://www.internalfb.com/intern/testinfra/testrun/14355223937501118

TSAN output (before synchronization was added): P2026731804

Differential Revision: D86964261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167722
Approved by: https://github.com/malfet
2025-11-14 16:16:35 +00:00
bfddfde50c Add basic spin config and linting commands (#167226)
This PR adds a basic spin configuration to allow for linting. It is designed as a drop-in replacement for the current Makefile based solution, i.e. it sets up and updates lintrunner based on the hashes of certain configuration files.

Lintrunner is called via Uv's `uvx` command, separating its environment from the general development environment in an effort to reduce instances of competing requirements breaking environments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167226
Approved by: https://github.com/atalman, https://github.com/albanD
2025-11-14 15:35:42 +00:00
b6570615f8 [precompile] Integrate AOTI as a backend. (#167338)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167338
Approved by: https://github.com/jamesjwu
2025-11-14 15:33:11 +00:00
226850cc66 [ATen][CUDA] Add sm_121a flag for RowwiseScaledMM (#167734)
This PR add a sm_121a flag for row-wise scaled matmuls on DGX Spark.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167734
Approved by: https://github.com/eqy, https://github.com/cyyever
2025-11-14 08:44:04 +00:00
f8a2ce3b9a Fix inplace ops on Partial DTensors to preserve aliasing semantics (#164729)
Fixes #163374.

Here is the output from reproducible code:

```
W1006 09:09:26.329000 2457 /home/fedora/github/pytorch/torch/distributed/run.py:811]
W1006 09:09:26.329000 2457 /home/fedora/github/pytorch/torch/distributed/run.py:811] *****************************************
W1006 09:09:26.329000 2457 /home/fedora/github/pytorch/torch/distributed/run.py:811] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1006 09:09:26.329000 2457 /home/fedora/github/pytorch/torch/distributed/run.py:811] *****************************************
  aten::clamp_(dt: f32[][R], None, 2)
    redistribute_input(0, [P] -> [R])
      redistribute_input(t: f32[], [P] -> [R])
        _c10d_functional::all_reduce(t: f32[], sum, 0)
        _c10d_functional::wait_tensor(t: f32[])
    aten::clamp_(t: f32[], None, 2)
    aten::view(t: f32[], [])
(Replicate(),)
tensor(2., device='cuda:0')
```

The behavior is now matching what you were expecting in issue #163374:

Expected behavior (from the issue):
  1. Placement should change from Partial(sum) to Replicate()
  2. Value should be tensor(2.) instead of tensor(144.)

  Actual output from this build:
  1. (Replicate(),) - placement is correct
  2. tensor(2., device='cuda:0') - value is correct

so the inplace operation now properly redistributes the partial DTensor to replicate before performing the clamp snd maintains the correct aliasing semantics. It also produces the expected clamped value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164729
Approved by: https://github.com/ezyang
2025-11-14 07:46:35 +00:00
e2c6834584 Revert "deprecate check_is_size and guard_size_oblivious (#167198)"
This reverts commit 50bf1f0b819f0b1cc9acbb0646ac9555bb9d44b9.

Reverted https://github.com/pytorch/pytorch/pull/167198 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/167198#issuecomment-3531149912))
2025-11-14 06:46:15 +00:00
0e7235ed73 [xpu][feature] [1/3] add fp8 scaled_mm implementation for XPU (#165978)
This PR implements `scaled_mm` for XPU. It enables the following data types:
1. TensorWise Scaling: `fp8_e4m3` and `fp8_e5m2`
2. RowWise Scaling:  `fp8_e4m3` and `fp8_e5m2`

It leaves the BlockWise Scaling to next PR, so that it will have less reviewing efforts.

This is the first PR that only adds `scaled_mm_xpu` but does not registered. We separate this out for less reviewing efforts.

Secondly, there is a `scaled_mm_v2` API in #164141 . We will align with it once the v1 is cleaned up.

**Co-author:** @yuchengliu1, @carsonwang

## PR stack:

- -> https://github.com/pytorch/pytorch/pull/165978 : implementation of XPU scaled_mm and oneDNN kernel
- https://github.com/pytorch/pytorch/pull/167518 : implementation of XPU scaled_mm_v2
- https://github.com/pytorch/pytorch/pull/166056 : Op registration

## Test Status:

1. Relies on the changes in https://github.com/intel/torch-xpu-ops/pull/1746/, Otherwise the op will fallback to CPU.
2. This PR does not include tests, the tests are enabled in #166056.

## Credit:

This work is based on @yuchengliu1's work at #140972 . The purpose that we created a new PR is to align with the API / checks with CUDA, so there will be less porting efforts.

## FP8 Task tracker:
We will track all the scaled_mm related tasks in: https://github.com/pytorch/pytorch/issues/167170

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165978
Approved by: https://github.com/liangan1, https://github.com/EikanWang

Co-authored-by: Eikan Wang <eikan.wang@intel.com>
2025-11-14 06:41:18 +00:00
3522e0ce74 Revert "Fix different seq length (#167481)"
This reverts commit c78e64622e62eb93a03a9c3762df3290d6c65362.

Reverted https://github.com/pytorch/pytorch/pull/167481 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/167481#issuecomment-3530992724))
2025-11-14 06:05:45 +00:00
50bf1f0b81 deprecate check_is_size and guard_size_oblivious (#167198)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167198
Approved by: https://github.com/bobrenjc93
2025-11-14 05:35:29 +00:00
c78e64622e Fix different seq length (#167481)
Differential Revision: D86685546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167481
Approved by: https://github.com/eellison
2025-11-14 05:31:29 +00:00
5623628894 [SymmMem] op to get remote tensors (#167779)
To support use case in https://github.com/pytorch/helion/pull/1122, i.e.
```
@helion.kernel
def foo(
    x: Tensor,
    group_name: str
):
    x_remotes = torch.ops.symm_mem.get_remote_tensors(x, group_name)
    for t in x_remotes:
        ...
````

Helion uses fake tensor to trace a program, thus we cannot use the following code in a Helion function:
```
hdl = rendezvous(tensor)
remote_tensors = tuple(
    hdl.get_remote_tensor(peer, ...) for peer in range(world_size)
)
```
The reason is that when `tensor` is fake, the returned `hdl` is None, thus any subsequent call on it will fail.

This PR wraps the above functionality as an op:
```
lib.define("get_remote_tensors(Tensor x, str group_name) -> Tensor[]")
```
so that things like `hdl` is not exposed to Helion. The op also provides a `meta` implementation so that Helion can trace it without actually running the rendezvous.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167779
Approved by: https://github.com/yf225
2025-11-14 05:01:55 +00:00
2aba180114 Always track _local_scalar_dense output in tensorify_python_scalars. (#166573)
We need to track all symbols, we used to skip
u = item()
and fail with
```
 File "/home/lsakka/pytorch10/pytorch/torch/fx/passes/_tensorify_python_scalars.py", line 149, in _sympy_interp
    expr_to_sym_proxy[expr]
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
KeyError: u0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166573
Approved by: https://github.com/bobrenjc93
2025-11-14 03:51:43 +00:00
45b2c3d312 [OpenReg][Feat][Docs] Enrich OpenReg device management implementation and add focused documentation (#165897)
## Summary
This PR enriches OpenReg device management codes and adds focused documentation.

## Key Changes
- Introduced device management documentation in `device.md`.
- Updated `OpenRegFunctions.h` and `OpenRegFunctions.cpp` to use `DeviceIndex` and added error handling.
- Implemented `check_device_index` function for validating device indices.
- Enhanced Python bindings in `Module.cpp` for device management.
- Added tests for invalid device index handling in `test_device.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165897
Approved by: https://github.com/fffrog
2025-11-14 03:08:23 +00:00
5b1e112cf9 [Dynamo] Imporve-graph-break-skip-logs (#167067)
Fixes #150477

### Summary:

- Added frame information (function name, file, line number) to all graph break/skip messages
- Standardized message format: "torch.compile will skip tracing the frame <name> (<file> line <N>) and fall back to eager. Reason: <reason>"

### Impacts:
module: dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167067
Approved by: https://github.com/williamwen42
2025-11-14 03:06:37 +00:00
394 changed files with 6095 additions and 2401 deletions

View File

@ -100,337 +100,6 @@ def check_lib_statically_linked_libstdc_cxx_abi_symbols(lib: str) -> None:
)
def _compile_and_extract_symbols(
cpp_content: str, compile_flags: list[str], exclude_list: list[str] | None = None
) -> list[str]:
"""
Helper to compile a C++ file and extract all symbols.
Args:
cpp_content: C++ source code to compile
compile_flags: Compilation flags
exclude_list: List of symbol names to exclude. Defaults to ["main"].
Returns:
List of all symbols found in the object file (excluding those in exclude_list).
"""
import subprocess
import tempfile
if exclude_list is None:
exclude_list = ["main"]
with tempfile.TemporaryDirectory() as tmpdir:
tmppath = Path(tmpdir)
cpp_file = tmppath / "test.cpp"
obj_file = tmppath / "test.o"
cpp_file.write_text(cpp_content)
result = subprocess.run(
compile_flags + [str(cpp_file), "-o", str(obj_file)],
capture_output=True,
text=True,
timeout=60,
)
if result.returncode != 0:
raise RuntimeError(f"Compilation failed: {result.stderr}")
symbols = get_symbols(str(obj_file))
# Return all symbol names, excluding those in the exclude list
return [name for _addr, _stype, name in symbols if name not in exclude_list]
def check_stable_only_symbols(install_root: Path) -> None:
"""
Test TORCH_STABLE_ONLY and TORCH_TARGET_VERSION by compiling test code and comparing symbol counts.
This approach tests:
1. WITHOUT macros -> many torch symbols exposed
2. WITH TORCH_STABLE_ONLY -> zero torch symbols (all hidden)
3. WITH TORCH_TARGET_VERSION -> zero torch symbols (all hidden)
4. WITH both macros -> zero torch symbols (all hidden)
"""
include_dir = install_root / "include"
assert include_dir.exists(), f"Expected {include_dir} to be present"
test_cpp_content = """
// Main torch C++ API headers
#include <torch/torch.h>
#include <torch/all.h>
// ATen tensor library
#include <ATen/ATen.h>
// Core c10 headers (commonly used)
#include <c10/core/Device.h>
#include <c10/core/DeviceType.h>
#include <c10/core/ScalarType.h>
#include <c10/core/TensorOptions.h>
#include <c10/util/Optional.h>
int main() { return 0; }
"""
base_compile_flags = [
"g++",
"-std=c++17",
f"-I{include_dir}",
f"-I{include_dir}/torch/csrc/api/include",
"-c", # Compile only, don't link
]
# Compile WITHOUT any macros
symbols_without = _compile_and_extract_symbols(
cpp_content=test_cpp_content,
compile_flags=base_compile_flags,
)
# We expect constexpr symbols, inline functions used by other headers etc.
# to produce symbols
num_symbols_without = len(symbols_without)
print(f"Found {num_symbols_without} symbols without any macros defined")
assert num_symbols_without != 0, (
"Expected a non-zero number of symbols without any macros"
)
# Compile WITH TORCH_STABLE_ONLY (expect 0 symbols)
compile_flags_with_stable_only = base_compile_flags + ["-DTORCH_STABLE_ONLY"]
symbols_with_stable_only = _compile_and_extract_symbols(
cpp_content=test_cpp_content,
compile_flags=compile_flags_with_stable_only,
)
num_symbols_with_stable_only = len(symbols_with_stable_only)
assert num_symbols_with_stable_only == 0, (
f"Expected no symbols with TORCH_STABLE_ONLY macro, but found {num_symbols_with_stable_only}"
)
# Compile WITH TORCH_TARGET_VERSION (expect 0 symbols)
compile_flags_with_target_version = base_compile_flags + [
"-DTORCH_TARGET_VERSION=1"
]
symbols_with_target_version = _compile_and_extract_symbols(
cpp_content=test_cpp_content,
compile_flags=compile_flags_with_target_version,
)
num_symbols_with_target_version = len(symbols_with_target_version)
assert num_symbols_with_target_version == 0, (
f"Expected no symbols with TORCH_TARGET_VERSION macro, but found {num_symbols_with_target_version}"
)
# Compile WITH both macros (expect 0 symbols)
compile_flags_with_both = base_compile_flags + [
"-DTORCH_STABLE_ONLY",
"-DTORCH_TARGET_VERSION=1",
]
symbols_with_both = _compile_and_extract_symbols(
cpp_content=test_cpp_content,
compile_flags=compile_flags_with_both,
)
num_symbols_with_both = len(symbols_with_both)
assert num_symbols_with_both == 0, (
f"Expected no symbols with both macros, but found {num_symbols_with_both}"
)
def check_stable_api_symbols(install_root: Path) -> None:
"""
Test that stable API headers still expose symbols with TORCH_STABLE_ONLY.
The torch/csrc/stable/c/shim.h header is tested in check_stable_c_shim_symbols
"""
include_dir = install_root / "include"
assert include_dir.exists(), f"Expected {include_dir} to be present"
stable_dir = include_dir / "torch" / "csrc" / "stable"
assert stable_dir.exists(), f"Expected {stable_dir} to be present"
stable_headers = list(stable_dir.rglob("*.h"))
if not stable_headers:
raise RuntimeError("Could not find any stable headers")
includes = []
for header in stable_headers:
rel_path = header.relative_to(include_dir)
includes.append(f"#include <{rel_path.as_posix()}>")
includes_str = "\n".join(includes)
test_stable_content = f"""
{includes_str}
int main() {{ return 0; }}
"""
compile_flags = [
"g++",
"-std=c++17",
f"-I{include_dir}",
f"-I{include_dir}/torch/csrc/api/include",
"-c",
"-DTORCH_STABLE_ONLY",
]
symbols_stable = _compile_and_extract_symbols(
cpp_content=test_stable_content,
compile_flags=compile_flags,
)
num_symbols_stable = len(symbols_stable)
print(f"Found {num_symbols_stable} symbols in torch/csrc/stable")
assert num_symbols_stable > 0, (
f"Expected stable headers to expose symbols with TORCH_STABLE_ONLY, "
f"but found {num_symbols_stable} symbols"
)
def check_headeronly_symbols(install_root: Path) -> None:
"""
Test that header-only utility headers still expose symbols with TORCH_STABLE_ONLY.
"""
include_dir = install_root / "include"
assert include_dir.exists(), f"Expected {include_dir} to be present"
# Find all headers in torch/headeronly
headeronly_dir = include_dir / "torch" / "headeronly"
assert headeronly_dir.exists(), f"Expected {headeronly_dir} to be present"
headeronly_headers = list(headeronly_dir.rglob("*.h"))
if not headeronly_headers:
raise RuntimeError("Could not find any headeronly headers")
# Filter out platform-specific headers that may not compile everywhere
platform_specific_keywords = [
"cpu/vec",
]
filtered_headers = []
for header in headeronly_headers:
rel_path = header.relative_to(include_dir).as_posix()
if not any(
keyword in rel_path.lower() for keyword in platform_specific_keywords
):
filtered_headers.append(header)
includes = []
for header in filtered_headers:
rel_path = header.relative_to(include_dir)
includes.append(f"#include <{rel_path.as_posix()}>")
includes_str = "\n".join(includes)
test_headeronly_content = f"""
{includes_str}
int main() {{ return 0; }}
"""
compile_flags = [
"g++",
"-std=c++17",
f"-I{include_dir}",
f"-I{include_dir}/torch/csrc/api/include",
"-c",
"-DTORCH_STABLE_ONLY",
]
symbols_headeronly = _compile_and_extract_symbols(
cpp_content=test_headeronly_content,
compile_flags=compile_flags,
)
num_symbols_headeronly = len(symbols_headeronly)
print(f"Found {num_symbols_headeronly} symbols in torch/headeronly")
assert num_symbols_headeronly > 0, (
f"Expected headeronly headers to expose symbols with TORCH_STABLE_ONLY, "
f"but found {num_symbols_headeronly} symbols"
)
def check_aoti_shim_symbols(install_root: Path) -> None:
"""
Test that AOTI shim headers still expose symbols with TORCH_STABLE_ONLY.
"""
include_dir = install_root / "include"
assert include_dir.exists(), f"Expected {include_dir} to be present"
# There are no constexpr symbols etc., so we need to actually use functions
# so that some symbols are found.
test_shim_content = """
#include <torch/csrc/inductor/aoti_torch/c/shim.h>
int main() {
int32_t (*fp1)() = &aoti_torch_device_type_cpu;
int32_t (*fp2)() = &aoti_torch_dtype_float32;
(void)fp1; (void)fp2;
return 0;
}
"""
compile_flags = [
"g++",
"-std=c++17",
f"-I{include_dir}",
f"-I{include_dir}/torch/csrc/api/include",
"-c",
"-DTORCH_STABLE_ONLY",
]
symbols_shim = _compile_and_extract_symbols(
cpp_content=test_shim_content,
compile_flags=compile_flags,
)
num_symbols_shim = len(symbols_shim)
assert num_symbols_shim > 0, (
f"Expected shim headers to expose symbols with TORCH_STABLE_ONLY, "
f"but found {num_symbols_shim} symbols"
)
def check_stable_c_shim_symbols(install_root: Path) -> None:
"""
Test that stable C shim headers still expose symbols with TORCH_STABLE_ONLY.
"""
include_dir = install_root / "include"
assert include_dir.exists(), f"Expected {include_dir} to be present"
# Check if the stable C shim exists
stable_shim = include_dir / "torch" / "csrc" / "stable" / "c" / "shim.h"
if not stable_shim.exists():
raise RuntimeError("Could not find stable c shim")
# There are no constexpr symbols etc., so we need to actually use functions
# so that some symbols are found.
test_stable_shim_content = """
#include <torch/csrc/stable/c/shim.h>
int main() {
// Reference stable C API functions to create undefined symbols
AOTITorchError (*fp1)(const char*, uint32_t*, int32_t*) = &torch_parse_device_string;
AOTITorchError (*fp2)(uint32_t*) = &torch_get_num_threads;
(void)fp1; (void)fp2;
return 0;
}
"""
compile_flags = [
"g++",
"-std=c++17",
f"-I{include_dir}",
f"-I{include_dir}/torch/csrc/api/include",
"-c",
"-DTORCH_STABLE_ONLY",
]
symbols_stable_shim = _compile_and_extract_symbols(
cpp_content=test_stable_shim_content,
compile_flags=compile_flags,
)
num_symbols_stable_shim = len(symbols_stable_shim)
assert num_symbols_stable_shim > 0, (
f"Expected stable C shim headers to expose symbols with TORCH_STABLE_ONLY, "
f"but found {num_symbols_stable_shim} symbols"
)
def check_lib_symbols_for_abi_correctness(lib: str) -> None:
print(f"lib: {lib}")
cxx11_symbols = grep_symbols(lib, LIBTORCH_CXX11_PATTERNS)
@ -460,13 +129,6 @@ def main() -> None:
check_lib_symbols_for_abi_correctness(libtorch_cpu_path)
check_lib_statically_linked_libstdc_cxx_abi_symbols(libtorch_cpu_path)
# Check symbols when TORCH_STABLE_ONLY is defined
check_stable_only_symbols(install_root)
check_stable_api_symbols(install_root)
check_headeronly_symbols(install_root)
check_aoti_shim_symbols(install_root)
check_stable_c_shim_symbols(install_root)
if __name__ == "__main__":
main()

View File

@ -389,6 +389,13 @@ test_lazy_tensor_meta_reference_disabled() {
export -n TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE
}
test_dynamo_core() {
time python test/run_test.py \
--include-dynamo-core-tests \
--verbose \
--upload-artifacts-while-running
assert_git_not_dirty
}
test_dynamo_wrapped_shard() {
if [[ -z "$NUM_TEST_SHARDS" ]]; then
@ -1814,6 +1821,8 @@ elif [[ "${TEST_CONFIG}" == *inductor* ]]; then
test_inductor_shard "${SHARD_NUMBER}"
elif [[ "${TEST_CONFIG}" == *einops* ]]; then
test_einops
elif [[ "${TEST_CONFIG}" == *dynamo_core* ]]; then
test_dynamo_core
elif [[ "${TEST_CONFIG}" == *dynamo_wrapped* ]]; then
install_torchvision
test_dynamo_wrapped_shard "${SHARD_NUMBER}"

View File

@ -1 +1 @@
acccf86477759b2d3500f1ae1be065f7b1e409ec
2d82dc5caa336d179d9b46ac4a0fb8c43d84c5cc

View File

@ -7,6 +7,7 @@ ciflow_push_tags:
- ciflow/binaries
- ciflow/binaries_libtorch
- ciflow/binaries_wheel
- ciflow/dynamo
- ciflow/h100
- ciflow/h100-cutlass-backend
- ciflow/h100-distributed

View File

@ -326,7 +326,7 @@ jobs:
SCCACHE_BUCKET: ${{ !contains(matrix.runner, 'b200') && 'ossci-compiler-cache-circleci-v2' || '' }}
SCCACHE_REGION: ${{ !contains(matrix.runner, 'b200') && 'us-east-1' || '' }}
SHM_SIZE: ${{ contains(inputs.build-environment, 'cuda') && '2g' || '1g' }}
DOCKER_IMAGE: ${{ inputs.docker-image }}
DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
XLA_CUDA: ${{ contains(inputs.build-environment, 'xla') && '0' || '' }}
XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
PYTORCH_TEST_CUDA_MEM_LEAK_CHECK: ${{ matrix.mem_leak_check && '1' || '0' }}

70
.github/workflows/dynamo-unittest.yml vendored Normal file
View File

@ -0,0 +1,70 @@
# Workflow: Dynamo Unit Test
# runs unit tests for dynamo.
name: dynamo-unittest
on:
push:
tags:
- ciflow/dynamo/*
workflow_call:
schedule:
- cron: 29 8 * * * # about 1:29am PDT
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
permissions:
id-token: write
contents: read
jobs:
get-label-type:
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
if: ${{ (github.event_name != 'schedule' || github.repository == 'pytorch/pytorch') && github.repository_owner == 'pytorch' }}
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
opt_out_experiments: lf
dynamo-build:
name: dynamo-build
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
strategy:
matrix:
python-version: ['3.11', '3.12']
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-py${{ matrix.python-version }}-clang12
docker-image-name: ci-image:pytorch-linux-jammy-py${{ matrix.python-version }}-clang12
test-matrix: |
{ include: [
{ config: "dynamo_core", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.c7i.2xlarge" },
{ config: "dynamo_wrapped", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.c7i.2xlarge" },
{ config: "dynamo_wrapped", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.c7i.2xlarge" },
{ config: "dynamo_wrapped", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.c7i.2xlarge" },
]}
secrets: inherit
dynamo-test:
name: dynamo-test
uses: ./.github/workflows/_linux-test.yml
needs: [get-label-type, dynamo-build]
strategy:
matrix:
python-version: ['3.11', '3.12']
with:
build-environment: linux-jammy-py${{ matrix.python-version }}-clang12
docker-image: ci-image:pytorch-linux-jammy-py${{ matrix.python-version }}-clang12
test-matrix: |
{ include: [
{ config: "dynamo_core", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.c7i.2xlarge" },
{ config: "dynamo_wrapped", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.c7i.2xlarge" },
{ config: "dynamo_wrapped", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.c7i.2xlarge" },
{ config: "dynamo_wrapped", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.c7i.2xlarge" },
]}
secrets: inherit

330
.spin/cmds.py Normal file
View File

@ -0,0 +1,330 @@
import hashlib
import subprocess
import sys
from pathlib import Path
import click
import spin
def file_digest(file, algorithm: str):
try:
return hashlib.file_digest(file, algorithm)
except AttributeError:
pass # Fallback to manual implementation below
hash = hashlib.new(algorithm)
while chunk := file.read(8192):
hash.update(chunk)
return hash
def _hash_file(file):
with open(file, "rb") as f:
hash = file_digest(f, "sha256")
return hash.hexdigest()
def _hash_files(files):
hashes = {file: _hash_file(file) for file in files}
return hashes
def _read_hashes(hash_file: Path):
if not hash_file.exists():
return {}
with hash_file.open("r") as f:
lines = f.readlines()
hashes = {}
for line in lines:
hash = line[:64]
file = line[66:].strip()
hashes[file] = hash
return hashes
def _updated_hashes(hash_file, files_to_hash):
old_hashes = _read_hashes(hash_file)
new_hashes = _hash_files(files_to_hash)
if new_hashes != old_hashes:
return new_hashes
return None
@click.command()
def regenerate_version():
"""Regenerate version.py."""
cmd = [
sys.executable,
"-m",
"tools.generate_torch_version",
"--is-debug=false",
]
spin.util.run(cmd)
TYPE_STUBS = [
(
"Pytorch type stubs",
Path(".lintbin/.pytorch-type-stubs.sha256"),
[
"aten/src/ATen/native/native_functions.yaml",
"aten/src/ATen/native/tags.yaml",
"tools/autograd/deprecated.yaml",
],
[
sys.executable,
"-m",
"tools.pyi.gen_pyi",
"--native-functions-path",
"aten/src/ATen/native/native_functions.yaml",
"--tags-path",
"aten/src/ATen/native/tags.yaml",
"--deprecated-functions-path",
"tools/autograd/deprecated.yaml",
],
),
(
"Datapipes type stubs",
None,
[],
[
sys.executable,
"torch/utils/data/datapipes/gen_pyi.py",
],
),
]
@click.command()
def regenerate_type_stubs():
"""Regenerate type stubs."""
for name, hash_file, files_to_hash, cmd in TYPE_STUBS:
if hash_file:
if hashes := _updated_hashes(hash_file, files_to_hash):
click.echo(
f"Changes detected in type stub files for {name}. Regenerating..."
)
spin.util.run(cmd)
hash_file.parent.mkdir(parents=True, exist_ok=True)
with hash_file.open("w") as f:
for file, hash in hashes.items():
f.write(f"{hash} {file}\n")
click.echo("Type stubs and hashes updated.")
else:
click.echo(f"No changes detected in type stub files for {name}.")
else:
click.echo(f"No hash file for {name}. Regenerating...")
spin.util.run(cmd)
click.echo("Type stubs regenerated.")
@click.command()
def regenerate_clangtidy_files():
"""Regenerate clang-tidy files."""
cmd = [
sys.executable,
"-m",
"tools.linter.clang_tidy.generate_build_files",
]
spin.util.run(cmd)
#: These linters are expected to need less than 3s cpu time total
VERY_FAST_LINTERS = {
"ATEN_CPU_GPU_AGNOSTIC",
"BAZEL_LINTER",
"C10_NODISCARD",
"C10_UNUSED",
"CALL_ONCE",
"CMAKE_MINIMUM_REQUIRED",
"CONTEXT_DECORATOR",
"COPYRIGHT",
"CUBINCLUDE",
"DEPLOY_DETECTION",
"ERROR_PRONE_ISINSTANCE",
"EXEC",
"HEADER_ONLY_LINTER",
"IMPORT_LINTER",
"INCLUDE",
"LINTRUNNER_VERSION",
"MERGE_CONFLICTLESS_CSV",
"META_NO_CREATE_UNBACKED",
"NEWLINE",
"NOQA",
"NO_WORKFLOWS_ON_FORK",
"ONCE_FLAG",
"PYBIND11_INCLUDE",
"PYBIND11_SPECIALIZATION",
"PYPIDEP",
"PYPROJECT",
"RAWCUDA",
"RAWCUDADEVICE",
"ROOT_LOGGING",
"TABS",
"TESTOWNERS",
"TYPEIGNORE",
"TYPENOSKIP",
"WORKFLOWSYNC",
}
#: These linters are expected to take a few seconds, but less than 10s cpu time total
FAST_LINTERS = {
"CMAKE",
"DOCSTRING_LINTER",
"GHA",
"NATIVEFUNCTIONS",
"RUFF",
"SET_LINTER",
"SHELLCHECK",
"SPACES",
}
#: These linters are expected to take more than 10s cpu time total;
#: some need more than 1 hour.
SLOW_LINTERS = {
"ACTIONLINT",
"CLANGFORMAT",
"CLANGTIDY",
"CODESPELL",
"FLAKE8",
"GB_REGISTRY",
"PYFMT",
"PYREFLY",
"TEST_DEVICE_BIAS",
"TEST_HAS_MAIN",
}
ALL_LINTERS = VERY_FAST_LINTERS | FAST_LINTERS | SLOW_LINTERS
LINTRUNNER_CACHE_INFO = (
Path(".lintbin/.lintrunner.sha256"),
[
"requirements.txt",
"pyproject.toml",
".lintrunner.toml",
],
)
LINTRUNNER_BASE_CMD = [
"uvx",
"--python",
"3.10",
"lintrunner@0.12.7",
]
@click.command()
def setup_lint():
"""Set up lintrunner with current CI version."""
cmd = LINTRUNNER_BASE_CMD + ["init"]
subprocess.run(cmd, check=True, capture_output=True, text=True)
def _check_linters():
cmd = LINTRUNNER_BASE_CMD + ["list"]
ret = spin.util.run(cmd, output=False, stderr=subprocess.PIPE)
linters = {l.strip() for l in ret.stdout.decode().strip().split("\n")[1:]}
unknown_linters = linters - ALL_LINTERS
missing_linters = ALL_LINTERS - linters
if unknown_linters:
click.secho(
f"Unknown linters found; please add them to the correct category "
f"in .spin/cmds.py: {', '.join(unknown_linters)}",
fg="yellow",
)
if missing_linters:
click.secho(
f"Missing linters found; please update the corresponding category "
f"in .spin/cmds.py: {', '.join(missing_linters)}",
fg="yellow",
)
return unknown_linters, missing_linters
@spin.util.extend_command(
setup_lint,
doc=f"""
If configuration has changed, update lintrunner.
Compares the stored old hashes of configuration files with new ones and
performs setup via setup-lint if the hashes have changed.
Hashes are stored in {LINTRUNNER_CACHE_INFO[0]}; the following files are
considered: {", ".join(LINTRUNNER_CACHE_INFO[1])}.
""",
)
@click.pass_context
def lazy_setup_lint(ctx, parent_callback, **kwargs):
if hashes := _updated_hashes(*LINTRUNNER_CACHE_INFO):
click.echo(
"Changes detected in lint configuration files. Setting up linting tools..."
)
parent_callback(**kwargs)
hash_file = LINTRUNNER_CACHE_INFO[0]
hash_file.parent.mkdir(parents=True, exist_ok=True)
with hash_file.open("w") as f:
for file, hash in hashes.items():
f.write(f"{hash} {file}\n")
click.echo("Linting tools set up and hashes updated.")
else:
click.echo("No changes detected in lint configuration files. Skipping setup.")
click.echo("Regenerating version...")
ctx.invoke(regenerate_version)
click.echo("Regenerating type stubs...")
ctx.invoke(regenerate_type_stubs)
click.echo("Done.")
_check_linters()
@click.command()
@click.option("-a", "--apply-patches", is_flag=True)
@click.pass_context
def lint(ctx, apply_patches, **kwargs):
"""Lint all files."""
ctx.invoke(lazy_setup_lint)
all_files_linters = VERY_FAST_LINTERS | FAST_LINTERS
changed_files_linters = SLOW_LINTERS
cmd = LINTRUNNER_BASE_CMD
if apply_patches:
cmd += ["--apply-patches"]
all_files_cmd = cmd + [
"--take",
",".join(all_files_linters),
"--all-files",
]
spin.util.run(all_files_cmd)
changed_files_cmd = cmd + [
"--take",
",".join(changed_files_linters),
]
spin.util.run(changed_files_cmd)
@click.command()
@click.pass_context
def fixlint(ctx, **kwargs):
"""Autofix all files."""
ctx.invoke(lint, apply_patches=True)
@click.command()
@click.option("-a", "--apply-patches", is_flag=True)
@click.pass_context
def quicklint(ctx, apply_patches, **kwargs):
"""Lint changed files."""
ctx.invoke(lazy_setup_lint)
cmd = LINTRUNNER_BASE_CMD
if apply_patches:
cmd += ["--apply-patches"]
spin.util.run(cmd)
@click.command()
@click.pass_context
def quickfix(ctx, **kwargs):
"""Autofix changed files."""
ctx.invoke(quicklint, apply_patches=True)

View File

@ -144,7 +144,7 @@ inline std::bitset<kVmapNumLevels> createVmapLevelsBitset(BatchDimsRef bdims) {
}
inline std::ostream& operator<<(std::ostream& out, const BatchDim& bdim) {
out << "(lvl=" << bdim.level() << ", dim=" << bdim.dim() << ")";
out << "(lvl=" << bdim.level() << ", dim=" << bdim.dim() << ')';
return out;
}

View File

@ -9,7 +9,7 @@ namespace indexing {
const EllipsisIndexType Ellipsis = EllipsisIndexType();
std::ostream& operator<<(std::ostream& stream, const Slice& slice) {
stream << slice.start() << ":" << slice.stop() << ":" << slice.step();
stream << slice.start() << ':' << slice.stop() << ':' << slice.step();
return stream;
}
@ -31,12 +31,12 @@ std::ostream& operator<<(std::ostream& stream, const TensorIndex& tensor_index)
}
std::ostream& operator<<(std::ostream& stream, const std::vector<TensorIndex>& tensor_indices) {
stream << "(";
stream << '(';
for (const auto i : c10::irange(tensor_indices.size())) {
stream << tensor_indices[i];
if (i < tensor_indices.size() - 1) stream << ", ";
}
stream << ")";
stream << ')';
return stream;
}

View File

@ -113,7 +113,7 @@ void TensorNames::checkUnique(const char* op_name) const {
std::ostream& operator<<(std::ostream& out, const TensorName& tensorname) {
out << tensorname.name_ << " (index ";
out << tensorname.origin_idx_ << " of ";
out << tensorname.origin_ << ")";
out << tensorname.origin_ << ')';
return out;
}

View File

@ -13,9 +13,9 @@ std::ostream& operator<<(std::ostream & out, const TensorGeometryArg& t) {
if (t.pos == 0) {
// 0 is distinguished; it usually indicates 'self' or the return
// tensor
out << "'" << t.name << "'";
out << '\'' << t.name << '\'';
} else {
out << "argument #" << t.pos << " '" << t.name << "'";
out << "argument #" << t.pos << " '" << t.name << '\'';
}
return out;
}
@ -154,7 +154,7 @@ void checkSameGPU(CheckedFrom c, const TensorArg& t1, const TensorArg& t2) {
oss << "Tensor for " << t2 << " is on CPU, ";
}
oss << "but expected " << ((!t1->is_cpu() && !t2->is_cpu()) ? "them" : "it")
<< " to be on GPU (while checking arguments for " << c << ")";
<< " to be on GPU (while checking arguments for " << c << ')';
TORCH_CHECK(false, oss.str());
}
TORCH_CHECK(
@ -199,7 +199,7 @@ void checkScalarTypes(CheckedFrom c, const TensorArg& t,
i++;
}
oss << "; but got " << t->toString()
<< " instead (while checking arguments for " << c << ")";
<< " instead (while checking arguments for " << c << ')';
TORCH_CHECK(false, oss.str());
}
}

View File

@ -43,8 +43,8 @@ std::string get_mkldnn_version() {
// https://github.com/intel/ideep/issues/29
{
const dnnl_version_t* ver = dnnl_version();
ss << "Intel(R) MKL-DNN v" << ver->major << "." << ver->minor << "." << ver->patch
<< " (Git Hash " << ver->hash << ")";
ss << "Intel(R) MKL-DNN v" << ver->major << '.' << ver->minor << '.' << ver->patch
<< " (Git Hash " << ver->hash << ')';
}
#else
ss << "MKLDNN not found";
@ -81,7 +81,7 @@ std::string get_openmp_version() {
break;
}
if (ver_str) {
ss << " (a.k.a. OpenMP " << ver_str << ")";
ss << " (a.k.a. OpenMP " << ver_str << ')';
}
}
#else
@ -135,38 +135,38 @@ std::string show_config() {
#if defined(__GNUC__)
{
ss << " - GCC " << __GNUC__ << "." << __GNUC_MINOR__ << "\n";
ss << " - GCC " << __GNUC__ << '.' << __GNUC_MINOR__ << '\n';
}
#endif
#if defined(__cplusplus)
{
ss << " - C++ Version: " << __cplusplus << "\n";
ss << " - C++ Version: " << __cplusplus << '\n';
}
#endif
#if defined(__clang_major__)
{
ss << " - clang " << __clang_major__ << "." << __clang_minor__ << "." << __clang_patchlevel__ << "\n";
ss << " - clang " << __clang_major__ << '.' << __clang_minor__ << '.' << __clang_patchlevel__ << '\n';
}
#endif
#if defined(_MSC_VER)
{
ss << " - MSVC " << _MSC_FULL_VER << "\n";
ss << " - MSVC " << _MSC_FULL_VER << '\n';
}
#endif
#if AT_MKL_ENABLED()
ss << " - " << get_mkl_version() << "\n";
ss << " - " << get_mkl_version() << '\n';
#endif
#if AT_MKLDNN_ENABLED()
ss << " - " << get_mkldnn_version() << "\n";
ss << " - " << get_mkldnn_version() << '\n';
#endif
#ifdef _OPENMP
ss << " - " << get_openmp_version() << "\n";
ss << " - " << get_openmp_version() << '\n';
#endif
#if AT_BUILD_WITH_LAPACK()
@ -183,7 +183,7 @@ std::string show_config() {
ss << " - Cross compiling on MacOSX\n";
#endif
ss << " - "<< used_cpu_capability() << "\n";
ss << " - "<< used_cpu_capability() << '\n';
if (hasCUDA()) {
ss << detail::getCUDAHooks().showConfig();
@ -200,10 +200,10 @@ std::string show_config() {
ss << " - Build settings: ";
for (const auto& pair : caffe2::GetBuildOptions()) {
if (!pair.second.empty()) {
ss << pair.first << "=" << pair.second << ", ";
ss << pair.first << '=' << pair.second << ", ";
}
}
ss << "\n";
ss << '\n';
// TODO: do HIP
// TODO: do XLA

View File

@ -209,7 +209,7 @@ struct CodeTemplate {
// to indent correctly in the context.
void emitIndent(std::ostream& out, size_t indent) const {
for ([[maybe_unused]] const auto i : c10::irange(indent)) {
out << " ";
out << ' ';
}
}
void emitStringWithIndents(

View File

@ -10,7 +10,7 @@ std::ostream& operator<<(std::ostream& out, const Dimname& dimname) {
if (dimname.type() == NameType::WILDCARD) {
out << "None";
} else {
out << "'" << dimname.symbol().toUnqualString() << "'";
out << '\'' << dimname.symbol().toUnqualString() << '\'';
}
return out;
}

View File

@ -5,7 +5,7 @@
namespace at {
std::ostream& operator<<(std::ostream& out, const Range& range) {
out << "Range[" << range.begin << ", " << range.end << "]";
out << "Range[" << range.begin << ", " << range.end << ']';
return out;
}

View File

@ -71,7 +71,7 @@ void TensorBase::enforce_invariants() {
void TensorBase::print() const {
if (defined()) {
std::cerr << "[" << toString() << " " << sizes() << "]" << '\n';
std::cerr << '[' << toString() << ' ' << sizes() << ']' << '\n';
} else {
std::cerr << "[UndefinedTensor]" << '\n';
}

View File

@ -1,5 +1,6 @@
#pragma once
#include <torch/headeronly/core/TensorAccessor.h>
#include <c10/macros/Macros.h>
#include <c10/util/ArrayRef.h>
#include <c10/util/Deprecated.h>
@ -11,252 +12,37 @@
namespace at {
// The PtrTraits argument to the TensorAccessor/GenericPackedTensorAccessor
// is used to enable the __restrict__ keyword/modifier for the data
// passed to cuda.
template <typename T>
struct DefaultPtrTraits {
typedef T* PtrType;
};
using torch::headeronly::DefaultPtrTraits;
#if defined(__CUDACC__) || defined(__HIPCC__)
template <typename T>
struct RestrictPtrTraits {
typedef T* __restrict__ PtrType;
};
using torch::headeronly::RestrictPtrTraits;
#endif
// TensorAccessorBase and TensorAccessor are used for both CPU and CUDA tensors.
// For CUDA tensors it is used in device code (only). This means that we restrict ourselves
// to functions and types available there (e.g. IntArrayRef isn't).
// The PtrTraits argument is only relevant to cuda to support `__restrict__` pointers.
template<typename T, size_t N, template <typename U> class PtrTraits = DefaultPtrTraits, typename index_t = int64_t>
class TensorAccessorBase {
public:
typedef typename PtrTraits<T>::PtrType PtrType;
using TensorAccessorBase = torch::headeronly::detail::TensorAccessorBase<c10::IntArrayRef, T, N, PtrTraits, index_t>;
C10_HOST_DEVICE TensorAccessorBase(
PtrType data_,
const index_t* sizes_,
const index_t* strides_)
: data_(data_), sizes_(sizes_), strides_(strides_) {}
C10_HOST IntArrayRef sizes() const {
return IntArrayRef(sizes_,N);
}
C10_HOST IntArrayRef strides() const {
return IntArrayRef(strides_,N);
}
C10_HOST_DEVICE index_t stride(index_t i) const {
return strides_[i];
}
C10_HOST_DEVICE index_t size(index_t i) const {
return sizes_[i];
}
C10_HOST_DEVICE PtrType data() {
return data_;
}
C10_HOST_DEVICE const PtrType data() const {
return data_;
}
protected:
PtrType data_;
const index_t* sizes_;
const index_t* strides_;
};
// The `TensorAccessor` is typically instantiated for CPU `Tensor`s using
// `Tensor.accessor<T, N>()`.
// For CUDA `Tensor`s, `GenericPackedTensorAccessor` is used on the host and only
// indexing on the device uses `TensorAccessor`s.
template<typename T, size_t N, template <typename U> class PtrTraits = DefaultPtrTraits, typename index_t = int64_t>
class TensorAccessor : public TensorAccessorBase<T,N,PtrTraits,index_t> {
public:
typedef typename PtrTraits<T>::PtrType PtrType;
using TensorAccessor = torch::headeronly::detail::TensorAccessor<c10::IntArrayRef, T, N, PtrTraits, index_t>;
C10_HOST_DEVICE TensorAccessor(
PtrType data_,
const index_t* sizes_,
const index_t* strides_)
: TensorAccessorBase<T, N, PtrTraits, index_t>(data_,sizes_,strides_) {}
namespace detail {
C10_HOST_DEVICE TensorAccessor<T, N - 1, PtrTraits, index_t> operator[](index_t i) {
return TensorAccessor<T,N-1,PtrTraits,index_t>(this->data_ + this->strides_[0]*i,this->sizes_+1,this->strides_+1);
}
C10_HOST_DEVICE const TensorAccessor<T, N-1, PtrTraits, index_t> operator[](index_t i) const {
return TensorAccessor<T,N-1,PtrTraits,index_t>(this->data_ + this->strides_[0]*i,this->sizes_+1,this->strides_+1);
}
};
template<typename T, template <typename U> class PtrTraits, typename index_t>
class TensorAccessor<T,1,PtrTraits,index_t> : public TensorAccessorBase<T,1,PtrTraits,index_t> {
public:
typedef typename PtrTraits<T>::PtrType PtrType;
C10_HOST_DEVICE TensorAccessor(
PtrType data_,
const index_t* sizes_,
const index_t* strides_)
: TensorAccessorBase<T, 1, PtrTraits, index_t>(data_,sizes_,strides_) {}
C10_HOST_DEVICE T & operator[](index_t i) {
// NOLINTNEXTLINE(clang-analyzer-core.NullDereference)
return this->data_[this->strides_[0]*i];
}
C10_HOST_DEVICE const T & operator[](index_t i) const {
return this->data_[this->strides_[0]*i];
}
};
// GenericPackedTensorAccessorBase and GenericPackedTensorAccessor are used on for CUDA `Tensor`s on the host
// and as
// In contrast to `TensorAccessor`s, they copy the strides and sizes on instantiation (on the host)
// in order to transfer them on the device when calling kernels.
// On the device, indexing of multidimensional tensors gives to `TensorAccessor`s.
// Use RestrictPtrTraits as PtrTraits if you want the tensor's data pointer to be marked as __restrict__.
// Instantiation from data, sizes, strides is only needed on the host and std::copy isn't available
// on the device, so those functions are host only.
template<typename T, size_t N, template <typename U> class PtrTraits = DefaultPtrTraits, typename index_t = int64_t>
class GenericPackedTensorAccessorBase {
public:
typedef typename PtrTraits<T>::PtrType PtrType;
C10_HOST GenericPackedTensorAccessorBase(
PtrType data_,
const index_t* sizes_,
const index_t* strides_)
: data_(data_) {
std::copy(sizes_, sizes_ + N, std::begin(this->sizes_));
std::copy(strides_, strides_ + N, std::begin(this->strides_));
}
// if index_t is not int64_t, we want to have an int64_t constructor
template <typename source_index_t, class = std::enable_if_t<std::is_same_v<source_index_t, int64_t>>>
C10_HOST GenericPackedTensorAccessorBase(
PtrType data_,
const source_index_t* sizes_,
const source_index_t* strides_)
: data_(data_) {
for (const auto i : c10::irange(N)) {
this->sizes_[i] = sizes_[i];
this->strides_[i] = strides_[i];
}
}
C10_HOST_DEVICE index_t stride(index_t i) const {
return strides_[i];
}
C10_HOST_DEVICE index_t size(index_t i) const {
return sizes_[i];
}
C10_HOST_DEVICE PtrType data() {
return data_;
}
C10_HOST_DEVICE const PtrType data() const {
return data_;
}
protected:
PtrType data_;
// NOLINTNEXTLINE(*c-arrays*)
index_t sizes_[N];
// NOLINTNEXTLINE(*c-arrays*)
index_t strides_[N];
C10_HOST void bounds_check_(index_t i) const {
TORCH_CHECK_INDEX(
template <size_t N, typename index_t>
struct IndexBoundsCheck {
IndexBoundsCheck(index_t i) {
TORCH_CHECK_INDEX(
0 <= i && i < index_t{N},
"Index ",
i,
" is not within bounds of a tensor of dimension ",
N);
}
}
};
} // namespace detail
template<typename T, size_t N, template <typename U> class PtrTraits = DefaultPtrTraits, typename index_t = int64_t>
class GenericPackedTensorAccessor : public GenericPackedTensorAccessorBase<T,N,PtrTraits,index_t> {
public:
typedef typename PtrTraits<T>::PtrType PtrType;
C10_HOST GenericPackedTensorAccessor(
PtrType data_,
const index_t* sizes_,
const index_t* strides_)
: GenericPackedTensorAccessorBase<T, N, PtrTraits, index_t>(data_, sizes_, strides_) {}
// if index_t is not int64_t, we want to have an int64_t constructor
template <typename source_index_t, class = std::enable_if_t<std::is_same_v<source_index_t, int64_t>>>
C10_HOST GenericPackedTensorAccessor(
PtrType data_,
const source_index_t* sizes_,
const source_index_t* strides_)
: GenericPackedTensorAccessorBase<T, N, PtrTraits, index_t>(data_, sizes_, strides_) {}
C10_DEVICE TensorAccessor<T, N - 1, PtrTraits, index_t> operator[](index_t i) {
index_t* new_sizes = this->sizes_ + 1;
index_t* new_strides = this->strides_ + 1;
return TensorAccessor<T,N-1,PtrTraits,index_t>(this->data_ + this->strides_[0]*i, new_sizes, new_strides);
}
C10_DEVICE const TensorAccessor<T, N - 1, PtrTraits, index_t> operator[](index_t i) const {
const index_t* new_sizes = this->sizes_ + 1;
const index_t* new_strides = this->strides_ + 1;
return TensorAccessor<T,N-1,PtrTraits,index_t>(this->data_ + this->strides_[0]*i, new_sizes, new_strides);
}
/// Returns a PackedTensorAccessor of the same dimension after transposing the
/// two dimensions given. Does not actually move elements; transposition is
/// made by permuting the size/stride arrays. If the dimensions are not valid,
/// asserts.
C10_HOST GenericPackedTensorAccessor<T, N, PtrTraits, index_t> transpose(
index_t dim1,
index_t dim2) const {
this->bounds_check_(dim1);
this->bounds_check_(dim2);
GenericPackedTensorAccessor<T, N, PtrTraits, index_t> result(
this->data_, this->sizes_, this->strides_);
std::swap(result.strides_[dim1], result.strides_[dim2]);
std::swap(result.sizes_[dim1], result.sizes_[dim2]);
return result;
}
};
template<typename T, template <typename U> class PtrTraits, typename index_t>
class GenericPackedTensorAccessor<T,1,PtrTraits,index_t> : public GenericPackedTensorAccessorBase<T,1,PtrTraits,index_t> {
public:
typedef typename PtrTraits<T>::PtrType PtrType;
C10_HOST GenericPackedTensorAccessor(
PtrType data_,
const index_t* sizes_,
const index_t* strides_)
: GenericPackedTensorAccessorBase<T, 1, PtrTraits, index_t>(data_, sizes_, strides_) {}
// if index_t is not int64_t, we want to have an int64_t constructor
template <typename source_index_t, class = std::enable_if_t<std::is_same_v<source_index_t, int64_t>>>
C10_HOST GenericPackedTensorAccessor(
PtrType data_,
const source_index_t* sizes_,
const source_index_t* strides_)
: GenericPackedTensorAccessorBase<T, 1, PtrTraits, index_t>(data_, sizes_, strides_) {}
C10_DEVICE T & operator[](index_t i) {
return this->data_[this->strides_[0] * i];
}
C10_DEVICE const T& operator[](index_t i) const {
return this->data_[this->strides_[0]*i];
}
// Same as in the general N-dimensional case, but note that in the
// 1-dimensional case the returned PackedTensorAccessor will always be an
// identical copy of the original
C10_HOST GenericPackedTensorAccessor<T, 1, PtrTraits, index_t> transpose(
index_t dim1,
index_t dim2) const {
this->bounds_check_(dim1);
this->bounds_check_(dim2);
return GenericPackedTensorAccessor<T, 1, PtrTraits, index_t>(
this->data_, this->sizes_, this->strides_);
}
};
using GenericPackedTensorAccessorBase = torch::headeronly::detail::GenericPackedTensorAccessorBase<detail::IndexBoundsCheck<N, index_t>, T, N, PtrTraits, index_t>;
template<typename T, size_t N, template <typename U> class PtrTraits = DefaultPtrTraits, typename index_t = int64_t>
using GenericPackedTensorAccessor = torch::headeronly::detail::GenericPackedTensorAccessor<TensorAccessor<T, N-1, PtrTraits, index_t>, detail::IndexBoundsCheck<N, index_t>, T, N, PtrTraits, index_t>;
// Can't put this directly into the macro function args because of commas
#define AT_X GenericPackedTensorAccessor<T, N, PtrTraits, index_t>

View File

@ -9,8 +9,8 @@ APIVitals VitalsAPI;
std::ostream& operator<<(std::ostream& os, TorchVital const& tv) {
for (const auto& m : tv.attrs) {
os << "[TORCH_VITAL] " << tv.name << "." << m.first << "\t\t "
<< m.second.value << "\n";
os << "[TORCH_VITAL] " << tv.name << '.' << m.first << "\t\t "
<< m.second.value << '\n';
}
return os;
}

View File

@ -100,18 +100,18 @@ inline bool operator==(const AliasInfo& lhs, const AliasInfo& rhs) {
// this does match the way things are represented in the schema
inline std::ostream& operator<<(std::ostream& out, const AliasInfo& aliasInfo) {
out << "(";
out << '(';
bool first = true;
for (const auto& set : aliasInfo.beforeSets()) {
if (first) {
first = false;
} else {
out << "|";
out << '|';
}
out << set.toUnqualString();
}
if (aliasInfo.isWrite()) {
out << "!";
out << '!';
}
if (aliasInfo.beforeSets() != aliasInfo.afterSets()) {
out << " -> ";
@ -120,12 +120,12 @@ inline std::ostream& operator<<(std::ostream& out, const AliasInfo& aliasInfo) {
if (first) {
first = false;
} else {
out << "|";
out << '|';
}
out << set.toUnqualString();
}
}
out << ")";
out << ')';
return out;
}
} // namespace c10

View File

@ -198,7 +198,7 @@ inline void swap(Blob& lhs, Blob& rhs) noexcept {
}
inline std::ostream& operator<<(std::ostream& out, const Blob& v) {
return out << "Blob[" << v.TypeName() << "]";
return out << "Blob[" << v.TypeName() << ']';
}
} // namespace caffe2

View File

@ -456,8 +456,8 @@ bool ClassType::isSubtypeOfExt(const Type& rhs, std::ostream* why_not) const {
*why_not << "Method on class '" << repr_str()
<< "' (1) is not compatible with interface '"
<< rhs.repr_str() << "' (2)\n"
<< " (1) " << self_method->getSchema() << "\n"
<< " (2) " << schema << "\n";
<< " (1) " << self_method->getSchema() << '\n'
<< " (2) " << schema << '\n';
}
return false;
}

View File

@ -100,7 +100,7 @@ struct TORCH_API ClassType : public NamedType {
std::string repr_str() const override {
std::stringstream ss;
ss << str()
<< " (of Python compilation unit at: " << compilation_unit().get() << ")";
<< " (of Python compilation unit at: " << compilation_unit().get() << ')';
return ss.str();
}

View File

@ -58,12 +58,12 @@ std::string DispatchKeyExtractor::dumpState() const {
std::ostringstream oss;
for (const auto i : c10::irange(c10::utils::bitset::NUM_BITS())) {
if (dispatch_arg_indices_reverse_.get(i)) {
oss << "1";
oss << '1';
} else {
oss << "0";
oss << '0';
}
}
oss << " " << nonFallthroughKeys_ << "\n";
oss << ' ' << nonFallthroughKeys_ << '\n';
return oss.str();
}

View File

@ -69,8 +69,8 @@ private:
void _print_dispatch_trace(const std::string& label, const std::string& op_name, const DispatchKeySet& dispatchKeySet) {
auto nesting_value = dispatch_trace_nesting_value();
for (int64_t i = 0; i < nesting_value; ++i) std::cerr << " ";
std::cerr << label << " op=[" << op_name << "], key=[" << toString(dispatchKeySet.highestPriorityTypeId()) << "]" << std::endl;
for (int64_t i = 0; i < nesting_value; ++i) std::cerr << ' ';
std::cerr << label << " op=[" << op_name << "], key=[" << toString(dispatchKeySet.highestPriorityTypeId()) << ']' << std::endl;
}
} // namespace detail

View File

@ -570,7 +570,7 @@ void OperatorEntry::checkInvariants() const {
std::string OperatorEntry::listAllDispatchKeys() const {
std::ostringstream str;
str << "[";
str << '[';
bool has_kernels = false;
for (auto k : allDispatchKeysInFullSet()) {
@ -584,7 +584,7 @@ std::string OperatorEntry::listAllDispatchKeys() const {
str << k;
has_kernels = true;
}
str << "]";
str << ']';
return str.str();
}
@ -683,12 +683,12 @@ void OperatorEntry::setReportErrorCallback_(std::unique_ptr<c10::SafePyObject> c
// This WON'T report backend fallbacks.
std::string OperatorEntry::dumpState() const {
std::ostringstream oss;
oss << "name: " << name_ << "\n";
oss << "name: " << name_ << '\n';
if (schema_) {
oss << "schema: " << schema_->schema << "\n";
oss << "debug: " << schema_->debug << "\n";
oss << "schema: " << schema_->schema << '\n';
oss << "debug: " << schema_->debug << '\n';
oss << "alias analysis kind: " << toString(schema_->schema.aliasAnalysis())
<< (schema_->schema.isDefaultAliasAnalysisKind() ? " (default)" : "") << "\n";
<< (schema_->schema.isDefaultAliasAnalysisKind() ? " (default)" : "") << '\n';
} else {
oss << "schema: (none)\n";
}

View File

@ -7,7 +7,7 @@
namespace c10 {
void FunctionSchema::dump() const {
std::cout << *this << "\n";
std::cout << *this << '\n';
}
const std::vector<Argument>& FunctionSchema::getCorrectList(SchemaArgType type) const {
@ -210,9 +210,9 @@ std::ostream& operator<<(std::ostream& out, const FunctionSchema& schema) {
out << schema.name();
if (!schema.overload_name().empty()) {
out << "." << schema.overload_name();
out << '.' << schema.overload_name();
}
out << "(";
out << '(';
bool seen_kwarg_only = false;
for (const auto i : c10::irange(schema.arguments().size())) {
@ -273,7 +273,7 @@ std::ostream& operator<<(std::ostream& out, const FunctionSchema& schema) {
}
if (need_paren) {
out << "(";
out << '(';
}
for (const auto i : c10::irange(returns.size())) {
if (i > 0) {
@ -288,7 +288,7 @@ std::ostream& operator<<(std::ostream& out, const FunctionSchema& schema) {
out << "...";
}
if (need_paren) {
out << ")";
out << ')';
}
return out;
}
@ -471,7 +471,7 @@ bool FunctionSchema::isForwardCompatibleWith(
if (!arguments().at(i).isForwardCompatibleWith(old.arguments().at(i))) {
if (why_not) {
why_not
<< "'" << arguments().at(i).name() << "'"
<< '\'' << arguments().at(i).name() << '\''
<< " is not forward compatible with the older version of the schema";
}
return false;
@ -511,7 +511,7 @@ bool FunctionSchema::isForwardCompatibleWith(
.isForwardCompatibleWith(old.arguments().at(i))) {
if (why_not) {
why_not << "Out argument '"
<< "'" << arguments().at(i).name()
<< '\'' << arguments().at(i).name()
<< " is not FC with the older version of the schema";
}
return false;

View File

@ -571,7 +571,7 @@ inline std::ostream& operator<<(std::ostream& out, const Argument& arg) {
if (arg.N()) {
N = std::to_string(*arg.N());
}
out << "[" << N << "]";
out << '[' << N << ']';
} else {
out << unopt_type->str();
}
@ -582,15 +582,15 @@ inline std::ostream& operator<<(std::ostream& out, const Argument& arg) {
}
if (is_opt) {
out << "?";
out << '?';
}
if (!arg.name().empty()) {
out << " " << arg.name();
out << ' ' << arg.name();
}
if (arg.default_value()) {
out << "=";
out << '=';
if ((type->kind() == c10::TypeKind::StringType ||
unopt_type->kind() == c10::TypeKind::StringType) &&
arg.default_value().value().isString()) {

View File

@ -66,7 +66,7 @@ bool operator==(const ivalue::Tuple& lhs, const ivalue::Tuple& rhs) {
}
std::ostream& operator<<(std::ostream& out, const ivalue::EnumHolder& v) {
out << v.qualifiedClassName() << "." << v.name();
out << v.qualifiedClassName() << '.' << v.name();
return out;
}
@ -526,7 +526,7 @@ std::ostream& printMaybeAnnotatedList(
!elementTypeCanBeInferredFromMembers(list_elem_type)) {
out << "annotate(" << the_list.type<c10::Type>()->annotation_str() << ", ";
printList(out, the_list.toListRef(), "[", "]", formatter);
out << ")";
out << ')';
return out;
} else {
return printList(out, the_list.toListRef(), "[", "]", formatter);
@ -538,7 +538,7 @@ std::ostream& printDict(
std::ostream& out,
const Dict& v,
const IValueFormatter& formatter) {
out << "{";
out << '{';
bool first = true;
for (const auto& pair : v) {
@ -552,7 +552,7 @@ std::ostream& printDict(
first = false;
}
out << "}";
out << '}';
return out;
}
}
@ -565,8 +565,8 @@ static std::ostream& printMaybeAnnotatedDict(
auto value_type = the_dict.type()->castRaw<DictType>()->getValueType();
if (the_dict.toGenericDict().empty() ||
!elementTypeCanBeInferredFromMembers(value_type)) {
out << "annotate(" << the_dict.type<c10::Type>()->annotation_str() << ",";
printDict(out, the_dict.toGenericDict(), formatter) << ")";
out << "annotate(" << the_dict.type<c10::Type>()->annotation_str() << ',';
printDict(out, the_dict.toGenericDict(), formatter) << ')';
} else {
return printDict(out, the_dict.toGenericDict(), formatter);
}
@ -577,7 +577,7 @@ static std::ostream& printComplex(std::ostream & out, const IValue & v) {
c10::complex<double> d = v.toComplexDouble();
IValue real(d.real()), imag(std::abs(d.imag()));
auto sign = d.imag() >= 0 ? '+' : '-';
return out << real << sign << imag << "j";
return out << real << sign << imag << 'j';
}
std::ostream& IValue::repr(
@ -605,9 +605,9 @@ std::ostream& IValue::repr(
if (static_cast<double>(i) == d) {
// -0.0 (signed zero) needs to be parsed as -0.
if (i == 0 && std::signbit(d)) {
return out << "-" << i << ".";
return out << '-' << i << '.';
}
return out << i << ".";
return out << i << '.';
}
}
auto orig_prec = out.precision();
@ -643,20 +643,20 @@ std::ostream& IValue::repr(
device_stream << v.toDevice();
out << "torch.device(";
c10::printQuotedString(out, device_stream.str());
return out << ")";
return out << ')';
}
case IValue::Tag::Generator: {
auto generator = v.toGenerator();
out << "torch.Generator(device=";
c10::printQuotedString(out, generator.device().str());
out << ", seed=" << generator.current_seed() << ")";
out << ", seed=" << generator.current_seed() << ')';
return out;
}
case IValue::Tag::GenericDict:
return printMaybeAnnotatedDict(out, v, formatter);
case IValue::Tag::Enum: {
auto enum_holder = v.toEnumHolder();
return out << enum_holder->qualifiedClassName() << "." <<
return out << enum_holder->qualifiedClassName() << '.' <<
enum_holder->name();
}
case IValue::Tag::Object: {
@ -801,7 +801,7 @@ std::ostream& operator<<(std::ostream & out, const IValue & v) {
if (c == FP_NORMAL || c == FP_ZERO) {
int64_t i = static_cast<int64_t>(d);
if (static_cast<double>(i) == d) {
return out << i << ".";
return out << i << '.';
}
}
auto orig_prec = out.precision();
@ -852,7 +852,7 @@ std::ostream& operator<<(std::ostream & out, const IValue & v) {
return printDict(out, v.toGenericDict(), formatter);
case IValue::Tag::PyObject: {
auto py_obj = v.toPyObject();
return out << "<PyObject at" << py_obj << ">";
return out << "<PyObject at" << py_obj << '>';
}
case IValue::Tag::Generator:
return out << "Generator";
@ -862,22 +862,22 @@ std::ostream& operator<<(std::ostream & out, const IValue & v) {
// TODO we should attempt to call __str__ if the object defines it.
auto obj = v.toObject();
// print this out the way python would do it
return out << "<" << obj->name() << " object at " << obj.get() << ">";
return out << '<' << obj->name() << " object at " << obj.get() << '>';
}
case IValue::Tag::Enum: {
auto enum_holder = v.toEnumHolder();
return out << "Enum<" << enum_holder->unqualifiedClassName() << "." <<
enum_holder->name() << ">";
return out << "Enum<" << enum_holder->unqualifiedClassName() << '.' <<
enum_holder->name() << '>';
}
}
return out << "<Invalid IValue tag=" << std::to_string(static_cast<uint32_t>(v.tag)) << ">";
return out << "<Invalid IValue tag=" << std::to_string(static_cast<uint32_t>(v.tag)) << '>';
}
#undef TORCH_FORALL_TAGS
void IValue::dump() const {
std::cout << *this << "\n";
std::cout << *this << '\n';
}
std::shared_ptr<ClassType> ivalue::Object::type() const {
@ -1050,7 +1050,7 @@ c10::intrusive_ptr<ivalue::Object> ivalue::Object::deepcopy(
std::stringstream err;
err << "Cannot serialize custom bound C++ class";
if (auto qualname = type()->name()) {
err << " " << qualname->qualifiedName();
err << ' ' << qualname->qualifiedName();
}
err << ". Please define serialization methods via def_pickle() for "
"this class.";

View File

@ -211,7 +211,7 @@ struct TORCH_API OptionalType : public UnionType {
std::string str() const override {
std::stringstream ss;
ss << getElementType()->str() << "?";
ss << getElementType()->str() << '?';
return ss.str();
}
@ -240,7 +240,7 @@ struct TORCH_API OptionalType : public UnionType {
std::string annotation_str_impl(const TypePrinter& printer = nullptr) const override {
std::stringstream ss;
ss << "Optional[" << getElementType()->annotation_str(printer) << "]";
ss << "Optional[" << getElementType()->annotation_str(printer) << ']';
return ss.str();
}
};
@ -906,7 +906,7 @@ struct TORCH_API ListType
std::string annotation_str_impl(const TypePrinter& printer = nullptr) const override {
std::stringstream ss;
ss << "List[" << getElementType()->annotation_str(printer) << "]";
ss << "List[" << getElementType()->annotation_str(printer) << ']';
return ss.str();
}
};
@ -946,7 +946,7 @@ struct TORCH_API DictType : public SharedType {
std::string str() const override {
std::stringstream ss;
ss << "Dict(" << getKeyType()->str() << ", " << getValueType()->str()
<< ")";
<< ')';
return ss.str();
}
@ -1018,7 +1018,7 @@ struct TORCH_API FutureType
std::string str() const override {
std::stringstream ss;
ss << "Future(" << getElementType()->str() << ")";
ss << "Future(" << getElementType()->str() << ')';
return ss.str();
}
TypePtr createWithContained(
@ -1041,7 +1041,7 @@ struct TORCH_API FutureType
std::string annotation_str_impl(const TypePrinter& printer = nullptr) const override {
std::stringstream ss;
ss << "Future[" << getElementType()->annotation_str(printer) << "]";
ss << "Future[" << getElementType()->annotation_str(printer) << ']';
return ss.str();
}
};
@ -1060,7 +1060,7 @@ struct TORCH_API AwaitType
std::string str() const override {
std::stringstream ss;
ss << "Await(" << getElementType()->str() << ")";
ss << "Await(" << getElementType()->str() << ')';
return ss.str();
}
TypePtr createWithContained(
@ -1083,7 +1083,7 @@ struct TORCH_API AwaitType
std::string annotation_str_impl(const TypePrinter& printer = nullptr) const override {
std::stringstream ss;
ss << "Await[" << getElementType()->annotation_str(printer) << "]";
ss << "Await[" << getElementType()->annotation_str(printer) << ']';
return ss.str();
}
};
@ -1102,7 +1102,7 @@ struct TORCH_API RRefType
std::string str() const override {
std::stringstream ss;
ss << "RRef(" << getElementType()->str() << ")";
ss << "RRef(" << getElementType()->str() << ')';
return ss.str();
}
TypePtr createWithContained(
@ -1115,7 +1115,7 @@ struct TORCH_API RRefType
std::string annotation_str_impl(const TypePrinter& printer = nullptr) const override {
std::stringstream ss;
ss << "RRef[" << getElementType()->annotation_str(printer) << "]";
ss << "RRef[" << getElementType()->annotation_str(printer) << ']';
return ss.str();
}
};

View File

@ -11,7 +11,7 @@ std::string toString(const OperatorName& opName) {
std::ostream& operator<<(std::ostream& os, const OperatorName& opName) {
os << opName.name;
if (!opName.overload_name.empty()) {
os << "." << opName.overload_name;
os << '.' << opName.overload_name;
}
return os;
}

View File

@ -65,7 +65,7 @@ VaryingShape<T> VaryingShape<T>::merge(const VaryingShape<T>& other) const {
template <typename T>
std::ostream& operator<<(std::ostream& out, const VaryingShape<T>& vs) {
out << "(";
out << '(';
if (!vs.size()) {
out << "*)";
return out;
@ -79,10 +79,10 @@ std::ostream& operator<<(std::ostream& out, const VaryingShape<T>& vs) {
if (v.has_value()) {
out << v.value();
} else {
out << "*";
out << '*';
}
}
out << ")";
out << ')';
return out;
}
@ -105,7 +105,7 @@ std::ostream& operator<<(
}
auto sizes_opt = ss.sizes();
os << "(";
os << '(';
for (size_t i = 0; i < rank_opt.value(); i++) {
if (i > 0) {
os << ", ";
@ -113,10 +113,10 @@ std::ostream& operator<<(
if(sizes_opt.has_value() && sizes_opt.value()[i].is_static()) {
os << sizes_opt.value()[i];
} else {
os << "*";
os << '*';
}
}
os << ")";
os << ')';
return os;
}
@ -131,17 +131,17 @@ std::ostream& operator<<(std::ostream& os, const ShapeSymbol& s) {
}
std::ostream& operator<<(std::ostream& os, const Stride& s) {
os << "{";
os << '{';
if (s.stride_index_.has_value()) {
os << *s.stride_index_;
} else {
os << "*";
os << '*';
}
os << ":";
os << ':';
if (s.stride_.has_value()) {
os << *s.stride_;
} else {
os << "*";
os << '*';
}
os << '}';
return os;

View File

@ -67,7 +67,7 @@ std::ostream& operator<<(std::ostream & out, const Type & t) {
bool has_valid_strides_info = ndim > 0 &&
value->strides().isComplete() && value->strides().size() == ndim;
out << "(";
out << '(';
size_t i = 0;
bool symbolic = type_verbosity() == TypeVerbosity::Symbolic;
for (i = 0; i < *ndim; ++i) {
@ -79,7 +79,7 @@ std::ostream& operator<<(std::ostream & out, const Type & t) {
} else if (symbolic) {
out << value->symbolic_sizes().at(i);
} else {
out << "*";
out << '*';
}
}
if (has_valid_strides_info &&
@ -91,7 +91,7 @@ std::ostream& operator<<(std::ostream & out, const Type & t) {
}
out << value->strides()[i].value();
}
out << "]";
out << ']';
}
if (type_verbosity() >= TypeVerbosity::Full) {
if (value->requiresGrad()) {
@ -107,12 +107,12 @@ std::ostream& operator<<(std::ostream & out, const Type & t) {
out << "device=" << *value->device();
}
}
out << ")";
out << ')';
} else {
if (type_verbosity() >= TypeVerbosity::Full) {
size_t i = 0;
if (value->requiresGrad()) {
out << "("
out << '('
<< "requires_grad=" << *value->requiresGrad();
i++;
}
@ -120,7 +120,7 @@ std::ostream& operator<<(std::ostream & out, const Type & t) {
out << ((i++ > 0) ? ", " : "(") << "device=" << *value->device();
}
if (i > 0) {
out << ")";
out << ')';
}
}
}
@ -133,18 +133,18 @@ std::ostream& operator<<(std::ostream & out, const Type & t) {
out << *prim << "[]";
} else if (t.kind() == TypeKind::OptionalType) {
auto prim = t.castRaw<OptionalType>()->getElementType();
out << *prim << "?";
out << *prim << '?';
} else if(t.kind() == TypeKind::FutureType) {
auto elem = t.castRaw<FutureType>()->getElementType();
out << "Future[" << *elem << "]";
out << "Future[" << *elem << ']';
} else if(t.kind() == TypeKind::RRefType) {
auto elem = t.castRaw<RRefType>()->getElementType();
out << "RRef[" << *elem << "]";
out << "RRef[" << *elem << ']';
} else if(auto tup = t.cast<TupleType>()) {
if (tup->schema()) {
out << "NamedTuple";
}
out << "(";
out << '(';
for(size_t i = 0; i < tup->elements().size(); ++i) {
if(i > 0)
out << ", ";
@ -160,7 +160,7 @@ std::ostream& operator<<(std::ostream & out, const Type & t) {
out << *(tup->elements()[i]);
}
}
out << ")";
out << ')';
} else if (t.kind() == TypeKind::FunctionType) {
out << "Function";
} else {
@ -475,7 +475,7 @@ std::optional<TypePtr> unifyTypeList(
why_not << "Could not unify type list since element " << i << " of type "
<< elements.at(i)->repr_str()
<< " did not match the types before it ("
<< ret_type->repr_str() << ")";
<< ret_type->repr_str() << ')';
return std::nullopt;
}
ret_type = *maybe_unified;
@ -907,13 +907,13 @@ std::string TupleType::str() const {
// NOLINTNEXTLINE(bugprone-unchecked-optional-access)
ss << name()->qualifiedName();
} else {
ss << "(";
ss << '(';
for(size_t i = 0; i < elements().size(); ++i) {
if(i > 0)
ss << ", ";
ss << elements()[i]->str();
}
ss << ")";
ss << ')';
}
return ss.str();
}
@ -1003,8 +1003,8 @@ bool InterfaceType::isSubTypeImpl(
*why_not << "Method on interface '" << lhs.repr_str()
<< "' (1) is not compatible with interface '"
<< rhs.repr_str() << "' (2)\n"
<< " (1) " << *self_schema << "\n"
<< " (2) " << schema << "\n";
<< " (1) " << *self_schema << '\n'
<< " (2) " << schema << '\n';
return false;
}
return false;
@ -1078,7 +1078,7 @@ SymbolicShape SymbolicShape::merge(const SymbolicShape& other) const {
}
void SymbolicShape::dump() const {
std::cout << *this << "\n";
std::cout << *this << '\n';
}
bool EnumType::isSubtypeOfExt(const Type& rhs, std::ostream* why_not) const {

View File

@ -205,9 +205,9 @@ UnionType::UnionType(std::vector<TypePtr> reference, TypeKind kind) : SharedType
for (const auto i : c10::irange(reference.size())) {
msg << reference[i]->repr_str();
if (i > 0) {
msg << ",";
msg << ',';
}
msg << " ";
msg << ' ';
}
msg << "} has the single type " << types_[0]->repr_str()
<< ". Use the common supertype instead of creating a Union"

View File

@ -80,7 +80,7 @@ std::ostream& operator<<(std::ostream& stream, const Vectorized<T>& vec) {
}
stream << buf[i];
}
stream << "]";
stream << ']';
return stream;
}

View File

@ -55,7 +55,7 @@ std::ostream& operator<<(std::ostream& stream, const Vectorized<T>& vec) {
}
stream << buf[i];
}
stream << "]";
stream << ']';
return stream;
}

View File

@ -175,17 +175,24 @@ void CUDAGraph::instantiate() {
// Trailing NULL, NULL, 0 arguments were recommended by Cuda driver people,
// who prefer not to report error message through these arguments moving forward
// (they prefer return value, or errors on api calls internal to the capture)
#if (defined(CUDA_VERSION) && CUDA_VERSION >= 12000)
AT_CUDA_CHECK(cudaGraphInstantiate(&graph_exec_, graph_, 0));
// ROCM appears to fail with HIP error: invalid argument
#if (defined(CUDA_VERSION) && CUDA_VERSION >= 12000) && !defined(USE_ROCM)
AT_CUDA_CHECK(cudaGraphInstantiate(&graph_exec_, graph_, cudaGraphInstantiateFlagUseNodePriority));
#else
AT_CUDA_CHECK(cudaGraphInstantiate(&graph_exec_, graph_, NULL, NULL, 0));
#endif
//Since ROCm 6.2, we want to go down this path as hipGraphExecDestroy in the destructor will not immediately free the memory.
//It will wait for the next sync operation. cudaGraphInstantiateFlagAutoFreeOnLaunch will add async frees after graph launch.
} else {
#if !defined(USE_ROCM)
AT_CUDA_CHECK(cudaGraphInstantiateWithFlags(&graph_exec_,
graph_,
cudaGraphInstantiateFlagAutoFreeOnLaunch | cudaGraphInstantiateFlagUseNodePriority));
#else
AT_CUDA_CHECK(cudaGraphInstantiateWithFlags(&graph_exec_,
graph_,
cudaGraphInstantiateFlagAutoFreeOnLaunch));
#endif
}
has_graph_exec_ = true;
}

View File

@ -411,16 +411,16 @@ std::string CUDAHooks::showConfig() const {
// HIP_VERSION value format was changed after ROCm v4.2 to include the patch number
if(v < 500) {
// If major=xx, minor=yy then format -> xxyy
oss << (v / 100) << "." << (v % 10);
oss << (v / 100) << '.' << (v % 10);
}
else {
// If major=xx, minor=yy & patch=zzzzz then format -> xxyyzzzzz
oss << (v / 10000000) << "." << (v / 100000 % 100) << "." << (v % 100000);
oss << (v / 10000000) << '.' << (v / 100000 % 100) << '.' << (v % 100000);
}
#else
oss << (v / 1000) << "." << (v / 10 % 100);
oss << (v / 1000) << '.' << (v / 10 % 100);
if (v % 10 != 0) {
oss << "." << (v % 10);
oss << '.' << (v % 10);
}
#endif
};
@ -431,16 +431,16 @@ std::string CUDAHooks::showConfig() const {
oss << " - HIP Runtime ";
#endif
printCudaStyleVersion(runtimeVersion);
oss << "\n";
oss << '\n';
// TODO: Make HIPIFY understand CUDART_VERSION macro
#if !defined(USE_ROCM)
if (runtimeVersion != CUDART_VERSION) {
oss << " - Built with CUDA Runtime ";
printCudaStyleVersion(CUDART_VERSION);
oss << "\n";
oss << '\n';
}
oss << " - NVCC architecture flags: " << NVCC_FLAGS_EXTRA << "\n";
oss << " - NVCC architecture flags: " << NVCC_FLAGS_EXTRA << '\n';
#endif
#if !defined(USE_ROCM)
@ -448,9 +448,9 @@ std::string CUDAHooks::showConfig() const {
auto printCudnnStyleVersion = [&](size_t v) {
oss << (v / 1000) << "." << (v / 100 % 10);
oss << (v / 1000) << '.' << (v / 100 % 10);
if (v % 100 != 0) {
oss << "." << (v % 100);
oss << '.' << (v % 100);
}
};
@ -461,22 +461,22 @@ std::string CUDAHooks::showConfig() const {
if (cudnnCudartVersion != CUDART_VERSION) {
oss << " (built against CUDA ";
printCudaStyleVersion(cudnnCudartVersion);
oss << ")";
oss << ')';
}
oss << "\n";
oss << '\n';
if (cudnnVersion != CUDNN_VERSION) {
oss << " - Built with CuDNN ";
printCudnnStyleVersion(CUDNN_VERSION);
oss << "\n";
oss << '\n';
}
#endif
#else
// TODO: Check if miopen has the functions above and unify
oss << " - MIOpen " << MIOPEN_VERSION_MAJOR << "." << MIOPEN_VERSION_MINOR << "." << MIOPEN_VERSION_PATCH << "\n";
oss << " - MIOpen " << MIOPEN_VERSION_MAJOR << '.' << MIOPEN_VERSION_MINOR << '.' << MIOPEN_VERSION_PATCH << '\n';
#endif
#if AT_MAGMA_ENABLED()
oss << " - Magma " << MAGMA_VERSION_MAJOR << "." << MAGMA_VERSION_MINOR << "." << MAGMA_VERSION_MICRO << "\n";
oss << " - Magma " << MAGMA_VERSION_MAJOR << '.' << MAGMA_VERSION_MINOR << '.' << MAGMA_VERSION_MICRO << '\n';
#endif
return oss.str();

View File

@ -42,7 +42,7 @@ static inline void launch_jitted_vectorized_kernel_dynamic(
// The cache key includes all the parameters to generate_code + vec_size + dev_idx
std::stringstream ss;
ss << nInputs << "_" << nOutputs << f;
ss << nInputs << '_' << nOutputs << f;
ss << f_inputs_type_str << compute_type_str << result_type_str;
ss << static_cast<int>(at::cuda::jit::BinaryFuncVariant::NoScalar);
ss << extra_args_types;
@ -144,7 +144,7 @@ static inline void launch_jitted_unrolled_kernel_dynamic(
// The cache key includes all the parameters to generate_code + dev_idx
std::stringstream ss;
ss << nInputs << "_" << nOutputs << f;
ss << nInputs << '_' << nOutputs << f;
ss << f_inputs_type_str << compute_type_str << result_type_str;
ss << contiguous << dynamic_casting;
ss << static_cast<int>(at::cuda::jit::BinaryFuncVariant::NoScalar);

View File

@ -52,10 +52,10 @@ TuningContext* getTuningContext() {
std::ostream& operator<<(std::ostream& stream, const ResultEntry& entry) {
static const bool blaslog = c10::utils::get_env("PYTORCH_TUNABLEOP_BLAS_LOG") == "1";
if (!blaslog) {
return stream << entry.key_ << "," << entry.time_;
return stream << entry.key_ << ',' << entry.time_;
}
else {
return stream << entry.key_ << "," << entry.time_ << ",BLAS_PARAMS: " << entry.blas_sig_;
return stream << entry.key_ << ',' << entry.time_ << ",BLAS_PARAMS: " << entry.blas_sig_;
}
}
@ -156,10 +156,10 @@ void TuningResultsManager::RecordUntuned( std::ofstream& untuned_file, const std
if (isNew) {
static const bool blaslog = c10::utils::get_env("PYTORCH_TUNABLEOP_BLAS_LOG") == "1";
if (!blaslog) {
untuned_file << op_signature << "," << params_signature << std::endl;
untuned_file << op_signature << ',' << params_signature << std::endl;
}
else {
untuned_file << op_signature << "," << params_signature << ",BLAS_PARAMS: " << blas_signature << std::endl;
untuned_file << op_signature << ',' << params_signature << ",BLAS_PARAMS: " << blas_signature << std::endl;
}
TUNABLE_LOG3("Untuned,", op_signature, ",", params_signature);
}
@ -201,7 +201,7 @@ void TuningResultsManager::InitRealtimeAppend(const std::string& filename, const
if(!file_exists || file_empty) {
for(const auto& [key, val] : validators) {
(*realtime_out_) << "Validator," << key << "," << val << std::endl;
(*realtime_out_) << "Validator," << key << ',' << val << std::endl;
realtime_out_->flush();
}
validators_written_ = true;
@ -219,7 +219,7 @@ void TuningResultsManager::AppendResultLine(const std::string& op_sig, const std
return;
}
(*realtime_out_) << op_sig << "," << param_sig << "," << result << std::endl;
(*realtime_out_) << op_sig << ',' << param_sig << ',' << result << std::endl;
realtime_out_->flush(); //ensure immediate write to disk
TUNABLE_LOG3("Realtime append: ", op_sig, "(", param_sig, ") -> ", result);

View File

@ -93,31 +93,31 @@ std::string cudnnTypeToString(cudnnDataType_t dtype) {
return "CUDNN_DATA_UINT8x4";
default:
std::ostringstream oss;
oss << "(unknown data-type " << static_cast<int>(dtype) << ")";
oss << "(unknown data-type " << static_cast<int>(dtype) << ')';
return oss.str();
}
}
std::ostream& operator<<(std::ostream & out, const TensorDescriptor& d) {
out << "TensorDescriptor " << static_cast<void*>(d.desc()) << "\n";
out << "TensorDescriptor " << static_cast<void*>(d.desc()) << '\n';
int nbDims = 0;
int dimA[CUDNN_DIM_MAX];
int strideA[CUDNN_DIM_MAX];
cudnnDataType_t dtype{};
cudnnGetTensorNdDescriptor(d.desc(), CUDNN_DIM_MAX, &dtype, &nbDims, dimA, strideA);
out << " type = " << cudnnTypeToString(dtype) << "\n";
out << " nbDims = " << nbDims << "\n";
out << " type = " << cudnnTypeToString(dtype) << '\n';
out << " nbDims = " << nbDims << '\n';
// Read out only nbDims of the arrays!
out << " dimA = ";
for (auto i : ArrayRef<int>{dimA, static_cast<size_t>(nbDims)}) {
out << i << ", ";
}
out << "\n";
out << '\n';
out << " strideA = ";
for (auto i : ArrayRef<int>{strideA, static_cast<size_t>(nbDims)}) {
out << i << ", ";
}
out << "\n";
out << '\n';
return out;
}
@ -168,27 +168,27 @@ std::string cudnnMemoryFormatToString(cudnnTensorFormat_t tformat) {
return "CUDNN_TENSOR_NHWC";
default:
std::ostringstream oss;
oss << "(unknown cudnn tensor format " << static_cast<int>(tformat) << ")";
oss << "(unknown cudnn tensor format " << static_cast<int>(tformat) << ')';
return oss.str();
}
}
std::ostream& operator<<(std::ostream & out, const FilterDescriptor& d) {
out << "FilterDescriptor " << static_cast<void*>(d.desc()) << "\n";
out << "FilterDescriptor " << static_cast<void*>(d.desc()) << '\n';
int nbDims = 0;
int dimA[CUDNN_DIM_MAX];
cudnnDataType_t dtype{};
cudnnTensorFormat_t tformat{};
cudnnGetFilterNdDescriptor(d.desc(), CUDNN_DIM_MAX, &dtype, &tformat, &nbDims, dimA);
out << " type = " << cudnnTypeToString(dtype) << "\n";
out << " tensor_format = " << cudnnMemoryFormatToString(tformat) << "\n";
out << " nbDims = " << nbDims << "\n";
out << " type = " << cudnnTypeToString(dtype) << '\n';
out << " tensor_format = " << cudnnMemoryFormatToString(tformat) << '\n';
out << " nbDims = " << nbDims << '\n';
// Read out only nbDims of the arrays!
out << " dimA = ";
for (auto i : ArrayRef<int>{dimA, static_cast<size_t>(nbDims)}) {
out << i << ", ";
}
out << "\n";
out << '\n';
return out;
}

View File

@ -346,15 +346,15 @@ void foreachTensorInplaceWithFlag(std::vector<IValue>& args, int64_t begin, int6
}
std::ostream& operator<< (std::ostream& os, const DynamicLayer& layer) {
os << layer.layerId() << ":" << layer.key();
os << layer.layerId() << ':' << layer.key();
return os;
}
std::ostream& operator<< (std::ostream& os, const std::vector<DynamicLayer>& dls) {
os << "DynamicLayerStack[ ";
for (const auto& layer : dls) {
os << layer << " ";
os << layer << ' ';
}
os << "]";
os << ']';
return os;
}

View File

@ -22,7 +22,7 @@ void dumpTensor(std::ostream& ss, const Tensor& tensor) {
if (batched) {
ss << "Batched[lvl=" << batched->level() << " dim=" << batched->bdim() << ", ";
dumpTensor(ss, batched->value());
ss << "]";
ss << ']';
return;
}
ss << "Tensor" << tensor.sizes();
@ -36,7 +36,7 @@ void dumpTensor(std::ostream& ss, const Tensor& tensor) {
ss << "dead, ";
}
dumpTensor(ss, wrapped->value());
ss << "]";
ss << ']';
}
void TensorWrapper::refreshMetadata() {

View File

@ -73,32 +73,32 @@ std::string miopenTypeToString(miopenDataType_t dtype) {
return "miopenBFloat16";
default:
std::ostringstream oss;
oss << "(unknown data-type " << static_cast<int>(dtype) << ")";
oss << "(unknown data-type " << static_cast<int>(dtype) << ')';
return oss.str();
}
}
std::ostream& operator<<(std::ostream & out, const TensorDescriptor& d) {
out << "TensorDescriptor " << static_cast<void*>(d.desc()) << "\n";
out << "TensorDescriptor " << static_cast<void*>(d.desc()) << '\n';
int nbDims = 0;
int dimA[MIOPEN_DIM_MAX];
int strideA[MIOPEN_DIM_MAX];
miopenDataType_t dtype;
miopenGetTensorDescriptorSize(d.desc(), &nbDims);
miopenGetTensorDescriptor(d.desc(), &dtype, dimA, strideA);
out << " type = " << miopenTypeToString(dtype) << "\n";
out << " nbDims = " << nbDims << "\n";
out << " type = " << miopenTypeToString(dtype) << '\n';
out << " nbDims = " << nbDims << '\n';
// Read out only nbDims of the arrays!
out << " dimA = ";
for (auto i : ArrayRef<int>{dimA, static_cast<size_t>(nbDims)}) {
out << i << ", ";
}
out << "\n";
out << '\n';
out << " strideA = ";
for (auto i : ArrayRef<int>{strideA, static_cast<size_t>(nbDims)}) {
out << i << ", ";
}
out << "\n";
out << '\n';
return out;
}

View File

@ -91,7 +91,7 @@ struct OperationInfo : BaseInfo {
std::stringstream kernelStr;
kernelStr << kernelName;
for (const Tensor& tensor : tensors) {
kernelStr << ":" << BaseInfo::buildTensorString(tensor, includeBufferId);
kernelStr << ':' << BaseInfo::buildTensorString(tensor, includeBufferId);
}
return kernelStr.str();
}

View File

@ -39,9 +39,9 @@ std::string BaseInfo::buildTensorString(const Tensor& tensor, bool includeBuffer
// see comments for INCLUDE_BUFFER_ID
if (includeBufferId && deviceType == at::kMPS) {
id<MTLBuffer> buffer = __builtin_bit_cast(id<MTLBuffer>, tensor.storage().data());
tensorStr << "(buf#" << (getIMPSAllocator()->getBufferId(buffer)) << ":" << buffer.retainCount << ")";
tensorStr << "(buf#" << (getIMPSAllocator()->getBufferId(buffer)) << ':' << buffer.retainCount << ')';
}
tensorStr << ":" << tensor.scalar_type() << tensor.sizes();
tensorStr << ':' << tensor.scalar_type() << tensor.sizes();
return tensorStr.str();
} else {
return "undefined";

View File

@ -167,7 +167,7 @@ static void check_args(CheckedFrom c, IntArrayRef args, size_t expected_size, co
std::stringstream ss;
ss << arg_name << " should be greater than zero but got (";
std::copy(args.begin(), args.end() - 1, std::ostream_iterator<int>(ss,", "));
ss << args.back() << ")" << " (while checking arguments for " << c << ")";
ss << args.back() << ")" << " (while checking arguments for " << c << ')';
TORCH_CHECK(false, ss.str());
}
}

View File

@ -639,7 +639,7 @@ static std::ostream& operator<<(std::ostream & out, const ConvParams<T>& params)
<< " deterministic = " << params.deterministic
<< " cudnn_enabled = " << params.cudnn_enabled
<< " allow_tf32 = " << params.allow_tf32
<< "}";
<< '}';
return out;
}

View File

@ -1936,7 +1936,7 @@ static bool should_fold(const Tensor& tensor1, const Tensor& tensor2, bool has_o
// We order the tensors. t1 will be the larger tensor
// We can always transpose tensor2 as the dimensions are always >= 1 (precondition from matmul)
// and tensor1_larger iff tensor2.dim() > tensor1.dim()
// and tensor1_larger iff tensor2.dim() > tensor1.dim(9
const auto t1 = tensor1_larger ? MaybeOwned<Tensor>::borrowed(tensor1)
: MaybeOwned<Tensor>::owned(tensor2.mT());
const int64_t dim_t1 = t1->dim();
@ -1948,11 +1948,20 @@ static bool should_fold(const Tensor& tensor1, const Tensor& tensor2, bool has_o
return false;
}
// If we require a gradient, we should fold to minimize backward memory usage - even if this
// leads to a copy in forward because is needed in backward,
// only time we avoid this strict pre-allocated memory usage (has_out = True)
bool requires_grad = tensor1.requires_grad() || tensor2.requires_grad();
if (requires_grad && !has_out) {
// In this case we *do* incur in an extra copy to avoid creating an unnecessary large tensor in the backward
// Suppose we don't fold here. Let t1.shape = [b, m, n] t2.shape = [n, k] like in a transformer
// t2 will be expanded to a tensor of shape [b, n, k] and then we do t1.bmm(t2_expanded)
// The issue appears in the backward.
// The output gradient g of this operation would have shape [b, m, k]
// The backward wrt. t2 of bmm would be given by t1.mH @ g, which has shape [b, n, k]
// Then, the backward of expand is simply `sum(0)`. As such, we are instantiating a tensor
// of shape [b, n, k] unnecessarily, which may cause a large memory footprint, and in the
// worst case, an OOM
bool t2_requires_grad = tensor1_larger ? tensor2.requires_grad() : tensor1.requires_grad();
if (t2_requires_grad && !has_out) {
// We should be checking !at::GradMode::is_enabled(), but apparently
// this regresses performance in some cases:
// https://github.com/pytorch/pytorch/issues/118548#issuecomment-1916022394
return true;
}

View File

@ -847,7 +847,7 @@ Tensor stft(const Tensor& self, const int64_t n_fft, const std::optional<int64_t
<< ", hop_length=" << hop_length << ", win_length=" << win_length \
<< ", window="; \
if (window.defined()) { \
SS << window.toString() << "{" << window.sizes() << "}"; \
SS << window.toString() << '{' << window.sizes() << '}'; \
} else { \
SS << "None"; \
} \
@ -1046,7 +1046,7 @@ Tensor istft(const Tensor& self, const int64_t n_fft, const std::optional<int64_
<< ", hop_length=" << hop_length << ", win_length=" << win_length \
<< ", window="; \
if (window.defined()) { \
SS << window.toString() << "{" << window.sizes() << "}"; \
SS << window.toString() << '{' << window.sizes() << '}'; \
} else { \
SS << "None"; \
} \

View File

@ -1087,7 +1087,8 @@ TORCH_IMPL_FUNC(index_copy_out)
result.copy_(self);
// See Note [Enabling Deterministic Operations]
if (result.is_cuda() && globalContext().deterministicAlgorithms()) {
if ((result.is_cuda() || result.is_xpu()) &&
globalContext().deterministicAlgorithms()) {
torch::List<std::optional<Tensor>> indices;
indices.resize(dim + 1);
indices.set(dim, index);

View File

@ -523,7 +523,7 @@ Tensor _functional_assert_async_msg_cpu(
}
void _print(std::string_view s) {
std::cout << s << "\n";
std::cout << s << '\n';
}
// Sorting-based algorithm for isin(); used when the number of test elements is

View File

@ -904,19 +904,11 @@ Tensor mvlgamma(const Tensor& self, int64_t p) {
return args.lgamma_().sum(-1).add_(p2_sub_p * std::log(c10::pi<double>) * QUARTER);
}
// since mvlgamma_ has different signature from its
// out and functional variant, we explicitly
// define it (instead of using structured kernel).
Tensor& mvlgamma_(Tensor& self, int64_t p) {
mvlgamma_check(self, p);
Tensor args = native::arange(
-p *HALF + HALF,
HALF,
HALF,
optTypeMetaToScalarType(self.options().dtype_opt()),
self.options().layout_opt(),
self.options().device_opt(),
self.options().pinned_memory_opt());
args = args.add(self.unsqueeze(-1));
const auto p2_sub_p = static_cast<double>(p * (p - 1));
return self.copy_(args.lgamma_().sum(-1).add_(p2_sub_p * std::log(c10::pi<double>) * QUARTER));
return at::mvlgamma_out(self, self, p);
}
Tensor& mvlgamma_out(const Tensor& self, int64_t p, Tensor& result) {

View File

@ -78,9 +78,18 @@ __global__ void EmbeddingBag_updateOutputKernel_max(
scalar_t weightFeatMax = 0;
int64_t bag_size_ = 0;
int64_t maxWord = -1;
// Separate validation loop reduces register pressure in the main loop below.
// No early exit (break) on invalid input as benchmarking shows it degrades performance.
bool has_invalid_index = false;
for (int64_t emb = begin; emb < end; emb++) {
index_t input_idx = input[emb];
has_invalid_index = has_invalid_index || (input_idx < 0 || input_idx >= numRows);
}
CUDA_KERNEL_ASSERT(!has_invalid_index && "Invalid input index in EmbeddingBag: index out of range [0, numRows)");
for (int64_t emb = begin; emb < end; emb++) {
bool pad = (input[emb] == padding_idx);
CUDA_KERNEL_ASSERT(input[emb] < numRows);
const int64_t weightRow = input[emb] * weight_stride0;
scalar_t weightValue = weightFeat[weightRow];
if (bag_size_ == 0 || weightValue > weightFeatMax) {
@ -129,10 +138,19 @@ __global__ void EmbeddingBag_updateOutputKernel_sum_mean(
CUDA_KERNEL_ASSERT(end >= begin);
accscalar_t weightFeatSum = 0;
int64_t bag_size_ = 0;
// Separate validation loop reduces register pressure in the main loop below.
// No early exit (break) on invalid input as benchmarking shows it degrades performance.
bool has_invalid_index = false;
for (int64_t emb = begin; emb < end; emb++) {
index_t input_idx = input[emb];
has_invalid_index = has_invalid_index || (input_idx < 0 || input_idx >= numRows);
}
CUDA_KERNEL_ASSERT(!has_invalid_index && "Invalid input index in EmbeddingBag: index out of range [0, numRows)");
for (int64_t emb = begin; emb < end; emb++) {
index_t input_idx = input[emb];
bool pad = (input_idx == padding_idx);
CUDA_KERNEL_ASSERT(0 <= input_idx && input_idx < numRows);
const int64_t weightRow = input_idx * weight_stride0;
scalar_t weightValue = weightFeat[weightRow];
weightValue = pad ? static_cast<scalar_t>(0) : weightValue;

View File

@ -78,9 +78,9 @@ _mx8_mx8_bf16_grouped_mm_fbgemm(
const Tensor& mat_a,
const Tensor& mat_b,
const Tensor& scale_a,
const SwizzleType& swizzle_a,
const SwizzleType swizzle_a,
const Tensor& scale_b,
const SwizzleType& swizzle_b,
const SwizzleType swizzle_b,
const std::optional<at::Tensor>& offs,
Tensor& out) {
const bool a_is_2d = mat_a.dim() == 2;

View File

@ -11,7 +11,7 @@ static inline std::ostream& operator<<(std::ostream& out, dim3 dim) {
if (dim.y == 1 && dim.z == 1) {
out << dim.x;
} else {
out << "[" << dim.x << "," << dim.y << "," << dim.z << "]";
out << '[' << dim.x << ',' << dim.y << ',' << dim.z << ']';
}
return out;
}
@ -27,7 +27,7 @@ std::ostream& operator<<(std::ostream& out, const ReduceConfig& config) {
out << "input_mult=[";
for (int i = 0; i < 3; i++) {
if (i != 0) {
out << ",";
out << ',';
}
out << config.input_mult[i];
}
@ -35,7 +35,7 @@ std::ostream& operator<<(std::ostream& out, const ReduceConfig& config) {
out << "output_mult=[";
for (int i = 0; i < 2; i++) {
if (i != 0) {
out << ",";
out << ',';
}
out << config.output_mult[i];
}
@ -49,7 +49,7 @@ std::ostream& operator<<(std::ostream& out, const ReduceConfig& config) {
out << "block=" << config.block() << ", ";
out << "grid=" << config.grid() << ", ";
out << "global_memory_size=" << config.global_memory_size();
out << ")";
out << ')';
return out;
}

View File

@ -740,7 +740,12 @@ _scaled_rowwise_rowwise(
TORCH_CHECK_VALUE(scale_a.numel() == mat_a.size(0) && scale_a.scalar_type() == kFloat, "scale_a must have ", mat_a.size(0), " Float elements, got ", scale_a.numel())
TORCH_CHECK_VALUE(scale_b.numel() == mat_b.size(1) && scale_b.scalar_type() == kFloat, "scale_b must have ", mat_b.size(1), " Float elements, got ", scale_b.numel())
TORCH_CHECK_VALUE(scale_a.stride(1) == 1, "expected scale_a.stride(1) to be 1, but got ", scale_a.stride(1));
// if we have a scale of shape [256, 1] (say), then stride can be [1, 0] - handle this case
TORCH_CHECK_VALUE(
scale_a.stride(1) == 1 ||
scale_a.size(1) == 1,
"expected scale_a.stride(1) to be 1, but got ", scale_a.stride(1)
);
TORCH_CHECK_VALUE(scale_b.stride(1) == 1, "expected scale_b.stride(1) to be 1, but got ", scale_b.stride(1));
auto scaling_choice_a = ScalingType::RowWise;

View File

@ -364,9 +364,9 @@ void f8f8bf16_grouped_gemm_impl_sm90(
// reinterpret_cast<ProblemShape::UnderlyingProblemShape*>(
// stride_output_h + group_count);
// std::cout << "PTRS " << mat_a.data_ptr() << " " << mat_b.data_ptr() << "
// std::cout << "PTRS " << mat_a.data_ptr() << ' ' << mat_b.data_ptr() << "
// "
// << out.data_ptr() << " " << scale_a.data_ptr() << " "
// << out.data_ptr() << ' ' << scale_a.data_ptr() << ' '
// << scale_b.data_ptr() << "\n";
// for (int i = 0; i < group_count; i++) {
// std::cout << "A " << (void*)inputA_ptrs_h[i] << "\n";

View File

@ -1057,14 +1057,14 @@ std::string generate_code(
// TODO these arrays are potentially of the different types, use function
// traits to determine the types
declare_load_arrays << f_inputs_type << " arg" << std::to_string(i)
<< "[" << std::to_string(thread_work_size) << "];\n";
<< '[' << std::to_string(thread_work_size) << "];\n";
}
env.s("declare_load_arrays", declare_load_arrays.str());
std::stringstream declare_store_arrays;
for (int i = 0; i < nOutputs; i++) {
declare_store_arrays << result_type << " out" << std::to_string(i)
<< "[" << std::to_string(thread_work_size) << "];\n";
<< '[' << std::to_string(thread_work_size) << "];\n";
}
env.s("declare_store_arrays", declare_store_arrays.str());
@ -1217,7 +1217,7 @@ std::string generate_code(
for (const auto i : c10::irange(nInputs)){
auto i_string = std::to_string(i);
vector_inputs << "auto * input" << i_string <<
" = reinterpret_cast<const scalar_t*>(data[" << i_string << "+" << nOutputs << "])" <<
" = reinterpret_cast<const scalar_t*>(data[" << i_string << '+' << nOutputs << "])" <<
" + block_work_size * idx;\n";
}
env.s("vector_inputs", vector_inputs.str());
@ -1543,17 +1543,17 @@ NvrtcFunction jit_pwise_function(
// Constructs file path by appending constructed cubin name to cache path
std::stringstream ss;
ss << *cache_dir << "/";
ss << *cache_dir << '/';
ss << kernel_name;
#ifdef USE_ROCM
ss << "_arch" << prop->gcnArchName;
#else
ss << "_arch" << cuda_major << "." << cuda_minor;
ss << "_arch" << cuda_major << '.' << cuda_minor;
#endif
ss << "_nvrtc" << nvrtc_major << "." << nvrtc_minor;
ss << "_nvrtc" << nvrtc_major << '.' << nvrtc_minor;
ss << (compile_to_sass ? "_sass" : "_ptx");
ss << "_" << code.length();
ss << "_" << hash_code;
ss << '_' << code.length();
ss << '_' << hash_code;
file_path = ss.str();
std::ifstream readin{file_path, std::ios::in | std::ifstream::binary};

View File

@ -82,15 +82,15 @@ namespace native {
std::ostream& operator<<(std::ostream& out, const ConvolutionParams& params) {
out << "ConvolutionParams \n"
<< " memory_format = " << params.memory_format << "\n"
<< " data_type = " << cudnnTypeToString(params.dataType) << "\n"
<< " padding = " << ArrayRef<int>{params.padding} << "\n"
<< " stride = " << ArrayRef<int>{params.stride} << "\n"
<< " dilation = " << ArrayRef<int>{params.dilation} << "\n"
<< " groups = " << params.groups << "\n"
<< " memory_format = " << params.memory_format << '\n'
<< " data_type = " << cudnnTypeToString(params.dataType) << '\n'
<< " padding = " << ArrayRef<int>{params.padding} << '\n'
<< " stride = " << ArrayRef<int>{params.stride} << '\n'
<< " dilation = " << ArrayRef<int>{params.dilation} << '\n'
<< " groups = " << params.groups << '\n'
<< " deterministic = " << (params.deterministic ? "true" : "false")
<< "\n"
<< " allow_tf32 = " << (params.allow_tf32 ? "true" : "false") << "\n";
<< '\n'
<< " allow_tf32 = " << (params.allow_tf32 ? "true" : "false") << '\n';
return out;
}
@ -173,16 +173,16 @@ std::string repro_from_args(const ConvolutionParams& params) {
at::globalContext().float32Precision(
at::Float32Backend::CUDA, at::Float32Op::MATMUL) ==
at::Float32Precision::TF32)
<< "\n";
<< '\n';
ss << "torch.backends.cudnn.benchmark = "
<< pybool(at::globalContext().benchmarkCuDNN()) << "\n";
<< pybool(at::globalContext().benchmarkCuDNN()) << '\n';
ss << "torch.backends.cudnn.deterministic = " << pybool(params.deterministic)
<< "\n";
<< '\n';
ss << "torch.backends.cudnn.allow_tf32 = " << pybool(params.allow_tf32)
<< "\n";
<< '\n';
ss << "data = torch.randn(" << ArrayRef<int>(params.input_size, dim)
<< ", dtype=" << full_dtype << ", ";
ss << "device='cuda', requires_grad=True)" << to_channels_last << "\n";
ss << "device='cuda', requires_grad=True)" << to_channels_last << '\n';
ss << "net = torch.nn.Conv" << dim - 2 << "d(" << in_channels << ", "
<< out_channels << ", ";
ss << "kernel_size=" << ArrayRef<int>(&params.weight_size[2], dim - 2)
@ -192,7 +192,7 @@ std::string repro_from_args(const ConvolutionParams& params) {
ss << "dilation=" << ArrayRef<int>(params.dilation, dim - 2) << ", ";
ss << "groups=" << params.groups << ")\n";
ss << "net = net.cuda()." << partial_dtype << "()" << to_channels_last
<< "\n";
<< '\n';
ss << "out = net(data)\n";
ss << "out.backward(torch.randn_like(out))\n";
ss << "torch.cuda.synchronize()\n\n";

View File

@ -93,11 +93,10 @@ std::ostream& operator<<(std::ostream& out, const ConvolutionArgs& args) {
<< "input: " << args.idesc // already has a trailing newline
<< "output: " << args.odesc // already has a trailing newline
<< "weight: " << args.wdesc // already has a trailing newline
<< "Pointer addresses: "
<< "\n"
<< " input: " << args.input.const_data_ptr() << "\n"
<< " output: " << args.output.const_data_ptr() << "\n"
<< " weight: " << args.weight.const_data_ptr() << "\n";
<< "Pointer addresses: " << '\n'
<< " input: " << args.input.const_data_ptr() << '\n'
<< " output: " << args.output.const_data_ptr() << '\n'
<< " weight: " << args.weight.const_data_ptr() << '\n';
return out;
}

View File

@ -115,7 +115,7 @@ std::ostream& operator<<(
std::copy(
strides.begin(), strides.end() - 1, std::ostream_iterator<int>(oss, ","));
oss << sizes.back();
output << oss.str() << "}";
output << oss.str() << '}';
return output;
}

View File

@ -53,7 +53,7 @@ std::ostream& operator<<(std::ostream& out, const ConvParams& params) {
<< " transposed = " << params.transposed
<< " output_padding = " << IntArrayRef{params.output_padding}
<< " groups = " << params.groups << " benchmark = " << params.benchmark
<< " deterministic = " << params.deterministic << "}";
<< " deterministic = " << params.deterministic << '}';
return out;
}

View File

@ -0,0 +1,342 @@
#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
#include <ATen/BlasBackend.h>
#include <ATen/WrapDimUtilsMulti.h>
#include <ATen/ceil_div.h>
#include <ATen/native/Resize.h>
#include <ATen/native/mkldnn/xpu/detail/oneDNN.h>
#include <ATen/native/xpu/Blas.h>
#include <torch/library.h>
#ifndef AT_PER_OPERATOR_HEADERS
#include <ATen/Functions.h>
#include <ATen/NativeFunctions.h>
#else
#include <ATen/ops/_addmm_activation_native.h>
#include <ATen/ops/_efficientzerotensor.h>
#include <ATen/ops/_scaled_mm_native.h>
#include <ATen/ops/_unsafe_view_native.h>
#include <ATen/ops/abs.h>
#include <ATen/ops/addmm_native.h>
#include <ATen/ops/addmv_native.h>
#include <ATen/ops/baddbmm_native.h>
#include <ATen/ops/bmm_native.h>
#include <ATen/ops/copy_native.h>
#include <ATen/ops/dot_native.h>
#include <ATen/ops/empty.h>
#include <ATen/ops/empty_strided.h>
#include <ATen/ops/gelu.h>
#include <ATen/ops/max.h>
#include <ATen/ops/mm_native.h>
#include <ATen/ops/mul.h>
#include <ATen/ops/ones.h>
#include <ATen/ops/relu.h>
#include <ATen/ops/scalar_tensor_native.h>
#include <ATen/ops/vdot_native.h>
#endif
namespace at::native {
using at::blas::ScalingType;
using at::blas::SwizzleType;
namespace {
/*
* Scaling Type Determination:
* ---------------------------
* Conditions and corresponding Scaling Types:
*
* - If scale tensor is `Float8_e8m0fnu` or `Float8_e4m3fn`:
* - Returns BlockWise (with additional size checks).
*
* - Else if scale.numel() == 1:
* - Returns TensorWise.
*
* - Else if scale.dim() == 2 && scale.size(0) == outer_dim && scale.size(1) ==
* 1:
* - Returns RowWise.
*
* - Otherwise:
* - Returns Error.
*/
bool is_tensorwise_scaling(const at::Tensor& t, const at::Tensor& scale) {
return at::isFloat8Type(t.scalar_type()) &&
scale.scalar_type() == at::kFloat && scale.numel() == 1;
}
bool is_rowwise_scaling(const at::Tensor& t, const at::Tensor& scale) {
return (
at::isFloat8Type(t.scalar_type()) && scale.scalar_type() == at::kFloat &&
scale.dim() == 2 && scale.size(0) == t.size(0) && scale.size(1) == 1 &&
scale.is_contiguous());
}
bool is_desired_scaling(
const at::Tensor& t,
const at::Tensor& scale,
ScalingType desired_scaling) {
auto result = desired_scaling == ScalingType::TensorWise
? is_tensorwise_scaling(t, scale)
: is_rowwise_scaling(t, scale);
return result;
}
std::pair<ScalingType, ScalingType> get_joint_scaling(
std::initializer_list<std::pair<ScalingType, ScalingType>> options,
const at::Tensor& a,
const at::Tensor& b,
const at::Tensor& scale_a,
const at::Tensor& scale_b) {
for (auto [lhs, rhs] : options) {
if (is_desired_scaling(a, scale_a, lhs) &&
is_desired_scaling(b.t(), scale_b.t(), rhs)) {
return {lhs, rhs};
}
}
TORCH_CHECK(
false,
"Invalid scaling configuration.\n"
"- For TensorWise scaling, a and b should be float8, scales should be float and singletons.\n"
"- For RowWise scaling, a and b should be float8, scales should be float, scale_a should be (",
a.size(0),
", 1) and scale_b should be (1, ",
b.size(1),
"), and both should be contiguous.\n"
"Got a.dtype()=",
a.scalar_type(),
", scale_a.dtype()=",
scale_a.scalar_type(),
", scale_a.size()=",
scale_a.sizes(),
", scale_a.stride()=",
scale_a.strides(),
", ",
"b.dtype()=",
b.scalar_type(),
", scale_b.dtype()=",
scale_b.scalar_type(),
", scale_b.size()=",
scale_b.sizes(),
" and scale_b.stride()=",
scale_b.strides());
}
Tensor& _scaled_gemm(
const Tensor& mat1,
const Tensor& mat2,
const Tensor& scale_a,
const Tensor& scale_b,
const ScalingType scaling_choice_a,
const ScalingType scaling_choice_b,
const std::optional<Tensor>& bias,
const bool use_fast_accum,
Tensor& out,
const std::optional<Tensor>& alpha = std::nullopt) {
// TODO: scale_result and alpha is not defined or used!
std::optional<Tensor> scaled_result = std::nullopt;
at::native::onednn::scaled_matmul(
mat1,
mat2,
out,
scale_a,
scale_b,
scaling_choice_a,
scaling_choice_b,
bias,
scaled_result,
use_fast_accum);
return out;
}
} // namespace
// Computes matrix multiply + bias while applying scaling to input and output
// matrices Scales are only applicable when matrices are of Float8 type and
// assumed to be equal to 1.0 by default. If output matrix type is 16 or 32-bit
// type, scale_result is not applied. Known limitations:
// - Only works if mat1 is row-major and mat2 is column-major
// - Only works if matrices sizes are divisible by 32
// - If 1-dimensional tensors are used then scale_a should be size =
// mat1.size(0)
// and scale_b should have size = to mat2.size(1)
// Arguments:
// - `mat1`: the first operand of the matrix multiply, can be type
// `torch.float8_e4m3fn` or `torch.float8_e5m2`
// - `mat2`: the second operand of the matrix multiply, can be type
// `torch.float8_e4m3fn` or `torch.float8_e5m2`
// - `bias`: the bias, can be type `torch.float16` or `torch.bfloat16`
// - `out_dtype`: the output dtype, can either be a float8 or a higher
// precision floating point type
// - `scale_a`: a tensor with the inverse scale of `mat1`, whose
// shape/strides/dtype depend on the scaling scheme
// - `scale_b`: a tensor with the inverse scale of `mat2`, whose
// shape/strides/dtype depend on the scaling scheme
// - `scale_result`: a scalar tensor with the scale of the output, only
// utilized if the output is a float8 type
// - `use_fast_accum`: Not applicable for XPU. For now, it should always be
// false.
// - `out`: a reference to the output tensor
Tensor& _scaled_mm_out_xpu(
const Tensor& mat1,
const Tensor& mat2,
const Tensor& scale_a,
const Tensor& scale_b,
const std::optional<at::Tensor>& bias,
const std::optional<at::Tensor>& scale_result,
std::optional<c10::ScalarType> out_dtype,
bool use_fast_accum,
Tensor& out) {
// Note: fast_accum is not supported in XPU for now.
TORCH_CHECK(!use_fast_accum, "fast_accum is not supported in XPU for now.");
TORCH_CHECK(mat1.dim() == 2, "mat1 must be a matrix");
TORCH_CHECK(mat2.dim() == 2, "mat2 must be a matrix");
TORCH_CHECK(
mat1.sizes()[1] == mat2.sizes()[0],
"mat1 and mat2 shapes cannot be multiplied (",
mat1.sizes()[0],
"x",
mat1.sizes()[1],
" and ",
mat2.sizes()[0],
"x",
mat2.sizes()[1],
")");
// Check what type of scaling we are doing based on inputs. This list is
// sorted by decreasing priority.
// List of supported datatypes for XPU with oneDNN:
// https://uxlfoundation.github.io/oneDNN/dev_guide_matmul.html#data-types
auto [scaling_choice_a, scaling_choice_b] = get_joint_scaling(
{
std::make_pair(ScalingType::TensorWise, ScalingType::TensorWise),
std::make_pair(ScalingType::RowWise, ScalingType::RowWise),
},
mat1,
mat2,
scale_a,
scale_b);
TORCH_CHECK(
!scale_result ||
(scale_result->numel() == 1 && scale_result->scalar_type() == kFloat),
"scale_result must be a float scalar");
TORCH_CHECK(
!bias || bias->numel() == mat2.sizes()[1],
"Bias must be size ",
mat2.sizes()[1],
" but got ",
bias->numel());
TORCH_CHECK(
mat1.sizes()[1] % 16 == 0,
"Expected trailing dimension of mat1 to be divisible by 16 ",
"but got mat1 shape: (",
mat1.sizes()[0],
"x",
mat1.sizes()[1],
").");
TORCH_CHECK(
mat2.sizes()[0] % 16 == 0 && mat2.sizes()[1] % 16 == 0,
"mat2 shape (",
mat2.sizes()[0],
"x",
mat2.sizes()[1],
") must be divisible by 16");
// Check types
TORCH_CHECK(
!out_dtype || *out_dtype == out.scalar_type(),
"out_dtype must match output matrix type");
TORCH_CHECK(
at::isFloat8Type(mat1.scalar_type()),
"Expected mat1 to be Float8 matrix got ",
mat1.scalar_type());
TORCH_CHECK(
at::isFloat8Type(mat2.scalar_type()),
"Expected mat2 to be Float8 matrix got ",
mat2.scalar_type());
// TODO: oneDNN Currently only supports e4m3 with group scales on BMG. Not
// support 2D scales, only 1D. Needs to add more checks there.
if (bias) {
TORCH_CHECK(
bias->scalar_type() == kFloat ||
bias->scalar_type() == c10::ScalarType::BFloat16 ||
bias->scalar_type() == c10::ScalarType::Half,
"Bias must be Float32 or BFloat16 or Half, but got ",
bias->scalar_type());
}
{
auto bias_ = bias.value_or(Tensor());
auto scale_result_ = scale_result.value_or(Tensor());
// NOLINTNEXTLINE(*c-array*)
TensorArg targs[]{
{out, "out", 0},
{mat1, "mat1", 1},
{mat2, "mat2", 2},
{bias_, "bias", 3},
{scale_a, "scale_a", 4},
{scale_b, "scale_b", 5},
{scale_result_, "scale_result", 6}};
checkAllSameGPU(__func__, targs);
}
// Validation checks have passed lets resize the output to actual size
IntArrayRef mat1_sizes = mat1.sizes();
IntArrayRef mat2_sizes = mat2.sizes();
at::native::resize_output(out, {mat1_sizes[0], mat2_sizes[1]});
// If any of M, K, N is 0 - return early (the tensorwise/rowwise float8 gemm
// kernels do not support this case).
if (mat1_sizes[0] == 0 || mat1_sizes[1] == 0 || mat2_sizes[1] == 0) {
// `out` was created with `at::empty`. In the case where we are multiplying
// MxK by KxN and K is the zero dim, we need to initialize here to properly
// return a tensor of zeros.
if (mat1_sizes[1] == 0) {
out.zero_();
}
return out;
}
// TODO: Scale_result is not supported by now!!
return _scaled_gemm(
mat1,
mat2,
scale_a,
scale_b,
scaling_choice_a,
scaling_choice_b,
bias,
use_fast_accum,
out);
}
Tensor _scaled_mm_xpu(
const Tensor& mat_a,
const Tensor& mat_b,
const Tensor& scale_a,
const Tensor& scale_b,
const std::optional<at::Tensor>& bias,
const std::optional<at::Tensor>& scale_result,
std::optional<c10::ScalarType> out_dtype,
bool use_fast_accum) {
const auto out_dtype_ = out_dtype.value_or(mat_a.scalar_type());
Tensor out = at::empty({0}, mat_a.options().dtype(out_dtype_));
return _scaled_mm_out_xpu(
mat_a,
mat_b,
scale_a,
scale_b,
bias,
scale_result,
out_dtype,
use_fast_accum,
out);
}
} // namespace at::native

View File

@ -1,3 +1,4 @@
#include <ATen/BlasBackend.h>
#include <ATen/Tensor.h>
#include <ATen/core/Tensor.h>
#include <c10/core/ScalarType.h>
@ -8,7 +9,6 @@
#include <oneapi/dnnl/dnnl.hpp>
namespace at::native::onednn {
at::Tensor broadcast_bias2D(
at::Tensor& dst,
at::Tensor& bias,
@ -328,4 +328,236 @@ void quantized_matmul(
result.copy_(dst);
}
// Describes how to configure oneDNN scales for a given role/ScalingType
struct ScaleSpec {
// specifies the way scale values will be applied to an ARG tensor.
int mask;
// specifies how scales are grouped along dimensions where
// multiple scale factors are used.
dnnl::memory::dims groups;
// specifies data type for scale factors.
dnnl::memory::data_type dtype;
// Helper to compute expected number of elements for scale tensors
// arg_type: "src" for SRC (groups pattern {1, X}),
// "wei" for WEIGHTS (groups pattern {X, 1})
int64_t expected_numel(
int64_t outer_dim,
int64_t inner_dim,
const std::string& arg_type) const {
if (groups == dnnl::memory::dims{1, 1})
return 1; // tensorwise scaling
TORCH_CHECK(
arg_type == "src" || arg_type == "wei",
"Expected arg_type to be 'src' or 'wei', but got '",
arg_type,
"'");
// For rowwise: SRC groups={1, K}, WEI groups={K, 1}
TORCH_INTERNAL_ASSERT(
(groups == dnnl::memory::dims{1, inner_dim} ||
groups == dnnl::memory::dims{inner_dim, 1}),
"The groups must be either {1, inner_dim} or {inner_dim, 1}. But got ",
groups,
".");
return outer_dim;
}
// Normalize an incoming scale tensor to contiguous storage and appropriate
// dtype/view
at::Tensor normalize(const at::Tensor& scale) const {
TORCH_INTERNAL_ASSERT(
dtype == dnnl::memory::data_type::f32,
"tensor scale currently must be f32, but got scale dtype: ",
scale.scalar_type());
return scale.to(at::kFloat).contiguous();
}
};
// This function defines how to set scales mask and groups according to:
// https://github.com/uxlfoundation/oneDNN/blob/main/tests/benchdnn/doc/knobs_attr.md#--attr-scales
// The returned value will be used in
// `set_scales(arg, mask, groups, data_type)`.
inline ScaleSpec make_scale_spec(
at::blas::ScalingType scaling_type,
int64_t M,
int64_t K,
int64_t N,
const std::string& arg_type) {
TORCH_CHECK(
arg_type == "src" || arg_type == "wei",
"Expected arg_type to be 'src' or 'wei', but got '",
arg_type,
"'");
TORCH_INTERNAL_ASSERT(
(scaling_type == at::blas::ScalingType::TensorWise ||
scaling_type == at::blas::ScalingType::RowWise),
"Currently only support scaling_type for TensorWise or RowWise");
int64_t dim = K; // Currently only K is used for grouping
bool is_src = (arg_type == "src");
if (scaling_type == at::blas::ScalingType::TensorWise) {
// Scale tensorwise. The same as `--attr-scales=common`.
// mask=0 : scale whole tensor
// groups={1, 1}: indicates that there is only one group for scaling
return {0, {1, 1}, dnnl::memory::data_type::f32};
} else {
// (scaling_type == at::blas::ScalingType::RowWise)
// Scale RowWise. The same as `--attr-scales=per_dim_01`.
// mask={(1 << 0) | (1 << 1)}: Scale on both dim0 and dim1
// SRC: groups={1, K}, WEIGHTS: groups={K, 1}
return {
(1 << 0) | (1 << 1),
is_src ? dnnl::memory::dims{1, dim} : dnnl::memory::dims{dim, 1},
dnnl::memory::data_type::f32};
}
}
sycl::event scaled_matmul(
const Tensor& mat1,
const Tensor& mat2,
Tensor& result,
const Tensor& scale_a,
const Tensor& scale_b,
at::blas::ScalingType scaling_choice_a,
at::blas::ScalingType scaling_choice_b,
const std::optional<at::Tensor>& bias,
const std::optional<at::Tensor>& scale_result,
bool use_fast_accum) {
auto& engine = GpuEngineManager::Instance().get_engine();
auto& stream = GpuStreamManager::Instance().get_stream();
// This function will do steps with following steps
// 1. create memory descriptor
// 2. call write_to_dnnl_memory() to actually write memory
// 3. execute
const int64_t M = mat1.size(0);
const int64_t K = mat1.size(1);
const int64_t N = mat2.size(1);
// 1.1 Create memory descriptor
dnnl::memory::desc src_md = get_onednn_md(mat1);
dnnl::memory::desc weights_md = get_onednn_md(mat2);
dnnl::memory::desc dst_md = get_onednn_md(result);
// scale_a and scale_b has already be checked in `is_desired_scaling()` call.
// So we could directly get their memory desc and set later.
dnnl::memory::desc scale_a_md = get_onednn_md(scale_a);
dnnl::memory::desc scale_b_md = get_onednn_md(scale_b);
dnnl::memory::desc bias_md;
bool with_bias = bias.has_value();
at::Tensor possible_reshaped_bias = bias.value_or(at::Tensor());
if (with_bias) {
if (possible_reshaped_bias.dim() == 1) {
possible_reshaped_bias =
possible_reshaped_bias.reshape({1, possible_reshaped_bias.size(0)});
bias_md = get_onednn_md(possible_reshaped_bias);
} else {
bias_md = get_onednn_md(possible_reshaped_bias);
}
}
// 1.2 Create primitive descriptor and set scales mask
const ScaleSpec src_spec = make_scale_spec(scaling_choice_a, M, K, N, "src");
const ScaleSpec wei_spec = make_scale_spec(scaling_choice_b, M, K, N, "wei");
dnnl::primitive_attr op_attr = dnnl::primitive_attr();
#if ONEDNN_SUPPORT_DETERMINISTIC
if (at::globalContext().deterministicAlgorithms() ||
at::globalContext().deterministicMkldnn())
op_attr.set_deterministic(true);
#endif
std::vector<int64_t> default_groups;
op_attr.set_scales(
DNNL_ARG_SRC, src_spec.mask, src_spec.groups, src_spec.dtype);
op_attr.set_scales(
DNNL_ARG_WEIGHTS, wei_spec.mask, wei_spec.groups, wei_spec.dtype);
// scale_result tensor currently only supports scalar(TensorWise Scaling).
bool with_dst_scale = scale_result && scale_result->defined();
if (with_dst_scale) {
op_attr.set_scales(DNNL_ARG_DST, 0, {1}, dnnl::memory::data_type::f32);
}
op_attr.set_scratchpad_mode(dnnl::scratchpad_mode::user);
// 1.3 Create the matmul primitive descriptor
dnnl::matmul::primitive_desc matmul_pd = with_bias
? dnnl::matmul::primitive_desc(
engine, src_md, weights_md, bias_md, dst_md, op_attr)
: dnnl::matmul::primitive_desc(
engine, src_md, weights_md, dst_md, op_attr);
// 1.4 (Possible) Additional Checks
// TODO: In case there are memory desc does not align with the actual tensor,
// we might need to reorder weights similar to CPU's reorder_if_differ_in()
// call. For example, weights not the same as matmul_pd.weights_desc(),
// 2. Prepare memory
// Create memory
auto src_usr_m = make_onednn_memory(src_md, engine, mat1.data_ptr());
auto weights_usr_m = make_onednn_memory(weights_md, engine, mat2.data_ptr());
auto dst_usr_m = make_onednn_memory(dst_md, engine, result.data_ptr());
dnnl::memory b_usr_m;
if (with_bias) {
b_usr_m =
make_onednn_memory(bias_md, engine, possible_reshaped_bias.data_ptr());
}
// Prepare runtime scale memories (flat 1-D views) using the specs
auto make_scale_mem_from_spec = [&](const ScaleSpec& spec,
int64_t expected_numel,
const at::Tensor& scale_tensor) {
at::Tensor prepared = spec.normalize(scale_tensor);
TORCH_CHECK(
prepared.numel() == expected_numel,
"Scale buffer length mismatch. Expected ",
expected_numel,
", got ",
prepared.numel());
dnnl::memory::desc scale_md(
{prepared.numel()}, spec.dtype, dnnl::memory::format_tag::x);
return make_onednn_memory(scale_md, engine, prepared.data_ptr());
};
auto scratchpad =
make_onednn_memory(matmul_pd.scratchpad_desc(), engine, nullptr);
// 3. Setup Args for exec
std::unordered_map<int, dnnl::memory> args;
args.insert({DNNL_ARG_SRC, src_usr_m});
args.insert({DNNL_ARG_WEIGHTS, weights_usr_m});
args.insert({DNNL_ARG_DST, dst_usr_m});
args.insert({DNNL_ARG_SCRATCHPAD, scratchpad});
if (with_bias) {
args.insert({DNNL_ARG_BIAS, b_usr_m});
}
// Attach runtime scales using specs
auto src_sc_mem = make_scale_mem_from_spec(
src_spec, src_spec.expected_numel(M, K, "src"), scale_a);
auto wei_sc_mem = make_scale_mem_from_spec(
wei_spec, wei_spec.expected_numel(N, K, "wei"), scale_b);
args.insert({DNNL_ARG_ATTR_SCALES | DNNL_ARG_SRC, src_sc_mem});
args.insert({DNNL_ARG_ATTR_SCALES | DNNL_ARG_WEIGHTS, wei_sc_mem});
if (with_dst_scale) {
// Bind single f32 scalar as DST scale
at::Tensor dst_scale_f32 = scale_result->to(at::kFloat).contiguous();
dnnl::memory::desc dst_sc_md(
{1}, dnnl::memory::data_type::f32, dnnl::memory::format_tag::x);
auto dst_sc_mem =
make_onednn_memory(dst_sc_md, engine, dst_scale_f32.data_ptr());
args.insert({DNNL_ARG_ATTR_SCALES | DNNL_ARG_DST, dst_sc_mem});
}
dnnl::matmul matmul_p = dnnl::matmul(matmul_pd);
sycl::event matmul_fwd_event =
dnnl::sycl_interop::execute(matmul_p, stream, args);
return matmul_fwd_event;
}
} // namespace at::native::onednn

View File

@ -78,6 +78,10 @@ dnnl::memory::data_type get_onednn_dtype(
return dnnl::memory::data_type::f32;
case at::ScalarType::BFloat16:
return dnnl::memory::data_type::bf16;
case at::ScalarType::Float8_e4m3fn:
return dnnl::memory::data_type::f8_e4m3;
case at::ScalarType::Float8_e5m2:
return dnnl::memory::data_type::f8_e5m2;
default:
if (!allow_undef) {
TORCH_CHECK(

View File

@ -1,6 +1,7 @@
#pragma once
#include <ATen/ATen.h>
#include <ATen/BlasBackend.h>
#include <ATen/native/mkldnn/xpu/detail/Attr.h>
#include <ATen/native/mkldnn/xpu/detail/Utils.h>
#include <ATen/native/mkldnn/xpu/detail/oneDNNContext.h>
@ -202,4 +203,16 @@ void sdpa_backward(
Tensor& grad_query,
Tensor& grad_key,
Tensor& grad_value);
sycl::event scaled_matmul(
const Tensor& mat1,
const Tensor& mat2,
Tensor& result,
const Tensor& scale_a,
const Tensor& scale_b,
at::blas::ScalingType scaling_choice_a,
at::blas::ScalingType scaling_choice_b,
const std::optional<at::Tensor>& bias,
const std::optional<at::Tensor>& scale_result,
bool use_fast_accum);
} // namespace at::native::onednn

View File

@ -82,6 +82,7 @@ NSArray<NSNumber*>* getTensorAxes(const TensorBase& t);
NSArray<NSNumber*>* getTensorAxes(const IntArrayRef& sizes, at::OptionalIntArrayRef dim);
std::string getMPSShapeString(MPSShape* shape);
std::string getTensorsStringKey(const TensorList& tensors, bool short_dtype = true, bool exclude_shape = false);
std::string to_hex_key(float);
std::string getArrayRefString(const IntArrayRef s);
// use has_storage() on the returned tensor to determine if src actually is a view
Tensor gatherViewTensor(const Tensor& src, Tensor& dst);

View File

@ -301,6 +301,10 @@ std::string getArrayRefString(const IntArrayRef s) {
return fmt::to_string(fmt::join(s, ","));
}
std::string to_hex_key(float f) {
return fmt::format("{:a}", f);
}
std::string getTensorsStringKey(const TensorList& tensors, bool short_dtype, bool exclude_shape) {
fmt::basic_memory_buffer<char, 100> buffer;
auto buf_iterator = std::back_inserter(buffer);

View File

@ -96,7 +96,9 @@ kernel void addmm(
auto bias =
biasData[thread_id.y * strides[3].x + thread_id.x * strides[3].y];
outputData[thread_id.y * strides[2].x + thread_id.x * strides[2].y] =
static_cast<T>(alpha_beta[0] * sum + alpha_beta[1] * bias);
static_cast<T>(
c10::metal::mul(alpha_beta[0], sum) +
c10::metal::mul(alpha_beta[1], bias));
}
}

View File

@ -121,7 +121,7 @@ Tensor& do_metal_addmm(const Tensor& self,
const Scalar& alpha,
const Scalar& beta,
const Tensor& bias) {
if (beta.toDouble() == 0 && alpha.toDouble() == 1) {
if (beta.isFloatingPoint() && alpha.isFloatingPoint() && beta.toDouble() == 0 && alpha.toDouble() == 1) {
return do_metal_mm(self, other, output);
}
auto stream = getCurrentMPSStream();
@ -147,13 +147,15 @@ Tensor& do_metal_addmm(const Tensor& self,
std::array<int64_t, 2> i64;
std::array<int32_t, 2> i32;
std::array<float, 2> f32;
} alpha_beta;
std::array<c10::complex<float>, 2> c64;
} alpha_beta{};
if (output.scalar_type() == kLong) {
alpha_beta.i64 = {alpha.toLong(), beta.toLong()};
} else if (c10::isIntegralType(output.scalar_type(), true)) {
alpha_beta.i32 = {alpha.toInt(), beta.toInt()};
} else if (c10::isComplexType(output.scalar_type())) {
alpha_beta.c64 = {alpha.toComplexFloat(), beta.toComplexFloat()};
} else {
TORCH_INTERNAL_ASSERT(c10::isFloatingType(output.scalar_type()));
alpha_beta.f32 = {alpha.toFloat(), beta.toFloat()};
}
constexpr uint32_t TILE_DIM = 16; // fastest performance from tests on multiple macs

View File

@ -244,8 +244,8 @@ static void clamp_scalar_out_mps(const Tensor& input_t,
@autoreleasepool {
// the optional min/max refs could affect how we build the cached graph
std::string key = op_name + (has_min ? ("_min:" + std::to_string(min_scalar)) : "") +
(has_max ? ("_max:" + std::to_string(max_scalar)) : "") + "_scalar:" + getTensorsStringKey({input_t});
std::string key = op_name + (has_min ? ("_min:" + to_hex_key(min_scalar)) : "") +
(has_max ? ("_max:" + to_hex_key(max_scalar)) : "") + "_scalar:" + getTensorsStringKey({input_t});
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
if (has_min)
newCachedGraph->minTensor = [mpsGraph constantWithScalar:min_scalar

View File

@ -301,12 +301,12 @@ class AvgPoolMicrokernelTester {
ASSERT_NEAR(
float(int32_t(y[i * yStride() + k])), yFP[i * kc() + k], 0.5001f)
<< "at pixel " << i << ", channel " << k << ", n = " << n()
<< ", ks = " << kh() << "x" << kw() << " (" << ks()
<< ", ks = " << kh() << 'x' << kw() << " (" << ks()
<< "), kc = " << kc() << ", acc = " << yAcc[i * kc() + k];
ASSERT_EQ(
uint32_t(yRef[i * kc() + k]), uint32_t(y[i * yStride() + k]))
<< "at pixel " << i << ", channel " << k << ", n = " << n()
<< ", ks = " << kh() << "x" << kw() << " (" << ks()
<< ", ks = " << kh() << 'x' << kw() << " (" << ks()
<< "), kc = " << kc() << ", acc = " << yAcc[i * kc() + k];
}
}
@ -396,12 +396,12 @@ class AvgPoolMicrokernelTester {
ASSERT_NEAR(
float(int32_t(y[i * yStride() + k])), yFP[i * kc() + k], 0.5001f)
<< "at pixel " << i << ", channel " << k << ", n = " << n()
<< ", ks = " << kh() << "x" << kw() << " (" << ks()
<< ", ks = " << kh() << 'x' << kw() << " (" << ks()
<< "), kc = " << kc() << ", acc = " << yAcc[i * kc() + k];
ASSERT_EQ(
uint32_t(yRef[i * kc() + k]), uint32_t(y[i * yStride() + k]))
<< "at pixel " << i << ", channel " << k << ", n = " << n()
<< ", ks = " << kh() << "x" << kw() << " (" << ks()
<< ", ks = " << kh() << 'x' << kw() << " (" << ks()
<< "), kc = " << kc() << ", acc = " << yAcc[i * kc() + k];
}
}

View File

@ -232,7 +232,7 @@ class MaxPoolMicrokernelTester {
ASSERT_EQ(
uint32_t(yRef[i * kc() + k]), uint32_t(y[i * yStride() + k]))
<< "at pixel " << i << ", channel " << k << ", n = " << n()
<< ", ks = " << kh() << "x" << kw() << " (" << ks()
<< ", ks = " << kh() << 'x' << kw() << " (" << ks()
<< "), kc = " << kc();
}
}

View File

@ -17,7 +17,7 @@ inline std::vector<T> _expand_param_if_needed(
std::ostringstream ss;
ss << "expected " << param_name << " to be a single integer value or a "
<< "list of " << expected_dim << " values to match the convolution "
<< "dimensions, but got " << param_name << "=" << list_param;
<< "dimensions, but got " << param_name << '=' << list_param;
TORCH_CHECK(false, ss.str());
} else {
return list_param.vec();

View File

@ -358,9 +358,9 @@ std::string Adapter::stringize() const {
std::string device_type = get_device_type_str(properties.deviceType);
VkPhysicalDeviceLimits limits = properties.limits;
ss << "{" << std::endl;
ss << '{' << std::endl;
ss << " Physical Device Info {" << std::endl;
ss << " apiVersion: " << v_major << "." << v_minor << std::endl;
ss << " apiVersion: " << v_major << '.' << v_minor << std::endl;
ss << " driverversion: " << properties.driverVersion << std::endl;
ss << " deviceType: " << device_type << std::endl;
ss << " deviceName: " << properties.deviceName << std::endl;
@ -371,7 +371,7 @@ std::string Adapter::stringize() const {
#define PRINT_LIMIT_PROP_VEC3(name) \
ss << " " << std::left << std::setw(36) << #name << limits.name[0] \
<< "," << limits.name[1] << "," << limits.name[2] << std::endl;
<< ',' << limits.name[1] << ',' << limits.name[2] << std::endl;
ss << " Physical Device Limits {" << std::endl;
PRINT_LIMIT_PROP(maxImageDimension1D);
@ -425,7 +425,7 @@ std::string Adapter::stringize() const {
;
}
ss << " ]" << std::endl;
ss << "}";
ss << '}';
return ss.str();
}

View File

@ -33,7 +33,7 @@ std::ostream& operator<<(std::ostream& out, const VkResult result) {
VK_RESULT_CASE(VK_ERROR_FORMAT_NOT_SUPPORTED)
VK_RESULT_CASE(VK_ERROR_FRAGMENTED_POOL)
default:
out << "VK_ERROR_UNKNOWN (VkResult " << result << ")";
out << "VK_ERROR_UNKNOWN (VkResult " << result << ')';
break;
}
return out;
@ -46,7 +46,7 @@ std::ostream& operator<<(std::ostream& out, const VkResult result) {
//
std::ostream& operator<<(std::ostream& out, const SourceLocation& loc) {
out << loc.function << " at " << loc.file << ":" << loc.line;
out << loc.function << " at " << loc.file << ':' << loc.line;
return out;
}
@ -66,7 +66,7 @@ Error::Error(SourceLocation source_location, const char* cond, std::string msg)
: msg_(std::move(msg)), source_location_{source_location} {
std::ostringstream oss;
oss << "Exception raised from " << source_location_ << ": ";
oss << "(" << cond << ") is false! ";
oss << '(' << cond << ") is false! ";
oss << msg_;
what_ = oss.str();
}

View File

@ -173,8 +173,8 @@ void QueryPool::extract_results() {
static std::string stringize(const VkExtent3D& extents) {
std::stringstream ss;
ss << "{" << extents.width << ", " << extents.height << ", " << extents.depth
<< "}";
ss << '{' << extents.width << ", " << extents.height << ", " << extents.depth
<< '}';
return ss.str();
}

View File

@ -149,7 +149,7 @@ VKAPI_ATTR VkBool32 VKAPI_CALL debug_report_callback_fn(
(void)flags;
std::stringstream stream;
stream << layer_prefix << " " << message_code << " " << message << std::endl;
stream << layer_prefix << ' ' << message_code << ' ' << message << std::endl;
const std::string log = stream.str();
std::cout << log;

View File

@ -253,7 +253,7 @@ using vec4 = vec<4u>;
// uvec3 is the type representing tensor extents. Useful for debugging.
inline std::ostream& operator<<(std::ostream& os, const uvec3& v) {
os << "(" << v.data[0u] << ", " << v.data[1u] << ", " << v.data[2u] << ")";
os << '(' << v.data[0u] << ", " << v.data[1u] << ", " << v.data[2u] << ')';
return os;
}

View File

@ -246,7 +246,7 @@ void TestToCFloat() {
void TestToString() {
Tensor b = ones({3, 7}) * .0000001f;
std::stringstream s;
s << b << "\n";
s << b << '\n';
std::string expect = "1e-07 *";
ASSERT_EQ_RESOLVED(s.str().substr(0, expect.size()), expect);
}

View File

@ -33,7 +33,7 @@ struct Foo {
static void apply(Tensor a, Tensor b) {
scalar_type s = 1;
std::stringstream ss;
ss << "hello, dispatch: " << a.toString() << s << "\n";
ss << "hello, dispatch: " << a.toString() << s << '\n';
auto data = (scalar_type*)a.data_ptr();
(void)data;
}
@ -73,8 +73,8 @@ TEST(TestScalar, TestScalar) {
Scalar bar = 3.0;
Half h = bar.toHalf();
Scalar h2 = h;
cout << "H2: " << h2.toDouble() << " " << what.toFloat() << " "
<< bar.toDouble() << " " << what.isIntegral(false) << "\n";
cout << "H2: " << h2.toDouble() << ' ' << what.toFloat() << ' '
<< bar.toDouble() << ' ' << what.isIntegral(false) << '\n';
auto gen = at::detail::getDefaultCPUGenerator();
{
// See Note [Acquire lock when using random generators]
@ -84,7 +84,7 @@ TEST(TestScalar, TestScalar) {
}
if (at::hasCUDA()) {
auto t2 = zeros({4, 4}, at::kCUDA);
cout << &t2 << "\n";
cout << &t2 << '\n';
}
auto t = ones({4, 4});
@ -129,7 +129,7 @@ TEST(TestScalar, TestScalar) {
std::stringstream ss;
// NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
ASSERT_NO_THROW(
ss << "hello, dispatch" << x.toString() << s << "\n");
ss << "hello, dispatch" << x.toString() << s << '\n');
auto data = (scalar_t*)x.data_ptr();
(void)data;
});

View File

@ -1,5 +1,5 @@
#include <ATen/ATen.h>
int main() {
std::cout << at::ones({3,4}, at::CPU(at::kFloat)) << "\n";
std::cout << at::ones({3,4}, at::CPU(at::kFloat)) << '\n';
}

View File

@ -1828,9 +1828,9 @@ namespace {
#endif
EXPECT_EQ(u16, c10::detail::fp16_ieee_from_fp32_value(f32s[i]))
<< "Test failed for float to uint16 " << f32s[i] << "\n";
<< "Test failed for float to uint16 " << f32s[i] << '\n';
EXPECT_EQ(x, c10::detail::fp16_ieee_to_fp32_value(u16))
<< "Test failed for uint16 to float " << u16 << "\n";
<< "Test failed for uint16 to float " << u16 << '\n';
}
}
TEST(FP8E4M3Test, FP8E4M3ConversionFloat) {
@ -1848,10 +1848,10 @@ namespace {
EXPECT_TRUE(std::isnan(f32));
} else {
EXPECT_EQ(f32, c10::detail::fp8e4m3fn_to_fp32_value(input))
<< "Test failed for u8 to float " << input << "\n";
<< "Test failed for u8 to float " << input << '\n';
}
EXPECT_EQ(u8, c10::detail::fp8e4m3fn_from_fp32_value(f32))
<< "Test failed for float to u8 " << f32 << "\n";
<< "Test failed for float to u8 " << f32 << '\n';
}
}
TEST(FP8E4M3Test, FP8E4M3BinaryAdd) {
@ -2015,10 +2015,10 @@ namespace {
EXPECT_TRUE(std::isnan(f32));
} else {
EXPECT_EQ(f32, c10::detail::fp8e5m2_to_fp32_value(input))
<< "Test failed for u8 to float " << input << "\n";
<< "Test failed for u8 to float " << input << '\n';
}
EXPECT_EQ(u8, c10::detail::fp8e5m2_from_fp32_value(f32))
<< "Test failed for float to u8 " << f32 << "\n";
<< "Test failed for float to u8 " << f32 << '\n';
}
}
TEST(FP8E5M2Test, FP8E5M2BinaryAdd) {

View File

@ -19,7 +19,7 @@ TEST(Vitals, Basic) {
c10::utils::set_env("TORCH_VITAL", "1");
TORCH_VITAL_DEFINE(Testing);
TORCH_VITAL(Testing, Attribute0) << 1;
TORCH_VITAL(Testing, Attribute1) << "1";
TORCH_VITAL(Testing, Attribute1) << '1';
TORCH_VITAL(Testing, Attribute2) << 1.0f;
TORCH_VITAL(Testing, Attribute3) << 1.0;
auto t = at::ones({1, 1});

View File

@ -129,14 +129,14 @@ void showRtol(const at::Tensor& a, const at::Tensor& b) {
std::cout << "Max Diff allowed: " << maxDiff << std::endl;
if (diff.sizes().size() == 2) {
for (const auto y : c10::irange(diff.sizes()[0])) {
std::cout << y << ":";
std::cout << y << ':';
for (const auto x : c10::irange(diff.sizes()[1])) {
float diff_xy = diff[y][x].item<float>();
if (diff_xy > maxDiff) {
std::cout << std::setw(5) << x;
}
else {
std::cout << std::setw(5) << " ";
std::cout << std::setw(5) << ' ';
}
}
std::cout << std::endl;
@ -3276,7 +3276,7 @@ TEST_F(VulkanAPITest, masked_fill_invalidinputs_exceptions) {
void print_shape(const std::vector<int64_t>& shape) {
for (const auto& num : shape) {
std::cout << num << " ";
std::cout << num << ' ';
}
}
@ -3367,7 +3367,7 @@ void test_masked_fill_scalar(
print_shape(tmp_curr_input_shape);
std::cout << "], and mask of shape [";
print_shape(tmp_curr_mask_shape);
std::cout << "]" << std::endl;
std::cout << ']' << std::endl;
}
ASSERT_TRUE(check);
@ -4542,9 +4542,9 @@ void test_softmax(const at::IntArrayRef shape, bool log_softmax = false) {
if (!check) {
std::cout << "Softmax test failed on axis " << dim << "for tensor dims {";
for (uint32_t place = 0; place < shape.size() - 1; place++) {
std::cout << shape[place] << " ";
std::cout << shape[place] << ' ';
}
std::cout << shape.back() << "}" << std::endl;
std::cout << shape.back() << '}' << std::endl;
showRtol(out_cpu, out_vulkan.cpu());
}
ASSERT_TRUE(check);

View File

@ -95,7 +95,7 @@ void showRtol(
std::cout << "Max Diff found is: " << diff.max().item<double>() << std::endl;
if (diff.sizes().size() == 2) {
for (const auto y : c10::irange(diff.sizes()[0])) {
std::cout << y << ":";
std::cout << y << ':';
for (const auto x : c10::irange(diff.sizes()[1])) {
double diff_xy = diff[y][x].item<double>();
if (diff_xy > maxDiff) {
@ -109,7 +109,7 @@ void showRtol(
}
}
} else {
std::cout << std::setw(5) << " ";
std::cout << std::setw(5) << ' ';
}
}
std::cout << std::endl;
@ -148,19 +148,19 @@ using at::native::vulkan::api::utils::ivec4;
using at::native::vulkan::api::utils::vec4;
std::ostream& operator<<(std::ostream& os, const vec4& v) {
os << "(" << v.data[0u] << ", " << v.data[1u] << ", " << v.data[2u] << ", "
<< v.data[3u] << ")";
os << '(' << v.data[0u] << ", " << v.data[1u] << ", " << v.data[2u] << ", "
<< v.data[3u] << ')';
return os;
}
std::ostream& operator<<(std::ostream& os, const ivec3& v) {
os << "(" << v.data[0u] << ", " << v.data[1u] << ", " << v.data[2u] << ")";
os << '(' << v.data[0u] << ", " << v.data[1u] << ", " << v.data[2u] << ')';
return os;
}
std::ostream& operator<<(std::ostream& os, const ivec4& v) {
os << "(" << v.data[0u] << ", " << v.data[1u] << ", " << v.data[2u] << ", "
<< v.data[3u] << ")";
os << '(' << v.data[0u] << ", " << v.data[1u] << ", " << v.data[2u] << ", "
<< v.data[3u] << ')';
return os;
}
@ -3379,51 +3379,51 @@ bool _test_quantized_linear(
showRtol(out_cpu_dequant, out_vk_to_cpu_dequant);
}
if (xpos != -1 && ypos != -1) {
std::cout << "\nFailure caused on row/col: " << ypos << "/" << xpos
<< "\n";
std::cout << "\nFailure caused on row/col: " << ypos << '/' << xpos
<< '\n';
std::cout << "Input tensor scale: " << scale << " zerop: " << zero_point
<< "\n";
std::cout << "Input tensor row " << ypos << "\n";
<< '\n';
std::cout << "Input tensor row " << ypos << '\n';
for (int i = 0; i < input_cpu.sizes()[1]; i++) {
std::cout << input_cpu[ypos][i].item<double>() << ", ";
}
std::cout << "\n";
std::cout << '\n';
std::cout << "Weight tensor scale: " << w_scale
<< " zerop: " << w_zero_point << "\n";
std::cout << "Weight tensor col " << xpos << "\n";
<< " zerop: " << w_zero_point << '\n';
std::cout << "Weight tensor col " << xpos << '\n';
for (int i = 0; i < weight.sizes()[1]; i++) {
std::cout << weight[xpos][i].item<double>() << ", ";
}
std::cout << "\n";
std::cout << '\n';
std::cout << "Input tensor quantized row " << ypos << " with dtype "
<< (input_quant_dtype_int8 ? "QInt8" : "QUInt8") << "\n";
<< (input_quant_dtype_int8 ? "QInt8" : "QUInt8") << '\n';
for (int i = 0; i < input_cpu.sizes()[1]; i++) {
std::cout << input_cpu_quantized[ypos][i].item<double>() << ", ";
}
std::cout << "\n";
std::cout << '\n';
std::cout << "Weight tensor quantized col " << xpos << " with dtype "
<< (weight_quant_dtype_int8 ? "QInt8" : "QUInt8") << "\n";
<< (weight_quant_dtype_int8 ? "QInt8" : "QUInt8") << '\n';
for (int i = 0; i < weight.sizes()[1]; i++) {
std::cout << weight_cpu_quantized[xpos][i].item<double>() << ", ";
}
std::cout << "\n";
std::cout << '\n';
std::cout << "bias tensor\n";
for (int i = 0; i < bias.sizes()[0]; i++) {
std::cout << bias[i].item<double>() << ", ";
}
std::cout << "\n";
std::cout << '\n';
std::cout << "out_scale: " << out_scale
<< " out_zero_point: " << out_zero_point << "\n";
<< " out_zero_point: " << out_zero_point << '\n';
std::cout << "cpu unmatched output: "
<< out_cpu_dequant[ypos][xpos].item<double>() << "\n";
<< out_cpu_dequant[ypos][xpos].item<double>() << '\n';
std::cout << "vk unmatched output: "
<< out_vk_to_cpu_dequant[ypos][xpos].item<double>() << "\n";
<< out_vk_to_cpu_dequant[ypos][xpos].item<double>() << '\n';
}
}
return check;

View File

@ -189,6 +189,10 @@ skip:
- hf_Whisper
- hf_distil_whisper
- timm_vision_transformer_large
# https://github.com/pytorch/pytorch/issues/167895
- stable_diffusion
- stable_diffusion_text_encoder
- stable_diffusion_unet
device:
cpu:

View File

@ -2,6 +2,7 @@
# These load paths point to different files in internal and OSS environment
load("@bazel_skylib//lib:paths.bzl", "paths")
load("//tools/build_defs:cell_defs.bzl", "get_fbsource_cell")
load("//tools/build_defs:fb_native_wrapper.bzl", "fb_native")
load("//tools/build_defs:fb_xplat_cxx_library.bzl", "fb_xplat_cxx_library")
load("//tools/build_defs:fb_xplat_genrule.bzl", "fb_xplat_genrule")
@ -590,6 +591,9 @@ def pt_operator_query_codegen(
pt_allow_forced_schema_registration = True,
compatible_with = [],
apple_sdks = None):
if get_fbsource_cell() == "fbcode":
return
oplist_dir_name = name + "_pt_oplist"
# @lint-ignore BUCKLINT
@ -865,6 +869,9 @@ def define_buck_targets(
pt_xplat_cxx_library = fb_xplat_cxx_library,
c2_fbandroid_xplat_compiler_flags = [],
labels = []):
if get_fbsource_cell() == "fbcode":
return
# @lint-ignore BUCKLINT
fb_native.filegroup(
name = "metal_build_srcs",

View File

@ -176,7 +176,7 @@ std::ostream& operator<<(std::ostream& os, DispatchKeySet ts) {
os << k;
first = false;
}
os << ")";
os << ')';
return os;
}

View File

@ -34,20 +34,6 @@ namespace c10 {
// See [dtype Macros note] in torch/headeronly/core/ScalarType.h
// regarding macros.
template <typename T>
struct CppTypeToScalarType;
#define SPECIALIZE_CppTypeToScalarType(cpp_type, scalar_type) \
template <> \
struct CppTypeToScalarType<cpp_type> \
: std:: \
integral_constant<c10::ScalarType, c10::ScalarType::scalar_type> { \
};
AT_FORALL_SCALAR_TYPES_WITH_COMPLEX_AND_QINTS(SPECIALIZE_CppTypeToScalarType)
#undef SPECIALIZE_CppTypeToScalarType
#define DEFINE_CONSTANT(_, name) \
constexpr ScalarType k##name = ScalarType::name;

View File

@ -33,7 +33,7 @@ std::ostream& operator<<(std::ostream& stream, const TensorOptions& options) {
} else {
stream << "(nullopt)";
}
stream << ")";
stream << ')';
return stream;
}

View File

@ -136,7 +136,7 @@ std::string c10_retrieve_device_side_assertion_info() {
// Something failed, let's talk about that
oss << failures_found
<< " CUDA device-side assertion failures were found on GPU #"
<< device_num << "!" << std::endl;
<< device_num << '!' << std::endl;
if (assertion_data_for_device.assertion_count >
C10_CUDA_DSA_ASSERTION_COUNT) {
oss << "But at least " << assertion_data_for_device.assertion_count
@ -151,17 +151,17 @@ std::string c10_retrieve_device_side_assertion_info() {
oss << "Assertion failure " << i << std::endl;
oss << " GPU assertion failure message = " << self.assertion_msg
<< std::endl;
oss << " File containing assertion = " << self.filename << ":"
oss << " File containing assertion = " << self.filename << ':'
<< self.line_number << std::endl;
oss << " Device function containing assertion = " << self.function_name
<< std::endl;
oss << " Thread ID that failed assertion = [" << self.thread_id[0] << ","
<< self.thread_id[1] << "," << self.thread_id[2] << "]" << std::endl;
oss << " Block ID that failed assertion = [" << self.block_id[0] << ","
<< self.block_id[1] << "," << self.block_id[2] << "]" << std::endl;
oss << " Thread ID that failed assertion = [" << self.thread_id[0] << ','
<< self.thread_id[1] << ',' << self.thread_id[2] << ']' << std::endl;
oss << " Block ID that failed assertion = [" << self.block_id[0] << ','
<< self.block_id[1] << ',' << self.block_id[2] << ']' << std::endl;
if (launch_info.generation_number == self.caller) {
oss << " File containing kernel launch = "
<< launch_info.launch_filename << ":" << launch_info.launch_linenum
<< launch_info.launch_filename << ':' << launch_info.launch_linenum
<< std::endl;
oss << " Function containing kernel launch = "
<< launch_info.launch_function << std::endl;
@ -175,7 +175,7 @@ std::string c10_retrieve_device_side_assertion_info() {
if (launch_registry.gather_launch_stacktrace) {
oss << "Launch stacktracing disabled." << std::endl;
} else {
oss << "\n" << launch_info.launch_stacktrace << std::endl;
oss << '\n' << launch_info.launch_stacktrace << std::endl;
}
} else {
oss << " CPU launch site info: Unavailable, the circular queue wrapped around. Increase `CUDAKernelLaunchRegistry::max_size`."

View File

@ -435,7 +435,7 @@ TEST(DispatchKeySet, TestFunctionalityDispatchKeyToString) {
if (i > 0) {
ASSERT_TRUE(res.find("Unknown") == std::string::npos)
<< i << " (before is " << toString(static_cast<DispatchKey>(i - 1))
<< ")";
<< ')';
} else {
ASSERT_TRUE(res.find("Unknown") == std::string::npos) << i;
}

View File

@ -96,10 +96,10 @@ TEST(HalfConversionTest, TestPorableConversion) {
for (auto x : inputs) {
auto target = c10::detail::fp16_ieee_to_fp32_value(x);
EXPECT_EQ(halfbits2float(x), target)
<< "Test failed for uint16 to float " << x << "\n";
<< "Test failed for uint16 to float " << x << '\n';
EXPECT_EQ(
float2halfbits(target), c10::detail::fp16_ieee_from_fp32_value(target))
<< "Test failed for float to uint16" << target << "\n";
<< "Test failed for float to uint16" << target << '\n';
}
}

View File

@ -98,7 +98,7 @@ struct Noncopyable {
};
std::ostream& operator<<(std::ostream& out, const Noncopyable& nc) {
out << "Noncopyable(" << nc.x << ")";
out << "Noncopyable(" << nc.x << ')';
return out;
}
} // namespace

View File

@ -50,7 +50,13 @@ namespace c10 {
/// However, you should prefer to use ArrayRef when possible, because its use
/// of TORCH_CHECK will lead to better user-facing error messages.
template <typename T>
class ArrayRef final : public HeaderOnlyArrayRef<T> {
// ArrayRef cannot be derived from. Normally, we would use `final`
// specifier to force this constraint at compile time. However, Intel
// compiler does not recognize ArrayRef as a class template (which is
// required in the definition of at::TensorAccessor, for instance)
// when `final` specifier is used. So, we cannot define ArrayRef as
// final because of the Intel compiler issue.
class ArrayRef : public HeaderOnlyArrayRef<T> {
public:
/// @name Constructors, all inherited from HeaderOnlyArrayRef except for
/// SmallVector. As inherited constructors won't work with class template
@ -198,13 +204,13 @@ ArrayRef(const std::initializer_list<T>&) -> ArrayRef<T>;
template <typename T>
std::ostream& operator<<(std::ostream& out, ArrayRef<T> list) {
int i = 0;
out << "[";
out << '[';
for (const auto& e : list) {
if (i++ > 0)
out << ", ";
out << e;
}
out << "]";
out << ']';
return out;
}

View File

@ -106,8 +106,8 @@ class GetBacktraceImpl {
/*length*/ &length,
/*status*/ &status);
os << " frame #" << idx++ << "\t"
<< ((demangled != NULL && status == 0) ? demangled : symbol) << "["
os << " frame #" << idx++ << '\t'
<< ((demangled != NULL && status == 0) ? demangled : symbol) << '['
<< addr << "]\t" << std::endl;
}
free(demangled);
@ -274,7 +274,7 @@ class GetBacktraceImpl {
} else {
// In the edge-case where we couldn't parse the frame string, we can
// just use it directly (it may have a different format).
stream << symbols[frame_number] << "\n";
stream << symbols[frame_number] << '\n';
}
}
@ -413,8 +413,8 @@ class GetBacktraceImpl {
<< back_trace_[i_frame] << std::dec;
if (with_symbol) {
stream << std::setfill('0') << std::setw(16) << std::uppercase
<< std::hex << p_symbol->Address << std::dec << " " << module
<< "!" << p_symbol->Name;
<< std::hex << p_symbol->Address << std::dec << ' ' << module
<< '!' << p_symbol->Name;
} else {
stream << " <unknown symbol address> " << module << "!<unknown symbol>";
}
@ -424,7 +424,7 @@ class GetBacktraceImpl {
} else {
stream << "<unknown file> @ <unknown line number>";
}
stream << "]" << std::endl;
stream << ']' << std::endl;
}
return stream.str();

View File

@ -1,4 +1,5 @@
#include <c10/util/Exception.h>
#include <c10/util/FileSystem.h>
#include <c10/util/Logging.h>
#include <c10/util/Type.h>
@ -27,7 +28,7 @@ Error::Error(
const void* caller)
: Error(
str("[enforce fail at ",
detail::StripBasename(file),
c10::filesystem::path(file).filename(),
":",
line,
"] ",
@ -44,7 +45,7 @@ std::string Error::compute_what(bool include_backtrace) const {
if (context_.size() == 1) {
// Fold error and context in one line
oss << " (" << context_[0] << ")";
oss << " (" << context_[0] << ')';
} else {
for (const auto& c : context_) {
oss << "\n " << c;
@ -52,7 +53,7 @@ std::string Error::compute_what(bool include_backtrace) const {
}
if (include_backtrace && backtrace_) {
oss << "\n" << backtrace_->get();
oss << '\n' << backtrace_->get();
}
return oss.str();
@ -247,7 +248,7 @@ void WarningHandler::process(const Warning& warning) {
LOG_AT_FILE_LINE(
WARNING, warning.source_location().file, warning.source_location().line)
<< "Warning: " << warning.msg() << " (function "
<< warning.source_location().function << ")";
<< warning.source_location().function << ')';
}
std::string GetExceptionString(const std::exception& e) {

View File

@ -379,7 +379,11 @@ C10_API std::string GetExceptionString(const std::exception& e);
// ----------------------------------------------------------------------------
#ifdef STRIP_ERROR_MESSAGES
#define TORCH_RETHROW(e, ...) throw
#define TORCH_RETHROW(e, ...) \
do { \
(void)e; /* Suppress unused variable warning */ \
throw; \
} while (false)
#else
#define TORCH_RETHROW(e, ...) \
do { \

View File

@ -1,4 +1,5 @@
#include <c10/util/Backtrace.h>
#include <c10/util/FileSystem.h>
#include <c10/util/Flags.h>
#include <c10/util/Lazy.h>
#include <c10/util/Logging.h>
@ -473,13 +474,12 @@ MessageLogger::MessageLogger(
if (GLOBAL_RANK != -1) {
stream_ << "[rank" << GLOBAL_RANK << "]:";
}
stream_ << "[" << CAFFE2_SEVERITY_PREFIX[std::min(4, GLOG_FATAL - severity_)]
stream_ << '[' << CAFFE2_SEVERITY_PREFIX[std::min(4, GLOG_FATAL - severity_)]
<< (timeinfo->tm_mon + 1) * 100 + timeinfo->tm_mday
<< std::setfill('0') << " " << std::setw(2) << timeinfo->tm_hour
<< ":" << std::setw(2) << timeinfo->tm_min << ":" << std::setw(2)
<< timeinfo->tm_sec << "." << std::setw(9) << ns << " "
<< c10::detail::StripBasename(std::string(file)) << ":" << line
<< "] ";
<< std::setfill('0') << ' ' << std::setw(2) << timeinfo->tm_hour
<< ':' << std::setw(2) << timeinfo->tm_min << ':' << std::setw(2)
<< timeinfo->tm_sec << '.' << std::setw(9) << ns << ' '
<< c10::filesystem::path(file).filename() << ':' << line << "] ";
}
// Output the contents of the stream to the proper channel on destruction.
@ -488,7 +488,7 @@ MessageLogger::~MessageLogger() noexcept(false) {
// Nothing needs to be logged.
return;
}
stream_ << "\n";
stream_ << '\n';
#ifdef ANDROID
static const int android_log_levels[] = {
ANDROID_LOG_FATAL, // LOG_FATAL

Some files were not shown because too many files have changed in this diff Show More