Compare commits

...

37 Commits

Author SHA1 Message Date
51ba31dceb Fix memory leak test for SDPA op call. Currently fails because it keeps committing more work without waiting for previous work to finish and cleanup. 2025-11-17 14:39:36 -08:00
b7208877c8 Revert "deprecate check_is_size and guard_size_oblivious (#167198)"
This reverts commit f2e6f94081c02704adf027fe0c81bf25726828f5.

Reverted https://github.com/pytorch/pytorch/pull/167198 on behalf of https://github.com/yangw-dev due to synced with author, this breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/167198#issuecomment-3544065659))
2025-11-17 22:16:37 +00:00
f69815d77f [pallas backend] remove unnecessary mypy comment (#167954)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167954
Approved by: https://github.com/Skylion007
2025-11-17 22:14:08 +00:00
1c04a43959 Revert "Tiling bug fix (#167771)"
This reverts commit 7ede33b8e3cd5f068c6e88d678ed3f67f5249c64.

Reverted https://github.com/pytorch/pytorch/pull/167771 on behalf of https://github.com/eellison due to needs one fix ([comment](https://github.com/pytorch/pytorch/pull/167771#issuecomment-3543999822))
2025-11-17 21:54:56 +00:00
661fb53449 Revert "Remove old NVTX interface (#167637)"
This reverts commit 99117c1238c9adcd3fb2621e36c91f9d20ed2ff7.

Reverted https://github.com/pytorch/pytorch/pull/167637 on behalf of https://github.com/yangw-dev due to breaks internal build with torch/csrc/profiler/stubs/cuda.cpp:4:10: fatal error: 'nvtx3/nvtx3.hpp' file not found 4 | #include <nvtx3/nvtx3.hpp>, please find a meta fella to resolve this issue and try again, diff:[D87229660] ([comment](https://github.com/pytorch/pytorch/pull/167637#issuecomment-3543984021))
2025-11-17 21:51:04 +00:00
4e1b772103 Fix: Improve fallback behavior in deserialize_torch_artifact and relocate test into TestSaveLoad (#158247)
This is a follow-up to [#154333](https://github.com/pytorch/pytorch/pull/154333), where I initially introduced a fallback mechanism in deserialize_torch_artifact.

In this revised PR:

Cleaned up commit history for clarity and reproducibility.

Relocated the test into the TestSaveLoad class in test_serialize.py.

There were some issues with last PR so opened this PR

The previous PR had inconsistencies due to local branch issues and was closed in favor of this cleaner submission.

Feedback is very welcome
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158247
Approved by: https://github.com/angelayi
2025-11-17 21:14:37 +00:00
bdd3c3a29c Support SymInt placeholder in wrapper fxir (#167757)
Summary:
add support for symint placeholders

added two test cases with dynamic reshape
- dynamic info coming from tmd on placeholders
- dynamic info coming from placeholders (symints)

Test Plan:
test_reshape_dynamic_ph
test_reshape_dynamic_tmd

Differential Revision: D86984100

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167757
Approved by: https://github.com/blaine-rister
2025-11-17 21:10:55 +00:00
9d8ceaa36f Revert "[ARM] Improve LLM performance & mem usage using int4-bf16 KleidiAI kernels (#158250)"
This reverts commit 53809f964083a9e89182c2db7638fd44f3a6e304.

Reverted https://github.com/pytorch/pytorch/pull/158250 on behalf of https://github.com/zou3519 due to reverting to see if it fixes inductor halide test failure ([comment](https://github.com/pytorch/pytorch/pull/158250#issuecomment-3543840277))
2025-11-17 21:06:26 +00:00
927899dc05 fixes a few issues with out_dtype overload for addmm/baddbmm (#167931)
Per title
1) allows `self` argument to have the same precision as output
2) fixes broadcasting of `self` argument - it used to allocate incorrectly sized output and resize it later, causing a warning, in addmm, and error out in baddbmm
3) fixes `out` handling for `out` baddbmm overload, where the implementation used uninitialized memory in `out` instead of copying `self` to out.
4) removes couple unneeded iife patterns

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167931
Approved by: https://github.com/PaulZhang12, https://github.com/drisspg, https://github.com/malfet
2025-11-17 20:50:30 +00:00
a892f76d06 [MPS] mm out sparse (#167908)
Enables mm out for sparse tensors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167908
Approved by: https://github.com/malfet
2025-11-17 20:44:58 +00:00
2ddcf53e1a Logaddexp complex inconsistent bw cpu and cuda (#163509)
Fixes #158429

Updated LogAddExpKernel.cu to allow for complex numbers. Also, updated unittest to run test_logaddexp on CUDA with complex data types and added a unit test in test_linalg.py to compare results between CUDA and cpu.

@drisspg
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163509
Approved by: https://github.com/isuruf
2025-11-17 20:30:51 +00:00
689d731ece [inductor] fix the decision of inner reduction (#167697)
Inductor may treat an outer reduction as inner reduction when the reduction ranges contains a 1. This cause some weird issue that we skip fusing with mix order reduction. While I'm still debugging why that happens, I think we should fix the decision here anyways

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167697
Approved by: https://github.com/jansel, https://github.com/v0i0
2025-11-17 20:17:20 +00:00
b288d0020b [inductor] unittest for run2run determinism (#167482)
Not sure if the path are already properly setup so I can call 'benchmarks/dynamo/huggingface.py' in unit test directly. Let's tell from CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167482
Approved by: https://github.com/v0i0, https://github.com/mlazos
2025-11-17 20:12:15 +00:00
4414e1bff0 Cleanup in inductor usage of nccl estimator after its fix (#167633)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167633
Approved by: https://github.com/eellison
ghstack dependencies: #167827
2025-11-17 19:02:56 +00:00
694f9b943c Revert "[ROCm][CI] Upgrade ROCm CI to 7.1 (#166743)"
This reverts commit 77acc66df917a2b9f6305d089ac88b8975786552.

Reverted https://github.com/pytorch/pytorch/pull/166743 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/166743#issuecomment-3543307333))
2025-11-17 18:25:59 +00:00
01deee228a Fix dataloader tests failing on python 3.14 (#167429)
The following tests are failing on python 3.14 on linux machine

* TestSetAffinity::test_set_affinity_in_worker_init
    * Why? 3.14 makes `forkserver` the default start method for multiprocessing. With it, local functions are not pickle-able and unit test fail.
* TestIndividualWorkerQueue::test_ind_worker_queue
    * Why? The test was hitting timeout. This is also related to the start method. I am increasing timeout and reducing batch size iterations to reduce total unit test time.
    * Fixes https://github.com/pytorch/pytorch/issues/68643

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167429
Approved by: https://github.com/aelavender, https://github.com/ramanishsingh
2025-11-17 18:10:26 +00:00
1233be0923 [STABLE ABI] Add mutable_data_ptr() and const_data_ptr() methods to torch::stable::Tensor. (#161891)
This ghstack is a prerequisite for porting torchaudio C++ extensions to use torch stable ABI, see https://github.com/pytorch/audio/issues/4074, https://github.com/pytorch/audio/issues/4075, https://github.com/pytorch/audio/issues/4076, https://github.com/pytorch/audio/issues/4077, https://github.com/pytorch/audio/issues/4078

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161891
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #167772
2025-11-17 18:05:36 +00:00
02b55c3f4a Move isQIntType to headeronly (#167772)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167772
Approved by: https://github.com/janeyx99
2025-11-17 18:05:36 +00:00
ae3ce54f27 Revert "[ROCm] Enable StaticCudaLauncher for ROCm (#166492)"
This reverts commit 99fdca8f4d856cc52eb39d5e70be73dbd48228f8.

Reverted https://github.com/pytorch/pytorch/pull/166492 on behalf of https://github.com/jeanschmidt due to Internally we still depends on the old logic, so we need to find a way to maintain backwards compatibility, for now ([comment](https://github.com/pytorch/pytorch/pull/166492#issuecomment-3543198811))
2025-11-17 17:59:42 +00:00
2f3bb7482c Improve benchmarks/dynamo:check_perf_csv output and failure summary (#161728)
Resolves https://github.com/pytorch/pytorch/issues/161290

## Summary

Expands `dynamo/check_perf_csv.py` output capabilities with latency, compile time and memory information:

- Display's measured speedup and display % from target
- Added clear messaging for all passing model tests when no regression is found
- Added error handling if csv file is missing

### Example (Failing Check)

```bash
python benchmarks/dynamo/check_perf_csv.py -f reports-dir/inductor_training_smoketest.csv -t 1.40
```

**Example Output:**
```
Checking inductor_training_smoketest.csv (speedup threshold >= 1.40x)
hf_Bert                            speedup=1.005x, latency=390.8 ms/iter, compile=1.526s, mem_ratio=1.02x (eager=360.6 GB, dynamo=369.3 GB)
Error 1 model(s) performance regressed
    hf_Bert
  - hf_Bert: 1.005x (< 1.40x; -28.2% from target)
```

### Example (Passing Check)

```bash
python benchmarks/dynamo/check_perf_csv.py -f reports-dir/inductor_training_smoketest.csv -t 1.40
```

**Example Output:**
```
Checking inductor_training_smoketest.csv (speedup threshold >= 1.00x)
hf_Bert                            speedup=1.005x, latency=390.8 ms/iter, compile=1.526s, mem_ratio=1.02x (eager=360.6 GB, dynamo=369.3 GB)
All 1 model(s) passed threshold check (>= 1.00x)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161728
Approved by: https://github.com/isuruf
2025-11-17 17:54:29 +00:00
567dcdba75 Fix longstanding race condition around getAllOperatorsFor (#167860)
getAllOperatorsFor returns a const reference to internal state that is protected by a lock. Presuming that the lock is necessary in the first place (about which I offer no opinion because it's unclear to what extent the GIL should help here), this is a straightforward way to cause callers to create race conditions.

This should fix those race conditions by copying the state instead. I modified calling code to stop binding a const reference to the result for clarity.

Differential Revision: [D87088731](https://our.internmc.facebook.com/intern/diff/D87088731/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D87088731/)!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167860
Approved by: https://github.com/zou3519
2025-11-17 17:37:02 +00:00
77acc66df9 [ROCm][CI] Upgrade ROCm CI to 7.1 (#166743)
Upgrade all the ROCm docker images to ROCm 7.1 release version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166743
Approved by: https://github.com/atalman, https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Prachi Gupta <prachi.gupta@amd.com>
2025-11-17 17:17:25 +00:00
95d1df7d4e Disable CUDA MXFP4 on non-B200 GPUs (#167857)
Summary:

MXFP4 unit tests pass on B200, fail on RTX 5090 - disable non-B200
cases.

Also add a fail w/a not implemented error for non-B200 to avoid
unhelpful failure messages.

Test Plan:

```
pytest -sv -k "mxfp4" test/test_scaled_matmul_cuda.py
```

Reviewers:

@nWEIdia

Subscribers:

Tasks:

Fixes https://github.com/pytorch/pytorch/issues/167850

Tags:
Signed-off-by: Simon Layton <simonlayton@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167857
Approved by: https://github.com/nWEIdia, https://github.com/malfet
2025-11-17 17:14:53 +00:00
094e529c64 [MPS] Fix repeat_interleave with slices (#167961)
Alas, one can not use `repeat_interleave_common` for MPS tensors, as `data_offset` is not a valid pointer to `id<MTLTensor>`
On the other hand, one does not need to use `AT_DISPATCH_INDEX_TYPES` as dispatching is happening on the shader side

Fixes https://github.com/pytorch/pytorch/issues/167924
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167961
Approved by: https://github.com/manuelcandales
2025-11-17 17:10:59 +00:00
a4c7bf7e8d Revert "Use c10::filesystem (#167821)"
This reverts commit deabb3e36de207aa497b035a8bdf6ec1b37d17fe.

Reverted https://github.com/pytorch/pytorch/pull/167821 on behalf of https://github.com/jeanschmidt due to Breaks internal tests, see D87148810. @Skylion007 may you help the author to get this PR merged? ([comment](https://github.com/pytorch/pytorch/pull/167821#issuecomment-3542877623))
2025-11-17 16:48:57 +00:00
22ccd44d73 Revert "Improve char printing (#167899)"
This reverts commit 2245d7d3b90162ae2958929a22c140537cfc4b42.

Reverted https://github.com/pytorch/pytorch/pull/167899 on behalf of https://github.com/jeanschmidt due to need to revert in order to revert https://github.com/pytorch/pytorch/pull/167899 ([comment](https://github.com/pytorch/pytorch/pull/167899#issuecomment-3542869096))
2025-11-17 16:46:44 +00:00
39ebab1dd9 Revert "Remove python workaround for ContextDecorator (#167049)"
This reverts commit e20ca3bc2e6ef9935c782fe548348f81fabc5bd7.

Reverted https://github.com/pytorch/pytorch/pull/167049 on behalf of https://github.com/jeanschmidt due to breaks internal tests see D87120562, @Skylion007 please thelp the author get this PR merged ([comment](https://github.com/pytorch/pytorch/pull/167049#issuecomment-3542847796))
2025-11-17 16:41:26 +00:00
4c152a71ad Revert "add device generalization support for distributed tests (#165067)"
This reverts commit 96a4c4b3d1c533b36cfa7259524b91a0eaf4254f.

Reverted https://github.com/pytorch/pytorch/pull/165067 on behalf of https://github.com/jeanschmidt due to breaks internal tests see D87036515, @albanD please help the author get this PR merged ([comment](https://github.com/pytorch/pytorch/pull/165067#issuecomment-3542820651))
2025-11-17 16:37:07 +00:00
1b43d6cd4e [ROCm] enable fastSpecializedAtomicAdd for gfx950 (#167661)
Use standard HIP headers for unsafeAtomicAdd. Removes copy/paste of unsafeAtomicAdd as "preview" implementation for gfx942.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167661
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-11-17 16:18:49 +00:00
2b69673bbf [CD] Add libopenblas to dep list for AArch64+CPU whl (#167841)
#166044 removes openblas from whl dependency list for AArch64+CPU build so this PR adds it back. Only affects CPU build since AArch64+CUDA uses NVPL.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167841
Approved by: https://github.com/tinglvv, https://github.com/malfet
2025-11-17 16:11:39 +00:00
2f74916e36 Do not hardfail on use nccl estimations for non-nccl (#167827)
Previously we hard failed if pg was "gloo".
Fallback on hardcoded formulas.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167827
Approved by: https://github.com/eellison
2025-11-17 16:06:26 +00:00
2b5eabc74b Rework PyObject preservation (v2) (#167564)
Make the PyObject preservation scheme thread-safe with free threaded (nogil) Python. The general idea is:

* Python Tensor and Storage objects always hold a strong reference to their underlying c10 object
* c10 objects hold a strong reference to their Python objects if there's at least one other reference to the c10 object

This is implemented in `intrusive_ptr`:

* The top most bit (`kHasPyObject`) from the weakref count is now used to indicate if the `intrusive_ptr_target` has an associated PyObject. So `kHasPyObject` is one bit, the weakref count is now 31 bits and the strong refcount remains 32 bits.
* When the reference count increases from one to two and `kHasPyObject` is set, we incref the associated Python object to ensure that it's kept alive.
* When the reference count decreases from two to one (i.e., there are no C++ reference to the `intrusive_ptr_target` other than from the Python object), we decre the associated Python object to break the cycle.

Other benefits:

* We can delete a lot of the copypasta from Python internal `subtype_dealloc`
* This fixes the weakref and GC bugs we had in the previous scheme. Python weakrefs on Tensors and Storages should just work as expected now.

Risks:

* Extra branch for reference count operations on `intrusive_ptr<TensorImpl>`, `intrusive_ptr<StorageImpl>`, and the generic `intrusive_ptr<intrusive_ptr_target>` even when we're not using Python.
* It's a big change

(Second attempt at https://github.com/pytorch/pytorch/pull/166342)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167564
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-11-17 14:52:02 +00:00
9ff95f6835 [inductor] Expose config for fx bucket all_reduces (#167634)
Exposing `_inductor.config.bucket_all_reduces_fx` similar to all_gathers, reduce_scatters with only option "all".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167634
Approved by: https://github.com/eellison
2025-11-17 13:10:36 +00:00
6fdb974f4a Update torch-xpu-ops commit pin (#167698)
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@1e69f4](1e69f40b3c), includes:

- Add PTL in the default AOT target list for both Win and Lin
- Use PyTorch p2p API in Copy kernel
- Add event cache and event timing to XCCL
- Add Float8_e8m0fnu support for copy
- Add CMAKE_SYCL_COMPILER_LAUNCHER for sccache
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167698
Approved by: https://github.com/EikanWang
2025-11-17 12:58:42 +00:00
661d1653aa [xla hash update] update the pinned xla hash (#167968)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167968
Approved by: https://github.com/pytorchbot
2025-11-17 12:20:32 +00:00
53809f9640 [ARM] Improve LLM performance & mem usage using int4-bf16 KleidiAI kernels (#158250)
Co-authored-by: Nikhil Gupta [nikhil.gupta2@arm.com](mailto:nikhil.gupta2@arm.com)

This PR enables the use of KleidiAI INT4 kernels that directly produce BF16 outputs within PyTorch to boost LLM prefill & decode performance

**This change improves decode throughput by ~15% & reduces memory required to inference the model by 50%**

### Benchmark Setup
```
Model: meta-llama/Llama-3.1-8B
Test Platform: Neoverse V2
```
### Detailed Results

| Metric                           | With `--compile`         | Without `--compile`      |
|----------------------------------|---------------------------|---------------------------|
| Quantization Scheme              | INT4 symmetric channelwise | INT4 symmetric channelwise |
| Input Precision                  | BF16                      | BF16                      |
| Number of Layers Quantized       | 32                        | 32                        |
| Average Compression Ratio        | 87.49%                    | 87.49%                    |
| Total Quantization Time (s)      | 9.62                      | 10.32                     |
| Compile Time (First) (s)         | 134.48                    | 1.69                      |
| Compile Time (Second) (s)        | 80.44                     | 1.60                      |
| Compile Time (Subsequent) (s)    | 0.19                      | 0.22                      |
| Prefill Tokens                   | 54                        | 54                        |
| Decoded Tokens                   | 33                        | 33                        |
| Prefill Time (s)                 | 0.19                      | 0.22                      |
| Decode Time (s)                  | 0.76                      | 1.38                      |
| E2E Generation Time (s)          | 0.95                      | 1.60                      |
| Prefill Throughput (tokens/s)    | 288.13                    | 249.91                    |
| Decode Throughput (tokens/s)     | 43.42                     | 23.83                     |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158250
Approved by: https://github.com/malfet, https://github.com/aditew01, https://github.com/fadara01

Co-authored-by: Nikhil Gupta <nikhil.gupta2@arm.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-11-17 12:06:33 +00:00
93ddd38ecd Re-land#2 "Fix thread safety in getCurrentCUDABlasHandle and getCUDABlasLtWorkspace" (#167928)
Summary:
getCurrentCUDABlasHandle() and getCUDABlasLtWorkspace() use static mutable maps that are not protected from concurrent read-and-write. This leads to crashes.

This diff adds mutexes to synchronize access to the static maps.

Re-land context:

This is a re-land of https://github.com/pytorch/pytorch/pull/167248.

A few issues were addressed:
- fix for a bug in fast path: premature return in getCurrentCUDABlasHandle)
- fix for test flakiness (https://github.com/pytorch/pytorch/pull/167884)

Test Plan:
1. regression tests:
buck2 test \mode/opt //caffe2/test\:test_transformers_cuda
https://www.internalfb.com/intern/testinfra/testrun/6192449759713581

2. Use a GPU OD, run multi-threaded tests with TSAN:

buck test fbcode//mode/dev-tsan fbcode//caffe2:cuda_cublas_handle_pool_test  -- --stress-runs 100
https://www.internalfb.com/intern/testinfra/testrun/14355223937501118

Differential Revision: D87111985

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167928
Approved by: https://github.com/Skylion007
2025-11-17 12:05:08 +00:00
324 changed files with 3359 additions and 2901 deletions

View File

@ -75,9 +75,11 @@ if [[ "$ARCH" == "aarch64" ]]; then
# ARM system libraries
DEPS_LIST+=(
"/usr/lib64/libgfortran.so.5"
"/opt/OpenBLAS/lib/libopenblas.so.0"
)
DEPS_SONAME+=(
"libgfortran.so.5"
"libopenblas.so.0"
)
fi

View File

@ -1 +1 @@
e4d25697f9dc5eedaf8f0a5bf085c62c5455a53a
94631807d22c09723dd006f7be5beb649d5f88d0

View File

@ -144,7 +144,7 @@ inline std::bitset<kVmapNumLevels> createVmapLevelsBitset(BatchDimsRef bdims) {
}
inline std::ostream& operator<<(std::ostream& out, const BatchDim& bdim) {
out << "(lvl=" << bdim.level() << ", dim=" << bdim.dim() << ')';
out << "(lvl=" << bdim.level() << ", dim=" << bdim.dim() << ")";
return out;
}

View File

@ -9,7 +9,7 @@ namespace indexing {
const EllipsisIndexType Ellipsis = EllipsisIndexType();
std::ostream& operator<<(std::ostream& stream, const Slice& slice) {
stream << slice.start() << ':' << slice.stop() << ':' << slice.step();
stream << slice.start() << ":" << slice.stop() << ":" << slice.step();
return stream;
}
@ -31,12 +31,12 @@ std::ostream& operator<<(std::ostream& stream, const TensorIndex& tensor_index)
}
std::ostream& operator<<(std::ostream& stream, const std::vector<TensorIndex>& tensor_indices) {
stream << '(';
stream << "(";
for (const auto i : c10::irange(tensor_indices.size())) {
stream << tensor_indices[i];
if (i < tensor_indices.size() - 1) stream << ", ";
}
stream << ')';
stream << ")";
return stream;
}

View File

@ -113,7 +113,7 @@ void TensorNames::checkUnique(const char* op_name) const {
std::ostream& operator<<(std::ostream& out, const TensorName& tensorname) {
out << tensorname.name_ << " (index ";
out << tensorname.origin_idx_ << " of ";
out << tensorname.origin_ << ')';
out << tensorname.origin_ << ")";
return out;
}

View File

@ -13,9 +13,9 @@ std::ostream& operator<<(std::ostream & out, const TensorGeometryArg& t) {
if (t.pos == 0) {
// 0 is distinguished; it usually indicates 'self' or the return
// tensor
out << '\'' << t.name << '\'';
out << "'" << t.name << "'";
} else {
out << "argument #" << t.pos << " '" << t.name << '\'';
out << "argument #" << t.pos << " '" << t.name << "'";
}
return out;
}
@ -154,7 +154,7 @@ void checkSameGPU(CheckedFrom c, const TensorArg& t1, const TensorArg& t2) {
oss << "Tensor for " << t2 << " is on CPU, ";
}
oss << "but expected " << ((!t1->is_cpu() && !t2->is_cpu()) ? "them" : "it")
<< " to be on GPU (while checking arguments for " << c << ')';
<< " to be on GPU (while checking arguments for " << c << ")";
TORCH_CHECK(false, oss.str());
}
TORCH_CHECK(
@ -199,7 +199,7 @@ void checkScalarTypes(CheckedFrom c, const TensorArg& t,
i++;
}
oss << "; but got " << t->toString()
<< " instead (while checking arguments for " << c << ')';
<< " instead (while checking arguments for " << c << ")";
TORCH_CHECK(false, oss.str());
}
}

View File

@ -43,8 +43,8 @@ std::string get_mkldnn_version() {
// https://github.com/intel/ideep/issues/29
{
const dnnl_version_t* ver = dnnl_version();
ss << "Intel(R) MKL-DNN v" << ver->major << '.' << ver->minor << '.' << ver->patch
<< " (Git Hash " << ver->hash << ')';
ss << "Intel(R) MKL-DNN v" << ver->major << "." << ver->minor << "." << ver->patch
<< " (Git Hash " << ver->hash << ")";
}
#else
ss << "MKLDNN not found";
@ -81,7 +81,7 @@ std::string get_openmp_version() {
break;
}
if (ver_str) {
ss << " (a.k.a. OpenMP " << ver_str << ')';
ss << " (a.k.a. OpenMP " << ver_str << ")";
}
}
#else
@ -135,38 +135,38 @@ std::string show_config() {
#if defined(__GNUC__)
{
ss << " - GCC " << __GNUC__ << '.' << __GNUC_MINOR__ << '\n';
ss << " - GCC " << __GNUC__ << "." << __GNUC_MINOR__ << "\n";
}
#endif
#if defined(__cplusplus)
{
ss << " - C++ Version: " << __cplusplus << '\n';
ss << " - C++ Version: " << __cplusplus << "\n";
}
#endif
#if defined(__clang_major__)
{
ss << " - clang " << __clang_major__ << '.' << __clang_minor__ << '.' << __clang_patchlevel__ << '\n';
ss << " - clang " << __clang_major__ << "." << __clang_minor__ << "." << __clang_patchlevel__ << "\n";
}
#endif
#if defined(_MSC_VER)
{
ss << " - MSVC " << _MSC_FULL_VER << '\n';
ss << " - MSVC " << _MSC_FULL_VER << "\n";
}
#endif
#if AT_MKL_ENABLED()
ss << " - " << get_mkl_version() << '\n';
ss << " - " << get_mkl_version() << "\n";
#endif
#if AT_MKLDNN_ENABLED()
ss << " - " << get_mkldnn_version() << '\n';
ss << " - " << get_mkldnn_version() << "\n";
#endif
#ifdef _OPENMP
ss << " - " << get_openmp_version() << '\n';
ss << " - " << get_openmp_version() << "\n";
#endif
#if AT_BUILD_WITH_LAPACK()
@ -183,7 +183,7 @@ std::string show_config() {
ss << " - Cross compiling on MacOSX\n";
#endif
ss << " - "<< used_cpu_capability() << '\n';
ss << " - "<< used_cpu_capability() << "\n";
if (hasCUDA()) {
ss << detail::getCUDAHooks().showConfig();
@ -200,10 +200,10 @@ std::string show_config() {
ss << " - Build settings: ";
for (const auto& pair : caffe2::GetBuildOptions()) {
if (!pair.second.empty()) {
ss << pair.first << '=' << pair.second << ", ";
ss << pair.first << "=" << pair.second << ", ";
}
}
ss << '\n';
ss << "\n";
// TODO: do HIP
// TODO: do XLA

View File

@ -209,7 +209,7 @@ struct CodeTemplate {
// to indent correctly in the context.
void emitIndent(std::ostream& out, size_t indent) const {
for ([[maybe_unused]] const auto i : c10::irange(indent)) {
out << ' ';
out << " ";
}
}
void emitStringWithIndents(

View File

@ -10,7 +10,7 @@ std::ostream& operator<<(std::ostream& out, const Dimname& dimname) {
if (dimname.type() == NameType::WILDCARD) {
out << "None";
} else {
out << '\'' << dimname.symbol().toUnqualString() << '\'';
out << "'" << dimname.symbol().toUnqualString() << "'";
}
return out;
}

View File

@ -5,7 +5,7 @@
namespace at {
std::ostream& operator<<(std::ostream& out, const Range& range) {
out << "Range[" << range.begin << ", " << range.end << ']';
out << "Range[" << range.begin << ", " << range.end << "]";
return out;
}

View File

@ -71,7 +71,7 @@ void TensorBase::enforce_invariants() {
void TensorBase::print() const {
if (defined()) {
std::cerr << '[' << toString() << ' ' << sizes() << ']' << '\n';
std::cerr << "[" << toString() << " " << sizes() << "]" << '\n';
} else {
std::cerr << "[UndefinedTensor]" << '\n';
}

View File

@ -245,6 +245,9 @@ class TORCH_API TensorBase {
size_t weak_use_count() const noexcept {
return impl_.weak_use_count();
}
bool is_uniquely_owned() const noexcept {
return impl_.is_uniquely_owned();
}
std::string toString() const;

View File

@ -9,8 +9,8 @@ APIVitals VitalsAPI;
std::ostream& operator<<(std::ostream& os, TorchVital const& tv) {
for (const auto& m : tv.attrs) {
os << "[TORCH_VITAL] " << tv.name << '.' << m.first << "\t\t "
<< m.second.value << '\n';
os << "[TORCH_VITAL] " << tv.name << "." << m.first << "\t\t "
<< m.second.value << "\n";
}
return os;
}

View File

@ -100,18 +100,18 @@ inline bool operator==(const AliasInfo& lhs, const AliasInfo& rhs) {
// this does match the way things are represented in the schema
inline std::ostream& operator<<(std::ostream& out, const AliasInfo& aliasInfo) {
out << '(';
out << "(";
bool first = true;
for (const auto& set : aliasInfo.beforeSets()) {
if (first) {
first = false;
} else {
out << '|';
out << "|";
}
out << set.toUnqualString();
}
if (aliasInfo.isWrite()) {
out << '!';
out << "!";
}
if (aliasInfo.beforeSets() != aliasInfo.afterSets()) {
out << " -> ";
@ -120,12 +120,12 @@ inline std::ostream& operator<<(std::ostream& out, const AliasInfo& aliasInfo) {
if (first) {
first = false;
} else {
out << '|';
out << "|";
}
out << set.toUnqualString();
}
}
out << ')';
out << ")";
return out;
}
} // namespace c10

View File

@ -198,7 +198,7 @@ inline void swap(Blob& lhs, Blob& rhs) noexcept {
}
inline std::ostream& operator<<(std::ostream& out, const Blob& v) {
return out << "Blob[" << v.TypeName() << ']';
return out << "Blob[" << v.TypeName() << "]";
}
} // namespace caffe2

View File

@ -456,8 +456,8 @@ bool ClassType::isSubtypeOfExt(const Type& rhs, std::ostream* why_not) const {
*why_not << "Method on class '" << repr_str()
<< "' (1) is not compatible with interface '"
<< rhs.repr_str() << "' (2)\n"
<< " (1) " << self_method->getSchema() << '\n'
<< " (2) " << schema << '\n';
<< " (1) " << self_method->getSchema() << "\n"
<< " (2) " << schema << "\n";
}
return false;
}

View File

@ -100,7 +100,7 @@ struct TORCH_API ClassType : public NamedType {
std::string repr_str() const override {
std::stringstream ss;
ss << str()
<< " (of Python compilation unit at: " << compilation_unit().get() << ')';
<< " (of Python compilation unit at: " << compilation_unit().get() << ")";
return ss.str();
}

View File

@ -58,12 +58,12 @@ std::string DispatchKeyExtractor::dumpState() const {
std::ostringstream oss;
for (const auto i : c10::irange(c10::utils::bitset::NUM_BITS())) {
if (dispatch_arg_indices_reverse_.get(i)) {
oss << '1';
oss << "1";
} else {
oss << '0';
oss << "0";
}
}
oss << ' ' << nonFallthroughKeys_ << '\n';
oss << " " << nonFallthroughKeys_ << "\n";
return oss.str();
}

View File

@ -69,8 +69,8 @@ private:
void _print_dispatch_trace(const std::string& label, const std::string& op_name, const DispatchKeySet& dispatchKeySet) {
auto nesting_value = dispatch_trace_nesting_value();
for (int64_t i = 0; i < nesting_value; ++i) std::cerr << ' ';
std::cerr << label << " op=[" << op_name << "], key=[" << toString(dispatchKeySet.highestPriorityTypeId()) << ']' << std::endl;
for (int64_t i = 0; i < nesting_value; ++i) std::cerr << " ";
std::cerr << label << " op=[" << op_name << "], key=[" << toString(dispatchKeySet.highestPriorityTypeId()) << "]" << std::endl;
}
} // namespace detail

View File

@ -570,7 +570,7 @@ void OperatorEntry::checkInvariants() const {
std::string OperatorEntry::listAllDispatchKeys() const {
std::ostringstream str;
str << '[';
str << "[";
bool has_kernels = false;
for (auto k : allDispatchKeysInFullSet()) {
@ -584,7 +584,7 @@ std::string OperatorEntry::listAllDispatchKeys() const {
str << k;
has_kernels = true;
}
str << ']';
str << "]";
return str.str();
}
@ -683,12 +683,12 @@ void OperatorEntry::setReportErrorCallback_(std::unique_ptr<c10::SafePyObject> c
// This WON'T report backend fallbacks.
std::string OperatorEntry::dumpState() const {
std::ostringstream oss;
oss << "name: " << name_ << '\n';
oss << "name: " << name_ << "\n";
if (schema_) {
oss << "schema: " << schema_->schema << '\n';
oss << "debug: " << schema_->debug << '\n';
oss << "schema: " << schema_->schema << "\n";
oss << "debug: " << schema_->debug << "\n";
oss << "alias analysis kind: " << toString(schema_->schema.aliasAnalysis())
<< (schema_->schema.isDefaultAliasAnalysisKind() ? " (default)" : "") << '\n';
<< (schema_->schema.isDefaultAliasAnalysisKind() ? " (default)" : "") << "\n";
} else {
oss << "schema: (none)\n";
}

View File

@ -7,7 +7,7 @@
namespace c10 {
void FunctionSchema::dump() const {
std::cout << *this << '\n';
std::cout << *this << "\n";
}
const std::vector<Argument>& FunctionSchema::getCorrectList(SchemaArgType type) const {
@ -210,9 +210,9 @@ std::ostream& operator<<(std::ostream& out, const FunctionSchema& schema) {
out << schema.name();
if (!schema.overload_name().empty()) {
out << '.' << schema.overload_name();
out << "." << schema.overload_name();
}
out << '(';
out << "(";
bool seen_kwarg_only = false;
for (const auto i : c10::irange(schema.arguments().size())) {
@ -273,7 +273,7 @@ std::ostream& operator<<(std::ostream& out, const FunctionSchema& schema) {
}
if (need_paren) {
out << '(';
out << "(";
}
for (const auto i : c10::irange(returns.size())) {
if (i > 0) {
@ -288,7 +288,7 @@ std::ostream& operator<<(std::ostream& out, const FunctionSchema& schema) {
out << "...";
}
if (need_paren) {
out << ')';
out << ")";
}
return out;
}
@ -471,7 +471,7 @@ bool FunctionSchema::isForwardCompatibleWith(
if (!arguments().at(i).isForwardCompatibleWith(old.arguments().at(i))) {
if (why_not) {
why_not
<< '\'' << arguments().at(i).name() << '\''
<< "'" << arguments().at(i).name() << "'"
<< " is not forward compatible with the older version of the schema";
}
return false;
@ -511,7 +511,7 @@ bool FunctionSchema::isForwardCompatibleWith(
.isForwardCompatibleWith(old.arguments().at(i))) {
if (why_not) {
why_not << "Out argument '"
<< '\'' << arguments().at(i).name()
<< "'" << arguments().at(i).name()
<< " is not FC with the older version of the schema";
}
return false;

View File

@ -571,7 +571,7 @@ inline std::ostream& operator<<(std::ostream& out, const Argument& arg) {
if (arg.N()) {
N = std::to_string(*arg.N());
}
out << '[' << N << ']';
out << "[" << N << "]";
} else {
out << unopt_type->str();
}
@ -582,15 +582,15 @@ inline std::ostream& operator<<(std::ostream& out, const Argument& arg) {
}
if (is_opt) {
out << '?';
out << "?";
}
if (!arg.name().empty()) {
out << ' ' << arg.name();
out << " " << arg.name();
}
if (arg.default_value()) {
out << '=';
out << "=";
if ((type->kind() == c10::TypeKind::StringType ||
unopt_type->kind() == c10::TypeKind::StringType) &&
arg.default_value().value().isString()) {

View File

@ -66,7 +66,7 @@ bool operator==(const ivalue::Tuple& lhs, const ivalue::Tuple& rhs) {
}
std::ostream& operator<<(std::ostream& out, const ivalue::EnumHolder& v) {
out << v.qualifiedClassName() << '.' << v.name();
out << v.qualifiedClassName() << "." << v.name();
return out;
}
@ -526,7 +526,7 @@ std::ostream& printMaybeAnnotatedList(
!elementTypeCanBeInferredFromMembers(list_elem_type)) {
out << "annotate(" << the_list.type<c10::Type>()->annotation_str() << ", ";
printList(out, the_list.toListRef(), "[", "]", formatter);
out << ')';
out << ")";
return out;
} else {
return printList(out, the_list.toListRef(), "[", "]", formatter);
@ -538,7 +538,7 @@ std::ostream& printDict(
std::ostream& out,
const Dict& v,
const IValueFormatter& formatter) {
out << '{';
out << "{";
bool first = true;
for (const auto& pair : v) {
@ -552,7 +552,7 @@ std::ostream& printDict(
first = false;
}
out << '}';
out << "}";
return out;
}
}
@ -565,8 +565,8 @@ static std::ostream& printMaybeAnnotatedDict(
auto value_type = the_dict.type()->castRaw<DictType>()->getValueType();
if (the_dict.toGenericDict().empty() ||
!elementTypeCanBeInferredFromMembers(value_type)) {
out << "annotate(" << the_dict.type<c10::Type>()->annotation_str() << ',';
printDict(out, the_dict.toGenericDict(), formatter) << ')';
out << "annotate(" << the_dict.type<c10::Type>()->annotation_str() << ",";
printDict(out, the_dict.toGenericDict(), formatter) << ")";
} else {
return printDict(out, the_dict.toGenericDict(), formatter);
}
@ -577,7 +577,7 @@ static std::ostream& printComplex(std::ostream & out, const IValue & v) {
c10::complex<double> d = v.toComplexDouble();
IValue real(d.real()), imag(std::abs(d.imag()));
auto sign = d.imag() >= 0 ? '+' : '-';
return out << real << sign << imag << 'j';
return out << real << sign << imag << "j";
}
std::ostream& IValue::repr(
@ -605,9 +605,9 @@ std::ostream& IValue::repr(
if (static_cast<double>(i) == d) {
// -0.0 (signed zero) needs to be parsed as -0.
if (i == 0 && std::signbit(d)) {
return out << '-' << i << '.';
return out << "-" << i << ".";
}
return out << i << '.';
return out << i << ".";
}
}
auto orig_prec = out.precision();
@ -643,20 +643,20 @@ std::ostream& IValue::repr(
device_stream << v.toDevice();
out << "torch.device(";
c10::printQuotedString(out, device_stream.str());
return out << ')';
return out << ")";
}
case IValue::Tag::Generator: {
auto generator = v.toGenerator();
out << "torch.Generator(device=";
c10::printQuotedString(out, generator.device().str());
out << ", seed=" << generator.current_seed() << ')';
out << ", seed=" << generator.current_seed() << ")";
return out;
}
case IValue::Tag::GenericDict:
return printMaybeAnnotatedDict(out, v, formatter);
case IValue::Tag::Enum: {
auto enum_holder = v.toEnumHolder();
return out << enum_holder->qualifiedClassName() << '.' <<
return out << enum_holder->qualifiedClassName() << "." <<
enum_holder->name();
}
case IValue::Tag::Object: {
@ -801,7 +801,7 @@ std::ostream& operator<<(std::ostream & out, const IValue & v) {
if (c == FP_NORMAL || c == FP_ZERO) {
int64_t i = static_cast<int64_t>(d);
if (static_cast<double>(i) == d) {
return out << i << '.';
return out << i << ".";
}
}
auto orig_prec = out.precision();
@ -852,7 +852,7 @@ std::ostream& operator<<(std::ostream & out, const IValue & v) {
return printDict(out, v.toGenericDict(), formatter);
case IValue::Tag::PyObject: {
auto py_obj = v.toPyObject();
return out << "<PyObject at" << py_obj << '>';
return out << "<PyObject at" << py_obj << ">";
}
case IValue::Tag::Generator:
return out << "Generator";
@ -862,22 +862,22 @@ std::ostream& operator<<(std::ostream & out, const IValue & v) {
// TODO we should attempt to call __str__ if the object defines it.
auto obj = v.toObject();
// print this out the way python would do it
return out << '<' << obj->name() << " object at " << obj.get() << '>';
return out << "<" << obj->name() << " object at " << obj.get() << ">";
}
case IValue::Tag::Enum: {
auto enum_holder = v.toEnumHolder();
return out << "Enum<" << enum_holder->unqualifiedClassName() << '.' <<
enum_holder->name() << '>';
return out << "Enum<" << enum_holder->unqualifiedClassName() << "." <<
enum_holder->name() << ">";
}
}
return out << "<Invalid IValue tag=" << std::to_string(static_cast<uint32_t>(v.tag)) << '>';
return out << "<Invalid IValue tag=" << std::to_string(static_cast<uint32_t>(v.tag)) << ">";
}
#undef TORCH_FORALL_TAGS
void IValue::dump() const {
std::cout << *this << '\n';
std::cout << *this << "\n";
}
std::shared_ptr<ClassType> ivalue::Object::type() const {
@ -1050,7 +1050,7 @@ c10::intrusive_ptr<ivalue::Object> ivalue::Object::deepcopy(
std::stringstream err;
err << "Cannot serialize custom bound C++ class";
if (auto qualname = type()->name()) {
err << ' ' << qualname->qualifiedName();
err << " " << qualname->qualifiedName();
}
err << ". Please define serialization methods via def_pickle() for "
"this class.";

View File

@ -211,7 +211,7 @@ struct TORCH_API OptionalType : public UnionType {
std::string str() const override {
std::stringstream ss;
ss << getElementType()->str() << '?';
ss << getElementType()->str() << "?";
return ss.str();
}
@ -240,7 +240,7 @@ struct TORCH_API OptionalType : public UnionType {
std::string annotation_str_impl(const TypePrinter& printer = nullptr) const override {
std::stringstream ss;
ss << "Optional[" << getElementType()->annotation_str(printer) << ']';
ss << "Optional[" << getElementType()->annotation_str(printer) << "]";
return ss.str();
}
};
@ -906,7 +906,7 @@ struct TORCH_API ListType
std::string annotation_str_impl(const TypePrinter& printer = nullptr) const override {
std::stringstream ss;
ss << "List[" << getElementType()->annotation_str(printer) << ']';
ss << "List[" << getElementType()->annotation_str(printer) << "]";
return ss.str();
}
};
@ -946,7 +946,7 @@ struct TORCH_API DictType : public SharedType {
std::string str() const override {
std::stringstream ss;
ss << "Dict(" << getKeyType()->str() << ", " << getValueType()->str()
<< ')';
<< ")";
return ss.str();
}
@ -1018,7 +1018,7 @@ struct TORCH_API FutureType
std::string str() const override {
std::stringstream ss;
ss << "Future(" << getElementType()->str() << ')';
ss << "Future(" << getElementType()->str() << ")";
return ss.str();
}
TypePtr createWithContained(
@ -1041,7 +1041,7 @@ struct TORCH_API FutureType
std::string annotation_str_impl(const TypePrinter& printer = nullptr) const override {
std::stringstream ss;
ss << "Future[" << getElementType()->annotation_str(printer) << ']';
ss << "Future[" << getElementType()->annotation_str(printer) << "]";
return ss.str();
}
};
@ -1060,7 +1060,7 @@ struct TORCH_API AwaitType
std::string str() const override {
std::stringstream ss;
ss << "Await(" << getElementType()->str() << ')';
ss << "Await(" << getElementType()->str() << ")";
return ss.str();
}
TypePtr createWithContained(
@ -1083,7 +1083,7 @@ struct TORCH_API AwaitType
std::string annotation_str_impl(const TypePrinter& printer = nullptr) const override {
std::stringstream ss;
ss << "Await[" << getElementType()->annotation_str(printer) << ']';
ss << "Await[" << getElementType()->annotation_str(printer) << "]";
return ss.str();
}
};
@ -1102,7 +1102,7 @@ struct TORCH_API RRefType
std::string str() const override {
std::stringstream ss;
ss << "RRef(" << getElementType()->str() << ')';
ss << "RRef(" << getElementType()->str() << ")";
return ss.str();
}
TypePtr createWithContained(
@ -1115,7 +1115,7 @@ struct TORCH_API RRefType
std::string annotation_str_impl(const TypePrinter& printer = nullptr) const override {
std::stringstream ss;
ss << "RRef[" << getElementType()->annotation_str(printer) << ']';
ss << "RRef[" << getElementType()->annotation_str(printer) << "]";
return ss.str();
}
};

View File

@ -11,7 +11,7 @@ std::string toString(const OperatorName& opName) {
std::ostream& operator<<(std::ostream& os, const OperatorName& opName) {
os << opName.name;
if (!opName.overload_name.empty()) {
os << '.' << opName.overload_name;
os << "." << opName.overload_name;
}
return os;
}

View File

@ -65,7 +65,7 @@ VaryingShape<T> VaryingShape<T>::merge(const VaryingShape<T>& other) const {
template <typename T>
std::ostream& operator<<(std::ostream& out, const VaryingShape<T>& vs) {
out << '(';
out << "(";
if (!vs.size()) {
out << "*)";
return out;
@ -79,10 +79,10 @@ std::ostream& operator<<(std::ostream& out, const VaryingShape<T>& vs) {
if (v.has_value()) {
out << v.value();
} else {
out << '*';
out << "*";
}
}
out << ')';
out << ")";
return out;
}
@ -105,7 +105,7 @@ std::ostream& operator<<(
}
auto sizes_opt = ss.sizes();
os << '(';
os << "(";
for (size_t i = 0; i < rank_opt.value(); i++) {
if (i > 0) {
os << ", ";
@ -113,10 +113,10 @@ std::ostream& operator<<(
if(sizes_opt.has_value() && sizes_opt.value()[i].is_static()) {
os << sizes_opt.value()[i];
} else {
os << '*';
os << "*";
}
}
os << ')';
os << ")";
return os;
}
@ -131,17 +131,17 @@ std::ostream& operator<<(std::ostream& os, const ShapeSymbol& s) {
}
std::ostream& operator<<(std::ostream& os, const Stride& s) {
os << '{';
os << "{";
if (s.stride_index_.has_value()) {
os << *s.stride_index_;
} else {
os << '*';
os << "*";
}
os << ':';
os << ":";
if (s.stride_.has_value()) {
os << *s.stride_;
} else {
os << '*';
os << "*";
}
os << '}';
return os;

View File

@ -67,7 +67,7 @@ std::ostream& operator<<(std::ostream & out, const Type & t) {
bool has_valid_strides_info = ndim > 0 &&
value->strides().isComplete() && value->strides().size() == ndim;
out << '(';
out << "(";
size_t i = 0;
bool symbolic = type_verbosity() == TypeVerbosity::Symbolic;
for (i = 0; i < *ndim; ++i) {
@ -79,7 +79,7 @@ std::ostream& operator<<(std::ostream & out, const Type & t) {
} else if (symbolic) {
out << value->symbolic_sizes().at(i);
} else {
out << '*';
out << "*";
}
}
if (has_valid_strides_info &&
@ -91,7 +91,7 @@ std::ostream& operator<<(std::ostream & out, const Type & t) {
}
out << value->strides()[i].value();
}
out << ']';
out << "]";
}
if (type_verbosity() >= TypeVerbosity::Full) {
if (value->requiresGrad()) {
@ -107,12 +107,12 @@ std::ostream& operator<<(std::ostream & out, const Type & t) {
out << "device=" << *value->device();
}
}
out << ')';
out << ")";
} else {
if (type_verbosity() >= TypeVerbosity::Full) {
size_t i = 0;
if (value->requiresGrad()) {
out << '('
out << "("
<< "requires_grad=" << *value->requiresGrad();
i++;
}
@ -120,7 +120,7 @@ std::ostream& operator<<(std::ostream & out, const Type & t) {
out << ((i++ > 0) ? ", " : "(") << "device=" << *value->device();
}
if (i > 0) {
out << ')';
out << ")";
}
}
}
@ -133,18 +133,18 @@ std::ostream& operator<<(std::ostream & out, const Type & t) {
out << *prim << "[]";
} else if (t.kind() == TypeKind::OptionalType) {
auto prim = t.castRaw<OptionalType>()->getElementType();
out << *prim << '?';
out << *prim << "?";
} else if(t.kind() == TypeKind::FutureType) {
auto elem = t.castRaw<FutureType>()->getElementType();
out << "Future[" << *elem << ']';
out << "Future[" << *elem << "]";
} else if(t.kind() == TypeKind::RRefType) {
auto elem = t.castRaw<RRefType>()->getElementType();
out << "RRef[" << *elem << ']';
out << "RRef[" << *elem << "]";
} else if(auto tup = t.cast<TupleType>()) {
if (tup->schema()) {
out << "NamedTuple";
}
out << '(';
out << "(";
for(size_t i = 0; i < tup->elements().size(); ++i) {
if(i > 0)
out << ", ";
@ -160,7 +160,7 @@ std::ostream& operator<<(std::ostream & out, const Type & t) {
out << *(tup->elements()[i]);
}
}
out << ')';
out << ")";
} else if (t.kind() == TypeKind::FunctionType) {
out << "Function";
} else {
@ -475,7 +475,7 @@ std::optional<TypePtr> unifyTypeList(
why_not << "Could not unify type list since element " << i << " of type "
<< elements.at(i)->repr_str()
<< " did not match the types before it ("
<< ret_type->repr_str() << ')';
<< ret_type->repr_str() << ")";
return std::nullopt;
}
ret_type = *maybe_unified;
@ -907,13 +907,13 @@ std::string TupleType::str() const {
// NOLINTNEXTLINE(bugprone-unchecked-optional-access)
ss << name()->qualifiedName();
} else {
ss << '(';
ss << "(";
for(size_t i = 0; i < elements().size(); ++i) {
if(i > 0)
ss << ", ";
ss << elements()[i]->str();
}
ss << ')';
ss << ")";
}
return ss.str();
}
@ -1003,8 +1003,8 @@ bool InterfaceType::isSubTypeImpl(
*why_not << "Method on interface '" << lhs.repr_str()
<< "' (1) is not compatible with interface '"
<< rhs.repr_str() << "' (2)\n"
<< " (1) " << *self_schema << '\n'
<< " (2) " << schema << '\n';
<< " (1) " << *self_schema << "\n"
<< " (2) " << schema << "\n";
return false;
}
return false;
@ -1078,7 +1078,7 @@ SymbolicShape SymbolicShape::merge(const SymbolicShape& other) const {
}
void SymbolicShape::dump() const {
std::cout << *this << '\n';
std::cout << *this << "\n";
}
bool EnumType::isSubtypeOfExt(const Type& rhs, std::ostream* why_not) const {

View File

@ -205,9 +205,9 @@ UnionType::UnionType(std::vector<TypePtr> reference, TypeKind kind) : SharedType
for (const auto i : c10::irange(reference.size())) {
msg << reference[i]->repr_str();
if (i > 0) {
msg << ',';
msg << ",";
}
msg << ' ';
msg << " ";
}
msg << "} has the single type " << types_[0]->repr_str()
<< ". Use the common supertype instead of creating a Union"

View File

@ -80,7 +80,7 @@ std::ostream& operator<<(std::ostream& stream, const Vectorized<T>& vec) {
}
stream << buf[i];
}
stream << ']';
stream << "]";
return stream;
}

View File

@ -55,7 +55,7 @@ std::ostream& operator<<(std::ostream& stream, const Vectorized<T>& vec) {
}
stream << buf[i];
}
stream << ']';
stream << "]";
return stream;
}

View File

@ -3,6 +3,7 @@
#include <cstdint>
#include <map>
#include <shared_mutex>
#include <cuda_runtime_api.h>
#include <cusparse.h>
@ -88,8 +89,13 @@ TORCH_CUDA_CPP_API cublasHandle_t getCurrentCUDABlasHandle();
TORCH_CUDA_CPP_API cublasLtHandle_t getCurrentCUDABlasLtHandle();
TORCH_CUDA_CPP_API void clearCublasWorkspaces();
TORCH_CUDA_CPP_API std::map<std::tuple<void *, void *>, at::DataPtr>& cublas_handle_stream_to_workspace();
TORCH_CUDA_CPP_API std::map<std::tuple<void *, void *>, at::DataPtr>& cublaslt_handle_stream_to_workspace();
struct WorkspaceMapWithMutex {
std::map<std::tuple<void*, void*>, at::DataPtr> map;
std::shared_mutex mutex;
};
TORCH_CUDA_CPP_API WorkspaceMapWithMutex& cublas_handle_stream_to_workspace();
TORCH_CUDA_CPP_API WorkspaceMapWithMutex& cublaslt_handle_stream_to_workspace();
TORCH_CUDA_CPP_API size_t getChosenWorkspaceSize();
TORCH_CUDA_CPP_API size_t getCUDABlasLtWorkspaceSize();
TORCH_CUDA_CPP_API void* getCUDABlasLtWorkspace();

View File

@ -99,7 +99,7 @@ void destroyCublasHandle(cublasHandle_t handle) {
// - Comments of @soumith copied from cuDNN handle pool implementation
#ifdef NO_CUDNN_DESTROY_HANDLE
#else
cublasDestroy(handle);
cublasDestroy(handle);
#endif
}
@ -107,19 +107,27 @@ using CuBlasPoolType = DeviceThreadHandlePool<cublasHandle_t, createCublasHandle
} // namespace
std::map<std::tuple<void *, void *>, at::DataPtr>& cublas_handle_stream_to_workspace() {
static auto& instance = *new std::map<std::tuple<void *, void *>, at::DataPtr>;
WorkspaceMapWithMutex& cublas_handle_stream_to_workspace() {
static auto& instance = *new WorkspaceMapWithMutex;
return instance;
}
std::map<std::tuple<void *, void *>, at::DataPtr>& cublaslt_handle_stream_to_workspace() {
static auto& instance = *new std::map<std::tuple<void *, void *>, at::DataPtr>;
WorkspaceMapWithMutex& cublaslt_handle_stream_to_workspace() {
static auto& instance = *new WorkspaceMapWithMutex;
return instance;
}
void clearCublasWorkspaces() {
cublas_handle_stream_to_workspace().clear();
cublaslt_handle_stream_to_workspace().clear();
{
auto& workspace = cublas_handle_stream_to_workspace();
std::unique_lock<std::shared_mutex> lock(workspace.mutex);
workspace.map.clear();
}
{
auto& workspace = cublaslt_handle_stream_to_workspace();
std::unique_lock<std::shared_mutex> lock(workspace.mutex);
workspace.map.clear();
}
}
size_t parseChosenWorkspaceSize() {
@ -233,6 +241,38 @@ at::DataPtr getNewCUDABlasLtWorkspace() {
return c10::cuda::CUDACachingAllocator::get()->allocate(getCUDABlasLtWorkspaceSize());
}
void setWorkspaceForHandle(cublasHandle_t handle, c10::cuda::CUDAStream stream) {
cudaStream_t _stream = stream;
auto key = std::make_tuple(static_cast<void *>(handle), static_cast<void *>(_stream));
auto& workspace = cublas_handle_stream_to_workspace();
size_t workspace_size = getChosenWorkspaceSize();
// Fast path: check if workspace already exists
{
std::shared_lock<std::shared_mutex> lock(workspace.mutex);
auto workspace_it = workspace.map.find(key);
if (workspace_it != workspace.map.end()) {
TORCH_CUDABLAS_CHECK(cublasSetWorkspace(
handle, workspace_it->second.get(), workspace_size));
return;
}
}
// Slow path: allocate workspace outside the lock
auto new_workspace = getNewWorkspace();
// Insert with lock (double-check in case another thread inserted while we
// were allocating)
{
std::unique_lock<std::shared_mutex> lock(workspace.mutex);
auto workspace_it = workspace.map.try_emplace(key, std::move(new_workspace)).first;
TORCH_CUDABLAS_CHECK(
cublasSetWorkspace(handle, workspace_it->second.get(), workspace_size));
}
}
void* getCUDABlasLtWorkspace() {
#ifndef USE_ROCM
static bool unified = c10::utils::check_env(TORCH_CUBLASLT_UNIFIED_WORKSPACE) == true;
@ -241,8 +281,10 @@ void* getCUDABlasLtWorkspace() {
auto stream = c10::cuda::getCurrentCUDAStream();
cudaStream_t _stream = stream;
auto key = std::make_tuple(static_cast<void *>(handle), static_cast<void *>(_stream));
auto workspace_it = at::cuda::cublas_handle_stream_to_workspace().find(key);
TORCH_INTERNAL_ASSERT(workspace_it != at::cuda::cublas_handle_stream_to_workspace().end());
auto& workspace = at::cuda::cublas_handle_stream_to_workspace();
std::shared_lock<std::shared_mutex> lock(workspace.mutex);
auto workspace_it = workspace.map.find(key);
TORCH_INTERNAL_ASSERT(workspace_it != workspace.map.end());
return workspace_it->second.mutable_get();
}
#endif
@ -250,11 +292,29 @@ void* getCUDABlasLtWorkspace() {
auto stream = c10::cuda::getCurrentCUDAStream();
cudaStream_t _stream = stream;
auto key = std::make_tuple(static_cast<void *>(handle), static_cast<void *>(_stream));
auto workspace_it = cublaslt_handle_stream_to_workspace().find(key);
if (workspace_it == cublaslt_handle_stream_to_workspace().end()) {
workspace_it = cublaslt_handle_stream_to_workspace().insert(workspace_it, {key, getNewCUDABlasLtWorkspace()});
auto& workspace = cublaslt_handle_stream_to_workspace();
// Fast path: check if workspace already exists
{
std::shared_lock<std::shared_mutex> lock(workspace.mutex);
auto workspace_it = workspace.map.find(key);
if (workspace_it != workspace.map.end()) {
return workspace_it->second.mutable_get();
}
}
// Slow path: allocate workspace outside the lock
auto new_workspace = getNewCUDABlasLtWorkspace();
// Insert with lock (double-check in case another thread inserted while we
// were allocating)
{
std::unique_lock<std::shared_mutex> lock(workspace.mutex);
auto workspace_it =
workspace.map.try_emplace(key, std::move(new_workspace)).first;
return workspace_it->second.mutable_get();
}
return workspace_it->second.mutable_get();
}
cublasHandle_t getCurrentCUDABlasHandle() {
@ -298,13 +358,8 @@ cublasHandle_t getCurrentCUDABlasHandle() {
// will allocate memory dynamically (even if they're cheap) outside
// PyTorch's CUDA caching allocator. It's possible that CCA used up
// all the memory and cublas's cudaMallocAsync will return OOM
cudaStream_t _stream = stream;
auto key = std::make_tuple(static_cast<void *>(handle), static_cast<void *>(_stream));
auto workspace_it = cublas_handle_stream_to_workspace().find(key);
if (workspace_it == cublas_handle_stream_to_workspace().end()) {
workspace_it = cublas_handle_stream_to_workspace().insert(workspace_it, {key, getNewWorkspace()});
}
TORCH_CUDABLAS_CHECK(cublasSetWorkspace(handle, workspace_it->second.get(), getChosenWorkspaceSize()));
setWorkspaceForHandle(handle, stream);
#if !defined(USE_ROCM)
// On CUDA >= 11, and architecture >= Ampere, cuBLAS can use TF32 to speedup
// FP32 data type calculations based on the value of the allow_tf32 flag.

View File

@ -411,16 +411,16 @@ std::string CUDAHooks::showConfig() const {
// HIP_VERSION value format was changed after ROCm v4.2 to include the patch number
if(v < 500) {
// If major=xx, minor=yy then format -> xxyy
oss << (v / 100) << '.' << (v % 10);
oss << (v / 100) << "." << (v % 10);
}
else {
// If major=xx, minor=yy & patch=zzzzz then format -> xxyyzzzzz
oss << (v / 10000000) << '.' << (v / 100000 % 100) << '.' << (v % 100000);
oss << (v / 10000000) << "." << (v / 100000 % 100) << "." << (v % 100000);
}
#else
oss << (v / 1000) << '.' << (v / 10 % 100);
oss << (v / 1000) << "." << (v / 10 % 100);
if (v % 10 != 0) {
oss << '.' << (v % 10);
oss << "." << (v % 10);
}
#endif
};
@ -431,16 +431,16 @@ std::string CUDAHooks::showConfig() const {
oss << " - HIP Runtime ";
#endif
printCudaStyleVersion(runtimeVersion);
oss << '\n';
oss << "\n";
// TODO: Make HIPIFY understand CUDART_VERSION macro
#if !defined(USE_ROCM)
if (runtimeVersion != CUDART_VERSION) {
oss << " - Built with CUDA Runtime ";
printCudaStyleVersion(CUDART_VERSION);
oss << '\n';
oss << "\n";
}
oss << " - NVCC architecture flags: " << NVCC_FLAGS_EXTRA << '\n';
oss << " - NVCC architecture flags: " << NVCC_FLAGS_EXTRA << "\n";
#endif
#if !defined(USE_ROCM)
@ -448,9 +448,9 @@ std::string CUDAHooks::showConfig() const {
auto printCudnnStyleVersion = [&](size_t v) {
oss << (v / 1000) << '.' << (v / 100 % 10);
oss << (v / 1000) << "." << (v / 100 % 10);
if (v % 100 != 0) {
oss << '.' << (v % 100);
oss << "." << (v % 100);
}
};
@ -461,22 +461,22 @@ std::string CUDAHooks::showConfig() const {
if (cudnnCudartVersion != CUDART_VERSION) {
oss << " (built against CUDA ";
printCudaStyleVersion(cudnnCudartVersion);
oss << ')';
oss << ")";
}
oss << '\n';
oss << "\n";
if (cudnnVersion != CUDNN_VERSION) {
oss << " - Built with CuDNN ";
printCudnnStyleVersion(CUDNN_VERSION);
oss << '\n';
oss << "\n";
}
#endif
#else
// TODO: Check if miopen has the functions above and unify
oss << " - MIOpen " << MIOPEN_VERSION_MAJOR << '.' << MIOPEN_VERSION_MINOR << '.' << MIOPEN_VERSION_PATCH << '\n';
oss << " - MIOpen " << MIOPEN_VERSION_MAJOR << "." << MIOPEN_VERSION_MINOR << "." << MIOPEN_VERSION_PATCH << "\n";
#endif
#if AT_MAGMA_ENABLED()
oss << " - Magma " << MAGMA_VERSION_MAJOR << '.' << MAGMA_VERSION_MINOR << '.' << MAGMA_VERSION_MICRO << '\n';
oss << " - Magma " << MAGMA_VERSION_MAJOR << "." << MAGMA_VERSION_MINOR << "." << MAGMA_VERSION_MICRO << "\n";
#endif
return oss.str();

View File

@ -42,7 +42,7 @@ static inline void launch_jitted_vectorized_kernel_dynamic(
// The cache key includes all the parameters to generate_code + vec_size + dev_idx
std::stringstream ss;
ss << nInputs << '_' << nOutputs << f;
ss << nInputs << "_" << nOutputs << f;
ss << f_inputs_type_str << compute_type_str << result_type_str;
ss << static_cast<int>(at::cuda::jit::BinaryFuncVariant::NoScalar);
ss << extra_args_types;
@ -144,7 +144,7 @@ static inline void launch_jitted_unrolled_kernel_dynamic(
// The cache key includes all the parameters to generate_code + dev_idx
std::stringstream ss;
ss << nInputs << '_' << nOutputs << f;
ss << nInputs << "_" << nOutputs << f;
ss << f_inputs_type_str << compute_type_str << result_type_str;
ss << contiguous << dynamic_casting;
ss << static_cast<int>(at::cuda::jit::BinaryFuncVariant::NoScalar);

View File

@ -52,10 +52,10 @@ TuningContext* getTuningContext() {
std::ostream& operator<<(std::ostream& stream, const ResultEntry& entry) {
static const bool blaslog = c10::utils::get_env("PYTORCH_TUNABLEOP_BLAS_LOG") == "1";
if (!blaslog) {
return stream << entry.key_ << ',' << entry.time_;
return stream << entry.key_ << "," << entry.time_;
}
else {
return stream << entry.key_ << ',' << entry.time_ << ",BLAS_PARAMS: " << entry.blas_sig_;
return stream << entry.key_ << "," << entry.time_ << ",BLAS_PARAMS: " << entry.blas_sig_;
}
}
@ -156,10 +156,10 @@ void TuningResultsManager::RecordUntuned( std::ofstream& untuned_file, const std
if (isNew) {
static const bool blaslog = c10::utils::get_env("PYTORCH_TUNABLEOP_BLAS_LOG") == "1";
if (!blaslog) {
untuned_file << op_signature << ',' << params_signature << std::endl;
untuned_file << op_signature << "," << params_signature << std::endl;
}
else {
untuned_file << op_signature << ',' << params_signature << ",BLAS_PARAMS: " << blas_signature << std::endl;
untuned_file << op_signature << "," << params_signature << ",BLAS_PARAMS: " << blas_signature << std::endl;
}
TUNABLE_LOG3("Untuned,", op_signature, ",", params_signature);
}
@ -201,7 +201,7 @@ void TuningResultsManager::InitRealtimeAppend(const std::string& filename, const
if(!file_exists || file_empty) {
for(const auto& [key, val] : validators) {
(*realtime_out_) << "Validator," << key << ',' << val << std::endl;
(*realtime_out_) << "Validator," << key << "," << val << std::endl;
realtime_out_->flush();
}
validators_written_ = true;
@ -219,7 +219,7 @@ void TuningResultsManager::AppendResultLine(const std::string& op_sig, const std
return;
}
(*realtime_out_) << op_sig << ',' << param_sig << ',' << result << std::endl;
(*realtime_out_) << op_sig << "," << param_sig << "," << result << std::endl;
realtime_out_->flush(); //ensure immediate write to disk
TUNABLE_LOG3("Realtime append: ", op_sig, "(", param_sig, ") -> ", result);

View File

@ -93,31 +93,31 @@ std::string cudnnTypeToString(cudnnDataType_t dtype) {
return "CUDNN_DATA_UINT8x4";
default:
std::ostringstream oss;
oss << "(unknown data-type " << static_cast<int>(dtype) << ')';
oss << "(unknown data-type " << static_cast<int>(dtype) << ")";
return oss.str();
}
}
std::ostream& operator<<(std::ostream & out, const TensorDescriptor& d) {
out << "TensorDescriptor " << static_cast<void*>(d.desc()) << '\n';
out << "TensorDescriptor " << static_cast<void*>(d.desc()) << "\n";
int nbDims = 0;
int dimA[CUDNN_DIM_MAX];
int strideA[CUDNN_DIM_MAX];
cudnnDataType_t dtype{};
cudnnGetTensorNdDescriptor(d.desc(), CUDNN_DIM_MAX, &dtype, &nbDims, dimA, strideA);
out << " type = " << cudnnTypeToString(dtype) << '\n';
out << " nbDims = " << nbDims << '\n';
out << " type = " << cudnnTypeToString(dtype) << "\n";
out << " nbDims = " << nbDims << "\n";
// Read out only nbDims of the arrays!
out << " dimA = ";
for (auto i : ArrayRef<int>{dimA, static_cast<size_t>(nbDims)}) {
out << i << ", ";
}
out << '\n';
out << "\n";
out << " strideA = ";
for (auto i : ArrayRef<int>{strideA, static_cast<size_t>(nbDims)}) {
out << i << ", ";
}
out << '\n';
out << "\n";
return out;
}
@ -168,27 +168,27 @@ std::string cudnnMemoryFormatToString(cudnnTensorFormat_t tformat) {
return "CUDNN_TENSOR_NHWC";
default:
std::ostringstream oss;
oss << "(unknown cudnn tensor format " << static_cast<int>(tformat) << ')';
oss << "(unknown cudnn tensor format " << static_cast<int>(tformat) << ")";
return oss.str();
}
}
std::ostream& operator<<(std::ostream & out, const FilterDescriptor& d) {
out << "FilterDescriptor " << static_cast<void*>(d.desc()) << '\n';
out << "FilterDescriptor " << static_cast<void*>(d.desc()) << "\n";
int nbDims = 0;
int dimA[CUDNN_DIM_MAX];
cudnnDataType_t dtype{};
cudnnTensorFormat_t tformat{};
cudnnGetFilterNdDescriptor(d.desc(), CUDNN_DIM_MAX, &dtype, &tformat, &nbDims, dimA);
out << " type = " << cudnnTypeToString(dtype) << '\n';
out << " tensor_format = " << cudnnMemoryFormatToString(tformat) << '\n';
out << " nbDims = " << nbDims << '\n';
out << " type = " << cudnnTypeToString(dtype) << "\n";
out << " tensor_format = " << cudnnMemoryFormatToString(tformat) << "\n";
out << " nbDims = " << nbDims << "\n";
// Read out only nbDims of the arrays!
out << " dimA = ";
for (auto i : ArrayRef<int>{dimA, static_cast<size_t>(nbDims)}) {
out << i << ", ";
}
out << '\n';
out << "\n";
return out;
}

View File

@ -346,15 +346,15 @@ void foreachTensorInplaceWithFlag(std::vector<IValue>& args, int64_t begin, int6
}
std::ostream& operator<< (std::ostream& os, const DynamicLayer& layer) {
os << layer.layerId() << ':' << layer.key();
os << layer.layerId() << ":" << layer.key();
return os;
}
std::ostream& operator<< (std::ostream& os, const std::vector<DynamicLayer>& dls) {
os << "DynamicLayerStack[ ";
for (const auto& layer : dls) {
os << layer << ' ';
os << layer << " ";
}
os << ']';
os << "]";
return os;
}

View File

@ -22,7 +22,7 @@ void dumpTensor(std::ostream& ss, const Tensor& tensor) {
if (batched) {
ss << "Batched[lvl=" << batched->level() << " dim=" << batched->bdim() << ", ";
dumpTensor(ss, batched->value());
ss << ']';
ss << "]";
return;
}
ss << "Tensor" << tensor.sizes();
@ -36,7 +36,7 @@ void dumpTensor(std::ostream& ss, const Tensor& tensor) {
ss << "dead, ";
}
dumpTensor(ss, wrapped->value());
ss << ']';
ss << "]";
}
void TensorWrapper::refreshMetadata() {

View File

@ -73,32 +73,32 @@ std::string miopenTypeToString(miopenDataType_t dtype) {
return "miopenBFloat16";
default:
std::ostringstream oss;
oss << "(unknown data-type " << static_cast<int>(dtype) << ')';
oss << "(unknown data-type " << static_cast<int>(dtype) << ")";
return oss.str();
}
}
std::ostream& operator<<(std::ostream & out, const TensorDescriptor& d) {
out << "TensorDescriptor " << static_cast<void*>(d.desc()) << '\n';
out << "TensorDescriptor " << static_cast<void*>(d.desc()) << "\n";
int nbDims = 0;
int dimA[MIOPEN_DIM_MAX];
int strideA[MIOPEN_DIM_MAX];
miopenDataType_t dtype;
miopenGetTensorDescriptorSize(d.desc(), &nbDims);
miopenGetTensorDescriptor(d.desc(), &dtype, dimA, strideA);
out << " type = " << miopenTypeToString(dtype) << '\n';
out << " nbDims = " << nbDims << '\n';
out << " type = " << miopenTypeToString(dtype) << "\n";
out << " nbDims = " << nbDims << "\n";
// Read out only nbDims of the arrays!
out << " dimA = ";
for (auto i : ArrayRef<int>{dimA, static_cast<size_t>(nbDims)}) {
out << i << ", ";
}
out << '\n';
out << "\n";
out << " strideA = ";
for (auto i : ArrayRef<int>{strideA, static_cast<size_t>(nbDims)}) {
out << i << ", ";
}
out << '\n';
out << "\n";
return out;
}

View File

@ -91,7 +91,7 @@ struct OperationInfo : BaseInfo {
std::stringstream kernelStr;
kernelStr << kernelName;
for (const Tensor& tensor : tensors) {
kernelStr << ':' << BaseInfo::buildTensorString(tensor, includeBufferId);
kernelStr << ":" << BaseInfo::buildTensorString(tensor, includeBufferId);
}
return kernelStr.str();
}

View File

@ -39,9 +39,9 @@ std::string BaseInfo::buildTensorString(const Tensor& tensor, bool includeBuffer
// see comments for INCLUDE_BUFFER_ID
if (includeBufferId && deviceType == at::kMPS) {
id<MTLBuffer> buffer = __builtin_bit_cast(id<MTLBuffer>, tensor.storage().data());
tensorStr << "(buf#" << (getIMPSAllocator()->getBufferId(buffer)) << ':' << buffer.retainCount << ')';
tensorStr << "(buf#" << (getIMPSAllocator()->getBufferId(buffer)) << ":" << buffer.retainCount << ")";
}
tensorStr << ':' << tensor.scalar_type() << tensor.sizes();
tensorStr << ":" << tensor.scalar_type() << tensor.sizes();
return tensorStr.str();
} else {
return "undefined";

View File

@ -167,7 +167,7 @@ static void check_args(CheckedFrom c, IntArrayRef args, size_t expected_size, co
std::stringstream ss;
ss << arg_name << " should be greater than zero but got (";
std::copy(args.begin(), args.end() - 1, std::ostream_iterator<int>(ss,", "));
ss << args.back() << ")" << " (while checking arguments for " << c << ')';
ss << args.back() << ")" << " (while checking arguments for " << c << ")";
TORCH_CHECK(false, ss.str());
}
}

View File

@ -639,7 +639,7 @@ static std::ostream& operator<<(std::ostream & out, const ConvParams<T>& params)
<< " deterministic = " << params.deterministic
<< " cudnn_enabled = " << params.cudnn_enabled
<< " allow_tf32 = " << params.allow_tf32
<< '}';
<< "}";
return out;
}

View File

@ -847,7 +847,7 @@ Tensor stft(const Tensor& self, const int64_t n_fft, const std::optional<int64_t
<< ", hop_length=" << hop_length << ", win_length=" << win_length \
<< ", window="; \
if (window.defined()) { \
SS << window.toString() << '{' << window.sizes() << '}'; \
SS << window.toString() << "{" << window.sizes() << "}"; \
} else { \
SS << "None"; \
} \
@ -1046,7 +1046,7 @@ Tensor istft(const Tensor& self, const int64_t n_fft, const std::optional<int64_
<< ", hop_length=" << hop_length << ", win_length=" << win_length \
<< ", window="; \
if (window.defined()) { \
SS << window.toString() << '{' << window.sizes() << '}'; \
SS << window.toString() << "{" << window.sizes() << "}"; \
} else { \
SS << "None"; \
} \

View File

@ -523,7 +523,7 @@ Tensor _functional_assert_async_msg_cpu(
}
void _print(std::string_view s) {
std::cout << s << '\n';
std::cout << s << "\n";
}
// Sorting-based algorithm for isin(); used when the number of test elements is

View File

@ -296,7 +296,7 @@ template <typename scalar_t, typename res_scalar_t = scalar_t>
bool launchGemmAndBiasCublasLt(
// args contains result which is modified
cublasCommonArgs& args,
const Tensor& self,
const std::optional<Tensor>& self,
const Scalar& alpha,
Activation activation = Activation::None
) {
@ -304,12 +304,8 @@ bool launchGemmAndBiasCublasLt(
// or when it can be squeezed to 1D.
// self_ptr == nullptr implies ignore bias epilogue
// and use standard gemm-like API.
const auto* self_ptr = [&]() -> auto {
if (self.dim() == 1 || self.squeeze().dim() == 1) {
return self.const_data_ptr<scalar_t>();
}
return static_cast<const scalar_t*>(nullptr);
}();
const auto* self_ptr = self.has_value() ? self.value().const_data_ptr<scalar_t>() : static_cast<const scalar_t*>(nullptr);
const auto tuning_ctx = at::cuda::tunable::getTuningContext();
if (tuning_ctx->IsTunableOpEnabled()) {
@ -392,35 +388,30 @@ Tensor& addmm_out_cuda_impl(Tensor& result, const Tensor& self, const Tensor& ma
bool disable_addmm_cuda_lt = persistent_disable_addmm_cuda_lt || disable_addmm_cuda_lt_override;
#ifdef USE_ROCM
// Conditioned on the device index, which is not persistent
disable_addmm_cuda_lt = isGloballyDisabledAddmmCudaLt(self.device()) || disable_addmm_cuda_lt;
disable_addmm_cuda_lt = disable_addmm_cuda_lt || isGloballyDisabledAddmmCudaLt(self.device());
#endif
// Condition on the input
disable_addmm_cuda_lt = !isInputCompliesAddmmCudaLt(result, self, mat1, mat2, beta, alpha, activation) || disable_addmm_cuda_lt;
// }
disable_addmm_cuda_lt = disable_addmm_cuda_lt || !isInputCompliesAddmmCudaLt(result, self, mat1, mat2, beta, alpha, activation);
at::ScalarType scalar_type = mat1.scalar_type();
bool is_float_output_with_half_input = (scalar_type == at::ScalarType::Half || scalar_type == at::ScalarType::BFloat16) && result.scalar_type() == at::ScalarType::Float;
#ifdef USE_ROCM
disable_addmm_cuda_lt = disable_addmm_cuda_lt || is_float_output_with_half_input;
#endif
bool use_bias_ptr_lt = (self.dim() == 1) && !disable_addmm_cuda_lt;
// for float output with half input cublasLT with bias produces wrong results
use_bias_ptr_lt &= !is_float_output_with_half_input;
// Handle result/self shapes
if (!result.is_same(self)) {
at::native::resize_output(result, {mat1.sizes()[0], mat2.sizes()[1]});
// We use bias ptr in the Lt path only when bias is 1D
const auto use_bias_ptr_lt = (self.dim() == 1) && !disable_addmm_cuda_lt;
const auto self_maybe_expanded = [&]() -> c10::MaybeOwned<Tensor> {
if (!use_bias_ptr_lt) {
// We do expand self even before
// check for beta != 0.0 to make sure that
// test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_errors_*
// runs green.
return expand_size(self, result.sizes(), "addmm");
}
return c10::MaybeOwned<Tensor>::borrowed(self);
}();
// We do not copy bias only when we need the bias ptr
// We do not copy bias only when we need the bias ptr
if (beta.toComplexDouble() != 0.0 && !use_bias_ptr_lt) {
// NOTE: self should broadcast over result
at::native::copy_(result, *self_maybe_expanded);
at::native::copy_(result, *expand_size(self, result.sizes(), "addmm"));
}
}
@ -468,7 +459,7 @@ Tensor& addmm_out_cuda_impl(Tensor& result, const Tensor& self, const Tensor& ma
scalar_type,
"addmm_cuda_lt",
[&] {
lt_success = launchGemmAndBiasCublasLt<scalar_t, float>(args, self, alpha, activation);
lt_success = launchGemmAndBiasCublasLt<scalar_t, float>(args, use_bias_ptr_lt ? std::make_optional(self) : std::nullopt, alpha, activation);
}
);
#endif
@ -480,7 +471,7 @@ Tensor& addmm_out_cuda_impl(Tensor& result, const Tensor& self, const Tensor& ma
scalar_type,
"addmm_cuda_lt",
[&] {
lt_success = launchGemmAndBiasCublasLt<scalar_t>(args, self, alpha, activation);
lt_success = launchGemmAndBiasCublasLt<scalar_t>(args, use_bias_ptr_lt ? std::make_optional(self) : std::nullopt, alpha, activation);
}
);
} // end is_float_output_with_half_input
@ -936,7 +927,7 @@ Tensor _int_mm_cuda(const Tensor& self, const Tensor& mat2) {
return _int_mm_out_cuda(self, mat2, result);
}
static void baddbmm_bmm_out_dtype_checks(const Tensor& batch1, const Tensor& batch2, const Scalar& beta, const Scalar& alpha, const at::ScalarType out_dtype, bool is_bmm, const std::optional<Tensor>& self_baddbmm = std::nullopt) {
static void baddbmm_bmm_out_dtype_checks(const Tensor& batch1, const Tensor& batch2, const Scalar& beta, const Scalar& alpha, const at::ScalarType out_dtype, const std::optional<Tensor>& self_baddbmm = std::nullopt) {
// ref ATen/native/LinearAlgebra.cpp common_checks_baddbmm_bmm
TORCH_CHECK(batch1.dim() == 3, "batch1 must be a 3D tensor");
TORCH_CHECK(batch2.dim() == 3, "batch2 must be a 3D tensor");
@ -960,7 +951,7 @@ static void baddbmm_bmm_out_dtype_checks(const Tensor& batch1, const Tensor& bat
(out_dtype == at::ScalarType::Float && (batch1.scalar_type() == at::ScalarType::Half || batch1.scalar_type() == at::ScalarType::BFloat16)),
"out_dtype must be the same as input dtype or fp32 for fp16/bf16 inputs");
if (!is_bmm && self_baddbmm.has_value()) {
if (self_baddbmm.has_value()) {
const auto& self = self_baddbmm.value();
TORCH_CHECK(self.dim() == 3, "self must be a 3D tensor");
TORCH_CHECK(self.sizes() == output_size, "self must have the same shape as the output");
@ -968,15 +959,12 @@ static void baddbmm_bmm_out_dtype_checks(const Tensor& batch1, const Tensor& bat
}
Tensor _bmm_dtype_cuda(const Tensor& batch1, const Tensor& batch2, const at::ScalarType out_dtype) {
IntArrayRef batch1_sizes = batch1.sizes();
IntArrayRef batch2_sizes = batch2.sizes();
Tensor out = at::empty({batch1_sizes[0], batch1_sizes[1], batch2_sizes[2]}, batch1.options().dtype(out_dtype));
Tensor out = at::empty({batch1.size(0), batch1.size(1), batch2.size(2)}, batch1.options().dtype(out_dtype));
return _bmm_out_dtype_cuda(batch1, batch2, out_dtype, out);
}
Tensor& _bmm_out_dtype_cuda(const Tensor& batch1, const Tensor& batch2, const at::ScalarType out_dtype, Tensor &out) {
baddbmm_bmm_out_dtype_checks(batch1, batch2, 0.0, 1.0, out_dtype, true);
baddbmm_bmm_out_dtype_checks(batch1, batch2, 0.0, 1.0, out_dtype);
Scalar beta(0.0);
Scalar alpha(1.0);
{
@ -988,14 +976,16 @@ Tensor& _bmm_out_dtype_cuda(const Tensor& batch1, const Tensor& batch2, const at
}
Tensor _baddbmm_dtype_cuda(const Tensor& self, const Tensor& batch1, const Tensor& batch2, const at::ScalarType out_dtype, const Scalar& beta, const Scalar& alpha) {
// We need to copy the tensor
Tensor out = self.clone().to(self.options().dtype(out_dtype));
return _baddbmm_out_dtype_cuda(out, batch1, batch2, out_dtype, beta, alpha, out);
TORCH_CHECK(self.scalar_type() == out_dtype || self.scalar_type() == batch1.dtype(),
"self dtype must match either out_dtype or batch1 dtype");
Tensor out = at::empty({batch1.size(0), batch1.size(1), batch2.size(2)}, batch1.options().dtype(out_dtype));
return _baddbmm_out_dtype_cuda(self, batch1, batch2, out_dtype, beta, alpha, out);
}
Tensor& _baddbmm_out_dtype_cuda(const Tensor& self, const Tensor& batch1, const Tensor& batch2, const at::ScalarType out_dtype, const Scalar& beta, const Scalar& alpha, Tensor &out) {
baddbmm_bmm_out_dtype_checks(batch1, batch2, beta, alpha, out_dtype, false, self);
baddbmm_bmm_out_dtype_checks(batch1, batch2, beta, alpha, out_dtype, out);
// We need to copy the tensor
out.copy_(self);
{
NoNamesGuard guard;
baddbmm_out_cuda_impl(out, out, batch1, batch2, beta, alpha);
@ -1030,24 +1020,27 @@ Tensor& _mm_dtype_out_cuda(const Tensor& self, const Tensor& mat2, const at::Sca
}
Tensor _addmm_dtype_cuda(const Tensor& self, const Tensor& mat1, const Tensor& mat2, const at::ScalarType out_dtype, const Scalar& beta, const Scalar& alpha) {
Tensor result = at::empty(self.sizes(), self.options().dtype(out_dtype));
TORCH_CHECK(mat1.dim() == 2, "mat1 must be a matrix, got ", mat1.dim(), "-D tensor");
TORCH_CHECK(mat2.dim() == 2, "mat2 must be a matrix, got ", mat2.dim(), "-D tensor");
Tensor result = at::empty({mat1.size(0), mat2.size(1)}, self.options().dtype(out_dtype));
return _addmm_dtype_out_cuda(self, mat1, mat2, out_dtype, beta, alpha, result);
}
Tensor& _addmm_dtype_out_cuda(const Tensor& self, const Tensor& mat1, const Tensor& mat2, const at::ScalarType out_dtype, const Scalar& beta, const Scalar& alpha, Tensor &out) {
TORCH_CHECK(self.scalar_type() == mat2.scalar_type(), "self and mat2 must have the same dtype, but got ", self.scalar_type(), " and ", mat2.scalar_type());
TORCH_CHECK(mat1.scalar_type() == mat2.scalar_type(), "mat1 and mat2 must have the same dtype, but got ", mat1.scalar_type(), " and ", mat2.scalar_type());
// repeat dimensionality checks for direct calls to `out` overload
TORCH_CHECK(mat1.dim() == 2, "mat1 must be a matrix, got ", mat1.dim(), "-D tensor");
TORCH_CHECK(mat2.dim() == 2, "mat2 must be a matrix, got ", mat2.dim(), "-D tensor");
TORCH_CHECK(
mat1.sizes()[1] == mat2.sizes()[0], "mat1 and mat2 shapes cannot be multiplied (",
mat1.sizes()[0], "x", mat1.sizes()[1], " and ", mat2.sizes()[0], "x", mat2.sizes()[1], ")");
TORCH_CHECK(mat1.scalar_type() == mat2.scalar_type(), "mat1 and mat2 must have the same dtype, but got ", mat1.scalar_type(), " and ", mat2.scalar_type());
TORCH_CHECK(out_dtype == mat1.scalar_type() ||
(out_dtype == at::ScalarType::Float && (mat1.scalar_type() == at::ScalarType::Half || mat1.scalar_type() == at::ScalarType::BFloat16)),
"out_dtype must be the same as input dtype or fp32 for fp16/bf16 inputs");
TORCH_CHECK(out_dtype == out.scalar_type(), "out_dtype must be the same as the dtype of the provided out tensor");
TORCH_CHECK(out_dtype == self.scalar_type() ||
(out_dtype == at::ScalarType::Float && (self.scalar_type() == at::ScalarType::Half || self.scalar_type() == at::ScalarType::BFloat16)),
"out_dtype must be the same as input dtype or fp32 for fp16/bf16 inputs");
TORCH_CHECK(out_dtype == out.scalar_type(), "out_dtype must be the same as the dtype of the provided out tensor");
TORCH_CHECK(out_dtype == self.scalar_type() || self.scalar_type() == mat1.scalar_type(),
"self dtype must match either out_dtype or mat1 dtype");
addmm_out_cuda_impl(out, self, mat1, mat2, beta, alpha);

View File

@ -5,69 +5,11 @@
#include <cuda_bf16.h>
#endif
// ROCm 6.3 is planned to have these functions, but until then here they are.
#if defined(USE_ROCM)
#include <device_functions.h>
#include <hip/hip_fp16.h>
#include <hip/hip_bf16.h>
__device__ inline __hip_bfloat162 preview_unsafeAtomicAdd(__hip_bfloat162* address, __hip_bfloat162 value) {
#if (defined(__gfx942__)) && \
__has_builtin(__builtin_amdgcn_flat_atomic_fadd_v2bf16)
typedef unsigned short __attribute__((ext_vector_type(2))) vec_short2;
static_assert(sizeof(vec_short2) == sizeof(__hip_bfloat162_raw));
union {
__hip_bfloat162_raw bf162_raw;
vec_short2 vs2;
} u{static_cast<__hip_bfloat162_raw>(value)};
u.vs2 = __builtin_amdgcn_flat_atomic_fadd_v2bf16((vec_short2*)address, u.vs2);
return static_cast<__hip_bfloat162>(u.bf162_raw);
#else
static_assert(sizeof(unsigned int) == sizeof(__hip_bfloat162_raw));
union u_hold {
__hip_bfloat162_raw h2r;
unsigned int u32;
};
u_hold old_val, new_val;
old_val.u32 = __hip_atomic_load((unsigned int*)address, __ATOMIC_RELAXED, __HIP_MEMORY_SCOPE_AGENT);
do {
new_val.h2r = __hadd2(old_val.h2r, value);
} while (!__hip_atomic_compare_exchange_strong(
(unsigned int*)address, &old_val.u32, new_val.u32,
__ATOMIC_RELAXED, __ATOMIC_RELAXED, __HIP_MEMORY_SCOPE_AGENT));
return old_val.h2r;
#endif
}
__device__ inline __half2 preview_unsafeAtomicAdd(__half2* address, __half2 value) {
#if (defined(__gfx942__)) && \
__has_builtin(__builtin_amdgcn_flat_atomic_fadd_v2f16)
// The api expects an ext_vector_type of half
typedef _Float16 __attribute__((ext_vector_type(2))) vec_fp162;
static_assert(sizeof(vec_fp162) == sizeof(__half2_raw));
union {
__half2_raw h2r;
vec_fp162 fp16;
} u {static_cast<__half2_raw>(value)};
u.fp16 = __builtin_amdgcn_flat_atomic_fadd_v2f16((vec_fp162*)address, u.fp16);
return static_cast<__half2>(u.h2r);
#else
static_assert(sizeof(__half2_raw) == sizeof(unsigned int));
union u_hold {
__half2_raw h2r;
unsigned int u32;
};
u_hold old_val, new_val;
old_val.u32 = __hip_atomic_load((unsigned int*)address, __ATOMIC_RELAXED, __HIP_MEMORY_SCOPE_AGENT);
do {
new_val.h2r = __hadd2(old_val.h2r, value);
} while (!__hip_atomic_compare_exchange_strong(
(unsigned int*)address, &old_val.u32, new_val.u32,
__ATOMIC_RELAXED, __ATOMIC_RELAXED, __HIP_MEMORY_SCOPE_AGENT));
return old_val.h2r;
#endif
}
#define ATOMICADD preview_unsafeAtomicAdd
#define ATOMICADD unsafeAtomicAdd
#define NATIVE_ZERO_BF16 __float2bfloat16(0.0f)
#else
#define ATOMICADD atomicAdd

View File

@ -2,18 +2,250 @@
#include <ATen/Dispatch.h>
#include <ATen/native/DispatchStub.h>
#include <ATen/native/cuda/Loops.cuh>
#include <ATen/native/cuda/JitLoops.cuh>
#include <ATen/native/cuda/jit_utils.h>
#include <ATen/native/cuda/ScanUtils.cuh>
#include <ATen/native/TensorIterator.h>
#include <ATen/native/BinaryOps.h>
#include <ATen/OpMathType.h>
#include <c10/util/MathConstants.h>
#include <c10/util/complex.h>
#include <cmath>
#include <limits>
// NOTE: CUDA on Windows requires that the enclosing function
// of a __device__ lambda not have internal linkage.
namespace at::native {
// custom min and max to be used in logaddexp for complex arguments
template <typename scalar_t, bool min>
__host__ __device__ c10::complex<scalar_t> _logaddexp_minmax(const c10::complex<scalar_t>& x, const c10::complex<scalar_t>& y) {
scalar_t xr = std::real(x);
scalar_t yr = std::real(y);
if (::isnan(yr) || (::isnan(std::imag(y)))) {
return y;
} else if (::isnan(xr) || (::isnan(std::imag(x)))) {
return x;
} else if (min) { // min
return (xr < yr) ? x : y;
} else { // max
return (xr >= yr) ? x : y;
}
}
template <typename scalar_t>
__host__ __device__ scalar_t _log_add_exp_helper(const scalar_t& x, const scalar_t& y) {
// Reference : https://www.tensorflow.org/api_docs/python/tf/math/cumulative_logsumexp
// Using the original expression: `at::_isnan(y) ? y : std::min(x, y)` causes an error in ROCM
const auto isnan_x = at::_isnan(x);
const auto isnan_y = at::_isnan(y);
scalar_t min = isnan_y ? y : (isnan_x ? x : std::min(x, y));
scalar_t max = isnan_y ? y : (isnan_x ? x : std::max(x, y));
if (min != max || ::isfinite(min)) {
// nan will be propagated here
return ::log1p(std::exp(min - max)) + max;
} else {
// special case to correctly handle infinite cases
return x;
}
}
template <typename scalar_t>
__host__ __device__ c10::complex<scalar_t> _fast_build_exp(const c10::complex<scalar_t>& x) {
// complex exponential function, but implemented manually to get fast compilation time
// this function only handles the case where the x is finite (not inf nor nan)
const auto xreal = std::real(x);
const auto ximag = std::imag(x);
const auto exp_x_abs = std::exp(xreal);
auto exp_x_real = exp_x_abs * std::cos(ximag);
auto exp_x_imag = exp_x_abs * std::sin(ximag);
return {exp_x_real, exp_x_imag};
}
template <typename scalar_t>
__host__ __device__ c10::complex<scalar_t> _fast_build_exp_inf(const c10::complex<scalar_t>& x) {
// complex exponential function, but implemented manually to get fast compilation time
// this function only handles the case where the real part of x is infinite
const auto ximag = std::imag(x);
constexpr auto exp_x_abs = std::numeric_limits<scalar_t>::infinity();
if (!::isfinite(ximag)) { // add this to make consitent with std::exp(x+yi)
return {exp_x_abs, std::numeric_limits<scalar_t>::quiet_NaN()};
}
const auto sin = std::sin(ximag);
const auto cos = std::cos(ximag);
// special case if the angle is exactly the multiple of pi/2
auto exp_x_real = (cos == 0) ? (scalar_t)0.0 : exp_x_abs * cos;
auto exp_x_imag = (sin == 0) ? (scalar_t)0.0 : exp_x_abs * sin;
return {exp_x_real, exp_x_imag};
}
template <typename scalar_t>
__host__ __device__ c10::complex<scalar_t> _log_add_exp_helper(const c10::complex<scalar_t>& x, const c10::complex<scalar_t>& y) {
c10::complex<scalar_t> min = _logaddexp_minmax<scalar_t, /*min=*/true>(x, y);
c10::complex<scalar_t> max = _logaddexp_minmax<scalar_t, /*min=*/false>(x, y);
scalar_t min_real = std::real(min);
scalar_t max_real = std::real(max);
if (::isnan(min_real) || ::isnan(std::imag(min))) {
// handling the "infectious" NaNs
return {std::numeric_limits<scalar_t>::quiet_NaN(), std::numeric_limits<scalar_t>::quiet_NaN()};
}
else if ((!::isfinite(min_real)) && (min_real == max_real)) {
if (min_real < 0) {
// handle the -inf case, the imaginary part here does not really matter as the exp(value)
// will be around 0.0 and the angle (i.e. the imaginary part) cannot be determined.
// It does not matter if we're taking the exp of this value
return min;
} else {
// handle the +inf case, we don't need the special precision for log1p for small values
// and to avoid producing nan in case of real(max) == real(min) == +inf
const auto exp_min = _fast_build_exp_inf(min);
const auto exp_max = _fast_build_exp_inf(max);
return ::log1p(exp_min + exp_max - 1); // log1p(x - 1) builds faster than log
}
} else {
const auto minmax = min - max;
c10::complex<scalar_t> exp_minmax;
if (!::isfinite(minmax.real())) {
exp_minmax = minmax.real() < 0 ? c10::complex<scalar_t>{0.0, 0.0} : _fast_build_exp_inf(minmax);
} else {
exp_minmax = _fast_build_exp(minmax);
}
return ::log1p(exp_minmax) + max;
}
}
// Complex logaddexp jiterator string
const auto logaddexp_complex_string = jiterator_stringify(
template<typename T>
std::complex<T> log1p(const std::complex<T>& z)
{
using complex_t = std::complex<T>;
T x = z.real();
T y = z.imag();
T zabs = abs(z);
T theta = atan2(y, x + T(1));
if (zabs < 0.5) {
T r = x * (T(2) + x) + y * y;
if (r == 0) { // handle underflow
return complex_t(x, theta);
}
return complex_t(T(0.5) * std::log1p(r), theta);
} else {
T z0 = std::hypot(x + 1, y);
return complex_t(log(z0), theta);
}
}
// separated _logaddexp_minmax into 2 different functions for jiterator_string
template <typename T>
std::complex<T> logaddexp_min(const std::complex<T>& x, const std::complex<T>& y) {
T xr = x.real();
T yr = y.real();
if (isnan(yr) || isnan(y.imag())) {
return y;
} else if (isnan(xr) || isnan(x.imag())) {
return x;
} else {
return (xr < yr) ? x : y;
}
}
template <typename T>
std::complex<T> logaddexp_max(const std::complex<T>& x, const std::complex<T>& y) {
T xr = x.real();
T yr = y.real();
if (isnan(yr) || isnan(y.imag())) {
return y;
} else if (isnan(xr) || isnan(x.imag())) {
return x;
} else {
return (xr >= yr) ? x : y;
}
}
template <typename T>
std::complex<T> fast_build_exp(const std::complex<T>& x) {
const auto xreal = x.real();
const auto ximag = x.imag();
const auto exp_x_abs = exp(xreal);
auto exp_x_real = exp_x_abs * cos(ximag);
auto exp_x_imag = exp_x_abs * sin(ximag);
return std::complex<T>(exp_x_real, exp_x_imag);
}
template <typename T>
std::complex<T> fast_build_exp_inf(const std::complex<T>& x) {
using complex_t = std::complex<T>;
const auto ximag = x.imag();
const T exp_x_abs = INFINITY;
if (!isfinite(ximag)) {
return complex_t(exp_x_abs, NAN);
}
const auto sin_val = sin(ximag);
const auto cos_val = cos(ximag);
auto exp_x_real = (cos_val == T(0)) ? T(0) : exp_x_abs * cos_val;
auto exp_x_imag = (sin_val == T(0)) ? T(0) : exp_x_abs * sin_val;
return complex_t(exp_x_real, exp_x_imag);
}
template <typename complex_t>
complex_t logaddexp_complex(complex_t x, complex_t y) {
using T = typename complex_t::value_type;
complex_t min_val = logaddexp_min(x, y);
complex_t max_val = logaddexp_max(x, y);
T min_real = min_val.real();
T max_real = max_val.real();
if (isnan(min_real) || isnan(min_val.imag())) {
return complex_t(NAN, NAN);
}
else if ((!isfinite(min_real)) && (min_real == max_real)) {
if (min_real < T(0)) {
return min_val;
} else {
const auto exp_min = fast_build_exp_inf<T>(min_val);
const auto exp_max = fast_build_exp_inf<T>(max_val);
return log1p(exp_min + exp_max - complex_t(1, 0));
}
} else {
const auto minmax = min_val - max_val;
complex_t exp_minmax;
if (!isfinite(minmax.real())) {
exp_minmax = (minmax.real() < T(0)) ? complex_t(0, 0) : fast_build_exp_inf<T>(minmax);
} else {
exp_minmax = fast_build_exp<T>(minmax);
}
return log1p(exp_minmax) + max_val;
}
}
);
constexpr char logaddexp_complex_name[] = "logaddexp_complex";
void logaddexp_kernel_cuda(TensorIteratorBase& iter) {
AT_DISPATCH_FLOATING_TYPES_AND2(
if (at::isComplexType(iter.dtype())) {
#if AT_USE_JITERATOR()
AT_DISPATCH_COMPLEX_TYPES_AND(at::ScalarType::ComplexHalf, iter.dtype(), "logaddexp_cuda", [&]() {
jitted_gpu_kernel<
/*name=*/logaddexp_complex_name,
/*return_dtype=*/scalar_t,
/*common_dtype=*/scalar_t,
/*arity=*/2>(iter, logaddexp_complex_string);
});
#else
AT_DISPATCH_COMPLEX_TYPES_AND(at::ScalarType::ComplexHalf, iter.dtype(), "logaddexp_cuda", [&]() {
using opmath_t = at::opmath_type<scalar_t>;
gpu_kernel(iter, [] GPU_LAMBDA (scalar_t a_, scalar_t b_) -> scalar_t {
const auto a = static_cast<opmath_t>(a_);
const auto b = static_cast<opmath_t>(b_);
return static_cast<scalar_t>(_log_add_exp_helper(a, b));
});
});
#endif
} else {
AT_DISPATCH_FLOATING_TYPES_AND2(
ScalarType::BFloat16, ScalarType::Half,
iter.dtype(), "logaddexp_cuda",
[&]() {
@ -29,6 +261,7 @@ void logaddexp_kernel_cuda(TensorIteratorBase& iter) {
}
});
});
}
}
void logaddexp2_kernel_cuda(TensorIteratorBase& iter) {

View File

@ -11,7 +11,7 @@ static inline std::ostream& operator<<(std::ostream& out, dim3 dim) {
if (dim.y == 1 && dim.z == 1) {
out << dim.x;
} else {
out << '[' << dim.x << ',' << dim.y << ',' << dim.z << ']';
out << "[" << dim.x << "," << dim.y << "," << dim.z << "]";
}
return out;
}
@ -27,7 +27,7 @@ std::ostream& operator<<(std::ostream& out, const ReduceConfig& config) {
out << "input_mult=[";
for (int i = 0; i < 3; i++) {
if (i != 0) {
out << ',';
out << ",";
}
out << config.input_mult[i];
}
@ -35,7 +35,7 @@ std::ostream& operator<<(std::ostream& out, const ReduceConfig& config) {
out << "output_mult=[";
for (int i = 0; i < 2; i++) {
if (i != 0) {
out << ',';
out << ",";
}
out << config.output_mult[i];
}
@ -49,7 +49,7 @@ std::ostream& operator<<(std::ostream& out, const ReduceConfig& config) {
out << "block=" << config.block() << ", ";
out << "grid=" << config.grid() << ", ";
out << "global_memory_size=" << config.global_memory_size();
out << ')';
out << ")";
return out;
}

View File

@ -1101,6 +1101,19 @@ _scaled_mxfp8_mxfp8(
return _scaled_gemm(mat_a, mat_b, scale_a, scale_b, scaling_choice_a, scaling_choice_b, bias, false /* use_fast_accum */, out);
}
void
_check_mxfp4_support() {
#ifndef USE_ROCM
auto dprops = at::cuda::getCurrentDeviceProperties();
// Only on B200 GPUs
TORCH_CHECK_NOT_IMPLEMENTED(
// B200 = 10.0, B300 = 10.3
dprops->major == 10,
"MXFP4 scaling only supported in CUDA for B200/B300"
);
#endif
}
Tensor&
_scaled_mxfp4_mxfp4(
@ -1113,6 +1126,7 @@ _scaled_mxfp4_mxfp4(
#if defined(_WIN32) || (!defined(USE_ROCM) && !defined(USE_FBGEMM_GENAI))
TORCH_CHECK_NOT_IMPLEMENTED(false, "MXFP4 scaling supported on ROCM and CUDA+FBGEMM_GENAI only");
#else
_check_mxfp4_support();
// Restrictions:
// A, B are FP4, scales are e8m0, A: shape K//32, B: K, N//32
TORCH_CHECK_VALUE(mat_a.scalar_type() == at::kFloat4_e2m1fn_x2 && mat_b.scalar_type() == at::kFloat4_e2m1fn_x2, "mat_a and mat_b must be fp4 types, got: ",

View File

@ -364,9 +364,9 @@ void f8f8bf16_grouped_gemm_impl_sm90(
// reinterpret_cast<ProblemShape::UnderlyingProblemShape*>(
// stride_output_h + group_count);
// std::cout << "PTRS " << mat_a.data_ptr() << ' ' << mat_b.data_ptr() << "
// std::cout << "PTRS " << mat_a.data_ptr() << " " << mat_b.data_ptr() << "
// "
// << out.data_ptr() << ' ' << scale_a.data_ptr() << ' '
// << out.data_ptr() << " " << scale_a.data_ptr() << " "
// << scale_b.data_ptr() << "\n";
// for (int i = 0; i < group_count; i++) {
// std::cout << "A " << (void*)inputA_ptrs_h[i] << "\n";

View File

@ -1057,14 +1057,14 @@ std::string generate_code(
// TODO these arrays are potentially of the different types, use function
// traits to determine the types
declare_load_arrays << f_inputs_type << " arg" << std::to_string(i)
<< '[' << std::to_string(thread_work_size) << "];\n";
<< "[" << std::to_string(thread_work_size) << "];\n";
}
env.s("declare_load_arrays", declare_load_arrays.str());
std::stringstream declare_store_arrays;
for (int i = 0; i < nOutputs; i++) {
declare_store_arrays << result_type << " out" << std::to_string(i)
<< '[' << std::to_string(thread_work_size) << "];\n";
<< "[" << std::to_string(thread_work_size) << "];\n";
}
env.s("declare_store_arrays", declare_store_arrays.str());
@ -1217,7 +1217,7 @@ std::string generate_code(
for (const auto i : c10::irange(nInputs)){
auto i_string = std::to_string(i);
vector_inputs << "auto * input" << i_string <<
" = reinterpret_cast<const scalar_t*>(data[" << i_string << '+' << nOutputs << "])" <<
" = reinterpret_cast<const scalar_t*>(data[" << i_string << "+" << nOutputs << "])" <<
" + block_work_size * idx;\n";
}
env.s("vector_inputs", vector_inputs.str());
@ -1543,17 +1543,17 @@ NvrtcFunction jit_pwise_function(
// Constructs file path by appending constructed cubin name to cache path
std::stringstream ss;
ss << *cache_dir << '/';
ss << *cache_dir << "/";
ss << kernel_name;
#ifdef USE_ROCM
ss << "_arch" << prop->gcnArchName;
#else
ss << "_arch" << cuda_major << '.' << cuda_minor;
ss << "_arch" << cuda_major << "." << cuda_minor;
#endif
ss << "_nvrtc" << nvrtc_major << '.' << nvrtc_minor;
ss << "_nvrtc" << nvrtc_major << "." << nvrtc_minor;
ss << (compile_to_sass ? "_sass" : "_ptx");
ss << '_' << code.length();
ss << '_' << hash_code;
ss << "_" << code.length();
ss << "_" << hash_code;
file_path = ss.str();
std::ifstream readin{file_path, std::ios::in | std::ifstream::binary};

View File

@ -82,15 +82,15 @@ namespace native {
std::ostream& operator<<(std::ostream& out, const ConvolutionParams& params) {
out << "ConvolutionParams \n"
<< " memory_format = " << params.memory_format << '\n'
<< " data_type = " << cudnnTypeToString(params.dataType) << '\n'
<< " padding = " << ArrayRef<int>{params.padding} << '\n'
<< " stride = " << ArrayRef<int>{params.stride} << '\n'
<< " dilation = " << ArrayRef<int>{params.dilation} << '\n'
<< " groups = " << params.groups << '\n'
<< " memory_format = " << params.memory_format << "\n"
<< " data_type = " << cudnnTypeToString(params.dataType) << "\n"
<< " padding = " << ArrayRef<int>{params.padding} << "\n"
<< " stride = " << ArrayRef<int>{params.stride} << "\n"
<< " dilation = " << ArrayRef<int>{params.dilation} << "\n"
<< " groups = " << params.groups << "\n"
<< " deterministic = " << (params.deterministic ? "true" : "false")
<< '\n'
<< " allow_tf32 = " << (params.allow_tf32 ? "true" : "false") << '\n';
<< "\n"
<< " allow_tf32 = " << (params.allow_tf32 ? "true" : "false") << "\n";
return out;
}
@ -173,16 +173,16 @@ std::string repro_from_args(const ConvolutionParams& params) {
at::globalContext().float32Precision(
at::Float32Backend::CUDA, at::Float32Op::MATMUL) ==
at::Float32Precision::TF32)
<< '\n';
<< "\n";
ss << "torch.backends.cudnn.benchmark = "
<< pybool(at::globalContext().benchmarkCuDNN()) << '\n';
<< pybool(at::globalContext().benchmarkCuDNN()) << "\n";
ss << "torch.backends.cudnn.deterministic = " << pybool(params.deterministic)
<< '\n';
<< "\n";
ss << "torch.backends.cudnn.allow_tf32 = " << pybool(params.allow_tf32)
<< '\n';
<< "\n";
ss << "data = torch.randn(" << ArrayRef<int>(params.input_size, dim)
<< ", dtype=" << full_dtype << ", ";
ss << "device='cuda', requires_grad=True)" << to_channels_last << '\n';
ss << "device='cuda', requires_grad=True)" << to_channels_last << "\n";
ss << "net = torch.nn.Conv" << dim - 2 << "d(" << in_channels << ", "
<< out_channels << ", ";
ss << "kernel_size=" << ArrayRef<int>(&params.weight_size[2], dim - 2)
@ -192,7 +192,7 @@ std::string repro_from_args(const ConvolutionParams& params) {
ss << "dilation=" << ArrayRef<int>(params.dilation, dim - 2) << ", ";
ss << "groups=" << params.groups << ")\n";
ss << "net = net.cuda()." << partial_dtype << "()" << to_channels_last
<< '\n';
<< "\n";
ss << "out = net(data)\n";
ss << "out.backward(torch.randn_like(out))\n";
ss << "torch.cuda.synchronize()\n\n";

View File

@ -93,10 +93,11 @@ std::ostream& operator<<(std::ostream& out, const ConvolutionArgs& args) {
<< "input: " << args.idesc // already has a trailing newline
<< "output: " << args.odesc // already has a trailing newline
<< "weight: " << args.wdesc // already has a trailing newline
<< "Pointer addresses: " << '\n'
<< " input: " << args.input.const_data_ptr() << '\n'
<< " output: " << args.output.const_data_ptr() << '\n'
<< " weight: " << args.weight.const_data_ptr() << '\n';
<< "Pointer addresses: "
<< "\n"
<< " input: " << args.input.const_data_ptr() << "\n"
<< " output: " << args.output.const_data_ptr() << "\n"
<< " weight: " << args.weight.const_data_ptr() << "\n";
return out;
}

View File

@ -115,7 +115,7 @@ std::ostream& operator<<(
std::copy(
strides.begin(), strides.end() - 1, std::ostream_iterator<int>(oss, ","));
oss << sizes.back();
output << oss.str() << '}';
output << oss.str() << "}";
return output;
}

View File

@ -53,7 +53,7 @@ std::ostream& operator<<(std::ostream& out, const ConvParams& params) {
<< " transposed = " << params.transposed
<< " output_padding = " << IntArrayRef{params.output_padding}
<< " groups = " << params.groups << " benchmark = " << params.benchmark
<< " deterministic = " << params.deterministic << '}';
<< " deterministic = " << params.deterministic << "}";
return out;
}

View File

@ -91,25 +91,30 @@ static auto& lib = mps::MetalShaderLibrary::getBundledLibrary();
#include <ATen/native/mps/Repeat_metallib.h>
#endif
template <typename index_t>
void computeRepeatIndices(const index_t* repeat_ptr,
const int64_t* cumsum_ptr,
index_t* result_ptr,
int64_t size,
int64_t result_size) {
id<MTLBuffer> repeatBuffer = reinterpret_cast<id<MTLBuffer>>(repeat_ptr);
id<MTLBuffer> cumsumBuffer = reinterpret_cast<id<MTLBuffer>>(cumsum_ptr);
id<MTLBuffer> resultBuffer = reinterpret_cast<id<MTLBuffer>>(result_ptr);
TORCH_CHECK(repeatBuffer && cumsumBuffer && resultBuffer);
Tensor repeat_interleave_mps(const Tensor& repeat, std::optional<int64_t> output_size) {
TORCH_CHECK(repeat.dim() == 1, "repeat_interleave only accept 1D vector as repeat");
std::string scalar_type;
if constexpr (std::is_same_v<index_t, int32_t>) {
if (repeat.scalar_type() == kInt) {
scalar_type = "int32_t";
} else if constexpr (std::is_same_v<index_t, int64_t>) {
} else if (repeat.scalar_type() == kLong) {
scalar_type = "int64_t";
} else {
TORCH_CHECK(false, "repeat_interleave: unsupported indexing data type");
TORCH_CHECK(false, "repeats has to be Long or Int tensor");
}
if (repeat.size(0) == 0) {
return at::empty_like(repeat, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
}
Tensor repeat_ = repeat.contiguous();
Tensor cumsum = repeat.cumsum(0);
int64_t total = 0;
if (output_size.has_value()) {
total = output_size.value();
} else {
total = cumsum[-1].item<int64_t>();
TORCH_CHECK((repeat >= 0).all().item<uint8_t>(), "repeats can not be negative");
}
auto result = at::empty({total}, repeat.options());
MPSStream* mpsStream = getCurrentMPSStream();
dispatch_sync(mpsStream->queue(), ^() {
@ -121,20 +126,13 @@ void computeRepeatIndices(const index_t* repeat_ptr,
getMPSProfiler().beginProfileKernel(pipelineState, "repeat_interleave:" + scalar_type, false);
[computeEncoder setComputePipelineState:pipelineState];
mps::mtl_setArgs(computeEncoder, repeatBuffer, cumsumBuffer, resultBuffer, size);
mps::mtl_dispatch1DJob(computeEncoder, pipelineState, size);
mps::mtl_setArgs(computeEncoder, repeat_, cumsum, result, repeat.size(0));
mps::mtl_dispatch1DJob(computeEncoder, pipelineState, repeat.size(0));
getMPSProfiler().endProfileKernel(pipelineState);
}
});
}
Tensor repeat_interleave_mps(const Tensor& repeat, std::optional<int64_t> output_size) {
Tensor output;
AT_DISPATCH_INDEX_TYPES(repeat.scalar_type(), "repeat_interleave_mps", [&]() {
output = repeat_interleave_common<index_t, computeRepeatIndices<index_t>>(repeat, output_size);
});
return output;
return result;
}
} // namespace at::native

View File

@ -4225,7 +4225,7 @@
MTIA: mm_out_mtia
MPS: mm_out_mps
XPU: mm_out_xpu
SparseCPU, SparseCUDA: _sparse_mm_out
SparseCPU, SparseCUDA, SparseMPS: _sparse_mm_out
SparseCsrCPU, SparseCsrCUDA, SparseCsrMeta: _sparse_csr_mm_out
- func: mm.dtype(Tensor self, Tensor mat2, ScalarType out_dtype) -> Tensor

View File

@ -301,12 +301,12 @@ class AvgPoolMicrokernelTester {
ASSERT_NEAR(
float(int32_t(y[i * yStride() + k])), yFP[i * kc() + k], 0.5001f)
<< "at pixel " << i << ", channel " << k << ", n = " << n()
<< ", ks = " << kh() << 'x' << kw() << " (" << ks()
<< ", ks = " << kh() << "x" << kw() << " (" << ks()
<< "), kc = " << kc() << ", acc = " << yAcc[i * kc() + k];
ASSERT_EQ(
uint32_t(yRef[i * kc() + k]), uint32_t(y[i * yStride() + k]))
<< "at pixel " << i << ", channel " << k << ", n = " << n()
<< ", ks = " << kh() << 'x' << kw() << " (" << ks()
<< ", ks = " << kh() << "x" << kw() << " (" << ks()
<< "), kc = " << kc() << ", acc = " << yAcc[i * kc() + k];
}
}
@ -396,12 +396,12 @@ class AvgPoolMicrokernelTester {
ASSERT_NEAR(
float(int32_t(y[i * yStride() + k])), yFP[i * kc() + k], 0.5001f)
<< "at pixel " << i << ", channel " << k << ", n = " << n()
<< ", ks = " << kh() << 'x' << kw() << " (" << ks()
<< ", ks = " << kh() << "x" << kw() << " (" << ks()
<< "), kc = " << kc() << ", acc = " << yAcc[i * kc() + k];
ASSERT_EQ(
uint32_t(yRef[i * kc() + k]), uint32_t(y[i * yStride() + k]))
<< "at pixel " << i << ", channel " << k << ", n = " << n()
<< ", ks = " << kh() << 'x' << kw() << " (" << ks()
<< ", ks = " << kh() << "x" << kw() << " (" << ks()
<< "), kc = " << kc() << ", acc = " << yAcc[i * kc() + k];
}
}

View File

@ -232,7 +232,7 @@ class MaxPoolMicrokernelTester {
ASSERT_EQ(
uint32_t(yRef[i * kc() + k]), uint32_t(y[i * yStride() + k]))
<< "at pixel " << i << ", channel " << k << ", n = " << n()
<< ", ks = " << kh() << 'x' << kw() << " (" << ks()
<< ", ks = " << kh() << "x" << kw() << " (" << ks()
<< "), kc = " << kc();
}
}

View File

@ -17,7 +17,7 @@ inline std::vector<T> _expand_param_if_needed(
std::ostringstream ss;
ss << "expected " << param_name << " to be a single integer value or a "
<< "list of " << expected_dim << " values to match the convolution "
<< "dimensions, but got " << param_name << '=' << list_param;
<< "dimensions, but got " << param_name << "=" << list_param;
TORCH_CHECK(false, ss.str());
} else {
return list_param.vec();

View File

@ -358,9 +358,9 @@ std::string Adapter::stringize() const {
std::string device_type = get_device_type_str(properties.deviceType);
VkPhysicalDeviceLimits limits = properties.limits;
ss << '{' << std::endl;
ss << "{" << std::endl;
ss << " Physical Device Info {" << std::endl;
ss << " apiVersion: " << v_major << '.' << v_minor << std::endl;
ss << " apiVersion: " << v_major << "." << v_minor << std::endl;
ss << " driverversion: " << properties.driverVersion << std::endl;
ss << " deviceType: " << device_type << std::endl;
ss << " deviceName: " << properties.deviceName << std::endl;
@ -371,7 +371,7 @@ std::string Adapter::stringize() const {
#define PRINT_LIMIT_PROP_VEC3(name) \
ss << " " << std::left << std::setw(36) << #name << limits.name[0] \
<< ',' << limits.name[1] << ',' << limits.name[2] << std::endl;
<< "," << limits.name[1] << "," << limits.name[2] << std::endl;
ss << " Physical Device Limits {" << std::endl;
PRINT_LIMIT_PROP(maxImageDimension1D);
@ -425,7 +425,7 @@ std::string Adapter::stringize() const {
;
}
ss << " ]" << std::endl;
ss << '}';
ss << "}";
return ss.str();
}

View File

@ -33,7 +33,7 @@ std::ostream& operator<<(std::ostream& out, const VkResult result) {
VK_RESULT_CASE(VK_ERROR_FORMAT_NOT_SUPPORTED)
VK_RESULT_CASE(VK_ERROR_FRAGMENTED_POOL)
default:
out << "VK_ERROR_UNKNOWN (VkResult " << result << ')';
out << "VK_ERROR_UNKNOWN (VkResult " << result << ")";
break;
}
return out;
@ -46,7 +46,7 @@ std::ostream& operator<<(std::ostream& out, const VkResult result) {
//
std::ostream& operator<<(std::ostream& out, const SourceLocation& loc) {
out << loc.function << " at " << loc.file << ':' << loc.line;
out << loc.function << " at " << loc.file << ":" << loc.line;
return out;
}
@ -66,7 +66,7 @@ Error::Error(SourceLocation source_location, const char* cond, std::string msg)
: msg_(std::move(msg)), source_location_{source_location} {
std::ostringstream oss;
oss << "Exception raised from " << source_location_ << ": ";
oss << '(' << cond << ") is false! ";
oss << "(" << cond << ") is false! ";
oss << msg_;
what_ = oss.str();
}

View File

@ -173,8 +173,8 @@ void QueryPool::extract_results() {
static std::string stringize(const VkExtent3D& extents) {
std::stringstream ss;
ss << '{' << extents.width << ", " << extents.height << ", " << extents.depth
<< '}';
ss << "{" << extents.width << ", " << extents.height << ", " << extents.depth
<< "}";
return ss.str();
}

View File

@ -149,7 +149,7 @@ VKAPI_ATTR VkBool32 VKAPI_CALL debug_report_callback_fn(
(void)flags;
std::stringstream stream;
stream << layer_prefix << ' ' << message_code << ' ' << message << std::endl;
stream << layer_prefix << " " << message_code << " " << message << std::endl;
const std::string log = stream.str();
std::cout << log;

View File

@ -253,7 +253,7 @@ using vec4 = vec<4u>;
// uvec3 is the type representing tensor extents. Useful for debugging.
inline std::ostream& operator<<(std::ostream& os, const uvec3& v) {
os << '(' << v.data[0u] << ", " << v.data[1u] << ", " << v.data[2u] << ')';
os << "(" << v.data[0u] << ", " << v.data[1u] << ", " << v.data[2u] << ")";
return os;
}

View File

@ -61,6 +61,7 @@ list(APPEND ATen_CUDA_TEST_SRCS
${CMAKE_CURRENT_SOURCE_DIR}/cuda_complex_math_test.cu
${CMAKE_CURRENT_SOURCE_DIR}/cuda_complex_test.cu
${CMAKE_CURRENT_SOURCE_DIR}/cuda_cub_test.cu
${CMAKE_CURRENT_SOURCE_DIR}/cuda_cublas_handle_pool_test.cpp
${CMAKE_CURRENT_SOURCE_DIR}/cuda_device_test.cpp
${CMAKE_CURRENT_SOURCE_DIR}/cuda_distributions_test.cu
${CMAKE_CURRENT_SOURCE_DIR}/cuda_dlconvertor_test.cpp

View File

@ -246,7 +246,7 @@ void TestToCFloat() {
void TestToString() {
Tensor b = ones({3, 7}) * .0000001f;
std::stringstream s;
s << b << '\n';
s << b << "\n";
std::string expect = "1e-07 *";
ASSERT_EQ_RESOLVED(s.str().substr(0, expect.size()), expect);
}

View File

@ -0,0 +1,77 @@
#include <gtest/gtest.h>
#include <ATen/cuda/CUDAContext.h>
#include <c10/cuda/CUDACachingAllocator.h>
#include <c10/cuda/CUDAGuard.h>
#include <atomic>
#include <thread>
#include <vector>
// Test concurrent access to getCurrentCUDABlasHandle and getCUDABlasLtWorkspace
// to verify that the data race fix is working correctly
TEST(CUDABlasHandlePoolTest, ConcurrentGetAndClearWorkspaces) {
if (!at::cuda::is_available()) {
return;
}
constexpr int num_accessor_threads = 15;
constexpr int num_clear_threads = 5;
constexpr int iterations_per_thread = 50;
std::atomic<bool> stop{false};
std::atomic<int> error_count{0};
std::vector<std::thread> threads;
threads.reserve(num_accessor_threads + num_clear_threads);
// Launch accessor threads
for (int i = 0; i < num_accessor_threads; ++i) {
threads.emplace_back([&stop, &error_count]() {
try {
at::cuda::CUDAGuard device_guard(0);
while (!stop.load(std::memory_order_relaxed)) {
const auto handle = at::cuda::getCurrentCUDABlasHandle();
const auto workspace = at::cuda::getCUDABlasLtWorkspace();
if (handle == nullptr || workspace == nullptr) {
error_count++;
}
}
} catch (const std::exception& e) {
error_count++;
}
});
}
// Launch threads that clear workspaces
for (int i = 0; i < num_clear_threads; ++i) {
threads.emplace_back([&error_count]() {
try {
for (int j = 0; j < iterations_per_thread; ++j) {
at::cuda::clearCublasWorkspaces();
std::this_thread::yield();
}
} catch (const std::exception& e) {
error_count++;
}
});
}
// Let them run for a bit
std::this_thread::sleep_for(std::chrono::milliseconds(100));
stop.store(true, std::memory_order_relaxed);
for (auto& thread : threads) {
thread.join();
}
EXPECT_EQ(error_count.load(), 0);
}
int main(int argc, char* argv[]) {
::testing::InitGoogleTest(&argc, argv);
c10::cuda::CUDACachingAllocator::init(1);
return RUN_ALL_TESTS();
}

View File

@ -33,7 +33,7 @@ struct Foo {
static void apply(Tensor a, Tensor b) {
scalar_type s = 1;
std::stringstream ss;
ss << "hello, dispatch: " << a.toString() << s << '\n';
ss << "hello, dispatch: " << a.toString() << s << "\n";
auto data = (scalar_type*)a.data_ptr();
(void)data;
}
@ -73,8 +73,8 @@ TEST(TestScalar, TestScalar) {
Scalar bar = 3.0;
Half h = bar.toHalf();
Scalar h2 = h;
cout << "H2: " << h2.toDouble() << ' ' << what.toFloat() << ' '
<< bar.toDouble() << ' ' << what.isIntegral(false) << '\n';
cout << "H2: " << h2.toDouble() << " " << what.toFloat() << " "
<< bar.toDouble() << " " << what.isIntegral(false) << "\n";
auto gen = at::detail::getDefaultCPUGenerator();
{
// See Note [Acquire lock when using random generators]
@ -84,7 +84,7 @@ TEST(TestScalar, TestScalar) {
}
if (at::hasCUDA()) {
auto t2 = zeros({4, 4}, at::kCUDA);
cout << &t2 << '\n';
cout << &t2 << "\n";
}
auto t = ones({4, 4});
@ -129,7 +129,7 @@ TEST(TestScalar, TestScalar) {
std::stringstream ss;
// NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
ASSERT_NO_THROW(
ss << "hello, dispatch" << x.toString() << s << '\n');
ss << "hello, dispatch" << x.toString() << s << "\n");
auto data = (scalar_t*)x.data_ptr();
(void)data;
});

View File

@ -1,5 +1,5 @@
#include <ATen/ATen.h>
int main() {
std::cout << at::ones({3,4}, at::CPU(at::kFloat)) << '\n';
std::cout << at::ones({3,4}, at::CPU(at::kFloat)) << "\n";
}

View File

@ -1828,9 +1828,9 @@ namespace {
#endif
EXPECT_EQ(u16, c10::detail::fp16_ieee_from_fp32_value(f32s[i]))
<< "Test failed for float to uint16 " << f32s[i] << '\n';
<< "Test failed for float to uint16 " << f32s[i] << "\n";
EXPECT_EQ(x, c10::detail::fp16_ieee_to_fp32_value(u16))
<< "Test failed for uint16 to float " << u16 << '\n';
<< "Test failed for uint16 to float " << u16 << "\n";
}
}
TEST(FP8E4M3Test, FP8E4M3ConversionFloat) {
@ -1848,10 +1848,10 @@ namespace {
EXPECT_TRUE(std::isnan(f32));
} else {
EXPECT_EQ(f32, c10::detail::fp8e4m3fn_to_fp32_value(input))
<< "Test failed for u8 to float " << input << '\n';
<< "Test failed for u8 to float " << input << "\n";
}
EXPECT_EQ(u8, c10::detail::fp8e4m3fn_from_fp32_value(f32))
<< "Test failed for float to u8 " << f32 << '\n';
<< "Test failed for float to u8 " << f32 << "\n";
}
}
TEST(FP8E4M3Test, FP8E4M3BinaryAdd) {
@ -2015,10 +2015,10 @@ namespace {
EXPECT_TRUE(std::isnan(f32));
} else {
EXPECT_EQ(f32, c10::detail::fp8e5m2_to_fp32_value(input))
<< "Test failed for u8 to float " << input << '\n';
<< "Test failed for u8 to float " << input << "\n";
}
EXPECT_EQ(u8, c10::detail::fp8e5m2_from_fp32_value(f32))
<< "Test failed for float to u8 " << f32 << '\n';
<< "Test failed for float to u8 " << f32 << "\n";
}
}
TEST(FP8E5M2Test, FP8E5M2BinaryAdd) {

View File

@ -19,7 +19,7 @@ TEST(Vitals, Basic) {
c10::utils::set_env("TORCH_VITAL", "1");
TORCH_VITAL_DEFINE(Testing);
TORCH_VITAL(Testing, Attribute0) << 1;
TORCH_VITAL(Testing, Attribute1) << '1';
TORCH_VITAL(Testing, Attribute1) << "1";
TORCH_VITAL(Testing, Attribute2) << 1.0f;
TORCH_VITAL(Testing, Attribute3) << 1.0;
auto t = at::ones({1, 1});

View File

@ -129,14 +129,14 @@ void showRtol(const at::Tensor& a, const at::Tensor& b) {
std::cout << "Max Diff allowed: " << maxDiff << std::endl;
if (diff.sizes().size() == 2) {
for (const auto y : c10::irange(diff.sizes()[0])) {
std::cout << y << ':';
std::cout << y << ":";
for (const auto x : c10::irange(diff.sizes()[1])) {
float diff_xy = diff[y][x].item<float>();
if (diff_xy > maxDiff) {
std::cout << std::setw(5) << x;
}
else {
std::cout << std::setw(5) << ' ';
std::cout << std::setw(5) << " ";
}
}
std::cout << std::endl;
@ -3276,7 +3276,7 @@ TEST_F(VulkanAPITest, masked_fill_invalidinputs_exceptions) {
void print_shape(const std::vector<int64_t>& shape) {
for (const auto& num : shape) {
std::cout << num << ' ';
std::cout << num << " ";
}
}
@ -3367,7 +3367,7 @@ void test_masked_fill_scalar(
print_shape(tmp_curr_input_shape);
std::cout << "], and mask of shape [";
print_shape(tmp_curr_mask_shape);
std::cout << ']' << std::endl;
std::cout << "]" << std::endl;
}
ASSERT_TRUE(check);
@ -4542,9 +4542,9 @@ void test_softmax(const at::IntArrayRef shape, bool log_softmax = false) {
if (!check) {
std::cout << "Softmax test failed on axis " << dim << "for tensor dims {";
for (uint32_t place = 0; place < shape.size() - 1; place++) {
std::cout << shape[place] << ' ';
std::cout << shape[place] << " ";
}
std::cout << shape.back() << '}' << std::endl;
std::cout << shape.back() << "}" << std::endl;
showRtol(out_cpu, out_vulkan.cpu());
}
ASSERT_TRUE(check);

View File

@ -95,7 +95,7 @@ void showRtol(
std::cout << "Max Diff found is: " << diff.max().item<double>() << std::endl;
if (diff.sizes().size() == 2) {
for (const auto y : c10::irange(diff.sizes()[0])) {
std::cout << y << ':';
std::cout << y << ":";
for (const auto x : c10::irange(diff.sizes()[1])) {
double diff_xy = diff[y][x].item<double>();
if (diff_xy > maxDiff) {
@ -109,7 +109,7 @@ void showRtol(
}
}
} else {
std::cout << std::setw(5) << ' ';
std::cout << std::setw(5) << " ";
}
}
std::cout << std::endl;
@ -148,19 +148,19 @@ using at::native::vulkan::api::utils::ivec4;
using at::native::vulkan::api::utils::vec4;
std::ostream& operator<<(std::ostream& os, const vec4& v) {
os << '(' << v.data[0u] << ", " << v.data[1u] << ", " << v.data[2u] << ", "
<< v.data[3u] << ')';
os << "(" << v.data[0u] << ", " << v.data[1u] << ", " << v.data[2u] << ", "
<< v.data[3u] << ")";
return os;
}
std::ostream& operator<<(std::ostream& os, const ivec3& v) {
os << '(' << v.data[0u] << ", " << v.data[1u] << ", " << v.data[2u] << ')';
os << "(" << v.data[0u] << ", " << v.data[1u] << ", " << v.data[2u] << ")";
return os;
}
std::ostream& operator<<(std::ostream& os, const ivec4& v) {
os << '(' << v.data[0u] << ", " << v.data[1u] << ", " << v.data[2u] << ", "
<< v.data[3u] << ')';
os << "(" << v.data[0u] << ", " << v.data[1u] << ", " << v.data[2u] << ", "
<< v.data[3u] << ")";
return os;
}
@ -3379,51 +3379,51 @@ bool _test_quantized_linear(
showRtol(out_cpu_dequant, out_vk_to_cpu_dequant);
}
if (xpos != -1 && ypos != -1) {
std::cout << "\nFailure caused on row/col: " << ypos << '/' << xpos
<< '\n';
std::cout << "\nFailure caused on row/col: " << ypos << "/" << xpos
<< "\n";
std::cout << "Input tensor scale: " << scale << " zerop: " << zero_point
<< '\n';
std::cout << "Input tensor row " << ypos << '\n';
<< "\n";
std::cout << "Input tensor row " << ypos << "\n";
for (int i = 0; i < input_cpu.sizes()[1]; i++) {
std::cout << input_cpu[ypos][i].item<double>() << ", ";
}
std::cout << '\n';
std::cout << "\n";
std::cout << "Weight tensor scale: " << w_scale
<< " zerop: " << w_zero_point << '\n';
std::cout << "Weight tensor col " << xpos << '\n';
<< " zerop: " << w_zero_point << "\n";
std::cout << "Weight tensor col " << xpos << "\n";
for (int i = 0; i < weight.sizes()[1]; i++) {
std::cout << weight[xpos][i].item<double>() << ", ";
}
std::cout << '\n';
std::cout << "\n";
std::cout << "Input tensor quantized row " << ypos << " with dtype "
<< (input_quant_dtype_int8 ? "QInt8" : "QUInt8") << '\n';
<< (input_quant_dtype_int8 ? "QInt8" : "QUInt8") << "\n";
for (int i = 0; i < input_cpu.sizes()[1]; i++) {
std::cout << input_cpu_quantized[ypos][i].item<double>() << ", ";
}
std::cout << '\n';
std::cout << "\n";
std::cout << "Weight tensor quantized col " << xpos << " with dtype "
<< (weight_quant_dtype_int8 ? "QInt8" : "QUInt8") << '\n';
<< (weight_quant_dtype_int8 ? "QInt8" : "QUInt8") << "\n";
for (int i = 0; i < weight.sizes()[1]; i++) {
std::cout << weight_cpu_quantized[xpos][i].item<double>() << ", ";
}
std::cout << '\n';
std::cout << "\n";
std::cout << "bias tensor\n";
for (int i = 0; i < bias.sizes()[0]; i++) {
std::cout << bias[i].item<double>() << ", ";
}
std::cout << '\n';
std::cout << "\n";
std::cout << "out_scale: " << out_scale
<< " out_zero_point: " << out_zero_point << '\n';
<< " out_zero_point: " << out_zero_point << "\n";
std::cout << "cpu unmatched output: "
<< out_cpu_dequant[ypos][xpos].item<double>() << '\n';
<< out_cpu_dequant[ypos][xpos].item<double>() << "\n";
std::cout << "vk unmatched output: "
<< out_vk_to_cpu_dequant[ypos][xpos].item<double>() << '\n';
<< out_vk_to_cpu_dequant[ypos][xpos].item<double>() << "\n";
}
}
return check;

View File

@ -10,6 +10,13 @@
...
}
{
ignore_empty_generic_uninitialised_conditional_jump
Memcheck:Cond
fun:_ZN2at6detail13empty_genericEN3c108ArrayRefIlEEPNS1_9AllocatorENS1_14DispatchKeySetENS1_10ScalarTypeESt8optionalINS1_12MemoryFormatEE
...
}
{
Cond_cuda
Memcheck:Cond

View File

@ -9,28 +9,61 @@ def check_perf_csv(filename, threshold, threshold_scale):
"""
Basic performance checking.
"""
try:
df = pd.read_csv(filename)
except FileNotFoundError:
print(f"Error: File {filename} not found")
sys.exit(1)
df = pd.read_csv(filename)
effective_threshold = threshold * threshold_scale
print(f"Checking {filename} (speedup threshold >= {effective_threshold:.2f}x)\n")
failed = []
for _, row in df.iterrows():
model_name = row["name"]
speedup = row["speedup"]
if speedup < threshold * threshold_scale:
failed.append(model_name)
speedup = float(row["speedup"])
abs_latency = float(row["abs_latency"])
compilation_latency = float(row["compilation_latency"])
compression_ratio = float(row["compression_ratio"])
eager_peak_mem = float(row["eager_peak_mem"])
dynamo_peak_mem = float(row["dynamo_peak_mem"])
print(f"{model_name:34} {speedup}")
perf_summary = f"{model_name:34} speedup={speedup:.3f}x"
if pd.notna(abs_latency):
perf_summary += f", latency={abs_latency:.1f} ms/iter"
if pd.notna(compilation_latency):
perf_summary += f", compile={compilation_latency:.3f}s"
if pd.notna(compression_ratio):
perf_summary += f", mem_ratio={1 / compression_ratio:.2f}x"
if pd.notna(eager_peak_mem) and pd.notna(dynamo_peak_mem):
perf_summary += (
f" (eager={eager_peak_mem:.1f} GB, dynamo={dynamo_peak_mem:.1f} GB)"
)
if speedup < effective_threshold:
failed.append((model_name, speedup))
print(perf_summary)
if failed:
print(
textwrap.dedent(
f"""
Error {len(failed)} models performance regressed
{" ".join(failed)}
Error {len(failed)} model(s) performance regressed
{" ".join([name for name, _ in failed])}
"""
)
)
for name, sp in sorted(failed, key=lambda x: x[1]):
pct_from_target = (sp / effective_threshold - 1.0) * 100.0
print(
f" - {name}: {sp:.3f}x (< {effective_threshold:.2f}x; {pct_from_target:.1f}% from target)"
)
sys.exit(1)
else:
print(
f"\nAll {len(df)} model(s) passed threshold check (>= {effective_threshold:.2f}x)"
)
if __name__ == "__main__":
@ -44,7 +77,7 @@ if __name__ == "__main__":
"-s",
type=float,
default=1.0,
help="multiple threshold by this value to relax the check",
help="multiply threshold by this value to relax the check",
)
args = parser.parse_args()
check_perf_csv(args.file, args.threshold, args.threshold_scale)

View File

@ -2379,7 +2379,9 @@ class BenchmarkRunner:
print(
f"Load model outputs from {self.args.compare_model_outputs_with} to compare"
)
saved_result = torch.load(self.args.compare_model_outputs_with)
saved_result = torch.load(
self.args.compare_model_outputs_with, weights_only=False
)
is_bitwise_same = bitwise_same(saved_result, new_result)
if not is_bitwise_same:
print(

View File

@ -176,7 +176,7 @@ std::ostream& operator<<(std::ostream& os, DispatchKeySet ts) {
os << k;
first = false;
}
os << ')';
os << ")";
return os;
}

View File

@ -44,7 +44,7 @@ struct C10_API SafePyObject {
(*other.pyinterpreter_)->incref(other.data_);
}
if (data_ != nullptr) {
(*pyinterpreter_)->decref(data_, /*has_pyobj_slot*/ false);
(*pyinterpreter_)->decref(data_);
}
data_ = other.data_;
pyinterpreter_ = other.pyinterpreter_;
@ -53,7 +53,7 @@ struct C10_API SafePyObject {
~SafePyObject() {
if (data_ != nullptr) {
(*pyinterpreter_)->decref(data_, /*has_pyobj_slot*/ false);
(*pyinterpreter_)->decref(data_);
}
}

View File

@ -92,13 +92,6 @@ inline bool isComplexType(ScalarType t) {
t == ScalarType::ComplexDouble);
}
inline bool isQIntType(ScalarType t) {
// Don't forget to extend this when adding new QInt types
return t == ScalarType::QInt8 || t == ScalarType::QUInt8 ||
t == ScalarType::QInt32 || t == ScalarType::QUInt4x2 ||
t == ScalarType::QUInt2x4;
}
inline bool isBitsType(ScalarType t) {
return t == ScalarType::Bits1x8 || t == ScalarType::Bits2x4 ||
t == ScalarType::Bits4x2 || t == ScalarType::Bits8 ||

View File

@ -48,6 +48,30 @@ void warnDeprecatedDataPtr() {
TORCH_CHECK(false, "Cannot access data pointer of Storage that is invalid.");
}
void StorageImpl::incref_pyobject() const {
// Because intrusive_ptr incref uses relaxed memory order, we need to
// do an acquire fence to ensure that the kHasPyObject bit was
// observed before the load of the PyObject* below.
// NB: This is a no-op on x86/x86-64
std::atomic_thread_fence(std::memory_order_acquire);
PyObject* obj = pyobj_slot_.load_pyobj();
(*pyobj_slot_.pyobj_interpreter())->incref(obj);
}
void StorageImpl::decref_pyobject() const {
PyObject* obj = pyobj_slot_.load_pyobj();
(*pyobj_slot_.pyobj_interpreter())->decref(obj);
}
bool StorageImpl::try_incref_pyobject() const {
c10::impl::PyInterpreter* interp = pyobj_slot_.pyobj_interpreter();
if (C10_UNLIKELY(!interp)) {
return false;
}
return (*interp)->try_incref(pyobj_slot_);
}
void SetStorageImplCreate(DeviceType t, StorageImplCreateHelper fptr) {
// Allowlist verification.
// Only if the devicetype is in the allowlist,

View File

@ -105,6 +105,12 @@ struct C10_API StorageImpl : public c10::intrusive_ptr_target {
data_ptr_.clear();
}
void incref_pyobject() const override final;
void decref_pyobject() const override final;
bool try_incref_pyobject() const override final;
size_t nbytes() const {
// OK to do this instead of maybe_as_int as nbytes is guaranteed positive
TORCH_CHECK(!size_bytes_is_heap_allocated_);
@ -370,4 +376,18 @@ C10_API c10::intrusive_ptr<c10::StorageImpl> make_storage_impl(
bool resizable,
std::optional<at::Device> device_opt);
namespace detail {
#ifndef C10_MOBILE
template <class T>
struct TargetTraits<
T,
std::enable_if_t<
std::is_base_of_v<c10::StorageImpl, std::remove_cv_t<T>>>> {
static constexpr bool can_have_pyobject = true;
};
#endif
} // namespace detail
} // namespace c10

View File

@ -277,7 +277,6 @@ void TensorImpl::release_resources() {
if (storage_) {
storage_ = {};
}
pyobj_slot_.maybe_destroy_pyobj();
}
#ifndef C10_DISABLE_TENSORIMPL_EXTENSIBILITY
@ -989,6 +988,30 @@ void TensorImpl::empty_tensor_restride_symint(MemoryFormat memory_format) {
}
}
void TensorImpl::incref_pyobject() const {
// Because intrusive_ptr incref uses relaxed memory order, we need to
// do an acquire fence to ensure that the kHasPyObject bit was
// observed before the load of the PyObject* below.
// NB: This is a no-op on x86/x86-64
std::atomic_thread_fence(std::memory_order_acquire);
PyObject* obj = pyobj_slot_.load_pyobj();
(*pyobj_slot_.pyobj_interpreter())->incref(obj);
}
void TensorImpl::decref_pyobject() const {
PyObject* obj = pyobj_slot_.load_pyobj();
(*pyobj_slot_.pyobj_interpreter())->decref(obj);
}
bool TensorImpl::try_incref_pyobject() const {
c10::impl::PyInterpreter* interp = pyobj_slot_.pyobj_interpreter();
if (C10_UNLIKELY(!interp)) {
return false;
}
return (*interp)->try_incref(pyobj_slot_);
}
namespace impl {
namespace {

View File

@ -2178,6 +2178,12 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
return &pyobj_slot_;
}
void incref_pyobject() const override final;
void decref_pyobject() const override final;
bool try_incref_pyobject() const override final;
private:
// See NOTE [std::optional operator usage in CUDA]
// We probably don't want to expose this publicly until
@ -3079,6 +3085,19 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
friend class C10_TensorImpl_Size_Check_Dummy_Class;
};
namespace detail {
#ifndef C10_MOBILE
template <class T>
struct TargetTraits<
T,
std::enable_if_t<std::is_base_of_v<c10::TensorImpl, std::remove_cv_t<T>>>> {
static constexpr bool can_have_pyobject = true;
};
#endif
} // namespace detail
// Note [TensorImpl size constraints]
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
// Changed the size of TensorImpl? If the size went down, good for

View File

@ -33,7 +33,7 @@ std::ostream& operator<<(std::ostream& stream, const TensorOptions& options) {
} else {
stream << "(nullopt)";
}
stream << ')';
stream << ")";
return stream;
}

View File

@ -11,8 +11,11 @@ struct NoopPyInterpreterVTable final : public PyInterpreterVTable {
void incref(PyObject* pyobj) const override {} // do nothing
void decref(PyObject* pyobj, bool has_pyobj_slot) const override {
} // do nothing
void decref(PyObject* pyobj) const override {} // do nothing
bool try_incref(const c10::impl::PyObjectSlot& pyobj_slot) const override {
return false;
}
#define PANIC(m) \
TORCH_INTERNAL_ASSERT( \
@ -20,6 +23,10 @@ struct NoopPyInterpreterVTable final : public PyInterpreterVTable {
"attempted to call " #m \
" on a Tensor with nontrivial PyObject after corresponding interpreter died")
size_t refcnt(PyObject* pyobj) const override {
PANIC(refcnt);
}
c10::intrusive_ptr<TensorImpl> detach(const TensorImpl* self) const override {
PANIC(detach);
}

View File

@ -18,6 +18,9 @@ namespace c10 {
struct IValue;
class OperatorHandle;
struct TensorImpl;
namespace impl {
struct PyObjectSlot;
} // namespace impl
} // namespace c10
namespace torch::jit {
@ -126,9 +129,12 @@ struct C10_API PyInterpreterVTable {
// Run Py_INCREF on a PyObject.
virtual void incref(PyObject* pyobj) const = 0;
// Run Py_DECREF on a PyObject. We DO NOT assume the GIL is held on call
// See NOTE [PyInterpreter::decref takes a `has_pyobj_slot` arg]
virtual void decref(PyObject* pyobj, bool has_pyobj_slot) const = 0;
// Run Py_DECREF on a PyObject. We DO NOT assume the GIL is held on call.
virtual void decref(PyObject* pyobj) const = 0;
// Run PyUnstable_TryIncRef on a PyObject if it's not NULL.
virtual bool try_incref(const c10::impl::PyObjectSlot& pyobj_slot) const = 0;
// Run Py_REFCNT on a PyObject.
virtual size_t refcnt(PyObject* pyobj) const = 0;
// Perform a detach by deferring to the __torch_dispatch__ implementation of
// detach, which will also arrange for the PyObject to get copied in this

View File

@ -1,56 +0,0 @@
#include <c10/core/impl/PyObjectSlot.h>
namespace c10::impl {
PyObjectSlot::PyObjectSlot() : pyobj_interpreter_(nullptr), pyobj_(nullptr) {}
PyObjectSlot::~PyObjectSlot() {
maybe_destroy_pyobj();
}
void PyObjectSlot::maybe_destroy_pyobj() {
if (owns_pyobj()) {
TORCH_INTERNAL_ASSERT(pyobj_interpreter_ != nullptr);
TORCH_INTERNAL_ASSERT(pyobj_ != nullptr);
(*pyobj_interpreter_.load(std::memory_order_acquire))
->decref(_unchecked_untagged_pyobj(), /*has_pyobj_slot*/ true);
// NB: this destructor can only be entered when there are no
// references to this C++ object (obviously), NOR any references
// to the PyObject (if there are references to the PyObject,
// then the PyObject holds an owning reference to the tensor).
// So it is OK to clear pyobj_ here as it is impossible for it to
// be used again (modulo weak reference races)
pyobj_ = nullptr; // for safety
}
}
PyInterpreter* PyObjectSlot::pyobj_interpreter() {
return pyobj_interpreter_.load(std::memory_order_acquire);
}
PyObject* PyObjectSlot::_unchecked_untagged_pyobj() const {
// NOLINTNEXTLINE(performance-no-int-to-ptr)
return reinterpret_cast<PyObject*>(
reinterpret_cast<uintptr_t>(pyobj_) & ~0x1ULL);
}
PyInterpreter& PyObjectSlot::load_pyobj_interpreter() const {
auto interpreter = pyobj_interpreter_.load(std::memory_order_acquire);
if (interpreter) {
return *interpreter;
}
TORCH_CHECK(false, "cannot access PyObject for Tensor - no interpreter set");
}
bool PyObjectSlot::owns_pyobj() {
// NOLINTNEXTLINE(performance-no-int-to-ptr)
return reinterpret_cast<uintptr_t>(pyobj_) & 1;
}
void PyObjectSlot::set_owns_pyobj(bool b) {
// NOLINTNEXTLINE(performance-no-int-to-ptr)
pyobj_ = reinterpret_cast<PyObject*>(
reinterpret_cast<uintptr_t>(_unchecked_untagged_pyobj()) | b);
}
} // namespace c10::impl

View File

@ -8,117 +8,58 @@
#include <atomic>
namespace torch::utils {
class PyObjectPreservation;
}
namespace c10::impl {
struct C10_API PyObjectSlot {
public:
PyObjectSlot();
~PyObjectSlot();
void maybe_destroy_pyobj();
// Associate the TensorImpl with the specified PyObject, and, if necessary,
// also tag the interpreter.
//
// NB: This lives in a header so that we can inline away the switch on status
//
// NB: THIS FUNCTION CAN RAISE AN EXCEPTION. Make sure to clean up after
// PyObject if necessary!
void init_pyobj(PyObject* pyobj) {
pyobj_interpreter_.store(
getGlobalPyInterpreter(), std::memory_order_relaxed);
pyobj_ = pyobj;
}
PyObjectSlot() : pyobj_interpreter_(nullptr), pyobj_(nullptr) {}
// Query the PyObject interpreter. This may return null if there is no
// interpreter. This is racy!
PyInterpreter* pyobj_interpreter();
PyObject* _unchecked_untagged_pyobj() const;
// Test the interpreter tag. If tagged for the current interpreter, return
// a non-nullopt (but possibly null) PyObject. If (possibly) untagged,
// returns a nullopt. If it is definitely invalid, raises an error.
//
// If `ignore_hermetic_tls` is false and this function is called from a
// hermetic context (ie, `HermeticPyObjectTLS::get_state()` is true), then
// nullopt is returned. If `ignore_hermetic_tls` is true, then the hermetic
// context is ignored, allowing you to check the interpreter tag of a
// nonhermetic PyObject from within a hermetic context. This is necessary
// because there are some cases where the deallocator function of a
// nonhermetic PyObject is called from within a hermetic context, so it must
// be properly treated as a nonhermetic PyObject.
//
// NB: this lives in header so that we can avoid actually creating the
// std::optional
// @todo alban: I'm not too sure what's going on here, we can probably delete
// it but it's worthwhile making sure
std::optional<PyObject*> check_pyobj(bool ignore_hermetic_tls = false) const {
impl::PyInterpreter* interpreter =
pyobj_interpreter_.load(std::memory_order_acquire);
if (interpreter == nullptr) {
return std::nullopt;
}
if (!ignore_hermetic_tls && c10::impl::HermeticPyObjectTLS::get_state()) {
return std::nullopt;
} else {
return _unchecked_untagged_pyobj();
}
// interpreter.
PyInterpreter* pyobj_interpreter() const {
return pyobj_interpreter_.load(std::memory_order_acquire);
}
PyInterpreter& load_pyobj_interpreter() const;
PyInterpreter& load_pyobj_interpreter() const {
auto interpreter = pyobj_interpreter_.load(std::memory_order_acquire);
TORCH_INTERNAL_ASSERT(
interpreter, "cannot access PyObject for Tensor - no interpreter set");
return *interpreter;
}
bool owns_pyobj();
PyObject* load_pyobj() const {
return pyobj_.load(std::memory_order_acquire);
}
void set_owns_pyobj(bool b);
void store_pyobj(PyObject* obj) {
pyobj_.store(obj, std::memory_order_release);
}
bool has_unique_reference() const {
PyObject* pyobj = load_pyobj();
return pyobj != nullptr && load_pyobj_interpreter()->refcnt(pyobj) == 1;
}
void clear() {
pyobj_.store(nullptr, std::memory_order_relaxed);
pyobj_interpreter_.store(nullptr, std::memory_order_relaxed);
}
private:
// This field contains the interpreter tag for this object. See
// Note [Python interpreter tag] for general context
//
// Note [Memory ordering on Python interpreter tag]
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
// What memory_order do we need when accessing this atomic? We don't
// need a single total modification order (as provided by
// memory_order_seq_cst) as pyobj_interpreter_ is monotonic: it can only
// transition from -1 to some positive integer and never changes afterwards.
// Because there is only one modification, it trivially already has a total
// modification order (e.g., we don't need fences or locked instructions on
// x86)
//
// In fact, one could make a reasonable argument that relaxed reads are OK,
// due to the presence of external locking (GIL) to ensure that interactions
// with other data structures are still correctly synchronized, so that
// we fall in the "Single-Location Data Structures" case as described in
// http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p2055r0.pdf
// However, on x86, it doesn't matter if I use acquire or relaxed on the load
// as I get the same assembly in both cases. So I just use the more
// conservative acquire (which will impede compiler optimizations but I don't
// care)
// This is now always the global interpreter if the PyObject is set.
// Maybe we can remove this field some day...
std::atomic<PyInterpreter*> pyobj_interpreter_;
// This field contains a reference to a PyObject representing this Tensor.
// If pyobj is nullptr, when we transfer Tensor to Python, we allocate a new
// PyObject for it and set this field. This field does not have to be
// protected by an atomic as it is only allowed to be accessed when you hold
// the GIL, or during destruction of the tensor.
//
// When a PyObject dies, you are obligated to clear this field
// (otherwise, you will try to use-after-free the pyobj); this currently
// occurs in THPVariable_clear in torch/csrc/autograd/python_variable.cpp
//
// NB: Ordinarily, this should not be a strong reference, as if the
// PyObject owns the Tensor, this would create a reference cycle.
// However, sometimes this ownership flips. To track who owns
// who, this has a single pointer tag indicating whether or not the
// C++ object owns the PyObject (the common case, zero, means PyObject
// owns the C++ object); see _unchecked_untagged_pyobj for raw access
// or check_pyobj for checked access. See references to PyObject
// resurrection in torch/csrc/autograd/python_variable.cpp
PyObject* pyobj_;
// The PyObject representing this Tensor or nullptr. Ownership is managed
// by intrusive_ptr. By the time the PyObjectSlot is destroyed, this
// reference is already dead.
std::atomic<PyObject*> pyobj_;
friend class torch::utils::PyObjectPreservation;
};
} // namespace c10::impl

View File

@ -136,7 +136,7 @@ std::string c10_retrieve_device_side_assertion_info() {
// Something failed, let's talk about that
oss << failures_found
<< " CUDA device-side assertion failures were found on GPU #"
<< device_num << '!' << std::endl;
<< device_num << "!" << std::endl;
if (assertion_data_for_device.assertion_count >
C10_CUDA_DSA_ASSERTION_COUNT) {
oss << "But at least " << assertion_data_for_device.assertion_count
@ -151,17 +151,17 @@ std::string c10_retrieve_device_side_assertion_info() {
oss << "Assertion failure " << i << std::endl;
oss << " GPU assertion failure message = " << self.assertion_msg
<< std::endl;
oss << " File containing assertion = " << self.filename << ':'
oss << " File containing assertion = " << self.filename << ":"
<< self.line_number << std::endl;
oss << " Device function containing assertion = " << self.function_name
<< std::endl;
oss << " Thread ID that failed assertion = [" << self.thread_id[0] << ','
<< self.thread_id[1] << ',' << self.thread_id[2] << ']' << std::endl;
oss << " Block ID that failed assertion = [" << self.block_id[0] << ','
<< self.block_id[1] << ',' << self.block_id[2] << ']' << std::endl;
oss << " Thread ID that failed assertion = [" << self.thread_id[0] << ","
<< self.thread_id[1] << "," << self.thread_id[2] << "]" << std::endl;
oss << " Block ID that failed assertion = [" << self.block_id[0] << ","
<< self.block_id[1] << "," << self.block_id[2] << "]" << std::endl;
if (launch_info.generation_number == self.caller) {
oss << " File containing kernel launch = "
<< launch_info.launch_filename << ':' << launch_info.launch_linenum
<< launch_info.launch_filename << ":" << launch_info.launch_linenum
<< std::endl;
oss << " Function containing kernel launch = "
<< launch_info.launch_function << std::endl;
@ -175,7 +175,7 @@ std::string c10_retrieve_device_side_assertion_info() {
if (launch_registry.gather_launch_stacktrace) {
oss << "Launch stacktracing disabled." << std::endl;
} else {
oss << '\n' << launch_info.launch_stacktrace << std::endl;
oss << "\n" << launch_info.launch_stacktrace << std::endl;
}
} else {
oss << " CPU launch site info: Unavailable, the circular queue wrapped around. Increase `CUDAKernelLaunchRegistry::max_size`."

View File

@ -435,7 +435,7 @@ TEST(DispatchKeySet, TestFunctionalityDispatchKeyToString) {
if (i > 0) {
ASSERT_TRUE(res.find("Unknown") == std::string::npos)
<< i << " (before is " << toString(static_cast<DispatchKey>(i - 1))
<< ')';
<< ")";
} else {
ASSERT_TRUE(res.find("Unknown") == std::string::npos) << i;
}

View File

@ -96,10 +96,10 @@ TEST(HalfConversionTest, TestPorableConversion) {
for (auto x : inputs) {
auto target = c10::detail::fp16_ieee_to_fp32_value(x);
EXPECT_EQ(halfbits2float(x), target)
<< "Test failed for uint16 to float " << x << '\n';
<< "Test failed for uint16 to float " << x << "\n";
EXPECT_EQ(
float2halfbits(target), c10::detail::fp16_ieee_from_fp32_value(target))
<< "Test failed for float to uint16" << target << '\n';
<< "Test failed for float to uint16" << target << "\n";
}
}

View File

@ -98,7 +98,7 @@ struct Noncopyable {
};
std::ostream& operator<<(std::ostream& out, const Noncopyable& nc) {
out << "Noncopyable(" << nc.x << ')';
out << "Noncopyable(" << nc.x << ")";
return out;
}
} // namespace

View File

@ -204,13 +204,13 @@ ArrayRef(const std::initializer_list<T>&) -> ArrayRef<T>;
template <typename T>
std::ostream& operator<<(std::ostream& out, ArrayRef<T> list) {
int i = 0;
out << '[';
out << "[";
for (const auto& e : list) {
if (i++ > 0)
out << ", ";
out << e;
}
out << ']';
out << "]";
return out;
}

View File

@ -106,8 +106,8 @@ class GetBacktraceImpl {
/*length*/ &length,
/*status*/ &status);
os << " frame #" << idx++ << '\t'
<< ((demangled != NULL && status == 0) ? demangled : symbol) << '['
os << " frame #" << idx++ << "\t"
<< ((demangled != NULL && status == 0) ? demangled : symbol) << "["
<< addr << "]\t" << std::endl;
}
free(demangled);
@ -274,7 +274,7 @@ class GetBacktraceImpl {
} else {
// In the edge-case where we couldn't parse the frame string, we can
// just use it directly (it may have a different format).
stream << symbols[frame_number] << '\n';
stream << symbols[frame_number] << "\n";
}
}
@ -413,8 +413,8 @@ class GetBacktraceImpl {
<< back_trace_[i_frame] << std::dec;
if (with_symbol) {
stream << std::setfill('0') << std::setw(16) << std::uppercase
<< std::hex << p_symbol->Address << std::dec << ' ' << module
<< '!' << p_symbol->Name;
<< std::hex << p_symbol->Address << std::dec << " " << module
<< "!" << p_symbol->Name;
} else {
stream << " <unknown symbol address> " << module << "!<unknown symbol>";
}
@ -424,7 +424,7 @@ class GetBacktraceImpl {
} else {
stream << "<unknown file> @ <unknown line number>";
}
stream << ']' << std::endl;
stream << "]" << std::endl;
}
return stream.str();

View File

@ -1,5 +1,4 @@
#include <c10/util/Exception.h>
#include <c10/util/FileSystem.h>
#include <c10/util/Logging.h>
#include <c10/util/Type.h>
@ -28,7 +27,7 @@ Error::Error(
const void* caller)
: Error(
str("[enforce fail at ",
c10::filesystem::path(file).filename(),
detail::StripBasename(file),
":",
line,
"] ",
@ -45,7 +44,7 @@ std::string Error::compute_what(bool include_backtrace) const {
if (context_.size() == 1) {
// Fold error and context in one line
oss << " (" << context_[0] << ')';
oss << " (" << context_[0] << ")";
} else {
for (const auto& c : context_) {
oss << "\n " << c;
@ -53,7 +52,7 @@ std::string Error::compute_what(bool include_backtrace) const {
}
if (include_backtrace && backtrace_) {
oss << '\n' << backtrace_->get();
oss << "\n" << backtrace_->get();
}
return oss.str();
@ -248,7 +247,7 @@ void WarningHandler::process(const Warning& warning) {
LOG_AT_FILE_LINE(
WARNING, warning.source_location().file, warning.source_location().line)
<< "Warning: " << warning.msg() << " (function "
<< warning.source_location().function << ')';
<< warning.source_location().function << ")";
}
std::string GetExceptionString(const std::exception& e) {

View File

@ -1,5 +1,4 @@
#include <c10/util/Backtrace.h>
#include <c10/util/FileSystem.h>
#include <c10/util/Flags.h>
#include <c10/util/Lazy.h>
#include <c10/util/Logging.h>
@ -474,12 +473,13 @@ MessageLogger::MessageLogger(
if (GLOBAL_RANK != -1) {
stream_ << "[rank" << GLOBAL_RANK << "]:";
}
stream_ << '[' << CAFFE2_SEVERITY_PREFIX[std::min(4, GLOG_FATAL - severity_)]
stream_ << "[" << CAFFE2_SEVERITY_PREFIX[std::min(4, GLOG_FATAL - severity_)]
<< (timeinfo->tm_mon + 1) * 100 + timeinfo->tm_mday
<< std::setfill('0') << ' ' << std::setw(2) << timeinfo->tm_hour
<< ':' << std::setw(2) << timeinfo->tm_min << ':' << std::setw(2)
<< timeinfo->tm_sec << '.' << std::setw(9) << ns << ' '
<< c10::filesystem::path(file).filename() << ':' << line << "] ";
<< std::setfill('0') << " " << std::setw(2) << timeinfo->tm_hour
<< ":" << std::setw(2) << timeinfo->tm_min << ":" << std::setw(2)
<< timeinfo->tm_sec << "." << std::setw(9) << ns << " "
<< c10::detail::StripBasename(std::string(file)) << ":" << line
<< "] ";
}
// Output the contents of the stream to the proper channel on destruction.
@ -488,7 +488,7 @@ MessageLogger::~MessageLogger() noexcept(false) {
// Nothing needs to be logged.
return;
}
stream_ << '\n';
stream_ << "\n";
#ifdef ANDROID
static const int android_log_levels[] = {
ANDROID_LOG_FATAL, // LOG_FATAL

View File

@ -1412,13 +1412,13 @@ inline size_t capacity_in_bytes(const SmallVector<T, N>& X) {
template <typename T, unsigned N>
std::ostream& operator<<(std::ostream& out, const SmallVector<T, N>& list) {
int i = 0;
out << '[';
out << "[";
for (auto e : list) {
if (i++ > 0)
out << ", ";
out << e;
}
out << ']';
out << "]";
return out;
}

View File

@ -79,7 +79,7 @@ std::ostream& _str(std::ostream& ss, const std::wstring& wString) {
} // namespace detail
std::ostream& operator<<(std::ostream& out, const SourceLocation& loc) {
out << loc.function << " at " << loc.file << ':' << loc.line;
out << loc.function << " at " << loc.file << ":" << loc.line;
return out;
}

Some files were not shown because too many files have changed in this diff Show More