35 Commits

Author SHA1 Message Date
3806e9767b Refactor out headeronly ArrayRef (#164991)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164991
Approved by: https://github.com/swolchok
2025-10-17 18:32:39 +00:00
ca8bd5dbed Move toString(ScalarType) and ScalarType ostream operator to headeronly (#164405)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164405
Approved by: https://github.com/Skylion007, https://github.com/janeyx99
ghstack dependencies: #164350, #164354
2025-10-16 00:55:43 +00:00
26f3803433 Remove workaround to old CUDA bug (#164354)
As in the title.

A check for https://github.com/pytorch/pytorch/issues/164348 to see if the workaround can be removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164354
Approved by: https://github.com/janeyx99, https://github.com/ngimel, https://github.com/malfet, https://github.com/jeffdaily
ghstack dependencies: #164350
2025-10-16 00:55:43 +00:00
48064acf37 Move AT_FORALL_... macros and ScalarTypeToCPPTypeT to headeronly (#164350)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164350
Approved by: https://github.com/janeyx99
2025-10-16 00:55:42 +00:00
f231be25c6 Mark unused parameters in C++ code (#164912)
This PR adds unused parameter name comments in C++ declarations to improve code readability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164912
Approved by: https://github.com/Skylion007
2025-10-09 06:23:25 +00:00
f37a6523ef Move version.h to torch/headeronly (#164381)
Differential Revision: [D83685392](https://our.internmc.facebook.com/intern/diff/D83685392)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164381
Approved by: https://github.com/janeyx99
2025-10-07 17:47:30 +00:00
7f3dc45300 Migrate DeviceType to torch/headeronly (#163999)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163999
Approved by: https://github.com/mikaylagawarecki
2025-09-30 23:13:27 +00:00
e900a274e5 Add CUDA_KERNEL_ASSERT_PRINTF, a more flexible CUDA_KERNEL_ASSERT_MSG (#160129)
This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows.

We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by `__assert_fail`. There's also an easy-to-grep-for keyword `CUDA_KERNEL_ASSERT` in the message.

I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side).

# Alternatives
* We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity.
* If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to https://github.com/pytorch/pytorch/pull/157996. But the main downside here is the performance hit, so let's have an organized way of doing it first.

# Risks/Problems
* We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU.
* Adding printf calls can have a performance impact because of increased register and stack usage. I did not see this play out in practice (see "benchmarks" below). However, there are changes to the generated PTX that could result in performance problems later (see "changes in generated PTX" below).

# Benchmarks

* I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f
* Results are here -- I couldn't find a significant difference before or after  https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431

# Change in generated PTX

This is the easiest way I found to run nvcc over just Repeat.cu (this is a buck2 target that includes just a copy of Repeat.cu):
```
buck2 build --show-output scripts/mjk/ai_training/cuda_benchmarks:repeat_cuda
# then use the printed .so file like this:
~/fbsource/third-party/cuda/cuda_12.8.0/x64-linux/bin/cuobjdump -ptx ../buck-out/v2/gen/fbcode/028bde1acfaba823/scripts/mjk/ai_training/cuda_benchmarks/__repeat_cuda__/libscripts_mjk_ai_training_cuda_benchmarks_repeat_cuda.so
```

## with printf
This is the version of the code that appears in this diff:

https://gist.github.com/mjkatmeta/5d18d48282d46b2240d946b335052b9a

## without printf
I recompiled, replacing `CUDA_KERNEL_ASSERT_PRINTF(...)` in Repeat.cu with:
```
CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1]);
```
https://gist.github.com/mjkatmeta/480df4b3a122e7b326554dd15ebb7c9d

(Both of these are annotated with `// CHAR ARRAY:` comments to make the string constants easier to read.)

Test Plan:
Running this minimal test case:

```
import torch
def main():
    x = torch.ones(10, dtype=torch.int64, device="cuda:0")
    torch.repeat_interleave(x, x, output_size=0)
```

Now we see the new message (from printf) alongside the assert failure:

```
$ buck2 run fbcode//scripts/darshanr/repeat_interleave_errors:repeat_interleave_errors
[...]
[CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [31,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the `output_size` argument (0) must be the same as the sum of the elements in the `repeats` tensor (10).

fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [384,0,0] Assertion `result_size == cumsum_ptr[size - 1]` failed.
[...[
```

Rollback Plan:

Reviewed By: mradmila

Differential Revision: D79310684

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160129
Approved by: https://github.com/ngimel
2025-09-16 00:23:48 +00:00
bc4db2c27f CUDA 13 -- sm_120 -- Nvidia 5090 -- ptxas warning : Value of threads … (#161380)
bug fix:

i have opened a issue ( https://github.com/pytorch/pytorch/issues/161376 ) and i suggest this bug fix.

In this metod compile fine.

Fixes #161376

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161380
Approved by: https://github.com/eqy, https://github.com/malfet

Co-authored-by: Aidyn-A <31858918+Aidyn-A@users.noreply.github.com>
2025-09-02 13:27:57 +00:00
1690c0c3a0 [Reland] Migrate ScalarType to headeronly (#159911)
The non ghstack version of #159416, to make sure we don't get reverted again
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159911
Approved by: https://github.com/mikaylagawarecki
2025-08-06 07:36:37 +00:00
3ddfd46bd2 Cut a version of TORCH_ERROR_CODE_CHECK in headeronly from AOTI (#159604)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159604
Approved by: https://github.com/albanD, https://github.com/desertfire
2025-08-06 00:29:56 +00:00
7e8197e34d Revert "Migrate ScalarType to headeronly (#159416)"
This reverts commit 1371a98b0e727f8a8916dd473b6dd0cff78c0449.

Reverted https://github.com/pytorch/pytorch/pull/159416 on behalf of https://github.com/izaitsevfb due to breaking internal builds, see D79452481 ([comment](https://github.com/pytorch/pytorch/pull/159416#issuecomment-3152138508))
2025-08-04 19:55:09 +00:00
1371a98b0e Migrate ScalarType to headeronly (#159416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159416
Approved by: https://github.com/albanD
ghstack dependencies: #159415, #159411
2025-08-01 16:07:01 +00:00
b95cf5c91d Move complex to headeronly (#159411)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159411
Approved by: https://github.com/albanD
ghstack dependencies: #159415
2025-07-31 22:05:43 +00:00
5e2ef2a465 Move Float8 variations to headeronly (#159415)
This PR is a big copy pasta from `c10/util/Float8*` -> `torch/headeronly/util/` which is why we are breaking PR sanity :C (sorry @albanD!).

Why is it not a clean copy paste?
- For BC reasons, we have to keep the old c10 file around so that OSS devs relying on those files can still get the same APIs
- Because we reexpose APIs that are headeronly through torch::headeronly, so there is an extra chunk of code in the new torch::headeronly files to do that.

Outside of the copy paste, I:
- changed the tests to call torch::headeronly instead of c10
- updated header_only_apis.txt
- added `// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions)` to pass lint (which was previously skipped for -inl.h files)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159415
Approved by: https://github.com/albanD
2025-07-31 22:05:43 +00:00
c57382a493 Move BFloat16.h to headeronly (#159412)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159412
Approved by: https://github.com/desertfire
2025-07-31 15:29:17 +00:00
259e79e3ff Move Half to headeronly (#159172)
Essence of this copypasta:
- combine Half-inl.h and Half.h in c10/util -> torch/headeronly/util/Half.h
- Add NOLINTNEXTLINE's to the portions of Half-inl.h that were previously in the ignore list of clangtidy
- Re-expose all APIs in namespaces and through includes of the original files. Ideally, we would have the APIs in torch::headeronly and reexpose them in c10, but that runs into BC issues (see D78997465) so for now we are keeping the APIs in c10 but reexposing them in torch::headeronly.
- Change test cases in test_aoti_abi_check to test torch::headeronly::Half vs c10::Half (they're the same thing but we eventually want all the tests for headeronly APIs to only import from headeronly).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159172
Approved by: https://github.com/albanD, https://github.com/desertfire
2025-07-30 16:11:58 +00:00
b268f22ab2 Move Float4 to headeronly (#159414)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159414
Approved by: https://github.com/desertfire
2025-07-30 15:34:01 +00:00
eaadd1282c Revert "Move Half to headeronly (#159172)"
This reverts commit 6d0f4566e2b6e05369d8bb6c0d0e83a0eee982aa.

Reverted https://github.com/pytorch/pytorch/pull/159172 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/16613893793/job/47002486679) [HUD commit link](6d0f4566e2).  Note to self: why isn't Dr. CI updating ([comment](https://github.com/pytorch/pytorch/pull/159172#issuecomment-3136769493))
2025-07-30 15:10:26 +00:00
6d0f4566e2 Move Half to headeronly (#159172)
Essence of this copypasta:
- combine Half-inl.h and Half.h in c10/util -> torch/headeronly/util/Half.h
- Add NOLINTNEXTLINE's to the portions of Half-inl.h that were previously in the ignore list of clangtidy
- Re-expose all APIs in namespaces and through includes of the original files. Ideally, we would have the APIs in torch::headeronly and reexpose them in c10, but that runs into BC issues (see D78997465) so for now we are keeping the APIs in c10 but reexposing them in torch::headeronly.
- Change test cases in test_aoti_abi_check to test torch::headeronly::Half vs c10::Half (they're the same thing but we eventually want all the tests for headeronly APIs to only import from headeronly).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159172
Approved by: https://github.com/albanD, https://github.com/desertfire
2025-07-30 05:02:13 +00:00
96ac64d00c Migrate easy q(u)int/bits stuff to torch/headeronly (#159302)
Straightup copy pasta. Keeps APIs in c10 and reexposes them to torch::headeronly.

It is arguable that we should just get rid of some of these unused dtypes but that is outside the scope of this PR, which is meant to build up to ScalarType moving to headeronly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159302
Approved by: https://github.com/malfet, https://github.com/albanD
2025-07-30 03:41:27 +00:00
222fa451a2 Move some of vec into headeronly in preparation for Half.h (#158976)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158976
Approved by: https://github.com/albanD, https://github.com/desertfire
2025-07-29 05:43:53 +00:00
751285cb22 Revert "Move some of vec into headeronly in preparation for Half.h (#158976)"
This reverts commit 5564f2ca2e0836d75c4ee45899b1b981582c3e2d.

Reverted https://github.com/pytorch/pytorch/pull/158976 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See D78924504 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158976#issuecomment-3115198443))
2025-07-24 22:31:49 +00:00
5564f2ca2e Move some of vec into headeronly in preparation for Half.h (#158976)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158976
Approved by: https://github.com/albanD, https://github.com/desertfire
2025-07-24 20:32:33 +00:00
78aa3bd6b6 Added Emscripten __assert_fail declaration to Macros.h (#158580)
Summary: __assert_fail is declared slightly differently in the Emscripten stdlib. This may cause errors when compiling with Emscripten.

Test Plan:
N/A

Rollback Plan:

Differential Revision: D78500790

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158580
Approved by: https://github.com/JacobSzwejbka
2025-07-24 17:10:29 +00:00
6100ed457c [ROCm] Improve Type Safety of C10_WARP_SIZE (#158271)
# Background

The `C10_WARP_SIZE`, although always be `32` on CUDA platform, varies across different AMD GPUs.
Therefore, to correctly refer this value, the host code must be a variable instead of a literal defined by macro, or a `constexpr int`.

This PR may cause more compiler errors for third party code on AMD GPU, which is intentional. Having a fixed `C10_WARP_SIZE` value on host code for AMD GPU only defers compile time error to runtime.

This PR is recommended to be included as part of Release Notes to describe an API change for whoever uses this macro.

Users are recommended to use `C10_WARP_SIZE` directly, which adapts for various scenarios, or define a macro to use `C10_WARP_SIZE`. Assignment of this macro to symbols shared by host/device code causes problems on ROCM platform. (See the fix at `aten/src/ATen/native/cuda/layer_norm_kernel.cu` for a concrete example)

# Behaviors

* If compiling with HIPCC (i.e `defined(__HIPCC__)`):
  + Define `C10_WARP_SIZE` to be non-`constexpr` `at::cuda::warp_size()` for host-compilation pass (as compared to `static constexpr int C10_WARP_SIZE = 1;` set in 04bd7e6850e8efec77994963ffee87549555b9c3)
  + Define `C10_WARP_SIZE` to be a function returning `constexpr int` `64` for `__GFX9__`, and `32` otherwise, for device-compilation pass
    - `__GFX8__` is also 64 but we do not support any GFX8 GPU.
* If not compiling with HIPCC:
  + Define `C10_WARP_SIZE` to be non-constexpr `at::cuda::warp_size()`

# `constexpr` variant for host code

For host-compilation cases where a `constexpr` value is needed for warp size (eg. launch bounds), use `C10_WARP_SIZE_STATIC`, which is defined as `64`. This macro follows the pre 04bd7e6850e8efec77994963ffee87549555b9c3 behavior of `C10_WARP_SIZE`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158271
Approved by: https://github.com/jeffdaily

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
2025-07-22 23:19:38 +00:00
e882c761dd Add STD_TORCH_CHECK to headeronly (#158377)
Differential Revision: [D78366519](https://our.internmc.facebook.com/intern/diff/D78366519/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158377
Approved by: https://github.com/albanD
2025-07-18 14:35:20 +00:00
09db3a22e8 [BE] Get rid of final mentions of BUILD_SPLIT_CUDA (#158453)
BUILD_SPLIT_CUDA logic has been removed for a while

Differential Revision: [D78418191](https://our.internmc.facebook.com/intern/diff/D78418191/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158453
Approved by: https://github.com/albanD
ghstack dependencies: #158358, #158365
2025-07-17 06:47:10 +00:00
2b0f9b1f61 Move c10/macros/Macros.h to headeronly (#158365)
^

Differential Revision: [D78361893](https://our.internmc.facebook.com/intern/diff/D78361893/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158365
Approved by: https://github.com/swolchok
ghstack dependencies: #158358
2025-07-16 18:46:52 +00:00
b40f48d191 Move the rest of c10/macros/Export.h (#158358)
Differential Revision: [D78356975](https://our.internmc.facebook.com/intern/diff/D78356975/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158358
Approved by: https://github.com/swolchok
2025-07-16 18:46:52 +00:00
30587195d3 Migrate c10/macros/cmake_macros.h.in to torch/headeronly (#158035)
Summary: As above, also changes a bunch of the build files to be better

Test Plan:
internal and external CI

did run buck2 build fbcode//caffe2:torch and it succeeded

Rollback Plan:

Reviewed By: swolchok

Differential Revision: D78016591

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158035
Approved by: https://github.com/swolchok
2025-07-15 19:52:59 +00:00
317520bf6e Add an ovrsource target for torch/headeronly (#157912)
Summary: no idea how this works

Test Plan:
will things just pass?

Rollback Plan:

Differential Revision: D77965219

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157912
Approved by: https://github.com/albanD
2025-07-09 19:32:03 +00:00
fee2377f9e Reapply D77381084 / #156964: Rename torch::standalone to headeronly (#157251)
Was reverted due to internal failure which should be fixed now. I believe Jane wants this reapplied and picked to release, and she's out this week.

Original summary:

headeronly is more clear, let's change the name before anyone depends on standalone

Differential Revision: [D77520173](https://our.internmc.facebook.com/intern/diff/D77520173/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157251
Approved by: https://github.com/janeyx99, https://github.com/Skylion007, https://github.com/desertfire
2025-06-30 23:25:30 +00:00
e290a4c645 Revert "Rename torch::standalone to headeronly (#156964)"
This reverts commit 7e54c02a35b905e758497b856a1953eb009ba836.

Reverted https://github.com/pytorch/pytorch/pull/156964 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/156964#issuecomment-3011136947))
2025-06-27 02:20:33 +00:00
7e54c02a35 Rename torch::standalone to headeronly (#156964)
Summary: headeronly is more clear, let's change the name before anyone depends on standalone

Test Plan:
CI should pass!

Rollback Plan:

Differential Revision: D77381084

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156964
Approved by: https://github.com/swolchok, https://github.com/albanD, https://github.com/desertfire
2025-06-27 01:00:14 +00:00