pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 12:54:11 +08:00

Author	SHA1	Message	Date
Jane Xu	3806e9767b	Refactor out headeronly ArrayRef (#164991 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164991 Approved by: https://github.com/swolchok	2025-10-17 18:32:39 +00:00
Pearu Peterson	ca8bd5dbed	Move toString(ScalarType) and ScalarType ostream operator to headeronly (#164405 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164405 Approved by: https://github.com/Skylion007, https://github.com/janeyx99 ghstack dependencies: #164350, #164354	2025-10-16 00:55:43 +00:00
Pearu Peterson	26f3803433	Remove workaround to old CUDA bug (#164354 ) As in the title. A check for https://github.com/pytorch/pytorch/issues/164348 to see if the workaround can be removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164354 Approved by: https://github.com/janeyx99, https://github.com/ngimel, https://github.com/malfet, https://github.com/jeffdaily ghstack dependencies: #164350	2025-10-16 00:55:43 +00:00
Pearu Peterson	48064acf37	Move AT_FORALL_... macros and ScalarTypeToCPPTypeT to headeronly (#164350 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164350 Approved by: https://github.com/janeyx99	2025-10-16 00:55:42 +00:00
Yuanyuan Chen	f231be25c6	Mark unused parameters in C++ code (#164912 ) This PR adds unused parameter name comments in C++ declarations to improve code readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164912 Approved by: https://github.com/Skylion007	2025-10-09 06:23:25 +00:00
Mikayla Gawarecki	f37a6523ef	Move version.h to torch/headeronly (#164381 ) Differential Revision: [D83685392](https://our.internmc.facebook.com/intern/diff/D83685392) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164381 Approved by: https://github.com/janeyx99	2025-10-07 17:47:30 +00:00
Jane Xu	7f3dc45300	Migrate DeviceType to torch/headeronly (#163999 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163999 Approved by: https://github.com/mikaylagawarecki	2025-09-30 23:13:27 +00:00
Michael Kelly	e900a274e5	Add `CUDA_KERNEL_ASSERT_PRINTF`, a more flexible `CUDA_KERNEL_ASSERT_MSG` (#160129 ) This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows. We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by `__assert_fail`. There's also an easy-to-grep-for keyword `CUDA_KERNEL_ASSERT` in the message. I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side). # Alternatives We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity. * If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to https://github.com/pytorch/pytorch/pull/157996. But the main downside here is the performance hit, so let's have an organized way of doing it first. # Risks/Problems * We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU. * Adding printf calls can have a performance impact because of increased register and stack usage. I did not see this play out in practice (see "benchmarks" below). However, there are changes to the generated PTX that could result in performance problems later (see "changes in generated PTX" below). # Benchmarks * I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f * Results are here -- I couldn't find a significant difference before or after https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431 # Change in generated PTX This is the easiest way I found to run nvcc over just Repeat.cu (this is a buck2 target that includes just a copy of Repeat.cu): ``` buck2 build --show-output scripts/mjk/ai_training/cuda_benchmarks:repeat_cuda # then use the printed .so file like this: ~/fbsource/third-party/cuda/cuda_12.8.0/x64-linux/bin/cuobjdump -ptx ../buck-out/v2/gen/fbcode/028bde1acfaba823/scripts/mjk/ai_training/cuda_benchmarks/__repeat_cuda__/libscripts_mjk_ai_training_cuda_benchmarks_repeat_cuda.so ``` ## with printf This is the version of the code that appears in this diff: https://gist.github.com/mjkatmeta/5d18d48282d46b2240d946b335052b9a ## without printf I recompiled, replacing `CUDA_KERNEL_ASSERT_PRINTF(...)` in Repeat.cu with: ``` CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1]); ``` https://gist.github.com/mjkatmeta/480df4b3a122e7b326554dd15ebb7c9d (Both of these are annotated with `// CHAR ARRAY:` comments to make the string constants easier to read.) Test Plan: Running this minimal test case: ``` import torch def main(): x = torch.ones(10, dtype=torch.int64, device="cuda:0") torch.repeat_interleave(x, x, output_size=0) ``` Now we see the new message (from printf) alongside the assert failure: ``` $ buck2 run fbcode//scripts/darshanr/repeat_interleave_errors:repeat_interleave_errors [...] [CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [31,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the `output_size` argument (0) must be the same as the sum of the elements in the `repeats` tensor (10). fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [384,0,0] Assertion `result_size == cumsum_ptr[size - 1]` failed. [...[ ``` Rollback Plan: Reviewed By: mradmila Differential Revision: D79310684 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160129 Approved by: https://github.com/ngimel	2025-09-16 00:23:48 +00:00
DrStone71	bc4db2c27f	CUDA 13 -- sm_120 -- Nvidia 5090 -- ptxas warning : Value of threads … (#161380 ) bug fix: i have opened a issue ( https://github.com/pytorch/pytorch/issues/161376 ) and i suggest this bug fix. In this metod compile fine. Fixes #161376 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161380 Approved by: https://github.com/eqy, https://github.com/malfet Co-authored-by: Aidyn-A <31858918+Aidyn-A@users.noreply.github.com>	2025-09-02 13:27:57 +00:00
Jane Xu	1690c0c3a0	[Reland] Migrate ScalarType to headeronly (#159911 ) The non ghstack version of #159416, to make sure we don't get reverted again Pull Request resolved: https://github.com/pytorch/pytorch/pull/159911 Approved by: https://github.com/mikaylagawarecki	2025-08-06 07:36:37 +00:00
Jane Xu	3ddfd46bd2	Cut a version of TORCH_ERROR_CODE_CHECK in headeronly from AOTI (#159604 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159604 Approved by: https://github.com/albanD, https://github.com/desertfire	2025-08-06 00:29:56 +00:00
PyTorch MergeBot	7e8197e34d	Revert "Migrate ScalarType to headeronly (#159416 )" This reverts commit 1371a98b0e727f8a8916dd473b6dd0cff78c0449. Reverted https://github.com/pytorch/pytorch/pull/159416 on behalf of https://github.com/izaitsevfb due to breaking internal builds, see D79452481 ([comment](https://github.com/pytorch/pytorch/pull/159416#issuecomment-3152138508))	2025-08-04 19:55:09 +00:00
Jane Xu	1371a98b0e	Migrate ScalarType to headeronly (#159416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159416 Approved by: https://github.com/albanD ghstack dependencies: #159415, #159411	2025-08-01 16:07:01 +00:00
Jane Xu	b95cf5c91d	Move complex to headeronly (#159411 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159411 Approved by: https://github.com/albanD ghstack dependencies: #159415	2025-07-31 22:05:43 +00:00
Jane Xu	5e2ef2a465	Move Float8 variations to headeronly (#159415 ) This PR is a big copy pasta from `c10/util/Float8*` -> `torch/headeronly/util/` which is why we are breaking PR sanity :C (sorry @albanD!). Why is it not a clean copy paste? - For BC reasons, we have to keep the old c10 file around so that OSS devs relying on those files can still get the same APIs - Because we reexpose APIs that are headeronly through torch::headeronly, so there is an extra chunk of code in the new torch::headeronly files to do that. Outside of the copy paste, I: - changed the tests to call torch::headeronly instead of c10 - updated header_only_apis.txt - added `// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions)` to pass lint (which was previously skipped for -inl.h files) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159415 Approved by: https://github.com/albanD	2025-07-31 22:05:43 +00:00
Jane Xu	c57382a493	Move BFloat16.h to headeronly (#159412 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159412 Approved by: https://github.com/desertfire	2025-07-31 15:29:17 +00:00
Jane Xu	259e79e3ff	Move Half to headeronly (#159172 ) Essence of this copypasta: - combine Half-inl.h and Half.h in c10/util -> torch/headeronly/util/Half.h - Add NOLINTNEXTLINE's to the portions of Half-inl.h that were previously in the ignore list of clangtidy - Re-expose all APIs in namespaces and through includes of the original files. Ideally, we would have the APIs in torch::headeronly and reexpose them in c10, but that runs into BC issues (see D78997465) so for now we are keeping the APIs in c10 but reexposing them in torch::headeronly. - Change test cases in test_aoti_abi_check to test torch::headeronly::Half vs c10::Half (they're the same thing but we eventually want all the tests for headeronly APIs to only import from headeronly). Pull Request resolved: https://github.com/pytorch/pytorch/pull/159172 Approved by: https://github.com/albanD, https://github.com/desertfire	2025-07-30 16:11:58 +00:00
Jane Xu	b268f22ab2	Move Float4 to headeronly (#159414 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159414 Approved by: https://github.com/desertfire	2025-07-30 15:34:01 +00:00
PyTorch MergeBot	eaadd1282c	Revert "Move Half to headeronly (#159172 )" This reverts commit 6d0f4566e2b6e05369d8bb6c0d0e83a0eee982aa. Reverted https://github.com/pytorch/pytorch/pull/159172 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/16613893793/job/47002486679) [HUD commit link](`6d0f4566e2`). Note to self: why isn't Dr. CI updating ([comment](https://github.com/pytorch/pytorch/pull/159172#issuecomment-3136769493))	2025-07-30 15:10:26 +00:00
Jane Xu	6d0f4566e2	Move Half to headeronly (#159172 ) Essence of this copypasta: - combine Half-inl.h and Half.h in c10/util -> torch/headeronly/util/Half.h - Add NOLINTNEXTLINE's to the portions of Half-inl.h that were previously in the ignore list of clangtidy - Re-expose all APIs in namespaces and through includes of the original files. Ideally, we would have the APIs in torch::headeronly and reexpose them in c10, but that runs into BC issues (see D78997465) so for now we are keeping the APIs in c10 but reexposing them in torch::headeronly. - Change test cases in test_aoti_abi_check to test torch::headeronly::Half vs c10::Half (they're the same thing but we eventually want all the tests for headeronly APIs to only import from headeronly). Pull Request resolved: https://github.com/pytorch/pytorch/pull/159172 Approved by: https://github.com/albanD, https://github.com/desertfire	2025-07-30 05:02:13 +00:00
Jane Xu	96ac64d00c	Migrate easy q(u)int/bits stuff to torch/headeronly (#159302 ) Straightup copy pasta. Keeps APIs in c10 and reexposes them to torch::headeronly. It is arguable that we should just get rid of some of these unused dtypes but that is outside the scope of this PR, which is meant to build up to ScalarType moving to headeronly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159302 Approved by: https://github.com/malfet, https://github.com/albanD	2025-07-30 03:41:27 +00:00
Jane Xu	222fa451a2	Move some of vec into headeronly in preparation for Half.h (#158976 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158976 Approved by: https://github.com/albanD, https://github.com/desertfire	2025-07-29 05:43:53 +00:00
PyTorch MergeBot	751285cb22	Revert "Move some of vec into headeronly in preparation for Half.h (#158976 )" This reverts commit 5564f2ca2e0836d75c4ee45899b1b981582c3e2d. Reverted https://github.com/pytorch/pytorch/pull/158976 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See D78924504 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158976#issuecomment-3115198443))	2025-07-24 22:31:49 +00:00
Jane Xu	5564f2ca2e	Move some of vec into headeronly in preparation for Half.h (#158976 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158976 Approved by: https://github.com/albanD, https://github.com/desertfire	2025-07-24 20:32:33 +00:00
Conan Truong	78aa3bd6b6	Added Emscripten __assert_fail declaration to Macros.h (#158580 ) Summary: __assert_fail is declared slightly differently in the Emscripten stdlib. This may cause errors when compiling with Emscripten. Test Plan: N/A Rollback Plan: Differential Revision: D78500790 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158580 Approved by: https://github.com/JacobSzwejbka	2025-07-24 17:10:29 +00:00
Xinya Zhang	6100ed457c	[ROCm] Improve Type Safety of C10_WARP_SIZE (#158271 ) # Background The `C10_WARP_SIZE`, although always be `32` on CUDA platform, varies across different AMD GPUs. Therefore, to correctly refer this value, the host code must be a variable instead of a literal defined by macro, or a `constexpr int`. This PR may cause more compiler errors for third party code on AMD GPU, which is intentional. Having a fixed `C10_WARP_SIZE` value on host code for AMD GPU only defers compile time error to runtime. This PR is recommended to be included as part of Release Notes to describe an API change for whoever uses this macro. Users are recommended to use `C10_WARP_SIZE` directly, which adapts for various scenarios, or define a macro to use `C10_WARP_SIZE`. Assignment of this macro to symbols shared by host/device code causes problems on ROCM platform. (See the fix at `aten/src/ATen/native/cuda/layer_norm_kernel.cu` for a concrete example) # Behaviors * If compiling with HIPCC (i.e `defined(__HIPCC__)`): + Define `C10_WARP_SIZE` to be non-`constexpr` `at::cuda::warp_size()` for host-compilation pass (as compared to `static constexpr int C10_WARP_SIZE = 1;` set in 04bd7e6850e8efec77994963ffee87549555b9c3) + Define `C10_WARP_SIZE` to be a function returning `constexpr int` `64` for `__GFX9__`, and `32` otherwise, for device-compilation pass - `__GFX8__` is also 64 but we do not support any GFX8 GPU. * If not compiling with HIPCC: + Define `C10_WARP_SIZE` to be non-constexpr `at::cuda::warp_size()` # `constexpr` variant for host code For host-compilation cases where a `constexpr` value is needed for warp size (eg. launch bounds), use `C10_WARP_SIZE_STATIC`, which is defined as `64`. This macro follows the pre 04bd7e6850e8efec77994963ffee87549555b9c3 behavior of `C10_WARP_SIZE` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158271 Approved by: https://github.com/jeffdaily Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>	2025-07-22 23:19:38 +00:00
Jane Xu	e882c761dd	Add STD_TORCH_CHECK to headeronly (#158377 ) Differential Revision: [D78366519](https://our.internmc.facebook.com/intern/diff/D78366519/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158377 Approved by: https://github.com/albanD	2025-07-18 14:35:20 +00:00
Jane Xu	09db3a22e8	[BE] Get rid of final mentions of BUILD_SPLIT_CUDA (#158453 ) BUILD_SPLIT_CUDA logic has been removed for a while Differential Revision: [D78418191](https://our.internmc.facebook.com/intern/diff/D78418191/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158453 Approved by: https://github.com/albanD ghstack dependencies: #158358, #158365	2025-07-17 06:47:10 +00:00
Jane Xu	2b0f9b1f61	Move c10/macros/Macros.h to headeronly (#158365 ) ^ Differential Revision: [D78361893](https://our.internmc.facebook.com/intern/diff/D78361893/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158365 Approved by: https://github.com/swolchok ghstack dependencies: #158358	2025-07-16 18:46:52 +00:00
Jane Xu	b40f48d191	Move the rest of c10/macros/Export.h (#158358 ) Differential Revision: [D78356975](https://our.internmc.facebook.com/intern/diff/D78356975/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158358 Approved by: https://github.com/swolchok	2025-07-16 18:46:52 +00:00
Jane Xu	30587195d3	Migrate c10/macros/cmake_macros.h.in to torch/headeronly (#158035 ) Summary: As above, also changes a bunch of the build files to be better Test Plan: internal and external CI did run buck2 build fbcode//caffe2:torch and it succeeded Rollback Plan: Reviewed By: swolchok Differential Revision: D78016591 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158035 Approved by: https://github.com/swolchok	2025-07-15 19:52:59 +00:00
Jane Xu	317520bf6e	Add an ovrsource target for torch/headeronly (#157912 ) Summary: no idea how this works Test Plan: will things just pass? Rollback Plan: Differential Revision: D77965219 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157912 Approved by: https://github.com/albanD	2025-07-09 19:32:03 +00:00
Scott Wolchok	fee2377f9e	Reapply D77381084 / #156964 : Rename torch::standalone to headeronly (#157251 ) Was reverted due to internal failure which should be fixed now. I believe Jane wants this reapplied and picked to release, and she's out this week. Original summary: headeronly is more clear, let's change the name before anyone depends on standalone Differential Revision: [D77520173](https://our.internmc.facebook.com/intern/diff/D77520173/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157251 Approved by: https://github.com/janeyx99, https://github.com/Skylion007, https://github.com/desertfire	2025-06-30 23:25:30 +00:00
PyTorch MergeBot	e290a4c645	Revert "Rename torch::standalone to headeronly (#156964 )" This reverts commit 7e54c02a35b905e758497b856a1953eb009ba836. Reverted https://github.com/pytorch/pytorch/pull/156964 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/156964#issuecomment-3011136947))	2025-06-27 02:20:33 +00:00
Jane Xu	7e54c02a35	Rename torch::standalone to headeronly (#156964 ) Summary: headeronly is more clear, let's change the name before anyone depends on standalone Test Plan: CI should pass! Rollback Plan: Differential Revision: D77381084 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156964 Approved by: https://github.com/swolchok, https://github.com/albanD, https://github.com/desertfire	2025-06-27 01:00:14 +00:00

35 Commits