Summary: As above, also changes a bunch of the build files to be better
Test Plan:
internal and external CI
did run buck2 build fbcode//caffe2:torch and it succeeded
Rollback Plan:
Reviewed By: swolchok
Differential Revision: D78016591
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158035
Approved by: https://github.com/swolchok
Summary:
This diff introduces a set of changes that makes it possible for the host to get assertions from CUDA devices. This includes the introduction of
**`CUDA_KERNEL_ASSERT2`**
A preprocessor macro to be used within a CUDA kernel that, upon an assertion failure, writes the assertion message, file, line number, and possibly other information to UVM (Managed memory). Once this is done, the original assertion is triggered, which places the GPU in a Bad State requiring recovery. In my tests, data written to UVM appears there before the GPU reaches the Bad State and is still accessible from the host after the GPU is in this state.
Messages are written to a multi-message buffer which can, in theory, hold many assertion failures. I've done this as a precaution in case there are several, but I don't actually know whether that is possible and a simpler design which holds only a single message may well be all that is necessary.
**`TORCH_DSA_KERNEL_ARGS`**
This preprocess macro is added as an _argument_ to a kernel function's signature. It expands to supply the standardized names of all the arguments needed by `C10_CUDA_COMMUNICATING_KERNEL_ASSERTION` to handle device-side assertions. This includes, eg, the name of the pointer to the UVM memory the assertion would be written to. This macro abstracts the arguments so there is a single point of change if the system needs to be modified.
**`c10::cuda::get_global_cuda_kernel_launch_registry()`**
This host-side function returns a singleton object that manages the host's part of the device-side assertions. Upon allocation, the singleton allocates sufficient UVM (Managed) memory to hold information about several device-side assertion failures. The singleton also provides methods for getting the current traceback (used to identify when a kernel was launched). To avoid consuming all the host's memory the singleton stores launches in a circular buffer; a unique "generation number" is used to ensure that kernel launch failures map to their actual launch points (in the case that the circular buffer wraps before the failure is detected).
**`TORCH_DSA_KERNEL_LAUNCH`**
This host-side preprocessor macro replaces the standard
```
kernel_name<<<blocks, threads, shmem, stream>>>(args)
```
invocation with
```
TORCH_DSA_KERNEL_LAUNCH(blocks, threads, shmem, stream, args);
```
Internally, it fetches the UVM (Managed) pointer and generation number from the singleton and append these to the standard argument list. It also checks to ensure the kernel launches correctly. This abstraction on kernel launches can be modified to provide additional safety/logging.
**`c10::cuda::c10_retrieve_device_side_assertion_info`**
This host-side function checks, when called, that no kernel assertions have occurred. If one has. It then raises an exception with:
1. Information (file, line number) of what kernel was launched.
2. Information (file, line number, message) about the device-side assertion
3. Information (file, line number) about where the failure was detected.
**Checking for device-side assertions**
Device-side assertions are most likely to be noticed by the host when a CUDA API call such as `cudaDeviceSynchronize` is made and fails with a `cudaError_t` indicating
> CUDA error: device-side assert triggered CUDA kernel errors
Therefore, we rewrite `C10_CUDA_CHECK()` to include a call to `c10_retrieve_device_side_assertion_info()`. To make the code cleaner, most of the logic of `C10_CUDA_CHECK()` is now contained within a new function `c10_cuda_check_implementation()` to which `C10_CUDA_CHECK` passes the preprocessor information about filenames, function names, and line numbers. (In C++20 we can use `std::source_location` to eliminate macros entirely!)
# Notes on special cases
* Multiple assertions from the same block are recorded
* Multiple assertions from different blocks are recorded
* Launching kernels from many threads on many streams seems to be handled correctly
* If two process are using the same GPU and one of the processes fails with a device-side assertion the other process continues without issue
* X Multiple assertions from separate kernels on different streams seem to be recorded, but we can't reproduce the test condition
* X Multiple assertions from separate devices should be all be shown upon exit, but we've been unable to generate a test that produces this condition
Differential Revision: D37621532
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84609
Approved by: https://github.com/ezyang, https://github.com/malfet
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76173
We need this facility temporarily to sequence some changes without
breakage. This is generally not a good idea since the main purpose of
this effort is to replicate builds in OSS Bazel.
ghstack-source-id: 155215491
Test Plan: Manual test and rely on CI.
Reviewed By: dreiss
Differential Revision: D35815290
fbshipit-source-id: 89bacda373e7ba03d6a3fcbcaa5af42ae5eac154
(cherry picked from commit 1b808bbc94c939da1fd410d81b22d43bdfe1cda0)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74470
Internally, use it as well.
ghstack-source-id: 152438657
Test Plan: Rely on CI to validate.
Reviewed By: malfet
Differential Revision: D35011144
fbshipit-source-id: fb7247470df579ae23fcbc74bd2f8d6cc55cf657
(cherry picked from commit d9b476e2507807097a59c0b0a5ddf029d8dc0ab3)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74465
This requires adding py_library and its PyPI dependency provider
"requirement".
ghstack-source-id: 152438643
Test Plan: Rely on CI to validate.
Reviewed By: malfet
Differential Revision: D35009795
fbshipit-source-id: 424c4968474b3c2fb37d2c7dba932b37605a63f7
(cherry picked from commit 91e442c3bf0e204b0fb6c98405aaaa7308011511)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71907
This allows us to refactor the c10 tests without anything downstream
needing to be concerned about it.
ghstack-source-id: 150235098
Test Plan: This ought to be a no-op, rely on CI to validate.
Reviewed By: malfet
Differential Revision: D33815403
fbshipit-source-id: d358d6e8b1b45b62cef73bdbfd9c7709a7075c42
(cherry picked from commit a554dbe55a28516c8db2287552194860be87f2f0)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71411
This library is mostly the same now externally and internally, though
internal to Meta we never include cuda in this library, so our select
resolves internally unconditionally to false.
ghstack-source-id: 150235103
Test Plan: This ought to be a no-op, rely on CI.
Reviewed By: malfet
Differential Revision: D33635739
fbshipit-source-id: a4d3c7e30995c0e43ecd4c69ad0abb23498ee098
(cherry picked from commit c574a123615588adbe42cc51a713fccfa1b2cac0)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70928
ghstack-source-id: 148159366
Test Plan: Ensured that the same number of tests are found and run.
Reviewed By: malfet
Differential Revision: D33455272
fbshipit-source-id: fba1e3409b14794be3e6fe4445c56dd5361cfe9d
(cherry picked from commit b45fce500aa9c3f69915bf0857144ba6d268e649)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70863
ghstack-source-id: 148159368
Test Plan: Ought to be a no-op: rely on CI to validate.
Reviewed By: malfet
Differential Revision: D33367290
fbshipit-source-id: cb550538b9eafaa0117f94077ebd4cb920688881
(cherry picked from commit 077d9578bcbf5e41e806c6acb7a8f7c622f66fe9)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70854
We can't do the entire package since parts of it depend on //c10/core.
ghstack-source-id: 147170901
Test Plan: Rely on CI.
Reviewed By: malfet
Differential Revision: D33321821
fbshipit-source-id: 6d634da872a382a60548e2eea37a0f9f93c6f080
(cherry picked from commit 0afa808367ff92b6011b61dcbb398a2a32e5e90d)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70852
This is the first change that uses a common build file, build.bzl, to
hold most of the build logic.
ghstack-source-id: 147170895
Test Plan: Relying on internal and external CI.
Reviewed By: malfet
Differential Revision: D33299331
fbshipit-source-id: a66afffba6deec76b758dfb39bdf61d747b5bd99
(cherry picked from commit d9163c56f55cfc97c20f5a6d505474d5b8839201)