Summary:
What: Enables CUDA support for int8_mm woq optimization pattern by:
- Fixing dtype conversion in weight_int8pack_mm_kernel to match CPU
- Updating pattern validation to accept CUDA devices
- Adding test coverage for CUDA
Why: Extend WOQ to more device types
Test Plan:
```
buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm
```
Rollback Plan:
Reviewed By: jerryzh168
Differential Revision: D80882442
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161680
Approved by: https://github.com/jerryzh168
Note that this works only in a limited case, where you *don't* change the seed, but change only the offset of the philox generator. This captures the main use case we're interested in: Rewinding the RNG to a previous state. This is done by torch.utils.checkpoint.checkpoint in particular.
Calls to increase() change only the offset, not the seed. Thus, we allow for "no-op" calls to set_seed where the new seed is the same as the old seed. If a user does happen to try to change the seed during stream capture, they will receive an error.
Fixes#162504
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162505
Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/eellison, https://github.com/eee4017, https://github.com/cyyever
Reported by compute-sanitizer, otherwise it looks like `block_y_reduce` and `block_x_reduce` both use `shared_memory` for temporaries without synchronization between them
reproduces in e.g.,
`compute-sanitizer --tool=racecheck python test/test_matmul_cuda.py -k test_scaled_mm_vs_emulated_block_wise_float32_lhs_block_128_rhs_block_1_cuda` (note that this test requires H100 to run unless only the non-emulated (cuBLAS impl.) is commented out)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162995
Approved by: https://github.com/msaroufim
This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows.
We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by `__assert_fail`. There's also an easy-to-grep-for keyword `CUDA_KERNEL_ASSERT` in the message.
I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side).
# Alternatives
* We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity.
* If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to https://github.com/pytorch/pytorch/pull/157996. But the main downside here is the performance hit, so let's have an organized way of doing it first.
# Risks/Problems
* We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU.
* Adding printf calls can have a performance impact because of increased register and stack usage. I did not see this play out in practice (see "benchmarks" below). However, there are changes to the generated PTX that could result in performance problems later (see "changes in generated PTX" below).
# Benchmarks
* I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f
* Results are here -- I couldn't find a significant difference before or after https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431
# Change in generated PTX
This is the easiest way I found to run nvcc over just Repeat.cu (this is a buck2 target that includes just a copy of Repeat.cu):
```
buck2 build --show-output scripts/mjk/ai_training/cuda_benchmarks:repeat_cuda
# then use the printed .so file like this:
~/fbsource/third-party/cuda/cuda_12.8.0/x64-linux/bin/cuobjdump -ptx ../buck-out/v2/gen/fbcode/028bde1acfaba823/scripts/mjk/ai_training/cuda_benchmarks/__repeat_cuda__/libscripts_mjk_ai_training_cuda_benchmarks_repeat_cuda.so
```
## with printf
This is the version of the code that appears in this diff:
https://gist.github.com/mjkatmeta/5d18d48282d46b2240d946b335052b9a
## without printf
I recompiled, replacing `CUDA_KERNEL_ASSERT_PRINTF(...)` in Repeat.cu with:
```
CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1]);
```
https://gist.github.com/mjkatmeta/480df4b3a122e7b326554dd15ebb7c9d
(Both of these are annotated with `// CHAR ARRAY:` comments to make the string constants easier to read.)
Test Plan:
Running this minimal test case:
```
import torch
def main():
x = torch.ones(10, dtype=torch.int64, device="cuda:0")
torch.repeat_interleave(x, x, output_size=0)
```
Now we see the new message (from printf) alongside the assert failure:
```
$ buck2 run fbcode//scripts/darshanr/repeat_interleave_errors:repeat_interleave_errors
[...]
[CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [31,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the `output_size` argument (0) must be the same as the sum of the elements in the `repeats` tensor (10).
fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [384,0,0] Assertion `result_size == cumsum_ptr[size - 1]` failed.
[...[
```
Rollback Plan:
Reviewed By: mradmila
Differential Revision: D79310684
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160129
Approved by: https://github.com/ngimel
This PR fixes a bug in the implementation of `apply_triu_tril_single` where using extremely large values for the diagonal argument (e.g. `diagonal=9223372036854775807`) could result in integer overflow and incorrect results. The masking logic is re-written to avoid this issue by always iterating over all columns, ensuring correctness even for large or extreme diagonal values.
Example of the original incorrect behavior:
```python
a = torch.ones(5,5)
torch.triu(a, 9223372036854775807)
# Before:
# tensor([[0., 0., 0., 0., 0.],
# [1., 1., 1., 1., 1.],
# [1., 1., 1., 1., 1.],
# [1., 1., 1., 1., 1.],
# [1., 1., 1., 1., 1.]])
```
The new implementation guards against overflow and produces correct results for all valid input values.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153240
Approved by: https://github.com/albanD
## Summary
- pytorch is not built for *a variants of SM architectures, due to non-portability. However, we need fbgemm_gpu kernels built for sm100a (see #162209)
## Changes
- **Setting USE_FBGEMM_GENAI for CUDA builds**: fbgemm_gpu builds for sm100a if using CUDA 12.8 or 12.9 ([source](2033a0a08f/.github/scripts/nova_dir.bash (L29-L32))), so I follow the same rule here.
- **Extra nvcc flags**: if USE_FBGEMM_GENAI and USE_CUDA are set, we add extra nvcc flags for sm100a
## Test plan
Test build:
```
echo $CUDA_HOME
/usr/local/cuda-12.9
export TORCH_CUDA_ARCH_LIST=10.0
python -m pip install --no-build-isolation -v -e .
```
Check build logs:
```
CMake Warning at CMakeLists.txt:901 (message):
Setting USE_FBGEMM_GENAI to ON, doing CUDA build for SM100a
```
Run unit tests:
- `pytest test/test_matmul_cuda.py -k test_mxfp8_scaled_grouped_mm`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162544
Approved by: https://github.com/drisspg