Commit Graph

174 Commits

Author SHA1 Message Date
fdab48a7c1 Enable all PIE rules on ruff (#165814)
This PR enables all PIE rules on ruff, there are already some enabled rules from this family, the new added rules are
```
PIE796  Enum contains duplicate value: {value}
PIE808  Unnecessary start argument in range
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165814
Approved by: https://github.com/ezyang
2025-10-18 07:36:18 +00:00
24520b8386 Revert "Enable all PIE rules on ruff (#165814)"
This reverts commit c79dfdc6550e872783aa5cb5fc9e86589bf18872.

Reverted https://github.com/pytorch/pytorch/pull/165814 on behalf of https://github.com/cyyever due to Need to cover more files ([comment](https://github.com/pytorch/pytorch/pull/165814#issuecomment-3417931863))
2025-10-18 07:21:08 +00:00
c79dfdc655 Enable all PIE rules on ruff (#165814)
This PR enables all PIE rules on ruff, there are already some enabled rules from this family, the new added rules are
```
PIE796  Enum contains duplicate value: {value}
PIE808  Unnecessary start argument in range
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165814
Approved by: https://github.com/ezyang
2025-10-18 06:40:12 +00:00
3154482072 [CUDA][cuBLAS] Only xFail addmm with reduced precision reductions on non-RTX skus (#165379)
RTX Blackwells don't behave quite like their datacenter counterparts here

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165379
Approved by: https://github.com/Skylion007
2025-10-17 02:45:07 +00:00
b42fe389b9 ROCm unit tests enablement (#165366)
Enables:
test_cuda.py::TestCuda::test_streaming_backwards_multiple_streams
test_cuda.py::TestCuda::test_graph_make_graphed_callables_with_amp_cache_disabled_allow_unused_input
test_cuda.py::TestCuda::test_graph_make_graphed_callables_without_amp_allow_unused_input
test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_1_10000_10000_10000_cuda_bfloat16
test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_1_10000_10000_10000_cuda_float16
test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_1_10000_10000_10000_cuda_float32
test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_1_10000_1000_10000_cuda_bfloat16
test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_1_10000_1000_10000_cuda_float16
test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_1_10000_1000_10000_cuda_float32
test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_2_1000_1000_1000_cuda_bfloat16
test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_2_1000_1000_1000_cuda_float16
test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_2_1000_1000_1000_cuda_float32
test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_2_100_100_100_cuda_bfloat16
test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_2_100_100_100_cuda_float16
test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_2_100_100_100_cuda_float32

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165366
Approved by: https://github.com/jeffdaily
2025-10-15 22:35:03 +00:00
955f21dc2c [ROCm][CI] Add support for gfx1100 in rocm workflow + test skips (#148355)
This PR adds infrastructure support for gfx1100 in the rocm workflow. Nodes have been allocated for this effort.
@dnikolaev-amd contributed all the test skips.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148355
Approved by: https://github.com/jeffdaily

Co-authored-by: Dmitry Nikolaev <dmitry.nikolaev@amd.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-07 22:36:25 +00:00
1e42fde45e Revert "[CUDA] Add experimental green context support for SM carveout (#159104)"
This reverts commit 746fe78ecd52f3e9cfddda41f0ac82dada7bdd0b.

Reverted https://github.com/pytorch/pytorch/pull/159104 on behalf of https://github.com/malfet due to Breaks Windows CD build ([comment](https://github.com/pytorch/pytorch/pull/159104#issuecomment-3378675515))
2025-10-07 20:51:22 +00:00
746fe78ecd [CUDA] Add experimental green context support for SM carveout (#159104)
Low-level PyTorch APIs should be usable/stable enough at this point but we might move the underlying driver API usage a bit from here...

Built on top of @drisspg 's branch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159104
Approved by: https://github.com/ngimel

Co-authored-by: drisspg <drisspguessous@gmail.com>
2025-10-06 23:11:23 +00:00
b63bbe1661 Remove old ROCm version check in tests (#164245)
This PR removes ROCm<6 version checks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164245
Approved by: https://github.com/jeffdaily
2025-10-06 22:42:01 +00:00
8ec8c14ace Revert "[CUDA] Add experimental green context support for SM carveout (#159104)"
This reverts commit 3c59351c6ea2fc29d346903e28e95c5f4d0ccdbb.

Reverted https://github.com/pytorch/pytorch/pull/159104 on behalf of https://github.com/clee2000 due to failed lint, pyfmt not caught pyi file, I think they need special handling since theyre not in the changed files list? ([comment](https://github.com/pytorch/pytorch/pull/159104#issuecomment-3367077208))
2025-10-03 20:15:56 +00:00
3c59351c6e [CUDA] Add experimental green context support for SM carveout (#159104)
Low-level PyTorch APIs should be usable/stable enough at this point but we might move the underlying driver API usage a bit from here...

Built on top of @drisspg 's branch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159104
Approved by: https://github.com/ngimel

Co-authored-by: drisspg <drisspguessous@gmail.com>
2025-10-03 18:59:12 +00:00
f7082e92b3 [cuBLAS] update cuBLAS determinism docs, remove workspace requirement checks (#161749)
Since CUDA 11.x (need to update the docs for this, current PR is saying 12.2 which is incorrect) we've been allocating cuBLAS workspaces explicitly per handle/stream combination https://github.com/pytorch/pytorch/pull/85447

According to the cuBLAS documentation, this appears to be sufficient for determinism without any explicit workspace requirements to e.g., `:4096:8` or `:16:8` as was previously expressed in PyTorch docs https://docs.nvidia.com/cuda/cublas/#results-reproducibility

Planning to add an explicit determinism test as well...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161749
Approved by: https://github.com/ngimel
2025-10-03 00:09:47 +00:00
70d1043bdf Fix non-TMA loads in grouped MM Triton kernel (#163895)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163895
Approved by: https://github.com/lezcano
2025-10-01 12:21:13 +00:00
7a9119948e Split scaled-mm tests into separate file (#164266)
Summary:

* Split scaled-mm-specific tests into `test/test_scaled_matmul.py`

Test Plan:

```
pytest test/test_matmul_cuda.py
pytest test/test_scaled_matmul_cuda.py
```

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Simon Layton <simonlayton@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164266
Approved by: https://github.com/Skylion007, https://github.com/albanD
2025-10-01 02:23:21 +00:00
1cce6efdb8 Fix silent incorrectness for bmm/baddmm out_dtype overload (#164095)
Add input checks like meta functions for standard ops in `ATen/native/LinearAlgebra.cpp` for the `out_dtype` variants. Fixes silent incorrectness in https://github.com/pytorch/pytorch/issues/163816

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164095
Approved by: https://github.com/ngimel
2025-09-30 20:13:13 +00:00
b7419b920d [ROCm][CI] Upgrade ROCm to 7.0 (#163140)
Upgrade all the ROCm docker image to ROCm 7.0 release version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163140
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-30 02:23:26 +00:00
e2c894c97d [Inductor][ATen][FP8] Relax stride check for block-wise scaling when scaling dimension is 1 (#163829)
Summary: Relax stride check for block-wise scaling (1x128, 128x128) when a dimension of the scaling factor is 1. When the scaling tensor has a dimension of size 1, the stride is effectively "meaningless" to PyTorch, i.e. PyTorch decides to replace its stride with a default of `[1, 1]`. However, the old stride check required the stride to match one of the scaling dimensions. Here, we relax the stride check when the effective stride is 1 in order to allow for cases in which `K <= 128` and `N <= 128`.

Test Plan:
```
pytest -s -v test/test_matmul_cuda.py::TestFP8MatmulCUDA::test_scaled_mm_vs_emulated_block_wise_float32_lhs_block_1_rhs_block_128_cuda   2>&1 | tee ~/personal/stride_check.log
```

Differential Revision: D83023706

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163829
Approved by: https://github.com/lw, https://github.com/eqy
2025-09-29 17:28:26 +00:00
b3cf5c79dd Skip on sm100 later since Tests are non determinisitic (#163552)
This is tracked https://github.com/pytorch/pytorch/issues/163462

skipping since we are seeing sporadic errors locally and on CI,
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163552
Approved by: https://github.com/eqy, https://github.com/Skylion007
ghstack dependencies: #163460, #163537
2025-09-23 15:45:05 +00:00
0f674077f4 Large tests failing on bfloat16 (#163537)
# Summary

I ran these tests locally, each 10k Tests takes over 5 mins for an extremely beefy cpu to run. I think that this is overkill feel free to disagree. Also the 1 test I ran that failed earlier up in the stack failed with 1 ulp difference so I think that this is kind of an edgecase on how we do testing (will right up issue for my thoughts later)

``` Shell
==================================================================================================== FAILURES =====================================================================================================
_________________________________________________________ TestMatmulCudaCUDA.test_cublas_addmm_reduced_precision_size_10000_backend_cublas_cuda_bfloat16 __________________________________________________________
Traceback (most recent call last):
  File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
    yield
  File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 634, in run
    self._callTestMethod(testMethod)
  File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 3223, in wrapper
    method(*args, **kwargs)
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 3223, in wrapper
    method(*args, **kwargs)
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_device_type.py", line 426, in instantiated_test
    result = test(self, **param_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_device_type.py", line 1408, in only_fn
    return fn(slf, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 2024, in wrap_fn
    return fn(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/meta/pytorch/test/test_matmul_cuda.py", line 190, in test_cublas_addmm_reduced_precision
    self.cublas_addmm(size, dtype, True)
  File "/home/dev/meta/pytorch/test/test_matmul_cuda.py", line 162, in cublas_addmm
    assert_close_with_ulp(res_cpu, res_cuda, atol=tolerance.atol, rtol=tolerance.rtol)
  File "/home/dev/meta/transformer_nuggets/transformer_nuggets/numerics/__init__.py", line 222, in assert_close_with_ulp
    raise AssertionError("\n".join(error_parts))
AssertionError: Tensor-likes are not close!

Mismatched elements: 425 / 100030002 (0.0%)
Greatest absolute difference: 16 at index (2176, 9325) (up to 10 allowed)
Greatest relative difference: 3984 at index (376, 3754) (up to 0.2 allowed)

============================================================
ULP Analysis of Failures:
============================================================

Total failures: 425
ULP distances: min=-32761, max=32763, mean=-11513.7

Top 10 failures by absolute difference:
  #  | Index                      | Abs Diff    | Rel Diff    | ULP  | Expected     | Actual
----------------------------------------------------------------------------------------------------
   1 | (6923, 1580)               | 1.600000e+01 | 5.390625e-01 |  146 |    29.750000 |    13.750000
   2 | (4677, 420)                | 1.600000e+01 | 6.601562e-01 |   95 |    24.250000 |    40.250000
   3 | (2176, 9325)               | 1.600000e+01 | 6.875000e-01 |  210 |    23.250000 |     7.250000
   4 | (5119, 7865)               | 1.600000e+01 | 1.164062e+00 |  146 |   -13.750000 |   -29.750000
   5 | (3218, 8334)               | 1.600000e+01 | 2.593750e+00 |  236 |     6.156250 |    22.125000
   6 | (5245, 241)                | 1.600000e+01 | 5.468750e-01 |   75 |    29.250000 |    45.250000
   7 | (7666, 6549)               | 1.600000e+01 | 1.640000e+03 | 1376 |    -0.009766 |   -16.000000
   8 | (1663, 1115)               | 1.593750e+01 | 8.375000e+00 | -32427 |     1.898438 |   -14.062500
   9 | (3967, 7708)               | 1.593750e+01 | 1.368750e+01 | -32510 |     1.164062 |   -14.750000
  10 | (2874, 2038)               | 1.593750e+01 | 1.710938e+00 |  181 |     9.312500 |    25.250000

Note: Maximum absolute and relative errors occur at different locations
  Max abs diff location (2176, 9325): 210 ULP
  Max rel diff location (376, 3754): 31868 ULP

To execute this test, run the following from the base repo dir:
    python test/test_matmul_cuda.py TestMatmulCudaCUDA.test_cublas_addmm_reduced_precision_size_10000_backend_cublas_cuda_bfloat16

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
________________________________________________________ TestMatmulCudaCUDA.test_cublas_addmm_reduced_precision_size_10000_backend_cublaslt_cuda_bfloat16 _________________________________________________________
Traceback (most recent call last):
  File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
    yield
  File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 634, in run
    self._callTestMethod(testMethod)
  File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 3223, in wrapper
    method(*args, **kwargs)
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 3223, in wrapper
    method(*args, **kwargs)
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_device_type.py", line 426, in instantiated_test
    result = test(self, **param_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_device_type.py", line 1408, in only_fn
    return fn(slf, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 2024, in wrap_fn
    return fn(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/meta/pytorch/test/test_matmul_cuda.py", line 190, in test_cublas_addmm_reduced_precision
    self.cublas_addmm(size, dtype, True)
  File "/home/dev/meta/pytorch/test/test_matmul_cuda.py", line 162, in cublas_addmm
    assert_close_with_ulp(res_cpu, res_cuda, atol=tolerance.atol, rtol=tolerance.rtol)
  File "/home/dev/meta/transformer_nuggets/transformer_nuggets/numerics/__init__.py", line 222, in assert_close_with_ulp
    raise AssertionError("\n".join(error_parts))
AssertionError: Tensor-likes are not close!

Mismatched elements: 425 / 100030002 (0.0%)
Greatest absolute difference: 16 at index (2176, 9325) (up to 10 allowed)
Greatest relative difference: 3984 at index (376, 3754) (up to 0.2 allowed)

============================================================
ULP Analysis of Failures:
============================================================

Total failures: 425
ULP distances: min=-32761, max=32763, mean=-11513.7

Top 10 failures by absolute difference:
  #  | Index                      | Abs Diff    | Rel Diff    | ULP  | Expected     | Actual
----------------------------------------------------------------------------------------------------
   1 | (6923, 1580)               | 1.600000e+01 | 5.390625e-01 |  146 |    29.750000 |    13.750000
   2 | (4677, 420)                | 1.600000e+01 | 6.601562e-01 |   95 |    24.250000 |    40.250000
   3 | (2176, 9325)               | 1.600000e+01 | 6.875000e-01 |  210 |    23.250000 |     7.250000
   4 | (5119, 7865)               | 1.600000e+01 | 1.164062e+00 |  146 |   -13.750000 |   -29.750000
   5 | (3218, 8334)               | 1.600000e+01 | 2.593750e+00 |  236 |     6.156250 |    22.125000
   6 | (5245, 241)                | 1.600000e+01 | 5.468750e-01 |   75 |    29.250000 |    45.250000
   7 | (7666, 6549)               | 1.600000e+01 | 1.640000e+03 | 1376 |    -0.009766 |   -16.000000
   8 | (1663, 1115)               | 1.593750e+01 | 8.375000e+00 | -32427 |     1.898438 |   -14.062500
   9 | (3967, 7708)               | 1.593750e+01 | 1.368750e+01 | -32510 |     1.164062 |   -14.750000
  10 | (2874, 2038)               | 1.593750e+01 | 1.710938e+00 |  181 |     9.312500 |    25.250000

Note: Maximum absolute and relative errors occur at different locations
  Max abs diff location (2176, 9325): 210 ULP
  Max rel diff location (376, 3754): 31868 ULP

To execute this test, run the following from the base repo dir:
    python test/test_matmul_cuda.py TestMatmulCudaCUDA.test_cublas_addmm_reduced_precision_size_10000_backend_cublaslt_cuda_bfloat16

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
```
Okay the bfloat16 are forsure  real cc @eqy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163537
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/eqy
ghstack dependencies: #163460
2025-09-23 15:45:05 +00:00
02da4753f5 Triton template IMA reads on B200 (#163460)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163460
Approved by: https://github.com/eqy, https://github.com/alexsamardzic
2025-09-22 20:34:39 +00:00
eqy
e37b600007 [CUDA][cuBLAS][FP8] Forward-fix #162022 (#163354)
@ngimel is right, `ciflow/h100` doesn't actually appear to test the PR :(

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163354
Approved by: https://github.com/ngimel, https://github.com/Skylion007
2025-09-21 00:55:12 +00:00
f8f230a801 [FP8][cuBLAS][H100] only test fp32 outputs for rowwise _scaled_mm on H100 (#162022)
only cuBLAS supports float32 output and cuBLAS only supports rowwise for SM 9.0

Intended to land after #161305

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162022
Approved by: https://github.com/ngimel
2025-09-19 15:18:13 +00:00
264e7f68a0 [ROCm] Fix mx fp8 and fp4 code after scaling refactor changes. (#163127)
PR #151360 added mx fp8 and fp4 support on ROCm.
1. However, on recent upstream, scaling function in Blas.cpp along with test_matmul_cuda changes triggered failures.
This patch corrects is_blockwise_1x32_scaling function code.

2. Fixes the m, n, k dimensions for ROCm mx case.

3.  Modify FP4E2M1FN_LARGEST_POW2 (largest power of 2 representable in `torch.float4_e2m1fn_x2`) to 2.
This resulted in higher SQNR value for mx fp4 test.

Testing result on gfx950 w/ ROCm7.0

PYTORCH_TEST_WITH_ROCM=1 python test/test_matmul_cuda.py -k test_blockwise -v Ran 452 tests in 22.698s
OK passed 111
This is same as before. (when PR 151360 was merged)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163127
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-19 12:29:52 +00:00
6944d4b639 [ROCm] rocblas Aten GEMM overload for FP32 output from FP16/BF16 inputs (#162600)
Fix ROCm GEMM helper to set output type (C/D) based on C_Dtype template parameter.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162600
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony
2025-09-11 03:34:07 +00:00
b477fb106f [ROCm] enable grouped gemm fallback (#162419)
Enables bf16 group gemm alternative path as described in #161366
Fast path will be enabled in future through CK integration.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162419
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-09 20:04:56 +00:00
5eb35d2ab8 [CUDA][float8][TF32] Disable tf32 for vs. emulated rowwise comparison (#162387)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162387
Approved by: https://github.com/Skylion007
2025-09-09 17:04:06 +00:00
c0142f5c06 [ROCm] Enabling several UTs (#161715)
All these UTs are working as is, just removing the skip
- test_p2p_ipc
- test_repros.py: working, added fp8 support
- test_activation_checkpointing.py
- test_content_store.py
- test_cuda_multigpu.py
- test_compute_comm_reordering.py
- test_segment_reductions.py
- test_dataloader.py
- test_math_ops.py
- test_loop_ordering.py
- test_control_flow.py
- distributed_test.py
- test_mem_tracker.py
- test_fsdp_optim_state.py
- test_fully_shard_mixed_precision.py: skippped for < ROCm7.0
- test_aot_inductor_custom_ops.py
- test_c10d_ops_nccl.py
- test_eager_transforms.py
- test_sparse_csr.py
- test_inductor_collectives.py
- test_fake_tensor.py
- test_cupy_as_tensor.py
- test_cuda.py: enable UTs that are working
- test_matmul_cuda.py: enable UTs that are working

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161715
Approved by: https://github.com/msaroufim

Co-authored-by: Mark Saroufim <marksaroufim@fb.com>
2025-09-09 15:49:21 +00:00
8235c4f65d Revert "[ROCm] Enabling several UTs (#161715)"
This reverts commit b9ba612f7a968f7b27e121ca8f4d0a4d954f5354.

Reverted https://github.com/pytorch/pytorch/pull/161715 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert https://github.com/pytorch/pytorch/pull/159473, feel free to merge it back once conflicts are cleared ([comment](https://github.com/pytorch/pytorch/pull/161715#issuecomment-3264040604))
2025-09-07 21:03:17 +00:00
b6d0a9ea90 MXFP8 grouped GEMM support for torch._scaled_grouped_mm + submodule bump (#162209)
## Summary
- We just landed 2d-2d support for mxfp8 grouped gemm in FBGEMM: https://github.com/pytorch/FBGEMM/pull/4816
- This is needed for backward pass of mxfp8 MoE training with grouped gemms
- Changes:
    - Add dispatching + input validation for mxfp8 grouped gemm in `torch._scaled_grouped_mm`
    - Add meta registration input validation for mxfp8 grouped gemm, for composability with compile
    - Add unit tests exercising torch._scaled_grouped_mm with mxfp8 inputs
    - Bump FBGEMM third party submodule to include:
          - https://github.com/pytorch/FBGEMM/pull/4816
          - https://github.com/pytorch/FBGEMM/pull/4820
          - https://github.com/pytorch/FBGEMM/pull/4821
          - https://github.com/pytorch/FBGEMM/pull/4823

#### How fbgemm dependency was bumped
Documenting this since I haven't found it documented elsewhere:
- `cd ~/pytorch/third_party/fbgemm`
- `git fetch`
- `git checkout <hash>`
- `cd ~/pytorch`
- `git add third_party/fbgemm`

## Test plan

#### Test build
```
USE_FBGEMM_GENAI=1 python -m pip install --no-build-isolation -v -e .
...
Successfully installed torch-2.9.0a0+gitf5070f3
```
[full build log](https://www.internalfb.com/phabricator/paste/view/P1933787581)

#### Unit tests
```
pytest test/test_matmul_cuda.py -k test_mxfp8_scaled_grouped_mm_
...

test/test_matmul_cuda.py .........                                                                                                                        [100%]

============================================================== 9 passed, 1668 deselected in 5.34s ===============================================================
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162209
Approved by: https://github.com/ngimel
2025-09-06 15:25:30 +00:00
c2a3024617 [cuBLASLt][FP8] cuBLASLt appears to support float8 rowwise-scaling on H100 (#161305)
Following #157905 I think the macro around
```
  TORCH_INTERNAL_ASSERT(use_rowwise == false, "rowwise scaled_gemm not supported with blaslt");
```
was never updated and this would cause `float8` tests to fail. Also it appears the `Lt` accepts two inputs with `e4m3` and `e5m2` dtypes simultaneously, so removing that check here as well...

CC @lw

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161305
Approved by: https://github.com/Skylion007, https://github.com/drisspg, https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-05 16:55:09 +00:00
73eb4511fb [B200][NVFP4] Fix argument passing in test_blockwise_mxfp8_nvfp4_mxfp4_numerics_ (#162185)
to unblock https://github.com/pytorch/pytorch/pull/159494

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162185
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2025-09-05 01:24:59 +00:00
c7e41071a0 [B200][MXFP8] Fix regex in test_blockwise_mxfp8_nvfp4_error_messages_recipe_mxfp8_cuda (#162180)
to unblock https://github.com/pytorch/pytorch/pull/159494

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162180
Approved by: https://github.com/Skylion007, https://github.com/drisspg, https://github.com/nWEIdia
2025-09-04 23:29:10 +00:00
b9ba612f7a [ROCm] Enabling several UTs (#161715)
All these UTs are working as is, just removing the skip
- test_p2p_ipc
- test_repros.py: working, added fp8 support
- test_activation_checkpointing.py
- test_content_store.py
- test_cuda_multigpu.py
- test_compute_comm_reordering.py
- test_segment_reductions.py
- test_dataloader.py
- test_math_ops.py
- test_loop_ordering.py
- test_control_flow.py
- distributed_test.py
- test_mem_tracker.py
- test_fsdp_optim_state.py
- test_fully_shard_mixed_precision.py: skippped for < ROCm7.0
- test_aot_inductor_custom_ops.py
- test_c10d_ops_nccl.py
- test_eager_transforms.py
- test_sparse_csr.py
- test_inductor_collectives.py
- test_fake_tensor.py
- test_cupy_as_tensor.py
- test_cuda.py: enable UTs that are working
- test_matmul_cuda.py: enable UTs that are working

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161715
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
2025-09-04 20:43:03 +00:00
9eadb37cdd enable float32 and float16 in torch._grouped_mm fallback (#162059)
Summary:

Enables `torch.float32` and `torch.float16` options in
`torch._grouped_mm`. Note that the fast path is only enabled if `mat_a`,
`mat_b`, and `out_dtype` are `torch.bfloat16`.

Saving for future PRs:
1. enabling testing on more platforms
2. supporting out_dtype != mat_a.dtype
3. opinfo
4. better compile support

Test Plan:

```bash
// on A100 and H100
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm -x
// on H100
pytest test/test_matmul_cuda.py -s -k test_scaled_grouped_gemm -x
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162059
Approved by: https://github.com/ngimel, https://github.com/eqy
ghstack dependencies: #161407, #161717
2025-09-04 17:48:52 +00:00
61fb632cfb move _grouped_mm fallback to composite explicit autograd (#161717)
Summary:

Moves the `torch._grouped_mm` fallback from cuda-only code to a place
where it can be used by multiple backends. Specifically:
1. make the fallback path and util functions reusable and move them to
   `ATen/native/GroupedMMUtils.h`
2. register a backend-agnostic kernel to composite explicit autograd key
3. refactor the grouped_mm tests to their own test case and enable CPU

At the end of this PR, here is the support matrix:
* CUDA SM90+: fast path with test coverage (no change)
* CUDA SM80+: fallback with test coverage (no change)
* CPU: fallback works, but without test coverage (new in this PR)
* other SM versions and other backends: will probably already work, but
  let's leave this to future PRs
* float32/float16: will probably already work, but let's leave this to
  future PRs

Test Plan:

```bash
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm -x
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161717
Approved by: https://github.com/ngimel, https://github.com/drisspg
ghstack dependencies: #161407
2025-09-04 17:48:52 +00:00
8a736fa1ea create torch._grouped_mm fallback path with for loops / bmm (#161407)
Summary:

Creates a fallback path for `torch._grouped_mm`, using the naive for
loop implementation (or bmm).

For the sake of keeping the PR small, this PR only enables SM80+ (CUDA
capability 8.0 and up), since I am testing this on an A100 machine. In
future PRs, we can increase the coverage of the fallback to:
1. float32 and float16, which will extend the GPU coverage
2. cpu

Test Plan:

```bash
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_2d_3d -x
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_3d_2d -x
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_2d_2d -x
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_3d_3d -x
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161407
Approved by: https://github.com/drisspg, https://github.com/eqy
2025-09-04 17:48:44 +00:00
a8d6943d36 ROCm: Enable overload tests from test_matmul_cuda (#161540)
This patch enables hipblaslt backend tests for test_mm_bmm_dtype_overload and test_addmm_baddmm_dtype_overload.
Tests were disabled as part of #150812
Rocblas backend tests are not enabled yet, WIP.

Test command
PYTORCH_TEST_WITH_ROCM=1 pytest test/test_matmul_cuda.py -k 'test_mm_bmm_dtype_overload' -v PYTORCH_TEST_WITH_ROCM=1 pytest test/test_matmul_cuda.py -k 'test_addmm_baddmm_dtype_overload' -v

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161540
Approved by: https://github.com/jeffdaily
2025-09-02 16:27:42 +00:00
21fae99c18 Revert "[cuBLASLt][FP8] cuBLASLt appears to support float8 rowwise-scaling on H100 (#161305)"
This reverts commit 55c289d5c104c4959cc125c0fb4fb50c9fc71102.

Reverted https://github.com/pytorch/pytorch/pull/161305 on behalf of https://github.com/atalman due to Broke test_matmul_cuda.py::TestFP8MatmulCUDA::test_float8_error_messages_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/17309011599/job/49140215634) [HUD commit link](1190b7f73e) ([comment](https://github.com/pytorch/pytorch/pull/161305#issuecomment-3242652672))
2025-09-01 14:56:47 +00:00
eqy
55c289d5c1 [cuBLASLt][FP8] cuBLASLt appears to support float8 rowwise-scaling on H100 (#161305)
Following #157905 I think the macro around
```
  TORCH_INTERNAL_ASSERT(use_rowwise == false, "rowwise scaled_gemm not supported with blaslt");
```
was never updated and this would cause `float8` tests to fail. Also it appears the `Lt` accepts two inputs with `e4m3` and `e5m2` dtypes simultaneously, so removing that check here as well...

CC @lw

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161305
Approved by: https://github.com/Skylion007, https://github.com/drisspg, https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-08-28 17:04:25 +00:00
543896fcf3 test_matmul_cuda: Refine MX test skipping (#161009)
Replace return unittest.skip with raise unittest.SkipTest to ensure that the test suite correctly reports skipped tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161009
Approved by: https://github.com/jeffdaily
2025-08-20 00:47:45 +00:00
e389a08dcd AMD/ROCm OCP Micro-scaling Format (mx-fp8/mx-fp4) Support (#151360)
- This pull request introduces support for the [OCP Micro-scaling (MX) format](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf), with a focus on compatibility with AMD **ROCm 7.0** and the **gfx950** architecture.

  This PR also establishes the foundation for enabling MX-FPX features in [TorchAO](https://github.com/pytorch/ao/issues/2229) on the AMD platform.

- Validation (**ROCm 7.0** + **gfx950** required):

  `111 relevant tests passing.`

  > PYTORCH_TEST_WITH_ROCM=1 python test/test_matmul_cuda.py -k test_blockwise -v

  Co-author: @jagadish-amd —  Thank you for the efforts leading validation on gfx950 with ROCm 7.0.

-----------------------------------

This pull request introduces support for new scalar types and scaling methods, particularly for ROCm 7.0 and gfx950, and refines testing for these features. Key changes include adding constraints for matrix dimensions, enabling block-wise scaling, and updating tests to accommodate new data types.

### Support for new scalar types and scaling methods:
* [`aten/src/ATen/cuda/CUDABlas.cpp`](diffhunk://#diff-74fcb26047c1df4024105d36ce22a36b77cf8cc93c28631d743e639b3d6066aeR1876-R1885): Added constraints for matrix dimensions when using `Float8_e8m0fnu` with block-wise scaling, ensuring dimensions are multiples of 32. Updated compatibility checks to support ROCm 7.0 for `Float8_e8m0fnu` and `Float8_e4m3fn`. [[1]](diffhunk://#diff-74fcb26047c1df4024105d36ce22a36b77cf8cc93c28631d743e639b3d6066aeR1876-R1885) [[2]](diffhunk://#diff-74fcb26047c1df4024105d36ce22a36b77cf8cc93c28631d743e639b3d6066aeL1913-R1934)

* [`aten/src/ATen/native/cuda/Blas.cpp`](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abR1276-R1290): Introduced block-wise scaling for `Float8_e8m0fnu`, with checks for ROCm 7.0 and GPU architecture `gfx950`. Added validation for supported scalar types and matrix dimensions. [[1]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abR1276-R1290) [[2]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abR1349-R1364)

### Updates to scalar type mappings:
* [`aten/src/ATen/cuda/CUDADataType.h`](diffhunk://#diff-9188bb13b1a49f459141f5f9b875593d1c5ce2beb5ad711fdbaf5bc7089ec015L93-R93): Extended scalar type mappings to support `Float4_e2m1fn_x2` for ROCm 7.0.

* [`aten/src/ATen/cuda/tunable/GemmHipblaslt.h`](diffhunk://#diff-bfa1a3b5d4bef1892bf50338775f3b0fd8cd31fc1868148f3968b98aefb68e3fR88-R96): Added a constexpr mapping for `Float4_e2m1fn_x2` based on ROCm version.

### Enhancements to testing(@jagadish-amd):
* [`test/test_matmul_cuda.py`](diffhunk://#diff-3f31c52b48cfddf8f4617d809f7695b2e4a1c78656f8c4b5143a4b45d01fcf23R765-R766): Updated tests to include new scalar types (`Float4_e2m1fn_x2`) and recipes (`mxfp4`). Added logic to handle different scaling recipes and validate compatibility with ROCm and CUDA versions. [[1]](diffhunk://#diff-3f31c52b48cfddf8f4617d809f7695b2e4a1c78656f8c4b5143a4b45d01fcf23R765-R766) [[2]](diffhunk://#diff-3f31c52b48cfddf8f4617d809f7695b2e4a1c78656f8c4b5143a4b45d01fcf23L1331-R1356) F592e669L1353R1472)

These changes improve compatibility with newer hardware and software versions, enhance functionality for matrix operations, and ensure robust testing for the added features.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151360
Approved by: https://github.com/drisspg, https://github.com/malfet
2025-08-18 16:43:09 +00:00
bf3ebd7ad4 Fix grouped MM load along K when TMA loads are not used (#159485)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159485
Approved by: https://github.com/ngimel
2025-07-31 17:58:02 +00:00
c400c8e2e0 [ROCm] Add FP8 rowwise support to _scaled_grouped_mm + Submodule update (#159075)
Summary:

In this PR we integrate the [FBGEMM AMD FP8 rowwise scaling grouped GEMM kernel](https://github.com/pytorch/FBGEMM/tree/main/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped) to add support for the `_scaled_grouped_mm` API on AMD. `_scaled_grouped_mm` is [currently supported on Nvidia](9faef3d17c/aten/src/ATen/native/cuda/Blas.cpp (L1614)), this PR aims to bring parity to AMD. Related: [[RFC]: PyTorch Low-Precision GEMMs Public API](https://github.com/pytorch/pytorch/issues/157950#top) #157950.

The kernel is developed using the Composable Kernel framework. Only MI300X is currently supported. In the near future we plan to add support for MI350X as well. For data types we support FP8 e3m4.

The kernel support will be gated with the `USE_FBGEMM_GENAI` flag. We hope to enable this by default for relevant AMD builds.

Note we also update submodule `third_party/fbgemm` to 0adf62831 for the required updates from fbgemm.

Test Plan:

**Hipify & build**
```
python tools/amd_build/build_amd.py
USE_FBGEMM_GENAI=1 python setup.py develop
```

**Unit tests**
```
python test/test_matmul_cuda.py -- TestFP8MatmulCUDA
Ran 488 tests in 32.969s
OK (skipped=454)
```

**Performance Sample**
| G  | M | N | K | Runtime Ms | GB/S | TFLOPS |
| --  | -- | -- | -- | -- | -- | -- |
| 128 | 1 | 2048 | 5120 | 0.37| 3590 | 7.17 |
| 128 | 64 | 2048 | 5120 | 0.51| 2792 | 338.34 |
| 128 | 128 | 2048 | 5120 | 0.66| 2272 | 522.72 |
| 128 | 1 | 5120 | 1024 | 0.21| 3224 | 6.43 |
| 128 | 64 | 5120 | 1024 | 0.29| 2590 | 291.40 |
| 128 | 128 | 5120 | 1024 | 0.40| 2165 | 434.76 |
| 128 | 1 | 4096 | 4096 | 0.69| 3126 | 6.25 |
| 128 | 64 | 4096 | 4096 | 0.85| 2655 | 324.66 |
| 128 | 128 | 4096 | 4096 | 1.10| 2142 | 501.40 |
| 128 | 1 | 8192 | 8192 | 2.45| 3508 | 7.01 |
| 128 | 64 | 8192 | 8192 | 3.27| 2692 | 336.74 |
| 128 | 128 | 8192 | 8192 | 4.04| 2224 | 543.76 |
| 16 | 1 | 2048 | 5120 | 0.04| 3928 | 7.85 |
| 16 | 64 | 2048 | 5120 | 0.05| 3295 | 399.29 |
| 16 | 128 | 2048 | 5120 | 0.07| 2558 | 588.69 |
| 16 | 1 | 5120 | 1024 | 0.03| 3119 | 6.23 |
| 16 | 64 | 5120 | 1024 | 0.03| 2849 | 320.62 |
| 16 | 128 | 5120 | 1024 | 0.05| 2013 | 404.11 |
| 16 | 1 | 4096 | 4096 | 0.06| 4512 | 9.02 |
| 16 | 64 | 4096 | 4096 | 0.09| 3124 | 381.95 |
| 16 | 128 | 4096 | 4096 | 0.13| 2340 | 547.67 |
| 16 | 1 | 8192 | 8192 | 0.32| 3374 | 6.75 |
| 16 | 64 | 8192 | 8192 | 0.42| 2593 | 324.28 |
| 16 | 128 | 8192 | 8192 | 0.53| 2120 | 518.36 |

- Using ROCm 6.4.1
- Collected through `triton.testing.do_bench_cudagraph`

**Binary size with gfx942 arch**
Before: 116103856 Jul 23 14:12 build/lib/libtorch_hip.so
After:  118860960 Jul 23 14:29 build/lib/libtorch_hip.so
The difference is 2757104 bytes (~2.6 MiB).

Reviewers: @drisspg @ngimel @jwfromm @jeffdaily

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159075
Approved by: https://github.com/drisspg
2025-07-30 23:53:58 +00:00
26f4dd5160 Scaled MM Fix NVfp4 (#159170)
Fixes mm on B200:
Before:
```Shell
    def _addmm_nvfp4_dispatch(
        a: NVFP4Tensor, b: NVFP4Tensor, aten_op, bias: Optional[torch.Tensor] = None
    ) -> torch.Tensor:
        """
        Core implementation shared between nvfp4_mm, nvfp4_addmm, and nvfp4_linear.
        The only difference is whether bias is None or not.
        """
        assert a._data.is_contiguous()
        assert b._data.t().is_contiguous()
        assert a._block_size == 16, f"NVFP4 requires block_size=16, got {a._block_size}"
        assert b._block_size == 16, f"NVFP4 requires block_size=16, got {b._block_size}"

        M, K = a.shape[0], a.shape[1]
        N = b.shape[1]

        # Swizzle Dizzle
        if a._is_swizzled_scales:
            a_scale_blocked = a._scale_e4m3  # Already swizzled
        else:
            a_scale = a._scale_e4m3.view(M, K // a._block_size)
            a_scale_blocked = to_blocked(a_scale)

        if b._is_swizzled_scales:
            b_scale_blocked = b._scale_e4m3  # Already swizzled
        else:
            b_scale = b._scale_e4m3.view(N, K // b._block_size)
            b_scale_blocked = to_blocked(b_scale)

        # Merge double quant scales into 1 scale for Scale_In^D
        if a._per_tensor_scale is not None:
            assert b._per_tensor_scale is not None
            scale_result = a._per_tensor_scale * b._per_tensor_scale
        else:
            assert b._per_tensor_scale is None and a._per_tensor_scale is None
            scale_result = None

        # THIS IS A WORKAROUND:
        # RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling
        # When we have per-tensor scaling, we need to apply it before bias
        # since bias is not quantized
        should_add_bias_separately = (scale_result is not None) and (bias is not None)
        # should_add_bias_separately = bias is not None

>       result = torch._scaled_mm(
            a._data.view(torch.float4_e2m1fn_x2),
            b._data.view(torch.float4_e2m1fn_x2),
            a_scale_blocked.view(torch.float8_e4m3fn),
            b_scale_blocked.view(torch.float8_e4m3fn),
            bias=None if should_add_bias_separately else bias,
            out_dtype=a._orig_dtype,
            # scale_result=scale_result,  # Not supported yet
        )
E       RuntimeError: Invalid scaling configuration.
E       - For TensorWise scaling, a and b should be float8, scales should be float and singletons.
E       - For RowWise scaling, a and b should be float8, scales should be float, scale_a should be (200, 1) and scale_b should be (1, 256), and both should be contiguous.
E       - For BlockWise 1x128 scaling, a and b should be float8, scales should be float, scale_a should be (200, 1) and scale_b should be (1, 256), and both should be outer-dim-major.
E       - For BlockWise 128x128 scaling, a and b should be float8, scales should be float, scale_a should be (2, 1) and scale_b should be (1, 2), and both should be near-inner-dim-major (with 16-byte aligned strides).
E       - For Blockwise 1x32 scaling, a and b should be float8, scales should be float8_e8m0fnu, scale_a should have 1024 elements and scale_b should have 1024 elements, and both should be contiguous.
E       - For Blockwise 1x16 scaling, a and b should be float4 (packed 2x), scales should be float8_e4m3fn, scale_a should have 3072 elements and scale_b should have 3072 elements, and both should be contiguous.
E       Got a.dtype()=Float4_e2m1fn_x2, scale_a.dtype()=Float8_e4m3fn, scale_a.size()=[256, 12], scale_a.stride()=[12, 1], b.dtype()=Float4_e2m1fn_x2, scale_b.dtype()=Float8_e4m3fn, scale_b.size()=[256, 12] and scale_b.stride()=[12, 1]

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159170
Approved by: https://github.com/ngimel
2025-07-25 23:34:03 +00:00
5ab0eb28f7 Support DeepSeek-style blockwise scaling scaled-mm for fp8 on Hopper+ (#158037)
cuBLAS added support for them in CUDA 12.9. It's rather easy to call into them, the hardest thing is allowing the lhs and rhs operands to have different scaling types, as that changes the whole callstack.

The scaling format is still detected from the sizes of the scale tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158037
Approved by: https://github.com/eqy, https://github.com/drisspg
2025-07-24 20:10:51 +00:00
55ff4f85e9 [FP8][CUTLASS] xFail honor_sm_carveout on sm100 (#152378)
CUTLASS only supports SM carveout via green contexts on `sm100`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152378
Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/nWEIdia
2025-07-22 18:39:50 +00:00
32aade9d8d Revert "Support DeepSeek-style blockwise scaling scaled-mm for fp8 on Hopper+ (#158037)"
This reverts commit 39ac189808c61588f3594dbc2fc1d69bb6194c47.

Reverted https://github.com/pytorch/pytorch/pull/158037 on behalf of https://github.com/jithunnair-amd due to Ignored ROCm failures while ROCm was unstable, but HUD clearly shows this PR introduced failures on trunk ([comment](https://github.com/pytorch/pytorch/pull/158037#issuecomment-3087982975))
2025-07-18 07:47:46 +00:00
39ac189808 Support DeepSeek-style blockwise scaling scaled-mm for fp8 on Hopper+ (#158037)
cuBLAS added support for them in CUDA 12.9. It's rather easy to call into them, the hardest thing is allowing the lhs and rhs operands to have different scaling types, as that changes the whole callstack.

The scaling format is still detected from the sizes of the scale tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158037
Approved by: https://github.com/eqy, https://github.com/drisspg
2025-07-17 08:26:27 +00:00
9513b9d03f Revert "Support DeepSeek-style blockwise scaling scaled-mm for fp8 on Hopper+ (#158037)"
This reverts commit bc65253369933160a2da3fc786d027a572faf6b7.

Reverted https://github.com/pytorch/pytorch/pull/158037 on behalf of https://github.com/lw due to OSX failures are real ([comment](https://github.com/pytorch/pytorch/pull/158037#issuecomment-3079042171))
2025-07-16 15:04:10 +00:00
bc65253369 Support DeepSeek-style blockwise scaling scaled-mm for fp8 on Hopper+ (#158037)
cuBLAS added support for them in CUDA 12.9. It's rather easy to call into them, the hardest thing is allowing the lhs and rhs operands to have different scaling types, as that changes the whole callstack.

The scaling format is still detected from the sizes of the scale tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158037
Approved by: https://github.com/eqy, https://github.com/drisspg
2025-07-16 13:54:09 +00:00