pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Yuanyuan Chen	fdab48a7c1	Enable all PIE rules on ruff (#165814 ) This PR enables all PIE rules on ruff, there are already some enabled rules from this family, the new added rules are ``` PIE796 Enum contains duplicate value: {value} PIE808 Unnecessary start argument in range ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165814 Approved by: https://github.com/ezyang	2025-10-18 07:36:18 +00:00
PyTorch MergeBot	24520b8386	Revert "Enable all PIE rules on ruff (#165814 )" This reverts commit c79dfdc6550e872783aa5cb5fc9e86589bf18872. Reverted https://github.com/pytorch/pytorch/pull/165814 on behalf of https://github.com/cyyever due to Need to cover more files ([comment](https://github.com/pytorch/pytorch/pull/165814#issuecomment-3417931863))	2025-10-18 07:21:08 +00:00
Yuanyuan Chen	c79dfdc655	Enable all PIE rules on ruff (#165814 ) This PR enables all PIE rules on ruff, there are already some enabled rules from this family, the new added rules are ``` PIE796 Enum contains duplicate value: {value} PIE808 Unnecessary start argument in range ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165814 Approved by: https://github.com/ezyang	2025-10-18 06:40:12 +00:00
Eddie Yan	3154482072	[CUDA][cuBLAS] Only `xFail` `addmm` with reduced precision reductions on non-RTX skus (#165379 ) RTX Blackwells don't behave quite like their datacenter counterparts here Pull Request resolved: https://github.com/pytorch/pytorch/pull/165379 Approved by: https://github.com/Skylion007	2025-10-17 02:45:07 +00:00
blorange-amd	b42fe389b9	ROCm unit tests enablement (#165366 ) Enables: test_cuda.py::TestCuda::test_streaming_backwards_multiple_streams test_cuda.py::TestCuda::test_graph_make_graphed_callables_with_amp_cache_disabled_allow_unused_input test_cuda.py::TestCuda::test_graph_make_graphed_callables_without_amp_allow_unused_input test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_1_10000_10000_10000_cuda_bfloat16 test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_1_10000_10000_10000_cuda_float16 test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_1_10000_10000_10000_cuda_float32 test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_1_10000_1000_10000_cuda_bfloat16 test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_1_10000_1000_10000_cuda_float16 test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_1_10000_1000_10000_cuda_float32 test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_2_1000_1000_1000_cuda_bfloat16 test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_2_1000_1000_1000_cuda_float16 test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_2_1000_1000_1000_cuda_float32 test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_2_100_100_100_cuda_bfloat16 test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_2_100_100_100_cuda_float16 test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_2_100_100_100_cuda_float32 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165366 Approved by: https://github.com/jeffdaily	2025-10-15 22:35:03 +00:00
amdfaa	955f21dc2c	[ROCm][CI] Add support for gfx1100 in rocm workflow + test skips (#148355 ) This PR adds infrastructure support for gfx1100 in the rocm workflow. Nodes have been allocated for this effort. @dnikolaev-amd contributed all the test skips. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148355 Approved by: https://github.com/jeffdaily Co-authored-by: Dmitry Nikolaev <dmitry.nikolaev@amd.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-10-07 22:36:25 +00:00
PyTorch MergeBot	1e42fde45e	Revert "[CUDA] Add experimental green context support for SM carveout (#159104 )" This reverts commit 746fe78ecd52f3e9cfddda41f0ac82dada7bdd0b. Reverted https://github.com/pytorch/pytorch/pull/159104 on behalf of https://github.com/malfet due to Breaks Windows CD build ([comment](https://github.com/pytorch/pytorch/pull/159104#issuecomment-3378675515))	2025-10-07 20:51:22 +00:00
Eddie Yan	746fe78ecd	[CUDA] Add experimental green context support for SM carveout (#159104 ) Low-level PyTorch APIs should be usable/stable enough at this point but we might move the underlying driver API usage a bit from here... Built on top of @drisspg 's branch Pull Request resolved: https://github.com/pytorch/pytorch/pull/159104 Approved by: https://github.com/ngimel Co-authored-by: drisspg <drisspguessous@gmail.com>	2025-10-06 23:11:23 +00:00
Yuanyuan Chen	b63bbe1661	Remove old ROCm version check in tests (#164245 ) This PR removes ROCm<6 version checks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164245 Approved by: https://github.com/jeffdaily	2025-10-06 22:42:01 +00:00
PyTorch MergeBot	8ec8c14ace	Revert "[CUDA] Add experimental green context support for SM carveout (#159104 )" This reverts commit 3c59351c6ea2fc29d346903e28e95c5f4d0ccdbb. Reverted https://github.com/pytorch/pytorch/pull/159104 on behalf of https://github.com/clee2000 due to failed lint, pyfmt not caught pyi file, I think they need special handling since theyre not in the changed files list? ([comment](https://github.com/pytorch/pytorch/pull/159104#issuecomment-3367077208))	2025-10-03 20:15:56 +00:00
Eddie Yan	3c59351c6e	[CUDA] Add experimental green context support for SM carveout (#159104 ) Low-level PyTorch APIs should be usable/stable enough at this point but we might move the underlying driver API usage a bit from here... Built on top of @drisspg 's branch Pull Request resolved: https://github.com/pytorch/pytorch/pull/159104 Approved by: https://github.com/ngimel Co-authored-by: drisspg <drisspguessous@gmail.com>	2025-10-03 18:59:12 +00:00
Eddie Yan	f7082e92b3	[cuBLAS] update cuBLAS determinism docs, remove workspace requirement checks (#161749 ) Since CUDA 11.x (need to update the docs for this, current PR is saying 12.2 which is incorrect) we've been allocating cuBLAS workspaces explicitly per handle/stream combination https://github.com/pytorch/pytorch/pull/85447 According to the cuBLAS documentation, this appears to be sufficient for determinism without any explicit workspace requirements to e.g., `:4096:8` or `:16:8` as was previously expressed in PyTorch docs https://docs.nvidia.com/cuda/cublas/#results-reproducibility Planning to add an explicit determinism test as well... Pull Request resolved: https://github.com/pytorch/pytorch/pull/161749 Approved by: https://github.com/ngimel	2025-10-03 00:09:47 +00:00
Aleksandar Samardžić	70d1043bdf	Fix non-TMA loads in grouped MM Triton kernel (#163895 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163895 Approved by: https://github.com/lezcano	2025-10-01 12:21:13 +00:00
Simon Layton	7a9119948e	Split scaled-mm tests into separate file (#164266 ) Summary: * Split scaled-mm-specific tests into `test/test_scaled_matmul.py` Test Plan: ``` pytest test/test_matmul_cuda.py pytest test/test_scaled_matmul_cuda.py ``` Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Simon Layton <simonlayton@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/164266 Approved by: https://github.com/Skylion007, https://github.com/albanD	2025-10-01 02:23:21 +00:00
PaulZhang12	1cce6efdb8	Fix silent incorrectness for bmm/baddmm out_dtype overload (#164095 ) Add input checks like meta functions for standard ops in `ATen/native/LinearAlgebra.cpp` for the `out_dtype` variants. Fixes silent incorrectness in https://github.com/pytorch/pytorch/issues/163816 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164095 Approved by: https://github.com/ngimel	2025-09-30 20:13:13 +00:00
Jeff Daily	b7419b920d	[ROCm][CI] Upgrade ROCm to 7.0 (#163140 ) Upgrade all the ROCm docker image to ROCm 7.0 release version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163140 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-30 02:23:26 +00:00
Janani Sriram	e2c894c97d	[Inductor][ATen][FP8] Relax stride check for block-wise scaling when scaling dimension is 1 (#163829 ) Summary: Relax stride check for block-wise scaling (1x128, 128x128) when a dimension of the scaling factor is 1. When the scaling tensor has a dimension of size 1, the stride is effectively "meaningless" to PyTorch, i.e. PyTorch decides to replace its stride with a default of `[1, 1]`. However, the old stride check required the stride to match one of the scaling dimensions. Here, we relax the stride check when the effective stride is 1 in order to allow for cases in which `K <= 128` and `N <= 128`. Test Plan: ``` pytest -s -v test/test_matmul_cuda.py::TestFP8MatmulCUDA::test_scaled_mm_vs_emulated_block_wise_float32_lhs_block_1_rhs_block_128_cuda 2>&1 \| tee ~/personal/stride_check.log ``` Differential Revision: D83023706 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163829 Approved by: https://github.com/lw, https://github.com/eqy	2025-09-29 17:28:26 +00:00
drisspg	b3cf5c79dd	Skip on sm100 later since Tests are non determinisitic (#163552 ) This is tracked https://github.com/pytorch/pytorch/issues/163462 skipping since we are seeing sporadic errors locally and on CI, Pull Request resolved: https://github.com/pytorch/pytorch/pull/163552 Approved by: https://github.com/eqy, https://github.com/Skylion007 ghstack dependencies: #163460, #163537	2025-09-23 15:45:05 +00:00
drisspg	0f674077f4	Large tests failing on bfloat16 (#163537 ) # Summary I ran these tests locally, each 10k Tests takes over 5 mins for an extremely beefy cpu to run. I think that this is overkill feel free to disagree. Also the 1 test I ran that failed earlier up in the stack failed with 1 ulp difference so I think that this is kind of an edgecase on how we do testing (will right up issue for my thoughts later) ``` Shell ==================================================================================================== FAILURES ===================================================================================================== _________________________________________________________ TestMatmulCudaCUDA.test_cublas_addmm_reduced_precision_size_10000_backend_cublas_cuda_bfloat16 __________________________________________________________ Traceback (most recent call last): File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 58, in testPartExecutor yield File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 634, in run self._callTestMethod(testMethod) File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 589, in _callTestMethod if method() is not None: ^^^^^^^^ File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 3223, in wrapper method(args, kwargs) File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 3223, in wrapper method(args, kwargs) File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_device_type.py", line 426, in instantiated_test result = test(self, param_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_device_type.py", line 1408, in only_fn return fn(slf, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 2024, in wrap_fn return fn(self, args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dev/meta/pytorch/test/test_matmul_cuda.py", line 190, in test_cublas_addmm_reduced_precision self.cublas_addmm(size, dtype, True) File "/home/dev/meta/pytorch/test/test_matmul_cuda.py", line 162, in cublas_addmm assert_close_with_ulp(res_cpu, res_cuda, atol=tolerance.atol, rtol=tolerance.rtol) File "/home/dev/meta/transformer_nuggets/transformer_nuggets/numerics/__init__.py", line 222, in assert_close_with_ulp raise AssertionError("\n".join(error_parts)) AssertionError: Tensor-likes are not close! Mismatched elements: 425 / 100030002 (0.0%) Greatest absolute difference: 16 at index (2176, 9325) (up to 10 allowed) Greatest relative difference: 3984 at index (376, 3754) (up to 0.2 allowed) ============================================================ ULP Analysis of Failures: ============================================================ Total failures: 425 ULP distances: min=-32761, max=32763, mean=-11513.7 Top 10 failures by absolute difference: # \| Index \| Abs Diff \| Rel Diff \| ULP \| Expected \| Actual ---------------------------------------------------------------------------------------------------- 1 \| (6923, 1580) \| 1.600000e+01 \| 5.390625e-01 \| 146 \| 29.750000 \| 13.750000 2 \| (4677, 420) \| 1.600000e+01 \| 6.601562e-01 \| 95 \| 24.250000 \| 40.250000 3 \| (2176, 9325) \| 1.600000e+01 \| 6.875000e-01 \| 210 \| 23.250000 \| 7.250000 4 \| (5119, 7865) \| 1.600000e+01 \| 1.164062e+00 \| 146 \| -13.750000 \| -29.750000 5 \| (3218, 8334) \| 1.600000e+01 \| 2.593750e+00 \| 236 \| 6.156250 \| 22.125000 6 \| (5245, 241) \| 1.600000e+01 \| 5.468750e-01 \| 75 \| 29.250000 \| 45.250000 7 \| (7666, 6549) \| 1.600000e+01 \| 1.640000e+03 \| 1376 \| -0.009766 \| -16.000000 8 \| (1663, 1115) \| 1.593750e+01 \| 8.375000e+00 \| -32427 \| 1.898438 \| -14.062500 9 \| (3967, 7708) \| 1.593750e+01 \| 1.368750e+01 \| -32510 \| 1.164062 \| -14.750000 10 \| (2874, 2038) \| 1.593750e+01 \| 1.710938e+00 \| 181 \| 9.312500 \| 25.250000 Note: Maximum absolute and relative errors occur at different locations Max abs diff location (2176, 9325): 210 ULP Max rel diff location (376, 3754): 31868 ULP To execute this test, run the following from the base repo dir: python test/test_matmul_cuda.py TestMatmulCudaCUDA.test_cublas_addmm_reduced_precision_size_10000_backend_cublas_cuda_bfloat16 This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ________________________________________________________ TestMatmulCudaCUDA.test_cublas_addmm_reduced_precision_size_10000_backend_cublaslt_cuda_bfloat16 _________________________________________________________ Traceback (most recent call last): File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 58, in testPartExecutor yield File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 634, in run self._callTestMethod(testMethod) File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 589, in _callTestMethod if method() is not None: ^^^^^^^^ File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 3223, in wrapper method(args, *kwargs) File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 3223, in wrapper method(args, kwargs) File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_device_type.py", line 426, in instantiated_test result = test(self, param_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_device_type.py", line 1408, in only_fn return fn(slf, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 2024, in wrap_fn return fn(self, args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dev/meta/pytorch/test/test_matmul_cuda.py", line 190, in test_cublas_addmm_reduced_precision self.cublas_addmm(size, dtype, True) File "/home/dev/meta/pytorch/test/test_matmul_cuda.py", line 162, in cublas_addmm assert_close_with_ulp(res_cpu, res_cuda, atol=tolerance.atol, rtol=tolerance.rtol) File "/home/dev/meta/transformer_nuggets/transformer_nuggets/numerics/__init__.py", line 222, in assert_close_with_ulp raise AssertionError("\n".join(error_parts)) AssertionError: Tensor-likes are not close! Mismatched elements: 425 / 100030002 (0.0%) Greatest absolute difference: 16 at index (2176, 9325) (up to 10 allowed) Greatest relative difference: 3984 at index (376, 3754) (up to 0.2 allowed) ============================================================ ULP Analysis of Failures: ============================================================ Total failures: 425 ULP distances: min=-32761, max=32763, mean=-11513.7 Top 10 failures by absolute difference: # \| Index \| Abs Diff \| Rel Diff \| ULP \| Expected \| Actual ---------------------------------------------------------------------------------------------------- 1 \| (6923, 1580) \| 1.600000e+01 \| 5.390625e-01 \| 146 \| 29.750000 \| 13.750000 2 \| (4677, 420) \| 1.600000e+01 \| 6.601562e-01 \| 95 \| 24.250000 \| 40.250000 3 \| (2176, 9325) \| 1.600000e+01 \| 6.875000e-01 \| 210 \| 23.250000 \| 7.250000 4 \| (5119, 7865) \| 1.600000e+01 \| 1.164062e+00 \| 146 \| -13.750000 \| -29.750000 5 \| (3218, 8334) \| 1.600000e+01 \| 2.593750e+00 \| 236 \| 6.156250 \| 22.125000 6 \| (5245, 241) \| 1.600000e+01 \| 5.468750e-01 \| 75 \| 29.250000 \| 45.250000 7 \| (7666, 6549) \| 1.600000e+01 \| 1.640000e+03 \| 1376 \| -0.009766 \| -16.000000 8 \| (1663, 1115) \| 1.593750e+01 \| 8.375000e+00 \| -32427 \| 1.898438 \| -14.062500 9 \| (3967, 7708) \| 1.593750e+01 \| 1.368750e+01 \| -32510 \| 1.164062 \| -14.750000 10 \| (2874, 2038) \| 1.593750e+01 \| 1.710938e+00 \| 181 \| 9.312500 \| 25.250000 Note: Maximum absolute and relative errors occur at different locations Max abs diff location (2176, 9325): 210 ULP Max rel diff location (376, 3754): 31868 ULP To execute this test, run the following from the base repo dir: python test/test_matmul_cuda.py TestMatmulCudaCUDA.test_cublas_addmm_reduced_precision_size_10000_backend_cublaslt_cuda_bfloat16 This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ``` Okay the bfloat16 are forsure real cc @eqy Pull Request resolved: https://github.com/pytorch/pytorch/pull/163537 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/eqy ghstack dependencies: #163460	2025-09-23 15:45:05 +00:00
drisspg	02da4753f5	Triton template IMA reads on B200 (#163460 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163460 Approved by: https://github.com/eqy, https://github.com/alexsamardzic	2025-09-22 20:34:39 +00:00
eqy	e37b600007	[CUDA][cuBLAS][FP8] Forward-fix #162022 (#163354 ) @ngimel is right, `ciflow/h100` doesn't actually appear to test the PR :( Pull Request resolved: https://github.com/pytorch/pytorch/pull/163354 Approved by: https://github.com/ngimel, https://github.com/Skylion007	2025-09-21 00:55:12 +00:00
Eddie Yan	f8f230a801	[FP8][cuBLAS][H100] only test fp32 outputs for rowwise `_scaled_mm` on H100 (#162022 ) only cuBLAS supports float32 output and cuBLAS only supports rowwise for SM 9.0 Intended to land after #161305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162022 Approved by: https://github.com/ngimel	2025-09-19 15:18:13 +00:00
Jagadish Krishnamoorthy	264e7f68a0	[ROCm] Fix mx fp8 and fp4 code after scaling refactor changes. (#163127 ) PR #151360 added mx fp8 and fp4 support on ROCm. 1. However, on recent upstream, scaling function in Blas.cpp along with test_matmul_cuda changes triggered failures. This patch corrects is_blockwise_1x32_scaling function code. 2. Fixes the m, n, k dimensions for ROCm mx case. 3. Modify FP4E2M1FN_LARGEST_POW2 (largest power of 2 representable in `torch.float4_e2m1fn_x2`) to 2. This resulted in higher SQNR value for mx fp4 test. Testing result on gfx950 w/ ROCm7.0 PYTORCH_TEST_WITH_ROCM=1 python test/test_matmul_cuda.py -k test_blockwise -v Ran 452 tests in 22.698s OK passed 111 This is same as before. (when PR 151360 was merged) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163127 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-19 12:29:52 +00:00
Jagadish Krishnamoorthy	6944d4b639	[ROCm] rocblas Aten GEMM overload for FP32 output from FP16/BF16 inputs (#162600 ) Fix ROCm GEMM helper to set output type (C/D) based on C_Dtype template parameter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162600 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony	2025-09-11 03:34:07 +00:00
Jeff Daily	b477fb106f	[ROCm] enable grouped gemm fallback (#162419 ) Enables bf16 group gemm alternative path as described in #161366 Fast path will be enabled in future through CK integration. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162419 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-09 20:04:56 +00:00
Eddie Yan	5eb35d2ab8	[CUDA][float8][TF32] Disable tf32 for vs. emulated rowwise comparison (#162387 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162387 Approved by: https://github.com/Skylion007	2025-09-09 17:04:06 +00:00
Prachi Gupta	c0142f5c06	[ROCm] Enabling several UTs (#161715 ) All these UTs are working as is, just removing the skip - test_p2p_ipc - test_repros.py: working, added fp8 support - test_activation_checkpointing.py - test_content_store.py - test_cuda_multigpu.py - test_compute_comm_reordering.py - test_segment_reductions.py - test_dataloader.py - test_math_ops.py - test_loop_ordering.py - test_control_flow.py - distributed_test.py - test_mem_tracker.py - test_fsdp_optim_state.py - test_fully_shard_mixed_precision.py: skippped for < ROCm7.0 - test_aot_inductor_custom_ops.py - test_c10d_ops_nccl.py - test_eager_transforms.py - test_sparse_csr.py - test_inductor_collectives.py - test_fake_tensor.py - test_cupy_as_tensor.py - test_cuda.py: enable UTs that are working - test_matmul_cuda.py: enable UTs that are working Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/161715 Approved by: https://github.com/msaroufim Co-authored-by: Mark Saroufim <marksaroufim@fb.com>	2025-09-09 15:49:21 +00:00
PyTorch MergeBot	8235c4f65d	Revert "[ROCm] Enabling several UTs (#161715 )" This reverts commit b9ba612f7a968f7b27e121ca8f4d0a4d954f5354. Reverted https://github.com/pytorch/pytorch/pull/161715 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert https://github.com/pytorch/pytorch/pull/159473, feel free to merge it back once conflicts are cleared ([comment](https://github.com/pytorch/pytorch/pull/161715#issuecomment-3264040604))	2025-09-07 21:03:17 +00:00
Daniel Vega-Myhre	b6d0a9ea90	MXFP8 grouped GEMM support for torch._scaled_grouped_mm + submodule bump (#162209 ) ## Summary - We just landed 2d-2d support for mxfp8 grouped gemm in FBGEMM: https://github.com/pytorch/FBGEMM/pull/4816 - This is needed for backward pass of mxfp8 MoE training with grouped gemms - Changes: - Add dispatching + input validation for mxfp8 grouped gemm in `torch._scaled_grouped_mm` - Add meta registration input validation for mxfp8 grouped gemm, for composability with compile - Add unit tests exercising torch._scaled_grouped_mm with mxfp8 inputs - Bump FBGEMM third party submodule to include: - https://github.com/pytorch/FBGEMM/pull/4816 - https://github.com/pytorch/FBGEMM/pull/4820 - https://github.com/pytorch/FBGEMM/pull/4821 - https://github.com/pytorch/FBGEMM/pull/4823 #### How fbgemm dependency was bumped Documenting this since I haven't found it documented elsewhere: - `cd ~/pytorch/third_party/fbgemm` - `git fetch` - `git checkout <hash>` - `cd ~/pytorch` - `git add third_party/fbgemm` ## Test plan #### Test build ``` USE_FBGEMM_GENAI=1 python -m pip install --no-build-isolation -v -e . ... Successfully installed torch-2.9.0a0+gitf5070f3 ``` [full build log](https://www.internalfb.com/phabricator/paste/view/P1933787581) #### Unit tests ``` pytest test/test_matmul_cuda.py -k test_mxfp8_scaled_grouped_mm_ ... test/test_matmul_cuda.py ......... [100%] ============================================================== 9 passed, 1668 deselected in 5.34s =============================================================== ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162209 Approved by: https://github.com/ngimel	2025-09-06 15:25:30 +00:00
Eddie Yan	c2a3024617	[cuBLASLt][FP8] `cuBLASLt` appears to support float8 rowwise-scaling on H100 (#161305 ) Following #157905 I think the macro around ``` TORCH_INTERNAL_ASSERT(use_rowwise == false, "rowwise scaled_gemm not supported with blaslt"); ``` was never updated and this would cause `float8` tests to fail. Also it appears the `Lt` accepts two inputs with `e4m3` and `e5m2` dtypes simultaneously, so removing that check here as well... CC @lw Pull Request resolved: https://github.com/pytorch/pytorch/pull/161305 Approved by: https://github.com/Skylion007, https://github.com/drisspg, https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-05 16:55:09 +00:00
Eddie Yan	73eb4511fb	[B200][NVFP4] Fix argument passing in `test_blockwise_mxfp8_nvfp4_mxfp4_numerics_` (#162185 ) to unblock https://github.com/pytorch/pytorch/pull/159494 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162185 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2025-09-05 01:24:59 +00:00
Eddie Yan	c7e41071a0	[B200][MXFP8] Fix regex in `test_blockwise_mxfp8_nvfp4_error_messages_recipe_mxfp8_cuda` (#162180 ) to unblock https://github.com/pytorch/pytorch/pull/159494 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162180 Approved by: https://github.com/Skylion007, https://github.com/drisspg, https://github.com/nWEIdia	2025-09-04 23:29:10 +00:00
Prachi Gupta	b9ba612f7a	[ROCm] Enabling several UTs (#161715 ) All these UTs are working as is, just removing the skip - test_p2p_ipc - test_repros.py: working, added fp8 support - test_activation_checkpointing.py - test_content_store.py - test_cuda_multigpu.py - test_compute_comm_reordering.py - test_segment_reductions.py - test_dataloader.py - test_math_ops.py - test_loop_ordering.py - test_control_flow.py - distributed_test.py - test_mem_tracker.py - test_fsdp_optim_state.py - test_fully_shard_mixed_precision.py: skippped for < ROCm7.0 - test_aot_inductor_custom_ops.py - test_c10d_ops_nccl.py - test_eager_transforms.py - test_sparse_csr.py - test_inductor_collectives.py - test_fake_tensor.py - test_cupy_as_tensor.py - test_cuda.py: enable UTs that are working - test_matmul_cuda.py: enable UTs that are working Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/161715 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily	2025-09-04 20:43:03 +00:00
vasiliy	9eadb37cdd	enable float32 and float16 in `torch._grouped_mm` fallback (#162059 ) Summary: Enables `torch.float32` and `torch.float16` options in `torch._grouped_mm`. Note that the fast path is only enabled if `mat_a`, `mat_b`, and `out_dtype` are `torch.bfloat16`. Saving for future PRs: 1. enabling testing on more platforms 2. supporting out_dtype != mat_a.dtype 3. opinfo 4. better compile support Test Plan: ```bash // on A100 and H100 pytest test/test_matmul_cuda.py -s -k test_grouped_gemm -x // on H100 pytest test/test_matmul_cuda.py -s -k test_scaled_grouped_gemm -x ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/162059 Approved by: https://github.com/ngimel, https://github.com/eqy ghstack dependencies: #161407, #161717	2025-09-04 17:48:52 +00:00
vasiliy	61fb632cfb	move `_grouped_mm` fallback to composite explicit autograd (#161717 ) Summary: Moves the `torch._grouped_mm` fallback from cuda-only code to a place where it can be used by multiple backends. Specifically: 1. make the fallback path and util functions reusable and move them to `ATen/native/GroupedMMUtils.h` 2. register a backend-agnostic kernel to composite explicit autograd key 3. refactor the grouped_mm tests to their own test case and enable CPU At the end of this PR, here is the support matrix: * CUDA SM90+: fast path with test coverage (no change) * CUDA SM80+: fallback with test coverage (no change) * CPU: fallback works, but without test coverage (new in this PR) * other SM versions and other backends: will probably already work, but let's leave this to future PRs * float32/float16: will probably already work, but let's leave this to future PRs Test Plan: ```bash pytest test/test_matmul_cuda.py -s -k test_grouped_gemm -x ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/161717 Approved by: https://github.com/ngimel, https://github.com/drisspg ghstack dependencies: #161407	2025-09-04 17:48:52 +00:00
vasiliy	8a736fa1ea	create torch._grouped_mm fallback path with for loops / bmm (#161407 ) Summary: Creates a fallback path for `torch._grouped_mm`, using the naive for loop implementation (or bmm). For the sake of keeping the PR small, this PR only enables SM80+ (CUDA capability 8.0 and up), since I am testing this on an A100 machine. In future PRs, we can increase the coverage of the fallback to: 1. float32 and float16, which will extend the GPU coverage 2. cpu Test Plan: ```bash pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_2d_3d -x pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_3d_2d -x pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_2d_2d -x pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_3d_3d -x ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/161407 Approved by: https://github.com/drisspg, https://github.com/eqy	2025-09-04 17:48:44 +00:00
Jagadish Krishnamoorthy	a8d6943d36	ROCm: Enable overload tests from test_matmul_cuda (#161540 ) This patch enables hipblaslt backend tests for test_mm_bmm_dtype_overload and test_addmm_baddmm_dtype_overload. Tests were disabled as part of #150812 Rocblas backend tests are not enabled yet, WIP. Test command PYTORCH_TEST_WITH_ROCM=1 pytest test/test_matmul_cuda.py -k 'test_mm_bmm_dtype_overload' -v PYTORCH_TEST_WITH_ROCM=1 pytest test/test_matmul_cuda.py -k 'test_addmm_baddmm_dtype_overload' -v Pull Request resolved: https://github.com/pytorch/pytorch/pull/161540 Approved by: https://github.com/jeffdaily	2025-09-02 16:27:42 +00:00
PyTorch MergeBot	21fae99c18	Revert "[cuBLASLt][FP8] `cuBLASLt` appears to support float8 rowwise-scaling on H100 (#161305 )" This reverts commit 55c289d5c104c4959cc125c0fb4fb50c9fc71102. Reverted https://github.com/pytorch/pytorch/pull/161305 on behalf of https://github.com/atalman due to Broke test_matmul_cuda.py::TestFP8MatmulCUDA::test_float8_error_messages_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/17309011599/job/49140215634) [HUD commit link](`1190b7f73e`) ([comment](https://github.com/pytorch/pytorch/pull/161305#issuecomment-3242652672))	2025-09-01 14:56:47 +00:00
eqy	55c289d5c1	[cuBLASLt][FP8] `cuBLASLt` appears to support float8 rowwise-scaling on H100 (#161305 ) Following #157905 I think the macro around ``` TORCH_INTERNAL_ASSERT(use_rowwise == false, "rowwise scaled_gemm not supported with blaslt"); ``` was never updated and this would cause `float8` tests to fail. Also it appears the `Lt` accepts two inputs with `e4m3` and `e5m2` dtypes simultaneously, so removing that check here as well... CC @lw Pull Request resolved: https://github.com/pytorch/pytorch/pull/161305 Approved by: https://github.com/Skylion007, https://github.com/drisspg, https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-28 17:04:25 +00:00
Jagadish Krishnamoorthy	543896fcf3	test_matmul_cuda: Refine MX test skipping (#161009 ) Replace return unittest.skip with raise unittest.SkipTest to ensure that the test suite correctly reports skipped tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161009 Approved by: https://github.com/jeffdaily	2025-08-20 00:47:45 +00:00
Peter Y. Yeh	e389a08dcd	AMD/ROCm OCP Micro-scaling Format (mx-fp8/mx-fp4) Support (#151360 ) - This pull request introduces support for the [OCP Micro-scaling (MX) format](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf), with a focus on compatibility with AMD ROCm 7.0 and the gfx950 architecture. This PR also establishes the foundation for enabling MX-FPX features in [TorchAO](https://github.com/pytorch/ao/issues/2229) on the AMD platform. - Validation (ROCm 7.0 + gfx950 required): `111 relevant tests passing.` > PYTORCH_TEST_WITH_ROCM=1 python test/test_matmul_cuda.py -k test_blockwise -v Co-author: @jagadish-amd — Thank you for the efforts leading validation on gfx950 with ROCm 7.0. ----------------------------------- This pull request introduces support for new scalar types and scaling methods, particularly for ROCm 7.0 and gfx950, and refines testing for these features. Key changes include adding constraints for matrix dimensions, enabling block-wise scaling, and updating tests to accommodate new data types. ### Support for new scalar types and scaling methods: * [`aten/src/ATen/cuda/CUDABlas.cpp`](diffhunk://#diff-74fcb26047c1df4024105d36ce22a36b77cf8cc93c28631d743e639b3d6066aeR1876-R1885): Added constraints for matrix dimensions when using `Float8_e8m0fnu` with block-wise scaling, ensuring dimensions are multiples of 32. Updated compatibility checks to support ROCm 7.0 for `Float8_e8m0fnu` and `Float8_e4m3fn`. [[1]](diffhunk://#diff-74fcb26047c1df4024105d36ce22a36b77cf8cc93c28631d743e639b3d6066aeR1876-R1885) [[2]](diffhunk://#diff-74fcb26047c1df4024105d36ce22a36b77cf8cc93c28631d743e639b3d6066aeL1913-R1934) * [`aten/src/ATen/native/cuda/Blas.cpp`](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abR1276-R1290): Introduced block-wise scaling for `Float8_e8m0fnu`, with checks for ROCm 7.0 and GPU architecture `gfx950`. Added validation for supported scalar types and matrix dimensions. [[1]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abR1276-R1290) [[2]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abR1349-R1364) ### Updates to scalar type mappings: * [`aten/src/ATen/cuda/CUDADataType.h`](diffhunk://#diff-9188bb13b1a49f459141f5f9b875593d1c5ce2beb5ad711fdbaf5bc7089ec015L93-R93): Extended scalar type mappings to support `Float4_e2m1fn_x2` for ROCm 7.0. * [`aten/src/ATen/cuda/tunable/GemmHipblaslt.h`](diffhunk://#diff-bfa1a3b5d4bef1892bf50338775f3b0fd8cd31fc1868148f3968b98aefb68e3fR88-R96): Added a constexpr mapping for `Float4_e2m1fn_x2` based on ROCm version. ### Enhancements to testing(@jagadish-amd): * [`test/test_matmul_cuda.py`](diffhunk://#diff-3f31c52b48cfddf8f4617d809f7695b2e4a1c78656f8c4b5143a4b45d01fcf23R765-R766): Updated tests to include new scalar types (`Float4_e2m1fn_x2`) and recipes (`mxfp4`). Added logic to handle different scaling recipes and validate compatibility with ROCm and CUDA versions. [[1]](diffhunk://#diff-3f31c52b48cfddf8f4617d809f7695b2e4a1c78656f8c4b5143a4b45d01fcf23R765-R766) [[2]](diffhunk://#diff-3f31c52b48cfddf8f4617d809f7695b2e4a1c78656f8c4b5143a4b45d01fcf23L1331-R1356) F592e669L1353R1472) These changes improve compatibility with newer hardware and software versions, enhance functionality for matrix operations, and ensure robust testing for the added features. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151360 Approved by: https://github.com/drisspg, https://github.com/malfet	2025-08-18 16:43:09 +00:00
Aleksandar Samardžić	bf3ebd7ad4	Fix grouped MM load along K when TMA loads are not used (#159485 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159485 Approved by: https://github.com/ngimel	2025-07-31 17:58:02 +00:00
Chris Thi	c400c8e2e0	[ROCm] Add FP8 rowwise support to _scaled_grouped_mm + Submodule update (#159075 ) Summary: In this PR we integrate the [FBGEMM AMD FP8 rowwise scaling grouped GEMM kernel](https://github.com/pytorch/FBGEMM/tree/main/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped) to add support for the `_scaled_grouped_mm` API on AMD. `_scaled_grouped_mm` is [currently supported on Nvidia](`9faef3d17c/aten/src/ATen/native/cuda/Blas.cpp (L1614)`), this PR aims to bring parity to AMD. Related: [[RFC]: PyTorch Low-Precision GEMMs Public API](https://github.com/pytorch/pytorch/issues/157950#top) #157950. The kernel is developed using the Composable Kernel framework. Only MI300X is currently supported. In the near future we plan to add support for MI350X as well. For data types we support FP8 e3m4. The kernel support will be gated with the `USE_FBGEMM_GENAI` flag. We hope to enable this by default for relevant AMD builds. Note we also update submodule `third_party/fbgemm` to 0adf62831 for the required updates from fbgemm. Test Plan: Hipify & build ``` python tools/amd_build/build_amd.py USE_FBGEMM_GENAI=1 python setup.py develop ``` Unit tests ``` python test/test_matmul_cuda.py -- TestFP8MatmulCUDA Ran 488 tests in 32.969s OK (skipped=454) ``` Performance Sample \| G \| M \| N \| K \| Runtime Ms \| GB/S \| TFLOPS \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| \| 128 \| 1 \| 2048 \| 5120 \| 0.37\| 3590 \| 7.17 \| \| 128 \| 64 \| 2048 \| 5120 \| 0.51\| 2792 \| 338.34 \| \| 128 \| 128 \| 2048 \| 5120 \| 0.66\| 2272 \| 522.72 \| \| 128 \| 1 \| 5120 \| 1024 \| 0.21\| 3224 \| 6.43 \| \| 128 \| 64 \| 5120 \| 1024 \| 0.29\| 2590 \| 291.40 \| \| 128 \| 128 \| 5120 \| 1024 \| 0.40\| 2165 \| 434.76 \| \| 128 \| 1 \| 4096 \| 4096 \| 0.69\| 3126 \| 6.25 \| \| 128 \| 64 \| 4096 \| 4096 \| 0.85\| 2655 \| 324.66 \| \| 128 \| 128 \| 4096 \| 4096 \| 1.10\| 2142 \| 501.40 \| \| 128 \| 1 \| 8192 \| 8192 \| 2.45\| 3508 \| 7.01 \| \| 128 \| 64 \| 8192 \| 8192 \| 3.27\| 2692 \| 336.74 \| \| 128 \| 128 \| 8192 \| 8192 \| 4.04\| 2224 \| 543.76 \| \| 16 \| 1 \| 2048 \| 5120 \| 0.04\| 3928 \| 7.85 \| \| 16 \| 64 \| 2048 \| 5120 \| 0.05\| 3295 \| 399.29 \| \| 16 \| 128 \| 2048 \| 5120 \| 0.07\| 2558 \| 588.69 \| \| 16 \| 1 \| 5120 \| 1024 \| 0.03\| 3119 \| 6.23 \| \| 16 \| 64 \| 5120 \| 1024 \| 0.03\| 2849 \| 320.62 \| \| 16 \| 128 \| 5120 \| 1024 \| 0.05\| 2013 \| 404.11 \| \| 16 \| 1 \| 4096 \| 4096 \| 0.06\| 4512 \| 9.02 \| \| 16 \| 64 \| 4096 \| 4096 \| 0.09\| 3124 \| 381.95 \| \| 16 \| 128 \| 4096 \| 4096 \| 0.13\| 2340 \| 547.67 \| \| 16 \| 1 \| 8192 \| 8192 \| 0.32\| 3374 \| 6.75 \| \| 16 \| 64 \| 8192 \| 8192 \| 0.42\| 2593 \| 324.28 \| \| 16 \| 128 \| 8192 \| 8192 \| 0.53\| 2120 \| 518.36 \| - Using ROCm 6.4.1 - Collected through `triton.testing.do_bench_cudagraph` Binary size with gfx942 arch Before: 116103856 Jul 23 14:12 build/lib/libtorch_hip.so After: 118860960 Jul 23 14:29 build/lib/libtorch_hip.so The difference is 2757104 bytes (~2.6 MiB). Reviewers: @drisspg @ngimel @jwfromm @jeffdaily Pull Request resolved: https://github.com/pytorch/pytorch/pull/159075 Approved by: https://github.com/drisspg	2025-07-30 23:53:58 +00:00
drisspg	26f4dd5160	Scaled MM Fix NVfp4 (#159170 ) Fixes mm on B200: Before: ```Shell def _addmm_nvfp4_dispatch( a: NVFP4Tensor, b: NVFP4Tensor, aten_op, bias: Optional[torch.Tensor] = None ) -> torch.Tensor: """ Core implementation shared between nvfp4_mm, nvfp4_addmm, and nvfp4_linear. The only difference is whether bias is None or not. """ assert a._data.is_contiguous() assert b._data.t().is_contiguous() assert a._block_size == 16, f"NVFP4 requires block_size=16, got {a._block_size}" assert b._block_size == 16, f"NVFP4 requires block_size=16, got {b._block_size}" M, K = a.shape[0], a.shape[1] N = b.shape[1] # Swizzle Dizzle if a._is_swizzled_scales: a_scale_blocked = a._scale_e4m3 # Already swizzled else: a_scale = a._scale_e4m3.view(M, K // a._block_size) a_scale_blocked = to_blocked(a_scale) if b._is_swizzled_scales: b_scale_blocked = b._scale_e4m3 # Already swizzled else: b_scale = b._scale_e4m3.view(N, K // b._block_size) b_scale_blocked = to_blocked(b_scale) # Merge double quant scales into 1 scale for Scale_In^D if a._per_tensor_scale is not None: assert b._per_tensor_scale is not None scale_result = a._per_tensor_scale * b._per_tensor_scale else: assert b._per_tensor_scale is None and a._per_tensor_scale is None scale_result = None # THIS IS A WORKAROUND: # RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling # When we have per-tensor scaling, we need to apply it before bias # since bias is not quantized should_add_bias_separately = (scale_result is not None) and (bias is not None) # should_add_bias_separately = bias is not None > result = torch._scaled_mm( a._data.view(torch.float4_e2m1fn_x2), b._data.view(torch.float4_e2m1fn_x2), a_scale_blocked.view(torch.float8_e4m3fn), b_scale_blocked.view(torch.float8_e4m3fn), bias=None if should_add_bias_separately else bias, out_dtype=a._orig_dtype, # scale_result=scale_result, # Not supported yet ) E RuntimeError: Invalid scaling configuration. E - For TensorWise scaling, a and b should be float8, scales should be float and singletons. E - For RowWise scaling, a and b should be float8, scales should be float, scale_a should be (200, 1) and scale_b should be (1, 256), and both should be contiguous. E - For BlockWise 1x128 scaling, a and b should be float8, scales should be float, scale_a should be (200, 1) and scale_b should be (1, 256), and both should be outer-dim-major. E - For BlockWise 128x128 scaling, a and b should be float8, scales should be float, scale_a should be (2, 1) and scale_b should be (1, 2), and both should be near-inner-dim-major (with 16-byte aligned strides). E - For Blockwise 1x32 scaling, a and b should be float8, scales should be float8_e8m0fnu, scale_a should have 1024 elements and scale_b should have 1024 elements, and both should be contiguous. E - For Blockwise 1x16 scaling, a and b should be float4 (packed 2x), scales should be float8_e4m3fn, scale_a should have 3072 elements and scale_b should have 3072 elements, and both should be contiguous. E Got a.dtype()=Float4_e2m1fn_x2, scale_a.dtype()=Float8_e4m3fn, scale_a.size()=[256, 12], scale_a.stride()=[12, 1], b.dtype()=Float4_e2m1fn_x2, scale_b.dtype()=Float8_e4m3fn, scale_b.size()=[256, 12] and scale_b.stride()=[12, 1] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159170 Approved by: https://github.com/ngimel	2025-07-25 23:34:03 +00:00
Luca Wehrstedt	5ab0eb28f7	Support DeepSeek-style blockwise scaling scaled-mm for fp8 on Hopper+ (#158037 ) cuBLAS added support for them in CUDA 12.9. It's rather easy to call into them, the hardest thing is allowing the lhs and rhs operands to have different scaling types, as that changes the whole callstack. The scaling format is still detected from the sizes of the scale tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158037 Approved by: https://github.com/eqy, https://github.com/drisspg	2025-07-24 20:10:51 +00:00
Eddie Yan	55ff4f85e9	[FP8][CUTLASS] xFail `honor_sm_carveout` on `sm100` (#152378 ) CUTLASS only supports SM carveout via green contexts on `sm100` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152378 Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/nWEIdia	2025-07-22 18:39:50 +00:00
PyTorch MergeBot	32aade9d8d	Revert "Support DeepSeek-style blockwise scaling scaled-mm for fp8 on Hopper+ (#158037 )" This reverts commit 39ac189808c61588f3594dbc2fc1d69bb6194c47. Reverted https://github.com/pytorch/pytorch/pull/158037 on behalf of https://github.com/jithunnair-amd due to Ignored ROCm failures while ROCm was unstable, but HUD clearly shows this PR introduced failures on trunk ([comment](https://github.com/pytorch/pytorch/pull/158037#issuecomment-3087982975))	2025-07-18 07:47:46 +00:00
Luca Wehrstedt	39ac189808	Support DeepSeek-style blockwise scaling scaled-mm for fp8 on Hopper+ (#158037 ) cuBLAS added support for them in CUDA 12.9. It's rather easy to call into them, the hardest thing is allowing the lhs and rhs operands to have different scaling types, as that changes the whole callstack. The scaling format is still detected from the sizes of the scale tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158037 Approved by: https://github.com/eqy, https://github.com/drisspg	2025-07-17 08:26:27 +00:00
PyTorch MergeBot	9513b9d03f	Revert "Support DeepSeek-style blockwise scaling scaled-mm for fp8 on Hopper+ (#158037 )" This reverts commit bc65253369933160a2da3fc786d027a572faf6b7. Reverted https://github.com/pytorch/pytorch/pull/158037 on behalf of https://github.com/lw due to OSX failures are real ([comment](https://github.com/pytorch/pytorch/pull/158037#issuecomment-3079042171))	2025-07-16 15:04:10 +00:00
Luca Wehrstedt	bc65253369	Support DeepSeek-style blockwise scaling scaled-mm for fp8 on Hopper+ (#158037 ) cuBLAS added support for them in CUDA 12.9. It's rather easy to call into them, the hardest thing is allowing the lhs and rhs operands to have different scaling types, as that changes the whole callstack. The scaling format is still detected from the sizes of the scale tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158037 Approved by: https://github.com/eqy, https://github.com/drisspg	2025-07-16 13:54:09 +00:00

1 2 3 4

174 Commits