Recently there has been work in an experimental repo to start implementing the intrinsics necessary handle F8 workloads. (see: https://github.com/pytorch-labs/float8_experimental)
A recent PR was submitted to add support for AMD F8 types (fnuz). This PR uncovered a bug in the rocm code that caused unit tests to fail due to numerical inaccuracy. This PR fixes that bug by swapping `abs_()` with `abs()` as the former performs elementwise absolute value on the tensor in-place causing the final assertion to fail due to the tensor only containing positive values.
Important to note, this fix is part of a workaround as hipblasLT does not yet support amax (HIPBLASLT_MATMUL_DESC_AMAX_D_POINTER). This functionality has been implemented internally and is going through the proper channels to propagate to the community.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123275
Approved by: https://github.com/drisspg, https://github.com/jeffdaily
scaled_gemm for ROCm using hipblaslt. As of ROCm 6.0, HIPBLASLT_MATMUL_DESC_AMAX_D_POINTER is not supported. A work-around is provided, performing the absmax operation on the output buffer, but this results in some loss of accuracy for the absmax result. For this reason the feature should be considered beta/preview.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117822
Approved by: https://github.com/jianyuh, https://github.com/xw285cornell
CC @malfet @ptrblck
~~We've been seeing a lot of noise from Ampere and later devices due to reduced precision reductions, so preemptively disabling them for addmm tests.~~
Breaking out addmm tests into one with and without reduced precision reductions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112545
Approved by: https://github.com/malfet
Fixes#68972
Relands #107246
To avoid causing Meta-internal CI failures, this PR avoids always asserting that the default dtype is float in the `TestCase.setUp/tearDown` methods. Instead, the assert is only done if `TestCase._default_dtype_check_enabled == True`. `_default_dtype_check_enabled` is set to True in the `if __name__ == "__main__":` blocks of all the relevant test files that have required changes for this issue
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108088
Approved by: https://github.com/ezyang
Summary:
Based on D48377631 with updates to guard the utilization of cublas features only found after 11.8
According to https://docs.nvidia.com/cuda/cublas/#id99 only FP8 matrix types can be scaled, and `Float8_e4m3`x`Float8_e4m3` results can be returned as `Float8_e4m3` type, or upcast to `Half`, `BFloat16` or `Float`, but in that case `result_scale` will have no effect as well as `amax` would not be computed.
Optional `bias` argument can also be passed to a function, which should be a vector of either `Half` or `BFloat16`, whose values are added to each row of the result matrix.
See table below for supported input and output types:
| Mat1 type | Mat2 type | Bias type | Output types |
| ----------- | ----------- | ----------- | ----------- |
| Float8_e4m3 | Float8_e4m3 | Float16 | Float8_e4m3, Float16 |
| Float8_e4m3 | Float8_e4m3 | BFloat16 | Float8_e4m3, BFloat16, Float |
| Float8_e5m2 | Float8_e4m3 | Float16 | Float8_e4m3, Float8_e5m2, Float16 |
| Float8_e5m2 | Float8_e4m3 | BFloat16 | Float8_e4m3, Float8_e5m2, BFloat16, Float |
| Float8_e4m3 | Float8_e5m2 | Float16 | Float8_e4m3, Float8_e5m2, Float16 |
| Float8_e4m3 | Float8_e5m2 | BFloat16 | Float8_e4m3, Float8_e5m2, BFloat16, Float |
| Float8_e4m3 | Float8_e5m2 | Not supported | Not supported |
Skip decomposition implementation until fp8-on-triton story is better defined, Potential decomposition can look something like the following:
```python
register_decomposition(aten._scaled_mm)
def _scaled_mm(
mat1: Tensor,
mat2: Tensor,
*,
dtype: Optional[torch.dtype] = None,
scale_a: Optional[Tensor] = None,
scale_b: Optional[Tensor] = None,
scale_result: Optional[Tensor] = None,
) -> Tuple[Tensor, Tensor]:
rc = torch.mm(mat1.to(torch.float32), mat2.to(torch.float32))
rc = scale_a * rc if scale_a is not None else rc
rc = scale_b * rc if scale_b is not None else rc
rc = scale_result * rc if scale_result is not None else rc
rc = rc.to(dtype if dtype is not None else mat1.dtype)
return rc, torch.tensor(0.0, device=mat1.device)
```
Known limitations:
- Only works for matrix sizes divisible by 16
- 1st operand must be in row-major and 2nd in column-major orders (i.e. if `x` and `y` are contiguous, than only `torch._scaled_mm(x, y.t())` will work)
Test Plan: Tests in test_matmul_cda.py
Differential Revision: D48415871
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107341
Approved by: https://github.com/vkuzo
According to https://docs.nvidia.com/cuda/cublas/#id99 only FP8 matrix types can be scaled, and `Float8_e4m3`x`Float8_e4m3` results can be returned as `Float8_e4m3` type, or upcast to `Half`, `BFloat16` or `Float`, but in that case `result_scale` will have no effect as well as `amax` would not be computed.
Optional `bias` argument can also be passed to a function, which should be a vector of either `Half` or `BFloat16`, whose values are added to each row of the result matrix.
See table below for supported input and output types:
| Mat1 type | Mat2 type | Bias type | Output types |
| ----------- | ----------- | ----------- | ----------- |
| Float8_e4m3 | Float8_e4m3 | Float16 | Float8_e4m3, Float16 |
| Float8_e4m3 | Float8_e4m3 | BFloat16 | Float8_e4m3, BFloat16, Float |
| Float8_e5m2 | Float8_e4m3 | Float16 | Float8_e4m3, Float8_e5m2, Float16 |
| Float8_e5m2 | Float8_e4m3 | BFloat16 | Float8_e4m3, Float8_e5m2, BFloat16, Float |
| Float8_e4m3 | Float8_e5m2 | Float16 | Float8_e4m3, Float8_e5m2, Float16 |
| Float8_e4m3 | Float8_e5m2 | BFloat16 | Float8_e4m3, Float8_e5m2, BFloat16, Float |
| Float8_e4m3 | Float8_e5m2 | Not supported | Not supported |
Skip decomposition implementation until fp8-on-triton story is better defined, Potential decomposition can look something like the following:
```python
@register_decomposition(aten._scaled_mm)
def _scaled_mm(
mat1: Tensor,
mat2: Tensor,
*,
dtype: Optional[torch.dtype] = None,
scale_a: Optional[Tensor] = None,
scale_b: Optional[Tensor] = None,
scale_result: Optional[Tensor] = None,
) -> Tuple[Tensor, Tensor]:
rc = torch.mm(mat1.to(torch.float32), mat2.to(torch.float32))
rc = scale_a * rc if scale_a is not None else rc
rc = scale_b * rc if scale_b is not None else rc
rc = scale_result * rc if scale_result is not None else rc
rc = rc.to(dtype if dtype is not None else mat1.dtype)
return rc, torch.tensor(0.0, device=mat1.device)
```
Known limitations:
- Only works for matrix sizes divisible by 16
- 1st operand must be in row-major and 2nd in column-major orders (i.e. if `x` and `y` are contiguous, than only `torch._scaled_mm(x, y.t())` will work)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106844
Approved by: https://github.com/albanD
ghstack dependencies: #106977
Fixes the underlying issue previously addressed in #92201 by specifying minimum alignments explicitly to `cuBLAS` rather than relying on a handcrafted rule. ~~We're still investigating some potential failure modes on `sm80` and `sm90` but those would be real `cuBlasLt` heuristics bugs rather than being caused by underspecifying constraints to the heuristics.~~
According to the `cuBLAS` docs the default alignment is 256 bytes so that is the current maximum that is currently being checked: https://docs.nvidia.com/cuda/cublas/
CC @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98975
Approved by: https://github.com/ngimel
Follow-up of #89582 to drop flags like `CUDA11OrLater` in tests. Note that in some places it appears that `TEST_WITH_ROCM` is _implicitly_ guarded against via the `CUDA11OrLater` version check, based on my best-guess of how `torch.version.cuda` would behave in ROCM builds, so I've added `not TEST_WITH_ROCM` in cases where ROCM wasn't previously explicitly allowed.
CC @ptrblck @malfet @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92605
Approved by: https://github.com/ngimel
Fix for this issue surfaced from the discuss forum: https://discuss.pytorch.org/t/cuda-error-cublas-status-not-supported-when-calling-cublasltmatmul-from-torch-nn-functional-linear/170214
Note that PyTorch builds before #71200 should not be affected as there was no `cublasLt` dispatch path. Additionally, the provided repro has the quirk of using a 3D input, which means it will not dispatch to `cublasLt`-backed `addmm` until builds that include #72728. Changing the input to 2D by trivially removing the size `1` dimension will surface the failure on builds after #71200.
Interestingly, the use-case where _all_ inputs are 2-byte aligned are supported (runs without crashing), but when some are > 2-byte and some are == 2-byte are not. This behavior suggests that the `cuBlastLt` heuristics are incorrect, as the heuristic function has visibility of the raw pointer values via the descriptors when it is called.
We will follow up with `cuBlasLt` but this fix is needed to prevent unnecessary crashes for now.
CC @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92201
Approved by: https://github.com/ngimel