pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Andres Lugo-Reyes	69c6e0b851	[ROCm] Fix ROCm bug that causes numerical errors in float8_experimental (#123275 ) Recently there has been work in an experimental repo to start implementing the intrinsics necessary handle F8 workloads. (see: https://github.com/pytorch-labs/float8_experimental) A recent PR was submitted to add support for AMD F8 types (fnuz). This PR uncovered a bug in the rocm code that caused unit tests to fail due to numerical inaccuracy. This PR fixes that bug by swapping `abs_()` with `abs()` as the former performs elementwise absolute value on the tensor in-place causing the final assertion to fail due to the tensor only containing positive values. Important to note, this fix is part of a workaround as hipblasLT does not yet support amax (HIPBLASLT_MATMUL_DESC_AMAX_D_POINTER). This functionality has been implemented internally and is going through the proper channels to propagate to the community. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123275 Approved by: https://github.com/drisspg, https://github.com/jeffdaily	2024-04-10 21:52:02 +00:00
Jeff Daily	96793e0f10	[ROCm] enable scaled_gemm (#117822 ) scaled_gemm for ROCm using hipblaslt. As of ROCm 6.0, HIPBLASLT_MATMUL_DESC_AMAX_D_POINTER is not supported. A work-around is provided, performing the absmax operation on the output buffer, but this results in some loss of accuracy for the absmax result. For this reason the feature should be considered beta/preview. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117822 Approved by: https://github.com/jianyuh, https://github.com/xw285cornell	2024-02-29 10:20:48 +00:00
eqy	65efece3a4	[CUDA][cuBLAS] Bump `test_cublas_baddbmm_large_input` tolerances (#117889 ) Unfortunate that the current `rtol=1e-5` hits a literal 1 / 1000000 mismatch (`rtol=1.04e-5`) on L40. CC @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/117889 Approved by: https://github.com/atalman	2024-02-27 19:05:20 +00:00
drisspg	6b009aceea	Enable scaled_mm on sm89 devices (#118881 ) Fixes #118703 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118881 Approved by: https://github.com/malfet	2024-02-03 00:44:03 +00:00
Jithun Nair	2ea2421b44	Skip unit tests that fail on MI210 runners (#114613 ) Taken from https://github.com/pytorch/pytorch/pull/105980 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114613 Approved by: https://github.com/malfet	2023-11-27 22:25:35 +00:00
Eddie Yan	785e586eb0	[CUDA][cuBLAS] Separate reduced precision reductions on/off for addmm tests (#112545 ) CC @malfet @ptrblck ~~We've been seeing a lot of noise from Ampere and later devices due to reduced precision reductions, so preemptively disabling them for addmm tests.~~ Breaking out addmm tests into one with and without reduced precision reductions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112545 Approved by: https://github.com/malfet	2023-11-07 19:09:29 +00:00
drisspg	a469aca1cc	Exposes a fast_fp8_accum option to _scaled_mm (#111847 ) # Summary Adds the option to use fast_accumulation_mode for the fp8 matmul in scaled_mm Information can be found here: https://docs.nvidia.com/cuda/cublas/#cublasltmatmuldescattributes-t defaults to 0 (off) Pull Request resolved: https://github.com/pytorch/pytorch/pull/111847 Approved by: https://github.com/ipiszy, https://github.com/malfet	2023-10-24 03:26:53 +00:00
Christian Puhrsch	3553eb9b89	Add CUTLASS-based support for mixed dtypes matrix multiplication (#110981 ) Resubmission without ghstack to make it easier to import https://github.com/pytorch/pytorch/pull/110934/commits Pull Request resolved: https://github.com/pytorch/pytorch/pull/110981 Approved by: https://github.com/drisspg	2023-10-11 21:47:52 +00:00
drisspg	09a17c512d	Add better error messaging to scaled_mm (#108454 ) Fixes #108411 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108454 Approved by: https://github.com/vkuzo	2023-09-07 21:26:47 +00:00
Kurt Mohler	3f88e3105f	Reland: Remove remaining global `set_default_dtype` calls from tests (#108088 ) Fixes #68972 Relands #107246 To avoid causing Meta-internal CI failures, this PR avoids always asserting that the default dtype is float in the `TestCase.setUp/tearDown` methods. Instead, the assert is only done if `TestCase._default_dtype_check_enabled == True`. `_default_dtype_check_enabled` is set to True in the `if __name__ == "__main__":` blocks of all the relevant test files that have required changes for this issue Pull Request resolved: https://github.com/pytorch/pytorch/pull/108088 Approved by: https://github.com/ezyang	2023-09-07 03:04:34 +00:00
drisspg	d5ff8ca4ef	Relax divsibilty by 16 for leading dimension of mat1 in scaled_gemm (#108308 ) # Summary CublasLT requires that the matrices be 16 byte aligned. If mat1.size(-1) % 16 == 0 and the matrix is row major than the leading dimension can be any value. See this coment: https://github.com/pytorch/pytorch/pull/107341#discussion_r1310934737 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108308 Approved by: https://github.com/eqy, https://github.com/vkuzo	2023-08-31 20:31:47 +00:00
drisspg	00eed6f367	Better Error Message for invalid Out_dtype + Bias for scaled_mm (#108097 ) # Summary Fixes an error case that was directly throwing Cublasslt error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108097 Approved by: https://github.com/vkuzo	2023-08-29 04:10:17 +00:00
PyTorch MergeBot	161ea463e6	Revert "Remove remaining global `set_default_dtype` calls from tests (#107246 )" This reverts commit aa8ea1d787a9d21b064b664c5344376265feea6c. Reverted https://github.com/pytorch/pytorch/pull/107246 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/107246#issuecomment-1693838522))	2023-08-25 19:34:55 +00:00
Kurt Mohler	aa8ea1d787	Remove remaining global `set_default_dtype` calls from tests (#107246 ) Fixes #68972 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107246 Approved by: https://github.com/ezyang	2023-08-24 16:10:48 +00:00
drisspg	c093fdf924	Fix wrong hardcoded value for _scaled_mm (#107719 ) ## Summary Sneaky lil bug where we were accidentally fusing in relu to the epilogue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107719 Approved by: https://github.com/vkuzo	2023-08-22 21:52:20 +00:00
Driss Guessous	8ccfd801be	Introduce CUDA-only `_scaled_mm` op (#107341 ) Summary: Based on D48377631 with updates to guard the utilization of cublas features only found after 11.8 According to https://docs.nvidia.com/cuda/cublas/#id99 only FP8 matrix types can be scaled, and `Float8_e4m3`x`Float8_e4m3` results can be returned as `Float8_e4m3` type, or upcast to `Half`, `BFloat16` or `Float`, but in that case `result_scale` will have no effect as well as `amax` would not be computed. Optional `bias` argument can also be passed to a function, which should be a vector of either `Half` or `BFloat16`, whose values are added to each row of the result matrix. See table below for supported input and output types: \| Mat1 type \| Mat2 type \| Bias type \| Output types \| \| ----------- \| ----------- \| ----------- \| ----------- \| \| Float8_e4m3 \| Float8_e4m3 \| Float16 \| Float8_e4m3, Float16 \| \| Float8_e4m3 \| Float8_e4m3 \| BFloat16 \| Float8_e4m3, BFloat16, Float \| \| Float8_e5m2 \| Float8_e4m3 \| Float16 \| Float8_e4m3, Float8_e5m2, Float16 \| \| Float8_e5m2 \| Float8_e4m3 \| BFloat16 \| Float8_e4m3, Float8_e5m2, BFloat16, Float \| \| Float8_e4m3 \| Float8_e5m2 \| Float16 \| Float8_e4m3, Float8_e5m2, Float16 \| \| Float8_e4m3 \| Float8_e5m2 \| BFloat16 \| Float8_e4m3, Float8_e5m2, BFloat16, Float \| \| Float8_e4m3 \| Float8_e5m2 \| Not supported \| Not supported \| Skip decomposition implementation until fp8-on-triton story is better defined, Potential decomposition can look something like the following: ```python register_decomposition(aten._scaled_mm) def _scaled_mm( mat1: Tensor, mat2: Tensor, , dtype: Optional[torch.dtype] = None, scale_a: Optional[Tensor] = None, scale_b: Optional[Tensor] = None, scale_result: Optional[Tensor] = None, ) -> Tuple[Tensor, Tensor]: rc = torch.mm(mat1.to(torch.float32), mat2.to(torch.float32)) rc = scale_a rc if scale_a is not None else rc rc = scale_b * rc if scale_b is not None else rc rc = scale_result * rc if scale_result is not None else rc rc = rc.to(dtype if dtype is not None else mat1.dtype) return rc, torch.tensor(0.0, device=mat1.device) ``` Known limitations: - Only works for matrix sizes divisible by 16 - 1st operand must be in row-major and 2nd in column-major orders (i.e. if `x` and `y` are contiguous, than only `torch._scaled_mm(x, y.t())` will work) Test Plan: Tests in test_matmul_cda.py Differential Revision: D48415871 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107341 Approved by: https://github.com/vkuzo	2023-08-17 21:24:43 +00:00
PyTorch MergeBot	1af324b560	Revert "Introduce CUDA-only `_scaled_mm` op (#106844 )" This reverts commit 9440a8cbec52ce5c2eb9b95b4a8d9f16055d611d. Reverted https://github.com/pytorch/pytorch/pull/106844 on behalf of https://github.com/izaitsevfb due to Breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/106844#issuecomment-1679858327))	2023-08-16 02:05:29 +00:00
Nikita Shulga	9440a8cbec	Introduce CUDA-only `_scaled_mm` op (#106844 ) According to https://docs.nvidia.com/cuda/cublas/#id99 only FP8 matrix types can be scaled, and `Float8_e4m3`x`Float8_e4m3` results can be returned as `Float8_e4m3` type, or upcast to `Half`, `BFloat16` or `Float`, but in that case `result_scale` will have no effect as well as `amax` would not be computed. Optional `bias` argument can also be passed to a function, which should be a vector of either `Half` or `BFloat16`, whose values are added to each row of the result matrix. See table below for supported input and output types: \| Mat1 type \| Mat2 type \| Bias type \| Output types \| \| ----------- \| ----------- \| ----------- \| ----------- \| \| Float8_e4m3 \| Float8_e4m3 \| Float16 \| Float8_e4m3, Float16 \| \| Float8_e4m3 \| Float8_e4m3 \| BFloat16 \| Float8_e4m3, BFloat16, Float \| \| Float8_e5m2 \| Float8_e4m3 \| Float16 \| Float8_e4m3, Float8_e5m2, Float16 \| \| Float8_e5m2 \| Float8_e4m3 \| BFloat16 \| Float8_e4m3, Float8_e5m2, BFloat16, Float \| \| Float8_e4m3 \| Float8_e5m2 \| Float16 \| Float8_e4m3, Float8_e5m2, Float16 \| \| Float8_e4m3 \| Float8_e5m2 \| BFloat16 \| Float8_e4m3, Float8_e5m2, BFloat16, Float \| \| Float8_e4m3 \| Float8_e5m2 \| Not supported \| Not supported \| Skip decomposition implementation until fp8-on-triton story is better defined, Potential decomposition can look something like the following: ```python @register_decomposition(aten._scaled_mm) def _scaled_mm( mat1: Tensor, mat2: Tensor, , dtype: Optional[torch.dtype] = None, scale_a: Optional[Tensor] = None, scale_b: Optional[Tensor] = None, scale_result: Optional[Tensor] = None, ) -> Tuple[Tensor, Tensor]: rc = torch.mm(mat1.to(torch.float32), mat2.to(torch.float32)) rc = scale_a rc if scale_a is not None else rc rc = scale_b * rc if scale_b is not None else rc rc = scale_result * rc if scale_result is not None else rc rc = rc.to(dtype if dtype is not None else mat1.dtype) return rc, torch.tensor(0.0, device=mat1.device) ``` Known limitations: - Only works for matrix sizes divisible by 16 - 1st operand must be in row-major and 2nd in column-major orders (i.e. if `x` and `y` are contiguous, than only `torch._scaled_mm(x, y.t())` will work) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106844 Approved by: https://github.com/albanD ghstack dependencies: #106977	2023-08-15 02:59:41 +00:00
rraminen	239578beff	[ROCm] Enable a few bfloat16 unit tests (#105177 ) Currently a few unit tests from test_matmul_cuda and test_sparse_csr test suites are being skipped on ROCm. This PR is to enable the following unit tests on ROCm (~30 UTs): test_cublas_baddbmm_large_input_* (__main__.TestMatmulCudaCUDA) test_addmm_sizes_all_sparse_csr* (__main__.TestSparseCSRCUDA) when m==0 or n==0 or k==0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105177 Approved by: https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/malfet	2023-08-03 21:17:19 +00:00
Justin Chu	73e1455327	[BE] Enable ruff's UP rules and autoformat test/ (#105434 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105434 Approved by: https://github.com/albanD	2023-07-19 20:36:06 +00:00
Eddie Yan	5c5ad53517	[CUBLAS] Specify alignment for `cuBlasLt` `addmm` (#98975 ) Fixes the underlying issue previously addressed in #92201 by specifying minimum alignments explicitly to `cuBLAS` rather than relying on a handcrafted rule. ~~We're still investigating some potential failure modes on `sm80` and `sm90` but those would be real `cuBlasLt` heuristics bugs rather than being caused by underspecifying constraints to the heuristics.~~ According to the `cuBLAS` docs the default alignment is 256 bytes so that is the current maximum that is currently being checked: https://docs.nvidia.com/cuda/cublas/ CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/98975 Approved by: https://github.com/ngimel	2023-04-18 06:19:30 +00:00
blorange-amd	079452ea0f	Enable test_matmul_cuda UTs for ROCm (#98797 ) test_file \| test_name \| test_class -- \| -- \| -- test_matmul_cuda \| test_cublas_addmm_size_10000_cuda_bfloat16 \| (__main__.TestMatmulCudaCUDA) test_matmul_cuda \| test_cublas_addmm_size_10000_cuda_float16 \| (__main__.TestMatmulCudaCUDA) test_matmul_cuda \| test_cublas_addmm_size_10000_cuda_float32 \| (__main__.TestMatmulCudaCUDA) test_matmul_cuda \| test_cublas_addmm_size_1000_cuda_bfloat16 \| (__main__.TestMatmulCudaCUDA) test_matmul_cuda \| test_cublas_addmm_size_1000_cuda_float16 \| (__main__.TestMatmulCudaCUDA) test_matmul_cuda \| test_cublas_addmm_size_1000_cuda_float32 \| (__main__.TestMatmulCudaCUDA) test_matmul_cuda \| test_cublas_addmm_size_100_cuda_bfloat16 \| (__main__.TestMatmulCudaCUDA) test_matmul_cuda \| test_cublas_addmm_size_100_cuda_float16 \| (__main__.TestMatmulCudaCUDA) test_matmul_cuda \| test_cublas_addmm_size_100_cuda_float32 \| (__main__.TestMatmulCudaCUDA) This PR is the same fix as https://github.com/pytorch/pytorch/pull/88888. Creating this new PR to sanitize the history. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98797 Approved by: https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/malfet	2023-04-13 19:36:07 +00:00
puririshi98	8aa34602f7	Jetson Update for CI Redo (#94549 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94549 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-02-21 17:13:38 +00:00
Eddie Yan	0bf7506051	[CUDA] Drop CUDA < 11.0 test flags (#92605 ) Follow-up of #89582 to drop flags like `CUDA11OrLater` in tests. Note that in some places it appears that `TEST_WITH_ROCM` is _implicitly_ guarded against via the `CUDA11OrLater` version check, based on my best-guess of how `torch.version.cuda` would behave in ROCM builds, so I've added `not TEST_WITH_ROCM` in cases where ROCM wasn't previously explicitly allowed. CC @ptrblck @malfet @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/92605 Approved by: https://github.com/ngimel	2023-01-24 04:34:06 +00:00
Eddie Yan	1af40d5108	[cublas][cublasLt] Fall back to unfused `addmm` for 2-byte-aligned inputs (#92201 ) Fix for this issue surfaced from the discuss forum: https://discuss.pytorch.org/t/cuda-error-cublas-status-not-supported-when-calling-cublasltmatmul-from-torch-nn-functional-linear/170214 Note that PyTorch builds before #71200 should not be affected as there was no `cublasLt` dispatch path. Additionally, the provided repro has the quirk of using a 3D input, which means it will not dispatch to `cublasLt`-backed `addmm` until builds that include #72728. Changing the input to 2D by trivially removing the size `1` dimension will surface the failure on builds after #71200. Interestingly, the use-case where _all_ inputs are 2-byte aligned are supported (runs without crashing), but when some are > 2-byte and some are == 2-byte are not. This behavior suggests that the `cuBlastLt` heuristics are incorrect, as the heuristic function has visibility of the raw pointer values via the descriptors when it is called. We will follow up with `cuBlasLt` but this fix is needed to prevent unnecessary crashes for now. CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/92201 Approved by: https://github.com/ngimel	2023-01-21 00:32:02 +00:00
Eddie Yan	8c0289a61c	[CUDA][CUBLAS][BFloat16] Tenatively disable reduced precision reductions for some matmul tests (#92599 ) We've observed some failures in numerical checks on newer compute capabilities stemming from cuBLAS allowing reduced precision reductions. CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/92599 Approved by: https://github.com/ngimel	2023-01-20 22:19:11 +00:00
Fang Wang	160118d72a	Add test case for matrix multiply-add with large inputs (#85550 ) Summary: - Added test case for addmm, baddbmm and linear with large inputs - Testing with torch types: float32, float16, bfloat16 Test Plan: Run unit tests with: `buck2 run mode/opt //caffe2/test:linalg_re_cuda` ``` ... test_addmm_baddbmm_large_input_1_10000_10000_10000_cpu_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda' test_addmm_baddbmm_large_input_1_10000_10000_10000_cpu_float16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda' test_addmm_baddbmm_large_input_1_10000_10000_10000_cpu_float32 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda' test_addmm_baddbmm_large_input_1_10000_1000_10000_cpu_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda' test_addmm_baddbmm_large_input_1_10000_1000_10000_cpu_float16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda' test_addmm_baddbmm_large_input_1_10000_1000_10000_cpu_float32 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda' test_addmm_baddbmm_large_input_2_1000_1000_1000_cpu_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda' test_addmm_baddbmm_large_input_2_1000_1000_1000_cpu_float16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda' test_addmm_baddbmm_large_input_2_1000_1000_1000_cpu_float32 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda' test_addmm_baddbmm_large_input_2_100_100_100_cpu_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda' test_addmm_baddbmm_large_input_2_100_100_100_cpu_float16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda' test_addmm_baddbmm_large_input_2_100_100_100_cpu_float32 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda' test_addmm_baddbmm_large_input_1_10000_10000_10000_cuda_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok test_addmm_baddbmm_large_input_1_10000_10000_10000_cuda_float16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok test_addmm_baddbmm_large_input_1_10000_10000_10000_cuda_float32 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok test_addmm_baddbmm_large_input_1_10000_1000_10000_cuda_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok test_addmm_baddbmm_large_input_1_10000_1000_10000_cuda_float16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok test_addmm_baddbmm_large_input_1_10000_1000_10000_cuda_float32 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok test_addmm_baddbmm_large_input_2_1000_1000_1000_cuda_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok test_addmm_baddbmm_large_input_2_1000_1000_1000_cuda_float16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok test_addmm_baddbmm_large_input_2_1000_1000_1000_cuda_float32 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok test_addmm_baddbmm_large_input_2_100_100_100_cuda_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok test_addmm_baddbmm_large_input_2_100_100_100_cuda_float16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok test_addmm_baddbmm_large_input_2_100_100_100_cuda_float32 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok ---------------------------------------------------------------------- Ran 24 tests in 63.224s OK (skipped=12) ``` Differential Revision: D39718256 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85550 Approved by: https://github.com/IvanYashchuk, https://github.com/malfet	2022-10-11 17:52:21 +00:00

27 Commits