pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Nichols A. Romero	37bb7f79c6	[ROCm][TunableOp] Unit test for TunableOp BLAS logging. (#148982 ) Add unit test for new TunableOp BLAS logging feature. Requires this PR to be merged in first: https://github.com/pytorch/pytorch/pull/148979 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148982 Approved by: https://github.com/jeffdaily	2025-03-19 20:57:19 +00:00
Nichols A. Romero	11d4438a5f	[ROCm][TunableOp] More TF32 support. (#149088 ) This PR includes additional enhancements to TF32 support in TunableOp. - OpSignature now differentiates between float32 and tf32 data types. - Offline tuning now supports TF32. - Unit tests for online and offline tuning of TF32. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149088 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-03-19 00:26:20 +00:00
Aaron Gokaslan	a0ac63cbd9	[BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/149257 Approved by: https://github.com/jansel	2025-03-18 00:46:07 +00:00
PyTorch MergeBot	24cfeec2c7	Revert "[BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257 )" This reverts commit bfee141666319c80b6c5284394905beef8682515. Reverted https://github.com/pytorch/pytorch/pull/149257 on behalf of https://github.com/malfet due to Let's see if it helps restore compiler benchmark sanity, see `8bc7bd94a5/1` ([comment](https://github.com/pytorch/pytorch/pull/149257#issuecomment-2731133812))	2025-03-17 22:57:00 +00:00
Yichen Yan	8cdb9adc05	do not run `test_ck_blas_library` on cpu (#148316 ) Fix on non-rocm: ``` root@e01-tw-ue5g2g3sap6:~/pytorch/test# python test_linalg.py TestLinalgCPU.test_ck_blas_library_cpu E ====================================================================== ERROR: test_ck_blas_library_cpu (__main__.TestLinalgCPU) ---------------------------------------------------------------------- Traceback (most recent call last): File "/root/pytorch/torch/testing/_internal/common_utils.py", line 3108, in wrapper method(args, kwargs) File "/root/pytorch/torch/testing/_internal/common_device_type.py", line 480, in instantiated_test raise rte File "/root/pytorch/torch/testing/_internal/common_device_type.py", line 460, in instantiated_test result = test(self, param_kwargs) File "/root/pytorch/torch/testing/_internal/common_device_type.py", line 1242, in dep_fn return fn(slf, args, *kwargs) File "/root/pytorch/torch/testing/_internal/common_utils.py", line 1981, in _fn fn(args, **kwargs) File "/root/pytorch/test/test_linalg.py", line 8621, in test_ck_blas_library torch.backends.cuda.preferred_blas_library('ck') File "/root/pytorch/torch/backends/cuda/__init__.py", line 258, in preferred_blas_library torch._C._set_blas_preferred_backend(_BlasBackends[backend]) RuntimeError: Cannot set preferred backend to Ck if PyTorch has not been compiled for ROCm. To execute this test, run the following from the base repo dir: python test/test_linalg.py TestLinalgCPU.test_ck_blas_library_cpu This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ---------------------------------------------------------------------- Ran 1 test in 0.346s FAILED (errors=1) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148316 Approved by: https://github.com/jeffdaily	2025-03-17 17:45:45 +00:00
Aaron Gokaslan	bfee141666	[BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/149257 Approved by: https://github.com/jansel	2025-03-16 23:52:58 +00:00
Nichols A. Romero	dcc502f376	[ROCm][TunableOp] Add bias data type to params signature. (#146227 ) Add bias vector data type in TunableOp params signature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146227 Approved by: https://github.com/jeffdaily	2025-03-11 18:31:22 +00:00
Nichols A. Romero	33f8ab2f58	[ROCm][TunableOp] Add support for rowwise scaling on scaled GEMM. (#148238 ) This PR adds support for rowwise scaling versus tensorwise scaling on scaled GEMM. There are few other items included in this PR as well: - Fixes for offline tuning of scaled GEMM - Simplification of existing offline UT - Update existing online UT to also test rowwise versus tensorwise scaled GEMM - New UT for offline scaled GEMM Pull Request resolved: https://github.com/pytorch/pytorch/pull/148238 Approved by: https://github.com/jeffdaily	2025-03-07 04:12:48 +00:00
Aidyn-A	e6800bda7f	[Test][Linalg][CUDA] Increase niter in test_svd_lowrank_cuda_float64 (#145930 ) A recent PR #143049 attempted to increase tolerances to make test passable. However, we are still seeing errors like: ``` Traceback (most recent call last): File "~git/pytorch/test/test_linalg.py", line 2540, in test_svd_lowrank run_subtest(None, size, (), device, torch.svd_lowrank, density=density) File "~git/pytorch/test/test_linalg.py", line 2505, in run_subtest self.assertEqual(A, a, rtol=1e-7, atol=2e-7) File "~git/pytorch/torch/testing/_internal/common_utils.py", line 4044, in assertEqual raise error_metas.pop()[0].to_error( # type: ignore[index] AssertionError: Tensor-likes are not close! Mismatched elements: 90 / 1000000 (0.0%) Greatest absolute difference: 7.795904016052784e-07 at index (176, 930) (up to 2e-07 allowed) Greatest relative difference: inf at index (6, 179) (up to 1e-07 allowed) ``` Increasing `niter` parameter actually decreases numerical differences. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145930 Approved by: https://github.com/ngimel	2025-03-06 22:10:53 +00:00
Nichols A. Romero	5f47b7e268	[ROCm][TunableOp] Unit test for offline tuning of GEMM with bias (#148371 ) One more unit test for the offline version of TunableOp. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148371 Approved by: https://github.com/jeffdaily	2025-03-04 22:24:27 +00:00
Nichols A. Romero	84e60eece8	[ROCm] [TunableOp] Unit tests for scaled GEMM and GEMM with bias (#147890 ) Two more unit tests for TunableOp: - Scaled GEMM - GEMM with bias Pull Request resolved: https://github.com/pytorch/pytorch/pull/147890 Approved by: https://github.com/jeffdaily	2025-02-26 22:41:24 +00:00
Peter Yeh	81dccd706b	[ROCm] OCP FP8 Support for new GPUs (#146632 ) TLDR: Follow up/ Build on top of https://github.com/pytorch/pytorch/pull/144476. add OCP FP8 support for gfx950 refer to https://github.com/pytorch/ao/pull/1677 This pull request includes several changes to improve compatibility and support for new GPU architectures and data types, particularly for ROCm. The key updates involve adding support for new ROCm versions and GPU architectures, updating data type handling, and removing outdated checks. ### Improvements to GPU Architecture and ROCm Version Support: * [`aten/src/ATen/Context.cpp`](diffhunk://#diff-33de472d304acbe57d693c8567370c638068bedc1aa0ce8e9dc115dad05a7810L323-R326): Added support for new GPU architectures `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks. * [`aten/src/ATen/native/cuda/Blas.cpp`](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199): Updated architecture support in multiple functions to include `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks. [[1]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199) [[2]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL865-R876) ### Updates to Data Type Handling: * [`aten/src/ATen/cuda/CUDADataType.h`](diffhunk://#diff-9188bb13b1a49f459141f5f9b875593d1c5ce2beb5ad711fdbaf5bc7089ec015L81-L98): Enhanced data type conversion to include new float8 types for both CUDA and ROCm environments. * [`aten/src/ATen/cuda/tunable/GemmHipblaslt.h`](diffhunk://#diff-bfa1a3b5d4bef1892bf50338775f3b0fd8cd31fc1868148f3968b98aefb68e3fL29-R80): Updated `HipDataTypeFor` template to handle new float8 types and added hard-coded enum values for ROCm versions prior to 6.3. ### Removal of Outdated Checks: * [`cmake/public/LoadHIP.cmake`](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197): Removed the check for `HIP_NEW_TYPE_ENUMS` as it is no longer necessary with the updated ROCm versions. [[1]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197) [[2]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L211-R182) These changes ensure better compatibility and performance on newer hardware and software environments, particularly for users leveraging ROCm and CUDA for deep learning and scientific computing tasks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146632 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-02-24 22:47:52 +00:00
Nichols A. Romero	9605c5063b	[ROCm][TunableOp] Speed-up matmul_small_brute_force_tunableop unit test (#147659 ) This PR has a UT speed-up and some refactoring of tests. A previous PR https://github.com/pytorch/pytorch/pull/142422 fixed this matmul_small_brute_force_tunableop for the FP16 data type by adding TunableOp numerical checks. It had the unfortunate side effect that it increased the execution time for the FP32 and FP64 data types by a significant margin. This PR reduces the execution time by 20+ minutes. We also move a hipBLASLt version check to a different tunableop UT for simplicity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147659 Approved by: https://github.com/jeffdaily	2025-02-24 19:44:38 +00:00
PyTorch MergeBot	3e2d9d079e	Revert "[ROCm] OCP FP8 Support for new GPUs (#146632 )" This reverts commit f95ab46797e1f3e8cc48ce2f45e4f6985132fb19. Reverted https://github.com/pytorch/pytorch/pull/146632 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, I'll find someone to help merge this PR back to main ([comment](https://github.com/pytorch/pytorch/pull/146632#issuecomment-2676823614))	2025-02-23 12:04:50 +00:00
sanchitintel	3cc3d7e08f	Also support non-contiguous activation for torch._weight_int8pack_mm on CPU (#147588 ) ### Problem Non-contiguous activation for `torch._weight_int8pack_mm` is unsupported on CPU. So, with int8 WoQ with B16 activation with torchao, for batch-size 2 & above, an assertion is hit regarding non-contiguous A being unsupported. Such an issue was encountered with LLaMA models. ### Solution Also support non-contiguous activation for `torch._weight_int8pack_mm`, so long as it's contiguous on the last dimension & remove the assertion that requires contiguous activation. ### Alternative solutions considered Could modify LLaMA model in transformers library to call `contiguous` after obtaining the final hidden state, just before computing logits with the LM head. However, [it](https://github.com/huggingface/transformers/pull/36078) might cause some regression for other users of that code. Another aspect to this issue is - is latency always lower if we make an activation tensor contiguous before linear or `torch._weight_int8pack_mm` is called on CPU? I guess we need some data-points to analyze this part, although I think the performance should be good enough with this patch, since the first cache lines of rows of A are being explicitly prefetched in the existing code (and it also avoids copy, which a `contiguous` call would do). Pull Request resolved: https://github.com/pytorch/pytorch/pull/147588 Approved by: https://github.com/mingfeima, https://github.com/leslie-fang-intel, https://github.com/malfet	2025-02-22 08:29:07 +00:00
Peter Yeh	f95ab46797	[ROCm] OCP FP8 Support for new GPUs (#146632 ) TLDR: Follow up/ Build on top of https://github.com/pytorch/pytorch/pull/144476. add OCP FP8 support for gfx950 refer to https://github.com/pytorch/ao/pull/1677 This pull request includes several changes to improve compatibility and support for new GPU architectures and data types, particularly for ROCm. The key updates involve adding support for new ROCm versions and GPU architectures, updating data type handling, and removing outdated checks. ### Improvements to GPU Architecture and ROCm Version Support: * [`aten/src/ATen/Context.cpp`](diffhunk://#diff-33de472d304acbe57d693c8567370c638068bedc1aa0ce8e9dc115dad05a7810L323-R326): Added support for new GPU architectures `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks. * [`aten/src/ATen/native/cuda/Blas.cpp`](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199): Updated architecture support in multiple functions to include `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks. [[1]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199) [[2]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL865-R876) ### Updates to Data Type Handling: * [`aten/src/ATen/cuda/CUDADataType.h`](diffhunk://#diff-9188bb13b1a49f459141f5f9b875593d1c5ce2beb5ad711fdbaf5bc7089ec015L81-L98): Enhanced data type conversion to include new float8 types for both CUDA and ROCm environments. * [`aten/src/ATen/cuda/tunable/GemmHipblaslt.h`](diffhunk://#diff-bfa1a3b5d4bef1892bf50338775f3b0fd8cd31fc1868148f3968b98aefb68e3fL29-R80): Updated `HipDataTypeFor` template to handle new float8 types and added hard-coded enum values for ROCm versions prior to 6.3. ### Removal of Outdated Checks: * [`cmake/public/LoadHIP.cmake`](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197): Removed the check for `HIP_NEW_TYPE_ENUMS` as it is no longer necessary with the updated ROCm versions. [[1]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197) [[2]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L211-R182) These changes ensure better compatibility and performance on newer hardware and software environments, particularly for users leveraging ROCm and CUDA for deep learning and scientific computing tasks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146632 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-02-21 23:44:08 +00:00
Arash Pakbin	c74b59fc1f	[ROCm][TunableOp] resolve the rocBLAS version dynamically (#147363 ) Dynamically gets rocBLAS version instead of relying on some preprocessing-time definitions which may be stale. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147363 Approved by: https://github.com/pruthvistony, https://github.com/naromero77amd, https://github.com/jeffdaily	2025-02-21 06:50:21 +00:00
Xia, Weiwen	98e16012ec	[Quant][CPU] add a wrapper op for _weight_int4pack_mm_for_cpu with tensor args (#145245 ) Summary It's part of the task to enable max-autotune with GEMM template for WoQ INT4 GEMM on CPU. This PR adds a wrapper op in `quantized` namespace for `torch.ops.aten_weight_int4pack_mm_for_cpu`, whose arguments are all tensors. It will be used in Inductor lowering with max-autotune where scalar arguments are difficult to handle. The new op is not registered to - `aten` because it will require changing `native_functions.yaml`, which is not recommended. - `quantized_decomposed` because it will only have a Python implementation, which cannot be used for cpp wrapper in Inductor. Test plan ``` python test/test_linalg.py -k test__int4_mm ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145245 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168	2025-02-12 08:46:38 +00:00
Daniel Vega-Myhre	652880e840	Fix logging and test files which misspell "precision" (#146113 ) Noticed this while working on something, decided to submit a quick fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146113 Approved by: https://github.com/drisspg	2025-02-10 21:54:16 +00:00
Wouter Devriendt	4d626c261b	Fix workarea compute in lapackSyevd (#146456 ) work-query APIs return floating point values, that could loose precision when converted back to int. Solve this by using `nextafter` and `ceil` Add regression test Fixes #145801 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146456 Approved by: https://github.com/malfet	2025-02-10 21:29:48 +00:00
Nichols A. Romero	44b69b80c2	[ROCm][TunableOp] Future proof TunableOp unit test. (#146548 ) TunableOp UT will fail because the regular expression in the test will not work for future versions of ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146548 Approved by: https://github.com/jeffdaily	2025-02-06 20:26:02 +00:00
Robert Hardwick	386650353b	[ARM] Fix bf32 and tf32 precision for tensordot unit test (#141136 ) Fixes unit test failure on aarch64 ( neoverse-v1 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141136 Approved by: https://github.com/malfet	2025-01-24 02:59:45 +00:00
Nikhil Gupta	41b38f755c	Revert "Reverting the PR adding Kleidiai-based int4 kernels (#145392 )" (#145505 ) https://github.com/pytorch/pytorch/pull/134124 was reverted by https://github.com/pytorch/pytorch/pull/145392 due to KleidiAI clone issue. 1. This reverts commit 0940eb6d44f3cf69dd840db990245cbe1f78e770 (https://github.com/pytorch/pytorch/pull/145392 )and Fixes KleidiAI mirror issue. 2. KleidiAI is now cloned from github mirror instead of arm gitlab Change-Id: I7d6eee7214cd117d3057d615936fcc3ee6052fa2 Fixes https://github.com/pytorch/pytorch/issues/145273 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145505 Approved by: https://github.com/malfet	2025-01-23 18:50:59 +00:00
albanD	0940eb6d44	Reverting the PR adding Kleidiai-based int4 kernels (#145392 ) Mitigation for https://github.com/pytorch/pytorch/issues/145273 Reverting https://github.com/pytorch/pytorch/pull/134124 and https://github.com/pytorch/pytorch/pull/144074 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145392 Approved by: https://github.com/ZainRizvi, https://github.com/malfet, https://github.com/atalman, https://github.com/digantdesai	2025-01-22 20:11:49 +00:00
eqy	912d6a2867	[CUDA] Bump tolerances in `test_svd_lowrank_cuda_float64` (#143049 ) pre-emptive bump for apparent noisy failure Pull Request resolved: https://github.com/pytorch/pytorch/pull/143049 Approved by: https://github.com/Skylion007, https://github.com/lezcano, https://github.com/nikitaved	2024-12-20 23:25:21 +00:00
Nikhil Gupta	94737e8a2a	[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 ) Description: 1. Quantize Linear Layer Weights to 4-bits: Quantize the weights of the Linear layer to 4 bits, using symmetric quantization. Pack two 4-bit weights into one uint8 container. Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32. 2. Prepare Quantized Weights, Scales, and Optional Bias: After quantizing, obtain the quantized_weights, scales, and groupsize. If the original Linear layer has a bias, prepare it as well. 3. Pack the Weights Efficiently: Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias. ```python packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features) ``` Input parameters should include: in_features and out_features (the same as the Linear layer’s corresponding parameters). 4. Perform Dynamic Quantized Matrix Multiplication: Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights. ```python output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features) ``` Inputs required include: The input tensor, packed_weights , groupsize, and the in_features and out_features. API Usage: https://github.com/pytorch/pytorch/issues/143289 Model Perf : 7B Transformer model: Prefill : 340 t/s Decode : 40 t/s 2B Transformer model Prefill : 747 t/s Decode : 80 t/s Tests: python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight Ran 1 test in 0.016s OK python test/test_linalg.py -k test__dyn_quant_matmul_4bit Ran 8 tests in 0.077s OK python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit Ran 8 tests in 11.454s Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-12-20 19:32:03 +00:00
PyTorch MergeBot	8136daff5a	Revert "[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 )" This reverts commit 4b82251011f85f9d1395b451d61e976af844d9b1. Reverted https://github.com/pytorch/pytorch/pull/134124 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it breaks lots of internal build ([comment](https://github.com/pytorch/pytorch/pull/134124#issuecomment-2555953189))	2024-12-19 23:33:17 +00:00
Nichols A. Romero	2d150ad29f	[ROCm] Fix unit test: matmul_offline_mgpu_tunableop (#143507 ) Fixes #141652 This PR contains: - Fix for `matmul_offline_mgpu_tunableop` - Modifications to _checking_tuning_assertions to enable TunableOp if it is disabled. Also moved it into the concurrent futures initializer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143507 Approved by: https://github.com/jeffdaily	2024-12-19 19:48:20 +00:00
Nikhil Gupta	4b82251011	[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 ) Description: 1. Quantize Linear Layer Weights to 4-bits: Quantize the weights of the Linear layer to 4 bits, using symmetric quantization. Pack two 4-bit weights into one uint8 container. Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32. 2. Prepare Quantized Weights, Scales, and Optional Bias: After quantizing, obtain the quantized_weights, scales, and groupsize. If the original Linear layer has a bias, prepare it as well. 3. Pack the Weights Efficiently: Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias. ```python packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features) ``` Input parameters should include: in_features and out_features (the same as the Linear layer’s corresponding parameters). 4. Perform Dynamic Quantized Matrix Multiplication: Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights. ```python output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features) ``` Inputs required include: The input tensor, packed_weights , groupsize, and the in_features and out_features. API Usage: https://github.com/pytorch/pytorch/issues/143289 Model Perf : 7B Transformer model: Prefill : 340 t/s Decode : 40 t/s 2B Transformer model Prefill : 747 t/s Decode : 80 t/s Tests: python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight Ran 1 test in 0.016s OK python test/test_linalg.py -k test__dyn_quant_matmul_4bit Ran 8 tests in 0.077s OK python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit Ran 8 tests in 11.454s Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-12-19 18:51:26 +00:00
PyTorch MergeBot	14fe1f7190	Revert "[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 )" This reverts commit d3ff2d42c28a2c187cbedfd8f60b84a4dfa2d6bf. Reverted https://github.com/pytorch/pytorch/pull/134124 on behalf of https://github.com/malfet due to This broke S390 builds, includes cpuinfo unconditionally ([comment](https://github.com/pytorch/pytorch/pull/134124#issuecomment-2552560208))	2024-12-19 01:05:11 +00:00
Tom Ritchford	d8c8ba2440	Fix unused Python variables in test/[e-z]* (#136964 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136964 Approved by: https://github.com/justinchuby, https://github.com/albanD	2024-12-18 23:02:30 +00:00
Nikhil Gupta	d3ff2d42c2	[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 ) Description: 1. Quantize Linear Layer Weights to 4-bits: Quantize the weights of the Linear layer to 4 bits, using symmetric quantization. Pack two 4-bit weights into one uint8 container. Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32. 2. Prepare Quantized Weights, Scales, and Optional Bias: After quantizing, obtain the quantized_weights, scales, and groupsize. If the original Linear layer has a bias, prepare it as well. 3. Pack the Weights Efficiently: Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias. ```python packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features) ``` Input parameters should include: in_features and out_features (the same as the Linear layer’s corresponding parameters). 4. Perform Dynamic Quantized Matrix Multiplication: Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights. ```python output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features) ``` Inputs required include: The input tensor, packed_weights , groupsize, and the in_features and out_features. API Usage: https://github.com/pytorch/pytorch/issues/143289 Model Perf : 7B Transformer model: Prefill : 340 t/s Decode : 40 t/s 2B Transformer model Prefill : 747 t/s Decode : 80 t/s Tests: python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight Ran 1 test in 0.016s OK python test/test_linalg.py -k test__dyn_quant_matmul_4bit Ran 8 tests in 0.077s OK python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit Ran 8 tests in 11.454s Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-12-18 22:30:07 +00:00
Nichols A. Romero	894d47b91b	[ROCm] Fix unit test: matmul_offline_tunableop (#143322 ) Fixes #137936 The PR contains: * Fix for `matmul_offline_tunableop` * Clean-up try-finally blocks in UTs that don't use environment variables (`test_validator_tunableop_rocm`, `test_minimum_tuning_iteration_tunableop`, `test_disable_tuning_tunableop`) * Avoid the use of environment variables in `minimum_tuning_iteration_tunableop` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143322 Approved by: https://github.com/jeffdaily	2024-12-18 20:14:44 +00:00
Nichols A. Romero	c0a39ad35a	[ROCm] Fix TunableOp UTs: Rotating Buffer (#143172 ) TunableOp's rotating buffer feature cannot be properly tested because the environment variable that controls this feature is sticky. A Python API is introduced to modify this value. Additional items in this PR: * UT for rotating buffer API * Clean up UTs that were setting the rotating buffer via the environment variable * Align behavior of environment variable and Python API when a negative value (< 0) is set. * Update documentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143172 Approved by: https://github.com/jeffdaily	2024-12-14 06:18:11 +00:00
Nichols A. Romero	dcb128d495	[ROCm] TunableOp use thread-safe getenv functions (#142274 ) Fixes #142403 ~~PR fixes breakage due to this commit `8cd7ad8b48`~~ PR is a partial reland of this https://github.com/pytorch/pytorch/pull/140594 with a unit test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142274 Approved by: https://github.com/jeffdaily, https://github.com/eqy	2024-12-12 06:49:26 +00:00
Nichols A. Romero	ee817e8cf3	[ROCm] Second attempt to fix unit test: matmul_small_brute_force_tunableop (#142422 ) Fixes #141458 Fixes #141635 Fixes #141636 ~~Address OOM issue by clearing PyTorch's caching allocator.~~ Disabling this test on NVIDIA since it doesn't do much on NVIDIA hardware at the moment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142422 Approved by: https://github.com/jeffdaily, https://github.com/eqy	2024-12-11 15:36:37 +00:00
Nichols A. Romero	2fc8bac091	[ROCm] Fix unit test: matmul_offline_mgpu_gpu_tunableop (#142269 ) Fixes #141652 This PR fixes (at least in part) the unit test failure. However, we may also need to do a separate flush of the untuned results-- if this test continues to be flaky, another PR would be needed to flush the untuned results as well. Tested locally and it seems to be working. Also fixing code that was accidentally commented out code in the unit test from the prior multi-gpu offline tuning PR https://github.com/pytorch/pytorch/pull/139673 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142269 Approved by: https://github.com/jeffdaily	2024-12-08 02:18:00 +00:00
Nichols A. Romero	93214aad30	[ROCM] Fix unit test: matmul_small_brute_force_tunableop (#142089 ) Fixes #141636 Fixes #141635 Fixes #141458 Changes include: - TunableOp filename that wasn't set properly - Activate numerical check (see additional test comment) - Entire test in try-finally clause to avoid OS environment variable leakage (see additional test comment) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142089 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2024-12-06 00:58:10 +00:00
Nichols A. Romero	a99332eb25	[ROCM] Support Multi-GPU offline tuning in TunableOp (#139673 ) This PR enhances offline tuning to support multi-GPUs. High-level description of algorithm: - Duplicate GEMMs are first eliminated - GEMMs are distributed to multi-GPUs for tuning - Results are gathered into a file with `_full` in the filename Also adding support for GemmAndBias and ScaledGemm Pull Request resolved: https://github.com/pytorch/pytorch/pull/139673 Approved by: https://github.com/jeffdaily, https://github.com/hongxiayang	2024-11-26 19:07:41 +00:00
redwrasse	260d1dcef4	Check torch.linalg.qr differentiability as documented (#135097 ) Expands the `test_linalg_qr_autograd_errors` unit test to check all cases of differentiablity/non-differentiability as given in the docs https://pytorch.org/docs/stable/generated/torch.linalg.qr.html: - mode= ‘reduced’ (default): Returns (Q, R) of shapes (, m, k), (, k, n) respectively. It is always differentiable. - mode= ‘complete’: Returns (Q, R) of shapes (, m, m), (, m, n) respectively. It is differentiable for m <= n. - mode= ‘r’: Computes only the reduced R. Returns (Q, R) with Q empty and R of shape (*, k, n). It is never differentiable. (in particular, the happy paths are added) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135097 Approved by: https://github.com/IvanYashchuk, https://github.com/nikitaved	2024-11-19 12:25:39 +00:00
Nichols A. Romero	27c7caf745	[ROCm] TunableOp fix for batched MM with views. (#140673 ) Fixes #140278 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140673 Approved by: https://github.com/jeffdaily	2024-11-14 20:22:12 +00:00
Jiang, Yanbing	f77eb07662	Split int4wo weight packing (#139611 ) Fixes https://github.com/pytorch/ao/issues/1117. This PR is to seperate int4wo weight packing between CPU and other devices, to help implement `INT4CPULayout` in torchao based on https://github.com/pytorch/ao/issues/1117#issuecomment-2451252756. Now, for CPU, the input `weight` of `_convert_weight_to_int4pack_for_cpu` is [n, k] int32, output is [n, k / 2] uint8. The input packed weight of `_weight_int4pack_mm_for_cpu` is [n, k / 2] uint8. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139611 Approved by: https://github.com/jerryzh168	2024-11-12 10:12:50 +00:00
Sun, Jiayi	a59132b9c8	fix torch.linalg.norm and torch.norm for torch.complex32 datatype (#133661 ) Fix https://github.com/pytorch/pytorch/issues/132634. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133661 Approved by: https://github.com/mingfeima, https://github.com/Skylion007	2024-11-07 03:21:36 +00:00
Nichols A. Romero	641ca67d5a	[ROCM] Fix hipBLASLt version check in TunableOp test (#139811 ) Allow 3 or more digits for hipBLASLt version check in TunableOp test. Needed due to upcoming ROCm 6.3 release. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139811 Approved by: https://github.com/eqy, https://github.com/malfet	2024-11-06 14:37:45 +00:00
Nichols A. Romero	2922b9fee1	[ROCm] Fix ADDMM hipBLASLt regression (#138267 ) Fixes #138067 A partial reversion of this PR: https://github.com/pytorch/pytorch/pull/137604 The breakage is on AMD GPUs that do not fully support hipBLASLt, e.g. gfx1100 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138267 Approved by: https://github.com/eqy, https://github.com/jeffdaily	2024-10-28 16:07:11 +00:00
Xia, Weiwen	e299193423	Bug fix: Use oneDNN for `torch._int_mm` CPU only when avx512_vnni is supported (#136942 ) Fixes #136746 If AVX512_VNNI is not supported, overflow occurs inside oneDNN. Fall back to ref path in such case. UT is also updated to catch the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136942 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-10-26 01:17:11 +00:00
Jeff Daily	3f3b692a00	[ROCm] CK-based GEMM (#131004 ) - composable_kernel as a third_party submodule - "ck" as a `torch.backends.cuda.preferred_linalg_library()` - reference CK gemm implementations for float, bfloat16, and half types Pull Request resolved: https://github.com/pytorch/pytorch/pull/131004 Approved by: https://github.com/xw285cornell, https://github.com/pruthvistony Co-authored-by: Andres Lugo <Andy.LugoReyes@amd.com> Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com>	2024-10-20 02:57:43 +00:00
Nichols A. Romero	aa28062169	[ROCm] TunableOp more unit test follow-up - Part 2 (#134517 ) More unit tests to cover TunableOp functionality. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134517 Approved by: https://github.com/jeffdaily	2024-10-16 01:49:47 +00:00
Huy Do	929797dedb	Fix test_matmul_offline_tunableop by writing its output files to a temp dir (#137835 ) The test is failing (flakily?) on periodic Windows CUDA jobs with the following error: ``` __________ TestLinalgCUDA.test_matmul_offline_tunableop_cuda_float16 __________ Traceback (most recent call last): File "C:\actions-runner\_work\pytorch\pytorch\test\test_linalg.py", line 4618, in test_matmul_offline_tunableop os.remove(filename) PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'tunableop_untuned0.csv' ``` For example, https://github.com/pytorch/pytorch/actions/runs/11292745299/job/31410578167#step:15:15097 The test tried to catch and ignore this, but this is Windows. So, the fix is to: 1. Ignore if these files couldn't be removed 2. Write them to a temp directory instead, otherwise, [assert_git_not_dirty](https://github.com/pytorch/pytorch/blob/main/.ci/pytorch/test.sh#L286) won't be happy Pull Request resolved: https://github.com/pytorch/pytorch/pull/137835 Approved by: https://github.com/atalman	2024-10-14 17:28:33 +00:00
Jin Zhou	5516ac5c21	[ROCm] Tunableop record untuned (#128813 ) When enable tunableop, It is easy to have OOM since APP usually needs large video memory size, such as running a LLM for inference. So we need a offline mode to tune the GEMMs. This PR provide an offline mode for tunableOp: - record untuned GEMMs to file. - a python API named tune_gemm_in_file is added to read the untuned file and tune the GEMMs in file Pull Request resolved: https://github.com/pytorch/pytorch/pull/128813 Approved by: https://github.com/jeffdaily, https://github.com/hongxiayang, https://github.com/naromero77amd Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-10-09 21:59:03 +00:00

1 2 3 4 5 ...

522 Commits