This PR includes additional enhancements to TF32 support in TunableOp.
- OpSignature now differentiates between float32 and tf32 data types.
- Offline tuning now supports TF32.
- Unit tests for online and offline tuning of TF32.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149088
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Fix on non-rocm:
```
root@e01-tw-ue5g2g3sap6:~/pytorch/test# python test_linalg.py TestLinalgCPU.test_ck_blas_library_cpu
E
======================================================================
ERROR: test_ck_blas_library_cpu (__main__.TestLinalgCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/root/pytorch/torch/testing/_internal/common_utils.py", line 3108, in wrapper
method(*args, **kwargs)
File "/root/pytorch/torch/testing/_internal/common_device_type.py", line 480, in instantiated_test
raise rte
File "/root/pytorch/torch/testing/_internal/common_device_type.py", line 460, in instantiated_test
result = test(self, **param_kwargs)
File "/root/pytorch/torch/testing/_internal/common_device_type.py", line 1242, in dep_fn
return fn(slf, *args, **kwargs)
File "/root/pytorch/torch/testing/_internal/common_utils.py", line 1981, in _fn
fn(*args, **kwargs)
File "/root/pytorch/test/test_linalg.py", line 8621, in test_ck_blas_library
torch.backends.cuda.preferred_blas_library('ck')
File "/root/pytorch/torch/backends/cuda/__init__.py", line 258, in preferred_blas_library
torch._C._set_blas_preferred_backend(_BlasBackends[backend])
RuntimeError: Cannot set preferred backend to Ck if PyTorch has not been compiled for ROCm.
To execute this test, run the following from the base repo dir:
python test/test_linalg.py TestLinalgCPU.test_ck_blas_library_cpu
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
----------------------------------------------------------------------
Ran 1 test in 0.346s
FAILED (errors=1)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148316
Approved by: https://github.com/jeffdaily
This PR adds support for rowwise scaling versus tensorwise scaling on scaled GEMM.
There are few other items included in this PR as well:
- Fixes for offline tuning of scaled GEMM
- Simplification of existing offline UT
- Update existing online UT to also test rowwise versus tensorwise scaled GEMM
- New UT for offline scaled GEMM
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148238
Approved by: https://github.com/jeffdaily
A recent PR #143049 attempted to increase tolerances to make test passable. However, we are still seeing errors like:
```
Traceback (most recent call last):
File "~git/pytorch/test/test_linalg.py", line 2540, in test_svd_lowrank
run_subtest(None, size, (), device, torch.svd_lowrank, density=density)
File "~git/pytorch/test/test_linalg.py", line 2505, in run_subtest
self.assertEqual(A, a, rtol=1e-7, atol=2e-7)
File "~git/pytorch/torch/testing/_internal/common_utils.py", line 4044, in assertEqual
raise error_metas.pop()[0].to_error( # type: ignore[index]
AssertionError: Tensor-likes are not close!
Mismatched elements: 90 / 1000000 (0.0%)
Greatest absolute difference: 7.795904016052784e-07 at index (176, 930) (up to 2e-07 allowed)
Greatest relative difference: inf at index (6, 179) (up to 1e-07 allowed)
```
Increasing `niter` parameter actually decreases numerical differences.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145930
Approved by: https://github.com/ngimel
TLDR: Follow up/ Build on top of https://github.com/pytorch/pytorch/pull/144476. add OCP FP8 support for gfx950
refer to https://github.com/pytorch/ao/pull/1677
This pull request includes several changes to improve compatibility and support for new GPU architectures and data types, particularly for ROCm. The key updates involve adding support for new ROCm versions and GPU architectures, updating data type handling, and removing outdated checks.
### Improvements to GPU Architecture and ROCm Version Support:
* [`aten/src/ATen/Context.cpp`](diffhunk://#diff-33de472d304acbe57d693c8567370c638068bedc1aa0ce8e9dc115dad05a7810L323-R326): Added support for new GPU architectures `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks.
* [`aten/src/ATen/native/cuda/Blas.cpp`](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199): Updated architecture support in multiple functions to include `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks. [[1]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199) [[2]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL865-R876)
### Updates to Data Type Handling:
* [`aten/src/ATen/cuda/CUDADataType.h`](diffhunk://#diff-9188bb13b1a49f459141f5f9b875593d1c5ce2beb5ad711fdbaf5bc7089ec015L81-L98): Enhanced data type conversion to include new float8 types for both CUDA and ROCm environments.
* [`aten/src/ATen/cuda/tunable/GemmHipblaslt.h`](diffhunk://#diff-bfa1a3b5d4bef1892bf50338775f3b0fd8cd31fc1868148f3968b98aefb68e3fL29-R80): Updated `HipDataTypeFor` template to handle new float8 types and added hard-coded enum values for ROCm versions prior to 6.3.
### Removal of Outdated Checks:
* [`cmake/public/LoadHIP.cmake`](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197): Removed the check for `HIP_NEW_TYPE_ENUMS` as it is no longer necessary with the updated ROCm versions. [[1]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197) [[2]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L211-R182)
These changes ensure better compatibility and performance on newer hardware and software environments, particularly for users leveraging ROCm and CUDA for deep learning and scientific computing tasks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146632
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
This PR has a UT speed-up and some refactoring of tests.
A previous PR https://github.com/pytorch/pytorch/pull/142422 fixed this matmul_small_brute_force_tunableop for the FP16 data type by adding TunableOp numerical checks. It had the unfortunate side effect that it increased the execution time for the FP32 and FP64 data types by a significant margin. This PR *reduces* the execution time by 20+ minutes.
We also move a hipBLASLt version check to a different tunableop UT for simplicity.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147659
Approved by: https://github.com/jeffdaily
### Problem
Non-contiguous activation for `torch._weight_int8pack_mm` is unsupported on CPU.
So, with int8 WoQ with B16 activation with torchao, for batch-size 2 & above, an assertion is hit regarding non-contiguous A being unsupported. Such an issue was encountered with LLaMA models.
### Solution
Also support non-contiguous activation for `torch._weight_int8pack_mm`, so long as it's contiguous on the last dimension & remove the assertion that requires contiguous activation.
### Alternative solutions considered
Could modify LLaMA model in transformers library to call `contiguous` after obtaining the final hidden state, just before computing logits with the LM head. However, [it](https://github.com/huggingface/transformers/pull/36078) might cause some regression for other users of that code.
Another aspect to this issue is - is latency always lower if we make an activation tensor contiguous before linear or `torch._weight_int8pack_mm` is called on CPU? I guess we need some data-points to analyze this part, although I think the performance should be good enough with this patch, since the first cache lines of rows of A are being explicitly prefetched in the existing code (and it also avoids copy, which a `contiguous` call would do).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147588
Approved by: https://github.com/mingfeima, https://github.com/leslie-fang-intel, https://github.com/malfet
TLDR: Follow up/ Build on top of https://github.com/pytorch/pytorch/pull/144476. add OCP FP8 support for gfx950
refer to https://github.com/pytorch/ao/pull/1677
This pull request includes several changes to improve compatibility and support for new GPU architectures and data types, particularly for ROCm. The key updates involve adding support for new ROCm versions and GPU architectures, updating data type handling, and removing outdated checks.
### Improvements to GPU Architecture and ROCm Version Support:
* [`aten/src/ATen/Context.cpp`](diffhunk://#diff-33de472d304acbe57d693c8567370c638068bedc1aa0ce8e9dc115dad05a7810L323-R326): Added support for new GPU architectures `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks.
* [`aten/src/ATen/native/cuda/Blas.cpp`](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199): Updated architecture support in multiple functions to include `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks. [[1]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199) [[2]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL865-R876)
### Updates to Data Type Handling:
* [`aten/src/ATen/cuda/CUDADataType.h`](diffhunk://#diff-9188bb13b1a49f459141f5f9b875593d1c5ce2beb5ad711fdbaf5bc7089ec015L81-L98): Enhanced data type conversion to include new float8 types for both CUDA and ROCm environments.
* [`aten/src/ATen/cuda/tunable/GemmHipblaslt.h`](diffhunk://#diff-bfa1a3b5d4bef1892bf50338775f3b0fd8cd31fc1868148f3968b98aefb68e3fL29-R80): Updated `HipDataTypeFor` template to handle new float8 types and added hard-coded enum values for ROCm versions prior to 6.3.
### Removal of Outdated Checks:
* [`cmake/public/LoadHIP.cmake`](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197): Removed the check for `HIP_NEW_TYPE_ENUMS` as it is no longer necessary with the updated ROCm versions. [[1]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197) [[2]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L211-R182)
These changes ensure better compatibility and performance on newer hardware and software environments, particularly for users leveraging ROCm and CUDA for deep learning and scientific computing tasks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146632
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
**Summary**
It's part of the task to enable max-autotune with GEMM template for WoQ INT4 GEMM on CPU.
This PR adds a wrapper op in `quantized` namespace for `torch.ops.aten_weight_int4pack_mm_for_cpu`, whose arguments are all tensors. It will be used in Inductor lowering with max-autotune where scalar arguments are difficult to handle.
The new op is not registered to
- `aten` because it will require changing `native_functions.yaml`, which is not recommended.
- `quantized_decomposed` because it will only have a Python implementation, which cannot be used for cpp wrapper in Inductor.
**Test plan**
```
python test/test_linalg.py -k test__int4_mm
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145245
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
Description:
1. Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.
2. Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.
3. Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
```python
packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)
```
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).
4. Perform Dynamic Quantized Matrix Multiplication:
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
```python
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features)
```
Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.
API Usage: https://github.com/pytorch/pytorch/issues/143289
Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode : 40 t/s
2B Transformer model
Prefill : 747 t/s
Decode : 80 t/s
Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s
OK
python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s
OK
python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s
Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124
Approved by: https://github.com/digantdesai, https://github.com/malfet
Fixes#141652
This PR contains:
- Fix for `matmul_offline_mgpu_tunableop`
- Modifications to _checking_tuning_assertions to enable TunableOp if it is disabled. Also moved it into the concurrent futures initializer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143507
Approved by: https://github.com/jeffdaily
Description:
1. Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.
2. Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.
3. Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
```python
packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)
```
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).
4. Perform Dynamic Quantized Matrix Multiplication:
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
```python
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features)
```
Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.
API Usage: https://github.com/pytorch/pytorch/issues/143289
Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode : 40 t/s
2B Transformer model
Prefill : 747 t/s
Decode : 80 t/s
Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s
OK
python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s
OK
python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s
Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124
Approved by: https://github.com/digantdesai, https://github.com/malfet
Description:
1. Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.
2. Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.
3. Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
```python
packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)
```
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).
4. Perform Dynamic Quantized Matrix Multiplication:
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
```python
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features)
```
Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.
API Usage: https://github.com/pytorch/pytorch/issues/143289
Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode : 40 t/s
2B Transformer model
Prefill : 747 t/s
Decode : 80 t/s
Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s
OK
python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s
OK
python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s
Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124
Approved by: https://github.com/digantdesai, https://github.com/malfet
Fixes#137936
The PR contains:
* Fix for `matmul_offline_tunableop`
* Clean-up try-finally blocks in UTs that don't use environment variables (`test_validator_tunableop_rocm`, `test_minimum_tuning_iteration_tunableop`, `test_disable_tuning_tunableop`)
* Avoid the use of environment variables in `minimum_tuning_iteration_tunableop`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143322
Approved by: https://github.com/jeffdaily
TunableOp's rotating buffer feature cannot be properly tested because the environment variable that controls this feature is sticky. A Python API is introduced to modify this value.
Additional items in this PR:
* UT for rotating buffer API
* Clean up UTs that were setting the rotating buffer via the environment variable
* Align behavior of environment variable and Python API when a negative value (< 0) is set.
* Update documentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143172
Approved by: https://github.com/jeffdaily
Fixes#141652
This PR fixes (at least in part) the unit test failure. However, we may also need to do a separate flush of the untuned results-- if this test continues to be flaky, another PR would be needed to flush the untuned results as well.
Tested locally and it seems to be working.
Also fixing code that was accidentally commented out code in the unit test from the prior multi-gpu offline tuning PR https://github.com/pytorch/pytorch/pull/139673
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142269
Approved by: https://github.com/jeffdaily
This PR enhances offline tuning to support multi-GPUs.
High-level description of algorithm:
- Duplicate GEMMs are first eliminated
- GEMMs are distributed to multi-GPUs for tuning
- Results are gathered into a file with `_full` in the filename
Also adding support for GemmAndBias and ScaledGemm
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139673
Approved by: https://github.com/jeffdaily, https://github.com/hongxiayang
Expands the `test_linalg_qr_autograd_errors` unit test to check all cases of differentiablity/non-differentiability as given in the docs https://pytorch.org/docs/stable/generated/torch.linalg.qr.html:
- mode= ‘reduced’ (default): Returns (Q, R) of shapes (*, m, k), (*, k, n) respectively. It is always differentiable.
- mode= ‘complete’: Returns (Q, R) of shapes (*, m, m), (*, m, n) respectively. It is differentiable for m <= n.
- mode= ‘r’: Computes only the reduced R. Returns (Q, R) with Q empty and R of shape (*, k, n). It is never differentiable.
(in particular, the happy paths are added)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135097
Approved by: https://github.com/IvanYashchuk, https://github.com/nikitaved
The test is failing (flakily?) on periodic Windows CUDA jobs with the following error:
```
__________ TestLinalgCUDA.test_matmul_offline_tunableop_cuda_float16 __________
Traceback (most recent call last):
File "C:\actions-runner\_work\pytorch\pytorch\test\test_linalg.py", line 4618, in test_matmul_offline_tunableop
os.remove(filename)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'tunableop_untuned0.csv'
```
For example, https://github.com/pytorch/pytorch/actions/runs/11292745299/job/31410578167#step:15:15097
The test tried to catch and ignore this, but this is Windows. So, the fix is to:
1. Ignore if these files couldn't be removed
2. Write them to a temp directory instead, otherwise, [assert_git_not_dirty](https://github.com/pytorch/pytorch/blob/main/.ci/pytorch/test.sh#L286) won't be happy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137835
Approved by: https://github.com/atalman