Commit Graph

23 Commits

Author SHA1 Message Date
66ea76ec44 [ROCm][tunableop] Improvements to tunableop Numerical Check (#163079)
Modified the flag PYTORCH_TUNABLEOP_NUMERICAL_CHECK, so that it accepts the numerical tolerances in the format atol_rtol as compared to the previous 0 and 1. Retains previous functionality with default values as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163079
Approved by: https://github.com/naromero77amd, https://github.com/jeffdaily
2025-10-15 22:26:47 +00:00
7f9b745494 [ROCm][tunableop] Modified Online Tuning Mode to add Instant Logging (#163965)
- Added instant logging in online tuning mode, so that each tuned GEMM is instantly written
- Allows us to have saved tuning configs, in cases of crashes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163965
Approved by: https://github.com/naromero77amd, https://github.com/jeffdaily
2025-10-15 20:02:31 +00:00
e769026bcb [ROCm] Remove HIPBLASLT_ALLOW_TF32 from codebase (#162998)
A few UT failures are caused by `HIPBLASLT_ALLOW_TF32`

Fixes #157094
Fixes #157093
Fixes #157092
Fixes #157091
Fixes #157064
Fixes #157063
Fixes #157062
Fixes #157061
Fixes #157042
Fixes #157041
Fixes #157039
Fixes #157004

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162998
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-18 13:53:48 +00:00
66308fb470 Revert "[ROCm] Remove HIPBLASLT_ALLOW_TF32 from codebase (#162998)"
This reverts commit cef815dc2ce37f98e01a6469a15b69f15995c1f9.

Reverted https://github.com/pytorch/pytorch/pull/162998 on behalf of https://github.com/huydhn due to Sorry for reverting this, but it seems to break a test in trunk ([comment](https://github.com/pytorch/pytorch/pull/162998#issuecomment-3300280242))
2025-09-16 20:39:41 +00:00
559e8d1c20 [doc]: Small typos (#162982)
Small typo fixes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162982
Approved by: https://github.com/ezyang, https://github.com/zou3519
2025-09-16 17:42:19 +00:00
cef815dc2c [ROCm] Remove HIPBLASLT_ALLOW_TF32 from codebase (#162998)
A few UT failures are caused by `HIPBLASLT_ALLOW_TF32`

Fixes #157094, #157093, #157092, #157091, #157064, #157063, #157062, #157061, #157042, #157041, #157039, #157004

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162998
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-16 12:48:45 +00:00
4cc8b60d1b [BE][1/16] fix typos in torch/ (#156311)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156311
Approved by: https://github.com/albanD
2025-07-09 11:02:22 +00:00
3fd84a8592 [BE][PYFMT] migrate PYFMT for torch/[a-c]*/ to ruff format (#144554)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144554
Approved by: https://github.com/soulitzer
2025-07-03 18:56:07 +00:00
f6c1cf04b5 [ROCm][TunableOp] Support submatrices in offline tuning (#151138)
This PR adds support for submatrices in offline tuning for:
- GEMM
- GEMM and bias
- ScaledGEMM
- Batch Strided GEMM

New UTs to cover submatrices. Submatrices for strided batch API is not part of this PR and will be done seperately.

There is also a bug fix for offline tuning for full matrix for GEMM and bias in the `NT` case. Offline and online UTs were updated to cover this corner case.

To improve code readability, swapped definition of transA and transB.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151138
Approved by: https://github.com/jeffdaily
2025-04-19 04:14:27 +00:00
ca2ffc23ab [ROCm][TunableOp] Stricter unit tests for online and offline tuning (#150142)
Improvements to unit tests and warnings for unsupported cases in offline tuning. Here are more details:
- Previously we only compared the OpSig for the untuned vs. tuned entries. This was not strict enough so we now compare OpSig+ParamSig.
- The main offline and online UTs are now stricter to make sure we exercise the code paths for the four combinations of transA and transB.
- Offline tuning does not support some tensor shapes. Emit warning and skip tuning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150142
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-03-31 04:12:08 +00:00
01b1d1f91b [ROCm][TunableOp] Fix offline tuning for ScaledGEMM. (#149677)
The main purpose of this PR is to fix offline tuning for ScaledGEMM. The previous UT passed because it was not strict enough. Additionally:
- All the offline tuning tests now do a comparison with the online results to ensure that ParamSignature match.
- We raise an error if submatrices are encountered as this is only supported in online tuning mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149677
Approved by: https://github.com/jeffdaily
2025-03-22 02:22:13 +00:00
11d4438a5f [ROCm][TunableOp] More TF32 support. (#149088)
This PR includes additional enhancements to TF32 support in TunableOp.
- OpSignature now differentiates between float32 and tf32 data types.
- Offline tuning now supports TF32.
- Unit tests for online and offline tuning of TF32.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149088
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-03-19 00:26:20 +00:00
dcc502f376 [ROCm][TunableOp] Add bias data type to params signature. (#146227)
Add bias vector data type in TunableOp params signature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146227
Approved by: https://github.com/jeffdaily
2025-03-11 18:31:22 +00:00
08baaa7d63 [Docs][TunableOp] TunableOp documentation update (#148384)
This PR aligns documentation to what is in the README file:
https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cuda/tunable/README.md

and removes the prototype NOTE.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148384
Approved by: https://github.com/jeffdaily, https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-03-07 21:02:49 +00:00
33f8ab2f58 [ROCm][TunableOp] Add support for rowwise scaling on scaled GEMM. (#148238)
This PR adds support for rowwise scaling versus tensorwise scaling on scaled GEMM.

There are few other items included in this PR as well:
- Fixes for offline tuning of scaled GEMM
- Simplification of existing offline UT
- Update existing online UT to also test rowwise versus tensorwise scaled GEMM
- New UT for offline scaled GEMM

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148238
Approved by: https://github.com/jeffdaily
2025-03-07 04:12:48 +00:00
805c4b597a PEP585 update - torch/_higher_order_ops torch/_subclasses torch/backends torch/compiler torch/cuda torch/masked torch/mtia torch/nested (#145202)
See #145101 for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145202
Approved by: https://github.com/bobrenjc93
2025-01-20 22:37:26 +00:00
2d150ad29f [ROCm] Fix unit test: matmul_offline_mgpu_tunableop (#143507)
Fixes #141652

This PR contains:

- Fix for `matmul_offline_mgpu_tunableop`
- Modifications to _checking_tuning_assertions to enable TunableOp if it is disabled. Also moved it into the concurrent futures initializer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143507
Approved by: https://github.com/jeffdaily
2024-12-19 19:48:20 +00:00
c0a39ad35a [ROCm] Fix TunableOp UTs: Rotating Buffer (#143172)
TunableOp's rotating buffer feature cannot be properly tested because the environment variable that controls this feature is sticky. A Python API is introduced to modify this value.

Additional items in this PR:
* UT for rotating buffer API
* Clean up UTs that were setting the rotating buffer via the environment variable
* Align behavior of environment variable and Python API when a negative value (< 0) is set.
* Update documentation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143172
Approved by: https://github.com/jeffdaily
2024-12-14 06:18:11 +00:00
2fc8bac091 [ROCm] Fix unit test: matmul_offline_mgpu_gpu_tunableop (#142269)
Fixes #141652

This PR fixes (at least in part) the unit test failure. However, we may also need to do a separate flush of the untuned results-- if this test continues to be flaky, another PR would be needed to flush the untuned results as well.

Tested locally and it seems to be working.

Also fixing code that was accidentally commented out code in the unit test from the prior multi-gpu offline tuning PR https://github.com/pytorch/pytorch/pull/139673

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142269
Approved by: https://github.com/jeffdaily
2024-12-08 02:18:00 +00:00
a99332eb25 [ROCM] Support Multi-GPU offline tuning in TunableOp (#139673)
This PR enhances offline tuning to support multi-GPUs.

High-level description of algorithm:
- Duplicate GEMMs are first eliminated
- GEMMs are distributed to multi-GPUs for tuning
- Results are gathered into a file with `_full` in the filename

Also adding support for GemmAndBias and ScaledGemm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139673
Approved by: https://github.com/jeffdaily, https://github.com/hongxiayang
2024-11-26 19:07:41 +00:00
5516ac5c21 [ROCm] Tunableop record untuned (#128813)
When enable tunableop, It is easy to have OOM since APP usually needs large video memory size, such as running a LLM for inference.  So we need a offline mode to tune the GEMMs. This PR provide an offline mode for tunableOp:

- record untuned GEMMs to file.

- a python API named tune_gemm_in_file is added to read the untuned file and tune the GEMMs in file

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128813
Approved by: https://github.com/jeffdaily, https://github.com/hongxiayang, https://github.com/naromero77amd

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-10-09 21:59:03 +00:00
62bcdc0ac9 Flip default value for mypy disallow_untyped_defs [4/11] (#127841)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127841
Approved by: https://github.com/oulgen
2024-06-08 18:36:48 +00:00
0e7bd7fedd [ROCm] TunableOp improvements (#124362)
- use less memory; smaller default hipblaslt workspace size
- options to avoid cache effects
  - icache flush option
  - rotating buffers during tuning
- python APIs
- unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124362
Approved by: https://github.com/xw285cornell
2024-06-03 22:30:11 +00:00