[Docs][TunableOp] TunableOp documentation update (#148384)

This PR aligns documentation to what is in the README file:
https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cuda/tunable/README.md

and removes the prototype NOTE.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148384
Approved by: https://github.com/jeffdaily, https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
This commit is contained in:
Nichols A. Romero
2025-03-07 21:02:44 +00:00
committed by PyTorch MergeBot
parent bb94b65da7
commit 08baaa7d63
2 changed files with 75 additions and 11 deletions

View File

@ -3,9 +3,6 @@
TunableOp
=========
.. note::
This is a prototype feature, which means it is at an early stage
for feedback and testing, and its components are subject to change.
Overview
--------

View File

@ -46,8 +46,8 @@ like so::
Validator,ROCM_VERSION,6.0.0.0-12969-1544e39
Validator,HIPBLASLT_VERSION,0.6.0-a9c5cc7
Validator,ROCBLAS_VERSION,4.0.0-72e57364-dirty
GemmTunableOp_float_NT,nt_25088_4096_64,1219,1.262
GemmTunableOp_float_NT,nt_4096_4096_64,1216,0.033
GemmTunableOp_float_NT,nt_25088_4096_64,Gemm_Hipblaslt_1219,1.262
GemmTunableOp_float_NT,nt_4096_4096_64,Gemm_Rocblas_1216,0.033
Note the "Validator" lines. If you change a library version, or ROCm version, or
PyTorch version, TunableOp will detect this and reject the tunings file because
@ -73,13 +73,18 @@ completely silent, besides file output, unless there is a warning or error
during its use. The verbose option is only available by setting the environment
variable PYTORCH_TUNABLEOP_VEROBSE=1.
A Note on Tuning Behavior
=========================
A Note on Tuning Behavior, Warmup, and Cache Effects
====================================================
Tuning an operator consists of iterating through the list or registered
implementations and profiling each one. The profile is established by running a
single implementation in a loop multiple times and taking the average execution
time.
time. There is also an optional warmup phase prior to tuning that can help with
reaching stable power states by the hardware. During tuning of a workload the
various hardware caches will more likely produce hits than when not tuning.
There are options for flushing the instruction cache and rotate the input tensors
which might help produce a more faithful profile of the tuned operator as if the
operator were run within a larger workload instead of in a tight, repetitive loop.
By default, each possible solution for a given operator will be run for either
100 iterations or as many iterations that can be run within 30ms, whichever is
@ -102,14 +107,76 @@ or ::bgemm() will be routed through TunableOp when enabled. Calling gemm() for a
given set of input arguments (transa, transb, m, n, k) will attempt to use the
fastest available implementation across both rocblas and hipblaslt.
Offline Tuning
==============
Motivation
----------
There are several use cases for offline tuning.
One use case involves a workload with a high-memory utilization, where regular tuning might lead to running out of memory.
Another use case is for compute-intensive workloads. In such cases, it is more resource-efficient to collect
the GEMMs for the workload once and then tune repeatedly with different tuning parameters or libraries.
Workflow
--------
There are basically two steps:
1) Set the environment variables to collect the untuned GEMM and this will generate ``tunableop_untuned0.csv``:
.. code-block:: python
PYTORCH_TUNABLEOP_ENABLED=1
PYTORCH_TUNABLEOP_TUNING=0
PYTORCH_TUNABLEOP_RECORD_UNTUNED=1
...
2) Run a Python script that reads the ``tunableop_untuned0.csv`` and generates the ``tunableop_results0.csv``, like this:
.. code-block:: python
import torch.cuda.tunable as tunable
import os
os.putenv('PYTORCH_TUNABLEOP_ENABLED', '1')
os.putenv('PYTORCH_TUNABLEOP_TUNING', '1')
os.putenv('PYTORCH_TUNABLEOP_RECORD_UNTUNED', '0')
tunable.tune_gemm_in_file("tunableop_untuned0.csv")
It is also possible to take multiple untuned files and distribute the GEMMs for tuning to multiple GPUs
within a single node. In the first step, the GEMMs are first gathered and duplicate GEMMs are eliminated.
Next, the GEMMs are distributed to different GPUs for tuning. After all GEMMs are tuned, the results from
all the GPUs are then gathered into a single file whose base filename has ``_full0`` appended to it
(for example ``tunableop_results_full0.csv``). Finally, this new file, containing the gathered results, will be
duplicated N times, once for each GPU as convenience to the user will run the workload with the tuned
configuration on N GPUs.
.. code-block:: python
if __name__ == "__main__":
num_gpus = 8 # number of GPUs that will be used during the tuning process
tunable.mgpu_tune_gemm_in_file("tunableop_untuned?.csv", num_gpus)
Note that the usage of the ``mgpu_tune_gemm_in_file`` API is different from its single GPU counterpart
(``tune_gemm_in_file``). The body of the Python script that calls the API must be wrapped in ``main()`` as shown
due to the use of concurrent futures module. The argument to ``mgpu_tune_gemm_in_file`` must contain a wild card
expression (``?`` or ``*``) to generate the list of untuned files containing the GEMMs to be processed. The ``num_gpus``
must between 1 and the total number of GPUs available.
Tuning Context
==============
The behavior of TunableOp is currently manipulated through environment
variables, the C++ interface of at::cuda::tunable::getTuningContext(), or the
torch.cuda.tunable python interfaces that wrap the C++ TuningContext. The
environment variables take precedence over any setting you manipulate using the
C++ or Python APIs.
torch.cuda.tunable python interfaces. The environment variables take precedence
over any setting you manipulate using the C++ or Python APIs.
Environment Variable Interface
------------------------------
Environment variables are cached the first time they are read. You cannot use the
environment variable interface programmatically since the settings become fixed.
Use the C++ or Python APIs instead.
"""
import concurrent.futures