[Docs][TunableOp] TunableOp documentation update (#148384)

This PR aligns documentation to what is in the README file: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cuda/tunable/README.md and removes the prototype NOTE. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148384 Approved by: https://github.com/jeffdaily, https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-10-20 21:14:14 +08:00 · 2025-03-07 21:02:44 +00:00
parent bb94b65da7
commit 08baaa7d63
2 changed files with 75 additions and 11 deletions
--- a/docs/source/cuda.tunable.rst
+++ b/docs/source/cuda.tunable.rst
@ -3,9 +3,6 @@
 TunableOp
 =========

-.. note::
-    This is a prototype feature, which means it is at an early stage
-    for feedback and testing, and its components are subject to change.

 Overview
 --------
--- a/torch/cuda/tunable.py
+++ b/torch/cuda/tunable.py
@ -46,8 +46,8 @@ like so::
  Validator,ROCM_VERSION,6.0.0.0-12969-1544e39
  Validator,HIPBLASLT_VERSION,0.6.0-a9c5cc7
  Validator,ROCBLAS_VERSION,4.0.0-72e57364-dirty
-  GemmTunableOp_float_NT,nt_25088_4096_64,1219,1.262
-  GemmTunableOp_float_NT,nt_4096_4096_64,1216,0.033
+  GemmTunableOp_float_NT,nt_25088_4096_64,Gemm_Hipblaslt_1219,1.262
+  GemmTunableOp_float_NT,nt_4096_4096_64,Gemm_Rocblas_1216,0.033

 Note the "Validator" lines. If you change a library version, or ROCm version, or
 PyTorch version, TunableOp will detect this and reject the tunings file because
@ -73,13 +73,18 @@ completely silent, besides file output, unless there is a warning or error
 during its use. The verbose option is only available by setting the environment
 variable PYTORCH_TUNABLEOP_VEROBSE=1.

-A Note on Tuning Behavior
-=========================
+A Note on Tuning Behavior, Warmup, and Cache Effects
+====================================================

 Tuning an operator consists of iterating through the list or registered
 implementations and profiling each one. The profile is established by running a
 single implementation in a loop multiple times and taking the average execution
-time.
+time. There is also an optional warmup phase prior to tuning that can help with
+reaching stable power states by the hardware. During tuning of a workload the
+various hardware caches will more likely produce hits than when not tuning.
+There are options for flushing the instruction cache and rotate the input tensors
+which might help produce a more faithful profile of the tuned operator as if the
+operator were run within a larger workload instead of in a tight, repetitive loop.

 By default, each possible solution for a given operator will be run for either
 100 iterations or as many iterations that can be run within 30ms, whichever is
@ -102,14 +107,76 @@ or ::bgemm() will be routed through TunableOp when enabled. Calling gemm() for a
 given set of input arguments (transa, transb, m, n, k) will attempt to use the
 fastest available implementation across both rocblas and hipblaslt.

+Offline Tuning
+==============
+
+Motivation
+----------
+There are several use cases for offline tuning.
+
+One use case involves a workload with a high-memory utilization, where regular tuning might lead to running out of memory.
+
+Another use case is for compute-intensive workloads. In such cases, it is more resource-efficient to collect
+the GEMMs for the workload once and then tune repeatedly with different tuning parameters or libraries.
+
+Workflow
+--------
+There are basically two steps:
+1) Set the environment variables to collect the untuned GEMM and this will generate ``tunableop_untuned0.csv``:
+
+.. code-block:: python
+
+   PYTORCH_TUNABLEOP_ENABLED=1
+   PYTORCH_TUNABLEOP_TUNING=0
+   PYTORCH_TUNABLEOP_RECORD_UNTUNED=1
+   ...
+
+2) Run a Python script that reads the ``tunableop_untuned0.csv`` and generates the ``tunableop_results0.csv``, like this:
+
+.. code-block:: python
+
+   import torch.cuda.tunable as tunable
+   import os
+
+   os.putenv('PYTORCH_TUNABLEOP_ENABLED', '1')
+   os.putenv('PYTORCH_TUNABLEOP_TUNING', '1')
+   os.putenv('PYTORCH_TUNABLEOP_RECORD_UNTUNED', '0')
+   tunable.tune_gemm_in_file("tunableop_untuned0.csv")
+
+
+It is also possible to take multiple untuned files and distribute the GEMMs for tuning to multiple GPUs
+within a single node. In the first step, the GEMMs are first gathered and duplicate GEMMs are eliminated.
+Next, the GEMMs are distributed to different GPUs for tuning. After all GEMMs are tuned, the results from
+all the GPUs are then gathered into a single file whose base filename has ``_full0`` appended to it
+(for example ``tunableop_results_full0.csv``). Finally, this new file, containing the gathered results, will be
+duplicated N times, once for each GPU as convenience to the user will run the workload with the tuned
+configuration on N GPUs.
+
+.. code-block:: python
+
+   if __name__ == "__main__":
+       num_gpus = 8 # number of GPUs that will be used during the tuning process
+       tunable.mgpu_tune_gemm_in_file("tunableop_untuned?.csv", num_gpus)
+
+Note that the usage of the ``mgpu_tune_gemm_in_file`` API is different from its single GPU counterpart
+(``tune_gemm_in_file``). The body of the Python script that calls the API must be wrapped in ``main()`` as shown
+due to the use of concurrent futures module. The argument to ``mgpu_tune_gemm_in_file`` must contain a wild card
+expression (``?`` or ``*``) to generate the list of untuned files containing the GEMMs to be processed. The ``num_gpus``
+must between 1 and the total number of GPUs available.
+
 Tuning Context
 ==============

 The behavior of TunableOp is currently manipulated through environment
 variables, the C++ interface of at::cuda::tunable::getTuningContext(), or the
-torch.cuda.tunable python interfaces that wrap the C++ TuningContext. The
-environment variables take precedence over any setting you manipulate using the
-C++ or Python APIs.
+torch.cuda.tunable python interfaces. The environment variables take precedence
+over any setting you manipulate using the C++ or Python APIs.
+
+Environment Variable Interface
+------------------------------
+Environment variables are cached the first time they are read. You cannot use the
+environment variable interface programmatically since the settings become fixed.
+Use the C++ or Python APIs instead.

 """
 import concurrent.futures