mirror of
https://github.com/pytorch/pytorch.git
synced 2025-10-20 21:14:14 +08:00
[Docs][TunableOp] TunableOp documentation update (#148384)
This PR aligns documentation to what is in the README file: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cuda/tunable/README.md and removes the prototype NOTE. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148384 Approved by: https://github.com/jeffdaily, https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
This commit is contained in:
committed by
PyTorch MergeBot
parent
bb94b65da7
commit
08baaa7d63
@ -3,9 +3,6 @@
|
||||
TunableOp
|
||||
=========
|
||||
|
||||
.. note::
|
||||
This is a prototype feature, which means it is at an early stage
|
||||
for feedback and testing, and its components are subject to change.
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
@ -46,8 +46,8 @@ like so::
|
||||
Validator,ROCM_VERSION,6.0.0.0-12969-1544e39
|
||||
Validator,HIPBLASLT_VERSION,0.6.0-a9c5cc7
|
||||
Validator,ROCBLAS_VERSION,4.0.0-72e57364-dirty
|
||||
GemmTunableOp_float_NT,nt_25088_4096_64,1219,1.262
|
||||
GemmTunableOp_float_NT,nt_4096_4096_64,1216,0.033
|
||||
GemmTunableOp_float_NT,nt_25088_4096_64,Gemm_Hipblaslt_1219,1.262
|
||||
GemmTunableOp_float_NT,nt_4096_4096_64,Gemm_Rocblas_1216,0.033
|
||||
|
||||
Note the "Validator" lines. If you change a library version, or ROCm version, or
|
||||
PyTorch version, TunableOp will detect this and reject the tunings file because
|
||||
@ -73,13 +73,18 @@ completely silent, besides file output, unless there is a warning or error
|
||||
during its use. The verbose option is only available by setting the environment
|
||||
variable PYTORCH_TUNABLEOP_VEROBSE=1.
|
||||
|
||||
A Note on Tuning Behavior
|
||||
=========================
|
||||
A Note on Tuning Behavior, Warmup, and Cache Effects
|
||||
====================================================
|
||||
|
||||
Tuning an operator consists of iterating through the list or registered
|
||||
implementations and profiling each one. The profile is established by running a
|
||||
single implementation in a loop multiple times and taking the average execution
|
||||
time.
|
||||
time. There is also an optional warmup phase prior to tuning that can help with
|
||||
reaching stable power states by the hardware. During tuning of a workload the
|
||||
various hardware caches will more likely produce hits than when not tuning.
|
||||
There are options for flushing the instruction cache and rotate the input tensors
|
||||
which might help produce a more faithful profile of the tuned operator as if the
|
||||
operator were run within a larger workload instead of in a tight, repetitive loop.
|
||||
|
||||
By default, each possible solution for a given operator will be run for either
|
||||
100 iterations or as many iterations that can be run within 30ms, whichever is
|
||||
@ -102,14 +107,76 @@ or ::bgemm() will be routed through TunableOp when enabled. Calling gemm() for a
|
||||
given set of input arguments (transa, transb, m, n, k) will attempt to use the
|
||||
fastest available implementation across both rocblas and hipblaslt.
|
||||
|
||||
Offline Tuning
|
||||
==============
|
||||
|
||||
Motivation
|
||||
----------
|
||||
There are several use cases for offline tuning.
|
||||
|
||||
One use case involves a workload with a high-memory utilization, where regular tuning might lead to running out of memory.
|
||||
|
||||
Another use case is for compute-intensive workloads. In such cases, it is more resource-efficient to collect
|
||||
the GEMMs for the workload once and then tune repeatedly with different tuning parameters or libraries.
|
||||
|
||||
Workflow
|
||||
--------
|
||||
There are basically two steps:
|
||||
1) Set the environment variables to collect the untuned GEMM and this will generate ``tunableop_untuned0.csv``:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
PYTORCH_TUNABLEOP_ENABLED=1
|
||||
PYTORCH_TUNABLEOP_TUNING=0
|
||||
PYTORCH_TUNABLEOP_RECORD_UNTUNED=1
|
||||
...
|
||||
|
||||
2) Run a Python script that reads the ``tunableop_untuned0.csv`` and generates the ``tunableop_results0.csv``, like this:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import torch.cuda.tunable as tunable
|
||||
import os
|
||||
|
||||
os.putenv('PYTORCH_TUNABLEOP_ENABLED', '1')
|
||||
os.putenv('PYTORCH_TUNABLEOP_TUNING', '1')
|
||||
os.putenv('PYTORCH_TUNABLEOP_RECORD_UNTUNED', '0')
|
||||
tunable.tune_gemm_in_file("tunableop_untuned0.csv")
|
||||
|
||||
|
||||
It is also possible to take multiple untuned files and distribute the GEMMs for tuning to multiple GPUs
|
||||
within a single node. In the first step, the GEMMs are first gathered and duplicate GEMMs are eliminated.
|
||||
Next, the GEMMs are distributed to different GPUs for tuning. After all GEMMs are tuned, the results from
|
||||
all the GPUs are then gathered into a single file whose base filename has ``_full0`` appended to it
|
||||
(for example ``tunableop_results_full0.csv``). Finally, this new file, containing the gathered results, will be
|
||||
duplicated N times, once for each GPU as convenience to the user will run the workload with the tuned
|
||||
configuration on N GPUs.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
if __name__ == "__main__":
|
||||
num_gpus = 8 # number of GPUs that will be used during the tuning process
|
||||
tunable.mgpu_tune_gemm_in_file("tunableop_untuned?.csv", num_gpus)
|
||||
|
||||
Note that the usage of the ``mgpu_tune_gemm_in_file`` API is different from its single GPU counterpart
|
||||
(``tune_gemm_in_file``). The body of the Python script that calls the API must be wrapped in ``main()`` as shown
|
||||
due to the use of concurrent futures module. The argument to ``mgpu_tune_gemm_in_file`` must contain a wild card
|
||||
expression (``?`` or ``*``) to generate the list of untuned files containing the GEMMs to be processed. The ``num_gpus``
|
||||
must between 1 and the total number of GPUs available.
|
||||
|
||||
Tuning Context
|
||||
==============
|
||||
|
||||
The behavior of TunableOp is currently manipulated through environment
|
||||
variables, the C++ interface of at::cuda::tunable::getTuningContext(), or the
|
||||
torch.cuda.tunable python interfaces that wrap the C++ TuningContext. The
|
||||
environment variables take precedence over any setting you manipulate using the
|
||||
C++ or Python APIs.
|
||||
torch.cuda.tunable python interfaces. The environment variables take precedence
|
||||
over any setting you manipulate using the C++ or Python APIs.
|
||||
|
||||
Environment Variable Interface
|
||||
------------------------------
|
||||
Environment variables are cached the first time they are read. You cannot use the
|
||||
environment variable interface programmatically since the settings become fixed.
|
||||
Use the C++ or Python APIs instead.
|
||||
|
||||
"""
|
||||
import concurrent.futures
|
||||
|
Reference in New Issue
Block a user