mirror of
https://github.com/pytorch/pytorch.git
synced 2025-10-20 21:14:14 +08:00
**RFC: Problem statement** Intel oneMKL and oneDNN are used to accelerate performance on Intel platforms. Both these 2 libraries provide verbose functionality to dump detailed operator execution information as well as execution time. These verbose messages are very helpful to performance profiling. However, the verbose functionality works for the entire execution. In many scenarios, though, we only would like to profile partial of the execution process. This feature is to expose PyTorch API functions to control oneDNN and oneMKL verbose functionality in runtime. **Additional context** The most used performance profiling steps are shown as the following code snippet: ``` def inference(model, inputs): # step0 (optional): jit model = torch.jit.trace(model, inputs) # step1: warmup for _ in range(100): model(inputs) # step2: performance profiling. We only care the profiling result, as well as oneDNN and oneMKL verbose messages, of this step model(inputs) # step3 (optional): benchmarking t0 = time.time() for _ in range(100): model(inputs) t1 = time.time() print(‘dur: {}’.format((t1-t0)/100)) return model(inputs) ``` Since environment variables MKL_VERBOSE and DNNL_VERBOSE will be effect to the entire progress, we will get a great number of verbose messages for all of 101 iterations (if step3 is not involved). However, we only care about the verbose messages dumped in step2. It is very difficult to filter unnecessary verbose messages out if we are running into a complicated usages scenario. Also, jit trace will also bring more undesired verbose messages. Furthermore, there are more complicated topologies or usages like cascaded topologies as below: ``` model1 = Model1() model2 = Model2() model3 = Model3() x1 = inference(model1, x) x2 = inference(model2, x1) y = inference(model3, x2) ``` There are many cases that it is very hard to split these child topologies out. In this scenario, it is not possible to investigate performance of each individual topology with `DNNL_VERBOSE` and `MKL_VERBOSE`. To solve this issue, oneDNN and oneMKL provide API functions to make it possible to control verbose functionality in runtime. ``` int mkl_verbose (int enable) status dnnl::set_verbose(int level) ``` oneDNN and oneMKL print verbose messages to stdout when oneMKL or oneDNN ops are executed. Sample verbose messages: ``` MKL_VERBOSE SGEMM(t,n,768,2048,3072,0x7fff64115800,0x7fa1aca58040,3072,0x1041f5c0,3072,0x7fff64115820,0x981f0c0,768) 8.52ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:44 dnnl_verbose,exec,cpu,inner_product,brgemm:avx512_core,forward_training,src_f32::blocked:ab:f0 wei_f32::blocked:AB16b64a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,,,mb16ic768oc768,0.0839844 ``` **Design and implementation** The design is to make python-interfaced wrap functions to invoke mkl_verbose and dnnl::set_verbose functions. **Design concern** - Need to add wrapper C++ functions for mkl_verbose and dnnl::set_verbose functions in torch/csrc and aten/csrc. - Python API functions will be added to device-specific backends - with torch.backends.mkl.verbose(1): - with torch.backends.mkldnn.verbose(1): **Use cases** ``` def inference(model, inputs): # step0 (optional): jit model = torch.jit.trace(model, inputs) # step1: warmup for _ in range(100): model(inputs) # step2: performance profiling with torch.backends.mkl.verbose(1), torch.backends.mkldnn.verbose(1): model(inputs) # step3 (optional): benchmarking t0 = time.time() for _ in range(100): model(inputs) t1 = time.time() print(‘dur: {}’.format((t1-t0)/100)) return model(inputs) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/63212 Approved by: https://github.com/VitalyFedyunin, https://github.com/malfet
127 lines
3.5 KiB
ReStructuredText
127 lines
3.5 KiB
ReStructuredText
.. role:: hidden
|
|
:class: hidden-section
|
|
|
|
torch.backends
|
|
==============
|
|
.. automodule:: torch.backends
|
|
|
|
`torch.backends` controls the behavior of various backends that PyTorch supports.
|
|
|
|
These backends include:
|
|
|
|
- ``torch.backends.cuda``
|
|
- ``torch.backends.cudnn``
|
|
- ``torch.backends.mkl``
|
|
- ``torch.backends.mkldnn``
|
|
- ``torch.backends.openmp``
|
|
|
|
|
|
torch.backends.cuda
|
|
^^^^^^^^^^^^^^^^^^^
|
|
.. automodule:: torch.backends.cuda
|
|
|
|
.. autofunction:: torch.backends.cuda.is_built
|
|
|
|
.. attribute:: torch.backends.cuda.matmul.allow_tf32
|
|
|
|
A :class:`bool` that controls whether TensorFloat-32 tensor cores may be used in matrix
|
|
multiplications on Ampere or newer GPUs. See :ref:`tf32_on_ampere`.
|
|
|
|
.. attribute:: torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction
|
|
|
|
A :class:`bool` that controls whether reduced precision reductions (e.g., with fp16 accumulation type) are allowed with fp16 GEMMs.
|
|
|
|
.. attribute:: torch.backends.cuda.cufft_plan_cache
|
|
|
|
``cufft_plan_cache`` caches the cuFFT plans
|
|
|
|
.. attribute:: size
|
|
|
|
A readonly :class:`int` that shows the number of plans currently in the cuFFT plan cache.
|
|
|
|
.. attribute:: max_size
|
|
|
|
A :class:`int` that controls cache capacity of cuFFT plan.
|
|
|
|
.. method:: clear()
|
|
|
|
Clears the cuFFT plan cache.
|
|
|
|
.. autofunction:: torch.backends.cuda.preferred_linalg_library
|
|
|
|
|
|
torch.backends.cudnn
|
|
^^^^^^^^^^^^^^^^^^^^
|
|
.. automodule:: torch.backends.cudnn
|
|
|
|
.. autofunction:: torch.backends.cudnn.version
|
|
|
|
.. autofunction:: torch.backends.cudnn.is_available
|
|
|
|
.. attribute:: torch.backends.cudnn.enabled
|
|
|
|
A :class:`bool` that controls whether cuDNN is enabled.
|
|
|
|
.. attribute:: torch.backends.cudnn.allow_tf32
|
|
|
|
A :class:`bool` that controls where TensorFloat-32 tensor cores may be used in cuDNN
|
|
convolutions on Ampere or newer GPUs. See :ref:`tf32_on_ampere`.
|
|
|
|
.. attribute:: torch.backends.cudnn.deterministic
|
|
|
|
A :class:`bool` that, if True, causes cuDNN to only use deterministic convolution algorithms.
|
|
See also :func:`torch.are_deterministic_algorithms_enabled` and
|
|
:func:`torch.use_deterministic_algorithms`.
|
|
|
|
.. attribute:: torch.backends.cudnn.benchmark
|
|
|
|
A :class:`bool` that, if True, causes cuDNN to benchmark multiple convolution algorithms
|
|
and select the fastest.
|
|
|
|
.. attribute:: torch.backends.cudnn.benchmark_limit
|
|
|
|
A :class:`int` that specifies the maximum number of cuDNN convolution algorithms to try when
|
|
`torch.backends.cudnn.benchmark` is True. Set `benchmark_limit` to zero to try every
|
|
available algorithm. Note that this setting only affects convolutions dispatched via the
|
|
cuDNN v8 API.
|
|
|
|
|
|
torch.backends.mps
|
|
^^^^^^^^^^^^^^^^^^
|
|
.. automodule:: torch.backends.mps
|
|
|
|
.. autofunction:: torch.backends.mps.is_available
|
|
|
|
.. autofunction:: torch.backends.mps.is_built
|
|
|
|
|
|
torch.backends.mkl
|
|
^^^^^^^^^^^^^^^^^^
|
|
.. automodule:: torch.backends.mkl
|
|
|
|
.. autofunction:: torch.backends.mkl.is_available
|
|
|
|
.. autoclass:: torch.backends.mkl.verbose
|
|
|
|
|
|
torch.backends.mkldnn
|
|
^^^^^^^^^^^^^^^^^^^^^
|
|
.. automodule:: torch.backends.mkldnn
|
|
|
|
.. autofunction:: torch.backends.mkldnn.is_available
|
|
|
|
.. autoclass:: torch.backends.mkldnn.verbose
|
|
|
|
|
|
torch.backends.openmp
|
|
^^^^^^^^^^^^^^^^^^^^^
|
|
.. automodule:: torch.backends.openmp
|
|
|
|
.. autofunction:: torch.backends.openmp.is_available
|
|
|
|
.. Docs for other backends need to be added here.
|
|
.. Automodules are just here to ensure checks run but they don't actually
|
|
.. add anything to the rendered page for now.
|
|
.. py:module:: torch.backends.quantized
|
|
.. py:module:: torch.backends.xnnpack
|