mirror of
https://github.com/pytorch/pytorch.git
synced 2025-10-21 05:34:18 +08:00
The `AttrsDescriptor` class has been present in Triton for almost a year now (introduced [here](72c9833927
)), so we should be able to rely on it existing. I am in the process of supporting the new `AttrsDescriptor` class and @jansel suggested I split changes to the existing class out separately to make sure nothing breaks removing the legacy attribute descriptor attributes.
Initially I attempted to remove the branching around detecting whether `AttrsDescriptor` exists but that breaks because PyTorch must build without Triton. So, I went back and updated for the naming introduced in the commit linked above, and also removed two unused attributes `divisible_by_8` and `ids_to_fold` which were removed in Feb 2024 (https://github.com/triton-lang/triton/pull/3122 and https://github.com/triton-lang/triton/pull/3080 respectively).
With these changes only the internal workings of the `AttrsDescriptor` class will differ between supported Triton versions, but the data stored will remain consistent.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137458
Approved by: https://github.com/jansel
149 lines
6.3 KiB
ReStructuredText
149 lines
6.3 KiB
ReStructuredText
.. _torch.compiler_get_started:
|
|
|
|
Getting Started
|
|
===============
|
|
|
|
Before you read this section, make sure to read the :ref:`torch.compiler_overview`.
|
|
|
|
Let's start by looking at a simple ``torch.compile`` example that demonstrates
|
|
how to use ``torch.compile`` for inference. This example demonstrates the
|
|
``torch.cos()`` and ``torch.sin()`` features which are examples of pointwise
|
|
operators as they operate element by element on a vector. This example might
|
|
not show significant performance gains but should help you form an intuitive
|
|
understanding of how you can use ``torch.compile`` in your own programs.
|
|
|
|
.. note::
|
|
To run this script, you need to have at least one GPU on your machine.
|
|
If you do not have a GPU, you can remove the ``.to(device="cuda:0")`` code
|
|
in the snippet below and it will run on CPU. You can also set device to
|
|
``xpu:0`` to run on Intel® GPUs.
|
|
|
|
.. code:: python
|
|
|
|
import torch
|
|
def fn(x):
|
|
a = torch.cos(x)
|
|
b = torch.sin(a)
|
|
return b
|
|
new_fn = torch.compile(fn, backend="inductor")
|
|
input_tensor = torch.randn(10000).to(device="cuda:0")
|
|
a = new_fn(input_tensor)
|
|
|
|
A more famous pointwise operator you might want to use would
|
|
be something like ``torch.relu()``. Pointwise ops in eager mode are
|
|
suboptimal because each one would need to read a tensor from the
|
|
memory, make some changes, and then write back those changes. The single
|
|
most important optimization that inductor performs is fusion. In the
|
|
example above we can turn 2 reads (``x``, ``a``) and
|
|
2 writes (``a``, ``b``) into 1 read (``x``) and 1 write (``b``), which
|
|
is crucial especially for newer GPUs where the bottleneck is memory
|
|
bandwidth (how quickly you can send data to a GPU) rather than compute
|
|
(how quickly your GPU can crunch floating point operations).
|
|
|
|
Another major optimization that inductor provides is automatic
|
|
support for CUDA graphs.
|
|
CUDA graphs help eliminate the overhead from launching individual
|
|
kernels from a Python program which is especially relevant for newer GPUs.
|
|
|
|
TorchDynamo supports many different backends, but TorchInductor specifically works
|
|
by generating `Triton <https://github.com/openai/triton>`__ kernels. Let's save
|
|
our example above into a file called ``example.py``. We can inspect the code
|
|
generated Triton kernels by running ``TORCH_COMPILE_DEBUG=1 python example.py``.
|
|
As the script executes, you should see ``DEBUG`` messages printed to the
|
|
terminal. Closer to the end of the log, you should see a path to a folder
|
|
that contains ``torchinductor_<your_username>``. In that folder, you can find
|
|
the ``output_code.py`` file that contains the generated kernel code similar to
|
|
the following:
|
|
|
|
.. code-block:: python
|
|
|
|
@pointwise(size_hints=[16384], filename=__file__, triton_meta={'signature': {'in_ptr0': '*fp32', 'out_ptr0': '*fp32', 'xnumel': 'i32'}, 'device': 0, 'constants': {}, 'mutated_arg_names': [], 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2), equal_to_1=())]})
|
|
@triton.jit
|
|
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
|
|
xnumel = 10000
|
|
xoffset = tl.program_id(0) * XBLOCK
|
|
xindex = xoffset + tl.arange(0, XBLOCK)[:]
|
|
xmask = xindex < xnumel
|
|
x0 = xindex
|
|
tmp0 = tl.load(in_ptr0 + (x0), xmask, other=0.0)
|
|
tmp1 = tl.cos(tmp0)
|
|
tmp2 = tl.sin(tmp1)
|
|
tl.store(out_ptr0 + (x0 + tl.zeros([XBLOCK], tl.int32)), tmp2, xmask)
|
|
|
|
.. note:: The above code snippet is an example. Depending on your hardware,
|
|
you might see different code generated.
|
|
|
|
And you can verify that fusing the ``cos`` and ``sin`` did actually occur
|
|
because the ``cos`` and ``sin`` operations occur within a single Triton kernel
|
|
and the temporary variables are held in registers with very fast access.
|
|
|
|
Read more on Triton's performance
|
|
`here <https://openai.com/blog/triton/>`__. Because the code is written
|
|
in Python, it's fairly easy to understand even if you have not written all that
|
|
many CUDA kernels.
|
|
|
|
Next, let's try a real model like resnet50 from the PyTorch
|
|
hub.
|
|
|
|
.. code-block:: python
|
|
|
|
import torch
|
|
model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet50', pretrained=True)
|
|
opt_model = torch.compile(model, backend="inductor")
|
|
opt_model(torch.randn(1,3,64,64))
|
|
|
|
And that is not the only available backend, you can run in a REPL
|
|
``torch.compiler.list_backends()`` to see all the available backends. Try out the
|
|
``cudagraphs`` next as inspiration.
|
|
|
|
Using a pretrained model
|
|
~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
PyTorch users frequently leverage pretrained models from
|
|
`transformers <https://github.com/huggingface/transformers>`__ or
|
|
`TIMM <https://github.com/rwightman/pytorch-image-models>`__ and one of
|
|
the design goals is TorchDynamo and TorchInductor is to work out of the box with
|
|
any model that people would like to author.
|
|
|
|
Let's download a pretrained model directly from the HuggingFace hub and optimize
|
|
it:
|
|
|
|
.. code-block:: python
|
|
|
|
import torch
|
|
from transformers import BertTokenizer, BertModel
|
|
# Copy pasted from here https://huggingface.co/bert-base-uncased
|
|
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
|
model = BertModel.from_pretrained("bert-base-uncased").to(device="cuda:0")
|
|
model = torch.compile(model, backend="inductor") # This is the only line of code that we changed
|
|
text = "Replace me by any text you'd like."
|
|
encoded_input = tokenizer(text, return_tensors='pt').to(device="cuda:0")
|
|
output = model(**encoded_input)
|
|
|
|
If you remove the ``to(device="cuda:0")`` from the model and
|
|
``encoded_input``, then Triton will generate C++ kernels that will be
|
|
optimized for running on your CPU. You can inspect both Triton or C++
|
|
kernels for BERT. They are more complex than the trigonometry
|
|
example we tried above but you can similarly skim through it and see if you
|
|
understand how PyTorch works.
|
|
|
|
Similarly, let's try out a TIMM example:
|
|
|
|
.. code-block:: python
|
|
|
|
import timm
|
|
import torch
|
|
model = timm.create_model('resnext101_32x8d', pretrained=True, num_classes=2)
|
|
opt_model = torch.compile(model, backend="inductor")
|
|
opt_model(torch.randn(64,3,7,7))
|
|
|
|
Next Steps
|
|
~~~~~~~~~~
|
|
|
|
In this section, we have reviewed a few inference examples and developed a
|
|
basic understanding of how torch.compile works. Here is what you check out next:
|
|
|
|
- `torch.compile tutorial on training <https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html>`_
|
|
- :ref:`torch.compiler_api`
|
|
- :ref:`torchdynamo_fine_grain_tracing`
|