mirror of
https://github.com/pytorch/pytorch.git
synced 2025-10-20 21:14:14 +08:00
Needed this class for because `parallelize_module` takes a dict, which doesn't allow `PrepareModuleInput` and `PrepareModuleOutput` to be applied at the same time. The `PrepareModuleInputOutput` in this PR initializes two variables `prepare_module_input` and `prepare_module_output` and uses them to process module / inputs / outputs. I had another implementation which put all code in `PrepareModuleInputOutput` and let `PrepareModuleInput` and `PrepareModuleOutput` inherit the monolithic `PrepareModuleInputOutput`. But it is 1. less cleaner 2. conceptually abusing inheritance because `PrepareModuleInput` shouldn't be able to access class methods of `PrepareModuleOutput` and vice versa Pull Request resolved: https://github.com/pytorch/pytorch/pull/150372 Approved by: https://github.com/wanchaol
72 lines
2.8 KiB
ReStructuredText
72 lines
2.8 KiB
ReStructuredText
.. role:: hidden
|
|
:class: hidden-section
|
|
|
|
Tensor Parallelism - torch.distributed.tensor.parallel
|
|
======================================================
|
|
|
|
Tensor Parallelism(TP) is built on top of the PyTorch DistributedTensor
|
|
(`DTensor <https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/README.md>`__)
|
|
and provides different parallelism styles: Colwise, Rowwise, and Sequence Parallelism.
|
|
|
|
.. warning ::
|
|
Tensor Parallelism APIs are experimental and subject to change.
|
|
|
|
The entrypoint to parallelize your ``nn.Module`` using Tensor Parallelism is:
|
|
|
|
.. automodule:: torch.distributed.tensor.parallel
|
|
|
|
.. currentmodule:: torch.distributed.tensor.parallel
|
|
|
|
.. autofunction:: parallelize_module
|
|
|
|
Tensor Parallelism supports the following parallel styles:
|
|
|
|
.. autoclass:: torch.distributed.tensor.parallel.ColwiseParallel
|
|
:members:
|
|
:undoc-members:
|
|
|
|
.. autoclass:: torch.distributed.tensor.parallel.RowwiseParallel
|
|
:members:
|
|
:undoc-members:
|
|
|
|
.. autoclass:: torch.distributed.tensor.parallel.SequenceParallel
|
|
:members:
|
|
:undoc-members:
|
|
|
|
To simply configure the nn.Module's inputs and outputs with DTensor layouts
|
|
and perform necessary layout redistributions, without distribute the module
|
|
parameters to DTensors, the following ``ParallelStyle`` s can be used in
|
|
the ``parallelize_plan`` when calling ``parallelize_module``:
|
|
|
|
.. autoclass:: torch.distributed.tensor.parallel.PrepareModuleInput
|
|
:members:
|
|
:undoc-members:
|
|
|
|
.. autoclass:: torch.distributed.tensor.parallel.PrepareModuleOutput
|
|
:members:
|
|
:undoc-members:
|
|
|
|
.. autoclass:: torch.distributed.tensor.parallel.PrepareModuleInputOutput
|
|
:members:
|
|
:undoc-members:
|
|
|
|
.. note:: when using the ``Shard(dim)`` as the input/output layouts for the above
|
|
``ParallelStyle`` s, we assume the input/output activation tensors are evenly sharded on
|
|
the tensor dimension ``dim`` on the ``DeviceMesh`` that TP operates on. For instance,
|
|
since ``RowwiseParallel`` accepts input that is sharded on the last dimension, it assumes
|
|
the input tensor has already been evenly sharded on the last dimension. For the case of uneven
|
|
sharded activation tensors, one could pass in DTensor directly to the partitioned modules,
|
|
and use ``use_local_output=False`` to return DTensor after each ``ParallelStyle``, where
|
|
DTensor could track the uneven sharding information.
|
|
|
|
For models like Transformer, we recommend users to use ``ColwiseParallel``
|
|
and ``RowwiseParallel`` together in the parallelize_plan for achieve the desired
|
|
sharding for the entire model (i.e. Attention and MLP).
|
|
|
|
Parallelized cross-entropy loss computation (loss parallelism), is supported via the following context manager:
|
|
|
|
.. autofunction:: torch.distributed.tensor.parallel.loss_parallel
|
|
|
|
.. warning ::
|
|
The loss_parallel API is experimental and subject to change.
|