Files
pytorch/docs/source/distributed.tensor.parallel.md
ggsmith842 799443605b Convert to markdown: distributed.tensor.parallel.rst, distributed.tensor.rst, distributions.rst, dlpack.rst (#155297)
Fixes #155019

## Description
Convert to markdown: distributed.tensor.parallel.rst, distributed.tensor.rst, distributions.rst, dlpack.rst

## Checklist
- [X] dlpack.rst converted to dlpack.md --> [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155297/dlpack.html)
- [X] distributions.rst converted to distributions.md --> [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155297/distributions.html)
- [X] distributed.tensor.rst converted to distributed.tensor.md --> [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155297/distributed.tensor.html)
- [X] distributed.tensor.parallel.rst converted to distributed.tensor.parallel.md --> [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155297/distributed.tensor.parallel.html)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155297
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-13 22:08:37 +00:00

2.8 KiB

:::{role} hidden :class: hidden-section :::

Tensor Parallelism - torch.distributed.tensor.parallel

Tensor Parallelism(TP) is built on top of the PyTorch DistributedTensor (DTensor)[https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/README.md] and provides different parallelism styles: Colwise, Rowwise, and Sequence Parallelism.

:::{warning} Tensor Parallelism APIs are experimental and subject to change. :::

The entrypoint to parallelize your nn.Module using Tensor Parallelism is:

.. automodule:: torch.distributed.tensor.parallel
.. currentmodule:: torch.distributed.tensor.parallel
.. autofunction::  parallelize_module

Tensor Parallelism supports the following parallel styles:

.. autoclass:: torch.distributed.tensor.parallel.ColwiseParallel
  :members:
  :undoc-members:
.. autoclass:: torch.distributed.tensor.parallel.RowwiseParallel
  :members:
  :undoc-members:
.. autoclass:: torch.distributed.tensor.parallel.SequenceParallel
  :members:
  :undoc-members:

To simply configure the nn.Module's inputs and outputs with DTensor layouts and perform necessary layout redistributions, without distribute the module parameters to DTensors, the following ParallelStyle s can be used in the parallelize_plan when calling parallelize_module:

.. autoclass:: torch.distributed.tensor.parallel.PrepareModuleInput
  :members:
  :undoc-members:
.. autoclass:: torch.distributed.tensor.parallel.PrepareModuleOutput
  :members:
  :undoc-members:
.. autoclass:: torch.distributed.tensor.parallel.PrepareModuleInputOutput
  :members:
  :undoc-members:

:::{note} when using the Shard(dim) as the input/output layouts for the above ParallelStyle s, we assume the input/output activation tensors are evenly sharded on the tensor dimension dim on the DeviceMesh that TP operates on. For instance, since RowwiseParallel accepts input that is sharded on the last dimension, it assumes the input tensor has already been evenly sharded on the last dimension. For the case of uneven sharded activation tensors, one could pass in DTensor directly to the partitioned modules, and use use_local_output=False to return DTensor after each ParallelStyle, where DTensor could track the uneven sharding information. :::

For models like Transformer, we recommend users to use ColwiseParallel and RowwiseParallel together in the parallelize_plan for achieve the desired sharding for the entire model (i.e. Attention and MLP).

Parallelized cross-entropy loss computation (loss parallelism), is supported via the following context manager:

.. autofunction:: torch.distributed.tensor.parallel.loss_parallel

:::{warning} The loss_parallel API is experimental and subject to change. :::