Files
Yao Matrix 5b1fcda371 enable test_cli & test_example cases on XPU (#3578)
* enable test_cli & test_example cases on XPU

Signed-off-by: Matrix Yao <matrix.yao@intel.com>

* fix style

Signed-off-by: Matrix Yao <matrix.yao@intel.com>

* fix style

Signed-off-by: Matrix Yao <matrix.yao@intel.com>

* remove print

Signed-off-by: Matrix Yao <matrix.yao@intel.com>

* fix ci issue

Signed-off-by: YAO Matrix <matrix.yao@intel.com>

---------

Signed-off-by: Matrix Yao <matrix.yao@intel.com>
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
2025-05-20 12:04:24 +02:00
..
2024-08-26 14:21:13 -04:00
2024-04-25 11:13:56 +02:00
2024-08-26 14:21:13 -04:00

Distributed inference examples with PiPPy

This repo contains a variety of tutorials for using the PiPPy pipeline parallelism library with accelerate. You will find examples covering:

  1. How to trace the model using accelerate.prepare_pippy
  2. How to specify inputs based on what the model expects (when to use kwargs, args, and such)
  3. How to gather the results at the end.

Installation

This requires the main branch of accelerate (or a version at least 0.27.0), pippy version of 0.2.0 or greater, and at least python 3.9. Please install using pip install . to pull from the setup.py in this repo, or run manually:

pip install 'accelerate>=0.27.0' 'torchpippy>=0.2.0'

Running code

You can either use torchrun or the recommended way of accelerate launch (without needing to run accelerate config) on each script:

accelerate launch bert.py

Or:

accelerate launch --num_processes {NUM_GPUS} bert.py

Or:

torchrun --nproc-per-node {NUM_GPUS} bert.py

General speedups

One can expect that PiPPy will outperform native model parallism by a multiplicative factor since all GPUs are running at all times with inputs, rather than one input being passed through a GPU at a time waiting for the prior to finish.

Below are some benchmarks we have found when using the accelerate-pippy integration for a few models when running on 2x4090's:

Bert

Accelerate/Sequential PiPPy + Accelerate
First batch 0.2137s 0.3119s
Average of 5 batches 0.0099s 0.0062s

GPT2

Accelerate/Sequential PiPPy + Accelerate
First batch 0.1959s 0.4189s
Average of 5 batches 0.0205s 0.0126s

T5

Accelerate/Sequential PiPPy + Accelerate
First batch 0.2789s 0.3809s
Average of 5 batches 0.0198s 0.0166s