Compare commits

..

62 Commits

Author SHA1 Message Date
eebeb59a36 Fix accelerate tests command (#528) 2022-07-18 14:47:34 +02:00
be4b74f42f Relese: v0.11.0 2022-07-18 08:27:58 -04:00
c93b3eb5d7 FSDP integration enhancements and fixes (#522)
* FSDP integration enhancements and fixes

* bug fixes

1. fix circular dependency
2. Add model print statement in FSDP example
3. minor fixes

* removing `always_wrap` as it is rarely useful

* removing comment

* resolving comments

* fsdp fp16 mp uses ShardedGradScaler

* fix import

* fix check

* add exception when class to wrap not found in model

* adding `FSDP_BACKWARD_PREFETCH`

* fix
2022-07-18 17:45:58 +05:30
3eea8ceee0 Warn user if no trackers are installed (#524) 2022-07-15 18:16:00 +02:00
7abc708be2 Fixup all example CI tests and properly fail (#517)
* Clean and make all tests pass
2022-07-15 18:15:45 +02:00
bb78b04cce fixing deepspeed multi-node launcher (#514)
* fixing deepspeed multi-node launcher

* minor fixes

* handling env variables for accelerate to correctly work

* resolving comments
2022-07-14 18:40:48 +05:30
7e6593756f Add special Parameters modules support (#519)
* Meta init/tensor_to_device logic for Int8 Parameters.

* add 8 bit support

* add special modules support

Co-authored-by: timdettmers <timdettmers@users.noreply.github.com>

* bad formatting

* bad formatting

* restoring the poor lines that were alone!

Co-authored-by: Tim Dettmers <tim.dettmers@gmail.com>
Co-authored-by: timdettmers <timdettmers@users.noreply.github.com>
2022-07-13 12:46:36 -04:00
960fd9d86a Don't unwrap in save_state() (#489) 2022-07-13 12:46:21 -04:00
70ca65a9a1 Fix a bug when reduce a tensor. (#513)
* return reduced result

* update doc for Accelerator.reduce

* update doc in Accelerator.reduce

* fix reduce behavior for gpu devices
2022-07-13 09:19:01 -04:00
ea0d5368bd Add benchmarks (#506)
* Add benchmarks

* Oops! Forgot one file

* Update benchmarks/README.md

Co-authored-by: Zachary Mueller <muellerzr@gmail.com>

Co-authored-by: Zachary Mueller <muellerzr@gmail.com>
2022-07-12 15:16:45 -04:00
78357f44b3 Add gradient accumulation doc (#511)
* Gradient accumulation doc

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2022-07-12 17:36:45 +02:00
c7526e9483 Make gradient accumulation work with dispatched dataloaders (#510)
* Make grad accum work with dispatch dl

* Split print over multiple lines
2022-07-12 17:12:39 +02:00
f5ef120e77 Fix DispatchDataLoader length when split_batches=True (#509) 2022-07-12 10:35:35 -04:00
3c1f97c386 SageMaker enhancements to allow custom docker image, input channels referring to s3/remote data locations and metrics logging (#504)
* SageMaker DP and MP Support

* fix 😅

* removing SageMaker MP option

* adding support for custom image_uri, data location and metrics
2022-07-12 13:25:52 +05:30
a0514dd809 SageMaker DP Support (#494)
* SageMaker DP and MP Support

* fix 😅

* removing SageMaker MP option
2022-07-09 00:14:57 +05:30
b20f90ab17 Fix scheduler in gradient accumulation example (#500)
* Fix scheduler in gradient accumulation example

* Phrase better how the scheduler is stepped during grad accum

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2022-07-08 13:41:43 -04:00
cfb2a3e239 update dataloader wrappers to have total_batch_size attribute (#493)
* update dataloader wrappers to have `total_batch_size` attribute

* fix

* resolving comments

* Update src/accelerate/data_loader.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* quality

* add docstrings

* Update src/accelerate/data_loader.py

Co-authored-by: Zachary Mueller <muellerzr@gmail.com>

* docstrings iter 2 + quality + minor change in doc

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Zachary Mueller <muellerzr@gmail.com>
2022-07-08 21:16:31 +05:30
86ce737d7f Introduce automatic gradient accumulation wrapper + fix a few test issues (#484)
* Have accelerator handle gradient accumulation

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2022-07-05 15:49:36 -04:00
deffaba8d6 add use_distributed property (#487)
* add distributed property in accelerate_state

* ensure num_process > 1
2022-07-05 09:19:44 -04:00
6ebddcd5e0 fixing fsdp autowrap functionality (#475)
* fixing fsdp autowrap functionality

* updating version requirements

* update version to latest torch stable version

* quality
2022-07-01 10:00:47 +05:30
4a7bc3bcb7 Use datasets 2.2.0 for now (#481) 2022-06-28 12:31:41 -04:00
1f96f3cf85 Rm gradient accumulation on TPU (#479)
* Rm gradient accumulation on TPU for now
2022-06-28 12:29:58 -04:00
bbca2700c7 Revert "Pin datasets for now (#477)" (#478)
This reverts commit a8eca60d57e8294e666b765b5331770aa0c58893.
2022-06-28 10:09:11 -04:00
a8eca60d57 Pin datasets for now (#477) 2022-06-28 09:47:39 -04:00
329209871f Some typos and cosmetic fixes (#472) 2022-06-27 05:40:07 -07:00
619ef04f09 Dev version 2022-06-24 16:41:09 -04:00
9d8ed50f7b Fix when TPU device check is ran (#469) 2022-06-24 12:07:38 -04:00
196856f357 Refactor Utility Documentation (#467)
* Add a utilities doc
2022-06-23 16:34:01 -04:00
3a5490b066 Add docbuilder to quality (#468) 2022-06-23 14:36:16 -04:00
24be733d84 Expose some is_*_available utils in docs (#466) 2022-06-23 10:34:45 -04:00
775bc790e7 Cleanup CI Warnings (#465)
* Fix named tuple warning

* Use torch AdamW instead of transformers

* Make regex string instead of literal
2022-06-23 10:06:19 -04:00
799fa935e9 Link CI slow runners to the commit (#464)
* Tweak trigger logic to link actions together
2022-06-23 08:56:01 -04:00
3ccbd9f7a0 Fix subtle bug in BF16 (#463)
* mixed precision bugfix

* Use is_tpu_available
2022-06-23 08:55:13 -04:00
f13c59f91e Include bf16 support for TPUs and CPUs, and a better check for if a CUDA device supports BF16 (#462)
* Support bf16 on TPU, CPU, and GPU in Accelerator directly
2022-06-22 17:53:42 -04:00
d39c57c11f Handle bfloat16 weights in disk offload without adding memory overhead (#460) (#461) 2022-06-22 09:13:23 -04:00
e2a968c66d Handle bfloat16 weights in disk offload (#460)
* Handle bfloat16 weights in disk offload

* Address review comments
2022-06-21 18:06:57 -04:00
dc243c0db1 Raise a clear warning if a user tries to modify the AcceleratorState (#458)
* Reinitialize warning
2022-06-21 16:42:35 -04:00
97f4c9de61 Right step point (#459) 2022-06-21 15:11:03 -04:00
73a596593e Better checks for if a TPU device exists (#456)
* Check if a TPU device actually exists
2022-06-21 12:12:00 -04:00
eeaba598f4 Offload and modules with unused submodules (#442)
* Offload and modules with unused submodules

* Renaming

* Update src/accelerate/hooks.py

Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>

* Address review comment

Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>
2022-06-17 20:04:39 -04:00
3d92caa241 Release: v0.10.0 2022-06-15 13:58:22 -04:00
fa17f207b5 Fix docstring (#447) 2022-06-15 13:54:04 -04:00
873dcc63a4 Migrate HFDeepSpeedConfig from trfrs to accelerate (#432)
* Migrate HFDeepSpeedConfig from trfrs to accelerate

* update state.py to resolve comments

1. Adds static method to have a simple API for integrating deepspeed config in transformers trainer.

* reverting changes and addressing comments

* Marking DepSpeed and FSDP as experimental in accelerate
2022-06-15 20:56:39 +05:30
40b6fe1784 Add psutil as depenedency (#445) 2022-06-15 11:03:52 -04:00
29eef234c9 Revamp TPU internals to be more efficient + enable mixed precision types (#441)
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2022-06-14 17:41:20 -04:00
3f0876ac03 fix fsdp torch version dependency (#437) 2022-06-11 00:36:44 +05:30
450d51ce01 Create Gradient Accumulation Example (#431)
* Gradient accumulation example
2022-06-08 14:46:04 -04:00
1b2da6c6a5 init (#429) 2022-06-08 14:07:10 -04:00
1424a8e00d Introduce no_sync context wrapper + clean up some more warnings for DDP (#428) 2022-06-08 12:56:21 -04:00
b2afd4e8da updating tests to resolve runner failures wrt deepspeed revamp (#427)
* deepspeed revamp

* Update dataclasses.py

* Update deepspeed.py

* quality

* fixing code

* quality

* FIx imports

* saving 16bit model in zero stage 3

1. Saving 16bit model in zero stage 3
2. zero init in stage 3 support using HFDeepSpeedConfig

* quality

* adding test and fixing bugs

* update makefile for deepspeed tests

* Update test.yml

* adding `deepspeed` as requirement for tests

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* quality

* addressing comments

* add example and minor updates

1. Add example to show the usage of config file with revamped deepspeed support.
2. update required deepspeed version to 0.6.5
2. reverting `reinit` change as it is not required,
3. raising Exception when using `clip_grad_value` with DeepSpeed/FSDP.

* Documentation and Zero-3 Inference Support

1. Changes to support ZeRo Stage-3 Inference support.
2. minor bug fixes.
3. Documentation.

* doc fix

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* addressing comments

* update doc to address comments and bug fixes

1. update tests and add new one testing autofill functionality of `prepare` method.
2. fix bug related to zero-3 init related to HFDeepSpeedConfig
3. Update documentation addressing comments.

* removing image and hosting it on `documentation-images` dataset

* check for hidden_size for zero_opt heurisitics

* updating tests to resolve runner failures

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2022-06-07 16:21:26 +05:30
2130205626 Fix secrets in Docker workflow (#426)
* Fix secrets
2022-06-07 06:47:09 -04:00
1703b79a79 DeepSpeed Revamp (#405)
* deepspeed revamp

* Update dataclasses.py

* Update deepspeed.py

* quality

* fixing code

* quality

* FIx imports

* saving 16bit model in zero stage 3

1. Saving 16bit model in zero stage 3
2. zero init in stage 3 support using HFDeepSpeedConfig

* quality

* adding test and fixing bugs

* update makefile for deepspeed tests

* Update test.yml

* adding `deepspeed` as requirement for tests

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* quality

* addressing comments

* add example and minor updates

1. Add example to show the usage of config file with revamped deepspeed support.
2. update required deepspeed version to 0.6.5
2. reverting `reinit` change as it is not required,
3. raising Exception when using `clip_grad_value` with DeepSpeed/FSDP.

* Documentation and Zero-3 Inference Support

1. Changes to support ZeRo Stage-3 Inference support.
2. minor bug fixes.
3. Documentation.

* doc fix

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* addressing comments

* update doc to address comments and bug fixes

1. update tests and add new one testing autofill functionality of `prepare` method.
2. fix bug related to zero-3 init related to HFDeepSpeedConfig
3. Update documentation addressing comments.

* removing image and hosting it on `documentation-images` dataset

* check for hidden_size for zero_opt heurisitics

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2022-06-07 00:52:18 +05:30
05c641bc0c Introduce a Dependency Checker to trigger new Docker Builds on main (#424)
* Introduce warning + auto build

* Trigger only on merge to main
2022-06-06 07:30:39 -04:00
da78e296ba Enable slow tests nightly (#421) 2022-06-01 20:28:31 -04:00
9e0fff9291 Push out python 3.6 + fix all tests related to the upgrade (#420)
* Update Docker for py 3.7

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2022-06-01 16:49:27 -04:00
938b8f358d Speedup main CI (#419)
* Speed up workflow
2022-06-01 10:59:01 -04:00
d04e8e2baa Switch to evaluate for metrics (#417)
* Switch to evaluate for metrics

* Why the heck?

* Fix syntax error

* Install from githug

* Is this the culprit?

* Upgrade Python

* Protobouf 💩

* Install from git not necessary now

* Sneaky last tensorboard

* Let's try this way

* Forgot to add all files :-/
2022-06-01 09:57:57 -04:00
8db128498c Create an issue template for Accelerate (#415) 2022-06-01 09:15:23 -04:00
114707449b Introduce post-merge runners (#416)
* Introduce post-merge runners
2022-05-31 15:11:29 -04:00
3b51d6e9ad Fix debug_launcher issues (#413)
* change to require_cpu only
2022-05-31 14:59:28 -04:00
174eb3af1d Use main egg (#414) 2022-05-31 14:58:38 -04:00
d176b552c9 Introduce nightly runners (#410)
* Introduce nightly builds
* Fixup docker images slightly
* Make device-count specific test use `torch.cuda.device_count()` rather than `Accelerator.num_processes` to avoid bug.
2022-05-31 14:14:02 -04:00
95 changed files with 5794 additions and 867 deletions

59
.github/ISSUE_TEMPLATE/bug-report.yml vendored Normal file
View File

@ -0,0 +1,59 @@
name: "\U0001F41B Bug Report"
description: Submit a bug report to help us improve Accelerate
labels: [ "bug" ]
body:
- type: textarea
id: system-info
attributes:
label: System Info
description: Please share your accelerate configuration with us. You can run the command `accelerate env` and copy-paste its outputs below
render: Shell
placeholder: accelerate version, OS, python version, numpy version, torch version, and accelerate's configuration
validations:
required: true
- type: checkboxes
id: information-scripts-examples
attributes:
label: Information
description: 'The problem arises when using:'
options:
- label: "The official example scripts"
- label: "My own modified scripts"
- type: checkboxes
id: information-tasks
attributes:
label: Tasks
description: "The tasks I am working on are:"
options:
- label: "One of the scripts in the examples/ folder of Accelerate or an officially supported `no_trainer` script in the `examples` folder of the `transformers` repo (such as `run_no_trainer_glue.py`)"
- label: "My own task or dataset (give details below)"
- type: textarea
id: reproduction
validations:
required: true
attributes:
label: Reproduction
description: |
Please provide a code sample that reproduces the problem you ran into. It can be a Colab link or just a code snippet.
If you have code snippets, error messages, stack traces please provide them here as well.
Important! Use code tags to correctly format your code. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
Do not use screenshots, as they are hard to read and (more importantly) don't allow others to copy-and-paste your code.
placeholder: |
Steps to reproduce the behavior:
1.
2.
3.
- type: textarea
id: expected-behavior
validations:
required: true
attributes:
label: Expected behavior
description: "A clear and concise description of what you would expect to happen."
render: Shell

View File

@ -1,7 +1,8 @@
name: Build Docker images (scheduled)
on:
repository_dispatch:
workflow_dispatch:
workflow_call:
schedule:
- cron: "0 1 * * *"

View File

@ -0,0 +1,45 @@
name: Trigger docker images and run slow tests
on:
push:
branches:
- main
workflow_dispatch:
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
jobs:
check-for-setup:
runs-on: ubuntu-latest
name: Check if setup was changed
outputs:
changed: ${{ steps.was_changed.outputs.changed }}
steps:
- uses: actions/checkout@v3
with:
fetch-depth: "2"
- name: Get changed files
id: changed-files
uses: tj-actions/changed-files@v22.2
- name: Was setup changed
id: was_changed
run: |
for file in ${{ steps.changed-files.outputs.all_changed_files }}; do
if [ `basename "${file}"` = "setup.py" ]; then
echo ::set-output name=changed::"1"
fi
done
build-docker-containers:
needs: check-for-setup
if: (github.event_name == 'push') && (needs.check-for-setup.outputs.changed == '1')
uses: ./.github/workflows/build-docker-images.yml
secrets: inherit
run-tests:
needs: build-docker-containers
if: always()
uses: ./.github/workflows/on-merge.yml

69
.github/workflows/nightly.yml vendored Normal file
View File

@ -0,0 +1,69 @@
name: Self-hosted runner (scheduled)
on:
workflow_dispatch:
schedule:
- cron: "0 2 * * *"
env:
RUN_SLOW: "yes"
jobs:
run_all_tests_single_gpu:
runs-on: [self-hosted, docker-gpu, multi-gpu]
env:
CUDA_VISIBLE_DEVICES: "0"
container:
image: huggingface/accelerate-gpu:latest
options: --gpus all --shm-size "16gb"
defaults:
run:
working-directory: accelerate/
shell: bash
steps:
- name: Update clone & pip install
run: |
source activate accelerate
git config --global --add safe.directory '*'
git fetch && git checkout ${{ github.sha }}
pip install -e . --no-deps
- name: Run test on GPUs
run: |
source activate accelerate
make test
- name: Run examples on GPUs
run: |
source activate accelerate
pip uninstall comet_ml -y
make test_examples
run_all_tests_multi_gpu:
runs-on: [self-hosted, docker-gpu, multi-gpu]
env:
CUDA_VISIBLE_DEVICES: "0,1"
container:
image: huggingface/accelerate-gpu:latest
options: --gpus all --shm-size "16gb"
defaults:
run:
working-directory: accelerate/
shell: bash
steps:
- name: Update clone
run: |
source activate accelerate
git config --global --add safe.directory '*'
git fetch && git checkout ${{ github.sha }}
pip install -e . --no-deps
- name: Run test on GPUs
run: |
source activate accelerate
make test
- name: Run examples on GPUs
run: |
source activate accelerate
pip uninstall comet_ml -y
make test_examples

66
.github/workflows/on-merge.yml vendored Normal file
View File

@ -0,0 +1,66 @@
name: Self-hosted runner tests (push to "main")
on:
workflow_call:
workflow_dispatch:
env:
TESTING_MOCKED_DATALOADERS: "1"
jobs:
run_all_tests_single_gpu:
runs-on: [self-hosted, docker-gpu, multi-gpu]
env:
CUDA_VISIBLE_DEVICES: "0"
container:
image: huggingface/accelerate-gpu:latest
options: --gpus all --shm-size "16gb"
defaults:
run:
working-directory: accelerate/
shell: bash
steps:
- name: Update clone & pip install
run: |
source activate accelerate
git config --global --add safe.directory '*'
git fetch && git checkout ${{ github.sha }}
pip install -e .[test,test_trackers]
- name: Run test on GPUs
run: |
source activate accelerate
make test
- name: Run examples on GPUs
run: |
source activate accelerate
pip uninstall comet_ml -y
make test_examples
run_all_tests_multi_gpu:
runs-on: [self-hosted, docker-gpu, multi-gpu]
container:
image: huggingface/accelerate-gpu:latest
options: --gpus all --shm-size "16gb"
defaults:
run:
working-directory: accelerate/
shell: bash
steps:
- name: Update clone
run: |
source activate accelerate
git config --global --add safe.directory '*'
git fetch && git checkout ${{ github.sha }}
pip install -e .[test,test_trackers]
- name: Run test on GPUs
run: |
source activate accelerate
make test
- name: Run examples on GPUs
run: |
source activate accelerate
pip uninstall comet_ml -y
make test_examples

View File

@ -7,10 +7,10 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.6
uses: actions/setup-python@v2
- name: Set up Python 3.7
uses: actions/setup-python@v3
with:
python-version: 3.6
python-version: 3.7
- name: Install Python dependencies
run: pip install -e .[quality]
- name: Run Quality check

View File

@ -2,29 +2,44 @@ name: Run Tests
on: [pull_request]
env:
HF_HOME: ~/hf_cache
TESTING_MOCKED_DATALOADERS: "1"
jobs:
test:
run-tests:
runs-on: ubuntu-latest
strategy:
matrix:
test-kind: [
test,
test_deepspeed,
test_example_differences,
test_checkpoint_step,
test_checkpoint_epoch,
test_rest
]
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.6
uses: actions/setup-python@v2
- uses: actions/checkout@v3
- name: Set up python 3.7
uses: actions/setup-python@v3
with:
python-version: 3.6
- name: Install Python dependencies
run: pip install setuptools==59.5.0; pip install -e .[test,test_trackers]
- name: Run Tests
run: make test_cpu
test_examples:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.6
uses: actions/setup-python@v2
python-version: 3.7
- name: Activate python cache
uses: actions/cache@v3
with:
python-version: 3.6
- name: Install Python dependencies
run: pip install setuptools==59.5.0; pip install -e .[test] tensorboard
path: |
${{ env.pythonLocation }}
${{ env.HF_HOME }}
key: ${{ env.pythonLocation }}-${{ matrix.test-kind }}-${{ hashFiles('setup.py') }}
- name: Install the library
run: |
pip install --upgrade pip
pip install -e .[test,test_trackers]
if [ ${{ matrix.test-kind }} = test_rest ]; then pip uninstall comet_ml -y; fi
- name: Run Tests
run: make test_examples
run: |
make ${{ matrix.test-kind }}

View File

@ -1,6 +1,6 @@
.PHONY: quality style test docs
check_dirs := tests src examples
check_dirs := tests src examples benchmarks
# Check that source code meets quality standards
@ -24,12 +24,24 @@ style:
python utils/style_doc.py src/accelerate docs/source --max_len 119
# Run tests for the library
test_cpu:
test:
python -m pytest -s -v ./tests/ --ignore=./tests/test_examples.py
test_cuda:
python -m pytest -s -v ./tests/ --ignore=./tests/test_examples.py --ignore=./tests/test_scheduler.py --ignore=./tests/test_cpu.py
python -m pytest -s -v ./tests/test_cpu.py ./tests/test_scheduler.py
test_deepspeed:
python -m pytest -s -v ./tests/deepspeed
test_examples:
python -m pytest -s -v ./tests/test_examples.py
# Broken down example tests for the CI runners
test_example_differences:
python -m pytest -s -v ./tests/test_examples.py::ExampleDifferenceTests
test_checkpoint_epoch:
python -m pytest -s -v ./tests/test_examples.py::FeatureExamplesTests -k "by_epoch"
test_checkpoint_step:
python -m pytest -s -v ./tests/test_examples.py::FeatureExamplesTests -k "by_step"
test_rest:
python -m pytest -s -v ./tests/test_examples.py::FeatureExamplesTests -k "not by_step and not by_epoch"

View File

@ -241,4 +241,5 @@ pip install accelerate
- multi-GPU on several nodes (machines)
- TPU
- FP16 with native AMP (apex on the roadmap)
- DeepSpeed support (experimental)
- DeepSpeed support (Experimental)
- PyTorch Fully Sharded Data Parallel (FSDP) support (Experimental)

46
benchmarks/README.md Normal file
View File

@ -0,0 +1,46 @@
# Big model inference benchmarks
Running inference with Accelerate on big models.
## Setup
These benchmarks use the `transformers` library:
```bash
pip install transformers
```
To reproduce or test a new setup, run
```py
python inference_acc.py model_name
```
This script supports `gpt-j-6b`, `gpt-neox`, `opt` (30B version) and `T0pp` out of the box, but you can specify any valid checkpoint for `model_name`.
To force a different `torch_dtype` than the one in the config: `--torch_dtype xxx`.
If you get an error linked to disk offload, you need to add the option `--disk-offload`
## Results
On a setup with two Titan RTXs (24GB of RAM) and 32GB of RAM, we get the following benchmarks (T0pp does not run in float16, which is why it's not included).
| Model | Model load time | Generation time | dtype | GPU 0 use | GPU 1 use | CPU use | Disk offload |
|:-----:|:---------------:|:---------------:|:-----:|:---------:|:---------:|:-------:|:------------:|
| GPT-J-6B | 8.7s | 0.05s per token | float16 | 11.7GB | 0GB | 0GB | no |
| GPT-J-6B | 12.4s | 0.06s per token | float32 | 21.9GB | 1.5GB | 0GB | no |
| GPT-Neo-X-20B | 30.9s | 0.08s per token | float16 | 21.5GB | 18GB | 0GB | no |
| GPT-Neo-X-20B | 78.2s | 10.72s per token | float32 | 20.3GB | 22.7 GB | 24.4GB | yes |
| T0pp (11B) | 29.4s | 0.05s per token | float32 | 21.1GB | 21.3GB | 0GB | no |
| OPT-30B | 34.5s | 2.37s per token | float16 | 20.7GB | 22.3GB | 14.1GB | no |
| OPT-30B | 112.3s | 33.9s per token | float32 | 20.2GB | 21.2GB | 23.5GB | yes |
Note on the results:
- using two GPUs instead of one does not slow down generation
- using CPU offload slows down a bit (see OPT-30b)
- using disk offload slows down a lot (need to implement prefetching)
You will also note that Accelerate does not use anymore GPU and CPU RAM than necessary:
- peak GPU memory is exactly the size of the model put on a given GPU
- peak CPU memory is either the size of the biggest checkpoint shard or the part of the model offloaded on CPU, whichever is bigger.

View File

@ -0,0 +1,143 @@
# Copyright 2022 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import time
import torch
import transformers
from accelerate.utils import compute_module_sizes
from measures_util import end_measure, log_measures, start_measure
from transformers import AutoConfig, AutoModelForCausalLM, AutoModelForSeq2SeqLM, AutoTokenizer
DEFAULT_MODELS = {
"gpt-j-6b": {"is_causal": True, "model": "sgugger/sharded-gpt-j-6B", "tokenizer": "EleutherAI/gpt-j-6B"},
"gpt-neox": {"is_causal": True, "model": "EleutherAI/gpt-neox-20b"},
"opt": {"is_causal": True, "model": "facebook/opt-30b"},
"T0pp": {"is_causal": False, "model": "bigscience/T0pp", "model_revision": "sharded"},
}
PROMPTS = [
"Hello, my name is",
"Are unicorns real? Unicorns are",
"For the first time in several years,",
"My name is Julien and I am",
"The goal of life is",
"Whenever I'm sad, I like to",
]
def parse_args():
parser = argparse.ArgumentParser(description="Run and time generations on a big model using Accelerate.")
parser.add_argument("model_name", type=str, default=None, help="The name of the model to try.")
parser.add_argument(
"--tokenizer_name", type=str, default=None, help="The name of the tokenizer (if different from the model."
)
parser.add_argument("--is_causal", type=bool, default=None, help="Whether or not the model is causal.")
parser.add_argument(
"--model_revision", type=str, default=None, help="The revision to use for the model checkpoint."
)
parser.add_argument("--torch_dtype", type=str, default=None, help="The dtype for the model.")
parser.add_argument("--disk_offload", action="store_true")
args = parser.parse_args()
# Sanitize args
if args.model_name in DEFAULT_MODELS:
defaults = DEFAULT_MODELS[args.model_name]
args.model_name = defaults["model"]
if args.tokenizer_name is None:
args.tokenizer_name = defaults.get("tokenizer", args.model_name)
if args.is_causal is None:
args.is_causal = defaults["is_causal"]
if args.model_revision is None:
args.model_revision = defaults.get("model_revision", "main")
if args.is_causal is None:
raise ValueError("Could not infer the default for `--is_causal`, pass either True or False for it.")
if args.tokenizer_name is None:
args.tokenizer_name = args.model_name
if args.model_revision is None:
args.model_revision = "main"
return args
def main():
transformers.utils.logging.set_verbosity_error()
args = parse_args()
if args.torch_dtype is None:
config = AutoConfig.from_pretrained(args.model_name)
torch_dtype = getattr(config, "torch_dtype", torch.float32)
else:
torch_dtype = getattr(torch, args.torch_dtype)
model_cls = AutoModelForCausalLM if args.is_causal else AutoModelForSeq2SeqLM
kwargs = {
"torch_dtype": torch_dtype,
"revision": args.model_revision,
}
if args.disk_offload:
kwargs["offload_folder"] = "tmp_offload"
kwargs["offload_state_dict"] = True
start_measures = start_measure()
model = model_cls.from_pretrained(args.model_name, device_map="auto", **kwargs)
end_measures = end_measure(start_measures)
log_measures(end_measures, "Model loading")
module_sizes = compute_module_sizes(model)
device_size = {v: 0 for v in model.hf_device_map.values()}
for module, device in model.hf_device_map.items():
device_size[device] += module_sizes[module]
message = "\n".join([f"- {device}: {size // 2**20}MiB" for device, size in device_size.items()])
print(f"\nTheoretical use:\n{message}")
tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name)
start_measures = start_measure()
generation_times = []
gen_tokens = []
texts_outs = []
for prompt in PROMPTS:
inputs = tokenizer(prompt, return_tensors="pt").to(0)
tokens = inputs["input_ids"][0].tolist()
before_generate = time.time()
outputs = model.generate(inputs["input_ids"])
after_generate = time.time()
outputs = outputs[0].tolist()
num_gen_tokens = len(outputs) if outputs[: len(tokens)] != tokens else len(outputs) - len(tokens)
generation_time = after_generate - before_generate
text_out = tokenizer.decode(outputs, skip_special_tokens=True)
texts_outs.append(text_out)
generation_times.append(generation_time)
gen_tokens.append(num_gen_tokens)
print(f"Prompt: {prompt}\nGeneration {text_out}\nIn {generation_time:.2f}s for {num_gen_tokens} tokens\n")
end_measures = end_measure(start_measures)
log_measures(end_measures, "Model generation")
generation_times_per_token = [gen / tok for gen, tok in zip(generation_times, gen_tokens)]
avg_gen = sum(generation_times_per_token) / len(generation_times)
print(f"Average time of generation per token: {avg_gen:.2f}s")
print(f"First generation (avg time per token): {generation_times_per_token[0]:.2f}s")
avg_gen = sum(generation_times_per_token[1:]) / (len(generation_times_per_token) - 1)
print(f"Average time of generation per token (excluding the first): {avg_gen:.2f}s")
if __name__ == "__main__":
main()

View File

@ -0,0 +1,86 @@
import gc
import threading
import time
import torch
import psutil
class PeakCPUMemory:
def __init__(self):
self.process = psutil.Process()
self.peak_monitoring = False
def peak_monitor(self):
self.cpu_memory_peak = -1
while True:
self.cpu_memory_peak = max(self.process.memory_info().rss, self.cpu_memory_peak)
# can't sleep or will not catch the peak right (this comment is here on purpose)
if not self.peak_monitoring:
break
def start(self):
self.peak_monitoring = True
self.thread = threading.Thread(target=self.peak_monitor)
self.thread.daemon = True
self.thread.start()
def stop(self):
self.peak_monitoring = False
self.thread.join()
return self.cpu_memory_peak
cpu_peak_tracker = PeakCPUMemory()
def start_measure():
# Time
measures = {"time": time.time()}
gc.collect()
torch.cuda.empty_cache()
# CPU mem
measures["cpu"] = psutil.Process().memory_info().rss
cpu_peak_tracker.start()
# GPU mem
for i in range(torch.cuda.device_count()):
measures[str(i)] = torch.cuda.memory_allocated(i)
torch.cuda.reset_peak_memory_stats()
return measures
def end_measure(start_measures):
# Time
measures = {"time": time.time() - start_measures["time"]}
gc.collect()
torch.cuda.empty_cache()
# CPU mem
measures["cpu"] = (psutil.Process().memory_info().rss - start_measures["cpu"]) / 2**20
measures["cpu-peak"] = (cpu_peak_tracker.stop() - start_measures["cpu"]) / 2**20
# GPU mem
for i in range(torch.cuda.device_count()):
measures[str(i)] = (torch.cuda.memory_allocated(i) - start_measures[str(i)]) / 2**20
measures[f"{i}-peak"] = (torch.cuda.max_memory_allocated(i) - start_measures[str(i)]) / 2**20
return measures
def log_measures(measures, description):
print(f"{description}:")
print(f"- Time: {measures['time']:.2f}s")
for i in range(torch.cuda.device_count()):
print(f"- GPU {i} allocated: {measures[str(i)]:.2f}MiB")
peak = measures[f"{i}-peak"]
print(f"- GPU {i} peak: {peak:.2f}MiB")
print(f"- CPU RAM allocated: {measures['cpu']:.2f}MiB")
print(f"- CPU RAM peak: {measures['cpu-peak']:.2f}MiB")

View File

@ -1,7 +1,7 @@
# Builds CPU-only Docker image of PyTorch
# Uses multi-staged approach to reduce size
# Stage 1
FROM python:3.6-slim as compile-image
FROM python:3.7-slim as compile-image
ARG DEBIAN_FRONTEND=noninteractive
@ -21,17 +21,14 @@ WORKDIR /workspace
RUN python3 -m pip install --upgrade --no-cache-dir pip
RUN python3 -m pip install --no-cache-dir \
jupyter \
torch --extra-index-url https://download.pytorch.org/whl/cpu \
git+https://github.com/huggingface/accelerate#egg=accelerate[dev]
git+https://github.com/huggingface/accelerate#egg=accelerate[test,test_trackers] \
--extra-index-url https://download.pytorch.org/whl/cpu
# Stage 2
FROM python:3.6-slim AS build-image
FROM python:3.7-slim AS build-image
COPY --from=compile-image /opt/venv /opt/venv
# Install apt libs
RUN apt-get update && \
apt-get install -y curl git wget && \
apt-get clean && \
rm -rf /var/lib/apt/lists*
RUN useradd -ms /bin/bash user
USER user
# Make sure we use the virtualenv
ENV PATH="/opt/venv/bin:$PATH"

View File

@ -4,7 +4,7 @@
# Use base conda image to reduce time
FROM continuumio/miniconda3:latest AS compile-image
# Specify py version
ENV PYTHON_VERSION=3.6
ENV PYTHON_VERSION=3.7.3
# Install apt libs
RUN apt-get update && \
apt-get install -y curl git wget && \
@ -22,8 +22,8 @@ SHELL ["/bin/bash", "-c"]
# Activate the conda env and install torch + accelerate
RUN source activate accelerate && \
python3 -m pip install --no-cache-dir \
torch --extra-index-url https://download.pytorch.org/whl/cu113 \
git+https://github.com/huggingface/accelerate#egg=accelerate[dev]
git+https://github.com/huggingface/accelerate#egg=accelerate[test,test_trackers] \
--extra-index-url https://download.pytorch.org/whl/cu113
# Stage 2
FROM nvidia/cuda:11.2.2-cudnn8-devel-ubuntu20.04 AS build-image

View File

@ -9,6 +9,8 @@
- sections:
- local: big_modeling
title: Handling big models
- local: gradient_accumulation
title: Gradient accumulation
- local: sagemaker
title: Amazon SageMaker
title: Guides
@ -19,14 +21,18 @@
title: Notebook Launcher
- local: kwargs
title: Kwargs Handlers
- local: internal
title: Internals
- local: checkpoint
title: Checkpointing
- local: internal
title: Internals
- local: tracking
title: Experiment Tracking
- local: fsdp
title: Fully Sharded Data Parallel
- local: memory
title: Memory Utilities
- local: deepspeed
title: DeepSpeed
- local: utilities
title: General Utilities
title: API Reference

View File

@ -13,7 +13,7 @@ specific language governing permissions and limitations under the License.
# Accelerator
The [`Accelerator`] is the main class provided by 🤗 Accelerate. It serves at the main entrypoint for
the API. To quickly adapt your script to work on any kind of setup with 🤗 Accelerate juste:
the API. To quickly adapt your script to work on any kind of setup with 🤗 Accelerate just:
1. Initialize an [`Accelerator`] object (that we will call `accelerator` in the rest of this
page) as early as possible in your script.
@ -21,10 +21,10 @@ the API. To quickly adapt your script to work on any kind of setup with 🤗 Acc
3. (Optional but best practice) Remove all the `.cuda()` or `.to(device)` in your code and let the
`accelerator` handle device placement for you.
4. Replace the `loss.backward()` in your code by `accelerator.backward(loss)`.
5. (Optional, when using distributed evaluation) Gather your predictions and labelsbefore storing them or using them
for metric computation using [`~Accelerator.gather`].
5. (Optional, when using distributed evaluation) Gather your predictions and labels before storing them or using
them for metric computation using [`~Accelerator.gather`].
This is all what is needed in most cases. For more advanced case or a nicer experience here are the functions you
This is all that is needed in most cases. For more advanced cases or a nicer experience here are the functions you
should search for and replace by the corresponding methods of your `accelerator`:
- `print` statements should be replaced by [`~Accelerator.print`] to be only printed once per
@ -38,4 +38,27 @@ should search for and replace by the corresponding methods of your `accelerator`
- Use [`~Accelerator.clip_grad_norm_`] instead of `torch.nn.utils.clip_grad_norm_` and
[`~Accelerator.clip_grad_value_`] instead of `torch.nn.utils.clip_grad_value_`.
To perform gradient accumulation use [`~Accelerator.accumulate`] and specify a `gradient_accumulation_steps`.
This will also automatically ensure the gradients are synced or unsynced when on multi-device training, check if the step should
actually be performed, and auto-scale the loss:
```python
accelerator = Accelerator(gradient_accumulation_steps=2)
for (input, label) in enumerate(training_dataloader):
with accelerator.accumulate(model):
predictions = model(input)
loss = loss_function(predictions, labels)
accelerator.backward(loss)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
```
<Tip warning={true}>
Using this with `dispatch_batches=True` (which is the default for iterable datasets) is currently not supported.
</Tip>
[[autodoc]] Accelerator

View File

@ -27,7 +27,7 @@ In plain English, those steps are:
2. Load the model weights (in a dictionary usually called a state dict) from the disk
3. Load those weights inside the model
While this works very well for regularly sized models, this workflow has some clear limitation when we deal with a huge model: in step 1, we load a full version of the model in RAM, and spend some time randomly initializing the weights (which will be discarded in step 3). In step 2, we load another full version of the model in RAM, with the pretrained weights. If you're loading a model with 6 billions parameters, this needs you will need 24GB of RAM for each copy of the model, so 48GB in total (half of it to load the model in FP16).
While this works very well for regularly sized models, this workflow has some clear limitations when we deal with a huge model: in step 1, we load a full version of the model in RAM, and spend some time randomly initializing the weights (which will be discarded in step 3). In step 2, we load another full version of the model in RAM, with the pretrained weights. If you're loading a model with 6 billions parameters, this means you will need 24GB of RAM for each copy of the model, so 48GB in total (half of it to load the model in FP16).
<Tip warning={true}>
@ -37,7 +37,7 @@ This API is quite new and still in its experimental stage. While we strive to pr
## Instantiating an empty model
The first tool Accelerate introduces to help with big models is a context manager [`init_empty_weights`] that helps you initialize a model without using any RAM, so that step 1 can be done on models of any size. Here is how it works:
The first tool 🤗 Accelerate introduces to help with big models is a context manager [`init_empty_weights`] that helps you initialize a model without using any RAM, so that step 1 can be done on models of any size. Here is how it works:
```py
from accelerate import init_empty_weights
@ -65,7 +65,7 @@ You can't move a model initialized like this on CPU or another device directly,
It's possible your model is so big that even a single copy won't fit in RAM. That doesn't mean it can't be loaded: if you have one or several GPUs, this is more memory available to store your model. In this case, it's better if your checkpoint is split in several smaller files that we call checkpoint shards.
Accelerate will handle sharded checkpoints as long as you follow the following format: your checkpoint should be in a folder, with several files containing the partial state dicts, and there should be an index in the JSON format that contains a dictionary mapping parameter names to the file containing their weights. For instance we could have a folder containing:
🤗 Accelerate will handle sharded checkpoints as long as you follow the following format: your checkpoint should be in a folder, with several files containing the partial state dicts, and there should be an index in the JSON format that contains a dictionary mapping parameter names to the file containing their weights. For instance we could have a folder containing:
```bash
first_state_dict.bin
@ -88,7 +88,7 @@ and `first_state_dict.bin` containing the weights for `"linear1.weight"` and `"l
## Loading weights
The second tool Accelerate introduces is a function [`load_checkpoint_and_dispatch`], that will allow you to load a checkpoint inside your empty model. This supports full checkpoints (a single file containing the whole state dict) as well as sharded checkpoints. It will also automatically dispatch those weights across the devices you have available (GPUs, CPU RAM), so if you are loading a sharded checkpoint, the maximum RAM usage will be the size of the biggest shard.
The second tool 🤗 Accelerate introduces is a function [`load_checkpoint_and_dispatch`], that will allow you to load a checkpoint inside your empty model. This supports full checkpoints (a single file containing the whole state dict) as well as sharded checkpoints. It will also automatically dispatch those weights across the devices you have available (GPUs, CPU RAM), so if you are loading a sharded checkpoint, the maximum RAM usage will be the size of the biggest shard.
Here is how we can use this to load the [GPT-J-6B](https://huggingface.co/EleutherAI/gpt-j-6B) model. You clone the sharded version of this model with:
@ -122,14 +122,14 @@ model = load_checkpoint_and_dispatch(
)
```
By passing `device_map="auto"`, we tell Accelerate to determine automatically where to put each layer of the model depending on the available resources:
By passing `device_map="auto"`, we tell 🤗 Accelerate to determine automatically where to put each layer of the model depending on the available resources:
- first we use the maximum space available on the GPU(s)
- if we still need space, we store the remaining weights on the CPU
- if there is not enough RAM, we store the remaining weights on the hard drive as memory-mapped tensors
`no_split_module_classes=["GPTJBlock"]` indicates that the modules that are `GPTJBlock` should not be split on different devices. You should set here all blocks that include a residual connection of some kind.
You can see the `device_map` that Accelerate picked by accessing the `hf_device_map` attribute of your model:
You can see the `device_map` that 🤗 Accelerate picked by accessing the `hf_device_map` attribute of your model:
```py
model.hf_device_map
@ -190,7 +190,7 @@ output = model.generate(inputs["input_ids"])
tokenizer.decode(output[0].tolist())
```
Behind the scenes, Accelerate added hooks to the model, so that:
Behind the scenes, 🤗 Accelerate added hooks to the model, so that:
- at each layer, the inputs are put on the right device (so even if your model is spread across several GPUs, it works)
- for the weights offloaded on the CPU, they are put on a GPU just before the forward pass, and cleaned up just after
- for the weights offloaded on the hard drive, they are loaded in RAM then put on a GPU just before the forward pass, and cleaned up just after
@ -207,7 +207,7 @@ This only supports inference of your model, not training. Most of the computatio
We are aware of the current limitations in the API:
- While this could theoretically work just one CPU with potential disk offload, you need at least one GPU to run this API. This will be fixed in further development.
- While this could theoretically work on just one CPU with potential disk offload, you need at least one GPU to run this API. This will be fixed in further development.
- [`infer_auto_device_map`] (or `device_map="auto"` in [`load_checkpoint_and_dispatch`]) tries to maximize GPU and CPU RAM it sees available when you execute it. While PyTorch is very good at managing GPU RAM efficiently (and giving it back when not needed), it's not entirely true with Python and CPU RAM. Therefore, an automatically computed device map might be too intense on the CPU. Move a few modules to the disk device if you get crashes due to lack of RAM.
- [`infer_auto_device_map`] (or `device_map="auto"` in [`load_checkpoint_and_dispatch`]) attributes devices sequentially (to avoid moving things back and forth) so if your first layer is bigger than the size of the GPU you have, it will end up with everything on the CPU/Disk.
- [`load_checkpoint_and_dispatch`] and [`load_checkpoint_in_model`] do not perform any check on the correctness of your state dict compared to your model at the moment (this will be fixed in a future version), so you may get some weird errors if trying to load a checkpoint with mismatched or missing keys.

View File

@ -12,8 +12,8 @@ specific language governing permissions and limitations under the License.
# Checkpointing
When training a PyTorch model with Accelerate, you may often want to save and continue a state of training. Doing so requires
saving and loading the model, optimizer, RNG generators, and the GradScaler. Inside Accelerate are two convience functions to achieve this quickly:
When training a PyTorch model with 🤗 Accelerate, you may often want to save and continue a state of training. Doing so requires
saving and loading the model, optimizer, RNG generators, and the GradScaler. Inside 🤗 Accelerate are two convience functions to achieve this quickly:
- Use [`~Accelerator.save_state`] for saving everything mentioned above to a folder location
- Use [`~Accelerator.load_state`] for loading everything stored from an earlier `save_state`
@ -57,4 +57,4 @@ for epoch in range(num_epochs):
# Restore previous state
accelerate.load_state("my/save/path")
```
```

508
docs/source/deepspeed.mdx Normal file
View File

@ -0,0 +1,508 @@
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# DeepSpeed
[DeepSpeed](https://github.com/microsoft/DeepSpeed) implements everything described in the [ZeRO paper](https://arxiv.org/abs/1910.02054). Currently it provides full support for:
1. Optimizer state partitioning (ZeRO stage 1)
2. Gradient partitioning (ZeRO stage 2)
3. Parameter partitioning (ZeRO stage 3)
4. Custom mixed precision training handling
5. A range of fast CUDA-extension-based optimizers
6. ZeRO-Offload to CPU and Disk/NVMe
ZeRO-Offload has its own dedicated paper: [ZeRO-Offload: Democratizing Billion-Scale Model Training](https://arxiv.org/abs/2101.06840). And NVMe-support is described in the paper [ZeRO-Infinity: Breaking the GPU
Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/abs/2104.07857).
DeepSpeed ZeRO-2 is primarily used only for training, as its features are of no use to inference.
DeepSpeed ZeRO-3 can be used for inference as well, since it allows huge models to be loaded on multiple GPUs, which
won't be possible on a single GPU.
🤗 Accelerate integrates [DeepSpeed](https://github.com/microsoft/DeepSpeed) via 2 options:
1. Integration of the DeepSpeed features via `deepspeed config file` specification in `accelerate config` . You just supply your custom config file or use our template. Most of
this document is focused on this feature. This supports all the core features of DeepSpeed and gives user a lot of flexibility.
User may have to change few lines of code depending on the config.
2. Integration via `deepspeed_plugin`.This supports subset of the DeepSpeed features and uses default options for the rest of the configurations.
User need not change any code and is good for those who are fine with most of the default settings of DeepSpeed.
## What is integrated?
Training:
1. DeepSpeed ZeRO training supports the full ZeRO stages 1, 2 and 3 as well as CPU/Disk offload of optimizer states, gradients and parameters.
Below is a short description of Data Parallelism using ZeRO - Zero Redundancy Optimizer along with diagram from this [blog post](https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/)
![ZeRO Data Parallelism](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-zero.png)
(Source: [link](https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/))
a. **Stage 1** : Shards optimizer states across data parallel workers/GPUs
b. **Stage 2** : Shards optimizer states + gradients across data parallel workers/GPUs
c. **Stage 3**: Shards optimizer states + gradients + model parameters across data parallel workers/GPUs
d. **Optimizer Offload**: Offloads the gradients + optimizer states to CPU/Disk building on top of ZERO Stage 2
e. **Param Offload**: Offloads the model parameters to CPU/Disk building on top of ZERO Stage 3
<u>Note</u>: With respect to Disk Offload, the disk should be an NVME for decent speed but it technically work on any Disk
Inference:
1. DeepSpeed ZeRO Inference supports ZeRO stage 3 with ZeRO-Infinity. It uses the same ZeRO protocol as training, but
it doesn't use an optimizer and a lr scheduler and only stage 3 is relevant. For more details see:
[deepspeed-zero-inference](#deepspeed-zero-inference).
## How it works?
**Pre-Requisites**: Install DeepSpeed version >=0.6.5. Please refer to the [DeepSpeed Insallation details](https://github.com/microsoft/DeepSpeed#installation)
for more information.
We will first look at easy to use integration via `accelerate config`.
Followed by more flexible and feature rich `deepspeed config file` integration.
### Accelerate DeepSpeed Plugin
On your machine(s) just run:
```bash
accelerate config
```
and answer the questions asked. It will ask whether you want to use a config file for DeepSpeed to which you should answer no. Then answer the following questions to generate a basic DeepSpeed config.
This will generate a config file that will be used automatically to properly set the
default options when doing
```bash
accelerate launch my_script.py --args_to_my_script
```
For instance, here is how you would run the NLP example `examples/nlp_example.py` (from the root of the repo) with DeepSpeed Plugin:
**ZeRO Stage-2 DeepSpeed Plugin Example**
```bash
compute_environment: LOCAL_MACHINE
deepspeed_config:
gradient_accumulation_steps: 1
gradient_clipping: 1.0
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: true
zero_stage: 2
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
use_cpu: false
```
```bash
accelerate launch examples/nlp_example.py --mixed_precision fp16
```
**ZeRO Stage-3 with CPU Offload DeepSpeed Plugin Example**
```bash
compute_environment: LOCAL_MACHINE
deepspeed_config:
gradient_accumulation_steps: 1
gradient_clipping: 1.0
offload_optimizer_device: cpu
offload_param_device: cpu
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
use_cpu: false
```
```bash
accelerate launch examples/nlp_example.py --mixed_precision fp16
```
Currently, `Accelerate` supports following config through the CLI:
```bash
`zero_stage`: [0] Disabled, [1] optimizer state partitioning, [2] optimizer+gradient state partitioning and [3] optimizer+gradient+parameter partitioning
`gradient_accumulation_steps`: Number of training steps to accumulate gradients before averaging and applying them.
`gradient_clipping`: Enable gradient clipping with value.
`offload_optimizer_device`: [none] Disable optimizer offloading, [cpu] offload optimizer to CPU, [nvme] offload optimizer to NVMe SSD. Only applicable with ZeRO >= Stage-2.
`offload_param_device`: [none] Disable parameter offloading, [cpu] offload parameters to CPU, [nvme] offload parameters to NVMe SSD. Only applicable with ZeRO Stage-3.
`zero3_init_flag`: Decides whether to enable `deepspeed.zero.Init` for constructing massive models. Only applicable with ZeRO Stage-3.
`zero3_save_16bit_model`: Decides whether to save 16-bit model weights when using ZeRO Stage-3.
`mixed_precision`: `no` for FP32 training, `fp16` for FP16 mixed-precision training and `bf16` for BF16 mixed-precision training.
```
To be able to tweak more options, you will need to use a DeepSpeed config file.
### DeepSpeed Config File
On your machine(s) just run:
```bash
accelerate config
```
and answer the questions asked. It will ask whether you want to use a config file for deepspeed to which you answer yes
and provide the path to the deepspeed config file.
This will generate a config file that will be used automatically to properly set the
default options when doing
```bash
accelerate launch my_script.py --args_to_my_script
```
For instance, here is how you would run the NLP example `examples/by_feature/deepspeed_with_config_support.py` (from the root of the repo) with DeepSpeed Config File:
**ZeRO Stage-2 DeepSpeed Config File Example**
```bash
compute_environment: LOCAL_MACHINE
deepspeed_config:
deepspeed_config_file: /home/ubuntu/accelerate/examples/configs/deepspeed_config_templates/zero_stage2_config.json
zero3_init_flag: true
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
use_cpu: false
```
with the contents of `zero_stage2_config.json` being:
```json
{
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"weight_decay": "auto",
"torch_adam": true,
"adam_w_mode": true
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto",
"total_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": "auto",
"contiguous_gradients": true
},
"gradient_accumulation_steps": 1,
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
```
```bash
accelerate launch examples/by_feature/deepspeed_with_config_support.py \
--config_name "gpt2-large" \
--tokenizer_name "gpt2-large" \
--dataset_name "wikitext" \
--dataset_config_name "wikitext-2-raw-v1" \
--block_size 128 \
--output_dir "./clm/clm_deepspeed_stage2_accelerate" \
--learning_rate 5e-4 \
--per_device_train_batch_size 24 \
--per_device_eval_batch_size 24 \
--num_train_epochs 3 \
--with_tracking \
--report_to "wandb"\
```
**ZeRO Stage-3 with CPU offload DeepSpeed Config File Example**
```bash
compute_environment: LOCAL_MACHINE
deepspeed_config:
deepspeed_config_file: /home/ubuntu/accelerate/examples/configs/deepspeed_config_templates/zero_stage3_offload_config.json
zero3_init_flag: true
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
use_cpu: false
```
with the contents of `zero_stage3_offload_config.json` being:
```json
{
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto",
"total_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"sub_group_size": 1e9,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": "auto"
},
"gradient_accumulation_steps": 1,
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
```
```bash
accelerate launch examples/by_feature/deepspeed_with_config_support.py \
--config_name "gpt2-large" \
--tokenizer_name "gpt2-large" \
--dataset_name "wikitext" \
--dataset_config_name "wikitext-2-raw-v1" \
--block_size 128 \
--output_dir "./clm/clm_deepspeed_stage3_offload_accelerate" \
--learning_rate 5e-4 \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 32 \
--num_train_epochs 3 \
--with_tracking \
--report_to "wandb"\
```
**Important code changes when using DeepSpeed Config File**
1. DeepSpeed Optimizers and Schedulers. For more information on these,
see the [DeepSpeed Optimizers](https://deepspeed.readthedocs.io/en/latest/optimizers.html) and [DeepSpeed Schedulers](https://deepspeed.readthedocs.io/en/latest/schedulers.html) documentation.
We will look at the changes needed in the code when using these.
a. DS Optim + DS Scheduler: The case when both `optimizer` and `scheduler` keys present in the DeepSpeed config file.
In this situation, those will be used and user has to use `accelerate.utils.DummyOptim` and `accelerate.utils.DummyScheduler` to replace the PyTorch/Custom optimizers and schedulers in their code.
Below is the snippet from `examples/by_feature/deepspeed_with_config_support.py` showing this:
```python
# Creates Dummy Optimizer if `optimizer` was spcified in the config file else creates Adam Optimizer
optimizer_cls = (
torch.optim.AdamW
if accelerator.state.deepspeed_plugin is None
or "optimizer" not in accelerator.state.deepspeed_plugin.deepspeed_config
else DummyOptim
)
optimizer = optimizer_cls(optimizer_grouped_parameters, lr=args.learning_rate)
# Creates Dummy Scheduler if `scheduler` was spcified in the config file else creates `args.lr_scheduler_type` Scheduler
if (
accelerator.state.deepspeed_plugin is None
or "scheduler" not in accelerator.state.deepspeed_plugin.deepspeed_config
):
lr_scheduler = get_scheduler(
name=args.lr_scheduler_type,
optimizer=optimizer,
num_warmup_steps=args.num_warmup_steps,
num_training_steps=args.max_train_steps,
)
else:
lr_scheduler = DummyScheduler(
optimizer, total_num_steps=args.max_train_steps, warmup_num_steps=args.num_warmup_steps
)
```
b. Custom Optim + Custom Scheduler: The case when both `optimizer` and `scheduler` keys are absent in the DeepSpeed config file.
In this situation, no code changes are needed from the user and this is the case when using integration via DeepSpeed Plugin.
In the above example we can see that the code reamins unchanged if the `optimizer` and `scheduler` keys are absent in the DeepSpeed config file.
c. Custom Optim + DS Scheduler: The case when only `scheduler` key is present in the DeepSpeed config file.
In this situation, user has to use `accelerate.utils.DummyScheduler` to replace the PyTorch/Custom scheduler in their code.
d. DS Optim + Custom Scheduler: The case when only `optimizer` key is present in the DeepSpeed config file.
This will result in an error because one can only use DS Scheduler when using DS Optim.
2. Notice the `auto` values in the above example DeepSpeed config files. These are automatically handled by `prepare` method
based on model, dataloaders, dummy optimizer and dummy schedulers provided to `prepare` method.
Only the `auto` fields specified in above examples are handled by `prepare` method and the rest have to be explicitly specified by the user.
## Saving and loading
1. Saving and loading of models is unchanged for ZeRO Stage-1 and Stage-2.
2. under ZeRO Stage-3, `state_dict` contains just the placeholders since the model weights are partitioned across multiple GPUs.
ZeRO Stage-3 has 2 options:
a. Saving the entire 16bit model weights to directly load later on using `model.load_state_dict(torch.load(pytorch_model.bin))`.
For this, either set `zero_optimization.stage3_gather_16bit_weights_on_model_save` to True in DeepSpeed Config file or set
`zero3_save_16bit_model` to True in DeepSpeed Plugin.
**Note that this option requires consolidation of the weights on one GPU it can be slow and memory demanding, so only use this feature when needed.**
Below is the snippet from `examples/by_feature/deepspeed_with_config_support.py` showing this:
```python
unwrapped_model = accelerator.unwrap_model(model)
# New Code #
# Saves the whole/unpartitioned fp16 model when in ZeRO Stage-3 to the output directory if
# `stage3_gather_16bit_weights_on_model_save` is True in DeepSpeed Config file or
# `zero3_save_16bit_model` is True in DeepSpeed Plugin.
# For Zero Stages 1 and 2, models are saved as usual in the output directory.
# The model name saved is `pytorch_model.bin`
unwrapped_model.save_pretrained(
args.output_dir,
is_main_process=accelerator.is_main_process,
save_function=accelerator.save,
state_dict=accelerator.get_state_dict(model),
)
```
b. To get 32bit weights, first save the model using `model.save_checkpoint()`.
Below is the snippet from `examples/by_feature/deepspeed_with_config_support.py` showing this:
```python
success = model.save_checkpoint(PATH, ckpt_id, checkpoint_state_dict)
status_msg = "checkpointing: PATH={}, ckpt_id={}".format(PATH, ckpt_id)
if success:
logging.info(f"Success {status_msg}")
else:
logging.warning(f"Failure {status_msg}")
```
This will create ZeRO model and optimizer partitions along with `zero_to_fp32.py` script in checkpoint directory.
One can use this script to do offline consolidation.
It requires no configuration files or GPUs. Here is an example of its usage:
```bash
$ cd /path/to/checkpoint_dir
$ ./zero_to_fp32.py . pytorch_model.bin
Processing zero checkpoint at global_step1
Detected checkpoint of type zero stage 3, world_size: 2
Saving fp32 state dict to pytorch_model.bin (total_numel=60506624)
```
To get 32bit model for saving/inference, one can do the following:
```python
from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
unwrapped_model = accelerator.unwrap_model(model)
fp32_model = load_state_dict_from_zero_checkpoint(unwrapped_model, checkpoint_dir)
```
If only interested in state_dict, one can do the following:
```python
from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir)
```
Note that all these functions require ~2x memory (general RAM) of the size of the final checkpoint.
## ZeRO Inference
DeepSpeed ZeRO Inference supports ZeRO stage 3 with ZeRO-Infinity.
It uses the same ZeRO protocol as training, but it doesn't use an optimizer and a lr scheduler and only stage 3 is relevant.
With accelerate integration, one has to just prepare model and dataloader as shown below:
```python
model, eval_dataloader = accelerator.prepare(model, eval_dataloader)
```
## Few caveats to be aware of
1. Current integration doesnt support Pipeline Parallelism of DeepSpeed.
2. Current integration doesnt support `mpu`, limiting the tensor parallelism which is supported in Megatron-LM.
3. Current integration doesnt support multiple models for a given `accelerator` object.
## Internals
[[autodoc]] utils.DeepSpeedPlugin
[[autodoc]] utils.DummyOptim
[[autodoc]] utils.DummyScheduler
[[autodoc]] utils.DeepSpeedEngineWrapper
[[autodoc]] utils.DeepSpeedOptimizerWrapper
[[autodoc]] utils.DeepSpeedSchedulerWrapper
## Main DeepSpeed Resources
- [Project's github](https://github.com/microsoft/deepspeed)
- [Usage docs](https://www.deepspeed.ai/getting-started/)
- [API docs](https://deepspeed.readthedocs.io/en/latest/index.html)
- [Blog posts](https://www.microsoft.com/en-us/research/search/?q=deepspeed)
Papers:
- [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://arxiv.org/abs/1910.02054)
- [ZeRO-Offload: Democratizing Billion-Scale Model Training](https://arxiv.org/abs/2101.06840)
- [ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/abs/2104.07857)
Finally, please, remember that, 🤗 `Accelerate` only integrates DeepSpeed, therefore if you
have any problems or questions with regards to DeepSpeed usage, please, file an issue with [DeepSpeed GitHub](https://github.com/microsoft/DeepSpeed/issues).

View File

@ -18,7 +18,7 @@ To read more about it and the benefits, check out the [Fully Sharded Data Parall
We have integrated the latest PyTorch's Fully Sharded Data Parallel (FSDP) training feature.
All you need to do is enable it through the config.
## How it works out the box
## How it works out of the box
On your machine(s) just run:
@ -57,7 +57,7 @@ use_cpu: false
accelerate launch examples/nlp_example.py
```
Currently, `Accelerate` supports following config through the CLI:
Currently, `Accelerate` supports the following config through the CLI:
```bash
`Sharding Strategy`: [1] FULL_SHARD, [2] SHARD_GRAD_OP
@ -65,11 +65,11 @@ Currently, `Accelerate` supports following config through the CLI:
`Offload Params`: Decides Whether to offload parameters and gradients to CPU.
```
## Few caveats to be aware of
## A few caveats to be aware of
- PyTorch FSDP auto wraps sub-modules, flattens the parameters and shards the parameters in place.
Due to this, any optimizer created before model wrapping gets broken and occupies more memory.
Hence, it is highly recommended and efficient to prepare model before creating optimizer.
Hence, it is highly recommended and efficient to prepare the model before creating the optimizer.
`Accelerate` will automatically wrap the model and create an optimizer for you in case of single model with a warning message.
> FSDP Warning: When using FSDP, it is efficient and recommended to call prepare for the model before creating the optimizer
@ -91,14 +91,14 @@ optimizer = torch.optim.AdamW(params=model.parameters(), lr=lr)
```
- In case of a single model, if you have created optimizer with multiple parameter groups and called prepare with them together,
- In case of a single model, if you have created the optimizer with multiple parameter groups and called prepare with them together,
then the parameter groups will be lost and the following warning is displayed:
> FSDP Warning: When using FSDP, several parameter groups will be conflated into
> a single one due to nested module wrapping and parameter flattening.
This is because parameter groups created before wrapping will have no meaning post wrapping due parameter flattening of nested FSDP modules into 1D arrays (which can consume many layers).
For instance, below are the named parameters of FSDP model on GPU 0 (When using 2 GPUs. Around 55M (110M/2) params in 1D arrays as this will have the 1st shard of the parameters).
Here, if one has applied no weight decay for [bias, LayerNorm.weight] named parameters of unwrapped BERT model,
This is because parameter groups created before wrapping will have no meaning post wrapping due to parameter flattening of nested FSDP modules into 1D arrays (which can consume many layers).
For instance, below are the named parameters of an FSDP model on GPU 0 (When using 2 GPUs. Around 55M (110M/2) params in 1D arrays as this will have the 1st shard of the parameters).
Here, if one has applied no weight decay for [bias, LayerNorm.weight] the named parameters of an unwrapped BERT model,
it can't be applied to the below FSDP wrapped model as there are no named parameters with either of those strings and
the parameters of those layers are concatenated with parameters of various other layers.
```
@ -110,7 +110,7 @@ optimizer = torch.optim.AdamW(params=model.parameters(), lr=lr)
```
- In case of multiple models, it is necessary to prepare the models before creating optimizers else it will throw an error.
- In case of multiple models, it is necessary to prepare the models before creating optimizers or else it will throw an error.
- Mixed precision is currently not supported with FSDP.
For more control, users can leverage the `FullyShardedDataParallelPlugin` wherein they can specify `auto_wrap_policy`, `backward_prefetch` and `ignored_modules`.

View File

@ -0,0 +1,126 @@
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Performing gradient accumulation with 🤗 Accelerate
Gradient accumulation is a technique where you can train on bigger batch sizes than
your machine would normally be able to fit into memory. This is done by accumulating gradients over
several batches, and only stepping the optimizer after a certain number of batches have been performed.
While technically standard gradient accumulation code would work fine in a distributed setup, it is not the most efficient
method for doing so and you may experience considerable slowdowns!
In this tutorial you will see how to quickly setup gradient accumulation and perform it with the utilities provided in 🤗 Accelerate,
which can total to adding just one new line of code!
This example will use a very simplistic PyTorch training loop that performs gradient accumulation every two batches:
```python
device = "cuda"
model.to(device)
gradient_accumulation_steps = 2
for index, batch in enumerate(training_dataloader):
optimizer.zero_grad()
inputs, targets = batch
inputs = inputs.to(device)
targets = targets.to(device)
outputs = model(inputs)
loss = loss_function(outputs, targets)
loss = loss / gradient_accumulation_steps
loss.backward()
if (index + 1) % gradient_accumulation_steps == 0:
optimizer.step()
scheduler.step()
```
## Converting it to 🤗 Accelerate
First the code shown earlier will be converted to utilize 🤗 Accelerate without the special gradient accumulation helper:
```diff
+ from accelerate import Accelerator
+ accelerator = Accelerator()
+ model, optimizer, training_dataloader, scheduler = accelerator.prepare(
+ model, optimizer, training_dataloader, scheduler
+ )
for index, batch in enumerate(training_dataloader):
optimizer.zero_grad()
inputs, targets = batch
- inputs = inputs.to(device)
- targets = targets.to(device)
outputs = model(inputs)
loss = loss_function(outputs, targets)
loss = loss / gradient_accumulation_steps
+ accelerator.backward(loss)
if (index+1) % gradient_accumulation_steps == 0:
optimizer.step()
scheduler.step()
```
<Tip warning={true}>
In its current state, this code is not going to perform gradient accumulation efficiently due to a process called gradient synchronization.
</Tip>
## Letting 🤗 Accelerate handle gradient accumulation
All that is left now is to let 🤗 Accelerate handle the gradient accumulation for us. To do so you should pass in a `gradient_accumulation_steps` parameter to [`Accelerator`], dictating the number
of steps to perform before each call to `step()` and how to automatically adjust the loss during the call to [`Accelerator.backward`]:
```diff
from accelerate import Accelerator
- accelerator = Accelerator()
+ accelerator = Accelerator(gradient_accumulation_steps=2)
```
From here you can use the [`Accelerator.accumulate`] context manager from inside your training loop to automatically perform the gradient accumulation for you!
You just wrap it around the entire training part of your code:
```diff
- for index, batch in enumerate(training_dataloader):
+ for batch in training_dataloader:
+ with accelerator.accumulate(model):
optimizer.zero_grad()
inputs, targets = batch
outputs = model(inputs)
```
and you can remove all the special checks for the step number and the loss adjustment:
```diff
- loss = loss / gradient_accumulation_steps
accelerator.backward(loss)
- if (index+1) % gradient_accumulation_steps == 0:
optimizer.step()
scheduler.step()
```
As you can see the [`Accelerator`] is able to keep track of the batch number you are on and it will automatically know whether to step through the prepared optimizer and how to adjust the loss.
## The finished code
Below is the finished implementation for performing gradient accumulation with 🤗 Accelerate
```python
for batch in training_dataloader:
with accelerator.accumulate(model):
optimizer.zero_grad()
inputs, targets = batch
outputs = model(inputs)
loss = loss_function(outputs, targets)
accelerator.backward(loss)
optimizer.step()
scheduler.step()
```

View File

@ -12,16 +12,16 @@ specific language governing permissions and limitations under the License.
# Accelerate
Run your *raw* PyTorch training script on any kind of device
Run your *raw* PyTorch training script on any kind of device.
## Features
- 🤗 Accelerate provides an easy API to make your scripts run with mixed precision and on any kind of distributed
setting (multi-GPUs, TPUs etc.) while still letting you write your own training loop. The same code can then runs
- 🤗 Accelerate provides an easy API to make your scripts run with mixed precision and in any kind of distributed
setting (multi-GPUs, TPUs etc.) while still letting you write your own training loop. The same code can then run
seamlessly on your local machine for debugging or your training environment.
- 🤗 Accelerate also provides a CLI tool that allows you to quickly configure and test your training environment then
launch the scripts.
- 🤗 Accelerate also provides a CLI tool that allows you to quickly configure and test your training environment and
then launch the scripts.
## Easy to integrate

View File

@ -57,7 +57,7 @@ pip install git+https://github.com/huggingface/accelerate
Note that this will install not the latest released version, but the bleeding edge `main` version, which you may want to use in case a bug has been fixed since the last official release and a new release hasn't been yet rolled out.
While we strive to keep `main` operational at all times, if you notice some issues, they usually get fixed within a few hours or a day and and you're more than welcome to help us detect any problems by opening an [Issue](https://github.com/huggingface/accelerate/issues) and this way, things will get fixed even sooner.
While we strive to keep `main` operational at all times, if you notice some issues, they usually get fixed within a few hours or a day and you're more than welcome to help us detect any problems by opening an [Issue](https://github.com/huggingface/accelerate/issues) and this way, things will get fixed even sooner.
Again, you can run:
@ -85,7 +85,7 @@ now this editable install will reside where you clone the folder to, e.g. `~/acc
Do note that you have to keep that `accelerate` folder around and not delete it to continue using the 🤗 Accelerate library.
Now, let's get to the real benefit of this installation approach. Say, you saw some new feature has been just committed into `main`. If you have already performed all the steps above, to update your accelerate repo to include all the latest commits, all you need to do is to `cd` into that cloned repository folder and update the clone to the latest version:
Now, let's get to the real benefit of this installation approach. Say, you saw some new feature just has been committed into `main`. If you have already performed all the steps above, to update your accelerate repo to include all the latest commits, all you need to do is to `cd` into that cloned repository folder and update the clone to the latest version:
```bash
cd ~/accelerate/

View File

@ -12,6 +12,10 @@ specific language governing permissions and limitations under the License.
# Internals
## Gradient Accumulation states
[[autodoc]] state.GradientState
## Optimizer
[[autodoc]] optimizer.AcceleratedOptimizer
@ -22,7 +26,7 @@ The main work on your PyTorch `DataLoader` is done by the following function:
[[autodoc]] data_loader.prepare_data_loader
### BatchSamplerShard
### DataLoaderShard
[[autodoc]] data_loader.DataLoaderShard
@ -44,28 +48,6 @@ The main work on your PyTorch `DataLoader` is done by the following function:
[[autodoc]] state.AcceleratorState
### DistributedType
[[autodoc]] state.DistributedType
## Tracking
[[autodoc]] tracking.GeneralTracker
## Utilities
[[autodoc]] utils.extract_model_from_parallel
[[autodoc]] utils.gather
[[autodoc]] utils.send_to_device
[[autodoc]] utils.set_seed
[[autodoc]] utils.synchronize_rng_state
[[autodoc]] utils.synchronize_rng_states
[[autodoc]] utils.wait_for_everyone
[[autodoc]] utils.write_basic_config

View File

@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.
# Quick tour
Let's have a look at a look at 🤗 Accelerate main features and traps to avoid.
Let's have a look at the 🤗 Accelerate main features and traps to avoid.
## Main use
@ -54,7 +54,7 @@ model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
)
```
In particular, your training dataloader will be sharded accross all GPUs/TPU cores available so that each one sees a
In particular, your training dataloader will be sharded across all GPUs/TPU cores available so that each one sees a
different portion of the training dataset. Also, the random states of all processes will be synchronized at the
beginning of each iteration through your dataloader, to make sure the data is shuffled the same way (if you decided to
use `shuffle=True` or any kind of random sampler).
@ -118,7 +118,7 @@ method:
validation_dataloader = accelerator.prepare(validation_dataloader)
```
Like for your training dataloader, it will mean that (should you run your script on multiple devices) each device will
As for your training dataloader, it will mean that (should you run your script on multiple devices) each device will
only see part of the evaluation data. This means you will need to group your predictions together. This is very easy to
do with the [`~Accelerator.gather`] method.
@ -134,8 +134,8 @@ for inputs, targets in validation_dataloader:
<Tip warning={true}>
Like for the training dataloader, passing your validation dataloader through
[`~Accelerator.prepare`] may change its: if you run on X GPUs, it will have its length divided by X
As for the training dataloader, passing your validation dataloader through
[`~Accelerator.prepare`] may change it: if you run on X GPUs, it will have its length divided by X
(since your actual batch size will be multiplied by X), unless you set `split_batches=True`.
Any instruction using your training dataloader length (for instance if you need the number of total training steps
@ -159,7 +159,7 @@ PyTorch), they are fully compatible with 🤗 Accelerate. The only caveat here i
to determine all useful information, so `torch.distributed.launch` should be used with the flag `--use_env`.
🤗 Accelerate also provides a CLI tool that unifies all launcher, so you only have to remember one command. To use it,
just run
just run:
```bash
accelerate config
@ -175,7 +175,7 @@ on your machine and reply to the questions asked. This will save a *default_conf
You can also specify with the flag `--config_file` the location of the file you want to save.
Once this is done, you can test everything is going well on your setup by running
Once this is done, you can test everything is going well on your setup by running:
```bash
accelerate test
@ -235,14 +235,14 @@ step). This is why your first step of training will always be very long as build
optimizations takes some time.
The good news is that this compilation will be cached so the second step and all the following will be much faster. The
bas news is that it only applies if all of your steps do exactly the same operations, which implies:
bad news is that it only applies if all of your steps do exactly the same operations, which implies:
- having all tensors of the same length in all your lengths
- having static code (i.e., not a for loop of length that could change from step to step)
Having any of the things above change between two steps will trigger a new compilation which will, once again, take a
lot of time. In practice, that means you must take special care to have all your tensors in your inputs of the same
shape (so no dynamic padding for instance if you are in an NLP problem) and should not use layer with for loops that
shape (so no dynamic padding for instance if you are in an NLP problem) and should not use layers with for loops that
have different lengths depending on the inputs (such as an LSTM) or the training will be excruciatingly slow.
To introduce special behavior in your script for TPUs you can check the `distributed_type` of your
@ -257,10 +257,10 @@ else:
# go crazy and be dynamic
```
The [NLP example](https://github.com/huggingface/accelerate/blob/main/examples/nlp_example.py) shows an example in
The [NLP example](https://github.com/huggingface/accelerate/blob/main/examples/nlp_example.py) shows an example in a
situation with dynamic padding.
One last thing to pay close attnetion to: if your model has tied weights (such as language models which tie the weights
One last thing to pay close attention to: if your model has tied weights (such as language models which tie the weights
of the embedding matrix with the weights of the decoder), moving this model to the TPU (either yourself or after you
passed your model to [`~Accelerator.prepare`]) will break the tying. You will need to retie the weights
after. You can find an example of this in the [run_clm_no_trainer](https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_clm.py) script in
@ -317,8 +317,8 @@ following line in your code:
accelerator.wait_for_everyone()
```
This instruction will block all the processes that arrive them first until all the other processes have reached that
point (if you run your script on just one GPU or CPU, this wont' do anything).
This instruction will block all the processes that arrive first until all the other processes have reached that
point (if you run your script on just one GPU or CPU, this won't do anything).
### Saving/loading a model
@ -338,7 +338,7 @@ unwrapped_model = accelerator.unwrap_model(model)
accelerator.save(unwrapped_model.state_dict(), filename)
```
If your script contains a logic to load checkpoint, we also recommend you load your weights in the unwrapped model
If your script contains logic to load a checkpoint, we also recommend you load your weights in the unwrapped model
(this is only useful if you use the load function after making your model go through
[`~Accelerator.prepare`]). Here is an example:
@ -368,7 +368,7 @@ and `accelerator.clip_grad_value_` respectively.
### Mixed Precision training
If you are running your training in Mixed Precision with Accelerate, you will get the best result with your loss being
If you are running your training in Mixed Precision with 🤗 Accelerate, you will get the best result with your loss being
computed inside your model (like in Transformer models for instance). Every computation outside of the model will be
executed in full precision (which is generally what you want for loss computation, expecially if it involves a
softmax). However you might want to put your loss computation inside the *accelerator.autocast* context manager:
@ -438,14 +438,14 @@ The random number generator synchronization will by default synchronize:
- the main random number generator in PyTorch <=1.5.1
You can choose which random number generator(s) to synchronize with the `rng_types` argument of the main
[`Accelerator`]. In PyTorch >= 1.6, it is recommended to rely on local `generator` to avoid
[`Accelerator`]. In PyTorch >= 1.6, it is recommended to rely on a local `generator` to avoid
setting the same seed in the main random number generator in all processes.
<Tip warning={true}>
Synchronization the main torch (or CUDA or XLA) random number generator will affect any other potential random
artifacts you could have in your dataset (like random data augmentation) in the sense all processes will get the
same random numbers from the torch random modules (so will apply the same random data augmentation if it's
Synchronization of the main torch (or CUDA or XLA) random number generator will affect any other potential random
artifacts you could have in your dataset (like random data augmentation) in the sense that all processes will get
the same random numbers from the torch random modules (so will apply the same random data augmentation if it's
controlled by torch).
</Tip>
@ -457,4 +457,4 @@ The randomization part of your custom sampler, batch sampler or iterable dataset
</Tip>
See more details about the internal in the [Internals page](internal).
For more details about the internals, see the [Internals page](internal).

View File

@ -23,7 +23,7 @@ make it easier than ever to train Hugging Face Transformer models in [Amazon Sag
Before you can run your 🤗 Accelerate scripts on Amazon SageMaker you need to sign up for an AWS account. If you do not
have an AWS account yet learn more [here](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-set-up.html).
After you have your AWS Account you need to install the `sagemaker` sdk for 🤗 Accelerate with.
After you have your AWS Account you need to install the `sagemaker` sdk for 🤗 Accelerate with:
```bash
pip install "accelerate[sagemaker]" --upgrade
@ -31,7 +31,7 @@ pip install "accelerate[sagemaker]" --upgrade
🤗 Accelerate currently uses the 🤗 DLCs, with `transformers`, `datasets` and `tokenizers` pre-installed. 🤗
Accelerate is not in the DLC yet (will soon be added!) so to use it within Amazon SageMaker you need to create a
`requirements.txt` in the same directory where your training script is located and add it as dependency.
`requirements.txt` in the same directory where your training script is located and add it as dependency:
```
accelerate
@ -43,7 +43,7 @@ You should also add any other dependencies you have to this `requirements.txt`.
### Configure 🤗 Accelerate
You can configure the launch configuration for Amazon SageMaker the same as you do for non SageMaker training jobs with
the 🤗 Accelerate CLI.
the 🤗 Accelerate CLI:
```bash
accelerate config
@ -62,7 +62,7 @@ accelerate config
The training script is very similar to a training script you might run outside of SageMaker, but to save your model
after training you need to specify either `/opt/ml/model` or use `os.environ["SM_MODEL_DIR"]` as your save
directory. After training, artifacts in this directory are uploaded to S3.
directory. After training, artifacts in this directory are uploaded to S3:
```diff
@ -79,7 +79,7 @@ specify type as bool in your script and provide an explicit True or False value
### Launch Training
You can launch your training with 🤗 Accelerate CLI with
You can launch your training with 🤗 Accelerate CLI with:
```
accelerate launch path_to_script.py --args_to_the_script

View File

@ -13,7 +13,7 @@ specific language governing permissions and limitations under the License.
# Tracking
There are a large number of experiment tracking API's available, however getting them all to work with in a multi-processing environment can oftentimes be complex.
Accelerate provides a general tracking API that can be used to log useful items during your script through [`~Accelerator.log`]
🤗 Accelerate provides a general tracking API that can be used to log useful items during your script through [`~Accelerator.log`]
## Integrated Trackers

91
docs/source/utilities.mdx Normal file
View File

@ -0,0 +1,91 @@
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Helpful Utilities
Below are a variety of utility functions that 🤗 Accelerate provides, broken down by use-case.
## Data Classes
These are basic dataclasses used throughout 🤗 Accelerate and they can be passed in as parameters.
[[autodoc]] utils.DistributedType
[[autodoc]] utils.LoggerType
[[autodoc]] utils.PrecisionType
## Data Manipulation and Operations
These include data operations that mimic the same `torch` ops but can be used on distributed processes.
[[autodoc]] utils.broadcast
[[autodoc]] utils.concatenate
[[autodoc]] utils.gather
[[autodoc]] utils.pad_across_processes
[[autodoc]] utils.reduce
[[autodoc]] utils.send_to_device
## Environment Checks
These functionalities check the state of the current working environment including information about the operating system itself, what it can support, and if particular dependencies are installed.
[[autodoc]] utils.get_max_memory
[[autodoc]] utils.is_bf16_available
[[autodoc]] utils.is_torch_version
[[autodoc]] utils.is_tpu_available
## Environment Configuration
[[autodoc]] utils.write_basic_config
When setting up 🤗 Accelerate for the first time, rather than running `accelerate config` [~utils.write_basic_config] can be used as an alternative for quick configuration.
## Modeling
These utilities relate to interacting with PyTorch models
[[autodoc]] utils.extract_model_from_parallel
[[autodoc]] utils.get_max_layer_size
[[autodoc]] utils.offload_state_dict
## Parallel
These include general utilities that should be used when working in parallel.
[[autodoc]] utils.extract_model_from_parallel
[[autodoc]] utils.save
[[autodoc]] utils.wait_for_everyone
## Random
These utilities relate to setting and synchronizing of all the random states.
[[autodoc]] utils.set_seed
[[autodoc]] utils.synchronize_rng_state
[[autodoc]] utils.synchronize_rng_states

View File

@ -23,7 +23,7 @@ The [nlp_example.py](./nlp_example.py) script is a simple example to train a Ber
Prior to running it you should install 🤗 Dataset and 🤗 Transformers:
```bash
pip install datasets transformers
pip install datasets evaluate transformers
```
The same script can be run in any of the following configurations:

View File

@ -42,6 +42,18 @@ These arguments should be added at the end of any method for starting the python
accelerate launch ./checkpointing.py --checkpointing_steps epoch output_dir "checkpointing_tutorial" --resume_from_checkpoint "checkpointing_tutorial/epoch_0"
```
### Cross Validation (`cross_validation.py`)
- Shows how to use `Accelerator.free_memory` and run cross validation efficiently with `datasets`.
- Arguments available:
- `num_folds`, the number of folds the training dataset should be split into.
These arguments should be added at the end of any method for starting the python script (such as `python`, `accelerate launch`, `python -m torch.distributed.launch`), such as:
```bash
accelerate launch ./cross_validation.py --num_folds 2
```
### Experiment Tracking (`tracking.py`)
- Shows how to use `Accelerate.init_trackers` and `Accelerator.log`
@ -55,14 +67,14 @@ These arguments should be added at the end of any method for starting the python
accelerate launch ./tracking.py --with_tracking
```
### Cross Validation (`cross_validation.py`)
### Gradient Accumulation (`gradient_accumulation.py`)
- Shows how to use `Accelerator.free_memory` and run cross validation efficiently with `datasets`.
- Shows how to use `Accelerator.no_sync` to prevent gradient averaging in a distributed setup.
- Arguments available:
- `num_folds`, the number of folds the training dataset should be split into.
- `gradient_accumulation_steps`, the number of steps to perform before the gradients are accumulated and the optimizer and scheduler are stepped + zero_grad
These arguments should be added at the end of any method for starting the python script (such as `python`, `accelerate launch`, `python -m torch.distributed.launch`), such as:
```bash
accelerate launch ./cross_validation.py --num_folds 2
```
accelerate launch ./gradient_accumulation.py --gradient_accumulation_steps 5
```

View File

@ -16,17 +16,13 @@ import argparse
import os
import torch
from torch.optim import AdamW
from torch.utils.data import DataLoader
import evaluate
from accelerate import Accelerator, DistributedType
from datasets import load_dataset, load_metric
from transformers import (
AdamW,
AutoModelForSequenceClassification,
AutoTokenizer,
get_linear_schedule_with_warmup,
set_seed,
)
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer, get_linear_schedule_with_warmup, set_seed
########################################################################
@ -116,7 +112,6 @@ def training_function(config, args):
# Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
lr = config["lr"]
num_epochs = int(config["num_epochs"])
correct_bias = config["correct_bias"]
seed = int(config["seed"])
batch_size = int(config["batch_size"])
@ -137,11 +132,11 @@ def training_function(config, args):
set_seed(seed)
train_dataloader, eval_dataloader = get_dataloaders(accelerator, batch_size)
metric = load_metric("glue", "mrpc")
metric = evaluate.load("glue", "mrpc")
# If the batch size is too big we use gradient accumulation
gradient_accumulation_steps = 1
if batch_size > MAX_GPU_BATCH_SIZE:
if batch_size > MAX_GPU_BATCH_SIZE and accelerator.distributed_type != DistributedType.TPU:
gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
batch_size = MAX_GPU_BATCH_SIZE
@ -154,7 +149,7 @@ def training_function(config, args):
model = model.to(accelerator.device)
# Instantiate optimizer
optimizer = AdamW(params=model.parameters(), lr=lr, correct_bias=correct_bias)
optimizer = AdamW(params=model.parameters(), lr=lr)
# Instantiate scheduler
lr_scheduler = get_linear_schedule_with_warmup(
@ -296,7 +291,7 @@ def main():
help="If the training should continue from a checkpoint folder.",
)
args = parser.parse_args()
config = {"lr": 2e-5, "num_epochs": 3, "correct_bias": True, "seed": 42, "batch_size": 16}
config = {"lr": 2e-5, "num_epochs": 3, "seed": 42, "batch_size": 16}
training_function(config, args)

View File

@ -17,21 +17,17 @@ from typing import List
import numpy as np
import torch
from torch.optim import AdamW
from torch.utils.data import DataLoader
import evaluate
from accelerate import Accelerator, DistributedType
from datasets import DatasetDict, load_dataset, load_metric
from datasets import DatasetDict, load_dataset
# New Code #
# We'll be using StratifiedKFold for this example
from sklearn.model_selection import StratifiedKFold
from transformers import (
AdamW,
AutoModelForSequenceClassification,
AutoTokenizer,
get_linear_schedule_with_warmup,
set_seed,
)
from transformers import AutoModelForSequenceClassification, AutoTokenizer, get_linear_schedule_with_warmup, set_seed
########################################################################
@ -129,7 +125,6 @@ def get_fold_dataloaders(
def training_function(config, args):
# New Code #
test_labels = None
test_predictions = []
# Download the dataset
datasets = load_dataset("glue", "mrpc")
@ -140,15 +135,14 @@ def training_function(config, args):
# Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
lr = config["lr"]
num_epochs = int(config["num_epochs"])
correct_bias = config["correct_bias"]
seed = int(config["seed"])
batch_size = int(config["batch_size"])
metric = load_metric("glue", "mrpc")
metric = evaluate.load("glue", "mrpc")
# If the batch size is too big we use gradient accumulation
gradient_accumulation_steps = 1
if batch_size > MAX_GPU_BATCH_SIZE:
if batch_size > MAX_GPU_BATCH_SIZE and accelerator.distributed_type != DistributedType.TPU:
gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
batch_size = MAX_GPU_BATCH_SIZE
@ -157,17 +151,15 @@ def training_function(config, args):
# New Code #
# Create our folds:
folds = kfold.split(np.zeros(datasets["train"].num_rows), datasets["train"]["label"])
test_references = []
# Iterate over them
for train_idxs, valid_idxs in folds:
for i, (train_idxs, valid_idxs) in enumerate(folds):
train_dataloader, eval_dataloader, test_dataloader = get_fold_dataloaders(
accelerator,
datasets,
train_idxs,
valid_idxs,
)
if test_labels is None:
test_labels = datasets["validation"]["label"]
# Instantiate the model (we build the model here so that the seed also control new weights initialization)
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", return_dict=True)
@ -177,7 +169,7 @@ def training_function(config, args):
model = model.to(accelerator.device)
# Instantiate optimizer
optimizer = AdamW(params=model.parameters(), lr=lr, correct_bias=correct_bias)
optimizer = AdamW(params=model.parameters(), lr=lr)
# Instantiate scheduler
lr_scheduler = get_linear_schedule_with_warmup(
@ -236,19 +228,18 @@ def training_function(config, args):
predictions = outputs.logits
predictions, references = accelerator.gather((predictions, batch["labels"]))
fold_predictions.append(predictions.cpu())
metric.add_batch(
predictions=predictions.argmax(dim=-1),
references=references,
)
test_metric = metric.compute()
if i == 0:
# We need all of the test predictions
test_references.append(references.cpu())
# Use accelerator.print to print only on the main process.
test_predictions.append(torch.cat(fold_predictions, dim=0))
# We now need to release all our memory and get rid of the current model, optimizer, etc
accelerator.free_memory()
# New Code #
# Finally we check the accuracy of our folded results:
preds = torch.stack(test_predictions, dim=0).sum(dim=0).div(int(config["n_splits"])).argmax(dim=-1)
test_metric = metric.compute(predictions=preds, references=test_labels)
test_references = torch.cat(test_references, dim=0)
preds = torch.stack(test_predictions, dim=0).sum(dim=0).div(int(args.num_folds)).argmax(dim=-1)
test_metric = metric.compute(predictions=preds, references=test_references)
accelerator.print("Average test metrics from all folds:", test_metric)
@ -267,7 +258,7 @@ def main():
# New Code #
parser.add_argument("--num_folds", type=int, default=3, help="The number of splits to perform across the dataset")
args = parser.parse_args()
config = {"lr": 2e-5, "num_epochs": 3, "correct_bias": True, "seed": 42, "batch_size": 16}
config = {"lr": 2e-5, "num_epochs": 3, "seed": 42, "batch_size": 16}
training_function(config, args)

View File

@ -0,0 +1,736 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Fine-tuning the library models for causal language modeling (GPT, GPT-2, CTRL, ...)
on a text file or a dataset without using HuggingFace Trainer.
Here is the full list of checkpoints on the hub that can be fine-tuned by this script:
https://huggingface.co/models?filter=text-generation
"""
# You can also adapt this script on your own causal language modeling task. Pointers for this are left as comments.
import argparse
import json
import logging
import math
import os
import random
from itertools import chain
from pathlib import Path
import torch
from torch.utils.data import DataLoader
import datasets
import transformers
from accelerate import Accelerator, DistributedType
from accelerate.logging import get_logger
from accelerate.utils import DummyOptim, DummyScheduler, set_seed
from datasets import load_dataset
from huggingface_hub import Repository
from tqdm.auto import tqdm
from transformers import (
CONFIG_MAPPING,
MODEL_MAPPING,
AutoConfig,
AutoModelForCausalLM,
AutoTokenizer,
SchedulerType,
default_data_collator,
get_scheduler,
)
from transformers.utils import get_full_repo_name
from transformers.utils.versions import require_version
logger = get_logger(__name__)
require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/language-modeling/requirements.txt")
MODEL_CONFIG_CLASSES = list(MODEL_MAPPING.keys())
MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)
def parse_args():
parser = argparse.ArgumentParser(description="Finetune a transformers model on a causal language modeling task")
parser.add_argument(
"--dataset_name",
type=str,
default=None,
help="The name of the dataset to use (via the datasets library).",
)
parser.add_argument(
"--dataset_config_name",
type=str,
default=None,
help="The configuration name of the dataset to use (via the datasets library).",
)
parser.add_argument(
"--train_file", type=str, default=None, help="A csv or a json file containing the training data."
)
parser.add_argument(
"--validation_file", type=str, default=None, help="A csv or a json file containing the validation data."
)
parser.add_argument(
"--validation_split_percentage",
default=5,
help="The percentage of the train set used as validation set in case there's no validation split",
)
parser.add_argument(
"--model_name_or_path",
type=str,
help="Path to pretrained model or model identifier from huggingface.co/models.",
required=False,
)
parser.add_argument(
"--config_name",
type=str,
default=None,
help="Pretrained config name or path if not the same as model_name",
)
parser.add_argument(
"--tokenizer_name",
type=str,
default=None,
help="Pretrained tokenizer name or path if not the same as model_name",
)
parser.add_argument(
"--use_slow_tokenizer",
action="store_true",
help="If passed, will use a slow tokenizer (not backed by the 🤗 Tokenizers library).",
)
parser.add_argument(
"--per_device_train_batch_size",
type=int,
default=8,
help="Batch size (per device) for the training dataloader.",
)
parser.add_argument(
"--per_device_eval_batch_size",
type=int,
default=8,
help="Batch size (per device) for the evaluation dataloader.",
)
parser.add_argument(
"--learning_rate",
type=float,
default=5e-5,
help="Initial learning rate (after the potential warmup period) to use.",
)
parser.add_argument("--weight_decay", type=float, default=0.0, help="Weight decay to use.")
parser.add_argument("--num_train_epochs", type=int, default=3, help="Total number of training epochs to perform.")
parser.add_argument(
"--max_train_steps",
type=int,
default=None,
help="Total number of training steps to perform. If provided, overrides num_train_epochs.",
)
parser.add_argument(
"--gradient_accumulation_steps",
type=int,
default=1,
help="Number of updates steps to accumulate before performing a backward/update pass.",
)
parser.add_argument(
"--lr_scheduler_type",
type=SchedulerType,
default="linear",
help="The scheduler type to use.",
choices=["linear", "cosine", "cosine_with_restarts", "polynomial", "constant", "constant_with_warmup"],
)
parser.add_argument(
"--num_warmup_steps", type=int, default=0, help="Number of steps for the warmup in the lr scheduler."
)
parser.add_argument("--output_dir", type=str, default=None, help="Where to store the final model.")
parser.add_argument("--seed", type=int, default=None, help="A seed for reproducible training.")
parser.add_argument(
"--model_type",
type=str,
default=None,
help="Model type to use if training from scratch.",
choices=MODEL_TYPES,
)
parser.add_argument(
"--block_size",
type=int,
default=None,
help=(
"Optional input sequence length after tokenization. The training dataset will be truncated in block of"
" this size for training. Default to the model max input length for single sentence inputs (take into"
" account special tokens)."
),
)
parser.add_argument(
"--preprocessing_num_workers",
type=int,
default=None,
help="The number of processes to use for the preprocessing.",
)
parser.add_argument(
"--overwrite_cache", type=bool, default=False, help="Overwrite the cached training and evaluation sets"
)
parser.add_argument(
"--no_keep_linebreaks", action="store_true", help="Do not keep line breaks when using TXT files."
)
parser.add_argument("--push_to_hub", action="store_true", help="Whether or not to push the model to the Hub.")
parser.add_argument(
"--hub_model_id", type=str, help="The name of the repository to keep in sync with the local `output_dir`."
)
parser.add_argument("--hub_token", type=str, help="The token to use to push to the Model Hub.")
parser.add_argument(
"--checkpointing_steps",
type=str,
default=None,
help="Whether the various states should be saved at the end of every n steps, or 'epoch' for each epoch.",
)
parser.add_argument(
"--resume_from_checkpoint",
type=str,
default=None,
help="If the training should continue from a checkpoint folder.",
)
# New Code #
# Whether to load the best model at the end of training
parser.add_argument(
"--load_best_model",
action="store_true",
help="Whether to load the best model at the end of training",
)
parser.add_argument(
"--with_tracking",
action="store_true",
help="Whether to enable experiment trackers for logging.",
)
parser.add_argument(
"--report_to",
type=str,
default="all",
help=(
'The integration to report the results and logs to. Supported platforms are `"tensorboard"`,'
' `"wandb"` and `"comet_ml"`. Use `"all"` (default) to report to all integrations.'
"Only applicable when `--with_tracking` is passed."
),
)
args = parser.parse_args()
# Sanity checks
if args.dataset_name is None and args.train_file is None and args.validation_file is None:
raise ValueError("Need either a dataset name or a training/validation file.")
else:
if args.train_file is not None:
extension = args.train_file.split(".")[-1]
assert extension in ["csv", "json", "txt"], "`train_file` should be a csv, json or txt file."
if args.validation_file is not None:
extension = args.validation_file.split(".")[-1]
assert extension in ["csv", "json", "txt"], "`validation_file` should be a csv, json or txt file."
if args.push_to_hub:
assert args.output_dir is not None, "Need an `output_dir` to create a repo when `--push_to_hub` is passed."
return args
# New Code #
def checkpoint_model(checkpoint_folder, ckpt_id, model, epoch, last_global_step, **kwargs):
"""Utility function for checkpointing model + optimizer dictionaries
The main purpose for this is to be able to resume training from that instant again
"""
checkpoint_state_dict = {
"epoch": epoch,
"last_global_step": last_global_step,
}
# Add extra kwargs too
checkpoint_state_dict.update(kwargs)
success = model.save_checkpoint(checkpoint_folder, ckpt_id, checkpoint_state_dict)
status_msg = f"checkpointing: checkpoint_folder={checkpoint_folder}, ckpt_id={ckpt_id}"
if success:
logging.info(f"Success {status_msg}")
else:
logging.warning(f"Failure {status_msg}")
return
# New Code #
def load_training_checkpoint(model, load_dir, tag=None, **kwargs):
"""Utility function for checkpointing model + optimizer dictionaries
The main purpose for this is to be able to resume training from that instant again
"""
_, checkpoint_state_dict = model.load_checkpoint(load_dir, tag=tag, **kwargs)
epoch = checkpoint_state_dict["epoch"]
last_global_step = checkpoint_state_dict["last_global_step"]
del checkpoint_state_dict
return (epoch, last_global_step)
# New Code #
def evaluate(args, model, eval_dataloader, accelerator, eval_dataset):
model.eval()
losses = []
for step, batch in enumerate(eval_dataloader):
with torch.no_grad():
outputs = model(**batch)
loss = outputs.loss
losses.append(accelerator.gather(loss.repeat(args.per_device_eval_batch_size)))
losses = torch.cat(losses)
losses = losses[: len(eval_dataset)]
try:
eval_loss = torch.mean(losses)
perplexity = math.exp(eval_loss)
except OverflowError:
perplexity = float("inf")
return perplexity, eval_loss
def main():
args = parse_args()
# Initialize the accelerator. We will let the accelerator handle device placement for us in this example.
# If we're using tracking, we also need to initialize it here and it will by default pick up all supported trackers
# in the environment
accelerator = (
Accelerator(log_with=args.report_to, logging_dir=args.output_dir) if args.with_tracking else Accelerator()
)
# Make one log on every process with the configuration for debugging.
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%m/%d/%Y %H:%M:%S",
level=logging.INFO,
)
logger.info(accelerator.state, main_process_only=False)
if accelerator.is_local_main_process:
datasets.utils.logging.set_verbosity_warning()
transformers.utils.logging.set_verbosity_info()
else:
datasets.utils.logging.set_verbosity_error()
transformers.utils.logging.set_verbosity_error()
# If passed along, set the training seed now.
if args.seed is not None:
set_seed(args.seed)
# Handle the repository creation
if accelerator.is_main_process:
if args.push_to_hub:
if args.hub_model_id is None:
repo_name = get_full_repo_name(Path(args.output_dir).name, token=args.hub_token)
else:
repo_name = args.hub_model_id
repo = Repository(args.output_dir, clone_from=repo_name)
with open(os.path.join(args.output_dir, ".gitignore"), "w+") as gitignore:
if "step_*" not in gitignore:
gitignore.write("step_*\n")
if "epoch_*" not in gitignore:
gitignore.write("epoch_*\n")
elif args.output_dir is not None:
os.makedirs(args.output_dir, exist_ok=True)
accelerator.wait_for_everyone()
# Get the datasets: you can either provide your own CSV/JSON/TXT training and evaluation files (see below)
# or just provide the name of one of the public datasets available on the hub at https://huggingface.co/datasets/
# (the dataset will be downloaded automatically from the datasets Hub).
#
# For CSV/JSON files, this script will use the column called 'text' or the first column if no column called
# 'text' is found. You can easily tweak this behavior (see below).
#
# In distributed training, the load_dataset function guarantee that only one local process can concurrently
# download the dataset.
if args.dataset_name is not None:
# Downloading and loading a dataset from the hub.
raw_datasets = load_dataset(args.dataset_name, args.dataset_config_name)
if "validation" not in raw_datasets.keys():
raw_datasets["validation"] = load_dataset(
args.dataset_name,
args.dataset_config_name,
split=f"train[:{args.validation_split_percentage}%]",
)
raw_datasets["train"] = load_dataset(
args.dataset_name,
args.dataset_config_name,
split=f"train[{args.validation_split_percentage}%:]",
)
else:
data_files = {}
dataset_args = {}
if args.train_file is not None:
data_files["train"] = args.train_file
if args.validation_file is not None:
data_files["validation"] = args.validation_file
extension = args.train_file.split(".")[-1]
if extension == "txt":
extension = "text"
dataset_args["keep_linebreaks"] = not args.no_keep_linebreaks
raw_datasets = load_dataset(extension, data_files=data_files, **dataset_args)
# If no validation data is there, validation_split_percentage will be used to divide the dataset.
if "validation" not in raw_datasets.keys():
raw_datasets["validation"] = load_dataset(
extension,
data_files=data_files,
split=f"train[:{args.validation_split_percentage}%]",
**dataset_args,
)
raw_datasets["train"] = load_dataset(
extension,
data_files=data_files,
split=f"train[{args.validation_split_percentage}%:]",
**dataset_args,
)
# See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at
# https://huggingface.co/docs/datasets/loading_datasets.html.
# Load pretrained model and tokenizer
#
# In distributed training, the .from_pretrained methods guarantee that only one local process can concurrently
# download model & vocab.
if args.config_name:
config = AutoConfig.from_pretrained(args.config_name)
elif args.model_name_or_path:
config = AutoConfig.from_pretrained(args.model_name_or_path)
else:
config = CONFIG_MAPPING[args.model_type]()
logger.warning("You are instantiating a new config instance from scratch.")
if args.tokenizer_name:
tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name, use_fast=not args.use_slow_tokenizer)
elif args.model_name_or_path:
tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, use_fast=not args.use_slow_tokenizer)
else:
raise ValueError(
"You are instantiating a new tokenizer from scratch. This is not supported by this script."
"You can do it from another script, save it, and load it from here, using --tokenizer_name."
)
if args.model_name_or_path:
model = AutoModelForCausalLM.from_pretrained(
args.model_name_or_path,
from_tf=bool(".ckpt" in args.model_name_or_path),
config=config,
)
else:
logger.info("Training new model from scratch")
model = AutoModelForCausalLM.from_config(config)
model.resize_token_embeddings(len(tokenizer))
# Preprocessing the datasets.
# First we tokenize all the texts.
column_names = raw_datasets["train"].column_names
text_column_name = "text" if "text" in column_names else column_names[0]
def tokenize_function(examples):
return tokenizer(examples[text_column_name])
with accelerator.main_process_first():
tokenized_datasets = raw_datasets.map(
tokenize_function,
batched=True,
num_proc=args.preprocessing_num_workers,
remove_columns=column_names,
load_from_cache_file=not args.overwrite_cache,
desc="Running tokenizer on dataset",
)
if args.block_size is None:
block_size = tokenizer.model_max_length
if block_size > 1024:
logger.warning(
f"The tokenizer picked seems to have a very large `model_max_length` ({tokenizer.model_max_length}). "
"Picking 1024 instead. You can change that default value by passing --block_size xxx."
)
block_size = 1024
else:
if args.block_size > tokenizer.model_max_length:
logger.warning(
f"The block_size passed ({args.block_size}) is larger than the maximum length for the model"
f"({tokenizer.model_max_length}). Using block_size={tokenizer.model_max_length}."
)
block_size = min(args.block_size, tokenizer.model_max_length)
# Main data processing function that will concatenate all texts from our dataset and generate chunks of block_size.
def group_texts(examples):
# Concatenate all texts.
concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])
# We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
# customize this part to your needs.
if total_length >= block_size:
total_length = (total_length // block_size) * block_size
# Split by chunks of max_len.
result = {
k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
for k, t in concatenated_examples.items()
}
result["labels"] = result["input_ids"].copy()
return result
# Note that with `batched=True`, this map processes 1,000 texts together, so group_texts throws away a remainder
# for each of those groups of 1,000 texts. You can adjust that batch_size here but a higher value might be slower
# to preprocess.
#
# To speed up this part, we use multiprocessing. See the documentation of the map method for more information:
# https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map
with accelerator.main_process_first():
lm_datasets = tokenized_datasets.map(
group_texts,
batched=True,
num_proc=args.preprocessing_num_workers,
load_from_cache_file=not args.overwrite_cache,
desc=f"Grouping texts in chunks of {block_size}",
)
train_dataset = lm_datasets["train"]
eval_dataset = lm_datasets["validation"]
# Log a few random samples from the training set:
for index in random.sample(range(len(train_dataset)), 3):
logger.info(f"Sample {index} of the training set: {train_dataset[index]}.")
# DataLoaders creation:
train_dataloader = DataLoader(
train_dataset, shuffle=True, collate_fn=default_data_collator, batch_size=args.per_device_train_batch_size
)
eval_dataloader = DataLoader(
eval_dataset, collate_fn=default_data_collator, batch_size=args.per_device_eval_batch_size
)
# Optimizer
# Split weights in two groups, one with weight decay and the other not.
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
{
"params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
"weight_decay": args.weight_decay,
},
{
"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
"weight_decay": 0.0,
},
]
# New Code #
# Creates Dummy Optimizer if `optimizer` was spcified in the config file else creates Adam Optimizer
optimizer_cls = (
torch.optim.AdamW
if accelerator.state.deepspeed_plugin is None
or "optimizer" not in accelerator.state.deepspeed_plugin.deepspeed_config
else DummyOptim
)
optimizer = optimizer_cls(optimizer_grouped_parameters, lr=args.learning_rate)
# On TPU, the tie weights in our model have been disconnected, so we need to restore the ties.
if accelerator.distributed_type == DistributedType.TPU:
model.tie_weights()
# Scheduler and math around the number of training steps.
# New Code
# Get gradient accumulation steps from deepspeed config if available
if accelerator.state.deepspeed_plugin is not None:
args.gradient_accumulation_steps = accelerator.state.deepspeed_plugin.deepspeed_config[
"gradient_accumulation_steps"
]
num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
if args.max_train_steps is None:
args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch
else:
args.num_train_epochs = math.ceil(args.max_train_steps / num_update_steps_per_epoch)
# New Code #
# Creates Dummy Scheduler if `scheduler` was spcified in the config file else creates `args.lr_scheduler_type` Scheduler
if (
accelerator.state.deepspeed_plugin is None
or "scheduler" not in accelerator.state.deepspeed_plugin.deepspeed_config
):
lr_scheduler = get_scheduler(
name=args.lr_scheduler_type,
optimizer=optimizer,
num_warmup_steps=args.num_warmup_steps,
num_training_steps=args.max_train_steps,
)
else:
lr_scheduler = DummyScheduler(
optimizer, total_num_steps=args.max_train_steps, warmup_num_steps=args.num_warmup_steps
)
# Prepare everything with our `accelerator`.
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
)
# We need to recalculate our total training steps as the size of the training dataloader may have changed.
num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch
# Figure out how many steps we should save the Accelerator states
if hasattr(args.checkpointing_steps, "isdigit"):
checkpointing_steps = args.checkpointing_steps
if args.checkpointing_steps.isdigit():
checkpointing_steps = int(args.checkpointing_steps)
else:
checkpointing_steps = None
# We need to initialize the trackers we use, and also store our configuration.
# We initialize the trackers only on main process because `accelerator.log`
# only logs on main process and we don't want empty logs/runs on other processes.
if args.with_tracking:
if accelerator.is_main_process:
experiment_config = vars(args)
# TensorBoard cannot log Enums, need the raw value
experiment_config["lr_scheduler_type"] = experiment_config["lr_scheduler_type"].value
accelerator.init_trackers("clm_no_trainer", experiment_config)
# Train!
total_batch_size = args.per_device_train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps
logger.info("***** Running training *****")
logger.info(f" Num examples = {len(train_dataset)}")
logger.info(f" Num Epochs = {args.num_train_epochs}")
logger.info(f" Instantaneous batch size per device = {args.per_device_train_batch_size}")
logger.info(f" Total train batch size (w. parallel, distributed & accumulation) = {total_batch_size}")
logger.info(f" Gradient Accumulation steps = {args.gradient_accumulation_steps}")
logger.info(f" Total optimization steps = {args.max_train_steps}")
# Only show the progress bar once on each machine.
progress_bar = tqdm(range(args.max_train_steps), disable=not accelerator.is_local_main_process)
completed_steps = 0
starting_epoch = 0
best_metric = None
best_metric_checkpoint = None
# Potentially load in the weights and states from a previous save
if args.resume_from_checkpoint:
# New Code #
# Loads the DeepSpeed checkpoint from the specified path
_, last_global_step = load_training_checkpoint(
model,
args.resume_from_checkpoint,
**{"load_optimizer_states": True, "load_lr_scheduler_states": True},
)
accelerator.print(f"Resumed from checkpoint: {args.resume_from_checkpoint}")
resume_step = last_global_step
starting_epoch = resume_step // len(train_dataloader)
resume_step -= starting_epoch * len(train_dataloader)
for epoch in range(starting_epoch, args.num_train_epochs):
model.train()
if args.with_tracking:
total_loss = 0
for step, batch in enumerate(train_dataloader):
# We need to skip steps until we reach the resumed step
if args.resume_from_checkpoint and epoch == starting_epoch:
if resume_step is not None and step < resume_step:
completed_steps += 1
continue
outputs = model(**batch)
loss = outputs.loss
# We keep track of the loss at each epoch
if args.with_tracking:
total_loss += loss.detach().float()
loss = loss / args.gradient_accumulation_steps
accelerator.backward(loss)
if step % args.gradient_accumulation_steps == 0 or step == len(train_dataloader) - 1:
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
progress_bar.update(1)
completed_steps += 1
if isinstance(checkpointing_steps, int):
if completed_steps % checkpointing_steps == 0:
output_dir = f"step_{completed_steps }"
if args.output_dir is not None:
output_dir = os.path.join(args.output_dir, output_dir)
accelerator.save_state(output_dir)
if completed_steps >= args.max_train_steps:
break
perplexity, eval_loss = evaluate(args, model, eval_dataloader, accelerator, eval_dataset)
logger.info(f"epoch {epoch}: perplexity: {perplexity} eval_loss: {eval_loss}")
if args.with_tracking:
accelerator.log(
{
"perplexity": perplexity,
"eval_loss": eval_loss,
"train_loss": total_loss.item() / len(train_dataloader),
"epoch": epoch,
"step": completed_steps,
},
step=completed_steps,
)
# New Code #
# Save the DeepSpeed checkpoint to the specified path
checkpoint_model(args.output_dir, epoch, model, epoch, completed_steps)
# New Code #
# Tracks the best checkpoint and best metric
if best_metric is None or best_metric > perplexity:
best_metric = perplexity
best_metric_checkpoint = os.path.join(args.output_dir, str(epoch))
accelerator.print(f"New best metric: {best_metric} at epoch {epoch}")
accelerator.print(f"best_metric_checkpoint: {best_metric_checkpoint}")
# New Code #
# Loads the best checkpoint after the training is finished
if args.load_best_model:
_, last_global_step = load_training_checkpoint(
model,
"/".join(best_metric_checkpoint.split("/")[:-1]),
tag=best_metric_checkpoint.split("/")[-1],
**{"load_optimizer_states": True, "load_lr_scheduler_states": True},
)
# New Code #
# Evaluates using the best checkpoint
perplexity, eval_loss = evaluate(args, model, eval_dataloader, accelerator, eval_dataset)
logger.info(f"Best model metrics: perplexity: {perplexity} eval_loss: {eval_loss}")
if perplexity != best_metric:
raise AssertionError(
f"Best metric {best_metric} does not match the metric {perplexity} of the loaded best model."
)
if args.output_dir is not None:
accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)
# New Code #
# Saves the whole/unpartitioned fp16 model when in ZeRO Stage-3 to the output directory if
# `stage3_gather_16bit_weights_on_model_save` is True in DeepSpeed Config file or
# `zero3_save_16bit_model` is True in DeepSpeed Plugin.
# For Zero Stages 1 and 2, models are saved as usual in the output directory.
# The model name saved is `pytorch_model.bin`
unwrapped_model.save_pretrained(
args.output_dir,
is_main_process=accelerator.is_main_process,
save_function=accelerator.save,
state_dict=accelerator.get_state_dict(model),
)
if accelerator.is_main_process:
tokenizer.save_pretrained(args.output_dir)
if args.push_to_hub:
repo.push_to_hub(commit_message="End of training", auto_lfs_prune=True)
with open(os.path.join(args.output_dir, "all_results.json"), "w") as f:
json.dump({"perplexity": perplexity, "eval_loss": eval_loss.item()}, f)
if __name__ == "__main__":
main()

View File

@ -19,8 +19,9 @@ import os
import torch
from torch.utils.data import DataLoader
import evaluate
from accelerate import Accelerator, DistributedType
from datasets import load_dataset, load_metric
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer, get_linear_schedule_with_warmup, set_seed
@ -111,15 +112,12 @@ def training_function(config, args):
# We need to initialize the trackers we use, and also store our configuration
if args.with_tracking:
if accelerator.is_main_process:
run = os.path.split(__file__)[-1].split(".")[0]
if args.logging_dir:
run = os.path.join(args.logging_dir, run)
accelerator.print(run)
accelerator.init_trackers(run, config)
experiment_config = vars(args)
accelerator.init_trackers("fsdp_glue_no_trainer", experiment_config)
tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
datasets = load_dataset("glue", "mrpc")
metric = load_metric("glue", "mrpc")
metric = evaluate.load("glue", "mrpc")
def tokenize_function(examples):
# max_length=None => use the model max length (it's actually the default)
@ -139,7 +137,7 @@ def training_function(config, args):
# If the batch size is too big we use gradient accumulation
gradient_accumulation_steps = 1
if batch_size > MAX_GPU_BATCH_SIZE:
if batch_size > MAX_GPU_BATCH_SIZE and accelerator.distributed_type != DistributedType.TPU:
gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
batch_size = MAX_GPU_BATCH_SIZE
@ -164,6 +162,7 @@ def training_function(config, args):
# New Code #
# For FSDP feature, it is highly recommended and efficient to prepare the model before creating optimizer
model = accelerator.prepare(model)
accelerator.print(model)
# Instantiate optimizer
# New Code #
@ -282,7 +281,7 @@ def training_function(config, args):
predictions, references = accelerator.gather(
(predictions, batch["labels"])
) # If we are in a multiprocess environment, the last batch has duplicates
if accelerator.num_processes > 1:
if accelerator.use_distributed:
if step == len(eval_dataloader) - 1:
predictions = predictions[: len(eval_dataloader.dataset) - samples_seen]
references = references[: len(eval_dataloader.dataset) - samples_seen]

View File

@ -0,0 +1,210 @@
# coding=utf-8
# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import os
import torch
from torch.optim import AdamW
from torch.utils.data import DataLoader
import evaluate
from accelerate import Accelerator, DistributedType
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer, get_linear_schedule_with_warmup, set_seed
########################################################################
# This is a fully working simple example to use Accelerate
# and perform gradient accumulation
#
# This example trains a Bert base model on GLUE MRPC
# in any of the following settings (with the same script):
# - single CPU or single GPU
# - multi GPUS (using PyTorch distributed mode)
# - (multi) TPUs
# - fp16 (mixed-precision) or fp32 (normal precision)
#
# To run it in each of these various modes, follow the instructions
# in the readme for examples:
# https://github.com/huggingface/accelerate/tree/main/examples
#
########################################################################
MAX_GPU_BATCH_SIZE = 16
EVAL_BATCH_SIZE = 32
def get_dataloaders(accelerator: Accelerator, batch_size: int = 16):
"""
Creates a set of `DataLoader`s for the `glue` dataset,
using "bert-base-cased" as the tokenizer.
Args:
accelerator (`Accelerator`):
An `Accelerator` object
batch_size (`int`, *optional*):
The batch size for the train and validation DataLoaders.
"""
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
datasets = load_dataset("glue", "mrpc")
def tokenize_function(examples):
# max_length=None => use the model max length (it's actually the default)
outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None)
return outputs
# Apply the method we just defined to all the examples in all the splits of the dataset
tokenized_datasets = datasets.map(
tokenize_function,
batched=True,
remove_columns=["idx", "sentence1", "sentence2"],
)
# We also rename the 'label' column to 'labels' which is the expected name for labels by the models of the
# transformers library
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
def collate_fn(examples):
# On TPU it's best to pad everything to the same length or training will be very slow.
if accelerator.distributed_type == DistributedType.TPU:
return tokenizer.pad(examples, padding="max_length", max_length=128, return_tensors="pt")
return tokenizer.pad(examples, padding="longest", return_tensors="pt")
# Instantiate dataloaders.
train_dataloader = DataLoader(
tokenized_datasets["train"], shuffle=True, collate_fn=collate_fn, batch_size=batch_size
)
eval_dataloader = DataLoader(
tokenized_datasets["validation"], shuffle=False, collate_fn=collate_fn, batch_size=EVAL_BATCH_SIZE
)
return train_dataloader, eval_dataloader
# For testing only
if os.environ.get("TESTING_MOCKED_DATALOADERS", None) == "1":
from accelerate.test_utils.training import mocked_dataloaders
get_dataloaders = mocked_dataloaders # noqa: F811
def training_function(config, args):
# New Code #
gradient_accumulation_steps = int(args.gradient_accumulation_steps)
# Initialize accelerator
accelerator = Accelerator(
cpu=args.cpu, mixed_precision=args.mixed_precision, gradient_accumulation_steps=gradient_accumulation_steps
)
if accelerator.distributed_type == DistributedType.TPU and gradient_accumulation_steps > 1:
raise NotImplementedError(
"Gradient accumulation on TPUs is currently not supported. Pass `gradient_accumulation_steps=1`"
)
# Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
lr = config["lr"]
num_epochs = int(config["num_epochs"])
seed = int(config["seed"])
batch_size = int(config["batch_size"])
metric = evaluate.load("glue", "mrpc")
set_seed(seed)
train_dataloader, eval_dataloader = get_dataloaders(accelerator, batch_size)
# Instantiate the model (we build the model here so that the seed also control new weights initialization)
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", return_dict=True)
# We could avoid this line since the accelerator is set with `device_placement=True` (default value).
# Note that if you are placing tensors on devices manually, this line absolutely needs to be before the optimizer
# creation otherwise training will not work on TPU (`accelerate` will kindly throw an error to make us aware of that).
model = model.to(accelerator.device)
# Instantiate optimizer
optimizer = AdamW(params=model.parameters(), lr=lr)
# Instantiate scheduler
lr_scheduler = get_linear_schedule_with_warmup(
optimizer=optimizer,
num_warmup_steps=100,
num_training_steps=(len(train_dataloader) * num_epochs),
)
# Prepare everything
# There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
# prepare method.
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
)
# Now we train the model
for epoch in range(num_epochs):
model.train()
for step, batch in enumerate(train_dataloader):
# We could avoid this line since we set the accelerator with `device_placement=True`.
batch.to(accelerator.device)
# New code #
# We use the new `accumulate` context manager to perform gradient accumulation
# We also currently do not support TPUs nor advise it as bugs were found on the XLA side when running our tests.
with accelerator.accumulate(model):
output = model(**batch)
loss = output.loss
accelerator.backward(loss)
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
model.eval()
for step, batch in enumerate(eval_dataloader):
# We could avoid this line since we set the accelerator with `device_placement=True`.
batch.to(accelerator.device)
with torch.no_grad():
outputs = model(**batch)
predictions = outputs.logits.argmax(dim=-1)
predictions, references = accelerator.gather((predictions, batch["labels"]))
metric.add_batch(
predictions=predictions,
references=references,
)
eval_metric = metric.compute()
# Use accelerator.print to print only on the main process.
accelerator.print(f"epoch {epoch}:", eval_metric)
def main():
parser = argparse.ArgumentParser(description="Simple example of training script.")
parser.add_argument(
"--mixed_precision",
type=str,
default="no",
choices=["no", "fp16", "bf16"],
help="Whether to use mixed precision. Choose"
"between fp16 and bf16 (bfloat16). Bf16 requires PyTorch >= 1.10."
"and an Nvidia Ampere GPU.",
)
# New Code #
parser.add_argument(
"--gradient_accumulation_steps",
type=int,
default=1,
help="The number of minibatches to be ran before gradients are accumulated.",
)
parser.add_argument("--cpu", action="store_true", help="If passed, will train on the CPU.")
args = parser.parse_args()
config = {"lr": 2e-5, "num_epochs": 3, "seed": 42, "batch_size": 16}
training_function(config, args)
if __name__ == "__main__":
main()

View File

@ -15,20 +15,15 @@ import argparse
import os
import torch
from torch.optim import AdamW
from torch.utils.data import DataLoader
from accelerate import Accelerator, DistributedType
# New Code #
import evaluate
from accelerate import Accelerator, DistributedType
from accelerate.utils import find_executable_batch_size
from datasets import load_dataset, load_metric
from transformers import (
AdamW,
AutoModelForSequenceClassification,
AutoTokenizer,
get_linear_schedule_with_warmup,
set_seed,
)
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer, get_linear_schedule_with_warmup, set_seed
########################################################################
@ -117,15 +112,14 @@ def training_function(config, args):
# Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
lr = config["lr"]
num_epochs = int(config["num_epochs"])
correct_bias = config["correct_bias"]
seed = int(config["seed"])
batch_size = int(config["batch_size"])
metric = load_metric("glue", "mrpc")
metric = evaluate.load("glue", "mrpc")
# If the batch size is too big we use gradient accumulation
gradient_accumulation_steps = 1
if batch_size > MAX_GPU_BATCH_SIZE:
if batch_size > MAX_GPU_BATCH_SIZE and accelerator.distributed_type != DistributedType.TPU:
gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
batch_size = MAX_GPU_BATCH_SIZE
@ -139,7 +133,7 @@ def training_function(config, args):
model = model.to(accelerator.device)
# Instantiate optimizer
optimizer = AdamW(params=model.parameters(), lr=lr, correct_bias=correct_bias)
optimizer = AdamW(params=model.parameters(), lr=lr)
# New Code #
# We now can define an inner training loop function. It should take a batch size as the only parameter,
@ -218,7 +212,7 @@ def main():
)
parser.add_argument("--cpu", action="store_true", help="If passed, will train on the CPU.")
args = parser.parse_args()
config = {"lr": 2e-5, "num_epochs": 3, "correct_bias": True, "seed": 42, "batch_size": 16}
config = {"lr": 2e-5, "num_epochs": 3, "seed": 42, "batch_size": 16}
training_function(config, args)

View File

@ -16,17 +16,13 @@ import argparse
import os
import torch
from torch.optim import AdamW
from torch.utils.data import DataLoader
import evaluate
from accelerate import Accelerator, DistributedType
from datasets import load_dataset, load_metric
from transformers import (
AdamW,
AutoModelForSequenceClassification,
AutoTokenizer,
get_linear_schedule_with_warmup,
set_seed,
)
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer, get_linear_schedule_with_warmup, set_seed
########################################################################
@ -118,15 +114,14 @@ def training_function(config, args):
# Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
lr = config["lr"]
num_epochs = int(config["num_epochs"])
correct_bias = config["correct_bias"]
seed = int(config["seed"])
batch_size = int(config["batch_size"])
metric = load_metric("glue", "mrpc")
metric = evaluate.load("glue", "mrpc")
# If the batch size is too big we use gradient accumulation
gradient_accumulation_steps = 1
if batch_size > MAX_GPU_BATCH_SIZE:
if batch_size > MAX_GPU_BATCH_SIZE and accelerator.distributed_type != DistributedType.TPU:
gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
batch_size = MAX_GPU_BATCH_SIZE
@ -141,7 +136,7 @@ def training_function(config, args):
model = model.to(accelerator.device)
# Instantiate optimizer
optimizer = AdamW(params=model.parameters(), lr=lr, correct_bias=correct_bias)
optimizer = AdamW(params=model.parameters(), lr=lr)
# Instantiate scheduler
lr_scheduler = get_linear_schedule_with_warmup(
@ -183,7 +178,7 @@ def training_function(config, args):
predictions, references = accelerator.gather((predictions, batch["labels"]))
# New Code #
# First we check if it's a distributed system
if accelerator.num_processes > 1:
if accelerator.use_distributed:
# Then see if we're on the last batch of our eval dataloader
if step == len(eval_dataloader) - 1:
# Last batch needs to be truncated on distributed systems as it contains additional samples
@ -215,7 +210,7 @@ def main():
)
parser.add_argument("--cpu", action="store_true", help="If passed, will train on the CPU.")
args = parser.parse_args()
config = {"lr": 2e-5, "num_epochs": 3, "correct_bias": True, "seed": 42, "batch_size": 16}
config = {"lr": 2e-5, "num_epochs": 3, "seed": 42, "batch_size": 16}
training_function(config, args)

View File

@ -16,17 +16,13 @@ import argparse
import os
import torch
from torch.optim import AdamW
from torch.utils.data import DataLoader
import evaluate
from accelerate import Accelerator, DistributedType
from datasets import load_dataset, load_metric
from transformers import (
AdamW,
AutoModelForSequenceClassification,
AutoTokenizer,
get_linear_schedule_with_warmup,
set_seed,
)
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer, get_linear_schedule_with_warmup, set_seed
########################################################################
@ -126,17 +122,16 @@ def training_function(config, args):
# Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
lr = config["lr"]
num_epochs = int(config["num_epochs"])
correct_bias = config["correct_bias"]
seed = int(config["seed"])
batch_size = int(config["batch_size"])
set_seed(seed)
train_dataloader, eval_dataloader = get_dataloaders(accelerator, batch_size)
metric = load_metric("glue", "mrpc")
metric = evaluate.load("glue", "mrpc")
# If the batch size is too big we use gradient accumulation
gradient_accumulation_steps = 1
if batch_size > MAX_GPU_BATCH_SIZE:
if batch_size > MAX_GPU_BATCH_SIZE and accelerator.distributed_type != DistributedType.TPU:
gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
batch_size = MAX_GPU_BATCH_SIZE
@ -149,7 +144,7 @@ def training_function(config, args):
model = model.to(accelerator.device)
# Instantiate optimizer
optimizer = AdamW(params=model.parameters(), lr=lr, correct_bias=correct_bias)
optimizer = AdamW(params=model.parameters(), lr=lr)
# Instantiate scheduler
lr_scheduler = get_linear_schedule_with_warmup(
@ -170,8 +165,6 @@ def training_function(config, args):
if args.with_tracking:
if accelerator.is_main_process:
run = os.path.split(__file__)[-1].split(".")[0]
if args.logging_dir:
run = os.path.join(args.logging_dir, run)
accelerator.init_trackers(run, config)
# Now we train the model
@ -259,7 +252,7 @@ def main():
help="Location on where to store experiment tracking logs`",
)
args = parser.parse_args()
config = {"lr": 2e-5, "num_epochs": 3, "correct_bias": True, "seed": 42, "batch_size": 16}
config = {"lr": 2e-5, "num_epochs": 3, "seed": 42, "batch_size": 16}
training_function(config, args)

View File

@ -242,7 +242,7 @@ def training_function(config, args):
outputs = model(inputs)
predictions = outputs.argmax(dim=-1)
predictions, references = accelerator.gather((predictions, batch["label"]))
if accelerator.num_processes > 1:
if accelerator.use_distributed:
if step == len(eval_dataloader) - 1:
predictions = predictions[: len(eval_dataloader) - samples_seen]
references = references[: len(eval_dataloader) - samples_seen]

View File

@ -16,17 +16,13 @@ import argparse
import os
import torch
from torch.optim import AdamW
from torch.utils.data import DataLoader
import evaluate
from accelerate import Accelerator, DistributedType
from datasets import load_dataset, load_metric
from transformers import (
AdamW,
AutoModelForSequenceClassification,
AutoTokenizer,
get_linear_schedule_with_warmup,
set_seed,
)
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer, get_linear_schedule_with_warmup, set_seed
########################################################################
@ -75,7 +71,6 @@ def training_function(config, args):
# Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
lr = config["lr"]
num_epochs = int(config["num_epochs"])
correct_bias = config["correct_bias"]
seed = int(config["seed"])
batch_size = int(config["batch_size"])
@ -89,7 +84,7 @@ def training_function(config, args):
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
datasets = load_dataset("glue", "mrpc")
metric = load_metric("glue", "mrpc")
metric = evaluate.load("glue", "mrpc")
def tokenize_function(examples):
# max_length=None => use the model max length (it's actually the default)
@ -109,7 +104,7 @@ def training_function(config, args):
# If the batch size is too big we use gradient accumulation
gradient_accumulation_steps = 1
if batch_size > MAX_GPU_BATCH_SIZE:
if batch_size > MAX_GPU_BATCH_SIZE and accelerator.distributed_type != DistributedType.TPU:
gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
batch_size = MAX_GPU_BATCH_SIZE
@ -138,7 +133,7 @@ def training_function(config, args):
model = model.to(accelerator.device)
# Instantiate optimizer
optimizer = AdamW(params=model.parameters(), lr=lr, correct_bias=correct_bias)
optimizer = AdamW(params=model.parameters(), lr=lr)
# Instantiate scheduler
lr_scheduler = get_linear_schedule_with_warmup(
@ -227,7 +222,7 @@ def training_function(config, args):
predictions, references = accelerator.gather(
(predictions, batch["labels"])
) # If we are in a multiprocess environment, the last batch has duplicates
if accelerator.num_processes > 1:
if accelerator.use_distributed:
if step == len(eval_dataloader) - 1:
predictions = predictions[: len(eval_dataloader.dataset) - samples_seen]
references = references[: len(eval_dataloader.dataset) - samples_seen]
@ -304,7 +299,7 @@ def main():
help="Location on where to store experiment tracking logs`",
)
args = parser.parse_args()
config = {"lr": 2e-5, "num_epochs": 3, "correct_bias": True, "seed": 42, "batch_size": 16}
config = {"lr": 2e-5, "num_epochs": 3, "seed": 42, "batch_size": 16}
training_function(config, args)

View File

@ -0,0 +1,43 @@
{
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"weight_decay": "auto",
"torch_adam": true,
"adam_w_mode": true
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto",
"total_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 1,
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": "auto",
"contiguous_gradients": true
},
"gradient_accumulation_steps": 1,
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}

View File

@ -0,0 +1,43 @@
{
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"weight_decay": "auto",
"torch_adam": true,
"adam_w_mode": true
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto",
"total_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": "auto",
"contiguous_gradients": true
},
"gradient_accumulation_steps": 1,
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}

View File

@ -0,0 +1,47 @@
{
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"weight_decay": "auto",
"torch_adam": true,
"adam_w_mode": true
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto",
"total_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": "auto",
"contiguous_gradients": true
},
"gradient_accumulation_steps": 1,
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}

View File

@ -0,0 +1,44 @@
{
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto",
"total_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"sub_group_size": 1e9,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": "auto"
},
"gradient_accumulation_steps": 1,
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}

View File

@ -0,0 +1,52 @@
{
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto",
"total_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"sub_group_size": 1e9,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": "auto"
},
"gradient_accumulation_steps": 1,
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}

View File

@ -15,17 +15,13 @@
import argparse
import torch
from torch.optim import AdamW
from torch.utils.data import DataLoader
import evaluate
from accelerate import Accelerator, DistributedType
from datasets import load_dataset, load_metric
from transformers import (
AdamW,
AutoModelForSequenceClassification,
AutoTokenizer,
get_linear_schedule_with_warmup,
set_seed,
)
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer, get_linear_schedule_with_warmup, set_seed
########################################################################
@ -102,15 +98,14 @@ def training_function(config, args):
# Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
lr = config["lr"]
num_epochs = int(config["num_epochs"])
correct_bias = config["correct_bias"]
seed = int(config["seed"])
batch_size = int(config["batch_size"])
metric = load_metric("glue", "mrpc")
metric = evaluate.load("glue", "mrpc")
# If the batch size is too big we use gradient accumulation
gradient_accumulation_steps = 1
if batch_size > MAX_GPU_BATCH_SIZE:
if batch_size > MAX_GPU_BATCH_SIZE and accelerator.distributed_type != DistributedType.TPU:
gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
batch_size = MAX_GPU_BATCH_SIZE
@ -125,7 +120,7 @@ def training_function(config, args):
model = model.to(accelerator.device)
# Instantiate optimizer
optimizer = AdamW(params=model.parameters(), lr=lr, correct_bias=correct_bias)
optimizer = AdamW(params=model.parameters(), lr=lr)
# Instantiate scheduler
lr_scheduler = get_linear_schedule_with_warmup(
@ -187,7 +182,7 @@ def main():
)
parser.add_argument("--cpu", action="store_true", help="If passed, will train on the CPU.")
args = parser.parse_args()
config = {"lr": 2e-5, "num_epochs": 3, "correct_bias": True, "seed": 42, "batch_size": 16}
config = {"lr": 2e-5, "num_epochs": 3, "seed": 42, "batch_size": 16}
training_function(config, args)

View File

@ -1 +1,3 @@
accelerate # used to be installed in Amazon SageMaker environment
accelerate # used to be installed in Amazon SageMaker environment
evaluate
datasets==2.3.2

View File

@ -16,19 +16,22 @@ from setuptools import setup
from setuptools import find_packages
extras = {}
extras["quality"] = ["black ~= 22.0", "isort >= 5.5.4", "flake8 >= 3.8.3"]
extras["quality"] = ["black ~= 22.0", "isort >= 5.5.4", "flake8 >= 3.8.3", "hf-doc-builder >= 0.3.0"]
extras["docs"] = []
extras["test"] = [
"psutil",
"pytest",
"pytest-xdist",
"pytest-subtests",
"datasets",
"datasets<=2.2.2",
"evaluate",
"transformers",
"scipy",
"sklearn"
"sklearn",
"parameterized",
"deepspeed",
]
extras["test_trackers"] = ["wandb", "comet-ml", "tensorflow>=2.6.2", "tensorboard"]
extras["test_trackers"] = ["wandb", "comet-ml", "tensorboard"]
extras["dev"] = extras["quality"] + extras["test"]
extras["sagemaker"] = [
@ -37,7 +40,7 @@ extras["sagemaker"] = [
setup(
name="accelerate",
version="0.10.0.dev0",
version="0.11.0",
description="Accelerate",
long_description=open("README.md", "r", encoding="utf-8").read(),
long_description_content_type="text/markdown",
@ -55,8 +58,8 @@ setup(
"accelerate-launch=accelerate.commands.launch:main",
]
},
python_requires=">=3.6.0",
install_requires=["numpy>=1.17", "packaging>=20.0", "pyyaml", "torch>=1.4.0"],
python_requires=">=3.7.0",
install_requires=["numpy>=1.17", "packaging>=20.0", "psutil", "pyyaml", "torch>=1.4.0"],
extras_require=extras,
classifiers=[
"Development Status :: 5 - Production/Stable",
@ -66,7 +69,6 @@ setup(
"License :: OSI Approved :: Apache Software License",
"Operating System :: OS Independent",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.6",
"Programming Language :: Python :: 3.7",
"Topic :: Scientific/Engineering :: Artificial Intelligence",
],

View File

@ -2,7 +2,7 @@
# There's no way to ignore "F401 '...' imported but unused" warnings in this
# module, but to preserve other warnings. So, don't check this module at all.
__version__ = "0.10.0.dev0"
__version__ = "0.11.0"
from .accelerator import Accelerator
from .big_modeling import cpu_offload, disk_offload, dispatch_model, init_empty_weights, load_checkpoint_and_dispatch

View File

@ -12,7 +12,9 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import contextlib
import gc
import math
import os
import sys
import warnings
@ -26,7 +28,7 @@ from .data_loader import prepare_data_loader
from .logging import get_logger
from .optimizer import AcceleratedOptimizer
from .scheduler import AcceleratedScheduler
from .state import AcceleratorState
from .state import AcceleratorState, GradientState
from .tracking import LOGGER_TYPE_TO_CLASS, GeneralTracker, filter_trackers
from .utils import (
DeepSpeedPlugin,
@ -39,12 +41,15 @@ from .utils import (
LoggerType,
PrecisionType,
RNGType,
compare_versions,
convert_outputs_to_fp32,
extract_model_from_parallel,
gather,
get_pretty_name,
is_bf16_available,
is_deepspeed_available,
is_torch_version,
is_tpu_available,
pad_across_processes,
reduce,
save,
@ -55,7 +60,16 @@ from .utils import (
if is_deepspeed_available():
import deepspeed
from .utils import DeepSpeedEngineWrapper, DeepSpeedOptimizerWrapper
from .utils import (
DeepSpeedEngineWrapper,
DeepSpeedOptimizerWrapper,
DeepSpeedSchedulerWrapper,
DummyOptim,
DummyScheduler,
)
if is_tpu_available(check_device=False):
import torch_xla.distributed.xla_multiprocessing as xmp
logger = get_logger(__name__)
@ -78,6 +92,9 @@ class Accelerator:
default to the value in the environment variable `MIXED_PRECISION`, which will use the default value in the
accelerate config of the current system or the flag passed with the `accelerate.launch` command. 'fp16'
requires pytorch 1.6 or higher. 'bf16' requires pytorch 1.10 or higher.
gradient_accumulation_steps (`int`, *optional*, default to 1):
The number of steps that should pass before gradients are accumulated. A number > 1 should be combined with
`Accelerator.accumulate`.
cpu (`bool`, *optional*):
Whether or not to force the script to execute on CPU. Will ignore GPU available if set to `True` and force
the execution on one process only.
@ -132,6 +149,7 @@ class Accelerator:
split_batches: bool = False,
fp16: bool = None,
mixed_precision: Union[PrecisionType, str] = None,
gradient_accumulation_steps: int = 1,
cpu: bool = False,
deepspeed_plugin: DeepSpeedPlugin = None,
fsdp_plugin: FullyShardedDataParallelPlugin = None,
@ -143,7 +161,10 @@ class Accelerator:
kwargs_handlers: Optional[List[KwargsHandler]] = None,
):
self.logging_dir = logging_dir
self.log_with = filter_trackers(log_with, self.logging_dir)
trackers = filter_trackers(log_with, self.logging_dir)
if len(trackers) < 1 and log_with is not None:
warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
self.log_with = trackers
if mixed_precision is not None:
mixed_precision = str(mixed_precision)
@ -163,6 +184,19 @@ class Accelerator:
deepspeed_plugin, DeepSpeedPlugin
), "`deepspeed_plugin` must be a DeepSpeedPlugin object."
os.environ["USE_DEEPSPEED"] = "true" # use DeepSpeed if plugin is provided
if deepspeed_plugin:
if not is_deepspeed_available():
raise ImportError("DeepSpeed is not installed => run `pip install deepspeed` or build it from source.")
if compare_versions("deepspeed", "<", "0.6.5"):
raise ImportError("DeepSpeed version must be >= 0.6.5. Please update DeepSpeed.")
mixed_precision = os.environ.get("MIXED_PRECISION", "no") if mixed_precision is None else mixed_precision
deepspeed_plugin.set_mixed_precision(mixed_precision)
deepspeed_plugin.set_deepspeed_weakref()
if os.environ.get("USE_FSDP", "false") == "true" or isinstance(fsdp_plugin, FullyShardedDataParallelPlugin):
if is_torch_version("<", "1.12.0"):
raise ValueError("FSDP requires PyTorch >= 1.12.0")
if fsdp_plugin is None: # init from env variables
fsdp_plugin = FullyShardedDataParallelPlugin() if os.environ.get("USE_FSDP", "false") == "true" else None
@ -171,10 +205,6 @@ class Accelerator:
raise TypeError("`fsdp_plugin` must be a FullyShardedDataParallelPlugin object.")
os.environ["USE_FSDP"] = "true" # use FSDP if plugin is provided
if os.environ.get("USE_FSDP", "false") == "true":
if is_torch_version("<", "1.12.0.dev20220418+cu113"):
raise ValueError("FSDP requires PyTorch >= 1.12.0.dev20220418+cu113")
# Kwargs handlers
self.ddp_handler = None
self.scaler_handler = None
@ -208,6 +238,13 @@ class Accelerator:
**kwargs,
)
if gradient_accumulation_steps > 1:
if self.state.distributed_type == DistributedType.TPU:
raise NotImplementedError(
"Gradient accumulation on TPU is not supported. Pass in `gradient_accumulation_steps=1`"
)
self.gradient_accumulation_steps = gradient_accumulation_steps
self.device_placement = device_placement
self.split_batches = split_batches
self.dispatch_batches = dispatch_batches
@ -220,20 +257,33 @@ class Accelerator:
# Mixed precision attributes
self.scaler = None
self.native_amp = False
err = "{mode} mixed precision requires {requirement}"
if self.state.mixed_precision == "fp16":
self.native_amp = is_torch_version(">=", "1.6")
if not self.native_amp:
raise ValueError("fp16 mixed precision requires PyTorch >= 1.6")
raise ValueError(err.format(mode="fp16", requirement="PyTorch >= 1.6"))
if not torch.cuda.is_available():
raise ValueError(err.format(mode="fp16", requirement="a GPU"))
kwargs = self.scaler_handler.to_kwargs() if self.scaler_handler is not None else {}
self.scaler = torch.cuda.amp.GradScaler(**kwargs)
elif self.state.mixed_precision == "bf16":
self.native_amp = is_torch_version(">=", "1.10")
if mixed_precision == "bf16" and not self.native_amp:
raise ValueError("bf16 mixed precision requires PyTorch >= 1.10")
if self.distributed_type == DistributedType.FSDP:
from torch.distributed.fsdp.sharded_grad_scaler import ShardedGradScaler
kwargs = self.scaler_handler.to_kwargs() if self.scaler_handler is not None else {}
self.scaler = torch.cuda.amp.GradScaler(**kwargs)
self.scaler = ShardedGradScaler(**kwargs)
else:
self.scaler = torch.cuda.amp.GradScaler(**kwargs)
elif self.state.mixed_precision == "bf16" and self.distributed_type != DistributedType.FSDP:
self.native_amp = is_bf16_available(True)
if mixed_precision == "bf16" and not self.native_amp and not is_tpu_available():
raise ValueError(err.format(mode="bf16", requirement="PyTorch >= 1.10 and a supported device."))
# Only on the GPU do we care about scaling the gradients
if torch.cuda.is_available():
kwargs = self.scaler_handler.to_kwargs() if self.scaler_handler is not None else {}
self.scaler = torch.cuda.amp.GradScaler(**kwargs)
# Start of internal step tracking
self.step = 0
self.gradient_state = GradientState()
# Internal references to the training objects
self._optimizers = []
@ -246,6 +296,10 @@ class Accelerator:
if self.rng_types is None:
self.rng_types = ["torch"] if is_torch_version("<=", "1.5.1") else ["generator"]
@property
def use_distributed(self):
return self.distributed_type != DistributedType.NO and self.num_processes > 1
@property
def distributed_type(self):
return self.state.distributed_type
@ -321,6 +375,56 @@ class Accelerator:
if is_main:
self.wait_for_everyone()
@contextmanager
def no_sync(self, model):
"""
A context manager to disable gradient synchronizations across DDP processes by calling
`torch.nn.parallel.DistributedDataParallel.no_sync`.
If `model` is not in DDP, this context manager does nothing
Args:
model (`torch.nn.Module`):
PyTorch Module that was prepared with `Accelerator.prepare`
"""
context = contextlib.nullcontext
if self.use_distributed:
context = getattr(model, "no_sync", context)
with context():
yield
def _do_sync(self):
"Sets the right `sync_gradients` context and either resets or increases `self.step`"
if self.gradient_state.end_of_dataloader:
self.step = 0
self.gradient_state._set_sync_gradients(True)
else:
self.step += 1
self.gradient_state._set_sync_gradients((self.step % self.gradient_accumulation_steps) == 0)
@property
def sync_gradients(self):
return self.gradient_state.sync_gradients
@contextmanager
def accumulate(self, model):
"""
A context manager that will lightly wrap around and perform gradient accumulation automatically
Args:
model (`torch.nn.Module`):
PyTorch Module that was prepared with `Accelerator.prepare`
"""
self._do_sync()
if self.sync_gradients:
context = contextlib.nullcontext
else:
context = self.no_sync
with context(model):
yield
def print(self, *args, **kwargs):
"""
Use in replacement of `print()` to only print once per server.
@ -470,6 +574,7 @@ class Accelerator:
# Check if the model is already a FSDP model due to `Manual Wrapping` and if so,
# don't wrap it again
if type(model) != FSDP:
self.state.fsdp_plugin.set_auto_wrap_policy(model)
fsdp_plugin = self.state.fsdp_plugin
model = FSDP(
model,
@ -477,6 +582,7 @@ class Accelerator:
cpu_offload=fsdp_plugin.cpu_offload,
auto_wrap_policy=fsdp_plugin.auto_wrap_policy,
backward_prefetch=fsdp_plugin.backward_prefetch,
mixed_precision=fsdp_plugin.mixed_precision_policy,
ignored_modules=fsdp_plugin.ignored_modules,
)
if not fsdp_plugin.cpu_offload.offload_params:
@ -487,19 +593,28 @@ class Accelerator:
if self.native_amp:
if self.mixed_precision == "fp16" and is_torch_version(">=", "1.10"):
model.forward = torch.cuda.amp.autocast(dtype=torch.float16)(model.forward)
elif self.mixed_precision == "bf16":
model.forward = torch.cuda.amp.autocast(dtype=torch.bfloat16)(model.forward)
elif self.mixed_precision == "bf16" and self.distributed_type != DistributedType.TPU:
device_type = "cuda" if torch.cuda.is_available() else "cpu"
model.forward = torch.autocast(device_type=device_type, dtype=torch.bfloat16)(model.forward)
else:
model.forward = torch.cuda.amp.autocast()(model.forward)
model.forward = convert_outputs_to_fp32(model.forward)
if self.distributed_type == DistributedType.TPU and self.state.fork_launched:
model = xmp.MpModelWrapper(model).to(self.device)
return model
def _prepare_deepspeed(self, *args):
deepspeed_plugin = self.state.deepspeed_plugin
self.deepspeed_config = deepspeed_plugin.deepspeed_config
result = [
self._prepare_one(obj, first_pass=True) if isinstance(obj, torch.utils.data.DataLoader) else obj
for obj in args
]
batch_sizes = [obj.batch_size for obj in args if hasattr(obj, "batch_size")]
if self.split_batches:
batch_sizes = [batch_size // self.num_processes for batch_size in batch_sizes]
if len(batch_sizes) == 0:
raise ValueError(
"You must specify a training or evaluation dataloader in `accelerate.prepare()` when using DeepSpeed."
@ -508,73 +623,141 @@ class Accelerator:
batch_size_per_device = min(batch_sizes) if deepspeed_plugin.is_train_batch_min else max(batch_sizes)
if len(batch_sizes) > 1:
logger.info(
f"Since you passed both train and evaluation dataloader, `is_train_batch_min` (here \
{deepspeed_plugin.is_train_batch_min} will decide the `train_batch_size` ({batch_size_per_device})."
"Since you passed both train and evaluation dataloader, `is_train_batch_min` (here "
f"{deepspeed_plugin.is_train_batch_min} will decide the `train_batch_size` ({batch_size_per_device})."
)
self.deepspeed_config["train_batch_size"] = (
batch_size_per_device * deepspeed_plugin.gradient_accumulation_steps * self.num_processes
)
result = [
self._prepare_one(obj, first_pass=True) if isinstance(obj, torch.utils.data.DataLoader) else obj
for obj in args
]
config_kwargs = {
"train_micro_batch_size_per_gpu": batch_size_per_device,
"train_batch_size": batch_size_per_device
* deepspeed_plugin.deepspeed_config["gradient_accumulation_steps"]
* self.num_processes,
"gradient_clipping": 1.0,
"zero_optimization.stage3_gather_16bit_weights_on_model_save": False,
}
model = None
optimizer = None
scheduler = None
for obj in result:
if isinstance(obj, torch.nn.Module):
model = obj
elif isinstance(obj, (torch.optim.Optimizer, dict)):
elif isinstance(obj, (torch.optim.Optimizer, DummyOptim)):
optimizer = obj
elif (isinstance(obj, (torch.optim.lr_scheduler._LRScheduler, DummyScheduler))) or (
type(obj).__name__ in deepspeed.runtime.lr_schedules.VALID_LR_SCHEDULES
):
scheduler = obj
if deepspeed_plugin.auto_opt_mapping:
is_adam = isinstance(optimizer, torch.optim.Adam)
is_adamw = isinstance(optimizer, torch.optim.AdamW)
if (is_adam or is_adamw) and deepspeed_plugin.offload_optimizer_device == "cpu":
defaults = optimizer.defaults
params = []
for group in optimizer.param_groups:
params.extend(group["params"])
optimizer = deepspeed.ops.adam.DeepSpeedCPUAdam(
params,
lr=defaults["lr"],
bias_correction=True,
betas=defaults["betas"],
eps=defaults["eps"],
weight_decay=defaults["weight_decay"],
amsgrad=defaults["amsgrad"],
adamw_mode=is_adamw,
if optimizer is not None:
if "optimizer" in deepspeed_plugin.deepspeed_config and not isinstance(optimizer, (DummyOptim)):
raise ValueError(
"You cannot specify an optimizer in the config file and in the code at the same time. "
"Please remove the optimizer from the config file or "
"create `accelerate.utils.DummyOptim` in the code."
)
elif "optimizer" not in deepspeed_plugin.deepspeed_config and isinstance(optimizer, (DummyOptim)):
raise ValueError(
"You cannot create a `DummyOptim` without specifying an optimizer in the config file."
)
if isinstance(optimizer, (torch.optim.Optimizer)):
deepspeed_plugin.deepspeed_config["zero_allow_untested_optimizer"] = True
if scheduler is not None:
if "scheduler" in deepspeed_plugin.deepspeed_config and not isinstance(scheduler, (DummyScheduler)):
raise ValueError(
"You cannot specify a scheduler in the config file and in the code at the same time. "
"Please remove the scheduler from the config file or "
"create `accelerate.utils.DummyScheduler` in the code."
)
elif "scheduler" not in deepspeed_plugin.deepspeed_config and isinstance(scheduler, (DummyScheduler)):
raise ValueError(
"You cannot create a `DummyScheduler` without specifying a scheduler in the config file."
)
if optimizer is not None and scheduler is not None:
if isinstance(optimizer, (DummyOptim)) and not isinstance(scheduler, (DummyScheduler)):
raise ValueError(
"You can only specify `accelerate.utils.DummyScheduler` in the code when using "
"`accelerate.utils.DummyOptim`."
)
# useful when only eval_dataloader is given into `accelerator.prepare()`
if model is not None:
engine = DeepSpeedEngineWrapper(
args=None,
model=model,
optimizer=optimizer,
config_params=self.deepspeed_config,
dist_init_required=False,
)
if hasattr(model, "config") and hasattr(model.config, "hidden_size"):
hidden_size = model.config.hidden_size
config_kwargs.update(
{
"zero_optimization.reduce_bucket_size": hidden_size * hidden_size,
"zero_optimization.stage3_prefetch_bucket_size": 0.9 * hidden_size * hidden_size,
"zero_optimization.stage3_param_persistence_threshold": 10 * hidden_size,
}
)
if isinstance(optimizer, (DummyOptim)):
config_kwargs.update(
{"optimizer.params.lr": optimizer.lr, "optimizer.params.weight_decay": optimizer.weight_decay}
)
if isinstance(scheduler, (DummyScheduler)):
config_kwargs.update(
{
"scheduler.params.warmup_min_lr": 0,
"scheduler.params.warmup_max_lr": scheduler.optimizer.lr,
"scheduler.params.warmup_num_steps": scheduler.warmup_num_steps,
}
)
if scheduler.total_num_steps is not None:
config_kwargs["scheduler.params.total_num_steps"] = (
math.ceil(scheduler.total_num_steps / self.num_processes)
if not self.split_batches
else scheduler.total_num_steps
)
deepspeed_plugin.deepspeed_config_process(must_match=False, **config_kwargs)
self.deepspeed_config = deepspeed_plugin.deepspeed_config
kwargs = dict(model=model, config_params=self.deepspeed_config)
if optimizer is not None:
if isinstance(optimizer, (DummyOptim)):
kwargs["model_parameters"] = optimizer.params
else:
kwargs["optimizer"] = optimizer
if scheduler is not None:
if type(scheduler).__name__ in deepspeed.runtime.lr_schedules.VALID_LR_SCHEDULES:
kwargs["lr_scheduler"] = scheduler
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
if optimizer is not None:
optimizer = DeepSpeedOptimizerWrapper(optimizer)
if scheduler is not None:
if lr_scheduler is None:
scheduler = AcceleratedScheduler(
scheduler,
optimizer,
step_with_optimizer=self.step_scheduler_with_optimizer,
split_batches=self.split_batches,
)
else:
scheduler = DeepSpeedSchedulerWrapper(lr_scheduler, optimizer)
for i in range(len(result)):
if isinstance(result[i], torch.nn.Module):
result[i] = engine
elif isinstance(result[i], torch.optim.Optimizer):
result[i] = DeepSpeedOptimizerWrapper(engine.optimizer, engine)
self.deepspeed_engine = engine # pointing for deepspeed_engine.backward()
elif isinstance(result[i], (torch.optim.Optimizer, DummyOptim)):
result[i] = optimizer
elif (isinstance(result[i], (torch.optim.lr_scheduler._LRScheduler, DummyScheduler))) or (
type(result[i]).__name__ in deepspeed.runtime.lr_schedules.VALID_LR_SCHEDULES
):
result[i] = scheduler
# pointing for deepspeed_engine_wrapped.backward()
self.deepspeed_engine_wrapped = DeepSpeedEngineWrapper(engine)
self._models.append(engine)
self._optimizers.append(engine.optimizer)
assert (
len(self._models) == 1
), "You can't use same `Accelerator()` instance with 2 models when using DeepSpeed"
if self.distributed_type == DistributedType.DEEPSPEED:
assert hasattr(
self, "deepspeed_engine"
), "You need to pass the model along the optimizer when using Deepspeed."
if optimizer is not None:
self._optimizers.append(optimizer)
if scheduler is not None:
self._schedulers.append(scheduler)
if len(self._models) > 1:
raise AssertionError(
"You can't use same `Accelerator()` instance with multiple models when using DeepSpeed"
)
return tuple(result)
def prepare_data_loader(self, data_loader):
@ -584,7 +767,7 @@ class Accelerator:
num_processes=self.num_processes,
process_index=self.process_index,
split_batches=self.split_batches,
put_on_device=self.device_placement,
put_on_device=self.device_placement if self.distributed_type != DistributedType.TPU else False,
rng_types=self.rng_types.copy(),
dispatch_batches=self.dispatch_batches,
)
@ -611,8 +794,9 @@ class Accelerator:
"""
Use `accelerator.backward(loss)` in lieu of `loss.backward()`.
"""
loss /= self.gradient_accumulation_steps
if self.distributed_type == DistributedType.DEEPSPEED:
self.deepspeed_engine.backward(loss, **kwargs)
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
elif self.scaler is not None:
self.scaler.scale(loss).backward(**kwargs)
else:
@ -643,11 +827,15 @@ class Accelerator:
Should be used in place of `torch.nn.utils.clip_grad_norm_`.
"""
if self.distributed_type == DistributedType.FSDP:
self.unscale_gradients()
parameters = [p for p in parameters]
for model in self._models:
if parameters == [p for p in model.parameters()]:
model.clip_grad_norm_(max_norm, norm_type)
return
elif self.distributed_type == DistributedType.DEEPSPEED:
# `accelerator.backward(loss)` is doing that automatically. Therefore, it's implementation is not needed
return
self.unscale_gradients()
torch.nn.utils.clip_grad_norm_(parameters, max_norm, norm_type=norm_type)
@ -655,6 +843,8 @@ class Accelerator:
"""
Should be used in place of `torch.nn.utils.clip_grad_value_`.
"""
if self.distributed_type in [DistributedType.DEEPSPEED, DistributedType.FSDP]:
raise Exception("DeepSpeed and FSDP do not support `clip_grad_value_`. Use `clip_grad_norm_` instead.")
self.unscale_gradients()
torch.nn.utils.clip_grad_value_(parameters, clip_value)
@ -676,17 +866,23 @@ class Accelerator:
"""
return gather(tensor)
def reduce(self, tensor: torch.Tensor, reduction="sum"):
def reduce(self, tensor, reduction="sum"):
"""
Reduce the values in *tensor* across all processes based on *reduction*.
Note:
All processes get the reduced value.
Args:
tensor (`torch.Tensor`):
tensor (`torch.Tensor`, or a nested tuple/list/dictionary of `torch.Tensor`):
The tensors to reduce across all processes.
reduction (`str`, *optional*, defaults to "sum"):
A reduction type, can be one of 'sum', 'mean', or 'none'. If 'none', will not perform any operation.
Returns:
`torch.Tensor`, or a nested tuple/list/dictionary of `torch.Tensor`: The reduced tensor(s).
"""
reduce(tensor, reduction)
return reduce(tensor, reduction)
def pad_across_processes(self, tensor, dim=0, pad_index=0, pad_first=False):
"""
@ -794,7 +990,7 @@ class Accelerator:
output_dir = os.path.expanduser(output_dir)
os.makedirs(output_dir, exist_ok=True)
logger.info(f"Saving current state to {output_dir}")
weights = [self.get_state_dict(m) for m in self._models]
weights = [self.get_state_dict(m, unwrap=False) for m in self._models]
save_location = save_accelerator_state(
output_dir, weights, self._optimizers, self._schedulers, self.state.process_index, self.scaler
)
@ -837,7 +1033,7 @@ class Accelerator:
self._schedulers = []
self._optimizers = []
self._models = []
self.deepspeed_engine = None
self.deepspeed_engine_wrapped = None
gc.collect()
torch.cuda.empty_cache()
@ -873,16 +1069,24 @@ class Accelerator:
break
return (model_device, optimizer_device)
def get_state_dict(self, model):
def get_state_dict(self, model, unwrap=True):
is_zero_3 = False
if is_deepspeed_available():
if isinstance(model, DeepSpeedEngineWrapper) and self.distributed_type == DistributedType.DEEPSPEED:
is_zero_3 = self.state.deepspeed_plugin.zero_stage == 3
if self.distributed_type == DistributedType.DEEPSPEED:
is_zero_3 = self.deepspeed_config["zero_optimization"]["stage"] == 3
if is_zero_3:
state_dict = model._zero3_consolidated_16bit_state_dict()
if model.zero_gather_16bit_weights_on_model_save():
state_dict = model._zero3_consolidated_16bit_state_dict()
else:
raise ValueError(
"Cannot get 16bit model weights because `stage3_gather_16bit_weights_on_model_save` in DeepSpeed config is False. "
"To save the model weights in 16bit, set `stage3_gather_16bit_weights_on_model_save` to True in DeepSpeed config file or "
"set `zero3_save_16bit_model` to True when using `accelerate config`. "
"To save the full checkpoint, run `model.save_checkpoint(save_dir)` and use `zero_to_fp32.py` to recover weights."
)
else:
model = self.unwrap_model(model)
if unwrap:
model = self.unwrap_model(model)
state_dict = model.state_dict()
if state_dict is not None:
@ -925,8 +1129,10 @@ class Accelerator:
if self.native_amp:
if self.mixed_precision == "fp16" and is_torch_version(">=", "1.10"):
autocast_context = torch.cuda.amp.autocast(dtype=torch.float16)
elif self.mixed_precision == "bf16":
autocast_context = torch.cuda.amp.autocast(dtype=torch.bfloat16)
elif self.mixed_precision == "bf16" and is_bf16_available():
if self.distributed_type in [DistributedType.NO, DistributedType.MULTI_CPU, DistributedType.MULTI_GPU]:
device_type = "cpu" if not torch.cuda.is_available() else "cuda"
autocast_context = torch.autocast(dtype=torch.bfloat16, device_type=device_type)
else:
autocast_context = torch.cuda.amp.autocast()

View File

@ -65,7 +65,9 @@ def init_empty_weights(include_buffers: bool = False):
def register_empty_parameter(module, name, param):
old_register_parameter(module, name, param)
if param is not None:
module._parameters[name] = nn.Parameter(module._parameters[name].to(torch.device("meta")))
param_cls = type(module._parameters[name])
kwargs = module._parameters[name].__dict__
module._parameters[name] = param_cls(module._parameters[name].to(torch.device("meta")), **kwargs)
def register_empty_buffer(module, name, buffer):
old_register_buffer(module, name, buffer)
@ -88,6 +90,7 @@ def cpu_offload(
execution_device: Optional[torch.device] = None,
offload_buffers: bool = False,
state_dict: Optional[Dict[str, torch.Tensor]] = None,
preload_module_classes: Optional[List[str]] = None,
):
"""
Activates full CPU offload for a model. As a result, all parameters of the model will be offloaded and only one
@ -104,13 +107,23 @@ def cpu_offload(
Whether or not to offload the buffers with the model parameters.
state_dict (`Dict[str, torch.Tensor]`, *optional*):
The state dict of the model that will be kept on CPU.
preload_module_classes (`List[str]`, *optional*):
A list of classes whose instances should load all their weights (even in the submodules) at the beginning
of the forward. This should only be used for classes that have submodules which are registered but not
called directly during the forward, for instance if a `dense` linear layer is registered, but at forward,
`dense.weight` and `dense.bias` are used in some operations instead of calling `dense` directly.
"""
if execution_device is None:
execution_device = next(iter(model.parameters())).device
if state_dict is None:
state_dict = {n: p.to("cpu") for n, p in model.state_dict().items()}
attach_align_device_hook(
model, execution_device=execution_device, offload=True, offload_buffers=offload_buffers, weights_map=state_dict
model,
execution_device=execution_device,
offload=True,
offload_buffers=offload_buffers,
weights_map=state_dict,
preload_module_classes=preload_module_classes,
)
add_hook_to_module(model, AlignDevicesHook(io_same_device=True))
return model
@ -121,6 +134,7 @@ def disk_offload(
offload_dir: Union[str, os.PathLike],
execution_device: Optional[torch.device] = None,
offload_buffers: bool = False,
preload_module_classes: Optional[List[str]] = None,
):
"""
Activates full disk offload for a model. As a result, all parameters of the model will be offloaded as
@ -136,6 +150,11 @@ def disk_offload(
model's first parameter device.
offload_buffers (`bool`, *optional*, defaults to `False`):
Whether or not to offload the buffers with the model parameters.
preload_module_classes (`List[str]`, *optional*):
A list of classes whose instances should load all their weights (even in the submodules) at the beginning
of the forward. This should only be used for classes that have submodules which are registered but not
called directly during the forward, for instance if a `dense` linear layer is registered, but at forward,
`dense.weight` and `dense.bias` are used in some operations instead of calling `dense` directly.
"""
if not os.path.isdir(offload_dir) or not os.path.isfile(os.path.join(offload_dir, "index.json")):
offload_state_dict(offload_dir, model.state_dict())
@ -148,6 +167,7 @@ def disk_offload(
offload=True,
offload_buffers=offload_buffers,
weights_map=weights_map,
preload_module_classes=preload_module_classes,
)
add_hook_to_module(model, AlignDevicesHook(io_same_device=True))
return model
@ -160,6 +180,7 @@ def dispatch_model(
state_dict: Optional[Dict[str, torch.Tensor]] = None,
offload_dir: Union[str, os.PathLike] = None,
offload_buffers: bool = False,
preload_module_classes: Optional[List[str]] = None,
):
"""
Dispatches a model according to a given device map. Layers of the model might be spread across GPUs, offloaded on
@ -180,6 +201,11 @@ def dispatch_model(
The folder in which to offload the model weights (or where the model weights are already offloaded).
offload_buffers (`bool`, *optional*, defaults to `False`):
Whether or not to offload the buffers with the model parameters.
preload_module_classes (`List[str]`, *optional*):
A list of classes whose instances should load all their weights (even in the submodules) at the beginning
of the forward. This should only be used for classes that have submodules which are registered but not
called directly during the forward, for instance if a `dense` linear layer is registered, but at forward,
`dense.weight` and `dense.bias` are used in some operations instead of calling `dense` directly.
"""
# Error early if the device map is incomplete.
check_device_map(model, device_map)
@ -219,6 +245,7 @@ def dispatch_model(
offload=offload,
offload_buffers=offload_buffers,
weights_map=weights_map,
preload_module_classes=preload_module_classes,
)
model.hf_device_map = device_map
return model
@ -234,6 +261,7 @@ def load_checkpoint_and_dispatch(
offload_buffers: bool = False,
dtype: Optional[Union[str, torch.dtype]] = None,
offload_state_dict: bool = False,
preload_module_classes: Optional[List[str]] = None,
):
"""
Loads a (potentially sharded) checkpoint inside a model, potentially sending weights to a given device as they are
@ -267,6 +295,11 @@ def load_checkpoint_and_dispatch(
offload_state_dict (`bool`, *optional*, defaults to `False`):
If `True`, will temporarily offload the CPU state dict on the hard drive to avoig getting out of CPU RAM if
the weight of the CPU state dict + the biggest shard does not fit.
preload_module_classes (`List[str]`, *optional*):
A list of classes whose instances should load all their weights (even in the submodules) at the beginning
of the forward. This should only be used for classes that have submodules which are registered but not
called directly during the forward, for instance if a `dense` linear layer is registered, but at forward,
`dense.weight` and `dense.bias` are used in some operations instead of calling `dense` directly.
"""
if device_map == "auto":
device_map = infer_auto_device_map(
@ -282,4 +315,10 @@ def load_checkpoint_and_dispatch(
)
if device_map is None:
return model
return dispatch_model(model, device_map=device_map, offload_dir=offload_folder, offload_buffers=offload_buffers)
return dispatch_model(
model,
device_map=device_map,
offload_dir=offload_folder,
offload_buffers=offload_buffers,
preload_module_classes=preload_module_classes,
)

View File

@ -33,7 +33,7 @@ from .utils import (
)
if is_tpu_available():
if is_tpu_available(check_device=False):
import torch_xla.core.xla_model as xm
from .logging import get_logger
@ -116,7 +116,7 @@ def load_accelerator_state(input_dir, models, optimizers, schedulers, process_in
Args:
input_dir (`str` or `os.PathLike`):
The name of the folder to load all relevant weights and states.
model_stmodelsates (`List[torch.nn.Module]`):
models (`List[torch.nn.Module]`):
A list of model instances
optimizers (`List[torch.optim.Optimizer]`):
A list of optimizer instances

View File

@ -14,7 +14,13 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from ...utils import ComputeEnvironment, DistributedType, is_deepspeed_available
from ...utils import ComputeEnvironment, DistributedType, is_deepspeed_available, is_transformers_available
from ...utils.constants import (
DEEPSPEED_MULTINODE_LAUNCHERS,
FSDP_AUTO_WRAP_POLICY,
FSDP_BACKWARD_PREFETCH,
FSDP_SHARDING_STRATEGY,
)
from .config_args import ClusterConfig
from .config_utils import _ask_field, _convert_distributed_mode, _convert_yes_no_to_bool
@ -77,24 +83,117 @@ def get_cluster_input():
), "DeepSpeed is not installed => run `pip3 install deepspeed` or build it from source"
if distributed_type == DistributedType.DEEPSPEED:
deepspeed_config["zero_stage"] = _ask_field(
"What should be your DeepSpeed's ZeRO optimization stage (0, 1, 2, 3)? [2]: ",
lambda x: int(x),
default=2,
use_deepspeed_config = _ask_field(
"Do you want to specify a json file to a DeepSpeed config? [yes/NO]: ",
_convert_yes_no_to_bool,
default=False,
error_message="Please enter yes or no.",
)
if deepspeed_config["zero_stage"] >= 2:
deepspeed_config["offload_optimizer_device"] = _ask_field(
"Where to offload optimizer states? [NONE/cpu/nvme]: ",
if use_deepspeed_config:
deepspeed_config["deepspeed_config_file"] = _ask_field(
"Please enter the path to the json DeepSpeed config file: ",
lambda x: str(x),
default="none",
)
else:
deepspeed_config["zero_stage"] = _ask_field(
"What should be your DeepSpeed's ZeRO optimization stage (0, 1, 2, 3)? [2]: ",
lambda x: int(x),
default=2,
)
deepspeed_config["gradient_accumulation_steps"] = _ask_field(
"How many gradient accumulation steps you're passing in your script? [1]: ",
lambda x: int(x),
default=1,
if deepspeed_config["zero_stage"] >= 2:
deepspeed_config["offload_optimizer_device"] = _ask_field(
"Where to offload optimizer states? [none/cpu/nvme]: ",
lambda x: str(x),
default="none",
)
deepspeed_config["offload_param_device"] = _ask_field(
"Where to offload parameters? [none/cpu/nvme]: ",
lambda x: str(x),
default="none",
)
deepspeed_config["gradient_accumulation_steps"] = _ask_field(
"How many gradient accumulation steps you're passing in your script? [1]: ",
lambda x: int(x),
default=1,
)
use_gradient_clipping = _ask_field(
"Do you want to use gradient clipping? [yes/NO]: ",
_convert_yes_no_to_bool,
default=False,
error_message="Please enter yes or no.",
)
if use_gradient_clipping:
deepspeed_config["gradient_clipping"] = _ask_field(
"What is the gradient clipping value? [1.0]: ",
lambda x: float(x),
default=1.0,
)
if deepspeed_config["zero_stage"] == 3:
deepspeed_config["zero3_save_16bit_model"] = _ask_field(
"Do you want to save 16-bit model weights when using ZeRO Stage-3? [yes/NO]: ",
_convert_yes_no_to_bool,
default=False,
error_message="Please enter yes or no.",
)
deepspeed_config["zero3_init_flag"] = _ask_field(
"Do you want to enable `deepspeed.zero.Init` when using ZeRO Stage-3 for constructing massive models? [yes/NO]: ",
_convert_yes_no_to_bool,
default=False,
error_message="Please enter yes or no.",
)
if deepspeed_config["zero3_init_flag"]:
if not is_transformers_available():
raise Exception(
"When `zero3_init_flag` is set, it requires Transformers to be installed. "
"Please run `pip3 install transformers`."
)
if num_machines > 1:
launcher_query = "Which Type of launcher do you want to use "
for i, launcher in enumerate(DEEPSPEED_MULTINODE_LAUNCHERS):
launcher_query += f"[{i}] {launcher}, "
launcher_query = launcher_query[:-2] + ")? [0]: "
deepspeed_config["deepspeed_multinode_launcher"] = _ask_field(
launcher_query,
lambda x: DEEPSPEED_MULTINODE_LAUNCHERS[int(x)],
default=DEEPSPEED_MULTINODE_LAUNCHERS[0],
)
if deepspeed_config["deepspeed_multinode_launcher"] != DEEPSPEED_MULTINODE_LAUNCHERS[1]:
deepspeed_config["deepspeed_hostfile"] = _ask_field(
"DeepSpeed configures multi-node compute resources with hostfile. "
"Each row is of the format `hostname slots=[num_gpus]`, e.g., `localhost slots=2`; "
"for more information please refer official [documentation]"
"(https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node). "
"Please specify the location of hostfile: ",
lambda x: str(x),
)
is_exclusion_filter = _ask_field(
"Do you want to specify exclusion filter string? [yes/NO]: ",
_convert_yes_no_to_bool,
default=False,
error_message="Please enter yes or no.",
)
if is_exclusion_filter:
deepspeed_config["deepspeed_exclusion_filter"] = _ask_field(
"DeepSpeed exclusion filter string: ",
lambda x: str(x),
)
is_inclusion_filter = _ask_field(
"Do you want to specify inclusion filter string? [yes/NO]: ",
_convert_yes_no_to_bool,
default=False,
error_message="Please enter yes or no.",
)
if is_inclusion_filter:
deepspeed_config["deepspeed_inclusion_filter"] = _ask_field(
"DeepSpeed inclusion filter string: ",
lambda x: str(x),
)
fsdp_config = {}
if distributed_type in [DistributedType.MULTI_GPU]:
@ -107,8 +206,12 @@ def get_cluster_input():
if use_fsdp:
distributed_type = DistributedType.FSDP
if distributed_type == DistributedType.FSDP:
sharding_strategy_query = "What should be your sharding strategy ("
for i, strategy in enumerate(FSDP_SHARDING_STRATEGY):
sharding_strategy_query += f"[{i+1}] {strategy}, "
sharding_strategy_query = sharding_strategy_query[:-2] + ")? [1]: "
fsdp_config["sharding_strategy"] = _ask_field(
"What should be your sharding strategy ([1] FULL_SHARD, [2] SHARD_GRAD_OP)? [1]: ",
sharding_strategy_query,
lambda x: int(x),
default=1,
)
@ -118,10 +221,34 @@ def get_cluster_input():
default=False,
error_message="Please enter yes or no.",
)
fsdp_config["min_num_params"] = _ask_field(
"What should be your FSDP's minimum number of parameters for Default Auto Wrapping Policy? [1e8]: ",
lambda x: int(x),
default=1e8,
fsdp_wrap_query = "What should be your auto wrap policy ("
for i, wrap_policy in enumerate(FSDP_AUTO_WRAP_POLICY):
fsdp_wrap_query += f"[{i}] {wrap_policy}, "
fsdp_wrap_query = fsdp_wrap_query[:-2] + ")? [0]: "
fsdp_config["fsdp_auto_wrap_policy"] = _ask_field(
fsdp_wrap_query,
lambda x: FSDP_AUTO_WRAP_POLICY[int(x)],
default=FSDP_AUTO_WRAP_POLICY[0],
)
if fsdp_config["fsdp_auto_wrap_policy"] == FSDP_AUTO_WRAP_POLICY[0]:
fsdp_config["transformer_layer_cls_to_wrap"] = _ask_field(
"What is the transformer layer class name (case-sensitive) to wrap ,e.g, `BertLayer`, `GPTJBlock`, `T5Block` ...? : ",
lambda x: str(x),
)
elif fsdp_config["fsdp_auto_wrap_policy"] == FSDP_AUTO_WRAP_POLICY[1]:
fsdp_config["min_num_params"] = _ask_field(
"What should be your FSDP's minimum number of parameters for Default Auto Wrapping Policy? [1e8]: ",
lambda x: int(x),
default=1e8,
)
fsdp_backward_prefetch_query = "What should be your FSDP's backward prefetch policy ("
for i, backward_prefetch_policy in enumerate(FSDP_BACKWARD_PREFETCH):
fsdp_backward_prefetch_query += f"[{i}] {backward_prefetch_policy}, "
fsdp_backward_prefetch_query = fsdp_backward_prefetch_query[:-2] + ")? [0]: "
fsdp_config["fsdp_backward_prefetch_policy"] = _ask_field(
fsdp_backward_prefetch_query,
lambda x: FSDP_BACKWARD_PREFETCH[int(x)],
default=FSDP_BACKWARD_PREFETCH[0],
)
if distributed_type == DistributedType.TPU:
@ -155,11 +282,14 @@ def get_cluster_input():
num_processes = 1
if distributed_type != DistributedType.TPU:
mixed_precision = _ask_field(
"Do you wish to use FP16 or BF16 (mixed precision)? [NO/fp16/bf16]: ",
lambda x: str(x).lower(),
default="no",
)
if distributed_type == DistributedType.DEEPSPEED and use_deepspeed_config:
mixed_precision = "no"
else:
mixed_precision = _ask_field(
"Do you wish to use FP16 or BF16 (mixed precision)? [NO/fp16/bf16]: ",
lambda x: str(x).lower(),
default="no",
)
else:
mixed_precision = "no"

View File

@ -23,6 +23,7 @@ from typing import Optional, Union
import yaml
from ...utils import ComputeEnvironment, DistributedType, SageMakerDistributedType
from ...utils.constants import SAGEMAKER_PYTHON_VERSION, SAGEMAKER_PYTORCH_VERSION, SAGEMAKER_TRANSFORMERS_VERSION
hf_cache_home = os.path.expanduser(
@ -123,7 +124,10 @@ class BaseConfig:
if isinstance(self.compute_environment, str):
self.compute_environment = ComputeEnvironment(self.compute_environment)
if isinstance(self.distributed_type, str):
self.distributed_type = DistributedType(self.distributed_type)
if self.compute_environment == ComputeEnvironment.AMAZON_SAGEMAKER:
self.distributed_type = SageMakerDistributedType(self.distributed_type)
else:
self.distributed_type = DistributedType(self.distributed_type)
@dataclass
@ -152,9 +156,13 @@ class ClusterConfig(BaseConfig):
class SageMakerConfig(BaseConfig):
ec2_instance_type: str
iam_role_name: str
image_uri: str
profile: Optional[str] = None
region: str = "us-east-1"
num_machines: int = 1
base_job_name: str = f"accelerate-sagemaker-{num_machines}"
pytorch_version: str = "1.6"
transformers_version: str = "4.4"
pytorch_version: str = SAGEMAKER_PYTORCH_VERSION
transformers_version: str = SAGEMAKER_TRANSFORMERS_VERSION
py_version: str = SAGEMAKER_PYTHON_VERSION
sagemaker_inputs_file: str = None
sagemaker_metrics_file: str = None

View File

@ -16,10 +16,11 @@
import json
import os
from ...utils.constants import SAGEMAKER_PARALLEL_EC2_INSTANCES
from ...utils.dataclasses import ComputeEnvironment, SageMakerDistributedType
from ...utils.imports import is_boto3_available
from .config_args import SageMakerConfig
from .config_utils import _ask_field, _convert_sagemaker_distributed_mode
from .config_utils import _ask_field, _convert_sagemaker_distributed_mode, _convert_yes_no_to_bool
if is_boto3_available():
@ -119,24 +120,68 @@ def get_sagemaker_input():
print(f'Accelerate will create an iam role "{iam_role_name}" using the provided credentials')
_create_iam_role_for_sagemaker(iam_role_name)
is_custom_docker_image = _ask_field(
"Do you want to use custom Docker image? [yes/NO]: ",
_convert_yes_no_to_bool,
default=False,
error_message="Please enter yes or no.",
)
docker_image = None
if is_custom_docker_image:
docker_image = _ask_field("Enter your Docker image: ", lambda x: str(x).lower())
is_sagemaker_inputs_enabled = _ask_field(
"Do you want to provide SageMaker input channels with data locations? [yes/NO]: ",
_convert_yes_no_to_bool,
default=False,
error_message="Please enter yes or no.",
)
sagemaker_inputs_file = None
if is_sagemaker_inputs_enabled:
sagemaker_inputs_file = _ask_field(
"Enter the path to the SageMaker inputs TSV file with columns (channel_name, data_location): ",
lambda x: str(x).lower(),
)
is_sagemaker_metrics_enabled = _ask_field(
"Do you want to enable SageMaker metrics? [yes/NO]: ",
_convert_yes_no_to_bool,
default=False,
error_message="Please enter yes or no.",
)
sagemaker_metrics_file = None
if is_sagemaker_metrics_enabled:
sagemaker_metrics_file = _ask_field(
"Enter the path to the SageMaker metrics TSV file with columns (metric_name, metric_regex): ",
lambda x: str(x).lower(),
)
distributed_type = _ask_field(
"Which type of machine are you using? ([0] No distributed training, [1] data parallelism, [2] model parallelism): ",
"Which type of machine are you using? ([0] No distributed training, [1] data parallelism): ",
_convert_sagemaker_distributed_mode,
error_message="Please enter 0, 1 or 2",
error_message="Please enter 0 or 1",
)
# using the best two instances for single-gpu training or multi-gpu -> can turn into question to make it more diverse
ec2_instance_type = "ml.p3.2xlarge" if distributed_type == SageMakerDistributedType.NO else "ml.p3dn.24xlarge"
ec2_instance_query = "Which EC2 instance type you want to use for your training "
if distributed_type != SageMakerDistributedType.NO:
ec2_instance_query += "("
for i, instance_type in enumerate(SAGEMAKER_PARALLEL_EC2_INSTANCES):
ec2_instance_query += f"[{i}] {instance_type}, "
ec2_instance_query = ec2_instance_query[:-2] + ")? [0]: "
ec2_instance_type = _ask_field(ec2_instance_query, lambda x: SAGEMAKER_PARALLEL_EC2_INSTANCES[int(x)])
else:
ec2_instance_query += "? [ml.p3.2xlarge]:"
ec2_instance_type = _ask_field(ec2_instance_query, lambda x: str(x).lower(), default="ml.p3.2xlarge")
num_machines = 1
if (
distributed_type == SageMakerDistributedType.DATA_PARALLEL
or distributed_type == SageMakerDistributedType.MODEL_PARALLEL
):
raise NotImplementedError("Model or Data Parallelism is not implemented yet. We are working on it")
num_machines = _ask_field(
"How many machines do you want use? [2]: ",
"How many machines do you want use? [1]: ",
lambda x: int(x),
default=2,
default=1,
)
mixed_precision = _ask_field(
@ -146,12 +191,16 @@ def get_sagemaker_input():
)
return SageMakerConfig(
image_uri=docker_image,
compute_environment=ComputeEnvironment.AMAZON_SAGEMAKER,
distributed_type=distributed_type,
use_cpu=False,
ec2_instance_type=ec2_instance_type,
profile=aws_profile,
region=aws_region,
iam_role_name=iam_role_name,
mixed_precision=mixed_precision,
num_machines=num_machines,
sagemaker_inputs_file=sagemaker_inputs_file,
sagemaker_metrics_file=sagemaker_metrics_file,
)

View File

@ -31,9 +31,12 @@ from accelerate.utils import (
DistributedType,
PrecisionType,
PrepareForLaunch,
get_launch_prefix,
is_deepspeed_available,
is_sagemaker_available,
)
from accelerate.utils.versions import is_torch_version
from accelerate.utils.constants import DEEPSPEED_MULTINODE_LAUNCHERS
from accelerate.utils.dataclasses import SageMakerDistributedType
def launch_command_parser(subparsers=None):
@ -57,6 +60,80 @@ def launch_command_parser(subparsers=None):
action="store_true",
help="Whether to use deepspeed.",
)
parser.add_argument(
"--deepspeed_config_file",
default=None,
type=str,
help="DeepSpeed config file.",
)
parser.add_argument(
"--zero_stage",
default=None,
type=int,
help="DeepSpeed's ZeRO optimization stage (useful only when `use_deepspeed` flag is passed).",
)
parser.add_argument(
"--offload_optimizer_device",
default=None,
type=str,
help="Decides where (none|cpu|nvme) to offload optimizer states (useful only when `use_deepspeed` flag is passed).",
)
parser.add_argument(
"--offload_param_device",
default=None,
type=str,
help="Decides where (none|cpu|nvme) to offload parameters (useful only when `use_deepspeed` flag is passed).",
)
parser.add_argument(
"--gradient_accumulation_steps",
default=None,
type=int,
help="No of gradient_accumulation_steps used in your training script (useful only when `use_deepspeed` flag is passed).",
)
parser.add_argument(
"--gradient_clipping",
default=None,
type=float,
help="gradient clipping value used in your training script (useful only when `use_deepspeed` flag is passed).",
)
parser.add_argument(
"--zero3_init_flag",
default=None,
type=str,
help="Decides Whether (true|false) to enable `deepspeed.zero.Init` for constructing massive models. "
"Only applicable with DeepSpeed ZeRO Stage-3.",
)
parser.add_argument(
"--zero3_save_16bit_model",
default=None,
type=str,
help="Decides Whether (true|false) to save 16-bit model weights when using ZeRO Stage-3. "
"Only applicable with DeepSpeed ZeRO Stage-3.",
)
parser.add_argument(
"--deepspeed_hostfile",
default=None,
type=str,
help="DeepSpeed hostfile for configuring multi-node compute resources.",
)
parser.add_argument(
"--deepspeed_exclusion_filter",
default=None,
type=str,
help="DeepSpeed exclusion filter string when using mutli-node setup.",
)
parser.add_argument(
"--deepspeed_inclusion_filter",
default=None,
type=str,
help="DeepSpeed inclusion filter string when using mutli-node setup.",
)
parser.add_argument(
"--deepspeed_multinode_launcher",
default=None,
type=str,
help="DeepSpeed multi-node launcher to use.",
)
parser.add_argument(
"--use_fsdp",
default=False,
@ -81,6 +158,25 @@ def launch_command_parser(subparsers=None):
default=1,
help="FSDP's Sharding Strategy. (useful only when `use_fsdp` flag is passed).",
)
parser.add_argument(
"--fsdp_auto_wrap_policy",
type=str,
default=None,
help="FSDP's auto wrap policy. (useful only when `use_fsdp` flag is passed).",
)
parser.add_argument(
"--transformer_layer_cls_to_wrap",
default=None,
type=str,
help="Transformer layer class name (case-sensitive) to wrap ,e.g, `BertLayer`, `GPTJBlock`, `T5Block` .... "
"(useful only when `use_fsdp` flag is passed).",
)
parser.add_argument(
"--fsdp_backward_prefetch_policy",
default=None,
type=str,
help="FSDP's backward prefetch policy. (useful only when `use_fsdp` flag is passed).",
)
parser.add_argument(
"--tpu", default=False, action="store_true", help="Whether or not this should launch a TPU training."
)
@ -158,24 +254,6 @@ def launch_command_parser(subparsers=None):
"script."
),
)
parser.add_argument(
"--zero_stage",
default=None,
type=int,
help="DeepSpeed's ZeRO optimization stage (useful only when `use_deepspeed` flag is passed).",
)
parser.add_argument(
"--offload_optimizer_device",
default=None,
type=str,
help="Decides where (none|cpu|nvme) to offload optimizer states (useful only when `use_deepspeed` flag is passed).",
)
parser.add_argument(
"--gradient_accumulation_steps",
default=None,
type=int,
help="No of gradient_accumulation_steps used in your training script (useful only when `use_deepspeed` flag is passed).",
)
# Other arguments of the training scripts
parser.add_argument("training_script_args", nargs=argparse.REMAINDER, help="Arguments of the training script.")
@ -218,12 +296,7 @@ def simple_launcher(args):
def multi_gpu_launcher(args):
if is_torch_version(">=", "1.10.0"):
cmd = ["torchrun"]
elif is_torch_version(">=", "1.9.0"):
cmd = [sys.executable, "-m", "torch.distributed.run"]
else:
cmd = [sys.executable, "-m", "torch.distributed.launch", "--use_env"]
cmd = get_launch_prefix()
if args.num_machines > 1:
cmd.extend(
[
@ -268,9 +341,12 @@ def multi_gpu_launcher(args):
current_env["MIXED_PRECISION"] = str(mixed_precision)
if args.use_fsdp:
current_env["USE_FSDP"] = "true"
current_env["FSDP_AUTO_WRAP_POLICY"] = str(args.fsdp_auto_wrap_policy)
current_env["FSDP_TRANSFORMER_CLS_TO_WRAP"] = str(args.transformer_layer_cls_to_wrap)
current_env["FSDP_OFFLOAD_PARAMS"] = str(args.offload_params).lower()
current_env["FSDP_MIN_NUM_PARAMS"] = str(args.min_num_params)
current_env["FSDP_SHARDING_STRATEGY"] = str(args.sharding_strategy)
current_env["FSDP_BACKWARD_PREFETCH"] = str(args.fsdp_backward_prefetch_policy)
current_env["OMP_NUM_THREADS"] = str(args.num_cpu_threads_per_process)
process = subprocess.Popen(cmd, env=current_env)
process.wait()
@ -279,22 +355,46 @@ def multi_gpu_launcher(args):
def deepspeed_launcher(args):
if not is_deepspeed_available():
raise ImportError("DeepSpeed is not installed => run `pip3 install deepspeed` or build it from source.")
cmd = ["deepspeed", "--no_local_rank"]
if args.num_machines > 1:
cmd.extend(
[
"--num_gpus",
str(args.num_processes // args.num_machines),
"--num_nodes",
str(args.num_machines),
"--node_rank",
str(args.machine_rank),
"--master_addr",
args.main_process_ip,
"--master_port",
str(args.main_process_port),
]
)
if args.deepspeed_multinode_launcher == DEEPSPEED_MULTINODE_LAUNCHERS[1]:
cmd = get_launch_prefix()
cmd.extend(
[
"--nproc_per_node",
str(args.num_processes // args.num_machines),
"--nnodes",
str(args.num_machines),
"--node_rank",
str(args.machine_rank),
"--master_addr",
args.main_process_ip,
"--master_port",
str(args.main_process_port),
]
)
else:
cmd.extend(
["--hostfile", str(args.deepspeed_hostfile), "--launcher", str(args.deepspeed_multinode_launcher)]
)
if args.deepspeed_exclusion_filter is not None:
cmd.extend(
[
"--exclude",
str(args.deepspeed_exclusion_filter),
]
)
elif args.deepspeed_inclusion_filter is not None:
cmd.extend(
[
"--include",
str(args.deepspeed_inclusion_filter),
]
)
else:
cmd.extend(["--num_gpus", str(args.num_processes // args.num_machines)])
else:
cmd.extend(["--num_gpus", str(args.num_processes)])
@ -319,11 +419,24 @@ def deepspeed_launcher(args):
warnings.warn('--fp16 flag is deprecated. Use "--mixed_precision fp16" instead.', DeprecationWarning)
mixed_precision = "fp16"
current_env["PYTHONPATH"] = sys.executable
current_env["MIXED_PRECISION"] = str(mixed_precision)
current_env["USE_DEEPSPEED"] = "true"
current_env["DEEPSPEED_ZERO_STAGE"] = str(args.zero_stage)
current_env["GRADIENT_ACCUMULATION_STEPS"] = str(args.gradient_accumulation_steps)
current_env["GRADIENT_CLIPPING"] = str(args.gradient_clipping).lower()
current_env["DEEPSPEED_OFFLOAD_OPTIMIZER_DEVICE"] = str(args.offload_optimizer_device).lower()
current_env["DEEPSPEED_OFFLOAD_PARAM_DEVICE"] = str(args.offload_param_device).lower()
current_env["DEEPSPEED_ZERO3_INIT"] = str(args.zero3_init_flag).lower()
current_env["DEEPSPEED_ZERO3_SAVE_16BIT_MODEL"] = str(args.zero3_save_16bit_model).lower()
current_env["DEEPSPEED_CONFIG_FILE"] = str(args.deepspeed_config_file).lower()
if args.num_machines > 1 and args.deepspeed_multinode_launcher != DEEPSPEED_MULTINODE_LAUNCHERS[1]:
with open(".deepspeed_env", "a") as f:
for key, value in current_env.items():
if ";" in value or " " in value:
continue
f.write(f"{key}={value}\n")
process = subprocess.Popen(cmd, env=current_env)
process.wait()
@ -451,19 +564,56 @@ def sagemaker_launcher(sagemaker_config: SageMakerConfig, args):
mixed_precision = "fp16"
# Environment variables to be set for use during training job
environment = {"MIXED_PRECISION": str(mixed_precision)}
environment = {
"USE_SAGEMAKER": "true",
"MIXED_PRECISION": str(mixed_precision),
"SAGEMAKER_DISTRIBUTED_TYPE": sagemaker_config.distributed_type.value,
}
# configure distribution set up
distribution = None # TODO: not yet implemented
distribution = None
if sagemaker_config.distributed_type == SageMakerDistributedType.DATA_PARALLEL:
distribution = {"smdistributed": {"dataparallel": {"enabled": True}}}
# configure sagemaker inputs
sagemaker_inputs = None
if sagemaker_config.sagemaker_inputs_file is not None:
print(f"Loading SageMaker Inputs from {sagemaker_config.sagemaker_inputs_file} file")
sagemaker_inputs = {}
with open(sagemaker_config.sagemaker_inputs_file) as file:
for i, line in enumerate(file):
if i == 0:
continue
l = line.split("\t")
sagemaker_inputs[l[0]] = l[1].strip()
print(f"Loaded SageMaker Inputs: {sagemaker_inputs}")
# configure sagemaker metrics
sagemaker_metrics = None
if sagemaker_config.sagemaker_metrics_file is not None:
print(f"Loading SageMaker Metrics from {sagemaker_config.sagemaker_metrics_file} file")
sagemaker_metrics = []
with open(sagemaker_config.sagemaker_metrics_file) as file:
for i, line in enumerate(file):
if i == 0:
continue
l = line.split("\t")
metric_dict = {
"Name": l[0],
"Regex": l[1].strip(),
}
sagemaker_metrics.append(metric_dict)
print(f"Loaded SageMaker Metrics: {sagemaker_metrics}")
# configure session
print("Creating Estimator")
huggingface_estimator = HuggingFace(
image_uri=sagemaker_config.image_uri,
entry_point=entry_point,
source_dir=source_dir,
role=sagemaker_config.iam_role_name,
transformers_version="4.4",
pytorch_version="1.6",
py_version="py36",
transformers_version=sagemaker_config.transformers_version,
pytorch_version=sagemaker_config.pytorch_version,
py_version=sagemaker_config.py_version,
base_job_name=sagemaker_config.base_job_name,
instance_count=sagemaker_config.num_machines,
instance_type=sagemaker_config.ec2_instance_type,
@ -471,9 +621,10 @@ def sagemaker_launcher(sagemaker_config: SageMakerConfig, args):
distribution=distribution,
hyperparameters=hyperparameters,
environment=environment,
metric_definitions=sagemaker_metrics,
)
huggingface_estimator.fit()
huggingface_estimator.fit(inputs=sagemaker_inputs)
print(f"You can find your model data at: {huggingface_estimator.model_data}")

View File

@ -43,7 +43,7 @@ def test_command_parser(subparsers=None):
def test_command(args):
script_name = os.path.sep.join(__file__.split(os.path.sep)[:-2] + ["test_utils", "test_script.py"])
script_name = os.path.sep.join(__file__.split(os.path.sep)[:-2] + ["test_utils", "scripts", "test_script.py"])
test_args = f"""
--config_file={args.config_file} {script_name}

View File

@ -18,7 +18,8 @@ from typing import List, Optional, Union
import torch
from torch.utils.data import BatchSampler, DataLoader, IterableDataset
from .state import AcceleratorState, DistributedType, is_tpu_available
from .logging import get_logger
from .state import AcceleratorState, DistributedType, GradientState, is_tpu_available
from .utils import (
RNGType,
broadcast,
@ -34,9 +35,26 @@ from .utils import (
)
if is_tpu_available():
import torch_xla.core.xla_model as xm
if is_tpu_available(check_device=False):
import torch_xla.distributed.parallel_loader as xpl
class MpDeviceLoaderWrapper(xpl.MpDeviceLoader):
"""
Wrapper for the xpl.MpDeviceLoader class that knows the total batch size.
**Available attributes:**
- **total_batch_size** (`int`) -- Total batch size of the dataloader across all processes.
Equal to the original batch size when `split_batches=True`; otherwise the original batch size * the total
number of processes
"""
@property
def total_batch_size(self):
return self._loader.total_batch_size
logger = get_logger(__name__)
# kwargs of the DataLoader in min version 1.4.0.
_PYTORCH_DATALOADER_KWARGS = {
@ -287,6 +305,12 @@ class DataLoaderShard(DataLoader):
A random number generator to keep synchronized across processes.
kwargs:
All other keyword arguments to pass to the regular `DataLoader` initialization.
**Available attributes:**
- **total_batch_size** (`int`) -- Total batch size of the dataloader across all processes.
Equal to the original batch size when `split_batches=True`; otherwise the original batch size * the total
number of processes
"""
def __init__(self, dataset, device=None, rng_types=None, generator=None, **kwargs):
@ -294,34 +318,58 @@ class DataLoaderShard(DataLoader):
self.device = device
self.rng_types = rng_types
self.generator = generator
self.gradient_state = GradientState()
def __iter__(self):
if self.rng_types is not None:
synchronize_rng_states(self.rng_types, self.generator)
state = AcceleratorState()
for batch in super().__iter__():
if state.distributed_type == DistributedType.TPU:
xm.mark_step()
yield batch if self.device is None else send_to_device(batch, self.device)
self.gradient_state._set_end_of_dataloader(False)
dataloader_iter = super().__iter__()
# We iterate one batch ahead to check when we are at the end
try:
current_batch = next(dataloader_iter)
except StopIteration:
yield
while True:
try:
# But we still move it to the device so it is done before `StopIteration` is reached
if self.device is not None:
current_batch = send_to_device(current_batch, self.device)
next_batch = next(dataloader_iter)
yield current_batch
current_batch = next_batch
except StopIteration:
self.gradient_state._set_end_of_dataloader(True)
yield current_batch
break
@property
def total_batch_size(self):
return (
self.batch_sampler.batch_size
if self.batch_sampler.split_batches
else (self.batch_sampler.batch_size * self.batch_sampler.num_processes)
)
class DataLoaderDispatcher(DataLoader):
"""
Args:
Subclass of a PyTorch `DataLoader` that will iterate and preprocess on process 0 only, then dispatch on each
process their part of the batch.
Args:
split_batches (`bool`, *optional*, defaults to `False`):
Whether the resulting `DataLoader` should split the batches of the original data loader across devices or
yield full batches (in which case it will yield batches starting at the `process_index`-th and advancing of
`num_processes` batches at each iteration).
`num_processes` batches at each iteration). Another way to see this is that the observed batch size will be
the same as the initial `dataloader` if this option is set to `True`, the batch size of the initial
`dataloader` multiplied by `num_processes` otherwise. Setting this option to `True` requires that the batch
size of the `dataloader` is a round multiple of `batch_size`.
Another way to see this is that the observed batch size will be the same as the initial `dataloader` if
this option is set to `True`, the batch size of the initial `dataloader` multiplied by `num_processes`
otherwise.
**Available attributes:**
Setting this option to `True` requires that the batch size of the `dataloader` is a round multiple of
`batch_size`.
- **total_batch_size** (`int`) -- Total batch size of the dataloader across all processes.
Equal to the original batch size when `split_batches=True`; otherwise the original batch size * the total
number of processes
"""
def __init__(self, dataset, split_batches: bool = False, **kwargs):
@ -341,85 +389,105 @@ class DataLoaderDispatcher(DataLoader):
if shuffle:
torch.utils.data.graph_settings.apply_shuffle_settings(dataset, shuffle=shuffle)
self.gradient_state = GradientState()
self.state = AcceleratorState()
def _fetch_batches(self, iterator):
batches, batch = None, None
# On process 0, we gather the batch to dispatch.
if self.state.process_index == 0:
try:
if self.split_batches:
# One batch of the main iterator is dispatched and split.
batch = next(iterator)
else:
# num_processes batches of the main iterator are concatenated then dispatched and split.
# We add the batches one by one so we have the remainder available when drop_last=False.
batches = []
for _ in range(self.state.num_processes):
batches.append(next(iterator))
batch = concatenate(batches, dim=0)
# In both cases, we need to get the structure of the batch that we will broadcast on other
# processes to initialize the tensors with the right shape.
# data_structure, stop_iteration
batch_info = [get_data_structure(batch), False]
except StopIteration:
batch_info = [None, True]
else:
batch_info = [None, self._stop_iteration]
# This is inplace, so after this instruction, every process has the same `batch_info` as process 0.
broadcast_object_list(batch_info)
self._stop_iteration = batch_info[1]
if self._stop_iteration:
# If drop_last is False and split_batches is False, we may have a remainder to take care of.
if not self.split_batches and not self.drop_last:
if self.state.process_index == 0 and len(batches) > 0:
batch = concatenate(batches, dim=0)
batch_info = [get_data_structure(batch), False]
else:
batch_info = [None, True]
broadcast_object_list(batch_info)
if batch_info[1]:
return batch, batch_info, True
else:
return batch, batch_info, True
return batch, batch_info, False
def __iter__(self):
state = AcceleratorState()
if state.process_index == 0:
self.gradient_state._set_end_of_dataloader(False)
main_iterator = None
if self.state.process_index == 0:
# We only iterate through the DataLoader on process 0.
main_iterator = super().__iter__()
stop_iteration = False
self._stop_iteration = False
first_batch = None
while not stop_iteration:
# On process 0, we gather the batch to dispatch.
if state.process_index == 0:
try:
if self.split_batches:
# One batch of the main iterator is dispatched and split.
batch = next(main_iterator)
else:
# num_processes batches of the main iterator are concatenated then dispatched and split.
# We add the batches one by one so we have the remainder available when drop_last=False.
batches = []
for _ in range(state.num_processes):
batches.append(next(main_iterator))
batch = concatenate(batches, dim=0)
# In both cases, we need to get the structure of the batch that we will broadcast on other
# processes to initialize the tensors with the right shape.
# data_structure, stop_iteration
batch_info = [get_data_structure(batch), False]
except StopIteration:
batch_info = [None, True]
else:
batch_info = [None, stop_iteration]
# This is inplace, so after this instruction, every process has the same `batch_info` as process 0.
broadcast_object_list(batch_info)
stop_iteration = batch_info[1]
if stop_iteration:
# If drop_last is False and split_batches is False, we may have a remainder to take care of.
if not self.split_batches and not self.drop_last:
if state.process_index == 0 and len(batches) > 0:
batch = concatenate(batches, dim=0)
batch_info = [get_data_structure(batch), False]
else:
batch_info = [None, True]
broadcast_object_list(batch_info)
if batch_info[1]:
continue
else:
continue
if state.process_index != 0:
batch, batch_info, skip = self._fetch_batches(main_iterator)
while True:
if skip:
continue
if self.state.process_index != 0:
# Initialize tensors on other processes than process 0.
batch = initialize_tensors(batch_info[0])
batch = send_to_device(batch, state.device)
batch = send_to_device(batch, self.state.device)
# Broadcast the batch before splitting it.
batch = broadcast(batch, from_process=0)
if not self.drop_last and first_batch is None:
# We keep at least num processes elements of the first batch to be able to complete the last batch
first_batch = slice_tensors(batch, slice(0, state.num_processes))
first_batch = slice_tensors(batch, slice(0, self.state.num_processes))
observed_batch_size = find_batch_size(batch)
batch_size = observed_batch_size // state.num_processes
batch_size = observed_batch_size // self.state.num_processes
if not self.drop_last and stop_iteration and observed_batch_size % state.num_processes != 0:
if not self.drop_last and self._stop_iteration and observed_batch_size % self.state.num_processes != 0:
# If the last batch is not complete, let's add the first batch to it.
batch = concatenate([batch, first_batch], dim=0)
batch_size += 1
data_slice = slice(state.process_index * batch_size, (state.process_index + 1) * batch_size)
if state.distributed_type == DistributedType.TPU:
xm.mark_step()
yield slice_tensors(batch, data_slice)
data_slice = slice(self.state.process_index * batch_size, (self.state.process_index + 1) * batch_size)
next_batch, next_batch_info, next_skip = self._fetch_batches(main_iterator)
if not self._stop_iteration:
yield slice_tensors(batch, data_slice)
batch, batch_info, skip = next_batch, next_batch_info, next_skip
else:
self.gradient_state._set_end_of_dataloader(True)
yield slice_tensors(batch, data_slice)
break
def __len__(self):
state = AcceleratorState()
whole_length = super().__len__()
if self.drop_last:
return whole_length // state.num_processes
if self.split_batches:
return whole_length
elif self.drop_last:
return whole_length // self.state.num_processes
else:
return math.ceil(whole_length / state.num_processes)
return math.ceil(whole_length / self.state.num_processes)
@property
def total_batch_size(self):
return (
self.dataset.batch_size if self.split_batches else (self.dataset.batch_size * self.dataset.num_processes)
)
def prepare_data_loader(
@ -565,15 +633,22 @@ def prepare_data_loader(
kwargs["batch_size"] = dataloader.batch_size // num_processes if split_batches else dataloader.batch_size
if dispatch_batches:
return DataLoaderDispatcher(
new_dataset, split_batches=split_batches, batch_sampler=new_batch_sampler, **kwargs
dataloader = DataLoaderDispatcher(
new_dataset,
split_batches=split_batches,
batch_sampler=new_batch_sampler,
**kwargs,
)
else:
dataloader = DataLoaderShard(
new_dataset,
device=device if put_on_device and state.distributed_type != DistributedType.TPU else None,
batch_sampler=new_batch_sampler,
rng_types=rng_types,
generator=generator,
**kwargs,
)
return DataLoaderShard(
new_dataset,
device=device if put_on_device else None,
batch_sampler=new_batch_sampler,
rng_types=rng_types,
generator=generator,
**kwargs,
)
if state.distributed_type == DistributedType.TPU:
return MpDeviceLoaderWrapper(dataloader, device)
return dataloader

View File

@ -13,7 +13,7 @@
# limitations under the License.
import functools
from typing import Dict, Mapping, Optional, Union
from typing import Dict, List, Mapping, Optional, Union
import torch
import torch.nn as nn
@ -270,7 +270,11 @@ class AlignDevicesHook(ModelHook):
set_module_tensor_to_device(module, name, device, value=self.weights_map.get(name, None))
def attach_execution_device_hook(module: torch.nn.Module, execution_device: Union[int, str, torch.device]):
def attach_execution_device_hook(
module: torch.nn.Module,
execution_device: Union[int, str, torch.device],
preload_module_classes: Optional[List[str]] = None,
):
"""
Recursively attaches `AlignDevicesHook` to all submodules of a given model to make sure they have the right
execution device
@ -280,10 +284,19 @@ def attach_execution_device_hook(module: torch.nn.Module, execution_device: Unio
The module where we want to attach the hooks.
execution_device (`int`, `str` or `torch.device`):
The device on which inputs and model weights should be placed before the forward pass.
preload_module_classes (`List[str]`, *optional*):
A list of classes whose instances should load all their weights (even in the submodules) at the beginning
of the forward. This should only be used for classes that have submodules which are registered but not
called directly during the forward, for instance if a `dense` linear layer is registered, but at forward,
`dense.weight` and `dense.bias` are used in some operations instead of calling `dense` directly.
"""
if not hasattr(module, "_hf_hook") and len(module.state_dict()) > 0:
add_hook_to_module(module, AlignDevicesHook(execution_device))
# Break the recursion if we get to a preload module.
if preload_module_classes is not None and module.__class__.__name__ in preload_module_classes:
return
for child in module.children():
attach_execution_device_hook(child, execution_device)
@ -295,6 +308,7 @@ def attach_align_device_hook(
weights_map: Optional[Mapping] = None,
offload_buffers: bool = False,
module_name: str = "",
preload_module_classes: Optional[List[str]] = None,
):
"""
Recursively attaches `AlignDevicesHook` to all submodules of a given model that have direct parameters and/or
@ -313,10 +327,19 @@ def attach_align_device_hook(
Whether or not to include the associated module's buffers when offloading.
module_name (`str`, *optional*, defaults to `""`):
The name of the module.
preload_module_classes (`List[str]`, *optional*):
A list of classes whose instances should load all their weights (even in the submodules) at the beginning
of the forward. This should only be used for classes that have submodules which are registered but not
called directly during the forward, for instance if a `dense` linear layer is registered, but at forward,
`dense.weight` and `dense.bias` are used in some operations instead of calling `dense` directly.
"""
# Attach the hook on this module if it has any direct tensor.
directs = named_module_tensors(module)
if len(list(directs)) > 0:
full_offload = (
offload and preload_module_classes is not None and module.__class__.__name__ in preload_module_classes
)
if len(list(directs)) > 0 or full_offload:
if weights_map is not None:
prefix = f"{module_name}." if len(module_name) > 0 else ""
prefixed_weights_map = PrefixedDataset(weights_map, prefix)
@ -327,9 +350,14 @@ def attach_align_device_hook(
offload=offload,
weights_map=prefixed_weights_map,
offload_buffers=offload_buffers,
place_submodules=full_offload,
)
add_hook_to_module(module, hook)
# We stop the recursion in case we hit the full offload.
if full_offload:
return
# Recurse on all children of the module.
for child_name, child in module.named_children():
child_name = f"{module_name}.{child_name}" if len(module_name) > 0 else child_name
@ -340,6 +368,7 @@ def attach_align_device_hook(
weights_map=weights_map,
offload_buffers=offload_buffers,
module_name=child_name,
preload_module_classes=preload_module_classes,
)
@ -362,6 +391,7 @@ def attach_align_device_hook_on_blocks(
weights_map: Mapping = None,
offload_buffers: bool = False,
module_name: str = "",
preload_module_classes: Optional[List[str]] = None,
):
"""
Attaches `AlignDevicesHook` to all blocks of a given model as needed.
@ -381,6 +411,11 @@ def attach_align_device_hook_on_blocks(
Whether or not to include the associated module's buffers when offloading.
module_name (`str`, *optional*, defaults to `""`):
The name of the module.
preload_module_classes (`List[str]`, *optional*):
A list of classes whose instances should load all their weights (even in the submodules) at the beginning
of the forward. This should only be used for classes that have submodules which are registered but not
called directly during the forward, for instance if a `dense` linear layer is registered, but at forward,
`dense.weight` and `dense.bias` are used in some operations instead of calling `dense` directly.
"""
# If one device and one offload, we've got one hook.
if not isinstance(execution_device, Mapping) and not isinstance(offload, dict):
@ -399,7 +434,7 @@ def attach_align_device_hook_on_blocks(
return
if not isinstance(execution_device, Mapping):
execution_device = {key: offload for key in offload.keys()}
execution_device = {key: execution_device for key in offload.keys()}
if not isinstance(offload, Mapping):
offload = {key: offload for key in execution_device.keys()}
@ -413,21 +448,21 @@ def attach_align_device_hook_on_blocks(
add_hook_to_module(module, hook)
attach_execution_device_hook(module, execution_device[module_name])
elif module_name in execution_device:
if weights_map is not None:
prefix = f"{module_name}." if len(module_name) > 0 else ""
prefixed_weights_map = PrefixedDataset(weights_map, prefix)
else:
prefixed_weights_map = None
hook = AlignDevicesHook(
attach_align_device_hook(
module,
execution_device=execution_device[module_name],
offload=True,
weights_map=prefixed_weights_map,
weights_map=weights_map,
offload_buffers=offload_buffers,
io_same_device=(module_name == ""),
place_submodules=True,
module_name=module_name,
preload_module_classes=preload_module_classes,
)
if not hasattr(module, "_hf_hook"):
hook = AlignDevicesHook(execution_device=execution_device[module_name], io_same_device=(module_name == ""))
add_hook_to_module(module, hook)
attach_execution_device_hook(
module, execution_device[module_name], preload_module_classes=preload_module_classes
)
attach_execution_device_hook(module, execution_device[module_name])
add_hook_to_module(module, hook)
elif module_name == "":
hook = AlignDevicesHook(io_same_device=True)
add_hook_to_module(module, hook)
@ -441,4 +476,5 @@ def attach_align_device_hook_on_blocks(
weights_map=weights_map,
offload_buffers=offload_buffers,
module_name=child_name,
preload_module_classes=preload_module_classes,
)

View File

@ -50,6 +50,13 @@ def notebook_launcher(function, args=(), num_processes=None, use_fp16=False, mix
else:
in_colab_or_kaggle = False
try:
mixed_precision = PrecisionType(mixed_precision.lower())
except ValueError:
raise ValueError(
f"Unknown mixed_precision mode: {args.mixed_precision.lower()}. Choose between {PrecisionType.list()}."
)
if in_colab_or_kaggle:
if os.environ.get("TPU_NAME", None) is not None:
# TPU launch
@ -72,7 +79,7 @@ def notebook_launcher(function, args=(), num_processes=None, use_fp16=False, mix
if torch.cuda.is_available():
print("Launching training on one GPU.")
else:
print("Launching training on CPU.")
print("Launching training on one CPU.")
function(*args)
else:
@ -105,13 +112,6 @@ def notebook_launcher(function, args=(), num_processes=None, use_fp16=False, mix
"function."
)
try:
mixed_precision = PrecisionType(mixed_precision.lower())
except ValueError:
raise ValueError(
f"Unknown mixed_precision mode: {args.mixed_precision.lower()}. Choose between {PrecisionType.list()}."
)
if use_fp16:
warnings.warn('use_fp16=True is deprecated. Use mixed_precision="fp16" instead.', DeprecationWarning)
mixed_precision = "fp16"

View File

@ -17,11 +17,11 @@ import warnings
import torch
from .state import AcceleratorState
from .state import AcceleratorState, GradientState
from .utils import DistributedType, honor_type, is_torch_version, is_tpu_available
if is_tpu_available():
if is_tpu_available(check_device=False):
import torch_xla.core.xla_model as xm
@ -39,6 +39,9 @@ class AcceleratedOptimizer(torch.optim.Optimizer):
"""
Internal wrapper around a torch optimizer.
Conditionally will perform `step` and `zero_grad` if gradients should be synchronized when performing gradient
accumulation.
Args:
optimizer (`torch.optim.optimizer.Optimizer`):
The optimizer to wrap.
@ -53,6 +56,7 @@ class AcceleratedOptimizer(torch.optim.Optimizer):
self.optimizer = optimizer
self.scaler = scaler
self.accelerator_state = AcceleratorState()
self.gradient_state = GradientState()
self.device_placement = device_placement
self._is_overflow = False
@ -101,37 +105,39 @@ class AcceleratedOptimizer(torch.optim.Optimizer):
return self.optimizer.state_dict()
def zero_grad(self, set_to_none=None):
if is_torch_version("<", "1.7.0"):
if set_to_none is not None:
raise ValueError(
"`set_to_none` for Optimizer.zero_grad` was introduced in PyTorch 1.7.0 and can't be used for "
f"earlier versions (found version {torch.__version__})."
)
self.optimizer.zero_grad()
else:
accept_arg = "set_to_none" in inspect.signature(self.optimizer.zero_grad).parameters
if accept_arg:
if set_to_none is None:
set_to_none = False
self.optimizer.zero_grad(set_to_none=set_to_none)
else:
if self.gradient_state.sync_gradients:
if is_torch_version("<", "1.7.0"):
if set_to_none is not None:
raise ValueError("`set_to_none` for Optimizer.zero_grad` is not supported by this optimizer.")
raise ValueError(
"`set_to_none` for Optimizer.zero_grad` was introduced in PyTorch 1.7.0 and can't be used for "
f"earlier versions (found version {torch.__version__})."
)
self.optimizer.zero_grad()
else:
accept_arg = "set_to_none" in inspect.signature(self.optimizer.zero_grad).parameters
if accept_arg:
if set_to_none is None:
set_to_none = False
self.optimizer.zero_grad(set_to_none=set_to_none)
else:
if set_to_none is not None:
raise ValueError("`set_to_none` for Optimizer.zero_grad` is not supported by this optimizer.")
self.optimizer.zero_grad()
def step(self, closure=None):
if self.accelerator_state.distributed_type == DistributedType.TPU:
optimizer_args = {"closure": closure} if closure is not None else {}
xm.optimizer_step(self.optimizer, optimizer_args=optimizer_args)
elif self.scaler is not None:
scale_before = self.scaler.get_scale()
self.scaler.step(self.optimizer, closure)
self.scaler.update()
scale_after = self.scaler.get_scale()
# If we reduced the loss scale, it means the optimizer step was skipped because of gradient overflow.
self._is_overflow = scale_after < scale_before
else:
self.optimizer.step(closure)
if self.gradient_state.sync_gradients:
if self.accelerator_state.distributed_type == DistributedType.TPU:
optimizer_args = {"closure": closure} if closure is not None else {}
xm.optimizer_step(self.optimizer, optimizer_args=optimizer_args)
elif self.scaler is not None:
scale_before = self.scaler.get_scale()
self.scaler.step(self.optimizer, closure)
self.scaler.update()
scale_after = self.scaler.get_scale()
# If we reduced the loss scale, it means the optimizer step was skipped because of gradient overflow.
self._is_overflow = scale_after < scale_before
else:
self.optimizer.step(closure)
def _switch_parameters(self, parameters_map):
for param_group in self.optimizer.param_groups:

View File

@ -12,16 +12,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# We ignore warnings about stepping the scheduler since we step it ourselves during gradient accumulation
import warnings
from .state import AcceleratorState
warnings.filterwarnings("ignore", category=UserWarning, module="torch.optim.lr_scheduler")
class AcceleratedScheduler:
"""
A wrapper around a learning rate scheduler that will only step when the optimizer(s) have a training step. Useful
to avoid making a scheduler step too fast when:
to avoid making a scheduler step too fast when gradients went overflow and there was no training step (in mixed
precision training)
- gradients went overflow and there was no training step (in mixed precision training)
- step was skipped because of gradient accumulation
When performing gradient accumulation scheduler lengths should not be changed accordingly, accelerate will always
step the scheduler to account for it.
Args:
scheduler (`torch.optim.lr_scheduler._LRScheduler`):
@ -52,7 +60,6 @@ class AcceleratedScheduler:
for opt in self.optimizers:
if opt.step_was_skipped:
return
if self.split_batches:
# Split batches -> the training dataloader batch size is not changed so one step per training step
self.scheduler.step(*args, **kwargs)

View File

@ -18,9 +18,10 @@ from distutils.util import strtobool
import torch
from .utils import DistributedType, is_ccl_available, is_deepspeed_available, is_tpu_available
from .utils.dataclasses import SageMakerDistributedType
if is_tpu_available():
if is_tpu_available(check_device=False):
import torch_xla.core.xla_model as xm
@ -52,6 +53,7 @@ class AcceleratorState:
Attributes:
- **device** (`torch.device`) -- The device to use.
- **sync_gradients** (`bool`) -- Whether to sync the gradients or not
- **distributed_type** (`~accelerate.state.DistributedType`) -- The type of distributed environment currently
in use.
- **num_processes** (`int`) -- The number of processes currently launched in parallel.
@ -75,22 +77,46 @@ class AcceleratorState:
self.__dict__ = self._shared_state
if parse_flag_from_env("USE_CPU"):
cpu = True
self._check_initialized(mixed_precision, cpu)
self.fork_launched = parse_flag_from_env("FORK_LAUNCHED", 0)
if not getattr(self, "initialized", False):
self.backend = None
self.deepspeed_plugin = None
mixed_precision = mixed_precision.lower() if mixed_precision else None
mixed_precision = (
parse_choice_from_env("MIXED_PRECISION", "no") if mixed_precision is None else mixed_precision.lower()
)
if not _from_accelerator:
raise ValueError(
"Please make sure to properly initialize your accelerator via `accelerator = Accelerator()` "
"before using any functionality from the `accelerate` library."
)
if (
os.environ.get("USE_SAGEMAKER", "false") == "true"
and os.environ.get("SAGEMAKER_DISTRIBUTED_TYPE") != SageMakerDistributedType.NO
and not cpu
):
if os.environ.get("SAGEMAKER_DISTRIBUTED_TYPE") == SageMakerDistributedType.DATA_PARALLEL:
self.distributed_type = DistributedType.MULTI_GPU
import smdistributed.dataparallel.torch.torch_smddp # noqa
if not torch.distributed.is_initialized():
torch.distributed.init_process_group(backend="smddp")
self.backend = "smddp"
self.num_processes = torch.distributed.get_world_size()
self.process_index = torch.distributed.get_rank()
self.local_process_index = int(os.environ.get("LOCAL_RANK", -1))
self.device = torch.device("cuda", self.local_process_index)
torch.cuda.set_device(self.device)
self.mixed_precision = mixed_precision
elif is_tpu_available() and not cpu:
self.distributed_type = DistributedType.TPU
self.num_processes = xm.xrt_world_size()
self.process_index = xm.get_ordinal()
self.local_process_index = xm.get_local_ordinal()
self.device = xm.xla_device()
self.mixed_precision = "no"
if mixed_precision == "bf16":
os.environ["XLA_USE_BF16"] = str(1)
self.mixed_precision = mixed_precision
elif os.environ.get("USE_DEEPSPEED", "false") == "true" and not cpu:
assert (
is_deepspeed_available()
@ -105,13 +131,6 @@ class AcceleratorState:
self.device = torch.device("cuda", self.local_process_index)
torch.cuda.set_device(self.device)
self.mixed_precision = "no" # deepspeed handles mixed_precision using deepspeed_config
mixed_precision = (
parse_choice_from_env("MIXED_PRECISION", "no") if mixed_precision is None else mixed_precision
)
if mixed_precision == "fp16":
deepspeed_plugin.deepspeed_config.update({"fp16": {"enabled": True}})
elif mixed_precision == "bf16":
deepspeed_plugin.deepspeed_config.update({"bfloat16": {"enabled": True}})
self.deepspeed_plugin = deepspeed_plugin
elif int(os.environ.get("LOCAL_RANK", -1)) != -1 and not cpu:
self.distributed_type = DistributedType.MULTI_GPU
@ -123,15 +142,11 @@ class AcceleratorState:
self.local_process_index = int(os.environ.get("LOCAL_RANK", -1))
self.device = torch.device("cuda", self.local_process_index)
torch.cuda.set_device(self.device)
self.mixed_precision = (
parse_choice_from_env("MIXED_PRECISION", "no") if mixed_precision is None else mixed_precision
)
self.mixed_precision = mixed_precision
if os.environ.get("USE_FSDP", "false") == "true":
self.distributed_type = DistributedType.FSDP
if self.mixed_precision != "no":
raise ValueError(
"Mixed precision is currently not supported for FSDP. Please set `mixed_precision` to `no`."
)
fsdp_plugin.set_mixed_precision(self.mixed_precision)
self.fsdp_plugin = fsdp_plugin
elif get_int_from_env(["PMI_SIZE", "OMPI_COMM_WORLD_SIZE", "MV2_COMM_WORLD_SIZE", "WORLD_SIZE"], 1) > 1:
self.distributed_type = DistributedType.MULTI_CPU
@ -169,15 +184,13 @@ class AcceleratorState:
self.process_index = torch.distributed.get_rank()
self.local_process_index = local_rank
self.device = torch.device("cpu")
self.mixed_precision = "no"
self.mixed_precision = mixed_precision
else:
self.distributed_type = DistributedType.NO
self.num_processes = 1
self.process_index = self.local_process_index = 0
self.device = torch.device("cuda" if torch.cuda.is_available() and not cpu else "cpu")
self.mixed_precision = (
parse_choice_from_env("MIXED_PRECISION", "no") if mixed_precision is None else mixed_precision
)
self.mixed_precision = mixed_precision
self.initialized = True
def __repr__(self):
@ -189,13 +202,61 @@ class AcceleratorState:
f"Process index: {self.process_index}\n"
f"Local process index: {self.local_process_index}\n"
f"Device: {self.device}\n"
f"Mixed precision type: {mixed_precision}\n"
)
if self.distributed_type == DistributedType.DEEPSPEED:
repr += f"ds_config: {self.deepspeed_plugin.deepspeed_config}\n"
else:
f"Mixed precision type: {mixed_precision}\n"
return repr
# For backward compatibility
@property
def use_fp16(self):
return self.mixed_precision != "no"
@staticmethod
def _reset_state():
"Resets `_shared_state`, is used internally and should not be called"
AcceleratorState._shared_state = {}
def _check_initialized(self, mixed_precision=None, cpu=None):
"Checks if a modification is trying to be made and the `AcceleratorState` has already been initialized"
if getattr(self, "initialized", False):
err = "AcceleratorState has already been initialized and cannot be changed, restart your runtime completely and pass `{flag}` to `Accelerate()`."
if cpu and self.device.type != "cpu":
raise ValueError(err.format(flag="cpu=True"))
if mixed_precision is not None and mixed_precision != self.mixed_precision:
raise ValueError(err.format(flag=f"mixed_precision='{mixed_precision}'"))
class GradientState:
"""
This is a variation of a [singleton class](https://en.wikipedia.org/wiki/Singleton_pattern) in the sense that all
instance of `GradientState` share the same state, which is initialized on the first instantiation.
This specific state revolves around whether gradients should be synced and if we have reached the end of a prepared
dataloader Attributes:
- **sync_gradients** (`bool`) -- Whether the gradients should be synced
- **end_of_dataloader** (`bool`) -- Whether we have reached the end the current dataloader
"""
_shared_state = {}
def __init__(self):
self.__dict__ = self._shared_state
if not getattr(self, "initialized", False):
self.sync_gradients = True
self.end_of_dataloader = False
self.initialized = True
def __repr__(self):
return f"Sync Gradients: {self.sync_gradients}\n" f"At end of current dataloader: {self.end_of_dataloader}\n"
def _set_sync_gradients(self, sync_gradients):
"Private function that sets whether gradients should be synchronized. Users should not have to call this."
self.sync_gradients = sync_gradients
def _set_end_of_dataloader(self, end_of_dataloader):
"Private function that sets whether the end of the current dataloader has been reached. Users should not have to call this."
self.end_of_dataloader = end_of_dataloader

View File

@ -2,5 +2,17 @@
# There's no way to ignore "F401 '...' imported but unused" warnings in this
# module, but to preserve other warnings. So, don't check this module at all.
from .testing import are_the_same_tensors, execute_subprocess_async, require_cuda, require_multi_gpu, require_tpu, slow
from .testing import (
are_the_same_tensors,
execute_subprocess_async,
require_cpu,
require_cuda,
require_multi_gpu,
require_single_gpu,
require_tpu,
slow,
)
from .training import RegressionDataset, RegressionModel
from .scripts import test_script, test_sync # isort:skip

View File

@ -21,7 +21,14 @@ from accelerate import Accelerator
from accelerate.data_loader import prepare_data_loader
from accelerate.state import AcceleratorState
from accelerate.test_utils import RegressionDataset, RegressionModel, are_the_same_tensors
from accelerate.utils import DistributedType, gather, is_torch_version, set_seed, synchronize_rng_states
from accelerate.utils import (
DistributedType,
gather,
is_bf16_available,
is_torch_version,
set_seed,
synchronize_rng_states,
)
def init_state_check():
@ -245,71 +252,77 @@ def training_check():
accelerator.print("Training yielded the same results on one CPU or distributes setup with batch split.")
# Mostly a test that FP16 doesn't crash as the operation inside the model is not converted to FP16
print("FP16 training check.")
accelerator = Accelerator(mixed_precision="fp16")
train_dl = DataLoader(train_set, batch_size=batch_size, shuffle=True, generator=generator)
model = RegressionModel()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
if torch.cuda.is_available():
# Mostly a test that FP16 doesn't crash as the operation inside the model is not converted to FP16
print("FP16 training check.")
AcceleratorState._reset_state()
accelerator = Accelerator(mixed_precision="fp16")
train_dl = DataLoader(train_set, batch_size=batch_size, shuffle=True, generator=generator)
model = RegressionModel()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
train_dl, model, optimizer = accelerator.prepare(train_dl, model, optimizer)
set_seed(42)
generator.manual_seed(42)
for _ in range(3):
for batch in train_dl:
model.zero_grad()
output = model(batch["x"])
loss = torch.nn.functional.mse_loss(output, batch["y"])
accelerator.backward(loss)
optimizer.step()
train_dl, model, optimizer = accelerator.prepare(train_dl, model, optimizer)
set_seed(42)
generator.manual_seed(42)
for _ in range(3):
for batch in train_dl:
model.zero_grad()
output = model(batch["x"])
loss = torch.nn.functional.mse_loss(output, batch["y"])
accelerator.backward(loss)
optimizer.step()
model = accelerator.unwrap_model(model).cpu()
assert torch.allclose(old_model.a, model.a), "Did not obtain the same model on CPU or distributed training."
assert torch.allclose(old_model.b, model.b), "Did not obtain the same model on CPU or distributed training."
model = accelerator.unwrap_model(model).cpu()
assert torch.allclose(old_model.a, model.a), "Did not obtain the same model on CPU or distributed training."
assert torch.allclose(old_model.b, model.b), "Did not obtain the same model on CPU or distributed training."
# TEST that previous fp16 flag still works
print("Legacy FP16 training check.")
accelerator = Accelerator(fp16=True)
train_dl = DataLoader(train_set, batch_size=batch_size, shuffle=True, generator=generator)
model = RegressionModel()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
# TEST that previous fp16 flag still works
print("Legacy FP16 training check.")
AcceleratorState._reset_state()
accelerator = Accelerator(fp16=True)
train_dl = DataLoader(train_set, batch_size=batch_size, shuffle=True, generator=generator)
model = RegressionModel()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
train_dl, model, optimizer = accelerator.prepare(train_dl, model, optimizer)
set_seed(42)
generator.manual_seed(42)
for _ in range(3):
for batch in train_dl:
model.zero_grad()
output = model(batch["x"])
loss = torch.nn.functional.mse_loss(output, batch["y"])
accelerator.backward(loss)
optimizer.step()
train_dl, model, optimizer = accelerator.prepare(train_dl, model, optimizer)
set_seed(42)
generator.manual_seed(42)
for _ in range(3):
for batch in train_dl:
model.zero_grad()
output = model(batch["x"])
loss = torch.nn.functional.mse_loss(output, batch["y"])
accelerator.backward(loss)
optimizer.step()
model = accelerator.unwrap_model(model).cpu()
assert torch.allclose(old_model.a, model.a), "Did not obtain the same model on CPU or distributed training."
assert torch.allclose(old_model.b, model.b), "Did not obtain the same model on CPU or distributed training."
model = accelerator.unwrap_model(model).cpu()
assert torch.allclose(old_model.a, model.a), "Did not obtain the same model on CPU or distributed training."
assert torch.allclose(old_model.b, model.b), "Did not obtain the same model on CPU or distributed training."
# Mostly a test that BF16 doesn't crash as the operation inside the model is not converted to BF16
print("BF16 training check.")
accelerator = Accelerator(mixed_precision="bf16")
train_dl = DataLoader(train_set, batch_size=batch_size, shuffle=True, generator=generator)
model = RegressionModel()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
# BF16 support is only for CPU + TPU, and some GPU
if is_bf16_available():
# Mostly a test that BF16 doesn't crash as the operation inside the model is not converted to BF16
print("BF16 training check.")
AcceleratorState._reset_state()
accelerator = Accelerator(mixed_precision="bf16")
train_dl = DataLoader(train_set, batch_size=batch_size, shuffle=True, generator=generator)
model = RegressionModel()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
train_dl, model, optimizer = accelerator.prepare(train_dl, model, optimizer)
set_seed(42)
generator.manual_seed(42)
for _ in range(3):
for batch in train_dl:
model.zero_grad()
output = model(batch["x"])
loss = torch.nn.functional.mse_loss(output, batch["y"])
accelerator.backward(loss)
optimizer.step()
train_dl, model, optimizer = accelerator.prepare(train_dl, model, optimizer)
set_seed(42)
generator.manual_seed(42)
for _ in range(3):
for batch in train_dl:
model.zero_grad()
output = model(batch["x"])
loss = torch.nn.functional.mse_loss(output, batch["y"])
accelerator.backward(loss)
optimizer.step()
model = accelerator.unwrap_model(model).cpu()
assert torch.allclose(old_model.a, model.a), "Did not obtain the same model on CPU or distributed training."
assert torch.allclose(old_model.b, model.b), "Did not obtain the same model on CPU or distributed training."
model = accelerator.unwrap_model(model).cpu()
assert torch.allclose(old_model.a, model.a), "Did not obtain the same model on CPU or distributed training."
assert torch.allclose(old_model.b, model.b), "Did not obtain the same model on CPU or distributed training."
def main():
@ -326,7 +339,8 @@ def main():
if state.local_process_index == 0:
print("\n**DataLoader integration test**")
dl_preparation_check()
central_dl_preparation_check()
if state.distributed_type != DistributedType.TPU:
central_dl_preparation_check()
# Trainings are not exactly the same in DeepSpeed and CPU mode
if state.distributed_type == DistributedType.DEEPSPEED:
@ -337,5 +351,10 @@ def main():
training_check()
def _mp_fn(index):
# For xla_spawn (TPUs)
main()
if __name__ == "__main__":
main()

View File

@ -0,0 +1,274 @@
# Copyright 2022 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from copy import deepcopy
import torch
import torch.nn.functional as F
from torch.optim import AdamW
from torch.optim.lr_scheduler import LambdaLR
from torch.utils.data import DataLoader
from accelerate import Accelerator
from accelerate.test_utils import RegressionDataset, RegressionModel
from accelerate.utils import DistributedType, set_seed
def check_model_parameters(model_a, model_b, did_step, iteration):
for param, grad_param in zip(model_a.parameters(), model_b.parameters()):
if not param.requires_grad:
continue
if not did_step:
# Grads should not be in sync
assert (
torch.allclose(param.grad, grad_param.grad) is False
), f"Gradients in sync when they should not be at iteration {iteration}:\nmodel_a grad ({param.grad}) == model_b grad ({grad_param.grad})"
else:
# Grads should be in sync
assert (
torch.allclose(param.grad, grad_param.grad) is True
), f"Gradients not in sync when they should be at iteration {iteration}:\nmodel_a grad ({param.grad}) != model_b grad ({grad_param.grad})"
def step_model(model, input, target, accelerator, do_backward=True):
model.train()
output = model(input)
loss = F.mse_loss(output, target.to(output.device))
if not do_backward:
loss /= accelerator.gradient_accumulation_steps
loss.backward()
else:
accelerator.backward(loss)
def get_training_setup(accelerator, sched=False):
"Returns everything needed to perform basic training"
set_seed(42)
model = RegressionModel()
ddp_model = deepcopy(model)
dset = RegressionDataset(length=80)
dataloader = DataLoader(dset, batch_size=16)
model.to(accelerator.device)
if sched:
opt = AdamW(params=model.parameters(), lr=1e-3)
ddp_opt = AdamW(params=ddp_model.parameters(), lr=1e-3)
sched = LambdaLR(opt, lr_lambda=lambda epoch: epoch**0.65)
ddp_sched = LambdaLR(ddp_opt, lr_lambda=lambda epoch: epoch**0.65)
# Make a copy of `model`
if sched:
ddp_model, ddp_opt, ddp_sched, dataloader = accelerator.prepare(ddp_model, ddp_opt, ddp_sched, dataloader)
else:
ddp_model, dataloader = accelerator.prepare(ddp_model, dataloader)
if sched:
return (model, opt, sched, dataloader, ddp_model, ddp_opt, ddp_sched)
return model, ddp_model, dataloader
def test_noop_sync(accelerator):
# Test when on a single CPU or GPU that the context manager does nothing
model, ddp_model, dataloader = get_training_setup(accelerator)
# Use a single batch
ddp_input, ddp_target = next(iter(dataloader)).values()
for iteration in range(3):
# Gather the distributed inputs and targs for the base model
input, target = accelerator.gather((ddp_input, ddp_target))
input, target = input.to(accelerator.device), target.to(accelerator.device)
# Perform our initial ground truth step in non "DDP"
step_model(model, input, target, accelerator)
# Do "gradient accumulation" (noop)
if iteration % 2 == 0:
# Accumulate grads locally
with accelerator.no_sync(ddp_model):
step_model(ddp_model, ddp_input, ddp_target, accelerator)
else:
# Sync grads
step_model(ddp_model, ddp_input, ddp_target, accelerator)
# Since `no_sync` is a noop, `ddp_model` and `model` grads should always be in sync
check_model_parameters(model, ddp_model, True, iteration)
for param, ddp_param in zip(model.parameters(), ddp_model.parameters()):
if not param.requires_grad:
continue
assert torch.allclose(
param.grad, ddp_param.grad
), f"Gradients not in sync when they should be:\nModel grad ({param.grad}) != DDP grad ({ddp_param.grad})"
# Shuffle ddp_input on each iteration
torch.manual_seed(1337 + iteration)
ddp_input = ddp_input[torch.randperm(len(ddp_input))]
def test_distributed_sync(accelerator):
# Test on distributed setup that context manager behaves properly
model, ddp_model, dataloader = get_training_setup(accelerator)
# Use a single batch
ddp_input, ddp_target = next(iter(dataloader)).values()
for iteration in range(3):
# Gather the distributed inputs and targs for the base model
input, target = accelerator.gather((ddp_input, ddp_target))
input, target = input.to(accelerator.device), target.to(accelerator.device)
# Perform our initial ground truth step in non "DDP"
step_model(model, input, target, accelerator)
# Do "gradient accumulation" (noop)
if iteration % 2 == 0:
# Accumulate grads locally
with accelerator.no_sync(ddp_model):
step_model(ddp_model, ddp_input, ddp_target, accelerator)
else:
# Sync grads
step_model(ddp_model, ddp_input, ddp_target, accelerator)
# DDP model and model should only be in sync when not (iteration % 2 == 0)
for param, ddp_param in zip(model.parameters(), ddp_model.parameters()):
if not param.requires_grad:
continue
if iteration % 2 == 0:
# Grads should not be in sync
assert (
torch.allclose(param.grad, ddp_param.grad) is False
), f"Gradients in sync when they should not be:\nModel grad ({param.grad}) == DDP grad ({ddp_param.grad})"
else:
# Grads should be in sync
assert (
torch.allclose(param.grad, ddp_param.grad) is True
), f"Gradients not in sync when they should be:\nModel grad ({param.grad}) != DDP grad ({ddp_param.grad})"
# Shuffle ddp_input on each iteration
torch.manual_seed(1337 + iteration)
ddp_input = ddp_input[torch.randperm(len(ddp_input))]
def test_gradient_accumulation(split_batches=False, dispatch_batches=False):
accelerator = Accelerator(
gradient_accumulation_steps=2, split_batches=split_batches, dispatch_batches=dispatch_batches
)
# Test that context manager behaves properly
model, ddp_model, dataloader = get_training_setup(accelerator)
for iteration, batch in enumerate(dataloader):
ddp_input, ddp_target = batch.values()
# Gather the distributed inputs and targs for the base model
input, target = accelerator.gather((ddp_input, ddp_target))
input, target = input.to(accelerator.device), target.to(accelerator.device)
# Perform our initial ground truth step in non "DDP"
step_model(model, input, target, accelerator, False)
# Do "gradient accumulation" (noop)
with accelerator.accumulate(ddp_model):
step_model(ddp_model, ddp_input, ddp_target, accelerator)
# DDP model and model should only be in sync when not (iteration % 2 == 0)
for param, ddp_param in zip(model.parameters(), ddp_model.parameters()):
if not param.requires_grad:
continue
if ((iteration + 1) % 2 == 0) or (iteration == len(dataloader) - 1):
# Grads should be in sync
assert (
torch.allclose(param.grad, ddp_param.grad) is True
), f"Gradients not in sync when they should be at iteration {iteration}:\nModel grad ({param.grad}) != DDP grad ({ddp_param.grad})"
else:
# Grads should not be in sync
assert (
torch.allclose(param.grad, ddp_param.grad) is False
), f"Gradients in sync when they should not be at iteration {iteration}:\nModel grad ({param.grad}) == DDP grad ({ddp_param.grad})"
# Shuffle ddp_input on each iteration
torch.manual_seed(1337 + iteration)
ddp_input = ddp_input[torch.randperm(len(ddp_input))]
def test_gradient_accumulation_with_opt_and_scheduler(split_batches=False, dispatch_batches=False):
accelerator = Accelerator(
gradient_accumulation_steps=2, split_batches=split_batches, dispatch_batches=dispatch_batches
)
# Test that context manager behaves properly
model, opt, sched, dataloader, ddp_model, ddp_opt, ddp_sched = get_training_setup(accelerator, True)
for iteration, batch in enumerate(dataloader):
ddp_input, ddp_target = batch.values()
# Gather the distributed inputs and targs for the base model
input, target = accelerator.gather((ddp_input, ddp_target))
input, target = input.to(accelerator.device), target.to(accelerator.device)
# Perform our initial ground truth step in non "DDP"
model.train()
ddp_model.train()
step_model(model, input, target, accelerator, False)
opt.step()
if split_batches:
sched.step()
else:
for _ in range(accelerator.num_processes):
sched.step()
opt.zero_grad()
# Perform gradient accumulation under wrapper
with accelerator.accumulate(ddp_model):
step_model(ddp_model, ddp_input, ddp_target, accelerator)
ddp_opt.step()
ddp_sched.step()
ddp_opt.zero_grad()
# Learning rates should be the same
assert (
opt.param_groups[0]["lr"] == ddp_opt.param_groups[0]["lr"]
), f'Learning rates found in each optimizer did not align\nopt: {opt.param_groups[0]["lr"]}\nDDP opt: {ddp_opt.param_groups[0]["lr"]}\n'
did_step = (((iteration + 1) % 2) == 0) or ((iteration + 1) == len(dataloader))
if accelerator.num_processes > 1:
check_model_parameters(model, ddp_model, did_step, iteration)
# Shuffle ddp_input on each iteration
torch.manual_seed(1337 + iteration)
def main():
accelerator = Accelerator()
state = accelerator.state
if state.distributed_type == DistributedType.NO:
if state.local_process_index == 0:
print("**Test NOOP `no_sync` context manager**")
test_noop_sync(accelerator)
if state.distributed_type in (DistributedType.MULTI_GPU, DistributedType.MULTI_CPU):
if state.local_process_index == 0:
print("**Test Distributed `no_sync` context manager**")
test_distributed_sync(accelerator)
if state.distributed_type == DistributedType.MULTI_GPU:
for split_batch in [True, False]:
for dispatch_batches in [True, False]:
if state.local_process_index == 0:
print(
"**Test `accumulate` gradient accumulation, ",
f"`split_batches={split_batch}` and `dispatch_batches={dispatch_batches}`**",
)
test_gradient_accumulation(split_batch)
if state.local_process_index == 0:
print(
"**Test `accumulate` gradient accumulation with optimizer and scheduler, ",
"`split_batches=False`, `dispatch_batches=False`**",
)
test_gradient_accumulation_with_opt_and_scheduler()
if state.distributed_type == DistributedType.MULTI_GPU:
for split_batch in [True, False]:
for dispatch_batches in [True, False]:
if not split_batch and not dispatch_batches:
continue
if state.local_process_index == 0:
print(
"**Test `accumulate` gradient accumulation with optimizer and scheduler, ",
f"`split_batches={split_batch}` and `dispatch_batches={dispatch_batches}`**",
)
test_gradient_accumulation_with_opt_and_scheduler(split_batch, dispatch_batches)
def _mp_fn(index):
# For xla_spawn (TPUs)
main()
if __name__ == "__main__":
main()

View File

@ -15,6 +15,7 @@
import asyncio
import os
import shutil
import subprocess
import sys
import tempfile
import unittest
@ -26,7 +27,14 @@ from unittest import mock
import torch
from ..state import AcceleratorState
from ..utils import gather, is_comet_ml_available, is_tensorflow_available, is_tpu_available, is_wandb_available
from ..utils import (
gather,
is_comet_ml_available,
is_deepspeed_available,
is_tensorboard_available,
is_tpu_available,
is_wandb_available,
)
def parse_flag_from_env(key, default=False):
@ -56,6 +64,13 @@ def slow(test_case):
return unittest.skipUnless(_run_slow_tests, "test is slow")(test_case)
def require_cpu(test_case):
"""
Decorator marking a test that must be only ran on the CPU. These tests are skipped when a GPU is available.
"""
return unittest.skipUnless(not torch.cuda.is_available(), "test requires only a CPU")(test_case)
def require_cuda(test_case):
"""
Decorator marking a test that requires CUDA. These tests are skipped when there are no GPU available.
@ -70,6 +85,14 @@ def require_tpu(test_case):
return unittest.skipUnless(is_tpu_available(), "test requires TPU")(test_case)
def require_single_gpu(test_case):
"""
Decorator marking a test that requires CUDA on a single GPU. These tests are skipped when there are no GPU
available or number of GPUs is more than one.
"""
return unittest.skipUnless(torch.cuda.device_count() == 1, "test requires a GPU")(test_case)
def require_multi_gpu(test_case):
"""
Decorator marking a test that requires a multi-GPU setup. These tests are skipped on a machine without multiple
@ -78,12 +101,19 @@ def require_multi_gpu(test_case):
return unittest.skipUnless(torch.cuda.device_count() > 1, "test requires multiple GPUs")(test_case)
def require_tensorflow(test_case):
def require_deepspeed(test_case):
"""
Decorator marking a test that requires TensorFlow installed. These tests are skipped when TensorFlow isn't
Decorator marking a test that requires DeepSpeed installed. These tests are skipped when DeepSpeed isn't installed
"""
return unittest.skipUnless(is_deepspeed_available(), "test requires DeepSpeed")(test_case)
def require_tensorboard(test_case):
"""
Decorator marking a test that requires tensorboard installed. These tests are skipped when tensorboard isn't
installed
"""
return unittest.skipUnless(is_tensorflow_available(), "test requires TensorFlow")(test_case)
return unittest.skipUnless(is_tensorboard_available(), "test requires Tensorboard")(test_case)
def require_wandb(test_case):
@ -100,6 +130,22 @@ def require_comet_ml(test_case):
return unittest.skipUnless(is_comet_ml_available(), "test requires comet_ml")(test_case)
_atleast_one_tracker_available = (
any([is_wandb_available(), is_tensorboard_available()]) and not is_comet_ml_available()
)
def require_trackers(test_case):
"""
Decorator marking that a test requires at least one tracking library installed. These tests are skipped when none
are installed
"""
return unittest.skipUnless(
_atleast_one_tracker_available,
"test requires at least one tracker to be available and for `comet_ml` to not be installed",
)(test_case)
class TempDirTestCase(unittest.TestCase):
"""
A TestCase class that keeps a single `tempfile.TemporaryDirectory` open for the duration of the class, wipes its
@ -250,3 +296,24 @@ def execute_subprocess_async(cmd, env=None, stdin=None, timeout=180, quiet=False
)
return result
class SubprocessCallException(Exception):
pass
def run_command(command: List[str], return_stdout=False):
"""
Runs `command` with `subprocess.check_output` and will potentially return the `stdout`. Will also properly capture
if an error occured while running `command`
"""
try:
output = subprocess.check_output(command, stderr=subprocess.STDOUT)
if return_stdout:
if hasattr(output, "decode"):
output = output.decode("utf-8")
return output
except subprocess.CalledProcessError as e:
raise SubprocessCallException(
f"Command `{' '.join(command)}` failed with the following error:\n\n{e.output.decode()}"
) from e

View File

@ -20,14 +20,15 @@ from .dataclasses import (
)
from .imports import (
is_apex_available,
is_bf16_available,
is_boto3_available,
is_ccl_available,
is_comet_ml_available,
is_deepspeed_available,
is_sagemaker_available,
is_tensorboard_available,
is_tensorflow_available,
is_tpu_available,
is_transformers_available,
is_wandb_available,
)
from .modeling import (
@ -48,6 +49,7 @@ from .offload import (
OffloadedWeightsLoader,
PrefixedDataset,
extract_submodules_state_dict,
load_offloaded_weight,
offload_state_dict,
offload_weight,
save_offload_index,
@ -77,9 +79,16 @@ from .versions import compare_versions, is_torch_version
if is_deepspeed_available():
from .deepspeed import DeepSpeedEngineWrapper, DeepSpeedOptimizerWrapper
from .deepspeed import (
DeepSpeedEngineWrapper,
DeepSpeedOptimizerWrapper,
DeepSpeedSchedulerWrapper,
DummyOptim,
DummyScheduler,
HfDeepSpeedConfig,
)
from .launch import PrepareForLaunch
from .launch import PrepareForLaunch, get_launch_prefix
from .memory import find_executable_batch_size
from .other import (
extract_model_from_parallel,

View File

@ -20,5 +20,13 @@ MODEL_NAME = "pytorch_model"
RNG_STATE_NAME = "random_states"
OPTIMIZER_NAME = "optimizer"
SCHEDULER_NAME = "scheduler"
SAGEMAKER_PYTORCH_VERSION = "1.10.2"
SAGEMAKER_PYTHON_VERSION = "py38"
SAGEMAKER_TRANSFORMERS_VERSION = "4.17.0"
SAGEMAKER_PARALLEL_EC2_INSTANCES = ["ml.p3.16xlarge", "ml.p3dn.24xlarge", "ml.p4dn.24xlarge"]
FSDP_SHARDING_STRATEGY = ["FULL_SHARD", "SHARD_GRAD_OP", "NO_SHARD"]
FSDP_AUTO_WRAP_POLICY = ["TRANSFORMER_BASED_WRAP", "SIZE_BASED_WRAP", "NO_WRAP"]
FSDP_BACKWARD_PREFETCH = ["BACKWARD_PRE", "BACKWARD_POST", "NO_PREFETCH"]
DEEPSPEED_MULTINODE_LAUNCHERS = ["pdsh", "standard", "openmpi", "mvapich"]
STR_OPERATION_TO_FUNC = {">": op.gt, ">=": op.ge, "==": op.eq, "!=": op.ne, "<=": op.le, "<": op.lt}

View File

@ -21,12 +21,15 @@ import enum
import functools
import os
import typing
import warnings
from dataclasses import dataclass, field
from datetime import timedelta
from typing import Callable, Iterable, Optional
from typing import Any, Callable, Iterable, Optional
import torch
from .constants import FSDP_AUTO_WRAP_POLICY, FSDP_BACKWARD_PREFETCH
class KwargsHandler:
"""
@ -178,6 +181,16 @@ class BaseEnum(enum.Enum, metaclass=EnumWithContains):
class LoggerType(BaseEnum):
"""Represents a type of supported experiment tracker
Values:
- **ALL** -- all available trackers in the environment that are supported
- **TENSORBOARD** -- TensorBoard as an experiment tracker
- **WANDB** -- wandb as an experiment tracker
- **COMETML** -- comet_ml as an experiment tracker
"""
ALL = "all"
TENSORBOARD = "tensorboard"
WANDB = "wandb"
@ -185,6 +198,15 @@ class LoggerType(BaseEnum):
class PrecisionType(BaseEnum):
"""Represents a type of precision used on floating point values
Values:
- **NO** -- using full precision (FP32)
- **FP16** -- using half precision
- **BF16** -- using brain floating point precision
"""
NO = "no"
FP16 = "fp16"
BF16 = "bf16"
@ -208,10 +230,20 @@ class TensorInformation:
@dataclass
class DeepSpeedPlugin:
"""
This plugin is used to integrate DeepSpeed.
"""
hf_ds_config: Any = field(
default=None,
metadata={
"help": "path to DeepSpeed config file or dict or an object of class `accelerate.utils.deepspeed.HfDeepSpeedConfig`."
},
)
gradient_accumulation_steps: int = field(
default=None, metadata={"help": "Number of steps to accumulate gradients before updating optimizer states"}
)
gradient_clipping: float = field(default=None, metadata={"help": "Enable gradient clipping with value"})
zero_stage: int = field(
default=None,
metadata={"help": "Possible options are 0,1,2,3; Default will be taken from environment variable"},
@ -220,37 +252,164 @@ class DeepSpeedPlugin:
default=True,
metadata={"help": "If both train & eval dataloaders are specified, this will decide the train_batch_size"},
)
auto_opt_mapping: bool = field(
default=True,
metadata={"help": "whether to map torch.adam to deepspeed optimizer version of adam based on config"},
offload_optimizer_device: bool = field(
default=None,
metadata={"help": "Possible options are none|cpu|nvme. Only applicable with ZeRO Stages 2 and 3."},
)
offload_param_device: bool = field(
default=None,
metadata={"help": "Possible options are none|cpu|nvme. Only applicable with ZeRO Stage 3."},
)
zero3_init_flag: bool = field(
default=None,
metadata={
"help": "Flag to indicate whether to enable `deepspeed.zero.Init` for constructing massive models."
"Only applicable with ZeRO Stage-3."
},
)
zero3_save_16bit_model: bool = field(
default=None,
metadata={"help": "Flag to indicate whether to save 16-bit model. Only applicable with ZeRO Stage-3."},
)
offload_optimizer_device: bool = field(default=None, metadata={"help": "Possible options are none|cpu|nvme"})
def __post_init__(self):
from .deepspeed import HfDeepSpeedConfig
if self.gradient_accumulation_steps is None:
self.gradient_accumulation_steps = int(os.environ.get("GRADIENT_ACCUMULATION_STEPS", 1))
if self.hf_ds_config is None:
self.hf_ds_config = os.environ.get("DEEPSPEED_CONFIG_FILE", "none")
if (
isinstance(self.hf_ds_config, dict)
or (isinstance(self.hf_ds_config, str) and self.hf_ds_config != "none")
or isinstance(self.hf_ds_config, HfDeepSpeedConfig)
):
if not isinstance(self.hf_ds_config, HfDeepSpeedConfig):
self.hf_ds_config = HfDeepSpeedConfig(self.hf_ds_config)
if "gradient_accumulation_steps" not in self.hf_ds_config.config:
self.hf_ds_config.config["gradient_accumulation_steps"] = 1
elif self.hf_ds_config.config["gradient_accumulation_steps"] == "auto":
raise ValueError("gradient_accumulation_steps cannot be set to 'auto' in the DeepSpeed config.")
if "zero_optimization" not in self.hf_ds_config.config:
raise ValueError("Please specify the ZeRO optimization config in the DeepSpeed config.")
else:
if self.gradient_accumulation_steps is None:
self.gradient_accumulation_steps = int(os.environ.get("GRADIENT_ACCUMULATION_STEPS", 1))
if self.zero_stage is None:
self.zero_stage = int(os.environ.get("DEEPSPEED_ZERO_STAGE", 2))
if self.gradient_clipping is None:
gradient_clipping = os.environ.get("GRADIENT_CLIPPING", "none")
if gradient_clipping != "none":
self.gradient_clipping = float(gradient_clipping)
if self.offload_optimizer_device is None:
self.offload_optimizer_device = os.environ.get("DEEPSPEED_OFFLOAD_OPTIMIZER_DEVICE", "none")
if self.zero_stage is None:
self.zero_stage = int(os.environ.get("DEEPSPEED_ZERO_STAGE", 2))
self.deepspeed_config = {
"train_batch_size": None,
"gradient_accumulation_steps": self.gradient_accumulation_steps,
"zero_optimization": {
"stage": self.zero_stage,
"offload_optimizer": {
"device": self.offload_optimizer_device,
if self.offload_optimizer_device is None:
self.offload_optimizer_device = os.environ.get("DEEPSPEED_OFFLOAD_OPTIMIZER_DEVICE", "none")
if self.offload_param_device is None:
self.offload_param_device = os.environ.get("DEEPSPEED_OFFLOAD_PARAM_DEVICE", "none")
if self.zero3_save_16bit_model is None:
self.zero3_save_16bit_model = os.environ.get("DEEPSPEED_ZERO3_SAVE_16BIT_MODEL", "false") == "true"
config = {
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": self.gradient_accumulation_steps,
"zero_optimization": {
"stage": self.zero_stage,
"offload_optimizer": {
"device": self.offload_optimizer_device,
},
"offload_param": {
"device": self.offload_param_device,
},
"stage3_gather_16bit_weights_on_model_save": self.zero3_save_16bit_model,
},
},
"steps_per_print": float("inf"), # this will stop deepspeed from logging @ stdout
"zero_allow_untested_optimizer": True,
}
}
if self.gradient_clipping:
config["gradient_clipping"] = self.gradient_clipping
self.hf_ds_config = HfDeepSpeedConfig(config)
self.deepspeed_config = self.hf_ds_config.config
self.deepspeed_config["steps_per_print"] = float("inf") # this will stop deepspeed from logging @ stdout
if self.zero3_init_flag is None:
self.zero3_init_flag = os.environ.get("DEEPSPEED_ZERO3_INIT", "false") == "true"
if self.zero3_init_flag and not self.hf_ds_config.is_zero3():
warnings.warn("DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.")
self.zero3_init_flag = False
def fill_match(self, ds_key_long, mismatches, must_match=True, **kwargs):
config, ds_key = self.hf_ds_config.find_config_node(ds_key_long)
if config is None:
return
if config.get(ds_key) == "auto":
if ds_key_long in kwargs:
config[ds_key] = kwargs[ds_key_long]
return
else:
raise ValueError(
f"`{ds_key_long}` not found in kwargs. "
f"Please specify `{ds_key_long}` without `auto`(set to correct value) in the DeepSpeed config file or "
"pass it in kwargs."
)
if not must_match:
return
ds_val = config.get(ds_key)
if ds_val is not None and ds_key_long in kwargs:
if ds_val != kwargs[ds_key_long]:
mismatches.append(f"- ds {ds_key_long}={ds_val} vs arg {ds_key_long}={kwargs[ds_key_long]}")
def deepspeed_config_process(self, prefix="", mismatches=None, config=None, must_match=True, **kwargs):
"""Process the DeepSpeed config with the values from the kwargs."""
mismatches = [] if mismatches is None else mismatches
if config is None:
config = self.deepspeed_config
for key, value in config.items():
if isinstance(value, dict):
self.deepspeed_config_process(
prefix=prefix + key + ".", mismatches=mismatches, config=value, must_match=must_match, **kwargs
)
else:
self.fill_match(prefix + key, mismatches, must_match=must_match, **kwargs)
if len(mismatches) > 0 and prefix == "":
mismatches_msg = "\n".join(mismatches)
raise ValueError(
"Please correct the following DeepSpeed config values that mismatch kwargs "
f" values:\n{mismatches_msg}\nThe easiest method is to set these DeepSpeed config values to 'auto'."
)
def set_mixed_precision(self, mixed_precision):
ds_config = self.deepspeed_config
if mixed_precision == "fp16" and "fp16" not in ds_config and "bf16" not in ds_config:
ds_config.update({"fp16": {"enabled": True}})
elif mixed_precision == "bf16" and "fp16" not in ds_config and "bf16" not in ds_config:
ds_config.update({"bf16": {"enabled": True}})
def set_deepspeed_weakref(self):
from .imports import is_transformers_available
if self.zero3_init_flag:
if not is_transformers_available():
raise Exception(
"When `zero3_init_flag` is set, it requires Transformers to be installed. "
"Please run `pip install transformers`."
)
ds_config = copy.deepcopy(self.deepspeed_config)
if "gradient_accumulation_steps" not in ds_config or ds_config["gradient_accumulation_steps"] == "auto":
ds_config["gradient_accumulation_steps"] = 1
if (
"train_micro_batch_size_per_gpu" not in ds_config
or ds_config["train_micro_batch_size_per_gpu"] == "auto"
):
ds_config["train_micro_batch_size_per_gpu"] = 1
if ds_config["train_batch_size"] == "auto":
del ds_config["train_batch_size"]
from transformers.deepspeed import HfDeepSpeedConfig
self.dschf = HfDeepSpeedConfig(ds_config) # keep this object alive # noqa
@dataclass
@ -261,22 +420,35 @@ class FullyShardedDataParallelPlugin:
sharding_strategy: "typing.Any" = field(
default=None,
metadata={"help": "Possible options are [1] FULL_SHARD, [2] SHARD_GRAD_OP"},
metadata={
"help": "FSDP Sharding Strategy of type `torch.distributed.fsdp.fully_sharded_data_parallel.ShardingStrategy`"
},
)
backward_prefetch: "typing.Any" = field(
default=None,
metadata={"help": "Possible options are [1] BACKWARD_PRE, [2] BACKWARD_POST"},
metadata={
"help": "FSDP Backward Prefetch of type `torch.distributed.fsdp.fully_sharded_data_parallel.BackwardPrefetch`"
},
)
auto_wrap_policy: "typing.Any" = field(
mixed_precision_policy: "typing.Any" = field(
default=None,
metadata={
"help": "A config to enable mixed precision training with FullyShardedDataParallel. "
"The 3 flags that are set are `param_dtype`, `reduce_dtype`, `buffer_dtype`. "
"Each flag expects `torch.dtype` as the value. "
"It is of type `torch.distributed.fsdp.fully_sharded_data_parallel.MixedPrecision`."
},
)
auto_wrap_policy: Optional[Callable] = field(
default=None,
metadata={"help": "A callable specifying a policy to recursively wrap layers with FSDP"},
)
cpu_offload: Optional[Callable] = field(
cpu_offload: "typing.Any" = field(
default=None,
metadata={"help": "Decides Whether to offload parameters and gradients to CPU."},
)
min_num_params: int = field(
default=None, metadata={"help": "FSDP's minimum number of parameters for Default Auto Wrapping."}
metadata={
"help": "Decides Whether to offload parameters and gradients to CPU. "
"It is of type `torch.distributed.fsdp.fully_sharded_data_parallel.CPUOffload`."
},
)
ignored_modules: Optional[Iterable[torch.nn.Module]] = field(
default=None,
@ -284,8 +456,7 @@ class FullyShardedDataParallelPlugin:
)
def __post_init__(self):
from torch.distributed.fsdp.fully_sharded_data_parallel import CPUOffload, ShardingStrategy
from torch.distributed.fsdp.wrap import default_auto_wrap_policy
from torch.distributed.fsdp.fully_sharded_data_parallel import BackwardPrefetch, CPUOffload, ShardingStrategy
if self.sharding_strategy is None:
self.sharding_strategy = ShardingStrategy(int(os.environ.get("FSDP_SHARDING_STRATEGY", 1)))
@ -296,9 +467,63 @@ class FullyShardedDataParallelPlugin:
else:
self.cpu_offload = CPUOffload(offload_params=False)
if self.min_num_params is None:
self.min_num_params = int(os.environ.get("FSDP_MIN_NUM_PARAMS", 0))
if self.backward_prefetch is None:
prefetch_policy = os.environ.get("FSDP_BACKWARD_PREFETCH", FSDP_BACKWARD_PREFETCH[-1])
if prefetch_policy != FSDP_BACKWARD_PREFETCH[-1]:
self.backward_prefetch = BackwardPrefetch(FSDP_BACKWARD_PREFETCH.index(prefetch_policy) + 1)
@staticmethod
def get_module_class_from_name(module, name):
"""
Gets a class from a module by its name.
Args:
module (`torch.nn.Module`): The module to get the class from.
name (`str`): The name of the class.
"""
modules_children = list(module.children())
if module.__class__.__name__ == name:
return module.__class__
elif len(modules_children) == 0:
return
else:
for child_module in modules_children:
module_class = FullyShardedDataParallelPlugin.get_module_class_from_name(child_module, name)
if module_class is not None:
return module_class
def set_auto_wrap_policy(self, model):
from torch.distributed.fsdp.wrap import size_based_auto_wrap_policy, transformer_auto_wrap_policy
if self.auto_wrap_policy is None:
if self.min_num_params > 0:
self.auto_wrap_policy = functools.partial(default_auto_wrap_policy, min_num_params=self.min_num_params)
auto_wrap_policy = os.environ.get("FSDP_AUTO_WRAP_POLICY", FSDP_AUTO_WRAP_POLICY[-1])
if auto_wrap_policy == FSDP_AUTO_WRAP_POLICY[0]:
transformer_cls_to_wrap = os.environ.get("FSDP_TRANSFORMER_CLS_TO_WRAP", "")
transformer_cls_to_wrap = FullyShardedDataParallelPlugin.get_module_class_from_name(
model, transformer_cls_to_wrap
)
if transformer_cls_to_wrap is None:
raise Exception("Could not find the transformer layer class to wrap in the model.")
self.auto_wrap_policy = functools.partial(
transformer_auto_wrap_policy,
# Transformer layer class to wrap
transformer_layer_cls={transformer_cls_to_wrap},
)
elif auto_wrap_policy == FSDP_AUTO_WRAP_POLICY[1]:
min_num_params = int(os.environ.get("FSDP_MIN_NUM_PARAMS", 0))
if min_num_params > 0:
self.auto_wrap_policy = functools.partial(
size_based_auto_wrap_policy, min_num_params=min_num_params
)
def set_mixed_precision(self, mixed_precision):
if mixed_precision == "fp16":
dtype = torch.float16
elif mixed_precision == "bf16":
dtype = torch.bfloat16
else:
raise ValueError(f"Unknown mixed precision value: {mixed_precision}")
from torch.distributed.fsdp.fully_sharded_data_parallel import MixedPrecision
if self.mixed_precision_policy is None:
self.mixed_precision_policy = MixedPrecision(param_dtype=dtype, reduce_dtype=dtype, buffer_dtype=dtype)

View File

@ -12,58 +12,157 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import io
import json
from copy import deepcopy
from ..optimizer import AcceleratedOptimizer
from .imports import is_apex_available, is_deepspeed_available
from ..scheduler import AcceleratedScheduler
if is_deepspeed_available():
from deepspeed import DeepSpeedEngine
if is_apex_available():
from apex import amp
class DeepSpeedEngineWrapper(DeepSpeedEngine):
class HfDeepSpeedConfig:
"""
Wrapper over deepspeed.DeepSpeedEngine object
This object contains a DeepSpeed configuration dictionary and can be quickly queried for things like zero stage.
A `weakref` of this object is stored in the module's globals to be able to access the config from areas where
things like the Trainer object is not available (e.g. `from_pretrained` and `_get_resized_embeddings`). Therefore
it's important that this object remains alive while the program is still running.
[`Trainer`] uses the `HfTrainerDeepSpeedConfig` subclass instead. That subclass has logic to sync the configuration
with values of [`TrainingArguments`] by replacing special placeholder values: `"auto"`. Without this special logic
the DeepSpeed configuration is not modified in any way.
Args:
config_file_or_dict (`Union[str, Dict]`): path to DeepSpeed config file or dict.
"""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
def __init__(self, config_file_or_dict):
# overwriting micro_steps for user's gradient_accumulation
self.micro_steps = -1
if isinstance(config_file_or_dict, dict):
# Don't modify user's data should they want to reuse it (e.g. in tests), because once we
# modified it, it will not be accepted here again, since `auto` values would have been overridden
config = deepcopy(config_file_or_dict)
elif isinstance(config_file_or_dict, str):
with io.open(config_file_or_dict, "r", encoding="utf-8") as f:
config = json.load(f)
else:
raise ValueError("expecting either a path to a DeepSpeed config file or a pre-populated dict")
self.config = config
def step(self, lr_kwargs=None):
"""DeepSpeedEngine.step() without `micro_steps` update & no profiling"""
if self.is_gradient_accumulation_boundary(): # it shouldn't matter whether we keep this line or not
if self.progressive_layer_drop:
self.progressive_layer_drop.update_state(self.global_steps)
# zero stage - this is done as early as possible, before model is created, to allow
# ``is_deepspeed_zero3_enabled`` query and getting to the early deepspeed config object
# during ``zero.Init()`` which needs to know the dtype, and some other hparams.
self._stage = self.get_value("zero_optimization.stage", -1)
self._take_model_step(lr_kwargs)
# offload
self._offload = False
if self.is_zero2() or self.is_zero3():
offload_devices_valid = set(["cpu", "nvme"])
offload_devices = set(
[
self.get_value("zero_optimization.offload_optimizer.device"),
self.get_value("zero_optimization.offload_param.device"),
]
)
if len(offload_devices & offload_devices_valid) > 0:
self._offload = True
def find_config_node(self, ds_key_long):
config = self.config
# find the config node of interest if it exists
nodes = ds_key_long.split(".")
ds_key = nodes.pop()
for node in nodes:
config = config.get(node)
if config is None:
return None, ds_key
return config, ds_key
def get_value(self, ds_key_long, default=None):
"""
Returns the set value or `default` if no value is set
"""
config, ds_key = self.find_config_node(ds_key_long)
if config is None:
return default
return config.get(ds_key, default)
def del_config_sub_tree(self, ds_key_long, must_exist=False):
"""
Deletes a sub-section of the config file if it's found.
Unless `must_exist` is `True` the section doesn't have to exist.
"""
config = self.config
# find the config node of interest if it exists
nodes = ds_key_long.split(".")
for node in nodes:
parent_config = config
config = config.get(node)
if config is None:
if must_exist:
raise ValueError(f"Can't find {ds_key_long} entry in the config: {self.config}")
else:
return
# if found remove it
if parent_config is not None:
parent_config.pop(node)
def is_true(self, ds_key_long):
"""
Returns `True`/``False` only if the value is set, always `False` otherwise. So use this method to ask the very
specific question of whether the value is set to `True` (and it's not set to `False`` or isn't set).
"""
value = self.get_value(ds_key_long)
return False if value is None else bool(value)
def is_false(self, ds_key_long):
"""
Returns `True`/``False` only if the value is set, always `False` otherwise. So use this method to ask the very
specific question of whether the value is set to `False` (and it's not set to `True`` or isn't set).
"""
value = self.get_value(ds_key_long)
return False if value is None else not bool(value)
def is_zero2(self):
return self._stage == 2
def is_zero3(self):
return self._stage == 3
def is_offload(self):
return self._offload
class DeepSpeedEngineWrapper:
"""
Internal wrapper for deepspeed.runtime.engine.DeepSpeedEngine. This is used to follow conventional training loop.
Args:
engine (deepspeed.runtime.engine.DeepSpeedEngine): deepspeed engine to wrap
"""
def __init__(self, engine):
self.engine = engine
def backward(self, loss):
"""DeepSpeedEngine.backward() with with no loss scaling; no profiling but with `micro_steps` update"""
# runs backpropagation and handles mixed precision
self.engine.backward(loss)
if self.zero_optimization():
self.optimizer.is_gradient_accumulation_boundary = self.is_gradient_accumulation_boundary()
self.optimizer.backward(loss)
elif self.amp_enabled():
# AMP requires delaying unscale when inside gradient accumulation boundaries
# https://nvidia.github.io/apex/advanced.html#gradient-accumulation-across-iterations
delay_unscale = not self.is_gradient_accumulation_boundary()
with amp.scale_loss(loss, self.optimizer, delay_unscale=delay_unscale) as scaled_loss:
scaled_loss.backward()
elif self.fp16_enabled():
self.optimizer.backward(loss)
else:
loss.backward()
if self.enable_backward_allreduce:
self.allreduce_gradients()
# this will ensure deepspeed gradient_accumulation matches user's accumulation
self.micro_steps += 1
# deepspeed `engine.step` performs following operations:
# gradient accumulation check
# gradient clipping
# optimizer step
# zero grad
# checking overflow
# lr_scheduler step
self.engine.step()
class DeepSpeedOptimizerWrapper(AcceleratedOptimizer):
@ -75,22 +174,79 @@ class DeepSpeedOptimizerWrapper(AcceleratedOptimizer):
The optimizer to wrap.
"""
def __init__(self, optimizer, model: DeepSpeedEngineWrapper):
def __init__(self, optimizer):
super().__init__(optimizer, device_placement=False, scaler=None)
self.model = model
def zero_grad(self, set_to_none=None):
pass # `model.step()` is doing that automatically. Therefore, it's implementation is not needed
pass # `accelerator.backward(loss)` is doing that automatically. Therefore, it's implementation is not needed
def step(self):
"""This will handle optimizer.step() & optimizer.zero_grad() with gradient_accumulation"""
self.model.step()
pass # `accelerator.backward(loss)` is doing that automatically. Therefore, it's implementation is not needed
@property
def is_overflow(self):
def step_was_skipped(self):
"""Whether or not the optimizer step was done, or skipped because of gradient overflow."""
overflow = False
if hasattr(self.optimizer, "overflow"):
overflow = self.optimizer.overflow
return overflow
return self.optimizer.overflow
class DeepSpeedSchedulerWrapper(AcceleratedScheduler):
"""
Internal wrapper around a deepspeed scheduler.
Args:
scheduler (`torch.optim.lr_scheduler.LambdaLR`):
The scheduler to wrap.
optimizers (one or a list of `torch.optim.Optimizer`):
"""
def __init__(self, scheduler, optimizers):
super().__init__(scheduler, optimizers)
def step(self):
pass # `accelerator.backward(loss)` is doing that automatically. Therefore, it's implementation is not needed
class DummyOptim:
"""
Dummy optimizer presents model parameters or param groups, this is primarily used to follow conventional training
loop when optimizer config is specified in the deepspeed config file.
Args:
lr (float):
Learning rate.
params (iterable): iterable of parameters to optimize or dicts defining
parameter groups
weight_decay (float):
Weight decay.
**kwargs:
Other arguments.
"""
def __init__(self, params, lr=0.001, weight_decay=0, **kwargs):
self.params = params
self.lr = lr
self.weight_decay = weight_decay
self.kwargs = kwargs
class DummyScheduler:
"""
Dummy scheduler presents model parameters or param groups, this is primarily used to follow conventional training
loop when scheduler config is specified in the deepspeed config file.
Args:
optimizer (`torch.optim.optimizer.Optimizer`):
The optimizer to wrap.
total_num_steps (int):
Total number of steps.
warmup_num_steps (int):
Number of steps for warmup.
**kwargs:
Other arguments.
"""
def __init__(self, optimizer, total_num_steps=None, warmup_num_steps=0, **kwargs):
self.optimizer = optimizer
self.total_num_steps = total_num_steps
self.warmup_num_steps = warmup_num_steps
self.kwargs = kwargs

View File

@ -15,6 +15,10 @@
import importlib
import sys
import torch
from .versions import is_torch_version
# The package importlib_metadata is in a different place, depending on the Python version.
if sys.version_info < (3, 8):
@ -47,7 +51,15 @@ def is_apex_available():
return importlib.util.find_spec("apex") is not None
def is_tpu_available():
def is_tpu_available(check_device=True):
"Checks if `torch_xla` is installed and potentially if a TPU is in the environment"
if _tpu_available and check_device:
try:
# Will raise a RuntimeError if no XLA configuration is found
_ = xm.xla_device()
return True
except RuntimeError:
return False
return _tpu_available
@ -63,8 +75,19 @@ def is_deepspeed_available():
return False
def is_tensorflow_available():
return importlib.util.find_spec("tensorflow") is not None
def is_bf16_available(ignore_tpu=False):
"Checks if bf16 is supported, optionally ignoring the TPU"
if is_tpu_available():
return not ignore_tpu
if is_torch_version(">=", "1.10"):
if torch.cuda.is_available():
return torch.cuda.is_bf16_supported()
return True
return False
def is_transformers_available():
return importlib.util.find_spec("transformers") is not None
def is_tensorboard_available():

View File

@ -13,12 +13,28 @@
# limitations under the License.
import os
import sys
import torch
from ..utils import is_torch_version
from .dataclasses import DistributedType
def get_launch_prefix():
"""
Grabs the correct launcher for starting a distributed command, such as either `torchrun`, `python -m
torch.distributed.run`, etc
"""
if is_torch_version(">=", "1.10.0"):
cmd = ["torchrun"]
elif is_torch_version(">=", "1.9.0"):
cmd = [sys.executable, "-m", "torch.distributed.run"]
else:
cmd = [sys.executable, "-m", "torch.distributed.launch", "--use_env"]
return cmd
class PrepareForLaunch:
"""
Prepare a function that will launched in a distributed setup.
@ -52,4 +68,5 @@ class PrepareForLaunch:
os.environ["LOCAL_RANK"] = str(index)
os.environ["RANK"] = str(index)
os.environ["FORK_LAUNCHED"] = str(1)
self.launcher(*args)

View File

@ -78,7 +78,7 @@ def dtype_byte_size(dtype: torch.dtype):
"""
if dtype == torch.bool:
return 1 / 8
bit_search = re.search("[^\d](\d+)$", str(dtype))
bit_search = re.search(r"[^\d](\d+)$", str(dtype))
if bit_search is None:
raise ValueError(f"`dtype` is not a valid dtype: {dtype}.")
bit_size = int(bit_search.groups()[0])

View File

@ -22,10 +22,18 @@ import torch
def offload_weight(weight, weight_name, offload_folder, index=None):
dtype = None
# Check the string instead of the dtype to be compatible with versions of PyTorch that don't have bfloat16.
if str(weight.dtype) == "torch.bfloat16":
# Need to reinterpret the underlined data as int16 since NumPy does not handle bfloat16s.
weight = weight.view(torch.int16)
dtype = "bfloat16"
array = weight.numpy()
tensor_file = os.path.join(offload_folder, f"{weight_name}.dat")
if index is not None:
index[weight_name] = {"dtype": str(array.dtype), "shape": list(array.shape)}
if dtype is None:
dtype = str(array.dtype)
index[weight_name] = {"dtype": dtype, "shape": list(array.shape)}
if array.ndim == 0:
array = array[None]
file_array = np.memmap(tensor_file, dtype=array.dtype, mode="w+", shape=array.shape)
@ -34,6 +42,28 @@ def offload_weight(weight, weight_name, offload_folder, index=None):
return index
def load_offloaded_weight(weight_file, weight_info):
shape = tuple(weight_info["shape"])
if shape == ():
# NumPy memory-mapped arrays can't have 0 dims so it was saved as 1d tensor
shape = (1,)
dtype = weight_info["dtype"]
if dtype == "bfloat16":
# NumPy does not support bfloat16 so this was saved as a int16
dtype = "int16"
weight = np.memmap(weight_file, dtype=dtype, shape=shape, mode="r")
if len(weight_info["shape"]) == 0:
weight = weight[0]
weight = torch.tensor(weight)
if weight_info["dtype"] == "bfloat16":
weight = weight.view(torch.bfloat16)
return weight
def save_offload_index(index, offload_folder):
if index is None or len(index) == 0:
# Nothing to save
@ -129,12 +159,7 @@ class OffloadedWeightsLoader(Mapping):
return self.state_dict[key]
weight_info = self.index[key]
weight_file = os.path.join(self.save_folder, f"{key}.dat")
shape = tuple(weight_info["shape"])
if shape == ():
weight = np.memmap(weight_file, dtype=weight_info["dtype"], shape=(1,), mode="r")[0]
else:
weight = np.memmap(weight_file, dtype=weight_info["dtype"], shape=shape, mode="r")
return torch.tensor(weight)
return load_offloaded_weight(weight_file, weight_info)
def __iter__(self):
return iter(self.all_keys)

View File

@ -29,7 +29,7 @@ from .imports import is_tpu_available
from .versions import is_torch_version
if is_tpu_available():
if is_tpu_available(check_device=False):
import torch_xla.core.xla_model as xm
@ -217,7 +217,11 @@ def gather(tensor):
"""
if AcceleratorState().distributed_type == DistributedType.TPU:
return _tpu_gather(tensor, name="accelerate.utils.gather")
elif AcceleratorState().distributed_type in [DistributedType.DEEPSPEED, DistributedType.MULTI_GPU]:
elif AcceleratorState().distributed_type in [
DistributedType.DEEPSPEED,
DistributedType.MULTI_GPU,
DistributedType.FSDP,
]:
return _gpu_gather(tensor)
elif AcceleratorState().distributed_type == DistributedType.MULTI_CPU:
return _cpu_gather(tensor)
@ -430,7 +434,7 @@ def reduce(tensor, reduction="mean"):
xm.all_reduce("sum", cloned_tensor)
return cloned_tensor
elif state.distributed_type in [DistributedType.DEEPSPEED, DistributedType.MULTI_GPU]:
torch.distributed.reduce(cloned_tensor, ReduceOp.SUM)
torch.distributed.all_reduce(cloned_tensor, ReduceOp.SUM)
return cloned_tensor
else:
if reduction == "sum":

View File

@ -28,7 +28,7 @@ from .imports import is_deepspeed_available, is_tpu_available
if is_deepspeed_available():
from deepspeed import DeepSpeedEngine
if is_tpu_available():
if is_tpu_available(check_device=False):
import torch_xla.core.xla_model as xm

View File

@ -23,7 +23,7 @@ from .dataclasses import DistributedType, RNGType
from .imports import is_tpu_available
if is_tpu_available():
if is_tpu_available(check_device=False):
import torch_xla.core.xla_model as xm

View File

@ -0,0 +1,49 @@
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"weight_decay": "auto",
"torch_adam": true,
"adam_w_mode": true
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": "auto",
"contiguous_gradients": true
},
"gradient_accumulation_steps": 1,
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}

View File

@ -0,0 +1,56 @@
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"weight_decay": "auto",
"torch_adam": true,
"adam_w_mode": true
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": "auto"
},
"gradient_accumulation_steps": 1,
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}

View File

@ -0,0 +1,584 @@
# Copyright 2022 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import inspect
import io
import itertools
import json
import os
import tempfile
import unittest
from copy import deepcopy
from pathlib import Path
import torch
from torch.utils.data import DataLoader
from accelerate.accelerator import Accelerator
from accelerate.scheduler import AcceleratedScheduler
from accelerate.state import AcceleratorState
from accelerate.test_utils.testing import require_cuda, require_deepspeed
from accelerate.test_utils.training import RegressionDataset
from accelerate.utils.dataclasses import DeepSpeedPlugin
from accelerate.utils.deepspeed import (
DeepSpeedEngineWrapper,
DeepSpeedOptimizerWrapper,
DeepSpeedSchedulerWrapper,
DummyOptim,
DummyScheduler,
)
from parameterized import parameterized
from transformers import AutoModel, AutoModelForCausalLM, get_scheduler
from transformers.testing_utils import mockenv_context
from transformers.trainer_utils import set_seed
from transformers.utils import is_torch_bf16_available
set_seed(42)
T5_SMALL = "t5-small"
T5_TINY = "patrickvonplaten/t5-tiny-random"
GPT2_TINY = "sshleifer/tiny-gpt2"
ZERO2 = "zero2"
ZERO3 = "zero3"
FP16 = "fp16"
BF16 = "bf16"
CUSTOM_OPTIMIZER = "custom_optimizer"
CUSTOM_SCHEDULER = "custom_scheduler"
DS_OPTIMIZER = "deepspeed_optimizer"
DS_SCHEDULER = "deepspeed_scheduler"
stages = [ZERO2, ZERO3]
optims = [CUSTOM_OPTIMIZER, DS_OPTIMIZER]
schedulers = [CUSTOM_SCHEDULER, DS_SCHEDULER]
if is_torch_bf16_available():
dtypes = [FP16, BF16]
else:
dtypes = [FP16]
def parameterized_custom_name_func(func, param_num, param):
# customize the test name generator function as we want both params to appear in the sub-test
# name, as by default it shows only the first param
param_based_name = parameterized.to_safe_name("_".join(str(x) for x in param.args))
return f"{func.__name__}_{param_based_name}"
# Cartesian-product of zero stages with models to test
params = list(itertools.product(stages, dtypes))
optim_scheduler_params = list(itertools.product(optims, schedulers))
@require_deepspeed
@require_cuda
class DeepSpeedConfigIntegration(unittest.TestCase):
def setUp(self):
super().setUp()
self._test_file_path = inspect.getfile(self.__class__)
path = Path(self._test_file_path).resolve()
self.test_file_dir_str = str(path.parents[0])
self.ds_config_file = dict(
zero2=f"{self.test_file_dir_str}/ds_config_zero2.json",
zero3=f"{self.test_file_dir_str}/ds_config_zero3.json",
)
# use self.get_config_dict(stage) to use these to ensure the original is not modified
with io.open(self.ds_config_file[ZERO2], "r", encoding="utf-8") as f:
config_zero2 = json.load(f)
with io.open(self.ds_config_file[ZERO3], "r", encoding="utf-8") as f:
config_zero3 = json.load(f)
# The following setting slows things down, so don't enable it by default unless needed by a test.
# It's in the file as a demo for users since we want everything to work out of the box even if slower.
config_zero3["zero_optimization"]["stage3_gather_16bit_weights_on_model_save"] = False
self.ds_config_dict = dict(zero2=config_zero2, zero3=config_zero3)
self.dist_env = dict(
USE_DEEPSPEED="true",
MASTER_ADDR="localhost",
MASTER_PORT="10999",
RANK="0",
LOCAL_RANK="0",
WORLD_SIZE="1",
)
def get_config_dict(self, stage):
# As some tests modify the dict, always make a copy
return deepcopy(self.ds_config_dict[stage])
@parameterized.expand(stages, name_func=parameterized_custom_name_func)
def test_deepspeed_plugin(self, stage):
# Test zero3_init_flag will be set to False when ZeRO stage != 3
deepspeed_plugin = DeepSpeedPlugin(
gradient_accumulation_steps=1,
gradient_clipping=1.0,
zero_stage=2,
offload_optimizer_device="cpu",
offload_param_device="cpu",
zero3_save_16bit_model=True,
zero3_init_flag=True,
)
self.assertFalse(deepspeed_plugin.zero3_init_flag)
deepspeed_plugin.deepspeed_config = None
# Test zero3_init_flag will be set to True only when ZeRO stage == 3
deepspeed_plugin = DeepSpeedPlugin(
gradient_accumulation_steps=1,
gradient_clipping=1.0,
zero_stage=3,
offload_optimizer_device="cpu",
offload_param_device="cpu",
zero3_save_16bit_model=True,
zero3_init_flag=True,
)
self.assertTrue(deepspeed_plugin.zero3_init_flag)
deepspeed_plugin.deepspeed_config = None
# Test config files are loaded correctly
deepspeed_plugin = DeepSpeedPlugin(hf_ds_config=self.ds_config_file[stage], zero3_init_flag=True)
if stage == ZERO2:
self.assertFalse(deepspeed_plugin.zero3_init_flag)
elif stage == ZERO3:
self.assertTrue(deepspeed_plugin.zero3_init_flag)
# Test `gradient_accumulation_steps` is set to 1 if unavailable in config file
with tempfile.TemporaryDirectory() as dirpath:
ds_config = self.get_config_dict(stage)
del ds_config["gradient_accumulation_steps"]
with open(os.path.join(dirpath, "ds_config.json"), "w") as out_file:
json.dump(ds_config, out_file)
deepspeed_plugin = DeepSpeedPlugin(hf_ds_config=os.path.join(dirpath, "ds_config.json"))
self.assertEqual(deepspeed_plugin.deepspeed_config["gradient_accumulation_steps"], 1)
deepspeed_plugin.deepspeed_config = None
# Test `ValueError` is raised if `zero_optimization` is unavailable in config file
with tempfile.TemporaryDirectory() as dirpath:
ds_config = self.get_config_dict(stage)
del ds_config["zero_optimization"]
with open(os.path.join(dirpath, "ds_config.json"), "w") as out_file:
json.dump(ds_config, out_file)
with self.assertRaises(ValueError) as cm:
deepspeed_plugin = DeepSpeedPlugin(hf_ds_config=os.path.join(dirpath, "ds_config.json"))
self.assertTrue(
"Please specify the ZeRO optimization config in the DeepSpeed config." in str(cm.exception)
)
deepspeed_plugin.deepspeed_config = None
# Test `deepspeed_config_process`
deepspeed_plugin = DeepSpeedPlugin(hf_ds_config=self.ds_config_file[stage])
kwargs = {
"fp16.enabled": True,
"bf16.enabled": False,
"optimizer.params.lr": 5e-5,
"optimizer.params.weight_decay": 0.0,
"scheduler.params.warmup_min_lr": 0.0,
"scheduler.params.warmup_max_lr": 5e-5,
"scheduler.params.warmup_num_steps": 0,
"train_micro_batch_size_per_gpu": 16,
"gradient_clipping": 1.0,
"train_batch_size": 16,
"zero_optimization.reduce_bucket_size": 5e5,
"zero_optimization.stage3_prefetch_bucket_size": 5e5,
"zero_optimization.stage3_param_persistence_threshold": 5e5,
"zero_optimization.stage3_gather_16bit_weights_on_model_save": False,
}
deepspeed_plugin.deepspeed_config_process(**kwargs)
for ds_key_long, value in kwargs.items():
config, ds_key = deepspeed_plugin.hf_ds_config.find_config_node(ds_key_long)
if config.get(ds_key) is not None:
self.assertEqual(config.get(ds_key), value)
# Test mismatches
mismatches = {
"optimizer.params.lr": 1e-5,
"optimizer.params.weight_decay": 1e-5,
"gradient_accumulation_steps": 2,
}
with self.assertRaises(ValueError) as cm:
new_kwargs = deepcopy(kwargs)
new_kwargs.update(mismatches)
deepspeed_plugin.deepspeed_config_process(**new_kwargs)
for key in mismatches.keys():
self.assertTrue(
key in str(cm.exception),
f"{key} is not in the exception message:\n{cm.exception}",
)
# Test `ValueError` is raised if some config file fields with `auto` value is missing in `kwargs`
deepspeed_plugin.deepspeed_config["optimizer"]["params"]["lr"] = "auto"
with self.assertRaises(ValueError) as cm:
del kwargs["optimizer.params.lr"]
deepspeed_plugin.deepspeed_config_process(**kwargs)
self.assertTrue("`optimizer.params.lr` not found in kwargs." in str(cm.exception))
@parameterized.expand([FP16, BF16], name_func=parameterized_custom_name_func)
def test_accelerate_state_deepspeed(self, dtype):
state = AcceleratorState(_from_accelerator=True)
if state.initialized:
state.initialized = False
deepspeed_plugin = DeepSpeedPlugin(
gradient_accumulation_steps=1,
gradient_clipping=1.0,
zero_stage=ZERO2,
offload_optimizer_device="cpu",
offload_param_device="cpu",
zero3_save_16bit_model=True,
zero3_init_flag=True,
)
with mockenv_context(**self.dist_env):
state = Accelerator(mixed_precision=dtype, deepspeed_plugin=deepspeed_plugin).state
self.assertTrue(state.deepspeed_plugin.deepspeed_config[dtype]["enabled"])
state.initialized = False
def test_init_zero3(self):
deepspeed_plugin = DeepSpeedPlugin(
gradient_accumulation_steps=1,
gradient_clipping=1.0,
zero_stage=3,
offload_optimizer_device="cpu",
offload_param_device="cpu",
zero3_save_16bit_model=True,
zero3_init_flag=True,
)
with mockenv_context(**self.dist_env):
accelerator = Accelerator(deepspeed_plugin=deepspeed_plugin)
from transformers.deepspeed import is_deepspeed_zero3_enabled
self.assertTrue(is_deepspeed_zero3_enabled())
accelerator.state.initialized = False
@parameterized.expand(optim_scheduler_params, name_func=parameterized_custom_name_func)
def test_prepare_deepspeed(self, optim_type, scheduler_type):
# 1. Testing with one of the ZeRO Stages is enough to test the `_prepare_deepspeed` function.
# Here we test using ZeRO Stage 2 with FP16 enabled.
from deepspeed.runtime.engine import DeepSpeedEngine
kwargs = {
"fp16.enabled": True,
"bf16.enabled": False,
"optimizer.params.lr": 5e-5,
"optimizer.params.weight_decay": 0.0,
"scheduler.params.warmup_min_lr": 0.0,
"scheduler.params.warmup_max_lr": 5e-5,
"scheduler.params.warmup_num_steps": 0,
"train_micro_batch_size_per_gpu": 16,
"gradient_clipping": 1.0,
"train_batch_size": 16,
"zero_optimization.reduce_bucket_size": 5e5,
"zero_optimization.stage3_prefetch_bucket_size": 5e5,
"zero_optimization.stage3_param_persistence_threshold": 5e5,
"zero_optimization.stage3_gather_16bit_weights_on_model_save": False,
}
if optim_type == CUSTOM_OPTIMIZER and scheduler_type == CUSTOM_SCHEDULER:
# Test custom optimizer + custom scheduler
deepspeed_plugin = DeepSpeedPlugin(
gradient_accumulation_steps=1,
gradient_clipping=1.0,
zero_stage=2,
offload_optimizer_device="cpu",
offload_param_device="cpu",
zero3_save_16bit_model=False,
zero3_init_flag=False,
)
with mockenv_context(**self.dist_env):
accelerator = Accelerator(mixed_precision="fp16", deepspeed_plugin=deepspeed_plugin)
train_set = RegressionDataset(length=80)
eval_set = RegressionDataset(length=20)
train_dataloader = DataLoader(train_set, batch_size=16, shuffle=True)
eval_dataloader = DataLoader(eval_set, batch_size=32, shuffle=False)
model = AutoModel.from_pretrained(GPT2_TINY)
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
lr_scheduler = get_scheduler(
name="linear",
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=1000,
)
dummy_optimizer = DummyOptim(params=model.parameters())
dummy_lr_scheduler = DummyScheduler(dummy_optimizer)
with self.assertRaises(ValueError) as cm:
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
model, dummy_optimizer, train_dataloader, eval_dataloader, lr_scheduler
)
self.assertTrue(
"You cannot create a `DummyOptim` without specifying an optimizer in the config file."
in str(cm.exception)
)
with self.assertRaises(ValueError) as cm:
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
model, optimizer, train_dataloader, eval_dataloader, dummy_lr_scheduler
)
self.assertTrue(
"You cannot create a `DummyScheduler` without specifying a scheduler in the config file."
in str(cm.exception)
)
with self.assertRaises(ValueError) as cm:
model, optimizer, lr_scheduler = accelerator.prepare(model, optimizer, lr_scheduler)
self.assertTrue(
"You must specify a training or evaluation dataloader in `accelerate.prepare()` when using DeepSpeed."
in str(cm.exception)
)
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
)
self.assertTrue(accelerator.deepspeed_config["zero_allow_untested_optimizer"])
self.assertTrue(accelerator.deepspeed_config["train_batch_size"], 16)
self.assertEqual(type(model), DeepSpeedEngine)
self.assertEqual(type(optimizer), DeepSpeedOptimizerWrapper)
self.assertEqual(type(lr_scheduler), AcceleratedScheduler)
self.assertEqual(type(accelerator.deepspeed_engine_wrapped), DeepSpeedEngineWrapper)
elif optim_type == DS_OPTIMIZER and scheduler_type == DS_SCHEDULER:
# Test DeepSpeed optimizer + DeepSpeed scheduler
deepspeed_plugin = DeepSpeedPlugin(hf_ds_config=self.ds_config_file[ZERO2])
with mockenv_context(**self.dist_env):
accelerator = Accelerator(deepspeed_plugin=deepspeed_plugin)
train_set = RegressionDataset(length=80)
eval_set = RegressionDataset(length=20)
train_dataloader = DataLoader(train_set, batch_size=10, shuffle=True)
eval_dataloader = DataLoader(eval_set, batch_size=5, shuffle=False)
model = AutoModel.from_pretrained(GPT2_TINY)
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
lr_scheduler = get_scheduler(
name="linear",
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=1000,
)
dummy_optimizer = DummyOptim(params=model.parameters())
dummy_lr_scheduler = DummyScheduler(dummy_optimizer)
kwargs["train_batch_size"] = (
kwargs["train_micro_batch_size_per_gpu"]
* deepspeed_plugin.deepspeed_config["gradient_accumulation_steps"]
* accelerator.num_processes
)
accelerator.state.deepspeed_plugin.deepspeed_config_process(**kwargs)
with self.assertRaises(ValueError) as cm:
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
model, optimizer, train_dataloader, eval_dataloader, dummy_lr_scheduler
)
self.assertTrue(
"You cannot specify an optimizer in the config file and in the code at the same time"
in str(cm.exception)
)
with self.assertRaises(ValueError) as cm:
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
model, dummy_optimizer, train_dataloader, eval_dataloader, lr_scheduler
)
self.assertTrue(
"You cannot specify a scheduler in the config file and in the code at the same time"
in str(cm.exception)
)
with self.assertRaises(ValueError) as cm:
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
model, dummy_optimizer, train_dataloader, eval_dataloader, lr_scheduler
)
self.assertTrue(
"You cannot specify a scheduler in the config file and in the code at the same time"
in str(cm.exception)
)
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
model, dummy_optimizer, train_dataloader, eval_dataloader, dummy_lr_scheduler
)
self.assertTrue(type(model) == DeepSpeedEngine)
self.assertTrue(type(optimizer) == DeepSpeedOptimizerWrapper)
self.assertTrue(type(lr_scheduler) == DeepSpeedSchedulerWrapper)
self.assertTrue(type(accelerator.deepspeed_engine_wrapped) == DeepSpeedEngineWrapper)
elif optim_type == CUSTOM_OPTIMIZER and scheduler_type == DS_SCHEDULER:
# Test custom optimizer + DeepSpeed scheduler
deepspeed_plugin = DeepSpeedPlugin(hf_ds_config=self.ds_config_file[ZERO2])
with mockenv_context(**self.dist_env):
accelerator = Accelerator(deepspeed_plugin=deepspeed_plugin)
train_set = RegressionDataset(length=80)
eval_set = RegressionDataset(length=20)
train_dataloader = DataLoader(train_set, batch_size=10, shuffle=True)
eval_dataloader = DataLoader(eval_set, batch_size=5, shuffle=False)
model = AutoModel.from_pretrained(GPT2_TINY)
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
lr_scheduler = get_scheduler(
name="linear",
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=1000,
)
dummy_optimizer = DummyOptim(params=model.parameters())
dummy_lr_scheduler = DummyScheduler(dummy_optimizer)
kwargs["train_batch_size"] = (
kwargs["train_micro_batch_size_per_gpu"]
* deepspeed_plugin.deepspeed_config["gradient_accumulation_steps"]
* accelerator.num_processes
)
accelerator.state.deepspeed_plugin.deepspeed_config_process(**kwargs)
del accelerator.state.deepspeed_plugin.deepspeed_config["optimizer"]
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
model, optimizer, train_dataloader, eval_dataloader, dummy_lr_scheduler
)
self.assertTrue(type(model) == DeepSpeedEngine)
self.assertTrue(type(optimizer) == DeepSpeedOptimizerWrapper)
self.assertTrue(type(lr_scheduler) == DeepSpeedSchedulerWrapper)
self.assertTrue(type(accelerator.deepspeed_engine_wrapped) == DeepSpeedEngineWrapper)
elif optim_type == DS_OPTIMIZER and scheduler_type == CUSTOM_SCHEDULER:
# Test deepspeed optimizer + custom scheduler
deepspeed_plugin = DeepSpeedPlugin(hf_ds_config=self.ds_config_file[ZERO2])
with mockenv_context(**self.dist_env):
accelerator = Accelerator(deepspeed_plugin=deepspeed_plugin)
train_set = RegressionDataset(length=80)
eval_set = RegressionDataset(length=20)
train_dataloader = DataLoader(train_set, batch_size=10, shuffle=True)
eval_dataloader = DataLoader(eval_set, batch_size=5, shuffle=False)
model = AutoModel.from_pretrained(GPT2_TINY)
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
lr_scheduler = get_scheduler(
name="linear",
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=1000,
)
dummy_optimizer = DummyOptim(params=model.parameters())
dummy_lr_scheduler = DummyScheduler(dummy_optimizer)
kwargs["train_batch_size"] = (
kwargs["train_micro_batch_size_per_gpu"]
* deepspeed_plugin.deepspeed_config["gradient_accumulation_steps"]
* accelerator.num_processes
)
accelerator.state.deepspeed_plugin.deepspeed_config_process(**kwargs)
del accelerator.state.deepspeed_plugin.deepspeed_config["scheduler"]
with self.assertRaises(ValueError) as cm:
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
model, dummy_optimizer, train_dataloader, eval_dataloader, lr_scheduler
)
self.assertTrue(
"You can only specify `accelerate.utils.DummyScheduler` in the code when using `accelerate.utils.DummyOptim`."
in str(cm.exception)
)
accelerator.state.initialized = False
def test_save_checkpoints(self):
deepspeed_plugin = DeepSpeedPlugin(
hf_ds_config=self.ds_config_file[ZERO3],
zero3_init_flag=True,
)
del deepspeed_plugin.deepspeed_config["bf16"]
kwargs = {
"fp16.enabled": True,
"bf16.enabled": False,
"optimizer.params.lr": 5e-5,
"optimizer.params.weight_decay": 0.0,
"scheduler.params.warmup_min_lr": 0.0,
"scheduler.params.warmup_max_lr": 5e-5,
"scheduler.params.warmup_num_steps": 0,
"train_micro_batch_size_per_gpu": 16,
"gradient_clipping": 1.0,
"train_batch_size": 16,
"zero_optimization.reduce_bucket_size": 5e5,
"zero_optimization.stage3_prefetch_bucket_size": 5e5,
"zero_optimization.stage3_param_persistence_threshold": 5e5,
"zero_optimization.stage3_gather_16bit_weights_on_model_save": False,
}
with mockenv_context(**self.dist_env):
accelerator = Accelerator(deepspeed_plugin=deepspeed_plugin)
kwargs["train_batch_size"] = (
kwargs["train_micro_batch_size_per_gpu"]
* deepspeed_plugin.deepspeed_config["gradient_accumulation_steps"]
* accelerator.num_processes
)
accelerator.state.deepspeed_plugin.deepspeed_config_process(**kwargs)
train_set = RegressionDataset(length=80)
eval_set = RegressionDataset(length=20)
train_dataloader = DataLoader(train_set, batch_size=16, shuffle=True)
eval_dataloader = DataLoader(eval_set, batch_size=32, shuffle=False)
model = AutoModelForCausalLM.from_pretrained("gpt2")
dummy_optimizer = DummyOptim(params=model.parameters())
dummy_lr_scheduler = DummyScheduler(dummy_optimizer)
model, _, train_dataloader, eval_dataloader, _ = accelerator.prepare(
model, dummy_optimizer, train_dataloader, eval_dataloader, dummy_lr_scheduler
)
with self.assertRaises(ValueError) as cm:
accelerator.get_state_dict(model)
msg = (
"Cannot get 16bit model weights because `stage3_gather_16bit_weights_on_model_save` in DeepSpeed config is False. "
"To save the model weights in 16bit, set `stage3_gather_16bit_weights_on_model_save` to True in DeepSpeed config file or "
"set `zero3_save_16bit_model` to True when using `accelerate config`. "
"To save the full checkpoint, run `model.save_checkpoint(save_dir)` and use `zero_to_fp32.py` to recover weights."
)
self.assertTrue(msg in str(cm.exception))
accelerator.state.initialized = False
def test_autofill_dsconfig(self):
deepspeed_plugin = DeepSpeedPlugin(
hf_ds_config=self.ds_config_file[ZERO3],
zero3_init_flag=True,
)
del deepspeed_plugin.deepspeed_config["bf16"]
del deepspeed_plugin.deepspeed_config["fp16"]
with mockenv_context(**self.dist_env):
accelerator = Accelerator(deepspeed_plugin=deepspeed_plugin)
train_set = RegressionDataset(length=80)
eval_set = RegressionDataset(length=20)
train_dataloader = DataLoader(train_set, batch_size=16, shuffle=True)
eval_dataloader = DataLoader(eval_set, batch_size=32, shuffle=False)
model = AutoModelForCausalLM.from_pretrained("gpt2")
dummy_optimizer = DummyOptim(params=model.parameters(), lr=5e-5, weight_decay=1e-4)
dummy_lr_scheduler = DummyScheduler(dummy_optimizer, warmup_num_steps=10, total_num_steps=1000)
hidden_size = model.config.hidden_size
model, _, train_dataloader, eval_dataloader, _ = accelerator.prepare(
model, dummy_optimizer, train_dataloader, eval_dataloader, dummy_lr_scheduler
)
self.assertEqual(accelerator.deepspeed_config["train_micro_batch_size_per_gpu"], 16)
self.assertEqual(accelerator.deepspeed_config["train_batch_size"], 16)
self.assertEqual(accelerator.deepspeed_config["optimizer"]["params"]["lr"], 5e-5)
self.assertEqual(accelerator.deepspeed_config["optimizer"]["params"]["weight_decay"], 1e-4)
self.assertEqual(accelerator.deepspeed_config["scheduler"]["params"]["warmup_min_lr"], 0.0)
self.assertEqual(accelerator.deepspeed_config["scheduler"]["params"]["warmup_max_lr"], 5e-5)
self.assertEqual(accelerator.deepspeed_config["scheduler"]["params"]["warmup_num_steps"], 10)
self.assertEqual(accelerator.deepspeed_config["gradient_clipping"], 1.0)
self.assertEqual(
accelerator.deepspeed_config["zero_optimization"]["reduce_bucket_size"], hidden_size * hidden_size
)
self.assertEqual(
accelerator.deepspeed_config["zero_optimization"]["stage3_prefetch_bucket_size"],
0.9 * hidden_size * hidden_size,
)
self.assertEqual(
accelerator.deepspeed_config["zero_optimization"]["stage3_param_persistence_threshold"],
10 * hidden_size,
)
self.assertFalse(
accelerator.deepspeed_config["zero_optimization"]["stage3_gather_16bit_weights_on_model_save"]
)
accelerator.state.initialized = False

View File

@ -56,6 +56,29 @@ class BiggerModelForTest(nn.Module):
return self.linear4(self.linear3(self.batchnorm(self.linear2(self.linear1(x)))))
# To test preload_module_classes
class ModuleWithUnusedSubModules(nn.Module):
def __init__(self, input_dim, output_dim):
super().__init__()
self.linear = nn.Linear(input_dim, output_dim)
def forward(self, x):
return x @ self.linear.weight.t() + self.linear.bias
class ModelWithUnusedSubModulesForTest(nn.Module):
def __init__(self):
super().__init__()
self.linear1 = ModuleWithUnusedSubModules(3, 4)
self.linear2 = ModuleWithUnusedSubModules(4, 5)
self.batchnorm = nn.BatchNorm1d(5)
self.linear3 = ModuleWithUnusedSubModules(5, 6)
self.linear4 = ModuleWithUnusedSubModules(6, 5)
def forward(self, x):
return self.linear4(self.linear3(self.batchnorm(self.linear2(self.linear1(x)))))
class BigModelingTester(unittest.TestCase):
def test_init_empty_weights(self):
# base use
@ -94,14 +117,45 @@ class BigModelingTester(unittest.TestCase):
cpu_offload(model, execution_device=device)
output = model(x)
self.assertTrue(torch.allclose(expected, output.cpu()))
self.assertTrue(
torch.allclose(expected, output.cpu(), 1e-4, 1e-5), msg=f"Expected: {expected}\nActual: {output.cpu()}"
)
# Clean up for next test.
remove_hook_from_submodules(model)
cpu_offload(model, execution_device=device, offload_buffers=True)
output = model(x)
self.assertTrue(torch.allclose(expected, output.cpu()))
self.assertTrue(
torch.allclose(expected, output.cpu(), 1e-4, 1e-5), msg=f"Expected: {expected}\nActual: {output.cpu()}"
)
def test_cpu_offload_with_unused_submodules(self):
model = ModelWithUnusedSubModulesForTest()
x = torch.randn(2, 3)
expected = model(x)
device = torch.device(0 if torch.cuda.is_available() else "cpu")
cpu_offload(model, execution_device=device, preload_module_classes=["ModuleWithUnusedSubModules"])
output = model(x)
self.assertTrue(
torch.allclose(expected, output.cpu(), 1e-4, 1e-5), msg=f"Expected: {expected}\nActual: {output.cpu()}"
)
# Clean up for next test.
remove_hook_from_submodules(model)
cpu_offload(
model,
execution_device=device,
offload_buffers=True,
preload_module_classes=["ModuleWithUnusedSubModules"],
)
output = model(x)
self.assertTrue(
torch.allclose(expected, output.cpu(), 1e-4, 1e-5), msg=f"Expected: {expected}\nActual: {output.cpu()}"
)
@slow
@require_cuda
@ -127,7 +181,9 @@ class BigModelingTester(unittest.TestCase):
with TemporaryDirectory() as tmp_dir:
disk_offload(model, tmp_dir, execution_device=device)
output = model(x)
self.assertTrue(torch.allclose(expected, output.cpu()))
self.assertTrue(
torch.allclose(expected, output.cpu(), 1e-4, 1e-5), msg=f"Expected: {expected}\nActual: {output.cpu()}"
)
# Clean up for next test.
remove_hook_from_submodules(model)
@ -135,7 +191,41 @@ class BigModelingTester(unittest.TestCase):
with TemporaryDirectory() as tmp_dir:
disk_offload(model, tmp_dir, execution_device=device, offload_buffers=True)
output = model(x)
self.assertTrue(torch.allclose(expected, output.cpu()))
self.assertTrue(
torch.allclose(expected, output.cpu(), 1e-4, 1e-5), msg=f"Expected: {expected}\nActual: {output.cpu()}"
)
def test_disk_offload_with_unused_submodules(self):
model = ModelWithUnusedSubModulesForTest()
x = torch.randn(2, 3)
expected = model(x)
device = torch.device(0 if torch.cuda.is_available() else "cpu")
with TemporaryDirectory() as tmp_dir:
disk_offload(
model, tmp_dir, execution_device=device, preload_module_classes=["ModuleWithUnusedSubModules"]
)
output = model(x)
self.assertTrue(
torch.allclose(expected, output.cpu(), 1e-4, 1e-5), msg=f"Expected: {expected}\nActual: {output.cpu()}"
)
# Clean up for next test.
remove_hook_from_submodules(model)
with TemporaryDirectory() as tmp_dir:
disk_offload(
model,
tmp_dir,
execution_device=device,
offload_buffers=True,
preload_module_classes=["ModuleWithUnusedSubModules"],
)
output = model(x)
self.assertTrue(
torch.allclose(expected, output.cpu(), 1e-4, 1e-5), msg=f"Expected: {expected}\nActual: {output.cpu()}"
)
@slow
@require_cuda
@ -229,6 +319,36 @@ class BigModelingTester(unittest.TestCase):
"Hello world! My name is Kiyoshi, and I'm a student at the University of Tokyo",
)
@require_cuda
def test_dispatch_model_with_unused_submodules(self):
model = ModelWithUnusedSubModulesForTest()
device_map = {"linear1": "cpu", "linear2": "disk", "batchnorm": "cpu", "linear3": 0, "linear4": 0}
x = torch.randn(2, 3)
expected = model(x)
with TemporaryDirectory() as tmp_dir:
dispatch_model(
model, device_map, offload_dir=tmp_dir, preload_module_classes=["ModuleWithUnusedSubModules"]
)
output = model(x)
self.assertTrue(torch.allclose(expected, output.cpu(), atol=1e-5))
@require_multi_gpu
def test_dispatch_model_with_unused_submodules_multi_gpu(self):
model = ModelWithUnusedSubModulesForTest()
device_map = {"linear1": "cpu", "linear2": "disk", "batchnorm": "cpu", "linear3": 0, "linear4": 1}
x = torch.randn(2, 3)
expected = model(x)
with TemporaryDirectory() as tmp_dir:
dispatch_model(
model, device_map, offload_dir=tmp_dir, preload_module_classes=["ModuleWithUnusedSubModules"]
)
output = model(x)
self.assertTrue(torch.allclose(expected, output.cpu(), atol=1e-5))
@require_cuda
def test_load_checkpoint_and_dispatch(self):
model = ModelForTest()
@ -274,3 +394,55 @@ class BigModelingTester(unittest.TestCase):
output = new_model(x)
self.assertTrue(torch.allclose(expected, output.cpu(), atol=1e-5))
@require_cuda
def test_load_checkpoint_and_dispatch_with_unused_submodules(self):
model = ModelWithUnusedSubModulesForTest()
device_map = {"linear1": "cpu", "linear2": "cpu", "batchnorm": 0, "linear3": 0, "linear4": 0}
x = torch.randn(2, 3)
expected = model(x)
with TemporaryDirectory() as tmp_dir:
checkpoint = os.path.join(tmp_dir, "pt_model.bin")
torch.save(model.state_dict(), checkpoint)
new_model = ModelWithUnusedSubModulesForTest()
new_model = load_checkpoint_and_dispatch(
new_model, checkpoint, device_map=device_map, preload_module_classes=["ModuleWithUnusedSubModules"]
)
# CPU-offloaded weights are on the meta device while waiting for the forward pass.
self.assertEqual(new_model.linear1.linear.weight.device, torch.device("meta"))
self.assertEqual(new_model.linear2.linear.weight.device, torch.device("meta"))
self.assertEqual(new_model.linear3.linear.weight.device, torch.device(0))
self.assertEqual(new_model.linear4.linear.weight.device, torch.device(0))
output = new_model(x)
self.assertTrue(torch.allclose(expected, output.cpu(), atol=1e-5))
@require_multi_gpu
def test_load_checkpoint_and_dispatch_multi_gpu_with_unused_submodules(self):
model = ModelWithUnusedSubModulesForTest()
device_map = {"linear1": "cpu", "linear2": "cpu", "batchnorm": 0, "linear3": 0, "linear4": 1}
x = torch.randn(2, 3)
expected = model(x)
with TemporaryDirectory() as tmp_dir:
checkpoint = os.path.join(tmp_dir, "pt_model.bin")
torch.save(model.state_dict(), checkpoint)
new_model = ModelWithUnusedSubModulesForTest()
new_model = load_checkpoint_and_dispatch(
new_model, checkpoint, device_map=device_map, preload_module_classes=["ModuleWithUnusedSubModules"]
)
# CPU-offloaded weights are on the meta device while waiting for the forward pass.
self.assertEqual(new_model.linear1.linear.weight.device, torch.device("meta"))
self.assertEqual(new_model.linear2.linear.weight.device, torch.device("meta"))
self.assertEqual(new_model.linear3.linear.weight.device, torch.device(0))
self.assertEqual(new_model.linear4.linear.weight.device, torch.device(1))
output = new_model(x)
self.assertTrue(torch.allclose(expected, output.cpu(), atol=1e-5))

View File

@ -15,9 +15,10 @@
import unittest
from accelerate import debug_launcher
from accelerate.test_utils import test_script
from accelerate.test_utils import require_cpu, test_script
@require_cpu
class MultiCPUTester(unittest.TestCase):
def test_cpu(self):
debug_launcher(test_script.main)

View File

@ -16,14 +16,14 @@ import ast
import os
import re
import shutil
import subprocess
import tempfile
import unittest
from unittest import mock
from accelerate import Accelerator
import torch
from accelerate.test_utils.examples import compare_against_test
from accelerate.test_utils.testing import TempDirTestCase, slow
from accelerate.test_utils.testing import TempDirTestCase, require_trackers, run_command, slow
from accelerate.utils import write_basic_config
@ -31,7 +31,14 @@ from accelerate.utils import write_basic_config
# Should mock `{script_name}.get_dataloaders` via:
# @mock.patch("{script_name}.get_dataloaders", mocked_dataloaders)
EXCLUDE_EXAMPLES = ["cross_validation.py", "multi_process_metrics.py", "memory.py", "fsdp_with_peak_mem_tracking.py"]
EXCLUDE_EXAMPLES = [
"cross_validation.py",
"gradient_accumulation.py",
"multi_process_metrics.py",
"memory.py",
"fsdp_with_peak_mem_tracking.py",
"deepspeed_with_config_support.py",
]
class ExampleDifferenceTests(unittest.TestCase):
@ -137,7 +144,7 @@ class FeatureExamplesTests(TempDirTestCase):
--checkpointing_steps epoch
--output_dir {self.tmpdir}
""".split()
_ = subprocess.run(self._launch_args + testargs, stdout=subprocess.PIPE)
run_command(self._launch_args + testargs)
self.assertTrue(os.path.exists(os.path.join(self.tmpdir, "epoch_1")))
def test_checkpointing_by_steps(self):
@ -146,7 +153,7 @@ class FeatureExamplesTests(TempDirTestCase):
--checkpointing_steps 1
--output_dir {self.tmpdir}
""".split()
_ = subprocess.run(self._launch_args + testargs, stdout=subprocess.PIPE, env=os.environ)
_ = run_command(self._launch_args + testargs)
self.assertTrue(os.path.exists(os.path.join(self.tmpdir, "step_5")))
def test_load_states_by_epoch(self):
@ -154,9 +161,7 @@ class FeatureExamplesTests(TempDirTestCase):
examples/by_feature/checkpointing.py
--resume_from_checkpoint {os.path.join(self.tmpdir, "epoch_1")}
""".split()
output = subprocess.run(
self._launch_args + testargs, universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE
).stdout
output = run_command(self._launch_args + testargs, return_stdout=True)
self.assertNotIn("epoch 0:", output)
self.assertNotIn("epoch 1:", output)
self.assertIn("epoch 2:", output)
@ -166,18 +171,18 @@ class FeatureExamplesTests(TempDirTestCase):
examples/by_feature/checkpointing.py
--resume_from_checkpoint {os.path.join(self.tmpdir, "step_5")}
""".split()
output = subprocess.run(
self._launch_args + testargs, universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE
).stdout
num_processes = Accelerator().num_processes
output = run_command(self._launch_args + testargs, return_stdout=True)
if torch.cuda.is_available():
num_processes = torch.cuda.device_count()
else:
num_processes = 1
if num_processes > 1:
self.assertNotIn("epoch 0:", output)
self.assertNotIn("epoch 1:", output)
self.assertIn("epoch 2:", output)
else:
self.assertNotIn("epoch 0:", output)
self.assertIn("epoch 1:", output)
self.assertIn("epoch 2:", output)
self.assertIn("epoch 2:", output)
@slow
def test_cross_validation(self):
@ -186,16 +191,16 @@ class FeatureExamplesTests(TempDirTestCase):
--num_folds 2
""".split()
with mock.patch.dict(os.environ, {"TESTING_MOCKED_DATALOADERS": "0"}):
output = subprocess.run(
self._launch_args + testargs, universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE
).stdout
output = run_command(self._launch_args + testargs, return_stdout=True)
results = ast.literal_eval(re.findall("({.+})", output)[-1])
self.assertGreaterEqual(results["accuracy"], 0.75)
def test_multi_process_metrics(self):
testargs = ["examples/by_feature/multi_process_metrics.py"]
_ = subprocess.run(self._launch_args + testargs, stdout=subprocess.PIPE)
run_command(self._launch_args + testargs)
@require_trackers
@mock.patch.dict(os.environ, {"WANDB_MODE": "offline"})
def test_tracking(self):
with tempfile.TemporaryDirectory() as tmpdir:
testargs = f"""
@ -203,5 +208,9 @@ class FeatureExamplesTests(TempDirTestCase):
--with_tracking
--logging_dir {tmpdir}
""".split()
_ = subprocess.run(self._launch_args + testargs, stdout=subprocess.PIPE)
run_command(self._launch_args + testargs)
self.assertTrue(os.path.exists(os.path.join(tmpdir, "tracking")))
def test_gradient_accumulation(self):
testargs = ["examples/by_feature/gradient_accumulation.py"]
run_command(self._launch_args + testargs)

55
tests/test_grad_sync.py Normal file
View File

@ -0,0 +1,55 @@
# Copyright 2021 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import inspect
import os
import unittest
import torch
import accelerate
from accelerate import debug_launcher
from accelerate.test_utils import (
execute_subprocess_async,
require_cpu,
require_multi_gpu,
require_single_gpu,
test_sync,
)
from accelerate.utils import get_launch_prefix, patch_environment
class SyncScheduler(unittest.TestCase):
def setUp(self):
mod_file = inspect.getfile(accelerate.test_utils)
self.test_file_path = os.path.sep.join(mod_file.split(os.path.sep)[:-1] + ["scripts", "test_sync.py"])
@require_cpu
def test_gradient_sync_cpu_noop(self):
debug_launcher(test_sync.main, num_processes=1)
@require_cpu
def test_gradient_sync_cpu_multi(self):
debug_launcher(test_sync.main)
@require_single_gpu
def test_gradient_sync_gpu(self):
test_sync.main()
@require_multi_gpu
def test_gradient_sync_gpu_multi(self):
print(f"Found {torch.cuda.device_count()} devices.")
cmd = get_launch_prefix() + [f"--nproc_per_node={torch.cuda.device_count()}", self.test_file_path]
with patch_environment(omp_num_threads=1):
execute_subprocess_async(cmd, env=os.environ.copy())

View File

@ -77,20 +77,20 @@ class HooksModelTester(unittest.TestCase):
test_hook = PreForwardHook()
add_hook_to_module(test_model, test_hook)
output1 = test_model(x)
self.assertTrue(torch.allclose(output1, expected))
self.assertTrue(torch.allclose(output1, expected, atol=1e-5))
# Attaching a hook to a model when it already has one replaces, does not chain
test_hook = PreForwardHook()
add_hook_to_module(test_model, test_hook)
output1 = test_model(x)
self.assertTrue(torch.allclose(output1, expected))
self.assertTrue(torch.allclose(output1, expected, atol=1e-5))
# You need to use the sequential hook to chain two or more hooks
test_hook = SequentialHook(PreForwardHook(), PreForwardHook())
add_hook_to_module(test_model, test_hook)
output2 = test_model(x)
assert torch.allclose(output2, expected2)
assert torch.allclose(output2, expected2, atol=1e-5)
def test_post_forward_hook_is_executed(self):
test_model = ModelForTest()
@ -100,20 +100,20 @@ class HooksModelTester(unittest.TestCase):
test_hook = PostForwardHook()
add_hook_to_module(test_model, test_hook)
output1 = test_model(x)
self.assertTrue(torch.allclose(output1, output + 1))
self.assertTrue(torch.allclose(output1, output + 1, atol=1e-5))
# Attaching a hook to a model when it already has one replaces, does not chain
test_hook = PostForwardHook()
add_hook_to_module(test_model, test_hook)
output1 = test_model(x)
self.assertTrue(torch.allclose(output1, output + 1))
self.assertTrue(torch.allclose(output1, output + 1, atol=1e-5))
# You need to use the sequential hook to chain two or more hooks
test_hook = SequentialHook(PostForwardHook(), PostForwardHook())
add_hook_to_module(test_model, test_hook)
output2 = test_model(x)
assert torch.allclose(output2, output + 2)
assert torch.allclose(output2, output + 2, atol=1e-5)
def test_no_grad_in_hook(self):
test_model = ModelForTest()

View File

@ -21,6 +21,7 @@ from dataclasses import dataclass
import torch
from accelerate import Accelerator, DistributedDataParallelKwargs, GradScalerKwargs
from accelerate.state import AcceleratorState
from accelerate.test_utils import execute_subprocess_async, require_cuda, require_multi_gpu
from accelerate.utils import KwargsHandler
@ -44,7 +45,8 @@ class DataLoaderTester(unittest.TestCase):
def test_grad_scaler_kwargs(self):
# If no defaults are changed, `to_kwargs` returns an empty dict.
scaler_handler = GradScalerKwargs(init_scale=1024, growth_factor=2)
accelerator = Accelerator(fp16=True, kwargs_handlers=[scaler_handler])
AcceleratorState._reset_state()
accelerator = Accelerator(mixed_precision="fp16", kwargs_handlers=[scaler_handler])
print(accelerator.use_fp16)
scaler = accelerator.scaler

View File

@ -14,7 +14,6 @@
import inspect
import os
import sys
import unittest
import torch
@ -22,35 +21,26 @@ import torch
import accelerate
from accelerate import Accelerator
from accelerate.test_utils import execute_subprocess_async, require_multi_gpu
from accelerate.utils import get_launch_prefix, patch_environment
class MultiGPUTester(unittest.TestCase):
def setUp(self):
mod_file = inspect.getfile(accelerate.test_utils)
self.test_file_path = os.path.sep.join(mod_file.split(os.path.sep)[:-1] + ["test_script.py"])
self.test_file_path = os.path.sep.join(mod_file.split(os.path.sep)[:-1] + ["scripts", "test_script.py"])
@require_multi_gpu
def test_multi_gpu(self):
print(f"Found {torch.cuda.device_count()} devices.")
distributed_args = f"""
-m torch.distributed.launch
--nproc_per_node={torch.cuda.device_count()}
--use_env
{self.test_file_path}
""".split()
cmd = [sys.executable] + distributed_args
execute_subprocess_async(cmd, env=os.environ.copy())
cmd = get_launch_prefix() + [self.test_file_path]
with patch_environment(omp_num_threads=1):
execute_subprocess_async(cmd, env=os.environ.copy())
@require_multi_gpu
def test_pad_across_processes(self):
distributed_args = f"""
-m torch.distributed.launch
--nproc_per_node={torch.cuda.device_count()}
--use_env
{inspect.getfile(self.__class__)}
""".split()
cmd = [sys.executable] + distributed_args
execute_subprocess_async(cmd, env=os.environ.copy())
cmd = get_launch_prefix() + [inspect.getfile(self.__class__)]
with patch_environment(omp_num_threads=1):
execute_subprocess_async(cmd, env=os.environ.copy())
if __name__ == "__main__":

View File

@ -19,7 +19,13 @@ from tempfile import TemporaryDirectory
import torch
import torch.nn as nn
from accelerate.utils import OffloadedWeightsLoader, offload_state_dict
from accelerate.utils import (
OffloadedWeightsLoader,
is_torch_version,
load_offloaded_weight,
offload_state_dict,
offload_weight,
)
class ModelForTest(nn.Module):
@ -35,8 +41,6 @@ class ModelForTest(nn.Module):
class OffloadTester(unittest.TestCase):
def test_offload_state_dict(self):
from tempfile import TemporaryDirectory
model = ModelForTest()
with TemporaryDirectory() as tmp_dir:
offload_state_dict(tmp_dir, model.state_dict())
@ -49,6 +53,22 @@ class OffloadTester(unittest.TestCase):
self.assertTrue(os.path.isfile(weight_file))
# TODO: add tests on the fact weights are properly loaded
def test_offload_weight(self):
dtypes = [torch.float16, torch.float32]
if is_torch_version(">=", "1.10"):
dtypes.append(torch.bfloat16)
for dtype in dtypes:
weight = torch.randn(2, 3, dtype=dtype)
with TemporaryDirectory() as tmp_dir:
index = offload_weight(weight, "weight", tmp_dir, {})
weight_file = os.path.join(tmp_dir, "weight.dat")
self.assertTrue(os.path.isfile(weight_file))
self.assertDictEqual(index, {"weight": {"shape": [2, 3], "dtype": str(dtype).split(".")[1]}})
new_weight = load_offloaded_weight(weight_file, index["weight"])
self.assertTrue(torch.equal(weight, new_weight))
def test_offload_weights_loader(self):
model = ModelForTest()
state_dict = model.state_dict()

View File

@ -18,6 +18,7 @@ from functools import partial
import torch
from accelerate import Accelerator, debug_launcher
from accelerate.test_utils import require_cpu
def scheduler_test(num_processes=2, step_scheduler_with_optimizer=True, split_batches=False):
@ -46,6 +47,7 @@ def scheduler_test(num_processes=2, step_scheduler_with_optimizer=True, split_ba
), f"Wrong lr found at second step, expected {expected_lr}, got {scheduler.get_last_lr()[0]}"
@require_cpu
class SchedulerTester(unittest.TestCase):
def test_scheduler_steps_with_optimizer_single_process(self):
debug_launcher(partial(scheduler_test, num_processes=1), num_processes=1)

View File

@ -24,7 +24,7 @@ from accelerate.test_utils import execute_subprocess_async, require_tpu
class MultiTPUTester(unittest.TestCase):
def setUp(self):
mod_file = inspect.getfile(accelerate.test_utils)
self.test_file_path = os.path.sep.join(mod_file.split(os.path.sep)[:-1] + ["test_script.py"])
self.test_file_path = os.path.sep.join(mod_file.split(os.path.sep)[:-1] + ["scripts", "test_script.py"])
self.test_dir = os.path.sep.join(inspect.getfile(self.__class__).split(os.path.sep)[:-1])
@require_tpu

View File

@ -30,31 +30,22 @@ from accelerate.test_utils.testing import (
MockingTestCase,
TempDirTestCase,
require_comet_ml,
require_tensorflow,
require_tensorboard,
require_wandb,
)
from accelerate.tracking import CometMLTracker, GeneralTracker
from accelerate.utils import is_comet_ml_available, is_tensorflow_available
from accelerate.utils import is_comet_ml_available
if is_comet_ml_available():
from comet_ml import OfflineExperiment
if is_tensorflow_available():
import tensorflow as tf
from tensorboard.plugins.hparams import plugin_data_pb2
from tensorflow.core.util import event_pb2
from tensorflow.python.summary.summary_iterator import summary_iterator
logger = logging.getLogger(__name__)
@require_tensorboard
class TensorBoardTrackingTest(unittest.TestCase):
@require_tensorflow
def test_init_trackers(self):
hps = None
project_name = "test_project_with_config"
with tempfile.TemporaryDirectory() as dirpath:
accelerator = Accelerator(log_with="tensorboard", logging_dir=dirpath)
@ -63,29 +54,9 @@ class TensorBoardTrackingTest(unittest.TestCase):
accelerator.end_training()
for child in Path(f"{dirpath}/{project_name}").glob("*/**"):
log = list(filter(lambda x: x.is_file(), child.iterdir()))[0]
# The config log is stored one layer deeper in the logged directory
# And names are randomly generated each time
si = summary_iterator(str(log))
# Pull HPS through careful parsing
for event in si:
for value in event.summary.value:
proto_bytes = value.metadata.plugin_data.content
plugin_data = plugin_data_pb2.HParamsPluginData.FromString(proto_bytes)
if plugin_data.HasField("session_start_info"):
hps = dict(plugin_data.session_start_info.hparams)
self.assertNotEqual(str(log), "")
self.assertTrue(isinstance(hps, dict))
keys = list(hps.keys())
keys.sort()
self.assertEqual(keys, ["learning_rate", "num_iterations", "some_boolean", "some_string"])
self.assertEqual(hps["num_iterations"].number_value, 12)
self.assertEqual(hps["learning_rate"].number_value, 0.01)
self.assertEqual(hps["some_boolean"].bool_value, False)
self.assertEqual(hps["some_string"].string_value, "some_value")
@require_tensorflow
def test_log(self):
step = None
project_name = "test_project_with_log"
with tempfile.TemporaryDirectory() as dirpath:
accelerator = Accelerator(log_with="tensorboard", logging_dir=dirpath)
@ -96,21 +67,7 @@ class TensorBoardTrackingTest(unittest.TestCase):
# Logged values are stored in the outermost-tfevents file and can be read in as a TFRecord
# Names are randomly generated each time
log = list(filter(lambda x: x.is_file(), Path(f"{dirpath}/{project_name}").iterdir()))[0]
serialized_examples = tf.data.TFRecordDataset(log)
for e in serialized_examples:
event = event_pb2.Event.FromString(e.numpy())
if step is None:
step = event.step
for value in event.summary.value:
if value.tag == "total_loss":
total_loss = value.simple_value
elif value.tag == "iteration":
iteration = value.simple_value
elif value.tag == "my_text/text_summary": # Append /text_summary to the key
my_text = value.tensor.string_val[0].decode()
self.assertAlmostEqual(total_loss, values["total_loss"])
self.assertEqual(iteration, values["iteration"])
self.assertEqual(my_text, values["my_text"])
self.assertNotEqual(str(log), "")
def test_logging_dir(self):
with self.assertRaisesRegex(ValueError, "Logging with `tensorboard` requires a `logging_dir`"):

View File

@ -23,7 +23,7 @@ from accelerate.test_utils.training import RegressionModel
from accelerate.utils import convert_outputs_to_fp32, find_device, patch_environment, send_to_device
TestNamedTuple = namedtuple("TestNamedTuple", "a b c")
ExampleNamedTuple = namedtuple("ExampleNamedTuple", "a b c")
class UtilsTester(unittest.TestCase):
@ -50,8 +50,8 @@ class UtilsTester(unittest.TestCase):
self.assertTrue(torch.equal(result2["b"][1].cpu(), tensor))
self.assertEqual(result2["c"], 1)
result3 = send_to_device(TestNamedTuple(a=tensor, b=[tensor, tensor], c=1), device)
self.assertIsInstance(result3, TestNamedTuple)
result3 = send_to_device(ExampleNamedTuple(a=tensor, b=[tensor, tensor], c=1), device)
self.assertIsInstance(result3, ExampleNamedTuple)
self.assertTrue(torch.equal(result3.a.cpu(), tensor))
self.assertIsInstance(result3.b, list)
self.assertTrue(torch.equal(result3.b[0].cpu(), tensor))

View File

@ -78,7 +78,6 @@ def main():
# Patch sys.argv
sys.argv = [args.training_script] + args.training_script_args + ["--tpu_num_cores", str(args.num_cores)]
xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_cores)