drafting Jekyll webpage (#143)

This commit is contained in:
Shaden Smith
2020-03-17 13:49:48 -07:00
committed by GitHub
parent d6bc44bfad
commit 5042dc0085
37 changed files with 1456 additions and 1050 deletions

7
.gitignore vendored
View File

@ -9,3 +9,10 @@ build/
dist/
fused_lamb_*.so
deepspeed.egg-info/
# Website
docs/_site/
docs/code-docs/build
.sass-cache/
.jekyll-cache/
.jekyll-metadata

25
docs/404.html Normal file
View File

@ -0,0 +1,25 @@
---
permalink: /404.html
layout: default
---
<style type="text/css" media="screen">
.container {
margin: 10px auto;
max-width: 600px;
text-align: center;
}
h1 {
margin: 30px 0;
font-size: 4em;
line-height: 1;
letter-spacing: -1px;
}
</style>
<div class="container">
<h1>404</h1>
<p><strong>Page not found :(</strong></p>
<p>The requested page could not be found.</p>
</div>

View File

@ -1 +0,0 @@
www.deepspeed.ai

22
docs/Gemfile Normal file
View File

@ -0,0 +1,22 @@
source "https://rubygems.org"
gem 'github-pages', group: :jekyll_plugins
# If you have any plugins, put them here!
group :jekyll_plugins do
gem "jekyll-feed"
gem "jekyll-paginate"
gem "jekyll-remote-theme"
gem "jekyll-include-cache"
gem "minimal-mistakes-jekyll"
end
# Windows and JRuby does not include zoneinfo files, so bundle the tzinfo-data gem
# and associated library.
install_if -> { RUBY_PLATFORM =~ %r!mingw|mswin|java! } do
gem "tzinfo", "~> 1.2"
gem "tzinfo-data"
end
# Performance-booster for watching directories on Windows
gem "wdm", "~> 0.1.1", :install_if => Gem.win_platform?

268
docs/Gemfile.lock Normal file
View File

@ -0,0 +1,268 @@
GEM
remote: https://rubygems.org/
specs:
activesupport (6.0.2.1)
concurrent-ruby (~> 1.0, >= 1.0.2)
i18n (>= 0.7, < 2)
minitest (~> 5.1)
tzinfo (~> 1.1)
zeitwerk (~> 2.2)
addressable (2.7.0)
public_suffix (>= 2.0.2, < 5.0)
coffee-script (2.4.1)
coffee-script-source
execjs
coffee-script-source (1.11.1)
colorator (1.1.0)
commonmarker (0.17.13)
ruby-enum (~> 0.5)
concurrent-ruby (1.1.6)
dnsruby (1.61.3)
addressable (~> 2.5)
em-websocket (0.5.1)
eventmachine (>= 0.12.9)
http_parser.rb (~> 0.6.0)
ethon (0.12.0)
ffi (>= 1.3.0)
eventmachine (1.2.7)
execjs (2.7.0)
faraday (1.0.0)
multipart-post (>= 1.2, < 3)
ffi (1.12.2)
forwardable-extended (2.6.0)
gemoji (3.0.1)
github-pages (204)
github-pages-health-check (= 1.16.1)
jekyll (= 3.8.5)
jekyll-avatar (= 0.7.0)
jekyll-coffeescript (= 1.1.1)
jekyll-commonmark-ghpages (= 0.1.6)
jekyll-default-layout (= 0.1.4)
jekyll-feed (= 0.13.0)
jekyll-gist (= 1.5.0)
jekyll-github-metadata (= 2.13.0)
jekyll-mentions (= 1.5.1)
jekyll-optional-front-matter (= 0.3.2)
jekyll-paginate (= 1.1.0)
jekyll-readme-index (= 0.3.0)
jekyll-redirect-from (= 0.15.0)
jekyll-relative-links (= 0.6.1)
jekyll-remote-theme (= 0.4.1)
jekyll-sass-converter (= 1.5.2)
jekyll-seo-tag (= 2.6.1)
jekyll-sitemap (= 1.4.0)
jekyll-swiss (= 1.0.0)
jekyll-theme-architect (= 0.1.1)
jekyll-theme-cayman (= 0.1.1)
jekyll-theme-dinky (= 0.1.1)
jekyll-theme-hacker (= 0.1.1)
jekyll-theme-leap-day (= 0.1.1)
jekyll-theme-merlot (= 0.1.1)
jekyll-theme-midnight (= 0.1.1)
jekyll-theme-minimal (= 0.1.1)
jekyll-theme-modernist (= 0.1.1)
jekyll-theme-primer (= 0.5.4)
jekyll-theme-slate (= 0.1.1)
jekyll-theme-tactile (= 0.1.1)
jekyll-theme-time-machine (= 0.1.1)
jekyll-titles-from-headings (= 0.5.3)
jemoji (= 0.11.1)
kramdown (= 1.17.0)
liquid (= 4.0.3)
mercenary (~> 0.3)
minima (= 2.5.1)
nokogiri (>= 1.10.4, < 2.0)
rouge (= 3.13.0)
terminal-table (~> 1.4)
github-pages-health-check (1.16.1)
addressable (~> 2.3)
dnsruby (~> 1.60)
octokit (~> 4.0)
public_suffix (~> 3.0)
typhoeus (~> 1.3)
html-pipeline (2.12.3)
activesupport (>= 2)
nokogiri (>= 1.4)
http_parser.rb (0.6.0)
i18n (0.9.5)
concurrent-ruby (~> 1.0)
jekyll (3.8.5)
addressable (~> 2.4)
colorator (~> 1.0)
em-websocket (~> 0.5)
i18n (~> 0.7)
jekyll-sass-converter (~> 1.0)
jekyll-watch (~> 2.0)
kramdown (~> 1.14)
liquid (~> 4.0)
mercenary (~> 0.3.3)
pathutil (~> 0.9)
rouge (>= 1.7, < 4)
safe_yaml (~> 1.0)
jekyll-avatar (0.7.0)
jekyll (>= 3.0, < 5.0)
jekyll-coffeescript (1.1.1)
coffee-script (~> 2.2)
coffee-script-source (~> 1.11.1)
jekyll-commonmark (1.3.1)
commonmarker (~> 0.14)
jekyll (>= 3.7, < 5.0)
jekyll-commonmark-ghpages (0.1.6)
commonmarker (~> 0.17.6)
jekyll-commonmark (~> 1.2)
rouge (>= 2.0, < 4.0)
jekyll-default-layout (0.1.4)
jekyll (~> 3.0)
jekyll-feed (0.13.0)
jekyll (>= 3.7, < 5.0)
jekyll-gist (1.5.0)
octokit (~> 4.2)
jekyll-github-metadata (2.13.0)
jekyll (>= 3.4, < 5.0)
octokit (~> 4.0, != 4.4.0)
jekyll-include-cache (0.2.0)
jekyll (>= 3.7, < 5.0)
jekyll-mentions (1.5.1)
html-pipeline (~> 2.3)
jekyll (>= 3.7, < 5.0)
jekyll-optional-front-matter (0.3.2)
jekyll (>= 3.0, < 5.0)
jekyll-paginate (1.1.0)
jekyll-readme-index (0.3.0)
jekyll (>= 3.0, < 5.0)
jekyll-redirect-from (0.15.0)
jekyll (>= 3.3, < 5.0)
jekyll-relative-links (0.6.1)
jekyll (>= 3.3, < 5.0)
jekyll-remote-theme (0.4.1)
addressable (~> 2.0)
jekyll (>= 3.5, < 5.0)
rubyzip (>= 1.3.0)
jekyll-sass-converter (1.5.2)
sass (~> 3.4)
jekyll-seo-tag (2.6.1)
jekyll (>= 3.3, < 5.0)
jekyll-sitemap (1.4.0)
jekyll (>= 3.7, < 5.0)
jekyll-swiss (1.0.0)
jekyll-theme-architect (0.1.1)
jekyll (~> 3.5)
jekyll-seo-tag (~> 2.0)
jekyll-theme-cayman (0.1.1)
jekyll (~> 3.5)
jekyll-seo-tag (~> 2.0)
jekyll-theme-dinky (0.1.1)
jekyll (~> 3.5)
jekyll-seo-tag (~> 2.0)
jekyll-theme-hacker (0.1.1)
jekyll (~> 3.5)
jekyll-seo-tag (~> 2.0)
jekyll-theme-leap-day (0.1.1)
jekyll (~> 3.5)
jekyll-seo-tag (~> 2.0)
jekyll-theme-merlot (0.1.1)
jekyll (~> 3.5)
jekyll-seo-tag (~> 2.0)
jekyll-theme-midnight (0.1.1)
jekyll (~> 3.5)
jekyll-seo-tag (~> 2.0)
jekyll-theme-minimal (0.1.1)
jekyll (~> 3.5)
jekyll-seo-tag (~> 2.0)
jekyll-theme-modernist (0.1.1)
jekyll (~> 3.5)
jekyll-seo-tag (~> 2.0)
jekyll-theme-primer (0.5.4)
jekyll (> 3.5, < 5.0)
jekyll-github-metadata (~> 2.9)
jekyll-seo-tag (~> 2.0)
jekyll-theme-slate (0.1.1)
jekyll (~> 3.5)
jekyll-seo-tag (~> 2.0)
jekyll-theme-tactile (0.1.1)
jekyll (~> 3.5)
jekyll-seo-tag (~> 2.0)
jekyll-theme-time-machine (0.1.1)
jekyll (~> 3.5)
jekyll-seo-tag (~> 2.0)
jekyll-titles-from-headings (0.5.3)
jekyll (>= 3.3, < 5.0)
jekyll-watch (2.2.1)
listen (~> 3.0)
jemoji (0.11.1)
gemoji (~> 3.0)
html-pipeline (~> 2.2)
jekyll (>= 3.0, < 5.0)
kramdown (1.17.0)
liquid (4.0.3)
listen (3.2.1)
rb-fsevent (~> 0.10, >= 0.10.3)
rb-inotify (~> 0.9, >= 0.9.10)
mercenary (0.3.6)
mini_portile2 (2.4.0)
minima (2.5.1)
jekyll (>= 3.5, < 5.0)
jekyll-feed (~> 0.9)
jekyll-seo-tag (~> 2.1)
minimal-mistakes-jekyll (4.19.1)
jekyll (>= 3.7, < 5.0)
jekyll-feed (~> 0.1)
jekyll-gist (~> 1.5)
jekyll-include-cache (~> 0.1)
jekyll-paginate (~> 1.1)
jekyll-sitemap (~> 1.3)
minitest (5.14.0)
multipart-post (2.1.1)
nokogiri (1.10.9)
mini_portile2 (~> 2.4.0)
octokit (4.17.0)
faraday (>= 0.9)
sawyer (~> 0.8.0, >= 0.5.3)
pathutil (0.16.2)
forwardable-extended (~> 2.6)
public_suffix (3.1.1)
rb-fsevent (0.10.3)
rb-inotify (0.10.1)
ffi (~> 1.0)
rouge (3.13.0)
ruby-enum (0.7.2)
i18n
rubyzip (2.3.0)
safe_yaml (1.0.5)
sass (3.7.4)
sass-listen (~> 4.0.0)
sass-listen (4.0.0)
rb-fsevent (~> 0.9, >= 0.9.4)
rb-inotify (~> 0.9, >= 0.9.7)
sawyer (0.8.2)
addressable (>= 2.3.5)
faraday (> 0.8, < 2.0)
terminal-table (1.8.0)
unicode-display_width (~> 1.1, >= 1.1.1)
thread_safe (0.3.6)
typhoeus (1.3.1)
ethon (>= 0.9.0)
tzinfo (1.2.6)
thread_safe (~> 0.1)
tzinfo-data (1.2019.3)
tzinfo (>= 1.0.0)
unicode-display_width (1.7.0)
wdm (0.1.1)
zeitwerk (2.3.0)
PLATFORMS
ruby
DEPENDENCIES
github-pages
jekyll-feed
jekyll-include-cache
jekyll-paginate
jekyll-remote-theme
minimal-mistakes-jekyll
tzinfo (~> 1.2)
tzinfo-data
wdm (~> 0.1.1)
BUNDLED WITH
2.1.4

56
docs/_config.yml Normal file
View File

@ -0,0 +1,56 @@
title: DeepSpeed
email: deepspeed@microsoft.com
description: >-
DeepSpeed is a deep learning optimization library that makes distributed
training easy, efficient, and effective.
locale : "en-US"
repository: microsoft/DeepSpeed
baseurl: "/" # the subpath of your site, e.g. /blog
url: "https://www.deepspeed.ai" # the base hostname & protocol for your site, e.g. http://example.com
# Build settings
remote_theme: "mmistakes/minimal-mistakes@4.19.0"
minimal_mistakes_skin : "air"
plugins:
- jekyll-feed
- jekyll-include-cache
- jekyll-paginate
#paginate: 10
#paginate_path: /blog/page:num
include: ["_pages"]
exclude: ["code-docs"]
collections:
tutorials:
output: true
permalink: /:collection/:path/
defaults:
- scope:
path: ""
type: posts
values:
layout: single
author_profile: false
read_time: true
comments: false
share: true
related: false
# _tutorials
- scope:
path: ""
type: tutorials
values:
layout: single
toc: true
toc_label: "Contents"
sidebar:
nav: "lnav"
timezone: America/Los_Angeles
breadcrumbs: true

21
docs/_data/navigation.yml Normal file
View File

@ -0,0 +1,21 @@
main:
- title: "Getting Started"
url: /getting-started/
- title: "Blog"
url: /blog/
- title: "Tutorials"
url: /tutorials/
- title: "Documentation"
url: https://ghpages-test.readthedocs.io/
- title: "GitHub"
url: https://github.com/microsoft/DeepSpeed
lnav:
- title: "This is a floating nav bar."
- title: "Getting Started"
url: /getting-started/
children:
- title: "Installation"
url: /getting-started/#installation
- title: "Configuration"
url: /getting-started/#deepspeed-configuration

View File

@ -0,0 +1,6 @@
---
title: "Tutorials"
layout: collection
collection: tutorials
permalink: /tutorials/
---

View File

@ -0,0 +1,8 @@
---
layout: single
title: "ZeRO & DeepSpeed: New system optimizations enable training models with over 100 billion parameters"
date: 2020-02-13
link: https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/
excerpt: "Developed by Microsoft AI & Research."
categories: news
---

View File

@ -0,0 +1,8 @@
---
layout: single
title: "Turing-NLG: A 17-billion-parameter language model by Microsoft"
date: 2020-02-13
link: https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/
excerpt: "DeepSpeed was used to train the world's largest language model."
categories: news
---

View File

@ -0,0 +1,12 @@
---
title: "ZeRO stage 1 with reduced communication"
date: 2020-03-13
excerpt: "Partition-aware ZeRO with up to 2x reduction in communication time!"
---
# ZeRO stage 1 with reduced communication
* Partition-aware approach instead of initial implementation that used a global collective (all-reduce)
* Total communication volume reduction 1.5x -> 1x of data parallelism
* Up to 2x reduction in communication time compared to all-reduce
# Further updates coming soon!

View File

@ -0,0 +1,12 @@
---
title: "ZeRO stage 2"
date: 2020-03-13
excerpt: "Reduce memory footprint to enable training 10B models without model parallelism!"
---
# Zero Stage 2
* Reduce memory footprint of gradients
* Train larger models: e.g., 10B parameters on 32GPUs without model parallelism
* Train larger batch sizes
# Further updates coming soon!

View File

@ -1,4 +1,7 @@
# Tutorial: CIFAR-10 with DeepSpeed
---
title: "CIFAR-10 Tutorial"
excerpt: "Train your first model with DeepSpeed!"
---
If you haven't already, we advise you to first read through the [Getting
Started](../../README.md#getting-started) guide before stepping through this
@ -10,22 +13,22 @@ First we will go over how to run original CIFAR-10. Then we will proceed step-by
## 1 Running Original CIFAR-10
## Running Original CIFAR-10
Original model code from [CIFAR-10 Tutorial](https://github.com/pytorch/tutorials/blob/master/beginner_source/blitz/cifar10_tutorial.py), We've copied this repo under [DeepSpeedExamples/cifar/](https://github.com/microsoft/DeepSpeedExamples/tree/master/cifar) and made it available as a submodule. To download, execute:
```
```bash
git submodule update --init --recursive
```
To install requirements for CIFAR-10:
```
```bash
cd DeepSpeedExamples/cifar
pip install -r requirements.txt
```
Run `python cifar10_tutorial.py`, it downloads the training data set at first run.
```less
```
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz
170500096it [00:02, 61124868.24it/s]
Extracting ./data/cifar-10-python.tar.gz to ./data
@ -63,10 +66,10 @@ cuda:0
## 2 Enabling DeepSpeed
## Enabling DeepSpeed
### 2.1 Argument Parsing
### Argument Parsing
The first step to apply DeepSpeed is adding DeepSpeed arguments to CIFAR-10 model, using `deepspeed.add_config_arguments()` function as below.
@ -103,7 +106,7 @@ The first step to apply DeepSpeed is adding DeepSpeed arguments to CIFAR-10 mode
### 2.2 Initialization
### Initialization
We use `deepspeed.initialize` to create `model_engine`, `optimizer` and `trainloader`. Below is its definition.
@ -144,27 +147,28 @@ The original device and optimizer can be removed after initializing DeepSpeed.
### 2.3 Training API
### Training API
The `model` returned by `deepspeed.initialize` is the _DeepSpeed Model Engine_ that we will use to train the model using the forward, backward and step API.
```python
```python
for i, data in enumerate(trainloader):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data[0].to(model_engine.local_rank), data[1].to(model_engine.local_rank)
inputs = data[0].to(model_engine.device)
labels = data[1].to(model_engine.device)
outputs = model_engine(inputs)
loss = criterion(outputs, labels)
model_engine.backward(loss)
model_engine.step()
```
```
Zeroing the gradients is handled automatically by DeepSpeed after the weights have been updated using a mini-batch.
### 2.4 Configuration
### Configuration
The next step to use DeepSpeed is to create a configuration JSON file (ds_config.json). This file provides DeepSpeed specific parameters defined by the user, e.g., batch size, optimizer, scheduler and other parameters.
@ -198,20 +202,17 @@ The next step to use DeepSpeed is to create a configuration JSON file (ds_config
### 2.6 Run CIFAR-10 Model with DeepSpeed Enabled
### Run CIFAR-10 Model with DeepSpeed Enabled
To start training CIFAR-10 model with DeepSpeed applied, execute the following command, it will use all detected GPUs by default.
```bash
deepspeed cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json
deepspeed cifar10_deepspeed.py --deepspeed_config ds_config.json
```
DeepSpeed usually prints more training details for user to monitor, including training settings, performance statistics and loss trends.
```less
deepspeed.pt --num_nodes 1 --num_gpus 1 cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json
```
deepspeed.pt cifar10_deepspeed.py --deepspeed_config ds_config.json
Warning: Permanently added '[192.168.0.22]:42227' (ECDSA) to the list of known hosts.
cmd=['pdsh', '-w', 'worker-0', 'export NCCL_VERSION=2.4.2; ', 'cd /data/users/deepscale/test/ds_v2/examples/cifar;', '/usr/bin/python', '-u', '-m', 'deepspeed.pt.deepspeed_launch', '--world_info=eyJ3b3JrZXItMCI6IFswXX0=', '--node_rank=%n', '--master_addr=192.168.0.22', '--master_port=29500', 'cifar10_deepspeed.py', '--deepspeed', '--deepspeed_config', 'ds_config.json']
worker-0: Warning: Permanently added '[192.168.0.22]:42227' (ECDSA) to the list of known hosts.

View File

@ -0,0 +1,243 @@
---
title: "Getting Started"
permalink: /getting-started/
excerpt: "First steps with DeepSpeed"
---
## Installation
* Please see our [Azure tutorial](docs/azure.md) to get started with DeepSpeed on Azure!
* If you're not on Azure, we recommend using our docker image via `docker pull deepspeed/deepspeed:latest` which contains a pre-installed version of DeepSpeed and all the necessary dependencies.
* If you want to install DeepSpeed manually, we provide an install script [install.sh](install.sh) to help install on a local machine or across an entire cluster.
## Writing DeepSpeed Models
DeepSpeed model training is accomplished using the DeepSpeed engine. The engine
can wrap any arbitrary model of type `torch.nn.module` and has a minimal set of APIs
for training and checkpointing the model. Please see the tutorials for detailed
examples.
To initialize the DeepSpeed engine:
```python
model_engine, optimizer, _, _ = deepspeed.initialize(args=cmd_args,
model=model,
model_parameters=params)
```
`deepspeed.inialize` ensures that all of the necessary setup required for
distributed data parallel or mixed precision training are done
appropriately under the hood. In addition to wrapping the model, DeepSpeed can
construct and manage the training optimizer, data loader, and the learning rate
scheduler based on the parameters passed to `deepspeed.initialze` and the
DeepSpeed [configuration file](#deepspeed-configuration).
### Training
Once the DeepSpeed engine has been initialized, it can be used to train the
model using three simple APIs for forward propagation (`()`), backward
propagation (`backward`), and weight updates (`step`).
```python
for step, batch in enumerate(data_loader):
#forward() method
loss = model_engine(batch)
#runs backpropagation
model_engine.backward(loss)
#weight update
model_engine.step()
```
Under the hood, DeepSpeed automatically performs the necessary operations
required for distributed data parallel training, in mixed precision, with a
pre-defined learning rate schedule:
* **Gradient Averaging**: in distributed data parallel training, `backward`
ensures that gradients are averaged across data parallel processes after
training on an `train_batch_size`.
* **Loss Scaling**: in FP16/mixed precision training, the DeepSpeed
engine automatically handles scaling the loss to avoid precision loss in the
gradients.
* **Learning Rate Schedule**: if using DeepSpeed's learning rate
schedule, then DeepSpeed automatically handles any updates to the learning
rate when `step` is executed.
### Model Checkpointing
Saving and loading the training state is handled via the `save_checkpoint` and
`load_checkpoint` API in DeepSpeed which takes two arguments to uniquely
identify a checkpoint:
* `ckpt_dir`: the directory where checkpoints will be saved.
* `ckpt_id`: an identifier that uniquely identifies a checkpoint in the directory.
In the following code snippet, we use the loss value as the checkpoint identifier.
```python
#load checkpoint
_, client_sd = model_engine.load_checkpoint(args.load_dir, args.ckpt_id)
step = client_sd['step']
#advance data loader to ckpt step
dataloader_to_step(data_loader, step + 1)
for step, batch in enumerate(data_loader):
#forward() method
loss = model_engine(batch)
#runs backpropagation
model_engine.backward(loss)
#weight update
model_engine.step()
#save checkpoint
if step % args.save_interval:
client_sd['step'] = step
ckpt_id = loss.item()
model_engine.save_checkpoint(args.save_dir, ckpt_id, client_sd = client_sd)
```
DeepSpeed can automatically save and restore the model, optimizer, and the
learning rate scheduler states while hiding away these details from the user.
However, the user may want to save other data in addition to these that are
unique to a given model training. To support these items, `save_checkpoint`
accepts a client state dictionary `client_sd` for saving. These items can be
retrieved from `load_checkpoint` as a return argument. In the example above,
the `step` value is stored as part of the `client_sd`.
## DeepSpeed Configuration
DeepSpeed features can be enabled, disabled, or configured using a config JSON
file that should be specified as `args.deepspeed_config`. A sample config file
is shown below. For a full set of features see [core API
doc](https://microsoft.github.io/DeepSpeed/docs/htmlfiles/api/full/index.html).
```json
{
"train_batch_size": 8,
"gradient_accumulation_steps": 1,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.00015
}
},
"fp16": {
"enabled": true
},
"zero_optimization": true
}
```
## Multi-Node Environment Variables
When training across multiple nodes we have found it useful to support
propagating user-defined environment variables. By default DeepSpeed will
propagate all NCCL and PYTHON related environment variables that are set. If
you would like to propagate additional variables you can specify them in a
dot-file named `.deepspeed_env` that contains a new-line separated list of
`VAR=VAL` entries. The DeepSpeed launcher will look in the local path you are
executing from and also in your home directory (`~/`).
As a concrete example, some clusters require special NCCL variables to set
prior to training. The user can simply add these variables to a
`.deepspeed_env` file in their home directory that looks like this:
```
NCCL_IB_DISABLE=1
NCCL_SOCKET_IFNAME=eth0
```
DeepSpeed will then make sure that these environment variables are set when
launching each process on every node across their training job.
# Launching DeepSpeed Training
DeepSpeed installs the entry point `deepspeed` to launch distributed training.
We illustrate an example usage of DeepSpeed with the following assumptions:
1. You have already integrated DeepSpeed into your model
2. `client_entry.py` is the entry script for your model
3. `client args` is the `argparse` command line arguments
4. `ds_config.json` is the configuration file for DeepSpeed
## Resource Configuration (multi-node)
DeepSpeed configures multi-node compute resources with hostfiles that are compatible with
[OpenMPI](https://www.open-mpi.org/) and [Horovod](https://github.com/horovod/horovod).
A hostfile is a list of *hostnames* (or SSH aliases), which are machines accessible via passwordless
SSH, and *slot counts*, which specify the number of GPUs available on the system. For
example,
```
worker-1 slots=4
worker-2 slots=4
```
specifies that two machines named *worker-1* and *worker-2* each have four GPUs to use
for training.
Hostfiles are specified with the `--hostfile` command line option. If no hostfile is
specified, DeepSpeed searches for `/job/hostfile`. If no hostfile is specified or found,
DeepSpeed queries the number of GPUs on the local machine to discover the number of local
slots available.
The following command launches a PyTorch training job across all available nodes and GPUs
specified in `myhostfile`:
```bash
deepspeed <client_entry.py> <client args> \
--deepspeed --deepspeed_config ds_config.json --hostfile=myhostfile
```
Alternatively, DeepSpeed allows you to restrict distributed training of your model to a
subset of the available nodes and GPUs. This feature is enabled through two command line
arguments: `--num_nodes` and `--num_gpus`. For example, distributed training can be
restricted to use only two nodes with the following command:
```bash
deepspeed --num_nodes=2 \
<client_entry.py> <client args> \
--deepspeed --deepspeed_config ds_config.json
```
You can instead include or exclude specific resources using the `--include` and
`--exclude` flags. For example, to use all available resources **except** GPU 0 on node
*worker-2* and GPUs 0 and 1 on *worker-3*:
```bash
deepspeed --exclude="worker-2:0@worker-3:0,1" \
<client_entry.py> <client args> \
--deepspeed --deepspeed_config ds_config.json
```
Similarly, you can use **only** GPUs 0 and 1 on *worker-2*:
```bash
deepspeed --include="worker-2:0,1" \
<client_entry.py> <client args> \
--deepspeed --deepspeed_config ds_config.json
```
### MPI Compatibility
As described above, DeepSpeed provides its own parallel launcher to help launch
multi-node/multi-gpu training jobs. If you prefer to launch your training job
using MPI (e.g., mpirun), we provide support for this. It should be noted that
DeepSpeed will still use the torch distributed NCCL backend and *not* the MPI
backend. To launch your training job with mpirun + DeepSpeed you simply pass us
an additional flag `--deepspeed_mpi`. DeepSpeed will then use
[mpi4py](https://pypi.org/project/mpi4py/) to discover the MPI environment (e.g.,
rank, world size) and properly initialize torch distributed for training. In this
case you will explicitly invoke `python` to launch your model script instead of using
the `deepspeed` launcher, here is an example:
```bash
mpirun <mpi-args> python \
<client_entry.py> <client args> \
--deepspeed_mpi --deepspeed --deepspeed_config ds_config.json
```
If you want to use this feature of DeepSpeed, please ensure that mpi4py is
installed via `pip install mpi4py`.
## Resource Configuration (single-node)
In the case that we are only running on a single node (with one or more GPUs)
DeepSpeed *does not* require a hostfile as described above. If a hostfile is
not detected or passed in then DeepSpeed will query the number of GPUs on the
local machine to discover the number of slots available. The `--include` and
`--exclude` arguments work as normal, but the user should specify 'localhost'
as the hostname.

View File

Before

Width:  |  Height:  |  Size: 96 KiB

After

Width:  |  Height:  |  Size: 96 KiB

View File

@ -1,129 +0,0 @@
# DeepSpeed with Azure
This tutorial will help you get started running DeepSpeed on [Azure virtual
machines](https://azure.microsoft.com/en-us/services/virtual-machines/).
Looking forward, we will be integrating these techniques and additional enhancements
into the [Azure ML](https://azure.microsoft.com/en-us/services/machine-learning/) platform to
benefit all your large model training jobs.
If you don't already have an Azure account please see more details here: [https://azure.microsoft.com/](https://azure.microsoft.com/).
To help with launching Azure instances we suggest using the [Azure
CLI](https://docs.microsoft.com/en-us/cli/azure/?view=azure-cli-latest). We have created
several helper scripts to get you quickly started using DeepSpeed with Azure.
* Install Azure CLI on your local box: https://docs.microsoft.com/en-us/cli/azure/install-azure-cli
* Alternatively you can use the Azure in-browser shell: https://shell.azure.com/
## Create an SSH key
Generate an SSH key that will be used across this tutorial to SSH into your VMs and
between Docker containers. `ssh-keygen` is the recommended way of doing this. Our scripts
assume your key is located inside the same directory as the Azure scripts.
## Azure Config JSON
Our helper scripts depend on the following a configuration JSON for deployment
and setup. We have provided a simple example JSON in `azure_config.json` that
sets up a basic environment with two VMs. This config uses the NV6_Promo
instance type which has one NVIDIA Tesla M60 GPU per VM. You can read more
details about the VM on the [Linux Virtual Machines
Pricing](https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/)
page.
See the example below:
```json
{
"num_vms": 2,
"location": "southcentralus",
"azure_sku": "Standard_NV6_Promo",
"ssh_private_key": "id_rsa",
"docker_ssh_port": 2222
}
```
## Dependencies
The scripts in this tutorial require [jq](https://stedolan.github.io/jq/) to help with
parsing JSON from the command line. Also it is recommended to install
[pdsh](https://linux.die.net/man/1/pdsh) to help launch ssh connections in parallel.
## Create Azure VMs
We first need to allocate the VMs. We provide a script
```bash
./create_vms.sh
```
to create VMs with the Azure SKU in the region specified in `azure_config.json`. Feel
free to customize your JSON to your desired region/SKU. This step will take a few minutes
to complete while it sets up all of your VMs on Azure.
## Setup VM environment to use DeepSpeed
Next, we need to configure the VM environment for DeepSpeed. We provide a script
```bash
./setup_vms.sh
```
to generate a [hostfile](../README.md#resource-configuration) and SSH
configuration on all of the VMs. This configuration will be used by the DeepSpeed
Docker containers in the next step.
## Start the DeepSpeed docker container
We now setup the DeepSpeed Docker containers on the VMs. We provide a script
```bash
./setup_docker.sh
```
to pull the DeepSpeed image onto all VMs and start a container instance in the
background. This will take several minutes since it needs to pull the entire Docker
image.
## Access VMs
The tool [azure_ssh.sh](azure_ssh.sh) will let you SSH into any of the VMs with this
syntax:
```bash
./azure_ssh.sh <node-id> [command]
```
where the `node-id` is a number between `0` and `num_vms-1`. This script will find the
public IP address of your VM and use the SSH key provided in the Azure configuration
JSON.
## Access DeepSpeed container
Everything should be up and running at this point. Let's access the running DeepSpeed
container on the first VM and make sure we can talk to the other containers in our deployment.
* SSH into the first VM via: `./azure_ssh.sh 0`
* Change directories into the azure folder of this repo via: `cd ~/workdir/DeepSpeed/azure`
* Attach the running docker container via: `./attach.sh`
* You should now be able to `ssh` into any other docker container, the containers can be
accessed via their SSH alias of `worker-N`, where `N` is the VM number between `0`
and `num_vms-1`. In this example we should be able to successfully run `ssh worker-1
hostname` which will return the hostname of worker-1.
## Parallel SSH across containers
DeepSpeed comes installed with a helper script `ds_ssh` which is a wrapper around
the [pdsh](https://linux.die.net/man/1/pdsh) command that lets you issue commands
to groups of hosts (via SSH) in parallel. This wrapper simply connects with the
hostfile that defines all the containers in your deployment. For example if you run
`ds_ssh hostname` you should see a list of all the hostnames in your deployment.
## Run CIFAR-10 example model
We will now run the DeepSpeed CIFAR-10 model example to test the VM setup. From inside
the first DeepSpeed container:
1) Install the python dependencies necessary to run the CIFAR-10 example model. You can
do this across your cluster via:
```bash
ds_ssh pip install -r ~/workdir/DeepSpeed/DeepSpeedExamples/cifar/requirements.txt
```
2) Now change directories to the CIFAR example:
```bash
cd ~/workdir/DeepSpeed/DeepSpeedExamples/cifar
```
3) Finally, launch training across all VMs:
```bash
deepspeed cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json
```
## Megatron-LM GPT2
DeepSpeed includes an example model using Megatron-LM's GPT2. Please refer to the full
[Megatron tutorial](../docs/tutorials/MegatronGPT2Tutorial.md) for more details.
* In order to fully train GPT2 with DeepSpeed and ZeRO we recommend using 8 instances of
Azure's Standard_ND40rs_v2 SKU for a total of 64 NVIDIA V100 GPUs. With this setup and
a batch size of 1536 you should be able to complete 100k training steps (153.6 million
samples) in less than 2 weeks of training.

3
docs/blog/index.html Normal file
View File

@ -0,0 +1,3 @@
---
layout: home
---

20
docs/code-docs/Makefile Normal file
View File

@ -0,0 +1,20 @@
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = source
BUILDDIR = build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

View File

@ -0,0 +1,4 @@
#!/bin/bash
sphinx-apidoc -f -o source ../../deepspeed
make html

View File

@ -0,0 +1,3 @@
sphinx
recommonmark
sphinx-rtd-theme

View File

@ -0,0 +1,3 @@
tqdm
psutil
tensorboardX==1.8

View File

@ -0,0 +1,93 @@
# Configuration file for the Sphinx documentation builder.
#
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html
# -- Path setup --------------------------------------------------------------
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
import os
import sys
# -- Project information -----------------------------------------------------
project = 'DeepSpeed'
copyright = '2020, Microsoft AI & Research'
author = 'Microsoft AI & Research'
# The full version, including alpha/beta/rc tags
release = '0.1.0'
master_doc = 'index'
# -- General configuration ---------------------------------------------------
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
'sphinx.ext.autodoc',
'sphinx.ext.napoleon',
'recommonmark',
'sphinx_rtd_theme',
]
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = []
# -- Options for HTML output -------------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = 'sphinx_rtd_theme'
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
# GitHub integration
html_context = {
"display_github": True,
"github_user": "microsoft",
"github_repo": "DeepSpeed",
"github_version": "master",
"conf_py_path": "/docs/code-docs/source/",
}
# Mock imports so we don't have to install torch to build the docs.
from unittest.mock import MagicMock
sys.path.insert(0, os.path.abspath('../../../'))
class Mock(MagicMock):
@classmethod
def __getattr__(cls, name):
return MagicMock()
MOCK_MODULES = [
'torch',
'torch.utils',
'torch.utils.data',
'torch.utils.data.distributed',
'torch._utils',
'torch.cuda',
'torch.nn.modules',
'torch.nn',
'torch.distributed',
'torch.distributed.distributed_c10d',
'torch.optim',
'torch._six'
]
sys.modules.update((mod_name, Mock()) for mod_name in MOCK_MODULES)

View File

@ -0,0 +1,134 @@
deepspeed.pt package
====================
Submodules
----------
deepspeed.pt.deepspeed\_config module
-------------------------------------
.. automodule:: deepspeed.pt.deepspeed_config
:members:
:undoc-members:
:show-inheritance:
deepspeed.pt.deepspeed\_constants module
----------------------------------------
.. automodule:: deepspeed.pt.deepspeed_constants
:members:
:undoc-members:
:show-inheritance:
deepspeed.pt.deepspeed\_csr\_tensor module
------------------------------------------
.. automodule:: deepspeed.pt.deepspeed_csr_tensor
:members:
:undoc-members:
:show-inheritance:
deepspeed.pt.deepspeed\_dataloader module
-----------------------------------------
.. automodule:: deepspeed.pt.deepspeed_dataloader
:members:
:undoc-members:
:show-inheritance:
deepspeed.pt.deepspeed\_fused\_lamb module
------------------------------------------
.. automodule:: deepspeed.pt.deepspeed_fused_lamb
:members:
:undoc-members:
:show-inheritance:
deepspeed.pt.deepspeed\_launch module
-------------------------------------
.. automodule:: deepspeed.pt.deepspeed_launch
:members:
:undoc-members:
:show-inheritance:
deepspeed.pt.deepspeed\_light module
------------------------------------
.. automodule:: deepspeed.pt.deepspeed_light
:members:
:undoc-members:
:show-inheritance:
deepspeed.pt.deepspeed\_lr\_schedules module
--------------------------------------------
.. automodule:: deepspeed.pt.deepspeed_lr_schedules
:members:
:undoc-members:
:show-inheritance:
deepspeed.pt.deepspeed\_run module
----------------------------------
.. automodule:: deepspeed.pt.deepspeed_run
:members:
:undoc-members:
:show-inheritance:
deepspeed.pt.deepspeed\_timer module
------------------------------------
.. automodule:: deepspeed.pt.deepspeed_timer
:members:
:undoc-members:
:show-inheritance:
deepspeed.pt.deepspeed\_utils module
------------------------------------
.. automodule:: deepspeed.pt.deepspeed_utils
:members:
:undoc-members:
:show-inheritance:
deepspeed.pt.deepspeed\_zero\_optimizer module
----------------------------------------------
.. automodule:: deepspeed.pt.deepspeed_zero_optimizer
:members:
:undoc-members:
:show-inheritance:
deepspeed.pt.fp16\_optimizer module
-----------------------------------
.. automodule:: deepspeed.pt.fp16_optimizer
:members:
:undoc-members:
:show-inheritance:
deepspeed.pt.fp16\_unfused\_optimizer module
--------------------------------------------
.. automodule:: deepspeed.pt.fp16_unfused_optimizer
:members:
:undoc-members:
:show-inheritance:
deepspeed.pt.loss\_scaler module
--------------------------------
.. automodule:: deepspeed.pt.loss_scaler
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: deepspeed.pt
:members:
:undoc-members:
:show-inheritance:

View File

@ -0,0 +1,17 @@
deepspeed package
=================
Subpackages
-----------
.. toctree::
deepspeed.pt
Module contents
---------------
.. automodule:: deepspeed
:members:
:undoc-members:
:show-inheritance:

View File

@ -0,0 +1,17 @@
DeepSpeed
=========
.. toctree::
:maxdepth: 2
:caption: Contents:
modules
Indices and tables
==================
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`

View File

@ -0,0 +1,7 @@
deepspeed
=========
.. toctree::
:maxdepth: 4
deepspeed

View File

@ -1,191 +0,0 @@
# PyTorch DeepSpeed Config JSON Documentation
## REQUIRED DeepSpeed Config JSON Parameters
***train\_batch\_size***: [integer]
| Value | Example |
| ------------------------------------------------------------ | ------- |
| The effective training batch size. This is the amount of data samples that leads to one step of model update. ***train\_batch\_size*** is aggregated by the batch size that a single GPU processes in one forward/backward pass (a.k.a., ***train\_step\_batch\_size***), the gradient accumulation steps (a.k.a., ***gradient\_accumulation\_steps***), and the number of GPUs. | `32` |
## OPTIONAL DeepSpeed Config JSON Parameters
### Batch Size Related Parameters
***train\_micro\_batch\_size\_per\_gpu***: [integer]
| Description | Default |
| ------------------------------------------------------------ | ---------------------------- |
| Batch size to be processed by one GPU in one step (without gradient accumulation). When specified, ***gradient\_accumulation\_steps*** is automatically calculated using ***train\_batch\_size*** and number of GPUs. Should not be concurrently specified with ***gradient\_accumulation\_steps*** in the configuration JSON. | ***train\_batch\_size*** value |
***gradient\_accumulation\_steps***: [integer]
| Description | Default |
| ------------------------------------------------------------ | ------- |
| Number of training steps to accumulate gradients before averaging and applying them. This feature is sometimes useful to improve scalability since it results in less frequent communication of gradients between steps. Another impact of this feature is the ability to train with larger batch sizes per GPU. When specified, ***train\_step\_batch\_size*** is automatically calculated using ***train\_batch\_size*** and number of GPUs. Should not be concurrently specified with ***train\_step\_batch\_size*** in the configuration JSON. | `1` |
### Optimizer Parameters
***optimizer***: [dictionary]
| Fields | Value | Example |
| ------ | ------------------------------------------------------------ | ------------------------------ |
| type | The optimizer name. DeepSpeed natively supports Adam and LAMB optimizers and will import other optimizers from [torch](https://pytorch.org/docs/stable/optim.html). | `"Adam"` |
| params | Dictionary of parameters to instantiate optimizer. The parameter names must match the optimizer constructor signature (e.g., for [Adam](https://pytorch.org/docs/stable/optim.html#torch.optim.Adam)). | `{"lr": 0.001, "eps": 1e-8}` |
Example of ***optimizer***
```json
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.001,
"betas": [
0.8,
0.999
],
"eps": 1e-8,
"weight_decay": 3e-7
}
}
```
### Scheduler Parameters
***scheduler***: [dictionary]
| Fields | Value | Example |
| ------ | ------------------------------------------------------------ | ------------------------------ |
| type | The scheduler name. See [here](https://microsoft.github.io/DeepSpeed/docs/htmlfiles/api/full/pt/deepspeed_lr_schedules.m.html) for list of support schedulers. | `"1Cycle"` |
| params | Dictionary of parameters to instantiate scheduler. The parameter names should match scheduler constructor signature. | `{"lr": 0.001, "eps": 1e-8}` |
Example of ***scheduler***
```json
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 0.001,
"warmup_num_steps": 1000
}
}
```
### Communication options
***fp32\_allreduce***: [boolean]
| Description | Default |
| ------------------------------------ | ------- |
| During gradient averaging perform allreduce with 32 bit values | `false` |
***disable\_allgather***: [boolean]
| Description | Default |
| ---------------------------- | ------- |
| Disable allgather when using ZeRO optimizer and instead use broadcast | `false`
***prescale\_gradients***: [boolean]
| Description | Default |
| -------------------------------------- | ------- |
| Scale gradients before doing allreduce | `false` |
***sparse\_gradients***: [boolean]
| Description | Default |
| ------------------------------------------------------------ | ------- |
| Enable sparse compression of [torch.nn.Embedding](https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding) gradients. | `false` |
### FP16 training options
***zero\_optimization***: [boolean]
| Description | Default |
| ------------------------------------------------------------ | ------- |
| Enable ZeRO memory optimization wrapper for FP16 Training. Currently compatible only with Adam optimizer. | `false` |
***fp16***: [dictionary]
| Description | Default |
| ------------------------------------------------------------ | ------- |
| Configuration for using mixed precision/FP16 training that leverages [NVIDIA's Apex package](https://nvidia.github.io/apex/). An example, including the available dictionary keys is illustrated below. | None |
```json
"fp16": {
"enabled": true,
"loss_scale": 0,
"initial_scale_power": 32,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
}
```
***fp16:enabled***: [boolean]
| Description | Default |
| ------------------------------------------------------------ | ------- |
| ***enabled*** is a **fp16** parameter indicating whether or not FP16 training enabled. | `false` |
***fp16:loss\_scale***: [float]
| Description | Default |
| ------------------------------------------------------------ | ------- |
| ***loss\_scale*** is a ***fp16*** parameter representing the loss scaling value for FP16 training. The default value of 0.0 results in dynamic loss scaling, otherwise the value will be used for static fixed loss scaling. | `0.0` |
***fp16:initial\_scale\_power***: [integer]
| Description | Default |
| ------------------------------------------------------------ | ------- |
| ***initial\_loss\_scale\_power*** is a **fp16** parameter representing the power of the initial dynamic loss scale value. The actual loss scale is computed as 2<sup>***initial\_loss\_scale\_power***</sup>. | `32` |
***fp16:loss\_scale\_window***: [integer]
| Description | Default |
| ------------------------------------------------------------ | ------- |
| ***loss\_scale\_window*** is a **fp16** parameter representing the window over which to raise/lower the dynamic loss scale value. | `1000` |
***fp16:hysteresis***: [integer]
| Description | Default |
| ------------------------------------------------------------ | ------- |
| ***hysteresis*** is a **fp16** parameter representing the delay shift in dynamic loss scaling. | `2` |
***fp16:min\_loss\_scale***: [integer]
| Description | Default |
| ------------------------------------------------------------ | ------- |
| ***min\_loss\_scale*** is a **fp16** parameter representing the minimum dynamic loss scale value. | `1000` |
### Gradient Clipping
***gradient\_clipping***: [float]
| Description | Default |
| ----------------------------------- | ------- |
| Enable gradient clipping with value | `0` |
### Logging
***steps\_per\_print***: [integer]
| Description | Default |
| ----------- | ------- |
| Print train loss every N steps | `10` |
***wall\_clock\_breakdown***: [boolean]
| Description | Default |
| ------------------------------------------------------------ | ------- |
| Enable timing of the latency of forward/backward/update training phases | `false` |
***dump_state***: [boolean]
| Description | Default |
| ------------------------------------------------------------ | ------- |
| Print out state information of DeepSpeed object after initialization | `false` |

View File

@ -1,4 +1,9 @@
# Feature Overview
---
title: "Feature Overview"
layout: single
toc: true
toc_label: "Contents"
---
* [Distributed Training with Mixed Precision](#distributed-training-with-mixed-precision)
* 16-bit mixed precision

Binary file not shown.

Before

Width:  |  Height:  |  Size: 24 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 41 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 12 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 41 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 12 KiB

440
docs/index.md Normal file
View File

@ -0,0 +1,440 @@
---
layout: single
toc: true
toc_label: "Contents"
---
DeepSpeed is a deep learning optimization library that makes distributed training easy,
efficient, and effective.
<p align="center"><i><b>10x Larger Models</b></i></p>
<p align="center"><i><b>5x Faster Training</b></i></p>
<p align="center"><i><b>Minimal Code Change</b></i></p>
DeepSpeed can train DL models with over a hundred billion parameters on current
generation of GPU clusters, while achieving over 5x in system performance
compared to the state-of-art. Early adopters of DeepSpeed have already produced
a language model (LM) with over 17B parameters called
[Turing-NLG](https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft),
establishing a new SOTA in the LM category.
# Why DeepSpeed?
Training advanced deep learning models is challenging. Beyond model design,
model scientists also need to set up the state-of-the-art training techniques
such as distributed training, mixed precision, gradient accumulation, and
checkpointing. Yet still, scientists may not achieve the desired system
performance and convergence rate. Large model sizes are even more challenging:
a large model easily runs out of memory with pure data parallelism and it is
difficult to use model parallelism. DeepSpeed addresses these challenges to
accelerate model development *and* training.
## Distributed, Effective, and Efficient Training with Ease
The DeepSpeed API is a lightweight wrapper on [PyTorch](https://pytorch.org/). This
means that you can use everything you love in PyTorch and without learning a new
platform. In addition, DeepSpeed manages all of the boilerplate state-of-the-art
training techniques, such as distributed training, mixed precision, gradient
accumulation, and checkpoints so that you can focus on your model development. Most
importantly, you can leverage the distinctive efficiency and effectiveness benefit of
DeepSpeed to boost speed and scale with just a few lines of code changes to your PyTorch
models.
## Speed
DeepSpeed achieves high performance and fast convergence through a combination of
efficiency optimizations on compute/communication/memory/IO and effectiveness
optimizations on advanced hyperparameter tuning and optimizers. For example:
* DeepSpeed trains BERT-large to parity in 14 hours using 64 GPUs (4 DGX-2 boxes) and in
3.7 hours using 256 GPUs (16 DGX-2 boxes).
**BERT-large Training Times**
| Devices | Source | Training Time (hours) |
| ------------- | --------- | ---------------------:|
| 64 TPUs | Google | 96 |
| 64 V100 GPUs | DeepSpeed | **14** |
| 256 V100 GPUs | NVIDIA | 3.9 |
| 256 V100 GPUs | DeepSpeed | **3.7** |
<!---*Read more*: [BERT tutorial](../../Tutorials/bert_pretraining/deepspeed_bert_training.md)-->
*BERT Tutorial*: Coming Soon
* DeepSpeed trains GPT2 (1.5 billion parameters) 3.75x faster than state-of-art, NVIDIA
Megatron on Azure GPUs.
*Read more*: [GPT tutorial](./docs/tutorials/MegatronGPT2Tutorial.md)
## Memory efficiency
DeepSpeed provides memory-efficient data parallelism and enables training models without
model parallelism. For example, DeepSpeed can train models with up to 6 billion parameters on
NVIDIA V100 GPUs with 32GB of device memory. In comparison, existing frameworks (e.g.,
PyTorch's Distributed Data Parallel) run out of memory with 1.5 billion parameter models.
DeepSpeed reduces the training memory footprint through a novel solution called Zero
Redundancy Optimizer (ZeRO). Unlike basic data parallelism where memory states are
replicated across data-parallel processes, ZeRO partitions model states to save
significant memory. The current implementation (stage 1 of ZeRO) reduces memory by up to
4x relative to the state-of-art. You can read more about ZeRO in our [paper](https://arxiv.org/abs/1910.02054).
With this impressive memory reduction, early adopters of DeepSpeed have already
produced a language model (LM) with over 17B parameters called
[Turing-NLG](https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft),
establishing a new SOTA in the LM category.
## Scalability
DeepSpeed supports efficient data parallelism, model parallelism, and their
combination. ZeRO boosts the scaling capability and efficiency further.
* DeepSpeed provides system support to run models up to 100 billion parameters,
10x larger than the state-of-art (8 billion NVIDIA GPT, 11 billion Google T5).
* DeepSpeed can run large models more efficiently, up to 6x faster for models with
various sizes spanning 1.5B to 100B. More specifically, the data parallelism powered by ZeRO
is complementary and can be combined with different types of model parallelism. It allows
DeepSpeed to fit models using lower degree of model parallelism and higher batch size, offering
significant performance gains compared to using model parallelism alone.
*Read more*: [technical report](https://arxiv.org/abs/1910.02054),
and [GPT tutorial](./docs/tutorials/MegatronGPT2Tutorial.md).
<!-- and [QANet tutorial](../../Tutorials/QANetTutorial.md). -->
![DeepSpeed-vs-Megatron](/assets/images/DeepSpeed-vs-Megatron.png)
<p align="center">
<em>The figure depicts system throughput improvements of DeepSpeed (combining ZeRO-powered data parallelism with model parallelism of NVIDIA Megatron-LM) over using Megatron-LM alone.</em>
</p>
## Fast convergence for effectiveness
DeepSpeed supports advanced hyperparameter tuning and large batch size
optimizers such as [LAMB](https://arxiv.org/abs/1904.00962). These improve the
effectiveness of model training and reduce the number of samples required to
convergence to desired accuracy.
*Read more*: [Tuning tutorial](./docs/tutorials/1Cycle.md),
<!---
and *BERT Tutorial*: Coming Soon.
[BERT tutorial](../../Tutorials/BingBertSquad/BingBertSquadTutorial.md),
[QANet tutorial](../../Tutorials/QANet/QANetTutorial.md)
-->
## Good Usability
Only a few lines of code changes are needed to enable a PyTorch model to use DeepSpeed and ZeRO. Compared to current model parallelism libraries, DeepSpeed does not require a code redesign or model refactoring. It also does not put limitations on model dimensions (such as number of attention heads, hidden sizes, and others), batch size, or any other training parameters. For models of up to six billion parameters, you can use ZeRO-powered data parallelism conveniently without requiring model parallelism, while in contrast, standard data parallelism will run out of memory for models with more than 1.3 billion parameters. In addition, DeepSpeed conveniently supports flexible combination of ZeRO-powered data parallelism with custom model parallelisms, such as tensor slicing of NVIDIA's Megatron-LM.
## Features
Below we provide a brief feature list, see our detailed [feature
overview](features) for descriptions and usage.
* [Distributed Training with Mixed Precision](features.md#distributed-training-with-mixed-precision)
* 16-bit mixed precision
* Single-GPU/Multi-GPU/Multi-Node
* [Model Parallelism](features.md#model-parallelism)
* Support for Custom Model Parallelism
* Integration with Megatron-LM
* [Memory and Bandwidth Optimizations](features.md#memory-and-bandwidth-optimizations)
* The Zero Redundancy Optimizer (ZeRO)
* Constant Buffer Optimization (CBO)
* Smart Gradient Accumulation
* [Training Features](features.md#training-features)
* Simplified training API
* Gradient Clipping
* Automatic loss scaling with mixed precision
* [Training Optimizers](features.md#training-optimizers)
* Fused Adam optimizer and arbitrary `torch.optim.Optimizer`
* Memory bandwidth optimized FP16 Optimizer
* Large Batch Training with LAMB Optimizer
* Memory efficient Training with ZeRO Optimizer
* [Training Agnostic Checkpointing](features.md#training-agnostic-checkpointing)
* [Advanced Parameter Search](features.md#advanced-parameter-search)
* Learning Rate Range Test
* 1Cycle Learning Rate Schedule
* [Simplified Data Loader](features.md#simplified-data-loader)
* [Performance Analysis and Debugging](features.md#performance-analysis-and-debugging)
# Getting Started
## Installation
* Please see our [Azure tutorial](docs/azure.md) to get started with DeepSpeed on Azure!
* If you're not on Azure, we recommend using our docker image via `docker pull deepspeed/deepspeed:latest` which contains a pre-installed version of DeepSpeed and all the necessary dependencies.
* If you want to install DeepSpeed manually, we provide an install script [install.sh](install.sh) to help install on a local machine or across an entire cluster.
## Writing DeepSpeed Models
DeepSpeed model training is accomplished using the DeepSpeed engine. The engine
can wrap any arbitrary model of type `torch.nn.module` and has a minimal set of APIs
for training and checkpointing the model. Please see the tutorials for detailed
examples.
To initialize the DeepSpeed engine:
```python
model_engine, optimizer, _, _ = deepspeed.initialize(args=cmd_args,
model=model,
model_parameters=params)
```
`deepspeed.inialize` ensures that all of the necessary setup required for
distributed data parallel or mixed precision training are done
appropriately under the hood. In addition to wrapping the model, DeepSpeed can
construct and manage the training optimizer, data loader, and the learning rate
scheduler based on the parameters passed to `deepspeed.initialze` and the
DeepSpeed [configuration file](#deepspeed-configuration).
### Training
Once the DeepSpeed engine has been initialized, it can be used to train the
model using three simple APIs for forward propagation (`()`), backward
propagation (`backward`), and weight updates (`step`).
```python
for step, batch in enumerate(data_loader):
#forward() method
loss = model_engine(batch)
#runs backpropagation
model_engine.backward(loss)
#weight update
model_engine.step()
```
Under the hood, DeepSpeed automatically performs the necessary operations
required for distributed data parallel training, in mixed precision, with a
pre-defined learning rate schedule:
* **Gradient Averaging**: in distributed data parallel training, `backward`
ensures that gradients are averaged across data parallel processes after
training on an `train_batch_size`.
* **Loss Scaling**: in FP16/mixed precision training, the DeepSpeed
engine automatically handles scaling the loss to avoid precision loss in the
gradients.
* **Learning Rate Schedule**: if using DeepSpeed's learning rate
schedule, then DeepSpeed automatically handles any updates to the learning
rate when `step` is executed.
### Model Checkpointing
Saving and loading the training state is handled via the `save_checkpoint` and
`load_checkpoint` API in DeepSpeed which takes two arguments to uniquely
identify a checkpoint:
* `ckpt_dir`: the directory where checkpoints will be saved.
* `ckpt_id`: an identifier that uniquely identifies a checkpoint in the directory.
In the following code snippet, we use the loss value as the checkpoint identifier.
```python
#load checkpoint
_, client_sd = model_engine.load_checkpoint(args.load_dir, args.ckpt_id)
step = client_sd['step']
#advance data loader to ckpt step
dataloader_to_step(data_loader, step + 1)
for step, batch in enumerate(data_loader):
#forward() method
loss = model_engine(batch)
#runs backpropagation
model_engine.backward(loss)
#weight update
model_engine.step()
#save checkpoint
if step % args.save_interval:
client_sd['step'] = step
ckpt_id = loss.item()
model_engine.save_checkpoint(args.save_dir, ckpt_id, client_sd = client_sd)
```
DeepSpeed can automatically save and restore the model, optimizer, and the
learning rate scheduler states while hiding away these details from the user.
However, the user may want to save other data in addition to these that are
unique to a given model training. To support these items, `save_checkpoint`
accepts a client state dictionary `client_sd` for saving. These items can be
retrieved from `load_checkpoint` as a return argument. In the example above,
the `step` value is stored as part of the `client_sd`.
## DeepSpeed Configuration
DeepSpeed features can be enabled, disabled, or configured using a config JSON
file that should be specified as `args.deepspeed_config`. A sample config file
is shown below. For a full set of features see [core API
doc](https://microsoft.github.io/DeepSpeed/docs/htmlfiles/api/full/index.html).
```json
{
"train_batch_size": 8,
"gradient_accumulation_steps": 1,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.00015
}
},
"fp16": {
"enabled": true
},
"zero_optimization": true
}
```
## Multi-Node Environment Variables
When training across multiple nodes we have found it useful to support
propagating user-defined environment variables. By default DeepSpeed will
propagate all NCCL and PYTHON related environment variables that are set. If
you would like to propagate additional variables you can specify them in a
dot-file named `.deepspeed_env` that contains a new-line separated list of
`VAR=VAL` entries. The DeepSpeed launcher will look in the local path you are
executing from and also in your home directory (`~/`).
As a concrete example, some clusters require special NCCL variables to set
prior to training. The user can simply add these variables to a
`.deepspeed_env` file in their home directory that looks like this:
```
NCCL_IB_DISABLE=1
NCCL_SOCKET_IFNAME=eth0
```
DeepSpeed will then make sure that these environment variables are set when
launching each process on every node across their training job.
# Launching DeepSpeed Training
DeepSpeed installs the entry point `deepspeed` to launch distributed training.
We illustrate an example usage of DeepSpeed with the following assumptions:
1. You have already integrated DeepSpeed into your model
2. `client_entry.py` is the entry script for your model
3. `client args` is the `argparse` command line arguments
4. `ds_config.json` is the configuration file for DeepSpeed
## Resource Configuration (multi-node)
DeepSpeed configures multi-node compute resources with hostfiles that are compatible with
[OpenMPI](https://www.open-mpi.org/) and [Horovod](https://github.com/horovod/horovod).
A hostfile is a list of *hostnames* (or SSH aliases), which are machines accessible via passwordless
SSH, and *slot counts*, which specify the number of GPUs available on the system. For
example,
```
worker-1 slots=4
worker-2 slots=4
```
specifies that two machines named *worker-1* and *worker-2* each have four GPUs to use
for training.
Hostfiles are specified with the `--hostfile` command line option. If no hostfile is
specified, DeepSpeed searches for `/job/hostfile`. If no hostfile is specified or found,
DeepSpeed queries the number of GPUs on the local machine to discover the number of local
slots available.
The following command launches a PyTorch training job across all available nodes and GPUs
specified in `myhostfile`:
```bash
deepspeed <client_entry.py> <client args> \
--deepspeed --deepspeed_config ds_config.json --hostfile=myhostfile
```
Alternatively, DeepSpeed allows you to restrict distributed training of your model to a
subset of the available nodes and GPUs. This feature is enabled through two command line
arguments: `--num_nodes` and `--num_gpus`. For example, distributed training can be
restricted to use only two nodes with the following command:
```bash
deepspeed --num_nodes=2 \
<client_entry.py> <client args> \
--deepspeed --deepspeed_config ds_config.json
```
You can instead include or exclude specific resources using the `--include` and
`--exclude` flags. For example, to use all available resources **except** GPU 0 on node
*worker-2* and GPUs 0 and 1 on *worker-3*:
```bash
deepspeed --exclude="worker-2:0@worker-3:0,1" \
<client_entry.py> <client args> \
--deepspeed --deepspeed_config ds_config.json
```
Similarly, you can use **only** GPUs 0 and 1 on *worker-2*:
```bash
deepspeed --include="worker-2:0,1" \
<client_entry.py> <client args> \
--deepspeed --deepspeed_config ds_config.json
```
### MPI Compatibility
As described above, DeepSpeed provides its own parallel launcher to help launch
multi-node/multi-gpu training jobs. If you prefer to launch your training job
using MPI (e.g., mpirun), we provide support for this. It should be noted that
DeepSpeed will still use the torch distributed NCCL backend and *not* the MPI
backend. To launch your training job with mpirun + DeepSpeed you simply pass us
an additional flag `--deepspeed_mpi`. DeepSpeed will then use
[mpi4py](https://pypi.org/project/mpi4py/) to discover the MPI environment (e.g.,
rank, world size) and properly initialize torch distributed for training. In this
case you will explicitly invoke `python` to launch your model script instead of using
the `deepspeed` launcher, here is an example:
```bash
mpirun <mpi-args> python \
<client_entry.py> <client args> \
--deepspeed_mpi --deepspeed --deepspeed_config ds_config.json
```
If you want to use this feature of DeepSpeed, please ensure that mpi4py is
installed via `pip install mpi4py`.
## Resource Configuration (single-node)
In the case that we are only running on a single node (with one or more GPUs)
DeepSpeed *does not* require a hostfile as described above. If a hostfile is
not detected or passed in then DeepSpeed will query the number of GPUs on the
local machine to discover the number of slots available. The `--include` and
`--exclude` arguments work as normal, but the user should specify 'localhost'
as the hostname.
# Further Reading
| Article | Description |
| ---------------------------------------------------------------------------------------------- | -------------------------------------------- |
| [DeepSpeed Features](features.md) | DeepSpeed features |
| [DeepSpeed JSON Configuration](config_json.md) | Configuring DeepSpeed |
| [API Documentation](/code-docs/) | Generated DeepSpeed API documentation |
| [CIFAR-10 Tutorial](./docs/tutorials/CIFAR-10.md) | Getting started with CIFAR-10 and DeepSpeed |
| [Megatron-LM Tutorial](./docs/tutorials/MegatronGPT2Tutorial.md) | Train GPT2 with DeepSpeed and Megatron-LM |
| [Learning Rate Range Test Tutorial](./docs/tutorials/lrrt.md) | Faster training with large learning rates |
| [1Cycle Tutorial](./docs/tutorials/1Cycle.md) | SOTA learning schedule in DeepSpeed |
# Contributing
DeepSpeed welcomes your contributions! Please see our
[contributing](CONTRIBUTING.md) guide for more details on formatting, testing,
etc.
## Contributor License Agreement
This project welcomes contributions and suggestions. Most contributions require you to
agree to a Contributor License Agreement (CLA) declaring that you have the right to, and
actually do, grant us the rights to use your contribution. For details, visit
https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need
to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply
follow the instructions provided by the bot. You will only need to do this once across
all repos using our CLA.
## Code of Conduct
This project has adopted the [Microsoft Open Source Code of
Conduct](https://opensource.microsoft.com/codeofconduct/). For more information see the
[Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact
[opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or
comments.
# Publications
1. Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He. (2019) ZeRO: Memory Optimization Towards Training A Trillion Parameter Models. [ArXiv:1910.02054](https://arxiv.org/abs/1910.02054)

View File

@ -1,141 +0,0 @@
# Tutorial: 1-Cycle Schedule
This tutorial shows how to implement 1Cycle schedules for learning rate and
momentum in PyTorch.
## 1-Cycle Schedule
Recent research has demonstrated that the slow convergence problems of large
batch size training can be addressed by tuning critical hyperparameters such
as learning rate and momentum, during training using cyclic and decay
schedules. In DeepSpeed, we have implemented a state-of-the-art schedule called
[1-Cycle](https://arxiv.org/abs/1803.09820) to help data scientists
effectively use larger batch sizes to train their models in PyTorch.
## Prerequisites
To use 1-cycle schedule for model training, you should satisfy these two requirements:
1. Integrate DeepSpeed into your training script using this
[guide](../..//README.md#getting-started).
2. Add the parameters to configure a 1-Cycle schedule to the parameters of your
model. We will define the 1-Cycle parameters below.
## Overview
The 1-cycle schedule operates in two phases, a cycle phase and a decay phase,
which span one iteration over the training data. For concreteness, we will
review how 1-cycle schedule of learning rate works. In the cycle phase,
the learning rate oscillates between a minimum value and a maximum value over a
number of training steps. In the decay phase, the learning rate decays starting
from the minimum value of the cycle phase. An example of 1-cycle learning rate
schedule during model training is illustrated below.
![1cycle_lr](../figures/1cycle_lr.png)
### 1-Cycle Parameters
The 1-Cycle schedule is defined by a number of parameters which allow users
explore different configurations. The literature recommends concurrent tuning
of learning rate and momentum because they are correlated hyperparameters. We
have leveraged this recommendation to reduce configuration burden by organizing
the 1-cycle parameters into two groups to:
1. Global parameters for configuring the cycle and decay phase
2. Local parameters for configuring learning rate and momentum
The global parameters for configuring the 1-cycle phases are:
1. `cycle_first_step_size`: The count of training steps to complete first step of cycle phase
2. `cycle_first_stair_count`: The count of updates (or stairs) in first step of cycle phase
3. `cycle_second_step_size`: The count of training steps to complete second step of cycle phase
4. `cycle_second_stair_count`: The count of updates (or stairs) in the second step of cycle phase
5. `post_cycle_decay_step_size`: The interval, in training steps, to decay hyperparameter in decay phase
The local parameters for the hyperparameters are:
**Learning rate**:
1. `cycle_min_lr`: minimum learning rate in cycle phase
2. `cycle_max_lr`: maximum learning rate in cycle phase
3. `decay_lr_rate`: decay rate for learning rate in decay phase
Although appropriate values `cycle_min_lr` and `cycle_max_lr` values can be
selected based on experience or expertise, we recommend using [learning rate
range test](lrrt.md) feature of DeepSpeed to configure them.
**Momentum**
1. `cycle_min_mom`: minimum momentum in cycle phase
2. `cycle_max_mom`: maximum momentum in cycle phase
3. `decay_mom_rate`: decay rate for momentum in decay phase
## Required Model Configuration Changes
To illustrate the required model configuration changes to use 1-Cycle schedule
in model training, we will use a schedule with the following properties:
1. A symmetric cycle phase, where each half of the cycle spans the same number
of training steps. For this example, it will take 1000 training steps for the
learning rate to increase from 0.0001 to 0.0010 (10X scale), and then to
decrease back to 0.0001. The momentum will correspondingly cycle between 0.85
and 0.99 in similar number of steps.
2. A decay phase, where learning rate decays by 0.001 every 1000 steps, while
momentum is not decayed.
Note that these parameters are processed by DeepSpeed as session parameters,
and so should be added to the appropriate section of the model configuration.
### **PyTorch model**
PyTorch versions 1.0.1 and newer provide a feature for implementing schedulers
for hyper-parameters, called [learning rate
schedulers](https://pytorch.org/docs/stable/_modules/torch/optim/lr_scheduler.html).
We have implemented 1-Cycle schedule using this feature. You will add a
scheduler entry of type **"OneCycle"** as illustrated below.
```json
"scheduler": {
"type": "OneCycle",
"params": {
"cycle_first_step_size": 1000,
"cycle_first_stair_count": 500,
"cycle_second_step_size": 1000,
"cycle_second_stair_count": 500,
"decay_step_size": 1000,
"cycle_min_lr": 0.0001,
"cycle_max_lr": 0.0010,
"decay_lr_rate": 0.001,
"cycle_min_mom": 0.85,
"cycle_max_mom": 0.99,
"decay_mom_rate": 0.0
}
},
```
## Batch Scaling Example
As example of how 1-Cycle schedule can enable effective batch scaling, we
briefly share our experience with an internal model in Microsoft. In this case,
the model was well-tuned for fast convergence (in data samples) on a single
GPU, but was converging slowly to target performance (AUC) when training on 8
GPUs (8X batch size). The plot below shows model convergence with 8 GPUs for
these learning rate schedules:
1. **Fixed**: using an optimal fixed learning rate for 1-GPU training.
2. **LinearScale**: using a fixed learning rate that is 8X of **Fixed**.
3. **1Cycle**: using 1-Cycle schedule.
![model_convergence](../figures/model_convergence.png)
With **1Cycle**, the model converges faster than the other schedules to the
target AUC . In fact, **1Cycle** converges as fast as the optimal 1-GPU
training (not shown). For **Fixed**, convergence is about 5X slower (needs 5X
more data samples). With **LinearScale**, the model diverges because the
learning rate is too high. The plot below illustrates the schedules by
reporting the learning rate values during 8-GPU training.
![lr_schedule](../figures/lr_schedule.png)
We see that the learning rate for **1Cycle** is always larger than **Fixed**
and is briefly larger than **LinearScale** to achieve faster convergence. Also
**1Cycle** lowers the learning rate later during training to avoid model
divergence, in contrast to **LinearScale**. In summary, by configuring an
appropriate 1-Cycle schedule we were able to effective scale the training batch
size for this model by 8X without loss of convergence speed.

View File

@ -1,420 +0,0 @@
# Tutorial: Megatron-LM GPT2 with DeepSpeed
If you haven't already, we advise you to first read through the [Getting
Started](../../README.md#getting-started) guide before stepping through this
tutorial.
In this tutorial we will be adding DeepSpeed to Megatron-LM GPT2 model, which
is a large, powerful transformer. Megatron-LM supports model-parallel and multi-node
training. Please see the corresponding paper for more details: [Megatron-LM:
Training Multi-Billion Parameter Language Models Using Model
Parallelism](https://arxiv.org/abs/1909.08053).
First, we discuss data and environment setup and how to train the GPT-2 model with the
original Megatron-LM. Next, we proceed step-by-step in enabling this model to run with
DeepSpeed. Finally, we demonstrate the **_performance gains_**, and **_memory footprint
reduction_** from using DeepSpeed.
## 1 Training GPT-2 with the Original Megatron-LM
The original model code from
[Megatron-LM](https://github.com/NVIDIA/Megatron-LM). We've copied this repo
under
[DeepSpeedExamples/Megatron-LM/](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM)
and made it available as a submodule. To download, execute:
```bash
git submodule update --init --recursive
```
### 1.1 Training Data Setup
* Follow Megatron's [instructions](https://github.com/NVIDIA/Megatron-LM#collecting-gpt2-webtext-data)
to download the webtext data and place a symbolic link under `DeepSpeedExamples/Megatron-LM/data`:
### 1.2 Running Unmodified Megatron-LM GPT2 model
* For a single GPU run:
- change `scripts/pretrain_gpt2.sh`, set its `--train-data` argument as `"webtext"`.
- run `bash scripts/pretrain_gpt2.sh`
* For multiple GPUs and/or nodes run:
- change `scripts/pretrain_gpt2_model_parallel.sh`
- set its `--train-data` argument as `"webtext"`
- `GPUS_PER_NODE` indicates how many GPUs per node involved in the testing
- `NNODES` indicates how many nodes involved in the testing
- run `bash scripts/pretrain_gpt2_model_parallel.sh`
## 2 Enabling DeepSpeed
To use DeepSpeed we will modify three files :
* `arguments.py` : Arguments configurations
* `pretrain_gpt2.py` : Main entry point for training
* `utils.py` : Checkpoints saving and loading utilities
### 2.1 Argument Parsing
The first step is to apply DeepSpeed is adding DeepSpeed arguments to
Megatron-LM GPT2 model, using `deepspeed.add_config_arguments()` in
`arguments.py`.
```python
def get_args():
"""Parse all the args."""
parser = argparse.ArgumentParser(description='PyTorch BERT Model')
parser = add_model_config_args(parser)
parser = add_fp16_config_args(parser)
parser = add_training_args(parser)
parser = add_evaluation_args(parser)
parser = add_text_generate_args(parser)
parser = add_data_args(parser)
# Include DeepSpeed configuration arguments
parser = deepspeed.add_config_arguments(parser)
```
### 2.2 Initialization and Training
We modify `pretrain.py` to enable training with DeepSpeed.
#### 2.2.1 Initialization
We use `deepspeed.initialize` to create `model_engine`, `optimizer` and LR
`scheduler`. Below is its definition:
```python
def initialize(args,
model,
optimizer=None,
model_parameters=None,
training_data=None,
lr_scheduler=None,
mpu=None,
dist_init_required=True,
collate_fn=None):
```
For the Megatron-LM GPT2 model, we initialize DeepSpeed in its
`setup_model_and_optimizer()` function as below, to pass the raw `model`,
`optimizer`, `args`, `lr_scheduler` and `mpu`.
```python
def setup_model_and_optimizer(args):
"""Setup model and optimizer."""
model = get_model(args)
optimizer = get_optimizer(model, args)
lr_scheduler = get_learning_rate_scheduler(optimizer, args)
if args.deepspeed:
import deepspeed
print_rank_0("DeepSpeed is enabled.")
model, optimizer, _, lr_scheduler = deepspeed.initialize(
model=model,
optimizer=optimizer,
args=args,
lr_scheduler=lr_scheduler,
mpu=mpu,
dist_init_required=False
)
```
Note that when FP16 is enabled, Megatron-LM GPT2 adds a wrapper to the `Adam`
optimizer. DeepSpeed has its own FP16 Optimizer, so we need to pass the `Adam`
optimizer to DeepSpeed directly without any wrapper. We return the unwrapped
Adam optimizer from `get_optimizer()` when DeepSpeed is enabled.
```python
def get_optimizer(model, args):
"""Setup the optimizer."""
......
# Use Adam.
optimizer = Adam(param_groups,
lr=args.lr, weight_decay=args.weight_decay)
if args.deepspeed:
# fp16 wrapper is not required for DeepSpeed.
return optimizer
```
#### 2.2.2 Using the Training API
The `model` returned by `deepspeed.initialize` is the _DeepSpeed Model Engine_
that we will use to train the model using the forward, backward and step API.
##### Forward Propagation
The forward propagation API is compatible to PyTorch and no change is required.
##### Backward Propagation
Backward propagation is done by calling `backward(loss)` directly on the model engine.
```python
def backward_step(optimizer, model, lm_loss, args, timers):
"""Backward step."""
# Total loss.
loss = lm_loss
# Backward pass.
if args.deepspeed:
model.backward(loss)
else:
optimizer.zero_grad()
if args.fp16:
optimizer.backward(loss, update_master_grads=False)
else:
loss.backward()
```
Zeroing the gradients is handled automatically by DeepSpeed after the weights
have been updated using a mini-batch.
Furthermore, DeepSpeed addresses distributed data parallel and FP16 under the
hood, simplifying code in multiple places.
(A) DeepSpeed also performs gradient averaging automatically at the gradient
accumulation boundaries. So we skip the allreduce communication.
```python
if args.deepspeed:
# DeepSpeed backward propagation already addressed all reduce communication.
# Reset the timer to avoid breaking timer logs below.
timers('allreduce').reset()
else:
torch.distributed.all_reduce(reduced_losses.data)
reduced_losses.data = reduced_losses.data / args.world_size
if not USE_TORCH_DDP:
timers('allreduce').start()
model.allreduce_params(reduce_after=False,
fp32_allreduce=args.fp32_allreduce)
timers('allreduce').stop()
```
(B) We also skip updating master gradients, since DeepSpeed addresses it internally.
```python
# Update master gradients.
if not args.deepspeed:
if args.fp16:
optimizer.update_master_grads()
# Clipping gradients helps prevent the exploding gradient.
if args.clip_grad > 0:
if not args.fp16:
mpu.clip_grad_norm(model.parameters(), args.clip_grad)
else:
optimizer.clip_master_grads(args.clip_grad)
return lm_loss_reduced
```
##### Updating the Model Parameters
The `step()` function in DeepSpeed engine updates the model parameters as well
as the learning rate.
```python
if args.deepspeed:
model.step()
else:
optimizer.step()
# Update learning rate.
if not (args.fp16 and optimizer.overflow):
lr_scheduler.step()
else:
skipped_iter = 1
```
##### Loss Scaling
The GPT2 training script logs the loss scaling value during training. Inside,
the DeepSpeed optimizer, this value is stored as `cur_scale` instead of
`loss_scale` in Megatron's optimizer. Therefore, we appropriately replace it in
the logging string.
```python
if args.fp16:
log_string += ' loss scale {:.1f} |'.format(
optimizer.cur_scale if args.deepspeed else optimizer.loss_scale)
```
### 2.3 Checkpoints Saving & Loading
DeepSpeed engine has flexible APIs for checkpoint saving and loading, to handle
the states from both the client model and its own internal.
```python
def save_checkpoint(self, save_dir, tag, client_state={})
def load_checkpoint(self, load_dir, tag)
```
Applying DeepSpeed needs to update utils.py in which Megatron-LM GPT2 saves and
loads its checkpoints.
A new function `save_ds_checkpoint()` is created as below for DeepSpeed, it
collects the client model states and passes to DeepSpeed engine by calling
`save_checkpoint()` of DeepSpeed.
```python
def save_ds_checkpoint(iteration, model, args):
"""Save a model checkpoint."""
sd = {}
sd['iteration'] = iteration
# rng states.
if not args.no_save_rng:
sd['random_rng_state'] = random.getstate()
sd['np_rng_state'] = np.random.get_state()
sd['torch_rng_state'] = torch.get_rng_state()
sd['cuda_rng_state'] = torch.cuda.get_rng_state()
sd['rng_tracker_states'] = mpu.get_cuda_rng_tracker().get_states()
model.save_checkpoint(args.save, iteration, client_state = sd)
```
In Megatron-LM GPT2 `save_checkpoint()` function, adds following lines to
invoke the above function for DeepSpeed.
```python
def save_checkpoint(iteration, model, optimizer,
lr_scheduler, args):
"""Save a model checkpoint."""
if args.deepspeed:
save_ds_checkpoint(iteration, model, args)
else:
......
```
In `load_checkpoint()` function, use DeepSpeed loading checkpoint API as below,
and return the states for the client model.
```python
def load_checkpoint(model, optimizer, lr_scheduler, args):
"""Load a model checkpoint."""
iteration, release = get_checkpoint_iteration(args)
if args.deepspeed:
checkpoint_name, sd = model.load_checkpoint(args.load, iteration)
if checkpoint_name is None:
if mpu.get_data_parallel_rank() == 0:
print("Unable to load checkpoint.")
return iteration
else:
......
```
### 2.4 Train scripts
Assume webtext data was prepared in previous step, to start training
Megatron-LM GPT2 model with DeepSpeed applied, execute the following command to
start training.
- Single GPU run
- run `bash scripts/ds_pretrain_gpt2.sh`
- Multiple GPUs/Nodes run
- run `bash scripts/ds_pretrain_gpt2_model_parallel.sh`
## 3 Performance Improvements
DeepSpeed enables training very large models effectively via the advanced [ZeRO
optimizer](https://arxiv.org/abs/1910.02054v2). ZeRO significantly reduces the memory
footprint for training large models which means large models can be trained with i) less
model parallelism and ii) larger batch sizes. A lower model parallelism degree improves
training efficiency by increasing the granularity of the computation such as the matrix
multiplication where performance is directly related to the size of the matrices.
Furthermore, less model parallelism also results in less communication between model
parallel GPUs, which further boosts performance. Larger batch size has a similar effect
of increasing the computational granularity as well as reducing communication, also
resulting in better performance. Therefore, DeepSpeed combines ZeRO-powered data parallelism with
Megatron-LM tensor-slicing model parallelism, which is
significantly faster than using Megatron-LM alone.
The observed performance improvements depend on several factors such as the memory per
GPU, the local GPU interconnect (i.e., PCI-E vs NVLINK vs NVSwitch), the model size,
inter node network interconnect, etc. Below, we show some of the performance improvements
from using DeepSpeed over Megatron on a 16 GPU Low Bandwidth (40 Gbps) cluster and a 400 GPU DGX-2 High Bandwidth (800 Gbps) cluster.
For details please see the [ZeRO Paper](https://arxiv.org/abs/1910.02054v2). We also
present performance improvement on a 64 GPU cluster along with detailed configuration
analysis to show where the improvements come from.
![DeepSpeed-vs-Megatron](../figures/DeepSpeed-vs-Megatron.png)
<p align="center">
<em>The figure depicts system throughput improvements of DeepSpeed (combining ZeRO-powered data parallelism with model parallelism of Nvidia Megatron-LM) over using Megatron-LM alone.</em>
</p>
### 3.1 On Low Bandwidth GPU Cluster
The figure above shows that training 1.5B parameter model with DeepSpeed is
nearly 4x faster than without DeepSpeed on a cluster with 4 nodes, 4 GPU per
node, and 16 GPUs total. These GPUs have 16GB of memory each, and PCI-E
interconnects GPUs within a node, and 40 Gbps infiniband across nodes.
The performance improvement comes from lower model parallelism degree and
larger batch size as discussed earlier. Training 1.5B parameter model with
Megatron-LM alone requires 4-way model parallelism, and can only fit an effective
batch size of 32 using all 16 GPUs. On the other hand, DeepSpeed does not
require any model-parallelism to train this model, and can support an
effective batch size of 128 without running out of memory, resulting in
significantly higher performance.
### 3.2 On High bandwidth DGX-2 GPU Cluster
Each GPU on the DGX-2 cluster has 32 GB of memory, and GPUs inside a box is connected via
the high-bandwidth NVSwitch. DGX-2 nodes are connected to each other via 800 Gbps (8 x 100Gbps) infiniband interconnect. As such, running a 1.5B model on DGX-2 requires less model
parallelism, and the performance improvement from DeepSpeed for this model size is less
significant. However, at larger model sizes, Megatron still requires significantly larger
model parallelism degree, and can only run much smaller batch sizes than DeepSpeed.
Therefore, as the model sizes get larger, DeepSpeed, by coming ZeRO with Megatron model parallelism, starts to significantly outperform
using Megatron-LM alone.
### 3.3 Performance Improvements with Configuration Details
The figure below compares DeepSpeed with Megatron on a 64 GPU cluster with 4
DGX-2 nodes. To give the readers a clear idea of source of the performance
improvements, we also present the configuration table for both Megatron and
DeepSpeed. It shows the smallest model parallelism degree and the largest batch
size that can be used to train these models without running out of memory. As
discussed above, the tables demonstrate that DeepSpeed runs with smaller model parallelism degree
and achieves better performance.
![DeepSpeed Performance SpeedUp](../figures/megatron-gpt2-perf-test.png)
<p align="center">
<em>The figure depicts system throughput improvements of DeepSpeed (combining ZeRO-powered data parallelism with model parallelism of Nvidia Megatron-LM) over using Megatron-LM alone.</em>
</p>
**a ) Megatron-LM GPT2 Baseline**
| | Model Parallelism | Data Parallelism | #gpus | batch size | layers | hidden size | attention heads | samples / sec |
| ---- | ----------------: | ---------------: | ----: | ---------: | -----: | -----------:| --------------: | ------------: |
| 1.5B | 2 | 32 | 64 | 512 | 48 | 1600 | 16 | 128.56 |
| 4B | 4 | 16 | 64 | 128 | 64 | 2304 | 16 | 49.36 |
| 8B | 4 | 16 | 64 | 128 | 72 | 3072 | 24 | 24.57 |
| 20B | 16 | 4 | 64 | 16 | 111 | 3808 | 32 | 3.42 |
**b ) Megatron-LM GPT2 with DeepSpeed**
| | Model Parallelism | Data Parallelism | #gpus | batch size | layers | hidden size | attention heads | samples / sec |
| ---- | ----------------: | ---------------: | ----: | ---------: | -----: | -----------:| --------------: | ------------: |
| 1.5B | 1 | 64 | 64 | 2048 | 48 | 1600 | 16 | 151.35 |
| 4B | 1 | 64 | 64 | 512 | 64 | 2304 | 16 | 75.13 |
| 8B | 2 | 32 | 64 | 512 | 72 | 3072 | 24 | 43.52 |
| 20B | 4 | 16 | 64 | 128 | 111 | 3808 | 32 | 12.65 |

View File

@ -1,147 +0,0 @@
# Tutorial: Learning Rate Range Test
This tutorial shows how to use to perform Learning Rate range tests in PyTorch.
## Learning Rate Range Test (LRRT)
Learning rate range test ( [LRRT](https://arxiv.org/abs/1803.09820) ) is a
method for discovering the largest learning rate values that can be used to
train a model without divergence. Data scientists are often interested in this
information because large learning rates lead to faster model convergence than
a small learning rates. Moreover, large learning rates are crucial in learning
rate schedules such as [CLR](https://arxiv.org/abs/1506.01186) and
[1Cycle](https://arxiv.org/abs/1803.09820), which are used to train effectively
with large batch sizes. DeepSpeed provides LRRT for model training in PyTorch
frameworks.
## Prerequisites
To use DeepSpeed's LRRT, you must satisfy the following two conditions:
1. Integrate DeepSpeed into your training script using this
[guide](../../README.md#getting-started).
2. Add the parameters to configure LRRT to the parameters of your model. The
LRRT parameters are defined below.
## LRRT Parameters
LRRT works by linearly increasing the learning rate by a predefined amount, at
predefined intervals. Thus, LRRT is a form of learning rate schedule because it
defines how and when the learning rate should change during model training. To
configure LRRT, you will need to set these parameters:
1. `lr_range_test_min_lr` : The initial learning rate for training `(float)`
2. `lr_range_test_step_size`: The interval for scaling up learning rate,
defined in training steps `(integer)`
3. `lr_range_test_step_rate`: The scaling factor for increasing learning rate
`(float)`
4. `lr_range_test_staircase`: If true, learning rate is changed every
`lr_range_test_step_size` training steps, otherwise learning rate is changed at
every training step `(boolean)`
## Required Model Configuration Changes
We will illustrate the required model configuration changes an example LRRT
schedule that:
1. Starts training with an initial learning rate of 0.0001
2. Uses a scaling rate of 5
3. Uses a scaling interval of 200 training steps
4. Scales learning rate at every training step, i.e., does not use staircase
### PyTorch
For PyTorch models, LRRT is implemented as a [learning rate
scheduler](https://pytorch.org/docs/stable/_modules/torch/optim/lr_scheduler.html),
a feature that is available in PyTorch versions 1.0.1 and newer. Thus, you can
add a `"scheduler"` entry of type `"LRRangeTest"` into your model configuration
as illustrated below:
```json
"scheduler": {
"type": "LRRangeTest",
"params": {
"lr_range_test_min_lr": 0.0001,
"lr_range_test_step_size": 200,
"lr_range_test_step_rate": 5,
"lr_range_test_staircase": false
}
}
```
## Example: Tuning for Large Batch Sizes
We illustrate how LRRT can benefit data scientists with a snippet of our
experience of tuning an internal production model to converge efficiently on
larger batch sizes, as we scaled from one GPU (batch size 512) to four GPUs
(batch size 2048). Our goal was to train the model with the larger batch size
to match the performance of the smaller batch size using the same amount of
data samples. The challenge here is the well known problem of slow convergence
of large batch size training. Our approach was to use a
[1Cycle](Cycle.md) schedule in DeepSpeed to tackle
this problem, and we used LRRT to configure the schedule.
In the plots below, we illustrate using LRRT to discover the maximum learning
rates for effective training with batch size 2048. The plot on the left shows
the impact of large learning rates on validation loss over the first 9000
batches of training. The plot on the right shows the learning rate values
during the same period of training. Using grid search we discover that the
best fixed learning rate for the batch size 2048 is 0.0002. The blue line
(`lr=0.0002`) represents training with this fixed learning rate. We compare the
two LRRT schedules with this fixed learning rate. The orange
(`lr_range_test_step_rate=5`) and gray (`lr_range_test_step_rate=50`) lines
represent training with similar LRRT schedules that differ only in
`lr_range_test_step_rate` values. Although the LRRT schedules start from the
same base learning rate, the gray line's learning rate grows about 10 times
faster than the orange line. Also, the learning rates of the LRRT schedules had
grown larger than that of the blue line in the presented data points. We
subsequently refer to the gray line as "fast growing", and the orange line as
"slow growing" LRRT schedules respectively.
![validation_loss](../figures/loss_and_lr.png)
We make the following observations from this small example.
1. Larger learning rates clearly benefit model performance, up to some point.
The fast growing LRRT schedule achieves validation loss of 0.46 after 3000
batches, which the fixed learning rate does not achieve with 9000 batches. The
slow growing LRRT does not match that score until after 6000 batches, however
it maintains an increasing performance advantage over the fixed learning rate.
2. There is an upper bound on learning rate values that are useful for training
the model. The fast growing LRRT schedule hits this boundary quickly and
diverges, while the slow growing LRRT will later diverge for the same reason.
LRRT helped us discover these boundaries quickly, using less than 2% of the
training data. These boundaries are useful information for constructing
learning rate schedules.
These observations from LRRT helped us to configure the learning rate
boundaries and the cycle span for a 1Cycle schedule that solves the problem, as
shown below.
```json
"OneCycle": {
"cycle_min_lr": 0.002,
"cycle_max_lr": 0.005,
"cycle_first_step_size": 2000,
"cycle_second_step_size": 2000,
...
}
```
In our experience these are four most critical parameters of 1Cycle schedules.
1. We chose to use the slower LRRT schedule (`lr_range_test_step_rate=5`) to
set `cycle_min_lr` because it achieves the best loss and the faster schedule
diverges fairly quickly.
2. We set `cycle_min_lr` to 0.005 even though the plot shows that performance
was still improving at slightly higher learning rate. This is because we
observed that if we wait till the maximum learning rate, the model could be at
the point of divergence and impossible to recover.
3. Since it takes 8000 batches for the learning rate to become 0.005, we set
`cycle_first_step_size` and (`cycle_second_step_size`) to 2000 which is the
number of steps that it takes for four GPUs to process 8000 batches.
We hope this brief example sparks your imagination on using LRRT for your own
unique tuning challenges.