[docs] website refresh (#2123)

Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: yaozhewei <zheweiy@berkeley.edu>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
This commit is contained in:
Jeff Rasley
2022-07-21 16:56:17 -07:00
committed by GitHub
parent 9f5895cb7a
commit a2506b545a
21 changed files with 394 additions and 326 deletions

189
README.md
View File

@ -9,50 +9,90 @@
<img src="docs/assets/images/DeepSpeed_dark_transparent.svg#gh-dark-mode-only" width="400px">
</div>
<!--
Remove until pypi issue is resolved: https://status.python.org/incidents/2jj696st6yn5
[![Downloads](https://pepy.tech/badge/deepspeed/month)](https://pepy.tech/project/deepspeed)
-->
## Latest News
* [2022/07/20] [DeepSpeed Compression: A composable library for extreme compression and zero-cost quantization](https://www.microsoft.com/en-us/research/blog/deepspeed-compression-a-composable-library-for-extreme-compression-and-zero-cost-quantization/)
* [Tutorial](https://www.deepspeed.ai/tutorials/model-compression/) and [Code examples](https://github.com/microsoft/DeepSpeedExamples/tree/master/model_compression).
* 50x model size reduction via [XTC](https://arxiv.org/abs/2206.01859) and 5000x compression cost reduction via [ZeroQuant](https://arxiv.org/abs/2206.01861).
* [2022/03/21] [Supporting efficient large model training on AMD Instinct GPUs with DeepSpeed](https://cloudblogs.microsoft.com/opensource/2022/03/21/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed/)
* [2022/03/07] [Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam](https://www.deepspeed.ai/tutorials/zero-one-adam/)
* [2022/01/19] [DeepSpeed: Advancing MoE inference and training to power next-generation AI scale](https://www.microsoft.com/en-us/research/blog/deepspeed-advancing-moe-inference-and-training-to-power-next-generation-ai-scale/)
* [Mixture of Experts (MoE) for NLG tutorial](https://www.deepspeed.ai/tutorials/mixture-of-experts-nlg/).
* [Mixture of Experts (MoE) Inference tutorial](https://www.deepspeed.ai/tutorials/moe-inference-tutorial).
* [2021/11/15] [Autotuning: Automatically discover the optimal DeepSpeed configuration that delivers good training speed](https://www.deepspeed.ai/news/2021/11/15/autotuning.html)
* [2021/10/11] [Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the Worlds Largest and Most Powerful Generative Language Model](https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/)
* Read more on how to [train large models with DeepSpeed](https://www.deepspeed.ai/tutorials/large-models-w-deepspeed/)
<b> DeepSpeed trained the world's most powerful language models ([MT-530B](https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/), [BLOOM](https://huggingface.co/blog/bloom-megatron-deepspeed)); [learn how](https://www.deepspeed.ai/tutorials/large-models-w-deepspeed/).</b>
* [2022/07] [DeepSpeed Compression: A composable library for extreme compression](https://www.microsoft.com/en-us/research/blog/deepspeed-compression-a-composable-library-for-extreme-compression-and-zero-cost-quantization/)
* [2022/03] [Supporting efficient large model training on AMD Instinct GPUs with DeepSpeed](https://cloudblogs.microsoft.com/opensource/2022/03/21/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed/)
* [2022/03] [Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam](https://www.deepspeed.ai/tutorials/zero-one-adam/)
* [2022/01] [DeepSpeed: Advancing MoE inference and training to power next-generation AI scale](https://www.microsoft.com/en-us/research/blog/deepspeed-advancing-moe-inference-and-training-to-power-next-generation-ai-scale/)
* [2021/11] [Autotuning: Automatically discover the optimal DeepSpeed configuration](https://www.deepspeed.ai/news/2021/11/15/autotuning.html)
### DeepSpeed is hiring, [come join us!](https://careers.microsoft.com/us/en/search-results?keywords=http:%2F%2Fdeepspeed.ai)
---
[DeepSpeed](https://www.deepspeed.ai/) is a deep learning optimization
library that makes distributed training easy, efficient, and effective.
# Extreme Speed and Scale for DL Training and Inference
<p align="center"><i><b>10x Larger Models</b></i></p>
<p align="center"><i><b>10x Faster Training</b></i></p>
<p align="center"><i><b>Minimal Code Change</b></i></p>
[DeepSpeed](https://www.deepspeed.ai/) is an easy-to-use deep learning optimization software suite that enables unprecedented scale and speed for Deep Learning Training and Inference. With DeepSpeed you can:
DeepSpeed delivers extreme-scale model training for everyone, from data scientists training on massive supercomputers to those training on low-end clusters or even on a single GPU:
* Extreme scale: Using current generation of GPU clusters with hundreds of devices, 3D parallelism of DeepSpeed can efficiently train deep learning models with trillions of parameters.
* Extremely memory efficient: With just a single GPU, ZeRO-Offload of DeepSpeed can train models with over 10B parameters, 10x bigger than the state of arts, democratizing multi-billion-parameter model training such that many deep learning scientists can explore bigger and better models.
* Extremely long sequence length: Sparse attention of DeepSpeed powers an order-of-magnitude longer input sequence and obtains up to 6x faster execution comparing with dense transformers.
* Extremely communication efficient: 3D parallelism improves communication efficiency allows users to train multi-billion-parameter models 27x faster on clusters with limited network bandwidth. 1-bit Adam, 0/1 Adam and 1-bit LAMB reduce communication volume by up to 26x while achieving similar convergence efficiency to Adam/LAMB, allowing for scaling to different types of GPU clusters and networks.
* Train/Inference dense or sparse models with billions or trillions of parameters
* Achieve excellent system throughput and efficiently scale to thousands of GPUs
* Train/Inference on resource constrained GPU systems
* Achieve unprecedented low latency and high thoughput for inference
* Achieve extreme compression for an unparalleled inference latency and model size reduction with low costs
Early adopters of DeepSpeed have already produced
a language model (LM) with over 17B parameters called
[Turing-NLG](https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft),
establishing a new SOTA in the LM category.
---
# DeepSpeed's three innovation pillars
<img src="docs/assets/images/3pillars.png" width="800px">
## DeepSpeed-Training
DeepSpeed offers a confluence of system innovations, that has made large scale DL training effective, and efficient, greatly improved ease of use, and redefined the DL training landscape in terms of scale that is possible. These innovations such as ZeRO, 3D-Parallelism, DeepSpeed-MoE, ZeRO-Infinity, etc. fall under the training pillar. Learn more: [DeepSpeed-Training](/_pages/training)
## DeepSpeed-Inference
DeepSpeed brings together innovations in parallelism technology such as tensor, pipeline, expert and ZeRO-parallelism, and combines them with high performance custom inference kernels, communication optimizations and heterogeneous memory technologies to enable inference at an unprecedented scale, while achieving unparalleled latency, thoughput and cost reduction. This systematic composition of system technologies for inference falls under the inference pillar. Learn more: [DeepSpeed-Inference](/_pages/inference)
## DeepSpeed-Compression
To further increase the inference efficiency, DeepSpeed offers easy-to-use and flexible-to-compose compression techniques for researchers and practitioners to compress their models while delivering faster speed, smaller model size, and significantly reduced compression cost. Moreover, SoTA innovations on compression like ZeroQuant and XTC are included under the compression pillar. Learn more: [DeepSpeed-Compression](/_pages/compression)
---
# DeepSpeed Software Suite
## DeepSpeed Library
The [DeepSpeed](https://github.com/microsoft/deepspeed) library (this repository) implements and packages the innovations and technologies in DeepSpeed Training, Inference and Compression Pillars into a single easy-to-use, open-sourced repository. It allows for easy composition of multitude of features within a single training, infernece or compression pipeline. The DeepSpeed Library is heavily adopted by the DL community, and has been used to enable some of the most powerful models (see [DeepSpeed Adoption](#deepspeed-adoption)).
## Model Implementations for Inference (MII)
[Model Implementations for Inference (MII)](https://github.com/microsoft/deepspeed-mii) is an open-sourced repository for making low-latency and high-throughput inference accessible to all data scientists by alleviating the need to apply complex system optimization techniques themselves. Out-of-box, MII offers support for thousands of widely used DL models, optimized using DeepSpeed-Inference, that can be deployed with a few lines of code, while achieving significant latency reduction compared to their vanilla open-sourced versions.
## DeepSpeed on Azure
DeepSpeed users are diverse and have access to different environments. We recommend to try DeepSpeed on Azure as it is the simplest and easiest method. The recommended method to try DeepSpeed on Azure is through AzureML [recipes](https://github.com/Azure/azureml-examples/tree/main/python-sdk/workflows/train/deepspeed). The job submission and data preparation scripts have been made available [here](https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples/azureml). For more details on how to use DeepSpeed on Azure, please follow the [Azure tutorial](https://www.deepspeed.ai/tutorials/azure/).
---
# DeepSpeed Adoption
DeepSpeed is an important part of Microsofts new
[AI at Scale](https://www.microsoft.com/en-us/research/project/ai-at-scale/)
initiative to enable next-generation AI capabilities at scale, where you can find more
information [here](https://innovation.microsoft.com/en-us/exploring-ai-at-scale).
**_For further documentation, tutorials, and technical deep-dives please see [deepspeed.ai](https://www.deepspeed.ai/)!_**
DeepSpeed has been used to train many different large-scale models, below is a list of several examples that we are aware of (if you'd like to include your model please submit a PR):
* [Megatron-Turing NLG (530B)](https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/)
* [Jurassic-1 (178B)](https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf)
* [BLOOM (176B)](https://huggingface.co/blog/bloom-megatron-deepspeed)
* [YaLM (100B)](https://github.com/yandex/YaLM-100B)
* [GPT-NeoX (20B)](https://github.com/EleutherAI/gpt-neox)
DeepSpeed has been integrated with several different popular open-source DL frameworks such as:
| | Documentation |
| ---------------------------------------------------------------------------------------------- | -------------------------------------------- |
<img src="docs/assets/images/transformers-light.png#gh-light-mode-only" width="250px"><img src="docs/assets/images/transformers-dark.png#gh-dark-mode-only" width="250px"> | [Transformers with DeepSpeed](https://huggingface.co/docs/transformers/main/main_classes/deepspeed) |
| <img src="docs/assets/images/accelerate-light.png#gh-light-mode-only" width="250px"><img src="docs/assets/images/accelerate-dark.png#gh-dark-mode-only" width="250px"> | [Accelerate with DeepSpeed](https://huggingface.co/docs/accelerate/main/en/deepspeed) |
| <img src="docs/assets/images/lightning-light.svg#gh-light-mode-only" width="200px"><img src="docs/assets/images/lightning-dark.svg#gh-dark-mode-only" width="200px"> | [Lightning with DeepSpeed](https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.strategies.DeepSpeedStrategy.html) |
| <img src="docs/assets/images/mosaicml.svg" width="200px"> | [MosaicML with DeepSpeed](https://docs.mosaicml.com/en/v0.8.0/trainer/using_the_trainer.html?highlight=deepspeed#deepspeed-integration) |
---
# Build Pipeline Status
@ -64,28 +104,6 @@ information [here](https://innovation.microsoft.com/en-us/exploring-ai-at-scale)
| Integrations | [![nv-transformers-v100](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-transformers-v100.yml/badge.svg)](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-transformers-v100.yml) [![nv-lightning-v100](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-lightning-v100.yml/badge.svg)](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-lightning-v100.yml) |
| Misc | [![Formatting](https://github.com/microsoft/DeepSpeed/actions/workflows/formatting.yml/badge.svg)](https://github.com/microsoft/DeepSpeed/actions/workflows/formatting.yml) [![pages-build-deployment](https://github.com/microsoft/DeepSpeed/actions/workflows/pages/pages-build-deployment/badge.svg)](https://github.com/microsoft/DeepSpeed/actions/workflows/pages/pages-build-deployment) [![Documentation Status](https://readthedocs.org/projects/deepspeed/badge/?version=latest)](https://deepspeed.readthedocs.io/en/latest/?badge=latest)|
# Table of Contents
| Section | Description |
| --------------------------------------- | ------------------------------------------- |
| [Why DeepSpeed?](#why-deepspeed) | DeepSpeed overview |
| [Install](#installation) | Installation details |
| [Features](#features) | Feature list and overview |
| [Further Reading](#further-reading) | Documentation, tutorials, etc. |
| [Contributing](#contributing) | Instructions for contributing |
| [Publications](#publications) | Publications related to DeepSpeed |
| [Videos](#videos) | Videos related to DeepSpeed |
# Why DeepSpeed?
Training advanced deep learning models is challenging. Beyond model design,
model scientists also need to set up the state-of-the-art training techniques
such as distributed training, mixed precision, gradient accumulation, and
checkpointing. Yet still, scientists may not achieve the desired system
performance and convergence rate. Large model sizes are even more challenging:
a large model easily runs out of memory with pure data parallelism and it is
difficult to use model parallelism. DeepSpeed addresses these challenges to
accelerate model development *and* training.
# Installation
The quickest way to get started with DeepSpeed is via pip, this will install
@ -121,76 +139,21 @@ On Windows you can build wheel with following steps, currently only inference mo
4. Run `python setup.py bdist_wheel` to build wheel in `dist` folder
# Features
Below we provide a brief feature list, see our detailed [feature
overview](https://www.deepspeed.ai/features/) for descriptions and usage.
* [Distributed Training with Mixed Precision](https://www.deepspeed.ai/features/#distributed-training-with-mixed-precision)
* 16-bit mixed precision
* Single-GPU/Multi-GPU/Multi-Node
* [Model Parallelism](https://www.deepspeed.ai/features/#model-parallelism)
* Support for Custom Model Parallelism
* Integration with Megatron-LM
* [Pipeline Parallelism](https://www.deepspeed.ai/tutorials/pipeline/)
* 3D Parallelism
* [The Zero Redundancy Optimizer (ZeRO)](https://www.deepspeed.ai/tutorials/zero/)
* Optimizer State and Gradient Partitioning
* Activation Partitioning
* Constant Buffer Optimization
* Contiguous Memory Optimization
* [ZeRO-Offload](https://www.deepspeed.ai/tutorials/zero-offload/)
* Leverage both CPU/GPU memory for model training
* Support 10B model training on a single GPU
* [Ultra-fast dense transformer kernels](https://www.deepspeed.ai/2020/05/18/bert-record.html)
* [Sparse attention](https://www.deepspeed.ai/2020/09/08/sparse-attention-news.html)
* Memory- and compute-efficient sparse kernels
* Support 10x longer sequences than dense
* Flexible support to different sparse structures
* [1-bit Adam](https://www.deepspeed.ai/2020/09/08/onebit-adam-blog-post.html), [0/1 Adam](https://www.deepspeed.ai/tutorials/zero-one-adam/) and [1-bit LAMB](https://www.deepspeed.ai/tutorials/onebit-lamb/)
* Custom communication collective
* Up to 26x communication volume saving
* [Additional Memory and Bandwidth Optimizations](https://www.deepspeed.ai/features/#additional-memory-and-bandwidth-optimizations)
* Smart Gradient Accumulation
* Communication/Computation Overlap
* [Training Features](https://www.deepspeed.ai/features/#training-features)
* Simplified training API
* Gradient Clipping
* Automatic loss scaling with mixed precision
* [Training Optimizers](https://www.deepspeed.ai/features/#training-optimizers)
* Fused Adam optimizer and arbitrary `torch.optim.Optimizer`
* Memory bandwidth optimized FP16 Optimizer
* Large Batch Training with LAMB Optimizer
* Memory efficient Training with ZeRO Optimizer
* CPU-Adam
* [Training Agnostic Checkpointing](https://www.deepspeed.ai/features/#training-agnostic-checkpointing)
* [Advanced Parameter Search](https://www.deepspeed.ai/features/#advanced-parameter-search)
* Learning Rate Range Test
* 1Cycle Learning Rate Schedule
* [Simplified Data Loader](https://www.deepspeed.ai/features/#simplified-data-loader)
* [Curriculum Learning](https://www.deepspeed.ai/tutorials/curriculum-learning/)
* A curriculum learning-based data pipeline that presents easier or simpler examples earlier during training
* Stable and 3.3x faster GPT-2 pre-training with 8x/4x larger batch size/learning rate while maintaining token-wise convergence speed
* Complementary to many other DeepSpeed features
* [Performance Analysis and Debugging](https://www.deepspeed.ai/features/#performance-analysis-and-debugging)
* [Mixture of Experts (MoE)](https://www.deepspeed.ai/tutorials/mixture-of-experts/)
Please checkout [DeepSpeed-Training](https://www.deepspeed.ai/docs/training), [DeepSpeed-Inference](https://www.deepspeed.ai/docs/inference) and [DeepSpeed-Compression](https://www.deepspeed.ai/docs/compression) pages for full set of features offered along each of these three pillars.
# Further Reading
All DeepSpeed documentation can be found on our website: [deepspeed.ai](https://www.deepspeed.ai/)
All DeepSpeed documentation, tutorials, and blogs can be found on our website: [deepspeed.ai](https://www.deepspeed.ai/)
| Article | Description |
| | Description |
| ---------------------------------------------------------------------------------------------- | -------------------------------------------- |
| [DeepSpeed Features](https://www.deepspeed.ai/features/) | DeepSpeed features |
| [Getting Started](https://www.deepspeed.ai/getting-started/) | First steps with DeepSpeed |
| [DeepSpeed JSON Configuration](https://www.deepspeed.ai/docs/config-json/) | Configuring DeepSpeed |
| [API Documentation](https://deepspeed.readthedocs.io/en/latest/) | Generated DeepSpeed API documentation |
| [CIFAR-10 Tutorial](https://www.deepspeed.ai/tutorials/cifar-10) | Getting started with CIFAR-10 and DeepSpeed |
| [Megatron-LM Tutorial](https://www.deepspeed.ai/tutorials/megatron/) | Train GPT2 with DeepSpeed and Megatron-LM |
| [BERT Pre-training Tutorial](https://www.deepspeed.ai/tutorials/bert-pretraining/) | Pre-train BERT with DeepSpeed |
| [Learning Rate Range Test Tutorial](https://www.deepspeed.ai/tutorials/lrrt/) | Faster training with large learning rates |
| [1Cycle Tutorial](https://www.deepspeed.ai/tutorials/one-cycle/) | SOTA learning schedule in DeepSpeed |
| [Tutorials](https://www.deepspeed.ai/tutorials/) | Tutorials |
| [Blogs](https://www.deepspeed.ai/posts/) | Blogs |
# Contributing

View File

@ -80,6 +80,8 @@ defaults:
path: "_pages"
values:
permalink: /docs/:basename/
toc: true
toc_label: "Contents"
- scope:
path: ""
type: posts

View File

@ -11,20 +11,15 @@ main:
url: https://github.com/microsoft/DeepSpeed
lnav:
- title: 'Feature Overview'
url: /features/
- title: 'Training'
url: /training/
- title: 'Inference'
url: /inference/
- title: 'Compression'
url: /compression/
- title: 'Getting Started'
url: /getting-started/
children:
- title: 'Installation'
url: /getting-started/#installation
- title: 'Writing models'
url: /getting-started/#writing-deepspeed-models
- title: 'Training'
url: /getting-started/#training
- title: 'Launching'
url: /getting-started/#launching-deepspeed-training
- title: 'Configuration'
- title: 'ds_config'
url: /docs/config-json/
children:
- title: 'Autotuning'
@ -33,34 +28,16 @@ lnav:
url: /docs/config-json/#batch-size-related-parameters
- title: 'Optimizer'
url: /docs/config-json/#optimizer-parameters
- title: 'Scheduler'
url: /docs/config-json/#scheduler-parameters
- title: 'Communication'
url: /docs/config-json/#communication-options
- title: 'FP16'
url: /docs/config-json/#fp16-training-options
- title: 'BFLOAT16'
url: /docs/config-json/#bfloat16-training-options
- title: 'Gradient Clipping'
url: /docs/config-json/#gradient-clipping
- title: 'ZeRO optimizations'
url: /docs/config-json/#zero-optimizations-for-fp16-training
- title: 'Parameter Offloading'
url: /docs/config-json/#parameter-offloading
- title: 'Optimizer Offloading'
url: /docs/config-json/#optimizer-offloading
- title: 'Asynchronous I/O'
url: /docs/config-json/#asynchronous-io
- title: 'Logging'
url: /docs/config-json/#logging
- title: 'Flops Profiler'
url: /docs/config-json/#flops-profiler
- title: 'PyTorch Profiler'
url: /docs/config-json/#pytorch-profiler
- title: 'Activation checkpointing'
url: /docs/config-json/#activation-checkpointing
- title: 'Sparse Attention'
url: /docs/config-json/#sparse-attention
- title: 'Monitoring'
url: /docs/config-json/#monitoring-module-tensorboard-wandb-csv
- title: 'Model Compression'

View File

@ -0,0 +1,12 @@
---
title: "Compression Overview and Features"
layout: single
permalink: /compression/
toc: true
toc_label: "Contents"
---
DeepSpeed Compression is a library purposely built to make it easy to compress models for researchers and practitioners while delivering faster speed, smaller model size, and significantly reduced compression cost. Please refer to our [blog](https://www.microsoft.com/en-us/research/blog/deepspeed-compression-a-composable-library-for-extreme-compression-and-zero-cost-quantization/) for more details.
DeepSpeed Compression offers novel state-of-the-art compression techniques to achieve faster model compression with better model quality and lower compression cost. DeepSpeed Compression also takes an end-to-end approach to improve the computation efficiency of compressed models via a highly optimized inference engine. Furthermore, our library has multiple built-in state-of-the-art compression methods. It supports the synergistic composition of these methods and the system optimizations, offering the best of both worlds while allowing a seamless and easy-to-use pipeline for efficient DL model inference. We highly recommend you also to read our blog to learn more about (at a high level) why we build DeepSpeed Compression and what benefits it provides to users. To try compress your model using DeepSpeed compression library, please checkout our [tutorial](https://www.deepspeed.ai/tutorials/model-compression/).

View File

@ -1,5 +1,7 @@
---
title: "DeepSpeed Configuration JSON"
toc: true
toc_label: "Contents"
---
### Batch Size Related Parameters

13
docs/_pages/inference.md Executable file
View File

@ -0,0 +1,13 @@
---
title: "Inference Overview and Features"
layout: single
permalink: /inference/
toc: true
toc_label: "Contents"
---
DeepSpeed-Inference introduces several features to efficiently serve transformer-based PyTorch models. It supports model parallelism (MP) to fit large models that would otherwise not fit in GPU memory. Even for smaller models, MP can be used to reduce latency for inference. To further reduce latency and cost, we introduce inference-customized kernels. Finally, we propose a novel approach to quantize models, called MoQ, to both shrink the model and reduce the inference cost at production. For more details on the inference related optimizations in DeepSpeed, please refer to our [blog post](https://www.microsoft.com/en-us/research/blog/deepspeed-accelerating-large-scale-model-inference-and-training-via-system-optimizations-and-compression/).
DeepSpeed provides a seamless inference mode for compatible transformer based models trained using DeepSpeed, Megatron, and HuggingFace, meaning that we dont require any change on the modeling side such as exporting the model or creating a different checkpoint from your trained checkpoints. To run inference on multi-GPU for compatible models, provide the model parallelism degree and the checkpoint information or the model which is already loaded from a checkpoint, and DeepSpeed will do the rest. It will automatically partition the model as necessary, inject compatible high performance kernels into your model and manage the inter-gpu communication. For list of compatible models please see [here](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/module_inject/replace_policy.py).
To get started with DeepSpeed-Inference, please checkout our [tutorial](https://www.deepspeed.ai/tutorials/inference-tutorial/).

177
docs/_pages/features.md → docs/_pages/training.md Executable file → Normal file
View File

@ -1,3 +1,180 @@
---
title: "Training Overview and Features"
layout: single
permalink: /training/
toc: true
toc_label: "Contents"
---
# Overview
Training advanced deep learning models is challenging. Beyond model design,
model scientists also need to set up the state-of-the-art training techniques
such as distributed training, mixed precision, gradient accumulation, and
checkpointing. Yet still, scientists may not achieve the desired system
performance and convergence rate. Large model sizes are even more challenging:
a large model easily runs out of memory with pure data parallelism and it is
difficult to use model parallelism. DeepSpeed addresses these challenges to
accelerate model development *and* training.
## Distributed, Effective, and Efficient Training with Ease
The DeepSpeed API is a lightweight wrapper on [PyTorch](https://pytorch.org/). This
means that you can use everything you love in PyTorch and without learning a new
platform. In addition, DeepSpeed manages all of the boilerplate state-of-the-art
training techniques, such as distributed training, mixed precision, gradient
accumulation, and checkpoints so that you can focus on your model development. Most
importantly, you can leverage the distinctive efficiency and effectiveness benefit of
DeepSpeed to boost speed and scale with just a few lines of code changes to your PyTorch
models.
## Speed
DeepSpeed achieves high performance and fast convergence through a combination of
efficiency optimizations on compute/communication/memory/IO and effectiveness
optimizations on advanced hyperparameter tuning and optimizers. For example:
* <span style="color:dodgerblue">DeepSpeed trains BERT-large to parity in 44
mins using 1024 V100 GPUs (64 DGX-2 boxes) and in 2.4 hours using 256 GPUs
(16 DGX-2 boxes).</span>
**BERT-large Training Times**
| Devices | Source | Training Time |
| -------------- | --------- | ---------------------:|
| 1024 V100 GPUs | DeepSpeed | **44** min|
| 256 V100 GPUs | DeepSpeed | **2.4** hr|
| 64 V100 GPUs | DeepSpeed | **8.68** hr|
| 16 V100 GPUs | DeepSpeed | **33.22** hr|
*BERT codes and tutorials will be available soon.*
* DeepSpeed trains GPT2 (1.5 billion parameters) 3.75x faster than state-of-art, NVIDIA
Megatron on Azure GPUs.
*Read more*: [GPT tutorial](/tutorials/megatron/)
## Memory efficiency
DeepSpeed provides memory-efficient data parallelism and enables training models without
model parallelism. For example, DeepSpeed can train models with up to 13 billion parameters on
a single GPU. In comparison, existing frameworks (e.g.,
PyTorch's Distributed Data Parallel) run out of memory with 1.4 billion parameter models.
DeepSpeed reduces the training memory footprint through a novel solution called Zero
Redundancy Optimizer (ZeRO). Unlike basic data parallelism where memory states are
replicated across data-parallel processes, ZeRO partitions model states and gradients to save
significant memory. Furthermore, it also reduces activation memory and fragmented memory.
The current implementation (ZeRO-2) reduces memory by up to
8x relative to the state-of-art. You can read more about ZeRO in our [paper](https://arxiv.org/abs/1910.02054), and
in our blog posts related to
[ZeRO-1](https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/) and [ZeRO-2](https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/).
With this impressive memory reduction, early adopters of DeepSpeed have already
produced a language model (LM) with over 17B parameters called
<a href="https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft">
<span style="color:dodgerblue">Turing-NLG</span></a>,
establishing a new SOTA in the LM category.
For model scientists with limited GPU resources, ZeRO-Offload leverages both CPU and GPU memory for training large models. Using a machine with **a single GPU**, our users can run **models of up to 13 billion parameters** without running out of memory, 10x bigger than the existing approaches, while obtaining competitive throughput. This feature democratizes multi-billion-parameter model training and opens the window for many deep learning practitioners to explore bigger and better models.
## Scalability
DeepSpeed supports efficient data parallelism, model parallelism, pipeline parallelism and their
combinations, which we call 3D parallelism.
* <span style="color:dodgerblue">3D parallelism of DeepSpeed provides system support to run models with trillions of parameters, read more in our [press-release]({{ site.press_release_v3 }}) and [tutorial](/tutorials/pipeline).</span>
* <span style="color:dodgerblue">DeepSpeed can run large models more efficiently, up to 10x
faster for models with
various sizes spanning 1.5B to hundred billion.</span> More specifically, the data parallelism powered by ZeRO
is complementary and can be combined with different types of model parallelism. It allows
DeepSpeed to fit models using lower degree of model parallelism and higher batch size, offering
significant performance gains compared to using model parallelism alone.
*Read more*: [ZeRO paper](https://arxiv.org/abs/1910.02054),
and [GPT tutorial](/tutorials/megatron).
![DeepSpeed Speedup](/assets/images/deepspeed-speedup.png)
<p align="center">
<em>The figure depicts system throughput improvements of DeepSpeed (combining ZeRO-powered data parallelism with model parallelism of NVIDIA Megatron-LM) over using Megatron-LM alone.</em>
</p>
## Communication efficiency
Pipeline parallelism of DeepSpeed reduce communication volume during distributed training, which allows users to train multi-billion-parameter models 27x faster on clusters with limited network bandwidth.
![Low-bandwidth GPT-2 Performance](/assets/images/pp-lowbw-gpt2.png)
1-bit Adam, 0/1 Adam and 1-bit LAMB reduce communication volume by up to 26x while achieving similar convergence efficiency to Adam, allowing for scaling to different types of GPU clusters and networks. [1-bit Adam blog post](https://www.deepspeed.ai/2020/09/08/onebit-adam-blog-post.html), [1-bit Adam tutorial](https://www.deepspeed.ai/tutorials/onebit-adam/), [0/1 Adam tutorial](https://www.deepspeed.ai/tutorials/zero-one-adam/), [1-bit LAMB tutorial](https://www.deepspeed.ai/tutorials/onebit-lamb/).
## Supporting long sequence length
DeepSpeed offers sparse attention kernels—an instrumental technology to support long sequences of model inputs, whether for text, image, or sound. Compared with the classic dense Transformers, it powers **an order-of-magnitude longer input sequence** and obtains up to 6x faster execution with comparable accuracy. It also outperforms state-of-the-art sparse implementations with 1.53x faster execution. Furthermore, our sparse kernels support efficient execution of flexible sparse format and empower users to innovate on their custom sparse structures. [Read more here](https://www.deepspeed.ai/2020/09/08/sparse-attention.html).
## Fast convergence for effectiveness
DeepSpeed supports advanced hyperparameter tuning and large batch size
optimizers such as [LAMB](https://arxiv.org/abs/1904.00962). These improve the
effectiveness of model training and reduce the number of samples required to
convergence to desired accuracy.
*Read more*: [Tuning tutorial](/tutorials/one-cycle).
## Good Usability
Only a few lines of code changes are needed to enable a PyTorch model to use DeepSpeed and ZeRO. Compared to current model parallelism libraries, DeepSpeed does not require a code redesign or model refactoring. It also does not put limitations on model dimensions (such as number of attention heads, hidden sizes, and others), batch size, or any other training parameters. For models of up to 13 billion parameters, you can use ZeRO-powered data parallelism conveniently without requiring model parallelism, while in contrast, standard data parallelism will run out of memory for models with more than 1.4 billion parameters. In addition, DeepSpeed conveniently supports flexible combination of ZeRO-powered data parallelism with custom model parallelisms, such as tensor slicing of NVIDIA's Megatron-LM.
## Features
Below we provide a brief feature list, see our detailed [feature overview](https://www.deepspeed.ai/features/) for descriptions and usage.
* [Distributed Training with Mixed Precision](https://www.deepspeed.ai/features/#distributed-training-with-mixed-precision)
* 16-bit mixed precision
* Single-GPU/Multi-GPU/Multi-Node
* [Model Parallelism](https://www.deepspeed.ai/features/#model-parallelism)
* Support for Custom Model Parallelism
* Integration with Megatron-LM
* [Pipeline Parallelism](https://www.deepspeed.ai/tutorials/pipeline/)
* 3D Parallelism
* [The Zero Redundancy Optimizer](https://www.deepspeed.ai/tutorials/zero/)
* Optimizer State and Gradient Partitioning
* Activation Partitioning
* Constant Buffer Optimization
* Contiguous Memory Optimization
* [ZeRO-Offload](https://www.deepspeed.ai/tutorials/zero-offload/)
* Leverage both CPU/GPU memory for model training
* Support 10B model training on a single GPU
* [Ultra-fast dense transformer kernels](https://www.deepspeed.ai/2020/05/18/bert-record.html)
* [Sparse attention](https://www.deepspeed.ai/2020/09/08/sparse-attention-news.html)
* Memory- and compute-efficient sparse kernels
* Support 10x long sequences than dense
* Flexible support to different sparse structures
* [1-bit Adam](https://www.deepspeed.ai/2020/09/08/onebit-adam-blog-post.html), [0/1 Adam](https://www.deepspeed.ai/tutorials/zero-one-adam/) and [1-bit LAMB](https://www.deepspeed.ai/tutorials/onebit-lamb/)
* Custom communication collective
* Up to 26x communication volume saving
* [Additional Memory and Bandwidth Optimizations](https://www.deepspeed.ai/features/#additional-memory-and-bandwidth-optimizations)
* Smart Gradient Accumulation
* Communication/Computation Overlap
* [Training Features](https://www.deepspeed.ai/features/#training-features)
* Simplified training API
* Gradient Clipping
* Automatic loss scaling with mixed precision
* [Training Optimizers](https://www.deepspeed.ai/features/#training-optimizers)
* Fused Adam optimizer and arbitrary `torch.optim.Optimizer`
* Memory bandwidth optimized FP16 Optimizer
* Large Batch Training with LAMB Optimizer
* Memory efficient Training with ZeRO Optimizer
* CPU-Adam
* [Training Agnostic Checkpointing](https://www.deepspeed.ai/features/#training-agnostic-checkpointing)
* [Advanced Parameter Search](https://www.deepspeed.ai/features/#advanced-parameter-search)
* Learning Rate Range Test
* 1Cycle Learning Rate Schedule
* [Simplified Data Loader](https://www.deepspeed.ai/features/#simplified-data-loader)
* [Curriculum Learning](https://www.deepspeed.ai/tutorials/curriculum-learning/)
* A curriculum learning-based data pipeline that presents easier or simpler examples earlier during training
* Stable and 3.3x faster GPT-2 pre-training with 8x/4x larger batch size/learning rate while maintaining token-wise convergence speed
* Complementary to many other DeepSpeed features
* [Progressive Layer Dropping](https://www.deepspeed.ai/2020/10/28/progressive-layer-dropping-news.html)
* Efficient and robust compressed training
* Up to 2.5x convergence speedup for pre-training
* [Performance Analysis and Debugging](https://www.deepspeed.ai/features/#performance-analysis-and-debugging)
* [Mixture of Experts (MoE)](https://www.deepspeed.ai/tutorials/mixture-of-experts/)
---
title: "Feature Overview"
layout: single

BIN
docs/assets/images/3pillars.png Executable file

Binary file not shown.

After

Width:  |  Height:  |  Size: 90 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 9.0 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 8.8 KiB

BIN
docs/assets/images/accelerate.png Executable file

Binary file not shown.

After

Width:  |  Height:  |  Size: 12 KiB

BIN
docs/assets/images/hf-logo.png Executable file

Binary file not shown.

After

Width:  |  Height:  |  Size: 160 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 7.1 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 11 KiB

View File

@ -0,0 +1,10 @@
<svg width="732" height="198" viewBox="0 0 732 198" fill="none" xmlns="http://www.w3.org/2000/svg">
<path d="M80.7967 1.017L3.69127 46.7244C2.56854 47.3909 1.63645 48.3484 0.988564 49.5007C0.340673 50.653 -0.000172384 51.9595 6.54048e-08 53.2894V144.717C0.000431188 146.046 0.341535 147.353 0.989426 148.504C1.63723 149.657 2.56897 150.614 3.69127 151.282L80.7967 196.983C81.9228 197.649 83.1997 198 84.5 198C85.8003 198 87.0772 197.649 88.2033 196.983L165.309 151.282C166.431 150.614 167.363 149.657 168.011 148.504C168.659 147.353 169 146.046 169 144.717V53.2894C169 51.9595 168.659 50.653 168.011 49.5007C167.363 48.3484 166.431 47.3909 165.309 46.7244L88.2033 1.017C87.0772 0.350663 85.8003 0 84.5 0C83.1997 0 81.9228 0.350663 80.7967 1.017ZM68.229 153.423L77.3848 113.44C77.4503 113.151 77.4417 112.849 77.3598 112.565C77.2778 112.28 77.1252 112.022 76.9174 111.816L54.7265 89.6312C54.5627 89.4704 54.4322 89.2768 54.3432 89.0629C54.2541 88.849 54.2083 88.6191 54.2083 88.3869C54.2083 88.1543 54.2541 87.9241 54.3432 87.7102C54.4322 87.4963 54.5627 87.3032 54.7265 87.1423L98.1183 42.9408C98.3752 42.6755 98.7107 42.5039 99.0728 42.4519C99.4349 42.3999 99.8031 42.4706 100.122 42.6531C100.441 42.8355 100.693 43.1199 100.84 43.4629C100.987 43.8057 101.02 44.1886 100.935 44.5527L91.767 84.6282C91.6989 84.9176 91.7049 85.2203 91.786 85.5064C91.867 85.7924 92.0196 86.0516 92.2292 86.2583L114.291 108.318C114.45 108.479 114.576 108.671 114.662 108.882C114.748 109.093 114.792 109.319 114.792 109.548C114.792 109.776 114.748 110.002 114.662 110.214C114.576 110.425 114.45 110.616 114.291 110.777L71.0521 155.015C70.794 155.275 70.461 155.443 70.1024 155.495C69.7438 155.545 69.3788 155.475 69.0617 155.297C68.7446 155.118 68.4923 154.839 68.3422 154.501C68.1922 154.163 68.1525 153.785 68.229 153.423Z" fill="url(#paint0_linear_5_35)"/>
<path d="M251.524 141H206.939V60.4775H220.303V129.713H251.524V141ZM268.313 71.4272C266.18 71.4272 264.345 70.7347 262.811 69.3496C261.313 67.9645 260.564 66.2051 260.564 64.0713C260.564 61.9375 261.313 60.1593 262.811 58.7368C264.345 57.3143 266.18 56.603 268.313 56.603C270.522 56.603 272.394 57.3143 273.929 58.7368C275.464 60.1593 276.231 61.9375 276.231 64.0713C276.231 66.0928 275.464 67.8335 273.929 69.2935C272.394 70.716 270.522 71.4272 268.313 71.4272ZM274.771 141H261.744V83.5H274.771V141ZM344.175 136.396C344.175 157.509 333.562 168.065 312.337 168.065C304.85 168.065 298.318 166.811 292.74 164.303V152.399C299.029 155.993 305 157.79 310.652 157.79C324.316 157.79 331.148 151.07 331.148 137.631V131.342H330.923C326.618 138.679 320.142 142.348 311.495 142.348C304.494 142.348 298.842 139.802 294.537 134.711C290.269 129.582 288.135 122.713 288.135 114.103C288.135 104.333 290.438 96.5648 295.042 90.7998C299.646 85.0348 305.973 82.1523 314.021 82.1523C321.621 82.1523 327.255 85.2594 330.923 91.4736H331.148V83.5H344.175V136.396ZM331.26 114.665V107.196C331.26 103.153 329.913 99.7093 327.217 96.8643C324.559 93.9818 321.228 92.5405 317.222 92.5405C312.281 92.5405 308.406 94.3748 305.599 98.0435C302.828 101.675 301.443 106.766 301.443 113.317C301.443 118.97 302.772 123.499 305.43 126.906C308.125 130.275 311.682 131.959 316.099 131.959C320.591 131.959 324.241 130.35 327.049 127.13C329.856 123.874 331.26 119.718 331.26 114.665ZM412.232 141H399.205V109.555C399.205 98.1745 395.405 92.4844 387.806 92.4844C383.987 92.4844 380.768 94.1315 378.147 97.4258C375.527 100.72 374.217 104.931 374.217 110.06V141H361.133V55.873H374.217V93.0459H374.441C378.784 85.7835 384.998 82.1523 393.084 82.1523C405.849 82.1523 412.232 89.9575 412.232 105.568V141ZM458.389 140.382C455.844 141.655 452.493 142.292 448.338 142.292C437.182 142.292 431.604 136.938 431.604 126.232V93.7197H422.002V83.5H431.604V70.1919L444.632 66.4858V83.5H458.389V93.7197H444.632V122.47C444.632 125.876 445.25 128.31 446.485 129.77C447.72 131.229 449.779 131.959 452.662 131.959C454.87 131.959 456.779 131.323 458.389 130.05V140.382ZM520.831 141H507.803V108.6C507.803 97.8563 504.004 92.4844 496.404 92.4844C492.436 92.4844 489.161 93.9818 486.578 96.9766C483.995 99.9339 482.703 103.677 482.703 108.207V141H469.62V83.5H482.703V93.0459H482.928C487.233 85.7835 493.447 82.1523 501.57 82.1523C507.822 82.1523 512.595 84.1925 515.889 88.2729C519.183 92.3159 520.831 98.1745 520.831 105.849V141ZM543.348 71.4272C541.214 71.4272 539.38 70.7347 537.845 69.3496C536.347 67.9645 535.599 66.2051 535.599 64.0713C535.599 61.9375 536.347 60.1593 537.845 58.7368C539.38 57.3143 541.214 56.603 543.348 56.603C545.556 56.603 547.428 57.3143 548.963 58.7368C550.498 60.1593 551.265 61.9375 551.265 64.0713C551.265 66.0928 550.498 67.8335 548.963 69.2935C547.428 70.716 545.556 71.4272 543.348 71.4272ZM549.805 141H536.778V83.5H549.805V141ZM618.086 141H605.059V108.6C605.059 97.8563 601.259 92.4844 593.66 92.4844C589.692 92.4844 586.417 93.9818 583.833 96.9766C581.25 99.9339 579.959 103.677 579.959 108.207V141H566.875V83.5H579.959V93.0459H580.184C584.489 85.7835 590.703 82.1523 598.826 82.1523C605.078 82.1523 609.851 84.1925 613.145 88.2729C616.439 92.3159 618.086 98.1745 618.086 105.849V141ZM686.368 136.396C686.368 157.509 675.755 168.065 654.529 168.065C647.042 168.065 640.51 166.811 634.932 164.303V152.399C641.221 155.993 647.192 157.79 652.845 157.79C666.508 157.79 673.34 151.07 673.34 137.631V131.342H673.116C668.811 138.679 662.334 142.348 653.687 142.348C646.687 142.348 641.034 139.802 636.729 134.711C632.461 129.582 630.328 122.713 630.328 114.103C630.328 104.333 632.63 96.5648 637.234 90.7998C641.839 85.0348 648.165 82.1523 656.214 82.1523C663.813 82.1523 669.447 85.2594 673.116 91.4736H673.34V83.5H686.368V136.396ZM673.453 114.665V107.196C673.453 103.153 672.105 99.7093 669.41 96.8643C666.752 93.9818 663.42 92.5405 659.415 92.5405C654.473 92.5405 650.599 94.3748 647.791 98.0435C645.021 101.675 643.636 106.766 643.636 113.317C643.636 118.97 644.965 123.499 647.623 126.906C650.318 130.275 653.874 131.959 658.292 131.959C662.784 131.959 666.434 130.35 669.241 127.13C672.049 123.874 673.453 119.718 673.453 114.665Z" fill="white"/>
<defs>
<linearGradient id="paint0_linear_5_35" x1="127.442" y1="25.514" x2="-40.088" y2="307.246" gradientUnits="userSpaceOnUse">
<stop stop-color="#792EE5"/>
<stop offset="1" stop-color="#3EABB3"/>
</linearGradient>
</defs>
</svg>

After

Width:  |  Height:  |  Size: 6.1 KiB

View File

@ -0,0 +1,10 @@
<svg width="732" height="198" viewBox="0 0 732 198" fill="none" xmlns="http://www.w3.org/2000/svg">
<path d="M80.7967 1.017L3.69127 46.7244C2.56854 47.3909 1.63645 48.3484 0.988564 49.5007C0.340673 50.653 -0.000172384 51.9595 6.54048e-08 53.2894V144.717C0.000431188 146.046 0.341535 147.353 0.989426 148.504C1.63723 149.657 2.56897 150.614 3.69127 151.282L80.7967 196.983C81.9228 197.649 83.1997 198 84.5 198C85.8003 198 87.0772 197.649 88.2033 196.983L165.309 151.282C166.431 150.614 167.363 149.657 168.011 148.504C168.659 147.353 169 146.046 169 144.717V53.2894C169 51.9595 168.659 50.653 168.011 49.5007C167.363 48.3484 166.431 47.3909 165.309 46.7244L88.2033 1.017C87.0772 0.350663 85.8003 0 84.5 0C83.1997 0 81.9228 0.350663 80.7967 1.017ZM68.229 153.423L77.3848 113.44C77.4503 113.151 77.4417 112.849 77.3598 112.565C77.2778 112.28 77.1252 112.022 76.9174 111.816L54.7265 89.6312C54.5627 89.4704 54.4322 89.2768 54.3432 89.0629C54.2541 88.849 54.2083 88.6191 54.2083 88.3869C54.2083 88.1543 54.2541 87.9241 54.3432 87.7102C54.4322 87.4963 54.5627 87.3032 54.7265 87.1423L98.1183 42.9408C98.3752 42.6755 98.7107 42.5039 99.0728 42.4519C99.4349 42.3999 99.8031 42.4706 100.122 42.6531C100.441 42.8355 100.693 43.1199 100.84 43.4629C100.987 43.8057 101.02 44.1886 100.935 44.5527L91.767 84.6282C91.6989 84.9176 91.7049 85.2203 91.786 85.5064C91.867 85.7924 92.0196 86.0516 92.2292 86.2583L114.291 108.318C114.45 108.479 114.576 108.671 114.662 108.882C114.748 109.093 114.792 109.319 114.792 109.548C114.792 109.776 114.748 110.002 114.662 110.214C114.576 110.425 114.45 110.616 114.291 110.777L71.0521 155.015C70.794 155.275 70.461 155.443 70.1024 155.495C69.7438 155.545 69.3788 155.475 69.0617 155.297C68.7446 155.118 68.4923 154.839 68.3422 154.501C68.1922 154.163 68.1525 153.785 68.229 153.423Z" fill="url(#paint0_linear_5_36)"/>
<path d="M251.524 141H206.939V60.4775H220.303V129.713H251.524V141ZM268.313 71.4272C266.18 71.4272 264.345 70.7347 262.811 69.3496C261.313 67.9645 260.564 66.2051 260.564 64.0713C260.564 61.9375 261.313 60.1593 262.811 58.7368C264.345 57.3143 266.18 56.603 268.313 56.603C270.522 56.603 272.394 57.3143 273.929 58.7368C275.464 60.1593 276.231 61.9375 276.231 64.0713C276.231 66.0928 275.464 67.8335 273.929 69.2935C272.394 70.716 270.522 71.4272 268.313 71.4272ZM274.771 141H261.744V83.5H274.771V141ZM344.175 136.396C344.175 157.509 333.562 168.065 312.337 168.065C304.85 168.065 298.318 166.811 292.74 164.303V152.399C299.029 155.993 305 157.79 310.652 157.79C324.316 157.79 331.148 151.07 331.148 137.631V131.342H330.923C326.618 138.679 320.142 142.348 311.495 142.348C304.494 142.348 298.842 139.802 294.537 134.711C290.269 129.582 288.135 122.713 288.135 114.103C288.135 104.333 290.438 96.5648 295.042 90.7998C299.646 85.0348 305.973 82.1523 314.021 82.1523C321.621 82.1523 327.255 85.2594 330.923 91.4736H331.148V83.5H344.175V136.396ZM331.26 114.665V107.196C331.26 103.153 329.913 99.7093 327.217 96.8643C324.559 93.9818 321.228 92.5405 317.222 92.5405C312.281 92.5405 308.406 94.3748 305.599 98.0435C302.828 101.675 301.443 106.766 301.443 113.317C301.443 118.97 302.772 123.499 305.43 126.906C308.125 130.275 311.682 131.959 316.099 131.959C320.591 131.959 324.241 130.35 327.049 127.13C329.856 123.874 331.26 119.718 331.26 114.665ZM412.232 141H399.205V109.555C399.205 98.1745 395.405 92.4844 387.806 92.4844C383.987 92.4844 380.768 94.1315 378.147 97.4258C375.527 100.72 374.217 104.931 374.217 110.06V141H361.133V55.873H374.217V93.0459H374.441C378.784 85.7835 384.998 82.1523 393.084 82.1523C405.849 82.1523 412.232 89.9575 412.232 105.568V141ZM458.389 140.382C455.844 141.655 452.493 142.292 448.338 142.292C437.182 142.292 431.604 136.938 431.604 126.232V93.7197H422.002V83.5H431.604V70.1919L444.632 66.4858V83.5H458.389V93.7197H444.632V122.47C444.632 125.876 445.25 128.31 446.485 129.77C447.72 131.229 449.779 131.959 452.662 131.959C454.87 131.959 456.779 131.323 458.389 130.05V140.382ZM520.831 141H507.803V108.6C507.803 97.8563 504.004 92.4844 496.404 92.4844C492.436 92.4844 489.161 93.9818 486.578 96.9766C483.995 99.9339 482.703 103.677 482.703 108.207V141H469.62V83.5H482.703V93.0459H482.928C487.233 85.7835 493.447 82.1523 501.57 82.1523C507.822 82.1523 512.595 84.1925 515.889 88.2729C519.183 92.3159 520.831 98.1745 520.831 105.849V141ZM543.348 71.4272C541.214 71.4272 539.38 70.7347 537.845 69.3496C536.347 67.9645 535.599 66.2051 535.599 64.0713C535.599 61.9375 536.347 60.1593 537.845 58.7368C539.38 57.3143 541.214 56.603 543.348 56.603C545.556 56.603 547.428 57.3143 548.963 58.7368C550.498 60.1593 551.265 61.9375 551.265 64.0713C551.265 66.0928 550.498 67.8335 548.963 69.2935C547.428 70.716 545.556 71.4272 543.348 71.4272ZM549.805 141H536.778V83.5H549.805V141ZM618.086 141H605.059V108.6C605.059 97.8563 601.259 92.4844 593.66 92.4844C589.692 92.4844 586.417 93.9818 583.833 96.9766C581.25 99.9339 579.959 103.677 579.959 108.207V141H566.875V83.5H579.959V93.0459H580.184C584.489 85.7835 590.703 82.1523 598.826 82.1523C605.078 82.1523 609.851 84.1925 613.145 88.2729C616.439 92.3159 618.086 98.1745 618.086 105.849V141ZM686.368 136.396C686.368 157.509 675.755 168.065 654.529 168.065C647.042 168.065 640.51 166.811 634.932 164.303V152.399C641.221 155.993 647.192 157.79 652.845 157.79C666.508 157.79 673.34 151.07 673.34 137.631V131.342H673.116C668.811 138.679 662.334 142.348 653.687 142.348C646.687 142.348 641.034 139.802 636.729 134.711C632.461 129.582 630.328 122.713 630.328 114.103C630.328 104.333 632.63 96.5648 637.234 90.7998C641.839 85.0348 648.165 82.1523 656.214 82.1523C663.813 82.1523 669.447 85.2594 673.116 91.4736H673.34V83.5H686.368V136.396ZM673.453 114.665V107.196C673.453 103.153 672.105 99.7093 669.41 96.8643C666.752 93.9818 663.42 92.5405 659.415 92.5405C654.473 92.5405 650.599 94.3748 647.791 98.0435C645.021 101.675 643.636 106.766 643.636 113.317C643.636 118.97 644.965 123.499 647.623 126.906C650.318 130.275 653.874 131.959 658.292 131.959C662.784 131.959 666.434 130.35 669.241 127.13C672.049 123.874 673.453 119.718 673.453 114.665Z" fill="black"/>
<defs>
<linearGradient id="paint0_linear_5_36" x1="127.442" y1="25.514" x2="-40.088" y2="307.246" gradientUnits="userSpaceOnUse">
<stop stop-color="#792EE5"/>
<stop offset="1" stop-color="#3EABB3"/>
</linearGradient>
</defs>
</svg>

After

Width:  |  Height:  |  Size: 6.1 KiB

BIN
docs/assets/images/lightning.png Executable file

Binary file not shown.

After

Width:  |  Height:  |  Size: 28 KiB

38
docs/assets/images/mosaicml.svg Executable file
View File

@ -0,0 +1,38 @@
<svg width="221" height="38" viewBox="0 0 221 38" fill="none" xmlns="http://www.w3.org/2000/svg">
<g clip-path="url(#clip0)">
<path d="M24.0822 31.9977L26.0824 23.4903L21.6462 4.62432C21.3778 3.48392 20.5595 0 17.9712 0V6.0023L23.0374 27.5463L24.0822 31.9977Z" fill="#13294E"/>
<path d="M48.0339 30.825L47.2726 27.5501L41.8733 4.62432C41.605 3.48392 40.7866 0 38.1964 0V6.0023L43.2626 27.5463L44.2428 31.7069C44.3012 31.9563 44.3097 32.2147 44.2679 32.4674C44.2261 32.72 44.1348 32.962 43.9992 33.1793C43.8636 33.3967 43.6863 33.5852 43.4777 33.734C43.269 33.8828 43.0329 33.9891 42.7831 34.0466C42.636 34.082 42.4852 34.0998 42.3339 34.0998C41.8957 34.0965 41.4712 33.9466 41.1283 33.6741C40.7854 33.4015 40.5438 33.0221 40.4422 32.5964L40.3375 32.1536L39.2641 27.5463L36.141 14.274L34.196 6.0023L32.1958 14.5097L35.2637 27.5463L36.6587 33.4802C36.8336 34.2261 37.154 34.9304 37.6015 35.5526C38.0489 36.1749 38.6147 36.7031 39.2666 37.107C39.9184 37.5109 40.6435 37.7826 41.4005 37.9067C42.1575 38.0307 42.9316 38.0046 43.6785 37.8299C44.4254 37.6552 45.1305 37.3352 45.7536 36.8884C46.3767 36.4415 46.9056 35.8764 47.31 35.2254C47.7144 34.5744 47.9865 33.8503 48.1107 33.0943C48.2349 32.3383 48.2088 31.5652 48.0339 30.8193V30.825Z" fill="#13294E"/>
<path d="M19.0375 27.5459L14.0036 6.14062L12.0034 14.6461L15.037 27.5459L16.4073 33.3714C16.5539 33.993 16.8622 35.3025 17.5283 36.3688C19.1612 35.5933 19.7949 33.2061 20.0461 32.1493L20.0842 31.9935L19.0375 27.5459Z" fill="#13294E"/>
<path d="M48.0339 30.825L47.2726 27.5501L41.8733 4.62432C41.605 3.48392 40.7866 0 38.1964 0V6.0023L43.2626 27.5463L44.2428 31.7069C44.3012 31.9563 44.3097 32.2147 44.2679 32.4674C44.2261 32.72 44.1348 32.962 43.9992 33.1793C43.8636 33.3967 43.6863 33.5852 43.4777 33.734C43.269 33.8828 43.0329 33.9891 42.7831 34.0466C42.636 34.082 42.4852 34.0998 42.3339 34.0998C41.8957 34.0965 41.4712 33.9466 41.1283 33.6741C40.7854 33.4015 40.5438 33.0221 40.4422 32.5964L40.3375 32.1536L39.2641 27.5463L36.141 14.274L34.196 6.0023L32.1958 14.5097L35.2637 27.5463L36.6587 33.4802C36.8336 34.2261 37.154 34.9304 37.6015 35.5526C38.0489 36.1749 38.6147 36.7031 39.2666 37.107C39.9184 37.5109 40.6435 37.7826 41.4005 37.9067C42.1575 38.0307 42.9316 38.0046 43.6785 37.8299C44.4254 37.6552 45.1305 37.3352 45.7536 36.8884C46.3767 36.4415 46.9056 35.8764 47.31 35.2254C47.7144 34.5744 47.9865 33.8503 48.1107 33.0943C48.2349 32.3383 48.2088 31.5652 48.0339 30.8193V30.825Z" fill="url(#paint0_linear)"/>
<path d="M21.6462 4.62432C21.3778 3.48392 20.5595 0 17.9712 0V6.0023L23.0374 27.5463L24.0822 31.9939L26.0824 23.4865L21.6462 4.62432Z" fill="url(#paint1_linear)"/>
<path d="M14.0036 6.14062L12.0034 14.6461L15.037 27.5459L16.4073 33.3714C16.5539 33.993 16.8622 35.3025 17.5283 36.3688C19.1612 35.5933 19.7949 33.2061 20.0461 32.1493L20.0842 31.9935L19.0375 27.5459L14.0036 6.14062Z" fill="url(#paint2_linear)"/>
<path d="M70.6506 13.2686C68.2508 13.2686 67.0518 15.1921 66.9433 18.1685V28.5139H62.2178V9.63834H66.9528V14.8291L68.0795 10.4746C68.6618 9.92914 69.897 8.9541 72.1503 8.9541C75.4942 8.9541 77.8921 11.0962 78.4745 14.6181L79.4927 10.4803C80.1474 9.7904 81.491 8.95981 83.5274 8.95981C87.7143 8.95981 90.2151 11.6207 90.2151 16.4389V28.5253H85.48V17.1897C85.48 14.5763 84.3172 13.2686 82.2085 13.2686C79.7744 13.2686 78.5735 15.0115 78.5735 18.0963V28.5139H73.8498V17.2258C73.8498 15.2282 73.0125 13.2686 70.6506 13.2686Z" fill="#EE3932"/>
<path d="M114.267 19.0047C114.267 24.8853 109.869 29.0592 103.945 29.0592C98.0203 29.0592 93.6221 24.8777 93.6221 19.0047C93.6221 13.1316 98.0203 8.94824 103.945 8.94824C109.869 8.94824 114.267 13.124 114.267 19.0047ZM103.945 24.739C107.034 24.739 109.361 22.5247 109.361 19.0047C109.361 15.4846 107.034 13.2685 103.945 13.2685C100.818 13.2685 98.5284 15.4827 98.5284 19.0047C98.5284 22.5266 100.856 24.739 103.945 24.739Z" fill="#EE3932"/>
<path d="M133.563 14.3573L129.129 16.028C128.984 13.8137 127.458 12.6068 125.022 12.6068C123.06 12.6068 121.897 13.4792 121.897 14.785C121.897 18.3069 133.782 15.4388 133.782 23.0263C133.782 27.164 130.075 29.0514 125.532 29.0514C120.732 29.0514 117.244 27.0918 116.553 23.4976L121.023 21.5742C121.214 24.332 123.168 25.3489 125.532 25.3489C127.275 25.3489 128.767 24.6589 128.767 23.3152C128.767 19.6127 117.065 22.6632 117.065 15.0758C117.065 11.5177 120.734 8.94043 125.024 8.94043C129.679 8.94803 132.84 11.054 133.563 14.3573Z" fill="#EE3932"/>
<path d="M156.793 20.0575L159.665 28.5136H154.689L152.468 21.6541L152.14 26.4057C150.76 28.0023 148.76 29.0553 145.86 29.0553C140.226 29.0553 136.3 24.7712 136.3 19.073C136.3 13.3007 140.19 8.94434 145.816 8.94434C148.543 8.94434 150.614 9.9973 152.068 11.6053V9.63808H156.793V20.0575ZM146.726 24.775C149.742 24.775 152.245 22.5607 152.245 19.073C152.245 15.5149 149.737 13.2284 146.726 13.2284C143.563 13.2284 141.207 15.5872 141.207 19.073C141.201 22.4885 143.563 24.775 146.726 24.775V24.775Z" fill="#EE3932"/>
<path d="M162.646 1.83398H167.552V6.55334H162.646V1.83398ZM167.44 9.63812V28.5136H162.714V9.63812H167.44Z" fill="#EE3932"/>
<path d="M190.123 14.3214L185.936 16.7543C185.5 14.6122 183.536 13.2685 181.211 13.2685C178.122 13.2685 175.941 15.3744 175.941 19.0047C175.941 22.6349 178.086 24.739 181.211 24.739C183.464 24.739 185.398 23.2508 185.936 21.3634L190.007 23.7221C188.736 26.7366 185.426 29.0592 181.211 29.0592C175.142 29.0592 171.035 24.9575 171.035 19.0047C171.035 13.0879 175.286 8.94824 181.211 8.94824C185.4 8.94824 188.888 11.2728 190.123 14.3214Z" fill="#EE3932"/>
<path d="M197.557 1.25781L200.672 11.9338L203.79 1.25781H208.5V15.5508H205.579V4.21335L202.288 15.414H198.939L195.667 4.20004V15.5565H192.751V1.26351L197.557 1.25781Z" fill="#EE3932"/>
<path d="M211.419 1.25781H214.419V12.7701H221V15.5508H211.419V1.25781Z" fill="#EE3932"/>
<path d="M38.1961 0C35.6078 0 34.7895 3.48012 34.5212 4.62432L34.1957 6.0023L28.0828 31.9977C28.0733 32.0452 28.06 32.1003 28.0466 32.1536C27.7935 33.2103 27.1617 35.5976 25.5269 36.373C26.0883 37.272 26.9009 38 28.0847 38C30.673 38 31.4913 34.5199 31.7597 33.3757L32.0471 32.1536L38.1961 6.0023C38.4226 5.04246 39.0354 2.44235 40.754 1.63077C40.1925 0.727952 39.3799 0 38.1961 0Z" fill="#EE3932"/>
<path d="M20.55 1.62697C19.9886 0.727955 19.1759 0 17.9922 0C15.4039 0 14.5855 3.48012 14.3172 4.62432L13.9936 6.0023L7.84265 32.1536L7.73988 32.5964C7.6376 33.0218 7.39585 33.4008 7.05308 33.6733C6.71031 33.9457 6.28618 34.0958 5.84814 34.0998C5.69684 34.1 5.54606 34.0821 5.399 34.0466C4.89657 33.9287 4.4615 33.6165 4.18927 33.1786C3.91704 32.7406 3.82987 32.2127 3.94689 31.7107L9.99321 6.0023C10.2178 5.04246 10.8306 2.44235 12.5491 1.63077C11.9877 0.731754 11.177 0.00379924 9.99321 0.00379924C7.40302 0.00379924 6.58467 3.48392 6.31632 4.62812L5.99278 6.0061L0.152005 30.825C-0.201318 32.3315 0.0590563 33.9164 0.875846 35.2311C1.69264 36.5459 2.99894 37.4827 4.50737 37.8356C6.01581 38.1885 7.60282 37.9284 8.91929 37.1127C10.2358 36.297 11.1738 34.9924 11.5271 33.4859L11.8393 32.1555L17.9884 6.0042C18.2186 5.04247 18.8315 2.44236 20.55 1.62697Z" fill="#EE3932"/>
<path d="M32.7535 1.62697C32.1921 0.727954 31.3794 0 30.1957 0C27.6074 0 26.789 3.48012 26.5207 4.62432L26.1971 6.0023L20.0842 31.9977L20.0461 32.1536C19.7949 33.2103 19.1612 35.5976 17.5264 36.373C18.0878 37.272 18.9004 38 20.0842 38C22.6725 38 23.4909 34.5199 23.7592 33.3757L24.0466 32.1536L30.1976 6.0023C30.4221 5.04246 31.0349 2.44235 32.7535 1.62697Z" fill="#EE3932"/>
</g>
<defs>
<linearGradient id="paint0_linear" x1="33.7431" y1="17.1041" x2="39.9841" y2="18.9826" gradientUnits="userSpaceOnUse">
<stop stop-color="#070D19"/>
<stop offset="1" stop-color="#13294E" stop-opacity="0"/>
</linearGradient>
<linearGradient id="paint1_linear" x1="2686.57" y1="8718.02" x2="2954.62" y2="8740.32" gradientUnits="userSpaceOnUse">
<stop stop-color="#070D19"/>
<stop offset="1" stop-color="#13294E" stop-opacity="0"/>
</linearGradient>
<linearGradient id="paint2_linear" x1="2581.7" y1="9255.66" x2="2848.61" y2="9279.07" gradientUnits="userSpaceOnUse">
<stop stop-color="#070D19"/>
<stop offset="1" stop-color="#13294E" stop-opacity="0"/>
</linearGradient>
<clipPath id="clip0">
<rect width="221" height="38" fill="white"/>
</clipPath>
</defs>
</svg>

After

Width:  |  Height:  |  Size: 8.3 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 8.8 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 8.6 KiB

View File

@ -5,214 +5,78 @@ toc_label: "Contents"
title: "Latest News"
---
<b> DeepSpeed trained the world's most powerful language models ([MT-530B](https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/), [BLOOM](https://huggingface.co/blog/bloom-megatron-deepspeed)); [learn how](https://www.deepspeed.ai/tutorials/large-models-w-deepspeed/).</b>
* [2022/07/20] [DeepSpeed Compression: A composable library for extreme compression and zero-cost quantization](https://www.microsoft.com/en-us/research/blog/deepspeed-compression-a-composable-library-for-extreme-compression-and-zero-cost-quantization/)
* [Tutorial](https://www.deepspeed.ai/tutorials/model-compression/) and [Code examples](https://github.com/microsoft/DeepSpeedExamples/tree/master/model_compression).
* 50x model size reduction via [XTC](https://arxiv.org/abs/2206.01859) and 5000x compression cost reduction via [ZeroQuant](https://arxiv.org/abs/2206.01861).
* [2022/03/21] [Supporting efficient large model training on AMD Instinct GPUs with DeepSpeed](https://cloudblogs.microsoft.com/opensource/2022/03/21/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed/)
* [2022/03/07] [Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam](https://www.deepspeed.ai/tutorials/zero-one-adam/)
* [2022/01/19] [DeepSpeed: Advancing MoE inference and training to power next-generation AI scale](https://www.microsoft.com/en-us/research/blog/deepspeed-advancing-moe-inference-and-training-to-power-next-generation-ai-scale/)
* [Mixture of Experts (MoE) for NLG tutorial](https://www.deepspeed.ai/tutorials/mixture-of-experts-nlg/).
* [Mixture of Experts (MoE) Inference tutorial](https://www.deepspeed.ai/tutorials/moe-inference-tutorial).
* [2021/11/15] [Autotuning: Automatically discover the optimal DeepSpeed configuration that delivers good training speed](https://www.deepspeed.ai/2021/11/16/autotuning.html)
* [2021/10/11] [Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the Worlds Largest and Most Powerful Generative Language Model](https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/)
* Read more on how to [train large models with DeepSpeed](https://www.deepspeed.ai/tutorials/large-models-w-deepspeed/)
* [2022/07] [DeepSpeed Compression: A composable library for extreme compression](https://www.microsoft.com/en-us/research/blog/deepspeed-compression-a-composable-library-for-extreme-compression-and-zero-cost-quantization/)
* [2022/03] [Supporting efficient large model training on AMD Instinct GPUs with DeepSpeed](https://cloudblogs.microsoft.com/opensource/2022/03/21/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed/)
* [2022/03] [Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam](https://www.deepspeed.ai/tutorials/zero-one-adam/)
* [2022/01] [DeepSpeed: Advancing MoE inference and training to power next-generation AI scale](https://www.microsoft.com/en-us/research/blog/deepspeed-advancing-moe-inference-and-training-to-power-next-generation-ai-scale/)
* [2021/11] [Autotuning: Automatically discover the optimal DeepSpeed configuration](https://www.deepspeed.ai/news/2021/11/15/autotuning.html)
<!-- <b> DeepSpeed is hiring, [come join us!](https://careers.microsoft.com/us/en/search-results?keywords=http:%2F%2Fdeepspeed.ai) </b> -->
# Extreme Speed and Scale for DL Training and Inference
DeepSpeed is an easy-to-use deep learning optimization software suite that enables unprecedented scale and speed for Deep Learning Training and Inference. With DeepSpeed you can:
<p style="text-align: center;"><em>Train/Inference dense or sparse models with billions or trillions of parameters</em></p>
<p style="text-align: center;"><em>Achieve excellent system throughput and efficiently scale to thousands of GPUs</em></p>
<p style="text-align: center;"><em>Train/Inference on resource constrained GPU systems</em></p>
<p style="text-align: center;"><em>Achieve unprecedented low latency and high thoughput for inference</em></p>
<p style="text-align: center;"><em>Achieve extreme compression for an unparalleled inference latency and model size reduction with low costs</em></p>
<b> DeepSpeed+Megatron trained the world's most powerful language model: [MT-530B](https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/) <b>
# DeepSpeed has three innovation pillars:
<b> DeepSpeed is hiring, [come join us!](https://careers.microsoft.com/us/en/search-results?keywords=http:%2F%2Fdeepspeed.ai) </b>
DeepSpeed is a deep learning optimization library that makes distributed training easy,
efficient, and effective.
<p align="center"><i><b>10x Larger Models</b></i></p>
<p align="center"><i><b>10x Faster Training</b></i></p>
<p align="center"><i><b>Minimal Code Change</b></i></p>
DeepSpeed delivers extreme-scale model training for everyone, from data scientists training on massive supercomputers to those training on low-end clusters or even on a single GPU:
* Extreme scale: Using current generation of GPU clusters with hundreds of devices, 3D parallelism of DeepSpeed can efficiently train deep learning models with trillions of parameters.
* Extremely memory efficient: With just a single GPU, ZeRO-Offload of DeepSpeed can train models with over 10B parameters, 10x bigger than the state of the art, democratizing multi-billion-parameter model training such that many deep learning scientists can explore bigger and better models.
* Extremely long sequence length: Sparse attention of DeepSpeed powers an order-of-magnitude longer input sequence and obtains up to 6x faster execution comparing with dense transformers.
* Extremely communication efficient: 3D parallelism improves communication efficiency allows users to train multi-billion-parameter models 27x faster on clusters with limited network bandwidth. 1-bit Adam, 0/1 Adam and 1-bit LAMB reduce communication volume by up to 26x while achieving similar convergence efficiency to Adam/LAMB, allowing for scaling to different types of GPU clusters and networks.
Early adopters of DeepSpeed have already produced
a language model (LM) with over 17B parameters called
[Turing-NLG](https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft),
establishing a new SOTA in the LM category.
DeepSpeed is an important part of Microsofts new
[AI at Scale](https://www.microsoft.com/en-us/research/project/ai-at-scale/)
initiative to enable next-generation AI capabilities at scale, where you can find more
information [here](https://innovation.microsoft.com/en-us/exploring-ai-at-scale).
# Why DeepSpeed?
Training advanced deep learning models is challenging. Beyond model design,
model scientists also need to set up the state-of-the-art training techniques
such as distributed training, mixed precision, gradient accumulation, and
checkpointing. Yet still, scientists may not achieve the desired system
performance and convergence rate. Large model sizes are even more challenging:
a large model easily runs out of memory with pure data parallelism and it is
difficult to use model parallelism. DeepSpeed addresses these challenges to
accelerate model development *and* training.
## Distributed, Effective, and Efficient Training with Ease
The DeepSpeed API is a lightweight wrapper on [PyTorch](https://pytorch.org/). This
means that you can use everything you love in PyTorch and without learning a new
platform. In addition, DeepSpeed manages all of the boilerplate state-of-the-art
training techniques, such as distributed training, mixed precision, gradient
accumulation, and checkpoints so that you can focus on your model development. Most
importantly, you can leverage the distinctive efficiency and effectiveness benefit of
DeepSpeed to boost speed and scale with just a few lines of code changes to your PyTorch
models.
## Speed
DeepSpeed achieves high performance and fast convergence through a combination of
efficiency optimizations on compute/communication/memory/IO and effectiveness
optimizations on advanced hyperparameter tuning and optimizers. For example:
* <span style="color:dodgerblue">DeepSpeed trains BERT-large to parity in 44
mins using 1024 V100 GPUs (64 DGX-2 boxes) and in 2.4 hours using 256 GPUs
(16 DGX-2 boxes).</span>
**BERT-large Training Times**
| Devices | Source | Training Time |
| -------------- | --------- | ---------------------:|
| 1024 V100 GPUs | DeepSpeed | **44** min|
| 256 V100 GPUs | DeepSpeed | **2.4** hr|
| 64 V100 GPUs | DeepSpeed | **8.68** hr|
| 16 V100 GPUs | DeepSpeed | **33.22** hr|
*BERT codes and tutorials will be available soon.*
* DeepSpeed trains GPT2 (1.5 billion parameters) 3.75x faster than state-of-art, NVIDIA
Megatron on Azure GPUs.
*Read more*: [GPT tutorial](/tutorials/megatron/)
![Three innovation pillars](/assets/images/3pillars.png){: .align-center}
## DeepSpeed-Training
## Memory efficiency
DeepSpeed provides memory-efficient data parallelism and enables training models without
model parallelism. For example, DeepSpeed can train models with up to 13 billion parameters on
a single GPU. In comparison, existing frameworks (e.g.,
PyTorch's Distributed Data Parallel) run out of memory with 1.4 billion parameter models.
DeepSpeed offers a confluence of system innovations, that has made large scale DL training effective, and efficient, greatly improved ease of use, and redefined the DL training landscape in terms of scale that is possible. These innovations such as ZeRO, 3D-Parallelism, DeepSpeed-MoE, ZeRO-Infinity, etc fall under the DeepSpeed-Training pillar. Learn more: [DeepSpeed-Training](/_pages/training)
DeepSpeed reduces the training memory footprint through a novel solution called Zero
Redundancy Optimizer (ZeRO). Unlike basic data parallelism where memory states are
replicated across data-parallel processes, ZeRO partitions model states and gradients to save
significant memory. Furthermore, it also reduces activation memory and fragmented memory.
The current implementation (ZeRO-2) reduces memory by up to
8x relative to the state-of-art. You can read more about ZeRO in our [paper](https://arxiv.org/abs/1910.02054), and
in our blog posts related to
[ZeRO-1](https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/) and [ZeRO-2](https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/).
## DeepSpeed-Inference
With this impressive memory reduction, early adopters of DeepSpeed have already
produced a language model (LM) with over 17B parameters called
<a href="https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft">
<span style="color:dodgerblue">Turing-NLG</span></a>,
establishing a new SOTA in the LM category.
DeepSpeed brings together innovations in parallelism technology such as tensor, pipeline, expert and ZeRO-parallelism, and combines them with high performance custom inference kernels, communication optimizations and heterogeneous memory technologies to enable inference at an unprecedented scale, while achieving unparalleled latency, thoughput and cost reduction. This systematic composition of system technologies for inference falls under the DeepSpeed-Inference. Learn more: [DeepSpeed-Inference](/_pages/inference)
For model scientists with limited GPU resources, ZeRO-Offload leverages both CPU and GPU memory for training large models. Using a machine with **a single GPU**, our users can run **models of up to 13 billion parameters** without running out of memory, 10x bigger than the existing approaches, while obtaining competitive throughput. This feature democratizes multi-billion-parameter model training and opens the window for many deep learning practitioners to explore bigger and better models.
## DeepSpeed-Compression
## Scalability
DeepSpeed supports efficient data parallelism, model parallelism, pipeline parallelism and their
combinations, which we call 3D parallelism.
* <span style="color:dodgerblue">3D parallelism of DeepSpeed provides system support to run models with trillions of parameters, read more in our [press-release]({{ site.press_release_v3 }}) and [tutorial](/tutorials/pipeline).</span>
* <span style="color:dodgerblue">DeepSpeed can run large models more efficiently, up to 10x
faster for models with
various sizes spanning 1.5B to hundred billion.</span> More specifically, the data parallelism powered by ZeRO
is complementary and can be combined with different types of model parallelism. It allows
DeepSpeed to fit models using lower degree of model parallelism and higher batch size, offering
significant performance gains compared to using model parallelism alone.
To further increase the inference efficiency, DeepSpeed offers easy-to-use and flexible-to-compose compression techniques for researchers and practitioners to compress their models while delivering faster speed, smaller model size, and significantly reduced compression cost. Moreover, SoTA innovations on compression like ZeroQuant and XTC are included under the DeepSpeed-Compression pillar. Learn more: [DeepSpeed-Compression](/_pages/compression)
*Read more*: [ZeRO paper](https://arxiv.org/abs/1910.02054),
and [GPT tutorial](/tutorials/megatron).
# DeepSpeed Software Suite
![DeepSpeed Speedup](/assets/images/deepspeed-speedup.png)
<p align="center">
<em>The figure depicts system throughput improvements of DeepSpeed (combining ZeRO-powered data parallelism with model parallelism of NVIDIA Megatron-LM) over using Megatron-LM alone.</em>
</p>
## DeepSpeed Library
## Communication efficiency
Pipeline parallelism of DeepSpeed reduce communication volume during distributed training, which allows users to train multi-billion-parameter models 27x faster on clusters with limited network bandwidth.
![Low-bandwidth GPT-2 Performance](/assets/images/pp-lowbw-gpt2.png)
The [DeepSpeed](https://github.com/microsoft/deepspeed) library implements and packages the innovations and technologies in DeepSpeed Training, Inference and Compression Pillars into a single easy-to-use, open-sourced repository. It allows for easy composition of multitude of features within a single training, infernece or compression pipeline. The DeepSpeed Library is heavily adopted by the DL community, and has been used to enable some of the most powerful models (see [DeepSpeed Adoption](#deepspeed-adoption)).
1-bit Adam, 0/1 Adam and 1-bit LAMB reduce communication volume by up to 26x while achieving similar convergence efficiency to Adam, allowing for scaling to different types of GPU clusters and networks. [1-bit Adam blog post](https://www.deepspeed.ai/2020/09/08/onebit-adam-blog-post.html), [1-bit Adam tutorial](https://www.deepspeed.ai/tutorials/onebit-adam/), [0/1 Adam tutorial](https://www.deepspeed.ai/tutorials/zero-one-adam/), [1-bit LAMB tutorial](https://www.deepspeed.ai/tutorials/onebit-lamb/).
## Model Implementations for Inference (MII)
## Supporting long sequence length
DeepSpeed offers sparse attention kernels—an instrumental technology to support long sequences of model inputs, whether for text, image, or sound. Compared with the classic dense Transformers, it powers **an order-of-magnitude longer input sequence** and obtains up to 6x faster execution with comparable accuracy. It also outperforms state-of-the-art sparse implementations with 1.53x faster execution. Furthermore, our sparse kernels support efficient execution of flexible sparse format and empower users to innovate on their custom sparse structures. [Read more here](https://www.deepspeed.ai/2020/09/08/sparse-attention.html).
[Model Implementations for Inference (MII)](https://github.com/microsoft/deepspeed-mii) is an open-sourced repository for making low-latency and high-throughput inference accessible to all data scientists by alleviating the need to apply complex system optimization techniques themselves. Out-of-box, MII offers support for thousands of widely used DL models, optimized using DeepSpeed-Inference, that can be deployed with a few lines of code, while achieving significant latency reduction compared to their vanilla open-sourced versions.
## DeepSpeed on Azure
## Fast convergence for effectiveness
DeepSpeed supports advanced hyperparameter tuning and large batch size
optimizers such as [LAMB](https://arxiv.org/abs/1904.00962). These improve the
effectiveness of model training and reduce the number of samples required to
convergence to desired accuracy.
DeepSpeed users are diverse and have access to different environments. We recommend to try DeepSpeed on Azure as it is the simplest and easiest method. The recommended method to try DeepSpeed on Azure is through AzureML [recipes](https://github.com/Azure/azureml-examples/tree/main/python-sdk/workflows/train/deepspeed). The job submission and data preparation scripts have been made available [here](https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples/azureml). For more details on how to use DeepSpeed on Azure, please follow the [Azure tutorial](https://www.deepspeed.ai/tutorials/azure/).
*Read more*: [Tuning tutorial](/tutorials/one-cycle).
# DeepSpeed Adoption
DeepSpeed has been used to train many different large-scale models, below is a list of several examples that we are aware of (if you'd like to include your model please submit a PR):
## Good Usability
Only a few lines of code changes are needed to enable a PyTorch model to use DeepSpeed and ZeRO. Compared to current model parallelism libraries, DeepSpeed does not require a code redesign or model refactoring. It also does not put limitations on model dimensions (such as number of attention heads, hidden sizes, and others), batch size, or any other training parameters. For models of up to 13 billion parameters, you can use ZeRO-powered data parallelism conveniently without requiring model parallelism, while in contrast, standard data parallelism will run out of memory for models with more than 1.4 billion parameters. In addition, DeepSpeed conveniently supports flexible combination of ZeRO-powered data parallelism with custom model parallelisms, such as tensor slicing of NVIDIA's Megatron-LM.
* [Megatron-Turing NLG (530B)](https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/)
* [Jurassic-1 (178B)](https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf)
* [BLOOM (176B)](https://huggingface.co/blog/bloom-megatron-deepspeed)
* [YaLM (100B)](https://github.com/yandex/YaLM-100B)
* [GPT-NeoX (20B)](https://github.com/EleutherAI/gpt-neox)
DeepSpeed has been integrated with several different popular open-source DL frameworks such as:
## Features
| | Documentation |
| ---------------------------------------------------------------------------------------------- | -------------------------------------------- |
| <img src="assets/images/transformers-light.png" width="300px"> | [Transformers with DeepSpeed](https://huggingface.co/docs/transformers/main/main_classes/deepspeed) |
| <img src="assets/images/accelerate-light.png" width="300px">| [Accelerate with DeepSpeed](https://huggingface.co/docs/accelerate/main/en/deepspeed) |
| <img src="assets/images/lightning-light.svg" width="250px"> | [Lightning with DeepSpeed](https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.strategies.DeepSpeedStrategy.html) |
| <img src="assets/images/mosaicml.svg" width="250px"> | [MosaicML with DeepSpeed](https://docs.mosaicml.com/en/v0.8.0/trainer/using_the_trainer.html?highlight=deepspeed#deepspeed-integration) |
Below we provide a brief feature list, see our detailed [feature overview](https://www.deepspeed.ai/features/) for descriptions and usage.
DeepSpeed is an integral part of [Microsofts AI at Scale initiative](https://www.microsoft.com/en-us/research/project/ai-at-scale/) to enable next-generation AI capabilities at scale.
* [Distributed Training with Mixed Precision](https://www.deepspeed.ai/features/#distributed-training-with-mixed-precision)
* 16-bit mixed precision
* Single-GPU/Multi-GPU/Multi-Node
* [Model Parallelism](https://www.deepspeed.ai/features/#model-parallelism)
* Support for Custom Model Parallelism
* Integration with Megatron-LM
* [Pipeline Parallelism](https://www.deepspeed.ai/tutorials/pipeline/)
* 3D Parallelism
* [The Zero Redundancy Optimizer](https://www.deepspeed.ai/tutorials/zero/)
* Optimizer State and Gradient Partitioning
* Activation Partitioning
* Constant Buffer Optimization
* Contiguous Memory Optimization
* [ZeRO-Offload](https://www.deepspeed.ai/tutorials/zero-offload/)
* Leverage both CPU/GPU memory for model training
* Support 10B model training on a single GPU
* [Ultra-fast dense transformer kernels](https://www.deepspeed.ai/2020/05/18/bert-record.html)
* [Sparse attention](https://www.deepspeed.ai/2020/09/08/sparse-attention-news.html)
* Memory- and compute-efficient sparse kernels
* Support 10x long sequences than dense
* Flexible support to different sparse structures
* [1-bit Adam](https://www.deepspeed.ai/2020/09/08/onebit-adam-blog-post.html), [0/1 Adam](https://www.deepspeed.ai/tutorials/zero-one-adam/) and [1-bit LAMB](https://www.deepspeed.ai/tutorials/onebit-lamb/)
* Custom communication collective
* Up to 26x communication volume saving
* [Additional Memory and Bandwidth Optimizations](https://www.deepspeed.ai/features/#additional-memory-and-bandwidth-optimizations)
* Smart Gradient Accumulation
* Communication/Computation Overlap
* [Training Features](https://www.deepspeed.ai/features/#training-features)
* Simplified training API
* Gradient Clipping
* Automatic loss scaling with mixed precision
* [Training Optimizers](https://www.deepspeed.ai/features/#training-optimizers)
* Fused Adam optimizer and arbitrary `torch.optim.Optimizer`
* Memory bandwidth optimized FP16 Optimizer
* Large Batch Training with LAMB Optimizer
* Memory efficient Training with ZeRO Optimizer
* CPU-Adam
* [Training Agnostic Checkpointing](https://www.deepspeed.ai/features/#training-agnostic-checkpointing)
* [Advanced Parameter Search](https://www.deepspeed.ai/features/#advanced-parameter-search)
* Learning Rate Range Test
* 1Cycle Learning Rate Schedule
* [Simplified Data Loader](https://www.deepspeed.ai/features/#simplified-data-loader)
* [Curriculum Learning](https://www.deepspeed.ai/tutorials/curriculum-learning/)
* A curriculum learning-based data pipeline that presents easier or simpler examples earlier during training
* Stable and 3.3x faster GPT-2 pre-training with 8x/4x larger batch size/learning rate while maintaining token-wise convergence speed
* Complementary to many other DeepSpeed features
* [Progressive Layer Dropping](https://www.deepspeed.ai/2020/10/28/progressive-layer-dropping-news.html)
* Efficient and robust compressed training
* Up to 2.5x convergence speedup for pre-training
* [Performance Analysis and Debugging](https://www.deepspeed.ai/features/#performance-analysis-and-debugging)
* [Mixture of Experts (MoE)](https://www.deepspeed.ai/tutorials/mixture-of-experts/)
# Contributing
DeepSpeed welcomes your contributions! Please see our