mirror of
https://github.com/deepspeedai/DeepSpeed.git
synced 2025-10-20 23:53:48 +08:00
Minor fix in the SuperOffload blog (#7612)
Polish SuperOffload blog post; minor grammar and style fixes --------- Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
This commit is contained in:
@ -9,10 +9,9 @@
|
||||
|
||||
---
|
||||
|
||||
## Table of Content
|
||||
## Table of Content <!-- omit in toc -->
|
||||
|
||||
- [SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips](#superoffload-unleashing-the-power-of-large-scale-llm-training-on-superchips)
|
||||
- [Table of Content](#table-of-content)
|
||||
- [SuperOffload Highlights](#superoffload-highlights)
|
||||
- [Introduction](#introduction)
|
||||
- [How SuperOffload Works](#how-superoffload-works)
|
||||
@ -21,9 +20,10 @@
|
||||
- [3. Superchip-Aware Casting](#3-superchip-aware-casting)
|
||||
- [4. GraceAdam for Optimizer Efficiency](#4-graceadam-for-optimizer-efficiency)
|
||||
- [Experience and Insights](#experience-and-insights)
|
||||
- [Easy-to-Use](#easy-to-use)
|
||||
- [Getting Started](#getting-started)
|
||||
- [Acknowledgements](#acknowledgements)
|
||||
- [BibTeX](#bibtex)
|
||||
|
||||
---
|
||||
|
||||
## SuperOffload Highlights
|
||||
|
||||
@ -39,9 +39,9 @@
|
||||
|
||||
The emergence of tightly coupled heterogeneous GPU/CPU architectures (a.k.a., Superchips), such as NVIDIA GH200, GB200, and AMD MI300A, offers new optimization opportunities for large-scale AI. Yet it remains under-explored in terms of how to make the best use of these new hardware for large-scale LLM training. Existing offloading solutions were designed for traditional loosely coupled architectures, and are suboptimal on Superchips suffering high overheads and low GPU utilization. To address this gap and to make the best use of Superchips for efficient LLM training, we have developed and open-sourced **SuperOffload**.
|
||||
|
||||
SuperOffload introduces a set of novel techniques that make the best use of Hopper GPU, Grace CPU, and NVLink-C2C, simultaneously for LLM training. Unlike prior offloading solutions which assume slow GPU-CPU interconnects (e.g., 64GB/sec for PCIe-Gen4), SuperOffload exploits the much faster interconnects (e.g., 900GB/sec for NVLink-C2C) to boost GPU and CPU utilization, and training throughput. With SuperOffload, models such as **GPT-OSS-20B**, **Qwen3-14B**, and **Phi-4** can be fully fine-tuned on a single GH200, delivering **up to 600 TFLOPS** training throughput under modest settings (sequence length 4k, batch size 4). This delivers up to **4×** higher throughput compared to prior work such as ZeRO-Offload. SuperOffload enables scaling to even larger models, including Qwen3-30B-A3B and Seed-OSS-36B on two GH200s and Llama-70B on four GH200s.
|
||||
SuperOffload introduces a set of novel techniques that make the best use of Hopper GPU, Grace CPU, and NVLink-C2C, simultaneously for LLM training. Unlike prior offloading solutions which assume slow GPU-CPU interconnects (e.g., 64GB/sec for PCIe-Gen4), SuperOffload exploits the much faster interconnects (e.g., 900GB/sec for NVLink-C2C) to boost GPU and CPU utilization, and training throughput. With SuperOffload, models such as **GPT-OSS-20B**, **Qwen3-14B**, and **Phi-4** can be fully fine-tuned on a single GH200, delivering up to **600 TFLOPS** training throughput under modest settings (sequence length 4k, batch size 4). This delivers up to **4×** higher throughput compared to prior work such as ZeRO-Offload. SuperOffload enables scaling to even larger models, including Qwen3-30B-A3B and Seed-OSS-36B on two GH200s and Llama-70B on four GH200s.
|
||||
|
||||
SuperOffload is built on top of DeepSpeed ZeRO Stage 3, and available in DeepSpeed versions >= [0.18.0](https://github.com/deepspeedai/DeepSpeed/releases/tag/v0.18.0). To enable easy integration into LLM finetuning pipelines, SuperOffload is compatible with Hugging Face Transformers and does not require any changes to modeling code.
|
||||
SuperOffload is built on top of DeepSpeed ZeRO Stage 3, and is available in DeepSpeed versions >= [0.18.0](https://github.com/deepspeedai/DeepSpeed/releases/tag/v0.18.0). To enable easy integration into LLM finetuning pipelines, SuperOffload is compatible with Hugging Face Transformers and does not require any changes to modeling code.
|
||||
|
||||
<!-- Recent models, especially MoE, at the scale of tens to hundreds of billions of parameters, make fine-tuning on limited GPUs difficult. Offloading to CPU memory helps reduce GPU demand but typically assumes GPU-CPU connections over PCIe, which is bandwidth-limited (e.g., 32 GB/s on PCIe-Gen4). Thus, prior work mainly optimizes data transfers to avoid PCIe becoming a major performance bottleneck. However, hardware vendors are introducing a new class of tightly coupled architectures—such as NVIDIA GH200, GB200, and AMD MI300A—that challenge these long-standing assumptions.
|
||||
|
||||
@ -50,7 +50,7 @@ The open-source release of **SuperOffload** addresses this gap by providing a se
|
||||
<!-- Built on top of ZeRO Stage 3, SuperOffload enables scaling to even larger models, including Qwen3-30B-A3B, Seed-OSS-36B on two GH200s and Llama-70B on four GH200s. All of this is supported natively through Hugging Face Transformers and DeepSpeed, with no need for changes to modeling code. -->
|
||||
|
||||
<div align="center">
|
||||
<img src="./images/superoffload_comparision.jpg" alt="SuperOffload system overview" width="80%">
|
||||
<img src="./images/superoffload_comparison.jpg" alt="SuperOffload system overview" width="90%">
|
||||
<p align="center"><em>Figure 1: SuperOffload delivers up to 4× higher throughput than ZeRO-Offload for large-model fine-tuning across varying sequence lengths and batch sizes, achieving up to 600 TFLOPS throughput.</em></p>
|
||||
</div>
|
||||
|
||||
@ -93,7 +93,7 @@ SuperOffload improves optimizer efficiency beyond STV by partitioning optimizer
|
||||
|
||||
### 3. Superchip-Aware Casting
|
||||
|
||||
In mixed precision training with offloading, tensor transfers between GPU and CPU require casting between the low-precision format on GPU (e.g., BF16, FP16, etc.) and the high-precision format on CPU (i.e., FP32). To address the bandwidth limitations of PCIe interconnects, prior offloading solutions transfer tensors in low-precision and type cast tensors on both GPU and CPU as appropriate. However, this is a sub-optimal strategy on Superchip architectures because GPU compute throughput is ~100X higher than CPU, and high-bandwidth interconnects (e.g., NVLink-C2C) makes the transfer costs negligible. As illustration, Figure 4 below shows that the optimal strategy on GH200 is tensor casting on the GPU and transferring in high-precision format.
|
||||
In mixed precision training with offloading, tensor transfers between GPU and CPU require casting between the low-precision format on GPU (e.g., BF16, FP16, etc.) and the high-precision format on CPU (i.e., FP32). To address the bandwidth limitations of PCIe interconnects, prior offloading solutions transfer tensors in low-precision and type cast tensors on both GPU and CPU as appropriate. However, this is a suboptimal strategy on Superchip architectures because GPU compute throughput is ~100X higher than CPU, and high-bandwidth interconnects (e.g., NVLink-C2C) makes the transfer costs negligible. As an illustration, Figure 4 below shows that the optimal strategy on GH200 is tensor casting on the GPU and transferring in high-precision format.
|
||||
|
||||
<!-- - Mixed precision training involves casting tensors between low precision data types (e.g., FP16, BF16, etc.) and full precision FP32.
|
||||
- On superchips with high CPU↔GPU bandwidth, casting cost matters.
|
||||
@ -108,7 +108,7 @@ In mixed precision training with offloading, tensor transfers between GPU and CP
|
||||
|
||||
### 4. GraceAdam for Optimizer Efficiency
|
||||
|
||||
Existing offloading solutions for LLM training require CPU implementations of the popular Adam optimizer, such as PyTorch Adam and DeepSpeed CPU-Adam. However, these are inadequate for Superchips because they are not optimized for the Grace CPU architecture. To address this issue, we created GraceAdam, a highly efficient Adam optimizer implementation for Grace CPUs. GraceAdam achieves high performance exploiting the underlying ARM architecture features such as Scalable Vector Extension (SVE), explicit memory hierarchy management, and instruction-level parallelism. Figure 5 below shows that on GH200 Superchip, GraceAdam is 3X faster than PyTorch Adam (PT-CPU) and 1.3X faster than CPU-Adam. To our knowledge, GraceAdam is the first open sourced Adam optimizer implementation for Grace CPU.
|
||||
Existing offloading solutions for LLM training require CPU implementations of the popular Adam optimizer, such as PyTorch Adam and DeepSpeed CPU-Adam. However, these are inadequate for Superchips because they are not optimized for the Grace CPU architecture. To address this issue, we created GraceAdam, a highly efficient Adam optimizer implementation for Grace CPUs. GraceAdam achieves high performance exploiting the underlying ARM architecture features such as Scalable Vector Extension (SVE), explicit memory hierarchy management, and instruction-level parallelism. Figure 5 below shows that on GH200 Superchip, GraceAdam is 3× faster than PyTorch Adam (PT-CPU) and 1.3× faster than CPU-Adam. To our knowledge, GraceAdam is the first open sourced Adam optimizer implementation for Grace CPU.
|
||||
|
||||
<div align="center">
|
||||
<img src="./images/superoffload_grace_adam.png" alt="GraceAdam" width="80%">
|
||||
@ -127,7 +127,7 @@ Existing offloading solutions for LLM training require CPU implementations of th
|
||||
- **MPAM (Memory System Resource Partitioning and Monitoring):**
|
||||
Reduces interference between CPU and GPU tasks.
|
||||
|
||||
**How to enable MPAM on Nvidia Superchips:**
|
||||
**How to enable MPAM on NVIDIA Superchips:**
|
||||
1. Install the kernel from [NVIDIA NV-Kernels](https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11).
|
||||
2. Check MPAM support:
|
||||
```bash
|
||||
@ -153,7 +153,7 @@ Existing offloading solutions for LLM training require CPU implementations of th
|
||||
mkdir /sys/fs/resctrl/p1 /sys/fs/resctrl/p2
|
||||
```
|
||||
5. Set CPU cores & memory configs (example from experiments):
|
||||
```
|
||||
```bash
|
||||
/sys/fs/resctrl/p1/cpus_list:
|
||||
0-6
|
||||
/sys/fs/resctrl/p2/cpus_list:
|
||||
@ -168,18 +168,18 @@ Existing offloading solutions for LLM training require CPU implementations of th
|
||||
|
||||
---
|
||||
|
||||
## Easy-to-Use
|
||||
## Getting Started
|
||||
|
||||
End-to-end finetuning examples using SuperOffload are available in our tutorial/readme: [DeepSpeedExamples: SuperOffload](https://github.com/deepspeedai/DeepSpeedExamples/tree/master/training/DeepSpeed-SuperOffload#readme). To enable SuperOffload quickly, add the following switch to your DeepSpeed config (see tutorial for full context):
|
||||
|
||||
<div align="center">
|
||||
<img src="./images/superoffload_enable.jpg" alt="Enable SuperOffload" width="60%">
|
||||
<p align="center"><em>Figure 5: Enable SuperOffload with a single line in the DeepSpeed config.</em></p>
|
||||
<p align="center"><em>Figure 6: Enable SuperOffload with a single line in the DeepSpeed config.</em></p>
|
||||
</div>
|
||||
|
||||
|
||||
|
||||
Tip: On superchip platforms (e.g., GH200/GB200/MI300A), combine NUMA binding and MPAM settings from "Experience and Insights" to stabilize bandwidth and improve end-to-end performance.
|
||||
Tip: On Superchip platforms (e.g., GH200/GB200/MI300A), combine NUMA binding and MPAM settings from "Experience and Insights" to stabilize bandwidth and improve end-to-end performance.
|
||||
|
||||
<!-- ## Status & Availability
|
||||
|
||||
@ -191,7 +191,7 @@ Community feedback and contributions are welcome. For enablement and examples, s
|
||||
|
||||
## Acknowledgements
|
||||
|
||||
This work is a close collaboration among [University of Illinois Urbana-Champaign (UIUC)](https://supercomputing-system-ai-lab.github.io/), [AnyScale](https://www.anyscale.com/), and [Snowflake](https://www.snowflake.com/en/blog/authors/snowflake-ai-research/).
|
||||
This work is a close collaboration among [University of Illinois Urbana-Champaign (UIUC)](https://supercomputing-system-ai-lab.github.io/), [Anyscale](https://www.anyscale.com/), and [Snowflake](https://www.snowflake.com/en/blog/authors/snowflake-ai-research/).
|
||||
|
||||
We also gratefully acknowledge William Gropp, Brett Bode, and Gregory H. Bauer from the National Center for Supercomputing Applications (NCSA), as well as Dan Ernst, Ian Karlin, Giridhar Chukkapalli, Kurt Rago, and others from NVIDIA for their valuable discussions and guidance on MPAM support on Grace CPU.
|
||||
|
||||
@ -199,7 +199,7 @@ Community feedback and contributions are welcome. For enablement and examples, s
|
||||
|
||||
---
|
||||
|
||||
## BibTeX
|
||||
## BibTeX <!-- omit in toc -->
|
||||
|
||||
```bibtex
|
||||
@inproceedings{superoffload,
|
||||
|
Before Width: | Height: | Size: 6.4 MiB After Width: | Height: | Size: 6.4 MiB |
Reference in New Issue
Block a user