Files
DeepSpeed/blogs/zenflow-corebinding
Ma, Guokai 2b68bbc594 Blog of zenflow binding study (#7614)
This PR add a blog/lab for study of zenflow and zero offload performance
with DeepSpeed CPU core binding.

---------

Signed-off-by: Guokai Ma <guokai.ma@gmail.com>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>
Co-authored-by: Xinyu Lian <lian7@illinois.edu>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: zhengchenyu <zhengchenyu16@163.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Junjie Mao <junjie.mao@linux.alibaba.com>
2025-10-06 11:38:44 -04:00
..

Study of ZenFlow and ZeRO offload performance with DeepSpeed CPU core binding

TL;DR: ZenFlow is an improvement to ZeRO Offload contributed to DeepSpeed by Tingfeng Lan et al. After testing this feature, we explored the relationship between ZenFlow performance and DeepSpeed CPU core binding.

ZenFlow technology introduction

ZenFlow is a recent improvement to ZeRO Offload implemented in DeepSpeed. Its primary goal is to address the GPU stalls caused by ZeRO Offload. These stalls mainly originate from two sources: 1) the data transfer from the GPU to the CPU, which is limited by PCIe bandwidth, and 2) the computational overhead of executing the Adam optimizer on the CPU, which is constrained by CPU performance and memory bandwidth.

The core idea of ZenFlow is to separate gradients into two groups based on their norm. A very small portion of gradients, which have larger norms, are classified as important gradients and are updated directly on the GPU. The vast majority of gradients, which have smaller norms, are used to update the weights on the CPU at a lower frequency than the important gradients. If the gradients are not scheduled for an update in the current training iteration, they are accumulated into a copy of the gradients. These accumulated gradients are then used for the weight update in a subsequent iteration.

Furthermore, the weight updates on the CPU are designed to run in parallel with the computations on the GPU, thereby achieving the objective of reducing GPU stall.

To achieve the goal of parallelizing weight updates on the CPU with GPU computations, ZenFlow creates an additional process for each rank. This dedicated process handles the weight updates, while the original process for each rank can continue executing GPU computation code. This design enables the concurrency between weight updates and GPU computations. In addition to these optimizations, ZenFlow also performs CPU core binding for the weight update processes. It binds the CPU update processes of different ranks to distinct CPU cores to enhance CPU performance.

DeepSpeed CPU core binding feature and its improvement to CPU offloading performance

This reminds us that DeepSpeed itself supports CPU core binding through the --bind_cores_to_rank flag. This switch was originally designed to improve multi-socket CPU inference performance. By binding cores, different workers can run on distinct CPU cores without interfering with each other, thereby enhancing locality. Additionally, DeepSpeed's core binding feature automatically configures the OMP_NUM_THREADS environment variable to ensure the OpenMP thread pool size matches the number of allocated cores.

This raised a question: Could this switch also benefit ZeRO Offload? We conducted tests to explore this possibility.

Improvement to ZeRO Offload performance from DeepSpeed CPU core binding

Avg. time of first 51 iterations (1st run) 2nd run 3rd run Average
No bind core 2707.32ms 3127.24ms 2826.04ms 2887ms
Bind core 2649.06ms 2641.82ms 2200.76ms 2497ms

Model: Qwen2.5-3B

Test environment: 2xDGX-A100-SXM4-40GB, 2xAMD EPYC 7742 64-Core Processor, 1TB memory

Test URL: DeepSpeedExamples/training/DeepSpeed-ZenFlow/finetuning (All following tests are using the same URL)

Test command:

  • No core binding: deepspeed --num_gpus=2 finetune_llama.py --model_name Qwen/Qwen2.5-3B --output_dir output --lr 2e-5 --batch_size 8 --deepspeed_config zo_config.json --num_train_epochs 1
  • With core binding: deepspeed --num_gpus=2 --bind_cores_to_rank finetune_llama.py --model_name Qwen/Qwen2.5-3B --output_dir output --lr 2e-5 --batch_size 8 --deepspeed_config zo_config.json --num_train_epochs 1

Config file (zo_config.json):

{
    "train_batch_size": 8,
    "bf16": { "enabled": true },
    "zero_optimization": {
      "stage": 2,
      "offload_optimizer": {
        "device": "cpu",
        "pin_memory": true
      }
    },
    "optimizer": {
      "type": "AdamW",
      "params": {
        "lr": 2e-5,
        "betas": [0.9, 0.999],
        "eps": 1e-8,
        "weight_decay": 0.01
      }
    },
    "gradient_accumulation_steps": 1,
    "gradient_clipping": 1.0,
    "zero_allow_untested_optimizer": true,
    "wall_clock_breakdown": true
}

From this data, DeepSpeed's core binding provides approximately a 15% performance improvement for ZeRO Offload. So, could it also benefit ZenFlow's performance? With this question in mind, we decided to comment out the core binding logic within ZenFlow and instead directly use the --bind_cores_to_rank flag to run ZenFlow:

Improvement to ZenFlow performance from DeepSpeed CPU core binding

Avg. time from iteration 5-51 (1st run) 2nd run 3rd run Average
ZenFlow core binding 1337.66ms 1443.87ms 1475.04ms 1419ms
DeepSpeed core binding 1233.6ms 1228.36ms 1235ms 1232ms

Model: Qwen2.5-3B

Test environment: 2xDGX-A100-SXM4-40GB, 2xAMD EPYC 7742 64-Core Processor, 1TB memory

DeepSpeed commit: 1d7b90adc4

ZenFlow use 4 iterations to compute gradient importance, so we start from 5th iteration to measure time

Test command:

  • No core binding: deepspeed --num_gpus=2 finetune_llama.py --model_name Qwen/Qwen2.5-3B --output_dir output --lr 2e-5 --batch_size 8 --deepspeed_config zf_config.json --num_train_epochs 1
  • With core binding: deepspeed --num_gpus=2 --bind_cores_to_rank finetune_llama.py --model_name Qwen/Qwen2.5-3B --output_dir output --lr 2e-5 --batch_size 8 --deepspeed_config zf_config.json --num_train_epochs 1

Config file (zf_config.json):

{
    "train_batch_size": 8,
    "bf16": { "enabled": true },
    "zero_optimization": {
      "stage": 2,
      "offload_optimizer": {
        "device": "cpu",
        "pin_memory": true
      },
      "zenflow": {
            "topk_ratio": 0.1,
            "update_interval": 4,
            "full_warm_up_rounds": 0,
            "overlap_step": true,
            "pt_reserved_cores_perc": 0.5
        }
    },
    "optimizer": {
      "type": "AdamW",
      "params": {
        "lr": 2e-5,
        "betas": [0.9, 0.999],
        "eps": 1e-8,
        "weight_decay": 0.01
      }
    },
    "gradient_accumulation_steps": 1,
    "gradient_clipping": 1.0,
    "zero_allow_untested_optimizer": true
}

We observed a performance improvement of approximately 15% from DeepSpeed CPU core binding against ZenFlow core binding. Why did this happen?

Our improvements to ZenFlow CPU core binding mechanism

After communicating with the authors of ZenFlow, we gained a new understanding of the core binding mechanism required by ZenFlow.

First, the ZenFlow worker processes need to use a dedicated set of CPU cores, separate from those used by the main process of each rank. Second, the ZenFlow workers and the main processes should be bound to different physical cores, avoiding binding to virtual cores (hyper-threads). Third, the OpenMP thread pool size should be appropriately set to match the number of cores allocated to the ZenFlow workers.

In the original ZenFlow implementation, all cores (including the virtual cores corresponding to physical cores) were used for core binding, meaning the workers were not properly isolated at the physical core level. In contrast, DeepSpeed's core binding specifically binds processes to physical cores only, which explains the performance improvement we observed.

Based on this understanding, we collaborated with the ZenFlow authors to update its core binding mechanism.

First, before each rank launches a ZenFlow worker process, it needs to enumerate the list of available physical cores. If these lists of physical cores differ across ranks, it indicates that DeepSpeed has already performed physical core binding. Otherwise, each rank needs to allocate its own list of available cores from the total pool.

Finally, each rank allocates a subset of cores from its own list to the ZenFlow worker process and sets the corresponding OMP_NUM_THREADS environment variable. This ensures that all processes use distinct CPU cores, preventing interference, and also allows for proper configuration of the OpenMP thread pool size. code

Under this new core binding mechanism, we re-evaluated the performance of ZenFlow:

ZenFlow perf. with new core binding mechanism

Avg. time from iteration 5-51 (1st run) 2nd run 3rd run Average Improvement over original binding
New ZenFlow worker core binding 1321.21ms 1269.83ms 1384.47ms 1325ms 7%
DeepSpeed core binding + new ZenFlow worker core binding 1111.68ms 1125.38ms 1111.91ms 1116ms 10%

Model: Qwen2.5-3B

Test environment: 2xDGX-A100-SXM4-40GB, 2xAMD EPYC 7742 64-Core Processor, 1TB memory

DeepSpeed commit: 80033a8293

The results indicate that ZenFlow's performance was further enhanced under the new core binding mechanism. Compared to the original binding method, performance improved by 7% when not using DeepSpeed's core binding. When DeepSpeed's core binding was enabled, the performance gain reached 10%.

Why does DeepSpeed binding still provide an additional performance boost on top of the new ZenFlow binding?

We initially hypothesized that it might be because DeepSpeed uses numactl, which can bind a process to a specific NUMA node, ensuring the process always accesses local memory. However, upon examining the DeepSpeed logs, we found that the -m switch was not enabled during runtime. Furthermore, when we replaced numactl with taskset, we still observed the performance improvement.

Our current conjecture is that the difference lies in how the binding is applied. numactl (and taskset in this context) operates at the process level, applying the binding to the entire process from the start. In contrast, ZenFlow's binding is applied within the code at the point of use. This distinction in the scope and timing of the binding application could be the source of the performance difference. This point may require more detailed investigation in the future.

Regardless, the key finding remains: the new ZenFlow binding mechanism improves performance irrespective of whether DeepSpeed binding is used. This conclusively demonstrates the effectiveness of physical core isolation for performance.

We conducted a comparative analysis of the performance across several configurations: ZeRO Offload without core binding, ZeRO Offload with core binding, and ZenFlow both before and after our improvements. The results are summarized as follows:

Perf comparison table

Average time Perf. improv. vs. baseline
ZeRO Offload without binding -- baseline 2887ms 1x
ZeRO Offload with DeepSpeed core binding 2497ms 1.16x
ZenFlow original worker core binding 1419ms 2.03x
DeepSpeed core binding +ZenFlow new worker core binding 1116ms 2.59x

Model: Qwen2.5-3B

Test environment: 2xDGX-A100-SXM4-40GB, 2xAMD EPYC 7742 64-Core Processor, 1TB memory

The result clearly shows that the improved ZenFlow achieves a 2.59x speedup compared to ZeRO Offload without core binding, and a 2.24x speedup compared to ZeRO Offload with core binding.

Given that ZenFlow's core innovations involve reducing the frequency of weight updates and parallelizing CPU/GPU execution, the 2.24x improvement over the core-bound ZeRO Offload is particularly significant. This comparison provides a more accurate reflection of ZenFlow's inherent performance advantages. By using the core-bound ZeRO Offload as the baseline, we effectively isolate and quantify the performance gains attributable specifically to ZenFlow's algorithmic optimizations, rather than those coming from general core-binding techniques. This strongly validates the effectiveness of ZenFlow's fundamental design.

Through our collaboration with the ZenFlow authors, the new core-binding mechanism has been integrated into the main branch of DeepSpeed. As a result, users can now achieve optimal offload performance by simply using ZenFlow in conjunction with the DeepSpeed --bind_cores_to_rank flag. This integration provides an out-of-the-box, high-performance experience that leverages the combined strengths of both the algorithmic innovations in ZenFlow and the low-level system optimizations in DeepSpeed's core binding.

Practicality metric, a metric to evaluate offloading technology

In addition to comparisons with ZeRO Offload, a performance comparison against scenarios without offloading better demonstrates the practicality of ZenFlow or ZeRO Offload. While it's true that ZeRO Offload or ZenFlow enables model optimization with relatively limited VRAM, achieving a breakthrough from impossibility to possibility, if the performance gap is too significant, the decision to use offloading becomes a dilemma. We consider the performance difference between scenarios with and without offloading as a practicality metric. A value of 1 represents the ideal scenario, indicating that offloading has no impact on performance. The smaller this value, the poorer the practicality, as users would need to wait considerably longer for fine-tuning.

Since we couldn't run Qwen2.5-3B with ZeRO2 using the same config on two GPUs in our test environment, we conducted the practicality test using Qwen2.5-1.5B instead:

Practicality test

Average time Practicality metric
ZeRO2 240ms
ZeRO Offload with DeepSpeed core binding 1365ms 17.6%
DeepSpeed core binding + new ZenFlow worker core binding 569ms 42.2%

Model: Qwen2.5-1.5B

Test environment: 2xDGX-A100-SXM4-40GB, 2xAMD EPYC 7742 64-Core Processor, 1TB memory

Based on the tests conducted on 2xA100 GPUs, the practicality metric for ZeRO Offload was 17.6%, while ZenFlow achieved a practicality metric of 42.2%. This result demonstrates that ZenFlow significantly improves the practicality of offloading.

Summary

ZeRO Offload is an effective technique for reducing VRAM pressure, making the fine-tuning of large models possible. We have now seen that ZenFlow, as a new technology, achieves a breakthrough improvement in the practicality of ZeRO Offload, bringing it to a usable level. When combined with DeepSpeed's core binding, ZenFlow is able to deliver its optimal performance.

Disclaimer

All performance data presented in this article is measured for the sole purpose of discussing the effects of specific optimization techniques. There is no guarantee that the data was obtained under optimal software or hardware configurations, nor does it represent a performance evaluation of any software or hardware products mentioned. This article discusses only the relative performance changes resulting from specific optimization methods. The performance gain depends on specific software or hardware configuration and may vary in your own run.