mirror of
https://github.com/deepspeedai/DeepSpeed.git
synced 2025-10-20 15:33:51 +08:00
add --bind_cores_to_rank to zero offload tutorial (#7474)
In ZeRO offload, significant time is spent on CPUAdam, which is CPU code. Thus use `--bind_cores_to_rank` in deepspeed launch command would help improve the performance of ZeRO offload. This PR add this command to ZeRO offload tutorial to increase user awareness. For Qwen2.5-3B finetuning on 2 A100-40B cards, running on CPU host with 128 CPU cores, the average step time is as follow, near 1.3x performance improvement: without `--bind_cores_to_rank`: 3084.44ms per step with `--bind_cores_to_rank`: 2383.16ms per step --------- Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
This commit is contained in:
@ -74,4 +74,11 @@ Finally, here is a screenshot of `htop` showing host CPU and memory activity dur
|
||||
<img src="/assets/images/zero_offload_dp1_10B_cpu.png">
|
||||
</a>
|
||||
|
||||
### CPU Adam perf tuning
|
||||
ZeRO offload already support multi-gpu training. If the workload is using CPU optimizer, the workload can be further tuned by passing `--bind_cores_to_rank` to the deepspeed launch command. This switch will mainly do two things:
|
||||
1. Divide physical CPU cores evenly among ranks, make each rank to have a dedicated set of CPU cores to run CPU optimizer.
|
||||
2. Set OMP_NUM_THREADS environment variable to the number of CPU cores assigned to each rank, so OpenMP code in CPU optimizer will have near optimal performance.
|
||||
|
||||
ZeRO offload is a hybrid workload that is both heavy on GPU and CPU, and DeepSpeed is optimized for both GPU and CPU performance. Refer to [How to launch DeepSpeed on Intel Architecture CPU](https://github.com/deepspeedai/DeepSpeed/blob/master/docs/_tutorials/accelerator-setup-guide.md#how-to-launch-deepspeed-on-intel-architecture-cpu) for more details on how to tune core bindings for CPU performance.
|
||||
|
||||
Congratulations! You have completed the ZeRO-Offload tutorial.
|
||||
|
Reference in New Issue
Block a user