DeepSpeed

mirror of https://github.com/deepspeedai/DeepSpeed.git synced 2025-10-20 06:53:47 +08:00

Files

Junjie Mao 4efd7eca73 DeepCompile: Fuse allgather and downcast (#7588 )

With autocast enabled, a majority of weights are downcasted before being
used in calculations. Today zero3_compile gathers the FP32 weights
before they are downcasted. That is sub-optimal because FP32 weights
consumes more bandwidth to allgather and takes more time to downcast.

To reduce communication and downcast time, fuse allgather and downcast
in the dc ops. The target type is now passed to allgather_param() and
prefetch_params_fused() which will downcast the (partial) weights before
launching allgathers.

This corresponds to issue 1 of #7577.

Tested with
https://gist.github.com/eternalNight/3c2cf8c703f1e9e7742d3b7f9e1edae3
(run with `deepspeed --num_gpus=N this_file.py -c -p -m 23` to collect
torch and memory profiles, and with DINOV2_DEPTH = SIGLIP_DEPTH = 3,
LLAMA2_DEPTH = 4 for faster compileation) on 5090 (which has limited
inter-GPU bandwidth), time per step decreases from 438ms to 337ms and
peak GPU memory usage from 9.5GB to 8.5GB.

Profiles of a single step before this PR:

<img width="1235" height="1029" alt="image"
src="https://github.com/user-attachments/assets/d9fe5296-7731-4542-924b-421ff7415054"
/>

<img width="1466" height="616" alt="image"
src="https://github.com/user-attachments/assets/aa192802-8633-4e36-b2c4-f28b1b432663"
/>

After this PR:

<img width="1218" height="1006" alt="image"
src="https://github.com/user-attachments/assets/18a0e09c-155b-4783-adb5-b4d36c5c3691"
/>

<img width="1537" height="559" alt="image"
src="https://github.com/user-attachments/assets/16a2ca74-8a89-4db9-9b68-81844295c61b"
/>

This PR also reduces peak memory usage because the
`fast_free_schedule()` today always arranges param allgathers and
downcasts at the beginning of the graph. While the original FP32 params
can be freed early, all FP16/BF16-casted params are kept in GPU memory
at the beginning of the backward graph, leading to a higher peak in
memory usage.

P.S. Probably due to organization branch rule settings, I don't find
anywhere to allow reviewers to modify the branch. So I'll update the
branch per reviewers' comments and rebase if needed.

Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>

2025-09-29 03:15:33 +00:00

adagrad

Fix Type Name Inconsistency & Typo in cpu_adam (#6732 )

2024-11-11 23:31:45 +00:00

adam

SuperOffload Release (#7559 )

2025-09-24 13:09:23 +00:00

aio

Fix invalid f-strings (#7457 )

2025-08-16 18:22:19 +00:00

compile

DeepCompile: Fuse allgather and downcast (#7588 )