Verl's megatron core_r0.11.0 backend successfully tested with 3D parallelism with multiple bug fixed (#495)

This PR combines multiple modifications.

# QWen2.5 checkpoint saver bug fix

Thanks for the efforts @uygnef contributed to #368 , we use the new
saver for model loader and saver for 3D parallelism support.

# Megatron backend 3D-parallelism test benches

We modify the scripts in `examples/ppo_trainer` and `tests/e2e`, as well
as the CI workflows, all tested.

# Bug Fix for 3D-parallelism

Including configuration bugs as well as the module packing.

Original TP VocabParallelEntropy can lead to CUDA OOM, we refactor the
implementation with `torch.bmm`.

# Fully migration to Megatron Core

Now we only use Megatron core in verl, fully get rid of calling other
components. If they are in need, please integrate them into
`utils/megatron`.

---------

Co-authored-by: uygnef <admin@fengyu.org>
This commit is contained in:
Blue Space
2025-03-07 13:38:58 +08:00
committed by GitHub
parent cb97d077e7
commit 35555d8ae9
23 changed files with 596 additions and 197 deletions

5
.gitignore vendored
View File

@ -118,3 +118,8 @@ tests/e2e/toy_examples/deepspeed/synchronous/output.txt
# data
*.parquet
# local logs
logs
log