Verl's megatron core_r0.11.0 backend successfully tested with 3D parallelism with multiple bug fixed (#495)

mirror of https://github.com/volcengine/verl.git synced 2025-10-20 13:43:50 +08:00

This PR combines multiple modifications.

# QWen2.5 checkpoint saver bug fix

Thanks for the efforts @uygnef contributed to #368 , we use the new
saver for model loader and saver for 3D parallelism support.

# Megatron backend 3D-parallelism test benches

We modify the scripts in `examples/ppo_trainer` and `tests/e2e`, as well
as the CI workflows, all tested.

# Bug Fix for 3D-parallelism

Including configuration bugs as well as the module packing.

Original TP VocabParallelEntropy can lead to CUDA OOM, we refactor the
implementation with `torch.bmm`.

# Fully migration to Megatron Core

Now we only use Megatron core in verl, fully get rid of calling other
components. If they are in need, please integrate them into
`utils/megatron`.

---------

Co-authored-by: uygnef <admin@fengyu.org>

This commit is contained in:

Blue Space

2025-03-07 13:38:58 +08:00

committed by

GitHub

parent cb97d077e7

commit 35555d8ae9

23 changed files with 596 additions and 197 deletions

5

.gitignore vendored

View File

 @ -118,3 +118,8 @@ tests/e2e/toy_examples/deepspeed/synchronous/output.txt
 # data
 *.parquet
 # local logs
 logs
 log

Verl's megatron core_r0.11.0 backend successfully tested with 3D parallelism with multiple bug fixed (#495)

5 .gitignore vendored Unescape Escape View File

5

.gitignore vendored

View File