mirror of
https://github.com/volcengine/verl.git
synced 2025-10-20 13:43:50 +08:00
Verl's megatron core_r0.11.0 backend successfully tested with 3D parallelism with multiple bug fixed (#495)
This PR combines multiple modifications. # QWen2.5 checkpoint saver bug fix Thanks for the efforts @uygnef contributed to #368 , we use the new saver for model loader and saver for 3D parallelism support. # Megatron backend 3D-parallelism test benches We modify the scripts in `examples/ppo_trainer` and `tests/e2e`, as well as the CI workflows, all tested. # Bug Fix for 3D-parallelism Including configuration bugs as well as the module packing. Original TP VocabParallelEntropy can lead to CUDA OOM, we refactor the implementation with `torch.bmm`. # Fully migration to Megatron Core Now we only use Megatron core in verl, fully get rid of calling other components. If they are in need, please integrate them into `utils/megatron`. --------- Co-authored-by: uygnef <admin@fengyu.org>
This commit is contained in:
5
.gitignore
vendored
5
.gitignore
vendored
@ -118,3 +118,8 @@ tests/e2e/toy_examples/deepspeed/synchronous/output.txt
|
||||
|
||||
# data
|
||||
*.parquet
|
||||
|
||||
|
||||
# local logs
|
||||
logs
|
||||
log
|
Reference in New Issue
Block a user