mirror of
https://github.com/volcengine/verl.git
synced 2025-10-20 13:43:50 +08:00
### Checklist Before Starting - [ ] Search for similar PR(s). ### What does this PR do? Allow to override of transformer config to enable custom megatron features like variable PP layers distribution, with CI tests, which is in need for larger moe models with 94 layers (Qwen3 moe) or 61 layers (DeepSeek V3) We will first fix e2e_prime CI by use fused kernels. **Notice that now the imbalance PP layers distribution only compatible with dist_ckpt load and save, not support huggingface direct load/save.** Also, other megatron arguments can be passed through scripts. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API Breaking APIs: ```py class MegatronWorker(Worker): def _init_hf_config_and_tf_config(self, model_path, dtype, override_model_config, override_transformer_config): # and the models building ``` ```yaml actor: megatron: override_transformer_config: {} # common transformer config for all models ``` To avoid trouble of input same transformer config arguments, other models will reuse actor's config, so just need to input once. ### Usage Example ```bash run_ppo_trainer_megatron.sh \ +actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_first_pipeline_stage=13 \ +actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_last_pipeline_stage=11 ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: Megatron - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary.
309 lines
16 KiB
YAML
309 lines
16 KiB
YAML
name: e2e_ppo_trainer_megatron
|
|
# latest version: Megatron-LM core_r0.11.0 https://github.com/NVIDIA/Megatron-LM/tree/core_r0.11.0
|
|
|
|
on:
|
|
# Trigger the workflow on push or pull request,
|
|
# but only for the main branch
|
|
push:
|
|
branches:
|
|
- main
|
|
- v0.*
|
|
pull_request:
|
|
branches:
|
|
- main
|
|
- v0.*
|
|
paths:
|
|
- "**/*.py"
|
|
# Other entrypoints
|
|
- "!examples/**"
|
|
- "!tests/**"
|
|
- "!verl/trainer/main_*.py"
|
|
- "!verl/trainer/fsdp_sft_trainer.py"
|
|
# Recipes
|
|
- "!recipe/**"
|
|
# FSDP
|
|
- "!verl/workers/**/*dp_*.py"
|
|
# Entrypoints
|
|
- ".github/workflows/e2e_ppo_trainer_megatron.yml"
|
|
- "examples/data_preprocess/gsm8k.py"
|
|
- "tests/e2e/run_ppo_trainer_megatron.sh"
|
|
- "verl/trainer/main_ppo.py"
|
|
- "verl/trainer/config/ppo_megatron_trainer.yaml"
|
|
|
|
# Cancel jobs on the same ref if a new one is triggered
|
|
concurrency:
|
|
group: ${{ github.workflow }}-${{ github.ref }}
|
|
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
|
|
|
|
# Declare permissions just read content.
|
|
permissions:
|
|
contents: read
|
|
|
|
jobs:
|
|
e2e_ppo_trainer_megatron-qwen:
|
|
runs-on: [L20x8]
|
|
timeout-minutes: 30 # Increase this timeout value as needed
|
|
env:
|
|
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
|
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
|
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
|
HF_ENDPOINT: "https://hf-mirror.com"
|
|
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
|
container:
|
|
image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6-mcore0.12.0-te2.3
|
|
options: --gpus all --shm-size=10g
|
|
steps:
|
|
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
|
with:
|
|
fetch-depth: 0
|
|
- name: Install the current repository
|
|
run: |
|
|
pip3 install --no-deps -e .[test]
|
|
- name: Prepare GSM8K dataset
|
|
run: |
|
|
python3 examples/data_preprocess/gsm8k.py
|
|
- name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (Qwen) with validation and saving
|
|
run: |
|
|
ray stop --force
|
|
VAL_BEFORE_TRAIN=True TEST_FREQ=1 SAVE_FREQ=1 bash tests/e2e/run_ppo_trainer_megatron.sh
|
|
- name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (Qwen) after resuming
|
|
run: |
|
|
ray stop --force
|
|
RESUME_MODE=auto bash tests/e2e/run_ppo_trainer_megatron.sh
|
|
- name: Test Megatron checkpoints merging function (Qwen Actor and Critic)
|
|
run: |
|
|
exp_name="qwen2.5-0.5b-megatron-gsm8k-minimal"
|
|
python scripts/model_merger.py test --backend megatron --tie-word-embedding --local_dir checkpoints/verl-test/${exp_name}/global_step_1/actor --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/actor/huggingface --hf_model_path Qwen/Qwen2.5-0.5B
|
|
python scripts/model_merger.py test --backend megatron --is-value-model --local_dir checkpoints/verl-test/${exp_name}/global_step_1/critic --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/critic/huggingface --hf_model_path Qwen/Qwen2.5-0.5B
|
|
- name: Running GRPO GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (Qwen)
|
|
run: |
|
|
ray stop --force
|
|
ADV_ESTIMATOR=grpo bash tests/e2e/run_ppo_trainer_megatron.sh
|
|
- name: clean up
|
|
run: |
|
|
rm -rf checkpoints
|
|
e2e_ppo_trainer_megatron-deepseek:
|
|
runs-on: [L20x8]
|
|
timeout-minutes: 30 # Increase this timeout value as needed
|
|
env:
|
|
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
|
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
|
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
|
HF_ENDPOINT: "https://hf-mirror.com"
|
|
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
|
container:
|
|
image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6-mcore0.12.0-te2.3
|
|
options: --gpus all --shm-size=10g
|
|
steps:
|
|
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
|
with:
|
|
fetch-depth: 0
|
|
- name: Install the current repository
|
|
run: |
|
|
pip3 install --no-deps -e .[test]
|
|
- name: Prepare GSM8K dataset
|
|
run: |
|
|
python3 examples/data_preprocess/gsm8k.py
|
|
- name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (DeepSeek)
|
|
run: |
|
|
ray stop --force
|
|
SAVE_FREQ=1 MODEL_ID=deepseek-ai/deepseek-coder-1.3b-instruct bash tests/e2e/run_ppo_trainer_megatron.sh
|
|
- name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (DeepSeek)
|
|
run: |
|
|
ray stop --force
|
|
RESUME_MODE=auto MODEL_ID=deepseek-ai/deepseek-coder-1.3b-instruct bash tests/e2e/run_ppo_trainer_megatron.sh
|
|
- name: Running GRPO GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (Deepseek)
|
|
run: |
|
|
ray stop --force
|
|
ADV_ESTIMATOR=grpo MODEL_ID=deepseek-ai/deepseek-coder-1.3b-instruct bash tests/e2e/run_ppo_trainer_megatron.sh
|
|
- name: Test Megatron checkpoints merging function (DeepSeek Actor and Critic)
|
|
run: |
|
|
exp_name="deepseek-coder-1.3b-instruct-megatron-gsm8k-minimal"
|
|
python scripts/model_merger.py test --backend megatron --local_dir checkpoints/verl-test/${exp_name}/global_step_1/actor --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/actor/huggingface --hf_model_path deepseek-ai/deepseek-coder-1.3b-instruct
|
|
python scripts/model_merger.py test --backend megatron --is-value-model --local_dir checkpoints/verl-test/${exp_name}/global_step_1/critic --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/critic/huggingface --hf_model_path deepseek-ai/deepseek-coder-1.3b-instruct
|
|
- name: clean up
|
|
run: |
|
|
rm -rf checkpoints
|
|
e2e_ppo_trainer_megatron-qwen3:
|
|
runs-on: [L20x8]
|
|
timeout-minutes: 30 # Increase this timeout value as needed
|
|
env:
|
|
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
|
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
|
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
|
HF_ENDPOINT: "https://hf-mirror.com"
|
|
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
|
container:
|
|
image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6-mcore0.12.0-te2.2
|
|
options: --gpus all --shm-size=10g
|
|
steps:
|
|
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
|
with:
|
|
fetch-depth: 0
|
|
- name: Install the current repository
|
|
run: |
|
|
pip3 install --no-deps -e .[test]
|
|
- name: Prepare GSM8K dataset
|
|
run: |
|
|
python3 examples/data_preprocess/gsm8k.py
|
|
- name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (Qwen3) with validation and saving
|
|
run: |
|
|
ray stop --force
|
|
VAL_BEFORE_TRAIN=True TEST_FREQ=1 SAVE_FREQ=1 MODEL_ID=Qwen/Qwen3-0.6B bash tests/e2e/run_ppo_trainer_megatron.sh
|
|
- name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (Qwen3) after resuming
|
|
run: |
|
|
ray stop --force
|
|
RESUME_MODE=auto MODEL_ID=Qwen/Qwen3-0.6B bash tests/e2e/run_ppo_trainer_megatron.sh
|
|
- name: Test Megatron checkpoints merging function (Qwen3 Actor and Critic)
|
|
run: |
|
|
exp_name="qwen3-0.6b-megatron-gsm8k-minimal"
|
|
python scripts/model_merger.py test --backend megatron --tie-word-embedding --hf_model_path Qwen/Qwen3-0.6B --local_dir checkpoints/verl-test/${exp_name}/global_step_1/actor --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/actor/huggingface
|
|
python scripts/model_merger.py test --backend megatron --is-value-model --hf_model_path Qwen/Qwen3-0.6B --local_dir checkpoints/verl-test/${exp_name}/global_step_1/critic --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/critic/huggingface
|
|
- name: Running GRPO GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (Qwen3)
|
|
run: |
|
|
ray stop --force
|
|
ADV_ESTIMATOR=grpo MODEL_ID=Qwen/Qwen3-0.6B bash tests/e2e/run_ppo_trainer_megatron.sh
|
|
- name: clean up
|
|
run: |
|
|
rm -rf checkpoints
|
|
e2e_ppo_trainer_megatron-different-train-infer-tp-qwen:
|
|
runs-on: [L20x8]
|
|
timeout-minutes: 30 # Increase this timeout value as needed
|
|
env:
|
|
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
|
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
|
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
|
HF_ENDPOINT: "https://hf-mirror.com"
|
|
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
|
container:
|
|
image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6-mcore0.12.0-te2.3
|
|
options: --gpus all --shm-size=10g
|
|
steps:
|
|
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
|
with:
|
|
fetch-depth: 0
|
|
- name: Install the current repository
|
|
run: |
|
|
pip3 install --no-deps -e .[test]
|
|
- name: Prepare GSM8K dataset
|
|
run: |
|
|
python3 examples/data_preprocess/gsm8k.py
|
|
- name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (Qwen) with train tp > infer tp
|
|
run: |
|
|
ray stop --force
|
|
VAL_BEFORE_TRAIN=True TEST_FREQ=1 SAVE_FREQ=1 TRAIN_TP=2 INFER_TP=1 bash tests/e2e/run_ppo_trainer_megatron.sh
|
|
- name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (Qwen) with train tp < infer tp
|
|
run: |
|
|
ray stop --force
|
|
VAL_BEFORE_TRAIN=True TEST_FREQ=1 SAVE_FREQ=1 TRAIN_TP=1 INFER_TP=2 bash tests/e2e/run_ppo_trainer_megatron.sh
|
|
- name: clean up
|
|
run: |
|
|
rm -rf checkpoints
|
|
e2e_ppo_trainer_megatron-different-train-infer-tp-qwen-tie-embedding:
|
|
runs-on: [L20x8]
|
|
timeout-minutes: 30 # Increase this timeout value as needed
|
|
env:
|
|
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
|
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
|
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
|
HF_ENDPOINT: "https://hf-mirror.com"
|
|
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
|
container:
|
|
image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6-mcore0.12.0-te2.3
|
|
options: --gpus all --shm-size=10g
|
|
steps:
|
|
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
|
with:
|
|
fetch-depth: 0
|
|
- name: Install the current repository
|
|
run: |
|
|
pip3 install --no-deps -e .[test]
|
|
- name: Prepare GSM8K dataset
|
|
run: |
|
|
python3 examples/data_preprocess/gsm8k.py
|
|
- name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with tie-embedding Megatron (Qwen) with train tp > infer tp
|
|
run: |
|
|
ray stop --force
|
|
VAL_BEFORE_TRAIN=True TEST_FREQ=1 SAVE_FREQ=1 TRAIN_TP=2 INFER_TP=1 MODEL_ID=Qwen/Qwen2.5-1.5B bash tests/e2e/run_ppo_trainer_megatron.sh
|
|
- name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (Qwen) with train tp < infer tp
|
|
run: |
|
|
ray stop --force
|
|
VAL_BEFORE_TRAIN=True TEST_FREQ=1 SAVE_FREQ=1 TRAIN_TP=1 INFER_TP=2 MODEL_ID=Qwen/Qwen2.5-1.5B bash tests/e2e/run_ppo_trainer_megatron.sh
|
|
- name: clean up
|
|
run: |
|
|
rm -rf checkpoints
|
|
e2e_ppo_trainer_megatron-qwen-override-transformer-config:
|
|
runs-on: [L20x8]
|
|
timeout-minutes: 30 # Increase this timeout value as needed
|
|
env:
|
|
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
|
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
|
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
|
HF_ENDPOINT: "https://hf-mirror.com"
|
|
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
|
container:
|
|
image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6-mcore0.12.0-te2.3
|
|
options: --gpus all --shm-size=10g
|
|
steps:
|
|
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
|
with:
|
|
fetch-depth: 0
|
|
- name: Install the current repository
|
|
run: |
|
|
pip3 install --no-deps -e .[test]
|
|
- name: Prepare GSM8K dataset
|
|
run: |
|
|
python3 examples/data_preprocess/gsm8k.py
|
|
- name: Prepare dist_ckpt of Qwen2.5-0.5B, uneven layer distribution only supports dist_ckpt
|
|
run: |
|
|
python3 scripts/converter_hf_to_mcore.py --hf_model_path ${HOME}/models/Qwen/Qwen2.5-0.5B --output_path checkpoints/verl-test/qwen2.5-0.5b-megatron
|
|
- name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (Qwen)
|
|
run: |
|
|
ray stop --force
|
|
SAVE_FREQ=1 COMMON_PP=4 COMMON_VPP=null COMMON_CP=1 SKIP_SAVE_HF_MODEL=1 bash tests/e2e/run_ppo_trainer_megatron.sh +actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_first_pipeline_stage=8 +actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_last_pipeline_stage=4 actor_rollout_ref.actor.megatron.use_dist_checkpointing=true actor_rollout_ref.actor.megatron.dist_checkpointing_path=checkpoints/verl-test/qwen2.5-0.5b-megatron actor_rollout_ref.ref.megatron.use_dist_checkpointing=true actor_rollout_ref.ref.megatron.dist_checkpointing_path=checkpoints/verl-test/qwen2.5-0.5b-megatron critic.megatron.use_dist_checkpointing=true critic.megatron.dist_checkpointing_path=checkpoints/verl-test/qwen2.5-0.5b-megatron reward_model.megatron.use_dist_checkpointing=true reward_model.megatron.dist_checkpointing_path=checkpoints/verl-test/qwen2.5-0.5b-megatron
|
|
cp -r checkpoints checkpoints-dut
|
|
SAVE_FREQ=1 COMMON_PP=4 COMMON_VPP=null COMMON_CP=1 bash tests/e2e/run_ppo_trainer_megatron.sh
|
|
- name: Test Megatron checkpoints merging function (Qwen Actor and Critic)
|
|
run: |
|
|
exp_name="qwen2.5-0.5b-megatron-gsm8k-minimal"
|
|
python scripts/model_merger.py test --backend megatron --tie-word-embedding --local_dir checkpoints-dut/verl-test/${exp_name}/global_step_1/actor --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/actor/huggingface --hf_model_path Qwen/Qwen2.5-0.5B
|
|
python scripts/model_merger.py test --backend megatron --is-value-model --local_dir checkpoints-dut/verl-test/${exp_name}/global_step_1/critic --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/critic/huggingface --hf_model_path Qwen/Qwen2.5-0.5B
|
|
- name: clean up
|
|
run: |
|
|
rm -rf checkpoints
|
|
e2e_ppo_trainer_megatron-deepseek-override-transformer-config:
|
|
runs-on: [L20x8]
|
|
timeout-minutes: 30 # Increase this timeout value as needed
|
|
env:
|
|
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
|
|
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
|
|
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
|
|
HF_ENDPOINT: "https://hf-mirror.com"
|
|
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
|
container:
|
|
image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6-mcore0.12.0-te2.3
|
|
options: --gpus all --shm-size=10g
|
|
steps:
|
|
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
|
with:
|
|
fetch-depth: 0
|
|
- name: Install the current repository
|
|
run: |
|
|
pip3 install --no-deps -e .[test]
|
|
- name: Prepare GSM8K dataset
|
|
run: |
|
|
python3 examples/data_preprocess/gsm8k.py
|
|
- name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (DeepSeek)
|
|
run: |
|
|
ray stop --force
|
|
SAVE_FREQ=1 MODEL_ID=deepseek-ai/deepseek-coder-1.3b-instruct COMMON_PP=2 COMMON_VPP=null bash tests/e2e/run_ppo_trainer_megatron.sh +actor_rollout_ref.actor.megatron.override_transformer_config.account_for_embedding_in_pipeline_split=true +actor_rollout_ref.actor.megatron.override_transformer_config.account_for_loss_in_pipeline_split=true
|
|
- name: Test Megatron checkpoints merging function (DeepSeek Actor and Critic)
|
|
run: |
|
|
exp_name="deepseek-coder-1.3b-instruct-megatron-gsm8k-minimal"
|
|
python scripts/model_merger.py test --backend megatron --local_dir checkpoints/verl-test/${exp_name}/global_step_1/actor --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/actor/huggingface --hf_model_path deepseek-ai/deepseek-coder-1.3b-instruct
|
|
python scripts/model_merger.py test --backend megatron --is-value-model --local_dir checkpoints/verl-test/${exp_name}/global_step_1/critic --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/critic/huggingface --hf_model_path deepseek-ai/deepseek-coder-1.3b-instruct
|
|
- name: clean up
|
|
run: |
|
|
rm -rf checkpoints
|
|
|