### Checklist Before Starting
- [ ] Search for similar PR(s).
### What does this PR do?
Allow to override of transformer config to enable custom megatron
features like variable PP layers distribution, with CI tests, which is
in need for larger moe models with 94 layers (Qwen3 moe) or 61 layers
(DeepSeek V3)
We will first fix e2e_prime CI by use fused kernels.
**Notice that now the imbalance PP layers distribution only compatible
with dist_ckpt load and save, not support huggingface direct
load/save.**
Also, other megatron arguments can be passed through scripts.
### High-Level Design
> Demonstrate the high-level design if this PR is complex.
### Specific Changes
> List the specific changes.
### API
Breaking APIs:
```py
class MegatronWorker(Worker):
def _init_hf_config_and_tf_config(self, model_path, dtype, override_model_config, override_transformer_config):
# and the models building
```
```yaml
actor:
megatron:
override_transformer_config: {} # common transformer config for all models
```
To avoid trouble of input same transformer config arguments, other
models will reuse actor's config, so just need to input once.
### Usage Example
```bash
run_ppo_trainer_megatron.sh \
+actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_first_pipeline_stage=13 \
+actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_last_pipeline_stage=11
```
### Test
> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.
### Additional Info.
- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: Megatron
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]
### Checklist Before Submitting
- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
### Checklist Before Starting
- [ ] Search for similar PR(s).
### What does this PR do?
update images and fix sglang installation, the latest image:
`whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6-mcore0.12.0-te2.3`
### High-Level Design
> Demonstrate the high-level design if this PR is complex.
### Specific Changes
- vLLM: 0.8.5.post1
- SGLang: 0.4.6.post4, fix installation
- Megatron: core_v0.12.0 announcement
- TransformerEngine: 2.3
### API
> Demonstrate how the API changes if any.
### Usage Example
> Provide usage example(s) for easier usage.
```python
# Add code snippet or script demonstrating how to use this
```
### Test
> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.
### Additional Info.
- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]
### Checklist Before Submitting
- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title if it breaks any API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if necessary.
### Checklist Before Starting
- [x] Search for similar PR(s).
### What does this PR do?
This PR refactors `model_merge`, making the code cleaner and more
maintainable:
- now verl checkpointer manager will save model config and
processor/tokenizer (introduced in
https://github.com/volcengine/verl/pull/1288), so there is no need for
`hf_model_path`. This PR deprecates this argument and keeps it for
backward compatibility.
- the current `model_merge` has two purposes, merge checkpoints and test
checkpoints (mainly for CI). This PR separates these two purposes into
two sub-commands to better manage user input argument for improved user
experience.
- generally cleans up the code and makes it look better.
### Test
Our current CI hasn't tested DDP+FSDP e2e training. This PR also adds
DDP+FSDP e2e into CI and tests merging DDP+FSDP checkpoints.
The current CI should test this PR correctly.
### Additional Info.
- **Training**: both
- **Inference**: none
### Checklist Before Submitting
- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
### Checklist Before Starting
- [x] Search for similar PR(s).
### What does this PR do?
1. This PR eliminates the micro-dp group as the article says, and
support train-infer tp to be different.
2. Side Effect: able to run Qwen3moe on megatron aligned with FSDP.
3. CI tests have been added to check the effect.
### High-Level Design
This PR eliminates the micro-dp group as the article says, since the
`generate_sequence` process only relates to inference engine, there is
no need for us to consider the training side.
The only problem now is that the `dispatch/collect` function cannot
directly use the inference parallel size, so current solution is that we
define a new `MEGATRON_ALL_DP` dispatch method to view all ranks as Data
Parallel rank, which is the same as FSDP.
So we follow the way of FSDP to pre/post-process the data.
### Specific Changes
Mainly in `megatron_vllm.py`
### API
None
### Usage Example
```sh
actor_rollout_ref.actor.megatron.tensor_model_parallel_size=2 \
actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
# or
actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4 \
actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
```
### Test
Added CI tests.
For e2e test with Qwen 2.5 7B, please refer to
`examples/grpo_trainer/run_qwen2_5-7b_math_megatron_diff_tp.sh`
### Additional Info.
- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: Megatron
- **Inference**: vLLM
### Checklist Before Submitting
- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
### Checklist Before Starting
- [x] Search for similar PR(s).
### What does this PR do?
fix head_dim in GQA model when load from hf ckpt
### High-Level Design
> Demonstrate the high-level design if this PR is complex.
### Specific Changes
- Change the acquisition methods of q and kv head_dim to be compatible
with GQA.
- Add the conversions of q_layernorm and k_layernorm in
convert_megatron_model_to_transformers_model for Qwen3.
### API
> Demonstrate how the API changes if any.
### Usage Example
> Provide usage example(s) for easier usage.
```python
# Add code snippet or script demonstrating how to use this
```
### Test
> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.
### Additional Info.
- **Issue Number**: Fixes issue #1510
### Checklist Before Submitting
- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title if it breaks any API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
---------
Signed-off-by: ShareLer <ShareLe@163.com>
### Checklist Before Starting
- [x] Search for similar PR(s).
### What does this PR do?
Update mcore image to use vLLM which support qwen3 and rewrite
installation from conda
### High-Level Design
> Demonstrate the high-level design if this PR is complex.
### Specific Changes
Docker image and docs
### API
> Demonstrate how the API changes if any.
### Usage Example
> Provide usage example(s) for easier usage.
```python
# Add code snippet or script demonstrating how to use this
```
### Test
> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.
### Additional Info.
- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: both
- **Inference**: both
### Checklist Before Submitting
- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
Temporarily use CPU to initialize larger models for huggingface to
dist_ckpt conversion.
And Support GQA Moe model.
May not require CI as this function can be dependent to VeRL, but
current solution may need.
Fix bugs related to #1165 .
Megatron backend reward model has no CI test, add to current ppo
trainer.
Fix `micro_batch_size_per_gpu` but not sure whether it is right for
reward config.
The output format is also not right with current `forward_micro_batch`
implementation.
### Summary
#### Minimize Test Workloads
This PR minimizes the test workloads while keeping them meaningful,
reducing the time cost of a test from >10 min to 1~2 min. Specifically,
we
1. set batch sizes and steps as small but still meaningful numbers:
```bash
train_traj_micro_bsz_per_gpu=2 # b
n_resp_per_prompt=4 # g
train_traj_micro_bsz=$((train_traj_micro_bsz_per_gpu * NUM_GPUS)) # b * n
train_traj_mini_bsz=$((train_traj_micro_bsz * 2)) # 2 * b * n
train_prompt_mini_bsz=$((train_traj_mini_bsz * n_resp_per_prompt)) # 2 * b * n / g
train_prompt_bsz=$((train_prompt_mini_bsz * 2)) # 4 * b * n / g
# ...
TOT_TRAIN_STEPS=${TOT_TRAIN_STEPS:-1}
```
2. disable validation (this costs a lot!) / saving / resuming for
training tests by default and leave them to specialized tests
```bash
# Validation
VAL_BEFORE_TRAIN=${VAL_BEFORE_TRAIN:-False}
TEST_FREQ=${TEST_FREQ:--1}
# Save & Resume
RESUME_MODE=${RESUME_MODE:-disable}
SAVE_FREQ=${SAVE_FREQ:--1}
```
#### Improve Triggering Mode
This PRs introduces a more comprehensive triggering logic mode.
Specifically, we
1. consider all Python code by default
2. include related entrypoints (the workflow config, scripts used by it
and hydra config, etc.)
3. exclude unrelated Python code from other components (e.g., recipes,
examples, Megatron, SFT, generation, evaluation, etc. for FSDP training)
An example from `e2e_ppo_trainer`:
```yaml
on:
paths:
- "**/*.py"
# Entrypoints
- ".github/workflows/e2e_ppo_trainer.yml"
- "examples/data_preprocess/gsm8k.py"
- "examples/data_preprocess/geo3k.py"
- "tests/e2e/ppo_trainer"
- "verl/trainer/main_ppo.py"
- "verl/trainer/config/ppo_trainer.yaml"
- "!examples"
- "!verl/trainer/main_*.py"
- "!verl/trainer/fsdp_sft_trainer.py"
# Recipes
- "!recipe"
# Megatron
- "!verl/workers/**/megatron_*.py"
```
#### Avoid missing out errors
Some test scripts didn't end with the main python command and might miss
out the error.
To address this issue, this PR introduces the following options:
```bash
set -xeuo pipefail
```
, which means
- `x`: Print each command before executing it (useful for debugging)
- `e`: Exit immediately if any command fails (returns non-zero exit
status)
- `u`: Treat unset variables as an error
- `o pipefail`: Return the exit status of the last command in a pipeline
that failed, or zero if all succeeded
Together, these options make the script fail fast and provide verbose
output, which helps with debugging and ensuring the script doesn't
continue after encountering errors.
#### Others
Besides, we also
1. unify runner labels into `"L20x8"` to enable preemptive scheduling of
jobs
2. reduce test scripts of minimal differences, grouping by entrypoint
(e.g. `ppo_trainer`, `ppo_megatron_trainer`, recipes, etc.), into a base
script with options