### What does this PR do?
Refactor profiler CI to a unified way.
TODO:
- nsys use `save_path`
- nsys descrete tests are disabled
- torch profiler
cc: @davidmlw
### Checklist Before Starting
- [ ] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
- `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
- Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`
### Test
> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.
### API and Usage Example
Global profiler config:
```yaml
global_profiler:
_target_: verl.utils.profiler.ProfilerConfig
tool: null
steps: null
profile_continuous_steps: false
save_path: outputs/profile
tool_config:
nsys:
_target_: verl.utils.profiler.config.NsightToolConfig
discrete: false
npu:
_target_: verl.utils.profiler.config.NPUToolConfig
discrete: false
contents: []
level: level1
analysis: true
torch:
_target_: verl.utils.profiler.config.TorchProfilerToolConfig
step_start: 0
step_end: null
```
Local profiler config:
```yaml
profiler:
# Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
_target_: verl.utils.profiler.ProfilerConfig
# profiler tool, default same as profiler.tool in global config
# choices: nsys, npu, torch
tool: ${oc.select:global_profiler.tool,null}
# whether enable profile on critic
enable: False
# Whether to profile all ranks.
all_ranks: False
# The ranks that will be profiled. [] or [0,1,...]
ranks: []
# profile results saving path
save_path: ${oc.select:global_profiler.save_path,null}
# specific tool config
tool_config: ${oc.select:global_profiler.tool_config,null}
```
### Design & Code Changes
> Demonstrate the high-level design if this PR is complex, and list the
specific changes.
### Checklist Before Submitting
> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.
- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
### Checklist Before Starting
- [x] Search for similar PR(s).
### What does this PR do?
> remove redundant 'get_custom_reward_fn' function.
### High-Level Design
> None.
### Specific Changes
> "from verl.trainer.ppo.reward import get_custom_reward_fn" instead of
'get_custom_reward_fn' function in verl/recipe/dapo/main_dapo.py
verl/recipe/r1/main_eval.py verl/recipe/spin/main_spin.py
verl/verl/trainer/main_eval.py verl/verl/trainer/main_eval.py
> remove 'get_custom_reward_fn' function in
verl/verl/trainer/main_ppo.py
### Additional Info.
- **[Issue Number](https://github.com/volcengine/verl/issues/1716)**:
Fixes issue # or discussion # if any.
### Checklist Before Submitting
- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
Because CI is too slow, combine the features and functions of checkpoint
here in 1 PR.
# Add Layer idx to decode layers
But it seems to be hard to attach a "correct" layer number to each
layer, now verl implemented megatron each pp and vpp rank's layers start
from index 0, leading to some inconvenience for merging tool.
The difficulty mainly comes from `torch.nn.ModuleList` implementation,
[it suggests and forces to directly use index rather than custom layer
number](8a40fca9a1/torch/nn/modules/container.py (L302C5-L324C66)).
Current solution is that we modify the layer number to actual number
starts from pp and vpp offset when saving megatron checkpoint, and
recover when loading. When use merging tool, there is no need for extra
scans.
# Huggingface Model loader logic simplified
Since every rank can have access to state_dict, there is actually no
need to broadcast the weights among mp and dp groups at all, and all
from rank 0. The implementation before is too costly and may cause OOM
issue because each rank can take up whole model space in GPU.
And the loader logic is not straight-forward, since everyone only need
to load its vpp_size number of layers, why iterate over whole
num_layers.
So current solution is every rank load itself's sharded weights from
`state_dict`.
But this requires users having storage nodes available to connect with
every calculation nodes. For those who can only use rank 0 to store
huggingface model, we move original implementation to deperacated
besides new version of file.
# Modify test scripts to reuse downloaded huggingface model
Avoid errors when connecting with huggingface to access metadata.
# Modify CI workflows to enable load-balance of CI machines
Currently L20-0 takes up 6 more jobs than L20-1, try reduce the pipeline
bubble of each task.
This PR combines multiple modifications.
# QWen2.5 checkpoint saver bug fix
Thanks for the efforts @uygnef contributed to #368 , we use the new
saver for model loader and saver for 3D parallelism support.
# Megatron backend 3D-parallelism test benches
We modify the scripts in `examples/ppo_trainer` and `tests/e2e`, as well
as the CI workflows, all tested.
# Bug Fix for 3D-parallelism
Including configuration bugs as well as the module packing.
Original TP VocabParallelEntropy can lead to CUDA OOM, we refactor the
implementation with `torch.bmm`.
# Fully migration to Megatron Core
Now we only use Megatron core in verl, fully get rid of calling other
components. If they are in need, please integrate them into
`utils/megatron`.
---------
Co-authored-by: uygnef <admin@fengyu.org>
## What does this PR do?
This PR migrates the feature of RL on VLMs in our implementation in
[EasyR1](https://github.com/hiyouga/EasyR1) fork back to veRL. We have
validated this feature using Qwen2.5-VL 7B model on 8*H100 GPUs. The
configuration and data processing script are provided along this PR for
easy reproducing.
## How to reproduce?
1. Download and preprocess the dataset
```bash
python3 examples/data_preprocess/geo3k.py --local_dir ~/data/geo3k
```
2. Start GRPO training
```bash
bash examples/grpo_trainer/run_qwen2_5_vl-7b.sh
```
## Dependencies
- vllm>=0.7.3
- transformers>=4.49.0
- [qwen-vl-utils](https://pypi.org/project/qwen-vl-utils/)
- [mathruler](https://pypi.org/project/mathruler/)
## Major Changes
### New dataflow for multimodal RL
In this PR, we introduce two new concepts in the dataflow,
`multi_modal_data` and `multi_modal_inputs`. The former means the
multi-modal features required by the **rollout** worker (such as vLLM),
while the latter means the multi-modal features required by the
**actor/critic** worker (such as an HF model). They are different
because the rollout and actor workers have their own data format
requirements.
Taking Qwen2-VL + huggingface + vLLM as an example, the data structure
should be:
- **multi_modal_data**: {"image": [PIL.Image, PIL.Image, ...]}
- **multi_modal_inputs**: {"pixel_values": torch.Tensor,
"image_grid_thw": torch.Tensor}
Both of them are converted to numpy objects and placed in the non-tensor
batch in DataProto.
This design can be extended to other modalities/VLMs easily due to the
agnostic of models.
### Other changes
- Data
- Support pre-processing the
[Geometry3k](https://huggingface.co/datasets/hiyouga/geometry3k)
dataset.
- Support `config.data.image_key`, which should be **a list of Pillow
images**.
- Actor/Ref/Critic
- Support `multi_modal_inputs`.
- Process position ids to adapt to the m-rope .
- Rollout
- Update dtensor weight loader to adapt to the Qwen2-VL architecture in
vLLM 0.7+.
- Support `multi_modal_data`.
- Use `raw_prompt_ids` as the vLLM inputs to **avoid unpadding** the
input ids.
- Reward Manager
- Add **mathruler** for more accurate math scores on the Geometry 3k
dataset
- Models
- Support calculating the position ids for the m-rope in Qwen2-VL.
- Support removing padding in flash attention2 for m-rope (transformers
itself **does not support it**).
- Sharding Manager
- Support all-gathering the non-tensor batch.
- FSDP Workers / Checkpoint Merger
- Support `AutoModelForVision2Seq` at model initialization.
Note: The Ulysses parallelism is not completed yet. We will support it
in the next update.
## Performance
We provide the estimated MFU of the language model part for H100 GPUs.
These values are lower than the actual ones because **we did not compute
the FLOPs of the vision tower part**.
- `remove_padding=False`: MFU ~7%
- `remove_padding=True`: MFU ~20%
The training and test reward score curves are presented as follows.

## Who can review?
@vermouth1992 @PeterSH6