12 Commits

Author SHA1 Message Date
545f899844 [BREAKING] [perf] refactor: Profiler api refactor (#2894)
### What does this PR do?

Refactor profiler CI to a unified way.

TODO:

- nsys use `save_path`
- nsys descrete tests are disabled
- torch profiler

cc: @davidmlw 

### Checklist Before Starting

- [ ] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

Global profiler config:

```yaml
global_profiler:
  _target_: verl.utils.profiler.ProfilerConfig
  tool: null
  steps: null
  profile_continuous_steps: false
  save_path: outputs/profile
  tool_config:
    nsys:
      _target_: verl.utils.profiler.config.NsightToolConfig
      discrete: false
    npu:
      _target_: verl.utils.profiler.config.NPUToolConfig
      discrete: false
      contents: []
      level: level1
      analysis: true
    torch:
      _target_: verl.utils.profiler.config.TorchProfilerToolConfig
      step_start: 0
      step_end: null
```

Local profiler config:

```yaml
profiler:

  # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
  _target_: verl.utils.profiler.ProfilerConfig

  # profiler tool, default same as profiler.tool in global config
  # choices: nsys, npu, torch
  tool: ${oc.select:global_profiler.tool,null}

  # whether enable profile on critic
  enable: False

  # Whether to profile all ranks.
  all_ranks: False

  # The ranks that will be profiled. [] or [0,1,...]
  ranks: []

  # profile results saving path
  save_path: ${oc.select:global_profiler.save_path,null}

  # specific tool config
  tool_config: ${oc.select:global_profiler.tool_config,null}
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
2025-08-11 09:52:41 +08:00
d255783a0a [docker] feat: upgrade vllm to 0.9.1 (#2747) 2025-07-29 07:32:04 +08:00
3126c8b428 remove redundant 'get_custom_reward_fn' function (#1791)
### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

> remove redundant 'get_custom_reward_fn' function. 

### High-Level Design

> None.

### Specific Changes

> "from verl.trainer.ppo.reward import get_custom_reward_fn" instead of
'get_custom_reward_fn' function in verl/recipe/dapo/main_dapo.py
verl/recipe/r1/main_eval.py verl/recipe/spin/main_spin.py
verl/verl/trainer/main_eval.py verl/verl/trainer/main_eval.py
> remove 'get_custom_reward_fn' function in
verl/verl/trainer/main_ppo.py

### Additional Info.

- **[Issue Number](https://github.com/volcengine/verl/issues/1716)**:
Fixes issue # or discussion # if any.

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
2025-06-01 21:45:54 +08:00
H
249c26fdc8 [tests] BREAKING: move recipe.dapo.src to recipe.dapo; move test files to their own namespaces (tests/verl/xxx -> tests/xxx) (#1392) 2025-05-10 11:21:53 +08:00
ccab83654c Megatron checkpoint default not save hf_models, and provide model merge tool. (#780)
Because CI is too slow, combine the features and functions of checkpoint
here in 1 PR.

# Add Layer idx to decode layers

But it seems to be hard to attach a "correct" layer number to each
layer, now verl implemented megatron each pp and vpp rank's layers start
from index 0, leading to some inconvenience for merging tool.

The difficulty mainly comes from `torch.nn.ModuleList` implementation,
[it suggests and forces to directly use index rather than custom layer
number](8a40fca9a1/torch/nn/modules/container.py (L302C5-L324C66)).

Current solution is that we modify the layer number to actual number
starts from pp and vpp offset when saving megatron checkpoint, and
recover when loading. When use merging tool, there is no need for extra
scans.

# Huggingface Model loader logic simplified

Since every rank can have access to state_dict, there is actually no
need to broadcast the weights among mp and dp groups at all, and all
from rank 0. The implementation before is too costly and may cause OOM
issue because each rank can take up whole model space in GPU.

And the loader logic is not straight-forward, since everyone only need
to load its vpp_size number of layers, why iterate over whole
num_layers.

So current solution is every rank load itself's sharded weights from
`state_dict`.

But this requires users having storage nodes available to connect with
every calculation nodes. For those who can only use rank 0 to store
huggingface model, we move original implementation to deperacated
besides new version of file.

# Modify test scripts to reuse downloaded huggingface model

Avoid errors when connecting with huggingface to access metadata.

# Modify CI workflows to enable load-balance of CI machines

Currently L20-0 takes up 6 more jobs than L20-1, try reduce the pipeline
bubble of each task.
2025-03-30 10:39:40 +08:00
333e6d624a [rollout] feat: add SGLang as rollout engine to verl (#490)
#22 . WIP, will add more details tomorrow :)

---------

Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
2025-03-17 21:12:33 +08:00
35555d8ae9 Verl's megatron core_r0.11.0 backend successfully tested with 3D parallelism with multiple bug fixed (#495)
This PR combines multiple modifications.

# QWen2.5 checkpoint saver bug fix

Thanks for the efforts @uygnef contributed to #368 , we use the new
saver for model loader and saver for 3D parallelism support.

# Megatron backend 3D-parallelism test benches

We modify the scripts in `examples/ppo_trainer` and `tests/e2e`, as well
as the CI workflows, all tested.

# Bug Fix for 3D-parallelism

Including configuration bugs as well as the module packing.

Original TP VocabParallelEntropy can lead to CUDA OOM, we refactor the
implementation with `torch.bmm`.

# Fully migration to Megatron Core

Now we only use Megatron core in verl, fully get rid of calling other
components. If they are in need, please integrate them into
`utils/megatron`.

---------

Co-authored-by: uygnef <admin@fengyu.org>
2025-03-07 13:38:58 +08:00
4a291fa760 [Hardware] Support AMD (Rocm kernel) (#360) 2025-03-06 13:56:20 +08:00
b46f55ecc9 [feat] Initial support for VLMs, add Qwen2.5VL GRPO example (#386)
## What does this PR do?

This PR migrates the feature of RL on VLMs in our implementation in
[EasyR1](https://github.com/hiyouga/EasyR1) fork back to veRL. We have
validated this feature using Qwen2.5-VL 7B model on 8*H100 GPUs. The
configuration and data processing script are provided along this PR for
easy reproducing.

## How to reproduce?

1. Download and preprocess the dataset

```bash
python3 examples/data_preprocess/geo3k.py --local_dir ~/data/geo3k
```

2. Start GRPO training

```bash
bash examples/grpo_trainer/run_qwen2_5_vl-7b.sh
```

## Dependencies

- vllm>=0.7.3
- transformers>=4.49.0
- [qwen-vl-utils](https://pypi.org/project/qwen-vl-utils/)
- [mathruler](https://pypi.org/project/mathruler/)

## Major Changes

### New dataflow for multimodal RL

In this PR, we introduce two new concepts in the dataflow,
`multi_modal_data` and `multi_modal_inputs`. The former means the
multi-modal features required by the **rollout** worker (such as vLLM),
while the latter means the multi-modal features required by the
**actor/critic** worker (such as an HF model). They are different
because the rollout and actor workers have their own data format
requirements.

Taking Qwen2-VL + huggingface + vLLM as an example, the data structure
should be:

- **multi_modal_data**: {"image": [PIL.Image, PIL.Image, ...]}
- **multi_modal_inputs**: {"pixel_values": torch.Tensor,
"image_grid_thw": torch.Tensor}

Both of them are converted to numpy objects and placed in the non-tensor
batch in DataProto.

This design can be extended to other modalities/VLMs easily due to the
agnostic of models.

### Other changes

- Data
- Support pre-processing the
[Geometry3k](https://huggingface.co/datasets/hiyouga/geometry3k)
dataset.
- Support `config.data.image_key`, which should be **a list of Pillow
images**.

- Actor/Ref/Critic
  - Support `multi_modal_inputs`.
  - Process position ids to adapt to the m-rope .

- Rollout
- Update dtensor weight loader to adapt to the Qwen2-VL architecture in
vLLM 0.7+.
  - Support `multi_modal_data`.
- Use `raw_prompt_ids` as the vLLM inputs to **avoid unpadding** the
input ids.

- Reward Manager
- Add **mathruler** for more accurate math scores on the Geometry 3k
dataset

- Models
  - Support calculating the position ids for the m-rope in Qwen2-VL.
- Support removing padding in flash attention2 for m-rope (transformers
itself **does not support it**).

- Sharding Manager
  - Support all-gathering the non-tensor batch.

- FSDP Workers / Checkpoint Merger
  - Support `AutoModelForVision2Seq` at model initialization.

Note: The Ulysses parallelism is not completed yet. We will support it
in the next update.

## Performance

We provide the estimated MFU of the language model part for H100 GPUs.
These values are lower than the actual ones because **we did not compute
the FLOPs of the vision tower part**.

- `remove_padding=False`: MFU ~7%
- `remove_padding=True`: MFU ~20%

The training and test reward score curves are presented as follows.


![image](https://github.com/user-attachments/assets/ecb9fc27-8591-4c5b-ae4b-4ba77c6e30f9)

## Who can review?

@vermouth1992 @PeterSH6
2025-03-03 19:41:28 +08:00
27484a7bbb [misc] feat: add ckpt manager in utils (#216)
- Support FSDPCheckpointManager
- Support hdfs_io import if installed
- Add CI for FSDPCheckpointManager

TODO:
- Will integrate in the next PR
2025-02-07 09:09:03 +08:00
HL
e6b089c5a8 [example] docs: add getting started notebook with free GPUs from lightning (#92) 2025-01-11 09:10:53 -08:00
30911f133a [init] feat: upload first open source version of verl 2024-10-31 14:29:44 +08:00