Files
verl/.gitignore
hoshi-hiyouga b46f55ecc9 [feat] Initial support for VLMs, add Qwen2.5VL GRPO example (#386)
## What does this PR do?

This PR migrates the feature of RL on VLMs in our implementation in
[EasyR1](https://github.com/hiyouga/EasyR1) fork back to veRL. We have
validated this feature using Qwen2.5-VL 7B model on 8*H100 GPUs. The
configuration and data processing script are provided along this PR for
easy reproducing.

## How to reproduce?

1. Download and preprocess the dataset

```bash
python3 examples/data_preprocess/geo3k.py --local_dir ~/data/geo3k
```

2. Start GRPO training

```bash
bash examples/grpo_trainer/run_qwen2_5_vl-7b.sh
```

## Dependencies

- vllm>=0.7.3
- transformers>=4.49.0
- [qwen-vl-utils](https://pypi.org/project/qwen-vl-utils/)
- [mathruler](https://pypi.org/project/mathruler/)

## Major Changes

### New dataflow for multimodal RL

In this PR, we introduce two new concepts in the dataflow,
`multi_modal_data` and `multi_modal_inputs`. The former means the
multi-modal features required by the **rollout** worker (such as vLLM),
while the latter means the multi-modal features required by the
**actor/critic** worker (such as an HF model). They are different
because the rollout and actor workers have their own data format
requirements.

Taking Qwen2-VL + huggingface + vLLM as an example, the data structure
should be:

- **multi_modal_data**: {"image": [PIL.Image, PIL.Image, ...]}
- **multi_modal_inputs**: {"pixel_values": torch.Tensor,
"image_grid_thw": torch.Tensor}

Both of them are converted to numpy objects and placed in the non-tensor
batch in DataProto.

This design can be extended to other modalities/VLMs easily due to the
agnostic of models.

### Other changes

- Data
- Support pre-processing the
[Geometry3k](https://huggingface.co/datasets/hiyouga/geometry3k)
dataset.
- Support `config.data.image_key`, which should be **a list of Pillow
images**.

- Actor/Ref/Critic
  - Support `multi_modal_inputs`.
  - Process position ids to adapt to the m-rope .

- Rollout
- Update dtensor weight loader to adapt to the Qwen2-VL architecture in
vLLM 0.7+.
  - Support `multi_modal_data`.
- Use `raw_prompt_ids` as the vLLM inputs to **avoid unpadding** the
input ids.

- Reward Manager
- Add **mathruler** for more accurate math scores on the Geometry 3k
dataset

- Models
  - Support calculating the position ids for the m-rope in Qwen2-VL.
- Support removing padding in flash attention2 for m-rope (transformers
itself **does not support it**).

- Sharding Manager
  - Support all-gathering the non-tensor batch.

- FSDP Workers / Checkpoint Merger
  - Support `AutoModelForVision2Seq` at model initialization.

Note: The Ulysses parallelism is not completed yet. We will support it
in the next update.

## Performance

We provide the estimated MFU of the language model part for H100 GPUs.
These values are lower than the actual ones because **we did not compute
the FLOPs of the vision tower part**.

- `remove_padding=False`: MFU ~7%
- `remove_padding=True`: MFU ~20%

The training and test reward score curves are presented as follows.


![image](https://github.com/user-attachments/assets/ecb9fc27-8591-4c5b-ae4b-4ba77c6e30f9)

## Who can review?

@vermouth1992 @PeterSH6
2025-03-03 19:41:28 +08:00

120 lines
1.3 KiB
Plaintext

**/*.pt
**/checkpoints
**/wget-log
**/_build/
**/*.ckpt
**/outputs
**/*.tar.gz
**/playground
**/wandb
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
dataset/*
tensorflow/my_graph/*
.idea/
# C extensions
*.so
# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
*.egg-info/
.installed.cfg
*.egg
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*,cover
.hypothesis/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
target/
# IPython Notebook
.ipynb_checkpoints
# pyenv
.python-version
# celery beat schedule file
celerybeat-schedule
# dotenv
.env
# virtualenv
venv/
ENV/
# Spyder project settings
.spyderproject
# Rope project settings
.ropeproject
# vscode
.vscode
# Mac
.DS_Store
# output logs
tests/e2e/toy_examples/deepspeed/synchronous/output.txt
# vim
*.swp
# ckpt
*.lock
# data
*.parquet