Files
verl/recipe/deepeyes/README.md
Minghui Jia 9f4161e250 [recipe] feat: add deepeyes recipe (#2398)
### What does this PR do?

This PR introduces a complete training recipe for [DeepEyes:
Incentivizing "Thinking with Images" via Reinforcement
Learning](https://arxiv.org/abs/2505.14362).

The core feature is the support for multi-turn visual tools,
specifically the `ImageZoomInTool`, integrated with a custom reward
function based on the "LLM-as-a-Judge" pattern to evaluate model
performance.

Additionally, to better monitor and analyze the model's tool-use
behavior, this PR adds functionality to track tool call counts during
the training process and reports these metrics to logging systems like
wandb.

### API and Usage Example

The primary change is the new training recipe for DeepEyes. Users can
start a training run by using the provided configuration file.

1. Preprocess the dataset. We need to add some tool-related extra_info:
```bash
python recipe/deepeyes/deepeyes47k_preprocess.py --dataset_dir <path_to_raw_dataset> --save_dir <path_to_processed_data>
```
2. Start the PPO training:
```bash
bash recipe/deepeyes/run_deepeyes_grpo.sh
```
The training process will automatically load the ImageZoomInTool and the
custom reward function as defined in the recipe.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

- **DeepEyes Recipe Integration**: Added a new recipe directory with
data preprocessing, tool config, and a custom reward function for
DeepEyes.
- **Visual Tool Support**: Implemented `ImageZoomInTool` with robust
bbox validation and resizing.
- **Tool Call Statistics**: Modified the rollout and metrics code to
track and log tool call counts per sample and per step.
- **Bug Fixes**: Fixed image byte handling and ensured special tokens
are preserved during decoding for tool call formatting.

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

---------

Co-authored-by: Maxwell-Jia <mr.minghui.jia@gamil.com>
Co-authored-by: xieck13 <xieck13@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
2025-08-12 09:51:58 +08:00

5.0 KiB

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

This directory contains the implementation for reproducing the DeepEyes paper within the verl framework, supporting multi-turn visual tool calls. This implementation is based on the original DeepEyes paper and its official implementation, integrated with the multi-modal and multi-turn capabilities of the verl framework.

Reproducing the Experiment

Note on the 'Chart' Dataset:

The provided preprocessing script intentionally excludes data_v0.8_visual_toolbox_v2.parquet, which contains the 'Chart' data. This subset consists of very high-resolution images, often resembling large figures composed of multiple sub-plots, much like those found in academic papers.

Consequently, even after using the zoom-in tool, the resulting cropped images remain large. This poses a significant risk of causing Out-of-Memory (OOM) errors, which can abruptly terminate the training process.

We strongly recommend against training on the 'Chart' dataset on a single node.

Note on the 'thinklite' Dataset: Many images in the thinklite dataset have a very low resolution, with either a height or width below 28 pixels. This fails to meet the minimum input size required by the Qwen-2.5VL image processor and would cause errors during data loading.

To mitigate this, we upscale these low-resolution images to satisfy the processor's requirements. However, please be aware that because the original resolution is low, subsequent crop operations by the zoom-in tool might frequently trigger exceptions, which could in turn affect the model's tool-use performance.

First, launch an inference service to act as a judge for reward calculation. You can use the following script as a reference:

python -m sglang.launch_server --model-path /path/to/Qwen2.5-72B-Instruct \
    --port 18901 \
    --tp-size 8 \
    --context-length 32768 \
    --trust-remote-code \
    --log-requests false

Next, you can start the training:

bash recipe/deepeyes/run_deepeyes_grpo.sh

Performance

score

entropy

num_turns

See Comment for more details.

Note: AgentLoop does not directly record num_tool_calls, but records num_turns. In our scenario, you can calculate the number of tool calls by num_tool_calls = num_turns / 2 - 1.

References and Acknowledgements


If you need further details for reproduction or encounter any issues, feel free to open an issue or contact the maintainers.