DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
This directory contains the implementation for reproducing the DeepEyes paper within the verl framework, supporting multi-turn visual tool calls. This implementation is based on the original DeepEyes paper and its official implementation, integrated with the multi-modal and multi-turn capabilities of the verl framework.
Reproducing the Experiment
Note on the 'Chart' Dataset:
The provided preprocessing script intentionally excludes
data_v0.8_visual_toolbox_v2.parquet
, which contains the 'Chart' data. This subset consists of very high-resolution images, often resembling large figures composed of multiple sub-plots, much like those found in academic papers.Consequently, even after using the zoom-in tool, the resulting cropped images remain large. This poses a significant risk of causing Out-of-Memory (OOM) errors, which can abruptly terminate the training process.
We strongly recommend against training on the 'Chart' dataset on a single node.
Note on the 'thinklite' Dataset: Many images in the
thinklite
dataset have a very low resolution, with either a height or width below 28 pixels. This fails to meet the minimum input size required by the Qwen-2.5VL image processor and would cause errors during data loading.To mitigate this, we upscale these low-resolution images to satisfy the processor's requirements. However, please be aware that because the original resolution is low, subsequent
crop
operations by the zoom-in tool might frequently trigger exceptions, which could in turn affect the model's tool-use performance.
First, launch an inference service to act as a judge for reward calculation. You can use the following script as a reference:
python -m sglang.launch_server --model-path /path/to/Qwen2.5-72B-Instruct \
--port 18901 \
--tp-size 8 \
--context-length 32768 \
--trust-remote-code \
--log-requests false
Next, you can start the training:
bash recipe/deepeyes/run_deepeyes_grpo.sh
Performance
See Comment for more details.
Note: AgentLoop does not directly record num_tool_calls, but records num_turns. In our scenario, you can calculate the number of tool calls by num_tool_calls = num_turns / 2 - 1.
References and Acknowledgements
If you need further details for reproduction or encounter any issues, feel free to open an issue or contact the maintainers.