verl/docs/examples/multi_modal_example.rst

Multi-Modal Example Architecture
=================================

Last updated: 04/28/2025.

Introduction
------------

Now, verl has supported multi-modal training. You can use fsdp and
vllm/sglang to start a multi-modal RL task. Megatron supports is also
on the way.

Follow the steps below to quickly start a multi-modal RL task.

Step 1: Prepare dataset
-----------------------

.. code:: python

    # it will be saved in the $HOME/data/geo3k folder
    python examples/data_preprocess/geo3k.py

Step 2: Download Model
----------------------

.. code:: bash

    # download the model from huggingface
    python3 -c "import transformers; transformers.pipeline(model='Qwen/Qwen2.5-VL-7B-Instruct')"

Step 3: Perform GRPO training with multi-modal model on Geo3K Dataset
---------------------------------------------------------------------

.. code:: bash

    # run the task
    bash examples/grpo_trainer/run_qwen2_5_vl-7b.sh