### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. support training qwen3vl with megatron 1. add an image with vllm0.11 and nemo's dedicated megatron that support gpt-oss with optimized fused kernels. 2. add a script of training qwen3vl-30b with megatron 3. necessary changes to support qwen3vl megatron. (just register forward functions, the modeling is through mbridge) ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. <img width="372" height="314" alt="image" src="https://github.com/user-attachments/assets/f1126e46-51a9-4e00-958f-5d034b8f94bd" /> ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
3.8 KiB
Dockerfiles of verl
We provide pre-built Docker images for quick setup. And from this version, we utilize a new image release hierarchy for productivity and stability.
The image types are divided into three large categories:
- Base Image: Without inference and training frameworks, only basic dependencies are installed. Can directly install vllm or SGLang on top of it, without need of reinstall torch or CUDA.
- Application Image: Stable version with inference and training frameworks installed.
- Preview Image: Unstable version with the latest frameworks and features.
The first two types of images are hosted on dockerhub verlai/verl repository, while the preview images are hosted on community repository.
The image versions are mapped with verl releases, for example, image with tag
verl0.4
is built for verl releasev0.4.x
.
Base Image
The stable base image is verlai/verl:base-verl0.5-cu126-cudnn9.8-torch2.7.1-fa2.7.4
with different CUDA versions.
The update of base image is not frequent, and the app image can be built on top of it without reinstalling base packages.
Application Image
From this version, we divide images built for vLLM and SGLang as the divergence of dependent packages like FlashInfer. There are 2 types of application images available:
- vLLM with FSDP and Megatron:
verlai/verl:app-verl0.5-transformers4.55.4-vllm0.10.0-mcore0.13.0-te2.2
- SGLang with FSDP and Megatron:
verlai/verl:app-verl0.5-transformers4.55.4-sglang0.4.10.post2-mcore0.13.0-te2.2
Docker images with Megatron backends are runnable with large language model like Qwen/Qwen3-235B-A22B
, deepseek-ai/DeepSeek-V3-0324
post-training. Refer to the :doc:Large Language Model Post-Training documentation<../perf/dpsk>
for more details.
Application images can be updated frequently, and the Dockerfile can be found in docker/verl[version]-[packages]/Dockerfile.app.[frameworks]
. Based on the base image, it is easy to build your own application image with the desired inference and training frameworks.
Community Image
For vLLM with FSDP, please refer to hiyouga/verl repository and the latest version is hiyouga/verl:ngc-th2.6.0-cu126-vllm0.8.4-flashinfer0.2.2-cxx11abi0
.
For SGLang with FSDP, please refer to ocss884/verl-sglang repository and the latest version is ocss884/verl-sglang:ngc-th2.6.0-cu126-sglang0.4.6.post5
which is provided by SGLang RL Group.
For latest vLLM with Megatron, please refer to iseekyan/verl repository and the latest version is iseekyan/verl:nemo.gptoss_vllm0.11.0
.
See files under docker/
for NGC-based image or if you want to build your own.
Note that For aws instances with EFA net interface (Sagemaker AI Pod), you need to install EFA driver as shown in docker/Dockerfile.extenstion.awsefa
Installation from Docker
After pulling the desired Docker image and installing desired inference and training frameworks, you can run it with the following steps:
- Launch the desired Docker image and attach into it:
docker create --runtime=nvidia --gpus all --net=host --shm-size="10g" --cap-add=SYS_ADMIN -v .:/workspace/verl --name verl <image:tag> sleep infinity
docker start verl
docker exec -it verl bash
- If you use the images provided, you only need to install verl itself without dependencies:
# install the nightly version (recommended)
git clone https://github.com/volcengine/verl && cd verl
pip3 install --no-deps -e .
[Optional] If you hope to switch between different frameworks, you can install verl with the following command:
# install the nightly version (recommended)
git clone https://github.com/volcengine/verl && cd verl
pip3 install -e .[vllm]
pip3 install -e .[sglang]