Files
verl/docker
Ethan (Yusheng) Su fd1a121324 [hardware] fix: update source in dockerfile.rocm (#3284)
### What does this PR do?

> Update the resource in `Dockerfile.rocm`

### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here: ...
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> docker build -f Dockerfile.rocm -t verl-rocm:local .
```
docker run --rm -it verl-rocm:local python -c "import torch; print('ok')"
```

### Design & Code Changes

> Update the resource in `Dockerfile.rocm`

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
2025-09-01 11:32:44 +08:00
..

Dockerfiles of verl

We provide pre-built Docker images for quick setup. And from this version, we utilize a new image release hierarchy for productivity and stability.

The image types are divided into three large categories:

  • Base Image: Without inference and training frameworks, only basic dependencies are installed. Can directly install vllm or SGLang on top of it, without need of reinstall torch or CUDA.
  • Application Image: Stable version with inference and training frameworks installed.
  • Preview Image: Unstable version with the latest frameworks and features.

The first two types of images are hosted on dockerhub verlai/verl repository, while the preview images are hosted on community repository.

The image versions are mapped with verl releases, for example, image with tag verl0.4 is built for verl release v0.4.x.

Base Image

The stable base image is verlai/verl:base-verl0.5-cu126-cudnn9.8-torch2.7.1-fa2.7.4 with different CUDA versions.

The update of base image is not frequent, and the app image can be built on top of it without reinstalling base packages.

Application Image

From this version, we divide images built for vLLM and SGLang as the divergence of dependent packages like FlashInfer. There are 2 types of application images available:

  • vLLM with FSDP and Megatron: verlai/verl:app-verl0.5-transformers4.55.4-vllm0.10.0-mcore0.13.0-te2.2
  • SGLang with FSDP and Megatron: verlai/verl:app-verl0.5-transformers4.55.4-sglang0.4.10.post2-mcore0.13.0-te2.2

Docker images with Megatron backends are runnable with large language model like Qwen/Qwen3-235B-A22B, deepseek-ai/DeepSeek-V3-0324 post-training. Refer to the :doc:Large Language Model Post-Training documentation<../perf/dpsk> for more details.

Application images can be updated frequently, and the Dockerfile can be found in docker/verl[version]-[packages]/Dockerfile.app.[frameworks]. Based on the base image, it is easy to build your own application image with the desired inference and training frameworks.

Community Image

For vLLM with FSDP, please refer to hiyouga/verl repository and the latest version is hiyouga/verl:ngc-th2.6.0-cu126-vllm0.8.4-flashinfer0.2.2-cxx11abi0.

For SGLang with FSDP, please refer to ocss884/verl-sglang repository and the latest version is ocss884/verl-sglang:ngc-th2.6.0-cu126-sglang0.4.6.post5 which is provided by SGLang RL Group.

See files under docker/ for NGC-based image or if you want to build your own.

Note that For aws instances with EFA net interface (Sagemaker AI Pod), you need to install EFA driver as shown in docker/Dockerfile.extenstion.awsefa

Installation from Docker

After pulling the desired Docker image and installing desired inference and training frameworks, you can run it with the following steps:

  1. Launch the desired Docker image and attach into it:
docker create --runtime=nvidia --gpus all --net=host --shm-size="10g" --cap-add=SYS_ADMIN -v .:/workspace/verl --name verl <image:tag> sleep infinity
docker start verl
docker exec -it verl bash
  1. If you use the images provided, you only need to install verl itself without dependencies:
# install the nightly version (recommended)
git clone https://github.com/volcengine/verl && cd verl
pip3 install --no-deps -e .

[Optional] If you hope to switch between different frameworks, you can install verl with the following command:

# install the nightly version (recommended)
git clone https://github.com/volcengine/verl && cd verl
pip3 install -e .[vllm]
pip3 install -e .[sglang]