Files
verl/docker/Dockerfile.rocm_verl-0.3.0.post1
Ethan (Yusheng) Su 526098d664 [Hardware] feat: Support AMD (ROCMm Kernel) - Update Dockerfile/Docker Image (#2390)
### What does this PR do?

> Update Dockerfile/Docker Image

### Checklist Before Starting
- [X] Search for similar PRs. 
- [X] Format the PR title (This will be checked by the CI)

### Test
>  Done

### API and Usage Example

>  Usage example(s) 

[AMD_toturial](https://github.com/volcengine/verl/blob/main/docs/amd_tutorial/amd_build_dockerfile_page.rst).


### Design & Code Changes

>  Dockerfile/Docker Image dependency:
ROCm: 6.3.4 (patch version)
Pytoch: 2.7.0
vllm: >=0.8.5
sglang: >=v0.4.6.post4
megatron-lm: TransformerEngine==1.14.0, megatron-core==0.12.0
Ray: >=2.45

Also allow VLM training

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/docs/amd_tutorial/amd_build_dockerfile_page.rst).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
2025-07-09 10:05:43 -07:00

59 lines
1.5 KiB
Docker

# Build the docker in the repo dir:
# docker build -f docker/Dockerfile.rocm -t verl-rocm:03.04.2015 .
# docker images # you can find your built docker
# Support - Traing: fsdp; Inference: vllm
# FROM rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
# Support - Traing: fsdp; Inference: vllm, sglang
FROM lmsysorg/sglang:v0.4.6.post5-rocm630
# Set working directory
# WORKDIR $PWD/app
# Set environment variables
ENV PYTORCH_ROCM_ARCH="gfx90a;gfx942"
ENV HIPCC_COMPILE_FLAGS_APPEND="--amdgpu-target=gfx90a;gfx942 -D__HIP_PLATFORM_AMD__"
ENV CFLAGS="-D__HIP_PLATFORM_AMD__"
ENV CXXFLAGS="-D__HIP_PLATFORM_AMD__"
# Install vllm
RUN pip uninstall -y vllm && \
rm -rf vllm && \
git clone -b v0.6.3 https://github.com/vllm-project/vllm.git && \
cd vllm && \
MAX_JOBS=$(nproc) python3 setup.py install && \
cd .. && \
rm -rf vllm
# Copy the entire project directory
COPY . .
# Install dependencies
RUN pip install "tensordict==0.6.2" --no-deps && \
pip install accelerate \
codetiming \
datasets \
dill \
hydra-core \
liger-kernel \
numpy \
pandas \
peft \
"pyarrow>=15.0.0" \
pylatexenc \
"ray[data,train,tune,serve]<2.45.0" \
torchdata \
transformers \
wandb \
orjson \
pybind11
RUN git clone https://github.com/volcengine/verl.git && \
cd verl && \
pip install -e .
# Install torch_memory_saver
RUN pip install git+https://github.com/ExtremeViscent/torch_memory_saver.git --no-deps