[Hardware] feat: Support AMD (ROCMm Kernel) - Update Dockerfile/Docker Image (#2390)

### What does this PR do? > Update Dockerfile/Docker Image ### Checklist Before Starting - [X] Search for similar PRs. - [X] Format the PR title (This will be checked by the CI) ### Test > Done ### API and Usage Example > Usage example(s) [AMD_toturial](https://github.com/volcengine/verl/blob/main/docs/amd_tutorial/amd_build_dockerfile_page.rst). ### Design & Code Changes > Dockerfile/Docker Image dependency: ROCm: 6.3.4 (patch version) Pytoch: 2.7.0 vllm: >=0.8.5 sglang: >=v0.4.6.post4 megatron-lm: TransformerEngine==1.14.0, megatron-core==0.12.0 Ray: >=2.45 Also allow VLM training ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/docs/amd_tutorial/amd_build_dockerfile_page.rst). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
2025-10-20 13:43:50 +08:00 · 2025-07-09 10:05:43 -07:00
parent b5e711eab5
commit 526098d664
6 changed files with 957 additions and 59 deletions
--- a/docker/Dockerfile.rocm
+++ b/docker/Dockerfile.rocm
@ -1,36 +1,294 @@
-#  Build the docker in the repo dir:
-# docker build -f docker/Dockerfile.rocm -t verl-rocm:03.04.2015 .
-# docker images # you can find your built docker
+# FROM "compute-artifactory.amd.com:5000/rocm-plus-docker/framework/compute-rocm-rel-6.4:94_ubuntu22.04_py3.10_pytorch_release-2.7_575e247"
+FROM "rlfoundation.azurecr.io/rocm6.3.4:vllm-0.8.5-numa-patch-ubuntu-22.04"
+
+SHELL ["/bin/bash", "-ceuxo", "pipefail"]
+
+ENV MAX_JOBS=512
+
+ENV PATH="/usr/local/python3.12/bin:$PATH"
+RUN ln -sf /usr/bin/python3.12 /usr/bin/python && \
+    ln -sf /usr/bin/pip3.12 /usr/bin/pip
+
+############################################
+############################################
+RUN apt-get update
+RUN apt-get install -y pkg-config liblzma-dev
+############################################
+############################################


-# Support - Traing: fsdp; Inference: vllm
-# FROM rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
-# Support - Traing: fsdp; Inference: vllm, sglang
-FROM lmsysorg/sglang:v0.4.6.post5-rocm630
+###########################################
+##########Install TransformerEngine########
+###########################################
+WORKDIR /workspace/
+# transformer-engine install
+# https://github.com/ROCm/TransformerEngine

-# Set working directory
-# WORKDIR $PWD/app
+RUN rm -rf TransformerEngine 
+RUN git clone --recursive https://github.com/ROCm/TransformerEngine.git
+WORKDIR /workspace/TransformerEngine
+RUN git checkout 236178e5
+# git checkout bb061ade
+# git checkout 864405c

+ENV NVTE_FRAMEWORK=pytorch 
+ENV NVTE_ROCM_ARCH=gfx942 
+ENV NVTE_USE_HIPBLASLT=1
+ENV NVTE_USE_ROCM=1  
+
+# export CMAKE_PREFIX_PATH="/opt/rocm:/opt/rocm/hip:/usr/local:/usr:${CMAKE_PREFIX_PATH:-}"
+ENV CMAKE_PREFIX_PATH="/opt/rocm:/opt/rocm/hip:/usr/local:/usr"
+
+
+# ENV NVTE_BUILD_MAX_JOBS=$(MAX_JOBS)
+
+RUN MAX_JOBS=$(MAX_JOBS) pip install . -vvv 
+
+WORKDIR /workspace/
+###########################################
+###########################################
+###########################################
+
+
+
+
+
+####################################################################################
+################Install vllm - sglang require vllm 0.6.7 dependency#################
+####################################################################################
+#### Require vllm 0.6.7 - checkout 113274a0
+WORKDIR /workspace/
+RUN rm -rf vllm
+RUN pip uninstall -y vllm
+# Refer to here (down-grade vllm to 0.6.3): https://docs.vllm.ai/en/v0.6.3/getting_started/amd-installation.html
+RUN git clone https://github.com/ROCm/vllm.git
+# git clone https://github.com/vllm-project/vllm.git
+WORKDIR /workspace/vllm
+RUN git checkout 113274a0
+ENV PYTORCH_ROCM_ARCH="gfx90a;gfx942"
+#ENV MAX_JOBS=512
+ENV MAX_JOBS=${MAX_JOBS}
+RUN pip install "boto3>=1.26.0"
+RUN pip install setuptools_scm
+# will add src into py. You can delete the repo
+RUN python3 setup.py install
+WORKDIR /workspace/
+####################################################################################
+####################################################################################
+####################################################################################
+
+
+
+###########################################
+############For hack docker################
+###########################################
+RUN pip install setuptools==75.8.0
+###########################################
+###########################################
+###########################################
+
+
+
+###########################################
+############build sgalng###################
+###########################################
 # Set environment variables
+ENV BASE_DIR=/sgl-workspace
+ENV BUILD_TYPE=all
+ENV SGL_REPO=https://github.com/sgl-project/sglang
+ENV SGL_BRANCH=v0.4.6.post5
+ENV TRITON_REPO=https://github.com/ROCm/triton.git
+ENV TRITON_COMMIT=improve_fa_decode_3.0.0
+ENV AITER_REPO=https://github.com/ROCm/aiter.git
+ENV AITER_COMMIT=v0.1.2
+# v0.1.2 version - commit id: 9d11f47
+# ENV AITER_COMMIT=9d11f47
+
+ENV HIP_FORCE_DEV_KERNARG=1
+ENV HSA_NO_SCRATCH_RECLAIM=1
+ENV SGLANG_SET_CPU_AFFINITY=1
+ENV SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
+ENV NCCL_MIN_NCHANNELS=112
+ENV MOE_PADDING=1
+ENV VLLM_FP8_PADDING=1
+ENV VLLM_FP8_ACT_PADDING=1
+ENV VLLM_FP8_WEIGHT_PADDING=1
+ENV VLLM_FP8_REDUCE_CONV=1
+ENV TORCHINDUCTOR_MAX_AUTOTUNE=1
+ENV TORCHINDUCTOR_MAX_AUTOTUNE_POINTWISE=1
+ENV HIPCC_COMPILE_FLAGS_APPEND="--offload-arch=gfx942"
+ENV AMDGPU_TARGETS=gfx942
+ENV ROCM_ARCH=gfx942
 ENV PYTORCH_ROCM_ARCH="gfx90a;gfx942"

+# Switch to working directory
+WORKDIR /sgl-workspace
+
+# Clean and create directory
+RUN rm -rf /sgl-workspace && mkdir -p /sgl-workspace
+
+# Clone and build sglang
+RUN git clone ${SGL_REPO} \
+    && cd sglang \
+    && git checkout ${SGL_BRANCH} || echo "Using default branch" \
+    && cd sgl-kernel \
+    && rm -f pyproject.toml \
+    && mv pyproject_rocm.toml pyproject.toml \
+    && python setup_rocm.py install \
+    && cd .. \
+    && if [ "$BUILD_TYPE" = "srt" ]; then \
+         python -m pip --no-cache-dir install -e "python[srt_hip]"; \
+       else \
+         python -m pip --no-cache-dir install -e "python[all_hip]"; \
+       fi \
+    && cd /sgl-workspace \
+    && cp -r /sgl-workspace/sglang /sglang \
+    && python -m pip cache purge
+
+# Install common Python packages
+RUN pip install IPython orjson python-multipart torchao pybind11
+
+# Rebuild Triton
+RUN pip uninstall -y triton || true \
+    && git clone ${TRITON_REPO} \
+    && cd triton \
+    && git checkout ${TRITON_COMMIT} \
+    && cd python \
+    && python3 setup.py install \
+    && cd /sgl-workspace
+
+
+# ENV HIPCC_COMPILE_FLAGS_APPEND="--offload-arch=gfx942 --amdgpu-lower-module-lds-strategy=1"
+# ENV HIPCC_COMPILE_FLAGS_APPEND="--offload-arch=gfx942"
+
+# Build aiter
+#version: Commit 9d11f47
+    # && git checkout ${AITER_COMMIT} \
+RUN pip uninstall -y aiter || true
+RUN git clone ${AITER_REPO} \
+    && cd aiter \
+    && git checkout ${AITER_COMMIT} \
+    && git submodule sync \
+    && git submodule update --init --recursive \
+    && PREBUILD_KERNELS=1 GPU_ARCHS=gfx942 python3 setup.py install \
+    && cd /sgl-workspace
+    # && PREBUILD_KERNELS=1 GPU_ARCHS=gfx942 python3 setup.py develop \
+    # && PREBUILD_KERNELS=1 GPU_ARCHS=gfx942 python3 setup.py develop \
+
+# Copy MI300X config 
+RUN find /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/ \
+         /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/ \
+         -type f -name '*MI300X*' | \
+         xargs -I {} sh -c 'vf_config=$(echo "$1" | sed "s/MI300X/MI300X_VF/"); cp "$1" "$vf_config"' -- {}
+
+# Environment setup complete.
+RUN echo "Environment setup complete."
+
+WORKDIR /workspace/
+###########################################
+###########################################
+###########################################
+
+
+
+
+
+
+###########################################
+###############vllm v0.8.5#################
+###########################################
+# ENV GITHUB_USERNAME=yushengsu-thu
+# ENV GITHUB_MAIL=yushengsu@gmail.com
+
+# RUN git config --global user.name "${GITHUB_USERNAME}" \
+#     && git config --global user.email "${GITHUB_MAIL}" 
+
+WORKDIR /workspace/
+
+ENV VLLM_TARGET_DEVICE=rocm 
+ENV ROCM_PATH=/opt/rocm 
+ENV SETUPTOOLS_SCM_PRETEND_VERSION=0.8.5.dev
+
+# Find the repo path in: DockerFile/Dockerfile.rocm_yang
+# RUN git clone https://github.com/RLFoundation/vllm-patch.git
+RUN pip uninstall -y vllm || true
+RUN rm -rf vllm-patch
+RUN git clone https://github.com/RLFoundation/vllm-patch.git \
+    && cd vllm-patch \
+    && git checkout v0.8.5-sleep-numa \
+    && rm -rf build/ dist/ *.egg-info \
+    && ln -sf /opt/rocm/lib/libamdhip64.so /usr/lib/libamdhip64.so \
+    && SETUPTOOLS_SCM_PRETEND_VERSION=0.8.5.dev PYTORCH_ROCM_ARCH="gfx90a;gfx942" MAX_JOBS=${MAX_JOBS} python3 setup.py install
+    # RUN SETUPTOOLS_SCM_PRETEND_VERSION=0.8.5.dev PYTORCH_ROCM_ARCH="gfx90a;gfx942" MAX_JOBS=${MAX_JOBS} python3 setup.py develop
+
+WORKDIR /workspace/
+###########################################
+###########################################
+###########################################
+
+
+
+
+#########################################
+#### Install megatron-core###############
+#########################################
+RUN pip uninstall -y megatron-core && \
+    git clone https://github.com/yushengsu-thu/Megatron-LM-amd_version.git && \
+    cd Megatron-LM-amd_version && \
+    pip install -vvv -e . && \
+    cd /workspace/
+#########################################
+#########################################
+#########################################
+
+
+
+
+#######################################
+################apex###################
+#######################################
+WORKDIR /workspace/
+RUN pip uninstall -y apex && \
+    git clone https://github.com/ROCm/apex.git && \
+    cd apex && \
+    python setup.py install && \
+    cd /workspace/ 
+#######################################
+#######################################
+#######################################
+
+
+
+
+################################################################################
+###########################Add torch_memory_saver###############################
+################################################################################
+# Set environment variables
 ENV HIPCC_COMPILE_FLAGS_APPEND="--amdgpu-target=gfx90a;gfx942 -D__HIP_PLATFORM_AMD__"
 ENV CFLAGS="-D__HIP_PLATFORM_AMD__"
 ENV CXXFLAGS="-D__HIP_PLATFORM_AMD__"
+RUN pip install "git+https://github.com/YangWang92/torch_memory_saver_numa.git@numa"
+################################################################################
+################################################################################
+################################################################################

-# Install vllm
-RUN pip uninstall -y vllm && \
-    rm -rf vllm && \
-    git clone -b v0.6.3 https://github.com/vllm-project/vllm.git && \
-    cd vllm && \
-    MAX_JOBS=$(nproc) python3 setup.py install && \
-    cd .. && \
-    rm -rf vllm

-# Copy the entire project directory
-COPY . .

-# Install dependencies
+########################################
+######Install ray#######################
+########################################
+# need to add this patch: https://github.com/ray-project/ray/pull/53531/files
+RUN pip uninstall ray -y
+RUN pip install "ray[data,train,tune,serve]>=2.47.0" 
+########################################
+########################################
+########################################
+
+
+
+##########################################
+#######Install other dependencies#########
+##########################################
 RUN pip install "tensordict==0.6.2" --no-deps && \
    pip install accelerate \
    codetiming \
@ -43,14 +301,21 @@ RUN pip install "tensordict==0.6.2" --no-deps && \
    peft \
    "pyarrow>=15.0.0" \
    pylatexenc \
-    "ray[data,train,tune,serve]<2.45.0" \
    torchdata \
-    transformers \
    wandb \
    orjson \
-    pybind11 && \
-    pip install -e . --no-deps && \
-    python setup.py bdist_wheel
+    pybind11
    
-# Install torch_memory_saver
-RUN pip install git+https://github.com/ExtremeViscent/torch_memory_saver.git --no-deps
+WORKDIR /workspace/
+RUN git clone https://github.com/volcengine/verl.git && \
+    cd verl && \
+    pip install -e . 
+##########################################
+##########################################
+##########################################
+
+
+
+WORKDIR /workspace/
+
+CMD ["/usr/bin/bash"]
--- a/docker/Dockerfile.rocm_verl-0.3.0.post1
+++ b/docker/Dockerfile.rocm_verl-0.3.0.post1
@ -0,0 +1,58 @@
+#  Build the docker in the repo dir:
+# docker build -f docker/Dockerfile.rocm -t verl-rocm:03.04.2015 .
+# docker images # you can find your built docker
+
+
+# Support - Traing: fsdp; Inference: vllm
+# FROM rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
+# Support - Traing: fsdp; Inference: vllm, sglang
+FROM lmsysorg/sglang:v0.4.6.post5-rocm630
+
+# Set working directory
+# WORKDIR $PWD/app
+
+# Set environment variables
+ENV PYTORCH_ROCM_ARCH="gfx90a;gfx942"
+
+ENV HIPCC_COMPILE_FLAGS_APPEND="--amdgpu-target=gfx90a;gfx942 -D__HIP_PLATFORM_AMD__"
+ENV CFLAGS="-D__HIP_PLATFORM_AMD__"
+ENV CXXFLAGS="-D__HIP_PLATFORM_AMD__"
+
+# Install vllm
+RUN pip uninstall -y vllm && \
+    rm -rf vllm && \
+    git clone -b v0.6.3 https://github.com/vllm-project/vllm.git && \
+    cd vllm && \
+    MAX_JOBS=$(nproc) python3 setup.py install && \
+    cd .. && \
+    rm -rf vllm
+
+# Copy the entire project directory
+COPY . .
+
+# Install dependencies
+RUN pip install "tensordict==0.6.2" --no-deps && \
+    pip install accelerate \
+    codetiming \
+    datasets \
+    dill \
+    hydra-core \
+    liger-kernel \
+    numpy \
+    pandas \
+    peft \
+    "pyarrow>=15.0.0" \
+    pylatexenc \
+    "ray[data,train,tune,serve]<2.45.0" \
+    torchdata \
+    transformers \
+    wandb \
+    orjson \
+    pybind11
+    
+RUN git clone https://github.com/volcengine/verl.git && \
+    cd verl && \
+    pip install -e . 
+
+# Install torch_memory_saver
+RUN pip install git+https://github.com/ExtremeViscent/torch_memory_saver.git --no-deps
--- a/docker/Dockerfile.rocm_verl-0.4.1
+++ b/docker/Dockerfile.rocm_verl-0.4.1
@ -0,0 +1,322 @@
+# FROM "compute-artifactory.amd.com:5000/rocm-plus-docker/framework/compute-rocm-rel-6.4:94_ubuntu22.04_py3.10_pytorch_release-2.7_575e247"
+FROM "rlfoundation.azurecr.io/rocm6.3.4:vllm-0.8.5-numa-patch-ubuntu-22.04"
+
+SHELL ["/bin/bash", "-ceuxo", "pipefail"]
+
+ENV MAX_JOBS=512
+
+ENV PATH="/usr/local/python3.12/bin:$PATH"
+RUN ln -sf /usr/bin/python3.12 /usr/bin/python && \
+    ln -sf /usr/bin/pip3.12 /usr/bin/pip
+
+############################################
+############################################
+RUN apt-get update
+RUN apt-get install -y pkg-config liblzma-dev
+############################################
+############################################
+
+
+###########################################
+##########Install TransformerEngine########
+###########################################
+WORKDIR /workspace/
+# transformer-engine install
+# https://github.com/ROCm/TransformerEngine
+
+RUN rm -rf TransformerEngine 
+RUN git clone --recursive https://github.com/ROCm/TransformerEngine.git
+WORKDIR /workspace/TransformerEngine
+RUN git checkout 236178e5
+# git checkout bb061ade
+# git checkout 864405c
+
+ENV NVTE_FRAMEWORK=pytorch 
+ENV NVTE_ROCM_ARCH=gfx942 
+ENV NVTE_USE_HIPBLASLT=1
+ENV NVTE_USE_ROCM=1  
+
+# export CMAKE_PREFIX_PATH="/opt/rocm:/opt/rocm/hip:/usr/local:/usr:${CMAKE_PREFIX_PATH:-}"
+ENV CMAKE_PREFIX_PATH="/opt/rocm:/opt/rocm/hip:/usr/local:/usr"
+
+
+# ENV NVTE_BUILD_MAX_JOBS=$(MAX_JOBS)
+
+RUN MAX_JOBS=$(MAX_JOBS) pip install . -vvv 
+
+WORKDIR /workspace/
+###########################################
+###########################################
+###########################################
+
+
+
+
+
+####################################################################################
+################Install vllm - sglang require vllm 0.6.7 dependency#################
+####################################################################################
+#### Require vllm 0.6.7 - checkout 113274a0
+WORKDIR /workspace/
+RUN rm -rf vllm
+RUN pip uninstall -y vllm
+# Refer to here (down-grade vllm to 0.6.3): https://docs.vllm.ai/en/v0.6.3/getting_started/amd-installation.html
+RUN git clone https://github.com/ROCm/vllm.git
+# git clone https://github.com/vllm-project/vllm.git
+WORKDIR /workspace/vllm
+RUN git checkout 113274a0
+ENV PYTORCH_ROCM_ARCH="gfx90a;gfx942"
+#ENV MAX_JOBS=512
+ENV MAX_JOBS=${MAX_JOBS}
+RUN pip install "boto3>=1.26.0"
+RUN pip install setuptools_scm
+# will add src into py. You can delete the repo
+RUN python3 setup.py install
+WORKDIR /workspace/
+####################################################################################
+####################################################################################
+####################################################################################
+
+
+
+###########################################
+############For hack docker################
+###########################################
+RUN pip install setuptools==75.8.0
+###########################################
+###########################################
+###########################################
+
+
+
+###########################################
+############build sgalng###################
+###########################################
+# Set environment variables
+ENV BASE_DIR=/sgl-workspace
+ENV BUILD_TYPE=all
+ENV SGL_REPO=https://github.com/sgl-project/sglang
+ENV SGL_BRANCH=v0.4.6.post5
+ENV TRITON_REPO=https://github.com/ROCm/triton.git
+ENV TRITON_COMMIT=improve_fa_decode_3.0.0
+ENV AITER_REPO=https://github.com/ROCm/aiter.git
+ENV AITER_COMMIT=v0.1.2
+# v0.1.2 version - commit id: 9d11f47
+# ENV AITER_COMMIT=9d11f47
+
+ENV HIP_FORCE_DEV_KERNARG=1
+ENV HSA_NO_SCRATCH_RECLAIM=1
+ENV SGLANG_SET_CPU_AFFINITY=1
+ENV SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
+ENV NCCL_MIN_NCHANNELS=112
+ENV MOE_PADDING=1
+ENV VLLM_FP8_PADDING=1
+ENV VLLM_FP8_ACT_PADDING=1
+ENV VLLM_FP8_WEIGHT_PADDING=1
+ENV VLLM_FP8_REDUCE_CONV=1
+ENV TORCHINDUCTOR_MAX_AUTOTUNE=1
+ENV TORCHINDUCTOR_MAX_AUTOTUNE_POINTWISE=1
+ENV HIPCC_COMPILE_FLAGS_APPEND="--offload-arch=gfx942"
+ENV AMDGPU_TARGETS=gfx942
+ENV ROCM_ARCH=gfx942
+ENV PYTORCH_ROCM_ARCH="gfx90a;gfx942"
+
+# Switch to working directory
+WORKDIR /sgl-workspace
+
+# Clean and create directory
+RUN rm -rf /sgl-workspace && mkdir -p /sgl-workspace
+
+# Clone and build sglang
+RUN git clone ${SGL_REPO} \
+    && cd sglang \
+    && git checkout ${SGL_BRANCH} || echo "Using default branch" \
+    && cd sgl-kernel \
+    && rm -f pyproject.toml \
+    && mv pyproject_rocm.toml pyproject.toml \
+    && python setup_rocm.py install \
+    && cd .. \
+    && if [ "$BUILD_TYPE" = "srt" ]; then \
+         python -m pip --no-cache-dir install -e "python[srt_hip]"; \
+       else \
+         python -m pip --no-cache-dir install -e "python[all_hip]"; \
+       fi \
+    && cd /sgl-workspace \
+    && cp -r /sgl-workspace/sglang /sglang \
+    && python -m pip cache purge
+
+# Install common Python packages
+RUN pip install IPython orjson python-multipart torchao pybind11
+
+# Rebuild Triton
+RUN pip uninstall -y triton || true \
+    && git clone ${TRITON_REPO} \
+    && cd triton \
+    && git checkout ${TRITON_COMMIT} \
+    && cd python \
+    && python3 setup.py install \
+    && cd /sgl-workspace
+
+
+# ENV HIPCC_COMPILE_FLAGS_APPEND="--offload-arch=gfx942 --amdgpu-lower-module-lds-strategy=1"
+# ENV HIPCC_COMPILE_FLAGS_APPEND="--offload-arch=gfx942"
+
+# Build aiter
+#version: Commit 9d11f47
+    # && git checkout ${AITER_COMMIT} \
+RUN pip uninstall -y aiter || true
+RUN git clone ${AITER_REPO} \
+    && cd aiter \
+    && git checkout ${AITER_COMMIT} \
+    && git submodule sync \
+    && git submodule update --init --recursive \
+    && PREBUILD_KERNELS=1 GPU_ARCHS=gfx942 python3 setup.py install \
+    && cd /sgl-workspace
+    # && PREBUILD_KERNELS=1 GPU_ARCHS=gfx942 python3 setup.py develop \
+    # && PREBUILD_KERNELS=1 GPU_ARCHS=gfx942 python3 setup.py develop \
+
+# Copy MI300X config 
+RUN find /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/ \
+         /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/ \
+         -type f -name '*MI300X*' | \
+         xargs -I {} sh -c 'vf_config=$(echo "$1" | sed "s/MI300X/MI300X_VF/"); cp "$1" "$vf_config"' -- {}
+
+# Environment setup complete.
+RUN echo "Environment setup complete."
+
+WORKDIR /workspace/
+###########################################
+###########################################
+###########################################
+
+
+
+
+
+
+###########################################
+###############vllm v0.8.5#################
+###########################################
+# ENV GITHUB_USERNAME=yushengsu-thu
+# ENV GITHUB_MAIL=yushengsu@gmail.com
+
+# RUN git config --global user.name "${GITHUB_USERNAME}" \
+#     && git config --global user.email "${GITHUB_MAIL}" 
+
+WORKDIR /workspace/
+
+ENV VLLM_TARGET_DEVICE=rocm 
+ENV ROCM_PATH=/opt/rocm 
+ENV SETUPTOOLS_SCM_PRETEND_VERSION=0.8.5.dev
+
+# Find the repo path in: DockerFile/Dockerfile.rocm_yang
+# RUN git clone https://github.com/RLFoundation/vllm-patch.git
+RUN pip uninstall -y vllm || true
+RUN rm -rf vllm-patch
+RUN git clone https://github.com/RLFoundation/vllm-patch.git \
+    && cd vllm-patch \
+    && git checkout v0.8.5-sleep-numa \
+    && rm -rf build/ dist/ *.egg-info \
+    && ln -sf /opt/rocm/lib/libamdhip64.so /usr/lib/libamdhip64.so \
+    && SETUPTOOLS_SCM_PRETEND_VERSION=0.8.5.dev PYTORCH_ROCM_ARCH="gfx90a;gfx942" MAX_JOBS=${MAX_JOBS} python3 setup.py install
+    # RUN SETUPTOOLS_SCM_PRETEND_VERSION=0.8.5.dev PYTORCH_ROCM_ARCH="gfx90a;gfx942" MAX_JOBS=${MAX_JOBS} python3 setup.py develop
+
+WORKDIR /workspace/
+###########################################
+###########################################
+###########################################
+
+
+
+
+#########################################
+#### Install megatron-core###############
+#########################################
+RUN pip uninstall -y megatron-core && \
+    git clone https://github.com/yushengsu-thu/Megatron-LM-amd_version.git && \
+    cd Megatron-LM-amd_version && \
+    pip install -vvv -e . && \
+    cd /workspace/
+#########################################
+#########################################
+#########################################
+
+
+
+
+#######################################
+################apex###################
+#######################################
+WORKDIR /workspace/
+RUN pip uninstall -y apex && \
+    git clone https://github.com/ROCm/apex.git && \
+    cd apex && \
+    python setup.py install && \
+    cd /workspace/ 
+#######################################
+#######################################
+#######################################
+
+
+
+
+################################################################################
+###########################Add torch_memory_saver###############################
+################################################################################
+# Set environment variables
+ENV HIPCC_COMPILE_FLAGS_APPEND="--amdgpu-target=gfx90a;gfx942 -D__HIP_PLATFORM_AMD__"
+ENV CFLAGS="-D__HIP_PLATFORM_AMD__"
+ENV CXXFLAGS="-D__HIP_PLATFORM_AMD__"
+RUN pip install "git+https://github.com/YangWang92/torch_memory_saver_numa.git@numa"
+################################################################################
+################################################################################
+################################################################################
+
+
+
+########################################
+######Install ray#######################
+########################################
+# need to add this patch: https://github.com/ray-project/ray/pull/53531/files
+RUN pip uninstall ray -y
+RUN pip install "ray[data,train,tune,serve]>=2.47.0" 
+########################################
+########################################
+########################################
+
+
+
+##########################################
+#######Install other dependencies#########
+##########################################
+RUN pip install "tensordict==0.6.2" --no-deps && \
+    pip install accelerate \
+    codetiming \
+    datasets \
+    dill \
+    hydra-core \
+    liger-kernel \
+    numpy \
+    pandas \
+    peft \
+    "pyarrow>=15.0.0" \
+    pylatexenc \
+    torchdata \
+    wandb \
+    orjson \
+    pybind11
+    
+WORKDIR /workspace/
+RUN git clone https://github.com/volcengine/verl.git && \
+    cd verl && \
+    pip install -e . 
+##########################################
+##########################################
+##########################################
+
+
+
+WORKDIR /workspace/
+
+CMD ["/usr/bin/bash"]
+CMD ["/usr/bin/bash"]
--- a/docs/amd_tutorial/amd_build_dockerfile_page.rst
+++ b/docs/amd_tutorial/amd_build_dockerfile_page.rst
@ -1,7 +1,7 @@
 Getting started with AMD (ROCM Kernel)
 =====================================================

-Last updated: 06/02/2025.
+Last updated: 07/06/2025.

 Author: `Yusheng Su <https://yushengsu-thu.github.io/>`_

@ -16,40 +16,267 @@ docker/Dockerfile.rocm

 .. code-block:: bash

-    # Build the docker in the repo dir:
-    # docker build -f docker/Dockerfile.rocm -t verl-rocm .
-    # docker images # you can find your built docker
+    FROM "rlfoundation.azurecr.io/rocm6.3.4:vllm-0.8.5-numa-patch-ubuntu-22.04"
+
+    SHELL ["/bin/bash", "-ceuxo", "pipefail"]
+
+    ENV MAX_JOBS=512
+
+    ENV PATH="/usr/local/python3.12/bin:$PATH"
+    RUN ln -sf /usr/bin/python3.12 /usr/bin/python && \
+        ln -sf /usr/bin/pip3.12 /usr/bin/pip
+
+    ############################################
+    RUN apt-get update
+    RUN apt-get install -y pkg-config liblzma-dev
+    ############################################
+
+    ###########################################
+    ##########Install TransformerEngine########
+    ###########################################
+    WORKDIR /workspace/
+    # transformer-engine install
+    # https://github.com/ROCm/TransformerEngine
+    RUN rm -rf TransformerEngine 
+    RUN git clone --recursive https://github.com/ROCm/TransformerEngine.git
+    WORKDIR /workspace/TransformerEngine
+    git checkout 236178e5
+    # git checkout bb061ade
+    # git checkout 864405c
+    ENV NVTE_FRAMEWORK=pytorch 
+    ENV NVTE_ROCM_ARCH=gfx942 
+    ENV NVTE_USE_HIPBLASLT=1
+    ENV NVTE_USE_ROCM=1  
+    # export CMAKE_PREFIX_PATH="/opt/rocm:/opt/rocm/hip:/usr/local:/usr:${CMAKE_PREFIX_PATH:-}"
+    ENV CMAKE_PREFIX_PATH="/opt/rocm:/opt/rocm/hip:/usr/local:/usr"
+    RUN MAX_JOBS=$(MAX_JOBS) pip install . -vvv 
+    WORKDIR /workspace/
+    ###########################################
+    ###########################################
+    ###########################################


-    # Support - Traing: fsdp; Inference: vllm
-    # FROM rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
-    # Support - Traing: fsdp; Inference: vllm, sglang
-    FROM lmsysorg/sglang:v0.4.6.post5-rocm630

-    # Set working directory
-    # WORKDIR $PWD/app

-    # Set environment variables
+
+    ####################################################################################
+    ################Install vllm - sglang require vllm 0.6.7 dependency#################
+    ####################################################################################
+    #### Require vllm 0.6.7 - checkout 113274a0
+    WORKDIR /workspace/
+    RUN rm -rf vllm
+    RUN pip uninstall -y vllm
+    # Refer to here (down-grade vllm to 0.6.3): https://docs.vllm.ai/en/v0.6.3/getting_started/amd-installation.html
+    RUN git clone https://github.com/ROCm/vllm.git
+    # git clone https://github.com/vllm-project/vllm.git
+    WORKDIR /workspace/vllm
+    RUN git checkout 113274a0
    ENV PYTORCH_ROCM_ARCH="gfx90a;gfx942"
+    #ENV MAX_JOBS=512
+    ENV MAX_JOBS=${MAX_JOBS}
+    RUN pip install "boto3>=1.26.0"
+    RUN pip install setuptools_scm
+    # will add src into py. You can delete the repo
+    RUN python3 setup.py install
+    WORKDIR /workspace/
+    ####################################################################################
+    ####################################################################################
+    ####################################################################################

+
+
+    ###########################################
+    ############For hack docker################
+    ###########################################
+    RUN pip install setuptools==75.8.0
+    ###########################################
+    ###########################################
+    ###########################################
+
+
+
+    ###########################################
+    ############build sgalng###################
+    ###########################################
+    # Set environment variables
+    ENV BASE_DIR=/sgl-workspace
+    ENV BUILD_TYPE=all
+    ENV SGL_REPO=https://github.com/sgl-project/sglang
+    ENV SGL_BRANCH=v0.4.6.post5
+    ENV TRITON_REPO=https://github.com/ROCm/triton.git
+    ENV TRITON_COMMIT=improve_fa_decode_3.0.0
+    ENV AITER_REPO=https://github.com/ROCm/aiter.git
+    ENV AITER_COMMIT=v0.1.2
+    # v0.1.2 version - commit id: 9d11f47
+    # ENV AITER_COMMIT=9d11f47
+    ENV HIP_FORCE_DEV_KERNARG=1
+    ENV HSA_NO_SCRATCH_RECLAIM=1
+    ENV SGLANG_SET_CPU_AFFINITY=1
+    ENV SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
+    ENV NCCL_MIN_NCHANNELS=112
+    ENV MOE_PADDING=1
+    ENV VLLM_FP8_PADDING=1
+    ENV VLLM_FP8_ACT_PADDING=1
+    ENV VLLM_FP8_WEIGHT_PADDING=1
+    ENV VLLM_FP8_REDUCE_CONV=1
+    ENV TORCHINDUCTOR_MAX_AUTOTUNE=1
+    ENV TORCHINDUCTOR_MAX_AUTOTUNE_POINTWISE=1
+    ENV HIPCC_COMPILE_FLAGS_APPEND="--offload-arch=gfx942"
+    ENV AMDGPU_TARGETS=gfx942
+    ENV ROCM_ARCH=gfx942
+    ENV PYTORCH_ROCM_ARCH="gfx90a;gfx942"
+    # Switch to working directory
+    WORKDIR /sgl-workspace
+    # Clean and create directory
+    RUN rm -rf /sgl-workspace && mkdir -p /sgl-workspace
+
+    # Clone and build sglang
+    RUN git clone ${SGL_REPO} \
+        && cd sglang \
+        && git checkout ${SGL_BRANCH} || echo "Using default branch" \
+        && cd sgl-kernel \
+        && rm -f pyproject.toml \
+        && mv pyproject_rocm.toml pyproject.toml \
+        && python setup_rocm.py install \
+        && cd .. \
+        && if [ "$BUILD_TYPE" = "srt" ]; then \
+            python -m pip --no-cache-dir install -e "python[srt_hip]"; \
+        else \
+            python -m pip --no-cache-dir install -e "python[all_hip]"; \
+        fi \
+        && cd /sgl-workspace \
+        && cp -r /sgl-workspace/sglang /sglang \
+        && python -m pip cache purge
+
+    # Install common Python packages
+    RUN pip install IPython orjson python-multipart torchao pybind11
+    # Rebuild Triton
+    RUN pip uninstall -y triton || true \
+        && git clone ${TRITON_REPO} \
+        && cd triton \
+        && git checkout ${TRITON_COMMIT} \
+        && cd python \
+        && python3 setup.py install \
+        && cd /sgl-workspace
+    # ENV HIPCC_COMPILE_FLAGS_APPEND="--offload-arch=gfx942 --amdgpu-lower-module-lds-strategy=1"
+    # ENV HIPCC_COMPILE_FLAGS_APPEND="--offload-arch=gfx942"
+
+    # Build aiter
+    #version: Commit 9d11f47
+        # && git checkout ${AITER_COMMIT} \
+    RUN pip uninstall -y aiter || true
+    RUN git clone ${AITER_REPO} \
+        && cd aiter \
+        && git checkout ${AITER_COMMIT} \
+        && git submodule sync \
+        && git submodule update --init --recursive \
+        && PREBUILD_KERNELS=1 GPU_ARCHS=gfx942 python3 setup.py install \
+        && cd /sgl-workspace
+
+    # Copy MI300X config 
+    RUN find /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/ \
+            /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/ \
+            -type f -name '*MI300X*' | \
+            xargs -I {} sh -c 'vf_config=$(echo "$1" | sed "s/MI300X/MI300X_VF/"); cp "$1" "$vf_config"' -- {}
+
+    # Environment setup complete.
+    RUN echo "Environment setup complete."
+
+    WORKDIR /workspace/
+    ###########################################
+    ###########################################
+    ###########################################
+
+
+
+
+
+
+    ###########################################
+    ###############vllm v0.8.5#################
+    ###########################################
+    WORKDIR /workspace/
+
+    ENV VLLM_TARGET_DEVICE=rocm 
+    ENV ROCM_PATH=/opt/rocm 
+    ENV SETUPTOOLS_SCM_PRETEND_VERSION=0.8.5.dev
+    # Find the repo path in: DockerFile/Dockerfile.rocm_yang
+    # RUN git clone https://github.com/RLFoundation/vllm-patch.git
+    RUN pip uninstall -y vllm || true
+    RUN rm -rf vllm-patch
+    RUN git clone https://github.com/RLFoundation/vllm-patch.git \
+        && cd vllm-patch \
+        && git checkout v0.8.5-sleep-numa \
+        && rm -rf build/ dist/ *.egg-info \
+        && ln -sf /opt/rocm/lib/libamdhip64.so /usr/lib/libamdhip64.so \
+        && SETUPTOOLS_SCM_PRETEND_VERSION=0.8.5.dev PYTORCH_ROCM_ARCH="gfx90a;gfx942" MAX_JOBS=${MAX_JOBS} python3 setup.py install
+        # RUN SETUPTOOLS_SCM_PRETEND_VERSION=0.8.5.dev PYTORCH_ROCM_ARCH="gfx90a;gfx942" MAX_JOBS=${MAX_JOBS} python3 setup.py develop
+    WORKDIR /workspace/
+    ###########################################
+    ###########################################
+    ###########################################
+
+
+
+
+    #########################################
+    #### Install megatron-core###############
+    #########################################
+    RUN pip uninstall -y megatron-core && \
+        git clone https://github.com/yushengsu-thu/Megatron-LM-amd_version.git && \
+        cd Megatron-LM-amd_version && \
+        pip install -vvv -e . && \
+        cd /workspace/
+    #########################################
+    #########################################
+    #########################################
+
+
+
+
+    #######################################
+    ################apex###################
+    #######################################
+    WORKDIR /workspace/
+    RUN pip uninstall -y apex && \
+        git clone git@github.com:ROCm/apex.git && \
+        cd apex && \
+        python setup.py install && \
+        cd /workspace/ 
+    #######################################
+    #######################################
+    #######################################
+
+
+    ################################################################################
+    ###########################Add torch_memory_saver###############################
+    ################################################################################
+    # Set environment variables
    ENV HIPCC_COMPILE_FLAGS_APPEND="--amdgpu-target=gfx90a;gfx942 -D__HIP_PLATFORM_AMD__"
    ENV CFLAGS="-D__HIP_PLATFORM_AMD__"
    ENV CXXFLAGS="-D__HIP_PLATFORM_AMD__"
+    RUN pip install "git+https://github.com/YangWang92/torch_memory_saver_numa.git@numa"
+    ################################################################################
+    ################################################################################
+    ################################################################################

-    # Install vllm
-    RUN pip uninstall -y vllm && \
-        rm -rf vllm && \
-        git clone -b v0.6.3 https://github.com/vllm-project/vllm.git && \
-        cd vllm && \
-        MAX_JOBS=$(nproc) python3 setup.py install && \
-        cd .. && \
-        rm -rf vllm

-    # Copy the entire project directory
-    COPY . .

-    # Install dependencies
-    RUN pip install "tensordict<0.6" --no-deps && \
+    ########################################
+    ######Install ray#######################
+    ########################################
+    # need to add this patch: https://github.com/ray-project/ray/pull/53531/files
+    RUN pip uninstall ray -y
+    RUN pip install "ray[data,train,tune,serve]>=2.47.0" 
+    ########################################
+    ########################################
+    ########################################
+
+
+    ##########################################
+    #######Install other dependencies#########
+    ##########################################
+    RUN pip install "tensordict==0.6.2" --no-deps && \
        pip install accelerate \
        codetiming \
        datasets \
@ -61,16 +288,21 @@ docker/Dockerfile.rocm
        peft \
        "pyarrow>=15.0.0" \
        pylatexenc \
-        "ray[data,train,tune,serve]>=2.45.0" \
        torchdata \
-        transformers \
        wandb \
        orjson \
-        pybind11 && \
-        pip install -e . --no-deps
+        pybind11
+        
+    WORKDIR /workspace/
+    RUN git clone https://github.com/volcengine/verl.git && \
+        cd verl && \
+        pip install -e . 
+    ##########################################
+    ##########################################
+    ##########################################

-    # Install torch_memory_saver
-    RUN pip install git+https://github.com/ExtremeViscent/torch_memory_saver.git --no-deps
+    WORKDIR /workspace/
+    CMD ["/usr/bin/bash"]


 Build the image:
@ -78,7 +310,20 @@ Build the image:

 .. code-block:: bash

-    docker build -t verl-rocm .
+    docker docker/build -t verl-rocm .
+
+Run the container
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Note: You can pull the docker from this DockerHub: [RLSys Foundation](https://hub.docker.com/u/yushengsuthu)
+Pull the image:
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: bash
+
+    docker pull yushengsuthu/verl:verl-0.4.1_ubuntu-22.04_rocm6.3.4-numa-patch_vllm0.8.5_sglang0.4.6.post4
+
+    docker tag yushengsuthu/verl:verl-0.4.1_ubuntu-22.04_rocm6.3.4-numa-patch_vllm0.8.5_sglang0.4.6.post4 verl-rocm:latest

 Run the container
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -111,7 +356,7 @@ Example
 -------

 Due to to special setting in AMD (ROCM) torch, 
-1. If your ``ray>=2.45.0`` (default), you need to set ``RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES`` when starting ray in verl's RLHF training.
+1. If your ``ray>=2.45.0`` (default), you need to set ``RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES`` when starting ray in verl's RLHF training and add this [patch](https://github.com/ray-project/ray/pull/53531/files).
 2. If your ``ray<2.45.0``, you need to set ``RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES`` when starting ray in verl's RLHF training.
 Inference ``$ENGINE`` can be ``vllm`` or ``sglang``. We choose ``vllm`` as default in the following examples.

@ -126,6 +371,8 @@ PPO
    YOUR_RUN_NAME=r1-training_ppo-upstream 
    # export HYDRA_FULL_ERROR=1

+    export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+    
    # [ray] < 2.45.0
    #export RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES=1

@ -178,6 +425,8 @@ GRPO
    # export HYDRA_FULL_ERROR=1
    # export FSDP_VERBOSE=1 

+    #export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+
    # [ray] < 2.45.0
    #export RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES=1

@ -304,6 +553,9 @@ slurm_script.sh
    export HSA_NO_SCRATCH_RECLAIM=1
    ##########################################################################

+    ## Assign using GPUs
+    export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+
    ### For rocm and training script
    # [ray] < 2.45.0
    #export RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES=1
--- a/pyproject.toml
+++ b/pyproject.toml
@ -19,7 +19,7 @@ name = "verl"
 dynamic = ["version", "dependencies", "optional-dependencies", "authors", "urls"]

 description = "verl: Volcano Engine Reinforcement Learning for LLM"
-license = {file = "LICENSE"}  # or "Apache-2.0", if you prefer an SPDX identifier
+license = {text = "Apache-2.0"}  # Changed from file to text format
 readme = {file = "README.md", content-type = "text/markdown"}
 requires-python = ">=3.8"

--- a/verl/single_controller/base/worker.py
+++ b/verl/single_controller/base/worker.py
@ -218,6 +218,7 @@ class Worker(WorkerHelper):
            else:
                cuda_val = val
                os.environ["CUDA_VISIBLE_DEVICES"] = val
+                # os.environ["HIP_VISIBLE_DEVICES"] = val

        if rocr_val:
            # You must take care if both HIP/CUDA and ROCR env vars are set as they have