mirror of
https://github.com/volcengine/verl.git
synced 2025-10-20 13:43:50 +08:00
### What does this PR do? > Update the resource in `Dockerfile.rocm` ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > docker build -f Dockerfile.rocm -t verl-rocm:local . ``` docker run --rm -it verl-rocm:local python -c "import torch; print('ok')" ``` ### Design & Code Changes > Update the resource in `Dockerfile.rocm` ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
323 lines
10 KiB
Docker
323 lines
10 KiB
Docker
# FROM "compute-artifactory.amd.com:5000/rocm-plus-docker/framework/compute-rocm-rel-6.4:94_ubuntu22.04_py3.10_pytorch_release-2.7_575e247"
|
|
# FROM "rlfoundation.azurecr.io/rocm6.3.4:vllm-0.8.5-numa-patch-ubuntu-22.04"
|
|
FROM "rlsys/rocm-6.3.4-patch:rocm6.3.4-numa-patch_ubuntu-22.04"
|
|
|
|
SHELL ["/bin/bash", "-ceuxo", "pipefail"]
|
|
|
|
ENV MAX_JOBS=512
|
|
|
|
ENV PATH="/usr/local/python3.12/bin:$PATH"
|
|
RUN ln -sf /usr/bin/python3.12 /usr/bin/python && \
|
|
ln -sf /usr/bin/pip3.12 /usr/bin/pip
|
|
|
|
############################################
|
|
############################################
|
|
RUN apt-get update
|
|
RUN apt-get install -y pkg-config liblzma-dev
|
|
############################################
|
|
############################################
|
|
|
|
|
|
###########################################
|
|
##########Install TransformerEngine########
|
|
###########################################
|
|
WORKDIR /workspace/
|
|
# transformer-engine install
|
|
# https://github.com/ROCm/TransformerEngine
|
|
|
|
RUN rm -rf TransformerEngine
|
|
RUN git clone --recursive https://github.com/ROCm/TransformerEngine.git
|
|
WORKDIR /workspace/TransformerEngine
|
|
RUN git checkout 236178e5
|
|
# git checkout bb061ade
|
|
# git checkout 864405c
|
|
|
|
ENV NVTE_FRAMEWORK=pytorch
|
|
ENV NVTE_ROCM_ARCH=gfx942
|
|
ENV NVTE_USE_HIPBLASLT=1
|
|
ENV NVTE_USE_ROCM=1
|
|
|
|
# export CMAKE_PREFIX_PATH="/opt/rocm:/opt/rocm/hip:/usr/local:/usr:${CMAKE_PREFIX_PATH:-}"
|
|
ENV CMAKE_PREFIX_PATH="/opt/rocm:/opt/rocm/hip:/usr/local:/usr"
|
|
|
|
|
|
# ENV NVTE_BUILD_MAX_JOBS=$(MAX_JOBS)
|
|
|
|
RUN MAX_JOBS=$(MAX_JOBS) pip install . -vvv
|
|
|
|
WORKDIR /workspace/
|
|
###########################################
|
|
###########################################
|
|
###########################################
|
|
|
|
|
|
|
|
|
|
|
|
####################################################################################
|
|
################Install vllm - sglang require vllm 0.6.7 dependency#################
|
|
####################################################################################
|
|
#### Require vllm 0.6.7 - checkout 113274a0
|
|
WORKDIR /workspace/
|
|
RUN rm -rf vllm
|
|
RUN pip uninstall -y vllm
|
|
# Refer to here (down-grade vllm to 0.6.3): https://docs.vllm.ai/en/v0.6.3/getting_started/amd-installation.html
|
|
RUN git clone https://github.com/ROCm/vllm.git
|
|
# git clone https://github.com/vllm-project/vllm.git
|
|
WORKDIR /workspace/vllm
|
|
RUN git checkout 113274a0
|
|
ENV PYTORCH_ROCM_ARCH="gfx90a;gfx942"
|
|
#ENV MAX_JOBS=512
|
|
ENV MAX_JOBS=${MAX_JOBS}
|
|
RUN pip install "boto3>=1.26.0"
|
|
RUN pip install setuptools_scm
|
|
# will add src into py. You can delete the repo
|
|
RUN python3 setup.py install
|
|
WORKDIR /workspace/
|
|
####################################################################################
|
|
####################################################################################
|
|
####################################################################################
|
|
|
|
|
|
|
|
###########################################
|
|
############For hack docker################
|
|
###########################################
|
|
RUN pip install setuptools==75.8.0
|
|
###########################################
|
|
###########################################
|
|
###########################################
|
|
|
|
|
|
|
|
###########################################
|
|
############build sgalng###################
|
|
###########################################
|
|
# Set environment variables
|
|
ENV BASE_DIR=/sgl-workspace
|
|
ENV BUILD_TYPE=all
|
|
ENV SGL_REPO=https://github.com/sgl-project/sglang
|
|
ENV SGL_BRANCH=v0.4.6.post5
|
|
ENV TRITON_REPO=https://github.com/ROCm/triton.git
|
|
ENV TRITON_COMMIT=improve_fa_decode_3.0.0
|
|
ENV AITER_REPO=https://github.com/ROCm/aiter.git
|
|
ENV AITER_COMMIT=v0.1.2
|
|
# v0.1.2 version - commit id: 9d11f47
|
|
# ENV AITER_COMMIT=9d11f47
|
|
|
|
ENV HIP_FORCE_DEV_KERNARG=1
|
|
ENV HSA_NO_SCRATCH_RECLAIM=1
|
|
ENV SGLANG_SET_CPU_AFFINITY=1
|
|
ENV SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
|
|
ENV NCCL_MIN_NCHANNELS=112
|
|
ENV MOE_PADDING=1
|
|
ENV VLLM_FP8_PADDING=1
|
|
ENV VLLM_FP8_ACT_PADDING=1
|
|
ENV VLLM_FP8_WEIGHT_PADDING=1
|
|
ENV VLLM_FP8_REDUCE_CONV=1
|
|
ENV TORCHINDUCTOR_MAX_AUTOTUNE=1
|
|
ENV TORCHINDUCTOR_MAX_AUTOTUNE_POINTWISE=1
|
|
ENV HIPCC_COMPILE_FLAGS_APPEND="--offload-arch=gfx942"
|
|
ENV AMDGPU_TARGETS=gfx942
|
|
ENV ROCM_ARCH=gfx942
|
|
ENV PYTORCH_ROCM_ARCH="gfx90a;gfx942"
|
|
|
|
# Switch to working directory
|
|
WORKDIR /sgl-workspace
|
|
|
|
# Clean and create directory
|
|
RUN rm -rf /sgl-workspace && mkdir -p /sgl-workspace
|
|
|
|
# Clone and build sglang
|
|
RUN git clone ${SGL_REPO} \
|
|
&& cd sglang \
|
|
&& git checkout ${SGL_BRANCH} || echo "Using default branch" \
|
|
&& cd sgl-kernel \
|
|
&& rm -f pyproject.toml \
|
|
&& mv pyproject_rocm.toml pyproject.toml \
|
|
&& python setup_rocm.py install \
|
|
&& cd .. \
|
|
&& if [ "$BUILD_TYPE" = "srt" ]; then \
|
|
python -m pip --no-cache-dir install -e "python[srt_hip]"; \
|
|
else \
|
|
python -m pip --no-cache-dir install -e "python[all_hip]"; \
|
|
fi \
|
|
&& cd /sgl-workspace \
|
|
&& cp -r /sgl-workspace/sglang /sglang \
|
|
&& python -m pip cache purge
|
|
|
|
# Install common Python packages
|
|
RUN pip install IPython orjson python-multipart torchao pybind11
|
|
|
|
# Rebuild Triton
|
|
RUN pip uninstall -y triton || true \
|
|
&& git clone ${TRITON_REPO} \
|
|
&& cd triton \
|
|
&& git checkout ${TRITON_COMMIT} \
|
|
&& cd python \
|
|
&& python3 setup.py install \
|
|
&& cd /sgl-workspace
|
|
|
|
|
|
# ENV HIPCC_COMPILE_FLAGS_APPEND="--offload-arch=gfx942 --amdgpu-lower-module-lds-strategy=1"
|
|
# ENV HIPCC_COMPILE_FLAGS_APPEND="--offload-arch=gfx942"
|
|
|
|
# Build aiter
|
|
#version: Commit 9d11f47
|
|
# && git checkout ${AITER_COMMIT} \
|
|
RUN pip uninstall -y aiter || true
|
|
RUN git clone ${AITER_REPO} \
|
|
&& cd aiter \
|
|
&& git checkout ${AITER_COMMIT} \
|
|
&& git submodule sync \
|
|
&& git submodule update --init --recursive \
|
|
&& PREBUILD_KERNELS=1 GPU_ARCHS=gfx942 python3 setup.py install \
|
|
&& cd /sgl-workspace
|
|
# && PREBUILD_KERNELS=1 GPU_ARCHS=gfx942 python3 setup.py develop \
|
|
# && PREBUILD_KERNELS=1 GPU_ARCHS=gfx942 python3 setup.py develop \
|
|
|
|
# Copy MI300X config
|
|
RUN find /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/ \
|
|
/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/ \
|
|
-type f -name '*MI300X*' | \
|
|
xargs -I {} sh -c 'vf_config=$(echo "$1" | sed "s/MI300X/MI300X_VF/"); cp "$1" "$vf_config"' -- {}
|
|
|
|
# Environment setup complete.
|
|
RUN echo "Environment setup complete."
|
|
|
|
WORKDIR /workspace/
|
|
###########################################
|
|
###########################################
|
|
###########################################
|
|
|
|
|
|
|
|
|
|
|
|
|
|
###########################################
|
|
###############vllm v0.8.5#################
|
|
###########################################
|
|
# ENV GITHUB_USERNAME=yushengsu-thu
|
|
# ENV GITHUB_MAIL=yushengsu@gmail.com
|
|
|
|
# RUN git config --global user.name "${GITHUB_USERNAME}" \
|
|
# && git config --global user.email "${GITHUB_MAIL}"
|
|
|
|
WORKDIR /workspace/
|
|
|
|
ENV VLLM_TARGET_DEVICE=rocm
|
|
ENV ROCM_PATH=/opt/rocm
|
|
ENV SETUPTOOLS_SCM_PRETEND_VERSION=0.8.5.dev
|
|
|
|
# Find the repo path in: DockerFile/Dockerfile.rocm_yang
|
|
# RUN git clone https://github.com/RLFoundation/vllm-patch.git
|
|
RUN pip uninstall -y vllm || true
|
|
RUN rm -rf vllm-patch
|
|
RUN git clone https://github.com/RLFoundation/vllm-patch.git \
|
|
&& cd vllm-patch \
|
|
&& git checkout v0.8.5-sleep-numa \
|
|
&& rm -rf build/ dist/ *.egg-info \
|
|
&& ln -sf /opt/rocm/lib/libamdhip64.so /usr/lib/libamdhip64.so \
|
|
&& SETUPTOOLS_SCM_PRETEND_VERSION=0.8.5.dev PYTORCH_ROCM_ARCH="gfx90a;gfx942" MAX_JOBS=${MAX_JOBS} python3 setup.py install
|
|
# RUN SETUPTOOLS_SCM_PRETEND_VERSION=0.8.5.dev PYTORCH_ROCM_ARCH="gfx90a;gfx942" MAX_JOBS=${MAX_JOBS} python3 setup.py develop
|
|
|
|
WORKDIR /workspace/
|
|
###########################################
|
|
###########################################
|
|
###########################################
|
|
|
|
|
|
|
|
|
|
#########################################
|
|
#### Install megatron-core###############
|
|
#########################################
|
|
RUN pip uninstall -y megatron-core && \
|
|
git clone https://github.com/yushengsu-thu/Megatron-LM-amd_version.git && \
|
|
cd Megatron-LM-amd_version && \
|
|
pip install -vvv -e . && \
|
|
cd /workspace/
|
|
#########################################
|
|
#########################################
|
|
#########################################
|
|
|
|
|
|
|
|
|
|
#######################################
|
|
################apex###################
|
|
#######################################
|
|
WORKDIR /workspace/
|
|
RUN pip uninstall -y apex && \
|
|
git clone https://github.com/ROCm/apex.git && \
|
|
cd apex && \
|
|
python setup.py install && \
|
|
cd /workspace/
|
|
#######################################
|
|
#######################################
|
|
#######################################
|
|
|
|
|
|
|
|
|
|
################################################################################
|
|
###########################Add torch_memory_saver###############################
|
|
################################################################################
|
|
# Set environment variables
|
|
ENV HIPCC_COMPILE_FLAGS_APPEND="--amdgpu-target=gfx90a;gfx942 -D__HIP_PLATFORM_AMD__"
|
|
ENV CFLAGS="-D__HIP_PLATFORM_AMD__"
|
|
ENV CXXFLAGS="-D__HIP_PLATFORM_AMD__"
|
|
RUN pip install "git+https://github.com/YangWang92/torch_memory_saver_numa.git@numa"
|
|
################################################################################
|
|
################################################################################
|
|
################################################################################
|
|
|
|
|
|
|
|
########################################
|
|
######Install ray#######################
|
|
########################################
|
|
# need to add this patch: https://github.com/ray-project/ray/pull/53531/files
|
|
RUN pip uninstall ray -y
|
|
RUN pip install "ray[data,train,tune,serve]>=2.47.0"
|
|
########################################
|
|
########################################
|
|
########################################
|
|
|
|
|
|
|
|
##########################################
|
|
#######Install other dependencies#########
|
|
##########################################
|
|
RUN pip install "tensordict==0.6.2" --no-deps && \
|
|
pip install accelerate \
|
|
codetiming \
|
|
datasets \
|
|
dill \
|
|
hydra-core \
|
|
liger-kernel \
|
|
numpy \
|
|
pandas \
|
|
peft \
|
|
"pyarrow>=15.0.0" \
|
|
pylatexenc \
|
|
torchdata \
|
|
wandb \
|
|
orjson \
|
|
pybind11
|
|
|
|
WORKDIR /workspace/
|
|
RUN git clone https://github.com/volcengine/verl.git && \
|
|
cd verl && \
|
|
pip install -e .
|
|
##########################################
|
|
##########################################
|
|
##########################################
|
|
|
|
|
|
|
|
WORKDIR /workspace/
|
|
|
|
CMD ["/usr/bin/bash"]
|