mirror of
https://github.com/volcengine/verl.git
synced 2025-10-20 13:43:50 +08:00
[docker] fix: downgrade TransformerEngine version 2.2.1 to allow mcore image using rope fusion and provide another set of v0.5 image (#2611)
### What does this PR do? Downgrade TransformerEngine version to allow mcore image using rope fusion and provide another set of v0.5 image. ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
This commit is contained in:
2
.github/workflows/README.md
vendored
2
.github/workflows/README.md
vendored
@ -31,7 +31,7 @@ permissions:
|
||||
contents: read
|
||||
|
||||
env:
|
||||
IMAGE: "your vemlp image" # e.g. "verl-ci-cn-beijing.cr.volces.com/verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.1"
|
||||
IMAGE: "your vemlp image" # e.g. "verl-ci-cn-beijing.cr.volces.com/verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.2"
|
||||
DYNAMIC_RUNNER_URL: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner" # public veFaas api
|
||||
|
||||
jobs:
|
||||
|
4
.github/workflows/checkpoint_converter.yml
vendored
4
.github/workflows/checkpoint_converter.yml
vendored
@ -84,7 +84,7 @@ jobs:
|
||||
NO_PROXY: "localhost,127.0.0.1"
|
||||
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
||||
container:
|
||||
image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1
|
||||
image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.2-te2.2
|
||||
options: --gpus all --shm-size=10g
|
||||
steps:
|
||||
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
||||
@ -114,7 +114,7 @@ jobs:
|
||||
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
||||
HF_ENDPOINT: "https://hf-mirror.com"
|
||||
container:
|
||||
image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1
|
||||
image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.2-te2.2
|
||||
options: --gpus all --shm-size=10g
|
||||
steps:
|
||||
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
||||
|
2
.github/workflows/e2e_dapo.yml
vendored
2
.github/workflows/e2e_dapo.yml
vendored
@ -94,7 +94,7 @@ jobs:
|
||||
HF_ENDPOINT: "https://hf-mirror.com"
|
||||
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
||||
container:
|
||||
image: verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.1
|
||||
image: verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.2-te2.2
|
||||
options: --gpus all --shm-size=10g
|
||||
steps:
|
||||
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
||||
|
2
.github/workflows/e2e_eval_aime24.yml
vendored
2
.github/workflows/e2e_eval_aime24.yml
vendored
@ -88,7 +88,7 @@ permissions:
|
||||
contents: read
|
||||
|
||||
env:
|
||||
IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.1"
|
||||
IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.2-te2.2"
|
||||
DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
|
||||
|
||||
jobs:
|
||||
|
12
.github/workflows/e2e_ppo_trainer.yml
vendored
12
.github/workflows/e2e_ppo_trainer.yml
vendored
@ -87,7 +87,7 @@ jobs:
|
||||
HF_ENDPOINT: "https://hf-mirror.com"
|
||||
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
||||
container:
|
||||
image: verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.1
|
||||
image: verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.2-te2.2
|
||||
options: --gpus all --shm-size=10g
|
||||
steps:
|
||||
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
||||
@ -223,7 +223,7 @@ jobs:
|
||||
HF_ENDPOINT: "https://hf-mirror.com"
|
||||
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
||||
container:
|
||||
image: verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.1
|
||||
image: verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.2-te2.2
|
||||
options: --gpus all --shm-size=50g # Visual dataloader requires large memory
|
||||
steps:
|
||||
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
||||
@ -280,7 +280,7 @@ jobs:
|
||||
HF_ENDPOINT: "https://hf-mirror.com"
|
||||
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
||||
container:
|
||||
image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1
|
||||
image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.2-te2.2
|
||||
options: --gpus all --shm-size=10g
|
||||
steps:
|
||||
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
||||
@ -319,7 +319,7 @@ jobs:
|
||||
HF_ENDPOINT: "https://hf-mirror.com"
|
||||
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
||||
container:
|
||||
image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1
|
||||
image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.2-te2.2
|
||||
options: --gpus all --shm-size=10g
|
||||
steps:
|
||||
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
||||
@ -352,7 +352,7 @@ jobs:
|
||||
HF_ENDPOINT: "https://hf-mirror.com"
|
||||
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
||||
container:
|
||||
image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1
|
||||
image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.2-te2.2
|
||||
options: --gpus all --shm-size=50g # Visual dataloader requires large memory
|
||||
steps:
|
||||
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
||||
@ -409,7 +409,7 @@ jobs:
|
||||
HF_ENDPOINT: "https://hf-mirror.com"
|
||||
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
||||
container:
|
||||
image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1
|
||||
image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.2-te2.2
|
||||
options: --gpus all --shm-size=10g
|
||||
steps:
|
||||
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
||||
|
@ -85,7 +85,7 @@ permissions:
|
||||
contents: read
|
||||
|
||||
env:
|
||||
IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1"
|
||||
IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.2-te2.2"
|
||||
DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
|
||||
|
||||
jobs:
|
||||
|
@ -85,7 +85,7 @@ permissions:
|
||||
contents: read
|
||||
|
||||
env:
|
||||
IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.1"
|
||||
IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.2-te2.2"
|
||||
DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
|
||||
|
||||
jobs:
|
||||
|
2
.github/workflows/e2e_spin.yml
vendored
2
.github/workflows/e2e_spin.yml
vendored
@ -68,7 +68,7 @@ jobs:
|
||||
HF_ENDPOINT: "https://hf-mirror.com"
|
||||
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
||||
container:
|
||||
image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1
|
||||
image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.2-te2.2
|
||||
options: --gpus all --shm-size=10g
|
||||
steps:
|
||||
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
||||
|
2
.github/workflows/e2e_sppo.yml
vendored
2
.github/workflows/e2e_sppo.yml
vendored
@ -66,7 +66,7 @@ jobs:
|
||||
HF_ENDPOINT: "https://hf-mirror.com"
|
||||
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
||||
container:
|
||||
image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1
|
||||
image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.2-te2.2
|
||||
options: --gpus all --shm-size=10g
|
||||
steps:
|
||||
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
||||
|
2
.github/workflows/gpu_unit_tests.yml
vendored
2
.github/workflows/gpu_unit_tests.yml
vendored
@ -80,7 +80,7 @@ jobs:
|
||||
NO_PROXY: "localhost,127.0.0.1"
|
||||
HF_HUB_ENABLE_HF_TRANSFER: 1
|
||||
container:
|
||||
image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1
|
||||
image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.2-te2.2
|
||||
options: --gpus all --shm-size=10g
|
||||
steps:
|
||||
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
||||
|
2
.github/workflows/model.yml
vendored
2
.github/workflows/model.yml
vendored
@ -71,7 +71,7 @@ jobs:
|
||||
HF_ENDPOINT: "https://hf-mirror.com"
|
||||
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
|
||||
container:
|
||||
image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1
|
||||
image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.2-te2.2
|
||||
options: --gpus all --shm-size=10g
|
||||
steps:
|
||||
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
||||
|
2
.github/workflows/sgl.yml
vendored
2
.github/workflows/sgl.yml
vendored
@ -88,7 +88,7 @@ jobs:
|
||||
HF_HUB_ENABLE_HF_TRANSFER: 1
|
||||
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK: "True"
|
||||
container:
|
||||
image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1
|
||||
image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.2-te2.2
|
||||
options: --gpus all --shm-size=10g
|
||||
steps:
|
||||
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
|
||||
|
@ -1,6 +1,6 @@
|
||||
# Base Image support aws EFA
|
||||
# Build Image with frameworks based on this
|
||||
FROM verlai/verl:app-verl0.5-sglang0.4.6.post5-mcore0.12.1
|
||||
FROM verlai/verl:app-verl0.5-sglang0.4.6.post5-mcore0.12.2
|
||||
|
||||
# For aws instances with EFA net interface (Sagemaker AI Pod)
|
||||
# install EFA driver:
|
||||
|
@ -97,7 +97,7 @@ RUN git clone https://github.com/NVIDIA/apex.git && \
|
||||
RUN export NVTE_FRAMEWORK=pytorch && pip3 install --no-deps --no-cache-dir git+https://github.com/NVIDIA/TransformerEngine.git@v2.3
|
||||
|
||||
# Install Megatron-LM
|
||||
RUN pip3 install --no-deps --no-cache-dir git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.12.1
|
||||
RUN pip3 install --no-deps --no-cache-dir git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.12.2
|
||||
|
||||
# Fix opencv
|
||||
RUN pip install opencv-python
|
||||
|
@ -26,12 +26,12 @@ From this version, we divide images built for vLLM and SGLang as the divergence
|
||||
|
||||
There are four types of application images available:
|
||||
|
||||
- **vLLM with FSDP and Megatron**: ``verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.1``
|
||||
- **SGLang with FSDP and Megatron**: ``verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1``
|
||||
- **Preview version of SGLang with FSDP and Megatron, CUDA 12.6**: ``verlai/verl:app-verl0.5-sglang0.4.8-mcore0.12.1``
|
||||
- **Preview version of SGLang with FSDP and Megatron, CUDA 12.8**: ``verlai/verl:app-preview-verl0.5-sglang0.4.8-mcore0.12.1``
|
||||
- **vLLM with FSDP and Megatron**: ``verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.2-te2.2``, with Deep-EP support: ``verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.2-te2.2-deepep``.
|
||||
- **SGLang with FSDP and Megatron**: ``verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.2-te2.2`` (need vLLM support, but can have some package conflicts), with Deep-EP support: ``verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.2-te2.2-deepep``.
|
||||
- **Preview version of SGLang with FSDP and Megatron, CUDA 12.6**: ``verlai/verl:app-verl0.5-sglang0.4.8-mcore0.12.2-te2.2``
|
||||
- **Preview version of SGLang with FSDP and Megatron, CUDA 12.8**: ``verlai/verl:app-preview-verl0.5-sglang0.4.8-mcore0.12.2-te2.2``
|
||||
|
||||
For Megatron 0.13.0, we offer preview images, to use latest codes, just replace ``mcore0.12.1`` with ``mcore0.13.0-preview`` in the above image tag.
|
||||
For Megatron 0.13.0, we offer preview images, to use latest codes, just replace ``mcore0.12.2`` with ``mcore0.13.0-preview`` in the above image tag.
|
||||
|
||||
The latest vLLM support is coming soon.
|
||||
|
||||
|
@ -29,10 +29,10 @@ RUN pip uninstall -y pynvml nvidia-ml-py && \
|
||||
RUN pip install --resume-retries 999 --no-cache-dir nvidia-cudnn-cu12==9.8.0.87
|
||||
|
||||
# Install TransformerEngine
|
||||
RUN export NVTE_FRAMEWORK=pytorch && pip3 install --resume-retries 999 --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@v2.3
|
||||
RUN export NVTE_FRAMEWORK=pytorch && pip3 install --resume-retries 999 --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@v2.2.1
|
||||
|
||||
# Install Megatron-LM
|
||||
RUN pip3 install --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.12.1
|
||||
RUN pip3 install --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.12.2
|
||||
|
||||
# Fix for transformers 4.53.0
|
||||
RUN pip3 install --no-cache-dir "transformers[hf_xet]<4.52.0"
|
||||
|
@ -29,10 +29,10 @@ RUN pip uninstall -y pynvml nvidia-ml-py && \
|
||||
RUN pip install --resume-retries 999 --no-cache-dir nvidia-cudnn-cu12==9.8.0.87
|
||||
|
||||
# Install TransformerEngine
|
||||
RUN export NVTE_FRAMEWORK=pytorch && pip3 install --resume-retries 999 --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@v2.3
|
||||
RUN export NVTE_FRAMEWORK=pytorch && pip3 install --resume-retries 999 --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@v2.2.1
|
||||
|
||||
# Install Megatron-LM
|
||||
RUN pip3 install --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.12.1
|
||||
RUN pip3 install --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.12.2
|
||||
|
||||
# Fix for transformers 4.53.0
|
||||
RUN pip3 install --no-cache-dir "transformers[hf_xet]<4.52.0"
|
||||
|
@ -10,7 +10,7 @@ ENV NODE_OPTIONS=""
|
||||
ENV PIP_ROOT_USER_ACTION=ignore
|
||||
ENV HF_HUB_ENABLE_HF_TRANSFER="1"
|
||||
|
||||
# Install torch-2.6.0+cu126 + vllm-0.8.5.post1
|
||||
# Install torch-2.6.0+cu124 + vllm-0.8.5.post1
|
||||
# torch-2.6.0+cu124: cxx11abi=False
|
||||
# torch-2.6.0+cu126: cxx11abi=True
|
||||
# see https://github.com/flashinfer-ai/flashinfer/issues/911
|
||||
@ -35,10 +35,10 @@ RUN pip uninstall -y pynvml nvidia-ml-py && \
|
||||
RUN pip install --resume-retries 999 --no-cache-dir nvidia-cudnn-cu12==9.8.0.87
|
||||
|
||||
# Install TransformerEngine
|
||||
RUN export NVTE_FRAMEWORK=pytorch && pip3 install --resume-retries 999 --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@v2.3
|
||||
RUN export NVTE_FRAMEWORK=pytorch && pip3 install --resume-retries 999 --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@v2.2.1
|
||||
|
||||
# Install Megatron-LM
|
||||
RUN pip3 install --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.12.1
|
||||
RUN pip3 install --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.12.2
|
||||
|
||||
# Fix for transformers 4.53.0
|
||||
RUN pip3 install --no-cache-dir "transformers[hf_xet]<4.52.0"
|
||||
|
@ -10,7 +10,7 @@ ENV NODE_OPTIONS=""
|
||||
ENV PIP_ROOT_USER_ACTION=ignore
|
||||
ENV HF_HUB_ENABLE_HF_TRANSFER="1"
|
||||
|
||||
# Install torch-2.6.0+cu126 + vllm-0.8.5.post1
|
||||
# Install torch-2.6.0+cu124 + vllm-0.8.5.post1
|
||||
# torch-2.6.0+cu124: cxx11abi=False
|
||||
# torch-2.6.0+cu126: cxx11abi=True
|
||||
# see https://github.com/flashinfer-ai/flashinfer/issues/911
|
||||
@ -35,10 +35,10 @@ RUN pip uninstall -y pynvml nvidia-ml-py && \
|
||||
RUN pip install --resume-retries 999 --no-cache-dir nvidia-cudnn-cu12==9.8.0.87
|
||||
|
||||
# Install TransformerEngine
|
||||
RUN export NVTE_FRAMEWORK=pytorch && pip3 install --resume-retries 999 --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@v2.3
|
||||
RUN export NVTE_FRAMEWORK=pytorch && pip3 install --resume-retries 999 --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@v2.2.1
|
||||
|
||||
# Install Megatron-LM
|
||||
RUN pip3 install --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.12.1
|
||||
RUN pip3 install --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.12.2
|
||||
|
||||
# Fix for transformers 4.53.0
|
||||
RUN pip3 install --no-cache-dir "transformers[hf_xet]<4.52.0"
|
||||
|
@ -10,7 +10,7 @@ ENV NODE_OPTIONS=""
|
||||
ENV PIP_ROOT_USER_ACTION=ignore
|
||||
ENV HF_HUB_ENABLE_HF_TRANSFER="1"
|
||||
|
||||
# Install torch-2.6.0+cu126 + vllm-0.8.5.post1
|
||||
# Install torch-2.6.0+cu124 + vllm-0.8.5.post1
|
||||
# torch-2.6.0+cu124: cxx11abi=False
|
||||
# torch-2.6.0+cu126: cxx11abi=True
|
||||
# see https://github.com/flashinfer-ai/flashinfer/issues/911
|
||||
@ -38,7 +38,7 @@ RUN pip install --resume-retries 999 --no-cache-dir nvidia-cudnn-cu12==9.8.0.87
|
||||
RUN export NVTE_FRAMEWORK=pytorch && pip3 install --resume-retries 999 --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@release_v2.5
|
||||
|
||||
# Install Megatron-LM
|
||||
RUN pip3 install --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.12.1
|
||||
RUN pip3 install --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.12.2
|
||||
|
||||
# Install mbridge
|
||||
RUN pip3 install --no-cache-dir mbridge
|
||||
|
@ -11,7 +11,7 @@ sglang==0.4.6.post5
|
||||
vllm==0.8.5.post1
|
||||
vidia-cudnn-cu12==9.8.0.87
|
||||
transformer_engine==2.3
|
||||
megatron.core==core_v0.12.1
|
||||
megatron.core==core_v0.12.2
|
||||
# Preview
|
||||
transformer_engine==2.5
|
||||
megatron.core==core_r0.13.0
|
||||
@ -22,10 +22,10 @@ megatron.core==core_r0.13.0
|
||||
- Base image:
|
||||
- `verlai/verl:base-verl0.4-cu124-cudnn9.8-torch2.6-fa2.7.4`
|
||||
- App image:
|
||||
- `verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1`: SGLang requires vLLM in 0.4.6.post5 version, vLLM can have some package conflicts with SGLang
|
||||
- `verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1-deepep`: Built with deepep
|
||||
- `verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.1`
|
||||
- `verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.1-deepep`: Built with deepep
|
||||
- `verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.2-te2.2`: SGLang requires vLLM in 0.4.6.post5 version, vLLM can have some package conflicts with SGLang
|
||||
- `verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.2-te2.2-deepep`: Built with deepep
|
||||
- `verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.2-te2.2`
|
||||
- `verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.2-te2.2-deepep`: Built with deepep
|
||||
- Preview image:
|
||||
- `verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.13.0-preview`
|
||||
- `verlai/verl:app-verl0.4-vllm0.8.5-mcore0.13.0-preview`
|
||||
- `verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.13.0-te2.2-preview`
|
||||
- `verlai/verl:app-verl0.4-vllm0.8.5-mcore0.13.0-te2.2-preview`
|
@ -0,0 +1,37 @@
|
||||
# Start from the verl base image
|
||||
# Dockerfile.base
|
||||
FROM verlai/verl:base-verl0.5-cu126-cudnn9.8-torch2.7.1-fa2.7.4
|
||||
|
||||
# Define environments
|
||||
ENV MAX_JOBS=8
|
||||
ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
|
||||
ENV DEBIAN_FRONTEND=noninteractive
|
||||
ENV NODE_OPTIONS=""
|
||||
ENV PIP_ROOT_USER_ACTION=ignore
|
||||
ENV HF_HUB_ENABLE_HF_TRANSFER="1"
|
||||
|
||||
# Install sglang-0.4.8 and torch-memory-saver
|
||||
# Install FlashInfer Python package
|
||||
RUN pip install --upgrade pip setuptools packaging
|
||||
RUN pip install --resume-retries 999 --no-cache-dir --no-build-isolation flashinfer-python==0.2.6.post1
|
||||
RUN pip install --resume-retries 999 --no-cache-dir "sglang[all]==0.4.8" && pip install torch-memory-saver --no-cache-dir
|
||||
|
||||
# Fix packages
|
||||
RUN pip install --no-cache-dir "tensordict==0.6.2" "transformers[hf_xet]>=4.52.3" accelerate datasets peft hf-transfer \
|
||||
"numpy<2.0.0" "pyarrow>=19.0.1" pandas \
|
||||
ray[default] codetiming hydra-core pylatexenc qwen-vl-utils wandb dill pybind11 liger-kernel mathruler blobfile xgrammar \
|
||||
pytest py-spy pyext pre-commit ruff
|
||||
|
||||
RUN pip uninstall -y pynvml nvidia-ml-py && \
|
||||
pip install --resume-retries 999 --no-cache-dir --upgrade "nvidia-ml-py>=12.560.30" "fastapi[standard]>=0.115.0" "optree>=0.13.0" "pydantic>=2.9" "grpcio>=1.62.1"
|
||||
|
||||
RUN pip install --resume-retries 999 --no-cache-dir nvidia-cudnn-cu12==9.8.0.87
|
||||
|
||||
# Install TransformerEngine
|
||||
RUN export NVTE_FRAMEWORK=pytorch && pip3 install --resume-retries 999 --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@v2.2.1
|
||||
|
||||
# Install Megatron-LM
|
||||
RUN pip3 install --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.12.2
|
||||
|
||||
# Install mbridge
|
||||
RUN pip3 install --no-cache-dir mbridge
|
@ -0,0 +1,34 @@
|
||||
# Start from the verl base image
|
||||
# Dockerfile.base
|
||||
FROM verlai/verl:base-verl0.5-cu126-cudnn9.8-torch2.7.0-fa2.7.4
|
||||
|
||||
# Define environments
|
||||
ENV MAX_JOBS=32
|
||||
ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
|
||||
ENV DEBIAN_FRONTEND=noninteractive
|
||||
ENV NODE_OPTIONS=""
|
||||
ENV PIP_ROOT_USER_ACTION=ignore
|
||||
ENV HF_HUB_ENABLE_HF_TRANSFER="1"
|
||||
|
||||
# Install torch-2.7.0+cu126 + vllm-0.9.1
|
||||
RUN pip install --resume-retries 999 --no-cache-dir vllm==0.9.1
|
||||
|
||||
# Fix packages
|
||||
RUN pip install --no-cache-dir "tensordict==0.6.2" "transformers[hf_xet]>=4.51.0" accelerate datasets peft hf-transfer \
|
||||
"numpy<2.0.0" "pyarrow>=19.0.1" pandas \
|
||||
ray[default] codetiming hydra-core pylatexenc qwen-vl-utils wandb dill pybind11 liger-kernel mathruler blobfile xgrammar \
|
||||
pytest py-spy pyext pre-commit ruff
|
||||
|
||||
RUN pip uninstall -y pynvml nvidia-ml-py && \
|
||||
pip install --resume-retries 999 --no-cache-dir --upgrade "nvidia-ml-py>=12.560.30" "fastapi[standard]>=0.115.0" "optree>=0.13.0" "pydantic>=2.9" "grpcio>=1.62.1"
|
||||
|
||||
RUN pip install --resume-retries 999 --no-cache-dir nvidia-cudnn-cu12==9.8.0.87
|
||||
|
||||
# Install TransformerEngine
|
||||
RUN export NVTE_FRAMEWORK=pytorch && pip3 install --resume-retries 999 --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@v2.2.1
|
||||
|
||||
# Install Megatron-LM
|
||||
RUN pip3 install --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.12.2
|
||||
|
||||
# Install mbridge
|
||||
RUN pip3 install --no-cache-dir mbridge
|
133
docker/verl0.5-cu126-torch2.7-fa2.7.4/Dockerfile.base.torch2.7.0
Normal file
133
docker/verl0.5-cu126-torch2.7-fa2.7.4/Dockerfile.base.torch2.7.0
Normal file
@ -0,0 +1,133 @@
|
||||
# Base Docker Image of verl, with CUDA/Torch/FlashAttn/Apex/TransformerEngine, without other frameworks
|
||||
# Target: verlai/verl:base-verl0.5-cu126-cudnn9.8-torch2.7.1-fa2.8.0-fi0.2.6
|
||||
# Start from the NVIDIA official image (ubuntu-22.04 + cuda-12.6 + python-3.10)
|
||||
# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-24-08.html
|
||||
FROM nvcr.io/nvidia/pytorch:24.08-py3
|
||||
|
||||
# Define environments
|
||||
ENV MAX_JOBS=16
|
||||
ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
|
||||
ENV DEBIAN_FRONTEND=noninteractive
|
||||
ENV NODE_OPTIONS=""
|
||||
ENV PIP_ROOT_USER_ACTION=ignore
|
||||
ENV HF_HUB_ENABLE_HF_TRANSFER="1"
|
||||
|
||||
# Define installation arguments
|
||||
ARG APT_SOURCE=https://mirrors.tuna.tsinghua.edu.cn/ubuntu/
|
||||
ARG PIP_INDEX=https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
|
||||
|
||||
# Set apt source
|
||||
RUN cp /etc/apt/sources.list /etc/apt/sources.list.bak && \
|
||||
{ \
|
||||
echo "deb ${APT_SOURCE} jammy main restricted universe multiverse"; \
|
||||
echo "deb ${APT_SOURCE} jammy-updates main restricted universe multiverse"; \
|
||||
echo "deb ${APT_SOURCE} jammy-backports main restricted universe multiverse"; \
|
||||
echo "deb ${APT_SOURCE} jammy-security main restricted universe multiverse"; \
|
||||
} > /etc/apt/sources.list
|
||||
|
||||
# Install systemctl
|
||||
RUN apt-get update && \
|
||||
apt-get install -y -o Dpkg::Options::="--force-confdef" systemd && \
|
||||
apt-get clean
|
||||
|
||||
# Install tini
|
||||
RUN apt-get update && \
|
||||
apt-get install -y tini aria2 libfreeimage3 libfreeimage-dev zlib1g htop && \
|
||||
apt-get clean
|
||||
|
||||
# Change pip source
|
||||
RUN pip config set global.index-url "${PIP_INDEX}" && \
|
||||
pip config set global.extra-index-url "${PIP_INDEX}" && \
|
||||
python -m pip install --upgrade pip
|
||||
|
||||
# Uninstall nv-pytorch fork
|
||||
RUN pip uninstall -y torch torchvision torchaudio \
|
||||
pytorch-quantization pytorch-triton torch-tensorrt \
|
||||
xgboost transformer_engine flash_attn apex megatron-core grpcio
|
||||
|
||||
RUN pip install --resume-retries 999 --no-cache-dir torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0
|
||||
|
||||
# Install flash-attn-2.7.4.post1, although built with torch2.6, it is compatible with torch2.7
|
||||
# https://github.com/Dao-AILab/flash-attention/issues/1644#issuecomment-2899396361
|
||||
RUN ABI_FLAG=$(python -c "import torch; print('TRUE' if torch._C._GLIBCXX_USE_CXX11_ABI else 'FALSE')") && \
|
||||
URL="https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abi${ABI_FLAG}-cp310-cp310-linux_x86_64.whl" && \
|
||||
FILE="flash_attn-2.7.4.post1+cu12torch2.6cxx11abi${ABI_FLAG}-cp310-cp310-linux_x86_64.whl" && \
|
||||
wget -nv "${URL}" && \
|
||||
pip install --no-cache-dir "${FILE}"
|
||||
|
||||
# Fix packages
|
||||
RUN pip uninstall -y pynvml nvidia-ml-py && \
|
||||
pip install --no-cache-dir --upgrade "nvidia-ml-py>=12.560.30" "fastapi[standard]>=0.115.0" "optree>=0.13.0" "pydantic>=2.9" "grpcio>=1.62.1"
|
||||
|
||||
# Install cudnn
|
||||
RUN aria2c --max-tries=9999 https://developer.download.nvidia.com/compute/cudnn/9.8.0/local_installers/cudnn-local-repo-ubuntu2204-9.8.0_1.0-1_amd64.deb && \
|
||||
dpkg -i cudnn-local-repo-ubuntu2204-9.8.0_1.0-1_amd64.deb && \
|
||||
cp /var/cudnn-local-repo-ubuntu2204-9.8.0/cudnn-*-keyring.gpg /usr/share/keyrings/ && \
|
||||
apt-get update && \
|
||||
apt-get -y install cudnn-cuda-12 && \
|
||||
rm cudnn-local-repo-ubuntu2204-9.8.0_1.0-1_amd64.deb
|
||||
|
||||
# Install Apex
|
||||
RUN pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" --resume-retries 999 git+https://github.com/NVIDIA/apex.git
|
||||
|
||||
# Profiling tools
|
||||
RUN aria2c --always-resume=true --max-tries=99999 https://developer.nvidia.com/downloads/assets/tools/secure/nsight-systems/2025_3/nsight-systems-2025.3.1_2025.3.1.90-1_amd64.deb && \
|
||||
apt-get update && apt-get install -y libxcb-cursor0
|
||||
|
||||
RUN apt-get install -y ./nsight-systems-2025.3.1_2025.3.1.90-1_amd64.deb && \
|
||||
rm -rf /usr/local/cuda/bin/nsys && \
|
||||
ln -s /opt/nvidia/nsight-systems/2025.3.1/target-linux-x64/nsys /usr/local/cuda/bin/nsys && \
|
||||
rm -rf /usr/local/cuda/bin/nsys-ui && \
|
||||
ln -s /opt/nvidia/nsight-systems/2025.3.1/target-linux-x64/nsys-ui /usr/local/cuda/bin/nsys-ui && \
|
||||
rm nsight-systems-2025.3.1_2025.3.1.90-1_amd64.deb
|
||||
|
||||
RUN pip install --resume-retries 999 --no-cache-dir "tensordict==0.6.2" torchdata "transformers[hf_xet]>=4.51.0" accelerate datasets peft hf-transfer \
|
||||
"numpy<2.0.0" "pyarrow>=19.0.1" pandas cuda-bindings \
|
||||
ray[default] codetiming hydra-core pylatexenc qwen-vl-utils wandb dill pybind11 liger-kernel mathruler blobfile xgrammar \
|
||||
pytest py-spy pyext pre-commit ruff
|
||||
|
||||
# Install DeepEP
|
||||
## the dependency of IBGDA
|
||||
RUN ln -s /usr/lib/x86_64-linux-gnu/libmlx5.so.1 /usr/lib/x86_64-linux-gnu/libmlx5.so
|
||||
|
||||
## Clone and build deepep and deepep-nvshmem
|
||||
RUN git clone -b v2.3.1 https://github.com/NVIDIA/gdrcopy.git && \
|
||||
git clone https://github.com/deepseek-ai/DeepEP.git && \
|
||||
cd DeepEP && git checkout a84a248
|
||||
|
||||
# Prepare nvshmem
|
||||
RUN wget https://developer.nvidia.com/downloads/assets/secure/nvshmem/nvshmem_src_3.2.5-1.txz && \
|
||||
tar -xvf nvshmem_src_3.2.5-1.txz && mv nvshmem_src deepep-nvshmem && \
|
||||
cd deepep-nvshmem && git apply ../DeepEP/third-party/nvshmem.patch
|
||||
|
||||
ENV CUDA_HOME=/usr/local/cuda
|
||||
### Set MPI environment variables. Having errors when not set.
|
||||
ENV CPATH=/usr/local/mpi/include:$CPATH
|
||||
ENV LD_LIBRARY_PATH=/usr/local/mpi/lib:$LD_LIBRARY_PATH
|
||||
ENV LD_LIBRARY_PATH=/usr/local/x86_64-linux-gnu:$LD_LIBRARY_PATH
|
||||
ENV GDRCOPY_HOME=/workspace/gdrcopy
|
||||
|
||||
## Build deepep-nvshmem
|
||||
RUN cd deepep-nvshmem && \
|
||||
NVSHMEM_SHMEM_SUPPORT=0 \
|
||||
NVSHMEM_UCX_SUPPORT=0 \
|
||||
NVSHMEM_USE_NCCL=0 \
|
||||
NVSHMEM_MPI_SUPPORT=0 \
|
||||
NVSHMEM_IBGDA_SUPPORT=1 \
|
||||
NVSHMEM_PMIX_SUPPORT=0 \
|
||||
NVSHMEM_TIMEOUT_DEVICE_POLLING=0 \
|
||||
NVSHMEM_USE_GDRCOPY=1 \
|
||||
cmake -G Ninja -S . -B build/ -DCMAKE_INSTALL_PREFIX=/workspace/deepep-nvshmem/install && cmake --build build/ --target install
|
||||
|
||||
ENV NVSHMEM_DIR=/workspace/deepep-nvshmem/install
|
||||
ENV LD_LIBRARY_PATH=$NVSHMEM_DIR/lib:$LD_LIBRARY_PATH
|
||||
ENV PATH=$NVSHMEM_DIR/bin:$PATH
|
||||
|
||||
## Build deepep
|
||||
RUN cd DeepEP && \
|
||||
python setup.py install
|
||||
|
||||
# Reset pip config
|
||||
RUN pip config unset global.index-url && \
|
||||
pip config unset global.extra-index-url
|
||||
|
133
docker/verl0.5-cu126-torch2.7-fa2.7.4/Dockerfile.base.torch2.7.1
Normal file
133
docker/verl0.5-cu126-torch2.7-fa2.7.4/Dockerfile.base.torch2.7.1
Normal file
@ -0,0 +1,133 @@
|
||||
# Base Docker Image of verl, with CUDA/Torch/FlashAttn/Apex/TransformerEngine, without other frameworks
|
||||
# Target: verlai/verl:base-verl0.5-cu126-cudnn9.8-torch2.7.1-fa2.8.0-fi0.2.6
|
||||
# Start from the NVIDIA official image (ubuntu-22.04 + cuda-12.6 + python-3.10)
|
||||
# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-24-08.html
|
||||
FROM nvcr.io/nvidia/pytorch:24.08-py3
|
||||
|
||||
# Define environments
|
||||
ENV MAX_JOBS=16
|
||||
ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
|
||||
ENV DEBIAN_FRONTEND=noninteractive
|
||||
ENV NODE_OPTIONS=""
|
||||
ENV PIP_ROOT_USER_ACTION=ignore
|
||||
ENV HF_HUB_ENABLE_HF_TRANSFER="1"
|
||||
|
||||
# Define installation arguments
|
||||
ARG APT_SOURCE=https://mirrors.tuna.tsinghua.edu.cn/ubuntu/
|
||||
ARG PIP_INDEX=https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
|
||||
|
||||
# Set apt source
|
||||
RUN cp /etc/apt/sources.list /etc/apt/sources.list.bak && \
|
||||
{ \
|
||||
echo "deb ${APT_SOURCE} jammy main restricted universe multiverse"; \
|
||||
echo "deb ${APT_SOURCE} jammy-updates main restricted universe multiverse"; \
|
||||
echo "deb ${APT_SOURCE} jammy-backports main restricted universe multiverse"; \
|
||||
echo "deb ${APT_SOURCE} jammy-security main restricted universe multiverse"; \
|
||||
} > /etc/apt/sources.list
|
||||
|
||||
# Install systemctl
|
||||
RUN apt-get update && \
|
||||
apt-get install -y -o Dpkg::Options::="--force-confdef" systemd && \
|
||||
apt-get clean
|
||||
|
||||
# Install tini
|
||||
RUN apt-get update && \
|
||||
apt-get install -y tini aria2 libfreeimage3 libfreeimage-dev zlib1g htop && \
|
||||
apt-get clean
|
||||
|
||||
# Change pip source
|
||||
RUN pip config set global.index-url "${PIP_INDEX}" && \
|
||||
pip config set global.extra-index-url "${PIP_INDEX}" && \
|
||||
python -m pip install --upgrade pip
|
||||
|
||||
# Uninstall nv-pytorch fork
|
||||
RUN pip uninstall -y torch torchvision torchaudio \
|
||||
pytorch-quantization pytorch-triton torch-tensorrt \
|
||||
xgboost transformer_engine flash_attn apex megatron-core grpcio
|
||||
|
||||
RUN pip install --resume-retries 999 --no-cache-dir torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1
|
||||
|
||||
# Install flash-attn-2.7.4.post1, although built with torch2.6, it is compatible with torch2.7
|
||||
# https://github.com/Dao-AILab/flash-attention/issues/1644#issuecomment-2899396361
|
||||
RUN ABI_FLAG=$(python -c "import torch; print('TRUE' if torch._C._GLIBCXX_USE_CXX11_ABI else 'FALSE')") && \
|
||||
URL="https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abi${ABI_FLAG}-cp310-cp310-linux_x86_64.whl" && \
|
||||
FILE="flash_attn-2.7.4.post1+cu12torch2.6cxx11abi${ABI_FLAG}-cp310-cp310-linux_x86_64.whl" && \
|
||||
wget -nv "${URL}" && \
|
||||
pip install --no-cache-dir "${FILE}"
|
||||
|
||||
# Fix packages
|
||||
RUN pip uninstall -y pynvml nvidia-ml-py && \
|
||||
pip install --no-cache-dir --upgrade "nvidia-ml-py>=12.560.30" "fastapi[standard]>=0.115.0" "optree>=0.13.0" "pydantic>=2.9" "grpcio>=1.62.1"
|
||||
|
||||
# Install cudnn
|
||||
RUN aria2c --max-tries=9999 https://developer.download.nvidia.com/compute/cudnn/9.8.0/local_installers/cudnn-local-repo-ubuntu2204-9.8.0_1.0-1_amd64.deb && \
|
||||
dpkg -i cudnn-local-repo-ubuntu2204-9.8.0_1.0-1_amd64.deb && \
|
||||
cp /var/cudnn-local-repo-ubuntu2204-9.8.0/cudnn-*-keyring.gpg /usr/share/keyrings/ && \
|
||||
apt-get update && \
|
||||
apt-get -y install cudnn-cuda-12 && \
|
||||
rm cudnn-local-repo-ubuntu2204-9.8.0_1.0-1_amd64.deb
|
||||
|
||||
# Install Apex
|
||||
RUN pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" --resume-retries 999 git+https://github.com/NVIDIA/apex.git
|
||||
|
||||
# Profiling tools
|
||||
RUN aria2c --always-resume=true --max-tries=99999 https://developer.nvidia.com/downloads/assets/tools/secure/nsight-systems/2025_3/nsight-systems-2025.3.1_2025.3.1.90-1_amd64.deb && \
|
||||
apt-get update && apt-get install -y libxcb-cursor0
|
||||
|
||||
RUN apt-get install -y ./nsight-systems-2025.3.1_2025.3.1.90-1_amd64.deb && \
|
||||
rm -rf /usr/local/cuda/bin/nsys && \
|
||||
ln -s /opt/nvidia/nsight-systems/2025.3.1/target-linux-x64/nsys /usr/local/cuda/bin/nsys && \
|
||||
rm -rf /usr/local/cuda/bin/nsys-ui && \
|
||||
ln -s /opt/nvidia/nsight-systems/2025.3.1/target-linux-x64/nsys-ui /usr/local/cuda/bin/nsys-ui && \
|
||||
rm nsight-systems-2025.3.1_2025.3.1.90-1_amd64.deb
|
||||
|
||||
RUN pip install --resume-retries 999 --no-cache-dir "tensordict==0.6.2" torchdata "transformers[hf_xet]>=4.52.3" accelerate datasets peft hf-transfer \
|
||||
"numpy<2.0.0" "pyarrow>=19.0.1" pandas cuda-bindings \
|
||||
ray[default] codetiming hydra-core pylatexenc qwen-vl-utils wandb dill pybind11 liger-kernel mathruler blobfile xgrammar \
|
||||
pytest py-spy pyext pre-commit ruff
|
||||
|
||||
# Install DeepEP
|
||||
## the dependency of IBGDA
|
||||
RUN ln -s /usr/lib/x86_64-linux-gnu/libmlx5.so.1 /usr/lib/x86_64-linux-gnu/libmlx5.so
|
||||
|
||||
## Clone and build deepep and deepep-nvshmem
|
||||
RUN git clone -b v2.3.1 https://github.com/NVIDIA/gdrcopy.git && \
|
||||
git clone https://github.com/deepseek-ai/DeepEP.git && \
|
||||
cd DeepEP && git checkout a84a248
|
||||
|
||||
# Prepare nvshmem
|
||||
RUN wget https://developer.nvidia.com/downloads/assets/secure/nvshmem/nvshmem_src_3.2.5-1.txz && \
|
||||
tar -xvf nvshmem_src_3.2.5-1.txz && mv nvshmem_src deepep-nvshmem && \
|
||||
cd deepep-nvshmem && git apply ../DeepEP/third-party/nvshmem.patch
|
||||
|
||||
ENV CUDA_HOME=/usr/local/cuda
|
||||
### Set MPI environment variables. Having errors when not set.
|
||||
ENV CPATH=/usr/local/mpi/include:$CPATH
|
||||
ENV LD_LIBRARY_PATH=/usr/local/mpi/lib:$LD_LIBRARY_PATH
|
||||
ENV LD_LIBRARY_PATH=/usr/local/x86_64-linux-gnu:$LD_LIBRARY_PATH
|
||||
ENV GDRCOPY_HOME=/workspace/gdrcopy
|
||||
|
||||
## Build deepep-nvshmem
|
||||
RUN cd deepep-nvshmem && \
|
||||
NVSHMEM_SHMEM_SUPPORT=0 \
|
||||
NVSHMEM_UCX_SUPPORT=0 \
|
||||
NVSHMEM_USE_NCCL=0 \
|
||||
NVSHMEM_MPI_SUPPORT=0 \
|
||||
NVSHMEM_IBGDA_SUPPORT=1 \
|
||||
NVSHMEM_PMIX_SUPPORT=0 \
|
||||
NVSHMEM_TIMEOUT_DEVICE_POLLING=0 \
|
||||
NVSHMEM_USE_GDRCOPY=1 \
|
||||
cmake -G Ninja -S . -B build/ -DCMAKE_INSTALL_PREFIX=/workspace/deepep-nvshmem/install && cmake --build build/ --target install
|
||||
|
||||
ENV NVSHMEM_DIR=/workspace/deepep-nvshmem/install
|
||||
ENV LD_LIBRARY_PATH=$NVSHMEM_DIR/lib:$LD_LIBRARY_PATH
|
||||
ENV PATH=$NVSHMEM_DIR/bin:$PATH
|
||||
|
||||
## Build deepep
|
||||
RUN cd DeepEP && \
|
||||
python setup.py install
|
||||
|
||||
# Reset pip config
|
||||
RUN pip config unset global.index-url && \
|
||||
pip config unset global.extra-index-url
|
||||
|
27
docker/verl0.5-cu126-torch2.7-fa2.7.4/README.md
Normal file
27
docker/verl0.5-cu126-torch2.7-fa2.7.4/README.md
Normal file
@ -0,0 +1,27 @@
|
||||
# verl image with verl v0.5
|
||||
|
||||
## Important packages version
|
||||
|
||||
```txt
|
||||
cuda==12.6
|
||||
cudnn==9.8.0
|
||||
torch==2.7.1
|
||||
flash_attn=2.8.0 ##
|
||||
sglang==0.4.8
|
||||
vllm==0.8.5.post1
|
||||
vidia-cudnn-cu12==9.8.0.87
|
||||
transformer_engine==2.3
|
||||
megatron.core==core_v0.12.2
|
||||
# Preview
|
||||
transformer_engine==2.5
|
||||
megatron.core==core_r0.13.0
|
||||
```
|
||||
|
||||
## Target
|
||||
|
||||
- Base image:
|
||||
- `verlai/verl:base-verl0.5-cu126-cudnn9.8-torch2.7.0-fa2.7.4`: We offer a base image with deep ep built in, for vllm
|
||||
- `verlai/verl:base-verl0.5-cu126-cudnn9.8-torch2.7.1-fa2.7.4`: We offer a base image with deep ep built in, for sglang
|
||||
- App image:
|
||||
- `verlai/verl:app-verl0.5-vllm0.9.1-mcore0.12.2-te2.2`
|
||||
- `verlai/verl:app-verl0.5-sglang0.4.8-mcore0.12.2-te2.2`
|
@ -31,7 +31,7 @@ RUN pip install --resume-retries 999 --no-cache-dir nvidia-cudnn-cu12==9.8.0.87
|
||||
RUN export NVTE_FRAMEWORK=pytorch && pip3 install --resume-retries 999 --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@v2.3
|
||||
|
||||
# Install Megatron-LM
|
||||
RUN pip3 install --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.12.1
|
||||
RUN pip3 install --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.12.2
|
||||
|
||||
# Install mbridge
|
||||
RUN pip3 install --no-cache-dir mbridge
|
@ -31,7 +31,7 @@ RUN pip install --resume-retries 999 --no-cache-dir nvidia-cudnn-cu12==9.8.0.87
|
||||
RUN export NVTE_FRAMEWORK=pytorch && pip3 install --resume-retries 999 --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@release_v2.5
|
||||
|
||||
# Install Megatron-LM
|
||||
RUN pip3 install --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.12.1
|
||||
RUN pip3 install --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.12.2
|
||||
|
||||
# Install mbridge
|
||||
RUN pip3 install --no-cache-dir mbridge
|
@ -80,7 +80,7 @@ RUN apt-get install -y ./nsight-systems-2025.3.1_2025.3.1.90-1_amd64.deb && \
|
||||
ln -s /opt/nvidia/nsight-systems/2025.3.1/target-linux-x64/nsys-ui /usr/local/cuda/bin/nsys-ui && \
|
||||
rm nsight-systems-2025.3.1_2025.3.1.90-1_amd64.deb
|
||||
|
||||
RUN pip install --resume-retries 999 --no-cache-dir "tensordict==0.6.2" torchdata "transformers[hf_xet]>=4.51.0" accelerate datasets peft hf-transfer \
|
||||
RUN pip install --resume-retries 999 --no-cache-dir "tensordict==0.6.2" torchdata "transformers[hf_xet]>=4.53" accelerate datasets peft hf-transfer \
|
||||
"numpy<2.0.0" "pyarrow>=19.0.1" pandas cuda-bindings \
|
||||
ray[default] codetiming hydra-core pylatexenc qwen-vl-utils wandb dill pybind11 liger-kernel mathruler blobfile xgrammar \
|
||||
pytest py-spy pyext pre-commit ruff
|
||||
|
@ -11,7 +11,7 @@ sglang==0.4.8
|
||||
vllm==0.8.5.post1
|
||||
vidia-cudnn-cu12==9.8.0.87
|
||||
transformer_engine==2.3
|
||||
megatron.core==core_v0.12.1
|
||||
megatron.core==core_v0.12.2
|
||||
# Preview
|
||||
transformer_engine==2.5
|
||||
megatron.core==core_r0.13.0
|
||||
@ -22,6 +22,6 @@ megatron.core==core_r0.13.0
|
||||
- Base image:
|
||||
- `verlai/verl:base-verl0.5-cu126-cudnn9.8-torch2.7.1-fa2.8.0`: We offer a base image with deep ep built in
|
||||
- App image:
|
||||
- `verlai/verl:app-verl0.5-sglang0.4.9-mcore0.12.1`
|
||||
- `verlai/verl:app-verl0.5-sglang0.4.9-mcore0.12.2`
|
||||
- `verlai/verl:app-verl0.5-sglang0.4.9-mcore0.13.0-preview`
|
||||
- vllm temporarily not support latest version
|
@ -24,7 +24,7 @@ and the megatron backend now has a wider list of models supported:
|
||||
|
||||
### DeepSeek 671b
|
||||
|
||||
The recommended image with pre-built megatron dependency is `whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.1-te2.3-deepseekv3`, built with the Dockerfile in [docker/Dockerfile.vllm.sglang.megatron.deepseek](https://github.com/volcengine/verl/blob/main/docker/Dockerfile.vllm.sglang.megatron.deepseek).
|
||||
The recommended image with pre-built megatron dependency is `whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.2-te2.3-deepseekv3`, built with the Dockerfile in [docker/Dockerfile.vllm.sglang.megatron.deepseek](https://github.com/volcengine/verl/blob/main/docker/Dockerfile.vllm.sglang.megatron.deepseek).
|
||||
|
||||
For checkpoint loading, we rely on megatron dist-ckpt for resharding. A converted dist-ckpt for DeepSeek-V3 is available from [huggingface BearBiscuit05/dpsk-v3-671B-BF16-dist_ckpt](https://huggingface.co/BearBiscuit05/dpsk-v3-671B-BF16-dist_ckpt/tree/main).
|
||||
|
||||
|
@ -19,7 +19,7 @@ Choices of Backend Engines
|
||||
|
||||
We recommend using **FSDP** backend to investigate, research and prototype different models, datasets and RL algorithms. The guide for using FSDP backend can be found in :doc:`FSDP Workers<../workers/fsdp_workers>`.
|
||||
|
||||
For users who pursue better scalability, we recommend using **Megatron-LM** backend. Currently, we support `Megatron-LM v0.12.1 <https://github.com/NVIDIA/Megatron-LM/tree/core_v0.12.1>`_. The guide for using Megatron-LM backend can be found in :doc:`Megatron-LM Workers<../workers/megatron_workers>`.
|
||||
For users who pursue better scalability, we recommend using **Megatron-LM** backend. Currently, we support `Megatron-LM v0.12.2 <https://github.com/NVIDIA/Megatron-LM/tree/core_v0.12.2>`_. The guide for using Megatron-LM backend can be found in :doc:`Megatron-LM Workers<../workers/megatron_workers>`.
|
||||
|
||||
|
||||
2. Inference:
|
||||
@ -65,10 +65,10 @@ From this version, we divide images built for vLLM and SGLang as the divergence
|
||||
|
||||
There are four types of application images available:
|
||||
|
||||
- **vLLM with FSDP and Megatron**: ``verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.1``, with Deep-EP support: ``verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.1-deepep``.
|
||||
- **SGLang with FSDP and Megatron**: ``verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1`` (need vLLM support, but can have some package conflicts), with Deep-EP support: ``verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1-deepep``.
|
||||
- **Preview version of SGLang with FSDP and Megatron, CUDA 12.6**: ``verlai/verl:app-verl0.5-sglang0.4.8-mcore0.12.1``
|
||||
- **Preview version of SGLang with FSDP and Megatron, CUDA 12.8**: ``verlai/verl:app-preview-verl0.5-sglang0.4.8-mcore0.12.1``
|
||||
- **vLLM with FSDP and Megatron**: ``verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.2-te2.2``, with Deep-EP support: ``verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.2-te2.2-deepep``.
|
||||
- **SGLang with FSDP and Megatron**: ``verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.2-te2.2`` (need vLLM support, but can have some package conflicts), with Deep-EP support: ``verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.2-te2.2-deepep``.
|
||||
- **Preview version of SGLang with FSDP and Megatron, CUDA 12.6**: ``verlai/verl:app-verl0.5-sglang0.4.8-mcore0.12.2-te2.2``
|
||||
- **Preview version of SGLang with FSDP and Megatron, CUDA 12.8**: ``verlai/verl:app-preview-verl0.5-sglang0.4.8-mcore0.12.2-te2.2``
|
||||
|
||||
The latest vLLM support is coming soon.
|
||||
|
||||
|
@ -35,8 +35,8 @@ wget -nv https://github.com/flashinfer-ai/flashinfer/releases/download/v0.2.2.po
|
||||
if [ $USE_MEGATRON -eq 1 ]; then
|
||||
echo "4. install TransformerEngine and Megatron"
|
||||
echo "Notice that TransformerEngine installation can take very long time, please be patient"
|
||||
NVTE_FRAMEWORK=pytorch pip3 install --no-deps git+https://github.com/NVIDIA/TransformerEngine.git@v2.2
|
||||
pip3 install --no-deps git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.12.0rc3
|
||||
NVTE_FRAMEWORK=pytorch pip3 install --no-deps git+https://github.com/NVIDIA/TransformerEngine.git@v2.2.1
|
||||
pip3 install --no-deps git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.12.2
|
||||
fi
|
||||
|
||||
|
||||
|
Reference in New Issue
Block a user