[docker] refactor: Migrate images to verlai, support latest flash attention and newer CUDA versions in future (#2085)

### Checklist Before Starting

- [ ] Searched for similar PR(s).
- [ ] Checked PR Title format
  - In format of: [modules] type: Title
- modules are in `fsdp, megatron, sglang, vllm, rollout, trainer, ci,
training_utils, recipe, hardware, deployment, ray, worker,
single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data`
  - type is in `feat, fix, refactor, chore, test`
- can involve multiple modules, seperated by `,` or space, like
`[megatron, fsdp, doc] feat: xxx`

### What does this PR do?

Migrate images to verlai, upgrade CUDA support to 12.6 and support
latest flash attention

```txt
docker
├── README.md
├── verl0.4-cu124-torch2.6-fa2.7.4
│   ├── Dockerfile.app.sglang.vllm.mcore0.12
│   ├── Dockerfile.app.sglang.vllm.mcore0.13.preview
│   ├── Dockerfile.app.vllm.mcore0.12
│   ├── Dockerfile.app.vllm.mcore0.13.preview
│   ├── Dockerfile.base
│   └── README.md
├── verl0.5-cu126-torch2.7.1-fa2.8.0
│   ├── Dockerfile.app.sglang.mcore0.12
│   ├── Dockerfile.app.sglang.mcore0.13.preview
│   ├── Dockerfile.base.fi0.2.6
│   └── README.md
└── verl0.5-preview-cu128-torch2.7.1-fa2.8.0
    ├── Dockerfile.app.sglang.megatron
    ├── Dockerfile.base.fi0.2.6
    └── README.md
```

- verlai/verl
  - verl0.4
    - base
    - app.sglang.vllm.mcore
    - app.vllm.mcore
  - verl0.5
    - base
    - app.sglang.mcore
    - app.vllm.mcore [may not support now, for debug]
  - verl0.5-preview
    - base
    - app.sglang.mcore
    - app.vllm.mcore [may not support now, for debug]


### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Checklist Before Submitting

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title `description` if it breaks any
API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] New CI unit test(s) are added to cover the code path.
- [ ] Rely on existing unit tests on CI that covers the code path.
This commit is contained in:
Blue Space
2025-07-04 14:32:02 +08:00
committed by GitHub
parent a53fb3089e
commit ebb21b7fc7
34 changed files with 1233 additions and 193 deletions

View File

@ -84,7 +84,7 @@ jobs:
NO_PROXY: "localhost,127.0.0.1"
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
container:
image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.0-te2.3
image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1
options: --gpus all --shm-size=10g
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
@ -114,7 +114,7 @@ jobs:
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
HF_ENDPOINT: "https://hf-mirror.com"
container:
image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.0-te2.3
image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1
options: --gpus all --shm-size=10g
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

View File

@ -94,7 +94,7 @@ jobs:
HF_ENDPOINT: "https://hf-mirror.com"
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
container:
image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.0-te2.3
image: verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.1
options: --gpus all --shm-size=10g
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

View File

@ -98,7 +98,7 @@ jobs:
HF_ENDPOINT: "https://hf-mirror.com"
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
container:
image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.0-te2.3
image: verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.1
options: --gpus all --shm-size=10g
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

View File

@ -84,7 +84,7 @@ jobs:
HF_ENDPOINT: "https://hf-mirror.com"
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
container:
image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.0-te2.3
image: verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.1
options: --gpus all --shm-size=10g
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
@ -210,7 +210,7 @@ jobs:
HF_ENDPOINT: "https://hf-mirror.com"
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
container:
image: hiyouga/verl:ngc-th2.6.0-cu126-vllm0.8.3-flashinfer0.2.2-cxx11abi0
image: verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.1
options: --gpus all --shm-size=50g # Visual dataloader requires large memory
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
@ -219,6 +219,7 @@ jobs:
- name: Install the current repository
run: |
pip3 install -e .[test,gpu,vllm,geo,trl]
pip install "transformers[hf_xet]<4.53.0" # Fix for transformers 4.53.0
# Geo3k
- name: Prepare GEO3K dataset
run: |
@ -266,7 +267,7 @@ jobs:
HF_ENDPOINT: "https://hf-mirror.com"
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
container:
image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.0-te2.3
image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1
options: --gpus all --shm-size=10g
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
@ -305,7 +306,7 @@ jobs:
HF_ENDPOINT: "https://hf-mirror.com"
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
container:
image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.0-te2.3
image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1
options: --gpus all --shm-size=10g
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
@ -338,7 +339,7 @@ jobs:
HF_ENDPOINT: "https://hf-mirror.com"
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
container:
image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.0-te2.3
image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1
options: --gpus all --shm-size=50g # Visual dataloader requires large memory
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
@ -395,7 +396,7 @@ jobs:
HF_ENDPOINT: "https://hf-mirror.com"
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
container:
image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.0-te2.3
image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1
options: --gpus all --shm-size=10g
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

View File

@ -0,0 +1,378 @@
# # Tests layout
# Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
# - `tests/trainer` for testing functionality related to `verl/trainer`
# - `tests/models` for testing functionality related to `verl/models`
# - ...
# There are a few folders with `special_` prefix, created for special purposes:
# - `special_distributed`: unit tests that must run with multiple GPUs
# - `special_e2e`: end-to-end tests with training/generation scripts
# - `special_npu`: tests for NPUs
# - `special_sanity`: a suite of quick sanity tests
# - `special_standalone`: a set of test that are designed to run in dedicated environments
# Accelerators for tests
# - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
# - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
# # Workflow layout
# All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
# 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
# 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
# 3. End-to-end tests: `e2e_*.yml`
# 4. Unit tests
# - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
# - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
# - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
# - new workflow yaml is added to `.github/workflows`
# - new tests are added to workflow mentioned in 2.
name: e2e_ppo_trainer_megatron_sglang
on:
# Trigger the workflow on push or pull request,
# but only for the main branch.
# For push, for now only anti-patterns are specified so it is more conservative
# and achieves higher coverage.
push:
branches:
- main
- v0.*
paths:
- "**/*.py"
# Other entrypoints
- "!verl/trainer/fsdp_sft_trainer.py"
# Recipes
- "!recipe/**"
# FSDP
- "!verl/workers/**/*dp_*.py"
pull_request:
branches:
- main
- v0.*
paths:
- "**/*.py"
# Other entrypoints
- "!docker/**"
# Docs
- "!**/*.md"
- "!docs/**"
- "!examples/**"
- "!tests/**"
- "!verl/trainer/main_*.py"
- "!verl/trainer/fsdp_sft_trainer.py"
# Recipes
- "!recipe/**"
# FSDP
- "!verl/workers/**/*dp_*.py"
# Entrypoints
- ".github/workflows/e2e_ppo_trainer_megatron.yml"
- "examples/data_preprocess/gsm8k.py"
- "examples/data_preprocess/geo3k.py"
- "tests/special_e2e/run_ppo_trainer_megatron.sh"
- "verl/trainer/main_ppo.py"
- "verl/trainer/config/ppo_megatron_trainer.yaml"
# Cancel jobs on the same ref if a new one is triggered
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
# Declare permissions just read content.
permissions:
contents: read
env:
IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1"
DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
jobs:
setup:
runs-on: ubuntu-latest
outputs:
runner-label: ${{ steps.create-runner.outputs.runner_label }}
mlp-task-id: ${{ steps.create-runner.outputs.mlp_task_id }}
steps:
- name: create runner
id: create-runner
shell: bash
run: |
if [[ "${{ github.event.repository.full_name }}" != "volcengine/verl" ]]; then
echo "no need create runner, skip"
exit 0
fi
resp=$(curl -X POST "${{ env.DYNAMIC_RUNNER_ENDPOINT }}/create" \
-d '{"Image": "${{ env.IMAGE }}"}')
runner_label=$(echo $resp | jq -r '.runner_label')
if [[ -z $runner_label || $runner_label == "null" ]]; then
echo "create runner failed"
exit 1
fi
echo "runner_label=$runner_label" >> $GITHUB_OUTPUT
mlp_task_id=$(echo $resp | jq -r '.task_id')
echo "mlp_task_id=$mlp_task_id" >> $GITHUB_OUTPUT
e2e_ppo_trainer_megatron-deepseek:
needs: setup
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
timeout-minutes: 60 # Increase this timeout value as needed
env:
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
HF_ENDPOINT: "https://hf-mirror.com"
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
with:
fetch-depth: 0
- name: Install the current repository
run: |
pip3 install --no-deps -e .[test]
- name: Prepare GSM8K dataset
run: |
python3 examples/data_preprocess/gsm8k.py
- name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (DeepSeek)
run: |
ray stop --force
ENGINE=sglang ALL_OFFLOAD=True SAVE_FREQ=1 MODEL_ID=deepseek-ai/deepseek-coder-1.3b-instruct bash tests/special_e2e/run_ppo_trainer_megatron.sh
- name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (DeepSeek)
run: |
ray stop --force
export VLLM_USE_V1=1
ray start --head
ENGINE=sglang MODE=async RESUME_MODE=auto MODEL_ID=deepseek-ai/deepseek-coder-1.3b-instruct TOTAL_TRAIN_STEPS=2 bash tests/special_e2e/run_ppo_trainer_megatron.sh
- name: Test Megatron checkpoints merging function (DeepSeek Actor and Critic)
run: |
exp_name="deepseek-coder-1.3b-instruct-megatron-gsm8k-minimal"
python -m verl.model_merger test --backend megatron --local_dir checkpoints/verl-test/${exp_name}/global_step_1/actor --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/actor/huggingface
python -m verl.model_merger test --backend megatron --is-value-model --local_dir checkpoints/verl-test/${exp_name}/global_step_1/critic --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/critic/huggingface
- name: Running GRPO GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (Deepseek)
run: |
ray stop --force
ENGINE=sglang ADV_ESTIMATOR=grpo USE_DYNAMIC_BSZ=False MODEL_ID=deepseek-ai/deepseek-coder-1.3b-instruct bash tests/special_e2e/run_ppo_trainer_megatron.sh
- name: clean up
run: |
rm -rf checkpoints
e2e_ppo_trainer_megatron-qwen3:
needs: setup
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
timeout-minutes: 60 # Increase this timeout value as needed
env:
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
HF_ENDPOINT: "https://hf-mirror.com"
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
with:
fetch-depth: 0
- name: Install the current repository
run: |
pip3 install --no-deps -e .[test]
- name: Prepare GSM8K dataset
run: |
python3 examples/data_preprocess/gsm8k.py
- name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (Qwen3) with validation and saving
run: |
ray stop --force
ENGINE=sglang ALL_OFFLOAD=True VAL_BEFORE_TRAIN=True TEST_FREQ=1 SAVE_FREQ=1 MODEL_ID=Qwen/Qwen3-0.6B bash tests/special_e2e/run_ppo_trainer_megatron.sh
- name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (Qwen3) testing learning rate scheduler
run: |
ray stop --force
ENGINE=sglang LR_WARMUP_STEPS=1 TOTAL_TRAIN_STEPS=2 MODEL_ID=Qwen/Qwen3-0.6B bash tests/special_e2e/run_ppo_trainer_megatron.sh
- name: Test Megatron checkpoints merging function (Qwen3 Actor and Critic)
run: |
exp_name="qwen3-0.6b-megatron-gsm8k-minimal"
python -m verl.model_merger test --backend megatron --tie-word-embedding --local_dir checkpoints/verl-test/${exp_name}/global_step_1/actor --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/actor/huggingface
python -m verl.model_merger test --backend megatron --is-value-model --local_dir checkpoints/verl-test/${exp_name}/global_step_1/critic --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/critic/huggingface
- name: clean up
run: |
rm -rf checkpoints
e2e_ppo_trainer_megatron-different-train-infer-tp-qwen-tie-embedding:
needs: setup
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
timeout-minutes: 60 # Increase this timeout value as needed
env:
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
HF_ENDPOINT: "https://hf-mirror.com"
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
with:
fetch-depth: 0
- name: Install the current repository
run: |
pip3 install --no-deps -e .[test]
- name: Prepare GSM8K dataset
run: |
python3 examples/data_preprocess/gsm8k.py
- name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with tie-embedding Megatron (Qwen) with train tp > infer tp
run: |
ray stop --force
ENGINE=sglang VAL_BEFORE_TRAIN=True TEST_FREQ=1 SAVE_FREQ=1 TRAIN_TP=2 INFER_TP=1 MODEL_ID=Qwen/Qwen2.5-1.5B bash tests/special_e2e/run_ppo_trainer_megatron.sh
- name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (Qwen) with train tp < infer tp
run: |
ray stop --force
ENGINE=sglang VAL_BEFORE_TRAIN=True TEST_FREQ=1 SAVE_FREQ=1 TRAIN_TP=1 INFER_TP=2 MODEL_ID=Qwen/Qwen2.5-1.5B bash tests/special_e2e/run_ppo_trainer_megatron.sh
- name: clean up
run: |
rm -rf checkpoints
e2e_ppo_trainer_megatron-qwen-override-transformer-config:
needs: setup
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
timeout-minutes: 60 # Increase this timeout value as needed
env:
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
HF_ENDPOINT: "https://hf-mirror.com"
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
with:
fetch-depth: 0
- name: Install the current repository
run: |
pip3 install --no-deps -e .[test]
- name: Prepare GSM8K dataset
run: |
python3 examples/data_preprocess/gsm8k.py
- name: Prepare dist_ckpt of Qwen2.5-0.5B, uneven layer distribution only supports dist_ckpt
run: |
python3 scripts/converter_hf_to_mcore.py --hf_model_path ${HOME}/models/Qwen/Qwen2.5-0.5B --output_path checkpoints/verl-test/qwen2.5-0.5b-megatron
- name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (Qwen)
run: |
ray stop --force
ENGINE=sglang SAVE_FREQ=1 COMMON_PP=4 COMMON_VPP=null COMMON_CP=1 SKIP_SAVE_HF_MODEL=1 bash tests/special_e2e/run_ppo_trainer_megatron.sh +actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_first_pipeline_stage=8 +actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_last_pipeline_stage=4 actor_rollout_ref.actor.megatron.use_dist_checkpointing=true actor_rollout_ref.actor.megatron.dist_checkpointing_path=checkpoints/verl-test/qwen2.5-0.5b-megatron actor_rollout_ref.ref.megatron.use_dist_checkpointing=true actor_rollout_ref.ref.megatron.dist_checkpointing_path=checkpoints/verl-test/qwen2.5-0.5b-megatron critic.megatron.use_dist_checkpointing=true critic.megatron.dist_checkpointing_path=checkpoints/verl-test/qwen2.5-0.5b-megatron reward_model.megatron.use_dist_checkpointing=true reward_model.megatron.dist_checkpointing_path=checkpoints/verl-test/qwen2.5-0.5b-megatron
cp -r checkpoints checkpoints-dut
ENGINE=sglang SAVE_FREQ=1 COMMON_PP=4 COMMON_VPP=null COMMON_CP=1 bash tests/special_e2e/run_ppo_trainer_megatron.sh
- name: Test Megatron checkpoints merging function (Qwen Actor and Critic)
run: |
exp_name="qwen2.5-0.5b-megatron-gsm8k-minimal"
python -m verl.model_merger test --backend megatron --tie-word-embedding --local_dir checkpoints-dut/verl-test/${exp_name}/global_step_1/actor --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/actor/huggingface
python -m verl.model_merger test --backend megatron --is-value-model --local_dir checkpoints-dut/verl-test/${exp_name}/global_step_1/critic --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/critic/huggingface
- name: clean up
run: |
rm -rf checkpoints
e2e_ppo_trainer_megatron-deepseek-override-transformer-config:
needs: setup
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
timeout-minutes: 60 # Increase this timeout value as needed
env:
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
HF_ENDPOINT: "https://hf-mirror.com"
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
with:
fetch-depth: 0
- name: Install the current repository
run: |
pip3 install --no-deps -e .[test]
- name: Prepare GSM8K dataset
run: |
python3 examples/data_preprocess/gsm8k.py
- name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (DeepSeek)
run: |
ray stop --force
ENGINE=sglang SAVE_FREQ=1 MODEL_ID=deepseek-ai/deepseek-coder-1.3b-instruct COMMON_PP=2 COMMON_VPP=null bash tests/special_e2e/run_ppo_trainer_megatron.sh +actor_rollout_ref.actor.megatron.override_transformer_config.account_for_embedding_in_pipeline_split=true +actor_rollout_ref.actor.megatron.override_transformer_config.account_for_loss_in_pipeline_split=true
- name: Test Megatron checkpoints merging function (DeepSeek Actor and Critic)
run: |
exp_name="deepseek-coder-1.3b-instruct-megatron-gsm8k-minimal"
python -m verl.model_merger test --backend megatron --local_dir checkpoints/verl-test/${exp_name}/global_step_1/actor --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/actor/huggingface
python -m verl.model_merger test --backend megatron --is-value-model --local_dir checkpoints/verl-test/${exp_name}/global_step_1/critic --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/critic/huggingface
- name: clean up
run: |
rm -rf checkpoints
e2e_ppo_trainer_megatron-moe-expert-parallel:
needs: setup
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
timeout-minutes: 60 # Increase this timeout value as needed
env:
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
HF_ENDPOINT: "https://hf-mirror.com"
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
with:
fetch-depth: 0
- name: Install the current repository
run: |
pip3 install --no-deps -e .[test]
- name: Prepare GSM8K dataset
run: |
python3 examples/data_preprocess/gsm8k.py
- name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (DeepSeek)
run: |
ray stop --force
ADV_ESTIMATOR=grpo USE_DUMMY_MODEL=True DUMMY_MODEL_CONFIG_PATH=tests/special_e2e/ppo_trainer/expert_parallel/qwen2moe_minimal.json \
PPO_MAX_TOKEN_LEN=512 FWD_MAX_TOKEN_LEN=512 \
MAX_PROMPT_LENGTH=256 MAX_RESPONSE_LENGTH=256 \
MODEL_ID=Qwen/Qwen1.5-MoE-A2.7B-Chat \
ENGINE=sglang COMMON_PP=2 COMMON_VPP=null COMMON_CP=1 COMMON_TP=4 COMMON_EP=4 COMMON_ETP=1 INFER_TP=8 \
USE_DIST_CKPT=True ALL_OFFLOAD=True SKIP_SAVE_HF_MODEL=1 bash tests/special_e2e/run_ppo_trainer_megatron.sh
- name: clean up
run: |
rm -rf checkpoints
e2e_ppo_trainer_megatron-qwen2_5vl-3b:
needs: setup
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
timeout-minutes: 60 # Increase this timeout value as needed
env:
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
HF_ENDPOINT: "https://hf-mirror.com"
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
with:
fetch-depth: 0
- name: Install the current repository
run: |
pip3 install --no-deps -e .[test]
- name: Prepare Geo3k dataset
run: |
python3 examples/data_preprocess/geo3k.py
- name: Prepare dist_ckpt of Qwen2.5-VL-3B, only supports dist_ckpt
run: |
python3 scripts/converter_hf_to_mcore.py --hf_model_path ${HOME}/models/Qwen/Qwen2.5-VL-3B-Instruct --output_path checkpoints/verl-test/qwen2.5-vl-3b-megatron
- name: Running Geo3k E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (Qwen)
run: |
ray stop --force
ENGINE=sglang TRAIN_FILES=${HOME}/data/geo3k/train.parquet VAL_FILES=${HOME}/data/geo3k/test.parquet MAX_PROMPT_LENGTH=1024 MAX_RESPONSE_LENGTH=2048 MODEL_ID=Qwen/Qwen2.5-VL-3B-Instruct ADV_ESTIMATOR=grpo USE_DYNAMIC_BSZ=False SKIP_SAVE_HF_MODEL=1 COMMON_PP=4 COMMON_VPP=null COMMON_CP=1 COMMON_TP=2 USE_DIST_CKPT=true DIST_CKPT_PATH=checkpoints/verl-test/qwen2.5-vl-3b-megatron bash tests/special_e2e/run_ppo_trainer_megatron.sh
- name: clean up
run: |
rm -rf checkpoints
cleanup:
runs-on: ubuntu-latest
needs: [setup,
e2e_ppo_trainer_megatron-deepseek,
e2e_ppo_trainer_megatron-qwen3,
e2e_ppo_trainer_megatron-different-train-infer-tp-qwen-tie-embedding,
e2e_ppo_trainer_megatron-qwen-override-transformer-config,
e2e_ppo_trainer_megatron-deepseek-override-transformer-config,
e2e_ppo_trainer_megatron-moe-expert-parallel,
e2e_ppo_trainer_megatron-qwen2_5vl-3b]
if: always()
steps:
- name: remove runner
run: |
if [[ -z "${{ needs.setup.outputs.mlp-task-id }}" ]]; then
echo "no need remove runner, skip"
exit 0
fi
resp=$(curl -X POST "${{ env.DYNAMIC_RUNNER_ENDPOINT }}/delete" \
-d '{"TaskId": "${{ needs.setup.outputs.mlp-task-id }}"}')

View File

@ -29,8 +29,7 @@
# - new workflow yaml is added to `.github/workflows`
# - new tests are added to workflow mentioned in 2.
name: e2e_ppo_trainer_megatron
# latest version: Megatron-LM core_r0.11.0 https://github.com/NVIDIA/Megatron-LM/tree/core_r0.11.0
name: e2e_ppo_trainer_megatron_vllm
on:
# Trigger the workflow on push or pull request,
@ -86,7 +85,7 @@ permissions:
contents: read
env:
IMAGE: "verl-ci-cn-beijing.cr.volces.com/whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.0-te2.3"
IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.1"
DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
jobs:
@ -311,6 +310,7 @@ jobs:
- name: Install the current repository
run: |
pip3 install --no-deps -e .[test]
pip3 install mbridge
- name: Prepare GSM8K dataset
run: |
python3 examples/data_preprocess/gsm8k.py
@ -343,6 +343,7 @@ jobs:
- name: Install the current repository
run: |
pip3 install --no-deps -e .[test]
pip3 install "transformers[hf_xet]<4.52.0"
- name: Prepare Geo3k dataset
run: |
python3 examples/data_preprocess/geo3k.py
@ -376,4 +377,4 @@ jobs:
exit 0
fi
resp=$(curl -X POST "${{ env.DYNAMIC_RUNNER_ENDPOINT }}/delete" \
-d '{"TaskId": "${{ needs.setup.outputs.mlp-task-id }}"}')
-d '{"TaskId": "${{ needs.setup.outputs.mlp-task-id }}"}')

View File

@ -80,7 +80,7 @@ jobs:
HF_ENDPOINT: "https://hf-mirror.com"
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
container:
image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.0-te2.3
image: verlai/verl:base-verl0.4-cu124-cudnn9.8-torch2.6-fa2.7.4
options: --gpus all --shm-size=10g
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

View File

@ -68,7 +68,7 @@ jobs:
HF_ENDPOINT: "https://hf-mirror.com"
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
container:
image: ocss884/verl-sglang:ngc-th2.6.0-cu126-sglang0.4.5.post3
image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1
options: --gpus all --shm-size=10g
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

View File

@ -66,7 +66,7 @@ jobs:
HF_ENDPOINT: "https://hf-mirror.com"
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
container:
image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.0-te2.3
image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1
options: --gpus all --shm-size=10g
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

View File

@ -80,7 +80,7 @@ jobs:
NO_PROXY: "localhost,127.0.0.1"
HF_HUB_ENABLE_HF_TRANSFER: 1
container:
image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.0-te2.3
image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1
options: --gpus all --shm-size=10g
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

View File

@ -71,7 +71,7 @@ jobs:
HF_ENDPOINT: "https://hf-mirror.com"
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
container:
image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.0-te2.3
image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1
options: --gpus all --shm-size=10g
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
@ -129,7 +129,7 @@ jobs:
HF_ENDPOINT: "https://hf-mirror.com"
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
container:
image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.0-te2.3
image: verlai/verl:base-verl0.4-cu124-cudnn9.8-torch2.6-fa2.7.4
options: --gpus all --shm-size=10g
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

View File

@ -88,7 +88,7 @@ jobs:
HF_HUB_ENABLE_HF_TRANSFER: 1
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK: "True"
container:
image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.0-te2.3
image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1
options: --gpus all --shm-size=10g
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

View File

@ -1,4 +1,6 @@
FROM whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6-mcore0.12.0-te2.3
# Base Image support aws EFA
# Build Image with frameworks based on this
FROM verlai/verl:app-verl0.5-sglang0.4.6.post5-mcore0.12.1
# For aws instances with EFA net interface (Sagemaker AI Pod)
# install EFA driver:
@ -48,6 +50,6 @@ ENV OMPI_MCA_pml=^cm,ucx \
NCCL_SOCKET_IFNAME=^docker,lo,veth_def_agent \
FI_EFA_USE_HUGE_PAGE=0
# docker build -t whatcanyousee/verl:awsefa --label "commit=$(git rev-parse --short HEAD)" .
# docker build -t verl:awsefa --label "commit=$(git rev-parse --short HEAD)" .
# on aws:
# docker run --ipc=host --privileged --name verldev --gpus all --network=host --shm-size=1800gb -itd whatcanyousee/verl:awsefa
# docker run --ipc=host --privileged --name verldev --gpus all --network=host --shm-size=1800gb -itd verl:awsefa

79
docker/README.md Normal file
View File

@ -0,0 +1,79 @@
# Dockerfiles of verl
We provide pre-built Docker images for quick setup. And from this version, we utilize a new image release hierarchy for productivity and stability.
The image types are divided into three large categories:
- **Base Image**: Without inference and training frameworks, only basic dependencies are installed. Can directly install vllm or SGLang on top of it, without need of reinstall torch or CUDA.
- **Application Image**: Stable version with inference and training frameworks installed.
- **Preview Image**: Unstable version with the latest frameworks and features.
The first two types of images are hosted on dockerhub [verlai/verl](https://hub.docker.com/r/verlai/verl) repository, while the preview images are hosted on community repository.
> The image versions are mapped with verl releases, for example, image with tag ``verl0.4`` is built for verl release ``v0.4.x``.
## Base Image
The stable base image is ``verlai/verl:base-verl0.4-cu124-cudnn9.8-torch2.6-fa2.7.4-te2.3``. The installed package versions can be found from tags, and the Dockerfile can be found in ``verl[version]-[packages]/Dockerfile.base``.
The base images for preview are ``verlai/verl:base-verl0.5-cu126-cudnn9.8-torch2.7.1-fa2.8.0`` and ``verlai/verl:base-verl0.5-preview-cu128-cudnn9.8-torch2.7.1-fa2.8.0`` with different CUDA versions.
The update of base image is not frequent, and the app image can be built on top of it without reinstalling base packages.
## Application Image
From this version, we divide images built for vLLM and SGLang as the divergence of dependent packages like FlashInfer.
There are four types of application images available:
- **vLLM with FSDP and Megatron**: ``verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.1``
- **SGLang with FSDP and Megatron**: ``verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1``
- **Preview version of SGLang with FSDP and Megatron, CUDA 12.6**: ``verlai/verl:app-verl0.5-sglang0.4.8-mcore0.12.1``
- **Preview version of SGLang with FSDP and Megatron, CUDA 12.8**: ``verlai/verl:app-preview-verl0.5-sglang0.4.8-mcore0.12.1``
For Megatron 0.13.0, we offer preview images, to use latest codes, just replace ``mcore0.12.1`` with ``mcore0.13.0-preview`` in the above image tag.
The latest vLLM support is coming soon.
Docker images with Megatron backends are runnable with large language model like ``Qwen/Qwen3-235B-A22B``, ``deepseek-ai/DeepSeek-V3-0324`` post-training. Refer to the :doc:`Large Language Model Post-Training documentation<../perf/dpsk>` for more details.
Application images can be updated frequently, and the Dockerfile can be found in ``docker/verl[version]-[packages]/Dockerfile.app.[frameworks]``. Based on the base image, it is easy to build your own application image with the desired inference and training frameworks.
## Community Image
For vLLM with FSDP, please refer to [hiyouga/verl](https://hub.docker.com/r/hiyouga/verl) repository and the latest version is ``hiyouga/verl:ngc-th2.6.0-cu126-vllm0.8.4-flashinfer0.2.2-cxx11abi0``.
For SGLang with FSDP, please refer to [ocss884/verl-sglang](https://hub.docker.com/r/ocss884/verl-sglang) repository and the latest version is ``ocss884/verl-sglang:ngc-th2.6.0-cu126-sglang0.4.6.post5`` which is provided by SGLang RL Group.
See files under ``docker/`` for NGC-based image or if you want to build your own.
Note that For aws instances with EFA net interface (Sagemaker AI Pod), you need to install EFA driver as shown in ``docker/Dockerfile.extenstion.awsefa``
## Installation from Docker
After pulling the desired Docker image and installing desired inference and training frameworks, you can run it with the following steps:
1. Launch the desired Docker image and attach into it:
```sh
docker create --runtime=nvidia --gpus all --net=host --shm-size="10g" --cap-add=SYS_ADMIN -v .:/workspace/verl --name verl <image:tag> sleep infinity
docker start verl
docker exec -it verl bash
```
2. If you use the images provided, you only need to install verl itself without dependencies:
```sh
# install the nightly version (recommended)
git clone https://github.com/volcengine/verl && cd verl
pip3 install --no-deps -e .
```
[Optional] If you hope to switch between different frameworks, you can install verl with the following command:
```sh
# install the nightly version (recommended)
git clone https://github.com/volcengine/verl && cd verl
pip3 install -e .[vllm]
pip3 install -e .[sglang]
```

View File

@ -0,0 +1,41 @@
# Start from the verl base image
# Dockerfile.base
FROM verlai/verl:base-verl0.4-cu124-cudnn9.8-torch2.6-fa2.7.4
# Define environments
ENV MAX_JOBS=32
ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
ENV DEBIAN_FRONTEND=noninteractive
ENV NODE_OPTIONS=""
ENV PIP_ROOT_USER_ACTION=ignore
ENV HF_HUB_ENABLE_HF_TRANSFER="1"
# Install sglang-0.4.6.post5 and torch-memory-saver
RUN pip install --resume-retries 999 "sglang[all]==0.4.6.post5" --no-cache-dir --find-links https://flashinfer.ai/whl/cu124/torch2.6/flashinfer-python && pip install torch-memory-saver --no-cache-dir
# Some sglang operations in 0.4.6.post5 require vllm
# [Warning] vllm can have some packages not compatible with sglang, for example, flashinfer
RUN pip install --resume-retries 999 --no-cache-dir vllm==0.8.5.post1
# Fix packages
RUN pip install --no-cache-dir "tensordict==0.6.2" "transformers[hf_xet]>=4.51.0" accelerate datasets peft hf-transfer \
"numpy<2.0.0" "pyarrow>=19.0.1" pandas \
ray[default] codetiming hydra-core pylatexenc qwen-vl-utils wandb dill pybind11 liger-kernel mathruler blobfile xgrammar \
pytest py-spy pyext pre-commit ruff
RUN pip uninstall -y pynvml nvidia-ml-py && \
pip install --resume-retries 999 --no-cache-dir --upgrade "nvidia-ml-py>=12.560.30" "fastapi[standard]>=0.115.0" "optree>=0.13.0" "pydantic>=2.9" "grpcio>=1.62.1"
RUN pip install --resume-retries 999 --no-cache-dir nvidia-cudnn-cu12==9.8.0.87
# Install TransformerEngine
RUN export NVTE_FRAMEWORK=pytorch && pip3 install --resume-retries 999 --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@v2.3
# Install Megatron-LM
RUN pip3 install --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.12.1
# Fix for transformers 4.53.0
RUN pip3 install --no-cache-dir "transformers[hf_xet]<4.52.0"
# Install mbridge
RUN pip3 install --no-cache-dir mbridge

View File

@ -0,0 +1,41 @@
# Start from the verl base image
# Dockerfile.base
FROM verlai/verl:base-verl0.4-cu124-cudnn9.8-torch2.6-fa2.7.4
# Define environments
ENV MAX_JOBS=32
ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
ENV DEBIAN_FRONTEND=noninteractive
ENV NODE_OPTIONS=""
ENV PIP_ROOT_USER_ACTION=ignore
ENV HF_HUB_ENABLE_HF_TRANSFER="1"
# Install sglang-0.4.6.post5 and torch-memory-saver
RUN pip install --resume-retries 999 "sglang[all]==0.4.6.post5" --no-cache-dir --find-links https://flashinfer.ai/whl/cu124/torch2.6/flashinfer-python && pip install torch-memory-saver --no-cache-dir
# Some sglang operations in 0.4.6.post5 require vllm
# [Warning] vllm can have some packages not compatible with sglang, for example, flashinfer
RUN pip install --resume-retries 999 --no-cache-dir vllm==0.8.5.post1
# Fix packages
RUN pip install --no-cache-dir "tensordict==0.6.2" "transformers[hf_xet]>=4.51.0" accelerate datasets peft hf-transfer \
"numpy<2.0.0" "pyarrow>=19.0.1" pandas \
ray[default] codetiming hydra-core pylatexenc qwen-vl-utils wandb dill pybind11 liger-kernel mathruler blobfile xgrammar \
pytest py-spy pyext pre-commit ruff
RUN pip uninstall -y pynvml nvidia-ml-py && \
pip install --resume-retries 999 --no-cache-dir --upgrade "nvidia-ml-py>=12.560.30" "fastapi[standard]>=0.115.0" "optree>=0.13.0" "pydantic>=2.9" "grpcio>=1.62.1"
RUN pip install --resume-retries 999 --no-cache-dir nvidia-cudnn-cu12==9.8.0.87
# Install TransformerEngine
RUN export NVTE_FRAMEWORK=pytorch && pip3 install --resume-retries 999 --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@release_v2.5
# Install Megatron-LM
RUN pip3 install --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/Megatron-LM.git@core_r0.13.0
# Fix for transformers 4.53.0
RUN pip3 install --no-cache-dir "transformers[hf_xet]<4.52.0"
# Install mbridge
RUN pip3 install --no-cache-dir mbridge

View File

@ -0,0 +1,47 @@
# Start from the verl base image
# Dockerfile.base
FROM verlai/verl:base-verl0.4-cu124-cudnn9.8-torch2.6-fa2.7.4
# Define environments
ENV MAX_JOBS=32
ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
ENV DEBIAN_FRONTEND=noninteractive
ENV NODE_OPTIONS=""
ENV PIP_ROOT_USER_ACTION=ignore
ENV HF_HUB_ENABLE_HF_TRANSFER="1"
# Install torch-2.6.0+cu126 + vllm-0.8.5.post1
# torch-2.6.0+cu124: cxx11abi=False
# torch-2.6.0+cu126: cxx11abi=True
# see https://github.com/flashinfer-ai/flashinfer/issues/911
RUN pip install --resume-retries 999 --no-cache-dir vllm==0.8.5.post1
# Install flashinfer-0.2.2.post1+cu126 (cxx11abi=True)
# vllm-0.8.3 does not support flashinfer>=0.2.3
# see https://github.com/vllm-project/vllm/pull/15777
RUN aria2c --max-tries=9999 https://github.com/flashinfer-ai/flashinfer/releases/download/v0.2.2.post1/flashinfer_python-0.2.2.post1+cu124torch2.6-cp38-abi3-linux_x86_64.whl && \
pip install --no-cache-dir flashinfer_python-0.2.2.post1+cu124torch2.6-cp38-abi3-linux_x86_64.whl && \
rm flashinfer_python-0.2.2.post1+cu124torch2.6-cp38-abi3-linux_x86_64.whl
# Fix packages
RUN pip install --no-cache-dir "tensordict==0.6.2" "transformers[hf_xet]>=4.51.0" accelerate datasets peft hf-transfer \
"numpy<2.0.0" "pyarrow>=19.0.1" pandas \
ray[default] codetiming hydra-core pylatexenc qwen-vl-utils wandb dill pybind11 liger-kernel mathruler blobfile xgrammar \
pytest py-spy pyext pre-commit ruff
RUN pip uninstall -y pynvml nvidia-ml-py && \
pip install --resume-retries 999 --no-cache-dir --upgrade "nvidia-ml-py>=12.560.30" "fastapi[standard]>=0.115.0" "optree>=0.13.0" "pydantic>=2.9" "grpcio>=1.62.1"
RUN pip install --resume-retries 999 --no-cache-dir nvidia-cudnn-cu12==9.8.0.87
# Install TransformerEngine
RUN export NVTE_FRAMEWORK=pytorch && pip3 install --resume-retries 999 --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@v2.3
# Install Megatron-LM
RUN pip3 install --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.12.1
# Fix for transformers 4.53.0
RUN pip3 install --no-cache-dir "transformers[hf_xet]<4.52.0"
# Install mbridge
RUN pip3 install --no-cache-dir mbridge

View File

@ -0,0 +1,44 @@
# Start from the verl base image
# Dockerfile.base
FROM verlai/verl:base-verl0.4-cu124-cudnn9.8-torch2.6-fa2.7.4
# Define environments
ENV MAX_JOBS=32
ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
ENV DEBIAN_FRONTEND=noninteractive
ENV NODE_OPTIONS=""
ENV PIP_ROOT_USER_ACTION=ignore
ENV HF_HUB_ENABLE_HF_TRANSFER="1"
# Install torch-2.6.0+cu126 + vllm-0.8.5.post1
# torch-2.6.0+cu124: cxx11abi=False
# torch-2.6.0+cu126: cxx11abi=True
# see https://github.com/flashinfer-ai/flashinfer/issues/911
RUN pip install --resume-retries 999 --no-cache-dir vllm==0.8.5.post1
# Install flashinfer-0.2.2.post1+cu126 (cxx11abi=True)
# vllm-0.8.3 does not support flashinfer>=0.2.3
# see https://github.com/vllm-project/vllm/pull/15777
RUN aria2c --max-tries=9999 https://github.com/flashinfer-ai/flashinfer/releases/download/v0.2.2.post1/flashinfer_python-0.2.2.post1+cu124torch2.6-cp38-abi3-linux_x86_64.whl && \
pip install --no-cache-dir flashinfer_python-0.2.2.post1+cu124torch2.6-cp38-abi3-linux_x86_64.whl && \
rm flashinfer_python-0.2.2.post1+cu124torch2.6-cp38-abi3-linux_x86_64.whl
# Fix packages
RUN pip install --no-cache-dir "tensordict==0.6.2" "transformers[hf_xet]>=4.51.0" accelerate datasets peft hf-transfer \
"numpy<2.0.0" "pyarrow>=19.0.1" pandas \
ray[default] codetiming hydra-core pylatexenc qwen-vl-utils wandb dill pybind11 liger-kernel mathruler blobfile xgrammar \
pytest py-spy pyext pre-commit ruff
RUN pip uninstall -y pynvml nvidia-ml-py && \
pip install --resume-retries 999 --no-cache-dir --upgrade "nvidia-ml-py>=12.560.30" "fastapi[standard]>=0.115.0" "optree>=0.13.0" "pydantic>=2.9" "grpcio>=1.62.1"
RUN pip install --resume-retries 999 --no-cache-dir nvidia-cudnn-cu12==9.8.0.87
# Install TransformerEngine
RUN export NVTE_FRAMEWORK=pytorch && pip3 install --resume-retries 999 --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@release_v2.5
# Install Megatron-LM
RUN pip3 install --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.12.1
# Install mbridge
RUN pip3 install --no-cache-dir mbridge

View File

@ -1,9 +1,11 @@
# Base Docker Image of verl, with CUDA/Torch/FlashAttn/Apex/TransformerEngine, without other frameworks
# Target: verlai/verl:base-v2-cu124-cudnn9.8-torch2.6-fa2.8.0-te2.3
# Start from the NVIDIA official image (ubuntu-22.04 + cuda-12.6 + python-3.10)
# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-24-08.html
FROM nvcr.io/nvidia/pytorch:24.08-py3
# Define environments
ENV MAX_JOBS=32
ENV MAX_JOBS=16
ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
ENV DEBIAN_FRONTEND=noninteractive
ENV NODE_OPTIONS=""
@ -56,18 +58,11 @@ RUN aria2c --always-resume=true --max-tries=99999 https://developer.download.nvi
update-alternatives --set cuda /usr/local/cuda-12.4 && \
rm -rf /usr/local/cuda-12.6
# Install torch-2.6.0+cu124 + vllm-0.8.5.post1 + sglang-0.4.6.post5
# torch-2.6.0+cu124: cxx11abi=False
# torch-2.6.0+cu126: cxx11abi=True
# see https://github.com/flashinfer-ai/flashinfer/issues/911
# Install sglang-0.4.6.post1 and torch-memory-saver
RUN pip install "sglang[all]==0.4.6.post5" --no-cache-dir --find-links https://flashinfer.ai/whl/cu124/torch2.6/flashinfer-python && pip install torch-memory-saver --no-cache-dir
RUN pip install --resume-retries 999 --no-cache-dir torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0
RUN pip install --no-cache-dir "vllm==0.8.5.post1" "torch==2.6.0" "torchvision==0.21.0" "torchaudio==2.6.0" "tensordict==0.6.2" torchdata
RUN pip install --no-cache-dir "transformers[hf_xet]>=4.51.0" accelerate datasets peft hf-transfer \
"numpy<2.0.0" "pyarrow>=15.0.0" pandas \
ray[default] codetiming hydra-core pylatexenc qwen-vl-utils wandb dill pybind11 liger-kernel mathruler blobfile \
RUN pip install --resume-retries 999 --no-cache-dir "tensordict==0.6.2" torchdata "transformers[hf_xet]>=4.51.0" accelerate datasets peft hf-transfer \
"numpy<2.0.0" "pyarrow>=19.0.1" pandas \
ray[default] codetiming hydra-core pylatexenc qwen-vl-utils wandb dill pybind11 liger-kernel mathruler blobfile xgrammar \
pytest py-spy pyext pre-commit ruff
# Install flash-attn-2.7.4.post1 (cxx11abi=False)
@ -86,39 +81,33 @@ RUN aria2c --max-tries=9999 https://developer.download.nvidia.com/compute/cudnn/
apt-get -y install cudnn-cuda-12 && \
rm cudnn-local-repo-ubuntu2204-9.8.0_1.0-1_amd64.deb
RUN pip install --no-cache-dir nvidia-cudnn-cu12==9.8.0.87
# Install Apex
RUN git clone https://github.com/NVIDIA/apex.git && \
cd apex && \
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
# Install TransformerEngine
RUN export NVTE_FRAMEWORK=pytorch && pip3 install --no-deps --no-cache-dir git+https://github.com/NVIDIA/TransformerEngine.git@v2.3
# Install Megatron-LM
RUN pip3 install --no-deps --no-cache-dir git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.12.0
# Profiling tools
RUN aria2c --always-resume=true --max-tries=99999 https://developer.nvidia.com/downloads/assets/tools/secure/nsight-systems/2025_3/nsight-systems-2025.3.1_2025.3.1.90-1_amd64.deb && \
apt-get update && apt-get install -y libxcb-cursor0 && \
dpkg -i ./nsight-systems-2025.3.1_2025.3.1.90-1_amd64.deb && \
rm -rf /usr/local/cuda/bin/nsys && \
ln -s /opt/nvidia/nsight-systems/2025.3.1/target-linux-x64/nsys /usr/local/cuda/bin/nsys && \
rm -rf /usr/local/cuda/bin/nsys-ui && \
ln -s /opt/nvidia/nsight-systems/2025.3.1/target-linux-x64/nsys-ui /usr/local/cuda/bin/nsys-ui && \
rm nsight-systems-2025.3.1_2025.3.1.90-1_amd64.deb
# Fix opencv
RUN pip install opencv-python
RUN pip install --resume-retries 999 --no-cache-dir opencv-python
RUN pip install opencv-fixer && \
RUN pip install --resume-retries 999 --no-cache-dir opencv-fixer && \
python -c "from opencv_fixer import AutoFix; AutoFix()"
# Install verl
RUN pip install --resume-retries 999 --no-cache-dir cuda-bindings
# Reset pip config
RUN pip config unset global.index-url && \
pip config unset global.extra-index-url
RUN apt-get update && \
apt-get install -y aria2 libfreeimage3 libfreeimage-dev zlib1g
RUN apt-get update && \
apt-get install -y libfreeimage3 libfreeimage-dev zlib1g htop
RUN aria2c --always-resume=true --max-tries=99999 https://developer.nvidia.com/downloads/assets/tools/secure/nsight-systems/2025_3/nsight-systems-2025.3.1_2025.3.1.90-1_amd64.deb && \
apt-get update && apt-get install -y libxcb-cursor0 && \
dpkg -i ./nsight-systems-2025.3.1_2025.3.1.90-1_amd64.deb && \
rm -rf /usr/local/cuda/bin/nsys && \
ln -s /opt/nvidia/nsight-systems/2025.3.1/target-linux-x64/nsys /usr/local/cuda/bin/nsys && \
rm -rf /usr/local/cuda/bin/nsys-ui && \
ln -s /opt/nvidia/nsight-systems/2025.3.1/target-linux-x64/nsys-ui /usr/local/cuda/bin/nsys-ui && \
rm nsight-systems-2025.3.1_2025.3.1.90-1_amd64.deb

View File

@ -0,0 +1,29 @@
# verl image with verl v0.4.x
## Important packages version
```txt
cuda==12.4
cudnn==9.8.0
torch==2.6.0
flash_attn=2.7.4
sglang==0.4.6.post5
vllm==0.8.5.post1
vidia-cudnn-cu12==9.8.0.87
transformer_engine==2.3
megatron.core==core_v0.12.1
# Preview
transformer_engine==2.5
megatron.core==core_r0.13.0
```
## Target
- Base image:
- `verlai/verl:base-verl0.4-cu124-cudnn9.8-torch2.6-fa2.7.4`
- App image:
- `verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1`: SGLang requires vLLM in 0.4.6.post5 version, vLLM can have some package conflicts with SGLang
- `verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.1`
- Preview image:
- `verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.13.0-preview`
- `verlai/verl:app-verl0.4-vllm0.8.5-mcore0.13.0-preview`

View File

@ -0,0 +1,36 @@
# Start from the verl base image
# Dockerfile.base
FROM verlai/verl:base-verl0.5-cu126-cudnn9.8-torch2.7.1-fa2.8.0-fi0.2.6
# Define environments
ENV MAX_JOBS=8
ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
ENV DEBIAN_FRONTEND=noninteractive
ENV NODE_OPTIONS=""
ENV PIP_ROOT_USER_ACTION=ignore
ENV HF_HUB_ENABLE_HF_TRANSFER="1"
# Install sglang-0.4.8 and torch-memory-saver
# Install FlashInfer Python package
RUN pip install --resume-retries 999 --no-cache-dir --no-build-isolation flashinfer-python==0.2.6.post1
RUN pip install --resume-retries 999 --no-cache-dir "sglang[all]==0.4.8" && pip install torch-memory-saver --no-cache-dir
# Fix packages
RUN pip install --no-cache-dir "tensordict==0.6.2" "transformers[hf_xet]>=4.51.0" accelerate datasets peft hf-transfer \
"numpy<2.0.0" "pyarrow>=19.0.1" pandas \
ray[default] codetiming hydra-core pylatexenc qwen-vl-utils wandb dill pybind11 liger-kernel mathruler blobfile xgrammar \
pytest py-spy pyext pre-commit ruff
RUN pip uninstall -y pynvml nvidia-ml-py && \
pip install --resume-retries 999 --no-cache-dir --upgrade "nvidia-ml-py>=12.560.30" "fastapi[standard]>=0.115.0" "optree>=0.13.0" "pydantic>=2.9" "grpcio>=1.62.1"
RUN pip install --resume-retries 999 --no-cache-dir nvidia-cudnn-cu12==9.8.0.87
# Install TransformerEngine
RUN export NVTE_FRAMEWORK=pytorch && pip3 install --resume-retries 999 --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@v2.3
# Install Megatron-LM
RUN pip3 install --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.12.1
# Install mbridge
RUN pip3 install --no-cache-dir mbridge

View File

@ -0,0 +1,36 @@
# Start from the verl base image
# Dockerfile.base
FROM verlai/verl:base-verl0.5-cu126-cudnn9.8-torch2.7.1-fa2.8.0-fi0.2.6
# Define environments
ENV MAX_JOBS=8
ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
ENV DEBIAN_FRONTEND=noninteractive
ENV NODE_OPTIONS=""
ENV PIP_ROOT_USER_ACTION=ignore
ENV HF_HUB_ENABLE_HF_TRANSFER="1"
# Install sglang-0.4.8 and torch-memory-saver
# Install FlashInfer Python package
RUN pip install --resume-retries 999 --no-cache-dir --no-build-isolation flashinfer-python==0.2.6.post1
RUN pip install --resume-retries 999 --no-cache-dir "sglang[all]==0.4.8" && pip install torch-memory-saver --no-cache-dir
# Fix packages
RUN pip install --no-cache-dir "tensordict==0.6.2" "transformers[hf_xet]>=4.51.0" accelerate datasets peft hf-transfer \
"numpy<2.0.0" "pyarrow>=19.0.1" pandas \
ray[default] codetiming hydra-core pylatexenc qwen-vl-utils wandb dill pybind11 liger-kernel mathruler blobfile xgrammar \
pytest py-spy pyext pre-commit ruff
RUN pip uninstall -y pynvml nvidia-ml-py && \
pip install --resume-retries 999 --no-cache-dir --upgrade "nvidia-ml-py>=12.560.30" "fastapi[standard]>=0.115.0" "optree>=0.13.0" "pydantic>=2.9" "grpcio>=1.62.1"
RUN pip install --resume-retries 999 --no-cache-dir nvidia-cudnn-cu12==9.8.0.87
# Install TransformerEngine
RUN export NVTE_FRAMEWORK=pytorch && pip3 install --resume-retries 999 --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@release_v2.5
# Install Megatron-LM
RUN pip3 install --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.12.1
# Install mbridge
RUN pip3 install --no-cache-dir mbridge

View File

@ -0,0 +1,91 @@
# Base Docker Image of verl, with CUDA/Torch/FlashAttn/Apex/TransformerEngine, without other frameworks
# Target: verlai/verl:base-verl0.5-cu126-cudnn9.8-torch2.7.1-fa2.8.0-fi0.2.6
# Start from the NVIDIA official image (ubuntu-22.04 + cuda-12.6 + python-3.10)
# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-24-08.html
FROM nvcr.io/nvidia/pytorch:24.08-py3
# Define environments
ENV MAX_JOBS=16
ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
ENV DEBIAN_FRONTEND=noninteractive
ENV NODE_OPTIONS=""
ENV PIP_ROOT_USER_ACTION=ignore
ENV HF_HUB_ENABLE_HF_TRANSFER="1"
# Define installation arguments
ARG APT_SOURCE=https://mirrors.tuna.tsinghua.edu.cn/ubuntu/
ARG PIP_INDEX=https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
# Set apt source
RUN cp /etc/apt/sources.list /etc/apt/sources.list.bak && \
{ \
echo "deb ${APT_SOURCE} jammy main restricted universe multiverse"; \
echo "deb ${APT_SOURCE} jammy-updates main restricted universe multiverse"; \
echo "deb ${APT_SOURCE} jammy-backports main restricted universe multiverse"; \
echo "deb ${APT_SOURCE} jammy-security main restricted universe multiverse"; \
} > /etc/apt/sources.list
# Install systemctl
RUN apt-get update && \
apt-get install -y -o Dpkg::Options::="--force-confdef" systemd && \
apt-get clean
# Install tini
RUN apt-get update && \
apt-get install -y tini aria2 libfreeimage3 libfreeimage-dev zlib1g htop && \
apt-get clean
# Change pip source
RUN pip config set global.index-url "${PIP_INDEX}" && \
pip config set global.extra-index-url "${PIP_INDEX}" && \
python -m pip install --upgrade pip
# Uninstall nv-pytorch fork
RUN pip uninstall -y torch torchvision torchaudio \
pytorch-quantization pytorch-triton torch-tensorrt \
xgboost transformer_engine flash_attn apex megatron-core grpcio
RUN pip install --resume-retries 999 --no-cache-dir torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1
# Install flash-attn-2.8.0.post2 (cxx11abi=True)
RUN ABI_FLAG=$(python -c "import torch; print('TRUE' if torch._C._GLIBCXX_USE_CXX11_ABI else 'FALSE')") && \
URL="https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.0.post2/flash_attn-2.8.0.post2+cu12torch2.7cxx11abi${ABI_FLAG}-cp310-cp310-linux_x86_64.whl" && \
FILE="flash_attn-2.8.0.post2+cu12torch2.7cxx11abi${ABI_FLAG}-cp310-cp310-linux_x86_64.whl" && \
wget -nv "${URL}" && \
pip install --no-cache-dir "${FILE}"
# Fix packages
RUN pip uninstall -y pynvml nvidia-ml-py && \
pip install --no-cache-dir --upgrade "nvidia-ml-py>=12.560.30" "fastapi[standard]>=0.115.0" "optree>=0.13.0" "pydantic>=2.9" "grpcio>=1.62.1"
# Install cudnn
RUN aria2c --max-tries=9999 https://developer.download.nvidia.com/compute/cudnn/9.8.0/local_installers/cudnn-local-repo-ubuntu2204-9.8.0_1.0-1_amd64.deb && \
dpkg -i cudnn-local-repo-ubuntu2204-9.8.0_1.0-1_amd64.deb && \
cp /var/cudnn-local-repo-ubuntu2204-9.8.0/cudnn-*-keyring.gpg /usr/share/keyrings/ && \
apt-get update && \
apt-get -y install cudnn-cuda-12 && \
rm cudnn-local-repo-ubuntu2204-9.8.0_1.0-1_amd64.deb
# Install Apex
RUN pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" --resume-retries 999 git+https://github.com/NVIDIA/apex.git
# Profiling tools
RUN aria2c --always-resume=true --max-tries=99999 https://developer.nvidia.com/downloads/assets/tools/secure/nsight-systems/2025_3/nsight-systems-2025.3.1_2025.3.1.90-1_amd64.deb && \
apt-get update && apt-get install -y libxcb-cursor0
RUN apt-get install -y ./nsight-systems-2025.3.1_2025.3.1.90-1_amd64.deb && \
rm -rf /usr/local/cuda/bin/nsys && \
ln -s /opt/nvidia/nsight-systems/2025.3.1/target-linux-x64/nsys /usr/local/cuda/bin/nsys && \
rm -rf /usr/local/cuda/bin/nsys-ui && \
ln -s /opt/nvidia/nsight-systems/2025.3.1/target-linux-x64/nsys-ui /usr/local/cuda/bin/nsys-ui && \
rm nsight-systems-2025.3.1_2025.3.1.90-1_amd64.deb
RUN pip install --resume-retries 999 --no-cache-dir "tensordict==0.6.2" torchdata "transformers[hf_xet]>=4.51.0" accelerate datasets peft hf-transfer \
"numpy<2.0.0" "pyarrow>=19.0.1" pandas cuda-bindings \
ray[default] codetiming hydra-core pylatexenc qwen-vl-utils wandb dill pybind11 liger-kernel mathruler blobfile xgrammar \
pytest py-spy pyext pre-commit ruff
# Reset pip config
RUN pip config unset global.index-url && \
pip config unset global.extra-index-url

View File

@ -0,0 +1,27 @@
# verl image with verl v0.5
## Important packages version
```txt
cuda==12.6
cudnn==9.8.0
torch==2.7.1
flash_attn=2.8.0 ##
sglang==0.4.8
vllm==0.8.5.post1
vidia-cudnn-cu12==9.8.0.87
transformer_engine==2.3
megatron.core==core_v0.12.1
# Preview
transformer_engine==2.5
megatron.core==core_r0.13.0
```
## Target
- Base image:
- `verlai/verl:base-verl0.5-cu126-cudnn9.8-torch2.7.1-fa2.8.0`: We offer a base image with flash infer 0.2.6.post1 built in
- App image:
- `verlai/verl:app-verl0.5-sglang0.4.8-mcore0.12.1`
- `verlai/verl:app-verl0.5-sglang0.4.8-mcore0.13.0-preview`
- vllm temporarily not support latest version

View File

@ -0,0 +1,36 @@
# Start from the verl base image
# Dockerfile.base
FROM verlai/verl:base-verl0.5-preview-cu128-cudnn9.8-torch2.7.1-fa2.8.0-fi0.2.6
# Define environments
ENV MAX_JOBS=8
ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
ENV DEBIAN_FRONTEND=noninteractive
ENV NODE_OPTIONS=""
ENV PIP_ROOT_USER_ACTION=ignore
ENV HF_HUB_ENABLE_HF_TRANSFER="1"
# Install sglang-0.4.8 and torch-memory-saver
# Install FlashInfer Python package
RUN pip install --resume-retries 999 --no-cache-dir --no-build-isolation flashinfer-python==0.2.6.post1
RUN pip install --resume-retries 999 --no-cache-dir "sglang[all]==0.4.8" && pip install torch-memory-saver --no-cache-dir
# Fix packages
RUN pip install --no-cache-dir "tensordict==0.6.2" "transformers[hf_xet]>=4.51.0" accelerate datasets peft hf-transfer \
"numpy<2.0.0" "pyarrow>=19.0.1" pandas \
ray[default] codetiming hydra-core pylatexenc qwen-vl-utils wandb dill pybind11 liger-kernel mathruler blobfile xgrammar \
pytest py-spy pre-commit ruff
RUN pip uninstall -y pynvml nvidia-ml-py && \
pip install --resume-retries 999 --no-cache-dir --upgrade "nvidia-ml-py>=12.560.30" "fastapi[standard]>=0.115.0" "optree>=0.13.0" "pydantic>=2.9" "grpcio>=1.62.1"
RUN pip install --resume-retries 999 --no-cache-dir nvidia-cudnn-cu12==9.8.0.87
# Install TransformerEngine
RUN export NVTE_FRAMEWORK=pytorch && pip3 install --resume-retries 999 --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@release_v2.5
# Install Megatron-LM
RUN pip3 install --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/Megatron-LM.git@core_r0.13.0
# Install mbridge
RUN pip3 install --no-cache-dir mbridge

View File

@ -0,0 +1,91 @@
# Base Docker Image of verl, with CUDA/Torch/FlashAttn/Apex/TransformerEngine, without other frameworks
# Target: verlai/verl:base-verl0.5-preview-cu128-cudnn9.8-torch2.7.1-fa2.8.0-fi0.2.6
# Start from the NVIDIA official image (ubuntu-22.04 + cuda-12.6 + python-3.10)
# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-24-08.html
FROM nvcr.io/nvidia/pytorch:25.02-py3
# Define environments
ENV MAX_JOBS=16
ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
ENV DEBIAN_FRONTEND=noninteractive
ENV NODE_OPTIONS=""
ENV PIP_ROOT_USER_ACTION=ignore
ENV HF_HUB_ENABLE_HF_TRANSFER="1"
# Define installation arguments
ARG APT_SOURCE=https://mirrors.tuna.tsinghua.edu.cn/ubuntu/
ARG PIP_INDEX=https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
# Set apt source
RUN cp /etc/apt/sources.list /etc/apt/sources.list.bak && \
{ \
echo "deb ${APT_SOURCE} jammy main restricted universe multiverse"; \
echo "deb ${APT_SOURCE} jammy-updates main restricted universe multiverse"; \
echo "deb ${APT_SOURCE} jammy-backports main restricted universe multiverse"; \
echo "deb ${APT_SOURCE} jammy-security main restricted universe multiverse"; \
} > /etc/apt/sources.list
# Install systemctl
RUN apt-get update && \
apt-get install -y -o Dpkg::Options::="--force-confdef" systemd && \
apt-get clean
# Install tini
RUN apt-get update && \
apt-get install -y tini aria2 libfreeimage3 libfreeimage-dev zlib1g htop && \
apt-get clean
# Change pip source
RUN pip config set global.index-url "${PIP_INDEX}" && \
pip config set global.extra-index-url "${PIP_INDEX}" && \
python -m pip install --upgrade pip
# Uninstall nv-pytorch fork
RUN pip uninstall -y torch torchvision torchaudio \
pytorch-quantization pytorch-triton torch-tensorrt \
xgboost transformer_engine flash_attn apex megatron-core grpcio
RUN pip install --resume-retries 999 --no-cache-dir torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu128
# Install flash-attn-2.8.0.post2 (cxx11abi=True)
RUN ABI_FLAG=$(python -c "import torch; print('TRUE' if torch._C._GLIBCXX_USE_CXX11_ABI else 'FALSE')") && \
URL="https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.0.post2/flash_attn-2.8.0.post2+cu12torch2.7cxx11abi${ABI_FLAG}-cp312-cp312-linux_x86_64.whl" && \
FILE="flash_attn-2.8.0.post2+cu12torch2.7cxx11abi${ABI_FLAG}-cp312-cp312-linux_x86_64.whl" && \
wget -nv "${URL}" && \
pip install --no-cache-dir "${FILE}"
# Fix packages
RUN pip uninstall -y pynvml nvidia-ml-py && \
pip install --no-cache-dir --upgrade "nvidia-ml-py>=12.560.30" "fastapi[standard]>=0.115.0" "optree>=0.13.0" "pydantic>=2.9" "grpcio>=1.62.1"
# Install cudnn
RUN aria2c --max-tries=9999 https://developer.download.nvidia.com/compute/cudnn/9.8.0/local_installers/cudnn-local-repo-ubuntu2204-9.8.0_1.0-1_amd64.deb && \
dpkg -i cudnn-local-repo-ubuntu2204-9.8.0_1.0-1_amd64.deb && \
cp /var/cudnn-local-repo-ubuntu2204-9.8.0/cudnn-*-keyring.gpg /usr/share/keyrings/ && \
apt-get update && \
apt-get -y install cudnn-cuda-12 && \
rm cudnn-local-repo-ubuntu2204-9.8.0_1.0-1_amd64.deb
# Install Apex
RUN pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" --resume-retries 999 git+https://github.com/NVIDIA/apex.git
# Profiling tools
RUN aria2c --always-resume=true --max-tries=99999 https://developer.nvidia.com/downloads/assets/tools/secure/nsight-systems/2025_3/nsight-systems-2025.3.1_2025.3.1.90-1_amd64.deb && \
apt-get update && apt-get install -y libxcb-cursor0
RUN apt-get install -y ./nsight-systems-2025.3.1_2025.3.1.90-1_amd64.deb && \
rm -rf /usr/local/cuda/bin/nsys && \
ln -s /opt/nvidia/nsight-systems/2025.3.1/target-linux-x64/nsys /usr/local/cuda/bin/nsys && \
rm -rf /usr/local/cuda/bin/nsys-ui && \
ln -s /opt/nvidia/nsight-systems/2025.3.1/target-linux-x64/nsys-ui /usr/local/cuda/bin/nsys-ui && \
rm nsight-systems-2025.3.1_2025.3.1.90-1_amd64.deb
RUN pip install --resume-retries 999 --no-cache-dir "tensordict==0.6.2" torchdata "transformers[hf_xet]>=4.51.0" accelerate datasets peft hf-transfer \
"numpy<2.0.0" "pyarrow>=19.0.1" pandas cuda-bindings \
ray[default] codetiming hydra-core pylatexenc qwen-vl-utils wandb dill pybind11 liger-kernel mathruler blobfile xgrammar \
pytest py-spy pre-commit ruff
# Reset pip config
RUN pip config unset global.index-url && \
pip config unset global.extra-index-url

View File

@ -0,0 +1,26 @@
# verl image with verl v0.5
## Important packages version
```txt
cuda==12.8
cudnn==9.8.0
torch==2.7.1
flash_attn=2.8.0 ##
sglang==0.4.8
transformer_engine==2.5
megatron.core==core_r0.13.0
vidia-cudnn-cu12==9.8.0.87
```
## Target
- Base image:
- `verlai/verl:base-verl0.5-preview-cu128-cudnn9.8-torch2.7.1-fa2.8.0`: We offer a base image with flash infer 0.2.6.post1 built in
- App image:
- `verlai/verl:app-verl0.5-preview-sglang0.4.8-mcore0.13.0-preview`
- vllm temporarily not support latest version
## !!!Notice!!!
- pyext is lack of maintainace and cannot work with python 3.12, consider using replacement and deprecating this package.

View File

@ -19,7 +19,7 @@ Choices of Backend Engines
We recommend using **FSDP** backend to investigate, research and prototype different models, datasets and RL algorithms. The guide for using FSDP backend can be found in :doc:`FSDP Workers<../workers/fsdp_workers>`.
For users who pursue better scalability, we recommend using **Megatron-LM** backend. Currently, we support `Megatron-LM v0.11 <https://github.com/NVIDIA/Megatron-LM/tree/v0.11.0>`_. The guide for using Megatron-LM backend can be found in :doc:`Megatron-LM Workers<../workers/megatron_workers>`.
For users who pursue better scalability, we recommend using **Megatron-LM** backend. Currently, we support `Megatron-LM v0.12.1 <https://github.com/NVIDIA/Megatron-LM/tree/core_v0.12.1>`_. The guide for using Megatron-LM backend can be found in :doc:`Megatron-LM Workers<../workers/megatron_workers>`.
2. Inference:
@ -33,58 +33,94 @@ For huggingface TGI integration, it is usually used for debugging and single GPU
Install from docker image
-------------------------
We provide pre-built Docker images for quick setup.
We provide pre-built Docker images for quick setup. And from this version,
we utilize a new image release hierarchy for productivity and stability.
For vLLM with Megatron or FSDP, please use the stable version of image ``whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.1-te2.3-deepseekv3``, which supports DeepSeek-V3 671B post-training.
The image types are divided into three large categories:
For latest vLLM with FSDP, please refer to ``hiyouga/verl:ngc-th2.6.0-cu126-vllm0.8.4-flashinfer0.2.2-cxx11abi0``.
- **Base Image**: Without inference and training frameworks, only basic dependencies are installed.
Can directly install vllm or SGLang on top of it, without need of reinstall torch or CUDA.
- **Application Image**: Stable version with inference and training frameworks installed.
- **Community Image**: Unstable version with the latest frameworks and features.
For SGLang with FSDP, please use ``ocss884/verl-sglang:ngc-th2.6.0-cu126-sglang0.4.6.post5`` which is provided by SGLang RL Group.
The first two types of images are hosted on dockerhub `verlai/verl <https://hub.docker.com/r/verlai/verl>`_ repository, while the preview images are hosted on community repository.
.. note::
The image versions are mapped with verl releases, for example, image with tag ``verl0.4`` is built for verl release ``v0.4.x``.
Base Image
::::::::::
The stable base image is ``verlai/verl:base-verl0.4-cu124-cudnn9.8-torch2.6-fa2.7.4-te2.3``. The installed package versions can be found from tags, and the Dockerfile can be found in ``docker/verl[version]-[packages]/Dockerfile.base``.
The base images for preview are ``verlai/verl:base-verl0.5-cu126-cudnn9.8-torch2.7.1-fa2.8.0` and ``verlai/verl:base-verl0.5-preview-cu128-cudnn9.8-torch2.7.1-fa2.8.0`` with different CUDA versions.
The update of base image is not frequent, and the app image can be built on top of it without reinstalling base packages.
Application Image
:::::::::::::::::
From this version, we divide images built for vLLM and SGLang as the divergence of dependent packages like FlashInfer.
There are four types of application images available:
- **vLLM with FSDP and Megatron**: ``verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.1``
- **SGLang with FSDP and Megatron**: ``verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1`` (need vLLM support, but can have some package conflicts)
- **Preview version of SGLang with FSDP and Megatron, CUDA 12.6**: ``verlai/verl:app-verl0.5-sglang0.4.8-mcore0.12.1``
- **Preview version of SGLang with FSDP and Megatron, CUDA 12.8**: ``verlai/verl:app-preview-verl0.5-sglang0.4.8-mcore0.12.1``
The latest vLLM support is coming soon.
Docker images with Megatron backends are runnable with large language model like ``Qwen/Qwen3-235B-A22B``, ``deepseek-ai/DeepSeek-V3-0324`` post-training. Refer to the :doc:`Large Language Model Post-Training documentation<../perf/dpsk>` for more details.
Application images can be updated frequently, and the Dockerfile can be found in ``docker/verl[version]-[packages]/Dockerfile.app.[frameworks]``. Based on the base image, it is easy to build your own application image with the desired inference and training frameworks.
Community Image
:::::::::::::::
Community images are provided by the community, including the latest versions of vLLM and SGLang, and may include experimental features or configurations. And also works for other hardwares or platforms like AMD GPUs with ROCM or AWS EFA and Sagemaker.
For latest vLLM with FSDP, please refer to `hiyouga/verl <https://hub.docker.com/r/hiyouga/verl>`_ repository and the latest version is ``hiyouga/verl:ngc-th2.6.0-cu126-vllm0.8.4-flashinfer0.2.2-cxx11abi0``.
For latest SGLang with FSDP, please refer to `ocss884/verl-sglang <https://hub.docker.com/r/ocss884/verl-sglang>`_ repository and the latest version is ``ocss884/verl-sglang:ngc-th2.6.0-cu126-sglang0.4.6.post5`` which is provided by SGLang RL Group.
See files under ``docker/`` for NGC-based image or if you want to build your own.
Note that For aws instances with EFA net interface (Sagemaker AI Pod),
you need to install EFA driver as shown in ``docker/Dockerfile.extenstion.awsefa``
Installation from Docker
::::::::::::::::::::::::
After pulling the desired Docker image and installing desired inference and training frameworks, you can run it with the following steps:
1. Launch the desired Docker image and attach into it:
.. code:: bash
docker create --runtime=nvidia --gpus all --net=host --shm-size="10g" --cap-add=SYS_ADMIN -v .:/workspace/verl --name verl <image:tag>
docker create --runtime=nvidia --gpus all --net=host --shm-size="10g" --cap-add=SYS_ADMIN -v .:/workspace/verl --name verl <image:tag> sleep infinity
docker start verl
docker exec -it verl bash
2. Inside the container, install latest verl:
2. If you use the images provided, you only need to install verl itself without dependencies:
.. code:: bash
# install the nightly version (recommended)
git clone https://github.com/volcengine/verl && cd verl
# pick your choice of inference engine: vllm or sglang
# pip3 install -e .[vllm]
# pip3 install -e .[sglang]
# or install from pypi instead of git via:
# pip3 install verl[vllm]
# pip3 install verl[sglang]
pip3 install --no-deps -e .
.. note::
[Optional] If you hope to switch between different frameworks, you can install verl with the following command:
The Docker image ``whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.1-te2.3-deepseekv3`` is built with the following configurations:
.. code:: bash
- **PyTorch**: 2.6.0+cu124
- **CUDA**: 12.4
- **cuDNN**: 9.8.0
- **nvidia-cudnn-cu12**: 9.8.0.87, **important for the usage of Megatron FusedAttention with MLA Support**
- **Flash Attenttion**: 2.7.4.post1
- **Flash Infer**: 0.2.5
- **vLLM**: 0.8.5
- **SGLang**: 0.4.6.post5
- **Megatron-LM**: core_v0.12.1
- **TransformerEngine**: 2.3
- **Ray**: 2.44.1
# install the nightly version (recommended)
git clone https://github.com/volcengine/verl && cd verl
pip3 install -e .[vllm]
pip3 install -e .[sglang]
.. note::
For aws instances with EFA net interface (Sagemaker AI Pod),
you need to install EFA driver as shown in ``docker/Dockerfile.awsefa``
Install from custom environment
---------------------------------------------
@ -99,6 +135,11 @@ For training and inference engines to utilize better and faster hardware support
and some of the dependencies are easy to be overridden when installing other packages,
so we put them in the :ref:`Post-installation` step.
.. note::
The installation steps below are recommended configurations for the latest version of verl.
If you are trying to customize your own environment, please ignore the strict constraints.
We need to install the following pre-requisites:
- **CUDA**: Version >= 12.4

View File

@ -4,7 +4,7 @@ codetiming
datasets
dill
hydra-core
numpy
numpy<2.0.0
pandas
peft
pyarrow>=15.0.0

View File

@ -6,7 +6,7 @@ dill
flash-attn
hydra-core
liger-kernel
numpy
numpy<2.0.0
pandas
peft
pyarrow>=19.0.0

View File

@ -5,7 +5,7 @@ datasets
dill
flash-attn
hydra-core
numpy
numpy<2.0.0
pandas
peft
pyarrow>=19.0.0

View File

@ -29,7 +29,7 @@ install_requires = [
"datasets",
"dill",
"hydra-core",
"numpy",
"numpy<2.0.0",
"pandas",
"peft",
"pyarrow>=19.0.0",

View File

@ -123,110 +123,112 @@ if [ "$USE_DIST_CKPT" = "True" ]; then
--output_path "${DIST_CKPT_PATH}"
fi
ENGINES=("vllm" "sglang")
ENGINE=${ENGINE:-"vllm"}
exp_name="$(basename "${MODEL_ID,,}")-megatron-gsm8k-minimal"
for ENGINE in "${ENGINES[@]}"; do
if [ "$ENGINE" = "vllm" ]; then
MODE=${MODE:-"sync"}
ROLLOUT_MODE_ARG="actor_rollout_ref.rollout.mode=${MODE}"
if [ "$MODE" = "async" ]; then
ROLLOUT_MODE_ARG="${ROLLOUT_MODE_ARG} data.return_raw_chat=True"
fi
else
ROLLOUT_MODE_ARG=""
if [ "$ENGINE" = "vllm" ]; then
MODE=${MODE:-"sync"}
ROLLOUT_MODE_ARG="actor_rollout_ref.rollout.mode=${MODE}"
if [ "$MODE" = "async" ]; then
ROLLOUT_MODE_ARG="${ROLLOUT_MODE_ARG} data.return_raw_chat=True"
fi
python3 -m verl.trainer.main_ppo --config-path=config \
--config-name='ppo_megatron_trainer.yaml'\
algorithm.adv_estimator="${ADV_ESTIMATOR}" \
data.train_files="${TRAIN_FILES}" \
data.val_files="${VAL_FILES}" \
data.train_batch_size=${train_prompt_bsz} \
data.max_prompt_length=${MAX_PROMPT_LENGTH} \
data.max_response_length=${MAX_RESPONSE_LENGTH} \
data.filter_overlong_prompts=True \
data.truncation='error' \
actor_rollout_ref.model.path="${MODEL_PATH}" \
actor_rollout_ref.actor.optim.lr_warmup_steps=$LR_WARMUP_STEPS \
actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=${train_traj_micro_bsz_per_gpu} \
actor_rollout_ref.actor.use_dynamic_bsz=${USE_DYNAMIC_BSZ} \
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${ppo_max_token_len_per_gpu} \
actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=$ACTOR_PP \
actor_rollout_ref.actor.megatron.virtual_pipeline_model_parallel_size=$ACTOR_VPP \
actor_rollout_ref.actor.megatron.context_parallel_size=$ACTOR_CP \
actor_rollout_ref.actor.megatron.tensor_model_parallel_size=$ACTOR_TP \
actor_rollout_ref.actor.megatron.expert_model_parallel_size=$ACTOR_EP \
actor_rollout_ref.actor.megatron.expert_tensor_parallel_size=$ACTOR_ETP \
actor_rollout_ref.actor.megatron.param_offload=${ACTOR_PARAM_OFFLOAD} \
actor_rollout_ref.actor.megatron.optimizer_offload=${ACTOR_OPTIMIZER_OFFLOAD} \
actor_rollout_ref.actor.megatron.grad_offload=${ACTOR_GRAD_OFFLOAD} \
actor_rollout_ref.actor.megatron.use_dist_checkpointing=${USE_DIST_CKPT} \
actor_rollout_ref.actor.megatron.dist_checkpointing_path=${DIST_CKPT_PATH} \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.actor.checkpoint.save_contents=$CHECKPOINT_CONTENTS \
actor_rollout_ref.rollout.name="${ENGINE}" ${ROLLOUT_MODE_ARG}\
actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP \
actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=${train_traj_micro_bsz_per_gpu} \
actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=$REF_PP \
actor_rollout_ref.ref.megatron.virtual_pipeline_model_parallel_size=$REF_VPP \
actor_rollout_ref.ref.megatron.context_parallel_size=$REF_CP \
actor_rollout_ref.ref.megatron.tensor_model_parallel_size=$REF_TP \
actor_rollout_ref.ref.megatron.expert_model_parallel_size=$REF_EP \
actor_rollout_ref.ref.megatron.expert_tensor_parallel_size=$REF_ETP \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=${train_traj_micro_bsz_per_gpu} \
actor_rollout_ref.ref.megatron.param_offload=${REF_PARAM_OFFLOAD} \
actor_rollout_ref.ref.megatron.use_dist_checkpointing=${USE_DIST_CKPT} \
actor_rollout_ref.ref.megatron.dist_checkpointing_path=${DIST_CKPT_PATH} \
actor_rollout_ref.ref.megatron.use_mbridge=${USE_MBRIDGE} \
critic.optim.lr=2e-5 \
critic.optim.lr_warmup_steps=$LR_WARMUP_STEPS \
critic.model.path="${MODEL_PATH}" \
critic.model.enable_gradient_checkpointing=False \
critic.ppo_micro_batch_size_per_gpu=${train_traj_micro_bsz_per_gpu} \
critic.ppo_max_token_len_per_gpu=${forward_max_token_len_per_gpu} \
critic.megatron.pipeline_model_parallel_size=$CRITIC_PP \
critic.megatron.virtual_pipeline_model_parallel_size=$CRITIC_VPP \
critic.megatron.context_parallel_size=$CRITIC_CP \
critic.megatron.tensor_model_parallel_size=$CRITIC_TP \
critic.megatron.expert_model_parallel_size=$CRITIC_EP \
critic.megatron.expert_tensor_parallel_size=$CRITIC_ETP \
critic.megatron.param_offload=${CRITIC_PARAM_OFFLOAD} \
critic.megatron.optimizer_offload=${CRITIC_OPTIMIZER_OFFLOAD} \
critic.megatron.grad_offload=${CRITIC_GRAD_OFFLOAD} \
critic.megatron.use_dist_checkpointing=${USE_DIST_CKPT} \
critic.megatron.dist_checkpointing_path=${DIST_CKPT_PATH} \
critic.checkpoint.save_contents=$CHECKPOINT_CONTENTS \
reward_model.enable=True \
reward_model.model.path="${MODEL_PATH}" \
reward_model.micro_batch_size_per_gpu=${train_traj_micro_bsz_per_gpu} \
reward_model.megatron.pipeline_model_parallel_size=$RM_PP \
reward_model.megatron.virtual_pipeline_model_parallel_size=$RM_VPP \
reward_model.megatron.context_parallel_size=$RM_CP \
reward_model.megatron.tensor_model_parallel_size=$RM_TP \
reward_model.megatron.expert_model_parallel_size=$RM_EP \
reward_model.megatron.expert_tensor_parallel_size=$RM_ETP \
reward_model.megatron.param_offload=${RM_PARAM_OFFLOAD} \
reward_model.megatron.use_dist_checkpointing=${USE_DIST_CKPT} \
reward_model.megatron.dist_checkpointing_path=${DIST_CKPT_PATH} \
algorithm.use_kl_in_reward=False \
algorithm.kl_penalty=kl \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.critic_warmup=0 \
trainer.logger=['console'] \
trainer.project_name='verl-test' \
trainer.experiment_name="${exp_name}" \
trainer.nnodes=1 \
trainer.n_gpus_per_node=${NUM_GPUS} \
trainer.val_before_train="${VAL_BEFORE_TRAIN}" \
trainer.test_freq="${TEST_FREQ}" \
trainer.save_freq="${SAVE_FREQ}" \
trainer.resume_mode="${RESUME_MODE}" \
trainer.total_epochs=2 \
trainer.total_training_steps="${TOTAL_TRAIN_STEPS}" $@
done
else
ROLLOUT_MODE_ARG=""
fi
python3 -m verl.trainer.main_ppo --config-path=config \
--config-name='ppo_megatron_trainer.yaml'\
algorithm.adv_estimator="${ADV_ESTIMATOR}" \
data.train_files="${TRAIN_FILES}" \
data.val_files="${VAL_FILES}" \
data.train_batch_size=${train_prompt_bsz} \
data.max_prompt_length=${MAX_PROMPT_LENGTH} \
data.max_response_length=${MAX_RESPONSE_LENGTH} \
data.filter_overlong_prompts=True \
data.truncation='error' \
actor_rollout_ref.model.path="${MODEL_PATH}" \
actor_rollout_ref.actor.optim.lr_warmup_steps=$LR_WARMUP_STEPS \
actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=${train_traj_micro_bsz_per_gpu} \
actor_rollout_ref.actor.use_dynamic_bsz=${USE_DYNAMIC_BSZ} \
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${ppo_max_token_len_per_gpu} \
actor_rollout_ref.actor.megatron.use_mbridge=${USE_MBRIDGE} \
actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=$ACTOR_PP \
actor_rollout_ref.actor.megatron.virtual_pipeline_model_parallel_size=$ACTOR_VPP \
actor_rollout_ref.actor.megatron.context_parallel_size=$ACTOR_CP \
actor_rollout_ref.actor.megatron.tensor_model_parallel_size=$ACTOR_TP \
actor_rollout_ref.actor.megatron.expert_model_parallel_size=$ACTOR_EP \
actor_rollout_ref.actor.megatron.expert_tensor_parallel_size=$ACTOR_ETP \
actor_rollout_ref.actor.megatron.param_offload=${ACTOR_PARAM_OFFLOAD} \
actor_rollout_ref.actor.megatron.optimizer_offload=${ACTOR_OPTIMIZER_OFFLOAD} \
actor_rollout_ref.actor.megatron.grad_offload=${ACTOR_GRAD_OFFLOAD} \
actor_rollout_ref.actor.megatron.use_dist_checkpointing=${USE_DIST_CKPT} \
actor_rollout_ref.actor.megatron.dist_checkpointing_path=${DIST_CKPT_PATH} \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.actor.checkpoint.save_contents=$CHECKPOINT_CONTENTS \
actor_rollout_ref.rollout.name="${ENGINE}" ${ROLLOUT_MODE_ARG}\
actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP \
actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=${train_traj_micro_bsz_per_gpu} \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=${train_traj_micro_bsz_per_gpu} \
actor_rollout_ref.ref.megatron.use_mbridge=${USE_MBRIDGE} \
actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=$REF_PP \
actor_rollout_ref.ref.megatron.virtual_pipeline_model_parallel_size=$REF_VPP \
actor_rollout_ref.ref.megatron.context_parallel_size=$REF_CP \
actor_rollout_ref.ref.megatron.tensor_model_parallel_size=$REF_TP \
actor_rollout_ref.ref.megatron.expert_model_parallel_size=$REF_EP \
actor_rollout_ref.ref.megatron.expert_tensor_parallel_size=$REF_ETP \
actor_rollout_ref.ref.megatron.param_offload=${REF_PARAM_OFFLOAD} \
actor_rollout_ref.ref.megatron.use_dist_checkpointing=${USE_DIST_CKPT} \
actor_rollout_ref.ref.megatron.dist_checkpointing_path=${DIST_CKPT_PATH} \
critic.optim.lr=2e-5 \
critic.optim.lr_warmup_steps=$LR_WARMUP_STEPS \
critic.model.path="${MODEL_PATH}" \
critic.model.enable_gradient_checkpointing=False \
critic.ppo_micro_batch_size_per_gpu=${train_traj_micro_bsz_per_gpu} \
critic.ppo_max_token_len_per_gpu=${forward_max_token_len_per_gpu} \
critic.megatron.use_mbridge=${USE_MBRIDGE} \
critic.megatron.pipeline_model_parallel_size=$CRITIC_PP \
critic.megatron.virtual_pipeline_model_parallel_size=$CRITIC_VPP \
critic.megatron.context_parallel_size=$CRITIC_CP \
critic.megatron.tensor_model_parallel_size=$CRITIC_TP \
critic.megatron.expert_model_parallel_size=$CRITIC_EP \
critic.megatron.expert_tensor_parallel_size=$CRITIC_ETP \
critic.megatron.param_offload=${CRITIC_PARAM_OFFLOAD} \
critic.megatron.optimizer_offload=${CRITIC_OPTIMIZER_OFFLOAD} \
critic.megatron.grad_offload=${CRITIC_GRAD_OFFLOAD} \
critic.megatron.use_dist_checkpointing=${USE_DIST_CKPT} \
critic.megatron.dist_checkpointing_path=${DIST_CKPT_PATH} \
critic.checkpoint.save_contents=$CHECKPOINT_CONTENTS \
reward_model.enable=True \
reward_model.model.path="${MODEL_PATH}" \
reward_model.micro_batch_size_per_gpu=${train_traj_micro_bsz_per_gpu} \
reward_model.megatron.use_mbridge=${USE_MBRIDGE} \
reward_model.megatron.pipeline_model_parallel_size=$RM_PP \
reward_model.megatron.virtual_pipeline_model_parallel_size=$RM_VPP \
reward_model.megatron.context_parallel_size=$RM_CP \
reward_model.megatron.tensor_model_parallel_size=$RM_TP \
reward_model.megatron.expert_model_parallel_size=$RM_EP \
reward_model.megatron.expert_tensor_parallel_size=$RM_ETP \
reward_model.megatron.param_offload=${RM_PARAM_OFFLOAD} \
reward_model.megatron.use_dist_checkpointing=${USE_DIST_CKPT} \
reward_model.megatron.dist_checkpointing_path=${DIST_CKPT_PATH} \
algorithm.use_kl_in_reward=False \
algorithm.kl_penalty=kl \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.critic_warmup=0 \
trainer.logger=['console'] \
trainer.project_name='verl-test' \
trainer.experiment_name="${exp_name}" \
trainer.nnodes=1 \
trainer.n_gpus_per_node=${NUM_GPUS} \
trainer.val_before_train="${VAL_BEFORE_TRAIN}" \
trainer.test_freq="${TEST_FREQ}" \
trainer.save_freq="${SAVE_FREQ}" \
trainer.resume_mode="${RESUME_MODE}" \
trainer.total_epochs=2 \
trainer.total_training_steps="${TOTAL_TRAIN_STEPS}" $@

View File

@ -240,6 +240,7 @@ def apply_monkey_patch(
if is_transformers_version_in_range(min_version="4.53.0"):
from transformers.models.qwen2_5_vl.modeling_qwen2_5_vl import Qwen2_5_VLAttention
# TODO: Support transformers 4.53
raise ValueError("Transformers 4.53 is not supported")
else:
from transformers.models.qwen2_5_vl.modeling_qwen2_5_vl import (
@ -266,6 +267,7 @@ def apply_monkey_patch(
if is_transformers_version_in_range(min_version="4.53.0"):
from transformers.models.qwen2_vl.modeling_qwen2_vl import Qwen2VLAttention
# TODO: Support transformers 4.53
raise ValueError("Transformers 4.53 is not supported")
else:
from transformers.models.qwen2_vl.modeling_qwen2_vl import Qwen2VLFlashAttention2 as Qwen2VLAttention