[docker] refactor: Migrate images to verlai, support latest flash attention and newer CUDA versions in future (#2085)

### Checklist Before Starting - [ ] Searched for similar PR(s). - [ ] Checked PR Title format - In format of: [modules] type: Title - modules are in `fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data` - type is in `feat, fix, refactor, chore, test` - can involve multiple modules, seperated by `,` or space, like `[megatron, fsdp, doc] feat: xxx` ### What does this PR do? Migrate images to verlai, upgrade CUDA support to 12.6 and support latest flash attention ```txt docker ├── README.md ├── verl0.4-cu124-torch2.6-fa2.7.4 │ ├── Dockerfile.app.sglang.vllm.mcore0.12 │ ├── Dockerfile.app.sglang.vllm.mcore0.13.preview │ ├── Dockerfile.app.vllm.mcore0.12 │ ├── Dockerfile.app.vllm.mcore0.13.preview │ ├── Dockerfile.base │ └── README.md ├── verl0.5-cu126-torch2.7.1-fa2.8.0 │ ├── Dockerfile.app.sglang.mcore0.12 │ ├── Dockerfile.app.sglang.mcore0.13.preview │ ├── Dockerfile.base.fi0.2.6 │ └── README.md └── verl0.5-preview-cu128-torch2.7.1-fa2.8.0 ├── Dockerfile.app.sglang.megatron ├── Dockerfile.base.fi0.2.6 └── README.md ``` - verlai/verl - verl0.4 - base - app.sglang.vllm.mcore - app.vllm.mcore - verl0.5 - base - app.sglang.mcore - app.vllm.mcore [may not support now, for debug] - verl0.5-preview - base - app.sglang.mcore - app.vllm.mcore [may not support now, for debug] ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title `description` if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] New CI unit test(s) are added to cover the code path. - [ ] Rely on existing unit tests on CI that covers the code path.
2025-10-20 21:53:50 +08:00 · 2025-07-04 14:32:02 +08:00
parent a53fb3089e
commit ebb21b7fc7
34 changed files with 1233 additions and 193 deletions
--- a/.github/workflows/checkpoint_converter.yml
+++ b/.github/workflows/checkpoint_converter.yml
@ -84,7 +84,7 @@ jobs:
      NO_PROXY: "localhost,127.0.0.1"
      HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
    container:
-      image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.0-te2.3
+      image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1
      options: --gpus all --shm-size=10g
    steps:
      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
@ -114,7 +114,7 @@ jobs:
      HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
      HF_ENDPOINT: "https://hf-mirror.com"
    container:
-      image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.0-te2.3
+      image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1
      options: --gpus all --shm-size=10g
    steps:
      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
--- a/.github/workflows/e2e_dapo.yml
+++ b/.github/workflows/e2e_dapo.yml
@ -94,7 +94,7 @@ jobs:
      HF_ENDPOINT: "https://hf-mirror.com"
      HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
    container:
-      image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.0-te2.3
+      image: verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.1
      options: --gpus all --shm-size=10g
    steps:
      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
--- a/.github/workflows/e2e_eval_aime24.yml
+++ b/.github/workflows/e2e_eval_aime24.yml
@ -98,7 +98,7 @@ jobs:
      HF_ENDPOINT: "https://hf-mirror.com"
      HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
    container:
-      image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.0-te2.3
+      image: verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.1
      options: --gpus all --shm-size=10g
    steps:
      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
--- a/.github/workflows/e2e_ppo_trainer.yml
+++ b/.github/workflows/e2e_ppo_trainer.yml
@ -84,7 +84,7 @@ jobs:
      HF_ENDPOINT: "https://hf-mirror.com"
      HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
    container:
-      image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.0-te2.3
+      image: verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.1
      options: --gpus all --shm-size=10g
    steps:
      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
@ -210,7 +210,7 @@ jobs:
      HF_ENDPOINT: "https://hf-mirror.com"
      HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
    container:
-      image: hiyouga/verl:ngc-th2.6.0-cu126-vllm0.8.3-flashinfer0.2.2-cxx11abi0
+      image: verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.1
      options: --gpus all --shm-size=50g # Visual dataloader requires large memory
    steps:
      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
@ -219,6 +219,7 @@ jobs:
      - name: Install the current repository
        run: |
          pip3 install -e .[test,gpu,vllm,geo,trl]
+          pip install "transformers[hf_xet]<4.53.0" # Fix for transformers 4.53.0
      # Geo3k
      - name: Prepare GEO3K dataset
        run: |
@ -266,7 +267,7 @@ jobs:
      HF_ENDPOINT: "https://hf-mirror.com"
      HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
    container:
-      image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.0-te2.3
+      image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1
      options: --gpus all --shm-size=10g
    steps:
      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
@ -305,7 +306,7 @@ jobs:
      HF_ENDPOINT: "https://hf-mirror.com"
      HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
    container:
-      image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.0-te2.3
+      image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1
      options: --gpus all --shm-size=10g
    steps:
      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
@ -338,7 +339,7 @@ jobs:
      HF_ENDPOINT: "https://hf-mirror.com"
      HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
    container:
-      image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.0-te2.3
+      image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1
      options: --gpus all --shm-size=50g # Visual dataloader requires large memory
    steps:
      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
@ -395,7 +396,7 @@ jobs:
      HF_ENDPOINT: "https://hf-mirror.com"
      HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
    container:
-      image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.0-te2.3
+      image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1
      options: --gpus all --shm-size=10g
    steps:
      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
--- a/.github/workflows/e2e_ppo_trainer_megatron_sglang.yml
+++ b/.github/workflows/e2e_ppo_trainer_megatron_sglang.yml
@ -0,0 +1,378 @@
+# # Tests layout
+
+# Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
+# - `tests/trainer` for testing functionality related to `verl/trainer`
+# - `tests/models` for testing functionality related to `verl/models`
+# - ...
+
+# There are a few folders with `special_` prefix, created for special purposes:
+# - `special_distributed`: unit tests that must run with multiple GPUs
+# - `special_e2e`: end-to-end tests with training/generation scripts
+# - `special_npu`: tests for NPUs
+# - `special_sanity`: a suite of quick sanity tests
+# - `special_standalone`: a set of test that are designed to run in dedicated environments
+
+# Accelerators for tests 
+# - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
+# - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
+
+# # Workflow layout
+
+# All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
+# 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
+# 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
+# 3. End-to-end tests: `e2e_*.yml`
+# 4. Unit tests
+#   - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
+#   - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
+#   - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
+#     - new workflow yaml is added to `.github/workflows`
+#     - new tests are added to workflow mentioned in 2.
+
+name: e2e_ppo_trainer_megatron_sglang
+
+on:
+  # Trigger the workflow on push or pull request,
+  # but only for the main branch.
+  # For push, for now only anti-patterns are specified so it is more conservative
+  # and achieves higher coverage.
+  push:
+    branches:
+      - main
+      - v0.*
+    paths:
+      - "**/*.py"
+      # Other entrypoints
+      - "!verl/trainer/fsdp_sft_trainer.py"
+      # Recipes
+      - "!recipe/**"
+      # FSDP
+      - "!verl/workers/**/*dp_*.py"
+  pull_request:
+    branches:
+      - main
+      - v0.*
+    paths:
+      - "**/*.py"
+      # Other entrypoints
+      - "!docker/**"
+      # Docs
+      - "!**/*.md"
+      - "!docs/**"
+      - "!examples/**"
+      - "!tests/**"
+      - "!verl/trainer/main_*.py"
+      - "!verl/trainer/fsdp_sft_trainer.py"
+      # Recipes
+      - "!recipe/**"
+      # FSDP
+      - "!verl/workers/**/*dp_*.py"
+      # Entrypoints
+      - ".github/workflows/e2e_ppo_trainer_megatron.yml"
+      - "examples/data_preprocess/gsm8k.py"
+      - "examples/data_preprocess/geo3k.py"
+      - "tests/special_e2e/run_ppo_trainer_megatron.sh"
+      - "verl/trainer/main_ppo.py"
+      - "verl/trainer/config/ppo_megatron_trainer.yaml"
+
+# Cancel jobs on the same ref if a new one is triggered
+concurrency:
+  group: ${{ github.workflow }}-${{ github.ref }}
+  cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
+
+# Declare permissions just read content.
+permissions:
+  contents: read
+
+env:
+  IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1"
+  DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
+
+jobs:
+  setup:
+    runs-on: ubuntu-latest
+    outputs:
+      runner-label: ${{ steps.create-runner.outputs.runner_label }}
+      mlp-task-id: ${{ steps.create-runner.outputs.mlp_task_id }}
+    steps:
+      - name: create runner
+        id: create-runner
+        shell: bash
+        run: |
+          if [[ "${{ github.event.repository.full_name }}" != "volcengine/verl" ]]; then
+            echo "no need create runner, skip"
+            exit 0
+          fi
+          resp=$(curl -X POST "${{ env.DYNAMIC_RUNNER_ENDPOINT }}/create" \
+          -d '{"Image": "${{ env.IMAGE }}"}')
+          runner_label=$(echo $resp | jq -r '.runner_label')
+          if [[ -z $runner_label || $runner_label == "null" ]]; then
+            echo "create runner failed"
+            exit 1
+          fi
+          echo "runner_label=$runner_label" >> $GITHUB_OUTPUT   
+          mlp_task_id=$(echo $resp | jq -r '.task_id')
+          echo "mlp_task_id=$mlp_task_id" >> $GITHUB_OUTPUT 
+
+  e2e_ppo_trainer_megatron-deepseek:
+    needs: setup
+    runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
+    timeout-minutes: 60 # Increase this timeout value as needed
+    env:
+      HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
+      HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
+      NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
+      HF_ENDPOINT: "https://hf-mirror.com"
+      HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
+    steps:
+      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
+        with:
+          fetch-depth: 0
+      - name: Install the current repository
+        run: |
+          pip3 install --no-deps -e .[test]
+      - name: Prepare GSM8K dataset
+        run: |
+          python3 examples/data_preprocess/gsm8k.py
+      - name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (DeepSeek)
+        run: |
+          ray stop --force
+          ENGINE=sglang ALL_OFFLOAD=True SAVE_FREQ=1 MODEL_ID=deepseek-ai/deepseek-coder-1.3b-instruct bash tests/special_e2e/run_ppo_trainer_megatron.sh
+      - name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (DeepSeek)
+        run: |
+          ray stop --force
+          export VLLM_USE_V1=1
+          ray start --head
+          ENGINE=sglang MODE=async RESUME_MODE=auto MODEL_ID=deepseek-ai/deepseek-coder-1.3b-instruct TOTAL_TRAIN_STEPS=2 bash tests/special_e2e/run_ppo_trainer_megatron.sh
+      - name: Test Megatron checkpoints merging function (DeepSeek Actor and Critic)
+        run: |
+          exp_name="deepseek-coder-1.3b-instruct-megatron-gsm8k-minimal"
+          python -m verl.model_merger test --backend megatron --local_dir checkpoints/verl-test/${exp_name}/global_step_1/actor --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/actor/huggingface
+          python -m verl.model_merger test --backend megatron --is-value-model --local_dir checkpoints/verl-test/${exp_name}/global_step_1/critic --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/critic/huggingface
+      - name: Running GRPO GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (Deepseek)
+        run: |
+          ray stop --force
+          ENGINE=sglang ADV_ESTIMATOR=grpo USE_DYNAMIC_BSZ=False MODEL_ID=deepseek-ai/deepseek-coder-1.3b-instruct bash tests/special_e2e/run_ppo_trainer_megatron.sh
+      - name: clean up
+        run: |
+          rm -rf checkpoints
+  e2e_ppo_trainer_megatron-qwen3:
+    needs: setup
+    runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
+    timeout-minutes: 60 # Increase this timeout value as needed
+    env:
+      HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
+      HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
+      NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
+      HF_ENDPOINT: "https://hf-mirror.com"
+      HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
+    steps:
+      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
+        with:
+          fetch-depth: 0
+      - name: Install the current repository
+        run: |
+          pip3 install --no-deps -e .[test]
+      - name: Prepare GSM8K dataset
+        run: |
+          python3 examples/data_preprocess/gsm8k.py
+      - name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (Qwen3) with validation and saving
+        run: |
+          ray stop --force
+          ENGINE=sglang ALL_OFFLOAD=True VAL_BEFORE_TRAIN=True TEST_FREQ=1 SAVE_FREQ=1 MODEL_ID=Qwen/Qwen3-0.6B bash tests/special_e2e/run_ppo_trainer_megatron.sh
+      - name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (Qwen3) testing learning rate scheduler
+        run: |
+          ray stop --force
+          ENGINE=sglang LR_WARMUP_STEPS=1 TOTAL_TRAIN_STEPS=2 MODEL_ID=Qwen/Qwen3-0.6B bash tests/special_e2e/run_ppo_trainer_megatron.sh
+
+      - name: Test Megatron checkpoints merging function (Qwen3 Actor and Critic)
+        run: |
+          exp_name="qwen3-0.6b-megatron-gsm8k-minimal"
+          python -m verl.model_merger test --backend megatron --tie-word-embedding --local_dir checkpoints/verl-test/${exp_name}/global_step_1/actor --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/actor/huggingface
+          python -m verl.model_merger test --backend megatron --is-value-model --local_dir checkpoints/verl-test/${exp_name}/global_step_1/critic --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/critic/huggingface
+      - name: clean up
+        run: |
+          rm -rf checkpoints
+  e2e_ppo_trainer_megatron-different-train-infer-tp-qwen-tie-embedding:
+    needs: setup
+    runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
+    timeout-minutes: 60 # Increase this timeout value as needed
+    env:
+      HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
+      HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
+      NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
+      HF_ENDPOINT: "https://hf-mirror.com"
+      HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
+    steps:
+      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
+        with:
+          fetch-depth: 0
+      - name: Install the current repository
+        run: |
+          pip3 install --no-deps -e .[test]
+      - name: Prepare GSM8K dataset
+        run: |
+          python3 examples/data_preprocess/gsm8k.py
+      - name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with tie-embedding Megatron (Qwen) with train tp > infer tp
+        run: |
+          ray stop --force
+          ENGINE=sglang VAL_BEFORE_TRAIN=True TEST_FREQ=1 SAVE_FREQ=1 TRAIN_TP=2 INFER_TP=1 MODEL_ID=Qwen/Qwen2.5-1.5B bash tests/special_e2e/run_ppo_trainer_megatron.sh
+      - name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (Qwen) with  train tp < infer tp
+        run: |
+          ray stop --force
+          ENGINE=sglang VAL_BEFORE_TRAIN=True TEST_FREQ=1 SAVE_FREQ=1 TRAIN_TP=1 INFER_TP=2 MODEL_ID=Qwen/Qwen2.5-1.5B bash tests/special_e2e/run_ppo_trainer_megatron.sh
+      - name: clean up
+        run: |
+          rm -rf checkpoints
+  e2e_ppo_trainer_megatron-qwen-override-transformer-config:
+    needs: setup
+    runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
+    timeout-minutes: 60 # Increase this timeout value as needed
+    env:
+      HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
+      HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
+      NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
+      HF_ENDPOINT: "https://hf-mirror.com"
+      HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
+    steps:
+      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
+        with:
+          fetch-depth: 0
+      - name: Install the current repository
+        run: |
+          pip3 install --no-deps -e .[test]
+      - name: Prepare GSM8K dataset
+        run: |
+          python3 examples/data_preprocess/gsm8k.py
+      - name: Prepare dist_ckpt of Qwen2.5-0.5B, uneven layer distribution only supports dist_ckpt
+        run: |
+          python3 scripts/converter_hf_to_mcore.py --hf_model_path ${HOME}/models/Qwen/Qwen2.5-0.5B --output_path checkpoints/verl-test/qwen2.5-0.5b-megatron
+      - name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (Qwen)
+        run: |
+          ray stop --force
+          ENGINE=sglang SAVE_FREQ=1 COMMON_PP=4 COMMON_VPP=null COMMON_CP=1 SKIP_SAVE_HF_MODEL=1 bash tests/special_e2e/run_ppo_trainer_megatron.sh +actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_first_pipeline_stage=8 +actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_last_pipeline_stage=4 actor_rollout_ref.actor.megatron.use_dist_checkpointing=true actor_rollout_ref.actor.megatron.dist_checkpointing_path=checkpoints/verl-test/qwen2.5-0.5b-megatron actor_rollout_ref.ref.megatron.use_dist_checkpointing=true actor_rollout_ref.ref.megatron.dist_checkpointing_path=checkpoints/verl-test/qwen2.5-0.5b-megatron critic.megatron.use_dist_checkpointing=true critic.megatron.dist_checkpointing_path=checkpoints/verl-test/qwen2.5-0.5b-megatron reward_model.megatron.use_dist_checkpointing=true reward_model.megatron.dist_checkpointing_path=checkpoints/verl-test/qwen2.5-0.5b-megatron
+          cp -r checkpoints checkpoints-dut
+          ENGINE=sglang SAVE_FREQ=1 COMMON_PP=4 COMMON_VPP=null COMMON_CP=1 bash tests/special_e2e/run_ppo_trainer_megatron.sh
+      - name: Test Megatron checkpoints merging function (Qwen Actor and Critic)
+        run: |
+          exp_name="qwen2.5-0.5b-megatron-gsm8k-minimal"
+          python -m verl.model_merger test --backend megatron --tie-word-embedding --local_dir checkpoints-dut/verl-test/${exp_name}/global_step_1/actor --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/actor/huggingface
+          python -m verl.model_merger test --backend megatron --is-value-model --local_dir checkpoints-dut/verl-test/${exp_name}/global_step_1/critic --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/critic/huggingface
+      - name: clean up
+        run: |
+          rm -rf checkpoints
+  e2e_ppo_trainer_megatron-deepseek-override-transformer-config:
+    needs: setup
+    runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
+    timeout-minutes: 60 # Increase this timeout value as needed
+    env:
+      HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
+      HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
+      NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
+      HF_ENDPOINT: "https://hf-mirror.com"
+      HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
+    steps:
+      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
+        with:
+          fetch-depth: 0
+      - name: Install the current repository
+        run: |
+          pip3 install --no-deps -e .[test]
+      - name: Prepare GSM8K dataset
+        run: |
+          python3 examples/data_preprocess/gsm8k.py
+      - name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (DeepSeek)
+        run: |
+          ray stop --force
+          ENGINE=sglang SAVE_FREQ=1 MODEL_ID=deepseek-ai/deepseek-coder-1.3b-instruct COMMON_PP=2 COMMON_VPP=null bash tests/special_e2e/run_ppo_trainer_megatron.sh +actor_rollout_ref.actor.megatron.override_transformer_config.account_for_embedding_in_pipeline_split=true +actor_rollout_ref.actor.megatron.override_transformer_config.account_for_loss_in_pipeline_split=true
+      - name: Test Megatron checkpoints merging function (DeepSeek Actor and Critic)
+        run: |
+          exp_name="deepseek-coder-1.3b-instruct-megatron-gsm8k-minimal"
+          python -m verl.model_merger test --backend megatron --local_dir checkpoints/verl-test/${exp_name}/global_step_1/actor --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/actor/huggingface
+          python -m verl.model_merger test --backend megatron --is-value-model --local_dir checkpoints/verl-test/${exp_name}/global_step_1/critic --test_hf_dir checkpoints/verl-test/${exp_name}/global_step_1/critic/huggingface
+      - name: clean up
+        run: |
+          rm -rf checkpoints
+  e2e_ppo_trainer_megatron-moe-expert-parallel:
+    needs: setup
+    runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
+    timeout-minutes: 60 # Increase this timeout value as needed
+    env:
+      HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
+      HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
+      NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
+      HF_ENDPOINT: "https://hf-mirror.com"
+      HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
+    steps:
+      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
+        with:
+          fetch-depth: 0
+      - name: Install the current repository
+        run: |
+          pip3 install --no-deps -e .[test]
+      - name: Prepare GSM8K dataset
+        run: |
+          python3 examples/data_preprocess/gsm8k.py
+      - name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (DeepSeek)
+        run: |
+          ray stop --force
+          ADV_ESTIMATOR=grpo USE_DUMMY_MODEL=True DUMMY_MODEL_CONFIG_PATH=tests/special_e2e/ppo_trainer/expert_parallel/qwen2moe_minimal.json \
+          PPO_MAX_TOKEN_LEN=512 FWD_MAX_TOKEN_LEN=512 \
+          MAX_PROMPT_LENGTH=256 MAX_RESPONSE_LENGTH=256 \
+          MODEL_ID=Qwen/Qwen1.5-MoE-A2.7B-Chat \
+          ENGINE=sglang COMMON_PP=2 COMMON_VPP=null COMMON_CP=1 COMMON_TP=4 COMMON_EP=4 COMMON_ETP=1 INFER_TP=8 \
+          USE_DIST_CKPT=True ALL_OFFLOAD=True SKIP_SAVE_HF_MODEL=1 bash tests/special_e2e/run_ppo_trainer_megatron.sh
+      - name: clean up
+        run: |
+          rm -rf checkpoints
+  e2e_ppo_trainer_megatron-qwen2_5vl-3b:
+    needs: setup
+    runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
+    timeout-minutes: 60 # Increase this timeout value as needed
+    env:
+      HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
+      HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
+      NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
+      HF_ENDPOINT: "https://hf-mirror.com"
+      HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
+    steps:
+      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
+        with:
+          fetch-depth: 0
+      - name: Install the current repository
+        run: |
+          pip3 install --no-deps -e .[test]
+      - name: Prepare Geo3k dataset
+        run: |
+          python3 examples/data_preprocess/geo3k.py
+      - name: Prepare dist_ckpt of Qwen2.5-VL-3B, only supports dist_ckpt
+        run: |
+          python3 scripts/converter_hf_to_mcore.py --hf_model_path ${HOME}/models/Qwen/Qwen2.5-VL-3B-Instruct --output_path checkpoints/verl-test/qwen2.5-vl-3b-megatron
+      - name: Running Geo3k E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (Qwen)
+        run: |
+          ray stop --force
+          ENGINE=sglang TRAIN_FILES=${HOME}/data/geo3k/train.parquet VAL_FILES=${HOME}/data/geo3k/test.parquet MAX_PROMPT_LENGTH=1024 MAX_RESPONSE_LENGTH=2048  MODEL_ID=Qwen/Qwen2.5-VL-3B-Instruct ADV_ESTIMATOR=grpo USE_DYNAMIC_BSZ=False SKIP_SAVE_HF_MODEL=1 COMMON_PP=4 COMMON_VPP=null COMMON_CP=1 COMMON_TP=2 USE_DIST_CKPT=true DIST_CKPT_PATH=checkpoints/verl-test/qwen2.5-vl-3b-megatron bash tests/special_e2e/run_ppo_trainer_megatron.sh 
+      - name: clean up
+        run: |
+          rm -rf checkpoints
+
+  cleanup: 
+    runs-on: ubuntu-latest  
+    needs: [setup, 
+      e2e_ppo_trainer_megatron-deepseek, 
+      e2e_ppo_trainer_megatron-qwen3,
+      e2e_ppo_trainer_megatron-different-train-infer-tp-qwen-tie-embedding,
+      e2e_ppo_trainer_megatron-qwen-override-transformer-config,
+      e2e_ppo_trainer_megatron-deepseek-override-transformer-config,
+      e2e_ppo_trainer_megatron-moe-expert-parallel,
+      e2e_ppo_trainer_megatron-qwen2_5vl-3b]        
+    if: always()          
+    steps:
+      - name: remove runner
+        run: |
+          if [[ -z "${{ needs.setup.outputs.mlp-task-id }}" ]]; then
+            echo "no need remove runner, skip"
+            exit 0
+          fi
+          resp=$(curl -X POST "${{ env.DYNAMIC_RUNNER_ENDPOINT }}/delete" \
+          -d '{"TaskId": "${{ needs.setup.outputs.mlp-task-id }}"}')
--- a/.github/workflows/e2e_ppo_trainer_megatron_vllm.yml
+++ b/.github/workflows/e2e_ppo_trainer_megatron_vllm.yml
@ -29,8 +29,7 @@
 #     - new workflow yaml is added to `.github/workflows`
 #     - new tests are added to workflow mentioned in 2.

-name: e2e_ppo_trainer_megatron
-# latest version: Megatron-LM core_r0.11.0 https://github.com/NVIDIA/Megatron-LM/tree/core_r0.11.0
+name: e2e_ppo_trainer_megatron_vllm

 on:
  # Trigger the workflow on push or pull request,
@ -86,7 +85,7 @@ permissions:
  contents: read

 env:
-  IMAGE: "verl-ci-cn-beijing.cr.volces.com/whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.0-te2.3"
+  IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.1"
  DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"

 jobs:
@ -311,6 +310,7 @@ jobs:
      - name: Install the current repository
        run: |
          pip3 install --no-deps -e .[test]
+          pip3 install mbridge
      - name: Prepare GSM8K dataset
        run: |
          python3 examples/data_preprocess/gsm8k.py
@ -343,6 +343,7 @@ jobs:
      - name: Install the current repository
        run: |
          pip3 install --no-deps -e .[test]
+          pip3 install "transformers[hf_xet]<4.52.0"
      - name: Prepare Geo3k dataset
        run: |
          python3 examples/data_preprocess/geo3k.py
@ -376,4 +377,4 @@ jobs:
            exit 0
          fi
          resp=$(curl -X POST "${{ env.DYNAMIC_RUNNER_ENDPOINT }}/delete" \
-          -d '{"TaskId": "${{ needs.setup.outputs.mlp-task-id }}"}')
+          -d '{"TaskId": "${{ needs.setup.outputs.mlp-task-id }}"}')
--- a/.github/workflows/e2e_sft.yml
+++ b/.github/workflows/e2e_sft.yml
@ -80,7 +80,7 @@ jobs:
      HF_ENDPOINT: "https://hf-mirror.com"
      HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
    container:
-      image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.0-te2.3
+      image: verlai/verl:base-verl0.4-cu124-cudnn9.8-torch2.6-fa2.7.4
      options: --gpus all --shm-size=10g
    steps:
      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
--- a/.github/workflows/e2e_spin.yml
+++ b/.github/workflows/e2e_spin.yml
@ -68,7 +68,7 @@ jobs:
      HF_ENDPOINT: "https://hf-mirror.com"
      HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
    container:
-      image: ocss884/verl-sglang:ngc-th2.6.0-cu126-sglang0.4.5.post3
+      image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1
      options: --gpus all --shm-size=10g
    steps:
      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
--- a/.github/workflows/e2e_sppo.yml
+++ b/.github/workflows/e2e_sppo.yml
@ -66,7 +66,7 @@ jobs:
      HF_ENDPOINT: "https://hf-mirror.com"
      HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
    container:
-      image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.0-te2.3
+      image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1
      options: --gpus all --shm-size=10g
    steps:
      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
--- a/.github/workflows/gpu_unit_tests.yml
+++ b/.github/workflows/gpu_unit_tests.yml
@ -80,7 +80,7 @@ jobs:
      NO_PROXY: "localhost,127.0.0.1"
      HF_HUB_ENABLE_HF_TRANSFER: 1
    container:
-      image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.0-te2.3
+      image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1
      options: --gpus all --shm-size=10g
    steps:
      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
--- a/.github/workflows/model.yml
+++ b/.github/workflows/model.yml
@ -71,7 +71,7 @@ jobs:
      HF_ENDPOINT: "https://hf-mirror.com"
      HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
    container:
-      image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.0-te2.3
+      image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1
      options: --gpus all --shm-size=10g
    steps:
      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
@ -129,7 +129,7 @@ jobs:
      HF_ENDPOINT: "https://hf-mirror.com"
      HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
    container:
-      image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.0-te2.3
+      image: verlai/verl:base-verl0.4-cu124-cudnn9.8-torch2.6-fa2.7.4
      options: --gpus all --shm-size=10g
    steps:
      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
--- a/.github/workflows/sgl.yml
+++ b/.github/workflows/sgl.yml
@ -88,7 +88,7 @@ jobs:
      HF_HUB_ENABLE_HF_TRANSFER: 1
      SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK: "True"
    container:
-      image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.0-te2.3
+      image: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1
      options: --gpus all --shm-size=10g
    steps:
      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
--- a/docker/Dockerfile.extention.awsefa
+++ b/docker/Dockerfile.extention.awsefa
@ -1,4 +1,6 @@
-FROM whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6-mcore0.12.0-te2.3
+# Base Image support aws EFA
+# Build Image with frameworks based on this
+FROM verlai/verl:app-verl0.5-sglang0.4.6.post5-mcore0.12.1

 # For aws instances with EFA net interface (Sagemaker AI Pod)
 #     install EFA driver:
@ -48,6 +50,6 @@ ENV OMPI_MCA_pml=^cm,ucx            \
    NCCL_SOCKET_IFNAME=^docker,lo,veth_def_agent  \
    FI_EFA_USE_HUGE_PAGE=0

-# docker build -t whatcanyousee/verl:awsefa --label "commit=$(git rev-parse --short HEAD)" .
+# docker build -t verl:awsefa --label "commit=$(git rev-parse --short HEAD)" .
 # on aws:
-# docker run --ipc=host --privileged --name verldev --gpus all --network=host --shm-size=1800gb -itd whatcanyousee/verl:awsefa
+# docker run --ipc=host --privileged --name verldev --gpus all --network=host --shm-size=1800gb -itd verl:awsefa
--- a/docker/README.md
+++ b/docker/README.md
@ -0,0 +1,79 @@
+# Dockerfiles of verl
+
+We provide pre-built Docker images for quick setup. And from this version, we utilize a new image release hierarchy for productivity and stability.
+
+The image types are divided into three large categories:
+
+- **Base Image**: Without inference and training frameworks, only basic dependencies are installed. Can directly install vllm or SGLang on top of it, without need of reinstall torch or CUDA.
+- **Application Image**: Stable version with inference and training frameworks installed.
+- **Preview Image**: Unstable version with the latest frameworks and features.
+
+The first two types of images are hosted on dockerhub [verlai/verl](https://hub.docker.com/r/verlai/verl) repository, while the preview images are hosted on community repository.
+
+> The image versions are mapped with verl releases, for example, image with tag ``verl0.4`` is built for verl release ``v0.4.x``.
+
+## Base Image
+
+The stable base image is ``verlai/verl:base-verl0.4-cu124-cudnn9.8-torch2.6-fa2.7.4-te2.3``. The installed package versions can be found from tags, and the Dockerfile can be found in ``verl[version]-[packages]/Dockerfile.base``.
+
+The base images for preview are ``verlai/verl:base-verl0.5-cu126-cudnn9.8-torch2.7.1-fa2.8.0`` and ``verlai/verl:base-verl0.5-preview-cu128-cudnn9.8-torch2.7.1-fa2.8.0`` with different CUDA versions.
+
+The update of base image is not frequent, and the app image can be built on top of it without reinstalling base packages.
+
+## Application Image
+
+From this version, we divide images built for vLLM and SGLang as the divergence of dependent packages like FlashInfer.
+
+There are four types of application images available:
+
+- **vLLM with FSDP and Megatron**: ``verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.1``
+- **SGLang with FSDP and Megatron**: ``verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1``
+- **Preview version of SGLang with FSDP and Megatron, CUDA 12.6**: ``verlai/verl:app-verl0.5-sglang0.4.8-mcore0.12.1``
+- **Preview version of SGLang with FSDP and Megatron, CUDA 12.8**: ``verlai/verl:app-preview-verl0.5-sglang0.4.8-mcore0.12.1``
+
+For Megatron 0.13.0, we offer preview images, to use latest codes, just replace ``mcore0.12.1`` with ``mcore0.13.0-preview`` in the above image tag.
+
+The latest vLLM support is coming soon.
+
+Docker images with Megatron backends are runnable with large language model like ``Qwen/Qwen3-235B-A22B``, ``deepseek-ai/DeepSeek-V3-0324`` post-training. Refer to the :doc:`Large Language Model Post-Training documentation<../perf/dpsk>` for more details.
+
+Application images can be updated frequently, and the Dockerfile can be found in ``docker/verl[version]-[packages]/Dockerfile.app.[frameworks]``. Based on the base image, it is easy to build your own application image with the desired inference and training frameworks.
+
+## Community Image
+
+For vLLM with FSDP, please refer to [hiyouga/verl](https://hub.docker.com/r/hiyouga/verl) repository and the latest version is ``hiyouga/verl:ngc-th2.6.0-cu126-vllm0.8.4-flashinfer0.2.2-cxx11abi0``.
+
+For SGLang with FSDP, please refer to [ocss884/verl-sglang](https://hub.docker.com/r/ocss884/verl-sglang) repository and the latest version is ``ocss884/verl-sglang:ngc-th2.6.0-cu126-sglang0.4.6.post5`` which is provided by SGLang RL Group.
+
+See files under ``docker/`` for NGC-based image or if you want to build your own.
+
+Note that For aws instances with EFA net interface (Sagemaker AI Pod), you need to install EFA driver as shown in ``docker/Dockerfile.extenstion.awsefa``
+
+## Installation from Docker
+
+After pulling the desired Docker image and installing desired inference and training frameworks, you can run it with the following steps:
+
+1. Launch the desired Docker image and attach into it:
+
+```sh
+docker create --runtime=nvidia --gpus all --net=host --shm-size="10g" --cap-add=SYS_ADMIN -v .:/workspace/verl --name verl <image:tag> sleep infinity
+docker start verl
+docker exec -it verl bash
+```
+
+2.	If you use the images provided, you only need to install verl itself without dependencies:
+
+```sh
+# install the nightly version (recommended)
+git clone https://github.com/volcengine/verl && cd verl
+pip3 install --no-deps -e .
+```
+
+[Optional] If you hope to switch between different frameworks, you can install verl with the following command:
+
+```sh
+# install the nightly version (recommended)
+git clone https://github.com/volcengine/verl && cd verl
+pip3 install -e .[vllm]
+pip3 install -e .[sglang]
+```
--- a/docker/verl0.4-cu124-torch2.6-fa2.7.4/Dockerfile.app.sglang.vllm.mcore0.12
+++ b/docker/verl0.4-cu124-torch2.6-fa2.7.4/Dockerfile.app.sglang.vllm.mcore0.12
@ -0,0 +1,41 @@
+# Start from the verl base image
+# Dockerfile.base
+FROM verlai/verl:base-verl0.4-cu124-cudnn9.8-torch2.6-fa2.7.4
+
+# Define environments
+ENV MAX_JOBS=32
+ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
+ENV DEBIAN_FRONTEND=noninteractive
+ENV NODE_OPTIONS=""
+ENV PIP_ROOT_USER_ACTION=ignore
+ENV HF_HUB_ENABLE_HF_TRANSFER="1"
+
+# Install sglang-0.4.6.post5 and torch-memory-saver
+RUN pip install --resume-retries 999 "sglang[all]==0.4.6.post5" --no-cache-dir --find-links https://flashinfer.ai/whl/cu124/torch2.6/flashinfer-python && pip install torch-memory-saver --no-cache-dir
+
+# Some sglang operations in 0.4.6.post5 require vllm
+# [Warning] vllm can have some packages not compatible with sglang, for example, flashinfer
+RUN pip install --resume-retries 999 --no-cache-dir vllm==0.8.5.post1
+
+# Fix packages
+RUN pip install --no-cache-dir "tensordict==0.6.2" "transformers[hf_xet]>=4.51.0" accelerate datasets peft hf-transfer \
+    "numpy<2.0.0" "pyarrow>=19.0.1" pandas \
+    ray[default] codetiming hydra-core pylatexenc qwen-vl-utils wandb dill pybind11 liger-kernel mathruler blobfile xgrammar \
+    pytest py-spy pyext pre-commit ruff
+
+RUN pip uninstall -y pynvml nvidia-ml-py && \
+    pip install --resume-retries 999 --no-cache-dir --upgrade "nvidia-ml-py>=12.560.30" "fastapi[standard]>=0.115.0" "optree>=0.13.0" "pydantic>=2.9" "grpcio>=1.62.1"
+
+RUN pip install --resume-retries 999 --no-cache-dir nvidia-cudnn-cu12==9.8.0.87
+
+# Install TransformerEngine
+RUN export NVTE_FRAMEWORK=pytorch && pip3 install --resume-retries 999 --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@v2.3
+
+# Install Megatron-LM
+RUN pip3 install --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.12.1
+
+# Fix for transformers 4.53.0
+RUN pip3 install --no-cache-dir "transformers[hf_xet]<4.52.0"
+
+# Install mbridge
+RUN pip3 install --no-cache-dir mbridge
--- a/docker/verl0.4-cu124-torch2.6-fa2.7.4/Dockerfile.app.sglang.vllm.mcore0.13.preview
+++ b/docker/verl0.4-cu124-torch2.6-fa2.7.4/Dockerfile.app.sglang.vllm.mcore0.13.preview
@ -0,0 +1,41 @@
+# Start from the verl base image
+# Dockerfile.base
+FROM verlai/verl:base-verl0.4-cu124-cudnn9.8-torch2.6-fa2.7.4
+
+# Define environments
+ENV MAX_JOBS=32
+ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
+ENV DEBIAN_FRONTEND=noninteractive
+ENV NODE_OPTIONS=""
+ENV PIP_ROOT_USER_ACTION=ignore
+ENV HF_HUB_ENABLE_HF_TRANSFER="1"
+
+# Install sglang-0.4.6.post5 and torch-memory-saver
+RUN pip install --resume-retries 999 "sglang[all]==0.4.6.post5" --no-cache-dir --find-links https://flashinfer.ai/whl/cu124/torch2.6/flashinfer-python && pip install torch-memory-saver --no-cache-dir
+
+# Some sglang operations in 0.4.6.post5 require vllm
+# [Warning] vllm can have some packages not compatible with sglang, for example, flashinfer
+RUN pip install --resume-retries 999 --no-cache-dir vllm==0.8.5.post1
+
+# Fix packages
+RUN pip install --no-cache-dir "tensordict==0.6.2" "transformers[hf_xet]>=4.51.0" accelerate datasets peft hf-transfer \
+    "numpy<2.0.0" "pyarrow>=19.0.1" pandas \
+    ray[default] codetiming hydra-core pylatexenc qwen-vl-utils wandb dill pybind11 liger-kernel mathruler blobfile xgrammar \
+    pytest py-spy pyext pre-commit ruff
+
+RUN pip uninstall -y pynvml nvidia-ml-py && \
+    pip install --resume-retries 999 --no-cache-dir --upgrade "nvidia-ml-py>=12.560.30" "fastapi[standard]>=0.115.0" "optree>=0.13.0" "pydantic>=2.9" "grpcio>=1.62.1"
+
+RUN pip install --resume-retries 999 --no-cache-dir nvidia-cudnn-cu12==9.8.0.87
+
+# Install TransformerEngine
+RUN export NVTE_FRAMEWORK=pytorch && pip3 install --resume-retries 999 --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@release_v2.5
+
+# Install Megatron-LM
+RUN pip3 install --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/Megatron-LM.git@core_r0.13.0
+
+# Fix for transformers 4.53.0
+RUN pip3 install --no-cache-dir "transformers[hf_xet]<4.52.0"
+
+# Install mbridge
+RUN pip3 install --no-cache-dir mbridge
--- a/docker/verl0.4-cu124-torch2.6-fa2.7.4/Dockerfile.app.vllm.mcore0.12
+++ b/docker/verl0.4-cu124-torch2.6-fa2.7.4/Dockerfile.app.vllm.mcore0.12
@ -0,0 +1,47 @@
+# Start from the verl base image
+# Dockerfile.base
+FROM verlai/verl:base-verl0.4-cu124-cudnn9.8-torch2.6-fa2.7.4
+
+# Define environments
+ENV MAX_JOBS=32
+ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
+ENV DEBIAN_FRONTEND=noninteractive
+ENV NODE_OPTIONS=""
+ENV PIP_ROOT_USER_ACTION=ignore
+ENV HF_HUB_ENABLE_HF_TRANSFER="1"
+
+# Install torch-2.6.0+cu126 + vllm-0.8.5.post1
+# torch-2.6.0+cu124: cxx11abi=False
+# torch-2.6.0+cu126: cxx11abi=True
+# see https://github.com/flashinfer-ai/flashinfer/issues/911
+RUN pip install --resume-retries 999 --no-cache-dir vllm==0.8.5.post1
+
+# Install flashinfer-0.2.2.post1+cu126 (cxx11abi=True)
+# vllm-0.8.3 does not support flashinfer>=0.2.3
+# see https://github.com/vllm-project/vllm/pull/15777
+RUN aria2c --max-tries=9999 https://github.com/flashinfer-ai/flashinfer/releases/download/v0.2.2.post1/flashinfer_python-0.2.2.post1+cu124torch2.6-cp38-abi3-linux_x86_64.whl && \
+    pip install --no-cache-dir flashinfer_python-0.2.2.post1+cu124torch2.6-cp38-abi3-linux_x86_64.whl && \
+    rm flashinfer_python-0.2.2.post1+cu124torch2.6-cp38-abi3-linux_x86_64.whl
+
+# Fix packages
+RUN pip install --no-cache-dir "tensordict==0.6.2" "transformers[hf_xet]>=4.51.0" accelerate datasets peft hf-transfer \
+    "numpy<2.0.0" "pyarrow>=19.0.1" pandas \
+    ray[default] codetiming hydra-core pylatexenc qwen-vl-utils wandb dill pybind11 liger-kernel mathruler blobfile xgrammar \
+    pytest py-spy pyext pre-commit ruff
+
+RUN pip uninstall -y pynvml nvidia-ml-py && \
+    pip install --resume-retries 999 --no-cache-dir --upgrade "nvidia-ml-py>=12.560.30" "fastapi[standard]>=0.115.0" "optree>=0.13.0" "pydantic>=2.9" "grpcio>=1.62.1"
+
+RUN pip install --resume-retries 999 --no-cache-dir nvidia-cudnn-cu12==9.8.0.87
+
+# Install TransformerEngine
+RUN export NVTE_FRAMEWORK=pytorch && pip3 install --resume-retries 999 --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@v2.3
+
+# Install Megatron-LM
+RUN pip3 install --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.12.1
+
+# Fix for transformers 4.53.0
+RUN pip3 install --no-cache-dir "transformers[hf_xet]<4.52.0"
+
+# Install mbridge
+RUN pip3 install --no-cache-dir mbridge
--- a/docker/verl0.4-cu124-torch2.6-fa2.7.4/Dockerfile.app.vllm.mcore0.13.preview
+++ b/docker/verl0.4-cu124-torch2.6-fa2.7.4/Dockerfile.app.vllm.mcore0.13.preview
@ -0,0 +1,44 @@
+# Start from the verl base image
+# Dockerfile.base
+FROM verlai/verl:base-verl0.4-cu124-cudnn9.8-torch2.6-fa2.7.4
+
+# Define environments
+ENV MAX_JOBS=32
+ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
+ENV DEBIAN_FRONTEND=noninteractive
+ENV NODE_OPTIONS=""
+ENV PIP_ROOT_USER_ACTION=ignore
+ENV HF_HUB_ENABLE_HF_TRANSFER="1"
+
+# Install torch-2.6.0+cu126 + vllm-0.8.5.post1
+# torch-2.6.0+cu124: cxx11abi=False
+# torch-2.6.0+cu126: cxx11abi=True
+# see https://github.com/flashinfer-ai/flashinfer/issues/911
+RUN pip install --resume-retries 999 --no-cache-dir vllm==0.8.5.post1
+
+# Install flashinfer-0.2.2.post1+cu126 (cxx11abi=True)
+# vllm-0.8.3 does not support flashinfer>=0.2.3
+# see https://github.com/vllm-project/vllm/pull/15777
+RUN aria2c --max-tries=9999 https://github.com/flashinfer-ai/flashinfer/releases/download/v0.2.2.post1/flashinfer_python-0.2.2.post1+cu124torch2.6-cp38-abi3-linux_x86_64.whl && \
+    pip install --no-cache-dir flashinfer_python-0.2.2.post1+cu124torch2.6-cp38-abi3-linux_x86_64.whl && \
+    rm flashinfer_python-0.2.2.post1+cu124torch2.6-cp38-abi3-linux_x86_64.whl
+
+# Fix packages
+RUN pip install --no-cache-dir "tensordict==0.6.2" "transformers[hf_xet]>=4.51.0" accelerate datasets peft hf-transfer \
+    "numpy<2.0.0" "pyarrow>=19.0.1" pandas \
+    ray[default] codetiming hydra-core pylatexenc qwen-vl-utils wandb dill pybind11 liger-kernel mathruler blobfile xgrammar \
+    pytest py-spy pyext pre-commit ruff
+
+RUN pip uninstall -y pynvml nvidia-ml-py && \
+    pip install --resume-retries 999 --no-cache-dir --upgrade "nvidia-ml-py>=12.560.30" "fastapi[standard]>=0.115.0" "optree>=0.13.0" "pydantic>=2.9" "grpcio>=1.62.1"
+
+RUN pip install --resume-retries 999 --no-cache-dir nvidia-cudnn-cu12==9.8.0.87
+
+# Install TransformerEngine
+RUN export NVTE_FRAMEWORK=pytorch && pip3 install --resume-retries 999 --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@release_v2.5
+
+# Install Megatron-LM
+RUN pip3 install --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.12.1
+
+# Install mbridge
+RUN pip3 install --no-cache-dir mbridge
--- a/docker/verl0.4-cu124-torch2.6-fa2.7.4/Dockerfile.base
+++ b/docker/verl0.4-cu124-torch2.6-fa2.7.4/Dockerfile.base
@ -1,9 +1,11 @@
+# Base Docker Image of verl, with CUDA/Torch/FlashAttn/Apex/TransformerEngine, without other frameworks
+# Target: verlai/verl:base-v2-cu124-cudnn9.8-torch2.6-fa2.8.0-te2.3
 # Start from the NVIDIA official image (ubuntu-22.04 + cuda-12.6 + python-3.10)
 # https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-24-08.html
 FROM nvcr.io/nvidia/pytorch:24.08-py3

 # Define environments
-ENV MAX_JOBS=32
+ENV MAX_JOBS=16
 ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
 ENV DEBIAN_FRONTEND=noninteractive
 ENV NODE_OPTIONS=""
@ -56,18 +58,11 @@ RUN aria2c --always-resume=true --max-tries=99999 https://developer.download.nvi
    update-alternatives --set cuda /usr/local/cuda-12.4 && \
    rm -rf /usr/local/cuda-12.6

-# Install torch-2.6.0+cu124 + vllm-0.8.5.post1 + sglang-0.4.6.post5
-# torch-2.6.0+cu124: cxx11abi=False
-# torch-2.6.0+cu126: cxx11abi=True
-# see https://github.com/flashinfer-ai/flashinfer/issues/911
-# Install sglang-0.4.6.post1 and torch-memory-saver
-RUN pip install "sglang[all]==0.4.6.post5" --no-cache-dir --find-links https://flashinfer.ai/whl/cu124/torch2.6/flashinfer-python && pip install torch-memory-saver --no-cache-dir
+RUN pip install --resume-retries 999 --no-cache-dir torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0

-RUN pip install --no-cache-dir "vllm==0.8.5.post1" "torch==2.6.0" "torchvision==0.21.0" "torchaudio==2.6.0" "tensordict==0.6.2" torchdata
-
-RUN pip install --no-cache-dir "transformers[hf_xet]>=4.51.0" accelerate datasets peft hf-transfer \
-    "numpy<2.0.0" "pyarrow>=15.0.0" pandas \
-    ray[default] codetiming hydra-core pylatexenc qwen-vl-utils wandb dill pybind11 liger-kernel mathruler blobfile \
+RUN pip install --resume-retries 999 --no-cache-dir "tensordict==0.6.2" torchdata "transformers[hf_xet]>=4.51.0" accelerate datasets peft hf-transfer \
+    "numpy<2.0.0" "pyarrow>=19.0.1" pandas \
+    ray[default] codetiming hydra-core pylatexenc qwen-vl-utils wandb dill pybind11 liger-kernel mathruler blobfile xgrammar \
    pytest py-spy pyext pre-commit ruff

 # Install flash-attn-2.7.4.post1 (cxx11abi=False)
@ -86,39 +81,33 @@ RUN aria2c --max-tries=9999 https://developer.download.nvidia.com/compute/cudnn/
    apt-get -y install cudnn-cuda-12 && \
    rm cudnn-local-repo-ubuntu2204-9.8.0_1.0-1_amd64.deb

-RUN pip install --no-cache-dir nvidia-cudnn-cu12==9.8.0.87
-
 # Install Apex
 RUN git clone https://github.com/NVIDIA/apex.git && \
    cd apex && \
    pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

-# Install TransformerEngine
-RUN export NVTE_FRAMEWORK=pytorch && pip3 install --no-deps --no-cache-dir git+https://github.com/NVIDIA/TransformerEngine.git@v2.3
-
-# Install Megatron-LM
-RUN pip3 install --no-deps --no-cache-dir git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.12.0
+# Profiling tools
+RUN aria2c --always-resume=true --max-tries=99999 https://developer.nvidia.com/downloads/assets/tools/secure/nsight-systems/2025_3/nsight-systems-2025.3.1_2025.3.1.90-1_amd64.deb && \
+    apt-get update && apt-get install -y libxcb-cursor0 && \
+    dpkg -i ./nsight-systems-2025.3.1_2025.3.1.90-1_amd64.deb && \
+    rm -rf /usr/local/cuda/bin/nsys && \
+    ln -s /opt/nvidia/nsight-systems/2025.3.1/target-linux-x64/nsys  /usr/local/cuda/bin/nsys && \
+    rm -rf /usr/local/cuda/bin/nsys-ui && \
+    ln -s /opt/nvidia/nsight-systems/2025.3.1/target-linux-x64/nsys-ui /usr/local/cuda/bin/nsys-ui && \
+    rm nsight-systems-2025.3.1_2025.3.1.90-1_amd64.deb

 # Fix opencv
-RUN pip install opencv-python
+RUN pip install --resume-retries 999 --no-cache-dir opencv-python

-RUN pip install opencv-fixer && \
+RUN pip install --resume-retries 999 --no-cache-dir opencv-fixer && \
    python -c "from opencv_fixer import AutoFix; AutoFix()"

-# Install verl
+RUN pip install --resume-retries 999 --no-cache-dir cuda-bindings

 # Reset pip config
 RUN pip config unset global.index-url && \
    pip config unset global.extra-index-url

-    RUN apt-get update && \
-    apt-get install -y aria2 libfreeimage3 libfreeimage-dev zlib1g
+RUN apt-get update && \
+    apt-get install -y libfreeimage3 libfreeimage-dev zlib1g htop

-RUN aria2c --always-resume=true --max-tries=99999 https://developer.nvidia.com/downloads/assets/tools/secure/nsight-systems/2025_3/nsight-systems-2025.3.1_2025.3.1.90-1_amd64.deb && \
-apt-get update && apt-get install -y libxcb-cursor0 && \
-dpkg -i ./nsight-systems-2025.3.1_2025.3.1.90-1_amd64.deb && \
-rm -rf /usr/local/cuda/bin/nsys && \
-ln -s /opt/nvidia/nsight-systems/2025.3.1/target-linux-x64/nsys  /usr/local/cuda/bin/nsys && \
-rm -rf /usr/local/cuda/bin/nsys-ui && \
-ln -s /opt/nvidia/nsight-systems/2025.3.1/target-linux-x64/nsys-ui /usr/local/cuda/bin/nsys-ui && \
-rm nsight-systems-2025.3.1_2025.3.1.90-1_amd64.deb
--- a/docker/verl0.4-cu124-torch2.6-fa2.7.4/README.md
+++ b/docker/verl0.4-cu124-torch2.6-fa2.7.4/README.md
@ -0,0 +1,29 @@
+# verl image with verl v0.4.x
+
+## Important packages version
+
+```txt
+cuda==12.4
+cudnn==9.8.0
+torch==2.6.0
+flash_attn=2.7.4
+sglang==0.4.6.post5
+vllm==0.8.5.post1
+vidia-cudnn-cu12==9.8.0.87
+transformer_engine==2.3
+megatron.core==core_v0.12.1
+# Preview
+transformer_engine==2.5
+megatron.core==core_r0.13.0
+```
+
+## Target
+
+- Base image: 
+    - `verlai/verl:base-verl0.4-cu124-cudnn9.8-torch2.6-fa2.7.4`
+- App image:
+    - `verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1`: SGLang requires vLLM in 0.4.6.post5 version, vLLM can have some package conflicts with SGLang
+    - `verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.1`
+- Preview image:
+    - `verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.13.0-preview`
+    - `verlai/verl:app-verl0.4-vllm0.8.5-mcore0.13.0-preview`
--- a/docker/verl0.5-cu126-torch2.7.1-fa2.8.0/Dockerfile.app.sglang.mcore0.12
+++ b/docker/verl0.5-cu126-torch2.7.1-fa2.8.0/Dockerfile.app.sglang.mcore0.12
@ -0,0 +1,36 @@
+# Start from the verl base image
+# Dockerfile.base
+FROM verlai/verl:base-verl0.5-cu126-cudnn9.8-torch2.7.1-fa2.8.0-fi0.2.6
+
+# Define environments
+ENV MAX_JOBS=8
+ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
+ENV DEBIAN_FRONTEND=noninteractive
+ENV NODE_OPTIONS=""
+ENV PIP_ROOT_USER_ACTION=ignore
+ENV HF_HUB_ENABLE_HF_TRANSFER="1"
+
+# Install sglang-0.4.8 and torch-memory-saver
+# Install FlashInfer Python package
+RUN pip install --resume-retries 999 --no-cache-dir --no-build-isolation flashinfer-python==0.2.6.post1
+RUN pip install --resume-retries 999  --no-cache-dir "sglang[all]==0.4.8" && pip install torch-memory-saver --no-cache-dir
+
+# Fix packages
+RUN pip install --no-cache-dir "tensordict==0.6.2" "transformers[hf_xet]>=4.51.0" accelerate datasets peft hf-transfer \
+    "numpy<2.0.0" "pyarrow>=19.0.1" pandas \
+    ray[default] codetiming hydra-core pylatexenc qwen-vl-utils wandb dill pybind11 liger-kernel mathruler blobfile xgrammar \
+    pytest py-spy pyext pre-commit ruff
+
+RUN pip uninstall -y pynvml nvidia-ml-py && \
+    pip install --resume-retries 999 --no-cache-dir --upgrade "nvidia-ml-py>=12.560.30" "fastapi[standard]>=0.115.0" "optree>=0.13.0" "pydantic>=2.9" "grpcio>=1.62.1"
+
+RUN pip install --resume-retries 999 --no-cache-dir nvidia-cudnn-cu12==9.8.0.87
+
+# Install TransformerEngine
+RUN export NVTE_FRAMEWORK=pytorch && pip3 install --resume-retries 999 --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@v2.3
+
+# Install Megatron-LM
+RUN pip3 install --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.12.1
+
+# Install mbridge
+RUN pip3 install --no-cache-dir mbridge
--- a/docker/verl0.5-cu126-torch2.7.1-fa2.8.0/Dockerfile.app.sglang.mcore0.13.preview
+++ b/docker/verl0.5-cu126-torch2.7.1-fa2.8.0/Dockerfile.app.sglang.mcore0.13.preview
@ -0,0 +1,36 @@
+# Start from the verl base image
+# Dockerfile.base
+FROM verlai/verl:base-verl0.5-cu126-cudnn9.8-torch2.7.1-fa2.8.0-fi0.2.6
+
+# Define environments
+ENV MAX_JOBS=8
+ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
+ENV DEBIAN_FRONTEND=noninteractive
+ENV NODE_OPTIONS=""
+ENV PIP_ROOT_USER_ACTION=ignore
+ENV HF_HUB_ENABLE_HF_TRANSFER="1"
+
+# Install sglang-0.4.8 and torch-memory-saver
+# Install FlashInfer Python package
+RUN pip install --resume-retries 999 --no-cache-dir --no-build-isolation flashinfer-python==0.2.6.post1
+RUN pip install --resume-retries 999  --no-cache-dir "sglang[all]==0.4.8" && pip install torch-memory-saver --no-cache-dir
+
+# Fix packages
+RUN pip install --no-cache-dir "tensordict==0.6.2" "transformers[hf_xet]>=4.51.0" accelerate datasets peft hf-transfer \
+    "numpy<2.0.0" "pyarrow>=19.0.1" pandas \
+    ray[default] codetiming hydra-core pylatexenc qwen-vl-utils wandb dill pybind11 liger-kernel mathruler blobfile xgrammar \
+    pytest py-spy pyext pre-commit ruff
+
+RUN pip uninstall -y pynvml nvidia-ml-py && \
+    pip install --resume-retries 999 --no-cache-dir --upgrade "nvidia-ml-py>=12.560.30" "fastapi[standard]>=0.115.0" "optree>=0.13.0" "pydantic>=2.9" "grpcio>=1.62.1"
+
+RUN pip install --resume-retries 999 --no-cache-dir nvidia-cudnn-cu12==9.8.0.87
+
+# Install TransformerEngine
+RUN export NVTE_FRAMEWORK=pytorch && pip3 install --resume-retries 999 --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@release_v2.5
+
+# Install Megatron-LM
+RUN pip3 install --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.12.1
+
+# Install mbridge
+RUN pip3 install --no-cache-dir mbridge
--- a/docker/verl0.5-cu126-torch2.7.1-fa2.8.0/Dockerfile.base
+++ b/docker/verl0.5-cu126-torch2.7.1-fa2.8.0/Dockerfile.base
@ -0,0 +1,91 @@
+# Base Docker Image of verl, with CUDA/Torch/FlashAttn/Apex/TransformerEngine, without other frameworks
+# Target: verlai/verl:base-verl0.5-cu126-cudnn9.8-torch2.7.1-fa2.8.0-fi0.2.6
+# Start from the NVIDIA official image (ubuntu-22.04 + cuda-12.6 + python-3.10)
+# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-24-08.html
+FROM nvcr.io/nvidia/pytorch:24.08-py3
+
+# Define environments
+ENV MAX_JOBS=16
+ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
+ENV DEBIAN_FRONTEND=noninteractive
+ENV NODE_OPTIONS=""
+ENV PIP_ROOT_USER_ACTION=ignore
+ENV HF_HUB_ENABLE_HF_TRANSFER="1"
+
+# Define installation arguments
+ARG APT_SOURCE=https://mirrors.tuna.tsinghua.edu.cn/ubuntu/
+ARG PIP_INDEX=https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
+
+# Set apt source
+RUN cp /etc/apt/sources.list /etc/apt/sources.list.bak && \
+    { \
+    echo "deb ${APT_SOURCE} jammy main restricted universe multiverse"; \
+    echo "deb ${APT_SOURCE} jammy-updates main restricted universe multiverse"; \
+    echo "deb ${APT_SOURCE} jammy-backports main restricted universe multiverse"; \
+    echo "deb ${APT_SOURCE} jammy-security main restricted universe multiverse"; \
+    } > /etc/apt/sources.list
+
+# Install systemctl
+RUN apt-get update && \
+    apt-get install -y -o Dpkg::Options::="--force-confdef" systemd && \
+    apt-get clean
+
+# Install tini
+RUN apt-get update && \
+    apt-get install -y tini aria2 libfreeimage3 libfreeimage-dev zlib1g htop && \
+    apt-get clean
+
+# Change pip source
+RUN pip config set global.index-url "${PIP_INDEX}" && \
+    pip config set global.extra-index-url "${PIP_INDEX}" && \
+    python -m pip install --upgrade pip
+
+# Uninstall nv-pytorch fork
+RUN pip uninstall -y torch torchvision torchaudio \
+    pytorch-quantization pytorch-triton torch-tensorrt \
+    xgboost transformer_engine flash_attn apex megatron-core grpcio
+
+RUN pip install --resume-retries 999 --no-cache-dir torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1
+
+# Install flash-attn-2.8.0.post2 (cxx11abi=True)
+RUN ABI_FLAG=$(python -c "import torch; print('TRUE' if torch._C._GLIBCXX_USE_CXX11_ABI else 'FALSE')") && \
+    URL="https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.0.post2/flash_attn-2.8.0.post2+cu12torch2.7cxx11abi${ABI_FLAG}-cp310-cp310-linux_x86_64.whl" && \
+    FILE="flash_attn-2.8.0.post2+cu12torch2.7cxx11abi${ABI_FLAG}-cp310-cp310-linux_x86_64.whl" && \
+    wget -nv "${URL}" && \
+    pip install --no-cache-dir "${FILE}"
+
+# Fix packages
+RUN pip uninstall -y pynvml nvidia-ml-py && \
+    pip install --no-cache-dir --upgrade "nvidia-ml-py>=12.560.30" "fastapi[standard]>=0.115.0" "optree>=0.13.0" "pydantic>=2.9" "grpcio>=1.62.1"
+
+# Install cudnn
+RUN aria2c --max-tries=9999 https://developer.download.nvidia.com/compute/cudnn/9.8.0/local_installers/cudnn-local-repo-ubuntu2204-9.8.0_1.0-1_amd64.deb && \
+    dpkg -i cudnn-local-repo-ubuntu2204-9.8.0_1.0-1_amd64.deb && \
+    cp /var/cudnn-local-repo-ubuntu2204-9.8.0/cudnn-*-keyring.gpg /usr/share/keyrings/ && \
+    apt-get update && \
+    apt-get -y install cudnn-cuda-12 && \
+    rm cudnn-local-repo-ubuntu2204-9.8.0_1.0-1_amd64.deb
+
+# Install Apex
+RUN pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" --resume-retries 999 git+https://github.com/NVIDIA/apex.git
+
+# Profiling tools
+RUN aria2c --always-resume=true --max-tries=99999 https://developer.nvidia.com/downloads/assets/tools/secure/nsight-systems/2025_3/nsight-systems-2025.3.1_2025.3.1.90-1_amd64.deb && \
+    apt-get update && apt-get install -y libxcb-cursor0
+
+RUN apt-get install -y ./nsight-systems-2025.3.1_2025.3.1.90-1_amd64.deb && \
+    rm -rf /usr/local/cuda/bin/nsys && \
+    ln -s /opt/nvidia/nsight-systems/2025.3.1/target-linux-x64/nsys  /usr/local/cuda/bin/nsys && \
+    rm -rf /usr/local/cuda/bin/nsys-ui && \
+    ln -s /opt/nvidia/nsight-systems/2025.3.1/target-linux-x64/nsys-ui /usr/local/cuda/bin/nsys-ui && \
+    rm nsight-systems-2025.3.1_2025.3.1.90-1_amd64.deb
+
+RUN pip install --resume-retries 999 --no-cache-dir "tensordict==0.6.2" torchdata "transformers[hf_xet]>=4.51.0" accelerate datasets peft hf-transfer \
+    "numpy<2.0.0" "pyarrow>=19.0.1" pandas cuda-bindings \
+    ray[default] codetiming hydra-core pylatexenc qwen-vl-utils wandb dill pybind11 liger-kernel mathruler blobfile xgrammar \
+    pytest py-spy pyext pre-commit ruff
+
+# Reset pip config
+RUN pip config unset global.index-url && \
+    pip config unset global.extra-index-url
+
--- a/docker/verl0.5-cu126-torch2.7.1-fa2.8.0/README.md
+++ b/docker/verl0.5-cu126-torch2.7.1-fa2.8.0/README.md
@ -0,0 +1,27 @@
+# verl image with verl v0.5
+
+## Important packages version
+
+```txt
+cuda==12.6
+cudnn==9.8.0
+torch==2.7.1
+flash_attn=2.8.0    ##
+sglang==0.4.8
+vllm==0.8.5.post1
+vidia-cudnn-cu12==9.8.0.87
+transformer_engine==2.3
+megatron.core==core_v0.12.1
+# Preview
+transformer_engine==2.5
+megatron.core==core_r0.13.0
+```
+
+## Target
+
+- Base image:
+    - `verlai/verl:base-verl0.5-cu126-cudnn9.8-torch2.7.1-fa2.8.0`: We offer a base image with flash infer 0.2.6.post1 built in
+- App image:
+    - `verlai/verl:app-verl0.5-sglang0.4.8-mcore0.12.1`
+    - `verlai/verl:app-verl0.5-sglang0.4.8-mcore0.13.0-preview`
+- vllm temporarily not support latest version
--- a/docker/verl0.5-preview-cu128-torch2.7.1-fa2.8.0/Dockerfile.app.sglang.megatron
+++ b/docker/verl0.5-preview-cu128-torch2.7.1-fa2.8.0/Dockerfile.app.sglang.megatron
@ -0,0 +1,36 @@
+# Start from the verl base image
+# Dockerfile.base
+FROM verlai/verl:base-verl0.5-preview-cu128-cudnn9.8-torch2.7.1-fa2.8.0-fi0.2.6
+
+# Define environments
+ENV MAX_JOBS=8
+ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
+ENV DEBIAN_FRONTEND=noninteractive
+ENV NODE_OPTIONS=""
+ENV PIP_ROOT_USER_ACTION=ignore
+ENV HF_HUB_ENABLE_HF_TRANSFER="1"
+
+# Install sglang-0.4.8 and torch-memory-saver
+# Install FlashInfer Python package
+RUN pip install --resume-retries 999 --no-cache-dir --no-build-isolation flashinfer-python==0.2.6.post1
+RUN pip install --resume-retries 999  --no-cache-dir "sglang[all]==0.4.8" && pip install torch-memory-saver --no-cache-dir
+
+# Fix packages
+RUN pip install --no-cache-dir "tensordict==0.6.2" "transformers[hf_xet]>=4.51.0" accelerate datasets peft hf-transfer \
+    "numpy<2.0.0" "pyarrow>=19.0.1" pandas \
+    ray[default] codetiming hydra-core pylatexenc qwen-vl-utils wandb dill pybind11 liger-kernel mathruler blobfile xgrammar \
+    pytest py-spy pre-commit ruff
+
+RUN pip uninstall -y pynvml nvidia-ml-py && \
+    pip install --resume-retries 999 --no-cache-dir --upgrade "nvidia-ml-py>=12.560.30" "fastapi[standard]>=0.115.0" "optree>=0.13.0" "pydantic>=2.9" "grpcio>=1.62.1"
+
+RUN pip install --resume-retries 999 --no-cache-dir nvidia-cudnn-cu12==9.8.0.87
+
+# Install TransformerEngine
+RUN export NVTE_FRAMEWORK=pytorch && pip3 install --resume-retries 999 --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@release_v2.5
+
+# Install Megatron-LM
+RUN pip3 install --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/Megatron-LM.git@core_r0.13.0
+
+# Install mbridge
+RUN pip3 install --no-cache-dir mbridge
--- a/docker/verl0.5-preview-cu128-torch2.7.1-fa2.8.0/Dockerfile.base
+++ b/docker/verl0.5-preview-cu128-torch2.7.1-fa2.8.0/Dockerfile.base
@ -0,0 +1,91 @@
+# Base Docker Image of verl, with CUDA/Torch/FlashAttn/Apex/TransformerEngine, without other frameworks
+# Target: verlai/verl:base-verl0.5-preview-cu128-cudnn9.8-torch2.7.1-fa2.8.0-fi0.2.6
+# Start from the NVIDIA official image (ubuntu-22.04 + cuda-12.6 + python-3.10)
+# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-24-08.html
+FROM nvcr.io/nvidia/pytorch:25.02-py3
+
+# Define environments
+ENV MAX_JOBS=16
+ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
+ENV DEBIAN_FRONTEND=noninteractive
+ENV NODE_OPTIONS=""
+ENV PIP_ROOT_USER_ACTION=ignore
+ENV HF_HUB_ENABLE_HF_TRANSFER="1"
+
+# Define installation arguments
+ARG APT_SOURCE=https://mirrors.tuna.tsinghua.edu.cn/ubuntu/
+ARG PIP_INDEX=https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
+
+# Set apt source
+RUN cp /etc/apt/sources.list /etc/apt/sources.list.bak && \
+    { \
+    echo "deb ${APT_SOURCE} jammy main restricted universe multiverse"; \
+    echo "deb ${APT_SOURCE} jammy-updates main restricted universe multiverse"; \
+    echo "deb ${APT_SOURCE} jammy-backports main restricted universe multiverse"; \
+    echo "deb ${APT_SOURCE} jammy-security main restricted universe multiverse"; \
+    } > /etc/apt/sources.list
+
+# Install systemctl
+RUN apt-get update && \
+    apt-get install -y -o Dpkg::Options::="--force-confdef" systemd && \
+    apt-get clean
+
+# Install tini
+RUN apt-get update && \
+    apt-get install -y tini aria2 libfreeimage3 libfreeimage-dev zlib1g htop && \
+    apt-get clean
+
+# Change pip source
+RUN pip config set global.index-url "${PIP_INDEX}" && \
+    pip config set global.extra-index-url "${PIP_INDEX}" && \
+    python -m pip install --upgrade pip
+
+# Uninstall nv-pytorch fork
+RUN pip uninstall -y torch torchvision torchaudio \
+    pytorch-quantization pytorch-triton torch-tensorrt \
+    xgboost transformer_engine flash_attn apex megatron-core grpcio
+
+RUN pip install --resume-retries 999 --no-cache-dir torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu128
+
+# Install flash-attn-2.8.0.post2 (cxx11abi=True)
+RUN ABI_FLAG=$(python -c "import torch; print('TRUE' if torch._C._GLIBCXX_USE_CXX11_ABI else 'FALSE')") && \
+    URL="https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.0.post2/flash_attn-2.8.0.post2+cu12torch2.7cxx11abi${ABI_FLAG}-cp312-cp312-linux_x86_64.whl" && \
+    FILE="flash_attn-2.8.0.post2+cu12torch2.7cxx11abi${ABI_FLAG}-cp312-cp312-linux_x86_64.whl" && \
+    wget -nv "${URL}" && \
+    pip install --no-cache-dir "${FILE}"
+
+# Fix packages
+RUN pip uninstall -y pynvml nvidia-ml-py && \
+    pip install --no-cache-dir --upgrade "nvidia-ml-py>=12.560.30" "fastapi[standard]>=0.115.0" "optree>=0.13.0" "pydantic>=2.9" "grpcio>=1.62.1"
+
+# Install cudnn
+RUN aria2c --max-tries=9999 https://developer.download.nvidia.com/compute/cudnn/9.8.0/local_installers/cudnn-local-repo-ubuntu2204-9.8.0_1.0-1_amd64.deb && \
+    dpkg -i cudnn-local-repo-ubuntu2204-9.8.0_1.0-1_amd64.deb && \
+    cp /var/cudnn-local-repo-ubuntu2204-9.8.0/cudnn-*-keyring.gpg /usr/share/keyrings/ && \
+    apt-get update && \
+    apt-get -y install cudnn-cuda-12 && \
+    rm cudnn-local-repo-ubuntu2204-9.8.0_1.0-1_amd64.deb
+
+# Install Apex
+RUN pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" --resume-retries 999 git+https://github.com/NVIDIA/apex.git
+
+# Profiling tools
+RUN aria2c --always-resume=true --max-tries=99999 https://developer.nvidia.com/downloads/assets/tools/secure/nsight-systems/2025_3/nsight-systems-2025.3.1_2025.3.1.90-1_amd64.deb && \
+    apt-get update && apt-get install -y libxcb-cursor0
+
+RUN apt-get install -y ./nsight-systems-2025.3.1_2025.3.1.90-1_amd64.deb && \
+    rm -rf /usr/local/cuda/bin/nsys && \
+    ln -s /opt/nvidia/nsight-systems/2025.3.1/target-linux-x64/nsys  /usr/local/cuda/bin/nsys && \
+    rm -rf /usr/local/cuda/bin/nsys-ui && \
+    ln -s /opt/nvidia/nsight-systems/2025.3.1/target-linux-x64/nsys-ui /usr/local/cuda/bin/nsys-ui && \
+    rm nsight-systems-2025.3.1_2025.3.1.90-1_amd64.deb
+
+RUN pip install --resume-retries 999 --no-cache-dir "tensordict==0.6.2" torchdata "transformers[hf_xet]>=4.51.0" accelerate datasets peft hf-transfer \
+    "numpy<2.0.0" "pyarrow>=19.0.1" pandas cuda-bindings \
+    ray[default] codetiming hydra-core pylatexenc qwen-vl-utils wandb dill pybind11 liger-kernel mathruler blobfile xgrammar \
+    pytest py-spy pre-commit ruff
+
+# Reset pip config
+RUN pip config unset global.index-url && \
+    pip config unset global.extra-index-url
+
--- a/docker/verl0.5-preview-cu128-torch2.7.1-fa2.8.0/README.md
+++ b/docker/verl0.5-preview-cu128-torch2.7.1-fa2.8.0/README.md
@ -0,0 +1,26 @@
+# verl image with verl v0.5
+
+## Important packages version
+
+```txt
+cuda==12.8
+cudnn==9.8.0
+torch==2.7.1
+flash_attn=2.8.0    ##
+sglang==0.4.8
+transformer_engine==2.5
+megatron.core==core_r0.13.0
+vidia-cudnn-cu12==9.8.0.87
+```
+
+## Target
+
+- Base image:
+    - `verlai/verl:base-verl0.5-preview-cu128-cudnn9.8-torch2.7.1-fa2.8.0`: We offer a base image with flash infer 0.2.6.post1 built in
+- App image:
+    - `verlai/verl:app-verl0.5-preview-sglang0.4.8-mcore0.13.0-preview`
+- vllm temporarily not support latest version
+
+## !!!Notice!!!
+
+- pyext is lack of maintainace and cannot work with python 3.12, consider using replacement and deprecating this package.
--- a/docs/start/install.rst
+++ b/docs/start/install.rst
@ -19,7 +19,7 @@ Choices of Backend Engines

 We recommend using **FSDP** backend to investigate, research and prototype different models, datasets and RL algorithms. The guide for using FSDP backend can be found in :doc:`FSDP Workers<../workers/fsdp_workers>`.

-For users who pursue better scalability, we recommend using **Megatron-LM** backend. Currently, we support `Megatron-LM v0.11 <https://github.com/NVIDIA/Megatron-LM/tree/v0.11.0>`_. The guide for using Megatron-LM backend can be found in :doc:`Megatron-LM Workers<../workers/megatron_workers>`.
+For users who pursue better scalability, we recommend using **Megatron-LM** backend. Currently, we support `Megatron-LM v0.12.1 <https://github.com/NVIDIA/Megatron-LM/tree/core_v0.12.1>`_. The guide for using Megatron-LM backend can be found in :doc:`Megatron-LM Workers<../workers/megatron_workers>`.


 2. Inference:
@ -33,58 +33,94 @@ For huggingface TGI integration, it is usually used for debugging and single GPU
 Install from docker image
 -------------------------

-We provide pre-built Docker images for quick setup.
+We provide pre-built Docker images for quick setup. And from this version,
+we utilize a new image release hierarchy for productivity and stability.

-For vLLM with Megatron or FSDP, please use the stable version of image ``whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.1-te2.3-deepseekv3``, which supports DeepSeek-V3 671B post-training.
+The image types are divided into three large categories:

-For latest vLLM with FSDP, please refer to ``hiyouga/verl:ngc-th2.6.0-cu126-vllm0.8.4-flashinfer0.2.2-cxx11abi0``.
+- **Base Image**: Without inference and training frameworks, only basic dependencies are installed.
+  Can directly install vllm or SGLang on top of it, without need of reinstall torch or CUDA.
+- **Application Image**: Stable version with inference and training frameworks installed.
+- **Community Image**: Unstable version with the latest frameworks and features.

-For SGLang with FSDP, please use ``ocss884/verl-sglang:ngc-th2.6.0-cu126-sglang0.4.6.post5`` which is provided by SGLang RL Group.
+The first two types of images are hosted on dockerhub `verlai/verl <https://hub.docker.com/r/verlai/verl>`_ repository, while the preview images are hosted on community repository.
+
+.. note::
+
+    The image versions are mapped with verl releases, for example, image with tag ``verl0.4`` is built for verl release ``v0.4.x``.
+
+Base Image
+::::::::::
+
+The stable base image is ``verlai/verl:base-verl0.4-cu124-cudnn9.8-torch2.6-fa2.7.4-te2.3``. The installed package versions can be found from tags, and the Dockerfile can be found in ``docker/verl[version]-[packages]/Dockerfile.base``.
+
+The base images for preview are ``verlai/verl:base-verl0.5-cu126-cudnn9.8-torch2.7.1-fa2.8.0` and ``verlai/verl:base-verl0.5-preview-cu128-cudnn9.8-torch2.7.1-fa2.8.0`` with different CUDA versions.
+
+The update of base image is not frequent, and the app image can be built on top of it without reinstalling base packages.
+
+Application Image
+:::::::::::::::::
+
+From this version, we divide images built for vLLM and SGLang as the divergence of dependent packages like FlashInfer.
+
+There are four types of application images available:
+
+- **vLLM with FSDP and Megatron**: ``verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.1``
+- **SGLang with FSDP and Megatron**: ``verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1`` (need vLLM support, but can have some package conflicts)
+- **Preview version of SGLang with FSDP and Megatron, CUDA 12.6**: ``verlai/verl:app-verl0.5-sglang0.4.8-mcore0.12.1``
+- **Preview version of SGLang with FSDP and Megatron, CUDA 12.8**: ``verlai/verl:app-preview-verl0.5-sglang0.4.8-mcore0.12.1``
+
+The latest vLLM support is coming soon.
+
+Docker images with Megatron backends are runnable with large language model like ``Qwen/Qwen3-235B-A22B``, ``deepseek-ai/DeepSeek-V3-0324`` post-training. Refer to the :doc:`Large Language Model Post-Training documentation<../perf/dpsk>` for more details.
+
+Application images can be updated frequently, and the Dockerfile can be found in ``docker/verl[version]-[packages]/Dockerfile.app.[frameworks]``. Based on the base image, it is easy to build your own application image with the desired inference and training frameworks.
+
+Community Image
+:::::::::::::::
+
+Community images are provided by the community, including the latest versions of vLLM and SGLang, and may include experimental features or configurations. And also works for other hardwares or platforms like AMD GPUs with ROCM or AWS EFA and Sagemaker.
+
+For latest vLLM with FSDP, please refer to `hiyouga/verl <https://hub.docker.com/r/hiyouga/verl>`_ repository and the latest version is ``hiyouga/verl:ngc-th2.6.0-cu126-vllm0.8.4-flashinfer0.2.2-cxx11abi0``.
+
+For latest SGLang with FSDP, please refer to `ocss884/verl-sglang <https://hub.docker.com/r/ocss884/verl-sglang>`_ repository and the latest version is ``ocss884/verl-sglang:ngc-th2.6.0-cu126-sglang0.4.6.post5`` which is provided by SGLang RL Group.

 See files under ``docker/`` for NGC-based image or if you want to build your own.

+Note that For aws instances with EFA net interface (Sagemaker AI Pod),
+you need to install EFA driver as shown in ``docker/Dockerfile.extenstion.awsefa``
+
+Installation from Docker
+::::::::::::::::::::::::
+
+After pulling the desired Docker image and installing desired inference and training frameworks, you can run it with the following steps:
+
 1. Launch the desired Docker image and attach into it:

 .. code:: bash

-    docker create --runtime=nvidia --gpus all --net=host --shm-size="10g" --cap-add=SYS_ADMIN -v .:/workspace/verl --name verl <image:tag>
+    docker create --runtime=nvidia --gpus all --net=host --shm-size="10g" --cap-add=SYS_ADMIN -v .:/workspace/verl --name verl <image:tag> sleep infinity
    docker start verl
    docker exec -it verl bash


-2.	Inside the container, install latest verl:
+2.	If you use the images provided, you only need to install verl itself without dependencies:

 .. code:: bash

    # install the nightly version (recommended)
    git clone https://github.com/volcengine/verl && cd verl
-    # pick your choice of inference engine: vllm or sglang
-    # pip3 install -e .[vllm]
-    # pip3 install -e .[sglang]
-    # or install from pypi instead of git via:
-    # pip3 install verl[vllm]
-    # pip3 install verl[sglang]
+    pip3 install --no-deps -e .

-.. note::
+[Optional] If you hope to switch between different frameworks, you can install verl with the following command:

-    The Docker image ``whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.1-te2.3-deepseekv3`` is built with the following configurations:
+.. code:: bash

-    - **PyTorch**: 2.6.0+cu124
-    - **CUDA**: 12.4
-    - **cuDNN**: 9.8.0
-    - **nvidia-cudnn-cu12**: 9.8.0.87, **important for the usage of Megatron FusedAttention with MLA Support**
-    - **Flash Attenttion**: 2.7.4.post1
-    - **Flash Infer**: 0.2.5
-    - **vLLM**: 0.8.5
-    - **SGLang**: 0.4.6.post5
-    - **Megatron-LM**: core_v0.12.1
-    - **TransformerEngine**: 2.3
-    - **Ray**: 2.44.1
+    # install the nightly version (recommended)
+    git clone https://github.com/volcengine/verl && cd verl
+    pip3 install -e .[vllm]
+    pip3 install -e .[sglang]

-.. note::
-
-   For aws instances with EFA net interface (Sagemaker AI Pod),
-   you need to install EFA driver as shown in ``docker/Dockerfile.awsefa``

 Install from custom environment
 ---------------------------------------------
@ -99,6 +135,11 @@ For training and inference engines to utilize better and faster hardware support
 and some of the dependencies are easy to be overridden when installing other packages,
 so we put them in the :ref:`Post-installation` step.

+.. note::
+
+    The installation steps below are recommended configurations for the latest version of verl.
+    If you are trying to customize your own environment, please ignore the strict constraints.
+
 We need to install the following pre-requisites:

 - **CUDA**: Version >= 12.4
--- a/requirements-npu.txt
+++ b/requirements-npu.txt
@ -4,7 +4,7 @@ codetiming
 datasets
 dill
 hydra-core
-numpy
+numpy<2.0.0
 pandas
 peft
 pyarrow>=15.0.0
--- a/requirements.txt
+++ b/requirements.txt
@ -6,7 +6,7 @@ dill
 flash-attn
 hydra-core
 liger-kernel
-numpy
+numpy<2.0.0
 pandas
 peft
 pyarrow>=19.0.0
--- a/requirements_sglang.txt
+++ b/requirements_sglang.txt
@ -5,7 +5,7 @@ datasets
 dill
 flash-attn
 hydra-core
-numpy
+numpy<2.0.0
 pandas
 peft
 pyarrow>=19.0.0
--- a/setup.py
+++ b/setup.py
@ -29,7 +29,7 @@ install_requires = [
    "datasets",
    "dill",
    "hydra-core",
-    "numpy",
+    "numpy<2.0.0",
    "pandas",
    "peft",
    "pyarrow>=19.0.0",
--- a/tests/special_e2e/run_ppo_trainer_megatron.sh
+++ b/tests/special_e2e/run_ppo_trainer_megatron.sh
@ -123,110 +123,112 @@ if [ "$USE_DIST_CKPT" = "True" ]; then
        --output_path "${DIST_CKPT_PATH}"
 fi

-ENGINES=("vllm" "sglang")
+ENGINE=${ENGINE:-"vllm"}

 exp_name="$(basename "${MODEL_ID,,}")-megatron-gsm8k-minimal"

-for ENGINE in "${ENGINES[@]}"; do
-    if [ "$ENGINE" = "vllm" ]; then
-        MODE=${MODE:-"sync"}
-        ROLLOUT_MODE_ARG="actor_rollout_ref.rollout.mode=${MODE}"
-        if [ "$MODE" = "async" ]; then
-            ROLLOUT_MODE_ARG="${ROLLOUT_MODE_ARG} data.return_raw_chat=True"
-        fi
-    else
-        ROLLOUT_MODE_ARG=""
+if [ "$ENGINE" = "vllm" ]; then
+    MODE=${MODE:-"sync"}
+    ROLLOUT_MODE_ARG="actor_rollout_ref.rollout.mode=${MODE}"
+    if [ "$MODE" = "async" ]; then
+        ROLLOUT_MODE_ARG="${ROLLOUT_MODE_ARG} data.return_raw_chat=True"
    fi
-    python3 -m verl.trainer.main_ppo --config-path=config \
-        --config-name='ppo_megatron_trainer.yaml'\
-        algorithm.adv_estimator="${ADV_ESTIMATOR}" \
-        data.train_files="${TRAIN_FILES}" \
-        data.val_files="${VAL_FILES}" \
-        data.train_batch_size=${train_prompt_bsz} \
-        data.max_prompt_length=${MAX_PROMPT_LENGTH} \
-        data.max_response_length=${MAX_RESPONSE_LENGTH} \
-        data.filter_overlong_prompts=True \
-        data.truncation='error' \
-        actor_rollout_ref.model.path="${MODEL_PATH}" \
-        actor_rollout_ref.actor.optim.lr_warmup_steps=$LR_WARMUP_STEPS \
-        actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
-        actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=${train_traj_micro_bsz_per_gpu} \
-        actor_rollout_ref.actor.use_dynamic_bsz=${USE_DYNAMIC_BSZ} \
-        actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${ppo_max_token_len_per_gpu} \
-        actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=$ACTOR_PP \
-        actor_rollout_ref.actor.megatron.virtual_pipeline_model_parallel_size=$ACTOR_VPP \
-        actor_rollout_ref.actor.megatron.context_parallel_size=$ACTOR_CP \
-        actor_rollout_ref.actor.megatron.tensor_model_parallel_size=$ACTOR_TP \
-        actor_rollout_ref.actor.megatron.expert_model_parallel_size=$ACTOR_EP \
-        actor_rollout_ref.actor.megatron.expert_tensor_parallel_size=$ACTOR_ETP \
-        actor_rollout_ref.actor.megatron.param_offload=${ACTOR_PARAM_OFFLOAD} \
-        actor_rollout_ref.actor.megatron.optimizer_offload=${ACTOR_OPTIMIZER_OFFLOAD} \
-        actor_rollout_ref.actor.megatron.grad_offload=${ACTOR_GRAD_OFFLOAD} \
-        actor_rollout_ref.actor.megatron.use_dist_checkpointing=${USE_DIST_CKPT} \
-        actor_rollout_ref.actor.megatron.dist_checkpointing_path=${DIST_CKPT_PATH} \
-        actor_rollout_ref.actor.use_kl_loss=True \
-        actor_rollout_ref.actor.kl_loss_coef=0.001 \
-        actor_rollout_ref.actor.kl_loss_type=low_var_kl \
-        actor_rollout_ref.actor.checkpoint.save_contents=$CHECKPOINT_CONTENTS \
-        actor_rollout_ref.rollout.name="${ENGINE}" ${ROLLOUT_MODE_ARG}\
-        actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP \
-        actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
-        actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
-        actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=${train_traj_micro_bsz_per_gpu} \
-        actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=$REF_PP \
-        actor_rollout_ref.ref.megatron.virtual_pipeline_model_parallel_size=$REF_VPP \
-        actor_rollout_ref.ref.megatron.context_parallel_size=$REF_CP \
-        actor_rollout_ref.ref.megatron.tensor_model_parallel_size=$REF_TP \
-        actor_rollout_ref.ref.megatron.expert_model_parallel_size=$REF_EP \
-        actor_rollout_ref.ref.megatron.expert_tensor_parallel_size=$REF_ETP \
-        actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=${train_traj_micro_bsz_per_gpu} \
-        actor_rollout_ref.ref.megatron.param_offload=${REF_PARAM_OFFLOAD} \
-        actor_rollout_ref.ref.megatron.use_dist_checkpointing=${USE_DIST_CKPT} \
-        actor_rollout_ref.ref.megatron.dist_checkpointing_path=${DIST_CKPT_PATH} \
-        actor_rollout_ref.ref.megatron.use_mbridge=${USE_MBRIDGE} \
-        critic.optim.lr=2e-5 \
-        critic.optim.lr_warmup_steps=$LR_WARMUP_STEPS \
-        critic.model.path="${MODEL_PATH}" \
-        critic.model.enable_gradient_checkpointing=False \
-        critic.ppo_micro_batch_size_per_gpu=${train_traj_micro_bsz_per_gpu} \
-        critic.ppo_max_token_len_per_gpu=${forward_max_token_len_per_gpu} \
-        critic.megatron.pipeline_model_parallel_size=$CRITIC_PP \
-        critic.megatron.virtual_pipeline_model_parallel_size=$CRITIC_VPP \
-        critic.megatron.context_parallel_size=$CRITIC_CP \
-        critic.megatron.tensor_model_parallel_size=$CRITIC_TP \
-        critic.megatron.expert_model_parallel_size=$CRITIC_EP \
-        critic.megatron.expert_tensor_parallel_size=$CRITIC_ETP \
-        critic.megatron.param_offload=${CRITIC_PARAM_OFFLOAD} \
-        critic.megatron.optimizer_offload=${CRITIC_OPTIMIZER_OFFLOAD} \
-        critic.megatron.grad_offload=${CRITIC_GRAD_OFFLOAD} \
-        critic.megatron.use_dist_checkpointing=${USE_DIST_CKPT} \
-        critic.megatron.dist_checkpointing_path=${DIST_CKPT_PATH} \
-        critic.checkpoint.save_contents=$CHECKPOINT_CONTENTS \
-        reward_model.enable=True \
-        reward_model.model.path="${MODEL_PATH}" \
-        reward_model.micro_batch_size_per_gpu=${train_traj_micro_bsz_per_gpu} \
-        reward_model.megatron.pipeline_model_parallel_size=$RM_PP \
-        reward_model.megatron.virtual_pipeline_model_parallel_size=$RM_VPP \
-        reward_model.megatron.context_parallel_size=$RM_CP \
-        reward_model.megatron.tensor_model_parallel_size=$RM_TP \
-        reward_model.megatron.expert_model_parallel_size=$RM_EP \
-        reward_model.megatron.expert_tensor_parallel_size=$RM_ETP \
-        reward_model.megatron.param_offload=${RM_PARAM_OFFLOAD} \
-        reward_model.megatron.use_dist_checkpointing=${USE_DIST_CKPT} \
-        reward_model.megatron.dist_checkpointing_path=${DIST_CKPT_PATH} \
-        algorithm.use_kl_in_reward=False \
-        algorithm.kl_penalty=kl \
-        algorithm.kl_ctrl.kl_coef=0.001 \
-        trainer.critic_warmup=0 \
-        trainer.logger=['console'] \
-        trainer.project_name='verl-test' \
-        trainer.experiment_name="${exp_name}" \
-        trainer.nnodes=1 \
-        trainer.n_gpus_per_node=${NUM_GPUS} \
-        trainer.val_before_train="${VAL_BEFORE_TRAIN}" \
-        trainer.test_freq="${TEST_FREQ}" \
-        trainer.save_freq="${SAVE_FREQ}" \
-        trainer.resume_mode="${RESUME_MODE}" \
-        trainer.total_epochs=2 \
-        trainer.total_training_steps="${TOTAL_TRAIN_STEPS}" $@
-done
+else
+    ROLLOUT_MODE_ARG=""
+fi
+
+python3 -m verl.trainer.main_ppo --config-path=config \
+    --config-name='ppo_megatron_trainer.yaml'\
+    algorithm.adv_estimator="${ADV_ESTIMATOR}" \
+    data.train_files="${TRAIN_FILES}" \
+    data.val_files="${VAL_FILES}" \
+    data.train_batch_size=${train_prompt_bsz} \
+    data.max_prompt_length=${MAX_PROMPT_LENGTH} \
+    data.max_response_length=${MAX_RESPONSE_LENGTH} \
+    data.filter_overlong_prompts=True \
+    data.truncation='error' \
+    actor_rollout_ref.model.path="${MODEL_PATH}" \
+    actor_rollout_ref.actor.optim.lr_warmup_steps=$LR_WARMUP_STEPS \
+    actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
+    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=${train_traj_micro_bsz_per_gpu} \
+    actor_rollout_ref.actor.use_dynamic_bsz=${USE_DYNAMIC_BSZ} \
+    actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${ppo_max_token_len_per_gpu} \
+    actor_rollout_ref.actor.megatron.use_mbridge=${USE_MBRIDGE} \
+    actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=$ACTOR_PP \
+    actor_rollout_ref.actor.megatron.virtual_pipeline_model_parallel_size=$ACTOR_VPP \
+    actor_rollout_ref.actor.megatron.context_parallel_size=$ACTOR_CP \
+    actor_rollout_ref.actor.megatron.tensor_model_parallel_size=$ACTOR_TP \
+    actor_rollout_ref.actor.megatron.expert_model_parallel_size=$ACTOR_EP \
+    actor_rollout_ref.actor.megatron.expert_tensor_parallel_size=$ACTOR_ETP \
+    actor_rollout_ref.actor.megatron.param_offload=${ACTOR_PARAM_OFFLOAD} \
+    actor_rollout_ref.actor.megatron.optimizer_offload=${ACTOR_OPTIMIZER_OFFLOAD} \
+    actor_rollout_ref.actor.megatron.grad_offload=${ACTOR_GRAD_OFFLOAD} \
+    actor_rollout_ref.actor.megatron.use_dist_checkpointing=${USE_DIST_CKPT} \
+    actor_rollout_ref.actor.megatron.dist_checkpointing_path=${DIST_CKPT_PATH} \
+    actor_rollout_ref.actor.use_kl_loss=True \
+    actor_rollout_ref.actor.kl_loss_coef=0.001 \
+    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
+    actor_rollout_ref.actor.checkpoint.save_contents=$CHECKPOINT_CONTENTS \
+    actor_rollout_ref.rollout.name="${ENGINE}" ${ROLLOUT_MODE_ARG}\
+    actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP \
+    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
+    actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
+    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=${train_traj_micro_bsz_per_gpu} \
+    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=${train_traj_micro_bsz_per_gpu} \
+    actor_rollout_ref.ref.megatron.use_mbridge=${USE_MBRIDGE} \
+    actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=$REF_PP \
+    actor_rollout_ref.ref.megatron.virtual_pipeline_model_parallel_size=$REF_VPP \
+    actor_rollout_ref.ref.megatron.context_parallel_size=$REF_CP \
+    actor_rollout_ref.ref.megatron.tensor_model_parallel_size=$REF_TP \
+    actor_rollout_ref.ref.megatron.expert_model_parallel_size=$REF_EP \
+    actor_rollout_ref.ref.megatron.expert_tensor_parallel_size=$REF_ETP \
+    actor_rollout_ref.ref.megatron.param_offload=${REF_PARAM_OFFLOAD} \
+    actor_rollout_ref.ref.megatron.use_dist_checkpointing=${USE_DIST_CKPT} \
+    actor_rollout_ref.ref.megatron.dist_checkpointing_path=${DIST_CKPT_PATH} \
+    critic.optim.lr=2e-5 \
+    critic.optim.lr_warmup_steps=$LR_WARMUP_STEPS \
+    critic.model.path="${MODEL_PATH}" \
+    critic.model.enable_gradient_checkpointing=False \
+    critic.ppo_micro_batch_size_per_gpu=${train_traj_micro_bsz_per_gpu} \
+    critic.ppo_max_token_len_per_gpu=${forward_max_token_len_per_gpu} \
+    critic.megatron.use_mbridge=${USE_MBRIDGE} \
+    critic.megatron.pipeline_model_parallel_size=$CRITIC_PP \
+    critic.megatron.virtual_pipeline_model_parallel_size=$CRITIC_VPP \
+    critic.megatron.context_parallel_size=$CRITIC_CP \
+    critic.megatron.tensor_model_parallel_size=$CRITIC_TP \
+    critic.megatron.expert_model_parallel_size=$CRITIC_EP \
+    critic.megatron.expert_tensor_parallel_size=$CRITIC_ETP \
+    critic.megatron.param_offload=${CRITIC_PARAM_OFFLOAD} \
+    critic.megatron.optimizer_offload=${CRITIC_OPTIMIZER_OFFLOAD} \
+    critic.megatron.grad_offload=${CRITIC_GRAD_OFFLOAD} \
+    critic.megatron.use_dist_checkpointing=${USE_DIST_CKPT} \
+    critic.megatron.dist_checkpointing_path=${DIST_CKPT_PATH} \
+    critic.checkpoint.save_contents=$CHECKPOINT_CONTENTS \
+    reward_model.enable=True \
+    reward_model.model.path="${MODEL_PATH}" \
+    reward_model.micro_batch_size_per_gpu=${train_traj_micro_bsz_per_gpu} \
+    reward_model.megatron.use_mbridge=${USE_MBRIDGE} \
+    reward_model.megatron.pipeline_model_parallel_size=$RM_PP \
+    reward_model.megatron.virtual_pipeline_model_parallel_size=$RM_VPP \
+    reward_model.megatron.context_parallel_size=$RM_CP \
+    reward_model.megatron.tensor_model_parallel_size=$RM_TP \
+    reward_model.megatron.expert_model_parallel_size=$RM_EP \
+    reward_model.megatron.expert_tensor_parallel_size=$RM_ETP \
+    reward_model.megatron.param_offload=${RM_PARAM_OFFLOAD} \
+    reward_model.megatron.use_dist_checkpointing=${USE_DIST_CKPT} \
+    reward_model.megatron.dist_checkpointing_path=${DIST_CKPT_PATH} \
+    algorithm.use_kl_in_reward=False \
+    algorithm.kl_penalty=kl \
+    algorithm.kl_ctrl.kl_coef=0.001 \
+    trainer.critic_warmup=0 \
+    trainer.logger=['console'] \
+    trainer.project_name='verl-test' \
+    trainer.experiment_name="${exp_name}" \
+    trainer.nnodes=1 \
+    trainer.n_gpus_per_node=${NUM_GPUS} \
+    trainer.val_before_train="${VAL_BEFORE_TRAIN}" \
+    trainer.test_freq="${TEST_FREQ}" \
+    trainer.save_freq="${SAVE_FREQ}" \
+    trainer.resume_mode="${RESUME_MODE}" \
+    trainer.total_epochs=2 \
+    trainer.total_training_steps="${TOTAL_TRAIN_STEPS}" $@
--- a/verl/models/transformers/monkey_patch.py
+++ b/verl/models/transformers/monkey_patch.py
@ -240,6 +240,7 @@ def apply_monkey_patch(
        if is_transformers_version_in_range(min_version="4.53.0"):
            from transformers.models.qwen2_5_vl.modeling_qwen2_5_vl import Qwen2_5_VLAttention

+            # TODO: Support transformers 4.53
            raise ValueError("Transformers 4.53 is not supported")
        else:
            from transformers.models.qwen2_5_vl.modeling_qwen2_5_vl import (
@ -266,6 +267,7 @@ def apply_monkey_patch(
        if is_transformers_version_in_range(min_version="4.53.0"):
            from transformers.models.qwen2_vl.modeling_qwen2_vl import Qwen2VLAttention

+            # TODO: Support transformers 4.53
            raise ValueError("Transformers 4.53 is not supported")
        else:
            from transformers.models.qwen2_vl.modeling_qwen2_vl import Qwen2VLFlashAttention2 as Qwen2VLAttention