[BREAKING][megatron] refactor: activation checkpointing APIs (#2651)

### What does this PR do? Since we directly offer `override_transformer_config` option, we directly use it to recompute activations. Default settings are the same with `megatron.training`. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
2025-10-20 13:43:50 +08:00 · 2025-07-22 10:24:28 +08:00
parent 72cae971d0
commit c5b189a1af
26 changed files with 35 additions and 60 deletions
--- a/docs/examples/config.rst
+++ b/docs/examples/config.rst
@ -213,8 +213,9 @@ Actor/Rollout/Reference Policy
  the Huggingface system.
 - ``actor_rollout_ref.model.override_config``: Used to override some of
  the model's original configurations, mainly dropout
- ``actor_rollout_ref.model.enable_gradient_checkpointing``: Whether to
-  enable gradient checkpointing for the actor
+- ``actor_rollout_ref.model.enable_gradient_checkpointing``: FSDP only, decide
+  Whether to enable gradient checkpointing for the actor,
+  Megatron uses recompute options in ``override_transformer_config`` to set this
 - ``actor_rollout_ref.model.enable_activation_offload``: Whether to enable
  activation offloading for the actor
 - ``actor_rollout_ref.model.trust_remote_code``: Whether to enable loading
--- a/examples/gpg_trainer/run_qwen2-7b_math_megatron.sh
+++ b/examples/gpg_trainer/run_qwen2-7b_math_megatron.sh
@ -33,7 +33,6 @@ python3 -m verl.trainer.main_ppo --config-path=config \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
-    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
    actor_rollout_ref.rollout.name=vllm \
--- a/examples/grpo_trainer/run_moonlight16b_math_megatron.sh
+++ b/examples/grpo_trainer/run_moonlight16b_math_megatron.sh
@ -34,7 +34,6 @@ python3 -m verl.trainer.main_ppo --config-path=config \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
-    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
    actor_rollout_ref.rollout.name=vllm \
--- a/examples/grpo_trainer/run_qwen2-7b_math_megatron.sh
+++ b/examples/grpo_trainer/run_qwen2-7b_math_megatron.sh
@ -40,7 +40,6 @@ python3 -m verl.trainer.main_ppo --config-path=config \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
-    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
    actor_rollout_ref.rollout.name=vllm \
--- a/examples/grpo_trainer/run_qwen2-7b_seq_balance_math_megatron.sh
+++ b/examples/grpo_trainer/run_qwen2-7b_seq_balance_math_megatron.sh
@ -31,7 +31,6 @@ python3 -m verl.trainer.main_ppo --config-path=config \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
-    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
    actor_rollout_ref.rollout.name=vllm \
--- a/examples/grpo_trainer/run_qwen2-7b_sgl_megatron.sh
+++ b/examples/grpo_trainer/run_qwen2-7b_sgl_megatron.sh
@ -29,7 +29,6 @@ python3 -m verl.trainer.main_ppo --config-path=config \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
-    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
    actor_rollout_ref.rollout.name=sglang \
--- a/examples/grpo_trainer/run_qwen2_5-7b_math_megatron_diff_tp.sh
+++ b/examples/grpo_trainer/run_qwen2_5-7b_math_megatron_diff_tp.sh
@ -30,7 +30,6 @@ python3 -m verl.trainer.main_ppo --config-path=config \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
-    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
    actor_rollout_ref.rollout.name=vllm \
--- a/examples/grpo_trainer/run_qwen3-236b_megatron.sh
+++ b/examples/grpo_trainer/run_qwen3-236b_megatron.sh
@ -82,7 +82,6 @@ python3 -m verl.trainer.main_ppo \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
    actor_rollout_ref.model.path="${MODEL_PATH}" \
-    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
    actor_rollout_ref.actor.optim.weight_decay=0.1 \
--- a/examples/grpo_trainer/run_qwen3moe-30b_megatron.sh
+++ b/examples/grpo_trainer/run_qwen3moe-30b_megatron.sh
@ -30,7 +30,6 @@ python3 -m verl.trainer.main_ppo --config-path=config \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
-    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
    actor_rollout_ref.rollout.name=vllm \
--- a/examples/ppo_trainer/run_deepseek_math_gsm8k_megatron.sh
+++ b/examples/ppo_trainer/run_deepseek_math_gsm8k_megatron.sh
@ -36,7 +36,6 @@ python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megat
    actor_rollout_ref.ref.megatron.tensor_model_parallel_size=2 \
    critic.optim.lr=1e-5 \
    critic.model.path=deepseek-ai/deepseek-llm-7b-chat \
-    critic.model.enable_gradient_checkpointing=False \
    critic.ppo_micro_batch_size_per_gpu=4 \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
--- a/examples/ppo_trainer/run_deepseek_math_gsm8k_megatron_nsys.sh
+++ b/examples/ppo_trainer/run_deepseek_math_gsm8k_megatron_nsys.sh
@ -45,7 +45,6 @@ python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megat
    actor_rollout_ref.profiler.discrete=$DISCRETE \
    critic.optim.lr=1e-5 \
    critic.model.path=deepseek-ai/deepseek-llm-7b-chat \
-    critic.model.enable_gradient_checkpointing=False \
    critic.ppo_micro_batch_size_per_gpu=4 \
    critic.profiler.ranks=$PROFILE_RANKS \
    critic.profiler.all_ranks=$PROFILE_RANKS_ALL \
--- a/examples/ppo_trainer/run_moonlight16b_a3b_gsm8k_megatron.sh
+++ b/examples/ppo_trainer/run_moonlight16b_a3b_gsm8k_megatron.sh
@ -62,7 +62,6 @@ python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megat
    actor_rollout_ref.rollout.gpu_memory_utilization=0.7 \
    critic.optim.lr=1e-5 \
    critic.model.path=$LLM \
-    critic.model.enable_gradient_checkpointing=False \
    critic.ppo_micro_batch_size_per_gpu=4 \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
--- a/examples/sglang_multiturn/run_qwen2.5-3b_megatron_gsm8k_multiturn.sh
+++ b/examples/sglang_multiturn/run_qwen2.5-3b_megatron_gsm8k_multiturn.sh
@ -38,7 +38,6 @@ python3 -m verl.trainer.main_ppo \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.actor.megatron.seed=42 \
-    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=2 \
    actor_rollout_ref.ref.megatron.virtual_pipeline_model_parallel_size=2 \
    actor_rollout_ref.ref.megatron.context_parallel_size=2 \
--- a/recipe/dapo/test_dapo_7b_math_megatron.sh
+++ b/recipe/dapo/test_dapo_7b_math_megatron.sh
@ -84,7 +84,6 @@ python3 -m verl.trainer.main_ppo \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
    actor_rollout_ref.model.path="${MODEL_PATH}" \
-    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
    actor_rollout_ref.actor.optim.weight_decay=0.1 \
--- a/recipe/dapo/test_dapo_dspk_671b_megatron.sh
+++ b/recipe/dapo/test_dapo_dspk_671b_megatron.sh
@ -86,7 +86,6 @@ python3 -m verl.trainer.main_ppo \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.model.path="${MODEL_PATH}" \
-    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
    actor_rollout_ref.actor.optim.weight_decay=0.1 \
--- a/recipe/one_step_off_policy/dapo_7b_math_megatron_4_12.sh
+++ b/recipe/one_step_off_policy/dapo_7b_math_megatron_4_12.sh
@ -95,7 +95,6 @@ python3 -m recipe.one_step_off_policy.main_ppo \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
    actor_rollout_ref.model.path="${MODEL_PATH}" \
-    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
    actor_rollout_ref.actor.optim.weight_decay=0.1 \
--- a/recipe/one_step_off_policy/dapo_7b_math_megatron_colocate.sh
+++ b/recipe/one_step_off_policy/dapo_7b_math_megatron_colocate.sh
@ -89,7 +89,6 @@ python3 -m verl.trainer.main_ppo \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
    actor_rollout_ref.model.path="${MODEL_PATH}" \
-    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
    actor_rollout_ref.actor.optim.weight_decay=0.1 \
--- a/tests/special_e2e/run_ppo_trainer_megatron.sh
+++ b/tests/special_e2e/run_ppo_trainer_megatron.sh
@ -191,7 +191,6 @@ python3 -m verl.trainer.main_ppo --config-path=config \
    critic.optim.lr=2e-5 \
    critic.optim.lr_warmup_steps=$LR_WARMUP_STEPS \
    critic.model.path="${MODEL_PATH}" \
-    critic.model.enable_gradient_checkpointing=False \
    critic.ppo_micro_batch_size_per_gpu=${train_traj_micro_bsz_per_gpu} \
    critic.ppo_max_token_len_per_gpu=${forward_max_token_len_per_gpu} \
    critic.megatron.use_mbridge=${USE_MBRIDGE} \
--- a/verl/trainer/config/_generated_ppo_megatron_trainer.yaml
+++ b/verl/trainer/config/_generated_ppo_megatron_trainer.yaml
@ -71,7 +71,12 @@ actor_rollout_ref:
      dist_checkpointing_path: null
      seed: 42
      override_ddp_config: {}
-      override_transformer_config: {}
+      override_transformer_config:
+        recompute_granularity: selective
+        recompute_modules:
+        - core_attn
+        recompute_method: null
+        recompute_num_layers: null
      use_mbridge: false
    profile:
      use_profile: false
@ -184,11 +189,6 @@ actor_rollout_ref:
      model_config: {}
      moe_config:
        freeze_moe_router: false
-    enable_gradient_checkpointing: false
-    gradient_checkpointing_kwargs:
-      activations_checkpoint_method: null
-      activations_checkpoint_granularity: null
-      activations_checkpoint_num_layers: null
    use_fused_kernels: false
    trust_remote_code: false
  profiler:
@ -305,12 +305,7 @@ critic:
      moe_config:
        freeze_moe_router: false
    external_lib: ${oc.select:actor_rollout_ref.model.external_lib,null}
-    enable_gradient_checkpointing: false
    trust_remote_code: ${oc.select:actor_rollout_ref.model.trust_remote_code,false}
-    gradient_checkpointing_kwargs:
-      activations_checkpoint_method: null
-      activations_checkpoint_granularity: null
-      activations_checkpoint_num_layers: null
  ppo_mini_batch_size: ${oc.select:actor_rollout_ref.actor.ppo_mini_batch_size,256}
  ppo_micro_batch_size: null
  ppo_micro_batch_size_per_gpu: ${oc.select:.ppo_micro_batch_size,null}
@ -393,7 +388,7 @@ reward_model:
    use_dist_checkpointing: false
    dist_checkpointing_path: null
    seed: ${oc.select:actor_rollout_ref.actor.megatron.seed,42}
-    override_transformer_config: {}
+    override_transformer_config: ${oc.select:actor_rollout_ref.actor.megatron.override_transformer_config,{}}
    use_mbridge: ${oc.select:actor_rollout_ref.actor.megatron.use_mbridge,False}
  load_weight: true
 custom_reward_function:
--- a/verl/trainer/config/_generated_ppo_trainer.yaml
+++ b/verl/trainer/config/_generated_ppo_trainer.yaml
@ -263,9 +263,9 @@ critic:
    tokenizer_path: ${oc.select:actor_rollout_ref.model.path,"~/models/deepseek-llm-7b-chat"}
    override_config: {}
    external_lib: ${oc.select:actor_rollout_ref.model.external_lib,null}
-    enable_gradient_checkpointing: true
    trust_remote_code: ${oc.select:actor_rollout_ref.model.trust_remote_code,false}
    use_shm: false
+    enable_gradient_checkpointing: true
    enable_activation_offload: false
    use_remove_padding: false
    fsdp_config:
--- a/verl/trainer/config/actor/megatron_actor.yaml
+++ b/verl/trainer/config/actor/megatron_actor.yaml
@ -81,7 +81,23 @@ megatron:

  # additional transformer config like: num_layers_in_first(/last)_pipeline_stage
  # oc.select: default val for ref.megatron.override_transformer_config
-  override_transformer_config: {}
+  override_transformer_config:
+    # Recompute configuration, same as in megatron.training.arguments
+    # default use minimal performance-interference recompute methods
+    # Recompute granualarity, choices: ["full", "selective"]
+    recompute_granularity: selective
+
+    # Recompute modules, multiple choices: ["core_attn", "moe_act", "layernorm", "mla_up_proj", "mlp", "moe"]
+    # Please use correct module in matched model
+    recompute_modules: ["core_attn"]
+
+    # 'uniform', 'block'
+    # 'uniform' divides the total number of transformer layers and checkpoints the input activation of each chunk
+    # 'block' checkpoints the specified number of layers per pipeline stage at the specified granularity
+    recompute_method: null
+
+    # 'full' will checkpoint the entire transformer layer and 'selective' only checkpoints memory intensive part of attention
+    recompute_num_layers: null

  # oc.select: default val for ref.megatron.use_mbridge
  use_mbridge: False
--- a/verl/trainer/config/critic/critic.yaml
+++ b/verl/trainer/config/critic/critic.yaml
@ -31,9 +31,6 @@ model:
  # External model implementation (optional)
  external_lib: ${oc.select:actor_rollout_ref.model.external_lib,null}

-  # Enable gradient checkpointing to save memory
-  enable_gradient_checkpointing: True
-
  # Whether to trust remote code from Hugging Face models
  trust_remote_code: ${oc.select:actor_rollout_ref.model.trust_remote_code,false}

--- a/verl/trainer/config/critic/dp_critic.yaml
+++ b/verl/trainer/config/critic/dp_critic.yaml
@ -33,6 +33,9 @@ model:
  # Whether to use shared memory for loading the model
  use_shm: False

+  # Enable gradient checkpointing to save memory
+  enable_gradient_checkpointing: True
+
  # Offload activations to CPU to reduce GPU memory usage
  enable_activation_offload: False

--- a/verl/trainer/config/critic/megatron_critic.yaml
+++ b/verl/trainer/config/critic/megatron_critic.yaml
@ -55,19 +55,13 @@ model:

  # override default empty mapping
  override_config:
+
    model_config: {}
+
    moe_config:
+
      freeze_moe_router: False

-  # Enable gradient checkpointing to save memory
-  enable_gradient_checkpointing: False
-
-  # Activation Checkpointing settings
-  gradient_checkpointing_kwargs:
-    activations_checkpoint_method: null
-    activations_checkpoint_granularity: null
-    activations_checkpoint_num_layers: null
-
 # megatron-specific parallelism settings
 megatron:

--- a/verl/trainer/config/ppo_megatron_trainer.yaml
+++ b/verl/trainer/config/ppo_megatron_trainer.yaml
@ -41,20 +41,6 @@ actor_rollout_ref:

        freeze_moe_router: False

-    enable_gradient_checkpointing: False
-
-    gradient_checkpointing_kwargs:
-
-      ## Activation Checkpointing
-      activations_checkpoint_method: null # 'uniform', 'block'; not used with 'selective'
-
-      # 'uniform' divides the total number of transformer layers and checkpoints the input activation of each chunk
-      # 'block' checkpoints the specified number of layers per pipeline stage at the specified granularity
-      activations_checkpoint_granularity: null # 'selective' or 'full'
-
-      # 'full' will checkpoint the entire transformer layer and 'selective' only checkpoints memory intensive part of attention
-      activations_checkpoint_num_layers: null # not used with 'selective'
-
    use_fused_kernels: False # Whether to use custom fused kernels (PostProcessing, for memory efficiency)

    trust_remote_code: False
--- a/verl/trainer/config/reward_model/megatron_reward_model.yaml
+++ b/verl/trainer/config/reward_model/megatron_reward_model.yaml
@ -52,7 +52,7 @@ megatron:
  seed: ${oc.select:actor_rollout_ref.actor.megatron.seed,42}

  # Any overrides to transformer config
-  override_transformer_config: {}
+  override_transformer_config: ${oc.select:actor_rollout_ref.actor.megatron.override_transformer_config,{}}

  # Whether to use mbridge for faster comms
  use_mbridge: ${oc.select:actor_rollout_ref.actor.megatron.use_mbridge,False}