[BREAKING][megatron] refactor: activation checkpointing APIs (#2651)

### What does this PR do?

Since we directly offer `override_transformer_config` option, we
directly use it to recompute activations. Default settings are the same
with `megatron.training`.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
This commit is contained in:
Blue Space
2025-07-22 10:24:28 +08:00
committed by GitHub
parent 72cae971d0
commit c5b189a1af
26 changed files with 35 additions and 60 deletions

View File

@ -213,8 +213,9 @@ Actor/Rollout/Reference Policy
the Huggingface system.
- ``actor_rollout_ref.model.override_config``: Used to override some of
the model's original configurations, mainly dropout
- ``actor_rollout_ref.model.enable_gradient_checkpointing``: Whether to
enable gradient checkpointing for the actor
- ``actor_rollout_ref.model.enable_gradient_checkpointing``: FSDP only, decide
Whether to enable gradient checkpointing for the actor,
Megatron uses recompute options in ``override_transformer_config`` to set this
- ``actor_rollout_ref.model.enable_activation_offload``: Whether to enable
activation offloading for the actor
- ``actor_rollout_ref.model.trust_remote_code``: Whether to enable loading

View File

@ -33,7 +33,6 @@ python3 -m verl.trainer.main_ppo --config-path=config \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
actor_rollout_ref.rollout.name=vllm \

View File

@ -34,7 +34,6 @@ python3 -m verl.trainer.main_ppo --config-path=config \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
actor_rollout_ref.rollout.name=vllm \

View File

@ -40,7 +40,6 @@ python3 -m verl.trainer.main_ppo --config-path=config \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
actor_rollout_ref.rollout.name=vllm \

View File

@ -31,7 +31,6 @@ python3 -m verl.trainer.main_ppo --config-path=config \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
actor_rollout_ref.rollout.name=vllm \

View File

@ -29,7 +29,6 @@ python3 -m verl.trainer.main_ppo --config-path=config \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
actor_rollout_ref.rollout.name=sglang \

View File

@ -30,7 +30,6 @@ python3 -m verl.trainer.main_ppo --config-path=config \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
actor_rollout_ref.rollout.name=vllm \

View File

@ -82,7 +82,6 @@ python3 -m verl.trainer.main_ppo \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
actor_rollout_ref.model.path="${MODEL_PATH}" \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
actor_rollout_ref.actor.optim.weight_decay=0.1 \

View File

@ -30,7 +30,6 @@ python3 -m verl.trainer.main_ppo --config-path=config \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
actor_rollout_ref.rollout.name=vllm \

View File

@ -36,7 +36,6 @@ python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megat
actor_rollout_ref.ref.megatron.tensor_model_parallel_size=2 \
critic.optim.lr=1e-5 \
critic.model.path=deepseek-ai/deepseek-llm-7b-chat \
critic.model.enable_gradient_checkpointing=False \
critic.ppo_micro_batch_size_per_gpu=4 \
algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \

View File

@ -45,7 +45,6 @@ python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megat
actor_rollout_ref.profiler.discrete=$DISCRETE \
critic.optim.lr=1e-5 \
critic.model.path=deepseek-ai/deepseek-llm-7b-chat \
critic.model.enable_gradient_checkpointing=False \
critic.ppo_micro_batch_size_per_gpu=4 \
critic.profiler.ranks=$PROFILE_RANKS \
critic.profiler.all_ranks=$PROFILE_RANKS_ALL \

View File

@ -62,7 +62,6 @@ python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megat
actor_rollout_ref.rollout.gpu_memory_utilization=0.7 \
critic.optim.lr=1e-5 \
critic.model.path=$LLM \
critic.model.enable_gradient_checkpointing=False \
critic.ppo_micro_batch_size_per_gpu=4 \
algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \

View File

@ -38,7 +38,6 @@ python3 -m verl.trainer.main_ppo \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.actor.megatron.seed=42 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=2 \
actor_rollout_ref.ref.megatron.virtual_pipeline_model_parallel_size=2 \
actor_rollout_ref.ref.megatron.context_parallel_size=2 \

View File

@ -84,7 +84,6 @@ python3 -m verl.trainer.main_ppo \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
actor_rollout_ref.model.path="${MODEL_PATH}" \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
actor_rollout_ref.actor.optim.weight_decay=0.1 \

View File

@ -86,7 +86,6 @@ python3 -m verl.trainer.main_ppo \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
actor_rollout_ref.model.path="${MODEL_PATH}" \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
actor_rollout_ref.actor.optim.weight_decay=0.1 \

View File

@ -95,7 +95,6 @@ python3 -m recipe.one_step_off_policy.main_ppo \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
actor_rollout_ref.model.path="${MODEL_PATH}" \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
actor_rollout_ref.actor.optim.weight_decay=0.1 \

View File

@ -89,7 +89,6 @@ python3 -m verl.trainer.main_ppo \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
actor_rollout_ref.model.path="${MODEL_PATH}" \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
actor_rollout_ref.actor.optim.weight_decay=0.1 \

View File

@ -191,7 +191,6 @@ python3 -m verl.trainer.main_ppo --config-path=config \
critic.optim.lr=2e-5 \
critic.optim.lr_warmup_steps=$LR_WARMUP_STEPS \
critic.model.path="${MODEL_PATH}" \
critic.model.enable_gradient_checkpointing=False \
critic.ppo_micro_batch_size_per_gpu=${train_traj_micro_bsz_per_gpu} \
critic.ppo_max_token_len_per_gpu=${forward_max_token_len_per_gpu} \
critic.megatron.use_mbridge=${USE_MBRIDGE} \

View File

@ -71,7 +71,12 @@ actor_rollout_ref:
dist_checkpointing_path: null
seed: 42
override_ddp_config: {}
override_transformer_config: {}
override_transformer_config:
recompute_granularity: selective
recompute_modules:
- core_attn
recompute_method: null
recompute_num_layers: null
use_mbridge: false
profile:
use_profile: false
@ -184,11 +189,6 @@ actor_rollout_ref:
model_config: {}
moe_config:
freeze_moe_router: false
enable_gradient_checkpointing: false
gradient_checkpointing_kwargs:
activations_checkpoint_method: null
activations_checkpoint_granularity: null
activations_checkpoint_num_layers: null
use_fused_kernels: false
trust_remote_code: false
profiler:
@ -305,12 +305,7 @@ critic:
moe_config:
freeze_moe_router: false
external_lib: ${oc.select:actor_rollout_ref.model.external_lib,null}
enable_gradient_checkpointing: false
trust_remote_code: ${oc.select:actor_rollout_ref.model.trust_remote_code,false}
gradient_checkpointing_kwargs:
activations_checkpoint_method: null
activations_checkpoint_granularity: null
activations_checkpoint_num_layers: null
ppo_mini_batch_size: ${oc.select:actor_rollout_ref.actor.ppo_mini_batch_size,256}
ppo_micro_batch_size: null
ppo_micro_batch_size_per_gpu: ${oc.select:.ppo_micro_batch_size,null}
@ -393,7 +388,7 @@ reward_model:
use_dist_checkpointing: false
dist_checkpointing_path: null
seed: ${oc.select:actor_rollout_ref.actor.megatron.seed,42}
override_transformer_config: {}
override_transformer_config: ${oc.select:actor_rollout_ref.actor.megatron.override_transformer_config,{}}
use_mbridge: ${oc.select:actor_rollout_ref.actor.megatron.use_mbridge,False}
load_weight: true
custom_reward_function:

View File

@ -263,9 +263,9 @@ critic:
tokenizer_path: ${oc.select:actor_rollout_ref.model.path,"~/models/deepseek-llm-7b-chat"}
override_config: {}
external_lib: ${oc.select:actor_rollout_ref.model.external_lib,null}
enable_gradient_checkpointing: true
trust_remote_code: ${oc.select:actor_rollout_ref.model.trust_remote_code,false}
use_shm: false
enable_gradient_checkpointing: true
enable_activation_offload: false
use_remove_padding: false
fsdp_config:

View File

@ -81,7 +81,23 @@ megatron:
# additional transformer config like: num_layers_in_first(/last)_pipeline_stage
# oc.select: default val for ref.megatron.override_transformer_config
override_transformer_config: {}
override_transformer_config:
# Recompute configuration, same as in megatron.training.arguments
# default use minimal performance-interference recompute methods
# Recompute granualarity, choices: ["full", "selective"]
recompute_granularity: selective
# Recompute modules, multiple choices: ["core_attn", "moe_act", "layernorm", "mla_up_proj", "mlp", "moe"]
# Please use correct module in matched model
recompute_modules: ["core_attn"]
# 'uniform', 'block'
# 'uniform' divides the total number of transformer layers and checkpoints the input activation of each chunk
# 'block' checkpoints the specified number of layers per pipeline stage at the specified granularity
recompute_method: null
# 'full' will checkpoint the entire transformer layer and 'selective' only checkpoints memory intensive part of attention
recompute_num_layers: null
# oc.select: default val for ref.megatron.use_mbridge
use_mbridge: False

View File

@ -31,9 +31,6 @@ model:
# External model implementation (optional)
external_lib: ${oc.select:actor_rollout_ref.model.external_lib,null}
# Enable gradient checkpointing to save memory
enable_gradient_checkpointing: True
# Whether to trust remote code from Hugging Face models
trust_remote_code: ${oc.select:actor_rollout_ref.model.trust_remote_code,false}

View File

@ -33,6 +33,9 @@ model:
# Whether to use shared memory for loading the model
use_shm: False
# Enable gradient checkpointing to save memory
enable_gradient_checkpointing: True
# Offload activations to CPU to reduce GPU memory usage
enable_activation_offload: False

View File

@ -55,19 +55,13 @@ model:
# override default empty mapping
override_config:
model_config: {}
moe_config:
freeze_moe_router: False
# Enable gradient checkpointing to save memory
enable_gradient_checkpointing: False
# Activation Checkpointing settings
gradient_checkpointing_kwargs:
activations_checkpoint_method: null
activations_checkpoint_granularity: null
activations_checkpoint_num_layers: null
# megatron-specific parallelism settings
megatron:

View File

@ -41,20 +41,6 @@ actor_rollout_ref:
freeze_moe_router: False
enable_gradient_checkpointing: False
gradient_checkpointing_kwargs:
## Activation Checkpointing
activations_checkpoint_method: null # 'uniform', 'block'; not used with 'selective'
# 'uniform' divides the total number of transformer layers and checkpoints the input activation of each chunk
# 'block' checkpoints the specified number of layers per pipeline stage at the specified granularity
activations_checkpoint_granularity: null # 'selective' or 'full'
# 'full' will checkpoint the entire transformer layer and 'selective' only checkpoints memory intensive part of attention
activations_checkpoint_num_layers: null # not used with 'selective'
use_fused_kernels: False # Whether to use custom fused kernels (PostProcessing, for memory efficiency)
trust_remote_code: False

View File

@ -52,7 +52,7 @@ megatron:
seed: ${oc.select:actor_rollout_ref.actor.megatron.seed,42}
# Any overrides to transformer config
override_transformer_config: {}
override_transformer_config: ${oc.select:actor_rollout_ref.actor.megatron.override_transformer_config,{}}
# Whether to use mbridge for faster comms
use_mbridge: ${oc.select:actor_rollout_ref.actor.megatron.use_mbridge,False}