mirror of
https://github.com/volcengine/verl.git
synced 2025-10-20 13:43:50 +08:00
[BREAKING][megatron] refactor: activation checkpointing APIs (#2651)
### What does this PR do? Since we directly offer `override_transformer_config` option, we directly use it to recompute activations. Default settings are the same with `megatron.training`. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
This commit is contained in:
@ -213,8 +213,9 @@ Actor/Rollout/Reference Policy
|
||||
the Huggingface system.
|
||||
- ``actor_rollout_ref.model.override_config``: Used to override some of
|
||||
the model's original configurations, mainly dropout
|
||||
- ``actor_rollout_ref.model.enable_gradient_checkpointing``: Whether to
|
||||
enable gradient checkpointing for the actor
|
||||
- ``actor_rollout_ref.model.enable_gradient_checkpointing``: FSDP only, decide
|
||||
Whether to enable gradient checkpointing for the actor,
|
||||
Megatron uses recompute options in ``override_transformer_config`` to set this
|
||||
- ``actor_rollout_ref.model.enable_activation_offload``: Whether to enable
|
||||
activation offloading for the actor
|
||||
- ``actor_rollout_ref.model.trust_remote_code``: Whether to enable loading
|
||||
|
@ -33,7 +33,6 @@ python3 -m verl.trainer.main_ppo --config-path=config \
|
||||
actor_rollout_ref.actor.kl_loss_coef=0.001 \
|
||||
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
|
||||
actor_rollout_ref.actor.entropy_coeff=0 \
|
||||
actor_rollout_ref.model.enable_gradient_checkpointing=True \
|
||||
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
|
||||
actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
|
||||
actor_rollout_ref.rollout.name=vllm \
|
||||
|
@ -34,7 +34,6 @@ python3 -m verl.trainer.main_ppo --config-path=config \
|
||||
actor_rollout_ref.actor.kl_loss_coef=0.001 \
|
||||
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
|
||||
actor_rollout_ref.actor.entropy_coeff=0 \
|
||||
actor_rollout_ref.model.enable_gradient_checkpointing=True \
|
||||
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
|
||||
actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
|
||||
actor_rollout_ref.rollout.name=vllm \
|
||||
|
@ -40,7 +40,6 @@ python3 -m verl.trainer.main_ppo --config-path=config \
|
||||
actor_rollout_ref.actor.kl_loss_coef=0.001 \
|
||||
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
|
||||
actor_rollout_ref.actor.entropy_coeff=0 \
|
||||
actor_rollout_ref.model.enable_gradient_checkpointing=True \
|
||||
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
|
||||
actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
|
||||
actor_rollout_ref.rollout.name=vllm \
|
||||
|
@ -31,7 +31,6 @@ python3 -m verl.trainer.main_ppo --config-path=config \
|
||||
actor_rollout_ref.actor.kl_loss_coef=0.001 \
|
||||
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
|
||||
actor_rollout_ref.actor.entropy_coeff=0 \
|
||||
actor_rollout_ref.model.enable_gradient_checkpointing=True \
|
||||
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
|
||||
actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
|
||||
actor_rollout_ref.rollout.name=vllm \
|
||||
|
@ -29,7 +29,6 @@ python3 -m verl.trainer.main_ppo --config-path=config \
|
||||
actor_rollout_ref.actor.kl_loss_coef=0.001 \
|
||||
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
|
||||
actor_rollout_ref.actor.entropy_coeff=0 \
|
||||
actor_rollout_ref.model.enable_gradient_checkpointing=True \
|
||||
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
|
||||
actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
|
||||
actor_rollout_ref.rollout.name=sglang \
|
||||
|
@ -30,7 +30,6 @@ python3 -m verl.trainer.main_ppo --config-path=config \
|
||||
actor_rollout_ref.actor.kl_loss_coef=0.001 \
|
||||
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
|
||||
actor_rollout_ref.actor.entropy_coeff=0 \
|
||||
actor_rollout_ref.model.enable_gradient_checkpointing=True \
|
||||
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
|
||||
actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
|
||||
actor_rollout_ref.rollout.name=vllm \
|
||||
|
@ -82,7 +82,6 @@ python3 -m verl.trainer.main_ppo \
|
||||
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
|
||||
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
|
||||
actor_rollout_ref.model.path="${MODEL_PATH}" \
|
||||
actor_rollout_ref.model.enable_gradient_checkpointing=True \
|
||||
actor_rollout_ref.actor.optim.lr=1e-6 \
|
||||
actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
|
||||
actor_rollout_ref.actor.optim.weight_decay=0.1 \
|
||||
|
@ -30,7 +30,6 @@ python3 -m verl.trainer.main_ppo --config-path=config \
|
||||
actor_rollout_ref.actor.kl_loss_coef=0.001 \
|
||||
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
|
||||
actor_rollout_ref.actor.entropy_coeff=0 \
|
||||
actor_rollout_ref.model.enable_gradient_checkpointing=True \
|
||||
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
|
||||
actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
|
||||
actor_rollout_ref.rollout.name=vllm \
|
||||
|
@ -36,7 +36,6 @@ python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megat
|
||||
actor_rollout_ref.ref.megatron.tensor_model_parallel_size=2 \
|
||||
critic.optim.lr=1e-5 \
|
||||
critic.model.path=deepseek-ai/deepseek-llm-7b-chat \
|
||||
critic.model.enable_gradient_checkpointing=False \
|
||||
critic.ppo_micro_batch_size_per_gpu=4 \
|
||||
algorithm.use_kl_in_reward=False \
|
||||
trainer.critic_warmup=0 \
|
||||
|
@ -45,7 +45,6 @@ python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megat
|
||||
actor_rollout_ref.profiler.discrete=$DISCRETE \
|
||||
critic.optim.lr=1e-5 \
|
||||
critic.model.path=deepseek-ai/deepseek-llm-7b-chat \
|
||||
critic.model.enable_gradient_checkpointing=False \
|
||||
critic.ppo_micro_batch_size_per_gpu=4 \
|
||||
critic.profiler.ranks=$PROFILE_RANKS \
|
||||
critic.profiler.all_ranks=$PROFILE_RANKS_ALL \
|
||||
|
@ -62,7 +62,6 @@ python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megat
|
||||
actor_rollout_ref.rollout.gpu_memory_utilization=0.7 \
|
||||
critic.optim.lr=1e-5 \
|
||||
critic.model.path=$LLM \
|
||||
critic.model.enable_gradient_checkpointing=False \
|
||||
critic.ppo_micro_batch_size_per_gpu=4 \
|
||||
algorithm.use_kl_in_reward=False \
|
||||
trainer.critic_warmup=0 \
|
||||
|
@ -38,7 +38,6 @@ python3 -m verl.trainer.main_ppo \
|
||||
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
|
||||
actor_rollout_ref.actor.entropy_coeff=0 \
|
||||
actor_rollout_ref.actor.megatron.seed=42 \
|
||||
actor_rollout_ref.model.enable_gradient_checkpointing=True \
|
||||
actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=2 \
|
||||
actor_rollout_ref.ref.megatron.virtual_pipeline_model_parallel_size=2 \
|
||||
actor_rollout_ref.ref.megatron.context_parallel_size=2 \
|
||||
|
@ -84,7 +84,6 @@ python3 -m verl.trainer.main_ppo \
|
||||
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
|
||||
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
|
||||
actor_rollout_ref.model.path="${MODEL_PATH}" \
|
||||
actor_rollout_ref.model.enable_gradient_checkpointing=True \
|
||||
actor_rollout_ref.actor.optim.lr=1e-6 \
|
||||
actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
|
||||
actor_rollout_ref.actor.optim.weight_decay=0.1 \
|
||||
|
@ -86,7 +86,6 @@ python3 -m verl.trainer.main_ppo \
|
||||
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 \
|
||||
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
|
||||
actor_rollout_ref.model.path="${MODEL_PATH}" \
|
||||
actor_rollout_ref.model.enable_gradient_checkpointing=True \
|
||||
actor_rollout_ref.actor.optim.lr=1e-6 \
|
||||
actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
|
||||
actor_rollout_ref.actor.optim.weight_decay=0.1 \
|
||||
|
@ -95,7 +95,6 @@ python3 -m recipe.one_step_off_policy.main_ppo \
|
||||
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
|
||||
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
|
||||
actor_rollout_ref.model.path="${MODEL_PATH}" \
|
||||
actor_rollout_ref.model.enable_gradient_checkpointing=True \
|
||||
actor_rollout_ref.actor.optim.lr=1e-6 \
|
||||
actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
|
||||
actor_rollout_ref.actor.optim.weight_decay=0.1 \
|
||||
|
@ -89,7 +89,6 @@ python3 -m verl.trainer.main_ppo \
|
||||
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
|
||||
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
|
||||
actor_rollout_ref.model.path="${MODEL_PATH}" \
|
||||
actor_rollout_ref.model.enable_gradient_checkpointing=True \
|
||||
actor_rollout_ref.actor.optim.lr=1e-6 \
|
||||
actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
|
||||
actor_rollout_ref.actor.optim.weight_decay=0.1 \
|
||||
|
@ -191,7 +191,6 @@ python3 -m verl.trainer.main_ppo --config-path=config \
|
||||
critic.optim.lr=2e-5 \
|
||||
critic.optim.lr_warmup_steps=$LR_WARMUP_STEPS \
|
||||
critic.model.path="${MODEL_PATH}" \
|
||||
critic.model.enable_gradient_checkpointing=False \
|
||||
critic.ppo_micro_batch_size_per_gpu=${train_traj_micro_bsz_per_gpu} \
|
||||
critic.ppo_max_token_len_per_gpu=${forward_max_token_len_per_gpu} \
|
||||
critic.megatron.use_mbridge=${USE_MBRIDGE} \
|
||||
|
@ -71,7 +71,12 @@ actor_rollout_ref:
|
||||
dist_checkpointing_path: null
|
||||
seed: 42
|
||||
override_ddp_config: {}
|
||||
override_transformer_config: {}
|
||||
override_transformer_config:
|
||||
recompute_granularity: selective
|
||||
recompute_modules:
|
||||
- core_attn
|
||||
recompute_method: null
|
||||
recompute_num_layers: null
|
||||
use_mbridge: false
|
||||
profile:
|
||||
use_profile: false
|
||||
@ -184,11 +189,6 @@ actor_rollout_ref:
|
||||
model_config: {}
|
||||
moe_config:
|
||||
freeze_moe_router: false
|
||||
enable_gradient_checkpointing: false
|
||||
gradient_checkpointing_kwargs:
|
||||
activations_checkpoint_method: null
|
||||
activations_checkpoint_granularity: null
|
||||
activations_checkpoint_num_layers: null
|
||||
use_fused_kernels: false
|
||||
trust_remote_code: false
|
||||
profiler:
|
||||
@ -305,12 +305,7 @@ critic:
|
||||
moe_config:
|
||||
freeze_moe_router: false
|
||||
external_lib: ${oc.select:actor_rollout_ref.model.external_lib,null}
|
||||
enable_gradient_checkpointing: false
|
||||
trust_remote_code: ${oc.select:actor_rollout_ref.model.trust_remote_code,false}
|
||||
gradient_checkpointing_kwargs:
|
||||
activations_checkpoint_method: null
|
||||
activations_checkpoint_granularity: null
|
||||
activations_checkpoint_num_layers: null
|
||||
ppo_mini_batch_size: ${oc.select:actor_rollout_ref.actor.ppo_mini_batch_size,256}
|
||||
ppo_micro_batch_size: null
|
||||
ppo_micro_batch_size_per_gpu: ${oc.select:.ppo_micro_batch_size,null}
|
||||
@ -393,7 +388,7 @@ reward_model:
|
||||
use_dist_checkpointing: false
|
||||
dist_checkpointing_path: null
|
||||
seed: ${oc.select:actor_rollout_ref.actor.megatron.seed,42}
|
||||
override_transformer_config: {}
|
||||
override_transformer_config: ${oc.select:actor_rollout_ref.actor.megatron.override_transformer_config,{}}
|
||||
use_mbridge: ${oc.select:actor_rollout_ref.actor.megatron.use_mbridge,False}
|
||||
load_weight: true
|
||||
custom_reward_function:
|
||||
|
@ -263,9 +263,9 @@ critic:
|
||||
tokenizer_path: ${oc.select:actor_rollout_ref.model.path,"~/models/deepseek-llm-7b-chat"}
|
||||
override_config: {}
|
||||
external_lib: ${oc.select:actor_rollout_ref.model.external_lib,null}
|
||||
enable_gradient_checkpointing: true
|
||||
trust_remote_code: ${oc.select:actor_rollout_ref.model.trust_remote_code,false}
|
||||
use_shm: false
|
||||
enable_gradient_checkpointing: true
|
||||
enable_activation_offload: false
|
||||
use_remove_padding: false
|
||||
fsdp_config:
|
||||
|
@ -81,7 +81,23 @@ megatron:
|
||||
|
||||
# additional transformer config like: num_layers_in_first(/last)_pipeline_stage
|
||||
# oc.select: default val for ref.megatron.override_transformer_config
|
||||
override_transformer_config: {}
|
||||
override_transformer_config:
|
||||
# Recompute configuration, same as in megatron.training.arguments
|
||||
# default use minimal performance-interference recompute methods
|
||||
# Recompute granualarity, choices: ["full", "selective"]
|
||||
recompute_granularity: selective
|
||||
|
||||
# Recompute modules, multiple choices: ["core_attn", "moe_act", "layernorm", "mla_up_proj", "mlp", "moe"]
|
||||
# Please use correct module in matched model
|
||||
recompute_modules: ["core_attn"]
|
||||
|
||||
# 'uniform', 'block'
|
||||
# 'uniform' divides the total number of transformer layers and checkpoints the input activation of each chunk
|
||||
# 'block' checkpoints the specified number of layers per pipeline stage at the specified granularity
|
||||
recompute_method: null
|
||||
|
||||
# 'full' will checkpoint the entire transformer layer and 'selective' only checkpoints memory intensive part of attention
|
||||
recompute_num_layers: null
|
||||
|
||||
# oc.select: default val for ref.megatron.use_mbridge
|
||||
use_mbridge: False
|
||||
|
@ -31,9 +31,6 @@ model:
|
||||
# External model implementation (optional)
|
||||
external_lib: ${oc.select:actor_rollout_ref.model.external_lib,null}
|
||||
|
||||
# Enable gradient checkpointing to save memory
|
||||
enable_gradient_checkpointing: True
|
||||
|
||||
# Whether to trust remote code from Hugging Face models
|
||||
trust_remote_code: ${oc.select:actor_rollout_ref.model.trust_remote_code,false}
|
||||
|
||||
|
@ -33,6 +33,9 @@ model:
|
||||
# Whether to use shared memory for loading the model
|
||||
use_shm: False
|
||||
|
||||
# Enable gradient checkpointing to save memory
|
||||
enable_gradient_checkpointing: True
|
||||
|
||||
# Offload activations to CPU to reduce GPU memory usage
|
||||
enable_activation_offload: False
|
||||
|
||||
|
@ -55,19 +55,13 @@ model:
|
||||
|
||||
# override default empty mapping
|
||||
override_config:
|
||||
|
||||
model_config: {}
|
||||
|
||||
moe_config:
|
||||
|
||||
freeze_moe_router: False
|
||||
|
||||
# Enable gradient checkpointing to save memory
|
||||
enable_gradient_checkpointing: False
|
||||
|
||||
# Activation Checkpointing settings
|
||||
gradient_checkpointing_kwargs:
|
||||
activations_checkpoint_method: null
|
||||
activations_checkpoint_granularity: null
|
||||
activations_checkpoint_num_layers: null
|
||||
|
||||
# megatron-specific parallelism settings
|
||||
megatron:
|
||||
|
||||
|
@ -41,20 +41,6 @@ actor_rollout_ref:
|
||||
|
||||
freeze_moe_router: False
|
||||
|
||||
enable_gradient_checkpointing: False
|
||||
|
||||
gradient_checkpointing_kwargs:
|
||||
|
||||
## Activation Checkpointing
|
||||
activations_checkpoint_method: null # 'uniform', 'block'; not used with 'selective'
|
||||
|
||||
# 'uniform' divides the total number of transformer layers and checkpoints the input activation of each chunk
|
||||
# 'block' checkpoints the specified number of layers per pipeline stage at the specified granularity
|
||||
activations_checkpoint_granularity: null # 'selective' or 'full'
|
||||
|
||||
# 'full' will checkpoint the entire transformer layer and 'selective' only checkpoints memory intensive part of attention
|
||||
activations_checkpoint_num_layers: null # not used with 'selective'
|
||||
|
||||
use_fused_kernels: False # Whether to use custom fused kernels (PostProcessing, for memory efficiency)
|
||||
|
||||
trust_remote_code: False
|
||||
|
@ -52,7 +52,7 @@ megatron:
|
||||
seed: ${oc.select:actor_rollout_ref.actor.megatron.seed,42}
|
||||
|
||||
# Any overrides to transformer config
|
||||
override_transformer_config: {}
|
||||
override_transformer_config: ${oc.select:actor_rollout_ref.actor.megatron.override_transformer_config,{}}
|
||||
|
||||
# Whether to use mbridge for faster comms
|
||||
use_mbridge: ${oc.select:actor_rollout_ref.actor.megatron.use_mbridge,False}
|
||||
|
Reference in New Issue
Block a user