Files
verl/recipe/prime/config/prime_trainer.yaml
yangbaoxing 7f27789961 [fsdp,doc] refactor: rename warmup_style@FSDPOptimizerConfig -> lr_scheduler_type (#3739)
### What does this PR do?

> Rename `warmup_style` in FSDPOptimizerConfig to `lr_scheduler_type` to
align with Hugging Face Trainer API。

The following pull request is for refactoring the optimizer, however,
the naming issue persists.
https://github.com/volcengine/verl/pull/3656 
### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

---------

Co-authored-by: weiqi.li <weiqi.li@bytedance.com>
2025-10-13 15:58:59 +08:00

77 lines
2.2 KiB
YAML

# the prime config will override default ppo_trainer.yaml
hydra:
searchpath:
- file://verl/trainer/config
defaults:
- ppo_trainer
- _self_
data:
filter_accuracy: True
accuracy_lower_bound: 0.2
accuracy_upper_bound: 0.8
oversample_factor: 4.0 # Sample more responses than the batch size. prompts satisfying the filter will be prioritized.
filter_truncate: True
truncation: right
actor_rollout_ref:
hybrid_engine: True
model:
use_remove_padding: True
rollout:
# number of responses (i.e. num sample times)
n: 4
actor:
entropy_coeff: 0.001
reward_model:
enable: True
strategy: fsdp
model:
ref_path: ${reward_model.model.path}
use_remove_padding: True
use_fused_kernels: ${actor_rollout_ref.model.use_fused_kernels}
fused_kernel_options:
impl_backend: torch # triton, torch
tokenizer_path: ${actor_rollout_ref.model.path}
enable_gradient_checkpointing: ${actor_rollout_ref.model.enable_gradient_checkpointing}
ref_type: freeze
fsdp_config:
min_num_params: 0
param_offload: ${actor_rollout_ref.actor.fsdp_config.param_offload}
optimizer_offload: ${actor_rollout_ref.actor.fsdp_config.optimizer_offload}
update: before # ``before`` for double-forward, ``after`` for single-forward
optim:
lr: 1e-6
lr_warmup_steps: -1 # Prioritized. Negative values mean delegating to lr_warmup_steps_ratio.
lr_warmup_steps_ratio: 0. # the total steps will be injected during runtime
min_lr_ratio: null
warmup_style: null # deprecated
lr_scheduler_type: constant
total_training_steps: -1 # must be overridden by program
weight_decay: 0.
grad_clip: 10.0
beta_train: 0.05
loss_type: ce # currently only supports ce loss
prime_granularity: token
prime_norm: batch_norm # batch_norm or none. if set to none, the normalizer is beta_train
mini_batch_size: ${actor_rollout_ref.actor.ppo_mini_batch_size}
reward_manager: prime
algorithm:
adv_estimator: rloo
# now supports rloo. it treats different source of reward separately.
kl_ctrl:
type: fixed
kl_coef: 0.000
reward_gt_coef: 5
reward_dpo_coef: 5
trainer:
project_name: prime
experiment_name: examples
val_before_train: False
balance_batch: False