9 Commits

Author SHA1 Message Date
a31a8f251f [doc] fix: quickstart example can't work on zsh (#2509)
### What does this PR do?

I followed the instructions at
https://verl.readthedocs.io/en/latest/start/quickstart.html to run the
PPO example on my devbox, which uses zsh. However, I got the error zsh:
no matches found: `trainer.logger=[console]` because `[]` is interpreted
as a glob pattern in zsh.

```
(verl) ➜  verl git:(20250713-devbox-2-tmux0-verl-2) ✗ PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
 data.train_files=$HOME/data/gsm8k/train.parquet \
 data.val_files=$HOME/data/gsm8k/test.parquet \
 data.train_batch_size=256 \
 data.max_prompt_length=512 \
 data.max_response_length=256 \
 actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
 actor_rollout_ref.actor.optim.lr=1e-6 \
 actor_rollout_ref.actor.ppo_mini_batch_size=64 \
 actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
 actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
 actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
 actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
 actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
 critic.optim.lr=1e-5 \
 critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \
 critic.ppo_micro_batch_size_per_gpu=4 \
 algorithm.kl_ctrl.kl_coef=0.001 \
 trainer.logger=['console'] \
 trainer.val_before_train=False \
 trainer.n_gpus_per_node=1 \
 trainer.nnodes=1 \
 trainer.save_freq=10 \
 trainer.test_freq=10 \
 trainer.total_epochs=15 2>&1 | tee verl_demo.log
zsh: no matches found: trainer.logger=[console]
```

This PR has 3 changes:
* `trainer.logger=['console']` -> `trainer.logger=console`
* `trainer.logger=['console','wandb']` ->
`trainer.logger='["console","wandb"]'`
* `trainer.logger=['console','tensorboard']` ->
`trainer.logger='["console","tensorboard"]'`

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

* `trainer.logger=console` (zsh)
<img width="898" height="564" alt="image"
src="https://github.com/user-attachments/assets/a957a493-75e6-462b-9974-6b1c4cdf5a80"
/>

* ``trainer.logger='["console","wandb"]'`` (zsh)
<img width="870" height="565" alt="image"
src="https://github.com/user-attachments/assets/e20613bf-2ccc-4653-b23f-90edc3d568d1"
/>

* `trainer.logger=console` (bash)
  ```bash
ubuntu@ip-xxx-xx-x-xxx:~/verl$ PYTHONUNBUFFERED=1 python3 -m
verl.trainer.main_ppo \
  >  data.train_files=$HOME/data/gsm8k/train.parquet \
  >  data.val_files=$HOME/data/gsm8k/test.parquet \
  >  data.train_batch_size=256 \
  >  data.max_prompt_length=512 \
  >  data.max_response_length=256 \
  >  actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
  >  actor_rollout_ref.actor.optim.lr=1e-6 \
  >  actor_rollout_ref.actor.ppo_mini_batch_size=64 \
  >  actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
  >  actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
  >  actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
  >  actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
  >  actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
  >  critic.optim.lr=1e-5 \
  >  critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \
  >  critic.ppo_micro_batch_size_per_gpu=4 \
  >  algorithm.kl_ctrl.kl_coef=0.001 \
  >  trainer.logger=console \
  >  trainer.val_before_train=False \
  >  trainer.n_gpus_per_node=1 \
  >  trainer.nnodes=1 \
  >  trainer.save_freq=10 \
  >  trainer.test_freq=10 \
  >  trainer.total_epochs=15 2>&1 | tee verl_demo.log
2025-07-14 02:52:27,669 INFO worker.py:1908 -- Started a local Ray
instance. View the dashboard at 127.0.0.1:8265
(TaskRunner pid=1799248) TaskRunner hostname: ip-172-31-9-244, PID:
1799248
(TaskRunner pid=1799248) {'actor_rollout_ref': {'actor': {'checkpoint':
{'load_contents': ['model',
(TaskRunner pid=1799248) 'optimizer',
(TaskRunner pid=1799248) 'extra'],
(TaskRunner pid=1799248) 'save_contents': ['model',
(TaskRunner pid=1799248) 'optimizer',
(TaskRunner pid=1799248) 'extra']},
  ```

* `trainer.logger='["console","wandb"]'` (bash)
  ```bash
ubuntu@ip-xxx-xx-x-xxx:~/verl$ PYTHONUNBUFFERED=1 python3 -m
verl.trainer.main_ppo \
  >  data.train_files=$HOME/data/gsm8k/train.parquet \
  >  data.val_files=$HOME/data/gsm8k/test.parquet \
  >  data.train_batch_size=256 \
  >  data.max_prompt_length=512 \
  >  data.max_response_length=256 \
  >  actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
  >  actor_rollout_ref.actor.optim.lr=1e-6 \
  >  actor_rollout_ref.actor.ppo_mini_batch_size=64 \
  >  actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
  >  actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
  >  actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
  >  actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
  >  actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
  >  critic.optim.lr=1e-5 \
  >  critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \
  >  critic.ppo_micro_batch_size_per_gpu=4 \
  >  algorithm.kl_ctrl.kl_coef=0.001 \
  >  trainer.logger='["console","wandb"]' \
  >  trainer.val_before_train=False \
  >  trainer.n_gpus_per_node=1 \
  >  trainer.nnodes=1 \
  >  trainer.save_freq=10 \
  >  trainer.test_freq=10 \
  >  trainer.total_epochs=15 2>&1 | tee verl_demo.log
2025-07-14 02:54:13,989 INFO worker.py:1908 -- Started a local Ray
instance. View the dashboard at 127.0.0.1:8265
(TaskRunner pid=1805000) TaskRunner hostname: ip-172-31-9-244, PID:
1805000
(TaskRunner pid=1805000) {'actor_rollout_ref': {'actor': {'checkpoint':
{'load_contents': ['model',
(TaskRunner pid=1805000) 'optimizer',
(TaskRunner pid=1805000) 'extra'],
(TaskRunner pid=1805000) 'save_contents': ['model',
(TaskRunner pid=1805000) 'optimizer',
(TaskRunner pid=1805000) 'extra']},
  ```

### API and Usage Example

No

### Design & Code Changes

No

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

---------

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
2025-07-14 13:26:32 +08:00
4081d8af1f refactor example and test scripts to use megatron comm/comp overlap and checkpoint save (#1202)
Examples megatron scripts are outdated.
2025-04-23 11:30:30 +08:00
072fc9feed feat: support no reference model; fix KL issues (#644)
### Before get started

Difference between KL penalty in reward and KL loss

>  [!TIP]
>
>  1. In-reward KL penalty
>
>
>  $$
> r_t = r_{\varphi}(q, o_{\leq t}) - \beta\ \boxed{\log
\frac{\pi_{\theta}(o_t | q, o_{<t})}{\pi_{\text{ref}}(o_t | q, o_{<t})}}
>  $$
>
>  2. KL Loss
>
>  $$
> L^{\text{PPO}}(\theta) = \mathbb{E}_t [ \min(ratio_t A_t,
\text{clip}(ratio_t, 1 - \epsilon, 1 + \epsilon) A_t) ]
>  $$
>
>  $$
>  \- \beta\ \boxed{D_{\text{KL}}(\pi_{\theta} || \pi_{\text{ref}})}
>  $$

### Problems

1. The current code doesn't support not using reference model

This feature is half-implemented since the very first commit but never
completed, e.g., `RayPPOTrainer` has an attribute `use_reference_policy`
but it's always True since role_worker_mapping always has
`Role.RefPolicy`.

2. Restriction of `use_kl_loss` 

Currently, `use_kl_loss` determines whether to use in-reward kl penalty
or kl loss. So we can not use **both or neither**.


87a813658f/verl/trainer/ppo/ray_trainer.py (L875-L879)


87a813658f/verl/workers/actor/dp_actor.py (L299-L307)

>  [!CAUTION]  
>
>  ### You may have unintentionally adopted in-reward KL penalty
>
> For the experiments you've conducted, if you set
`actor.use_kl_loss`=False or didn't set it (Default is False),***You
unintentionally used in-reward KL penalty.*** If you don't want any KL,
you should set `actor_rollout_ref.actor.use_kl_loss=False` and
`algorithm.use_kl_in_reward=False` (or not to set them because they are
the default config) after this commit.

3. Deprecated config

After investigation, I guess Critic may used to be responsible for
in-reward KL. But this feature seems paralyzed.

1. Line 290, there may used to be `config.algorithm.kl_ctrl.target_kl`
and `config.critic.kl_ctrl.horizon` , which are not supported currently.


3ec83117c3/verl/trainer/ppo/ray_trainer.py (L289-L293)

2. In `verl/workers/critic/megatron_critic.py` : redundant set of
`self.kl_ctrl`


3b18b0eb74/verl/workers/critic/megatron_critic.py (L69-L73)


### What’s Changed?

1. Add support for not using reference model
2. Fixed the incomplete code of the KL controller.
3. A test case for using both kl terms
4. Some other misc issues in the code.

### How to disable reference model

* set `actor_rollout_ref.actor.use_kl_loss=False` and
`algorithm.use_kl_in_reward=False` (They are by default False, so you
can simply not set them)
2025-04-01 10:14:38 +08:00
386cfabed2 [misc] feat: make filter long prompt an option (#506)
# Background

In RLHFDataset, we filter out prompts that are too long. This requires
apply_chat_template to the whole dataset, which is not scalable when the
dataset is large.
https://github.com/volcengine/verl/blob/main/verl/utils/dataset/rl_dataset.py#L132

Instead of performing filtering online, we probably want to move this
process offline and add an assertion to avoid truncation or simply
perform truncation

Reference: #502 

# Key Changes

- Add an option `data.filter_overlong_prompts=True \` to enable the
above data filtering. The default value is set to False, but we enable
it for all the example scripts.
- Add an option `data.truncation` to truncate the input_ids or prompt
length if they
exceed max_prompt_length. The default is 'error', which does not allow
the
max_prompt_length to be exceeded. The users should increase the
max_prompt_length if
  throwing the error. You can also set `left` and `right`.

### Suggestion for large-scale dataset.
For large-scale datasets, filtering overlong prompts could be
time-consuming. You should set `data.filtering_overlong_prompts=False`
and set `truncation='left'`. Also, please note that you should increase
`data.max_prompt_length` to avoid over-truncation of the prompts.
2025-03-07 19:27:25 +08:00
4011f407b0 [Fix] Deprecate val_batch_size (#353)
Validation datasets are sent to inference engines as a whole batch,
which will schedule the memory themselves.

- [x] Remove `val_batch_size` from examples
- [x] Set default values of `val_batch_size` in configs as `null` and
add DEPRECATED comments
- [x] Add deprecation warnings about `val_batch_size` in
`_validate_config`
2025-02-24 10:24:24 +08:00
9db52329f6 [misc] feat: support offload parameter and optimizer during rollout (#284)
- Fixed FSDP1 model offload
- With `actor_rollout_ref.actor.fsdp_config.param_offload=True \` and
`actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \ `. The GPU
memory utilization can increase to 0.9
- With actor, critic and reference offload all enabled, there will only
be one model copy at a time in the GPU memory. Therefore, we can further
increase the `micro_batch_size_per_gpu` or `max_token_per_gpu`

**Specifically:**
- During rollout, only rollout model and KVCache are in the GPU memory.
- During critic compute values, only the critic model will stay in the
GPU memory while its optimizer and other model states are in CPU main
memory
- During actor update, the actor model, optimizer are stored on GPU
while the reference model and critic model, critic optimizer are
offloaded to CPU.
2025-02-17 14:07:43 +08:00
f2a76acd94 [BREAKING][misc] feat: change micro_batch_size to micro_batch_size_per_gpu (#136)
## Summary

This PR changes all the micro_batch_size to micro_batch_size_per_gpu.

**The Core logic of setting batch size:**
- **All algorithmic metrics** (train batch size, ppo mini batch size):
are global (from the perspective of single-controller), which will be
normalized in each Worker.
- **All performance-related parameters** (micro batch size, max token
length in dynamic batch size) are local parameters, which represent the
data sizes per GPU (i.e., each Worker).

## Main Changes

1. Change the scripts and config and delete the normalization for
micro_bsz
2. Fix CI for SFT
2025-01-27 11:26:52 +08:00
5b90cd7dd3 [misc] fix: normalize batch size should divide sp size (#125)
- The actual DP size when using SP is (DP // SP). As SP sets of GPUs
have the same sequence but different parts
2025-01-23 18:34:37 +08:00
e8eb9e4eca [misc][Long Context] feat: support ulysses for long context training (#109) 2025-01-18 12:21:41 +08:00