### Before get started
Difference between KL penalty in reward and KL loss
> [!TIP]
>
> 1. In-reward KL penalty
>
>
> $$
> r_t = r_{\varphi}(q, o_{\leq t}) - \beta\ \boxed{\log
\frac{\pi_{\theta}(o_t | q, o_{<t})}{\pi_{\text{ref}}(o_t | q, o_{<t})}}
> $$
>
> 2. KL Loss
>
> $$
> L^{\text{PPO}}(\theta) = \mathbb{E}_t [ \min(ratio_t A_t,
\text{clip}(ratio_t, 1 - \epsilon, 1 + \epsilon) A_t) ]
> $$
>
> $$
> \- \beta\ \boxed{D_{\text{KL}}(\pi_{\theta} || \pi_{\text{ref}})}
> $$
### Problems
1. The current code doesn't support not using reference model
This feature is half-implemented since the very first commit but never
completed, e.g., `RayPPOTrainer` has an attribute `use_reference_policy`
but it's always True since role_worker_mapping always has
`Role.RefPolicy`.
2. Restriction of `use_kl_loss`
Currently, `use_kl_loss` determines whether to use in-reward kl penalty
or kl loss. So we can not use **both or neither**.
87a813658f/verl/trainer/ppo/ray_trainer.py (L875-L879)87a813658f/verl/workers/actor/dp_actor.py (L299-L307)
> [!CAUTION]
>
> ### You may have unintentionally adopted in-reward KL penalty
>
> For the experiments you've conducted, if you set
`actor.use_kl_loss`=False or didn't set it (Default is False),***You
unintentionally used in-reward KL penalty.*** If you don't want any KL,
you should set `actor_rollout_ref.actor.use_kl_loss=False` and
`algorithm.use_kl_in_reward=False` (or not to set them because they are
the default config) after this commit.
3. Deprecated config
After investigation, I guess Critic may used to be responsible for
in-reward KL. But this feature seems paralyzed.
1. Line 290, there may used to be `config.algorithm.kl_ctrl.target_kl`
and `config.critic.kl_ctrl.horizon` , which are not supported currently.
3ec83117c3/verl/trainer/ppo/ray_trainer.py (L289-L293)
2. In `verl/workers/critic/megatron_critic.py` : redundant set of
`self.kl_ctrl`
3b18b0eb74/verl/workers/critic/megatron_critic.py (L69-L73)
### What’s Changed?
1. Add support for not using reference model
2. Fixed the incomplete code of the KL controller.
3. A test case for using both kl terms
4. Some other misc issues in the code.
### How to disable reference model
* set `actor_rollout_ref.actor.use_kl_loss=False` and
`algorithm.use_kl_in_reward=False` (They are by default False, so you
can simply not set them)
# Background
In RLHFDataset, we filter out prompts that are too long. This requires
apply_chat_template to the whole dataset, which is not scalable when the
dataset is large.
https://github.com/volcengine/verl/blob/main/verl/utils/dataset/rl_dataset.py#L132
Instead of performing filtering online, we probably want to move this
process offline and add an assertion to avoid truncation or simply
perform truncation
Reference: #502
# Key Changes
- Add an option `data.filter_overlong_prompts=True \` to enable the
above data filtering. The default value is set to False, but we enable
it for all the example scripts.
- Add an option `data.truncation` to truncate the input_ids or prompt
length if they
exceed max_prompt_length. The default is 'error', which does not allow
the
max_prompt_length to be exceeded. The users should increase the
max_prompt_length if
throwing the error. You can also set `left` and `right`.
### Suggestion for large-scale dataset.
For large-scale datasets, filtering overlong prompts could be
time-consuming. You should set `data.filtering_overlong_prompts=False`
and set `truncation='left'`. Also, please note that you should increase
`data.max_prompt_length` to avoid over-truncation of the prompts.
Validation datasets are sent to inference engines as a whole batch,
which will schedule the memory themselves.
- [x] Remove `val_batch_size` from examples
- [x] Set default values of `val_batch_size` in configs as `null` and
add DEPRECATED comments
- [x] Add deprecation warnings about `val_batch_size` in
`_validate_config`
- Fixed FSDP1 model offload
- With `actor_rollout_ref.actor.fsdp_config.param_offload=True \` and
`actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \ `. The GPU
memory utilization can increase to 0.9
- With actor, critic and reference offload all enabled, there will only
be one model copy at a time in the GPU memory. Therefore, we can further
increase the `micro_batch_size_per_gpu` or `max_token_per_gpu`
**Specifically:**
- During rollout, only rollout model and KVCache are in the GPU memory.
- During critic compute values, only the critic model will stay in the
GPU memory while its optimizer and other model states are in CPU main
memory
- During actor update, the actor model, optimizer are stored on GPU
while the reference model and critic model, critic optimizer are
offloaded to CPU.
## Summary
This PR changes all the micro_batch_size to micro_batch_size_per_gpu.
**The Core logic of setting batch size:**
- **All algorithmic metrics** (train batch size, ppo mini batch size):
are global (from the perspective of single-controller), which will be
normalized in each Worker.
- **All performance-related parameters** (micro batch size, max token
length in dynamic batch size) are local parameters, which represent the
data sizes per GPU (i.e., each Worker).
## Main Changes
1. Change the scripts and config and delete the normalization for
micro_bsz
2. Fix CI for SFT