Change non_eos_penalty to missing_eos_penalty to be consistent across OnPolicy trainers (#2033)

* Subtract a penalty from OnPolicy Trainers if output does not contain an EOS token * Caught a few other problems * Updated the documentation for RLOO trainer and PPOv2Trainer * Corrected the default type and value for missing_eos_penalty * Made RLOO Trainer consistent with Online DPO and PPOv2 * Removed --non_eos_penalty from all documentation * Made missing_eos_penalty examples positive (because we subtract). * Caught two more incorrect examples * Removed unnecessary whitespace to make ruff happy * Update trl/trainer/utils.py --------- Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-10-21 02:53:59 +08:00 · 2024-09-10 08:40:23 -04:00
parent ac071d6225
commit 2ee0b62cdb
12 changed files with 37 additions and 38 deletions
--- a/examples/scripts/rloo/rloo.py
+++ b/examples/scripts/rloo/rloo.py
@ -24,7 +24,8 @@ python -i examples/scripts/rloo/rloo.py \
    --gradient_accumulation_steps 1 \
    --total_episodes 10000 \
    --model_name_or_path EleutherAI/pythia-1b-deduped \
-    --non_eos_penalty \
+    --missing_eos_penalty 1.0
+
 accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml \
    examples/scripts/rloo/rloo.py \
    --output_dir models/minimal/rloo \
@ -40,7 +41,7 @@ accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml
    --reward_model_path EleutherAI/pythia-1b-deduped \
    --local_rollout_forward_batch_size 1 \
    --deepspeed3 \
-    --non_eos_penalty \
+    --missing_eos_penalty 1.0
 """