Change non_eos_penalty to missing_eos_penalty to be consistent across OnPolicy trainers (#2033)

* Subtract a penalty from OnPolicy Trainers if output does not contain an EOS token

* Caught a few other problems

* Updated the documentation for RLOO trainer and PPOv2Trainer

* Corrected the default type and value for missing_eos_penalty

* Made RLOO Trainer consistent with Online DPO and PPOv2

* Removed --non_eos_penalty from all documentation

* Made missing_eos_penalty examples positive (because we subtract).

* Caught two more incorrect examples

* Removed unnecessary whitespace to make ruff happy

* Update trl/trainer/utils.py

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
This commit is contained in:
Rylan Schaeffer
2024-09-10 08:40:23 -04:00
committed by GitHub
parent ac071d6225
commit 2ee0b62cdb
12 changed files with 37 additions and 38 deletions

View File

@ -24,7 +24,8 @@ python -i examples/scripts/rloo/rloo.py \
--gradient_accumulation_steps 1 \
--total_episodes 10000 \
--model_name_or_path EleutherAI/pythia-1b-deduped \
--non_eos_penalty \
--missing_eos_penalty 1.0
accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml \
examples/scripts/rloo/rloo.py \
--output_dir models/minimal/rloo \
@ -40,7 +41,7 @@ accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml
--reward_model_path EleutherAI/pythia-1b-deduped \
--local_rollout_forward_batch_size 1 \
--deepspeed3 \
--non_eos_penalty \
--missing_eos_penalty 1.0
"""