mirror of
https://github.com/huggingface/trl.git
synced 2025-10-21 02:53:59 +08:00
Change non_eos_penalty
to missing_eos_penalty
to be consistent across OnPolicy
trainers (#2033)
* Subtract a penalty from OnPolicy Trainers if output does not contain an EOS token * Caught a few other problems * Updated the documentation for RLOO trainer and PPOv2Trainer * Corrected the default type and value for missing_eos_penalty * Made RLOO Trainer consistent with Online DPO and PPOv2 * Removed --non_eos_penalty from all documentation * Made missing_eos_penalty examples positive (because we subtract). * Caught two more incorrect examples * Removed unnecessary whitespace to make ruff happy * Update trl/trainer/utils.py --------- Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
This commit is contained in:
@ -24,7 +24,8 @@ python -i examples/scripts/rloo/rloo.py \
|
||||
--gradient_accumulation_steps 1 \
|
||||
--total_episodes 10000 \
|
||||
--model_name_or_path EleutherAI/pythia-1b-deduped \
|
||||
--non_eos_penalty \
|
||||
--missing_eos_penalty 1.0
|
||||
|
||||
accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml \
|
||||
examples/scripts/rloo/rloo.py \
|
||||
--output_dir models/minimal/rloo \
|
||||
@ -40,7 +41,7 @@ accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml
|
||||
--reward_model_path EleutherAI/pythia-1b-deduped \
|
||||
--local_rollout_forward_batch_size 1 \
|
||||
--deepspeed3 \
|
||||
--non_eos_penalty \
|
||||
--missing_eos_penalty 1.0
|
||||
"""
|
||||
|
||||
|
||||
|
Reference in New Issue
Block a user