6859e048da
Fix PPO/RLOO examples ( #2100 )
2024-09-23 11:49:36 +02:00
10c2f63b2a
training_args
for all TrainingArguments
(#2082 )
2024-09-19 15:03:47 +02:00
40f05226de
Standardizing datasets for testing ( #2065 )
...
* zen dataset
* Update dataset test bco
* some tests
* Simple chat template
* bco
* xpo
* kto
* gkd
* trainer_args
* sft
* online dpo
* orpo
* zen script
2024-09-14 22:34:15 +02:00
4c92ba5769
©️ Copyrights ( #2063 )
...
* copyrights
* fail if missing
2024-09-13 14:18:47 +02:00
2ee0b62cdb
Change non_eos_penalty
to missing_eos_penalty
to be consistent across OnPolicy
trainers ( #2033 )
...
* Subtract a penalty from OnPolicy Trainers if output does not contain an EOS token
* Caught a few other problems
* Updated the documentation for RLOO trainer and PPOv2Trainer
* Corrected the default type and value for missing_eos_penalty
* Made RLOO Trainer consistent with Online DPO and PPOv2
* Removed --non_eos_penalty from all documentation
* Made missing_eos_penalty examples positive (because we subtract).
* Caught two more incorrect examples
* Removed unnecessary whitespace to make ruff happy
* Update trl/trainer/utils.py
---------
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com >
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com >
2024-09-10 14:40:23 +02:00
f05f63c1ea
PartialState().local_main_process_first()
when map
in examples (#1926 )
...
* `PartialState().local_main_process_first()` when map in examples
* allow load from cache
---------
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co >
2024-08-14 12:01:03 +02:00
54f806b6ff
Standardize dataset_num_proc
usage ( #1925 )
...
* uniform dataset_num_proc
* num_proc in shuffle
* Update examples/datasets/anthropic_hh.py
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com >
* Update examples/scripts/ppo.py
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com >
* Update examples/scripts/ppo.py
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com >
---------
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co >
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com >
2024-08-13 15:10:39 +02:00
7ddef5c158
Make use of trust_remote_code
consistent ( #1806 )
...
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co >
2024-07-10 18:26:11 +02:00
34d273f227
Support num_train_epochs ( #1743 )
...
* add a test case for num_train_epochs
* fix ci
* quick change
* disable push to hub
* debug windows ci
* try another fix
* skip subprocess tests on windows
2024-06-20 13:16:43 -04:00
e7cb597230
Fix ppov2 test case ( #1661 )
...
* Fix PPOv2 / RLOO refactor's stuff
* update terminology to use stop token
2024-05-23 11:37:16 -04:00
13454d2f4b
PPO / Reinforce Trainers ( #1540 )
...
* Add ppov2 trainer
* make eos trick optional, remove unused args
* quick fix
* precommit
* update debugging script
* fix out of bound `drop_last=True`; use built-in scheduler
* Add PPO examples
* push changes
* quick change
* quick change
* various bug fixes
* remove unnecessary grad accumulation setting
* push new changes
* fix DS3 model saving
* update ppo.py
* refactor
* quick change
* refactor
* update ppo trainer
* refactor
* quick test
* add ds2 /ds3 7 processes config
* add vllm trainer
* quick change
* experiment with reward normalization
* push changes
* quick push
* push changes
* push various changes
* refactor to use ModelConfig
* quick change
* refactor
* refactor
* Simplify DS logic
* quick update
* remove unnecessary files
* precommit
* deepspeed fix; handle edge case when eos_token_id = 0
* add PPO tldr example
* add TL;DR example
* fix undefined var
* utilize all samples in rloo
* quick setting
* remove the unnecessary `value_model`
* use exact_div
* allow saving the deepspeed model
* refactor
* remove dead code
* Use some shared utilities
* add some end-to-end test cases
* add PPOv2 docs and RLOO docs / tests
* update docs
* quikc push
* fix ci
* fix type annotation for ci
* quick update
* update trainer docs
2024-05-22 08:31:10 -04:00