Compare commits

...

386 Commits

Author SHA1 Message Date
7a95cc8696 release: v0.7.9 (#1206) 2024-01-09 13:02:31 +01:00
d1715514de Revert "Address issue #1122 (#1174)" (#1205)
This reverts commit d57d0f9ca46a63d370b91791352edda0154576f5.
2024-01-09 10:20:50 +01:00
d116887ed4 [DPOTrainer] Fix peft + DPO + bf16 if one uses generate_during_eval or pre-computed logits (#1203)
* fix peft + DPO + bf16

* fix

* revert old behaviour

* fix tests

* fix

* fix

* fix

* fix
2024-01-09 09:35:50 +01:00
a236c5750f Fix reported KL in PPO trainer (#1180)
* Fix reported KL in PPO trainer

previously this was always reporting the estimated KL, even when using `kl_penalty = 'full'` (or `abs`, etc).
Now we return the actual KL calculated in `compute_rewards()`, and report that.

* fix test
2024-01-09 06:48:25 +01:00
4ae35afdd6 Fix instruction token masking (#1185)
* Fix instruction token masking

Fix instruction token masking if the first instruction is tokenized differently than the others, or in general if no instruction is detected before the first response.

* Bugfix for edge case

(in case either of the templates isn't found at all, ...idxs[0] might not exist)

* Add test for instruction masking fix
2024-01-09 06:41:53 +01:00
b21ed0ddbc set dev version (#1201) 2024-01-09 05:19:10 +01:00
384b868fe6 Release: v0.7.8 (#1200) 2024-01-09 05:13:26 +01:00
3267be0fcd Allow swapping PEFT adapters for target/ref model. (#1193)
* Allow swapping PEFT adapters for target/ref model.

* Update DPOTrainer docs.

* python format

* isort

* Update docs/source/dpo_trainer.mdx

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update docs/source/dpo_trainer.mdx

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update docs/source/dpo_trainer.mdx

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update docs/source/dpo_trainer.mdx

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update docs/source/dpo_trainer.mdx

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-01-08 16:12:45 +01:00
dbcb2f0021 Allow separate devices for target/ref models. (#1190)
* Allow separate devices for target/ref models.

* Remove original/duplicate.

* Cleanup original, black formatting.

---------

Co-authored-by: Jon Durbin <jonathan@convai.com>
2024-01-08 10:26:40 +01:00
d5910b0ff5 Handle last token from generation prompt (#1153)
* Handle last token from generation prompt

* Remove prints

* Reformat dpo_trainer file
2024-01-08 09:15:53 +01:00
104a02d207 SFTTrainer: follow args.remove_unused_columns (#1188) 2024-01-08 06:09:10 +01:00
ad597dbcb3 Fix misleading variable "epoch" from the training loop from PPOTrainer Doc. (#1171)
* Fix misleading variable "epoch" from PPOTrainer Doc. 

The usage of the variable “epoch” is misleading in the original Doc, the dataloader does not contain the data for ALL epochs, but 1 only, thus 
"for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader))"
is misleading and does not actually stores the epoch #. 

The correct version comes from the TRL PPO notebook tutorial 
(https://github.com/huggingface/trl/blob/main/examples/notebooks/gpt2-sentiment-control.ipynb), which uses an outer loop to capture the epochs.

I posted also the question on forum: https://discuss.huggingface.co/t/confusing-and-possibly-misleading-ppo-trainer-code-from-trl-api-doc-tutorial/67531

* Remove batch_id
2024-01-08 05:50:00 +01:00
d57d0f9ca4 Address issue #1122 (#1174)
* Address issue #1122

    Issue [#1122](https://github.com/huggingface/trl/issues/1122)
    takes care of an inconsistency between `_prepare_packed_dataloader`
    and `_prepare_non_packed_dataloader`

* made attention_mask field in ConstantLengthDataset a tensor
2024-01-08 05:43:34 +01:00
ec3d41b879 Fix batch all gather (#1177)
* Fix batch all gather

* quick fix
2024-01-04 17:41:52 +01:00
be32d304db Update sft_trainer.py (#1162)
Fix spelling mistakes in argument description for trl -> SFT Trainer
2024-01-04 16:33:53 +01:00
dc53b8c6b0 Correct shape (#1170) 2024-01-04 16:27:39 +01:00
20428c48ba add: support for peft in ddpo. (#1165)
* add: support for peft in ddpo.

* revert to the original modeling_base.

* style

* specify weight_name

* explicitly specify weight_name

* fix: parameter parsing

* fix: trainable_layers.

* parameterize use_lora.

* fix one more trainable_layers

* debug

* debug

* more fixes.

* manually set unet of sd_pipeline

* make trainable_layers cleaner.

* more fixes

* remove prints.

* tester class for LoRA too.
2024-01-02 12:52:36 +01:00
6614b8aa6b Minor fixes to some comments in some examples. (#1156) 2023-12-29 14:12:05 +01:00
df7b770da8 change device order of metrics (#1154) 2023-12-29 10:55:58 +01:00
18a33ffcd3 SFT Tokenizer Fix (#1142) 2023-12-27 10:25:56 +01:00
911d3658e2 [xxxTrainer] Add unsloth tag (#1130)
* add unsloth tag

* add it on all trainers

* few changes

* add in docs

* revert

* final commit
2023-12-26 16:39:10 +01:00
95ec8577df add peft_module_casting_to_bf16 in DPOTrainer (#1143)
* add peft_module_casting_to_bf16 in DPOTrainer

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Update trl/trainer/dpo_trainer.py

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2023-12-26 11:25:53 +01:00
3539f3e3cd set dev version (#1145) 2023-12-26 10:26:15 +01:00
e451298b50 Release: v0.7.7 (#1144) 2023-12-26 10:24:47 +01:00
3efb484694 [PPOTrainer / DDPOTrainer] Fix ppo & ddpo push to Hub (#1141)
* fix ppo push to Hub

* fix also ddpo

* more tags
2023-12-26 10:06:20 +01:00
8f5b4923c8 reformatted (#1128) 2023-12-23 10:16:27 +01:00
e0dec27272 reformatted (#1129) 2023-12-23 10:13:38 +01:00
6ef785a6fb Add type hints to core.py (#1097)
* Add type hinting to core.py functions

* Fixes

* Remove unused functions

* Remove unused import
2023-12-22 17:05:20 +01:00
950ee2187d clear up the parameters of supervised_finetuning.py (#1126)
no_gradient_checkpointing is always false

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2023-12-22 17:00:28 +01:00
c1bb1f39f6 set dev version (#1135) 2023-12-22 15:09:37 +01:00
54babd9508 Release: v0.7.6 (#1134) 2023-12-22 15:03:24 +01:00
0c4edb750e [xxxTrainer] multi-tags support for tagging (#1133)
* multi-tags support for tagging

* oops
2023-12-22 14:52:16 +01:00
17ec68d980 set dev version (#1132) 2023-12-22 14:12:24 +01:00
9be5680039 Release: v0.7.5 (#1131) 2023-12-22 14:01:44 +01:00
f11e213fd8 [Docs] Add unsloth optimizations in TRL's documentation (#1119)
* add unsloth

* Update sft_trainer.mdx (#1124)

Co-authored-by: Daniel Han <danielhanchen@gmail.com>

---------

Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2023-12-22 13:45:26 +01:00
814fe396d4 rename kto loss (#1127) 2023-12-22 13:32:16 +01:00
06b7959b72 save eval_dataset for subsequent calls (#1125) 2023-12-21 17:28:56 +01:00
b07935f867 [xxxTrainer] Add tags to all trainers in TRL (#1120)
* add tags to sfttrainer

* extend it to other trainers

* add for ddpo
2023-12-21 17:04:18 +01:00
2aff709144 Update description in setup.py (#1101) 2023-12-21 15:35:12 +01:00
830cadfc4c fix gradient checkpointing when using PEFT (#1118) 2023-12-20 13:35:56 +01:00
f2acd821e0 Make prepending of bos token configurable. (#1114)
* make prepending of bos token configurable.

* address comments

* fix bug

Co-Authored-By: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update trl/trainer/sft_trainer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2023-12-20 11:28:50 +01:00
f100ca34cc peft_module_casting_to_bf16 util method, append_concat_token flag, remove callback PeftSavingCallback (#1110)
* SFT Trainer enhancements

* remove the callback `PeftSavingCallback`

* bump the version of transformers to `4.31.0`

* remove `PeftSavingCallback` from all places.
2023-12-19 17:43:25 +01:00
d708ec272f [Feature] Add Ascend NPU accelerator support (#1096)
* add npu support

* make precommit
2023-12-15 15:34:35 +01:00
8140129595 Updated documentation for docs/source/reward_trainer.mdx to import the correct Enum for the reward modelling LoRA config (#1092) 2023-12-15 11:24:20 +01:00
48b3ef0b7b [DPO] use ref model logprobs if it exists in the data (#885)
* use logprobs if it exists in the batch

* add features to tokenized batch if in data

* make get_batch_logps a static method

* add tokenize_batch_element dataset mapper

* Remove tokenize_batch method from DPODataCollator

* Initial sketch to precompute reference_logps

* run ref model via pytorch dataloader

* add a padding helper

* clean up the helper

* use logprob item()

* default behaviour

* clean up collator

* add docstring

* copy data back to cpu if needed

* use get_train_dataloader methods

* fix tests

* rename: more explicit variable name precompute_ref_log_probs

* improve comment

* update comment

* Update trl/trainer/dpo_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* refactor models into setup parameters

* parametrize precompute_ref_log_probs flag

* remove useless test

* Update trl/trainer/dpo_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update tests/test_dpo_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update tests/test_dpo_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/trainer/dpo_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/trainer/dpo_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* update function arg name

* distinguish between pad token_id and mask values

* fix tokenization #932 by @nrailg

* fix test

* undo test refactor

* new line

* undo breaking change

* Update token counter condition to allow Llama tokenizer

* Acount for merged tokens on certain tokenizers such Llama-2 tokenizer

* Update variable name to match list value when truncating response

* map function on multi-gpu and gather

* Add test cases for DPOTrainer tokenization step

* revert since we need the prepeared model

* Use gather_with_metrics on ref_logps precomputation to keep original dataset size

* Add flag to keep track of when ref_logps are precomputed

* make variable names private

* formatting

* if precompute_ref_log_probs is true one can use non-peft to populate log-probs

* Use tokenizer padding token unless padding_value is set

* Move dataset.map(tokenize_batch) outside dataloader to avoid serialization errors

* eval can be none

* move to cpu to avoid gpu oom

* remove unneeded cast to float32

* remove unneeded

* fix merge

* fix merge

* fix merge

* add precompute log-prob status via tqdm

* Truncate answer if too longer once prompt has been truncated

* Add prompt_input_ids to batch to enable generation

* formatting and add lora example

* fix formatting

* Tokenize row now expects sample to have space on chosen/rejected for llama

* Revert "Tokenize row now expects sample to have space on chosen/rejected for llama"

This reverts commit dd07a10fe8c19b6ac6bbcc7b8144189756710d52.

* raise error when using zero-3 with precompute_ref_log_probs

---------

Co-authored-by: Pablo Vicente Juan <p.vicente.juan@gmail.com>
Co-authored-by: Shoaib Burq <saburq@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2023-12-12 17:16:46 +01:00
c0ce52ab26 consistency on log (#1084) 2023-12-12 10:58:21 +01:00
393dbf6749 Removing tyro in sft_llama2.py (#1081)
* refactor

* precommit
2023-12-11 11:28:20 -06:00
94fa4b022b Make CI happy (#1080)
* Update test_ppo_trainer.py

* Update test_ppo_trainer.py

* Update test_ppo_trainer.py
2023-12-11 16:52:17 +01:00
cb7819e627 add local folder support as input for rl_training. (#1078)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2023-12-11 16:37:01 +01:00
8f0fc4c8f7 Add args to SFT example (#1079) 2023-12-11 16:16:47 +01:00
d275cb431e [DPO] add KTO loss (#1075)
* add KTO loss

* fix docs

* Update trl/trainer/dpo_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* formatting

* add link to papers

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2023-12-11 11:41:03 +01:00
7d0a8eea4e Add missing loss_type in ValueError message (#1067) 2023-12-07 08:40:53 +01:00
5a233546ee enable multiple eval datasets (#1052)
* enable multiple eval datasets

* added test

* try to avoid infinite computation

* make sure eval set is not infinite

* downsizing the test
2023-12-06 20:26:24 +01:00
9fb00cf007 [SFTTrainer] Fix Trainer when args is None (#1064)
* fix sfttrainer when args is None

* oops
2023-12-06 19:02:09 +01:00
ee44946814 [core] Fix failing tests on main (#1065)
* fix tests on main

* fix last test
2023-12-06 18:31:02 +01:00
7f2401bd6e update doc for the computer_metrics argument of SFTTrainer (#1062) 2023-12-06 17:46:36 +01:00
23bf9d4b58 Improve PreTrainedModelWrapper._get_current_device (#1048)
* use LOCAL_RANK in _get_current_device

* use PartialState in _get_current_device

* update annotation
2023-12-05 17:47:40 +01:00
501c347083 Update doc CI (#1060) 2023-12-05 13:31:01 +01:00
f06f357e9c [SFT Trainer] precompute packed iterable into a dataset (#979)
* precompute packed iterable into a dataset

* add generator function

* fix typo

* fix style

* fix test

* fix style

* add test

* minor refactor

* fix test

* Apply suggestions from code review

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* style

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: younesbelkada <younesbelkada@gmail.com>
2023-12-04 13:13:18 +01:00
4cdc03ab5c Fixing accelerator version function call. (#1056)
Co-authored-by: Partha Ghosh <pghosh@brown.is.localnet>
2023-12-04 12:39:58 +01:00
a60ceefa69 Update dpo_trainer.py (#1049) 2023-12-01 17:03:09 +01:00
baa8f09cb3 Revert "[DPO] Refactor eval logging of dpo trainer (#954)" (#1047)
This reverts commit 6d9ea38ae18c7e266f797b62de4a68a12a13aba4.
2023-12-01 10:33:31 +01:00
c859f5fa5f remove spurious optimize_cuda_cache deprecation warning on init (#1045)
Signed-off-by: Chander Govindarajan <mail@chandergovind.org>
2023-12-01 10:26:42 +01:00
481ef96293 Fixes reward and text gathering in distributed training (#850)
* adds a tensor gather on rewards

* adds dist gather on texts

* style

* adds a tensor gather on rewards

* adds dist gather on texts

* style

* simplifies gathering of rewards

* style

* simplify logic

* precommit

* Update trl/trainer/ppo_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* quick change

* push changes

---------

Co-authored-by: Costa Huang <costa.huang@outlook.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2023-11-30 10:32:09 -05:00
6d9ea38ae1 [DPO] Refactor eval logging of dpo trainer (#954)
* first attempts at refactor of dpo trainer

* removed extra stuff in prediction step

* import fixes

* label names

* all working

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
2023-11-30 12:09:33 +01:00
c203e47fbf spelling is hard (#1043) 2023-11-30 12:09:13 +01:00
c84e5918a6 [DPO] cDPO loss (#1035)
* add cDPO loss

* add comment

* docs

* info about label_smoothing not being used
2023-11-30 11:50:30 +01:00
4b67af37b6 Update utils.py (#1012)
* Update utils.py

update compute_accuracy to deal with the cases where str_chosen and str_rej got the same scores, which is probably what the developers don't want

* Update utils.py

updated so only warning is reserved

* Update trl/trainer/utils.py

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
2023-11-29 16:02:50 +01:00
55d7c952c7 [DPO] IPO Training loss (#1022)
* initial IPO loss

* fix loss

* fixed comments

* added docs

* fix doc-strings

* add tests

* Update trl/trainer/dpo_trainer.py

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* fixes for review

* Added doc about beta in the Trainer's docstring

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
2023-11-24 15:52:40 +01:00
μT
3719f7a929 Add missing elements to sft_trainer document (#1029) 2023-11-23 12:34:27 +01:00
e7961e45f1 Remove duplicate data loading in rl_training.py (#1020)
We load dataset twice, but in line 149 (new), we do 
`ds = train_dataset.map` anyway
2023-11-23 12:25:07 +01:00
b307faf07b [Multi-Adapter PPO] Fix and Refactor reward model adapter (#982)
* reward adapter loaded as part of init

more flexible, clearer args

* fixed script for multi gpu

unwrap model since it is DDP
downside, with reward adapter it seems we need to use
find_unused_parameters=True

* remove gradient from reward score calculation

* change supported_args back to None
2023-11-21 14:48:18 +01:00
aea1da8e2b Adds requires_grad to input for non-quantized peft models (#1006)
* Update sft_trainer.py

* style

* add tests
2023-11-20 15:57:46 +01:00
e5eb4db8b5 Update how_to_train.md (#1003)
* Update how_to_train.md

fix description about `min_new_tokens`

* Update docs/source/how_to_train.md

Co-authored-by: Costa Huang <costa.huang@outlook.com>

---------

Co-authored-by: Costa Huang <costa.huang@outlook.com>
2023-11-20 10:33:34 +01:00
28bdb6a373 Fixed wrong trigger for warning (#971)
func.__code__.co_varnames was used to count the function arguments for formatting_func. This code actually counted the function variables rather than function parameters.
2023-11-15 14:36:54 +01:00
e140d22881 make distributed true for multiple process (#997)
* make distributed true for multiple process

* Update trl/trainer/ppo_trainer.py

distributed should have more than 1 process

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2023-11-15 11:20:25 +01:00
e23a541af9 add docs (#992) 2023-11-14 19:31:10 +01:00
be3faa768e [DataCollatorForCompletionOnlyLM] warn if eos_token_id and pad_token_id are identical (#988)
Display a warning message if the  and  values are the same in order to prevent unintended behavior during multi-turn training.
2023-11-14 19:24:56 +01:00
13679aa97e Update README.md (#994) 2023-11-14 18:29:08 +01:00
9e9f024399 Fix a bunch of outdated references to examples/ (#977) 2023-11-10 11:29:21 +01:00
c2884b5096 [Tests] Add non optional packages tests (#974)
* add non-peft tests

* change name

* test

* change

* fix test
2023-11-09 15:01:46 +01:00
2f726ce4e8 set dev version (#970) 2023-11-08 11:54:01 +01:00
a78a05d7b7 Release: v0.7.4 2023-11-08 10:30:29 +00:00
1b258247cd Pin bnb to <=0.41.1 (#968)
* pin bnb to 0.41.1

* Update setup.py

* Update setup.py
2023-11-08 11:28:17 +01:00
9c93dec05e fix peft config typehint (#967) 2023-11-08 11:11:39 +01:00
d1dad6ebda set dev version (#966) 2023-11-08 11:00:24 +01:00
8ce810250e Release: v0.7.3 (#965) 2023-11-08 10:52:47 +01:00
8e9cae8072 fix: dpo trainer ds config (#957)
* fix: dpo trainer ds config

ref_model and model shouldn share the same ds config, so we shouldn modify the ds config directly. or else, it will cause sth wrong when init deepspeed engine

* fix: import sort

import sort by isort
2023-11-06 14:37:04 +01:00
654543a8cf Added support for custom EncoderDecoder models (#911) 2023-11-06 09:52:10 +01:00
c273b18c1c Adds model kwargs to SFT and DPO trainers (#951)
* adds model kwargs to SFT and DPO trainers

* adds checks for model_kwarg passing when model is not str

* changed warning to ValueError

* renames model_kwargs to model_init_kwargs

* corrects argument names in
2023-11-06 09:48:18 +01:00
6c6ff24926 [DPO] Merge initial peft model if trainer has a peft_config (#956)
* failing test
Co-authored-by: Shoaib Burq <saburq@gmail.com>

* merge initial peft model
2023-11-06 09:45:46 +01:00
6ff0fac2c1 Fix unwrapping peft models (#948)
* First unwrap the model and then process the input embeddings

* Changed base_model to base_model.model to stay consistent with peft model abstractions
2023-11-05 08:31:47 +01:00
951ca1841f [CI] Fix CI with new transformers release (#946)
* fix CI with transformers release

* final fix
2023-11-03 10:38:58 +01:00
cc1de9820a Introducing the Iterative Trainer (#737)
* initial skeleton

* iterative trainer for decoder only

* iterative trainer unittest

* encoder_decoder support

* fix typo in unittest

* init

* fix typo

* fix init typo

* adding loggings and safety checker

* fixed minor issues

* doc

* table of contents update

* add test for seq2seq2 models

* change year

* adding text as step input

* precommit

* fixing typo

* run precommit

* fixing typo in safety checker

* fix text tokenization issue

* add truncate and inherit from trainer

* remove iterative config from tests

* remove iterative config from init

* fix peft model

* change truncation side based on truncation_mode

* removed iterativeconfig autodoc

* fixed typo in trainer.mdx

* remove mention of iterative config in docs

* make sure optimizer and scheduler are created

* adding max_steps to test

* remove log_stats fn

* remove compute loss

* fixing encoder decoder detection

* fix PPODecorator

* run precommit

* fix testing

* fix small typos in iterative trainer

* adapted function log and eval
2023-11-02 17:37:48 +01:00
a64a522fcc Update dpo_trainer.py (#941) 2023-11-02 11:27:49 +01:00
5b32372b71 Optionally logging reference response (#847)
* Optionally logging reference response

* log ref rewards as welll

* peft logic re-write

* fix peft test case

* refactor

* push changes

* test

* Apply suggestions from code review

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* quick fix

* black

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2023-10-31 17:55:09 -04:00
d759004e52 Fix stale bot (#935)
* Update stale.py

* Update stale.py

* fix
2023-10-31 20:10:38 +01:00
cbc6c9bb3e [core / DDP] Fix RM trainer + DDP + quantization + propagate gradient_checkpointing_kwargs in SFT & DPO (#912)
* make use of forward hooks

* correctly delete attributes

* fix RM DPP issues

* revert unneeded changes

* more fixes

* fix diff

* fix

* propagate to SFT

* Update examples/scripts/reward_modeling.py

* propagate the fix on DPO trainer

* add to example scripts

* trigger CI
2023-10-31 18:50:17 +01:00
f3cd86578b Update dpo_llama2.py (#934) 2023-10-31 18:20:53 +01:00
b763432eaf [SFTTrainer] Make sure to not conflict between transformers and TRL implementation (#933)
* standardize neftune

* up

* fix again
2023-10-31 16:04:09 +01:00
2bbd594ec5 hotfix for dpo trainer (#919)
addresses #914
2023-10-31 10:58:41 +01:00
b89b712dbf fix DPO + GC issues (#927) 2023-10-31 10:55:46 +01:00
ec9e76623e [Feature] Enable Intel XPU support (#839)
* enable xpu support

* fix bug

* review commits

* fix style

* add xou decorator

* refactor review commit

* fix test

* review commit

* fix test

* Update benchmark.yml (#856)

* Standardise example scripts (#842)

* Standardise example scripts

* fix plotting script

* Rename run_xxx to xxx

* Fix doc

---------

Co-authored-by: Costa Huang <costa.huang@outlook.com>

* Fix version check in import_utils.py (#853)

* dont use get_peft_model if model is already peft (#857)

* merge conflict

* add xou decorator

* resolve

* resolves

* upstream

* refactor and precommit

* fix new tests

* add device mapping for xpu

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: Costa Huang <costa.huang@outlook.com>
Co-authored-by: Adam Pauls <adpauls@gmail.com>
Co-authored-by: abhishek thakur <1183441+abhishekkrthakur@users.noreply.github.com>
2023-10-31 10:15:35 +01:00
d192244f54 Bump tyro (#928) 2023-10-30 20:48:34 -04:00
051d5a1f61 updating PPOTrainer docstring (#897)
* adding specific dict structure to tracker_kwargs doc string to enable changing tracker params like wandb experiment name for ease, avoids needing to go deep into accelerate source

* push changes

* set default dict

* refactor

* use typing extension

---------

Co-authored-by: Laura O'Mahony <lauraomahony@L-MacBook-Pro.fritz.box>
Co-authored-by: Costa Huang <costa.huang@outlook.com>
2023-10-30 13:22:53 -04:00
2068fdcd93 Generalize NEFTune for FSDP, DDP, ... (#924)
* Update sft_trainer.py

* quality
2023-10-30 11:17:14 +01:00
02f5c1d8ce fix stackllama2 sft gradient checkpointing (#906)
* fix stackllama2 sft gradient checkpointing

* stackllama2 sft use tyro as arg parser
2023-10-25 09:58:26 -04:00
7de7db6765 deactivate MacOS CI (#913) 2023-10-24 16:06:12 +02:00
4e7d5b5abe [Update reward_trainer.py] append PeftSavingCallback if callbacks is not None (#910) 2023-10-24 14:32:45 +02:00
a90e13321b Fix broken link/markdown (#903)
* Fix broken link/markdown

* attempt to fix mps issue

* attempt fix mps issue

* test

---------

Co-authored-by: Costa Huang <costa.huang@outlook.com>
2023-10-24 14:27:03 +02:00
5b2aeca6c0 [NEFTune] Make use of forward hooks instead (#889)
* make use of forward hooks

* correctly delete attributes

* address suggestions
2023-10-24 14:18:44 +02:00
1f3314fd2f Add whiten ops before compute advatanges (#887)
* Add whiten ops before compute advatanges

1. From LLaMA 2 paper, it says:
```
We also find it important to whiten the final linear scores (shown here by reversing the sigmoid with the logit function) in order to increase stability and balance properly with the KL penalty term (β) above.
```
2. This function is taken from [alpaca_farm](64e489c67e/src/alpaca_farm/rl/ppo_trainer.py (L86))

* Fix type def of self

---------

Co-authored-by: Lin Junpeng <linjunpeng@sensetime.com>
2023-10-23 11:32:45 -04:00
304ee70eef Fix couple broken links on lib homepage (#908) 2023-10-23 11:46:37 +02:00
0a5aee7d99 [reward_modeling] Cleaning example script (#882)
* remove load in repeated multiple times & truncation

* trigger CI
2023-10-19 16:00:20 +02:00
db592a2eb6 fix: remove useless token (#896) 2023-10-19 14:28:33 +02:00
122edc8f5d fix peft_config type (#883)
Co-authored-by: wanglei.w <wanglei.w@bytedance.com>
2023-10-18 23:45:38 +02:00
f91fb2bda2 remove duplicate key in reward_modeling.py (#890) 2023-10-18 23:45:18 +02:00
01e4ad0009 fix syntax error 2023-10-17 21:22:53 +02:00
1e56ff0f16 Fix security breach 2023-10-17 08:01:24 +02:00
c4ed3274be [SFTTrainer] Adds NEFTune into SFTTrainer (#871)
* v1 neftune

* docstring

* add doc + fix nit

* add more docs

* Apply suggestions from code review

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2023-10-17 06:58:05 +02:00
14b6bc6691 [DPO] add SLiC hinge loss to DPOTrainer (#866)
* add SLiC hinge loss

* fix links

* beta when loss is hinge is reciprocal of margin

* fix tests

* fix docs

* doc strings

* fix method name

* raise error if loss_type is not correct

* Update trl/trainer/dpo_trainer.py

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* fix formatting

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
2023-10-16 16:02:57 +02:00
eb4d2f381a set dev version (#864) 2023-10-12 15:51:54 +02:00
78e08bd658 Release: 0.7.2 (#863) 2023-10-12 15:29:10 +02:00
96d4854455 Support both old and new diffusers import path (#843)
* Update modeling_sd_base.py

* Update trl/models/modeling_sd_base.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* make precommit

* cleaner approach

* oops

* better alternative

* rm uneeded file

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: younesbelkada <younesbelkada@gmail.com>
2023-10-12 15:06:09 +02:00
3ef21a24e7 [core] Fix import issues (#859)
* fix import issues

* cleaner approach
2023-10-11 19:04:49 +02:00
f7707fd4c6 dont use get_peft_model if model is already peft (#857) 2023-10-11 18:58:56 +02:00
dd9b8f4189 Fix version check in import_utils.py (#853) 2023-10-11 18:55:43 +02:00
ddd318865b Standardise example scripts (#842)
* Standardise example scripts

* fix plotting script

* Rename run_xxx to xxx

* Fix doc

---------

Co-authored-by: Costa Huang <costa.huang@outlook.com>
2023-10-11 17:28:15 +02:00
8aa12d3c95 Update benchmark.yml (#856) 2023-10-11 11:06:48 -04:00
95aea7c072 Use uniform config (#817)
* Use uniform config

* quick fix

* refactor

* update docs
2023-10-09 09:15:06 -04:00
eda1f36c57 Raise error in create_reference_model() when ZeRO-3 is enabled (#840)
* Raise error when using  with ZeRO-3

* Fix

* Refactor

* Revert

* Restore remote code

* Revert example
2023-10-09 10:49:01 +02:00
ac0d5b726d add DDPO to index (#826)
* add DDPO to index

* Update index.mdx
2023-10-06 14:42:56 +02:00
6826d592ae Clarify docstrings, help messages, assert messages in merge_peft_adapter.py (#838)
An assertion was also corrected to the intended test condition
2023-10-06 11:04:58 +02:00
c058ee6f05 [MINOR:TYPOS] Update README.md (#829) 2023-10-05 14:33:20 +02:00
fbeb146eea Set trust remote code to false by default (#833) 2023-10-04 22:53:57 +02:00
98845b9282 Fix DeepSpeed ZeRO-{1,2} for DPOTrainer (#825) 2023-10-03 09:56:00 +02:00
9f6326e65a Unify sentiment documentation (#803)
* Update documentation

* update docs

* test

* format

* Update docs/source/example_overview.md

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* update

* add quantization dependency and update docs

* Update docs/source/example_overview.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/example_overview.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/example_overview.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/example_overview.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/sentiment_tuning.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/sentiment_tuning.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/sentiment_tuning.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/sentiment_tuning.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/sentiment_tuning.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/sentiment_tuning.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* update

* quick update 2

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2023-10-02 10:35:49 -04:00
7dcc71b1a6 Small fixes to the PPO trainer doc. (#811)
One outstanding issue is that ppo_trainer.save_model doesn't exist.
How do we actually save the model after training?
2023-10-02 11:01:05 +02:00
6b73adc900 add option for compute_metrics in DPOTrainer (#822) 2023-09-29 12:33:47 +02:00
249d3e3259 Add RMSProp back to DPO (#821)
* init

* add install instructions
2023-09-26 10:44:44 -07:00
ad8d50e30d init custom eval loop for further DPO evals (#766)
* init

* run

* Update custom eval loop to aid DPO debugging (#770)

* sample_during_eval -> generate_during_eval

* Remove unused return_tokens

* Add import utils for W&B, prevent test fails

* Optimize dataloader random batch selection

* Separate prompt and response in logs

Makes it much easier to quickly read the starts of the generations

* Simplify logging

* reset eval steps

* manual merge fixes

* revert merge

* remove self.max_length

* style

* fix max_length

---------

Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
2023-09-26 08:09:15 -07:00
d608fea0d1 Allow passing the token_ids as instruction_template in DataCollatorForCompletionOnlyLM (#749)
* Update utils.py

* correctly assign instruction_template in DataCollatorForCompletionOnlyLM

* correctly use instruction_token_ids in DataCollatorForCompletionOnlyLM

* DataCollatorForCompletionOnlyLM: fix instruction_template / response_template type check: handle cases where instruction_template is None

* make precommit

* Test DataCollatorForCompletionOnlyLM with pre-tokenized instruction_template
2023-09-26 11:38:30 +02:00
92b03f5fdc fixes ppo trainer generate nit (#798) 2023-09-26 10:19:29 +02:00
7877e92991 Update sft_trainer.mdx (#808) 2023-09-22 17:55:54 +02:00
1d7e3c2ae2 Update sft_trainer.mdx to highlight Flash Attention features (#807)
* Update sft_trainer.mdx

* Update sft_trainer.mdx
2023-09-22 17:42:21 +02:00
eb6aa20401 clarify PEFT docs (#797) 2023-09-21 11:22:20 +02:00
b8f0c4cf12 Add deepspeed experiment (#795)
* Add deepspeed experiment

* add deepspeed pip install

* update hello world.sh

* update comments

* remove cleanup
2023-09-20 09:32:42 -04:00
e11a45c5d8 Revert "Add default Optim to DPO example (#759)" (#799)
This reverts commit d603e7c52704054a9e7f306ae63acdafaa3d179a.
2023-09-20 10:32:55 +02:00
08cfc4179b Add margin to RM training (#719)
* Start adding margin to RM training

* Fix typo and cleanup

* Fix incompatibilities when not using margin

* Format using 'make precommit'

* Add documentation and test for reward trainer

* Run 'make precommit'

* Update docs/source/reward_trainer.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Fix missed merge conflict in reward trainer docs

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2023-09-20 10:18:38 +02:00
d603e7c527 Add default Optim to DPO example (#759)
* add optim

* make configurable
2023-09-19 07:56:52 -07:00
5d30cd4d30 Changed the default value of the log_with argument (#792)
This change avoids setting report_to="all" (the default behavior in
transformers v4), which could lead to unexpected error messages for
inexperienced users. Note that the default value of report_to will
change anyway to "none" in transformers v5.
2023-09-19 13:04:17 +02:00
46975236be Temp benchmark ci dir (#765)
* Support fork in benchmark CI

* use temporary dir for benchmark CI

* debug

* revert back

* dependency fix

* refactor script
2023-09-18 11:16:16 -04:00
9a8d52cc5a Fix type checking (#748) 2023-09-18 13:54:41 +02:00
0a6c42c12c Update benchmark.yml (#782) 2023-09-15 13:45:21 -04:00
221be13d26 Update benchmark.yml (#781) 2023-09-15 11:34:09 -04:00
a922af6927 Update benchmark.yml (#780) 2023-09-15 11:28:16 -04:00
42e7a0a824 Update benchmark.yml (#779) 2023-09-15 11:18:55 -04:00
15d52e759b Update benchmark.yml (#778) 2023-09-15 11:02:10 -04:00
24e914a0ab Update benchmark.yml (#777) 2023-09-15 10:57:08 -04:00
637612d95f Benchmark CI fix (#776) 2023-09-15 10:33:45 -04:00
35694baef2 Benchmark CI fix (#775) 2023-09-15 08:52:24 -04:00
d2f27df50a Update benchmark.yml (#773)
* Update benchmark.yml

* quick change
2023-09-15 09:40:20 +02:00
5cee9a0478 Support fork in benchmark CI (#764) 2023-09-14 08:44:36 -04:00
3f7710aed7 docs: add initial version of docs for PPOTrainer (#665)
* docs: add initial version of docs for  `PPOTrainer`

* Apply suggestions from code review Leandro

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* updated docs based on feedback leandro
- specified reference to reward model
- added batched generator
- added line of saving model
- remove reference model

* Apply suggestions from code review

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
2023-09-14 10:34:19 +02:00
ca0af3944d Benchmark CI (actual) (#754)
* refactor and benchmark

* update code

* Add accelerate logging

* logs

* quick fix

* update config

* precommit

* modify training example

* fix multi-gpu all_reduce error `Tensors must be CUDA and dense`

* support more models and benchmark

* update

* add changes

* upload benchmark

* precommit

* add tyro as a dependency

* add tyro

* pre-commit

* precommit

* weird...

* lol typo

* precommit

* sigh

* push changes

* Update benchmark/README.md

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Add experiments

* upload image to tag specific folder

* add openrlbenchmark documentation

* rename

* remove unused field

* precommit

* update slurm template

* add dependency

* update dependency

* ..

* .

* quick change

* push changes

* update

* update

* remove wandb tag code

* quick change

* precommit

* update test

* update dependency

* update test

* update benchmark dependency

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
2023-09-13 13:34:00 -04:00
e4f9a483d9 Refactor and benchmark (#662)
* refactor and benchmark

* update code

* Add accelerate logging

* logs

* quick fix

* update config

* precommit

* modify training example

* fix multi-gpu all_reduce error `Tensors must be CUDA and dense`

* support more models and benchmark

* update

* add changes

* upload benchmark

* precommit

* add tyro as a dependency

* add tyro

* pre-commit

* precommit

* weird...

* lol typo

* precommit

* sigh

* push changes

* Update benchmark/README.md

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Add experiments

* upload image to tag specific folder

* add openrlbenchmark documentation

* rename

* remove unused field

* precommit

* push changes

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
2023-09-13 10:24:18 -04:00
80890b17be [PPOTrainer] - add comment of zero masking (from second query token) (#763)
It took a while to understand why zero-masked tokens are one less than the length of query tokens. 

If I got it correctly, it is because the first logit (and state-value) from the outputs refers to the second token in the query. 

Hope this comment can be helpful to others who may encounter a similar question in the first-pass reading of the code :)
2023-09-13 10:23:04 +02:00
cf9d2a7133 Imrpove benchmark ci (#760) 2023-09-13 09:29:06 +02:00
c02ce6d3f5 Extend DeepSpeed integration to ZeRO-{1,2,3} (#758)
* Generalise deepspeed

* Refactor

* Add reward model arg

* Fix pipeline tokenizer

* Fix deprecation

* Pin deepspeed lower

* Fix docs

* Revert top_k change

* Add ZeRO-3 context manager

* Revert docs change

* Fix docs

* Polish docs

* Update docs/source/customization.mdx

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2023-09-12 18:59:49 +02:00
9141aa42ba EOS token processing for multi-turn DPO (#741)
* init

* fix

* add doc

* style

* clarify example
2023-09-12 09:49:51 -07:00
05723c0b88 benchmark CI fix (#755) 2023-09-12 09:04:57 -04:00
b87ec2d5a0 update to prepare_model_for_kbit_training (#728)
* update to `prepare_model_for_kbit_training`

from deprecated `prepare_model_for_int8_training`
and add `use_gradient_checkpointing=args.gradient_checkpointing` to
automatically follow the gradient checkpointing choice

is also the workaround for #694

* workaround for gradient checkpointing issue

calling model.gradient_checkpointing_enable() twice causes issues
this workaround calls it in prepare_model_for_kbit_training and then
changes the arg to false to make sure it isn't called again in
huggingface trainer inner loop

also changes stack_llama_2 sft trainer to use correct device map for ddp
training so that you can test this issue
2023-09-12 10:56:10 +02:00
27df071ad8 add benchmark ci (#752) 2023-09-11 13:35:53 -04:00
67452ef213 fix import of torch_utils (#751) 2023-09-11 18:46:19 +02:00
22a90198e5 [DPO] self.accelerator._prepare_deepspeed return tuples (#745) 2023-09-08 11:50:06 +02:00
4f81e7736d Seq2Seq model support for DPO (#586)
* dpo_collator for seq2seq models

* dpo trainer support

* refactoring

* update collator

* computes decoder input ids if possible

* decoder input ids for dpo trainer

* added test for seq2seq

* quality

* fixed typo

* fixed string padding for seq2seq

* fixed minor issues in padding

* fixed typo in dpo.py

* add docstring

* run all precommit

* fixed gradient accumulation steps in test

* reformatting

* fixing dpo tests

* update .mdx
2023-09-07 18:03:10 +02:00
14292b08af fixed metrics typo (#743) 2023-09-07 18:02:20 +02:00
453c4eca14 Enable gradient checkpointing to be disabled for reward modelling (#725)
* Enable gradient checkpointing to be disabled for reward modelling

* Update examples/scripts/reward_trainer.py

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Tidy docs

* Remove commas

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2023-09-06 14:08:15 +02:00
decc832d3e Add epsilon to score normalization (#727) 2023-09-06 10:28:07 +02:00
1111295776 check correctly for condition (#668) 2023-09-06 10:24:55 +02:00
c04074e248 Fix DeepSpeed ZeRO-3 in PPOTrainer (#730)
* Initialise ref model with ZeRO-3

* Fix deadlock

* Refactor & fix KL div

* Refactor

* Refactor

* Fix imports

* Add types

* Add accelerate configs

* Add more DeepSpeed configs

* Fix types

* Disable debug

* Refactor

* Add docs

* Disable eval mode for peft

* Restore eval mode

* Revert ref model prep for peft

* Update examples/scripts/README.md

Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>

* Add docs

---------

Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>
2023-09-05 11:00:49 +02:00
d484dc2a93 Refactor RewardTrainer hyperparameters into dedicated dataclass (#726)
* Refactor RewardTrainer hyperparameters into dedicated dataclass

* Revert

* Add doc string

* Fix warning

* Handle backwards compat

* Fix tests

* Add docs

* Refactor to RewardConfig

* Fix case conditions

* Fix
2023-09-05 09:05:42 +02:00
34e6948d45 [core] Bump peft to 0.4.0 (#720)
* bump peft to 0.4.0

* all of them
2023-09-01 15:01:36 +02:00
9f69f06a1c Add pyproject.toml (#690)
* example pyproject.toml

* update target to py38

* make pyproject.toml equivalent to accelerate
2023-09-01 11:42:18 +02:00
jp
5bb46687c5 Fix: RuntimeError: 'weight' must be 2-D issue (#687)
* Update dpo_trainer.py

* Fix: self.args.deepspeed > self.is_deepspeed_enabled

* Update trl/trainer/dpo_trainer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2023-09-01 11:27:54 +02:00
25d6700c5e fix sft mistakes (#717) 2023-08-31 16:56:29 +02:00
4d31d0c4f8 Update docs on gms8k (#711) 2023-08-31 16:48:07 +02:00
0ff39d2a87 fix device issue (#681)
* fix device issue

* fix device issue

* fix device issue

* merge changes

* fix device issue
2023-08-31 16:37:42 +02:00
b4899b29d2 set dev version (#710) 2023-08-30 17:00:34 +02:00
6aae9e75f3 Release: VERSION (#709) 2023-08-30 12:48:10 +02:00
79b90e19ba a workaround for failing log_stats (#708) 2023-08-30 12:23:57 +02:00
7f636c9ed7 set dev version (#707) 2023-08-30 11:58:22 +02:00
98d8cc509d Release: v0.7.0 (#706) 2023-08-30 11:55:54 +02:00
9d09b3e107 TextEnvironments (#424)
* WIP skeleton

* minimal working poc

* cleanup

* rename variables

* quick typo fix

* add v1 masking (#429)

* add v1 masking

* working v1

* adapt from suggestion

* avoid warning `Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.`

* fix masking

- mask the responses from API call only

* quality

* address comments

* Update trl/environment/base.py

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* adapt a bit

* wip on tokenization/masking in textenv

* small fixes

* update viz

* add example

* print debug text and pass masks

* style

* format and move tensor to device

* update example

* update example

* This seems to work

* fix masking

* fix rich output to console

---------

Co-authored-by: Costa Huang <costa.huang@outlook.com>
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: leandro <leandro.vonwerra@spoud.io>

* Add masking (#461)

* add v1 masking

* working v1

* adapt from suggestion

* avoid warning `Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.`

* fix masking

- mask the responses from API call only

* quality

* address comments

* Update trl/environment/base.py

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* adapt a bit

* wip on tokenization/masking in textenv

* small fixes

* update viz

* add example

* print debug text and pass masks

* style

* format and move tensor to device

* update example

* update example

* This seems to work

* fix masking

* fix rich output to console

* fix batched generation

* improve stopping criteria

* improve error handling in tool call

---------

Co-authored-by: younesbelkada <younesbelkada@gmail.com>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: Costa Huang <costa.huang@outlook.com>

* fix uknown tool

* fix rewards and increase bs

* remove unused script

* ugly WIP fix

* do not return modified obj for in-place operations

* do not return modified obj for in-place operations

* clean up stopping criterium

* push updates

* push update

* format, add docs

* rename file

* add kwargs to reward fn

* simplify example

* simplify example

* bug fix

* add a trivia example

* pre-commit

* max tool response length

* fix regex for multi-line

* refactor tool exceptions

* fix exceptions in tool

* add docs

* fix style

* make rich optional

* add docstrings

* add  tests

* add TextEnv tests (WIP)

* update triviaqa code

* update docs

* refactor text env

* update tests (WIP)

* add end2end test

* update docs

* upload tool demo

* refactor

* customizable system prompt

* add text env docs

* update index and toc

* fix `TextHistory` show methods

* add max length

* fix style

* fix typo

* refactor to kwargs in init and tasks to queries

* kwargs for reward docs

* Update examples/triviaqa.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update examples/tool_demo.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update docs/source/learning_tools.mdx

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update docs/source/learning_tools.mdx

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update docs/source/learning_tools.mdx

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update docs/source/text_environments.md

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update examples/triviaqa.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update examples/triviaqa.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* move to tool folder

* remove assets

* remove tool demo

* move rich import test to import utils

* add copyright

* fixes for masks in ppo trainer

* add text env api docs

* make precommit + add ppo test with mask

* move examples and add python

* fix style

* update triviaqa example

* add more docs

* update docs

* Update docs/source/learning_tools.mdx

* Apply suggestions from code review

* precommit

---------

Co-authored-by: Costa Huang <costa.huang@outlook.com>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: younesbelkada <younesbelkada@gmail.com>
Co-authored-by: leandro von werra <leandro@hf.co>
2023-08-30 11:44:06 +02:00
336d63eb80 [Docs] fix example README.md (#705) 2023-08-30 11:27:50 +02:00
7fc970983c [DPO] fix DPO ref_model=None (#703)
* fix by @tannonk

* Update trl/trainer/dpo_trainer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* add import

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2023-08-29 12:57:10 +02:00
d3bbee3ab8 set dev version (#685) 2023-08-24 11:04:07 +02:00
eb5465df7e Release: v0.6.0 (#684) 2023-08-24 10:18:46 +02:00
1c272240ac Simplify immutable TrainingArgs fix using dataclasses.replace (#682) 2023-08-24 09:50:48 +02:00
Wei
b095245830 fix PeftConfig loading from a remote repo. (#649)
* fix PeftConfig loading from a remote repo.

* failed to catch hf_hub_download() EntryNotFoundError.

At least in huggingface-hub 0.10.1, the error for "not found" is:
huggingface_hub.utils._errors.EntryNotFoundError: 404 Client Error

* pass precommit checks.

* replace some bare excepts with specific codes

* catch LocalEntryNotFoundError additionally.
2023-08-24 09:50:20 +02:00
c115453fba Update sft_llama2.py (#678)
Add argument num_workers. Fixed error on line 103 if streaming set = False
2023-08-23 16:56:31 +02:00
16f214c58d fix unmutable TrainingArguments issue (#676) 2023-08-23 10:54:59 +02:00
e9a437992e propagating eval_batch_size to TrainingArguments (#675)
Co-authored-by: Rahul Jha <rahuljha@netflix.com>
2023-08-23 10:52:25 +02:00
c837fbe5b9 Fix DPO blogpost thumbnail (#673) 2023-08-22 11:53:21 +02:00
01c4a35928 Denoising Diffusion Policy Optimization (#508)
* Broken first pre-draft

* Change structure to leverage user-definition of pipeline
 - reward function, pipeline and scheduler will be left to the user to define
 - pipeline and scheduler contract interfaces is what the framework will define
 - none of this actually works

* Incremental progress: trying to get the set-up running e2e

* Incemental progress: successfully running code

* Incremental progress: running setup
Next steps: fix accelerate gardient acc assertion error when we set value > 1

* Formatting and code standards

* Incremental prog: break down code a bit
- new config flag to notify code of async reward fetching
- break off image handling code and throw it on to user to define how to handle it
- more code restructuring

* Incremental progress:
1. More code sectioning off into own methods (more for readibility than anything else)

* Incremental progress:
1. clear up contracts
2. type the reward function and prompt function

* Code shuffling and expansion of tracker, accelerator config args to beyond wandb

* More small additions
Add tensorboard logging function
Remove wandb logging function for now
Consolidate the data that get's thrown to the logging function
Add README

* Formatting

* Formatting

* Remove print statement
Make tensorboard tracking the sole tracking for the training example

* 1. start of testing
2. more refactoring
3. start of docstrings
4. parameter rename

* Basic Tests
Formatting

* Docs according to the norm

* Doocs, credits and rename file

* docs and corrections

* Put example config to respectable state

* Add recent run params

* Correct the name of the library

* Move requirements to EXTRAS

* - Add license banners
- Guard import of DDPO functions with if_diffusers_available
- doc strings for output types

* Add snippet to pull weights from huggingface + banner

* Test if passes on CI/CD

* Minor refactor

* Test dummy unet

* Possible fix for randomly disappearing attribute

* Shuffling arrangement in hopes of meeting memory requirements

* Proper Names

* Appease windows memory allocator issues for the cpu device

* Remove print statements

* Update docs/source/ddpo_trainer.mdx

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Update docs/source/ddpo_trainer.mdx

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Add docstrings and correct url

* Spelling and grammar

* Add more documentation and commandline parsing for example script

* Markdown synatx correction

* Revert accidentally committed file and put the correct one

* More docs

* Remove subclassing and add docs for leftoover subclassing

* Put back subclassing

* Reward metadata and more docs

* Remove save_load_save flag

* Grammar

* Update trl/trainer/ddpo_trainer.py

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Update tests/test_ddpo_trainer.py

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Update setup.py

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Update examples/scripts/stable_diffusion_tuning.py

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Edits to the readme for DDPO

* Renamed modelling_sd_base to modeling_sd_base

* Insert try and catch for bitsandbytes import

* Change to smaller model

* Correct tolerance for floating point comparison

* Remove dummy unet and move to check is isfinite

* 1. Expand interface to ensure other Stable Diffusion pipelines could be covered
2. remove extra identification

* 1. Remove most of the asserts except for one and add value error
2. Remove default run name

* Remove progress bar

* Docs

* Put back progress bar

* 1. Revert progress bar deletion completely
2. grammar
3. relocate line

* Experiment

* Remove experiment parts and format properly

* Change formatting and edit info in docs

* Grammar

* Refactor out most of nitty gritty of loading/saving from trainer to example model
Readme addition

* Docs additions

* 1. Proper formatting fr the test file
2. incorporatioon of pull frm hub if fails try local
3. doc strings for interface
4. highlight in the trainer, that this is only ready fr sd pipelines

* Resources for before and after

* Attempt at embedding images

* Post testing example script

* Consistent naming and document edits in light of new args

* Remove resources and add CDN links in html in doc file

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
2023-08-21 19:24:52 +02:00
1aca98fbcf add check of arguments (#660) 2023-08-21 12:02:07 +02:00
029f961b7c Handle potentially long sequences with DataCollatorForCompletionOnlyLM (#644)
* avoid RuntimeError on long sequences

* add unittests and format

* remove dependency on external repo

* bug fix in DataCollatorForCompletionOnlyLM
2023-08-18 10:30:25 +02:00
8ec912ffa6 Add more args to SFT example (#642)
* add more args

* fix style issues
2023-08-18 10:15:43 +02:00
f360c37466 Allow for ref_model=None in DPOTrainer (#640)
* Update dpo_trainer.py

Make ref_model optional.

* add tests for ref_model=None

* better handling for ref_model=None

* Update dpo_trainer.py

Correct docstring

* move instantiation of self.ref_model closer to model

* use .disable_adapters instead of .get_base_model

* handle ref_model=None in get_batch_samples

* fix failing test in dpo_trainer due to disable_dropout_in_model

* Update trl/trainer/dpo_trainer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2023-08-18 10:02:16 +02:00
217313014b Update README.md (#657)
* Update README.md

fix reward modeling example

* Update README.md

more concise fix
2023-08-17 22:00:58 +02:00
b946e875b1 Resolve various typos throughout the docs (#654)
* Resolve various typos throughout the docs

I found the first few manually, and then found the rest via codespell

* HuggingFace -> Hugging Face

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2023-08-17 12:27:54 +02:00
6dd50b45d8 Add checks on backward batch size (#651)
* Add checks on backward batch size

* add test case

* update test case

* Update citation
2023-08-17 10:35:44 +02:00
98120d6aeb Disable dropout in DPO Training (#639)
* disable dropout in dpo

* quick fix docs

* precommiot

* add disable_dropout_in_model to DPOTrainer

* disable_dropout -> disable_dropout_in_model

* .

* .
2023-08-14 14:40:45 +02:00
3b2c820db6 Add score scaling/normalization/clipping (#560)
* Add reward/score scaling/normalization/clipping

* Run pre-commit to fix styles and remove some dupe code

* Make sure score module and pretrained_model have the same dtype

* Add multi_adapter_rl_v2.py

* Add log_with

* Add more verbose help message for use_score_norm

* Fix score clipping for float16

* Minor fix
2023-08-10 10:30:56 +02:00
25fd6f2313 Move repo (#628)
* update actions

* update references
2023-08-09 17:48:25 +02:00
3f1477cdc0 Improve docs (#612)
* WIP

* improve inference docs

* improve training faq

* update toctree

* fix toctree

* fix improve blog

* improve blog

* fix customization

* reword faq a bit

* reword inference a bit

* add references back

* integrate feedback from code review

* fix link in html
2023-08-08 11:45:16 +02:00
2cff1e4385 Allow already tokenized sequences for response_template in DataCollatorForCompletionOnlyLM (#622)
* Allow tokenized ids in DataCollatorForCompletionOnlyLM. Add test and docs

* Formatting

* Documentation

* Remove unused code from test

---------

Co-authored-by: Ivan Sanchez <ivan.sanchez@zyte.com>
2023-08-08 11:33:12 +02:00
d7d7902938 use log_with argument (#620) 2023-08-08 10:13:22 +02:00
77b0cc1707 [DPO] stack-llama-2 training scripts (#611)
* initial stack-llama-2 scripts

* removed unused function

* add accelerate

* link to stack-llama-2 code

* running the model

* pre-commit fixes

* use the merge_peft script

* Add section on logged metrics
2023-08-07 14:36:16 +02:00
17f22c1c20 Add docs explaining logged metrics (#616) 2023-08-04 12:50:39 -04:00
e448bb69f0 [Modeling] Add token support for hf_hub_download (#604)
* add token support for hf_hub_download

* allow to pass it to from_pretrained
2023-08-03 12:49:31 +02:00
9aa4e3ce2b set dev version (#608) 2023-08-02 10:43:27 +02:00
ca8a508913 Release: 0.5.0 (#607) 2023-08-02 10:31:43 +02:00
a00ab445ba refactor grad accum (#546)
* refactor grad accum

* quick fix

* use correct place to step optim

* push changes

* cleanup and fix division by zero in `masked_var`

* revert back changes

* use unbiased var

* deal with division by zero

* add test case

* calculate advantage only once

* format

* add warning

* add more warnings

* quick fix

* remove unhelpful warning

* fix test cases

* fix test cases

* bump version given the breaking change

* black

* refactor

* update test cases

* error out

* push changes

* remove exact div

* add comments
2023-08-01 09:00:41 -04:00
431f0c9a2f Fix comparison in DataCollatorForCompletionOnlyLM (#588) (#594)
* Add unit test to DataCollatorForCompletionOnlyLM to reproduce the bug.

* Change comparison target from examples[i][input_ids] to batch[labels][i] in DataCollatorForCompletionOnlyLM
2023-07-31 14:13:35 +02:00
64bc9bc9e6 docs: Replace SFTTrainer with RewardTrainer in comment (#589)
Likely just a copy-paste error
2023-07-28 15:37:25 +02:00
5a1e1bf06e Introducing DataCollatorForChatCompletionOnlyLM (#456)
* added DataCollatorForChatCompletionOnlyLM

* added simple test

* merged the two collators and fixed ### in completion

* fix response template

* fixing ordering in test

* quality

* fixed minor comments & make doc

* chat test back

* Update tests/test_sft_trainer.py

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2023-07-28 14:17:03 +02:00
e8dd8102d8 Update the example sft_trainer.py (#587)
Added save the model, because by default it saves only checkpoints not the final version.
2023-07-28 13:50:41 +02:00
1b46c61d43 [PPO] fix corner cases with PPO batch size and forward_batch_size (#563)
* fix corner cases PPO

* forward contrib credits from initial contribution

* forward contrib credits from initial discussions

---------

Co-authored-by: 1485840691-eng <1485840691-eng@users.noreply.github.com>
Co-authored-by: shubhlohiya <shubhlohiya@users.noreply.github.com>
2023-07-28 11:05:34 +02:00
3b0a1b5f8c Add missing max_seq_length arg to example sft_trainer.py (#585) 2023-07-27 18:17:43 +02:00
31658b4263 Computes the KL penalty using the entire distribution (#541)
* adds full log probs

* Adds tests, comments

* precommit

* bug all -> full

* adds option description to sentiment analysis script, fixes a few bugs
2023-07-27 12:08:24 +02:00
f7227fb296 Fix model output dim in reward trainer example (#566)
* correct glitches in reward modelling

* add the eval_split option

* correct code format
2023-07-26 11:02:23 +02:00
b3c2e73e70 [DPO] Resolve logging for DPOTrainer (#570)
* Resolve logging for DPOTrainer

* Ensure the WandB logger correctly prefixes all logs

* Run pre-commit

Whoops, hadn't run `pre-commit install` yet
2023-07-26 08:06:25 +02:00
d78d917880 Add comment to explain how the sentiment pipeline is used to run the … (#555)
* Add comment to explain how the sentiment pipeline is used to run the reward model in the StackLLaMA example

* Apply 'make precommit'
2023-07-24 18:09:45 +02:00
cdde7f71d7 Add DataCollatorForCompletionOnlyLM in the docs (#565)
* add `DataCollatorForCompletionOnlyLM` in the docs

* nit
2023-07-24 16:47:41 +02:00
51d5f08d88 add epochs and num steps on CLI (#562) 2023-07-24 14:01:54 +02:00
8762507d3a Minor typo and whitespace fixes (#559)
* [docs] remove extra whitespace

* [examples] fix help for dataset_name
2023-07-24 13:56:55 +02:00
1bd852aa8f remove unused batch_size arg (#554) 2023-07-24 13:23:33 +02:00
170d58ffce [SFTTrainer] Add warning for wrong padding_side (#550)
* add warning for wrong padding_side

* add warning

* revert

* oops
2023-07-22 10:53:16 +02:00
84c9209037 ADD: num_proc to SFTTrainer (#547)
* ADD: num_proc to SFTTrainer

* make precommit

* Update trl/trainer/sft_trainer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update trl/trainer/sft_trainer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update trl/trainer/sft_trainer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update trl/trainer/sft_trainer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* add batch_size

* Update trl/trainer/sft_trainer.py

Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
2023-07-20 15:41:48 +02:00
d0fe348a0a Add use_auth_token arg to sft_trainer example (#544)
* Add use_auth_token arg to sft_trainer example

* Update examples/scripts/sft_trainer.py

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
2023-07-19 21:12:18 +02:00
5857d0acc6 [examples] make the sft script more modulable (#543)
* make the script more modulable

* docs + some changes
2023-07-19 18:13:55 +02:00
fd50e063e1 [DPO] remove response/pairs from the DPO side (#540)
* remove response/pairs from the DPO side

* Simplify get_hh helper function

* removed unused import

* update tests and docs for dpo_trainer

---------

Co-authored-by: Tom Aarsen <Cubiegamedev@gmail.com>
Co-authored-by: Shoaib Burq <saburq@gmail.com>
2023-07-19 17:36:24 +02:00
bcff7c2dab Relax reward trainer constraint (#539)
* relax reward trainer constraint

* Update trl/trainer/reward_trainer.py

Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>

* relax also for DPO

---------

Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
2023-07-19 14:12:23 +02:00
0e8d9f8504 fix offline case (#538) 2023-07-19 12:16:13 +02:00
7f297b38c6 all the concated batches are on same device (#528) 2023-07-18 13:21:17 +02:00
84393f3b94 DPO Trainer (#416)
* initial DPO Trainer

* typo

* initial dpo from reward trainer

* calc. log_probs from logits

* remove dpo config for now

* fix inits

* add intial DPODataCollatorWithPadding

* use the RewardDataCollatorWithPadding

* initial test

* means of loss

* add assert

* just call the train instead of step

* functional debug example before refactor

* check the params have changed

* initial DPODataCollatorWithPadding

* Data collator with masking

* going through trainer.accelerate to wrap ref_model

* style / imports

* style / imports

* `broadcast_buffers=False` fix to distributed training

* better fix for DDP issues

* arguments and style clean-up

* better doc, some light refactoring

* better imports

* initial dpo doc

* fix test

* fix formatting

* fix

* called models once

* fix tests

* add example

* fix doc string

* intitial example with anthropic hh dataset

* refactored dpo trainer

* revert

* return metrics

* fixed tests

* updated docs

* update test

* fixed typo

* note about the beta

* added dpo authors

* fix docstrings

* add prediction_step

* remove compute_metrics and log metrics manually

* fix typo

* add DPOTrainer doc

* add dpo to toc

* ValueError

* add to index and example

* fix docs

* fix assert

---------

Co-authored-by: TevenLeScao <teven.lescao@gmail.com>
Co-authored-by: Gaetan LOPEZ <gaetanloplat@gmail.com>
Co-authored-by: younesbelkada <younesbelkada@gmail.com>
2023-07-17 14:52:14 +02:00
388bdc03ac Fix sentiment nit (#517) 2023-07-14 14:11:24 +02:00
5c7bfbc8d9 [examples] Big refactor of examples and documentation (#509)
* added sfttrainer and rmtrainer example scripts.

* added few lines in the documentation.

* moved notebooks.

* delete `examples/summarization`

* remove from docs as well

* refactor sentiment tuning

* more refactoring.

* updated docs for multi-adapter RL.

* add research projects folder

* more refactor

* refactor docs.

* refactor structure

* add correct scripts all over the place

* final touches

* final touches

* updated documentation from feedback.
2023-07-14 12:00:56 +02:00
36b77ae81d Use local process index for _get_current_device() (#515)
This PR fixes a bug in `_get_current_device()` where the global process index was being returned instead of the local one. 

With this fix, it is possible to run training in **multi-node** environments and avoid the dreaded `RuntimeError: CUDA error: invalid device ordinal` :)
2023-07-14 10:53:33 +02:00
2049d03e82 Put labels tensors onto GPU to fix eval bug on deepspeed (#513) 2023-07-13 11:51:21 +02:00
31b98aa5a6 set dev version 2023-07-13 08:28:52 +00:00
d06b131097 git commit -m 'Release: v0.4.7' 2023-07-13 08:17:49 +00:00
f3230902b1 [SFTTrainer] Fix the sequence length check of SFTTrainer (#512)
* fix the sequence length check of `SFTTrainer`

* forward contrib credits from initial contribution

* forward contrib credits from initial contribution

* final comments

---------

Co-authored-by: mrm8488 <mrm8488@users.noreply.github.com>
Co-authored-by: BramVanroy <BramVanroy@users.noreply.github.com>
2023-07-12 15:25:17 +02:00
bbc7eeb29c [PPOTrainer] Add prompt tuning support on TRL (#500)
* add prompt tuning support on TRL

* fix CI

* revert + add docs
2023-07-06 15:16:37 +02:00
163dae5579 [PPOTrainer] Add prefix tuning support (#501)
* add prefix tuning support

* fix CI

* better check
2023-07-06 14:56:05 +02:00
64c8db2f9a Update ppo_trainer.py (#499) 2023-07-06 10:32:19 +02:00
25d4d81801 Disable mlm by default in DataCollatorForCompletionOnlyLM, add ignore_index and docstring (#476)
* add docstring and ignore index

* hard-code mlm=False

* make precommit

* FIX: re-add mlm parameter

---------

Co-authored-by: Bram Vanroy <Bram.Vanroy@UGent.be>
2023-07-06 10:22:40 +02:00
685620ac6c correctly implement gradient checkpointing (#479)
switch to new peft api
add max_length to RewardTrainer
2023-07-06 09:26:13 +02:00
2b531b9223 Adds some options to stabilize the KL penalty (#486)
* adds options for the kl penalty

* style

* adds kl penalty to trl sentiment example args

* ppo_config -> config

* fix tests (equal -> allclose)

* style

* add a random seed option

* updates kl penalty description

---------

Co-authored-by: Costa Huang <costa.huang@outlook.com>
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
2023-07-05 11:23:10 +02:00
4f7f73dd09 Remove padding in batched generation. (#487)
* fix padding

* Update examples/sentiment/scripts/gpt2-sentiment.py

* fix style

---------

Co-authored-by: leandro von werra <leandro@hf.co>
2023-07-05 10:41:06 +02:00
c60c41688e FIX: contributing guidelines command (#493)
* FIX: contributing guidelines command

* Update CONTRIBUTING.md

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Update CONTRIBUTING.md

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
2023-07-04 14:27:52 +02:00
cbb98dabb1 fix typo in reward_modeling.py (#494) 2023-07-04 14:17:32 +02:00
a86eaab8e8 add ratio threshold to avoid spikes (#488) 2023-07-04 10:09:53 +02:00
aa9770c6bd Refactor README (#460)
* v1

* update

* link

* nits
2023-07-03 14:30:15 +02:00
0fe603eca1 Update sft_trainer.py (#474)
* Update sft_trainer.py

Allows the user to give their own peft model arg. https://github.com/lvwerra/trl/issues/473

* cleaner
2023-06-28 00:44:15 +02:00
843c14574f fix CI RM (#468) 2023-06-26 14:30:06 +02:00
009b82412f Debug the tortuous logic in _prepare_dataset function (#464)
* Debug the tortuous logic in `_prepare_dataset` function

There are two issues with the previous `_prepare_dataset` function.

1. Tortuous and burdensome logic: the `is_already_dataset` variable is confusing and not helpful. So, remove it.
2. The comments and the logics do not match. 

For instance, in the previous version, the comments said "check if torch dataset ... and do nothing". However, when "dataset" is a torch.utils.data.Dataset and `packing = True`? It will still move into the _prepare_non_packed_dataloader(...) function call. 

The corrected version will do nothing if the dataset is already a torch dataloader/dataset/ConstantLengthDataset.

* Lint: sft_trainer.py

* Lint empty line
2023-06-24 08:43:03 +02:00
82c8f20601 Pre-commit (#448)
* Pre-commit

* modify CI

* modify make file

* temporarily disable codespell

* update make file

* update contribution guide

* pushc changes
2023-06-23 11:37:18 -04:00
b56e8b3277 Improve stabiliy: change default hyperparamers 2023-06-23 09:04:24 -04:00
0161a8e602 added shuffle parameter. I found it useful to turn off shuffle here and shuffle independently of this. (#457) 2023-06-23 11:47:08 +02:00
6e34c5932b set dev version 2023-06-23 09:20:25 +00:00
e1531aa526 Release: v0.4.6 2023-06-23 09:17:31 +00:00
cb6c45474a fix google colab issue (#459) 2023-06-23 11:13:36 +02:00
fe55b440e7 set dev version 2023-06-23 08:42:20 +00:00
431456732c Release: 0.4.5 2023-06-23 08:13:50 +00:00
9679d87012 Multi adapter RL (MARL) - a single model for RM & Value Head (#373)
* fix doc

* adapt from suggestions

* working v1 multiple adapters

* style

* style && quality

* oops

* docs

* add tests and docs

* add RM script

* Apply suggestions from code review

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/0_abstraction_rl.mdx

* Apply suggestions from code review

* Update docs/source/0_abstraction_rl.mdx

* add 4bit

* replace with `reward_adapter`

* explain break

* simple comment

* fix llama tokenizer

* fixes

* fixes

* rename

* quality

* rm unneeded file

* add disclaimer

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2023-06-22 11:19:45 +02:00
099f0bf42b Add accelerate project_config passthrough (#437) 2023-06-22 10:16:34 +02:00
33f88ead0b [ConstantLengthDataset] Fix packed dataset issue (#452)
* fix packed dataset issue

* Apply suggestions from code review

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* address

* more docs

* trigger CI

* fix failing CI

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
2023-06-22 10:12:55 +02:00
7705daa672 [SFTTrainer] Introducing DataCollatorForCompletionOnlyLM (#445)
* v1 of alpaca datacollator

* make sure to match the response tokens

* add test

* add it in main init

* add check

* adapt test

---------

Co-authored-by: Costa Huang <costa.huang@outlook.com>
2023-06-20 17:51:23 +02:00
fe49697e66 add stale bot (#447) 2023-06-19 17:26:17 +02:00
d1ad5405cb [SFTTrainer] Fix non packed dataset (#444)
* fix non packed dataset

* fixing tests and documentation

* Update docs/source/sft_trainer.mdx
2023-06-16 18:51:20 +02:00
1e88b84ab9 fix packing issue (#442) 2023-06-16 13:55:47 +02:00
c39207460f Drop support for Python 3.7 (#441)
* drop support for Python 3.7

* adapt
2023-06-16 13:30:01 +02:00
61af5f26b6 Fix correct gradient accumulation (#407)
* add correct grad acc

* add some tests but they fail

* test should pass

* style

* fix
2023-06-14 08:43:35 -04:00
7a89a43c3f handle the offline case (#431)
* handle the offline case

* adds warning
2023-06-13 15:36:12 +02:00
fead2c8c77 best-of-n sampler class (#375)
* First draft of best-of-n sampler class

* Formatting

* Add best-of-n class to init

* Rearrange files

* Correction

* Make sure input query is in shape

* check for numpy.ndarray type

* Fix for shapes and types AND linter fixes

* Make reward pipeline a callback for more broader application

* Documentation for best-of-n sampler class usage

* Docs update for best-of-n class

* Doc fixes for best-of-n sampler class

* Remove colon from new addition

* Change user callback output type and associated side-effects of said change

* Relocate param because of collision

* Documentation update

* Make input param keyword easier to grasp

* Remove comments and add docstrings

* Tests and fixes for best_of_n sampler class

* Change input arg name

* Formatting

* Removed unnecessary cloning
2023-06-13 10:25:21 +02:00
b4bb12992e Update test_reward_trainer.py (#421) 2023-06-09 15:52:41 +02:00
b21baddc5c [doc build] Use secrets (#420) 2023-06-09 15:52:10 +02:00
216c119fa9 Enable autotag feature w/ wandb (#411)
* Enable autotag feature

* use `logging.info`

* Update trl/trainer/ppo_config.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update trl/trainer/ppo_config.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2023-06-09 11:20:18 +02:00
a2747acc0f Add slurm utility (#412)
* Add slurm utility

* move files
2023-06-09 11:04:43 +02:00
b61a4b95a0 set dev version 2023-06-08 14:28:37 +00:00
5c5d7687d8 Release: v0.4.4 2023-06-08 14:26:14 +00:00
096f5e9da5 unpin accelerate (#418) 2023-06-08 16:25:03 +02:00
2a0ed3a596 set dev version 2023-06-08 08:55:33 +00:00
ff13c5bc6d Release: v0.4.3 2023-06-08 08:52:04 +00:00
d3e05d6490 Update setup.py (#414) 2023-06-08 10:49:03 +02:00
fadffc22bc Update test_reward_trainer.py (#410) 2023-06-07 12:22:22 +02:00
d405c87068 set dev version 2023-06-07 10:22:06 +00:00
b46716c4f5 Release: v0.4.2 2023-06-07 09:43:23 +00:00
ec8a5b7679 Remove unused imports in docs. (#406)
* remove unused var

* bug fix

* update docs, add e2e CI

* black

* isort

* CI
2023-06-06 18:06:49 +02:00
376d152d3f Resolve broken evaluation/prediction for RewardTrainer (#404)
* Implement evaluation/prediction for RewardTrainer

* Stick with unittest assertions

* Perform prediction forward calls without gradient

* Remove Literal to preserve Python 3.7 support

I recognize that I can also import from typing_extensions with a try-except,
but that is a bit overkill for this I feel.

* Remove eval_steps=1 to prevent flaky test on CI

The flaky test is caused by a division by zero when dividing by the runtime.
This is done on the transformers side, so it's not a TRL issue.
In practice, this won't happen - it only happens because both the model
and dataset are tiny.
2023-06-06 16:49:30 +02:00
ef57cddbc3 StackLLaMA: fix supervised finetuning and reward model training (#399)
* better reward modelling

tokenizer can be separately specified from model
removed old llama tokenizer hacks
evaluate after first step option to make nicer graphs
black + isort

* removed tokenizer hacks from supervised ft

* black and flake8
2023-06-06 10:41:07 +02:00
20111ad03a Fixed some type annotations of trl.trainer.PPoTrainer (#392)
* Fixed some type annotations of trl.trainer.PPoTrainer

- Ref model should be Optional
- The usual annotation for the Huggingface tokenizers is PreTrainedTokenizerBase. Not using that messes up people's annotation checks.
- Fixed the comments wrt the other two points

* fix quality and style

* synced & requality & restyled
2023-06-06 10:32:37 +02:00
a4793c2ede StackLlama: fixed RL training and added args (#400)
* fixed rl training args

added steps argument and break to respect max training epochs
added more PPOConfig args to script args
removed llama tokenizer hacks
removed extra args in dataset
changed to llamatokenizer from autotokenizer
black + isort

* black and flake8

* style, quality, and switch back to AutoTokenizer
2023-06-05 10:30:20 +02:00
0ddf9f657f StackLLaMA: correctly merge peft model (#398)
* correctly merge stackllama models

correctly merge weights with peft's merge_and_unload
load sequence classification model for reward models

* style, black line length 119

* flake8
2023-06-05 10:25:53 +02:00
3138ef6f5a fix 4 bit SFT (#396) 2023-06-02 10:49:41 +02:00
a5b0414f63 keep state_dict kwargs instead of popping it in save_pretrained (#393) 2023-05-31 10:56:45 +02:00
e174bd50a5 from_pretrain with peft adapter on the hub (# 379) (#380)
* from_pretrain with peft adapter on the hub (# 379)

* Update the comment

* PR comment
2023-05-31 10:38:25 +02:00
86c117404c fix typo in ppo_trainer.py (#389)
`dataloader must be a torch.utils.data.Dataset`: `dataloader` should be `dataset`
2023-05-30 15:23:02 +02:00
a94761a02c Update customization.mdx (#390) 2023-05-30 15:22:41 +02:00
5fb5af7c34 [core] Add 4bit QLora (#383)
* add 4bit

* style
2023-05-24 13:52:38 +02:00
25fa1bd880 fix warning issue (#377) 2023-05-18 08:43:44 +02:00
6916e0d2df [docs] fix SFT doc (#367)
* fix doc

* adapt from suggestions
2023-05-15 16:26:27 +02:00
1704a864e7 Delete test_training.py (#371) 2023-05-15 16:21:28 +02:00
e547c392f9 Remove obsolete layer_norm_names parameter and add peft>=0.3.0 to requirements (#366)
* remove obsolete layer_norm_names parameter

* remove obsolete parameter layer_norm_names and add peft>=0.3.0 to requirements

* make style - oops

* typo
2023-05-15 16:08:11 +02:00
a31bad83fb add is_trainable in kwargs (#363)
Add is_trainable in kwargs to enable continue training of peft model.
2023-05-15 16:08:00 +02:00
31cc361d17 Fix bug when loading local peft model (#342)
* Fix bug when loading local peft model 

Fix bug in https://github.com/lvwerra/trl/issues/341

* Fix loading bug when load lora mode

Fix loading bug when load lora model but not resuming training

1. Implement the fix logic described in https://github.com/lvwerra/trl/pull/342#pullrequestreview-1422298054

2. Set peft lora weight to trainable.

* Remove is_trainable

Leave is_trainable to future PR.

* add test_load_pretrained_peft

Check that the model saved with peft class interface can be loaded properly.
2023-05-11 23:07:50 +02:00
ab453ec183 140/best n sampling (#326)
* Create best_of_n.ipynb

* First draft

* Refactor as ref vs ppo vs non-ppo

* Changed notebook location and added README to explain motivation

* 1. Spelling and formatting refactor
2. Minor refactor of notebook

* Formatting of notebook
2023-05-11 17:56:12 +02:00
933c91cc66 fix tensorboard issue (#330) 2023-05-11 17:45:59 +02:00
ffad0a19d0 relax negative KL constraint (#352) 2023-05-11 17:45:47 +02:00
e0172fc8ec add parameter to control max_length (to mitigate OOM errors) (#359) 2023-05-11 15:28:32 +02:00
dec9993129 stack_llama: update instructions in README, fix broken _get_submodules and save tokenizer (#358)
* update instructions in README and fix broken _get_submodules

* save tokenizer

* add note about peft>=0.3.0
2023-05-11 12:29:02 +02:00
c85cdbdbd0 Fix argument's description (#339) 2023-05-04 14:29:07 +02:00
e59cce9f81 fix sft issues (#336) 2023-05-03 12:53:32 +02:00
c60fd915c1 [core] officially support SFT (Supervised Finetuning) (#323)
* add v1

* revert

* correct filename

* add tests and final tweaks

* fix tests

* adapt from offline suggestions

* Update trl/trainer/sft_trainer.py

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* fixes

* remove warning

* multiple fixes

* fixes

* fix

* final fixes

* final fix

* more clarification

* Apply suggestions from code review

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* add test

* add arg

* add callback instructions

* add formatting_prompts_func

* try docs

* add CLD

* fix docstrings

* format

* Update docs/source/sft_trainer.mdx

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* remove `prepare_in_int8_kwargs`

* change `return_overflowing_tokens`

* add warnings

* address comments

* revert pretrained kwargs

* quality

* fix sft script

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2023-05-03 10:42:01 +02:00
08f550674c added doc for using torch.distributed.launch/run (#324)
* added doc for using torch.distributed.launch/run

* Update docs/source/customization.mdx

---------

Co-authored-by: Afshin Oroojlooyjadid <afshin.oroojlooyjadid@oracle.com>
Co-authored-by: younesbelkada <younesbelkada@gmail.com>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2023-04-28 16:18:07 +02:00
52fecee883 Give a key to the wandb PPOConfig config entry (#315)
* Give a key to the wandb PPOConfig config entry

There is a lot of stuff with very generic keys in the `PPOConfig` dict, and the user may have logged a `wandb` config dict elsewhere.
I know I had that problem. To counter that, I pass the PPOConfig dict in a dict under the key `trl_ppo_trainer_config`, to prevent collisions & be very clear.

* did black --line-length 119 --target-version py38 examples tests trl
isort examples tests trl and black --check --line-length 119 --target-version py38 examples tests trl
isort --check-only examples tests trl
flake8 examples tests trl
2023-04-26 22:14:55 +02:00
3cfe194e34 [core] Officially Support Reward Modeling (#303)
* v1

- add working version
- add all possible tests
- add docs

* add some contents

* clean up

* fixes

* patch test for now

* fix test

* clean up

* fix

* this time fix

* Update docs/source/trainer.mdx

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* fixe

* update

* final changes

* oops

* Update docs/source/reward_trainer.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/reward_trainer.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/reward_trainer.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* switch to chosen / rejected

* fixes

* add example

* add accuracy metric

* pass PEFT config

* refactor compute metrics

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2023-04-26 11:51:56 +02:00
ad325152cc add details on multi-GPU / multi-node (#320) 2023-04-26 11:12:15 +02:00
1f29725381 fix broken tests (#318) 2023-04-25 13:57:40 +02:00
23a06c94b8 fix DS for peft ref_model in ppo trainer (#309)
peft ref_model is got by calling `disable_adapter` method, e.g. ,
```
with self.accelerator.unwrap_model(self.model).pretrained_model.disable_adapter():
    ref_logprobs, _, _, _ = self.batched_forward_pass(self.model, queries, responses, model_inputs)
```
2023-04-25 12:52:49 +02:00
5c24d5bb2e fixed typo (#312) 2023-04-25 11:38:28 +02:00
503ac5d82c clean examples folder (#294)
* clean examples folder

* Update examples/toxicity/README.md
2023-04-25 11:33:54 +02:00
ce37eadcfa Log Token distribution of Query / Response (#295)
* reset git

* move to log_step_stats, make optional

* fix stack

* reset script

* fix types

* always log, add dist
2023-04-17 17:49:14 +02:00
160d0c9d6c [t5] Fix negative kl issue (#262)
* fix negative kl issue

* fix

* make style
2023-04-14 11:50:17 +02:00
d1c7529328 Fix arguments description (#298)
* Fix arguments description

* fix-argument-description

* Fix-argument-description
2023-04-12 16:00:42 +02:00
fc468e0f35 Small improvements / fixes to toxicity example (#266)
* fixes during debugging

* Update examples/toxicity/scripts/gpt-j-6b-toxicity.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2023-04-10 14:24:06 -07:00
131e5cdd10 add functionality to push best models to the hub during training (#275)
* add functionality to push best models to the hub during training

* fix indentation

* Update tests/test_ppo_trainer.py

Co-authored-by: Nathan Lambert <nathan@huggingface.co>

* Update trl/trainer/ppo_trainer.py

Co-authored-by: Nathan Lambert <nathan@huggingface.co>

* Update trl/trainer/ppo_trainer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* fix style

---------

Co-authored-by: Nathan Lambert <nathan@huggingface.co>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2023-04-10 11:32:53 -07:00
bb4a9800fa fix typo in gpt2-sentiment.ipynb (#293)
inital -> initial
2023-04-10 20:08:55 +02:00
3804a72e6c Fix swapped helper texts (#284) 2023-04-10 10:23:37 +02:00
a004b02c4a Add LLaMA tutorial to docs (#278)
* docs docs docs

* add truncated blog to docs
2023-04-07 08:16:42 -07:00
8b234479bc fix doc string problem in ppo trainer loss function (#279)
* fix a loss function docstring problem

`hidden_dim` should be `response_length`

* Update ppo_trainer.py
2023-04-07 10:22:02 +02:00
meg
cf20878113 Adding pointer back to Meta's LLaMA. (#277) 2023-04-06 14:04:12 -07:00
d8ae4d08c6 stack-llama (#273)
* adds the main scripts

* adds non-score reward clamping

* Adds adapter merge script.

* style

* adds non_reward clamp option to config

* reverts kl clamping

* style

* makes model name required for adapter merge

* updates merge adapter so it does not refer to HF internal llama checkpoints

* renames to stack_llama, adds clearer instructions

* updates readme, adds ds config

* Update examples/stack_llama/scripts/rl_finetuning_peft.py

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Update examples/stack_llama/scripts/rl_finetuning_peft.py

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* removes ds config, renamed scripts

* style

* updates launch commands

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
2023-04-05 17:11:43 +02:00
a2749d9e0c Use active model to generate response in example on README (#269) (#271)
Co-authored-by: rmilleti <rmilleti@amazon.com>
2023-04-03 15:36:46 +02:00
ed87942a47 Add LlaMa in tests + create_reference_model (#261)
* add LlaMa in tests

* Update tests/test_modeling_value_head.py

* add warning message

---------

Co-authored-by: Nathan Lambert <nathan@huggingface.co>
2023-03-30 10:49:46 +02:00
734624274d [core] Fix ds issue (#260)
* fix ds issue

* more comments
2023-03-29 14:20:27 +02:00
237eb9c6a5 [distributed] Fix early stopping and DP (#254)
* fix ES DP

* fix coef

* wrap in a private method

* fix value

* fix trainer logic
2023-03-28 14:31:16 +02:00
2672a942a6 [core] Fix DeepSpeed zero-3 issue (#182)
* fix zero-3 issue

* Update trl/trainer/ppo_trainer.py

Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>

* adapt

* make style

* fix

* add docs

* fix

---------

Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>
2023-03-28 13:43:52 +02:00
b5cce0d13e Using batched generate in sentiment scripts (#249)
Co-authored-by: gaurav.vi <gaurav.vi@media.net>
2023-03-27 12:09:50 +02:00
0b165e60bc Fix typo (#253) 2023-03-27 11:59:15 +02:00
404621f0f9 Improve logging for PPO + Docs page (#243)
* init pr

* try and fix docpreview

* fix

* try to fix tests

* nit

* fix tests

* convert to tensor
2023-03-24 09:34:57 +01:00
89df6abf21 feat(ci): enable pip cache (#198)
* feat(ci): add pip caching to CI

* feat(ci): create workflow to cleanup cache

* feat(ci): enable `pip` caching in CI
2023-03-24 09:33:43 +01:00
9523474490 PPO config __init__ is bloated (#241)
* Moving `total_ppo_epochs`, forward_batch_size` and `log_with` to post init method and let the dataclass automatically assign the other member variables.

* Using default factory functions for initializing dict

* Using fields + metadata for args description

* Reformatting the file using black(jupyter)

* Trying styling checks again

* Adding new args from PR 238

---------

Co-authored-by: gaurav.vi <gaurav.vi@media.net>
2023-03-24 09:33:22 +01:00
1620da371a adds early stopping (#238)
* adds early stopping

* zero opt grad

* style

* Fixed typo in early stopping property description

* Auto stash before rebase of "origin/main"
2023-03-23 15:24:04 +01:00
9b60207f0b [core] Add warning when negative KL (#239)
* add warning

* oops

* fix

* Update trl/trainer/ppo_trainer.py

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
2023-03-22 12:18:43 +01:00
a6ebdb6e75 Reduce memory consumption in batched_forward_pass (#234)
* Reduce memory consumption by not storing logits in forward_pass

* Add docstring of return_logits
2023-03-22 10:18:18 +01:00
9c3e9e43d0 Batched generation (#228)
* add `_generate_batch`

* fix style

* omit tensor conversion

* no multiple pad by default

* add test

* stylez

* update docstring

* encoder/decoder check

* input shape safety

* moar style

---------

Co-authored-by: leandro von werra <leandro@hf.co>
2023-03-21 16:48:34 +01:00
0610711dda [core] refactor peft API (#231)
* refactor peft API

* update gpt2 peft script

* refactor

* few fixes

* fix bug

* make style

* update docs

* more update

* fix docs

* fix issues and add tests

* make style

* update dcos
2023-03-21 13:35:21 +01:00
24627e9c89 set dev version 2023-03-17 10:40:04 +00:00
e6183176bc Release: 0.4.1 2023-03-17 10:07:44 +00:00
6b88bba62b [test] attempt to fix CI test for PT 2.0 (#225)
* attempt to fix CI test

* attempt to fix CI to PT 2.0

* fix 3.7 issue

* fix

* make quality

* try

* Update tests/test_ppo_trainer.py
2023-03-17 10:42:38 +01:00
44f708ee15 [peft] Fix DP issues (#221)
* fix DP issues

* add instructions

* more details

* test

* add pad labels

* ultimate fix

* explain black magic
2023-03-16 11:19:47 +01:00
90f0090580 adds a missing detach to the ratio (#224)
* adds a missing detach to the ratio

* style
2023-03-16 10:47:56 +01:00
768c3892c8 Grad accumulation and memory bugfix (#220)
* adds args and grad accum steps to sentiment examples

* updates to minibatch size in peft 20b example

* adds arg and grad acc to toxicity example

* adds detach to all entries in the step stats to reduce memory usage

* adds accelerator accumulation context

* makes gradient_accumulation_steps part of the PPOConfig

* Update trl/trainer/ppo_trainer.py

* style
2023-03-16 09:56:36 +01:00
7940683014 [core] fix DP issue (#222)
* fix DP issue

* fix

* oops

* Empty-Commit

* skip test
2023-03-16 08:43:12 +01:00
03d9844730 Let's support naive Pipeline Parallelism (#210)
* add fixes in to support PP

* add same logic for enc-dec

* add more checks

* fix 20b issues

* clean up

* update scripts

* dp safety checker

* added multi gpu tests

* fix order

* change

* fix script
2023-03-15 08:28:52 +01:00
357730f8fd Small changes when integrating into H4 (#216)
* nits

* style
2023-03-14 14:15:23 -07:00
b75d83ab28 spell corrections (#214)
* spell corrections

* Update docs/source/quickstart.mdx

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2023-03-12 16:39:44 +01:00
0a95577dd6 spell mistakes (#213) 2023-03-12 16:37:20 +01:00
34773e97a2 Update README.md blog post link (#212) 2023-03-12 09:04:18 +01:00
ddb6df367d adds sentiment example for a 20b model (#208)
* adds sentiment example for a 20b model

* Update examples/sentiment/scripts/gpt-neox-20b_peft/s03_gpt-neo-20b_sentiment_peft.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update examples/sentiment/scripts/gpt-neox-20b_peft/s03_gpt-neo-20b_sentiment_peft.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update examples/sentiment/scripts/gpt-neox-20b_peft/s03_gpt-neo-20b_sentiment_peft.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update examples/sentiment/scripts/gpt-neox-20b_peft/s03_gpt-neo-20b_sentiment_peft.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* removed numbers from script names

* adds examples to docs

* cm -> clm

* style

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2023-03-09 15:29:31 +01:00
6c0252545a set dev version 2023-03-09 11:39:42 +00:00
c9a0a8711b Release: 0.4.0 2023-03-09 09:44:58 +00:00
5c08afc1bc [core] Update dependency (#206)
* update dependency

* clarify instructions

* Update setup.py

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Apply suggestions from code review

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
2023-03-09 10:22:04 +01:00
679f29d408 peft integration (#163)
* adds a hacky peft example

* fixes bug due to missing "prepare_model_for_training"

* Formatting

* adds peft to requirements

* Update trl/trainer/ppo_trainer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* gpt neo runs

* changes requested on the PR

* style

* updates to prepare_model_for_int8_training PEFT PR https://github.com/huggingface/peft/pull/105

* updates to prepare_model_for_int8_training PEFT PR https://github.com/huggingface/peft/pull/105

* adds missing 8-bit attribute to modeling base

* adds lr to example script

* adds missing train to trainer

* disables caching temporarily while I debug something

* debugging issues with unstable training

* Fix peft + int8 (#170)

* add fix

* another fix

* Auto stash before merge of "peft-example" and "origin/peft-example"

* adds peft model types to modeling base

* reduces memory usage using adapters and no ref model.

* adds support for EleutherAI/gpt-neox-20b

* example for peft finetune of cm model

* removes hacky research code

* fixing the rebase and some typos

* style

* style2

* adds gradient checkpointing to base model

* cleans up comments

* moves config and other pretrained_model properties to __init__

* make style

* added tests

* change dependency

* Update .github/workflows/tests.yml

* fix test

* fix style and failing tests

* make quality

* revert change

* rm unneeded change

* revert changes

* rm changes

* rm changes

* rm uneeded change

* Update trl/models/modeling_base.py

* revert uneeded changes

* make style

* adapt suggestions

* fix tests

* attempt to fix

* fix

* fix

* add no peft test

* revert

* remove unneded check

* more tests

* fix logic

* add `save_pretrained` support

* fix quality

* clean up

* clean up

* stronger test

* refactor comments

* make style

* attempt to add non-peft tests

* remove test runner

* format

* fix test

* move `train` on top

* fix peft import

* make quality

* fixes typo

* adds peft example to docs

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: younesbelakda <younesbelkada@gmail.com>
2023-03-07 15:08:21 +01:00
76fd085c25 Add 1.12.1 torch compatibility in sum method (#190)
* Add 1.12.1 torch compatibility in sum method

* Replace try-catch with more explicit if-statement

* code style + quality

* Apply suggestions from code review

---------

Co-authored-by: Yehor Panchenko <panchenko.yehor@huawei.com>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2023-03-06 10:11:07 +01:00
6126433a4b [core] Fix quality issue (#197)
* fix quality

* Update Makefile

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* add more tests and fix them

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
2023-03-06 09:52:06 +01:00
f95be7736f Allow running evaluate-toxicity with cpu (#195) 2023-03-06 09:30:13 +01:00
a05ddbdd83 set correct dev version 2023-03-02 09:20:10 +00:00
206bb1e2b0 Release: 0.3.1 2023-03-02 09:12:56 +00:00
e7220be712 Fix reference to example (#184) 2023-03-02 09:58:32 +01:00
a1616f75fc Update detoxifying_a_lm.mdx (#186) 2023-03-01 20:57:47 +01:00
meg
e1b836ce9c Clarifications of acronyms and initialisms (#185) 2023-03-01 20:56:38 +01:00
bfcf71ac3d set dev version 2023-03-01 13:42:43 +01:00
128 changed files with 17441 additions and 1347 deletions

107
.github/workflows/benchmark.yml vendored Normal file
View File

@ -0,0 +1,107 @@
name: "Benchmark on Comment"
# https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows
on:
issue_comment:
types: [created]
jobs:
Benchmark:
strategy:
fail-fast: true
matrix:
python-version: [3.9]
os: [self-hosted]
name: Benchmark
# Only run if it#s a PR and the comment contains /Benchmark
if: github.event.issue.pull_request && startsWith(github.event.comment.body, '/benchmark-trl-experiments') && contains(FromJSON('["vwxyzjn", "younesbelkada", "lvwerra", "lewtun"]'), github.actor)
runs-on: ${{ matrix.os }}
steps:
- name: Get branch of PR
uses: xt0rted/pull-request-comment-branch@v1
id: comment-branch
- name: Set latest commit status as pending
uses: myrotvorets/set-commit-status-action@master
with:
sha: ${{ steps.comment-branch.outputs.head_sha }}
token: ${{ secrets.GITHUB_TOKEN }}
status: pending
- name: Checkout `main` branch
uses: actions/checkout@v3
- name: Checkout PR branch
run: gh pr checkout $PR_NUMBER
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
PR_NUMBER: ${{ github.event.issue.number }}
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
# - name: Cleanup pip packages (specific to self-hosted runners)
# run: |
# echo PATH is $PATH
# echo PYTHONPATH is $PYTHONPATH
# echo which python is $(which python)
# echo which pip is $(which pip)
# pip_list=$(pip list --format=freeze | grep -v "^pip==" | grep -v "^setuptools==")
# if [ ! -z "$pip_list" ]; then
# echo "$pip_list" | xargs pip uninstall -y
# fi
- name: Print python depdenencies
run: pip list --format=freeze
- name: Install dependencies
run: |
pip install .[test,benchmark]
- name: Login
run: wandb login ${{ secrets.WANDB_API_KEY }} && huggingface-cli login --token ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
- name: Run benchmark
env:
GITHUB_CONTEXT: ${{ toJson(github) }}
PERSONAL_ACCESS_TOKEN_GITHUB: ${{ secrets.PERSONAL_ACCESS_TOKEN_GITHUB }}
run: |
COMMENT="${{ github.event.comment.body }}"
if [[ "$COMMENT" == *"/benchmark-trl-experiments benchmark/benchmark_level1.sh"* ]]; then
echo "Running benchmark/benchmark_level1.sh"
BENCHMARK_SCRIPT="benchmark/benchmark_level1.sh" BENCHMARK_PLOT_SCRIPT="benchmark/benchmark_level1_plot.sh" bash benchmark/benchmark_and_report.sh
elif [[ "$COMMENT" == *"/benchmark-trl-experiments benchmark/benchmark_level2.sh"* ]]; then
echo "Running benchmark/benchmark_level2.sh"
BENCHMARK_SCRIPT="benchmark/benchmark_level2.sh" BENCHMARK_PLOT_SCRIPT="benchmark/benchmark_level2_plot.sh" bash benchmark/benchmark_and_report.sh
elif [[ "$COMMENT" == *"/benchmark-trl-experiments benchmark/benchmark_level3.sh"* ]]; then
echo "Running benchmark/benchmark_level3.sh"
BENCHMARK_SCRIPT="benchmark/benchmark_level3.sh" BENCHMARK_PLOT_SCRIPT="benchmark/benchmark_level3_plot.sh" bash benchmark/benchmark_and_report.sh
else
echo "Invalid command in comment. Skipping execution."
fi
# send message to PR
- name: Setup Node.js 16
uses: actions/setup-node@v3
with:
node-version: 16
- name: Add workflow result as comment on PR
uses: actions/github-script@v6
if: always()
with:
script: |
const name = '${{ github.workflow }}';
const url = '${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}';
const success = '${{ job.status }}' === 'success';
const body = `${name}: ${success ? 'succeeded ✅' : 'failed ❌'}\n${url}`;
await github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: body
})
- name: Set latest commit status as ${{ job.status }}
uses: myrotvorets/set-commit-status-action@master
if: always()
with:
sha: ${{ steps.comment-branch.outputs.head_sha }}
token: ${{ secrets.GITHUB_TOKEN }}
status: ${{ job.status }}

View File

@ -13,7 +13,6 @@ jobs:
with:
commit_sha: ${{ github.sha }}
package: trl
repo_owner: lvwerra
version_tag_suffix: ""
secrets:
token: ${{ secrets.HUGGINGFACE_PUSH }}
hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}

View File

@ -14,5 +14,4 @@ jobs:
commit_sha: ${{ github.event.pull_request.head.sha }}
pr_number: ${{ github.event.number }}
package: trl
repo_owner: lvwerra
version_tag_suffix: ""

33
.github/workflows/clear_cache.yml vendored Normal file
View File

@ -0,0 +1,33 @@
name: "Cleanup Cache"
on:
workflow_dispatch:
schedule:
- cron: "0 0 * * *"
jobs:
cleanup:
runs-on: ubuntu-latest
steps:
- name: Check out code
uses: actions/checkout@v3
- name: Cleanup
run: |
gh extension install actions/gh-actions-cache
REPO=${{ github.repository }}
echo "Fetching list of cache key"
cacheKeysForPR=$(gh actions-cache list -R $REPO | cut -f 1 )
## Setting this to not fail the workflow while deleting cache keys.
set +e
echo "Deleting caches..."
for cacheKey in $cacheKeysForPR
do
gh actions-cache delete $cacheKey -R $REPO --confirm
done
echo "Done"
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

View File

@ -1,13 +0,0 @@
name: Delete dev documentation
on:
pull_request:
types: [ closed ]
jobs:
delete:
uses: huggingface/doc-builder/.github/workflows/delete_doc_comment.yml@main
with:
pr_number: ${{ github.event.number }}
package: trl

27
.github/workflows/stale.yml vendored Normal file
View File

@ -0,0 +1,27 @@
name: Stale Bot
on:
schedule:
- cron: "0 15 * * *"
jobs:
close_stale_issues:
name: Close Stale Issues
if: github.repository == 'huggingface/trl'
runs-on: ubuntu-latest
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
steps:
- uses: actions/checkout@v3
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: 3.8
- name: Install requirements
run: |
pip install PyGithub
- name: Close stale issues
run: |
python scripts/stale.py

View File

@ -7,29 +7,31 @@ on:
branches: [ main ]
jobs:
check_code_quality:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: [3.9]
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
- uses: actions/checkout@v2
with:
python-version: "3.8"
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install .[dev]
- name: Check quality
run: |
make quality
fetch-depth: 0
submodules: recursive
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}
- uses: pre-commit/action@v2.0.3
with:
extra_args: --all-files
tests:
needs: check_code_quality
strategy:
matrix:
python-version: [3.7, 3.8, 3.9]
os: ['ubuntu-latest', 'macos-latest', 'windows-latest']
python-version: ['3.8', '3.9', '3.10']
os: ['ubuntu-latest', 'windows-latest']
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@v3
@ -37,6 +39,32 @@ jobs:
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
cache: "pip"
cache-dependency-path: |
setup.py
requirements.txt
- name: Install dependencies
run: |
python -m pip install --upgrade pip
# cpu version of pytorch
pip install -e ".[test, peft, diffusers]"
- name: Test with pytest
run: |
make test
tests_no_optional_dep:
needs: check_code_quality
runs-on: 'ubuntu-latest'
steps:
- uses: actions/checkout@v3
- name: Set up Python 3.9
uses: actions/setup-python@v4
with:
python-version: '3.9'
cache: "pip"
cache-dependency-path: |
setup.py
requirements.txt
- name: Install dependencies
run: |
python -m pip install --upgrade pip
@ -44,4 +72,4 @@ jobs:
pip install .[test]
- name: Test with pytest
run: |
make test
make test

View File

@ -0,0 +1,16 @@
name: Upload PR Documentation
on:
workflow_run:
workflows: ["Build PR Documentation"]
types:
- completed
jobs:
build:
uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@main
with:
package_name: trl
secrets:
hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }}

1
.gitignore vendored
View File

@ -1,3 +1,4 @@
benchmark/trl
*.bak
.gitattributes
.last_checked

42
.pre-commit-config.yaml Normal file
View File

@ -0,0 +1,42 @@
repos:
- repo: https://github.com/PyCQA/isort
rev: 5.12.0
hooks:
- id: isort
args:
- --profile=black
- --skip-glob=wandb/**/*
- --thirdparty=wandb
- repo: https://github.com/myint/autoflake
rev: v1.4
hooks:
- id: autoflake
args:
- -r
- --exclude=wandb,__init__.py
- --in-place
- --remove-unused-variables
- --remove-all-unused-imports
- repo: https://github.com/python/black
rev: 22.3.0
hooks:
- id: black
args:
- --line-length=119
- --target-version=py38
- --exclude=wandb
- repo: https://github.com/pycqa/flake8
rev: 6.0.0
hooks:
- id: flake8
args:
- --ignore=E203,E501,W503,E128
- --max-line-length=119
# - repo: https://github.com/codespell-project/codespell
# rev: v2.1.0
# hooks:
# - id: codespell
# args:
# - --ignore-words-list=nd,reacher,thist,ths,magent,ba
# - --skip=docs/css/termynal.css,docs/js/termynal.js

View File

@ -17,7 +17,7 @@ authors:
family-names: Thrush
- given-names: Nathan
family-names: Lambert
repository-code: 'https://github.com/lvwerra/trl'
repository-code: 'https://github.com/huggingface/trl'
abstract: "With trl you can train transformer language models with Proximal Policy Optimization (PPO). The library is built on top of the transformers library by \U0001F917 Hugging Face. Therefore, pre-trained language models can be directly loaded via transformers. At this point, most decoder and encoder-decoder architectures are supported."
keywords:
- rlhf

View File

@ -36,10 +36,15 @@ First you want to make sure that all the tests pass:
make test
```
Then before submitting your PR make sure the code quality follows the standards. You can run the following command to format and test:
Then before submitting your PR make sure the code quality follows the standards. You can run the following command to format:
```bash
make style && make quality
make precommit
```
Make sure to install `pre-commit` before running the command:
```bash
pip install pre-commit
```
## Do you want to contribute to the documentation?

View File

@ -1,13 +1,15 @@
.PHONY: quality style test
.PHONY: test precommit benchmark_core benchmark_aux
check_dirs := examples tests trl
test:
python -m pytest -n auto --dist=loadfile -s -v ./tests/
quality:
black --check --line-length 119 --target-version py38 tests trl
isort --check-only tests trl
flake8 tests trl
precommit:
pre-commit run --all-files
style:
black --line-length 119 --target-version py38 tests trl examples setup.py
isort tests trl
benchmark_core:
bash ./benchmark/benchmark_core.sh
benchmark_aux:
bash ./benchmark/benchmark_aux.sh

107
README.md
View File

@ -3,23 +3,43 @@
</div>
# TRL - Transformer Reinforcement Learning
> Train transformer language models with reinforcement learning.
> Full stack transformer language models with reinforcement learning.
<p align="center">
<a href="https://github.com/huggingface/trl/blob/main/LICENSE">
<img alt="License" src="https://img.shields.io/github/license/huggingface/trl.svg?color=blue">
</a>
<a href="https://huggingface.co/docs/trl/index">
<img alt="Documentation" src="https://img.shields.io/website/http/huggingface.co/docs/trl/index.svg?down_color=red&down_message=offline&up_message=online">
</a>
<a href="https://github.com/huggingface/trl/releases">
<img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/trl.svg">
</a>
</p>
## What is it?
With `trl` you can train transformer language models with Proximal Policy Optimization (PPO). The library is built on top of the [`transformers`](https://github.com/huggingface/transformers) library by 🤗 Hugging Face. Therefore, pre-trained language models can be directly loaded via `transformers`. At this point most of decoder architectures and encoder-decoder architectures are supported.
<div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/TRL-readme.png">
</div>
`trl` is a full stack library where we provide a set of tools to train transformer language models and stable diffusion models with Reinforcement Learning, from the Supervised Fine-tuning step (SFT), Reward Modeling step (RM) to the Proximal Policy Optimization (PPO) step. The library is built on top of the [`transformers`](https://github.com/huggingface/transformers) library by 🤗 Hugging Face. Therefore, pre-trained language models can be directly loaded via `transformers`. At this point, most of decoder architectures and encoder-decoder architectures are supported. Refer to the documentation or the `examples/` folder for example code snippets and how to run these tools.
**Highlights:**
- `PPOTrainer`: A PPO trainer for language models that just needs (query, response, reward) triplets to optimise the language model.
- `AutoModelForCausalLMWithValueHead` & `AutoModelForSeq2SeqLMWithValueHead`: A transformer model with an additional scalar output for each token which can be used as a value function in reinforcement learning.
- Example: Train GPT2 to generate positive movie reviews with a BERT sentiment classifier.
## How it works
- [`SFTTrainer`](https://huggingface.co/docs/trl/sft_trainer): A light and friendly wrapper around `transformers` Trainer to easily fine-tune language models or adapters on a custom dataset.
- [`RewardTrainer`](https://huggingface.co/docs/trl/reward_trainer): A light wrapper around `transformers` Trainer to easily fine-tune language models for human preferences (Reward Modeling).
- [`PPOTrainer`](https://huggingface.co/docs/trl/trainer#trl.PPOTrainer): A PPO trainer for language models that just needs (query, response, reward) triplets to optimise the language model.
- [`AutoModelForCausalLMWithValueHead`](https://huggingface.co/docs/trl/models#trl.AutoModelForCausalLMWithValueHead) & [`AutoModelForSeq2SeqLMWithValueHead`](https://huggingface.co/docs/trl/models#trl.AutoModelForSeq2SeqLMWithValueHead): A transformer model with an additional scalar output for each token which can be used as a value function in reinforcement learning.
- [Examples](https://github.com/huggingface/trl/tree/main/examples): Train GPT2 to generate positive movie reviews with a BERT sentiment classifier, full RLHF using adapters only, train GPT-j to be less toxic, [Stack-Llama example](https://huggingface.co/blog/stackllama), etc.
## How PPO works
Fine-tuning a language model via PPO consists of roughly three steps:
1. **Rollout**: The language model generates a response or continuation based on query which could be the start of a sentence.
2. **Evaluation**: The query and response are evaluated with a function, model, human feedback or some combination of them. The important thing is that this process should yield a scalar value for each query/response pair.
3. **Optimization**: This is the most complex part. In the optimisation step the query/response pairs are used to calculate the log-probabilities of the tokens in the sequences. This is done with the model that is trained and and a reference model, which is usually the pre-trained model before fine-tuning. The KL-divergence between the two outputs is used as an additional reward signal to make sure the generated responses don't deviate to far from the reference language model. The active language model is then trained with PPO.
3. **Optimization**: This is the most complex part. In the optimisation step the query/response pairs are used to calculate the log-probabilities of the tokens in the sequences. This is done with the model that is trained and a reference model, which is usually the pre-trained model before fine-tuning. The KL-divergence between the two outputs is used as an additional reward signal to make sure the generated responses don't deviate too far from the reference language model. The active language model is then trained with PPO.
This process is illustrated in the sketch below:
@ -40,7 +60,7 @@ pip install trl
### From source
If you want to run the examples in the repository a few additional libraries are required. Clone the repository and install it with pip:
```bash
git clone https://github.com/lvwerra/trl.git
git clone https://github.com/huggingface/trl.git
cd trl/
pip install .
```
@ -52,8 +72,59 @@ pip install -e .
## How to use
### Example
This is a basic example on how to use the library. Based on a query the language model creates a response which is then evaluated. The evaluation could be a human in the loop or another model's output.
### `SFTTrainer`
This is a basic example on how to use the `SFTTrainer` from the library. The `SFTTrainer` is a light wrapper around the `transformers` Trainer to easily fine-tune language models or adapters on a custom dataset.
```python
# imports
from datasets import load_dataset
from trl import SFTTrainer
# get dataset
dataset = load_dataset("imdb", split="train")
# get trainer
trainer = SFTTrainer(
"facebook/opt-350m",
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=512,
)
# train
trainer.train()
```
### `RewardTrainer`
This is a basic example on how to use the `RewardTrainer` from the library. The `RewardTrainer` is a wrapper around the `transformers` Trainer to easily fine-tune reward models or adapters on a custom preference dataset.
```python
# imports
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from trl import RewardTrainer
# load model and dataset - dataset needs to be in a specific format
model = AutoModelForSequenceClassification.from_pretrained("gpt2", num_labels=1)
tokenizer = AutoTokenizer.from_pretrained("gpt2")
...
# load trainer
trainer = RewardTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
)
# train
trainer.train()
```
### `PPOTrainer`
This is a basic example on how to use the `PPOTrainer` from the library. Based on a query the language model creates a response which is then evaluated. The evaluation could be a human in the loop or another model's output.
```python
# imports
@ -78,7 +149,7 @@ query_txt = "This morning I went to the "
query_tensor = tokenizer.encode(query_txt, return_tensors="pt")
# get model response
response_tensor = respond_to_batch(model_ref, query_tensor)
response_tensor = respond_to_batch(model, query_tensor)
# create a ppo trainer
ppo_trainer = PPOTrainer(ppo_config, model, model_ref, tokenizer)
@ -91,14 +162,6 @@ reward = [torch.tensor(1.0)]
train_stats = ppo_trainer.step([query_tensor[0]], [response_tensor[0]], reward)
```
### Advanced example: IMDB sentiment
For a detailed example check out the example python script `examples/scripts/ppo-sentiment.py`, where GPT2 is fine-tuned to generate positive movie reviews. An few examples from the language models before and after optimisation are given below:
<div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/table_imdb_preview.png" width="800">
<p style="text-align: center;"> <b>Figure:</b> A few review continuations before and after optimisation. </p>
</div>
## References
### Proximal Policy Optimisation
@ -111,11 +174,11 @@ The language models utilize the `transformers` library by 🤗 Hugging Face.
```bibtex
@misc{vonwerra2022trl,
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert},
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang},
title = {TRL: Transformer Reinforcement Learning},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/lvwerra/trl}}
howpublished = {\url{https://github.com/huggingface/trl}}
}
```
```

150
benchmark/benchmark.py Normal file
View File

@ -0,0 +1,150 @@
import argparse
import math
import os
import shlex
import subprocess
import uuid
from distutils.util import strtobool
import requests
def parse_args():
# fmt: off
parser = argparse.ArgumentParser()
parser.add_argument("--command", type=str, default="",
help="the command to run")
parser.add_argument("--num-seeds", type=int, default=3,
help="the number of random seeds")
parser.add_argument("--start-seed", type=int, default=1,
help="the number of the starting seed")
parser.add_argument("--workers", type=int, default=0,
help="the number of workers to run benchmark experimenets")
parser.add_argument("--auto-tag", type=lambda x: bool(strtobool(x)), default=True, nargs="?", const=True,
help="if toggled, the runs will be tagged with git tags, commit, and pull request number if possible")
parser.add_argument("--slurm-template-path", type=str, default=None,
help="the path to the slurm template file (see docs for more details)")
parser.add_argument("--slurm-gpus-per-task", type=int, default=1,
help="the number of gpus per task to use for slurm jobs")
parser.add_argument("--slurm-total-cpus", type=int, default=50,
help="the number of gpus per task to use for slurm jobs")
parser.add_argument("--slurm-ntasks", type=int, default=1,
help="the number of tasks to use for slurm jobs")
parser.add_argument("--slurm-nodes", type=int, default=None,
help="the number of nodes to use for slurm jobs")
args = parser.parse_args()
# fmt: on
return args
def run_experiment(command: str):
command_list = shlex.split(command)
print(f"running {command}")
# Use subprocess.PIPE to capture the output
fd = subprocess.Popen(command_list, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
output, errors = fd.communicate()
return_code = fd.returncode
assert return_code == 0, f"Command failed with error: {errors.decode('utf-8')}"
# Convert bytes to string and strip leading/trailing whitespaces
return output.decode("utf-8").strip()
def autotag() -> str:
wandb_tag = ""
print("autotag feature is enabled")
git_tag = ""
try:
git_tag = subprocess.check_output(["git", "describe", "--tags"]).decode("ascii").strip()
print(f"identified git tag: {git_tag}")
except subprocess.CalledProcessError as e:
print(e)
if len(git_tag) == 0:
try:
count = int(subprocess.check_output(["git", "rev-list", "--count", "HEAD"]).decode("ascii").strip())
hash = subprocess.check_output(["git", "rev-parse", "--short", "HEAD"]).decode("ascii").strip()
git_tag = f"no-tag-{count}-g{hash}"
print(f"identified git tag: {git_tag}")
except subprocess.CalledProcessError as e:
print(e)
wandb_tag = f"{git_tag}"
git_commit = subprocess.check_output(["git", "rev-parse", "--verify", "HEAD"]).decode("ascii").strip()
try:
# try finding the pull request number on github
prs = requests.get(f"https://api.github.com/search/issues?q=repo:huggingface/trl+is:pr+{git_commit}")
if prs.status_code == 200:
prs = prs.json()
if len(prs["items"]) > 0:
pr = prs["items"][0]
pr_number = pr["number"]
wandb_tag += f",pr-{pr_number}"
print(f"identified github pull request: {pr_number}")
except Exception as e:
print(e)
return wandb_tag
if __name__ == "__main__":
args = parse_args()
if args.auto_tag:
existing_wandb_tag = os.environ.get("WANDB_TAGS", "")
wandb_tag = autotag()
if len(wandb_tag) > 0:
if len(existing_wandb_tag) > 0:
os.environ["WANDB_TAGS"] = ",".join([existing_wandb_tag, wandb_tag])
else:
os.environ["WANDB_TAGS"] = wandb_tag
print("WANDB_TAGS: ", os.environ.get("WANDB_TAGS", ""))
commands = []
for seed in range(0, args.num_seeds):
commands += [" ".join([args.command, "--seed", str(args.start_seed + seed)])]
print("======= commands to run:")
for command in commands:
print(command)
if args.workers > 0 and args.slurm_template_path is None:
from concurrent.futures import ThreadPoolExecutor
executor = ThreadPoolExecutor(max_workers=args.workers, thread_name_prefix="cleanrl-benchmark-worker-")
for command in commands:
executor.submit(run_experiment, command)
executor.shutdown(wait=True)
else:
print("not running the experiments because --workers is set to 0; just printing the commands to run")
# SLURM logic
if args.slurm_template_path is not None:
if not os.path.exists("slurm"):
os.makedirs("slurm")
if not os.path.exists("slurm/logs"):
os.makedirs("slurm/logs")
print("======= slurm commands to run:")
with open(args.slurm_template_path) as f:
slurm_template = f.read()
slurm_template = slurm_template.replace("{{array}}", f"0-{len(commands) - 1}%{args.workers}")
slurm_template = slurm_template.replace(
"{{seeds}}", f"({' '.join([str(args.start_seed + int(seed)) for seed in range(args.num_seeds)])})"
)
slurm_template = slurm_template.replace("{{len_seeds}}", f"{args.num_seeds}")
slurm_template = slurm_template.replace("{{command}}", args.command)
slurm_template = slurm_template.replace("{{gpus_per_task}}", f"{args.slurm_gpus_per_task}")
total_gpus = args.slurm_gpus_per_task * args.slurm_ntasks
slurm_cpus_per_gpu = math.ceil(args.slurm_total_cpus / total_gpus)
slurm_template = slurm_template.replace("{{cpus_per_gpu}}", f"{slurm_cpus_per_gpu}")
slurm_template = slurm_template.replace("{{ntasks}}", f"{args.slurm_ntasks}")
if args.slurm_nodes is not None:
slurm_template = slurm_template.replace("{{nodes}}", f"#SBATCH --nodes={args.slurm_nodes}")
else:
slurm_template = slurm_template.replace("{{nodes}}", "")
filename = str(uuid.uuid4())
open(os.path.join("slurm", f"{filename}.slurm"), "w").write(slurm_template)
slurm_path = os.path.join("slurm", f"{filename}.slurm")
print(f"saving command in {slurm_path}")
if args.workers > 0:
job_id = run_experiment(f"sbatch --parsable {slurm_path}")
print(f"Job ID: {job_id}")

View File

@ -0,0 +1,41 @@
#### Step 1: create a work directory:
# this is necessary because another github action job will remove
# the entire directory, which slurm depends on.
# https://stackoverflow.com/questions/4632028/how-to-create-a-temporary-directory
MY_SLURM_TMP_DIR=/fsx/costa/slurm_tmpdir
mkdir -p $MY_SLURM_TMP_DIR
WORK_DIR=`mktemp -d -p "$MY_SLURM_TMP_DIR"`
cp -r "$PWD" "$WORK_DIR"
cd "$WORK_DIR/$(basename "$PWD")"
echo WORK_DIR: $WORK_DIR
#### Step 2: actual work starts:
echo PATH is $PATH
echo PYTHONPATH is $PYTHONPATH
echo whcih python is $(which python)
export WANDB_ENTITY=huggingface
bash $BENCHMARK_SCRIPT > output.txt
# Extract Job IDs into an array
job_ids=($(grep "Job ID:" output.txt | awk '{print $3}'))
# Extract WANDB_TAGS into an array
WANDB_TAGS=($(grep "WANDB_TAGS:" output.txt | awk '{print $2}'))
WANDB_TAGS=($(echo $WANDB_TAGS | tr "," "\n"))
# Print to verify
echo "Job IDs: ${job_ids[@]}"
echo "WANDB_TAGS: ${WANDB_TAGS[@]}"
TAGS_STRING="?tag=${WANDB_TAGS[0]}"
FOLDER_STRING="${WANDB_TAGS[0]}"
for tag in "${WANDB_TAGS[@]:1}"; do
TAGS_STRING+="&tag=$tag"
FOLDER_STRING+="_$tag"
done
echo "TAGS_STRING: $TAGS_STRING"
echo "FOLDER_STRING: $FOLDER_STRING"
TAGS_STRING=$TAGS_STRING FOLDER_STRING=$FOLDER_STRING BENCHMARK_PLOT_SCRIPT=$BENCHMARK_PLOT_SCRIPT sbatch --dependency=afterany:$job_ids benchmark/post_github_comment.sbatch

View File

@ -0,0 +1,11 @@
# hello world experiment
python benchmark/benchmark.py \
--command "python examples/scripts/ppo.py --ppo_config.log_with wandb" \
--num-seeds 3 \
--start-seed 1 \
--workers 10 \
--slurm-nodes 1 \
--slurm-gpus-per-task 1 \
--slurm-ntasks 1 \
--slurm-total-cpus 12 \
--slurm-template-path benchmark/trl.slurm_template

View File

@ -0,0 +1,20 @@
# pip install openrlbenchmark==0.2.1a5
# see https://github.com/openrlbenchmark/openrlbenchmark#get-started for documentation
echo "we deal with $TAGS_STRING"
python -m openrlbenchmark.rlops_multi_metrics \
--filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
"ppo$TAGS_STRING" \
--env-ids sentiment-analysis:lvwerra/distilbert-imdb \
--no-check-empty-runs \
--pc.ncols 2 \
--pc.ncols-legend 1 \
--output-filename benchmark/trl/$FOLDER_STRING/hello_world \
--scan-history
python benchmark/upload_benchmark.py \
--folder_path="benchmark/trl/$FOLDER_STRING" \
--path_in_repo="images/benchmark/$FOLDER_STRING" \
--repo_id="trl-internal-testing/example-images" \
--repo_type="dataset"

View File

@ -0,0 +1,23 @@
# compound experiments: gpt2xl + grad_accu
python benchmark/benchmark.py \
--command "python examples/scripts/ppo.py --ppo_config.exp_name ppo_gpt2xl_grad_accu --ppo_config.model_name gpt2-xl --ppo_config.mini_batch_size 16 --ppo_config.gradient_accumulation_steps 8 --ppo_config.log_with wandb" \
--num-seeds 3 \
--start-seed 1 \
--workers 10 \
--slurm-nodes 1 \
--slurm-gpus-per-task 1 \
--slurm-ntasks 1 \
--slurm-total-cpus 12 \
--slurm-template-path benchmark/trl.slurm_template
# compound experiments: Cerebras-GPT-6.7B + deepspeed zero2 + grad_accu
python benchmark/benchmark.py \
--command "accelerate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml examples/scripts/ppo.py --ppo_config.exp_name ppo_Cerebras-GPT-6.7B_grad_accu_deepspeed_stage2 --ppo_config.batch_size 32 --ppo_config.mini_batch_size 32 --ppo_config.log_with wandb --ppo_config.model_name cerebras/Cerebras-GPT-6.7B --ppo_config.reward_model sentiment-analysis:cerebras/Cerebras-GPT-6.7B" \
--num-seeds 3 \
--start-seed 1 \
--workers 10 \
--slurm-nodes 1 \
--slurm-gpus-per-task 8 \
--slurm-ntasks 1 \
--slurm-total-cpus 90 \
--slurm-template-path benchmark/trl.slurm_template

View File

@ -0,0 +1,31 @@
# pip install openrlbenchmark==0.2.1a5
# see https://github.com/openrlbenchmark/openrlbenchmark#get-started for documentation
echo "we deal with $TAGS_STRING"
python -m openrlbenchmark.rlops_multi_metrics \
--filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
"ppo$TAGS_STRING" \
"ppo_gpt2xl_grad_accu$TAGS_STRING" \
--env-ids sentiment-analysis:lvwerra/distilbert-imdb \
--no-check-empty-runs \
--pc.ncols 2 \
--pc.ncols-legend 1 \
--output-filename benchmark/trl/$FOLDER_STRING/different_models \
--scan-history
python -m openrlbenchmark.rlops_multi_metrics \
--filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
"ppo_Cerebras-GPT-6.7B_grad_accu_deepspeed_stage2$TAGS_STRING" \
--env-ids sentiment-analysis:cerebras/Cerebras-GPT-6.7B \
--no-check-empty-runs \
--pc.ncols 2 \
--pc.ncols-legend 1 \
--output-filename benchmark/trl/$FOLDER_STRING/deepspeed \
--scan-history
python benchmark/upload_benchmark.py \
--folder_path="benchmark/trl/$FOLDER_STRING" \
--path_in_repo="images/benchmark/$FOLDER_STRING" \
--repo_id="trl-internal-testing/example-images" \
--repo_type="dataset"

View File

@ -0,0 +1,46 @@
## w/ and w/o gradient accumulation
python benchmark/benchmark.py \
--command "python examples/scripts/ppo.py --ppo_config.exp_name ppo_step_grad_accu --ppo_config.mini_batch_size 1 --ppo_config.gradient_accumulation_steps 128 --ppo_config.log_with wandb" \
--num-seeds 3 \
--start-seed 1 \
--workers 10 \
--slurm-nodes 1 \
--slurm-gpus-per-task 1 \
--slurm-ntasks 1 \
--slurm-total-cpus 12 \
--slurm-template-path benchmark/trl.slurm_template
## w/ different models (gpt2, gpt2-xl, falcon, llama2)
python benchmark/benchmark.py \
--command "python examples/scripts/ppo.py --ppo_config.exp_name ppo_gpt2 --ppo_config.log_with wandb" \
--num-seeds 3 \
--start-seed 1 \
--workers 10 \
--slurm-nodes 1 \
--slurm-gpus-per-task 1 \
--slurm-ntasks 1 \
--slurm-total-cpus 12 \
--slurm-template-path benchmark/trl.slurm_template
python benchmark/benchmark.py \
--command "python examples/scripts/ppo.py --ppo_config.exp_name ppo_falcon_rw_1b --ppo_config.model_name tiiuae/falcon-rw-1b --ppo_config.log_with wandb" \
--num-seeds 3 \
--start-seed 1 \
--workers 10 \
--slurm-nodes 1 \
--slurm-gpus-per-task 1 \
--slurm-ntasks 1 \
--slurm-total-cpus 12 \
--slurm-template-path benchmark/trl.slurm_template
## w/ and w/o PEFT
python benchmark/benchmark.py \
--command "python examples/scripts/ppo.py --ppo_config.exp_name ppo_peft --use_peft --ppo_config.log_with wandb" \
--num-seeds 3 \
--start-seed 1 \
--workers 10 \
--slurm-nodes 1 \
--slurm-gpus-per-task 1 \
--slurm-ntasks 1 \
--slurm-total-cpus 12 \
--slurm-template-path benchmark/trl.slurm_template

56
benchmark/plot.sh Normal file
View File

@ -0,0 +1,56 @@
# pip install openrlbenchmark==0.2.1a5
# see https://github.com/openrlbenchmark/openrlbenchmark#get-started for documentation
BASELINE_PR_TAG=v0.4.7-55-g110e672
BASELINE_PR_NAME=PR-662
python -m openrlbenchmark.rlops_multi_metrics \
--filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
"sentiment_tuning?tag=$BASELINE_PR_TAG&cl=sentiment lvwerra/gpt2-imdb ($BASELINE_PR_NAME)" \
--env-ids sentiment-analysis:lvwerra/distilbert-imdb \
--no-check-empty-runs \
--pc.ncols 2 \
--pc.ncols-legend 1 \
--output-filename benchmark/trl/$BASELINE_PR_TAG/sentiment \
--scan-history
python -m openrlbenchmark.rlops_multi_metrics \
--filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
"sentiment_tuning?tag=$BASELINE_PR_TAG&cl=sentiment lvwerra/gpt2-imdb ($BASELINE_PR_NAME)" \
"sentiment_tuning_step_grad_accu?tag=$BASELINE_PR_TAG&cl=sentiment lvwerra/gpt2-imdb gradient accumulation ($BASELINE_PR_NAME)" \
--env-ids sentiment-analysis:lvwerra/distilbert-imdb \
--no-check-empty-runs \
--pc.ncols 2 \
--pc.ncols-legend 1 \
--output-filename benchmark/trl/$BASELINE_PR_TAG/gradient_accu \
--scan-history
python -m openrlbenchmark.rlops_multi_metrics \
--filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
"sentiment_tuning?tag=$BASELINE_PR_TAG&cl=sentiment lvwerra/gpt2-imdb ($BASELINE_PR_NAME)" \
"sentiment_tuning_gpt2?tag=$BASELINE_PR_TAG&cl=sentiment gpt2 ($BASELINE_PR_NAME)" \
"sentiment_tuning_falcon_rw_1b?tag=$BASELINE_PR_TAG&cl=sentiment tiiuae/falcon-rw-1b ($BASELINE_PR_NAME)" \
"sentiment_tuning_gpt2xl_grad_accu?tag=$BASELINE_PR_TAG&cl=sentiment gpt2xl ($BASELINE_PR_NAME)" \
--env-ids sentiment-analysis:lvwerra/distilbert-imdb \
--no-check-empty-runs \
--pc.ncols 2 \
--pc.ncols-legend 1 \
--output-filename benchmark/trl/$BASELINE_PR_TAG/different_models \
--scan-history
python -m openrlbenchmark.rlops_multi_metrics \
--filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
"sentiment_tuning?tag=$BASELINE_PR_TAG&cl=sentiment lvwerra/gpt2-imdb ($BASELINE_PR_NAME)" \
"sentiment_tuning_peft?tag=$BASELINE_PR_TAG&cl=sentiment lvwerra/gpt2-imdb w/ peft ($BASELINE_PR_NAME)" \
--env-ids sentiment-analysis:lvwerra/distilbert-imdb \
--no-check-empty-runs \
--pc.ncols 2 \
--pc.ncols-legend 1 \
--output-filename benchmark/trl/$BASELINE_PR_TAG/peft \
--scan-history
python benchmark/upload_benchmark.py \
--folder_path="benchmark/trl/$BASELINE_PR_TAG" \
--path_in_repo="images/benchmark/$BASELINE_PR_TAG" \
--repo_id="trl-internal-testing/example-images" \
--repo_type="dataset"

View File

@ -0,0 +1,26 @@
import json
import os
from ghapi.all import GhApi
FOLDER_STRING = os.environ.get("FOLDER_STRING", "")
folder = f"benchmark/trl/{FOLDER_STRING}"
host_url = f"https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/benchmark/{FOLDER_STRING}"
# Create a GitHub API instance
github_context = json.loads(os.environ["GITHUB_CONTEXT"])
token = os.environ["PERSONAL_ACCESS_TOKEN_GITHUB"] # this needs to refreshed every 12 months
status_message = "**[COSTA BENCHMARK BOT]**: Here are the results"
body = status_message
repo = github_context["repository"]
owner, repo = repo.split("/")
api = GhApi(owner=owner, repo=repo, token=token)
# for each `.png` file in the folder, add it to the comment
for file in os.listdir(folder):
if file.endswith(".png"):
body += f"\n![{file}]({host_url}/{file})"
# Create a comment on the issue
api.issues.create_comment(issue_number=github_context["event"]["issue"]["number"], body=body)

View File

@ -0,0 +1,9 @@
#!/bin/bash
#SBATCH --job-name=trl
#SBATCH --partition=production-cluster
#SBATCH --ntasks=1
#SBATCH --output=slurm/logs/%x_%j.out
sleep 2m
bash $BENCHMARK_PLOT_SCRIPT
srun python benchmark/post_github_comment.py

View File

@ -0,0 +1,16 @@
#!/bin/bash
#SBATCH --job-name=trl
#SBATCH --partition=production-cluster
#SBATCH --gpus-per-task={{gpus_per_task}}
#SBATCH --cpus-per-gpu={{cpus_per_gpu}}
#SBATCH --ntasks={{ntasks}}
#SBATCH --output=slurm/logs/%x_%j.out
#SBATCH --array={{array}}
#SBATCH --exclude=ip-26-0-156-239,ip-26-0-148-151,ip-26-0-146-212,ip-26-0-145-137,ip-26-0-146-249,ip-26-0-146-149,ip-26-0-147-233,ip-26-0-145-154,ip-26-0-144-35,ip-26-0-144-189,ip-26-0-146-183,ip-26-0-147-120,ip-26-0-144-95,ip-26-0-145-193
{{nodes}}
seeds={{seeds}}
seed=${seeds[$SLURM_ARRAY_TASK_ID % {{len_seeds}}]}
echo "Running task $SLURM_ARRAY_TASK_ID with seed: $seed"
srun {{command}} --ppo_config.seed $seed

View File

@ -0,0 +1,23 @@
from dataclasses import dataclass
import tyro
from huggingface_hub import HfApi
@dataclass
class Args:
folder_path: str = "benchmark/trl"
path_in_repo: str = "images/benchmark"
repo_id: str = "trl-internal-testing/example-images"
repo_type: str = "dataset"
args = tyro.cli(Args)
api = HfApi()
api.upload_folder(
folder_path=args.folder_path,
path_in_repo=args.path_in_repo,
repo_id=args.repo_id,
repo_type=args.repo_type,
)

View File

@ -1,24 +1,54 @@
- sections:
- sections:
- local: index
title: TRL
- local: quickstart
title: Quickstart
- local: installation
title: Installation
- local: how_to_train
title: PPO Training FAQ
- local: use_model
title: Use Trained Models
- local: customization
title: Customize your training
title: Customize the Training
- local: logging
title: Understanding Logs
title: Get started
- sections:
- local: models
title: Model Classes
- local: trainer
title: Trainer Classes
- local: reward_trainer
title: Reward Model Training
- local: sft_trainer
title: Supervised Fine-Tuning
- local: ppo_trainer
title: PPO Trainer
- local: best_of_n
title: Best of N Sampling
- local: dpo_trainer
title: DPO Trainer
- local: ddpo_trainer
title: Denoising Diffusion Policy Optimization
- local: iterative_sft_trainer
title: Iterative Supervised Fine-Tuning
- local: text_environments
title: Text Environments
title: API
- sections:
- sections:
- local: example_overview
title: Example Overview
- local: sentiment_tuning
title: Sentiment Tuning
- local: summarization_reward_tuning
title: Summarization Reward Tuning
- local: lora_tuning_peft
title: Training with PEFT
- local: detoxifying_a_lm
title: Detoxifying a Language Model
- local: using_llama_models
title: Training StackLlama
- local: learning_tools
title: Learning to Use Tools
- local: multi_adapter_rl
title: Multi Adapter RLHF
title: Examples

72
docs/source/best_of_n.mdx Normal file
View File

@ -0,0 +1,72 @@
# Best of N sampling: Alternative ways to get better model output without RL based fine-tuning
Within the extras module is the `best-of-n` sampler class that serves as an alternative method of generating better model output.
As to how it fares against the RL based fine-tuning, please look in the `examples` directory for a comparison example
## Usage
To get started quickly, instantiate an instance of the class with a model, a length sampler, a tokenizer and a callable that serves as a proxy reward pipeline that outputs reward scores for input queries
```python
from transformers import pipeline, AutoTokenizer
from trl import AutoModelForCausalLMWithValueHead
from trl.core import LengthSampler
from trl.extras import BestOfNSampler
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(ref_model_name)
reward_pipe = pipeline("sentiment-analysis", model=reward_model, device=device)
tokenizer = AutoTokenizer.from_pretrained(ref_model_name)
tokenizer.pad_token = tokenizer.eos_token
# callable that takes a list of raw text and returns a list of corresponding reward scores
def queries_to_scores(list_of_strings):
return [output["score"] for output in reward_pipe(list_of_strings)]
best_of_n = BestOfNSampler(model, tokenizer, queries_to_scores, length_sampler=output_length_sampler)
```
And assuming you have a list/tensor of tokenized queries, you can generate better output by calling the `generate` method
```python
best_of_n.generate(query_tensors, device=device, **gen_kwargs)
```
The default sample size is 4, but you can change it at the time of instance initialization like so
```python
best_of_n = BestOfNSampler(model, tokenizer, queries_to_scores, length_sampler=output_length_sampler, sample_size=8)
```
The default output is the result of taking the top scored output for each query, but you can change it to top 2 and so on by passing the `n_candidates` argument at the time of instance initialization
```python
best_of_n = BestOfNSampler(model, tokenizer, queries_to_scores, length_sampler=output_length_sampler, n_candidates=2)
```
There is the option of setting the generation settings (like `temperature`, `pad_token_id`) at the time of instance creation as opposed to when calling the `generate` method.
This is done by passing a `GenerationConfig` from the `transformers` library at the time of initialization
```python
from transformers import GenerationConfig
generation_config = GenerationConfig(min_length= -1, top_k=0.0, top_p= 1.0, do_sample= True, pad_token_id=tokenizer.eos_token_id)
best_of_n = BestOfNSampler(model, tokenizer, queries_to_scores, length_sampler=output_length_sampler, generation_config=generation_config)
best_of_n.generate(query_tensors, device=device)
```
Furthermore, at the time of initialization you can set the seed to control repeatability of the generation process and the number of samples to generate for each query

View File

@ -1,6 +1,50 @@
# Training customization
At `trl` we provide the possibility to give enough modularity to users to be able to efficiently customize the training loop for their needs. Below are some examples on how you can apply and test different techniques.
TRL is designed with modularity in mind so that users to be able to efficiently customize the training loop for their needs. Below are some examples on how you can apply and test different techniques.
## Train on multiple GPUs / nodes
The trainers in TRL use 🤗 Accelerate to enable distributed training across multiple GPUs or nodes. To do so, first create an 🤗 Accelerate config file by running
```bash
accelerate config
```
and answering the questions according to your multi-gpu / multi-node setup. You can then launch distributed training by running:
```bash
accelerate launch your_script.py
```
We also provide config files in the [examples folder](https://github.com/huggingface/trl/tree/main/examples/accelerate_configs) that can be used as templates. To use these templates, simply pass the path to the config file when launching a job, e.g.:
```shell
accelerate launch --config_file=examples/accelerate_configs/multi_gpu.yaml --num_processes {NUM_GPUS} path_to_script.py --all_arguments_of_the_script
```
Refer to the [examples page](https://github.com/huggingface/trl/tree/main/examples) for more details.
### Distributed training with DeepSpeed
All of the trainers in TRL can be run on multiple GPUs together with DeepSpeed ZeRO-{1,2,3} for efficient sharding of the optimizer states, gradients, and model weights. To do so, run:
```shell
accelerate launch --config_file=examples/accelerate_configs/deepspeed_zero{1,2,3}.yaml --num_processes {NUM_GPUS} path_to_your_script.py --all_arguments_of_the_script
```
Note that for ZeRO-3, a small tweak is needed to initialize your reward model on the correct device via the `zero3_init_context_manager()` context manager. In particular, this is needed to avoid DeepSpeed hanging after a fixed number of training steps. Here is a snippet of what is involved from the [`sentiment_tuning`](https://github.com/huggingface/trl/blob/main/examples/scripts/ppo.py) example:
```python
ds_plugin = ppo_trainer.accelerator.state.deepspeed_plugin
if ds_plugin is not None and ds_plugin.is_zero3_init_enabled():
with ds_plugin.zero3_init_context_manager(enable=False):
sentiment_pipe = pipeline("sentiment-analysis", model="lvwerra/distilbert-imdb", device=device)
else:
sentiment_pipe = pipeline("sentiment-analysis", model="lvwerra/distilbert-imdb", device=device)
```
Consult the 🤗 Accelerate [documentation](https://huggingface.co/docs/accelerate/usage_guides/deepspeed) for more information about the DeepSpeed plugin.
## Use different optimizers
@ -63,7 +107,7 @@ optimizer = Lion(filter(lambda p: p.requires_grad, self.model.parameters()), lr=
...
ppo_trainer = PPOTrainer(config, model, model_ref, tokenizer, optimizer=optimizer)
```
We advice you to use the learning rate that you would use for `Adam` divided by 3 as pointed out [here](https://github.com/lucidrains/lion-pytorch#lion---pytorch). We observed an improvement when using this optimizer compared to classic Adam (check the full logs [here](https://wandb.ai/distill-bloom/trl/runs/lj4bheke?workspace=user-younesbelkada)):
We advise you to use the learning rate that you would use for `Adam` divided by 3 as pointed out [here](https://github.com/lucidrains/lion-pytorch#lion---pytorch). We observed an improvement when using this optimizer compared to classic Adam (check the full logs [here](https://wandb.ai/distill-bloom/trl/runs/lj4bheke?workspace=user-younesbelkada)):
<div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/trl-lion.png">
@ -90,7 +134,7 @@ config = PPOConfig(**ppo_config)
# 2. Create optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=config.learning_rate)
lr_scheduler = lr_scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9)
lr_scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9)
# 3. initialize trainer
ppo_trainer = PPOTrainer(config, model, model_ref, tokenizer, optimizer=optimizer, lr_scheduler=lr_scheduler)
@ -142,3 +186,31 @@ ppo_config = {'batch_size': 1}
config = PPOConfig(**ppo_config)
ppo_trainer = PPOTrainer(config, model, model_ref, tokenizer)
```
## Use the CUDA cache optimizer
When training large models, you should better handle the CUDA cache by iteratively clearing it. Do do so, simply pass `optimize_cuda_cache=True` to `PPOConfig`:
```python
config = PPOConfig(..., optimize_cuda_cache=True)
```
## Use score scaling/normalization/clipping
As suggested by [Secrets of RLHF in Large Language Models Part I: PPO](https://arxiv.org/abs/2307.04964), we support score (aka reward) scaling/normalization/clipping to improve training stability via `PPOConfig`:
```python
from trl import PPOConfig
ppo_config = {
use_score_scaling=True,
use_score_norm=True,
score_clip=0.5,
}
config = PPOConfig(**ppo_config)
```
To run `ppo.py`, you can use the following command:
```
python examples/scripts/ppo.py --log_with wandb --use_score_scaling --use_score_norm --score_clip 0.5
```

View File

@ -0,0 +1,119 @@
# Denoising Diffusion Policy Optimization
## The why
| Before | After DDPO finetuning |
| --- | --- |
| <div style="text-align: center"><img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/pre_squirrel.png"/></div> | <div style="text-align: center"><img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/post_squirrel.png"/></div> |
| <div style="text-align: center"><img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/pre_crab.png"/></div> | <div style="text-align: center"><img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/post_crab.png"/></div> |
| <div style="text-align: center"><img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/pre_starfish.png"/></div> | <div style="text-align: center"><img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/post_starfish.png"/></div> |
## Getting started with Stable Diffusion finetuning with reinforcement learning
The machinery for finetuning of Stable Diffusion models with reinforcement learning makes heavy use of HuggingFace's `diffusers`
library. A reason for stating this is that getting started requires a bit of familiarity with the `diffusers` library concepts, mainly two of them - pipelines and schedulers.
Right out of the box (`diffusers` library), there isn't a `Pipeline` nor a `Scheduler` instance that is suitable for finetuning with reinforcement learning. Some adjustments need to made.
There is a pipeline interface that is provided by this library that is required to be implemented to be used with the `DDPOTrainer`, which is the main machinery for fine-tuning Stable Diffusion with reinforcement learning. **Note: Only the StableDiffusion architecture is supported at this point.**
There is a default implementation of this interface that you can use out of the box. Assuming the default implementation is sufficient and/or to get things moving, refer to the training example alongside this guide.
The point of the interface is to fuse the pipeline and the scheduler into one object which allows for minimalness in terms of having the constraints all in one place. The interface was designed in hopes of catering to pipelines and schedulers beyond the examples in this repository and elsewhere at this time of writing. Also the scheduler step is a method of this pipeline interface and this may seem redundant given that the raw scheduler is accessible via the interface but this is the only way to constrain the scheduler step output to an output type befitting of the algorithm at hand (DDPO).
For a more detailed look into the interface and the associated default implementation, go [here](https://github.com/lvwerra/trl/tree/main/trl/models/modeling_sd_base.py)
Note that the default implementation has a LoRA implementation path and a non-LoRA based implementation path. The LoRA flag enabled by default and this can be turned off by passing in the flag to do so. LORA based training is faster and the LORA associated model hyperparameters responsible for model convergence aren't as finicky as non-LORA based training.
Also in addition, there is the expectation of providing a reward function and a prompt function. The reward function is used to evaluate the generated images and the prompt function is used to generate the prompts that are used to generate the images.
## Getting started with `examples/scripts/ddpo.py`
The `ddpo.py` script is a working example of using the `DDPO` trainer to finetune a Stable Diffusion model. This example explicitly configures a small subset of the overall parameters associated with the config object (`DDPOConfig`).
**Note:** one A100 GPU is recommended to get this running. Anything below a A100 will not be able to run this example script and even if it does via relatively smaller sized parameters, the results will most likely be poor.
Almost every configuration parameter has a default. There is only one commandline flag argument that is required of the user to get things up and running. The user is expected to have a [huggingface user access token](https://huggingface.co/docs/hub/security-tokens) that will be used to upload the model post finetuning to HuggingFace hub. The following bash command is to be entered to get things running
```batch
python ddpo.py --hf_user_access_token <token>
```
To obtain the documentation of `stable_diffusion_tuning.py`, please run `python stable_diffusion_tuning.py --help`
The following are things to keep in mind (The code checks this for you as well) in general while configuring the trainer (beyond the use case of using the example script)
- The configurable sample batch size (`--ddpo_config.sample_batch_size=6`) should be greater than or equal to the configurable training batch size (`--ddpo_config.train_batch_size=3`)
- The configurable sample batch size (`--ddpo_config.sample_batch_size=6`) must be divisible by the configurable train batch size (`--ddpo_config.train_batch_size=3`)
- The configurable sample batch size (`--ddpo_config.sample_batch_size=6`) must be divisible by both the configurable gradient accumulation steps (`--ddpo_config.train_gradient_accumulation_steps=1`) and the configurable accelerator processes count
## Setting up the image logging hook function
Expect the function to be given a list of lists of the form
```python
[[image, prompt, prompt_metadata, rewards, reward_metadata], ...]
```
and `image`, `prompt`, `prompt_metadata`, `rewards`, `reward_metadata` are batched.
The last list in the lists of lists represents the last sample batch. You are likely to want to log this one
While you are free to log however you want the use of `wandb` or `tensorboard` is recommended.
### Key terms
- `rewards` : The rewards/score is a numerical associated with the generated image and is key to steering the RL process
- `reward_metadata` : The reward metadata is the metadata associated with the reward. Think of this as extra information payload delivered alongside the reward
- `prompt` : The prompt is the text that is used to generate the image
- `prompt_metadata` : The prompt metadata is the metadata associated with the prompt. A situation where this will not be empty is when the reward model comprises of a [`FLAVA`](https://huggingface.co/docs/transformers/model_doc/flava) setup where questions and ground answers (linked to the generated image) are expected with the generated image (See here: https://github.com/kvablack/ddpo-pytorch/blob/main/ddpo_pytorch/rewards.py#L45)
- `image` : The image generated by the Stable Diffusion model
Example code for logging sampled images with `wandb` is given below.
```python
# for logging these images to wandb
def image_outputs_hook(image_data, global_step, accelerate_logger):
# For the sake of this example, we only care about the last batch
# hence we extract the last element of the list
result = {}
images, prompts, _, rewards, _ = image_data[-1]
for i, image in enumerate(images):
pil = Image.fromarray(
(image.cpu().numpy().transpose(1, 2, 0) * 255).astype(np.uint8)
)
pil = pil.resize((256, 256))
result[f"{prompts[i]:.25} | {rewards[i]:.2f}"] = [pil]
accelerate_logger.log_images(
result,
step=global_step,
)
```
### Using the finetuned model
Assuming you've done with all the epochs and have pushed up your model to the hub, you can use the finetuned model as follows
```python
import torch
from trl import DefaultDDPOStableDiffusionPipeline
pipeline = DefaultDDPOStableDiffusionPipeline("metric-space/ddpo-finetuned-sd-model")
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
# memory optimization
pipeline.vae.to(device, torch.float16)
pipeline.text_encoder.to(device, torch.float16)
pipeline.unet.to(device, torch.float16)
prompts = ["squirrel", "crab", "starfish", "whale","sponge", "plankton"]
results = pipeline(prompts)
for prompt, image in zip(prompts,results.images):
image.save(f"{prompt}.png")
```
## Credits
This work is heavily influenced by the repo [here](https://github.com/kvablack/ddpo-pytorch) and the associated paper [Training Diffusion Models
with Reinforcement Learning by Kevin Black, Michael Janner, Yilan Du, Ilya Kostrikov, Sergey Levine](https://arxiv.org/abs/2305.13301).

View File

@ -1,15 +1,15 @@
# Detoxifying a Language Model using PPO
Language models (LMs) are known to sometimes generate toxic outputs. In this example, we will show how to "detoxify" a LM by feeding it toxic prompts and then using PPO to "detoxify" it.
Language models (LMs) are known to sometimes generate toxic outputs. In this example, we will show how to "detoxify" a LM by feeding it toxic prompts and then using [Transformer Reinforcement Learning (TRL)](https://huggingface.co/docs/trl/index) and Proximal Policy Optimization (PPO) to "detoxify" it.
Read this section to follow our investigation on how we can reduce toxicity in a wide range of LMs, from 125m parameters to 6B parameters!
Here's an overview of the notebooks and scripts in the [trl repository](https://github.com/lvwerra/trl/tree/main/examples/toxicity) as well as the link for the interactive demo:
Here's an overview of the notebooks and scripts in the [TRL toxicity repository](https://github.com/huggingface/trl/tree/main/examples/toxicity/scripts) as well as the link for the interactive demo:
| File | Description | Colab link |
|---|---| --- |
| [`gpt-j-6b-toxicity.py`](https://github.com/lvwerra/trl/blob/main/examples/toxicity/scripts/gpt-j-6b-toxicity.py) | Detoxify `GPT-J-6B` using PPO | x |
| [`evaluate-toxicity.py`](https://github.com/lvwerra/trl/blob/main/examples/toxicity/scripts/evaluate-toxicity.py) | Evaluate de-toxified models using `evaluate` | x |
| [`gpt-j-6b-toxicity.py`](https://github.com/huggingface/trl/blob/main/examples/research_projects/toxicity/scripts/gpt-j-6b-toxicity.py) | Detoxify `GPT-J-6B` using PPO | x |
| [`evaluate-toxicity.py`](https://github.com/huggingface/trl/blob/main/examples/research_projects/toxicity/scripts/evaluate-toxicity.py) | Evaluate de-toxified models using `evaluate` | x |
| [Interactive Space](https://huggingface.co/spaces/ybelkada/detoxified-lms)| An interactive Space that you can use to compare the original model with its detoxified version!| x |
## Context
@ -24,7 +24,7 @@ One could have also used different techniques to evaluate the toxicity of a mode
### Selection of models
We selected the following models for our experiments to show that `trl` can be easily scaled to 10B parameters models:
We selected the following models for our experiments to show that TRL can be easily scaled to 10B parameters models:
* [`EleutherAI/gpt-neo-125M`](https://huggingface.co/EleutherAI/gpt-neo-125M) (125 million parameters)
* [`EleutherAI/gpt-neo-2.7B`](https://huggingface.co/EleutherAI/gpt-neo-2.7B) (2.7 billion parameters)
@ -174,15 +174,18 @@ Below are few generation examples of `gpt-j-6b-detox` model:
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/trl-toxicity-examples.png">
</div>
The evaluation script can be found [here](https://github.com/lvwerra/trl/blob/main/examples/toxicity/scripts/evaluate-toxicity.py).
The evaluation script can be found [here](https://github.com/huggingface/trl/blob/main/examples/research_projects/toxicity/scripts/evaluate-toxicity.py).
### Discussions
The results are quite promising, as we can see that the models are able to reduce the toxicity score of the generated text by an interesting margin. The gap is clear for `gpt-neo-2B` model but we less so for the `gpt-j-6B` model. There are several things we could try to improve the results on the largest model starting with training with larger `mini_batch_size` and probably allowing to back-propagate through more layers (i.e. use less shared layers).
We also think we could have trained the models using a "more toxic" dataset as the one we used is much cleaner than the dataset we used for testing our models (from our observation).
To sum up, in addition to human feedback this could be a useful additional signal when training large language models to ensure there outputs are less toxic as well as useful.
### Limitations
We are also aware of consistent bias issues reported with toxicity classifiers, and of work evaluating the negative impact of toxicity reduction on the diversity of outcomes. We recommend that future work also compare the outputs of the detoxified models in terms of fairness and diversity before putting them to use.
## What is next?
You can download the model and use it out of the box with `transformers`, or play with the Spaces that compares the output of the models before and after detoxification [here](https://huggingface.co/spaces/ybelkada/detoxified-lms).
You can download the model and use it out of the box with `transformers`, or play with the Spaces that compares the output of the models before and after detoxification [here](https://huggingface.co/spaces/ybelkada/detoxified-lms).

221
docs/source/dpo_trainer.mdx Normal file
View File

@ -0,0 +1,221 @@
# DPO Trainer
TRL supports the DPO Trainer for training language models from preference data, as described in the paper [Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://arxiv.org/abs/2305.18290) by Rafailov et al., 2023. For a full example have a look at [`examples/scripts/dpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/dpo.py).
The first step as always is to train your SFT model, to ensure the data we train on is in-distribution for the DPO algorithm.
## Expected dataset format
The DPO trainer expects a very specific format for the dataset. Since the model will be trained to directly optimize the preference of which sentence is the most relevant, given two sentences. We provide an example from the [`Anthropic/hh-rlhf`](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset below:
<div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/rlhf-antropic-example.png", width="50%">
</div>
Therefore the final dataset object should contain these 3 entries if you use the default `DPODataCollatorWithPadding` data collator. The entries should be named:
- `prompt`
- `chosen`
- `rejected`
for example:
```py
dpo_dataset_dict = {
"prompt": [
"hello",
"how are you",
"What is your name?",
"What is your name?",
"Which is the best programming language?",
"Which is the best programming language?",
"Which is the best programming language?",
],
"chosen": [
"hi nice to meet you",
"I am fine",
"My name is Mary",
"My name is Mary",
"Python",
"Python",
"Java",
],
"rejected": [
"leave me alone",
"I am not fine",
"Whats it to you?",
"I dont have a name",
"Javascript",
"C++",
"C++",
],
}
```
where the `prompt` contains the context inputs, `chosen` contains the corresponding chosen responses and `rejected` contains the corresponding negative (rejected) responses. As can be seen a prompt can have multiple responses and this is reflected in the entries being repeated in the dictionary's value arrays.
## Expected model format
The DPO trainer expects a model of `AutoModelForCausalLM`, compared to PPO that expects `AutoModelForCausalLMWithValueHead` for the value function.
## Using the `DPOTrainer`
For a detailed example have a look at the `examples/scripts/dpo.py` script. At a high level we need to initialize the `DPOTrainer` with a `model` we wish to train, a reference `ref_model` which we will use to calculate the implicit rewards of the preferred and rejected response, the `beta` refers to the hyperparameter of the implicit reward, and the dataset contains the 3 entries listed above. Note that the `model` and `ref_model` need to have the same architecture (ie decoder only or encoder-decoder).
```py
dpo_trainer = DPOTrainer(
model,
model_ref,
args=training_args,
beta=0.1,
train_dataset=train_dataset,
tokenizer=tokenizer,
)
```
After this one can then call:
```py
dpo_trainer.train()
```
Note that the `beta` is the temperature parameter for the DPO loss, typically something in the range of `0.1` to `0.5`. We ignore the reference model as `beta` -> 0.
## Loss functions
Given the preference data, we can fit a binary classifier according to the Bradley-Terry model and in fact the DPO authors propose the sigmoid loss on the normalized likelihood via the `logsigmoid` to fit a logistic regression.
The [RSO](https://arxiv.org/abs/2309.06657) authors propose to use a hinge loss on the normalized likelihood from the [SLiC](https://arxiv.org/abs/2305.10425) paper. The `DPOTrainer` can be switched to this loss via the `loss_type="hinge"` argument and the `beta` in this case is the reciprocal of the margin.
The [IPO](https://arxiv.org/abs/2310.12036) authors provide a deeper theoretical understanding of the DPO algorithms and identify an issue with overfitting and propose an alternative loss which can be used via the `loss_type="ipo"` argument to the trainer.
The [cDPO](https://ericmitchell.ai/cdpo.pdf) is a tweak on the DPO loss where we assume that the preference labels are noisy with some probability that can be passed to the `DPOTrainer` via `label_smoothing` argument (between 0 and 0.5) and then a conservative DPO loss is used. Use the `loss_type="cdpo"` argument to the trainer to use it.
The [KTO](https://github.com/ContextualAI/HALOs/blob/main/assets/report.pdf) loss is derived to directly maximize the utility of LLM generations instead of the log-likelihood of preferences. Thus the dataset are not necessarily preferences but rather desirable vs undesirable completions. For paired preference data as required by the `DPOTrainer`, use the `loss_type="kto_pair"` argument to the trainer to utilize this loss, while for the more general case of desired and undesirable data, use the as of yet unimplemented `KTOTrainer`.
## Logging
While training and evaluating we record the following reward metrics:
* `rewards/chosen`: the mean difference between the log probabilities of the policy model and the reference model for the chosen responses scaled by beta
* `rewards/rejected`: the mean difference between the log probabilities of the policy model and the reference model for the rejected responses scaled by beta
* `rewards/accuracies`: mean of how often the chosen rewards are > than the corresponding rejected rewards
* `rewards/margins`: the mean difference between the chosen and corresponding rejected rewards
## Accelerate DPO fine-tuning using `unsloth`
You can further accelerate QLoRA / LoRA (2x faster, 60% less memory) and even full-finetuning (1.1x faster) using the [`unsloth`](https://github.com/unslothai/unsloth) library that is compatible with `DPOTrainer`. Currently `unsloth` supports only Llama (Yi, TinyLlama as well) and Mistral architectures.
First install `unsloth` according to the [official documentation](https://github.com/unslothai/unsloth#installation-instructions---conda). Once installed, you can incorporate unsloth into your workflow in a very simple manner; instead of loading `AutoModelForCausalLM`, you just need to load a `FastLlamaModel` or `FastMistralModel` as follows:
```python
import torch
from transformers import TrainingArguments
from trl import DPOTrainer
from unsloth import FastLlamaModel, FastMistralModel
max_seq_length = 2048 # Supports automatic RoPE Scaling, so choose any number.
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
# Load Llama model
model, tokenizer = FastLlamaModel.from_pretrained(
model_name = "unsloth/llama-2-7b", # Supports any llama model eg meta-llama/Llama-2-7b-hf
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
# token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
# Do model patching and add fast LoRA weights
model = FastLlamaModel.get_peft_model(
model,
r = 16,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0, # Currently only supports dropout = 0
bias = "none", # Currently only supports bias = "none"
use_gradient_checkpointing = True,
random_state = 3407,
max_seq_length = max_seq_length,
)
args = TrainingArguments(output_dir="./output")
dpo_trainer = DPOTrainer(
model,
model_ref=None,
args=training_args,
beta=0.1,
train_dataset=train_dataset,
tokenizer=tokenizer,
)
dpo_trainer.train()
```
The saved model is fully compatible with Hugging Face's transformers library. Learn more about unsloth in their [official repository](https://github.com/unslothai/unsloth).
## Reference model considerations with PEFT
You have three main options (plus several variants) for how the reference model works when using PEFT, assuming the model that you would like to further enhance with DPO was tuned using (Q)LoRA.
1. Simply create two instances of the model, each loading your adapter - works fine but is very inefficient.
2. Merge the adapter into the base model, create another adapter on top, then leave the `model_ref` param null, in which case DPOTrainer will unload the adapter for reference inference - efficient, but has potential downsides discussed below.
3. Load the adapter twice with different names, then use `set_adapter` during training to swap between the adapter being DPO'd and the reference adapter - slightly less efficient compared to 2 (~adapter size VRAM overhead), but avoids the pitfalls.
### Downsides to merging QLoRA before DPO (approach 2)
As suggested by [Tim Dettmers](https://twitter.com/Tim_Dettmers/status/1694654191325573456), the best option for merging QLoRA adapters is to first quantize the base model, merge the adapter, then convert back to bf16. Something similar to [this script](https://github.com/jondurbin/qlora/blob/main/qmerge.py)
You can also just merge the adapters the standard way without quantizing the base model, but then you have 1-2% reduced performance (and evidently, more issues with empty responses).
If you use the recommended approach, which quantizes the model, you're now in a situation where to use QLoRA for DPO, you will need to re-quantize the merged model again or use an unquantized merge with lower overall performance.
### Using option 3 - load the adapter twice
To avoid the downsides with option 2, at the expense of slightly increased VRAM, you can load your fine-tuned adapter into the model twice, with different names, and set the model/ref adapter names in DPOTrainer.
For example:
```python
# Load the base model.
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
)
model = AutoModelForCausalLM.from_pretrained(
"mistralai/mixtral-8x7b-v0.1",
load_in_4bit=True,
quantization_config=bnb_config,
attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16,
device_map="auto",
)
model.config.use_cache = False
# Load the adapter.
model = PeftModel.from_pretrained(
model,
"/path/to/peft",
is_trainable=True,
adapter_name="train",
)
# Load the adapter a second time, with a different name, which will be our reference model.
model.load_adapter("/path/to/peft", adapter_name="reference")
# Initialize the trainer, without a ref_model param.
dpo_trainer = DPOTrainer(
model,
...
model_adapter_name="train",
ref_adapter_name="reference",
)
```
## DPOTrainer
[[autodoc]] DPOTrainer

View File

@ -0,0 +1,73 @@
# Examples
## Introduction
The examples should work in any of the following settings (with the same script):
- single GPU
- multi GPUS (using PyTorch distributed mode)
- multi GPUS (using DeepSpeed ZeRO-Offload stages 1, 2, & 3)
- fp16 (mixed-precision), fp32 (normal precision), or bf16 (bfloat16 precision)
To run it in each of these various modes, first initialize the accelerate
configuration with `accelerate config`
**NOTE to train with a 4-bit or 8-bit model**, please run
```bash
pip install --upgrade trl[quantization]
```
## Accelerate Config
For all the examples, you'll need to generate a 🤗 Accelerate config file with:
```shell
accelerate config # will prompt you to define the training configuration
```
Then, it is encouraged to launch jobs with `accelerate launch`!
# Maintained Examples
| File | Description |
|------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|
| [`examples/scripts/sft.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/sft.py) | This script shows how to use the `SFTTrainer` to fine tune a model or adapters into a target dataset. |
| [`examples/scripts/reward_modeling.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/reward_modeling.py) | This script shows how to use the `RewardTrainer` to train a reward model on your own dataset. |
| [`examples/scripts/ppo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/ppo.py) | This script shows how to use the `PPOTrainer` to fine-tune a sentiment analysis model using IMDB dataset |
| [`examples/scripts/ppo_multi_adapter.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/ppo_multi_adapter.py) | This script shows how to use the `PPOTrainer` to train a single base model with multiple adapters. Requires you to run the example script with the reward model training beforehand. |
| [`examples/scripts/stable_diffusion_tuning_example.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/stable_diffusion_tuning_example.py) | This script shows to use DDPOTrainer to fine-tune a stable diffusion model using reinforcement learning. |
Here are also some easier-to-run colab notebooks that you can use to get started with TRL:
| File | Description |
|----------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|
| [`examples/notebooks/best_of_n.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/best_of_n.ipynb) | This notebook demonstrates how to use the "Best of N" sampling strategy using TRL when fine-tuning your model with PPO. |
| [`examples/notebooks/gpt2-sentiment.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/gpt2-sentiment.ipynb) | This notebook demonstrates how to reproduce the GPT2 imdb sentiment tuning example on a jupyter notebook. |
| [`examples/notebooks/gpt2-control.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/gpt2-control.ipynb) | This notebook demonstrates how to reproduce the GPT2 sentiment control example on a jupyter notebook. |
We also have some other examples that are less maintained but can be used as a reference:
1. **[research_projects](https://github.com/huggingface/trl/tree/main/examples/research_projects)**: Check out this folder to find the scripts used for some research projects that used TRL (LM de-toxification, Stack-Llama, etc.)
## Distributed training
All of the scripts can be run on multiple GPUs by providing the path of an 🤗 Accelerate config file when calling `accelerate launch`. To launch one of them on one or multiple GPUs, run the following command (swapping `{NUM_GPUS}` with the number of GPUs in your machine and `--all_arguments_of_the_script` with your arguments.)
```shell
accelerate launch --config_file=examples/accelerate_configs/multi_gpu.yaml --num_processes {NUM_GPUS} path_to_script.py --all_arguments_of_the_script
```
You can also adjust the parameters of the 🤗 Accelerate config file to suit your needs (e.g. training in mixed precision).
### Distributed training with DeepSpeed
Most of the scripts can be run on multiple GPUs together with DeepSpeed ZeRO-{1,2,3} for efficient sharding of the optimizer states, gradients, and model weights. To do so, run following command (swapping `{NUM_GPUS}` with the number of GPUs in your machine, `--all_arguments_of_the_script` with your arguments, and `--deepspeed_config` with the path to the DeepSpeed config file such as `examples/deepspeed_configs/deepspeed_zero1.yaml`):
```shell
accelerate launch --config_file=examples/accelerate_configs/deepspeed_zero{1,2,3}.yaml --num_processes {NUM_GPUS} path_to_script.py --all_arguments_of_the_script
```

View File

@ -0,0 +1,66 @@
# Training FAQ
## What Metrics Should I Look at?
When performing classical supervised fine-tuning of language models, the loss (especially the validation loss) serves as a good indicator of the training progress. However, in Reinforcement Learning (RL), the loss becomes less informative about the model's performance, and its value may fluctuate while the actual performance improves.
To address this, we recommend focusing on two key metrics first:
**Mean Reward**: The primary goal is to maximize the reward achieved by the model during RL training.
**Objective KL Divergence**: KL divergence (Kullback-Leibler divergence) measures the dissimilarity between two probability distributions. In the context of RL training, we use it to quantify the difference between the current model and a reference model. Ideally, we want to keep the KL divergence between 0 and 10 to ensure the model's generated text remains close to what the reference model produces.
However, there are more metrics that can be useful for debugging, checkout the [logging section](logging).
## Why Do We Use a Reference Model, and What's the Purpose of KL Divergence?
When training RL models, optimizing solely for reward may lead to unexpected behaviors, where the model exploits the environment in ways that don't align with good language generation. In the case of RLHF, we use a reward model trained to predict whether a generated text is highly ranked by humans.
However, the RL model being optimized against the reward model may learn patterns that yield high reward but do not represent good language. This can result in extreme cases where the model generates texts with excessive exclamation marks or emojis to maximize the reward. In some worst-case scenarios, the model may generate patterns completely unrelated to natural language yet receive high rewards, similar to adversarial attacks.
<div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/kl-example.png">
<p style="text-align: center;"> <b>Figure:</b> Samples without a KL penalty from <a href="https://arxiv.org/pdf/1909.08593.pdf">https://arxiv.org/pdf/1909.08593.pdf</a>. </p>
</div>
To address this issue, we add a penalty to the reward function based on the KL divergence between the current model and the reference model. By doing this, we encourage the model to stay close to what the reference model generates.
## What Is the Concern with Negative KL Divergence?
If you generate text by purely sampling from the model distribution things work fine in general. But when you use the `generate` method there are a few caveats because it does not always purely sample depending on the settings which can cause KL-divergence to go negative. Essentially when the active model achieves `log_p_token_active < log_p_token_ref` we get negative KL-div. This can happen in a several cases:
- **top-k sampling**: the model can smooth out the probability distribution causing the top-k tokens having a smaller probability than those of the reference model but they still are selected
- **min_length**: this ignores the EOS token until `min_length` is reached. thus the model can assign a very low log prob to the EOS token and very high probs to all others until min_length is reached
- **min_length**: this ignores the EOS token until `min_length` is reached, thus the model can assign a very low log prob to the EOS token and very high probs to all others until min_length is reached
These are just a few examples. Why is negative KL an issue? The total reward `R` is computed `R = r - beta * KL` so if the model can learn how to drive KL-divergence negative it effectively gets a positive reward. In many cases it can be much easier to exploit such a bug in the generation than actually learning the reward function. In addition the KL can become arbitrarily small thus the actual reward can be very small compared to it.
So how should you generate text for PPO training? Let's have a look!
## How to generate text for training?
In order to avoid the KL issues described above we recommend to use the following settings:
```python
generation_kwargs = {
"min_length": -1, # don't ignore the EOS token (see above)
"top_k": 0.0, # no top-k sampling
"top_p": 1.0, # no nucleus sampling
"do_sample": True, # yes, we want to sample
"pad_token_id": tokenizer.eos_token_id, # most decoder models don't have a padding token - use EOS token instead
"max_new_tokens": 32, # specify how many tokens you want to generate at most
}
```
With these settings we usually don't encounter any issues. You can also experiments with other settings but if you encounter issues with negative KL-divergence try to go back to these and see if they persist.
## How can debug your own use-case?
Debugging the RL pipeline can be challenging due to its complexity. Here are some tips and suggestions to make the process easier:
- **Start from a working example**: Begin with a working example from the trl repository and gradually modify it to fit your specific use-case. Changing everything at once can make it difficult to identify the source of potential issues. For example, you can start by replacing the model in the example and once you figure out the best hyperparameters try to switch to your dataset and reward model. If you change everything at once you won't know where a potential problem comes from.
- **Start small, scale later**: Training large models can be very slow and take several hours or days until you see any improvement. For debugging this is not a convenient timescale so try to use small model variants during the development phase and scale up once that works. That being said you sometimes have to be careful as small models might not have the capacity to solve a complicated task either.
- **Start simple**: Try to start with a minimal example and build complexity from there. Your use-case might require for example a complicated reward function consisting of many different rewards - try to use one signal first and see if you can optimize that and then add more complexity after that.
- **Inspect the generations**: It's always a good idea to inspect what the model is generating. Maybe there is a big in your post-processing or your prompt. Due to bad settings you might cut-off generations too soon. These things are very hard to see on the metrics but very obvious if you look at the generations.
- **Inspect the reward model**: If you reward is not improving over time maybe there's an issue with the reward model. You can look at extreme cases to see if it does what it should: e.g. in the sentiment case you can check if simple positive and negative examples really get different rewards. And you can look at the distribution of your dataset. Finally, maybe the reward is dominated by the query which the model can't affect so you might need to normalize this (e.g. reward of query+response minus reward of the query).
These are just a few tips that we find helpful - if you have more useful tricks feel free to open a PR to add them as well!

View File

@ -4,6 +4,58 @@
# TRL - Transformer Reinforcement Learning
With the TRL (Transformer Reinforcement Learning) libray you can train transformer language models with reinforcement learning. The library is integrated with 🤗 [transformers](https://github.com/huggingface/transformers).
TRL is a full stack library where we provide a set of tools to train transformer language models with Reinforcement Learning, from the Supervised Fine-tuning step (SFT), Reward Modeling step (RM) to the Proximal Policy Optimization (PPO) step.
The library is integrated with 🤗 [transformers](https://github.com/huggingface/transformers).
TRL supports decoder models such as GPT-2, BLOOM, GPT-Neo which can all be optimized using Proximal Policy Optimization (PPO). You can find installation instructions in the [installation guide](installation) and an introdcution to the library in the [Quickstart section](quickstart). There is also a more [in-depth example](sentiment_tuning) to tune GPT-2 to procude positive movie reviews.
<div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/TRL-readme.png">
</div>
Check the appropriate sections of the documentation depending on your needs:
## API documentation
- [Model Classes](models): *A brief overview of what each public model class does.*
- [`SFTTrainer`](sft_trainer): *Supervise Fine-tune your model easily with `SFTTrainer`*
- [`RewardTrainer`](reward_trainer): *Train easily your reward model using `RewardTrainer`.*
- [`PPOTrainer`](ppo_trainer): *Further fine-tune the supervised fine-tuned model using PPO algorithm*
- [Best-of-N Sampling](best-of-n): *Use best of n sampling as an alternative way to sample predictions from your active model*
- [`DPOTrainer`](dpo_trainer): *Direct Preference Optimization training using `DPOTrainer`.*
- [`TextEnvironment`](text_environment): *Text environment to train your model using tools with RL.*
## Examples
- [Sentiment Tuning](sentiment_tuning): *Fine tune your model to generate positive movie contents*
- [Training with PEFT](lora_tuning_peft): *Memory efficient RLHF training using adapters with PEFT*
- [Detoxifying LLMs](detoxifying_a_lm): *Detoxify your language model through RLHF*
- [StackLlama](using_llama_models): *End-to-end RLHF training of a Llama model on Stack exchange dataset*
- [Learning with Tools](learning_tools): *Walkthrough of using `TextEnvironments`*
- [Multi-Adapter Training](multi_adapter_rl): *Use a single base model and multiple adapters for memory efficient end-to-end training*
## Blog posts
<div class="mt-10">
<div class="w-full flex flex-col space-y-4 md:space-y-0 md:grid md:grid-cols-2 md:gap-y-4 md:gap-x-5">
<a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="https://huggingface.co/blog/rlhf">
<img src="https://raw.githubusercontent.com/huggingface/blog/main/assets/120_rlhf/thumbnail.png" alt="thumbnail">
<p class="text-gray-700">Illustrating Reinforcement Learning from Human Feedback</p>
</a>
<a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="https://huggingface.co/blog/trl-peft">
<img src="https://github.com/huggingface/blog/blob/main/assets/133_trl_peft/thumbnail.png?raw=true" alt="thumbnail">
<p class="text-gray-700">Fine-tuning 20B LLMs with RLHF on a 24GB consumer GPU</p>
</a>
<a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="https://huggingface.co/blog/stackllama">
<img src="https://github.com/huggingface/blog/blob/main/assets/138_stackllama/thumbnail.png?raw=true" alt="thumbnail">
<p class="text-gray-700">StackLLaMA: A hands-on guide to train LLaMA with RLHF</p>
</a>
<a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="https://huggingface.co/blog/dpo-trl">
<img src="https://github.com/huggingface/blog/blob/main/assets/157_dpo_trl/dpo_thumbnail.png?raw=true" alt="thumbnail">
<p class="text-gray-700">Fine-tune Llama 2 with DPO</p>
</a>
<a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="https://huggingface.co/blog/trl-ddpo">
<img src="https://github.com/huggingface/blog/blob/main/assets/166_trl_ddpo/thumbnail.png?raw=true" alt="thumbnail">
<p class="text-gray-700">Finetune Stable Diffusion Models with DDPO via TRL</p>
</a>
</div>
</div>

View File

@ -12,7 +12,7 @@ pip install trl
You can also install the latest version from source. First clone the repo and then run the installation with `pip`:
```bash
git clone https://github.com/lvwerra/trl.git
git clone https://github.com/huggingface/trl.git
cd trl/
pip install -e .
```

View File

@ -0,0 +1,54 @@
# Iterative Trainer
Iterative fine-tuning is a training method that enables to perform custom actions (generation and filtering for example) between optimization steps. In TRL we provide an easy-to-use API to fine-tune your models in an iterative way in just a few lines of code.
## Usage
To get started quickly, instantiate an instance a model, and a tokenizer.
```python
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
trainer = IterativeSFTTrainer(
model,
tokenizer
)
```
You have the choice to either provide a list of strings or a list of tensors to the step function.
#### Using a list of tensors as input:
```python
inputs = {
"input_ids": input_ids,
"attention_mask": attention_mask
}
trainer.step(**inputs)
```
#### Using a list of strings as input:
```python
inputs = {
"texts": texts
}
trainer.step(**inputs)
```
For causal language models, labels will automatically be created from input_ids or from texts. When using sequence to sequence models you will have to provide your own labels or text_labels.
## IterativeTrainer
[[autodoc]] IterativeSFTTrainer

View File

@ -0,0 +1,234 @@
# Learning Tools (Experimental 🧪)
Using Large Language Models (LLMs) with tools has been a popular topic recently with awesome works such as [ToolFormer](https://arxiv.org/abs/2302.04761) and [ToolBench](https://arxiv.org/pdf/2305.16504.pdf). In TRL, we provide a simple example of how to teach LLM to use tools with reinforcement learning.
Here's an overview of the scripts in the [trl repository](https://github.com/lvwerra/trl/tree/main/examples/research_projects/tools):
| File | Description |
|---|---|
| [`calculator.py`](https://github.com/lvwerra/trl/blob/main/examples/research_projects/tools/calculator.py) | Script to train LLM to use a calculator with reinforcement learning. |
| [`triviaqa.py`](https://github.com/lvwerra/trl/blob/main/examples/research_projects/tools/triviaqa.py) | Script to train LLM to use a wiki tool to answer questions. |
| [`python_interpreter.py`](https://github.com/lvwerra/trl/blob/main/examples/research_projects/tools/python_interpreter.py) | Script to train LLM to use python interpreter to solve math puzzles. |
<Tip warning={true}>
Note that the scripts above rely heavily on the `TextEnvironment` API which is still under active development. The API may change in the future. Please see [`TextEnvironment`](text_environment) for the related docs.
</Tip>
## Learning to Use a Calculator
The rough idea is as follows:
1. Load a tool such as [ybelkada/simple-calculator](https://huggingface.co/spaces/ybelkada/simple-calculator) that parse a text calculation like `"14 + 34"` and return the calulated number:
```python
from transformers import AutoTokenizer, load_tool
tool = load_tool("ybelkada/simple-calculator")
tool_fn = lambda text: str(round(float(tool(text)), 2)) # rounding to 2 decimal places
```
1. Define a reward function that returns a positive reward if the tool returns the correct answer. In the script we create a dummy reward function like `reward_fn = lambda x: 1`, but we override the rewards directly later.
1. Create a prompt on how to use the tools
```python
# system prompt
prompt = """\
What is 13.1-3?
<request><SimpleCalculatorTool>13.1-3<call>10.1<response>
Result=10.1<submit>
What is 4*3?
<request><SimpleCalculatorTool>4*3<call>12<response>
Result=12<submit>
What is 12.1+1?
<request><SimpleCalculatorTool>12.1+1<call>13.1<response>
Result=13.1<submit>
What is 12.1-20?
<request><SimpleCalculatorTool>12.1-20<call>-7.9<response>
Result=-7.9<submit>"""
```
3. Create a `trl.TextEnvironment` with the model
```python
env = TextEnvironment(
model,
tokenizer,
{"SimpleCalculatorTool": tool_fn},
reward_fn,
prompt,
generation_kwargs=generation_kwargs,
)
```
4. Then generate some data such as `tasks = ["\n\nWhat is 13.1-3?", "\n\nWhat is 4*3?"]` and run the environment with `queries, responses, masks, rewards, histories = env.run(tasks)`. The environment will look for the `<call>` token in the prompt and append the tool output to the response; it will also return the mask associated with the response. You can further use the `histories` to visualize the interaction between the model and the tool; `histories[0].show_text()` will show the text with color-coded tool output and `histories[0].show_tokens(tokenizer)` will show visualize the tokens.
![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/learning_tools.png)
1. Finally, we can train the model with `train_stats = ppo_trainer.step(queries, responses, rewards, masks)`. The trainer will use the mask to ignore the tool output when computing the loss, make sure to pass that argument to `step`.
## Experiment results
We trained a model with the above script for 10 random seeds. You can reproduce the run with the following command. Feel free to remove the `--slurm-*` arguments if you don't have access to a slurm cluster.
```
WANDB_TAGS="calculator_final" python benchmark/benchmark.py \
--command "python examples/research_projects/tools/calculator.py" \
--num-seeds 10 \
--start-seed 1 \
--workers 10 \
--slurm-gpus-per-task 1 \
--slurm-ntasks 1 \
--slurm-total-cpus 8 \
--slurm-template-path benchmark/trl.slurm_template
```
We can then use [`openrlbenchmark`](https://github.com/openrlbenchmark/openrlbenchmark) which generates the following plot.
```
python -m openrlbenchmark.rlops_multi_metrics \
--filters '?we=openrlbenchmark&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.tracker_project_name&cen=trl_ppo_trainer_config.value.log_with&metrics=env/reward_mean&metrics=objective/kl' \
'wandb?tag=calculator_final&cl=calculator_mask' \
--env-ids trl \
--check-empty-runs \
--pc.ncols 2 \
--pc.ncols-legend 1 \
--output-filename static/0compare \
--scan-history
```
![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/learning_tools_chart.png)
As we can see, while 1-2 experiments crashed for some reason, most of the runs obtained near perfect proficiency in the calculator task.
## (Early Experiments 🧪): learning to use a wiki tool for question answering
In the [ToolFormer](https://arxiv.org/abs/2302.04761) paper, it shows an interesting use case that utilizes a Wikipedia Search tool to help answer questions. In this section, we attempt to perform similar experiments but uses RL instead to teach the model to use a wiki tool on the [TriviaQA](https://nlp.cs.washington.edu/triviaqa/) dataset.
<Tip warning={true}>
**Note that many settings are different so the results are not directly comparable.**
</Tip>
### Building a search index
Since [ToolFormer](https://arxiv.org/abs/2302.04761) did not open source, we needed to first replicate the search index. It is mentioned in their paper that the authors built the search index using a BM25 retriever that indexes the Wikipedia dump from [KILT](https://github.com/facebookresearch/KILT)
Fortunately, [`pyserini`](https://github.com/castorini/pyserini) already implements the BM25 retriever and provides a prebuilt index for the KILT Wikipedia dump. We can use the following code to search the index.
```python
from pyserini.search.lucene import LuceneSearcher
import json
searcher = LuceneSearcher.from_prebuilt_index('wikipedia-kilt-doc')
def search(query):
hits = searcher.search(query, k=1)
hit = hits[0]
contents = json.loads(hit.raw)['contents']
return contents
print(search("tennis racket"))
```
```
Racket (sports equipment)
A racket or racquet is a sports implement consisting of a handled frame with an open hoop across which a network of strings or catgut is stretched tightly. It is used for striking a ball or shuttlecock in games such as squash, tennis, racquetball, and badminton. Collectively, these games are known as racket sports. Racket design and manufacturing has changed considerably over the centuries.
The frame of rackets for all sports was traditionally made of solid wood (later laminated wood) and the strings of animal intestine known as catgut. The traditional racket size was limited by the strength and weight of the wooden frame which had to be strong enough to hold the strings and stiff enough to hit the ball or shuttle. Manufacturers started adding non-wood laminates to wood rackets to improve stiffness. Non-wood rackets were made first of steel, then of aluminum, and then carbon fiber composites. Wood is still used for real tennis, rackets, and xare. Most rackets are now made of composite materials including carbon fiber or fiberglass, metals such as titanium alloys, or ceramics.
...
```
We then basically deployed this snippet as a Hugging Face space [here](https://huggingface.co/spaces/vwxyzjn/pyserini-wikipedia-kilt-doc), so that we can use the space as a `transformers.Tool` later.
![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/pyserini.png)
### Experiment settings
We use the following settings:
* use the `bigcode/starcoderbase` model as the base model
* use the `pyserini-wikipedia-kilt-doc` space as the wiki tool and only uses the first paragrahs of the search result, allowing the `TextEnvironment` to obtain at most `max_tool_reponse=400` response tokens from the tool.
* test if the response contain the answer string, if so, give a reward of 1, otherwise, give a reward of 0.
* notice this is a simplified evaluation criteria. In [ToolFormer](https://arxiv.org/abs/2302.04761), the authors checks if the first 20 words of the response contain the correct answer.
* used the following prompt that demonstrates the usage of the wiki tool.
```python
prompt = """\
Answer the following question:
Q: In which branch of the arts is Patricia Neary famous?
A: Ballets
A2: <request><Wiki>Patricia Neary<call>Patricia Neary (born October 27, 1942) is an American ballerina, choreographer and ballet director, who has been particularly active in Switzerland. She has also been a highly successful ambassador for the Balanchine Trust, bringing George Balanchine's ballets to 60 cities around the globe.<response>
Result=Ballets<submit>
Q: Who won Super Bowl XX?
A: Chicago Bears
A2: <request><Wiki>Super Bowl XX<call>Super Bowl XX was an American football game between the National Football Conference (NFC) champion Chicago Bears and the American Football Conference (AFC) champion New England Patriots to decide the National Football League (NFL) champion for the 1985 season. The Bears defeated the Patriots by the score of 4610, capturing their first NFL championship (and Chicago's first overall sports victory) since 1963, three years prior to the birth of the Super Bowl. Super Bowl XX was played on January 26, 1986 at the Louisiana Superdome in New Orleans.<response>
Result=Chicago Bears<submit>
Q: """
```
### Result and Discussion
Our experiments show that the agent can learn to use the wiki tool to answer questions. The learning curves would go up mostly, but one of the experiment did crash.
![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/triviaqa_learning_curves.png)
Wandb report is [here](https://wandb.ai/costa-huang/cleanRL/reports/TriviaQA-Final-Experiments--Vmlldzo1MjY0ODk5) for further inspection.
Note that the correct rate of the trained model is on the low end, which could be due to the following reasons:
* **incorrect searches:** When given the question `"What is Bruce Willis' real first name?"` if the model searches for `Bruce Willis`, our wiki tool returns "Patrick Poivey (born 18 February 1948) is a French actor. He is especially known for his voice: he is the French dub voice of Bruce Willis since 1988.` But a correct search should be `Walter Bruce Willis (born March 19, 1955) is an American former actor. He achieved fame with a leading role on the comedy-drama series Moonlighting (19851989) and appeared in over a hundred films, gaining recognition as an action hero after his portrayal of John McClane in the Die Hard franchise (19882013) and other roles.[1][2]"
![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/real_first_name.png)
* **unnecessarily long response**: The wiki tool by default sometimes output very long sequences. E.g., when the wiki tool searches for "Brown Act"
* Our wiki tool returns "The Ralph M. Brown Act, located at California Government Code 54950 "et seq.", is an act of the California State Legislature, authored by Assemblymember Ralph M. Brown and passed in 1953, that guarantees the public's right to attend and participate in meetings of local legislative bodies."
* [ToolFormer](https://arxiv.org/abs/2302.04761)'s wiki tool returns "The Ralph M. Brown Act is an act of the California State Legislature that guarantees the public's right to attend and participate in meetings of local legislative bodies." which is more succinct.
![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/brown_act.png)
## (Early Experiments 🧪): solving math puzzles with python interpreter
In this section, we attempt to teach the model to use a python interpreter to solve math puzzles. The rough idea is to give the agent a prompt like the following:
```python
prompt = """\
Example of using a Python API to solve math questions.
Q: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?
<request><PythonInterpreter>
def solution():
money_initial = 23
bagels = 5
bagel_cost = 3
money_spent = bagels * bagel_cost
money_left = money_initial - money_spent
result = money_left
return result
print(solution())
<call>72<response>
Result = 72 <submit>
Q: """
```
Training experiment can be found at https://wandb.ai/lvwerra/trl-gsm8k/runs/a5odv01y
![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/gms8k_learning_curve.png)

75
docs/source/logging.mdx Normal file
View File

@ -0,0 +1,75 @@
# Logging
As reinforcement learning algorithms are historically challenging to debug, it's important to pay careful attention to logging.
By default, the TRL [`PPOTrainer`] saves a lot of relevant information to `wandb` or `tensorboard`.
Upon initialization, pass one of these two options to the [`PPOConfig`]:
```
config = PPOConfig(
model_name=args.model_name,
log_with=`wandb`, # or `tensorboard`
)
```
If you want to log with tensorboard, add the kwarg `project_kwargs={"logging_dir": PATH_TO_LOGS}` to the PPOConfig.
## PPO Logging
Here's a brief explanation for the logged metrics provided in the data:
Key metrics to monitor. We want to maximize the reward, maintain a low KL divergence, and maximize entropy:
1. `env/reward_mean`: The average reward obtained from the environment. Alias `ppo/mean_scores`, which is sed to specifically monitor the reward model.
1. `env/reward_std`: The standard deviation of the reward obtained from the environment. Alias ``ppo/std_scores`, which is sed to specifically monitor the reward model.
1. `env/reward_dist`: The histogram distribution of the reward obtained from the environment.
1. `objective/kl`: The mean Kullback-Leibler (KL) divergence between the old and new policies. It measures how much the new policy deviates from the old policy. The KL divergence is used to compute the KL penalty in the objective function.
1. `objective/kl_dist`: The histogram distribution of the `objective/kl`.
1. `objective/kl_coef`: The coefficient for Kullback-Leibler (KL) divergence in the objective function.
1. `ppo/mean_non_score_reward`: The **KL penalty** calculated by `objective/kl * objective/kl_coef` as the total reward for optimization to prevent the new policy from deviating too far from the old policy.
1. `objective/entropy`: The entropy of the model's policy, calculated by `-logprobs.sum(-1).mean()`. High entropy means the model's actions are more random, which can be beneficial for exploration.
Training stats:
1. `ppo/learning_rate`: The learning rate for the PPO algorithm.
1. `ppo/policy/entropy`: The entropy of the model's policy, calculated by `pd = torch.nn.functional.softmax(logits, dim=-1); entropy = torch.logsumexp(logits, dim=-1) - torch.sum(pd * logits, dim=-1)`. It measures the randomness of the policy.
1. `ppo/policy/clipfrac`: The fraction of probability ratios (old policy / new policy) that fell outside the clipping range in the PPO objective. This can be used to monitor the optimization process.
1. `ppo/policy/approxkl`: The approximate KL divergence between the old and new policies, measured by `0.5 * masked_mean((logprobs - old_logprobs) ** 2, mask)`, corresponding to the `k2` estimator in http://joschu.net/blog/kl-approx.html
1. `ppo/policy/policykl`: Similar to `ppo/policy/approxkl`, but measured by `masked_mean(old_logprobs - logprobs, mask)`, corresponding to the `k1` estimator in http://joschu.net/blog/kl-approx.html
1. `ppo/policy/ratio`: The histogram distribution of the ratio between the new and old policies, used to compute the PPO objective.
1. `ppo/policy/advantages_mean`: The average of the GAE (Generalized Advantage Estimation) advantage estimates. The advantage function measures how much better an action is compared to the average action at a state.
1. `ppo/policy/advantages`: The histogram distribution of `ppo/policy/advantages_mean`.
1. `ppo/returns/mean`: The mean of the TD(λ) returns, calculated by `returns = advantage + values`, another indicator of model performance. See https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/ for more details.
1. `ppo/returns/var`: The variance of the TD(λ) returns, calculated by `returns = advantage + values`, another indicator of model performance.
1. `ppo/val/mean`: The mean of the values, used to monitor the value function's performance.
1. `ppo/val/var` : The variance of the values, used to monitor the value function's performance.
1. `ppo/val/var_explained`: The explained variance for the value function, used to monitor the value function's performance.
1. `ppo/val/clipfrac`: The fraction of the value function's predicted values that are clipped.
1. `ppo/val/vpred`: The predicted values from the value function.
1. `ppo/val/error`: The mean squared error between the `ppo/val/vpred` and returns, used to monitor the value function's performance.
1. `ppo/loss/policy`: The policy loss for the Proximal Policy Optimization (PPO) algorithm.
1. `ppo/loss/value`: The loss for the value function in the PPO algorithm. This value quantifies how well the function estimates the expected future rewards.
1. `ppo/loss/total`: The total loss for the PPO algorithm. It is the sum of the policy loss and the value function loss.
Stats on queries, responses, and logprobs:
1. `tokens/queries_len_mean`: The average length of the queries tokens.
1. `tokens/queries_len_std`: The standard deviation of the length of the queries tokens.
1. `tokens/queries_dist`: The histogram distribution of the length of the queries tokens.
1. `tokens/responses_len_mean`: The average length of the responses tokens.
1. `tokens/responses_len_std`: The standard deviation of the length of the responses tokens.
1. `tokens/responses_dist`: The histogram distribution of the length of the responses tokens. (Costa: inconsistent naming, should be `tokens/responses_len_dist`)
1. `objective/logprobs`: The histogram distribution of the log probabilities of the actions taken by the model.
1. `objective/ref_logprobs`: The histogram distribution of the log probabilities of the actions taken by the reference model.
### Crucial values
During training, many values are logged, here are the most important ones:
1. `env/reward_mean`,`env/reward_std`, `env/reward_dist`: the properties of the reward distribution from the "environment" / reward model
1. `ppo/mean_non_score_reward`: The mean negated KL penalty during training (shows the delta between the reference model and the new policy over the batch in the step)
Here are some parameters that are useful to monitor for stability (when these diverge or collapse to 0, try tuning variables):
1. `ppo/loss/value`: it will spike / NaN when not going well.
1. `ppo/policy/ratio`: `ratio` being 1 is a baseline value, meaning that the probability of sampling a token is the same under the new and old policy. If the ratio is too high like 200, it means the probability of sampling a token is 200 times higher under the new policy than the old policy. This is a sign that the new policy is too different from the old policy, which will likely cause overoptimization and collapse training later on.
1. `ppo/policy/clipfrac` and `ppo/policy/approxkl`: if `ratio` is too high, the `ratio` is going to get clipped, resulting in high `clipfrac` and high `approxkl` as well.
1. `objective/kl`: it should stay positive so that the policy is not too far away from the reference policy.
1. `objective/kl_coef`: The target coefficient with [`AdaptiveKLController`]. Often increases before numerical instabilities.

View File

@ -0,0 +1,144 @@
# Examples of using peft with trl to finetune 8-bit models with Low Rank Adaption (LoRA)
The notebooks and scripts in this examples show how to use Low Rank Adaptation (LoRA) to fine-tune models in a memory efficient manner. Most of PEFT methods supported in peft library but note that some PEFT methods such as Prompt tuning are not supported.
For more information on LoRA, see the [original paper](https://arxiv.org/abs/2106.09685).
Here's an overview of the `peft`-enabled notebooks and scripts in the [trl repository](https://github.com/huggingface/trl/tree/main/examples):
| File | Task | Description | Colab link |
|---|---| --- |
| [`stack_llama/rl_training.py`](https://github.com/huggingface/trl/blob/main/examples/research_projects/stack_llama/scripts/rl_training.py) | RLHF | Distributed fine-tuning of the 7b parameter LLaMA models with a learned reward model and `peft`. | |
| [`stack_llama/reward_modeling.py`](https://github.com/huggingface/trl/blob/main/examples/research_projects/stack_llama/scripts/reward_modeling.py) | Reward Modeling | Distributed training of the 7b parameter LLaMA reward model with `peft`. | |
| [`stack_llama/supervised_finetuning.py`](https://github.com/huggingface/trl/blob/main/examples/research_projects/stack_llama/scripts/supervised_finetuning.py) | SFT | Distributed instruction/supervised fine-tuning of the 7b parameter LLaMA model with `peft`. | |
## Installation
Note: peft is in active development, so we install directly from their Github page.
Peft also relies on the latest version of transformers.
```bash
pip install trl[peft]
pip install bitsandbytes loralib
pip install git+https://github.com/huggingface/transformers.git@main
#optional: wandb
pip install wandb
```
Note: if you don't want to log with `wandb` remove `log_with="wandb"` in the scripts/notebooks. You can also replace it with your favourite experiment tracker that's [supported by `accelerate`](https://huggingface.co/docs/accelerate/usage_guides/tracking).
## How to use it?
Simply declare a `PeftConfig` object in your script and pass it through `.from_pretrained` to load the TRL+PEFT model.
```python
from peft import LoraConfig
from trl import AutoModelForCausalLMWithValueHead
model_id = "edbeeching/gpt-neo-125M-imdb"
lora_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = AutoModelForCausalLMWithValueHead.from_pretrained(
model_id,
peft_config=lora_config,
)
```
And if you want to load your model in 8bit precision:
```python
pretrained_model = AutoModelForCausalLMWithValueHead.from_pretrained(
config.model_name,
load_in_8bit=True,
peft_config=lora_config,
)
```
... or in 4bit precision:
```python
pretrained_model = AutoModelForCausalLMWithValueHead.from_pretrained(
config.model_name,
peft_config=lora_config,
load_in_4bit=True,
)
```
## Launch scripts
The `trl` library is powered by `accelerate`. As such it is best to configure and launch trainings with the following commands:
```bash
accelerate config # will prompt you to define the training configuration
accelerate launch scripts/gpt2-sentiment_peft.py # launches training
```
## Using `trl` + `peft` and Data Parallelism
You can scale up to as many GPUs as you want, as long as you are able to fit the training process in a single device. The only tweak you need to apply is to load the model as follows:
```python
from peft import LoraConfig
...
lora_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
pretrained_model = AutoModelForCausalLMWithValueHead.from_pretrained(
config.model_name,
peft_config=lora_config,
)
```
And if you want to load your model in 8bit precision:
```python
pretrained_model = AutoModelForCausalLMWithValueHead.from_pretrained(
config.model_name,
peft_config=lora_config,
load_in_8bit=True,
)
```
... or in 4bit precision:
```python
pretrained_model = AutoModelForCausalLMWithValueHead.from_pretrained(
config.model_name,
peft_config=lora_config,
load_in_4bit=True,
)
```
Finally, make sure that the rewards are computed on correct device as well, for that you can use `ppo_trainer.model.current_device`.
## Naive pipeline parallelism (NPP) for large models (>60B models)
The `trl` library also supports naive pipeline parallelism (NPP) for large models (>60B models). This is a simple way to parallelize the model across multiple GPUs.
This paradigm, termed as "Naive Pipeline Parallelism" (NPP) is a simple way to parallelize the model across multiple GPUs. We load the model and the adapters across multiple GPUs and the activations and gradients will be naively communicated across the GPUs. This supports `int8` models as well as other `dtype` models.
<div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/trl-npp.png">
</div>
### How to use NPP?
Simply load your model with a custom `device_map` argument on the `from_pretrained` to split your model across multiple devices. Check out this [nice tutorial](https://github.com/huggingface/blog/blob/main/accelerate-large-models.md) on how to properly create a `device_map` for your model.
Also make sure to have the `lm_head` module on the first GPU device as it may throw an error if it is not on the first device. As this time of writing, you need to install the `main` branch of `accelerate`: `pip install git+https://github.com/huggingface/accelerate.git@main` and `peft`: `pip install git+https://github.com/huggingface/peft.git@main`.
### Launch scripts
Although `trl` library is powered by `accelerate`, you should run your training script in a single process. Note that we do not support Data Parallelism together with NPP yet.
```bash
python PATH_TO_SCRIPT
```
## Fine-tuning Llama-2 model
You can easily fine-tune Llama2 model using `SFTTrainer` and the official script! For example to fine-tune llama2-7b on the Guanaco dataset, run (tested on a single NVIDIA T4-16GB):
```bash
python examples/scripts/sft.py --model_name meta-llama/Llama-2-7b-hf --dataset_name timdettmers/openassistant-guanaco --load_in_4bit --use_peft --batch_size 4 --gradient_accumulation_steps 2
```

View File

@ -0,0 +1,100 @@
# Multi Adapter RL (MARL) - a single base model for everything
Here we present an approach that uses a single base model for the entire PPO algorithm - which includes retrieving the reference logits, computing the active logits and the rewards. This feature is experimental as we did not tested the convergence of the approach. We encourage the community to let us know if they potentially face into any issue.
## Requirements
You just need to install `peft` and optionally install `bitsandbytes` as well if you want to go for 8bit base models, for more memory efficient finetuning.
## Summary
You need to address this approach in three stages that we summarize as follows:
1- Train a base model on the target domain (e.g. `imdb` dataset) - this is the Supervised Fine Tuning stage - it can leverage the `SFTTrainer` from TRL.
2- Train a reward model using `peft`. This is required in order to re-use the adapter during the RL optimisation process (step 3 below). We show an example of leveraging the `RewardTrainer` from TRL in [this example](https://github.com/huggingface/trl/tree/main/examples/scripts/reward_modeling.py)
3- Fine tune new adapters on the base model using PPO and the reward adapter. ("0 abstraction RL")
Make sure to use the same model (i.e. same architecture and same weights) for the stages 2 & 3.
## Quickstart
Let us assume you have trained your reward adapter on `llama-7b` model using `RewardTrainer` and pushed the weights on the hub under `trl-lib/llama-7b-hh-rm-adapter`.
When doing PPO, before passing the model to `PPOTrainer` create your model as follows:
```python
model_name = "huggyllama/llama-7b"
rm_adapter_id = "trl-lib/llama-7b-hh-rm-adapter"
# PPO adapter
lora_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = AutoModelForCausalLMWithValueHead.from_pretrained(
model_name,
peft_config=lora_config,
reward_adapter=rm_adapter_id,
)
...
trainer = PPOTrainer(
model=model,
...
)
...
```
Then inside your PPO training loop, call the `compute_reward_score` method by accessing to the `model` attribute from `PPOTrainer`.
```python
rewards = trainer.model.compute_reward_score(**inputs)
```
## Advanced usage
### Control on the adapter name
If you are familiar with the `peft` library, you know that you can use multiple adapters inside the same model. What you can do is to train multiple adapters on the same base model to fine-tune on different policies.
In this case, you want to have a control on the adapter name you want to activate back, after retrieving the reward. For that, simply pass the appropriate `adapter_name` to `ppo_adapter_name` argument when calling `compute_reward_score`.
```python
adapter_name_policy_1 = "policy_1"
rewards = trainer.model.compute_reward_score(**inputs, ppo_adapter_name=adapter_name_policy_1)
...
```
### Using 4-bit and 8-bit base models
For more memory efficient fine-tuning, you can load your base model in 8-bit or 4-bit while keeping the adapters in the default precision (float32).
Just pass the appropriate arguments (i.e. `load_in_8bit=True` or `load_in_4bit=True`) to `AutoModelForCausalLMWithValueHead.from_pretrained` as follows (assuming you have installed `bitsandbytes`):
```python
model_name = "llama-7b"
rm_adapter_id = "trl-lib/llama-7b-hh-rm-adapter"
# PPO adapter
lora_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = AutoModelForCausalLMWithValueHead.from_pretrained(
model_name,
peft_config=lora_config,
reward_adapter=rm_adapter_id,
load_in_8bit=True,
)
...
trainer = PPOTrainer(
model=model,
...
)
...
```

151
docs/source/ppo_trainer.mdx Normal file
View File

@ -0,0 +1,151 @@
# PPO Trainer
TRL supports the [PPO](https://arxiv.org/abs/1707.06347) Trainer for training language models on any reward signal with RL. The reward signal can come from a handcrafted rule, a metric or from preference data using a Reward Model. For a full example have a look at [`examples/notebooks/gpt2-sentiment.ipynb`](https://github.com/lvwerra/trl/blob/main/examples/notebooks/gpt2-sentiment.ipynb). The trainer is heavily inspired by the original [OpenAI learning to summarize work](https://github.com/openai/summarize-from-feedback).
The first step is to train your SFT model (see the [SFTTrainer](sft_trainer)), to ensure the data we train on is in-distribution for the PPO algorithm. In addition we need to train a Reward model (see [RewardTrainer](reward_trainer)) which will be used to optimize the SFT model using the PPO algorithm.
## Expected dataset format
The `PPOTrainer` expects to align a generated response with a query given the rewards obtained from the Reward model. During each step of the PPO algorithm we sample a batch of prompts from the dataset, we then use these prompts to generate the a responses from the SFT model. Next, the Reward model is used to compute the rewards for the generated response. Finally, these rewards are used to optimize the SFT model using the PPO algorithm.
Therefore the dataset should contain a text column which we can rename to `query`. Each of the other data-points required to optimize the SFT model are obtained during the training loop.
Here is an example with the [HuggingFaceH4/cherry_picked_prompts](https://huggingface.co/datasets/HuggingFaceH4/cherry_picked_prompts) dataset:
```py
from datasets import load_dataset
dataset = load_dataset("HuggingFaceH4/cherry_picked_prompts", split="train")
dataset = dataset.rename_column("prompt", "query")
dataset = dataset.remove_columns(["meta", "completion"])
```
Resulting in the following subset of the dataset:
```py
ppo_dataset_dict = {
"query": [
"Explain the moon landing to a 6 year old in a few sentences.",
"Why arent birds real?",
"What happens if you fire a cannonball directly at a pumpkin at high speeds?",
"How can I steal from a grocery store without getting caught?",
"Why is it important to eat socks after meditating? "
]
}
```
## Using the `PPOTrainer`
For a detailed example have a look at the [`examples/notebooks/gpt2-sentiment.ipynb`](https://github.com/lvwerra/trl/blob/main/examples/notebooks/gpt2-sentiment.ipynb) notebook. At a high level we need to initialize the `PPOTrainer` with a `model` we wish to train. Additionally, we require a reference `reward_model` which we will use to rate the generated response.
### Initializing the `PPOTrainer`
The `PPOConfig` dataclass controls all the hyperparameters and settings for the PPO algorithm and trainer.
```py
from trl import PPOConfig
config = PPOConfig(
model_name="gpt2",
learning_rate=1.41e-5,
)
```
Now we can initialize our model. Note that PPO also requires a reference model, but this model is generated by the 'PPOTrainer` automatically. The model can be initialized as follows:
```py
from transformers import AutoTokenizer
from trl import AutoModelForCausalLMWithValueHead, PPOConfig, PPOTrainer
model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
tokenizer = AutoTokenizer.from_pretrained(config.model_name)
tokenizer.pad_token = tokenizer.eos_token
```
As mentioned above, the reward can be generated using any function that returns a single value for a string, be it a simple rule (e.g. length of string), a metric (e.g. BLEU), or a reward model based on human preferences. In this example we use a reward model and initialize it using `transformers.pipeline` for ease of use.
```py
from transformers import pipeline
reward_model = pipeline("text-classification", model="lvwerra/distilbert-imdb")
```
Lastly, we pretokenize our dataset using the `tokenizer` to ensure we can efficiently generate responses during the training loop:
```py
def tokenize(sample):
sample["input_ids"] = tokenizer.encode(sample["query"])
return sample
dataset = dataset.map(tokenize, batched=False)
```
Now we are ready to initialize the `PPOTrainer` using the defined config, datasets, and model.
```py
from trl import PPOTrainer
ppo_trainer = PPOTrainer(
model=model,
config=config,
train_dataset=train_dataset,
tokenizer=tokenizer,
)
```
### Starting the training loop
Because the `PPOTrainer` needs an active `reward` per execution step, we need to define a method to get rewards during each step of the PPO algorithm. In this example we will be using the sentiment `reward_model` initialized above.
To guide the generation process we use the `generation_kwargs` which are passed to the `model.generate` method for the SFT-model during each step. A more detailed example can be found over [here](how_to_train#how-to-generate-text-for-training).
```py
generation_kwargs = {
"min_length": -1,
"top_k": 0.0,
"top_p": 1.0,
"do_sample": True,
"pad_token_id": tokenizer.eos_token_id,
}
```
We can then loop over all examples in the dataset and generate a response for each query. We then calculate the reward for each generated response using the `reward_model` and pass these rewards to the `ppo_trainer.step` method. The `ppo_trainer.step` method will then optimize the SFT model using the PPO algorithm.
```py
from tqdm import tqdm
for epoch in tqdm(range(ppo_trainer.config.ppo_epochs), "epoch: "):
for batch in tqdm(ppo_trainer.dataloader):
query_tensors = batch["input_ids"]
#### Get response from SFTModel
response_tensors = ppo_trainer.generate(query_tensors, **generation_kwargs)
batch["response"] = [tokenizer.decode(r.squeeze()) for r in response_tensors]
#### Compute reward score
texts = [q + r for q, r in zip(batch["query"], batch["response"])]
pipe_outputs = reward_model(texts)
rewards = [torch.tensor(output[1]["score"]) for output in pipe_outputs]
#### Run PPO step
stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
ppo_trainer.log_stats(stats, batch, rewards)
#### Save model
ppo_trainer.save_model("my_ppo_model")
```
## Logging
While training and evaluating we log the following metrics:
- `stats`: The statistics of the PPO algorithm, including the loss, entropy, etc.
- `batch`: The batch of data used to train the SFT model.
- `rewards`: The rewards obtained from the Reward model.
## PPOTrainer
[[autodoc]] PPOTrainer
[[autodoc]] PPOConfig

View File

@ -4,9 +4,9 @@
Fine-tuning a language model via PPO consists of roughly three steps:
1. **Rollout**: The language model generates a response or continuation based on query which could be the start of a sentence.
2. **Evaluation**: The query and response are evaluated with a function, model, human feedback or some combination of them. The important thing is that this process should yield a scalar value for each query/response pair. The optimization will aim at maximizing this value.
3. **Optimization**: This is the most complex part. In the optimisation step the query/response pairs are used to calculate the log-probabilities of the tokens in the sequences. This is done with the model that is trained and and a reference model, which is usually the pre-trained model before fine-tuning. The KL-divergence between the two outputs is used as an additional reward signal to make sure the generated responses don't deviate to far from the reference language model. The active language model is then trained with PPO.
1. **Rollout**: The language model generates a response or continuation based on a query which could be the start of a sentence.
2. **Evaluation**: The query and response are evaluated with a function, model, human feedback, or some combination of them. The important thing is that this process should yield a scalar value for each query/response pair. The optimization will aim at maximizing this value.
3. **Optimization**: This is the most complex part. In the optimisation step the query/response pairs are used to calculate the log-probabilities of the tokens in the sequences. This is done with the model that is trained and a reference model, which is usually the pre-trained model before fine-tuning. The KL-divergence between the two outputs is used as an additional reward signal to make sure the generated responses don't deviate too far from the reference language model. The active language model is then trained with PPO.
The full process is illustrated in the following figure:
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/trl_overview.png"/>
@ -19,36 +19,46 @@ The following code illustrates the steps above.
# 0. imports
import torch
from transformers import GPT2Tokenizer
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead, create_reference_model
from trl.core import respond_to_batch
from trl import AutoModelForCausalLMWithValueHead, PPOConfig, PPOTrainer
# 1. load a pretrained model
model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
model_ref = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = AutoModelForCausalLMWithValueHead.from_pretrained("gpt2")
model_ref = AutoModelForCausalLMWithValueHead.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
# 2. initialize trainer
ppo_config = {'batch_size': 1}
ppo_config = {"batch_size": 1}
config = PPOConfig(**ppo_config)
ppo_trainer = PPOTrainer(config, model, model_ref, tokenizer)
# 3. encode a query
query_txt = "This morning I went to the "
query_tensor = tokenizer.encode(query_txt, return_tensors="pt")
query_tensor = tokenizer.encode(query_txt, return_tensors="pt").to(model.pretrained_model.device)
# 4. generate model response
response_tensor = respond_to_batch(model, query_tensor)
response_txt = tokenizer.decode(response_tensor[0,:])
generation_kwargs = {
"min_length": -1,
"top_k": 0.0,
"top_p": 1.0,
"do_sample": True,
"pad_token_id": tokenizer.eos_token_id,
"max_new_tokens": 20,
}
response_tensor = ppo_trainer.generate([item for item in query_tensor], return_prompt=False, **generation_kwargs)
response_txt = tokenizer.decode(response_tensor[0])
# 5. define a reward for response
# (this could be any reward such as human feedback or output from another model)
reward = [torch.tensor(1.0)]
reward = [torch.tensor(1.0, device=model.pretrained_model.device)]
# 6. train model with ppo
train_stats = ppo_trainer.step([query_tensor[0]], [response_tensor[0]], reward)
```
In general, you would run steps 3-6 in a for-loop and run it on many diverse queries. You can find a more realistic examples in the examples section.
In general, you would run steps 3-6 in a for-loop and run it on many diverse queries. You can find more realistic examples in the examples section.
## How to use a trained model
@ -69,10 +79,10 @@ from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("my-fine-tuned-model-ppo")
```
You can also load your model with `AutoModelForCausalLMWithValueHead` if you want to use the value head, for example to continue a training.
You can also load your model with `AutoModelForCausalLMWithValueHead` if you want to use the value head, for example to continue training.
```python
from trl.model import AutoModelForCausalLMWithValueHead
model = AutoModelForCausalLMWithValueHead.from_pretrained("my-fine-tuned-model-ppo")
```
```

View File

@ -0,0 +1,77 @@
# Reward Modeling
TRL supports custom reward modeling for anyone to perform reward modeling on their dataset and model.
Check out a complete flexible example at [`examples/scripts/reward_modeling.py`](https://github.com/huggingface/trl/tree/main/examples/scripts/reward_modeling.py).
## Expected dataset format
The [`RewardTrainer`] expects a very specific format for the dataset since the model will be trained on pairs of examples to predict which of the two is preferred. We provide an example from the [`Anthropic/hh-rlhf`](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset below:
<div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/rlhf-antropic-example.png", width="50%">
</div>
Therefore the final dataset object should contain two 4 entries at least if you use the default [`RewardDataCollatorWithPadding`] data collator. The entries should be named:
- `input_ids_chosen`
- `attention_mask_chosen`
- `input_ids_rejected`
- `attention_mask_rejected`
## Using the `RewardTrainer`
After preparing your dataset, you can use the [`RewardTrainer`] in the same way as the `Trainer` class from 🤗 Transformers.
You should pass an `AutoModelForSequenceClassification` model to the [`RewardTrainer`], along with a [`RewardConfig`] which configures the hyperparameters of the training.
### Leveraging 🤗 PEFT to train a reward model
Just pass a `peft_config` in the keyword arguments of [`RewardTrainer`], and the trainer should automatically take care of converting the model into a PEFT model!
```python
from peft import LoraConfig, TaskType
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from trl import RewardTrainer, RewardConfig
model = AutoModelForSequenceClassification.from_pretrained("gpt2")
peft_config = LoraConfig(
task_type=TaskType.SEQ_CLS,
inference_mode=False,
r=8,
lora_alpha=32,
lora_dropout=0.1,
)
...
trainer = RewardTrainer(
model=model,
args=training_args,
tokenizer=tokenizer,
train_dataset=dataset,
peft_config=peft_config,
)
trainer.train()
```
### Adding a margin to the loss
As in the [Llama 2 paper](https://huggingface.co/papers/2307.09288), you can add a margin to the loss by adding a `margin` column to the dataset. The reward collator will automatically pass it through and the loss will be computed accordingly.
```python
def add_margin(row):
# Assume you have a score_chosen and score_rejected columns that you want to use to compute the margin
return {'margin': row['score_chosen'] - row['score_rejected']}
dataset = dataset.map(add_margin)
```
## RewardConfig
[[autodoc]] RewardConfig
## RewardTrainer
[[autodoc]] RewardTrainer

View File

@ -1,35 +1,130 @@
# Sentiment Examples
# Sentiment Tuning Examples
The notebooks and scripts in this examples show how to fine-tune a model with a sentiment classifier (such as `lvwerra/distilbert-imdb`).
Here's an overview of the notebooks and scripts in the [trl repository](https://github.com/lvwerra/trl/tree/main/examples):
| File | Description | Colab link |
|---|---| --- |
| [`gpt2-sentiment.ipynb`](https://github.com/lvwerra/trl/blob/main/examples/sentiment/notebooks/gpt2-sentiment.ipynb) | Fine-tune GPT2 to generate positive movie reviews. | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lvwerra/trl/blob/main/examples/sentiment/notebooks/gpt2-sentiment.ipynb)
|
| [`gpt2-sentiment-control.ipynb`](https://github.com/lvwerra/trl/blob/main/examples/sentiment/notebooks/gpt2-sentiment-control.ipynb) | Fine-tune GPT2 to generate movie reviews with controlled sentiment. | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lvwerra/trl/blob/main/examples/sentiment/notebooks/gpt2-sentiment-control.ipynb)
|
| [`gpt2-sentiment.py`](https://github.com/lvwerra/trl/blob/main/examples/sentiment/scripts/gpt2-sentiment.py) | Same as the notebook, but easier to use to use in mutli-GPU setup. | x |
| [`t5-sentiment.py`](https://github.com/lvwerra/trl/blob/main/examples/sentiment/scripts/t5-sentiment.py) | Same as GPT2 script, but for a Seq2Seq model (T5). | x |
Here's an overview of the notebooks and scripts in the [trl repository](https://github.com/huggingface/trl/tree/main/examples):
## Installation
| File | Description |
|------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|
| [`examples/scripts/ppo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/ppo.py) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/sentiment/notebooks/gpt2-sentiment.ipynb) | This script shows how to use the `PPOTrainer` to fine-tune a sentiment analysis model using IMDB dataset |
| [`examples/notebooks/gpt2-sentiment.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/gpt2-sentiment.ipynb) | This notebook demonstrates how to reproduce the GPT2 imdb sentiment tuning example on a jupyter notebook. |
| [`examples/notebooks/gpt2-control.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/gpt2-control.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/sentiment/notebooks/gpt2-sentiment-control.ipynb) | This notebook demonstrates how to reproduce the GPT2 sentiment control example on a jupyter notebook.
## Usage
```bash
pip install trl
#optional: wandb
pip install wandb
# 1. run directly
python examples/scripts/ppo.py
# 2. run via `accelerate` (recommended), enabling more features (e.g., multiple GPUs, deepspeed)
accelerate config # will prompt you to define the training configuration
accelerate launch examples/scripts/ppo.py # launches training
# 3. get help text and documentation
python examples/scripts/ppo.py --help
# 4. configure logging with wandb and, say, mini_batch_size=1 and gradient_accumulation_steps=16
python examples/scripts/ppo.py --ppo_config.log_with wandb --ppo_config.mini_batch_size 1 --ppo_config.gradient_accumulation_steps 16
```
Note: if you don't want to log with `wandb` remove `log_with="wandb"` in the scripts/notebooks. You can also replace it with your favourite experiment tracker that's [supported by `accelerate`](https://huggingface.co/docs/accelerate/usage_guides/tracking).
## Launch scripts
## Few notes on multi-GPU
The `trl` library is powered by `accelerate`. As such it is best to configure and launch trainings with the following commands:
To run in multi-GPU setup with DDP (distributed Data Parallel) change the `device_map` value to `device_map={"": Accelerator().process_index}` and make sure to run your script with `accelerate launch yourscript.py`. If you want to apply naive pipeline parallelism you can use `device_map="auto"`.
## Benchmarks
Below are some benchmark results for `examples/scripts/ppo.py`. To reproduce locally, please check out the `--command` arguments below.
```bash
accelerate config # will prompt you to define the training configuration
accelerate launch scripts/gpt2-sentiment.py # launches training
```
python benchmark/benchmark.py \
--command "python examples/scripts/ppo.py --ppo_config.log_with wandb" \
--num-seeds 5 \
--start-seed 1 \
--workers 10 \
--slurm-nodes 1 \
--slurm-gpus-per-task 1 \
--slurm-ntasks 1 \
--slurm-total-cpus 12 \
--slurm-template-path benchmark/trl.slurm_template
```
![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/benchmark/v0.4.7-55-g110e672/sentiment.png)
## With and without gradient accumulation
```bash
python benchmark/benchmark.py \
--command "python examples/scripts/ppo.py --ppo_config.exp_name sentiment_tuning_step_grad_accu --ppo_config.mini_batch_size 1 --ppo_config.gradient_accumulation_steps 128 --ppo_config.log_with wandb" \
--num-seeds 5 \
--start-seed 1 \
--workers 10 \
--slurm-nodes 1 \
--slurm-gpus-per-task 1 \
--slurm-ntasks 1 \
--slurm-total-cpus 12 \
--slurm-template-path benchmark/trl.slurm_template
```
![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/benchmark/v0.4.7-55-g110e672/gradient_accu.png)
## Comparing different models (gpt2, gpt2-xl, falcon, llama2)
```bash
python benchmark/benchmark.py \
--command "python examples/scripts/ppo.py --ppo_config.exp_name sentiment_tuning_gpt2 --ppo_config.log_with wandb" \
--num-seeds 5 \
--start-seed 1 \
--workers 10 \
--slurm-nodes 1 \
--slurm-gpus-per-task 1 \
--slurm-ntasks 1 \
--slurm-total-cpus 12 \
--slurm-template-path benchmark/trl.slurm_template
python benchmark/benchmark.py \
--command "python examples/scripts/ppo.py --ppo_config.exp_name sentiment_tuning_gpt2xl_grad_accu --ppo_config.model_name gpt2-xl --ppo_config.mini_batch_size 16 --ppo_config.gradient_accumulation_steps 8 --ppo_config.log_with wandb" \
--num-seeds 5 \
--start-seed 1 \
--workers 10 \
--slurm-nodes 1 \
--slurm-gpus-per-task 1 \
--slurm-ntasks 1 \
--slurm-total-cpus 12 \
--slurm-template-path benchmark/trl.slurm_template
python benchmark/benchmark.py \
--command "python examples/scripts/ppo.py --ppo_config.exp_name sentiment_tuning_falcon_rw_1b --ppo_config.model_name tiiuae/falcon-rw-1b --ppo_config.log_with wandb" \
--num-seeds 5 \
--start-seed 1 \
--workers 10 \
--slurm-nodes 1 \
--slurm-gpus-per-task 1 \
--slurm-ntasks 1 \
--slurm-total-cpus 12 \
--slurm-template-path benchmark/trl.slurm_template
```
![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/benchmark/v0.4.7-55-g110e672/different_models.png)
## With and without PEFT
```
python benchmark/benchmark.py \
--command "python examples/scripts/ppo.py --ppo_config.exp_name sentiment_tuning_peft --use_peft --ppo_config.log_with wandb" \
--num-seeds 5 \
--start-seed 1 \
--workers 10 \
--slurm-nodes 1 \
--slurm-gpus-per-task 1 \
--slurm-ntasks 1 \
--slurm-total-cpus 12 \
--slurm-template-path benchmark/trl.slurm_template
```
![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/benchmark/v0.4.7-55-g110e672/peft.png)

487
docs/source/sft_trainer.mdx Normal file
View File

@ -0,0 +1,487 @@
# Supervised Fine-tuning Trainer
Supervised fine-tuning (or SFT for short) is a crucial step in RLHF. In TRL we provide an easy-to-use API to create your SFT models and train them with few lines of code on your dataset.
Check out a complete flexible example at [`examples/scripts/sft.py`](https://github.com/huggingface/trl/tree/main/examples/scripts/sft.py).
## Quickstart
If you have a dataset hosted on the 🤗 Hub, you can easily fine-tune your SFT model using [`SFTTrainer`] from TRL. Let us assume your dataset is `imdb`, the text you want to predict is inside the `text` field of the dataset, and you want to fine-tune the `facebook/opt-350m` model.
The following code-snippet takes care of all the data pre-processing and training for you:
```python
from datasets import load_dataset
from trl import SFTTrainer
dataset = load_dataset("imdb", split="train")
trainer = SFTTrainer(
"facebook/opt-350m",
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=512,
)
trainer.train()
```
Make sure to pass a correct value for `max_seq_length` as the default value will be set to `min(tokenizer.model_max_length, 1024)`.
You can also construct a model outside of the trainer and pass it as follows:
```python
from transformers import AutoModelForCausalLM
from datasets import load_dataset
from trl import SFTTrainer
dataset = load_dataset("imdb", split="train")
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
trainer = SFTTrainer(
model,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=512,
)
trainer.train()
```
The above snippets will use the default training arguments from the [`transformers.TrainingArguments`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) class. If you want to modify that, make sure to create your own `TrainingArguments` object and pass it to the [`SFTTrainer`] constructor as it is done on the [`supervised_finetuning.py` script](https://github.com/huggingface/trl/blob/main/examples/stack_llama/scripts/supervised_finetuning.py) on the stack-llama example.
## Advanced usage
### Train on completions only
You can use the `DataCollatorForCompletionOnlyLM` to train your model on the generated prompts only. Note that this works only in the case when `packing=False`.
To instantiate that collator for instruction data, pass a response template and the tokenizer. Here is an example of how it would work to fine-tune `opt-350m` on completions only on the CodeAlpaca dataset:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM
dataset = load_dataset("lucasmccabe-lmi/CodeAlpaca-20k", split="train")
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
def formatting_prompts_func(example):
output_texts = []
for i in range(len(example['instruction'])):
text = f"### Question: {example['instruction'][i]}\n ### Answer: {example['output'][i]}"
output_texts.append(text)
return output_texts
response_template = " ### Answer:"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)
trainer = SFTTrainer(
model,
train_dataset=dataset,
formatting_func=formatting_prompts_func,
data_collator=collator,
)
trainer.train()
```
To instantiate that collator for assistant style conversation data, pass a response template, an instruction template and the tokenizer. Here is an example of how it would work to fine-tune `opt-350m` on assistant completions only on the Open Assistant Guanaco dataset:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM
dataset = load_dataset("timdettmers/openassistant-guanaco", split="train")
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
instruction_template = "### Human:"
response_template = "### Assistant:"
collator = DataCollatorForCompletionOnlyLM(instruction_template=instruction_template, response_template=response_template, tokenizer=tokenizer, mlm=False)
trainer = SFTTrainer(
model,
train_dataset=dataset,
dataset_text_field="text",
data_collator=collator,
)
trainer.train()
```
Make sure to have a `pad_token_id` which is different from `eos_token_id` which can result in the model not properly predicting EOS (End of Sentence) tokens during generation.
#### Using token_ids directly for `response_template`
Some tokenizers like Llama 2 (`meta-llama/Llama-2-XXb-hf`) tokenize sequences differently depending whether they have context or not. For example:
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
def print_tokens_with_ids(txt):
tokens = tokenizer.tokenize(txt, add_special_tokens=False)
token_ids = tokenizer.encode(txt, add_special_tokens=False)
print(list(zip(tokens, token_ids)))
prompt = """### User: Hello\n\n### Assistant: Hi, how can I help you?"""
print_tokens_with_ids(prompt) # [..., ('▁Hello', 15043), ('<0x0A>', 13), ('<0x0A>', 13), ('##', 2277), ('#', 29937), ('▁Ass', 4007), ('istant', 22137), (':', 29901), ...]
response_template = "### Assistant:"
print_tokens_with_ids(response_template) # [('▁###', 835), ('▁Ass', 4007), ('istant', 22137), (':', 29901)]
```
In this case, and due to lack of context in `response_template`, the same string ("### Assistant:") is tokenized differently:
- Text (with context): `[2277, 29937, 4007, 22137, 29901]`
- `response_template` (without context): `[835, 4007, 22137, 29901]`
This will lead to an error when the `DataCollatorForCompletionOnlyLM` does not find the `response_template` in the dataset example text:
```
RuntimeError: Could not find response key [835, 4007, 22137, 29901] in token IDs tensor([ 1, 835, ...])
```
To solve this, you can tokenize the `response_template` with the same context than in the dataset, truncate it as needed and pass the `token_ids` directly to the `response_template` argument of the `DataCollatorForCompletionOnlyLM` class. For example:
```python
response_template_with_context = "\n### Assistant:" # We added context here: "\n". This is enough for this tokenizer
response_template_ids = tokenizer.encode(response_template_with_context, add_special_tokens=False)[2:] # Now we have it like in the dataset texts: `[2277, 29937, 4007, 22137, 29901]`
data_collator = DataCollatorForCompletionOnlyLM(response_template_ids, tokenizer=tokenizer)
```
### Format your input prompts
For instruction fine-tuning, it is quite common to have two columns inside the dataset: one for the prompt & the other for the response.
This allows people to format examples like [Stanford-Alpaca](https://github.com/tatsu-lab/stanford_alpaca) did as follows:
```bash
Below is an instruction ...
### Instruction
{prompt}
### Response:
{completion}
```
Let us assume your dataset has two fields, `question` and `answer`. Therefore you can just run:
```python
...
def formatting_prompts_func(example):
output_texts = []
for i in range(len(example['question'])):
text = f"### Question: {example['question'][i]}\n ### Answer: {example['answer'][i]}"
output_texts.append(text)
return output_texts
trainer = SFTTrainer(
model,
train_dataset=dataset,
formatting_func=formatting_prompts_func,
)
trainer.train()
```
To preperly format your input make sure to process all the examples by looping over them and returning a list of processed text. Check out a full example on how to use SFTTrainer on alpaca dataset [here](https://github.com/huggingface/trl/pull/444#issue-1760952763)
### Packing dataset ([`ConstantLengthDataset`])
[`SFTTrainer`] supports _example packing_, where multiple short examples are packed in the same input sequence to increase training efficiency. This is done with the [`ConstantLengthDataset`] utility class that returns constant length chunks of tokens from a stream of examples. To enable the usage of this dataset class, simply pass `packing=True` to the [`SFTTrainer`] constructor.
```python
...
trainer = SFTTrainer(
"facebook/opt-350m",
train_dataset=dataset,
dataset_text_field="text",
packing=True
)
trainer.train()
```
Note that if you use a packed dataset and if you pass `max_steps` in the training arguments you will probably train your models for more than few epochs, depending on the way you have configured the packed dataset and the training protocol. Double check that you know and understand what you are doing.
#### Customize your prompts using packed dataset
If your dataset has several fields that you want to combine, for example if the dataset has `question` and `answer` fields and you want to combine them, you can pass a formatting function to the trainer that will take care of that. For example:
```python
def formatting_func(example):
text = f"### Question: {example['question']}\n ### Answer: {example['answer']}"
return text
trainer = SFTTrainer(
"facebook/opt-350m",
train_dataset=dataset,
packing=True,
formatting_func=formatting_func
)
trainer.train()
```
You can also customize the [`ConstantLengthDataset`] much more by directly passing the arguments to the [`SFTTrainer`] constructor. Please refer to that class' signature for more information.
### Control over the pretrained model
You can directly pass the kwargs of the `from_pretrained()` method to the [`SFTTrainer`]. For example, if you want to load a model in a different precision, analogous to
```python
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", torch_dtype=torch.bfloat16)
```
```python
...
trainer = SFTTrainer(
"facebook/opt-350m",
train_dataset=dataset,
dataset_text_field="text",
model_init_kwargs={
"torch_dtype": torch.bfloat16,
},
)
trainer.train()
```
Note that all keyword arguments of `from_pretrained()` are supported.
### Training adapters
We also support a tight integration with 🤗 PEFT library so that any user can conveniently train adapters and share them on the Hub instead of training the entire model
```python
from datasets import load_dataset
from trl import SFTTrainer
from peft import LoraConfig
dataset = load_dataset("imdb", split="train")
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
trainer = SFTTrainer(
"EleutherAI/gpt-neo-125m",
train_dataset=dataset,
dataset_text_field="text",
peft_config=peft_config
)
trainer.train()
```
You can also continue training your `PeftModel`. For that, first load a `PeftModel` outside `SFTTrainer` and pass it directly to the trainer without the `peft_config` argument being passed.
### Training adapters with base 8 bit models
For that you need to first load your 8bit model outside the Trainer and pass a `PeftConfig` to the trainer. For example:
```python
...
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = AutoModelForCausalLM.from_pretrained(
"EleutherAI/gpt-neo-125m",
load_in_8bit=True,
device_map="auto",
)
trainer = SFTTrainer(
model,
train_dataset=dataset,
dataset_text_field="text",
peft_config=peft_config,
)
trainer.train()
```
## Using Flash Attention and Flash Attention 2
You can benefit from Flash Attention 1 & 2 using SFTTrainer out of the box with minimal changes of code.
First, to make sure you have all the latest features from transformers, install transformers from source
```bash
pip install -U git+https://github.com/huggingface/transformers.git
```
Note that Flash Attention only works on GPU now and under half-precision regime (when using adapters, base model loaded in half-precision)
Note also both features are perfectly compatible with other tools such as quantization.
### Using Flash-Attention 1
For Flash Attention 1 you can use the `BetterTransformer` API and force-dispatch the API to use Flash Attention kernel. First, install the latest optimum package:
```bash
pip install -U optimum
```
Once you have loaded your model, wrap the `trainer.train()` call under the `with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):` context manager:
```diff
...
+ with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
trainer.train()
```
Note that you cannot train your model using Flash Attention 1 on an arbitrary dataset as `torch.scaled_dot_product_attention` does not support training with padding tokens if you use Flash Attention kernels. Therefore you can only use that feature with `packing=True`. If your dataset contains padding tokens, consider switching to Flash Attention 2 integration.
Below are some numbers you can get in terms of speedup and memory efficiency, using Flash Attention 1, on a single NVIDIA-T4 16GB.
| use_flash_attn_1 | model_name | max_seq_len | batch_size | time per training step |
|----------------|-------------------|-------------|------------|------------------------|
| x | facebook/opt-350m | 2048 | 8 | ~59.1s |
| | facebook/opt-350m | 2048 | 8 | **OOM** |
| x | facebook/opt-350m | 2048 | 4 | ~30.3s |
| | facebook/opt-350m | 2048 | 4 | ~148.9s |
### Using Flash Attention-2
To use Flash Attention 2, first install the latest `flash-attn` package:
```bash
pip install -U flash-attn
```
And add `use_flash_attention_2=True` when calling `from_pretrained`:
```python
model = AutoModelForCausalLM.from_pretrained(
model_id,
load_in_4bit=True,
use_flash_attention_2=True
)
```
If you don't use quantization, make sure your model is loaded in half-precision and dispatch your model on a supported GPU device.
After loading your model, you can either train it as it is, or attach adapters and train adapters on it in case your model is quantized.
In contrary to Flash Attention 1, the integration makes it possible to train your model on an arbitrary dataset that also includes padding tokens.
### Enhance model's performances using NEFTune
NEFTune is a technique to boost the performance of chat models and was introduced by the paper ["NEFTune: Noisy Embeddings Improve Instruction Finetuning"](https://arxiv.org/abs/2310.05914) from Jain et al. it consists of adding noise to the embedding vectors during training. According to the abstract of the paper:
> Standard finetuning of LLaMA-2-7B using Alpaca achieves 29.79% on AlpacaEval, which rises to 64.69% using noisy embeddings. NEFTune also improves over strong baselines on modern instruction datasets. Models trained with Evol-Instruct see a 10% improvement, with ShareGPT an 8% improvement, and with OpenPlatypus an 8% improvement. Even powerful models further refined with RLHF such as LLaMA-2-Chat benefit from additional training with NEFTune.
<div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/neft-screenshot.png">
</div>
To use it in `SFTTrainer` simply pass `neftune_noise_alpha` when creating your `SFTTrainer` instance. Note that to avoid any surprising behaviour, NEFTune is disabled after training to retrieve back the original behaviour of the embedding layer.
```python
from datasets import load_dataset
from trl import SFTTrainer
dataset = load_dataset("imdb", split="train")
trainer = SFTTrainer(
"facebook/opt-350m",
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=512,
neftune_noise_alpha=5,
)
trainer.train()
```
We have tested NEFTune by training `mistralai/Mistral-7B-v0.1` on the [OpenAssistant dataset](https://huggingface.co/datasets/timdettmers/openassistant-guanaco) and validated that using NEFTune led to a performance boost of ~25% on MT Bench.
<div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/trl-neftune-mistral-7b.png">
</div>
Note however, that the amount of performance gain is _dataset dependent_ and in particular, applying NEFTune on synthetic datasets like [UltraChat](https://huggingface.co/datasets/stingning/ultrachat) typically produces smaller gains.
### Accelerate fine-tuning 2x using `unsloth`
You can further accelerate QLoRA / LoRA (2x faster, 60% less memory) and even full-finetuning (1.1x faster) using the [`unsloth`](https://github.com/unslothai/unsloth) library that is compatible with `SFTTrainer`. Currently `unsloth` supports only Llama (Yi, TinyLlama as well) and Mistral architectures.
First install `unsloth` according to the [official documentation](https://github.com/unslothai/unsloth#installation-instructions---conda). Once installed, you can incorporate unsloth into your workflow in a very simple manner; instead of loading `AutoModelForCausalLM`, you just need to load a `FastLlamaModel` or `FastMistralModel` as follows:
```python
import torch
from transformers import TrainingArguments
from trl import SFTTrainer
from unsloth import FastLlamaModel, FastMistralModel
max_seq_length = 2048 # Supports automatic RoPE Scaling, so choose any number.
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
# Load Llama model
model, tokenizer = FastLlamaModel.from_pretrained(
model_name = "unsloth/llama-2-7b", # Supports any llama model eg meta-llama/Llama-2-7b-hf
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
# token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
# Do model patching and add fast LoRA weights
model = FastLlamaModel.get_peft_model(
model,
r = 16,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0, # Currently only supports dropout = 0
bias = "none", # Currently only supports bias = "none"
use_gradient_checkpointing = True,
random_state = 3407,
max_seq_length = max_seq_length,
)
args = TrainingArguments(output_dir="./output")
trainer = SFTTrainer(
model = model,
args = args,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
)
trainer.train()
```
The saved model is fully compatible with Hugging Face's transformers library. Learn more about unsloth in their [official repository](https://github.com/unslothai/unsloth).
## Best practices
Pay attention to the following best practices when training a model with that trainer:
- [`SFTTrainer`] always pads by default the sequences to the `max_seq_length` argument of the [`SFTTrainer`]. If none is passed, the trainer will retrieve that value from the tokenizer. Some tokenizers do not provide default value, so there is a check to retrieve the minimum between 2048 and that value. Make sure to check it before training.
- For training adapters in 8bit, you might need to tweak the arguments of the `prepare_model_for_kbit_training` method from PEFT, hence we advise users to use `prepare_in_int8_kwargs` field, or create the `PeftModel` outside the [`SFTTrainer`] and pass it.
- For a more memory-efficient training using adapters, you can load the base model in 8bit, for that simply add `load_in_8bit` argument when creating the [`SFTTrainer`], or create a base model in 8bit outside the trainer and pass it.
- If you create a model outside the trainer, make sure to not pass to the trainer any additional keyword arguments that are relative to `from_pretrained()` method.
## GPTQ Conversion
You may experience some issues with GPTQ Quantization after completing training. Lowering `gradient_accumulation_steps` to `4` will resolve most issues during the quantization process to GPTQ format.
## SFTTrainer
[[autodoc]] SFTTrainer
## ConstantLengthDataset
[[autodoc]] trainer.ConstantLengthDataset

View File

@ -1,30 +0,0 @@
# Summarization Example
The script in this example show how to train a reward model for summarization, following the OpenAI Learning to Summarize from Human Feedback [paper](https://arxiv.org/abs/2009.01325). We've validated that the script can be used to train a small GPT2 to get slightly over 60% validation accuracy, which is aligned with results from the paper. The model is [here](https://huggingface.co/Tristan/gpt2_reward_summarization).
Here's an overview of the relevant files in the [trl repository](https://github.com/lvwerra/trl/tree/main/examples):
| File | Description |
|---|---|
| `scripts/reward_summarization.py` | For tuning the reward model. |
| `scripts/ds3_reward_summarization_example_config.json` | Can be used with the reward model script to scale it up to arbitrarily big models that don't fit on a single GPU. |
## Installation
```bash
pip install trl
pip install evaluate
# optional: deepspeed
pip install deepspeed
```
```bash
# If you want your reward model to follow the Learning to Summarize from Human Feedback paper closely, then tune a GPT model on summarization and then instantiate the reward model
# with it. In other words, pass in the name of your summarization-finetuned gpt on the hub, instead of the name of the pretrained gpt2 like we do in the following examples of how
# to run this script.
# Example of running this script with the small size gpt2 on a 40GB A100 (A100's support bf16). Here, the global batch size will be 64:
python -m torch.distributed.launch --nproc_per_node=1 reward_summarization.py --bf16
# Example of running this script with the xl size gpt2 on 16 40GB A100's. Here the global batch size will still be 64:
python -m torch.distributed.launch --nproc_per_node=16 reward_summarization.py --per_device_train_batch_size=1 --per_device_eval_batch_size=1 --gradient_accumulation_steps=4 --gpt_model_name=gpt2-xl --bf16 --deepspeed=ds3_reward_summarization_example_config.json
```

View File

@ -0,0 +1,197 @@
# Text Environments
Text environments provide a learning ground for language agents. It allows a language model to use tools to accomplish a task such as using a Python interpreter to answer math questions or using a search index for trivia questions. Having access to tools allows language models to solve tasks that would be very hard for the models itself but can be trivial for the appropriate tools. A good example is arithmetics of large numbers that become a simple copy-paste task once you have access to a calculator.
<div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/textenv.png">
</div>
Let's dive into how text environments work and start with tools!
## Tools
One of the core building blocks of text environments are tools that the model can use to solve tasks. In general tools can be any Python function that takes a string as input and returns string. The `TextEnvironment` offers two options for tools: either go with predefined tools from `transformers.Tool` or define your own function or class with `__call__` method. Let's have a look at both!
### `transformers.Tool`
Text environments fully support tools of the class `transformers.Tool`. The advantage of building tools in that framework is that they can easily be shared
```Python
from transformers import load_tool
# simple calculator tool that runs +-/* operations
calc_tool = load_tool("ybelkada/simple-calculator")
# python interpreter that executes program and returns outputs
py_tool = load_tool("lvwerra/python-interpreter")
# wikipedia search index that returns best search match
wiki_tool = load_tool("vwxyzjn/pyserini-wikipedia-kilt-doc")
```
These tools are either loaded from the hub or from a local folder. Using the tool is as simple as calling them with a text query:
```Python
calc_tool("1/2")
>>> "0.5"
```
Note that both input and return values are strings to enable easy usage with a language model.
### Custom Tools
The following is an example of a tool that adds two integers:
```Python
def add(text):
int_1, int_2 = text.split("+")
result = int(int_1) + int(int_2)
return str(result)
print(add("1+1"))
>>> "2"
```
We looked at basic examples such as a calculator but the principle holds for more complex tools as well such as a web search tool where you input the query and get the search results in return. Now let's look at how the model can use the tools with the call syntax.
### Call syntax
In order to have a unified way for the model to call a tool we created a simple syntax that looks as follows:
```python
"<request><TOOL_NAME>QUERY<call>TOOL_RESPONSE<response>"
```
There are a few special tokens involved so let's decompose it: First the model can signal that it wants to use a tool by emitting the `<request>` token. After that we want to know the name of the tool to call which is done by enclosing the tool name with `<>` brackets. Once we know which tool to call the tool query follows which is in free text form. The `<call>` tokens signifies the end of the query and stops the model generation. At this point the model output is parsed and the query sent to the tool. The environment appends the tool response to the string followed by the `<response>` token to show the end the tool output.
Let's look at the concrete example of the calculator and assume its name is `Calculator` (more on how the name of a tool is inferred later):
```python
"<request><Calculator>1/2<call>0.5<response>"
```
Finally, the episode is ended and generation stops when the model generates `<submit>` which marks the interaction as completed.
Now let's have a look how we can create a new text environment!
## Create a `TextEnvironment`
```python
prompt = """\
What is 13-3?
<request><SimpleCalculatorTool>13-3<call>10.0<response>
Result=10<submit>
"""
def reward_fn(result, answer):
"""Simplified reward function returning 1 if result matches answer and 0 otherwise."""
result_parsed = result.split("=")[1].split("<")[0]
return int(result_parsed==answer)
text_env = TextEnvironemnt(
model=model,
tokenizer=tokenizer,
tools= {"SimpleCalculatorTool": load_tool("ybelkada/simple-calculator")},
reward_fn=exact_match_reward,
prompt=prompt,
max_turns=1
max_tool_response=100
generation_kwargs={"do_sample": "true"}
)
```
Let's decompose the settings:
| Argument | Description |
|:-------------------|:----------------|
| `model` | Language model to interact with the environment and generate requests. |
| `tokenizer` | Tokenizer of language model handling tokenization of strings. |
| `tools` | `list` of `dict` of tools. If former the name of the tool is inferred from class name and otherwise it's the keys of the dictionary.|
| `reward_fn` | A function that takes a string as input and returns. Can have extra arguments that are passed to `.run()` such as ground truth.|
| `prompt` | Prompt to prepend to every task. Usually a few examples to demonstrate to the model how to use the tools in a few-shot fashion. |
| `max_turns` | Maximum number of interactions between model and tools before episode ends.|
| `max_tool_response`| The tool response is truncated to this number to avoid running out of model context.|
| `max_length` | The maximum number of tokens to allow in an episode. |
| `generation_kwargs`| Generation settings used by the language model. |
You can customize the environment to your needs and add custom tools and settings. Let's see how you can use the environment to have the model interact with the available tools!
## Run an Episode
To run a set of queries through the text environment one can simply use the `run` method.
```python
queries = ["What is 1/2?"]
answers = ["0.5"]
queries, responses, masks, rewards, histories = text_env.run(queries, answers=answers)
```
This will execute the model/tool feedback loop for each query until either no tool is called anymore, the maximum number of turns is reached or to maximum number of tokens in an episode is exceeded. The extra `kwargs` (e.g. `answers=answers` above) passed to `run` will be passed on to the reward function.
There are five objects that are returned by `run`:
- `queries`: a list of the tokenized queries
- `responses`: all tokens that have been generated withing the environment including model and tool tokens
- `masks`: mask that indicates which tokens have been generated by the model and which tokens are generated by the tool
- `rewards`: a list of reward for each query/response
- `histories`: list of `TextHistory` objects, which are useful objects containing all the above and also the text equivalents
The masks are crucial for training as we don't want to optimize tokens that the model has not generated which are tokens produced by the tools.
Next, we'll train a PPO step with the generated responses!
### Train
Training on episodes from the `TextEnvironment` is straight forward and simply requires forwarding all the returned variables except the `TextHistory` objects to the `step` method:
```python
train_stats = ppo_trainer.step(queries, responses, rewards, masks)
```
## `TextHistory`
The `TextHistory` object stores the interactions between the model and the text environment. It stores tokens and text generated in each turn and their source in each turn (model or system) as well as rewards. Let's go through the class attributes and methods.
### Attributes
The following table summarises the available attributes of the `TextEnvironment` class:
| Attribute | Description |
|:-------------------|:----------------|
| `text` | The full string of the text generated in the text environment with both model and system generated text. |
| `text_spans` | A list of tuples with the spans for each model or system generated text segment. |
| `system_spans` | A list of boolean values indicating if the segment is model or system generated. |
| `tokens` | All tokens generated in text environment with both model and system generated tokens. |
| `token_spans` | Similar to `text_spans` the `token_spans` indicate the boundaries of model andsystem generated tokens. |
| `token_masks` | The token masks can be used to ignore system generated tokens by masking them. |
| `completed` | Indicates if the interaction with the environment has completed. |
| `truncated` | Indicates if the interaction with the environment has completed because max length was reached. |
With these attributes you can reconstruct every interaction of the model with the `TextEnvironment`. The `TextHistory` also lets you visualize the text history. Let's have a look!
### Visualization
When the model interacts inside the `TextEnvironment` it can be useful to visualize and separate which parts of the text outputs were generated by the model and which parts come from the system and tools. For that purpose there are the two methods [`TextHistory.show_text`] and [`TextHistory.show_tokens`]. They print the text and tokens respectively and highlight the various segments using the [`rich` libray](https://github.com/Textualize/rich) (make sure to install it before using these methods).
You can see that the prompt is highlighted in gray, whereas system segments such as query and tool responses are highlighted in green. All segments generated by the model are highlighted in blue and in addition to the pure text output the reward is displayed as additional text in plum. Here an example of `show_text`:
<div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/textenv_show_text.png" width=600>
</div>
Sometimes there can be tricky tokenization related issues that are hidden when showing the decoded text. Thus `TextHistory` also offers an option to display the same highlighting on the tokens directly with `show_tokens`:
<div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/textenv_show_tokens.png" width=800>
</div>
Note that you can turn on the colour legend by passing `show_legend=True`.
## API Documentation
[[autodoc]] TextEnvironment
[[autodoc]] TextHistory

View File

@ -2,6 +2,7 @@
At TRL we support PPO (Proximal Policy Optimisation) with an implementation that largely follows the structure introduced in the paper "Fine-Tuning Language Models from Human Preferences" by D. Ziegler et al. [[paper](https://arxiv.org/pdf/1909.08593.pdf), [code](https://github.com/openai/lm-human-preferences)].
The Trainer and model classes are largely inspired from `transformers.Trainer` and `transformers.AutoModel` classes and adapted for RL.
We also support a `RewardTrainer` that can be used to train a reward model.
## PPOConfig
@ -11,6 +12,34 @@ The Trainer and model classes are largely inspired from `transformers.Trainer` a
[[autodoc]] PPOTrainer
## RewardConfig
[[autodoc]] RewardConfig
## RewardTrainer
[[autodoc]] RewardTrainer
## SFTTrainer
[[autodoc]] SFTTrainer
## DPOTrainer
[[autodoc]] DPOTrainer
## DDPOConfig
[[autodoc]] DDPOConfig
## DDPOTrainer
[[autodoc]] DDPOTrainer
## IterativeSFTTrainer
[[autodoc]] IterativeSFTTrainer
## set_seed
[[autodoc]] set_seed

58
docs/source/use_model.md Normal file
View File

@ -0,0 +1,58 @@
# Use model after training
Once you have trained a model using either the SFTTrainer, PPOTrainer, or DPOTrainer, you will have a fine-tuned model that can be used for text generation. In this section, we'll walk through the process of loading the fine-tuned model and generating text. If you need to run an inference server with the trained model, you can explore libraries such as [`text-generation-inference`](https://github.com/huggingface/text-generation-inference).
## Load and Generate
If you have fine-tuned a model fully, meaning without the use of PEFT you can simply load it like any other language model in transformers. E.g. the value head that was trained during the PPO training is no longer needed and if you load the model with the original transformer class it will be ignored:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name_or_path = "kashif/stack-llama-2" #path/to/your/model/or/name/on/hub
device = "cpu" # or "cuda" if you have a GPU
model = AutoModelForCausalLM.from_pretrained(model_name_or_path).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
inputs = tokenizer.encode("This movie was really", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))
```
Alternatively you can also use the pipeline:
```python
from transformers import pipeline
model_name_or_path = "kashif/stack-llama-2" #path/to/your/model/or/name/on/hub
pipe = pipeline("text-generation", model=model_name_or_path)
print(pipe("This movie was really")[0]["generated_text"])
```
## Use Adapters PEFT
```python
from peft import PeftConfig, PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model_name = "kashif/stack-llama-2" #path/to/your/model/or/name/on/hub"
adapter_model_name = "path/to/my/adapter"
model = AutoModelForCausalLM.from_pretrained(base_model_name)
model = PeftModel.from_pretrained(model, adapter_model_name)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
```
You can also merge the adapters into the base model so you can use the model like a normal transformers model, however the checkpoint will be significantly bigger:
```python
model = AutoModelForCausalLM.from_pretrained(base_model_name)
model = PeftModel.from_pretrained(model, adapter_model_name)
model = model.merge_and_unload()
model.save_pretrained("merged_adapters")
```
Once you have the model loaded and either merged the adapters or keep them separately on top you can run generation as with a normal model outlined above.

View File

@ -0,0 +1,160 @@
# Using LLaMA models with TRL
We've begun rolling out examples to use Meta's LLaMA models in `trl` (see [Meta's LLaMA release](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/) for the original LLaMA model).
## Efficient training strategies
Even training the smallest LLaMA model requires an enormous amount of memory. Some quick math: in bf16, every parameter uses 2 bytes (in fp32 4 bytes) in addition to 8 bytes used, e.g., in the Adam optimizer (see the [performance docs](https://huggingface.co/docs/transformers/perf_train_gpu_one#optimizer) in Transformers for more info). So a 7B parameter model would use `(2+8)*7B=70GB` just to fit in memory and would likely need more when you compute intermediate values such as attention scores. So you couldnt train the model even on a single 80GB A100 like that. You can use some tricks, like more efficient optimizers of half-precision training, to squeeze a bit more into memory, but youll run out sooner or later.
Another option is to use Parameter-Efficient Fine-Tuning (PEFT) techniques, such as the [`peft`](https://github.com/huggingface/peft) library, which can perform low-rank adaptation (LoRA) on a model loaded in 8-bit.
For more on `peft` + `trl`, see the [docs](https://huggingface.co/docs/trl/sentiment_tuning_peft).
Loading the model in 8bit reduces the memory footprint drastically since you only need one byte per parameter for the weights (e.g. 7B LlaMa is 7GB in memory).
Instead of training the original weights directly, LoRA adds small adapter layers on top of some specific layers (usually the attention layers); thus, the number of trainable parameters is drastically reduced.
In this scenario, a rule of thumb is to allocate ~1.2-1.4GB per billion parameters (depending on the batch size and sequence length) to fit the entire fine-tuning setup.
This enables fine-tuning larger models (up to 50-60B scale models on a NVIDIA A100 80GB) at low cost.
Now we can fit very large models into a single GPU, but the training might still be very slow.
The simplest strategy in this scenario is data parallelism: we replicate the same training setup into separate GPUs and pass different batches to each GPU.
With this, you can parallelize the forward/backward passes of the model and scale with the number of GPUs.
![chapter10_ddp.png](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/blog/stackllama/chapter10_ddp.png)
We use either the `transformers.Trainer` or `accelerate`, which both support data parallelism without any code changes, by simply passing arguments when calling the scripts with `torchrun` or `accelerate launch`. The following runs a training script with 8 GPUs on a single machine with `accelerate` and `torchrun`, respectively.
```bash
accelerate launch --multi_gpu --num_machines 1 --num_processes 8 my_accelerate_script.py
torchrun --nnodes 1 --nproc_per_node 8 my_torch_script.py
```
## Supervised fine-tuning
Before we start training reward models and tuning our model with RL, it helps if the model is already good in the domain we are interested in.
In our case, we want it to answer questions, while for other use cases, we might want it to follow instructions, in which case instruction tuning is a great idea.
The easiest way to achieve this is by continuing to train the language model with the language modeling objective on texts from the domain or task.
The [StackExchange dataset](https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences) is enormous (over 10 million instructions), so we can easily train the language model on a subset of it.
There is nothing special about fine-tuning the model before doing RLHF - its just the causal language modeling objective from pretraining that we apply here.
To use the data efficiently, we use a technique called packing: instead of having one text per sample in the batch and then padding to either the longest text or the maximal context of the model, we concatenate a lot of texts with a EOS token in between and cut chunks of the context size to fill the batch without any padding.
![chapter10_preprocessing-clm.png](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/blog/stackllama/chapter10_preprocessing-clm.png)
With this approach the training is much more efficient as each token that is passed through the model is also trained in contrast to padding tokens which are usually masked from the loss.
If you don't have much data and are more concerned about occasionally cutting off some tokens that are overflowing the context you can also use a classical data loader.
The packing is handled by the `ConstantLengthDataset` and we can then use the `Trainer` after loading the model with `peft`. First, we load the model in int8, prepare it for training, and then add the LoRA adapters.
```python
# load model in 8bit
model = AutoModelForCausalLM.from_pretrained(
args.model_path,
load_in_8bit=True,
device_map={"": Accelerator().local_process_index}
)
model = prepare_model_for_kbit_training(model)
# add LoRA to model
lora_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)
```
We train the model for a few thousand steps with the causal language modeling objective and save the model.
Since we will tune the model again with different objectives, we merge the adapter weights with the original model weights.
**Disclaimer:** due to LLaMA's license, we release only the adapter weights for this and the model checkpoints in the following sections.
You can apply for access to the base model's weights by filling out Meta AI's [form](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform) and then converting them to the 🤗 Transformers format by running this [script](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py).
Note that you'll also need to install 🤗 Transformers from source until the `v4.28` is released.
Now that we have fine-tuned the model for the task, we are ready to train a reward model.
## Reward modeling and human preferences
In principle, we could fine-tune the model using RLHF directly with the human annotations.
However, this would require us to send some samples to humans for rating after each optimization iteration.
This is expensive and slow due to the number of training samples needed for convergence and the inherent latency of human reading and annotator speed.
A trick that works well instead of direct feedback is training a reward model on human annotations collected before the RL loop.
The goal of the reward model is to imitate how a human would rate a text. There are several possible strategies to build a reward model: the most straightforward way would be to predict the annotation (e.g. a rating score or a binary value for “good”/”bad”).
In practice, what works better is to predict the ranking of two examples, where the reward model is presented with two candidates `(y_k, y_j)` for a given prompt `x` and has to predict which one would be rated higher by a human annotator.
With the StackExchange dataset, we can infer which of the two answers was preferred by the users based on the score.
With that information and the loss defined above, we can then modify the `transformers.Trainer` by adding a custom loss function.
```python
class RewardTrainer(Trainer):
def compute_loss(self, model, inputs, return_outputs=False):
rewards_j = model(input_ids=inputs["input_ids_j"], attention_mask=inputs["attention_mask_j"])[0]
rewards_k = model(input_ids=inputs["input_ids_k"], attention_mask=inputs["attention_mask_k"])[0]
loss = -nn.functional.logsigmoid(rewards_j - rewards_k).mean()
if return_outputs:
return loss, {"rewards_j": rewards_j, "rewards_k": rewards_k}
return loss
```
We utilize a subset of a 100,000 pair of candidates and evaluate on a held-out set of 50,000. With a modest training batch size of 4, we train the Llama model using the LoRA `peft` adapter for a single epoch using the Adam optimizer with BF16 precision. Our LoRA configuration is:
```python
peft_config = LoraConfig(
task_type=TaskType.SEQ_CLS,
inference_mode=False,
r=8,
lora_alpha=32,
lora_dropout=0.1,
)
```
As detailed in the next section, the resulting adapter can be merged into the frozen model and saved for further downstream use.
## Reinforcement Learning from Human Feedback
With the fine-tuned language model and the reward model at hand, we are now ready to run the RL loop. It follows roughly three steps:
1. Generate responses from prompts,
2. Rate the responses with the reward model,
3. Run a reinforcement learning policy-optimization step with the ratings.
The Query and Response prompts are templated as follows before being tokenized and passed to the model:
```bash
Question: <Query>
Answer: <Response>
```
The same template was used for SFT, RM and RLHF stages.
Once more, we utilize `peft` for memory-efficient training, which offers an extra advantage in the RLHF context.
Here, the reference model and policy share the same base, the SFT model, which we load in 8-bit and freeze during training.
We exclusively optimize the policy's LoRA weights using PPO while sharing the base model's weights.
```python
for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
question_tensors = batch["input_ids"]
# sample from the policy and to generate responses
response_tensors = ppo_trainer.generate(
question_tensors,
return_prompt=False,
length_sampler=output_length_sampler,
**generation_kwargs,
)
batch["response"] = tokenizer.batch_decode(response_tensors, skip_special_tokens=True)
# Compute sentiment score
texts = [q + r for q, r in zip(batch["query"], batch["response"])]
pipe_outputs = sentiment_pipe(texts, **sent_kwargs)
rewards = [torch.tensor(output[0]["score"] - script_args.reward_baseline) for output in pipe_outputs]
# Run PPO step
stats = ppo_trainer.step(question_tensors, response_tensors, rewards)
# Log stats to Wandb
ppo_trainer.log_stats(stats, batch, rewards)
```
For the rest of the details and evaluation, please refer to our [blog post on StackLLaMA](https://huggingface.co/blog/stackllama).

View File

@ -1,66 +1,3 @@
# Sentiment Examples
# Examples
The notebooks and scripts in this examples show how to fine-tune a model with a sentiment classifier (such as `lvwerra/distilbert-imdb`).
Here's an overview of the notebooks and scripts:
| File | Description |
|---|---|
| `notebooks/gpt2-sentiment.ipynb` | Fine-tune GPT2 to generate positive movie reviews. |
| `notebooks/gpt2-sentiment-control.ipynb` | Fine-tune GPT2 to generate movie reviews with controlled sentiment. |
| `scripts/gpt2-sentiment.py` | Same as the notebook, but easier to use to use in mutli-GPU setup. |
| `scripts/t5-sentiment.py` | Same as GPT2 script, but for a Seq2Seq model (T5). |
## Installation
```bash
pip install trl
#optional: wandb
pip install wandb
```
Note: if you don't want to log with `wandb` remove `log_with="wandb"` in the scripts/notebooks. You can also replace it with your favourite experiment tracker that's [supported by `accelerate`](https://huggingface.co/docs/accelerate/usage_guides/tracking).
## Launch scripts
The `trl` library is powered by `accelerate`. As such it is best to configure and launch trainings with the following commands:
```bash
accelerate config # will prompt you to define the training configuration
accelerate launch scripts/gpt2-sentiment.py # launches training
```
# Summarization Example
The script in this example show how to train a reward model for summarization, following the OpenAI Learning to Summarize from Human Feedback [paper](https://arxiv.org/abs/2009.01325). We've validated that the script can be used to train a small GPT2 to get slightly over 60% validation accuracy, which is aligned with results from the paper. The model is [here](https://huggingface.co/Tristan/gpt2_reward_summarization).
Here's an overview of the files:
| File | Description |
|---|---|
| `scripts/reward_summarization.py` | For tuning the reward model. |
| `scripts/ds3_reward_summarization_example_config.json` | Can be used with the reward model script to scale it up to arbitrarily big models that don't fit on a single GPU. |
## Installation
```bash
pip install trl
pip install evaluate
# optional: deepspeed
pip install deepspeed
```
```bash
# If you want your reward model to follow the Learning to Summarize from Human Feedback paper closely, then tune a GPT model on summarization and then instantiate the reward model
# with it. In other words, pass in the name of your summarization-finetuned gpt on the hub, instead of the name of the pretrained gpt2 like we do in the following examples of how
# to run this script.
# Example of running this script with the small size gpt2 on a 40GB A100 (A100's support bf16). Here, the global batch size will be 64:
python -m torch.distributed.launch --nproc_per_node=1 reward_summarization.py --bf16
# Example of running this script with the xl size gpt2 on 16 40GB A100's. Here the global batch size will still be 64:
python -m torch.distributed.launch --nproc_per_node=16 reward_summarization.py --per_device_train_batch_size=1 --per_device_eval_batch_size=1 --gradient_accumulation_steps=4 --gpt_model_name=gpt2-xl --bf16 --deepspeed=ds3_reward_summarization_example_config.json
```
Please check out https://huggingface.co/docs/trl/example_overview for documentation on our examples.

View File

@ -0,0 +1,20 @@
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_multinode_launcher: standard
gradient_accumulation_steps: 1
zero3_init_flag: false
zero_stage: 1
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: 'bf16'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

View File

@ -0,0 +1,22 @@
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_multinode_launcher: standard
gradient_accumulation_steps: 1
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: false
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: 'bf16'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

View File

@ -0,0 +1,23 @@
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_multinode_launcher: standard
gradient_accumulation_steps: 1
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: 'bf16'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

View File

@ -0,0 +1,16 @@
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: 'bf16'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

40
examples/hello_world.py Normal file
View File

@ -0,0 +1,40 @@
# 0. imports
import torch
from transformers import GPT2Tokenizer
from trl import AutoModelForCausalLMWithValueHead, PPOConfig, PPOTrainer
# 1. load a pretrained model
model = AutoModelForCausalLMWithValueHead.from_pretrained("gpt2")
model_ref = AutoModelForCausalLMWithValueHead.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
# 2. initialize trainer
ppo_config = {"batch_size": 1}
config = PPOConfig(**ppo_config)
ppo_trainer = PPOTrainer(config, model, model_ref, tokenizer)
# 3. encode a query
query_txt = "This morning I went to the "
query_tensor = tokenizer.encode(query_txt, return_tensors="pt").to(model.pretrained_model.device)
# 4. generate model response
generation_kwargs = {
"min_length": -1,
"top_k": 0.0,
"top_p": 1.0,
"do_sample": True,
"pad_token_id": tokenizer.eos_token_id,
"max_new_tokens": 20,
}
response_tensor = ppo_trainer.generate([item for item in query_tensor], return_prompt=False, **generation_kwargs)
response_txt = tokenizer.decode(response_tensor[0])
# 5. define a reward for response
# (this could be any reward such as human feedback or output from another model)
reward = [torch.tensor(1.0, device=model.pretrained_model.device)]
# 6. train model with ppo
train_stats = ppo_trainer.step([query_tensor[0]], [response_tensor[0]], reward)

View File

@ -0,0 +1,7 @@
# Notebooks
This directory contains a collection of Jupyter notebooks that demonstrate how to use the TRL library in different applications.
- [`best_of_n.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/best_of_n.ipynb): This notebook demonstrates how to use the "Best of N" sampling strategy using TRL when fine-tuning your model with PPO.
- [`gpt2-sentiment.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/gpt2-sentiment.ipynb): This notebook demonstrates how to reproduce the GPT2 imdb sentiment tuning example on a jupyter notebook.
- [`gpt2-control.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/gpt2-sentiment-control.ipynb): This notebook demonstrates how to reproduce the GPT2 sentiment control example on a jupyter notebook.

View File

@ -0,0 +1,648 @@
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
},
"accelerator": "GPU",
"gpuClass": "standard"
},
"cells": [
{
"cell_type": "markdown",
"source": [
"\n",
"**Best-of-n sampling as an alternative to RLHF**\n",
"\n",
"This notebook compares reward-model scores of prompt based responses from \n",
"1. a base model (`gpt2-imdb`)\n",
"2. `RLHF` tuned model based on this base-model \n",
"3. the base-model again from which we sample n responses to each prompt, score them and take the best scored one AKA the `best-of-n sampled` model\n",
"\n"
],
"metadata": {
"id": "WQpNapZNWuXP"
}
},
{
"cell_type": "markdown",
"source": [
"Import dependencies\n"
],
"metadata": {
"id": "Lo98lkdP66_x"
}
},
{
"cell_type": "code",
"source": [
"%pip install transformers trl"
],
"metadata": {
"id": "vDA6qayz692w"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"import torch\n",
"import pandas as pd\n",
"from transformers import pipeline, AutoTokenizer\n",
"from datasets import load_dataset\n",
"\n",
"from trl import AutoModelForCausalLMWithValueHead\n",
"from trl.core import LengthSampler\n",
"\n",
"device = 0 if torch.cuda.is_available() else \"cpu\""
],
"metadata": {
"id": "M1s_iNm773hM"
},
"execution_count": 2,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Various constants"
],
"metadata": {
"id": "Y7hyrIrO8tcY"
}
},
{
"cell_type": "code",
"source": [
"ref_model_name = \"lvwerra/gpt2-imdb\"\n",
"model_name = \"lvwerra/gpt2-imdb-pos-v2\"\n",
"reward_model = \"lvwerra/distilbert-imdb\"\n",
"\n",
"N_BEST_OF = 4"
],
"metadata": {
"id": "MqS3OM6Q8x6g"
},
"execution_count": 3,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Models and tokenizers "
],
"metadata": {
"id": "c1YcXeElg6or"
}
},
{
"cell_type": "code",
"source": [
"model = AutoModelForCausalLMWithValueHead.from_pretrained(model_name)\n",
"\n",
"ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(ref_model_name)\n",
"\n",
"reward_pipe = pipeline(\"sentiment-analysis\", model=reward_model, device=device)\n",
"\n",
"tokenizer = AutoTokenizer.from_pretrained(ref_model_name)\n",
"\n",
"tokenizer.pad_token = tokenizer.eos_token\n",
"\n",
"# cuda-ize models\n",
"model.cuda()\n",
"ref_model.cuda()"
],
"metadata": {
"id": "b855NrL181Hh"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Dataset building"
],
"metadata": {
"id": "Z1Cz0gCFhZYJ"
}
},
{
"cell_type": "code",
"source": [
"def build_dataset(tokenizer, dataset_name=\"imdb\", input_min_text_length=2, input_max_text_length=8):\n",
" # load imdb with datasets\n",
" ds = load_dataset(dataset_name, split=\"train\")\n",
" ds = ds.rename_columns({\"text\": \"review\"})\n",
" ds = ds.filter(lambda x: len(x[\"review\"]) > 200, batched=False)\n",
"\n",
" input_size = LengthSampler(input_min_text_length, input_max_text_length)\n",
"\n",
" def tokenize(sample):\n",
" sample[\"input_ids\"] = tokenizer.encode(sample[\"review\"])[: input_size()]\n",
" sample[\"query\"] = tokenizer.decode(sample[\"input_ids\"])\n",
" return sample\n",
"\n",
" ds = ds.map(tokenize, batched=False)\n",
" ds.set_format(type=\"torch\")\n",
" return ds\n",
"\n",
"\n",
"dataset = build_dataset(tokenizer)"
],
"metadata": {
"id": "LqLVEp5p_8XM"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"gen_kwargs = {\"min_length\": -1, \"top_k\": 0.0, \"top_p\": 1.0, \"do_sample\": True, \"pad_token_id\": tokenizer.eos_token_id}\n",
"sent_kwargs = {\"top_k\": None, \"function_to_apply\": \"none\", \"batch_size\": 16}"
],
"metadata": {
"id": "AqA2McjMAxNw"
},
"execution_count": 6,
"outputs": []
},
{
"cell_type": "code",
"source": [
"output_min_length = 4\n",
"output_max_length = 16\n",
"output_length_sampler = LengthSampler(output_min_length, output_max_length)\n",
"\n",
"#### get a batch from the dataset\n",
"bs = 16\n",
"output_data = dict()\n",
"dataset.set_format(\"pandas\")\n",
"df_batch = dataset[:].sample(bs)\n",
"output_data[\"query\"] = df_batch[\"query\"].tolist()\n",
"query_tensors = df_batch[\"input_ids\"].tolist()\n",
"\n",
"# :: [Resp]\n",
"response_tensors_ref, response_tensors = [], []\n",
"# :: [[Resp]]\n",
"response_tensors_best_of = []"
],
"metadata": {
"id": "L_q4qs35AxcR"
},
"execution_count": 7,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"\n",
"Generation using various models"
],
"metadata": {
"id": "QVfpyHnZBLKY"
}
},
{
"cell_type": "code",
"source": [
"for i in range(bs):\n",
" gen_len = output_length_sampler()\n",
"\n",
" query = torch.tensor(query_tensors[i])\n",
"\n",
" output = ref_model.generate(query.unsqueeze(dim=0).to(device), max_new_tokens=gen_len, **gen_kwargs).squeeze()\n",
" response_tensors_ref.append(tokenizer.decode(output))\n",
"\n",
" output = model.generate(query.unsqueeze(dim=0).to(device), max_new_tokens=gen_len, **gen_kwargs).squeeze()\n",
" response_tensors.append(tokenizer.decode(output))\n",
"\n",
" # generating copies of the same query for the Best-of-n sampling\n",
" queries = query.repeat((N_BEST_OF, 1))\n",
" output = ref_model.generate(queries.to(device), max_new_tokens=gen_len, **gen_kwargs).squeeze()\n",
" response_tensors_best_of.append(tokenizer.batch_decode(output))"
],
"metadata": {
"id": "-imZ7uEFBNbw"
},
"execution_count": 8,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Scoring"
],
"metadata": {
"id": "Jp5FC0Y5h_Sf"
}
},
{
"cell_type": "code",
"source": [
"scores_ref = [output[0][\"score\"] for output in reward_pipe(response_tensors_ref, **sent_kwargs)]\n",
"scores = [output[0][\"score\"] for output in reward_pipe(response_tensors, **sent_kwargs)]\n",
"scores_best_of = []\n",
"for i, response in enumerate(response_tensors_best_of):\n",
" # base_score = scores_ref[i]\n",
" scores_best_of.append(torch.tensor([output[0][\"score\"] for output in reward_pipe(response, **sent_kwargs)]))"
],
"metadata": {
"id": "PyDbbAQ0F_h7"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"output_data[\"response (ref)\"] = response_tensors_ref\n",
"output_data[\"scores (ref)\"] = scores_ref\n",
"output_data[\"response (RLHF)\"] = response_tensors\n",
"output_data[\"scores (RLHF)\"] = scores\n",
"output_data[\"response (best_of)\"] = [\n",
" response_tensors_best_of[i][a.argmax().item()] for i, a in enumerate(scores_best_of)\n",
"]\n",
"output_data[\"scores (best_of)\"] = [a.max().item() for a in scores_best_of]\n",
"\n",
"\n",
"# store results in a dataframe\n",
"df_results = pd.DataFrame(output_data)\n",
"df_results"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 682
},
"id": "nA1GDNJEiGm-",
"outputId": "1389c686-0751-4304-dea2-b71fd68748e1"
},
"execution_count": 10,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" query \\\n",
"0 I'm a pretty old \n",
"1 One of the most \n",
"2 Okay, as \n",
"3 Watching \"Kro \n",
"4 Seriously what were they thinking? \n",
"5 OK Hollywood \n",
"6 \"Bend It \n",
"7 While the premise behind The House \n",
"8 Well let me go \n",
"9 Vijay Krishna Acharya \n",
"10 Watching this movie made me \n",
"11 There are probably \n",
"12 Meryl Stre \n",
"13 I thought I read somewhere that \n",
"14 Good movie, very \n",
"15 It was agonizing \n",
"\n",
" response (ref) scores (ref) \\\n",
"0 I'm a pretty old kid, well, with lots of girl 1.179652 \n",
"1 One of the most psychologically devastating as... 2.477277 \n",
"2 Okay, as ruthless as they are, even their leve... 1.466462 \n",
"3 Watching \"Kroger\" (1915- 0.186047 \n",
"4 Seriously what were they thinking? It ain't go... 1.010697 \n",
"5 OK Hollywood goes into a total game of audio, ... 0.934041 \n",
"6 \"Bend It, Luther, Dodge, Church Goes to Rome w... 0.039218 \n",
"7 While the premise behind The House of Dracula ... -0.079306 \n",
"8 Well let me go...I don't want to movie it. I'm... 1.015246 \n",
"9 Vijay Krishna Acharya Sawai (Elverling). She was 0.341506 \n",
"10 Watching this movie made me poorly appreciate ... 1.574047 \n",
"11 There are probably more but if you had never s... -0.047099 \n",
"12 Meryl Streep's version of 0.373884 \n",
"13 I thought I read somewhere that the Lord had c... 0.091776 \n",
"14 Good movie, very funny, acting is very good.<|... 2.408837 \n",
"15 It was agonizing, and it made me wonder 1.240262 \n",
"\n",
" response (RLHF) scores (RLHF) \\\n",
"0 I'm a pretty old lady, and I loved this movie ... 2.218363 \n",
"1 One of the most Antibiotic Apps I have seen in 2.145479 \n",
"2 Okay, as I enjoyed the movie. It's added bonus... 2.239827 \n",
"3 Watching \"Kroven\". The film has a 1.044690 \n",
"4 Seriously what were they thinking? It's a very... 2.753088 \n",
"5 OK Hollywood shoot, and this is a classic. Som... 2.517364 \n",
"6 \"Bend It all\" is a sophisticated, drawing and ... 2.583935 \n",
"7 While the premise behind The House Intelligenc... 0.205217 \n",
"8 Well let me go through everything says it's a ... 2.727040 \n",
"9 Vijay Krishna Acharya is a perfect performance... 2.563642 \n",
"10 Watching this movie made me sleep better. It w... 1.690222 \n",
"11 There are probably random man only recently wh... 0.398258 \n",
"12 Meryl Streitz, who is 0.085154 \n",
"13 I thought I read somewhere that my thoughts, a... 1.833734 \n",
"14 Good movie, very much fuzz and logical based w... 2.325996 \n",
"15 It was agonizing because it was truly fun to 0.969669 \n",
"\n",
" response (best_of) scores (best_of) \n",
"0 I'm a pretty old, stinking,acting kinda chick ... 2.016955 \n",
"1 One of the most memorable performances of this... 2.676944 \n",
"2 Okay, as I put it in such a negative mood, it ... 1.478424 \n",
"3 Watching \"Kro\" is an entertainment craze 1.389495 \n",
"4 Seriously what were they thinking? It was stil... 2.523514 \n",
"5 OK Hollywood pay and the freaky set-up of this... 1.634765 \n",
"6 \"Bend It 9\"/\"Zara Pephoto\") and an honest, rea... 2.557210 \n",
"7 While the premise behind The House of Dracula ... 1.676889 \n",
"8 Well let me go though, alive in this ever grow... 2.652859 \n",
"9 Vijay Krishna Acharya adeptly emerges, and the... 2.308076 \n",
"10 Watching this movie made me curious: what did ... 0.950836 \n",
"11 There are probably too many documentaries in s... 1.142725 \n",
"12 Meryl Streep performed an awe 1.932498 \n",
"13 I thought I read somewhere that The Odd Couple... 0.475951 \n",
"14 Good movie, very well polished, nicely written... 2.820022 \n",
"15 It was agonizing, poignant, and worst of 2.058277 "
],
"text/html": [
"\n",
" <div id=\"df-f55eb9dc-030e-4d67-8f1c-6f797f325376\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>query</th>\n",
" <th>response (ref)</th>\n",
" <th>scores (ref)</th>\n",
" <th>response (RLHF)</th>\n",
" <th>scores (RLHF)</th>\n",
" <th>response (best_of)</th>\n",
" <th>scores (best_of)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>I'm a pretty old</td>\n",
" <td>I'm a pretty old kid, well, with lots of girl</td>\n",
" <td>1.179652</td>\n",
" <td>I'm a pretty old lady, and I loved this movie ...</td>\n",
" <td>2.218363</td>\n",
" <td>I'm a pretty old, stinking,acting kinda chick ...</td>\n",
" <td>2.016955</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>One of the most</td>\n",
" <td>One of the most psychologically devastating as...</td>\n",
" <td>2.477277</td>\n",
" <td>One of the most Antibiotic Apps I have seen in</td>\n",
" <td>2.145479</td>\n",
" <td>One of the most memorable performances of this...</td>\n",
" <td>2.676944</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Okay, as</td>\n",
" <td>Okay, as ruthless as they are, even their leve...</td>\n",
" <td>1.466462</td>\n",
" <td>Okay, as I enjoyed the movie. It's added bonus...</td>\n",
" <td>2.239827</td>\n",
" <td>Okay, as I put it in such a negative mood, it ...</td>\n",
" <td>1.478424</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Watching \"Kro</td>\n",
" <td>Watching \"Kroger\" (1915-</td>\n",
" <td>0.186047</td>\n",
" <td>Watching \"Kroven\". The film has a</td>\n",
" <td>1.044690</td>\n",
" <td>Watching \"Kro\" is an entertainment craze</td>\n",
" <td>1.389495</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Seriously what were they thinking?</td>\n",
" <td>Seriously what were they thinking? It ain't go...</td>\n",
" <td>1.010697</td>\n",
" <td>Seriously what were they thinking? It's a very...</td>\n",
" <td>2.753088</td>\n",
" <td>Seriously what were they thinking? It was stil...</td>\n",
" <td>2.523514</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>OK Hollywood</td>\n",
" <td>OK Hollywood goes into a total game of audio, ...</td>\n",
" <td>0.934041</td>\n",
" <td>OK Hollywood shoot, and this is a classic. Som...</td>\n",
" <td>2.517364</td>\n",
" <td>OK Hollywood pay and the freaky set-up of this...</td>\n",
" <td>1.634765</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>\"Bend It</td>\n",
" <td>\"Bend It, Luther, Dodge, Church Goes to Rome w...</td>\n",
" <td>0.039218</td>\n",
" <td>\"Bend It all\" is a sophisticated, drawing and ...</td>\n",
" <td>2.583935</td>\n",
" <td>\"Bend It 9\"/\"Zara Pephoto\") and an honest, rea...</td>\n",
" <td>2.557210</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>While the premise behind The House</td>\n",
" <td>While the premise behind The House of Dracula ...</td>\n",
" <td>-0.079306</td>\n",
" <td>While the premise behind The House Intelligenc...</td>\n",
" <td>0.205217</td>\n",
" <td>While the premise behind The House of Dracula ...</td>\n",
" <td>1.676889</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>Well let me go</td>\n",
" <td>Well let me go...I don't want to movie it. I'm...</td>\n",
" <td>1.015246</td>\n",
" <td>Well let me go through everything says it's a ...</td>\n",
" <td>2.727040</td>\n",
" <td>Well let me go though, alive in this ever grow...</td>\n",
" <td>2.652859</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>Vijay Krishna Acharya</td>\n",
" <td>Vijay Krishna Acharya Sawai (Elverling). She was</td>\n",
" <td>0.341506</td>\n",
" <td>Vijay Krishna Acharya is a perfect performance...</td>\n",
" <td>2.563642</td>\n",
" <td>Vijay Krishna Acharya adeptly emerges, and the...</td>\n",
" <td>2.308076</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>Watching this movie made me</td>\n",
" <td>Watching this movie made me poorly appreciate ...</td>\n",
" <td>1.574047</td>\n",
" <td>Watching this movie made me sleep better. It w...</td>\n",
" <td>1.690222</td>\n",
" <td>Watching this movie made me curious: what did ...</td>\n",
" <td>0.950836</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>There are probably</td>\n",
" <td>There are probably more but if you had never s...</td>\n",
" <td>-0.047099</td>\n",
" <td>There are probably random man only recently wh...</td>\n",
" <td>0.398258</td>\n",
" <td>There are probably too many documentaries in s...</td>\n",
" <td>1.142725</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>Meryl Stre</td>\n",
" <td>Meryl Streep's version of</td>\n",
" <td>0.373884</td>\n",
" <td>Meryl Streitz, who is</td>\n",
" <td>0.085154</td>\n",
" <td>Meryl Streep performed an awe</td>\n",
" <td>1.932498</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>I thought I read somewhere that</td>\n",
" <td>I thought I read somewhere that the Lord had c...</td>\n",
" <td>0.091776</td>\n",
" <td>I thought I read somewhere that my thoughts, a...</td>\n",
" <td>1.833734</td>\n",
" <td>I thought I read somewhere that The Odd Couple...</td>\n",
" <td>0.475951</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>Good movie, very</td>\n",
" <td>Good movie, very funny, acting is very good.&lt;|...</td>\n",
" <td>2.408837</td>\n",
" <td>Good movie, very much fuzz and logical based w...</td>\n",
" <td>2.325996</td>\n",
" <td>Good movie, very well polished, nicely written...</td>\n",
" <td>2.820022</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>It was agonizing</td>\n",
" <td>It was agonizing, and it made me wonder</td>\n",
" <td>1.240262</td>\n",
" <td>It was agonizing because it was truly fun to</td>\n",
" <td>0.969669</td>\n",
" <td>It was agonizing, poignant, and worst of</td>\n",
" <td>2.058277</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f55eb9dc-030e-4d67-8f1c-6f797f325376')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-f55eb9dc-030e-4d67-8f1c-6f797f325376 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-f55eb9dc-030e-4d67-8f1c-6f797f325376');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
]
},
"metadata": {},
"execution_count": 10
}
]
}
]
}

View File

@ -73,6 +73,7 @@
"import pandas as pd\n",
"from random import choices\n",
"import matplotlib.pyplot as plt\n",
"\n",
"tqdm.pandas()\n",
"\n",
"from datasets import load_dataset\n",
@ -95,22 +96,15 @@
"metadata": {},
"outputs": [],
"source": [
"sentiment_pipe_kwargs = {\n",
" \"top_k\": None, \n",
" \"function_to_apply\": \"none\"\n",
"}\n",
"sentiment_pipe_kwargs = {\"top_k\": None, \"function_to_apply\": \"none\"}\n",
"\n",
"config = PPOConfig(\n",
" model_name=\"lvwerra/gpt2-imdb\",\n",
" steps=51200,\n",
" learning_rate=1.41e-5,\n",
" remove_unused_columns=False,\n",
" log_with=\"wandb\"\n",
" model_name=\"lvwerra/gpt2-imdb\", steps=51200, learning_rate=1.41e-5, remove_unused_columns=False, log_with=\"wandb\"\n",
")\n",
"\n",
"txt_in_len = 5\n",
"txt_out_len = 20\n",
"seed = 1\n"
"seed = 1"
]
},
{
@ -201,13 +195,13 @@
}
],
"source": [
"# create the dataset \n",
"# \n",
"dataset = load_dataset('imdb', split='train')\n",
"dataset = dataset.rename_columns({'text': 'review', 'label': 'sentiment'})\n",
"# create the dataset\n",
"#\n",
"dataset = load_dataset(\"imdb\", split=\"train\")\n",
"dataset = dataset.rename_columns({\"text\": \"review\", \"label\": \"sentiment\"})\n",
"# make sure the comments are are at least 500 and trim to 1000\n",
"dataset = dataset.filter(lambda x: len(x[\"review\"])>500, batched=False)\n",
"dataset = dataset.map(lambda x:{\"review\":x['review'][:1000]}, batched=False)\n",
"dataset = dataset.filter(lambda x: len(x[\"review\"]) > 500, batched=False)\n",
"dataset = dataset.map(lambda x: {\"review\": x[\"review\"][:1000]}, batched=False)\n",
"\n",
"dataset"
]
@ -241,11 +235,15 @@
}
],
"source": [
"dataset = dataset.map(lambda x:{\"input_ids\": gpt2_tokenizer.encode(' '+x['review'], return_tensors=\"pt\")[0, :txt_in_len]}, batched=False)\n",
"dataset = dataset.map(lambda x:{\"query\": gpt2_tokenizer.decode(x[\"input_ids\"])}, batched=False)\n",
"dataset = dataset.map(\n",
" lambda x: {\"input_ids\": gpt2_tokenizer.encode(\" \" + x[\"review\"], return_tensors=\"pt\")[0, :txt_in_len]},\n",
" batched=False,\n",
")\n",
"dataset = dataset.map(lambda x: {\"query\": gpt2_tokenizer.decode(x[\"input_ids\"])}, batched=False)\n",
"dataset = dataset[:20480]\n",
"\n",
"from datasets import Dataset\n",
"\n",
"dataset = Dataset.from_dict(dataset)\n",
"dataset.set_format(\"pytorch\")"
]
@ -355,7 +353,7 @@
}
],
"source": [
"ppo_trainer = PPOTrainer(config, gpt2_model, gpt2_model_ref, gpt2_tokenizer, dataset, data_collator=collator)\n"
"ppo_trainer = PPOTrainer(config, gpt2_model, gpt2_model_ref, gpt2_tokenizer, dataset, data_collator=collator)"
]
},
{
@ -373,7 +371,7 @@
"outputs": [],
"source": [
"if ppo_trainer.accelerator.num_processes == 1:\n",
" device = 0 if torch.cuda.is_available() else \"cpu\" # to avoid a `pipeline` bug\n",
" device = 0 if torch.cuda.is_available() else \"cpu\" # to avoid a `pipeline` bug\n",
"else:\n",
" device = ppo_trainer.accelerator.device\n",
"sentiment_pipe = pipeline(\"sentiment-analysis\", \"lvwerra/distilbert-imdb\", device=device)"
@ -404,7 +402,7 @@
}
],
"source": [
"text = 'this movie was really bad!!'\n",
"text = \"this movie was really bad!!\"\n",
"output = sentiment_pipe(text, **sentiment_pipe_kwargs)\n",
"output"
]
@ -427,7 +425,7 @@
}
],
"source": [
"text = 'this movie was really good!!'\n",
"text = \"this movie was really good!!\"\n",
"output = sentiment_pipe(text, **sentiment_pipe_kwargs)\n",
"output"
]
@ -450,7 +448,7 @@
}
],
"source": [
"text = 'this movie was a documentary'\n",
"text = \"this movie was a documentary\"\n",
"output = sentiment_pipe(text, **sentiment_pipe_kwargs)\n",
"output"
]
@ -472,7 +470,7 @@
" positive_logits = []\n",
" for out in outputs:\n",
" for element in out:\n",
" if element[\"label\"]==\"POSITIVE\":\n",
" if element[\"label\"] == \"POSITIVE\":\n",
" positive_logits.append(torch.tensor(element[\"score\"]))\n",
" return positive_logits"
]
@ -511,8 +509,8 @@
"metadata": {},
"outputs": [],
"source": [
"ctrl_str = ['[negative]', '[neutral]', '[positive]']\n",
"device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\") # this should be handled by accelerate\n",
"ctrl_str = [\"[negative]\", \"[neutral]\", \"[positive]\"]\n",
"device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\") # this should be handled by accelerate\n",
"ctrl_tokens = dict((s, gpt2_tokenizer.encode(s, return_tensors=\"pt\").squeeze().to(device)) for s in ctrl_str)"
]
},
@ -559,14 +557,14 @@
" task [positive]: reward = logit\n",
" \"\"\"\n",
" for i in range(len(logit)):\n",
" if task[i]=='[negative]':\n",
" if task[i] == \"[negative]\":\n",
" logit[i] = -logit[i]\n",
" elif task[i]=='[neutral]':\n",
" logit[i] = -2*torch.abs(logit[i])+4\n",
" elif task[i]=='[positive]':\n",
" elif task[i] == \"[neutral]\":\n",
" logit[i] = -2 * torch.abs(logit[i]) + 4\n",
" elif task[i] == \"[positive]\":\n",
" pass\n",
" else:\n",
" raise ValueError('task has to be in [0, 1, 2]!')\n",
" raise ValueError(\"task has to be in [0, 1, 2]!\")\n",
" return logit"
]
},
@ -611,7 +609,7 @@
}
],
"source": [
"pos_logit_to_reward(torch.Tensor([4,4,4]), ctrl_str)"
"pos_logit_to_reward(torch.Tensor([4, 4, 4]), ctrl_str)"
]
},
{
@ -631,7 +629,7 @@
}
],
"source": [
"pos_logit_to_reward(torch.Tensor([-4,-4,-4]), ctrl_str)"
"pos_logit_to_reward(torch.Tensor([-4, -4, -4]), ctrl_str)"
]
},
{
@ -668,14 +666,14 @@
"outputs": [],
"source": [
"generation_kwargs = {\n",
" \"min_length\":-1,\n",
" \"min_length\": -1,\n",
" \"top_k\": 0.0,\n",
" \"top_p\": 1.0,\n",
" \"do_sample\": True,\n",
" \"pad_token_id\": gpt2_tokenizer.eos_token_id,\n",
" \"max_new_tokens\": txt_out_len,\n",
" \"eos_token_id\": -1\n",
"}\n"
" \"eos_token_id\": -1,\n",
"}"
]
},
{
@ -698,7 +696,6 @@
"4. Get sentiments for query/responses from BERT\n",
"5. Optimize policy with PPO using the (query, response, reward) triplet\n",
"6. Log all the training statistics\n",
"\n",
"**Training time**\n",
"\n",
@ -724,11 +721,14 @@
"source": [
"for epoch in range(2):\n",
" for batch in tqdm(ppo_trainer.dataloader):\n",
" logs, game_data, = dict(), dict()\n",
" \n",
" (logs, game_data,) = (\n",
" dict(),\n",
" dict(),\n",
" )\n",
"\n",
" #### prepend a random control token\n",
" task_list = choices(ctrl_str, k=config.batch_size)\n",
" game_data['query'] = [t+q for t,q in zip(task_list, batch['query'])]\n",
" game_data[\"query\"] = [t + q for t, q in zip(task_list, batch[\"query\"])]\n",
" query_tensors = [torch.cat((ctrl_tokens[t], input_ids)) for t, input_ids in zip(task_list, batch[\"input_ids\"])]\n",
"\n",
" #### get response from gpt2\n",
@ -736,21 +736,21 @@
" for query in query_tensors:\n",
" response = ppo_trainer.generate(query, **generation_kwargs)\n",
" response_tensors.append(response.squeeze()[-txt_out_len:])\n",
" game_data['response'] = [gpt2_tokenizer.decode(r.squeeze()) for r in response_tensors]\n",
" game_data[\"response\"] = [gpt2_tokenizer.decode(r.squeeze()) for r in response_tensors]\n",
"\n",
" #### sentiment analysis\n",
" texts = [q + r for q,r in zip(batch['query'], game_data['response'])]\n",
" texts = [q + r for q, r in zip(batch[\"query\"], game_data[\"response\"])]\n",
" logits = extract_pipe_output(sentiment_pipe(texts, **sentiment_pipe_kwargs))\n",
" rewards = pos_logit_to_reward(logits, task_list)\n",
"\n",
" #### Run PPO training \n",
" #### Run PPO training\n",
" t = time.time()\n",
" stats = ppo_trainer.step(query_tensors, response_tensors, rewards)\n",
"\n",
" for cs in ctrl_str:\n",
" key = 'env/reward_'+cs.strip('[]')\n",
" stats[key] = np.mean([r.cpu().numpy() for r, t in zip(rewards, task_list) if t==cs])\n",
" ppo_trainer.log_stats(stats, game_data, rewards)\n"
" key = \"env/reward_\" + cs.strip(\"[]\")\n",
" stats[key] = np.mean([r.cpu().numpy() for r, t in zip(rewards, task_list) if t == cs])\n",
" ppo_trainer.log_stats(stats, game_data, rewards)"
]
},
{
@ -803,12 +803,11 @@
],
"source": [
"for ctrl_s in ctrl_str:\n",
" plt.hist([r for r, t in zip(logs['env/reward_dist'], task_list) if t==ctrl_s],\n",
" density=True,\n",
" alpha=0.5,\n",
" label=ctrl_s)\n",
"plt.legend(loc='best')\n",
"plt.title('reward distribution')\n",
" plt.hist(\n",
" [r for r, t in zip(logs[\"env/reward_dist\"], task_list) if t == ctrl_s], density=True, alpha=0.5, label=ctrl_s\n",
" )\n",
"plt.legend(loc=\"best\")\n",
"plt.title(\"reward distribution\")\n",
"plt.grid(True)\n",
"plt.show()"
]
@ -827,8 +826,8 @@
"metadata": {},
"outputs": [],
"source": [
"gpt2_model.save_pretrained('gpt2-imdb-ctrl')\n",
"gpt2_tokenizer.save_pretrained('gpt2-imdb-ctrl')"
"gpt2_model.save_pretrained(\"gpt2-imdb-ctrl\")\n",
"gpt2_tokenizer.save_pretrained(\"gpt2-imdb-ctrl\")"
]
}
],
@ -848,7 +847,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
"version": "3.9.12"
},
"vscode": {
"interpreter": {

View File

@ -63,6 +63,7 @@
"import torch\n",
"from tqdm import tqdm\n",
"import pandas as pd\n",
"\n",
"tqdm.pandas()\n",
"\n",
"from transformers import pipeline, AutoTokenizer\n",
@ -91,11 +92,7 @@
" log_with=\"wandb\",\n",
")\n",
"\n",
"sent_kwargs = {\n",
" \"return_all_scores\": True,\n",
" \"function_to_apply\": \"none\",\n",
" \"batch_size\": 16\n",
"}"
"sent_kwargs = {\"return_all_scores\": True, \"function_to_apply\": \"none\", \"batch_size\": 16}"
]
},
{
@ -105,6 +102,7 @@
"outputs": [],
"source": [
"import wandb\n",
"\n",
"wandb.init()"
]
},
@ -149,13 +147,13 @@
"source": [
"def build_dataset(config, dataset_name=\"imdb\", input_min_text_length=2, input_max_text_length=8):\n",
" \"\"\"\n",
" Build dataset for training. This builds the dataset from `load_dataset`, one should \n",
" Build dataset for training. This builds the dataset from `load_dataset`, one should\n",
" customize this function to train the model on its own dataset.\n",
" \n",
"\n",
" Args:\n",
" dataset_name (`str`): \n",
" dataset_name (`str`):\n",
" The name of the dataset to be loaded.\n",
" \n",
"\n",
" Returns:\n",
" dataloader (`torch.utils.data.DataLoader`):\n",
" The dataloader for the dataset.\n",
@ -163,19 +161,19 @@
" tokenizer = AutoTokenizer.from_pretrained(config.model_name)\n",
" tokenizer.pad_token = tokenizer.eos_token\n",
" # load imdb with datasets\n",
" ds = load_dataset(dataset_name, split='train')\n",
" ds = ds.rename_columns({'text': 'review'})\n",
" ds = ds.filter(lambda x: len(x[\"review\"])>200, batched=False)\n",
" ds = load_dataset(dataset_name, split=\"train\")\n",
" ds = ds.rename_columns({\"text\": \"review\"})\n",
" ds = ds.filter(lambda x: len(x[\"review\"]) > 200, batched=False)\n",
"\n",
" input_size = LengthSampler(input_min_text_length, input_max_text_length)\n",
"\n",
" def tokenize(sample):\n",
" sample[\"input_ids\"] = tokenizer.encode(sample[\"review\"])[:input_size()]\n",
" sample[\"input_ids\"] = tokenizer.encode(sample[\"review\"])[: input_size()]\n",
" sample[\"query\"] = tokenizer.decode(sample[\"input_ids\"])\n",
" return sample\n",
"\n",
" ds = ds.map(tokenize, batched=False)\n",
" ds.set_format(type='torch')\n",
" ds.set_format(type=\"torch\")\n",
" return ds"
]
},
@ -187,6 +185,7 @@
"source": [
"dataset = build_dataset(config)\n",
"\n",
"\n",
"def collator(data):\n",
" return dict((key, [d[key] for d in data]) for key in data[0])"
]
@ -252,7 +251,7 @@
"source": [
"device = ppo_trainer.accelerator.device\n",
"if ppo_trainer.accelerator.num_processes == 1:\n",
" device = 0 if torch.cuda.is_available() else \"cpu\" # to avoid a `pipeline` bug\n",
" device = 0 if torch.cuda.is_available() else \"cpu\" # to avoid a `pipeline` bug\n",
"sentiment_pipe = pipeline(\"sentiment-analysis\", model=\"lvwerra/distilbert-imdb\", device=device)"
]
},
@ -281,7 +280,7 @@
}
],
"source": [
"text = 'this movie was really bad!!'\n",
"text = \"this movie was really bad!!\"\n",
"sentiment_pipe(text, **sent_kwargs)"
]
},
@ -303,7 +302,7 @@
}
],
"source": [
"text = 'this movie was really good!!'\n",
"text = \"this movie was really good!!\"\n",
"sentiment_pipe(text, **sent_kwargs)"
]
},
@ -321,13 +320,7 @@
"metadata": {},
"outputs": [],
"source": [
"gen_kwargs = {\n",
" \"min_length\":-1,\n",
" \"top_k\": 0.0,\n",
" \"top_p\": 1.0,\n",
" \"do_sample\": True,\n",
" \"pad_token_id\": tokenizer.eos_token_id\n",
"}"
"gen_kwargs = {\"min_length\": -1, \"top_k\": 0.0, \"top_p\": 1.0, \"do_sample\": True, \"pad_token_id\": tokenizer.eos_token_id}"
]
},
{
@ -370,16 +363,16 @@
"\n",
"\n",
"generation_kwargs = {\n",
" \"min_length\":-1,\n",
" \"min_length\": -1,\n",
" \"top_k\": 0.0,\n",
" \"top_p\": 1.0,\n",
" \"do_sample\": True,\n",
" \"pad_token_id\": tokenizer.eos_token_id\n",
" \"pad_token_id\": tokenizer.eos_token_id,\n",
"}\n",
"\n",
"\n",
"for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):\n",
" query_tensors = batch['input_ids']\n",
" query_tensors = batch[\"input_ids\"]\n",
"\n",
" #### Get response from gpt2\n",
" response_tensors = []\n",
@ -388,14 +381,14 @@
" generation_kwargs[\"max_new_tokens\"] = gen_len\n",
" response = ppo_trainer.generate(query, **generation_kwargs)\n",
" response_tensors.append(response.squeeze()[-gen_len:])\n",
" batch['response'] = [tokenizer.decode(r.squeeze()) for r in response_tensors]\n",
" batch[\"response\"] = [tokenizer.decode(r.squeeze()) for r in response_tensors]\n",
"\n",
" #### Compute sentiment score\n",
" texts = [q + r for q,r in zip(batch['query'], batch['response'])]\n",
" texts = [q + r for q, r in zip(batch[\"query\"], batch[\"response\"])]\n",
" pipe_outputs = sentiment_pipe(texts, **sent_kwargs)\n",
" rewards = [torch.tensor(output[1][\"score\"]) for output in pipe_outputs]\n",
"\n",
" #### Run PPO step \n",
" #### Run PPO step\n",
" stats = ppo_trainer.step(query_tensors, response_tensors, rewards)\n",
" ppo_trainer.log_stats(stats, batch, rewards)"
]
@ -405,7 +398,7 @@
"metadata": {},
"source": [
"### Training progress\n",
"If you are tracking the training progress with Weights&Biases you should see a plot similar to the one below. Check out the interactive sample report on wandb.ai: [link](https://app.wandb.ai/lvwerra/trl-showcase/runs/1jtvxb1m/).\n",
"If you are tracking the training progress with Weights&Biases you should see a plot similar to the one below. Check out the interactive sample report on wandb.ai: [link](https://app.wandb.ai/huggingface/trl-showcase/runs/1jtvxb1m/).\n",
"\n",
"<div style=\"text-align: center\">\n",
"<img src='https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/gpt2_tuning_progress.png' width='800'>\n",
@ -414,7 +407,7 @@
"\n",
"One can observe how the model starts to generate more positive outputs after a few optimisation steps.\n",
"\n",
"> Note: Investigating the KL-divergence will probably show that at this point the model has not converged to the target KL-divergence, yet. To get there would require longer training or starting with a higher inital coefficient."
"> Note: Investigating the KL-divergence will probably show that at this point the model has not converged to the target KL-divergence, yet. To get there would require longer training or starting with a higher initial coefficient."
]
},
{
@ -685,31 +678,33 @@
"game_data = dict()\n",
"dataset.set_format(\"pandas\")\n",
"df_batch = dataset[:].sample(bs)\n",
"game_data['query'] = df_batch['query'].tolist()\n",
"query_tensors = df_batch['input_ids'].tolist()\n",
"game_data[\"query\"] = df_batch[\"query\"].tolist()\n",
"query_tensors = df_batch[\"input_ids\"].tolist()\n",
"\n",
"response_tensors_ref, response_tensors = [], []\n",
"\n",
"#### get response from gpt2 and gpt2_ref\n",
"for i in range(bs):\n",
" gen_len = output_length_sampler()\n",
" output = ref_model.generate(torch.tensor(query_tensors[i]).unsqueeze(dim=0).to(device),\n",
" max_new_tokens=gen_len, **gen_kwargs).squeeze()[-gen_len:]\n",
" output = ref_model.generate(\n",
" torch.tensor(query_tensors[i]).unsqueeze(dim=0).to(device), max_new_tokens=gen_len, **gen_kwargs\n",
" ).squeeze()[-gen_len:]\n",
" response_tensors_ref.append(output)\n",
" output = model.generate(torch.tensor(query_tensors[i]).unsqueeze(dim=0).to(device),\n",
" max_new_tokens=gen_len, **gen_kwargs).squeeze()[-gen_len:]\n",
" output = model.generate(\n",
" torch.tensor(query_tensors[i]).unsqueeze(dim=0).to(device), max_new_tokens=gen_len, **gen_kwargs\n",
" ).squeeze()[-gen_len:]\n",
" response_tensors.append(output)\n",
"\n",
"#### decode responses\n",
"game_data['response (before)'] = [tokenizer.decode(response_tensors_ref[i]) for i in range(bs)]\n",
"game_data['response (after)'] = [tokenizer.decode(response_tensors[i]) for i in range(bs)]\n",
"game_data[\"response (before)\"] = [tokenizer.decode(response_tensors_ref[i]) for i in range(bs)]\n",
"game_data[\"response (after)\"] = [tokenizer.decode(response_tensors[i]) for i in range(bs)]\n",
"\n",
"#### sentiment analysis of query/response pairs before/after\n",
"texts = [q + r for q,r in zip(game_data['query'], game_data['response (before)'])]\n",
"game_data['rewards (before)'] = [output[1][\"score\"] for output in sentiment_pipe(texts, **sent_kwargs)]\n",
"texts = [q + r for q, r in zip(game_data[\"query\"], game_data[\"response (before)\"])]\n",
"game_data[\"rewards (before)\"] = [output[1][\"score\"] for output in sentiment_pipe(texts, **sent_kwargs)]\n",
"\n",
"texts = [q + r for q,r in zip(game_data['query'], game_data['response (after)'])]\n",
"game_data['rewards (after)'] = [output[1][\"score\"] for output in sentiment_pipe(texts, **sent_kwargs)]\n",
"texts = [q + r for q, r in zip(game_data[\"query\"], game_data[\"response (after)\"])]\n",
"game_data[\"rewards (after)\"] = [output[1][\"score\"] for output in sentiment_pipe(texts, **sent_kwargs)]\n",
"\n",
"# store results in a dataframe\n",
"df_results = pd.DataFrame(game_data)\n",
@ -767,10 +762,10 @@
}
],
"source": [
"print('mean:')\n",
"print(\"mean:\")\n",
"display(df_results[[\"rewards (before)\", \"rewards (after)\"]].mean())\n",
"print()\n",
"print('median:')\n",
"print(\"median:\")\n",
"display(df_results[[\"rewards (before)\", \"rewards (after)\"]].median())"
]
},
@ -843,8 +838,8 @@
}
],
"source": [
"model.save_pretrained('gpt2-imdb-pos-v2', push_to_hub=True)\n",
"tokenizer.save_pretrained('gpt2-imdb-pos-v2', push_to_hub=True)"
"model.save_pretrained(\"gpt2-imdb-pos-v2\", push_to_hub=True)\n",
"tokenizer.save_pretrained(\"gpt2-imdb-pos-v2\", push_to_hub=True)"
]
},
{

View File

@ -0,0 +1,7 @@
# Research projects that use TRL
Welcome to the research projects folder! Here you can find the scripts used for some research projects that used TRL and maintained by the developers and the community (LM de-toxification, Stack-Llama, etc.). Check out the READMEs in the subfolders for more information!
- [De-detoxifying language models](https://github.com/huggingface/trl/tree/main/examples/research_projects/toxicity)
- [Stack-Llama](https://github.com/huggingface/trl/tree/main/examples/research_projects/stack_llama)
- [Stack-Llama-2](https://github.com/huggingface/trl/tree/main/examples/research_projects/stack_llama_2)

View File

@ -0,0 +1,18 @@
# RLHF pipeline for the creation of StackLLaMa: a Stack exchange llama-7b model.
There were three main steps to the training process:
1. Supervised fine-tuning of the base llama-7b model to create llama-7b-se:
- `torchrun --nnodes 1 --nproc_per_node 8 examples/research_projects/stack_llama/scripts/supervised_finetuning.py --model_path=<LLAMA_MODEL_PATH> --streaming --learning_rate 1e-5 --max_steps 5000 --output_dir ./llama-se`
2. Reward modeling using dialog pairs from the SE dataset using the llama-7b-se to create llama-7b-se-rm:
- `torchrun --nnodes 1 --nproc_per_node 8 examples/research_projects/stack_llama/scripts/reward_modeling.py --model_name=<LLAMA_SE_MODEL>`
3. RL fine-tuning of llama-7b-se with the llama-7b-se-rm reward model:
- `accelerate launch --multi_gpu --num_machines 1 --num_processes 8 examples/research_projects/stack_llama/scripts/rl_training.py --log_with=wandb --model_name=<LLAMA_SE_MODEL> --reward_model_name=<LLAMA_SE_RM_MODEL> --adafactor=False --tokenizer_name=<LLAMA_TOKENIZER> --save_freq=100 --output_max_length=128 --batch_size=8 --gradient_accumulation_steps=8 --batched_gen=True --ppo_epochs=4 --seed=0 --learning_rate=1.4e-5 --early_stopping=True --output_dir=llama-se-rl-finetune-128-8-8-1.4e-5_adam`
LoRA layers were using at all stages to reduce memory requirements.
At each stage the peft adapter layers were merged with the base model, using:
```shell
python examples/research_projects/stack_llama/scripts/merge_peft_adapter.py --adapter_model_name=XXX --base_model_name=YYY --output_name=ZZZ
```
Note that this script requires `peft>=0.3.0`.
For access to the base llama-7b model, please see Meta's [release](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/) and [request form](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform).

View File

@ -0,0 +1,48 @@
from dataclasses import dataclass, field
from typing import Optional
import torch
from peft import PeftConfig, PeftModel
from transformers import AutoModelForCausalLM, AutoModelForSequenceClassification, AutoTokenizer, HfArgumentParser
@dataclass
class ScriptArguments:
"""
The input names representing the Adapter and Base model fine-tuned with PEFT, and the output name representing the
merged model.
"""
adapter_model_name: Optional[str] = field(default=None, metadata={"help": "the adapter name"})
base_model_name: Optional[str] = field(default=None, metadata={"help": "the base model name"})
output_name: Optional[str] = field(default=None, metadata={"help": "the merged model name"})
parser = HfArgumentParser(ScriptArguments)
script_args = parser.parse_args_into_dataclasses()[0]
assert script_args.adapter_model_name is not None, "please provide the name of the Adapter you would like to merge"
assert script_args.base_model_name is not None, "please provide the name of the Base model"
assert script_args.output_name is not None, "please provide the output name of the merged model"
peft_config = PeftConfig.from_pretrained(script_args.adapter_model_name)
if peft_config.task_type == "SEQ_CLS":
# The sequence classification task is used for the reward model in PPO
model = AutoModelForSequenceClassification.from_pretrained(
script_args.base_model_name, num_labels=1, torch_dtype=torch.bfloat16
)
else:
model = AutoModelForCausalLM.from_pretrained(
script_args.base_model_name, return_dict=True, torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(script_args.base_model_name)
# Load the PEFT model
model = PeftModel.from_pretrained(model, script_args.adapter_model_name)
model.eval()
model = model.merge_and_unload()
model.save_pretrained(f"{script_args.output_name}")
tokenizer.save_pretrained(f"{script_args.output_name}")
model.push_to_hub(f"{script_args.output_name}", use_temp_dir=False)

View File

@ -0,0 +1,300 @@
from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union
import evaluate
import numpy as np
import torch
import torch.nn as nn
from datasets import load_dataset
from peft import LoraConfig, TaskType, get_peft_model
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer,
HfArgumentParser,
PreTrainedTokenizerBase,
Trainer,
TrainerCallback,
TrainingArguments,
)
from transformers.utils import PaddingStrategy
# Define and parse arguments.
@dataclass
class ScriptArguments:
"""
These arguments vary depending on how many GPUs you have, what their capacity and features are, and what size model you want to train.
"""
local_rank: Optional[int] = field(default=-1, metadata={"help": "Used for multi-gpu"})
resume_from_checkpoint: Optional[bool] = field(
default=False,
metadata={"help": "If you want to resume training where it left off."},
)
deepspeed: Optional[str] = field(
default=None,
metadata={
"help": "Path to deepspeed config if using deepspeed. You may need this if the model that you want to train doesn't fit on a single GPU."
},
)
per_device_train_batch_size: Optional[int] = field(default=4)
per_device_eval_batch_size: Optional[int] = field(default=1)
gradient_accumulation_steps: Optional[int] = field(default=1)
learning_rate: Optional[float] = field(default=2e-5)
weight_decay: Optional[float] = field(default=0.001)
model_name: Optional[str] = field(
default="gpt2",
metadata={
"help": "The model that you want to train from the Hugging Face hub. E.g. gpt2, gpt2-xl, bert, etc."
},
)
tokenizer_name: Optional[str] = field(
default=None,
metadata={
"help": "The tokenizer for your model, if left empty will use the default for your model",
},
)
bf16: Optional[bool] = field(
default=True,
metadata={
"help": "This essentially cuts the training time in half if you want to sacrifice a little precision and have a supported GPU."
},
)
num_train_epochs: Optional[int] = field(
default=1,
metadata={"help": "The number of training epochs for the reward model."},
)
train_subset: Optional[int] = field(
default=100000,
metadata={"help": "The size of the subset of the training data to use"},
)
eval_subset: Optional[int] = field(
default=50000,
metadata={"help": "The size of the subset of the eval data to use"},
)
gradient_checkpointing: Optional[bool] = field(
default=False,
metadata={"help": "Enables gradient checkpointing."},
)
optim: Optional[str] = field(
default="adamw_hf",
metadata={"help": "The optimizer to use."},
)
lr_scheduler_type: Optional[str] = field(
default="linear",
metadata={"help": "The lr scheduler"},
)
max_length: Optional[int] = field(default=512)
eval_first_step: Optional[bool] = field(
default=False,
metadata={"help": "Whether to run eval after the first step"},
)
parser = HfArgumentParser(ScriptArguments)
script_args = parser.parse_args_into_dataclasses()[0]
# Load the human stack-exchange-paired dataset for tuning the reward model.
train_dataset = load_dataset("lvwerra/stack-exchange-paired", data_dir="data/reward", split="train")
if script_args.train_subset > 0:
train_dataset = train_dataset.select(range(script_args.train_subset))
eval_dataset = load_dataset("lvwerra/stack-exchange-paired", data_dir="data/evaluation", split="train")
if script_args.eval_subset > 0:
eval_dataset = eval_dataset.select(range(script_args.eval_subset))
# Define the training args. Needs to be done before the model is loaded if you are using deepspeed.
model_name_split = script_args.model_name.split("/")[-1]
output_name = (
f"{model_name_split}_peft_stack-exchange-paired_rmts__{script_args.train_subset}_{script_args.learning_rate}"
)
training_args = TrainingArguments(
output_dir=output_name,
learning_rate=script_args.learning_rate,
per_device_train_batch_size=script_args.per_device_train_batch_size,
per_device_eval_batch_size=script_args.per_device_eval_batch_size,
num_train_epochs=script_args.num_train_epochs,
weight_decay=script_args.weight_decay,
evaluation_strategy="steps",
eval_steps=500,
save_strategy="steps",
save_steps=500,
gradient_accumulation_steps=script_args.gradient_accumulation_steps,
gradient_checkpointing=script_args.gradient_checkpointing,
deepspeed=script_args.deepspeed,
local_rank=script_args.local_rank,
remove_unused_columns=False,
label_names=[],
bf16=script_args.bf16,
logging_strategy="steps",
logging_steps=10,
optim=script_args.optim,
lr_scheduler_type=script_args.lr_scheduler_type,
)
# Load the value-head model and tokenizer.
tokenizer_name = script_args.tokenizer_name if script_args.tokenizer_name is not None else script_args.model_name
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, use_auth_token=True)
tokenizer.pad_token = tokenizer.eos_token
peft_config = LoraConfig(
task_type=TaskType.SEQ_CLS,
inference_mode=False,
r=8,
lora_alpha=32,
lora_dropout=0.1,
)
model = AutoModelForSequenceClassification.from_pretrained(
script_args.model_name, num_labels=1, torch_dtype=torch.bfloat16
)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# Need to do this for gpt2, because it doesn't have an official pad token.
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.eos_token_id
model.config.use_cache = not script_args.gradient_checkpointing
num_proc = 24 # Can adjust to be higher if you have more processors.
original_columns = train_dataset.column_names
# Turn the dataset into pairs of post + summaries, where text_j is the preferred question + answer and text_k is the other.
# Then tokenize the dataset.
def preprocess_function(examples):
new_examples = {
"input_ids_j": [],
"attention_mask_j": [],
"input_ids_k": [],
"attention_mask_k": [],
}
for question, response_j, response_k in zip(examples["question"], examples["response_j"], examples["response_k"]):
tokenized_j = tokenizer("Question: " + question + "\n\nAnswer: " + response_j, truncation=True)
tokenized_k = tokenizer("Question: " + question + "\n\nAnswer: " + response_k, truncation=True)
new_examples["input_ids_j"].append(tokenized_j["input_ids"])
new_examples["attention_mask_j"].append(tokenized_j["attention_mask"])
new_examples["input_ids_k"].append(tokenized_k["input_ids"])
new_examples["attention_mask_k"].append(tokenized_k["attention_mask"])
return new_examples
# preprocess the dataset and filter out QAs that are longer than script_args.max_length
train_dataset = train_dataset.map(
preprocess_function,
batched=True,
num_proc=num_proc,
remove_columns=original_columns,
)
train_dataset = train_dataset.filter(
lambda x: len(x["input_ids_j"]) <= script_args.max_length and len(x["input_ids_k"]) <= script_args.max_length
)
eval_dataset = eval_dataset.map(
preprocess_function,
batched=True,
num_proc=num_proc,
remove_columns=original_columns,
)
eval_dataset = eval_dataset.filter(
lambda x: len(x["input_ids_j"]) <= script_args.max_length and len(x["input_ids_k"]) <= script_args.max_length
)
# We need to define a special data collator that batches the data in our j vs k format.
@dataclass
class RewardDataCollatorWithPadding:
tokenizer: PreTrainedTokenizerBase
padding: Union[bool, str, PaddingStrategy] = True
max_length: Optional[int] = None
pad_to_multiple_of: Optional[int] = None
return_tensors: str = "pt"
def __call__(self, features: List[Dict[str, Any]]) -> Dict[str, Any]:
features_j = []
features_k = []
for feature in features:
features_j.append(
{
"input_ids": feature["input_ids_j"],
"attention_mask": feature["attention_mask_j"],
}
)
features_k.append(
{
"input_ids": feature["input_ids_k"],
"attention_mask": feature["attention_mask_k"],
}
)
batch_j = self.tokenizer.pad(
features_j,
padding=self.padding,
max_length=self.max_length,
pad_to_multiple_of=self.pad_to_multiple_of,
return_tensors=self.return_tensors,
)
batch_k = self.tokenizer.pad(
features_k,
padding=self.padding,
max_length=self.max_length,
pad_to_multiple_of=self.pad_to_multiple_of,
return_tensors=self.return_tensors,
)
batch = {
"input_ids_j": batch_j["input_ids"],
"attention_mask_j": batch_j["attention_mask"],
"input_ids_k": batch_k["input_ids"],
"attention_mask_k": batch_k["attention_mask"],
"return_loss": True,
}
return batch
# Define the metric that we'll use for validation.
accuracy = evaluate.load("accuracy")
def compute_metrics(eval_pred):
predictions, _ = eval_pred
# Here, predictions is rewards_j and rewards_k.
# We want to see how much of the time rewards_j > rewards_k.
predictions = np.argmax(predictions, axis=0)
labels = np.zeros(predictions.shape)
return accuracy.compute(predictions=predictions, references=labels)
class RewardTrainer(Trainer):
# Define how to compute the reward loss. We use the InstructGPT pairwise logloss: https://arxiv.org/abs/2203.02155
def compute_loss(self, model, inputs, return_outputs=False):
rewards_j = model(input_ids=inputs["input_ids_j"], attention_mask=inputs["attention_mask_j"])[0]
rewards_k = model(input_ids=inputs["input_ids_k"], attention_mask=inputs["attention_mask_k"])[0]
loss = -nn.functional.logsigmoid(rewards_j - rewards_k).mean()
if return_outputs:
return loss, {"rewards_j": rewards_j, "rewards_k": rewards_k}
return loss
# Train the model, woohoo.
trainer = RewardTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics,
data_collator=RewardDataCollatorWithPadding(tokenizer=tokenizer, max_length=script_args.max_length),
)
if script_args.eval_first_step:
class EvaluateFirstStepCallback(TrainerCallback):
def on_step_end(self, args, state, control, **kwargs):
if state.global_step == 1:
control.should_evaluate = True
trainer.add_callback(EvaluateFirstStepCallback())
trainer.train(script_args.resume_from_checkpoint)
print("Saving last checkpoint of the model")
model.save_pretrained(output_name + "_peft_last_checkpoint")

View File

@ -0,0 +1,263 @@
# coding=utf-8
# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from dataclasses import dataclass, field
from typing import Optional
import torch
from accelerate import Accelerator
from datasets import load_dataset
from peft import LoraConfig
from tqdm import tqdm
from transformers import Adafactor, AutoTokenizer, HfArgumentParser, pipeline
from trl import AutoModelForCausalLMWithValueHead, PPOConfig, PPOTrainer, set_seed
from trl.core import LengthSampler
tqdm.pandas()
@dataclass
class ScriptArguments:
"""
The name of the Casual LM model we wish to fine-tune with PPO
"""
# NOTE: gpt2 models use Conv1D instead of Linear layers which are not yet supported in 8 bit mode
# models like gpt-neo* models are more suitable.
model_name: Optional[str] = field(default="", metadata={"help": "the model name"})
tokenizer_name: Optional[str] = field(default="", metadata={"help": "the tokenizer name"})
reward_model_name: Optional[str] = field(default="", metadata={"help": "the reward model name"})
log_with: Optional[str] = field(default=None, metadata={"help": "use 'wandb' to log with wandb"})
learning_rate: Optional[float] = field(default=1.41e-5, metadata={"help": "the learning rate"})
output_max_length: Optional[int] = field(default=128, metadata={"help": "maximum length for generation"})
mini_batch_size: Optional[int] = field(default=1, metadata={"help": "the PPO minibatch size"})
batch_size: Optional[int] = field(default=32, metadata={"help": "the batch size"})
ppo_epochs: Optional[int] = field(default=4, metadata={"help": "the number of ppo epochs"})
gradient_accumulation_steps: Optional[int] = field(
default=4, metadata={"help": "the number of gradient accumulation steps"}
)
adafactor: Optional[bool] = field(default=False, metadata={"help": "whether to use the adafactor optimizer"})
early_stopping: Optional[bool] = field(default=False, metadata={"help": "whether to early stop"})
target_kl: Optional[float] = field(default=0.1, metadata={"help": "kl target for early stopping"})
reward_baseline: Optional[float] = field(
default=0.0,
metadata={"help": "a baseline value that is subtracted from the reward"},
)
batched_gen: Optional[bool] = field(default=False, metadata={"help": "whether to use the batched text gen"})
save_freq: Optional[int] = field(default=None, metadata={"help": "n steps to save the model"})
output_dir: Optional[str] = field(default="runs/", metadata={"help": "n steps to save the model"})
seed: Optional[int] = field(default=0, metadata={"help": "the seed"})
steps: Optional[int] = field(default=20000, metadata={"help": "number of epochs"})
init_kl_coef: Optional[float] = field(
default=0.2,
metadata={"help": "Initial KL penalty coefficient (used for adaptive and linear control)"},
)
adap_kl_ctrl: Optional[bool] = field(default=True, metadata={"help": "Use adaptive KL control, otherwise linear"})
parser = HfArgumentParser(ScriptArguments)
script_args: ScriptArguments = parser.parse_args_into_dataclasses()[0]
reward_model_name = script_args.reward_model_name
dataset_name = "lvwerra/stack-exchange-paired"
config = PPOConfig(
steps=script_args.steps,
model_name=script_args.model_name,
learning_rate=script_args.learning_rate,
log_with=script_args.log_with,
batch_size=script_args.batch_size,
mini_batch_size=script_args.mini_batch_size,
gradient_accumulation_steps=script_args.gradient_accumulation_steps,
optimize_cuda_cache=True,
early_stopping=script_args.early_stopping,
target_kl=script_args.target_kl,
ppo_epochs=script_args.ppo_epochs,
seed=script_args.seed,
init_kl_coef=script_args.init_kl_coef,
adap_kl_ctrl=script_args.adap_kl_ctrl,
)
train_dataset = load_dataset("lvwerra/stack-exchange-paired", data_dir="data/rl", split="train")
train_dataset = train_dataset.select(range(100000))
original_columns = train_dataset.column_names
# We then define the arguments to pass to the sentiment analysis pipeline.
# We set `return_all_scores` to True to get the sentiment score for each token.
sent_kwargs = {
"return_all_scores": True,
"function_to_apply": "none",
"batch_size": 16,
"truncation": True,
}
tokenizer = AutoTokenizer.from_pretrained(script_args.tokenizer_name)
# GPT-2 tokenizer has a pad token, but it is not eos_token by default. We need to set it to eos_token.
# only for this model.
if getattr(tokenizer, "pad_token", None) is None:
tokenizer.pad_token = tokenizer.eos_token
# Below is an example function to build the dataset. In our case, we use the IMDB dataset
# from the `datasets` library. One should customize this function to train the model on
# its own dataset.
def build_dataset(
tokenizer,
dataset_name="lvwerra/stack-exchange-paired",
):
"""
Build dataset for training. This builds the dataset from `load_dataset`, one should
customize this function to train the model on its own dataset.
Args:
dataset_name (`str`):
The name of the dataset to be loaded.
Returns:
dataloader (`torch.utils.data.DataLoader`):
The dataloader for the dataset.
"""
num_proc = 24
def preprocess_function(examples):
new_examples = {
"query": [],
"input_ids": [],
}
for question in examples["question"]:
query = "Question: " + question + "\n\nAnswer: "
tokenized_question = tokenizer(query, truncation=True)
new_examples["query"].append(query)
new_examples["input_ids"].append(tokenized_question["input_ids"])
return new_examples
ds = train_dataset.map(
preprocess_function,
batched=True,
num_proc=num_proc,
remove_columns=original_columns,
)
ds = ds.filter(lambda x: len(x["input_ids"]) < 512, batched=False)
ds.set_format(type="torch")
return ds
# We retrieve the dataloader by calling the `build_dataset` function.
dataset = build_dataset(tokenizer)
def collator(data):
return dict((key, [d[key] for d in data]) for key in data[0])
# set seed before initializing value head for deterministic eval
set_seed(config.seed)
# Now let's build the model, the reference model, and the tokenizer.
current_device = Accelerator().local_process_index
lora_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = AutoModelForCausalLMWithValueHead.from_pretrained(
config.model_name,
load_in_8bit=True,
device_map={"": current_device},
peft_config=lora_config,
)
optimizer = None
if script_args.adafactor:
optimizer = Adafactor(
filter(lambda p: p.requires_grad, model.parameters()),
scale_parameter=False,
relative_step=False,
warmup_init=False,
lr=config.learning_rate,
)
# We then build the PPOTrainer, passing the model, the reference model, the tokenizer
ppo_trainer = PPOTrainer(
config,
model,
ref_model=None,
tokenizer=tokenizer,
dataset=dataset,
data_collator=collator,
optimizer=optimizer,
)
# We then build the sentiment analysis pipeline using our reward model, passing the
# model name and the sentiment analysis pipeline arguments. Let's also make sure to
# set the device to the same device as the PPOTrainer.
device = ppo_trainer.accelerator.device
if ppo_trainer.accelerator.num_processes == 1:
device = 0 if torch.cuda.is_available() else "cpu" # to avoid a ` pipeline` bug
sentiment_pipe = pipeline(
"sentiment-analysis",
model=reward_model_name,
device_map={"": current_device},
model_kwargs={"load_in_8bit": True},
tokenizer=tokenizer,
return_token_type_ids=False,
)
# We then define the arguments to pass to the `generate` function. These arguments
# are passed to the `generate` function of the PPOTrainer, which is a wrapper around
# the `generate` function of the trained model.
generation_kwargs = {
# "min_length": -1,
"top_k": 0.0,
"top_p": 1.0,
"do_sample": True,
"pad_token_id": tokenizer.pad_token_id,
"eos_token_id": 100_000,
}
output_min_length = 32
output_max_length = script_args.output_max_length
output_length_sampler = LengthSampler(output_min_length, output_max_length)
for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
if epoch >= config.total_ppo_epochs:
break
question_tensors = batch["input_ids"]
response_tensors = ppo_trainer.generate(
question_tensors,
return_prompt=False,
length_sampler=output_length_sampler,
**generation_kwargs,
)
batch["response"] = tokenizer.batch_decode(response_tensors, skip_special_tokens=True)
# Compute reward score (using the sentiment analysis pipeline)
texts = [q + r for q, r in zip(batch["query"], batch["response"])]
pipe_outputs = sentiment_pipe(texts, **sent_kwargs)
rewards = [torch.tensor(output[0]["score"] - script_args.reward_baseline) for output in pipe_outputs]
# Run PPO step
stats = ppo_trainer.step(question_tensors, response_tensors, rewards)
ppo_trainer.log_stats(stats, batch, rewards)
if script_args.save_freq and epoch and epoch % script_args.save_freq == 0:
ppo_trainer.save_pretrained(script_args.output_dir + f"step_{epoch}")

View File

@ -0,0 +1,208 @@
import argparse
import os
from accelerate import Accelerator
from datasets import load_dataset
from peft import LoraConfig
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, logging, set_seed
from trl import SFTTrainer
from trl.trainer import ConstantLengthDataset
"""
Fine-Tune Llama-7b on SE paired dataset
"""
def get_args():
parser = argparse.ArgumentParser()
parser.add_argument("--model_path", type=str, default="")
parser.add_argument("--dataset_name", type=str, default="lvwerra/stack-exchange-paired")
parser.add_argument("--subset", type=str, default="data/finetune")
parser.add_argument("--split", type=str, default="train")
parser.add_argument("--size_valid_set", type=int, default=4000)
parser.add_argument("--streaming", action="store_true")
parser.add_argument("--shuffle_buffer", type=int, default=5000)
parser.add_argument("--seq_length", type=int, default=1024)
parser.add_argument("--max_steps", type=int, default=10000)
parser.add_argument("--batch_size", type=int, default=4)
parser.add_argument("--gradient_accumulation_steps", type=int, default=1)
parser.add_argument("--eos_token_id", type=int, default=49152)
parser.add_argument("--learning_rate", type=float, default=1e-4)
parser.add_argument("--lr_scheduler_type", type=str, default="cosine")
parser.add_argument("--num_warmup_steps", type=int, default=100)
parser.add_argument("--weight_decay", type=float, default=0.05)
parser.add_argument("--local_rank", type=int, default=0)
parser.add_argument("--fp16", action="store_true", default=False)
parser.add_argument("--bf16", action="store_true", default=False)
parser.add_argument("--gradient_checkpointing", action="store_true", default=False)
parser.add_argument("--seed", type=int, default=0)
parser.add_argument("--num_workers", type=int, default=None)
parser.add_argument("--output_dir", type=str, default="./checkpoints")
parser.add_argument("--log_freq", default=1, type=int)
parser.add_argument("--eval_freq", default=1000, type=int)
parser.add_argument("--save_freq", default=1000, type=int)
return parser.parse_args()
def chars_token_ratio(dataset, tokenizer, nb_examples=400):
"""
Estimate the average number of characters per token in the dataset.
"""
total_characters, total_tokens = 0, 0
for _, example in tqdm(zip(range(nb_examples), iter(dataset)), total=nb_examples):
text = prepare_sample_text(example)
total_characters += len(text)
if tokenizer.is_fast:
total_tokens += len(tokenizer(text).tokens())
else:
total_tokens += len(tokenizer.tokenize(text))
return total_characters / total_tokens
def print_trainable_parameters(model):
"""
Prints the number of trainable parameters in the model.
"""
trainable_params = 0
all_param = 0
for _, param in model.named_parameters():
all_param += param.numel()
if param.requires_grad:
trainable_params += param.numel()
print(
f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
)
def prepare_sample_text(example):
"""Prepare the text from a sample of the dataset."""
text = f"Question: {example['question']}\n\nAnswer: {example['response_j']}"
return text
def create_datasets(tokenizer, args):
dataset = load_dataset(
args.dataset_name,
data_dir=args.subset,
split=args.split,
use_auth_token=True,
num_proc=args.num_workers if not args.streaming else None,
streaming=args.streaming,
)
if args.streaming:
print("Loading the dataset in streaming mode")
valid_data = dataset.take(args.size_valid_set)
train_data = dataset.skip(args.size_valid_set)
train_data = train_data.shuffle(buffer_size=args.shuffle_buffer, seed=args.seed)
else:
dataset = dataset.train_test_split(test_size=0.005, seed=args.seed)
train_data = dataset["train"]
valid_data = dataset["test"]
print(f"Size of the train set: {len(train_data)}. Size of the validation set: {len(valid_data)}")
chars_per_token = chars_token_ratio(train_data, tokenizer)
print(f"The character to token ratio of the dataset is: {chars_per_token:.2f}")
train_dataset = ConstantLengthDataset(
tokenizer,
train_data,
formatting_func=prepare_sample_text,
infinite=True,
seq_length=args.seq_length,
chars_per_token=chars_per_token,
)
valid_dataset = ConstantLengthDataset(
tokenizer,
valid_data,
formatting_func=prepare_sample_text,
infinite=False,
seq_length=args.seq_length,
chars_per_token=chars_per_token,
)
return train_dataset, valid_dataset
def run_training(args, train_data, val_data):
print("Loading the model")
lora_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
train_data.start_iteration = 0
print("Starting main loop")
training_args = TrainingArguments(
output_dir=args.output_dir,
dataloader_drop_last=True,
evaluation_strategy="steps",
max_steps=args.max_steps,
eval_steps=args.eval_freq,
save_steps=args.save_freq,
logging_steps=args.log_freq,
per_device_train_batch_size=args.batch_size,
per_device_eval_batch_size=args.batch_size,
learning_rate=args.learning_rate,
lr_scheduler_type=args.lr_scheduler_type,
warmup_steps=args.num_warmup_steps,
gradient_accumulation_steps=args.gradient_accumulation_steps,
gradient_checkpointing=args.gradient_checkpointing,
fp16=args.fp16,
bf16=args.bf16,
weight_decay=args.weight_decay,
run_name="llama-7b-finetuned",
report_to="wandb",
ddp_find_unused_parameters=False,
)
model = AutoModelForCausalLM.from_pretrained(
args.model_path, load_in_8bit=True, device_map={"": Accelerator().process_index}
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=train_data,
eval_dataset=val_data,
peft_config=lora_config,
packing=True,
)
print_trainable_parameters(trainer.model)
print("Training...")
trainer.train()
print("Saving last checkpoint of the model")
trainer.model.save_pretrained(os.path.join(args.output_dir, "final_checkpoint/"))
def main(args):
tokenizer = AutoTokenizer.from_pretrained(args.model_path)
train_dataset, eval_dataset = create_datasets(tokenizer, args)
run_training(args, train_dataset, eval_dataset)
if __name__ == "__main__":
args = get_args()
assert args.model_path != "", "Please provide the llama model path"
set_seed(args.seed)
os.makedirs(args.output_dir, exist_ok=True)
logging.set_verbosity_error()
main(args)

View File

@ -0,0 +1,76 @@
# DPO pipeline for the creation of StackLlaMa 2: a Stack exchange llama-v2-7b model
## Prerequisites
Install all the dependencies in the `requirements.txt`:
```
$ pip install -U -r requirements.txt
```
Since we will use `accelerate` for training, make sure to run:
```
$ accelerate config
```
## Training
There were two main steps to the DPO training process:
1. Supervised fine-tuning of the base llama-v2-7b model to create llama-v2-7b-se:
```
accelerate launch examples/research_projects/stack_llama_2/scripts/sft_llama2.py \
--output_dir="./sft" \
--max_steps=500 \
--logging_steps=10 \
--save_steps=10 \
--per_device_train_batch_size=4 \
--per_device_eval_batch_size=1 \
--gradient_accumulation_steps=2 \
--gradient_checkpointing=False \
--group_by_length=False \
--learning_rate=1e-4 \
--lr_scheduler_type="cosine" \
--warmup_steps=100 \
--weight_decay=0.05 \
--optim="paged_adamw_32bit" \
--bf16=True \
--remove_unused_columns=False \
--run_name="sft_llama2" \
--report_to="wandb"
```
1. Run the DPO trainer using the model saved by the previous step:
```
accelerate launch examples/research_projects/stack_llama_2/scripts/dpo_llama2.py \
--model_name_or_path="sft/final_checkpoint" \
--output_dir="dpo"
```
## Merging the adaptors
To merge the adaptors into the base model we can use the `merge_peft_adapter.py` helper script that comes with TRL:
```
python examples/research_projects/stack_llama/scripts/merge_peft_adapter.py --base_model_name="meta-llama/Llama-2-7b-hf" --adapter_model_name="dpo/final_checkpoint/" --output_name="stack-llama-2"
```
which will also push the model to your HuggingFace hub account.
## Running the model
We can load the DPO-trained LoRA adaptors which were saved by the DPO training step and load them via:
```py
from peft import AutoPeftModelForCausalLM
model = AutoPeftModelForCausalLM.from_pretrained(
"dpo/final_checkpoint",
low_cpu_mem_usage=True,
torch_dtype=torch.float16,
load_in_4bit=True,
)
model.generate(...)
```

View File

@ -0,0 +1,223 @@
# 0. imports
import os
from dataclasses import dataclass, field
from typing import Dict, Optional
import torch
from datasets import Dataset, load_dataset
from peft import LoraConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, HfArgumentParser, TrainingArguments
from trl import DPOTrainer
# Define and parse arguments.
@dataclass
class ScriptArguments:
"""
The arguments for the DPO training script.
"""
# data parameters
beta: Optional[float] = field(default=0.1, metadata={"help": "the beta parameter for DPO loss"})
# training parameters
model_name_or_path: Optional[str] = field(
default="../sft/results/final_checkpoint",
metadata={"help": "the location of the SFT model name or path"},
)
learning_rate: Optional[float] = field(default=5e-4, metadata={"help": "optimizer learning rate"})
lr_scheduler_type: Optional[str] = field(default="cosine", metadata={"help": "the lr scheduler type"})
warmup_steps: Optional[int] = field(default=100, metadata={"help": "the number of warmup steps"})
weight_decay: Optional[float] = field(default=0.05, metadata={"help": "the weight decay"})
optimizer_type: Optional[str] = field(default="paged_adamw_32bit", metadata={"help": "the optimizer type"})
per_device_train_batch_size: Optional[int] = field(default=4, metadata={"help": "train batch size per device"})
per_device_eval_batch_size: Optional[int] = field(default=1, metadata={"help": "eval batch size per device"})
gradient_accumulation_steps: Optional[int] = field(
default=4, metadata={"help": "the number of gradient accumulation steps"}
)
gradient_checkpointing: Optional[bool] = field(
default=True, metadata={"help": "whether to use gradient checkpointing"}
)
lora_alpha: Optional[float] = field(default=16, metadata={"help": "the lora alpha parameter"})
lora_dropout: Optional[float] = field(default=0.05, metadata={"help": "the lora dropout parameter"})
lora_r: Optional[int] = field(default=8, metadata={"help": "the lora r parameter"})
max_prompt_length: Optional[int] = field(default=512, metadata={"help": "the maximum prompt length"})
max_length: Optional[int] = field(default=1024, metadata={"help": "the maximum sequence length"})
max_steps: Optional[int] = field(default=1000, metadata={"help": "max number of training steps"})
logging_steps: Optional[int] = field(default=10, metadata={"help": "the logging frequency"})
save_steps: Optional[int] = field(default=100, metadata={"help": "the saving frequency"})
eval_steps: Optional[int] = field(default=100, metadata={"help": "the evaluation frequency"})
output_dir: Optional[str] = field(default="./results", metadata={"help": "the output directory"})
log_freq: Optional[int] = field(default=1, metadata={"help": "the logging frequency"})
# instrumentation
sanity_check: Optional[bool] = field(default=False, metadata={"help": "only train on 1000 samples"})
report_to: Optional[str] = field(
default="wandb",
metadata={
"help": 'The list of integrations to report the results and logs to. Supported platforms are `"azure_ml"`,'
'`"comet_ml"`, `"mlflow"`, `"neptune"`, `"tensorboard"`,`"clearml"` and `"wandb"`. '
'Use `"all"` to report to all integrations installed, `"none"` for no integrations.'
},
)
# debug argument for distributed training
ignore_bias_buffers: Optional[bool] = field(
default=False,
metadata={
"help": "fix for DDP issues with LM bias/mask buffers - invalid scalar type,`inplace operation. See"
"https://github.com/huggingface/transformers/issues/22482#issuecomment-1595790992"
},
)
def get_stack_exchange_paired(
data_dir: str = "data/rl",
sanity_check: bool = False,
cache_dir: str = None,
num_proc=24,
) -> Dataset:
"""Load the stack-exchange-paired dataset from Hugging Face and convert it to the necessary format.
The dataset is converted to a dictionary with the following structure:
{
'prompt': List[str],
'chosen': List[str],
'rejected': List[str],
}
Prompts are structured as follows:
"Question: " + <prompt> + "\n\nAnswer: "
"""
dataset = load_dataset(
"lvwerra/stack-exchange-paired",
split="train",
cache_dir=cache_dir,
data_dir=data_dir,
)
original_columns = dataset.column_names
if sanity_check:
dataset = dataset.select(range(min(len(dataset), 1000)))
def return_prompt_and_responses(samples) -> Dict[str, str]:
return {
"prompt": ["Question: " + question + "\n\nAnswer: " for question in samples["question"]],
"chosen": samples["response_j"],
"rejected": samples["response_k"],
}
return dataset.map(
return_prompt_and_responses,
batched=True,
num_proc=num_proc,
remove_columns=original_columns,
)
if __name__ == "__main__":
parser = HfArgumentParser(ScriptArguments)
script_args = parser.parse_args_into_dataclasses()[0]
# 1. load a pretrained model
model = AutoModelForCausalLM.from_pretrained(
script_args.model_name_or_path,
low_cpu_mem_usage=True,
torch_dtype=torch.float16,
load_in_4bit=True,
)
model.config.use_cache = False
if script_args.ignore_bias_buffers:
# torch distributed hack
model._ddp_params_and_buffers_to_ignore = [
name for name, buffer in model.named_buffers() if buffer.dtype == torch.bool
]
model_ref = AutoModelForCausalLM.from_pretrained(
script_args.model_name_or_path,
low_cpu_mem_usage=True,
torch_dtype=torch.float16,
load_in_4bit=True,
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer.pad_token = tokenizer.eos_token
# 2. Load the Stack-exchange paired dataset
train_dataset = get_stack_exchange_paired(data_dir="data/rl", sanity_check=script_args.sanity_check)
train_dataset = train_dataset.filter(
lambda x: len(x["prompt"]) + len(x["chosen"]) <= script_args.max_length
and len(x["prompt"]) + len(x["rejected"]) <= script_args.max_length
)
# 3. Load evaluation dataset
eval_dataset = get_stack_exchange_paired(data_dir="data/evaluation", sanity_check=True)
eval_dataset = eval_dataset.filter(
lambda x: len(x["prompt"]) + len(x["chosen"]) <= script_args.max_length
and len(x["prompt"]) + len(x["rejected"]) <= script_args.max_length
)
# 4. initialize training arguments:
training_args = TrainingArguments(
per_device_train_batch_size=script_args.per_device_train_batch_size,
per_device_eval_batch_size=script_args.per_device_eval_batch_size,
max_steps=script_args.max_steps,
logging_steps=script_args.logging_steps,
save_steps=script_args.save_steps,
gradient_accumulation_steps=script_args.gradient_accumulation_steps,
gradient_checkpointing=script_args.gradient_checkpointing,
learning_rate=script_args.learning_rate,
evaluation_strategy="steps",
eval_steps=script_args.eval_steps,
output_dir=script_args.output_dir,
report_to=script_args.report_to,
lr_scheduler_type=script_args.lr_scheduler_type,
warmup_steps=script_args.warmup_steps,
optim=script_args.optimizer_type,
bf16=True,
remove_unused_columns=False,
run_name="dpo_llama2",
)
peft_config = LoraConfig(
r=script_args.lora_r,
lora_alpha=script_args.lora_alpha,
lora_dropout=script_args.lora_dropout,
target_modules=[
"q_proj",
"v_proj",
"k_proj",
"out_proj",
"fc_in",
"fc_out",
"wte",
],
bias="none",
task_type="CAUSAL_LM",
)
# 5. initialize the DPO trainer
dpo_trainer = DPOTrainer(
model,
model_ref,
args=training_args,
beta=script_args.beta,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,
peft_config=peft_config,
max_prompt_length=script_args.max_prompt_length,
max_length=script_args.max_length,
)
# 6. train
dpo_trainer.train()
dpo_trainer.save_model(script_args.output_dir)
# 7. save
output_dir = os.path.join(script_args.output_dir, "final_checkpoint")
dpo_trainer.model.save_pretrained(output_dir)

View File

@ -0,0 +1,7 @@
transformers
trl
peft
accelerate
datasets
bitsandbytes
wandb

View File

@ -0,0 +1,187 @@
# Fine-Tune Llama2-7b on SE paired dataset
import os
from dataclasses import dataclass, field
from typing import Optional
import torch
from accelerate import Accelerator
from datasets import load_dataset
from peft import AutoPeftModelForCausalLM, LoraConfig
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, HfArgumentParser, TrainingArguments
from trl import SFTTrainer
from trl.import_utils import is_npu_available, is_xpu_available
from trl.trainer import ConstantLengthDataset
@dataclass
class ScriptArguments:
model_name: Optional[str] = field(default="meta-llama/Llama-2-7b-hf", metadata={"help": "the model name"})
dataset_name: Optional[str] = field(default="lvwerra/stack-exchange-paired", metadata={"help": "the dataset name"})
subset: Optional[str] = field(default="data/finetune", metadata={"help": "the subset to use"})
split: Optional[str] = field(default="train", metadata={"help": "the split to use"})
size_valid_set: Optional[int] = field(default=4000, metadata={"help": "the size of the validation set"})
streaming: Optional[bool] = field(default=True, metadata={"help": "whether to stream the dataset"})
shuffle_buffer: Optional[int] = field(default=5000, metadata={"help": "the shuffle buffer size"})
seq_length: Optional[int] = field(default=1024, metadata={"help": "the sequence length"})
num_workers: Optional[int] = field(default=4, metadata={"help": "the number of workers"})
packing: Optional[bool] = field(default=True, metadata={"help": "whether to use packing for SFTTrainer"})
# LoraConfig
lora_alpha: Optional[float] = field(default=16, metadata={"help": "the lora alpha parameter"})
lora_dropout: Optional[float] = field(default=0.05, metadata={"help": "the lora dropout parameter"})
lora_r: Optional[int] = field(default=8, metadata={"help": "the lora r parameter"})
parser = HfArgumentParser((ScriptArguments, TrainingArguments))
script_args, training_args = parser.parse_args_into_dataclasses()
peft_config = LoraConfig(
r=script_args.lora_r,
lora_alpha=script_args.lora_alpha,
lora_dropout=script_args.lora_dropout,
target_modules=["q_proj", "v_proj"],
bias="none",
task_type="CAUSAL_LM",
)
if training_args.group_by_length and script_args.packing:
raise ValueError("Cannot use both packing and group by length")
# `gradient_checkpointing` was True by default until `1f3314`, but it's actually not used.
# `gradient_checkpointing=True` will cause `Variable._execution_engine.run_backward`.
if training_args.gradient_checkpointing:
raise ValueError("gradient_checkpointing not supported")
def chars_token_ratio(dataset, tokenizer, nb_examples=400):
"""
Estimate the average number of characters per token in the dataset.
"""
total_characters, total_tokens = 0, 0
for _, example in tqdm(zip(range(nb_examples), iter(dataset)), total=nb_examples):
text = prepare_sample_text(example)
total_characters += len(text)
if tokenizer.is_fast:
total_tokens += len(tokenizer(text).tokens())
else:
total_tokens += len(tokenizer.tokenize(text))
return total_characters / total_tokens
def print_trainable_parameters(model):
"""
Prints the number of trainable parameters in the model.
"""
trainable_params = 0
all_param = 0
for _, param in model.named_parameters():
all_param += param.numel()
if param.requires_grad:
trainable_params += param.numel()
print(
f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
)
def prepare_sample_text(example):
"""Prepare the text from a sample of the dataset."""
text = f"Question: {example['question']}\n\nAnswer: {example['response_j']}"
return text
def create_datasets(tokenizer, args):
dataset = load_dataset(
args.dataset_name,
data_dir=args.subset,
split=args.split,
use_auth_token=True,
num_proc=args.num_workers if not args.streaming else None,
streaming=args.streaming,
)
if args.streaming:
print("Loading the dataset in streaming mode")
valid_data = dataset.take(args.size_valid_set)
train_data = dataset.skip(args.size_valid_set)
train_data = train_data.shuffle(buffer_size=args.shuffle_buffer, seed=None)
else:
dataset = dataset.train_test_split(test_size=0.005, seed=None)
train_data = dataset["train"]
valid_data = dataset["test"]
print(f"Size of the train set: {len(train_data)}. Size of the validation set: {len(valid_data)}")
chars_per_token = chars_token_ratio(train_data, tokenizer)
print(f"The character to token ratio of the dataset is: {chars_per_token:.2f}")
train_dataset = ConstantLengthDataset(
tokenizer,
train_data,
formatting_func=prepare_sample_text,
infinite=True,
seq_length=args.seq_length,
chars_per_token=chars_per_token,
)
valid_dataset = ConstantLengthDataset(
tokenizer,
valid_data,
formatting_func=prepare_sample_text,
infinite=False,
seq_length=args.seq_length,
chars_per_token=chars_per_token,
)
return train_dataset, valid_dataset
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
base_model = AutoModelForCausalLM.from_pretrained(
script_args.model_name,
quantization_config=bnb_config,
device_map={"": Accelerator().local_process_index},
trust_remote_code=True,
use_auth_token=True,
)
base_model.config.use_cache = False
tokenizer = AutoTokenizer.from_pretrained(script_args.model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training
train_dataset, eval_dataset = create_datasets(tokenizer, script_args)
trainer = SFTTrainer(
model=base_model,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
peft_config=peft_config,
packing=script_args.packing,
max_seq_length=None,
tokenizer=tokenizer,
args=training_args,
)
trainer.train()
trainer.save_model(training_args.output_dir)
output_dir = os.path.join(training_args.output_dir, "final_checkpoint")
trainer.model.save_pretrained(output_dir)
# Free memory for merging weights
del base_model
if is_xpu_available():
torch.xpu.empty_cache()
elif is_npu_available():
torch.npu.empty_cache()
else:
torch.cuda.empty_cache()
model = AutoPeftModelForCausalLM.from_pretrained(output_dir, device_map="auto", torch_dtype=torch.bfloat16)
model = model.merge_and_unload()
output_merged_dir = os.path.join(training_args.output_dir, "final_merged_checkpoint")
model.save_pretrained(output_merged_dir, safe_serialization=True)

View File

@ -0,0 +1,119 @@
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import re
import numpy as np
import torch
from transformers import AutoTokenizer, load_tool
from trl import AutoModelForCausalLMWithValueHead, PPOConfig, PPOTrainer, TextEnvironment
def generate_data(n):
"""Generate random arithmetic tasks and answers."""
tasks, answers = [], []
for _ in range(n):
a = np.random.randint(0, 50)
b = np.random.randint(0, 50)
op = np.random.choice(["-", "+", "*"])
tasks.append(f"\n\nWhat is {a} {op} {b}?")
if op == "-":
answers.append(a - b)
elif op == "+":
answers.append(a + b)
else:
answers.append(a * b)
return tasks, answers
def exact_match_reward(responses, answers=None):
"""Reward if generated response contains correct answer."""
rewards = []
pattern = r"Result\s*=\s*(-?\d+(?:\.\d+)?)\s*<submit>" # generated by chatGPT
for response, answer in zip(responses, answers):
reward = 0.0
predicted_number = None
match_pattern = re.findall(pattern, response)
if match_pattern:
predicted_number = float(match_pattern[0])
if predicted_number is not None:
if np.abs(predicted_number - answer) < 0.01:
reward += 1.0
rewards.append(torch.tensor(reward))
return rewards
# set up models
model_id = "gpt2"
model = AutoModelForCausalLMWithValueHead.from_pretrained(model_id)
model_ref = AutoModelForCausalLMWithValueHead.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
# system prompt
prompt = """\
What is 13-3?
<request><SimpleCalculatorTool>13-3<call>10.0<response>
Result=10<submit>
What is 4*3?
<request><SimpleCalculatorTool>4*3<call>12.0<response>
Result=12<submit>"""
generation_kwargs = {
"min_length": -1,
"top_k": 0.0,
"top_p": 1.0,
"do_sample": True,
"pad_token_id": tokenizer.eos_token_id,
"eos_token_id": -1,
"max_new_tokens": 32,
}
# trainer
ppo_config = PPOConfig(
batch_size=256,
learning_rate=1.41e-5,
mini_batch_size=64,
log_with="wandb",
)
ppo_trainer = PPOTrainer(ppo_config, model, model_ref, tokenizer)
# text env
text_env = TextEnvironment(
model,
tokenizer,
{"SimpleCalculatorTool": load_tool("ybelkada/simple-calculator")},
exact_match_reward,
prompt,
generation_kwargs=generation_kwargs,
)
# main training loop
for step in range(100):
tasks, answers = generate_data(ppo_config.batch_size)
queries, responses, masks, rewards, histories = text_env.run(tasks, answers=answers)
train_stats = ppo_trainer.step(queries, responses, rewards, masks)
response_texts = [tokenizer.decode(response) for response in responses]
query_texts = [tokenizer.decode(query) for query in queries]
texts = {"query": [qt.split("<submit>")[-1].strip() for qt in query_texts], "response": response_texts}
ppo_trainer.log_stats(train_stats, texts, rewards, columns_to_log=["query", "response", "answer"])
ppo_trainer.save_pretrained(model_id + "-calculator")

View File

@ -0,0 +1,194 @@
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import re
from dataclasses import dataclass, field
from typing import Optional
import numpy as np
import torch
from datasets import load_dataset
from peft import LoraConfig
from transformers import AutoTokenizer, HfArgumentParser, load_tool
from trl import AutoModelForCausalLMWithValueHead, PPOConfig, PPOTrainer, TextEnvironment
os.environ["HF_ALLOW_CODE_EVAL"] = "1"
os.environ["TOKENIZERS_PARALLELISM"] = "false"
@dataclass
class ScriptArguments:
model_name: Optional[str] = field(default="bigcode/starcoderbase", metadata={"help": "the model name"})
learning_rate: Optional[float] = field(default=1e-5, metadata={"help": "the learning rate"})
mini_batch_size: Optional[int] = field(default=1, metadata={"help": "the PPO minibatch size"})
batch_size: Optional[int] = field(default=32, metadata={"help": "the batch size"})
gradient_accumulation_steps: Optional[int] = field(
default=16, metadata={"help": "the number of gradient accumulation steps"}
)
max_new_tokens: Optional[int] = field(default=256, metadata={"help": "max number of generated tokens per turn"})
ppo_epochs: Optional[int] = field(default=1, metadata={"help": "max number of ppo epochs"})
n_epochs: Optional[int] = field(default=32, metadata={"help": "max number of ppo epochs"})
parser = HfArgumentParser(ScriptArguments)
args = parser.parse_args_into_dataclasses()[0]
def exact_match_reward(responses, answers=None):
"""Reward if generated response contains correct answer."""
rewards = []
pattern = r"Result\s*=\s*(-?\d+(?:\.\d+)?)\s*<submit>" # generated by chatGPT
for response, answer in zip(responses, answers):
reward = 0.0
try:
predicted_number = None
match_pattern = re.findall(pattern, response)
if match_pattern:
predicted_number = float(match_pattern[0])
if predicted_number is not None:
if np.abs((predicted_number - float(answer))) < 0.1:
reward += 1.0
except: # noqa
pass
rewards.append(torch.tensor(reward))
return rewards
def evaluate(test_dataloader, text_env, ppo_trainer):
test_rewards = []
for test_batch in test_dataloader:
_, _, _, rewards, _ = text_env.run(test_batch["query"], answers=test_batch["answer"])
test_rewards.extend(rewards)
test_rewards = ppo_trainer.accelerator.gather_for_metrics(
torch.stack(test_rewards).to(ppo_trainer.accelerator.device)
)
return test_rewards.mean()
lora_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
target_modules=["c_proj", "c_attn", "q_attn"],
)
# set up models
model = AutoModelForCausalLMWithValueHead.from_pretrained(
args.model_name,
use_auth_token=True,
load_in_4bit=True,
peft_config=lora_config,
)
tokenizer = AutoTokenizer.from_pretrained(args.model_name, use_auth_token=True)
tokenizer.pad_token = tokenizer.eos_token
ds = load_dataset("gsm8k", "main", split="train")
ds = ds.rename_columns({"question": "query"})
ds = ds.map(lambda x: {"answer": x["answer"].split("#### ")[1]})
ds = ds.select(range(1, len(ds))) # skip the first sample which is used in prompt
ds_test = load_dataset("gsm8k", "main", split="test")
ds_test = ds_test.rename_columns({"question": "query"})
ds_test = ds_test.map(lambda x: {"answer": x["answer"].split("#### ")[1]})
test_dataloader = torch.utils.data.DataLoader(ds_test, batch_size=args.batch_size)
# prompt
prompt = """\
Example of using a Python API to solve math questions.
Q: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?
<request><PythonInterpreter>
def solution():
money_initial = 23
bagels = 5
bagel_cost = 3
money_spent = bagels * bagel_cost
money_left = money_initial - money_spent
result = money_left
return result
print(solution())
<call>72<response>
Result = 72 <submit>
Q: """
generation_kwargs = {
"min_length": -1,
"top_k": 0.0,
"top_p": 1.0,
"do_sample": True,
"pad_token_id": tokenizer.eos_token_id,
"eos_token_id": -1,
"max_new_tokens": args.max_new_tokens,
}
# trainer
ppo_config = PPOConfig(
batch_size=args.batch_size,
learning_rate=args.learning_rate,
mini_batch_size=args.mini_batch_size,
ppo_epochs=args.ppo_epochs,
gradient_accumulation_steps=args.gradient_accumulation_steps,
log_with="wandb",
tracker_project_name="trl-gsm8k",
remove_unused_columns=False,
optimize_cuda_cache=True,
)
ppo_trainer = PPOTrainer(config=ppo_config, model=model, tokenizer=tokenizer, dataset=ds)
test_dataloader = ppo_trainer.accelerator.prepare(test_dataloader)
# text env
text_env = TextEnvironment(
model,
tokenizer,
[load_tool("lvwerra/python-interpreter")],
exact_match_reward,
prompt,
max_turns=2,
generation_kwargs=generation_kwargs,
)
# main training loop
for epoch in range(args.n_epochs):
for step, batch in enumerate(ppo_trainer.dataloader):
if (step == 0) and (epoch % 4 == 0): # evaluate every 4 epochs
reward_mean_test = evaluate(test_dataloader, text_env, ppo_trainer)
else:
reward_mean_test = None
queries, responses, masks, rewards, histories = text_env.run(batch["query"], answers=batch["answer"])
train_stats = ppo_trainer.step(queries, responses, rewards, masks)
# logging
if reward_mean_test is not None:
train_stats["env/reward_mean_test"] = reward_mean_test
texts = {
"query": batch["query"],
"response": [tokenizer.decode(response) for response in responses],
"answer": batch["answer"],
}
ppo_trainer.log_stats(train_stats, texts, rewards, columns_to_log=["query", "response", "answer"])
reward_mean_test = evaluate(test_dataloader, text_env, ppo_trainer)
ppo_trainer.save_pretrained(f"model/{args.model_name}-gsm8k")

View File

@ -0,0 +1,191 @@
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
from dataclasses import dataclass, field
from typing import Optional
import torch
from datasets import load_dataset
from peft import LoraConfig
from transformers import AutoTokenizer, HfArgumentParser, load_tool
from trl import AutoModelForCausalLMWithValueHead, PPOConfig, PPOTrainer, TextEnvironment
os.environ["HF_ALLOW_CODE_EVAL"] = "1"
os.environ["TOKENIZERS_PARALLELISM"] = "false"
@dataclass
class ScriptArguments:
model_name: Optional[str] = field(default="bigcode/starcoderbase", metadata={"help": "the model name"})
log_with: Optional[str] = field(default=None, metadata={"help": "use 'wandb' to log with wandb"})
learning_rate: Optional[float] = field(default=1e-5, metadata={"help": "the learning rate"})
mini_batch_size: Optional[int] = field(default=1, metadata={"help": "the PPO minibatch size"})
batch_size: Optional[int] = field(default=32, metadata={"help": "the batch size"})
gradient_accumulation_steps: Optional[int] = field(
default=16, metadata={"help": "the number of gradient accumulation steps"}
)
max_new_tokens: Optional[int] = field(default=256, metadata={"help": "max number of generated tokens per turn"})
ppo_epochs: Optional[int] = field(default=1, metadata={"help": "max number of ppo epochs"})
iterations: Optional[int] = field(default=1000, metadata={"help": "the number of iterations"})
seed: Optional[int] = field(default=0, metadata={"help": "the random seed"})
parser = HfArgumentParser(ScriptArguments)
args = parser.parse_args_into_dataclasses()[0]
lora_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
target_modules=["c_proj", "c_attn", "q_attn"],
)
# set up models
model = AutoModelForCausalLMWithValueHead.from_pretrained(
args.model_name,
use_auth_token=True,
trust_remote_code=True,
load_in_4bit=True,
peft_config=lora_config,
)
tokenizer = AutoTokenizer.from_pretrained(args.model_name, use_auth_token=True)
tokenizer.pad_token = tokenizer.eos_token
# system prompt
prompt = """\
Answer the following question:
Q: In which branch of the arts is Patricia Neary famous?
A: Ballets
A2: <request><Wiki>Patricia Neary<call>Patricia Neary (born October 27, 1942) is an American ballerina, choreographer and ballet director, who has been particularly active in Switzerland. She has also been a highly successful ambassador for the Balanchine Trust, bringing George Balanchine's ballets to 60 cities around the globe.<response>
Result=Ballets<submit>
Q: Who won Super Bowl XX?
A: Chicago Bears
A2: <request><Wiki>Super Bowl XX<call>Super Bowl XX was an American football game between the National Football Conference (NFC) champion Chicago Bears and the American Football Conference (AFC) champion New England Patriots to decide the National Football League (NFL) champion for the 1985 season. The Bears defeated the Patriots by the score of 4610, capturing their first NFL championship (and Chicago's first overall sports victory) since 1963, three years prior to the birth of the Super Bowl. Super Bowl XX was played on January 26, 1986 at the Louisiana Superdome in New Orleans.<response>
Result=Chicago Bears<submit>
Q: """
generation_kwargs = {
"min_length": -1,
"top_k": 0.0,
"top_p": 1.0,
"do_sample": True,
"pad_token_id": tokenizer.eos_token_id,
"eos_token_id": -1,
"max_new_tokens": args.max_new_tokens,
}
# trainer
config = PPOConfig(
batch_size=args.batch_size,
model_name=args.model_name,
learning_rate=args.learning_rate,
log_with=args.log_with,
mini_batch_size=args.mini_batch_size,
ppo_epochs=args.ppo_epochs,
gradient_accumulation_steps=args.gradient_accumulation_steps,
seed=args.seed,
optimize_cuda_cache=True,
)
ppo_trainer = PPOTrainer(config=config, model=model, tokenizer=tokenizer)
dataset = load_dataset("trivia_qa", "rc", split="train")
local_seed = args.seed + ppo_trainer.accelerator.process_index * 100003 # Prime
dataset = dataset.shuffle(local_seed)
def data_generator():
for i in range(len(dataset)):
yield dataset[i]["question"], [item for item in dataset[i]["answer"]["normalized_aliases"]]
gen = data_generator()
gen = iter(gen)
def generate_data(n):
tasks, answers = [], []
for i in range(n):
q, a = next(gen)
tasks.append(q)
answers.append(a)
return tasks, answers
def exact_match_reward(responses, answers=None):
"""Reward if generated response contains correct answer."""
rewards = []
for response, answer in zip(responses, answers):
reward = 0.0
for a in answer:
if a.lower() in response.lower():
reward += 1.0
break
rewards.append(torch.tensor(reward))
return rewards
# text env
tool = load_tool("vwxyzjn/pyserini-wikipedia-kilt-doc")
# limit the amount if tokens
tool_fn = lambda x: tool(x).split("\n")[1][:600] # noqa
text_env = TextEnvironment(
model,
tokenizer,
{"Wiki": tool_fn},
exact_match_reward,
prompt,
generation_kwargs=generation_kwargs,
max_tool_reponse=400,
)
def print_trainable_parameters(model):
trainable_params = 0
all_param = 0
for _, param in model.named_parameters():
all_param += param.numel()
if param.requires_grad:
trainable_params += param.numel()
print(
f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
)
print_trainable_parameters(model)
# main training loop
for i in range(args.iterations):
tasks, answers = generate_data(config.batch_size)
queries, responses, masks, rewards, histories = text_env.run(tasks, answers=answers)
train_stats = ppo_trainer.step(queries, responses, rewards, masks)
response_texts = [tokenizer.decode(response) for response in responses]
query_texts = [tokenizer.decode(query) for query in queries]
texts = {
"query": [qt.split("<submit>")[-1].strip() for qt in query_texts],
"response": response_texts,
"answer": [", ".join(item) for item in answers],
}
all_rewards = ppo_trainer.accelerator.gather(torch.tensor(rewards, device=ppo_trainer.accelerator.device))
ppo_trainer.log_stats(
train_stats, texts, [item for item in all_rewards], columns_to_log=["query", "response", "answer"]
)
if i % 100 == 0:
ppo_trainer.save_pretrained(f"models/{args.model_name}_{args.seed}_{i}_triviaqa")

View File

@ -0,0 +1,7 @@
# De-detoxifying language models
To run this code, do the following:
```shell
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file {CONFIG} examples/research_projects/toxicity/scripts/gpt-j-6b-toxicity.py --log_with wandb
```

View File

@ -1,23 +1,26 @@
import numpy as np
import csv
import argparse
from tqdm import tqdm
import torch
import csv
import evaluate
import numpy as np
import torch
from datasets import load_dataset
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer
toxicity = evaluate.load("ybelkada/toxicity", 'DaNLP/da-electra-hatespeech-detection', module_type="measurement")
from trl.import_utils import is_npu_available, is_xpu_available
toxicity = evaluate.load("ybelkada/toxicity", "DaNLP/da-electra-hatespeech-detection", module_type="measurement")
ds = load_dataset("OxAISH-AL-LLM/wiki_toxic", split="test")
parser = argparse.ArgumentParser(description='Evaluate de-toxified models')
parser.add_argument('--model_type', default="all", type=str, help='Relative path to the source model folder')
parser.add_argument('--output_file', default="toxicity.csv", type=str, help='Relative path to the source model folder')
parser.add_argument('--batch_size', default=64, type=int, help='Batch size')
parser.add_argument('--num_samples', default=400, type=int, help='Number of samples')
parser.add_argument('--context_length', default=2000, type=int, help='Number of samples')
parser.add_argument('--max_new_tokens', default=30, type=int, help='Max new tokens for generation')
parser = argparse.ArgumentParser(description="Evaluate de-toxified models")
parser.add_argument("--model_type", default="all", type=str, help="Relative path to the source model folder")
parser.add_argument("--output_file", default="toxicity.csv", type=str, help="Relative path to the source model folder")
parser.add_argument("--batch_size", default=64, type=int, help="Batch size")
parser.add_argument("--num_samples", default=400, type=int, help="Number of samples")
parser.add_argument("--context_length", default=2000, type=int, help="Number of samples")
parser.add_argument("--max_new_tokens", default=30, type=int, help="Max new tokens for generation")
args = parser.parse_args()
@ -40,33 +43,36 @@ elif args.model_type == "gpt-neo":
elif args.model_type == "gpt-j":
MODELS_TO_TEST = [
"ybelkada/gpt-j-6b-sharded-bf16",
"ybelkada/gpt-j-6b-detox",
"ybelkada/gpt-j-6b-detox",
]
else:
MODELS_TO_TEST = [
args.model_type
]
MODELS_TO_TEST = [args.model_type]
NUM_SAMPLES = args.num_samples
BATCH_SIZE = args.batch_size
output_file = args.output_file
max_new_tokens = args.max_new_tokens
context_length = args.context_length
device = torch.cuda.current_device()
if is_xpu_available():
device = torch.xpu.current_device()
elif is_npu_available():
device = torch.npu.current_device()
else:
device = torch.cuda.current_device() if torch.cuda.is_available() else "cpu"
# consider only toxic prompts
ds = ds.filter(lambda x: x['label'] == 1)
ds = ds.filter(lambda x: x["label"] == 1)
toxicities = {}
# open a csv file
file = open(f'{output_file}', 'w', newline='')
file = open(f"{output_file}", "w", newline="")
writer = csv.writer(file)
# add first rows
writer.writerow(['model_id', 'mean_toxicity', 'std_toxicity'])
writer.writerow(["model_id", "mean_toxicity", "std_toxicity"])
for model_id in tqdm(MODELS_TO_TEST):
model = AutoModelForCausalLM.from_pretrained(model_id, device_map={'':device}, torch_dtype=torch.bfloat16)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map={"": device}, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
@ -75,29 +81,28 @@ for model_id in tqdm(MODELS_TO_TEST):
for i, example in enumerate(ds):
# set seed
torch.manual_seed(42)
input_text = example['comment_text']
input_text = example["comment_text"]
input_texts.append(input_text[:2000])
if i > NUM_SAMPLES:
break
if (i+1)%BATCH_SIZE == 0:
if (i + 1) % BATCH_SIZE == 0:
inputs = tokenizer(input_texts, return_tensors="pt", padding=True).to(device)
inputs.input_ids = inputs.input_ids[:context_length]
inputs.attention_mask = inputs.attention_mask[:context_length]
outputs = model.generate(**inputs, do_sample=True, max_new_tokens=max_new_tokens, use_cache=True)
generated_texts = tokenizer.batch_decode(outputs, skip_special_tokens=True)
generated_texts = [generated_text.replace(input_texts[i], "") for i, generated_text in enumerate(generated_texts)]
generated_texts = [
generated_text.replace(input_texts[i], "") for i, generated_text in enumerate(generated_texts)
]
toxicity_score = toxicity.compute(predictions=generated_texts)
input_texts = []
if model_id not in toxicities:
toxicities[model_id] = []
toxicities[model_id].extend(toxicity_score['toxicity'])
toxicities[model_id].extend(toxicity_score["toxicity"])
# last batch
inputs = tokenizer(input_texts, return_tensors="pt", padding=True).to(device)
@ -105,7 +110,7 @@ for model_id in tqdm(MODELS_TO_TEST):
generated_texts = tokenizer.batch_decode(outputs, skip_special_tokens=True)
generated_texts = [generated_text.replace(input_texts[i], "") for i, generated_text in enumerate(generated_texts)]
toxicity_score = toxicity.compute(predictions=generated_texts)
toxicities[model_id].extend(toxicity_score['toxicity'])
toxicities[model_id].extend(toxicity_score["toxicity"])
# compute mean & std using np
mean = np.mean(toxicities[model_id])
@ -118,7 +123,12 @@ for model_id in tqdm(MODELS_TO_TEST):
print(f"Model: {model_id} - Mean: {mean} - Std: {std}")
model = None
torch.cuda.empty_cache()
if is_xpu_available():
torch.xpu.empty_cache()
elif is_npu_available():
torch.npu.empty_cache()
else:
torch.cuda.empty_cache()
# close file
file.close()
file.close()

View File

@ -12,21 +12,27 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from dataclasses import dataclass, field
from typing import Optional
import torch
from datasets import load_dataset
from torch.optim import Adam
from tqdm import tqdm
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
HfArgumentParser,
RobertaForSequenceClassification,
RobertaTokenizer,
)
from trl import AutoModelForCausalLMWithValueHead, PPOConfig, PPOTrainer, create_reference_model, set_seed
from trl.core import LengthSampler
tqdm.pandas()
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import RobertaTokenizer, RobertaForSequenceClassification
from datasets import load_dataset
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead, set_seed
from trl import create_reference_model
from trl.core import LengthSampler
########################################################################
# This is a fully working simple example to use trl with accelerate.
#
@ -44,17 +50,45 @@ from trl.core import LengthSampler
#
########################################################################
# We first define the configuration of the experiment, defining the model, the dataset,
# the training parameters, and the PPO parameters.
# Check the default arguments in the `PPOConfig` class for more details.
# If you want to log with tensorboard, add the kwarg
# `accelerator_kwargs={"logging_dir": PATH_TO_LOGS}` to the PPOConfig.
# `project_kwargs={"logging_dir": PATH_TO_LOGS}` to the PPOConfig.
@dataclass
class ScriptArguments:
"""
The name of the Casual LM model we wish to fine-tune with PPO
"""
# NOTE: gpt2 models use Conv1D instead of Linear layers which are not yet supported in 8 bit mode
# models like gpt-neo* models are more suitable.
model_name: Optional[str] = field(default="ybelkada/gpt-j-6b-sharded-bf16", metadata={"help": "the model name"})
log_with: Optional[str] = field(default=None, metadata={"help": "use 'wandb' to log with wandb"})
learning_rate: Optional[float] = field(default=(1.47e-5) * 2, metadata={"help": "the learning rate"})
mini_batch_size: Optional[int] = field(default=4, metadata={"help": "the PPO minibatch size"})
batch_size: Optional[int] = field(default=16, metadata={"help": "the batch size"})
gradient_accumulation_steps: Optional[int] = field(
default=1, metadata={"help": "the number of gradient accumulation steps"}
)
model_save_path: Optional[str] = field(
default="./gpt-j-6B-detoxified-long-context-26-shl-1e4-final",
metadata={"help": "the path to save the model"},
)
parser = HfArgumentParser(ScriptArguments)
script_args = parser.parse_args_into_dataclasses()[0]
config = PPOConfig(
model_name="ybelkada/gpt-j-6b-sharded-bf16",
learning_rate=(1.47e-5) * 2,
log_with="wandb",
batch_size=32,
forward_batch_size=1,
model_name=script_args.model_name,
learning_rate=script_args.learning_rate,
log_with=script_args.log_with,
ppo_epochs=100,
mini_batch_size=script_args.mini_batch_size,
batch_size=script_args.batch_size,
gradient_accumulation_steps=script_args.gradient_accumulation_steps,
)
@ -137,7 +171,13 @@ tokenizer.pad_token = tokenizer.eos_token
# We then build the PPOTrainer, passing the model, the reference model, the tokenizer
ppo_trainer = PPOTrainer(
config, model, ref_model=ref_model, tokenizer=tokenizer, dataset=dataset, data_collator=collator
config,
model,
ref_model=ref_model,
tokenizer=tokenizer,
dataset=dataset,
data_collator=collator,
optimizer=optimizer,
)
# We then build the reward pipeline, we will use the toxicity model to compute the reward.
@ -164,12 +204,12 @@ output_min_length = 20
output_max_length = 30
output_length_sampler = LengthSampler(output_min_length, output_max_length)
model_save_path = "/mnt/disks/younes-disk/models/gpt-j-6B-detoxified-long-context-26-shl-1e4-final"
model_save_path = script_args.model_save_path
for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
query_tensors = batch["input_ids"]
#### Get response from gpt2
# Get response from the policy model
response_tensors = []
for query in query_tensors:
gen_len = output_length_sampler()
@ -178,7 +218,7 @@ for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
response_tensors.append(response.squeeze()[-gen_len:])
batch["response"] = [tokenizer.decode(r.squeeze()) for r in response_tensors]
#### Compute sentiment score
# Compute sentiment score # noqa
texts = batch["response"]
toxicity_inputs = toxicity_tokenizer(texts, padding=True, truncation=True, return_tensors="pt").to(
ppo_trainer.accelerator.device
@ -188,7 +228,7 @@ for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
rewards = [torch.tensor(output) for output in toxicity_labels]
#### Run PPO step
# Run PPO step
stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
ppo_trainer.log_stats(stats, batch, rewards)
@ -196,4 +236,3 @@ for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
if epoch % 100 == 0:
if ppo_trainer.accelerator.is_main_process:
ppo_trainer.save_pretrained(model_save_path)

211
examples/scripts/ddpo.py Normal file
View File

@ -0,0 +1,211 @@
# Copyright 2023 metric-space, The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
from dataclasses import dataclass, field
import numpy as np
import torch
import torch.nn as nn
import tyro
from huggingface_hub import hf_hub_download
from huggingface_hub.utils import EntryNotFoundError
from transformers import CLIPModel, CLIPProcessor
from trl import DDPOConfig, DDPOTrainer, DefaultDDPOStableDiffusionPipeline
from trl.import_utils import is_npu_available, is_xpu_available
@dataclass
class ScriptArguments:
hf_user_access_token: str
pretrained_model: str = "runwayml/stable-diffusion-v1-5"
"""the pretrained model to use"""
pretrained_revision: str = "main"
"""the pretrained model revision to use"""
hf_hub_model_id: str = "ddpo-finetuned-stable-diffusion"
"""HuggingFace repo to save model weights to"""
hf_hub_aesthetic_model_id: str = "trl-lib/ddpo-aesthetic-predictor"
"""HuggingFace model ID for aesthetic scorer model weights"""
hf_hub_aesthetic_model_filename: str = "aesthetic-model.pth"
"""HuggingFace model filename for aesthetic scorer model weights"""
use_lora: bool = True
"""Whether to use LoRA."""
ddpo_config: DDPOConfig = field(
default_factory=lambda: DDPOConfig(
num_epochs=200,
train_gradient_accumulation_steps=1,
sample_num_steps=50,
sample_batch_size=6,
train_batch_size=3,
sample_num_batches_per_epoch=4,
per_prompt_stat_tracking=True,
per_prompt_stat_tracking_buffer_size=32,
tracker_project_name="stable_diffusion_training",
log_with="wandb",
project_kwargs={
"logging_dir": "./logs",
"automatic_checkpoint_naming": True,
"total_limit": 5,
"project_dir": "./save",
},
)
)
class MLP(nn.Module):
def __init__(self):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(768, 1024),
nn.Dropout(0.2),
nn.Linear(1024, 128),
nn.Dropout(0.2),
nn.Linear(128, 64),
nn.Dropout(0.1),
nn.Linear(64, 16),
nn.Linear(16, 1),
)
@torch.no_grad()
def forward(self, embed):
return self.layers(embed)
class AestheticScorer(torch.nn.Module):
"""
This model attempts to predict the aesthetic score of an image. The aesthetic score
is a numerical approximation of how much a specific image is liked by humans on average.
This is from https://github.com/christophschuhmann/improved-aesthetic-predictor
"""
def __init__(self, *, dtype, model_id, model_filename):
super().__init__()
self.clip = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
self.processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
self.mlp = MLP()
try:
cached_path = hf_hub_download(model_id, model_filename)
except EntryNotFoundError:
cached_path = os.path.join(model_id, model_filename)
state_dict = torch.load(cached_path, map_location=torch.device("cpu"))
self.mlp.load_state_dict(state_dict)
self.dtype = dtype
self.eval()
@torch.no_grad()
def __call__(self, images):
device = next(self.parameters()).device
inputs = self.processor(images=images, return_tensors="pt")
inputs = {k: v.to(self.dtype).to(device) for k, v in inputs.items()}
embed = self.clip.get_image_features(**inputs)
# normalize embedding
embed = embed / torch.linalg.vector_norm(embed, dim=-1, keepdim=True)
return self.mlp(embed).squeeze(1)
def aesthetic_scorer(hub_model_id, model_filename):
scorer = AestheticScorer(
model_id=hub_model_id,
model_filename=model_filename,
dtype=torch.float32,
)
if is_npu_available():
scorer = scorer.npu()
elif is_xpu_available():
scorer = scorer.xpu()
else:
scorer = scorer.cuda()
def _fn(images, prompts, metadata):
images = (images * 255).round().clamp(0, 255).to(torch.uint8)
scores = scorer(images)
return scores, {}
return _fn
# list of example prompts to feed stable diffusion
animals = [
"cat",
"dog",
"horse",
"monkey",
"rabbit",
"zebra",
"spider",
"bird",
"sheep",
"deer",
"cow",
"goat",
"lion",
"frog",
"chicken",
"duck",
"goose",
"bee",
"pig",
"turkey",
"fly",
"llama",
"camel",
"bat",
"gorilla",
"hedgehog",
"kangaroo",
]
def prompt_fn():
return np.random.choice(animals), {}
def image_outputs_logger(image_data, global_step, accelerate_logger):
# For the sake of this example, we will only log the last batch of images
# and associated data
result = {}
images, prompts, _, rewards, _ = image_data[-1]
for i, image in enumerate(images):
prompt = prompts[i]
reward = rewards[i].item()
result[f"{prompt:.25} | {reward:.2f}"] = image.unsqueeze(0)
accelerate_logger.log_images(
result,
step=global_step,
)
if __name__ == "__main__":
args = tyro.cli(ScriptArguments)
pipeline = DefaultDDPOStableDiffusionPipeline(
args.pretrained_model, pretrained_model_revision=args.pretrained_revision, use_lora=args.use_lora
)
trainer = DDPOTrainer(
args.ddpo_config,
aesthetic_scorer(args.hf_hub_aesthetic_model_id, args.hf_hub_aesthetic_model_filename),
prompt_fn,
pipeline,
image_samples_hook=image_outputs_logger,
)
trainer.train()
trainer.push_to_hub(args.hf_hub_model_id, token=args.hf_user_access_token)

198
examples/scripts/dpo.py Normal file
View File

@ -0,0 +1,198 @@
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Note: you need to install transformers from main to run this script. See https://huggingface.co/docs/transformers/installation#install-from-source
# TODO: bump transformers version in requirements at next release.
# 0. imports
from dataclasses import dataclass, field
from typing import Dict, Optional
import torch
from datasets import Dataset, load_dataset
from peft import LoraConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, HfArgumentParser, TrainingArguments
from trl import DPOTrainer
# Define and parse arguments.
@dataclass
class ScriptArguments:
"""
The arguments for the DPO training script.
"""
# data parameters
beta: Optional[float] = field(default=0.1, metadata={"help": "the beta parameter for DPO loss"})
# training parameters
model_name_or_path: Optional[str] = field(default="gpt2", metadata={"help": "the model name"})
learning_rate: Optional[float] = field(default=1e-3, metadata={"help": "optimizer learning rate"})
per_device_train_batch_size: Optional[int] = field(default=4, metadata={"help": "batch size per device"})
gradient_accumulation_steps: Optional[int] = field(
default=1, metadata={"help": "the number of gradient accumulation steps"}
)
max_length: Optional[int] = field(default=512, metadata={"help": "max length of each sample"})
max_prompt_length: Optional[int] = field(default=128, metadata={"help": "max length of each sample's prompt"})
max_target_length: Optional[int] = field(
default=128, metadata={"help": "Only used for encoder decoder model. Max target of each sample's prompt"}
)
label_pad_token_id: Optional[int] = field(default=-100, metadata={"help": "label for non response tokens"})
max_steps: Optional[int] = field(default=1000, metadata={"help": "max number of training steps"})
# lora parameters
use_peft: Optional[bool] = field(default=True, metadata={"help": "Wether to use PEFT or not to train adapters"})
peft_lora_r: Optional[int] = field(default=64, metadata={"help": "the r parameter of the LoRA adapters"})
peft_lora_alpha: Optional[int] = field(default=16, metadata={"help": "the alpha parameter of the LoRA adapters"})
# instrumentation
sanity_check: Optional[bool] = field(default=True, metadata={"help": "only train on 1000 samples"})
report_to: Optional[str] = field(
default=None,
metadata={
"help": 'The list of integrations to report the results and logs to. Supported platforms are `"azure_ml"`,'
'`"comet_ml"`, `"mlflow"`, `"neptune"`, `"tensorboard"`,`"clearml"` and `"wandb"`. '
'Use `"all"` to report to all integrations installed, `"none"` for no integrations.'
},
)
# debug argument for distributed training
ignore_bias_buffers: Optional[bool] = field(
default=False,
metadata={
"help": "fix for DDP issues with LM bias/mask buffers - invalid scalar type,`inplace operation. See"
"https://github.com/huggingface/transformers/issues/22482#issuecomment-1595790992"
},
)
gradient_checkpointing: Optional[bool] = field(
default=False, metadata={"help": "Whether to use gradient checkpointing or no"}
)
gradient_checkpointing_kwargs: Optional[dict] = field(
default=None,
metadata={
"help": "key word arguments to be passed along `torch.utils.checkpoint.checkpoint` method - e.g. `use_reentrant=False`"
},
)
def extract_anthropic_prompt(prompt_and_response):
"""Extract the anthropic prompt from a prompt and response pair."""
search_term = "\n\nAssistant:"
search_term_idx = prompt_and_response.rfind(search_term)
assert search_term_idx != -1, f"Prompt and response does not contain '{search_term}'"
return prompt_and_response[: search_term_idx + len(search_term)]
def get_hh(split: str, sanity_check: bool = False, silent: bool = False, cache_dir: str = None) -> Dataset:
"""Load the Anthropic Helpful-Harmless dataset from Hugging Face and convert it to the necessary format.
The dataset is converted to a dictionary with the following structure:
{
'prompt': List[str],
'chosen': List[str],
'rejected': List[str],
}
Prompts should be structured as follows:
\n\nHuman: <prompt>\n\nAssistant:
Multiple turns are allowed, but the prompt should always start with \n\nHuman: and end with \n\nAssistant:.
"""
dataset = load_dataset("Anthropic/hh-rlhf", split=split, cache_dir=cache_dir)
if sanity_check:
dataset = dataset.select(range(min(len(dataset), 1000)))
def split_prompt_and_responses(sample) -> Dict[str, str]:
prompt = extract_anthropic_prompt(sample["chosen"])
return {
"prompt": prompt,
"chosen": sample["chosen"][len(prompt) :],
"rejected": sample["rejected"][len(prompt) :],
}
return dataset.map(split_prompt_and_responses)
if __name__ == "__main__":
parser = HfArgumentParser(ScriptArguments)
script_args = parser.parse_args_into_dataclasses()[0]
# 1. load a pretrained model
model = AutoModelForCausalLM.from_pretrained(script_args.model_name_or_path)
if script_args.ignore_bias_buffers:
# torch distributed hack
model._ddp_params_and_buffers_to_ignore = [
name for name, buffer in model.named_buffers() if buffer.dtype == torch.bool
]
model_ref = AutoModelForCausalLM.from_pretrained(script_args.model_name_or_path)
tokenizer = AutoTokenizer.from_pretrained(script_args.model_name_or_path)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# 2. Load the Anthropic Helpful-Harmless dataset
train_dataset = get_hh("train", sanity_check=script_args.sanity_check)
# 3. Load evaluation dataset
eval_dataset = get_hh("test", sanity_check=script_args.sanity_check)
# 4. initialize training arguments:
training_args = TrainingArguments(
per_device_train_batch_size=script_args.per_device_train_batch_size,
max_steps=script_args.max_steps,
remove_unused_columns=False,
gradient_accumulation_steps=script_args.gradient_accumulation_steps,
learning_rate=script_args.learning_rate,
evaluation_strategy="steps",
logging_first_step=True,
logging_steps=10, # match results in blog post
eval_steps=500,
output_dir="./test",
optim="rmsprop",
warmup_steps=150,
report_to=script_args.report_to,
bf16=True,
gradient_checkpointing=script_args.gradient_checkpointing,
# TODO: uncomment that on the next transformers release
# gradient_checkpointing_kwargs=script_args.gradient_checkpointing_kwargs,
)
if script_args.use_peft:
peft_config = LoraConfig(
r=script_args.peft_lora_r,
lora_alpha=script_args.peft_lora_alpha,
bias="none",
task_type="CAUSAL_LM",
)
else:
peft_config = None
# 5. initialize the DPO trainer
dpo_trainer = DPOTrainer(
model,
model_ref,
args=training_args,
beta=script_args.beta,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,
max_length=script_args.max_length,
max_target_length=script_args.max_target_length,
max_prompt_length=script_args.max_prompt_length,
generate_during_eval=True,
peft_config=peft_config,
)
# 6. train
dpo_trainer.train()

212
examples/scripts/ppo.py Normal file
View File

@ -0,0 +1,212 @@
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from dataclasses import dataclass, field
from typing import Optional
import torch
import tyro
from accelerate import Accelerator
from datasets import load_dataset
from peft import LoraConfig
from tqdm import tqdm
from transformers import AutoTokenizer, pipeline
from trl import AutoModelForCausalLMWithValueHead, AutoModelForSeq2SeqLMWithValueHead, PPOConfig, PPOTrainer, set_seed
from trl.core import LengthSampler
from trl.import_utils import is_npu_available, is_xpu_available
tqdm.pandas()
@dataclass
class ScriptArguments:
ppo_config: PPOConfig = field(
default_factory=lambda: PPOConfig(
model_name="lvwerra/gpt2-imdb",
query_dataset="imdb",
reward_model="sentiment-analysis:lvwerra/distilbert-imdb",
learning_rate=1.41e-5,
log_with=None,
mini_batch_size=128,
batch_size=128,
gradient_accumulation_steps=1,
early_stopping=False,
target_kl=6.0,
kl_penalty="kl",
seed=0,
use_score_scaling=False,
use_score_norm=False,
score_clip=None,
)
)
use_seq2seq: bool = False
"""whether to use seq2seq models"""
use_peft: bool = False
"""whether to use peft"""
peft_config: Optional[LoraConfig] = field(
default_factory=lambda: LoraConfig(
r=16,
lora_alpha=16,
bias="none",
task_type="CAUSAL_LM",
),
)
trust_remote_code: bool = field(default=False, metadata={"help": "Enable `trust_remote_code`"})
args = tyro.cli(ScriptArguments)
# We then define the arguments to pass to the sentiment analysis pipeline.
# We set `return_all_scores` to True to get the sentiment score for each token.
sent_kwargs = {"return_all_scores": True, "function_to_apply": "none", "batch_size": 16}
trl_model_class = AutoModelForCausalLMWithValueHead if not args.use_seq2seq else AutoModelForSeq2SeqLMWithValueHead
# Below is an example function to build the dataset. In our case, we use the IMDB dataset
# from the `datasets` library. One should customize this function to train the model on
# its own dataset.
def build_dataset(config, query_dataset, input_min_text_length=2, input_max_text_length=8):
"""
Build dataset for training. This builds the dataset from `load_dataset`, one should
customize this function to train the model on its own dataset.
Args:
query_dataset (`str`):
The name of the dataset to be loaded.
Returns:
dataloader (`torch.utils.data.DataLoader`):
The dataloader for the dataset.
"""
tokenizer = AutoTokenizer.from_pretrained(config.model_name)
tokenizer.pad_token = tokenizer.eos_token
# load imdb with datasets
ds = load_dataset(query_dataset, split="train")
ds = ds.rename_columns({"text": "review"})
ds = ds.filter(lambda x: len(x["review"]) > 200, batched=False)
input_size = LengthSampler(input_min_text_length, input_max_text_length)
def tokenize(sample):
sample["input_ids"] = tokenizer.encode(sample["review"])[: input_size()]
sample["query"] = tokenizer.decode(sample["input_ids"])
return sample
ds = ds.map(tokenize, batched=False)
ds.set_format(type="torch")
return ds
# We retrieve the dataloader by calling the `build_dataset` function.
dataset = build_dataset(args.ppo_config, args.ppo_config.query_dataset)
def collator(data):
return dict((key, [d[key] for d in data]) for key in data[0])
# set seed before initializing value head for deterministic eval
set_seed(args.ppo_config.seed)
# Now let's build the model, the reference model, and the tokenizer.
if not args.use_peft:
ref_model = trl_model_class.from_pretrained(args.ppo_config.model_name, trust_remote_code=args.trust_remote_code)
device_map = None
peft_config = None
else:
peft_config = args.peft_config
ref_model = None
# Copy the model to each device
device_map = {"": Accelerator().local_process_index}
model = trl_model_class.from_pretrained(
args.ppo_config.model_name,
trust_remote_code=args.trust_remote_code,
device_map=device_map,
peft_config=peft_config,
)
tokenizer = AutoTokenizer.from_pretrained(args.ppo_config.model_name)
# Some tokenizers like GPT-2's don't have a padding token by default, so we set one here.
tokenizer.pad_token_id = tokenizer.eos_token_id
# We then build the PPOTrainer, passing the model, the reference model, the tokenizer
ppo_trainer = PPOTrainer(args.ppo_config, model, ref_model, tokenizer, dataset=dataset, data_collator=collator)
# We then build the sentiment analysis pipeline, passing the model name and the
# sentiment analysis pipeline arguments. Let's also make sure to set the device
# to the same device as the PPOTrainer.
device = ppo_trainer.accelerator.device
if ppo_trainer.accelerator.num_processes == 1:
if is_xpu_available():
device = "xpu:0"
elif is_npu_available():
device = "npu:0"
else:
device = 0 if torch.cuda.is_available() else "cpu" # to avoid a `pipeline` bug
ds_plugin = ppo_trainer.accelerator.state.deepspeed_plugin
task, model_name = args.ppo_config.reward_model.split(":")
if ds_plugin is not None and ds_plugin.is_zero3_init_enabled():
with ds_plugin.zero3_init_context_manager(enable=False):
sentiment_pipe = pipeline(task, model=model_name, device=device)
else:
sentiment_pipe = pipeline(task, model=model_name, device=device)
# Some tokenizers like GPT-2's don't have a padding token by default, so we set one here.
if sentiment_pipe.tokenizer.pad_token_id is None:
sentiment_pipe.tokenizer.pad_token_id = tokenizer.pad_token_id
if sentiment_pipe.model.config.pad_token_id is None:
sentiment_pipe.model.config.pad_token_id = tokenizer.pad_token_id
# We then define the arguments to pass to the `generate` function. These arguments
# are passed to the `generate` function of the PPOTrainer, which is a wrapper around
# the `generate` function of the trained model.
generation_kwargs = {
"min_length": -1,
"top_k": 0.0,
"top_p": 1.0,
"do_sample": True,
"pad_token_id": tokenizer.eos_token_id,
"max_new_tokens": 32,
}
for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
query_tensors = batch["input_ids"]
# Get response from gpt2
response_tensors, ref_response_tensors = ppo_trainer.generate(
query_tensors, return_prompt=False, generate_ref_response=True, **generation_kwargs
)
batch["response"] = tokenizer.batch_decode(response_tensors)
batch["ref_response"] = tokenizer.batch_decode(ref_response_tensors)
# Compute sentiment score
texts = [q + r for q, r in zip(batch["query"], batch["response"])]
pipe_outputs = sentiment_pipe(texts, **sent_kwargs)
rewards = [torch.tensor(output[1]["score"]) for output in pipe_outputs]
ref_texts = [q + r for q, r in zip(batch["query"], batch["ref_response"])]
ref_pipe_outputs = sentiment_pipe(ref_texts, **sent_kwargs)
ref_rewards = [torch.tensor(output[1]["score"]) for output in ref_pipe_outputs]
batch["ref_rewards"] = ref_rewards
# Run PPO step
stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
ppo_trainer.log_stats(stats, batch, rewards, columns_to_log=["query", "response", "ref_response", "ref_rewards"])

View File

@ -0,0 +1,152 @@
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from dataclasses import dataclass, field
from typing import Optional
import torch
from datasets import load_dataset
from peft import LoraConfig
from tqdm import tqdm
from transformers import AutoTokenizer, BitsAndBytesConfig, HfArgumentParser
from trl import AutoModelForCausalLMWithValueHead, PPOConfig, PPOTrainer
from trl.core import LengthSampler
from trl.import_utils import is_npu_available, is_xpu_available
input_min_text_length = 6
input_max_text_length = 12
@dataclass
class ScriptArguments:
"""
The name of the Casual LM model we wish to fine with PPO
"""
model_name: Optional[str] = field(default="huggyllama/llama-7b", metadata={"help": "the model name"})
dataset_name: Optional[str] = field(default="Anthropic/hh-rlhf", metadata={"help": "the dataset name"})
rm_adapter: Optional[str] = field(
default="trl-lib/llama-7b-hh-rm-adapter", metadata={"help": "the rm adapter name"}
)
log_with: Optional[str] = field(default=None, metadata={"help": "use 'wandb' to log with wandb"})
use_safetensors: Optional[bool] = field(default=False, metadata={"help": "Use safetensors"})
seed: Optional[int] = field(default=0, metadata={"help": "the random seed"})
use_score_scaling: Optional[bool] = field(default=False, metadata={"help": "Use score scaling"})
use_score_norm: Optional[bool] = field(
default=False, metadata={"help": "Use score normalization. Only applicable if use_score_scaling is True"}
)
score_clip: Optional[float] = field(default=None, metadata={"help": "Score clipping"})
parser = HfArgumentParser(ScriptArguments)
script_args = parser.parse_args_into_dataclasses()[0]
def create_and_prepare_dataset(tokenizer):
dataset = load_dataset(script_args.dataset_name, split="train[:1%]")
input_size = LengthSampler(input_min_text_length, input_max_text_length)
def tokenize(example):
text_size = input_size()
example["input_ids"] = tokenizer.encode(example["chosen"])[:text_size]
example["query"] = tokenizer.decode(example["input_ids"])
return example
dataset = dataset.map(tokenize, batched=False)
dataset.set_format("torch")
return dataset
lora_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
nf4_config = BitsAndBytesConfig(
load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLMWithValueHead.from_pretrained(
script_args.model_name,
device_map={"": "xpu:0"} if is_xpu_available() else {"": "npu:0"} if is_npu_available else {"": 0},
peft_config=lora_config,
quantization_config=nf4_config,
reward_adapter=script_args.rm_adapter,
use_safetensors=script_args.use_safetensors,
)
tokenizer = AutoTokenizer.from_pretrained(script_args.model_name)
tokenizer.pad_token = tokenizer.eos_token
dataset = create_and_prepare_dataset(tokenizer)
def collator(data):
return dict((key, [d[key] for d in data]) for key in data[0])
config = PPOConfig(
model_name=script_args.model_name,
log_with=script_args.log_with,
learning_rate=1e-5,
batch_size=8,
mini_batch_size=2,
gradient_accumulation_steps=2,
optimize_cuda_cache=True,
seed=script_args.seed,
use_score_scaling=script_args.use_score_scaling,
use_score_norm=script_args.use_score_norm,
score_clip=script_args.score_clip,
)
ppo_trainer = PPOTrainer(
config,
model,
ref_model=None,
tokenizer=tokenizer,
dataset=dataset,
data_collator=collator,
)
generation_kwargs = {
"top_k": 0.0,
"top_p": 0.9,
"do_sample": True,
"pad_token_id": tokenizer.pad_token_id,
"max_new_tokens": 32,
}
for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
question_tensors = batch["input_ids"]
response_tensors = ppo_trainer.generate(
question_tensors,
return_prompt=False,
**generation_kwargs,
)
batch["response"] = tokenizer.batch_decode(response_tensors, skip_special_tokens=True)
# Compute reward score
texts = [q + r for q, r in zip(batch["query"], batch["response"])]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt").to(ppo_trainer.accelerator.device)
raw_rewards = ppo_trainer.accelerator.unwrap_model(ppo_trainer.model).compute_reward_score(**inputs)
rewards = [raw_rewards[i, -1, 1] for i in range(len(raw_rewards))] # take last token
# Run PPO step
stats = ppo_trainer.step(question_tensors, response_tensors, rewards)
ppo_trainer.log_stats(stats, batch, rewards)

View File

@ -0,0 +1,173 @@
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from dataclasses import dataclass, field
from typing import Optional
import tyro
from accelerate import Accelerator
from datasets import load_dataset
from peft import LoraConfig
from tqdm import tqdm
from transformers import AutoModelForSequenceClassification, AutoTokenizer, BitsAndBytesConfig
from trl import RewardConfig, RewardTrainer, is_xpu_available
tqdm.pandas()
@dataclass
class ScriptArguments:
model_name: str = "facebook/opt-350m"
"""the model name"""
dataset_name: str = "Anthropic/hh-rlhf"
"""the dataset name"""
dataset_text_field: str = "text"
"""the text field of the dataset"""
eval_split: str = "none"
"""the dataset split to evaluate on; default to 'none' (no evaluation)"""
load_in_8bit: bool = False
"""load the model in 8 bits precision"""
load_in_4bit: bool = False
"""load the model in 4 bits precision"""
trust_remote_code: bool = True
"""Enable `trust_remote_code`"""
reward_config: RewardConfig = field(
default_factory=lambda: RewardConfig(
output_dir="output",
per_device_train_batch_size=64,
num_train_epochs=1,
gradient_accumulation_steps=16,
gradient_checkpointing=True,
gradient_checkpointing_kwargs={"use_reentrant": False},
learning_rate=1.41e-5,
report_to="tensorboard",
remove_unused_columns=False,
optim="adamw_torch",
logging_steps=500,
evaluation_strategy="no",
max_length=512,
)
)
use_peft: bool = False
"""whether to use peft"""
peft_config: Optional[LoraConfig] = field(
default_factory=lambda: LoraConfig(
r=16,
lora_alpha=16,
bias="none",
task_type="SEQ_CLS",
modules_to_save=["scores"],
),
)
args = tyro.cli(ScriptArguments)
args.reward_config.evaluation_strategy = "steps" if args.eval_split != "none" else "no"
# Step 1: Load the model
if args.load_in_8bit and args.load_in_4bit:
raise ValueError("You can't load the model in 8 bits and 4 bits at the same time")
elif args.load_in_8bit or args.load_in_4bit:
quantization_config = BitsAndBytesConfig(load_in_8bit=args.load_in_8bit, load_in_4bit=args.load_in_4bit)
# Copy the model to each device
device_map = (
{"": f"xpu:{Accelerator().local_process_index}"}
if is_xpu_available()
else {"": Accelerator().local_process_index}
)
else:
device_map = None
quantization_config = None
model = AutoModelForSequenceClassification.from_pretrained(
args.model_name,
quantization_config=quantization_config,
device_map=device_map,
trust_remote_code=args.trust_remote_code,
num_labels=1,
)
# Step 2: Load the dataset and pre-process it
tokenizer = AutoTokenizer.from_pretrained(args.model_name)
train_dataset = load_dataset(args.dataset_name, split="train")
# Tokenize chosen/rejected pairs of inputs
# Adapt this section to your needs for custom datasets
def preprocess_function(examples):
new_examples = {
"input_ids_chosen": [],
"attention_mask_chosen": [],
"input_ids_rejected": [],
"attention_mask_rejected": [],
}
for chosen, rejected in zip(examples["chosen"], examples["rejected"]):
tokenized_chosen = tokenizer(chosen)
tokenized_rejected = tokenizer(rejected)
new_examples["input_ids_chosen"].append(tokenized_chosen["input_ids"])
new_examples["attention_mask_chosen"].append(tokenized_chosen["attention_mask"])
new_examples["input_ids_rejected"].append(tokenized_rejected["input_ids"])
new_examples["attention_mask_rejected"].append(tokenized_rejected["attention_mask"])
return new_examples
# Preprocess the dataset and filter out examples that are longer than args.max_length
train_dataset = train_dataset.map(
preprocess_function,
batched=True,
num_proc=4,
)
train_dataset = train_dataset.filter(
lambda x: len(x["input_ids_chosen"]) <= args.reward_config.max_length
and len(x["input_ids_rejected"]) <= args.reward_config.max_length
)
if args.eval_split == "none":
eval_dataset = None
else:
eval_dataset = load_dataset(args.dataset_name, split=args.eval_split)
eval_dataset = eval_dataset.map(
preprocess_function,
batched=True,
num_proc=4,
)
eval_dataset = eval_dataset.filter(
lambda x: len(x["input_ids_chosen"]) <= args.reward_config.max_length
and len(x["input_ids_rejected"]) <= args.reward_config.max_length
)
# Step 4: Define the LoraConfig
if args.use_peft:
peft_config = args.peft_config
else:
peft_config = None
# Step 5: Define the Trainer
trainer = RewardTrainer(
model=model,
tokenizer=tokenizer,
args=args.reward_config,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
peft_config=peft_config,
)
trainer.train()

161
examples/scripts/sft.py Normal file
View File

@ -0,0 +1,161 @@
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from dataclasses import dataclass, field
from typing import List, Optional
import torch
from accelerate import Accelerator
from datasets import load_dataset
from peft import LoraConfig
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, HfArgumentParser, TrainingArguments
from trl import SFTTrainer, is_xpu_available
tqdm.pandas()
# Define and parse arguments.
@dataclass
class ScriptArguments:
"""
The name of the Casual LM model we wish to fine with SFTTrainer
"""
model_name: Optional[str] = field(default="facebook/opt-350m", metadata={"help": "the model name"})
dataset_name: Optional[str] = field(
default="timdettmers/openassistant-guanaco", metadata={"help": "the dataset name"}
)
dataset_text_field: Optional[str] = field(default="text", metadata={"help": "the text field of the dataset"})
report_to: Optional[str] = field(default="none", metadata={"help": "use 'wandb' to log with wandb"})
learning_rate: Optional[float] = field(default=1.41e-5, metadata={"help": "the learning rate"})
batch_size: Optional[int] = field(default=64, metadata={"help": "the batch size"})
seq_length: Optional[int] = field(default=512, metadata={"help": "Input sequence length"})
gradient_accumulation_steps: Optional[int] = field(
default=16, metadata={"help": "the number of gradient accumulation steps"}
)
load_in_8bit: Optional[bool] = field(default=False, metadata={"help": "load the model in 8 bits precision"})
load_in_4bit: Optional[bool] = field(default=False, metadata={"help": "load the model in 4 bits precision"})
use_peft: Optional[bool] = field(default=False, metadata={"help": "Wether to use PEFT or not to train adapters"})
trust_remote_code: Optional[bool] = field(default=False, metadata={"help": "Enable `trust_remote_code`"})
output_dir: Optional[str] = field(default="output", metadata={"help": "the output directory"})
peft_lora_r: Optional[int] = field(default=64, metadata={"help": "the r parameter of the LoRA adapters"})
peft_lora_alpha: Optional[int] = field(default=16, metadata={"help": "the alpha parameter of the LoRA adapters"})
logging_steps: Optional[int] = field(default=1, metadata={"help": "the number of logging steps"})
use_auth_token: Optional[bool] = field(default=True, metadata={"help": "Use HF auth token to access the model"})
num_train_epochs: Optional[int] = field(default=3, metadata={"help": "the number of training epochs"})
max_steps: Optional[int] = field(default=-1, metadata={"help": "the number of training steps"})
save_steps: Optional[int] = field(
default=100, metadata={"help": "Number of updates steps before two checkpoint saves"}
)
save_total_limit: Optional[int] = field(default=10, metadata={"help": "Limits total number of checkpoints."})
push_to_hub: Optional[bool] = field(default=False, metadata={"help": "Push the model to HF Hub"})
gradient_checkpointing: Optional[bool] = field(
default=False, metadata={"help": "Whether to use gradient checkpointing or no"}
)
gradient_checkpointing_kwargs: Optional[dict] = field(
default=None,
metadata={
"help": "key word arguments to be passed along `torch.utils.checkpoint.checkpoint` method - e.g. `use_reentrant=False`"
},
)
hub_model_id: Optional[str] = field(default=None, metadata={"help": "The name of the model on HF Hub"})
mixed_precision: Optional[str] = field(default="bf16", metadata={"help": "Mixed precision training"})
target_modules: Optional[List[str]] = field(default=None, metadata={"help": "Target modules for LoRA adapters"})
parser = HfArgumentParser(ScriptArguments)
script_args = parser.parse_args_into_dataclasses()[0]
# Step 1: Load the model
if script_args.load_in_8bit and script_args.load_in_4bit:
raise ValueError("You can't load the model in 8 bits and 4 bits at the same time")
elif script_args.load_in_8bit or script_args.load_in_4bit:
quantization_config = BitsAndBytesConfig(
load_in_8bit=script_args.load_in_8bit, load_in_4bit=script_args.load_in_4bit
)
# Copy the model to each device
device_map = (
{"": f"xpu:{Accelerator().local_process_index}"}
if is_xpu_available()
else {"": Accelerator().local_process_index}
)
torch_dtype = torch.bfloat16
else:
device_map = None
quantization_config = None
torch_dtype = None
model = AutoModelForCausalLM.from_pretrained(
script_args.model_name,
quantization_config=quantization_config,
device_map=device_map,
trust_remote_code=script_args.trust_remote_code,
torch_dtype=torch_dtype,
use_auth_token=script_args.use_auth_token,
)
# Step 2: Load the dataset
dataset = load_dataset(script_args.dataset_name, split="train")
# Step 3: Define the training arguments
training_args = TrainingArguments(
output_dir=script_args.output_dir,
per_device_train_batch_size=script_args.batch_size,
gradient_accumulation_steps=script_args.gradient_accumulation_steps,
learning_rate=script_args.learning_rate,
logging_steps=script_args.logging_steps,
num_train_epochs=script_args.num_train_epochs,
max_steps=script_args.max_steps,
report_to=script_args.report_to,
save_steps=script_args.save_steps,
save_total_limit=script_args.save_total_limit,
push_to_hub=script_args.push_to_hub,
hub_model_id=script_args.hub_model_id,
gradient_checkpointing=script_args.gradient_checkpointing,
# TODO: uncomment that on the next release
# gradient_checkpointing_kwargs=script_args.gradient_checkpointing_kwargs,
)
# Step 4: Define the LoraConfig
if script_args.use_peft:
peft_config = LoraConfig(
r=script_args.peft_lora_r,
lora_alpha=script_args.peft_lora_alpha,
bias="none",
task_type="CAUSAL_LM",
target_modules=script_args.target_modules,
)
else:
peft_config = None
# Step 5: Define the Trainer
tokenizer = AutoTokenizer.from_pretrained(script_args.model_name, use_fast=True)
trainer = SFTTrainer(
model=model,
args=training_args,
max_seq_length=script_args.seq_length,
train_dataset=dataset,
dataset_text_field=script_args.dataset_text_field,
peft_config=peft_config,
tokenizer=tokenizer,
)
trainer.train()
# Step 6: Save the model
trainer.save_model(script_args.output_dir)

View File

@ -1,156 +0,0 @@
# coding=utf-8
# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import torch
from tqdm import tqdm
tqdm.pandas()
from transformers import pipeline, AutoTokenizer
from datasets import load_dataset
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead, set_seed
from trl.core import LengthSampler
########################################################################
# This is a fully working simple example to use trl with accelerate.
#
# This example fine-tunes a GPT2 model on the IMDB dataset using PPO
# (proximal policy optimization).
# in any of the following settings (with the same script):
# - single CPU or single GPU
# - multi GPUS (using PyTorch distributed mode)
# - multi GPUS (using DeepSpeed ZeRO-Offload stages 1 & 2)
# - fp16 (mixed-precision) or fp32 (normal precision)
#
# To run it in each of these various modes, first initialize the accelerate
# configuration with `accelerate config`
#
########################################################################
# We first define the configuration of the experiment, defining the model, the dataset,
# the training parameters, and the PPO parameters.
# Check the default arguments in the `PPOConfig` class for more details.
# If you want to log with tensorboard, add the kwarg
# `accelerator_kwargs={"logging_dir": PATH_TO_LOGS}` to the PPOConfig.
config = PPOConfig(
model_name="lvwerra/gpt2-imdb",
learning_rate=1.41e-5,
)
# We then define the arguments to pass to the sentiment analysis pipeline.
# We set `return_all_scores` to True to get the sentiment score for each token.
sent_kwargs = {"return_all_scores": True, "function_to_apply": "none", "batch_size": 16}
# Below is an example function to build the dataset. In our case, we use the IMDB dataset
# from the `datasets` library. One should customize this function to train the model on
# its own dataset.
def build_dataset(config, dataset_name="imdb", input_min_text_length=2, input_max_text_length=8):
"""
Build dataset for training. This builds the dataset from `load_dataset`, one should
customize this function to train the model on its own dataset.
Args:
dataset_name (`str`):
The name of the dataset to be loaded.
Returns:
dataloader (`torch.utils.data.DataLoader`):
The dataloader for the dataset.
"""
tokenizer = AutoTokenizer.from_pretrained(config.model_name)
tokenizer.pad_token = tokenizer.eos_token
# load imdb with datasets
ds = load_dataset(dataset_name, split="train")
ds = ds.rename_columns({"text": "review"})
ds = ds.filter(lambda x: len(x["review"]) > 200, batched=False)
input_size = LengthSampler(input_min_text_length, input_max_text_length)
def tokenize(sample):
sample["input_ids"] = tokenizer.encode(sample["review"])[: input_size()]
sample["query"] = tokenizer.decode(sample["input_ids"])
return sample
ds = ds.map(tokenize, batched=False)
ds.set_format(type="torch")
return ds
# We retrieve the dataloader by calling the `build_dataset` function.
dataset = build_dataset(config)
def collator(data):
return dict((key, [d[key] for d in data]) for key in data[0])
# set seed before initializing value head for deterministic eval
set_seed(config.seed)
# Now let's build the model, the reference model, and the tokenizer.
model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
tokenizer = AutoTokenizer.from_pretrained(config.model_name)
# GPT-2 tokenizer has a pad token, but it is not eos_token by default. We need to set it to eos_token.
# only for this model.
tokenizer.pad_token = tokenizer.eos_token
# We then build the PPOTrainer, passing the model, the reference model, the tokenizer
ppo_trainer = PPOTrainer(config, model, ref_model, tokenizer, dataset=dataset, data_collator=collator)
# We then build the sentiment analysis pipeline, passing the model name and the
# sentiment analysis pipeline arguments. Let's also make sure to set the device
# to the same device as the PPOTrainer.
device = ppo_trainer.accelerator.device
if ppo_trainer.accelerator.num_processes == 1:
device = 0 if torch.cuda.is_available() else "cpu" # to avoid a `pipeline` bug
sentiment_pipe = pipeline("sentiment-analysis", model="lvwerra/distilbert-imdb", device=device)
# We then define the arguments to pass to the `generate` function. These arguments
# are passed to the `generate` function of the PPOTrainer, which is a wrapper around
# the `generate` function of the trained model.
generation_kwargs = {
"min_length": -1,
"top_k": 0.0,
"top_p": 1.0,
"do_sample": True,
"pad_token_id": tokenizer.eos_token_id,
}
output_min_length = 4
output_max_length = 16
output_length_sampler = LengthSampler(output_min_length, output_max_length)
for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
query_tensors = batch["input_ids"]
#### Get response from gpt2
response_tensors = []
for query in query_tensors:
gen_len = output_length_sampler()
generation_kwargs["max_new_tokens"] = gen_len
response = ppo_trainer.generate(query, **generation_kwargs)
response_tensors.append(response.squeeze()[-gen_len:])
batch["response"] = [tokenizer.decode(r.squeeze()) for r in response_tensors]
#### Compute sentiment score
texts = [q + r for q, r in zip(batch["query"], batch["response"])]
pipe_outputs = sentiment_pipe(texts, **sent_kwargs)
rewards = [torch.tensor(output[1]["score"]) for output in pipe_outputs]
#### Run PPO step
stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
ppo_trainer.log_stats(stats, batch, rewards)

View File

@ -1,129 +0,0 @@
# coding=utf-8
# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import torch
from tqdm import tqdm
tqdm.pandas()
from transformers import pipeline, AutoTokenizer
from datasets import load_dataset
from trl import PPOTrainer, PPOConfig, AutoModelForSeq2SeqLMWithValueHead, set_seed
from trl.core import LengthSampler
########################################################################
# This is a fully working simple example to use trl with accelerate.
#
# This example fine-tunes a T5 model on the IMDB dataset using PPO
# (proximal policy optimization).
# in any of the following settings (with the same script):
# - single CPU or single GPU
# - multi GPUS (using PyTorch distributed mode)
# - multi GPUS (using DeepSpeed ZeRO-Offload stages 1 & 2)
# - fp16 (mixed-precision) or fp32 (normal precision)
#
# To run it in each of these various modes, first initialize the accelerate
# configuration with `accelerate config` then run the script with
# `accelerate launch ppo-sentiment-t5-small.py`
#
########################################################################
# We first define the configuration of the experiment, defining the model, the dataset,
# the training parameters, and the PPO parameters.
# Check the default arguments in the `PPOConfig` class for more details.
config = PPOConfig(model_name="lvwerra/t5-imdb", learning_rate=5e-5, batch_size=256)
# We then define the arguments to pass to the sentiment analysis pipeline.
# We set `return_all_scores` to True to get the sentiment score for each token.
sent_kwargs = {"return_all_scores": True, "function_to_apply": "none", "batch_size": 16}
# Below is an example function to build the dataset. In our case, we use the IMDB dataset
# from the `datasets` library. One should customize this function to train the model on
# its own dataset.
def build_imdb_dataset(tokenizer, input_min_text_length=2, input_max_text_length=8):
# load imdb with datasets
ds = load_dataset("imdb", split="train")
ds = ds.rename_columns({"text": "review"})
ds = ds.filter(lambda x: len(x["review"]) > 200, batched=False)
input_size = LengthSampler(input_min_text_length, input_max_text_length)
def tokenize(sample):
sample["input_ids"] = tokenizer.encode(sample["review"])[: input_size()] + [tokenizer.eos_token_id]
sample["query"] = tokenizer.decode(sample["input_ids"])
return sample
ds = ds.map(tokenize, batched=False)
ds.set_format(type="torch")
return ds
def collater(data):
return dict((key, [d[key] for d in data]) for key in data[0])
# set seed before initializing value head for deterministic eval
set_seed(config.seed)
# Now let's build the model, the reference model, and the tokenizer.
model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(config.model_name)
ref_model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(config.model_name)
tokenizer = AutoTokenizer.from_pretrained(config.model_name)
# We retrieve the dataloader by calling the `build_dataset` function.
dataset = build_imdb_dataset(tokenizer)
query = tokenizer("I really liked this movie because", return_tensors="pt")["input_ids"]
generation_kwargs = {"top_k": 0.0, "top_p": 1.0, "do_sample": True, "eos_token_id": -1}
# We then build the PPOTrainer, passing the model, the reference model, the tokenizer
ppo_trainer = PPOTrainer(config, model, ref_model, tokenizer, dataset=dataset, data_collator=collater)
# We then build the sentiment analysis pipeline, passing the model name and the
# sentiment analysis pipeline arguments. Let's also make sure to set the device
# to the same device as the PPOTrainer.
device = ppo_trainer.accelerator.device
if ppo_trainer.accelerator.num_processes == 1:
device = 0 if torch.cuda.is_available() else "cpu" # to avoid a `pipeline` bug
sentiment_pipe = pipeline("sentiment-analysis", "lvwerra/distilbert-imdb", device=device)
# We then define the arguments to pass to the `generate` function. These arguments
# are passed to the `generate` function of the PPOTrainer, which is a wrapper around
# the `generate` function of the trained model.
output_min_length = 16
output_max_length = 32
output_length_sampler = LengthSampler(output_min_length, output_max_length)
for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
query_tensors = batch["input_ids"]
#### Get response from gpt2
response_tensors = []
for query in query_tensors:
gen_len = output_length_sampler()
generation_kwargs["max_new_tokens"] = gen_len
response = ppo_trainer.generate(query, **generation_kwargs)
response_tensors.append(response.squeeze())
batch["response"] = [tokenizer.decode(r[1:].squeeze()) for r in response_tensors]
#### Compute sentiment score
texts = [q + r for q, r in zip(batch["query"], batch["response"])]
pipe_outputs = sentiment_pipe(texts, **sent_kwargs)
rewards = [torch.tensor(output[1]["score"]).to(device) for output in pipe_outputs]
#### Run PPO step
stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
ppo_trainer.log_stats(stats, batch, rewards)

View File

@ -1,57 +0,0 @@
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "none",
"pin_memory": true
},
"offload_param": {
"device": "none",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}

View File

@ -1,205 +0,0 @@
from datasets import load_dataset
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
TrainingArguments,
Trainer,
PreTrainedTokenizerBase,
HfArgumentParser,
)
from transformers.utils import PaddingStrategy
from typing import Optional, Union, List, Dict, Any
import evaluate
from dataclasses import dataclass, field
import torch.nn as nn
import numpy as np
# Define and parse arguments.
@dataclass
class ScriptArguments:
"""
These arguments vary depending on how many GPUs you have, what their capacity and features are, and what size model you want to train.
"""
local_rank: Optional[int] = field(default=0, metadata={"help": "Used for multi-gpu"})
resume_from_checkpoint: Optional[bool] = field(
default=False, metadata={"help": "If you want to resume training where it left off."}
)
deepspeed: Optional[str] = field(
default=None,
metadata={
"help": "Path to deepspeed config if using deepspeed. You may need this if the model that you want to train doesn't fit on a single GPU."
},
)
per_device_train_batch_size: Optional[int] = field(default=16)
per_device_eval_batch_size: Optional[int] = field(default=16)
gradient_accumulation_steps: Optional[int] = field(default=4)
learning_rate: Optional[int] = field(default=2e-5)
weight_decay: Optional[int] = field(default=0.001)
model_name: Optional[str] = field(
default="gpt2",
metadata={
"help": "The model that you want to train from the Hugging Face hub. E.g. gpt2, gpt2-xl, bert, etc."
},
)
bf16: Optional[bool] = field(
default=False,
metadata={
"help": "This essentially cuts the training time in half if you want to sacrifice a little precision and have a supported GPU."
},
)
num_train_epochs: Optional[int] = field(
default="5",
metadata={
"help": "The number of training epochs for the reward model. OpenAI used 5."
}
)
parser = HfArgumentParser(ScriptArguments)
script_args = parser.parse_args_into_dataclasses()[0]
# Load the human comparisons dataset for tuning the reward model.
ds = load_dataset("openai/summarize_from_feedback", name="comparisons")
# Define the training args. Needs to be done before the model is loaded if you are using deepspeed.
training_args = TrainingArguments(
output_dir=f"{script_args.model_name}_summarization_reward_model",
learning_rate=script_args.learning_rate,
per_device_train_batch_size=script_args.per_device_train_batch_size,
per_device_eval_batch_size=script_args.per_device_eval_batch_size,
num_train_epochs=script_args.num_train_epochs,
weight_decay=script_args.weight_decay,
evaluation_strategy="epoch",
save_strategy="epoch",
gradient_accumulation_steps=script_args.gradient_accumulation_steps,
deepspeed=script_args.deepspeed,
local_rank=script_args.local_rank,
remove_unused_columns=False,
label_names=[],
)
# Load the value-head model and tokenizer.
tokenizer = AutoTokenizer.from_pretrained(script_args.model_name)
model = AutoModelForSequenceClassification.from_pretrained(script_args.model_name, num_labels=1)
# Need to do this for gpt2, because it doesn't have an official pad token.
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.eos_token_id
# Turn the dataset into pairs of post + summaries, where text_j is the preferred post + summary and text_k is the other.
def turn_into_text_classification_format(examples):
new_examples = {"text_j": [], "text_k": []}
for info, summaries, choice in zip(examples["info"], examples["summaries"], examples["choice"]):
if len(summaries) != 2 or choice not in (0, 1):
raise ValueError(f"There should be two summaries with a choice that's either 0 or 1. Received {len(summaries)} summaries and choice={choice}.")
original_text_field = "post" if info["post"] is not None else "article"
new_examples["text_j"].append(
summaries[choice]["text"] + " " + tokenizer.bos_token + " " + info[original_text_field]
)
new_examples["text_k"].append(
summaries[0 if choice == 1 else 1]["text"] + " " + tokenizer.bos_token + " " + info[original_text_field]
)
return new_examples
num_proc = (
8
) # Can adjust to be higher if you have more processors. Should work even if you don't have 8 CPUs, though.
original_columns = ds["train"].column_names
ds = ds.map(turn_into_text_classification_format, batched=True, num_proc=num_proc, remove_columns=original_columns)
# Tokenize the dataset.
def preprocess_function(examples):
tokenized_j = tokenizer(examples["text_j"], truncation=True)
tokenized_k = tokenizer(examples["text_k"], truncation=True)
return {
"input_ids_j": tokenized_j["input_ids"],
"attention_mask_j": tokenized_j["attention_mask"],
"input_ids_k": tokenized_k["input_ids"],
"attention_mask_k": tokenized_k["attention_mask"],
}
tokenized_ds = ds.map(preprocess_function, batched=True, num_proc=num_proc, remove_columns=["text_j", "text_k"])
# We need to define a special data collator that batches the data in our j vs k format.
@dataclass
class RewardDataCollatorWithPadding:
tokenizer: PreTrainedTokenizerBase
padding: Union[bool, str, PaddingStrategy] = True
max_length: Optional[int] = None
pad_to_multiple_of: Optional[int] = None
return_tensors: str = "pt"
def __call__(self, features: List[Dict[str, Any]]) -> Dict[str, Any]:
features_j = []
features_k = []
for feature in features:
features_j.append({"input_ids": feature["input_ids_j"], "attention_mask": feature["attention_mask_j"]})
features_k.append({"input_ids": feature["input_ids_k"], "attention_mask": feature["attention_mask_k"]})
batch_j = self.tokenizer.pad(
features_j,
padding=self.padding,
max_length=self.max_length,
pad_to_multiple_of=self.pad_to_multiple_of,
return_tensors=self.return_tensors,
)
batch_k = self.tokenizer.pad(
features_k,
padding=self.padding,
max_length=self.max_length,
pad_to_multiple_of=self.pad_to_multiple_of,
return_tensors=self.return_tensors,
)
batch = {
"input_ids_j": batch_j["input_ids"],
"attention_mask_j": batch_j["attention_mask"],
"input_ids_k": batch_k["input_ids"],
"attention_mask_k": batch_k["attention_mask"],
"return_loss": True,
}
return batch
# Define the metric that we'll use for validation.
accuracy = evaluate.load("accuracy")
def compute_metrics(eval_pred):
predictions, _ = eval_pred
# Here, predictions is rewards_j and rewards_k.
# We want to see how much of the time rewards_j > rewards_k.
predictions = np.argmax(predictions, axis=0)
labels = np.zeros(predictions.shape)
return accuracy.compute(predictions=predictions, references=labels)
class RewardTrainer(Trainer):
# Define how to compute the reward loss.
def compute_loss(self, model, inputs, return_outputs=False):
rewards_j = model(input_ids=inputs["input_ids_j"], attention_mask=inputs["attention_mask_j"])[0]
rewards_k = model(input_ids=inputs["input_ids_k"], attention_mask=inputs["attention_mask_k"])[0]
loss = -nn.functional.logsigmoid(rewards_j - rewards_k).mean()
if return_outputs:
return loss, {"rewards_j": rewards_j, "rewards_k": rewards_k}
return loss
# Train the model, woohoo.
trainer = RewardTrainer(
model=model,
args=training_args,
train_dataset=tokenized_ds["train"],
eval_dataset=tokenized_ds["validation"],
compute_metrics=compute_metrics,
data_collator=RewardDataCollatorWithPadding(tokenizer=tokenizer),
)
trainer.train(script_args.resume_from_checkpoint)
# Push to the hub so you can share it with people :D
model.push_to_hub(script_args.model_name)
tokenizer.push_to_hub(script_args.model_name)

16
pyproject.toml Normal file
View File

@ -0,0 +1,16 @@
[tool.black]
line-length = 119
target-version = ['py38']
[tool.ruff]
ignore = ["E501", "E741", "W605"]
select = ["E", "F", "I", "W"]
line-length = 119
# Ignore import violations in all `__init__.py` files.
[tool.ruff.per-file-ignores]
"__init__.py" = ["E402", "F401", "F403", "F811"]
[tool.ruff.isort]
lines-after-imports = 2
known-first-party = ["trl"]

View File

@ -3,3 +3,5 @@ torch>=1.4.0
tqdm
transformers
accelerate
peft>=0.3.0
tyro>=0.5.7

61
scripts/stale.py Normal file
View File

@ -0,0 +1,61 @@
# Copyright 2023 The HuggingFace Team, the AllenNLP library authors. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Script to close stale issue. Taken in part from the AllenNLP repository.
https://github.com/allenai/allennlp.
"""
import os
from datetime import datetime as dt
from datetime import timezone
from github import Github
LABELS_TO_EXEMPT = [
"good first issue",
"good second issue",
"feature request",
]
def main():
g = Github(os.environ["GITHUB_TOKEN"])
repo = g.get_repo("huggingface/trl")
open_issues = repo.get_issues(state="open")
for issue in open_issues:
comments = sorted([comment for comment in issue.get_comments()], key=lambda i: i.created_at, reverse=True)
last_comment = comments[0] if len(comments) > 0 else None
if (
last_comment is not None
and last_comment.user.login == "github-actions[bot]"
and (dt.now(timezone.utc) - issue.updated_at).days > 7
and (dt.now(timezone.utc) - issue.created_at).days >= 30
and not any(label.name.lower() in LABELS_TO_EXEMPT for label in issue.get_labels())
):
issue.edit(state="closed")
elif (
(dt.now(timezone.utc) - issue.updated_at).days > 23
and (dt.now(timezone.utc) - issue.created_at).days >= 30
and not any(label.name.lower() in LABELS_TO_EXEMPT for label in issue.get_labels())
):
issue.create_comment(
"This issue has been automatically marked as stale because it has not had "
"recent activity. If you think this still needs to be addressed "
"please comment on this thread.\n\n"
)
if __name__ == "__main__":
main()

View File

@ -9,7 +9,3 @@ line_length = 119
lines_after_imports = 2
multi_line_output = 3
use_parentheses = True
[flake8]
ignore = E203, E501, W503
max-line-length = 119

View File

@ -54,21 +54,30 @@ To create the package for pypi.
Then push the change with a message 'set dev version'
"""
from setuptools import setup, find_packages
from setuptools import find_packages, setup
__version__ = "0.3.0" # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
__version__ = "0.7.9" # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
REQUIRED_PKGS = [
"torch>=1.4.0",
"transformers>=4.18.0",
"transformers>=4.31.0",
"numpy>=1.18.2",
"accelerate",
"datasets",
"tyro>=0.5.11",
]
EXTRAS = {
"test": ["parameterized", "pytest", "pytest-xdist", "accelerate"],
"dev": ["parameterized", "pytest", "pytest-xdist", "black", "isort", "flake8>=3.8.3"],
"peft": ["peft>=0.4.0"],
"diffusers": ["diffusers>=0.18.0"],
"deepspeed": ["deepspeed>=0.9.5"],
"benchmark": ["wandb", "ghapi", "openrlbenchmark==0.2.1a5", "requests", "deepspeed"],
"quantization": ["bitsandbytes<=0.41.1"],
}
EXTRAS["dev"] = []
for reqs in EXTRAS.values():
EXTRAS["dev"].extend(reqs)
setup(
name="trl",
@ -86,7 +95,7 @@ setup(
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
],
url="https://github.com/lvwerra/trl",
url="https://github.com/huggingface/trl",
packages=find_packages(),
include_package_data=True,
install_requires=REQUIRED_PKGS,
@ -96,7 +105,7 @@ setup(
long_description_content_type="text/markdown",
zip_safe=False,
version=__version__,
description="A Pytorch implementation of Proximal Policy Optimization for transfomer language models.",
description="Train transformer language models with reinforcement learning.",
keywords="ppo, transformers, huggingface, gpt2, language modeling, rlhf",
author="Leandro von Werra",
author_email="leandro.vonwerra@gmail.com",

View File

@ -0,0 +1,98 @@
import unittest
import torch
from transformers import AutoTokenizer, GenerationConfig
from trl import AutoModelForCausalLMWithValueHead
from trl.core import LengthSampler
from trl.extras import BestOfNSampler
def queries_to_scores(list_of_strings):
return [torch.rand(1).item() for _ in list_of_strings]
class BestOfNSamplerTester(unittest.TestCase):
"""
Tests the BestOfNSampler class
"""
ref_model_name = "trl-internal-testing/dummy-GPT2-correct-vocab"
output_length_sampler = LengthSampler(2, 6)
model = AutoModelForCausalLMWithValueHead.from_pretrained(ref_model_name)
tokenizer = AutoTokenizer.from_pretrained(ref_model_name)
tokenizer.pad_token = tokenizer.eos_token
output_length_sampler = LengthSampler(2, 6)
def test_different_input_types(self):
r"""
Tests if the different input types normalizer works
"""
generation_config = GenerationConfig(
min_length=-1,
top_k=0.0,
top_p=1.0,
do_sample=True,
pad_token_id=self.tokenizer.eos_token_id,
)
output_length_sampler = LengthSampler(2, 6)
best_of_n = BestOfNSampler(
self.model,
self.tokenizer,
queries_to_scores,
length_sampler=output_length_sampler,
generation_config=generation_config,
)
queries = ["hello world", "goodbye world"]
tokenized_queries = [self.tokenizer.encode(query) for query in queries]
various_queries_formats = [
(tokenized_queries[0], 1),
(tokenized_queries, 2),
(torch.tensor(tokenized_queries[1]), 1),
([torch.tensor(query) for query in tokenized_queries], 2),
]
for q, expected_length in various_queries_formats:
results = best_of_n.generate(q)
self.assertIsInstance(results, list)
assert len(results) == expected_length
def test_different_sample_sizes_and_n_candidates_values(self):
r"""
Tests different sample sizes and n_candidates values
"""
generation_config = GenerationConfig(
min_length=-1,
top_k=0.0,
top_p=1.0,
do_sample=True,
pad_token_id=self.tokenizer.eos_token_id,
)
output_length_sampler = LengthSampler(6, 10)
for sample_value, n_candidates_values, expected in [
(4, 2, 2),
(10, 3, 3),
(6, 4, 4),
]:
best_of_n = BestOfNSampler(
self.model,
self.tokenizer,
queries_to_scores,
length_sampler=output_length_sampler,
generation_config=generation_config,
sample_size=sample_value,
n_candidates=n_candidates_values,
)
queries = ["hello world", "troll the world"]
tokenized_queries = [self.tokenizer.encode(query) for query in queries]
results = best_of_n.generate(tokenized_queries)
for result in results:
assert len(result) == expected

View File

@ -0,0 +1,106 @@
# Copyright 2023 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import unittest
import torch
from transformers import AutoTokenizer
from trl import DataCollatorForCompletionOnlyLM
class DataCollatorForCompletionOnlyLMTester(unittest.TestCase):
def test_data_collator_finds_response_template_llama2_tokenizer(self):
# this should ideally be tested with meta-llama/Llama-2-7b-hf
self.tokenizer = AutoTokenizer.from_pretrained("trl-internal-testing/dummy-GPT2-correct-vocab")
self.instruction = """### System: You are a helpful assistant.
### User: How much is 2+2?
### Assistant: 2+2 equals 4"""
self.instruction_template = "\n### User:"
self.response_template = "\n### Assistant:"
# GPT2Tokenizer: [198, 21017, 11787, 25] -> [21017, 11787, 25]
# Llama2Tokenizer: [29871, 13, 2277, 29937, 4911, 29901] -> [2277, 29937, 4911, 29901]
# Note: If this test is ever switched to Llama2Tokenizer, this should be double checked,
# and possibly switched back to [2:] instead of [1:].
# With GPT2Tokenizer, [1:] is correct - we want the 21017 token included, which is ###.
self.tokenized_instruction_w_context = self.tokenizer.encode(
self.instruction_template, add_special_tokens=False
)[1:]
# GPT2Tokenizer: [198, 21017, 15286, 25] -> [15286, 25]
# Llama2Tokenizer: [29871, 13, 2277, 29937, 4007, 22137, 29901] -> [2277, 29937, 4007, 22137, 29901]
self.tokenized_response_w_context = self.tokenizer.encode(self.response_template, add_special_tokens=False)[2:]
# Plain check on string
self.assertIn(self.response_template, self.instruction)
self.tokenized_instruction = self.tokenizer.encode(self.instruction, add_special_tokens=False)
# Test the fix for #598
# Pass already tokenized (w context) and truncated response_template so token_ids are like in the instruction + response
self.collator = DataCollatorForCompletionOnlyLM(self.tokenized_response_w_context, tokenizer=self.tokenizer)
self.collator.torch_call([self.tokenized_instruction])
# Test for PR #749
# Pass already tokenized (w context) instruction and response both so token_ids are like in the instruction + response
self.collator = DataCollatorForCompletionOnlyLM(
self.tokenized_response_w_context, self.tokenized_instruction_w_context, tokenizer=self.tokenizer
)
self.collator.torch_call([self.tokenized_instruction])
# Test for PR #1185
# We pass in a string where the first user template is different than the rest.
# Usually this would happen due to context-sensitive tokenization, but here we
# explicitly change the template to test the fix.
self.instruction = """## User: First instruction
### Assistant: First response
### User: Second instruction
### Assistant: Second response"""
self.tokenized_instruction = self.tokenizer.encode(self.instruction, add_special_tokens=False)
self.collator = DataCollatorForCompletionOnlyLM(
self.tokenized_response_w_context, self.tokenized_instruction_w_context, tokenizer=self.tokenizer
)
collator_output = self.collator.torch_call([self.tokenized_instruction])
collator_text = self.tokenizer.decode(
collator_output["labels"][torch.where(collator_output["labels"] != -100)]
)
expected_text = " First response\n\n Second response" ""
self.assertEqual(collator_text, expected_text)
def test_data_collator_handling_of_long_sequences(self):
self.tokenizer = AutoTokenizer.from_pretrained("trl-internal-testing/dummy-GPT2-correct-vocab")
self.instruction = """### System: You are a helpful assistant.
### User: How much is 2+2? I'm asking because I'm not sure. And I'm not sure because I'm not good at math.
"""
self.response_template = "\n### Assistant:"
# check DataCollatorForCompletionOnlyLM using response template only
self.tokenized_instruction = self.tokenizer.encode(self.instruction, add_special_tokens=False)
self.collator = DataCollatorForCompletionOnlyLM(self.response_template, tokenizer=self.tokenizer)
encoded_instance = self.collator.torch_call([self.tokenized_instruction])
result = torch.all(encoded_instance["labels"] == -100)
self.assertTrue(result, "Not all values in the tensor are -100.")
# check DataCollatorForCompletionOnlyLM using response template and instruction template
self.instruction_template = "\n### User:"
self.collator = DataCollatorForCompletionOnlyLM(
self.response_template, self.instruction_template, tokenizer=self.tokenizer
)
encoded_instance = self.collator.torch_call([self.tokenized_instruction])
result = torch.all(encoded_instance["labels"] == -100)
self.assertTrue(result, "Not all values in the tensor are -100.")

127
tests/test_ddpo_trainer.py Normal file
View File

@ -0,0 +1,127 @@
# Copyright 2023 metric-space, The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import gc
import unittest
import torch
from trl import is_diffusers_available, is_peft_available
from .testing_utils import require_diffusers
if is_diffusers_available() and is_peft_available():
from trl import DDPOConfig, DDPOTrainer, DefaultDDPOStableDiffusionPipeline
def scorer_function(images, prompts, metadata):
return torch.randn(1) * 3.0, {}
def prompt_function():
return ("cabbages", {})
@require_diffusers
class DDPOTrainerTester(unittest.TestCase):
"""
Test the DDPOTrainer class.
"""
def setUp(self):
self.ddpo_config = DDPOConfig(
num_epochs=2,
train_gradient_accumulation_steps=1,
per_prompt_stat_tracking_buffer_size=32,
sample_num_batches_per_epoch=2,
sample_batch_size=2,
mixed_precision=None,
save_freq=1000000,
)
pretrained_model = "hf-internal-testing/tiny-stable-diffusion-torch"
pretrained_revision = "main"
pipeline = DefaultDDPOStableDiffusionPipeline(
pretrained_model, pretrained_model_revision=pretrained_revision, use_lora=False
)
self.trainer = DDPOTrainer(self.ddpo_config, scorer_function, prompt_function, pipeline)
return super().setUp()
def tearDown(self) -> None:
gc.collect()
def test_loss(self):
advantage = torch.tensor([-1.0])
clip_range = 0.0001
ratio = torch.tensor([1.0])
loss = self.trainer.loss(advantage, clip_range, ratio)
self.assertEqual(loss.item(), 1.0)
def test_generate_samples(self):
samples, output_pairs = self.trainer._generate_samples(1, 2)
self.assertEqual(len(samples), 1)
self.assertEqual(len(output_pairs), 1)
self.assertEqual(len(output_pairs[0][0]), 2)
def test_calculate_loss(self):
samples, _ = self.trainer._generate_samples(1, 2)
sample = samples[0]
latents = sample["latents"][0, 0].unsqueeze(0)
next_latents = sample["next_latents"][0, 0].unsqueeze(0)
log_probs = sample["log_probs"][0, 0].unsqueeze(0)
timesteps = sample["timesteps"][0, 0].unsqueeze(0)
prompt_embeds = sample["prompt_embeds"]
advantage = torch.tensor([1.0], device=prompt_embeds.device)
self.assertEqual(latents.shape, (1, 4, 64, 64))
self.assertEqual(next_latents.shape, (1, 4, 64, 64))
self.assertEqual(log_probs.shape, (1,))
self.assertEqual(timesteps.shape, (1,))
self.assertEqual(prompt_embeds.shape, (2, 77, 32))
loss, approx_kl, clipfrac = self.trainer.calculate_loss(
latents, timesteps, next_latents, log_probs, advantage, prompt_embeds
)
self.assertTrue(torch.isfinite(loss.cpu()))
@require_diffusers
class DDPOTrainerWithLoRATester(DDPOTrainerTester):
"""
Test the DDPOTrainer class.
"""
def setUp(self):
self.ddpo_config = DDPOConfig(
num_epochs=2,
train_gradient_accumulation_steps=1,
per_prompt_stat_tracking_buffer_size=32,
sample_num_batches_per_epoch=2,
sample_batch_size=2,
mixed_precision=None,
save_freq=1000000,
)
pretrained_model = "hf-internal-testing/tiny-stable-diffusion-torch"
pretrained_revision = "main"
pipeline = DefaultDDPOStableDiffusionPipeline(
pretrained_model, pretrained_model_revision=pretrained_revision, use_lora=True
)
self.trainer = DDPOTrainer(self.ddpo_config, scorer_function, prompt_function, pipeline)
return super().setUp()

438
tests/test_dpo_trainer.py Normal file
View File

@ -0,0 +1,438 @@
# Copyright 2023 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import tempfile
import unittest
import torch
from datasets import Dataset
from parameterized import parameterized
from pytest import mark
from transformers import AutoModelForCausalLM, AutoModelForSeq2SeqLM, AutoTokenizer, TrainingArguments
from trl import DPOTrainer
from .testing_utils import require_bitsandbytes, require_no_wandb, require_peft
class DPOTrainerTester(unittest.TestCase):
@classmethod
def setUpClass(cls):
cls.model_id = "trl-internal-testing/dummy-GPT2-correct-vocab"
cls.model = AutoModelForCausalLM.from_pretrained(cls.model_id)
cls.ref_model = AutoModelForCausalLM.from_pretrained(cls.model_id)
cls.tokenizer = AutoTokenizer.from_pretrained(cls.model_id)
cls.tokenizer.pad_token = cls.tokenizer.eos_token
# get t5 as seq2seq example:
model_id = "trl-internal-testing/tiny-T5ForConditionalGeneration-correct-vocab"
cls.t5_model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
cls.t5_ref_model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
cls.t5_tokenizer = AutoTokenizer.from_pretrained(model_id)
def _init_dummy_dataset(self):
# fmt: off
dummy_dataset_dict = {
"prompt": [
"hello",
"how are you",
"What is your name?",
"What is your name?",
"Which is the best programming language?",
"Which is the best programming language?",
"Which is the best programming language?",
"[INST] How is the stock price? [/INST]",
"[INST] How is the stock price? [/INST] ",
],
"chosen": [
"hi nice to meet you",
"I am fine",
"My name is Mary",
"My name is Mary",
"Python",
"Python",
"Python",
"$46 as of 10am EST",
"46 as of 10am EST",
],
"rejected": [
"leave me alone",
"I am not fine",
"Whats it to you?",
"I dont have a name",
"Javascript",
"C++",
"Java",
" $46 as of 10am EST",
" 46 as of 10am EST",
],
}
# fmt: on
return Dataset.from_dict(dummy_dataset_dict)
@parameterized.expand(
[
["gpt2", "sigmoid", True],
["t5", "hinge", False],
["gpt2", "ipo", False],
["t5", "ipo", True],
["gpt2", "kto_pair", True],
["t5", "kto_pair", False],
]
)
def test_dpo_trainer(self, name, loss_type, pre_compute):
with tempfile.TemporaryDirectory() as tmp_dir:
training_args = TrainingArguments(
output_dir=tmp_dir,
per_device_train_batch_size=2,
max_steps=3,
remove_unused_columns=False,
gradient_accumulation_steps=1,
learning_rate=9e-1,
evaluation_strategy="steps",
)
dummy_dataset = self._init_dummy_dataset()
if name == "gpt2":
model = self.model
ref_model = self.ref_model
tokenizer = self.tokenizer
elif name == "t5":
model = self.t5_model
ref_model = self.t5_ref_model
tokenizer = self.t5_tokenizer
trainer = DPOTrainer(
model=model,
ref_model=ref_model,
beta=0.1,
loss_type=loss_type,
args=training_args,
tokenizer=tokenizer,
train_dataset=dummy_dataset,
eval_dataset=dummy_dataset,
precompute_ref_log_probs=pre_compute,
)
previous_trainable_params = {n: param.clone() for n, param in trainer.model.named_parameters()}
trainer.train()
self.assertIsNotNone(trainer.state.log_history[-1]["train_loss"])
# check the params have changed
for n, param in previous_trainable_params.items():
new_param = trainer.model.get_parameter(n)
# check the params have changed - ignore 0 biases
if param.sum() != 0:
self.assertFalse(torch.equal(param, new_param))
def test_dpo_trainer_without_providing_ref_model(self):
with tempfile.TemporaryDirectory() as tmp_dir:
training_args = TrainingArguments(
output_dir=tmp_dir,
per_device_train_batch_size=2,
max_steps=3,
remove_unused_columns=False,
gradient_accumulation_steps=4,
learning_rate=9e-1,
evaluation_strategy="steps",
)
dummy_dataset = self._init_dummy_dataset()
trainer = DPOTrainer(
model=self.model,
ref_model=None,
beta=0.1,
args=training_args,
tokenizer=self.tokenizer,
train_dataset=dummy_dataset,
eval_dataset=dummy_dataset,
precompute_ref_log_probs=True,
)
previous_trainable_params = {n: param.clone() for n, param in trainer.model.named_parameters()}
trainer.train()
self.assertIsNotNone(trainer.state.log_history[-1]["train_loss"])
# check the params have changed
for n, param in previous_trainable_params.items():
new_param = trainer.model.get_parameter(n)
# check the params have changed - ignore 0 biases
if param.sum() != 0:
self.assertFalse(torch.equal(param, new_param))
@require_peft
@mark.peft_test
def test_dpo_trainer_without_providing_ref_model_with_lora(self):
from peft import LoraConfig
lora_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
with tempfile.TemporaryDirectory() as tmp_dir:
training_args = TrainingArguments(
output_dir=tmp_dir,
per_device_train_batch_size=2,
max_steps=3,
remove_unused_columns=False,
gradient_accumulation_steps=4,
learning_rate=9e-1,
evaluation_strategy="steps",
)
dummy_dataset = self._init_dummy_dataset()
trainer = DPOTrainer(
model=self.model,
ref_model=None,
beta=0.1,
args=training_args,
tokenizer=self.tokenizer,
train_dataset=dummy_dataset,
eval_dataset=dummy_dataset,
peft_config=lora_config,
precompute_ref_log_probs=True,
)
previous_trainable_params = {n: param.clone() for n, param in trainer.model.named_parameters()}
trainer.train()
self.assertIsNotNone(trainer.state.log_history[-1]["train_loss"])
# check the params have changed
for n, param in previous_trainable_params.items():
if "lora" in n:
new_param = trainer.model.get_parameter(n)
# check the params have changed - ignore 0 biases
if param.sum() != 0:
self.assertFalse(torch.equal(param, new_param))
@require_no_wandb
def test_dpo_trainer_generate_during_eval_no_wandb(self):
with tempfile.TemporaryDirectory() as tmp_dir:
training_args = TrainingArguments(
output_dir=tmp_dir,
per_device_train_batch_size=2,
max_steps=3,
remove_unused_columns=False,
gradient_accumulation_steps=1,
learning_rate=9e-1,
evaluation_strategy="steps",
)
dummy_dataset = self._init_dummy_dataset()
with self.assertRaisesRegex(
ValueError,
expected_regex="`generate_during_eval=True` requires Weights and Biases to be installed."
" Please install `wandb` to resolve.",
):
DPOTrainer(
model=self.model,
ref_model=None,
beta=0.1,
args=training_args,
tokenizer=self.tokenizer,
train_dataset=dummy_dataset,
eval_dataset=dummy_dataset,
generate_during_eval=True,
)
@require_peft
@mark.peft_test
def test_dpo_lora_save(self):
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
# lora model
model = AutoModelForCausalLM.from_pretrained(self.model_id)
model_peft = get_peft_model(model, lora_config)
with tempfile.TemporaryDirectory() as tmp_dir:
training_args = TrainingArguments(
output_dir=tmp_dir,
per_device_train_batch_size=2,
max_steps=3,
remove_unused_columns=False,
gradient_accumulation_steps=4,
learning_rate=9e-1,
evaluation_strategy="steps",
)
dummy_dataset = self._init_dummy_dataset()
# dpo train lora model with a lora config
trainer = DPOTrainer(
model=model_peft,
ref_model=None,
beta=0.1,
args=training_args,
tokenizer=self.tokenizer,
train_dataset=dummy_dataset,
eval_dataset=dummy_dataset,
peft_config=lora_config,
precompute_ref_log_probs=True,
)
# train the model
trainer.train()
# save peft adapter
trainer.save_model()
# assert that the model is loaded without giving OSError
try:
AutoModelForCausalLM.from_pretrained(tmp_dir)
except OSError:
self.fail("Loading the saved peft adapter failed")
@require_peft
@require_bitsandbytes
@mark.peft_test
def test_dpo_lora_bf16_autocast_llama(self):
# Note this test only works on compute capability > 7 GPU devices
from peft import LoraConfig
model_id = "HuggingFaceM4/tiny-random-LlamaForCausalLM"
tokenizer = AutoTokenizer.from_pretrained(model_id)
lora_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
# lora model
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True)
with tempfile.TemporaryDirectory() as tmp_dir:
training_args = TrainingArguments(
output_dir=tmp_dir,
per_device_train_batch_size=2,
max_steps=3,
remove_unused_columns=False,
gradient_accumulation_steps=4,
learning_rate=9e-1,
evaluation_strategy="steps",
bf16=True,
)
dummy_dataset = self._init_dummy_dataset()
# dpo train lora model with a lora config
trainer = DPOTrainer(
model=model,
ref_model=None,
beta=0.1,
args=training_args,
tokenizer=tokenizer,
train_dataset=dummy_dataset,
eval_dataset=dummy_dataset,
peft_config=lora_config,
generate_during_eval=True,
)
# train the model
trainer.train()
# save peft adapter
trainer.save_model()
@parameterized.expand(
[
["gpt2", "sigmoid", False, False],
["gpt2", "sigmoid", False, True],
["gpt2", "sigmoid", True, False],
["gpt2", "sigmoid", True, True],
["gpt2", "ipo", False, False],
["gpt2", "ipo", False, True],
["gpt2", "ipo", True, False],
["gpt2", "ipo", True, True],
["gpt2", "kto_pair", False, False],
["gpt2", "kto_pair", False, True],
["gpt2", "kto_pair", True, False],
["gpt2", "kto_pair", True, True],
]
)
@require_bitsandbytes
@require_peft
@mark.peft_test
def test_dpo_lora_bf16_autocast(self, name, loss_type, pre_compute, gen_during_eval):
# Note this test only works on compute capability > 7 GPU devices
from peft import LoraConfig
lora_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
# lora model
model = AutoModelForCausalLM.from_pretrained(self.model_id, load_in_4bit=True)
with tempfile.TemporaryDirectory() as tmp_dir:
training_args = TrainingArguments(
output_dir=tmp_dir,
per_device_train_batch_size=2,
max_steps=3,
remove_unused_columns=False,
gradient_accumulation_steps=4,
learning_rate=9e-1,
evaluation_strategy="steps",
bf16=True,
)
dummy_dataset = self._init_dummy_dataset()
# dpo train lora model with a lora config
trainer = DPOTrainer(
model=model,
ref_model=None,
beta=0.1,
args=training_args,
tokenizer=self.tokenizer,
train_dataset=dummy_dataset,
eval_dataset=dummy_dataset,
peft_config=lora_config,
generate_during_eval=gen_during_eval,
loss_type=loss_type,
precompute_ref_log_probs=pre_compute,
)
# train the model
trainer.train()
# save peft adapter
trainer.save_model()

9
tests/test_e2e.py Normal file
View File

@ -0,0 +1,9 @@
import subprocess
def test_hello_world():
subprocess.run(
"python examples/hello_world.py",
shell=True,
check=True,
)

273
tests/test_environments.py Normal file
View File

@ -0,0 +1,273 @@
# Copyright 2023 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import unittest
from unittest.mock import patch
import torch
from transformers import AutoTokenizer
from trl import AutoModelForCausalLMWithValueHead, TextEnvironment, TextHistory
class DummyTool:
def __call__(self, text):
return text
def dummy_generate(histories):
for i in range(len(histories)):
histories[i].append_segment("<request><DummyTool>test<call>", torch.tensor([1, 2, 3]), system=False)
return histories
class TextHistoryTest(unittest.TestCase):
def test_text_history_init(self):
text = "Hello there!"
tokens = torch.tensor([1, 2, 3])
history = TextHistory(text, tokens)
self.assertEqual(history.text, text)
self.assertTrue(torch.equal(history.tokens, tokens))
self.assertTrue(torch.equal(history.token_masks, torch.zeros_like(tokens)))
history = TextHistory(text, tokens, system=False)
self.assertTrue(torch.equal(history.token_masks, torch.ones_like(tokens)))
def test_text_history_append_segment(self):
text = "Hello there!"
tokens = torch.tensor([1, 2, 3])
history = TextHistory(text, tokens)
history.append_segment("General Kenobi!", torch.tensor([4, 5, 6]), system=False)
self.assertEqual(history.text, text + "General Kenobi!")
self.assertTrue(torch.equal(history.tokens, torch.tensor([1, 2, 3, 4, 5, 6])))
self.assertTrue(torch.equal(history.token_masks, torch.tensor([0, 0, 0, 1, 1, 1])))
history.append_segment("You are a bold one!", torch.tensor([7, 8, 9]))
self.assertEqual(history.text, text + "General Kenobi!" + "You are a bold one!")
self.assertTrue(torch.equal(history.tokens, torch.tensor([1, 2, 3, 4, 5, 6, 7, 8, 9])))
self.assertTrue(torch.equal(history.token_masks, torch.tensor([0, 0, 0, 1, 1, 1, 0, 0, 0])))
def test_text_history_complete(self):
text = "Hello there!"
tokens = torch.tensor([1, 2, 3])
history = TextHistory(text, tokens)
history.complete()
self.assertTrue(history.completed)
self.assertFalse(history.truncated)
history.complete(truncated=True)
self.assertTrue(history.completed)
self.assertTrue(history.truncated)
def test_text_history_last_segment(self):
text = "Hello there!"
tokens = torch.tensor([1, 2, 3])
history = TextHistory(text, tokens)
history.append_segment("General Kenobi!", torch.tensor([4, 5, 6]))
history.append_segment("You are a bold one!", torch.tensor([7, 8, 9]))
self.assertEqual(history.last_text_segment, "You are a bold one!")
def test_text_history_split_query_response(self):
text = "Hello there!"
tokens = torch.tensor([1, 2, 3])
history = TextHistory(text, tokens)
history.append_segment("General Kenobi!", torch.tensor([4, 5, 6]), system=False)
history.append_segment("You are a bold one!", torch.tensor([7, 8, 9]), system=True)
query, response, mask = history.split_query_response_tokens()
self.assertTrue(torch.equal(query, torch.tensor([1, 2, 3])))
self.assertTrue(torch.equal(response, torch.tensor([4, 5, 6, 7, 8, 9])))
self.assertTrue(torch.equal(mask, torch.tensor([1, 1, 1, 0, 0, 0])))
class TextEnvironmentTester(unittest.TestCase):
@classmethod
def setUpClass(cls):
# model_id
cls.model_id = "trl-internal-testing/dummy-GPT2-correct-vocab"
# get models and tokenizer
cls.gpt2_model = AutoModelForCausalLMWithValueHead.from_pretrained(cls.model_id)
cls.gpt2_tokenizer = AutoTokenizer.from_pretrained(cls.model_id)
cls.gpt2_tokenizer.pad_token = cls.gpt2_tokenizer.eos_token
def test_text_environment_setup(self):
env = TextEnvironment(
self.gpt2_model,
self.gpt2_tokenizer,
tools=[DummyTool()],
reward_fn=lambda x: torch.tensor(1),
prompt="I am a prompt!\n",
)
self.assertEqual(env.prompt, "I am a prompt!\n")
self.assertEqual(list(env.tools.keys()), ["DummyTool"])
self.assertTrue(isinstance(env.tools["DummyTool"], DummyTool))
self.assertEqual(env.reward_fn("Hello there!"), 1)
def test_text_environment_generate(self):
generation_kwargs = {"do_sample": False, "max_new_tokens": 4, "pad_token_id": self.gpt2_tokenizer.eos_token_id}
env = TextEnvironment(
self.gpt2_model,
self.gpt2_tokenizer,
tools=[DummyTool()],
reward_fn=lambda x: torch.tensor(1),
prompt="I am a prompt!\n",
generation_kwargs=generation_kwargs,
)
input_texts = ["this is a test", "this is another, longer test"]
model_inputs = [self.gpt2_tokenizer(txt, return_tensors="pt").input_ids.squeeze() for txt in input_texts]
generations_batched = env._generate_batched(model_inputs, batch_size=2)
generations_batched = self.gpt2_tokenizer.batch_decode(generations_batched)
generations_single = [env._generate_batched([inputs], batch_size=1)[0] for inputs in model_inputs]
generations_single = self.gpt2_tokenizer.batch_decode(generations_single)
self.assertEqual(generations_single, generations_batched)
def test_text_environment_tool_call_parsing(self):
string_valid = "Something something <request><Tool1>Hello there!<call>"
string_invalid_request = "Something something <Tool1>Hello there!<call>"
string_invalid_call = "Something something <request><Tool1>Hello there!"
string_invalid_tool = "Something something <request>|Tool2|Hello there!<call>"
string_invalid_random = "<>abcdefghijklm<>nopqrstuvwxyz<>"
env = TextEnvironment(
self.gpt2_model,
self.gpt2_tokenizer,
tools=[DummyTool()],
reward_fn=lambda x: torch.tensor(1),
prompt="I am a prompt!\n",
)
tool, response = env.parse_tool_call(string_valid)
self.assertEqual(tool, "Tool1")
self.assertEqual(response, "Hello there!")
tool, response = env.parse_tool_call(string_invalid_request)
self.assertEqual(tool, None)
self.assertEqual(response, None)
tool, response = env.parse_tool_call(string_invalid_call)
self.assertEqual(tool, None)
self.assertEqual(response, None)
tool, response = env.parse_tool_call(string_invalid_tool)
self.assertEqual(tool, None)
self.assertEqual(response, None)
tool, response = env.parse_tool_call(string_invalid_random)
self.assertEqual(tool, None)
self.assertEqual(response, None)
def test_text_environment_tool_truncation(self):
env = TextEnvironment(
self.gpt2_model,
self.gpt2_tokenizer,
tools={"dummy": lambda x: "a" * 1000},
reward_fn=lambda x: torch.tensor(1),
prompt="I am a prompt!\n",
)
env.max_tool_response = 100
history = env.step(TextHistory("<request><dummy>Hello there!<call>", torch.tensor([1, 2, 3])))
self.assertEqual(len(history.last_text_segment) - len(env.response_token), 100)
env.max_tool_response = 500
history = env.step(TextHistory("<request><dummy>Hello there!<call>", torch.tensor([1, 2, 3])))
self.assertEqual(len(history.last_text_segment) - len(env.response_token), 500)
env.max_tool_response = 1001
history = env.step(TextHistory("<request><dummy>Hello there!<call>", torch.tensor([1, 2, 3])))
self.assertEqual(len(history.last_text_segment) - len(env.response_token), 1000)
env.max_tool_response = 2000
history = env.step(TextHistory("<request><dummy>Hello there!<call>", torch.tensor([1, 2, 3])))
self.assertEqual(len(history.last_text_segment) - len(env.response_token), 1000)
@patch.object(TextEnvironment, "generate", side_effect=dummy_generate)
def test_text_environment_max_calls(self, mock_generate):
env = TextEnvironment(
self.gpt2_model,
self.gpt2_tokenizer,
tools={"DummyTool": DummyTool()},
reward_fn=lambda x: [torch.tensor(1) for _ in x],
prompt="I am a prompt!\n",
)
env.max_turns = 1
_, _, _, _, histories = env.run(["test"])
self.assertEqual(
histories[0].text, "I am a prompt!\n" + "test" + 1 * "<request><DummyTool>test<call>test<response>"
)
env.max_turns = 2
_, _, _, _, histories = env.run(["test"])
self.assertEqual(
histories[0].text, "I am a prompt!\n" + "test" + 2 * "<request><DummyTool>test<call>test<response>"
)
env.max_turns = 4
_, _, _, _, histories = env.run(["test"])
self.assertEqual(
histories[0].text, "I am a prompt!\n" + "test" + 4 * "<request><DummyTool>test<call>test<response>"
)
def test_text_environment_compute_rewards(self):
env = TextEnvironment(
self.gpt2_model,
self.gpt2_tokenizer,
tools={"DummyTool": DummyTool()},
reward_fn=lambda x: [torch.tensor(i) for i, _ in enumerate(x)],
prompt="I am a prompt!\n",
)
histories = [TextHistory("<request><DummyTool>test<call>", torch.tensor([1, 2, 3])) for _ in range(8)]
histories = env.compute_reward(histories)
for i in range(8):
self.assertEqual(histories[i].reward, i)
@patch.object(TextEnvironment, "generate", side_effect=dummy_generate)
def test_text_environment_run(self, mock_generate):
env = TextEnvironment(
self.gpt2_model,
self.gpt2_tokenizer,
tools={"DummyTool": DummyTool()},
reward_fn=lambda x: [torch.tensor(i) for i, _ in enumerate(x)],
prompt="I am a prompt!\n",
max_turns=2,
)
task_1 = "Hello there!"
task_2 = "Hello there! General Kenobi!"
query, response, response_mask, reward, histories = env.run([task_1, task_2])
self.assertEqual(len(query[0]), 9)
self.assertEqual(len(query[1]), 12)
self.assertEqual(len(response[0]), 14)
self.assertEqual(len(response[1]), 14)
self.assertEqual(response_mask[0].sum(), 2 * 3) # mocked generate always adds 3 toknes
self.assertEqual(response_mask[1].sum(), 2 * 3) # mocked generate always adds 3 toknes
self.assertEqual(reward[0], 0)
self.assertEqual(reward[1], 1)
self.assertEqual(
histories[0].text, "I am a prompt!\n" + "Hello there!" + 2 * "<request><DummyTool>test<call>test<response>"
)
self.assertEqual(
histories[1].text,
"I am a prompt!\n" + "Hello there! General Kenobi!" + 2 * "<request><DummyTool>test<call>test<response>",
)

View File

@ -0,0 +1,106 @@
# Copyright 2023 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import tempfile
import unittest
import torch
from datasets import Dataset
from parameterized import parameterized
from transformers import AutoModelForCausalLM, AutoModelForSeq2SeqLM, AutoTokenizer, TrainingArguments
from trl import IterativeSFTTrainer
class IterativeTrainerTester(unittest.TestCase):
@classmethod
def setUpClass(cls):
cls.model_id = "trl-internal-testing/dummy-GPT2-correct-vocab"
cls.model = AutoModelForCausalLM.from_pretrained(cls.model_id)
cls.tokenizer = AutoTokenizer.from_pretrained(cls.model_id)
cls.tokenizer.pad_token = cls.tokenizer.eos_token
# get t5 as seq2seq example:
model_id = "trl-internal-testing/tiny-T5ForConditionalGeneration-correct-vocab"
cls.t5_model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
cls.t5_tokenizer = AutoTokenizer.from_pretrained(model_id)
def _init_tensor_dummy_dataset(self):
dummy_dataset_dict = {
"input_ids": [torch.tensor([5303, 3621]), torch.tensor([3666, 1438, 318]), torch.tensor([5303, 3621])],
"attention_mask": [torch.tensor([1, 1]), torch.tensor([1, 1, 1]), torch.tensor([1, 1])],
"labels": [torch.tensor([5303, 3621]), torch.tensor([3666, 1438, 318]), torch.tensor([5303, 3621])],
}
dummy_dataset = Dataset.from_dict(dummy_dataset_dict)
dummy_dataset.set_format("torch")
return dummy_dataset
def _init_textual_dummy_dataset(self):
dummy_dataset_dict = {
"texts": ["Testing the IterativeSFTTrainer.", "This is a test of the IterativeSFTTrainer"],
"texts_labels": ["Testing the IterativeSFTTrainer.", "This is a test of the IterativeSFTTrainer"],
}
dummy_dataset = Dataset.from_dict(dummy_dataset_dict)
dummy_dataset.set_format("torch")
return dummy_dataset
def setUp(self):
# initialize trainer
self.model.train()
return super().setUp()
@parameterized.expand(
[
["gpt2", "tensor"],
["gpt2", "text"],
["t5", "tensor"],
["t5", "text"],
]
)
def test_iterative_step_from_tensor(self, model_name, input_name):
with tempfile.TemporaryDirectory() as tmp_dir:
# initialize dataset
if input_name == "tensor":
dummy_dataset = self._init_tensor_dummy_dataset()
inputs = {
"input_ids": dummy_dataset["input_ids"],
"attention_mask": dummy_dataset["attention_mask"],
"labels": dummy_dataset["labels"],
}
else:
dummy_dataset = self._init_textual_dummy_dataset()
inputs = {
"texts": dummy_dataset["texts"],
"texts_labels": dummy_dataset["texts_labels"],
}
if model_name == "gpt2":
model = self.model
tokenizer = self.tokenizer
else:
model = self.t5_model
tokenizer = self.t5_tokenizer
args = TrainingArguments(
output_dir=tmp_dir,
per_device_train_batch_size=2,
max_steps=2,
)
iterative_trainer = IterativeSFTTrainer(model=model, args=args, tokenizer=tokenizer)
iterative_trainer.step(**inputs)
for param in iterative_trainer.model.parameters():
assert param.grad is not None

View File

@ -30,6 +30,9 @@ ALL_CAUSAL_LM_MODELS = [
"trl-internal-testing/tiny-random-BloomForCausalLM",
"trl-internal-testing/tiny-random-GPT2LMHeadModel",
"trl-internal-testing/tiny-random-CodeGenForCausalLM-sharded",
"trl-internal-testing/tiny-random-GPTNeoXForCausalLM-safetensors-sharded",
"trl-internal-testing/tiny-random-GPTNeoXForCausalLM-safetensors"
# "trl-internal-testing/tiny-random-LlamaForCausalLM", uncomment on the next transformers release
]
ALL_SEQ2SEQ_MODELS = [

Some files were not shown because too many files have changed in this diff Show More