Compare commits

...

106 Commits

Author SHA1 Message Date
4c71daf461 Pin transformers version <4.47 (#2447)
* pin transformers to <4.47

* update version
2024-12-06 13:51:05 +01:00
c1e9ea8ecf Version 0.12.0 -> 0.12.1 2024-11-14 01:31:36 +00:00
f66282424f 👈 Add tokenizer arg back and add deprecation guidelines (#2348)
* Add deprecation and backward compatibility guidelines

* Update tokenizer argument in trainer classes

* Add warning message for TRL Judges API
2024-11-11 23:09:17 +00:00
14ef1aba15 Release v0.12.0 2024-11-01 10:24:21 +00:00
6138439df4 🧓 Specify and test min versions (#2303)
* Add conditional check for LLMBlender availability in test_judges.py

* Fix import issues and update test requirements

* Remove unused imports

* Add require_peft decorator to test cases

* Fix import_utils module to use correct package name for llm_blender

* Found min version and test

* Update Slack notification titles

* Update dependencies versions

* Update GitHub Actions workflow to include setup.py and reorder file paths

* Revert "Update Slack notification titles"

This reverts commit be02a7f2de87905e86a847540770968d0416934a.

* Update Slack notification titles

* Remove pull_request branch restriction in tests.yml

* add check code quality back

* Fix PairRMJudge model loading issue
2024-11-01 00:26:53 +01:00
d57a181163 🧩 Add optimizer_cls_and_kwargs attribute to PPOTrainer and RLOOTrainer (#2302) 2024-10-31 23:10:11 +01:00
73c3970c1f 🙅 Ensure dependency optionality (#2301)
* Add conditional check for LLMBlender availability in test_judges.py

* Fix import issues and update test requirements

* Remove unused imports

* Add require_peft decorator to test cases

* Fix import_utils module to use correct package name for llm_blender
2024-10-31 22:37:49 +01:00
013a32b396 Remove stale bot (#2300) 2024-10-31 21:16:30 +01:00
24fb32733f 🔧 Use standard unittest assertion methods (#2283)
* WIP: Partial unit test update

* Update unittest format

* Update tests/slow/test_sft_slow.py comment

* Refactor unit tests: replace pytest.raises with self.assertRaises

* Fix: Restore accidentally deleted 'ref_model' parameter in DPOTrainer

* Re-run pre-commit

* fix: Incorrectly replacing non-TestCase assert

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2024-10-31 15:10:43 +01:00
bb56c6e6af 💾 Fix _save_checkpoint for online methods (#2288)
* Update trainer_utils import and save strategy in online_dpo_trainer.py

* fix back-compat for online-dpo

* better comment

* Update transformers dependency to commit f33904
2024-10-31 12:35:25 +01:00
06be6f409a 🖇️ Better dependency and partitioning of CI tests (#2298)
* clean deps

* new tests

* tests

* Add tests without optional dependencies workflow

* Update dependencies in tests.yml

* cpu version of torch

* Update dependencies and installation commands

* Disable fail-fast in test workflow

* Update test matrix in workflows file

* try fix windows

* Remove "rich" from required packages in setup.py

* Update dependency installation in tests.yml

* Add torch and deepspeed installation for windows-latest

* Fix conditional statement in workflow file

* Add torch and deepspeed installation for Windows

* Fix if statement

* Update torch and deepspeed dependencies

* Update liger package requirement for non-Windows platforms

* remove scipy dep

* Add torch GPU requirement for testing_utils

* Update trl/trainer/judges.py
2024-10-31 11:08:51 +01:00
b2696578ce 🍬 Use any reward model for online methods (#2276)
* Refactor reward processing in OnlineDPOTrainer

* Refactor completion decoding and reward processing

* remove strip

* remove warning

* Add reward_tokenizer to training script

* Add reward_tokenizer and reward_processing_class to OnlineDPOTrainer test

* propagate to xpo and nash

* style

* reduce memory requirement with inference_mode

* fix tests

* pairrm judge llmblender

* setUpClass(cls)

* Add setUpClass method to TestJudges class

* truncation left for reward tokenizer

* don't logcompletion without eval dataset

* only eval when possible
2024-10-28 16:21:40 +01:00
0ce3b65928 🔌 Fix type hint in LogCompletionsCallback (#2285)
* Update callbacks.py for fix small python type error

* Update callbacks.py

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2024-10-28 11:49:35 +01:00
e155cb8a66 ⛓️💥 Don't use eval_dataset in scripts when no eval strategy (#2270) 2024-10-28 11:40:51 +01:00
ea7a1be92c 🧮 Fix the computation of KL divergence loss (#2277) 2024-10-25 18:16:02 +02:00
110d0884c7 🏁 Add bos_token_id only if it exists (#2279)
Co-authored-by: sean.jung <sean.jung@sean-ai.local>
2024-10-25 18:15:08 +02:00
57ba9b93aa 🧘 Replace F.log(F.sigmoid(log_odds) with F.logsigmoid(log_odds) (#2274)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2024-10-24 20:51:55 +02:00
0de75b26f2 🧼 Refactor log_reports.py for Improved Logging, File Processing, and Slack Payload Handling (#2249)
* Update log_reports.py

* comments text update

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* emoji added

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update scripts/log_reports.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update scripts/log_reports.py

* style

* Update scripts/log_reports.py

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-10-24 20:48:12 +02:00
e615974a03 ♾️ Fix test generation max_new_tokens (#2272)
* `eval_strategy="steps" if eval_dataset else "no"`

* tmp skip test

* drop `eval_strategy` in `test_sft_trainer_uncorrect_data`

* remove eval strategy

* Add parameterized test for generate method

* Revert "`eval_strategy="steps" if eval_dataset else "no"`"

This reverts commit 1e8b331fa2c222a699cb3563f44f5702a7d6f50b.

* Revert "tmp skip test"

This reverts commit 44558f84cc43e20254b567d608b44d059a14913b.

* Revert "drop `eval_strategy` in `test_sft_trainer_uncorrect_data`"

This reverts commit a1ef7016286649fce10b3665159abcbfac2219e3.

* Revert "remove eval strategy"

This reverts commit cb7fafa874b108ba91b29f15944b7c4a41705d6d.

* style

* Refactor test_generate method in test_modeling_value_head.py

* `max_new_tokens=9`
2024-10-24 20:20:01 +02:00
c2bb1eed14 Add torch_dtype to model kwargs in reward modeling example (#2266)
Update model_kwargs to include torch_dtype.

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2024-10-24 20:12:26 +02:00
9c376c571f [Judges] use the pair-judges in online-preference trainers (#2243)
* use the pair-judges

* add test

* Update trl/trainer/online_dpo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update trl/trainer/online_dpo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* decode and skip special characters

* initial nash

* return tensors

* Update trl/trainer/online_dpo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update trl/trainer/online_dpo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update trl/trainer/online_dpo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* add back the logging

* use batch_decode

* add judges api to XPO trainer

* Update tests/test_online_dpo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* judge in examples

* judge in config

* add back logs when using reward model

* typo

* add back model_scores logging when using reward model

* log scores for reward model only

* better cond on what to log

* same for rlhf reward

* Update trl/trainer/online_dpo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* use decode_and_strip_padding

* error if both reward and judge or none are set

* remove unused check

* Uniform way to pass conversation into judge

* heading -> leading

* LogCompletionsCallback compat with online method

* Update Online DPO doc

* check if data is conversational for judges

* update example

* remove comment

* use zip

* fix stats xpo

* Replace judge with PairRMJudge and import AutoModelForSequenceClassification

* update xpo documentation

* Remove doc duplication

* update nash doc

* XPO trl chat

* nash md doc

* HfPairwiseJudge

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-10-24 16:47:10 +02:00
16994738d0 Conversational dataset support for KTOTrainer (#2248)
* `get_batch_sample` -> `generate_from_model[_and_ref]`

* add `num_items_in_batch=None`

* `num_items_in_batch` in `training_step`

* Fix return type hint

* desc for unpair dataset util

* update example

* process in KTO

* Update doc

* KTO  doc rewrite

* fix orpo doc

* add other dataset config names in test

* update doc image

* fix links in doc

* Update reward and log probability metrics in KTOTrainer doc

* skip enc-dec test

* Update docs/source/kto_trainer.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-10-24 14:01:41 +02:00
99225bb6d6 Bump the minimum transformers version to v4.46 (#2245)
* Bump the minimum transformers version

* Bump version in `requirements.txt`

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-10-24 10:42:30 +02:00
88be2c07e5 🚩 setup_chat_format: throw error if there is already a template in base model (#2252)
* setup_chat_format: throw error if there was already a template

* fix lint

* clarify in docs

* fix test?

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-10-22 13:29:32 +02:00
f2349d2af0 Adjust padding in batch generation (#2251)
* pad batch generation

* Use pad utility

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/utils.py

* reshaping

* fix test_utils.py

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-10-22 09:36:43 +02:00
d843b3dadd Use processing_class instead of tokenizer in LogCompletionsCallback (#2261) 2024-10-22 09:35:04 +02:00
84dab850f6 🧽 Fix typo in dataset format doc (#2259)
doc update
2024-10-21 17:06:19 +02:00
92f6d246d3 🏗️ Refactor DPO data processing (#2209)
* in progress

* refactor concatenated_inputs and concatenated_forward

* progress

* further modif

* padding side

* eos prompt enc dec

* prompt_padding_side

* drop prompt apdding side collator

* working on decoder only

* dpo trainer

* Fix loss_mask type conversion bug

* bad attention mask

* try to get the same tokens as main

* fix loss mask

* fix unused col

* added comment

* raise error when paddind token not set

* remove private method tests

* initial vlm support

* make it work for paligemma

* minor test updates

* style

* improve readibility

* improve doc

* style

* flush left and truncate

* flush left in the code

* fix empty_cols and make max_length optional

* always add eos token

* minor changes and doc

* style

* fix docstring

* preference collator in doc

* fix doc

* optional max_completion_length

* Investigating CI failing

* style

* just dpo trainer test

* just idefics

* paligemma

* llava

* test cli

* dataset in test

* all tests

* Update trl/trainer/dpo_trainer.py

* Update trl/trainer/dpo_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/trainer/dpo_trainer.py

* Update trl/trainer/dpo_trainer.py

* reference to ref

* rich descriptions

* fix logits reporting

* fix truncation

* remove chat template from dpo_vlm

* `get_batch_sample` -> `generate_from_model[_and_ref]`

* add `num_items_in_batch=None`

* `num_items_in_batch` in `training_step`

* Fix return type hint

* test tokenize row

* fix test

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-10-21 12:47:33 +02:00
31b7820aad 🔀 Rename get_batch_sample and add num_items_in_batch to compute_loss (#2246) 2024-10-18 21:02:24 +02:00
b9aa965cce Enhance log report script: add error handling and logging (#2232)
* Update log_example_reports.py

1. Added logging: Imported the logging module and set up a logger in the main function. This allows for better error tracking and debugging.

2. Improved file reading: Used a with statement to ensure the file is properly closed after reading. Also added error handling to catch and log any issues when reading the file.

3. Error handling for Slack SDK import: Added a try-except block to handle cases where the slack_sdk might not be installed.

4. Enhanced Slack message sending: Added error handling and logging for the Slack message sending process. This will help identify any issues with the Slack integration.

* style

* Update log_reports.py

1. Logging: Added logging to track errors and important events.

2. Error Handling: Wrapped the log file processing in a try-except block to handle potential errors gracefully.

3. Logging Total Failed Tests: Added a log statement to report the total number of failed tests

* style

* further improve

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-10-18 19:40:30 +02:00
a67f2143c3 Update SFT examples (#2244) 2024-10-17 14:11:46 +02:00
494b4afa10 [CLI] Setting capture output to False (#2239)
* setting capture output to False

* Update trl/commands/cli.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2024-10-17 11:04:23 +02:00
02f4e750c0 DPO support remove_unused_columns (#2233) 2024-10-16 10:00:27 +02:00
2ba3005d1c Updated ScriptArguments warning messages (#2230) 2024-10-15 07:46:58 +02:00
7e394b03e8 🎭 Deprecate [SFT/DPO/Reward]ScriptArguments in favour of ScriptArguments (#2145)
* `DPOScriptArguments` to `ScriptArguments`

* use dataset_train_split

* Use scriptarguments

* dataset names in command lines

* use `ScriptArguments` everywhere

* ignore biais buffer to end

* remove in v0.13

* rm comment

* update test commands

* Update docs/source/rloo_trainer.md

* Update tests/test_rloo_trainer.py

* Added dataset_train_split argument to ppo.py and rloo.py

* update scripts with dataset_train_split
2024-10-14 11:14:58 +02:00
14f3613dac Update commands for code linting in contributing guidelines (#2225)
* update commands for code liniting in contributing guidelines

* update docs on code formatting in contributing guidelines

* fix markdown rendering error

* Update CONTRIBUTING.md

* Update CONTRIBUTING.md

* Update CONTRIBUTING.md

* Update CONTRIBUTING.md

* Update CONTRIBUTING.md

* "sans" -> "without"

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-10-13 09:22:24 +02:00
5e24101b36 📒 Fix type/format confusions (#2223) 2024-10-11 23:39:19 +02:00
b81a6121c3 Add GKD to dataset_formats.mdx (#2222)
* Update dataset_formats.mdx

* Update dataset_formats.mdx

* Update docs/source/dataset_formats.mdx

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Modified to Prompt-completion

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2024-10-11 21:52:20 +02:00
7f0d246235 Add Sequence-Level KD (#2220)
* Fix templates for dpo, etc.

* Update dpo.py

Add the third issue fixs

* make this a utility.

* Add Sequence-Level KD

* add to the docs-strings and the documentation

* reviewed

* Update docs/source/gkd_trainer.md

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2024-10-11 20:14:09 +02:00
70036bf87f 🕊️ Migration PPOv2 -> PPO (#2174)
* delete old ppo

* rename ppov2 files

* PPOv2 -> PPO

* rm old doc

* rename ppo doc file

* rm old test

* rename test

* re-add v2 with deprecation

* style

* start update customization

* Lion

* Finish update customization

* remove ppo_multi_adaptater

* remove ppo example

* update some doc

* rm test no peft

* rm hello world

* processing class

* Update docs/source/detoxifying_a_lm.mdx

Co-authored-by: Edward Beeching <edbeeching@users.noreply.github.com>

* Update trl/trainer/ppov2_config.py

Co-authored-by: Edward Beeching <edbeeching@users.noreply.github.com>

* Update docs/source/customization.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/detoxifying_a_lm.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* po to example overview

* drop lion

* remove "Use 8-bit optimizer"

* Update docs/source/customization.mdx

* Update docs/source/customization.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* it applies to all trainers

---------

Co-authored-by: Edward Beeching <edbeeching@users.noreply.github.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-10-11 17:28:39 +02:00
d0aa421e5e Conversational dataset support for ORPOTrainer (#2184)
* default learning rate

* update trainer

* update test

* update script

* update dataset format

* add line in dpo doc

* update orpo doc

* refine implicit/explicit

* update demo chat
2024-10-11 17:08:28 +02:00
5375d71bbd trl env report all cuda devices (#2216) 2024-10-11 16:32:34 +02:00
6004e033a4 Updated README.md with CLI examples and additional usage instructions (#2199)
* Updated README.md with CLI examples and additional usage instructions

Added Command Line Interface (CLI) examples for SFT, DPO, and Chat features.
Improved the "How to Use" section by providing code examples for SFTTrainer and RewardTrainer.
Included installation instructions for both Python Package and source-based installation.
Refined highlights to better showcase efficiency and scalability features.
Updated the repository clone instructions for working with examples.
Added new links to CLI documentation and contribution guide for better navigation.

* Update README.md

* Update README.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update README.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update README.md

* update badges

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-10-11 16:31:38 +02:00
f436c3e1c9 Update README.md (#2180)
* Update README.md

* Update README.md

* Update README.md

Co-authored-by: Edward Beeching <edbeeching@users.noreply.github.com>

* Update README.md

* Update README.md

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Edward Beeching <edbeeching@users.noreply.github.com>
2024-10-11 16:14:46 +02:00
cd1aa6bdcc [Judges] Soft judges for PairRM (#2221)
* initial soft judges

* add soft-judge to PairRM

* remove comments

* fix from review
2024-10-11 15:53:42 +02:00
b3f93f0bad Report to "none" in GKD test (#2214) 2024-10-10 19:05:55 +02:00
6c32c8bfcd Improve slack reporting (#2182)
* Update log_example_reports.py

1. Added logging: Imported the logging module and set up a logger in the main function. This allows for better error tracking and debugging.

2. Improved file reading: Used a with statement to ensure the file is properly closed after reading. Also added error handling to catch and log any issues when reading the file.

3. Error handling for Slack SDK import: Added a try-except block to handle cases where the slack_sdk might not be installed.

4. Enhanced Slack message sending: Added error handling and logging for the Slack message sending process. This will help identify any issues with the Slack integration.

* style

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-10-10 17:42:06 +02:00
3107a40f16 Update incorrect data processing in DataCollatorForChatML (#2172)
* Update incorrect data processing in DataCollatorForChatML

Fix the extra BOS token and the absence of an EOS token in the returned input_ids, and potentially the absence of a target string in the returned labels.

* Update trl/trainer/utils.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* style

* move comment

* add test for DataCollatorForChatML

* update comment with more details

* update assert reports and comments, and adds verification that the last token of input_ids should be EOS token

* new line at the end of file for code quality

* Update tests/test_utils.py

* Update tests/test_utils.py

* Update tests/test_utils.py

* update tests

* fix test

* Update tests/test_utils.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update tests/test_utils.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* formatting

* fix typo

* simplify

* Revert "simplify"

This reverts commit 7e4006c87265665183032932ca05dffef567e38b.

* tokenize full messages

* dont add eos

* eos is in the last token

* simplify DataCollatorForChatML

* Update tests/test_utils.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-10-10 12:49:10 +02:00
419791695c Drop decoder_input_ids in DPOTrainer (#2208) 2024-10-10 10:20:40 +02:00
7e5924d17e [GKD] interpolate in prob. space (#2204)
* interpolate in prob. space

* better var names

* use logsumexp

* set beta dtype

* beta tensor
2024-10-09 12:13:18 +02:00
ed9ea74b62 [DPO] Adding weighted preference optimization (WPO) (#2141)
* skeleton

* add weighting arg in config

* formatting

* fix doc

* do not compute gradients in weighting term

* fixed detach

* add WPO doc

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-10-08 19:52:54 +02:00
511c92c91c Get the aux_loss_coef at BCOTrainer, CPOTrainer, KTOTrainer, and ORPOTrainer initialization (#2201)
* Fix aux_loss coefficient bug of BCOTrainer

* Fix aux_loss coefficient bug of CPOTrainer

* Fix aux_loss coefficient bug of KTOTrainer

* Fix aux_loss coefficient bug of ORPOTrainer
2024-10-08 16:17:09 +02:00
c6cb6353a5 Get the aux_loss_coef at DPOTrainer initialization (#2200) 2024-10-08 16:06:48 +02:00
adb3e0560b ♾️ [CI] Use transformers from source in "tests_no_optional_dep" (#2198) 2024-10-08 12:19:04 +02:00
adf58d80d0 skip_prompt=True in TextIteratorStreamer (#2193)
* skip_prompt in `TextIteratorStreamer`

* Update trl/commands/cli.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update generation streamer in chat.py

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-10-07 17:38:40 +02:00
9aa022503c Update README.md (#2186)
* Update README.md

Fix grammatical errors in README.md
fixes issue #2185

Description:

I found a grammatical error in the README.md of the project. This PR fixes the error to improve the overall readability and clarity of the documentation.

Changes:
Corrected grammatical errors
Updated lines to reflect the correct grammar
Reasoning: The original text contained a grammatical error that could confuse readers. This fix ensures that the documentation is accurate and easy to understand.

Closes #2185

* Update README.md

Co-authored-by: Edward Beeching <edbeeching@users.noreply.github.com>

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Edward Beeching <edbeeching@users.noreply.github.com>
2024-10-07 14:30:00 +02:00
82ad390caf Fix RLOO checkpointing (#2114)
* Fix RLOO checkpointing for transformers>=4.45.0

* Add missing import

* Fix pre-commit issues

* Added test for RLOO checkpointing

* Ensure that tokenizer matches SFT and Reward model

* Pre-commit formatting

* processing class

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-10-07 13:11:17 +02:00
ac038ef03a Update CONTRIBUTING.md (#2181)
* Update CONTRIBUTING.md

* Update CONTRIBUTING.md

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-10-07 06:56:19 -04:00
51ca76b749 [CI] fix dpo gpu ci tests (#2189)
* fix dpo ci test

* color-blind
2024-10-07 10:59:43 +02:00
7005ab4d11 🃏 Model card: "unsloth" tag (#2173) 2024-10-07 10:57:05 +02:00
ffb1ab74ba Update documentation CLI Chat (#2191) 2024-10-07 10:33:51 +02:00
47d08a9626 Rename trainer arg tokenizer to processing_class (#2162) 2024-10-07 09:39:32 +02:00
70327c18e6 add trl to tag for models (#2178) 2024-10-07 08:12:44 +02:00
f05c3fa8fc minor KTO setting changes + KL batch size (#2153)
* add argument for dropout

* increase default lr

* change default lr in examples

* fix bug in calculation of KL batch size

* KL batch size should be args.per_device_train_batch_size

* Update kto_trainer.mdx with hparam recs

* typo

* allow dropout to be disabled

* update lr in sample scrippt

* Update kto_config.py

* Update trl/trainer/kto_trainer.py

* Update docs/source/kto_trainer.mdx

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-10-06 13:13:11 +02:00
4799ba4842 Capybara replaced with ultrafeedback_binarized (#2183) 2024-10-05 18:49:48 +02:00
d45c86e2a7 Conversational dataset support for CPOTrainer (#2144)
* extract prompt and apply chat template in cpo trainer

* default leanring rate

* simplify example

* update doc

* test all formats

* extend exptract prompt

* improve doc format

* link in dataset formats

* Update docs/source/cpo_trainer.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/cpo_trainer.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-10-04 18:01:02 +02:00
c6b0d1358b 🗑️ Set deprecation version for DPO and SFT arguments to version 0.13 (#2170) 2024-10-04 17:46:55 +02:00
3321084e30 Update trl version in CITATION.cff (#2171) 2024-10-04 12:24:09 +02:00
a9cffc7caf Default dataset_text_field to "text" (#2078)
* clarify ConstantLengthDataset usage

* dont provide dataset text field when formatting func is provided

* kto maybe_apply_chat_template

* default text field

* doc

* remove maybe_apply_chat_template from kto example

* dataset text field always a str

* remove `dataset_text_field="text"`

* update doc
2024-10-04 10:55:47 +02:00
32a928cfc2 🏷️ Model badges in trainer documentation (#2160) 2024-10-04 10:55:06 +02:00
1a3bb372ac Fix typo in error message (#2168)
occured -> occurred
2024-10-04 09:36:52 +02:00
d4564b7c64 ↩️ Revert tokenizer hotfix #2163 2024-10-04 00:14:12 +02:00
1be4d86ccc 🩹 [Hotfix] Add setter for tokenizer (#2163) 2024-10-03 16:13:50 +02:00
78249d9de4 Conversational dataset support for DPOTrainer (#2131)
* conversational dataset support for dpo

* support standard dataset for extract prompt

* test standard dataset for extract prompt

* fix maybe

* fix maybe apply prompt

* style

* overwrite default learning rate of DPO

* style

* rlaif script

* `writer_batch_size` in `train_test_split`

* initial dpo doc refactoring

* vision data section in doc

* lil format modif

* refine Vision datasets

* refine doc

* test new loss type format

* restrcture loss function

* table loss type

* simplify `unsloth`

* improve doc

* looged metrics up

* refine loss section

* Fix label_smoothing parameter in DPOConfig

* dataset for test

* update readme

* Update docs/source/dpo_trainer.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* try colorized code block

* refine doc style

* further refine doc

* Update docs/source/dpo_trainer.mdx

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* re add pali gemma test

* Add missing period

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-10-02 10:04:03 +02:00
5c21de30ae [CI] Don't use eval_strategy="steps" when no eval dataset (#2152)
* `eval_strategy="steps" if eval_dataset else "no"`

* tmp skip test

* drop `eval_strategy` in `test_sft_trainer_uncorrect_data`

* remove eval strategy
2024-10-01 21:46:41 +02:00
0a566f0c58 🩹 Fix attention mask warning in chat CLI (#2147)
* explicit attention mask

* fix chat command
2024-10-01 10:53:18 +02:00
de3876577c [GKD] Set custom EOS tokens in generation config (#2142)
* Expose EOS token IDs in GKD generation

* Apply suggestions from code review

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Revert

* Refactor EOS token setting

* Remove EOS from config

* Refactor

* Add unit test

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2024-09-30 13:53:16 +02:00
1201aa61b4 rename example (#2139) 2024-09-27 21:45:21 +02:00
c00722ce0a 🃏 Model card for TRL (#2123)
* template and util

* test for online dpo

* template in package_data

* template in manifest

* standardize push_to_hub

* wandb badge and quick start

* bco

* xpo

* simplify `create_model_card`

* cpo

* kto

* dpo

* gkd

* orpo

* style

* nash-md

* alignprop

* bco citation

* citation template

* cpo citation

* ddpo

* fix alignprop

* dpo

* gkd citation

* kto

* online dpo citation

* orpo citation

* citation in utils

* optional citation

* reward

* optional trainer citation

* sft

* remove add_model_tags bco

* Remove unnecessary code for adding model tags

* Fix model tag issue and update URL format

* Remove unused code for adding model tags

* Add citation for XPOTrainer

* Remove unused code in SFTTrainer

* Add model card generation in RLOOTrainer

* Remove unused import and method call in reward_trainer.py

* Add model card generation

* Remove unused code and update error message in ORPOTrainer class

* Add import statements and create model card in IterativeSFTTrainer

* Add dataset name to push_to_hub() call

* Update trainer.push_to_hub() dataset names

* script args

* test

* better doc

* fix tag test

* fix test tag

* Add tags parameter to create_model_card method

* doc

* script args

* Update trl/templates/model_card.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* unittest's `assertIn` instead of `assert`

* Update trl/templates/model_card.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-09-27 15:23:05 +02:00
124189c86a Add correct label for WinRateCallback table (#2134)
Small fix to make it clear in WandB which table it which
2024-09-27 10:33:41 +02:00
d5eeaab462 arXiv to HF papers (#2133) 2024-09-27 09:00:49 +02:00
5368be1e1e 🧹 Style (#2132)
* drop `# flake8: noqa` in examples

* `__init__.py`

* fix init

* unwrap_model_for_generation

* ignore import violation in init
2024-09-26 21:02:48 +02:00
b169e1030d Add table for WinRateCallback (#2116)
* Add table for WinRateCallback

* Fix tests

* Apply suggestions from code review

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Refactor

* Remove super

* Clean

* Clean

* Apply suggestions from code review

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2024-09-26 19:28:44 +02:00
9af4734178 ♻️ Standardize script_args (#2130) 2024-09-26 15:23:42 +02:00
a0d714949f Tokenize row during in training_step in OnlineDPOTrainer (#2117)
* tokenize while training

* same for nashmd and xpo

* Update trl/trainer/online_dpo_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-09-26 11:58:14 +02:00
a0e28143ec Eos token encouragement Clarification (#2128)
* Update nash_md_trainer.md

* Update online_dpo_trainer.md

* Update xpo_trainer.mdx

* Fixing XPO Script Location
2024-09-26 11:47:48 +02:00
32d9d34eb1 Standardize pushing to Hub in examples (#2126) 2024-09-26 10:00:51 +02:00
fb1b48fdbe Remove max_length from RewardDataCollatorWithPadding (#2119) 2024-09-26 09:59:12 +02:00
b5e4bc5984 Update example_overview.md (#2125) 2024-09-25 20:45:57 +02:00
7a24565d9d Generalizes VSFT script to support REDACTED (#2120)
* generalizes vst script

* precommit

* change launch command to use accelerate

* updates docs

* rename to sft_vlm

* fix script location

* fix formatting

* comma

* add model link

* fix name

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-09-25 19:54:44 +02:00
44a06fc487 BCOTrainer conversational dataset support (#2107)
* update test

* maybe_apply_chat_template

* simplify bco example

* Update documentation

* Update examples/scripts/bco.py

* Update docs/source/bco_trainer.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-09-24 18:15:57 +02:00
a84fc5d815 Fix packing test (#2111)
* Fix pack test

* same for eval
2024-09-24 17:12:54 +02:00
80038a5a92 [online-dpo] allow parse-args as list of floats (#2108)
* use a seperate argument for list of floats

* do super first

* fix docstrings

* typos

* use list of floats only

* check if it has len

* fix docstring

* fix suggestion

* fix default

* Update trl/trainer/online_dpo_config.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update trl/trainer/xpo_config.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update trl/trainer/nash_md_config.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update trl/trainer/nash_md_config.py

* additional tests

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-09-24 16:56:27 +02:00
cece86b182 fix formatting (#2109)
* fix formatting

* formatting
2024-09-24 16:05:55 +02:00
d005980d8b Fix documentation links (#2105) 2024-09-24 15:35:29 +02:00
cc23b511e4 [RewardTrainer] Tokenize inputs within trainer (#2102)
* Pretokenize in reward modelling

* Fix README example

* Apply suggestions from code review

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Move chat template formatting inside trainer

* Refactor tests

* Fix README

* Disable wandb

* Update readme

* add comment `remove_unused_columns`

* Update trl/trainer/reward_config.py

* doc

* implicit*

* explicit

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-09-24 13:03:32 +02:00
2cad48d511 [CLI] trl env for printing system info (#2104) 2024-09-24 09:57:24 +02:00
6859e048da Fix PPO/RLOO examples (#2100) 2024-09-23 11:49:36 +02:00
92eea1f239 Clean up README and remove openrlbenchmark dependency (#2085)
* Clean up README

* Add Kashif and Quentin

* Refactor

* Apply suggestions from code review

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Add citation

* Omit benchmarks from dev install

* Remove openrlbenchmark

* Apply suggestions from code review

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2024-09-23 09:21:41 +02:00
663002f609 KTO: fix logits metric, add logits metric to BCOTrainer too (#2094)
Co-authored-by: Clara Luise Pohland <clara-luise.pohland@telekom.de>
2024-09-21 19:08:10 +02:00
44d998b2af Fix _process_tokens for empty prompts in KTOTrainer (#2093)
The function _process_tokens in trl/trainers/kto_trainer.py crashes if the prompt_input_ids are an empty list.
- added a check for nonzero length
- added a check for nonzero length of answer_input_ids for consistency

The checks happen when determining when subtracting 1 from max_length (happens when BOS or EOS is already present).
2024-09-21 12:49:54 +02:00
9b80f3d50c fix: device could be in meta, transformers#33154 (#2089)
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
2024-09-21 09:11:34 +02:00
2038e52c30 Fix typo in orpo example. (#2092) 2024-09-21 09:11:01 +02:00
10c2f63b2a training_args for all TrainingArguments (#2082) 2024-09-19 15:03:47 +02:00
9fb871f62f [SFT] fix neftune_noise_alpha in SFTTrainer (#1841)
* fix neftune_noise_alpha

* del neftune_noise_alpha first

* check len after removing handle

* make sure we do not load twice

* Update trl/trainer/sft_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* remove neftune from SFTTrainer as the superclass has it now

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-09-19 11:57:36 +02:00
3cec013a20 Bump dev version 2024-09-19 10:47:21 +02:00
183 changed files with 8144 additions and 10265 deletions

View File

@ -15,7 +15,7 @@ body:
id: system-info
attributes:
label: System Info
description: Please share your system info with us. You can run the command `transformers-cli env` and copy-paste its output below.
description: Please share your system info with us. You can run the command `trl env` and copy-paste its output below.
placeholder: trl version, transformers version, platform, python version, ...
validations:
required: true

View File

@ -1,27 +0,0 @@
name: Stale Bot
on:
schedule:
- cron: "0 15 * * *"
jobs:
close_stale_issues:
name: Close Stale Issues
if: github.repository == 'huggingface/trl'
runs-on: ubuntu-latest
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
steps:
- uses: actions/checkout@v4
- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: 3.8
- name: Install requirements
run: |
pip install PyGithub
- name: Close stale issues
run: |
python scripts/stale.py

View File

@ -1,46 +0,0 @@
name: tests on transformers PEFT main
on:
push:
branches: [ main ]
env:
CI_SLACK_CHANNEL: ${{ secrets.CI_PUSH_MAIN_CHANNEL }}
jobs:
tests:
strategy:
matrix:
python-version: ['3.9', '3.10', '3.11']
os: ['ubuntu-latest', 'windows-latest']
fail-fast: false
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
cache: "pip"
cache-dependency-path: |
setup.py
requirements.txt
- name: Install dependencies
run: |
python -m pip install --upgrade pip
# install PEFT & transformers from source
pip install -U git+https://github.com/huggingface/peft.git
pip install -U git+https://github.com/huggingface/transformers.git
# cpu version of pytorch
pip install ".[test, diffusers]"
- name: Test with pytest
run: |
make test
- name: Post to Slack
if: always()
uses: huggingface/hf-workflows/.github/actions/post-slack@main
with:
slack_channel: ${{ env.CI_SLACK_CHANNEL }}
title: 🤗 Results of the TRL CI on transformers/PEFT main
status: ${{ job.status }}
slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}

View File

@ -1,88 +1,163 @@
name: tests
name: Tests
on:
push:
branches: [ main ]
pull_request:
branches: [ main ]
paths:
# Run only when relevant files are modified
- "trl/**.py"
- ".github/**.yml"
- "examples/**.py"
- "scripts/**.py"
- ".github/**.yml"
- "tests/**.py"
- "trl/**.py"
- "setup.py"
env:
TQDM_DISABLE: 1
CI_SLACK_CHANNEL: ${{ secrets.CI_PUSH_MAIN_CHANNEL }}
jobs:
check_code_quality:
name: Check code quality
runs-on: ubuntu-latest
strategy:
matrix:
python-version: [3.9]
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
submodules: recursive
- name: Set up Python ${{ matrix.python-version }}
- name: Set up Python 3.12
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
python-version: 3.12
- uses: pre-commit/action@v3.0.1
with:
extra_args: --all-files
tests:
needs: check_code_quality
name: Tests
strategy:
matrix:
python-version: ['3.9', '3.10', '3.11']
python-version: ['3.9', '3.10', '3.11', '3.12']
os: ['ubuntu-latest', 'windows-latest']
fail-fast: false
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
cache: "pip"
cache-dependency-path: |
setup.py
requirements.txt
- name: Install dependencies
run: |
python -m pip install --upgrade pip
# install PEFT & transformers from source
pip install -U git+https://github.com/huggingface/peft.git
pip install -U git+https://github.com/huggingface/transformers.git
# cpu version of pytorch
pip install ".[test, diffusers]"
- name: Test with pytest
run: |
make test
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
cache: "pip"
cache-dependency-path: |
setup.py
requirements.txt
tests_no_optional_dep:
needs: check_code_quality
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install ".[dev]"
- name: Test with pytest
run: |
make test
- name: Post to Slack
if: github.ref == 'refs/heads/main' && always() # Check if the branch is main
uses: huggingface/hf-workflows/.github/actions/post-slack@main
with:
slack_channel: ${{ env.CI_SLACK_CHANNEL }}
title: Results with ${{ matrix.python-version }} on ${{ matrix.os }} with lastest dependencies
status: ${{ job.status }}
slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
tests_dev:
name: Tests with dev dependencies
runs-on: 'ubuntu-latest'
steps:
- uses: actions/checkout@v4
- name: Set up Python 3.9
uses: actions/setup-python@v5
with:
python-version: '3.9'
cache: "pip"
cache-dependency-path: |
setup.py
requirements.txt
- name: Install dependencies
run: |
python -m pip install --upgrade pip
# cpu version of pytorch
pip install .[test]
- name: Test with pytest
run: |
make test
- uses: actions/checkout@v4
- name: Set up Python 3.12
uses: actions/setup-python@v5
with:
python-version: '3.12'
cache: "pip"
cache-dependency-path: |
setup.py
requirements.txt
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install -U git+https://github.com/huggingface/accelerate.git
python -m pip install -U git+https://github.com/huggingface/datasets.git
python -m pip install -U git+https://github.com/huggingface/transformers.git
python -m pip install ".[dev]"
- name: Test with pytest
run: |
make test
- name: Post to Slack
if: github.ref == 'refs/heads/main' && always() # Check if the branch is main
uses: huggingface/hf-workflows/.github/actions/post-slack@main
with:
slack_channel: ${{ env.CI_SLACK_CHANNEL }}
title: Results with ${{ matrix.python-version }} on ${{ matrix.os }} with dev dependencies
status: ${{ job.status }}
slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
tests_wo_optional_deps:
name: Tests without optional dependencies
runs-on: 'ubuntu-latest'
steps:
- uses: actions/checkout@v4
- name: Set up Python 3.12
uses: actions/setup-python@v5
with:
python-version: '3.12'
cache: "pip"
cache-dependency-path: |
setup.py
requirements.txt
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install ".[test]"
- name: Test with pytest
run: |
make test
- name: Post to Slack
if: github.ref == 'refs/heads/main' && always() # Check if the branch is main
uses: huggingface/hf-workflows/.github/actions/post-slack@main
with:
slack_channel: ${{ env.CI_SLACK_CHANNEL }}
title: Results with ${{ matrix.python-version }} on ${{ matrix.os }} without optional dependencies
status: ${{ job.status }}
slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
tests_min_versions:
name: Tests with minimum versions
runs-on: 'ubuntu-latest'
steps:
- uses: actions/checkout@v4
- name: Set up Python 3.12
uses: actions/setup-python@v5
with:
python-version: '3.12'
cache: "pip"
cache-dependency-path: |
setup.py
requirements.txt
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install accelerate==0.34.0
python -m pip install datasets==2.21.0
python -m pip install transformers==4.46.0
python -m pip install ".[dev]"
- name: Test with pytest
run: |
make test
- name: Post to Slack
if: github.ref == 'refs/heads/main' && always() # Check if the branch is main
uses: huggingface/hf-workflows/.github/actions/post-slack@main
with:
slack_channel: ${{ env.CI_SLACK_CHANNEL }}
title: Results with ${{ matrix.python-version }} on ${{ matrix.os }} with minimum versions
status: ${{ job.status }}
slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}

1
.gitignore vendored
View File

@ -1,4 +1,3 @@
benchmark/trl
*.bak
.gitattributes
.last_checked

View File

@ -17,6 +17,12 @@ authors:
family-names: Thrush
- given-names: Nathan
family-names: Lambert
- given-names: Shengyi
family-names: Huang
- given-names: Kashif
family-names: Rasul
- given-names: Quentin
family-names: Gallouédec
repository-code: 'https://github.com/huggingface/trl'
abstract: "With trl you can train transformer language models with Proximal Policy Optimization (PPO). The library is built on top of the transformers library by \U0001F917 Hugging Face. Therefore, pre-trained language models can be directly loaded via transformers. At this point, most decoder and encoder-decoder architectures are supported."
keywords:
@ -25,4 +31,4 @@ keywords:
- pytorch
- transformers
license: Apache-2.0
version: 0.2.1
version: 0.11.1

View File

@ -20,7 +20,7 @@ There are several ways you can contribute to TRL:
* Fix outstanding issues with the existing code.
* Submit issues related to bugs or desired new features.
* Implement trainers for new post-training algorithms.
* Contribute to the examples or to the documentation.
* Contribute to the examples or the documentation.
If you don't know where to start, there is a special [Good First
Issue](https://github.com/huggingface/trl/contribute) listing. It will give you a list of
@ -62,7 +62,7 @@ Once you've confirmed the bug hasn't already been reported, please include the f
To get the OS and software versions automatically, run the following command:
```bash
transformers-cli env
trl env
```
### Do you want a new feature?
@ -74,19 +74,19 @@ If there is a new feature you'd like to see in TRL, please open an issue and des
Whatever it is, we'd love to hear about it!
2. Describe your requested feature in as much detail as possible. The more you can tell us about it, the better we'll be able to help you.
3. Provide a *code snippet* that demonstrates the features usage.
3. Provide a *code snippet* that demonstrates the feature's usage.
4. If the feature is related to a paper, please include a link.
If your issue is well written we're already 80% of the way there by the time you create it.
## Do you want to implement a new trainer?
New post-training methods are published on a frequent basis and those which satisfy the following criteria are good candidates to be integrated in TRL:
New post-training methods are published frequently and those that satisfy the following criteria are good candidates to be integrated into TRL:
* **Simplicity:** does the new method achieve similar performance as prior methods, but with less complexity? A good example is [Direct Preference Optimization](https://arxiv.org/abs/2305.18290) (DPO), which provided a simpler and compelling alternative to RLHF methods.
* **Efficiency:** does the new method provide a significant improvement in training efficiency? A good example is [Odds Ratio Preference Optimization](https://arxiv.org/abs/2403.07691v2), which utilises a similar objective as DPO, but requires half the GPU VRAM.
* **Simplicity:** Does the new method achieve similar performance as prior methods, but with less complexity? A good example is Direct Preference Optimization (DPO) [[Rafailov et al, 2023]](https://huggingface.co/papers/2305.18290), which provided a simpler and compelling alternative to RLHF methods.
* **Efficiency:** Does the new method provide a significant improvement in training efficiency? A good example is Odds Ratio Preference Optimization (ORPO) [[Hong et al, 2023]](https://huggingface.co/papers/2403.07691), which utilizes a similar objective as DPO but requires half the GPU VRAM.
Methods which only provide incremental improvements at the expense of added complexity or compute costs are unlikely to be included in TRL.
Methods that only provide incremental improvements at the expense of added complexity or compute costs are unlikely to be included in TRL.
If you want to implement a trainer for a new post-training method, first open an issue and provide the following information:
@ -102,7 +102,7 @@ Based on the community and maintainer feedback, the next step will be to impleme
## Do you want to add documentation?
We're always looking for improvements to the documentation that make it more clear and accurate. Please let us know how the documentation can be improved, such as typos, dead links and any missing, unclear or inaccurate content.. We'll be happy to make the changes or help you make a contribution if you're interested!
We're always looking for improvements to the documentation that make it more clear and accurate. Please let us know how the documentation can be improved, such as typos, dead links, and any missing, unclear, or inaccurate content... We'll be happy to make the changes or help you contribute if you're interested!
## Submitting a pull request (PR)
@ -133,7 +133,7 @@ Follow these steps to start contributing:
3. Create a new branch to hold your development changes, and do this for every new PR you work on.
Start by synchronizing your `main` branch with the `upstream/main` branch (ore details in the [GitHub Docs](https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/syncing-a-fork)):
Start by synchronizing your `main` branch with the `upstream/main` branch (more details in the [GitHub Docs](https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/syncing-a-fork)):
```bash
$ git checkout main
@ -180,18 +180,21 @@ Follow these steps to start contributing:
$ make test
```
TRL relies on `ruff` to format its source code
consistently. After you make changes, apply automatic style corrections and code verifications
that can't be automated in one go with:
TRL relies on `ruff` for maintaining consistent code formatting across its source files. Before submitting any PR, you should apply automatic style corrections and run code verification checks.
This target is also optimized to only work with files modified by the PR you're working on.
We provide a `precommit` target in the `Makefile` that simplifies this process by running all required checks and optimizations on only the files modified by your PR.
If you prefer to run the checks one after the other, the following command apply the
style corrections:
To apply these checks and corrections in one step, use:
```bash
$ make precommit
```
```bash
$ make precommit
```
This command runs the following:
- Executes `pre-commit` hooks to automatically fix style issues with `ruff` and other tools.
- Runs additional scripts such as adding copyright information.
If you prefer to apply the style corrections separately or review them individually, the `pre-commit` hook will handle the formatting for the files in question.
Once you're happy with your changes, add changed files using `git add` and
make a commit with `git commit` to record your changes locally:
@ -221,10 +224,7 @@ Follow these steps to start contributing:
webpage of your fork on GitHub. Click on 'Pull request' to send your changes
to the project maintainers for review.
7. It's ok if maintainers ask you for changes. It happens to core contributors
too! So everyone can see the changes in the Pull request, work in your local
branch and push the changes to your fork. They will automatically appear in
the pull request.
7. It's ok if maintainers ask you for changes. It happens to core contributors too! To ensure everyone can review your changes in the pull request, work on your local branch and push the updates to your fork. They will automatically appear in the pull request.
### Checklist
@ -245,14 +245,41 @@ Follow these steps to start contributing:
An extensive test suite is included to test the library behavior and several examples. Library tests can be found in
the [tests folder](https://github.com/huggingface/trl/tree/main/tests).
We use `pytest` in order to run the tests. From the root of the
repository, here's how to run tests with `pytest` for the library:
We use `pytest` to run the tests. From the root of the
repository here's how to run tests with `pytest` for the library:
```bash
$ python -m pytest -sv ./tests
```
In fact, that's how `make test` is implemented (sans the `pip install` line)!
That's how `make test` is implemented (without the `pip install` line)!
You can specify a smaller set of tests in order to test only the feature
You can specify a smaller set of tests to test only the feature
you're working on.
### Deprecation and Backward Compatibility
Our approach to deprecation and backward compatibility is flexible and based on the features usage and impact. Each deprecation is carefully evaluated, aiming to balance innovation with user needs.
When a feature or component is marked for deprecation, its use will emit a warning message. This warning will include:
- **Transition Guidance**: Instructions on how to migrate to the alternative solution or replacement.
- **Removal Version**: The target version when the feature will be removed, providing users with a clear timeframe to transition.
Example:
```python
warnings.warn(
"The `Trainer.foo` method is deprecated and will be removed in version 0.14.0. "
"Please use the `Trainer.bar` class instead.",
FutureWarning,
)
```
The deprecation and removal schedule is based on each feature's usage and impact, with examples at two extremes:
- **Experimental or Low-Use Features**: For a feature that is experimental or has limited usage, backward compatibility may not be maintained between releases. Users should therefore anticipate potential breaking changes from one version to the next.
- **Widely-Used Components**: For a feature with high usage, we aim for a more gradual transition period of approximately **5 months**, generally scheduling deprecation around **5 minor releases** after the initial warning.
These examples represent the two ends of a continuum. The specific timeline for each feature will be determined individually, balancing innovation with user stability needs.

View File

@ -2,4 +2,5 @@ include settings.ini
include LICENSE
include CONTRIBUTING.md
include README.md
recursive-exclude * __pycache__
recursive-exclude * __pycache__
include trl/templates/*.md

View File

@ -1,4 +1,4 @@
.PHONY: test precommit benchmark_core benchmark_aux common_tests slow_tests test_examples tests_gpu
.PHONY: test precommit common_tests slow_tests test_examples tests_gpu
check_dirs := examples tests trl
@ -18,12 +18,6 @@ precommit:
pre-commit run --all-files
python scripts/add_copyrights.py
benchmark_core:
bash ./benchmark/benchmark_core.sh
benchmark_aux:
bash ./benchmark/benchmark_aux.sh
tests_gpu:
python -m pytest tests/test_* $(if $(IS_GITHUB_CI),--report-log "common_tests.log",)

217
README.md
View File

@ -1,207 +1,199 @@
# TRL - Transformer Reinforcement Learning
<div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/trl_banner_dark.png">
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/trl_banner_dark.png" alt="TRL Banner">
</div>
# TRL - Transformer Reinforcement Learning
> Full stack library to fine-tune and align large language models.
<hr> <br>
<h3 align="center">
<p>A comprehensive library to post-train foundation models</p>
</h3>
<p align="center">
<a href="https://github.com/huggingface/trl/blob/main/LICENSE">
<img alt="License" src="https://img.shields.io/github/license/huggingface/trl.svg?color=blue">
</a>
<a href="https://huggingface.co/docs/trl/index">
<img alt="Documentation" src="https://img.shields.io/website/http/huggingface.co/docs/trl/index.svg?down_color=red&down_message=offline&up_message=online">
</a>
<a href="https://github.com/huggingface/trl/releases">
<img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/trl.svg">
</a>
<a href="https://github.com/huggingface/trl/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/github/license/huggingface/trl.svg?color=blue"></a>
<a href="https://huggingface.co/docs/trl/index"><img alt="Documentation" src="https://img.shields.io/website/http/huggingface.co/docs/trl/index.svg?down_color=red&down_message=offline&up_color=blue&up_message=online"></a>
<a href="https://github.com/huggingface/trl/releases"><img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/trl.svg"></a>
</p>
## Overview
## What is it?
The `trl` library is a full stack tool to fine-tune and align transformer language and diffusion models using methods such as Supervised Fine-tuning step (SFT), Reward Modeling (RM) and the Proximal Policy Optimization (PPO) as well as Direct Preference Optimization (DPO).
The library is built on top of the [`transformers`](https://github.com/huggingface/transformers) library and thus allows to use any model architecture available there.
TRL is a cutting-edge library designed for post-training foundation models using advanced techniques like Supervised Fine-Tuning (SFT), Proximal Policy Optimization (PPO), and Direct Preference Optimization (DPO). Built on top of the [🤗 Transformers](https://github.com/huggingface/transformers) ecosystem, TRL supports a variety of model architectures and modalities, and can be scaled-up across various hardware setups.
## Highlights
- **`Efficient and scalable`**:
- [`accelerate`](https://github.com/huggingface/accelerate) is the backbone of `trl` which allows to scale model training from a single GPU to a large scale multi-node cluster with methods such as DDP and DeepSpeed.
- [`PEFT`](https://github.com/huggingface/peft) is fully integrated and allows to train even the largest models on modest hardware with quantisation and methods such as LoRA or QLoRA.
- [`unsloth`](https://github.com/unslothai/unsloth) is also integrated and allows to significantly speed up training with dedicated kernels.
- **`CLI`**: With the [CLI](https://huggingface.co/docs/trl/clis) you can fine-tune and chat with LLMs without writing any code using a single command and a flexible config system.
- **`Trainers`**: The Trainer classes are an abstraction to apply many fine-tuning methods with ease such as the [`SFTTrainer`](https://huggingface.co/docs/trl/sft_trainer), [`DPOTrainer`](https://huggingface.co/docs/trl/trainer#trl.DPOTrainer), [`RewardTrainer`](https://huggingface.co/docs/trl/reward_trainer), [`PPOTrainer`](https://huggingface.co/docs/trl/trainer#trl.PPOTrainer), [`CPOTrainer`](https://huggingface.co/docs/trl/trainer#trl.CPOTrainer), and [`ORPOTrainer`](https://huggingface.co/docs/trl/trainer#trl.ORPOTrainer).
- **`AutoModels`**: The [`AutoModelForCausalLMWithValueHead`](https://huggingface.co/docs/trl/models#trl.AutoModelForCausalLMWithValueHead) & [`AutoModelForSeq2SeqLMWithValueHead`](https://huggingface.co/docs/trl/models#trl.AutoModelForSeq2SeqLMWithValueHead) classes add an additional value head to the model which allows to train them with RL algorithms such as PPO.
- **`Examples`**: Train GPT2 to generate positive movie reviews with a BERT sentiment classifier, full RLHF using adapters only, train GPT-j to be less toxic, [StackLlama example](https://huggingface.co/blog/stackllama), etc. following the [examples](https://github.com/huggingface/trl/tree/main/examples).
- **Efficient and scalable**:
- Leverages [🤗 Accelerate](https://github.com/huggingface/accelerate) to scale from single GPU to multi-node clusters using methods like DDP and DeepSpeed.
- Full integration with [`PEFT`](https://github.com/huggingface/peft) enables training on large models with modest hardware via quantization and LoRA/QLoRA.
- Integrates [Unsloth](https://github.com/unslothai/unsloth) for accelerating training using optimized kernels.
- **Command Line Interface (CLI)**: A simple interface lets you fine-tune and interact with models without needing to write code.
- **Trainers**: Various fine-tuning methods are easily accessible via trainers like [`SFTTrainer`](https://huggingface.co/docs/trl/sft_trainer), [`DPOTrainer`](https://huggingface.co/docs/trl/dpo_trainer), [`RewardTrainer`](https://huggingface.co/docs/trl/reward_trainer), [`ORPOTrainer`](https://huggingface.co/docs/trl/orpo_trainer) and more.
- **AutoModels**: Use pre-defined model classes like [`AutoModelForCausalLMWithValueHead`](https://huggingface.co/docs/trl/models#trl.AutoModelForCausalLMWithValueHead) to simplify reinforcement learning (RL) with LLMs.
## Installation
### Python package
Install the library with `pip`:
### Python Package
Install the library using `pip`:
```bash
pip install trl
```
### From source
If you want to use the latest features before an official release you can install from source:
If you want to use the latest features before an official release, you can install TRL from source:
```bash
pip install git+https://github.com/huggingface/trl.git
```
### Repository
If you want to use the examples you can clone the repository with the following command:
```bash
git clone https://github.com/huggingface/trl.git
```
## Command Line Interface (CLI)
You can use TRL Command Line Interface (CLI) to quickly get started with Supervised Fine-tuning (SFT), Direct Preference Optimization (DPO) and test your aligned model with the chat CLI:
You can use the TRL Command Line Interface (CLI) to quickly get started with Supervised Fine-tuning (SFT) and Direct Preference Optimization (DPO), or vibe check your model with the chat CLI:
**SFT:**
```bash
trl sft --model_name_or_path facebook/opt-125m --dataset_name stanfordnlp/imdb --output_dir opt-sft-imdb
trl sft --model_name_or_path Qwen/Qwen2.5-0.5B \
--dataset_name trl-lib/Capybara \
--output_dir Qwen2.5-0.5B-SFT
```
**DPO:**
```bash
trl dpo --model_name_or_path facebook/opt-125m --dataset_name trl-internal-testing/hh-rlhf-helpful-base-trl-style --output_dir opt-sft-hh-rlhf
trl dpo --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
--dataset_name argilla/Capybara-Preferences \
--output_dir Qwen2.5-0.5B-DPO
```
**Chat:**
```bash
trl chat --model_name_or_path Qwen/Qwen1.5-0.5B-Chat
trl chat --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct
```
Read more about CLI in the [relevant documentation section](https://huggingface.co/docs/trl/main/en/clis) or use `--help` for more details.
## How to use
For more flexibility and control over the training, you can use the dedicated trainer classes to fine-tune the model in Python.
For more flexibility and control over training, TRL provides dedicated trainer classes to post-train language models or PEFT adapters on a custom dataset. Each trainer in TRL is a light wrapper around the 🤗 Transformers trainer and natively supports distributed training methods like DDP, DeepSpeed ZeRO, and FSDP.
### `SFTTrainer`
This is a basic example of how to use the `SFTTrainer` from the library. The `SFTTrainer` is a light wrapper around the `transformers` Trainer to easily fine-tune language models or adapters on a custom dataset.
Here is a basic example of how to use the `SFTTrainer`:
```python
# imports
from trl import SFTConfig, SFTTrainer
from datasets import load_dataset
from trl import SFTTrainer
# get dataset
dataset = load_dataset("stanfordnlp/imdb", split="train")
dataset = load_dataset("trl-lib/Capybara", split="train")
# get trainer
training_args = SFTConfig(output_dir="Qwen/Qwen2.5-0.5B-SFT")
trainer = SFTTrainer(
"facebook/opt-350m",
args=training_args,
model="Qwen/Qwen2.5-0.5B",
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=512,
)
# train
trainer.train()
```
### `RewardTrainer`
This is a basic example of how to use the `RewardTrainer` from the library. The `RewardTrainer` is a wrapper around the `transformers` Trainer to easily fine-tune reward models or adapters on a custom preference dataset.
Here is a basic example of how to use the `RewardTrainer`:
```python
# imports
from trl import RewardConfig, RewardTrainer
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from trl import RewardTrainer
# load model and dataset - dataset needs to be in a specific format
model = AutoModelForSequenceClassification.from_pretrained("gpt2", num_labels=1)
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
model = AutoModelForSequenceClassification.from_pretrained(
"Qwen/Qwen2.5-0.5B-Instruct", num_labels=1
)
model.config.pad_token_id = tokenizer.pad_token_id
...
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
# load trainer
training_args = RewardConfig(output_dir="Qwen2.5-0.5B-Reward", per_device_train_batch_size=2)
trainer = RewardTrainer(
args=training_args,
model=model,
tokenizer=tokenizer,
processing_class=tokenizer,
train_dataset=dataset,
)
# train
trainer.train()
```
### `PPOTrainer`
### `RLOOTrainer`
This is a basic example of how to use the `PPOTrainer` from the library. Based on a query the language model creates a response which is then evaluated. The evaluation could be a human in the loop or another model's output.
`RLOOTrainer` implements a [REINFORCE-style optimization](https://huggingface.co/papers/2402.14740) for RLHF that is more performant and memory-efficient than PPO. Here is a basic example of how to use the `RLOOTrainer`:
```python
# imports
import torch
from transformers import AutoTokenizer
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead, create_reference_model
from trl.core import respond_to_batch
from trl import RLOOConfig, RLOOTrainer, apply_chat_template
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoModelForSequenceClassification,
AutoTokenizer,
)
# get models
model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
ref_model = create_reference_model(model)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
reward_model = AutoModelForSequenceClassification.from_pretrained(
"Qwen/Qwen2.5-0.5B-Instruct", num_labels=1
)
ref_policy = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
policy = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
dataset = load_dataset("trl-lib/ultrafeedback-prompt")
dataset = dataset.map(apply_chat_template, fn_kwargs={"tokenizer": tokenizer})
dataset = dataset.map(lambda x: tokenizer(x["prompt"]), remove_columns="prompt")
# initialize trainer
ppo_config = PPOConfig(batch_size=1, mini_batch_size=1)
# encode a query
query_txt = "This morning I went to the "
query_tensor = tokenizer.encode(query_txt, return_tensors="pt")
# get model response
response_tensor = respond_to_batch(model, query_tensor)
# create a ppo trainer
ppo_trainer = PPOTrainer(ppo_config, model, ref_model, tokenizer)
# define a reward for response
# (this could be any reward such as human feedback or output from another model)
reward = [torch.tensor(1.0)]
# train model for one step with ppo
train_stats = ppo_trainer.step([query_tensor[0]], [response_tensor[0]], reward)
training_args = RLOOConfig(output_dir="Qwen2.5-0.5B-RL")
trainer = RLOOTrainer(
config=training_args,
processing_class=tokenizer,
policy=policy,
ref_policy=ref_policy,
reward_model=reward_model,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
)
trainer.train()
```
### `DPOTrainer`
`DPOTrainer` is a trainer that uses [Direct Preference Optimization algorithm](https://huggingface.co/papers/2305.18290). This is a basic example of how to use the `DPOTrainer` from the library. The `DPOTrainer` is a wrapper around the `transformers` Trainer to easily fine-tune reward models or adapters on a custom preference dataset.
`DPOTrainer` implements the popular [Direct Preference Optimization (DPO) algorithm](https://huggingface.co/papers/2305.18290) that was used to post-train Llama 3 and many other models. Here is a basic example of how to use the `DPOTrainer`:
```python
# imports
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOTrainer
from trl import DPOConfig, DPOTrainer
# load model and dataset - dataset needs to be in a specific format
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
...
# load trainer
trainer = DPOTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
)
# train
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
training_args = DPOConfig(output_dir="Qwen2.5-0.5B-DPO")
trainer = DPOTrainer(model=model, args=training_args, train_dataset=dataset, processing_class=tokenizer)
trainer.train()
```
## Development
If you want to contribute to `trl` or customizing it to your needs make sure to read the [contribution guide](https://github.com/huggingface/trl/blob/main/CONTRIBUTING.md) and make sure you make a dev install:
If you want to contribute to `trl` or customize it to your needs make sure to read the [contribution guide](https://github.com/huggingface/trl/blob/main/CONTRIBUTING.md) and make sure you make a dev install:
```bash
git clone https://github.com/huggingface/trl.git
@ -209,20 +201,11 @@ cd trl/
make dev
```
## References
### Proximal Policy Optimisation
The PPO implementation largely follows the structure introduced in the paper **"Fine-Tuning Language Models from Human Preferences"** by D. Ziegler et al. \[[paper](https://huggingface.co/papers/1909.08593), [code](https://github.com/openai/lm-human-preferences)].
### Direct Preference Optimization
DPO is based on the original implementation of **"Direct Preference Optimization: Your Language Model is Secretly a Reward Model"** by E. Mitchell et al. \[[paper](https://huggingface.co/papers/2305.18290), [code](https://github.com/eric-mitchell/direct-preference-optimization)]
## Citation
```bibtex
@misc{vonwerra2022trl,
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang},
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec},
title = {TRL: Transformer Reinforcement Learning},
year = {2020},
publisher = {GitHub},
@ -230,3 +213,7 @@ DPO is based on the original implementation of **"Direct Preference Optimization
howpublished = {\url{https://github.com/huggingface/trl}}
}
```
## License
This repository's source code is available under the [Apache-2.0 License](LICENSE).

View File

@ -1,164 +0,0 @@
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import math
import os
import shlex
import subprocess
import uuid
from distutils.util import strtobool
import requests
def parse_args():
# fmt: off
parser = argparse.ArgumentParser()
parser.add_argument("--command", type=str, default="",
help="the command to run")
parser.add_argument("--num-seeds", type=int, default=3,
help="the number of random seeds")
parser.add_argument("--start-seed", type=int, default=1,
help="the number of the starting seed")
parser.add_argument("--workers", type=int, default=0,
help="the number of workers to run benchmark experimenets")
parser.add_argument("--auto-tag", type=lambda x: bool(strtobool(x)), default=True, nargs="?", const=True,
help="if toggled, the runs will be tagged with git tags, commit, and pull request number if possible")
parser.add_argument("--slurm-template-path", type=str, default=None,
help="the path to the slurm template file (see docs for more details)")
parser.add_argument("--slurm-gpus-per-task", type=int, default=1,
help="the number of gpus per task to use for slurm jobs")
parser.add_argument("--slurm-total-cpus", type=int, default=50,
help="the number of gpus per task to use for slurm jobs")
parser.add_argument("--slurm-ntasks", type=int, default=1,
help="the number of tasks to use for slurm jobs")
parser.add_argument("--slurm-nodes", type=int, default=None,
help="the number of nodes to use for slurm jobs")
args = parser.parse_args()
# fmt: on
return args
def run_experiment(command: str):
command_list = shlex.split(command)
print(f"running {command}")
# Use subprocess.PIPE to capture the output
fd = subprocess.Popen(command_list, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
output, errors = fd.communicate()
return_code = fd.returncode
assert return_code == 0, f"Command failed with error: {errors.decode('utf-8')}"
# Convert bytes to string and strip leading/trailing whitespaces
return output.decode("utf-8").strip()
def autotag() -> str:
wandb_tag = ""
print("autotag feature is enabled")
git_tag = ""
try:
git_tag = subprocess.check_output(["git", "describe", "--tags"]).decode("ascii").strip()
print(f"identified git tag: {git_tag}")
except subprocess.CalledProcessError as e:
print(e)
if len(git_tag) == 0:
try:
count = int(subprocess.check_output(["git", "rev-list", "--count", "HEAD"]).decode("ascii").strip())
hash = subprocess.check_output(["git", "rev-parse", "--short", "HEAD"]).decode("ascii").strip()
git_tag = f"no-tag-{count}-g{hash}"
print(f"identified git tag: {git_tag}")
except subprocess.CalledProcessError as e:
print(e)
wandb_tag = f"{git_tag}"
git_commit = subprocess.check_output(["git", "rev-parse", "--verify", "HEAD"]).decode("ascii").strip()
try:
# try finding the pull request number on github
prs = requests.get(f"https://api.github.com/search/issues?q=repo:huggingface/trl+is:pr+{git_commit}")
if prs.status_code == 200:
prs = prs.json()
if len(prs["items"]) > 0:
pr = prs["items"][0]
pr_number = pr["number"]
wandb_tag += f",pr-{pr_number}"
print(f"identified github pull request: {pr_number}")
except Exception as e:
print(e)
return wandb_tag
if __name__ == "__main__":
args = parse_args()
if args.auto_tag:
existing_wandb_tag = os.environ.get("WANDB_TAGS", "")
wandb_tag = autotag()
if len(wandb_tag) > 0:
if len(existing_wandb_tag) > 0:
os.environ["WANDB_TAGS"] = ",".join([existing_wandb_tag, wandb_tag])
else:
os.environ["WANDB_TAGS"] = wandb_tag
print("WANDB_TAGS: ", os.environ.get("WANDB_TAGS", ""))
commands = []
for seed in range(0, args.num_seeds):
commands += [" ".join([args.command, "--seed", str(args.start_seed + seed)])]
print("======= commands to run:")
for command in commands:
print(command)
if args.workers > 0 and args.slurm_template_path is None:
from concurrent.futures import ThreadPoolExecutor
executor = ThreadPoolExecutor(max_workers=args.workers, thread_name_prefix="cleanrl-benchmark-worker-")
for command in commands:
executor.submit(run_experiment, command)
executor.shutdown(wait=True)
else:
print("not running the experiments because --workers is set to 0; just printing the commands to run")
# SLURM logic
if args.slurm_template_path is not None:
if not os.path.exists("slurm"):
os.makedirs("slurm")
if not os.path.exists("slurm/logs"):
os.makedirs("slurm/logs")
print("======= slurm commands to run:")
with open(args.slurm_template_path) as f:
slurm_template = f.read()
slurm_template = slurm_template.replace("{{array}}", f"0-{len(commands) - 1}%{args.workers}")
slurm_template = slurm_template.replace(
"{{seeds}}", f"({' '.join([str(args.start_seed + int(seed)) for seed in range(args.num_seeds)])})"
)
slurm_template = slurm_template.replace("{{len_seeds}}", f"{args.num_seeds}")
slurm_template = slurm_template.replace("{{command}}", args.command)
slurm_template = slurm_template.replace("{{gpus_per_task}}", f"{args.slurm_gpus_per_task}")
total_gpus = args.slurm_gpus_per_task * args.slurm_ntasks
slurm_cpus_per_gpu = math.ceil(args.slurm_total_cpus / total_gpus)
slurm_template = slurm_template.replace("{{cpus_per_gpu}}", f"{slurm_cpus_per_gpu}")
slurm_template = slurm_template.replace("{{ntasks}}", f"{args.slurm_ntasks}")
if args.slurm_nodes is not None:
slurm_template = slurm_template.replace("{{nodes}}", f"#SBATCH --nodes={args.slurm_nodes}")
else:
slurm_template = slurm_template.replace("{{nodes}}", "")
filename = str(uuid.uuid4())
open(os.path.join("slurm", f"{filename}.slurm"), "w").write(slurm_template)
slurm_path = os.path.join("slurm", f"{filename}.slurm")
print(f"saving command in {slurm_path}")
if args.workers > 0:
job_id = run_experiment(f"sbatch --parsable {slurm_path}")
print(f"Job ID: {job_id}")

View File

@ -1,26 +0,0 @@
export WANDB_ENTITY=huggingface
export WANDB_PROJECT=trl
bash $BENCHMARK_SCRIPT > output.txt
# Extract Job IDs into an array
job_ids=($(grep "Job ID:" output.txt | awk '{print $3}'))
# Extract WANDB_TAGS into an array
WANDB_TAGS=($(grep "WANDB_TAGS:" output.txt | awk '{print $2}'))
WANDB_TAGS=($(echo $WANDB_TAGS | tr "," "\n"))
# Print to verify
echo "Job IDs: ${job_ids[@]}"
echo "WANDB_TAGS: ${WANDB_TAGS[@]}"
TAGS_STRING="?tag=${WANDB_TAGS[0]}"
FOLDER_STRING="${WANDB_TAGS[0]}"
for tag in "${WANDB_TAGS[@]:1}"; do
TAGS_STRING+="&tag=$tag"
FOLDER_STRING+="_$tag"
done
echo "TAGS_STRING: $TAGS_STRING"
echo "FOLDER_STRING: $FOLDER_STRING"
TAGS_STRING=$TAGS_STRING FOLDER_STRING=$FOLDER_STRING BENCHMARK_PLOT_SCRIPT=$BENCHMARK_PLOT_SCRIPT sbatch --dependency=afterany:$job_ids benchmark/post_github_comment.sbatch

View File

@ -1,44 +0,0 @@
# hello world experiment
python benchmark/benchmark.py \
--command "python examples/scripts/ppo.py --log_with wandb" \
--num-seeds 3 \
--start-seed 1 \
--workers 10 \
--slurm-nodes 1 \
--slurm-gpus-per-task 1 \
--slurm-ntasks 1 \
--slurm-total-cpus 12 \
--slurm-template-path benchmark/trl.slurm_template
python benchmark/benchmark.py \
--command "python examples/scripts/dpo.py --model_name_or_path=gpt2 --per_device_train_batch_size 4 --max_steps 1000 --learning_rate 1e-3 --gradient_accumulation_steps 1 --logging_steps 10 --eval_steps 500 --output_dir="dpo_anthropic_hh" --optim adamw_torch --warmup_steps 150 --report_to wandb --bf16 --logging_first_step --no_remove_unused_columns" \
--num-seeds 3 \
--start-seed 1 \
--workers 10 \
--slurm-nodes 1 \
--slurm-gpus-per-task 1 \
--slurm-ntasks 1 \
--slurm-total-cpus 12 \
--slurm-template-path benchmark/trl.slurm_template
python benchmark/benchmark.py \
--command "python examples/scripts/sft.py --model_name_or_path="facebook/opt-350m" --report_to="wandb" --learning_rate=1.41e-5 --per_device_train_batch_size=64 --gradient_accumulation_steps=16 --output_dir="sft_openassistant-guanaco" --logging_steps=1 --num_train_epochs=3 --max_steps=-1 --push_to_hub --gradient_checkpointing" \
--num-seeds 3 \
--start-seed 1 \
--workers 10 \
--slurm-nodes 1 \
--slurm-gpus-per-task 1 \
--slurm-ntasks 1 \
--slurm-total-cpus 12 \
--slurm-template-path benchmark/trl.slurm_template
python benchmark/benchmark.py \
--command "python examples/scripts/reward_modeling.py --model_name_or_path=facebook/opt-350m --output_dir="reward_modeling_anthropic_hh" --per_device_train_batch_size=64 --num_train_epochs=1 --gradient_accumulation_steps=16 --gradient_checkpointing=True --learning_rate=1.41e-5 --report_to="wandb" --remove_unused_columns=False --optim="adamw_torch" --logging_steps=10 --eval_strategy="steps" --max_length=512" \
--num-seeds 3 \
--start-seed 1 \
--workers 10 \
--slurm-nodes 1 \
--slurm-gpus-per-task 1 \
--slurm-ntasks 1 \
--slurm-total-cpus 12 \
--slurm-template-path benchmark/trl.slurm_template

View File

@ -1,50 +0,0 @@
# pip install openrlbenchmark==0.2.1a5
# see https://github.com/openrlbenchmark/openrlbenchmark#get-started for documentation
echo "we deal with $TAGS_STRING"
python -m openrlbenchmark.rlops_multi_metrics \
--filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
"ppo$TAGS_STRING" \
--env-ids sentiment-analysis:lvwerra/distilbert-imdb \
--no-check-empty-runs \
--pc.ncols 2 \
--pc.ncols-legend 1 \
--output-filename benchmark/trl/$FOLDER_STRING/ppo \
--scan-history
python -m openrlbenchmark.rlops_multi_metrics \
--filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=output_dir&cen=_name_or_path&metrics=train/rewards/accuracies&metrics=train/loss' \
"gpt2$TAGS_STRING" \
--env-ids dpo_anthropic_hh \
--no-check-empty-runs \
--pc.ncols 2 \
--pc.ncols-legend 1 \
--output-filename benchmark/trl/$FOLDER_STRING/dpo \
--scan-history
python -m openrlbenchmark.rlops_multi_metrics \
--filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=output_dir&cen=_name_or_path&metrics=train/loss&metrics=eval/accuracy&metrics=eval/loss' \
"facebook/opt-350m$TAGS_STRING" \
--env-ids reward_modeling_anthropic_hh \
--no-check-empty-runs \
--pc.ncols 2 \
--pc.ncols-legend 1 \
--output-filename benchmark/trl/$FOLDER_STRING/reward_modeling \
--scan-history
python -m openrlbenchmark.rlops_multi_metrics \
--filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=output_dir&cen=_name_or_path&metrics=train/loss' \
"facebook/opt-350m$TAGS_STRING" \
--env-ids sft_openassistant-guanaco \
--no-check-empty-runs \
--pc.ncols 2 \
--pc.ncols-legend 1 \
--output-filename benchmark/trl/$FOLDER_STRING/sft \
--scan-history
python benchmark/upload_benchmark.py \
--folder_path="benchmark/trl/$FOLDER_STRING" \
--path_in_repo="images/benchmark/$FOLDER_STRING" \
--repo_id="trl-internal-testing/example-images" \
--repo_type="dataset"

View File

@ -1,23 +0,0 @@
# compound experiments: gpt2xl + grad_accu
python benchmark/benchmark.py \
--command "python examples/scripts/ppo.py --exp_name ppo_gpt2xl_grad_accu --model_name gpt2-xl --mini_batch_size 16 --gradient_accumulation_steps 8 --log_with wandb" \
--num-seeds 3 \
--start-seed 1 \
--workers 10 \
--slurm-nodes 1 \
--slurm-gpus-per-task 1 \
--slurm-ntasks 1 \
--slurm-total-cpus 12 \
--slurm-template-path benchmark/trl.slurm_template
# compound experiments: Cerebras-GPT-6.7B + deepspeed zero2 + grad_accu
python benchmark/benchmark.py \
--command "accelerate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml examples/scripts/ppo.py --exp_name ppo_Cerebras-GPT-6.7B_grad_accu_deepspeed_stage2 --batch_size 32 --mini_batch_size 32 --log_with wandb --model_name cerebras/Cerebras-GPT-6.7B --reward_model sentiment-analysis:cerebras/Cerebras-GPT-6.7B" \
--num-seeds 3 \
--start-seed 1 \
--workers 10 \
--slurm-nodes 1 \
--slurm-gpus-per-task 8 \
--slurm-ntasks 1 \
--slurm-total-cpus 90 \
--slurm-template-path benchmark/trl.slurm_template

View File

@ -1,31 +0,0 @@
# pip install openrlbenchmark==0.2.1a5
# see https://github.com/openrlbenchmark/openrlbenchmark#get-started for documentation
echo "we deal with $TAGS_STRING"
python -m openrlbenchmark.rlops_multi_metrics \
--filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
"ppo$TAGS_STRING" \
"ppo_gpt2xl_grad_accu$TAGS_STRING" \
--env-ids sentiment-analysis:lvwerra/distilbert-imdb \
--no-check-empty-runs \
--pc.ncols 2 \
--pc.ncols-legend 1 \
--output-filename benchmark/trl/$FOLDER_STRING/different_models \
--scan-history
python -m openrlbenchmark.rlops_multi_metrics \
--filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
"ppo_Cerebras-GPT-6.7B_grad_accu_deepspeed_stage2$TAGS_STRING" \
--env-ids sentiment-analysis:cerebras/Cerebras-GPT-6.7B \
--no-check-empty-runs \
--pc.ncols 2 \
--pc.ncols-legend 1 \
--output-filename benchmark/trl/$FOLDER_STRING/deepspeed \
--scan-history
python benchmark/upload_benchmark.py \
--folder_path="benchmark/trl/$FOLDER_STRING" \
--path_in_repo="images/benchmark/$FOLDER_STRING" \
--repo_id="trl-internal-testing/example-images" \
--repo_type="dataset"

View File

@ -1,46 +0,0 @@
## w/ and w/o gradient accumulation
python benchmark/benchmark.py \
--command "python examples/scripts/ppo.py --exp_name ppo_step_grad_accu --mini_batch_size 1 --gradient_accumulation_steps 128 --log_with wandb" \
--num-seeds 3 \
--start-seed 1 \
--workers 10 \
--slurm-nodes 1 \
--slurm-gpus-per-task 1 \
--slurm-ntasks 1 \
--slurm-total-cpus 12 \
--slurm-template-path benchmark/trl.slurm_template
## w/ different models (gpt2, gpt2-xl, falcon, llama2)
python benchmark/benchmark.py \
--command "python examples/scripts/ppo.py --exp_name ppo_gpt2 --log_with wandb" \
--num-seeds 3 \
--start-seed 1 \
--workers 10 \
--slurm-nodes 1 \
--slurm-gpus-per-task 1 \
--slurm-ntasks 1 \
--slurm-total-cpus 12 \
--slurm-template-path benchmark/trl.slurm_template
python benchmark/benchmark.py \
--command "python examples/scripts/ppo.py --exp_name ppo_falcon_rw_1b --model_name tiiuae/falcon-rw-1b --log_with wandb" \
--num-seeds 3 \
--start-seed 1 \
--workers 10 \
--slurm-nodes 1 \
--slurm-gpus-per-task 1 \
--slurm-ntasks 1 \
--slurm-total-cpus 12 \
--slurm-template-path benchmark/trl.slurm_template
## w/ and w/o PEFT
python benchmark/benchmark.py \
--command "python examples/scripts/ppo.py --exp_name ppo_peft --use_peft --log_with wandb" \
--num-seeds 3 \
--start-seed 1 \
--workers 10 \
--slurm-nodes 1 \
--slurm-gpus-per-task 1 \
--slurm-ntasks 1 \
--slurm-total-cpus 12 \
--slurm-template-path benchmark/trl.slurm_template

View File

@ -1,56 +0,0 @@
# pip install openrlbenchmark==0.2.1a5
# see https://github.com/openrlbenchmark/openrlbenchmark#get-started for documentation
BASELINE_PR_TAG=v0.4.7-55-g110e672
BASELINE_PR_NAME=PR-662
python -m openrlbenchmark.rlops_multi_metrics \
--filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
"sentiment_tuning?tag=$BASELINE_PR_TAG&cl=sentiment lvwerra/gpt2-imdb ($BASELINE_PR_NAME)" \
--env-ids sentiment-analysis:lvwerra/distilbert-imdb \
--no-check-empty-runs \
--pc.ncols 2 \
--pc.ncols-legend 1 \
--output-filename benchmark/trl/$BASELINE_PR_TAG/sentiment \
--scan-history
python -m openrlbenchmark.rlops_multi_metrics \
--filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
"sentiment_tuning?tag=$BASELINE_PR_TAG&cl=sentiment lvwerra/gpt2-imdb ($BASELINE_PR_NAME)" \
"sentiment_tuning_step_grad_accu?tag=$BASELINE_PR_TAG&cl=sentiment lvwerra/gpt2-imdb gradient accumulation ($BASELINE_PR_NAME)" \
--env-ids sentiment-analysis:lvwerra/distilbert-imdb \
--no-check-empty-runs \
--pc.ncols 2 \
--pc.ncols-legend 1 \
--output-filename benchmark/trl/$BASELINE_PR_TAG/gradient_accu \
--scan-history
python -m openrlbenchmark.rlops_multi_metrics \
--filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
"sentiment_tuning?tag=$BASELINE_PR_TAG&cl=sentiment lvwerra/gpt2-imdb ($BASELINE_PR_NAME)" \
"sentiment_tuning_gpt2?tag=$BASELINE_PR_TAG&cl=sentiment gpt2 ($BASELINE_PR_NAME)" \
"sentiment_tuning_falcon_rw_1b?tag=$BASELINE_PR_TAG&cl=sentiment tiiuae/falcon-rw-1b ($BASELINE_PR_NAME)" \
"sentiment_tuning_gpt2xl_grad_accu?tag=$BASELINE_PR_TAG&cl=sentiment gpt2xl ($BASELINE_PR_NAME)" \
--env-ids sentiment-analysis:lvwerra/distilbert-imdb \
--no-check-empty-runs \
--pc.ncols 2 \
--pc.ncols-legend 1 \
--output-filename benchmark/trl/$BASELINE_PR_TAG/different_models \
--scan-history
python -m openrlbenchmark.rlops_multi_metrics \
--filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
"sentiment_tuning?tag=$BASELINE_PR_TAG&cl=sentiment lvwerra/gpt2-imdb ($BASELINE_PR_NAME)" \
"sentiment_tuning_peft?tag=$BASELINE_PR_TAG&cl=sentiment lvwerra/gpt2-imdb w/ peft ($BASELINE_PR_NAME)" \
--env-ids sentiment-analysis:lvwerra/distilbert-imdb \
--no-check-empty-runs \
--pc.ncols 2 \
--pc.ncols-legend 1 \
--output-filename benchmark/trl/$BASELINE_PR_TAG/peft \
--scan-history
python benchmark/upload_benchmark.py \
--folder_path="benchmark/trl/$BASELINE_PR_TAG" \
--path_in_repo="images/benchmark/$BASELINE_PR_TAG" \
--repo_id="trl-internal-testing/example-images" \
--repo_type="dataset"

View File

@ -1,40 +0,0 @@
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import os
from ghapi.all import GhApi
FOLDER_STRING = os.environ.get("FOLDER_STRING", "")
folder = f"benchmark/trl/{FOLDER_STRING}"
host_url = f"https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/benchmark/{FOLDER_STRING}"
# Create a GitHub API instance
github_context = json.loads(os.environ["GITHUB_CONTEXT"])
token = os.environ["PERSONAL_ACCESS_TOKEN_GITHUB"] # this needs to refreshed every 12 months
status_message = "**[COSTA BENCHMARK BOT]**: Here are the results"
body = status_message
repo = github_context["repository"]
owner, repo = repo.split("/")
api = GhApi(owner=owner, repo=repo, token=token)
# for each `.png` file in the folder, add it to the comment
for file in os.listdir(folder):
if file.endswith(".png"):
body += f"\n![{file}]({host_url}/{file})"
# Create a comment on the issue
api.issues.create_comment(issue_number=github_context["event"]["issue"]["number"], body=body)

View File

@ -1,9 +0,0 @@
#!/bin/bash
#SBATCH --job-name=trl
#SBATCH --partition=hopper-cpu
#SBATCH --ntasks=1
#SBATCH --output=slurm/logs/%x_%j.out
sleep 2m
bash $BENCHMARK_PLOT_SCRIPT
srun python benchmark/post_github_comment.py

View File

@ -1,3 +0,0 @@
BENCHMARK_SCRIPT="benchmark/benchmark_level1.sh" \
BENCHMARK_PLOT_SCRIPT="benchmark/benchmark_level1_plot.sh" \
bash benchmark/benchmark_and_report.sh

View File

@ -1,19 +0,0 @@
#!/bin/bash
#SBATCH --job-name=trl
#SBATCH --partition=hopper-prod
#SBATCH --gpus-per-task={{gpus_per_task}}
#SBATCH --cpus-per-gpu={{cpus_per_gpu}}
#SBATCH --ntasks={{ntasks}}
#SBATCH --output=slurm/logs/%x_%j.out
#SBATCH --array={{array}}
##SBATCH --exclude=ip-26-0-149-199
module load cuda/12.1
{{nodes}}
seeds={{seeds}}
seed=${seeds[$SLURM_ARRAY_TASK_ID % {{len_seeds}}]}
echo "Running task $SLURM_ARRAY_TASK_ID with seed: $seed"
srun {{command}} --seed $seed

View File

@ -1,37 +0,0 @@
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from dataclasses import dataclass
import tyro
from huggingface_hub import HfApi
@dataclass
class Args:
folder_path: str = "benchmark/trl"
path_in_repo: str = "images/benchmark"
repo_id: str = "trl-internal-testing/example-images"
repo_type: str = "dataset"
args = tyro.cli(Args)
api = HfApi()
api.upload_folder(
folder_path=args.folder_path,
path_in_repo=args.path_in_repo,
repo_id=args.repo_id,
repo_type=args.repo_type,
)

View File

@ -41,7 +41,6 @@ accelerate launch $EXTRA_ACCELERATE_ARGS \
--dataset_name $DATASET_NAME \
--output_dir $OUTPUT_DIR \
--max_steps $MAX_STEPS \
--dataset_text_field 'text' \
--per_device_train_batch_size $BATCH_SIZE \
--max_seq_length $SEQ_LEN \
$EXTRA_TRAINING_ARGS

View File

@ -42,8 +42,6 @@
title: ORPO
- local: ppo_trainer
title: PPO
- local: ppov2_trainer
title: PPOv2
- local: reward_trainer
title: Reward
- local: rloo_trainer

View File

@ -1,5 +1,7 @@
# Aligning Text-to-Image Diffusion Models with Reward Backpropagation
[![](https://img.shields.io/badge/All_models-AlignProp-blue)](https://huggingface.co/models?other=alignprop,trl)
## The why
If your reward function is differentiable, directly backpropagating gradients from the reward models to the diffusion model is significantly more sample and compute efficient (25x) than doing policy gradient algorithm like DDPO.

View File

@ -1,54 +1,15 @@
# BCO Trainer
[![](https://img.shields.io/badge/All_models-BCO-blue)](https://huggingface.co/models?other=bco,trl)
TRL supports the Binary Classifier Optimization (BCO).
The [BCO](https://huggingface.co/papers/2404.04656) authors train a binary classifier whose logit serves as a reward so that the classifier maps {prompt, chosen completion} pairs to 1 and {prompt, rejected completion} pairs to 0.
For a full example have a look at [`examples/scripts/bco.py`].
## Expected dataset format
The BCO trainer expects a very specific format for the dataset as it does not require pairwise preferences. Since the model will be trained to directly optimize examples that consist of a prompt, model completion, and a label to indicate whether the completion is "good" or "bad", we expect a dataset with the following columns:
- `prompt`
- `completion`
- `label`
for example:
```
bco_dataset_dict = {
"prompt": [
"Hey, hello",
"How are you",
"What is your name?",
"What is your name?",
"Which is the best programming language?",
"Which is the best programming language?",
"Which is the best programming language?",
],
"completion": [
"hi nice to meet you",
"leave me alone",
"I don't have a name",
"My name is Mary",
"Python",
"C++",
"Java",
],
"label": [
True,
False,
False,
True,
True,
False,
False,
],
}
```
where the `prompt` contains the context inputs, `completion` contains the corresponding responses and `label` contains the corresponding flag that indicates if the generated completion is desired (`True`) or undesired (`False`).
A prompt can have multiple responses and this is reflected in the entries being repeated in the dictionary's value arrays. It is required that the dataset contains at least one desirable and one undesirable completion.
## Expected dataset type
The [`BCOTrainer`] requires an [unpaired preference dataset](dataset_formats#unpaired-preference).
The [`BCOTrainer`] supports both [conversational](dataset_formats#conversational) and [standard](dataset_formats#standard) dataset format. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
## Expected model format
The BCO trainer expects a model of `AutoModelForCausalLM`, compared to PPO that expects `AutoModelForCausalLMWithValueHead` for the value function.
@ -71,7 +32,7 @@ bco_trainer = BCOTrainer(
model_ref,
args=training_args,
train_dataset=train_dataset,
tokenizer=tokenizer,
processing_class=tokenizer,
)
```
After this one can then call:
@ -114,7 +75,7 @@ bco_trainer = BCOTrainer(
model_ref,
args=training_args,
train_dataset=train_dataset,
tokenizer=tokenizer,
processing_class=tokenizer,
embedding_func=embedding_func,
embedding_tokenizer=self.embedding_tokenizer,
)

View File

@ -7,6 +7,7 @@ Currently supported CLIs are:
- `trl sft`: fine-tune a LLM on a text/instruction dataset
- `trl dpo`: fine-tune a LLM with DPO on a preference dataset
- `trl chat`: quickly spin up a LLM fine-tuned for chatting
- `trl env`: get the system information
## Fine-tuning with the CLI
@ -25,8 +26,6 @@ model_name_or_path:
trl-internal-testing/tiny-random-LlamaForCausalLM
dataset_name:
stanfordnlp/imdb
dataset_text_field:
text
report_to:
none
learning_rate:
@ -97,23 +96,76 @@ python examples/datasets/anthropic_hh.py --push_to_hub --hf_entity your-hf-org
The chat CLI lets you quickly load the model and talk to it. Simply run the following:
```bash
trl chat --model_name_or_path Qwen/Qwen1.5-0.5B-Chat
```
<pre><code>$ trl chat --model_name_or_path Qwen/Qwen1.5-0.5B-Chat
<strong><span style="color: red;">&lt;quentin_gallouedec&gt;:</span></strong>
What is the best programming language?
> [!TIP]
> To use the chat CLI with the developer installation, you must run `make dev`
>
<strong><span style="color: blue;">&lt;Qwen/Qwen1.5-0.5B-Chat&gt;:</span></strong>
There isn't a "best" programming language, as everyone has different style preferences, needs, and preferences. However, some people commonly use
languages like Python, Java, C++, and JavaScript, which are popular among developers for a variety of reasons, including readability, flexibility,
and scalability. Ultimately, it depends on personal preference, needs, and goals.
</code></pre>
Note that the chat interface relies on the tokenizer's [chat template](https://huggingface.co/docs/transformers/chat_templating) to format the inputs for the model. Make sure your tokenizer has a chat template defined.
Besides talking to the model there are a few commands you can use:
- **clear**: clears the current conversation and start a new one
- **example {NAME}**: load example named `{NAME}` from the config and use it as the user input
- **set {SETTING_NAME}={SETTING_VALUE};**: change the system prompt or generation settings (multiple settings are separated by a ';').
- **reset**: same as clear but also resets the generation configs to defaults if they have been changed by **set**
- **save {SAVE_NAME} (optional)**: save the current chat and settings to file by default to `./chat_history/{MODEL_NAME}/chat_{DATETIME}.yaml` or `{SAVE_NAME}` if provided
- **exit**: closes the interface
- `clear`: clears the current conversation and start a new one
- `example {NAME}`: load example named `{NAME}` from the config and use it as the user input
- `set {SETTING_NAME}={SETTING_VALUE};`: change the system prompt or generation settings (multiple settings are separated by a `;`).
- `reset`: same as clear but also resets the generation configs to defaults if they have been changed by `set`
- `save` or `save {SAVE_NAME}`: save the current chat and settings to file by default to `./chat_history/{MODEL_NAME}/chat_{DATETIME}.yaml` or `{SAVE_NAME}` if provided
- `exit`: closes the interface
The default examples are defined in `examples/scripts/config/default_chat_config.yaml` but you can pass your own with `--config CONFIG_FILE` where you can also specify the default generation parameters.
## Getting the system information
You can get the system information by running the following command:
```bash
trl env
```
This will print out the system information including the GPU information, the CUDA version, the PyTorch version, the transformers version, and the TRL version, and any optional dependencies that are installed.
```txt
Copy-paste the following information when reporting an issue:
- Platform: Linux-5.15.0-1048-aws-x86_64-with-glibc2.31
- Python version: 3.11.9
- PyTorch version: 2.4.1
- CUDA device: NVIDIA H100 80GB HBM3
- Transformers version: 4.45.0.dev0
- Accelerate version: 0.34.2
- Accelerate config:
- compute_environment: LOCAL_MACHINE
- distributed_type: DEEPSPEED
- mixed_precision: no
- use_cpu: False
- debug: False
- num_processes: 4
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- deepspeed_config: {'gradient_accumulation_steps': 4, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': False, 'zero_stage': 2}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
- Datasets version: 3.0.0
- HF Hub version: 0.24.7
- TRL version: 0.12.0.dev0+acb4d70
- bitsandbytes version: 0.41.1
- DeepSpeed version: 0.15.1
- Diffusers version: 0.30.3
- Liger-Kernel version: 0.3.0
- LLM-Blender version: 0.0.2
- OpenAI version: 1.46.0
- PEFT version: 0.12.0
```
This information are required when reporting an issue.

View File

@ -1,100 +1,67 @@
# CPO Trainer
Contrastive Preference Optimization (CPO) as introduced in the paper [Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation](https://huggingface.co/papers/2401.08417) by Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Kenton Murray, and Young Jin Kim. At a high-level, CPO trains models to
avoid generating adequate, but not perfect translations in Machine Translation (MT) tasks. However, CPO is a general approximation to the DPO loss and can be applied to other domains like chat.
[![](https://img.shields.io/badge/All_models-CPO-blue)](https://huggingface.co/models?other=cpo,trl)
## Overview
Contrastive Preference Optimization (CPO) as introduced in the paper [Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation](https://huggingface.co/papers/2401.08417) by [Haoran Xu](https://huggingface.co/haoranxu), [Amr Sharaf](https://huggingface.co/amrsharaf), [Yunmo Chen](https://huggingface.co/yunmochen), Weiting Tan, Lingfeng Shen, Benjamin Van Durme, [Kenton Murray](https://huggingface.co/Kenton), and [Young Jin Kim](https://huggingface.co/ykim362). At a high-level, CPO trains models to avoid generating adequate, but not perfect translations in Machine Translation (MT) tasks. However, CPO is a general approximation to the DPO loss and can be applied to other domains like chat.
CPO aims to mitigate two fundamental shortcomings of SFT. First, SFTs methodology of minimizing the discrepancy between predicted outputs and gold-standard references inherently caps model performance at the quality level of the training data. Secondly, SFT lacks a mechanism to prevent the model from rejecting mistakes in translations. The CPO objective is derived from the DPO objective.
## SimPO
The [SimPO](https://huggingface.co/papers/2405.14734) method is also implemented in the `CPOTrainer`. SimPO is an alternative loss that adds a reward margin, allows for length normalization, and does not use BC regularization. To use this loss, we can use SimPO easily by turning on `loss_type="simpo"` and `cpo_alpha=0` in the `CPOConfig`.
## Quick start
## CPO-SimPO
We also offer the combined use of CPO and SimPO, which enables more stable training and improved performance. Learn more details at [CPO-SimPO Github](https://github.com/fe1ixxu/CPO_SIMPO). To use this method, simply enable SimPO by setting `loss_type="simpo"` and a non-zero `cpo_alpha` in the CPOConfig.
This example demonstrates how to train a model using the CPO method. We use the [Qwen 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) as the base model. We use the preference data from the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback). You can view the data in the dataset here:
## Expected dataset format
<iframe
src="https://huggingface.co/datasets/trl-lib/ultrafeedback_binarized/embed/viewer/default/train?row=0"
frameborder="0"
width="100%"
height="560px"
></iframe>
The CPO trainer expects a format identical to the DPO trainer, which should include three entries. These entries should be named as follows:
Below is the script to train the model:
- `prompt`
- `chosen`
- `rejected`
```python
# train_cpo.py
from datasets import load_dataset
from trl import CPOConfig, CPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
for example:
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
train_dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
```py
cpo_dataset_dict = {
"prompt": [
"hello",
"how are you",
"What is your name?",
"What is your name?",
"Which is the best programming language?",
"Which is the best programming language?",
"Which is the best programming language?",
],
"chosen": [
"hi nice to meet you",
"I am fine",
"My name is Mary",
"My name is Mary",
"Python",
"Python",
"Java",
],
"rejected": [
"leave me alone",
"I am not fine",
"Whats it to you?",
"I dont have a name",
"Javascript",
"C++",
"C++",
],
}
```
where the `prompt` contains the context inputs, `chosen` contains the corresponding chosen responses and `rejected` contains the corresponding negative (rejected) responses. As can be seen a prompt can have multiple responses and this is reflected in the entries being repeated in the dictionary's value arrays.
## Expected model format
The CPO trainer expects a model of `AutoModelForCausalLM`, compared to PPO that expects `AutoModelForCausalLMWithValueHead` for the value function.
## Using the `CPOTrainer`
For a detailed example have a look at the `examples/scripts/cpo.py` script. At a high level we need to initialize the `CPOTrainer` with a `model` we wish to train. **Note that CPOTrainer eliminates the need to use the reference model, simplifying the optimization process.** The `beta` refers to the hyperparameter of the implicit reward, and the dataset contains the 3 entries listed above.
```py
cpo_config = CPOConfig(
beta=0.1,
)
cpo_trainer = CPOTrainer(
model,
args=cpo_config,
train_dataset=train_dataset,
tokenizer=tokenizer,
)
```
After this one can then call:
```py
cpo_trainer.train()
training_args = CPOConfig(output_dir="Qwen2-0.5B-CPO", logging_steps=10)
trainer = CPOTrainer(model=model, args=training_args, processing_class=tokenizer, train_dataset=train_dataset)
trainer.train()
```
## Loss functions
Execute the script using the following command:
Given the preference data, the `CPOTrainer` uses the sigmoid loss on the normalized likelihood via the `logsigmoid` to fit a logistic regression.
```bash
accelerate launch train_cpo.py
```
The [RSO](https://huggingface.co/papers/2309.06657) authors propose to use a hinge loss on the normalized likelihood from the [SLiC](https://huggingface.co/papers/2305.10425) paper. The `CPOTrainer` can be switched to this loss via the `loss_type="hinge"` argument and the `beta` in this case is the reciprocal of the margin.
## Expected dataset type
The [IPO](https://huggingface.co/papers/2310.12036) authors provide a deeper theoretical understanding of the CPO algorithms and identify an issue with overfitting and propose an alternative loss which can be used via the `loss_type="ipo"` argument to the trainer. Note that the `beta` parameter is the reciprocal of the gap between the log-likelihood ratios of the chosen vs the rejected completion pair and thus the smaller the `beta` the larger this gaps is. As per the paper the loss is averaged over log-likelihoods of the completion (unlike CPO which is summed only).
CPO requires a [preference dataset](dataset_formats#preference). The [`CPOTrainer`] supports both [conversational](dataset_formats#conversational) and [standard](dataset_formats#standard) dataset format. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
### For Mixture of Experts Models: Enabling the auxiliary loss
## Example script
MOEs are the most efficient if the load is about equally distributed between experts.
To ensure that we train MOEs similarly during preference-tuning, it is beneficial to add the auxiliary loss from the load balancer to the final loss.
We provide an example script to train a model using the CPO method. The script is available in [`examples/scripts/cpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/cpo.py)
This option is enabled by setting `output_router_logits=True` in the model config (e.g. MixtralConfig).
To scale how much the auxiliary loss contributes to the total loss, use the hyperparameter `router_aux_loss_coef=...` (default: 0.001).
To test the CPO script with the [Qwen2 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) on the [UltraFeedback dataset](https://huggingface.co/datasets/trl-lib/ultrafeedback_binarized), run the following command:
## Logging
```bash
accelerate launch examples/scripts/cpo.py \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--dataset_name trl-lib/ultrafeedback_binarized \
--num_train_epochs 1 \
--logging_steps 25 \
--output_dir Qwen2-0.5B-CPO
```
## Logged metrics
While training and evaluating we record the following reward metrics:
@ -104,6 +71,34 @@ While training and evaluating we record the following reward metrics:
* `rewards/margins`: the mean difference between the chosen and corresponding rejected rewards
* `nll_loss`: the mean negative log likelihood loss of the policy model for the chosen responses
## CPO variants
### Simple Preference Optimization (SimPO)
The [SimPO](https://huggingface.co/papers/2405.14734) method is also implemented in the [`CPOTrainer`]. SimPO is an alternative loss that adds a reward margin, allows for length normalization, and does not use BC regularization. To use this loss, we can use SimPO easily by turning on `loss_type="simpo"` and `cpo_alpha=0` in the [`CPOConfig`].
### CPO-SimPO
We also offer the combined use of CPO and SimPO, which enables more stable training and improved performance. Learn more details at [CPO-SimPO GitHub](https://github.com/fe1ixxu/CPO_SIMPO). To use this method, simply enable SimPO by setting `loss_type="simpo"` and a non-zero `cpo_alpha` in the [`CPOConfig`].
## Loss functions
The CPO algorithm supports several loss functions. The loss function can be set using the `loss_type` parameter in the [`CPOConfig`]. The following loss functions are supported:
| `loss_type=` | Description |
| -------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `"sigmoid"` (default) | Given the preference data, we can fit a binary classifier according to the Bradley-Terry model and in fact the [DPO](https://huggingface.co/papers/2305.18290) authors propose the sigmoid loss on the normalized likelihood via the `logsigmoid` to fit a logistic regression. |
| `"hinge"` | The [RSO](https://huggingface.co/papers/2309.06657) authors propose to use a hinge loss on the normalized likelihood from the [SLiC](https://huggingface.co/papers/2305.10425) paper. In this case, the `beta` is the reciprocal of the margin. |
| `"ipo"` | The [IPO](https://huggingface.co/papers/2310.12036) authors provide a deeper theoretical understanding of the DPO algorithms and identify an issue with overfitting and propose an alternative loss. In this case, the `beta` is the reciprocal of the gap between the log-likelihood ratios of the chosen vs the rejected completion pair and thus the smaller the `beta` the larger this gaps is. As per the paper the loss is averaged over log-likelihoods of the completion (unlike DPO which is summed only). |
### For Mixture of Experts Models: Enabling the auxiliary loss
MOEs are the most efficient if the load is about equally distributed between experts.
To ensure that we train MOEs similarly during preference-tuning, it is beneficial to add the auxiliary loss from the load balancer to the final loss.
This option is enabled by setting `output_router_logits=True` in the model config (e.g. [`~transformers.MixtralConfig`]).
To scale how much the auxiliary loss contributes to the total loss, use the hyperparameter `router_aux_loss_coef=...` (default: `0.001`) in the model config.
## CPOTrainer
[[autodoc]] CPOTrainer

View File

@ -1,6 +1,6 @@
# Training customization
TRL is designed with modularity in mind so that users to be able to efficiently customize the training loop for their needs. Below are some examples on how you can apply and test different techniques.
TRL is designed with modularity in mind so that users to be able to efficiently customize the training loop for their needs. Below are some examples on how you can apply and test different techniques. Note: Although these examples use the DPOTrainer, the customization applies to most (if not all) trainers.
## Train on multiple GPUs / nodes
@ -46,171 +46,118 @@ else:
Consult the 🤗 Accelerate [documentation](https://huggingface.co/docs/accelerate/usage_guides/deepspeed) for more information about the DeepSpeed plugin.
## Use different optimizers
## Use different optimizers and schedulers
By default, the `DPOTrainer` creates a `torch.optim.AdamW` optimizer. You can create and define a different optimizer and pass it to `DPOTrainer` as follows:
By default, the `PPOTrainer` creates a `torch.optim.Adam` optimizer. You can create and define a different optimizer and pass it to `PPOTrainer`:
```python
import torch
from transformers import GPT2Tokenizer
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch import optim
from trl import DPOConfig, DPOTrainer
# 1. load a pretrained model
model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
training_args = DPOConfig(output_dir="Qwen2.5-0.5B-DPO")
# 2. define config
ppo_config = {'batch_size': 1, 'learning_rate':1e-5}
config = PPOConfig(**ppo_config)
optimizer = optim.SGD(model.parameters(), lr=training_args.learning_rate)
# 2. Create optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=config.learning_rate)
# 3. initialize trainer
ppo_trainer = PPOTrainer(config, model, ref_model, tokenizer, optimizer=optimizer)
trainer = DPOTrainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
optimizers=(optimizer, None),
)
trainer.train()
```
For memory efficient fine-tuning, you can also pass `Adam8bit` optimizer from `bitsandbytes`:
### Add a learning rate scheduler
You can also play with your training by adding learning rate schedulers.
```python
import torch
import bitsandbytes as bnb
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch import optim
from trl import DPOConfig, DPOTrainer
from transformers import GPT2Tokenizer
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
training_args = DPOConfig(output_dir="Qwen2.5-0.5B-DPO")
# 1. load a pretrained model
model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
optimizer = optim.AdamW(model.parameters(), lr=training_args.learning_rate)
lr_scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
# 2. define config
ppo_config = {'batch_size': 1, 'learning_rate':1e-5}
config = PPOConfig(**ppo_config)
# 2. Create optimizer
optimizer = bnb.optim.Adam8bit(model.parameters(), lr=config.learning_rate)
# 3. initialize trainer
ppo_trainer = PPOTrainer(config, model, ref_model, tokenizer, optimizer=optimizer)
```
### Use LION optimizer
You can use the new [LION optimizer from Google](https://huggingface.co/papers/2302.06675) as well, first take the source code of the optimizer definition [here](https://github.com/lucidrains/lion-pytorch/blob/main/lion_pytorch/lion_pytorch.py), and copy it so that you can import the optimizer. Make sure to initialize the optimizer by considering the trainable parameters only for a more memory efficient training:
```python
optimizer = Lion(filter(lambda p: p.requires_grad, self.model.parameters()), lr=self.config.learning_rate)
...
ppo_trainer = PPOTrainer(config, model, ref_model, tokenizer, optimizer=optimizer)
```
We advise you to use the learning rate that you would use for `Adam` divided by 3 as pointed out [here](https://github.com/lucidrains/lion-pytorch#lion---pytorch). We observed an improvement when using this optimizer compared to classic Adam (check the full logs [here](https://wandb.ai/distill-bloom/trl/runs/lj4bheke?workspace=user-younesbelkada)):
<div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/trl-lion.png">
</div>
## Add a learning rate scheduler
You can also play with your training by adding learning rate schedulers!
```python
import torch
from transformers import GPT2Tokenizer
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
# 1. load a pretrained model
model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# 2. define config
ppo_config = {'batch_size': 1, 'learning_rate':1e-5}
config = PPOConfig(**ppo_config)
# 2. Create optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=config.learning_rate)
lr_scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9)
# 3. initialize trainer
ppo_trainer = PPOTrainer(config, model, ref_model, tokenizer, optimizer=optimizer, lr_scheduler=lr_scheduler)
trainer = DPOTrainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
optimizers=(optimizer, lr_scheduler),
)
trainer.train()
```
## Memory efficient fine-tuning by sharing layers
Another tool you can use for more memory efficient fine-tuning is to share layers between the reference model and the model you want to train.
```python
import torch
from transformers import AutoTokenizer
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead, create_reference_model
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import create_reference_model, DPOConfig, DPOTrainer
# 1. load a pretrained model
model = AutoModelForCausalLMWithValueHead.from_pretrained('bigscience/bloom-560m')
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
ref_model = create_reference_model(model, num_shared_layers=6)
tokenizer = AutoTokenizer.from_pretrained('bigscience/bloom-560m')
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train[:1%]")
training_args = DPOConfig(output_dir="Qwen2.5-0.5B-DPO")
# 2. initialize trainer
ppo_config = {'batch_size': 1}
config = PPOConfig(**ppo_config)
ppo_trainer = PPOTrainer(config, model, ref_model, tokenizer)
trainer = DPOTrainer(
model=model,
ref_model=ref_model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
)
trainer.train()
```
## Pass 8-bit reference models
<div>
Since `trl` supports all keyword arguments when loading a model from `transformers` using `from_pretrained`, you can also leverage `load_in_8bit` from `transformers` for more memory efficient fine-tuning.
Since `trl` supports all key word arguments when loading a model from `transformers` using `from_pretrained`, you can also leverage `load_in_8bit` from `transformers` for more memory efficient fine-tuning.
Read more about 8-bit model loading in `transformers` [here](https://huggingface.co/docs/transformers/perf_infer_gpu_one#bitsandbytes-integration-for-int8-mixedprecision-matrix-decomposition).
</div>
Read more about 8-bit model loading in `transformers` [here](https://huggingface.co/docs/transformers/en/peft#load-in-8bit-or-4bit).
```python
# 0. imports
# pip install bitsandbytes
import torch
from transformers import AutoTokenizer
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOConfig, DPOTrainer
# 1. load a pretrained model
model = AutoModelForCausalLMWithValueHead.from_pretrained('bigscience/bloom-560m')
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained('bigscience/bloom-560m', device_map="auto", load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained('bigscience/bloom-560m')
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
ref_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", quantization_config= quantization_config)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
training_args = DPOConfig(output_dir="Qwen2.5-0.5B-DPO")
# 2. initialize trainer
ppo_config = {'batch_size': 1}
config = PPOConfig(**ppo_config)
ppo_trainer = PPOTrainer(config, model, ref_model, tokenizer)
trainer = DPOTrainer(
model=model,
ref_model=ref_model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
)
trainer.train()
```
## Use the CUDA cache optimizer
When training large models, you should better handle the CUDA cache by iteratively clearing it. Do do so, simply pass `optimize_cuda_cache=True` to `PPOConfig`:
When training large models, you should better handle the CUDA cache by iteratively clearing it. To do so, simply pass `optimize_cuda_cache=True` to `DPOConfig`:
```python
config = PPOConfig(..., optimize_cuda_cache=True)
```
## Use score scaling/normalization/clipping
As suggested by [Secrets of RLHF in Large Language Models Part I: PPO](https://huggingface.co/papers/2307.04964), we support score (aka reward) scaling/normalization/clipping to improve training stability via `PPOConfig`:
```python
from trl import PPOConfig
ppo_config = {
use_score_scaling=True,
use_score_norm=True,
score_clip=0.5,
}
config = PPOConfig(**ppo_config)
```
To run `ppo.py`, you can use the following command:
```
python examples/scripts/ppo.py --log_with wandb --use_score_scaling --use_score_norm --score_clip 0.5
training_args = DPOConfig(..., optimize_cuda_cache=True)
```

View File

@ -1,10 +1,11 @@
# Dataset formats
# Dataset formats and types
This guide provides an overview of the dataset formats supported by each trainer in TRL. Since conversational datasets are very common, we also provide a guide on how to use them, and how to convert them into a standard dataset format for TRL trainers.
This guide provides an overview of the dataset formats and types supported by each trainer in TRL.
## Overview of the dataset formats and types
The *format* of a dataset refers to how the data is structured, typically categorized as either *standard* or *conversational*. The *type* is associated with the specific task the dataset is designed for, such as *prompt-only* or *preference*. Each type is characterized by its columns, which vary according to the task, as shown in the table.
- The *format* of a dataset refers to how the data is structured, typically categorized as either *standard* or *conversational*.
- The *type* is associated with the specific task the dataset is designed for, such as *prompt-only* or *preference*. Each type is characterized by its columns, which vary according to the task, as shown in the table.
<table>
<tr>
@ -60,7 +61,7 @@ The *format* of a dataset refers to how the data is structured, typically catego
or, with implicit prompt:
<pre><code>{"chosen": [{"role": "user", "content": "What color is the sky?"},
{"role": "assistant", "content": "It is blue."}],
"rejected": [{"role": "user", "content": "What color is the sky?"},
"rejected": [{"role": "user", "content": "What color is the sky?"},
{"role": "assistant", "content": "It is green."}]}</code></pre>
</td>
</tr>
@ -78,8 +79,9 @@ The *format* of a dataset refers to how the data is structured, typically catego
</tr>
</table>
### Formats
### Standard dataset format
#### Standard
The standard dataset format typically consists of plain text strings. The columns in the dataset vary depending on the task. This is the format expected by TRL trainers. Below are examples of standard dataset formats for different tasks:
@ -90,7 +92,7 @@ example = {"text": "The sky is blue."}
example = {"chosen": "The sky is blue.", "rejected": "The sky is green."}
```
### Conversational dataset format
#### Conversational
Conversational datasets are used for tasks involving dialogues or chat interactions between users and assistants. Unlike standard dataset formats, these contain sequences of messages where each message has a `role` (e.g., `"user"` or `"assistant"`) and `content` (the message text).
@ -119,7 +121,9 @@ example = {
Conversational datasets are useful for training chat models, but must be converted into a standard format before being used with TRL trainers. This is typically done using chat templates specific to the model being used. For more information, refer to the [Working with conversational datasets in TRL](#working-with-conversational-datasets-in-trl) section.
### Language modeling
### Types
#### Language modeling
A language modeling dataset consists of a column `"text"` (or `"messages"` for conversational datasets) containing a full sequence of text.
@ -127,7 +131,7 @@ A language modeling dataset consists of a column `"text"` (or `"messages"` for c
language_modeling_example = {"text": "The sky is blue."}
```
### Prompt-only
#### Prompt-only
In a prompt-only dataset, only the initial prompt (the question or partial sentence) is provided under the key `"prompt"`. The training typically involves generating the completion based on this prompt, where the model learns to continue or complete the given input.
@ -137,7 +141,7 @@ prompt_only_example = {"prompt": "The sky is"}
<Tip>
While both the prompt-only and language modeling formats are similar, they differ in how the input is handled. In the prompt-only format, the prompt represents a partial input that expects the model to complete or continue, while in the language modeling format, the input is treated as a complete sentence or sequence. These two formats are processed differently by TRL. Below is an example showing the difference in the output of the `apply_chat_template` function for each format:
While both the prompt-only and language modeling types are similar, they differ in how the input is handled. In the prompt-only type, the prompt represents a partial input that expects the model to complete or continue, while in the language modeling type, the input is treated as a complete sentence or sequence. These two types are processed differently by TRL. Below is an example showing the difference in the output of the `apply_chat_template` function for each type:
```python
from transformers import AutoTokenizer
@ -145,12 +149,12 @@ from trl import apply_chat_template
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")
# Example for prompt-only format
# Example for prompt-only type
prompt_only_example = {"prompt": [{"role": "user", "content": "What color is the sky?"}]}
apply_chat_template(prompt_only_example, tokenizer)
# Output: {'prompt': '<|user|>\nWhat color is the sky?<|end|>\n<|assistant|>\n'}
# Example for language modeling format
# Example for language modeling type
lm_example = {"messages": [{"role": "user", "content": "What color is the sky?"}]}
apply_chat_template(lm_example, tokenizer)
# Output: {'text': '<|user|>\nWhat color is the sky?<|end|>\n<|endoftext|>'}
@ -161,7 +165,7 @@ apply_chat_template(lm_example, tokenizer)
</Tip>
### Prompt-completion
#### Prompt-completion
A prompt-completion dataset includes a `"prompt"` and a `"completion"`.
@ -169,18 +173,21 @@ A prompt-completion dataset includes a `"prompt"` and a `"completion"`.
prompt_completion_example = {"prompt": "The sky is", "completion": " blue."}
```
### Preference
#### Preference
A preference dataset is used for tasks where the model is trained to choose between two or more possible completions to the same prompt. This dataset includes a `"prompt"`, a `"chosen"` completion, and a `"rejected"` completion. The model is trained to select the `"chosen"` response over the `"rejected"` response.
Some dataset may not include the `"prompt"` column, in which case the prompt is implicit and directly included in the `"chosen"` and `"rejected"` completions. We recommend using explicit prompts whenever possible.
```python
# explicit prompt
preference_example = {"prompt": "The sky is", "chosen": " blue.", "rejected": " green."} # recommended
# or,
# implicit prompt
preference_example = {"chosen": "The sky is blue.", "rejected": "The sky is green."}
```
### Unpaired preference
Some preference datasets can be found with [the tag `dpo` on Hugging Face Hub](https://huggingface.co/datasets?other=dpo). You can also explore the [librarian-bots' DPO Collections](https://huggingface.co/collections/librarian-bots/direct-preference-optimization-datasets-66964b12835f46289b6ef2fc) to identify preference datasets.
#### Unpaired preference
An unpaired preference dataset is similar to a preference dataset but instead of having `"chosen"` and `"rejected"` completions for the same prompt, it includes a single `"completion"` and a `"label"` indicating whether the completion is preferred or not.
@ -188,24 +195,25 @@ An unpaired preference dataset is similar to a preference dataset but instead of
unpaired_preference_example = {"prompt": "The sky is", "completion": " blue.", "label": True}
```
## Which dataset format to use?
## Which dataset type to use?
Choosing the right dataset format depends on the task you are working on and the specific requirements of the TRL trainer you are using. Below is a brief overview of the dataset formats supported by each TRL trainer.
Choosing the right dataset type depends on the task you are working on and the specific requirements of the TRL trainer you are using. Below is a brief overview of the dataset types supported by each TRL trainer.
| Trainer | Expected dataset format |
| ----------------------- | ---------------------------- |
| [`BCOTrainer`] | Unpaired preference |
| [`CPOTrainer`] | Preference (explicit prompt) |
| [`DPOTrainer`] | Preference (explicit prompt) |
| [`IterativeSFTTrainer`] | Unpaired preference |
| [`KTOTrainer`] | Unpaired preference |
| [`NashMDTrainer`] | Prompt-only |
| [`OnlineDPOTrainer`] | Prompt-only |
| [`ORPOTrainer`] | Preference (explicit prompt) |
| [`PPOv2Trainer`] | Tokenized language modeling |
| [`RewardTrainer`] | Preference (implicit prompt) |
| [`SFTTrainer`] | Language modeling |
| [`XPOTrainer`] | Prompt-only |
| Trainer | Expected dataset type |
| ----------------------- | ------------------------------------------------------------------------------------------------------ |
| [`BCOTrainer`] | [Unpaired preference](#unpaired-preference) |
| [`CPOTrainer`] | [Preference (explicit prompt recommended)](#preference) |
| [`DPOTrainer`] | [Preference (explicit prompt recommended)](#preference) |
| [`GKDTrainer`] | [Prompt-completion](#prompt-completion) |
| [`IterativeSFTTrainer`] | [Unpaired preference](#unpaired-preference) |
| [`KTOTrainer`] | [Unpaired preference](#unpaired-preference) or [Preference (explicit prompt recommended)](#preference) |
| [`NashMDTrainer`] | [Prompt-only](#prompt-only) |
| [`OnlineDPOTrainer`] | [Prompt-only](#prompt-only) |
| [`ORPOTrainer`] | [Preference (explicit prompt recommended)](#preference) |
| [`PPOTrainer`] | Tokenized language modeling |
| [`RewardTrainer`] | [Preference (implicit prompt recommended)](#preference) |
| [`SFTTrainer`] | [Language modeling](#language-modeling) |
| [`XPOTrainer`] | [Prompt-only](#prompt-only) |
<Tip>
@ -216,7 +224,7 @@ For more information on how to work with conversational datasets, refer to the [
## Working with conversational datasets in TRL
Conversational datasets are increasingly common, especially for training chat models. However, TRL trainers (except [`SFTTrainer`]) don't support conversational datasets in their raw format. These datasets must first be converted into a standard format.
Conversational datasets are increasingly common, especially for training chat models. However, TRL trainers (except [`SFTTrainer`]) don't support conversational datasets in their raw format. These datasets must first be converted into a standard format.
Fortunately, TRL offers tools to easily handle this conversion, which are detailed below.
### Converting a conversational dataset into a standard dataset
@ -266,7 +274,8 @@ dataset = dataset.map(apply_chat_template, fn_kwargs={"tokenizer": tokenizer})
<Tip warning={true}>
We recommend using the [`apply_chat_template`] function rather than directly calling `tokenizer.apply_chat_template`. Handling chat templates nonlanguage modeling datasets can be tricky and may lead to issues, such as inserting a system prompt in the middle of a conversation. For additional examples, see [#1930 (comment)](https://github.com/huggingface/trl/pull/1930#issuecomment-2292908614). The [`apply_chat_template`] is designed to handle these intricacies and ensure the correct application of chat templates for various tasks.
We recommend using the [`apply_chat_template`] function instead of calling `tokenizer.apply_chat_template` directly. Handling chat templates for non-language modeling datasets can be tricky and may result in errors, such as mistakenly placing a system prompt in the middle conversation.
For additional examples, see [#1930 (comment)](https://github.com/huggingface/trl/pull/1930#issuecomment-2292908614). The [`apply_chat_template`] is designed to handle these intricacies and ensure the correct application of chat templates for various tasks.
</Tip>
@ -304,7 +313,7 @@ Lets take the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb
As shown above, the dataset format does not match the expected structure. Its not in a conversational format, the column names differ, and the results pertain to different models (e.g., Bard, GPT-4) and aspects (e.g., "helpfulness", "honesty").
By using the provided conversion script [`examples/datasets/ultrafeedback.py`](https://github.com/huggingface/trl/tree/main/examples/datasets/ultrafeedback.py), you can transform this dataset into an unpaired preference format, and push it to the Hub:
By using the provided conversion script [`examples/datasets/ultrafeedback.py`](https://github.com/huggingface/trl/tree/main/examples/datasets/ultrafeedback.py), you can transform this dataset into an unpaired preference type, and push it to the Hub:
```sh
python examples/datasets/ultrafeedback.py --push_to_hub --repo_id trl-lib/ultrafeedback-gpt-3.5-turbo-helpfulness
@ -580,7 +589,7 @@ dataset = dataset.remove_columns(["chosen", "rejected"])
### From explicit to implicit prompt preference dataset
To convert a preference dataset with implicit prompt into a preference dataset with explicit prompt, concatenate the prompt to both chosen and rejected, and remove the prompt.
To convert a preference dataset with explicit prompt into a preference dataset with implicit prompt, concatenate the prompt to both chosen and rejected, and remove the prompt.
```python
from datasets import Dataset
@ -710,3 +719,35 @@ dataset = dataset.remove_columns(["completion", "label"])
>>> dataset[0]
{'prompt': 'The sky is'}
```
## Vision datasets
Some trainers also support fine-tuning vision-language models (VLMs) using image-text pairs. In this scenario, it's recommended to use a conversational format, as each model handles image placeholders in text differently.
A conversational vision dataset differs from a standard conversational dataset in two key ways:
1. The dataset must contain the key `images` with the image data.
2. The `"content"` field in messages must be a list of dictionaries, where each dictionary specifies the type of data: `"image"` or `"text"`.
Example:
```python
# Textual dataset:
"content": "What color is the sky?"
# Vision dataset:
"content": [
{"type": "image"},
{"type": "text", "text": "What color is the sky in the image?"}
]
```
An example of a conversational vision dataset is the [openbmb/RLAIF-V-Dataset](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset). Below is an embedded view of the dataset's training data, allowing you to explore it directly:
<iframe
src="https://huggingface.co/datasets/trl-lib/rlaif-v/embed/viewer/default/train"
frameborder="0"
width="100%"
height="560px"
></iframe>

View File

@ -1,4 +1,7 @@
# Denoising Diffusion Policy Optimization
[![](https://img.shields.io/badge/All_models-DDPO-blue)](https://huggingface.co/models?other=ddpo,trl)
## The why
| Before | After DDPO finetuning |

View File

@ -98,19 +98,15 @@ model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", torch_dtype=
and the optimizer will take care of computing the gradients in `bfloat16` precision. Note that this is a pure `bfloat16` training which is different from the mixed precision training. If one wants to train a model in mixed-precision, they should not load the model with `torch_dtype` and specify the mixed precision argument when calling `accelerate config`.
- Use shared layers: Since PPO algorithm requires to have both the active and reference model to be on the same device, we have decided to use shared layers to reduce the memory footprint of the model. This can be achieved by just speifying `num_shared_layers` argument when creating a `PPOTrainer`:
- Use shared layers: Since PPO algorithm requires to have both the active and reference model to be on the same device, we have decided to use shared layers to reduce the memory footprint of the model. This can be achieved by specifying `num_shared_layers` argument when calling the `create_reference_model()` function. For example, if you want to share the first 6 layers of the model, you can do it like this:
<div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/trl-shared-layers.png">
</div>
```python
ppo_trainer = PPOTrainer(
model=model,
tokenizer=tokenizer,
num_shared_layers=4,
...
)
ref_policy = create_reference_model(model, num_shared_layers=6)
trainer = PPOTrainer(..., ref_policy=ref_policy)
```
In the example above this means that the model have the 4 first layers frozen (i.e. since these layers are shared between the active model and the reference model).

View File

@ -1,166 +1,131 @@
# DPO Trainer
TRL supports the DPO Trainer for training language models from preference data, as described in the paper [Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://huggingface.co/papers/2305.18290) by Rafailov et al., 2023. For a full example have a look at [`examples/scripts/dpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/dpo.py).
[![](https://img.shields.io/badge/All_models-DPO-blue)](https://huggingface.co/models?other=dpo,trl)
The first step as always is to train your SFT model, to ensure the data we train on is in-distribution for the DPO algorithm.
## Overview
## How DPO works
TRL supports the DPO Trainer for training language models from preference data, as described in the paper [Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://huggingface.co/papers/2305.18290) by [Rafael Rafailov](https://huggingface.co/rmrafailov), Archit Sharma, Eric Mitchell, [Stefano Ermon](https://huggingface.co/ermonste), [Christopher D. Manning](https://huggingface.co/manning), [Chelsea Finn](https://huggingface.co/cbfinn).
Fine-tuning a language model via DPO consists of two steps and is easier than PPO:
The abstract from the paper is the following:
1. **Data collection**: Gather a preference dataset with positive and negative selected pairs of generation, given a prompt.
> While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds PPO-based RLHF in ability to control sentiment of generations, and matches or improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.
The first step is to train an SFT model, to ensure the data we train on is in-distribution for the DPO algorithm.
Then, fine-tuning a language model via DPO consists of two steps and is easier than [PPO](ppo_trainer):
1. **Data collection**: Gather a [preference dataset](dataset_formats#preference) with positive and negative selected pairs of generation, given a prompt.
2. **Optimization**: Maximize the log-likelihood of the DPO loss directly.
DPO-compatible datasets can be found with [the tag `dpo` on Hugging Face Hub](https://huggingface.co/datasets?other=dpo). You can also explore the [librarian-bots/direct-preference-optimization-datasets](https://huggingface.co/collections/librarian-bots/direct-preference-optimization-datasets-66964b12835f46289b6ef2fc) Collection to identify datasets that are likely to support DPO training.
This process is illustrated in the sketch below (from [Figure 1 of the DPO paper](https://huggingface.co/papers/2305.18290)):
This process is illustrated in the sketch below (from [figure 1 of the original paper](https://huggingface.co/papers/2305.18290)):
<img width="835" alt="Screenshot 2024-03-19 at 12 39 41" src="https://github.com/huggingface/trl/assets/49240599/9150fac6-3d88-4ca2-8ec6-2a6f3473216d">
![](https://github.com/huggingface/trl/assets/49240599/9150fac6-3d88-4ca2-8ec6-2a6f3473216d)
Read more about DPO algorithm in the [original paper](https://huggingface.co/papers/2305.18290).
## Quick start
## Expected dataset format
This example demonstrates how to train a model using the DPO method. We use the [Qwen 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) as the base model. We use the preference data from the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback). You can view the data in the dataset here:
The DPO trainer expects a very specific format for the dataset. Since the model will be trained to directly optimize the preference of which sentence is the most relevant, given two sentences. We provide an example from the [`Anthropic/hh-rlhf`](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset below:
<iframe
src="https://huggingface.co/datasets/trl-lib/ultrafeedback_binarized/embed/viewer/default/train?row=0"
frameborder="0"
width="100%"
height="560px"
></iframe>
<div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/rlhf-antropic-example.png", width="50%">
</div>
Below is the script to train the model:
Therefore the final dataset object should contain these 3 entries if you use the default [`DPODataCollatorWithPadding`] data collator. The entries should be named:
```python
# train_dpo.py
from datasets import load_dataset
from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
- `prompt`
- `chosen`
- `rejected`
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
train_dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
for example:
```py
dpo_dataset_dict = {
"prompt": [
"hello",
"how are you",
"What is your name?",
"What is your name?",
"Which is the best programming language?",
"Which is the best programming language?",
"Which is the best programming language?",
],
"chosen": [
"hi nice to meet you",
"I am fine",
"My name is Mary",
"My name is Mary",
"Python",
"Python",
"Java",
],
"rejected": [
"leave me alone",
"I am not fine",
"Whats it to you?",
"I dont have a name",
"Javascript",
"C++",
"C++",
],
}
training_args = DPOConfig(output_dir="Qwen2-0.5B-DPO", logging_steps=10)
trainer = DPOTrainer(model=model, args=training_args, processing_class=tokenizer, train_dataset=train_dataset)
trainer.train()
```
where the `prompt` contains the context inputs, `chosen` contains the corresponding chosen responses and `rejected` contains the corresponding negative (rejected) responses. As can be seen a prompt can have multiple responses and this is reflected in the entries being repeated in the dictionary's value arrays.
Execute the script using the following command:
[`DPOTrainer`] can be used to fine-tune visual language models (VLMs). In this case, the dataset must also contain the key `images`, and the trainer's `tokenizer` is the VLM's `processor`. For example, for Idefics2, the processor expects the dataset to have the following format:
Note: Currently, VLM support is exclusive to Idefics2 and does not extend to other VLMs.
```py
dpo_dataset_dict = {
'images': [
[Image.open('beach.jpg')],
[Image.open('street.jpg')],
],
'prompt': [
'The image <image> shows',
'<image> The image depicts',
],
'chosen': [
'a sunny beach with palm trees.',
'a busy street with several cars and buildings.',
],
'rejected': [
'a snowy mountain with skiers.',
'a calm countryside with green fields.',
],
}
```bash
accelerate launch train_dpo.py
```
## Expected model format
Distributed across 8 GPUs, the training takes approximately 3 minutes. You can verify the training progress by checking the reward graph. An increasing trend in the reward margin indicates that the model is improving and generating better responses over time.
The DPO trainer expects a model of `AutoModelForCausalLM` or `AutoModelForVision2Seq`, compared to PPO that expects `AutoModelForCausalLMWithValueHead` for the value function.
![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/dpo-qwen2-reward-margin.png)
## Using the `DPOTrainer`
To see how the [trained model](https://huggingface.co/trl-lib/Qwen2-0.5B-DPO) performs, you can use the [TRL Chat CLI](clis#chat-interface).
For a detailed example have a look at the `examples/scripts/dpo.py` script. At a high level we need to initialize the [`DPOTrainer`] with a `model` we wish to train, a reference `ref_model` which we will use to calculate the implicit rewards of the preferred and rejected response, the `beta` refers to the hyperparameter of the implicit reward, and the dataset contains the 3 entries listed above. Note that the `model` and `ref_model` need to have the same architecture (ie decoder only or encoder-decoder).
<pre><code>$ trl chat --model_name_or_path trl-lib/Qwen2-0.5B-DPO
<strong><span style="color: red;">&lt;quentin_gallouedec&gt;:</span></strong>
What is the best programming language?
```py
training_args = DPOConfig(
beta=0.1,
)
dpo_trainer = DPOTrainer(
model,
ref_model,
args=training_args,
train_dataset=train_dataset,
tokenizer=tokenizer, # for visual language models, use tokenizer=processor instead
<strong><span style="color: blue;">&lt;trl-lib/Qwen2-0.5B-DPO&gt;:</span></strong>
The best programming language for specific applications can vary depending on the use case and knowledge level of the programmer. Here are some general factors that can be used as input to choose the best programming language:
<strong><span style="color: green;">1</span></strong> Ease of use: Some programming languages are more user-friendly than others, such as Python, Java, or Ruby. Python is popular due to its simplicity and great scalability.
<strong><span style="color: green;">2</span></strong> Versatility: The ability to work with a wide range of data structures and frameworks can define the language as versatile.
<strong><span style="color: green;">3</span></strong> Ease of learning: Different programming languages have different learning curves, so users must be willing to take some time to master one.
<strong><span style="color: green;">4</span></strong> Community support: The broader community of developers and enthusiasts in the selected programming language can provide great support and resources.
<strong><span style="color: green;">5</span></strong> Reusability: Languages that emphasize code reuse and can be easily modifiable can be more suitable for software development.
The best programming language based on these factors is subjective and depends on what the programmer intends to accomplish.
</code></pre>
## Expected dataset type
DPO requires a [preference dataset](dataset_formats#preference). The [`DPOTrainer`] supports both [conversational](dataset_formats#conversational) and [standard](dataset_formats#standard) dataset format. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
Although the [`DPOTrainer`] supports both explicit and implicit prompts, we recommend using explicit prompts. If provided with an implicit prompt dataset, the trainer will automatically extract the prompt from the `"chosen"` and `"rejected"` columns. For more information, refer to the [preference style](dataset_formats#preference) section.
### Special considerations for vision-language models
The [`DPOTrainer`] supports fine-tuning vision-language models (VLMs). For these models, a vision dataset is required. To learn more about the specific format for vision datasets, refer to the [Vision dataset format](dataset_formats#vision-datasets) section.
Additionally, unlike standard text-based models where a `tokenizer` is used, for VLMs, you should replace the `tokenizer` with a `processor`.
```diff
- model = AutoModelForCausalLM.from_pretrained(model_id)
+ model = AutoModelForVision2Seq.from_pretrained(model_id)
- tokenizer = AutoTokenizer.from_pretrained(model_id)
+ processor = AutoProcessor.from_pretrained(model_id)
trainer = DPOTrainer(
model,
args=training_args,
train_dataset=train_dataset,
- processing_class=tokenizer,
+ processing_class=processor,
)
```
After this one can then call:
For a complete example of fine-tuning a vision-language model, refer to the script in [`examples/scripts/dpo_vlm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/dpo_vlm.py).
```py
dpo_trainer.train()
## Example script
We provide an example script to train a model using the DPO method. The script is available in [`examples/scripts/dpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/dpo.py)
To test the DPO script with the [Qwen2 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) on the [UltraFeedback dataset](https://huggingface.co/datasets/trl-lib/ultrafeedback_binarized), run the following command:
```bash
accelerate launch examples/scripts/dpo.py \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--dataset_name trl-lib/ultrafeedback_binarized \
--num_train_epochs 1 \
--logging_steps 25 \
--output_dir Qwen2-0.5B-DPO
```
Note that the `beta` is the temperature parameter for the DPO loss, typically something in the range of `0.1` to `0.5`. We ignore the reference model as `beta` -> 0.
## Loss functions
Given the preference data, we can fit a binary classifier according to the Bradley-Terry model and in fact the [DPO](https://huggingface.co/papers/2305.18290) authors propose the sigmoid loss on the normalized likelihood via the `logsigmoid` to fit a logistic regression. To use this loss, set the `loss_type="sigmoid"` (default) in the [`DPOConfig`].
The [RSO](https://huggingface.co/papers/2309.06657) authors propose to use a hinge loss on the normalized likelihood from the [SLiC](https://huggingface.co/papers/2305.10425) paper. To use this loss, set the `loss_type="hinge"` in the [`DPOConfig`]. In this case, the `beta` is the reciprocal of the margin.
The [IPO](https://huggingface.co/papers/2310.12036) authors provide a deeper theoretical understanding of the DPO algorithms and identify an issue with overfitting and propose an alternative loss. To use the loss set the `loss_type="ipo"` in the [`DPOConfig`]. In this case, the `beta` is the reciprocal of the gap between the log-likelihood ratios of the chosen vs the rejected completion pair and thus the smaller the `beta` the larger this gaps is. As per the paper the loss is averaged over log-likelihoods of the completion (unlike DPO which is summed only).
The [cDPO](https://ericmitchell.ai/cdpo.pdf) is a tweak on the DPO loss where we assume that the preference labels are noisy with some probability. In this approach, the `label_smoothing` parameter in the [`DPOConfig`] is used to model the probability of existing label noise. To apply this conservative loss, set `label_smoothing` to a value greater than 0.0 (between 0.0 and 0.5; the default is 0.0).
The [EXO](https://huggingface.co/papers/2402.00856) authors propose to minimize the reverse KL instead of the negative log-sigmoid loss of DPO which corresponds to forward KL. To use the loss set the `loss_type="exo_pair"` in the [`DPOConfig`]. Setting non-zero `label_smoothing` (default `1e-3`) leads to a simplified version of EXO on pair-wise preferences (see Eqn. (16) of the [EXO paper](https://huggingface.co/papers/2402.00856)). The full version of EXO uses `K>2` completions generated by the SFT policy, which becomes an unbiased estimator of the PPO objective (up to a constant) when `K` is sufficiently large.
The [NCA](https://huggingface.co/papers/2402.05369) authors shows that NCA optimizes the absolute likelihood for each response rather than the relative likelihood. To use the loss set the `loss_type="nca_pair"` in the [`DPOConfig`].
The [Robust DPO](https://huggingface.co/papers/2403.00409) authors propose an unbiased estimate of the DPO loss that is robust to preference noise in the data. Like in cDPO, it assumes that the preference labels are noisy with some probability. In this approach, the `label_smoothing` parameter in the [`DPOConfig`] is used to model the probability of existing label noise. To apply this conservative loss, set `label_smoothing` to a value greater than 0.0 (between 0.0 and 0.5; the default is 0.0) and set the `loss_type="robust"` in the [`DPOConfig`].
The [BCO](https://huggingface.co/papers/2404.04656) authors train a binary classifier whose logit serves as a reward so that the classifier maps {prompt, chosen completion} pairs to 1 and {prompt, rejected completion} pairs to 0. To use this loss, set the `loss_type="bco_pair"` in the [`DPOConfig`].
The [TR-DPO](https://huggingface.co/papers/2404.09656) paper suggests syncing the reference model weights after every `ref_model_sync_steps` steps of SGD with weight `ref_model_mixup_alpha` during DPO training. To toggle this callback use the `sync_ref_model=True` in the [`DPOConfig`].
The [RPO](https://huggingface.co/papers/2404.19733) paper implements an iterative preference tuning algorithm using a loss related to the RPO loss in this [paper](https://huggingface.co/papers/2405.16436) that essentially consists of a weighted SFT loss on the chosen preferences together with the DPO loss. To use this loss, set the `rpo_alpha` in the [`DPOConfig`] to an appropriate value. The paper suggests setting this weight to 1.0.
The [SPPO](https://huggingface.co/papers/2405.00675) authors claim that SPPO is capable of solving the Nash equilibrium iteratively by pushing the chosen rewards to be as large as 1/2 and the rejected rewards to be as small as -1/2 and can alleviate data sparsity issues. The implementation approximates this algorithm by employing hard label probabilities, assigning 1 to the winner and 0 to the loser. To use this loss, set the `loss_type="sppo_hard"` in the [`DPOConfig`].
The [AOT](https://huggingface.co/papers/2406.05882) authors propose to use Distributional Preference Alignment Via Optimal Transport. Traditionally, the alignment algorithms use paired preferences at a sample level, which does not ensure alignment on the distributional level. AOT, on the other hand, can align LLMs on paired or unpaired preference data by making the reward distribution of the positive samples stochastically dominant in the first order on the distribution of negative samples. Specifically, `loss_type="aot"` is appropriate for paired datasets, where each prompt has both chosen and rejected responses; `loss_type="aot_pair"` is for unpaired datasets. In a nutshell, `loss_type="aot"` ensures that the log-likelihood ratio of chosen to rejected of the aligned model has higher quantiles than that ratio for the reference model. `loss_type="aot_pair"` ensures that the chosen reward is higher on all quantiles than the rejected reward. Note that in both cases quantiles are obtained via sorting. To fully leverage the advantages of the AOT algorithm, it is important to maximize the per-GPU batch size.
The [APO](https://huggingface.co/papers/2408.06266) method introduces an "anchored" version of the alignment objective. There are two variants: `apo_zero` and `apo_down`. The `apo_zero` loss increases the likelihood of winning outputs while decreasing the likelihood of losing outputs, making it suitable when the model is less performant than the winning outputs. On the other hand, `apo_down` decreases the likelihood of both winning and losing outputs, but with a stronger emphasis on reducing the likelihood of losing outputs. This variant is more effective when the model is better than the winning outputs. To use these losses, set `loss_type="apo_zero"` or `loss_type="apo_down"` in the [`DPOConfig`].
### For Mixture of Experts Models: Enabling the auxiliary loss
MOEs are the most efficient if the load is about equally distributed between experts.
To ensure that we train MOEs similarly during preference-tuning, it is beneficial to add the auxiliary loss from the load balancer to the final loss.
This option is enabled by setting `output_router_logits=True` in the model config (e.g. MixtralConfig).
To scale how much the auxiliary loss contributes to the total loss, use the hyperparameter `router_aux_loss_coef=...` (default: 0.001).
## Logging
## Logged metrics
While training and evaluating we record the following reward metrics:
@ -169,59 +134,75 @@ While training and evaluating we record the following reward metrics:
- `rewards/accuracies`: mean of how often the chosen rewards are > than the corresponding rejected rewards
- `rewards/margins`: the mean difference between the chosen and corresponding rejected rewards
## Loss functions
The DPO algorithm supports several loss functions. The loss function can be set using the `loss_type` parameter in the [`DPOConfig`]. The following loss functions are supported:
| `loss_type=` | Description |
| -------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `"sigmoid"` (default) | Given the preference data, we can fit a binary classifier according to the Bradley-Terry model and in fact the [DPO](https://huggingface.co/papers/2305.18290) authors propose the sigmoid loss on the normalized likelihood via the `logsigmoid` to fit a logistic regression. |
| `"hinge"` | The [RSO](https://huggingface.co/papers/2309.06657) authors propose to use a hinge loss on the normalized likelihood from the [SLiC](https://huggingface.co/papers/2305.10425) paper. In this case, the `beta` is the reciprocal of the margin. |
| `"ipo"` | The [IPO](https://huggingface.co/papers/2310.12036) authors provide a deeper theoretical understanding of the DPO algorithms and identify an issue with overfitting and propose an alternative loss. In this case, the `beta` is the reciprocal of the gap between the log-likelihood ratios of the chosen vs the rejected completion pair and thus the smaller the `beta` the larger this gaps is. As per the paper the loss is averaged over log-likelihoods of the completion (unlike DPO which is summed only). |
| `"exo_pair"` | The [EXO](https://huggingface.co/papers/2402.00856) authors propose to minimize the reverse KL instead of the negative log-sigmoid loss of DPO which corresponds to forward KL. Setting non-zero `label_smoothing` (default `1e-3`) leads to a simplified version of EXO on pair-wise preferences (see Eqn. (16) of the [EXO paper](https://huggingface.co/papers/2402.00856)). The full version of EXO uses `K>2` completions generated by the SFT policy, which becomes an unbiased estimator of the PPO objective (up to a constant) when `K` is sufficiently large. |
| `"nca_pair"` | The [NCA](https://huggingface.co/papers/2402.05369) authors shows that NCA optimizes the absolute likelihood for each response rather than the relative likelihood. |
| `"robust"` | The [Robust DPO](https://huggingface.co/papers/2403.00409) authors propose an unbiased estimate of the DPO loss that is robust to preference noise in the data. Like in cDPO, it assumes that the preference labels are noisy with some probability. In this approach, the `label_smoothing` parameter in the [`DPOConfig`] is used to model the probability of existing label noise. To apply this conservative loss, set `label_smoothing` to a value greater than 0.0 (between 0.0 and 0.5; the default is 0.0) |
| `"bco_pair"` | The [BCO](https://huggingface.co/papers/2404.04656) authors train a binary classifier whose logit serves as a reward so that the classifier maps {prompt, chosen completion} pairs to 1 and {prompt, rejected completion} pairs to 0. For unpaired data, we recommend the dedicated [`BCOTrainer`]. |
| `"sppo_hard"` | The [SPPO](https://huggingface.co/papers/2405.00675) authors claim that SPPO is capable of solving the Nash equilibrium iteratively by pushing the chosen rewards to be as large as 1/2 and the rejected rewards to be as small as -1/2 and can alleviate data sparsity issues. The implementation approximates this algorithm by employing hard label probabilities, assigning 1 to the winner and 0 to the loser. |
| `"aot"` or `loss_type="aot_pair"` | The [AOT](https://huggingface.co/papers/2406.05882) authors propose to use Distributional Preference Alignment Via Optimal Transport. Traditionally, the alignment algorithms use paired preferences at a sample level, which does not ensure alignment on the distributional level. AOT, on the other hand, can align LLMs on paired or unpaired preference data by making the reward distribution of the positive samples stochastically dominant in the first order on the distribution of negative samples. Specifically, `loss_type="aot"` is appropriate for paired datasets, where each prompt has both chosen and rejected responses; `loss_type="aot_pair"` is for unpaired datasets. In a nutshell, `loss_type="aot"` ensures that the log-likelihood ratio of chosen to rejected of the aligned model has higher quantiles than that ratio for the reference model. `loss_type="aot_pair"` ensures that the chosen reward is higher on all quantiles than the rejected reward. Note that in both cases quantiles are obtained via sorting. To fully leverage the advantages of the AOT algorithm, it is important to maximize the per-GPU batch size. |
| `"apo_zero"` or `loss_type="apo_down"` | The [APO](https://huggingface.co/papers/2408.06266) method introduces an "anchored" version of the alignment objective. There are two variants: `apo_zero` and `apo_down`. The `apo_zero` loss increases the likelihood of winning outputs while decreasing the likelihood of losing outputs, making it suitable when the model is less performant than the winning outputs. On the other hand, `apo_down` decreases the likelihood of both winning and losing outputs, but with a stronger emphasis on reducing the likelihood of losing outputs. This variant is more effective when the model is better than the winning outputs. |
### Label smoothing
The [cDPO](https://ericmitchell.ai/cdpo.pdf) is a tweak on the DPO loss where we assume that the preference labels are noisy with some probability. In this approach, the `label_smoothing` parameter in the [`DPOConfig`] is used to model the probability of existing label noise. To apply this conservative loss, set `label_smoothing` to a value greater than 0.0 (between 0.0 and 0.5; the default is 0.0).
### Syncing the reference model
The [TR-DPO](https://huggingface.co/papers/2404.09656) paper suggests syncing the reference model weights after every `ref_model_sync_steps` steps of SGD with weight `ref_model_mixup_alpha` during DPO training. To toggle this callback use the `sync_ref_model=True` in the [`DPOConfig`].
### RPO loss
The [RPO](https://huggingface.co/papers/2404.19733) paper implements an iterative preference tuning algorithm using a loss related to the RPO loss in this [paper](https://huggingface.co/papers/2405.16436) that essentially consists of a weighted SFT loss on the chosen preferences together with the DPO loss. To use this loss, set the `rpo_alpha` in the [`DPOConfig`] to an appropriate value. The paper suggests setting this weight to `1.0`.
### WPO loss
The [WPO](https://huggingface.co/papers/2406.11827) paper adapts off-policy data to resemble on-policy data more closely by reweighting preference pairs according to their probability under the current policy. To use this method, set the `use_weighting` flag to `True` in the [`DPOConfig`].
### For Mixture of Experts Models: Enabling the auxiliary loss
MOEs are the most efficient if the load is about equally distributed between experts.
To ensure that we train MOEs similarly during preference-tuning, it is beneficial to add the auxiliary loss from the load balancer to the final loss.
This option is enabled by setting `output_router_logits=True` in the model config (e.g. [`~transformers.MixtralConfig`]).
To scale how much the auxiliary loss contributes to the total loss, use the hyperparameter `router_aux_loss_coef=...` (default: `0.001`) in the model config.
## Accelerate DPO fine-tuning using `unsloth`
You can further accelerate QLoRA / LoRA (2x faster, 60% less memory) using the [`unsloth`](https://github.com/unslothai/unsloth) library that is fully compatible with `SFTTrainer`. Currently `unsloth` supports only Llama (Yi, TinyLlama, Qwen, Deepseek etc) and Mistral architectures. Some benchmarks for DPO listed below:
| GPU | Model | Dataset | 🤗 | 🤗 + Flash Attention 2 | 🦥 Unsloth | 🦥 VRAM saved |
| -------- | --------- | ---------- | --- | ---------------------- | ---------- | ------------- |
| A100 40G | Zephyr 7b | Ultra Chat | 1x | 1.24x | **1.88x** | -11.6% |
| Tesla T4 | Zephyr 7b | Ultra Chat | 1x | 1.09x | **1.55x** | -18.6% |
| GPU | Model | Dataset | 🤗 | 🤗 + Flash Attention 2 | 🦥 Unsloth | 🦥 VRAM saved |
| -------- | --------- | ---------- | --- | --------------------- | --------- | ------------ |
| A100 40G | Zephyr 7b | Ultra Chat | 1x | 1.24x | **1.88x** | -11.6% |
| Tesla T4 | Zephyr 7b | Ultra Chat | 1x | 1.09x | **1.55x** | -18.6% |
First install `unsloth` according to the [official documentation](https://github.com/unslothai/unsloth). Once installed, you can incorporate unsloth into your workflow in a very simple manner; instead of loading `AutoModelForCausalLM`, you just need to load a `FastLanguageModel` as follows:
```python
import torch
from trl import DPOConfig, DPOTrainer
from unsloth import FastLanguageModel
```diff
from datasets import load_dataset
from trl import DPOConfig, DPOTrainer
- from transformers import AutoModelForCausalLM, AutoTokenizer
+ from unsloth import FastLanguageModel
max_seq_length = 2048 # Supports automatic RoPE Scaling, so choose any number.
- model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
- tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
+ model, tokenizer = FastLanguageModel.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
+ model = FastLanguageModel.get_peft_model(model)
train_dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/zephyr-sft",
max_seq_length = max_seq_length,
dtype = None, # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True, # Use 4bit quantization to reduce memory usage. Can be False.
# token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
- training_args = DPOConfig(output_dir="Qwen2-0.5B-DPO", logging_steps=10)
+ training_args = DPOConfig(output_dir="Qwen2-0.5B-DPO", logging_steps=10, bf16=True)
trainer = DPOTrainer(model=model, args=training_args, processing_class=tokenizer, train_dataset=train_dataset)
trainer.train()
# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
model,
r = 16,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0, # Dropout = 0 is currently optimized
bias = "none", # Bias = "none" is currently optimized
use_gradient_checkpointing = True,
random_state = 3407,
)
training_args = DPOConfig(
output_dir="./output",
beta=0.1,
)
dpo_trainer = DPOTrainer(
model,
ref_model=None,
args=training_args,
train_dataset=train_dataset,
tokenizer=tokenizer,
)
dpo_trainer.train()
```
The saved model is fully compatible with Hugging Face's transformers library. Learn more about unsloth in their [official repository](https://github.com/unslothai/unsloth).
@ -295,3 +276,7 @@ dpo_trainer = DPOTrainer(
## DPOConfig
[[autodoc]] DPOConfig
## PreferenceCollator
[[autodoc]] trainer.dpo_trainer.PreferenceCollator

View File

@ -33,22 +33,22 @@ Then, it is encouraged to launch jobs with `accelerate launch`!
| File | Description |
| ----------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [`examples/scripts/alignprop.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/alignprop.py) | This script shows how to use the [`AlignPropTrainer`] to fine-tune a diffusion model. |
| [`examples/scripts/bco.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/bco.py) | This script shows how to use the [`KTOTrainer`] with the BCO loss to fine-tune a model to increase instruction-following, truthfulness, honesty and helpfulness using the [openbmb/UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) dataset. |
| [`examples/scripts/chat.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/chat.py) | This script allows you to load and use a model as a chatbot. |
| [`examples/scripts/cpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/cpo.py) | This script shows how to use the [`CPOTrainer`] to fine-tune a model to increase helpfulness and harmlessness using the [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset. |
| [`examples/scripts/ddpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/ddpo.py) | This script shows how to use the [`DDPOTrainer`] to fine-tune a stable diffusion model using reinforcement learning. |
| [`examples/scripts/dpo_visual.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/dpo_visual.py) | This script shows how to use the [`DPOTrainer`] to fine-tune a Vision Language Model to reduce hallucinations using the [openbmb/RLAIF-V-Dataset](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset) dataset. |
| [`examples/scripts/dpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/dpo.py) | This script shows how to use the [`DPOTrainer`] to fine-tune a stable to increase helpfulness and harmlessness using the [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset. |
| [`examples/scripts/kto.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/kto.py) | This script shows how to use the [`KTOTrainer`] to fine-tune a model. |
| [`examples/scripts/orpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/orpo.py) | This script shows how to use the [`ORPOTrainer`] to fine-tune a model to increase helpfulness and harmlessness using the [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset. |
| [`examples/scripts/ppo_multi_adapter.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/ppo_multi_adapter.py) | This script shows how to use the [`PPOTrainer`] to train a single base model with multiple adapters. Requires you to run the example script with the reward model training beforehand. |
| [`examples/scripts/ppo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/ppo.py) | This script shows how to use the [`PPOTrainer`] to fine-tune a sentiment analysis model using [IMDB dataset](https://huggingface.co/datasets/stanfordnlp/imdb). |
| [`examples/scripts/reward_modeling.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/reward_modeling.py) | This script shows how to use the [`RewardTrainer`] to train a reward model on your own dataset. |
| [`examples/scripts/sft.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/sft.py) | This script shows how to use the [`SFTTrainer`] to fine-tune a model or adapters into a target dataset. |
| [`examples/scripts/vsft_llava.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/vsft_llava.py) | This script shows how to use the [`SFTTrainer`] to fine-tune a Vision Language Model in a chat setting. The script has only been tested on a [LLaVA 1.5]([llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf)) model so users may see unexpected behaviour in other model architectures. |
| File | Description |
| ----------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [`examples/scripts/alignprop.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/alignprop.py) | This script shows how to use the [`AlignPropTrainer`] to fine-tune a diffusion model. |
| [`examples/scripts/bco.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/bco.py) | This script shows how to use the [`KTOTrainer`] with the BCO loss to fine-tune a model to increase instruction-following, truthfulness, honesty and helpfulness using the [openbmb/UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) dataset. |
| [`examples/scripts/chat.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/chat.py) | This script allows you to load and use a model as a chatbot. |
| [`examples/scripts/cpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/cpo.py) | This script shows how to use the [`CPOTrainer`] to fine-tune a model to increase helpfulness and harmlessness using the [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset. |
| [`examples/scripts/ddpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/ddpo.py) | This script shows how to use the [`DDPOTrainer`] to fine-tune a stable diffusion model using reinforcement learning. |
| [`examples/scripts/dpo_vlm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/dpo_vlm.py) | This script shows how to use the [`DPOTrainer`] to fine-tune a Vision Language Model to reduce hallucinations using the [openbmb/RLAIF-V-Dataset](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset) dataset. |
| [`examples/scripts/dpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/dpo.py) | This script shows how to use the [`DPOTrainer`] to fine-tune a stable to increase helpfulness and harmlessness using the [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset. |
| [`examples/scripts/kto.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/kto.py) | This script shows how to use the [`KTOTrainer`] to fine-tune a model. |
| [`examples/scripts/orpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/orpo.py) | This script shows how to use the [`ORPOTrainer`] to fine-tune a model to increase helpfulness and harmlessness using the [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset. |
| [`examples/scripts/ppo/ppo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/ppo/ppo.py) | This script shows how to use the [`PPOTrainer`] to fine-tune a model to improve its ability to continue text with positive sentiment or physically descriptive language |
| [`examples/scripts/ppo/ppo_tldr.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/ppo/ppo_tldr.py) | This script shows how to use the [`PPOTrainer`] to fine-tune a model to improve its ability to generate TL;DR summaries. |
| [`examples/scripts/reward_modeling.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/reward_modeling.py) | This script shows how to use the [`RewardTrainer`] to train a reward model on your own dataset. |
| [`examples/scripts/sft.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/sft.py) | This script shows how to use the [`SFTTrainer`] to fine-tune a model or adapters into a target dataset. |
| [`examples/scripts/sft_vlm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/sft_vlm.py) | This script shows how to use the [`SFTTrainer`] to fine-tune a Vision Language Model in a chat setting. The script has only been tested with [LLaVA 1.5](https://huggingface.co/llava-hf/llava-1.5-7b-hf), [LLaVA 1.6](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf), and [Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) models so users may see unexpected behaviour in other model architectures. |
Here are also some easier-to-run colab notebooks that you can use to get started with TRL:

View File

@ -1,5 +1,7 @@
# Generalized Knowledge Distillation Trainer
[![](https://img.shields.io/badge/All_models-GKD-blue)](https://huggingface.co/models?other=gkd,trl)
## Overview
Generalized Knowledge Distillation (GKD) was proposed in [On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes](https://huggingface.co/papers/2306.13649) by Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem.
@ -17,8 +19,9 @@ This post-training method was contributed by [Kashif Rasul](https://huggingface.
## Usage tips
The GKD Trainer is a wrapper around the [`SFTTrainer`] class that takes in a teacher model argument. It needs two parameters to be set via the [`GKDConfig`] namely:
The [`GKDTrainer`] is a wrapper around the [`SFTTrainer`] class that takes in a teacher model argument. It needs three parameters to be set via the [`GKDConfig`] namely:
* `lmbda`: controls the student data fraction, i.e., the proportion of on-policy student-generated outputs. When `lmbda=0.0`, the loss reduces to supervised JSD where the student is trained with the token-level probabilities of the teacher. When `lmbda=1.0`, the loss reduces to on-policy JSD, where the student generates output sequences and token-specific feedback on these sequences from the teacher. For values in between [0, 1] it is random between the two based on the `lmbda` value for each batch.
* `seq_kd`: controls whether to perform Sequence-Level KD (can be viewed as supervised FT on teacher-generated out). When `seq_kd=True` and `lmbda=0.0`, the loss reduces to supervised JSD, where the teacher generates output sequences and the student receives token-specific feedback on these sequences from the teacher.
* `beta`: controls the interpolation in the generalized Jensen-Shannon Divergence. When `beta=0.0` the loss approximates forward KL divergence, while for `beta=1.0` the loss approximates reverse KL divergence. For values in between [0, 1] it interpolates between the two.
The authors find that on-policy data (high `lmbda`) performs better and the optimal `beta` varied depending on the task and evaluation method.
@ -67,19 +70,19 @@ eval_dataset = Dataset.from_dict(
}
)
args = GKDConfig(output_dir="gkd-model", per_device_train_batch_size=1)
training_args = GKDConfig(output_dir="gkd-model", per_device_train_batch_size=1)
trainer = GKDTrainer(
model=model,
teacher_model=teacher_model,
args=args,
tokenizer=tokenizer,
args=training_args,
processing_class=tokenizer,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
)
trainer.train()
```
### Expected dataset format
### Expected dataset type
The dataset should be formatted as a list of "messages" where each message is a list of dictionaries with the following keys:
* `role`: either `system`, `assistant` or `user`

View File

@ -1,5 +1,8 @@
# Iterative Trainer
[![](https://img.shields.io/badge/All_models-Iterative_SFT-blue)](https://huggingface.co/models?other=iterative-sft,trl)
Iterative fine-tuning is a training method that enables to perform custom actions (generation and filtering for example) between optimization steps. In TRL we provide an easy-to-use API to fine-tune your models in an iterative way in just a few lines of code.
## Usage

View File

@ -1,5 +1,11 @@
# Judges
<Tip warning={true}>
TRL Judges is an experimental API which is subject to change at any time.
</Tip>
TRL provides judges to easily compare two completions.
Make sure to have installed the required dependencies by running:

View File

@ -1,102 +1,134 @@
# KTO Trainer
TRL supports the Kahneman-Tversky Optimization (KTO) Trainer for aligning language models with binary feedback data (e.g., upvote/downvote), as described in the [paper](https://huggingface.co/papers/2402.01306) by Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela.
For a full example have a look at [`examples/scripts/kto.py`].
[![](https://img.shields.io/badge/All_models-KTO-blue)](https://huggingface.co/models?other=kto,trl)
Depending on how good your base model is, you may or may not need to do SFT before KTO.
This is different from standard RLHF and DPO, which always require SFT.
## Overview
Kahneman-Tversky Optimization (KTO) was introduced in [KTO: Model Alignment as Prospect Theoretic Optimization](https://huggingface.co/papers/2402.01306) by [Kawin Ethayarajh](https://huggingface.co/kawine), [Winnie Xu](https://huggingface.co/xwinxu), [Niklas Muennighoff](https://huggingface.co/Muennighoff), Dan Jurafsky, [Douwe Kiela](https://huggingface.co/douwekiela).
The abstract from the paper is the following:
> Kahneman & Tversky's prospect theory tells us that humans perceive random variables in a biased but well-defined manner; for example, humans are famously loss-averse. We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases -- the success of these objectives (e.g., DPO) over cross-entropy minimization can partly be ascribed to them being human-aware loss functions (HALOs). However, the utility functions these methods attribute to humans still differ from those in the prospect theory literature. Using a Kahneman-Tversky model of human utility, we propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences, as current methods do. We call this approach Kahneman-Tversky Optimization (KTO), and it matches or exceeds the performance of preference-based methods at scales from 1B to 30B. Crucially, KTO does not need preferences -- only a binary signal of whether an output is desirable or undesirable for a given input. This makes it far easier to use in the real world, where preference data is scarce and expensive.
The official code can be found in [ContextualAI/HALOs](https://github.com/ContextualAI/HALOs).
This post-training method was contributed by [Kashif Rasul](https://huggingface.co/kashif), [Younes Belkada](https://huggingface.co/ybelkada), [Lewis Tunstall](https://huggingface.co/lewtun) and Pablo Vicente.
## Quick start
This example demonstrates how to train a model using the KTO method. We use the [Qwen 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) as the base model. We use the preference data from the [KTO Mix 14k](https://huggingface.co/datasets/trl-lib/kto-mix-14k). You can view the data in the dataset here:
<iframe
src="https://huggingface.co/datasets/trl-lib/kto-mix-14k/embed/viewer/default/train?row=0"
frameborder="0"
width="100%"
height="560px"
></iframe>
Below is the script to train the model:
```python
# train_kto.py
from datasets import load_dataset
from trl import KTOConfig, KTOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
train_dataset = load_dataset("trl-lib/kto-mix-14k", split="train")
training_args = KTOConfig(output_dir="Qwen2-0.5B-KTO", logging_steps=10)
trainer = KTOTrainer(model=model, args=training_args, processing_class=tokenizer, train_dataset=train_dataset)
trainer.train()
```
Execute the script using the following command:
```bash
accelerate launch train_kto.py
```
Distributed across 8 x H100 GPUs, the training takes approximately 30 minutes. You can verify the training progress by checking the reward graph. An increasing trend in the reward margin indicates that the model is improving and generating better responses over time.
![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/kto-qwen2-reward-margin.png)
To see how the [trained model](https://huggingface.co/trl-lib/Qwen2-0.5B-KTO) performs, you can use the [TRL Chat CLI](clis#chat-interface).
<pre><code>$ trl chat --model_name_or_path trl-lib/Qwen2-0.5B-KTO
<strong><span style="color: red;">&lt;quentin_gallouedec&gt;:</span></strong>
What is the best programming language?
<strong><span style="color: blue;">&lt;trl-lib/Qwen2-0.5B-KTO&gt;:</span></strong>
The best programming language can vary depending on individual preferences, industry-specific requirements, technical skills, and familiarity with the specific use case or task. Here are some widely-used programming languages that have been noted as popular and widely used:
Here are some other factors to consider when choosing a programming language for a project:
<strong><span style="color: green;">1</span> JavaScript</strong>: JavaScript is at the heart of the web and can be used for building web applications, APIs, and interactive front-end applications like frameworks like React and Angular. It's similar to C, C++, and F# in syntax structure and is accessible and easy to learn, making it a popular choice for beginners and professionals alike.
<strong><span style="color: green;">2</span> Java</strong>: Known for its object-oriented programming (OOP) and support for Java 8 and .NET, Java is used for developing enterprise-level software applications, high-performance games, as well as mobile apps, game development, and desktop applications.
<strong><span style="color: green;">3</span> C++</strong>: Known for its flexibility and scalability, C++ offers comprehensive object-oriented programming and is a popular choice for high-performance computing and other technical fields. It's a powerful platform for building real-world applications and games at scale.
<strong><span style="color: green;">4</span> Python</strong>: Developed by Guido van Rossum in 1991, Python is a high-level, interpreted, and dynamically typed language known for its simplicity, readability, and versatility.
</code></pre>
## Expected dataset format
The KTO trainer expects a very specific format for the dataset as it does not require pairwise preferences. Since the model will be trained to directly optimize examples that consist of a prompt, model completion, and a label to indicate whether the completion is "good" or "bad", we expect a dataset with the following columns:
KTO requires an [unpaired preference dataset](dataset_formats#unpaired-preference). Alternatively, you can provide a *paired* preference dataset (also known simply as a *preference dataset*). In this case, the trainer will automatically convert it to an unpaired format by separating the chosen and rejected responses, assigning `label = True` to the chosen completions and `label = False` to the rejected ones.
- `prompt`
- `completion`
- `label`
The [`KTOTrainer`] supports both [conversational](dataset_formats#conversational) and [standard](dataset_formats#standard) dataset format. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
for example:
In theory, the dataset should contain at least one chosen and one rejected completion. However, some users have successfully run KTO using *only* chosen or only rejected data. If using only rejected data, it is advisable to adopt a conservative learning rate.
```
kto_dataset_dict = {
"prompt": [
"Hey, hello",
"How are you",
"What is your name?",
"What is your name?",
"Which is the best programming language?",
"Which is the best programming language?",
"Which is the best programming language?",
],
"completion": [
"hi nice to meet you",
"leave me alone",
"I don't have a name",
"My name is Mary",
"Python",
"C++",
"Java",
],
"label": [
True,
False,
False,
True,
True,
False,
False,
],
}
## Example script
We provide an example script to train a model using the KTO method. The script is available in [`examples/scripts/kto.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/kto.py)
To test the KTO script with the [Qwen2 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) on the [UltraFeedback dataset](https://huggingface.co/datasets/trl-lib/kto-mix-14k), run the following command:
```bash
accelerate launch examples/scripts/kto.py \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--dataset_name trl-lib/kto-mix-14k \
--num_train_epochs 1 \
--logging_steps 25 \
--output_dir Qwen2-0.5B-KTO
```
where the `prompt` contains the context inputs, `completion` contains the corresponding responses and `label` contains the corresponding flag that indicates if the generated completion is desired (`True`) or undesired (`False`).
A prompt can have multiple responses and this is reflected in the entries being repeated in the dictionary's value arrays. It is required that the dataset contains at least one desirable and one undesirable completion.
## Expected model format
The KTO trainer expects a model of `AutoModelForCausalLM`, compared to PPO that expects `AutoModelForCausalLMWithValueHead` for the value function.
## Using the `KTOTrainer`
For a detailed example have a look at the `examples/scripts/kto.py` script. At a high level we need to initialize the `KTOTrainer` with a `model` we wish to train and a reference `ref_model` which we will use to calculate the implicit rewards of the preferred and rejected response.
The `beta` refers to the hyperparameter of the implicit reward, and the dataset contains the 3 entries listed above. Note that the `model` and `ref_model` need to have the same architecture (ie decoder only or encoder-decoder).
The `desirable_weight` and `undesirable_weight` refer to the weights placed on the losses for desirable/positive and undesirable/negative examples.
By default, they are both 1. However, if you have more of one or the other, then you should upweight the less common type such that the ratio of (`desirable_weight` \\(\times\\) number of positives) to (`undesirable_weight` \\(\times\\) number of negatives) is in the range 1:1 to 4:3.
<Tip>
It is strongly recommended you use a learning rate between `5e-7` and `5e-6` with an effective batch size between `8` and `32`, for both LoRA and full finetuning. Even if you are working with a small dataset, we do not recommend using a learning rate outside this range; instead, using smaller batch sizes and/or more training epochs will give you better results.
</Tip>
```py
training_args = KTOConfig(
beta=0.1,
desirable_weight=1.0,
undesirable_weight=1.0,
learning_rate=5e-7,
)
kto_trainer = KTOTrainer(
model,
ref_model,
args=training_args,
train_dataset=train_dataset,
tokenizer=tokenizer,
)
```
After this one can then call:
```py
kto_trainer.train()
```
## Usage tips
### For Mixture of Experts Models: Enabling the auxiliary loss
MOEs are the most efficient if the load is about equally distributed between experts.
To ensure that we train MOEs similarly during preference-tuning, it is beneficial to add the auxiliary loss from the load balancer to the final loss.
To ensure that we train MOEs similarly during preference-tuning, it is beneficial to add the auxiliary loss from the load balancer to the final loss.
This option is enabled by setting `output_router_logits=True` in the model config (e.g. MixtralConfig).
To scale how much the auxiliary loss contributes to the total loss, use the hyperparameter `router_aux_loss_coef=...` (default: 0.001).
This option is enabled by setting `output_router_logits=True` in the model config (e.g. [`~transformers.MixtralConfig`]).
To scale how much the auxiliary loss contributes to the total loss, use the hyperparameter `router_aux_loss_coef=...` (default: `0.001`) in the model config.
### Batch size recommendations
Use a per-step batch size that is at least 4, and an effective batch size between 16 and 128. Even if your effective batch size is large, if your per-step batch size is poor, then the KL estimate in KTO will be poor.
### Learning rate recommendations
Each choice of `beta` has a maximum learning rate it can tolerate before learning performance degrades. For the default setting of `beta = 0.1`, the learning rate should typically not exceed `1e-6` for most models. As `beta` decreases, the learning rate should also be reduced accordingly. In general, we strongly recommend keeping the learning rate between `5e-7` and `5e-6`. Even with small datasets, we advise against using a learning rate outside this range. Instead, opt for more epochs to achieve better results.
### Imbalanced data
The `desirable_weight` and `undesirable_weight` of the [`KTOConfig`] refer to the weights placed on the losses for desirable/positive and undesirable/negative examples.
By default, they are both 1. However, if you have more of one or the other, then you should upweight the less common type such that the ratio of (`desirable_weight` \\(\times\\) number of positives) to (`undesirable_weight` \\(\times\\) number of negatives) is in the range 1:1 to 4:3.
## Logged metrics
While training and evaluating we record the following reward metrics:
- `rewards/chosen`: the mean log probabilities of the policy model for the chosen responses scaled by beta
- `rewards/rejected`: the mean log probabilities of the policy model for the rejected responses scaled by beta
- `rewards/margins`: the mean difference between the chosen and corresponding rejected rewards
- `logps/chosen`: the mean log probabilities of the chosen completions
- `logps/rejected`: the mean log probabilities of the rejected completions
- `logits/chosen`: the mean logits of the chosen completions
- `logits/rejected`: the mean logits of the rejected completions
- `kl`: the KL divergence between the policy model and the reference model
## KTOTrainer

View File

@ -90,6 +90,7 @@ WANDB_TAGS="calculator_final" python benchmark/benchmark.py \
We can then use [`openrlbenchmark`](https://github.com/openrlbenchmark/openrlbenchmark) which generates the following plot.
```
# pip install openrlbenchmark==0.2.1a5
python -m openrlbenchmark.rlops_multi_metrics \
--filters '?we=openrlbenchmark&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.tracker_project_name&cen=trl_ppo_trainer_config.value.log_with&metrics=env/reward_mean&metrics=objective/kl' \
'wandb?tag=calculator_final&cl=calculator_mask' \

View File

@ -1,15 +1,14 @@
# Logging
As reinforcement learning algorithms are historically challenging to debug, it's important to pay careful attention to logging.
By default, the TRL [`PPOTrainer`] saves a lot of relevant information to `wandb` or `tensorboard`.
By default, the TRL [`PPOTrainer`] saves a lot of relevant information to wandb or tensorboard.
Upon initialization, pass one of these two options to the [`PPOConfig`]:
```
config = PPOConfig(
model_name=args.model_name,
log_with=`wandb`, # or `tensorboard`
)
training_args = PPOConfig(..., report_to="wandb") # or "tensorboard"
```
If you want to log with tensorboard, add the kwarg `project_kwargs={"logging_dir": PATH_TO_LOGS}` to the PPOConfig.
## PPO Logging

View File

@ -1,5 +1,7 @@
# Nash-MD Trainer
[![](https://img.shields.io/badge/All_models-Nash--MD-blue)](https://huggingface.co/models?other=nash-md,trl)
## Overview
Nash-MD was proposed in the paper [Nash Learning from Human Feedback](https://huggingface.co/papers/2312.00886) by Rémi Munos, [Michal Valko](https://huggingface.co/misovalko), Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mésnard, and Andrea Michi.
@ -12,7 +14,7 @@ This post-training method was contributed by [Kashif Rasul](https://huggingface.
## Quick start
This example demonstrates how to train a model using the Nash-MD method. We use the [Qwen 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) as the base model and the [Qwen 0.5B reward model](https://huggingface.co/trl-lib/Qwen2-0.5B-Reward) as the reward model. We use the prompts from the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback). You can view the prompts in the dataset here:
This example demonstrates how to train a model using the Nash-MD method. We use the [Qwen 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) as the base model and [`PairRMJudge`] as a judge. We use the prompts from the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback). You can view the prompts in the dataset here:
<iframe
src="https://huggingface.co/datasets/trl-lib/ultrafeedback-prompt/embed/viewer/default/train?row=0"
@ -26,21 +28,17 @@ Below is the script to train the model:
```python
# train_nash_md.py
from datasets import load_dataset
from trl import NashMDConfig, NashMDTrainer
from transformers import AutoModelForCausalLM, AutoModelForSequenceClassification, AutoTokenizer
from trl import NashMDConfig, NashMDTrainer, PairRMJudge
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
reward_model = AutoModelForSequenceClassification.from_pretrained("trl-lib/Qwen2-0.5B-Reward", num_labels=1)
judge = PairRMJudge()
train_dataset = load_dataset("trl-lib/ultrafeedback-prompt", split="train")
args = NashMDConfig(output_dir="nash-md-qwen2", logging_steps=10)
training_args = NashMDConfig(output_dir="Qwen2-0.5B-NashMD", logging_steps=10)
trainer = NashMDTrainer(
model=model,
reward_model=reward_model,
args=args,
tokenizer=tokenizer,
train_dataset=train_dataset,
model=model, judge=judge, args=training_args, processing_class=tokenizer, train_dataset=train_dataset
)
trainer.train()
```
@ -51,22 +49,54 @@ Execute the script using the following command:
accelerate launch train_nash_md.py
```
## Expected dataset format
Distributed across 8 GPUs, the training takes approximately 3 hours.
Nash-MD requires a [prompt-only dataset](dataset_format#preference). The [`NashMDTrainer`] supports both [conversational](dataset_format#conversational-dataset-format) and [standard](dataset_format#standard-dataset-format) dataset format. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
To see how the [trained model](https://huggingface.co/trl-lib/Qwen2-0.5B-NashMD) performs, you can use the [TRL Chat CLI](clis#chat-interface).
<pre><code>$ trl chat --model_name_or_path trl-lib/Qwen2-0.5B-NashMD
<strong><span style="color: red;">&lt;quentin_gallouedec&gt;:</span></strong>
What is the best programming language?
<strong><span style="color: blue;">&lt;trl-lib/Qwen2-0.5B-NashMD&gt;:</span></strong>
The best programming language depends on personal preference, the complexity of the project, and the specific requirements of the task. Some programming languages that are often recommended include Python, Java, and JavaScript, and there are many other languages to choose from depending on individual needs.
</code></pre>
## Expected dataset type
Nash-MD requires a [prompt-only dataset](dataset_formats#prompt-only). The [`NashMDTrainer`] supports both [conversational](dataset_formats#conversational) and [standard](dataset_formats#standard) dataset format. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
## Usage tips
### ⚠️ Use the same chat template
### Use a reward model
Make sure that the SFT model and reward model use the _same_ chat template. Otherwise, you may find the model completions are scored incorrectly during training.
Instead of a judge, you can chose to use a reward model -- see [Reward Bench](https://huggingface.co/spaces/allenai/reward-bench) for a leaderboard of public models you can use. Below is a code example showing how to replace a judge with the [trl-lib/Qwen2-0.5B-Reward](https://huggingface.co/trl-lib/Qwen2-0.5B-Reward) model:
```diff
- from trl import PairRMJudge
+ from transformers import AutoModelForSequenceClassification
- judge = PairRMJudge()
+ reward_model = AutoModelForSequenceClassification.from_pretrained("trl-lib/Qwen2-0.5B-Reward", num_labels=1)
trainer = NashMDTrainer(
...
- judge=judge,
+ reward_model=reward_model,
)
```
<Tip warning={true}>
Make sure that the SFT model and reward model use the _same_ chat template and the same tokenizer. Otherwise, you may find the model completions are scored incorrectly during training.
</Tip>
### Encourage EOS token generation
We can want the model to generate completion within a given length. During the learning, the model will generate completion up to the maximum completion length specified in the `max_new_tokens` argument of [`NashMDConfig`]. I you want to penalize for not generating an EOS token before the maximum completion length, you can use the `missing_eos_penalty` argument of [`NashMDConfig`]:
We may want the model to generate completions within a given length. During training, the model will generate completions up to the maximum length specified in the `max_new_tokens` argument of [`NashMDConfig`]. If you want to penalize the model for not generating an EOS token before reaching the maximum length, you can use the `missing_eos_penalty` argument of [`NashMDConfig`]:
```python
args = NashMDConfig(..., max_new_tokens=128, missing_eos_penalty=1.0)
training_args = NashMDConfig(..., max_new_tokens=128, missing_eos_penalty=1.0)
```
### Logging Completions
@ -87,21 +117,17 @@ This callback logs the model's generated completions directly to Weights & Biase
We provide an example script to train a model using the Nash-MD method. The script is available in [`examples/scripts/nash_md.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/nash_md.py)
To test the Nash-MD script with the [Pythia 14M model](https://huggingface.co/EleutherAI/pythia-14m) on the TL;DR summarization task, run the following command:
To test the online DPO script with the [Qwen2.5 0.5B model](https://huggingface.co/trl-lib/Qwen/Qwen2.5-0.5B-Instruct) on the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback), run the following command:
```bash
python examples/scripts/nash_md.py \
--model_name_or_path EleutherAI/pythia-14m \
--reward_model_path EleutherAI/pythia-14m \
--dataset_name trl-lib/tldr \
--model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
--judge pair_rm \
--dataset_name trl-lib/ultrafeedback-prompt \
--learning_rate 5.0e-7 \
--output_dir pythia-14m-tldr-nash-md \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 32 \
--num_train_epochs 3 \
--max_new_tokens 64 \
--logging_steps 25 \
--output_dir Qwen2.5-0.5B-NashMD-PairRM \
--warmup_ratio 0.1 \
--missing_eos_penalty 1.0 \
--push_to_hub
```
@ -114,6 +140,7 @@ The logged metrics are as follows:
* `loss/score`: The mean reinforce score loss.
* `rewards/chosen`: The mean scores (according to the reward model) of the model completions.
* `rewards/rejected`: The mean scores (according to the reward model) of the mixture completions.
* `rewards/probabilities`: The mean probability (according to the reward model or judge) of the model completions chosen vs the mixture completion.
* `rewards/accuracies`: The accuracies of the Nash-MD's implicit reward model.
* `rewards/margins`: The mean reward margin (according to reward model) between the chosen and mixture completions.
* `logps/chosen`: The mean log probabilities of the chosen completions.

View File

@ -1,5 +1,7 @@
# Online DPO Trainer
[![](https://img.shields.io/badge/All_models-Online_DPO-blue)](https://huggingface.co/models?other=online-dpo,trl)
## Overview
Online DPO was proposed in [Direct Language Model Alignment from Online AI Feedback](https://huggingface.co/papers/2402.04792) by Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexandre Rame, Thomas Mesnard, Yao Zhao, Bilal Piot, Johan Ferret, and Mathieu Blondel.
@ -8,13 +10,11 @@ The abstract from the paper is the following:
> Direct alignment from preferences (DAP) methods, such as DPO, have recently emerged as efficient alternatives to reinforcement learning from human feedback (RLHF), that do not require a separate reward model. However, the preference datasets used in DAP methods are usually collected ahead of training and never updated, thus the feedback is purely offline. Moreover, responses in these datasets are often sampled from a language model distinct from the one being aligned, and since the model evolves over training, the alignment phase is inevitably off-policy. In this study, we posit that online feedback is key and improves DAP methods. Our method, online AI feedback (OAIF), uses an LLM as annotator: on each training iteration, we sample two responses from the current model and prompt the LLM annotator to choose which one is preferred, thus providing online feedback. Despite its simplicity, we demonstrate via human evaluation in several tasks that OAIF outperforms both offline DAP and RLHF methods. We further show that the feedback leveraged in OAIF is easily controllable, via instruction prompts to the LLM annotator.
The current implementation uses reward models for scoring completions -- see [Reward Bench](https://huggingface.co/spaces/allenai/reward-bench) for a leaderboard of public models you can use.
This post-training method was contributed by [Michael Noukhovitch](https://huggingface.co/mnoukhov), [Shengyi Costa Huang](https://huggingface.co/vwxyzjn), [Quentin Gallouédec](https://huggingface.co/qgallouedec), and [Edward Beeching](https://huggingface.co/edbeeching).
## Quick start
This example demonstrates how to train a model using the online DPO method. We use the [Qwen 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) as the base model and the [Qwen 0.5B reward model](https://huggingface.co/trl-lib/Qwen2-0.5B-Reward) as the reward model. We use the prompts from the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback). You can view the prompts in the dataset here:
This example demonstrates how to train a model using the online DPO method. We use the [Qwen 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) as the base model and [`PairRMJudge`] as a judge. We use the prompts from the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback). You can view the prompts in the dataset here:
<iframe
src="https://huggingface.co/datasets/trl-lib/ultrafeedback-prompt/embed/viewer/default/train?row=0"
@ -28,21 +28,17 @@ Below is the script to train the model:
```python
# train_online_dpo.py
from datasets import load_dataset
from trl import OnlineDPOConfig, OnlineDPOTrainer
from transformers import AutoModelForCausalLM, AutoModelForSequenceClassification, AutoTokenizer
from trl import OnlineDPOConfig, OnlineDPOTrainer, PairRMJudge
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
reward_model = AutoModelForSequenceClassification.from_pretrained("trl-lib/Qwen2-0.5B-Reward", num_labels=1)
judge = PairRMJudge()
train_dataset = load_dataset("trl-lib/ultrafeedback-prompt", split="train")
args = OnlineDPOConfig(output_dir="online-dpo-qwen2", logging_steps=10)
training_args = OnlineDPOConfig(output_dir="Qwen2-0.5B-OnlineDPO", logging_steps=10)
trainer = OnlineDPOTrainer(
model=model,
reward_model=reward_model,
args=args,
tokenizer=tokenizer,
train_dataset=train_dataset,
model=model, judge=judge, args=training_args, processing_class=tokenizer, train_dataset=train_dataset
)
trainer.train()
```
@ -55,37 +51,51 @@ accelerate launch train_online_dpo.py
Distributed across 8 GPUs, the training takes approximately 1 hour. You can verify the training progress by checking the reward graph. An increasing trend in both the reward for rejected and chosen completions indicates that the model is improving and generating better responses over time.
![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/online-dpo-qwen2-reward.png)
![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/online-dpo-qwen2.png)
To see how the trained model performs, use the following code to generate completions:
To see how the [trained model](https://huggingface.co/trl-lib/Qwen2-0.5B-OnlineDPO) performs, you can use the [TRL Chat CLI](clis#chat-interface).
```python
>>> from transformers import pipeline
>>> generator = pipeline("text-generation", model="online-dpo-qwen2/checkpoint-1773", device="cuda")
>>> question = "Why is the problem always DNS?"
>>> output = generator([{"role": "user", "content": question}], max_new_tokens=200, return_full_text=False)[0]
>>> print(output["generated_text"])
The reason why the problem of DNS (Domain Name System) can always be encountered is that it is designed to provide reliable and accurate information about the availability, ownership, or expiration of domain names. However, there may be some circumstances where the system fails to resolve an IP address correctly, leading to the problem of DNS.
For example, if the server hosting the domain name does not have the correct IP address associated with it, or if the IP address is incorrectly formatted, then the DNS system will fail to resolve the domain name correctly. Additionally, if the server hosting the domain name has been compromised, then the DNS system may also fail to resolve the domain name correctly.
It's worth noting that the exact cause of DNS failure can vary depending on the specific situation, so it's important to carefully check all relevant factors before attempting to resolve the issue. If you suspect that your DNS problem may be caused by a bug in the system, you should report it to the DNS provider directly for further investigation.
```
<pre><code>$ trl chat --model_name_or_path trl-lib/Qwen2-0.5B-OnlineDPO
<strong><span style="color: red;">&lt;quentin_gallouedec&gt;:</span></strong>
What is the best programming language?
## Expected dataset format
<strong><span style="color: blue;">&lt;trl-lib/Qwen2-0.5B-OnlineDPO&gt;:</span></strong>
The best programming language depends on your specific needs and priorities. Some people prefer imperative programming languages (like Haskell or Lisp), while others prefer functional programming languages (like Scala or Python). It's important to consider your work style, programming environment, and project requirements when choosing a programming language.
</code></pre>
Online DPO only requires a [prompt-only dataset](dataset_format#preference) (unlike offline DPO, that expects [preference dataset](dataset_format#preference)). The [`OnlineDPOTrainer`] supports both [conversational](dataset_format#conversational-dataset-format) and [standard](dataset_format#standard-dataset-format) dataset format. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
## Expected dataset type
Online DPO only requires a [prompt-only dataset](dataset_formats#prompt-only) (unlike offline DPO, that expects [preference dataset](dataset_formats#preference)). The [`OnlineDPOTrainer`] supports both [conversational](dataset_formats#conversational) and [standard](dataset_formats#standard) dataset format. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
## Usage tips
### ⚠️ Use the same chat template
### Use a reward model
Make sure that the SFT model and reward model use the _same_ chat template. Otherwise, you may find the model completions are scored incorrectly during training.
Instead of a judge, you can chose to use a reward model -- see [Reward Bench](https://huggingface.co/spaces/allenai/reward-bench) for a leaderboard of public models you can use. Below is a code example showing how to replace a judge with the [trl-lib/Qwen2-0.5B-Reward](https://huggingface.co/trl-lib/Qwen2-0.5B-Reward) model:
```diff
- from trl import PairRMJudge
+ from transformers import AutoModelForSequenceClassification
- judge = PairRMJudge()
+ reward_model = AutoModelForSequenceClassification.from_pretrained("trl-lib/Qwen2-0.5B-Reward", num_labels=1)
+ reward_tokenizer = AutoTokenizer.from_pretrained("trl-lib/Qwen2-0.5B-Reward")
trainer = OnlineDPOTrainer(
...
- judge=judge,
+ reward_model=reward_model,
+ reward_processing_class=reward_tokenizer,
...
)
```
### Encourage EOS token generation
We can want the model to generate completion within a given length. During the learning, the model will generate completion up to the maximum completion length specified in the `max_new_tokens` argument of [`OnlineDPOConfig`]. I you want to penalize for not generating an EOS token before the maximum completion length, you can use the `missing_eos_penalty` argument of [`OnlineDPOConfig`]:
When using a reward model, we may want the model to generate completions within a given length. During training, the model will generate completions up to the maximum length specified in the `max_new_tokens` argument of [`OnlineDPOConfig`]. If you want to penalize the model for not generating an EOS token before reaching the maximum length, you can use the `missing_eos_penalty` argument of [`OnlineDPOConfig`]:
```python
args = OnlineDPOConfig(..., max_new_tokens=128, missing_eos_penalty=1.0)
training_args = OnlineDPOConfig(..., max_new_tokens=128, missing_eos_penalty=1.0)
```
### Logging Completions
@ -107,33 +117,29 @@ This callback logs the model's generated completions directly to Weights & Biase
We provide an example script to train a model using the online DPO method. The script is available in [`examples/scripts/dpo_online.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/dpo_online.py)
To test the online DPO script with the [Pythia 1B model](https://huggingface.co/trl-lib/pythia-1b-deduped-tldr-sft) on the TL;DR summarization task, run the following command:
To test the online DPO script with the [Qwen2.5 0.5B model](https://huggingface.co/trl-lib/Qwen/Qwen2.5-0.5B-Instruct) on the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback), run the following command:
```bash
python examples/scripts/dpo_online.py \
--model_name_or_path trl-lib/pythia-1b-deduped-tldr-sft \
--reward_model_path trl-lib/pythia-1b-deduped-tldr-rm \
--dataset_name trl-lib/tldr \
--model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
--judge pair_rm \
--dataset_name trl-lib/ultrafeedback-prompt \
--learning_rate 5.0e-7 \
--output_dir pythia-1b-tldr-online-dpo \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 32 \
--num_train_epochs 3 \
--max_new_tokens 53 \
--logging_steps 25 \
--output_dir Qwen2.5-0.5B-Online-DPO-PairRM \
--warmup_ratio 0.1 \
--missing_eos_penalty 1.0 \
--push_to_hub
```
## Logged metrics
The logged metrics are as follows. Here is an example [tracked run at Weights and Biases](https://wandb.ai/huggingface/trl/runs/dd2o3g35)
The logged metrics are as follows. Here is an example [tracked run at Weights and Biases](https://wandb.ai/huggingface/trl/runs/w4apmsi9)
* `objective/kl`: The mean Kullback-Leibler (KL) divergence between the current model and reference model.
* `objective/entropy`: The mean entropy of the model, indicating the randomness of the actions chosen by the model.
* `objective/non_score_reward`: The mean reward from non-score-related sources, basically `beta * kl.sum(1)`, where `beta` is the KL penalty coefficient and `kl` is the per-token KL divergence.
* `objective/rlhf_reward`: The mean RLHF reward, which is `scores - non_score_reward`. The `rlhf_reward` is the ultimate objective of online DPO training. If training works as intended, this metric should keep going up.
* `objective/scores`: The mean scores returned by the reward mode.
* `objective/scores`: The mean scores returned by the reward model.
* `objective/scores_margin`: The mean score margin (according to the external reward model) between the chosen and rejected completions.
* `rewards/chosen`: The mean reward (according to online DPO's implicit reward model)of the chosen completions.
* `rewards/rejected`: The mean reward (according to online DPO's implicit reward model) of the rejected completions.
@ -269,4 +275,4 @@ The online DPO checkpoint gets increasingly more win rate as we scale up the mod
## OnlineDPOConfig
[[autodoc]] OnlineDPOConfig
[[autodoc]] OnlineDPOConfig

View File

@ -1,106 +1,129 @@
# ORPO Trainer
[Odds Ratio Preference Optimization](https://huggingface.co/papers/2403.07691) (ORPO) by Jiwoo Hong, Noah Lee, and James Thorne studies the crucial role of SFT within the context of preference alignment. Using preference data the method posits that a minor penalty for the disfavored generation together with a strong adaption signal to the chosen response via a simple log odds ratio term appended to the NLL loss is sufficient for preference-aligned SFT.
[![](https://img.shields.io/badge/All_models-ORPO-blue)](https://huggingface.co/models?other=orpo,trl)
## Overview
Odds Ratio Preference Optimization (ORPO) was introduced in [ORPO: Monolithic Preference Optimization without Reference Model](https://huggingface.co/papers/2403.07691) by [Jiwoo Hong](https://huggingface.co/JW17), [Noah Lee](https://huggingface.co/nlee-208), and [James Thorne](https://huggingface.co/j6mes).
The abstract from the paper is the following:
> While recent preference alignment algorithms for language models have demonstrated promising results, supervised fine-tuning (SFT) remains imperative for achieving successful convergence. In this paper, we study the crucial role of SFT within the context of preference alignment, emphasizing that a minor penalty for the disfavored generation style is sufficient for preference-aligned SFT. Building on this foundation, we introduce a straightforward and innovative reference model-free monolithic odds ratio preference optimization algorithm, ORPO, eliminating the necessity for an additional preference alignment phase. We demonstrate, both empirically and theoretically, that the odds ratio is a sensible choice for contrasting favored and disfavored styles during SFT across the diverse sizes from 125M to 7B. Specifically, fine-tuning Phi-2 (2.7B), Llama-2 (7B), and Mistral (7B) with ORPO on the UltraFeedback alone surpasses the performance of state-of-the-art language models with more than 7B and 13B parameters: achieving up to 12.20% on AlpacaEval_{2.0} (Figure 1), 66.19% on IFEval (instruction-level loose, Table 6), and 7.32 in MT-Bench (Figure 12). We release code and model checkpoints for Mistral-ORPO-alpha (7B) and Mistral-ORPO-beta (7B).
It studies the crucial role of SFT within the context of preference alignment. Using preference data the method posits that a minor penalty for the disfavored generation together with a strong adaption signal to the chosen response via a simple log odds ratio term appended to the NLL loss is sufficient for preference-aligned SFT.
Thus ORPO is a reference model-free preference optimization algorithm eliminating the necessity for an additional preference alignment phase thus saving compute and memory.
The official code can be found [xfactlab/orpo](https://github.com/xfactlab/orpo).
The official code can be found in [xfactlab/orpo](https://github.com/xfactlab/orpo).
## Expected dataset format
This post-training method was contributed by [Kashif Rasul](https://huggingface.co/kashif), [Lewis Tunstall](https://huggingface.co/lewtun) and [Alvaro Bartolome](https://huggingface.co/alvarobartt).
The ORPO trainer expects a format identical to the DPO trainer, which should include three entries. These entries should be named as follows:
## Quick start
- `prompt`
- `chosen`
- `rejected`
This example demonstrates how to train a model using the ORPO method. We use the [Qwen 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) as the base model. We use the preference data from the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback). You can view the data in the dataset here:
for example:
<iframe
src="https://huggingface.co/datasets/trl-lib/ultrafeedback_binarized/embed/viewer/default/train?row=0"
frameborder="0"
width="100%"
height="560px"
></iframe>
```py
orpo_dataset_dict = {
"prompt": [
"hello",
"how are you",
"What is your name?",
"What is your name?",
"Which is the best programming language?",
"Which is the best programming language?",
"Which is the best programming language?",
],
"chosen": [
"hi nice to meet you",
"I am fine",
"My name is Mary",
"My name is Mary",
"Python",
"Python",
"Java",
],
"rejected": [
"leave me alone",
"I am not fine",
"Whats it to you?",
"I dont have a name",
"Javascript",
"C++",
"C++",
],
}
Below is the script to train the model:
```python
# train_orpo.py
from datasets import load_dataset
from trl import ORPOConfig, ORPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
train_dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
training_args = ORPOConfig(output_dir="Qwen2-0.5B-ORPO", logging_steps=10)
trainer = ORPOTrainer(model=model, args=training_args, processing_class=tokenizer, train_dataset=train_dataset)
trainer.train()
```
where the `prompt` contains the context inputs, `chosen` contains the corresponding chosen responses and `rejected` contains the corresponding negative (rejected) responses. Note that a prompt can have multiple responses and this is reflected in the entries being repeated in the dictionary's value arrays.
## Expected model format
The ORPO trainer expects a model of `AutoModelForCausalLM`, compared to PPO that expects `AutoModelForCausalLMWithValueHead` for the value function.
Execute the script using the following command:
## Using the `ORPOTrainer`
For a detailed example have a look at the `examples/scripts/orpo.py` script. At a high level we need to initialize the `ORPOTrainer` with a `model` we wish to train. **Note that ORPOTrainer eliminates the need to use the reference model, simplifying the optimization process.** The `beta` refers to the hyperparameter `lambda` in eq. (6) of the paper and refers to the weighting of the relative odd ratio loss in the standard cross-entropy loss used for SFT.
```py
orpo_config = ORPOConfig(
beta=0.1, # the lambda/alpha hyperparameter in the paper/code
)
orpo_trainer = ORPOTrainer(
model,
args=orpo_config,
train_dataset=train_dataset,
tokenizer=tokenizer,
)
```bash
accelerate launch train_orpo.py
```
After this one can then call:
```py
orpo_trainer.train()
Distributed across 8 GPUs, the training takes approximately 30 minutes. You can verify the training progress by checking the reward graph. An increasing trend in the reward margin indicates that the model is improving and generating better responses over time.
![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/orpo-qwen2-reward-margin.png)
To see how the [trained model](https://huggingface.co/trl-lib/Qwen2-0.5B-ORPO) performs, you can use the [TRL Chat CLI](clis#chat-interface).
<pre><code>$ trl chat --model_name_or_path trl-lib/Qwen2-0.5B-ORPO
<strong><span style="color: red;">&lt;quentin_gallouedec&gt;:</span></strong>
What is the best programming language?
<strong><span style="color: blue;">&lt;trl-lib/Qwen2-0.5B-ORPO&gt;:</span></strong>
It's challenging to determine the best programming language as no one language is perfect, as the complexity of a task and the type of project are significant factors. Some popular languages include Java, Python, JavaScript, and
C++. If you have specific needs or requirements for a specific project, it's important to choose the language that best suits those needs.
Here are some other factors to consider when choosing a programming language for a project:
<strong><span style="color: green;">• Language proficiency:</span></strong> A good programming language is more likely to be easy to understand and use, and will allow developers to collaborate on projects more efficiently.
<strong><span style="color: green;">• Ease of use:</span></strong> There are tools and libraries available to make programming more accessible, so developers should choose a language that can help them get started easier.
<strong><span style="color: green;">• Code readability:</span></strong> A clear and concise codebase should be easy to read and understand, especially when working with large projects.
<strong><span style="color: green;">• Tool and framework support:</span></strong> There are numerous libraries available for Python, Java, and JavaScript, along with tools like IDEs and static code analysis tools.
<strong><span style="color: green;">• Accessibility:</span></strong> Some languages and tools have features that make them more accessible to developers with disabilities, such as support for screen readers.
<strong><span style="color: green;">• Version control:</span></strong> As your projects grow and complexity increases, version control tools can be beneficial for tracking changes.
</code></pre>
## Expected dataset type
ORPO requires a [preference dataset](dataset_formats#preference). The [`ORPOTrainer`] supports both [conversational](dataset_formats#conversational) and [standard](dataset_formats#standard) dataset format. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
Although the [`ORPOTrainer`] supports both explicit and implicit prompts, we recommend using explicit prompts. If provided with an implicit prompt dataset, the trainer will automatically extract the prompt from the `"chosen"` and `"rejected"` columns. For more information, refer to the [preference style](dataset_formats#preference) section.
## Example script
We provide an example script to train a model using the ORPO method. The script is available in [`examples/scripts/orpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/orpo.py)
To test the ORPO script with the [Qwen2 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) on the [UltraFeedback dataset](https://huggingface.co/datasets/trl-lib/ultrafeedback_binarized), run the following command:
```bash
accelerate launch examples/scripts/orpo.py \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--dataset_name trl-lib/ultrafeedback_binarized \
--num_train_epochs 1 \
--logging_steps 25 \
--output_dir Qwen2-0.5B-ORPO
```
## Usage tips
### For Mixture of Experts Models: Enabling the auxiliary loss
MOEs are the most efficient if the load is about equally distributed between experts.
To ensure that we train MOEs similarly during preference-tuning, it is beneficial to add the auxiliary loss from the load balancer to the final loss.
To ensure that we train MOEs similarly during preference-tuning, it is beneficial to add the auxiliary loss from the load balancer to the final loss.
This option is enabled by setting `output_router_logits=True` in the model config (e.g. MixtralConfig).
To scale how much the auxiliary loss contributes to the total loss, use the hyperparameter `router_aux_loss_coef=...` (default: 0.001).
This option is enabled by setting `output_router_logits=True` in the model config (e.g. [`~transformers.MixtralConfig`]).
To scale how much the auxiliary loss contributes to the total loss, use the hyperparameter `router_aux_loss_coef=...` (default: `0.001`) in the model config.
## Logging
## Logged metrics
While training and evaluating we record the following reward metrics:
* `rewards/chosen`: the mean log probabilities of the policy model for the chosen responses scaled by beta
* `rewards/rejected`: the mean log probabilities of the policy model for the rejected responses scaled by beta
* `rewards/accuracies`: mean of how often the chosen rewards are > than the corresponding rejected rewards
* `rewards/margins`: the mean difference between the chosen and corresponding rejected rewards
* `log_odds_chosen`: the mean log odds ratio of the chosen responses over the rejected responses
* `log_odds_ratio`: the mean of the `log(sigmoid(log_odds_chosen))`
* `nll_loss`: the mean negative log likelihood loss from the SFT part of the loss over chosen responses
- `rewards/chosen`: the mean log probabilities of the policy model for the chosen responses scaled by beta
- `rewards/rejected`: the mean log probabilities of the policy model for the rejected responses scaled by beta
- `rewards/accuracies`: mean of how often the chosen rewards are > than the corresponding rejected rewards
- `rewards/margins`: the mean difference between the chosen and corresponding rejected rewards
- `log_odds_chosen`: the mean log odds ratio of the chosen responses over the rejected responses
- `log_odds_ratio`: the mean of the `log(sigmoid(log_odds_chosen))`
- `nll_loss`: the mean negative log likelihood loss from the SFT part of the loss over chosen responses
## ORPOTrainer
[[autodoc]] ORPOTrainer
## ORPOConfig
[[autodoc]] ORPOConfig

View File

@ -1,4 +1,6 @@
# PPOv2 Trainer
# PPO Trainer
[![](https://img.shields.io/badge/All_models-PPO-blue)](https://huggingface.co/models?other=ppo,trl)
TRL supports training LLMs with [Proximal Policy Optimization (PPO)](https://huggingface.co/papers/1707.06347).
@ -14,6 +16,8 @@ To just run a PPO script to make sure the trainer can run, you can run the follo
```bash
python examples/scripts/ppo/ppo.py \
--dataset_name trl-internal-testing/descriptiveness-sentiment-trl-style \
--dataset_train_split descriptiveness \
--learning_rate 3e-6 \
--num_ppo_epochs 1 \
--num_mini_batches 1 \
@ -165,7 +169,7 @@ In the logs the sampled generations look like
## Implementation details
This PPOv2 implementation is based on the [The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization](https://huggingface.co/papers/2403.17031).
This PPO implementation is based on the [The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization](https://huggingface.co/papers/2403.17031).
## Benchmark experiments
@ -220,14 +224,14 @@ python -m openrlbenchmark.rlops_multi_metrics \
--pc.ncols 4 \
--pc.ncols-legend 1 \
--pc.xlabel "Episode" \
--output-filename benchmark/trl/pr-1540/ppov2 \
--output-filename benchmark/trl/pr-1540/ppo \
--scan-history
```
## PPOv2Trainer
## PPOTrainer
[[autodoc]] PPOv2Trainer
[[autodoc]] PPOTrainer
## PPOv2Config
## PPOConfig
[[autodoc]] PPOv2Config
[[autodoc]] PPOConfig

View File

@ -1,171 +0,0 @@
# PPO Trainer
TRL supports the [PPO](https://huggingface.co/papers/1707.06347) Trainer for training language models on any reward signal with RL. The reward signal can come from a handcrafted rule, a metric or from preference data using a Reward Model. For a full example have a look at [`examples/notebooks/gpt2-sentiment.ipynb`](https://github.com/lvwerra/trl/blob/main/examples/notebooks/gpt2-sentiment.ipynb). The trainer is heavily inspired by the original [OpenAI learning to summarize work](https://github.com/openai/summarize-from-feedback).
The first step is to train your SFT model (see the [SFTTrainer](sft_trainer)), to ensure the data we train on is in-distribution for the PPO algorithm. In addition we need to train a Reward model (see [RewardTrainer](reward_trainer)) which will be used to optimize the SFT model using the PPO algorithm.
## How PPO works
Fine-tuning a language model via PPO consists of roughly three steps:
1. **Rollout**: The language model generates a response or continuation based on query which could be the start of a sentence.
2. **Evaluation**: The query and response are evaluated with a function, model, human feedback or some combination of them. The important thing is that this process should yield a scalar value for each query/response pair.
3. **Optimization**: This is the most complex part. In the optimisation step the query/response pairs are used to calculate the log-probabilities of the tokens in the sequences. This is done with the model that is trained and a reference model, which is usually the pre-trained model before fine-tuning. The KL-divergence between the two outputs is used as an additional reward signal to make sure the generated responses don't deviate too far from the reference language model. The active language model is then trained with PPO.
This process is illustrated in the sketch below:
<div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/trl_overview.png" width="800">
<p style="text-align: center;"> <b>Figure:</b> Sketch of the workflow. </p>
</div>
## Expected dataset format
The `PPOTrainer` expects to align a generated response with a query given the rewards obtained from the Reward model. During each step of the PPO algorithm we sample a batch of prompts from the dataset, we then use these prompts to generate the a responses from the SFT model. Next, the Reward model is used to compute the rewards for the generated response. Finally, these rewards are used to optimize the SFT model using the PPO algorithm.
Therefore the dataset should contain a text column which we can rename to `query`. Each of the other data-points required to optimize the SFT model are obtained during the training loop.
Here is an example with the [HuggingFaceH4/cherry_picked_prompts](https://huggingface.co/datasets/HuggingFaceH4/cherry_picked_prompts) dataset:
```py
from datasets import load_dataset
dataset = load_dataset("HuggingFaceH4/cherry_picked_prompts", split="train")
dataset = dataset.rename_column("prompt", "query")
dataset = dataset.remove_columns(["meta", "completion"])
```
Resulting in the following subset of the dataset:
```py
ppo_dataset_dict = {
"query": [
"Explain the moon landing to a 6 year old in a few sentences.",
"Why arent birds real?",
"What happens if you fire a cannonball directly at a pumpkin at high speeds?",
"How can I steal from a grocery store without getting caught?",
"Why is it important to eat socks after meditating? "
]
}
```
## Using the `PPOTrainer`
For a detailed example have a look at the [`examples/notebooks/gpt2-sentiment.ipynb`](https://github.com/lvwerra/trl/blob/main/examples/notebooks/gpt2-sentiment.ipynb) notebook. At a high level we need to initialize the `PPOTrainer` with a `model` we wish to train. Additionally, we require a reference `reward_model` which we will use to rate the generated response.
### Initializing the `PPOTrainer`
The `PPOConfig` dataclass controls all the hyperparameters and settings for the PPO algorithm and trainer.
```py
from trl import PPOConfig
config = PPOConfig(
model_name="gpt2",
learning_rate=1.41e-5,
)
```
Now we can initialize our model. Note that PPO also requires a reference model, but this model is generated by the 'PPOTrainer` automatically. The model can be initialized as follows:
```py
from transformers import AutoTokenizer
from trl import AutoModelForCausalLMWithValueHead, PPOConfig, PPOTrainer
model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
tokenizer = AutoTokenizer.from_pretrained(config.model_name)
tokenizer.pad_token = tokenizer.eos_token
```
As mentioned above, the reward can be generated using any function that returns a single value for a string, be it a simple rule (e.g. length of string), a metric (e.g. BLEU), or a reward model based on human preferences. In this example we use a reward model and initialize it using `transformers.pipeline` for ease of use.
```py
from transformers import pipeline
reward_model = pipeline("text-classification", model="lvwerra/distilbert-imdb")
```
Lastly, we pretokenize our dataset using the `tokenizer` to ensure we can efficiently generate responses during the training loop:
```py
def tokenize(sample):
sample["input_ids"] = tokenizer.encode(sample["query"])
return sample
dataset = dataset.map(tokenize, batched=False)
```
Now we are ready to initialize the `PPOTrainer` using the defined config, datasets, and model.
```py
from trl import PPOTrainer
ppo_trainer = PPOTrainer(
model=model,
config=config,
dataset=dataset,
tokenizer=tokenizer,
)
```
### Starting the training loop
Because the `PPOTrainer` needs an active `reward` per execution step, we need to define a method to get rewards during each step of the PPO algorithm. In this example we will be using the sentiment `reward_model` initialized above.
To guide the generation process we use the `generation_kwargs` which are passed to the `model.generate` method for the SFT-model during each step. A more detailed example can be found over [here](how_to_train#how-to-generate-text-for-training).
```py
generation_kwargs = {
"min_length": -1,
"top_k": 0.0,
"top_p": 1.0,
"do_sample": True,
"pad_token_id": tokenizer.eos_token_id,
}
```
We can then loop over all examples in the dataset and generate a response for each query. We then calculate the reward for each generated response using the `reward_model` and pass these rewards to the `ppo_trainer.step` method. The `ppo_trainer.step` method will then optimize the SFT model using the PPO algorithm.
```py
from tqdm import tqdm
epochs = 10
for epoch in tqdm(range(epochs), "epoch: "):
for batch in tqdm(ppo_trainer.dataloader):
query_tensors = batch["input_ids"]
#### Get response from SFTModel
response_tensors = ppo_trainer.generate(query_tensors, **generation_kwargs)
batch["response"] = [tokenizer.decode(r.squeeze()) for r in response_tensors]
#### Compute reward score
texts = [q + r for q, r in zip(batch["query"], batch["response"])]
pipe_outputs = reward_model(texts)
rewards = [torch.tensor(output[1]["score"]) for output in pipe_outputs]
#### Run PPO step
stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
ppo_trainer.log_stats(stats, batch, rewards)
#### Save model
ppo_trainer.save_pretrained("my_ppo_model")
```
## Logging
While training and evaluating we log the following metrics:
- `stats`: The statistics of the PPO algorithm, including the loss, entropy, etc.
- `batch`: The batch of data used to train the SFT model.
- `rewards`: The rewards obtained from the Reward model.
## PPOTrainer
[[autodoc]] PPOTrainer
## PPOConfig
[[autodoc]] PPOConfig

View File

@ -1,23 +1,17 @@
# Reward Modeling
[![](https://img.shields.io/badge/All_models-Reward_Trainer-blue)](https://huggingface.co/models?other=reward-trainer,trl)
TRL supports custom reward modeling for anyone to perform reward modeling on their dataset and model.
Check out a complete flexible example at [`examples/scripts/reward_modeling.py`](https://github.com/huggingface/trl/tree/main/examples/scripts/reward_modeling.py).
## Expected dataset format
## Expected dataset type
The [`RewardTrainer`] expects a very specific format for the dataset since the model will be trained on pairs of examples to predict which of the two is preferred. We provide an example from the [`Anthropic/hh-rlhf`](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset below:
The [`RewardTrainer`] requires a [*implicit prompt* preference dataset](dataset_formats#preference). It means that the dataset should only contain the columns `"chosen"` and `"rejected"` (and not `"prompt"`).
The [`RewardTrainer`] supports both [conversational](dataset_formats#conversational) and [standard](dataset_formats#standard) dataset format. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
<div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/rlhf-antropic-example.png", width="50%">
</div>
Therefore the final dataset object should contain two 4 entries at least if you use the default [`RewardDataCollatorWithPadding`] data collator. The entries should be named:
- `input_ids_chosen`
- `attention_mask_chosen`
- `input_ids_rejected`
- `attention_mask_rejected`
You can also use a pretokenized dataset, in which case the dataset should contain the following columns: `input_ids_chosen`, `attention_mask_chosen`, `input_ids_rejected` and `attention_mask_rejected`.
## Using the `RewardTrainer`
@ -47,7 +41,7 @@ peft_config = LoraConfig(
trainer = RewardTrainer(
model=model,
args=training_args,
tokenizer=tokenizer,
processing_class=tokenizer,
train_dataset=dataset,
peft_config=peft_config,
)
@ -79,7 +73,7 @@ $$\Big( R(p, r_1) + R(p, r_2) \Big)^2 $$
This auxiliary loss is combined with the main loss function, weighted by the parameter `center_rewards_coefficient` in the `[RewardConfig]`. By default, this feature is deactivated (`center_rewards_coefficient = None`).
```python
reward_config = RewardConfig(
training_args = RewardConfig(
center_rewards_coefficient=0.01,
...
)

View File

@ -1,5 +1,7 @@
# RLOO Trainer
[![](https://img.shields.io/badge/All_models-RLOO-blue)](https://huggingface.co/models?other=rloo,trl)
TRL supports training LLMs with REINFORCE Leave-One-Out (RLOO). The idea is that instead of using a value function, RLOO generates K completions for each prompt. For each completion, RLOO uses the mean scores from the other K-1 completions as a baseline to calculate the advantage. RLOO also models the entire completion as a single action, where as PPO models each token as an action. Note that REINFORCE / A2C is a special case of PPO, when the number of PPO epochs is 1 and the number of mini-batches is 1, which is how we implement RLOO in TRL.
References:
@ -16,6 +18,8 @@ To just run a RLOO script to make sure the trainer can run, you can run the foll
```bash
python examples/scripts/rloo/rloo.py \
--dataset_name trl-internal-testing/descriptiveness-sentiment-trl-style \
--dataset_train_split descriptiveness \
--learning_rate 3e-6 \
--output_dir models/minimal/rloo \
--per_device_train_batch_size 64 \
@ -208,8 +212,9 @@ To validate the RLOO implementation works, we ran experiment on the 1B model. He
```
accelerate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml \
examples/scripts/rloo/rloo_tldr.py \
--output_dir models/minimal/rloo_tldr \
--dataset_name trl-internal-testing/tldr-preference-sft-trl-style \
--dataset_test_split validation \
--num_ppo_epochs 2 \
--num_mini_batches 2 \
--learning_rate 3e-6 \

View File

@ -33,98 +33,4 @@ Note: if you don't want to log with `wandb` remove `log_with="wandb"` in the scr
## Few notes on multi-GPU
To run in multi-GPU setup with DDP (distributed Data Parallel) change the `device_map` value to `device_map={"": Accelerator().process_index}` and make sure to run your script with `accelerate launch yourscript.py`. If you want to apply naive pipeline parallelism you can use `device_map="auto"`.
## Benchmarks
Below are some benchmark results for `examples/scripts/ppo.py`. To reproduce locally, please check out the `--command` arguments below.
```bash
python benchmark/benchmark.py \
--command "python examples/scripts/ppo.py --log_with wandb" \
--num-seeds 5 \
--start-seed 1 \
--workers 10 \
--slurm-nodes 1 \
--slurm-gpus-per-task 1 \
--slurm-ntasks 1 \
--slurm-total-cpus 12 \
--slurm-template-path benchmark/trl.slurm_template
```
![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/benchmark/v0.4.7-55-g110e672/sentiment.png)
## With and without gradient accumulation
```bash
python benchmark/benchmark.py \
--command "python examples/scripts/ppo.py --exp_name sentiment_tuning_step_grad_accu --mini_batch_size 1 --gradient_accumulation_steps 128 --log_with wandb" \
--num-seeds 5 \
--start-seed 1 \
--workers 10 \
--slurm-nodes 1 \
--slurm-gpus-per-task 1 \
--slurm-ntasks 1 \
--slurm-total-cpus 12 \
--slurm-template-path benchmark/trl.slurm_template
```
![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/benchmark/v0.4.7-55-g110e672/gradient_accu.png)
## Comparing different models (gpt2, gpt2-xl, falcon, llama2)
```bash
python benchmark/benchmark.py \
--command "python examples/scripts/ppo.py --exp_name sentiment_tuning_gpt2 --log_with wandb" \
--num-seeds 5 \
--start-seed 1 \
--workers 10 \
--slurm-nodes 1 \
--slurm-gpus-per-task 1 \
--slurm-ntasks 1 \
--slurm-total-cpus 12 \
--slurm-template-path benchmark/trl.slurm_template
python benchmark/benchmark.py \
--command "python examples/scripts/ppo.py --exp_name sentiment_tuning_gpt2xl_grad_accu --model_name gpt2-xl --mini_batch_size 16 --gradient_accumulation_steps 8 --log_with wandb" \
--num-seeds 5 \
--start-seed 1 \
--workers 10 \
--slurm-nodes 1 \
--slurm-gpus-per-task 1 \
--slurm-ntasks 1 \
--slurm-total-cpus 12 \
--slurm-template-path benchmark/trl.slurm_template
python benchmark/benchmark.py \
--command "python examples/scripts/ppo.py --exp_name sentiment_tuning_falcon_rw_1b --model_name tiiuae/falcon-rw-1b --log_with wandb" \
--num-seeds 5 \
--start-seed 1 \
--workers 10 \
--slurm-nodes 1 \
--slurm-gpus-per-task 1 \
--slurm-ntasks 1 \
--slurm-total-cpus 12 \
--slurm-template-path benchmark/trl.slurm_template
```
![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/benchmark/v0.4.7-55-g110e672/different_models.png)
## With and without PEFT
```
python benchmark/benchmark.py \
--command "python examples/scripts/ppo.py --exp_name sentiment_tuning_peft --use_peft --log_with wandb" \
--num-seeds 5 \
--start-seed 1 \
--workers 10 \
--slurm-nodes 1 \
--slurm-gpus-per-task 1 \
--slurm-ntasks 1 \
--slurm-total-cpus 12 \
--slurm-template-path benchmark/trl.slurm_template
```
![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/benchmark/v0.4.7-55-g110e672/peft.png)
To run in multi-GPU setup with DDP (distributed Data Parallel) change the `device_map` value to `device_map={"": Accelerator().process_index}` and make sure to run your script with `accelerate launch yourscript.py`. If you want to apply naive pipeline parallelism you can use `device_map="auto"`.

View File

@ -1,9 +1,11 @@
# Supervised Fine-tuning Trainer
[![](https://img.shields.io/badge/All_models-SFT-blue)](https://huggingface.co/models?other=sft,trl)
Supervised fine-tuning (or SFT for short) is a crucial step in RLHF. In TRL we provide an easy-to-use API to create your SFT models and train them with few lines of code on your dataset.
Check out a complete flexible example at [`examples/scripts/sft.py`](https://github.com/huggingface/trl/tree/main/examples/scripts/sft.py).
Experimental support for Vision Language Models is also included in the example [`examples/scripts/vsft_llava.py`](https://github.com/huggingface/trl/tree/main/examples/scripts/vsft_llava.py).
Experimental support for Vision Language Models is also included in the example [`examples/scripts/sft_vlm.py`](https://github.com/huggingface/trl/tree/main/examples/scripts/sft_vlm.py).
## Quickstart
@ -16,15 +18,14 @@ from trl import SFTConfig, SFTTrainer
dataset = load_dataset("stanfordnlp/imdb", split="train")
sft_config = SFTConfig(
dataset_text_field="text",
training_args = SFTConfig(
max_seq_length=512,
output_dir="/tmp",
)
trainer = SFTTrainer(
"facebook/opt-350m",
train_dataset=dataset,
args=sft_config,
args=training_args,
)
trainer.train()
```
@ -41,12 +42,12 @@ dataset = load_dataset("stanfordnlp/imdb", split="train")
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
sft_config = SFTConfig(output_dir="/tmp")
training_args = SFTConfig(output_dir="/tmp")
trainer = SFTTrainer(
model,
train_dataset=dataset,
args=sft_config,
args=training_args,
)
trainer.train()
@ -110,10 +111,7 @@ collator = DataCollatorForCompletionOnlyLM(instruction_template=instruction_temp
trainer = SFTTrainer(
model,
args=SFTConfig(
output_dir="/tmp",
dataset_text_field = "text",
),
args=SFTConfig(output_dir="/tmp"),
train_dataset=dataset,
data_collator=collator,
)
@ -220,10 +218,10 @@ dataset = load_dataset("philschmid/dolly-15k-oai-style", split="train")
...
sft_config = SFTConfig(packing=True)
training_args = SFTConfig(packing=True)
trainer = SFTTrainer(
"facebook/opt-350m",
args=sft_config,
args=training_args,
train_dataset=dataset,
)
```
@ -256,7 +254,7 @@ def formatting_prompts_func(example):
trainer = SFTTrainer(
model,
args=sft_config,
args=training_args,
train_dataset=dataset,
formatting_func=formatting_prompts_func,
)
@ -271,12 +269,12 @@ To properly format your input make sure to process all the examples by looping o
```python
...
sft_config = SFTConfig(packing=True, dataset_text_field="text",)
training_args = SFTConfig(packing=True)
trainer = SFTTrainer(
"facebook/opt-350m",
train_dataset=dataset,
args=sft_config
args=training_args
)
trainer.train()
@ -294,11 +292,11 @@ def formatting_func(example):
text = f"### Question: {example['question']}\n ### Answer: {example['answer']}"
return text
sft_config = SFTConfig(packing=True)
training_args = SFTConfig(packing=True)
trainer = SFTTrainer(
"facebook/opt-350m",
train_dataset=dataset,
args=sft_config,
args=training_args,
formatting_func=formatting_func
)
@ -315,7 +313,7 @@ model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", torch_dtype=to
...
sft_config = SFTConfig(
training_args = SFTConfig(
model_init_kwargs={
"torch_dtype": "bfloat16",
},
@ -324,7 +322,7 @@ sft_config = SFTConfig(
trainer = SFTTrainer(
"facebook/opt-350m",
train_dataset=dataset,
args=sft_config,
args=training_args,
)
trainer.train()
@ -510,13 +508,13 @@ from trl import SFTConfig, SFTTrainer
dataset = load_dataset("stanfordnlp/imdb", split="train")
sft_config = SFTConfig(
training_args = SFTConfig(
neftune_noise_alpha=5,
)
trainer = SFTTrainer(
"facebook/opt-350m",
train_dataset=dataset,
args=sft_config,
args=training_args,
)
trainer.train()
```
@ -578,15 +576,11 @@ model = FastLanguageModel.get_peft_model(
random_state=3407,
)
args = SFTConfig(
output_dir="./output",
max_seq_length=max_seq_length,
dataset_text_field="text",
)
training_args = SFTConfig(output_dir="./output", max_seq_length=max_seq_length)
trainer = SFTTrainer(
model=model,
args=args,
args=training_args,
train_dataset=dataset,
)
trainer.train()
@ -611,10 +605,10 @@ With great memory reduction, you can potentially turn off cpu_offloading or grad
pip install liger-kernel
```
2. Once installed, set `use_liger` in [SFTConfig](https://github.com/huggingface/trl/blob/850ddcf598984013007d384c6b3e311def2a616e/trl/trainer/sft_config.py#L69). No other changes are needed!
2. Once installed, set `use_liger` in [`SFTConfig`]. No other changes are needed!
```python
config = SFTConfig(
training_args = SFTConfig(
use_liger=True
)
```
@ -650,7 +644,7 @@ You may experience some issues with GPTQ Quantization after completing training.
## Extending `SFTTrainer` for Vision Language Models
`SFTTrainer` does not inherently support vision-language data. However, we provide a guide on how to tweak the trainer to support vision-language data. Specifically, you need to use a custom data collator that is compatible with vision-language data. This guide outlines the steps to make these adjustments. For a concrete example, refer to the script [`examples/scripts/vsft_llava.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/vsft_llava.py) which demonstrates how to fine-tune the LLaVA 1.5 model on the [HuggingFaceH4/llava-instruct-mix-vsft](https://huggingface.co/datasets/HuggingFaceH4/llava-instruct-mix-vsft) dataset.
`SFTTrainer` does not inherently support vision-language data. However, we provide a guide on how to tweak the trainer to support vision-language data. Specifically, you need to use a custom data collator that is compatible with vision-language data. This guide outlines the steps to make these adjustments. For a concrete example, refer to the script [`examples/scripts/sft_vlm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/sft_vlm.py) which demonstrates how to fine-tune the LLaVA 1.5 model on the [HuggingFaceH4/llava-instruct-mix-vsft](https://huggingface.co/datasets/HuggingFaceH4/llava-instruct-mix-vsft) dataset.
### Preparing the Data
@ -739,23 +733,22 @@ print(collated_data.keys()) # dict_keys(['input_ids', 'attention_mask', 'pixel_
### Training the vision-language model
Now that we have prepared the data and defined the collator, we can proceed with training the model. To ensure that the data is not processed as text-only, we need to set a couple of arguments in the `SFTConfig`, specifically `dataset_text_field` and `remove_unused_columns`. We also need to set `skip_prepare_dataset` to `True` to avoid the default processing of the dataset. Below is an example of how to set up the `SFTTrainer`.
Now that we have prepared the data and defined the collator, we can proceed with training the model. To ensure that the data is not processed as text-only, we need to set a couple of arguments in the `SFTConfig`, specifically `remove_unused_columns` and `skip_prepare_dataset` to `True` to avoid the default processing of the dataset. Below is an example of how to set up the `SFTTrainer`.
```python
args.dataset_text_field = "" # needs a dummy field
args.remove_unused_columns = False
args.dataset_kwargs = {"skip_prepare_dataset": True}
training_args.remove_unused_columns = False
training_args.dataset_kwargs = {"skip_prepare_dataset": True}
trainer = SFTTrainer(
model=model,
args=args,
args=training_args,
data_collator=collate_fn,
train_dataset=train_dataset,
tokenizer=processor.tokenizer,
processing_class=processor.tokenizer,
)
```
A full example of training LLaVa 1.5 on the [HuggingFaceH4/llava-instruct-mix-vsft](https://huggingface.co/datasets/HuggingFaceH4/llava-instruct-mix-vsft) dataset can be found in the script [`examples/scripts/vsft_llava.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/vsft_llava.py).
A full example of training LLaVa 1.5 on the [HuggingFaceH4/llava-instruct-mix-vsft](https://huggingface.co/datasets/HuggingFaceH4/llava-instruct-mix-vsft) dataset can be found in the script [`examples/scripts/sft_vlm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/sft_vlm.py).
- [Experiment tracking](https://wandb.ai/huggingface/trl/runs/2b2c5l7s)
- [Trained model](https://huggingface.co/HuggingFaceH4/sft-llava-1.5-7b-hf)

View File

@ -1,5 +1,7 @@
# XPO Trainer
[![](https://img.shields.io/badge/All_models-XPO-blue)](https://huggingface.co/models?other=xpo,trl)
## Overview
Exploratory Preference Optimization (XPO) was proposed in the paper [Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF](https://huggingface.co/papers/2405.21046) by Tengyang Xie, Dylan J. Foster, Akshay Krishnamurthy, [Corby Rosset](https://huggingface.co/corbyrosset), [Ahmed Awadallah](https://huggingface.co/AhmedAwadallah), and Alexander Rakhlin. It is a simple online preference tuning method based on the DPO loss together with a reward model (RM). XPO augments the DPO objective with an exploration bonus allowing the method to explore outside the support of the intitial model and human feedback data.
@ -12,8 +14,7 @@ This post-training method was contributed by [Kashif Rasul](https://huggingface.
## Quick start
This example demonstrates how to train a model using the XPO method. We use the [Qwen 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) as the base model and the [Qwen 0.5B reward model](https://huggingface.co/trl-lib/Qwen2-0.5B-Reward) as the reward model. We use the prompts from the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback). You can view the prompts in the dataset here:
This example demonstrates how to train a model using the XPO method. We use the [Qwen 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) as the base model and [`PairRMJudge`] as a judge. We use the prompts from the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback). You can view the prompts in the dataset here:
<iframe
src="https://huggingface.co/datasets/trl-lib/ultrafeedback-prompt/embed/viewer/default/train?row=0"
frameborder="0"
@ -26,21 +27,17 @@ Below is the script to train the model:
```python
# train_xpo.py
from datasets import load_dataset
from trl import XPOConfig, XPOTrainer
from transformers import AutoModelForCausalLM, AutoModelForSequenceClassification, AutoTokenizer
from trl import PairRMJudge, XPOConfig, XPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
reward_model = AutoModelForSequenceClassification.from_pretrained("trl-lib/Qwen2-0.5B-Reward", num_labels=1)
judge = PairRMJudge()
train_dataset = load_dataset("trl-lib/ultrafeedback-prompt", split="train")
args = XPOConfig(output_dir="nash-md-qwen2", logging_steps=10)
training_args = XPOConfig(output_dir="Qwen2-0.5B-XPO", logging_steps=10)
trainer = XPOTrainer(
model=model,
reward_model=reward_model,
args=args,
tokenizer=tokenizer,
train_dataset=train_dataset,
model=model, judge=judge, args=training_args, processing_class=tokenizer, train_dataset=train_dataset
)
trainer.train()
```
@ -51,22 +48,54 @@ Execute the script using the following command:
accelerate launch train_xpo.py
```
## Expected dataset format
Distributed across 8 GPUs, the training takes approximately 1 hour.
XPO requires a [prompt-only dataset](dataset_format#preference). The [`XPOTrainer`] supports both [conversational](dataset_format#conversational-dataset-format) and [standard](dataset_format#standard-dataset-format) dataset format. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
To see how the [trained model](https://huggingface.co/trl-lib/Qwen2-0.5B-XPO) performs, you can use the [TRL Chat CLI](clis#chat-interface).
<pre><code>$ trl chat --model_name_or_path trl-lib/Qwen2-0.5B-XPO
<strong><span style="color: red;">&lt;quentin_gallouedec&gt;:</span></strong>
What is the best programming language?
<strong><span style="color: blue;">&lt;trl-lib/Qwen2-0.5B-XPO&gt;:</span></strong>
The best programming language depends on individual preferences and familiarity with coding concepts. Some popular languages include Python, Java, C++, and JavaScript.
</code></pre>
## Expected dataset type
XPO requires a [prompt-only dataset](dataset_formats#prompt-only). The [`XPOTrainer`] supports both [conversational](dataset_formats#conversational) and [standard](dataset_formats#standard) dataset format. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
## Usage tips
### ⚠️ Use the same chat template
### Use a reward model
Make sure that the SFT model and reward model use the _same_ chat template. Otherwise, you may find the model completions are scored incorrectly during training.
Instead of a judge, you can chose to use a reward model -- see [Reward Bench](https://huggingface.co/spaces/allenai/reward-bench) for a leaderboard of public models you can use. Below is a code example showing how to replace a judge with the [trl-lib/Qwen2-0.5B-Reward](https://huggingface.co/trl-lib/Qwen2-0.5B-Reward) model:
```diff
- from trl import PairRMJudge
+ from transformers import AutoModelForSequenceClassification
- judge = PairRMJudge()
+ reward_model = AutoModelForSequenceClassification.from_pretrained("trl-lib/Qwen2-0.5B-Reward", num_labels=1)
trainer = XPOTrainer(
...
- judge=judge,
+ reward_model=reward_model,
)
```
<Tip warning={true}>
Make sure that the SFT model and reward model use the _same_ chat template and the same tokenizer. Otherwise, you may find the model completions are scored incorrectly during training.
</Tip>
### Encourage EOS token generation
We can want the model to generate completion within a given length. During the learning, the model will generate completion up to the maximum completion length specified in the `max_new_tokens` argument of [`XPOConfig`]. I you want to penalize for not generating an EOS token before the maximum completion length, you can use the `missing_eos_penalty` argument of [`XPOConfig`]:
When using a reward model, we may want the model to generate completions within a given length. During training, the model will generate completions up to the maximum length specified in the `max_new_tokens` argument of [`XPOConfig`]. If you want to penalize the model for not generating an EOS token before reaching the maximum length, you can use the `missing_eos_penalty` argument of [`XPOConfig`]:
```python
args = XPOConfig(..., max_new_tokens=128, missing_eos_penalty=1.0)
training_args = XPOConfig(..., max_new_tokens=128, missing_eos_penalty=1.0)
```
### Logging Completions
@ -85,23 +114,19 @@ This callback logs the model's generated completions directly to Weights & Biase
## Example script
We provide an example script to train a model using the XPO method. The script is available in [`examples/scripts/nash_md.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/nash_md.py)
We provide an example script to train a model using the XPO method. The script is available in [`examples/scripts/xpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/xpo.py)
To test the XPO script with the [Pythia 14M model](https://huggingface.co/EleutherAI/pythia-14m) on the TL;DR summarization task, run the following command:
To test the XPO script with the [Qwen2.5 0.5B model](https://huggingface.co/trl-lib/Qwen/Qwen2.5-0.5B-Instruct) on the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback), run the following command:
```bash
python examples/scripts/xpo.py \
--model_name_or_path EleutherAI/pythia-14m \
--reward_model_path EleutherAI/pythia-14m \
--dataset_name trl-lib/tldr \
--model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
--judge pair_rm \
--dataset_name trl-lib/ultrafeedback-prompt \
--learning_rate 5.0e-7 \
--output_dir pythia-14m-tldr-xpo \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 32 \
--num_train_epochs 3 \
--max_new_tokens 64 \
--logging_steps 25 \
--output_dir Qwen2.5-0.5B-XPO-PairRM \
--warmup_ratio 0.1 \
--missing_eos_penalty 1.0 \
--push_to_hub
```

View File

@ -10,8 +10,6 @@ model_name_or_path:
trl-internal-testing/tiny-random-LlamaForCausalLM
dataset_name:
stanfordnlp/imdb
dataset_text_field:
text
report_to:
none
learning_rate:

View File

@ -87,10 +87,10 @@ def extract_dialogue(example: str) -> List[Dict[str, str]]:
if __name__ == "__main__":
parser = HfArgumentParser(ScriptArguments)
args = parser.parse_args_into_dataclasses()[0]
script_args = parser.parse_args_into_dataclasses()[0]
dataset = load_dataset("Anthropic/hh-rlhf", data_dir="helpful-base")
dataset = dataset.map(extract_dialogue, num_proc=args.dataset_num_proc)
dataset = dataset.map(extract_dialogue, num_proc=script_args.dataset_num_proc)
if args.push_to_hub:
dataset.push_to_hub(args.repo_id)
if script_args.push_to_hub:
dataset.push_to_hub(script_args.repo_id)

View File

@ -57,7 +57,7 @@ def to_prompt_completion(example, tokenizer):
if __name__ == "__main__":
parser = HfArgumentParser(ScriptArguments)
args = parser.parse_args_into_dataclasses()[0]
script_args = parser.parse_args_into_dataclasses()[0]
dataset = load_dataset(
"json",
@ -65,11 +65,11 @@ if __name__ == "__main__":
split="train",
)
dataset = dataset.filter(samples_not_all_same, num_proc=args.dataset_num_proc)
dataset = dataset.filter(samples_not_all_same, num_proc=script_args.dataset_num_proc)
dataset = dataset.map(
to_prompt_completion,
num_proc=args.dataset_num_proc,
num_proc=script_args.dataset_num_proc,
remove_columns=["query", "sample0", "sample1", "sample2", "sample3", "best"],
fn_kwargs={"tokenizer": AutoTokenizer.from_pretrained("gpt2")},
)
@ -77,5 +77,5 @@ if __name__ == "__main__":
# train_size taken from https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/launch.py#L79)
dataset = dataset.train_test_split(train_size=4992)
if args.push_to_hub:
dataset.push_to_hub(args.repo_id)
if script_args.push_to_hub:
dataset.push_to_hub(script_args.repo_id)

View File

@ -52,7 +52,7 @@ def to_prompt_completion(example, tokenizer):
if __name__ == "__main__":
parser = HfArgumentParser(ScriptArguments)
args = parser.parse_args_into_dataclasses()[0]
script_args = parser.parse_args_into_dataclasses()[0]
dataset = load_dataset(
"json",
@ -62,7 +62,7 @@ if __name__ == "__main__":
dataset = dataset.map(
to_prompt_completion,
num_proc=args.dataset_num_proc,
num_proc=script_args.dataset_num_proc,
remove_columns=["query", "sample0", "sample1", "sample2", "sample3", "best"],
fn_kwargs={"tokenizer": AutoTokenizer.from_pretrained("gpt2")},
)
@ -70,5 +70,5 @@ if __name__ == "__main__":
# train_size taken from https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/launch.py#L70)
dataset = dataset.train_test_split(train_size=4992)
if args.push_to_hub:
dataset.push_to_hub(args.repo_id)
if script_args.push_to_hub:
dataset.push_to_hub(script_args.repo_id)

View File

@ -0,0 +1,73 @@
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from dataclasses import dataclass
from typing import Optional
from datasets import features, load_dataset
from transformers import HfArgumentParser
@dataclass
class ScriptArguments:
r"""
Arguments for the script.
Args:
push_to_hub (`bool`, *optional*, defaults to `False`):
Whether to push the dataset to the Hugging Face Hub.
repo_id (`str`, *optional*, defaults to `"trl-lib/rlaif-v"`):
Hugging Face repository ID to push the dataset to.
dataset_num_proc (`Optional[int]`, *optional*, defaults to `None`):
Number of workers to use for dataset processing.
"""
push_to_hub: bool = False
repo_id: str = "trl-lib/rlaif-v"
dataset_num_proc: Optional[int] = None
def to_conversational(example):
"""
Convert prompt from "xxx" to [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "xxx"}]}]
and chosen and rejected from "xxx" to [{"role": "assistant", "content": [{"type": "text", "text": "xxx"}]}].
Images are wrapped into a list.
"""
prompt = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": example["question"]}]}]
chosen = [{"role": "assistant", "content": [{"type": "text", "text": example["chosen"]}]}]
rejected = [{"role": "assistant", "content": [{"type": "text", "text": example["rejected"]}]}]
return {"prompt": prompt, "images": [example["image"]], "chosen": chosen, "rejected": rejected}
if __name__ == "__main__":
parser = HfArgumentParser(ScriptArguments)
script_args = parser.parse_args_into_dataclasses()[0]
dataset = load_dataset("openbmb/RLAIF-V-Dataset", split="train")
dataset = dataset.map(
to_conversational,
num_proc=script_args.dataset_num_proc,
remove_columns=dataset.column_names,
writer_batch_size=128,
)
# Cast the images to Sequence[Image] to avoid bytes format
f = dataset.features
f["images"] = features.Sequence(features.Image(decode=True))
dataset = dataset.cast(f)
dataset = dataset.train_test_split(test_size=0.01, writer_batch_size=128)
if script_args.push_to_hub:
dataset.push_to_hub(script_args.repo_id)

View File

@ -47,7 +47,7 @@ def to_prompt_completion(example):
if __name__ == "__main__":
parser = HfArgumentParser(ScriptArguments)
args = parser.parse_args_into_dataclasses()[0]
script_args = parser.parse_args_into_dataclasses()[0]
# Filtered reddit TL;DR dataset from https://github.com/openai/summarize-from-feedback?tab=readme-ov-file#reddit-tldr-dataset
data_files = {
@ -59,9 +59,9 @@ if __name__ == "__main__":
dataset = dataset.map(
to_prompt_completion,
num_proc=args.dataset_num_proc,
num_proc=script_args.dataset_num_proc,
remove_columns=["id", "subreddit", "title", "post", "summary"],
)
if args.push_to_hub:
dataset.push_to_hub(args.repo_id)
if script_args.push_to_hub:
dataset.push_to_hub(script_args.repo_id)

View File

@ -58,15 +58,15 @@ def to_preference(example):
if __name__ == "__main__":
parser = HfArgumentParser(ScriptArguments)
args = parser.parse_args_into_dataclasses()[0]
script_args = parser.parse_args_into_dataclasses()[0]
dataset = load_dataset("openai/summarize_from_feedback", "comparisons")
dataset = dataset.map(
to_preference,
num_proc=args.dataset_num_proc,
num_proc=script_args.dataset_num_proc,
remove_columns=["info", "summaries", "choice", "worker", "batch", "split", "extra"],
)
if args.push_to_hub:
dataset.push_to_hub(args.repo_id)
if script_args.push_to_hub:
dataset.push_to_hub(script_args.repo_id)

View File

@ -39,9 +39,9 @@ class ScriptArguments:
if __name__ == "__main__":
args = HfArgumentParser(ScriptArguments).parse_args_into_dataclasses()[0]
dataset = load_dataset(args.dataset_name)
tokenizer = AutoTokenizer.from_pretrained(args.model)
script_args = HfArgumentParser(ScriptArguments).parse_args_into_dataclasses()[0]
dataset = load_dataset(script_args.dataset_name)
tokenizer = AutoTokenizer.from_pretrained(script_args.model)
if tokenizer.chat_template is None:
tokenizer.chat_template = SIMPLE_CHAT_TEMPLATE
@ -50,5 +50,5 @@ if __name__ == "__main__":
row["rejected"] = tokenizer.apply_chat_template(row["rejected"], tokenize=False)
return row
dataset = dataset.map(process, num_proc=args.dataset_num_proc)
dataset = dataset.map(process, num_proc=script_args.dataset_num_proc)
print(dataset["train"][0]["chosen"])

View File

@ -52,17 +52,17 @@ def drop_long_prompt(example):
if __name__ == "__main__":
parser = HfArgumentParser(ScriptArguments)
args = parser.parse_args_into_dataclasses()[0]
script_args = parser.parse_args_into_dataclasses()[0]
dataset = load_dataset("openbmb/UltraFeedback", split="train")
dataset = dataset.map(
to_unpaired_preference,
remove_columns=["source", "instruction", "models", "completions", "correct_answers", "incorrect_answers"],
num_proc=args.dataset_num_proc,
num_proc=script_args.dataset_num_proc,
)
dataset = dataset.filter(drop_long_prompt)
dataset = dataset.train_test_split(test_size=0.05, seed=42)
if args.push_to_hub:
dataset.push_to_hub(args.repo_id)
if script_args.push_to_hub:
dataset.push_to_hub(script_args.repo_id)

View File

@ -81,20 +81,22 @@ def to_unpaired_preference(example, model_name, aspect):
if __name__ == "__main__":
parser = HfArgumentParser(ScriptArguments)
args = parser.parse_args_into_dataclasses()[0]
script_args = parser.parse_args_into_dataclasses()[0]
dataset = load_dataset("openbmb/UltraFeedback", split="train")
dataset = dataset.filter(
lambda example: args.model_name in example["models"], batched=False, num_proc=args.dataset_num_proc
lambda example: script_args.model_name in example["models"],
batched=False,
num_proc=script_args.dataset_num_proc,
)
dataset = dataset.map(
to_unpaired_preference,
remove_columns=["source", "instruction", "models", "completions", "correct_answers", "incorrect_answers"],
fn_kwargs={"model_name": args.model_name, "aspect": args.aspect},
num_proc=args.dataset_num_proc,
fn_kwargs={"model_name": script_args.model_name, "aspect": script_args.aspect},
num_proc=script_args.dataset_num_proc,
)
dataset = dataset.train_test_split(test_size=0.05, seed=42)
if args.push_to_hub:
dataset.push_to_hub(args.repo_id)
if script_args.push_to_hub:
dataset.push_to_hub(script_args.repo_id)

View File

@ -579,5 +579,5 @@ def main(test_size, push_to_hub, repo_id):
if __name__ == "__main__":
parser = HfArgumentParser(ScriptArguments)
args = parser.parse_args_into_dataclasses()[0]
main(args.test_size, args.push_to_hub, args.repo_id)
script_args = parser.parse_args_into_dataclasses()[0]
main(script_args.test_size, script_args.push_to_hub, script_args.repo_id)

View File

@ -1,54 +0,0 @@
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# 0. imports
import torch
from transformers import GPT2Tokenizer
from trl import AutoModelForCausalLMWithValueHead, PPOConfig, PPOTrainer
# 1. load a pretrained model
model = AutoModelForCausalLMWithValueHead.from_pretrained("gpt2")
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
# 2. initialize trainer
ppo_config = {"mini_batch_size": 1, "batch_size": 1}
config = PPOConfig(**ppo_config)
ppo_trainer = PPOTrainer(config, model, ref_model, tokenizer)
# 3. encode a query
query_txt = "This morning I went to the "
query_tensor = tokenizer.encode(query_txt, return_tensors="pt").to(model.pretrained_model.device)
# 4. generate model response
generation_kwargs = {
"min_length": -1,
"top_k": 0.0,
"top_p": 1.0,
"do_sample": True,
"pad_token_id": tokenizer.eos_token_id,
"max_new_tokens": 20,
}
response_tensor = ppo_trainer.generate(list(query_tensor), return_prompt=False, **generation_kwargs)
response_txt = tokenizer.decode(response_tensor[0])
# 5. define a reward for response
# (this could be any reward such as human feedback or output from another model)
reward = [torch.tensor(1.0, device=model.pretrained_model.device)]
# 6. train model with ppo
train_stats = ppo_trainer.step([query_tensor[0]], [response_tensor[0]], reward)

View File

@ -13,16 +13,8 @@
"1. a base model (`gpt2-imdb`)\n",
"2. `RLHF` tuned model based on this base-model \n",
"3. the base-model again from which we sample n responses to each prompt, score them and take the best scored one AKA the `best-of-n sampled` model\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Lo98lkdP66_x"
},
"source": [
"Import dependencies\n"
"\n",
"Import dependencies"
]
},
{
@ -46,13 +38,14 @@
"source": [
"import torch\n",
"import pandas as pd\n",
"\n",
"from transformers import pipeline, AutoTokenizer\n",
"from datasets import load_dataset\n",
"\n",
"from trl import AutoModelForCausalLMWithValueHead\n",
"from trl.core import LengthSampler\n",
"\n",
"device = 0 if torch.cuda.is_available() else \"cpu\""
"device = \"cuda\" if torch.cuda.is_available() else \"cpu\""
]
},
{
@ -85,16 +78,68 @@
"id": "c1YcXeElg6or"
},
"source": [
"Models and tokenizers "
"Models and tokenizers"
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 4,
"metadata": {
"id": "b855NrL181Hh"
},
"outputs": [],
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/kashif/Github/transformers/src/transformers/tokenization_utils_base.py:1617: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be deprecated in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884\n",
" warnings.warn(\n"
]
},
{
"data": {
"text/plain": [
"AutoModelForCausalLMWithValueHead(\n",
" (pretrained_model): GPT2LMHeadModel(\n",
" (transformer): GPT2Model(\n",
" (wte): Embedding(50257, 768)\n",
" (wpe): Embedding(1024, 768)\n",
" (drop): Dropout(p=0.1, inplace=False)\n",
" (h): ModuleList(\n",
" (0-11): 12 x GPT2Block(\n",
" (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n",
" (attn): GPT2SdpaAttention(\n",
" (c_attn): Conv1D(nf=2304, nx=768)\n",
" (c_proj): Conv1D(nf=768, nx=768)\n",
" (attn_dropout): Dropout(p=0.1, inplace=False)\n",
" (resid_dropout): Dropout(p=0.1, inplace=False)\n",
" )\n",
" (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n",
" (mlp): GPT2MLP(\n",
" (c_fc): Conv1D(nf=3072, nx=768)\n",
" (c_proj): Conv1D(nf=768, nx=3072)\n",
" (act): NewGELUActivation()\n",
" (dropout): Dropout(p=0.1, inplace=False)\n",
" )\n",
" )\n",
" )\n",
" (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n",
" )\n",
" (lm_head): Linear(in_features=768, out_features=50257, bias=False)\n",
" )\n",
" (v_head): ValueHead(\n",
" (dropout): Dropout(p=0.1, inplace=False)\n",
" (summary): Linear(in_features=768, out_features=1, bias=True)\n",
" (flatten): Flatten(start_dim=1, end_dim=-1)\n",
" )\n",
")"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model = AutoModelForCausalLMWithValueHead.from_pretrained(model_name)\n",
"\n",
@ -107,8 +152,8 @@
"tokenizer.pad_token = tokenizer.eos_token\n",
"\n",
"# cuda-ize models\n",
"model.cuda()\n",
"ref_model.cuda()"
"model.to(device)\n",
"ref_model.to(device)"
]
},
{
@ -122,13 +167,18 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 5,
"metadata": {
"id": "LqLVEp5p_8XM"
},
"outputs": [],
"source": [
"def build_dataset(tokenizer, dataset_name=\"stanfordnlp/imdb\", input_min_text_length=2, input_max_text_length=8):\n",
"def build_dataset(\n",
" tokenizer,\n",
" dataset_name=\"stanfordnlp/imdb\",\n",
" input_min_text_length=2,\n",
" input_max_text_length=8,\n",
"):\n",
" # load imdb with datasets\n",
" ds = load_dataset(dataset_name, split=\"train\")\n",
" ds = ds.rename_columns({\"text\": \"review\"})\n",
@ -157,7 +207,13 @@
},
"outputs": [],
"source": [
"gen_kwargs = {\"min_length\": -1, \"top_k\": 0.0, \"top_p\": 1.0, \"do_sample\": True, \"pad_token_id\": tokenizer.eos_token_id}\n",
"gen_kwargs = {\n",
" \"min_length\": -1,\n",
" \"top_k\": 0.0,\n",
" \"top_p\": 1.0,\n",
" \"do_sample\": True,\n",
" \"pad_token_id\": tokenizer.eos_token_id,\n",
"}\n",
"sent_kwargs = {\"top_k\": None, \"function_to_apply\": \"none\", \"batch_size\": 16}"
]
},
@ -203,22 +259,36 @@
"metadata": {
"id": "-imZ7uEFBNbw"
},
"outputs": [],
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n"
]
}
],
"source": [
"for i in range(bs):\n",
" gen_len = output_length_sampler()\n",
"\n",
" query = torch.tensor(query_tensors[i])\n",
"\n",
" output = ref_model.generate(query.unsqueeze(dim=0).to(device), max_new_tokens=gen_len, **gen_kwargs).squeeze()\n",
" output = ref_model.generate(\n",
" query.unsqueeze(dim=0).to(device), max_new_tokens=gen_len, **gen_kwargs\n",
" ).squeeze()\n",
" response_tensors_ref.append(tokenizer.decode(output))\n",
"\n",
" output = model.generate(query.unsqueeze(dim=0).to(device), max_new_tokens=gen_len, **gen_kwargs).squeeze()\n",
" output = model.generate(\n",
" query.unsqueeze(dim=0).to(device), max_new_tokens=gen_len, **gen_kwargs\n",
" ).squeeze()\n",
" response_tensors.append(tokenizer.decode(output))\n",
"\n",
" # generating copies of the same query for the Best-of-n sampling\n",
" queries = query.repeat((N_BEST_OF, 1))\n",
" output = ref_model.generate(queries.to(device), max_new_tokens=gen_len, **gen_kwargs).squeeze()\n",
" output = ref_model.generate(\n",
" queries.to(device), max_new_tokens=gen_len, **gen_kwargs\n",
" ).squeeze()\n",
" response_tensors_best_of.append(tokenizer.batch_decode(output))"
]
},
@ -233,18 +303,24 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 9,
"metadata": {
"id": "PyDbbAQ0F_h7"
},
"outputs": [],
"source": [
"scores_ref = [output[0][\"score\"] for output in reward_pipe(response_tensors_ref, **sent_kwargs)]\n",
"scores_ref = [\n",
" output[0][\"score\"] for output in reward_pipe(response_tensors_ref, **sent_kwargs)\n",
"]\n",
"scores = [output[0][\"score\"] for output in reward_pipe(response_tensors, **sent_kwargs)]\n",
"scores_best_of = []\n",
"for i, response in enumerate(response_tensors_best_of):\n",
" # base_score = scores_ref[i]\n",
" scores_best_of.append(torch.tensor([output[0][\"score\"] for output in reward_pipe(response, **sent_kwargs)]))"
" scores_best_of.append(\n",
" torch.tensor(\n",
" [output[0][\"score\"] for output in reward_pipe(response, **sent_kwargs)]\n",
" )\n",
" )"
]
},
{
@ -262,142 +338,270 @@
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
"ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(ref_model_name)\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
"reward_pipe = pipeline(\"sentiment-analysis\", model=reward_model, device=device)\n",
"\n",
"tokenizer = AutoTokenizer.from_pretrained(ref_model_name)\n",
"\n",
"tokenizer.pad_token = tokenizer.eos_token\n",
"\n",
"# cuda-ize models\n",
"model.cuda()\n",
"ref_model.cuda()"
],
"metadata": {
"id": "b855NrL181Hh"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Dataset building"
],
"metadata": {
"id": "Z1Cz0gCFhZYJ"
}
},
{
"cell_type": "code",
"source": [
"def build_dataset(tokenizer, dataset_name=\"stanfordnlp/imdb\", input_min_text_length=2, input_max_text_length=8):\n",
" # load imdb with datasets\n",
" ds = load_dataset(dataset_name, split=\"train\")\n",
" ds = ds.rename_columns({\"text\": \"review\"})\n",
" ds = ds.filter(lambda x: len(x[\"review\"]) > 200, batched=False)\n",
"\n",
" input_size = LengthSampler(input_min_text_length, input_max_text_length)\n",
"\n",
" def tokenize(sample):\n",
" sample[\"input_ids\"] = tokenizer.encode(sample[\"review\"])[: input_size()]\n",
" sample[\"query\"] = tokenizer.decode(sample[\"input_ids\"])\n",
" return sample\n",
"\n",
" ds = ds.map(tokenize, batched=False)\n",
" ds.set_format(type=\"torch\")\n",
" return ds\n",
"\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>query</th>\n",
" <th>response (ref)</th>\n",
" <th>scores (ref)</th>\n",
" <th>response (RLHF)</th>\n",
" <th>scores (RLHF)</th>\n",
" <th>response (best_of)</th>\n",
" <th>scores (best_of)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>This movie</td>\n",
" <td>This movie should have read some books, and</td>\n",
" <td>1.411889</td>\n",
" <td>This movie has plenty of extraordinary feature...</td>\n",
" <td>2.735337</td>\n",
" <td>This movie was unexpectedly funny and funny, you</td>\n",
" <td>2.405301</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>OK where do i begin?</td>\n",
" <td>OK where do i begin? *** Acting is decent (not...</td>\n",
" <td>1.555380</td>\n",
" <td>OK where do i begin? For all of you who are no...</td>\n",
" <td>0.019694</td>\n",
" <td>OK where do i begin? i just wanted to add some...</td>\n",
" <td>0.622912</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>I watched</td>\n",
" <td>I watched one can compare themselves upon view...</td>\n",
" <td>1.380120</td>\n",
" <td>I watched it because of its excellent cast. Th...</td>\n",
" <td>2.498309</td>\n",
" <td>I watched the trial trial for teaches us a goo...</td>\n",
" <td>2.057187</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>It's been 19 years since Gordon</td>\n",
" <td>It's been 19 years since Gordon finally left c...</td>\n",
" <td>1.554914</td>\n",
" <td>It's been 19 years since Gordon Tree has becom...</td>\n",
" <td>1.632266</td>\n",
" <td>It's been 19 years since Gordon Clarke put me ...</td>\n",
" <td>2.783458</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Just kidding</td>\n",
" <td>Just kidding; I know a lot</td>\n",
" <td>-0.069533</td>\n",
" <td>Just kidding \"Third World Snopes</td>\n",
" <td>0.944632</td>\n",
" <td>Just kidding, I didn't even</td>\n",
" <td>1.945202</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>shakespeare's plays have a way</td>\n",
" <td>shakespeare's plays have a way of weaving into...</td>\n",
" <td>1.656927</td>\n",
" <td>shakespeare's plays have a way. It's the look ...</td>\n",
" <td>1.444803</td>\n",
" <td>shakespeare's plays have a way of getting back...</td>\n",
" <td>1.834373</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>This movie is wonderful. What</td>\n",
" <td>This movie is wonderful. What could have been ...</td>\n",
" <td>2.749068</td>\n",
" <td>This movie is wonderful. What someone likes ab...</td>\n",
" <td>2.759510</td>\n",
" <td>This movie is wonderful. What a different look,</td>\n",
" <td>2.695312</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>I loved</td>\n",
" <td>I loved this film. &lt;br /&gt;&lt;</td>\n",
" <td>2.576181</td>\n",
" <td>I loved it, and I really loved Audrey</td>\n",
" <td>2.578412</td>\n",
" <td>I loved this film. Reading reviews of it</td>\n",
" <td>2.751773</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>A superb and</td>\n",
" <td>A superb and very cool drama. The novel is</td>\n",
" <td>2.910374</td>\n",
" <td>A superb and super fun movie that removes all the</td>\n",
" <td>2.783201</td>\n",
" <td>A superb and most finely acted role that I will</td>\n",
" <td>2.894923</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>I remember</td>\n",
" <td>I remember.Very poor execution but good movies</td>\n",
" <td>0.923775</td>\n",
" <td>I remember when Shelter saw some girls on TV</td>\n",
" <td>0.825408</td>\n",
" <td>I remember thinking to myself how SOMEONE who</td>\n",
" <td>1.634163</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>This su*k</td>\n",
" <td>This su*k camel down your kidd</td>\n",
" <td>1.605957</td>\n",
" <td>This su*k Dress! I loved it</td>\n",
" <td>2.345865</td>\n",
" <td>This su*k like a roll of crap</td>\n",
" <td>2.422874</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>One Stink</td>\n",
" <td>One Stink Act...&lt;br /&gt;&lt;br</td>\n",
" <td>1.456476</td>\n",
" <td>One Stinkl was a great actor, particularly</td>\n",
" <td>1.782818</td>\n",
" <td>One Stink?: Invisible of Saint Barbara, poor</td>\n",
" <td>1.667756</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>I pulled down a VHS</td>\n",
" <td>I pulled down a VHS copy and watched it with m...</td>\n",
" <td>0.756151</td>\n",
" <td>I pulled down a VHS looking a good looking, and a</td>\n",
" <td>-0.008258</td>\n",
" <td>I pulled down a VHS copy the other day and all I</td>\n",
" <td>0.992919</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>For some</td>\n",
" <td>For some alone no more Buddy Trumbull would ha...</td>\n",
" <td>0.790762</td>\n",
" <td>For some enthraled time, the film will impress...</td>\n",
" <td>2.455694</td>\n",
" <td>For some reason, a bomb crashed on the rear of...</td>\n",
" <td>0.857423</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>This one features all</td>\n",
" <td>This one features all the good elements of spi...</td>\n",
" <td>1.452079</td>\n",
" <td>This one features all kinds of wit and humor r...</td>\n",
" <td>2.743043</td>\n",
" <td>This one features all the best Birdprogram sup...</td>\n",
" <td>2.343950</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>Somehow a woman working with</td>\n",
" <td>Somehow a woman working with Jim Wynorski prof...</td>\n",
" <td>0.242172</td>\n",
" <td>Somehow a woman working with her daughter play...</td>\n",
" <td>0.092226</td>\n",
" <td>Somehow a woman working with an overweight ins...</td>\n",
" <td>1.415525</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" query \\\n",
"0 I'm a pretty old \n",
"1 One of the most \n",
"2 Okay, as \n",
"3 Watching \"Kro \n",
"4 Seriously what were they thinking? \n",
"5 OK Hollywood \n",
"6 \"Bend It \n",
"7 While the premise behind The House \n",
"8 Well let me go \n",
"9 Vijay Krishna Acharya \n",
"10 Watching this movie made me \n",
"11 There are probably \n",
"12 Meryl Stre \n",
"13 I thought I read somewhere that \n",
"14 Good movie, very \n",
"15 It was agonizing \n",
" query \\\n",
"0 This movie \n",
"1 OK where do i begin? \n",
"2 I watched \n",
"3 It's been 19 years since Gordon \n",
"4 Just kidding \n",
"5 shakespeare's plays have a way \n",
"6 This movie is wonderful. What \n",
"7 I loved \n",
"8 A superb and \n",
"9 I remember \n",
"10 This su*k \n",
"11 One Stink \n",
"12 I pulled down a VHS \n",
"13 For some \n",
"14 This one features all \n",
"15 Somehow a woman working with \n",
"\n",
" response (ref) scores (ref) \\\n",
"0 I'm a pretty old kid, well, with lots of girl 1.179652 \n",
"1 One of the most psychologically devastating as... 2.477277 \n",
"2 Okay, as ruthless as they are, even their leve... 1.466462 \n",
"3 Watching \"Kroger\" (1915- 0.186047 \n",
"4 Seriously what were they thinking? It ain't go... 1.010697 \n",
"5 OK Hollywood goes into a total game of audio, ... 0.934041 \n",
"6 \"Bend It, Luther, Dodge, Church Goes to Rome w... 0.039218 \n",
"7 While the premise behind The House of Dracula ... -0.079306 \n",
"8 Well let me go...I don't want to movie it. I'm... 1.015246 \n",
"9 Vijay Krishna Acharya Sawai (Elverling). She was 0.341506 \n",
"10 Watching this movie made me poorly appreciate ... 1.574047 \n",
"11 There are probably more but if you had never s... -0.047099 \n",
"12 Meryl Streep's version of 0.373884 \n",
"13 I thought I read somewhere that the Lord had c... 0.091776 \n",
"14 Good movie, very funny, acting is very good.<|... 2.408837 \n",
"15 It was agonizing, and it made me wonder 1.240262 \n",
"0 This movie should have read some books, and 1.411889 \n",
"1 OK where do i begin? *** Acting is decent (not... 1.555380 \n",
"2 I watched one can compare themselves upon view... 1.380120 \n",
"3 It's been 19 years since Gordon finally left c... 1.554914 \n",
"4 Just kidding; I know a lot -0.069533 \n",
"5 shakespeare's plays have a way of weaving into... 1.656927 \n",
"6 This movie is wonderful. What could have been ... 2.749068 \n",
"7 I loved this film. <br />< 2.576181 \n",
"8 A superb and very cool drama. The novel is 2.910374 \n",
"9 I remember.Very poor execution but good movies 0.923775 \n",
"10 This su*k camel down your kidd 1.605957 \n",
"11 One Stink Act...<br /><br 1.456476 \n",
"12 I pulled down a VHS copy and watched it with m... 0.756151 \n",
"13 For some alone no more Buddy Trumbull would ha... 0.790762 \n",
"14 This one features all the good elements of spi... 1.452079 \n",
"15 Somehow a woman working with Jim Wynorski prof... 0.242172 \n",
"\n",
" response (RLHF) scores (RLHF) \\\n",
"0 I'm a pretty old lady, and I loved this movie ... 2.218363 \n",
"1 One of the most Antibiotic Apps I have seen in 2.145479 \n",
"2 Okay, as I enjoyed the movie. It's added bonus... 2.239827 \n",
"3 Watching \"Kroven\". The film has a 1.044690 \n",
"4 Seriously what were they thinking? It's a very... 2.753088 \n",
"5 OK Hollywood shoot, and this is a classic. Som... 2.517364 \n",
"6 \"Bend It all\" is a sophisticated, drawing and ... 2.583935 \n",
"7 While the premise behind The House Intelligenc... 0.205217 \n",
"8 Well let me go through everything says it's a ... 2.727040 \n",
"9 Vijay Krishna Acharya is a perfect performance... 2.563642 \n",
"10 Watching this movie made me sleep better. It w... 1.690222 \n",
"11 There are probably random man only recently wh... 0.398258 \n",
"12 Meryl Streitz, who is 0.085154 \n",
"13 I thought I read somewhere that my thoughts, a... 1.833734 \n",
"14 Good movie, very much fuzz and logical based w... 2.325996 \n",
"15 It was agonizing because it was truly fun to 0.969669 \n",
"0 This movie has plenty of extraordinary feature... 2.735337 \n",
"1 OK where do i begin? For all of you who are no... 0.019694 \n",
"2 I watched it because of its excellent cast. Th... 2.498309 \n",
"3 It's been 19 years since Gordon Tree has becom... 1.632266 \n",
"4 Just kidding \"Third World Snopes 0.944632 \n",
"5 shakespeare's plays have a way. It's the look ... 1.444803 \n",
"6 This movie is wonderful. What someone likes ab... 2.759510 \n",
"7 I loved it, and I really loved Audrey 2.578412 \n",
"8 A superb and super fun movie that removes all the 2.783201 \n",
"9 I remember when Shelter saw some girls on TV 0.825408 \n",
"10 This su*k Dress! I loved it 2.345865 \n",
"11 One Stinkl was a great actor, particularly 1.782818 \n",
"12 I pulled down a VHS looking a good looking, and a -0.008258 \n",
"13 For some enthraled time, the film will impress... 2.455694 \n",
"14 This one features all kinds of wit and humor r... 2.743043 \n",
"15 Somehow a woman working with her daughter play... 0.092226 \n",
"\n",
" response (best_of) scores (best_of) \n",
"0 I'm a pretty old, stinking,acting kinda chick ... 2.016955 \n",
"1 One of the most memorable performances of this... 2.676944 \n",
"2 Okay, as I put it in such a negative mood, it ... 1.478424 \n",
"3 Watching \"Kro\" is an entertainment craze 1.389495 \n",
"4 Seriously what were they thinking? It was stil... 2.523514 \n",
"5 OK Hollywood pay and the freaky set-up of this... 1.634765 \n",
"6 \"Bend It 9\"/\"Zara Pephoto\") and an honest, rea... 2.557210 \n",
"7 While the premise behind The House of Dracula ... 1.676889 \n",
"8 Well let me go though, alive in this ever grow... 2.652859 \n",
"9 Vijay Krishna Acharya adeptly emerges, and the... 2.308076 \n",
"10 Watching this movie made me curious: what did ... 0.950836 \n",
"11 There are probably too many documentaries in s... 1.142725 \n",
"12 Meryl Streep performed an awe 1.932498 \n",
"13 I thought I read somewhere that The Odd Couple... 0.475951 \n",
"14 Good movie, very well polished, nicely written... 2.820022 \n",
"15 It was agonizing, poignant, and worst of 2.058277 "
"0 This movie was unexpectedly funny and funny, you 2.405301 \n",
"1 OK where do i begin? i just wanted to add some... 0.622912 \n",
"2 I watched the trial trial for teaches us a goo... 2.057187 \n",
"3 It's been 19 years since Gordon Clarke put me ... 2.783458 \n",
"4 Just kidding, I didn't even 1.945202 \n",
"5 shakespeare's plays have a way of getting back... 1.834373 \n",
"6 This movie is wonderful. What a different look, 2.695312 \n",
"7 I loved this film. Reading reviews of it 2.751773 \n",
"8 A superb and most finely acted role that I will 2.894923 \n",
"9 I remember thinking to myself how SOMEONE who 1.634163 \n",
"10 This su*k like a roll of crap 2.422874 \n",
"11 One Stink?: Invisible of Saint Barbara, poor 1.667756 \n",
"12 I pulled down a VHS copy the other day and all I 0.992919 \n",
"13 For some reason, a bomb crashed on the rear of... 0.857423 \n",
"14 This one features all the best Birdprogram sup... 2.343950 \n",
"15 Somehow a woman working with an overweight ins... 1.415525 "
]
},
"execution_count": 10,
@ -420,6 +624,13 @@
"df_results = pd.DataFrame(output_data)\n",
"df_results"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
@ -429,13 +640,23 @@
},
"gpuClass": "standard",
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python"
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.10"
}
},
"nbformat": 4,
"nbformat_minor": 0
"nbformat_minor": 1
}

View File

@ -80,7 +80,12 @@
"\n",
"from transformers import AutoTokenizer, pipeline\n",
"\n",
"from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead, create_reference_model"
"from trl import (\n",
" PPOTrainer,\n",
" PPOConfig,\n",
" AutoModelForCausalLMWithValueHead,\n",
" create_reference_model,\n",
")"
]
},
{
@ -99,7 +104,11 @@
"sentiment_pipe_kwargs = {\"top_k\": None, \"function_to_apply\": \"none\"}\n",
"\n",
"config = PPOConfig(\n",
" model_name=\"lvwerra/gpt2-imdb\", steps=51200, learning_rate=1.41e-5, remove_unused_columns=False, log_with=\"wandb\"\n",
" model_name=\"lvwerra/gpt2-imdb\",\n",
" steps=51200,\n",
" learning_rate=1.41e-5,\n",
" remove_unused_columns=False,\n",
" log_with=\"wandb\",\n",
")\n",
"\n",
"txt_in_len = 5\n",
@ -236,10 +245,16 @@
],
"source": [
"dataset = dataset.map(\n",
" lambda x: {\"input_ids\": gpt2_tokenizer.encode(\" \" + x[\"review\"], return_tensors=\"pt\")[0, :txt_in_len]},\n",
" lambda x: {\n",
" \"input_ids\": gpt2_tokenizer.encode(\" \" + x[\"review\"], return_tensors=\"pt\")[\n",
" 0, :txt_in_len\n",
" ]\n",
" },\n",
" batched=False,\n",
")\n",
"dataset = dataset.map(lambda x: {\"query\": gpt2_tokenizer.decode(x[\"input_ids\"])}, batched=False)\n",
"dataset = dataset.map(\n",
" lambda x: {\"query\": gpt2_tokenizer.decode(x[\"input_ids\"])}, batched=False\n",
")\n",
"dataset = dataset[:20480]\n",
"\n",
"from datasets import Dataset\n",
@ -353,7 +368,9 @@
}
],
"source": [
"ppo_trainer = PPOTrainer(config, gpt2_model, gpt2_ref_model, gpt2_tokenizer, dataset, data_collator=collator)"
"ppo_trainer = PPOTrainer(\n",
" config, gpt2_model, gpt2_ref_model, gpt2_tokenizer, dataset, data_collator=collator\n",
")"
]
},
{
@ -374,7 +391,9 @@
" device = 0 if torch.cuda.is_available() else \"cpu\" # to avoid a `pipeline` bug\n",
"else:\n",
" device = ppo_trainer.accelerator.device\n",
"sentiment_pipe = pipeline(\"sentiment-analysis\", \"lvwerra/distilbert-imdb\", device=device)"
"sentiment_pipe = pipeline(\n",
" \"sentiment-analysis\", \"lvwerra/distilbert-imdb\", device=device\n",
")"
]
},
{
@ -510,8 +529,13 @@
"outputs": [],
"source": [
"ctrl_str = [\"[negative]\", \"[neutral]\", \"[positive]\"]\n",
"device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\") # this should be handled by accelerate\n",
"ctrl_tokens = dict((s, gpt2_tokenizer.encode(s, return_tensors=\"pt\").squeeze().to(device)) for s in ctrl_str)"
"device = torch.device(\n",
" \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
") # this should be handled by accelerate\n",
"ctrl_tokens = dict(\n",
" (s, gpt2_tokenizer.encode(s, return_tensors=\"pt\").squeeze().to(device))\n",
" for s in ctrl_str\n",
")"
]
},
{
@ -721,7 +745,10 @@
"source": [
"for epoch in range(2):\n",
" for batch in tqdm(ppo_trainer.dataloader):\n",
" (logs, game_data,) = (\n",
" (\n",
" logs,\n",
" game_data,\n",
" ) = (\n",
" dict(),\n",
" dict(),\n",
" )\n",
@ -729,14 +756,19 @@
" #### prepend a random control token\n",
" task_list = choices(ctrl_str, k=config.batch_size)\n",
" game_data[\"query\"] = [t + q for t, q in zip(task_list, batch[\"query\"])]\n",
" query_tensors = [torch.cat((ctrl_tokens[t], input_ids)) for t, input_ids in zip(task_list, batch[\"input_ids\"])]\n",
" query_tensors = [\n",
" torch.cat((ctrl_tokens[t], input_ids))\n",
" for t, input_ids in zip(task_list, batch[\"input_ids\"])\n",
" ]\n",
"\n",
" #### get response from gpt2\n",
" response_tensors = []\n",
" for query in query_tensors:\n",
" response = ppo_trainer.generate(query, **generation_kwargs)\n",
" response_tensors.append(response.squeeze()[-txt_out_len:])\n",
" game_data[\"response\"] = [gpt2_tokenizer.decode(r.squeeze()) for r in response_tensors]\n",
" game_data[\"response\"] = [\n",
" gpt2_tokenizer.decode(r.squeeze()) for r in response_tensors\n",
" ]\n",
"\n",
" #### sentiment analysis\n",
" texts = [q + r for q, r in zip(batch[\"query\"], game_data[\"response\"])]\n",
@ -749,7 +781,9 @@
"\n",
" for cs in ctrl_str:\n",
" key = \"env/reward_\" + cs.strip(\"[]\")\n",
" stats[key] = np.mean([r.cpu().numpy() for r, t in zip(rewards, task_list) if t == cs])\n",
" stats[key] = np.mean(\n",
" [r.cpu().numpy() for r, t in zip(rewards, task_list) if t == cs]\n",
" )\n",
" ppo_trainer.log_stats(stats, game_data, rewards)"
]
},
@ -804,7 +838,10 @@
"source": [
"for ctrl_s in ctrl_str:\n",
" plt.hist(\n",
" [r for r, t in zip(logs[\"env/reward_dist\"], task_list) if t == ctrl_s], density=True, alpha=0.5, label=ctrl_s\n",
" [r for r, t in zip(logs[\"env/reward_dist\"], task_list) if t == ctrl_s],\n",
" density=True,\n",
" alpha=0.5,\n",
" label=ctrl_s,\n",
" )\n",
"plt.legend(loc=\"best\")\n",
"plt.title(\"reward distribution\")\n",

View File

@ -136,7 +136,12 @@
"metadata": {},
"outputs": [],
"source": [
"def build_dataset(config, dataset_name=\"stanfordnlp/imdb\", input_min_text_length=2, input_max_text_length=8):\n",
"def build_dataset(\n",
" config,\n",
" dataset_name=\"stanfordnlp/imdb\",\n",
" input_min_text_length=2,\n",
" input_max_text_length=8,\n",
"):\n",
" \"\"\"\n",
" Build dataset for training. This builds the dataset from `load_dataset`, one should\n",
" customize this function to train the model on its own dataset.\n",
@ -223,7 +228,9 @@
"metadata": {},
"outputs": [],
"source": [
"ppo_trainer = PPOTrainer(config, model, ref_model, tokenizer, dataset=dataset, data_collator=collator)"
"ppo_trainer = PPOTrainer(\n",
" config, model, ref_model, tokenizer, dataset=dataset, data_collator=collator\n",
")"
]
},
{
@ -243,7 +250,9 @@
"device = ppo_trainer.accelerator.device\n",
"if ppo_trainer.accelerator.num_processes == 1:\n",
" device = 0 if torch.cuda.is_available() else \"cpu\" # to avoid a `pipeline` bug\n",
"sentiment_pipe = pipeline(\"sentiment-analysis\", model=\"lvwerra/distilbert-imdb\", device=device)"
"sentiment_pipe = pipeline(\n",
" \"sentiment-analysis\", model=\"lvwerra/distilbert-imdb\", device=device\n",
")"
]
},
{
@ -311,7 +320,13 @@
"metadata": {},
"outputs": [],
"source": [
"gen_kwargs = {\"min_length\": -1, \"top_k\": 0.0, \"top_p\": 1.0, \"do_sample\": True, \"pad_token_id\": tokenizer.eos_token_id}"
"gen_kwargs = {\n",
" \"min_length\": -1,\n",
" \"top_k\": 0.0,\n",
" \"top_p\": 1.0,\n",
" \"do_sample\": True,\n",
" \"pad_token_id\": tokenizer.eos_token_id,\n",
"}"
]
},
{
@ -378,7 +393,12 @@
" #### Compute sentiment score\n",
" texts = [q + r for q, r in zip(batch[\"query\"], batch[\"response\"])]\n",
" pipe_outputs = sentiment_pipe(texts, **sent_kwargs)\n",
" positive_scores = [item[\"score\"] for output in pipe_outputs for item in output if item[\"label\"] == \"POSITIVE\"]\n",
" positive_scores = [\n",
" item[\"score\"]\n",
" for output in pipe_outputs\n",
" for item in output\n",
" if item[\"label\"] == \"POSITIVE\"\n",
" ]\n",
" rewards = [torch.tensor(score) for score in positive_scores]\n",
"\n",
" #### Run PPO step\n",
@ -673,27 +693,45 @@
" query = torch.tensor(query_tensors[i]).to(device)\n",
"\n",
" gen_len = output_length_sampler()\n",
" query_response = ref_model.generate(query.unsqueeze(0), max_new_tokens=gen_len, **gen_kwargs).squeeze()\n",
" query_response = ref_model.generate(\n",
" query.unsqueeze(0), max_new_tokens=gen_len, **gen_kwargs\n",
" ).squeeze()\n",
" response_len = len(query_response) - len(query)\n",
" response_tensors_ref.append(query_response[-response_len:])\n",
"\n",
" query_response = model.generate(query.unsqueeze(0), max_new_tokens=gen_len, **gen_kwargs).squeeze()\n",
" query_response = model.generate(\n",
" query.unsqueeze(0), max_new_tokens=gen_len, **gen_kwargs\n",
" ).squeeze()\n",
" response_len = len(query_response) - len(query)\n",
" response_tensors.append(query_response[-response_len:])\n",
"\n",
"#### decode responses\n",
"game_data[\"response (before)\"] = [tokenizer.decode(response_tensors_ref[i]) for i in range(bs)]\n",
"game_data[\"response (after)\"] = [tokenizer.decode(response_tensors[i]) for i in range(bs)]\n",
"game_data[\"response (before)\"] = [\n",
" tokenizer.decode(response_tensors_ref[i]) for i in range(bs)\n",
"]\n",
"game_data[\"response (after)\"] = [\n",
" tokenizer.decode(response_tensors[i]) for i in range(bs)\n",
"]\n",
"\n",
"#### sentiment analysis of query/response pairs before/after\n",
"texts = [q + r for q, r in zip(game_data[\"query\"], game_data[\"response (before)\"])]\n",
"pipe_outputs = sentiment_pipe(texts, **sent_kwargs)\n",
"positive_scores = [item[\"score\"] for output in pipe_outputs for item in output if item[\"label\"] == \"POSITIVE\"]\n",
"positive_scores = [\n",
" item[\"score\"]\n",
" for output in pipe_outputs\n",
" for item in output\n",
" if item[\"label\"] == \"POSITIVE\"\n",
"]\n",
"game_data[\"rewards (before)\"] = positive_scores\n",
"\n",
"texts = [q + r for q, r in zip(game_data[\"query\"], game_data[\"response (after)\"])]\n",
"pipe_outputs = sentiment_pipe(texts, **sent_kwargs)\n",
"positive_scores = [item[\"score\"] for output in pipe_outputs for item in output if item[\"label\"] == \"POSITIVE\"]\n",
"positive_scores = [\n",
" item[\"score\"]\n",
" for output in pipe_outputs\n",
" for item in output\n",
" if item[\"label\"] == \"POSITIVE\"\n",
"]\n",
"game_data[\"rewards (after)\"] = positive_scores\n",
"\n",
"# store results in a dataframe\n",

View File

@ -233,7 +233,6 @@ eval_dataset = eval_dataset.filter(
class RewardDataCollatorWithPadding:
tokenizer: PreTrainedTokenizerBase
padding: Union[bool, str, PaddingStrategy] = True
max_length: Optional[int] = None
pad_to_multiple_of: Optional[int] = None
return_tensors: str = "pt"
@ -256,14 +255,12 @@ class RewardDataCollatorWithPadding:
batch_j = self.tokenizer.pad(
features_j,
padding=self.padding,
max_length=self.max_length,
pad_to_multiple_of=self.pad_to_multiple_of,
return_tensors=self.return_tensors,
)
batch_k = self.tokenizer.pad(
features_k,
padding=self.padding,
max_length=self.max_length,
pad_to_multiple_of=self.pad_to_multiple_of,
return_tensors=self.return_tensors,
)
@ -308,7 +305,7 @@ trainer = RewardTrainer(
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics,
data_collator=RewardDataCollatorWithPadding(tokenizer=tokenizer, max_length=script_args.max_length),
data_collator=RewardDataCollatorWithPadding(tokenizer=tokenizer),
)

View File

@ -237,7 +237,7 @@ if __name__ == "__main__":
beta=script_args.beta,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,
processing_class=tokenizer,
peft_config=peft_config,
max_prompt_length=script_args.max_prompt_length,
max_length=script_args.max_length,

View File

@ -187,7 +187,7 @@ trainer = SFTTrainer(
peft_config=peft_config,
max_seq_length=None,
formatting_func=prepare_sample_text,
tokenizer=tokenizer,
processing_class=tokenizer,
args=training_args,
)
trainer.train()

View File

@ -45,7 +45,7 @@ class ScriptArguments:
parser = HfArgumentParser(ScriptArguments)
args = parser.parse_args_into_dataclasses()[0]
script_args = parser.parse_args_into_dataclasses()[0]
def exact_match_reward(responses, answers=None):
@ -90,12 +90,12 @@ lora_config = LoraConfig(
# set up models
model = AutoModelForCausalLMWithValueHead.from_pretrained(
args.model_name,
script_args.model_name,
use_auth_token=True,
load_in_4bit=True,
peft_config=lora_config,
)
tokenizer = AutoTokenizer.from_pretrained(args.model_name, use_auth_token=True)
tokenizer = AutoTokenizer.from_pretrained(script_args.model_name, use_auth_token=True)
tokenizer.pad_token = tokenizer.eos_token
ds = load_dataset("openai/gsm8k", "main", split="train")
@ -107,7 +107,7 @@ ds_test = load_dataset("openai/gsm8k", "main", split="test")
ds_test = ds_test.rename_columns({"question": "query"})
ds_test = ds_test.map(lambda x: {"answer": x["answer"].split("#### ")[1]})
test_dataloader = torch.utils.data.DataLoader(ds_test, batch_size=args.batch_size)
test_dataloader = torch.utils.data.DataLoader(ds_test, batch_size=script_args.batch_size)
# prompt
prompt = """\
@ -138,16 +138,16 @@ generation_kwargs = {
"do_sample": True,
"pad_token_id": tokenizer.eos_token_id,
"eos_token_id": -1,
"max_new_tokens": args.max_new_tokens,
"max_new_tokens": script_args.max_new_tokens,
}
# trainer
ppo_config = PPOConfig(
batch_size=args.batch_size,
learning_rate=args.learning_rate,
mini_batch_size=args.mini_batch_size,
ppo_epochs=args.ppo_epochs,
gradient_accumulation_steps=args.gradient_accumulation_steps,
batch_size=script_args.batch_size,
learning_rate=script_args.learning_rate,
mini_batch_size=script_args.mini_batch_size,
ppo_epochs=script_args.ppo_epochs,
gradient_accumulation_steps=script_args.gradient_accumulation_steps,
log_with="wandb",
tracker_project_name="trl-gsm8k",
remove_unused_columns=False,
@ -169,7 +169,7 @@ text_env = TextEnvironment(
)
# main training loop
for epoch in range(args.n_epochs):
for epoch in range(script_args.n_epochs):
for step, batch in enumerate(ppo_trainer.dataloader):
if (step == 0) and (epoch % 4 == 0): # evaluate every 4 epochs
reward_mean_test = evaluate(test_dataloader, text_env, ppo_trainer)
@ -190,4 +190,4 @@ for epoch in range(args.n_epochs):
ppo_trainer.log_stats(train_stats, texts, rewards, columns_to_log=["query", "response", "answer"])
reward_mean_test = evaluate(test_dataloader, text_env, ppo_trainer)
ppo_trainer.save_pretrained(f"model/{args.model_name}-gsm8k")
ppo_trainer.save_pretrained(f"model/{script_args.model_name}-gsm8k")

View File

@ -45,7 +45,7 @@ class ScriptArguments:
parser = HfArgumentParser(ScriptArguments)
args = parser.parse_args_into_dataclasses()[0]
script_args = parser.parse_args_into_dataclasses()[0]
lora_config = LoraConfig(
r=16,
@ -58,13 +58,13 @@ lora_config = LoraConfig(
# set up models
model = AutoModelForCausalLMWithValueHead.from_pretrained(
args.model_name,
script_args.model_name,
use_auth_token=True,
trust_remote_code=True,
load_in_4bit=True,
peft_config=lora_config,
)
tokenizer = AutoTokenizer.from_pretrained(args.model_name, use_auth_token=True)
tokenizer = AutoTokenizer.from_pretrained(script_args.model_name, use_auth_token=True)
tokenizer.pad_token = tokenizer.eos_token
# system prompt
@ -90,24 +90,24 @@ generation_kwargs = {
"do_sample": True,
"pad_token_id": tokenizer.eos_token_id,
"eos_token_id": -1,
"max_new_tokens": args.max_new_tokens,
"max_new_tokens": script_args.max_new_tokens,
}
# trainer
config = PPOConfig(
batch_size=args.batch_size,
model_name=args.model_name,
learning_rate=args.learning_rate,
log_with=args.log_with,
mini_batch_size=args.mini_batch_size,
ppo_epochs=args.ppo_epochs,
gradient_accumulation_steps=args.gradient_accumulation_steps,
seed=args.seed,
batch_size=script_args.batch_size,
model_name=script_args.model_name,
learning_rate=script_args.learning_rate,
log_with=script_args.log_with,
mini_batch_size=script_args.mini_batch_size,
ppo_epochs=script_args.ppo_epochs,
gradient_accumulation_steps=script_args.gradient_accumulation_steps,
seed=script_args.seed,
optimize_cuda_cache=True,
)
ppo_trainer = PPOTrainer(config=config, model=model, tokenizer=tokenizer)
dataset = load_dataset("mandarjoshi/trivia_qa", "rc", split="train")
local_seed = args.seed + ppo_trainer.accelerator.process_index * 100003 # Prime
local_seed = script_args.seed + ppo_trainer.accelerator.process_index * 100003 # Prime
dataset = dataset.shuffle(local_seed)
@ -175,7 +175,7 @@ def print_trainable_parameters(model):
print_trainable_parameters(model)
# main training loop
for i in range(args.iterations):
for i in range(script_args.iterations):
tasks, answers = generate_data(config.batch_size)
queries, responses, masks, rewards, histories = text_env.run(tasks, answers=answers)
train_stats = ppo_trainer.step(queries, responses, rewards, masks)
@ -189,4 +189,4 @@ for i in range(args.iterations):
all_rewards = ppo_trainer.accelerator.gather(torch.tensor(rewards, device=ppo_trainer.accelerator.device))
ppo_trainer.log_stats(train_stats, texts, list(all_rewards), columns_to_log=["query", "response", "answer"])
if i % 100 == 0:
ppo_trainer.save_pretrained(f"models/{args.model_name}_{args.seed}_{i}_triviaqa")
ppo_trainer.save_pretrained(f"models/{script_args.model_name}_{script_args.seed}_{i}_triviaqa")

View File

@ -106,8 +106,8 @@ def image_outputs_logger(image_pair_data, global_step, accelerate_logger):
if __name__ == "__main__":
parser = HfArgumentParser((ScriptArguments, AlignPropConfig))
args, alignprop_config = parser.parse_args_into_dataclasses()
alignprop_config.project_kwargs = {
script_args, training_args = parser.parse_args_into_dataclasses()
training_args.project_kwargs = {
"logging_dir": "./logs",
"automatic_checkpoint_naming": True,
"total_limit": 5,
@ -115,11 +115,13 @@ if __name__ == "__main__":
}
pipeline = DefaultDDPOStableDiffusionPipeline(
args.pretrained_model, pretrained_model_revision=args.pretrained_revision, use_lora=args.use_lora
script_args.pretrained_model,
pretrained_model_revision=script_args.pretrained_revision,
use_lora=script_args.use_lora,
)
trainer = AlignPropTrainer(
alignprop_config,
aesthetic_scorer(args.hf_hub_aesthetic_model_id, args.hf_hub_aesthetic_model_filename),
training_args,
aesthetic_scorer(script_args.hf_hub_aesthetic_model_id, script_args.hf_hub_aesthetic_model_filename),
prompt_fn,
pipeline,
image_samples_hook=image_outputs_logger,
@ -127,4 +129,7 @@ if __name__ == "__main__":
trainer.train()
trainer.push_to_hub(args.hf_hub_model_id)
# Save and push to hub
trainer.save_model(training_args.output_dir)
if training_args.push_to_hub:
trainer.push_to_hub(dataset_name=script_args.dataset_name)

View File

@ -17,7 +17,9 @@ Run the BCO training script with the commands below. In general, the optimal con
# Full training:
python examples/scripts/bco.py \
--model_name_or_path=nnheui/stablelm-2-1_6b-sft-full \
--model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
--trust_remote_code \
--dataset_name trl-lib/ultrafeedback-gpt-3.5-turbo-helpfulness \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 32 \
--num_train_epochs 1 \
@ -66,88 +68,15 @@ python examples/scripts/bco.py \
--lora_alpha=16
"""
import logging
from dataclasses import dataclass
from functools import partial
from typing import Literal, Optional
import torch
import torch.nn.functional as F
from accelerate import Accelerator, PartialState
from datasets import Dataset, load_dataset
from accelerate import Accelerator
from datasets import load_dataset
from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer, HfArgumentParser, PreTrainedModel
from trl import BCOConfig, BCOTrainer, ModelConfig, get_peft_config, setup_chat_format
# Define and parse arguments.
@dataclass
class ScriptArguments:
"""
The arguments for the BCO training script.
"""
llm_name: Literal["gpt-3.5-turbo", "llama-2-7b-chat", "llama-2-70b-chat"] = "gpt-3.5-turbo"
def build_helpfulness_dataset(llm_name: str, num_proc: Optional[int] = None) -> Dataset:
"""
Filter `llm_name` completions and binarize given their helpfulness score.
If helpfulness score is 5, it is desirable. Otherwise, it is undesirable.
"""
def get_model_rating(example, metric: str, llm_name: str):
try:
model_index = example["models"].index(llm_name)
return {metric: int(example["completions"][model_index]["annotations"][metric]["Rating"])}
except ValueError as e:
logging.warning(e)
return -1
def get_model_response(example, llm_name: str):
try:
model_index = example["models"].index(llm_name)
return {"response": example["completions"][model_index]["response"]}
except ValueError as e:
logging.warning(e)
return -1
dataset = load_dataset("openbmb/UltraFeedback")["train"]
dataset = dataset.filter(lambda example: llm_name in example["models"], batched=False, num_proc=num_proc)
dataset = dataset.filter(
lambda example: len(example["models"]) == len(example["completions"]), batched=False, num_proc=num_proc
)
METRIC = "helpfulness"
dataset = dataset.map(
get_model_rating,
batched=False,
fn_kwargs={"metric": METRIC, "llm_name": llm_name},
num_proc=num_proc,
)
dataset = dataset.map(
get_model_response,
batched=False,
fn_kwargs={"llm_name": llm_name},
num_proc=num_proc,
)
dataset = dataset.select_columns(["source", "instruction", "response", "helpfulness"])
dataset = dataset.rename_columns({"instruction": "prompt", "response": "completion"})
dataset = dataset.map(lambda example: {"label": example["helpfulness"] >= 5}, batched=False, num_proc=num_proc)
dataset = dataset.map(
lambda example: {"prompt": [{"role": "user", "content": example["prompt"]}]},
batched=False,
num_proc=num_proc,
)
dataset = dataset.train_test_split(test_size=0.05, seed=42)
return dataset
from trl import BCOConfig, BCOTrainer, ModelConfig, ScriptArguments, get_peft_config, setup_chat_format
def embed_prompt(input_ids: torch.LongTensor, attention_mask: torch.LongTensor, model: PreTrainedModel):
@ -175,9 +104,9 @@ def embed_prompt(input_ids: torch.LongTensor, attention_mask: torch.LongTensor,
if __name__ == "__main__":
parser = HfArgumentParser((ScriptArguments, BCOConfig, ModelConfig))
script_args, bco_args, model_args = parser.parse_args_into_dataclasses()
script_args, training_args, model_args = parser.parse_args_into_dataclasses()
bco_args.gradient_checkpointing_kwargs = {"use_reentrant": True}
training_args.gradient_checkpointing_kwargs = {"use_reentrant": True}
# Load a pretrained model
model = AutoModelForCausalLM.from_pretrained(
@ -197,19 +126,7 @@ if __name__ == "__main__":
if tokenizer.chat_template is None:
model, tokenizer = setup_chat_format(model, tokenizer)
# Apply chat template
def format_dataset(example):
example["prompt"] = tokenizer.apply_chat_template(
example["prompt"], tokenize=False, add_generation_prompt=True
)
return example
# Compute that only on the main process for faster data processing.
# see: https://github.com/huggingface/trl/pull/1255
with PartialState().local_main_process_first():
# Load the dataset
dataset = build_helpfulness_dataset(script_args.llm_name, num_proc=bco_args.dataset_num_proc)
dataset = dataset.map(format_dataset, batched=False, num_proc=bco_args.dataset_num_proc)
dataset = load_dataset(script_args.dataset_name)
accelerator = Accelerator()
embedding_model = AutoModel.from_pretrained(
@ -229,18 +146,22 @@ if __name__ == "__main__":
)
# Initialize the BCO trainer
bco_trainer = BCOTrainer(
trainer = BCOTrainer(
model,
ref_model,
args=bco_args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
tokenizer=tokenizer,
args=training_args,
train_dataset=dataset[script_args.dataset_train_split],
eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None,
processing_class=tokenizer,
peft_config=get_peft_config(model_args),
embedding_func=embedding_func,
embedding_tokenizer=embedding_tokenizer,
)
# Train and push the model to the Hub
bco_trainer.train()
bco_trainer.save_model(bco_args.output_dir)
trainer.train()
# Save and push to hub
trainer.save_model(training_args.output_dir)
if training_args.push_to_hub:
trainer.push_to_hub(dataset_name=script_args.dataset_name)

View File

@ -1,4 +1,3 @@
# flake8: noqa
# Copyright 2024 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
@ -14,16 +13,12 @@
# limitations under the License.
from trl.commands.cli_utils import init_zero_verbose
init_zero_verbose()
import copy
import json
import os
import sys
import pwd
import re
import sys
import time
from threading import Thread
@ -33,10 +28,13 @@ from rich.live import Live
from rich.markdown import Markdown
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
from trl.commands.cli_utils import ChatArguments, TrlParser, init_zero_verbose
from trl import TrlParser, init_zero_verbose
from trl.commands.cli_utils import ChatArguments
from trl.trainer.utils import get_quantization_config
init_zero_verbose()
HELP_STRING = """\
**TRL CHAT INTERFACE**
@ -275,7 +273,7 @@ def chat_cli():
user = args.user
model, tokenizer = load_model_and_tokenizer(args)
generation_streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True)
generation_streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True, skip_prompt=True)
pad_token_id, eos_token_ids = parse_eos_tokens(tokenizer, args.eos_tokens, args.eos_token_ids)
@ -337,10 +335,13 @@ def chat_cli():
chat.append({"role": "user", "content": user_input})
inputs = tokenizer.apply_chat_template(chat, return_tensors="pt", add_generation_prompt=True).to(
model.device
)
attention_mask = torch.ones_like(inputs)
generation_kwargs = dict(
inputs=tokenizer.apply_chat_template(chat, return_tensors="pt", add_generation_prompt=True).to(
model.device
),
inputs=inputs,
attention_mask=attention_mask,
streamer=generation_streamer,
max_new_tokens=current_args.max_new_tokens,
do_sample=current_args.do_sample,

View File

@ -17,6 +17,7 @@ In general, the optimal configuration for CPO will be similar to that of DPO:
# regular:
python examples/scripts/cpo.py \
--dataset_name trl-lib/ultrafeedback_binarized \
--model_name_or_path=gpt2 \
--per_device_train_batch_size 4 \
--max_steps 1000 \
@ -33,6 +34,7 @@ python examples/scripts/cpo.py \
# peft:
python examples/scripts/cpo.py \
--dataset_name trl-lib/ultrafeedback_binarized \
--model_name_or_path=gpt2 \
--per_device_train_batch_size 4 \
--max_steps 1000 \
@ -52,27 +54,16 @@ python examples/scripts/cpo.py \
--lora_alpha=16
"""
from dataclasses import dataclass, field
from accelerate import PartialState
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, HfArgumentParser
from trl import CPOConfig, CPOTrainer, ModelConfig, get_peft_config
from trl import CPOConfig, CPOTrainer, ModelConfig, ScriptArguments, get_peft_config
from trl.trainer.utils import SIMPLE_CHAT_TEMPLATE
@dataclass
class ScriptArguments:
dataset_name: str = field(
default="trl-internal-testing/hh-rlhf-helpful-base-trl-style",
metadata={"help": "The name of the dataset to use."},
)
if __name__ == "__main__":
parser = HfArgumentParser((ScriptArguments, CPOConfig, ModelConfig))
args, cpo_args, model_config = parser.parse_args_into_dataclasses()
script_args, training_args, model_config = parser.parse_args_into_dataclasses()
################
# Model & Tokenizer
@ -89,32 +80,26 @@ if __name__ == "__main__":
################
# Dataset
################
dataset = load_dataset(args.dataset_name)
dataset = load_dataset(script_args.dataset_name)
if tokenizer.chat_template is None:
tokenizer.chat_template = SIMPLE_CHAT_TEMPLATE
def process(row):
row["chosen"] = tokenizer.apply_chat_template(row["chosen"], tokenize=False)
row["rejected"] = tokenizer.apply_chat_template(row["rejected"], tokenize=False)
return row
# Compute that only on the main process for faster data processing.
# see: https://github.com/huggingface/trl/pull/1255
with PartialState().local_main_process_first():
dataset = dataset.map(process, num_proc=cpo_args.dataset_num_proc)
################
# Training
################
trainer = CPOTrainer(
model,
args=cpo_args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
tokenizer=tokenizer,
args=training_args,
train_dataset=dataset[script_args.dataset_train_split],
eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None,
processing_class=tokenizer,
peft_config=get_peft_config(model_config),
)
# train and save the model
trainer.train()
trainer.save_model(cpo_args.output_dir)
# Save and push to hub
trainer.save_model(training_args.output_dir)
if training_args.push_to_hub:
trainer.push_to_hub(dataset_name=script_args.dataset_name)

View File

@ -185,8 +185,8 @@ def image_outputs_logger(image_data, global_step, accelerate_logger):
if __name__ == "__main__":
parser = HfArgumentParser((ScriptArguments, DDPOConfig))
args, ddpo_config = parser.parse_args_into_dataclasses()
ddpo_config.project_kwargs = {
script_args, training_args = parser.parse_args_into_dataclasses()
training_args.project_kwargs = {
"logging_dir": "./logs",
"automatic_checkpoint_naming": True,
"total_limit": 5,
@ -194,12 +194,14 @@ if __name__ == "__main__":
}
pipeline = DefaultDDPOStableDiffusionPipeline(
args.pretrained_model, pretrained_model_revision=args.pretrained_revision, use_lora=args.use_lora
script_args.pretrained_model,
pretrained_model_revision=script_args.pretrained_revision,
use_lora=script_args.use_lora,
)
trainer = DDPOTrainer(
ddpo_config,
aesthetic_scorer(args.hf_hub_aesthetic_model_id, args.hf_hub_aesthetic_model_filename),
training_args,
aesthetic_scorer(script_args.hf_hub_aesthetic_model_id, script_args.hf_hub_aesthetic_model_filename),
prompt_fn,
pipeline,
image_samples_hook=image_outputs_logger,
@ -207,4 +209,7 @@ if __name__ == "__main__":
trainer.train()
trainer.push_to_hub(args.hf_hub_model_id)
# Save and push to hub
trainer.save_model(training_args.output_dir)
if training_args.push_to_hub:
trainer.push_to_hub(dataset_name=script_args.dataset_name)

View File

@ -1,4 +1,3 @@
# flake8: noqa
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
@ -47,27 +46,26 @@ python examples/scripts/dpo.py \
--lora_alpha 16
"""
from trl.commands.cli_utils import DPOScriptArguments, TrlParser
from trl.trainer.utils import SIMPLE_CHAT_TEMPLATE
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from accelerate import PartialState
from trl import (
DPOConfig,
DPOTrainer,
ModelConfig,
ScriptArguments,
TrlParser,
get_kbit_device_map,
get_peft_config,
get_quantization_config,
maybe_extract_prompt,
maybe_apply_chat_template,
)
from trl.trainer.utils import SIMPLE_CHAT_TEMPLATE
if __name__ == "__main__":
parser = TrlParser((DPOScriptArguments, DPOConfig, ModelConfig))
args, training_args, model_config = parser.parse_args_and_config()
parser = TrlParser((ScriptArguments, DPOConfig, ModelConfig))
script_args, training_args, model_config = parser.parse_args_and_config()
################
# Model & Tokenizer
@ -103,7 +101,7 @@ if __name__ == "__main__":
tokenizer.pad_token = tokenizer.eos_token
if tokenizer.chat_template is None:
tokenizer.chat_template = SIMPLE_CHAT_TEMPLATE
if args.ignore_bias_buffers:
if script_args.ignore_bias_buffers:
# torch distributed hack
model._ddp_params_and_buffers_to_ignore = [
name for name, buffer in model.named_buffers() if buffer.dtype == torch.bool
@ -112,13 +110,7 @@ if __name__ == "__main__":
################
# Dataset
################
dataset = load_dataset(args.dataset_name)
with PartialState().local_main_process_first():
dataset = dataset.map(maybe_extract_prompt, num_proc=training_args.dataset_num_proc)
dataset = dataset.map(
maybe_apply_chat_template, num_proc=training_args.dataset_num_proc, fn_kwargs={"tokenizer": tokenizer}
)
dataset = load_dataset(script_args.dataset_name)
##########
# Training
@ -127,14 +119,20 @@ if __name__ == "__main__":
model,
ref_model,
args=training_args,
train_dataset=dataset[args.dataset_train_split],
eval_dataset=dataset[args.dataset_test_split],
tokenizer=tokenizer,
train_dataset=dataset[script_args.dataset_train_split],
eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None,
processing_class=tokenizer,
peft_config=peft_config,
)
trainer.train()
metrics = trainer.evaluate()
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)
if training_args.eval_strategy != "no":
metrics = trainer.evaluate()
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)
# Save and push to hub
trainer.save_model(training_args.output_dir)
if training_args.push_to_hub:
trainer.push_to_hub(dataset_name=script_args.dataset_name)

View File

@ -1,4 +1,3 @@
# flake8: noqa
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
@ -43,24 +42,30 @@ python examples/scripts/dpo_online.py \
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoModelForSequenceClassification, AutoTokenizer, GenerationConfig
from trl import (
DPOScriptArguments,
HfPairwiseJudge,
LogCompletionsCallback,
ModelConfig,
OnlineDPOConfig,
OnlineDPOTrainer,
OpenAIPairwiseJudge,
PairRMJudge,
ScriptArguments,
TrlParser,
get_kbit_device_map,
get_peft_config,
get_quantization_config,
LogCompletionsCallback,
)
from trl.commands.cli_utils import TrlParser
from trl.trainer.utils import SIMPLE_CHAT_TEMPLATE
JUDGES = {"pair_rm": PairRMJudge, "openai": OpenAIPairwiseJudge, "hf": HfPairwiseJudge}
if __name__ == "__main__":
parser = TrlParser((DPOScriptArguments, OnlineDPOConfig, ModelConfig))
args, training_args, model_config = parser.parse_args_and_config()
args.gradient_checkpointing_kwargs = {"use_reentrant": True}
parser = TrlParser((ScriptArguments, OnlineDPOConfig, ModelConfig))
script_args, training_args, model_config = parser.parse_args_and_config()
script_args.gradient_checkpointing_kwargs = {"use_reentrant": True}
torch_dtype = (
model_config.torch_dtype
@ -81,12 +86,28 @@ if __name__ == "__main__":
model_config.model_name_or_path, trust_remote_code=model_config.trust_remote_code, **model_kwargs
)
reward_model = AutoModelForSequenceClassification.from_pretrained(
training_args.reward_model_path,
num_labels=1,
trust_remote_code=model_config.trust_remote_code,
**model_kwargs,
)
if training_args.reward_model_path is not None:
reward_model = AutoModelForSequenceClassification.from_pretrained(
training_args.reward_model_path,
num_labels=1,
trust_remote_code=model_config.trust_remote_code,
**model_kwargs,
)
reward_tokenizer = AutoTokenizer.from_pretrained(
training_args.reward_model_path,
trust_remote_code=model_config.trust_remote_code,
truncation=True,
truncation_side="left", # since we judge the completion, truncating left is more appropriate
)
else:
reward_model = None
reward_tokenizer = None
if training_args.judge is not None:
judge_cls = JUDGES[training_args.judge]
judge = judge_cls()
else:
judge = None
tokenizer = AutoTokenizer.from_pretrained(
model_config.model_name_or_path,
@ -99,20 +120,30 @@ if __name__ == "__main__":
if tokenizer.pad_token_id is None:
tokenizer.pad_token = tokenizer.eos_token
dataset = load_dataset(args.dataset_name)
dataset = load_dataset(script_args.dataset_name)
trainer = OnlineDPOTrainer(
model=model,
reward_model=reward_model,
judge=judge,
args=training_args,
train_dataset=dataset[args.dataset_train_split],
eval_dataset=dataset[args.dataset_test_split],
tokenizer=tokenizer,
train_dataset=dataset[script_args.dataset_train_split],
eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None,
processing_class=tokenizer,
reward_processing_class=reward_tokenizer,
peft_config=get_peft_config(model_config),
)
generation_config = GenerationConfig(
max_new_tokens=training_args.max_new_tokens, do_sample=True, temperature=training_args.temperature
)
completions_callback = LogCompletionsCallback(trainer, generation_config, num_prompts=8)
trainer.add_callback(completions_callback)
if training_args.eval_strategy != "no":
generation_config = GenerationConfig(
max_new_tokens=training_args.max_new_tokens, do_sample=True, temperature=training_args.temperature
)
completions_callback = LogCompletionsCallback(trainer, generation_config, num_prompts=8)
trainer.add_callback(completions_callback)
trainer.train()
# Save and push to hub
trainer.save_model(training_args.output_dir)
if training_args.push_to_hub:
trainer.push_to_hub(dataset_name=script_args.dataset_name)

View File

@ -1,4 +1,3 @@
# flake8: noqa
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
@ -13,7 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
"""
accelerate launch examples/scripts/dpo_visual.py \
accelerate launch examples/scripts/dpo_vlm.py \
--dataset_name HuggingFaceH4/rlaif-v_formatted \
--model_name_or_path HuggingFaceM4/idefics2-8b \
--per_device_train_batch_size 2 \
@ -27,9 +26,6 @@ accelerate launch examples/scripts/dpo_visual.py \
--lora_target_modules=all-linear
"""
from trl.commands.cli_utils import DPOScriptArguments, TrlParser
from accelerate import PartialState
import torch
from datasets import load_dataset
from transformers import AutoModelForVision2Seq, AutoProcessor
@ -38,6 +34,8 @@ from trl import (
DPOConfig,
DPOTrainer,
ModelConfig,
ScriptArguments,
TrlParser,
get_kbit_device_map,
get_peft_config,
get_quantization_config,
@ -45,8 +43,8 @@ from trl import (
if __name__ == "__main__":
parser = TrlParser((DPOScriptArguments, DPOConfig, ModelConfig))
args, training_args, model_config = parser.parse_args_and_config()
parser = TrlParser((ScriptArguments, DPOConfig, ModelConfig))
script_args, training_args, model_config = parser.parse_args_and_config()
################
# Model & Tokenizer
@ -96,7 +94,7 @@ if __name__ == "__main__":
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
if args.ignore_bias_buffers:
if script_args.ignore_bias_buffers:
# torch distributed hack
model._ddp_params_and_buffers_to_ignore = [
name for name, buffer in model.named_buffers() if buffer.dtype == torch.bool
@ -105,18 +103,7 @@ if __name__ == "__main__":
################
# Dataset
################
dataset = load_dataset(args.dataset_name)
def process(row):
row["prompt"] = processor.apply_chat_template(row["prompt"], tokenize=False)
row["chosen"] = processor.apply_chat_template(row["chosen"], tokenize=False)
row["rejected"] = processor.apply_chat_template(row["rejected"], tokenize=False)
return row
# Compute that only on the main process for faster data processing.
# see: https://github.com/huggingface/trl/pull/1255
with PartialState().local_main_process_first():
dataset = dataset.map(process, num_proc=training_args.dataset_num_proc)
dataset = load_dataset(script_args.dataset_name)
################
# Training
@ -125,11 +112,15 @@ if __name__ == "__main__":
model,
ref_model,
args=training_args,
train_dataset=dataset[args.dataset_train_split],
eval_dataset=dataset[args.dataset_test_split],
tokenizer=processor,
train_dataset=dataset[script_args.dataset_train_split],
eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None,
processing_class=processor,
peft_config=peft_config,
)
trainer.train()
# Save and push to hub
trainer.save_model(training_args.output_dir)
if training_args.push_to_hub:
trainer.push_to_hub(dataset_name=script_args.dataset_name)

View File

@ -59,12 +59,12 @@ class ScriptArguments:
# Parse the arguments
parser = HfArgumentParser(ScriptArguments)
args = parser.parse_args_into_dataclasses()[0]
script_args = parser.parse_args_into_dataclasses()[0]
# Load the dataset
dataset = load_dataset("trl-lib/tldr", split="validation")
if args.num_examples is not None:
dataset = dataset.select(range(args.num_examples))
if script_args.num_examples is not None:
dataset = dataset.select(range(script_args.num_examples))
# Extract the prompts and reference completions
prompts = dataset["prompt"]
@ -72,15 +72,15 @@ reference_completions = dataset["completion"]
# Generate the model completions
sampling_params = SamplingParams(temperature=0.0, top_p=0.95, max_tokens=200) # very generous max token length
llm = LLM(model=args.model_name_or_path, tensor_parallel_size=1)
llm = LLM(model=script_args.model_name_or_path, tensor_parallel_size=1)
outputs = llm.generate(prompts, sampling_params)
model_completions = [output.outputs[0].text.strip() for output in outputs]
# Judge the outputs
if "gpt" in args.judge_model:
judge = OpenAIPairwiseJudge(args.judge_model)
if "gpt" in script_args.judge_model:
judge = OpenAIPairwiseJudge(script_args.judge_model)
else:
judge = HfPairwiseJudge(args.judge_model)
judge = HfPairwiseJudge(script_args.judge_model)
completions = [[c0, c1] for c0, c1 in zip(reference_completions, model_completions)]
best_idxs = judge.judge(prompts, completions)

View File

@ -1,4 +1,3 @@
# flake8: noqa
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
@ -45,26 +44,26 @@ python examples/scripts/gkd.py \
--lora_alpha 16
"""
from accelerate import PartialState
from datasets import load_dataset
from transformers import AutoTokenizer, GenerationConfig
from trl import (
GKDConfig,
GKDTrainer,
LogCompletionsCallback,
ModelConfig,
ScriptArguments,
TrlParser,
get_kbit_device_map,
get_peft_config,
get_quantization_config,
maybe_apply_chat_template,
LogCompletionsCallback,
)
from trl.commands.cli_utils import SFTScriptArguments, TrlParser
from accelerate import PartialState
if __name__ == "__main__":
parser = TrlParser((SFTScriptArguments, GKDConfig, ModelConfig))
args, training_args, model_config = parser.parse_args_and_config()
parser = TrlParser((ScriptArguments, GKDConfig, ModelConfig))
script_args, training_args, model_config = parser.parse_args_and_config()
################
# Model & Tokenizer
@ -94,6 +93,7 @@ if __name__ == "__main__":
tokenizer = AutoTokenizer.from_pretrained(
model_config.model_name_or_path,
revision=model_config.model_revision,
trust_remote_code=model_config.trust_remote_code,
padding_side="left",
)
@ -103,7 +103,7 @@ if __name__ == "__main__":
################
# Dataset
################
dataset = load_dataset(args.dataset_name)
dataset = load_dataset(script_args.dataset_name)
with PartialState().local_main_process_first():
dataset = dataset.map(
@ -120,16 +120,22 @@ if __name__ == "__main__":
model=model_config.model_name_or_path,
teacher_model=training_args.teacher_model_name_or_path,
args=training_args,
train_dataset=dataset[args.dataset_train_split],
eval_dataset=dataset[args.dataset_test_split],
tokenizer=tokenizer,
train_dataset=dataset[script_args.dataset_train_split],
eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None,
processing_class=tokenizer,
peft_config=get_peft_config(model_config),
)
generation_config = GenerationConfig(
max_new_tokens=training_args.max_new_tokens, do_sample=True, temperature=training_args.temperature
)
completions_callback = LogCompletionsCallback(trainer, generation_config, num_prompts=8)
trainer.add_callback(completions_callback)
if training_args.eval_strategy != "no":
generation_config = GenerationConfig(
max_new_tokens=training_args.max_new_tokens, do_sample=True, temperature=training_args.temperature
)
completions_callback = LogCompletionsCallback(trainer, generation_config, num_prompts=8)
trainer.add_callback(completions_callback)
trainer.train()
# Save and push to hub
trainer.save_model(training_args.output_dir)
if training_args.push_to_hub:
trainer.push_to_hub(dataset_name=script_args.dataset_name)

View File

@ -17,6 +17,7 @@ Run the KTO training script with the commands below. In general, the optimal con
# Full training:
python examples/scripts/kto.py \
--dataset_name trl-lib/kto-mix-14k \
--model_name_or_path=trl-lib/qwen1.5-1.8b-sft \
--per_device_train_batch_size 16 \
--num_train_epochs 1 \
@ -33,6 +34,7 @@ python examples/scripts/kto.py \
# QLoRA:
python examples/scripts/kto.py \
--dataset_name trl-lib/kto-mix-14k \
--model_name_or_path=trl-lib/qwen1.5-1.8b-sft \
--per_device_train_batch_size 8 \
--num_train_epochs 1 \
@ -53,28 +55,22 @@ python examples/scripts/kto.py \
--lora_alpha=16
"""
from dataclasses import dataclass
from accelerate import PartialState
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, HfArgumentParser
from trl import KTOConfig, KTOTrainer, ModelConfig, get_peft_config, maybe_unpair_preference_dataset, setup_chat_format
# Define and parse arguments.
@dataclass
class ScriptArguments:
"""
The arguments for the KTO training script.
"""
dataset_name: str = "trl-lib/kto-mix-14k"
from trl import (
KTOConfig,
KTOTrainer,
ModelConfig,
ScriptArguments,
get_peft_config,
setup_chat_format,
)
if __name__ == "__main__":
parser = HfArgumentParser((ScriptArguments, KTOConfig, ModelConfig))
script_args, kto_args, model_args = parser.parse_args_into_dataclasses()
script_args, training_args, model_args = parser.parse_args_into_dataclasses()
# Load a pretrained model
model = AutoModelForCausalLM.from_pretrained(
@ -97,36 +93,21 @@ if __name__ == "__main__":
# Load the dataset
dataset = load_dataset(script_args.dataset_name)
# If needed, reformat a DPO-formatted dataset (prompt, chosen, rejected) to a KTO-format (prompt, completion, label)
dataset = maybe_unpair_preference_dataset(dataset, num_proc=kto_args.dataset_num_proc)
# Apply chat template
def format_dataset(example):
if isinstance(example["completion"], str):
example["prompt"] = tokenizer.apply_chat_template(example["prompt"], tokenize=False)
example["completion"] = tokenizer.apply_chat_template(example["completion"], tokenize=False)
else:
example["prompt"] = tokenizer.apply_chat_template(example["completion"][:-1], tokenize=False)
example["completion"] = tokenizer.apply_chat_template([example["completion"][-1]], tokenize=False)
return example
# Compute that only on the main process for faster data processing.
# see: https://github.com/huggingface/trl/pull/1255
with PartialState().local_main_process_first():
dataset = dataset.map(format_dataset, num_proc=kto_args.dataset_num_proc)
# Initialize the KTO trainer
kto_trainer = KTOTrainer(
trainer = KTOTrainer(
model,
ref_model,
args=kto_args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
tokenizer=tokenizer,
args=training_args,
train_dataset=dataset[script_args.dataset_train_split],
eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None,
processing_class=tokenizer,
peft_config=get_peft_config(model_args),
)
# Train and push the model to the Hub
kto_trainer.train()
kto_trainer.save_model(kto_args.output_dir)
kto_trainer.push_to_hub()
trainer.train()
# Save and push to hub
trainer.save_model(training_args.output_dir)
if training_args.push_to_hub:
trainer.push_to_hub(dataset_name=script_args.dataset_name)

View File

@ -1,5 +1,3 @@
# flake8: noqa
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
@ -50,23 +48,29 @@ accelerate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoModelForSequenceClassification, AutoTokenizer, GenerationConfig
from trl import (
DPOScriptArguments,
HfPairwiseJudge,
LogCompletionsCallback,
ModelConfig,
NashMDConfig,
NashMDTrainer,
OpenAIPairwiseJudge,
PairRMJudge,
ScriptArguments,
TrlParser,
get_kbit_device_map,
get_quantization_config,
LogCompletionsCallback,
)
from trl.commands.cli_utils import TrlParser
from trl.trainer.utils import SIMPLE_QUERY_CHAT_TEMPLATE
from trl.trainer.utils import SIMPLE_CHAT_TEMPLATE
JUDGES = {"pair_rm": PairRMJudge, "openai": OpenAIPairwiseJudge, "hf": HfPairwiseJudge}
if __name__ == "__main__":
parser = TrlParser((DPOScriptArguments, NashMDConfig, ModelConfig))
args, training_args, model_config = parser.parse_args_and_config()
args.gradient_checkpointing_kwargs = {"use_reentrant": True}
parser = TrlParser((ScriptArguments, NashMDConfig, ModelConfig))
script_args, training_args, model_config = parser.parse_args_and_config()
script_args.gradient_checkpointing_kwargs = {"use_reentrant": True}
torch_dtype = (
model_config.torch_dtype
@ -89,9 +93,23 @@ if __name__ == "__main__":
ref_model = AutoModelForCausalLM.from_pretrained(
model_config.model_name_or_path, trust_remote_code=model_config.trust_remote_code, **model_kwargs
)
reward_model = AutoModelForSequenceClassification.from_pretrained(
training_args.reward_model_path, num_labels=1, trust_remote_code=model_config.trust_remote_code
)
if training_args.reward_model_path is not None:
reward_model = AutoModelForSequenceClassification.from_pretrained(
training_args.reward_model_path,
num_labels=1,
trust_remote_code=model_config.trust_remote_code,
**model_kwargs,
)
else:
reward_model = None
if training_args.judge is not None:
judge_cls = JUDGES[training_args.judge]
judge = judge_cls()
else:
judge = None
tokenizer = AutoTokenizer.from_pretrained(
model_config.model_name_or_path,
padding_side="left",
@ -100,26 +118,31 @@ if __name__ == "__main__":
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
if tokenizer.chat_template is None:
tokenizer.chat_template = SIMPLE_QUERY_CHAT_TEMPLATE
tokenizer.chat_template = SIMPLE_CHAT_TEMPLATE
dataset = load_dataset(args.dataset_name)
dataset = load_dataset(script_args.dataset_name)
trainer = NashMDTrainer(
model=model,
ref_model=ref_model,
reward_model=reward_model,
judge=judge,
args=training_args,
train_dataset=dataset[args.dataset_train_split],
eval_dataset=dataset[args.dataset_test_split],
tokenizer=tokenizer,
train_dataset=dataset[script_args.dataset_train_split],
eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None,
processing_class=tokenizer,
)
generation_config = GenerationConfig(
max_new_tokens=training_args.max_new_tokens, do_sample=True, temperature=training_args.temperature
)
completions_callback = LogCompletionsCallback(trainer, generation_config, num_prompts=8)
trainer.add_callback(completions_callback)
# train the model
if training_args.eval_strategy != "no":
generation_config = GenerationConfig(
max_new_tokens=training_args.max_new_tokens, do_sample=True, temperature=training_args.temperature
)
completions_callback = LogCompletionsCallback(trainer, generation_config, num_prompts=8)
trainer.add_callback(completions_callback)
trainer.train()
# save the model
# Save and push to hub
trainer.save_model(training_args.output_dir)
if training_args.push_to_hub:
trainer.push_to_hub(dataset_name=script_args.dataset_name)

View File

@ -17,6 +17,7 @@ In general, the optimal configuration for ORPO will be similar to that of DPO wi
# regular:
python examples/scripts/orpo.py \
--dataset_name trl-internal-testing/hh-rlhf-helpful-base-trl-style \
--model_name_or_path=gpt2 \
--per_device_train_batch_size 4 \
--max_steps 1000 \
@ -33,6 +34,7 @@ python examples/scripts/orpo.py \
# peft:
python examples/scripts/orpo.py \
--dataset_name trl-internal-testing/hh-rlhf-helpful-base-trl-style \
--model_name_or_path=gpt2 \
--per_device_train_batch_size 4 \
--max_steps 1000 \
@ -52,27 +54,16 @@ python examples/scripts/orpo.py \
--lora_alpha=16
"""
from dataclasses import dataclass, field
from accelerate import PartialState
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, HfArgumentParser
from trl import ModelConfig, ORPOConfig, ORPOTrainer, get_peft_config
from trl import ModelConfig, ORPOConfig, ORPOTrainer, ScriptArguments, get_peft_config
from trl.trainer.utils import SIMPLE_CHAT_TEMPLATE
@dataclass
class ScriptArguments:
dataset_name: str = field(
default="trl-internal-testing/hh-rlhf-helpful-base-trl-style",
metadata={"help": "The name of the dataset to use."},
)
if __name__ == "__main__":
parser = HfArgumentParser((ScriptArguments, ORPOConfig, ModelConfig))
args, orpo_args, model_config = parser.parse_args_into_dataclasses()
script_args, training_args, model_config = parser.parse_args_into_dataclasses()
################
# Model & Tokenizer
@ -89,33 +80,26 @@ if __name__ == "__main__":
################
# Dataset
################
dataset = load_dataset(args.dataset_name)
dataset = load_dataset(script_args.dataset_name)
if tokenizer.chat_template is None:
tokenizer.chat_template = SIMPLE_CHAT_TEMPLATE
def process(row):
row["prompt"] = tokenizer.apply_chat_template(row["chosen"][:-1], tokenize=False)
row["chosen"] = tokenizer.apply_chat_template([row["chosen"][-1]], tokenize=False)
row["rejected"] = tokenizer.apply_chat_template([row["rejected"][-1]], tokenize=False)
return row
# Compute that only on the main process for faster data processing.
# see: https://github.com/huggingface/trl/pull/1255
with PartialState().local_main_process_first():
dataset = dataset.map(process, num_prc=orpo_args.dataset_num_proc)
################
# Training
################
trainer = ORPOTrainer(
model,
args=orpo_args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
tokenizer=tokenizer,
args=training_args,
train_dataset=dataset[script_args.dataset_train_split],
eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None,
processing_class=tokenizer,
peft_config=get_peft_config(model_config),
)
# train and save the model
trainer.train()
trainer.save_model(orpo_args.output_dir)
# Save and push to hub
trainer.save_model(training_args.output_dir)
if training_args.push_to_hub:
trainer.push_to_hub(dataset_name=script_args.dataset_name)

View File

@ -1,198 +0,0 @@
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
python examples/scripts/ppo.py \
--log_with=wandb
"""
from dataclasses import dataclass, field
from typing import Optional
import torch
from accelerate import Accelerator, PartialState
from datasets import load_dataset
from peft import LoraConfig
from tqdm import tqdm
from transformers import AutoTokenizer, HfArgumentParser, is_torch_npu_available, is_torch_xpu_available, pipeline
from trl import AutoModelForCausalLMWithValueHead, AutoModelForSeq2SeqLMWithValueHead, PPOConfig, PPOTrainer, set_seed
from trl.core import LengthSampler
tqdm.pandas()
@dataclass
class ScriptArguments:
use_seq2seq: bool = field(default=False, metadata={"help": "whether to use seq2seq"})
trust_remote_code: bool = field(default=False, metadata={"help": "Enable `trust_remote_code`"})
# LoraConfig
use_peft: bool = field(default=False, metadata={"help": "whether to use peft"})
lora_alpha: Optional[float] = field(default=16, metadata={"help": "the lora alpha parameter"})
lora_r: Optional[int] = field(default=16, metadata={"help": "the lora r parameter"})
parser = HfArgumentParser((ScriptArguments, PPOConfig))
args, ppo_config = parser.parse_args_into_dataclasses()
# We then define the arguments to pass to the sentiment analysis pipeline.
# We set `return_all_scores` to True to get the sentiment score for each token.
sent_kwargs = {"return_all_scores": True, "function_to_apply": "none", "batch_size": 16}
trl_model_class = AutoModelForCausalLMWithValueHead if not args.use_seq2seq else AutoModelForSeq2SeqLMWithValueHead
tokenizer = AutoTokenizer.from_pretrained(ppo_config.model_name)
tokenizer.pad_token = tokenizer.eos_token
# Below is an example function to build the dataset. In our case, we use the IMDB dataset
# from the `datasets` library. One should customize this function to train the model on
# its own dataset.
def build_dataset(query_dataset, dataset_num_proc, input_min_text_length=2, input_max_text_length=8):
"""
Build dataset for training. This builds the dataset from `load_dataset`, one should
customize this function to train the model on its own dataset.
Args:
query_dataset (`str`):
The name of the dataset to be loaded.
Returns:
dataloader (`torch.utils.data.DataLoader`):
The dataloader for the dataset.
"""
# load imdb with datasets
ds = load_dataset(query_dataset, split="train")
ds = ds.rename_columns({"text": "review"})
ds = ds.filter(lambda x: len(x["review"]) > 200, num_proc=dataset_num_proc)
input_size = LengthSampler(input_min_text_length, input_max_text_length)
def tokenize(sample):
sample["input_ids"] = tokenizer.encode(sample["review"])[: input_size()]
sample["query"] = tokenizer.decode(sample["input_ids"])
return sample
ds = ds.map(tokenize, num_proc=dataset_num_proc)
ds.set_format(type="torch")
return ds
# We retrieve the dataloader by calling the `build_dataset` function.
# Compute that only on the main process for faster data processing.
# see: https://github.com/huggingface/trl/pull/1255
with PartialState().local_main_process_first():
dataset = build_dataset(ppo_config.query_dataset, ppo_config.dataset_num_proc)
def collator(data):
return {key: [d[key] for d in data] for key in data[0]}
# set seed before initializing value head for deterministic eval
set_seed(ppo_config.seed)
# Now let's build the model, the reference model, and the tokenizer.
if not args.use_peft:
ref_model = trl_model_class.from_pretrained(ppo_config.model_name, trust_remote_code=args.trust_remote_code)
device_map = None
peft_config = None
else:
peft_config = LoraConfig(
r=args.lora_r,
lora_alpha=args.lora_alpha,
bias="none",
task_type="CAUSAL_LM",
)
ref_model = None
# Copy the model to each device
device_map = {"": Accelerator().local_process_index}
model = trl_model_class.from_pretrained(
ppo_config.model_name,
trust_remote_code=args.trust_remote_code,
device_map=device_map,
peft_config=peft_config,
)
tokenizer = AutoTokenizer.from_pretrained(ppo_config.model_name)
# Some tokenizers like GPT-2's don't have a padding token by default, so we set one here.
tokenizer.pad_token_id = tokenizer.eos_token_id
# We then build the PPOTrainer, passing the model, the reference model, the tokenizer
ppo_trainer = PPOTrainer(ppo_config, model, ref_model, tokenizer, dataset=dataset, data_collator=collator)
# We then build the sentiment analysis pipeline, passing the model name and the
# sentiment analysis pipeline arguments. Let's also make sure to set the device
# to the same device as the PPOTrainer.
device = ppo_trainer.accelerator.device
if ppo_trainer.accelerator.num_processes == 1:
if is_torch_xpu_available():
device = "xpu:0"
elif is_torch_npu_available():
device = "npu:0"
else:
device = 0 if torch.cuda.is_available() else "cpu" # to avoid a `pipeline` bug
ds_plugin = ppo_trainer.accelerator.state.deepspeed_plugin
task, model_name = ppo_config.reward_model.split(":")
if ds_plugin is not None and ds_plugin.is_zero3_init_enabled():
with ds_plugin.zero3_init_context_manager(enable=False):
sentiment_pipe = pipeline(task, model=model_name, device=device)
else:
sentiment_pipe = pipeline(task, model=model_name, device=device)
# Some tokenizers like GPT-2's don't have a padding token by default, so we set one here.
if sentiment_pipe.tokenizer.pad_token_id is None:
sentiment_pipe.tokenizer.pad_token_id = tokenizer.pad_token_id
if sentiment_pipe.model.config.pad_token_id is None:
sentiment_pipe.model.config.pad_token_id = tokenizer.pad_token_id
# We then define the arguments to pass to the `generate` function. These arguments
# are passed to the `generate` function of the PPOTrainer, which is a wrapper around
# the `generate` function of the trained model.
generation_kwargs = {
"min_length": -1,
"top_k": 0.0,
"top_p": 1.0,
"do_sample": True,
"pad_token_id": tokenizer.eos_token_id,
"max_new_tokens": 32,
}
for batch in tqdm(ppo_trainer.dataloader):
query_tensors = batch["input_ids"]
# Get response from gpt2
response_tensors, ref_response_tensors = ppo_trainer.generate(
query_tensors, return_prompt=False, generate_ref_response=True, **generation_kwargs
)
batch["response"] = tokenizer.batch_decode(response_tensors)
batch["ref_response"] = tokenizer.batch_decode(ref_response_tensors)
# Compute sentiment score
texts = [q + r for q, r in zip(batch["query"], batch["response"])]
pipe_outputs = sentiment_pipe(texts, **sent_kwargs)
rewards = [torch.tensor(output[1]["score"]) for output in pipe_outputs]
ref_texts = [q + r for q, r in zip(batch["query"], batch["ref_response"])]
ref_pipe_outputs = sentiment_pipe(ref_texts, **sent_kwargs)
ref_rewards = [torch.tensor(output[1]["score"]) for output in ref_pipe_outputs]
batch["ref_rewards"] = ref_rewards
# Run PPO step
stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
ppo_trainer.log_stats(stats, batch, rewards, columns_to_log=["query", "response", "ref_response", "ref_rewards"])

View File

@ -23,12 +23,14 @@ from transformers import (
HfArgumentParser,
)
from trl import ModelConfig, PPOv2Config, PPOv2Trainer
from trl import ModelConfig, PPOConfig, PPOTrainer, ScriptArguments
from trl.trainer.utils import SIMPLE_CHAT_TEMPLATE
"""
python -i examples/scripts/ppo/ppo.py \
--dataset_name trl-internal-testing/descriptiveness-sentiment-trl-style \
--dataset_train_split descriptiveness \
--learning_rate 3e-6 \
--output_dir models/minimal/ppo \
--per_device_train_batch_size 64 \
@ -39,6 +41,8 @@ python -i examples/scripts/ppo/ppo.py \
accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml \
examples/scripts/ppo/ppo.py \
--dataset_name trl-internal-testing/descriptiveness-sentiment-trl-style \
--dataset_train_split descriptiveness \
--output_dir models/minimal/ppo \
--num_ppo_epochs 1 \
--num_mini_batches 1 \
@ -50,16 +54,15 @@ accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml
--sft_model_path EleutherAI/pythia-1b-deduped \
--reward_model_path EleutherAI/pythia-1b-deduped \
--local_rollout_forward_batch_size 1 \
--deepspeed3 \
--missing_eos_penalty 1.0
"""
if __name__ == "__main__":
parser = HfArgumentParser((PPOv2Config, ModelConfig))
config, model_config = parser.parse_args_into_dataclasses()
parser = HfArgumentParser((ScriptArguments, PPOConfig, ModelConfig))
script_args, training_args, model_config = parser.parse_args_into_dataclasses()
# remove output_dir if exists
shutil.rmtree(config.output_dir, ignore_errors=True)
shutil.rmtree(training_args.output_dir, ignore_errors=True)
################
# Model & Tokenizer
@ -73,22 +76,22 @@ if __name__ == "__main__":
if tokenizer.chat_template is None:
tokenizer.chat_template = SIMPLE_CHAT_TEMPLATE
value_model = AutoModelForSequenceClassification.from_pretrained(
config.reward_model_path, trust_remote_code=model_config.trust_remote_code, num_labels=1
training_args.reward_model_path, trust_remote_code=model_config.trust_remote_code, num_labels=1
)
reward_model = AutoModelForSequenceClassification.from_pretrained(
config.reward_model_path, trust_remote_code=model_config.trust_remote_code, num_labels=1
training_args.reward_model_path, trust_remote_code=model_config.trust_remote_code, num_labels=1
)
ref_policy = AutoModelForCausalLM.from_pretrained(
config.sft_model_path, trust_remote_code=model_config.trust_remote_code
training_args.sft_model_path, trust_remote_code=model_config.trust_remote_code
)
policy = AutoModelForCausalLM.from_pretrained(
config.sft_model_path, trust_remote_code=model_config.trust_remote_code
training_args.sft_model_path, trust_remote_code=model_config.trust_remote_code
)
################
# Dataset
################
dataset = load_dataset("trl-internal-testing/descriptiveness-sentiment-trl-style", split="descriptiveness")
eval_samples = 20
dataset = load_dataset(script_args.dataset_name, split=script_args.dataset_train_split)
eval_samples = 100
train_dataset = dataset.select(range(len(dataset) - eval_samples))
eval_dataset = dataset.select(range(len(dataset) - eval_samples, len(dataset)))
dataset_text_field = "prompt"
@ -107,7 +110,7 @@ if __name__ == "__main__":
tokenize,
batched=True,
remove_columns=dataset.column_names,
num_proc=config.dataset_num_proc,
num_proc=training_args.dataset_num_proc,
)
# Compute that only on the main process for faster data processing.
@ -119,9 +122,9 @@ if __name__ == "__main__":
################
# Training
################
trainer = PPOv2Trainer(
config=config,
tokenizer=tokenizer,
trainer = PPOTrainer(
config=training_args,
processing_class=tokenizer,
policy=policy,
ref_policy=ref_policy,
reward_model=reward_model,
@ -130,7 +133,10 @@ if __name__ == "__main__":
eval_dataset=eval_dataset,
)
trainer.train()
trainer.save_model(config.output_dir)
if config.push_to_hub:
trainer.push_to_hub()
# Save and push to hub
trainer.save_model(training_args.output_dir)
if training_args.push_to_hub:
trainer.push_to_hub(dataset_name=script_args.dataset_name)
trainer.generate_completions()

View File

@ -23,12 +23,14 @@ from transformers import (
HfArgumentParser,
)
from trl import ModelConfig, PPOv2Config, PPOv2Trainer
from trl import ModelConfig, PPOConfig, PPOTrainer, ScriptArguments
from trl.trainer.utils import SIMPLE_CHAT_TEMPLATE
"""
python examples/scripts/ppo/ppo_tldr.py \
--dataset_name trl-internal-testing/tldr-preference-sft-trl-style
--dataset_test_split validation \
--learning_rate 3e-6 \
--output_dir models/minimal/ppo \
--per_device_train_batch_size 1 \
@ -43,6 +45,8 @@ python examples/scripts/ppo/ppo_tldr.py \
accelerate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml \
examples/scripts/ppo/ppo_tldr.py \
--dataset_name trl-internal-testing/tldr-preference-sft-trl-style
--dataset_test_split validation \
--output_dir models/minimal/ppo_tldr \
--learning_rate 3e-6 \
--per_device_train_batch_size 16 \
@ -58,10 +62,10 @@ accelerate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml
if __name__ == "__main__":
parser = HfArgumentParser((PPOv2Config, ModelConfig))
config, model_config = parser.parse_args_into_dataclasses()
parser = HfArgumentParser((ScriptArguments, PPOConfig, ModelConfig))
script_args, training_args, model_config = parser.parse_args_into_dataclasses()
# remove output_dir if exists
shutil.rmtree(config.output_dir, ignore_errors=True)
shutil.rmtree(training_args.output_dir, ignore_errors=True)
################
# Model & Tokenizer
@ -75,23 +79,23 @@ if __name__ == "__main__":
if tokenizer.chat_template is None:
tokenizer.chat_template = SIMPLE_CHAT_TEMPLATE
value_model = AutoModelForSequenceClassification.from_pretrained(
config.reward_model_path, trust_remote_code=model_config.trust_remote_code, num_labels=1
training_args.reward_model_path, trust_remote_code=model_config.trust_remote_code, num_labels=1
)
reward_model = AutoModelForSequenceClassification.from_pretrained(
config.reward_model_path, trust_remote_code=model_config.trust_remote_code, num_labels=1
training_args.reward_model_path, trust_remote_code=model_config.trust_remote_code, num_labels=1
)
ref_policy = AutoModelForCausalLM.from_pretrained(
config.sft_model_path, trust_remote_code=model_config.trust_remote_code
training_args.sft_model_path, trust_remote_code=model_config.trust_remote_code
)
policy = AutoModelForCausalLM.from_pretrained(
config.sft_model_path, trust_remote_code=model_config.trust_remote_code
training_args.sft_model_path, trust_remote_code=model_config.trust_remote_code
)
################
# Dataset
################
dataset = load_dataset("trl-internal-testing/tldr-preference-sft-trl-style")
train_dataset = dataset["train"]
eval_dataset = dataset["validation"]
dataset = load_dataset(script_args.dataset_name)
train_dataset = dataset[script_args.dataset_train_split]
eval_dataset = dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None
def prepare_dataset(dataset, tokenizer):
"""pre-tokenize the dataset before training; only collate during training"""
@ -107,25 +111,27 @@ if __name__ == "__main__":
return dataset.map(
tokenize,
remove_columns=dataset.column_names,
num_proc=config.dataset_num_proc,
num_proc=training_args.dataset_num_proc,
)
# Compute that only on the main process for faster data processing.
# see: https://github.com/huggingface/trl/pull/1255
with PartialState().local_main_process_first():
train_dataset = prepare_dataset(train_dataset, tokenizer)
eval_dataset = prepare_dataset(eval_dataset, tokenizer)
if eval_dataset is not None:
eval_dataset = prepare_dataset(eval_dataset, tokenizer)
# filtering
train_dataset = train_dataset.filter(lambda x: x["lengths"] <= 512, num_proc=config.dataset_num_proc)
eval_dataset = eval_dataset.filter(lambda x: x["lengths"] <= 512, num_proc=config.dataset_num_proc)
train_dataset = train_dataset.filter(lambda x: x["lengths"] <= 512, num_proc=training_args.dataset_num_proc)
if eval_dataset is not None:
eval_dataset = eval_dataset.filter(lambda x: x["lengths"] <= 512, num_proc=training_args.dataset_num_proc)
assert train_dataset[0]["input_ids"][-1] != tokenizer.eos_token_id, "The last token should not be an EOS token"
################
# Training
################
trainer = PPOv2Trainer(
config=config,
tokenizer=tokenizer,
trainer = PPOTrainer(
config=training_args,
processing_class=tokenizer,
policy=policy,
ref_policy=ref_policy,
reward_model=reward_model,
@ -134,7 +140,10 @@ if __name__ == "__main__":
eval_dataset=eval_dataset,
)
trainer.train()
trainer.save_model(config.output_dir)
if config.push_to_hub:
trainer.push_to_hub()
# Save and push to hub
trainer.save_model(training_args.output_dir)
if training_args.push_to_hub:
trainer.push_to_hub(dataset_name=script_args.dataset_name)
trainer.generate_completions()

View File

@ -1,163 +0,0 @@
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from dataclasses import dataclass, field
from typing import Optional
import torch
from accelerate import PartialState
from datasets import load_dataset
from peft import LoraConfig
from tqdm import tqdm
from transformers import (
AutoTokenizer,
BitsAndBytesConfig,
HfArgumentParser,
is_torch_npu_available,
is_torch_xpu_available,
)
from trl import AutoModelForCausalLMWithValueHead, PPOConfig, PPOTrainer
from trl.core import LengthSampler
input_min_text_length = 6
input_max_text_length = 12
@dataclass
class ScriptArguments:
"""
The name of the Casual LM model we wish to fine with PPO
"""
model_name: Optional[str] = field(default="huggyllama/llama-7b", metadata={"help": "the model name"})
dataset_name: Optional[str] = field(default="Anthropic/hh-rlhf", metadata={"help": "the dataset name"})
rm_adapter: Optional[str] = field(
default="trl-lib/llama-7b-hh-rm-adapter", metadata={"help": "the rm adapter name"}
)
log_with: Optional[str] = field(default=None, metadata={"help": "use 'wandb' to log with wandb"})
use_safetensors: Optional[bool] = field(default=False, metadata={"help": "Use safetensors"})
seed: Optional[int] = field(default=0, metadata={"help": "the random seed"})
use_score_scaling: Optional[bool] = field(default=False, metadata={"help": "Use score scaling"})
use_score_norm: Optional[bool] = field(
default=False, metadata={"help": "Use score normalization. Only applicable if use_score_scaling is True"}
)
score_clip: Optional[float] = field(default=None, metadata={"help": "Score clipping"})
dataset_num_proc: Optional[int] = field(
default=None, metadata={"help": "The number of workers to use to tokenize the data"}
)
parser = HfArgumentParser(ScriptArguments)
script_args = parser.parse_args_into_dataclasses()[0]
def create_and_prepare_dataset(tokenizer, num_proc):
dataset = load_dataset(script_args.dataset_name, split="train[:1%]")
input_size = LengthSampler(input_min_text_length, input_max_text_length)
def tokenize(example):
text_size = input_size()
example["input_ids"] = tokenizer.encode(example["chosen"])[:text_size]
example["query"] = tokenizer.decode(example["input_ids"])
return example
dataset = dataset.map(tokenize, batched=False, num_proc=num_proc)
dataset.set_format("torch")
return dataset
lora_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
nf4_config = BitsAndBytesConfig(
load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLMWithValueHead.from_pretrained(
script_args.model_name,
device_map={"": "xpu:0"} if is_torch_xpu_available() else {"": "npu:0"} if is_torch_npu_available else {"": 0},
peft_config=lora_config,
quantization_config=nf4_config,
reward_adapter=script_args.rm_adapter,
use_safetensors=script_args.use_safetensors,
)
tokenizer = AutoTokenizer.from_pretrained(script_args.model_name)
tokenizer.pad_token = tokenizer.eos_token
# Compute that only on the main process for faster data processing.
# see: https://github.com/huggingface/trl/pull/1255
with PartialState().local_main_process_first():
dataset = create_and_prepare_dataset(tokenizer, script_args.dataset_num_proc)
def collator(data):
return {key: [d[key] for d in data] for key in data[0]}
config = PPOConfig(
model_name=script_args.model_name,
log_with=script_args.log_with,
learning_rate=1e-5,
batch_size=8,
mini_batch_size=2,
gradient_accumulation_steps=2,
optimize_cuda_cache=True,
seed=script_args.seed,
use_score_scaling=script_args.use_score_scaling,
use_score_norm=script_args.use_score_norm,
score_clip=script_args.score_clip,
)
ppo_trainer = PPOTrainer(
config,
model,
ref_model=None,
tokenizer=tokenizer,
dataset=dataset,
data_collator=collator,
)
generation_kwargs = {
"top_k": 0.0,
"top_p": 0.9,
"do_sample": True,
"pad_token_id": tokenizer.pad_token_id,
"max_new_tokens": 32,
}
for _epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
question_tensors = batch["input_ids"]
response_tensors = ppo_trainer.generate(
question_tensors,
return_prompt=False,
**generation_kwargs,
)
batch["response"] = tokenizer.batch_decode(response_tensors, skip_special_tokens=True)
# Compute reward score
texts = [q + r for q, r in zip(batch["query"], batch["response"])]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt").to(ppo_trainer.accelerator.device)
raw_rewards = ppo_trainer.accelerator.unwrap_model(ppo_trainer.model).compute_reward_score(**inputs)
rewards = [raw_rewards[i, -1, 1] for i in range(len(raw_rewards))] # take last token
# Run PPO step
stats = ppo_trainer.step(question_tensors, response_tensors, rewards)
ppo_trainer.log_stats(stats, batch, rewards)

View File

@ -19,8 +19,6 @@ python examples/scripts/reward_modeling.py \
--output_dir Qwen2-0.5B-Reward \
--per_device_train_batch_size 8 \
--num_train_epochs 1 \
--gradient_accumulation_steps 1 \
--remove_unused_columns False \
--gradient_checkpointing True \
--learning_rate 1.0e-5 \
--logging_steps 25 \
@ -32,17 +30,15 @@ LoRA:
python examples/scripts/reward_modeling.py \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--dataset_name trl-lib/ultrafeedback_binarized \
--output_dir Qwen2-0.5B-Reward \
--output_dir Qwen2-0.5B-Reward-LoRA \
--per_device_train_batch_size 8 \
--num_train_epochs 1 \
--gradient_accumulation_steps 1 \
--remove_unused_columns False \
--gradient_checkpointing True \
--learning_rate 1.0e-5 \
--learning_rate 1.0e-4 \
--logging_steps 25 \
--eval_strategy steps \
--eval_steps 50 \
--max_length 2048 /
--max_length 2048 \
--use_peft \
--lora_r 32 \
--lora_alpha 16
@ -51,31 +47,25 @@ python examples/scripts/reward_modeling.py \
import warnings
import torch
from accelerate import PartialState
from datasets import load_dataset
from tqdm import tqdm
from transformers import AutoModelForSequenceClassification, AutoTokenizer, HfArgumentParser
from trl import (
ModelConfig,
RewardConfig,
RewardTrainer,
ScriptArguments,
get_kbit_device_map,
get_peft_config,
get_quantization_config,
setup_chat_format,
)
from trl.commands.cli_utils import RewardScriptArguments
from trl.extras.dataset_formatting import conversations_formatting_function
tqdm.pandas()
if __name__ == "__main__":
parser = HfArgumentParser((RewardScriptArguments, RewardConfig, ModelConfig))
args, config, model_config = parser.parse_args_into_dataclasses()
config.gradient_checkpointing_kwargs = dict(use_reentrant=False)
parser = HfArgumentParser((ScriptArguments, RewardConfig, ModelConfig))
script_args, training_args, model_config = parser.parse_args_into_dataclasses()
training_args.gradient_checkpointing_kwargs = dict(use_reentrant=False)
################
# Model & Tokenizer
@ -90,6 +80,8 @@ if __name__ == "__main__":
revision=model_config.model_revision,
device_map=get_kbit_device_map() if quantization_config is not None else None,
quantization_config=quantization_config,
use_cache=False if training_args.gradient_checkpointing else True,
torch_dtype=torch_dtype,
)
tokenizer = AutoTokenizer.from_pretrained(
model_config.model_name_or_path, trust_remote_code=model_config.trust_remote_code, use_fast=True
@ -110,58 +102,20 @@ if __name__ == "__main__":
" Make sure to pass --lora_task_type SEQ_CLS when using this script with PEFT."
)
#############################
# Load and preprocess dataset
#############################
dataset = load_dataset(args.dataset_name)
def preprocess_function(examples):
new_examples = {
"input_ids_chosen": [],
"attention_mask_chosen": [],
"input_ids_rejected": [],
"attention_mask_rejected": [],
}
for chosen, rejected in zip(examples["chosen"], examples["rejected"]):
tokenized_chosen = tokenizer(chosen)
tokenized_rejected = tokenizer(rejected)
new_examples["input_ids_chosen"].append(tokenized_chosen["input_ids"])
new_examples["attention_mask_chosen"].append(tokenized_chosen["attention_mask"])
new_examples["input_ids_rejected"].append(tokenized_rejected["input_ids"])
new_examples["attention_mask_rejected"].append(tokenized_rejected["attention_mask"])
return new_examples
with PartialState().local_main_process_first():
# Wrap inputs with chat template.
# This assumes the chosen/rejected columns are in the OpenAI messages format.
chosen_fn = conversations_formatting_function(tokenizer, "chosen")
rejected_fn = conversations_formatting_function(tokenizer, "rejected")
dataset = dataset.map(
lambda x: {"chosen": chosen_fn(x), "rejected": rejected_fn(x)}, num_proc=config.dataset_num_proc
)
# Tokenize inputs
dataset = dataset.map(
preprocess_function,
batched=True,
num_proc=config.dataset_num_proc,
)
# Filter out examples that are too long
dataset = dataset.filter(
lambda x: len(x["input_ids_chosen"]) <= config.max_length
and len(x["input_ids_rejected"]) <= config.max_length,
num_proc=config.dataset_num_proc,
)
##############
# Load dataset
##############
dataset = load_dataset(script_args.dataset_name)
##########
# Training
##########
trainer = RewardTrainer(
model=model,
tokenizer=tokenizer,
args=config,
train_dataset=dataset[args.dataset_train_split],
eval_dataset=dataset[args.dataset_test_split],
processing_class=tokenizer,
args=training_args,
train_dataset=dataset[script_args.dataset_train_split],
eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None,
peft_config=get_peft_config(model_config),
)
trainer.train()
@ -169,9 +123,14 @@ if __name__ == "__main__":
############################
# Save model and push to Hub
############################
trainer.save_model(config.output_dir)
metrics = trainer.evaluate()
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)
trainer.save_model(config.output_dir)
trainer.push_to_hub()
trainer.save_model(training_args.output_dir)
if training_args.eval_strategy != "no":
metrics = trainer.evaluate()
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)
# Save and push to hub
trainer.save_model(training_args.output_dir)
if training_args.push_to_hub:
trainer.push_to_hub(dataset_name=script_args.dataset_name)

View File

@ -23,13 +23,14 @@ from transformers import (
HfArgumentParser,
)
from trl import ModelConfig
from trl.trainer.rloo_trainer import RLOOConfig, RLOOTrainer
from trl import ModelConfig, RLOOConfig, RLOOTrainer, ScriptArguments
from trl.trainer.utils import SIMPLE_CHAT_TEMPLATE
"""
python -i examples/scripts/rloo/rloo.py \
--dataset_name trl-internal-testing/descriptiveness-sentiment-trl-style \
--dataset_train_split descriptiveness \
--learning_rate 3e-6 \
--num_ppo_epochs 1 \
--num_mini_batches 1 \
@ -42,6 +43,8 @@ python -i examples/scripts/rloo/rloo.py \
accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml \
examples/scripts/rloo/rloo.py \
--dataset_name trl-internal-testing/descriptiveness-sentiment-trl-style \
--dataset_train_split descriptiveness \
--output_dir models/minimal/rloo \
--rloo_k 2 \
--num_ppo_epochs 1 \
@ -54,16 +57,15 @@ accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml
--sft_model_path EleutherAI/pythia-1b-deduped \
--reward_model_path EleutherAI/pythia-1b-deduped \
--local_rollout_forward_batch_size 1 \
--deepspeed3 \
--missing_eos_penalty 1.0
"""
if __name__ == "__main__":
parser = HfArgumentParser((RLOOConfig, ModelConfig))
config, model_config = parser.parse_args_into_dataclasses()
parser = HfArgumentParser((ScriptArguments, RLOOConfig, ModelConfig))
script_args, training_args, model_config = parser.parse_args_into_dataclasses()
# remove output_dir if exists
shutil.rmtree(config.output_dir, ignore_errors=True)
shutil.rmtree(training_args.output_dir, ignore_errors=True)
################
# Model & Tokenizer
@ -77,21 +79,21 @@ if __name__ == "__main__":
if tokenizer.chat_template is None:
tokenizer.chat_template = SIMPLE_CHAT_TEMPLATE
reward_model = AutoModelForSequenceClassification.from_pretrained(
config.reward_model_path, trust_remote_code=model_config.trust_remote_code, num_labels=1
training_args.reward_model_path, trust_remote_code=model_config.trust_remote_code, num_labels=1
)
ref_policy = AutoModelForCausalLM.from_pretrained(
config.sft_model_path, trust_remote_code=model_config.trust_remote_code
training_args.sft_model_path, trust_remote_code=model_config.trust_remote_code
)
policy = AutoModelForCausalLM.from_pretrained(
config.sft_model_path, trust_remote_code=model_config.trust_remote_code
training_args.sft_model_path, trust_remote_code=model_config.trust_remote_code
)
################
# Dataset
################
raw_datasets = load_dataset("trl-internal-testing/descriptiveness-sentiment-trl-style", split="descriptiveness")
eval_samples = 20
train_dataset = raw_datasets.select(range(len(raw_datasets) - eval_samples))
eval_dataset = raw_datasets.select(range(len(raw_datasets) - eval_samples, len(raw_datasets)))
dataset = load_dataset(script_args.dataset_name, split=script_args.dataset_train_split)
eval_samples = 100
train_dataset = dataset.select(range(len(dataset) - eval_samples))
eval_dataset = dataset.select(range(len(dataset) - eval_samples, len(dataset)))
dataset_text_field = "prompt"
def prepare_dataset(dataset, tokenizer):
@ -108,7 +110,7 @@ if __name__ == "__main__":
tokenize,
batched=True,
remove_columns=dataset.column_names,
num_proc=config.dataset_num_proc,
num_proc=training_args.dataset_num_proc,
)
# Compute that only on the main process for faster data processing.
@ -121,8 +123,8 @@ if __name__ == "__main__":
# Training
################
trainer = RLOOTrainer(
config=config,
tokenizer=tokenizer,
config=training_args,
processing_class=tokenizer,
policy=policy,
ref_policy=ref_policy,
reward_model=reward_model,
@ -130,7 +132,10 @@ if __name__ == "__main__":
eval_dataset=eval_dataset,
)
trainer.train()
trainer.save_model(config.output_dir)
if config.push_to_hub:
trainer.push_to_hub()
# Save and push to hub
trainer.save_model(training_args.output_dir)
if training_args.push_to_hub:
trainer.push_to_hub(dataset_name=script_args.dataset_name)
trainer.generate_completions()

View File

@ -23,13 +23,14 @@ from transformers import (
HfArgumentParser,
)
from trl import ModelConfig
from trl.trainer.rloo_trainer import RLOOConfig, RLOOTrainer
from trl import ModelConfig, RLOOConfig, RLOOTrainer, ScriptArguments
from trl.trainer.utils import SIMPLE_CHAT_TEMPLATE
"""
python examples/scripts/rloo/rloo_tldr.py \
--dataset_name trl-internal-testing/tldr-preference-sft-trl-style \
--dataset_test_split validation \
--learning_rate 3e-6 \
--output_dir models/minimal/ppo \
--per_device_train_batch_size 1 \
@ -44,6 +45,8 @@ python examples/scripts/rloo/rloo_tldr.py \
accelerate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml \
examples/scripts/rloo/rloo_tldr.py \
--dataset_name trl-internal-testing/tldr-preference-sft-trl-style \
--dataset_test_split validation \
--output_dir models/minimal/rloo_tldr \
--num_ppo_epochs 1 \
--num_mini_batches 1 \
@ -61,10 +64,10 @@ accelerate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml
if __name__ == "__main__":
parser = HfArgumentParser((RLOOConfig, ModelConfig))
config, model_config = parser.parse_args_into_dataclasses()
parser = HfArgumentParser((ScriptArguments, RLOOConfig, ModelConfig))
script_args, training_args, model_config = parser.parse_args_into_dataclasses()
# remove output_dir if exists
shutil.rmtree(config.output_dir, ignore_errors=True)
shutil.rmtree(training_args.output_dir, ignore_errors=True)
################
# Model & Tokenizer
@ -78,20 +81,20 @@ if __name__ == "__main__":
if tokenizer.chat_template is None:
tokenizer.chat_template = SIMPLE_CHAT_TEMPLATE
reward_model = AutoModelForSequenceClassification.from_pretrained(
config.reward_model_path, trust_remote_code=model_config.trust_remote_code, num_labels=1
training_args.reward_model_path, trust_remote_code=model_config.trust_remote_code, num_labels=1
)
ref_policy = AutoModelForCausalLM.from_pretrained(
config.sft_model_path, trust_remote_code=model_config.trust_remote_code
training_args.sft_model_path, trust_remote_code=model_config.trust_remote_code
)
policy = AutoModelForCausalLM.from_pretrained(
config.sft_model_path, trust_remote_code=model_config.trust_remote_code
training_args.sft_model_path, trust_remote_code=model_config.trust_remote_code
)
################
# Dataset
################
dataset = load_dataset("trl-internal-testing/tldr-preference-sft-trl-style")
train_dataset = dataset["train"]
eval_dataset = dataset["validation"]
dataset = load_dataset(script_args.dataset_name)
train_dataset = dataset[script_args.dataset_train_split]
eval_dataset = dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None
def prepare_dataset(dataset, tokenizer):
"""pre-tokenize the dataset before training; only collate during training"""
@ -107,7 +110,7 @@ if __name__ == "__main__":
return dataset.map(
tokenize,
remove_columns=dataset.column_names,
num_proc=config.dataset_num_proc,
num_proc=training_args.dataset_num_proc,
)
# Compute that only on the main process for faster data processing.
@ -116,16 +119,16 @@ if __name__ == "__main__":
train_dataset = prepare_dataset(train_dataset, tokenizer)
eval_dataset = prepare_dataset(eval_dataset, tokenizer)
# filtering
train_dataset = train_dataset.filter(lambda x: x["lengths"] <= 512, num_proc=config.dataset_num_proc)
eval_dataset = eval_dataset.filter(lambda x: x["lengths"] <= 512, num_proc=config.dataset_num_proc)
train_dataset = train_dataset.filter(lambda x: x["lengths"] <= 512, num_proc=training_args.dataset_num_proc)
eval_dataset = eval_dataset.filter(lambda x: x["lengths"] <= 512, num_proc=training_args.dataset_num_proc)
assert train_dataset[0]["input_ids"][-1] != tokenizer.eos_token_id, "The last token should not be an EOS token"
################
# Training
################
trainer = RLOOTrainer(
config=config,
tokenizer=tokenizer,
config=training_args,
processing_class=tokenizer,
policy=policy,
ref_policy=ref_policy,
reward_model=reward_model,
@ -133,7 +136,10 @@ if __name__ == "__main__":
eval_dataset=eval_dataset,
)
trainer.train()
trainer.save_model(config.output_dir)
if config.push_to_hub:
trainer.push_to_hub()
# Save and push to hub
trainer.save_model(training_args.output_dir)
if training_args.push_to_hub:
trainer.push_to_hub(dataset_name=script_args.dataset_name)
trainer.generate_completions()

View File

@ -1,4 +1,3 @@
# flake8: noqa
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
@ -13,60 +12,60 @@
# See the License for the specific language governing permissions and
# limitations under the License.
"""
# regular:
# Full training
python examples/scripts/sft.py \
--model_name_or_path="facebook/opt-350m" \
--dataset_text_field="text" \
--report_to="wandb" \
--learning_rate=1.41e-5 \
--per_device_train_batch_size=64 \
--gradient_accumulation_steps=16 \
--output_dir="sft_openassistant-guanaco" \
--logging_steps=1 \
--num_train_epochs=3 \
--max_steps=-1 \
--push_to_hub \
--gradient_checkpointing
# peft:
python examples/scripts/sft.py \
--model_name_or_path="facebook/opt-350m" \
--dataset_text_field="text" \
--report_to="wandb" \
--learning_rate=1.41e-5 \
--per_device_train_batch_size=64 \
--gradient_accumulation_steps=16 \
--output_dir="sft_openassistant-guanaco" \
--logging_steps=1 \
--num_train_epochs=3 \
--max_steps=-1 \
--push_to_hub \
--model_name_or_path Qwen/Qwen2-0.5B \
--dataset_name trl-lib/Capybara \
--learning_rate 2.0e-5 \
--num_train_epochs 1 \
--packing \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 8 \
--gradient_checkpointing \
--logging_steps 25 \
--eval_strategy steps \
--eval_steps 100 \
--output_dir Qwen2-0.5B-SFT \
--push_to_hub
# LoRA
python examples/scripts/sft.py \
--model_name_or_path Qwen/Qwen2-0.5B \
--dataset_name trl-lib/Capybara \
--learning_rate 2.0e-4 \
--num_train_epochs 1 \
--packing \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 8 \
--gradient_checkpointing \
--logging_steps 25 \
--eval_strategy steps \
--eval_steps 100 \
--use_peft \
--lora_r=64 \
--lora_alpha=16
--lora_r 32 \
--lora_alpha 16 \
--output_dir Qwen2-0.5B-SFT \
--push_to_hub
"""
from trl.commands.cli_utils import SFTScriptArguments, TrlParser
from datasets import load_dataset
from transformers import AutoTokenizer
from trl import (
ModelConfig,
ScriptArguments,
SFTConfig,
SFTTrainer,
TrlParser,
get_kbit_device_map,
get_peft_config,
get_quantization_config,
get_kbit_device_map,
)
if __name__ == "__main__":
parser = TrlParser((SFTScriptArguments, SFTConfig, ModelConfig))
args, training_args, model_config = parser.parse_args_and_config()
parser = TrlParser((ScriptArguments, SFTConfig, ModelConfig))
script_args, training_args, model_config = parser.parse_args_and_config()
################
# Model init kwargs & Tokenizer
@ -90,7 +89,7 @@ if __name__ == "__main__":
################
# Dataset
################
dataset = load_dataset(args.dataset_name)
dataset = load_dataset(script_args.dataset_name)
################
# Training
@ -98,11 +97,15 @@ if __name__ == "__main__":
trainer = SFTTrainer(
model=model_config.model_name_or_path,
args=training_args,
train_dataset=dataset[args.dataset_train_split],
eval_dataset=dataset[args.dataset_test_split],
tokenizer=tokenizer,
train_dataset=dataset[script_args.dataset_train_split],
eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None,
processing_class=tokenizer,
peft_config=get_peft_config(model_config),
)
trainer.train()
# Save and push to hub
trainer.save_model(training_args.output_dir)
if training_args.push_to_hub:
trainer.push_to_hub(dataset_name=script_args.dataset_name)

View File

@ -1,4 +1,3 @@
# flake8: noqa
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
@ -15,7 +14,10 @@
"""
pip install pillow
python examples/scripts/vsft_llava.py \
# Tested on 8x H100 GPUs
accelerate launch
--config_file=examples/accelerate_configs/deepspeed_zero3.yaml \
examples/scripts/sft_vlm.py \
--dataset_name HuggingFaceH4/llava-instruct-mix-vsft \
--model_name_or_path llava-hf/llava-1.5-7b-hf \
--per_device_train_batch_size 8 \
@ -23,38 +25,35 @@ python examples/scripts/vsft_llava.py \
--output_dir sft-llava-1.5-7b-hf \
--bf16 \
--torch_dtype bfloat16 \
--gradient_checkpointing \
--use_peft \
--dataloader_num_workers 32 \
--lora_target_modules=all-linear
--gradient_checkpointing
For LLaVA-NeXT, use: (requires transformers>=4.45)
--model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf
For meta-llama/Llama-3.2-11B-Vision-Instruct, use: (requires transformers>=4.45.1)
--model_name_or_path meta-llama/Llama-3.2-11B-Vision-Instruct
"""
from trl.commands.cli_utils import SFTScriptArguments, TrlParser
import torch
from accelerate import Accelerator
from datasets import load_dataset
from transformers import AutoModelForVision2Seq, AutoProcessor
from transformers import AutoModelForVision2Seq, AutoProcessor, LlavaForConditionalGeneration
from trl import (
ModelConfig,
ScriptArguments,
SFTConfig,
SFTTrainer,
TrlParser,
get_kbit_device_map,
get_peft_config,
get_quantization_config,
get_kbit_device_map,
)
if __name__ == "__main__":
parser = TrlParser((SFTScriptArguments, SFTConfig, ModelConfig))
sft_script_args, training_args, model_config = parser.parse_args_and_config()
parser = TrlParser((ScriptArguments, SFTConfig, ModelConfig))
script_args, training_args, model_config = parser.parse_args_and_config()
training_args.gradient_checkpointing_kwargs = dict(use_reentrant=False)
training_args.dataset_text_field = "" # need a dummy field
training_args.remove_unused_columns = False
training_args.dataset_kwargs = {"skip_prepare_dataset": True}
@ -88,14 +87,20 @@ if __name__ == "__main__":
def collate_fn(examples):
# Get the texts and images, and apply the chat template
texts = [processor.apply_chat_template(example["messages"], tokenize=False) for example in examples]
images = [example["images"][0] for example in examples]
images = [example["images"] for example in examples]
if isinstance(model, LlavaForConditionalGeneration):
# LLava1.5 does not support multiple images
images = [image[0] for image in images]
# Tokenize the texts and process the images
batch = processor(texts, images, return_tensors="pt", padding=True)
batch = processor(text=texts, images=images, return_tensors="pt", padding=True)
# The labels are the input_ids, and we mask the padding tokens in the loss computation
labels = batch["input_ids"].clone()
labels[labels == processor.tokenizer.pad_token_id] = -100
labels[labels == processor.tokenizer.pad_token_id] = -100 #
# Ignore the image token index in the loss computation (model specific)
image_token_id = processor.tokenizer.convert_tokens_to_ids(processor.image_token)
labels[labels == image_token_id] = -100
batch["labels"] = labels
return batch
@ -103,7 +108,7 @@ if __name__ == "__main__":
################
# Dataset
################
dataset = load_dataset(sft_script_args.dataset_name)
dataset = load_dataset(script_args.dataset_name)
################
# Training
@ -112,15 +117,17 @@ if __name__ == "__main__":
model=model,
args=training_args,
data_collator=collate_fn,
train_dataset=dataset[sft_script_args.dataset_train_split],
eval_dataset=dataset[sft_script_args.dataset_test_split],
tokenizer=processor.tokenizer,
train_dataset=dataset[script_args.dataset_train_split],
eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None,
processing_class=processor.tokenizer,
peft_config=get_peft_config(model_config),
)
trainer.train()
# Save and push to hub
trainer.save_model(training_args.output_dir)
trainer.push_to_hub()
if Accelerator().is_main_process:
processor.push_to_hub(training_args.hub_model_id)
if training_args.push_to_hub:
trainer.push_to_hub(dataset_name=script_args.dataset_name)
if trainer.accelerator.is_main_process:
processor.push_to_hub(training_args.hub_model_id)

View File

@ -1,4 +1,3 @@
# flake8: noqa
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
@ -33,23 +32,30 @@ python examples/scripts/xpo.py \
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoModelForSequenceClassification, AutoTokenizer, GenerationConfig
from trl import (
DPOScriptArguments,
HfPairwiseJudge,
LogCompletionsCallback,
ModelConfig,
OpenAIPairwiseJudge,
PairRMJudge,
ScriptArguments,
TrlParser,
XPOConfig,
XPOTrainer,
get_kbit_device_map,
get_quantization_config,
LogCompletionsCallback,
)
from trl.commands.cli_utils import TrlParser
from trl.trainer.utils import SIMPLE_CHAT_TEMPLATE
JUDGES = {"pair_rm": PairRMJudge, "openai": OpenAIPairwiseJudge, "hf": HfPairwiseJudge}
if __name__ == "__main__":
parser = TrlParser((DPOScriptArguments, XPOConfig, ModelConfig))
args, training_args, model_config = parser.parse_args_and_config()
args.gradient_checkpointing_kwargs = {"use_reentrant": True}
parser = TrlParser((ScriptArguments, XPOConfig, ModelConfig))
script_args, training_args, model_config = parser.parse_args_and_config()
script_args.gradient_checkpointing_kwargs = {"use_reentrant": True}
torch_dtype = (
model_config.torch_dtype
@ -72,9 +78,23 @@ if __name__ == "__main__":
ref_model = AutoModelForCausalLM.from_pretrained(
model_config.model_name_or_path, trust_remote_code=model_config.trust_remote_code, **model_kwargs
)
reward_model = AutoModelForSequenceClassification.from_pretrained(
training_args.reward_model_path, num_labels=1, trust_remote_code=model_config.trust_remote_code
)
if training_args.reward_model_path is not None:
reward_model = AutoModelForSequenceClassification.from_pretrained(
training_args.reward_model_path,
num_labels=1,
trust_remote_code=model_config.trust_remote_code,
**model_kwargs,
)
else:
reward_model = None
if training_args.judge is not None:
judge_cls = JUDGES[training_args.judge]
judge = judge_cls()
else:
judge = None
tokenizer = AutoTokenizer.from_pretrained(
model_config.model_name_or_path,
padding_side="left",
@ -85,24 +105,29 @@ if __name__ == "__main__":
if tokenizer.chat_template is None:
tokenizer.chat_template = SIMPLE_CHAT_TEMPLATE
dataset = load_dataset(args.dataset_name)
dataset = load_dataset(script_args.dataset_name)
trainer = XPOTrainer(
model=model,
ref_model=ref_model,
reward_model=reward_model,
judge=judge,
args=training_args,
train_dataset=dataset[args.dataset_train_split],
eval_dataset=dataset[args.dataset_test_split],
tokenizer=tokenizer,
train_dataset=dataset[script_args.dataset_train_split],
eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None,
processing_class=tokenizer,
)
generation_config = GenerationConfig(
max_new_tokens=training_args.max_new_tokens, do_sample=True, temperature=training_args.temperature
)
completions_callback = LogCompletionsCallback(trainer, generation_config, num_prompts=8)
trainer.add_callback(completions_callback)
# train the model
if training_args.eval_strategy != "no":
generation_config = GenerationConfig(
max_new_tokens=training_args.max_new_tokens, do_sample=True, temperature=training_args.temperature
)
completions_callback = LogCompletionsCallback(trainer, generation_config, num_prompts=8)
trainer.add_callback(completions_callback)
trainer.train()
# save the model
# Save and push to hub
trainer.save_model(training_args.output_dir)
if training_args.push_to_hub:
trainer.push_to_hub(dataset_name=script_args.dataset_name)

View File

@ -13,9 +13,10 @@ extend-select = ["E", "F", "I", "W", "UP", "B", "T", "C"]
[tool.ruff.lint.per-file-ignores]
# Allow prints in auxiliary scripts
"benchmark/**.py" = ["T201"]
"examples/**.py" = ["T201"]
"scripts/**.py" = ["T201"]
# Ignore import violations in all `__init__.py` files.
"__init__.py" = ["F401"]
[tool.ruff.lint.isort]
lines-after-imports = 2

View File

@ -1,7 +1,4 @@
datasets>=1.17.0
torch>=1.4.0
tqdm
transformers>=4.40.0
accelerate
peft>=0.3.0
tyro>=0.5.7
datasets
rich
transformers>=4.46.0

View File

@ -12,6 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import logging
import os
from datetime import date
@ -26,17 +27,23 @@ parser.add_argument("--text_file_name", required=True)
def main(text_file_name, slack_channel_name=None):
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
message = ""
if os.path.isfile(text_file_name):
final_results = {}
file = open(text_file_name)
lines = file.readlines()
for line in lines:
result, config_name = line.split(",")
config_name = config_name.split("/")[-1].split(".yaml")[0]
final_results[config_name] = int(result)
try:
with open(text_file_name) as file:
for line in file:
result, config_name = line.strip().split(",")
config_name = config_name.split("/")[-1].split(".yaml")[0]
final_results[config_name] = int(result)
except Exception as e:
logger.error(f"Error reading file {text_file_name}: {str(e)}")
final_results = {}
no_error_payload = {
"type": "section",
@ -55,7 +62,7 @@ def main(text_file_name, slack_channel_name=None):
"type": "section",
"text": {
"type": "plain_text",
"text": "🔴 Something is wrong with the workflow please check ASAP!"
"text": " Something is wrong with the workflow please check ASAP!"
"Something went wrong there is no text file being produced. Please check ASAP.",
"emoji": True,
},
@ -82,7 +89,7 @@ def main(text_file_name, slack_channel_name=None):
for test_name, failed in final_results.items():
failed_table = tabulate(
[[test_name, "🟢" if not failed else "🔴"]],
[[test_name, "" if not failed else ""]],
headers=["Test Name", "Status"],
showindex="always",
tablefmt="grid",
@ -95,7 +102,11 @@ def main(text_file_name, slack_channel_name=None):
payload.append(no_error_payload)
if os.environ.get("TEST_TYPE", "") != "":
from slack_sdk import WebClient
try:
from slack_sdk import WebClient
except ImportError:
logger.error("slack_sdk is not installed. Please install it to use Slack integration.")
return
if len(message) > MAX_LEN_MESSAGE:
print(f"Truncating long message from {len(message)} to {MAX_LEN_MESSAGE}")
@ -131,10 +142,16 @@ def main(text_file_name, slack_channel_name=None):
print(payload)
client = WebClient(token=os.environ.get("SLACK_API_TOKEN"))
client.chat_postMessage(channel=f"#{slack_channel_name}", text=message, blocks=payload)
try:
client = WebClient(token=os.environ.get("SLACK_API_TOKEN"))
response = client.chat_postMessage(channel=f"#{slack_channel_name}", text=message, blocks=payload)
if response["ok"]:
logger.info("Message sent successfully to Slack.")
else:
logger.error(f"Failed to send message to Slack: {response['error']}")
except Exception as e:
logger.error(f"Error sending message to Slack: {str(e)}")
if __name__ == "__main__":
args = parser.parse_args()
main(args.text_file_name, args.slack_channel_name)
if __name__ == "__main__":
args = parser.parse_args()
main(args.text_file_name, args.slack_channel_name)

View File

@ -13,6 +13,7 @@
# limitations under the License.
import argparse
import json
import logging
import os
from datetime import date
from pathlib import Path
@ -20,132 +21,146 @@ from pathlib import Path
from tabulate import tabulate
MAX_LEN_MESSAGE = 2900 # slack endpoint has a limit of 3001 characters
MAX_LEN_MESSAGE = 2900 # Slack endpoint has a limit of 3001 characters
parser = argparse.ArgumentParser()
parser.add_argument("--slack_channel_name", default="trl-push-ci")
# Set up logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
def main(slack_channel_name=None):
failed = []
passed = []
group_info = []
def process_log_file(log):
failed_tests = []
passed_tests = []
section_num_failed = 0
total_num_failed = 0
empty_file = False or len(list(Path().glob("*.log"))) == 0
total_empty_files = []
for log in Path().glob("*.log"):
section_num_failed = 0
i = 0
try:
with open(log) as f:
for line in f:
line = json.loads(line)
i += 1
if line.get("nodeid", "") != "":
test = line["nodeid"]
if line.get("duration", None) is not None:
duration = f'{line["duration"]:.4f}'
if line.get("outcome", "") == "failed":
section_num_failed += 1
failed.append([test, duration, log.name.split("_")[0]])
total_num_failed += 1
else:
passed.append([test, duration, log.name.split("_")[0]])
empty_file = i == 0
group_info.append([str(log), section_num_failed, failed])
total_empty_files.append(empty_file)
os.remove(log)
failed = []
no_error_payload = {
"type": "section",
"text": {
"type": "plain_text",
"text": "🌞 There were no failures!"
if not any(total_empty_files)
else "Something went wrong there is at least one empty file - please check GH action results.",
"emoji": True,
},
}
try:
data = json.loads(line)
test_name = data.get("nodeid", "")
duration = f'{data["duration"]:.4f}' if "duration" in data else "N/A"
outcome = data.get("outcome", "")
message = ""
if test_name:
if outcome == "failed":
section_num_failed += 1
failed_tests.append([test_name, duration, log.stem.split("_")[0]])
else:
passed_tests.append([test_name, duration, log.stem.split("_")[0]])
except json.JSONDecodeError as e:
logging.warning(f"Could not decode line in {log}: {e}")
except FileNotFoundError as e:
logging.error(f"Log file {log} not found: {e}")
except Exception as e:
logging.error(f"Error processing log file {log}: {e}")
return failed_tests, passed_tests, section_num_failed
def main(slack_channel_name):
group_info = []
total_num_failed = 0
total_empty_files = []
log_files = list(Path().glob("*.log"))
if not log_files:
logging.info("No log files found.")
return
for log in log_files:
failed, passed, section_num_failed = process_log_file(log)
empty_file = not failed and not passed
total_num_failed += section_num_failed
total_empty_files.append(empty_file)
group_info.append([str(log), section_num_failed, failed])
# Clean up log file
try:
os.remove(log)
except OSError as e:
logging.warning(f"Could not remove log file {log}: {e}")
# Prepare Slack message payload
payload = [
{
"type": "header",
"text": {
"type": "plain_text",
"text": "🤗 Results of the {} TRL tests.".format(os.environ.get("TEST_TYPE", "")),
},
"text": {"type": "plain_text", "text": f"🤗 Results of the {os.environ.get('TEST_TYPE', '')} TRL tests."},
},
]
if total_num_failed > 0:
for i, (name, num_failed, failed_tests) in enumerate(group_info):
message = ""
for name, num_failed, failed_tests in group_info:
if num_failed > 0:
if num_failed == 1:
message += f"*{name}: {num_failed} failed test*\n"
else:
message += f"*{name}: {num_failed} failed tests*\n"
failed_table = []
for test in failed_tests:
failed_report = test[0].split("::")
# Truncate the last string as some test names might be long
failed_report[-1] = failed_report[-1][:30] + ".."
failed_table.append(failed_report)
failed_table = tabulate(
failed_table,
headers=["Test Location", "Test Case", "Test Name"],
showindex="always",
tablefmt="grid",
maxcolwidths=[12, 12, 12],
message += f"*{name}: {num_failed} failed test(s)*\n"
failed_table = [
test[0].split("::")[:2] + [test[0].split("::")[-1][:30] + ".."] for test in failed_tests
]
message += (
"\n```\n"
+ tabulate(failed_table, headers=["Test Location", "Test Name"], tablefmt="grid")
+ "\n```\n"
)
message += "\n```\n" + failed_table + "\n```"
if total_empty_files[i]:
message += f"\n*{name}: Warning! Empty file - please check the GitHub action job *\n"
if any(total_empty_files):
message += f"\n*{name}: Warning! Empty file - check GitHub action job*\n"
# Logging
logging.info(f"Total failed tests: {total_num_failed}")
print(f"### {message}")
else:
payload.append(no_error_payload)
if os.environ.get("TEST_TYPE", "") != "":
from slack_sdk import WebClient
if len(message) > MAX_LEN_MESSAGE:
message = f"There are {total_num_failed} failed tests in total ! Cannot display the entire summary - please check the action results directly"
message = (
f"❌ There are {total_num_failed} failed tests in total! Please check the action results directly."
)
if len(message) != 0:
md_report = {
"type": "section",
"text": {"type": "mrkdwn", "text": message},
}
payload.append(md_report)
action_button = {
payload.append({"type": "section", "text": {"type": "mrkdwn", "text": message}})
payload.append(
{
"type": "section",
"text": {"type": "mrkdwn", "text": "*For more details:*"},
"accessory": {
"type": "button",
"text": {"type": "plain_text", "text": "Check Action results", "emoji": True},
"text": {"type": "plain_text", "text": "Check Action results"},
"url": f"https://github.com/huggingface/trl/actions/runs/{os.environ['GITHUB_RUN_ID']}",
},
}
payload.append(action_button)
)
payload.append(
{
"type": "context",
"elements": [
{
"type": "plain_text",
"text": f"On Push main {os.environ.get('TEST_TYPE')} results for {date.today()}",
}
],
}
)
date_report = {
"type": "context",
"elements": [
{
# Send to Slack
from slack_sdk import WebClient
slack_client = WebClient(token=os.environ.get("SLACK_API_TOKEN"))
slack_client.chat_postMessage(channel=f"#{slack_channel_name}", text=message, blocks=payload)
else:
payload.append(
{
"type": "section",
"text": {
"type": "plain_text",
"text": f"On Push main {os.environ.get('TEST_TYPE')} test results for {date.today()}",
"text": "✅ No failures! All tests passed successfully.",
"emoji": True,
},
],
}
payload.append(date_report)
print(payload)
client = WebClient(token=os.environ.get("SLACK_API_TOKEN"))
client.chat_postMessage(channel=f"#{slack_channel_name}", text=message, blocks=payload)
}
)
logging.info("All tests passed. No errors detected.")
if __name__ == "__main__":

View File

@ -1,59 +0,0 @@
# Copyright 2023 The HuggingFace Team, the AllenNLP library authors. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Script to close stale issue. Taken in part from the AllenNLP repository.
https://github.com/allenai/allennlp.
"""
import os
from datetime import datetime as dt
from datetime import timezone
from github import Github
LABELS_TO_EXEMPT = [
"good first issue",
"good second issue",
"feature request",
"help wanted",
]
def main():
g = Github(os.environ["GITHUB_TOKEN"])
repo = g.get_repo("huggingface/trl")
open_issues = repo.get_issues(state="open")
for issue in open_issues:
comments = sorted(issue.get_comments(), key=lambda i: i.created_at, reverse=True)
involved_users = [comment.user.login for comment in comments]
inactive_days = (dt.now(timezone.utc) - issue.updated_at).days
is_old = (dt.now(timezone.utc) - issue.created_at).days >= 30
has_comments = len([user for user in involved_users if user != "github-actions[bot]"]) > 0
to_exempt = any(label.name.lower() in LABELS_TO_EXEMPT for label in issue.get_labels())
if is_old and not to_exempt:
if has_comments and inactive_days > 23:
issue.create_comment(
"This issue has been automatically marked as stale because it has not had "
"recent activity. If you think this still needs to be addressed "
"please comment on this thread.\n\n"
)
elif involved_users and involved_users[0] == "github-actions[bot]" and inactive_days > 7:
issue.edit(state="closed")
if __name__ == "__main__":
main()

View File

@ -73,37 +73,26 @@ import os
from setuptools import find_packages, setup
__version__ = "0.11.0.dev0" # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
__version__ = "0.12.2" # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
REQUIRED_PKGS = [
"torch>=1.4.0",
"transformers>=4.40.0",
"numpy>=1.18.2;platform_system!='Windows'",
"numpy<2;platform_system=='Windows'",
"accelerate",
"datasets",
"tyro>=0.5.11",
"accelerate>=0.34.0",
"datasets>=2.21.0",
"rich", # rich shouldn't be a required package for trl, we should remove it from here
"transformers<4.47.0",
]
EXTRAS = {
"test": [
"parameterized",
"peft>=0.8.0",
"pytest",
"pytest-xdist",
"pytest-cov",
"pytest-xdist",
"scikit-learn",
"Pillow",
"pytest-rerunfailures",
"llm-blender>=0.0.2",
],
"peft": ["peft>=0.8.0"],
"liger": ["liger-kernel>=0.2.1"],
# Windows support is partially supported with DeepSpeed https://github.com/microsoft/DeepSpeed/tree/master#windows
"deepspeed": ["deepspeed>=0.14.4; sys_platform != 'win32'"],
"diffusers": ["diffusers>=0.18.0"],
"deepspeed": ["deepspeed>=0.14.4"],
"benchmark": ["wandb", "ghapi", "openrlbenchmark==0.2.1a5", "requests", "deepspeed"],
"quantization": ["bitsandbytes<=0.41.1"],
"llm_judge": ["openai>=1.23.2", "huggingface_hub>=0.22.2", "llm-blender>=0.0.2"],
# liger-kernel depends on triton, which is only available on Linux https://github.com/triton-lang/triton#compatibility
"liger": ["liger-kernel>=0.2.1; sys_platform != 'win32'"],
"llm_judge": ["openai>=1.23.2", "llm-blender>=0.0.2"],
"peft": ["peft>=0.8.0"],
"quantization": ["bitsandbytes"],
"scikit": ["scikit-learn"],
"test": ["parameterized", "pytest-cov", "pytest-rerunfailures", "pytest-xdist", "pytest"],
"vlm": ["Pillow"],
}
EXTRAS["dev"] = []
for reqs in EXTRAS.values():
@ -127,17 +116,18 @@ try:
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
],
url="https://github.com/huggingface/trl",
entry_points={
"console_scripts": ["trl=trl.commands.cli:main"],
},
include_package_data=True,
package_data={"trl": ["commands/scripts/config/*", "commands/scripts/*"]},
package_data={"trl": ["commands/scripts/config/*", "commands/scripts/*", "templates/*.md"]},
packages=find_packages(exclude={"tests"}),
install_requires=REQUIRED_PKGS,
extras_require=EXTRAS,
python_requires=">=3.7",
python_requires=">=3.9",
long_description=open("README.md", encoding="utf-8").read(),
long_description_content_type="text/markdown",
zip_safe=False,

Some files were not shown because too many files have changed in this diff Show More