Commit Graph

24 Commits

Author SHA1 Message Date
9df19e8a75 📜 Fix license and copyrights (#3264) 2025-04-08 15:22:58 -07:00
5877786b5a 🪄 Minor comment style modif (#2582) 2025-01-17 11:12:00 +01:00
1d23ecc36f ©️ Update copyrights year (#2547)
* happy new year

* fix wandb import sort
2025-01-07 14:53:09 +01:00
179ba53671 🐾 Process-supervised RM Trainer (#2127)
* initial skeleton

* tokenize fn

* adding bos and eos to tokenization fn

* prmtrainer

* fixing small typo in tokenize

* typo in input_ids and labels construction

* numpy dimension

* introduce the stepwise reward trainer

* update markdown files

* let user decide post step separator in config

* doc post_step_separator

* do not add post step_tokens to last step of the reasoning process

* renaming prm to stepwisereward

* formatting

* fix tokenize kwargs

* adapt test to the new post_token args

* adding example script

* fix small typo

* add create_model_card and renaming

* fixing booleans

* Adding the new stepwise_preference instead of placeholders for datasets

* formatting

* Update docs/source/_toctree.yml

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update examples/scripts/stepwise_reward_modeling.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update trl/trainer/stepwise_reward_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update trl/trainer/stepwise_reward_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* update push to hub

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* step_separator can't be None

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* fix suggested typos

* add citation

* reformat doc

* reordering init

* push to hub prm800k

* changing dataset in example

* change dataset format to align with the sky is blue example

* fix tokenization column names

* fix num labels in openai example

* add support for conversational dataset

* remove training whitespace

* replace tokenizer with processing class

* Update docs/source/dataset_formats.mdx

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* remove openai_prm800k

* Update trl/trainer/stepwise_reward_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update trl/trainer/stepwise_reward_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update docs/source/stepwise_reward_trainer.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/stepwise_reward_trainer.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* renaming

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* renaming

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* minor renamings in docs

* using prm800k instead of openai_prm800k

* update num labels to 2 following the new format

* changing doc examples to math examples

* change reference to dataset_formats.mdx

* changing dataset config in test

* remove conversational dataset support

* remove conv dataset support

* fix bos token

* fix scriptarguments in example

* completion to completions

* remove valuerror for step_separator inside steps

* run precommit

* remove conv dataset support

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* renaming zen dataset

* remove unused printing

* unknown label column

* introduce the train on last step arg

* _tokenize support train_on_last_step

* incorporate train_on_last_step to tests

* formatting

* remove comments in trainer

* Refactor `tokenize_row`

* Update max_completion_length parameter in StepwiseRewardConfig

* Collator

* Update comment

* Update type hint

* fix table

* Remove collator

* don't need pad token id

* add error back

* max length args

* use tokenizer arg

* Update doc

* label -> labels

* fixing tokenization issues in tokenize row

* correct labels for token classification

* adding max_length to tokenize_row

* reformat tests

* adding tests for tokenize row

* fixing typos in comments

* update doc

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Add math_shepherd.py script for dataset processing

* split the dataset

* formatting

* same evaluation method for the two training methods

* adding filtering to example script

* formatting

* Add features to avoid casting labels to bool in dataset tokenization

* Update docs/source/stepwise_reward_trainer.mdx [ci skip]

* Add learning_rate parameter to StepwiseRewardConfig class

* update doc

* Remove unused setup_chat_format function

* Fix warning message in stepwise_reward_modeling.py

* Update logging steps in stepwise_reward_trainer.mdx

* little doc change [ci skip]

* Fix copyrights

* fix space after copyrights

* Update dataset loading in stepwise_reward_modeling.py

* refine compute_accuracy and proper test

* fix tests

* style

* renamings

* renaming in init

* doc renaming

* fix sorting and tag

* experiemental [ci skip]

* trigger CI

* other doc fix

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-12-13 15:56:10 +01:00
9410874787 ©️ Copyrights update (#2454)
* First changes

* Other files

* Finally

* rm comment

* fix nashmd

* Fix example

* Fix example [ci skip]
2024-12-10 10:40:00 +01:00
453db5cd79 🤏 New models for tests (#2287)
* first commit

* uncomment

* other tests adaptations

* Remove unused variable in test_setup_chat_format

* Remove unused import statement

* style

* Add Bart model

* Update BCOTrainerTester class in test_bco_trainer.py

* Update model IDs and tokenizers in test files

* Add new models and processors

* Update model IDs in test files

* Fix formatting issue in test_dataset_formatting.py

* Refactor dataset formatting in test_dataset_formatting.py

* Fix dataset sequence length in SFTTrainerTester

* Remove tokenizer

* Remove print statement

* Add reward_model_path and sft_model_path to PPO trainer

* Fix tokenizer padding issue

* Add chat template for testing purposes in PaliGemma model

* Update PaliGemma model and chat template

* Increase learning rate to speed up test

* Update model names in run_dpo.sh and run_sft.sh scripts

* Update model and dataset names

* Fix formatting issue in test_dataset_formatting.py

* Fix formatting issue in test_dataset_formatting.py

* Remove unused chat template

* Update model generation script

* additional models

* Update model references in test files

* Remove unused imports in test_online_dpo_trainer.py

* Add is_llm_blender_available import and update reward_tokenizer

* Refactor test_online_dpo_trainer.py: Move skipped test case decorator

* remove models without chat templates

* Update model names in scripts and tests

* Update model_id in test_modeling_value_head.py

* Update model versions in test files

* Fix formatting issue in test_dataset_formatting.py

* Update embedding model ID in BCOTrainerTester

* Update test_online_dpo_trainer.py with reward model changes

* Update expected formatted text in test_dataset_formatting.py

* Add reward_tokenizer to TestOnlineDPOTrainer

* fix tests

* Add SIMPLE_CHAT_TEMPLATE to T5 tokenizer

* Fix dummy_text format in test_rloo_trainer.py

* Skip outdated test for chatML data collator

* Add new vision language models

* Commented out unused model IDs in test_vdpo_trainer

* Update model and vision configurations in generate_tiny_models.py and test_dpo_trainer.py

* Update model and tokenizer references

* Don't push if it already exists

* Add comment explaining test skip

* Fix model_exists function call and add new models

* Update LlavaForConditionalGeneration model and processor

* `qgallouedec` -> `trl-internal-testing`
2024-11-25 16:31:56 +01:00
24fb32733f 🔧 Use standard unittest assertion methods (#2283)
* WIP: Partial unit test update

* Update unittest format

* Update tests/slow/test_sft_slow.py comment

* Refactor unit tests: replace pytest.raises with self.assertRaises

* Fix: Restore accidentally deleted 'ref_model' parameter in DPOTrainer

* Re-run pre-commit

* fix: Incorrectly replacing non-TestCase assert

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2024-10-31 15:10:43 +01:00
47d08a9626 Rename trainer arg tokenizer to processing_class (#2162) 2024-10-07 09:39:32 +02:00
cc23b511e4 [RewardTrainer] Tokenize inputs within trainer (#2102)
* Pretokenize in reward modelling

* Fix README example

* Apply suggestions from code review

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Move chat template formatting inside trainer

* Refactor tests

* Fix README

* Disable wandb

* Update readme

* add comment `remove_unused_columns`

* Update trl/trainer/reward_config.py

* doc

* implicit*

* explicit

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-09-24 13:03:32 +02:00
07f0e687cb Use transformers utilities when possible (#2064)
* use transformers' availability functions

* require from transformers

* rm file

* fix no peft

* fix import

* don't alter  _peft_available

* fix require_diffusers

* style

* transformers>=4.40 and add back `is_liger_kernel_available`
2024-09-16 15:56:49 +02:00
cbcaa46cd3 Various args and test fix (#1909)
* report to none

* simplify AlignPropTrainerTester

* rm unused marker

* Don't share setup in dpo trainer

* style

* don't share setup in test rich

* fix setup and classmethod

* fix args for sft

* test_trainer_args

* various arg fix

* report to none and vsdt simplifi

* drop generate_during_eval

* fix run_name

* style

* drop setUpClass

* style

* new ref values for ppo trainer tester

* update ref val

---------

Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-08-09 10:07:58 +02:00
332062372d Drop setUpClass in reward tester (#1895)
* drop setUp class in reward tester

* report to none

* style

---------

Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-08-05 16:01:43 +02:00
a02513c3b7 Apply deprecated evaluation_strategy (#1559)
* Deprecate

* Update tests/test_dpo_trainer.py

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-05-23 12:48:00 +02:00
d10f7663b0 [peft] Update test_reward_trainer.py to fix tests (#1471)
* [peft] Update test_reward_trainer.py

Since we are requiring peft >= 0.4.0

* formatting
2024-03-22 19:12:54 +01:00
1e77d8aeb2 [core / xxxTrainer] Automatic tagging (#1329)
* automatic tagging

* add comments

* fix tests

* fix
2024-02-15 14:47:32 +01:00
9bc478ecbb pre-commit: replace linters + formatters with Ruff; fix some issues (#1300)
* pre-commit: replace linters + formatters with Ruff

* Don't use bare except

* Clean up `noqa`s

* Enable Ruff UP; apply auto-fixes

* Enable Ruff B; apply fixes

* Enable Ruff T with exceptions

* Enable Ruff C (complexity); autofix

* Upgrade Ruff to 0.2.0
2024-02-15 04:37:41 +01:00
ae8431bd50 Codemod Unittest assertions to bare asserts (#1301)
* Remove stray commas from test data

* Codemod Unittest assertions to bare asserts

* Make `assertAlmostEqual` tests more idiomatic

* DRY some test strings
2024-02-01 23:49:03 +01:00
08cfc4179b Add margin to RM training (#719)
* Start adding margin to RM training

* Fix typo and cleanup

* Fix incompatibilities when not using margin

* Format using 'make precommit'

* Add documentation and test for reward trainer

* Run 'make precommit'

* Update docs/source/reward_trainer.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Fix missed merge conflict in reward trainer docs

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2023-09-20 10:18:38 +02:00
d484dc2a93 Refactor RewardTrainer hyperparameters into dedicated dataclass (#726)
* Refactor RewardTrainer hyperparameters into dedicated dataclass

* Revert

* Add doc string

* Fix warning

* Handle backwards compat

* Fix tests

* Add docs

* Refactor to RewardConfig

* Fix case conditions

* Fix
2023-09-05 09:05:42 +02:00
843c14574f fix CI RM (#468) 2023-06-26 14:30:06 +02:00
b4bb12992e Update test_reward_trainer.py (#421) 2023-06-09 15:52:41 +02:00
fadffc22bc Update test_reward_trainer.py (#410) 2023-06-07 12:22:22 +02:00
376d152d3f Resolve broken evaluation/prediction for RewardTrainer (#404)
* Implement evaluation/prediction for RewardTrainer

* Stick with unittest assertions

* Perform prediction forward calls without gradient

* Remove Literal to preserve Python 3.7 support

I recognize that I can also import from typing_extensions with a try-except,
but that is a bit overkill for this I feel.

* Remove eval_steps=1 to prevent flaky test on CI

The flaky test is caused by a division by zero when dividing by the runtime.
This is done on the transformers side, so it's not a TRL issue.
In practice, this won't happen - it only happens because both the model
and dataset are tiny.
2023-06-06 16:49:30 +02:00
3cfe194e34 [core] Officially Support Reward Modeling (#303)
* v1

- add working version
- add all possible tests
- add docs

* add some contents

* clean up

* fixes

* patch test for now

* fix test

* clean up

* fix

* this time fix

* Update docs/source/trainer.mdx

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* fixe

* update

* final changes

* oops

* Update docs/source/reward_trainer.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/reward_trainer.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/reward_trainer.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* switch to chosen / rejected

* fixes

* add example

* add accuracy metric

* pass PEFT config

* refactor compute metrics

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2023-04-26 11:51:56 +02:00