9955ee7eaa
🐳 Docker update + Simplify Jobs doc ( #3931 )
...
Co-authored-by: sergiopaniego <sergiopaniegoblanco@gmail.com >
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com >
2025-09-13 18:35:55 -06:00
a647e5a78a
🗜 Hotfix: avoid passing quantization_config=None
( #4019 )
2025-09-09 14:50:15 -06:00
d1bf56020d
⚖️ Add vLLM server mode and VLM support to OnlineDPOTrainer ( #3783 )
...
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com >
Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com >
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com >
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com >
2025-09-05 16:58:49 -06:00
0c69fd2867
👷 Added Kernels on the Hub x TRL guide ( #3969 )
...
Co-authored-by: vb <vaibhavs10@gmail.com >
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com >
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com >
2025-09-04 15:37:02 +02:00
208e9f7df7
📏 torch_dype
to dtype
everywhere ( #4000 )
...
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com >
2025-09-03 15:45:37 -06:00
0c91515b58
🧭 HF jobs x TRL guide ( #3890 )
...
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com >
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com >
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com >
2025-08-26 21:44:29 -07:00
a043fd74a3
Add uv scripts headers ( #3767 )
2025-07-25 07:48:40 -07:00
9df19e8a75
📜 Fix license and copyrights ( #3264 )
2025-04-08 15:22:58 -07:00
1d23ecc36f
©️ Update copyrights year ( #2547 )
...
* happy new year
* fix wandb import sort
2025-01-07 14:53:09 +01:00
ca850be0a2
🕹️ CLI refactor ( #2380 )
...
* Refactor main function in dpo.py
* Update setup.py and add cli.py
* Add examples to package data
* style
* Refactor setup.py file
* Add new file t.py
* Move dpo to package
* Update MANIFEST.in and setup.py, refactor trl/cli.py
* Add __init__.py to trl/scripts directory
* Add license header to __init__.py
* File moved instruction
* Add Apache License and update file path
* Move dpo.py to new location
* Refactor CLI and DPO script
* Refactor import structure in scripts package
* env
* rm config from chat arg
* rm old cli
* chat init
* test cli [skip ci]
* Add `datast_config_name` to `ScriptArguments` (#2440 )
* add missing arg
* Add test cases for 'trl sft' and 'trl dpo' commands
* Add sft.py script and update cli.py to include sft command
* Move sft script
* chat
* style [ci skip]
* kto
* rm example config
* first step on doc
* see #2442
* see #2443
* fix chat windows
* ©️ Copyrights update (#2454 )
* First changes
* Other files
* Finally
* rm comment
* fix nashmd
* Fix example
* Fix example [ci skip]
* 💬 Fix chat for windows (#2443 )
* fix chat for windows
* add some tests back
* Revert "add some tests back"
This reverts commit 350aef52f53f8cf34fccd7ad0f78a3dd63867e06.
* 🆔 Add `datast_config` to `ScriptArguments` (#2440 )
* datast_config_name
* Update trl/utils.py [ci skip]
* sort import
* typo [ci skip]
* Trigger CI
* Rename `dataset_config_name` to `dataset_config`
* 🏎 Fix deepspeed preparation of `ref_model` in `OnlineDPOTrainer` (#2417 )
* Remove unused deepspeed code
* add model prep back
* add deepspeed even if it doesn't work
* rm old code
* Fix config name
* Remove `make dev` in favor of `pip install -e .[dev]`
* Update script paths and remove old symlink related things
* Fix chat script path [ci skip]
* style
2024-12-13 17:52:23 +01:00
460e780265
👯 Standardize model_args
( #2442 )
...
* `model_config` -> `model_args`
* sort
2024-12-10 12:51:20 +01:00
6a05feff02
🆔 Add datast_config
to ScriptArguments
( #2440 )
...
* datast_config_name
* Update trl/utils.py [ci skip]
* sort import
* typo [ci skip]
* Trigger CI
* Rename `dataset_config_name` to `dataset_config`
2024-12-10 11:09:26 +01:00
9410874787
©️ Copyrights update ( #2454 )
...
* First changes
* Other files
* Finally
* rm comment
* fix nashmd
* Fix example
* Fix example [ci skip]
2024-12-10 10:40:00 +01:00
ac77c09223
Fix gradient_checkpointing_kwargs assignment in examples ( #2331 )
...
Co-authored-by: Ping <ping.zhu@jmuse.cn >
2024-11-07 09:28:10 +01:00
b2696578ce
🍬 Use any reward model for online methods ( #2276 )
...
* Refactor reward processing in OnlineDPOTrainer
* Refactor completion decoding and reward processing
* remove strip
* remove warning
* Add reward_tokenizer to training script
* Add reward_tokenizer and reward_processing_class to OnlineDPOTrainer test
* propagate to xpo and nash
* style
* reduce memory requirement with inference_mode
* fix tests
* pairrm judge llmblender
* setUpClass(cls)
* Add setUpClass method to TestJudges class
* truncation left for reward tokenizer
* don't logcompletion without eval dataset
* only eval when possible
2024-10-28 16:21:40 +01:00
e155cb8a66
⛓️ 💥 Don't use eval_dataset
in scripts when no eval strategy ( #2270 )
2024-10-28 11:40:51 +01:00
9c376c571f
[Judges] use the pair-judges in online-preference trainers ( #2243 )
...
* use the pair-judges
* add test
* Update trl/trainer/online_dpo_trainer.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com >
* Update trl/trainer/online_dpo_trainer.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com >
* decode and skip special characters
* initial nash
* return tensors
* Update trl/trainer/online_dpo_trainer.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com >
* Update trl/trainer/online_dpo_trainer.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com >
* Update trl/trainer/online_dpo_trainer.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com >
* add back the logging
* use batch_decode
* add judges api to XPO trainer
* Update tests/test_online_dpo_trainer.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com >
* judge in examples
* judge in config
* add back logs when using reward model
* typo
* add back model_scores logging when using reward model
* log scores for reward model only
* better cond on what to log
* same for rlhf reward
* Update trl/trainer/online_dpo_trainer.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com >
* use decode_and_strip_padding
* error if both reward and judge or none are set
* remove unused check
* Uniform way to pass conversation into judge
* heading -> leading
* LogCompletionsCallback compat with online method
* Update Online DPO doc
* check if data is conversational for judges
* update example
* remove comment
* use zip
* fix stats xpo
* Replace judge with PairRMJudge and import AutoModelForSequenceClassification
* update xpo documentation
* Remove doc duplication
* update nash doc
* XPO trl chat
* nash md doc
* HfPairwiseJudge
---------
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com >
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co >
2024-10-24 16:47:10 +02:00
7e394b03e8
🎭 Deprecate [SFT/DPO/Reward]ScriptArguments
in favour of ScriptArguments
( #2145 )
...
* `DPOScriptArguments` to `ScriptArguments`
* use dataset_train_split
* Use scriptarguments
* dataset names in command lines
* use `ScriptArguments` everywhere
* ignore biais buffer to end
* remove in v0.13
* rm comment
* update test commands
* Update docs/source/rloo_trainer.md
* Update tests/test_rloo_trainer.py
* Added dataset_train_split argument to ppo.py and rloo.py
* update scripts with dataset_train_split
2024-10-14 11:14:58 +02:00
47d08a9626
Rename trainer arg tokenizer
to processing_class
( #2162 )
2024-10-07 09:39:32 +02:00
c00722ce0a
🃏 Model card for TRL ( #2123 )
...
* template and util
* test for online dpo
* template in package_data
* template in manifest
* standardize push_to_hub
* wandb badge and quick start
* bco
* xpo
* simplify `create_model_card`
* cpo
* kto
* dpo
* gkd
* orpo
* style
* nash-md
* alignprop
* bco citation
* citation template
* cpo citation
* ddpo
* fix alignprop
* dpo
* gkd citation
* kto
* online dpo citation
* orpo citation
* citation in utils
* optional citation
* reward
* optional trainer citation
* sft
* remove add_model_tags bco
* Remove unnecessary code for adding model tags
* Fix model tag issue and update URL format
* Remove unused code for adding model tags
* Add citation for XPOTrainer
* Remove unused code in SFTTrainer
* Add model card generation in RLOOTrainer
* Remove unused import and method call in reward_trainer.py
* Add model card generation
* Remove unused code and update error message in ORPOTrainer class
* Add import statements and create model card in IterativeSFTTrainer
* Add dataset name to push_to_hub() call
* Update trainer.push_to_hub() dataset names
* script args
* test
* better doc
* fix tag test
* fix test tag
* Add tags parameter to create_model_card method
* doc
* script args
* Update trl/templates/model_card.md
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com >
* unittest's `assertIn` instead of `assert`
* Update trl/templates/model_card.md
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com >
---------
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com >
2024-09-27 15:23:05 +02:00
5368be1e1e
🧹 Style ( #2132 )
...
* drop `# flake8: noqa` in examples
* `__init__.py`
* fix init
* unwrap_model_for_generation
* ignore import violation in init
2024-09-26 21:02:48 +02:00
9af4734178
♻️ Standardize script_args
( #2130 )
2024-09-26 15:23:42 +02:00
32d9d34eb1
Standardize pushing to Hub in examples ( #2126 )
2024-09-26 10:00:51 +02:00
6920c2d1bb
Conversational dataset support for Online DPO ( #2075 )
...
* first modifications in the documentation
* Add script for processing ultrafeedback prompt dataset
* Remove unused variable in ultrafeedback.py
* style
* apply chat template within the init
* extend test
* new default lr
* nash md and xpo conv test
* Update prompt length check to 512 characters
* remove `maybe_apply_chat_template` in XPO and Nash examples
* polish online dpo doc
* better section name
* LogCompletionsCallback doc
* optional generation config
* reorder stats (consistency with online dpo)
* update online dpo doc
* format online dpo config
* format nash_md config
* update nash md
* Nash MD -> Nash-MD
* xpo doc
* doc
2024-09-18 14:10:38 +02:00
dc2bd07408
Nash md ( #1853 )
...
* initial skeleton
* initial config and class
* move TrainerCallback to callbacks.py
* initial trainer mockup
* formatting
* add back header
* script with reward model
* call ref policy forward with torch no_grad
* fix api
* clean up the configs
* use the new API
* fix typo
* get get_reward without grads
* remove unused no_grad calls
* fix formatting
* initial GeometricMixtureWrapper
* Update trl/models/modeling_base.py
Co-authored-by: Alvaro Bartolome <36760800+alvarobartt@users.noreply.github.com >
* undo changes to callback
* GenerationMixin needs generation_config
* calculate score with model and mixture model outputs
* fix scores and mixture_scores tensors
* undo
* use interleaved version to calcuate chosen-rejected
* Revert "use interleaved version to calcuate chosen-rejected"
This reverts commit 4a63a60971a7db173d10771548f17f650d955c2a.
* fix mixture scores
* Fix global step
* use mixture_coeff
* record scores_margin only
* fix del
* First version of Nash MD trainer
* undo
* fix formatting
* fix toc
* initial refactorin
* mixin fixes
* fix refactoring
* cleanup comments
* add log_stats
* add test
* initial docs
* fix logs
* fix missing_eos_penalty
* fix output_dir
* add peft_config to docs and super
* undo init changes
* Update docs/source/_toctree.yml
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com >
* Update trl/trainer/nash_md_config.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com >
* add dataset format
* add authors
* add dynamic parameter callback
* update test
* fix comments
* test GeometricMixtureWrapper
* header
* formatting
* formatting
* add paper and abstract
* Update docs/source/nash_md_trainer.md
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com >
* DynamicParameterCallback
* drop callback in favor of getter
* revert kto config change
* revert kto config change
* fix contribution
* `coeff` to `coef`
* log dynamic coefs
* Update docs/source/nash_md_trainer.md
* Update docs/source/nash_md_trainer.md
* fix tests
* use self.ref_model
* one-line
---------
Co-authored-by: Alvaro Bartolome <36760800+alvarobartt@users.noreply.github.com >
Co-authored-by: Daniil Tiapkin <daniil.tiapkin@gmail.com >
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com >
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co >
2024-09-16 13:46:52 +02:00