frozenleaves/trl - trl - Gitea: Git for Me

mirror of https://github.com/huggingface/trl.git synced 2025-10-20 18:43:52 +08:00

Author	SHA1	Message	Date
Quentin Gallouédec	9955ee7eaa	🐳 Docker update + Simplify Jobs doc (#3931 ) Co-authored-by: sergiopaniego <sergiopaniegoblanco@gmail.com> Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>	2025-09-13 18:35:55 -06:00
Quentin Gallouédec	a647e5a78a	🗜 Hotfix: avoid passing `quantization_config=None` (#4019 )	2025-09-09 14:50:15 -06:00
johann	d1bf56020d	⚖️ Add vLLM server mode and VLM support to OnlineDPOTrainer (#3783 ) Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>	2025-09-05 16:58:49 -06:00
Sergio Paniego Blanco	0c69fd2867	👷 Added Kernels on the Hub x TRL guide (#3969 ) Co-authored-by: vb <vaibhavs10@gmail.com> Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>	2025-09-04 15:37:02 +02:00
Sergio Paniego Blanco	208e9f7df7	📏 `torch_dype` to `dtype` everywhere (#4000 ) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>	2025-09-03 15:45:37 -06:00
Sergio Paniego Blanco	0c91515b58	🧭 HF jobs x TRL guide (#3890 ) Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>	2025-08-26 21:44:29 -07:00
Quentin Lhoest	a043fd74a3	Add uv scripts headers (#3767 )	2025-07-25 07:48:40 -07:00
Quentin Gallouédec	9df19e8a75	📜 Fix license and copyrights (#3264 )	2025-04-08 15:22:58 -07:00
Quentin Gallouédec	1d23ecc36f	©️ Update copyrights year (#2547 ) * happy new year * fix wandb import sort	2025-01-07 14:53:09 +01:00
Quentin Gallouédec	ca850be0a2	🕹️ CLI refactor (#2380 ) * Refactor main function in dpo.py * Update setup.py and add cli.py * Add examples to package data * style * Refactor setup.py file * Add new file t.py * Move dpo to package * Update MANIFEST.in and setup.py, refactor trl/cli.py * Add __init__.py to trl/scripts directory * Add license header to __init__.py * File moved instruction * Add Apache License and update file path * Move dpo.py to new location * Refactor CLI and DPO script * Refactor import structure in scripts package * env * rm config from chat arg * rm old cli * chat init * test cli [skip ci] * Add `datast_config_name` to `ScriptArguments` (#2440) * add missing arg * Add test cases for 'trl sft' and 'trl dpo' commands * Add sft.py script and update cli.py to include sft command * Move sft script * chat * style [ci skip] * kto * rm example config * first step on doc * see #2442 * see #2443 * fix chat windows * ©️ Copyrights update (#2454) * First changes * Other files * Finally * rm comment * fix nashmd * Fix example * Fix example [ci skip] * 💬 Fix chat for windows (#2443) * fix chat for windows * add some tests back * Revert "add some tests back" This reverts commit 350aef52f53f8cf34fccd7ad0f78a3dd63867e06. * 🆔 Add `datast_config` to `ScriptArguments` (#2440) * datast_config_name * Update trl/utils.py [ci skip] * sort import * typo [ci skip] * Trigger CI * Rename `dataset_config_name` to `dataset_config` * 🏎 Fix deepspeed preparation of `ref_model` in `OnlineDPOTrainer` (#2417) * Remove unused deepspeed code * add model prep back * add deepspeed even if it doesn't work * rm old code * Fix config name * Remove `make dev` in favor of `pip install -e .[dev]` * Update script paths and remove old symlink related things * Fix chat script path [ci skip] * style	2024-12-13 17:52:23 +01:00
Quentin Gallouédec	460e780265	👯 Standardize `model_args` (#2442 ) * `model_config` -> `model_args` * sort	2024-12-10 12:51:20 +01:00
Quentin Gallouédec	6a05feff02	🆔 Add `datast_config` to `ScriptArguments` (#2440 ) * datast_config_name * Update trl/utils.py [ci skip] * sort import * typo [ci skip] * Trigger CI * Rename `dataset_config_name` to `dataset_config`	2024-12-10 11:09:26 +01:00
Quentin Gallouédec	9410874787	©️ Copyrights update (#2454 ) * First changes * Other files * Finally * rm comment * fix nashmd * Fix example * Fix example [ci skip]	2024-12-10 10:40:00 +01:00
Galaxy-Husky	ac77c09223	Fix gradient_checkpointing_kwargs assignment in examples (#2331 ) Co-authored-by: Ping <ping.zhu@jmuse.cn>	2024-11-07 09:28:10 +01:00
Quentin Gallouédec	b2696578ce	🍬 Use any reward model for online methods (#2276 ) * Refactor reward processing in OnlineDPOTrainer * Refactor completion decoding and reward processing * remove strip * remove warning * Add reward_tokenizer to training script * Add reward_tokenizer and reward_processing_class to OnlineDPOTrainer test * propagate to xpo and nash * style * reduce memory requirement with inference_mode * fix tests * pairrm judge llmblender * setUpClass(cls) * Add setUpClass method to TestJudges class * truncation left for reward tokenizer * don't logcompletion without eval dataset * only eval when possible	2024-10-28 16:21:40 +01:00
Quentin Gallouédec	e155cb8a66	⛓️‍💥 Don't use `eval_dataset` in scripts when no eval strategy (#2270 )	2024-10-28 11:40:51 +01:00
Kashif Rasul	9c376c571f	[Judges] use the pair-judges in online-preference trainers (#2243 ) * use the pair-judges * add test * Update trl/trainer/online_dpo_trainer.py Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> * Update trl/trainer/online_dpo_trainer.py Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> * decode and skip special characters * initial nash * return tensors * Update trl/trainer/online_dpo_trainer.py Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> * Update trl/trainer/online_dpo_trainer.py Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> * Update trl/trainer/online_dpo_trainer.py Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> * add back the logging * use batch_decode * add judges api to XPO trainer * Update tests/test_online_dpo_trainer.py Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> * judge in examples * judge in config * add back logs when using reward model * typo * add back model_scores logging when using reward model * log scores for reward model only * better cond on what to log * same for rlhf reward * Update trl/trainer/online_dpo_trainer.py Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> * use decode_and_strip_padding * error if both reward and judge or none are set * remove unused check * Uniform way to pass conversation into judge * heading -> leading * LogCompletionsCallback compat with online method * Update Online DPO doc * check if data is conversational for judges * update example * remove comment * use zip * fix stats xpo * Replace judge with PairRMJudge and import AutoModelForSequenceClassification * update xpo documentation * Remove doc duplication * update nash doc * XPO trl chat * nash md doc * HfPairwiseJudge --------- Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>	2024-10-24 16:47:10 +02:00
Quentin Gallouédec	7e394b03e8	🎭 Deprecate `[SFT/DPO/Reward]ScriptArguments` in favour of `ScriptArguments` (#2145 ) * `DPOScriptArguments` to `ScriptArguments` * use dataset_train_split * Use scriptarguments * dataset names in command lines * use `ScriptArguments` everywhere * ignore biais buffer to end * remove in v0.13 * rm comment * update test commands * Update docs/source/rloo_trainer.md * Update tests/test_rloo_trainer.py * Added dataset_train_split argument to ppo.py and rloo.py * update scripts with dataset_train_split	2024-10-14 11:14:58 +02:00
Quentin Gallouédec	47d08a9626	Rename trainer arg `tokenizer` to `processing_class` (#2162 )	2024-10-07 09:39:32 +02:00
Quentin Gallouédec	c00722ce0a	🃏 Model card for TRL (#2123 ) * template and util * test for online dpo * template in package_data * template in manifest * standardize push_to_hub * wandb badge and quick start * bco * xpo * simplify `create_model_card` * cpo * kto * dpo * gkd * orpo * style * nash-md * alignprop * bco citation * citation template * cpo citation * ddpo * fix alignprop * dpo * gkd citation * kto * online dpo citation * orpo citation * citation in utils * optional citation * reward * optional trainer citation * sft * remove add_model_tags bco * Remove unnecessary code for adding model tags * Fix model tag issue and update URL format * Remove unused code for adding model tags * Add citation for XPOTrainer * Remove unused code in SFTTrainer * Add model card generation in RLOOTrainer * Remove unused import and method call in reward_trainer.py * Add model card generation * Remove unused code and update error message in ORPOTrainer class * Add import statements and create model card in IterativeSFTTrainer * Add dataset name to push_to_hub() call * Update trainer.push_to_hub() dataset names * script args * test * better doc * fix tag test * fix test tag * Add tags parameter to create_model_card method * doc * script args * Update trl/templates/model_card.md Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * unittest's `assertIn` instead of `assert` * Update trl/templates/model_card.md Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> --------- Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>	2024-09-27 15:23:05 +02:00
Quentin Gallouédec	5368be1e1e	🧹 Style (#2132 ) * drop `# flake8: noqa` in examples * `__init__.py` * fix init * unwrap_model_for_generation * ignore import violation in init	2024-09-26 21:02:48 +02:00
Quentin Gallouédec	9af4734178	♻️ Standardize `script_args` (#2130 )	2024-09-26 15:23:42 +02:00
Quentin Gallouédec	32d9d34eb1	Standardize pushing to Hub in examples (#2126 )	2024-09-26 10:00:51 +02:00
Quentin Gallouédec	6920c2d1bb	Conversational dataset support for Online DPO (#2075 ) * first modifications in the documentation * Add script for processing ultrafeedback prompt dataset * Remove unused variable in ultrafeedback.py * style * apply chat template within the init * extend test * new default lr * nash md and xpo conv test * Update prompt length check to 512 characters * remove `maybe_apply_chat_template` in XPO and Nash examples * polish online dpo doc * better section name * LogCompletionsCallback doc * optional generation config * reorder stats (consistency with online dpo) * update online dpo doc * format online dpo config * format nash_md config * update nash md * Nash MD -> Nash-MD * xpo doc * doc	2024-09-18 14:10:38 +02:00
Quentin Gallouédec	40f05226de	Standardizing datasets for testing (#2065 ) * zen dataset * Update dataset test bco * some tests * Simple chat template * bco * xpo * kto * gkd * trainer_args * sft * online dpo * orpo * zen script	2024-09-14 22:34:15 +02:00
lewtun	88bede66fc	Standardise API for `WinRateCallback` and `LogCompletionsCallback` (#2061 ) * Use wrapped model * Make WinRateCallback work * Make LogCompletions work * Make LogCompletions work * Fix scripts * Fix path * Refactor * Remove padding * Refactor * Fix docs * Fix scripts * Fix TLDR template * Use explicit args * Fix callback import * Add docstring	2024-09-13 17:38:42 +02:00
lewtun	9a6061fc2f	Clean up DPO example (#2043 ) * Clean up DPO example * Fix bs * Remove rentrant * Fix tests * Nuke sanity checks * Switch dataset * Remove sanity check from XPO	2024-09-11 17:45:00 +02:00
Kashif Rasul	3511856767	[XPO] xpo trainer (#1943 ) * initial xpo trainer * compute rewards and ref log probs in smaller batches * add logging * initial log docs * fix global_step increment * fix metric descriptions * use messages API * use training_step API * fix logs * add test * add back max_new_tokens * use max_new_tokens * refactor * top_k is an int * fix formatting * fix the loss * fix logging * fix logging * fix logging * fix loss * calcuate pi_log_ratio once * fix stats * fix loss * do not log loss again * fix docs * add disable_dropout_in_model via flag * comments * revert doc change * rm empty cache in online dpo * improve doc xpo config * some comment * fix loggings stats * fix docs * save the model * fix model and reward model * Update trl/trainer/xpo_trainer.py Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> --------- Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>	2024-09-10 16:08:30 +02:00

28 Commits